What are we missing by ignoring text records in the Clinical Practice Research Datalink? Using three symptoms of cancer as examples to estimate the extent of data in text format that is hidden to research
Price, Sarah Jane
Thesis or dissertation
University of Exeter
Electronic medical record databases (e.g. the Clinical Practice Research Datalink, CPRD) are increasingly used in epidemiological research. The CPRD has two formats of data: coded, which is the sole format used in almost all research; and free-text (or ‘hidden’), which may contain much clinical information but is generally unavailable to researchers. This thesis examines the ramifications of omitting free-text records from research. Cases with bladder (n=4,915) or pancreatic (n=3,635) cancer were matched to controls (n=21,718, bladder; n=16,459, pancreas) on age, sex and GP practice. Coded and text-only records of attendance for haematuria, jaundice and abdominal pain in the year before cancer diagnosis were identified. The number of patients whose entire attendance record for a symptom/sign existed solely in the text was quantified. Associations between recording method (coded or text-only) and case/control status were estimated (χ2 test). For each symptom/sign, the positive predictive value (PPV, Bayes' Theorem) and odds ratio (OR, conditional logistic regression) for cancer were estimated before and after supplementation with text-only records. Text-only recording was considerable, with 7,951/20,958 (37%) of symptom records being in that format. For individual patients, text-only recording was more likely in controls (140/336=42%) than cases (556/3,147=18%) for visible haematuria in bladder cancer (χ2 test, p<0.001), and for jaundice (21/31=67% vs 463/1,565=30%, p<0.0001) and abdominal pain (323/1,126=29% vs 397/1,789=22%, p<0.001) in pancreatic cancer. Adding text records reduced PPVs of visible haematuria for bladder cancer from 4.0% (95% CI: 3.5–4.6%) to 2.9% (2.6–3.2%) and of jaundice for pancreatic cancer from 12.8% (7.3–21.6%) to 6.3% (4.5–8.7%). Coded records suggested that non-visible haematuria occurred in 127/4,915 (2.6%) cases, a figure below that generally used for study. Supplementation with text-only records increased this to 312/4,915 (6.4%), permitting the first estimation of its OR (28.0, 95% CI: 20.7–37.9, p<0.0001) and PPV (1.60%, 1.22–2.10%, p<0.0001) for bladder cancer. The results suggest that GPs make strong clinical judgements about the probable significance of symptoms – preferentially coding clinical features they consider significant to a diagnosis, while using text to record those that they think are not.
PhD in Medical Studies