DistinCT: R Package for Automated CT Imaging Indication Abstraction Using EHRs
Background
Accurately abstracting CT imaging indications, whether a scan is performed for surveillance, symptom evaluation, or other clinical reasons, is critical for real-world oncology research. For instance, among long-term lung cancer survivors, differentiating surveillance CTs from diagnostic CTs is essential to study surveillance patterns and evaluate if routine surveillance could contribute to early detection of second primary lung cancers or improved long-term outcomes, compared to imaging prompted by clinical symptoms or complications. However, distinguishing the true intent behind a CT scan remains a major challenge in real-world data. Most natural language processing (NLP) models in radiology have focused on abstracting imaging findings rather than the purpose of the scan. Furthermore, accurately inferring imaging intent often requires information beyond the radiology report, such as recent diagnoses, prior imaging, and clinical history, which are often captured in structured electronic health records (EHRs).
DistinCT addresses these challenges by combining NLP-derived features from free-text CT reports with structured EHR variables related to clinical context. This hybrid model enables accurate prediction of imaging indications (surveillance CT versus other reasons for CT), facilitating scalable abstraction of surveillance imaging patterns. By bridging the gap between unstructured and structured EHR data, DistinCT supports robust real-world evaluations of surveillance strategies among long-term lung cancer patients.
Methodology
DistinCT combines natural language processing (NLP) and structured EHR data to predict CT imaging indications. A six-step NLP pipeline was developed to process free-text CT reports: segmenting the report, tokenizing text, parts-of-speech tagging, extracting key phrases using regular expressions and graph based algorithms, clustering key phrases into oncology-related concepts, and deriving frequency-based features. In parallel, structured EHR features such as time since previous CT, provider specialty, recent symptom diagnoses, and lung disease diagnoses were extracted. Integrating these features into a hybrid logistic regression model, DistinCT enables accurate classification of CT scans as surveillance or non-surveillance purpose scans. The R package provides a ready-to-use, pre-trained model that can be applied to new datasets without retraining, requiring only CT report texts and associated structured clinical data.
Application
We applied the DistinCT R-package to predict imaging indications for longitudinal chest CT reports among long-term lung cancer survivors at Stanford Health Care. The hybrid NLP-based model, trained on over 1,200 CT reports, achieved high predictive performance on an independent hold-out test set, with an AUC of 0.86 and good calibration, outperforming models based on structured variables or NLP features alone. We used the model to predict imaging indications for 585 lung cancer survivors across their 3,362 longitudinal CT reports and corresponding structured EHR data, enabling characterization of temporal surveillance patterns in this cohort.
To assess the clinical utility of the DistinCT model, we conducted an exploratory survival analysis incorporating the predicted CT indications. Using a naïve approach that compared any post-5-year CT imaging, regardless of indication, to no imaging, no significant difference in overall survival was observed (Hazard Ratio: 0.88; Confidence Interval: [0.61–1.26]; p=0.53). However, when stratifying patients based on DistinCT-predicted surveillance CT receipt, a significant improvement in overall survival was observed: survivors receiving at least one surveillance CT beyond 5 years had better survival compared to matched patients without any post-5-year CT imaging (Hazard Ratio: 0.60; Confidence Interval: [0.41–0.89]; p=0.016). These results highlight the added value of precise CT indication prediction using DistinCT to enhance real-world evaluations of survivorship care and outcomes in lung cancer populations.
Availability
The DistinCT R package is freely available on GitHub: https://github.com/thehanlab/DistinCT
Contacts
Summer S. Han, Ph.D., Principal Investigator
Aparajita Khan, Ph.D., Methodology Development and Programming
Questions and comments should be addressed to: summerh@stanford.edu
Reference
A. Khan, E. Choi, C. Su, A. Graber-Naidich, S. Henry, A. W. Kurian, S.Liang, J. Neal, M. Desai, A. Leung, H. A. Wakelee, L. M. Backhus, C. Langlotz, J. Wu, and S. Han, “Automatic Abstraction of CT Imaging Indication using Natural Language Processing for Evaluation of Surveillance Patterns in Long-Term Lung Cancer Survivors”, Journal of Clinical Oncology: Clinical Cancer Informatics, April 2025.