Automated Clinical Information Extraction from Diagnostic and Nondiagnostic Radiology Reports Using Modern Language Models.
Authors
Affiliations (5)
Affiliations (5)
- Medical Scientist Training Program, Case Western Reserve University School of Medicine, Cleveland, OH, USA. [email protected].
- Department of Population and Quantitative Health Sciences, Case Western Reserve University School of Medicine, Cleveland, OH, USA. [email protected].
- Center for Value-Based Care Research, Department of Internal Medicine and Geriatrics, Primary Care Institute, Cleveland Clinic, Cleveland, OH, USA. [email protected].
- Center for Diagnostics and Artificial Intelligence, Pathology & Laboratory Medicine Institute, Cleveland Clinic, Cleveland, OH, USA.
- Center for Value-Based Care Research, Department of Internal Medicine and Geriatrics, Primary Care Institute, Cleveland Clinic, Cleveland, OH, USA.
Abstract
Our objective was to automate the extraction of clinical diagnoses from diagnostic and nondiagnostic radiology reports using modern language models and structured electronic health record (EHR) data. We selected venous thromboembolism (VTE) as our use case for which imaging is the gold standard but is not always fully diagnostic. We extracted venous duplex, computed tomography, and ventilation-perfusion scan reports from the Cleveland Clinic EHR system for patients admitted 2011-2020. Report ground truths were positive, negative, or nondiagnostic. We compared multiple large language models (LLMs) and bidirectional encoder representations from transformers (BERT) models on multiclass classification in holdout evaluation sets. Error analysis guided iterative LLM prompt design and maximized the detection of nondiagnostic reports. ICD-10 codes and therapeutic anticoagulation data were used to adjudicate VTE diagnoses for patients with nondiagnostic reports. We identified 82,476 radiology reports among 213,724 patients. Across models, multiclass areas under the receiver operating characteristics and precision-recall curves ranged from 0.83 to 0.96 and 0.57 to 0.94. The most accurate model, Llama-3.3, detected 95% of VTE-positive reports with a precision of 99.6% and detected 87% of nondiagnostic reports with a precision of 88%. The positive detection rate increased to 98% when we paired structured EHR variables with minimal chart review (0.7% of the evaluation set) to adjudicate diagnoses for patients with nondiagnostic reports. In summary, Llama-3.3 was highly sensitive and specific for positive VTE diagnoses and nondiagnostic radiology reports. We integrated an LLM, structured EHR variables, and limited chart review for successful management of diagnostic uncertainty in automated information extraction from radiology reports.