Data Extraction from Oncology Imaging Reports by Large Language Models: A Comparative Accuracy Study
Authors
Affiliations (1)
Affiliations (1)
- CLEAR Methods Center, Division of Clinical Epidemiology, Department of Clinical Research, University and University Hospital Basel, Basel, Switzerland.
Abstract
ImportanceManual data extraction from clinical text is resource intensive. Locally hosted large language models (LLMs) may offer a privacy-preserving solution, but their performance on non-English data remains unclear. ObjectiveTo investigate whether the classification accuracy of locally hosted LLMs is non-inferior to human accuracy when determining metastasis status and treatment response from German radiology reports. DesignIn this retrospective comparative accuracy study, five locally hosted LLMs (llama3.3:70b, mistral-small:24b, qwq:32b, qwen3:32b, and gpt-oss:120b) were compared against humans. To calculate accuracy, a ground truth was established via duplicate human extraction and adjudication of discrepancies by a senior oncologist. Both initial human extraction and LLM outputs were compared against this ground truth. SettingThe study was conducted at a tertiary referral hospital in Switzerland; data processing and analyses took place inside the hospital network. Participants400 randomly sampled radiology reports from adult cancer patients (CT, MRI, PET) generated between January 2023 and May 2025. ExposuresAutomated classification of metastasis status and treatment response by LLMs using a standardized prompt pipeline compared to manual human review. Main Outcomes and MeasuresPrimary outcomes were non-inferiority (5 percentage points [pp] margin) of LLM classification accuracy compared with human accuracy for metastasis status (presence/absence by anatomical site) and treatment response categories. Secondary outcomes included accuracy for primary tumor diagnosis, radiological absence of tumor, and extraction time per report. ResultsThe analysis included 400 reports from 317 patients (mean age 63 years, 32% women). On the test set (n=300), human accuracy for metastasis status was 98.4% (95% CI 98.0%-98.8%). All LLMs were non-inferior; gpt-oss:120b performed best (97.6% accuracy; difference:xs -0.8pp [90% CI, -1.3 to -0.3 pp]). For response to treatment, human accuracy was 86.0% (95% CI 83.2%-88.8%). All LLMs were inferior; the most accurate model, gpt-oss:120b, achieved 78.3% (difference -7.7 pp [90% CI, -11.6 to -3.8 pp]). Mean human time per report was 120 seconds vs 11-63 seconds for LLMs. Conclusion and RelevanceIn this study, LLMs were non-inferior to human accuracy for classification of metastasis status but were inferior for response to treatment assessment. gpt-oss:120b was the most accurate among tested LLMs. Study RegistrationOSF: 45PVQ Key PointsO_ST_ABSQuestionC_ST_ABSCan locally hosted large language models (LLMs) match human performance when extracting sites of metastases and response to treatment from radiology reports of cancer patients? FindingsIn this preregistered, single center study of 300 German radiology reports, all evaluated LLMs were non-inferior to humans in extracting the presence or absence of metastasis by organ site, but LLMs were inferior to humans in classification of response to treatment. MeaningLLMs can be suitable for classification of metastasis status, whereas more caution is warranted for more complex tasks where additional clinical reasoning may be required.