Prompt Engineering Enables Open-Source LLMs to Match Proprietary Models in Diagnostic Accuracy for Annotation of Radiology Reports
Authors
Affiliations (1)
Affiliations (1)
- University of Copenhagen
Abstract
AimThe aim of this study was to test whether open-source Large Language Models (LLMs) can match the diagnostic accuracy of proprietary models in annotating Danish trauma radiology reports across three clinical findings. Materials and MethodsThis retrospective study included 2,939 radiology reports of trauma radiographs collected from three Danish emergency departments. The data were split, with 600 cases for prompt engineering and 2,339 for model evaluation. Eight LLMs, GPT-4o and GPT-4o-mini (OpenAI), and six Llama3 variants (Meta) were prompted to annotate the reports for fractures, effusions, and luxations. The reference standard was human annotations. The diagnostic performance was assessed using accuracy, sensitivity, specificity, PPV, and NPV with 95% confidence intervals. ResultsPrompt engineering improved the Match-score for Llama3-8b from 77.8% (95% CI: 74.4% - 81.1%) to 94.3% (95% CI: 92.5% - 96.2%). GPT-4o achieved the highest overall diagnostic accuracy at 97.9% (95% CI: 97.3% - 98.5%), followed by Llama3.1-405b (97.1% (95% CI: 96.4% - 97.8%)), GPT-4o-mini (96.9% (95% CI: 96.2% - 97.6%)), Llama3-8b (96.9% (95% CI: 95.9% - 97.3%)), and Llama3.1-70b (96.0% (95% CI: 95.2% - 96.8%)). Across the three specific findings, all models performed best for fractures, whereas effusion and luxation were more prone to errors. Of the error types, Semantic Confusion was the most frequent, with 53.2% to 59.4% of misclassifications. ConclusionSmall, open-source LLMs can accurately annotate Danish trauma radiology reports when supported by effective prompt engineering, achieving accuracy levels that rival proprietary competitors. They offer a viable, privacy-conscious alternative for clinical use, even in a low-resource language setting.