Back to all papers

Prompt engineering enables open-source large language models to match proprietary models in diagnostic accuracy for annotation of radiology reports.

March 20, 2026pubmed logopapers

Authors

Petersen LA,Beck MS,Andersen MB,Xu JJ,Bruun FJ

Affiliations (3)

  • The Faculty of Health and Medical Sciences, University of Copenhagen, Nørre Allé 20, 2200 Copenhagen, Denmark.
  • The Department of Radiology, Herlev and Gentofte Hospital, Borgmester Ib Juuls Vej 1, 2730 Herlev, Denmark.
  • The Department of Radiology, Bispebjerg and Frederiksberg Hospital, Nielsine Nielsens Vej 41A, 2400 Copenhagen NV, Denmark. Electronic address: [email protected].

Abstract

This study aimed to test whether open-source large language models (LLMs) can match the diagnostic accuracy of proprietary models in annotating trauma radiology reports written in a low-resource language across 3 clinical findings. This retrospective study included 2939 radiology reports of trauma radiographs collected from 3 Danish emergency hospital centers. The data were split, with 600 cases for prompt engineering and 2339 for model evaluation. Eight LLMs, GPT-4o, GPT-4o-mini, Ministral-8b, Qwen3-8b, DeepseekR1-8b, and 3 Llama3 variants were prompted to annotate the reports for fractures, effusions, and luxations. The reference standard was human annotations. The diagnostic performance was assessed using accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. Prompt engineering improved the Match score for Llama3-8b from 77.8% (95% CI: 74.4%-81.1%) to 94.3% (95% CI: 92.5%-96.2%). GPT-4o achieved the highest overall diagnostic accuracy at 97.9% (95%; CI: 97.3%-98.5%), followed by Qwen3-8b (97.8%; 95% CI: 97.4%-98.1%), Llama3.1-405b (97.1%; 95% CI: 96.4%-97.8%), GPT-4o-mini (96.9%; 95% CI: 96.2%-97.6%), Ministral-8b (96.9%; 95% CI 96.4%-97.2%) and Llama3-8b (96.9%; 95% CI: 95.9%-97.3%). Across the 3 specific findings, all models performed best for fractures, whereas effusion and luxation were more prone to errors. Of the error types, semantic confusion was the most frequent, with 53.2%-59.4% of misclassifications. Small, open-source LLMs can accurately annotate trauma radiology reports written in a low-resource language when supported by effective prompt engineering, achieving accuracy levels that rival proprietary competitors. They offer a viable, privacy-conscious alternative for clinical use.

Topics

Large Language ModelsWounds and InjuriesRadiology Information SystemsJournal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.