Back to all papers

Large language models for structuring knee ultrasound reports-A comparative evaluation of DeepSeek R1, gemini 2.5 flash, and GPT-4o.

March 3, 2026pubmed logopapers

Authors

Tang M,Xu L,Zhu L,Maimaitiabula A,Zhang X,Pei C,Hu L

Affiliations (5)

  • Department of Ultrasound, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, 230001, China.
  • Department of Orthopedics, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, 230001, China.
  • Department of Orthopedics, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, 230001, China. [email protected].
  • Department of Respiratory and Critical Care Medicine, The First People's Hospital of Hefei City, The Third Affiliated Hospital of Anhui Medical University, Hefei, 230001, China. [email protected].
  • Department of Ultrasound, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, 230001, China. [email protected].

Abstract

To evaluate three LLMs (DeepSeek R1, Gemini 2.5 Flash, and GPT-4o) for generating structured, clinically interpretable outputs from free-text knee ultrasound reports. We retrospectively analyzed 359 de-identified knee ultrasound reports using four prompt strategies targeting (Prompt 1) information extraction, (Prompt 2) structured report generation, (Prompt 3) impression generation, and (Prompt 4) clinical recommendation generation. Two experienced physicians rated outputs using prespecified 5-point Likert rubrics for extraction quality, structuring quality, diagnostic usefulness, recommendation quality, readability, and subjective repeatability. Objective evaluations included Prompt 2 structural completeness and stability under a fixed slot schema (schema compliance, missing-item rate, and slot-level exact match), Prompt 3 objective repeatability via mapped binary findings (percent agreement and Cohen's ΞΊ across repeated runs), and Prompt 3 diagnostic performance against an expert binary reference standard (sensitivity/specificity, F1 score, balanced accuracy, and MCC; 95% confidence intervals via bootstrapping). For Chinese recommendation metrics, BLEU/ROUGE-L were computed using character-level tokenization. Across 359 knee ultrasound reports, GPT-4o demonstrated the strongest overall performance across extraction, diagnostic impression generation, and clinical recommendation tasks. It achieved the highest physician-rated scores for NER (4.0 ± 0.6) and diagnostic summarization (3.8 ± 0.7), along with the best diagnostic performance (F1-score = 0.79 vs 0.54 for DeepSeek R1 and 0.45 for Gemini 2.5 Flash, with the highest balanced accuracy and MCC). DeepSeek R1 showed competitive performance in structured reporting (4.0 ± 0.5) and repeatability (3.41 ± 0.06), with stable schema compliance and run-to-run consistency. Gemini 2.5 Flash consistently underperformed, particularly in NER accuracy (3.2 ± 0.7) and diagnostic recall (0.37). Objective analyses further confirmed superior structural completeness, repeatability, and semantic quality of recommendations for GPT-4o compared with the other models. GPT-4o consistently outperformed DeepSeek R1 and Gemini 2.5 Flash, supporting its potential to improve knee ultrasound report consistency and clinical utility.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.