Large language models for structuring knee ultrasound reports-A comparative evaluation of DeepSeek R1, gemini 2.5 flash, and GPT-4o.
Authors
Affiliations (5)
Affiliations (5)
- Department of Ultrasound, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, 230001, China.
- Department of Orthopedics, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, 230001, China.
- Department of Orthopedics, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, 230001, China. [email protected].
- Department of Respiratory and Critical Care Medicine, The First People's Hospital of Hefei City, The Third Affiliated Hospital of Anhui Medical University, Hefei, 230001, China. [email protected].
- Department of Ultrasound, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, 230001, China. [email protected].
Abstract
To evaluate three LLMs (DeepSeek R1, Gemini 2.5 Flash, and GPT-4o) for generating structured, clinically interpretable outputs from free-text knee ultrasound reports. We retrospectively analyzed 359 de-identified knee ultrasound reports using four prompt strategies targeting (Prompt 1) information extraction, (Prompt 2) structured report generation, (Prompt 3) impression generation, and (Prompt 4) clinical recommendation generation. Two experienced physicians rated outputs using prespecified 5-point Likert rubrics for extraction quality, structuring quality, diagnostic usefulness, recommendation quality, readability, and subjective repeatability. Objective evaluations included Prompt 2 structural completeness and stability under a fixed slot schema (schema compliance, missing-item rate, and slot-level exact match), Prompt 3 objective repeatability via mapped binary findings (percent agreement and Cohen's ΞΊ across repeated runs), and Prompt 3 diagnostic performance against an expert binary reference standard (sensitivity/specificity, F1 score, balanced accuracy, and MCC; 95% confidence intervals via bootstrapping). For Chinese recommendation metrics, BLEU/ROUGE-L were computed using character-level tokenization. Across 359 knee ultrasound reports, GPT-4o demonstrated the strongest overall performance across extraction, diagnostic impression generation, and clinical recommendation tasks. It achieved the highest physician-rated scores for NER (4.0βΒ±β0.6) and diagnostic summarization (3.8βΒ±β0.7), along with the best diagnostic performance (F1-scoreβ=β0.79 vs 0.54 for DeepSeek R1 and 0.45 for Gemini 2.5 Flash, with the highest balanced accuracy and MCC). DeepSeek R1 showed competitive performance in structured reporting (4.0βΒ±β0.5) and repeatability (3.41βΒ±β0.06), with stable schema compliance and run-to-run consistency. Gemini 2.5 Flash consistently underperformed, particularly in NER accuracy (3.2βΒ±β0.7) and diagnostic recall (0.37). Objective analyses further confirmed superior structural completeness, repeatability, and semantic quality of recommendations for GPT-4o compared with the other models. GPT-4o consistently outperformed DeepSeek R1 and Gemini 2.5 Flash, supporting its potential to improve knee ultrasound report consistency and clinical utility.