A fine-tuned, domain-specific LLM (LLM-RadSum) outperforms GPT-4o in accurately summarizing radiology reports across multiple patient demographics and modalities.
Key Details
- 1LLM-RadSum, based on Llama2, was trained and evaluated on over 1 million CT and MRI radiology reports from five hospitals.
- 2The model achieved higher F1 scores in summarization compared to GPT-4o (0.58 vs. 0.3, p < 0.001), consistent across anatomic regions, modalities, sex, and ages.
- 388.9% of LLM-RadSum's outputs were 'completely consistent' with original reports, versus 43.1% for GPT-4o.
- 481.5% of LLM-RadSum outputs met senior radiologists’ standards for safety and clinical use; most GPT-4o outputs required minor edits.
- 5Human evaluation included 1,800 randomly selected reports, underscoring generalizability within diverse hospital settings.
Why It Matters

Source
AuntMinnie
Related News

New Report Highlights Clinical AI Performance, Sustainability, and Adoption Challenges
A multi-institutional review details key challenges, progress, and sustainability concerns in deploying clinical AI in real-world healthcare settings.

FDA Clears AI Platform for Comprehensive Cardiac Risk Assessment on CT
HeartLung Corporation's AI-CVD receives FDA clearance for opportunistic multi-condition screening on routine chest CT scans.

LLM Boosts Terminology Expansion in Radiology Reports Over RadLex
A large language model (LLM) significantly outperforms RadLex in expanding terms for radiology report language standardization.