A fine-tuned, domain-specific LLM (LLM-RadSum) outperforms GPT-4o in accurately summarizing radiology reports across multiple patient demographics and modalities.
Key Details
- 1LLM-RadSum, based on Llama2, was trained and evaluated on over 1 million CT and MRI radiology reports from five hospitals.
- 2The model achieved higher F1 scores in summarization compared to GPT-4o (0.58 vs. 0.3, p < 0.001), consistent across anatomic regions, modalities, sex, and ages.
- 388.9% of LLM-RadSum's outputs were 'completely consistent' with original reports, versus 43.1% for GPT-4o.
- 481.5% of LLM-RadSum outputs met senior radiologists’ standards for safety and clinical use; most GPT-4o outputs required minor edits.
- 5Human evaluation included 1,800 randomly selected reports, underscoring generalizability within diverse hospital settings.
Why It Matters

Source
AuntMinnie
Related News

Radiologists Prefer Domain-Specific AI for CT Report Generation
Radiologists show a clear preference for domain-specific AI models in generating accurate and timely CT report impressions.

Radiology Receives Declining Share of Industry Research Funding
Radiologists received only 1.1% of industry-funded research payments in 2024, with a continuing downward trend.

GPT-4o AI Matches Radiologists in Follow-Up Imaging Recommendations
GPT-4o matched the performance of experienced radiologists and surpassed residents in recommending follow-up imaging from routine radiology reports.