A fine-tuned, domain-specific LLM (LLM-RadSum) outperforms GPT-4o in accurately summarizing radiology reports across multiple patient demographics and modalities.
Key Details
- 1LLM-RadSum, based on Llama2, was trained and evaluated on over 1 million CT and MRI radiology reports from five hospitals.
- 2The model achieved higher F1 scores in summarization compared to GPT-4o (0.58 vs. 0.3, p < 0.001), consistent across anatomic regions, modalities, sex, and ages.
- 388.9% of LLM-RadSum's outputs were 'completely consistent' with original reports, versus 43.1% for GPT-4o.
- 481.5% of LLM-RadSum outputs met senior radiologists’ standards for safety and clinical use; most GPT-4o outputs required minor edits.
- 5Human evaluation included 1,800 randomly selected reports, underscoring generalizability within diverse hospital settings.
Why It Matters

Source
AuntMinnie
Related News

Google's Gemini Outperforms Providers in Communicating IR Procedures
Large language models like Google's Gemini demonstrate higher accuracy and greater empathy than human providers when answering patient questions about interventional radiology.

Comparing False-Positive Findings: AI vs. Radiologists in DBT Screening
AI and radiologists differ in the types and patient characteristics of false-positive findings in digital breast tomosynthesis breast cancer screening.

Aidoc Receives FDA Breakthrough Status for Multi-Condition CT AI Triage
Aidoc has received FDA Breakthrough Device status for its AI solution that flags multiple critical conditions in CT scans.