A fine-tuned, domain-specific LLM (LLM-RadSum) outperforms GPT-4o in accurately summarizing radiology reports across multiple patient demographics and modalities.
Key Details
- 1LLM-RadSum, based on Llama2, was trained and evaluated on over 1 million CT and MRI radiology reports from five hospitals.
- 2The model achieved higher F1 scores in summarization compared to GPT-4o (0.58 vs. 0.3, p < 0.001), consistent across anatomic regions, modalities, sex, and ages.
- 388.9% of LLM-RadSum's outputs were 'completely consistent' with original reports, versus 43.1% for GPT-4o.
- 481.5% of LLM-RadSum outputs met senior radiologists’ standards for safety and clinical use; most GPT-4o outputs required minor edits.
- 5Human evaluation included 1,800 randomly selected reports, underscoring generalizability within diverse hospital settings.
Why It Matters

Source
AuntMinnie
Related News

RadNet Study: AI Boosts Breast Cancer Detection in Largest-Ever Real-World Analysis
A massive real-world study by RadNet shows AI-assisted mammography increased breast cancer detection by 21.6%.

Multimodal MRI Radiomics Model Predicts Long-Term Survival in Breast Cancer
A multimodal MRI radiomics and deep learning model outperformed traditional models in predicting 5- and 7-year survival for breast cancer patients receiving neoadjuvant chemotherapy.

AI Predicts 10-Year Mortality and Hip Fracture Risk from DEXA Scans
A self-supervised AI model predicts 10-year mortality and hip fractures using only DEXA scans.