A fine-tuned, domain-specific LLM (LLM-RadSum) outperforms GPT-4o in accurately summarizing radiology reports across multiple patient demographics and modalities.
Key Details
- 1LLM-RadSum, based on Llama2, was trained and evaluated on over 1 million CT and MRI radiology reports from five hospitals.
- 2The model achieved higher F1 scores in summarization compared to GPT-4o (0.58 vs. 0.3, p < 0.001), consistent across anatomic regions, modalities, sex, and ages.
- 388.9% of LLM-RadSum's outputs were 'completely consistent' with original reports, versus 43.1% for GPT-4o.
- 481.5% of LLM-RadSum outputs met senior radiologists’ standards for safety and clinical use; most GPT-4o outputs required minor edits.
- 5Human evaluation included 1,800 randomly selected reports, underscoring generalizability within diverse hospital settings.
Why It Matters

Source
AuntMinnie
Related News

Multimodal LLMs Show Improved Performance on Japanese Radiology Board Exams
New multimodal large language models (LLMs) like OpenAI o3 and Gemini 2.5 Pro demonstrated significant advancements in answering Japanese radiology board exam questions, particularly with image input.

AI Surpasses Radiologists in Predicting Lung Cancer Treatment Response
AI demonstrates higher accuracy than radiologists in predicting lung cancer treatment response from imaging.

AI Model Improves Short-Term Breast Cancer Risk Prediction
AI models combining mammography and clinical data improve identification of women at high short-term breast cancer risk.