Comparing large language models and human experts in interpreting MRI reports for personalized patient education.

March 19, 2026

papers

DOI: 10.1016/j.ijmedinf.2026.106409 PMID: 41865475

Authors

Du K,Li A,Zuo QH,Zhang CY,Guo R,Chen P,Du WS,Zuo YL,Li SM

Affiliations (3)

Department of Pain Medicine, Beijing Hospital of Traditional Chinese Medicine, Capital Medical University, 23 Meishuguan Houjie, Dongcheng District, Beijing 100010, China; Graduate School, Beijing University of Chinese Medicine, 11 Beisanhuan Donglu, Chaoyang District, Beijing 100029, China.
Department of Pain Medicine, Beijing Hospital of Traditional Chinese Medicine, Capital Medical University, 23 Meishuguan Houjie, Dongcheng District, Beijing 100010, China.
Department of Pain Medicine, Beijing Hospital of Traditional Chinese Medicine, Capital Medical University, 23 Meishuguan Houjie, Dongcheng District, Beijing 100010, China. Electronic address: [email protected].

Abstract

Knee osteoarthritis (OA) is a prevalent global condition. While MRI guides clinical decisions, its technical complexity hinders patient understanding and engagement. Translating these findings into comprehensible, personalized patient education remains challenging. Large language models (LLMs) show promise in automating this process. To evaluate and compare the effectiveness of advanced large language models against experienced clinicians in generating comprehensible, personalized patient education materials derived from knee MRI reports. This study compared performance of two LLMs, GPT-4o and Claude 3.5 Sonnet, with experienced clinicians in generating personalized patient education materials from 150 anonymized knee MRI reports. To assess their effectiveness, we developed a comprehensive, multidimensional evaluation framework. This included readability evaluation, using both validated linguistic metrics and expert assessments of clarity, emphasis, and coherence; content personalization, quantified with a novel structured scoring system focused on specificity, practicality, and actionability of recommendations; and generation efficiency, measured in words per minute. Both LLMs significantly outperformed clinicians across key metrics, with GPT-4o showing superior performance. Compared to clinicians, GPT-4o and Claude 3.5 Sonnet demonstrated higher expert-rated understandability (72[IQR 6] vs 60[IQR 6] vs 50[IQR 12], P < 0.001), better personalization scores (68[IQR 2] vs 62[IQR 4] vs 64[IQR 9], P < 0.001), and markedly higher generation efficiency (1348.5 ± 202.2 vs 1160.8 ± 137.2 vs 142.6 ± 29.8 WPM, P < 0.001). Readability indices consistently favored LLM-generated content. Advanced LLMs, particularly GPT-4o, showed strong performance in translating knee MRI reports into comprehensible and personalized patient education materials, with advantages in readability, personalization, and efficiency over clinician-generated outputs in this study setting. These findings support the potential role of LLMs as clinician-supervised tools for scalable patient education, while highlighting the need for further validation across institutions, models, and clinical workflows before deployment.

View Source Full Text PDF

Topics

Journal Article

Comparing large language models and human experts in interpreting MRI reports for personalized patient education.

Authors

Affiliations (3)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?