Consensus-Level and Cluster-Adjusted Evaluation of a Large Language Model for Diagnostic Extraction from Musculoskeletal Radiology Reports.

May 22, 2026

DOI: 10.3390/diagnostics16111590 PMID: 42279457

Authors

Bosbach WA,Montazeri E,Senge JF,Beisbart C,Mitrakovic M,Anderson SE,Divjak E,Ivanac G,Grieser T,Weber MA,Sanal HT,Daneshvar K

Affiliations (10)

Department of Nuclear Medicine, Inselspital, Bern University Hospital, University of Bern, 3010 Bern, Switzerland.
Department of Mathematics and Computer Science, University of Bremen, 28359 Bremen, Germany.
Dioscuri Centre for Topological Data Analysis, Institute of Mathematics, Polish Academy of Sciences, 00-656 Warsaw, Poland.
Institute of Philosophy, University of Bern, 3012 Bern, Switzerland.
Department of Diagnostic, Interventional and Paediatric Radiology (DIPR), Inselspital, Bern University Hospital, University of Bern, 3010 Bern, Switzerland.
School of Medicine, Sydney Campus, University of Notre Dame, Broadway, P.O. Box 944, Sydney 2007, Australia.
Department of Diagnostic and Interventional Radiology, University Hospital "Dubrava", University of Zagreb School of Medicine, 10000 Zagreb, Croatia.
Department of Diagnostic and Interventional Radiology, University Hospital Augsburg, 86156 Augsburg, Germany.
Institute of Diagnostic and Interventional Radiology, Pediatric Radiology and Neuroradiology, University Medical Center Rostock, 18057 Rostock, Germany.
Radiology Department, Gülhane Training and Research Hospital, University of Health Sciences, 06010 Etlik, Ankara, Türkiye.

Abstract

Purpose: Large language models (LLMs) may reduce administrative workload in radiology by automating structured diagnostic extraction from text reports. This study evaluates the accuracy of ChatGPT-4.0 when extracting correct diagnoses from musculoskeletal (MSK) radiology text reports, and compares its performance with that of experienced human readers, using cluster-adjusted and consensus-level analyses. Materials and Methods: Twenty-three multimodal MSK imaging cases (X-ray, ultrasound, CT, and MRI) were analysed. Ten human readers and ChatGPT-4.0 (10 independent iterations) provided primary (1st) and secondary (2nd) diagnoses from six predefined options. We analysed data at the individual-reader level using cluster-adjusted generalised estimating equations (GEE) and at the case level using majority consensus with exact McNemar testing. Within-case (α_case) and within-reader (α_reader) correlations and design effects were calculated to assess clustering and implications for sample size. Results: For 1st diagnoses, AI accuracy was 0.957 (95%-CI 0.922-0.976) versus 0.865 (95%-CI 0.815-0.903) for human readers (absolute difference -0.091; OR 3.43, 95%-CI 1.07-11.02; p = 0.038). Within-case correlation (α case = 0.247) exceeded within-reader correlation (α reader ≈ 0); this resulted in a design effect of 5.7 and an effective sample size of 80.7. At the consensus level, discordance occurred in 2/23 cases (8.7%), with no significant difference between methods (McNemar p = 1.00). When 1st and 2nd diagnoses were combined, both systems achieved 23/23 correct consensus classifications. Interrater reliability between AI and human classifications was almost perfect (Gwet's AC1 = 0.836-0.927). Conclusions and Key points: In this structured MSK text-report setting, ChatGPT-4.0 achieved diagnostic accuracy comparable to that of experienced radiologists, with modest individual-reader advantages that disappeared under consensus aggregation. Clustering analysis indicates that variability is primarily case-driven, suggesting that future validation studies will benefit more from expanding case numbers than reader numbers. Our data suggest that large performance divergences between AI and human consensus are unlikely in similar structured diagnostic contexts.

View Source Full Text PDF

Topics

Journal Article

Consensus-Level and Cluster-Adjusted Evaluation of a Large Language Model for Diagnostic Extraction from Musculoskeletal Radiology Reports.

Authors

Affiliations (10)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?