Can AI write reports like a radiologist? A blinded evaluation of large language model-generated lumbar spine MRI reports.

February 23, 2026

papers

DOI: 10.1186/s41747-026-00682-6 PMID: 41729370

Authors

Zanardo M,Albano D,Molinari V,Fabrizio R,Conca M,Asmundo L,Pardo F,Traina F,Montechiari M,Gitto S,Sconfienza LM

Affiliations (10)

Radiology Unit, IRCCS Policlinico San Donato, San Donato Milanese, Italy.
Department of Radiology, ASST Grande Ospedale Metropolitano Niguarda, Milan, Italy. [email protected].
Dipartimento di Scienze Biomediche, Chirurgiche ed Odontoiatriche, Università degli Studi di Milano, Milano, Italy. [email protected].
Scuola di Specializzazione in Radiodiagnostica, Università degli Studi di Milano, 20122, Milan, Italy.
Department of Radiology, ASST Grande Ospedale Metropolitano Niguarda, Milan, Italy.
SC Ortopedia-Traumatologia e Chirurgia Protesica e dei Reimpianti d'Anca e di Ginocchio, IRCCS Istituto Ortopedico Rizzoli, Via Pupilli 1, Bologna, 40136, Italy.
Orthopaedics and Traumatology, University of Bologna, DIBINEM, Bologna, 40123, Italy.
Azienda Socio-Sanitaria Territoriale (ASST) Fatebenefratelli-Sacco, Milan, Italy.
IRCCS Istituto Ortopedico Galeazzi, Milan, Italy.
Dipartimento di Scienze Biomediche per la Salute, Università degli Studi di Milano, Milan, Italy.

Abstract

To compare the quality and clinical usefulness of large language model (LLM)-generated lumbar spine magnetic resonance imaging (MRI) reports with radiologist-written ones and assess whether medical professionals can distinguish between them. This retrospective observational single-center study was approved by the local ethics committee. A total of 125 lumbar spine MRI reports (104 human-written, 21 LLM-generated using ChatGPT-4o) were anonymized, randomized, and blindly evaluated by five medical professionals (one board-certified radiologist, two radiology residents, one general practitioner, one orthopedic surgeon), all with basic familiarity with LLM. Each report was scored on a five-point Likert scale for clinical relevance, clarity, completeness, diagnostic accuracy, and intelligibility, whereas general practitioner and orthopedic surgeon evaluated intelligibility only. Evaluators also classified each report as AI-generated or human-written. Accuracy was defined as the proportion of correctly classified reports in distinguishing LLM-generated from radiologist-written texts. Mann-Whitney U or Student's t-tests were used. Radiologists' reports consistently received higher median scores across all domains (p < 0.001). No differences were found in the description of the imaging technique (p > 0.175). No clinically false statements were identified in the LLM-generated reports. Identification accuracy varied widely among evaluators: Board-certified radiologist achieved 88.0% accuracy (sensitivity 66.7%, specificity 92.3%), Resident 1 65.6% (14.3%, 76.0%), Resident 2 94.4% (66.7%, 100%), orthopedic surgeon 78.4% (90.5%, 76.0%) and general practitioner 65.6% (81.0%, 62.5%). Radiologist-written lumbar spine MRI reports outperform LLM-generated reports in quality and structure. However, some AI-generated reports were indistinguishable from human ones, particularly for non-specialized readers. LLMs may support radiologists in structured reporting and improve workflow efficiency, while maintaining diagnostic reliability. Large language models can draft lumbar spine MRI reports, but currently lack the quality and consistency of radiologist reports. With radiologist supervision, large language models may improve reporting efficiency while preserving diagnostic reliability and supporting clinical decision-making. LLM-generated reports are clinically coherent and stylistically comparable to those written by expert radiologists. Radiologist-written reports scored significantly higher for clinical relevance, findings, and structure. LLM-generated reports were sometimes misclassified as human-written by clinicians.

View Source Full Text PDF

Topics

Magnetic Resonance ImagingLumbar VertebraeRadiologistsArtificial IntelligenceLanguageJournal ArticleObservational Study

Can AI write reports like a radiologist? A blinded evaluation of large language model-generated lumbar spine MRI reports.

Authors

Affiliations (10)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?