Reliability of Gemini 2.5 Pro, ChatGPT 4.1, DeepSeek V3, and Claude Opus 4 in generating standardized CMR protocols.

January 26, 2026

DOI: 10.1186/s41747-025-00671-1 PMID: 41586949

Authors

Licu RA,Muscogiuri G,Casartelli D,Bacârea A,Pop M,Licu AM,Sferratore D,Caruso A,Mirchuk M,Tarkowski P,Byczkowski J,Sironi S

Affiliations (14)

School of Medicine and Surgery, University of Milano-Bicocca, Milan, Italy.
Department of Radiology, ASST Papa Giovanni XXIII, Bergamo, Italy.
Doctoral School of Medicine and Pharmacy, George Emil Palade University of Medicine, Pharmacy, Science and Technology, Târgu Mureș, Romania.
Department of Radiology, County Emergency Clinical Hospital, Târgu Mureș, Romania.
School of Medicine and Surgery, University of Milano-Bicocca, Milan, Italy. [email protected].
Department of Radiology, ASST Papa Giovanni XXIII, Bergamo, Italy. [email protected].
Department of Pathophysiology, George Emil Palade University of Medicine, Pharmacy, Science and Technology, Târgu Mureș, Romania.
Faculty of Medicine in English, George Emil Palade University of Medicine, Pharmacy, Science and Technology, Târgu Mureș, Romania.
Department of Radiology, Dr. Fogolyán Kristóf County Emergency Hospital, Sfântu Gheorghe, Romania.
Department of Radiation Diagnostics, Danylo Halytsky Lviv National Medical University, Lviv, Ukraine.
Ukrainian-Polish Heart Center Lviv, Lviv, Ukraine.
Department of Radiology and Nuclear Medicine, University Hospital No 4 of Lublin, Lublin, Poland.
Department of Diagnostic Imaging, University Hospital No 1 of Lublin, Lublin, Poland.
Department of Radiology, Medical University of Gdańsk, Gdańsk, Poland.

Abstract

Artificial intelligence (AI) and large language models (LLMs) are increasingly integrated into radiology, offering new possibilities for advanced imaging techniques, including cardiovascular magnetic resonance (CMR). This proof-of-concept study assessed four high-performing LLMs (Gemini 2.5 Pro, ChatGPT 4.1, DeepSeek V3, and Claude Opus 4) on their ability to generate CMR protocols for 140 hypothetical cardiac cases. AI-generated protocols were compared against a reference standard established by a consensus between two experienced cardiovascular radiologists, following the Society for Cardiovascular Magnetic Resonance (SCMR) recommendations. Descriptive statistics were used to quantify the concordance of LLM-generated sequences with the SCMR guidelines. Statistical agreement was measured using Cohen and Fleiss κ statistics. Gemini 2.5 Pro achieved the highest concordance, aligning with the SCMR guidelines in 71.5% of all evaluated scenarios. Overall, LLMs showed moderate agreement with the SCMR protocols, with Gemini 2.5 Pro again performing best (Cohen κ = 0.55). Agreement was substantial for mandatory CMR sequences (Fleiss κ ∈ [0.69, 0.74]) and predominantly fair for optional sequences. The tested LLMs demonstrate a potential to generate efficient and pathology-adapted CMR protocols. Under expert supervision, this capability could streamline the imaging workflow and help extend CMR to primary healthcare centers through protocol automation. RELEVANCE STATEMENT: The potential of Gemini 2.5 Pro, ChatGPT 4.1, DeepSeek V3, and Claude Opus 4 to suggest pathology-adapted CMR protocols could improve imaging throughput and help to expand access to advanced cardiac diagnostics in primary healthcare centers. KEY POINTS: The tested large language models show potential for generating CMR protocols. Substantial agreement on mandatory CMR sequences promises more efficient examinations. Automation of CMR protocols could help to improve access to this advanced technique outside major medical institutions.

View Source Full Text PDF

Topics

Magnetic Resonance ImagingArtificial IntelligenceJournal Article

Reliability of Gemini 2.5 Pro, ChatGPT 4.1, DeepSeek V3, and Claude Opus 4 in generating standardized CMR protocols.

Authors

Affiliations (14)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?