Methodological quality of cardiac CT and MRI radiomics studies assessed using METRICS and RQS by human readers and ChatGPT 5.1 Thinking.

June 19, 2026

papers

DOI: 10.1186/s41747-026-00756-5 PMID: 42319678

Authors

Garello LF,Giannini V,Gatti M,Defeudis A,Cafaro D,Nicoletti G,Culasso NC,Faletti R,Veltri A,Cuocolo R,Balbi M

Affiliations (8)

Radiology Unit, San Luigi Gonzaga Hospital, Turin, Italy.
Department of Oncology, University of Turin, Turin, Italy.
Department of Surgical Sciences, University of Turin, Turin, Italy.
Azienda Ospedaliero-Universitaria Città della Salute e della Scienza di Torino, Turin, Italy.
Candiolo Cancer Institute, FPO-IRCCS, Turin, Italy.
Department of Medicine, Surgery and Dentistry, University of Salerno, Baronissi, Italy.
Radiology Unit, San Luigi Gonzaga Hospital, Turin, Italy. [email protected].
Department of Oncology, University of Turin, Turin, Italy. [email protected].

Abstract

To assess the methodological quality of cardiac CT and MRI radiomics studies using the METhodological RadiomICs Score (METRICS) and Radiomics Quality Score (RQS), and to evaluate inter-rater reliability (IRR) of both scoring tools among human readers and ChatGPT 5.1 Thinking. Cardiac CT and MRI radiomics studies published up to 28 February 2025 were scored by human readers with complementary expertise in cardiac imaging and radiomics using both scoring systems. IRR was evaluated in 30 randomly selected studies by two independent groups of secondary readers and ChatGPT 5.1 Thinking. Of 781 screened records, 154 were included. The overall median METRICS was 0.60 (IQR, 0.52-0.68), and the median percentage RQS was 0.36 (IQR, 0.19-0.42), corresponding to a median absolute RQS of 13 (IQR, 8-15). The scoring systems highlighted several methodological limitations, such as a lack of external validation, a prospective study design, and open data availability. Between human readers, IRR was good for METRICS (ICC, 0.77-0.88) and moderate to good for RQS (ICC, 0.59-0.82). Between human readers and ChatGPT 5.1 Thinking, IRR was moderate to good for METRICS (ICC, 0.70-0.85) but only poor to moderate for RQS (ICC, 0.46-0.56). Cardiac CT and MRI radiomics research quality was rated as good by METRICS, whereas RQS yielded lower scores. Human readers showed good reproducibility with METRICS and moderate to good reproducibility with RQS. ChatGPT 5.1 Thinking showed potential for automating the evaluation process, but its use requires caution due to potential discrepancies with human evaluations. Research quality in cardiac CT and MRI still suffers from substantial limitations. The application of METRICS and RQS using LLMs requires caution, given the limited reproducibility when compared with human assessments. According to METRICS and RQS, radiomic-based cardiac CT and MRI studies remain affected by substantial methodological limitations. Human readers achieved good reproducibility with METRICS and moderate to good reproducibility with RQS. ChatGPT 5.1 Thinking may be helpful for scoring radiomics research quality, but its results should be interpreted with caution due to potential discrepancies with human evaluations.

View Source Full Text PDF

Topics

Magnetic Resonance ImagingTomography, X-Ray ComputedRadiomicsHeartJournal Article

Methodological quality of cardiac CT and MRI radiomics studies assessed using METRICS and RQS by human readers and ChatGPT 5.1 Thinking.

Authors

Affiliations (8)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?