Accuracy and reproducibility of large language model measurements of liver metastases: comparison with radiologist measurements.
Authors
Affiliations (5)
Affiliations (5)
- Department of Medical Imaging, The Ottawa Hospital, University of Ottawa, 501 Smyth Road, Ottawa, ON, K1H 8L6, Canada. [email protected].
- Department of Diagnostic Radiology, McGill University, Montreal, QC, Canada.
- Augmented Intelligence and Precision Health Laboratory (AIPHL), Research Institute of the McGill University Health Centre, Montreal, Canada.
- Diagnostic Radiology and Radiation Oncology, Chiba University Graduate School of Medicine, Chiba, Japan.
- Department of Radiology, Institute of Medical Science, The University of Tokyo, Tokyo, Japan.
Abstract
To compare the accuracy and reproducibility of lesion-diameter measurements performed by three state-of-the-art LLMs with those obtained by radiologists. In this retrospective study using a public database, 83 patients with solitary colorectal-cancer liver metastases were identified. From each CT series, a radiologist extracted the single axial slice showing the maximal tumor diameter and converted it to a 512 × 512-pixel PNG image (window level 50 HU, window width 400 HU) with pixel size encoded in the filename. Three LLMs-ChatGPT-o3 (OpenAI), Gemini 2.5 Pro (Google), and Claude 4 Opus (Anthropic)-were prompted to estimate the longest lesion diameter twice, ≥ 1 week apart. Two board-certified radiologists (12 years' experience each) independently measured the same single slice images and one radiologist repeated the measurements after ≥ 1 week. Agreement was assessed with intraclass correlation coefficients (ICC); 95% confidence intervals were obtained by bootstrap resampling (5 000 iterations). Radiologist inter-observer agreement was excellent (ICC = 0.95, 95% CI 0.86-0.99); intra-observer agreement was 0.98 (95% CI 0.94-0.99). Gemini achieved good model-to-radiologist agreement (ICC = 0.81, 95% CI 0.68-0.89) and intra-model reproducibility (ICC = 0.78, 95% CI 0.65-0.87). GPT-o3 showed moderate agreement (ICC = 0.52) and poor reproducibility (ICC = 0.25); Claude showed poor agreement (ICC = 0.07) and reproducibility (ICC = 0.47). LLMs do not yet match radiologists in measuring colorectal cancer liver metastasis; however, Gemini's good agreement and reproducibility highlight the rapid progress of image interpretation capability of LLMs.