Back to all papers

Accuracy and reproducibility of large language model measurements of liver metastases: comparison with radiologist measurements.

Authors

Sugawara H,Takada A,Kato S

Affiliations (5)

  • Department of Medical Imaging, The Ottawa Hospital, University of Ottawa, 501 Smyth Road, Ottawa, ON, K1H 8L6, Canada. [email protected].
  • Department of Diagnostic Radiology, McGill University, Montreal, QC, Canada.
  • Augmented Intelligence and Precision Health Laboratory (AIPHL), Research Institute of the McGill University Health Centre, Montreal, Canada.
  • Diagnostic Radiology and Radiation Oncology, Chiba University Graduate School of Medicine, Chiba, Japan.
  • Department of Radiology, Institute of Medical Science, The University of Tokyo, Tokyo, Japan.

Abstract

To compare the accuracy and reproducibility of lesion-diameter measurements performed by three state-of-the-art LLMs with those obtained by radiologists. In this retrospective study using a public database, 83 patients with solitary colorectal-cancer liver metastases were identified. From each CT series, a radiologist extracted the single axial slice showing the maximal tumor diameter and converted it to a 512 × 512-pixel PNG image (window level 50 HU, window width 400 HU) with pixel size encoded in the filename. Three LLMs-ChatGPT-o3 (OpenAI), Gemini 2.5 Pro (Google), and Claude 4 Opus (Anthropic)-were prompted to estimate the longest lesion diameter twice, ≥ 1 week apart. Two board-certified radiologists (12 years' experience each) independently measured the same single slice images and one radiologist repeated the measurements after ≥ 1 week. Agreement was assessed with intraclass correlation coefficients (ICC); 95% confidence intervals were obtained by bootstrap resampling (5 000 iterations). Radiologist inter-observer agreement was excellent (ICC = 0.95, 95% CI 0.86-0.99); intra-observer agreement was 0.98 (95% CI 0.94-0.99). Gemini achieved good model-to-radiologist agreement (ICC = 0.81, 95% CI 0.68-0.89) and intra-model reproducibility (ICC = 0.78, 95% CI 0.65-0.87). GPT-o3 showed moderate agreement (ICC = 0.52) and poor reproducibility (ICC = 0.25); Claude showed poor agreement (ICC = 0.07) and reproducibility (ICC = 0.47). LLMs do not yet match radiologists in measuring colorectal cancer liver metastasis; however, Gemini's good agreement and reproducibility highlight the rapid progress of image interpretation capability of LLMs.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.