Validating Radiology AI Model Performance on Photon-Counting CT Images Using Large Language Models for Ground Truth Extraction.
Authors
Affiliations (3)
Affiliations (3)
- Department of Radiology, University of Washington, Seattle, WA, USA. Electronic address: [email protected].
- School of Medicine, University of Washington, Seattle, WA, USA.
- Department of Radiology, University of Washington, Seattle, WA, USA.
Abstract
To evaluate the feasibility of using large language models (LLMs) to automate ground truth label extraction from radiology reports, enabling scalable assessment and monitoring of radiology artificial intelligence (AI) tools. The framework is tested by validating AI model performance on a newly installed photon-counting CT (PCCT) scanner. We retrospectively analyzed four FDA-cleared deep learning-based computer-aided detection and triage (CADt) tools targeting pulmonary embolism (PE), intracranial hemorrhage (ICH), cervical spine fractures (CSPFX), and vertebral compression fractures (COMPFX). Radiology reports from exams acquired on the new PCCT scanner and conventional scanners were processed using an LLM (Llama 3.3) to extract binary ground truth labels. AI outputs were compared to these labels to estimate performance metrics. Discrepant cases were adjudicated by three human annotators, with inter-rater reliability measured using Fleiss' Kappa test. Performance metrics were recalculated after partial human correction of LLM errors. LLM-extracted labels enabled rapid performance assessment across all four diagnostic tasks. There were no statistically significant differences in performance between PCCT and non-PCCT cohorts. In discrepant cases, the agreement between LLM labels and final human annotations (κ = 0.731) was comparable to inter-reader agreement (κ = 0.720), supporting the reliability of LLM labeling. LLMs can be used to automate ground truth label extraction from radiology reports, offering a scalable and efficient alternative to manual annotation. This method supports rapid local validation of AI tools, even in response to input drift from new imaging hardware.