Evaluating the Large Language Model-Based Quality Assurance Tool for Auto-Contouring
Authors
Affiliations (1)
Affiliations (1)
- Department of Therapeutic Radiology, University of Yamanashi, Yamanashi, Japan.
Abstract
PurposeManual verification of AI-based auto-contouring is labor-intensive and prone to fatigue-related errors. This study developed a large language model (LLM)-based automated Quality Assurance (QA) for auto-contouring (LAQUA) system using a multimodal LLM, Gemini 2.5 Pro, and evaluated its feasibility as a clinical primary screening tool to streamline the QA workflow. MethodsTwenty male pelvic CT scans from an open dataset were utilized to generate auto-contours of the bladder, prostate, rectum, and bilateral femoral heads using three distinct software packages (OncoStudio, RatoGuide prototype, and syngo.via). The generated contours for each slice were exported as PDF images with overlaid contours and input into Gemini 2.5 Pro. The LLM was instructed to rate the contour quality on a 5-point clinical scale (5: Optimal; 4: Acceptable; 3: Suboptimal; 2: Unacceptable; redraw from scratch; 1: Unacceptable; organ not detected or completely wrong). Spearmans rank correlation coefficients ({rho}) and weighted kappa coefficients ({kappa}) were calculated based on evaluations by two board-certified radiation oncologists as ground truth. Additionally, to assess screening performance, sensitivity and specificity were calculated by dichotomizing the scores, defining "inadequate" contours (scores < 3 or < 4) as the target for detection, compared to "adequate" contours (scores [≥] 3 or [≥] 4). Finally, the alignment of the rationales provided by the LLM with the auto-contouring quality was evaluated by the same two board-certified radiation oncologists. This was conducted using a Likert scale assessing four domains (error detection, hallucination, clinical relevance, and anatomical understanding), each scored out of 2 points. ResultsThe LAQUA system demonstrated moderate to strong agreement with expert judgments across all evaluated software ({rho}: 0.733-0.794; quadratic weighted {kappa}: 0.730-0.798) and organs ({rho}: 0.567-0.835; quadratic weighted {kappa}: 0.639-0.804). Regarding screening performance, a cutoff of [≥]3 as "adequate" achieved the highest sensitivity and specificity in specific subgroups, but with wide 95% confidence intervals (CIs). A cutoff of [≥]4 as "adequate " narrowed the CIs, yielding the highest sensitivity in the rectum (0.976) and the highest specificity in the left femoral head (0.933). Qualitatively, the LLMs rationales achieved an overall mean score of 1.70 {+/-} 0.48 (out of 2), with 155 of 291 outputs receiving perfect scores across all criteria. ConclusionsThe LAQUA system demonstrated substantial agreement with expert evaluations in AI-based auto-contouring quality assessment. While potential overestimation bias (risk of missing "inadequate" cases) warrants caution, the observed sensitivity suggests its feasibility as a primary screening QA tool to efficiently filter acceptable contours, thereby reducing the clinical workload.