Robustness evaluation of an artificial intelligence-based automatic contouring software in daily routine practice.
Authors
Affiliations (3)
Affiliations (3)
- Centre Marie Curie, Valence, France.
- IRUDIGI SARL, Bayonne, France.
- Centre Marie Curie, Valence, France. Electronic address: [email protected].
Abstract
AI-based automatic contouring streamlines radiotherapy by reducing contouring time but requires rigorous validation and ongoing daily monitoring. This study assessed how software updates affect contouring accuracy and examined how image quality variations influence AI performance. Two patient cohorts were analyzed. The software updates cohort (40 CT scans: 20 thorax, 10 pelvis, 10 H&N) compared six versions of Limbus AI contouring software. The image quality cohort (20 patients: H&N, pelvis, brain, thorax) analyzed 12 reconstructions per patient using Standard, iDose, and IMR algorithms, with simulated noise and spatial resolution (SR) degradations. AI performance was assessed using Volumetric Dice Similarity Coefficient (vDSC) and 95 % Hausdorff Distance (HD95%) with Wilcoxon tests for significance. In the software updates cohort, vDSC improved for re-trained structures across versions (mean DSC ≥ 0.75), with breast contour vDSC decreasing by 1 % between v1.5 and v1.8B3 (p > 0.05). Median HD95% values were consistently <4 mm, <5 mm, and <12 mm for H&N, pelvis, and thorax contours, respectively (p > 0.05). In the image quality cohort, no significant differences were observed between Standard, iDose, and IMR algorithms. However, noise and SR degradation significantly reduced performance: vDSC ≥ 0.9 dropped from 89 % at 2 % noise to 30 % at 20 %, and from 87 % to 70 % as SR degradation increased (p < 0.001). AI contouring accuracy improved with software updates and showed robustness to minor reconstruction variations, but it was sensitive to noise and SR degradation. Continuous validation and quality control of AI-generated contours are essential. Future studies should include a broader range of anatomical regions and larger cohorts.