Geometric, dosimetric and psychometric evaluation of three commercial AI software solutions for OAR auto-segmentation in head and neck radiotherapy.
Authors
Affiliations (4)
Affiliations (4)
- Faculty of Electrical Engineering, University of Ljubljana, Ljubljana, Slovenia.
- Faculty of Information and Communication Technology, University of Malta, Msida, Malta.
- Faculty of Health Sciences, Department of Radiography, University of Malta, Msida, Malta.
- Faculty of Electrical Engineering, University of Ljubljana, Ljubljana, Slovenia. [email protected].
Abstract
Contouring organs-at-risk (OARs) is a critical yet time-consuming step in head and neck (HaN) radiotherapy planning. Auto-segmentation methods have been widely studied, and commercial solutions are increasingly entering clinical use. However, their adoption warrants a comprehensive, multi-perspective evaluation. The purpose of this study is to compare three commercial artificial intelligence (AI) software solutions (Limbus, MIM and MVision) for HaN OAR auto-segmentation on a cohort of 10 computed tomography images with reference contours obtained from the public HaN-Seg dataset, from both observational (descriptive and empirical) and analytical (geometric, dosimetric and psychometric) perspectives. The observational evaluation included vendor questionnaires on technical specifications and radiographer feedback on usability. The analytical evaluation covered geometric (Dice similarity coefficient, DSC, and 95th percentile Hausdorff distance, HD95), dosimetric (dose constraint compliance, OAR priority-based analysis), and psychometric (5-point Likert scale) assessments. All software solutions covered a broad range of OARs. Overall geometric performance differences were relatively small (Limbus: 69.7% DSC, 5.0 mm HD95; MIM: 69.2% DSC, 5.6 mm HD95; MVision: 66.7% DSC, 5.3 mm HD95), however, statistically significant differences were observed for smaller structures such as the cochleae, optic chiasm, and pituitary and thyroid glands. Differences in dosimetric compliance were overall minor, with the lowest compliance observed for the oral cavity and submandibular glands. In terms of qualitative assessment, radiographers gave the highest average Likert rating to Limbus (3.9), followed by MVision (3.7) and MIM (3.5). With few exceptions, most software solutions produced good-quality AI-generated contours (Likert ratings ≥ 3), yet some editing should still be performed to reach clinical acceptability. Notable discrepancies were seen for the optic chiasm and in cases affected by mouth bites or dental artifacts. Importantly, no clear relationship emerged between geometric, dosimetric, and psychometric metrics, underscoring the need for a multi-perspective evaluation without shortcuts.