Quantitative Evaluation of AI-based Organ Segmentation Across Multiple Anatomical Sites Using Eight Commercial Software Platforms.
Authors
Affiliations (12)
Affiliations (12)
- Virginia Commonwealth University, Richmond, VA. Electronic address: [email protected].
- City of Hope Comprehensive Cancer Center, Duarte, CA; Mayo Clinic Arizona, Phoenix, AZ.
- Emory University, Atlanta, GA.
- Radiation Physics, MD Anderson Cancer Center, Houston, TX.
- University of Pennsylvania/Abramson Cancer Center, Philadelphia, PA.
- Moffitt Cancer Center, Tampa, FL.
- The University of Texas Southwestern Medical Center, Dallas, TX.
- Duke University Medical, Durham, NC.
- University of California Davis Comprehensive Cancer Center, Sacramento, CA.
- Mayo Clinic Arizona, Phoenix, AZ.
- National Cancer Institute, Bethesda, MD.
- University of California Los Angeles, Los Angeles, CA. Electronic address: [email protected].
Abstract
To evaluate organs-at-risk (OARs) segmentation variability across eight commercial AI-based segmentation software using independent multi-institutional datasets, and to provide recommendations for clinical practices utilizing AI-segmentation. 160 planning CT image sets from four anatomical sites: head-and-neck, thorax, abdomen and pelvis were retrospectively pooled from three institutions. Contours for 31 OARs generated by the software were compared to clinical contours using multiple accuracy metrics, including: Dice similarity coefficient (DSC), 95 Percentile of Hausdorff distance (HD95), surface DSC, as well as relative added path length (RAPL) as an efficiency metric. A two-factor analysis of variance was used to quantify variability in contouring accuracy across software platforms (inter-software) and patients (inter-patient). Pairwise comparisons were performed to categorize the software into different performance groups, and inter-software variations (ISV) were calculated as the average performance differences between the groups. Significant inter-software and inter-patient contouring accuracy variations (p<0.05) were observed for most OARs. The largest ISV in DSC in each anatomical region were cervical esophagus (0.41), trachea (0.10), spinal cord (0.13) and prostate (0.17). Among the organs evaluated, 7 had mean DSC >0.9 (i.e., heart, liver), 15 had DSC ranging from 0.7 to 0.89 (i.e., parotid, esophagus). The remaining organs (i.e., optic nerves, seminal vesicle) had DSC<0.7. 16 of the 31 organs (52%) had RAPL less than 0.1. Our results reveal significant inter-software and inter-patient variability in the performance of AI-segmentation software. These findings highlight the need of thorough software commissioning, testing, and quality assurance across disease sites, patient-specific anatomies and image acquisition protocols.