Back to all papers

Quantitative Evaluation of AI-based Organ Segmentation Across Multiple Anatomical Sites Using Eight Commercial Software Platforms.

Authors

Yuan L,Chen Q,Al-Hallaq H,Yang J,Yang X,Geng H,Latifi K,Cai B,Wu QJ,Xiao Y,Benedict SH,Rong Y,Buchsbaum J,Qi XS

Affiliations (12)

  • Virginia Commonwealth University, Richmond, VA. Electronic address: [email protected].
  • City of Hope Comprehensive Cancer Center, Duarte, CA; Mayo Clinic Arizona, Phoenix, AZ.
  • Emory University, Atlanta, GA.
  • Radiation Physics, MD Anderson Cancer Center, Houston, TX.
  • University of Pennsylvania/Abramson Cancer Center, Philadelphia, PA.
  • Moffitt Cancer Center, Tampa, FL.
  • The University of Texas Southwestern Medical Center, Dallas, TX.
  • Duke University Medical, Durham, NC.
  • University of California Davis Comprehensive Cancer Center, Sacramento, CA.
  • Mayo Clinic Arizona, Phoenix, AZ.
  • National Cancer Institute, Bethesda, MD.
  • University of California Los Angeles, Los Angeles, CA. Electronic address: [email protected].

Abstract

To evaluate organs-at-risk (OARs) segmentation variability across eight commercial AI-based segmentation software using independent multi-institutional datasets, and to provide recommendations for clinical practices utilizing AI-segmentation. 160 planning CT image sets from four anatomical sites: head-and-neck, thorax, abdomen and pelvis were retrospectively pooled from three institutions. Contours for 31 OARs generated by the software were compared to clinical contours using multiple accuracy metrics, including: Dice similarity coefficient (DSC), 95 Percentile of Hausdorff distance (HD95), surface DSC, as well as relative added path length (RAPL) as an efficiency metric. A two-factor analysis of variance was used to quantify variability in contouring accuracy across software platforms (inter-software) and patients (inter-patient). Pairwise comparisons were performed to categorize the software into different performance groups, and inter-software variations (ISV) were calculated as the average performance differences between the groups. Significant inter-software and inter-patient contouring accuracy variations (p<0.05) were observed for most OARs. The largest ISV in DSC in each anatomical region were cervical esophagus (0.41), trachea (0.10), spinal cord (0.13) and prostate (0.17). Among the organs evaluated, 7 had mean DSC >0.9 (i.e., heart, liver), 15 had DSC ranging from 0.7 to 0.89 (i.e., parotid, esophagus). The remaining organs (i.e., optic nerves, seminal vesicle) had DSC<0.7. 16 of the 31 organs (52%) had RAPL less than 0.1. Our results reveal significant inter-software and inter-patient variability in the performance of AI-segmentation software. These findings highlight the need of thorough software commissioning, testing, and quality assurance across disease sites, patient-specific anatomies and image acquisition protocols.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.