Evaluating the Utility and Limitations of Machine Learning Tumor Segmentation for Automated Longitudinal RANO Treatment Response Classification.
Authors
Affiliations (2)
Affiliations (2)
- From the Department of Neuroradiology (P.K., A.N., H.S., S.A., A.M., C.B., M.W., K.S.), Division of Diagnostic Imaging, MD Anderson Cancer Center, Houston, TX and Department of Radiology (M.W.), The University of Texas Medical Branch, Galveston, TX. [email protected].
- From the Department of Neuroradiology (P.K., A.N., H.S., S.A., A.M., C.B., M.W., K.S.), Division of Diagnostic Imaging, MD Anderson Cancer Center, Houston, TX and Department of Radiology (M.W.), The University of Texas Medical Branch, Galveston, TX.
Abstract
Machine learning segmentation has emerged in tumor assessment with high performance in volumetric evaluation of brain tumors. It is unclear, however, if such methods can be translated into full end-to-end clinical tools for assessing treatment response. This study aims to assess the ability and current limitations of machine learning of performing fully automated longitudinal Response Assessment in Neuro-Oncology (RANO). An nnU-Net model was trained for segmentation on 4,162 MRIs of pre- and post-treatment brain tumors. A separate dataset of 91 patients was preprocessed, including co-registration for longitudinal lesion tracking. The nnU-Net model was used to segment enhancing tumor on the longitudinal dataset. After excluding patients classified as progression due to T2/FLAIR changes or clinical criteria, 233 pairwise MRI comparisons remained for 60 patients. Using auto-segmented tumor masks, comparisons were classified into Progressive Disease (PD), Partial Response (PR), Stable Disease (SD), and Complete Response (CR) according to the original RANO criteria based on 2D lesion size, measurable criteria, appearance of new lesions, and changes over time. Model results were compared to expert ground truth and evaluated for accuracy in classifying PD and non-PD disease, per class accuracy, and a variety of performance metrics. All discrepant cases were manually reviewed to classify the cause of error. The model was 77.7% (95% CI 70.6%-84.2%) accurate in distinguishing PD and non-PD disease with sensitivity 92.1% (86.2%-97.3%) and specificity 60.4% (43.4%-74.7%). The highest per-class accuracy was PD at 92.1% (86.2%-97.3%), with much lower performance for non-PD categories, worst for PR at 14.3% (0.0%-38.5%). Patient-averaged accuracy was 66.9% (58.2%-75.3%). False positive PD predictions were commonly due to evolving changes around the resection cavity, affecting the morphology and location of enhancement and accounting for 32.8% of SD classifications. Other causes of error included changing morphology but similar volume of enhancement, missed lesions in less typical locations, and radiation necrosis. Automated tumor segmentation is emerging in treatment response assessment, though highly trained segmentation models are still met with limitations in fully automated response assessment due to nuances such as post-surgical enhancement and comparing location and morphology changes between timepoints.