Back to all papers

Clinical Evaluation of a Novel Deep Learning-Based Auto-Segmentation Software: Utility and Potential Pitfalls

January 11, 2026medrxiv logopreprint

Authors

Tozuka, R.,Saito, M.,Matsuda, M.,Akita, T.,Nemoto, H.,Komiyama, T.,Kadoya, N.,Jingu, K.,Onishi, H.

Affiliations (1)

  • Department of Therapeutic Radiology, University of Yamanashi, Chuo, Yamanashi, Japan

Abstract

BackgroundAccurate contouring of target volumes and organs at risk is critical for radiotherapy. While deep learning (DL) models offer efficient automation, their generalizability to real-world clinical cases containing anatomical variations and artifacts requires rigorous validation. PurposeTo evaluate the clinical accuracy and robustness of RatoGuide, a novel DL-based auto-segmentation software, using a dataset derived from routine clinical practice including atypical cases. MethodsThis single-center retrospective study included 36 patients treated for head and neck, thoracic, abdominal, and pelvic cancers. The cohort was intentionally selected to encompass diverse anatomies and artifacts (e.g., pacemakers, artificial femoral head replacement). Auto-contours generated by RatoGuide were compared with expert-approved manual contours. Performance was evaluated quantitatively using the Dice Similarity Coefficient (DSC) and 95th percentile Hausdorff Distance (HD95), and qualitatively via a 5-point visual assessment scale (higher is better) by four independent reviewers. A score of [≤]2 by multiple reviewers was defined as failure. ResultsOverall, the mean DSC, HD95, and visual assessment score were 0.79 {+/-} 0.19, 6.35 {+/-} 12.2 mm, and 3.65 {+/-} 0.88, respectively. The mean DSC exceeded 0.8 in 62% (23/37 organ structures) of the evaluated structure types, and a total of 93.5% (315/337) of all contours were considered clinically acceptable based on visual evaluation . However, lower performance was observed in small structures (e.g., optic chiasm) and low-contrast organs (e.g., esophagus). ConclusionsRatoGuide demonstrated favorable performance for major organs across various anatomical regions, consistent with benchmarks reported in the literature. However, performance variability in atypical cases underscores the necessity of rigorous visual verification by experts for clinical implementation.

Topics

radiology and imaging

Ready to Sharpen Your Edge?

Subscribe to join 8,400+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.