Back to all papers

Comprehensive framework for evaluation of deep neural networks in detection and quantification of lymphoma from PET/CT images: Clinical insights, pitfalls, and observer agreement analyses.

June 12, 2026pubmed logopapers

Authors

Ahamed S,Xu Y,Kurkowska S,Gowdy C,H O J,Bloise I,Wilson D,Martineau P,Bénard F,Yousefirizi F,Dodhia R,Lavista JM,Weeks WB,Uribe CF,Rahmim A

Affiliations (8)

  • University of British Columbia, Vancouver, BC, Canada; BC Cancer Research Institute, Vancouver, BC, Canada; Microsoft AI for Good Lab, Redmond, WA, USA. Electronic address: [email protected].
  • Microsoft AI for Good Lab, Redmond, WA, USA.
  • BC Cancer Research Institute, Vancouver, BC, Canada; Pomeranian Medical University, Szczecin, Zachodniopomorskie, Poland.
  • BC Children's Hospital, Vancouver, BC, Canada.
  • St. Mary's Hospital, Seoul, Republic of Korea.
  • BC Cancer, Vancouver, BC, Canada.
  • BC Cancer Research Institute, Vancouver, BC, Canada.
  • University of British Columbia, Vancouver, BC, Canada; BC Cancer Research Institute, Vancouver, BC, Canada.

Abstract

This study addresses critical gaps in automated lymphoma segmentation from PET/CT imaging, often overlooked in prior work. While deep learning has been applied to this task, few studies evaluate generalizability on external or out-of-distribution data. Similarly, intra- and inter-observer variability analyses remain rare, limiting understanding of task difficulty. Moreover, most methods emphasize global segmentation metrics, neglecting lesion-level characteristics that are crucial for clinical decision-making. We propose a clinically-relevant evaluation framework to assess four commonly used deep segmentation networks (ResUNet, SegResNet, DynUNet, SwinUNETR) on 611 PET/CT cases from multi-institutional datasets spanning varied lymphoma subtypes and lesion characteristics. In addition to the Dice similarity coefficient (DSC), we compute prediction errors on clinical lesion measures and analyze DSC performance as a function of these measures. Additionally, we use traditional lesion-specific detection criteria (1 and 2), providing insights into network's performance in identifying and localizing lesions respectively, and propose an additional Criterion 3 for segmenting lesions based on metabolic characteristics. Finally, we contextualize network performance by comparing it to expert human observers through intra- and inter-observer variability analyses. Networks perform best on large, metabolically active lesions. Their error patterns closely resemble those of expert annotators, while small and faint lesions remain challenging for both networks and physicians. Our clinically-relevant benchmarking framework enables more consistent and meaningful evaluation of lymphoma segmentation models, supporting robust decision-making in patient care. The approach is extensible to other architectures and disease types. Code is available at: https://github.com/microsoft/lymphoma-segmentation-dnn.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.