Evaluating the Impact of Annotation Expertise on AI-Based Ultrasound Segmentation: A Case Study on Left Atrial Appendage.
Authors
Affiliations (7)
Affiliations (7)
- 2Ai-School of Technology, IPCA, Campus do IPCA, Vila Frescaínha S. Martinho, 4750-810, Barcelos, Portugal.
- LASI, Intelligent Systems Associate Laboratory, 4800-058, Guimarães, Portugal.
- Department of Cardiology, Shanghai Chest Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
- Division of Cardiology, Department of Medicine and Therapeutics, Prince of Wales Hospital, New Territories, Hong Kong, China.
- Laboratory for Cardiac Imaging and 3D Printing, Li Ka Shing Institute of Health Science, Faculty of Medicine, The Chinese University of Hong Kong, New Territories, Hong Kong, China.
- 2Ai-School of Technology, IPCA, Campus do IPCA, Vila Frescaínha S. Martinho, 4750-810, Barcelos, Portugal. [email protected].
- LASI, Intelligent Systems Associate Laboratory, 4800-058, Guimarães, Portugal. [email protected].
Abstract
Medical image segmentation using artificial intelligence (AI) is a prominent area of research with diverse applications across various fields. During the last years, a multitude of datasets representing different body structures have been developed and made publicly available. However, the volume of data-particularly the ground truth data, which often relies on manual annotation-remains limited. Supervised learning remains the state-of-the-art approach for deep learning methods; however, its performance is often reported as dependent on the expertise of the operator for the ground truth generation. This dependency becomes more critical when dealing with challenging medical imaging modalities, such as ultrasound, often characterized by low image quality and various artifacts. This study aims to investigate the influence of user expertise on the accuracy of ground truth annotations and their impact on the final performance of the segmentation method. Specifically, we focus on the task of segmenting the left atrial appendage (LAA) in ultrasound images. Two datasets were initially created: one annotated by an Expert and the other by a novice observer. Additionally, synthetic variations of these manually annotated datasets were generated by introducing both systematic and non-systematic errors to examine their effects on segmentation outcomes. Using the nnU-Net framework as the computational basis, the network was trained on each dataset, and the results were evaluated against the Expert's test labels. Training with Expert and Naive contours achieved Dice values in the test set of 0.81 ± 0.09 and 0.77 ± 0.12, respectively, with no statistically significant differences between them. Similarly, training with synthetic variations obtained showed no statistically significant differences for non-systematic errors, whereas systematic errors result in statistically significant differences against manual contours. These findings demonstrate that the AI network remains highly effective across most tested scenarios, even when synthetic errors are introduced, showcasing its ability to handle non-systematic errors efficiently, which synthetically mimic the variability between observers. However, the network encounters greater challenges with systematic errors, failing to accurately delineate the LAA boundaries.