Lung nodule detection and potential impact on guideline-based management: a retrospective post-market evaluation of three commercial software systems.
Authors
Affiliations (6)
Affiliations (6)
- Department of Radiology, Medical University of Innsbruck, Innsbruck, Austria.
- Department of Nuclear Medicine, Medical University of Innsbruck, Innsbruck, Austria.
- Department of Internal Medicine V, Medical University of Innsbruck, Innsbruck, Austria.
- Department of Internal Medicine II, Medical University of Innsbruck, Innsbruck, Austria.
- Department of Visceral, Transplant and Thoracic Surgery, Center of Operative Medicine, Medical University of Innsbruck, Innsbruck, Austria.
- Department of Radiology, Medical University of Innsbruck, Innsbruck, Austria. [email protected].
Abstract
To evaluate three commercial AI software tools for pulmonary nodule detection and segmentation and to assess their impact on guideline-based management recommendations. A total of 740 CT and PET-CT studies from clinical routine were analyzed using three software tools (S1, S2, S3). We compared the total number of detected nodules and "actionable" nodules (per British Thoracic Society (BTS) definition). We further evaluated how measurement variations between tools affected hypothetical management according to Fleischner Society and BTS guidelines for incidental nodules. The tools differed significantly in the total number of detections (S1: 1336; S2: 1060; S3: 1536; p < 0.001) and wrong findings (S1: 965; S2: 720; S3: 1169; p < 0.001). However, the detection of actionable nodules was comparable across all tools (S1: 375; S2: 341; S3: 373; p = 0.73). While no statistically significant differences were found in mean diameter or volume measurements, small absolute variations led to significant differences in management. Specifically, S2 triggered significantly more 1-year follow-up recommendations than S3 under BTS guidelines (p < 0.001). No significant management differences were observed when applying Fleischner Society guidelines. While the three included AI tools show comparable performance in detecting actionable nodules, minor measurement variations significantly impact downstream management when using guidelines with narrow thresholds, such as the BTS criteria. Fleischner Society guidelines appear more robust to these inter-software variations. Question How do commercial software tools for pulmonary nodule detection perform in real-world settings and impact hypothetical management under BTS and Fleischner guidelines? Findings Detection of actionable nodules was comparable across all tools, but small absolute measurement variations triggered significantly more 1-year follow-up recommendations under BTS guidelines. Clinical relevance AI software can cause inconsistent BTS-based management due to narrow thresholds, while Fleischner criteria appear more stable. Frequent detection of benign lesions potentially poses a risk of overdiagnosis and overtreatment in standalone AI-based reporting.