Prospective pilot evaluation of a deep learning model for kidney stone detection on CT using a web-based workflow platform.
Authors
Affiliations (1)
Affiliations (1)
- Department of Electrical and Electronics Engineering, Faculty of Engineering and Architecture, Izmir Bakırçay University, İzmir, Turkey. [email protected].
Abstract
Rapid and reliable detection of kidney stones on non-contrast abdominal CT is essential for timely decision-making in emergency radiology. However, rising imaging volumes and workflow pressures continue to limit reporting capacity, creating a need for AI systems capable of supporting routine diagnostic practice. Although many AI-based stone detection models have been proposed, most rely on retrospective datasets, and few have been evaluated prospectively within environments that reflect real radiology workflow conditions. This study prospectively evaluates the performance, usability, and workflow compatibility of a deep learning-based kidney stone detection model deployed within a web-based platform designed to emulate key components of routine radiology practice, enabling forward-in-time evaluation without direct integration into routine clinical operations such as PACS/RIS or clinical reporting. A dual-stage convolutional neural network was developed using an internal dataset of 235 cases (3,452 slices) and validated through five-fold patient-level cross-validation. An independent set of 732 slices served as an independent hold-out set. For prospective evaluation, the trained model was integrated into a secure, browser-based interface supporting case upload, slice-level review, independent radiologist labeling, and visualization of AI-generated predictions. Over a six-month period, three radiologists uploaded and annotated a total of 5,152 anonymized CT slices. The platform dynamically calculated diagnostic metrics and logged human-AI interactions to assess performance stability and concordance. The pilot deployment demonstrated strong diagnostic performance under real-world variability, achieving 97.83% accuracy, 94.64% sensitivity, 98.27% specificity, 88.50% precision, and a Cohen's kappa of 0.90. Concordance between radiologists and the model exhibited increasing stability across sequential pilot stages. These findings present a reproducible framework for transitioning radiological AI systems from retrospective validation toward workflow-aligned, prospective pilot deployment. Although full PACS/RIS integration was not attempted, the results underscore the importance of pilot-stage evaluation as a critical intermediary step toward clinical implementation and regulatory approval.