External Validation of a Winning Artificial Intelligence Algorithm from the RSNA 2022 Cervical Spine Fracture Detection Challenge.
Authors
Affiliations (4)
Affiliations (4)
- From the Department of Radiology (J.P.H., X.V.N., N.Q., L.M.P.), The Ohio State University Wexner Medical Center, Columbus, Ohio.
- College of Medicine (G.R.L.), The Ohio State University, Columbus, Ohio.
- Department of Radiology (I.P.), Brigham and Women's Hospital, Boston, Massachusetts.
- From the Department of Radiology (J.P.H., X.V.N., N.Q., L.M.P.), The Ohio State University Wexner Medical Center, Columbus, Ohio [email protected].
Abstract
The Radiological Society of North America has actively promoted artificial intelligence (AI) challenges since 2017. Algorithms emerging from the recent RSNA 2022 Cervical Spine Fracture Detection Challenge demonstrated state-of-the-art performance in the competition's data set, surpassing results from prior publications. However, their performance in real-world clinical practice is not known. As an initial step toward the goal of assessing feasibility of these models in clinical practice, we conducted a generalizability test by using one of the leading algorithms of the competition. The deep learning algorithm was selected due to its performance, portability, and ease of use, and installed locally. One hundred examinations (50 consecutive cervical spine CT scans with at least 1 fracture present and 50 consecutive negative CT scans) from a level 1 trauma center not represented in the competition data set were processed at 6.4 seconds per examination. Ground truth was established based on the radiology report with retrospective confirmation of positive fracture cases. Sensitivity, specificity, F1 score, and area under the curve were calculated. The external validation data set comprised older patients in comparison to the competition set (53.5 ± 21.8 years versus 58 ± 22.0, respectively; <i>P</i> < .05). Sensitivity and specificity were 86% and 70% in the external validation group and 85% and 94% in the competition group, respectively. Fractures misclassified by the convolutional neural networks frequently had features of advanced degenerative disease, subtle nondisplaced fractures not easily identified on the axial plane, and malalignment. The model performed with a similar sensitivity on the test and external data set, suggesting that such a tool could be potentially generalizable as a triage tool in the emergency setting. Discordant factors such as age-associated comorbidities may affect accuracy and specificity of AI models when used in certain populations. Further research should be encouraged to help elucidate the potential contributions and pitfalls of these algorithms in supporting clinical care.