Clinical evaluation of two artificial intelligence algorithms in standard radiography for post-traumatic exploration of peripheral limbs in children.
Authors
Affiliations (2)
Affiliations (2)
- Radiology Department, Centre Hospitalier Universitaire de Reims, 45 Rue Cognacq-Jay, 51092, Reims, France. [email protected].
- Radiology Department, Centre Hospitalier Universitaire de Reims, 45 Rue Cognacq-Jay, 51092, Reims, France.
Abstract
Pediatric traumatology represents a significant portion of emergency department visits, and advances in artificial intelligence (AI) provide promising tools for fracture detection on radiographs. However, few studies have assessed these tools on the same pediatric population. To evaluate the diagnostic performance in a pediatric population of two commercially available AI software programs for the detection of peripheral limb fractures, both alone and in comparison with radiology residents. Consecutive recruitment of 366 children under 18 referred by the pediatric emergency department for radiographic exploration of trauma involving peripheral limbs (five subgroups: hand/wrist, elbow, shoulder, knee/leg, and foot/ankle). Two fracture-detection AI algorithms (Boneview and Rayvolve) were evaluated, with joint effusions and dislocations analyzed only by Boneview. Two radiology residents (first and third years) interpreted the radiographs first without and then with AI assistance. Boneview's sensitivity and specificity were 0.85 and 0.87, respectively, while Rayvolve's were 0.81 and 0.93. No significant difference was observed between areas under the curve of Boneview (0.862) and Rayvolve (0.874) with P-value = 0.62. AI significantly improved the performance of the least experienced resident, but not of the more experienced resident. For effusion detection, Boneview's sensitivity and specificity were 0.57 and 0.96, close to but slightly lower than the performance of the residents. AI assistance did not have real impact on residents' performance for effusion detection. Evaluating two AI algorithms in the same pediatric population revealed consistent and acceptable performance with variations in performance profiles and may improve outcomes for less-experienced users.