Impact of test set composition on AI performance in pediatric wrist fracture detection in X-rays.

Authors

Till T,Scherkl M,Stranger N,Singer G,Hankel S,Flucher C,Hržić F,Štajduhar I,Tschauner S

Affiliations (4)

  • Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria.
  • Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria. [email protected].
  • Department of Pediatric and Adolescent Surgery, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria.
  • Faculty of Engineering, Department of Computer Engineering, University of Rijeka, Vukovarska 58, Rijeka, 51000, Croatia.

Abstract

To evaluate how different test set sampling strategies-random selection and balanced sampling-affect the performance of artificial intelligence (AI) models in pediatric wrist fracture detection using radiographs, aiming to highlight the need for standardization in test set design. This retrospective study utilized the open-sourced GRAZPEDWRI-DX dataset of 6091 pediatric wrist radiographs. Two test sets, each containing 4588 images, were constructed: one using a balanced approach based on case difficulty, projection type, and fracture presence and the other a random selection. EfficientNet and YOLOv11 models were trained and validated on 18,762 radiographs and tested on both sets. Binary classification and object detection tasks were evaluated using metrics such as precision, recall, F1 score, AP50, and AP50-95. Statistical comparisons between test sets were performed using nonparametric tests. Performance metrics significantly decreased in the balanced test set with more challenging cases. For example, the precision for YOLOv11 models decreased from 0.95 in the random set to 0.83 in the balanced set. Similar trends were observed for recall, accuracy, and F1 score, indicating that models trained on easy-to-recognize cases performed poorly on more complex ones. These results were consistent across all model variants tested. AI models for pediatric wrist fracture detection exhibit reduced performance when tested on balanced datasets containing more difficult cases, compared to randomly selected cases. This highlights the importance of constructing representative and standardized test sets that account for clinical complexity to ensure robust AI performance in real-world settings. Question Do different sampling strategies based on samples' complexity have an influence in deep learning models' performance in fracture detection? Findings AI performance in pediatric wrist fracture detection significantly drops when tested on balanced datasets with more challenging cases, compared to randomly selected cases. Clinical relevance Without standardized and validated test datasets for AI that reflect clinical complexities, performance metrics may be overestimated, limiting the utility of AI in real-world settings.

Topics

Journal Article
Get Started

Upload your X-ray image and get interpretation.

Upload now →

Disclaimer: X-ray Interpreter's AI-generated results are for informational purposes only and not a substitute for professional medical advice. Always consult a healthcare professional for medical diagnosis and treatment.