Back to all papers

Large language model-based uncertainty-adjusted label extraction for artificial intelligence model development in upper extremity radiography.

November 14, 2025pubmed logopapers

Authors

Kreutzer H,Caselitz AS,Dratsch T,Pinto Dos Santos D,Kuhl C,Truhn D,Nebelung S

Affiliations (6)

  • Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany. [email protected].
  • Lab for Artificial Intelligence in Medicine, Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany. [email protected].
  • Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany.
  • Lab for Artificial Intelligence in Medicine, Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany.
  • Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, Cologne, Germany.
  • Department of Diagnostic and Interventional Radiology, University Medical Center Mainz, Mainz, Germany.

Abstract

To evaluate GPT-4o's zero-shot ability to extract structured diagnostic labels (with uncertainty) from free-text radiology reports and to test how these labels affect multi-label image classification of musculoskeletal radiographs. This retrospective study included radiography series of the clavicle (n = 1170), elbow (n = 3755), and thumb (n = 1978). After anonymization, GPT-4o filled out structured templates by indicating imaging findings as present ("true"), absent ("false"), or "uncertain." To assess the impact of label uncertainty, "uncertain" labels of the training and validation sets were automatically reassigned to "true" (inclusive) or "false" (exclusive). Label-image pairs were used for multi-label classification using the ResNet50 architecture. Label extraction accuracy was manually verified on internal (clavicle: n = 233, elbow: n = 745, thumb: n = 393) and external test sets (n = 300 for each). Performance was assessed using macro-averaged receiver operating characteristic (ROC) area under the curve (AUC), precision, recall curves, sensitivity, specificity, and accuracy. AUCs were compared with the DeLong test. Automatic extraction was correct in 98.6% (60,618 of 61,488) of labels in the test sets. Across anatomic regions, label-based model training yielded competitive performance measured by macro-averaged AUC values for inclusive (e.g., elbow: AUC = 0.80 (range, 0.62-0.87)) and exclusive models (elbow: AUC = 0.80 (range, 0.61-0.88)). Models generalized well on external datasets (elbow (inclusive): AUC = 0.79 (range, 0.61-0.87); elbow (exclusive): AUC = 0.79 (range, 0.63-0.89)). No significant differences were observed across labeling strategies or datasets (p ≥ 0.15). GPT-4o extracted labels from radiologic reports to train competitive multi-label classification models with high accuracy. Detected uncertainty in the radiologic reports did not influence the performance of these models. Question Can GPT-4o automatically extract high-accuracy, uncertainty-aware diagnostic labels from routine radiologic reports of the clavicle, elbow, and thumb for use in training multi-label image classifiers? Findings GPT-4o extracted labels with > 98% accuracy, and multi-label classifiers for clavicle, elbow, and thumb radiographs performed consistently regardless of how uncertainty was handled. Clinical relevance Automated GPT-4o-based labeling of routine clavicle, elbow, and thumb radiologic reports enables the rapid conversion of radiologic reports into structured multi-label training datasets, supporting scalable development of dedicated image classification models.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.