Back to all papers

Deriving the OTA/AO fracture classification from routinely collected radiology reports using a large language model.

May 1, 2026pubmed logopapers

Authors

Hu S,Keeley T,Halvorson R,Campbell ST,Lefaivre KA,Levack A,Lundy DW,Meinberg E,Schweser K,Shymon SJ,Marmor MT

Affiliations (9)

  • University of California San Francisco, School of Medicine, San Francisco, CA.
  • University of California San Francisco, Orthopaedic Trauma Institute, San Francisco, CA.
  • Department of Orthopaedic Surgery, University of California Davis, Sacramento, CA.
  • Department of Orthopaedics, The University of British Columbia, Vancouver, BC, Canada.
  • Loyola University Medical Center, Maywood, IL.
  • St. Luke's University Health Network, Bethlehem, PA.
  • HealthPartners Medical Group, Bloomington, MN.
  • University of Missouri Health Care, Columbia, MO.
  • Department of Orthopaedic Surgery, Harbor-UCLA Medical Center, Torrance, CA.

Abstract

Fracture classification plays a pivotal role in research and quality assurance; despite its wide acceptance, the OTA/AO classification is seldom documented in patients' electronic medical records, which impedes fracture registry creation and effective interdisciplinary communication. In this study, we investigate "off-the-shelf" large language models (LLMs) in translating free text in radiology reports into OTA/AO classification labels. We employed a Health Insurance Portability and Accountability Act-compliant LLM to classify 109 fracture descriptions from randomly selected radiology reports in a deidentified electronic medical record database. Ground-truth classifications were assigned by expert orthopaedic traumatologists based on corresponding radiographs. Multiple prompting strategies were tested, including zero-shot prompting, zero-shot chain-of-thought prompting, and retrieval-augmented generation. We additionally asked the LLM to assign classification labels to "ideal" fracture descriptions written according to the 2018 OTA/AO Fracture and Dislocation Classification Compendium. Model performance was assessed using Cohen kappa and accuracy against ground-truth labels. The 3 prompting strategies tested yielded similar classification performance on radiology report fracture descriptions, with almost perfect agreement at the <i>bone</i> and <i>bone and location</i> levels. Performance declined to slight agreement at the subgroup level. The best performance was observed using ideal fracture descriptions with retrieval-augmented generation, in which the agreement between the full LLM-generated and ground-truth labels remained moderate. Classification errors were largely due to imprecise descriptions, hallucinated information, or incorrect application of factually correct information. Our study demonstrates some potential for LLMs to translate free-text fracture descriptions into OTA/AO classifications, allowing for efficient labeling of large datasets of radiology reports. Future work should focus on refining model classification capabilities using more sophisticated prompting methods. Level III.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.