Back to all papers

Less time Coding, more time Caring: Performance evaluation of ChatGPT-5 for ICD-10 coding of radiology reports.

January 17, 2026pubmed logopapers

Authors

Ruhwedel T,Rogasch JMM,Dahlke PM,Shnayien S,Furth C,Wetz C,Amthauer H,Schatka I,Beetz NL

Affiliations (3)

  • Department of Nuclear Medicine, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt - Universität zu Berlin, Augustenburger Platz 1, 13353 Berlin, Germany. Electronic address: [email protected].
  • Department of Nuclear Medicine, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt - Universität zu Berlin, Augustenburger Platz 1, 13353 Berlin, Germany.
  • Department of Nuclear Medicine, University Hospital Zurich (USZ), University of Zurich (UZH), Raemistrasse 100, CH-8091 Zurich, Switzerland.

Abstract

Worldwide radiologists are facing a high administrative workload. ICD-10 coding is mandatory for reimbursement in many health systems and a frequent source of billing errors. Large language models have shown promise in supporting coding related tasks, but previous studies with earlier ChatGPT versions reported mixed results and evidence specific to radiology reports remains scarce. We therefore aimed to investigate whether ChatGPT-5 can be consulted when assigning ICD-10 codes to radiology reports and whether this leads to a measurable time advantage. 2,738 fictious radiology reports across multiple modalities were derived from the PARROT database. Additionally, 100 fictitious PET/CT reports were created. Each report was assigned a single, most relevant ICD-10 code using ChatGPT-5. For PARROT, ChatGPT-derived codes were compared with predefined database reference labels. For PET/CT, ChatGPT-derived codes were compared with codes assigned by an independent manual coder. Exact and character-level concordance were assessed. In cases of discordance, a blinded adjudicator selected the most accurate ICD-10 code. Coding efficiency was evaluated for PET/CT reports by measuring coding time per report. For PARROT, exact-code concordance was 1,590/2,738 (58.1 %). In a random subset of 200 mismatches, blinded adjudication preferred the ChatGPT derived code in 123 and the reference label in 77 cases (p = 0.0015). Coding non-English reports resulted in significantly lower concordance (first character: p = 0.002; second/third characters: p < 0.001; last characters: p = 0.012) and longer coding times than English reports (p = 0.002). Regarding PET/CT reports, median coding time was 8 s with ChatGPT and 135 s without. The median time saved was 127 s per report. Applied to daily clinical care, higher code correctness might reduce billing errors, while saved time could be reallocated to patient care. Radiologists should collaborate with developers to create versions of LLMs that operate within data-secure environments.

Topics

International Classification of DiseasesClinical CodingJournal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.