Evaluation of GPT-4o for multilingual translation of radiology reports across imaging modalities.

Authors

Terzis R,Salam B,Nowak S,Mueller PT,Mesropyan N,Oberlinkels L,Efferoth AF,Kravchenko D,Voigt M,Ginzburg D,Pieper CC,Hayawi M,Kuetting D,Afat S,Maintz D,Luetkens JA,Kaya K,Isaak A

Affiliations (6)

  • Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, Kerpener Straße 62, 50937 Cologne, Germany.
  • Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany; Quantitative Imaging Lab Bonn (QILaB), University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany.
  • Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany.
  • Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Pauwelsstraße 30, 52074 Aachen, Germany.
  • Department of Diagnostic and Interventional Radiology, Eberhard Karls University Tuebingen, Hoppe-Seyler-Straße 3, Tuebingen 72076, Germany.
  • Department of Diagnostic and Interventional Radiology, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany; Quantitative Imaging Lab Bonn (QILaB), University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany. Electronic address: [email protected].

Abstract

Large language models (LLMs) like GPT-4o offer multilingual and real-time translation capabilities. This study aims to evaluate GPT-4o's effectiveness in translating radiology reports into different languages. In this experimental two-center study, 100 real-world radiology reports from four imaging modalities (X-ray, ultrasound, CT, MRI) were randomly selected and fully anonymized. Reports were translated using GPT-4o with zero-shot prompting from German into four languages including English, French, Spanish, and Russian (n = 400 translations). Eight bilingual radiologists (two per language) evaluated the translations for general readability, overall quality, and utility for translators using 5-point Likert scales (ranging from 5 [best score] to 1 [worst score]). Binary questions (yes/no) were conducted to evaluate potential harmful errors, completeness, and factual correctness. The average processing time of GPT-4o for translating reports ranged from 9 to 24 s. The overall quality of translations achieved a median of 4.5 (IQR 4-5), with English (5 [4,5]), French and Spanish (each 4.5 [4,5]) significantly outperforming Russian (4 [3.5-4]; each p < 0.05). Usefulness for translators was rated highest for English (5 [5-5], p < 0.05 against other languages). Readability scores and translation completeness were significantly higher for translations into Spanish, English and French compared to Russian (each p < 0.05). Factual correctness averaged 79 %, with English (84 %) and French (83 %) outperforming Russian (69 %) (each p < 0.05). Potentially harmful errors were identified in 4 % of translations, primarily in Russian (9 %). GPT-4o demonstrated robust performance in translating radiology reports across multiple languages, with limitations observed in Russian translations.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.