Evaluation of radiology residents' reporting skills using large language models: an observational study.

Authors

Atsukawa N,Tatekawa H,Oura T,Matsushita S,Horiuchi D,Takita H,Mitsuyama Y,Omori A,Shimono T,Miki Y,Ueda D

Affiliations (3)

  • Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3, Asahi-Machi, Abeno-ku, Osaka, 545-8585, Japan.
  • Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3, Asahi-Machi, Abeno-ku, Osaka, 545-8585, Japan. [email protected].
  • Department of Artificial Intelligence, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.

Abstract

Large language models (LLMs) have the potential to objectively evaluate radiology resident reports; however, research on their use for feedback in radiology training and assessment of resident skill development remains limited. This study aimed to assess the effectiveness of LLMs in revising radiology reports by comparing them with reports verified by board-certified radiologists and to analyze the progression of resident's reporting skills over time. To identify the LLM that best aligned with human radiologists, 100 reports were randomly selected from 7376 reports authored by nine first-year radiology residents. The reports were evaluated based on six criteria: (1) addition of missing positive findings, (2) deletion of findings, (3) addition of negative findings, (4) correction of the expression of findings, (5) correction of the diagnosis, and (6) proposal of additional examinations or treatments. Reports were segmented into four time-based terms, and 900 reports (450 CT and 450 MRI) were randomly chosen from the initial and final terms of the residents' first year. The revised rates for each criterion were compared between the first and last terms using the Wilcoxon Signed-Rank test. Among the three LLMs-ChatGPT-4 Omni (GPT-4o), Claude-3.5 Sonnet, and Claude-3 Opus-GPT-4o demonstrated the highest level of agreement with board-certified radiologists. Significant improvements were noted in Criteria 1-3 when comparing reports from the first and last terms (Criteria 1, 2, and 3; P < 0.001, P = 0.023, and P = 0.004, respectively) using GPT-4o. No significant changes were observed for Criteria 4-6. Despite this, all criteria except for Criteria 6 showed progressive enhancement over time. LLMs can effectively provide feedback on commonly corrected areas in radiology reports, enabling residents to objectively identify and improve their weaknesses and monitor their progress. Additionally, LLMs may help reduce the workload of radiologists' mentors.

Topics

Internship and ResidencyRadiologyClinical CompetenceLanguageJournal ArticleObservational Study

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.