Back to all papers

Interpreting BI-RADS-Free Breast MRI Reports Using a Large Language Model: Automated BI-RADS Classification From Narrative Reports Using ChatGPT.

Authors

Tekcan Sanli DE,Sanli AN,Ozmen G,Ozmen A,Cihan I,Kurt A,Esmerer E

Affiliations (4)

  • Department of Radiology, Faculty of Medicine, Gaziantep University, Gaziantep, Turkey (D.E.T.S., G.O., A.O., I.C., A.K.). Electronic address: [email protected].
  • Department of General Surgery, Abdulkadir Yüksel State Hospital, Gaziantep, Turkey (A.N.S.).
  • Department of Radiology, Faculty of Medicine, Gaziantep University, Gaziantep, Turkey (D.E.T.S., G.O., A.O., I.C., A.K.).
  • Department of Radiology, Başakşehir Çam and Sakura City Hospital, Istanbul, Turkey (E.E.).

Abstract

This study aimed to evaluate the performance of ChatGPT (GPT-4o) in interpreting free-text breast magnetic resonance imaging (MRI) reports by assigning BI-RADS categories and recommending appropriate clinical management steps in the absence of explicitly stated BI-RADS classifications. In this retrospective, single-center study, a total of 352 documented full-text breast MRI reports of at least one identifiable breast lesion with descriptive imaging findings between January 2024 and June 2025 were included in the study. Incomplete reports due to technical limitations, reports describing only normal findings, and MRI examinations performed at external institutions were excluded from the study. First, it was aimed to assess ChatGPT's ability to infer the correct BI-RADS category (2-3-4a-4b-4c-5 separately) based solely on the narrative imaging findings. Second, it was evaluated the model's ability to distinguish between benign versus suspicious/malignant imaging features in terms of clinical decision-making. Therefore, BI-RADS 2-3 categories were grouped as "benign," and BI-RADS 4-5 as "suspicious/malignant," in alignment with how BI-RADS categories are used to guide patient management, rather than to represent definitive diagnostic outcomes. Reports originally containing the term "BI-RADS" were manually de-identified by removing BI-RADS categories and clinical recommendations. Each narrative report was then processed through ChatGPT using two standardized prompts as follows: (1) What is the most appropriate BI-RADS category based on the findings in the report? (2) What should be the next clinical step (e.g., follow-up, biopsy)? Responses were evaluated in real time by two experienced breast radiologists, and consensus was used as the reference standard. ChatGPT demonstrated moderate agreement with radiologists' consensus for BI-RADS classification (Cohen's Kappa (κ): 0.510, p<0.001). Classification accuracy was highest for BI-RADS 5 reports (77.9%), whereas lower agreement was observed in intermediate categories such as BI-RADS 3 (52.4% correct) and 4B (29.4% correct). In the binary classification of reports as benign or malignant, ChatGPT achieved almost perfect agreement (κ: 0.843), correctly identifying 91.7% of benign and 93.2% of malignant reports. Notably, the model's management recommendations were 100% consistent with its assigned BI-RADS categories, advising biopsy for all BI-RADS 4-5 cases and short-interval follow-up or conditional biopsy for BI-RADS 3 reports. ChatGPT accurately interprets unstructured breast MRI reports, particularly in benign/malignant discrimination and corresponding clinical recommendations. This technology holds potential as a decision support tool to standardize reporting and enhance clinical workflows, especially in settings with variable reporting practices. Prospective, multi-institutional studies are needed for further validation.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.