Cross-Institutional Evaluation of Large Language Models for Radiology Diagnosis Extraction: A Prompt-Engineering Perspective.

May 8, 2025pubmed logopapers

Authors

Moassefi M,Houshmand S,Faghani S,Chang PD,Sun SH,Khosravi B,Triphati AG,Rasool G,Bhatia NK,Folio L,Andriole KP,Gichoya JW,Erickson BJ

Affiliations (9)

  • Mayo Clinic Artificial Intelligence Lab, Department of Radiology, Mayo Clinic, 200 1st Street, S.W., Rochester, MN, 55905, USA.
  • Department of Radiology, University of California San Francisco, San Francisco, CA, USA.
  • Departments of Radiological Sciences and Computer Science, University of California, Irvine, CA, USA.
  • The Center for Artificial Intelligence in Diagnostic Medicine (CAIDM), University of California, Irvine, CA, USA.
  • Moffitt Cancer Center, Tampa, FL, USA.
  • Department of Radiology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
  • Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, GA, USA.
  • Healthcare AI Innovation and Translational Informatics (HITI) Lab, Emory University School of Medicine, Atlanta, GA, USA.
  • Mayo Clinic Artificial Intelligence Lab, Department of Radiology, Mayo Clinic, 200 1st Street, S.W., Rochester, MN, 55905, USA. [email protected].

Abstract

The rapid evolution of large language models (LLMs) offers promising opportunities for radiology report annotation, aiding in determining the presence of specific findings. This study evaluates the effectiveness of a human-optimized prompt in labeling radiology reports across multiple institutions using LLMs. Six distinct institutions collected 500 radiology reports: 100 in each of 5 categories. A standardized Python script was distributed to participating sites, allowing the use of one common locally executed LLM with a standard human-optimized prompt. The script executed the LLM's analysis for each report and compared predictions to reference labels provided by local investigators. Models' performance using accuracy was calculated, and results were aggregated centrally. The human-optimized prompt demonstrated high consistency across sites and pathologies. Preliminary analysis indicates significant agreement between the LLM's outputs and investigator-provided reference across multiple institutions. At one site, eight LLMs were systematically compared, with Llama 3.1 70b achieving the highest performance in accurately identifying the specified findings. Comparable performance with Llama 3.1 70b was observed at two additional centers, demonstrating the model's robust adaptability to variations in report structures and institutional practices. Our findings illustrate the potential of optimized prompt engineering in leveraging LLMs for cross-institutional radiology report labeling. This approach is straightforward while maintaining high accuracy and adaptability. Future work will explore model robustness to diverse report structures and further refine prompts to improve generalizability.

Topics

Journal Article
Get Started

Upload your X-ray image and get interpretation.

Upload now →

Disclaimer: X-ray Interpreter's AI-generated results are for informational purposes only and not a substitute for professional medical advice. Always consult a healthcare professional for medical diagnosis and treatment.