Back to all papers

Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization.

May 6, 2026pubmed logopapers

Authors

Rasromani E,Kang SK,Xu Y,Liu B,Luhadia G,Chui WF,Pasadyn FL,Hung YC,An JY,Mathieu E,Gu Z,Fernandez-Granda C,Javed AA,Sacks GD,Gonda T,Huang C,Shen Y

Affiliations (8)

  • Center for Data Science, New York University, 60 5th Ave, New York, NY 10011. United States.
  • Department of Radiology, Columbia University Irving Medical Center, 722 West 168th Street. New York, NY 10032. United States.
  • Department of Radiology, NYU Langone Health, 550 First Avenue New York, NY 10016. United States.
  • NYU Grossman School of Medicine, 550 First Ave, New York, NY 10016. United States.
  • Department of Radiology, University of California San Diego School of Medicine, 9500 Gilman Dr, La Jolla, CA 92093. United States.
  • Department of Mathematics, Courant Institute of Mathematical Sciences, New York University, 251 Mercer St, New York, NY 10012. United States.
  • Department of Surgery, NYU Langone Health, 550 First Avenue New York, NY 10016. United States.
  • Department of Medicine, NYU Langone Health, 550 First Avenue New York, NY 10016. United States.

Abstract

<b>Background:</b> Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive, limiting large-scale studies needed to advance PCL research. <b>Objective:</b> The purpose of this study was to evaluate GPT-4o (closed source), Llama (open source), and DeepSeek (open source) large language models (LLMs) for PCL feature extraction, without and with chain-of-thought (CoT) reasoning. <b>Methods:</b> We curated a dataset of 6469 abdominal MRI or CT reports (2005-2024) from 5615 patients that described PCLs. Llama and DeepSeek were fine-tuned using Quantized Low-Rank Adaptation on GPT-4o-generated CoT labels for extracting PCL and main pancreatic duct features. Features were mapped to risk categories per institutional policy. Evaluation was performed on 285 held-out human-annotated reports from 281 patients. Model outputs for 100 cases were independently reviewed by three radiologists. Feature extraction was evaluated using exact match accuracy, risk categorization with macro-averaged F1 score, and radiologist-model agreement with Fleiss' kappa. Error analyses were performed to assess how and why models made mistakes. <b>Results:</b> CoT fine-tuned LLMs showed a feature extraction accuracy of 97% (95% CI, 97-98%) for Llama, 98% (95% CI, 97-98%) for DeepSeek, and 97% (95% CI, 97-98%) for GPT-4o. Risk categorization F1 scores were 0.93 (95% CI, 0.89-0.97) for Llama, 0.94 (95% CI, 0.90-0.98) for DeepSeek, and 0.97 (95% CI, 0.93-0.99) for GPT-4o. Radiologist interreader agreement was high (κ = 0.888) and showed no significant difference with the addition of Llama (κ = 0.882; p > .99), DeepSeek (κ = 0.893, p > .99), or GPT (κ = 0.897, p > .99). Across all models, object identification and clinical reasoning were the most frequent error types, accounting for 29.3-37.3% and 18.1-21.1% of total errors, respectively. <b>Conclusion:</b> LLMs show feasibility for automatically extracting PCL features from radiology reports. Fine-tuned open-source LLMs achieved performance comparable to GPT-4o. CoT reasoning improved accuracy and enabled interpretable error analysis. Model-assigned risk categories showed high agreement with abdominal radiologists. <b>Clinical Impact:</b> LLMs have the potential to enable creation of large structured registries from existing radiology reports to support population-level research on PCLs.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.