Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization.
Authors
Affiliations (8)
Affiliations (8)
- Center for Data Science, New York University, 60 5th Ave, New York, NY 10011. United States.
- Department of Radiology, Columbia University Irving Medical Center, 722 West 168th Street. New York, NY 10032. United States.
- Department of Radiology, NYU Langone Health, 550 First Avenue New York, NY 10016. United States.
- NYU Grossman School of Medicine, 550 First Ave, New York, NY 10016. United States.
- Department of Radiology, University of California San Diego School of Medicine, 9500 Gilman Dr, La Jolla, CA 92093. United States.
- Department of Mathematics, Courant Institute of Mathematical Sciences, New York University, 251 Mercer St, New York, NY 10012. United States.
- Department of Surgery, NYU Langone Health, 550 First Avenue New York, NY 10016. United States.
- Department of Medicine, NYU Langone Health, 550 First Avenue New York, NY 10016. United States.
Abstract
<b>Background:</b> Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive, limiting large-scale studies needed to advance PCL research. <b>Objective:</b> The purpose of this study was to evaluate GPT-4o (closed source), Llama (open source), and DeepSeek (open source) large language models (LLMs) for PCL feature extraction, without and with chain-of-thought (CoT) reasoning. <b>Methods:</b> We curated a dataset of 6469 abdominal MRI or CT reports (2005-2024) from 5615 patients that described PCLs. Llama and DeepSeek were fine-tuned using Quantized Low-Rank Adaptation on GPT-4o-generated CoT labels for extracting PCL and main pancreatic duct features. Features were mapped to risk categories per institutional policy. Evaluation was performed on 285 held-out human-annotated reports from 281 patients. Model outputs for 100 cases were independently reviewed by three radiologists. Feature extraction was evaluated using exact match accuracy, risk categorization with macro-averaged F1 score, and radiologist-model agreement with Fleiss' kappa. Error analyses were performed to assess how and why models made mistakes. <b>Results:</b> CoT fine-tuned LLMs showed a feature extraction accuracy of 97% (95% CI, 97-98%) for Llama, 98% (95% CI, 97-98%) for DeepSeek, and 97% (95% CI, 97-98%) for GPT-4o. Risk categorization F1 scores were 0.93 (95% CI, 0.89-0.97) for Llama, 0.94 (95% CI, 0.90-0.98) for DeepSeek, and 0.97 (95% CI, 0.93-0.99) for GPT-4o. Radiologist interreader agreement was high (κ = 0.888) and showed no significant difference with the addition of Llama (κ = 0.882; p > .99), DeepSeek (κ = 0.893, p > .99), or GPT (κ = 0.897, p > .99). Across all models, object identification and clinical reasoning were the most frequent error types, accounting for 29.3-37.3% and 18.1-21.1% of total errors, respectively. <b>Conclusion:</b> LLMs show feasibility for automatically extracting PCL features from radiology reports. Fine-tuned open-source LLMs achieved performance comparable to GPT-4o. CoT reasoning improved accuracy and enabled interpretable error analysis. Model-assigned risk categories showed high agreement with abdominal radiologists. <b>Clinical Impact:</b> LLMs have the potential to enable creation of large structured registries from existing radiology reports to support population-level research on PCLs.