Extraction of distant recurrence sites for breast cancer patients from free-text clinical notes using large language models.
Authors
Affiliations (7)
Affiliations (7)
- Department of Radiology, Mayo Clinic, Phoenix, AZ, USA. Electronic address: [email protected].
- Department of Radiology, Mayo Clinic, Phoenix, AZ, USA.
- Departments of Medicine and of Epidemiology & Population Health, Stanford University School of Medicine, Palo Alto, CA, USA.
- Rollins School of Public Health, Emory University, Atlanta, GA, USA.
- Department of Internal Medicine, UC Davis School of Medicine, Sacramento, CA, USA.
- Department of Biomedical Data Science, Radiology, and Medicine, Stanford University School of Medicine, Palo Alto, CA, USA.
- Department of Radiology, Mayo Clinic, Phoenix, AZ, USA; School of Computing and Augmented Intelligence, Arizona State University, Tempe, AZ, USA.
Abstract
Accurate documentation of distant recurrence sites in breast cancer is essential for evaluating treatment effectiveness and outcomes research. However, such information is embedded in unstructured clinical notes, making manual abstraction labor-intensive. Large language models (LLMs) offer a scalable solution for extracting complex information from heterogeneous clinical narratives; however, generic LLMs often lack the specialized clinical reasoning needed for accurate interpretation of oncologic documentation. This study aims to develop an efficient LLM-based framework to automatically extract distant recurrence sites from free-text documentation. We used clinical notes, pathology and radiology reports from recurrent breast cancer patients at Mayo Clinic (n = 766) for model development and evaluated generalizability on internal hold-out samples (n = 112) and an external Stanford Medicine cohort (n = 110). For cross-disease domain adaptation, we further validated on prostate cancer patients (n = 49). Our proposed framework employs BioLinkBERT, a pretrained language model (PLM) backbone, with weak supervision and an epoch-wise entropy optimization to address limited labeled data and class imbalance across recurrence sites. The fine-tuned model was compared against state-of-the-art models, including Llama2-7B, Llama-3-8B and MedAlpaca, using precision, recall, and F1-score. The fine-tuned model outperformed generic and domain-specific LLM baselines, with notable gains in identifying multi-site distant recurrence. In-domain validation showed consistent F1-score improvement (average 0.78), particularly for rare recurrence sites. The model also demonstrated strong performance on the external Stanford cohort and on prostate cancer, achieving F1-score of 0.83 and 0.93, respectively. This study presents an efficient, weakly supervised LLM framework that accurately extracts metastatic recurrence sites, reducing reliance on manual chart review. The results demonstrate that relatively small LLMs, optimized with domain-aware weak supervision, can outperform larger models for complex oncologic information extraction. The model is released as a platform-independent Docker image to support seamless cancer registry integration.