Human level information extraction from clinical reports with finetuned language models.
Authors
Affiliations (13)
Affiliations (13)
- Electrical Engineering and Computer Sciences, UC Berkeley, 387 Soda Hall, Berkeley, CA, 94720, USA.
- Department of Bioengineering, University of Pennsylvania, 210 South 33rd St, Philadelphia, PA, 19104, USA.
- University of California San Francisco School of Medicine, 533 Parnassus Ave, 94143, San Francisco, CA, USA.
- University of California Riverside School of Medicine, 92521 Botanic Gardens Dr, Riverside, CA, 92507, USA.
- Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, 2213 Garland Avenue, Nashville, TN, 37232, USA.
- Department of Medicine, Vanderbilt University Medical Center, 2213 Garland Avenue, Nashville, TN, 37232, USA.
- Vanderbilt Ingram Cancer Center, Vanderbilt University Medical Center, 2220 Pierce Ave, Nashville, TN, 37232, USA.
- Department of Urology, School of Medicine, University of California San Francisco (UCSF), 1825 4th St, Box 1695, 94143, San Francisco, CA, USA.
- Computational Precision Health, UC Berkeley and UCSF, 2177 Hearst Ave, Berkeley, CA, 94709, USA.
- Department of Radiology and Biomedical Imaging, School of Medicine, University of California San Francisco (UCSF), 505 Parnassus Ave, 94143, San Francisco, CA, USA.
- Electrical Engineering and Computer Sciences, UC Berkeley, 387 Soda Hall, Berkeley, CA, 94720, USA. [email protected].
- Computational Precision Health, UC Berkeley and UCSF, 2177 Hearst Ave, Berkeley, CA, 94709, USA. [email protected].
- Department of Statistics, UC Berkeley, 367 Evans Hall, Berkeley, CA, 94720, USA. [email protected].
Abstract
Extracting structured data from clinical notes remains a key bottleneck in clinical research. We hypothesized that with minimal computational and annotation resources, open-source large language models (LLMs) could create high-quality research databases. We developed Strata, a low-code library for leveraging LLMs for data extraction from clinical reports. Trained researchers labeled four datasets from prostate MRI, breast pathology, kidney pathology, and bone marrow (MDS) pathology reports. Using Strata, we evaluated open-source LLMs, including instruction-tuned, medicine-specific, reasoning-based, and LoRA-finetuned LLMs. We compared these models to zero-shot GPT-4 and a second human annotator. Our primary evaluation metric was exact match accuracy, which assesses if all variables for a report were extracted correctly. LoRa-finetuned Llama-3.1 8B achieved non-inferior performance to the second human annotator across all four datasets, with an average exact match accuracy of 90.0 ± 1.7. Fine-tuned Llama-3.1 outperformed all other open-source models, including DeepSeekR1-Distill-Llama and Llama-3-8B-UltraMedical, which obtained average exact match accuracies of 56.8 ± 29.0 and 39.1 ± 24.4 respectively. GPT-4 was non-inferior to the second human annotator in all datasets except kidney pathology. Small, open-source LLMs offer an accessible solution for the curation of local research databases; they obtain human-level accuracy while only leveraging desktop-grade hardware and ≤ 100 training reports. Unlike commercial LLMs, these tools can be locally hosted and version-controlled. Strata enables automated human-level performance in extracting structured data from clinical notes using ≤ 100 training reports and a single desktop-grade GPU.