From BERT to generative AI - Comparing encoder-only vs. large language models in a cohort of lung cancer patients for named entity recognition in unstructured medical reports.
Authors
Affiliations (8)
Affiliations (8)
- Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany; Central IT Department, Data Integration Center, University Hospital Essen, Essen, Germany.
- Institute for Transfusion Medicine, University Hospital Essen, Essen, Germany; Department of Computer Science, University of Applied Sciences and Arts Dortmund, Dortmund, Germany.
- Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS), Sankt Augustin, Germany.
- Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany.
- West German Cancer Center Essen, Department of Medical Oncology, University Hospital Essen, Essen, Germany.
- Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany; Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Essen, Germany.
- Department of Computer Science, University of Applied Sciences and Arts Dortmund, Dortmund, Germany; Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University Hospital Essen, Essen, Germany.
- Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany; Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Essen, Germany. Electronic address: [email protected].
Abstract
Extracting clinical entities from unstructured medical documents is critical for improving clinical decision support and documentation workflows. This study examines the performance of various encoder and decoder models trained for Named Entity Recognition (NER) of clinical parameters in pathology and radiology reports, highlighting the applicability of Large Language Models (LLMs) for this task. Three NER methods were evaluated: (1) flat NER using transformer-based models, (2) nested NER with a multi-task learning setup, and (3) instruction-based NER utilizing LLMs. A dataset of 2013 pathology reports and 413 radiology reports, annotated by medical students, was used for training and testing. The performance of encoder-based NER models (flat and nested) was superior to that of LLM-based approaches. The best-performing flat NER models achieved F1-scores of 0.87-0.88 on pathology reports and up to 0.78 on radiology reports, while nested NER models performed slightly lower. In contrast, multiple LLMs, despite achieving high precision, yielded significantly lower F1-scores (ranging from 0.18 to 0.30) due to poor recall. A contributing factor appears to be that these LLMs produce fewer but more accurate entities, suggesting they become overly conservative when generating outputs. LLMs in their current form are unsuitable for comprehensive entity extraction tasks in clinical domains, particularly when faced with a high number of entity types per document, though instructing them to return more entities in subsequent refinements may improve recall. Additionally, their computational overhead does not provide proportional performance gains. Encoder-based NER models, particularly those pre-trained on biomedical data, remain the preferred choice for extracting information from unstructured medical documents.