Application of generative artificial intelligence to utilize unstructured clinical data for acceleration of inflammatory bowel disease research.
Authors
Affiliations (11)
Affiliations (11)
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton SO16 6YD, UK; National Institute for Health Research (NIHR) Southampton Biomedical Research Centre, Southampton SO16 6YD, UK. Electronic address: [email protected].
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton SO16 6YD, UK; Department of Paediatric Gastroenterology, Southampton Children's Hospital, Southampton SO16 6YD, UK.
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton SO16 6YD, UK.
- Department of Paediatric Gastroenterology, Southampton Children's Hospital, Southampton SO16 6YD, UK.
- Clinical Informatics Research Unit, University Hospital Southampton NHS Trust, Southampton SO16 6YD, UK; Southampton Emerging Therapies and Technologies (SETT) Centre, University Hospital Southampton NHS Trust, Southampton SO16 6YD, UK.
- Clinical Informatics Research Unit, University Hospital Southampton NHS Trust, Southampton SO16 6YD, UK.
- Department of Histopathology, University Hospital Southampton NHS Trust, Southampton SO16 6YD, UK.
- Clinical Informatics Research Unit, University Hospital Southampton NHS Trust, Southampton SO16 6YD, UK; Southampton Emerging Therapies and Technologies (SETT) Centre, University Hospital Southampton NHS Trust, Southampton SO16 6YD, UK; Department of Gastroenterology, University Hospital Southampton NHS Trust, Southampton SO16 6YD, UK.
- Southampton Emerging Therapies and Technologies (SETT) Centre, University Hospital Southampton NHS Trust, Southampton SO16 6YD, UK; Department of Neurology, University Hospital Southampton NHS Trust, Southampton SO16 6YD, UK.
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton SO16 6YD, UK; Department of Paediatric Gastroenterology, Southampton Children's Hospital, Southampton SO16 6YD, UK. Electronic address: [email protected].
- Department of Human Genetics and Genomic Medicine, University of Southampton, Southampton SO16 6YD, UK; National Institute for Health Research (NIHR) Southampton Biomedical Research Centre, Southampton SO16 6YD, UK. Electronic address: [email protected].
Abstract
Inflammatory bowel disease (IBD) research is a dynamic field. However, the growing volume of electronic health records (EHRs) and research data presents significant challenges. Traditional methods for structuring unstructured EHRs are labor-intensive and lack scalability. Large language models (LLMs) may present a solution, however, their usefulness in data standardization in the context of IBD remains unknown. We sought to evaluate LLMs in structuring free-text histology and radiology reports from IBD patients (n = 32,041), compare their performance to manual clinician curation, and assess the usefulness of fine-tuning and retrieval-augmented generation (RAG). We developed an IBD-specialized LLM-based framework utilizing structured prompt engineering and fine-tuning. Free-text reports from two independent sites were manually curated and processed using various LLMs (n = 120). Overall, Llama 3.3 achieved the highest F1 scores for histology and imaging (1.00 ± 0 and 0.85 ± 0.29, respectively) in extracting findings and anatomical regions, surpassing other models in structured data generation. Fine-tuning improved the performance of the smaller Llama 3.1 8B model for imaging reports (0.70 ± 0.46 vs. 0.82 ± 0.35), enabling better extraction with reduced computational requirements. Our findings demonstrate the feasibility of LLM-based automated structuring of IBD-related medical records. Unstructured data from free-text reports can be reliably converted into standardized ontologies with location, severity, and qualifiers. These advancements enable scalable, privacy-compliant AI-driven solutions for data standardization. The Institute for Life Sciences, University of Southampton, the NIHR Southampton BRC, and EPSRC (EP/Y01720X/1).