Natural language processing algorithm accurately classifies diverticulitis-related complications and predicts long-term outcomes.

March 23, 2026

DOI: 10.1016/j.cgh.2026.03.009 PMID: 41881290

Authors

Ma W,Wu Y,Challa PK,Sikavi D,Downie JM,Nguyen LH,Raghu VK,Simon TG,Khalili H,Kambadakone AR,Ananthakrishnan AN,Strate LL,Chan AT

Affiliations (4)

Clinical and Translational Epidemiology Unit and Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, Boston, MA.
Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA.
Division of Gastroenterology and Hepatology, University of Wisconsin-Madison School of Medicine and Public Health, Madison, WI.
Clinical and Translational Epidemiology Unit and Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, Boston, MA; Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA; Broad Institute of MIT and Harvard, Cambridge, Boston, MA; Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, MA. Electronic address: [email protected].

Abstract

Diagnostic codes lack the precision to identify specific complications of diverticulitis, limiting their utility in large-scale, real-world data. We developed a natural language processing (NLP) algorithm to classify diverticulitis and associated features using computed tomography (CT) reports. Using data from Mass General Brigham Research Patient Data Registry (1979-2024), we identified patients with a diagnosis code for diverticular disease (ICD-9: 562; ICD-10: K57) and a prior abdominopelvic CT report. We developed and validated our NLP algorithm to detect diverticulitis and associated features. We subsequently investigated the associations between NLP-defined severity at first diagnosis (i.e., uncomplicated, mild, severe, or chronic complications) and risk of severe diverticulitis recurrence using a Cox proportional hazards regression model. We assessed the predictive value of NLP-detected features using random forest models. The NLP algorithm achieved positive and negative predictive values of 82.8% to 99.9%, outperforming both ICD codes and a generalist large language model. Among 16,349 patients with NLP-detected diverticulitis, 3,192 developed severe recurrence over 76,736 person-years. Compared to uncomplicated diverticulitis, the multivariable-adjusted hazard ratio (HR) for severe recurrence was 1.39 (95% confidence interval [CI]: 1.14-1.69) for mild complications, 3.02 (95% CI: 2.80-3.27) for severe complications, and 5.41 (95% CI: 4.78-6.13) for chronic complications. NLP-detected features significantly improved the prediction of severe diverticulitis recurrence compared to codified variables. Our NLP algorithm accurately classifies diverticulitis features, facilitating the construction of large and high-quality EHR-based cohorts. Severity at initial diagnosis predicts risk of severe recurrence, supporting the use of artificial intelligence for risk stratification and long-term management.

View Source Full Text PDF

Topics

Journal Article

Natural language processing algorithm accurately classifies diverticulitis-related complications and predicts long-term outcomes.

Authors

Affiliations (4)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?