Natural language processing algorithm accurately classifies diverticulitis-related complications and predicts long-term outcomes.
Authors
Affiliations (4)
Affiliations (4)
- Clinical and Translational Epidemiology Unit and Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, Boston, MA.
- Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA.
- Division of Gastroenterology and Hepatology, University of Wisconsin-Madison School of Medicine and Public Health, Madison, WI.
- Clinical and Translational Epidemiology Unit and Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, Boston, MA; Channing Division of Network Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA; Broad Institute of MIT and Harvard, Cambridge, Boston, MA; Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, MA. Electronic address: [email protected].
Abstract
Diagnostic codes lack the precision to identify specific complications of diverticulitis, limiting their utility in large-scale, real-world data. We developed a natural language processing (NLP) algorithm to classify diverticulitis and associated features using computed tomography (CT) reports. Using data from Mass General Brigham Research Patient Data Registry (1979-2024), we identified patients with a diagnosis code for diverticular disease (ICD-9: 562; ICD-10: K57) and a prior abdominopelvic CT report. We developed and validated our NLP algorithm to detect diverticulitis and associated features. We subsequently investigated the associations between NLP-defined severity at first diagnosis (i.e., uncomplicated, mild, severe, or chronic complications) and risk of severe diverticulitis recurrence using a Cox proportional hazards regression model. We assessed the predictive value of NLP-detected features using random forest models. The NLP algorithm achieved positive and negative predictive values of 82.8% to 99.9%, outperforming both ICD codes and a generalist large language model. Among 16,349 patients with NLP-detected diverticulitis, 3,192 developed severe recurrence over 76,736 person-years. Compared to uncomplicated diverticulitis, the multivariable-adjusted hazard ratio (HR) for severe recurrence was 1.39 (95% confidence interval [CI]: 1.14-1.69) for mild complications, 3.02 (95% CI: 2.80-3.27) for severe complications, and 5.41 (95% CI: 4.78-6.13) for chronic complications. NLP-detected features significantly improved the prediction of severe diverticulitis recurrence compared to codified variables. Our NLP algorithm accurately classifies diverticulitis features, facilitating the construction of large and high-quality EHR-based cohorts. Severity at initial diagnosis predicts risk of severe recurrence, supporting the use of artificial intelligence for risk stratification and long-term management.