Performance of Open-Source LLMs in Identifying Pediatric Pneumonia From Free-Text Chest Radiograph Reports.
Authors
Affiliations (5)
Affiliations (5)
- Department of Preventive Medicine, Division of Biostatistics and Informatics, Northwestern University Feinberg School of Medicine, Chicago, IL.
- Division of Emergency Medicine, Ann and Robert H. Lurie Children's Hospital of Chicago, Chicago, IL.
- Stanley Manne Children's Research Institute, Ann and Robert H. Lurie Children's Hospital of Chicago, Chicago, IL.
- Department of Health Administration and Policy, George Mason University, Fairfax, VA.
- Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL.
Abstract
To develop and internally validate an automated system for classifying chest radiograph (CXR) reports for community-acquired pneumonia in children. We performed a retrospective single-center study using 1000 pediatric emergency department encounters (2016 to 2022) with CXR. Reports were adjudicated by two physicians as positive, negative, or indeterminate for pneumonia. We evaluated five open-source LLMs (Gemma2 9B, Gemma2 27B, Falcon3 7B, DeepSeek R1 Distill Llama 8B, and Llama3.1 8B) on a 70/30 train-test split for an outcome of pneumonia. We reported performance metrics for both three-class and binary classification (pneumonia + indeterminate vs. no pneumonia). The median patient age was 4.2 years (IQR 1.7 to 10.5), and 54.4% were admitted from the ED. After clinician adjudication, 27.8% of reports were labeled pneumonia, 13.7% indeterminate, and 58.5% no pneumonia. Gemma2 9B achieved the best performance overall, with a pneumonia F1 score of 0.82 and no-pneumonia F1 score of 0.97 in three-class classification. Binary classification further improved performance (F1=0.97 for Gemma2 9B and 0.93 for 27B). Discrepancies between model and human labels often involved ambiguous language, highlighting interpretive subjectivity rather than model error. All LLMs substantially outperformed traditional NLP classifiers such as XGBoost, random forest, and logistic regression. Open-source LLMs accurately classified pediatric CXR reports for pneumonia. These findings support the feasibility of integrating LLMs into decision support and quality improvement pipelines to enhance radiographic interpretation and improve pediatric emergency care.