Back to all papers

Multimodal GPT-5 for Predicting Poor Functional Outcomes After Intracerebral Hemorrhage in the Emergency Department: Validation Study.

May 27, 2026pubmed logopapers

Authors

Matsumoto K,Ishihara K,Tamba R,Fujiyoshi Y,Tokunaga K,Matsuda K,Nohara Y,Chen J,Yamashiro S,Nakashima N,Kamouchi M

Affiliations (12)

  • Department of Health Care Administration and Management, Graduate School of Medical Sciences, Kyushu University, 3-1-1 Maidashi, Higashi-ku, Fukuoka-shi, Fukuoka, 812-8582, Japan, 81 0926426960.
  • Institute for Medical Information Research and Analysis, Saiseikai Kumamoto Hospital, Kumamoto, Japan.
  • Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan, Taiwan.
  • Graduate Degree Program of Applied Data Sciences, Sophia University, Tokyo, Japan.
  • Joint Graduate School of Mathematics for Innovation, Kyushu University, Fukuoka, Japan.
  • Department of Pharmacy, Saiseikai Kumamoto Hospital, Kumamoto, Japan.
  • Department of Radiology, Saiseikai Kumamoto Hospital, Kumamoto, Japan.
  • Big Data Science and Technology, Faculty of Advanced Science and Technology, Kumamoto University, Kumamoto, Japan.
  • Division of Neurosurgery, Saiseikai Kumamoto Hospital, Japan.
  • Medical Information Center, Kyushu University Hospital, Fukuoka, Japan.
  • Department of Medical Informatics, Graduate School of Medical Sciences, Kyushu University, Fukuoka, Japan.
  • Center for Cohort Studies, Graduate School of Medical Sciences, Kyushu University, Fukuoka, Japan.

Abstract

In the emergency department, rapid prognostic assessment of patients with intracerebral hemorrhage (ICH) is essential for guiding early management decisions, particularly when stroke specialists are not immediately available. Recent advances in large language models have generated interest in their potential utility as clinical decision-support tools. This study aimed to evaluate the predictive performance and potential clinical utility of GPT (OpenAI)-based models for poor functional outcomes after ICH using real-world multimodal data routinely available at emergency department presentation. We analyzed data from patients with ICH admitted to a tertiary hospital. Using routinely collected clinical data and noncontrast computed tomography (CT) images at admission, GPT-4.1 (OpenAI) and GPT-5 (OpenAI)-accessed via the Azure OpenAI Service-were applied to predict poor functional outcomes, defined as a modified Rankin Scale score of 3-6 at discharge. A conventional machine learning (ML) model was developed by combining deep learning-extracted imaging features from Digital Imaging and Communications in Medicine CT data with clinical variables using L1-regularized logistic regression. GPT-based models were evaluated using the same clinical dataset and JPEG-format CT images. Model performance was assessed in terms of discrimination (area under the receiver operating characteristic curve [AUROC]), overall performance (scaled Brier score and Nagelkerke R²), calibration, reproducibility (intraclass correlation coefficient [ICC]), and clinical utility (decision curve analysis). The ML model achieved an AUROC of 0.85 (95% CI 0.79-0.90), a scaled Brier score of 0.23 (95% CI 0.06-0.36), and a Nagelkerke R² of 0.35 (95% CI 0.18-0.48). Zero-shot GPT-4.1 and GPT-5 demonstrated discrimination comparable to the ML model (AUROC 0.84, 95% CI 0.77-0.91 and 0.85, 95% CI 0.78-0.91, respectively) with high reproducibility (ICC 0.91 and 0.95, respectively) but inferior overall performance, as reflected by lower scaled Brier scores and low or negative Nagelkerke R² values. Incorporating ML-derived information into the prompts modestly improved discrimination (AUROC 0.84, 95% CI 0.78-0.90 and 0.87, 95% CI 0.81-0.92, respectively) and reproducibility (ICC 0.97 and 0.96, respectively). Calibration plots indicated that GPT-based models tended to underestimate predicted probabilities, although this bias was partially attenuated after model-informed prompting. Decision curve analysis indicated that GPT-based models provided net benefit only at higher threshold probabilities and did not demonstrate superior clinical utility compared with the ML model. Zero-shot GPT models achieved discriminatory performance comparable to a conventional ML model but showed limitations in calibration and overall predictive accuracy. Rather than replacing established prognostic ML models, GPT-based models may be better positioned as complementary interfaces that translate predictive outputs into clinically interpretable natural language to support decision-making.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.