Back to all papers

K-STAMM: a knowledge-enhanced spatial - temporal attention model with multimodal fusion for pneumonia prediction.

April 9, 2026pubmed logopapers

Authors

Anbukkarasi S,Hemalatha S,Balakrishnan A,Varadhaganapathy S,Easwaramoorthy SV

Affiliations (4)

  • Manipal Institute of Technology Bengaluru, Manipal Academy of Higher Education, Manipal, India. [email protected].
  • Kongu Engineering College, Erode, India.
  • Manipal Institute of Technology Bengaluru, Manipal Academy of Higher Education, Manipal, India.
  • School of Engineering and Technology, Sunway University, No. 5, Jalan Universiti, Bandar Sunway, 47500, Selangor Darul Ehsan, Malaysia.

Abstract

Precise prediction of pneumonia remains a challenge mainly because effective integration of clinical data that are highly heterogeneous is mandatory. The types of clinical data in question include longitudinal electronic health records (EHRs), medical imaging, clinical text, and domain knowledge. Nevertheless, most existing multimodal transformer-based models face difficulties in multimodal alignment, temporal regularity, and limited incorporation of structured medical knowledge. In order to solve these problems, we present K-STAMM, a knowledge-augmented spatiotemporal attention model for multimodal fusion. Different from traditional methods, K-STAMM brings together biomedical knowledge sourced from the Unified Medical Language System through embedding-based representations, which allow for semantically enriched feature learning. On top of that, it uses attention-based spatial modeling of structured EHR data without direct graph construction along with temporal sequence modeling to effectively capture disease progression at irregular time intervals. Besides, a cross-modal fusion mechanism that harmonizes chest X-ray images, clinical text, and knowledge embeddings is used to build a single and interpretable patient representation. The experimental results on MIMIC-IV and MIMIC-CXR datasets exhibit that K-STAMM surpasses strong unimodal and multimodal baselines, obtaining an AUROC of 0.953, an AUPRC of 0.962, and an F1-score of 0.910. Also, ablation studies confirm the effectiveness of knowledge augmentation, temporal attention, and multimodal fusion. In brief, K-STAMM offers a scalable and interpretable framework for multimodal clinical prediction.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.