Clinically integrated multi-modal transformer framework with cross-modal gated fusion and clinical nomogram for automated Kellgren-Lawrence grading of knee osteoarthritis on x-ray images.
Authors
Affiliations (3)
Affiliations (3)
- College of Big Data and Software Engineering, Zhejiang Wanli University, No. 8 Qianhu South Road, Ningbo, Zhejiang, 315100, China.
- School of Informatics, The University of Edinburgh, Edinburgh, UK.
- College of Big Data and Software Engineering, Zhejiang Wanli University, No. 8 Qianhu South Road, Ningbo, Zhejiang, 315100, China. [email protected].
Abstract
We developed a multi-modal Transformer framework integrating knee radiographs with clinical covariates to enable automated, objective, and generalizable ordinal Kellgren-Lawrence (KL) grading. A total of 2,703 anteroposterior knee radiographs were retrospectively collected from three independent medical centers (January 2018 - December 2024). Data from two centers (n = 1,953) were used for model development and internal five-fold stratified cross-validation, while the third center (n = 750) served as an independent external test set. The proposed framework combines a Swin Transformer-Base image encoder with a clinical feature Transformer through a novel Robust Cross-Modal Gated Fusion (RCGF) module employing bidirectional cross-attention and uncertainty-aware dynamic gating via Monte-Carlo dropout. Ordinal prediction was performed using Consistent Rank Logits (CORAL). Eight classifier architectures were systematically compared, encompassing multi-modal models, unimodal image-only baselines, and a clinical-only model. The proposed RCGF framework achieved a Quadratic Weighted Kappa (QWK) of 0.900 (95% CI: 0.877-0.921), macro-averaged AUC of 0.930 (95% CI: 0.910-0.950), and balanced accuracy of 87.6% on the independent external test set, significantly outperforming all baseline models including BioViL-T (QWK = 0.850) and MedViT (QWK = 0.830; all FDR-corrected p < 0.001). Sensitivity for severe Osteoarthritis (OA) (Grade 4) reached 83.5% (95% CI: 79.1-87.4%), with specificity 95.3%. The clinical nomogram demonstrated excellent calibration (calibration slope = 0.98, Brier score = 0.072, C-statistic = 0.940) and superior net benefit over treat-all and treat-none strategies across all clinically relevant decision thresholds. This multi-modal Transformer framework with uncertainty-aware gated fusion provides robust external generalizability for ordinal knee OA severity grading and delivers a clinically actionable nomogram. The approach has strong potential to reduce radiologist workload and facilitate objective assessment on routine clinical radiographs, particularly in resource-constrained settings.