RE-LIG: A Faithfulness-Driven Layer Integrated Gradients Framework for Explainable Medical Visual Question Answering.
Authors
Affiliations (3)
Affiliations (3)
- Department of Software Engineering, Bandırma Onyedi Eylül University, Bandırma, Türkiye.
- Department of Software Engineering, Celal Bayar University, Manisa, Türkiye.
- Department of Computer Engineering, Fırat University, Elazig, Türkiye. [email protected].
Abstract
Medical Visual Question Answering (Med-VQA) systems have the potential to support medical image interpretation and clinical decision-making processes. However, the "black-box" nature of existing systems and low-resolution constraints limit the transparency of model decisions, hindering clinical applicability. This work proposes a high-resolution holistic framework called robust and efficient layer-integrated gradients (RE-LIG) to enhance reliability and explainability in Med-VQA systems. The proposed architecture is built upon three key components: (1) high-resolution visual encoding: the PubMedCLIP encoder is scaled to high-resolution using dynamic positional embedding interpolation to capture fine details. (2) Multimodal semantic fusion: clinical questions solved by BioLinkBERT and visual features obtained by PubMedCLIP are aligned through a coattention mechanism. (3) Explainability framework: to counter the noisy nature of classical gradient methods, the RE-LIG algorithm, which combines noise tunneling and layer-based integration strategies, has been integrated into the system. Extensive experiments conducted on the SLAKE dataset demonstrate the proposed framework's success in primarily increasing model faithfulness. Quantitative analyses demonstrate that the RE-LIG method achieves a + 28.9% higher explanation fidelity (RE-LIG AOPC = 0.3180 vs. Vanilla IG = 0.2467, Bootstrap 95% CI [0.262-0.375], Wilcoxon p < 0.001) compared to standard gradient approaches. While achieving this gain in explainability, competitive performance with state-of-the-art (SOTA) models was achieved without compromising diagnostic performance (80.77% overall accuracy, 87.61% closed-ended, and 77.34% open-ended performance). Ablation studies confirm that the integrated noise reduction mechanisms shift the model's focus from background noise to actual pathological boundaries. The findings demonstrate that explainability is not merely a visual aid for clinical confidence but a measurable and verifiable requirement.