A Prompt-Guided Vision-Language Framework for Interpretable and Region-Aware Disease Diagnosis in Chest X-rays.
Authors
Abstract
Effective interpretation of chest X-rays requires a tightly integrated process of visual analysis, diagnostic reasoning, and structured reporting. Yet, most machine learning systems handle these steps in isolation. Visual encoders are typically trained without diagnostic context, and language outputs often lack spatial grounding. To address this gap, we propose an interactive vision-language framework that supports prompt-guided reasoning over both textual and spatial queries, enabling region-aware, clinically aligned interpretations. The framework comprises three functional modules: Prompt-Guided Localization (PGL) for identifying relevant regions, Region-Level Diagnosis (RLD) for structured classification, and Region-Aware Explanation (RAE) for generating localized descriptions. These modules are unified through a regional alignment mechanism built on a multi-task Detection Transformer (DETR) backbone, which maps prompts and image regions into a shared semantic space. To train the system under limited supervision, we adopt a two-stage strategy: contrastive pretraining to establish cross-modal alignment, followed by multi-task fine-tuning to support downstream tasks including disease classification and report generation. Experiments across the publicly available chest X-ray datasets MIMIC-CXR, VinDr-CXR, and MS-CXR demonstrate consistent gains compared with state-of-the-art methods. Module-wise ablations further validate the contribution of each component and highlight the framework's potential for transparent, clinically applicable diagnostic support.