Multimodal Learning with Privileged Report Supervision for Generalizable Tuberculosis Detection on Chest Radiographs.
Authors
Affiliations (2)
Affiliations (2)
- Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
- Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA. [email protected].
Abstract
Multimodal learning using images and associated clinical text offers richer semantic supervision for medical AI. However, models trained with synthetic reports risk hallucination, and conventional multimodal tuberculosis (TB) systems are impractical because they require text at inference. In realworld screening workflows, particularly in low-resource settings or during triage, radiology reports are often unavailable or delayed. Computer-aided detection systems for chest X-rays (CXRs) are considered a potential solution. In this context, this study proposes a method that uses clinically grounded text as privileged information during training to improve a binary CXR classifier, while enabling image-only TB prediction at deployment. Frontal CXRs from Shenzhen (internal train/validation/test), Montgomery County, TBX11K, and NIAID TB Portals (external tests) were lung-cropped using a YOLOv8s detector and resized to 224 × 224. For Shenzhen, de-identified metadata and brief clinical notes were converted into structured reports encoding population type, TB status, laterality, lobar involvement, and adjunct findings; a parallel model used raw notes. A VGG-11 vision encoder and frozen CXR BERT text encoder were co-trained in a shared 256-dimensional space using image classification, cosine similarity, and supervised contrastive alignment losses. At inference, the text branch was removed, yielding an image only classifier regularized through multimodal supervision. Multimodal training with report supervision consistently improved image-only predictions, with structured report outperforming raw notes. Across internal and external cohorts, performance gains were reflected in higher balanced accuracy, Matthews correlation coefficient, and area under the curve. UMAP embeddings showed clearer class separation, and Grad CAM maps demonstrated improved localization of TB-relevant lesions. The online version contains supplementary material available at 10.1007/s10916-026-02368-3.