KeyCap3D: Keyword-Guided 3D Medical Image Captioning with Cross-Attention.
Authors
Affiliations (4)
Affiliations (4)
- Faculty of Mathematics and Natural Science, Mulawarman University, Indonesia.
- Faculty of Engineering, Mulawarman University, Indonesia.
- Faculty of Medicine, Muhammadiyah University of East Kalimantan, Indonesia.
- Faculty of Computing and Informatics, Universiti Malaysia Sabah, Malaysia.
Abstract
This study presents a keyword-guided cross-attention framework for automated radiological report generation from 3D FLAIR MRI brain tumor images. The architecture integrates M3D-CLIP as the image encoder. Hierarchical keyword extraction is performed using fine-tuned KeyBERT and BioBERT semantic embeddings in a 768-dimensional space. Six cross-attention layers fuse visual features with clinical keywords across four hierarchical levels: abnormality type, lesion characteristics, anatomical location, and lateralization. A four-layer transformer decoder generates captions autoregressively. The BraTS2020 dataset containing 369 glioma patients paired with TextBraTS radiological descriptions was preprocessed with center-focused slice selection of 32 from 155 slices and spatial interpolation to 256 × 256 resolution. Training on NVIDIA RTX 3050 GPU for 15 epochs using AdamW optimizer achieved loss reduction from 4.16 to 1.33. Evaluation on 20 test samples demonstrated BLEU-1 of 0.5359, BLEU-2 of 0.3969, and ROUGE-L of 0.5051, with generated captions accurately capturing clinical information for decision support applications. •Multi-modal fusion through keyword-guided cross-attention integrating visual MRI features with hierarchical clinical terminology•Transformer-based autoregressive generation conditioned on enriched image-keyword representations•Comprehensive evaluation using BLEU and ROUGE metrics on brain tumor caption generation task.