Semi-supervised medical image captioning via anatomical collaborative evidence network.
Authors
Affiliations (2)
Affiliations (2)
- College of Information Engineering, Sichuan Agricultural University, Ya'an, China.
- Department of Otolaryngology, Ya'an People's Hospital, Ya'an, China.
Abstract
Medical image captioning bridges visual perception and clinical language, but its development is limited by the high cost of detailed anatomical annotation and by the risk of hallucinations or overconfidence in ambiguous endoscopic images. We propose ACE-Net, an Anatomy Collaborative Evidence Network for semi-supervised medical image captioning. ACE-Net integrates evidential deep learning into the visual encoding stage through an evidence-driven soft-gating mechanism that quantifies epistemic uncertainty and suppresses unreliable visual noise. A triple-guided Mixture-of-Experts decoder further organizes clinical reasoning into semantic anchoring, visual evidencing, and spatial calibration. Spatial consistency alignment is imposed within a teacher-student co-training framework to promote stable anatomical attention patterns without pixel-level supervision. On a high-resolution otolaryngology endoscopy dataset, ACE-Net achieved a BLEU-4 score of 0.7511 and a ROUGE-L score of 0.8728, demonstrating strong text-generation performance and improved anatomical grounding under limited annotation. These results suggest that effective anatomical localization can be induced through evidence-constrained global supervision rather than expensive pixel-level annotations, providing a data-efficient and reliable paradigm for medical image captioning.