Back to all papers

Semi-supervised medical image captioning via anatomical collaborative evidence network.

May 26, 2026pubmed logopapers

Authors

Zhou S,Liu Q,Cai L,Lu K,Qiao L,Xu N,Wu Y,Xu Y,Li J

Affiliations (2)

  • College of Information Engineering, Sichuan Agricultural University, Ya'an, China.
  • Department of Otolaryngology, Ya'an People's Hospital, Ya'an, China.

Abstract

Medical image captioning bridges visual perception and clinical language, but its development is limited by the high cost of detailed anatomical annotation and by the risk of hallucinations or overconfidence in ambiguous endoscopic images. We propose ACE-Net, an Anatomy Collaborative Evidence Network for semi-supervised medical image captioning. ACE-Net integrates evidential deep learning into the visual encoding stage through an evidence-driven soft-gating mechanism that quantifies epistemic uncertainty and suppresses unreliable visual noise. A triple-guided Mixture-of-Experts decoder further organizes clinical reasoning into semantic anchoring, visual evidencing, and spatial calibration. Spatial consistency alignment is imposed within a teacher-student co-training framework to promote stable anatomical attention patterns without pixel-level supervision. On a high-resolution otolaryngology endoscopy dataset, ACE-Net achieved a BLEU-4 score of 0.7511 and a ROUGE-L score of 0.8728, demonstrating strong text-generation performance and improved anatomical grounding under limited annotation. These results suggest that effective anatomical localization can be induced through evidence-constrained global supervision rather than expensive pixel-level annotations, providing a data-efficient and reliable paradigm for medical image captioning.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.