Back to all papers

MedFusionT5: Cross-Modal Attention Boosts Semantic Quality and Reduces Hallucinations in Dental AI.

March 1, 2026pubmed logopapers

Authors

Abdaoui H,Barbaria S,Dergaa I,Ceylan Hİ,Bragazzi NL,de Giorgio A,Salah RB,Rahmouni HB

Affiliations (6)

  • Laboratory of Biophysics and Medical Technologies, Higher Institute of Medical Technologies of Tunis (ISTMT), University of Tunis El Manar, Tunis, Tunisia.
  • High Institute of Sport and Physical Education of Ksar Said, University of Manouba, Manouba, Tunisia.
  • Physical Education of Sports Teaching Department, Faculty of Sports Sciences, Atatürk University, Erzurum, Türkiye. Electronic address: [email protected].
  • Department of Mathematics and Statistics, Laboratory for Industrial and Applied Mathematics (LIAM), York University, Toronto, Ontario, Canada; Department of Clinical Pharmacy, Saarland University, Saarbrücken, Germany. Electronic address: [email protected].
  • Artificial Engineering, Naples, Italy.
  • Laboratory of Biophysics and Medical Technologies, Higher Institute of Medical Technologies of Tunis (ISTMT), University of Tunis El Manar, Tunis, Tunisia; The Computer Science Research Centre, the University of the West of England, Bristol, UK.

Abstract

Automated dental report generation faces significant challenges in multimodal fusion, often resulting in suboptimal semantic quality and risks of hallucination, where AI generates clinically unsupported content. Current approaches that rely on simple feature concatenation or bidirectional attention mechanisms fail to effectively capture visual-textual relationships in medical imaging. This study aims to develop MedFusionT5, a unidirectional cross-modal alignment framework that (1) achieves superior clinical report quality through focused attention between visual patches and clinical text representations, and (2) ensures exceptional factual consistency by minimising hallucination rates. We implemented a novel architecture that integrates vision transformer (ViT) for patch-based visual feature extraction with Bio_ClinicalBERT for clinical text encoding. The core innovation is a unidirectional multihead attention alignment module that selectively maps textual embeddings to relevant visual patches before multimodal fusion. A T5-base decoder then generates diagnostic reports from the aligned representations. We evaluated performance on 700 dental panoramic radiographs using comprehensive metrics, including BLEU, ROUGE, CIDEr, clinical precision/recall, and specialised hallucination analysis, comparing against both concatenation and coattention baselines. MedFusionT5 demonstrated superior performance across all evaluated metrics. Compared to the coattention baseline, CIDEr increased by 122% (5.65 vs 2.54) and by 320% over simple concatenation. BLEU-4 reached 0.865, outperforming both baselines, while maintaining the lowest hallucination rate at 2.42% (39% reduction vs coattention, 46% vs concatenation). The model achieved an optimal balance between precision (0.982) and recall (0.923), with 90% of reports exhibiting near-zero hallucination. Notably, MedFusionT5 showed consistent quality independent of report length (r = -0.022), unlike coattention's length-dependent performance (r = +0.795). MedFusionT5 establishes a new state-of-the-art in automated dental report generation, demonstrating that unidirectional cross-modal alignment achieves superior semantic quality and clinical precision while minimising hallucinations. This work identifies unidirectional attention as the optimal alignment strategy for medical AI, providing a foundation for trustworthy clinical deployment where both accuracy and reliability are paramount.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.