Medical image local augmentation via text- and mask-guided diffusion model.
Authors
Affiliations (2)
Affiliations (2)
- School of Artificial Intelligence and Big Data, Hefei University, Hefei, Anhui, China.
- Department of Radiation Oncology, the First Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China.
Abstract
Medical images serve as the core basis for precision diagnosis and treatment, yet their scarcity severely hampers the advancement of intelligent medical image analysis. Data augmentation for medical images represents a key pathway to overcoming this data bottleneck. However, existing methods primarily focus on global image transformations and exhibiting limited control over local regional details. In order to enhance image diversity, this paper proposes a text- and mask-guided local augmentation method for medical images. Aiming at the problem of insufficient diversity of medical synthesized images, this paper designs a text- and mask-guided local augmentation method for medical images (MILA-TMGDiff). This method first employs a pre-trained MedSAM model to segment target regions within input medical images, yielding precise masks. Subsequently, text prompts with semantic relevance and task-specificity are designed for different types of medical imaging data. Finally, the mask and text prompts are jointly input as local guidance conditions into a diffusion generative model. By applying controlled perturbations to the local noise distribution, fine-grained generation control over specific anatomical regions is achieved, ultimately producing synthetic medical images of high quality in both visual realism and diversity. The method in this paper has been tested on x-ray, MRI, and CT images for local augmentation experiments, and the quantitative analysis results show that the local structural similarity of the images generated by this paper in the Mask region exhibits a significant change: a reduction of 97.9%, 103.3%, and 42.2% on chest x-ray, pelvic CT, and brain CT data, respectively. This phenomenon confirms that the local feature enhancement mechanism proposed in this paper can effectively modulate the distribution of structural features in the Mask region while maintaining the global texture consistency. This provides a new technical pathway for controlled data augmentation in medical imaging, helping to advance the development of intelligent medical image analysis and laying the foundation for future research on fine-grained medical image generation.