Automated generation of structured breast ultrasound reports using BreastViT and ChatGPT.
Authors
Affiliations (3)
Affiliations (3)
- Department of Ultrasound, Peking University Third Hospital, Beijing, 100191, China.
- Department of Radiology, Seoul National University Hospital, Seoul, Korea.
- Department of Ultrasound, Peking University Third Hospital, Beijing, 100191, China. [email protected].
Abstract
Breast cancer is the most common malignancy in women. Ultrasound plays a critical role in dense breasts, and BI-RADS provides a standardized framework for lesion assessment. However, conventional reports may suffer from variability. Deep learning and large language models (LLMs) show promise in automated report generation. We propose a workflow integrating deep learning with GPT-4o for structured breast ultrasound reports. We retrospectively collected 2,243 ultrasound images from 362 patients (BI-RADS 4B, 4C, 5; 2019-2024). The proposed BreastViT model, a VisionEncoderDecoderModel (pretrained: nlpconnect/vit-gpt2-image-captioning), was compared against three baseline architectures: CNN-Transformer (R2Gen), CNN-Attention-LSTM, and CNN-RNN. Generated texts were refined by GPT-4o for language optimization and terminology standardization. An external validation set (49 cases, Oct-Dec 2024) compared three outputs: GPT-4o alone, BreastViT outputs, and BreastViT + GPT-4o. Internally, BreastViT achieved a best BLEU of 0.9187 and loss of 0.1277. GPT-4o refinement markedly improved fluency and structure. In external validation, GPT-4o alone produced natural language but occasional image inconsistencies; BreastViT outputs captured key findings but lacked structure; the combined approach yielded the best accuracy, completeness, and terminological consistency. In blinded radiologist evaluation, the BreastViT + GPT-4o reports were rated highest for structural integrity and terminology standardization. In the external validation, a blinded evaluation was conducted by three senior radiologists. The intraclass correlation coefficient (ICC) demonstrated excellent inter-rater reliability (ICC = 0.8808; 95% CI: 0.86-0.90). Results indicated that the combined BreastViT + GPT-4o model achieved the highest scores across all categories, including Clinical Accuracy, Information Completeness, Structural Integrity, and Terminology Standardization (7.31 ± 0.94, 7.61 ± 0.66, 8.03 ± 0.93, 8.02 ± 0.98, respectively). These scores were significantly superior to those of the standalone models (all P < 0.05). The proposed BreastViT + GPT-4o workflow automatically generates clinically compliant structured breast ultrasound reports, which enhances readability and standardization. This approach can improve report consistency and efficiency, offering a promising pathway for clinical integration.