Structured Report Generation for Breast Cancer Imaging Based on Large Language Modeling: A Comparative Analysis of GPT-4 and DeepSeek.

Authors

Chen K,Hou X,Li X,Xu W,Yi H

Affiliations (4)

  • Department of Nuclear Medicine, Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, Zhejiang 310022, China (K.C., H.Y.).
  • Department of Medical Imaging, Tianjin Children's Hospital (Tianjin University Children's Hospital), Tianjin, China (X.H.).
  • Department of Molecular Imaging and Nuclear Medicine, Tianjin Medical University Cancer Institute and Hospital, National Clinical Research Center for Cancer, Tianjin, China (X.L., W.X.).
  • Department of Nuclear Medicine, Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, Zhejiang 310022, China (K.C., H.Y.). Electronic address: [email protected].

Abstract

The purpose of this study is to compare the performance of GPT-4 and DeepSeek large language models in generating structured breast cancer multimodality imaging integrated reports from free-text radiology reports including mammography, ultrasound, MRI, and PET/CT. A retrospective analysis was conducted on 1358 free-text reports from 501 breast cancer patients across two institutions. The study design involved synthesizing multimodal imaging data into structured reports with three components: primary lesion characteristics, metastatic lesions, and TNM staging. Input prompts were standardized for both models, with GPT-4 using predesigned instructions and DeepSeek requiring manual input. Reports were evaluated based on physician satisfaction using a Likert scale, descriptive accuracy including lesion localization, size, SUV, and metastasis assessment, and TNM staging correctness according to NCCN guidelines. Statistical analysis included McNemar tests for binary outcomes and correlation analysis for multiclass comparisons with a significance threshold of P < .05. Physician satisfaction scores showed strong correlation between models with r-values of 0.665 and 0.558 and P-values below .001. Both models demonstrated high accuracy in data extraction and integration. The mean accuracy for primary lesion features was 91.7% for GPT-4% and 92.1% for DeepSeek, while feature synthesis accuracy was 93.4% for GPT4 and 93.9% for DeepSeek. Metastatic lesion identification showed comparable overall accuracy at 93.5% for GPT4 and 94.4% for DeepSeek. GPT-4 performed better in pleural lesion detection with 94.9% accuracy compared to 79.5% for DeepSeek, whereas DeepSeek achieved higher accuracy in mesenteric metastasis identification at 87.5% vs 43.8% for GPT4. TNM staging accuracy exceeded 92% for T-stage and 94% for M-stage, with N-stage accuracy improving beyond 90% when supplemented with physical exam data. Both GPT-4 and DeepSeek effectively generate structured breast cancer imaging reports with high accuracy in data mining, integration, and TNM staging. Integrating these models into clinical practice is expected to enhance report standardization and physician productivity.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.