Automated Esophageal Cancer Staging From Free-Text Radiology Reports: Large Language Model Evaluation Study.
Authors
Affiliations (4)
Affiliations (4)
- Information Center, Shanghai Chest Hospital, Shanghai Jiao Tong University School of Medicine, No 241, West Huaihai Road, Shanghai, 200030, China, 86 22200000.
- Department of Healthcare, INF Technology, Shanghai, China.
- School of Life Sciences, Shanghai University, Shanghai, China.
- Artificial Intelligence Innovation and Incubation Institute of Fudan University, Fudan University, Shanghai, China.
Abstract
Accurate staging of esophageal cancer is crucial for determining prognosis and guiding treatment strategies, but manual interpretation of radiology reports by clinicians is prone to variability and limited accuracy, resulting in reduced staging accuracy. Recent advances in large language models (LLMs) have shown promise in medical applications, but their utility in esophageal cancer staging remains underexplored. This study aims to compare the performance of 3 locally deployed LLMs (INF-72B, Qwen2.5-72B, and LLaMA3.1-70B) and clinicians in preoperative esophageal cancer staging using free-text radiology reports. This retrospective study included 200 patients from Shanghai Chest Hospital who underwent esophageal cancer surgery from May to December 2024. The dataset consisted of 1134 Chinese free-text radiology reports. The reference standard was derived from postoperative pathological staging. A total of 3 LLMs determined tumor classification (T1-T4), node classification (N0-N3), and overall staging (I-IV) using 3 prompting strategies (zero-shot, chain-of-thought, and a proposed interpretable reasoning [IR] method). The McNemar test and Pearson chi-square test were used for comparisons. INF-72B+IR achieved a superior overall staging accuracy of 61.5% and an F1-score of 0.60, substantially higher than the clinicians' accuracy of 39.5% and F1-score of 0.39 (all P<.001). Qwen2.5-72B+IR also demonstrated an advantage, achieving an overall staging accuracy of 46% and an F1-score of 0.51, which was better than the clinicians' performance (P<.001). LLaMA3.1-70B showed no statistically significant difference in overall staging performance compared to clinicians (all P>0.5). This study demonstrates that LLMs, particularly when guided by the proposed IR strategy, can accurately and reliably perform esophageal cancer staging from free-text radiology reports. This approach not only provides high-performance predictions but also offers a transparent and verifiable reasoning process, highlighting its potential as a valuable decision-support tool to augment human expertise in complex clinical diagnostic tasks.