Advancing radiology foundation models with reasoning through step-by-step verification from daily reports.
Authors
Affiliations (8)
Affiliations (8)
- Shanghai Jiao Tong University, Shanghai, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, China.
- Shanghai Jiao Tong University, Shanghai, China. [email protected].
- Shanghai Artificial Intelligence Laboratory, Shanghai, China. [email protected].
- Shanghai Jiao Tong University, Shanghai, China. [email protected].
- Shanghai Artificial Intelligence Laboratory, Shanghai, China. [email protected].
- Shanghai Jiao Tong University, Shanghai, China. [email protected].
- Shanghai Artificial Intelligence Laboratory, Shanghai, China. [email protected].
Abstract
Recent advances in reasoning-enhanced large language models (LLMs) and multimodal LLMs (MLLMs) have significantly improved performance in complex tasks, yet medical AI models often overlook the structured reasoning processes inherent in clinical practice. In this work, we propose a two-stage pipeline to train ChestX-Reasoner, a radiology diagnosis MLLM with process supervision mined directly from clinical reports, reflecting the step-by-step reasoning followed by radiologists. To facilitate and evaluate reasoning capabilities, we introduce RadRBench-CXR, a comprehensive benchmark featuring 59K visual question answering samples with 301K clinically validated reasoning steps, and RadRScore, a metric evaluating reasoning factuality, completeness, and effectiveness. Here we show that, ChestX-Reasoner achieves significant improvements of 16% and 8.5% in reasoning ability compared to leading medical and general-purpose models, respectively. Furthermore, diagnostic accuracy improves by 3.3% to 24% against baselines. These results demonstrate that incorporating explicit reasoning steps improves diagnostic outcomes, and integrating process supervision enhances the reliability and transparency of medical diagnosis.