Latest Papers on Radiology AI. Tags: Benchmark SOTA

MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Yinghao Zhu, Ziyi He, Haoran Hu, Xiaochen Zheng, Xichen Zhang, Zixiang Wang, Junyi Gao, Liantao Ma, Lequan Yu

•preprint•May 18 2025

The rapid advancement of Large Language Models (LLMs) has stimulated interest in multi-agent collaboration for addressing complex medical tasks. However, the practical advantages of multi-agent collaboration approaches remain insufficiently understood. Existing evaluations often lack generalizability, failing to cover diverse tasks reflective of real-world clinical practice, and frequently omit rigorous comparisons against both single-LLM-based and established conventional methods. To address this critical gap, we introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches. MedAgentBoard encompasses four diverse medical task categories: (1) medical (visual) question answering, (2) lay summary generation, (3) structured Electronic Health Record (EHR) predictive modeling, and (4) clinical workflow automation, across text, medical images, and structured EHR data. Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, such as enhancing task completeness in clinical workflow automation, it does not consistently outperform advanced single LLMs (e.g., in textual medical QA) or, critically, specialized conventional methods that generally maintain better performance in tasks like medical VQA and EHR-based prediction. MedAgentBoard offers a vital resource and actionable insights, emphasizing the necessity of a task-specific, evidence-based approach to selecting and developing AI solutions in medicine. It underscores that the inherent complexity and overhead of multi-agent collaboration must be carefully weighed against tangible performance gains. All code, datasets, detailed prompts, and experimental results are open-sourced at https://medagentboard.netlify.app/.

Mixed Modality Classification Dataset Release In Silico Academic Lab Benchmark SOTA Open Dataset Open Code

Harnessing Artificial Intelligence for Accurate Diagnosis and Radiomics Analysis of Combined Pulmonary Fibrosis and Emphysema: Insights from a Multicenter Cohort Study

Zhang, S., Wang, H., Tang, H., Li, X., Wu, N.-W., Lang, Q., Li, B., Zhu, H., Chen, X., Chen, K., Xie, B., Zhou, A., Mo, C.

•preprint•May 18 2025

Combined Pulmonary Fibrosis and Emphysema (CPFE), formally recognized as a distinct pulmonary syndrome in 2022, is characterized by unique clinical features and pathogenesis that may lead to respiratory failure and death. However, the diagnosis of CPFE presents significant challenges that hinder effective treatment. Here, we assembled three-dimensional (3D) reconstruction data of the chest High-Resolution Computed Tomography (HRCT) of patients from multiple hospitals across different provinces in China, including Xiangya Hospital, West China Hospital, and Fujian Provincial Hospital. Using this dataset, we developed CPFENet, a deep learning-based diagnostic model for CPFE. It accurately differentiates CPFE from COPD, with performance comparable to that of professional radiologists. Additionally, we developed a CPFE score based on radiomic analysis of 3D CT images to quantify disease characteristics. Notably, female patients demonstrated significantly higher CPFE scores than males, suggesting potential sex-specific differences in CPFE. Overall, our study establishes the first diagnostic framework for CPFE, providing a diagnostic model and clinical indicators that enable accurate classification and characterization of the syndrome.

CT Classification Chest Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Mutual Evidential Deep Learning for Medical Image Segmentation

Yuanpeng He, Yali Bi, Lijian Li, Chi-Man Pun, Wenpin Jiao, Zhi Jin

•preprint•May 18 2025

Existing semi-supervised medical segmentation co-learning frameworks have realized that model performance can be diminished by the biases in model recognition caused by low-quality pseudo-labels. Due to the averaging nature of their pseudo-label integration strategy, they fail to explore the reliability of pseudo-labels from different sources. In this paper, we propose a mutual evidential deep learning (MEDL) framework that offers a potentially viable solution for pseudo-label generation in semi-supervised learning from two perspectives. First, we introduce networks with different architectures to generate complementary evidence for unlabeled samples and adopt an improved class-aware evidential fusion to guide the confident synthesis of evidential predictions sourced from diverse architectural networks. Second, utilizing the uncertainty in the fused evidence, we design an asymptotic Fisher information-based evidential learning strategy. This strategy enables the model to initially focus on unlabeled samples with more reliable pseudo-labels, gradually shifting attention to samples with lower-quality pseudo-labels while avoiding over-penalization of mislabeled classes in high data uncertainty samples. Additionally, for labeled data, we continue to adopt an uncertainty-driven asymptotic learning strategy, gradually guiding the model to focus on challenging voxels. Extensive experiments on five mainstream datasets have demonstrated that MEDL achieves state-of-the-art performance.

Mixed Modality Segmentation Methodology In Silico Benchmark SOTA

A Robust Automated Segmentation Method for White Matter Hyperintensity of Vascular-origin.

He H, Jiang J, Peng S, He C, Sun T, Fan F, Song H, Sun D, Xu Z, Wu S, Lu D, Zhang J

•papers•May 17 2025

White matter hyperintensity (WMH) is a primary manifestation of small vessel disease (SVD), leading to vascular cognitive impairment and other disorders. Accurate WMH quantification is vital for diagnosis and prognosis, but current automatic segmentation methods often fall short, especially across different datasets. The aims of this study are to develop and validate a robust deep learning segmentation method for WMH of vascular-origin. In this study, we developed a transformer-based method for the automatic segmentation of vascular-origin WMH using both 3D T1 and 3D T2-FLAIR images. Our initial dataset comprised 126 participants with varying WMH burdens due to SVD, each with manually segmented WMH masks used for training and testing. External validation was performed on two independent datasets: the WMH Segmentation Challenge 2017 dataset (170 subjects) and an in-house vascular risk factor dataset (70 subjects), which included scans acquired on eight different MRI systems at field strengths of 1.5T, 3T, and 5T. This approach enabled a comprehensive assessment of the method's generalizability across diverse imaging conditions. We further compared our method against LGA, LPA, BIANCA, UBO-detector and TrUE-Net in optimized settings. Our method consistently outperformed others, achieving a median Dice coefficient of 0.78±0.09 in our primary dataset, 0.72±0.15 in the external dataset 1, and 0.72±0.14 in the external dataset 2. The relative volume errors were 0.15±0.14, 0.50±0.86, and 0.47±1.02, respectively. The true positive rates were 0.81±0.13, 0.92±0.09, and 0.92±0.12, while the false positive rates were 0.20±0.09, 0.40±0.18, and 0.40±0.19. None of the external validation datasets were used for model training; instead, they comprise previously unseen MRI scans acquired from different scanners and protocols. This setup closely reflects real-world clinical scenarios and further demonstrates the robustness and generalizability of our model across diverse MRI systems and acquisition settings. As such, the proposed method provides a reliable solution for WMH segmentation in large-scale cohort studies.

MRI Segmentation Neurological Retrospective Clinical In Silico Academic Lab Benchmark SOTA

CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction

Jing Zou, Qingqiu Li, Chenyu Lian, Lihao Liu, Xiaohan Yan, Shujun Wang, Jing Qin

•preprint•May 17 2025

AI-driven models have shown great promise in detecting errors in radiology reports, yet the field lacks a unified benchmark for rigorous evaluation of error detection and further correction. To address this gap, we introduce CorBenchX, a comprehensive suite for automated error detection and correction in chest X-ray reports, designed to advance AI-assisted quality control in clinical practice. We first synthesize a large-scale dataset of 26,326 chest X-ray error reports by injecting clinically common errors via prompting DeepSeek-R1, with each corrupted report paired with its original text, error type, and human-readable description. Leveraging this dataset, we benchmark both open- and closed-source vision-language models,(e.g., InternVL, Qwen-VL, GPT-4o, o4-mini, and Claude-3.7) for error detection and correction under zero-shot prompting. Among these models, o4-mini achieves the best performance, with 50.6 % detection accuracy and correction scores of BLEU 0.853, ROUGE 0.924, BERTScore 0.981, SembScore 0.865, and CheXbertF1 0.954, remaining below clinical-level accuracy, highlighting the challenge of precise report correction. To advance the state of the art, we propose a multi-step reinforcement learning (MSRL) framework that optimizes a multi-objective reward combining format compliance, error-type accuracy, and BLEU similarity. We apply MSRL to QwenVL2.5-7B, the top open-source model in our benchmark, achieving an improvement of 38.3% in single-error detection precision and 5.2% in single-error correction over the zero-shot baseline.

X-Ray LLM Radiology Report Chest Dataset Release In Silico Open Dataset Benchmark SOTA

Evaluating the Performance of Reasoning Large Language Models on Japanese Radiology Board Examination Questions.

Nakaura T, Takamure H, Kobayashi N, Shiraishi K, Yoshida N, Nagayama Y, Uetani H, Kidoh M, Funama Y, Hirai T

•papers•May 17 2025

This study evaluates the performance, cost, and processing time of OpenAI's reasoning large language models (LLMs) (o1-preview, o1-mini) and their base models (GPT-4o, GPT-4o-mini) on Japanese radiology board examination questions. A total of 210 questions from the 2022-2023 official board examinations of the Japan Radiological Society were presented to each of the four LLMs. Performance was evaluated by calculating the percentage of correctly answered questions within six predefined radiology subspecialties. The total cost and processing time for each model were also recorded. The McNemar test was used to assess the statistical significance of differences in accuracy between paired model responses. The o1-preview achieved the highest accuracy (85.7%), significantly outperforming GPT-4o (73.3%, P<.001). Similarly, o1-mini (69.5%) performed significantly better than GPT-4o-mini (46.7%, P<.001). Across all radiology subspecialties, o1-preview consistently ranked highest. However, reasoning models incurred substantially higher costs (o1-preview: $17.10, o1-mini: $2.58) compared to their base counterparts (GPT-4o: $0.496, GPT-4o-mini: $0.04), and their processing times were approximately 3.7 and 1.2 times longer, respectively. Reasoning LLMs demonstrated markedly superior performance in answering radiology board exam questions compared to their base models, albeit at a substantially higher cost and increased processing time.

LLM Radiology Report Retrospective Clinical In Silico Academic Lab GenAI Benchmark SOTA

MedVKAN: Efficient Feature Extraction with Mamba and KAN for Medical Image Segmentation

Hancan Zhu, Jinhao Chen, Guanghua He

•preprint•May 17 2025

Medical image segmentation relies heavily on convolutional neural networks (CNNs) and Transformer-based models. However, CNNs are constrained by limited receptive fields, while Transformers suffer from scalability challenges due to their quadratic computational complexity. To address these limitations, recent advances have explored alternative architectures. The state-space model Mamba offers near-linear complexity while capturing long-range dependencies, and the Kolmogorov-Arnold Network (KAN) enhances nonlinear expressiveness by replacing fixed activation functions with learnable ones. Building on these strengths, we propose MedVKAN, an efficient feature extraction model integrating Mamba and KAN. Specifically, we introduce the EFC-KAN module, which enhances KAN with convolutional operations to improve local pixel interaction. We further design the VKAN module, integrating Mamba with EFC-KAN as a replacement for Transformer modules, significantly improving feature extraction. Extensive experiments on five public medical image segmentation datasets show that MedVKAN achieves state-of-the-art performance on four datasets and ranks second on the remaining one. These results validate the potential of Mamba and KAN for medical image segmentation while introducing an innovative and computationally efficient feature extraction framework. The code is available at: https://github.com/beginner-cjh/MedVKAN.

Mixed Modality Segmentation Methodology In Silico Benchmark SOTA Open Code

Patient-Specific Autoregressive Models for Organ Motion Prediction in Radiotherapy

Yuxiang Lai, Jike Zhong, Vanessa Su, Xiaofeng Yang

•preprint•May 17 2025

Radiotherapy often involves a prolonged treatment period. During this time, patients may experience organ motion due to breathing and other physiological factors. Predicting and modeling this motion before treatment is crucial for ensuring precise radiation delivery. However, existing pre-treatment organ motion prediction methods primarily rely on deformation analysis using principal component analysis (PCA), which is highly dependent on registration quality and struggles to capture periodic temporal dynamics for motion modeling.In this paper, we observe that organ motion prediction closely resembles an autoregressive process, a technique widely used in natural language processing (NLP). Autoregressive models predict the next token based on previous inputs, naturally aligning with our objective of predicting future organ motion phases. Building on this insight, we reformulate organ motion prediction as an autoregressive process to better capture patient-specific motion patterns. Specifically, we acquire 4D CT scans for each patient before treatment, with each sequence comprising multiple 3D CT phases. These phases are fed into the autoregressive model to predict future phases based on prior phase motion patterns. We evaluate our method on a real-world test set of 4D CT scans from 50 patients who underwent radiotherapy at our institution and a public dataset containing 4D CT scans from 20 patients (some with multiple scans), totaling over 1,300 3D CT phases. The performance in predicting the motion of the lung and heart surpasses existing benchmarks, demonstrating its effectiveness in capturing motion dynamics from CT images. These results highlight the potential of our method to improve pre-treatment planning in radiotherapy, enabling more precise and adaptive radiation delivery.

CT Detection Chest Retrospective Clinical In Silico Academic Lab Benchmark SOTA

From Embeddings to Accuracy: Comparing Foundation Models for Radiographic Classification

Xue Li, Jameson Merkow, Noel C. F. Codella, Alberto Santamaria-Pang, Naiteek Sangani, Alexander Ersoy, Christopher Burt, John W. Garrett, Richard J. Bruce, Joshua D. Warner, Tyler Bradshaw, Ivan Tarapov, Matthew P. Lungren, Alan B. McMillan

•preprint•May 16 2025

Foundation models, pretrained on extensive datasets, have significantly advanced machine learning by providing robust and transferable embeddings applicable to various domains, including medical imaging diagnostics. This study evaluates the utility of embeddings derived from both general-purpose and medical domain-specific foundation models for training lightweight adapter models in multi-class radiography classification, focusing specifically on tube placement assessment. A dataset comprising 8842 radiographs classified into seven distinct categories was employed to extract embeddings using six foundation models: DenseNet121, BiomedCLIP, Med-Flamingo, MedImageInsight, Rad-DINO, and CXR-Foundation. Adapter models were subsequently trained using classical machine learning algorithms. Among these combinations, MedImageInsight embeddings paired with an support vector machine adapter yielded the highest mean area under the curve (mAUC) at 93.8%, followed closely by Rad-DINO (91.1%) and CXR-Foundation (89.0%). In comparison, BiomedCLIP and DenseNet121 exhibited moderate performance with mAUC scores of 83.0% and 81.8%, respectively, whereas Med-Flamingo delivered the lowest performance at 75.1%. Notably, most adapter models demonstrated computational efficiency, achieving training within one minute and inference within seconds on CPU, underscoring their practicality for clinical applications. Furthermore, fairness analyses on adapters trained on MedImageInsight-derived embeddings indicated minimal disparities, with gender differences in performance within 2% and standard deviations across age groups not exceeding 3%. These findings confirm that foundation model embeddings-especially those from MedImageInsight-facilitate accurate, computationally efficient, and equitable diagnostic classification using lightweight adapters for radiographic image analysis.

X-Ray Classification Chest Retrospective Clinical In Silico Benchmark SOTA

Pancreas segmentation using AI developed on the largest CT dataset with multi-institutional validation and implications for early cancer detection.

Mukherjee S, Antony A, Patnam NG, Trivedi KH, Karbhari A, Nagaraj M, Murlidhar M, Goenka AH

•papers•May 16 2025

Accurate and fully automated pancreas segmentation is critical for advancing imaging biomarkers in early pancreatic cancer detection and for biomarker discovery in endocrine and exocrine pancreatic diseases. We developed and evaluated a deep learning (DL)-based convolutional neural network (CNN) for automated pancreas segmentation using the largest single-institution dataset to date (n = 3031 CTs). Ground truth segmentations were performed by radiologists, which were used to train a 3D nnU-Net model through five-fold cross-validation, generating an ensemble of top-performing models. To assess generalizability, the model was externally validated on the multi-institutional AbdomenCT-1K dataset (n = 585), for which volumetric segmentations were newly generated by expert radiologists and will be made publicly available. In the test subset (n = 452), the CNN achieved a mean Dice Similarity Coefficient (DSC) of 0.94 (SD 0.05), demonstrating high spatial concordance with radiologist-annotated volumes (Concordance Correlation Coefficient [CCC]: 0.95). On the AbdomenCT-1K dataset, the model achieved a DSC of 0.96 (SD 0.04) and a CCC of 0.98, confirming its robustness across diverse imaging conditions. The proposed DL model establishes new performance benchmarks for fully automated pancreas segmentation, offering a scalable and generalizable solution for large-scale imaging biomarker research and clinical translation.

CT Segmentation Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA Open Dataset

Filter Papers

Tags

MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks

Harnessing Artificial Intelligence for Accurate Diagnosis and Radiomics Analysis of Combined Pulmonary Fibrosis and Emphysema: Insights from a Multicenter Cohort Study

Mutual Evidential Deep Learning for Medical Image Segmentation

A Robust Automated Segmentation Method for White Matter Hyperintensity of Vascular-origin.

CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction

Evaluating the Performance of Reasoning Large Language Models on Japanese Radiology Board Examination Questions.

MedVKAN: Efficient Feature Extraction with Mamba and KAN for Medical Image Segmentation

Patient-Specific Autoregressive Models for Organ Motion Prediction in Radiotherapy

From Embeddings to Accuracy: Comparing Foundation Models for Radiographic Classification

Pancreas segmentation using AI developed on the largest CT dataset with multi-institutional validation and implications for early cancer detection.

Ready to Sharpen Your Edge?