Latest Papers on Radiology AI. Tags: Benchmark SOTA

Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets

Qinmei Xu, Yiheng Li, Xianghao Zhan, Ahmet Gorkem Er, Brittany Dashevsky, Chuanjun Xu, Mohammed Alawad, Mengya Yang, Liu Ya, Changsheng Zhou, Xiao Li, Haruka Itakura, Olivier Gevaert

•preprint•May 21 2025

Foundation models leveraging vision-language pretraining have shown promise in chest X-ray (CXR) interpretation, yet their real-world performance across diverse populations and diagnostic tasks remains insufficiently evaluated. This study benchmarks the diagnostic performance and generalizability of foundation models versus traditional convolutional neural networks (CNNs) on multinational CXR datasets. We evaluated eight CXR diagnostic models - five vision-language foundation models and three CNN-based architectures - across 37 standardized classification tasks using six public datasets from the USA, Spain, India, and Vietnam, and three private datasets from hospitals in China. Performance was assessed using AUROC, AUPRC, and other metrics across both shared and dataset-specific tasks. Foundation models outperformed CNNs in both accuracy and task coverage. MAVL, a model incorporating knowledge-enhanced prompts and structured supervision, achieved the highest performance on public (mean AUROC: 0.82; AUPRC: 0.32) and private (mean AUROC: 0.95; AUPRC: 0.89) datasets, ranking first in 14 of 37 public and 3 of 4 private tasks. All models showed reduced performance on pediatric cases, with average AUROC dropping from 0.88 +/- 0.18 in adults to 0.57 +/- 0.29 in children (p = 0.0202). These findings highlight the value of structured supervision and prompt design in radiologic AI and suggest future directions including geographic expansion and ensemble modeling for clinical deployment. Code for all evaluated models is available at https://drive.google.com/drive/folders/1B99yMQm7bB4h1sVMIBja0RfUu8gLktCE

X-Ray Classification Chest Retrospective Clinical In Silico Academic Lab Benchmark SOTA Open Code

Predictive machine learning and multimodal data to develop highly sensitive, composite biomarkers of disease progression in Friedreich ataxia.

Saha S, Corben LA, Selvadurai LP, Harding IH, Georgiou-Karistianis N

•papers•May 21 2025

Friedreich ataxia (FRDA) is a rare, inherited progressive movement disorder for which there is currently no cure. The field urgently requires more sensitive, objective, and clinically relevant biomarkers to enhance the evaluation of treatment efficacy in clinical trials and to speed up the process of drug discovery. This study pioneers the development of clinically relevant, multidomain, fully objective composite biomarkers of disease severity and progression, using multimodal neuroimaging and background data (i.e., demographic, disease history, genetics). Data from 31 individuals with FRDA and 31 controls from a longitudinal multimodal natural history study IMAGE-FRDA, were included. Using an elasticnet predictive machine learning (ML) regression model, we derived a weighted combination of background, structural MRI, diffusion MRI, and quantitative susceptibility imaging (QSM) measures that predicted Friedreich ataxia rating scale (FARS) with high accuracy (R2 = 0.79, root mean square error (RMSE) = 13.19). This composite also exhibited strong sensitivity to disease progression over two years (Cohen's d = 1.12), outperforming the sensitivity of the FARS score alone (d = 0.88). The approach was validated using the Scale for the assessment and rating of ataxia (SARA), demonstrating the potential and robustness of ML-derived composites to surpass individual biomarkers and act as complementary or surrogate markers of disease severity and progression. However, further validation, refinement, and the integration of additional data modalities will open up new opportunities for translating these biomarkers into clinical practice and clinical trials for FRDA, as well as other rare neurodegenerative diseases.

MRI Classification Neurological Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding

Ta Duc Huy, Duy Anh Huynh, Yutong Xie, Yuankai Qi, Qi Chen, Phi Le Nguyen, Sen Kim Tran, Son Lam Phung, Anton van den Hengel, Zhibin Liao, Minh-Son To, Johan W. Verjans, Vu Minh Hieu Phan

•preprint•May 21 2025

Visual grounding (VG) is the capability to identify the specific regions in an image associated with a particular text description. In medical imaging, VG enhances interpretability by highlighting relevant pathological features corresponding to textual descriptions, improving model transparency and trustworthiness for wider adoption of deep learning models in clinical practice. Current models struggle to associate textual descriptions with disease regions due to inefficient attention mechanisms and a lack of fine-grained token representations. In this paper, we empirically demonstrate two key observations. First, current VLMs assign high norms to background tokens, diverting the model's attention from regions of disease. Second, the global tokens used for cross-modal learning are not representative of local disease tokens. This hampers identifying correlations between the text and disease tokens. To address this, we introduce simple, yet effective Disease-Aware Prompting (DAP) process, which uses the explainability map of a VLM to identify the appropriate image features. This simple strategy amplifies disease-relevant regions while suppressing background interference. Without any additional pixel-level annotations, DAP improves visual grounding accuracy by 20.74% compared to state-of-the-art methods across three major chest X-ray datasets.

X-Ray Segmentation Chest Methodology In Silico Benchmark SOTA

VET-DINO: Learning Anatomical Understanding Through Multi-View Distillation in Veterinary Imaging

Andre Dourson, Kylie Taylor, Xiaoli Qiao, Michael Fitzke

•preprint•May 21 2025

Self-supervised learning has emerged as a powerful paradigm for training deep neural networks, particularly in medical imaging where labeled data is scarce. While current approaches typically rely on synthetic augmentations of single images, we propose VET-DINO, a framework that leverages a unique characteristic of medical imaging: the availability of multiple standardized views from the same study. Using a series of clinical veterinary radiographs from the same patient study, we enable models to learn view-invariant anatomical structures and develop an implied 3D understanding from 2D projections. We demonstrate our approach on a dataset of 5 million veterinary radiographs from 668,000 canine studies. Through extensive experimentation, including view synthesis and downstream task performance, we show that learning from real multi-view pairs leads to superior anatomical understanding compared to purely synthetic augmentations. VET-DINO achieves state-of-the-art performance on various veterinary imaging tasks. Our work establishes a new paradigm for self-supervised learning in medical imaging that leverages domain-specific properties rather than merely adapting natural image techniques.

X-Ray Classification Whole Body Methodology In Silico Academic Lab Benchmark SOTA Open Dataset

Deep Learning with Domain Randomization in Image and Feature Spaces for Abdominal Multiorgan Segmentation on CT and MRI Scans.

Shi Y, Wang L, Qureshi TA, Deng Z, Xie Y, Li D

•papers•May 21 2025

"Just Accepted" papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. Purpose To develop a deep learning segmentation model that can segment abdominal organs on CT and MR images with high accuracy and generalization ability. Materials and Methods In this study, an extended nnU-Net model was trained for abdominal organ segmentation. A domain randomization method in both the image and feature space was developed to improve the generalization ability under cross-site and cross-modality settings on public prostate MRI and abdominal CT and MRI datasets. The prostate MRI dataset contains data from multiple health care institutions with domain shifts. The abdominal CT and MRI dataset is structured for cross-modality evaluation, training on one modality (eg, MRI) and testing on the other (eg, CT). This domain randomization method was then used to train a segmentation model with enhanced generalization ability on the abdominal multiorgan segmentation challenge (AMOS) dataset to improve abdominal CT and MR multiorgan segmentation, and the model was compared with two commonly used segmentation algorithms (TotalSegmentator and MRSegmentator). Model performance was evaluated using the Dice similarity coefficient (DSC). Results The proposed domain randomization method showed improved generalization ability on the cross-site and cross-modality datasets compared with the state-of-the-art methods. The segmentation model using this method outperformed two other publicly available segmentation models on data from unseen test domains (Average DSC: 0.88 versus 0.79; P < .001 and 0.88 versus 0.76; P < .001). Conclusion The combination of image and feature domain randomizations improved the accuracy and generalization ability of deep learning-based abdominal segmentation on CT and MR images. © RSNA, 2025.

Mixed Modality Segmentation Abdominal Methodology In Silico Academic Lab Benchmark SOTA

NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI

Cosmin I. Bercea, Jun Li, Philipp Raffler, Evamaria O. Riedel, Lena Schmitzer, Angela Kurz, Felix Bitzer, Paula Roßmüller, Julian Canisius, Mirjam L. Beyrle, Che Liu, Wenjia Bai, Bernhard Kainz, Julia A. Schnabel, Benedikt Wiestler

•preprint•May 20 2025

In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Out-of-distribution detection identifies whether an input stems from an unseen distribution, while open-world recognition flags such inputs to ensure the system remains robust as ever-emerging, previously $unknown$ categories appear and must be addressed without retraining. Foundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging. However, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use. We therefore present $NOVA$, a challenging, real-life $evaluation-only$ benchmark of $\sim$900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. Because NOVA is never used for training, it serves as an $extreme$ stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space. Baseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops across all tasks, establishing NOVA as a rigorous testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.

MRI Detection Neurological Dataset Release In Silico Academic Lab Open Dataset Benchmark SOTA

XDementNET: An Explainable Attention Based Deep Convolutional Network to Detect Alzheimer Progression from MRI data

Soyabul Islam Lincoln, Mirza Mohd Shahriar Maswood

•preprint•May 20 2025

A common neurodegenerative disease, Alzheimer's disease requires a precise diagnosis and efficient treatment, particularly in light of escalating healthcare expenses and the expanding use of artificial intelligence in medical diagnostics. Many recent studies shows that the combination of brain Magnetic Resonance Imaging (MRI) and deep neural networks have achieved promising results for diagnosing AD. Using deep convolutional neural networks, this paper introduces a novel deep learning architecture that incorporates multiresidual blocks, specialized spatial attention blocks, grouped query attention, and multi-head attention. The study assessed the model's performance on four publicly accessible datasets and concentrated on identifying binary and multiclass issues across various categories. This paper also takes into account of the explainability of AD's progression and compared with state-of-the-art methods namely Gradient Class Activation Mapping (GradCAM), Score-CAM, Faster Score-CAM, and XGRADCAM. Our methodology consistently outperforms current approaches, achieving 99.66\% accuracy in 4-class classification, 99.63\% in 3-class classification, and 100\% in binary classification using Kaggle datasets. For Open Access Series of Imaging Studies (OASIS) datasets the accuracies are 99.92\%, 99.90\%, and 99.95\% respectively. The Alzheimer's Disease Neuroimaging Initiative-1 (ADNI-1) dataset was used for experiments in three planes (axial, sagittal, and coronal) and a combination of all planes. The study achieved accuracies of 99.08\% for axis, 99.85\% for sagittal, 99.5\% for coronal, and 99.17\% for all axis, and 97.79\% and 8.60\% respectively for ADNI-2. The network's ability to retrieve important information from MRI images is demonstrated by its excellent accuracy in categorizing AD stages.

MRI Classification Neurological Methodology In Silico Benchmark SOTA Open Dataset

Pancreas segmentation in CT scans: A novel MOMUNet based workflow.

Juwita J, Hassan GM, Datta A

•papers•May 20 2025

Automatic pancreas segmentation in CT scans is crucial for various medical applications, including early diagnosis and computer-assisted surgery. However, existing segmentation methods remain suboptimal due to significant pancreas size variations across slices and severe class imbalance caused by the pancreas's small size and CT scanner movement during imaging. Traditional computer vision techniques struggle with these challenges, while deep learning-based approaches, despite their success in other domains, still face limitations in pancreas segmentation. To address these issues, we propose a novel, three-stage workflow that enhances segmentation accuracy and computational efficiency. First, we introduce External Contour Cropping (ECC), a background cleansing technique that mitigates class imbalance. Second, we propose a Size Ratio (SR) technique that restructures the training dataset based on the relative size of the target organ, improving the robustness of the model against anatomical variations. Third, we develop MOMUNet, an ultra-lightweight segmentation model with only 1.31 million parameters, designed for optimal performance on limited computational resources. Our proposed workflow achieves an improvement in Dice Score (DSC) of 2.56% over state-of-the-art (SOTA) models in the NIH-Pancreas dataset and 2.97% in the MSD-Pancreas dataset. Furthermore, applying the proposed model to another small organ, such as colon cancer segmentation in the MSD-Colon dataset, yielded a DSC of 68.4%, surpassing the SOTA models. These results demonstrate the effectiveness of our approach in significantly improving segmentation accuracy for small abdomen organs including pancreas and colon, making deep learning more accessible for low-resource medical facilities.

CT Segmentation Abdominal Methodology In Silico Academic Lab Benchmark SOTA

A multi-modal model integrating MRI habitat and clinicopathology to predict platinum sensitivity in patients with high-grade serous ovarian cancer: a diagnostic study.

Bi Q, Ai C, Meng Q, Wang Q, Li H, Zhou A, Shi W, Lei Y, Wu Y, Song Y, Xiao Z, Li H, Qiang J

•papers•May 20 2025

Platinum resistance of high-grade serous ovarian cancer (HGSOC) cannot currently be recognized by specific molecular biomarkers. We aimed to compare the predictive capacity of various models integrating MRI habitat, whole slide images (WSIs), and clinical parameters to predict platinum sensitivity in HGSOC patients. A retrospective study involving 998 eligible patients from four hospitals was conducted. MRI habitats were clustered using K-means algorithm on multi-parametric MRI. Following feature extraction and selection, a Habitat model was developed. Vision Transformer (ViT) and multi-instance learning were trained to derive the patch-level prediction and WSI-level prediction on hematoxylin and eosin (H&E)-stained WSIs, respectively, forming a Pathology model. Logistic regression (LR) was used to create a Clinic model. A multi-modal model integrating Clinic, Habitat, and Pathology (CHP) was constructed using Multi-Head Attention (MHA) and compared with the unimodal models and Ensemble multi-modal models. The area under the curve (AUC) and integrated discrimination improvement (IDI) value were used to assess model performance and gains. In the internal validation cohort and the external test cohort, the Habitat model showed the highest AUCs (0.722 and 0.685) compared to the Clinic model (0.683 and 0.681) and the Pathology model (0.533 and 0.565), respectively. The AUCs (0.789 and 0.807) of the multi-modal model interating CHP based on MHA were highest than those of any unimodal models and Ensemble multi-modal models, with positive IDI values. MRI-based habitat imaging showed potentials to predict platinum sensitivity in HGSOC patients. Multi-modal integration of CHP based on MHA was helpful to improve prediction performance.

MRI Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Semiautomated segmentation of breast tumor on automatic breast ultrasound image using a large-scale model with customized modules.

Zhou Y, Ye M, Ye H, Zeng S, Shu X, Pan Y, Wu A, Liu P, Zhang G, Cai S, Chen S

•papers•May 19 2025

To verify the capability of the Segment Anything Model for medical images in 3D (SAM-Med3D), tailored with low-rank adaptation (LoRA) strategies, in segmenting breast tumors in Automated Breast Ultrasound (ABUS) images. This retrospective study collected data from 329 patients diagnosed with breast cancer (average age 54 years). The dataset was randomly divided into training (n = 204), validation (n = 29), and test sets (n = 59). Two experienced radiologists manually annotated the regions of interest of each sample in the dataset, which served as ground truth for training and evaluating the SAM-Med3D model with additional customized modules. For semi-automatic tumor segmentation, points were randomly sampled within the lesion areas to simulate the radiologists' clicks in real-world scenarios. The segmentation performance was evaluated using the Dice coefficient. A total of 492 cases (200 from the "Tumor Detection, Segmentation, and Classification Challenge on Automated 3D Breast Ultrasound (TDSC-ABUS) 2023 challenge") were subjected to semi-automatic segmentation inference. The average Dice Similariy Coefficient (DSC) scores for the training, validation, and test sets of the Lishui dataset were 0.75, 0.78, and 0.75, respectively. The Breast Imaging Reporting and Data System (BI-RADS) categories of all samples range from BI-RADS 3 to 6, yielding an average DSC coefficient between 0.73 and 0.77. By categorizing the samples (lesion volumes ranging from 1.64 to 100.03 cm3) based on lesion size, the average DSC falls between 0.72 and 0.77.And the overall average DSC for the TDSC-ABUS 2023 challenge dataset was 0.79, with the test set achieving a sora-of-art scores of 0.79. The SAM-Med3D model with additional customized modules demonstrates good performance in semi-automatic 3D ABUS breast cancer tumor segmentation, indicating its feasibility for application in computer-aided diagnosis systems.

Ultrasound Segmentation Breast Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Filter Papers

Tags

Benchmarking Chest X-ray Diagnosis Models Across Multinational Datasets

Predictive machine learning and multimodal data to develop highly sensitive, composite biomarkers of disease progression in Friedreich ataxia.

Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding

VET-DINO: Learning Anatomical Understanding Through Multi-View Distillation in Veterinary Imaging

Deep Learning with Domain Randomization in Image and Feature Spaces for Abdominal Multiorgan Segmentation on CT and MRI Scans.

NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI

XDementNET: An Explainable Attention Based Deep Convolutional Network to Detect Alzheimer Progression from MRI data

Pancreas segmentation in CT scans: A novel MOMUNet based workflow.

A multi-modal model integrating MRI habitat and clinicopathology to predict platinum sensitivity in patients with high-grade serous ovarian cancer: a diagnostic study.

Semiautomated segmentation of breast tumor on automatic breast ultrasound image using a large-scale model with customized modules.

Ready to Sharpen Your Edge?