Latest Papers on Radiology AI. Tags: Benchmark SOTA

MSRepaint: Multiple Sclerosis Repaint with Conditional Denoising Diffusion Implicit Model for Bidirectional Lesion Filling and Synthesis

Jinwei Zhang, Lianrui Zuo, Yihao Liu, Hang Zhang, Samuel W. Remedios, Bennett A. Landman, Peter A. Calabresi, Shiv Saidha, Scott D. Newsome, Dzung L. Pham, Jerry L. Prince, Ellen M. Mowry, Aaron Carass

•preprint•Oct 2 2025

In multiple sclerosis, lesions interfere with automated magnetic resonance imaging analyses such as brain parcellation and deformable registration, while lesion segmentation models are hindered by the limited availability of annotated training data. To address both issues, we propose MSRepaint, a unified diffusion-based generative model for bidirectional lesion filling and synthesis that restores anatomical continuity for downstream analyses and augments segmentation through realistic data generation. MSRepaint conditions on spatial lesion masks for voxel-level control, incorporates contrast dropout to handle missing inputs, integrates a repainting mechanism to preserve surrounding anatomy during lesion filling and synthesis, and employs a multi-view DDIM inversion and fusion pipeline for 3D consistency with fast inference. Extensive evaluations demonstrate the effectiveness of MSRepaint across multiple tasks. For lesion filling, we evaluate both the accuracy within the filled regions and the impact on downstream tasks including brain parcellation and deformable registration. MSRepaint outperforms the traditional lesion filling methods FSL and NiftySeg, and achieves accuracy on par with FastSurfer-LIT, a recent diffusion model-based inpainting method, while offering over 20 times faster inference. For lesion synthesis, state-of-the-art MS lesion segmentation models trained on MSRepaint-synthesized data outperform those trained on CarveMix-synthesized data or real ISBI challenge training data across multiple benchmarks, including the MICCAI 2016 and UMCL datasets. Additionally, we demonstrate that MSRepaint's unified bidirectional filling and synthesis capability, with full spatial control over lesion appearance, enables high-fidelity simulation of lesion evolution in longitudinal MS progression.

MRI Image Synthesis Neurological Methodology In Silico Benchmark SOTA

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

Jiyao Liu, Jinjie Wei, Wanying Qu, Chenglong Ma, Junzhi Ning, Yunheng Li, Ying Chen, Xinzhe Luo, Pengcheng Chen, Xin Gao, Ming Hu, Huihui Xu, Xin Wang, Shujian Gao, Dingkang Yang, Zhongying Deng, Jin Ye, Lihao Liu, Junjun He, Ningsheng Xu

•preprint•Oct 2 2025

Medical Image Quality Assessment (IQA) serves as the first-mile safety gate for clinical AI, yet existing approaches remain constrained by scalar, score-based metrics and fail to reflect the descriptive, human-like reasoning process central to expert evaluation. To address this gap, we introduce MedQ-Bench, a comprehensive benchmark that establishes a perception-reasoning paradigm for language-based evaluation of medical image quality with Multi-modal Large Language Models (MLLMs). MedQ-Bench defines two complementary tasks: (1) MedQ-Perception, which probes low-level perceptual capability via human-curated questions on fundamental visual attributes; and (2) MedQ-Reasoning, encompassing both no-reference and comparison reasoning tasks, aligning model evaluation with human-like reasoning on image quality. The benchmark spans five imaging modalities and over forty quality attributes, totaling 2,600 perceptual queries and 708 reasoning assessments, covering diverse image sources including authentic clinical acquisitions, images with simulated degradations via physics-based reconstructions, and AI-generated images. To evaluate reasoning ability, we propose a multi-dimensional judging protocol that assesses model outputs along four complementary axes. We further conduct rigorous human-AI alignment validation by comparing LLM-based judgement with radiologists. Our evaluation of 14 state-of-the-art MLLMs demonstrates that models exhibit preliminary but unstable perceptual and reasoning skills, with insufficient accuracy for reliable clinical use. These findings highlight the need for targeted optimization of MLLMs in medical IQA. We hope that MedQ-Bench will catalyze further exploration and unlock the untapped potential of MLLMs for medical image quality evaluation.

Mixed Modality Classification Whole Body Dataset Release In Silico Academic Lab Benchmark SOTA Open Dataset

Brain age as an accurate biomarker of preclinical cognitive decline: evidence from a 12-year longitudinal study.

Elkana O, Beheshti I

•papers•Oct 2 2025

Cognitive decline in older adults, particularly during the preclinical stages of Alzheimer's disease (AD), presents a critical opportunity for early detection and intervention. While T1-weighted MRI is widely used in AD research, its capacity to identify early vulnerability and monitor longitudinal progression remains incompletely characterized. We analyzed longitudinal T1-weighted MRI data from 224 cognitively unimpaired older adults followed for up to 12 years. Participants were stratified by clinical outcome into converters to mild cognitive impairment (HC-converters, n = 112) and stable controls (HC-stable, n = 112). Groups were matched at baseline for age (mean ~ 74-75 years), education (~ 16.4 years), and cognitive scores (MMSE ≈ 29; CDR-SB ≈ 0.04). Four MRI-derived biomarkers were examined: brain-predicted age difference (brain-PAD), mean cortical thickness, AD-cortical signature, and hippocampal volume. Brain-PAD showed the strongest baseline association with future conversion (β = 1.25, t = 3.52, p = 0.0009) and highest classification accuracy (AUC = 0.66; sensitivity = 62%, and specificity = 67%). Longitudinal mixed-effects models focusing on the group × time interaction revealed a significant positive slope in brain-PAD for converters (β = 0.0079, p = 0.003) and a non-significant trend in stable controls (β = 0.0047, p = 0.075), indicating incipient divergence in brain aging trajectories during the preclinical window. Hippocampal volume and AD-cortical signature declined similarly in both groups. The mean cortical thickness had limited discriminative or dynamic utility. These findings support brain-PAD, derived from routine T1-weighted MRI using machine learning, as a sensitive, performance-independent biomarker for early risk stratification and monitoring of cognitive aging trajectories.

MRI Registration Neurological Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Dosiomic and radiomic features within radiotherapy target volume for predicting the treatment response in patients with glioma after radiotherapy.

Wang Y, Zhang Y, Lin L, Hu Z, Wang H

•papers•Oct 2 2025

This study aimed to develop interpretable machine learning models using radiomic and dosiomic features from radiotherapy target volumes to predict treatment response in glioma patients. A retrospective analysis was conducted on 176 glioma patients. Treatment response was categorized into disease control rate (DCR) and non-DCR groups (training cohort: 71 vs. 44; validation cohort: 34 vs. 27). Five regions of interest (ROIs) were identified: gross tumor volume (GTV), gross tumor volume with tumor bed (GTVtb), clinical target volume (CTV), GTV-GTV and CTV-GTVtb. For each ROI, radiomic features and dosiomic features were separately extracted from CT images and dose maps. Feature selection was performed. Six dosimetric parameters and six clinical variables were also included in model development. Five predictive models were constructed using four machine learning algorithms: Radiomic, Dosiomic, Dose-Volume Histogram (DVH), Combined (integrating clinical, radiomic, dosiomic, and DVH features), and Clinical models. Model performance was evaluated using accuracy, precision, recall, F1-score, and area under the curve (AUC). SHAP analysis was applied to explain model predictions. The CTV_combined support vector machine (SVM) model achieved the best performance, with an AUC of 0.728 in the validation cohort. SHAP summary plots showed that dosiomic features contributed significantly to prediction. Force plots further illustrated how individual features affected classification outcomes. The SHAP-interpretable CTV_combined SVM model demonstrated strong predictive ability for treatment response in glioma patients. This approach may support radiation oncologists in identifying the underlying pathological mechanisms of poor treatment response and adjusting dose distribution accordingly, thereby aiding the development of personalized radiotherapy strategies. Not applicable.

CT Classification Neurological Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Transformer-enhanced vertebrae segmentation and anatomical variation recognition from CT images.

Yang C, Huang L, Sucharit W, Xie H, Huang X, Li Y

•papers•Oct 2 2025

Accurate segmentation and anatomical classification of vertebrae in spinal CT scans are crucial for clinical diagnosis, surgical planning, and disease monitoring. However, the task is complicated by anatomical variability, degenerative changes, and the presence of rare vertebral anomalies. In this study, we propose a hybrid framework that combines a high-resolution WNet segmentation backbone with a Vision Transformer (ViT)-based classification module to perform vertebral identification and anomaly detection. Our model incorporates an attention-based anatomical variation module and leverages patient-specific metadata (age, sex, vertebral distribution) to improve the accuracy and personalization of vertebrae typing. Extensive experiments on the VerSe 2019 and 2020 datasets demonstrate that our approach outperforms state-of-the-art baselines such as nnUNet and SwinUNet, especially in detecting transitional vertebrae (e.g., T13, L6) and modeling morphological diversity. The system maintains high robustness under slice skipping, noise perturbation, and scanner variations, while offering interpretability through attention heatmaps and case-specific alerts. Our findings suggest that integrating anatomical priors and demographic context into transformer-based pipelines is a promising direction for personalized, intelligent spinal image analysis.

CT Segmentation Musculoskeletal Methodology In Silico Benchmark SOTA

Towards a CMR Foundation Model for Multi-Task Cardiac Image Analysis.

Jacob AJ, Borgohain I, Chitiboi T, Sharma P, Comaniciu D, Rueckert D

•papers•Oct 2 2025

Cardiac magnetic resonance (CMR) is a complex imaging modality requiring a broad variety of image processing tasks for comprehensive assessment of the study. Recently, foundation models (FM) have shown promise for automated image analyses in natural images (NI). In this study, a CMR-specific vision FM was developed and then finetuned in a supervised manner for 9 different imaging tasks typical to a CMR workflow, including classification, segmentation, landmark localization, and pathology detection. A ViT-S/8 model was trained in a self-supervised manner using DINO on 36 million CMR images from 27,524 subjects from three sources (UK Biobank and two clinical centers). The model was then finetuned for 9 tasks: classification (sequence, cine view), segmentation (cine SAX, cine LAX, LGE SAX, Mapping SAX), landmark localization, pathology detection (LGE, cardiac disease), on data from various sources (both public and 3 clinical datasets). The results were compared against metrics from state-of-the-art methods on the same tasks. A comparable baseline model was also trained on the same datasets for direct comparison. Additionally, the effect of pretraining strategy, as well as generalization and few-shot performance (training on few labeled samples) were explored for the pretrained model, compared to the baseline. The proposed model obtained similar performance or moderate improvements to results reported in the literature in most tasks (except disease detection), without any task-specific optimization of methodology. The proposed model outperformed the baseline in most cases, with an average increase of 6.8 percentage points (pp) for cine view classification, and 0.1 to 1.8 pp for segmentation tasks. The proposed method also obtained generally lower standard deviations in the metrics. Improvements of 3.7 and 6.6 pp for hyperenhancement detection from LGE and 14 pp for disease detection were observed. Ablation studies highlighted the importance of pretraining strategy, architecture and the impact of domain shifts from pretraining to finetuning. Moreover, CMR-pretrained model achieved better generalization and few-shot performance compared to the baseline. Vision FM specialized for medical imaging can improve accuracy and robustness over NI-FM. Self-supervised pretraining offers a resource-efficient, unified framework for CMR assessment, with the potential to accelerate the development of deep learning-based solutions for image analysis tasks, even with few annotated data available.

MRI Classification Cardiac Methodology In Silico Academic Lab Benchmark SOTA GenAI

Anatomy-Guided, Modality-Agnostic Segmentation of Neuroimaging Abnormalities.

Lteif D, Appapogu D, Bargal SA, Plummer BA, Kolachalama VB

•papers•Oct 1 2025

Magnetic resonance imaging (MRI) offers multiple sequences that provide complementary views of brain anatomy and pathology. However, real-world datasets often exhibit variability in sequence availability due to clinical and logistical constraints. This variability complicates radiological interpretation and limits the generalizability of machine learning models that depend on a consistent multimodal input. Here, we propose an anatomy-guided, modality-agnostic framework to assess disease-related abnormalities in brain MRI, leveraging structural context to ensure robustness in diverse input configurations. Central to our approach is Region ModalMix (RMM), an augmentation strategy that integrates anatomical priors during training to improve model performance under missing or variable modality conditions. Using the BraTS 2020 dataset (n = 369), our framework outperformed state-of-the-art methods, achieving a 9.68 mm average reduction in 95th percentile Hausdorff Distance (HD95) and a 1.36 percentage point improvement in Dice Similarity Coefficient (DSC) over baselines with only one available modality. To evaluate out-of-distribution generalization, we tested RMM on the MU-Glioma-Post dataset (n = 593), which includes heterogeneous post-operative glioma cases. Despite distribution shifts, RMM maintained strong performance, reducing HD95 by 18.24 mm and improving DSC by 9.54% points in the most severe missing-modality scenario. Our framework is applicable to multimodal neuroimaging pipelines, enabling more generalizable abnormality detection under heterogeneous data availability.

MRI Segmentation Neurological Methodology In Silico Benchmark SOTA

Validation of novel low-dose CT methods for quantifying bone marrow in the appendicular skeleton of patients with multiple myeloma: initial results from the [18F]FDG PET/CT sub-study of the Phase 3 GMMG-HD7 Trial.

Sachpekidis C, Hajiyianni M, Grözinger M, Piller M, Kopp-Schneider A, Mai EK, John L, Sauer S, Weinhold N, Menis E, Enqvist O, Raab MS, Jauch A, Edenbrandt L, Hundemer M, Brobeil A, Jende J, Schlemmer HP, Delorme S, Goldschmidt H, Dimitrakopoulou-Strauss A

•papers•Oct 1 2025

The clinical significance of medullary abnormalities in the appendicular skeleton detected by computed tomography (CT) in patients with multiple myeloma (MM) remains incompletely elucidated. This study aims to validate novel low-dose CT-based methods for quantifying myeloma bone marrow (BM) volume in the appendicular skeleton of MM patients undergoing [1⁸F]FDG PET/CT. Seventy-two newly diagnosed, transplantation eligible MM patients enrolled in the randomised phase 3 GMMG-HD7 trial underwent whole-body [18F]FDG PET/CT prior to treatment and after induction therapy with either isatuximab plus lenalidomide, bortezomib, and dexamethasone or lenalidomide, bortezomib, and dexamethasone alone. Two CT-based methods using the Medical Imaging Toolkit (MITK 2.4.0.0, Heidelberg, Germany) were used to quantify BM infiltration in the appendicular skeleton: (1) Manual approach, based on calculation of the highest mean CT value (CTv) within bony canals. (2) Semi-automated approach, based on summation of CT values across the appendicular skeleton to compute cumulative CT values (cCTv). PET/CT data were analyzed visually and via standardized uptake value (SUV) metrics, applying the Italian Myeloma criteria for PET Use (IMPeTUs). Additionally, an AI-based method was used to automatically derive whole-body metabolic tumor volume (MTV) and total lesion glycolysis (TLG) from PET scans. Post-induction, all patients were evaluated for minimal residual disease (MRD) using BM multiparametric flow cytometry. Correlation analyses were performed between imaging data and clinical, histopathological, and cytogenetic parameters, as well as treatment response. Statistical significance was defined as p < 0.05. At baseline, the median CTv (manual) was 26.1 Hounsfield units (HU) and the median cCTv (semi-automated) was 5.5 HU. Both CT-based methods showed weak but significant correlations with disease burden indicators: CTv correlated with BM plasma cell infiltration (r = 0.29; p = 0.02) and β2-microglobulin levels (r = 0.28; p = 0.02), while cCTv correlated with BM plasma cell infiltration (r = 0.25; p = 0.04). Appendicular CT values further demonstrated significant associations with PET-derived parameters. Notably, SUVmax values from the BM of long bones were strongly correlated with both CTv (r = 0.61; p < 0.001) and moderately with cCTv (r = 0.45; p < 0.001). Patients classified as having increased [1⁸F]FDG uptake in the BM (Deauville Score ≥ 4), according to the IMPeTUs criteria, exhibited significantly higher CTv and cCTv values compared to those with Deauville Score <4 (p = 0.002 for both). AI-based analysis of PET data revealed additional weak-to-moderate significant associations, with MTV correlating with CTv (r = 0.32; p = 0.008) and cCTv (r = 0.45; p < 0.001), and TLG showing correlations with CTv (r = 0.36; p = 0.002) and cCTv (r = 0.46; p < 0.001). Following induction therapy, CT values decreased significantly from baseline (median CTv = -13.8 HU, median cCTv = 5.2 HU; p < 0.001 for both), and CTv significantly correlated with SUVmax values from the BM of long bones (r = 0.59; p < 0.001). In parallel, the incidence of follow-up pathological PET/CT scans, SUV values, Deauville Scores, and AI-derived MTV and TLG values showed a significant reduction after therapy (all p < 0.001). No significant differences in CTv, cCTv, or PET-derived metrics were observed between MRD-positive and MRD-negative patients. Novel CT-based quantification approaches for assessing BM involvement in the appendicular skeleton correlate with key clinical and PET parameters in MM. As low-dose, standardized techniques, they show promise for inclusion in MM imaging protocols, potentially enhancing assessment of disease extent and treatment response.

Mixed Modality Segmentation Musculoskeletal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Artificial Intelligence Versus Radiologist False Positives on Digital Breast Tomosynthesis Examinations in a Population-Based Screening Program.

Shahrvini T, Wood EJ, Joines MM, Nguyen H, Hoyt AC, Chalfant JS, Capiro NM, Fischer CP, Sayre J, Hsu W, Milch HS

•papers•Oct 1 2025

Background: Insights into the nature of false-positive findings flagged by contemporary mammography artificial intelligence (AI) systems could inform the potential use of AI to reduce false-positive recall rates. Objective: To compare AI and radiologists in terms of characteristics of false-positive digital breast tomosynthesis (DBT) examinations in a breast cancer screening population. Methods: This retrospective study included 2977 women (mean age, 58 years) participating in an observational population-based screening study who underwent 3183 screening DBT examinations from January 2013 to June 2017. A commercial AI tool analyzed DBT examinations. Positive examinations were defined for AI as an elevated-risk result and for interpreting radiologists as BI-RAD category 0. False-positive examinations were defined as the absence of a breast cancer diagnosis within 1 year. Radiologists re-reviewed the imaging for AI-flagged false-positive findings. Results: The false-positive rate was 10% for both AI (308/3183) and radiologists (304/3183). Of 541 total false-positive examinations, 233 (43%) were false positives for AI only, 237 (44%) for radiologists only, and 71 (13%) for both. AI-only versus radiologist-only false positives were associated with greater mean patient age (60 vs 52 years, p<.001), lower frequency of dense breasts (24% vs 57%, p<.001), and greater frequencies of a personal history of breast cancer (13% vs 4%, p<.001), prior breast imaging studies (95% vs 78%, p<.001), and prior breast surgical procedures (37% vs 11%, p<.001). The false-positive examinations included 932 AI-only flagged findings, 315 radiologist-only flagged findings, and 49 flagged findings concordant between AI and radiologists. AI-only flagged findings were most commonly benign calcifications (40%), asymmetries (13%), and benign postsurgical change (12%); radiologist-only flagged findings were most commonly masses (47%), asymmetries (19%), and indeterminate calcifications (15%). Of 18 concordant flagged findings undergoing biopsy, 44% yielded high-risk lesions. Conclusion: Imaging and patient-level differences were observed between AI and radiologist false-positive DBT examinations. Although only a small fraction of false-positive examinations overlapped between AI and radiologists, concordant flagged findings had a high rate of representing high-risk lesions. Clinical Impact: The findings may help guide strategies for using AI to improve DBT recall specificity. In particular, concordant findings may represent an enriched subset of actionable abnormalities.

Mammography Detection Breast Retrospective Clinical Clinical Pilot Academic Lab Benchmark SOTA

Application of artificial intelligence in assisting treatment of gynecologic tumors: a systematic review.

Guo L, Zhang S, Chen H, Li Y, Liu Y, Liu W, Wang Q, Tang Z, Jiang P, Wang J

•papers•Oct 1 2025

In recent years, the application of artificial intelligence (AI) in medical image analysis has drawn increasing attention in clinical studies of gynecologic tumors. This study presents the development and prospects of AI applications to assist in the treatment of gynecological oncology. The Web of Science database was screened for articles published until August 2023. "artificial intelligence," "deep learning," "machine learning," "radiomics," "radiotherapy," "chemoradiotherapy," "neoadjuvant therapy," "immunotherapy," "gynecological malignancy," "cervical carcinoma," "cervical cancer," "ovarian cancer," "endometrial cancer," "vulvar cancer," "Vaginal cancer" were used as keywords. Research articles related to AI-assisted treatment of gynecological cancers were included. A total of 317 articles were retrieved based on the search strategy, and 133 were selected by applying the inclusion and exclusion criteria, including 114 on cervical cancer, 10 on endometrial cancer, and 9 on ovarian cancer. Among the included studies, 44 (33%) focused on prognosis prediction, 24 (18%) on treatment response prediction, 13 (10%) on adverse event prediction, five (4%) on dose distribution prediction, and 47 (35%) on target volume delineation. Target volume delineation and dose prediction were performed using deep Learning methods. For the prediction of treatment response, prognosis, and adverse events, 57 studies (70%) used conventional radiomics methods, 13 (16%) used deep Learning methods, 8 (10%) used spatial-related unconventional radiomics methods, and 3 (4%) used temporal-related unconventional radiomics methods. In cervical and endometrial cancers, target prediction mostly included treatment response, overall survival, recurrence, toxicity undergoing radiotherapy, lymph node metastasis, and dose distribution. For ovarian cancer, the target prediction included platinum sensitivity and postoperative complications. The majority of the studies were single-center, retrospective, and small-scale; 101 studies (76%) had single-center data, 125 studies (94%) were retrospective, and 127 studies (95%) included Less than 500 cases. The application of AI in assisting treatment in gynecological oncology remains limited. Although the results of AI in predicting the response, prognosis, adverse events, and dose distribution in gynecological oncology are superior, it is evident that there is no validation of substantial data from multiple centers for these tasks.

Mixed Modality Segmentation Abdominal Review In Silico Benchmark SOTA

Filter Papers

Tags

MSRepaint: Multiple Sclerosis Repaint with Conditional Denoising Diffusion Implicit Model for Bidirectional Lesion Filling and Synthesis

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

Brain age as an accurate biomarker of preclinical cognitive decline: evidence from a 12-year longitudinal study.

Dosiomic and radiomic features within radiotherapy target volume for predicting the treatment response in patients with glioma after radiotherapy.

Transformer-enhanced vertebrae segmentation and anatomical variation recognition from CT images.

Towards a CMR Foundation Model for Multi-Task Cardiac Image Analysis.

Anatomy-Guided, Modality-Agnostic Segmentation of Neuroimaging Abnormalities.

Validation of novel low-dose CT methods for quantifying bone marrow in the appendicular skeleton of patients with multiple myeloma: initial results from the [<sup>18</sup>F]FDG PET/CT sub-study of the Phase 3 GMMG-HD7 Trial.

Artificial Intelligence Versus Radiologist False Positives on Digital Breast Tomosynthesis Examinations in a Population-Based Screening Program.

Application of artificial intelligence in assisting treatment of gynecologic tumors: a systematic review.

Ready to Sharpen Your Edge?