Latest Papers on Radiology AI. Tags: Reproducibility, Order: Best Match, Limit: 10.

A Deep Learning Lung Cancer Segmentation Pipeline to Facilitate CT-based Radiomics

So, A. C. P., Cheng, D., Aslani, S., Azimbagirad, M., Yamada, D., Dunn, R., Josephides, E., McDowall, E., Henry, A.-R., Bille, A., Sivarasan, N., Karapanagiotou, E., Jacob, J., Pennycuick, A.

•preprint•Jun 18 2025

BackgroundCT-based radio-biomarkers could provide non-invasive insights into tumour biology to risk-stratify patients. One of the limitations is laborious manual segmentation of regions-of-interest (ROI). We present a deep learning auto-segmentation pipeline for radiomic analysis. Patients and Methods153 patients with resected stage 2A-3B non-small cell lung cancer (NSCLCs) had tumours segmented using nnU-Net with review by two clinicians. The nnU-Net was pretrained with anatomical priors in non-cancerous lungs and finetuned on NSCLCs. Three ROIs were segmented: intra-tumoural, peri-tumoural, and whole lung. 1967 features were extracted using PyRadiomics. Feature reproducibility was tested using segmentation perturbations. Features were selected using minimum-redundancy-maximum-relevance with Random Forest-recursive feature elimination nested in 500 bootstraps. ResultsAuto-segmentation time was [~]36 seconds/series. Mean volumetric and surface Dice-Sorensen coefficient (DSC) scores were 0.84 ({+/-}0.28), and 0.79 ({+/-}0.34) respectively. DSC were significantly correlated with tumour shape (sphericity, diameter) and location (worse with chest wall adherence), but not batch effects (e.g. contrast, reconstruction kernel). 6.5% cases had missed segmentations; 6.5% required major changes. Pre-training on anatomical priors resulted in better segmentations compared to training on tumour-labels alone (p<0.001) and tumour with anatomical labels (p<0.001). Most radiomic features were not reproducible following perturbations and resampling. Adding radiomic features, however, did not significantly improve the clinical model in predicting 2-year disease-free survival: AUCs 0.67 (95%CI 0.59-0.75) vs 0.63 (95%CI 0.54-0.71) respectively (p=0.28). ConclusionOur study demonstrates that integrating auto-segmentation into radio-biomarker discovery is feasible with high efficiency and accuracy. Whilst radiomic analysis show limited reproducibility, our auto-segmentation may allow more robust radio-biomarker analysis using deep learning features.

CT Segmentation Chest Retrospective Clinical In Silico Academic Lab Reproducibility

Federated Learning for MRI-based BrainAGE: a multicenter study on post-stroke functional outcome prediction

Vincent Roca, Marc Tommasi, Paul Andrey, Aurélien Bellet, Markus D. Schirmer, Hilde Henon, Laurent Puy, Julien Ramon, Grégory Kuchcinski, Martin Bretzner, Renaud Lopes

•preprint•Jun 18 2025

$\textbf{Objective:}$ Brain-predicted age difference (BrainAGE) is a neuroimaging biomarker reflecting brain health. However, training robust BrainAGE models requires large datasets, often restricted by privacy concerns. This study evaluates the performance of federated learning (FL) for BrainAGE estimation in ischemic stroke patients treated with mechanical thrombectomy, and investigates its association with clinical phenotypes and functional outcomes. $\textbf{Methods:}$ We used FLAIR brain images from 1674 stroke patients across 16 hospital centers. We implemented standard machine learning and deep learning models for BrainAGE estimates under three data management strategies: centralized learning (pooled data), FL (local training at each site), and single-site learning. We reported prediction errors and examined associations between BrainAGE and vascular risk factors (e.g., diabetes mellitus, hypertension, smoking), as well as functional outcomes at three months post-stroke. Logistic regression evaluated BrainAGE's predictive value for these outcomes, adjusting for age, sex, vascular risk factors, stroke severity, time between MRI and arterial puncture, prior intravenous thrombolysis, and recanalisation outcome. $\textbf{Results:}$ While centralized learning yielded the most accurate predictions, FL consistently outperformed single-site models. BrainAGE was significantly higher in patients with diabetes mellitus across all models. Comparisons between patients with good and poor functional outcomes, and multivariate predictions of these outcomes showed the significance of the association between BrainAGE and post-stroke recovery. $\textbf{Conclusion:}$ FL enables accurate age predictions without data centralization. The strong association between BrainAGE, vascular risk factors, and post-stroke recovery highlights its potential for prognostic modeling in stroke care.

MRI Registration Neurological Retrospective Clinical In Silico Academic Lab Reproducibility

A Digital Twin Framework for Adaptive Treatment Planning in Radiotherapy

Chih-Wei Chang, Sri Akkineni, Mingzhe Hu, Keyur D. Shah, Jun Zhou, Xiaofeng Yang

•preprint•Jun 17 2025

This study aims to develop and evaluate a digital twin (DT) framework to enhance adaptive proton therapy for prostate stereotactic body radiotherapy (SBRT), focusing on improving treatment precision for dominant intraprostatic lesions (DILs) while minimizing organ-at-risk (OAR) toxicity. We propose a decision-theoretic (DT) framework combining deep learning (DL)-based deformable image registration (DIR) with a prior treatment database to generate synthetic CTs (sCTs) for predicting interfractional anatomical changes. Using daily CBCT from five prostate SBRT patients with DILs, the framework precomputes multiple plans with high (DT-H) and low (DT-L) similarity sCTs. Plan optimization is performed in RayStation 2023B, assuming a constant RBE of 1.1 and robustly accounting for positional and range uncertainties. Plan quality is evaluated via a modified ProKnow score across two fractions, with reoptimization limited to 10 minutes. Daily CBCT evaluation showed clinical plans often violated OAR constraints (e.g., bladder V20.8Gy, rectum V23Gy), with DIL V100 < 90% in 2 patients, indicating SIFB failure. DT-H plans, using high-similarity sCTs, achieved better or comparable DIL/CTV coverage and lower OAR doses, with reoptimization completed within 10 min (e.g., DT-H-REopt-A score: 154.3-165.9). DT-L plans showed variable outcomes; lower similarity correlated with reduced DIL coverage (e.g., Patient 4: 84.7%). DT-H consistently outperformed clinical plans within time limits, while extended optimization brought DT-L and clinical plans closer to DT-H quality. This DT framework enables rapid, personalized adaptive proton therapy, improving DIL targeting and reducing toxicity. By addressing geometric uncertainties, it supports outcome gains in ultra-hypofractionated prostate RT and lays groundwork for future multimodal anatomical prediction.

CT Registration Abdominal Methodology In Silico Academic Lab Reproducibility

Beyond the First Read: AI-Assisted Perceptual Error Detection in Chest Radiography Accounting for Interobserver Variability

Adhrith Vutukuri, Akash Awasthi, David Yang, Carol C. Wu, Hien Van Nguyen

•preprint•Jun 16 2025

Chest radiography is widely used in diagnostic imaging. However, perceptual errors -- especially overlooked but visible abnormalities -- remain common and clinically significant. Current workflows and AI systems provide limited support for detecting such errors after interpretation and often lack meaningful human--AI collaboration. We introduce RADAR (Radiologist--AI Diagnostic Assistance and Review), a post-interpretation companion system. RADAR ingests finalized radiologist annotations and CXR images, then performs regional-level analysis to detect and refer potentially missed abnormal regions. The system supports a "second-look" workflow and offers suggested regions of interest (ROIs) rather than fixed labels to accommodate inter-observer variation. We evaluated RADAR on a simulated perceptual-error dataset derived from de-identified CXR cases, using F1 score and Intersection over Union (IoU) as primary metrics. RADAR achieved a recall of 0.78, precision of 0.44, and an F1 score of 0.56 in detecting missed abnormalities in the simulated perceptual-error dataset. Although precision is moderate, this reduces over-reliance on AI by encouraging radiologist oversight in human--AI collaboration. The median IoU was 0.78, with more than 90% of referrals exceeding 0.5 IoU, indicating accurate regional localization. RADAR effectively complements radiologist judgment, providing valuable post-read support for perceptual-error detection in CXR interpretation. Its flexible ROI suggestions and non-intrusive integration position it as a promising tool in real-world radiology workflows. To facilitate reproducibility and further evaluation, we release a fully open-source web implementation alongside a simulated error dataset. All code, data, demonstration videos, and the application are publicly available at https://github.com/avutukuri01/RADAR.

X-Ray Detection Chest Methodology In Silico Academic Lab Open Code Open Dataset Reproducibility

Real-time cardiac cine MRI: A comparison of a diffusion probabilistic model with alternative state-of-the-art image reconstruction techniques for undersampled spiral acquisitions.

Schad O, Heidenreich JF, Petri N, Kleineisel J, Sauer S, Bley TA, Nordbeck P, Petritsch B, Wech T

•papers•Jun 16 2025

Electrocardiogram (ECG)-gated cine imaging in breath-hold enables high-quality diagnostics in most patients but can be compromised by arrhythmia and inability to hold breath. Real-time cardiac MRI offers faster and robust exams without these limitations. To achieve sufficient acceleration, advanced reconstruction methods, which transfer data into high-quality images, are required. In this study, undersampled spiral balanced SSFP (bSSFP) real-time data in free-breathing were acquired at 1.5T in 16 healthy volunteers and five arrhythmic patients, with ECG-gated Cartesian cine in breath-hold serving as clinical reference. Image reconstructions were performed using a tailored and specifically trained score-based diffusion model, compared to a variational network and different compressed sensing approaches. The techniques were assessed using an expert reader study, scalar metric calculations, difference images against a segmented reference, and Bland-Altman analysis of cardiac functional parameters. In participants with irregular RR-cycles, spiral real-time acquisitions showed superior image quality compared to the clinical reference. Quantitative and qualitative metrics indicate enhanced image quality of the diffusion model in comparison to the alternative reconstruction methods, although improvements over the variational network were minor. Slightly higher ejection fractions for the real-time diffusion reconstructions were exhibited relative to the clinical references with a bias of 1.1 ± 5.7% for healthy subjects. The proposed real-time technique enables free-breathing acquisitions of spatio-temporal images with high quality, covering the entire heart in less than 1 min. Evaluation of ejection fraction using the ECG-gated reference can be vulnerable to arrhythmia and averaging effects, highlighting the need for real-time approaches. Prolonged inference times and stochastic variability of the diffusion reconstruction represent obstacles to overcome for clinical translation.

MRI Reconstruction Cardiac Retrospective Clinical In Silico Academic Lab Reproducibility

Evaluating Explainability: A Framework for Systematic Assessment and Reporting of Explainable AI Features

Miguel A. Lago, Ghada Zamzmi, Brandon Eich, Jana G. Delfino

•preprint•Jun 16 2025

Explainability features are intended to provide insight into the internal mechanisms of an AI device, but there is a lack of evaluation techniques for assessing the quality of provided explanations. We propose a framework to assess and report explainable AI features. Our evaluation framework for AI explainability is based on four criteria: 1) Consistency quantifies the variability of explanations to similar inputs, 2) Plausibility estimates how close the explanation is to the ground truth, 3) Fidelity assesses the alignment between the explanation and the model internal mechanisms, and 4) Usefulness evaluates the impact on task performance of the explanation. Finally, we developed a scorecard for AI explainability methods that serves as a complete description and evaluation to accompany this type of algorithm. We describe these four criteria and give examples on how they can be evaluated. As a case study, we use Ablation CAM and Eigen CAM to illustrate the evaluation of explanation heatmaps on the detection of breast lesions on synthetic mammographies. The first three criteria are evaluated for clinically-relevant scenarios. Our proposed framework establishes criteria through which the quality of explanations provided by AI models can be evaluated. We intend for our framework to spark a dialogue regarding the value provided by explainability features and help improve the development and evaluation of AI-based medical devices.

Mammography Detection Breast Methodology In Silico Academic Lab Ethics Reproducibility

Artificial intelligence (AI) and CT in abdominal imaging: image reconstruction and beyond.

Pisuchpen N, Srinivas Rao S, Noda Y, Kongboonvijit S, Rezaei A, Kambadakone A

•papers•Jun 16 2025

Computed tomography (CT) is a cornerstone of abdominal imaging, playing a vital role in accurate diagnosis, appropriate treatment planning, and disease monitoring. The evolution of artificial intelligence (AI) in imaging has introduced deep learning-based reconstruction (DLR) techniques that enhance image quality, reduce radiation dose, and improve workflow efficiency. Traditional image reconstruction methods, including filtered back projection (FBP) and iterative reconstruction (IR), have limitations such as high noise levels and artificial image texture. DLR overcomes these challenges by leveraging convolutional neural networks to generate high-fidelity images while preserving anatomical details. Recent advances in vendor-specific and vendor-agnostic DLR algorithms, such as TrueFidelity, AiCE, and Precise Image, have demonstrated significant improvements in contrast-to-noise ratio, lesion detection, and diagnostic confidence across various abdominal organs, including the liver, pancreas, and kidneys. Furthermore, AI extends beyond image reconstruction to applications such as low contrast lesion detection, quantitative imaging, and workflow optimization, augmenting radiologists' efficiency and diagnostic accuracy. However, challenges remain in clinical validation, standardization, and widespread adoption. This review explores the principles, advancements, and future directions of AI-driven CT image reconstruction and its expanding role in abdominal imaging.

CT Reconstruction Abdominal Review In Silico Academic Lab Reproducibility

BreastDCEDL: Curating a Comprehensive DCE-MRI Dataset and developing a Transformer Implementation for Breast Cancer Treatment Response Prediction

Naomi Fridman, Bubby Solway, Tomer Fridman, Itamar Barnea, Anat Goldshtein

•preprint•Jun 13 2025

Breast cancer remains a leading cause of cancer-related mortality worldwide, making early detection and accurate treatment response monitoring critical priorities. We present BreastDCEDL, a curated, deep learning-ready dataset comprising pre-treatment 3D Dynamic Contrast-Enhanced MRI (DCE-MRI) scans from 2,070 breast cancer patients drawn from the I-SPY1, I-SPY2, and Duke cohorts, all sourced from The Cancer Imaging Archive. The raw DICOM imaging data were rigorously converted into standardized 3D NIfTI volumes with preserved signal integrity, accompanied by unified tumor annotations and harmonized clinical metadata including pathologic complete response (pCR), hormone receptor (HR), and HER2 status. Although DCE-MRI provides essential diagnostic information and deep learning offers tremendous potential for analyzing such complex data, progress has been limited by lack of accessible, public, multicenter datasets. BreastDCEDL addresses this gap by enabling development of advanced models, including state-of-the-art transformer architectures that require substantial training data. To demonstrate its capacity for robust modeling, we developed the first transformer-based model for breast DCE-MRI, leveraging Vision Transformer (ViT) architecture trained on RGB-fused images from three contrast phases (pre-contrast, early post-contrast, and late post-contrast). Our ViT model achieved state-of-the-art pCR prediction performance in HR+/HER2- patients (AUC 0.94, accuracy 0.93). BreastDCEDL includes predefined benchmark splits, offering a framework for reproducible research and enabling clinically meaningful modeling in breast cancer imaging.

MRI Classification Breast Dataset Release In Silico Academic Lab Open Dataset Benchmark SOTA Reproducibility

Empirical evaluation of artificial intelligence distillation techniques for ascertaining cancer outcomes from electronic health records.

Riaz IB, Naqvi SAA, Ashraf N, Harris GJ, Kehl KL

•papers•Jun 10 2025

Phenotypic information for cancer research is embedded in unstructured electronic health records (EHR), requiring effort to extract. Deep learning models can automate this but face scalability issues due to privacy concerns. We evaluated techniques for applying a teacher-student framework to extract longitudinal clinical outcomes from EHRs. We focused on the challenging task of ascertaining two cancer outcomes-overall response and progression according to Response Evaluation Criteria in Solid Tumors (RECIST)-from free-text radiology reports. Teacher models with hierarchical Transformer architecture were trained on data from Dana-Farber Cancer Institute (DFCI). These models labeled public datasets (MIMIC-IV, Wiki-text) and GPT-4-generated synthetic data. "Student" models were then trained to mimic the teachers' predictions. DFCI "teacher" models achieved high performance, and student models trained on MIMIC-IV data showed comparable results, demonstrating effective knowledge transfer. However, student models trained on Wiki-text and synthetic data performed worse, emphasizing the need for in-domain public datasets for model distillation.

Mixed Modality LLM Radiology Report Methodology In Silico Academic Lab GenAI Reproducibility

Post-processing steps improve generalisability and robustness of an MRI-based radiogenomic model for human papillomavirus status prediction in oropharyngeal cancer.

Ahmadian M, Bodalal Z, Bos P, Martens RM, Agrotis G, van der Hulst HJ, Vens C, Karssemakers L, Al-Mamgani A, de Graaf P, Jasperse B, Brakenhoff RH, Leemans CR, Beets-Tan RGH, Castelijns JA, van den Brekel MWM

•papers•Jun 6 2025

To assess the impact of image post-processing steps on the generalisability of MRI-based radiogenomic models. Using a human papillomavirus (HPV) status in oropharyngeal squamous cell carcinoma (OPSCC) prediction model, this study examines the potential of different post-processing strategies to increase its generalisability across data from different centres and image acquisition protocols. Contrast-enhanced T1-weighted MR images of OPSCC patients of two cohorts from different centres, with confirmed HPV status, were manually segmented. After radiomic feature extraction, the HPV prediction model trained on a training set with 91 patients was subsequently tested on two independent cohorts: a test set with 62 patients and an externally derived cohort of 157 patients. The data processing options included: data harmonisation, a process to ensure consistency in data from different centres; exclusion of unstable features across different segmentations and scan protocols; and removal of highly correlated features to reduce redundancy. The predictive model, trained without post-processing, showed high performance on the test set, with an AUC of 0.79 (95% CI: 0.66-0.90, p < 0.001). However, when tested on the external data, the model performed less well, resulting in an AUC of 0.52 (95% CI: 0.45-0.58, p = 0.334). The model's generalisability substantially improved after performing post-processing steps. The AUC for the test set reached 0.76 (95% CI: 0.63-0.87, p < 0.001), while for the external cohort, the predictive model achieved an AUC of 0.73 (95% CI: 0.64-0.81, p < 0.001). When applied before model development, post-processing steps can enhance the robustness and generalisability of predictive radiogenomics models. Question How do post-processing steps impact the generalisability of MRI-based radiogenomic prediction models? Findings Applying post-processing steps, i.e., data harmonisation, identification of stable radiomic features, and removal of correlated features, before model development can improve model robustness and generalisability. Clinical relevance Post-processing steps in MRI radiogenomic model generation lead to reliable non-invasive diagnostic tools for personalised cancer treatment strategies.

MRI Classification Retrospective Clinical In Silico Academic Lab Reproducibility