Latest Papers on Radiology AI. Tags: Reproducibility, Order: Best Match, Limit: 10.

FreqSelect: Frequency-Aware fMRI-to-Image Reconstruction

Junliang Ye, Lei Wang, Md Zakir Hossain

•preprint•May 18 2025

Reconstructing natural images from functional magnetic resonance imaging (fMRI) data remains a core challenge in natural decoding due to the mismatch between the richness of visual stimuli and the noisy, low resolution nature of fMRI signals. While recent two-stage models, combining deep variational autoencoders (VAEs) with diffusion models, have advanced this task, they treat all spatial-frequency components of the input equally. This uniform treatment forces the model to extract meaning features and suppress irrelevant noise simultaneously, limiting its effectiveness. We introduce FreqSelect, a lightweight, adaptive module that selectively filters spatial-frequency bands before encoding. By dynamically emphasizing frequencies that are most predictive of brain activity and suppressing those that are uninformative, FreqSelect acts as a content-aware gate between image features and natural data. It integrates seamlessly into standard very deep VAE-diffusion pipelines and requires no additional supervision. Evaluated on the Natural Scenes dataset, FreqSelect consistently improves reconstruction quality across both low- and high-level metrics. Beyond performance gains, the learned frequency-selection patterns offer interpretable insights into how different visual frequencies are represented in the brain. Our method generalizes across subjects and scenes, and holds promise for extension to other neuroimaging modalities, offering a principled approach to enhancing both decoding accuracy and neuroscientific interpretability.

MRI Reconstruction Neurological Methodology In Silico Reproducibility

External Validation of a CT-Based Radiogenomics Model for the Detection of EGFR Mutation in NSCLC and the Impact of Prevalence in Model Building by Using Synthetic Minority Over Sampling (SMOTE): Lessons Learned.

Kohan AA, Mirshahvalad SA, Hinzpeter R, Kulanthaivelu R, Avery L, Ortega C, Metser U, Hope A, Veit-Haibach P

•papers•May 15 2025

Radiogenomics holds promise in identifying molecular alterations in nonsmall cell lung cancer (NSCLC) using imaging features. Previously, we developed a radiogenomics model to predict epidermal growth factor receptor (EGFR) mutations based on contrast-enhanced computed tomography (CECT) in NSCLC patients. The current study aimed to externally validate this model using a publicly available National Institutes of Health (NIH)-based NSCLC dataset and assess the effect of EGFR mutation prevalence on model performance through synthetic minority oversampling technique (SMOTE). The original radiogenomics model was validated on an independent NIH cohort (n=140). For assessing the influence of disease prevalence, six SMOTE-augmented datasets were created, simulating EGFR mutation prevalence from 25% to 50%. Seven models were developed (one from original data, six SMOTE-augmented), each undergoing rigorous cross-validation, feature selection, and logistic regression modeling. Models were tested against the NIH cohort. Performance was compared using area under the receiver operating characteristic curve (Area Under the Curve [AUC]), and differences between radiomic-only, clinical-only, and combined models were statistically assessed. External validation revealed poor diagnostic performance for both our model and a previously published EGFR radiomics model (AUC ∼0.5). The clinical model alone achieved higher diagnostic accuracy (AUC 0.74). SMOTE-augmented models showed increased sensitivity but did not improve overall AUC compared to the clinical-only model. Changing EGFR mutation prevalence had minimal impact on AUC, challenging previous assumptions about the influence of sample imbalance on model performance. External validation failed to reproduce prior radiogenomics model performance, while clinical variables alone retained strong predictive value. SMOTE-based oversampling did not improve diagnostic accuracy, suggesting that, in EGFR prediction, radiomics may offer limited value beyond clinical data. Emphasis on robust external validation and data-sharing is essential for future clinical implementation of radiogenomic models.

CT Classification Chest Retrospective Clinical In Silico Academic Lab Reproducibility

Challenges in Implementing Artificial Intelligence in Breast Cancer Screening Programs: Systematic Review and Framework for Safe Adoption.

Goh S, Goh RSJ, Chong B, Ng QX, Koh GCH, Ngiam KY, Hartman M

•papers•May 15 2025

Artificial intelligence (AI) studies show promise in enhancing accuracy and efficiency in mammographic screening programs worldwide. However, its integration into clinical workflows faces several challenges, including unintended errors, the need for professional training, and ethical concerns. Notably, specific frameworks for AI imaging in breast cancer screening are still lacking. This study aims to identify the challenges associated with implementing AI in breast screening programs and to apply the Consolidated Framework for Implementation Research (CFIR) to discuss a practical governance framework for AI in this context. Three electronic databases (PubMed, Embase, and MEDLINE) were searched using combinations of the keywords "artificial intelligence," "regulation," "governance," "breast cancer," and "screening." Original studies evaluating AI in breast cancer detection or discussing challenges related to AI implementation in this setting were eligible for review. Findings were narratively synthesized and subsequently mapped directly onto the constructs within the CFIR. A total of 1240 results were retrieved, with 20 original studies ultimately included in this systematic review. The majority (n=19) focused on AI-enhanced mammography, while 1 addressed AI-enhanced ultrasound for women with dense breasts. Most studies originated from the United States (n=5) and the United Kingdom (n=4), with publication years ranging from 2019 to 2023. The quality of papers was rated as moderate to high. The key challenges identified were reproducibility, evidentiary standards, technological concerns, trust issues, as well as ethical, legal, societal concerns, and postadoption uncertainty. By aligning these findings with the CFIR constructs, action plans targeting the main challenges were incorporated into the framework, facilitating a structured approach to addressing these issues. This systematic review identifies key challenges in implementing AI in breast cancer screening, emphasizing the need for consistency, robust evidentiary standards, technological advancements, user trust, ethical frameworks, legal safeguards, and societal benefits. These findings can serve as a blueprint for policy makers, clinicians, and AI developers to collaboratively advance AI adoption in breast cancer screening. PROSPERO CRD42024553889; https://tinyurl.com/mu4nwcxt.

Mammography Detection Breast Review In Silico Academic Lab Policy Ethics Reproducibility

Fed-ComBat: A Generalized Federated Framework for Batch Effect Harmonization in Collaborative Studies

Silva, S., Lorenzi, M., Altmann, A., Oxtoby, N.

•preprint•May 14 2025

In neuroimaging research, the utilization of multi-centric analyses is crucial for obtaining sufficient sample sizes and representative clinical populations. Data harmonization techniques are typically part of the pipeline in multi-centric studies to address systematic biases and ensure the comparability of the data. However, most multi-centric studies require centralized data, which may result in exposing individual patient information. This poses a significant challenge in data governance, leading to the implementation of regulations such as the GDPR and the CCPA, which attempt to address these concerns but also hinder data access for researchers. Federated learning offers a privacy-preserving alternative approach in machine learning, enabling models to be collaboratively trained on decentralized data without the need for data centralization or sharing. In this paper, we present Fed-ComBat, a federated framework for batch effect harmonization on decentralized data. Fed-ComBat extends existing centralized linear methods, such as ComBat and distributed as d-ComBat, and nonlinear approaches like ComBat-GAM in accounting for potentially nonlinear and multivariate covariate effects. By doing so, Fed-ComBat enables the preservation of nonlinear covariate effects without requiring centralization of data and without prior knowledge of which variables should be considered nonlinear or their interactions, differentiating it from ComBat-GAM. We assessed Fed-ComBat and existing approaches on simulated data and multiple cohorts comprising healthy controls (CN) and subjects with various disorders such as Parkinson's disease (PD), Alzheimer's disease (AD), and autism spectrum disorder (ASD). The results of our study show that Fed-ComBat performs better than centralized ComBat when dealing with nonlinear effects and is on par with centralized methods like ComBat-GAM. Through experiments using synthetic data, Fed-ComBat demonstrates a superior ability to reconstruct the target unbiased function, achieving a 35% improvement (RMSE=0.5952) compared to d-ComBat (RMSE=0.9162) and a 12% improvement compared to our proposal to federate ComBat-GAM, d-ComBat-GAM (RMSE=0.6751). Additionally, Fed-ComBat achieves comparable results to centralized methods like ComBat-GAM for MRI-derived phenotypes without requiring prior knowledge of potential nonlinearities.

MRI Reconstruction Neurological Methodology In Silico Academic Lab Reproducibility

Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results

Meritxell Riera-Marin, Sikha O K, Julia Rodriguez-Comas, Matthias Stefan May, Zhaohong Pan, Xiang Zhou, Xiaokun Liang, Franciskus Xaverius Erick, Andrea Prenner, Cedric Hemon, Valentin Boussot, Jean-Louis Dillenseger, Jean-Claude Nunes, Abdul Qayyum, Moona Mazher, Steven A Niederer, Kaisar Kushibar, Carlos Martin-Isla, Petia Radeva, Karim Lekadir, Theodore Barfoot, Luis C. Garcia Peraza Herrera, Ben Glocker, Tom Vercauteren, Lucas Gago, Justin Englemann, Joy-Marie Kleiss, Anton Aubanell, Andreu Antolin, Javier Garcia-Lopez, Miguel A. Gonzalez Ballester, Adrian Galdran

•preprint•May 13 2025

Deep learning (DL) has become the dominant approach for medical image segmentation, yet ensuring the reliability and clinical applicability of these models requires addressing key challenges such as annotation variability, calibration, and uncertainty estimation. This is why we created the Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS), which highlights the critical role of multiple annotators in establishing a more comprehensive ground truth, emphasizing that segmentation is inherently subjective and that leveraging inter-annotator variability is essential for robust model evaluation. Seven teams participated in the challenge, submitting a variety of DL models evaluated using metrics such as Dice Similarity Coefficient (DSC), Expected Calibration Error (ECE), and Continuous Ranked Probability Score (CRPS). By incorporating consensus and dissensus ground truth, we assess how DL models handle uncertainty and whether their confidence estimates align with true segmentation performance. Our findings reinforce the importance of well-calibrated models, as better calibration is strongly correlated with the quality of the results. Furthermore, we demonstrate that segmentation models trained on diverse datasets and enriched with pre-trained knowledge exhibit greater robustness, particularly in cases deviating from standard anatomical structures. Notably, the best-performing models achieved high DSC and well-calibrated uncertainty estimates. This work underscores the need for multi-annotator ground truth, thorough calibration assessments, and uncertainty-aware evaluations to develop trustworthy and clinically reliable DL-based medical image segmentation models.

Mixed Modality Segmentation Whole Body Retrospective Clinical In Silico Consortium Benchmark SOTA Reproducibility

Enhancing Liver Fibrosis Measurement: Deep Learning and Uncertainty Analysis Across Multi-Centre Cohorts

Wojciechowska, M. K., Malacrino, S., Windell, D., Culver, E., Dyson, J., UK-AIH Consortium,, Rittscher, J.

•preprint•May 13 2025

O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=111 SRC="FIGDIR/small/25326981v1_ufig1.gif" ALT="Figure 1"> View larger version (31K): [email protected]@14e7b87org.highwire.dtl.DTLVardef@19005c4org.highwire.dtl.DTLVardef@6ac42f_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOGraphical AbstractC_FLOATNO C_FIG HighlightsO_LIA retrospective cohort of liver biopsies collected from over 20 healthcare centres has been assembled. C_LIO_LIThe cohort is characterized on the basis of collagen staining used for liver fibrosis assessment. C_LIO_LIA computational pipeline for the quantification of collagen from liver histology slides has been developed and applied to the described cohorts. C_LIO_LIUncertainty estimation is evaluated as a method to build trust in deep-learning based collagen predictions. C_LI The introduction of digital pathology has revolutionised the way in which histology-based measurements can support large, multi-centre studies. How-ever, pooling data from various centres often reveals significant differences in specimen quality, particularly regarding histological staining protocols. These variations present challenges in reliably quantifying features from stained tissue sections using image analysis. In this study, we investigate the statistical variation of measuring fibrosis across a liver cohort composed of four individual studies from 20 clinical sites across Europe and North America. In a first step, we apply colour consistency measurements to analyse staining variability across this diverse cohort. Subsequently, a learnt segmentation model is used to quantify the collagen proportionate area (CPA) and employed uncertainty mapping to evaluate the quality of the segmentations. Our analysis highlights a lack of standardisation in PicroSirius Red (PSR) staining practices, revealing significant variability in staining protocols across institutions. The deconvolution of the staining of the digitised slides identified the different numbers and types of counterstains used, leading to potentially incomparable results. Our analysis highlights the need for standardised staining protocols to ensure reliable collagen quantification in liver biopsies. The tools and methodologies presented here can be applied to perform slide colour quality control in digital pathology studies, thus enhancing the comparability and reproducibility of fibrosis assessment in the liver and other tissues.

Mixed Modality Segmentation Abdominal Retrospective Clinical In Silico Academic Lab Reproducibility

AmygdalaGo-BOLT: an open and reliable AI tool to trace boundaries of human amygdala

Zhou, Q., Dong, B., Gao, P., Jintao, W., Xiao, J., Wang, W., Liang, P., Lin, D., Zuo, X.-N., He, H.

•preprint•May 13 2025

Each year, thousands of brain MRI scans are collected to study structural development in children and adolescents. However, the amygdala, a particularly small and complex structure, remains difficult to segment reliably, especially in developing populations where its volume is even smaller. To address this challenge, we developed AmygdalaGo-BOLT, a boundary-aware deep learning model tailored for human amygdala segmentation. It was trained and validated using 854 manually labeled scans from pediatric datasets, with independent samples used to ensure performance generalizability. The model integrates multiscale image features, spatial priors, and self-attention mechanisms within a compact encoder-decoder architecture to enhance boundary detection. Validation across multiple imaging centers and age groups shows that AmygdalaGo-BOLT closely matches expert manual labels, improves processing efficiency, and outperforms existing tools in accuracy. This enables robust and scalable analysis of amygdala morphology in developmental neuroimaging studies where manual tracing is impractical. To support open and reproducible science, we publicly release both the labeled datasets and the full source code.

MRI Segmentation Neurological Methodology In Silico Academic Lab Open Dataset Open Code Reproducibility

An incremental algorithm for non-convex AI-enhanced medical image processing

Elena Morotti

•preprint•May 13 2025

Solving non-convex regularized inverse problems is challenging due to their complex optimization landscapes and multiple local minima. However, these models remain widely studied as they often yield high-quality, task-oriented solutions, particularly in medical imaging, where the goal is to enhance clinically relevant features rather than merely minimizing global error. We propose incDG, a hybrid framework that integrates deep learning with incremental model-based optimization to efficiently approximate the $\ell_0$-optimal solution of imaging inverse problems. Built on the Deep Guess strategy, incDG exploits a deep neural network to generate effective initializations for a non-convex variational solver, which refines the reconstruction through regularized incremental iterations. This design combines the efficiency of Artificial Intelligence (AI) tools with the theoretical guarantees of model-based optimization, ensuring robustness and stability. We validate incDG on TpV-regularized optimization tasks, demonstrating its effectiveness in medical image deblurring and tomographic reconstruction across diverse datasets, including synthetic images, brain CT slices, and chest-abdomen scans. Results show that incDG outperforms both conventional iterative solvers and deep learning-based methods, achieving superior accuracy and stability. Moreover, we confirm that training incDG without ground truth does not significantly degrade performance, making it a practical and powerful tool for solving non-convex inverse problems in imaging and beyond.

CT Reconstruction Methodology In Silico Reproducibility

A comparison of performance of DeepSeek-R1 model-generated responses to musculoskeletal radiology queries against ChatGPT-4 and ChatGPT-4o - A feasibility study.

Uldin H, Saran S, Gandikota G, Iyengar KP, Vaishya R, Parmar Y, Rasul F, Botchu R

•papers•May 12 2025

Artificial Intelligence (AI) has transformed society and chatbots using Large Language Models (LLM) are playing an increasing role in scientific research. This study aims to assess and compare the efficacy of newer DeepSeek R1 and ChatGPT-4 and 4o models in answering scientific questions about recent research. We compared output generated from ChatGPT-4, ChatGPT-4o, and DeepSeek-R1 in response to ten standardized questions in the setting of musculoskeletal (MSK) radiology. These were independently analyzed by one MSK radiologist and one final-year MSK radiology trainee and graded using a Likert scale from 1 to 5 (1 being inaccurate to 5 being accurate). Five DeepSeek answers were significantly inaccurate and provided fictitious references only on prompting. All ChatGPT-4 and 4o answers were well-written with good content, the latter including useful and comprehensive references. ChatGPT-4o generates structured research answers to questions on recent MSK radiology research with useful references in all our cases, enabling reliable usage. DeepSeek-R1 generates articles that, on the other hand, may appear authentic to the unsuspecting eye but contain a higher amount of falsified and inaccurate information in the current version. Further iterations may improve these accuracies.

Mixed Modality LLM Radiology Report Musculoskeletal Retrospective Clinical In Silico Academic Lab GenAI Reproducibility

Reproducing and Improving CheXNet: Deep Learning for Chest X-ray Disease Classification

Daniel Strick, Carlos Garcia, Anthony Huang

•preprint•May 10 2025

Deep learning for radiologic image analysis is a rapidly growing field in biomedical research and is likely to become a standard practice in modern medicine. On the publicly available NIH ChestX-ray14 dataset, containing X-ray images that are classified by the presence or absence of 14 different diseases, we reproduced an algorithm known as CheXNet, as well as explored other algorithms that outperform CheXNet's baseline metrics. Model performance was primarily evaluated using the F1 score and AUC-ROC, both of which are critical metrics for imbalanced, multi-label classification tasks in medical imaging. The best model achieved an average AUC-ROC score of 0.85 and an average F1 score of 0.39 across all 14 disease classifications present in the dataset.

X-Ray Classification Chest Methodology In Silico Benchmark SOTA Reproducibility Open Dataset

FreqSelect: Frequency-Aware fMRI-to-Image Reconstruction

External Validation of a CT-Based Radiogenomics Model for the Detection of EGFR Mutation in NSCLC and the Impact of Prevalence in Model Building by Using Synthetic Minority Over Sampling (SMOTE): Lessons Learned.

Challenges in Implementing Artificial Intelligence in Breast Cancer Screening Programs: Systematic Review and Framework for Safe Adoption.

Fed-ComBat: A Generalized Federated Framework for Batch Effect Harmonization in Collaborative Studies

Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results

Enhancing Liver Fibrosis Measurement: Deep Learning and Uncertainty Analysis Across Multi-Centre Cohorts

AmygdalaGo-BOLT: an open and reliable AI tool to trace boundaries of human amygdala

An incremental algorithm for non-convex AI-enhanced medical image processing

A comparison of performance of DeepSeek-R1 model-generated responses to musculoskeletal radiology queries against ChatGPT-4 and ChatGPT-4o - A feasibility study.

Reproducing and Improving CheXNet: Deep Learning for Chest X-ray Disease Classification

Ready to Sharpen Your Edge?