Latest Papers on Radiology AI. Tags: Reproducibility

Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results

Meritxell Riera-Marin, Sikha O K, Julia Rodriguez-Comas, Matthias Stefan May, Zhaohong Pan, Xiang Zhou, Xiaokun Liang, Franciskus Xaverius Erick, Andrea Prenner, Cedric Hemon, Valentin Boussot, Jean-Louis Dillenseger, Jean-Claude Nunes, Abdul Qayyum, Moona Mazher, Steven A Niederer, Kaisar Kushibar, Carlos Martin-Isla, Petia Radeva, Karim Lekadir, Theodore Barfoot, Luis C. Garcia Peraza Herrera, Ben Glocker, Tom Vercauteren, Lucas Gago, Justin Englemann, Joy-Marie Kleiss, Anton Aubanell, Andreu Antolin, Javier Garcia-Lopez, Miguel A. Gonzalez Ballester, Adrian Galdran

•preprint•May 13 2025

Deep learning (DL) has become the dominant approach for medical image segmentation, yet ensuring the reliability and clinical applicability of these models requires addressing key challenges such as annotation variability, calibration, and uncertainty estimation. This is why we created the Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS), which highlights the critical role of multiple annotators in establishing a more comprehensive ground truth, emphasizing that segmentation is inherently subjective and that leveraging inter-annotator variability is essential for robust model evaluation. Seven teams participated in the challenge, submitting a variety of DL models evaluated using metrics such as Dice Similarity Coefficient (DSC), Expected Calibration Error (ECE), and Continuous Ranked Probability Score (CRPS). By incorporating consensus and dissensus ground truth, we assess how DL models handle uncertainty and whether their confidence estimates align with true segmentation performance. Our findings reinforce the importance of well-calibrated models, as better calibration is strongly correlated with the quality of the results. Furthermore, we demonstrate that segmentation models trained on diverse datasets and enriched with pre-trained knowledge exhibit greater robustness, particularly in cases deviating from standard anatomical structures. Notably, the best-performing models achieved high DSC and well-calibrated uncertainty estimates. This work underscores the need for multi-annotator ground truth, thorough calibration assessments, and uncertainty-aware evaluations to develop trustworthy and clinically reliable DL-based medical image segmentation models.

Mixed Modality Segmentation Whole Body Retrospective Clinical In Silico Consortium Benchmark SOTA Reproducibility

Enhancing Liver Fibrosis Measurement: Deep Learning and Uncertainty Analysis Across Multi-Centre Cohorts

Wojciechowska, M. K., Malacrino, S., Windell, D., Culver, E., Dyson, J., UK-AIH Consortium,, Rittscher, J.

•preprint•May 13 2025

O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=111 SRC="FIGDIR/small/25326981v1_ufig1.gif" ALT="Figure 1"> View larger version (31K): [email protected]@14e7b87org.highwire.dtl.DTLVardef@19005c4org.highwire.dtl.DTLVardef@6ac42f_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOGraphical AbstractC_FLOATNO C_FIG HighlightsO_LIA retrospective cohort of liver biopsies collected from over 20 healthcare centres has been assembled. C_LIO_LIThe cohort is characterized on the basis of collagen staining used for liver fibrosis assessment. C_LIO_LIA computational pipeline for the quantification of collagen from liver histology slides has been developed and applied to the described cohorts. C_LIO_LIUncertainty estimation is evaluated as a method to build trust in deep-learning based collagen predictions. C_LI The introduction of digital pathology has revolutionised the way in which histology-based measurements can support large, multi-centre studies. How-ever, pooling data from various centres often reveals significant differences in specimen quality, particularly regarding histological staining protocols. These variations present challenges in reliably quantifying features from stained tissue sections using image analysis. In this study, we investigate the statistical variation of measuring fibrosis across a liver cohort composed of four individual studies from 20 clinical sites across Europe and North America. In a first step, we apply colour consistency measurements to analyse staining variability across this diverse cohort. Subsequently, a learnt segmentation model is used to quantify the collagen proportionate area (CPA) and employed uncertainty mapping to evaluate the quality of the segmentations. Our analysis highlights a lack of standardisation in PicroSirius Red (PSR) staining practices, revealing significant variability in staining protocols across institutions. The deconvolution of the staining of the digitised slides identified the different numbers and types of counterstains used, leading to potentially incomparable results. Our analysis highlights the need for standardised staining protocols to ensure reliable collagen quantification in liver biopsies. The tools and methodologies presented here can be applied to perform slide colour quality control in digital pathology studies, thus enhancing the comparability and reproducibility of fibrosis assessment in the liver and other tissues.

Mixed Modality Segmentation Abdominal Retrospective Clinical In Silico Academic Lab Reproducibility

AmygdalaGo-BOLT: an open and reliable AI tool to trace boundaries of human amygdala

Zhou, Q., Dong, B., Gao, P., Jintao, W., Xiao, J., Wang, W., Liang, P., Lin, D., Zuo, X.-N., He, H.

•preprint•May 13 2025

Each year, thousands of brain MRI scans are collected to study structural development in children and adolescents. However, the amygdala, a particularly small and complex structure, remains difficult to segment reliably, especially in developing populations where its volume is even smaller. To address this challenge, we developed AmygdalaGo-BOLT, a boundary-aware deep learning model tailored for human amygdala segmentation. It was trained and validated using 854 manually labeled scans from pediatric datasets, with independent samples used to ensure performance generalizability. The model integrates multiscale image features, spatial priors, and self-attention mechanisms within a compact encoder-decoder architecture to enhance boundary detection. Validation across multiple imaging centers and age groups shows that AmygdalaGo-BOLT closely matches expert manual labels, improves processing efficiency, and outperforms existing tools in accuracy. This enables robust and scalable analysis of amygdala morphology in developmental neuroimaging studies where manual tracing is impractical. To support open and reproducible science, we publicly release both the labeled datasets and the full source code.

MRI Segmentation Neurological Methodology In Silico Academic Lab Open Dataset Open Code Reproducibility

An incremental algorithm for non-convex AI-enhanced medical image processing

Elena Morotti

•preprint•May 13 2025

Solving non-convex regularized inverse problems is challenging due to their complex optimization landscapes and multiple local minima. However, these models remain widely studied as they often yield high-quality, task-oriented solutions, particularly in medical imaging, where the goal is to enhance clinically relevant features rather than merely minimizing global error. We propose incDG, a hybrid framework that integrates deep learning with incremental model-based optimization to efficiently approximate the $\ell_0$-optimal solution of imaging inverse problems. Built on the Deep Guess strategy, incDG exploits a deep neural network to generate effective initializations for a non-convex variational solver, which refines the reconstruction through regularized incremental iterations. This design combines the efficiency of Artificial Intelligence (AI) tools with the theoretical guarantees of model-based optimization, ensuring robustness and stability. We validate incDG on TpV-regularized optimization tasks, demonstrating its effectiveness in medical image deblurring and tomographic reconstruction across diverse datasets, including synthetic images, brain CT slices, and chest-abdomen scans. Results show that incDG outperforms both conventional iterative solvers and deep learning-based methods, achieving superior accuracy and stability. Moreover, we confirm that training incDG without ground truth does not significantly degrade performance, making it a practical and powerful tool for solving non-convex inverse problems in imaging and beyond.

CT Reconstruction Methodology In Silico Reproducibility

A comparison of performance of DeepSeek-R1 model-generated responses to musculoskeletal radiology queries against ChatGPT-4 and ChatGPT-4o - A feasibility study.

Uldin H, Saran S, Gandikota G, Iyengar KP, Vaishya R, Parmar Y, Rasul F, Botchu R

•papers•May 12 2025

Artificial Intelligence (AI) has transformed society and chatbots using Large Language Models (LLM) are playing an increasing role in scientific research. This study aims to assess and compare the efficacy of newer DeepSeek R1 and ChatGPT-4 and 4o models in answering scientific questions about recent research. We compared output generated from ChatGPT-4, ChatGPT-4o, and DeepSeek-R1 in response to ten standardized questions in the setting of musculoskeletal (MSK) radiology. These were independently analyzed by one MSK radiologist and one final-year MSK radiology trainee and graded using a Likert scale from 1 to 5 (1 being inaccurate to 5 being accurate). Five DeepSeek answers were significantly inaccurate and provided fictitious references only on prompting. All ChatGPT-4 and 4o answers were well-written with good content, the latter including useful and comprehensive references. ChatGPT-4o generates structured research answers to questions on recent MSK radiology research with useful references in all our cases, enabling reliable usage. DeepSeek-R1 generates articles that, on the other hand, may appear authentic to the unsuspecting eye but contain a higher amount of falsified and inaccurate information in the current version. Further iterations may improve these accuracies.

Mixed Modality LLM Radiology Report Musculoskeletal Retrospective Clinical In Silico Academic Lab GenAI Reproducibility

Reproducing and Improving CheXNet: Deep Learning for Chest X-ray Disease Classification

Daniel Strick, Carlos Garcia, Anthony Huang

•preprint•May 10 2025

Deep learning for radiologic image analysis is a rapidly growing field in biomedical research and is likely to become a standard practice in modern medicine. On the publicly available NIH ChestX-ray14 dataset, containing X-ray images that are classified by the presence or absence of 14 different diseases, we reproduced an algorithm known as CheXNet, as well as explored other algorithms that outperform CheXNet's baseline metrics. Model performance was primarily evaluated using the F1 score and AUC-ROC, both of which are critical metrics for imbalanced, multi-label classification tasks in medical imaging. The best model achieved an average AUC-ROC score of 0.85 and an average F1 score of 0.39 across all 14 disease classifications present in the dataset.

X-Ray Classification Chest Methodology In Silico Benchmark SOTA Reproducibility Open Dataset

A novel framework for esophageal cancer grading: combining CT imaging, radiomics, reproducibility, and deep learning insights.

Alsallal M, Ahmed HH, Kareem RA, Yadav A, Ganesan S, Shankhyan A, Gupta S, Joshi KK, Sameer HN, Yaseen A, Athab ZH, Adil M, Farhood B

•papers•May 10 2025

This study aims to create a reliable framework for grading esophageal cancer. The framework combines feature extraction, deep learning with attention mechanisms, and radiomics to ensure accuracy, interpretability, and practical use in tumor analysis. This retrospective study used data from 2,560 esophageal cancer patients across multiple clinical centers, collected from 2018 to 2023. The dataset included CT scan images and clinical information, representing a variety of cancer grades and types. Standardized CT imaging protocols were followed, and experienced radiologists manually segmented the tumor regions. Only high-quality data were used in the study. A total of 215 radiomic features were extracted using the SERA platform. The study used two deep learning models-DenseNet121 and EfficientNet-B0-enhanced with attention mechanisms to improve accuracy. A combined classification approach used both radiomic and deep learning features, and machine learning models like Random Forest, XGBoost, and CatBoost were applied. These models were validated with strict training and testing procedures to ensure effective cancer grading. This study analyzed the reliability and performance of radiomic and deep learning features for grading esophageal cancer. Radiomic features were classified into four reliability levels based on their ICC (Intraclass Correlation) values. Most of the features had excellent (ICC > 0.90) or good (0.75 < ICC ≤ 0.90) reliability. Deep learning features extracted from DenseNet121 and EfficientNet-B0 were also categorized, and some of them showed poor reliability. The machine learning models, including XGBoost and CatBoost, were tested for their ability to grade cancer. XGBoost with Recursive Feature Elimination (RFE) gave the best results for radiomic features, with an AUC (Area Under the Curve) of 91.36%. For deep learning features, XGBoost with Principal Component Analysis (PCA) gave the best results using DenseNet121, while CatBoost with RFE performed best with EfficientNet-B0, achieving an AUC of 94.20%. Combining radiomic and deep features led to significant improvements, with XGBoost achieving the highest AUC of 96.70%, accuracy of 96.71%, and sensitivity of 95.44%. The combination of both DenseNet121 and EfficientNet-B0 models in ensemble models achieved the best overall performance, with an AUC of 95.14% and accuracy of 94.88%. This study improves esophageal cancer grading by combining radiomics and deep learning. It enhances diagnostic accuracy, reproducibility, and interpretability, while also helping in personalized treatment planning through better tumor characterization. Not applicable.

CT Classification Abdominal Retrospective Clinical In Silico Academic Lab Reproducibility

Systematic review and epistemic meta-analysis to advance binomial AI-radiomics integration for predicting high-grade glioma progression and enhancing patient management.

Chilaca-Rosas MF, Contreras-Aguilar MT, Pallach-Loose F, Altamirano-Bustamante NF, Salazar-Calderon DR, Revilla-Monsalve C, Heredia-Gutiérrez JC, Conde-Castro B, Medrano-Guzmán R, Altamirano-Bustamante MM

•papers•May 8 2025

High-grade gliomas, particularly glioblastoma (MeSH:Glioblastoma), are among the most aggressive and lethal central nervous system tumors, necessitating advanced diagnostic and prognostic strategies. This systematic review and epistemic meta-analysis explore the integration of Artificial Intelligence (AI) and Radiomics Inter-field (AIRI) to enhance predictive modeling for tumor progression. A comprehensive literature search identified 19 high-quality studies, which were analyzed to evaluate radiomic features and machine learning models in predicting overall survival (OS) and progression-free survival (PFS). Key findings highlight the predictive strength of specific MRI-derived radiomic features such as log-filter and Gabor textures and the superior performance of Support Vector Machines (SVM) and Random Forest (RF) models, achieving high accuracy and AUC scores (e.g., 98% AUC and 98.7% accuracy for OS). This research demonstrates the current state of the AIRI field and shows that current articles report their results with different performance indicators and metrics, making outcomes heterogenous and hard to integrate knowledge. Additionally, it was explored that today some articles use biased methodologies. This study proposes a structured AIRI development roadmap and guidelines, to avoid bias and make results comparable, emphasizing standardized feature extraction and AI model training to improve reproducibility across clinical settings. By advancing precision medicine, AIRI integration has the potential to refine clinical decision-making and enhance patient outcomes.

MRI Classification Neurological Meta Analysis In Silico Academic Lab Benchmark SOTA Reproducibility

Machine learning-based approaches for distinguishing viral and bacterial pneumonia in paediatrics: A scoping review.

Rickard D, Kabir MA, Homaira N

•papers•May 8 2025

Pneumonia is the leading cause of hospitalisation and mortality among children under five, particularly in low-resource settings. Accurate differentiation between viral and bacterial pneumonia is essential for guiding appropriate treatment, yet it remains challenging due to overlapping clinical and radiographic features. Advances in machine learning (ML), particularly deep learning (DL), have shown promise in classifying pneumonia using chest X-ray (CXR) images. This scoping review summarises the evidence on ML techniques for classifying viral and bacterial pneumonia using CXR images in paediatric patients. This scoping review was conducted following the Joanna Briggs Institute methodology and the PRISMA-ScR guidelines. A comprehensive search was performed in PubMed, Embase, and Scopus to identify studies involving children (0-18 years) with pneumonia diagnosed through CXR, using ML models for binary or multiclass classification. Data extraction included ML models, dataset characteristics, and performance metrics. A total of 35 studies, published between 2018 and 2025, were included in this review. Of these, 31 studies used the publicly available Kermany dataset, raising concerns about overfitting and limited generalisability to broader, real-world clinical populations. Most studies (n=33) used convolutional neural networks (CNNs) for pneumonia classification. While many models demonstrated promising performance, significant variability was observed due to differences in methodologies, dataset sizes, and validation strategies, complicating direct comparisons. For binary classification (viral vs bacterial pneumonia), a median accuracy of 92.3% (range: 80.8% to 97.9%) was reported. For multiclass classification (healthy, viral pneumonia, and bacterial pneumonia), the median accuracy was 91.8% (range: 76.8% to 99.7%). Current evidence is constrained by a predominant reliance on a single dataset and variability in methodologies, which limit the generalisability and clinical applicability of findings. To address these limitations, future research should focus on developing diverse and representative datasets while adhering to standardised reporting guidelines. Such efforts are essential to improve the reliability, reproducibility, and translational potential of machine learning models in clinical settings.

X-Ray Classification Chest Review In Silico Academic Lab Reproducibility

False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims

Evangelia Christodoulou, Annika Reinke, Pascaline Andrè, Patrick Godau, Piotr Kalinowski, Rola Houhou, Selen Erkan, Carole H. Sudre, Ninon Burgos, Sofiène Boutaj, Sophie Loizillon, Maëlys Solal, Veronika Cheplygina, Charles Heitz, Michal Kozubek, Michela Antonelli, Nicola Rieke, Antoine Gilson, Leon D. Mayer, Minu D. Tizabi, M. Jorge Cardoso, Amber Simpson, Annette Kopp-Schneider, Gaël Varoquaux, Olivier Colliot, Lena Maier-Hein

•preprint•May 7 2025

Performance comparisons are fundamental in medical imaging Artificial Intelligence (AI) research, often driving claims of superiority based on relative improvements in common performance metrics. However, such claims frequently rely solely on empirical mean performance. In this paper, we investigate whether newly proposed methods genuinely outperform the state of the art by analyzing a representative cohort of medical imaging papers. We quantify the probability of false claims based on a Bayesian approach that leverages reported results alongside empirically estimated model congruence to estimate whether the relative ranking of methods is likely to have occurred by chance. According to our results, the majority (>80%) of papers claims outperformance when introducing a new method. Our analysis further revealed a high probability (>5%) of false outperformance claims in 86% of classification papers and 53% of segmentation papers. These findings highlight a critical flaw in current benchmarking practices: claims of outperformance in medical imaging AI are frequently unsubstantiated, posing a risk of misdirecting future research efforts.

Classification Review In Silico Academic Lab Ethics Policy Reproducibility

Filter Papers

Tags

Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results

Enhancing Liver Fibrosis Measurement: Deep Learning and Uncertainty Analysis Across Multi-Centre Cohorts

AmygdalaGo-BOLT: an open and reliable AI tool to trace boundaries of human amygdala

An incremental algorithm for non-convex AI-enhanced medical image processing

A comparison of performance of DeepSeek-R1 model-generated responses to musculoskeletal radiology queries against ChatGPT-4 and ChatGPT-4o - A feasibility study.

Reproducing and Improving CheXNet: Deep Learning for Chest X-ray Disease Classification

A novel framework for esophageal cancer grading: combining CT imaging, radiomics, reproducibility, and deep learning insights.

Systematic review and epistemic meta-analysis to advance binomial AI-radiomics integration for predicting high-grade glioma progression and enhancing patient management.

Machine learning-based approaches for distinguishing viral and bacterial pneumonia in paediatrics: A scoping review.

False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims

Ready to Sharpen Your Edge?