Latest Papers on Radiology AI. Tags: Reproducibility

Understanding Dataset Bias in Medical Imaging: A Case Study on Chest X-rays

Ethan Dack, Chengliang Dai

•preprint•Jul 10 2025

Recent work has revisited the infamous task Name that dataset and established that in non-medical datasets, there is an underlying bias and achieved high Accuracies on the dataset origin task. In this work, we revisit the same task applied to popular open-source chest X-ray datasets. Medical images are naturally more difficult to release for open-source due to their sensitive nature, which has led to certain open-source datasets being extremely popular for research purposes. By performing the same task, we wish to explore whether dataset bias also exists in these datasets. % We deliberately try to increase the difficulty of the task by dataset transformations. We apply simple transformations of the datasets to try to identify bias. Given the importance of AI applications in medical imaging, it's vital to establish whether modern methods are taking shortcuts or are focused on the relevant pathology. We implement a range of different network architectures on the datasets: NIH, CheXpert, MIMIC-CXR and PadChest. We hope this work will encourage more explainable research being performed in medical imaging and the creation of more open-source datasets in the medical domain. The corresponding code will be released upon acceptance.

X-Ray Classification Chest Methodology In Silico Open Dataset Open Code Reproducibility

MRI-based interpretable clinicoradiological and radiomics machine learning model for preoperative prediction of pituitary macroadenomas consistency: a dual-center study.

Liang M, Wang F, Yang Y, Wen L, Wang S, Zhang D

•papers•Jul 9 2025

To establish an interpretable and non-invasive machine learning (ML) model using clinicoradiological predictors and magnetic resonance imaging (MRI) radiomics features to predict the consistency of pituitary macroadenomas (PMAs) preoperatively. Total 350 patients with PMA (272 from Xinqiao Hospital of Army Medical University and 78 from Daping Hospital of Army Medical University) were stratified and randomly divided into training and test cohorts in a 7:3 ratio. The tumor consistency was classified as soft or firm. Clinicoradiological predictors were examined utilizing univariate and multivariate regression analyses. Radiomics features were selected employing the minimum redundancy maximum relevance (mRMR) and least absolute shrinkage and selection operator (LASSO) algorithms. Logistic regression (LR) and random forest (RF) classifiers were applied to construct the models. Receiver operating characteristic (ROC) curves and decision curve analyses (DCA) were performed to compare and validate the predictive capacities of the models. A comparative study of the area under the curve (AUC), accuracy (ACC), sensitivity (SEN), and specificity (SPE) was performed. The Shapley additive explanation (SHAP) was applied to investigate the optimal model's interpretability. The combined model predicted the PMAs' consistency more effectively than the clinicoradiological and radiomics models. Specifically, the LR-combined model displayed optimal prediction performance (test cohort: AUC = 0.913; ACC = 0.840). The SHAP-based explanation of the LR-combined model suggests that the wavelet-transformed and Laplacian of Gaussian (LoG) filter features extracted from T2WI and CE-T1WI occupy a dominant position. Meanwhile, the skewness of the original first-order features extracted from T2WI (T2WI_original_first-order_Skewness) demonstrated the most substantial contribution. An interpretable machine learning model incorporating clinicoradiological predictors and multiparametric MRI (mpMRI)-based radiomics features may predict PMAs consistency, enabling tailored and precise therapies for patients with PMA.

MRI Classification Neurological Retrospective Clinical In Silico Academic Lab Reproducibility

Impact of polymer source variations on hydrogel structure and product performance in dexamethasone-loaded ophthalmic inserts.

VandenBerg MA, Zaman RU, Plavchak CL, Smith WC, Nejad HB, Beringhs AO, Wang Y, Xu X

•papers•Jul 9 2025

Localized drug delivery can enhance therapeutic efficacy while minimizing systemic side effects, making sustained-release ophthalmic inserts an attractive alternative to traditional eye drops. Such inserts offer improved patient compliance through prolonged therapeutic effects and a reduced need for frequent administration. This study focuses on dexamethasone-containing ophthalmic inserts. These inserts utilize a key excipient, polyethylene glycol (PEG), which forms a hydrogel upon contact with tear fluid. Developing generic equivalents of PEG-based inserts is challenging due to difficulties in characterizing inactive ingredients and the absence of standardized physicochemical characterization methods to demonstrate similarity. To address this gap, a suite of analytical approaches was applied to both PEG precursor materials sourced from different vendors and manufactured inserts. 1H NMR, FTIR, MALDI, and SEC revealed variations in end-group functionalization, impurity content, and molecular weight distribution of the excipient. These differences led to changes in the finished insert network properties such as porosity, pore size and structure, gel mechanical strength, and crystallinity, which were corroborated by X-ray microscopy, AI-based image analysis, thermal, mechanical, and density measurements. In vitro release testing revealed distinct drug release profiles across formulations, with swelling rate correlated to release rate (i.e., faster release with rapid swelling). The use of non-micronized and micronized dexamethasone also contributed to release profile differences. Through comprehensive characterization of these PEG-based dexamethasone inserts, correlations between polymer quality, hydrogel microstructure, and release kinetics were established. The study highlights how excipient differences can alter product performance, emphasizing the importance of thorough analysis in developing generic equivalents of complex drug products.

X-Ray Segmentation Methodology In Silico Academic Lab Reproducibility

An autonomous agent for auditing and improving the reliability of clinical AI models

Lukas Kuhn, Florian Buettner

•preprint•Jul 8 2025

The deployment of AI models in clinical practice faces a critical challenge: models achieving expert-level performance on benchmarks can fail catastrophically when confronted with real-world variations in medical imaging. Minor shifts in scanner hardware, lighting or demographics can erode accuracy, but currently reliability auditing to identify such catastrophic failure cases before deployment is a bespoke and time-consuming process. Practitioners lack accessible and interpretable tools to expose and repair hidden failure modes. Here we introduce ModelAuditor, a self-reflective agent that converses with users, selects task-specific metrics, and simulates context-dependent, clinically relevant distribution shifts. ModelAuditor then generates interpretable reports explaining how much performance likely degrades during deployment, discussing specific likely failure modes and identifying root causes and mitigation strategies. Our comprehensive evaluation across three real-world clinical scenarios - inter-institutional variation in histopathology, demographic shifts in dermatology, and equipment heterogeneity in chest radiography - demonstrates that ModelAuditor is able correctly identify context-specific failure modes of state-of-the-art models such as the established SIIM-ISIC melanoma classifier. Its targeted recommendations recover 15-25% of performance lost under real-world distribution shift, substantially outperforming both baseline models and state-of-the-art augmentation methods. These improvements are achieved through a multi-agent architecture and execute on consumer hardware in under 10 minutes, costing less than US$0.50 per audit.

X-Ray Classification Chest Methodology In Silico Academic Lab Benchmark SOTA Reproducibility

Deep supervised transformer-based noise-aware network for low-dose PET denoising across varying count levels.

Azimi MS, Felfelian V, Zeraatkar N, Dadgar H, Arabi H, Zaidi H

•papers•Jul 8 2025

Reducing radiation dose from PET imaging is essential to minimize cancer risks; however, it often leads to increased noise and degraded image quality, compromising diagnostic reliability. Recent advances in deep learning have shown promising results in addressing these limitations through effective denoising. However, existing networks trained on specific noise levels often fail to generalize across diverse acquisition conditions. Moreover, training multiple models for different noise levels is impractical due to data and computational constraints. This study aimed to develop a supervised Swin Transformer-based unified noise-aware (ST-UNN) network that handles diverse noise levels and reconstructs high-quality images in low-dose PET imaging. We present a Swin Transformer-based Noise-Aware Network (ST-UNN), which incorporates multiple sub-networks, each designed to address specific noise levels ranging from 1 % to 10 %. An adaptive weighting mechanism dynamically integrates the outputs of these sub-networks to achieve effective denoising. The model was trained and evaluated using PET/CT dataset encompassing the entire head and malignant lesions in the head and neck region. Performance was assessed using a combination of structural and statistical metrics, including the Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), Standardized Uptake Value (SUV) mean bias, SUVmax bias, and Root Mean Square Error (RMSE). This comprehensive evaluation ensured reliable results for both global and localized regions within PET images. The ST-UNN consistently outperformed conventional networks, particularly in ultra-low-dose scenarios. At 1 % count level, it achieved a PSNR of 34.77, RMSE of 0.05, and SSIM of 0.97, notably surpassing the baseline networks. It also achieved the lowest SUVmean bias (0.08) and RMSE lesion (0.12) at this level. Across all count levels, ST-UNN maintained high performance and low error, demonstrating strong generalization and diagnostic integrity. ST-UNN offers a scalable, transformer-based solution for low-dose PET imaging. By dynamically integrating sub-networks, it effectively addresses noise variability and provides superior image quality, thereby advancing the capabilities of low-dose and dynamic PET imaging.

PET Reconstruction Neurological Methodology In Silico Academic Lab Reproducibility Benchmark SOTA

Development and retrospective validation of an artificial intelligence system for diagnostic assessment of prostate biopsies: study protocol.

Mulliqi N, Blilie A, Ji X, Szolnoky K, Olsson H, Titus M, Martinez Gonzalez G, Boman SE, Valkonen M, Gudlaugsson E, Kjosavik SR, Asenjo J, Gambacorta M, Libretti P, Braun M, Kordek R, Łowicki R, Hotakainen K, Väre P, Pedersen BG, Sørensen KD, Ulhøi BP, Rantalainen M, Ruusuvuori P, Delahunt B, Samaratunga H, Tsuzuki T, Janssen EAM, Egevad L, Kartasalo K, Eklund M

•papers•Jul 7 2025

Histopathological evaluation of prostate biopsies using the Gleason scoring system is critical for prostate cancer diagnosis and treatment selection. However, grading variability among pathologists can lead to inconsistent assessments, risking inappropriate treatment. Similar challenges complicate the assessment of other prognostic features like cribriform cancer morphology and perineural invasion. Many pathology departments are also facing an increasingly unsustainable workload due to rising prostate cancer incidence and a decreasing pathologist workforce coinciding with increasing requirements for more complex assessments and reporting. Digital pathology and artificial intelligence (AI) algorithms for analysing whole slide images show promise in improving the accuracy and efficiency of histopathological assessments. Studies have demonstrated AI's capability to diagnose and grade prostate cancer comparably to expert pathologists. However, external validations on diverse data sets have been limited and often show reduced performance. Historically, there have been no well-established guidelines for AI study designs and validation methods. Diagnostic assessments of AI systems often lack preregistered protocols and rigorous external cohort sampling, essential for reliable evidence of their safety and accuracy. This study protocol covers the retrospective validation of an AI system for prostate biopsy assessment. The primary objective of the study is to develop a high-performing and robust AI model for diagnosis and Gleason scoring of prostate cancer in core needle biopsies, and at scale evaluate whether it can generalise to fully external data from independent patients, pathology laboratories and digitalisation platforms. The secondary objectives cover AI performance in estimating cancer extent and detecting cribriform prostate cancer and perineural invasion. This protocol outlines the steps for data collection, predefined partitioning of data cohorts for AI model training and validation, model development and predetermined statistical analyses, ensuring systematic development and comprehensive validation of the system. The protocol adheres to Transparent Reporting of a multivariable prediction model of Individual Prognosis Or Diagnosis+AI (TRIPOD+AI), Protocol Items for External Cohort Evaluation of a Deep Learning System in Cancer Diagnostics (PIECES), Checklist for AI in Medical Imaging (CLAIM) and other relevant best practices. Data collection and usage were approved by the respective ethical review boards of each participating clinical laboratory, and centralised anonymised data handling was approved by the Swedish Ethical Review Authority. The study will be conducted in agreement with the Helsinki Declaration. The findings will be disseminated in peer-reviewed publications (open access).

Mixed Modality Classification Abdominal Retrospective Clinical In Silico Academic Lab Reproducibility

Introducing Image-Space Preconditioning in the Variational Formulation of MRI Reconstructions

Bastien Milani, Jean-Baptist Ledoux, Berk Can Acikgoz, Xavier Richard

•preprint•Jul 7 2025

The aim of the present article is to enrich the comprehension of iterative magnetic resonance imaging (MRI) reconstructions, including compressed sensing (CS) and iterative deep learning (DL) reconstructions, by describing them in the general framework of finite-dimensional inner-product spaces. In particular, we show that image-space preconditioning (ISP) and data-space preconditioning (DSP) can be formulated as non-conventional inner-products. The main gain of our reformulation is an embedding of ISP in the variational formulation of the MRI reconstruction problem (in an algorithm-independent way) which allows in principle to naturally and systematically propagate ISP in all iterative reconstructions, including many iterative DL and CS reconstructions where preconditioning is lacking. The way in which we apply linear algebraic tools to MRI reconstructions as presented in this article is a novelty. A secondary aim of our article is to offer a certain didactic material to scientists who are new in the field of MRI reconstruction. Since we explore here some mathematical concepts of reconstruction, we take that opportunity to recall some principles that may be understood for experts, but which may be hard to find in the literature for beginners. In fact, the description of many mathematical tools of MRI reconstruction is fragmented in the literature or sometimes missing because considered as a general knowledge. Further, some of those concepts can be found in mathematic manuals, but not in a form that is oriented toward MRI. For example, we think of the conjugate gradient descent, the notion of derivative with respect to non-conventional inner products, or simply the notion of adjoint. The authors believe therefore that it is beneficial for their field of research to dedicate some space to such a didactic material.

MRI Reconstruction Methodology Concept Reproducibility

Self-supervised Deep Learning for Denoising in Ultrasound Microvascular Imaging

Lijie Huang, Jingyi Yin, Jingke Zhang, U-Wai Lok, Ryan M. DeRuiter, Jieyang Jin, Kate M. Knoll, Kendra E. Petersen, James D. Krier, Xiang-yang Zhu, Gina K. Hesley, Kathryn A. Robinson, Andrew J. Bentall, Thomas D. Atwell, Andrew D. Rule, Lilach O. Lerman, Shigao Chen, Chengwu Huang

•preprint•Jul 7 2025

Ultrasound microvascular imaging (UMI) is often hindered by low signal-to-noise ratio (SNR), especially in contrast-free or deep tissue scenarios, which impairs subsequent vascular quantification and reliable disease diagnosis. To address this challenge, we propose Half-Angle-to-Half-Angle (HA2HA), a self-supervised denoising framework specifically designed for UMI. HA2HA constructs training pairs from complementary angular subsets of beamformed radio-frequency (RF) blood flow data, across which vascular signals remain consistent while noise varies. HA2HA was trained using in-vivo contrast-free pig kidney data and validated across diverse datasets, including contrast-free and contrast-enhanced data from pig kidneys, as well as human liver and kidney. An improvement exceeding 15 dB in both contrast-to-noise ratio (CNR) and SNR was observed, indicating a substantial enhancement in image quality. In addition to power Doppler imaging, denoising directly in the RF domain is also beneficial for other downstream processing such as color Doppler imaging (CDI). CDI results of human liver derived from the HA2HA-denoised signals exhibited improved microvascular flow visualization, with a suppressed noisy background. HA2HA offers a label-free, generalizable, and clinically applicable solution for robust vascular imaging in both contrast-free and contrast-enhanced UMI.

Ultrasound Reconstruction Abdominal Methodology In Silico Academic Lab Reproducibility

PGMI assessment in mammography: AI software versus human readers.

Santner T, Ruppert C, Gianolini S, Stalheim JG, Frei S, Hondl M, Fröhlich V, Hofvind S, Widmann G

•papers•Jul 5 2025

The aim of this study was to evaluate human inter-reader agreement of parameters included in PGMI (perfect-good-moderate-inadequate) classification of screening mammograms and explore the role of artificial intelligence (AI) as an alternative reader. Five radiographers from three European countries independently performed a PGMI assessment of 520 anonymized mammography screening examinations randomly selected from representative subsets from 13 imaging centres within two European countries. As a sixth reader, a dedicated AI software was used. Accuracy, Cohen's Kappa, and confusion matrices were calculated to compare the predictions of the software against the individual assessment of the readers, as well as potential discrepancies between them. A questionnaire and a personality test were used to better understand the decision-making processes of the human readers. Significant inter-reader variability among human readers with poor to moderate agreement (κ = -0.018 to κ = 0.41) was observed, with some showing more homogenous interpretations of single features and overall quality than others. In comparison, the software surpassed human inter-reader agreement in detecting glandular tissue cuts, mammilla deviation, pectoral muscle detection, and pectoral angle measurement, while remaining features and overall image quality exhibited comparable performance to human assessment. Notably, human inter-reader disagreement of PGMI assessment in mammography is considerably high. AI software may already reliably categorize quality. Its potential for standardization and immediate feedback to achieve and monitor high levels of quality in screening programs needs further attention and should be included in future approaches. AI has promising potential for automated assessment of diagnostic image quality. Faster, more representative and more objective feedback may support radiographers in their quality management processes. Direct transformation of common PGMI workflows into an AI algorithm could be challenging.

Mammography Classification Breast Retrospective Clinical In Silico Academic Lab Reproducibility

Impact of super-resolution deep learning-based reconstruction for hippocampal MRI: A volunteer and phantom study.

Takada S, Nakaura T, Yoshida N, Uetani H, Shiraishi K, Kobayashi N, Matsuo K, Morita K, Nagayama Y, Kidoh M, Yamashita Y, Takayanagi R, Hirai T

•papers•Jul 5 2025

To evaluate the effects of super-resolution deep learning-based reconstruction (SR-DLR) on thin-slice T2-weighted hippocampal MR image quality using 3 T MRI, in both human volunteers and phantoms. Thirteen healthy volunteers underwent hippocampal MRI at standard and high resolutions. Original (standard-resolution; StR) images were reconstructed with and without deep learning-based reconstruction (DLR) (Matrix = 320 × 320), and with SR-DLR (Matrix = 960 × 960). High-resolution (HR) images were also reconstructed with/without DLR (Matrix = 960 × 960). Contrast, contrast-to-noise ratio (CNR), and septum slope were analyzed. Two radiologists evaluated the images for noise, contrast, artifacts, sharpness, and overall quality. Quantitative and qualitative results are reported as medians and interquartile ranges (IQR). Comparisons used the Wilcoxon signed-rank test with Holm correction. We also scanned an American College of Radiology (ACR) phantom to evaluate the ability of our SR-DLR approach to reduce artifacts induced by zero-padding interpolation (ZIP). SR-DLR exhibited contrast comparable to original images and significantly higher than HR-images. Its slope was comparable to that of HR images but was significantly steeper than that of StR images (p < 0.01). Furthermore, the CNR of SR-DLR (10.53; IQR: 10.08, 11.69) was significantly superior to the StR-images without DLR (7.5; IQR: 6.4, 8.37), StR-images with DLR (8.73; IQR: 7.68, 9.0), HR-images without DLR (2.24; IQR: 1.43, 2.38), and HR-images with DLR (4.84; IQR: 2.99, 5.43) (p < 0.05). In the phantom study, artifacts induced by ZIP were scarcely observed when using SR-DLR. SR-DLR for hippocampal MRI potentially improves image quality beyond that of actual HR-images while reducing acquisition time.

MRI Reconstruction Neurological Retrospective Clinical In Silico Academic Lab Reproducibility

Filter Papers

Tags

Understanding Dataset Bias in Medical Imaging: A Case Study on Chest X-rays

MRI-based interpretable clinicoradiological and radiomics machine learning model for preoperative prediction of pituitary macroadenomas consistency: a dual-center study.

Impact of polymer source variations on hydrogel structure and product performance in dexamethasone-loaded ophthalmic inserts.

An autonomous agent for auditing and improving the reliability of clinical AI models

Deep supervised transformer-based noise-aware network for low-dose PET denoising across varying count levels.

Development and retrospective validation of an artificial intelligence system for diagnostic assessment of prostate biopsies: study protocol.

Introducing Image-Space Preconditioning in the Variational Formulation of MRI Reconstructions

Self-supervised Deep Learning for Denoising in Ultrasound Microvascular Imaging

PGMI assessment in mammography: AI software versus human readers.

Impact of super-resolution deep learning-based reconstruction for hippocampal MRI: A volunteer and phantom study.

Ready to Sharpen Your Edge?