Latest Papers on Radiology AI.

Benchmarking and Explaining Deep Learning Cortical Lesion MRI Segmentation in Multiple Sclerosis

Nataliia Molchanova, Alessandro Cagol, Mario Ocampo-Pineda, Po-Jui Lu, Matthias Weigel, Xinjie Chen, Erin Beck, Charidimos Tsagkas, Daniel Reich, Colin Vanden Bulcke, Anna Stolting, Serena Borrelli, Pietro Maggi, Adrien Depeursinge, Cristina Granziera, Henning Mueller, Pedro M. Gordaliza, Meritxell Bach Cuadra

•preprint•Jul 16 2025

Cortical lesions (CLs) have emerged as valuable biomarkers in multiple sclerosis (MS), offering high diagnostic specificity and prognostic relevance. However, their routine clinical integration remains limited due to subtle magnetic resonance imaging (MRI) appearance, challenges in expert annotation, and a lack of standardized automated methods. We propose a comprehensive multi-centric benchmark of CL detection and segmentation in MRI. A total of 656 MRI scans, including clinical trial and research data from four institutions, were acquired at 3T and 7T using MP2RAGE and MPRAGE sequences with expert-consensus annotations. We rely on the self-configuring nnU-Net framework, designed for medical imaging segmentation, and propose adaptations tailored to the improved CL detection. We evaluated model generalization through out-of-distribution testing, demonstrating strong lesion detection capabilities with an F1-score of 0.64 and 0.5 in and out of the domain, respectively. We also analyze internal model features and model errors for a better understanding of AI decision-making. Our study examines how data variability, lesion ambiguity, and protocol differences impact model performance, offering future recommendations to address these barriers to clinical adoption. To reinforce the reproducibility, the implementation and models will be publicly accessible and ready to use at https://github.com/Medical-Image-Analysis-Laboratory/ and https://doi.org/10.5281/zenodo.15911797.

MRI Segmentation Neurological Methodology In Silico Academic Lab Benchmark SOTA Reproducibility Open Code

Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants

Sybelle Goedicke-Fritz, Michelle Bous, Annika Engel, Matthias Flotho, Pascal Hirsch, Hannah Wittig, Dino Milanovic, Dominik Mohr, Mathias Kaspar, Sogand Nemat, Dorothea Kerner, Arno Bücker, Andreas Keller, Sascha Meyer, Michael Zemlin, Philipp Flotho

•preprint•Jul 16 2025

Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting 35% of extremely low birth weight infants. Defined by oxygen dependence at 36 weeks postmenstrual age, it causes lifelong respiratory complications. However, preventive interventions carry severe risks, including neurodevelopmental impairment, ventilator-induced lung injury, and systemic complications. Therefore, early BPD prognosis and prediction of BPD outcome is crucial to avoid unnecessary toxicity in low risk infants. Admission radiographs of extremely preterm infants are routinely acquired within 24h of life and could serve as a non-invasive prognostic tool. In this work, we developed and investigated a deep learning approach using chest X-rays from 163 extremely low-birth-weight infants ($\leq$32 weeks gestation, 401-999g) obtained within 24 hours of birth. We fine-tuned a ResNet-50 pretrained specifically on adult chest radiographs, employing progressive layer freezing with discriminative learning rates to prevent overfitting and evaluated a CutMix augmentation and linear probing. For moderate/severe BPD outcome prediction, our best performing model with progressive freezing, linear probing and CutMix achieved an AUROC of 0.78 $\pm$ 0.10, balanced accuracy of 0.69 $\pm$ 0.10, and an F1-score of 0.67 $\pm$ 0.11. In-domain pre-training significantly outperformed ImageNet initialization (p = 0.031) which confirms domain-specific pretraining to be important for BPD outcome prediction. Routine IRDS grades showed limited prognostic value (AUROC 0.57 $\pm$ 0.11), confirming the need of learned markers. Our approach demonstrates that domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. Through progressive freezing and linear probing, the method remains computationally feasible for site-level implementation and future federated learning deployments.

X-Ray Classification Chest Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models

Felix Nützel, Mischa Dombrowski, Bernhard Kainz

•preprint•Jul 16 2025

Phrase grounding, i.e., mapping natural language phrases to specific image regions, holds significant potential for disease localization in medical imaging through clinical reports. While current state-of-the-art methods rely on discriminative, self-supervised contrastive models, we demonstrate that generative text-to-image diffusion models, leveraging cross-attention maps, can achieve superior zero-shot phrase grounding performance. Contrary to prior assumptions, we show that fine-tuning diffusion models with a frozen, domain-specific language model, such as CXR-BERT, substantially outperforms domain-agnostic counterparts. This setup achieves remarkable improvements, with mIoU scores doubling those of current discriminative methods. These findings highlight the underexplored potential of generative models for phrase grounding tasks. To further enhance performance, we introduce Bimodal Bias Merging (BBM), a novel post-processing technique that aligns text and image biases to identify regions of high certainty. BBM refines cross-attention maps, achieving even greater localization accuracy. Our results establish generative approaches as a more effective paradigm for phrase grounding in the medical imaging domain, paving the way for more robust and interpretable applications in clinical practice. The source code and model weights are available at https://github.com/Felix-012/generate_to_ground.

X-Ray Segmentation Chest Methodology In Silico Academic Lab Open Code

Comparative Analysis of CNN Performance in Keras, PyTorch and JAX on PathMNIST

Anida Nezović, Jalal Romano, Nada Marić, Medina Kapo, Amila Akagić

•preprint•Jul 16 2025

Deep learning has significantly advanced the field of medical image classification, particularly with the adoption of Convolutional Neural Networks (CNNs). Various deep learning frameworks such as Keras, PyTorch and JAX offer unique advantages in model development and deployment. However, their comparative performance in medical imaging tasks remains underexplored. This study presents a comprehensive analysis of CNN implementations across these frameworks, using the PathMNIST dataset as a benchmark. We evaluate training efficiency, classification accuracy and inference speed to assess their suitability for real-world applications. Our findings highlight the trade-offs between computational speed and model accuracy, offering valuable insights for researchers and practitioners in medical image analysis.

Mixed Modality Classification Methodology In Silico Benchmark SOTA

Hybrid Ensemble Approaches: Optimal Deep Feature Fusion and Hyperparameter-Tuned Classifier Ensembling for Enhanced Brain Tumor Classification

Zahid Ullah, Dragan Pamucar, Jihie Kim

•preprint•Jul 16 2025

Magnetic Resonance Imaging (MRI) is widely recognized as the most reliable tool for detecting tumors due to its capability to produce detailed images that reveal their presence. However, the accuracy of diagnosis can be compromised when human specialists evaluate these images. Factors such as fatigue, limited expertise, and insufficient image detail can lead to errors. For example, small tumors might go unnoticed, or overlap with healthy brain regions could result in misidentification. To address these challenges and enhance diagnostic precision, this study proposes a novel double ensembling framework, consisting of ensembled pre-trained deep learning (DL) models for feature extraction and ensembled fine-tuned hyperparameter machine learning (ML) models to efficiently classify brain tumors. Specifically, our method includes extensive preprocessing and augmentation, transfer learning concepts by utilizing various pre-trained deep convolutional neural networks and vision transformer networks to extract deep features from brain MRI, and fine-tune hyperparameters of ML classifiers. Our experiments utilized three different publicly available Kaggle MRI brain tumor datasets to evaluate the pre-trained DL feature extractor models, ML classifiers, and the effectiveness of an ensemble of deep features along with an ensemble of ML classifiers for brain tumor classification. Our results indicate that the proposed feature fusion and classifier fusion improve upon the state of the art, with hyperparameter fine-tuning providing a significant enhancement over the ensemble method. Additionally, we present an ablation study to illustrate how each component contributes to accurate brain tumor classification.

MRI Classification Neurological Methodology In Silico Benchmark SOTA

Identifying Signatures of Image Phenotypes to Track Treatment Response in Liver Disease

Matthias Perkonigg, Nina Bastati, Ahmed Ba-Ssalamah, Peter Mesenbrink, Alexander Goehler, Miljen Martic, Xiaofei Zhou, Michael Trauner, Georg Langs

•preprint•Jul 16 2025

Quantifiable image patterns associated with disease progression and treatment response are critical tools for guiding individual treatment, and for developing novel therapies. Here, we show that unsupervised machine learning can identify a pattern vocabulary of liver tissue in magnetic resonance images that quantifies treatment response in diffuse liver disease. Deep clustering networks simultaneously encode and cluster patches of medical images into a low-dimensional latent space to establish a tissue vocabulary. The resulting tissue types capture differential tissue change and its location in the liver associated with treatment response. We demonstrate the utility of the vocabulary on a randomized controlled trial cohort of non-alcoholic steatohepatitis patients. First, we use the vocabulary to compare longitudinal liver change in a placebo and a treatment cohort. Results show that the method identifies specific liver tissue change pathways associated with treatment, and enables a better separation between treatment groups than established non-imaging measures. Moreover, we show that the vocabulary can predict biopsy derived features from non-invasive imaging data. We validate the method on a separate replication cohort to demonstrate the applicability of the proposed method.

MRI Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Deep learning-assisted comparison of different models for predicting maxillary canine impaction on panoramic radiography.

Zhang C, Zhu H, Long H, Shi Y, Guo J, You M

•papers•Jul 16 2025

The panoramic radiograph is the most commonly used imaging modality for predicting maxillary canine impaction. Several prediction models have been constructed based on panoramic radiographs. This study aimed to compare the prediction accuracy of existing models in an external validation facilitated by an automatic landmark detection system based on deep learning. Patients aged 7-14 years who underwent panoramic radiographic examinations and received a diagnosis of impacted canines were included in the study. An automatic landmark localization system was employed to assist the measurement of geometric parameters on the panoramic radiographs, followed by the calculated prediction of the canine impaction. Three prediction models constructed by Arnautska, Alqerban et al, and Margot et al were evaluated. The metrics of accuracy, sensitivity, specificity, precision, and area under the receiver operating characteristic curve (AUC) were used to compare the performance of different models. A total of 102 panoramic radiographs with 102 impacted canines and 102 nonimpacted canines were analyzed in this study. The prediction outcomes indicated that the model by Margot et al achieved the highest performance, with a sensitivity of 95% and a specificity of 86% (AUC, 0.97), followed by the model by Arnautska, with a sensitivity of 93% and a specificity of 71% (AUC, 0.94). The model by Alqerban et al showed poor performance with an AUC of only 0.20. Two of the existing predictive models exhibited good diagnostic accuracy, whereas the third model demonstrated suboptimal performance. Nonetheless, even the most effective model is constrained by several limitations, such as logical and computational challenges, which necessitate further refinement.

X-Ray Detection Retrospective Clinical In Silico Academic Lab

Super-resolution deep learning in pediatric CTA for congenital heart disease: enhancing intracardiac visualization under free-breathing conditions.

Zhou X, Xiong D, Liu F, Li J, Tan N, Duan X, Du X, Ouyang Z, Bao S, Ke T, Zhao Y, Tao J, Dong X, Wang Y, Liao C

•papers•Jul 16 2025

This study assesses the effectiveness of super-resolution deep learning reconstruction (SR-DLR), conventional deep learning reconstruction (C-DLR), and hybrid iterative reconstruction (HIR) in enhancing image quality and diagnostic performance for pediatric congenital heart disease (CHD) in CT angiography (CCTA). A total of 91 pediatric patients aged 1-10 years, suspected of having CHD, were consecutively enrolled for CCTA under free-breathing conditions. Reconstructions were performed using SR-DLR, C-DLR, and HIR algorithms. Objective metrics-standard deviation (SD), signal-to-noise ratio (SNR), and contrast-to-noise ratio (CNR)-were quantified. Two radiologists provided blinded subjective image quality evaluations. The full width at half maximum of lesions was significantly larger on SR-DLR (9.50 ± 6.44 mm) than on C-DLR (9.08 ± 6.23 mm; p < 0.001) and HIR (8.98 ± 6.37 mm; p < 0.001). SR-DLR exhibited superior performance with significantly reduced SD and increased SNR and CNR, particularly in the left ventricle, left atrium, and right ventricle regions (p < 0.05). Subjective evaluations favored SR-DLR over C-DLR and HIR (p < 0.05). The accuracy (99.12%), sensitivity (99.07%), and negative predictive value (85.71%) of SR-DLR were the highest, significantly exceeding those of C-DLR (+7.01%, +7.40%, and +45.71%) and HIR (+20.17%, +21.29%, and +65.71%), with statistically significant differences (p < 0.05 and p < 0.001). In the detection of atrial septal defects (ASDs) and ventricular septal defects (VSDs), SR-DLR demonstrated significantly higher sensitivity compared to C-DLR (+8.96% and +9.09%) and HIR (+20.90% and +36.36%). For multi-perforated ASDs and VSDs, SR-DLR's sensitivity reached 85.71% and 100%, far surpassing C-DLR and HIR. SR-DLR significantly reduces image noise and enhances resolution, improving the diagnostic visualization of CHD structures in pediatric patients. It outperforms existing algorithms in detecting small lesions, achieving diagnostic accuracy close to that of ultrasound. Question Pediatric cardiac computed tomography angiography (CCTA) often fails to adequately visualize intracardiac structures, creating diagnostic challenges for CHD, particularly complex multi-perforated atrioventricular defects. Findings SR-DLR markedly improves image quality and diagnostic accuracy, enabling detailed visualization and precise detection of small congenital lesions. Clinical relevance SR-DLR enhances the diagnostic confidence and accuracy of CCTA in pediatric CHD, reducing missed diagnoses and improving the characterization of complex intracardiac anomalies, thus supporting better clinical decision-making.

CT Reconstruction Cardiac Retrospective Clinical Clinical Pilot Academic Lab Breakthrough

Multi-scale machine learning model predicts muscle and functional disease progression.

Blemker SS, Riem L, DuCharme O, Pinette M, Costanzo KE, Weatherley E, Statland J, Tapscott SJ, Wang LH, Shaw DWW, Song X, Leung D, Friedman SD

•papers•Jul 16 2025

Facioscapulohumeral muscular dystrophy (FSHD) is a genetic neuromuscular disorder characterized by progressive muscle degeneration with substantial variability in severity and progression patterns. FSHD is a highly heterogeneous disease; however, current clinical metrics used for tracking disease progression lack sensitivity for personalized assessment, which greatly limits the design and execution of clinical trials. This study introduces a multi-scale machine learning framework leveraging whole-body magnetic resonance imaging (MRI) and clinical data to predict regional, muscle, joint, and functional progression in FSHD. The goal this work is to create a 'digital twin' of individual FSHD patients that can be leveraged in clinical trials. Using a combined dataset of over 100 patients from seven studies, MRI-derived metrics-including fat fraction, lean muscle volume, and fat spatial heterogeneity at baseline-were integrated with clinical and functional measures. A three-stage random forest model was developed to predict annualized changes in muscle composition and a functional outcome (timed up-and-go (TUG)). All model stages revealed strong predictive performance in separate holdout datasets. After training, the models predicted fat fraction change with a root mean square error (RMSE) of 2.16% and lean volume change with a RMSE of 8.1 ml in a holdout testing dataset. Feature analysis revealed that metrics of fat heterogeneity within muscle predicts muscle-level progression. The stage 3 model, which combined functional muscle groups, predicted change in TUG with a RMSE of 0.6 s in the holdout testing dataset. This study demonstrates the machine learning models incorporating individual muscle and performance data can effectively predict MRI disease progression and functional performance of complex tasks, addressing the heterogeneity and nonlinearity inherent in FSHD. Further studies incorporating larger longitudinal cohorts, as well as comprehensive clinical and functional measures, will allow for expanding and refining this model. As many neuromuscular diseases are characterized by variability and heterogeneity similar to FSHD, such approaches have broad applicability.

MRI Registration Whole Body Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Conditional GAN performs better than orthopedic surgeon in virtual reduction of femoral neck fracture.

Zhao K, Mei Y, Wang X, Ma W, Shen W

•papers•Jul 16 2025

Satisfied reduction of fracture is hard to achieve. The purpose of this study is to develop a virtual fracture reduction technique using conditional GAN (Generative Adversarial Network), and evaluate its performance in simulating and guiding reduction of femoral neck fracture, which is hard to reduce. We compared its reduction quality with manual reduction performed by orthopedic surgeons. It is a pilot study for augmented reality assisted femoral neck fracture surgery. To establish the gold standard of reduction, we invited an orthopedic surgeon to perform virtual reduction registration with reference to the healthy proximal femur. The invited orthopedic surgeon also performed manual reduction by Mimics software to represent the capability of human doctor. Then we trained conditional GAN models on our dataset, which consisted 208 images from 208 different patients. For displaced femoral neck fractures, it is not easy to measure the accurate angles, like Pauwels angle, of the fracture line. However, the fracture lines would be clearer after reduction. We compared the results of manual reduction, conditional GAN models and registration by Pauwels angle, Garden index and satisfied reduction rate. We tried different number of downsampling (α) to optimize the performance of conditional GAN models. There were 208 pre-surgical CT scans from 208 patients included in our study (age 69.755 ± 13.728, including 88 men). The Pauwles angles of conditional GAN model(α = 0) was 38.519°, which was significantly more stable than manual reduction (44.647°, p < 0.001). The Garden indices of conditional GAN model(α = 0) was 176.726°, which was also significantly more stable than manual reduction (163.590°, p = 0.002). The satisfied reduction rate of conditional GAN model(α = 0) was 88.372%, significantly higher than manual reduction (53.488%, p < 0.001). The Pauwels angles, Garden indices and satisfied reduction rate of conditional GAN model(α = 0) showed no difference to registration. Conditional GAN model(α = 0) can achieve better performance in the virtual reduction of femoral neck fracture than orthopedic surgeon.

CT Registration Musculoskeletal Retrospective Clinical Clinical Pilot

Filter Papers

Tags

Benchmarking and Explaining Deep Learning Cortical Lesion MRI Segmentation in Multiple Sclerosis

Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants

Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models

Comparative Analysis of CNN Performance in Keras, PyTorch and JAX on PathMNIST

Hybrid Ensemble Approaches: Optimal Deep Feature Fusion and Hyperparameter-Tuned Classifier Ensembling for Enhanced Brain Tumor Classification

Identifying Signatures of Image Phenotypes to Track Treatment Response in Liver Disease

Deep learning-assisted comparison of different models for predicting maxillary canine impaction on panoramic radiography.

Super-resolution deep learning in pediatric CTA for congenital heart disease: enhancing intracardiac visualization under free-breathing conditions.

Multi-scale machine learning model predicts muscle and functional disease progression.

Conditional GAN performs better than orthopedic surgeon in virtual reduction of femoral neck fracture.

Ready to Sharpen Your Edge?