Latest Papers on Radiology AI. Tags: X-Ray

Use of artificial intelligence for classification of fractures around the elbow in adults according to the 2018 AO/OTA classification system.

Pettersson A, Axenhus M, Stukan T, Ljungberg O, Nåsell H, Razavian AS, Gordon M

•papers•Sep 9 2025

This study evaluates the accuracy of an Artificial Intelligence (AI) system, specifically a convolutional neural network (CNN), in classifying elbow fractures using the detailed 2018 AO/OTA fracture classification system. A retrospective analysis of 5,367 radiograph exams visualizing the elbow from adult patients (2002-2016) was conducted using a deep neural network. Radiographs were manually categorized according to the 2018 AO/OTA system by orthopedic surgeons. A pretrained Efficientnet B4 network with squeeze and excitation layers was fine-tuned. Performance was assessed against a test set of 208 radiographs reviewed independently by four orthopedic surgeons, with disagreements resolved via consensus. The study evaluated 54 distinct fracture types, each with a minimum of 10 cases, ensuring adequate dataset representation. Overall fracture detection achieved an AUC of 0.88 (95% CI 0.83-0.93). The weighted mean AUC was 0.80 for proximal radius fractures, 0.86 for proximal ulna, and 0.85 for distal humerus. These results underscore the AI system's ability to accurately detect and classify a broad spectrum of elbow fractures. AI systems, such as CNNs, can enhance clinicians' ability to identify and classify elbow fractures, offering a complementary tool to improve diagnostic accuracy and optimize treatment decisions. The findings suggest AI can reduce the risk of undiagnosed fractures, enhancing clinical outcomes and radiologic evaluation.

X-Ray Classification Musculoskeletal Retrospective Clinical In Silico

MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification

Patrick Wienholt, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

•preprint•Sep 9 2025

Deep neural networks excel in radiological image classification but frequently suffer from poor interpretability, limiting clinical acceptance. We present MedicalPatchNet, an inherently self-explainable architecture for chest X-ray classification that transparently attributes decisions to distinct image regions. MedicalPatchNet splits images into non-overlapping patches, independently classifies each patch, and aggregates predictions, enabling intuitive visualization of each patch's diagnostic contribution without post-hoc techniques. Trained on the CheXpert dataset (223,414 images), MedicalPatchNet matches the classification performance (AUROC 0.907 vs. 0.908) of EfficientNet-B0, while substantially improving interpretability: MedicalPatchNet demonstrates substantially improved interpretability with higher pathology localization accuracy (mean hit-rate 0.485 vs. 0.376 with Grad-CAM) on the CheXlocalize dataset. By providing explicit, reliable explanations accessible even to non-AI experts, MedicalPatchNet mitigates risks associated with shortcut learning, thus improving clinical trust. Our model is publicly available with reproducible training and inference scripts and contributes to safer, explainable AI-assisted diagnostics across medical imaging domains. We make the code publicly available: https://github.com/TruhnLab/MedicalPatchNet

X-Ray Classification Chest Methodology In Silico Academic Lab Open Code Reproducibility

Automatic bone age assessment: a Turkish population study.

Öztürk S, Yüce M, Pamuk GG, Varlık C, Cimilli AT, Atay M

•papers•Sep 8 2025

Established methods for bone age assessment (BAA), such as the Greulich and Pyle atlas, suffer from variability due to population differences and observer discrepancies. Although automated BAA offers speed and consistency, limited research exists on its performance across different populations using deep learning. This study examines deep learning algorithms on the Turkish population to enhance bone age models by understanding demographic influences. We analyzed reports from Bağcılar Hospital's Health Information Management System between April 2012 and September 2023 using "bone age" as a keyword. Patient images were re-evaluated by an experienced radiologist and anonymized. A total of 2,730 hand radiographs from Bağcılar Hospital (Turkish population), 12,572 from the Radiological Society of North America (RSNA), and 6,185 from the Radiological Hand Pose Estimation (RHPE) public datasets were collected, along with corresponding bone ages and gender information. A random set of 546 radiographs (273 from Bağcılar, 273 from public datasets) was initially randomly split for an internal test set with bone age stratification; the remaining data were used for training and validation. BAAs were generated using a modified InceptionV3 model on 500 × 500-pixel images, selecting the model with the lowest mean absolute error (MAE) on the validation set. Three models were trained and tested based on dataset origin: Bağcılar (Turkish), public (RSNA-RHPE), and a Combined model. Internal test set predictions of the Combined model estimated bone age within less than 6, 12, 18, and 24 months at rates of 44%, 73%, 87%, and 94%, respectively. The MAE was 9.2 months in the overall internal test set, 7 months on the public test set, and 11.5 months on the Bağcılar internal test data. The Bağcılar-only model had an MAE of 12.7 months on the Bağcılar internal test data. Despite less training data, there was no significant difference between the combined and Bağcılar models on the Bağcılar dataset (<i>P</i> > 0.05). The public model showed an MAE of 16.5 months on the Bağcılar dataset, significantly worse than the other models (<i>P</i> < 0.05). We developed an automatic BAA model including the Turkish population, one of the few such studies using deep learning. Despite challenges from population differences and data heterogeneity, these models can be effectively used in various clinical settings. Model accuracy can improve over time with cumulative data, and publicly available datasets may further refine them. Our approach enables more accurate and efficient BAAs, supporting healthcare professionals where traditional methods are time-consuming and variable. The developed automated BAA model for the Turkish population offers a reliable and efficient alternative to traditional methods. By utilizing deep learning with diverse datasets from Bağcılar Hospital and publicly available sources, the model minimizes assessment time and reduces variability. This advancement enhances clinical decision-making, supports standardized BAA practices, and improves patient care in various healthcare settings.

X-Ray Registration Musculoskeletal Retrospective Clinical In Silico Academic Lab Benchmark SOTA Open Dataset

Automated Radiographic Total Sharp Score (ARTSS) in Rheumatoid Arthritis: A Solution to Reduce Inter-Intra Reader Variation and Enhancing Clinical Practice

Hajar Moradmand, Lei Ren

•preprint•Sep 8 2025

Assessing the severity of rheumatoid arthritis (RA) using the Total Sharp/Van Der Heijde Score (TSS) is crucial, but manual scoring is often time-consuming and subjective. This study introduces an Automated Radiographic Sharp Scoring (ARTSS) framework that leverages deep learning to analyze full-hand X-ray images, aiming to reduce inter- and intra-observer variability. The research uniquely accommodates patients with joint disappearance and variable-length image sequences. We developed ARTSS using data from 970 patients, structured into four stages: I) Image pre-processing and re-orientation using ResNet50, II) Hand segmentation using UNet.3, III) Joint identification using YOLOv7, and IV) TSS prediction using models such as VGG16, VGG19, ResNet50, DenseNet201, EfficientNetB0, and Vision Transformer (ViT). We evaluated model performance with Intersection over Union (IoU), Mean Average Precision (MAP), mean absolute error (MAE), Root Mean Squared Error (RMSE), and Huber loss. The average TSS from two radiologists was used as the ground truth. Model training employed 3-fold cross-validation, with each fold consisting of 452 training and 227 validation samples, and external testing included 291 unseen subjects. Our joint identification model achieved 99% accuracy. The best-performing model, ViT, achieved a notably low Huber loss of 0.87 for TSS prediction. Our results demonstrate the potential of deep learning to automate RA scoring, which can significantly enhance clinical practice. Our approach addresses the challenge of joint disappearance and variable joint numbers, offers timesaving benefits, reduces inter- and intra-reader variability, improves radiologist accuracy, and aids rheumatologists in making more informed decisions.

X-Ray Classification Musculoskeletal Methodology In Silico Academic Lab Benchmark SOTA

The Effect of Image Resolution on the Performance of Deep Learning Algorithms in Detecting Calcaneus Fractures on X-Ray

Yee, N. J., Taseh, A., Ghandour, S., Sirls, E., Halai, M., Whyne, C., DiGiovanni, C. W., Kwon, J. Y., Ashkani-Esfahani, S. J.

•preprint•Sep 7 2025

PurposeTo evaluate convolutional neural network (CNN) model training strategies that optimize the performance of calcaneus fracture detection on radiographs at different image resolutions. Materials and MethodsThis retrospective study included foot radiographs from a single hospital between 2015 and 2022 for a total of 1,775 x-ray series (551 fractures; 1,224 without) and was split into training (70%), validation (15%), and testing (15%). ImageNet pre-trained ResNet models were fine-tuned on the dataset. Three training strategies were evaluated: 1) single size: trained exclusively on 128x128, 256x256, 512x512, 640x640, or 900x900 radiographs (5 model sets); 2) curriculum learning: trained exclusively on 128x128 radiographs then exclusively on 256x256, then 512x512, then 640x640, and finally on 900x900 (5 model sets); and 3) multi-scale augmentation: trained on x-ray images resized along continuous dimensions between 128x128 to 900x900 (1 model set). Inference time and training time were compared. ResultsMulti-scale augmentation trained models achieved the highest average area under the Receiver Operating Characteristic curve of 0.938 [95% CI: 0.936 - 0.939] for a single model across image resolutions compared to the other strategies without prolonging training or inference time. Using the optimal model sets, curriculum learning had the highest sensitivity on in-distribution low-resolution images (85.4% to 90.1%) and on out-of-distribution high-resolution images (78.2% to 89.2%). However, curriculum learning models took significantly longer to train (11.8 [IQR: 11.1-16.4] hours; P<.001). ConclusioWhile 512x512 images worked well for fracture identification, curriculum learning and multi-scale augmentation training strategies algorithmically improved model robustness towards different image resolutions without requiring additional annotated data. Summary statementDifferent deep learning training strategies affect performance in detecting calcaneus fractures on radiographs across in- and out-of-distribution image resolutions, with a multi-scale augmentation strategy conferring the greatest overall performance improvement in a single model. Key pointsO_LITraining strategies addressing differences in radiograph image resolution (or pixel dimensions) could improve deep learning performance. C_LIO_LIThe highest average performance across different image resolutions in a single model was achieved by multi-scale augmentation, where the sampled training dataset is uniformly resized between square resolutions of 128x128 to 900x900. C_LIO_LICompared to model training on a single image resolution, sequentially training on increasingly higher resolution images up to 900x900 (i.e., curriculum learning) resulted in higher fracture detection performance on images resolutions between 128x128 and 2048x2048. C_LI

X-Ray Detection Musculoskeletal Retrospective Clinical In Silico Academic Lab

AI-Based Applied Innovation for Fracture Detection in X-rays Using Custom CNN and Transfer Learning Models

Amna Hassan, Ilsa Afzaal, Nouman Muneeb, Aneeqa Batool, Hamail Noor

•preprint•Sep 7 2025

Bone fractures present a major global health challenge, often resulting in pain, reduced mobility, and productivity loss, particularly in low-resource settings where access to expert radiology services is limited. Conventional imaging methods suffer from high costs, radiation exposure, and dependency on specialized interpretation. To address this, we developed an AI-based solution for automated fracture detection from X-ray images using a custom Convolutional Neural Network (CNN) and benchmarked it against transfer learning models including EfficientNetB0, MobileNetV2, and ResNet50. Training was conducted on the publicly available FracAtlas dataset, comprising 4,083 anonymized musculoskeletal radiographs. The custom CNN achieved 95.96% accuracy, 0.94 precision, 0.88 recall, and an F1-score of 0.91 on the FracAtlas dataset. Although transfer learning models (EfficientNetB0, MobileNetV2, ResNet50) performed poorly in this specific setup, these results should be interpreted in light of class imbalance and data set limitations. This work highlights the promise of lightweight CNNs for detecting fractures in X-rays and underscores the importance of fair benchmarking, diverse datasets, and external validation for clinical translation

X-Ray Detection Musculoskeletal Methodology In Silico Benchmark SOTA Open Dataset

Prediction of bronchopulmonary dysplasia using machine learning from chest X-rays of premature infants in the neonatal intensive care unit.

Ozcelik G, Erol S, Korkut S, Kose Cetinkaya A, Ozcelik H

•papers•Sep 5 2025

Bronchopulmonary dysplasia (BPD) is a significant morbidity in premature infants. This study aimed to assess the accuracy of the model's predictions in comparison to clinical outcomes. Medical records of premature infants born ≤ 28 weeks and < 1250 g between January 1, 2020, and December 31, 2021, in the neonatal intensive care unit were obtained. In this retrospective model development and validation study, an artificial intelligence model was developed using DenseNet121 deep learning architecture. The data set and test set consisted of chest radiographs obtained on postnatal day 1 as well as during the 2nd, 3rd, and 4th weeks. The model predicted the likelihood of developing no BPD, or mild, moderate, or severe BPD. The accuracy of the artificial intelligence model was tested based on the clinical outcomes of patients. This study included 122 premature infants with a birth weight of 990 g (range: 840-1120 g). Of these, 33 (27%) patients did not develop BPD, 24 (19.7%) had mild BPD, 28 (23%) had moderate BPD, and 37 (30%) had severe BPD. A total of 395 chest radiographs from these patients were used to develop an artificial intelligence (AI) model for predicting BPD. Area under the curve values, representing the accuracy of predicting severe, moderate, mild, and no BPD, were as follows: 0.79, 0.75, 0.82, and 0.82 for day 1 radiographs; 0.88, 0.82, 0.74, and 0.94 for week 2 radiographs; 0.87, 0.83, 0.88, and 0.96 for week 3 radiographs; and 0.90, 0.82, 0.86, and 0.97 for week 4 radiographs. The artificial intelligence model successfully identified BPD on chest radiographs and classified its severity. The accuracy of the model can be improved using larger control and external validation datasets.

X-Ray Classification Chest Retrospective Clinical In Silico Academic Lab

Veriserum: A dual-plane fluoroscopic dataset with knee implant phantoms for deep learning in medical imaging

Jinhao Wang, Florian Vogl, Pascal Schütz, Saša Ćuković, William R. Taylor

•preprint•Sep 5 2025

Veriserum is an open-source dataset designed to support the training of deep learning registration for dual-plane fluoroscopic analysis. It comprises approximately 110,000 X-ray images of 10 knee implant pair combinations (2 femur and 5 tibia implants) captured during 1,600 trials, incorporating poses associated with daily activities such as level gait and ramp descent. Each image is annotated with an automatically registered ground-truth pose, while 200 images include manually registered poses for benchmarking. Key features of Veriserum include dual-plane images and calibration tools. The dataset aims to support the development of applications such as 2D/3D image registration, image segmentation, X-ray distortion correction, and 3D reconstruction. Freely accessible, Veriserum aims to advance computer vision and medical imaging research by providing a reproducible benchmark for algorithm development and evaluation. The Veriserum dataset used in this study is publicly available via https://movement.ethz.ch/data-repository/veriserum.html, with the data stored at ETH Z\"urich Research Collections: https://doi.org/10.3929/ethz-b-000701146.

X-Ray Registration Musculoskeletal Dataset Release In Silico Academic Lab Open Dataset Reproducibility

Deep learning-based precision phenotyping of spine curvature identifies novel genetic risk loci for scoliosis in the UK Biobank

Zeosk, M., Kun, E., Reddy, S., Pandey, D., Xu, L., Wang, J. Y., Li, C., Gray, R. S., Wise, C. A., Otomo, N., Narasimhan, V. M.

•preprint•Sep 5 2025

Scoliosis is the most common developmental spinal deformity, but its genetic underpinnings remain only partially understood. To enhance the identification of scoliosis-related loci, we utilized whole body dual energy X-ray absorptiometry (DXA) scans from 57,887 individuals in the UK Biobank (UKB), and quantified spine curvature by applying deep learning models to segment then landmark vertebrae to measure the cumulative horizontal displacement of the spine from a central axis. On a subset of 120 individuals, our automated image-derived curvature measurements showed a correlation 0.92 with clinical Cobb angle assessments, supporting their validity as a proxy for scoliosis severity. To connect spinal curvature with its genetic basis we conducted a genome-wide association study (GWAS). Our quantitative imaging phenotype allowed us to identify 2 novel loci associated with scoliosis in a European population not seen in previous GWAS. These loci are in the gene SEM1/SHFM1 as well as on a lncRNA on chr 3 that is downstream of EDEM1 and upstream of GRM7. Genetic correlation analysis revealed significant overlap between our image-based GWAS and ICD-10 based GWAS in both the UKB and Biobank of Japan. We also showed that our quantitative GWAS had more statistical power to identify new loci than a case-control dataset with an order of magnitude larger sample size. Increased spine curvature was also associated with increased leg length discrepancy, reduced muscle strength and decreased bone density, and increased incidence of knee but not hip osteoarthritis. Our results illustrate the potential of using quantitative imaging phenotypes to uncover genetic associations that are challenging to capture with medical records alone and identify new loci for functional follow-up.

X-Ray Segmentation Musculoskeletal Retrospective Clinical In Silico Academic Lab Open Dataset

Detecting, Characterizing, and Mitigating Implicit and Explicit Racial Biases in Health Care Datasets With Subgroup Learnability: Algorithm Development and Validation Study.

Gulamali F, Sawant AS, Liharska L, Horowitz C, Chan L, Hofer I, Singh K, Richardson L, Mensah E, Charney A, Reich D, Hu J, Nadkarni G

•papers•Sep 4 2025

The growing adoption of diagnostic and prognostic algorithms in health care has led to concerns about the perpetuation of algorithmic bias against disadvantaged groups of individuals. Deep learning methods to detect and mitigate bias have revolved around modifying models, optimization strategies, and threshold calibration with varying levels of success and tradeoffs. However, there have been limited substantive efforts to address bias at the level of the data used to generate algorithms in health care datasets. The aim of this study is to create a simple metric (AEquity) that uses a learning curve approximation to distinguish and mitigate bias via guided dataset collection or relabeling. We demonstrate this metric in 2 well-known examples, chest X-rays and health care cost utilization, and detect novel biases in the National Health and Nutrition Examination Survey. We demonstrated that using AEquity to guide data-centric collection for each diagnostic finding in the chest radiograph dataset decreased bias by between 29% and 96.5% when measured by differences in area under the curve. Next, we wanted to examine (1) whether AEquity worked on intersectional populations and (2) if AEquity is invariant to different types of fairness metrics, not just area under the curve. Subsequently, we examined the effect of AEquity on mitigating bias when measured by false negative rate, precision, and false discovery rate for Black patients on Medicaid. When we examined Black patients on Medicaid, at the intersection of race and socioeconomic status, we found that AEquity-based interventions reduced bias across a number of different fairness metrics including overall false negative rate by 33.3% (bias reduction absolute=1.88×10-1, 95% CI 1.4×10-1 to 2.5×10-1; bias reduction of 33.3%, 95% CI 26.6%-40%; precision bias by 7.50×10-2, 95% CI 7.48×10-2 to 7.51×10-2; bias reduction of 94.6%, 95% CI 94.5%-94.7%; false discovery rate by 94.5%; absolute bias reduction=3.50×10-2, 95% CI 3.49×10-2 to 3.50×10-2). Similarly, AEquity-guided data collection demonstrated bias reduction of up to 80% on mortality prediction with the National Health and Nutrition Examination Survey (bias reduction absolute=0.08, 95% CI 0.07-0.09). Then, we wanted to compare AEquity to state-of-the-art data-guided debiasing measures such as balanced empirical risk minimization and calibration. Consequently, we benchmarked against balanced empirical risk minimization and calibration and showed that AEquity-guided data collection outperforms both standard approaches. Moreover, we demonstrated that AEquity works on fully connected networks; convolutional neural networks such as ResNet-50; transformer architectures such as VIT-B-16, a vision transformer with 86 million parameters; and nonparametric methods such as Light Gradient-Boosting Machine. In short, we demonstrated that AEquity is a robust tool by applying it to different datasets, algorithms, and intersectional analyses and measuring its effectiveness with respect to a range of traditional fairness metrics.

X-Ray Classification Chest Methodology In Silico Ethics Benchmark SOTA

Filter Papers

Tags