Latest Papers on Radiology AI. Sources: medrxiv, Tags: X-Ray.

The Effect of Image Resolution on the Performance of Deep Learning Algorithms in Detecting Calcaneus Fractures on X-Ray

Yee, N. J., Taseh, A., Ghandour, S., Sirls, E., Halai, M., Whyne, C., DiGiovanni, C. W., Kwon, J. Y., Ashkani-Esfahani, S. J.

•preprint•Sep 7 2025

PurposeTo evaluate convolutional neural network (CNN) model training strategies that optimize the performance of calcaneus fracture detection on radiographs at different image resolutions. Materials and MethodsThis retrospective study included foot radiographs from a single hospital between 2015 and 2022 for a total of 1,775 x-ray series (551 fractures; 1,224 without) and was split into training (70%), validation (15%), and testing (15%). ImageNet pre-trained ResNet models were fine-tuned on the dataset. Three training strategies were evaluated: 1) single size: trained exclusively on 128x128, 256x256, 512x512, 640x640, or 900x900 radiographs (5 model sets); 2) curriculum learning: trained exclusively on 128x128 radiographs then exclusively on 256x256, then 512x512, then 640x640, and finally on 900x900 (5 model sets); and 3) multi-scale augmentation: trained on x-ray images resized along continuous dimensions between 128x128 to 900x900 (1 model set). Inference time and training time were compared. ResultsMulti-scale augmentation trained models achieved the highest average area under the Receiver Operating Characteristic curve of 0.938 [95% CI: 0.936 - 0.939] for a single model across image resolutions compared to the other strategies without prolonging training or inference time. Using the optimal model sets, curriculum learning had the highest sensitivity on in-distribution low-resolution images (85.4% to 90.1%) and on out-of-distribution high-resolution images (78.2% to 89.2%). However, curriculum learning models took significantly longer to train (11.8 [IQR: 11.1-16.4] hours; P<.001). ConclusioWhile 512x512 images worked well for fracture identification, curriculum learning and multi-scale augmentation training strategies algorithmically improved model robustness towards different image resolutions without requiring additional annotated data. Summary statementDifferent deep learning training strategies affect performance in detecting calcaneus fractures on radiographs across in- and out-of-distribution image resolutions, with a multi-scale augmentation strategy conferring the greatest overall performance improvement in a single model. Key pointsO_LITraining strategies addressing differences in radiograph image resolution (or pixel dimensions) could improve deep learning performance. C_LIO_LIThe highest average performance across different image resolutions in a single model was achieved by multi-scale augmentation, where the sampled training dataset is uniformly resized between square resolutions of 128x128 to 900x900. C_LIO_LICompared to model training on a single image resolution, sequentially training on increasingly higher resolution images up to 900x900 (i.e., curriculum learning) resulted in higher fracture detection performance on images resolutions between 128x128 and 2048x2048. C_LI

X-Ray Detection Musculoskeletal Retrospective Clinical In Silico Academic Lab

Deep learning-based precision phenotyping of spine curvature identifies novel genetic risk loci for scoliosis in the UK Biobank

Zeosk, M., Kun, E., Reddy, S., Pandey, D., Xu, L., Wang, J. Y., Li, C., Gray, R. S., Wise, C. A., Otomo, N., Narasimhan, V. M.

•preprint•Sep 5 2025

Scoliosis is the most common developmental spinal deformity, but its genetic underpinnings remain only partially understood. To enhance the identification of scoliosis-related loci, we utilized whole body dual energy X-ray absorptiometry (DXA) scans from 57,887 individuals in the UK Biobank (UKB), and quantified spine curvature by applying deep learning models to segment then landmark vertebrae to measure the cumulative horizontal displacement of the spine from a central axis. On a subset of 120 individuals, our automated image-derived curvature measurements showed a correlation 0.92 with clinical Cobb angle assessments, supporting their validity as a proxy for scoliosis severity. To connect spinal curvature with its genetic basis we conducted a genome-wide association study (GWAS). Our quantitative imaging phenotype allowed us to identify 2 novel loci associated with scoliosis in a European population not seen in previous GWAS. These loci are in the gene SEM1/SHFM1 as well as on a lncRNA on chr 3 that is downstream of EDEM1 and upstream of GRM7. Genetic correlation analysis revealed significant overlap between our image-based GWAS and ICD-10 based GWAS in both the UKB and Biobank of Japan. We also showed that our quantitative GWAS had more statistical power to identify new loci than a case-control dataset with an order of magnitude larger sample size. Increased spine curvature was also associated with increased leg length discrepancy, reduced muscle strength and decreased bone density, and increased incidence of knee but not hip osteoarthritis. Our results illustrate the potential of using quantitative imaging phenotypes to uncover genetic associations that are challenging to capture with medical records alone and identify new loci for functional follow-up.

X-Ray Segmentation Musculoskeletal Retrospective Clinical In Silico Academic Lab Open Dataset

Multimodal Deep Learning for ARDS Detection

Broecker, S., Adams, J. Y., Kumar, G., Callcut, R., Ni, Y., Strohmer, T.

•preprint•Aug 12 2025

ObjectivePoor outcomes in acute respiratory distress syndrome (ARDS) can be alleviated with tools that support early diagnosis. Current machine learning methods for detecting ARDS do not take full advantage of the multimodality of ARDS pathophysiology. We developed a multimodal deep learning model that uses imaging data, continuously collected ventilation data, and tabular data derived from a patients electronic health record (EHR) to make ARDS predictions. Materials and MethodsA chest radiograph (x-ray), at least two hours of ventilator waveform (VWD) data within the first 24 hours of intubation, and EHR-derived tabular data were used from 220 patients admitted to the ICU to train a deep learning model. The model uses pretrained encoders for the x-rays and ventilation data and trains a feature extractor on tabular data. Encoded features for a patient are combined to make a single ARDS prediction. Ablation studies for each modality assessed their effect on the models predictive capability. ResultsThe trimodal model achieved an area under the receiver operator curve (AUROC) of 0.86 with a 95% confidence interval of 0.01. This was a statistically significant improvement (p<0.05) over single modality models and bimodal models trained on VWD+tabular and VWD+x-ray data. Discussion and ConclusionOur results demonstrate the potential utility of using deep learning to address complex conditions with heterogeneous data. More work is needed to determine the additive effect of modalities on ARDS detection. Our framework can serve as a blueprint for building performant multimodal deep learning models for conditions with small, heterogeneous datasets.

X-Ray Classification Chest Methodology In Silico

DREAM: A framework for discovering mechanisms underlying AI prediction of protected attributes

Gadgil, S. U., DeGrave, A. J., Janizek, J. D., Xu, S., Nwandu, L., Fonjungo, F., Lee, S.-I., Daneshjou, R.

•preprint•Jul 21 2025

Recent advances in Artificial Intelligence (AI) have started disrupting the healthcare industry, especially medical imaging, and AI devices are increasingly being deployed into clinical practice. Such classifiers have previously demonstrated the ability to discern a range of protected demographic attributes (like race, age, sex) from medical images with unexpectedly high performance, a sensitive task which is difficult even for trained physicians. In this study, we motivate and introduce a general explainable AI (XAI) framework called DREAM (DiscoveRing and Explaining AI Mechanisms) for interpreting how AI models trained on medical images predict protected attributes. Focusing on two modalities, radiology and dermatology, we are successfully able to train high-performing classifiers for predicting race from chest x-rays (ROC-AUC score of [~]0.96) and sex from dermoscopic lesions (ROC-AUC score of [~]0.78). We highlight how incorrect use of these demographic shortcuts can have a detrimental effect on the performance of a clinically relevant downstream task like disease diagnosis under a domain shift. Further, we employ various XAI techniques to identify specific signals which can be leveraged to predict sex. Finally, we propose a technique, which we callremoval via balancing, to quantify how much a signal contributes to the classification performance. Using this technique and the signals identified, we are able to explain [~]15% of the total performance for radiology and [~]42% of the total performance for dermatology. We envision DREAM to be broadly applicable to other modalities and demographic attributes. This analysis not only underscores the importance of cautious AI application in healthcare but also opens avenues for improving the transparency and reliability of AI-driven diagnostic tools.

X-Ray Classification Chest Methodology In Silico Ethics

Prediction of OncotypeDX recurrence score using H&E stained WSI images

Cohen, S., Shamai, G., Sabo, E., Cretu, A., Barshack, I., Goldman, T., Bar-Sela, G., Pearson, A. T., Huo, D., Howard, F. M., Kimmel, R., Mayer, C.

•preprint•Jul 21 2025

The OncotypeDX 21-gene assay is a widely adopted tool for estimating recurrence risk and informing chemotherapy decisions in early-stage, hormone receptor-positive, HER2-negative breast cancer. Although informative, its high cost and long turnaround time limit accessibility and delay treatment in low- and middle-income countries, creating a need for alternative solutions. This study presents a deep learning-based approach for predicting OncotypeDX recurrence scores directly from hematoxylin and eosin-stained whole slide images. Our approach leverages a deep learning foundation model pre-trained on 171,189 slides via self-supervised learning, which is fine-tuned for our task. The model was developed and validated using five independent cohorts, out of which three are external. On the two external cohorts that include OncotypeDX scores, the model achieved an AUC of 0.825 and 0.817, and identified 21.9% and 25.1% of the patients as low-risk with sensitivity of 0.97 and 0.95 and negative predictive value of 0.97 and 0.96, showing strong generalizability despite variations in staining protocols and imaging devices. Kaplan-Meier analysis demonstrated that patients classified as low-risk by the model had a significantly better prognosis than those classified as high-risk, with a hazard ratio of 4.1 (P<0.001) and 2.0 (P<0.01) on the two external cohorts that include patient outcomes. This artificial intelligence-driven solution offers a rapid, cost-effective, and scalable alternative to genomic testing, with the potential to enhance personalized treatment planning, especially in resource-constrained settings.

X-Ray Classification Breast Retrospective Clinical In Silico Benchmark SOTA

A clinically relevant morpho-molecular classification of lung neuroendocrine tumours

Sexton-Oates, A., Mathian, E., Candeli, N., Lim, Y., Voegele, C., Di Genova, A., Mange, L., Li, Z., van Weert, T., Hillen, L. M., Blazquez-Encinas, R., Gonzalez-Perez, A., Morrison, M. L., Lauricella, E., Mangiante, L., Bonheme, L., Moonen, L., Absenger, G., Altmuller, J., Degletagne, C., Brustugun, O. T., Cahais, V., Centonze, G., Chabrier, A., Cuenin, C., Damiola, F., de Montpreville, V. T., Deleuze, J.-F., Dingemans, A.-M. C., Fadel, E., Gadot, N., Ghantous, A., Graziano, P., Hofman, P., Hofman, V., Ibanez-Costa, A., Lacomme, S., Lopez-Bigas, N., Lund-Iversen, M., Milione, M., Muscarella, L

•preprint•Jul 18 2025

Lung neuroendocrine tumours (NETs, also known as carcinoids) are rapidly rising in incidence worldwide but have unknown aetiology and limited therapeutic options beyond surgery. We conducted multi-omic analyses on over 300 lung NETs including whole-genome sequencing (WGS), transcriptome profiling, methylation arrays, spatial RNA sequencing, and spatial proteomics. The integration of multi-omic data provides definitive proof of the existence of four strikingly different molecular groups that vary in patient characteristics, genomic and transcriptomic profiles, microenvironment, and morphology, as much as distinct diseases. Among these, we identify a new molecular group, enriched for highly aggressive supra-carcinoids, that displays an immune-rich microenvironment linked to tumour--macrophage crosstalk, and we uncover an undifferentiated cell population within supra-carcinoids, explaining their molecular and behavioural link to high-grade lung neuroendocrine carcinomas. Deep learning models accurately identified the Ca A1, Ca A2, and Ca B groups based on morphology alone, outperforming current histological criteria. The characteristic tumour microenvironment of supra-carcinoids and the validation of a panel of immunohistochemistry markers for the other three molecular groups demonstrates that these groups can be accurately identified based solely on morphological features, facilitating their implementation in the clinical setting. Our proposed morpho-molecular classification highlights group-specific therapeutic opportunities, including DLL3, FGFR, TERT, and BRAF inhibitors. Overall, our findings unify previously proposed molecular classifications and refine the lung cancer map by revealing novel tumour types and potential treatments, with significant implications for prognosis and treatment decision-making.

X-Ray Classification Chest Retrospective Clinical In Silico Breakthrough

Detecting Fifth Metatarsal Fractures on Radiographs through the Lens of Smartphones: A FIXUS AI Algorithm

Taseh, A., Shah, A., Eftekhari, M., Flaherty, A., Ebrahimi, A., Jones, S., Nukala, V., Nazarian, A., Waryasz, G., Ashkani-Esfahani, S.

•preprint•Jul 18 2025

BackgroundFifth metatarsal (5MT) fractures are common but challenging to diagnose, particularly with limited expertise or subtle fractures. Deep learning shows promise but faces limitations due to image quality requirements. This study develops a deep learning model to detect 5MT fractures from smartphone-captured radiograph images, enhancing accessibility of diagnostic tools. MethodsA retrospective study included patients aged >18 with 5MT fractures (n=1240) and controls (n=1224). Radiographs (AP, oblique, lateral) from Electronic Health Records (EHR) were obtained and photographed using a smartphone, creating a new dataset (SP). Models using ResNet 152V2 were trained on EHR, SP, and combined datasets, then evaluated on a separate smartphone test dataset (SP-test). ResultsOn validation, the SP model achieved optimal performance (AUROC: 0.99). On the SP-test dataset, the EHR models performance decreased (AUROC: 0.83), whereas SP and combined models maintained high performance (AUROC: 0.99). ConclusionsSmartphone-specific deep learning models effectively detect 5MT fractures, suggesting their practical utility in resource-limited settings.

X-Ray Detection Musculoskeletal Retrospective Clinical In Silico

Three-dimensional high-content imaging of unstained soft tissue with subcellular resolution using a laboratory-based multi-modal X-ray microscope

Esposito, M., Astolfo, A., Zhou, Y., Buchanan, I., Teplov, A., Endrizzi, M., Egido Vinogradova, A., Makarova, O., Divan, R., Tang, C.-M., Yagi, Y., Lee, P. D., Walsh, C. L., Ferrara, J. D., Olivo, A.

•preprint•Jul 14 2025

With increasing interest in studying biological systems across spatial scales--from centimetres down to nanometres--histology continues to be the gold standard for tissue imaging at cellular resolution, providing an essential bridge between macroscopic and nanoscopic analysis. However, its inherently destructive and two-dimensional nature limits its ability to capture the full three-dimensional complexity of tissue architecture. Here we show that phase-contrast X-ray microscopy can enable three-dimensional virtual histology with subcellular resolution. This technique provides direct quantification of electron density without restrictive assumptions, allowing for direct characterisation of cellular nuclei in a standard laboratory setting. By combining high spatial resolution and soft tissue contrast, with automated segmentation of cell nuclei, we demonstrated virtual H&E staining using machine learning-based style transfer, yielding volumetric datasets compatible with existing histopathological analysis tools. Furthermore, by integrating electron density and the sensitivity to nanometric features of the dark field contrast channel, we achieve stain-free, high-content imaging capable of distinguishing nuclei and extracellular matrix.

X-Ray Segmentation Methodology Prototype GenAI

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

Guan, H., Hou, P. C., Hong, P., Wang, L., Zhang, W., Du, X., Zhou, Z., Zhou, L.

•preprint•Jul 14 2025

Recent advances in vision-language models (VLMs) have enabled automatic radiology report generation, yet current evaluation methods remain limited to general-purpose NLP metrics or coarse classification-based clinical scores. In this study, we propose a clinically informed evaluation framework for VLM-generated radiology reports that goes beyond traditional performance measures. We define a taxonomy of 12 radiology-specific error types, each annotated with clinical risk levels (low, medium, high) in collaboration with physicians. Using this framework, we conduct a comprehensive error analysis of three representative VLMs, i.e., DeepSeek VL2, CXR-LLaVA, and CheXagent, on 685 gold-standard, expert-annotated MIMIC-CXR cases. We further introduce a risk-aware evaluation metric, the Clinical Risk-weighted Error Score for Text-generation (CREST), to quantify safety impact. Our findings reveal critical model vulnerabilities, common error patterns, and condition-specific risk profiles, offering actionable insights for model development and deployment. This work establishes a safety-centric foundation for evaluating and improving medical report generation models. The source code of our evaluation framework, including CREST computation and error taxonomy analysis, is available at https://github.com/guanharry/VLM-CREST.

X-Ray LLM Radiology Report Chest Methodology In Silico Open Code GenAI

A Unified Platform for Radiology Report Generation and Clinician-Centered AI Evaluation

Ma, Z., Yang, X., Atalay, Z., Yang, A., Collins, S., Bai, H., Bernstein, M., Baird, G., Jiao, Z.

•preprint•Jul 8 2025

Generative AI models have demonstrated strong potential in radiology report generation, but their clinical adoption depends on physician trust. In this study, we conducted a radiology-focused Turing test to evaluate how well attendings and residents distinguish AI-generated reports from those written by radiologists, and how their confidence and decision time reflect trust. we developed an integrated web-based platform comprising two core modules: Report Generation and Report Evaluation. Using the web-based platform, eight participants evaluated 48 anonymized X-ray cases, each paired with two reports from three comparison groups: radiologist vs. AI model 1, radiologist vs. AI model 2, and AI model 1 vs. AI model 2. Participants selected the AI-generated report, rated their confidence, and indicated report preference. Attendings outperformed residents in identifying AI-generated reports (49.9% vs. 41.1%) and exhibited longer decision times, suggesting more deliberate judgment. Both groups took more time when both reports were AI-generated. Our findings highlight the role of clinical experience in AI acceptance and the need for design strategies that foster trust in clinical applications. The project page of the evaluation platform is available at: https://zachatalay89.github.io/Labsite.

X-Ray Report Generation Retrospective Clinical In Silico Academic Lab GenAI

Filter Papers

Tags