Latest Papers on Radiology AI. Tags: Benchmark SOTA

Leveraging Large Language Models for Accurate AO Fracture Classification from CT Text Reports.

Mergen M, Spitzl D, Ketzer C, Strenzke M, Marka AW, Makowski MR, Bressem KK, Adams LC, Gassert FT

•papers•Jul 7 2025

Large language models (LLMs) have shown promising potential in analyzing complex textual data, including radiological reports. These models can assist clinicians, particularly those with limited experience, by integrating and presenting diagnostic criteria within radiological classifications. However, before clinical adoption, LLMs must be rigorously validated by medical professionals to ensure accuracy, especially in the context of advanced radiological classification systems. This study evaluates the performance of four LLMs-ChatGPT-4o, AmbossGPT, Claude 3.5 Sonnet, and Gemini 2.0 Flash-in classifying fractures based on the AO classification system using CT reports. A dataset of 292 fictitious physician-generated CT reports, representing 310 fractures, was used to assess the accuracy of each LLM in AO fracture classification retrospectively. Performance was evaluated by comparing the models' classifications to ground truth labels, with accuracy rates analyzed across different fracture types and subtypes. ChatGPT-4o and AmbossGPT achieved the highest overall accuracy (74.6 and 74.3%, respectively), outperforming Claude 3.5 Sonnet (69.5%) and Gemini 2.0 Flash (62.7%). Statistically significant differences were observed in fracture type classification, particularly between ChatGPT-4o and Gemini 2.0 Flash (Δ12%, p < 0.001). While all models demonstrated strong bone recognition rates (90-99%), their accuracy in fracture subtype classification remained lower (71-77%), indicating limitations in nuanced diagnostic categorization. LLMs show potential in assisting radiologists with initial fracture classification, particularly in high-volume or resource-limited settings. However, their performance remains inconsistent for detailed subtype classification, highlighting the need for further refinement and validation before clinical integration in advanced diagnostic workflows.

CT Classification Musculoskeletal Retrospective Clinical In Silico GenAI Benchmark SOTA

Development and validation of an improved volumetric breast density estimation model using the ResNet technique.

Asai Y, Yamamuro M, Yamada T, Kimura Y, Ishii K, Nakamura Y, Otsuka Y, Kondo Y

•papers•Jul 7 2025

Temporal changes in volumetric breast density (VBD) may serve as prognostic biomarkers for predicting the risk of future breast cancer development. However, accurately measuring VBD from archived X-ray mammograms remains challenging. In a previous study, we proposed a method to estimate volumetric breast density using imaging parameters (tube voltage, tube current, and exposure time) and patient age. This approach, based on a multiple regression model, achieved a determination coefficient (R²) of 0.868. Approach:In this study, we developed and applied machine learning models-Random Forest, XG-Boost-and the deep learning model Residual Network (ResNet) to the same dataset. Model performance was assessed using several metrics: determination coefficient, correlation coefficient, root mean square error, mean absolute error, root mean square percentage error, and mean absolute percentage error. A five-fold cross-validation was conducted to ensure robust validation. Main results:The best-performing fold resulted in R² values of 0.895, 0.907, and 0.918 for Random Forest, XG-Boost, and ResNet, respectively, all surpassing the previous study's results. ResNet consistently achieved the lowest error values across all metrics. Significance:These findings suggest that ResNet successfully achieved the task of accurately determining VBD from past mammography-a task that has not been realised to date. We are confident that this achievement contributes to advancing research aimed at predicting future risks of breast cancer development by enabling high-accuracy time-series analyses of retrospective VBD.&#xD.

Mammography Registration Breast Retrospective Clinical In Silico Academic Lab Benchmark SOTA

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Brush, Kenneth Philbrick, Howard Hu, Howard Yang, Richa Tiwari, Sunny Jansen, Preeti Singh, Yun Liu, Shekoofeh Azizi, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Riviere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Elena Buchatskaya, Jean-Baptiste Alayrac, Dmitry Lepikhin, Vlad Feinberg, Sebastian Borgeaud, Alek Andreev, Cassidy Hardin, Robert Dadashi, Léonard Hussenot, Armand Joulin, Olivier Bachem, Yossi Matias, Katherine Chou, Avinatan Hassidim, Kavi Goel, Clement Farabet, Joelle Barral, Tris Warkentin, Jonathon Shlens, David Fleet, Victor Cotruta, Omar Sanseviero, Gus Martins, Phoebe Kirk, Anand Rao, Shravya Shetty, David F. Steiner, Can Kirmizibayrak, Rory Pilgrim, Daniel Golden, Lin Yang

•preprint•Jul 7 2025

Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.

X-Ray Classification Chest Methodology In Silico Big Tech Benchmark SOTA Open Code

AG-MS3D-CNN multiscale attention guided 3D convolutional neural network for robust brain tumor segmentation across MRI protocols.

Lilhore UK, Sunder R, Simaiya S, Alsafyani M, Monish Khan MD, Alroobaea R, Alsufyani H, Baqasah AM

•papers•Jul 7 2025

Accurate segmentation of brain tumors from multimodal Magnetic Resonance Imaging (MRI) plays a critical role in diagnosis, treatment planning, and disease monitoring in neuro-oncology. Traditional methods of tumor segmentation, often manual and labour-intensive, are prone to inconsistencies and inter-observer variability. Recently, deep learning models, particularly Convolutional Neural Networks (CNNs), have shown great promise in automating this process. However, these models face challenges in terms of generalization across diverse datasets, accurate tumor boundary delineation, and uncertainty estimation. To address these challenges, we propose AG-MS3D-CNN, an attention-guided multiscale 3D convolutional neural network for brain tumor segmentation. Our model integrates local and global contextual information through multiscale feature extraction and leverages spatial attention mechanisms to enhance boundary delineation, particularly in complex tumor regions. We also introduce Monte Carlo dropout for uncertainty estimation, providing clinicians with confidence scores for each segmentation, which is crucial for informed decision-making. Furthermore, we adopt a multitask learning framework, which enables the simultaneous segmentation, classification, and volume estimation of tumors. To ensure robustness and generalizability across diverse MRI acquisition protocols and scanners, we integrate a domain adaptation module into the network. Extensive evaluations on the BraTS 2021 dataset and additional external datasets, such as OASIS, ADNI, and IXI, demonstrate the superior performance of AG-MS3D-CNN compared to existing state-of-the-art methods. Our model achieves high Dice scores and shows excellent robustness, making it a valuable tool for clinical decision support in neuro-oncology.

MRI Segmentation Neurological Methodology In Silico Academic Lab Benchmark SOTA

X-ray transferable polyrepresentation learning

Weronika Hryniewska-Guzik, Przemyslaw Biecek

•preprint•Jul 7 2025

The success of machine learning algorithms is inherently related to the extraction of meaningful features, as they play a pivotal role in the performance of these algorithms. Central to this challenge is the quality of data representation. However, the ability to generalize and extract these features effectively from unseen datasets is also crucial. In light of this, we introduce a novel concept: the polyrepresentation. Polyrepresentation integrates multiple representations of the same modality extracted from distinct sources, for example, vector embeddings from the Siamese Network, self-supervised models, and interpretable radiomic features. This approach yields better performance metrics compared to relying on a single representation. Additionally, in the context of X-ray images, we demonstrate the transferability of the created polyrepresentation to a smaller dataset, underscoring its potential as a pragmatic and resource-efficient approach in various image-related solutions. It is worth noting that the concept of polyprepresentation on the example of medical data can also be applied to other domains, showcasing its versatility and broad potential impact.

X-Ray Classification Methodology In Silico Academic Lab Benchmark SOTA

Prediction of tissue and clinical thrombectomy outcome in acute ischaemic stroke using deep learning.

von Braun MS, Starke K, Peter L, Kürsten D, Welle F, Schneider HR, Wawrzyniak M, Kaiser DPO, Prasse G, Richter C, Kellner E, Reisert M, Klingbeil J, Stockert A, Hoffmann KT, Scheuermann G, Gillmann C, Saur D

•papers•Jul 7 2025

The advent of endovascular thrombectomy has significantly improved outcomes for stroke patients with intracranial large vessel occlusion, yet individual benefits can vary widely. As demand for thrombectomy rises and geographical disparities in stroke care access persist, there is a growing need for predictive models that quantify individual benefits. However, current imaging methods for estimating outcomes may not fully capture the dynamic nature of cerebral ischaemia and lack a patient-specific assessment of thrombectomy benefits. Our study introduces a deep learning approach to predict individual responses to thrombectomy in acute ischaemic stroke patients. The proposed models provide predictions for both tissue and clinical outcomes under two scenarios: one assuming successful reperfusion and another assuming unsuccessful reperfusion. The resulting simulations of penumbral salvage and difference in National Institutes of Health Stroke Scale (NIHSS) at discharge quantify the potential individual benefits of the intervention. Our models were developed on an extensive dataset from routine stroke care, which included 405 ischaemic stroke patients who underwent thrombectomy. We used acute data for training (n = 304), including multimodal CT imaging and clinical characteristics, along with post hoc markers such as thrombectomy success, final infarct localization and NIHSS at discharge. We benchmarked our tissue outcome predictions under the observed reperfusion scenario against a thresholding-based clinical method and a generalized linear model. Our deep learning model showed significant superiority, with a mean Dice score of 0.48 on internal test data (n = 50) and 0.52 on external test data (n = 51), versus 0.26/0.36 and 0.34/0.35 for the baselines, respectively. The NIHSS sum score prediction achieved median absolute errors of 1.5 NIHSS points on the internal test dataset and 3.0 NIHSS points on the external test dataset, outperforming other machine learning models. By predicting the patient-specific response to thrombectomy for both tissue and clinical outcomes, our approach offers an innovative biomarker that captures the dynamics of cerebral ischaemia. We believe this method holds significant potential to enhance personalized therapeutic strategies and to facilitate efficient resource allocation in acute stroke care.

CT Segmentation Neurological Retrospective Clinical In Silico Academic Lab Benchmark SOTA

FB-Diff: Fourier Basis-guided Diffusion for Temporal Interpolation of 4D Medical Imaging

Xin You, Runze Yang, Chuyan Zhang, Zhongliang Jiang, Jie Yang, Nassir Navab

•preprint•Jul 6 2025

The temporal interpolation task for 4D medical imaging, plays a crucial role in clinical practice of respiratory motion modeling. Following the simplified linear-motion hypothesis, existing approaches adopt optical flow-based models to interpolate intermediate frames. However, realistic respiratory motions should be nonlinear and quasi-periodic with specific frequencies. Intuited by this property, we resolve the temporal interpolation task from the frequency perspective, and propose a Fourier basis-guided Diffusion model, termed FB-Diff. Specifically, due to the regular motion discipline of respiration, physiological motion priors are introduced to describe general characteristics of temporal data distributions. Then a Fourier motion operator is elaborately devised to extract Fourier bases by incorporating physiological motion priors and case-specific spectral information in the feature space of Variational Autoencoder. Well-learned Fourier bases can better simulate respiratory motions with motion patterns of specific frequencies. Conditioned on starting and ending frames, the diffusion model further leverages well-learned Fourier bases via the basis interaction operator, which promotes the temporal interpolation task in a generative manner. Extensive results demonstrate that FB-Diff achieves state-of-the-art (SOTA) perceptual performance with better temporal consistency while maintaining promising reconstruction metrics. Codes are available.

CT Reconstruction Chest Methodology In Silico Academic Lab Benchmark SOTA Open Code

A CT-Based Deep Learning Radiomics Nomogram for Early Recurrence Prediction in Pancreatic Cancer: A Multicenter Study.

Guan X, Liu J, Xu L, Jiang W, Wang C

•papers•Jul 6 2025

Early recurrence (ER) following curative-intent surgery remains a major obstacle to improving long-term outcomes in patients with pancreatic cancer (PC). The accurate preoperative prediction of ER could significantly aid clinical decision-making and guide postoperative management. A retrospective cohort of 493 patients with histologically confirmed PC who underwent resection was analyzed. Contrast-enhanced computed tomography (CT) images were used for tumor segmentation, followed by radiomics and deep learning feature extraction. In total, four distinct feature selection algorithms were employed. Predictive models were constructed using random forest (RF) and support vector machine (SVM) classifiers. The model performance was evaluated by the area under the receiver operating characteristic curve (AUC). A comprehensive nomogram integrating feature scores and clinical factors was developed and validated. Among all of the constructed models, the Inte-SVM demonstrated superior classification performance. The nomogram, incorporating the Inte-feature score, CT-assessed lymph node status, and carbohydrate antigen 19-9 (CA19-9), yielded excellent predictive accuracy in the validation cohort (AUC = 0.920). Calibration curves showed strong agreement between predicted and observed outcomes, and decision curve analysis confirmed the clinical utility of the nomogram. A CT-based deep learning radiomics nomogram enabled the accurate preoperative prediction of early recurrence in patients with pancreatic cancer. This model may serve as a valuable tool to assist clinicians in tailoring postoperative strategies and promoting personalized therapeutic approaches.

CT Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

ViTaL: A Multimodality Dataset and Benchmark for Multi-pathological Ovarian Tumor Recognition

You Zhou, Lijiang Chen, Guangxia Cui, Wenpei Bai, Yu Guo, Shuchang Lyu, Guangliang Cheng, Qi Zhao

•preprint•Jul 6 2025

Ovarian tumor, as a common gynecological disease, can rapidly deteriorate into serious health crises when undetected early, thus posing significant threats to the health of women. Deep neural networks have the potential to identify ovarian tumors, thereby reducing mortality rates, but limited public datasets hinder its progress. To address this gap, we introduce a vital ovarian tumor pathological recognition dataset called \textbf{ViTaL} that contains \textbf{V}isual, \textbf{T}abular and \textbf{L}inguistic modality data of 496 patients across six pathological categories. The ViTaL dataset comprises three subsets corresponding to different patient data modalities: visual data from 2216 two-dimensional ultrasound images, tabular data from medical examinations of 496 patients, and linguistic data from ultrasound reports of 496 patients. It is insufficient to merely distinguish between benign and malignant ovarian tumors in clinical practice. To enable multi-pathology classification of ovarian tumor, we propose a ViTaL-Net based on the Triplet Hierarchical Offset Attention Mechanism (THOAM) to minimize the loss incurred during feature fusion of multi-modal data. This mechanism could effectively enhance the relevance and complementarity between information from different modalities. ViTaL-Net serves as a benchmark for the task of multi-pathology, multi-modality classification of ovarian tumors. In our comprehensive experiments, the proposed method exhibited satisfactory performance, achieving accuracies exceeding 90\% on the two most common pathological types of ovarian tumor and an overall performance of 85\%. Our dataset and code are available at https://github.com/GGbond-study/vitalnet.

Ultrasound Classification Abdominal Dataset Release In Silico Academic Lab Open Dataset Open Code Benchmark SOTA

Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLMs

Haifeng Zhao, Yufei Zhang, Leilei Ma, Shuo Xu, Dengdi Sun

•preprint•Jul 5 2025

Radiology report generation represents a significant application within medical AI, and has achieved impressive results. Concurrently, large language models (LLMs) have demonstrated remarkable performance across various domains. However, empirical validation indicates that general LLMs tend to focus more on linguistic fluency rather than clinical effectiveness, and lack the ability to effectively capture the relationship between X-ray images and their corresponding texts, thus resulting in poor clinical practicability. To address these challenges, we propose Optimal Transport-Driven Radiology Report Generation (OTDRG), a novel framework that leverages Optimal Transport (OT) to align image features with disease labels extracted from reports, effectively bridging the cross-modal gap. The core component of OTDRG is Alignment \& Fine-Tuning, where OT utilizes results from the encoding of label features and image visual features to minimize cross-modal distances, then integrating image and text features for LLMs fine-tuning. Additionally, we design a novel disease prediction module to predict disease labels contained in X-ray images during validation and testing. Evaluated on the MIMIC-CXR and IU X-Ray datasets, OTDRG achieves state-of-the-art performance in both natural language generation (NLG) and clinical efficacy (CE) metrics, delivering reports that are not only linguistically coherent but also clinically accurate.

X-Ray Report Generation Chest Methodology In Silico GenAI Benchmark SOTA

Filter Papers

Tags

Leveraging Large Language Models for Accurate AO Fracture Classification from CT Text Reports.

Development and validation of an improved volumetric breast density estimation model using the ResNet technique.

MedGemma Technical Report

AG-MS3D-CNN multiscale attention guided 3D convolutional neural network for robust brain tumor segmentation across MRI protocols.

X-ray transferable polyrepresentation learning

Prediction of tissue and clinical thrombectomy outcome in acute ischaemic stroke using deep learning.

FB-Diff: Fourier Basis-guided Diffusion for Temporal Interpolation of 4D Medical Imaging

A CT-Based Deep Learning Radiomics Nomogram for Early Recurrence Prediction in Pancreatic Cancer: A Multicenter Study.

ViTaL: A Multimodality Dataset and Benchmark for Multi-pathological Ovarian Tumor Recognition

Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLMs

Ready to Sharpen Your Edge?