Latest Papers on Radiology AI. Tags: Benchmark SOTA, Order: Best Match, Limit: 10.

AI-enhanced patient-specific dosimetry in I-131 planar imaging with a single oblique view.

Jalilifar M, Sadeghi M, Emami-Ardekani A, Bitarafan-Rajabi A, Geravand K, Geramifar P

•papers•Jul 8 2025

This study aims to enhance the dosimetry accuracy in <sup>131</sup>I planar imaging by utilizing a single oblique view and Monte Carlo (MC) validated dose point kernels (DPKs) alongside the integration of artificial intelligence (AI) for accurate dose prediction within planar imaging. Forty patients with thyroid cancers post-thyroidectomy surgery and 30 with neuroendocrine tumors underwent planar and SPECT/CT imaging. Using whole-body (WB) planar images with an additional oblique view, organ thicknesses were estimated. DPKs and organ-specific S-values were used to estimate the absorbed doses. Four AI algorithms- multilayer perceptron (MLP), linear regression, support vector regression model, decision tree, convolution neural network, and U-Net were used for dose estimation. Planar image counts, body thickness, patient BMI, age, S-values, and tissue attenuation coefficients were imported as input into the AI algorithm. To provide the ground truth, the CT-based segmentation generated binary masks for each organ, and the corresponding SPECT images were used for GATE MC dosimetry. The MLP-predicted dose values across all organs represented superior performance with the lowest mean absolute error in the liver but higher in the spleen and salivary glands. Notably, MLP-based dose estimations closely matched ground truth data with < 15% differences in most tissues. The MLP-estimated dose values present a robust patient-specific dosimetry approach capable of swiftly predicting absorbed doses in different organs using WB planar images and a single oblique view. This approach facilitates the implementation of 2D planar imaging as a pre-therapeutic technique for a more accurate assessment of the administrated activity.

SPECT Registration Whole Body Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Integrating radiomic texture analysis and deep learning for automated myocardial infarction detection in cine-MRI.

Xu W, Shi X

•papers•Jul 8 2025

Robust differentiation between infarcted and normal myocardial tissue is essential for improving diagnostic accuracy and personalizing treatment in myocardial infarction (MI). This study proposes a hybrid framework combining radiomic texture analysis with deep learning-based segmentation to enhance MI detection on non-contrast cine cardiac magnetic resonance (CMR) imaging.The approach incorporates radiomic features derived from the Gray-Level Co-Occurrence Matrix (GLCM) and Gray-Level Run Length Matrix (GLRLM) methods into a modified U-Net segmentation network. A three-stage feature selection pipeline was employed, followed by classification using multiple machine learning models. Early and intermediate fusion strategies were integrated into the hybrid architecture. The model was validated on cine-CMR data from the SCD and Kaggle datasets.Joint Entropy, Max Probability, and RLNU emerged as the most discriminative features, with Joint Entropy achieving the highest AUC (0.948). The hybrid model outperformed standalone U-Net in segmentation (Dice = 0.887, IoU = 0.803, HD95 = 4.48 mm) and classification (accuracy = 96.30%, AUC = 0.97, precision = 0.96, recall = 0.94, F1-score = 0.96). Dimensionality reduction via PCA and t-SNE confirmed distinct class separability. Correlation coefficients (r = 0.95-0.98) and Bland-Altman plots demonstrated high agreement between predicted and reference infarct sizes.Integrating radiomic features into a deep learning segmentation pipeline improves MI detection and interpretability in cine-CMR. This scalable and explainable hybrid framework holds potential for broader applications in multimodal cardiac imaging and automated myocardial tissue characterization.

MRI Segmentation Cardiac Methodology In Silico Benchmark SOTA

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Brush, Kenneth Philbrick, Howard Hu, Howard Yang, Richa Tiwari, Sunny Jansen, Preeti Singh, Yun Liu, Shekoofeh Azizi, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Riviere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Elena Buchatskaya, Jean-Baptiste Alayrac, Dmitry Lepikhin, Vlad Feinberg, Sebastian Borgeaud, Alek Andreev, Cassidy Hardin, Robert Dadashi, Léonard Hussenot, Armand Joulin, Olivier Bachem, Yossi Matias, Katherine Chou, Avinatan Hassidim, Kavi Goel, Clement Farabet, Joelle Barral, Tris Warkentin, Jonathon Shlens, David Fleet, Victor Cotruta, Omar Sanseviero, Gus Martins, Phoebe Kirk, Anand Rao, Shravya Shetty, David F. Steiner, Can Kirmizibayrak, Rory Pilgrim, Daniel Golden, Lin Yang

•preprint•Jul 7 2025

Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.

X-Ray Classification Chest Methodology In Silico Big Tech Benchmark SOTA Open Code

Development and International Validation of a Deep Learning Model for Predicting Acute Pancreatitis Severity from CT Scans

Xu, Y., Teutsch, B., Zeng, W., Hu, Y., Rastogi, S., Hu, E. Y., DeGregorio, I. M., Fung, C. W., Richter, B. I., Cummings, R., Goldberg, J. E., Mathieu, E., Appiah Asare, B., Hegedus, P., Gurza, K.-B., Szabo, I. V., Tarjan, H., Szentesi, A., Borbely, R., Molnar, D., Faluhelyi, N., Vincze, A., Marta, K., Hegyi, P., Lei, Q., Gonda, T., Huang, C., Shen, Y.

•preprint•Jul 7 2025

Background and aimsAcute pancreatitis (AP) is a common gastrointestinal disease with rising global incidence. While most cases are mild, severe AP (SAP) carries high mortality. Early and accurate severity prediction is crucial for optimal management. However, existing severity prediction models, such as BISAP and mCTSI, have modest accuracy and often rely on data unavailable at admission. This study proposes a deep learning (DL) model to predict AP severity using abdominal contrast-enhanced CT (CECT) scans acquired within 24 hours of admission. MethodsWe collected 10,130 studies from 8,335 patients across a multi-site U.S. health system. The model was trained in two stages: (1) self-supervised pretraining on large-scale unlabeled CT studies and (2) fine-tuning on 550 labeled studies. Performance was evaluated against mCTSI and BISAP on a hold-out internal test set (n=100 patients) and externally validated on a Hungarian AP registry (n=518 patients). ResultsOn the internal test set, the model achieved AUROCs of 0.888 (95% CI: 0.800-0.960) for SAP and 0.888 (95% CI: 0.819-0.946) for mild AP (MAP), outperforming mCTSI (p = 0.002). External validation showed robust AUROCs of 0.887 (95% CI: 0.825-0.941) for SAP and 0.858 (95% CI: 0.826-0.888) for MAP, surpassing mCTSI (p = 0.024) and BISAP (p = 0.002). Retrospective simulation suggested the models potential to support admission triage and serve as a second reader during CECT interpretation. ConclusionsThe proposed DL model outperformed standard scoring systems for AP severity prediction, generalized well to external data, and shows promise for providing early clinical decision support and improving resource allocation.

CT Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

AG-MS3D-CNN multiscale attention guided 3D convolutional neural network for robust brain tumor segmentation across MRI protocols.

Lilhore UK, Sunder R, Simaiya S, Alsafyani M, Monish Khan MD, Alroobaea R, Alsufyani H, Baqasah AM

•papers•Jul 7 2025

Accurate segmentation of brain tumors from multimodal Magnetic Resonance Imaging (MRI) plays a critical role in diagnosis, treatment planning, and disease monitoring in neuro-oncology. Traditional methods of tumor segmentation, often manual and labour-intensive, are prone to inconsistencies and inter-observer variability. Recently, deep learning models, particularly Convolutional Neural Networks (CNNs), have shown great promise in automating this process. However, these models face challenges in terms of generalization across diverse datasets, accurate tumor boundary delineation, and uncertainty estimation. To address these challenges, we propose AG-MS3D-CNN, an attention-guided multiscale 3D convolutional neural network for brain tumor segmentation. Our model integrates local and global contextual information through multiscale feature extraction and leverages spatial attention mechanisms to enhance boundary delineation, particularly in complex tumor regions. We also introduce Monte Carlo dropout for uncertainty estimation, providing clinicians with confidence scores for each segmentation, which is crucial for informed decision-making. Furthermore, we adopt a multitask learning framework, which enables the simultaneous segmentation, classification, and volume estimation of tumors. To ensure robustness and generalizability across diverse MRI acquisition protocols and scanners, we integrate a domain adaptation module into the network. Extensive evaluations on the BraTS 2021 dataset and additional external datasets, such as OASIS, ADNI, and IXI, demonstrate the superior performance of AG-MS3D-CNN compared to existing state-of-the-art methods. Our model achieves high Dice scores and shows excellent robustness, making it a valuable tool for clinical decision support in neuro-oncology.

MRI Segmentation Neurological Methodology In Silico Academic Lab Benchmark SOTA

Sequential Attention-based Sampling for Histopathological Analysis

Tarun G, Naman Malpani, Gugan Thoppe, Sridharan Devarajan

•preprint•Jul 7 2025

Deep neural networks are increasingly applied for automated histopathology. Yet, whole-slide images (WSIs) are often acquired at gigapixel sizes, rendering it computationally infeasible to analyze them entirely at high resolution. Diagnostic labels are largely available only at the slide-level, because expert annotation of images at a finer (patch) level is both laborious and expensive. Moreover, regions with diagnostic information typically occupy only a small fraction of the WSI, making it inefficient to examine the entire slide at full resolution. Here, we propose SASHA -- {\it S}equential {\it A}ttention-based {\it S}ampling for {\it H}istopathological {\it A}nalysis -- a deep reinforcement learning approach for efficient analysis of histopathological images. First, SASHA learns informative features with a lightweight hierarchical, attention-based multiple instance learning (MIL) model. Second, SASHA samples intelligently and zooms selectively into a small fraction (10-20\%) of high-resolution patches, to achieve reliable diagnosis. We show that SASHA matches state-of-the-art methods that analyze the WSI fully at high-resolution, albeit at a fraction of their computational and memory costs. In addition, it significantly outperforms competing, sparse sampling methods. We propose SASHA as an intelligent sampling model for medical imaging challenges that involve automated diagnosis with exceptionally large images containing sparsely informative features.

OCT Classification Methodology In Silico Academic Lab Benchmark SOTA

Development and validation of an improved volumetric breast density estimation model using the ResNet technique.

Asai Y, Yamamuro M, Yamada T, Kimura Y, Ishii K, Nakamura Y, Otsuka Y, Kondo Y

•papers•Jul 7 2025

Temporal changes in volumetric breast density (VBD) may serve as prognostic biomarkers for predicting the risk of future breast cancer development. However, accurately measuring VBD from archived X-ray mammograms remains challenging. In a previous study, we proposed a method to estimate volumetric breast density using imaging parameters (tube voltage, tube current, and exposure time) and patient age. This approach, based on a multiple regression model, achieved a determination coefficient (R²) of 0.868. Approach:In this study, we developed and applied machine learning models-Random Forest, XG-Boost-and the deep learning model Residual Network (ResNet) to the same dataset. Model performance was assessed using several metrics: determination coefficient, correlation coefficient, root mean square error, mean absolute error, root mean square percentage error, and mean absolute percentage error. A five-fold cross-validation was conducted to ensure robust validation. Main results:The best-performing fold resulted in R² values of 0.895, 0.907, and 0.918 for Random Forest, XG-Boost, and ResNet, respectively, all surpassing the previous study's results. ResNet consistently achieved the lowest error values across all metrics. Significance:These findings suggest that ResNet successfully achieved the task of accurately determining VBD from past mammography-a task that has not been realised to date. We are confident that this achievement contributes to advancing research aimed at predicting future risks of breast cancer development by enabling high-accuracy time-series analyses of retrospective VBD.&#xD.

Mammography Registration Breast Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Leveraging Large Language Models for Accurate AO Fracture Classification from CT Text Reports.

Mergen M, Spitzl D, Ketzer C, Strenzke M, Marka AW, Makowski MR, Bressem KK, Adams LC, Gassert FT

•papers•Jul 7 2025

Large language models (LLMs) have shown promising potential in analyzing complex textual data, including radiological reports. These models can assist clinicians, particularly those with limited experience, by integrating and presenting diagnostic criteria within radiological classifications. However, before clinical adoption, LLMs must be rigorously validated by medical professionals to ensure accuracy, especially in the context of advanced radiological classification systems. This study evaluates the performance of four LLMs-ChatGPT-4o, AmbossGPT, Claude 3.5 Sonnet, and Gemini 2.0 Flash-in classifying fractures based on the AO classification system using CT reports. A dataset of 292 fictitious physician-generated CT reports, representing 310 fractures, was used to assess the accuracy of each LLM in AO fracture classification retrospectively. Performance was evaluated by comparing the models' classifications to ground truth labels, with accuracy rates analyzed across different fracture types and subtypes. ChatGPT-4o and AmbossGPT achieved the highest overall accuracy (74.6 and 74.3%, respectively), outperforming Claude 3.5 Sonnet (69.5%) and Gemini 2.0 Flash (62.7%). Statistically significant differences were observed in fracture type classification, particularly between ChatGPT-4o and Gemini 2.0 Flash (Δ12%, p < 0.001). While all models demonstrated strong bone recognition rates (90-99%), their accuracy in fracture subtype classification remained lower (71-77%), indicating limitations in nuanced diagnostic categorization. LLMs show potential in assisting radiologists with initial fracture classification, particularly in high-volume or resource-limited settings. However, their performance remains inconsistent for detailed subtype classification, highlighting the need for further refinement and validation before clinical integration in advanced diagnostic workflows.

CT Classification Musculoskeletal Retrospective Clinical In Silico GenAI Benchmark SOTA

External validation of an artificial intelligence tool for fracture detection in children with osteogenesis imperfecta: a multireader study.

Pauling C, Laidlow-Singh H, Evans E, Garbera D, Williamson R, Fernando R, Thomas K, Martin H, Arthurs OJ, Shelmerdine SC

•papers•Jul 7 2025

To determine the performance of a commercially available AI tool for fracture detection when used in children with osteogenesis imperfecta (OI). All appendicular and pelvic radiographs from an OI clinic at a single centre from 48 patients were included. Seven radiologists evaluated anonymised images in two rounds, first without, then with AI assistance. Differences in diagnostic accuracy between the rounds were analysed. 48 patients (mean 12 years) provided 336 images, containing 206 fractures established by consensus opinion of two radiologists. AI produced a per-examination accuracy of 74.8% [95% CI: 65.4%, 82.7%], compared to average radiologist performance at 83.4% [95% CI: 75.2%, 89.8%]. Radiologists using AI assistance improved average radiologist accuracy per examination to 90.7% [95% CI: 83.5%, 95.4%]. AI gave more false negatives than radiologists, with 80 missed fractures versus 41, respectively. Radiologists were more likely (74.6%) to alter their original decision to agree with AI at the per-image level, 82.8% of which led to a correct result, 64.0% of which were changing from a false positive to a true negative. Despite inferior standalone performance, AI assistance can still improve radiologist fracture detection in a rare disease paediatric population. Radiologists using AI typically led to more accurate diagnostic outcomes through reduced false positives. Future studies focusing on the real-world application of AI tools in a larger population of children with bone fragility disorders will help better evaluate whether these improvements in accuracy translate into improved patient outcomes. Question How well does a commercially available artificial intelligence (AI) tool identify fractures, on appendicular radiographs of children with osteogenesis imperfecta (OI), and can it also improve radiologists' identification of fractures in this population? Findings Specialist human radiologists outperformed the AI fracture detection tool when acting alone; however, their diagnostic performance overall improved with AI assistance. Clinical relevance AI assistance improves specialist radiologist fracture detection in children with osteogenesis imperfecta, even with AI performance alone inferior to the radiologists acting alone. The reason for this was due to the AI moderating the number of false positives generated by the radiologists.

X-Ray Detection Musculoskeletal Retrospective Clinical Clinical Pilot Academic Lab Benchmark SOTA

X-ray transferable polyrepresentation learning

Weronika Hryniewska-Guzik, Przemyslaw Biecek

•preprint•Jul 7 2025

The success of machine learning algorithms is inherently related to the extraction of meaningful features, as they play a pivotal role in the performance of these algorithms. Central to this challenge is the quality of data representation. However, the ability to generalize and extract these features effectively from unseen datasets is also crucial. In light of this, we introduce a novel concept: the polyrepresentation. Polyrepresentation integrates multiple representations of the same modality extracted from distinct sources, for example, vector embeddings from the Siamese Network, self-supervised models, and interpretable radiomic features. This approach yields better performance metrics compared to relying on a single representation. Additionally, in the context of X-ray images, we demonstrate the transferability of the created polyrepresentation to a smaller dataset, underscoring its potential as a pragmatic and resource-efficient approach in various image-related solutions. It is worth noting that the concept of polyprepresentation on the example of medical data can also be applied to other domains, showcasing its versatility and broad potential impact.

X-Ray Classification Methodology In Silico Academic Lab Benchmark SOTA

AI-enhanced patient-specific dosimetry in I-131 planar imaging with a single oblique view.

Integrating radiomic texture analysis and deep learning for automated myocardial infarction detection in cine-MRI.

MedGemma Technical Report

Development and International Validation of a Deep Learning Model for Predicting Acute Pancreatitis Severity from CT Scans

AG-MS3D-CNN multiscale attention guided 3D convolutional neural network for robust brain tumor segmentation across MRI protocols.

Sequential Attention-based Sampling for Histopathological Analysis

Development and validation of an improved volumetric breast density estimation model using the ResNet technique.

Leveraging Large Language Models for Accurate AO Fracture Classification from CT Text Reports.

External validation of an artificial intelligence tool for fracture detection in children with osteogenesis imperfecta: a multireader study.

X-ray transferable polyrepresentation learning

Ready to Sharpen Your Edge?