Latest Papers on Radiology AI.

Benchmarking GPT-5 for Zero-Shot Multimodal Medical Reasoning in Radiology and Radiation Oncology

Mingzhe Hu, Zach Eidex, Shansong Wang, Mojtaba Safari, Qiang Li, Xiaofeng Yang

•preprint•Aug 15 2025

Radiology, radiation oncology, and medical physics require decision-making that integrates medical images, textual reports, and quantitative data under high-stakes conditions. With the introduction of GPT-5, it is critical to assess whether recent advances in large multimodal models translate into measurable gains in these safety-critical domains. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks. We present a targeted zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano) against GPT-4o across three representative tasks: (1) VQA-RAD, a benchmark for visual question answering in radiology; (2) SLAKE, a semantically annotated, multilingual VQA dataset testing cross-modal grounding; and (3) a curated Medical Physics Board Examination-style dataset of 150 multiple-choice questions spanning treatment planning, dosimetry, imaging, and quality assurance. Across all datasets, GPT-5 achieved the highest accuracy, with substantial gains over GPT-4o up to +20.00% in challenging anatomical regions such as the chest-mediastinal, +13.60% in lung-focused questions, and +11.44% in brain-tissue interpretation. On the board-style physics questions, GPT-5 attained 90.7% accuracy (136/150), exceeding the estimated human passing threshold, while GPT-4o trailed at 78.0%. These results demonstrate that GPT-5 delivers consistent and often pronounced performance improvements over GPT-4o in both image-grounded reasoning and domain-specific numerical problem-solving, highlighting its potential to augment expert workflows in medical imaging and therapeutic physics.

Mixed Modality Classification Methodology In Silico Academic Lab Benchmark SOTA GenAI

A Case Study on Colposcopy-Based Cervical Cancer Staging Reveals an Alarming Lack of Data Sharing Hindering the Adoption of Machine Learning in Clinical Practice

Schulz, M., Leha, A.

•preprint•Aug 15 2025

BackgroundThe inbuilt ability to adapt existing models to new applications has been one of the key drivers of the success of deep learning models. Thereby, sharing trained models is crucial for their adaptation to different populations and domains. Not sharing models prohibits validation and potentially following translation into clinical practice, and hinders scientific progress. In this paper we examine the current state of data and model sharing in the medical field using cervical cancer staging on colposcopy images as a case example. MethodsWe conducted a comprehensive literature search in PubMed to identify studies employing machine learning techniques in the analysis of colposcopy images. For studies where raw data was not directly accessible, we systematically inquired about accessing the pre-trained model weights and/or raw colposcopy image data by contacting the authors using various channels. ResultsWe included 46 studies and one publicly available dataset in our study. We retrieved data of the latter and inquired about data access for the 46 studies by contacting a total of 92 authors. We received 15 responses related to 14 studies (30%). The remaining 32 studies remained unresponsive (70%). Of the 15 responses received, two responses redirected our inquiry to other authors, two responses were initially pending, and 11 declined data sharing. Despite our follow-up efforts on all responses received, none of the inquiries led to actual data sharing (0%). The only available data source remained the publicly available dataset. ConclusionsDespite the long-standing demands for reproducible research and efforts to incentivize data sharing, such as the requirement of data availability statements, our case study reveals a persistent lack of data sharing culture. Reasons identified in this case study include a lack of resources to provide the data, data privacy concerns, ongoing trial registrations and low response rates to inquiries. Potential routes for improvement could include comprehensive data availability statements required by journals, data preparation and deposition in a repository as part of the publication process, an automatic maximal embargo time after which data will become openly accessible and data sharing rules set by funders.

Mixed Modality Classification Abdominal Review In Silico Academic Lab Reproducibility

A Contrast-Agnostic Method for Ultra-High Resolution Claustrum Segmentation.

Mauri C, Fritz R, Mora J, Billot B, Iglesias JE, Van Leemput K, Augustinack J, Greve DN

•papers•Aug 15 2025

The claustrum is a band-like gray matter structure located between putamen and insula whose exact functions are still actively researched. Its sheet-like structure makes it barely visible in in vivo magnetic resonance imaging (MRI) scans at typical resolutions, and neuroimaging tools for its study, including methods for automatic segmentation, are currently very limited. In this paper, we propose a contrast- and resolution-agnostic method for claustrum segmentation at ultra-high resolution (0.35 mm isotropic); the method is based on the SynthSeg segmentation framework, which leverages the use of synthetic training intensity images to achieve excellent generalization. In particular, SynthSeg requires only label maps to be trained, since corresponding intensity images are synthesized on the fly with random contrast and resolution. We trained a deep learning network for automatic claustrum segmentation, using claustrum manual labels obtained from 18 ultra-high resolution MRI scans (mostly ex vivo). We demonstrated the method to work on these 18 high resolution cases (Dice score = 0.632, mean surface distance = 0.458 mm, and volumetric similarity = 0.867 using 6-fold cross validation (CV)), and also on in vivo T1-weighted MRI scans at typical resolutions (≈1 mm isotropic). We also demonstrated that the method is robust in a test-retest setting and when applied to multimodal imaging (T2-weighted, proton density, and quantitative T1 scans). To the best of our knowledge this is the first accurate method for automatic ultra-high resolution claustrum segmentation, which is robust against changes in contrast and resolution. The method is released at https://github.com/chiara-mauri/claustrum_segmentation and as part of the neuroimaging package FreeSurfer.

MRI Segmentation Neurological Methodology In Silico Academic Lab Open Code

Noninvasive prediction of microsatellite instability in stage II/III rectal cancer using dynamic contrast-enhanced magnetic resonance imaging radiomics.

Zheng CY, Zhang JM, Lin QS, Lian T, Shi LP, Chen JY, Cai YL

•papers•Aug 15 2025

Colorectal cancer stands among the most prevalent digestive system malignancies. The microsatellite instability (MSI) profile plays a crucial role in determining patient outcomes and therapy responsiveness. Traditional MSI evaluation methods require invasive tissue sampling, are lengthy, and can be compromised by intratumoral heterogeneity. To establish a non-invasive technique utilizing dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) radiomics and machine learning algorithms to determine MSI status in patients with intermediate-stage rectal cancer. This retrospective analysis examined 120 individuals diagnosed with stage II/III rectal cancer [30 MSI-high (MSI-H) and 90 microsatellite stability (MSS)/MSI-low (MSI-L) cases]. We extracted comprehensive radiomics signatures from DCE-MRI scans, encompassing textural parameters that reflect tumor heterogeneity, shape-based metrics, and histogram-derived statistical values. Least absolute shrinkage and selection operator regression facilitated feature selection, while predictive frameworks were developed using various classification algorithms (logistic regression, support vector machine, and random forest). Performance assessment utilized separate training and validation cohorts. Our investigation uncovered distinctive imaging characteristics between MSI-H and MSS/MSI-L neoplasms. MSI-H tumors exhibited significantly elevated entropy values (7.84 ± 0.92 vs 6.39 ± 0.83, P = 0.004), enhanced surface-to-volume proportions (0.72 ± 0.14 vs 0.58 ± 0.11, P = 0.008), and heightened signal intensity variation (3642 ± 782 vs 2815 ± 645, P = 0.007). The random forest model demonstrated superior classification capability with area under the curves (AUCs) of 0.891 and 0.896 across training and validation datasets, respectively. An integrated approach combining radiomics with clinical parameters further enhanced performance metrics (AUC 0.923 and 0.914), achieving 88.5% sensitivity alongside 87.2% specificity. DCE-MRI radiomics features interpreted through machine learning frameworks offer an effective strategy for MSI status assessment in intermediate-stage rectal cancer.

MRI Classification Abdominal Retrospective Clinical In Silico Academic Lab

Delineation of the Centromedian Nucleus for Epilepsy Neuromodulation Using Deep Learning Reconstruction of White Matter-Nulled Imaging.

Ryan MV, Satzer D, Hu H, Litwiller DV, Rettmann DW, Tanabe J, Thompson JA, Ojemann SG, Kramer DR

•papers•Aug 14 2025

Neuromodulation of the centromedian nucleus (CM) of the thalamus has shown promise in treating refractory epilepsy, particularly for idiopathic generalized epilepsy and Lennox-Gastaut syndrome. However, precise targeting of CM remains challenging. The combination of deep learning reconstruction (DLR) and fast gray matter acquisition T1 inversion recovery (FGATIR) offers potential improvements in visualization of CM for deep brain stimulation (DBS) targeting. The goal of the study was to evaluate the visualization of the putative CM on DLR-FGATIR and its alignment with atlas-defined CM boundaries, with the aim of facilitating direct targeting of CM for neuromodulation. This retrospective study included 12 patients with drug-resistant epilepsy treated with thalamic neuromodulation by using DLR-FGATIR for direct targeting. Postcontrast-T1-weighted MRI, DLR-FGATIR, and postoperative CT were coregistered and normalized into Montreal Neurological Institute (MNI) space and compared with the Morel histologic atlas. Contrast-to-noise ratios were measured between CM and neighboring nuclei. CM segmentations were compared between an experienced rater, a trainee rater, the Morel atlas, and the Thalamus Optimized Multi Atlas Segmentation (THOMAS) atlas (derived from expert segmentation of high-field MRI) by using the Sorenson-Dice coefficient (Dice score, a measure of overlap) and volume ratios. The number of electrode contacts within the Morel atlas CM was assessed. On DLR-FGATIR, CM was visible as an ovoid hypointensity in the intralaminar thalamus. Contrast-to-noise ratios were highest (P < .001) for the mediodorsal and medial pulvinar nuclei. Dice score with the Morel atlas CM was higher (median 0.49, interquartile range 0.40-0.58) for the experienced rater (P < .001) than the trainee rater (0.32, 0.19-0.46) and no different (P = .32) than the THOMAS atlas CM (0.56, 0.55-0.58). Both raters and the THOMAS atlas tended to under-segment the lateral portion of the Morel atlas CM, reflected by smaller segmentation volumes (P < .001). All electrodes targeting CM based on DLR-FGATIR traversed the Morel atlas CM. DLR-FGATIR permitted visualization and delineation of CM commensurate with a group atlas derived from high-field MRI. This technique provided reliable guidance for accurate electrode placement within CM, highlighting its potential use for direct targeting.

MRI Reconstruction Neurological Retrospective Clinical Clinical Pilot Academic Lab

Comparative evaluation of supervised and unsupervised deep learning strategies for denoising hyperpolarized 129Xe lung MRI.

Bdaiwi AS, Willmering MM, Hussain R, Hysinger E, Woods JC, Walkup LL, Cleveland ZI

•papers•Aug 14 2025

Reduced signal-to-noise ratio (SNR) in hyperpolarized 129Xe MR images can affect accurate quantification for research and diagnostic evaluations. Thus, this study explores the application of supervised deep learning (DL) denoising, traditional (Trad) and Noise2Noise (N2N) and unsupervised Noise2void (N2V) approaches for 129Xe MR imaging. The DL denoising frameworks were trained and tested on 952 129Xe MRI data sets (421 ventilation, 125 diffusion-weighted, and 406 gas-exchange acquisitions) from healthy subjects and participants with cardiopulmonary conditions and compared with the block matching 3D denoising technique. Evaluation involved mean signal, noise standard deviation (SD), SNR, and sharpness. Ventilation defect percentage (VDP), apparent diffusion coefficient (ADC), membrane uptake, red blood cell (RBC) transfer, and RBC:Membrane were also evaluated for ventilation, diffusion, and gas-exchange images, respectively. Denoising methods significantly reduced noise SDs and enhanced SNR (p < 0.05) across all imaging types. Traditional ventilation model (Tradvent) improved sharpness in ventilation images but underestimated VDP (bias = -1.37%) relative to raw images, whereas N2Nvent overestimated VDP (bias = +1.88%). Block matching 3D and N2Vvent showed minimal VDP bias (≤ 0.35%). Denoising significantly reduced ADC mean and SD (p < 0.05, bias ≤ - 0.63 × 10-2). The values of Tradvent and N2Nvent increased mean membrane and RBC (p < 0.001) with no change in RBC:Membrane. Denoising also reduced SDs of all gas-exchange metrics (p < 0.01). Low SNR may impair the potential of 129Xe MRI for clinical diagnosis and lung function assessment. The evaluation of supervised and unsupervised DL denoising methods enhanced 129Xe imaging quality, offering promise for improved clinical interpretation and diagnosis.

MRI Reconstruction Chest Retrospective Clinical In Silico Academic Lab

AI-based prediction of best-corrected visual acuity in patients with multiple retinal diseases using multimodal medical imaging.

Dong L, Gao W, Niu L, Deng Z, Gong Z, Li HY, Fang LJ, Shao L, Zhang RH, Zhou WD, Ma L, Wei WB

•papers•Aug 14 2025

This study evaluated the performance of artificial intelligence (AI) algorithms in predicting best-corrected visual acuity (BCVA) for patients with multiple retinal diseases, using multimodal medical imaging including macular optical coherence tomography (OCT), optic disc OCT and fundus images. The goal was to enhance clinical BCVA evaluation efficiency and precision. A retrospective study used data from 2545 patients (4028 eyes) for training, 896 (1006 eyes) for testing and 196 (200 eyes) for internal validation, with an external prospective dataset of 741 patients (1381 eyes). Single-modality analyses employed different backbone networks and feature fusion methods, while multimodal fusion combined modalities using average aggregation, concatenation/reduction and maximum feature selection. Predictive accuracy was measured by mean absolute error (MAE), root mean squared error (RMSE) and R² score. Macular OCT achieved better single-modality prediction than optic disc OCT, with MAE of 3.851 vs 4.977 and RMSE of 7.844 vs 10.026. Fundus images showed an MAE of 3.795 and RMSE of 7.954. Multimodal fusion significantly improved accuracy, with the best results using average aggregation, achieving an MAE of 2.865, RMSE of 6.229 and R² of 0.935. External validation yielded an MAE of 8.38 and RMSE of 10.62. Multimodal fusion provided the most accurate BCVA predictions, demonstrating AI's potential to improve clinical evaluation. However, challenges remain regarding disease diversity and applicability in resource-limited settings.

Mixed Modality Registration Retrospective Clinical In Silico Academic Lab

Optimized AI-based Neural Decoding from BOLD fMRI Signal for Analyzing Visual and Semantic ROIs in the Human Visual System.

Veronese L, Moglia A, Pecco N, Della Rosa P, Scifo P, Mainardi LT, Cerveri P

•papers•Aug 14 2025

AI-based neural decoding reconstructs visual perception by leveraging generative models to map brain activity measured through functional MRI (fMRI) into the observed visual stimulus. Traditionally, ridge linear models transform fMRI into a latent space, which is then decoded using variational autoencoders (VAE) or latent diffusion models (LDM). Owing to the complexity and noisiness of fMRI data, newer approaches split the reconstruction into two sequential stages, the first one providing a rough visual approximation using a VAE, the second one incorporating semantic information through the adoption of LDM guided by contrastive language-image pre-training (CLIP) embeddings. This work addressed some key scientific and technical gaps of the two-stage neural decoding by: 1) implementing a gated recurrent unit (GRU)-based architecture to establish a non-linear mapping between the fMRI signal and the VAE latent space, 2) optimizing the dimensionality of the VAE latent space, 3) systematically evaluating the contribution of the first reconstruction stage, and 4) analyzing the impact of different brain regions of interest (ROIs) on reconstruction quality. Experiments on the Natural Scenes Dataset, containing 73,000 unique natural images, along with fMRI of eight subjects, demonstrated that the proposed architecture maintained competitive performance while reducing the complexity of its first stage by 85%. The sensitivity analysis showcased that the first reconstruction stage is essential for preserving high structural similarity in the final reconstructions. Restricting analysis to semantic ROIs, while excluding early visual areas, diminished visual coherence, preserving semantics though. The inter-subject repeatability across ROIs was about 92 and 98% for visual and sematic metrics, respectively. This study represents a key step toward optimized neural decoding architectures leveraging non-linear models for stimulus prediction. Sensitivity analysis highlighted the interplay between the two reconstruction stages, while ROI-based analysis provided strong evidence that the two-stage AI model reflects the brain's hierarchical processing of visual information.

MRI Image Synthesis Neurological Methodology In Silico Academic Lab

Performance Evaluation of Deep Learning for the Detection and Segmentation of Thyroid Nodules: Systematic Review and Meta-Analysis.

Ni J, You Y, Wu X, Chen X, Wang J, Li Y

•papers•Aug 14 2025

Thyroid cancer is one of the most common endocrine malignancies. Its incidence has steadily increased in recent years. Distinguishing between benign and malignant thyroid nodules (TNs) is challenging due to their overlapping imaging features. The rapid advancement of artificial intelligence (AI) in medical image analysis, particularly deep learning (DL) algorithms, has provided novel solutions for automated TN detection. However, existing studies exhibit substantial heterogeneity in diagnostic performance. Furthermore, no systematic evidence-based research comprehensively assesses the diagnostic performance of DL models in this field. This study aimed to execute a systematic review and meta-analysis to appraise the performance of DL algorithms in diagnosing TN malignancy, identify key factors influencing their diagnostic efficacy, and compare their accuracy with that of clinicians in image-based diagnosis. We systematically searched multiple databases, including PubMed, Cochrane, Embase, Web of Science, and IEEE, and identified 41 eligible studies for systematic review and meta-analysis. Based on the task type, studies were categorized into segmentation (n=14) and detection (n=27) tasks. The pooled sensitivity, specificity, and the area under the receiver operating characteristic curve (AUC) were calculated for each group. Subgroup analyses were performed to examine the impact of transfer learning and compare model performance against clinicians. For segmentation tasks, the pooled sensitivity, specificity, and AUC were 82% (95% CI 79%-84%), 95% (95% CI 92%-96%), and 0.91 (95% CI 0.89-0.94), respectively. For detection tasks, the pooled sensitivity, specificity, and AUC were 91% (95% CI 89%-93%), 89% (95% CI 86%-91%), and 0.96 (95% CI 0.93-0.97), respectively. Some studies demonstrated that DL models could achieve diagnostic performance comparable with, or even exceeding, that of clinicians in certain scenarios. The application of transfer learning contributed to improved model performance. DL algorithms exhibit promising diagnostic accuracy in TN imaging, highlighting their potential as auxiliary diagnostic tools. However, current studies are limited by suboptimal methodological design, inconsistent image quality across datasets, and insufficient external validation, which may introduce bias. Future research should enhance methodological standardization, improve model interpretability, and promote transparent reporting to facilitate the sustainable clinical translation of DL-based solutions.

Ultrasound Detection Abdominal Meta Analysis In Silico Academic Lab Benchmark SOTA

A software ecosystem for brain tractometry processing, analysis, and insight.

Kruper J, Richie-Halford A, Qiao J, Gilmore A, Chang K, Grotheer M, Roy E, Caffarra S, Gomez T, Chou S, Cieslak M, Koudoro S, Garyfallidis E, Satthertwaite TD, Yeatman JD, Rokem A

•papers•Aug 14 2025

Tractometry uses diffusion-weighted magnetic resonance imaging (dMRI) to assess physical properties of brain connections. Here, we present an integrative ecosystem of software that performs all steps of tractometry: post-processing of dMRI data, delineation of major white matter pathways, and modeling of the tissue properties within them. This ecosystem also provides a set of interoperable and extensible tools for visualization and interpretation of the results that extract insights from these measurements. These include novel machine learning and statistical analysis methods adapted to the characteristic structure of tract-based data. We benchmark the performance of these statistical analysis methods in different datasets and analysis tasks, including hypothesis testing on group differences and predictive analysis of subject age. We also demonstrate that computational advances implemented in the software offer orders of magnitude of acceleration. Taken together, these open-source software tools-freely available at https://tractometry.org-provide a transformative environment for the analysis of dMRI data.

MRI Classification Neurological Methodology In Silico Academic Lab Open Code Open Dataset

Filter Papers

Tags

Benchmarking GPT-5 for Zero-Shot Multimodal Medical Reasoning in Radiology and Radiation Oncology

A Case Study on Colposcopy-Based Cervical Cancer Staging Reveals an Alarming Lack of Data Sharing Hindering the Adoption of Machine Learning in Clinical Practice

A Contrast-Agnostic Method for Ultra-High Resolution Claustrum Segmentation.

Noninvasive prediction of microsatellite instability in stage II/III rectal cancer using dynamic contrast-enhanced magnetic resonance imaging radiomics.

Delineation of the Centromedian Nucleus for Epilepsy Neuromodulation Using Deep Learning Reconstruction of White Matter-Nulled Imaging.

Comparative evaluation of supervised and unsupervised deep learning strategies for denoising hyperpolarized <sup>129</sup>Xe lung MRI.

AI-based prediction of best-corrected visual acuity in patients with multiple retinal diseases using multimodal medical imaging.

Optimized AI-based Neural Decoding from BOLD fMRI Signal for Analyzing Visual and Semantic ROIs in the Human Visual System.

Performance Evaluation of Deep Learning for the Detection and Segmentation of Thyroid Nodules: Systematic Review and Meta-Analysis.

A software ecosystem for brain tractometry processing, analysis, and insight.

Ready to Sharpen Your Edge?