Latest Papers on Radiology AI. Tags: Benchmark SOTA

End-to-end Spatiotemporal Analysis of Color Doppler Echocardiograms: Application for Rheumatic Heart Disease Detection.

Roshanitabrizi P, Nath V, Brown K, Broudy TG, Jiang Z, Parida A, Rwebembera J, Okello E, Beaton A, Roth HR, Sable CA, Linguraru MG

•papers•Sep 29 2025

Rheumatic heart disease (RHD) represents a significant global health challenge, disproportionately affecting over 40 million people in low- and middle-income countries. Early detection through color Doppler echocardiography is crucial for treating RHD, but it requires specialized physicians who are often scarce in resource-limited settings. To address this disparity, artificial intelligence (AI)-driven tools for RHD screening can provide scalable, autonomous solutions to improve access to critical healthcare services in underserved regions. This paper introduces RADAR (Rapid AI-Assisted Echocardiography Detection and Analysis of RHD), a novel and generalizable AI approach for end-to-end spatiotemporal analysis of color Doppler echocardiograms, aimed at detecting early RHD in resource-limited settings. RADAR identifies key imaging views and employs convolutional neural networks to analyze diagnostically relevant phases of the cardiac cycle. It also localizes essential anatomical regions and examines blood flow patterns. It then integrates all findings into a cohesive analytical framework. RADAR was trained and validated on 1,022 echocardiogram videos from 511 Ugandan children, acquired using standard portable ultrasound devices. An independent set of 318 cases, acquired using a handheld ultrasound device with diverse imaging characteristics, was also tested. On the validation set, RADAR outperformed existing methods, achieving an average accuracy of 0.92, sensitivity of 0.94, and specificity of 0.90. In independent testing, it maintained high, clinically acceptable performance, with an average accuracy of 0.79, sensitivity of 0.87, and specificity of 0.70. These results highlight RADAR's potential to improve RHD detection and promote health equity for vulnerable children by enhancing timely, accurate diagnoses in underserved regions.

Ultrasound Classification Cardiac Retrospective Clinical In Silico Academic Lab Benchmark SOTA

TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models

Junyi Zhang, Jia-Chen Gu, Wenbo Hu, Yu Zhou, Robinson Piramuthu, Nanyun Peng

•preprint•Sep 29 2025

Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient's condition based on an image from a single visit. However, this setting deviates significantly from real-world clinical practice, where doctors typically refer to a patient's historical conditions to provide a comprehensive assessment by tracking their changes over time. In this paper, we introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits, which challenges large vision-language models (LVLMs) to reason over temporal medical images. TemMed-Bench consists of a test set comprising three tasks - visual question-answering (VQA), report generation, and image-pair selection - and a supplementary knowledge corpus of over 17,000 instances. With TemMed-Bench, we conduct an evaluation of six proprietary and six open-source LVLMs. Our results show that most LVLMs lack the ability to analyze patients' condition changes over temporal medical images, and a large proportion perform only at a random-guessing level in the closed-book setting. In contrast, GPT o3, o4-mini and Claude 3.5 Sonnet demonstrate comparatively decent performance, though they have yet to reach the desired level. Furthermore, we explore augmenting the input with both retrieved visual and textual modalities in the medical domain. We also show that multi-modal retrieval augmentation yields notably higher performance gains than no retrieval and textual retrieval alone across most models on our benchmark, with the VQA task showing an average improvement of 2.59%. Overall, we compose a benchmark grounded on real-world clinical practice, and it reveals LVLMs' limitations in temporal medical image reasoning, as well as highlighting the use of multi-modal retrieval augmentation as a potentially promising direction worth exploring to address this challenge.

Mixed Modality LLM Radiology Report Dataset Release In Silico Academic Lab Open Dataset Benchmark SOTA

Impact of Artificial Intelligence Triage on Radiologist Report Turnaround Time: Real-World Time Savings and Insights From Model Predictions.

Thompson YLE, Fergus J, Chung J, Delfino JG, Chen W, Levine GM, Samuelson FW

•papers•Sep 29 2025

To quantify the impact of workflow parameters on time savings in report turnaround time due to an AI triage device that prioritized pulmonary embolism (PE) in chest CT pulmonary angiography (CTPA) examinations. This retrospective study analyzed 11,252 adult CTPA examinations conducted for suspected PE at a single tertiary academic medical center. Data was divided into two periods: pre-artificial intelligence (AI) and post-AI. For PE-positive examinations, turnaround time (TAT)-defined as the duration from patient scan completion to the first preliminary report completion-was compared between the two periods. Time savings were reported separately for work-hour and off-hour cohorts. To characterize radiologist workflow, 527,234 records were retrieved from the PACS and workflow parameters such as examination interarrival time and radiologist read time extracted. These parameters were input into a computational model to predict time savings after deployment of an AI triage device and to study the impact of workflow parameters. The pre-AI dataset included 4,694 chest CTPA examinations with 13.3% being PE-positive. The post-AI dataset comprised 6,558 examinations with 16.2% being PE-positive. The mean TAT for pre-AI and post-AI during work hours are 68.9 (95% confidence interval 55.0-82.8) and 46.7 (38.1-55.2) min, respectively, and those during off-hours are 44.8 (33.7-55.9) and 42.0 (33.6-50.3) min. Clinically observed time savings during work hours (22.2 [95% confidence interval: 5.85-38.6] min) were significant (P = .004), while off-hour (2.82 [-11.1 to 16.7] min) were not (P = .345). Observed time savings aligned with model predictions (29.6 [95% range: 23.2-38.1] min for work hours; 2.10 [1.76, 2.58] min for off-hours). Consideration and quantification of the clinical workflow contributes to the accurate assessment of the expected time savings in report TAT after deployment of an AI triage device.

CT Triage Chest Retrospective Clinical Clinical Pilot Academic Lab Benchmark SOTA

Novel multi-task learning for Alzheimer's stage classification using hippocampal MRI segmentation, feature fusion, and nomogram modeling.

Hu W, Du Q, Wei L, Wang D, Zhang G

•papers•Sep 29 2025

To develop and validate a comprehensive and interpretable framework for multi-class classification of Alzheimer's disease (AD) progression stages based on hippocampal MRI, integrating radiomic, deep, and clinical features. This retrospective multi-center study included 2956 patients across four AD stages (Non-Demented, Very Mild Demented, Mild Demented, Moderate Demented). T1-weighted MRI scans were processed through a standardized pipeline involving hippocampal segmentation using four models (U-Net, nnU-Net, Swin-UNet, MedT). Radiomic features (n = 215) were extracted using the SERA platform, and deep features (n = 256) were learned using an LSTM network with attention applied to hippocampal slices. Fused features were harmonized with ComBat and filtered by ICC (≥ 0.75), followed by LASSO-based feature selection. Classification was performed using five machine learning models, including Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), Multilayer Perceptron (MLP), and eXtreme Gradient Boosting (XGBoost). Model interpretability was addressed using SHAP, and a nomogram and decision curve analysis (DCA) were developed. Additionally, an end-to-end 3D CNN-LSTM model and two transformer-based benchmarks (Vision Transformer, Swin Transformer) were trained for comparative evaluation. MedT achieved the best hippocampal segmentation (Dice = 92.03% external). Fused features yielded the highest classification performance with XGBoost (external accuracy = 92.8%, AUC = 94.2%). SHAP identified MMSE, hippocampal volume, and APOE ε4 as top contributors. The nomogram accurately predicted early-stage AD with clinical utility confirmed by DCA. The end-to-end model performed acceptably (AUC = 84.0%) but lagged behind the fused pipeline. Statistical tests confirmed significant performance advantages for feature fusion and MedT-based segmentation. This study demonstrates that integrating radiomics, deep learning, and clinical data from hippocampal MRI enables accurate and interpretable classification of AD stages. The proposed framework is robust, generalizable, and clinically actionable, representing a scalable solution for AD diagnostics.

MRI Classification Neurological Retrospective Clinical In Silico Benchmark SOTA

A machine learning approach for non-invasive PCOS diagnosis from ultrasound and clinical features.

Agirsoy M, Oehlschlaeger MA

•papers•Sep 29 2025

This study investigates the use of machine learning (ML) algorithms to support faster and more accurate diagnosis of polycystic ovary syndrome (PCOS), with a focus on both predictive performance and clinical applicability. Multiple algorithms were evaluated-including Artificial Neural Networks (ANN), Support Vector Machines (SVM), Logistic Regression (LR), K-Nearest Neighbors (KNN), and Extreme Gradient Boosting (XGBoost). XGBoost consistently outperformed the other models and was selected for final development and validation. To align with the Rotterdam criteria, the dataset was structured into three feature categories: clinical, biochemical, and ultrasound (USG) data. The study explored various combinations of these feature subsets to identify the most efficient diagnostic pathways. Feature selection using the chi-square-based SelectKBest method revealed the top 10 predictive features, which were further validated through XGBoost's internal feature importance, SHAP analysis, and expert clinical assessment. The final XGBoost model demonstrated robust performance across multiple feature combinations: • Clinical + USG + AMH: AUC = 0.9947, Precision = 0.9553, F1 Score = 0.9553, Accuracy = 0.9553. • Clinical + USG: AUC = 0.9852, Precision = 0.9583, F1 Score = 0.9388, Accuracy = 0.9384. The most influential features included follicle count on both ovaries, weight gain, Anti-Müllerian Hormone (AMH), hair growth, menstrual irregularity, fast food consumption, pimples, and hair loss, levels. External validation was performed using a publicly available dataset containing 320 instances and 18 diagnostic features. The XGBoost model trained on the top-ranked features achieved perfect performance on the test set (AUC = 1.0, Precision = 1.0, F1 Score = 1.0, Accuracy = 1.0), though further validation is necessary to rule out overfitting or data leakage. These findings suggest that combining clinical and ultrasound features enables highly accurate, non-invasive, and cost-effective PCOS diagnosis. This study demonstrates the potential of ML-driven tools to streamline clinical workflows, reduce reliance on invasive diagnostics, and support early intervention in women's health.

Ultrasound Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

A radiomics-based machine learning model and SHAP for predicting spread through air spaces and its prognostic implications in stage I lung adenocarcinoma: a multicenter cohort study.

Wang Y, Liu X, Zhao X, Wang Z, Li X, Sun D

•papers•Sep 29 2025

Despite early detection via low-dose computed tomography and complete surgical resection for early-stage lung adenocarcinoma, postoperative recurrence remains high, particularly in patients with tumor spread through air spaces. A reliable preoperative prediction model is urgently needed to adjust the treatment modality. In this multicenter retrospective study, 609 patients with pathological stage I lung adenocarcinoma from 3 independent centers were enrolled. Regions of interest for the primary tumor and peritumoral areas (extended by three, six, and twelve voxel units) were manually delineated from preoperative CT imaging. Quantitative imaging features were extracted and filtered by correlation analysis and Random forest Ranking to yield 40 candidate features. Fifteen machine learning methods were evaluated, and a ten-fold cross-validated elastic net regression model was selected to construct the radiomics-based prediction model. A clinical model based on five key clinical variables and a combined model integrating imaging and clinical features were also developed. The radiomics model achieved accuracies of 0.801, 0.866, and 0.831 in the training set and two external test sets, with AUC of 0.791, 0.829, and 0.807. In one external test set, the clinical model had an AUC of 0.689, significantly lower than the radiomics model (0.807, p < 0.05). The combined model achieved the highest performance, with AUC of 0.834 in the training set and 0.894 in an external test set (p < 0.01 and p < 0.001, respectively). Interpretability analysis revealed that wavelet-transformed features dominated the model, with the highest contribution from a feature reflecting small high-intensity clusters within the tumor and the second highest from a feature representing low-intensity clusters in the six-voxel peritumoral region. Kaplan-Meier analysis demonstrated that patients with either pathologically confirmed or model-predicted spread had significantly shorter progression-free survival (p < 0.001). Our novel machine learning model, integrating imaging features from both tumor and peritumoral regions, preoperatively predicts tumor spread through air spaces in stage I lung adenocarcinoma. It outperforms traditional clinical models, highlighting the potential of quantitative imaging analysis in personalizing treatment. Future prospective studies and further optimization are warranted.

CT Classification Chest Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Convolutional neural network models of structural MRI for discriminating categories of cognitive impairment: a systematic review and meta-analysis.

Dong X, Li Y, Hao J, Zhou P, Yang C, Ai Y, He M, Zhang W, Hu H

•papers•Sep 29 2025

Alzheimer's disease (AD) and mild cognitive impairment (MCI) pose significant challenges to public health and underscore the need for accurate and early diagnostic tools. Structural magnetic resonance imaging (sMRI) combined with advanced analytical techniques like convolutional neural networks (CNNs) seemed to offer a promising avenue for the diagnosis of these conditions. This systematic review and meta-analysis aimed to evaluate the diagnostic performance of CNN algorithms applied to sMRI data in differentiating between AD, MCI, and normal cognition (NC). Following the PRISMA-DTA guidelines, a comprehensive literature search was carried out in PubMed and Web of Science databases for studies published between 2018 and 2024. Studies were included if they employed CNNs for the diagnostic classification of sMRI data from participants with AD, MCI, or NC. The methodological quality of the included studies was assessed using the QUADAS-2 and METRICS tools. Data extraction and statistical analysis were performed to calculate pooled diagnostic accuracy metrics. A total of 21 studies were included in the study, comprising 16,139 participants in the analysis. The pooled sensitivity and specificity of CNN algorithms for differentiating AD from NC were 0.92 and 0.91, respectively. For distinguishing MCI from NC, the pooled sensitivity and specificity were 0.74 and 0.79, respectively. The algorithms also showed a moderate ability to differentiate AD from MCI, with a pooled sensitivity and specificity of 0.73 and 0.79, respectively. In the pMCI versus sMCI classification, a pooled sensitivity was 0.69 and a specificity was 0.81. Heterogeneity across studies was significant, as indicated by meta-regression results. CNN algorithms demonstrated promising diagnostic performance in differentiating AD, MCI, and NC using sMRI data. The highest accuracy was observed in distinguishing AD from NC and the lowest accuracy observed in distinguishing pMCI from sMCI. These findings suggest that CNN-based radiomics has the potential to serve as a valuable tool in the diagnostic armamentarium for neurodegenerative diseases. However, the heterogeneity among studies indicates a need for further methodological refinement and validation. This systematic review was registered in PROSPERO (Registration ID: CRD42022295408).

MRI Classification Neurological Meta Analysis In Silico Benchmark SOTA

MetaChest: Generalized few-shot learning of patologies from chest X-rays

Berenice Montalvo-Lezama, Gibran Fuentes-Pineda

•preprint•Sep 29 2025

The limited availability of annotated data presents a major challenge for applying deep learning methods to medical image analysis. Few-shot learning methods aim to recognize new classes from only a small number of labeled examples. These methods are typically studied under the standard few-shot learning setting, where all classes in a task are new. However, medical applications such as pathology classification from chest X-rays often require learning new classes while simultaneously leveraging knowledge of previously known ones, a scenario more closely aligned with generalized few-shot classification. Despite its practical relevance, few-shot learning has been scarcely studied in this context. In this work, we present MetaChest, a large-scale dataset of 479,215 chest X-rays collected from four public databases. MetaChest includes a meta-set partition specifically designed for standard few-shot classification, as well as an algorithm for generating multi-label episodes. We conduct extensive experiments evaluating both a standard transfer learning approach and an extension of ProtoNet across a wide range of few-shot multi-label classification tasks. Our results demonstrate that increasing the number of classes per episode and the number of training examples per class improves classification performance. Notably, the transfer learning approach consistently outperforms the ProtoNet extension, despite not being tailored for few-shot learning. We also show that higher-resolution images improve accuracy at the cost of additional computation, while efficient model architectures achieve comparable performance to larger models with significantly reduced resource requirements.

X-Ray Classification Chest Dataset Release In Silico Open Dataset Benchmark SOTA

Radiology's Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology

Suvrankar Datta, Divya Buchireddygari, Lakshmi Vennela Chowdary Kaza, Mrudula Bhalke, Kautik Singh, Ayush Pandey, Sonit Sai Vasipalli, Upasana Karnwal, Hakikat Bir Singh Bhatti, Bhavya Ratan Maroo, Sanjana Hebbar, Rahul Joseph, Gurkawal Kaur, Devyani Singh, Akhil V, Dheeksha Devasya Shama Prasad, Nishtha Mahajan, Ayinaparthi Arisha, Rajesh Vanagundi, Reet Nandy, Kartik Vuthoo, Snigdhaa Rajvanshi, Nikhileswar Kondaveeti, Suyash Gunjal, Rishabh Jain, Rajat Jain, Anurag Agrawal

•preprint•Sep 29 2025

Generalist multimodal AI systems such as large language models (LLMs) and vision language models (VLMs) are increasingly accessed by clinicians and patients alike for medical image interpretation through widely available consumer-facing chatbots. Most evaluations claiming expert level performance are on public datasets containing common pathologies. Rigorous evaluation of frontier models on difficult diagnostic cases remains limited. We developed a pilot benchmark of 50 expert-level "spot diagnosis" cases across multiple imaging modalities to evaluate the performance of frontier AI models against board-certified radiologists and radiology trainees. To mirror real-world usage, the reasoning modes of five popular frontier AI models were tested through their native web interfaces, viz. OpenAI o3, OpenAI GPT-5, Gemini 2.5 Pro, Grok-4, and Claude Opus 4.1. Accuracy was scored by blinded experts, and reproducibility was assessed across three independent runs. GPT-5 was additionally evaluated across various reasoning modes. Reasoning quality errors were assessed and a taxonomy of visual reasoning errors was defined. Board-certified radiologists achieved the highest diagnostic accuracy (83%), outperforming trainees (45%) and all AI models (best performance shown by GPT-5: 30%). Reliability was substantial for GPT-5 and o3, moderate for Gemini 2.5 Pro and Grok-4, and poor for Claude Opus 4.1. These findings demonstrate that advanced frontier models fall far short of radiologists in challenging diagnostic cases. Our benchmark highlights the present limitations of generalist AI in medical imaging and cautions against unsupervised clinical use. We also provide a qualitative analysis of reasoning traces and propose a practical taxonomy of visual reasoning errors by AI models for better understanding their failure modes, informing evaluation standards and guiding more robust model development.

Mixed Modality Classification Retrospective Clinical In Silico Academic Lab Benchmark SOTA

A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

Rohit Jena, Vedant Zope, Pratik Chaudhari, James C. Gee

•preprint•Sep 29 2025

In this work, we propose FFDP, a set of IO-aware non-GEMM fused kernels supplemented with a distributed framework for image registration at unprecedented scales. Image registration is an inverse problem fundamental to biomedical and life sciences, but algorithms have not scaled in tandem with image acquisition capabilities. Our framework complements existing model parallelism techniques proposed for large-scale transformer training by optimizing non-GEMM bottlenecks and enabling convolution-aware tensor sharding. We demonstrate unprecedented capabilities by performing multimodal registration of a 100 micron ex-vivo human brain MRI volume at native resolution - an inverse problem more than 570x larger than a standard clinical datum in about a minute using only 8 A6000 GPUs. FFDP accelerates existing state-of-the-art optimization and deep learning registration pipelines by upto 6 - 7x while reducing peak memory consumption by 20 - 59%. Comparative analysis on a 250 micron dataset shows that FFDP can fit upto 64x larger problems than existing SOTA on a single GPU, and highlights both the performance and efficiency gains of FFDP compared to SOTA image registration methods.

MRI Registration Neurological Methodology In Silico Benchmark SOTA

Filter Papers

Tags

End-to-end Spatiotemporal Analysis of Color Doppler Echocardiograms: Application for Rheumatic Heart Disease Detection.

TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models

Impact of Artificial Intelligence Triage on Radiologist Report Turnaround Time: Real-World Time Savings and Insights From Model Predictions.

Novel multi-task learning for Alzheimer's stage classification using hippocampal MRI segmentation, feature fusion, and nomogram modeling.

A machine learning approach for non-invasive PCOS diagnosis from ultrasound and clinical features.

A radiomics-based machine learning model and SHAP for predicting spread through air spaces and its prognostic implications in stage I lung adenocarcinoma: a multicenter cohort study.

Convolutional neural network models of structural MRI for discriminating categories of cognitive impairment: a systematic review and meta-analysis.

MetaChest: Generalized few-shot learning of patologies from chest X-rays

Radiology's Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology

A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

Ready to Sharpen Your Edge?