Latest Papers on Radiology AI. Tags: Benchmark SOTA

Fine-tuned large language model for classifying CT-guided interventional radiology reports.

Yasaka K, Nishimura N, Fukushima T, Kubo T, Kiryu S, Abe O

•papers•Jun 23 2025

BackgroundManual data curation was necessary to extract radiology reports due to the ambiguities of natural language.PurposeTo develop a fine-tuned large language model that classifies computed tomography (CT)-guided interventional radiology reports into technique categories and to compare its performance with that of the readers.Material and MethodsThis retrospective study included patients who underwent CT-guided interventional radiology between August 2008 and November 2024. Patients were chronologically assigned to the training (n = 1142; 646 men; mean age = 64.1 ± 15.7 years), validation (n = 131; 83 men; mean age = 66.1 ± 16.1 years), and test (n = 332; 196 men; mean age = 66.1 ± 14.8 years) datasets. In establishing a reference standard, reports were manually classified into categories 1 (drainage), 2 (lesion biopsy within fat or soft tissue density tissues), 3 (lung biopsy), and 4 (bone biopsy). The bi-directional encoder representation from the transformers model was fine-tuned with the training dataset, and the model with the best performance in the validation dataset was selected. The performance and required time for classification in the test dataset were compared between the best-performing model and the two readers.ResultsCategories 1/2/3/4 included 309/367/270/196, 30/42/40/19, and 75/124/78/55 patients for the training, validation, and test datasets, respectively. The model demonstrated an accuracy of 0.979 in the test dataset, which was significantly better than that of the readers (0.922-0.940) (<i>P</i> ≤0.012). The model classified reports within a 49.8-53.5-fold shorter time compared to readers.ConclusionThe fine-tuned large language model classified CT-guided interventional radiology reports into four categories demonstrating high accuracy within a remarkably short time.

CT LLM Radiology Report Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Multimodal deep learning for predicting neoadjuvant treatment outcomes in breast cancer: a systematic review.

Krasniqi E, Filomeno L, Arcuri T, Ferretti G, Gasparro S, Fulvi A, Roselli A, D'Onofrio L, Pizzuti L, Barba M, Maugeri-Saccà M, Botti C, Graziano F, Puccica I, Cappelli S, Pelle F, Cavicchi F, Villanucci A, Paris I, Calabrò F, Rea S, Costantini M, Perracchio L, Sanguineti G, Takanen S, Marucci L, Greco L, Kayal R, Moscetti L, Marchesini E, Calonaci N, Blandino G, Caravagna G, Vici P

•papers•Jun 23 2025

Pathological complete response (pCR) to neoadjuvant systemic therapy (NAST) is an established prognostic marker in breast cancer (BC). Multimodal deep learning (DL), integrating diverse data sources (radiology, pathology, omics, clinical), holds promise for improving pCR prediction accuracy. This systematic review synthesizes evidence on multimodal DL for pCR prediction and compares its performance against unimodal DL. Following PRISMA, we searched PubMed, Embase, and Web of Science (January 2015-April 2025) for studies applying DL to predict pCR in BC patients receiving NAST, using data from radiology, digital pathology (DP), multi-omics, and/or clinical records, and reporting AUC. Data on study design, DL architectures, and performance (AUC) were extracted. A narrative synthesis was conducted due to heterogeneity. Fifty-one studies, mostly retrospective (90.2%, median cohort 281), were included. Magnetic resonance imaging and DP were common primary modalities. Multimodal approaches were used in 52.9% of studies, often combining imaging with clinical data. Convolutional neural networks were the dominant architecture (88.2%). Longitudinal imaging improved prediction over baseline-only (median AUC 0.91 vs. 0.82). Overall, the median AUC across studies was 0.88, with 35.3% achieving AUC ≥ 0.90. Multimodal models showed a modest but consistent improvement over unimodal approaches (median AUC 0.88 vs. 0.83). Omics and clinical text were rarely primary DL inputs. DL models demonstrate promising accuracy for pCR prediction, especially when integrating multiple modalities and longitudinal imaging. However, significant methodological heterogeneity, reliance on retrospective data, and limited external validation hinder clinical translation. Future research should prioritize prospective validation, integration underutilized data (multi-omics, clinical), and explainable AI to advance DL predictors to the clinical setting.

MRI Classification Breast Review In Silico Academic Lab Benchmark SOTA Ethics

Comparative Analysis of Multimodal Large Language Models GPT-4o and o1 vs Clinicians in Clinical Case Challenge Questions

Jung, J., Kim, H., Bae, S., Park, J. Y.

•preprint•Jun 23 2025

BackgroundGenerative Pre-trained Transformer 4 (GPT-4) has demonstrated strong performance in standardized medical examinations but has limitations in real-world clinical settings. The newly released multimodal GPT-4o model, which integrates text and image inputs to enhance diagnostic capabilities, and the multimodal o1 model, which incorporates advanced reasoning, may address these limitations. ObjectiveThis study aimed to compare the performance of GPT-4o and o1 against clinicians in real-world clinical case challenges. MethodsThis retrospective, cross-sectional study used Medscape case challenge questions from May 2011 to June 2024 (n = 1,426). Each case included text and images of patient history, physical examination findings, diagnostic test results, and imaging studies. Clinicians were required to choose one answer from among multiple options, with the most frequent response defined as the clinicians decision. Data-based decisions were made using GPT models (3.5 Turbo, 4 Turbo, 4 Omni, and o1) to interpret the text and images, followed by a process to provide a formatted answer. We compared the performances of the clinicians and GPT models using Mixed-effects logistic regression analysis. ResultsOf the 1,426 questions, clinicians achieved an overall accuracy of 85.0%, whereas GPT-4o and o1 demonstrated higher accuracies of 88.4% and 94.3% (mean difference 3.4%; P = .005 and mean difference 9.3%; P < .001), respectively. In the multimodal performance analysis, which included cases involving images (n = 917), GPT-4o achieved an accuracy of 88.3%, and o1 achieved 93.9%, both significantly outperforming clinicians (mean difference 4.2%; P = .005 and mean difference 9.8%; P < .001). o1 showed the highest accuracy across all question categories, achieving 92.6% in diagnosis (mean difference 14.5%; P < .001), 97.0% in disease characteristics (mean difference 7.2%; P < .001), 92.6% in examination (mean difference 7.3%; P = .002), and 94.8% in treatment (mean difference 4.3%; P = .005), consistently outperforming clinicians. In terms of medical specialty, o1 achieved 93.6% accuracy in internal medicine (mean difference 10.3%; P < .001), 96.6% in major surgery (mean difference 9.2%; P = .030), 97.3% in psychiatry (mean difference 10.6%; P = .030), and 95.4% in minor specialties (mean difference 10.0%; P < .001), significantly surpassing clinicians. Across five trials, GPT-4o and o1 provided the correct answer 5/5 times in 86.2% and 90.7% of the cases, respectively. ConclusionsThe GPT-4o and o1 models achieved higher accuracy than clinicians in clinical case challenge questions, particularly in disease diagnosis. The GPT-4o and o1 could serve as valuable tools to assist healthcare professionals in clinical settings.

Mixed Modality Classification Retrospective Clinical In Silico Academic Lab Benchmark SOTA GenAI

Deep learning-quantified body composition from positron emission tomography/computed tomography and cardiovascular outcomes: a multicentre study.

Miller RJH, Yi J, Shanbhag A, Marcinkiewicz A, Patel KK, Lemley M, Ramirez G, Geers J, Chareonthaitawee P, Wopperer S, Berman DS, Di Carli M, Dey D, Slomka PJ

•papers•Jun 23 2025

Positron emission tomography (PET)/computed tomography (CT) myocardial perfusion imaging (MPI) is a vital diagnostic tool, especially in patients with cardiometabolic syndrome. Low-dose CT scans are routinely performed with PET for attenuation correction and potentially contain valuable data about body tissue composition. Deep learning and image processing were combined to automatically quantify skeletal muscle (SM), bone and adipose tissue from these scans and then evaluate their associations with death or myocardial infarction (MI). In PET MPI from three sites, deep learning quantified SM, bone, epicardial adipose tissue (EAT), subcutaneous adipose tissue (SAT), visceral adipose tissue (VAT), and intermuscular adipose tissue (IMAT). Sex-specific thresholds for abnormal values were established. Associations with death or MI were evaluated using unadjusted and multivariable models adjusted for clinical and imaging factors. This study included 10 085 patients, with median age 68 (interquartile range 59-76) and 5767 (57%) male. Body tissue segmentations were completed in 102 ± 4 s. Higher VAT density was associated with an increased risk of death or MI in both unadjusted [hazard ratio (HR) 1.40, 95% confidence interval (CI) 1.37-1.43] and adjusted (HR 1.24, 95% CI 1.19-1.28) analyses, with similar findings for IMAT, SAT, and EAT. Patients with elevated VAT density and reduced myocardial flow reserve had a significantly increased risk of death or MI (adjusted HR 2.49, 95% CI 2.23-2.77). Volumetric body tissue composition can be obtained rapidly and automatically from standard cardiac PET/CT. This new information provides a detailed, quantitative assessment of sarcopenia and cardiometabolic health for physicians.

Mixed Modality Segmentation Cardiac Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Open Set Recognition for Endoscopic Image Classification: A Deep Learning Approach on the Kvasir Dataset

Kasra Moazzami, Seoyoun Son, John Lin, Sun Min Lee, Daniel Son, Hayeon Lee, Jeongho Lee, Seongji Lee

•preprint•Jun 23 2025

Endoscopic image classification plays a pivotal role in medical diagnostics by identifying anatomical landmarks and pathological findings. However, conventional closed-set classification frameworks are inherently limited in open-world clinical settings, where previously unseen conditions can arise andcompromise model reliability. To address this, we explore the application of Open Set Recognition (OSR) techniques on the Kvasir dataset, a publicly available and diverse endoscopic image collection. In this study, we evaluate and compare the OSR capabilities of several representative deep learning architectures, including ResNet-50, Swin Transformer, and a hybrid ResNet-Transformer model, under both closed-set and open-set conditions. OpenMax is adopted as a baseline OSR method to assess the ability of these models to distinguish known classes from previously unseen categories. This work represents one of the first efforts to apply open set recognition to the Kvasir dataset and provides a foundational benchmark for evaluating OSR performance in medical image analysis. Our results offer practical insights into model behavior in clinically realistic settings and highlight the importance of OSR techniques for the safe deployment of AI systems in endoscopy.

Mixed Modality Classification Abdominal Methodology In Silico Benchmark SOTA Open Dataset

BrainSymphony: A Transformer-Driven Fusion of fMRI Time Series and Structural Connectivity

Moein Khajehnejad, Forough Habibollahi, Adeel Razi

•preprint•Jun 23 2025

Existing foundation models for neuroimaging are often prohibitively large and data-intensive. We introduce BrainSymphony, a lightweight, parameter-efficient foundation model that achieves state-of-the-art performance while being pre-trained on significantly smaller public datasets. BrainSymphony's strong multimodal architecture processes functional MRI data through parallel spatial and temporal transformer streams, which are then efficiently distilled into a unified representation by a Perceiver module. Concurrently, it models structural connectivity from diffusion MRI using a novel signed graph transformer to encode the brain's anatomical structure. These powerful, modality-specific representations are then integrated via an adaptive fusion gate. Despite its compact design, our model consistently outperforms larger models on a diverse range of downstream benchmarks, including classification, prediction, and unsupervised network identification tasks. Furthermore, our model revealed novel insights into brain dynamics using attention maps on a unique external psilocybin neuroimaging dataset (pre- and post-administration). BrainSymphony establishes that architecturally-aware, multimodal models can surpass their larger counterparts, paving the way for more accessible and powerful research in computational neuroscience.

Mixed Modality Classification Neurological Methodology In Silico Academic Lab Benchmark SOTA GenAI

MRI Radiomics and Automated Habitat Analysis Enhance Machine Learning Prediction of Bone Metastasis and High-Grade Gleason Scores in Prostate Cancer.

Yang Y, Zheng B, Zou B, Liu R, Yang R, Chen Q, Guo Y, Yu S, Chen B

•papers•Jun 23 2025

To explore the value of machine learning models based on MRI radiomics and automated habitat analysis in predicting bone metastasis and high-grade pathological Gleason scores in prostate cancer. This retrospective study enrolled 214 patients with pathologically diagnosed prostate cancer from May 2013 to January 2025, including 93 cases with bone metastasis and 159 cases with high-grade Gleason scores. Clinical, pathological and MRI data were collected. An nnUNet model automatically segmented the prostate in MRI scans. K-means clustering identified subregions within the entire prostate in T2-FS images. Senior radiologists manually segmented regions of interest (ROIs) in prostate lesions. Radiomics features were extracted from these habitat subregions and lesion ROIs. These features combined with clinical features were utilized to build multiple machine learning classifiers to predict bone metastasis and high-grade Gleason scores while a K-means clustering method was applied to obtain habitat subregions within the whole prostate. Finally, the models underwent interpretable analysis based on feature importance. The nnUNet model achieved a mean Dice coefficient of 0.970 for segmentation. Habitat analysis using 2 clusters yielded the highest average silhouette coefficient (0.57). Machine learning models based on a combination of lesion radiomics, habitat radiomics, and clinical features achieved the best performance in both prediction tasks. The Extra Trees Classifier achieved the highest AUC (0.900) for predicting bone metastasis, while the CatBoost Classifier performed best (AUC 0.895) for predicting high-grade Gleason scores. The interpretability analysis of the optimal models showed that the PSA clinical feature was crucial for predictions, while both habitat radiomics and lesion radiomics also played important roles. The study proposed an automated prostate habitat analysis for prostate cancer, enabling a comprehensive analysis of tumor heterogeneity. The machine learning models developed achieved excellent performance in predicting the risk of bone metastasis and high-grade Gleason scores in prostate cancer. This approach overcomes the limitations of manual feature extraction, and the inadequate analysis of heterogeneity often encountered in traditional radiomics, thereby improving model performance.

MRI Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

SafeClick: Error-Tolerant Interactive Segmentation of Any Medical Volumes via Hierarchical Expert Consensus

Yifan Gao, Jiaxi Sheng, Wenbin Wu, Haoyue Li, Yaoxian Dong, Chaoyang Ge, Feng Yuan, Xin Gao

•preprint•Jun 23 2025

Foundation models for volumetric medical image segmentation have emerged as powerful tools in clinical workflows, enabling radiologists to delineate regions of interest through intuitive clicks. While these models demonstrate promising capabilities in segmenting previously unseen anatomical structures, their performance is strongly influenced by prompt quality. In clinical settings, radiologists often provide suboptimal prompts, which affects segmentation reliability and accuracy. To address this limitation, we present SafeClick, an error-tolerant interactive segmentation approach for medical volumes based on hierarchical expert consensus. SafeClick operates as a plug-and-play module compatible with foundation models including SAM 2 and MedSAM 2. The framework consists of two key components: a collaborative expert layer (CEL) that generates diverse feature representations through specialized transformer modules, and a consensus reasoning layer (CRL) that performs cross-referencing and adaptive integration of these features. This architecture transforms the segmentation process from a prompt-dependent operation to a robust framework capable of producing accurate results despite imperfect user inputs. Extensive experiments across 15 public datasets demonstrate that our plug-and-play approach consistently improves the performance of base foundation models, with particularly significant gains when working with imperfect prompts. The source code is available at https://github.com/yifangao112/SafeClick.

Mixed Modality Segmentation Methodology In Silico Academic Lab Open Code Benchmark SOTA

Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination

Hirano, Y., Miki, S., Yamagishi, Y., Hanaoka, S., Nakao, T., Kikuchi, T., Nakamura, Y., Nomura, Y., Yoshikawa, T., Abe, O.

•preprint•Jun 23 2025

PurposeTo assess and compare the accuracy and legitimacy of multimodal large language models (LLMs) on the Japan Diagnostic Radiology Board Examination (JDRBE). Materials and methodsThe dataset comprised questions from JDRBE 2021, 2023, and 2024, with ground-truth answers established through consensus among multiple board-certified diagnostic radiologists. Questions without associated images and those lacking unanimous agreement on answers were excluded. Eight LLMs were evaluated: GPT-4 Turbo, GPT-4o, GPT-4.5, GPT-4.1, o3, o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Each model was evaluated under two conditions: with inputting images (vision) and without (text-only). Performance differences between the conditions were assessed using McNemars exact test. Two diagnostic radiologists (with 2 and 18 years of experience) independently rated the legitimacy of responses from four models (GPT-4 Turbo, Claude 3.7 Sonnet, o3, and Gemini 2.5 Pro) using a five-point Likert scale, blinded to model identity. Legitimacy scores were analyzed using Friedmans test, followed by pairwise Wilcoxon signed-rank tests with Holm correction. ResultsThe dataset included 233 questions. Under the vision condition, o3 achieved the highest accuracy at 72%, followed by o4-mini (70%) and Gemini 2.5 Pro (70%). Under the text-only condition, o3 topped the list with an accuracy of 67%. Addition of image input significantly improved the accuracy of two models (Gemini 2.5 Pro and GPT-4.5), but not the others. Both o3 and Gemini 2.5 Pro received significantly higher legitimacy scores than GPT-4 Turbo and Claude 3.7 Sonnet from both raters. ConclusionRecent multimodal LLMs, particularly o3 and Gemini 2.5 Pro, have demonstrated remarkable progress on JDRBE questions, reflecting their rapid evolution in diagnostic radiology. Secondary abstract Eight multimodal large language models were evaluated on the Japan Diagnostic Radiology Board Examination. OpenAIs o3 and Google DeepMinds Gemini 2.5 Pro achieved high accuracy rates (72% and 70%) and received good legitimacy scores from human raters, demonstrating steady progress.

Mixed Modality LLM Radiology Report Whole Body Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Decoding Federated Learning: The FedNAM+ Conformal Revolution

Sree Bhargavi Balija, Amitash Nanda, Debashis Sahoo

•preprint•Jun 22 2025

Federated learning has significantly advanced distributed training of machine learning models across decentralized data sources. However, existing frameworks often lack comprehensive solutions that combine uncertainty quantification, interpretability, and robustness. To address this, we propose FedNAM+, a federated learning framework that integrates Neural Additive Models (NAMs) with a novel conformal prediction method to enable interpretable and reliable uncertainty estimation. Our method introduces a dynamic level adjustment technique that utilizes gradient-based sensitivity maps to identify key input features influencing predictions. This facilitates both interpretability and pixel-wise uncertainty estimates. Unlike traditional interpretability methods such as LIME and SHAP, which do not provide confidence intervals, FedNAM+ offers visual insights into prediction reliability. We validate our approach through experiments on CT scan, MNIST, and CIFAR datasets, demonstrating high prediction accuracy with minimal loss (e.g., only 0.1% on MNIST), along with transparent uncertainty measures. Visual analysis highlights variable uncertainty intervals, revealing low-confidence regions where model performance can be improved with additional data. Compared to Monte Carlo Dropout, FedNAM+ delivers efficient and global uncertainty estimates with reduced computational overhead, making it particularly suitable for federated learning scenarios. Overall, FedNAM+ provides a robust, interpretable, and computationally efficient framework that enhances trust and transparency in decentralized predictive modeling.

CT Classification Methodology In Silico Academic Lab Benchmark SOTA

Filter Papers

Tags

Fine-tuned large language model for classifying CT-guided interventional radiology reports.

Multimodal deep learning for predicting neoadjuvant treatment outcomes in breast cancer: a systematic review.

Comparative Analysis of Multimodal Large Language Models GPT-4o and o1 vs Clinicians in Clinical Case Challenge Questions

Deep learning-quantified body composition from positron emission tomography/computed tomography and cardiovascular outcomes: a multicentre study.

Open Set Recognition for Endoscopic Image Classification: A Deep Learning Approach on the Kvasir Dataset

BrainSymphony: A Transformer-Driven Fusion of fMRI Time Series and Structural Connectivity

MRI Radiomics and Automated Habitat Analysis Enhance Machine Learning Prediction of Bone Metastasis and High-Grade Gleason Scores in Prostate Cancer.

SafeClick: Error-Tolerant Interactive Segmentation of Any Medical Volumes via Hierarchical Expert Consensus

Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination

Decoding Federated Learning: The FedNAM+ Conformal Revolution

Ready to Sharpen Your Edge?