Latest Papers on Radiology AI. Tags: Benchmark SOTA

Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation

Yuanhe Tian, Lei Mao, Yan Song

•preprint•Jun 24 2025

Generating reports for computed tomography (CT) images is a challenging task, while similar to existing studies for medical image report generation, yet has its unique characteristics, such as spatial encoding of multiple images, alignment between image volume and texts, etc. Existing solutions typically use general 2D or 3D image processing techniques to extract features from a CT volume, where they firstly compress the volume and then divide the compressed CT slices into patches for visual encoding. These approaches do not explicitly account for the transformations among CT slices, nor do they effectively integrate multi-level image features, particularly those containing specific organ lesions, to instruct CT report generation (CTRG). In considering the strong correlation among consecutive slices in CT scans, in this paper, we propose a large language model (LLM) based CTRG method with recurrent visual feature extraction and stereo attentions for hierarchical feature modeling. Specifically, we use a vision Transformer to recurrently process each slice in a CT volume, and employ a set of attentions over the encoded slices from different perspectives to selectively obtain important visual information and align them with textual features, so as to better instruct an LLM for CTRG. Experiment results and further analysis on the benchmark M3D-Cap dataset show that our method outperforms strong baseline models and achieves state-of-the-art results, demonstrating its validity and effectiveness.

CT Report Generation Whole Body Methodology In Silico Academic Lab Benchmark SOTA

DeepSeek-assisted LI-RADS classification: AI-driven precision in hepatocellular carcinoma diagnosis.

Zhang J, Liu J, Guo M, Zhang X, Xiao W, Chen F

•papers•Jun 24 2025

The clinical utility of the DeepSeek-V3 (DSV3) model in enhancing the accuracy of Liver Imaging Reporting and Data System (LI-RADS, LR) classification remains underexplored. This study aimed to evaluate the diagnostic performance of DSV3 in LR classifications compared to radiologists with varying levels of experience and to assess its potential as a decision-support tool in clinical practice. A dual-phase retrospective-prospective study analyzed 426 liver lesions (300 retrospective, 126 prospective) in high-risk HCC patients who underwent Magnetic Resonance Imaging (MRI) or Computed Tomography (CT). Three radiologists (one junior, two seniors) independently classified lesions using LR v2018 criteria, while DSV3 analyzed unstructured radiology reports to generate corresponding classifications. In the prospective cohort, DSV3 processed inputs in both Chinese and English to evaluate language impact. Performance was compared using chi-square test or Fisher's exact test, with pathology as the gold standard. In the retrospective cohort, DSV3 significantly outperformed junior radiologists in diagnostically challenging categories: LR-3 (17.8% vs. 39.7%, p<0.05), LR-4 (80.4% vs. 46.2%, p<0.05), and LR-5 (86.2% vs. 66.7%, p<0.05), while showing comparable accuracy in LR-1 (90.8% vs. 88.7%), LR-2 (11.9% vs. 25.6%), and LR-M (79.5% vs. 62.1%) classifications (all p>0.05). Prospective validation confirmed these findings, with DSV3 demonstrating superior performance for LR-3 (13.3% vs. 60.0%), LR-4 (93.3% vs. 66.7%), and LR-5 (93.5% vs. 67.7%) compared to junior radiologists (all p<0.05). Notably, DSV3 achieved diagnostic parity with senior radiologists across all categories (p>0.05) and maintained consistent performance between Chinese and English inputs. The DSV3 model effectively improves diagnostic accuracy of LR-3 to LR-5 classifications among junior radiologists . Its language-independent performance and ability to match senior-level expertise suggest strong potential for clinical implementation to standardize HCC diagnosis and optimize treatment decisions.

Mixed Modality Classification Abdominal Retrospective Clinical Clinical Pilot Academic Lab Benchmark SOTA

Non-invasive prediction of NSCLC immunotherapy efficacy and tumor microenvironment through unsupervised machine learning-driven CT Radiomic subtypes: a multi-cohort study.

Guo Y, Gong B, Li Y, Mo P, Chen Y, Fan Q, Sun Q, Miao L, Li Y, Liu Y, Tan W, Yang L, Zheng C

•papers•Jun 24 2025

Radiomics analyzes quantitative features from medical images to reveal tumor heterogeneity, offering new insights for diagnosis, prognosis, and treatment prediction. This study explored radiomics based biomarkers to predict immunotherapy response and its association with the tumor microenvironment in non-small cell lung cancer (NSCLC) using unsupervised machine learning models derived from CT imaging. This study included 1539 NSCLC patients from seven independent cohorts. For 1834 radiomic features extracted from 869 NSCLC patients, K-means unsupervised clustering was applied to identify radiomic subtypes. A random forest model extended subtype classification to external cohorts, model accuracy, sensitivity, and specificity were evaluated. By conducting bulk RNA sequencing (RNA-seq) and single-cell transcriptome sequencing (scRNA-seq) of tumors, the immune microenvironment characteristics of tumors can be obtained to evaluate the association between radiomic subtypes and immunotherapy efficacy, immune scores, and immune cells infiltration. Unsupervised clustering stratified NSCLC patients into two subtypes (Cluster 1 and Cluster 2). Principal component analysis confirmed significant distinctions between subtypes across all cohorts. Cluster 2 exhibited significantly longer median overall survival (35 vs. 30 months, P = 0.006) and progression-free survival (19 vs. 16 months, P = 0.020) compared to Cluster 1. Multivariate Cox regression identified radiomic subtype as an independent predictor of overall survival (HR: 0.738, 95% CI 0.583-0.935, P = 0.012), validated in two external cohorts. Bulk RNA seq showed elevated interaction signaling and immune scores in Cluster 2 and scRNA-seq demonstrated higher proportions of T cells, B cells, and NK cells in Cluster 2. This study establishes a radiomic subtype associated with NSCLC immunotherapy efficacy and tumor immune microenvironment. The findings provide a non-invasive tool for personalized treatment, enabling early identification of immunotherapy-responsive patients and optimized therapeutic strategies.

CT Classification Chest Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Machine learning-based construction and validation of an radiomics model for predicting ISUP grading in prostate cancer: a multicenter radiomics study based on [68Ga]Ga-PSMA PET/CT.

Zhang H, Jiang X, Yang G, Tang Y, Qi L, Chen M, Hu S, Gao X, Zhang M, Chen S, Cai Y

•papers•Jun 24 2025

The International Society of Urological Pathology (ISUP) grading of prostate cancer (PCa) is a crucial factor in the management and treatment planning for PCa patients. An accurate and non-invasive assessment of the ISUP grading group could significantly improve biopsy decisions and treatment planning. The use of PSMA-PET/CT radiomics for predicting ISUP has not been widely studied. The aim of this study is to investigate the role of 68Ga-PSMA PET/CT radiomics in predicting the ISUP grading of primary PCa. This study included 415 PCa patients who underwent 68Ga-PSMA PET/CT scans before prostate biopsy or radical prostatectomy. Patients were from three centers: Xiangya Hospital, Central South University (252 cases), Qilu Hospital of Shandong University (External Validation 1, 108 cases), and Qingdao University Medical College (External Validation 2, 55 cases). Xiangya Hospital cases were split into training and testing groups (1:1 ratio), with the other centers serving as external validation groups. Feature selection was performed using Minimum Redundancy Maximum Relevance (mRMR) and Least Absolute Shrinkage and Selection Operator (LASSO) algorithms. Eight machine learning classifiers were trained and tested with ten-fold cross-validation. Sensitivity, specificity, and AUC were calculated for each model. Additionally, we combined the radiomic features with maximum Standardized Uptake Value (SUVmax) and prostate-specific antigen (PSA) to create prediction models and tested the corresponding performances. The best-performing model in the Xiangya Hospital training cohort achieved an AUC of 0.868 (sensitivity 72.7%, specificity 96.0%). Similar trends were seen in the testing cohort and external validation centers (AUCs: 0.860, 0.827, and 0.812). After incorporating PSA and SUVmax, a more robust model was developed, achieving an AUC of 0.892 (sensitivity 77.9%, specificity 96.0%) in the training group. This study established and validated a radiomics model based on 68Ga-PSMA PET/CT, offering an accurate, non-invasive method for predicting ISUP grades in prostate cancer. A multicenter design with external validation ensured the model's robustness and broad applicability. This is the largest study to date on PSMA radiomics for predicting ISUP grades. Notably, integrating SUVmax and PSA metrics with radiomic features significantly improved prediction accuracy, providing new insights and tools for personalized diagnosis and treatment.

PET Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging

Filippo Ruffini, Elena Mulero Ayllon, Linlin Shen, Paolo Soda, Valerio Guarrasi

•preprint•Jun 23 2025

Artificial Intelligence (AI) holds significant promise for improving prognosis prediction in medical imaging, yet its effective application remains challenging. In this work, we introduce a structured benchmark explicitly designed to evaluate and compare the transferability of Convolutional Neural Networks and Foundation Models in predicting clinical outcomes in COVID-19 patients, leveraging diverse publicly available Chest X-ray datasets. Our experimental methodology extensively explores a wide set of fine-tuning strategies, encompassing traditional approaches such as Full Fine-Tuning and Linear Probing, as well as advanced Parameter-Efficient Fine-Tuning methods including Low-Rank Adaptation, BitFit, VeRA, and IA3. The evaluations were conducted across multiple learning paradigms, including both extensive full-data scenarios and more clinically realistic Few-Shot Learning settings, which are critical for modeling rare disease outcomes and rapidly emerging health threats. By implementing a large-scale comparative analysis involving a diverse selection of pretrained models, including general-purpose architectures pretrained on large-scale datasets such as CLIP and DINOv2, to biomedical-specific models like MedCLIP, BioMedCLIP, and PubMedCLIP, we rigorously assess each model's capacity to effectively adapt and generalize to prognosis tasks, particularly under conditions of severe data scarcity and pronounced class imbalance. The benchmark was designed to capture critical conditions common in prognosis tasks, including variations in dataset size and class distribution, providing detailed insights into the strengths and limitations of each fine-tuning strategy. This extensive and structured evaluation aims to inform the practical deployment and adoption of robust, efficient, and generalizable AI-driven solutions in real-world clinical prognosis prediction workflows.

X-Ray Classification Chest Methodology In Silico Academic Lab Benchmark SOTA

MedSeg-R: Medical Image Segmentation with Clinical Reasoning

Hao Shao, Qibin Hou

•preprint•Jun 23 2025

Medical image segmentation is challenging due to overlapping anatomies with ambiguous boundaries and a severe imbalance between the foreground and background classes, which particularly affects the delineation of small lesions. Existing methods, including encoder-decoder networks and prompt-driven variants of the Segment Anything Model (SAM), rely heavily on local cues or user prompts and lack integrated semantic priors, thus failing to generalize well to low-contrast or overlapping targets. To address these issues, we propose MedSeg-R, a lightweight, dual-stage framework inspired by inspired by clinical reasoning. Its cognitive stage interprets medical report into structured semantic priors (location, texture, shape), which are fused via transformer block. In the perceptual stage, these priors modulate the SAM backbone: spatial attention highlights likely lesion regions, dynamic convolution adapts feature filters to expected textures, and deformable sampling refines spatial support. By embedding this fine-grained guidance early, MedSeg-R disentangles inter-class confusion and amplifies minority-class cues, greatly improving sensitivity to small lesions. In challenging benchmarks, MedSeg-R produces large Dice improvements in overlapping and ambiguous structures, demonstrating plug-and-play compatibility with SAM-based systems.

Segmentation Methodology In Silico Academic Lab Benchmark SOTA

Development and validation of a SOTA-based system for biliopancreatic segmentation and station recognition system in EUS.

Zhang J, Zhang J, Chen H, Tian F, Zhang Y, Zhou Y, Jiang Z

•papers•Jun 23 2025

Endoscopic ultrasound (EUS) is a vital tool for diagnosing biliopancreatic disease, offering detailed imaging to identify key abnormalities. Its interpretation demands expertise, which limits its accessibility for less trained practitioners. Thus, the creation of tools or systems to assist in interpreting EUS images is crucial for improving diagnostic accuracy and efficiency. To develop an AI-assisted EUS system for accurate pancreatic and biliopancreatic duct segmentation, and evaluate its impact on endoscopists' ability to identify biliary-pancreatic diseases during segmentation and anatomical localization. The EUS-AI system was designed to perform station positioning and anatomical structure segmentation. A total of 45,737 EUS images from 1852 patients were used for model training. Among them, 2881 images were for internal testing, and 2747 images from 208 patients were for external validation. Additionally, 340 images formed a man-machine competition test set. During the research process, various newer state-of-the-art (SOTA) deep learning algorithms were also compared. In classification, in the station recognition task, compared to the ResNet-50 and YOLOv8-CLS algorithms, the Mean Teacher algorithm achieved the highest accuracy, with an average of 95.60% (92.07%-99.12%) in the internal test set and 92.72% (88.30%-97.15%) in the external test set. For segmentation, compared to the UNet ++ and YOLOv8 algorithms, the U-Net v2 algorithm was optimal. Ultimately, the EUS-AI system was constructed using the optimal models from two tasks, and a man-machine competition experiment was conducted. The results demonstrated that the performance of the EUS-AI system significantly outperformed that of mid-level endoscopists, both in terms of position recognition (p < 0.001) and pancreas and biliopancreatic duct segmentation tasks (p < 0.001, p = 0.004). The EUS-AI system is expected to significantly shorten the learning curve for the pancreatic EUS examination and enhance procedural standardization.

Ultrasound Segmentation Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Chest X-ray Foundation Model with Global and Local Representations Integration.

Yang Z, Xu X, Zhang J, Wang G, Kalra MK, Yan P

•papers•Jun 23 2025

Chest X-ray (CXR) is the most frequently ordered imaging test, supporting diverse clinical tasks from thoracic disease detection to postoperative monitoring. However, task-specific classification models are limited in scope, require costly labeled data, and lack generalizability to out-of-distribution datasets. To address these challenges, we introduce CheXFound, a self-supervised vision foundation model that learns robust CXR representations and generalizes effectively across a wide range of downstream tasks. We pretrained CheXFound on a curated CXR-987K dataset, comprising over approximately 987K unique CXRs from 12 publicly available sources. We propose a Global and Local Representations Integration (GLoRI) head for downstream adaptations, by incorporating fine- and coarse-grained disease-specific local features with global image features for enhanced performance in multilabel classification. Our experimental results showed that CheXFound outperformed state-of-the-art models in classifying 40 disease findings across different prevalence levels on the CXR-LT 24 dataset and exhibited superior label efficiency on downstream tasks with limited training data. Additionally, CheXFound achieved significant improvements on downstream tasks with out-of-distribution datasets, including opportunistic cardiovascular disease risk estimation, mortality prediction, malpositioned tube detection, and anatomical structure segmentation. The above results demonstrate CheXFound's strong generalization capabilities, which will enable diverse downstream adaptations with improved label efficiency in future applications. The project source code is publicly available at https://github.com/RPIDIAL/CheXFound.

X-Ray Classification Chest Methodology In Silico Academic Lab Benchmark SOTA Open Code

MOSCARD -- Causal Reasoning and De-confounding for Multimodal Opportunistic Screening of Cardiovascular Adverse Events

Jialu Pi, Juan Maria Farina, Rimita Lahiri, Jiwoong Jeong, Archana Gurudu, Hyung-Bok Park, Chieh-Ju Chao, Chadi Ayoub, Reza Arsanjani, Imon Banerjee

•preprint•Jun 23 2025

Major Adverse Cardiovascular Events (MACE) remain the leading cause of mortality globally, as reported in the Global Disease Burden Study 2021. Opportunistic screening leverages data collected from routine health check-ups and multimodal data can play a key role to identify at-risk individuals. Chest X-rays (CXR) provide insights into chronic conditions contributing to major adverse cardiovascular events (MACE), while 12-lead electrocardiogram (ECG) directly assesses cardiac electrical activity and structural abnormalities. Integrating CXR and ECG could offer a more comprehensive risk assessment than conventional models, which rely on clinical scores, computed tomography (CT) measurements, or biomarkers, which may be limited by sampling bias and single modality constraints. We propose a novel predictive modeling framework - MOSCARD, multimodal causal reasoning with co-attention to align two distinct modalities and simultaneously mitigate bias and confounders in opportunistic risk estimation. Primary technical contributions are - (i) multimodal alignment of CXR with ECG guidance; (ii) integration of causal reasoning; (iii) dual back-propagation graph for de-confounding. Evaluated on internal, shift data from emergency department (ED) and external MIMIC datasets, our model outperformed single modality and state-of-the-art foundational models - AUC: 0.75, 0.83, 0.71 respectively. Proposed cost-effective opportunistic screening enables early intervention, improving patient outcomes and reducing disparities.

Mixed Modality Classification Cardiac Methodology In Silico Academic Lab Benchmark SOTA

GPT-4o and Specialized AI in Breast Ultrasound Imaging: A comparative Study on Accuracy, Agreement, Limitations, and Diagnostic Potential.

Sanli DET, Sanli AN, Buyukdereli Atadag Y, Kurt A, Esmerer E

•papers•Jun 23 2025

This study aimed to evaluate the ability of ChatGPT and Breast Ultrasound Helper, a special ChatGPT-based subprogram trained on ultrasound image analysis, to analyze and differentiate benign and malignant breast lesions on ultrasound images. Ultrasound images of histopathologically confirmed breast cancer and fibroadenoma patients were read GPT-4o (the latest ChatGPT version) and Breast Ultrasound Helper (BUH), a tool from the "Explore" section of ChatGPT. Both were prompted in English using ACR BI-RADS Breast Ultrasound Lexicon criteria: lesion shape, orientation, margin, internal echo pattern, echogenicity, posterior acoustic features, microcalcifications or hyperechoic foci, perilesional hyperechoic rim, edema or architectural distortion, lesion size, and BI-RADS category. Two experienced radiologists evaluated the images and the responses of the programs in consensus. The outputs, BI-RADS category agreement, and benign/malignant discrimination were statistically compared. A total of 232 ultrasound images were analyzed, of which 133 (57.3%) were malignant and 99 (42.7%) benign. In comparative analysis, BUH showed superior performance overall, with higher kappa values and statistically significant results across multiple features (P .001). However, the overall level of agreement with the radiologists' consensus for all features was similar for BUH (κ: 0.387-0.755) and GPT-4o (κ: 0.317-0.803). On the other hand, BI-RADS category agreement was slightly higher in GPT-4o than in BUH (69.4% versus 65.9%), but BUH was slightly more successful in distinguishing benign lesions from malignant lesions (65.9% versus 67.7%). Although both AI tools show moderate-good performance in ultrasound image analysis, their limited compatibility with radiologists' evaluations and BI-RADS categorization suggests that their clinical application in breast ultrasound interpretation is still early and unreliable.

Ultrasound Classification Breast Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Filter Papers

Tags

Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation

DeepSeek-assisted LI-RADS classification: AI-driven precision in hepatocellular carcinoma diagnosis.

Non-invasive prediction of NSCLC immunotherapy efficacy and tumor microenvironment through unsupervised machine learning-driven CT Radiomic subtypes: a multi-cohort study.

Machine learning-based construction and validation of an radiomics model for predicting ISUP grading in prostate cancer: a multicenter radiomics study based on [68Ga]Ga-PSMA PET/CT.

Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging

MedSeg-R: Medical Image Segmentation with Clinical Reasoning

Development and validation of a SOTA-based system for biliopancreatic segmentation and station recognition system in EUS.

Chest X-ray Foundation Model with Global and Local Representations Integration.

MOSCARD -- Causal Reasoning and De-confounding for Multimodal Opportunistic Screening of Cardiovascular Adverse Events

GPT-4o and Specialized AI in Breast Ultrasound Imaging: A comparative Study on Accuracy, Agreement, Limitations, and Diagnostic Potential.

Ready to Sharpen Your Edge?