Latest Papers on Radiology AI. Tags: Benchmark SOTA

HiPerformer: A High-Performance Global-Local Segmentation Model with Modular Hierarchical Fusion Strategy

Dayu Tan, Zhenpeng Xu, Yansen Su, Xin Peng, Chunhou Zheng, Weimin Zhong

•preprint•Sep 24 2025

Both local details and global context are crucial in medical image segmentation, and effectively integrating them is essential for achieving high accuracy. However, existing mainstream methods based on CNN-Transformer hybrid architectures typically employ simple feature fusion techniques such as serial stacking, endpoint concatenation, or pointwise addition, which struggle to address the inconsistencies between features and are prone to information conflict and loss. To address the aforementioned challenges, we innovatively propose HiPerformer. The encoder of HiPerformer employs a novel modular hierarchical architecture that dynamically fuses multi-source features in parallel, enabling layer-wise deep integration of heterogeneous information. The modular hierarchical design not only retains the independent modeling capability of each branch in the encoder, but also ensures sufficient information transfer between layers, effectively avoiding the degradation of features and information loss that come with traditional stacking methods. Furthermore, we design a Local-Global Feature Fusion (LGFF) module to achieve precise and efficient integration of local details and global semantic information, effectively alleviating the feature inconsistency problem and resulting in a more comprehensive feature representation. To further enhance multi-scale feature representation capabilities and suppress noise interference, we also propose a Progressive Pyramid Aggregation (PPA) module to replace traditional skip connections. Experiments on eleven public datasets demonstrate that the proposed method outperforms existing segmentation techniques, demonstrating higher segmentation accuracy and robustness. The code is available at https://github.com/xzphappy/HiPerformer.

Segmentation Methodology In Silico Open Code Benchmark SOTA

Photon-counting detector computed tomography in thoracic oncology: revolutionizing tumor imaging through precision and detail.

Yanagawa M, Ueno M, Ito R, Ueda D, Saida T, Kurokawa R, Takumi K, Nishioka K, Sugawara S, Ide S, Honda M, Iima M, Kawamura M, Sakata A, Sofue K, Oda S, Watabe T, Hirata K, Naganawa S

•papers•Sep 24 2025

Photon-counting detector computed tomography (PCD-CT) is an emerging imaging technology that promises to overcome the limitations of conventional energy-integrating detector (EID)-CT, particularly in thoracic oncology. This narrative review summarizes technical advances and clinical applications of PCD-CT in the thorax with emphasis on spatial resolution, dose-image-quality balance, and intrinsic spectral imaging, and it outlines practical implications relevant to thoracic oncology. A literature review of PubMed through May 31, 2025, was conducted using combinations of "photon counting," "computed tomography," "thoracic oncology," and "artificial intelligence." We screened the retrieved records and included studies with direct relevance to lung and mediastinal tumors, image quality, radiation dose, spectral/iodine imaging, or artificial intelligence-based reconstruction; case reports, editorials, and animal-only or purely methodological reports were excluded. PCD-CT demonstrated superior spatial resolution compared with EID-CT, enabling clearer visualization of fine pulmonary structures, such as bronchioles and subsolid nodules; slice thicknesses of approximately 0.4 mm and <i>ex vivo</i> resolvable structures approaching 0.11 mm have been reported. Across intraindividual clinical comparisons, radiation-dose reductions of 16%-43% have been achieved while maintaining or improving diagnostic image quality. Intrinsic spectral imaging enables accurate iodine mapping and low-keV virtual monoenergetic images and has shown quantitative advantages versus dual-energy CT in phantoms and early clinical work. Artificial intelligence-based deep-learning reconstruction and super-resolution can complement detector capabilities to reduce noise and stabilize fine-structure depiction without increasing dose. Potential reductions in contrast volume are biologically plausible given improved low-keV contrast-to-noise ratio, although clinical dose-finding data remain limited, and routine K-edge imaging has not yet translated to clinical thoracic practice. In conclusion, PCD-CT provides higher spatial and spectral fidelity at lower or comparable doses, supporting earlier and more precise tumor detection and characterization; future work should prioritize outcome-oriented trials, protocol harmonization, and implementation studies aligned with "Green Radiology".

CT Reconstruction Chest Review In Silico Academic Lab Benchmark SOTA

A Versatile Foundation Model for AI-enabled Mammogram Interpretation

Fuxiang Huang, Jiayi Zhu, Yunfang Yu, Yu Xie, Yuan Guo, Qingcong Kong, Mingxiang Wu, Xinrui Jiang, Shu Yang, Jiabo Ma, Ziyi Liu, Zhe Xu, Zhixuan Chen, Yujie Tan, Zifan He, Luhui Mao, Xi Wang, Junlin Hou, Lei Zhang, Qiong Luo, Zhenhui Li, Herui Yao, Hao Chen

•preprint•Sep 24 2025

Breast cancer is the most commonly diagnosed cancer and the leading cause of cancer-related mortality in women globally. Mammography is essential for the early detection and diagnosis of breast lesions. Despite recent progress in foundation models (FMs) for mammogram analysis, their clinical translation remains constrained by several fundamental limitations, including insufficient diversity in training data, limited model generalizability, and a lack of comprehensive evaluation across clinically relevant tasks. Here, we introduce VersaMammo, a versatile foundation model for mammograms, designed to overcome these limitations. We curated the largest multi-institutional mammogram dataset to date, comprising 706,239 images from 21 sources. To improve generalization, we propose a two-stage pre-training strategy to develop VersaMammo, a mammogram foundation model. First, a teacher model is trained via self-supervised learning to extract transferable features from unlabeled mammograms. Then, supervised learning combined with knowledge distillation transfers both features and clinical knowledge into VersaMammo. To ensure a comprehensive evaluation, we established a benchmark comprising 92 specific tasks, including 68 internal tasks and 24 external validation tasks, spanning 5 major clinical task categories: lesion detection, segmentation, classification, image retrieval, and visual question answering. VersaMammo achieves state-of-the-art performance, ranking first in 50 out of 68 specific internal tasks and 20 out of 24 external validation tasks, with average ranks of 1.5 and 1.2, respectively. These results demonstrate its superior generalization and clinical utility, offering a substantial advancement toward reliable and scalable breast cancer screening and diagnosis.

Mammography Classification Breast Methodology In Silico Academic Lab Benchmark SOTA Open Dataset

SHMoAReg: Spark Deformable Image Registration via Spatial Heterogeneous Mixture of Experts and Attention Heads

Yuxi Zheng, Jianhui Feng, Tianran Li, Marius Staring, Yuchuan Qiao

•preprint•Sep 24 2025

Encoder-Decoder architectures are widely used in deep learning-based Deformable Image Registration (DIR), where the encoder extracts multi-scale features and the decoder predicts deformation fields by recovering spatial locations. However, current methods lack specialized extraction of features (that are useful for registration) and predict deformation jointly and homogeneously in all three directions. In this paper, we propose a novel expert-guided DIR network with Mixture of Experts (MoE) mechanism applied in both encoder and decoder, named SHMoAReg. Specifically, we incorporate Mixture of Attention heads (MoA) into encoder layers, while Spatial Heterogeneous Mixture of Experts (SHMoE) into the decoder layers. The MoA enhances the specialization of feature extraction by dynamically selecting the optimal combination of attention heads for each image token. Meanwhile, the SHMoE predicts deformation fields heterogeneously in three directions for each voxel using experts with varying kernel sizes. Extensive experiments conducted on two publicly available datasets show consistent improvements over various methods, with a notable increase from 60.58% to 65.58% in Dice score for the abdominal CT dataset. Furthermore, SHMoAReg enhances model interpretability by differentiating experts' utilities across/within different resolution layers. To the best of our knowledge, we are the first to introduce MoE mechanism into DIR tasks. The code will be released soon.

CT Registration Abdominal Methodology In Silico Open Code Benchmark SOTA

Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context

Andrew Wang, Jiashuo Zhang, Michael Oberst

•preprint•Sep 24 2025

Public healthcare datasets of Chest X-Rays (CXRs) have long been a popular benchmark for developing computer vision models in healthcare. However, strong average-case performance of machine learning (ML) models on these datasets is insufficient to certify their clinical utility. In this paper, we use clinical context, as captured by prior discharge summaries, to provide a more holistic evaluation of current ``state-of-the-art'' models for the task of CXR diagnosis. Using discharge summaries recorded prior to each CXR, we derive a ``prior'' or ``pre-test'' probability of each CXR label, as a proxy for existing contextual knowledge available to clinicians when interpreting CXRs. Using this measure, we demonstrate two key findings: First, for several diagnostic labels, CXR models tend to perform best on cases where the pre-test probability is very low, and substantially worse on cases where the pre-test probability is higher. Second, we use pre-test probability to assess whether strong average-case performance reflects true diagnostic signal, rather than an ability to infer the pre-test probability as a shortcut. We find that performance drops sharply on a balanced test set where this shortcut does not exist, which may indicate that much of the apparent diagnostic power derives from inferring this clinical context. We argue that this style of analysis, using context derived from clinical notes, is a promising direction for more rigorous and fine-grained evaluation of clinical vision models.

X-Ray Classification Chest Retrospective Clinical In Silico Benchmark SOTA Ethics

Region-of-Interest Augmentation for Mammography Classification under Patient-Level Cross-Validation

Farbod Bigdeli, Mohsen Mohammadagha, Ali Bigdeli

•preprint•Sep 24 2025

Breast cancer screening with mammography remains central to early detection and mortality reduction. Deep learning has shown strong potential for automating mammogram interpretation, yet limited-resolution datasets and small sample sizes continue to restrict performance. We revisit the Mini-DDSM dataset (9,684 images; 2,414 patients) and introduce a lightweight region-of-interest (ROI) augmentation strategy. During training, full images are probabilistically replaced with random ROI crops sampled from a precomputed, label-free bounding-box bank, with optional jitter to increase variability. We evaluate under strict patient-level cross-validation and report ROC-AUC, PR-AUC, and training-time efficiency metrics (throughput and GPU memory). Because ROI augmentation is training-only, inference-time cost remains unchanged. On Mini-DDSM, ROI augmentation (best: p_roi = 0.10, alpha = 0.10) yields modest average ROC-AUC gains, with performance varying across folds; PR-AUC is flat to slightly lower. These results demonstrate that simple, data-centric ROI strategies can enhance mammography classification in constrained settings without requiring additional labels or architectural modifications.

Mammography Classification Breast Methodology In Silico Benchmark SOTA Open Dataset

A Contrastive Learning Framework for Breast Cancer Detection

Samia Saeed, Khuram Naveed

•preprint•Sep 24 2025

Breast cancer, the second leading cause of cancer-related deaths globally, accounts for a quarter of all cancer cases [1]. To lower this death rate, it is crucial to detect tumors early, as early-stage detection significantly improves treatment outcomes. Advances in non-invasive imaging techniques have made early detection possible through computer-aided detection (CAD) systems which rely on traditional image analysis to identify malignancies. However, there is a growing shift towards deep learning methods due to their superior effectiveness. Despite their potential, deep learning methods often struggle with accuracy due to the limited availability of large-labeled datasets for training. To address this issue, our study introduces a Contrastive Learning (CL) framework, which excels with smaller labeled datasets. In this regard, we train Resnet-50 in semi supervised CL approach using similarity index on a large amount of unlabeled mammogram data. In this regard, we use various augmentation and transformations which help improve the performance of our approach. Finally, we tune our model on a small set of labelled data that outperforms the existing state of the art. Specifically, we observed a 96.7% accuracy in detecting breast cancer on benchmark datasets INbreast and MIAS.

Mammography Classification Breast Methodology In Silico Benchmark SOTA

Deep Learning-based Automated Detection of Pulmonary Embolism: Is It Reliable?

Babacan Ö, Karkaş AY, Durak G, Uysal E, Durak Ü, Shrestha R, Bingöl Z, Okumuş G, Medetalibeyoğlu A, Ertürk ŞM

•papers•Sep 24 2025

To assess the diagnostic accuracy and clinical applicability of the artificial intelligence (AI) program "Canon Automation Platform" for the automated detection and localization of pulmonary embolisms (PEs) in chest computed tomography pulmonary angiograms (CTPAs). A total of 1474 CTPAs suspected of PEs were retrospectively evaluated by 2 senior radiology residents with 5 years of experience. The final diagnosis was verified through radiology reports by 2 thoracic radiologists with 20 and 25 years of experience, along with the patients' clinical records and histories. The images were transferred to the Canon Automation Platform, which integrates with the picture archiving and communication system (PACS), and the diagnostic success of the platform was evaluated. This study examined all anatomic levels of the pulmonary arteries, including the left pulmonary artery, right pulmonary artery, and interlobar, segmental, and subsegmental branches. The confusion matrix data obtained at all anatomic levels considered in our study were as follows: AUC-ROC score of 0.945 to 0.996, accuracy of 95.4% to 99.7%, sensitivity of 81.4% to 99.1%, specificity of 98.7% to 100%, PPV of 89.1% to 100%, NPV of 95.6% to 99.9%, F1 score of 0.868 to 0.987, and Cohen Kappa of 0.842 to 0.986. Notably, sensitivity in the subsegmental branches was lower (81.4% to 84.7%) compared with more central locations, whereas specificity remained consistent (98.7% to 98.9%). The results showed that the chest pain package of the Canon Automation Platform accurately provides rapid automatic PE detection in chest CTPAs by leveraging deep learning algorithms to facilitate the clinical workflow. This study demonstrates that AI can provide physicians with robust diagnostic support for acute PE, particularly in hospitals without 24/7 access to radiology specialists.

CT Detection Chest Retrospective Clinical In Silico Big Tech Benchmark SOTA

Performance Comparison of Cutting-Edge Large Language Models on the ACR In-Training Examination: An Update for 2025.

Young A, Paloka R, Islam A, Prasanna P, Hill V, Payne D

•papers•Sep 24 2025

This study represents a continuation of prior work by Payne et al. evaluating large language model (LLM) performance on radiology board-style assessments, specifically the ACR diagnostic radiology in-training examination (DXIT). Building upon earlier findings with GPT-4, we assess the performance of newer, cutting-edge models, such as GPT-4o, GPT-o1, GPT-o3, Claude, Gemini, and Grok on standardized DXIT questions. In addition to overall performance, we compare model accuracy on text-based versus image-based questions to assess multi-modal reasoning capabilities. As a secondary aim, we investigate the potential impact of data contamination by comparing model performance on original versus revised image-based questions. Seven LLMs - GPT-4, GPT-4o, GPT-o1, GPT-o3, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Grok 2.0-were evaluated using 106 publicly available DXIT questions. Each model was prompted using a standardized instruction set to simulate a radiology resident answering board-style questions. For each question, the model's selected answer, rationale, and confidence score were recorded. Unadjusted accuracy (based on correct answer selection) and logic-adjusted accuracy (based on clinical reasoning pathways) were calculated. Subgroup analysis compared model performance on text-based versus image-based questions. Additionally, 63 image-based questions were revised to test novel reasoning while preserving the original diagnostic image to assess the impact of potential training data contamination. Across 106 DXIT questions, GPT-o1 demonstrated the highest unadjusted accuracy (71.7%), followed closely by GPT-4o (69.8%) and GPT-o3 (68.9%). GPT-4 (59.4%) and Grok 2.0 exhibited similar scores (59.4% and 52.8%). Claude 3.5 Sonnet had the lowest unadjusted accuracies (34.9%). Similar trends were observed for logic-adjusted accuracy, with GPT-o1 (60.4%), GPT-4o (59.4%), and GPT-o3 (59.4%) again outperforming other models, while Grok 2.0 and Claude 3.5 Sonnet lagged behind (34.0% and 30.2%, respectively). GPT-4o's performance was significantly higher on text-based questions compared to image-based ones. Unadjusted accuracy for the revised DXIT questions was 49.2%, compared to 56.1% on matched original DXIT questions. Logic-adjusted accuracy for the revised DXIT questions was 40.0% compared to 44.4% on matched original DXIT questions. No significant difference in performance was observed between original and revised questions. Modern LLMs, especially those from OpenAI, demonstrate strong and improved performance on board-style radiology assessments. Comparable performance on revised prompts suggests that data contamination may have played a limited role. As LLMs improve, they hold strong potential to support radiology resident learning through personalized feedback and board-style question review.

Mixed Modality LLM Radiology Report Whole Body Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Automated Coronary Artery Calcium Scoring Using Deep Learning: Validation Across Diverse Chest CT Protocols.

Mineo E, Assuncao-Jr AN, Grego da Silva CF, Liberato G, Dantas-Jr RN, Graves CV, Gutierrez MA, Nomura CH

•papers•Sep 24 2025

Coronary artery calcium (CAC) scoring refines atherosclerotic cardiovascular disease (ASCVD) risk but is not frequently reported on routine non‑gated chest CT (NCCT), whose use expanded in the COVID‑19 era. We sought to develop and validate a workflow-ready deep learning model for fully automated, protocol-agnostic CAC quantification. In this retrospective study, a deep learning (DL) model was trained and validated using 2132 chest CT scans (routine, CT-CAC, and CT-COVID) from patients without established atherosclerotic cardiovascular disease (ASCVD) collected (2013-2023) at a single university hospital. The index test was a DL-based CAC segmentation model; the reference standard was manual annotation by experienced observers. Agreement was evaluated using intra-class correlation coefficients (ICC) for Agatston scores and Cohen's kappa for CAC risk categories. Sensitivity, specificity, positive, and negative predictive values, and F1 scores were calculated to measure diagnostic performance. The DL model demonstrated high reliability for Agatston scores (ICC=0.987) and strong agreement in CAC categories (Cohen's κ=0.86-0.95). Diagnostic performance for CAC >100 (F1=0.956) and CAC >300 (F1=0.967) was very high. External validation in the Mashhad COVID Study showed good agreement (κ=0.8). In the SBU COVID study, the F1 score for detecting moderate-to-severe CAC was 0.928. The proposed DL model delivers accurate, workflow‑ready CAC quantification across routine, dedicated, and pandemic‑era chest CT scans, supporting opportunistic, cost‑effective cardiovascular risk stratification in contemporary clinical practice.

CT Segmentation Cardiac Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Filter Papers

Tags

HiPerformer: A High-Performance Global-Local Segmentation Model with Modular Hierarchical Fusion Strategy

Photon-counting detector computed tomography in thoracic oncology: revolutionizing tumor imaging through precision and detail.

A Versatile Foundation Model for AI-enabled Mammogram Interpretation

SHMoAReg: Spark Deformable Image Registration via Spatial Heterogeneous Mixture of Experts and Attention Heads

Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context

Region-of-Interest Augmentation for Mammography Classification under Patient-Level Cross-Validation

A Contrastive Learning Framework for Breast Cancer Detection

Deep Learning-based Automated Detection of Pulmonary Embolism: Is It Reliable?

Performance Comparison of Cutting-Edge Large Language Models on the ACR In-Training Examination: An Update for 2025.

Automated Coronary Artery Calcium Scoring Using Deep Learning: Validation Across Diverse Chest CT Protocols.

Ready to Sharpen Your Edge?