Latest Papers on Radiology AI. Tags: Open Dataset, Order: Best Match, Limit: 10.

A self-supervised multimodal deep learning approach to differentiate post-radiotherapy progression from pseudoprogression in glioblastoma.

Gomaa A, Huang Y, Stephan P, Breininger K, Frey B, Dörfler A, Schnell O, Delev D, Coras R, Donaubauer AJ, Schmitter C, Stritzelberger J, Semrau S, Maier A, Bayer S, Schönecker S, Heiland DH, Hau P, Gaipl US, Bert C, Fietkau R, Schmidt MA, Putz F

•papers•May 17 2025

Accurate differentiation of pseudoprogression (PsP) from True Progression (TP) following radiotherapy (RT) in glioblastoma patients is crucial for optimal treatment planning. However, this task remains challenging due to the overlapping imaging characteristics of PsP and TP. This study therefore proposes a multimodal deep-learning approach utilizing complementary information from routine anatomical MR images, clinical parameters, and RT treatment planning information for improved predictive accuracy. The approach utilizes a self-supervised Vision Transformer (ViT) to encode multi-sequence MR brain volumes to effectively capture both global and local context from the high dimensional input. The encoder is trained in a self-supervised upstream task on unlabeled glioma MRI datasets from the open BraTS2021, UPenn-GBM, and UCSF-PDGM datasets (n = 2317 MRI studies) to generate compact, clinically relevant representations from FLAIR and T1 post-contrast sequences. These encoded MR inputs are then integrated with clinical data and RT treatment planning information through guided cross-modal attention, improving progression classification accuracy. This work was developed using two datasets from different centers: the Burdenko Glioblastoma Progression Dataset (n = 59) for training and validation, and the GlioCMV progression dataset from the University Hospital Erlangen (UKER) (n = 20) for testing. The proposed method achieved competitive performance, with an AUC of 75.3%, outperforming the current state-of-the-art data-driven approaches. Importantly, the proposed approach relies solely on readily available anatomical MRI sequences, clinical data, and RT treatment planning information, enhancing its clinical feasibility. The proposed approach addresses the challenge of limited data availability for PsP and TP differentiation and could allow for improved clinical decision-making and optimized treatment plans for glioblastoma patients.

MRI Classification Neurological Retrospective Clinical In Silico Academic Lab Open Dataset

CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction

Jing Zou, Qingqiu Li, Chenyu Lian, Lihao Liu, Xiaohan Yan, Shujun Wang, Jing Qin

•preprint•May 17 2025

AI-driven models have shown great promise in detecting errors in radiology reports, yet the field lacks a unified benchmark for rigorous evaluation of error detection and further correction. To address this gap, we introduce CorBenchX, a comprehensive suite for automated error detection and correction in chest X-ray reports, designed to advance AI-assisted quality control in clinical practice. We first synthesize a large-scale dataset of 26,326 chest X-ray error reports by injecting clinically common errors via prompting DeepSeek-R1, with each corrupted report paired with its original text, error type, and human-readable description. Leveraging this dataset, we benchmark both open- and closed-source vision-language models,(e.g., InternVL, Qwen-VL, GPT-4o, o4-mini, and Claude-3.7) for error detection and correction under zero-shot prompting. Among these models, o4-mini achieves the best performance, with 50.6 % detection accuracy and correction scores of BLEU 0.853, ROUGE 0.924, BERTScore 0.981, SembScore 0.865, and CheXbertF1 0.954, remaining below clinical-level accuracy, highlighting the challenge of precise report correction. To advance the state of the art, we propose a multi-step reinforcement learning (MSRL) framework that optimizes a multi-objective reward combining format compliance, error-type accuracy, and BLEU similarity. We apply MSRL to QwenVL2.5-7B, the top open-source model in our benchmark, achieving an improvement of 38.3% in single-error detection precision and 5.2% in single-error correction over the zero-shot baseline.

X-Ray LLM Radiology Report Chest Dataset Release In Silico Open Dataset Benchmark SOTA

MedSG-Bench: A Benchmark for Medical Image Sequences Grounding

Jingkun Yue, Siqi Zhang, Zinan Jia, Huihuan Xu, Zongbo Han, Xiaohong Liu, Guangyu Wang

•preprint•May 17 2025

Visual grounding is essential for precise perception and reasoning in multimodal large language models (MLLMs), especially in medical imaging domains. While existing medical visual grounding benchmarks primarily focus on single-image scenarios, real-world clinical applications often involve sequential images, where accurate lesion localization across different modalities and temporal tracking of disease progression (e.g., pre- vs. post-treatment comparison) require fine-grained cross-image semantic alignment and context-aware reasoning. To remedy the underrepresentation of image sequences in existing medical visual grounding benchmarks, we propose MedSG-Bench, the first benchmark tailored for Medical Image Sequences Grounding. It comprises eight VQA-style tasks, formulated into two paradigms of the grounding tasks, including 1) Image Difference Grounding, which focuses on detecting change regions across images, and 2) Image Consistency Grounding, which emphasizes detection of consistent or shared semantics across sequential images. MedSG-Bench covers 76 public datasets, 10 medical imaging modalities, and a wide spectrum of anatomical structures and diseases, totaling 9,630 question-answer pairs. We benchmark both general-purpose MLLMs (e.g., Qwen2.5-VL) and medical-domain specialized MLLMs (e.g., HuatuoGPT-vision), observing that even the advanced models exhibit substantial limitations in medical sequential grounding tasks. To advance this field, we construct MedSG-188K, a large-scale instruction-tuning dataset tailored for sequential visual grounding, and further develop MedSeq-Grounder, an MLLM designed to facilitate future research on fine-grained understanding across medical sequential images. The benchmark, dataset, and model are available at https://huggingface.co/MedSG-Bench

Mixed Modality Detection Whole Body Dataset Release In Silico Academic Lab Open Dataset Open Code GenAI

Impact of test set composition on AI performance in pediatric wrist fracture detection in X-rays.

Till T, Scherkl M, Stranger N, Singer G, Hankel S, Flucher C, Hržić F, Štajduhar I, Tschauner S

•papers•May 16 2025

To evaluate how different test set sampling strategies-random selection and balanced sampling-affect the performance of artificial intelligence (AI) models in pediatric wrist fracture detection using radiographs, aiming to highlight the need for standardization in test set design. This retrospective study utilized the open-sourced GRAZPEDWRI-DX dataset of 6091 pediatric wrist radiographs. Two test sets, each containing 4588 images, were constructed: one using a balanced approach based on case difficulty, projection type, and fracture presence and the other a random selection. EfficientNet and YOLOv11 models were trained and validated on 18,762 radiographs and tested on both sets. Binary classification and object detection tasks were evaluated using metrics such as precision, recall, F1 score, AP50, and AP50-95. Statistical comparisons between test sets were performed using nonparametric tests. Performance metrics significantly decreased in the balanced test set with more challenging cases. For example, the precision for YOLOv11 models decreased from 0.95 in the random set to 0.83 in the balanced set. Similar trends were observed for recall, accuracy, and F1 score, indicating that models trained on easy-to-recognize cases performed poorly on more complex ones. These results were consistent across all model variants tested. AI models for pediatric wrist fracture detection exhibit reduced performance when tested on balanced datasets containing more difficult cases, compared to randomly selected cases. This highlights the importance of constructing representative and standardized test sets that account for clinical complexity to ensure robust AI performance in real-world settings. Question Do different sampling strategies based on samples' complexity have an influence in deep learning models' performance in fracture detection? Findings AI performance in pediatric wrist fracture detection significantly drops when tested on balanced datasets with more challenging cases, compared to randomly selected cases. Clinical relevance Without standardized and validated test datasets for AI that reflect clinical complexities, performance metrics may be overestimated, limiting the utility of AI in real-world settings.

X-Ray Detection Musculoskeletal Retrospective Clinical In Silico Academic Lab Open Dataset

Comparative analysis of deep learning methods for breast ultrasound lesion detection and classification.

Vallez N, Mateos-Aparicio-Ruiz I, Rienda MA, Deniz O, Bueno G

•papers•May 16 2025

Breast ultrasound (BUS) computer-aided diagnosis (CAD) systems aims to perform two major steps: detecting lesions and classifying them as benign or malignant. However, the impact of combining both steps has not been previously addressed. Moreover, the specific method employed can influence the final outcome of the system. In this work, a comparison of the effects of using object detection, semantic segmentation and instance segmentation to detect lesions in BUS images was conducted. To this end, four approaches were examined: a) multi-class object detection, b) one-class object detection followed by localized region classification, c) multi-class segmentation, and d) one-class segmentation followed by segmented region classification. Additionally, a novel dataset for BUS segmentation, called BUS-UCLM, has been gathered, annotated and shared publicly. The evaluation of the methods proposed was carried out with this new dataset and four publicly available datasets: BUSI, OASBUD, RODTOOK and UDIAT. Among the four approaches compared, multi-class detection and multi-class segmentation achieved the best results when instance segmentation CNNs are used. The best results in detection were obtained with a multi-class Mask R-CNN with a COCO AP50 metric of 72.9%. In the multi-class segmentation scenario, Poolformer achieved the best results with a Dice score of 77.7%. The analysis of detection and segmentation models in BUS highlights several key challenges, emphasizing the complexity of accurately identifying and segmenting lesions. Among the methods evaluated, instance segmentation has proven to be the most effective for BUS images, offering superior performance in delineating individual lesions.

Ultrasound Detection Breast Methodology In Silico Academic Lab Open Dataset

Residual self-attention vision transformer for detecting acquired vitelliform lesions and age-related macular drusen.

Powroznik P, Skublewska-Paszkowska M, Nowomiejska K, Gajda-Deryło B, Brinkmann M, Concilio M, Toro MD, Rejdak R

•papers•May 16 2025

Retinal diseases recognition is still a challenging task. Many deep learning classification methods and their modifications have been developed for medical imaging. Recently, Vision Transformers (ViT) have been applied for classification of retinal diseases with great success. Therefore, in this study a novel method was proposed, the Residual Self-Attention Vision Transformer (RS-A ViT), for automatic detection of acquired vitelliform lesions (AVL), macular drusen as well as distinguishing them from healthy cases. The Residual Self-Attention module instead of Self-Attention was applied in order to improve model's performance. The new tool outperforms the classical deep learning methods, like EfficientNet, InceptionV3, ResNet50 and VGG16. The RS-A ViT method also exceeds the ViT algorithm, reaching 96.62%. For the purpose of this research a new dataset was created that combines AVL data gathered from two research centers and drusen as well as normal cases from the OCT dataset. The augmentation methods were applied in order to enlarge the samples. The Grad-CAM interpretability method indicated that this model analyses the appropriate areas in optical coherence tomography images in order to detect retinal diseases. The results proved that the presented RS-A ViT model has a great potential in classification retinal disorders with high accuracy and thus may be applied as a supportive tool for ophthalmologists.

OCT Classification Methodology In Silico Academic Lab Open Dataset

Pancreas segmentation using AI developed on the largest CT dataset with multi-institutional validation and implications for early cancer detection.

Mukherjee S, Antony A, Patnam NG, Trivedi KH, Karbhari A, Nagaraj M, Murlidhar M, Goenka AH

•papers•May 16 2025

Accurate and fully automated pancreas segmentation is critical for advancing imaging biomarkers in early pancreatic cancer detection and for biomarker discovery in endocrine and exocrine pancreatic diseases. We developed and evaluated a deep learning (DL)-based convolutional neural network (CNN) for automated pancreas segmentation using the largest single-institution dataset to date (n = 3031 CTs). Ground truth segmentations were performed by radiologists, which were used to train a 3D nnU-Net model through five-fold cross-validation, generating an ensemble of top-performing models. To assess generalizability, the model was externally validated on the multi-institutional AbdomenCT-1K dataset (n = 585), for which volumetric segmentations were newly generated by expert radiologists and will be made publicly available. In the test subset (n = 452), the CNN achieved a mean Dice Similarity Coefficient (DSC) of 0.94 (SD 0.05), demonstrating high spatial concordance with radiologist-annotated volumes (Concordance Correlation Coefficient [CCC]: 0.95). On the AbdomenCT-1K dataset, the model achieved a DSC of 0.96 (SD 0.04) and a CCC of 0.98, confirming its robustness across diverse imaging conditions. The proposed DL model establishes new performance benchmarks for fully automated pancreas segmentation, offering a scalable and generalizable solution for large-scale imaging biomarker research and clinical translation.

CT Segmentation Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA Open Dataset

CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

Raman Dutt, Pedro Sanchez, Yongchen Yao, Steven McDonagh, Sotirios A. Tsaftaris, Timothy Hospedales

•preprint•May 15 2025

We introduce CheXGenBench, a rigorous and multifaceted evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and clinical utility across state-of-the-art text-to-image generative models. Despite rapid advancements in generative AI for real-world imagery, medical domain evaluations have been hindered by methodological inconsistencies, outdated architectural comparisons, and disconnected assessment criteria that rarely address the practical clinical value of synthetic samples. CheXGenBench overcomes these limitations through standardised data partitioning and a unified evaluation protocol comprising over 20 quantitative metrics that systematically analyse generation quality, potential privacy vulnerabilities, and downstream clinical applicability across 11 leading text-to-image architectures. Our results reveal critical inefficiencies in the existing evaluation protocols, particularly in assessing generative fidelity, leading to inconsistent and uninformative comparisons. Our framework establishes a standardised benchmark for the medical AI community, enabling objective and reproducible comparisons while facilitating seamless integration of both existing and future generative models. Additionally, we release a high-quality, synthetic dataset, SynthCheX-75K, comprising 75K radiographs generated by the top-performing model (Sana 0.6B) in our benchmark to support further research in this critical domain. Through CheXGenBench, we establish a new state-of-the-art and release our framework, models, and SynthCheX-75K dataset at https://raman1121.github.io/CheXGenBench/

X-Ray Image Synthesis Chest Dataset Release In Silico Academic Lab Open Dataset Open Code Benchmark SOTA

An Annotated Multi-Site and Multi-Contrast Magnetic Resonance Imaging Dataset for the study of the Human Tongue Musculature.

Ribeiro FL, Zhu X, Ye X, Tu S, Ngo ST, Henderson RD, Steyn FJ, Kiernan MC, Barth M, Bollmann S, Shaw TB

•papers•May 14 2025

This dataset provides the first annotated, openly available MRI-based imaging dataset for investigations of tongue musculature, including multi-contrast and multi-site MRI data from non-disease participants. The present dataset includes 47 participants collated from three studies: BeLong (four participants; T2-weighted images), EATT4MND (19 participants; T2-weighted images), and BMC (24 participants; T1-weighted images). We provide manually corrected segmentations of five key tongue muscles: the superior longitudinal, combined transverse/vertical, genioglossus, and inferior longitudinal muscles. Other phenotypic measures, including age, sex, weight, height, and tongue muscle volume, are also available for use. This dataset will benefit researchers across domains interested in the structure and function of the tongue in health and disease. For instance, researchers can use this data to train new machine learning models for tongue segmentation, which can be leveraged for segmentation and tracking of different tongue muscles engaged in speech formation in health and disease. Altogether, this dataset provides the means to the scientific community for investigation of the intricate tongue musculature and its role in physiological processes and speech production.

MRI Segmentation Dataset Release In Silico Academic Lab Open Dataset

Blockchain enabled collective and combined deep learning framework for COVID19 diagnosis.

Periyasamy S, Kaliyaperumal P, Thirumalaisamy M, Balusamy B, Elumalai T, Meena V, Jadoun VK

•papers•May 13 2025

The rapid spread of SARS-CoV-2 has highlighted the need for intelligent methodologies in COVID-19 diagnosis. Clinicians face significant challenges due to the virus's fast transmission rate and the lack of reliable diagnostic tools. Although artificial intelligence (AI) has improved image processing, conventional approaches still rely on centralized data storage and training. This reliance increases complexity and raises privacy concerns, which hinder global data exchange. Therefore, it is essential to develop collaborative models that balance accuracy with privacy protection. This research presents a novel framework that combines blockchain technology with a combined learning paradigm to ensure secure data distribution and reduced complexity. The proposed Combined Learning Collective Deep Learning Blockchain Model (CLCD-Block) aggregates data from multiple institutions and leverages a hybrid capsule learning network for accurate predictions. Extensive testing with lung CT images demonstrates that the model outperforms existing models, achieving an accuracy exceeding 97%. Specifically, on four benchmark datasets, CLCD-Block achieved up to 98.79% Precision, 98.84% Recall, 98.79% Specificity, 98.81% F1-Score, and 98.71% Accuracy, showcasing its superior diagnostic capability. Designed for COVID-19 diagnosis, the CLCD-Block framework is adaptable to other applications, integrating AI, decentralized training, privacy protection, and secure blockchain collaboration. It addresses challenges in diagnosing chronic diseases, facilitates cross-institutional research and monitors infectious outbreaks. Future work will focus on enhancing scalability, optimizing real-time performance and adapting the model for broader healthcare datasets.

CT Classification Chest Methodology In Silico Academic Lab Open Dataset

A self-supervised multimodal deep learning approach to differentiate post-radiotherapy progression from pseudoprogression in glioblastoma.

CorBenchX: Large-Scale Chest X-Ray Error Dataset and Vision-Language Model Benchmark for Report Error Correction

MedSG-Bench: A Benchmark for Medical Image Sequences Grounding

Impact of test set composition on AI performance in pediatric wrist fracture detection in X-rays.

Comparative analysis of deep learning methods for breast ultrasound lesion detection and classification.

Residual self-attention vision transformer for detecting acquired vitelliform lesions and age-related macular drusen.

Pancreas segmentation using AI developed on the largest CT dataset with multi-institutional validation and implications for early cancer detection.

CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

An Annotated Multi-Site and Multi-Contrast Magnetic Resonance Imaging Dataset for the study of the Human Tongue Musculature.

Blockchain enabled collective and combined deep learning framework for COVID19 diagnosis.

Ready to Sharpen Your Edge?