Latest Papers on Radiology AI. Tags: Open Dataset

Reliable Evaluation of MRI Motion Correction: Dataset and Insights

Kun Wang, Tobit Klug, Stefan Ruschke, Jan S. Kirschke, Reinhard Heckel

•preprint•Jun 6 2025

Correcting motion artifacts in MRI is important, as they can hinder accurate diagnosis. However, evaluating deep learning-based and classical motion correction methods remains fundamentally difficult due to the lack of accessible ground-truth target data. To address this challenge, we study three evaluation approaches: real-world evaluation based on reference scans, simulated motion, and reference-free evaluation, each with its merits and shortcomings. To enable evaluation with real-world motion artifacts, we release PMoC3D, a dataset consisting of unprocessed Paired Motion-Corrupted 3D brain MRI data. To advance evaluation quality, we introduce MoMRISim, a feature-space metric trained for evaluating motion reconstructions. We assess each evaluation approach and find real-world evaluation together with MoMRISim, while not perfect, to be most reliable. Evaluation based on simulated motion systematically exaggerates algorithm performance, and reference-free evaluation overrates oversmoothed deep learning outputs.

MRI Reconstruction Neurological Dataset Release In Silico Academic Lab Open Dataset

Prenatal detection of congenital heart defects using the deep learning-based image and video analysis: protocol for Clinical Artificial Intelligence in Fetal Echocardiography (CAIFE), an international multicentre multidisciplinary study.

Patey O, Hernandez-Cruz N, D'Alberti E, Salovic B, Noble JA, Papageorghiou AT

•papers•Jun 5 2025

Congenital heart defect (CHD) is a significant, rapidly emerging global problem in child health and a leading cause of neonatal and childhood death. Prenatal detection of CHDs with the help of ultrasound allows better perinatal management of such pregnancies, leading to reduced neonatal mortality, morbidity and developmental complications. However, there is a wide variation in reported fetal heart problem detection rates from 34% to 85%, with some low- and middle-income countries detecting as low as 9.3% of cases before birth. Research has shown that deep learning-based or more general artificial intelligence (AI) models can support the detection of fetal CHDs more rapidly than humans performing ultrasound scan. Progress in this AI-based research depends on the availability of large, well-curated and diverse data of ultrasound images and videos of normal and abnormal fetal hearts. Currently, CHD detection based on AI models is not accurate enough for practical clinical use, in part due to the lack of ultrasound data available for machine learning as CHDs are rare and heterogeneous, the retrospective nature of published studies, the lack of multicentre and multidisciplinary collaboration, and utilisation of mostly standard planes still images of the fetal heart for AI models. Our aim is to develop AI models that could support clinicians in detecting fetal CHDs in real time, particularly in nonspecialist or low-resource settings where fetal echocardiography expertise is not readily available. We have designed the Clinical Artificial Intelligence Fetal Echocardiography (CAIFE) study as an international multicentre multidisciplinary collaboration led by a clinical and an engineering team at the University of Oxford. This study involves five multicountry hospital sites for data collection (Oxford, UK (n=1), London, UK (n=3) and Southport, Australia (n=1)). We plan to curate 14 000 retrospective ultrasound scans of fetuses with normal hearts (n=13 000) and fetuses with CHDs (n=1000), as well as 2400 prospective ultrasound cardiac scans, including the proposed research-specific CAIFE 10 s video sweeps, from fetuses with normal hearts (n=2000) and fetuses diagnosed with major CHDs (n=400). This gives a total of 16 400 retrospective and prospective ultrasound scans from the participating hospital sites. We will build, train and validate computational models capable of differentiating between normal fetal hearts and those diagnosed with CHDs and recognise specific types of CHDs. Data will be analysed using statistical metrics, namely, sensitivity, specificity and accuracy, which include calculating positive and negative predictive values for each outcome, compared with manual assessment. We will disseminate the findings through regional, national and international conferences and through peer-reviewed journals. The study was approved by the Health Research Authority, Care Research Wales and the Research Ethics Committee (Ref: 23/EM/0023; IRAS Project ID: 317510) on 8 March 2023. All collaborating hospitals have obtained the local trust research and development approvals.

Ultrasound Classification Cardiac Prospective In Silico Academic Lab Open Dataset

UltraBones100k: A reliable automated labeling method and large-scale dataset for ultrasound-based bone surface extraction.

Wu L, Cavalcanti NA, Seibold M, Loggia G, Reissner L, Hein J, Beeler S, Viehöfer A, Wirth S, Calvet L, Fürnstahl P

•papers•Jun 4 2025

Ultrasound-based bone surface segmentation is crucial in computer-assisted orthopedic surgery. However, ultrasound images have limitations, including a low signal-to-noise ratio, acoustic shadowing, and speckle noise, which make interpretation difficult. Existing deep learning models for bone segmentation rely primarily on costly manual labeling by experts, limiting dataset size and model generalizability. Additionally, the complexity of ultrasound physics and acoustic shadow makes the images difficult for humans to interpret, leading to incomplete labels in low-intensity and anechoic regions and limiting model performance. To advance the state-of-the-art in ultrasound bone segmentation and establish effective model benchmarks, larger and higher-quality datasets are needed. We propose a methodology for collecting ex-vivo ultrasound datasets with automatically generated bone labels, including anechoic regions. The proposed labels are derived by accurately superimposing tracked bone Computed Tomography (CT) models onto the tracked ultrasound images. These initial labels are refined to account for ultrasound physics. To clinically evaluate the proposed method, an expert physician from our university hospital specialized in orthopedic sonography assessed the quality of the generated bone labels. A neural network for bone segmentation is trained on the collected dataset and its predictions are compared to expert manual labels, evaluating accuracy, completeness, and F1-score. We collected UltraBones100k, the largest known dataset comprising 100k ex-vivo ultrasound images of human lower limbs with bone annotations, specifically targeting the fibula, tibia, and foot bones. A Wilcoxon signed-rank test with Bonferroni correction confirmed that the bone alignment after our optimization pipeline significantly improved the quality of bone labeling (p<0.001). The model trained on UltraBones100k consistently outperforms manual labeling in all metrics, particularly in low-intensity regions (at a distance threshold of 0.5 mm: 320% improvement in completeness, 27.4% improvement in accuracy, and 197% improvement in F1 score) CONCLUSION:: This work is promising to facilitate research and clinical translation of ultrasound imaging in computer-assisted interventions, particularly for applications such as 2D bone segmentation, 3D bone surface reconstruction, and multi-modality bone registration.

Ultrasound Segmentation Musculoskeletal Dataset Release In Silico Academic Lab Open Dataset

ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, Pranav Rajpurkar

•preprint•Jun 4 2025

We present ReXVQA, the largest and most comprehensive benchmark for visual question answering (VQA) in chest radiology, comprising approximately 696,000 questions paired with 160,000 chest X-rays studies across training, validation, and test sets. Unlike prior efforts that rely heavily on template based queries, ReXVQA introduces a diverse and clinically authentic task suite reflecting five core radiological reasoning skills: presence assessment, location analysis, negation detection, differential diagnosis, and geometric reasoning. We evaluate eight state-of-the-art multimodal large language models, including MedGemma-4B-it, Qwen2.5-VL, Janus-Pro-7B, and Eagle2-9B. The best-performing model (MedGemma) achieves 83.24% overall accuracy. To bridge the gap between AI performance and clinical expertise, we conducted a comprehensive human reader study involving 3 radiology residents on 200 randomly sampled cases. Our evaluation demonstrates that MedGemma achieved superior performance (83.84% accuracy) compared to human readers (best radiology resident: 77.27%), representing a significant milestone where AI performance exceeds expert human evaluation on chest X-ray interpretation. The reader study reveals distinct performance patterns between AI models and human experts, with strong inter-reader agreement among radiologists while showing more variable agreement patterns between human readers and AI models. ReXVQA establishes a new standard for evaluating generalist radiological AI systems, offering public leaderboards, fine-grained evaluation splits, structured explanations, and category-level breakdowns. This benchmark lays the foundation for next-generation AI systems capable of mimicking expert-level clinical reasoning beyond narrow pathology classification. Our dataset will be open-sourced at https://huggingface.co/datasets/rajpurkarlab/ReXVQA

X-Ray LLM Radiology Report Chest Dataset Release In Silico Academic Lab Open Dataset Benchmark SOTA

Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

Negin Baghbanzadeh, Sajad Ashkezari, Elham Dolatabadi, Arash Afkanpour

•preprint•Jun 3 2025

Compound figures, which are multi-panel composites containing diverse subfigures, are ubiquitous in biomedical literature, yet large-scale subfigure extraction remains largely unaddressed. Prior work on subfigure extraction has been limited in both dataset size and generalizability, leaving a critical open question: How does high-fidelity image-text alignment via large-scale subfigure extraction impact representation learning in vision-language models? We address this gap by introducing a scalable subfigure extraction pipeline based on transformer-based object detection, trained on a synthetic corpus of 500,000 compound figures, and achieving state-of-the-art performance on both ImageCLEF 2016 and synthetic benchmarks. Using this pipeline, we release OPEN-PMC-18M, a large-scale high quality biomedical vision-language dataset comprising 18 million clinically relevant subfigure-caption pairs spanning radiology, microscopy, and visible light photography. We train and evaluate vision-language models on our curated datasets and show improved performance across retrieval, zero-shot classification, and robustness benchmarks, outperforming existing baselines. We release our dataset, models, and code to support reproducible benchmarks and further study into biomedical vision-language modeling and representation learning.

Mixed Modality Image Synthesis Dataset Release In Silico Academic Lab Open Dataset Open Code Benchmark SOTA Reproducibility

Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

Negin Baghbanzadeh, Sajad Ashkezari, Elham Dolatabadi, Arash Afkanpour

•preprint•Jun 3 2025

Mixed Modality Classification Dataset Release In Silico Academic Lab Open Dataset Open Code Benchmark SOTA

Upper Airway Volume Predicts Brain Structure and Cognition in Adolescents.

Kanhere A, Navarathna N, Yi PH, Parekh VS, Pickle J, Cloak CC, Ernst T, Chang L, Li D, Redline S, Isaiah A

•papers•Jun 3 2025

One in ten children experiences sleep-disordered breathing (SDB). Untreated SDB is associated with poor cognition, but the underlying mechanisms are less understood. We assessed the relationship between magnetic resonance imaging (MRI)-derived upper airway volume and children's cognition and regional cortical gray matter volumes. We used five-year data from the Adolescent Brain Cognitive Development study (n=11,875 children, 9-10 years at baseline). Upper airway volumes were derived using a deep learning model applied to 5,552,640 brain MRI slices. The primary outcome was the Total Cognition Composite score from the National Institutes of Health Toolbox (NIH-TB). Secondary outcomes included other NIH-TB measures and cortical gray matter volumes. The habitual snoring group had significantly smaller airway volumes than non-snorers (mean difference=1.2 cm3; 95% CI, 1.0-1.4 cm3; P<0.001). Deep learning-derived airway volume predicted the Total Cognition Composite score (estimated mean difference=3.68 points; 95% CI, 2.41-4.96; P<0.001) per one-unit increase in the natural log of airway volume (~2.7-fold raw volume increase). This airway volume increase was also associated with an average 0.02 cm3 increase in right temporal pole volume (95% CI, 0.01-0.02 cm3; P<0.001). Similar airway volume predicted most NIH-TB domain scores and multiple frontal and temporal gray matter volumes. These brain volumes mediated the relationship between airway volume and cognition. We demonstrate a novel application of deep learning-based airway segmentation in a large pediatric cohort. Upper airway volume is a potential biomarker for cognitive outcomes in pediatric SDB, offers insights into neurobiological mechanisms, and informs future studies on risk stratification. This article is open access and distributed under the terms of the Creative Commons Attribution Non-Commercial No Derivatives License 4.0 (http://creativecommons.org/licenses/by-nc-nd/4.0/).

MRI Segmentation Neurological Retrospective Clinical In Silico Academic Lab Open Dataset

Multimodal Neuroimaging Based Alzheimer's Disease Diagnosis Using Evolutionary RVFL Classifier.

Goel T, Sharma R, Tanveer M, Suganthan PN, Maji K, Pilli R

•papers•Jun 1 2025

Alzheimer's disease (AD) is one of the most known causes of dementia which can be characterized by continuous deterioration in the cognitive skills of elderly people. It is a non-reversible disorder that can only be cured if detected early, which is known as mild cognitive impairment (MCI). The most common biomarkers to diagnose AD are structural atrophy and accumulation of plaques and tangles, which can be detected using magnetic resonance imaging (MRI) and positron emission tomography (PET) scans. Therefore, the present paper proposes wavelet transform-based multimodality fusion of MRI and PET scans to incorporate structural and metabolic information for the early detection of this life-taking neurodegenerative disease. Further, the deep learning model, ResNet-50, extracts the fused images' features. The random vector functional link (RVFL) with only one hidden layer is used to classify the extracted features. The weights and biases of the original RVFL network are being optimized by using an evolutionary algorithm to get optimum accuracy. All the experiments and comparisons are performed over the publicly available Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset to demonstrate the suggested algorithm's efficacy.

Mixed Modality Classification Neurological Methodology In Silico Academic Lab Open Dataset

Mexican dataset of digital mammograms (MEXBreast) with suspicious clusters of microcalcifications.

Lozoya RSL, Barragán KN, Domínguez HJO, Azuela JHS, Sánchez VGC, Villegas OOV

•papers•Jun 1 2025

Breast cancer is one of the most prevalent cancers affecting women worldwide. Early detection and treatment are crucial in significantly reducing mortality rates Microcalcifications (MCs) are of particular importance among the various breast lesions. These tiny calcium deposits within breast tissue are present in approximately 30% of malignant tumors and can serve as critical indirect indicators of early-stage breast cancer. Three or more MCs within an area of 1 cm² are considered a Microcalcification Cluster (MCC) and assigned a BI-RADS category 4, indicating a suspicion of malignancy. Mammography is the most used technique for breast cancer detection. Approximately one in two mammograms showing MCCs is confirmed as cancerous through biopsy. MCCs are challenging to detect, even for experienced radiologists, underscoring the need for computer-aided detection tools such as Convolutional Neural Networks (CNNs). CNNs require large amounts of domain-specific data with consistent resolutions for effective training. However, most publicly available mammogram datasets either lack resolution information or are compiled from heterogeneous sources. Additionally, MCCs are often either unlabeled or sparsely represented in these datasets, limiting their utility for training CNNs. In this dataset, we present the MEXBreast, an annotated MCCs Mexican digital mammogram database, containing images from resolutions of 50, 70, and 100 microns. MEXBreast aims to support the training, validation, and testing of deep learning CNNs.

Mammography Detection Breast Dataset Release In Silico Academic Lab Open Dataset

MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book

Sau Lai Yip, Sunan He, Yuxiang Nie, Shu Pui Chan, Yilin Ye, Sum Ying Lam, Hao Chen

•preprint•Jun 1 2025

The accelerating development of general medical artificial intelligence (GMAI), powered by multimodal large language models (MLLMs), offers transformative potential for addressing persistent healthcare challenges, including workforce deficits and escalating costs. The parallel development of systematic evaluation benchmarks emerges as a critical imperative to enable performance assessment and provide technological guidance. Meanwhile, as an invaluable knowledge source, the potential of medical textbooks for benchmark development remains underexploited. Here, we present MedBookVQA, a systematic and comprehensive multimodal benchmark derived from open-access medical textbooks. To curate this benchmark, we propose a standardized pipeline for automated extraction of medical figures while contextually aligning them with corresponding medical narratives. Based on this curated data, we generate 5,000 clinically relevant questions spanning modality recognition, disease classification, anatomical identification, symptom diagnosis, and surgical procedures. A multi-tier annotation system categorizes queries through hierarchical taxonomies encompassing medical imaging modalities (42 categories), body anatomies (125 structures), and clinical specialties (31 departments), enabling nuanced analysis across medical subdomains. We evaluate a wide array of MLLMs, including proprietary, open-sourced, medical, and reasoning models, revealing significant performance disparities across task types and model categories. Our findings highlight critical capability gaps in current GMAI systems while establishing textbook-derived multimodal benchmarks as essential evaluation tools. MedBookVQA establishes textbook-derived benchmarking as a critical paradigm for advancing clinical AI, exposing limitations in GMAI systems while providing anatomically structured performance metrics across specialties.

Mixed Modality Classification Dataset Release In Silico Academic Lab Open Dataset Benchmark SOTA

Filter Papers

Tags

Reliable Evaluation of MRI Motion Correction: Dataset and Insights

Prenatal detection of congenital heart defects using the deep learning-based image and video analysis: protocol for Clinical Artificial Intelligence in Fetal Echocardiography (CAIFE), an international multicentre multidisciplinary study.

UltraBones100k: A reliable automated labeling method and large-scale dataset for ultrasound-based bone surface extraction.

ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

Upper Airway Volume Predicts Brain Structure and Cognition in Adolescents.

Multimodal Neuroimaging Based Alzheimer's Disease Diagnosis Using Evolutionary RVFL Classifier.

Mexican dataset of digital mammograms (MEXBreast) with suspicious clusters of microcalcifications.

MedBookVQA: A Systematic and Comprehensive Medical Benchmark Derived from Open-Access Book

Ready to Sharpen Your Edge?