Latest Papers on Radiology AI. Tags: Benchmark SOTA

Quantitative and automatic plan-of-the-day assessment to facilitate adaptive radiotherapy in cervical cancer.

Mason SA, Wang L, Alexander SE, Lalondrelle S, McNair HA, Harris EJ

•papers•Jun 5 2025

To facilitate implementation of plan-of-the-day (POTD) selection for treating locally advanced cervical cancer (LACC), we developed a POTD assessment tool for CBCT-guided radiotherapy (RT). A female pelvis segmentation model (U-Seg3) is combined with a quantitative standard operating procedure (qSOP) to identify optimal and acceptable plans. Approach: The planning CT[i], corresponding structure set[ii], and manually contoured CBCTs[iii] (n=226) from 39 LACC patients treated with POTD (n=11) or non-adaptive RT (n=28) were used to develop U-Seg3, an algorithm incorporating deep-learning and deformable image registration techniques to segment the low-risk clinical target volume (LR-CTV), high-risk CTV (HR-CTV), bladder, rectum, and bowel bag. A single-channel input model (iii only, U-Seg1) was also developed. Contoured CBCTs from the POTD patients were (a) reserved for U-Seg3 validation/testing, (b) audited to determine optimal and acceptable plans, and (c) used to empirically derive a qSOP that maximised classification accuracy. Main Results: The median [interquartile range] DSC between manual and U-Seg3 contours was 0.83 [0.80], 0.78 [0.13], 0.94 [0.05], 0.86[0.09], and 0.90 [0.05] for the LR-CTV, HR-CTV, bladder, rectum, and bowel bag. These were significantly higher than U-Seg1 in all structures but bladder. The qSOP classified plans as acceptable if they met target coverage thresholds (LR-CTV≧99%, HR-CTV≧99.8%), with lower LR-CTV coverage (≧95%) sometimes allowed. The acceptable plan minimising bowel irradiation was considered optimal unless substantial bladder sparing could be achieved. With U-Seg3 embedded in the qSOP, optimal and acceptable plans were identified in 46/60 and 57/60 cases. Significance: U-Seg3 outperforms U-Seg1 and all known CBCT-based female pelvis segmentation models. The tool combining U-Seg3 and the qSOP identifies optimal plans with equivalent accuracy as two observers. In an implementation strategy whereby this tool serves as the second observer, plan selection confidence and decision-making time could be improved whilst simultaneously reducing the required number of POTD-trained radiographers by 50%.&#xD.

CT Segmentation Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Clinical validation of a deep learning model for low-count PET image enhancement.

Long Q, Tian Y, Pan B, Xu Z, Zhang W, Xu L, Fan W, Pan T, Gong NJ

•papers•Jun 5 2025

To investigate the effects of the deep learning model RaDynPET on fourfold reduced-count whole-body PET examinations. A total of 120 patients (84 internal cohorts and 36 external cohorts) undergoing 18F-FDG PET/CT examinations were enrolled. PET images were reconstructed using OSEM algorithm with 120-s (G120) and 30-s (G30) list-mode data. RaDynPET was developed to generate enhanced images (R30) from G30. Two experienced nuclear medicine physicians independently evaluated subjective image quality using a 5-point Likert scale. Standardized uptake values (SUV), standard deviations, liver signal-to-noise ratio (SNR), lesion tumor-to-background ratio (TBR), and contrast-to-noise ratio (CNR) were compared. Subgroup analyses evaluated performance across demographics, and lesion detectability were evaluated using external datasets. RaDynPET was also compared to other deep learning methods. In internal cohorts, R30 demonstrated significantly higher image quality scores than G30 and G120. R30 showed excellent agreement with G120 for liver and lesion SUV values and surpassed G120 in liver SNR and CNR. Liver SNR and CNR of R30 were comparable to G120 in thin group, and the CNR of R30 was comparable to G120 in young age group. In external cohorts, R30 maintained strong SUV agreement with G120, with lesion-level sensitivity and specificity of 95.45% and 98.41%, respectively. There was no statistical difference in lesion detection between R30 and G120. RaDynPET achieved the highest PSNR and SSIM among deep learning methods. The RaDynPET model effectively restored high image quality while maintaining SUV agreement for 18F-FDG PET scans acquired in 25% of the standard acquisition time.

PET Reconstruction Whole Body Retrospective Clinical Clinical Pilot Academic Lab Benchmark SOTA

Matrix completion-informed deep unfolded equilibrium models for self-supervised <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>k</mi> <annotation>$k$</annotation></semantics> </math> -space interpolation in MRI.

Luo C, Wang H, Liu Y, Xie T, Chen G, Jin Q, Liang D, Cui ZX

•papers•Jun 5 2025

Self-supervised methods for magnetic resonance imaging (MRI) reconstruction have garnered significant interest due to their ability to address the challenges of slow data acquisition and scarcity of fully sampled labels. Current regularization-based self-supervised techniques merge the theoretical foundations of regularization with the representational strengths of deep learning and enable effective reconstruction under higher acceleration rates, yet often fall short in interpretability, leaving their theoretical underpinnings lacking. In this paper, we introduce a novel self-supervised approach that provides stringent theoretical guarantees and interpretable networks while circumventing the need for fully sampled labels. Our method exploits the intrinsic relationship between convolutional neural networks and the null space within structural low-rank models, effectively integrating network parameters into an iterative reconstruction process. Our network learns gradient descent steps of the projected gradient descent algorithm without changing its convergence property, which implements a fully interpretable unfolded model. We design a non-expansive mapping for the network architecture, ensuring convergence to a fixed point. This well-defined framework enables complete reconstruction of missing <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>k</mi> <annotation>$k$</annotation></semantics> </math> -space data grounded in matrix completion theory, independent of fully sampled labels. Qualitative and quantitative experimental results on multi-coil MRI reconstruction demonstrate the efficacy of our self-supervised approach, showing marked improvements over existing self-supervised and traditional regularization methods, achieving results comparable to supervised learning in selected scenarios. Our method surpasses existing self-supervised approaches in reconstruction quality and also delivers competitive performance under supervised settings. This work not only advances the state-of-the-art in MRI reconstruction but also enhances interpretability in deep learning applications for medical imaging.

MRI Reconstruction Methodology In Silico Academic Lab Benchmark SOTA

Long-Term Prognostic Implications of Thoracic Aortic Calcification on CT Using Artificial Intelligence-Based Quantification in a Screening Population: A Two-Center Study.

Lee JE, Kim NY, Kim YH, Kwon Y, Kim S, Han K, Suh YJ

•papers•Jun 4 2025

BACKGROUND. The importance of including the thoracic aortic calcification (TAC), in addition to coronary artery calcification (CAC), in prognostic assessments has been difficult to determine, partly due to greater challenge in performing standardized TAC assessments. OBJECTIVE. The purpose of this study was to evaluate long-term prognostic implications of TAC assessed using artificial intelligence (AI)-based quantification on routine chest CT in a screening population. METHODS. This retrospective study included 7404 asymptomatic individuals (median age, 53.9 years; 5875 men, 1529 women) who underwent nongated noncontrast chest CT as part of a national general health screening program at one of two centers from January 2007 to December 2014. A commercial AI program quantified TAC and CAC using Agatston scores, which were stratified into categories. Radiologists manually quantified TAC and CAC in 2567 examinations. The role of AI-based TAC categories in predicting major adverse cardiovascular events (MACE) and all-cause mortality (ACM), independent of AI-based CAC categories as well as clinical and laboratory variables, was assessed by multivariable Cox proportional hazards models using data from both centers and concordance statistics from prognostic models developed and tested using center 1 and center 2 data, respectively. RESULTS. AI-based and manual quantification showed excellent agreement for TAC and CAC (concordance correlation coefficient: 0.967 and 0.895, respectively). The median observation periods were 7.5 years for MACE (383 events in 5342 individuals) and 11.0 years for ACM (292 events in 7404 individuals). When adjusted for AI-based CAC categories along with clinical and laboratory variables, the risk for MACE was not independently associated with any AI-based TAC category; risk of ACM was independently associated with AI-based TAC score of 1001-3000 (HR = 2.14, p = .02) but not with other AI-based TAC categories. When prognostic models were tested, the addition of AI-based TAC categories did not improve model fit relative to models containing clinical variables, laboratory variables, and AI-based CAC categories for MACE (concordance index [C-index] = 0.760-0.760, p = .81) or ACM (C-index = 0.823-0.830, p = .32). CONCLUSION. The addition of TAC to models containing CAC provided limited improvement in risk prediction in an asymptomatic screening population undergoing CT. CLINICAL IMPACT. AI-based quantification provides a standardized approach for better understanding the potential role of TAC as a predictive imaging biomarker.

CT Segmentation Cardiac Retrospective Clinical In Silico Academic Lab Benchmark SOTA

ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, Pranav Rajpurkar

•preprint•Jun 4 2025

We present ReXVQA, the largest and most comprehensive benchmark for visual question answering (VQA) in chest radiology, comprising approximately 696,000 questions paired with 160,000 chest X-rays studies across training, validation, and test sets. Unlike prior efforts that rely heavily on template based queries, ReXVQA introduces a diverse and clinically authentic task suite reflecting five core radiological reasoning skills: presence assessment, location analysis, negation detection, differential diagnosis, and geometric reasoning. We evaluate eight state-of-the-art multimodal large language models, including MedGemma-4B-it, Qwen2.5-VL, Janus-Pro-7B, and Eagle2-9B. The best-performing model (MedGemma) achieves 83.24% overall accuracy. To bridge the gap between AI performance and clinical expertise, we conducted a comprehensive human reader study involving 3 radiology residents on 200 randomly sampled cases. Our evaluation demonstrates that MedGemma achieved superior performance (83.84% accuracy) compared to human readers (best radiology resident: 77.27%), representing a significant milestone where AI performance exceeds expert human evaluation on chest X-ray interpretation. The reader study reveals distinct performance patterns between AI models and human experts, with strong inter-reader agreement among radiologists while showing more variable agreement patterns between human readers and AI models. ReXVQA establishes a new standard for evaluating generalist radiological AI systems, offering public leaderboards, fine-grained evaluation splits, structured explanations, and category-level breakdowns. This benchmark lays the foundation for next-generation AI systems capable of mimicking expert-level clinical reasoning beyond narrow pathology classification. Our dataset will be open-sourced at https://huggingface.co/datasets/rajpurkarlab/ReXVQA

X-Ray LLM Radiology Report Chest Dataset Release In Silico Academic Lab Open Dataset Benchmark SOTA

Enhanced risk stratification for stage II colorectal cancer using deep learning-based CT classifier and pathological markers to optimize adjuvant therapy decision.

Huang YQ, Chen XB, Cui YF, Yang F, Huang SX, Li ZH, Ying YJ, Li SY, Li MH, Gao P, Wu ZQ, Wen G, Wang ZS, Wang HX, Hong MP, Diao WJ, Chen XY, Hou KQ, Zhang R, Hou J, Fang Z, Wang ZN, Mao Y, Wee L, Liu ZY

•papers•Jun 4 2025

Current risk stratification for stage II colorectal cancer (CRC) has limited accuracy in identifying patients who would benefit from adjuvant chemotherapy, leading to potential over- or under-treatment. We aimed to develop a more precise risk stratification system by integrating artificial intelligence-based imaging analysis with pathological markers. We analyzed 2,992 stage II CRC patients from 12 centers. A deep learning classifier (Swin Transformer Assisted Risk-stratification for CRC, STAR-CRC) was developed using multi-planar CT images from 1,587 patients (training:internal validation=7:3) and validated in 1,405 patients from 8 independent centers, which stratified patients into low-, uncertain-, and high-risk groups. To further refine the uncertain-risk group, a composite score based on pathological markers (pT4 stage, number of lymph nodes sampled, perineural invasion, and lymphovascular invasion) was applied, forming the intelligent risk integration system for stage II CRC (IRIS-CRC). IRIS-CRC was compared against the guideline-based risk stratification system (GRSS-CRC) for prediction performance and validated in the validation dataset. IRIS-CRC stratified patients into four prognostic groups with distinct 3-year disease-free survival rates (≥95%, 95-75%, 75-55%, ≤55%). Upon external validation, compared to GRSS-CRC, IRIS-CRC downstaged 27.1% of high-risk patients into Favorable group, while upstaged 6.5% of low-risk patients into Very Poor prognosis group who might require more aggressive treatment. In the GRSS-CRC intermediate-risk group of the external validation dataset, IRIS-CRC reclassified 40.1% as Favorable prognosis and 7.0% as Very Poor prognosis. IRIS-CRC's performance maintained generalized in both chemotherapy and non-chemotherapy cohorts. IRIS-CRC offers a more precise and personalized risk assessment than current guideline-based risk factors, potentially sparing low-risk patients from unnecessary adjuvant chemotherapy while identifying high-risk individuals for more aggressive treatment. This novel approach holds promise for improving clinical decision-making and outcomes in stage II CRC.

CT Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Subgrouping autism and ADHD based on structural MRI population modelling centiles.

Pecci-Terroba C, Lai MC, Lombardo MV, Chakrabarti B, Ruigrok ANV, Suckling J, Anagnostou E, Lerch JP, Taylor MJ, Nicolson R, Georgiades S, Crosbie J, Schachar R, Kelley E, Jones J, Arnold PD, Seidlitz J, Alexander-Bloch AF, Bullmore ET, Baron-Cohen S, Bedford SA, Bethlehem RAI

•papers•Jun 4 2025

Autism and attention deficit hyperactivity disorder (ADHD) are two highly heterogeneous neurodevelopmental conditions with variable underlying neurobiology. Imaging studies have yielded varied results, and it is now clear that there is unlikely to be one characteristic neuroanatomical profile of either condition. Parsing this heterogeneity could allow us to identify more homogeneous subgroups, either within or across conditions, which may be more clinically informative. This has been a pivotal goal for neurodevelopmental research using both clinical and neuroanatomical features, though results thus far have again been inconsistent with regards to the number and characteristics of subgroups. Here, we use population modelling to cluster a multi-site dataset based on global and regional centile scores of cortical thickness, surface area and grey matter volume. We use HYDRA, a novel semi-supervised machine learning algorithm which clusters based on differences to controls and compare its performance to a traditional clustering approach. We identified distinct subgroups within autism and ADHD, as well as across diagnosis, often with opposite neuroanatomical alterations relatively to controls. These subgroups were characterised by different combinations of increased or decreased patterns of morphometrics. We did not find significant clinical differences across subgroups. Crucially, however, the number of subgroups and their membership differed vastly depending on chosen features and the algorithm used, highlighting the impact and importance of careful method selection. We highlight the importance of examining heterogeneity in autism and ADHD and demonstrate that population modelling is a useful tool to study subgrouping in autism and ADHD. We identified subgroups with distinct patterns of alterations relative to controls but note that these results rely heavily on the algorithm used and encourage detailed reporting of methods and features used in future studies.

MRI Classification Neurological Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Retrieval-Augmented Generation with Large Language Models in Radiology: From Theory to Practice.

Fink A, Rau A, Reisert M, Bamberg F, Russe MF

•papers•Jun 4 2025

"Just Accepted" papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. Large language models (LLMs) hold substantial promise in addressing the growing workload in radiology, but recent studies also reveal limitations, such as hallucinations and opacity in sources for LLM responses. Retrieval-augmented Generation (RAG) based LLMs offer a promising approach to streamline radiology workflows by integrating reliable, verifiable, and customizable information. Ongoing refinement is critical to enable RAG models to manage large amounts of input data and to engage in complex multiagent dialogues. This report provides an overview of recent advances in LLM architecture, including few-shot and zero-shot learning, RAG integration, multistep reasoning, and agentic RAG, and identifies future research directions. Exemplary cases demonstrate the practical application of these techniques in radiology practice. ©RSNA, 2025.

Mixed Modality LLM Radiology Report Whole Body Review Concept Academic Lab GenAI Benchmark SOTA Ethics

A review on learning-based algorithms for tractography and human brain white matter tracts recognition.

Barati Shoorche A, Farnia P, Makkiabadi B, Leemans A

•papers•Jun 4 2025

Human brain fiber tractography using diffusion magnetic resonance imaging is a crucial stage in mapping brain white matter structures, pre-surgical planning, and extracting connectivity patterns. Accurate and reliable tractography, by providing detailed geometric information about the position of neural pathways, minimizes the risk of damage during neurosurgical procedures. Both tractography itself and its post-processing steps such as bundle segmentation are usually used in these contexts. Many approaches have been put forward in the past decades and recently, multiple data-driven tractography algorithms and automatic segmentation pipelines have been proposed to address the limitations of traditional methods. Several of these recent methods are based on learning algorithms that have demonstrated promising results. In this study, in addition to introducing diffusion MRI datasets, we review learning-based algorithms such as conventional machine learning, deep learning, reinforcement learning and dictionary learning methods that have been used for white matter tract, nerve and pathway recognition as well as whole brain streamlines or whole brain tractogram creation. The contribution is to discuss both tractography and tract recognition methods, in addition to extending previous related reviews with most recent methods, covering architectures as well as network details, assess the efficiency of learning-based methods through a comprehensive comparison in this field, and finally demonstrate the important role of learning-based methods in tractography.

MRI Segmentation Neurological Review Concept Academic Lab Benchmark SOTA

Ultra-High-Resolution Photon-Counting-Detector CT with a Dedicated Denoising Convolutional Neural Network for Enhanced Temporal Bone Imaging.

Chang S, Benson JC, Lane JI, Bruesewitz MR, Swicklik JR, Thorne JE, Koons EK, Carlson ML, McCollough CH, Leng S

•papers•Jun 3 2025

Ultra-high-resolution (UHR) photon-counting-detector (PCD) CT improves image resolution but increases noise, necessitating the use of smoother reconstruction kernels that reduce resolution below the 0.125-mm maximum spatial resolution. A denoising convolutional neural network (CNN) was developed to reduce noise in images reconstructed with the available sharpest reconstruction kernel while preserving resolution for enhanced temporal bone visualization to address this issue. With institutional review board approval, the CNN was trained on 6 patient cases of clinical temporal bone imaging (1885 images) and tested on 20 independent cases using a dual-source PCD-CT (NAEOTOM Alpha). Images were reconstructed using quantum iterative reconstruction at strength 3 (QIR3) with both a clinical routine kernel (Hr84) and the sharpest available head kernel (Hr96). The CNN was applied to images reconstructed with Hr96 and QIR1 kernel. For each case, three series of images (Hr84-QIR3, Hr96-QIR3, and Hr96-CNN) were randomized for review by 2 neuroradiologists assessing the overall quality and delineating the modiolus, stapes footplate, and incudomallear joint. The CNN reduced noise by 80% compared with Hr96-QIR3 and by 50% relative to Hr84-QIR3, while maintaining high resolution. Compared with the conventional method at the same kernel (Hr96-QIR3), Hr96-CNN significantly decreased image noise (from 204.63 to 47.35 HU) and improved its structural similarity index (from 0.72 to 0.99). Hr96-CNN images ranked higher than Hr84-QIR3 and Hr96-QIR3 in overall quality (P < .001). Readers preferred Hr96-CNN for all 3 structures. The proposed CNN significantly reduced image noise in UHR PCD-CT, enabling the use of the sharpest kernel. This combination greatly enhanced diagnostic image quality and anatomic visualization.

CT Reconstruction Neurological Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Filter Papers

Tags

Quantitative and automatic plan-of-the-day assessment to facilitate adaptive radiotherapy in cervical cancer.

Clinical validation of a deep learning model for low-count PET image enhancement.

Matrix completion-informed deep unfolded equilibrium models for self-supervised <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mi>k</mi> <annotation>$k$</annotation></semantics> </math> -space interpolation in MRI.

Long-Term Prognostic Implications of Thoracic Aortic Calcification on CT Using Artificial Intelligence-Based Quantification in a Screening Population: A Two-Center Study.

ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

Enhanced risk stratification for stage II colorectal cancer using deep learning-based CT classifier and pathological markers to optimize adjuvant therapy decision.

Subgrouping autism and ADHD based on structural MRI population modelling centiles.

Retrieval-Augmented Generation with Large Language Models in Radiology: From Theory to Practice.

A review on learning-based algorithms for tractography and human brain white matter tracts recognition.

Ultra-High-Resolution Photon-Counting-Detector CT with a Dedicated Denoising Convolutional Neural Network for Enhanced Temporal Bone Imaging.

Ready to Sharpen Your Edge?