Latest Papers on Radiology AI.

Performance of a Large Language Model in BI-RADS Classification of Ultrasound Based Breast Lesions

Pillai, K., Nausheen, F.

•preprint•Oct 1 2025

AimsGiven the advent of large language models (LLMs), the number of potential applications using artificial intelligence technologies in radiology has rapidly increased. Recently, several studies have evaluated the accuracy and quality of LLMs to characterize CT and MRI scans. Yet, to our knowledge, there have been few studies that have reported the utility of these models in generating BI-RADS assessment categories. MethodsA breast ultrasound dataset including 256 images from 256 patients manually interpreted and labeled by radiologists according to BI-RADS features and lexicon was used for evaluating Gemini 2.0 Flash. We prompted the model to assess images in individual context windows and tested it with two variations of the original prompt (n = 3). Statistical analyses were then performed comparing the abilities of the model to the ground truth. The receiver operating characteristic-area under the curve (ROC-AUC) analysis was then calculated for each classification type from individual replicates. ResultsWe found that the overall accuracy of Gemini 2.0 was 19.01% in predicting the BI-RADS classification of the breast lesions, and those of each category did not significantly differ from one another. From the ROC-AUC analysis, all category scores ranged from 0.5-0.6, and found that the model performed slightly better at categorizing benign lesions (1-4a), while those of greater probability of malignancy were akin to random chance (4b-5). Furthermore, we found that among incorrect predictions, the model was generally within 1-2 categories away from the true classification, demonstrating a low precision unreliable for realistic clinical usage. ConclusionsThis work highlights the current limitations of artificial intelligence models in classifying clinical images, and further development is required in these technologies before translation into the clinical setting. To our knowledge, this is the first study to report the capabilities of LLMs in performing BI-RADS classification of breast lesions with replicates.

Ultrasound Classification Breast Retrospective Clinical In Silico GenAI

Deep Learning for fODF Estimation in Infant Brains: Model Comparison, Ground-Truth Impact, and Domain Shift Mitigation.

Lin R, Kebiri H, Gholipour A, Chen Y, Thiran JP, Karimi D, Bach Cuadra M

•papers•Oct 1 2025

The accurate estimation of fiber orientation distribution functions (fODFs) in diffusion magnetic resonance imaging (MRI) is crucial for understanding early brain development and its potential disruptions. Although supervised deep learning (DL) models have shown promise in fODF estimation from neonatal diffusion MRI (dMRI) data, the out-of-domain (OOD) performance of these models remains largely unexplored, especially under diverse domain shift scenarios. This study evaluated the robustness of three state-of-the-art DL architectures: multilayer perceptron (MLP), transformer, and U-Net/convolutional neural network (CNN) on fODF predictions derived from dMRI data. Using 488 subjects from the developing Human Connectome Project (dHCP) and the Baby Connectome Project (BCP) datasets, we reconstructed reference fODFs from the full dMRI series using single-shell three-tissue constrained spherical deconvolution (SS3T-CSD) and multi-shell multi-tissue CSD (MSMT-CSD) to generate reference fODF reconstructions for model training, and systematically assessed the impact of age, scanner/protocol differences, and input dimensionality on model performance. Our findings reveal that U-Net consistently outperformed other models when fewer diffusion gradient directions were used, particularly with the SS3T-CSD-derived ground truth, which showed superior performance in capturing crossing fibers. However, as the number of input diffusion gradient directions increased, MLP and the transformer-based model exhibited steady gains in accuracy. Nevertheless, performance nearly plateaued from 28 to 45 input directions in all models. Age-related domain shifts showed asymmetric patterns, being less pronounced in late developmental stages (late neonates, and babies), with SS3T-CSD demonstrating greater robustness to variability compared to MSMT-CSD. To address inter-site domain shifts, we implemented two adaptation strategies: the Method of Moments (MoM) and fine-tuning. Both strategies achieved significant improvements ( <math xmlns="http://www.w3.org/1998/Math/MathML"> <semantics><mrow><mi>p</mi> <mo><</mo> <mn>0.05</mn></mrow> <annotation>$$ p<0.05 $$</annotation></semantics> </math> ) in over 95% of tested configurations, with fine-tuning consistently yielding superior results and U-Net benefiting the most from increased target subjects. This study represents the first systematic evaluation of OOD settings in DL applications to fODF estimation, providing critical insights into model robustness and adaptation strategies for diverse clinical and research applications.

MRI Reconstruction Neurological Methodology In Silico Academic Lab Benchmark SOTA

LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology

Zhenyue Qin, Yang Liu, Yu Yin, Jinyu Ding, Haoran Zhang, Anran Li, Dylan Campbell, Xuansheng Wu, Ke Zou, Tiarnan D. L. Keenan, Emily Y. Chew, Zhiyong Lu, Yih-Chung Tham, Ninghao Liu, Xiuzhen Zhang, Qingyu Chen

•preprint•Sep 30 2025

Vision-threatening eye diseases pose a major global health burden, with timely diagnosis limited by workforce shortages and restricted access to specialized care. While multimodal large language models (MLLMs) show promise for medical image interpretation, advancing MLLMs for ophthalmology is hindered by the lack of comprehensive benchmark datasets suitable for evaluating generative models. We present a large-scale multimodal ophthalmology benchmark comprising 32,633 instances with multi-granular annotations across 12 common ophthalmic conditions and 5 imaging modalities. The dataset integrates imaging, anatomical structures, demographics, and free-text annotations, supporting anatomical structure recognition, disease screening, disease staging, and demographic prediction for bias evaluation. This work extends our preliminary LMOD benchmark with three major enhancements: (1) nearly 50% dataset expansion with substantial enlargement of color fundus photography; (2) broadened task coverage including binary disease diagnosis, multi-class diagnosis, severity classification with international grading standards, and demographic prediction; and (3) systematic evaluation of 24 state-of-the-art MLLMs. Our evaluations reveal both promise and limitations. Top-performing models achieved ~58% accuracy in disease screening under zero-shot settings, and performance remained suboptimal for challenging tasks like disease staging. We will publicly release the dataset, curation pipeline, and leaderboard to potentially advance ophthalmic AI applications and reduce the global burden of vision-threatening diseases.

Mixed Modality Classification Dataset Release In Silico Academic Lab Open Dataset GenAI

Association Between Body Composition and Cardiometabolic Outcomes : A Prospective Cohort Study.

Jung M, Reisert M, Rieder H, Rospleszcz S, Lu MT, Bamberg F, Raghu VK, Weiss J

•papers•Sep 30 2025

Current measures of adiposity have limitations. Artificial intelligence (AI) models may accurately and efficiently estimate body composition (BC) from routine imaging. To assess the association of AI-derived BC compartments from magnetic resonance imaging (MRI) with cardiometabolic outcomes. Prospective cohort study. UK Biobank (UKB) observational cohort study. 33 432 UKB participants with no history of diabetes, myocardial infarction, or ischemic stroke (mean age, 65.0 years [SD, 7.8]; mean body mass index [BMI], 25.8 kg/m2 [SD, 4.2]; 52.8% female) who underwent whole-body MRI. An AI tool was applied to MRI to derive 3-dimensional (3D) BC measures, including subcutaneous adipose tissue (SAT), visceral adipose tissue (VAT), skeletal muscle (SM), and SM fat fraction (SMFF), and then calculate their relative distribution. Sex-stratified associations of these relative compartments with incident diabetes mellitus (DM) and major adverse cardiovascular events (MACE) were assessed using restricted cubic splines. Adipose tissue compartments and SMFF increased and SM decreased with age. After adjustment for age, smoking, and hypertension, greater adiposity and lower SM proportion were associated with higher incidence of DM and MACE after a median follow-up of 4.2 years in sex-stratified analyses; however, after additional adjustment for BMI and waist circumference (WC), only elevated VAT proportions and high SMFF (top fifth percentile in the cohort for each) were associated with increased risk for DM (respective adjusted hazard ratios [aHRs], 2.16 [95% CI, 1.59 to 2.94] and 1.27 [CI, 0.89 to 1.80] in females and 1.84 [CI, 1.48 to 2.27] and 1.84 [CI, 1.43 to 2.37] in males) and MACE (1.37 [CI, 1.00 to 1.88] and 1.72 [CI, 1.23 to 2.41] in females and 1.22 [CI, 0.99 to 1.50] and 1.25 [CI, 0.98 to 1.60] in males). In addition, in males only, those in the bottom fifth percentile of SM proportion had increased risk for DM (aHR for the bottom fifth percentile of the cohort, 1.96 [CI, 1.45 to 2.65]) and MACE (aHR, 1.55 [CI, 1.15 to 2.09]). Results may not be generalizable to non-Whites or people outside the United Kingdom. Artificial intelligence-derived BC proportions were strongly associated with cardiometabolic risk, but after BMI and WC were accounted for, only VAT proportion and SMFF (both sexes) and SM proportion (males only) added prognostic information. None.

MRI Segmentation Whole Body Prospective In Silico Academic Lab

Artificial Intelligence Model for Imaging-Based Extranodal Extension Detection and Outcome Prediction in Human Papillomavirus-Positive Oropharyngeal Cancer.

Dayan GS, Hénique G, Bahig H, Nelson K, Brodeur C, Christopoulos A, Filion E, Nguyen-Tan PF, O'Sullivan B, Ayad T, Bissada E, Tabet P, Guertin L, Desilets A, Kadoury S, Letourneau-Guillon L

•papers•Sep 30 2025

Although not included in the eighth edition of the American Joint Committee on Cancer Staging System, there is growing evidence suggesting that imaging-based extranodal extension (iENE) is associated with worse outcomes in HPV-associated oropharyngeal carcinoma (OPC). Key challenges with iENE include the lack of standardized criteria, reliance on radiological expertise, and interreader variability. To develop an artificial intelligence (AI)-driven pipeline for lymph node segmentation and iENE classification using pretreatment computed tomography (CT) scans, and to evaluate its association with oncologic outcomes in HPV-positive OPC. This was a single-center cohort study conducted at a tertiary oncology center in Montreal, Canada, of adult patients with HPV-positive cN+ OPC treated with up-front (chemo)radiotherapy from January 2009 to January 2020. Participants were followed up until January 2024. Data analysis was performed from March 2024 to April 2025. Pretreatment planning CT scans along with lymph node gross tumor volume segmentations performed by expert radiation oncologists were extracted. For lymph node segmentation, an nnU-Net model was developed. For iENE classification, radiomic and deep learning feature extraction methods were compared. iENE classification accuracy was assessed against 2 expert neuroradiologist evaluations using area under the receiver operating characteristic curve (AUC). Subsequently, the association of AI-predicted iENE with oncologic outcomes-ie, overall survival (OS), recurrence-free survival (RFS), distant control (DC), and locoregional control (LRC)-was assessed. Among 397 patients (mean [SD] age, 62.3 [9.1] years; 80 females [20.2%] and 317 males [79.8%]), AI-iENE classification using radiomics achieved an AUC of 0.81. Patients with AI-predicted iENE had worse 3-year OS (83.8% vs 96.8%), RFS (80.7% vs 93.7%), and DC (84.3% vs 97.1%), but similar LRC. AI-iENE had significantly higher Concordance indices than radiologist-assessed iENE for OS (0.64 vs 0.55), RFS (0.67 vs 0.60), and DC (0.79 vs 0.68). In multivariable analysis, AI-iENE remained independently associated with OS (adjusted hazard ratio [aHR], 2.82; 95% CI, 1.21-6.57), RFS (aHR, 4.20; 95% CI, 1.93-9.11), and DC (aHR, 12.33; 95% CI, 4.15-36.67), adjusting for age, tumor category, node category, and number of lymph nodes. This single-center cohort study found that an AI-driven pipeline can successfully automate lymph node segmentation and iENE classification from pretreatment CT scans in HPV-associated OPC. Predicted iENE was independently associated with worse oncologic outcomes. External validation is required to assess generalizability and the potential for implementation in institutions without specialized imaging expertise.

CT Segmentation Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Application of Machine Learning in the Diagnosis and Prognosis of Mild Traumatic Brain Injury Using Diffusion Tensor Imaging: A Systematic Review.

Saludar CJA, Tayebi M, Kwon E, McGeown J, Schierding W, Wang A, Fernandez J, Holdsworth S, Shim V

•papers•Sep 30 2025

Traumatic Brain Injury (TBI) is a global health concern, with mild TBI (mTBI) being the most common form. Despite its prevalence, accurately diagnosing mTBI remains a significant challenge. While advanced neuroimaging techniques like diffusion tensor imaging (DTI) offer promise for more robust diagnosis, their clinical application is limited by inconsistent and heterogeneous post-injury findings. Recently, machine learning (ML) techniques, utilizing DTI metrics as features, have shown increasing utility in mTBI research. This approach helps identify distinct between-group features, paving the way for more precise and efficient diagnostic and prognostic tools. This review aims to analyze studies employing ML techniques to assess changes in DTI metrics after mTBI. Systematic review. We conducted a systematic review, adhering to PRISMA guidelines, on the application of ML with DTI for mTBI diagnosis and prognosis on human subjects. This review identified 36 articles. N/A. Study quality was assessed using the Modified QualSyst Assessment Tool. N/A. The review found ML techniques using DTI Metrics either alone or in combination with other modalities (i.e., structural MRI, functional MRI, clinical scores, or demographics) can effectively classify mTBI patients from controls. These approaches have also demonstrated potential in classifying mTBI patients according to the degree of recovery and symptom severity. In addition, these ML models showed strong predictive power toward cognitive scores and brain structural decline, as quantified by brain-predicted age difference. Larger, externally validated studies are needed to develop robust models for the diagnosis and prognosis of mTBI, using imaging biomarkers (including DTI) in conjunction with non-imaging, on-field, or clinical data. Despite the high predictive performance of ML algorithms, the clinical application remains distant, likely due to the small sample size of studies and lack of external validation, which raises concerns about overfitting. 5. Stage 1.

MRI Classification Neurological Review Concept

Leveraging ChatGPT for Report Error Audit: An Accuracy-Driven and Cost-Efficient Solution for Ophthalmic Imaging Reports.

Xu Y, Kang D, Shi D, Tham YC, Grzybowski A, Jin K

•papers•Sep 30 2025

Accurate ophthalmic imaging reports, including fundus fluorescein angiography (FFA) and ocular B-scan ultrasound, are essential for effective clinical decision-making. The current process, involving drafting by residents followed by review by ophthalmic technicians and ophthalmologists, is time-consuming and prone to errors. This study evaluates the effectiveness of ChatGPT-4o in auditing errors in FFA and ocular B-scan reports and assesses its potential to reduce time and costs within the reporting workflow. Preliminary 100 FFA and 80 ocular B-scan reports drafted by residents were analyzed using GPT-4o to identify the errors in identifying left or right eye and incorrect anatomical descriptions. The accuracy of GPT-4o was compared to retinal specialists, general ophthalmologists, and ophthalmic technicians. Additionally, a cost-effective analysis was conducted to estimate time and cost savings from integrating GPT-4o into the reporting process. A pilot real-world validation with 20 erroneous reports was also performed between GPT-4o and human reviewers. GPT-4o demonstrated a detection rate of 79.0% (158 of 200; 95% CI 73.0-85.0) across all examinations, which was comparable to the average detection performance of general ophthalmologists (78.0% [155 of 200; 95% CI 72.0-83.0]; P ≥ 0.09). Integration of GPT-4o reduced the average report review time by 86%, completing 180 ophthalmic reports in approximately 0.27 h compared to 2.17-3.19 h by human ophthalmologists. Additionally, compared to human reviewers, GPT-4o lowered the cost from $0.21 to $0.03 per report (savings of $0.18). In the real-world evaluation, GPT-4o detected 18 of 20 errors with no false positives, compared to 95-100% by human reviewers. GPT-4o effectively enhances the accuracy of ophthalmic imaging reports by identifying and correcting common errors. Its implementation can potentially alleviate the workload of ophthalmologists, streamline the reporting process, and reduce associated costs, thereby improving overall clinical workflow and patient outcomes.

Ultrasound LLM Radiology Report Retrospective Clinical Clinical Pilot Academic Lab GenAI

Improving data-driven gated (DDG) PET and CT registration in thoracic lesions: a comparison of AI registration and DDG CT.

Pan T, Thomas MA, Lu Y, Luo D

•papers•Sep 30 2025

Misregistration between CT and PET can result in mis-localization and inaccurate quantification of the tracer uptake in PET. Data-driven gated (DDG) CT can correct registration and quantification but requires a radiation dose of 1.3 mSv and 1 min of acquisition time. AI registration (AIR) does not require an additional CT and has been validated to improve registration and reduce the 'banana' misregistration artifacts around the diaphragm. We aimed to compare a validated AIR and DDG CT in registration and quantification of avid thoracic lesions misregistered in DDG PET scans. Thirty PET/CT patient data (23 with 18F-FDG, 4 with 68Ga-Dotatate, and 3 with 18F-PSMA piflufolastat) with at least one misregistered avid lesion in the thorax were recruited. Patient studies were conducted using DDG CT to correct misregistration with DDG PET data of the phases 30 to 80% on GE Discovery MI PET/CT scanners. Non-attenuation correction DDG PET and misregistered CT were input to AIR and the AIR-corrected CT data were output to register and quantify the DDG PET data. Registration and quantification of lesion SUVmax and signal-to-background ratio (SBR) of the lesion SUVmax to the 2-cm background mean SUV were compared for each of the 51 avid lesions. DDG CT outperformed AIR in misregistration correction and quantification of avid thoracic lesions (1.16 ± 0.45 cm). Most lesions (46/51, 90%) showed improved registration from DDG CT relative to AIR, with 10% (5/51) being similar between AIR and DDG CT. The lesions in the baseline CT were an average of 2.06 ± 1.0 cm from their corresponding lesions in the DDG CT, while those in the AIR CT were an average of 0.97 ± 0.54 cm away. AIR significantly improved lesion registration compared to the baseline CT (P < 0.0001). SUVmax increased by 18.1 ± 15.3% with AIR, but a statistically significantly larger increase of 34.4 ± 25.4% was observed with DDG CT (P < 0.0001). A statistically significant increase in SBR was also observed, rising from 10.5 ± 12.1% of AIR to 21.1 ± 20.5% of DDG CT (P < 0.0001). Many registration improvements by AIR were still left with misregistration. AIR could mis-localize a lymph node to the lung parenchyma or the ribs, and could also mis-localize a lung nodule to the left atrium. AIR could also distort the rib cage and the circular shape of the aorta cross section. DDG CT outperformed AIR in both localization and quantification of the thoracic avid lesions. AIR improved registration of the misregistered PET/CT. Registered lymph nodes could be falsely misregistered by AIR. AIR-induced distortion of the rib cage can also negatively impact image quality. Further research on AIR's accuracy in modeling true patient respiratory motion without introducing new misregistration or anatomical distortion is warranted.

PET Registration Chest Retrospective Clinical In Silico Academic Lab

Artificial Intelligence in Low-Dose Computed Tomography Screening of the Chest: Past, Present, and Future.

Yip R, Jirapatnakul A, Avila R, Gutierrez JG, Naghavi M, Yankelevitz DF, Henschke CI

•papers•Sep 30 2025

The integration of artificial intelligence (AI) with low-dose computed tomography (LDCT) has the potential to transform lung cancer screening into a comprehensive approach to early detection of multiple diseases. Building on over 3 decades of research and global implementation by the International Early Lung Cancer Action Program (I-ELCAP), this paper reviews the development and clinical integration of AI for interpreting LDCT scans. We describe the historical milestones in AI-assisted lung nodule detection, emphysema quantification, and cardiovascular risk assessment using visual and quantitative imaging features. We also discuss challenges related to image acquisition variability, ground truth curation, and clinical integration, with a particular focus on the design and implementation of the open-source IELCAP-AIRS system and the ScreeningPLUS infrastructure, which enable AI training, validation, and deployment in real-world screening environments. AI algorithms for rule-out decisions, nodule tracking, and disease quantification have the potential to reduce radiologist workload and advance precision screening. With the ability to evaluate multiple diseases from a single LDCT scan, AI-enabled screening offers a powerful, scalable tool for improving population health. Ongoing collaboration, standardized protocols, and large annotated datasets are critical to advancing the future of integrated, AI-driven preventive care.

CT Detection Chest Review Clinical Pilot Consortium Open Code Open Dataset

A phase-aware Cross-Scale U-MAMba with uncertainty-aware segmentation and Switch Atrous Bifovea EfficientNetB7 classification of kidney lesion subtype.

Rmr SS, Mb S, R D, M T, P V

•papers•Sep 30 2025

Kidney lesion subtype identification is essential for precise diagnosis and personalized treatment planning. However, achieving reliable classification remains challenging due to factors such as inter-patient anatomical variability, incomplete multi-phase CT acquisitions, and ill-defined or overlapping lesion boundaries. In addition, genetic and ethnic morphological variations introduce inconsistent imaging patterns, reducing the generalizability of conventional deep learning models. To address these challenges, we introduce a unified framework called Phase-aware Cross-Scale U-MAMba and Switch Atrous Bifovea EfficientNet B7 (PCU-SABENet), which integrates multi-phase reconstruction, fine-grained lesion segmentation, and robust subtype classification. The PhaseGAN-3D synthesizes missing CT phases using binary mask-guided inter-phase priors, enabling complete four-phase reconstruction even under partial acquisition conditions. The PCU segmentation module combines Contextual Attention Blocks, Cross-Scale Skip Connections, and uncertainty-aware pseudo-labeling to delineate lesion boundaries with high anatomical fidelity. These enhancements help mitigate low contrast and intra-class ambiguity. For classification, SABENet employs Switch Atrous Convolution for multi-scale receptive field adaptation, Hierarchical Tree Pooling for structure-aware abstraction, and Bi-Fovea Self-Attention to emphasize fine lesion cues and global morphology. This configuration is particularly effective in addressing morphological diversity across patient populations. Experimental results show that the proposed model achieves state-of-the-art performance, with 99.3% classification accuracy, 94.8% Dice similarity, 89.3% IoU, 98.8% precision, 99.2% recall, a phase-consistency score of 0.94, and a subtype confidence deviation of 0.08. Moreover, the model generalizes well on external datasets (TCIA) with 98.6% accuracy and maintains efficient computational performance, requiring only 0.138 GFLOPs and 8.2 ms inference time. These outcomes confirm the model's robustness in phase-incomplete settings and its adaptability to diverse patient cohorts. The PCU-SABENet framework sets a new standard in kidney lesion subtype analysis, combining segmentation precision with clinically actionable classification, thus offering a powerful tool for enhancing diagnostic accuracy and decision-making in real-world renal cancer management.

CT Segmentation Abdominal Methodology In Silico Benchmark SOTA

Filter Papers

Tags

Performance of a Large Language Model in BI-RADS Classification of Ultrasound Based Breast Lesions

Deep Learning for fODF Estimation in Infant Brains: Model Comparison, Ground-Truth Impact, and Domain Shift Mitigation.

LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology

Association Between Body Composition and Cardiometabolic Outcomes : A Prospective Cohort Study.

Artificial Intelligence Model for Imaging-Based Extranodal Extension Detection and Outcome Prediction in Human Papillomavirus-Positive Oropharyngeal Cancer.

Application of Machine Learning in the Diagnosis and Prognosis of Mild Traumatic Brain Injury Using Diffusion Tensor Imaging: A Systematic Review.

Leveraging ChatGPT for Report Error Audit: An Accuracy-Driven and Cost-Efficient Solution for Ophthalmic Imaging Reports.

Improving data-driven gated (DDG) PET and CT registration in thoracic lesions: a comparison of AI registration and DDG CT.

Artificial Intelligence in Low-Dose Computed Tomography Screening of the Chest: Past, Present, and Future.

A phase-aware Cross-Scale U-MAMba with uncertainty-aware segmentation and Switch Atrous Bifovea EfficientNetB7 classification of kidney lesion subtype.

Ready to Sharpen Your Edge?