Sort by:
Page 9 of 2332330 results

Radiology's Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology

Suvrankar Datta, Divya Buchireddygari, Lakshmi Vennela Chowdary Kaza, Mrudula Bhalke, Kautik Singh, Ayush Pandey, Sonit Sai Vasipalli, Upasana Karnwal, Hakikat Bir Singh Bhatti, Bhavya Ratan Maroo, Sanjana Hebbar, Rahul Joseph, Gurkawal Kaur, Devyani Singh, Akhil V, Dheeksha Devasya Shama Prasad, Nishtha Mahajan, Ayinaparthi Arisha, Rajesh Vanagundi, Reet Nandy, Kartik Vuthoo, Snigdhaa Rajvanshi, Nikhileswar Kondaveeti, Suyash Gunjal, Rishabh Jain, Rajat Jain, Anurag Agrawal

arxiv logopreprintSep 29 2025
Generalist multimodal AI systems such as large language models (LLMs) and vision language models (VLMs) are increasingly accessed by clinicians and patients alike for medical image interpretation through widely available consumer-facing chatbots. Most evaluations claiming expert level performance are on public datasets containing common pathologies. Rigorous evaluation of frontier models on difficult diagnostic cases remains limited. We developed a pilot benchmark of 50 expert-level "spot diagnosis" cases across multiple imaging modalities to evaluate the performance of frontier AI models against board-certified radiologists and radiology trainees. To mirror real-world usage, the reasoning modes of five popular frontier AI models were tested through their native web interfaces, viz. OpenAI o3, OpenAI GPT-5, Gemini 2.5 Pro, Grok-4, and Claude Opus 4.1. Accuracy was scored by blinded experts, and reproducibility was assessed across three independent runs. GPT-5 was additionally evaluated across various reasoning modes. Reasoning quality errors were assessed and a taxonomy of visual reasoning errors was defined. Board-certified radiologists achieved the highest diagnostic accuracy (83%), outperforming trainees (45%) and all AI models (best performance shown by GPT-5: 30%). Reliability was substantial for GPT-5 and o3, moderate for Gemini 2.5 Pro and Grok-4, and poor for Claude Opus 4.1. These findings demonstrate that advanced frontier models fall far short of radiologists in challenging diagnostic cases. Our benchmark highlights the present limitations of generalist AI in medical imaging and cautions against unsupervised clinical use. We also provide a qualitative analysis of reasoning traces and propose a practical taxonomy of visual reasoning errors by AI models for better understanding their failure modes, informing evaluation standards and guiding more robust model development.

Latent Representation Learning from 3D Brain MRI for Interpretable Prediction in Multiple Sclerosis

Trinh Ngoc Huynh, Nguyen Duc Kien, Nguyen Hai Anh, Dinh Tran Hiep, Manuela Vaneckova, Tomas Uher, Jeroen Van Schependom, Stijn Denissen, Tran Quoc Long, Nguyen Linh Trung, Guy Nagels

arxiv logopreprintSep 28 2025
We present InfoVAE-Med3D, a latent-representation learning approach for 3D brain MRI that targets interpretable biomarkers of cognitive decline. Standard statistical models and shallow machine learning often lack power, while most deep learning methods behave as black boxes. Our method extends InfoVAE to explicitly maximize mutual information between images and latent variables, producing compact, structured embeddings that retain clinically meaningful content. We evaluate on two cohorts: a large healthy-control dataset (n=6527) with chronological age, and a clinical multiple sclerosis dataset from Charles University in Prague (n=904) with age and Symbol Digit Modalities Test (SDMT) scores. The learned latents support accurate brain-age and SDMT regression, preserve key medical attributes, and form intuitive clusters that aid interpretation. Across reconstruction and downstream prediction tasks, InfoVAE-Med3D consistently outperforms other VAE variants, indicating stronger information capture in the embedding space. By uniting predictive performance with interpretability, InfoVAE-Med3D offers a practical path toward MRI-based biomarkers and more transparent analysis of cognitive deterioration in neurological disease.

Predicting pathological complete response to chemoradiotherapy using artificial intelligence-based magnetic resonance imaging radiomics in esophageal squamous cell carcinoma.

Hirata A, Hayano K, Tochigi T, Kurata Y, Shiraishi T, Sekino N, Nakano A, Matsumoto Y, Toyozumi T, Uesato M, Ohira G

pubmed logopapersSep 28 2025
Advanced esophageal squamous cell carcinoma (ESCC) has an extremely poor prognosis. Preoperative chemoradiotherapy (CRT) can significantly prolong survival, especially in those who achieve pathological complete response (pCR). However, the pretherapeutic prediction of pCR remains challenging. To predict pCR and survival in ESCC patients undergoing CRT using an artificial intelligence (AI)-based diffusion-weighted magnetic resonance imaging (DWI-MRI) radiomics model. We retrospectively analyzed 70 patients with ESCC who underwent curative surgery following CRT. For each patient, pre-treatment tumors were semi-automatically segmented in three dimensions from DWI-MRI images (<i>b</i> = 0, 1000 second/mm²), and a total of 76 radiomics features were extracted from each segmented tumor. Using these features as explanatory variables and pCR as the objective variable, machine learning models for predicting pCR were developed using AutoGluon, an automated machine learning library, and validated by stratified double cross-validation. pCR was achieved in 15 patients (21.4%). Apparent diffusion coefficient skewness demonstrated the highest predictive performance [area under the curve (AUC) = 0.77]. Gray-level co-occurrence matrix (GLCM) entropy (<i>b</i> = 1000 second/mm²) was an independent prognostic factor for relapse-free survival (RFS) (hazard ratio = 0.32, <i>P</i> = 0.009). In Kaplan-Meier analysis, patients with high GLCM entropy showed significantly better RFS (<i>P</i> < 0.001, log-rank). The best-performing machine learning model achieved an AUC of 0.85. The predicted pCR-positive group showed significantly better RFS than the predicted pCR-negative group (<i>P</i> = 0.007, log-rank). AI-based radiomics analysis of DWI-MRI images in ESCC has the potential to accurately predict the effect of CRT before treatment and contribute to constructing optimal treatment strategies.

EWC-Guided Diffusion Replay for Exemplar-Free Continual Learning in Medical Imaging

Anoushka Harit, William Prew, Zhongtian Sun, Florian Markowetz

arxiv logopreprintSep 28 2025
Medical imaging foundation models must adapt over time, yet full retraining is often blocked by privacy constraints and cost. We present a continual learning framework that avoids storing patient exemplars by pairing class conditional diffusion replay with Elastic Weight Consolidation. Using a compact Vision Transformer backbone, we evaluate across eight MedMNIST v2 tasks and CheXpert. On CheXpert our approach attains 0.851 AUROC, reduces forgetting by more than 30\% relative to DER\texttt{++}, and approaches joint training at 0.869 AUROC, while remaining efficient and privacy preserving. Analyses connect forgetting to two measurable factors: fidelity of replay and Fisher weighted parameter drift, highlighting the complementary roles of replay diffusion and synaptic stability. The results indicate a practical route for scalable, privacy aware continual adaptation of clinical imaging models.

Application of deep learning-based convolutional neural networks in gastrointestinal disease endoscopic examination.

Wang YY, Liu B, Wang JH

pubmed logopapersSep 28 2025
Gastrointestinal (GI) diseases, including gastric and colorectal cancers, significantly impact global health, necessitating accurate and efficient diagnostic methods. Endoscopic examination is the primary diagnostic tool; however, its accuracy is limited by operator dependency and interobserver variability. Advancements in deep learning, particularly convolutional neural networks (CNNs), show great potential for enhancing GI disease detection and classification. This review explores the application of CNNs in endoscopic imaging, focusing on polyp and tumor detection, disease classification, endoscopic ultrasound, and capsule endoscopy analysis. We discuss the performance of CNN models with traditional diagnostic methods, highlighting their advantages in accuracy and real-time decision support. Despite promising results, challenges remain, including data availability, model interpretability, and clinical integration. Future directions include improving model generalization, enhancing explainability, and conducting large-scale clinical trials. With continued advancements, CNN-powered artificial intelligence systems could revolutionize GI endoscopy by enhancing early disease detection, reducing diagnostic errors, and improving patient outcomes.

Dementia-related volumetric assessments in neuroradiology reports: a natural language processing-based study.

Mayers AJ, Roberts A, Venkataraman AV, Booth C, Stewart R

pubmed logopapersSep 28 2025
Structural MRI of the brain is routinely performed on patients referred to memory clinics; however, resulting radiology reports, including volumetric assessments, are conventionally stored as unstructured free text. We sought to use natural language processing (NLP) to extract text relating to intracranial volumetric assessment from brain MRI text reports to enhance routine data availability for research purposes. Electronic records from a large mental healthcare provider serving a geographic catchment of 1.3 million residents in four boroughs of south London, UK. A corpus of 4007 de-identified brain MRI reports from patients referred to memory assessment services. An NLP algorithm was developed, using a span categorisation approach, to extract six binary (presence/absence) categories from the text reports: (i) global volume loss, (ii) hippocampal/medial temporal lobe volume loss and (iii) other lobar/regional volume loss. Distributions of these categories were evaluated. The overall F1 score for the six categories was 0.89 (precision 0.92, recall 0.86), with the following precision/recall for each category: presence of global volume loss 0.95/0.95, absence of global volume loss 0.94/0.77, presence of regional volume loss 0.80/0.58, absence of regional volume loss 0.91/0.93, presence of hippocampal volume loss 0.90/0.88, and absence of hippocampal volume loss 0.94/0.92. These results support the feasibility and accuracy of using NLP techniques to extract volumetric assessments from radiology reports, and the potential for automated generation of novel meta-data from dementia assessments in electronic health records.

FedAgentBench: Towards Automating Real-world Federated Medical Image Analysis with Server-Client LLM Agents

Pramit Saha, Joshua Strong, Divyanshu Mishra, Cheng Ouyang, J. Alison Noble

arxiv logopreprintSep 28 2025
Federated learning (FL) allows collaborative model training across healthcare sites without sharing sensitive patient data. However, real-world FL deployment is often hindered by complex operational challenges that demand substantial human efforts. This includes: (a) selecting appropriate clients (hospitals), (b) coordinating between the central server and clients, (c) client-level data pre-processing, (d) harmonizing non-standardized data and labels across clients, and (e) selecting FL algorithms based on user instructions and cross-client data characteristics. However, the existing FL works overlook these practical orchestration challenges. These operational bottlenecks motivate the need for autonomous, agent-driven FL systems, where intelligent agents at each hospital client and the central server agent collaboratively manage FL setup and model training with minimal human intervention. To this end, we first introduce an agent-driven FL framework that captures key phases of real-world FL workflows from client selection to training completion and a benchmark dubbed FedAgentBench that evaluates the ability of LLM agents to autonomously coordinate healthcare FL. Our framework incorporates 40 FL algorithms, each tailored to address diverse task-specific requirements and cross-client characteristics. Furthermore, we introduce a diverse set of complex tasks across 201 carefully curated datasets, simulating 6 modality-specific real-world healthcare environments, viz., Dermatoscopy, Ultrasound, Fundus, Histopathology, MRI, and X-Ray. We assess the agentic performance of 14 open-source and 10 proprietary LLMs spanning small, medium, and large model scales. While some agent cores such as GPT-4.1 and DeepSeek V3 can automate various stages of the FL pipeline, our results reveal that more complex, interdependent tasks based on implicit goals remain challenging for even the strongest models.

A Novel Hybrid Deep Learning and Chaotic Dynamics Approach for Thyroid Cancer Classification

Nada Bouchekout, Abdelkrim Boukabou, Morad Grimes, Yassine Habchi, Yassine Himeur, Hamzah Ali Alkhazaleh, Shadi Atalla, Wathiq Mansoor

arxiv logopreprintSep 28 2025
Timely and accurate diagnosis is crucial in addressing the global rise in thyroid cancer, ensuring effective treatment strategies and improved patient outcomes. We present an intelligent classification method that couples an Adaptive Convolutional Neural Network (CNN) with Cohen-Daubechies-Feauveau (CDF9/7) wavelets whose detail coefficients are modulated by an n-scroll chaotic system to enrich discriminative features. We evaluate on the public DDTI thyroid ultrasound dataset (n = 1,638 images; 819 malignant / 819 benign) using 5-fold cross-validation, where the proposed method attains 98.17% accuracy, 98.76% sensitivity, 97.58% specificity, 97.55% F1-score, and an AUC of 0.9912. A controlled ablation shows that adding chaotic modulation to CDF9/7 improves accuracy by +8.79 percentage points over a CDF9/7-only CNN (from 89.38% to 98.17%). To objectively position our approach, we trained state-of-the-art backbones on the same data and splits: EfficientNetV2-S (96.58% accuracy; AUC 0.987), Swin-T (96.41%; 0.986), ViT-B/16 (95.72%; 0.983), and ConvNeXt-T (96.94%; 0.987). Our method outperforms the best of these by +1.23 points in accuracy and +0.0042 in AUC, while remaining computationally efficient (28.7 ms per image; 1,125 MB peak VRAM). Robustness is further supported by cross-dataset testing on TCIA (accuracy 95.82%) and transfer to an ISIC skin-lesion subset (n = 28 unique images, augmented to 2,048; accuracy 97.31%). Explainability analyses (Grad-CAM, SHAP, LIME) highlight clinically relevant regions. Altogether, the wavelet-chaos-CNN pipeline delivers state-of-the-art thyroid ultrasound classification with strong generalization and practical runtime characteristics suitable for clinical integration.

FedDAPL: Toward Client-Private Generalization in Federated Learning

Soroosh Safari Loaliyan, Jose-Luis Ambite, Paul M. Thompson, Neda Jahanshad, Greg Ver Steeg

arxiv logopreprintSep 28 2025
Federated Learning (FL) trains models locally at each research center or clinic and aggregates only model updates, making it a natural fit for medical imaging, where strict privacy laws forbid raw data sharing. A major obstacle is scanner-induced domain shift: non-biological variations in hardware or acquisition protocols can cause models to fail on external sites. Most harmonization methods correct this shift by directly comparing data across sites, conflicting with FL's privacy constraints. Domain Generalization (DG) offers a privacy-friendly alternative - learning site-invariant representations without sharing raw data - but standard DG pipelines still assume centralized access to multi-site data, again violating FL's guarantees. This paper meets these difficulties with a straightforward integration of a Domain-Adversarial Neural Network (DANN) within the FL process. After demonstrating that a naive federated DANN fails to converge, we propose a proximal regularization method that stabilizes adversarial training among clients. Experiments on T1-weighted 3-D brain MRIs from the OpenBHB dataset, performing brain-age prediction on participants aged 6-64 y (mean 22+/-6 y; 45 percent male) in training and 6-79 y (mean 19+/-13 y; 55 percent male) in validation, show that training on 15 sites and testing on 19 unseen sites yields superior cross-site generalization over FedAvg and ERM while preserving data privacy.

Hepatocellular Carcinoma Risk Stratification for Cirrhosis Patients: Integrating Radiomics and Deep Learning Computed Tomography Signatures of the Liver and Spleen into a Clinical Model.

Fan R, Shi YR, Chen L, Wang CX, Qian YS, Gao YH, Wang CY, Fan XT, Liu XL, Bai HL, Zheng D, Jiang GQ, Yu YL, Liang XE, Chen JJ, Xie WF, Du LT, Yan HD, Gao YJ, Wen H, Liu JF, Liang MF, Kong F, Sun J, Ju SH, Wang HY, Hou JL

pubmed logopapersSep 28 2025
Given the high burden of hepatocellular carcinoma (HCC), risk stratification in patients with cirrhosis is critical but remains inadequate. In this study, we aimed to develop and validate an HCC prediction model by integrating radiomics and deep learning features from liver and spleen computed tomography (CT) images into the established age-male-ALBI-platelet (aMAP) clinical model. Patients were enrolled between 2018 and 2023 from a Chinese multicenter, prospective, observational cirrhosis cohort, all of whom underwent 3-phase contrast-enhanced abdominal CT scans at enrollment. The aMAP clinical score was calculated, and radiomic (PyRadiomics) and deep learning (ResNet-18) features were extracted from liver and spleen regions of interest. Feature selection was performed using the least absolute shrinkage and selection operator. Among 2,411 patients (median follow-up: 42.7 months [IQR: 32.9-54.1]), 118 developed HCC (three-year cumulative incidence: 3.59%). Chronic hepatitis B virus infection was the main etiology, accounting for 91.5% of cases. The aMAP-CT model, which incorporates CT signatures, significantly outperformed existing models (area under the receiver-operating characteristic curve: 0.809-0.869 in three cohorts). It stratified patients into high-risk (three-year HCC incidence: 26.3%) and low-risk (1.7%) groups. Stepwise application (aMAP → aMAP-CT) further refined stratification (three-year incidences: 1.8% [93.0% of the cohort] vs. 27.2% [7.0%]). The aMAP-CT model improves HCC risk prediction by integrating CT-based liver and spleen signatures, enabling precise identification of high-risk cirrhosis patients. This approach personalizes surveillance strategies, potentially facilitating earlier detection and improved outcomes.
Page 9 of 2332330 results
Show
per page

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.