Latest Papers on Radiology AI. Tags: Ethics

Artificial Intelligence and Extended Reality in TAVR: Current Applications and Challenges.

Skalidis I, Sayah N, Benamer H, Amabile N, Laforgia P, Champagne S, Hovasse T, Garot J, Garot P, Akodad M

•papers•Aug 6 2025

Integration of AI and XR in TAVR is revolutionizing the management of severe aortic stenosis by enhancing diagnostic accuracy, risk stratification, and pre-procedural planning. Advanced algorithms now facilitate precise electrocardiographic, echocardiographic, and CT-based assessments that reduce observer variability and enable patient-specific risk prediction. Immersive XR technologies, including augmented, virtual, and mixed reality, improve spatial visualization of complex cardiac anatomy and support real-time procedural guidance. Despite these advancements, standardized protocols, regulatory frameworks, and ethical safeguards remain necessary for widespread clinical adoption.

CT Classification Cardiac Review In Silico Ethics Policy

Foundation models for radiology-the position of the AI for Health Imaging (AI4HI) network.

de Almeida JG, Alberich LC, Tsakou G, Marias K, Tsiknakis M, Lekadir K, Marti-Bonmati L, Papanikolaou N

•papers•Aug 6 2025

Foundation models are large models trained on big data which can be used for downstream tasks. In radiology, these models can potentially address several gaps in fairness and generalization, as they can be trained on massive datasets without labelled data and adapted to tasks requiring data with a small number of descriptions. This reduces one of the limiting bottlenecks in clinical model construction-data annotation-as these models can be trained through a variety of techniques that require little more than radiological images with or without their corresponding radiological reports. However, foundation models may be insufficient as they are affected-to a smaller extent when compared with traditional supervised learning approaches-by the same issues that lead to underperforming models, such as a lack of transparency/explainability, and biases. To address these issues, we advocate that the development of foundation models should not only be pursued but also accompanied by the development of a decentralized clinical validation and continuous training framework. This does not guarantee the resolution of the problems associated with foundation models, but it enables developers, clinicians and patients to know when, how and why models should be updated, creating a clinical AI ecosystem that is better capable of serving all stakeholders. CRITICAL RELEVANCE STATEMENT: Foundation models may mitigate issues like bias and poor generalization in radiology AI, but challenges persist. We propose a decentralized, cross-institutional framework for continuous validation and training to enhance model reliability, safety, and clinical utility. KEY POINTS: Foundation models trained on large datasets reduce annotation burdens and improve fairness and generalization in radiology. Despite improvements, they still face challenges like limited transparency, explainability, and residual biases. A decentralized, cross-institutional framework for clinical validation and continuous training can strengthen reliability and inclusivity in clinical AI.

Mixed Modality Classification Whole Body Review Concept Consortium Policy Ethics GenAI

Scaling Artificial Intelligence for Prostate Cancer Detection on MRI towards Population-Based Screening and Primary Diagnosis in a Global, Multiethnic Population (Study Protocol)

Anindo Saha, Joeran S. Bosma, Jasper J. Twilt, Alexander B. C. D. Ng, Aqua Asif, Kirti Magudia, Peder Larson, Qinglin Xie, Xiaodong Zhang, Chi Pham Minh, Samuel N. Gitau, Ivo G. Schoots, Martijn F. Boomsma, Renato Cuocolo, Nikolaos Papanikolaou, Daniele Regge, Derya Yakar, Mattijs Elschot, Jeroen Veltman, Baris Turkbey, Nancy A. Obuchowski, Jurgen J. Fütterer, Anwar R. Padhani, Hashim U. Ahmed, Tobias Nordström, Martin Eklund, Veeru Kasivisvanathan, Maarten de Rooij, Henkjan Huisman

•preprint•Aug 4 2025

In this intercontinental, confirmatory study, we include a retrospective cohort of 22,481 MRI examinations (21,288 patients; 46 cities in 22 countries) to train and externally validate the PI-CAI-2B model, i.e., an efficient, next-generation iteration of the state-of-the-art AI system that was developed for detecting Gleason grade group $\geq$2 prostate cancer on MRI during the PI-CAI study. Of these examinations, 20,471 cases (19,278 patients; 26 cities in 14 countries) from two EU Horizon projects (ProCAncer-I, COMFORT) and 12 independent centers based in Europe, North America, Asia and Africa, are used for training and internal testing. Additionally, 2010 cases (2010 patients; 20 external cities in 12 countries) from population-based screening (STHLM3-MRI, IP1-PROSTAGRAM trials) and primary diagnostic settings (PRIME trial) based in Europe, North and South Americas, Asia and Australia, are used for external testing. Primary endpoint is the proportion of AI-based assessments in agreement with the standard of care diagnoses (i.e., clinical assessments made by expert uropathologists on histopathology, if available, or at least two expert urogenital radiologists in consensus; with access to patient history and peer consultation) in the detection of Gleason grade group $\geq$2 prostate cancer within the external testing cohorts. Our statistical analysis plan is prespecified with a hypothesis of diagnostic interchangeability to the standard of care at the PI-RADS $\geq$3 (primary diagnosis) or $\geq$4 (screening) cut-off, considering an absolute margin of 0.05 and reader estimates derived from the PI-CAI observer study (62 radiologists reading 400 cases). Secondary measures comprise the area under the receiver operating characteristic curve (AUROC) of the AI system stratified by imaging quality, patient age and patient ethnicity to identify underlying biases (if any).

MRI Detection Abdominal Retrospective Clinical In Silico Consortium Benchmark SOTA Open Dataset Ethics

Explainable AI Methods for Neuroimaging: Systematic Failures of Common Tools, the Need for Domain-Specific Validation, and a Proposal for Safe Application

Nys Tjade Siegel, James H. Cole, Mohamad Habes, Stefan Haufe, Kerstin Ritter, Marc-André Schulz

•preprint•Aug 4 2025

Trustworthy interpretation of deep learning models is critical for neuroimaging applications, yet commonly used Explainable AI (XAI) methods lack rigorous validation, risking misinterpretation. We performed the first large-scale, systematic comparison of XAI methods on ~45,000 structural brain MRIs using a novel XAI validation framework. This framework establishes verifiable ground truth by constructing prediction tasks with known signal sources - from localized anatomical features to subject-specific clinical lesions - without artificially altering input images. Our analysis reveals systematic failures in two of the most widely used methods: GradCAM consistently failed to localize predictive features, while Layer-wise Relevance Propagation generated extensive, artifactual explanations that suggest incompatibility with neuroimaging data characteristics. Our results indicate that these failures stem from a domain mismatch, where methods with design principles tailored to natural images require substantial adaptation for neuroimaging data. In contrast, the simpler, gradient-based method SmoothGrad, which makes fewer assumptions about data structure, proved consistently accurate, suggesting its conceptual simplicity makes it more robust to this domain shift. These findings highlight the need for domain-specific adaptation and validation of XAI methods, suggest that interpretations from prior neuroimaging studies using standard XAI methodology warrant re-evaluation, and provide urgent guidance for practical application of XAI in neuroimaging.

MRI Classification Neurological Methodology In Silico Ethics Reproducibility

A Dual Radiomic and Dosiomic Filtering Technique for Locoregional Radiation Pneumonitis Prediction in Breast Cancer Patients

Zhenyu Yang, Qian Chen, Rihui Zhang, Manju Liu, Fengqiu Guo, Minjie Yang, Min Tang, Lina Zhou, Chunhao Wang, Minbin Chen, Fang-Fang Yin

•preprint•Aug 4 2025

Purpose: Radiation pneumonitis (RP) is a serious complication of intensity-modulated radiation therapy (IMRT) for breast cancer patients, underscoring the need for precise and explainable predictive models. This study presents an Explainable Dual-Omics Filtering (EDOF) model that integrates spatially localized dosiomic and radiomic features for voxel-level RP prediction. Methods: A retrospective cohort of 72 breast cancer patients treated with IMRT was analyzed, including 28 who developed RP. The EDOF model consists of two components: (1) dosiomic filtering, which extracts local dose intensity and spatial distribution features from planning dose maps, and (2) radiomic filtering, which captures texture-based features from pre-treatment CT scans. These features are jointly analyzed using the Explainable Boosting Machine (EBM), a transparent machine learning model that enables feature-specific risk evaluation. Model performance was assessed using five-fold cross-validation, reporting area under the curve (AUC), sensitivity, and specificity. Feature importance was quantified by mean absolute scores, and Partial Dependence Plots (PDPs) were used to visualize nonlinear relationships between RP risk and dual-omic features. Results: The EDOF model achieved strong predictive performance (AUC = 0.95 +- 0.01; sensitivity = 0.81 +- 0.05). The most influential features included dosiomic Intensity Mean, dosiomic Intensity Mean Absolute Deviation, and radiomic SRLGLE. PDPs revealed that RP risk increases beyond 5 Gy and rises sharply between 10-30 Gy, consistent with clinical dose thresholds. SRLGLE also captured structural heterogeneity linked to RP in specific lung regions. Conclusion: The EDOF framework enables spatially resolved, explainable RP prediction and may support personalized radiation planning to mitigate pulmonary toxicity.

CT Classification Chest Retrospective Clinical In Silico Academic Lab Ethics

AI-Driven Integration of Deep Learning with Lung Imaging, Functional Analysis, and Blood Gas Metrics for Perioperative Hypoxemia Prediction: Progress and Perspectives.

Huang K, Wu C, Fang J, Pi R

•papers•Aug 4 2025

This Perspective article explores the transformative role of artificial intelligence (AI) in predicting perioperative hypoxemia through the integration of deep learning (DL) with multimodal clinical data, including lung imaging, pulmonary function tests (PFTs), and arterial blood gas (ABG) analysis. Perioperative hypoxemia, defined as arterial oxygen partial pressure (PaO₂) <60 mmHg or oxygen saturation (SpO₂) <90%, poses significant risks of delayed recovery and organ dysfunction. Traditional diagnostic methods, such as radiological imaging and ABG analysis, often lack integrated predictive accuracy. AI frameworks, particularly convolutional neural networks (CNNs) and hybrid models like TD-CNNLSTM-LungNet, demonstrate exceptional performance in detecting pulmonary inflammation and stratifying hypoxemia risk, achieving up to 96.57% accuracy in pneumonia subtype differentiation and an AUC of 0.96 for postoperative hypoxemia prediction. Multimodal AI systems, such as DeepLung-Predict, unify CT scans, PFTs, and ABG parameters to enhance predictive precision, surpassing conventional methods by 22%. However, challenges persist, including dataset heterogeneity, model interpretability, and clinical workflow integration. Future directions emphasize multicenter validation, explainable AI (XAI) frameworks, and pragmatic trials to ensure equitable and reliable deployment. This AI-driven approach not only optimizes resource allocation but also mitigates financial burdens on healthcare systems by enabling early interventions and reducing ICU admission risks.

CT Classification Chest Review Concept Academic Lab Benchmark SOTA Ethics

ESR Essentials: common performance metrics in AI-practice recommendations by the European Society of Medical Imaging Informatics.

Klontzas ME, Groot Lipman KBW, Akinci D' Antonoli T, Andreychenko A, Cuocolo R, Dietzel M, Gitto S, Huisman H, Santinha J, Vernuccio F, Visser JJ, Huisman M

•papers•Aug 3 2025

This article provides radiologists with practical recommendations for evaluating AI performance in radiology, ensuring alignment with clinical goals and patient safety. It outlines key performance metrics, including overlap metrics for segmentation, test-based metrics (e.g., sensitivity, specificity, and area under the receiver operating characteristic curve), and outcome-based metrics (e.g., precision, negative predictive value, F1-score, Matthews correlation coefficient, and area under the precision-recall curve). Key recommendations emphasize local validation using independent datasets, selecting task-specific metrics, and considering deployment context to ensure real-world performance matches claimed efficacy. Common pitfalls, such as overreliance on a single metric, misinterpretation in low-prevalence settings, and failure to account for clinical workflow, are addressed with mitigation strategies. Additional guidance is provided on threshold selection, prevalence-adjusted evaluation, and AI-generated image quality assessment. This guide equips radiologists to critically evaluate both commercially available and in-house developed AI tools, ensuring their safe and effective integration into clinical practice. CLINICAL RELEVANCE STATEMENT: This review provides guidance on selecting and interpreting AI performance metrics in radiology, ensuring clinically meaningful evaluation and safe deployment of AI tools. By addressing common pitfalls and promoting standardized reporting, it supports radiologists in making informed decisions, ultimately improving diagnostic accuracy and patient outcomes. KEY POINTS: Radiologists must evaluate performance metrics as they reflect acceptable performance in specific datasets but do not guarantee clinical utility. Independent evaluation tailored to the clinical setting is essential. Performance metrics must align with the intended task of the AI application-segmentation, detection, or classification-and be selected based on domain knowledge and clinical context. Sensitivity, specificity, area under the ROC curve, and accuracy must be interpreted with prevalence-dependent metrics (e.g., precision, F1 score, and Matthew's correlation coefficient) calculated for the target population to ensure safe and effective clinical use.

Review Consortium Policy Ethics

Multimodal Attention-Aware Fusion for Diagnosing Distal Myopathy: Evaluating Model Interpretability and Clinician Trust

Mohsen Abbaspour Onari, Lucie Charlotte Magister, Yaoxin Wu, Amalia Lupi, Dario Creazzo, Mattia Tordin, Luigi Di Donatantonio, Emilio Quaia, Chao Zhang, Isel Grau, Marco S. Nobile, Yingqian Zhang, Pietro Liò

•preprint•Aug 2 2025

Distal myopathy represents a genetically heterogeneous group of skeletal muscle disorders with broad clinical manifestations, posing diagnostic challenges in radiology. To address this, we propose a novel multimodal attention-aware fusion architecture that combines features extracted from two distinct deep learning models, one capturing global contextual information and the other focusing on local details, representing complementary aspects of the input data. Uniquely, our approach integrates these features through an attention gate mechanism, enhancing both predictive performance and interpretability. Our method achieves a high classification accuracy on the BUSI benchmark and a proprietary distal myopathy dataset, while also generating clinically relevant saliency maps that support transparent decision-making in medical diagnosis. We rigorously evaluated interpretability through (1) functionally grounded metrics, coherence scoring against reference masks and incremental deletion analysis, and (2) application-grounded validation with seven expert radiologists. While our fusion strategy boosts predictive performance relative to single-stream and alternative fusion strategies, both quantitative and qualitative evaluations reveal persistent gaps in anatomical specificity and clinical usefulness of the interpretability. These findings highlight the need for richer, context-aware interpretability methods and human-in-the-loop feedback to meet clinicians' expectations in real-world diagnostic settings.

MRI Classification Musculoskeletal Retrospective Clinical In Silico Academic Lab GenAI Ethics

Rapid review: Growing usage of Multimodal Large Language Models in healthcare.

Gupta P, Zhang Z, Song M, Michalowski M, Hu X, Stiglic G, Topaz M

•papers•Aug 1 2025

Recent advancements in large language models (LLMs) have led to multimodal LLMs (MLLMs), which integrate multiple data modalities beyond text. Although MLLMs show promise, there is a gap in the literature that empirically demonstrates their impact in healthcare. This paper summarizes the applications of MLLMs in healthcare, highlighting their potential to transform health practices. A rapid literature review was conducted in August 2024 using World Health Organization (WHO) rapid-review methodology and PRISMA standards, with searches across four databases (Scopus, Medline, PubMed and ACM Digital Library) and top-tier conferences-including NeurIPS, ICML, AAAI, MICCAI, CVPR, ACL and EMNLP. Articles on MLLMs healthcare applications were included for analysis based on inclusion and exclusion criteria. The search yielded 115 articles, 39 included in the final analysis. Of these, 77% appeared online (preprints and published) in 2024, reflecting the emergence of MLLMs. 80% of studies were from Asia and North America (mainly China and US), with Europe lagging. Studies split evenly between pre-built MLLMs evaluations (60% focused on GPT versions) and custom MLLMs/frameworks development with task-specific customizations. About 81% of studies examined MLLMs for diagnosis and reporting in radiology, pathology, and ophthalmology, with additional applications in education, surgery, and mental health. Prompting strategies, used in 80% of studies, improved performance in nearly half. However, evaluation practices were inconsistent with 67% reported accuracy. Error analysis was mostly anecdotal, with only 18% categorized failure types. Only 13% validated explainability through clinician feedback. Clinical deployment was demonstrated in just 3% of studies, and workflow integration, governance, and safety were rarely addressed. MLLMs offer substantial potential for healthcare transformation through multimodal data integration. Yet, methodological inconsistencies, limited validation, and underdeveloped deployment strategies highlight the need for standardized evaluation metrics, structured error analysis, and human-centered design to support safe, scalable, and trustworthy clinical adoption.

Mixed Modality LLM Radiology Report Review Concept Academic Lab GenAI Ethics

Emerging Applications of Feature Selection in Osteoporosis Research: From Biomarker Discovery to Clinical Decision Support.

Wang J, Wang Y, Ren J, Li Z, Guo L, Lv J

•papers•Aug 1 2025

Osteoporosis (OP), a systemic skeletal disease characterized by compromised bone strength and elevated fracture susceptibility, represents a growing global health challenge that necessitates early detection and accurate risk stratification. With the exponential growth of multidimensional biomedical data in OP research, feature selection has become an indispensable machine learning paradigm that improves model generalizability. At the same time, it preserves clinical interpretability and enhances predictive accuracy. This perspective article systematically reviews the transformative role of feature selection methodologies across three critical domains of OP investigation: 1) multi-omics biomarker identification, 2) diagnostic pattern recognition, and 3) fracture risk prognostication. In biomarker discovery, advanced feature selection algorithms systematically refine high-dimensional multi-omics datasets (genomic, proteomic, metabolomic) to isolate key molecular signatures correlated with bone mineral density (BMD) trajectories and microarchitectural deterioration. For clinical diagnostics, these techniques enable efficient extraction of discriminative pattern from multimodal imaging data, including dual-energy X-ray absorptiometry (DXA), quantitative computed tomography (CT), and emerging dental radiographic biomarkers. In prognostic modeling, strategic variable selection optimizes prognostic accuracy by integrating demographic, biochemical, and biomechanical predictors while migrating overfitting in heterogeneous patient cohorts. Current challenges include heterogeneity in dataset quality and dimensionality, translational gaps between algorithmic outputs and clinical decision parameters, and limited reproducibility across diverse populations. Future directions should prioritize the development of adaptive feature selection frameworks capable of dynamic multi-omics data integration, coupled with hybrid intelligence systems that synergize machine-derived biomarkers with clinician expertise. Addressing these challenges requires coordinated interdisciplinary efforts to establish standardized validation protocols and create clinician-friendly decision support interfaces, ultimately bridging the gap between computational OP research and personalized patient care.

Mixed Modality Classification Musculoskeletal Review Concept Ethics GenAI

Filter Papers

Tags

Artificial Intelligence and Extended Reality in TAVR: Current Applications and Challenges.

Foundation models for radiology-the position of the AI for Health Imaging (AI4HI) network.

Scaling Artificial Intelligence for Prostate Cancer Detection on MRI towards Population-Based Screening and Primary Diagnosis in a Global, Multiethnic Population (Study Protocol)

Explainable AI Methods for Neuroimaging: Systematic Failures of Common Tools, the Need for Domain-Specific Validation, and a Proposal for Safe Application

A Dual Radiomic and Dosiomic Filtering Technique for Locoregional Radiation Pneumonitis Prediction in Breast Cancer Patients

AI-Driven Integration of Deep Learning with Lung Imaging, Functional Analysis, and Blood Gas Metrics for Perioperative Hypoxemia Prediction: Progress and Perspectives.

ESR Essentials: common performance metrics in AI-practice recommendations by the European Society of Medical Imaging Informatics.

Multimodal Attention-Aware Fusion for Diagnosing Distal Myopathy: Evaluating Model Interpretability and Clinician Trust

Rapid review: Growing usage of Multimodal Large Language Models in healthcare.

Emerging Applications of Feature Selection in Osteoporosis Research: From Biomarker Discovery to Clinical Decision Support.

Ready to Sharpen Your Edge?