Sort by:
Page 1 of 437 results
Next

Revisiting Performance Claims for Chest X-Ray Models Using Clinical Context

Andrew Wang, Jiashuo Zhang, Michael Oberst

arxiv logopreprintSep 24 2025
Public healthcare datasets of Chest X-Rays (CXRs) have long been a popular benchmark for developing computer vision models in healthcare. However, strong average-case performance of machine learning (ML) models on these datasets is insufficient to certify their clinical utility. In this paper, we use clinical context, as captured by prior discharge summaries, to provide a more holistic evaluation of current ``state-of-the-art'' models for the task of CXR diagnosis. Using discharge summaries recorded prior to each CXR, we derive a ``prior'' or ``pre-test'' probability of each CXR label, as a proxy for existing contextual knowledge available to clinicians when interpreting CXRs. Using this measure, we demonstrate two key findings: First, for several diagnostic labels, CXR models tend to perform best on cases where the pre-test probability is very low, and substantially worse on cases where the pre-test probability is higher. Second, we use pre-test probability to assess whether strong average-case performance reflects true diagnostic signal, rather than an ability to infer the pre-test probability as a shortcut. We find that performance drops sharply on a balanced test set where this shortcut does not exist, which may indicate that much of the apparent diagnostic power derives from inferring this clinical context. We argue that this style of analysis, using context derived from clinical notes, is a promising direction for more rigorous and fine-grained evaluation of clinical vision models.

Limitations of Public Chest Radiography Datasets for Artificial Intelligence: Label Quality, Domain Shift, Bias and Evaluation Challenges

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

arxiv logopreprintSep 18 2025
Artificial intelligence has shown significant promise in chest radiography, where deep learning models can approach radiologist-level diagnostic performance. Progress has been accelerated by large public datasets such as MIMIC-CXR, ChestX-ray14, PadChest, and CheXpert, which provide hundreds of thousands of labelled images with pathology annotations. However, these datasets also present important limitations. Automated label extraction from radiology reports introduces errors, particularly in handling uncertainty and negation, and radiologist review frequently disagrees with assigned labels. In addition, domain shift and population bias restrict model generalisability, while evaluation practices often overlook clinically meaningful measures. We conduct a systematic analysis of these challenges, focusing on label quality, dataset bias, and domain shift. Our cross-dataset domain shift evaluation across multiple model architectures revealed substantial external performance degradation, with pronounced reductions in AUPRC and F1 scores relative to internal testing. To assess dataset bias, we trained a source-classification model that distinguished datasets with near-perfect accuracy, and performed subgroup analyses showing reduced performance for minority age and sex groups. Finally, expert review by two board-certified radiologists identified significant disagreement with public dataset labels. Our findings highlight important clinical weaknesses of current benchmarks and emphasise the need for clinician-validated datasets and fairer evaluation frameworks.

Invisible Attributes, Visible Biases: Exploring Demographic Shortcuts in MRI-based Alzheimer's Disease Classification

Akshit Achara, Esther Puyol Anton, Alexander Hammers, Andrew P. King

arxiv logopreprintSep 11 2025
Magnetic resonance imaging (MRI) is the gold standard for brain imaging. Deep learning (DL) algorithms have been proposed to aid in the diagnosis of diseases such as Alzheimer's disease (AD) from MRI scans. However, DL algorithms can suffer from shortcut learning, in which spurious features, not directly related to the output label, are used for prediction. When these features are related to protected attributes, they can lead to performance bias against underrepresented protected groups, such as those defined by race and sex. In this work, we explore the potential for shortcut learning and demographic bias in DL based AD diagnosis from MRI. We first investigate if DL algorithms can identify race or sex from 3D brain MRI scans to establish the presence or otherwise of race and sex based distributional shifts. Next, we investigate whether training set imbalance by race or sex can cause a drop in model performance, indicating shortcut learning and bias. Finally, we conduct a quantitative and qualitative analysis of feature attributions in different brain regions for both the protected attribute and AD classification tasks. Through these experiments, and using multiple datasets and DL models (ResNet and SwinTransformer), we demonstrate the existence of both race and sex based shortcut learning and bias in DL based AD classification. Our work lays the foundation for fairer DL diagnostic tools in brain MRI. The code is provided at https://github.com/acharaakshit/ShortMR

RetinaGuard: Obfuscating Retinal Age in Fundus Images for Biometric Privacy Preserving

Zhengquan Luo, Chi Liu, Dongfu Xiao, Zhen Yu, Yueye Wang, Tianqing Zhu

arxiv logopreprintSep 7 2025
The integration of AI with medical images enables the extraction of implicit image-derived biomarkers for a precise health assessment. Recently, retinal age, a biomarker predicted from fundus images, is a proven predictor of systemic disease risks, behavioral patterns, aging trajectory and even mortality. However, the capability to infer such sensitive biometric data raises significant privacy risks, where unauthorized use of fundus images could lead to bioinformation leakage, breaching individual privacy. In response, we formulate a new research problem of biometric privacy associated with medical images and propose RetinaGuard, a novel privacy-enhancing framework that employs a feature-level generative adversarial masking mechanism to obscure retinal age while preserving image visual quality and disease diagnostic utility. The framework further utilizes a novel multiple-to-one knowledge distillation strategy incorporating a retinal foundation model and diverse surrogate age encoders to enable a universal defense against black-box age prediction models. Comprehensive evaluations confirm that RetinaGuard successfully obfuscates retinal age prediction with minimal impact on image quality and pathological feature representation. RetinaGuard is also flexible for extension to other medical image derived biomarkers. RetinaGuard is also flexible for extension to other medical image biomarkers.

RetinaGuard: Obfuscating Retinal Age in Fundus Images for Biometric Privacy Preserving

Zhengquan Luo, Chi Liu, Dongfu Xiao, Zhen Yu, Yueye Wang, Tianqing Zhu

arxiv logopreprintSep 7 2025
The integration of AI with medical images enables the extraction of implicit image-derived biomarkers for a precise health assessment. Recently, retinal age, a biomarker predicted from fundus images, is a proven predictor of systemic disease risks, behavioral patterns, aging trajectory and even mortality. However, the capability to infer such sensitive biometric data raises significant privacy risks, where unauthorized use of fundus images could lead to bioinformation leakage, breaching individual privacy. In response, we formulate a new research problem of biometric privacy associated with medical images and propose RetinaGuard, a novel privacy-enhancing framework that employs a feature-level generative adversarial masking mechanism to obscure retinal age while preserving image visual quality and disease diagnostic utility. The framework further utilizes a novel multiple-to-one knowledge distillation strategy incorporating a retinal foundation model and diverse surrogate age encoders to enable a universal defense against black-box age prediction models. Comprehensive evaluations confirm that RetinaGuard successfully obfuscates retinal age prediction with minimal impact on image quality and pathological feature representation. RetinaGuard is also flexible for extension to other medical image derived biomarkers. RetinaGuard is also flexible for extension to other medical image biomarkers.

SWiFT: Soft-Mask Weight Fine-tuning for Bias Mitigation

Junyu Yan, Feng Chen, Yuyang Xue, Yuning Du, Konstantinos Vilouras, Sotirios A. Tsaftaris, Steven McDonagh

arxiv logopreprintAug 26 2025
Recent studies have shown that Machine Learning (ML) models can exhibit bias in real-world scenarios, posing significant challenges in ethically sensitive domains such as healthcare. Such bias can negatively affect model fairness, model generalization abilities and further risks amplifying social discrimination. There is a need to remove biases from trained models. Existing debiasing approaches often necessitate access to original training data and need extensive model retraining; they also typically exhibit trade-offs between model fairness and discriminative performance. To address these challenges, we propose Soft-Mask Weight Fine-Tuning (SWiFT), a debiasing framework that efficiently improves fairness while preserving discriminative performance with much less debiasing costs. Notably, SWiFT requires only a small external dataset and only a few epochs of model fine-tuning. The idea behind SWiFT is to first find the relative, and yet distinct, contributions of model parameters to both bias and predictive performance. Then, a two-step fine-tuning process updates each parameter with different gradient flows defined by its contribution. Extensive experiments with three bias sensitive attributes (gender, skin tone, and age) across four dermatological and two chest X-ray datasets demonstrate that SWiFT can consistently reduce model bias while achieving competitive or even superior diagnostic accuracy under common fairness and accuracy metrics, compared to the state-of-the-art. Specifically, we demonstrate improved model generalization ability as evidenced by superior performance on several out-of-distribution (OOD) datasets.

State of Abdominal CT Datasets: A Critical Review of Bias, Clinical Relevance, and Real-world Applicability

Saeide Danaei, Zahra Dehghanian, Elahe Meftah, Nariman Naderi, Seyed Amir Ahmad Safavi-Naini, Faeze Khorasanizade, Hamid R. Rabiee

arxiv logopreprintAug 19 2025
This systematic review critically evaluates publicly available abdominal CT datasets and their suitability for artificial intelligence (AI) applications in clinical settings. We examined 46 publicly available abdominal CT datasets (50,256 studies). Across all 46 datasets, we found substantial redundancy (59.1\% case reuse) and a Western/geographic skew (75.3\% from North America and Europe). A bias assessment was performed on the 19 datasets with >=100 cases; within this subset, the most prevalent high-risk categories were domain shift (63\%) and selection bias (57\%), both of which may undermine model generalizability across diverse healthcare environments -- particularly in resource-limited settings. To address these challenges, we propose targeted strategies for dataset improvement, including multi-institutional collaboration, adoption of standardized protocols, and deliberate inclusion of diverse patient populations and imaging technologies. These efforts are crucial in supporting the development of more equitable and clinically robust AI models for abdominal imaging.

Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities

Anindya Bijoy Das, Shahnewaz Karim Sakib, Shibbir Ahmed

arxiv logopreprintAug 9 2025
Large Language Models (LLMs) are increasingly applied to medical imaging tasks, including image interpretation and synthetic image generation. However, these models often produce hallucinations, which are confident but incorrect outputs that can mislead clinical decisions. This study examines hallucinations in two directions: image to text, where LLMs generate reports from X-ray, CT, or MRI scans, and text to image, where models create medical images from clinical prompts. We analyze errors such as factual inconsistencies and anatomical inaccuracies, evaluating outputs using expert informed criteria across imaging modalities. Our findings reveal common patterns of hallucination in both interpretive and generative tasks, with implications for clinical reliability. We also discuss factors contributing to these failures, including model architecture and training data. By systematically studying both image understanding and generation, this work provides insights into improving the safety and trustworthiness of LLM driven medical imaging systems.

Transformer-Based Explainable Deep Learning for Breast Cancer Detection in Mammography: The MammoFormer Framework

Ojonugwa Oluwafemi Ejiga Peter, Daniel Emakporuena, Bamidele Dayo Tunde, Maryam Abdulkarim, Abdullahi Bn Umar

arxiv logopreprintAug 8 2025
Breast cancer detection through mammography interpretation remains difficult because of the minimal nature of abnormalities that experts need to identify alongside the variable interpretations between readers. The potential of CNNs for medical image analysis faces two limitations: they fail to process both local information and wide contextual data adequately, and do not provide explainable AI (XAI) operations that doctors need to accept them in clinics. The researcher developed the MammoFormer framework, which unites transformer-based architecture with multi-feature enhancement components and XAI functionalities within one framework. Seven different architectures consisting of CNNs, Vision Transformer, Swin Transformer, and ConvNext were tested alongside four enhancement techniques, including original images, negative transformation, adaptive histogram equalization, and histogram of oriented gradients. The MammoFormer framework addresses critical clinical adoption barriers of AI mammography systems through: (1) systematic optimization of transformer architectures via architecture-specific feature enhancement, achieving up to 13% performance improvement, (2) comprehensive explainable AI integration providing multi-perspective diagnostic interpretability, and (3) a clinically deployable ensemble system combining CNN reliability with transformer global context modeling. The combination of transformer models with suitable feature enhancements enables them to achieve equal or better results than CNN approaches. ViT achieves 98.3% accuracy alongside AHE while Swin Transformer gains a 13.0% advantage through HOG enhancements

Scaling Artificial Intelligence for Prostate Cancer Detection on MRI towards Population-Based Screening and Primary Diagnosis in a Global, Multiethnic Population (Study Protocol)

Anindo Saha, Joeran S. Bosma, Jasper J. Twilt, Alexander B. C. D. Ng, Aqua Asif, Kirti Magudia, Peder Larson, Qinglin Xie, Xiaodong Zhang, Chi Pham Minh, Samuel N. Gitau, Ivo G. Schoots, Martijn F. Boomsma, Renato Cuocolo, Nikolaos Papanikolaou, Daniele Regge, Derya Yakar, Mattijs Elschot, Jeroen Veltman, Baris Turkbey, Nancy A. Obuchowski, Jurgen J. Fütterer, Anwar R. Padhani, Hashim U. Ahmed, Tobias Nordström, Martin Eklund, Veeru Kasivisvanathan, Maarten de Rooij, Henkjan Huisman

arxiv logopreprintAug 4 2025
In this intercontinental, confirmatory study, we include a retrospective cohort of 22,481 MRI examinations (21,288 patients; 46 cities in 22 countries) to train and externally validate the PI-CAI-2B model, i.e., an efficient, next-generation iteration of the state-of-the-art AI system that was developed for detecting Gleason grade group $\geq$2 prostate cancer on MRI during the PI-CAI study. Of these examinations, 20,471 cases (19,278 patients; 26 cities in 14 countries) from two EU Horizon projects (ProCAncer-I, COMFORT) and 12 independent centers based in Europe, North America, Asia and Africa, are used for training and internal testing. Additionally, 2010 cases (2010 patients; 20 external cities in 12 countries) from population-based screening (STHLM3-MRI, IP1-PROSTAGRAM trials) and primary diagnostic settings (PRIME trial) based in Europe, North and South Americas, Asia and Australia, are used for external testing. Primary endpoint is the proportion of AI-based assessments in agreement with the standard of care diagnoses (i.e., clinical assessments made by expert uropathologists on histopathology, if available, or at least two expert urogenital radiologists in consensus; with access to patient history and peer consultation) in the detection of Gleason grade group $\geq$2 prostate cancer within the external testing cohorts. Our statistical analysis plan is prespecified with a hypothesis of diagnostic interchangeability to the standard of care at the PI-RADS $\geq$3 (primary diagnosis) or $\geq$4 (screening) cut-off, considering an absolute margin of 0.05 and reader estimates derived from the PI-CAI observer study (62 radiologists reading 400 cases). Secondary measures comprise the area under the receiver operating characteristic curve (AUROC) of the AI system stratified by imaging quality, patient age and patient ethnicity to identify underlying biases (if any).
Page 1 of 437 results
Show
per page

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.