Sort by:
Page 1 of 212 results
Next

DREAM: A framework for discovering mechanisms underlying AI prediction of protected attributes

Gadgil, S. U., DeGrave, A. J., Janizek, J. D., Xu, S., Nwandu, L., Fonjungo, F., Lee, S.-I., Daneshjou, R.

medrxiv logopreprintJul 21 2025
Recent advances in Artificial Intelligence (AI) have started disrupting the healthcare industry, especially medical imaging, and AI devices are increasingly being deployed into clinical practice. Such classifiers have previously demonstrated the ability to discern a range of protected demographic attributes (like race, age, sex) from medical images with unexpectedly high performance, a sensitive task which is difficult even for trained physicians. In this study, we motivate and introduce a general explainable AI (XAI) framework called DREAM (DiscoveRing and Explaining AI Mechanisms) for interpreting how AI models trained on medical images predict protected attributes. Focusing on two modalities, radiology and dermatology, we are successfully able to train high-performing classifiers for predicting race from chest x-rays (ROC-AUC score of [~]0.96) and sex from dermoscopic lesions (ROC-AUC score of [~]0.78). We highlight how incorrect use of these demographic shortcuts can have a detrimental effect on the performance of a clinically relevant downstream task like disease diagnosis under a domain shift. Further, we employ various XAI techniques to identify specific signals which can be leveraged to predict sex. Finally, we propose a technique, which we callremoval via balancing, to quantify how much a signal contributes to the classification performance. Using this technique and the signals identified, we are able to explain [~]15% of the total performance for radiology and [~]42% of the total performance for dermatology. We envision DREAM to be broadly applicable to other modalities and demographic attributes. This analysis not only underscores the importance of cautious AI application in healthcare but also opens avenues for improving the transparency and reliability of AI-driven diagnostic tools.

Prediction of OncotypeDX recurrence score using H&E stained WSI images

Cohen, S., Shamai, G., Sabo, E., Cretu, A., Barshack, I., Goldman, T., Bar-Sela, G., Pearson, A. T., Huo, D., Howard, F. M., Kimmel, R., Mayer, C.

medrxiv logopreprintJul 21 2025
The OncotypeDX 21-gene assay is a widely adopted tool for estimating recurrence risk and informing chemotherapy decisions in early-stage, hormone receptor-positive, HER2-negative breast cancer. Although informative, its high cost and long turnaround time limit accessibility and delay treatment in low- and middle-income countries, creating a need for alternative solutions. This study presents a deep learning-based approach for predicting OncotypeDX recurrence scores directly from hematoxylin and eosin-stained whole slide images. Our approach leverages a deep learning foundation model pre-trained on 171,189 slides via self-supervised learning, which is fine-tuned for our task. The model was developed and validated using five independent cohorts, out of which three are external. On the two external cohorts that include OncotypeDX scores, the model achieved an AUC of 0.825 and 0.817, and identified 21.9% and 25.1% of the patients as low-risk with sensitivity of 0.97 and 0.95 and negative predictive value of 0.97 and 0.96, showing strong generalizability despite variations in staining protocols and imaging devices. Kaplan-Meier analysis demonstrated that patients classified as low-risk by the model had a significantly better prognosis than those classified as high-risk, with a hazard ratio of 4.1 (P<0.001) and 2.0 (P<0.01) on the two external cohorts that include patient outcomes. This artificial intelligence-driven solution offers a rapid, cost-effective, and scalable alternative to genomic testing, with the potential to enhance personalized treatment planning, especially in resource-constrained settings.

Detecting Fifth Metatarsal Fractures on Radiographs through the Lens of Smartphones: A FIXUS AI Algorithm

Taseh, A., Shah, A., Eftekhari, M., Flaherty, A., Ebrahimi, A., Jones, S., Nukala, V., Nazarian, A., Waryasz, G., Ashkani-Esfahani, S.

medrxiv logopreprintJul 18 2025
BackgroundFifth metatarsal (5MT) fractures are common but challenging to diagnose, particularly with limited expertise or subtle fractures. Deep learning shows promise but faces limitations due to image quality requirements. This study develops a deep learning model to detect 5MT fractures from smartphone-captured radiograph images, enhancing accessibility of diagnostic tools. MethodsA retrospective study included patients aged >18 with 5MT fractures (n=1240) and controls (n=1224). Radiographs (AP, oblique, lateral) from Electronic Health Records (EHR) were obtained and photographed using a smartphone, creating a new dataset (SP). Models using ResNet 152V2 were trained on EHR, SP, and combined datasets, then evaluated on a separate smartphone test dataset (SP-test). ResultsOn validation, the SP model achieved optimal performance (AUROC: 0.99). On the SP-test dataset, the EHR models performance decreased (AUROC: 0.83), whereas SP and combined models maintained high performance (AUROC: 0.99). ConclusionsSmartphone-specific deep learning models effectively detect 5MT fractures, suggesting their practical utility in resource-limited settings.

A clinically relevant morpho-molecular classification of lung neuroendocrine tumours

Sexton-Oates, A., Mathian, E., Candeli, N., Lim, Y., Voegele, C., Di Genova, A., Mange, L., Li, Z., van Weert, T., Hillen, L. M., Blazquez-Encinas, R., Gonzalez-Perez, A., Morrison, M. L., Lauricella, E., Mangiante, L., Bonheme, L., Moonen, L., Absenger, G., Altmuller, J., Degletagne, C., Brustugun, O. T., Cahais, V., Centonze, G., Chabrier, A., Cuenin, C., Damiola, F., de Montpreville, V. T., Deleuze, J.-F., Dingemans, A.-M. C., Fadel, E., Gadot, N., Ghantous, A., Graziano, P., Hofman, P., Hofman, V., Ibanez-Costa, A., Lacomme, S., Lopez-Bigas, N., Lund-Iversen, M., Milione, M., Muscarella, L

medrxiv logopreprintJul 18 2025
Lung neuroendocrine tumours (NETs, also known as carcinoids) are rapidly rising in incidence worldwide but have unknown aetiology and limited therapeutic options beyond surgery. We conducted multi-omic analyses on over 300 lung NETs including whole-genome sequencing (WGS), transcriptome profiling, methylation arrays, spatial RNA sequencing, and spatial proteomics. The integration of multi-omic data provides definitive proof of the existence of four strikingly different molecular groups that vary in patient characteristics, genomic and transcriptomic profiles, microenvironment, and morphology, as much as distinct diseases. Among these, we identify a new molecular group, enriched for highly aggressive supra-carcinoids, that displays an immune-rich microenvironment linked to tumour--macrophage crosstalk, and we uncover an undifferentiated cell population within supra-carcinoids, explaining their molecular and behavioural link to high-grade lung neuroendocrine carcinomas. Deep learning models accurately identified the Ca A1, Ca A2, and Ca B groups based on morphology alone, outperforming current histological criteria. The characteristic tumour microenvironment of supra-carcinoids and the validation of a panel of immunohistochemistry markers for the other three molecular groups demonstrates that these groups can be accurately identified based solely on morphological features, facilitating their implementation in the clinical setting. Our proposed morpho-molecular classification highlights group-specific therapeutic opportunities, including DLL3, FGFR, TERT, and BRAF inhibitors. Overall, our findings unify previously proposed molecular classifications and refine the lung cancer map by revealing novel tumour types and potential treatments, with significant implications for prognosis and treatment decision-making.

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

Guan, H., Hou, P. C., Hong, P., Wang, L., Zhang, W., Du, X., Zhou, Z., Zhou, L.

medrxiv logopreprintJul 14 2025
Recent advances in vision-language models (VLMs) have enabled automatic radiology report generation, yet current evaluation methods remain limited to general-purpose NLP metrics or coarse classification-based clinical scores. In this study, we propose a clinically informed evaluation framework for VLM-generated radiology reports that goes beyond traditional performance measures. We define a taxonomy of 12 radiology-specific error types, each annotated with clinical risk levels (low, medium, high) in collaboration with physicians. Using this framework, we conduct a comprehensive error analysis of three representative VLMs, i.e., DeepSeek VL2, CXR-LLaVA, and CheXagent, on 685 gold-standard, expert-annotated MIMIC-CXR cases. We further introduce a risk-aware evaluation metric, the Clinical Risk-weighted Error Score for Text-generation (CREST), to quantify safety impact. Our findings reveal critical model vulnerabilities, common error patterns, and condition-specific risk profiles, offering actionable insights for model development and deployment. This work establishes a safety-centric foundation for evaluating and improving medical report generation models. The source code of our evaluation framework, including CREST computation and error taxonomy analysis, is available at https://github.com/guanharry/VLM-CREST.

Three-dimensional high-content imaging of unstained soft tissue with subcellular resolution using a laboratory-based multi-modal X-ray microscope

Esposito, M., Astolfo, A., Zhou, Y., Buchanan, I., Teplov, A., Endrizzi, M., Egido Vinogradova, A., Makarova, O., Divan, R., Tang, C.-M., Yagi, Y., Lee, P. D., Walsh, C. L., Ferrara, J. D., Olivo, A.

medrxiv logopreprintJul 14 2025
With increasing interest in studying biological systems across spatial scales--from centimetres down to nanometres--histology continues to be the gold standard for tissue imaging at cellular resolution, providing an essential bridge between macroscopic and nanoscopic analysis. However, its inherently destructive and two-dimensional nature limits its ability to capture the full three-dimensional complexity of tissue architecture. Here we show that phase-contrast X-ray microscopy can enable three-dimensional virtual histology with subcellular resolution. This technique provides direct quantification of electron density without restrictive assumptions, allowing for direct characterisation of cellular nuclei in a standard laboratory setting. By combining high spatial resolution and soft tissue contrast, with automated segmentation of cell nuclei, we demonstrated virtual H&E staining using machine learning-based style transfer, yielding volumetric datasets compatible with existing histopathological analysis tools. Furthermore, by integrating electron density and the sensitivity to nanometric features of the dark field contrast channel, we achieve stain-free, high-content imaging capable of distinguishing nuclei and extracellular matrix.

A Unified Platform for Radiology Report Generation and Clinician-Centered AI Evaluation

Ma, Z., Yang, X., Atalay, Z., Yang, A., Collins, S., Bai, H., Bernstein, M., Baird, G., Jiao, Z.

medrxiv logopreprintJul 8 2025
Generative AI models have demonstrated strong potential in radiology report generation, but their clinical adoption depends on physician trust. In this study, we conducted a radiology-focused Turing test to evaluate how well attendings and residents distinguish AI-generated reports from those written by radiologists, and how their confidence and decision time reflect trust. we developed an integrated web-based platform comprising two core modules: Report Generation and Report Evaluation. Using the web-based platform, eight participants evaluated 48 anonymized X-ray cases, each paired with two reports from three comparison groups: radiologist vs. AI model 1, radiologist vs. AI model 2, and AI model 1 vs. AI model 2. Participants selected the AI-generated report, rated their confidence, and indicated report preference. Attendings outperformed residents in identifying AI-generated reports (49.9% vs. 41.1%) and exhibited longer decision times, suggesting more deliberate judgment. Both groups took more time when both reports were AI-generated. Our findings highlight the role of clinical experience in AI acceptance and the need for design strategies that foster trust in clinical applications. The project page of the evaluation platform is available at: https://zachatalay89.github.io/Labsite.

Efficient Chest X-Ray Feature Extraction and Feature Fusion for Pneumonia Detection Using Lightweight Pretrained Deep Learning Models

Chandola, Y., Uniyal, V., Bachheti, Y.

medrxiv logopreprintJun 30 2025
Pneumonia is a respiratory condition characterized by inflammation of the alveolar sacs in the lungs, which disrupts normal oxygen exchange. This disease disproportionately impacts vulnerable populations, including young children (under five years of age) and elderly individuals (over 65 years), primarily due to their compromised immune systems. The mortality rate associated with pneumonia remains alarmingly high, particularly in low-resource settings where healthcare access is limited. Although effective prevention strategies exist, pneumonia continues to claim the lives of approximately one million children each year, earning its reputation as a "silent killer." Globally, an estimated 500 million cases are documented annually, underscoring its widespread public health burden. This study explores the design and evaluation of the CNN-based Computer-Aided Diagnostic (CAD) systems with an aim of carrying out competent as well as resourceful classification and categorization of chest radiographs into binary classes (Normal, Pneumonia). An augmented Kaggle dataset of 18,200 chest radiographs, split between normal and pneumonia cases, was utilized. This study conducts a series of experiments to evaluate lightweight CNN models--ShuffleNet, NASNet-Mobile, and EfficientNet-b0--using transfer learning that achieved accuracy of 90%, 88% and 89%, prompting the task for deep feature extraction from each of the networks and applying feature fusion to further pair it with SVM classifier and XGBoost classifier, achieving an accuracy of 97% and 98% resepectively. The proposed research emphasizes the crucial role of CAD systems in advancing radiological diagnostics, delivering effective solutions to aid radiologists in distinguishing between diagnoses by applying feature fusion, feature selection along with various machine learning algorithms and deep learning architectures.

Diagnostic Performance of Universal versus Stratified Computer-Aided Detection Thresholds for Chest X-Ray-Based Tuberculosis Screening

Sung, J., Kitonsa, P. J., Nalutaaya, A., Isooba, D., Birabwa, S., Ndyabayunga, K., Okura, R., Magezi, J., Nantale, D., Mugabi, I., Nakiiza, V., Dowdy, D. W., Katamba, A., Kendall, E. A.

medrxiv logopreprintJun 24 2025
BackgroundComputer-aided detection (CAD) software analyzes chest X-rays for features suggestive of tuberculosis (TB) and provides a numeric abnormality score. However, estimates of CAD accuracy for TB screening are hindered by the lack of confirmatory data among people with lower CAD scores, including those without symptoms. Additionally, the appropriate CAD score thresholds for obtaining further testing may vary according to population and client characteristics. MethodsWe screened for TB in Ugandan individuals aged [&ge;]15 years using portable chest X-rays with CAD (qXR v3). Participants were offered screening regardless of their symptoms. Those with X-ray scores above a threshold of 0.1 (range, 0 - 1) were asked to provide sputum for Xpert Ultra testing. We estimated the diagnostic accuracy of CAD for detecting Xpert-positive TB when using the same threshold for all individuals (under different assumptions about TB prevalence among people with X-ray scores <0.1), and compared this estimate to age- and/or sex-stratified approaches. FindingsOf 52,835 participants screened for TB using CAD, 8,949 (16.9%) had X-ray scores [&ge;]0.1. Of 7,219 participants with valid Xpert Ultra results, 382 (5.3%) were Xpert-positive, including 81 with trace results. Assuming 0.1% of participants with X-ray scores <0.1 would have been Xpert-positive if tested, qXR had an estimated AUC of 0.920 (95% confidence interval 0.898-0.941) for Xpert-positive TB. Stratifying CAD thresholds according to age and sex improved accuracy; for example, at 96.1% specificity, estimated sensitivity was 75.0% for a universal threshold (of [&ge;]0.65) versus 76.9% for thresholds stratified by age and sex (p=0.046). InterpretationThe accuracy of CAD for TB screening among all screening participants, including those without symptoms or abnormal chest X-rays, is higher than previously estimated. Stratifying CAD thresholds based on client characteristics such as age and sex could further improve accuracy, enabling a more effective and personalized approach to TB screening. FundingNational Institutes of Health Research in contextO_ST_ABSEvidence before this studyC_ST_ABSThe World Health Organization (WHO) has endorsed computer-aided detection (CAD) as a screening tool for tuberculosis (TB), but the appropriate CAD score that triggers further diagnostic evaluation for tuberculosis varies by population. The WHO recommends determining the appropriate CAD threshold for specific settings and population and considering unique thresholds for specific populations, including older age groups, among whom CAD may perform poorly. We performed a PubMed literature search for articles published until September 9, 2024, using the search terms "tuberculosis" AND ("computer-aided detection" OR "computer aided detection" OR "CAD" OR "computer-aided reading" OR "computer aided reading" OR "artificial intelligence"), which resulted in 704 articles. Among them, we identified studies that evaluated the performance of CAD for tuberculosis screening and additionally reviewed relevant references. Most prior studies reported area under the curves (AUC) ranging from 0.76 to 0.88 but limited their evaluations to individuals with symptoms or abnormal chest X-rays. Some prior studies identified subgroups (including older individuals and people with prior TB) among whom CAD had lower-than-average AUCs, and authors discussed how the prevalence of such characteristics could affect the optimal value of a population-wide CAD threshold; however, none estimated the accuracy that could be gained with adjusting CAD thresholds between individuals based on personal characteristics. Added value of this studyIn this study, all consenting individuals in a high-prevalence setting were offered chest X-ray screening, regardless of symptoms, if they were [&ge;]15 years old, not pregnant, and not on TB treatment. A very low CAD score cutoff (qXR v3 score of 0.1 on a 0-1 scale) was used to select individuals for confirmatory sputum molecular testing, enabling the detection of radiographically mild forms of TB and facilitating comparisons of diagnostic accuracy at different CAD thresholds. With this more expansive, symptom-neutral evaluation of CAD, we estimated an AUC of 0.920, and we found that the qXR v3 threshold needed to decrease to under 0.1 to meet the WHO target product profile goal of [&ge;]90% sensitivity and [&ge;]70% specificity. Compared to using the same thresholds for all participants, adjusting CAD thresholds by age and sex strata resulted in a 1 to 2% increase in sensitivity without affecting specificity. Implications of all the available evidenceTo obtain high sensitivity with CAD screening in high-prevalence settings, low score thresholds may be needed. However, countries with a high burden of TB often do not have sufficient resources to test all individuals above a low threshold. In such settings, adjusting CAD thresholds based on individual characteristics associated with TB prevalence (e.g., male sex) and those associated with false-positive X-ray results (e.g., old age) can potentially improve the efficiency of TB screening programs.

Lack of children in public medical imaging data points to growing age bias in biomedical AI

Hua, S. B. Z., Heller, N., He, P., Towbin, A. J., Chen, I., Lu, A., Erdman, L.

medrxiv logopreprintJun 7 2025
Artificial intelligence (AI) is rapidly transforming healthcare, but its benefits are not reaching all patients equally. Children remain overlooked with only 17% of FDA-approved medical AI devices labeled for pediatric use. In this work, we demonstrate that this exclusion may stem from a fundamental data gap. Our systematic review of 181 public medical imaging datasets reveals that children represent just under 1% of available data, while the majority of machine learning imaging conference papers we surveyed utilized publicly available data for methods development. Much like systematic biases of other kinds in model development, past studies have demonstrated the manner in which pediatric representation in data used for models intended for the pediatric population is essential for model performance in that population. We add to these findings, showing that adult-trained chest radiograph models exhibit significant age bias when applied to pediatric populations, with higher false positive rates in younger children. This work underscores the urgent need for increased pediatric representation in publicly accessible medical datasets. We provide actionable recommendations for researchers, policymakers, and data curators to address this age equity gap and ensure AI benefits patients of all ages. 1-2 sentence summaryOur analysis reveals a critical healthcare age disparity: children represent less than 1% of public medical imaging datasets. This gap in representation leads to biased predictions across medical image foundation models, with the youngest patients facing the highest risk of misdiagnosis.
Page 1 of 212 results
Show
per page
12»

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.