Latest Papers on Radiology AI. Sources: medrxiv, Order: Best Match, Limit: 10.

An Unsupervised XAI Framework for Dementia Detection with Context Enrichment

Singh, D., Brima, Y., Levin, F., Becker, M., Hiller, B., Hermann, A., Villar-Munoz, I., Beichert, L., Bernhardt, A., Buerger, K., Butryn, M., Dechent, P., Duezel, E., Ewers, M., Fliessbach, K., D. Freiesleben, S., Glanz, W., Hetzer, S., Janowitz, D., Goerss, D., Kilimann, I., Kimmich, O., Laske, C., Levin, J., Lohse, A., Luesebrink, F., Munk, M., Perneczky, R., Peters, O., Preis, L., Priller, J., Prudlo, J., Prychynenko, D., Rauchmann, B.-S., Rostamzadeh, A., Roy-Kluth, N., Scheffler, K., Schneider, A., Droste zu Senden, L., H. Schott, B., Spottke, A., Synofzik, M., Wiltfang, J., Jessen, F., W

•preprint•Jun 4 2025

IntroductionExplainable Artificial Intelligence (XAI) methods enhance the diagnostic efficiency of clinical decision support systems by making the predictions of a convolutional neural networks (CNN) on brain imaging more transparent and trustworthy. However, their clinical adoption is limited due to limited validation of the explanation quality. Our study introduces a framework that evaluates XAI methods by integrating neuroanatomical morphological features with CNN-generated relevance maps for disease classification. MethodsWe trained a CNN using brain MRI scans from six cohorts: ADNI, AIBL, DELCODE, DESCRIBE, EDSD, and NIFD (N=3253), including participants that were cognitively normal, with amnestic mild cognitive impairment, dementia due to Alzheimers disease and frontotemporal dementia. Clustering analysis benchmarked different explanation space configurations by using morphological features as proxy-ground truth. We implemented three post-hoc explanations methods: i) by simplifying model decisions, ii) explanation-by-example, and iii) textual explanations. A qualitative evaluation by clinicians (N=6) was performed to assess their clinical validity. ResultsClustering performance improved in morphology enriched explanation spaces, improving both homogeneity and completeness of the clusters. Post hoc explanations by model simplification largely delineated converters and stable participants, while explanation-by-example presented possible cognition trajectories. Textual explanations gave rule-based summarization of pathological findings. Clinicians qualitative evaluation highlighted challenges and opportunities of XAI for different clinical applications. ConclusionOur study refines XAI explanation spaces and applies various approaches for generating explanations. Within the context of AI-based decision support system in dementia research we found the explanations methods to be promising towards enhancing diagnostic efficiency, backed up by the clinical assessments.

MRI Classification Neurological Methodology In Silico Academic Lab GenAI

Deep Learning-Based Opportunistic CT Osteoporosis Screening and Establishment of Normative Values

Westerhoff, M., Gyftopoulos, S., Dane, B., Vega, E., Murdock, D., Lindow, N., Herter, F., Bousabarah, K., Recht, M. P., Bredella, M. A.

•preprint•Jun 3 2025

BackgroundOsteoporosis is underdiagnosed and undertreated prompting the exploration of opportunistic screening using CT and artificial intelligence (AI). PurposeTo develop a reproducible deep learning-based convolutional neural network to automatically place a 3D region of interest (ROI) in trabecular bone, develop a correction method to normalize attenuation across different CT protocols or and scanner models, and to establish thresholds for osteoporosis in a large diverse population. MethodsA deep learning-based method was developed to automatically quantify trabecular attenuation using a 3D ROI of the thoracic and lumbar spine on chest, abdomen, or spine CTs, adjusted for different tube voltages and scanner models. Normative values, thresholds for osteoporosis of trabecular attenuation of the spine were established across a diverse population, stratified by age, sex, race, and ethnicity using reported prevalence of osteoporosis by the WHO. Results538,946 CT examinations from 283,499 patients (mean age 65 years{+/-}15, 51.2% women and 55.5% White), performed on 50 scanner models using six different tube voltages were analyzed. Hounsfield Units at 80 kVp versus 120 kVp differed by 23%, and different scanner models resulted in differences of values by < 10%. Automated ROI placement of 1496 vertebra was validated by manual radiologist review, demonstrating >99% agreement. Mean trabecular attenuation was higher in young women (<50 years) than young men (p<.001) and decreased with age, with a steeper decline in postmenopausal women. In patients older than 50 years, trabecular attention was higher in males than females (p<.001). Trabecular attenuation was highest in Blacks, followed by Asians and lowest in Whites (p<.001). The threshold for L1 in diagnosing osteoporosis was 80 HU. ConclusionDeep learning-based automated opportunistic osteoporosis screening can identify patients with low bone mineral density that undergo CT scans for clinical purposes on different scanners and protocols. Key Results 3 main results/conclusionsO_LIIn a study of 538,946 CT examinations performed in 283,499 patients using different scanner models and imaging protocols, an automated deep learning-based convolutional neural network was able to accurately place a three-dimensional regions of interest within thoracic and lumbar vertebra to measure trabecular attenuation. C_LIO_LITube voltage had a larger influence on attenuation values (23%) than scanner model (<10%). C_LIO_LIA threshold of 80 HU was identified for L1 to diagnose osteoporosis using an automated three-dimensional region of interest. C_LI

CT Segmentation Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Artificial Intelligence-Driven Innovations in Diabetes Care and Monitoring

Abdul Rahman, S., Mahadi, M., Yuliana, D., Budi Susilo, Y. K., Ariffin, A. E., Amgain, K.

•preprint•Jun 2 2025

This study explores Artificial Intelligence (AI)s transformative role in diabetes care and monitoring, focusing on innovations that optimize patient outcomes. AI, particularly machine learning and deep learning, significantly enhances early detection of complications like diabetic retinopathy and improves screening efficacy. The methodology employs a bibliometric analysis using Scopus, VOSviewer, and Publish or Perish, analyzing 235 articles from 2023-2025. Results indicate a strong interdisciplinary focus, with Computer Science and Medicine being dominant subject areas (36.9% and 12.9% respectively). Bibliographic coupling reveals robust international collaborations led by the U.S. (1558.52 link strength), UK, and China, with key influential documents by Zhu (2023c) and Annuzzi (2023). This research highlights AIs impact on enhancing monitoring, personalized treatment, and proactive care, while acknowledging challenges in data privacy and ethical deployment. Future work should bridge technological advancements with real-world implementation to create equitable and efficient diabetes care systems.

OCT Detection Review Concept Academic Lab Ethics

Evaluating the performance and potential bias of predictive models for the detection of transthyretin cardiac amyloidosis

Hourmozdi, J., Easton, N., Benigeri, S., Thomas, J. D., Narang, A., Ouyang, D., Duffy, G., Upton, R., Hawkes, W., Akerman, A., Okwuosa, I., Kline, A., Kho, A. N., Luo, Y., Shah, S. J., Ahmad, F. S.

•preprint•Jun 2 2025

BackgroundDelays in the diagnosis of transthyretin amyloid cardiomyopathy (ATTR-CM) contribute to the significant morbidity of the condition, especially in the era of disease-modifying therapies. Screening for ATTR-CM with AI and other algorithms may improve timely diagnosis, but these algorithms have not been directly compared. ObjectivesThe aim of this study was to compare the performance of four algorithms for ATTR-CM detection in a heart failure population and assess the risk for harms due to model bias. MethodsWe identified patients in an integrated health system from 2010-2022 with ATTR-CM and age- and sex-matched them to controls with heart failure to target 5% prevalence. We compared the performance of a claims-based random forest model (Huda et al. model), a regression-based score (Mayo ATTR-CM), and two deep learning echo models (EchoNet-LVH and EchoGo(R) Amyloidosis). We evaluated for bias using standard fairness metrics. ResultsThe analytical cohort included 176 confirmed cases of ATTR-CM and 3192 control patients with 79.2% self-identified as White and 9.0% as Black. The Huda et al. model performed poorly (AUC 0.49). Both deep learning echo models had a higher AUC when compared to the Mayo ATTR-CM Score (EchoNet-LVH 0.88; EchoGo Amyloidosis 0.92; Mayo ATTR-CM Score 0.79; DeLong P<0.001 for both). Bias auditing met fairness criteria for equal opportunity among patients who identified as Black. ConclusionsDeep learning, echo-based models to detect ATTR-CM demonstrated best overall discrimination when compared to two other models in external validation with low risk of harms due to racial bias.

Ultrasound Classification Cardiac Retrospective Clinical In Silico Academic Lab Benchmark SOTA Ethics

Synthetic Ultrasound Image Generation for Breast Cancer Diagnosis Using cVAE-WGAN Models: An Approach Based on Generative Artificial Intelligence

Mondillo, G., Masino, M., Colosimo, S., Perrotta, A., Frattolillo, V., Abbate, F. G.

•preprint•Jun 2 2025

The scarcity and imbalance of medical image datasets hinder the development of robust computer-aided diagnosis (CAD) systems for breast cancer. This study explores the application of advanced generative models, based on generative artificial intelligence (GenAI), for the synthesis of digital breast ultrasound images. Using a hybrid Conditional Variational Autoencoder-Wasserstein Generative Adversarial Network (CVAE-WGAN) architecture, we developed a system to generate high-quality synthetic images conditioned on the class (malignant vs. normal/benign). These synthetic images, generated from the low-resolution BreastMNIST dataset and filtered for quality, were systematically integrated with real training data at different mixing ratios (W). The performance of a CNN classifier trained on these mixed datasets was evaluated against a baseline model trained only on real data balanced with SMOTE. The optimal integration (mixing weight W=0.25) produced a significant performance increase on the real test set: +8.17% in macro-average F1-score and +4.58% in accuracy compared to using real data alone. Analysis confirmed the originality of the generated samples. This approach offers a promising solution for overcoming data limitations in image-based breast cancer diagnostics, potentially improving the capabilities of CAD systems.

Ultrasound Image Synthesis Breast Methodology In Silico Academic Lab GenAI

A Comparative Performance Analysis of Regular Expressions and an LLM-Based Approach to Extract the BI-RADS Score from Radiological Reports

Dennstaedt, F., Lerch, L., Schmerder, M., Cihoric, N., Cerghetti, G. M., Gaio, R., Bonel, H., Filchenko, I., Hastings, J., Dammann, F., Aebersold, D. M., von Tengg, H., Nairz, K.

•preprint•Jun 2 2025

BackgroundDifferent Natural Language Processing (NLP) techniques have demonstrated promising results for data extraction from radiological reports. Both traditional rule-based methods like regular expressions (Regex) and modern Large Language Models (LLMs) can extract structured information. However, comparison between these approaches for extraction of specific radiological data elements has not been widely conducted. MethodsWe compared accuracy and processing time between Regex and LLM-based approaches for extracting BI-RADS scores from 7,764 radiology reports (mammography, ultrasound, MRI, and biopsy). We developed a rule-based algorithm using Regex patterns and implemented an LLM-based extraction using the Rombos-LLM-V2.6-Qwen-14b model. A ground truth dataset of 199 manually classified reports was used for evaluation. ResultsThere was no statistically significant difference in the accuracy in extracting BI-RADS scores between Regex and an LLM-based method (accuracy of 89.20% for Regex versus 87.69% for the LLM-based method; p=0.56). Compared to the LLM-based method, Regex processing was more efficient, completing the task 28,120 times faster (0.06 seconds vs. 1687.20 seconds). Further analysis revealed LLMs favored common classifications (particularly BI-RADS value of 2) while Regex more frequently returned "unclear" values. We also could confirm in our sample an already known laterality bias for breast cancer (BI-RADS 6) and detected a slight laterality skew for suspected breast cancer (BI-RADS 5) as well. ConclusionFor structured, standardized data like BI-RADS, traditional NLP techniques seem to be superior, though future work should explore hybrid approaches combining Regex precision for standardized elements with LLM contextual understanding for more complex information extraction tasks.

Mixed Modality LLM Radiology Report Breast Methodology In Silico Academic Lab

Physician-level classification performance across multiple imaging domains with a diagnostic medical foundation model and a large dataset of annotated medical images

Thieme, A. H., Miri, T., Marra, A. R., Kobayashi, T., Rodriguez-Nava, G., Li, Y., Barba, T., Er, A. G., Benzler, J., Gertler, M., Riechers, M., Hinze, C., Zheng, Y., Pelz, K., Nagaraj, D., Chen, A., Loeser, A., Ruehle, A., Zamboglou, C., Alyahya, L., Uhlig, M., Machiraju, G., Weimann, K., Lippert, C., Conrad, T., Ma, J., Novoa, R., Moor, M., Hernandez-Boussard, T., Alawad, M., Salinas, J. L., Mittermaier, M., Gevaert, O.

•preprint•May 31 2025

A diagnostic medical foundation model (MedFM) is an artificial intelligence (AI) system engineered to accurately determine diagnoses across various medical imaging modalities and specialties. To train MedFM, we created the PubMed Central Medical Images Dataset (PMCMID), the largest annotated medical image dataset to date, comprising 16,126,659 images from 3,021,780 medical publications. Using AI- and ontology-based methods, we identified 4,482,237 medical images (e.g., clinical photos, X-rays, ultrasounds) and generated comprehensive annotations. To optimize MedFMs performance and assess biases, 13,266 images were manually annotated to establish a multimodal benchmark. MedFM achieved physician-level performance in diagnosis tasks spanning radiology, dermatology, and infectious diseases without requiring specific training. Additionally, we developed the Image2Paper app, allowing clinicians to upload medical images and retrieve relevant literature. The correct diagnoses appeared within the top ten results in 88.4% and at least one relevant differential diagnosis in 93.0%. MedFM and PMCMID were made publicly available. FundingResearch reported here was partially supported by the National Cancer Institute (NCI) (R01 CA260271), the Saudi Company for Artificial Intelligence (SCAI) Authority, and the German Federal Ministry for Economic Affairs and Climate Action (BMWK) under the project DAKI-FWS (01MK21009E). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Mixed Modality Classification Methodology In Silico Academic Lab Breakthrough Open Dataset Open Code

Dharma: A novel machine learning framework for pediatric appendicitis--diagnosis, severity assessment and evidence-based clinical decision support.

Thapa, A., Pahari, S., Timilsina, S., Chapagain, B.

•preprint•May 29 2025

BackgroundAcute appendicitis remains a challenging diagnosis in pediatric populations, with high rates of misdiagnosis and negative appendectomies despite advances in imaging modalities. Current diagnostic tools, including clinical scoring systems like Alvarado and Pediatric Appendicitis Score (PAS), lack sufficient sensitivity and specificity, while reliance on CT scans raises concerns about radiation exposure, contrast hazards and sedation in children. Moreover, no established tool effectively predicts progression from uncomplicated to complicated appendicitis, creating a critical gap in clinical decision-making. ObjectiveTo develop and evaluate a machine learning model that integrates clinical, laboratory, and radiological findings for accurate diagnosis and complication prediction in pediatric appendicitis and to deploy this model as an interpretable web-based tool for clinical decision support. MethodsWe analyzed data from 780 pediatric patients (ages 0-18) with suspected appendicitis admitted to Childrens Hospital St. Hedwig, Regensburg, between 2016 and 2021. For severity prediction, our dataset was augmented with 430 additional cases from published literature and only the confirmed cases of acute appendicitis(n=602) were used. After feature selection using statistical methods and recursive feature elimination, we developed a Random Forest model named Dharma, optimized through hyperparameter tuning and cross-validation. Model performance was evaluated on independent test sets and compared with conventional diagnostic tools. ResultsDharma demonstrated superior diagnostic performance with an AUC-ROC of 0.96 ({+/-}0.02 SD) in cross-validation and 0.97-0.98 on independent test sets. At an optimal threshold of 64%, the model achieved specificity of 88%-98%, sensitivity of 89%-95%, and positive predictive value of 93%-99%. For complication prediction, Dharma attained a sensitivity of 93% ({+/-}0.05 SD) in cross-validation and 96% on the test set, with a negative predictive value of 98%. The model maintained strong performance even in cases where the appendix could not be visualized on ultrasonography (AUC-ROC 0.95, sensitivity 89%, specificity 87% at the threshold of 30%). ConclusionDharma is a novel, interpretable machine learning based clinical decision support tool designed to address the diagnostic challenges of pediatric appendicitis by integrating easily obtainable clinical, laboratory, and radiological data into a unified, real-time predictive framework. Unlike traditional scoring systems and imaging modalities, which may lack specificity or raise safety concerns in children, Dharma demonstrates high accuracy in diagnosing appendicitis and predicting progression from uncomplicated to complicated cases, potentially reducing unnecessary surgeries and CT scans. Its robust performance, even with incomplete imaging data, underscores its utility in resource-limited settings. Delivered through an intuitive, transparent, and interpretable web application, Dharma supports frontline providers--particularly in low- and middle-income settings--in making timely, evidence-based decisions, streamlining patient referrals, and improving clinical outcomes. By bridging critical gaps in current diagnostic and prognostic tools, Dharma offers a practical and accessible 21st-century solution tailored to real-world pediatric surgical care across diverse healthcare contexts. Furthermore, the underlying framework and concepts of Dharma may be adaptable to other clinical challenges beyond pediatric appendicitis, providing a foundation for broader applications of machine learning in healthcare. Author SummaryAccurate diagnosis of pediatric appendicitis remains challenging, with current clinical scores and imaging tests limited by sensitivity, specificity, predictive values, and safety concerns. We developed Dharma, an interpretable machine learning model that integrates clinical, laboratory, and radiological data to assist in diagnosing appendicitis and predicting its severity in children. Evaluated on a large dataset supplemented by published cases, Dharma demonstrated strong diagnostic and prognostic performance, including in cases with incomplete imaging--making it potentially especially useful in resource-limited settings for early decision-making and streamlined referrals. Available as a web-based tool, it provides real-time support to healthcare providers in making evidence-based decisions that could reduce negative appendectomies while avoiding hazards associated with advanced imaging modalities such as sedation, contrast, or radiation exposure. Furthermore, the open-access concepts and framework underlying Dharma have the potential to address diverse healthcare challenges beyond pediatric appendicitis.

Ultrasound Classification Abdominal Retrospective Clinical In Silico Academic Lab GenAI

ROC Analysis of Biomarker Combinations in Fragile X Syndrome-Specific Clinical Trials: Evaluating Treatment Efficacy via Exploratory Biomarkers

Norris, J. E., Berry-Kravis, E. M., Harnett, M. D., Reines, S. A., Reese, M., Auger, E. K., Outterson, A., Furman, J., Gurney, M. E., Ethridge, L. E.

•preprint•May 29 2025

Fragile X Syndrome (FXS) is a rare neurodevelopmental disorder caused by a trinucleotide repeat expansion on the 5 untranslated region of the FMR1 gene. FXS is characterized by intellectual disability, anxiety, sensory hypersensitivity, and difficulties with executive function. A recent phase 2 placebo-controlled clinical trial assessing BPN14770, a first-in-class phosphodiesterase 4D allosteric inhibitor, in 30 adult males (age 18-41 years) with FXS demonstrated cognitive improvements on the NIH Toolbox Cognitive Battery in domains related to language and caregiver reports of improvement in both daily functioning and language. However, individual physiological measures from electroencephalography (EEG) demonstrated only marginal significance for trial efficacy. A secondary analysis of resting state EEG data collected as part of the phase 2 clinical trial evaluating BPN14770 was conducted using a machine learning classification algorithm to classify trial conditions (i.e., baseline, drug, placebo) via linear EEG variable combinations. The algorithm identified a composite of peak alpha frequencies (PAF) across multiple brain regions as a potential biomarker demonstrating BPN14770 efficacy. Increased PAF from baseline was associated with drug but not placebo. Given the relationship between PAF and cognitive function among typically developed adults and those with intellectual disability, as well as previously reported reductions in alpha frequency and power in FXS, PAF represents a potential physiological measure of BPN14770 efficacy.

SPECT Classification Neurological Retrospective Clinical In Silico Academic Lab

Multi-class classification of central and non-central geographic atrophy using Optical Coherence Tomography

Siraz, S., Kamanda, H., Gholami, S., Nabil, A. S., Ong, S. S. Y., Alam, M. N.

•preprint•May 28 2025

PurposeTo develop and validate deep learning (DL)-based models for classifying geographic atrophy (GA) subtypes using Optical Coherence Tomography (OCT) scans across four clinical classification tasks. DesignRetrospective comparative study evaluating three DL architectures on OCT data with two experimental approaches. Subjects455 OCT volumes (258 Central GA [CGA], 74 Non-Central GA [NCGA], 123 no GA [NGA]) from 104 patients at Atrium Health Wake Forest Baptist. For GA versus age-related macular degeneration (AMD) classification, we supplemented our dataset with AMD cases from four public repositories. MethodsWe implemented ResNet50, MobileNetV2, and Vision Transformer (ViT-B/16) architectures using two approaches: (1) utilizing all B-scans within each OCT volume and (2) selectively using B-scans containing foveal regions. Models were trained using transfer learning, standardized data augmentation, and patient-level data splitting (70:15:15 ratio) for training, validation, and testing. Main Outcome MeasuresArea under the receiver operating characteristic curve (AUC-ROC), F1 score, and accuracy for each classification task (CGA vs. NCGA, CGA vs. NCGA vs. NGA, GA vs. NGA, and GA vs. other forms of AMD). ResultsViT-B/16 consistently outperformed other architectures across all classification tasks. For CGA versus NCGA classification, ViT-B/16 achieved an AUC-ROC of 0.728{+/-}0.083 and accuracy of 0.831{+/-}0.006 using selective B-scans. In GA versus NGA classification, ViT-B/16 attained an AUC-ROC of 0.950{+/-}0.002 and accuracy of 0.873{+/-}0.012 with selective B-scans. All models demonstrated exceptional performance in distinguishing GA from other AMD forms (AUC-ROC>0.998). For multi-class classification, ViT-B/16 achieved an AUC-ROC of 0.873{+/-}0.003 and accuracy of 0.751{+/-}0.002 using selective B-scans. ConclusionsOur DL approach successfully classifies GA subtypes with clinically relevant accuracy. ViT-B/16 demonstrates superior performance due to its ability to capture spatial relationships between atrophic regions and the foveal center. Focusing on B-scans containing foveal regions improved diagnostic accuracy while reducing computational requirements, better aligning with clinical practice workflows.

OCT Classification Retrospective Clinical In Silico Academic Lab

An Unsupervised XAI Framework for Dementia Detection with Context Enrichment

Deep Learning-Based Opportunistic CT Osteoporosis Screening and Establishment of Normative Values

Artificial Intelligence-Driven Innovations in Diabetes Care and Monitoring

Evaluating the performance and potential bias of predictive models for the detection of transthyretin cardiac amyloidosis

Synthetic Ultrasound Image Generation for Breast Cancer Diagnosis Using cVAE-WGAN Models: An Approach Based on Generative Artificial Intelligence

A Comparative Performance Analysis of Regular Expressions and an LLM-Based Approach to Extract the BI-RADS Score from Radiological Reports

Physician-level classification performance across multiple imaging domains with a diagnostic medical foundation model and a large dataset of annotated medical images

Dharma: A novel machine learning framework for pediatric appendicitis--diagnosis, severity assessment and evidence-based clinical decision support.

ROC Analysis of Biomarker Combinations in Fragile X Syndrome-Specific Clinical Trials: Evaluating Treatment Efficacy via Exploratory Biomarkers

Multi-class classification of central and non-central geographic atrophy using Optical Coherence Tomography

Ready to Sharpen Your Edge?