Latest Papers on Radiology AI. Sources: medrxiv, Tags: Ethics.

A hybrid computer vision model to predict lung cancer in diverse populations

Zakkar, A., Perwaiz, N., Harikrishnan, V., Zhong, W., Narra, V., Krule, A., Yousef, F., Kim, D., Burrage-Burton, M., Lawal, A. A., Gadi, V., Korpics, M. C., Kim, S. J., Chen, Z., Khan, A. A., Molina, Y., Dai, Y., Marai, E., Meidani, H., Nguyen, R., Salahudeen, A. A.

•preprint•Aug 29 2025

PURPOSE Disparities of lung cancer incidence exist in Black populations and screening criteria underserve Black populations due to disparately elevated risk in the screening eligible population. Prediction models that integrate clinical and imaging-based features to individualize lung cancer risk is a potential means to mitigate these disparities. PATIENTS AND METHODS This Multicenter (NLST) and catchment population based (UIH, urban and suburban Cook County) study utilized participants at risk of lung cancer with available lung CT imaging and follow up between the years 2015 and 2024. 53,452 in NLST and 11,654 in UIH were included based on age and tobacco use based risk factors for lung cancer. Cohorts were used for training and testing of deep and machine learning models using clinical features alone or combined with CT image features (hybrid computer vision). RESULTS An optimized 7 clinical feature model achieved ROC-AUC values ranging 0.64-0.67 in NLST and 0.60-0.65 in UIH cohorts across multiple years. Incorporation of imaging features to form a hybrid computer vision model significantly improved ROC-AUC values to 0.78-0.91 in NLST but deteriorated in UIH with ROC-AUC values of 0.68- 0.80, attributable to Black participants where ROC-AUC values ranged from 0.63-0.72 across multiple years. Retraining the hybrid computer vision model by incorporating Black and other participants from the UIH cohort improved performance with ROC- AUC values of 0.70-0.87 in a held out UIH test set. CONCLUSION Hybrid computer vision predicted risk with improved accuracy compared to clinical risk models alone. However, potential biases in image training data reduced model generalizability in Black participants. Performance was improved upon retraining with a subset of the UIH cohort, suggesting that inclusive training and validation datasets can minimize racial disparities. Future studies incorporating vision models trained on representative data sets may demonstrate improved health equity upon clinical use.

CT Classification Chest Retrospective Clinical In Silico Academic Lab Ethics

Regulating Flexibility for Artificial Intelligence: FDA Experience with Predetermined Change Control Plans

Rosen, K. L., Mandl, K. D.

•preprint•Aug 27 2025

ImportancePredetermined Change Control Plans (PCCPs) are a recent regulatory innovation by the U.S. Food and Drug Administration (FDA) introduced to enable dynamic oversight of artificial intelligence and machine learning (AI/ML)-enabled medical devices. ObjectiveTo characterize FDA program of PCCPs among AI/ML-enabled medical devices, including device characteristics, preapproval testing, planned modifications, and post-clearance update mechanisms. DesignThis cross-sectional study reviewed FDA-cleared or approved AI/ML-enabled medical devices with authorized PCCPs. SettingAI/ML-enabled devices approved or cleared prior to May 30, 2025 were identified from an FDA-maintained public list and their characteristics extracted from FDA approval databases. ParticipantsN/A Main Outcome(s) and Measure(s)Primary outcomes included (1) prevalence and characteristics of devices with authorized PCCPs, (2) types of FDA-authorized modifications, (3) presence and nature of preapproval testing, such as study design and subgroup testing, and (4) postmarket device update mechanisms and transparency. ResultsAmong 26 identified AI/ML-enabled medical devices with authorized PCCPs, 92% were cleared via the 510(k) pathway, and all were classified as moderate risk. Devices were primarily intended for use in diagnosis or clinical assessment, and six had consumer-facing components. Authorized modifications spanned the product lifecycle, most commonly allowing model retraining (69% of devices), logic updates (42% of devices), and expansion of input sources (35% of devices). Preapproval testing was limited with seven devices prospectively evaluated and thirteen undergoing human factors testing. Subgroup analyses were reported for eleven devices and none included patient outcomes data. No postmarket studies or recalls were identified. User manuals could be identified online for 54% of devices, though many lacked performance details or mentioned PCCPs. Conclusions and RelevanceFDA authorization of PCCPs grants manufacturers substantial flexibility to modify AI/ML-enabled devices postmarket, while preapproval testing and postmarket transparency are limited. These findings highlight the need for strengthened oversight mechanisms to ensure ongoing safety and effectiveness of rapidly evolving AI/ML-enabled technologies in clinical care.

Review FDA Cleared FDA 510(k)Consortium Policy Ethics

The Effectiveness of Large Language Models in Providing Automated Feedback in Medical Imaging Education: A Protocol for a Systematic Review

Al-Mashhadani, M., Ajaz, F., Guraya, S. S., Ennab, F.

•preprint•Aug 6 2025

BackgroundLarge Language Models (LLMs) represent an ever-emerging and rapidly evolving generative artificial intelligence (AI) modality with promising developments in the field of medical education. LLMs can provide automated feedback services to medical trainees (i.e. medical students, residents, fellows, etc.) and possibly serve a role in medical imaging education. AimThis systematic review aims to comprehensively explore the current applications and educational outcomes of LLMs in providing automated feedback on medical imaging reports. MethodsThis study employs a comprehensive systematic review strategy, involving an extensive search of the literature (Pubmed, Scopus, Embase, and Cochrane), data extraction, and synthesis of the data. ConclusionThis systematic review will highlight the best practices of LLM use in automated feedback of medical imaging reports and guide further development of these models.

Mixed Modality LLM Radiology Report Review Concept Ethics

Interpreting convolutional neural network explainability for head-and-neck cancer radiotherapy organ-at-risk segmentation

Strijbis, V. I. J., Gurney-Champion, O. J., Grama, D. I., Slotman, B. J., Verbakel, W. F. A. R.

•preprint•Jul 31 2025

BackgroundConvolutional neural networks (CNNs) have emerged to reduce clinical resources and standardize auto-contouring of organs-at-risk (OARs). Although CNNs perform adequately for most patients, understanding when the CNN might fail is critical for effective and safe clinical deployment. However, the limitations of CNNs are poorly understood because of their black-box nature. Explainable artificial intelligence (XAI) can expose CNNs inner mechanisms for classification. Here, we investigate the inner mechanisms of CNNs for segmentation and explore a novel, computational approach to a-priori flag potentially insufficient parotid gland (PG) contours. MethodsFirst, 3D UNets were trained in three PG segmentation situations using (1) synthetic cases; (2) 1925 clinical computed tomography (CT) scans with typical and (3) more consistent contours curated through a previously validated auto-curation step. Then, we generated attribution maps for seven XAI methods, and qualitatively assessed them for congruency between simulated and clinical contours, and how much XAI agreed with expert reasoning. To objectify observations, we explored persistent homology intensity filtrations to capture essential topological characteristics of XAI attributions. Principal component (PC) eigenvalues of Euler characteristic profiles were correlated with spatial agreement (Dice-Sorensen similarity coefficient; DSC). Evaluation was done using sensitivity, specificity and the area under receiver operating characteristic (AUROC) curve on an external AAPM dataset, where as proof-of-principle, we regard the lowest 15% DSC as insufficient. ResultsPatternNet attributions (PNet-A) focused on soft-tissue structures, whereas guided backpropagation (GBP) highlighted both soft-tissue and high-density structures (e.g. mandible bone), which was congruent with synthetic situations. Both methods typically had higher/denser activations in better auto-contoured medial and anterior lobes. Curated models produced "cleaner" gradient class-activation mapping (GCAM) attributions. Quantitative analysis showed that PC{lambda}1 of guided GCAMs (GGCAM) Euler characteristic (EC) profile had good predictive value (sensitivity>0.85, specificity>0.9) of DSC for AAPM cases, with AUROC=0.66, 0.74, 0.94, 0.83 for GBP, GCAM, GGCAM and PNet-A. For for {lambda}1<-1.8e3 of GGCAMs EC-profile, 87% of cases were insufficient. ConclusionsGBP and PNet-A qualitatively agreed most with expert reasoning on directly (structure borders) and indirectly (proxies used for identifying structure borders) important features for PG segmentation. Additionally, this work investigated as proof-of-principle how topological data analysis could possibly be used for quantitative XAI signal analysis to a-priori mark potentially inadequate CNN-segmentations, using only features from inside the predicted PG. This work used PG as a well-understood segmentation paradigm and may extend to target volumes and other organs-at-risk.

CT Segmentation Neurological Methodology In Silico Academic Lab Ethics

The impacts of artificial intelligence on the workload of diagnostic radiology services: A rapid review and stakeholder contextualisation

Sutton, C., Prowse, J., Elshehaly, M., Randell, R.

•preprint•Jul 24 2025

BackgroundAdvancements in imaging technology, alongside increasing longevity and co-morbidities, have led to heightened demand for diagnostic radiology services. However, there is a shortfall in radiology and radiography staff to acquire, read and report on such imaging examinations. Artificial intelligence (AI) has been identified, notably by AI developers, as a potential solution to impact positively the workload of radiology services for diagnostics to address this staffing shortfall. MethodsA rapid review complemented with data from interviews with UK radiology service stakeholders was undertaken. ArXiv, Cochrane Library, Embase, Medline and Scopus databases were searched for publications in English published between 2007 and 2022. Following screening 110 full texts were included. Interviews with 15 radiology service managers, clinicians and academics were carried out between May and September 2022. ResultsMost literature was published in 2021 and 2022 with a distinct focus on AI for diagnostics of lung and chest disease (n = 25) notably COVID-19 and respiratory system cancers, closely followed by AI for breast screening (n = 23). AI contribution to streamline the workload of radiology services was categorised as autonomous, augmentative and assistive contributions. However, percentage estimates, of workload reduction, varied considerably with the most significant reduction identified in national screening programmes. AI was also recognised as aiding radiology services through providing second opinion, assisting in prioritisation of images for reading and improved quantification in diagnostics. Stakeholders saw AI as having the potential to remove some of the laborious work and contribute service resilience. ConclusionsThis review has shown there is limited data on real-world experiences from radiology services for the implementation of AI in clinical production. Autonomous, augmentative and assistive AI can, as noted in the article, decrease workload and aid reading and reporting, however the governance surrounding these advancements lags.

Mixed Modality Classification Review Clinical Pilot Academic Lab Policy Ethics

DREAM: A framework for discovering mechanisms underlying AI prediction of protected attributes

Gadgil, S. U., DeGrave, A. J., Janizek, J. D., Xu, S., Nwandu, L., Fonjungo, F., Lee, S.-I., Daneshjou, R.

•preprint•Jul 21 2025

Recent advances in Artificial Intelligence (AI) have started disrupting the healthcare industry, especially medical imaging, and AI devices are increasingly being deployed into clinical practice. Such classifiers have previously demonstrated the ability to discern a range of protected demographic attributes (like race, age, sex) from medical images with unexpectedly high performance, a sensitive task which is difficult even for trained physicians. In this study, we motivate and introduce a general explainable AI (XAI) framework called DREAM (DiscoveRing and Explaining AI Mechanisms) for interpreting how AI models trained on medical images predict protected attributes. Focusing on two modalities, radiology and dermatology, we are successfully able to train high-performing classifiers for predicting race from chest x-rays (ROC-AUC score of [~]0.96) and sex from dermoscopic lesions (ROC-AUC score of [~]0.78). We highlight how incorrect use of these demographic shortcuts can have a detrimental effect on the performance of a clinically relevant downstream task like disease diagnosis under a domain shift. Further, we employ various XAI techniques to identify specific signals which can be leveraged to predict sex. Finally, we propose a technique, which we callremoval via balancing, to quantify how much a signal contributes to the classification performance. Using this technique and the signals identified, we are able to explain [~]15% of the total performance for radiology and [~]42% of the total performance for dermatology. We envision DREAM to be broadly applicable to other modalities and demographic attributes. This analysis not only underscores the importance of cautious AI application in healthcare but also opens avenues for improving the transparency and reliability of AI-driven diagnostic tools.

X-Ray Classification Chest Methodology In Silico Ethics

Artificial Intelligence for Early Detection and Prognosis Prediction of Diabetic Retinopathy

Budi Susilo, Y. K., Yuliana, D., Mahadi, M., Abdul Rahman, S., Ariffin, A. E.

•preprint•Jun 20 2025

This review explores the transformative role of artificial intelligence (AI) in the early detection and prognosis prediction of diabetic retinopathy (DR), a leading cause of vision loss in diabetic patients. AI, particularly deep learning and convolutional neural networks (CNNs), has demonstrated remarkable accuracy in analyzing retinal images, identifying early-stage DR with high sensitivity and specificity. These advancements address critical challenges such as intergrader variability in manual screening and the limited availability of specialists, especially in underserved regions. The integration of AI with telemedicine has further enhanced accessibility, enabling remote screening through portable devices and smartphone-based imaging. Economically, AI-based systems reduce healthcare costs by optimizing resource allocation and minimizing unnecessary referrals. Key findings highlight the dominance of Medicine (819 documents) and Computer Science (613 documents) in research output, reflecting the interdisciplinary nature of this field. Geographically, China, the United States, and India lead in contributions, underscoring global efforts to combat DR. Despite these successes, challenges such as algorithmic bias, data privacy, and the need for explainable AI (XAI) remain. Future research should focus on multi-center validation, diverse AI methodologies, and clinician-friendly tools to ensure equitable adoption. By addressing these gaps, AI can revolutionize DR management, reducing the global burden of diabetes-related blindness through early intervention and scalable solutions.

OCT Classification Review Concept Academic Lab Ethics

Radiologist-AI workflow can be modified to reduce the risk of medical malpractice claims

Bernstein, M., Sheppard, B., Bruno, M. A., Lay, P. S., Baird, G. L.

•preprint•Jun 16 2025

BackgroundArtificial Intelligence (AI) is rapidly changing the legal landscape of radiology. Results from a previous experiment suggested that providing AI error rates can reduce perceived radiologist culpability, as judged by mock jury members (4). The current study advances this work by examining whether the radiologists behavior also impacts perceptions of liability. Methods. Participants (n=282) read about a hypothetical malpractice case where a 50-year-old who visited the Emergency Department with acute neurological symptoms received a brain CT scan to determine if bleeding was present. An AI system was used by the radiologist who interpreted imaging. The AI system correctly flagged the case as abnormal. Nonetheless, the radiologist concluded no evidence of bleeding, and the blood-thinner t-PA was administered. Participants were randomly assigned to either a 1.) single-read condition, where the radiologist interpreted the CT once after seeing AI feedback, or 2.) a double-read condition, where the radiologist interpreted the CT twice, first without AI and then with AI feedback. Participants were then told the patient suffered irreversible brain damage due to the missed brain bleed, resulting in the patient (plaintiff) suing the radiologist (defendant). Participants indicated whether the radiologist met their duty of care to the patient (yes/no). Results. Hypothetical jurors were more likely to side with the plaintiff in the single-read condition (106/142, 74.7%) than in the double-read condition (74/140, 52.9%), p=0.0002. Conclusion. This suggests that the penalty for disagreeing with correct AI can be mitigated when images are interpreted twice, or at least if a radiologist gives an interpretation before AI is used.

CT Detection Neurological Retrospective Clinical Post Market Academic Lab Ethics Policy

Lack of children in public medical imaging data points to growing age bias in biomedical AI

Hua, S. B. Z., Heller, N., He, P., Towbin, A. J., Chen, I., Lu, A., Erdman, L.

•preprint•Jun 7 2025

Artificial intelligence (AI) is rapidly transforming healthcare, but its benefits are not reaching all patients equally. Children remain overlooked with only 17% of FDA-approved medical AI devices labeled for pediatric use. In this work, we demonstrate that this exclusion may stem from a fundamental data gap. Our systematic review of 181 public medical imaging datasets reveals that children represent just under 1% of available data, while the majority of machine learning imaging conference papers we surveyed utilized publicly available data for methods development. Much like systematic biases of other kinds in model development, past studies have demonstrated the manner in which pediatric representation in data used for models intended for the pediatric population is essential for model performance in that population. We add to these findings, showing that adult-trained chest radiograph models exhibit significant age bias when applied to pediatric populations, with higher false positive rates in younger children. This work underscores the urgent need for increased pediatric representation in publicly accessible medical datasets. We provide actionable recommendations for researchers, policymakers, and data curators to address this age equity gap and ensure AI benefits patients of all ages. 1-2 sentence summaryOur analysis reveals a critical healthcare age disparity: children represent less than 1% of public medical imaging datasets. This gap in representation leads to biased predictions across medical image foundation models, with the youngest patients facing the highest risk of misdiagnosis.

X-Ray Classification Chest Review In Silico Academic Lab Ethics Policy Benchmark SOTA

Evaluating the performance and potential bias of predictive models for the detection of transthyretin cardiac amyloidosis

Hourmozdi, J., Easton, N., Benigeri, S., Thomas, J. D., Narang, A., Ouyang, D., Duffy, G., Upton, R., Hawkes, W., Akerman, A., Okwuosa, I., Kline, A., Kho, A. N., Luo, Y., Shah, S. J., Ahmad, F. S.

•preprint•Jun 2 2025

BackgroundDelays in the diagnosis of transthyretin amyloid cardiomyopathy (ATTR-CM) contribute to the significant morbidity of the condition, especially in the era of disease-modifying therapies. Screening for ATTR-CM with AI and other algorithms may improve timely diagnosis, but these algorithms have not been directly compared. ObjectivesThe aim of this study was to compare the performance of four algorithms for ATTR-CM detection in a heart failure population and assess the risk for harms due to model bias. MethodsWe identified patients in an integrated health system from 2010-2022 with ATTR-CM and age- and sex-matched them to controls with heart failure to target 5% prevalence. We compared the performance of a claims-based random forest model (Huda et al. model), a regression-based score (Mayo ATTR-CM), and two deep learning echo models (EchoNet-LVH and EchoGo(R) Amyloidosis). We evaluated for bias using standard fairness metrics. ResultsThe analytical cohort included 176 confirmed cases of ATTR-CM and 3192 control patients with 79.2% self-identified as White and 9.0% as Black. The Huda et al. model performed poorly (AUC 0.49). Both deep learning echo models had a higher AUC when compared to the Mayo ATTR-CM Score (EchoNet-LVH 0.88; EchoGo Amyloidosis 0.92; Mayo ATTR-CM Score 0.79; DeLong P<0.001 for both). Bias auditing met fairness criteria for equal opportunity among patients who identified as Black. ConclusionsDeep learning, echo-based models to detect ATTR-CM demonstrated best overall discrimination when compared to two other models in external validation with low risk of harms due to racial bias.

Ultrasound Classification Cardiac Retrospective Clinical In Silico Academic Lab Benchmark SOTA Ethics

Filter Papers

Tags