Sort by:
Page 1 of 324 results
Next

Improving discriminative ability in mammographic microcalcification classification using deep learning: a novel double transfer learning approach validated with an explainable artificial intelligence technique

Arlan, K., Bjornstrom, M., Makela, T., Meretoja, T. J., Hukkinen, K.

medrxiv logopreprintAug 11 2025
BackgroundBreast microcalcification diagnostics are challenging due to their subtle presentation, overlapping with benign findings, and high inter-reader variability, often leading to unnecessary biopsies. While deep learning (DL) models - particularly deep convolutional neural networks (DCNNs) - have shown potential to improve diagnostic accuracy, their clinical application remains limited by the need for large annotated datasets and the "black box" nature of their decision-making. PurposeTo develop and validate a deep learning model (DCNN) using a double transfer learning (d-TL) strategy for classifying suspected mammographic microcalcifications, with explainable AI (XAI) techniques to support model interpretability. Material and methodsA retrospective dataset of 396 annotated regions of interest (ROIs) from full-field digital mammography (FFDM) images of 194 patients who underwent stereotactic vacuum-assisted biopsy at the Womens Hospital radiological department, Helsinki University Hospital, was collected. The dataset was randomly split into training and test sets (24% test set, balanced for benign and malignant cases). A ResNeXt-based DCNN was developed using a d-TL approach: first pretrained on ImageNet, then adapted using an intermediate mammography dataset before fine-tuning on the target microcalcification data. Saliency maps were generated using Gradient-weighted Class Activation Mapping (Grad-CAM) to evaluate the visual relevance of model predictions. Diagnostic performance was compared to a radiologists BI-RADS-based assessment, using final histopathology as the reference standard. ResultsThe ensemble DCNN achieved an area under the ROC curve (AUC) of 0.76, with 65% sensitivity, 83% specificity, 79% positive predictive value (PPV), and 70% accuracy. The radiologist achieved an AUC of 0.65 with 100% sensitivity but lower specificity (30%) and PPV (59%). Grad-CAM visualizations showed consistent activation of the correct ROIs, even in misclassified cases where confidence scores fell below the threshold. ConclusionThe DCNN model utilizing d-TL achieved performance comparable to radiologists, with higher specificity and PPV than BI-RADS. The approach addresses data limitation issues and may help reduce additional imaging and unnecessary biopsies.

Interpretable Deep Learning Approaches for Reliable GI Image Classification: A Study with the HyperKvasir Dataset

Wahid, S. B., Rothy, Z. T., News, R. K., Rieyan, S. A.

medrxiv logopreprintJul 23 2025
Deep learning has emerged as a promising tool for automating gastrointestinal (GI) disease diagnosis. However, multi-class GI disease classification remains underexplored. This study addresses this gap by presenting a framework that uses advanced models like InceptionNetV3 and ResNet50, combined with boosting algorithms (XGB, LGBM), to classify lower GI abnormalities. InceptionNetV3 with XGB achieved the best recall of 0.81 and an F1 score of 0.90. To assist clinicians in understanding model decisions, the Grad-CAM technique, a form of explainable AI, was employed to highlight the critical regions influencing predictions, fostering trust in these systems. This approach significantly improves both the accuracy and reliability of GI disease diagnosis.

Large Language Model-Based Entity Extraction Reliably Classifies Pancreatic Cysts and Reveals Predictors of Malignancy: A Cross-Sectional and Retrospective Cohort Study

Papale, A. J., Flattau, R., Vithlani, N., Mahajan, D., Ziemba, Y., Zavadsky, T., Carvino, A., King, D., Nadella, S.

medrxiv logopreprintJul 17 2025
Pancreatic cystic lesions (PCLs) are often discovered incidentally on imaging and may progress to pancreatic ductal adenocarcinoma (PDAC). PCLs have a high incidence in the general population, and adherence to screening guidelines can be variable. With the advent of technologies that enable automated text classification, we sought to evaluate various natural language processing (NLP) tools including large language models (LLMs) for identifying and classifying PCLs from radiology reports. We correlated our classification of PCLs to clinical features to identify risk factors for a positive PDAC biopsy. We contrasted a previously described NLP classifier to LLMs for prospective identification of PCLs in radiology. We evaluated various LLMs for PCL classification into low-risk or high-risk categories based on published guidelines. We compared prompt-based PCL classification to specific entity-guided PCL classification. To this end, we developed tools to deidentify radiology and track patients longitudinally based on their radiology reports. Additionally, we used our newly developed tools to evaluate a retrospective database of patients who underwent pancreas biopsy to determine associated factors including those in their radiology reports and clinical features using multivariable logistic regression modelling. Of 14,574 prospective radiology reports, 665 (4.6%) described a pancreatic cyst, including 175 (1.2%) high-risk lesions. Our Entity-Extraction Large Language Model tool achieved recall 0.992 (95% confidence interval [CI], 0.985-0.998), precision 0.988 (0.979-0.996), and F1-score 0.990 (0.985-0.995) for detecting cysts; F1-scores were 0.993 (0.987-0.998) for low-risk and 0.977 (0.952-0.995) for high-risk classification. Among 4,285 biopsy patients, 330 had pancreatic cysts documented [≥]6 months before biopsy. In the final multivariable model (AUC = 0.877), independent predictors of adenocarcinoma were change in duct caliber with upstream atrophy (adjusted odds ratio [AOR], 4.94; 95% CI, 1.30-18.79), mural nodules (AOR, 11.02; 1.81-67.26), older age (AOR, 1.10; 1.05-1.16), lower body mass index (AOR, 0.86; 0.76-0.96), and total bilirubin (AOR, 1.81; 1.18-2.77). Automated NLP-based analysis of radiology reports using LLM-driven entity extraction can accurately identify and risk-stratify PCLs and, when retrospectively applied, reveal factors predicting malignant progression. Widespread implementation may improve surveillance and enable earlier intervention.

A conversational artificial intelligence based web application for medical conversations: a prototype for a chatbot

Pires, J. G.

medrxiv logopreprintJul 17 2025
BackgroundArtificial Intelligence (AI) has evolved through various trends, with different subfields gaining prominence over time. Currently, Conversational Artificial Intelligence (CAI)--particularly Generative AI--is at the forefront. CAI models are primarily focused on text-based tasks and are commonly deployed as chatbots. Recent advancements by OpenAI have enabled the integration of external, independently developed models, allowing chatbots to perform specialized, task-oriented functions beyond general language processing. ObjectiveThis study aims to develop a smart chatbot that integrates large language models (LLMs) from OpenAI with specialized domain-specific models, such as those used in medical image diagnostics. The system leverages transfer learning via Googles Teachable Machine to construct image-based classifiers and incorporates a diabetes detection model developed in TensorFlow.js. A key innovation is the chatbots ability to extract relevant parameters from user input, trigger the appropriate diagnostic model, interpret the output, and deliver responses in natural language. The overarching goal is to demonstrate the potential of combining LLMs with external models to build multimodal, task-oriented conversational agents. MethodsTwo image-based models were developed and integrated into the chatbot system. The first analyzes chest X-rays to detect viral and bacterial pneumonia. The second uses optical coherence tomography (OCT) images to identify ocular conditions such as drusen, choroidal neovascularization (CNV), and diabetic macular edema (DME). Both models were incorporated into the chatbot to enable image-based medical query handling. In addition, a text-based model was constructed to process physiological measurements for diabetes prediction using TensorFlow.js. The architecture is modular: new diagnostic models can be added without redesigning the chatbot, enabling straightforward functional expansion. ResultsThe findings demonstrate effective integration between the chatbot and the diagnostic models, with only minor deviations from expected behavior. Additionally, a stub function was implemented within the chatbot to schedule medical appointments based on the severity of a patients condition, and it was specifically tested with the OCT and X-ray models. ConclusionsThis study demonstrates the feasibility of developing advanced AI systems--including image-based diagnostic models and chatbot integration--by leveraging Artificial Intelligence as a Service (AIaaS). It also underscores the potential of AI to enhance user experiences in bioinformatics, paving the way for more intuitive and accessible interfaces in the field. Looking ahead, the modular nature of the chatbot allows for the integration of additional diagnostic models as the system evolves.

The Potential of ChatGPT as an Aiding Tool for the Neuroradiologist

nikola, s., paz, d.

medrxiv logopreprintJul 14 2025
PurposeThis study aims to explore whether ChatGPT can serve as an assistive tool for neuroradiologists in establishing a reasonable differential diagnosis in central nervous system tumors based on MRI images characteristics. MethodsThis retrospective study included 50 patients aged 18-90 who underwent imaging and surgery at the Western Galilee Medical Center. ChatGPT was provided with demographic and radiological information of the patients to generate differential diagnoses. We compared ChatGPTs performance to an experienced neuroradiologist, using pathological reports as the gold standard. Quantitative data were described using means and standard deviations, median and range. Qualitative data were described using frequencies and percentages. The level of agreement between examiners (neuroradiologist versus ChatGPT) was assessed using Fleiss kappa coefficient. A significance value below 5% was considered statistically significant. Statistical analysis was performed using IBM SPSS Statistics, version 27. ResultsThe results showed that while ChatGPT demonstrated good performance, particularly in identifying common tumors such as glioblastoma and meningioma, its overall accuracy (48%) was lower than that of the neuroradiologist (70%). The AI tool showed moderate agreement with the neuroradiologist (kappa = 0.445) and with pathology results (kappa = 0.419). ChatGPTs performance varied across tumor types, performing better with common tumors but struggling with rarer ones. ConclusionThis study suggests that ChatGPT has the potential to serve as an assistive tool in neuroradiology for establishing a reasonable differential diagnosis in central nervous system tumors based on MRI images characteristics. However, its limitations and potential risks must be considered, and it should therefore be used with caution.

Three-dimensional high-content imaging of unstained soft tissue with subcellular resolution using a laboratory-based multi-modal X-ray microscope

Esposito, M., Astolfo, A., Zhou, Y., Buchanan, I., Teplov, A., Endrizzi, M., Egido Vinogradova, A., Makarova, O., Divan, R., Tang, C.-M., Yagi, Y., Lee, P. D., Walsh, C. L., Ferrara, J. D., Olivo, A.

medrxiv logopreprintJul 14 2025
With increasing interest in studying biological systems across spatial scales--from centimetres down to nanometres--histology continues to be the gold standard for tissue imaging at cellular resolution, providing an essential bridge between macroscopic and nanoscopic analysis. However, its inherently destructive and two-dimensional nature limits its ability to capture the full three-dimensional complexity of tissue architecture. Here we show that phase-contrast X-ray microscopy can enable three-dimensional virtual histology with subcellular resolution. This technique provides direct quantification of electron density without restrictive assumptions, allowing for direct characterisation of cellular nuclei in a standard laboratory setting. By combining high spatial resolution and soft tissue contrast, with automated segmentation of cell nuclei, we demonstrated virtual H&E staining using machine learning-based style transfer, yielding volumetric datasets compatible with existing histopathological analysis tools. Furthermore, by integrating electron density and the sensitivity to nanometric features of the dark field contrast channel, we achieve stain-free, high-content imaging capable of distinguishing nuclei and extracellular matrix.

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

Guan, H., Hou, P. C., Hong, P., Wang, L., Zhang, W., Du, X., Zhou, Z., Zhou, L.

medrxiv logopreprintJul 14 2025
Recent advances in vision-language models (VLMs) have enabled automatic radiology report generation, yet current evaluation methods remain limited to general-purpose NLP metrics or coarse classification-based clinical scores. In this study, we propose a clinically informed evaluation framework for VLM-generated radiology reports that goes beyond traditional performance measures. We define a taxonomy of 12 radiology-specific error types, each annotated with clinical risk levels (low, medium, high) in collaboration with physicians. Using this framework, we conduct a comprehensive error analysis of three representative VLMs, i.e., DeepSeek VL2, CXR-LLaVA, and CheXagent, on 685 gold-standard, expert-annotated MIMIC-CXR cases. We further introduce a risk-aware evaluation metric, the Clinical Risk-weighted Error Score for Text-generation (CREST), to quantify safety impact. Our findings reveal critical model vulnerabilities, common error patterns, and condition-specific risk profiles, offering actionable insights for model development and deployment. This work establishes a safety-centric foundation for evaluating and improving medical report generation models. The source code of our evaluation framework, including CREST computation and error taxonomy analysis, is available at https://github.com/guanharry/VLM-CREST.

Multivariate whole brain neurodegenerative-cognitive-clinical severity mapping in the Alzheimer's disease continuum using explainable AI

Murad, T., Miao, H., Thakuri, D. S., Darekar, G., Chand, G.

medrxiv logopreprintJul 11 2025
Neurodegeneration and cognitive impairment are commonly reported in Alzheimers disease (AD); however, their multivariate links are not well understood. To map the multivariate relationships between whole brain neurodegenerative (WBN) markers, global cognition, and clinical severity in the AD continuum, we developed the explainable artificial intelligence (AI) methods, validated on semi-simulated data, and applied the outperforming method systematically to large-scale experimental data (N=1,756). The outperforming explainable AI method showed robust performance in predicting cognition from regional WBN markers and identified the ground-truth simulated dominant brain regions contributing to cognition. This method also showed excellent performance on experimental data and identified several prominent WBN regions hierarchically and simultaneously associated with cognitive declines across the AD continuum. These multivariate regional features also correlated with clinical severity, suggesting their clinical relevance. Overall, this study innovatively mapped the multivariate regional WBN-cognitive-clinical severity relationships in the AD continuum, thereby significantly advancing AD-relevant neurobiological pathways.

A Unified Platform for Radiology Report Generation and Clinician-Centered AI Evaluation

Ma, Z., Yang, X., Atalay, Z., Yang, A., Collins, S., Bai, H., Bernstein, M., Baird, G., Jiao, Z.

medrxiv logopreprintJul 8 2025
Generative AI models have demonstrated strong potential in radiology report generation, but their clinical adoption depends on physician trust. In this study, we conducted a radiology-focused Turing test to evaluate how well attendings and residents distinguish AI-generated reports from those written by radiologists, and how their confidence and decision time reflect trust. we developed an integrated web-based platform comprising two core modules: Report Generation and Report Evaluation. Using the web-based platform, eight participants evaluated 48 anonymized X-ray cases, each paired with two reports from three comparison groups: radiologist vs. AI model 1, radiologist vs. AI model 2, and AI model 1 vs. AI model 2. Participants selected the AI-generated report, rated their confidence, and indicated report preference. Attendings outperformed residents in identifying AI-generated reports (49.9% vs. 41.1%) and exhibited longer decision times, suggesting more deliberate judgment. Both groups took more time when both reports were AI-generated. Our findings highlight the role of clinical experience in AI acceptance and the need for design strategies that foster trust in clinical applications. The project page of the evaluation platform is available at: https://zachatalay89.github.io/Labsite.

Clinician-Led Code-Free Deep Learning for Detecting Papilloedema and Pseudopapilloedema Using Optic Disc Imaging

Shenoy, R., Samra, G. S., Sekhri, R., Yoon, H.-J., Teli, S., DeSilva, I., Tu, Z., Maconachie, G. D., Thomas, M. G.

medrxiv logopreprintJun 26 2025
ImportanceDifferentiating pseudopapilloedema from papilloedema is challenging, but critical for prompt diagnosis and to avoid unnecessary invasive procedures. Following diagnosis of papilloedema, objectively grading severity is important for determining urgency of management and therapeutic response. Automated machine learning (AutoML) has emerged as a promising tool for diagnosis in medical imaging and may provide accessible opportunities for consistent and accurate diagnosis and severity grading of papilloedema. ObjectiveThis study evaluates the feasibility of AutoML models for distinguishing the presence and severity of papilloedema using near infrared reflectance images (NIR) obtained from standard optical coherence tomography (OCT), comparing the performance of different AutoML platforms. Design, setting and participantsA retrospective cohort study was conducted using data from University Hospitals of Leicester, NHS Trust. The study involved 289 adults and children patients (813 images) who underwent optic nerve head-centred OCT imaging between 2021 and 2024. The dataset included patients with normal optic discs (69 patients, 185 images), papilloedema (135 patients, 372 images), and optic disc drusen (ODD) (85 patients, 256 images). AutoML platforms - Amazon Rekognition, Medic Mind (MM) and Google Vertex were evaluated for their ability to classify and grade papilloedema severity. Main outcomes and measuresTwo classification tasks were performed: (1) distinguishing papilloedema from normal discs and ODD; (2) grading papilloedema severity (mild/moderate vs. severe). Model performance was evaluated using area under the curve (AUC), precision, recall, F1 score, and confusion matrices for all six models. ResultsAmazon Rekognition outperformed the other platforms, achieving the highest AUC (0.90) and F1 score (0.81) in distinguishing papilloedema from normal/ODD. For papilloedema severity grading, Amazon Rekognition also performed best, with an AUC of 0.90 and F1 score of 0.79. Google Vertex and Medic Mind demonstrated good performance but had slightly lower accuracy and higher misclassification rates. Conclusions and relevanceThis evaluation of three widely available AutoML platforms using NIR images obtained from standard OCT shows promise in distinguishing and grading papilloedema. These models provide an accessible, scalable solution for clinical teams without coding expertise to feasibly develop intelligent diagnostic systems to recognise and characterise papilloedema. Further external validation and prospective testing is needed to confirm their clinical utility and applicability in diverse settings. Key PointsQuestion: Can clinician-led, code-free deep learning models using automated machine learning (AutoML) accurately differentiate papilloedema from pseudopapilloedema using optic disc imaging? Findings: Three widely available AutoML platforms were used to develop models that successfully distinguish the presence and severity of papilloedema on optic disc imaging, with Amazon Rekognition demonstrating the highest performance. Meaning: AutoML may assist clinical teams, even those with limited coding expertise, in diagnosing papilloedema, potentially reducing the need for invasive investigations.
Page 1 of 324 results
Show
per page

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.