Latest Papers on Radiology AI. Tags: In Silico

Multimodal Large Language Model With Knowledge Retrieval Using Flowchart Embedding for Forming Follow-Up Recommendations for Pancreatic Cystic Lesions.

Zhu Z, Liu J, Hong CW, Houshmand S, Wang K, Yang Y

•papers•Jul 16 2025

BACKGROUND. The American College of Radiology (ACR) Incidental Findings Committee (IFC) algorithm provides guidance for pancreatic cystic lesion (PCL) management. Its implementation using plain-text large language model (LLM) solutions is challenging given that key components include multimodal data (e.g., figures and tables). OBJECTIVE. The purpose of the study is to evaluate a multimodal LLM approach incorporating knowledge retrieval using flowchart embedding for forming follow-up recommendations for PCL management. METHODS. This retrospective study included patients who underwent abdominal CT or MRI from September 1, 2023, to September 1, 2024, and whose report mentioned a PCL. The reports' Findings sections were inputted to a multimodal LLM (GPT-4o). For task 1 (198 patients: mean age, 69.0 ± 13.0 [SD] years; 110 women, 88 men), the LLM assessed PCL features (presence of PCL, PCL size and location, presence of main pancreatic duct communication, presence of worrisome features or high-risk stigmata) and formed a follow-up recommendation using three knowledge retrieval methods (default knowledge, plain-text retrieval-augmented generation [RAG] from the ACR IFC algorithm PDF document, and flowchart embedding using the LLM's image-to-text conversion for in-context integration of the document's flowcharts and tables). For task 2 (85 patients: mean initial age, 69.2 ± 10.8 years; 48 women, 37 men), an additional relevant prior report was inputted; the LLM assessed for interval PCL change and provided an adjusted follow-up schedule accounting for prior imaging using flowchart embedding. Three radiologists assessed LLM accuracy in task 1 for PCL findings in consensus and follow-up recommendations independently; one radiologist assessed accuracy in task 2. RESULTS. For task 1, the LLM with flowchart embedding had accuracy for PCL features of 98.0-99.0%. The accuracy of the LLM follow-up recommendations based on default knowledge, plain-text RAG, and flowchart embedding for radiologist 1 was 42.4%, 23.7%, and 89.9% (p < .001), respectively; radiologist 2 was 39.9%, 24.2%, and 91.9% (p < .001); and radiologist 3 was 40.9%, 25.3%, and 91.9% (p < .001). For task 2, the LLM using flowchart embedding showed an accuracy for interval PCL change of 96.5% and for adjusted follow-up schedules of 81.2%. CONCLUSION. Multimodal flowchart embedding aided the LLM's automated provision of follow-up recommendations adherent to a clinical guidance document. CLINICAL IMPACT. The framework could be extended to other incidental findings through the use of other clinical guidance documents as the model input.

Mixed Modality LLM Radiology Report Abdominal Retrospective Clinical In Silico Academic Lab GenAI

Validation of artificial intelligence software for automatic calcium scoring in cardiac and chest computed tomography.

Hamelink II, Nie ZZ, Severijn TEJT, van Tuinen MM, van Ooijen PMAP, Kwee TCT, Dorrius MDM, van der Harst PP, Vliegenthart RR

•papers•Jul 16 2025

Coronary artery calcium scoring (CACS), i.e. quantification of Agatston (AS) or volume score (VS), can be time consuming. The aim of this study was to compare automated, artificial intelligence (AI)-based CACS to manual scoring, in cardiac and chest CT for lung cancer screening. We selected 684 participants (59 ± 4.8 years; 48.8 % men) who underwent cardiac and non-ECG-triggered chest CT, including 484 participants with AS > 0 on cardiac CT. AI-based results were compared to manual AS and VS, by assessing sensitivity and accuracy, intraclass correlation coefficient (ICC), Bland-Altman analysis and Cohen's kappa for classification in AS strata (0;1-99;100-299;≥300). AI showed high CAC detection rate: 98.1% in cardiac CT (accuracy 97.1%) and 92.4% in chest CT (accuracy 92.1%). AI showed excellent agreement with manual AS (ICC:0.997 and 0.992) and manual VS (ICC:0.997 and 0.991), in cardiac CT and chest CT, respectively. In Bland-Altman analysis, there was a mean difference of 2.3 (limits of agreement (LoA):-42.7, 47.4) for AS on cardiac CT; 1.9 (LoA:-36.4, 40.2) for VS on cardiac CT; -0.3 (LoA:-74.8, 74.2) for AS on chest CT; and -0.6 (LoA:-65.7, 64.5) for VS on chest CT. Cohen's kappa was 0.952 (95%CI:0.934-0.970) for cardiac CT and 0.901 (95%CI:0.875-0.926) for chest CT, with concordance in 95.9 and 91.4% of cases, respectively. AI-based CACS shows high detection rate and strong correlation compared to manual CACS, with excellent risk classification agreement. AI may reduce evaluation time and enable opportunistic screening for CAC on low-dose chest CT.

CT Detection Cardiac Retrospective Clinical In Silico

Illuminating radiogenomic signatures in pediatric-type diffuse gliomas: insights into molecular, clinical, and imaging correlations. Part II: low-grade group.

Kurokawa R, Hagiwara A, Ito R, Ueda D, Saida T, Sakata A, Nishioka K, Sugawara S, Takumi K, Watabe T, Ide S, Kawamura M, Sofue K, Hirata K, Honda M, Yanagawa M, Oda S, Iima M, Naganawa S

•papers•Jul 16 2025

The fifth edition of the World Health Organization classification of central nervous system tumors represents a significant advancement in the molecular-genetic classification of pediatric-type diffuse gliomas. This article comprehensively summarizes the clinical, molecular, and radiological imaging features in pediatric-type low-grade gliomas (pLGGs), including MYB- or MYBL1-altered tumors, polymorphous low-grade neuroepithelial tumor of the young (PLNTY), and diffuse low-grade glioma, MAPK pathway-altered. Most pLGGs harbor alterations in the RAS/MAPK pathway, functioning as "one pathway disease". Specific magnetic resonance imaging features, such as the T2-fluid-attenuated inversion recovery (FLAIR) mismatch sign in MYB- or MYBL1-altered tumors and the transmantle-like sign in PLNTYs, may serve as non-invasive biomarkers for underlying molecular alterations. Recent advances in radiogenomics have enabled the differentiation of BRAF fusion from BRAF V600E mutant tumors based on magnetic resonance imaging characteristics. Machine learning approaches have further enhanced our ability to predict molecular subtypes from imaging features. These radiology-molecular correlations offer potential clinical utility in treatment planning and prognostication, especially as targeted therapies against the MAPK pathway emerge. Continued research is needed to refine our understanding of genotype-phenotype correlations in less common molecular alterations and to validate these imaging biomarkers in larger cohorts.

MRI Classification Neurological Review In Silico GenAI

Deep learning for appendicitis: development of a three-dimensional localization model on CT.

Takaishi T, Kawai T, Kokubo Y, Fujinaga T, Ojio Y, Yamamoto T, Hayashi K, Owatari Y, Ito H, Hiwatashi A

•papers•Jul 16 2025

To develop and evaluate a deep learning model for detecting appendicitis on abdominal CT. This retrospective single-center study included 567 CTs of appendicitis patients (330 males, age range 20-96) obtained between 2011 and 2020, randomly split into training (n = 517) and validation (n = 50) sets. The validation set was supplemented with 50 control CTs performed for acute abdomen. For a test dataset, 100 appendicitis CTs and 100 control CTs were consecutively collected from a separate period after 2021. Exclusion criteria included age < 20, perforation, unclear appendix, and appendix tumors. Appendicitis CTs were annotated with three-dimensional bounding boxes that encompassed inflamed appendices. CT protocols were unenhanced, 5-mm slice-thickness, 512 × 512 pixel matrix. The deep learning algorithm was based on faster region convolutional neural network (Faster R-CNN). Two board-certified radiologists visually graded model predictions on the test dataset using a 5-point Likert scale (0: no detection, 1: false, 2: poor, 3: fair, 4: good), with scores ≥ 3 considered true positives. Inter-rater agreement was assessed using weighted kappa statistics. The effects of intra-abdominal fat, periappendiceal fat-stranding, presence of appendicolith, and appendix diameter on the model's recall were analyzed using binary logistic regression. The model showed a precision of 0.66 (87/132), a recall of 0.87 (87/100), and a false-positive rate per patient of 0.23 (45/200). The inter-rater agreement for Likert scores of 2-4 was κ = 0.76. The logistic regression analysis showed that only intra-abdominal fat had a significant impact on the model's precision (p = 0.02). We developed a model capable of detecting appendicitis on CT with a three-dimensional bounding box.

CT Detection Abdominal Retrospective Clinical In Silico Academic Lab

Identifying Signatures of Image Phenotypes to Track Treatment Response in Liver Disease

Matthias Perkonigg, Nina Bastati, Ahmed Ba-Ssalamah, Peter Mesenbrink, Alexander Goehler, Miljen Martic, Xiaofei Zhou, Michael Trauner, Georg Langs

•preprint•Jul 16 2025

Quantifiable image patterns associated with disease progression and treatment response are critical tools for guiding individual treatment, and for developing novel therapies. Here, we show that unsupervised machine learning can identify a pattern vocabulary of liver tissue in magnetic resonance images that quantifies treatment response in diffuse liver disease. Deep clustering networks simultaneously encode and cluster patches of medical images into a low-dimensional latent space to establish a tissue vocabulary. The resulting tissue types capture differential tissue change and its location in the liver associated with treatment response. We demonstrate the utility of the vocabulary on a randomized controlled trial cohort of non-alcoholic steatohepatitis patients. First, we use the vocabulary to compare longitudinal liver change in a placebo and a treatment cohort. Results show that the method identifies specific liver tissue change pathways associated with treatment, and enables a better separation between treatment groups than established non-imaging measures. Moreover, we show that the vocabulary can predict biopsy derived features from non-invasive imaging data. We validate the method on a separate replication cohort to demonstrate the applicability of the proposed method.

MRI Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Hybrid Ensemble Approaches: Optimal Deep Feature Fusion and Hyperparameter-Tuned Classifier Ensembling for Enhanced Brain Tumor Classification

Zahid Ullah, Dragan Pamucar, Jihie Kim

•preprint•Jul 16 2025

Magnetic Resonance Imaging (MRI) is widely recognized as the most reliable tool for detecting tumors due to its capability to produce detailed images that reveal their presence. However, the accuracy of diagnosis can be compromised when human specialists evaluate these images. Factors such as fatigue, limited expertise, and insufficient image detail can lead to errors. For example, small tumors might go unnoticed, or overlap with healthy brain regions could result in misidentification. To address these challenges and enhance diagnostic precision, this study proposes a novel double ensembling framework, consisting of ensembled pre-trained deep learning (DL) models for feature extraction and ensembled fine-tuned hyperparameter machine learning (ML) models to efficiently classify brain tumors. Specifically, our method includes extensive preprocessing and augmentation, transfer learning concepts by utilizing various pre-trained deep convolutional neural networks and vision transformer networks to extract deep features from brain MRI, and fine-tune hyperparameters of ML classifiers. Our experiments utilized three different publicly available Kaggle MRI brain tumor datasets to evaluate the pre-trained DL feature extractor models, ML classifiers, and the effectiveness of an ensemble of deep features along with an ensemble of ML classifiers for brain tumor classification. Our results indicate that the proposed feature fusion and classifier fusion improve upon the state of the art, with hyperparameter fine-tuning providing a significant enhancement over the ensemble method. Additionally, we present an ablation study to illustrate how each component contributes to accurate brain tumor classification.

MRI Classification Neurological Methodology In Silico Benchmark SOTA

Comparative Analysis of CNN Performance in Keras, PyTorch and JAX on PathMNIST

Anida Nezović, Jalal Romano, Nada Marić, Medina Kapo, Amila Akagić

•preprint•Jul 16 2025

Deep learning has significantly advanced the field of medical image classification, particularly with the adoption of Convolutional Neural Networks (CNNs). Various deep learning frameworks such as Keras, PyTorch and JAX offer unique advantages in model development and deployment. However, their comparative performance in medical imaging tasks remains underexplored. This study presents a comprehensive analysis of CNN implementations across these frameworks, using the PathMNIST dataset as a benchmark. We evaluate training efficiency, classification accuracy and inference speed to assess their suitability for real-world applications. Our findings highlight the trade-offs between computational speed and model accuracy, offering valuable insights for researchers and practitioners in medical image analysis.

Mixed Modality Classification Methodology In Silico Benchmark SOTA

Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models

Felix Nützel, Mischa Dombrowski, Bernhard Kainz

•preprint•Jul 16 2025

Phrase grounding, i.e., mapping natural language phrases to specific image regions, holds significant potential for disease localization in medical imaging through clinical reports. While current state-of-the-art methods rely on discriminative, self-supervised contrastive models, we demonstrate that generative text-to-image diffusion models, leveraging cross-attention maps, can achieve superior zero-shot phrase grounding performance. Contrary to prior assumptions, we show that fine-tuning diffusion models with a frozen, domain-specific language model, such as CXR-BERT, substantially outperforms domain-agnostic counterparts. This setup achieves remarkable improvements, with mIoU scores doubling those of current discriminative methods. These findings highlight the underexplored potential of generative models for phrase grounding tasks. To further enhance performance, we introduce Bimodal Bias Merging (BBM), a novel post-processing technique that aligns text and image biases to identify regions of high certainty. BBM refines cross-attention maps, achieving even greater localization accuracy. Our results establish generative approaches as a more effective paradigm for phrase grounding in the medical imaging domain, paving the way for more robust and interpretable applications in clinical practice. The source code and model weights are available at https://github.com/Felix-012/generate_to_ground.

X-Ray Segmentation Chest Methodology In Silico Academic Lab Open Code

Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants

Sybelle Goedicke-Fritz, Michelle Bous, Annika Engel, Matthias Flotho, Pascal Hirsch, Hannah Wittig, Dino Milanovic, Dominik Mohr, Mathias Kaspar, Sogand Nemat, Dorothea Kerner, Arno Bücker, Andreas Keller, Sascha Meyer, Michael Zemlin, Philipp Flotho

•preprint•Jul 16 2025

Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting 35% of extremely low birth weight infants. Defined by oxygen dependence at 36 weeks postmenstrual age, it causes lifelong respiratory complications. However, preventive interventions carry severe risks, including neurodevelopmental impairment, ventilator-induced lung injury, and systemic complications. Therefore, early BPD prognosis and prediction of BPD outcome is crucial to avoid unnecessary toxicity in low risk infants. Admission radiographs of extremely preterm infants are routinely acquired within 24h of life and could serve as a non-invasive prognostic tool. In this work, we developed and investigated a deep learning approach using chest X-rays from 163 extremely low-birth-weight infants ($\leq$32 weeks gestation, 401-999g) obtained within 24 hours of birth. We fine-tuned a ResNet-50 pretrained specifically on adult chest radiographs, employing progressive layer freezing with discriminative learning rates to prevent overfitting and evaluated a CutMix augmentation and linear probing. For moderate/severe BPD outcome prediction, our best performing model with progressive freezing, linear probing and CutMix achieved an AUROC of 0.78 $\pm$ 0.10, balanced accuracy of 0.69 $\pm$ 0.10, and an F1-score of 0.67 $\pm$ 0.11. In-domain pre-training significantly outperformed ImageNet initialization (p = 0.031) which confirms domain-specific pretraining to be important for BPD outcome prediction. Routine IRDS grades showed limited prognostic value (AUROC 0.57 $\pm$ 0.11), confirming the need of learned markers. Our approach demonstrates that domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. Through progressive freezing and linear probing, the method remains computationally feasible for site-level implementation and future federated learning deployments.

X-Ray Classification Chest Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Benchmarking and Explaining Deep Learning Cortical Lesion MRI Segmentation in Multiple Sclerosis

Nataliia Molchanova, Alessandro Cagol, Mario Ocampo-Pineda, Po-Jui Lu, Matthias Weigel, Xinjie Chen, Erin Beck, Charidimos Tsagkas, Daniel Reich, Colin Vanden Bulcke, Anna Stolting, Serena Borrelli, Pietro Maggi, Adrien Depeursinge, Cristina Granziera, Henning Mueller, Pedro M. Gordaliza, Meritxell Bach Cuadra

•preprint•Jul 16 2025

Cortical lesions (CLs) have emerged as valuable biomarkers in multiple sclerosis (MS), offering high diagnostic specificity and prognostic relevance. However, their routine clinical integration remains limited due to subtle magnetic resonance imaging (MRI) appearance, challenges in expert annotation, and a lack of standardized automated methods. We propose a comprehensive multi-centric benchmark of CL detection and segmentation in MRI. A total of 656 MRI scans, including clinical trial and research data from four institutions, were acquired at 3T and 7T using MP2RAGE and MPRAGE sequences with expert-consensus annotations. We rely on the self-configuring nnU-Net framework, designed for medical imaging segmentation, and propose adaptations tailored to the improved CL detection. We evaluated model generalization through out-of-distribution testing, demonstrating strong lesion detection capabilities with an F1-score of 0.64 and 0.5 in and out of the domain, respectively. We also analyze internal model features and model errors for a better understanding of AI decision-making. Our study examines how data variability, lesion ambiguity, and protocol differences impact model performance, offering future recommendations to address these barriers to clinical adoption. To reinforce the reproducibility, the implementation and models will be publicly accessible and ready to use at https://github.com/Medical-Image-Analysis-Laboratory/ and https://doi.org/10.5281/zenodo.15911797.

MRI Segmentation Neurological Methodology In Silico Academic Lab Benchmark SOTA Reproducibility Open Code

Filter Papers

Tags

Multimodal Large Language Model With Knowledge Retrieval Using Flowchart Embedding for Forming Follow-Up Recommendations for Pancreatic Cystic Lesions.

Validation of artificial intelligence software for automatic calcium scoring in cardiac and chest computed tomography.

Illuminating radiogenomic signatures in pediatric-type diffuse gliomas: insights into molecular, clinical, and imaging correlations. Part II: low-grade group.

Deep learning for appendicitis: development of a three-dimensional localization model on CT.

Identifying Signatures of Image Phenotypes to Track Treatment Response in Liver Disease

Hybrid Ensemble Approaches: Optimal Deep Feature Fusion and Hyperparameter-Tuned Classifier Ensembling for Enhanced Brain Tumor Classification

Comparative Analysis of CNN Performance in Keras, PyTorch and JAX on PathMNIST

Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models

Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants

Benchmarking and Explaining Deep Learning Cortical Lesion MRI Segmentation in Multiple Sclerosis

Ready to Sharpen Your Edge?