Latest Papers on Radiology AI. Sources: medrxiv, Tags: Open Code.

Normative Modelling of Brain Volume for Diagnostic and Prognostic Stratification in Multiple Sclerosis

Korbmacher, M., Lie, I. A., Wesnes, K., Westman, E., Espeseth, T., Andreassen, O., Westlye, L., Wergeland, S., Harbo, H. F., Nygaard, G. O., Myhr, K.-M., Hogestol, E. A., Torkildsen, O.

•preprint•Sep 15 2025

BackgroundBrain atrophy is a hallmark of multiple sclerosis (MS). For clinical translatability and individual-level predictions, brain atrophy needs to be put into context of the broader population, using reference or normative models. MethodsReference models of MRI-derived brain volumes were established from a large healthy control (HC) multi-cohort dataset (N=63 115, 51% females). The reference models were applied to two independent MS cohorts (N=362, T1w-scans=953, follow-up time up to 12 years) to assess deviations from the reference, defined as Z-values. We assessed the overlap of deviation profiles and their stability over time using individual-level transitions towards or out of significant reference deviation states (|Z|>1{middle dot}96). A negative binomial model was used for case-control comparisons of the number of extreme deviations. Linear models were used to assess differences in Z-score deviations between MS and propensity-matched HCs, and associations with clinical scores at baseline and over time. The utilized normative BrainReference models, scripts and usage instructions are freely available. FindingsWe identified a temporally stable, brain morphometric phenotype of MS. The right and left thalami most consistently showed significantly lower-than-reference volumes in MS (25% and 26% overlap across the sample). The number of such extreme smaller-than-reference values was 2{middle dot}70 in MS compared to HC (4{middle dot}51 versus 1{middle dot}67). Additional deviations indicated stronger disability (Expanded Disability Status Scale: {beta}=0{middle dot}22, 95% CI 0{middle dot}12 to 0{middle dot}32), Paced Auditory Serial Addition Test score ({beta}=-0{middle dot}27, 95% CI -0{middle dot}52 to -0{middle dot}02), and Fatigue Severity Score ({beta}=0{middle dot}29, 95% CI 0{middle dot}05 to 0{middle dot}53) at baseline, and over time with EDSS ({beta}=0{middle dot}07, 95% CI 0{middle dot}02 to 0{middle dot}13). We additionally provide detailed maps of reference-deviations and their associations with clinical assessments. InterpretationWe present a heterogenous brain phenotype of MS which is associated with clinical manifestations, and particularly implicating the thalamus. The findings offer potential to aid diagnosis and prognosis of MS. FundingNorwegian MS-union, Research Council of Norway (#223273; #324252); the South-Eastern Norway Regional Health Authority (#2022080); and the European Unions Horizon2020 Research and Innovation Programme (#847776, #802998). Research in contextO_ST_ABSEvidence before this studyC_ST_ABSReference values and normative models have yet to be widely applied to neuroimaging assessments of neurological disorders such as multiple sclerosis (MS). We conducted a literature search in PubMed and Embase (Jan 1, 2000-September 12, 2025) using the terms "MRI" AND "multiple sclerosis", with and without the keywords "normative model*" and "atrophy", without language restrictions. While normative models have been applied in psychiatric and developmental disorders, few studies have addressed their use in neurological conditions. Existing MS research has largely focused on global atrophy and has not provided regional reference charts or established links to clinical and cognitive outcomes. Added value of this studyWe provide regionally detailed brain morphometry maps derived from a heterogeneous MS cohort spanning wide ranges of age, sex, clinical phenotype, disease duration, disability, and scanner characteristics. By leveraging normative modelling, our approach enables individualised brain phenotyping of MS in relation to a population based normative sample. The analyses reveal clinically meaningful and spatially consistent patterns of smaller brain volumes, particularly in the thalamus and frontal cortical regions, which are linked to disability, cognitive impairment, and fatigue. Robustness across scanners, centres, and longitudinal follow-up supports the stability and generalisability of these findings to real-world MS populations. Implications of all the available evidenceNormative modelling offers an individualised, sensitive, and interpretable approach to quantifying brain structure in MS by providing individual-specific reference values, supporting earlier detection of neurodegeneration and improved patient stratification. A consistent pattern of thalamic and fronto-parietal deviations defines a distinct morphometric profile of MS, with potential utility for early and personalised diagnosis and disease monitoring in clinical practice and clinical trials.

MRI Classification Neurological Retrospective Clinical In Silico Academic Lab Open Code Open Dataset

Decoding Fibrosis: Transcriptomic and Clinical Insights via AI-Derived Collagen Deposition Phenotypes in MASLD

Wojciechowska, M. K., Thing, M., Hu, Y., Mazzoni, G., Harder, L. M., Werge, M. P., Kimer, N., Das, V., Moreno Martinez, J., Prada-Medina, C. A., Vyberg, M., Goldin, R., Serizawa, R., Tomlinson, J., Douglas Gaalsgard, E., Woodcock, D. J., Hvid, H., Pfister, D. R., Jurtz, V. I., Gluud, L.-L., Rittscher, J.

•preprint•Sep 2 2025

Histological assessment is foundational to multi-omics studies of liver disease, yet conventional fibrosis staging lacks resolution, and quantitative metrics like collagen proportionate area (CPA) fail to capture tissue architecture. While recent AI-driven approaches offer improved precision, they are proprietary and not accessible to academic research. Here, we present a novel, interpretable AI-based framework for characterising liver fibrosis from picrosirius red (PSR)-stained slides. By identifying distinct data-driven collagen deposition phenotypes (CDPs) which capture distinct morphologies, our method substantially improves the sensitivity and specificity of downstream transcriptomic and proteomic analyses compared to CPA and traditional fibrosis scores. Pathway analysis reveals that CDPs 4 and 5 are associated with active extracellular matrix remodelling, while phenotype correlates highlight links to liver functional status. Importantly, we demonstrate that selected CDPs can predict clinical outcomes with similar accuracy to established fibrosis metrics. All models and tools are made freely available to support transparent and reproducible multi-omics pathology research. HighlightsO_LIWe present a set of data-driven collagen deposition phenotypes for analysing PSR-stained liver biopsies, offering a spatially informed alternative to conventional fibrosis staging and CPA available as open-source code. C_LIO_LIThe identified collagen deposition phenotypes enhance transcriptomic and proteomic signal detection, revealing active ECM remodelling and distinct functional tissue states. C_LIO_LISelected phenotypes predict clinical outcomes with performance comparable to fibrosis stage and CPA, highlighting their potential as candidate quantitative indicators of fibrosis severity. C_LI O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=98 SRC="FIGDIR/small/25334719v1_ufig1.gif" ALT="Figure 1"> View larger version (22K): [email protected]@1793532org.highwire.dtl.DTLVardef@93a0d8org.highwire.dtl.DTLVardef@24d289_HPS_FORMAT_FIGEXP M_FIG C_FIG

Mixed Modality Segmentation Abdominal Methodology In Silico Academic Lab Open Code Open Dataset

HONeYBEE: Enabling Scalable Multimodal AI in Oncology Through Foundation Model-Driven Embeddings

Tripathi, A. G., Waqas, A., Schabath, M. B., Yilmaz, Y., Rasool, G.

•preprint•Aug 27 2025

HONeYBEE (Harmonized ONcologY Biomedical Embedding Encoder) is an open-source framework that integrates multimodal biomedical data for oncology applications. It processes clinical data (structured and unstructured), whole-slide images, radiology scans, and molecular profiles to generate unified patient-level embeddings using domain-specific foundation models and fusion strategies. These embeddings enable survival prediction, cancer-type classification, patient similarity retrieval, and cohort clustering. Evaluated on 11,400+ patients across 33 cancer types from The Cancer Genome Atlas (TCGA), clinical embeddings showed the strongest single-modality performance with 98.5% classification accuracy and 96.4% precision@10 in patient retrieval. They also achieved the highest survival prediction concordance indices across most cancer types. Multimodal fusion provided complementary benefits for specific cancers, improving overall survival prediction beyond clinical features alone. Comparative evaluation of four large language models revealed that general-purpose models like Qwen3 outperformed specialized medical models for clinical text representation, though task-specific fine-tuning improved performance on heterogeneous data such as pathology reports.

Mixed Modality Classification Methodology In Silico Open Source Open Code GenAI

Real-world federated learning for the brain imaging scientist

Denissen, S., Laton, J., Grothe, M., Vaneckova, M., Uher, T., Kudrna, M., Horakova, D., Baijot, J., Penner, I.-K., Kirsch, M., Motyl, J., De Vos, M., Chen, O. Y., Van Schependom, J., Sima, D. M., Nagels, G.

•preprint•Aug 22 2025

BackgroundFederated learning (FL) could boost deep learning in neuroimaging but is rarely deployed in a real-world scenario, where its true potential lies. Here, we propose FLightcase, a new FL toolbox tailored for brain research. We tested FLightcase on a real-world FL network to predict the cognitive status of patients with multiple sclerosis (MS) from brain magnetic resonance imaging (MRI). MethodsWe first trained a DenseNet neural network to predict age from T1-weighted brain MRI on three open-source datasets, IXI (586 images), SALD (491 images) and CamCAN (653 images). These were distributed across the three centres in our FL network, Brussels (BE), Greifswald (DE) and Prague (CZ). We benchmarked this federated model with a centralised version. The best-performing brain age model was then fine-tuned to predict performance on the Symbol Digit Modalities Test (SDMT) of patients with MS (Brussels: 96 images, Greifswald: 756 images, Prague: 2424 images). Shallow transfer learning (TL) was compared with deep transfer learning, updating weights in the last layer or the entire network respectively. ResultsCentralised training outperformed federated training, predicting age with a mean absolute error (MAE) of 6.00 versus 9.02. Federated training yielded a Pearson correlation (all p < .001) between true and predicted age of .78 (IXI, Brussels), .78 (SALD, Greifswald) and .86 (CamCAN, Prague). Fine-tuning of the centralised model to SDMT was most successful with a deep TL paradigm (MAE = 9.12) compared to shallow TL (MAE = 14.08), and respectively on Brussels, Greifswald and Prague predicted SDMT with an MAE of 11.50, 9.64 and 8.86, and a Pearson correlation between true and predicted SDMT of .10 (p = .668), .42 (p < .001) and .51 (p < .001). ConclusionReal-world federated learning using FLightcase is feasible for neuroimaging research in MS, enabling access to a large MS imaging database without sharing this data. The federated SDMT-decoding model is promising and could be improved in the future by adopting FL algorithms that address the non-IID data issue and consider other imaging modalities. We hope our detailed real-world experiments and open-source distribution of FLightcase will prompt researchers to move beyond simulated FL environments.

MRI Classification Neurological Methodology In Silico Academic Lab Open Code

Adapting Biomedical Foundation Models for Predicting Outcomes of Anti Seizure Medications

Pham, D. K., Mehta, D., Jiang, Y., Thom, D., Chang, R. S.-k., Foster, E., Fazio, T., Holper, S., Verspoor, K., Liu, J., Nhu, D., Barnard, S., O'Brien, T., Chen, Z., French, J., Kwan, P., Ge, Z.

•preprint•Aug 11 2025

Epilepsy affects over 50 million people worldwide, with anti-seizure medications (ASMs) as the primary treatment for seizure control. However, ASM selection remains a "trial and error" process due to the lack of reliable predictors of effectiveness and tolerability. While machine learning approaches have been explored, existing models are limited to predicting outcomes only for ASMs encountered during training and have not leveraged recent biomedical foundation models for this task. This work investigates ASM outcome prediction using only patient MRI scans and reports. Specifically, we leverage biomedical vision-language foundation models and introduce a novel contextualized instruction-tuning framework that integrates expert-built knowledge trees of MRI entities to enhance their performance. Additionally, by training only on the four most commonly prescribed ASMs, our framework enables generalization to predicting outcomes and effectiveness for unseen ASMs not present during training. We evaluate our instruction-tuning framework on two retrospective epilepsy patient datasets, achieving an average AUC of 71.39 and 63.03 in predicting outcomes for four primary ASMs and three completely unseen ASMs, respectively. Our approach improves the AUC by 5.53 and 3.51 compared to standard report-based instruction tuning for seen and unseen ASMs, respectively. Our code, MRI knowledge tree, prompting templates, and TREE-TUNE generated instruction-answer tuning dataset are available at the link.

MRI Classification Neurological Retrospective Clinical In Silico Academic Lab Benchmark SOTA Open Code

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

Guan, H., Hou, P. C., Hong, P., Wang, L., Zhang, W., Du, X., Zhou, Z., Zhou, L.

•preprint•Jul 14 2025

Recent advances in vision-language models (VLMs) have enabled automatic radiology report generation, yet current evaluation methods remain limited to general-purpose NLP metrics or coarse classification-based clinical scores. In this study, we propose a clinically informed evaluation framework for VLM-generated radiology reports that goes beyond traditional performance measures. We define a taxonomy of 12 radiology-specific error types, each annotated with clinical risk levels (low, medium, high) in collaboration with physicians. Using this framework, we conduct a comprehensive error analysis of three representative VLMs, i.e., DeepSeek VL2, CXR-LLaVA, and CheXagent, on 685 gold-standard, expert-annotated MIMIC-CXR cases. We further introduce a risk-aware evaluation metric, the Clinical Risk-weighted Error Score for Text-generation (CREST), to quantify safety impact. Our findings reveal critical model vulnerabilities, common error patterns, and condition-specific risk profiles, offering actionable insights for model development and deployment. This work establishes a safety-centric foundation for evaluating and improving medical report generation models. The source code of our evaluation framework, including CREST computation and error taxonomy analysis, is available at https://github.com/guanharry/VLM-CREST.

X-Ray LLM Radiology Report Chest Methodology In Silico Open Code GenAI

An Open-Source Generalizable Deep Learning Framework for Automated Corneal Segmentation in Anterior Segment Optical Coherence Tomography Imaging

Kandakji, L., Liu, S., Balal, S., Moghul, I., Allan, B., Tuft, S., Gore, D., Pontikos, N.

•preprint•Jun 20 2025

PurposeTo develop a deep learning model - Cornea nnU-Net Extractor (CUNEX) - for full-thickness corneal segmentation of anterior segment optical coherence tomography (AS-OCT) images and evaluate its utility in artificial intelligence (AI) research. MethodsWe trained and evaluated CUNEX using nnU-Net on 600 AS-OCT images (CSO MS-39) from 300 patients: 100 normal, 100 keratoconus (KC), and 100 Fuchs endothelial corneal dystrophy (FECD) eyes. To assess generalizability, we externally validated CUNEX on 1,168 AS-OCT images from an infectious keratitis dataset acquired from a different device (Casia SS-1000). We benchmarked CUNEX against two recent models, CorneaNet and ScLNet. We then applied CUNEX to our dataset of 194,599 scans from 37,499 patients as preprocessing for a classification model evaluating whether segmentation improves AI prediction, including age, sex, and disease staging (KC and FECD). ResultsCUNEX achieved Dice similarity coefficient (DSC) and intersection over union (IoU) scores ranging from 94-95% and 90-99%, respectively, across healthy, KC, and FECD eyes. This was similar to ScLNet (within 3%) but better than CorneaNet (8-35% lower). On external validation, CUNEX maintained high performance (DSC 83%; IoU 71%) while ScLNet (DSC 14%; IoU 8%) and CorneaNet (DSC 16%; IoU 9%) failed to generalize. Unexpectedly, segmentation minimally impacted classification accuracy except for sex prediction, where accuracy dropped from 81 to 68%, suggesting sex-related features may lie outside the cornea. ConclusionCUNEX delivers the first open-source generalizable corneal segmentation model using the latest framework, supporting its use in clinical analysis and AI workflows across diseases and imaging platforms. It is available at https://github.com/lkandakji/CUNEX.

OCT Segmentation Methodology In Silico Academic Lab Open Code

CEREBLEED: Automated quantification and severity scoring of intracranial hemorrhage on non-contrast CT

Cepeda, S., Esteban-Sinovas, O., Arrese, I., Sarabia, R.

•preprint•Jun 13 2025

BackgroundIntracranial hemorrhage (ICH), whether spontaneous or traumatic, is a neurological emergency with high morbidity and mortality. Accurate assessment of severity is essential for neurosurgical decision-making. This study aimed to develop and evaluate a fully automated, deep learning-based tool for the standardized assessment of ICH severity, based on the segmentation of the hemorrhage and intracranial structures, and the computation of an objective severity index. MethodsNon-contrast cranial CT scans from patients with spontaneous or traumatic ICH were retrospectively collected from public datasets and a tertiary care center. Deep learning models were trained to segment hemorrhages and intracranial structures. These segmentations were used to compute a severity index reflecting bleeding burden and mass effect through volumetric relationships. Segmentation performance was evaluated on a hold-out test cohort. In a prospective cohort, the severity index was assessed in relation to expert-rated CT severity, clinical outcomes, and the need for urgent neurosurgical intervention. ResultsA total of 1,110 non-contrast cranial CT scans were analyzed, 900 from the retrospective cohort and 200 from the prospective evaluation cohort. The binary segmentation model achieved a median Dice score of 0.90 for total hemorrhage. The multilabel model yielded Dice scores ranging from 0.55 to 0.94 across hemorrhage subtypes. The severity index significantly correlated with expert-rated CT severity (p < 0.001), the modified Rankin Scale (p = 0.007), and the Glasgow Outcome Scale-Extended (p = 0.039), and independently predicted the need for urgent surgery (p < 0.001). A threshold [~]300 was identified as a decision point for surgical management (AUC = 0.83). ConclusionWe developed a fully automated and openly accessible pipeline for the analysis of non-contrast cranial CT in intracranial hemorrhage. It computes a novel index that objectively quantifies hemorrhage severity and is significantly associated with clinically relevant outcomes, including the need for urgent neurosurgical intervention.

CT Segmentation Neurological Retrospective Clinical Clinical Pilot Academic Lab Open Code

Cross-dataset Evaluation of Dementia Longitudinal Progression Prediction Models

Zhang, C., An, L., Wulan, N., Nguyen, K.-N., Orban, C., Chen, P., Chen, C., Zhou, J. H., Liu, K., Yeo, B. T. T., Alzheimer's Disease Neuroimaging Initiative,, Australian Imaging Biomarkers and Lifestyle Study of Aging,

•preprint•Jun 11 2025

IntroductionAccurately predicting Alzheimers Disease (AD) progression is useful for clinical care. The 2019 TADPOLE (The Alzheimers Disease Prediction Of Longitudinal Evolution) challenge evaluated 92 algorithms from 33 teams worldwide. Unlike typical clinical prediction studies, TADPOLE accommodates (1) variable number of observed timepoints across patients, (2) missing data across modalities and visits, and (3) prediction over an open-ended time horizon, which better reflects real-world data. However, TADPOLE only used the Alzheimers Disease Neuroimaging Initiative (ADNI) dataset, so how well top algorithms generalize to other cohorts remains unclear. MethodsWe tested five algorithms in three external datasets covering 2,312 participants and 13,200 timepoints. The algorithms included FROG, the overall TADPOLE winner, which utilized a unique Longitudinal-to-Cross-sectional (L2C) transformation to convert variable-length longitudinal histories into feature vectors of the same length across participants (i.e., same-length feature vectors). We also considered two FROG variants. One variant unified all XGBoost models from the original FROG with a single feedforward neural network (FNN), which we referred to as L2C-FNN. We also included minimal recurrent neural networks (MinimalRNN), which was ranked second at publication time, as well as AD Course Map (AD-Map), which outperformed MinimalRNN at publication time. All five models - three FROG variants, MinimalRNN and AD-Map - were trained on ADNI and tested on the external datasets. ResultsL2C-FNN performed the best overall. In the case of predicting cognition and ventricle volume, L2C-FNN and AD-Map were the best. For clinical diagnosis prediction, L2C-FNN was the best, while AD-Map was the worst. L2C-FNN also maintained its edge over other models, regardless of the number of observed timepoints, and regardless of the prediction horizon from 0 to 6 years into the future. ConclusionsL2C-FNN shows strong potential for both short-term and long-term dementia progression prediction. Pretrained ADNI models are available: https://github.com/ThomasYeoLab/CBIG/tree/master/stable_projects/predict_phenotypes/Zhang2025_L2CFNN.

MRI Classification Neurological Retrospective Clinical In Silico Academic Lab Benchmark SOTA Open Code

Physician-level classification performance across multiple imaging domains with a diagnostic medical foundation model and a large dataset of annotated medical images

Thieme, A. H., Miri, T., Marra, A. R., Kobayashi, T., Rodriguez-Nava, G., Li, Y., Barba, T., Er, A. G., Benzler, J., Gertler, M., Riechers, M., Hinze, C., Zheng, Y., Pelz, K., Nagaraj, D., Chen, A., Loeser, A., Ruehle, A., Zamboglou, C., Alyahya, L., Uhlig, M., Machiraju, G., Weimann, K., Lippert, C., Conrad, T., Ma, J., Novoa, R., Moor, M., Hernandez-Boussard, T., Alawad, M., Salinas, J. L., Mittermaier, M., Gevaert, O.

•preprint•May 31 2025

A diagnostic medical foundation model (MedFM) is an artificial intelligence (AI) system engineered to accurately determine diagnoses across various medical imaging modalities and specialties. To train MedFM, we created the PubMed Central Medical Images Dataset (PMCMID), the largest annotated medical image dataset to date, comprising 16,126,659 images from 3,021,780 medical publications. Using AI- and ontology-based methods, we identified 4,482,237 medical images (e.g., clinical photos, X-rays, ultrasounds) and generated comprehensive annotations. To optimize MedFMs performance and assess biases, 13,266 images were manually annotated to establish a multimodal benchmark. MedFM achieved physician-level performance in diagnosis tasks spanning radiology, dermatology, and infectious diseases without requiring specific training. Additionally, we developed the Image2Paper app, allowing clinicians to upload medical images and retrieve relevant literature. The correct diagnoses appeared within the top ten results in 88.4% and at least one relevant differential diagnosis in 93.0%. MedFM and PMCMID were made publicly available. FundingResearch reported here was partially supported by the National Cancer Institute (NCI) (R01 CA260271), the Saudi Company for Artificial Intelligence (SCAI) Authority, and the German Federal Ministry for Economic Affairs and Climate Action (BMWK) under the project DAKI-FWS (01MK21009E). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Mixed Modality Classification Methodology In Silico Academic Lab Breakthrough Open Dataset Open Code

Filter Papers

Tags