Latest Papers on Radiology AI. Tags: GenAI, Order: Best Match, Limit: 10.

Performance of GPT-4 for automated prostate biopsy decision-making based on mpMRI: a multi-center evidence study.

Shi MJ, Wang ZX, Wang SK, Li XH, Zhang YL, Yan Y, An R, Dong LN, Qiu L, Tian T, Liu JX, Song HC, Wang YF, Deng C, Cao ZB, Wang HY, Wang Z, Wei W, Song J, Lu J, Wei X, Wang ZC

•papers•Jul 7 2025

Multiparametric magnetic resonance imaging (mpMRI) has significantly advanced prostate cancer (PCa) detection, yet decisions on invasive biopsy with moderate prostate imaging reporting and data system (PI-RADS) scores remain ambiguous. To explore the decision-making capacity of Generative Pretrained Transformer-4 (GPT-4) for automated prostate biopsy recommendations, we included 2299 individuals who underwent prostate biopsy from 2018 to 2023 in 3 large medical centers, with available mpMRI before biopsy and documented clinical-histopathological records. GPT-4 generated structured reports with given prompts. The performance of GPT-4 was quantified using confusion matrices, and sensitivity, specificity, as well as area under the curve were calculated. Multiple artificial evaluation procedures were conducted. Wilcoxon's rank sum test, Fisher's exact test, and Kruskal-Wallis tests were used for comparisons. Utilizing the largest sample size in the Chinese population, patients with moderate PI-RADS scores (scores 3 and 4) accounted for 39.7% (912/2299), defined as the subset-of-interest (SOI). The detection rates of clinically significant PCa corresponding to PI-RADS scores 2-5 were 9.4, 27.3, 49.2, and 80.1%, respectively. Nearly 47.5% (433/912) of SOI patients were histopathologically proven to have undergone unnecessary prostate biopsies. With the assistance of GPT-4, 20.8% (190/912) of the SOI population could avoid unnecessary biopsies, and it performed even better [28.8% (118/410)] in the most heterogeneous subgroup of PI-RADS score 3. More than 90.0% of GPT-4 -generated reports were comprehensive and easy to understand, but less satisfied with the accuracy (82.8%). GPT-4 also demonstrated cognitive potential for handling complex problems. Additionally, the Chain of Thought method enabled us to better understand the decision-making logic behind GPT-4. Eventually, we developed a ProstAIGuide platform to facilitate accessibility for both doctors and patients. This multi-center study highlights the clinical utility of GPT-4 for prostate biopsy decision-making and advances our understanding of the latest artificial intelligence implementation in various medical scenarios.

MRI Report Generation Abdominal Retrospective Clinical In Silico Academic Lab GenAI

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Brush, Kenneth Philbrick, Howard Hu, Howard Yang, Richa Tiwari, Sunny Jansen, Preeti Singh, Yun Liu, Shekoofeh Azizi, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Riviere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Elena Buchatskaya, Jean-Baptiste Alayrac, Dmitry, Lepikhin, Vlad Feinberg, Sebastian Borgeaud, Alek Andreev, Cassidy Hardin, Robert Dadashi, Léonard Hussenot, Armand Joulin, Olivier Bachem, Yossi Matias, Katherine Chou, Avinatan Hassidim, Kavi Goel, Clement Farabet, Joelle Barral, Tris Warkentin, Jonathon Shlens, David Fleet, Victor Cotruta, Omar Sanseviero, Gus Martins, Phoebe Kirk, Anand Rao, Shravya Shetty, David F. Steiner, Can Kirmizibayrak, Rory Pilgrim, Daniel Golden, Lin Yang

•preprint•Jul 7 2025

Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.

X-Ray Classification Chest Methodology In Silico Big Tech GenAI Open Code

Uncovering Neuroimaging Biomarkers of Brain Tumor Surgery with AI-Driven Methods

Carmen Jimenez-Mesa, Yizhou Wan, Guilio Sansone, Francisco J. Martinez-Murcia, Javier Ramirez, Pietro Lio, Juan M. Gorriz, Stephen J. Price, John Suckling, Michail Mamalakis

•preprint•Jul 7 2025

Brain tumor resection is a complex procedure with significant implications for patient survival and quality of life. Predictions of patient outcomes provide clinicians and patients the opportunity to select the most suitable onco-functional balance. In this study, global features derived from structural magnetic resonance imaging in a clinical dataset of 49 pre- and post-surgery patients identified potential biomarkers associated with survival outcomes. We propose a framework that integrates Explainable AI (XAI) with neuroimaging-based feature engineering for survival assessment, offering guidance for surgical decision-making. In this study, we introduce a global explanation optimizer that refines survival-related feature attribution in deep learning models, enhancing interpretability and reliability. Our findings suggest that survival is influenced by alterations in regions associated with cognitive and sensory functions, indicating the importance of preserving areas involved in decision-making and emotional regulation during surgery to improve outcomes. The global explanation optimizer improves both fidelity and comprehensibility of explanations compared to state-of-the-art XAI methods. It effectively identifies survival-related variability, underscoring its relevance in precision medicine for brain tumor treatment.

MRI Classification Neurological Retrospective Clinical In Silico Academic Lab GenAI

Leveraging Large Language Models for Accurate AO Fracture Classification from CT Text Reports.

Mergen M, Spitzl D, Ketzer C, Strenzke M, Marka AW, Makowski MR, Bressem KK, Adams LC, Gassert FT

•papers•Jul 7 2025

Large language models (LLMs) have shown promising potential in analyzing complex textual data, including radiological reports. These models can assist clinicians, particularly those with limited experience, by integrating and presenting diagnostic criteria within radiological classifications. However, before clinical adoption, LLMs must be rigorously validated by medical professionals to ensure accuracy, especially in the context of advanced radiological classification systems. This study evaluates the performance of four LLMs-ChatGPT-4o, AmbossGPT, Claude 3.5 Sonnet, and Gemini 2.0 Flash-in classifying fractures based on the AO classification system using CT reports. A dataset of 292 fictitious physician-generated CT reports, representing 310 fractures, was used to assess the accuracy of each LLM in AO fracture classification retrospectively. Performance was evaluated by comparing the models' classifications to ground truth labels, with accuracy rates analyzed across different fracture types and subtypes. ChatGPT-4o and AmbossGPT achieved the highest overall accuracy (74.6 and 74.3%, respectively), outperforming Claude 3.5 Sonnet (69.5%) and Gemini 2.0 Flash (62.7%). Statistically significant differences were observed in fracture type classification, particularly between ChatGPT-4o and Gemini 2.0 Flash (Δ12%, p < 0.001). While all models demonstrated strong bone recognition rates (90-99%), their accuracy in fracture subtype classification remained lower (71-77%), indicating limitations in nuanced diagnostic categorization. LLMs show potential in assisting radiologists with initial fracture classification, particularly in high-volume or resource-limited settings. However, their performance remains inconsistent for detailed subtype classification, highlighting the need for further refinement and validation before clinical integration in advanced diagnostic workflows.

CT Classification Musculoskeletal Retrospective Clinical In Silico GenAI Benchmark SOTA

Computed Tomography Visual Question Answering with Cross-modal Feature Graphing

Yuanhe Tian, Chen Su, Junwen Duan, Yan Song

•preprint•Jul 6 2025

Visual question answering (VQA) in medical imaging aims to support clinical diagnosis by automatically interpreting complex imaging data in response to natural language queries. Existing studies typically rely on distinct visual and textual encoders to independently extract features from medical images and clinical questions, which are subsequently combined to generate answers. Specifically, in computed tomography (CT), such approaches are similar to the conventional practices in medical image analysis. However, these approaches pay less attention to the spatial continuity and inter-slice correlations in the volumetric CT data, leading to fragmented and imprecise responses. In this paper, we propose a novel large language model (LLM)-based framework enhanced by a graph representation of salient features. Different from conventional multimodal encoding strategies, our approach constructs a cross-modal graph integrating both visual and textual features, treating individual CT slices and question tokens as nodes within the graph. We further leverage an attentive graph convolutional network to dynamically fuse information within this structure. The resulting aggregated graph features then serve as a soft prompt to guide a large language model in generating accurate answers. Extensive experiments on the M3D-VQA benchmark demonstrate that our approach consistently outperforms baselines across multiple evaluation metrics, offering more robust reasoning capabilities.

CT LLM Radiology Report Methodology In Silico GenAI

Unveiling knee morphology with SHAP: shaping personalized medicine through explainable AI.

Cansiz B, Arslan S, Gültekin MZ, Serbes G

•papers•Jul 5 2025

This study aims to enhance personalized medical assessments and the early detection of knee-related pathologies by examining the relationship between knee morphology and demographic factors such as age, gender, and body mass index. Additionally, gender-specific reference values for knee morphological features will be determined using explainable artificial intelligence (XAI). A retrospective analysis was conducted on the MRI data of 500 healthy knees aged 20-40 years. The study included various knee morphological features such as Distal Femoral Width (DFW), Lateral Femoral Condyler Width (LFCW), Intercondylar Femoral Width (IFW), Anterior Cruciate Ligament Width (ACLW), and Anterior Cruciate Ligament Length (ACLL). Machine learning models, including Decision Trees, Random Forests, Light Gradient Boosting, Multilayer Perceptron, and Support Vector Machines, were employed to predict gender based on these features. The SHapley Additive exPlanation was used to analyze feature importance. The learning models demonstrated high classification performance, with 83.2% (±5.15) for classification of clusters based on morphological feature and 88.06% (±4.8) for gender classification. These results validated that the strong correlation between knee morphology and gender. The study found that DFW is the most significant feature for gender prediction, with values below 78-79 mm range indicating females and values above this range indicating males. LFCW, IFW, ACLW, and ACLL also showed significant gender-based differences. The findings establish gender-specific reference values for knee morphological features, highlighting the impact of gender on knee morphology. These reference values can improve the accuracy of diagnoses and treatment plans tailored to each gender, enhancing personalized medical care.

MRI Classification Musculoskeletal Retrospective Clinical In Silico Academic Lab GenAI

Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLMs

Haifeng Zhao, Yufei Zhang, Leilei Ma, Shuo Xu, Dengdi Sun

•preprint•Jul 5 2025

Radiology report generation represents a significant application within medical AI, and has achieved impressive results. Concurrently, large language models (LLMs) have demonstrated remarkable performance across various domains. However, empirical validation indicates that general LLMs tend to focus more on linguistic fluency rather than clinical effectiveness, and lack the ability to effectively capture the relationship between X-ray images and their corresponding texts, thus resulting in poor clinical practicability. To address these challenges, we propose Optimal Transport-Driven Radiology Report Generation (OTDRG), a novel framework that leverages Optimal Transport (OT) to align image features with disease labels extracted from reports, effectively bridging the cross-modal gap. The core component of OTDRG is Alignment \& Fine-Tuning, where OT utilizes results from the encoding of label features and image visual features to minimize cross-modal distances, then integrating image and text features for LLMs fine-tuning. Additionally, we design a novel disease prediction module to predict disease labels contained in X-ray images during validation and testing. Evaluated on the MIMIC-CXR and IU X-Ray datasets, OTDRG achieves state-of-the-art performance in both natural language generation (NLG) and clinical efficacy (CE) metrics, delivering reports that are not only linguistically coherent but also clinically accurate.

X-Ray Report Generation Chest Methodology In Silico GenAI Benchmark SOTA

Unveiling genetic architecture of white matter microstructure through unsupervised deep representation learning of fractional anisotropy images.

Zhao X, Xie Z, He W, Fornage M, Zhi D

•papers•Jul 5 2025

Fractional anisotropy (FA) derived from diffusion MRI is a widely used marker of white matter (WM) integrity. However, conventional FA based genetic studies focus on phenotypes representing tract- or atlas-defined averages, which may oversimplify spatial patterns of WM integrity and thus limiting the genetic discovery. Here, we proposed a deep learning-based framework, termed unsupervised deep representation of white matter (UDR-WM), to extract brain-wide FA features-referred to as UDIP-FA, that capture distributed microstructural variation without prior anatomical assumptions. UDIP-FAs exhibit enhanced sensitivity to aging and substantially higher SNP-based heritability compared to traditional FA phenotypes ( P < 2.20e-16, Mann-Whitney U test, mean h 2 = 50.81%). Through multivariate GWAS, we identified 939 significant lead SNPs in 586 loci, mapped to 3480 genes, dubbed UDIP-FA related genes (UFAGs). UFAGs are overexpressed in glial cells, particularly in astrocytes and oligodendrocytes (Bonferroni-corrected P < 2e-6, Wald Test), and show strong overlap with risk gene sets for schizophrenia and Parkinson disease (Bonferroni-corrected P < 7.06e-3, Fisher exact test). UDIP-FAs are genetically correlated with multiple brain disorders and cognitive traits, including fluid intelligence and reaction time, and are associated with polygenic risk for bone mineral density. Network analyses reveal that UFAGs form disease-enriched modules across protein-protein interaction and co-expression networks, implicating core pathways in myelination and axonal structure. Notably, several UFAGs, including ACHE and ALDH2 , are targets of existing neuropsychiatric drugs. Together, our findings establish UDIP-FA as a biologically and clinically informative brain phenotype, enabling high-resolution dissection of white matter genetic architecture and its genetic links to complex brain traits.

MRI Classification Neurological Methodology In Silico Academic Lab GenAI

Performance of open-source and proprietary large language models in generating patient-friendly radiology chest CT reports.

Prucker P, Busch F, Dorfner F, Mertens CJ, Bayerl N, Makowski MR, Bressem KK, Adams LC

•papers•Jul 5 2025

Large Language Models (LLMs) show promise for generating patient-friendly radiology reports, but the performance of open-source versus proprietary LLMs needs assessment. To compare open-source and proprietary LLMs in generating patient-friendly radiology reports from chest CTs using quantitative readability metrics and qualitative assessments by radiologists. Fifty chest CT reports were processed by seven LLMs: three open-source models (Llama-3-70b, Mistral-7b, Mixtral-8x7b) and four proprietary models (GPT-4, GPT-3.5-Turbo, Claude-3-Opus, Gemini-Ultra). Simplification was evaluated using five quantitative readability metrics. Three radiologists rated patient-friendliness on a five-point Likert scale across five criteria. Content and coherence errors were counted. Inter-rater reliability and differences among models were statistically assessed. Inter-rater reliability was substantial to near perfect (κ = 0.76-0.86). Qualitatively, Llama-3-70b was non-inferior to leading proprietary models in 4/5 categories. GPT-3.5-Turbo showed the best overall readability, outperforming GPT-4 in two metrics. Llama-3-70b outperformed GPT-3.5-Turbo on the CLI (p = 0.006). Claude-3-Opus and Gemini-Ultra scored lower on readability but were rated highly in qualitative assessments. Claude-3-Opus maintained perfect factual accuracy. Claude-3-Opus and GPT-4 outperformed Llama-3-70b in emotional sensitivity (90.0 % vs 46.0 %, p < 0.001). Llama-3-70b shows strong potential in generating quality, patient-friendly radiology reports, challenging proprietary models. With further adaptation, open-source LLMs could advance patient-friendly reporting technology.

CT LLM Radiology Report Chest Retrospective Clinical In Silico Academic Lab GenAI

Quantitative CT Imaging in Chronic Obstructive Pulmonary Disease.

Park S, Lee SM, Hwang HJ, Oh SY, Choe J, Seo JB

•papers•Jul 4 2025

Chronic obstructive pulmonary disease (COPD) is a highly heterogeneous condition characterized by diverse pulmonary and extrapulmonary manifestations. Efforts to quantify its various components using CT imaging have advanced, aiming for more precise, objective, and reproducible assessment and management. Beyond emphysema and small airway disease, the two major components of COPD, CT quantification enables the evaluation of pulmonary vascular alteration, ventilation-perfusion mismatches, fissure completeness, and extrapulmonary features such as altered body composition, osteoporosis, and atherosclerosis. Recent advancements, including the application of deep learning techniques, have facilitated fully automated segmentation and quantification of CT parameters, while innovations such as image standardization hold promise for enhancing clinical applicability. Numerous studies have reported associations between quantitative CT parameters and clinical or physiologic outcomes in patients with COPD. However, barriers remain to the routine implementation of these technologies in clinical practice. This review highlights recent research on COPD quantification, explores advances in technology, and also discusses current challenges and potential solutions for improving quantification methods.

CT Segmentation Chest Review In Silico Academic Lab Benchmark SOTA GenAI

Performance of GPT-4 for automated prostate biopsy decision-making based on mpMRI: a multi-center evidence study.

MedGemma Technical Report

Uncovering Neuroimaging Biomarkers of Brain Tumor Surgery with AI-Driven Methods

Leveraging Large Language Models for Accurate AO Fracture Classification from CT Text Reports.

Computed Tomography Visual Question Answering with Cross-modal Feature Graphing

Unveiling knee morphology with SHAP: shaping personalized medicine through explainable AI.

Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLMs

Unveiling genetic architecture of white matter microstructure through unsupervised deep representation learning of fractional anisotropy images.

Performance of open-source and proprietary large language models in generating patient-friendly radiology chest CT reports.

Quantitative CT Imaging in Chronic Obstructive Pulmonary Disease.

Ready to Sharpen Your Edge?