Latest Papers on Radiology AI. Tags: GenAI

Foundation versus Domain-Specific Models for Left Ventricular Segmentation on Cardiac Ultrasound

Chao, C.-J., Gu, Y., Kumar, W., Xiang, T., Appari, L., Wu, J., Farina, J. M., Wraith, R., Jeong, J., Arsanjani, R., Garvan, K. C., Oh, J. K., Langlotz, C. P., Banerjee, I., Li, F.-F., Adeli, E.

•preprint•May 17 2025

The Segment Anything Model (SAM) was fine-tuned on the EchoNet-Dynamic dataset and evaluated on external transthoracic echocardiography (TTE) and Point-of-Care Ultrasound (POCUS) datasets from CAMUS (University Hospital of St Etienne) and Mayo Clinic (99 patients: 58 TTE, 41 POCUS). Fine-tuned SAM was superior or comparable to MedSAM. The fine-tuned SAM also outperformed EchoNet and U-Net models, demonstrating strong generalization, especially on apical 2-chamber (A2C) images (fine-tuned SAM vs. EchoNet: CAMUS-A2C: DSC 0.891 {+/-} 0.040 vs. 0.752 {+/-} 0.196, p<0.0001) and POCUS (DSC 0.857 {+/-} 0.047 vs. 0.667 {+/-} 0.279, p<0.0001). Additionally, SAM-enhanced workflow reduced annotation time by 50% (11.6 {+/-} 4.5 sec vs. 5.7 {+/-} 1.7 sec, p<0.0001) while maintaining segmentation quality. We demonstrated an effective strategy for fine-tuning a vision foundation model for enhancing clinical workflow efficiency and supporting human-AI collaboration.

Ultrasound Segmentation Cardiac Retrospective Clinical In Silico Academic Lab GenAI

Evaluating the Performance of Reasoning Large Language Models on Japanese Radiology Board Examination Questions.

Nakaura T, Takamure H, Kobayashi N, Shiraishi K, Yoshida N, Nagayama Y, Uetani H, Kidoh M, Funama Y, Hirai T

•papers•May 17 2025

This study evaluates the performance, cost, and processing time of OpenAI's reasoning large language models (LLMs) (o1-preview, o1-mini) and their base models (GPT-4o, GPT-4o-mini) on Japanese radiology board examination questions. A total of 210 questions from the 2022-2023 official board examinations of the Japan Radiological Society were presented to each of the four LLMs. Performance was evaluated by calculating the percentage of correctly answered questions within six predefined radiology subspecialties. The total cost and processing time for each model were also recorded. The McNemar test was used to assess the statistical significance of differences in accuracy between paired model responses. The o1-preview achieved the highest accuracy (85.7%), significantly outperforming GPT-4o (73.3%, P<.001). Similarly, o1-mini (69.5%) performed significantly better than GPT-4o-mini (46.7%, P<.001). Across all radiology subspecialties, o1-preview consistently ranked highest. However, reasoning models incurred substantially higher costs (o1-preview: $17.10, o1-mini: $2.58) compared to their base counterparts (GPT-4o: $0.496, GPT-4o-mini: $0.04), and their processing times were approximately 3.7 and 1.2 times longer, respectively. Reasoning LLMs demonstrated markedly superior performance in answering radiology board exam questions compared to their base models, albeit at a substantially higher cost and increased processing time.

LLM Radiology Report Retrospective Clinical In Silico Academic Lab GenAI Benchmark SOTA

Exploring interpretable echo analysis using self-supervised parcels.

Majchrowska S, Hildeman A, Mokhtari R, Diethe T, Teare P

•papers•May 17 2025

The application of AI for predicting critical heart failure endpoints using echocardiography is a promising avenue to improve patient care and treatment planning. However, fully supervised training of deep learning models in medical imaging requires a substantial amount of labelled data, posing significant challenges due to the need for skilled medical professionals to annotate image sequences. Our study addresses this limitation by exploring the potential of self-supervised learning, emphasising interpretability, robustness, and safety as crucial factors in cardiac imaging analysis. We leverage self-supervised learning on a large unlabelled dataset, facilitating the discovery of features applicable to a various downstream tasks. The backbone model not only generates informative features for training smaller models using simple techniques but also produces features that are interpretable by humans. The study employs a modified Self-supervised Transformer with Energy-based Graph Optimisation (STEGO) network on top of self-DIstillation with NO labels (DINO) as a backbone model, pre-trained on diverse medical and non-medical data. This approach facilitates the generation of self-segmented outputs, termed "parcels", which identify distinct anatomical sub-regions of the heart. Our findings highlight the robustness of these self-learned parcels across diverse patient profiles and phases of the cardiac cycle phases. Moreover, these parcels offer high interpretability and effectively encapsulate clinically relevant cardiac substructures. We conduct a comprehensive evaluation of the proposed self-supervised approach on publicly available datasets, demonstrating its adaptability to a wide range of requirements. Our results underscore the potential of self-supervised learning to address labelled data scarcity in medical imaging, offering a path to improve cardiac imaging analysis and enhance the efficiency and interpretability of diagnostic procedures, thus positively impacting patient care and clinical decision-making.

Ultrasound Segmentation Cardiac Methodology In Silico Big Tech GenAI Ethics

MedSG-Bench: A Benchmark for Medical Image Sequences Grounding

Jingkun Yue, Siqi Zhang, Zinan Jia, Huihuan Xu, Zongbo Han, Xiaohong Liu, Guangyu Wang

•preprint•May 17 2025

Visual grounding is essential for precise perception and reasoning in multimodal large language models (MLLMs), especially in medical imaging domains. While existing medical visual grounding benchmarks primarily focus on single-image scenarios, real-world clinical applications often involve sequential images, where accurate lesion localization across different modalities and temporal tracking of disease progression (e.g., pre- vs. post-treatment comparison) require fine-grained cross-image semantic alignment and context-aware reasoning. To remedy the underrepresentation of image sequences in existing medical visual grounding benchmarks, we propose MedSG-Bench, the first benchmark tailored for Medical Image Sequences Grounding. It comprises eight VQA-style tasks, formulated into two paradigms of the grounding tasks, including 1) Image Difference Grounding, which focuses on detecting change regions across images, and 2) Image Consistency Grounding, which emphasizes detection of consistent or shared semantics across sequential images. MedSG-Bench covers 76 public datasets, 10 medical imaging modalities, and a wide spectrum of anatomical structures and diseases, totaling 9,630 question-answer pairs. We benchmark both general-purpose MLLMs (e.g., Qwen2.5-VL) and medical-domain specialized MLLMs (e.g., HuatuoGPT-vision), observing that even the advanced models exhibit substantial limitations in medical sequential grounding tasks. To advance this field, we construct MedSG-188K, a large-scale instruction-tuning dataset tailored for sequential visual grounding, and further develop MedSeq-Grounder, an MLLM designed to facilitate future research on fine-grained understanding across medical sequential images. The benchmark, dataset, and model are available at https://huggingface.co/MedSG-Bench

Mixed Modality Detection Whole Body Dataset Release In Silico Academic Lab Open Dataset Open Code GenAI

Feasibility of improving vocal fold pathology image classification with synthetic images generated by DDPM-based GenAI: a pilot study.

Khazrak I, Zainaee S, M Rezaee M, Ghasemi M, C Green R

•papers•May 17 2025

Voice disorders (VD) are often linked to vocal fold structural pathologies (VFSP). Laryngeal imaging plays a vital role in assessing VFSPs and VD in clinical and research settings, but challenges like scarce and imbalanced datasets can limit the generalizability of findings. Denoising Diffusion Probabilistic Models (DDPMs), a subtype of Generative AI, has gained attention for its ability to generate high-quality and realistic synthetic images to address these challenges. This study explores the feasibility of improving VFSP image classification by generating synthetic images using DDPMs. 404 laryngoscopic images depicting VF without and with VFSP were included. DDPMs were used to generate synthetic images to augment the original dataset. Two convolutional neural network architectures, VGG16 and ResNet50, were applied for model training. The models were initially trained only on the original dataset. Then, they were trained on the augmented datasets. Evaluation metrics were analyzed to assess the performance of the models for both binary classification (with/without VFSPs) and multi-class classification (seven specific VFSPs). Realistic and high-quality synthetic images were generated for dataset augmentation. The model first failed to converge when trained only on the original dataset, but they successfully converged and achieved low loss and high accuracy when trained on the augmented datasets. The best performance was gained for both binary and multi-class classification when the models were trained on an augmented dataset. Generating realistic images of VFSP using DDPMs is feasible and can enhance the classification of VFSPs by an AI model and may support VD screening and diagnosis.

Mixed Modality Classification Methodology In Silico Academic Lab GenAI

High-Performance Prompting for LLM Extraction of Compression Fracture Findings from Radiology Reports.

Kanani MM, Monawer A, Brown L, King WE, Miller ZD, Venugopal N, Heagerty PJ, Jarvik JG, Cohen T, Cross NM

•papers•May 16 2025

Extracting information from radiology reports can provide critical data to empower many radiology workflows. For spinal compression fractures, these data can facilitate evidence-based care for at-risk populations. Manual extraction from free-text reports is laborious, and error-prone. Large language models (LLMs) have shown promise; however, fine-tuning strategies to optimize performance in specific tasks can be resource intensive. A variety of prompting strategies have achieved similar results with fewer demands. Our study pioneers the use of Meta's Llama 3.1, together with prompt-based strategies, for automated extraction of compression fractures from free-text radiology reports, outputting structured data without model training. We tested performance on a time-based sample of CT exams covering the spine from 2/20/2024 to 2/22/2024 acquired across our healthcare enterprise (637 anonymized reports, age 18-102, 47% Female). Ground truth annotations were manually generated and compared against the performance of three models (Llama 3.1 70B, Llama 3.1 8B, and Vicuna 13B) with nine different prompting configurations for a total of 27 model/prompt experiments. The highest F1 score (0.91) was achieved by the 70B Llama 3.1 model when provided with a radiologist-written background, with similar results when the background was written by a separate LLM (0.86). The addition of few-shot examples to these prompts had variable impact on F1 measurements (0.89, 0.84 respectively). Comparable ROC-AUC and PR-AUC performance was observed. Our work demonstrated that an open-weights LLM excelled at extracting compression fractures findings from free-text radiology reports using prompt-based techniques without requiring extensive manually labeled examples for model training.

CT LLM Radiology Report Musculoskeletal Retrospective Clinical In Silico Academic Lab GenAI

Escarcitys: A framework for enhancing medical image classification performance in scarcity of trainable samples scenarios.

Wang T, Dai Q, Xiong W

•papers•May 16 2025

In the field of healthcare, the acquisition and annotation of medical images present significant challenges, resulting in a scarcity of trainable samples. This data limitation hinders the performance of deep learning models, creating bottlenecks in clinical applications. To address this issue, we construct a framework (EScarcityS) aimed at enhancing the success rate of disease diagnosis in scarcity of trainable medical image scenarios. Firstly, considering that Transformer-based deep learning networks rely on a large amount of trainable data, this study takes into account the unique characteristics of pathological regions. By extracting the feature representations of all particles in medical images at different granularities, a multi-granularity Transformer network (MGVit) is designed. This network leverages additional prior knowledge to assist the Transformer network during training, thereby reducing the data requirement to some extent. Next, the importance maps of particles at different granularities, generated by MGVit, are fused to construct disease probability maps corresponding to the images. Based on these maps, a disease probability map-guided diffusion generation model is designed to generate more realistic and interpretable synthetic data. Subsequently, authentic and synthetical data are mixed and used to retrain MGVit, aiming to enhance the accuracy of medical image classification in scarcity of trainable medical image scenarios. Finally, we conducted detailed experiments on four real medical image datasets to validate the effectiveness of EScarcityS and its specific modules.

Classification Methodology In Silico Academic Lab GenAI

"MR Fingerprinting for Imaging Brain Hemodynamics and Oxygenation".

Coudert T, Delphin A, Barrier A, Barbier EL, Lemasson B, Warnking JM, Christen T

•papers•May 15 2025

Over the past decade, several studies have explored the potential of magnetic resonance fingerprinting (MRF) for the quantification of brain hemodynamics, oxygenation, and perfusion. Recent advances in simulation models and reconstruction frameworks have also significantly enhanced the accuracy of vascular parameter estimation. This review provides an overview of key vascular MRF studies, emphasizing advancements in geometrical models for vascular simulations, novel sequences, and state-of-the-art reconstruction techniques incorporating machine learning and deep learning algorithms. Both pre-clinical and clinical applications are discussed. Based on these findings, we outline future directions and development areas that need to be addressed to facilitate their clinical translation. EVIDENCE LEVEL: N/A. TECHNICAL EFFICACY: Stage 1.

MRI Reconstruction Neurological Review Concept Academic Lab GenAI

Predicting Immunotherapy Response in Unresectable Hepatocellular Carcinoma: A Comparative Study of Large Language Models and Human Experts.

Xu J, Wang J, Li J, Zhu Z, Fu X, Cai W, Song R, Wang T, Li H

•papers•May 15 2025

Hepatocellular carcinoma (HCC) is an aggressive cancer with limited biomarkers for predicting immunotherapy response. Recent advancements in large language models (LLMs) like GPT-4, GPT-4o, and Gemini offer the potential for enhancing clinical decision-making through multimodal data analysis. However, their effectiveness in predicting immunotherapy response, especially compared to human experts, remains unclear. This study assessed the performance of GPT-4, GPT-4o, and Gemini in predicting immunotherapy response in unresectable HCC, compared to radiologists and oncologists of varying expertise. A retrospective analysis of 186 patients with unresectable HCC utilized multimodal data (clinical and CT images). LLMs were evaluated with zero-shot prompting and two strategies: the 'voting method' and the 'OR rule method' for improved sensitivity. Performance metrics included accuracy, sensitivity, area under the curve (AUC), and agreement across LLMs and physicians.GPT-4o, using the 'OR rule method,' achieved 65% accuracy and 47% sensitivity, comparable to intermediate physicians but lower than senior physicians (accuracy: 72%, p = 0.045; sensitivity: 70%, p < 0.0001). Gemini-GPT, combining GPT-4, GPT-4o, and Gemini, achieved an AUC of 0.69, similar to senior physicians (AUC: 0.72, p = 0.35), with 68% accuracy, outperforming junior and intermediate physicians while remaining comparable to senior physicians (p = 0.78). However, its sensitivity (58%) was lower than senior physicians (p = 0.0097). LLMs demonstrated higher inter-model agreement (κ = 0.59-0.70) than inter-physician agreement, especially among junior physicians (κ = 0.15). This study highlights the potential of LLMs, particularly Gemini-GPT, as valuable tools in predicting immunotherapy response for HCC.

CT Classification Abdominal Retrospective Clinical In Silico Academic Lab GenAI

Bridging Innovation and Practice: Critical Perspectives on IR-GPT's Role in Interventional Radiology.

Gunes YC, Cesur T, Çamur E

•papers•May 15 2025

Mixed Modality LLM Radiology Report Vascular Review Concept Academic Lab GenAI

Filter Papers

Tags

Foundation versus Domain-Specific Models for Left Ventricular Segmentation on Cardiac Ultrasound

Evaluating the Performance of Reasoning Large Language Models on Japanese Radiology Board Examination Questions.

Exploring interpretable echo analysis using self-supervised parcels.

MedSG-Bench: A Benchmark for Medical Image Sequences Grounding

Feasibility of improving vocal fold pathology image classification with synthetic images generated by DDPM-based GenAI: a pilot study.

High-Performance Prompting for LLM Extraction of Compression Fracture Findings from Radiology Reports.

Escarcitys: A framework for enhancing medical image classification performance in scarcity of trainable samples scenarios.

"MR Fingerprinting for Imaging Brain Hemodynamics and Oxygenation".

Predicting Immunotherapy Response in Unresectable Hepatocellular Carcinoma: A Comparative Study of Large Language Models and Human Experts.

Bridging Innovation and Practice: Critical Perspectives on IR-GPT's Role in Interventional Radiology.

Ready to Sharpen Your Edge?