Sort by:
Page 53 of 58574 results

SMFusion: Semantic-Preserving Fusion of Multimodal Medical Images for Enhanced Clinical Diagnosis

Haozhe Xiang, Han Zhang, Yu Cheng, Xiongwen Quan, Wanwan Huang

arxiv logopreprintMay 18 2025
Multimodal medical image fusion plays a crucial role in medical diagnosis by integrating complementary information from different modalities to enhance image readability and clinical applicability. However, existing methods mainly follow computer vision standards for feature extraction and fusion strategy formulation, overlooking the rich semantic information inherent in medical images. To address this limitation, we propose a novel semantic-guided medical image fusion approach that, for the first time, incorporates medical prior knowledge into the fusion process. Specifically, we construct a publicly available multimodal medical image-text dataset, upon which text descriptions generated by BiomedGPT are encoded and semantically aligned with image features in a high-dimensional space via a semantic interaction alignment module. During this process, a cross attention based linear transformation automatically maps the relationship between textual and visual features to facilitate comprehensive learning. The aligned features are then embedded into a text-injection module for further feature-level fusion. Unlike traditional methods, we further generate diagnostic reports from the fused images to assess the preservation of medical information. Additionally, we design a medical semantic loss function to enhance the retention of textual cues from the source images. Experimental results on test datasets demonstrate that the proposed method achieves superior performance in both qualitative and quantitative evaluations while preserving more critical medical information.

Feasibility of improving vocal fold pathology image classification with synthetic images generated by DDPM-based GenAI: a pilot study.

Khazrak I, Zainaee S, M Rezaee M, Ghasemi M, C Green R

pubmed logopapersMay 17 2025
Voice disorders (VD) are often linked to vocal fold structural pathologies (VFSP). Laryngeal imaging plays a vital role in assessing VFSPs and VD in clinical and research settings, but challenges like scarce and imbalanced datasets can limit the generalizability of findings. Denoising Diffusion Probabilistic Models (DDPMs), a subtype of Generative AI, has gained attention for its ability to generate high-quality and realistic synthetic images to address these challenges. This study explores the feasibility of improving VFSP image classification by generating synthetic images using DDPMs. 404 laryngoscopic images depicting VF without and with VFSP were included. DDPMs were used to generate synthetic images to augment the original dataset. Two convolutional neural network architectures, VGG16 and ResNet50, were applied for model training. The models were initially trained only on the original dataset. Then, they were trained on the augmented datasets. Evaluation metrics were analyzed to assess the performance of the models for both binary classification (with/without VFSPs) and multi-class classification (seven specific VFSPs). Realistic and high-quality synthetic images were generated for dataset augmentation. The model first failed to converge when trained only on the original dataset, but they successfully converged and achieved low loss and high accuracy when trained on the augmented datasets. The best performance was gained for both binary and multi-class classification when the models were trained on an augmented dataset. Generating realistic images of VFSP using DDPMs is feasible and can enhance the classification of VFSPs by an AI model and may support VD screening and diagnosis.

MedSG-Bench: A Benchmark for Medical Image Sequences Grounding

Jingkun Yue, Siqi Zhang, Zinan Jia, Huihuan Xu, Zongbo Han, Xiaohong Liu, Guangyu Wang

arxiv logopreprintMay 17 2025
Visual grounding is essential for precise perception and reasoning in multimodal large language models (MLLMs), especially in medical imaging domains. While existing medical visual grounding benchmarks primarily focus on single-image scenarios, real-world clinical applications often involve sequential images, where accurate lesion localization across different modalities and temporal tracking of disease progression (e.g., pre- vs. post-treatment comparison) require fine-grained cross-image semantic alignment and context-aware reasoning. To remedy the underrepresentation of image sequences in existing medical visual grounding benchmarks, we propose MedSG-Bench, the first benchmark tailored for Medical Image Sequences Grounding. It comprises eight VQA-style tasks, formulated into two paradigms of the grounding tasks, including 1) Image Difference Grounding, which focuses on detecting change regions across images, and 2) Image Consistency Grounding, which emphasizes detection of consistent or shared semantics across sequential images. MedSG-Bench covers 76 public datasets, 10 medical imaging modalities, and a wide spectrum of anatomical structures and diseases, totaling 9,630 question-answer pairs. We benchmark both general-purpose MLLMs (e.g., Qwen2.5-VL) and medical-domain specialized MLLMs (e.g., HuatuoGPT-vision), observing that even the advanced models exhibit substantial limitations in medical sequential grounding tasks. To advance this field, we construct MedSG-188K, a large-scale instruction-tuning dataset tailored for sequential visual grounding, and further develop MedSeq-Grounder, an MLLM designed to facilitate future research on fine-grained understanding across medical sequential images. The benchmark, dataset, and model are available at https://huggingface.co/MedSG-Bench

Foundation versus Domain-Specific Models for Left Ventricular Segmentation on Cardiac Ultrasound

Chao, C.-J., Gu, Y., Kumar, W., Xiang, T., Appari, L., Wu, J., Farina, J. M., Wraith, R., Jeong, J., Arsanjani, R., Garvan, K. C., Oh, J. K., Langlotz, C. P., Banerjee, I., Li, F.-F., Adeli, E.

medrxiv logopreprintMay 17 2025
The Segment Anything Model (SAM) was fine-tuned on the EchoNet-Dynamic dataset and evaluated on external transthoracic echocardiography (TTE) and Point-of-Care Ultrasound (POCUS) datasets from CAMUS (University Hospital of St Etienne) and Mayo Clinic (99 patients: 58 TTE, 41 POCUS). Fine-tuned SAM was superior or comparable to MedSAM. The fine-tuned SAM also outperformed EchoNet and U-Net models, demonstrating strong generalization, especially on apical 2-chamber (A2C) images (fine-tuned SAM vs. EchoNet: CAMUS-A2C: DSC 0.891 {+/-} 0.040 vs. 0.752 {+/-} 0.196, p<0.0001) and POCUS (DSC 0.857 {+/-} 0.047 vs. 0.667 {+/-} 0.279, p<0.0001). Additionally, SAM-enhanced workflow reduced annotation time by 50% (11.6 {+/-} 4.5 sec vs. 5.7 {+/-} 1.7 sec, p<0.0001) while maintaining segmentation quality. We demonstrated an effective strategy for fine-tuning a vision foundation model for enhancing clinical workflow efficiency and supporting human-AI collaboration.

Evaluating the Performance of Reasoning Large Language Models on Japanese Radiology Board Examination Questions.

Nakaura T, Takamure H, Kobayashi N, Shiraishi K, Yoshida N, Nagayama Y, Uetani H, Kidoh M, Funama Y, Hirai T

pubmed logopapersMay 17 2025
This study evaluates the performance, cost, and processing time of OpenAI's reasoning large language models (LLMs) (o1-preview, o1-mini) and their base models (GPT-4o, GPT-4o-mini) on Japanese radiology board examination questions. A total of 210 questions from the 2022-2023 official board examinations of the Japan Radiological Society were presented to each of the four LLMs. Performance was evaluated by calculating the percentage of correctly answered questions within six predefined radiology subspecialties. The total cost and processing time for each model were also recorded. The McNemar test was used to assess the statistical significance of differences in accuracy between paired model responses. The o1-preview achieved the highest accuracy (85.7%), significantly outperforming GPT-4o (73.3%, P<.001). Similarly, o1-mini (69.5%) performed significantly better than GPT-4o-mini (46.7%, P<.001). Across all radiology subspecialties, o1-preview consistently ranked highest. However, reasoning models incurred substantially higher costs (o1-preview: $17.10, o1-mini: $2.58) compared to their base counterparts (GPT-4o: $0.496, GPT-4o-mini: $0.04), and their processing times were approximately 3.7 and 1.2 times longer, respectively. Reasoning LLMs demonstrated markedly superior performance in answering radiology board exam questions compared to their base models, albeit at a substantially higher cost and increased processing time.

Exploring interpretable echo analysis using self-supervised parcels.

Majchrowska S, Hildeman A, Mokhtari R, Diethe T, Teare P

pubmed logopapersMay 17 2025
The application of AI for predicting critical heart failure endpoints using echocardiography is a promising avenue to improve patient care and treatment planning. However, fully supervised training of deep learning models in medical imaging requires a substantial amount of labelled data, posing significant challenges due to the need for skilled medical professionals to annotate image sequences. Our study addresses this limitation by exploring the potential of self-supervised learning, emphasising interpretability, robustness, and safety as crucial factors in cardiac imaging analysis. We leverage self-supervised learning on a large unlabelled dataset, facilitating the discovery of features applicable to a various downstream tasks. The backbone model not only generates informative features for training smaller models using simple techniques but also produces features that are interpretable by humans. The study employs a modified Self-supervised Transformer with Energy-based Graph Optimisation (STEGO) network on top of self-DIstillation with NO labels (DINO) as a backbone model, pre-trained on diverse medical and non-medical data. This approach facilitates the generation of self-segmented outputs, termed "parcels", which identify distinct anatomical sub-regions of the heart. Our findings highlight the robustness of these self-learned parcels across diverse patient profiles and phases of the cardiac cycle phases. Moreover, these parcels offer high interpretability and effectively encapsulate clinically relevant cardiac substructures. We conduct a comprehensive evaluation of the proposed self-supervised approach on publicly available datasets, demonstrating its adaptability to a wide range of requirements. Our results underscore the potential of self-supervised learning to address labelled data scarcity in medical imaging, offering a path to improve cardiac imaging analysis and enhance the efficiency and interpretability of diagnostic procedures, thus positively impacting patient care and clinical decision-making.

High-Performance Prompting for LLM Extraction of Compression Fracture Findings from Radiology Reports.

Kanani MM, Monawer A, Brown L, King WE, Miller ZD, Venugopal N, Heagerty PJ, Jarvik JG, Cohen T, Cross NM

pubmed logopapersMay 16 2025
Extracting information from radiology reports can provide critical data to empower many radiology workflows. For spinal compression fractures, these data can facilitate evidence-based care for at-risk populations. Manual extraction from free-text reports is laborious, and error-prone. Large language models (LLMs) have shown promise; however, fine-tuning strategies to optimize performance in specific tasks can be resource intensive. A variety of prompting strategies have achieved similar results with fewer demands. Our study pioneers the use of Meta's Llama 3.1, together with prompt-based strategies, for automated extraction of compression fractures from free-text radiology reports, outputting structured data without model training. We tested performance on a time-based sample of CT exams covering the spine from 2/20/2024 to 2/22/2024 acquired across our healthcare enterprise (637 anonymized reports, age 18-102, 47% Female). Ground truth annotations were manually generated and compared against the performance of three models (Llama 3.1 70B, Llama 3.1 8B, and Vicuna 13B) with nine different prompting configurations for a total of 27 model/prompt experiments. The highest F1 score (0.91) was achieved by the 70B Llama 3.1 model when provided with a radiologist-written background, with similar results when the background was written by a separate LLM (0.86). The addition of few-shot examples to these prompts had variable impact on F1 measurements (0.89, 0.84 respectively). Comparable ROC-AUC and PR-AUC performance was observed. Our work demonstrated that an open-weights LLM excelled at extracting compression fractures findings from free-text radiology reports using prompt-based techniques without requiring extensive manually labeled examples for model training.

Escarcitys: A framework for enhancing medical image classification performance in scarcity of trainable samples scenarios.

Wang T, Dai Q, Xiong W

pubmed logopapersMay 16 2025
In the field of healthcare, the acquisition and annotation of medical images present significant challenges, resulting in a scarcity of trainable samples. This data limitation hinders the performance of deep learning models, creating bottlenecks in clinical applications. To address this issue, we construct a framework (EScarcityS) aimed at enhancing the success rate of disease diagnosis in scarcity of trainable medical image scenarios. Firstly, considering that Transformer-based deep learning networks rely on a large amount of trainable data, this study takes into account the unique characteristics of pathological regions. By extracting the feature representations of all particles in medical images at different granularities, a multi-granularity Transformer network (MGVit) is designed. This network leverages additional prior knowledge to assist the Transformer network during training, thereby reducing the data requirement to some extent. Next, the importance maps of particles at different granularities, generated by MGVit, are fused to construct disease probability maps corresponding to the images. Based on these maps, a disease probability map-guided diffusion generation model is designed to generate more realistic and interpretable synthetic data. Subsequently, authentic and synthetical data are mixed and used to retrain MGVit, aiming to enhance the accuracy of medical image classification in scarcity of trainable medical image scenarios. Finally, we conducted detailed experiments on four real medical image datasets to validate the effectiveness of EScarcityS and its specific modules.

Predicting Immunotherapy Response in Unresectable Hepatocellular Carcinoma: A Comparative Study of Large Language Models and Human Experts.

Xu J, Wang J, Li J, Zhu Z, Fu X, Cai W, Song R, Wang T, Li H

pubmed logopapersMay 15 2025
Hepatocellular carcinoma (HCC) is an aggressive cancer with limited biomarkers for predicting immunotherapy response. Recent advancements in large language models (LLMs) like GPT-4, GPT-4o, and Gemini offer the potential for enhancing clinical decision-making through multimodal data analysis. However, their effectiveness in predicting immunotherapy response, especially compared to human experts, remains unclear. This study assessed the performance of GPT-4, GPT-4o, and Gemini in predicting immunotherapy response in unresectable HCC, compared to radiologists and oncologists of varying expertise. A retrospective analysis of 186 patients with unresectable HCC utilized multimodal data (clinical and CT images). LLMs were evaluated with zero-shot prompting and two strategies: the 'voting method' and the 'OR rule method' for improved sensitivity. Performance metrics included accuracy, sensitivity, area under the curve (AUC), and agreement across LLMs and physicians.GPT-4o, using the 'OR rule method,' achieved 65% accuracy and 47% sensitivity, comparable to intermediate physicians but lower than senior physicians (accuracy: 72%, p = 0.045; sensitivity: 70%, p < 0.0001). Gemini-GPT, combining GPT-4, GPT-4o, and Gemini, achieved an AUC of 0.69, similar to senior physicians (AUC: 0.72, p = 0.35), with 68% accuracy, outperforming junior and intermediate physicians while remaining comparable to senior physicians (p = 0.78). However, its sensitivity (58%) was lower than senior physicians (p = 0.0097). LLMs demonstrated higher inter-model agreement (κ = 0.59-0.70) than inter-physician agreement, especially among junior physicians (κ = 0.15). This study highlights the potential of LLMs, particularly Gemini-GPT, as valuable tools in predicting immunotherapy response for HCC.

Exploring the Potential of Retrieval Augmented Generation for Question Answering in Radiology: Initial Findings and Future Directions.

Mou Y, Siepmann RM, Truhnn D, Sowe S, Decker S

pubmed logopapersMay 15 2025
This study explores the application of Retrieval-Augmented Generation (RAG) for question answering in radiology, an area where intelligent systems can significantly impact clinical decision-making. A preliminary experiment tested a naive RAG setup on nice radiology-specific questions with a textbook as the reference source, showing moderate improvements over baseline methods. The paper discusses lessons learned and potential enhancements for RAG in handling radiology knowledge, suggesting pathways for future research in integrating intelligent health systems in medical practice.
Page 53 of 58574 results
Show
per page

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.