Sort by:
Page 1 of 15144 results
Next

TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models

Junyi Zhang, Jia-Chen Gu, Wenbo Hu, Yu Zhou, Robinson Piramuthu, Nanyun Peng

arxiv logopreprintSep 29 2025
Existing medical reasoning benchmarks for vision-language models primarily focus on analyzing a patient's condition based on an image from a single visit. However, this setting deviates significantly from real-world clinical practice, where doctors typically refer to a patient's historical conditions to provide a comprehensive assessment by tracking their changes over time. In this paper, we introduce TemMed-Bench, the first benchmark designed for analyzing changes in patients' conditions between different clinical visits, which challenges large vision-language models (LVLMs) to reason over temporal medical images. TemMed-Bench consists of a test set comprising three tasks - visual question-answering (VQA), report generation, and image-pair selection - and a supplementary knowledge corpus of over 17,000 instances. With TemMed-Bench, we conduct an evaluation of six proprietary and six open-source LVLMs. Our results show that most LVLMs lack the ability to analyze patients' condition changes over temporal medical images, and a large proportion perform only at a random-guessing level in the closed-book setting. In contrast, GPT o3, o4-mini and Claude 3.5 Sonnet demonstrate comparatively decent performance, though they have yet to reach the desired level. Furthermore, we explore augmenting the input with both retrieved visual and textual modalities in the medical domain. We also show that multi-modal retrieval augmentation yields notably higher performance gains than no retrieval and textual retrieval alone across most models on our benchmark, with the VQA task showing an average improvement of 2.59%. Overall, we compose a benchmark grounded on real-world clinical practice, and it reveals LVLMs' limitations in temporal medical image reasoning, as well as highlighting the use of multi-modal retrieval augmentation as a potentially promising direction worth exploring to address this challenge.

InfiMed-Foundation: Pioneering Advanced Multimodal Medical Models with Compute-Efficient Pre-Training and Multi-Stage Fine-Tuning

Guanghao Zhu, Zhitian Hou, Zeyu Liu, Zhijie Sang, Congkai Xie, Hongxia Yang

arxiv logopreprintSep 26 2025
Multimodal large language models (MLLMs) have shown remarkable potential in various domains, yet their application in the medical field is hindered by several challenges. General-purpose MLLMs often lack the specialized knowledge required for medical tasks, leading to uncertain or hallucinatory responses. Knowledge distillation from advanced models struggles to capture domain-specific expertise in radiology and pharmacology. Additionally, the computational cost of continual pretraining with large-scale medical data poses significant efficiency challenges. To address these issues, we propose InfiMed-Foundation-1.7B and InfiMed-Foundation-4B, two medical-specific MLLMs designed to deliver state-of-the-art performance in medical applications. We combined high-quality general-purpose and medical multimodal data and proposed a novel five-dimensional quality assessment framework to curate high-quality multimodal medical datasets. We employ low-to-high image resolution and multimodal sequence packing to enhance training efficiency, enabling the integration of extensive medical data. Furthermore, a three-stage supervised fine-tuning process ensures effective knowledge extraction for complex medical tasks. Evaluated on the MedEvalKit framework, InfiMed-Foundation-1.7B outperforms Qwen2.5VL-3B, while InfiMed-Foundation-4B surpasses HuatuoGPT-V-7B and MedGemma-27B-IT, demonstrating superior performance in medical visual question answering and diagnostic tasks. By addressing key challenges in data quality, training efficiency, and domain-specific knowledge extraction, our work paves the way for more reliable and effective AI-driven solutions in healthcare. InfiMed-Foundation-4B model is available at \href{https://huggingface.co/InfiX-ai/InfiMed-Foundation-4B}{InfiMed-Foundation-4B}.

Integrating CT image reconstruction, segmentation, and large language models for enhanced diagnostic insight.

Abbasi AA, Farooqi AH

pubmed logopapersSep 25 2025
Deep learning has significantly advanced medical imaging, particularly computed tomography (CT), which is vital for diagnosing heart and cancer patients, evaluating treatments, and tracking disease progression. High-quality CT images enhance clinical decision-making, making image reconstruction a key research focus. This study develops a framework to improve CT image quality while minimizing reconstruction time. The proposed four-step medical image analysis framework includes reconstruction, preprocessing, segmentation, and image description. Initially, raw projection data undergoes reconstruction via a Radon transform to generate a sinogram, which is then used to construct a CT image of the pelvis. A convolutional neural network (CNN) ensures high-quality reconstruction. A bilateral filter reduces noise while preserving critical anatomical features. If required, a medical expert can review the image. The K-means clustering algorithm segments the preprocessed image, isolating the pelvis and removing irrelevant structures. Finally, the FuseCap model generates an automated textual description to assist radiologists. The framework's effectiveness is evaluated using peak signal-to-noise ratio (PSNR), normalized mean square error (NMSE), and structural similarity index measure (SSIM). The achieved values-PSNR 30.784, NMSE 0.032, and SSIM 0.877-demonstrate superior performance compared to existing methods. The proposed framework reconstructs high-quality CT images from raw projection data, integrating segmentation and automated descriptions to provide a decision-support tool for medical experts. By enhancing image clarity, segmenting outputs, and providing descriptive insights, this research aims to reduce the workload of frontline medical professionals and improve diagnostic efficiency.

Scan-do Attitude: Towards Autonomous CT Protocol Management using a Large Language Model Agent

Xingjian Kang, Linda Vorberg, Andreas Maier, Alexander Katzmann, Oliver Taubmann

arxiv logopreprintSep 24 2025
Managing scan protocols in Computed Tomography (CT), which includes adjusting acquisition parameters or configuring reconstructions, as well as selecting postprocessing tools in a patient-specific manner, is time-consuming and requires clinical as well as technical expertise. At the same time, we observe an increasing shortage of skilled workforce in radiology. To address this issue, a Large Language Model (LLM)-based agent framework is proposed to assist with the interpretation and execution of protocol configuration requests given in natural language or a structured, device-independent format, aiming to improve the workflow efficiency and reduce technologists' workload. The agent combines in-context-learning, instruction-following, and structured toolcalling abilities to identify relevant protocol elements and apply accurate modifications. In a systematic evaluation, experimental results indicate that the agent can effectively retrieve protocol components, generate device compatible protocol definition files, and faithfully implement user requests. Despite demonstrating feasibility in principle, the approach faces limitations regarding syntactic and semantic validity due to lack of a unified device API, and challenges with ambiguous or complex requests. In summary, the findings show a clear path towards LLM-based agents for supporting scan protocol management in CT imaging.

Performance Comparison of Cutting-Edge Large Language Models on the ACR In-Training Examination: An Update for 2025.

Young A, Paloka R, Islam A, Prasanna P, Hill V, Payne D

pubmed logopapersSep 24 2025
This study represents a continuation of prior work by Payne et al. evaluating large language model (LLM) performance on radiology board-style assessments, specifically the ACR diagnostic radiology in-training examination (DXIT). Building upon earlier findings with GPT-4, we assess the performance of newer, cutting-edge models, such as GPT-4o, GPT-o1, GPT-o3, Claude, Gemini, and Grok on standardized DXIT questions. In addition to overall performance, we compare model accuracy on text-based versus image-based questions to assess multi-modal reasoning capabilities. As a secondary aim, we investigate the potential impact of data contamination by comparing model performance on original versus revised image-based questions. Seven LLMs - GPT-4, GPT-4o, GPT-o1, GPT-o3, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Grok 2.0-were evaluated using 106 publicly available DXIT questions. Each model was prompted using a standardized instruction set to simulate a radiology resident answering board-style questions. For each question, the model's selected answer, rationale, and confidence score were recorded. Unadjusted accuracy (based on correct answer selection) and logic-adjusted accuracy (based on clinical reasoning pathways) were calculated. Subgroup analysis compared model performance on text-based versus image-based questions. Additionally, 63 image-based questions were revised to test novel reasoning while preserving the original diagnostic image to assess the impact of potential training data contamination. Across 106 DXIT questions, GPT-o1 demonstrated the highest unadjusted accuracy (71.7%), followed closely by GPT-4o (69.8%) and GPT-o3 (68.9%). GPT-4 (59.4%) and Grok 2.0 exhibited similar scores (59.4% and 52.8%). Claude 3.5 Sonnet had the lowest unadjusted accuracies (34.9%). Similar trends were observed for logic-adjusted accuracy, with GPT-o1 (60.4%), GPT-4o (59.4%), and GPT-o3 (59.4%) again outperforming other models, while Grok 2.0 and Claude 3.5 Sonnet lagged behind (34.0% and 30.2%, respectively). GPT-4o's performance was significantly higher on text-based questions compared to image-based ones. Unadjusted accuracy for the revised DXIT questions was 49.2%, compared to 56.1% on matched original DXIT questions. Logic-adjusted accuracy for the revised DXIT questions was 40.0% compared to 44.4% on matched original DXIT questions. No significant difference in performance was observed between original and revised questions. Modern LLMs, especially those from OpenAI, demonstrate strong and improved performance on board-style radiology assessments. Comparable performance on revised prompts suggests that data contamination may have played a limited role. As LLMs improve, they hold strong potential to support radiology resident learning through personalized feedback and board-style question review.

Automated Resectability Classification of Pancreatic Cancer CT Reports with Privacy-Preserving Open-Weight Large Language Models: A Multicenter Study.

Lee JH, Min JH, Gu K, Han S, Hwang JA, Choi SY, Song KD, Lee JE, Lee J, Moon JE, Adetyan H, Yang JD

pubmed logopapersSep 24 2025
 To evaluate the effectiveness of open-weight large language models (LLMs) in extracting key radiological features and determining National Comprehensive Cancer Network (NCCN) resectability status from free-text radiology reports for pancreatic ductal adenocarcinoma (PDAC). Methods. Prompts were developed using 30 fictitious reports, internally validated on 100 additional fictitious reports, and tested using 200 real reports from two institutions (January 2022 to December 2023). Two radiologists established ground truth for 18 key features and resectability status. Gemma-2-27b-it and Llama-3-70b-instruct models were evaluated using recall, precision, F1-score, extraction accuracy, and overall resectability accuracy. Statistical analyses included McNemar's test and mixed-effects logistic regression. Results. In internal validation, Llama had significantly higher recall than Gemma (99% vs. 95%, p < 0.01) and slightly higher extraction accuracy (98% vs. 97%). Llama also demonstrated higher overall resectability accuracy (93% vs. 91%). In the internal test set, both models achieved 96% recall and 96% extraction accuracy. Overall resectability accuracy was 95% for Llama and 93% for Gemma. In the external test set, both models had 93% recall. Extraction accuracy was 93% for Llama and 95% for Gemma. Gemma achieved higher overall resectability accuracy (89% vs. 83%), but the difference was not statistically significant (p > 0.05). Conclusion. Open-weight models accurately extracted key radiological features and determined NCCN resectability status from free-text PDAC reports. While internal dataset performance was robust, performance on external data decreased, highlighting the need for institution-specific optimization.

Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

Guoxin Wang, Jun Zhao, Xinyi Liu, Yanbo Liu, Xuyang Cao, Chao Li, Zhuoyun Liu, Qintian Sun, Fangru Zhou, Haoqiang Xing, Zhenhong Yang

arxiv logopreprintSep 23 2025
Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.

Medical AI Consensus: A Multi-Agent Framework for Radiology Report Generation and Evaluation

Ahmed T. Elboardy, Ghada Khoriba, Essam A. Rashed

arxiv logopreprintSep 22 2025
Automating radiology report generation poses a dual challenge: building clinically reliable systems and designing rigorous evaluation protocols. We introduce a multi-agent reinforcement learning framework that serves as both a benchmark and evaluation environment for multimodal clinical reasoning in the radiology ecosystem. The proposed framework integrates large language models (LLMs) and large vision models (LVMs) within a modular architecture composed of ten specialized agents responsible for image analysis, feature extraction, report generation, review, and evaluation. This design enables fine-grained assessment at both the agent level (e.g., detection and segmentation accuracy) and the consensus level (e.g., report quality and clinical relevance). We demonstrate an implementation using chatGPT-4o on public radiology datasets, where LLMs act as evaluators alongside medical radiologist feedback. By aligning evaluation protocols with the LLM development lifecycle, including pretraining, finetuning, alignment, and deployment, the proposed benchmark establishes a path toward trustworthy deviance-based radiology report generation.

Evaluating the role of LLMs in supporting patient education during the informed consent process for routine radiology procedures.

Einspänner E, Schwab R, Hupfeld S, Thormann M, Fuchs E, Gawlitza M, Borggrefe J, Behme D

pubmed logopapersSep 15 2025
This study evaluated three LLM chatbots (GPT-3.5-turbo, GPT-4-turbo, and GPT-4o) on their effectiveness in supporting patient education by answering common patient questions for CT, MRI, and DSA informed consent, assessing their accuracy and clarity. Two radiologists formulated 90 questions categorized as general, clinical, or technical. Each LLM answered every question five times. Radiologists then rated the responses for medical accuracy and clarity, while medical physicists assessed technical accuracy using a Likert scale. semantic similarity was analyzed with SBERT and cosine similarity. Ratings improved with newer model versions. Linear mixed-effects models revealed that GPT-4 models were rated significantly higher than GPT-3.5 (p < 0.001) by both physicians and physicists. However, physicians' ratings for GPT-4 models showed a significant performance decrease for complex modalities like DSA and MRI (p < 0.01), a pattern not observed in physicists' ratings. SBERT analysis revealed high internal consistency across all models. SBERT analysis revealed high internal consistency across all models. Variability in ratings revealed that while models effectively handled general and technical questions, they struggled with contextually complex medical inquiries requiring personalized responses and nuanced understanding. Statistical analysis confirms that while newer models are superior, their performance is modality-dependent and perceived differently by clinical and technical experts. This study evaluates the potential of LLMs to enhance informed consent in radiology, highlighting strengths in general and technical questions while noting limitations with complex clinical inquiries, with performance varying significantly by model type and imaging modality.

Large language models in radiology workflows: An exploratory study of generative AI for non-visual tasks in the German healthcare system.

Steinhauser S, Welsch S

pubmed logopapersSep 15 2025
Large language models (LLMs) are gaining attention for their potential to enhance radiology workflows by addressing challenges such as increasing workloads and staff shortages. However, limited knowledge among radiologists and concerns about their practical implementation and ethical implications present challenges. This study investigates radiologists' perspectives on the use of LLMs, exploring their potential benefits, challenges, and impact on workflows and professional roles. An exploratory, qualitative study was conducted using 12 semi-structured interviews with radiology experts. Data were analyzed to assess participants' awareness, attitudes, and perceived applications of LLMs in radiology. LLMs were identified as promising tools for reducing workloads by streamlining tasks like summarizing clinical histories and generating standardized reports, improving communication and efficiency. Participants expressed openness to LLM integration but noted concerns about their impact on human interaction, ethical standards, and liability. The role of radiologists is expected to evolve with LLM adoption, with a shift toward data stewardship and interprofessional collaboration. Barriers to implementation included limited awareness, regulatory constraints, and outdated infrastructure. The integration of LLMs is hindered by regulatory challenges, outdated infrastructure, and limited awareness among radiologists. Policymakers should establish clear, practical regulations to address liability and ethical concerns while ensuring compliance with privacy standards. Investments in modernizing clinical infrastructure and expanding training programs are critical to enable radiologists to effectively use these tools. By addressing these barriers, LLMs can enhance efficiency, reduce workloads, and improve patient care, while preserving the central role of radiologists in diagnostic and therapeutic processes.
Page 1 of 15144 results
Show
per page

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.