Latest Papers on Radiology AI. Tags: Other, Order: Best Match, Limit: 10.

Generative AI enables medical image segmentation in ultra low-data regimes.

Zhang L, Jindal B, Alaa A, Weinreb R, Wilson D, Segal E, Zou J, Xie P

•papers•Jul 14 2025

Semantic segmentation of medical images is pivotal in applications like disease diagnosis and treatment planning. While deep learning automates this task effectively, it struggles in ultra low-data regimes for the scarcity of annotated segmentation masks. To address this, we propose a generative deep learning framework that produces high-quality image-mask pairs as auxiliary training data. Unlike traditional generative models that separate data generation from model training, ours uses multi-level optimization for end-to-end data generation. This allows segmentation performance to guide the generation process, producing data tailored to improve segmentation outcomes. Our method demonstrates strong generalization across 11 medical image segmentation tasks and 19 datasets, covering various diseases, organs, and modalities. It improves performance by 10-20% (absolute) in both same- and out-of-domain settings and requires 8-20 times less training data than existing approaches. This greatly enhances the feasibility and cost-effectiveness of deep learning in data-limited medical imaging scenarios.

Mixed Modality Segmentation Methodology In Silico Academic Lab GenAI

Comparing large language models and text embedding models for automated classification of textual, semantic, and critical changes in radiology reports.

Lindholz M, Burdenski A, Ruppel R, Schulze-Weddige S, Baumgärtner GL, Schobert I, Haack AM, Eminovic S, Milnik A, Hamm CA, Frisch A, Penzkofer T

•papers•Jul 14 2025

Radiology reports can change during workflows, especially when residents draft preliminary versions that attending physicians finalize. We explored how large language models (LLMs) and embedding techniques can categorize these changes into textual, semantic, or clinically actionable types. We evaluated 400 adult CT reports drafted by residents against finalized versions by attending physicians. Changes were rated on a five-point scale from no changes to critical ones. We examined open-source LLMs alongside traditional metrics like normalized word differences, Levenshtein and Jaccard similarity, and text embedding similarity. Model performance was assessed using quadratic weighted Cohen's kappa (κ), (balanced) accuracy, F<sub>1</sub>, precision, and recall. Inter-rater reliability among evaluators was excellent (κ = 0.990). Of the reports analyzed, 1.3 % contained critical changes. The tested methods showed significant performance differences (P < 0.001). The Qwen3-235B-A22B model using a zero-shot prompt, most closely aligned with human assessments of changes in clinical reports, achieving a κ of 0.822 (SD 0.031). The best conventional metric, word difference, had a κ of 0.732 (SD 0.048), the difference between the two showed statistical significance in unadjusted post-hoc tests (P = 0.038) but lost significance after adjusting for multiple testing (P = 0.064). Embedding models underperformed compared to LLMs and classical methods, showing statistical significance in most cases. Large language models like Qwen3-235B-A22B demonstrated moderate to strong alignment with expert evaluations of the clinical significance of changes in radiology reports. LLMs outperformed embedding methods and traditional string and word approaches, achieving statistical significance in most instances. This demonstrates their potential as tools to support peer review.

CT Classification Retrospective Clinical In Silico Academic Lab Benchmark SOTA

A Survey on Medical Image Compression: From Traditional to Learning-Based

Guofeng Tong, Sixuan Liu, Yang Lv, Hanyu Pei, Feng-Lei Fan

•preprint•Jul 13 2025

The exponential growth of medical imaging has created significant challenges in data storage, transmission, and management for healthcare systems. In this vein, efficient compression becomes increasingly important. Unlike natural image compression, medical image compression prioritizes preserving diagnostic details and structural integrity, imposing stricter quality requirements and demanding fast, memory-efficient algorithms that balance computational complexity with clinically acceptable reconstruction quality. Meanwhile, the medical imaging family includes a plethora of modalities, each possessing different requirements. For example, 2D medical image (e.g., X-rays, histopathological images) compression focuses on exploiting intra-slice spatial redundancy, while volumetric medical image faces require handling intra-slice and inter-slice spatial correlations, and 4D dynamic imaging (e.g., time-series CT/MRI, 4D ultrasound) additionally demands processing temporal correlations between consecutive time frames. Traditional compression methods, grounded in mathematical transforms and information theory principles, provide solid theoretical foundations, predictable performance, and high standardization levels, with extensive validation in clinical environments. In contrast, deep learning-based approaches demonstrate remarkable adaptive learning capabilities and can capture complex statistical characteristics and semantic information within medical images. This comprehensive survey establishes a two-facet taxonomy based on data structure (2D vs 3D/4D) and technical approaches (traditional vs learning-based), thereby systematically presenting the complete technological evolution, analyzing the unique technical challenges, and prospecting future directions in medical image compression.

Mixed Modality Reconstruction Review Concept GenAI

Disentanglement and Assessment of Shortcuts in Ophthalmological Retinal Imaging Exams

Leonor Fernandes, Tiago Gonçalves, João Matos, Luis Filipe Nakayama, Jaime S. Cardoso

•preprint•Jul 13 2025

Diabetic retinopathy (DR) is a leading cause of vision loss in working-age adults. While screening reduces the risk of blindness, traditional imaging is often costly and inaccessible. Artificial intelligence (AI) algorithms present a scalable diagnostic solution, but concerns regarding fairness and generalization persist. This work evaluates the fairness and performance of image-trained models in DR prediction, as well as the impact of disentanglement as a bias mitigation technique, using the diverse mBRSET fundus dataset. Three models, ConvNeXt V2, DINOv2, and Swin V2, were trained on macula images to predict DR and sensitive attributes (SAs) (e.g., age and gender/sex). Fairness was assessed between subgroups of SAs, and disentanglement was applied to reduce bias. All models achieved high DR prediction performance in diagnosing (up to 94% AUROC) and could reasonably predict age and gender/sex (91% and 77% AUROC, respectively). Fairness assessment suggests disparities, such as a 10% AUROC gap between age groups in DINOv2. Disentangling SAs from DR prediction had varying results, depending on the model selected. Disentanglement improved DINOv2 performance (2% AUROC gain), but led to performance drops in ConvNeXt V2 and Swin V2 (7% and 3%, respectively). These findings highlight the complexity of disentangling fine-grained features in fundus imaging and emphasize the importance of fairness in medical imaging AI to ensure equitable and reliable healthcare solutions.

OCT Classification Methodology In Silico Ethics

PanoDiff-SR: Synthesizing Dental Panoramic Radiographs using Diffusion and Super-resolution

Sanyam Jain, Bruna Neves de Freitas, Andreas Basse-OConnor, Alexandros Iosifidis, Ruben Pauwels

•preprint•Jul 12 2025

There has been increasing interest in the generation of high-quality, realistic synthetic medical images in recent years. Such synthetic datasets can mitigate the scarcity of public datasets for artificial intelligence research, and can also be used for educational purposes. In this paper, we propose a combination of diffusion-based generation (PanoDiff) and Super-Resolution (SR) for generating synthetic dental panoramic radiographs (PRs). The former generates a low-resolution (LR) seed of a PR (256 X 128) which is then processed by the SR model to yield a high-resolution (HR) PR of size 1024 X 512. For SR, we propose a state-of-the-art transformer that learns local-global relationships, resulting in sharper edges and textures. Experimental results demonstrate a Frechet inception distance score of 40.69 between 7243 real and synthetic images (in HR). Inception scores were 2.55, 2.30, 2.90 and 2.98 for real HR, synthetic HR, real LR and synthetic LR images, respectively. Among a diverse group of six clinical experts, all evaluating a mixture of 100 synthetic and 100 real PRs in a time-limited observation, the average accuracy in distinguishing real from synthetic images was 68.5% (with 50% corresponding to random guessing).

X-Ray Image Synthesis Methodology In Silico Open Dataset

Integrating Artificial Intelligence in Thyroid Nodule Management: Clinical Outcomes and Cost-Effectiveness Analysis.

Bodoque-Cubas J, Fernández-Sáez J, Martínez-Hervás S, Pérez-Lacasta MJ, Carles-Lavila M, Pallarés-Gasulla RM, Salazar-González JJ, Gil-Boix JV, Miret-Llauradó M, Aulinas-Masó A, Argüelles-Jiménez I, Tofé-Povedano S

•papers•Jul 12 2025

The increasing incidence of thyroid nodules (TN) raises concerns about overdiagnosis and overtreatment. This study evaluates the clinical and economic impact of KOIOS, an FDA-approved artificial intelligence (AI) tool for the management of TN. A retrospective analysis was conducted on 176 patients who underwent thyroid surgery between May 2022 and November 2024. Ultrasound images were evaluated independently by an expert and novice operators using the American College of Radiology Thyroid Imaging Reporting and Data System (ACR-TIRADS), while KOIOS provided AI-adapted risk stratification. Sensitivity, specificity, and Receiver-Operating Curve (ROC) analysis were performed. The incremental cost-effectiveness ratio (ICER) was defined based on the number of optimal care interventions (FNAB and thyroid surgery). Both deterministic and probabilistic sensitivity analyses were conducted to evaluate model robustness. KOIOS AI demonstrated similar diagnostic performance to the expert operator (AUC: 0.794, 95% CI: 0.718-0.871 vs. 0.784, 95% CI: 0.706-0.861; p = 0.754) and significantly outperformed the novice operator (AUC: 0.619, 95% CI: 0.526-0.711; p < 0.001). ICER analysis estimated the cost per additional optimal care decision at -€8,085.56, indicating KOIOS as a dominant and cost-saving strategy when considering a third-party payer perspective over a one-year horizon. Deterministic sensitivity analysis identified surgical costs as the main drivers of variability, while probabilistic analysis consistently favored KOIOS as the optimal strategy. KOIOS AI is a cost-effective alternative, particularly in reducing overdiagnosis and overtreatment for benign TNs. Prospective, real-life studies are needed to validate these findings and explore long-term implications.

Ultrasound Classification Retrospective Clinical FDA Cleared FDA 510(k)Startup Benchmark SOTA

Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel

•preprint•Jul 12 2025

Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.

Mixed Modality LLM Radiology Report Methodology In Silico Academic Lab Benchmark SOTA Open Code

Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.

Jung J, Phillipi M, Tran B, Chen K, Chan N, Ho E, Sun S, Houshyar R

•papers•Jul 12 2025

Large language models (LLM) have shown promise in assisting medical decision-making. However, there is limited literature exploring the diagnostic accuracy of LLMs in generating differential diagnoses from text-based image descriptions and clinical presentations in pediatric radiology. To examine the performance of multiple proprietary LLMs in producing accurate differential diagnoses for text-based pediatric radiological cases without imaging. One hundred sixty-four cases were retrospectively selected from a pediatric radiology textbook and converted into two formats: (1) image description only, and (2) image description with clinical presentation. The ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro algorithms were given these inputs and tasked with providing a top 1 diagnosis and a top 3 differential diagnoses. Accuracy of responses was assessed by comparison with the original literature. Top 1 accuracy was defined as whether the top 1 diagnosis matched the textbook, and top 3 differential accuracy was defined as the number of diagnoses in the model-generated top 3 differential that matched any of the top 3 diagnoses in the textbook. McNemar's test, Cochran's Q test, Friedman test, and Wilcoxon signed-rank test were used to compare algorithms and assess the impact of added clinical information, respectively. There was no significant difference in top 1 accuracy between ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro when only image descriptions were provided (56.1% [95% CI 48.4-63.5], 64.6% [95% CI 57.1-71.5], 61.6% [95% CI 54.0-68.7]; P = 0.11). Adding clinical presentation to image description significantly improved top 1 accuracy for ChatGPT-4 V (64.0% [95% CI 56.4-71.0], P = 0.02) and Claude 3.5 Sonnet (80.5% [95% CI 73.8-85.8], P < 0.001). For image description and clinical presentation cases, Claude 3.5 Sonnet significantly outperformed both ChatGPT-4 V and Gemini 1.5 Pro (P < 0.001). For top 3 differential accuracy, no significant differences were observed between ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro, regardless of whether the cases included only image descriptions (1.29 [95% CI 1.16-1.41], 1.35 [95% CI 1.23-1.48], 1.37 [95% CI 1.25-1.49]; P = 0.60) or both image descriptions and clinical presentations (1.33 [95% CI 1.20-1.45], 1.52 [95% CI 1.41-1.64], 1.48 [95% 1.36-1.59]; P = 0.72). Only Claude 3.5 Sonnet performed significantly better when clinical presentation was added (P < 0.001). Commercial LLMs performed similarly on pediatric radiology cases in providing top 1 accuracy and top 3 differential accuracy when only a text-based image description was used. Adding clinical presentation significantly improved top 1 accuracy for ChatGPT-4 V and Claude 3.5 Sonnet, with Claude showing the largest improvement. Claude 3.5 Sonnet outperformed both ChatGPT-4 V and Gemini 1.5 Pro in top 1 accuracy when both image and clinical data were provided. No significant differences were found in top 3 differential accuracy across models in any condition.

Mixed Modality LLM Radiology Report Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Efficient needle guidance: multi-camera augmented reality navigation without patient-specific calibration.

Wei Y, Huang B, Zhao B, Lin Z, Zhou SZ

•papers•Jul 12 2025

Augmented reality (AR) technology holds significant promise for enhancing surgical navigation in needle-based procedures such as biopsies and ablations. However, most existing AR systems rely on patient-specific markers, which disrupt clinical workflows and require time-consuming preoperative calibrations, thereby hindering operational efficiency and precision. We developed a novel multi-camera AR navigation system that eliminates the need for patient-specific markers by utilizing ceiling-mounted markers mapped to fixed medical imaging devices. A hierarchical optimization framework integrates both marker mapping and multi-camera calibration. Deep learning techniques are employed to enhance marker detection and registration accuracy. Additionally, a vision-based pose compensation method is implemented to mitigate errors caused by patient movement, improving overall positional accuracy. Validation through phantom experiments and simulated clinical scenarios demonstrated an average puncture accuracy of 3.72 ± 1.21 mm. The system reduced needle placement time by 20 s compared to traditional marker-based methods. It also effectively corrected errors induced by patient movement, with a mean positional error of 0.38 pixels and an angular deviation of 0.51 <math xmlns="http://www.w3.org/1998/Math/MathML"><mmultiscripts><mrow></mrow> <mrow></mrow> <mo>∘</mo></mmultiscripts> </math> . These results highlight the system's precision, adaptability, and reliability in realistic surgical conditions. This marker-free AR guidance system significantly streamlines surgical workflows while enhancing needle navigation accuracy. Its simplicity, cost-effectiveness, and adaptability make it an ideal solution for both high- and low-resource clinical environments, offering the potential for improved precision, reduced procedural time, and better patient outcomes.

Mixed Modality Registration Methodology Phantom/Animal Academic Lab Benchmark SOTA

Semi-supervised Medical Image Segmentation Using Heterogeneous Complementary Correction Network and Confidence Contrastive Learning.

Li L, Xue M, Li S, Dong Z, Liao T, Li P

•papers•Jul 11 2025

Semi-supervised medical image segmentation techniques have demonstrated significant potential and effectiveness in clinical diagnosis. The prevailing approaches using the mean-teacher (MT) framework achieve promising image segmentation results. However, due to the unreliability of the pseudo labels generated by the teacher model, existing methods still have some inherent limitations that must be considered and addressed. In this paper, we propose an innovative semi-supervised method for medical image segmentation by combining the heterogeneous complementary correction network and confidence contrastive learning (HC-CCL). Specifically, we develop a triple-branch framework by integrating a heterogeneous complementary correction (HCC) network into the MT framework. HCC serves as an auxiliary branch that corrects prediction errors in the student model and provides complementary information. To improve the capacity for feature learning in our proposed model, we introduce a confidence contrastive learning (CCL) approach with a novel sampling strategy. Furthermore, we develop a momentum style transfer (MST) method to narrow the gap between labeled and unlabeled data distributions. In addition, we introduce a Cutout-style augmentation for unsupervised learning to enhance performance. Three medical image datasets (including left atrial (LA) dataset, NIH pancreas dataset, Brats-2019 dataset) were employed to rigorously evaluate HC-CCL. Quantitative results demonstrate significant performance advantages over existing approaches, achieving state-of-the-art performance across all metrics. The implementation will be released at https://github.com/xxmmss/HC-CCL .

Mixed Modality Segmentation Methodology In Silico Academic Lab Benchmark SOTA Open Code

Filter Papers

Tags

Generative AI enables medical image segmentation in ultra low-data regimes.

Comparing large language models and text embedding models for automated classification of textual, semantic, and critical changes in radiology reports.

A Survey on Medical Image Compression: From Traditional to Learning-Based

Disentanglement and Assessment of Shortcuts in Ophthalmological Retinal Imaging Exams

PanoDiff-SR: Synthesizing Dental Panoramic Radiographs using Diffusion and Super-resolution

Integrating Artificial Intelligence in Thyroid Nodule Management: Clinical Outcomes and Cost-Effectiveness Analysis.

Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.

Efficient needle guidance: multi-camera augmented reality navigation without patient-specific calibration.

Semi-supervised Medical Image Segmentation Using Heterogeneous Complementary Correction Network and Confidence Contrastive Learning.

Ready to Sharpen Your Edge?