Latest Papers on Radiology AI. Tags: Benchmark SOTA

QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models

Tien-Yu Chi, Hung-Yueh Chiang, Diana Marculescu, Kai-Chiang Wu

•preprint•Jul 13 2025

State space models (SSMs) reduce the quadratic complexity of transformers by leveraging linear recurrence. Recently, VMamba has emerged as a strong SSM-based vision backbone, yet remains bottlenecked by spatial redundancy in its four-directional scan. We propose QuarterMap, a post-training activation pruning method that removes redundant spatial activations before scanning and restores dimensions via nearest-neighbor upsampling. Our method improves throughput without retraining. On ImageNet-1K, QuarterMap achieves up to 11% speedup on VMamba with less than 0.9% accuracy drop, and yields similar gains on ADE20K segmentation. Beyond VMamba, we validate QuarterMap on MedMamba, a domain-specific model that shares the same four-directional scanning structure, where it consistently improves throughput while preserving accuracy across multiple medical imaging tasks. Compared to token merging methods like ToMe, QuarterMap is tailored for SSMs and avoids costly merge-unmerge operations. Our method offers a plug-and-play tool for deployment-time efficiency without compromising transferability.

Methodology In Silico Benchmark SOTA

Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel

•preprint•Jul 12 2025

Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.

Mixed Modality LLM Radiology Report Methodology In Silico Academic Lab Benchmark SOTA Open Code

Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift

Behraj Khan, Tahir Syed

•preprint•Jul 12 2025

Foundation models like CLIP and SAM have transformed computer vision and medical imaging via low-shot transfer learning. However, deployment of these models hindered by two key challenges: \textit{distribution shift} between training and test data, and \textit{confidence misalignment} that leads to overconfident incorrect predictions. These issues manifest differently in vision-language classification and medical segmentation tasks, yet existing solutions remain domain-specific. We propose \textit{StaRFM}, a unified framework addressing both challenges. It introduces a Fisher information penalty (FIP), extended to 3D medical data via patch-wise regularization, to reduce covariate shift in CLIP and SAM embeddings. Additionally, a confidence misalignment penalty (CMP), reformulated for voxel-level predictions, calibrates uncertainty in segmentation tasks. We theoretically derive PAC-Bayes bounds showing FIP controls generalization via the Fisher-Rao norm, while CMP minimizes calibration error through Brier score optimization. StaRFM shows consistent performance like \texttt{+}3.5\% accuracy and 28\% lower ECE on 19 vision datasets (e.g., ImageNet, Office-Home), 84.7\% DSC and 4.8mm HD95 in medical segmentation (e.g., BraTS, ATLAS), and 40\% lower cross-domain performance gap compared to prior benchmarking methods. The framework is plug-and-play, requiring minimal architectural changes for seamless integration with foundation models. Code and models will be released at https://anonymous.4open.science/r/StaRFM-C0CD/README.md

MRI Segmentation Neurological Methodology In Silico Academic Lab Benchmark SOTA Open Code

Efficient needle guidance: multi-camera augmented reality navigation without patient-specific calibration.

Wei Y, Huang B, Zhao B, Lin Z, Zhou SZ

•papers•Jul 12 2025

Augmented reality (AR) technology holds significant promise for enhancing surgical navigation in needle-based procedures such as biopsies and ablations. However, most existing AR systems rely on patient-specific markers, which disrupt clinical workflows and require time-consuming preoperative calibrations, thereby hindering operational efficiency and precision. We developed a novel multi-camera AR navigation system that eliminates the need for patient-specific markers by utilizing ceiling-mounted markers mapped to fixed medical imaging devices. A hierarchical optimization framework integrates both marker mapping and multi-camera calibration. Deep learning techniques are employed to enhance marker detection and registration accuracy. Additionally, a vision-based pose compensation method is implemented to mitigate errors caused by patient movement, improving overall positional accuracy. Validation through phantom experiments and simulated clinical scenarios demonstrated an average puncture accuracy of 3.72 ± 1.21 mm. The system reduced needle placement time by 20 s compared to traditional marker-based methods. It also effectively corrected errors induced by patient movement, with a mean positional error of 0.38 pixels and an angular deviation of 0.51 <math xmlns="http://www.w3.org/1998/Math/MathML"><mmultiscripts><mrow></mrow> <mrow></mrow> <mo>∘</mo></mmultiscripts> </math> . These results highlight the system's precision, adaptability, and reliability in realistic surgical conditions. This marker-free AR guidance system significantly streamlines surgical workflows while enhancing needle navigation accuracy. Its simplicity, cost-effectiveness, and adaptability make it an ideal solution for both high- and low-resource clinical environments, offering the potential for improved precision, reduced procedural time, and better patient outcomes.

Mixed Modality Registration Methodology Phantom/Animal Academic Lab Benchmark SOTA

AI-powered disease progression prediction in multiple sclerosis using magnetic resonance imaging: a systematic review and meta-analysis.

Houshi S, Khodakarami Z, Shaygannejad A, Khosravi F, Shaygannejad V

•papers•Jul 12 2025

Disability progression despite disease-modifying therapy remains a major challenge in multiple sclerosis (MS). Artificial intelligence (AI) models exploiting magnetic resonance imaging (MRI) promise personalized prognostication, yet their real-world accuracy is uncertain. To systematically review and meta-analyze MRI-based AI studies predicting future disability progression in MS. Five databases were searched from inception to 17 May 2025 following PRISMA. Eligible studies used MRI in an AI model to forecast changes in the Expanded Disability Status Scale (EDSS) or equivalent metrics. Two reviewers conducted study selection, data extraction, and QUADAS-2 assessment. Random-effects meta-analysis was applied when ≥3 studies reported compatible regression statistics. Twenty-one studies with 12,252 MS patients met inclusion criteria. Five used regression on continuous EDSS, fourteen classification, one time-to-event, and one both. Conventional machine learning predominated (57%), and deep learning (38%). Median classification area under the curve (AUC) was 0.78 (range 0.57-0.86); median regression root-mean-square-error (RMSE) 1.08 EDSS points. Pooled RMSE across regression studies was 1.31 (95% CI 1.02-1.60; I<sup>2</sup> = 95%). Deep learning conferred only marginal, non-significant gains over classical algorithms. External validation appeared in six studies; calibration, decision-curve analysis and code releases were seldom reported. QUADAS-2 indicated generally low patient-selection bias but frequent index-test concerns. MRI-driven AI models predict MS disability progression with moderate accuracy, but error margins that exceed one EDSS point limit individual-level utility. Harmonized endpoints, larger multicenter cohorts, rigorous external validation, and prospective clinician-in-the-loop trials are essential before routine clinical adoption.

MRI Classification Neurological Meta Analysis In Silico Academic Lab Benchmark SOTA

Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.

Jung J, Phillipi M, Tran B, Chen K, Chan N, Ho E, Sun S, Houshyar R

•papers•Jul 12 2025

Large language models (LLM) have shown promise in assisting medical decision-making. However, there is limited literature exploring the diagnostic accuracy of LLMs in generating differential diagnoses from text-based image descriptions and clinical presentations in pediatric radiology. To examine the performance of multiple proprietary LLMs in producing accurate differential diagnoses for text-based pediatric radiological cases without imaging. One hundred sixty-four cases were retrospectively selected from a pediatric radiology textbook and converted into two formats: (1) image description only, and (2) image description with clinical presentation. The ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro algorithms were given these inputs and tasked with providing a top 1 diagnosis and a top 3 differential diagnoses. Accuracy of responses was assessed by comparison with the original literature. Top 1 accuracy was defined as whether the top 1 diagnosis matched the textbook, and top 3 differential accuracy was defined as the number of diagnoses in the model-generated top 3 differential that matched any of the top 3 diagnoses in the textbook. McNemar's test, Cochran's Q test, Friedman test, and Wilcoxon signed-rank test were used to compare algorithms and assess the impact of added clinical information, respectively. There was no significant difference in top 1 accuracy between ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro when only image descriptions were provided (56.1% [95% CI 48.4-63.5], 64.6% [95% CI 57.1-71.5], 61.6% [95% CI 54.0-68.7]; P = 0.11). Adding clinical presentation to image description significantly improved top 1 accuracy for ChatGPT-4 V (64.0% [95% CI 56.4-71.0], P = 0.02) and Claude 3.5 Sonnet (80.5% [95% CI 73.8-85.8], P < 0.001). For image description and clinical presentation cases, Claude 3.5 Sonnet significantly outperformed both ChatGPT-4 V and Gemini 1.5 Pro (P < 0.001). For top 3 differential accuracy, no significant differences were observed between ChatGPT-4 V, Claude 3.5 Sonnet, and Gemini 1.5 Pro, regardless of whether the cases included only image descriptions (1.29 [95% CI 1.16-1.41], 1.35 [95% CI 1.23-1.48], 1.37 [95% CI 1.25-1.49]; P = 0.60) or both image descriptions and clinical presentations (1.33 [95% CI 1.20-1.45], 1.52 [95% CI 1.41-1.64], 1.48 [95% 1.36-1.59]; P = 0.72). Only Claude 3.5 Sonnet performed significantly better when clinical presentation was added (P < 0.001). Commercial LLMs performed similarly on pediatric radiology cases in providing top 1 accuracy and top 3 differential accuracy when only a text-based image description was used. Adding clinical presentation significantly improved top 1 accuracy for ChatGPT-4 V and Claude 3.5 Sonnet, with Claude showing the largest improvement. Claude 3.5 Sonnet outperformed both ChatGPT-4 V and Gemini 1.5 Pro in top 1 accuracy when both image and clinical data were provided. No significant differences were found in top 3 differential accuracy across models in any condition.

Mixed Modality LLM Radiology Report Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Integrating Artificial Intelligence in Thyroid Nodule Management: Clinical Outcomes and Cost-Effectiveness Analysis.

Bodoque-Cubas J, Fernández-Sáez J, Martínez-Hervás S, Pérez-Lacasta MJ, Carles-Lavila M, Pallarés-Gasulla RM, Salazar-González JJ, Gil-Boix JV, Miret-Llauradó M, Aulinas-Masó A, Argüelles-Jiménez I, Tofé-Povedano S

•papers•Jul 12 2025

The increasing incidence of thyroid nodules (TN) raises concerns about overdiagnosis and overtreatment. This study evaluates the clinical and economic impact of KOIOS, an FDA-approved artificial intelligence (AI) tool for the management of TN. A retrospective analysis was conducted on 176 patients who underwent thyroid surgery between May 2022 and November 2024. Ultrasound images were evaluated independently by an expert and novice operators using the American College of Radiology Thyroid Imaging Reporting and Data System (ACR-TIRADS), while KOIOS provided AI-adapted risk stratification. Sensitivity, specificity, and Receiver-Operating Curve (ROC) analysis were performed. The incremental cost-effectiveness ratio (ICER) was defined based on the number of optimal care interventions (FNAB and thyroid surgery). Both deterministic and probabilistic sensitivity analyses were conducted to evaluate model robustness. KOIOS AI demonstrated similar diagnostic performance to the expert operator (AUC: 0.794, 95% CI: 0.718-0.871 vs. 0.784, 95% CI: 0.706-0.861; p = 0.754) and significantly outperformed the novice operator (AUC: 0.619, 95% CI: 0.526-0.711; p < 0.001). ICER analysis estimated the cost per additional optimal care decision at -€8,085.56, indicating KOIOS as a dominant and cost-saving strategy when considering a third-party payer perspective over a one-year horizon. Deterministic sensitivity analysis identified surgical costs as the main drivers of variability, while probabilistic analysis consistently favored KOIOS as the optimal strategy. KOIOS AI is a cost-effective alternative, particularly in reducing overdiagnosis and overtreatment for benign TNs. Prospective, real-life studies are needed to validate these findings and explore long-term implications.

Ultrasound Classification Retrospective Clinical FDA Cleared FDA 510(k)Startup Benchmark SOTA

Vision-language model for report generation and outcome prediction in CT pulmonary angiogram.

Zhong Z, Wang Y, Wu J, Hsu WC, Somasundaram V, Bi L, Kulkarni S, Ma Z, Collins S, Baird G, Ahn SH, Feng X, Kamel I, Lin CT, Greineder C, Atalay M, Jiao Z, Bai H

•papers•Jul 12 2025

Accurate and comprehensive interpretation of pulmonary embolism (PE) from Computed Tomography Pulmonary Angiography (CTPA) scans remains a clinical challenge due to the limited specificity and structure of existing AI tools. We propose an agent-based framework that integrates Vision-Language Models (VLMs) for detecting 32 PE-related abnormalities and Large Language Models (LLMs) for structured report generation. Trained on over 69,000 CTPA studies from 24,890 patients across Brown University Health (BUH), Johns Hopkins University (JHU), and the INSPECT dataset from Stanford, the model demonstrates strong performance in abnormality classification and report generation. For abnormality classification, it achieved AUROC scores of 0.788 (BUH), 0.754 (INSPECT), and 0.710 (JHU), with corresponding BERT-F1 scores of 0.891, 0.829, and 0.842. The abnormality-guided reporting strategy consistently outperformed the organ-based and holistic captioning baselines. For survival prediction, a multimodal fusion model that incorporates imaging, clinical variables, diagnostic outputs, and generated reports achieved concordance indices of 0.863 (BUH) and 0.731 (JHU), outperforming traditional PESI scores. This framework provides a clinically meaningful and interpretable solution for end-to-end PE diagnosis, structured reporting, and outcome prediction.

CT LLM Radiology Report Chest Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Ensemble of Weak Spectral Total Variation Learners: a PET-CT Case Study

Anna Rosenberg, John Kennedy, Zohar Keidar, Yehoshua Y. Zeevi, Guy Gilboa

•preprint•Jul 11 2025

Solving computer vision problems through machine learning, one often encounters lack of sufficient training data. To mitigate this we propose the use of ensembles of weak learners based on spectral total-variation (STV) features (Gilboa 2014). The features are related to nonlinear eigenfunctions of the total-variation subgradient and can characterize well textures at various scales. It was shown (Burger et-al 2016) that, in the one-dimensional case, orthogonal features are generated, whereas in two-dimensions the features are empirically lowly correlated. Ensemble learning theory advocates the use of lowly correlated weak learners. We thus propose here to design ensembles using learners based on STV features. To show the effectiveness of this paradigm we examine a hard real-world medical imaging problem: the predictive value of computed tomography (CT) data for high uptake in positron emission tomography (PET) for patients suspected of skeletal metastases. The database consists of 457 scans with 1524 unique pairs of registered CT and PET slices. Our approach is compared to deep-learning methods and to Radiomics features, showing STV learners perform best (AUC=0.87), compared to neural nets (AUC=0.75) and Radiomics (AUC=0.79). We observe that fine STV scales in CT images are especially indicative for the presence of high uptake in PET.

Mixed Modality Classification Musculoskeletal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

A View-Agnostic Deep Learning Framework for Comprehensive Analysis of 2D-Echocardiography

Anisuzzaman, D. M., Malins, J. G., Jackson, J. I., Lee, E., Naser, J. A., Rostami, B., Bird, J. G., Spiegelstein, D., Amar, T., Ngo, C. C., Oh, J. K., Pellikka, P. A., Thaden, J. J., Lopez-Jimenez, F., Poterucha, T. J., Friedman, P. A., Pislaru, S., Kane, G. C., Attia, Z. I.

•preprint•Jul 11 2025

Echocardiography traditionally requires experienced operators to select and interpret clips from specific viewing angles. Clinical decision-making is therefore limited for handheld cardiac ultrasound (HCU), which is often collected by novice users. In this study, we developed a view-agnostic deep learning framework to estimate left ventricular ejection fraction (LVEF), patient age, and patient sex from any of several views containing the left ventricle. Model performance was: (1) consistently strong across retrospective transthoracic echocardiography (TTE) datasets; (2) comparable between prospective HCU versus TTE (625 patients; LVEF r2 0.80 vs. 0.86, LVEF [> or [≤]40%] AUC 0.981 vs. 0.993, age r2 0.85 vs. 0.87, sex classification AUC 0.985 vs. 0.996); (3) comparable between prospective HCU data collected by experts versus novice users (100 patients; LVEF r2 0.78 vs. 0.66, LVEF AUC 0.982 vs. 0.966). This approach may broaden the clinical utility of echocardiography by lessening the need for user expertise in image acquisition.

Ultrasound Classification Cardiac Prospective Clinical Pilot Academic Lab Benchmark SOTA

Filter Papers

Tags

QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models

Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift

Efficient needle guidance: multi-camera augmented reality navigation without patient-specific calibration.

AI-powered disease progression prediction in multiple sclerosis using magnetic resonance imaging: a systematic review and meta-analysis.

Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.

Integrating Artificial Intelligence in Thyroid Nodule Management: Clinical Outcomes and Cost-Effectiveness Analysis.

Vision-language model for report generation and outcome prediction in CT pulmonary angiogram.

Ensemble of Weak Spectral Total Variation Learners: a PET-CT Case Study

A View-Agnostic Deep Learning Framework for Comprehensive Analysis of 2D-Echocardiography

Ready to Sharpen Your Edge?