Latest Papers on Radiology AI.

Out-of-the-Box Large Language Models for Detecting and Classifying Critical Findings in Radiology Reports Using Various Prompt Strategies.

Talati IA, Chaves JMZ, Das A, Banerjee I, Rubin DL

•papers•Sep 10 2025

Background: The increasing complexity and volume of radiology reports present challenges for timely critical findings communication. Purpose: To evaluate the performance of two out-of-the-box LLMs in detecting and classifying critical findings in radiology reports using various prompt strategies. Methods: The analysis included 252 radiology reports of varying modalities and anatomic regions extracted from the MIMIC-III database, divided into a prompt engineering tuning set of 50 reports, a holdout test set of 125 reports, and a pool of 77 remaining reports used as examples for few-shot prompting. An external test set of 180 chest radiography reports was extracted from the CheXpert Plus database. Reports were manually reviewed to identify critical findings and classify such findings into one of three categories (true critical finding, known/expected critical finding, equivocal critical finding). Following prompt engineering using various prompt strategies, a final prompt for optimal true critical findings detection was selected. Two general-purpose LLMs, GPT-4 and Mistral-7B, processed reports in the test sets using the final prompt. Evaluation included automated text similarity metrics (BLEU-1, ROUGE-F1, G-Eval) and manual performance metrics (precision, recall). Results: For true critical findings, zero-shot, few-shot static (five examples), and few-shot dynamic (five examples) prompting yielded BLEU-1 of 0.691, 0.778, and 0.748; ROUGE-F1 of 0.706, 0.797, and 0.773; and G-Eval of 0.428, 0.573, and 0.516. Precision and recall for true critical findings, known/expected critical findings, and equivocal critical findings, in the holdout test set for GPT-4 were 90.1% and 86.9%, 80.9% and 85.0%, and 80.5% and 94.3%; in the holdout test set for Mistral-7B were 75.6% and 77.4%, 34.1% and 70.0%, and 41.3% and 74.3%; in the external test set for GPT-4 were 82.6% and 98.3%, 76.9% and 71.4%, and 70.8% and 85.0%; and in the external test set for Mistral-7B were 75.0% and 93.1%, 33.3% and 92.9%, and 34.0% and 80.0%. Conclusion: Out-of-the-box LLMs were used to detect and classify arbitrary numbers of critical findings in radiology reports. The optimal model for true critical findings entailed a few-shot static approach. Clinical Impact: The study shows a role of contemporary general-purpose models in adapting to specialized medical tasks using minimal data annotation.

Mixed Modality LLM Radiology Report Retrospective Clinical In Silico Academic Lab GenAI

RoentMod: A Synthetic Chest X-Ray Modification Model to Identify and Correct Image Interpretation Model Shortcuts

Lauren H. Cooke, Matthias Jung, Jan M. Brendel, Nora M. Kerkovits, Borek Foldyna, Michael T. Lu, Vineet K. Raghu

•preprint•Sep 10 2025

Chest radiographs (CXRs) are among the most common tests in medicine. Automated image interpretation may reduce radiologists\' workload and expand access to diagnostic expertise. Deep learning multi-task and foundation models have shown strong performance for CXR interpretation but are vulnerable to shortcut learning, where models rely on spurious and off-target correlations rather than clinically relevant features to make decisions. We introduce RoentMod, a counterfactual image editing framework that generates anatomically realistic CXRs with user-specified, synthetic pathology while preserving unrelated anatomical features of the original scan. RoentMod combines an open-source medical image generator (RoentGen) with an image-to-image modification model without requiring retraining. In reader studies with board-certified radiologists and radiology residents, RoentMod-produced images appeared realistic in 93\% of cases, correctly incorporated the specified finding in 89-99\% of cases, and preserved native anatomy comparable to real follow-up CXRs. Using RoentMod, we demonstrate that state-of-the-art multi-task and foundation models frequently exploit off-target pathology as shortcuts, limiting their specificity. Incorporating RoentMod-generated counterfactual images during training mitigated this vulnerability, improving model discrimination across multiple pathologies by 3-19\% AUC in internal validation and by 1-11\% for 5 out of 6 tested pathologies in external testing. These findings establish RoentMod as a broadly applicable tool for probing and correcting shortcut learning in medical AI. By enabling controlled counterfactual interventions, RoentMod enhances the robustness and interpretability of CXR interpretation models and provides a generalizable strategy for improving foundation models in medical imaging.

X-Ray Image Synthesis Chest Methodology In Silico Academic Lab GenAI Benchmark SOTA Open Code

Vision-Language Semantic Aggregation Leveraging Foundation Model for Generalizable Medical Image Segmentation

Wenjun Yu, Yinchen Zhou, Jia-Xuan Jiang, Shubin Zeng, Yuee Li, Zhong Wang

•preprint•Sep 10 2025

Multimodal models have achieved remarkable success in natural image segmentation, yet they often underperform when applied to the medical domain. Through extensive study, we attribute this performance gap to the challenges of multimodal fusion, primarily the significant semantic gap between abstract textual prompts and fine-grained medical visual features, as well as the resulting feature dispersion. To address these issues, we revisit the problem from the perspective of semantic aggregation. Specifically, we propose an Expectation-Maximization (EM) Aggregation mechanism and a Text-Guided Pixel Decoder. The former mitigates feature dispersion by dynamically clustering features into compact semantic centers to enhance cross-modal correspondence. The latter is designed to bridge the semantic gap by leveraging domain-invariant textual knowledge to effectively guide deep visual representations. The synergy between these two mechanisms significantly improves the model's generalization ability. Extensive experiments on public cardiac and fundus datasets demonstrate that our method consistently outperforms existing SOTA approaches across multiple domain generalization benchmarks.

Mixed Modality Segmentation Methodology In Silico Benchmark SOTA

Multispectral CT Denoising via Simulation-Trained Deep Learning: Experimental Results at the ESRF BM18

Peter Gänz, Steffen Kieß, Guangpu Yang, Jajnabalkya Guhathakurta, Tanja Pienkny, Charls Clark, Paul Tafforeau, Andreas Balles, Astrid Hölzing, Simon Zabler, Sven Simon

•preprint•Sep 10 2025

Multispectral computed tomography (CT) enables advanced material characterization by acquiring energy-resolved projection data. However, since the incoming X-ray flux is be distributed across multiple narrow energy bins, the photon count per bin is greatly reduced compared to standard energy-integrated imaging. This inevitably introduces substantial noise, which can either prolong acquisition times and make scan durations infeasible or degrade image quality with strong noise artifacts. To address this challenge, we present a dedicated neural network-based denoising approach tailored for multispectral CT projections acquired at the BM18 beamline of the ESRF. The method exploits redundancies across angular, spatial, and spectral domains through specialized sub-networks combined via stacked generalization and an attention mechanism. Non-local similarities in the angular-spatial domain are leveraged alongside correlations between adjacent energy bands in the spectral domain, enabling robust noise suppression while preserving fine structural details. Training was performed exclusively on simulated data replicating the physical and noise characteristics of the BM18 setup, with validation conducted on CT scans of custom-designed phantoms containing both high-Z and low-Z materials. The denoised projections and reconstructions demonstrate substantial improvements in image quality compared to classical denoising methods and baseline CNN models. Quantitative evaluations confirm that the proposed method achieves superior performance across a broad spectral range, generalizing effectively to real-world experimental data while significantly reducing noise without compromising structural fidelity.

CT Reconstruction Methodology In Silico Academic Lab Reproducibility

CLAPS: A CLIP-Unified Auto-Prompt Segmentation for Multi-Modal Retinal Imaging

Zhihao Zhao, Yinzheng Zhao, Junjie Yang, Xiangtong Yao, Quanmin Liang, Shahrooz Faghihroohi, Kai Huang, Nassir Navab, M. Ali Nasseri

•preprint•Sep 10 2025

Recent advancements in foundation models, such as the Segment Anything Model (SAM), have significantly impacted medical image segmentation, especially in retinal imaging, where precise segmentation is vital for diagnosis. Despite this progress, current methods face critical challenges: 1) modality ambiguity in textual disease descriptions, 2) a continued reliance on manual prompting for SAM-based workflows, and 3) a lack of a unified framework, with most methods being modality- and task-specific. To overcome these hurdles, we propose CLIP-unified Auto-Prompt Segmentation (\CLAPS), a novel method for unified segmentation across diverse tasks and modalities in retinal imaging. Our approach begins by pre-training a CLIP-based image encoder on a large, multi-modal retinal dataset to handle data scarcity and distribution imbalance. We then leverage GroundingDINO to automatically generate spatial bounding box prompts by detecting local lesions. To unify tasks and resolve ambiguity, we use text prompts enhanced with a unique "modality signature" for each imaging modality. Ultimately, these automated textual and spatial prompts guide SAM to execute precise segmentation, creating a fully automated and unified pipeline. Extensive experiments on 12 diverse datasets across 11 critical segmentation categories show that CLAPS achieves performance on par with specialized expert models while surpassing existing benchmarks across most metrics, demonstrating its broad generalizability as a foundation model.

Mixed Modality Segmentation Methodology In Silico Academic Lab Benchmark SOTA

LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations

Payal Varshney, Adriano Lucieri, Christoph Balada, Sheraz Ahmed, Andreas Dengel

•preprint•Sep 10 2025

Video-based AI systems are increasingly adopted in safety-critical domains such as autonomous driving and healthcare. However, interpreting their decisions remains challenging due to the inherent spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques often suffer from limited temporal coherence, insufficient robustness, and a lack of actionable causal insights. Current counterfactual explanation methods typically do not incorporate guidance from the target model, reducing semantic fidelity and practical utility. We introduce Latent Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework designed to explain the behavior of video-based AI models. Compared to previous approaches, LD-ViCE reduces the computational costs of generating explanations by operating in latent space using a state-of-the-art diffusion model, while producing realistic and interpretable counterfactuals through an additional refinement step. Our experiments demonstrate the effectiveness of LD-ViCE across three diverse video datasets, including EchoNet-Dynamic (cardiac ultrasound), FERV39k (facial expression), and Something-Something V2 (action recognition). LD-ViCE outperforms a recent state-of-the-art method, achieving an increase in R2 score of up to 68% while reducing inference time by half. Qualitative analysis confirms that LD-ViCE generates semantically meaningful and temporally coherent explanations, offering valuable insights into the target model behavior. LD-ViCE represents a valuable step toward the trustworthy deployment of AI in safety-critical domains.

Ultrasound Image Synthesis Cardiac Methodology In Silico Academic Lab GenAI

RepViT-CXR: A Channel Replication Strategy for Vision Transformers in Chest X-ray Tuberculosis and Pneumonia Classification

Faisal Ahmed

•preprint•Sep 10 2025

Chest X-ray (CXR) imaging remains one of the most widely used diagnostic tools for detecting pulmonary diseases such as tuberculosis (TB) and pneumonia. Recent advances in deep learning, particularly Vision Transformers (ViTs), have shown strong potential for automated medical image analysis. However, most ViT architectures are pretrained on natural images and require three-channel inputs, while CXR scans are inherently grayscale. To address this gap, we propose RepViT-CXR, a channel replication strategy that adapts single-channel CXR images into a ViT-compatible format without introducing additional information loss. We evaluate RepViT-CXR on three benchmark datasets. On the TB-CXR dataset,our method achieved an accuracy of 99.9% and an AUC of 99.9%, surpassing prior state-of-the-art methods such as Topo-CXR (99.3% accuracy, 99.8% AUC). For the Pediatric Pneumonia dataset, RepViT-CXR obtained 99.0% accuracy, with 99.2% recall, 99.3% precision, and an AUC of 99.0%, outperforming strong baselines including DCNN and VGG16. On the Shenzhen TB dataset, our approach achieved 91.1% accuracy and an AUC of 91.2%, marking a performance improvement over previously reported CNN-based methods. These results demonstrate that a simple yet effective channel replication strategy allows ViTs to fully leverage their representational power on grayscale medical imaging tasks. RepViT-CXR establishes a new state of the art for TB and pneumonia detection from chest X-rays, showing strong potential for deployment in real-world clinical screening systems.

X-Ray Classification Chest Methodology In Silico Academic Lab Benchmark SOTA

Symmetry Interactive Transformer with CNN Framework for Diagnosis of Alzheimer's Disease Using Structural MRI

Zheng Yang, Yanteng Zhang, Xupeng Kou, Yang Liu, Chao Ren

•preprint•Sep 10 2025

Structural magnetic resonance imaging (sMRI) combined with deep learning has achieved remarkable progress in the prediction and diagnosis of Alzheimer's disease (AD). Existing studies have used CNN and transformer to build a well-performing network, but most of them are based on pretraining or ignoring the asymmetrical character caused by brain disorders. We propose an end-to-end network for the detection of disease-based asymmetric induced by left and right brain atrophy which consist of 3D CNN Encoder and Symmetry Interactive Transformer (SIT). Following the inter-equal grid block fetch operation, the corresponding left and right hemisphere features are aligned and subsequently fed into the SIT for diagnostic analysis. SIT can help the model focus more on the regions of asymmetry caused by structural changes, thus improving diagnostic performance. We evaluated our method based on the ADNI dataset, and the results show that the method achieves better diagnostic accuracy (92.5\%) compared to several CNN methods and CNNs combined with a general transformer. The visualization results show that our network pays more attention in regions of brain atrophy, especially for the asymmetric pathological characteristics induced by AD, demonstrating the interpretability and effectiveness of the method.

MRI Classification Neurological Methodology In Silico

Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results.

Riera-Marín M, O K S, Rodríguez-Comas J, May MS, Pan Z, Zhou X, Liang X, Erick FX, Prenner A, Hémon C, Boussot V, Dillenseger JL, Nunes JC, Qayyum A, Mazher M, Niederer SA, Kushibar K, Martín-Isla C, Radeva P, Lekadir K, Barfoot T, Garcia Peraza Herrera LC, Glocker B, Vercauteren T, Gago L, Englemann J, Kleiss JM, Aubanell A, Antolin A, García-López J, González Ballester MA, Galdrán A

•papers•Sep 10 2025

Deep learning (DL) has become the dominant approach for medical image segmentation, yet ensuring the reliability and clinical applicability of these models requires addressing key challenges such as annotation variability, calibration, and uncertainty estimation. This is why we created the Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS), which highlights the critical role of multiple annotators in establishing a more comprehensive ground truth, emphasizing that segmentation is inherently subjective and that leveraging inter-annotator variability is essential for robust model evaluation. Seven teams participated in the challenge, submitting a variety of DL models evaluated using metrics such as Dice Similarity Coefficient (DSC), Expected Calibration Error (ECE), and Continuous Ranked Probability Score (CRPS). By incorporating consensus and dissensus ground truth, we assess how DL models handle uncertainty and whether their confidence estimates align with true segmentation performance. Our findings reinforce the importance of well-calibrated models, as better calibration is strongly correlated with the quality of the results. Furthermore, we demonstrate that segmentation models trained on diverse datasets and enriched with pre-trained knowledge exhibit greater robustness, particularly in cases deviating from standard anatomical structures. Notably, the best-performing models achieved high DSC and well-calibrated uncertainty estimates. This work underscores the need for multi-annotator ground truth, thorough calibration assessments, and uncertainty-aware evaluations to develop trustworthy and clinically reliable DL-based medical image segmentation models.

CT Segmentation Abdominal Retrospective Clinical In Silico Consortium Benchmark SOTA Open Dataset

RetiGen: Framework leveraging domain generalization and test-time adaptation for multi-view retinal diagnostics.

Zhang G, Chen Z, Huo J, do Rio JN, Komninos C, Liu Y, Sparks R, Ourselin S, Bergeles C, Jackson TL

•papers•Sep 10 2025

Domain generalization techniques involve training a model on one set of domains and evaluating its performance on different, unseen domains. In contrast, test-time adaptation optimizes the model specifically for the target domain during inference. Both approaches improve diagnostic accuracy in medical imaging models. However, no research to date has leveraged the advantages of both approaches in an end-to-end fashion. Our paper introduces RetiGen, a test-time optimization framework designed to be integrated with existing domain generalization approaches. With an emphasis on the ophthalmic imaging domain, RetiGen leverages unlabeled multi-view color fundus photographs-a critical optical technology in retinal diagnostics. By utilizing information from multiple viewing angles, our approach significantly enhances the robustness and accuracy of machine learning models when applied across different domains. By integrating class balancing, test-time adaptation, and a multi-view optimization strategy, RetiGen effectively addresses the persistent issue of domain shift, which often hinders the performance of imaging models. Experimental results demonstrate that our method outperforms state-of-the-art techniques in both domain generalization and test-time optimization. Specifically, RetiGen increases the generalizability of the MFIDDR dataset, improving the AUC from 0.751 to 0.872, a 0.121 improvement. Similarly, for the DRTiD dataset, the AUC increased from 0.794 to 0.879, a 0.085 improvement. The code for RetiGen is publicly available at https://github.com/RViMLab/RetiGen.

OCT Classification Methodology In Silico Academic Lab Open Code

Filter Papers

Tags

Out-of-the-Box Large Language Models for Detecting and Classifying Critical Findings in Radiology Reports Using Various Prompt Strategies.

RoentMod: A Synthetic Chest X-Ray Modification Model to Identify and Correct Image Interpretation Model Shortcuts

Vision-Language Semantic Aggregation Leveraging Foundation Model for Generalizable Medical Image Segmentation

Multispectral CT Denoising via Simulation-Trained Deep Learning: Experimental Results at the ESRF BM18

CLAPS: A CLIP-Unified Auto-Prompt Segmentation for Multi-Modal Retinal Imaging

LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations

RepViT-CXR: A Channel Replication Strategy for Vision Transformers in Chest X-ray Tuberculosis and Pneumonia Classification

Symmetry Interactive Transformer with CNN Framework for Diagnosis of Alzheimer's Disease Using Structural MRI

Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results.

RetiGen: Framework leveraging domain generalization and test-time adaptation for multi-view retinal diagnostics.

Ready to Sharpen Your Edge?