Latest Papers on Radiology AI. Tags: Benchmark SOTA

Real-world clinical impact of three commercial AI algorithms on musculoskeletal radiography interpretation: A prospective crossover reader study.

Prucker P, Lemke T, Mertens CJ, Ziegelmayer S, Graf MM, Weller D, Kim SH, Gassert FT, Kader A, Dorfner FJ, Meddeb A, Makowski MR, Lammert J, Huber T, Lohöfer F, Bressem KK, Adams LC, Luiken I, Busch F

•papers•Sep 17 2025

To prospectively assess the diagnostic performance, workflow efficiency, and clinical impact of three commercial deep-learning tools (BoneView, Rayvolve, RBfracture) for routine musculoskeletal radiograph interpretation. From January to March 2025, two radiologists (4 and 5 years' experience) independently interpreted 1,037 adult musculoskeletal studies (2,926 radiographs) first unaided and, after 14-day washouts, with each AI tool in a randomized crossover design. Ground truth was established by confirmatory CT when available. Outcomes included sensitivity, specificity, accuracy, area under the receiver operating characteristic curve (AUC), interpretation time, diagnostic confidence (5-point Likert), and rates of additional CT recommendations and senior consultations. DeLong tests compared AUCs; Mann-Whitney U and χ2 tests assessed secondary endpoints. AI assistance did not significantly change performance for fractures, dislocations, or effusions. For fractures, AUCs were comparable to baseline (Reader 1: 96.50 % vs. 96.30-96.50 %; Reader 2: 95.35 % vs. 95.97 %; all p > 0.11). For dislocations, baseline AUCs (Reader 1: 92.66 %; Reader 2: 90.68 %) were unchanged with AI (92.76-93.95 % and 92.00 %; p ≥ 0.280). For effusions, baseline AUCs (Reader 1: 92.52 %; Reader 2: 96.75 %) were similar with AI (93.12 % and 96.99 %; p ≥ 0.157). Median interpretation times decreased with AI (Reader 1: 34 s to 21-25 s; Reader 2: 30 s to 21-26 s; all p < 0.001). Confidence improved across tools: BoneView increased combined "very good/excellent" ratings versus unaided reads (Reader 1: 509 vs. 449, p < 0.001; Reader 2: 483 vs. 439, p < 0.001); Rayvolve (Reader 1: 456 vs. 449, p = 0.029; Reader 2: 449 vs. 439, p < 0.001) and RBfracture (Reader 1: 457 vs. 449, p = 0.017; Reader 2: 448 vs. 439, p = 0.001) yielded smaller but significant gains. Reader 1 recommended fewer CT scans with AI assistance (33 vs. 22-23, p = 0.007). In a real-world clinical setting, AI-assisted interpretation of musculoskeletal radiographs reduced reading time and increased diagnostic confidence without materially affecting diagnostic performance. These findings support AI assistance as a lever for workflow efficiency and potential cost-effectiveness at scale.

X-Ray Detection Musculoskeletal Prospective Clinical Pilot Startup Benchmark SOTA

Automating classification of treatment responses to combined targeted therapy and immunotherapy in HCC.

Quan B, Dai M, Zhang P, Chen S, Cai J, Shao Y, Xu P, Li P, Yu L

•papers•Sep 17 2025

Tyrosine kinase inhibitors (TKIs) combined with immunotherapy regimens are now widely used for treating advanced hepatocellular carcinoma (HCC), but their clinical efficacy is limited to a subset of patients. Considering that the vast majority of advanced HCC patients lose the opportunity for liver resection and thus cannot provide tumor tissue samples, we leveraged the clinical and image data to construct a multimodal convolutional neural network (CNN)-Transformer model for predicting and analyzing tumor response to TKI-immunotherapy. An automatic liver tumor segmentation system, based on a two-stage 3D U-Net framework, delineates lesions by first segmenting the liver parenchyma and then precisely localizing the tumor. This approach effectively addresses the variability in clinical data and significantly reduces bias introduced by manual intervention. Thus, we developed a clinical model using only pre-treatment clinical information, a CNN model using only pre-treatment magnetic resonance imaging data, and an advanced multimodal CNN-Transformer model that fused imaging and clinical parameters using a training cohort (n = 181) and then validated them using an independent cohort (n = 30). In the validation cohort, the area under the curve (95% confidence interval) values were 0.720 (0.710-0.731), 0.695 (0.683-0.707), and 0.785 (0.760-0.810), respectively, indicating that the multimodal model significantly outperformed the single-modality baseline models across validations. Finally, single-cell sequencing with the surgical tumor specimens reveals tumor ecosystem diversity associated with treatment response, providing a preliminary biological validation for the prediction model. In summary, this multimodal model effectively integrates imaging and clinical features of HCC patients, has a superior performance in predicting tumor response to TKI-immunotherapy, and provides a reliable tool for optimizing personalized treatment strategies.

MRI Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Accuracy of Foundation AI Models for Hepatic Macrovesicular Steatosis Quantification in Frozen Sections

Koga, S., Guda, A., Wang, Y., Sahni, A., Wu, J., Rosen, A., Nield, J., Nandish, N., Patel, K., Goldman, H., Rajapakse, C., Walle, S., Kristen, S., Tondon, R., Alipour, Z.

•preprint•Sep 17 2025

IntroductionAccurate intraoperative assessment of macrovesicular steatosis in donor liver biopsies is critical for transplantation decisions but is often limited by inter-observer variability and freezing artifacts that can obscure histological details. Artificial intelligence (AI) offers a potential solution for standardized and reproducible evaluation. To evaluate the diagnostic performance of two self-supervised learning (SSL)-based foundation models, Prov-GigaPath and UNI, for classifying macrovesicular steatosis in frozen liver biopsy sections, compared with assessments by surgical pathologists. MethodsWe retrospectively analyzed 131 frozen liver biopsy specimens from 68 donors collected between November 2022 and September 2024. Slides were digitized into whole-slide images, tiled into patches, and used to extract embeddings with Prov-GigaPath and UNI; slide-level classifiers were then trained and tested. Intraoperative diagnoses by on-call surgical pathologists were compared with ground truth determined from independent reviews of permanent sections by two liver pathologists. Accuracy was evaluated for both five-category classification and a clinically significant binary threshold (<30% vs. [≥]30%). ResultsFor binary classification, Prov-GigaPath achieved 96.4% accuracy, UNI 85.7%, and surgical pathologists 84.0% (P = .22). In five-category classification, accuracies were lower: Prov-GigaPath 57.1%, UNI 50.0%, and pathologists 58.7% (P = .70). Misclassification primarily occurred in intermediate categories (5%-<30% steatosis). ConclusionsSSL-based foundation models performed comparably to surgical pathologists in classifying macrovesicular steatosis, at the clinically relevant <30% vs. [≥]30% threshold. These findings support the potential role of AI in standardizing intraoperative evaluation of donor liver biopsies; however, the small sample size limits generalizability and requires validation in larger, balanced cohorts.

Mixed Modality Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Video Transformer for Segmentation of Echocardiography Images in Myocardial Strain Measurement.

Huang KC, Lin CE, Lin DS, Lin TT, Wu CK, Jeng GS, Lin LY, Lin LC

•papers•Sep 17 2025

The adoption of left ventricular global longitudinal strain (LVGLS) is still restricted by variability among various vendors and observers, despite advancements from tissue Doppler to speckle tracking imaging, machine learning, and, more recently, convolutional neural network (CNN)-based segmentation strain analysis. While CNNs have enabled fully automated strain measurement, they are inherently constrained by restricted receptive fields and a lack of temporal consistency. Transformer-based networks have emerged as a powerful alternative in medical imaging, offering enhanced global attention. Among these, the Video Swin Transformer (V-SwinT) architecture, with its 3D-shifted windows and locality inductive bias, is particularly well suited for ultrasound imaging, providing temporal consistency while optimizing computational efficiency. In this study, we propose the DTHR-SegStrain model based on a V-SwinT backbone. This model incorporates contour regression and utilizes an FCN-style multiscale feature fusion. As a result, it can generate accurate and temporally consistent left ventricle (LV) contours, allowing for direct calculation of myocardial strain without the need for conversion from segmentation to contours or any additional postprocessing. Compared to EchoNet-dynamic and Unity-GLS, DTHR-SegStrain showed greater efficiency, reliability, and validity in LVGLS measurements. Furthermore, the hybridization experiments assessed the interaction between segmentation models and strain algorithms, reinforcing that consistent segmentation contours over time can simplify strain calculations and decrease measurement variability. These findings emphasize the potential of V-SwinT-based frameworks to enhance the standardization and clinical applicability of LVGLS assessments.

Ultrasound Segmentation Cardiac Methodology In Silico Academic Lab Benchmark SOTA

A Deep Learning Framework for Synthesizing Longitudinal Infant Brain MRI during Early Development.

Fang Y, Xiong H, Huang J, Liu F, Shen Z, Cai X, Zhang H, Wang Q

•papers•Sep 17 2025

"Just Accepted" papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. Purpose To develop a three-stage, age-and modality-conditioned framework to synthesize longitudinal infant brain MRI scans, and account for rapid structural and contrast changes during early brain development. Materials and Methods This retrospective study used T1- and T2-weighted MRI scans (848 scans) from 139 infants in the Baby Connectome Project, collected since September 2016. The framework models three critical image cues related: volumetric expansion, cortical folding, and myelination, predicting missing time points with age and modality as predictive factors. The method was compared with LGAN, CounterSyn, and Diffusion-based approach using peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM) and the Dice similarity coefficient (DSC). Results The framework was trained on 119 participants (average age: 11.25 ± 6.16 months, 60 female, 59 male) and tested on 20 (average age: 12.98 ± 6.59 months, 11 female, 9 male). For T1-weighted images, PSNRs were 25.44 ± 1.95 and 26.93 ± 2.50 for forward and backward MRI synthesis, and SSIMs of 0.87 ± 0.03 and 0.90 ± 0.02. For T2-weighted images, PSNRs were 26.35 ± 2.30 and 26.40 ± 2.56, with SSIMs of 0.87 ± 0.03 and 0.89 ± 0.02, significantly outperforming competing methods (P < .001). The framework also excelled in tissue segmentation (P < .001) and cortical reconstruction, achieving DSC of 0.85 for gray matter and 0.86 for white matter, with intraclass correlation coefficients exceeding 0.8 in most cortical regions. Conclusion The proposed three-stage framework effectively synthesized age-specific infant brain MRI scans, outperforming competing methods in image quality and tissue segmentation with strong performance in cortical reconstruction, demonstrating potential for developmental modeling and longitudinal analyses. ©RSNA, 2025.

MRI Image Synthesis Neurological Retrospective Clinical In Silico Benchmark SOTA

Multimodal deep learning integration for predicting renal function outcomes in living donor kidney transplantation: a retrospective cohort study.

Kim JM, Jung H, Kwon HE, Ko Y, Jung JH, Shin S, Kim YH, Kim YH, Jun TJ, Kwon H

•papers•Sep 17 2025

Accurately predicting post-transplant renal function is essential for optimizing donor-recipient matching and improving long-term outcomes in kidney transplantation (KT). Traditional models using only structured clinical data often fail to account for complex biological and anatomical factors. This study aimed to develop and validate a multimodal deep learning model that integrates computed tomography (CT) imaging, radiology report text, and structured clinical variables to predict 1-year estimated glomerular filtration rate (eGFR) in living donor kidney transplantation (LDKT) recipients. A retrospective cohort of 1,937 LDKT recipients was selected from 3,772 KT cases. Exclusions included deceased donor KT, immunologic high-risk recipients (n = 304), missing CT imaging, early graft complications, and anatomical abnormalities. eGFR at 1 year post-transplant was classified into four categories: > 90, 75-90, 60-75, and 45-60 mL/min/1.73 m2. Radiology reports were embedded using BioBERT, while CT videos were encoded using a CLIP-based visual extractor. These were fused with structured clinical features and input into ensemble classifiers including XGBoost. Model performance was evaluated using cross-validation and SHapley Additive exPlanations (SHAP) analysis. The full multimodal model achieved a macro F1 score of 0.675, micro F1 score of 0.704, and weighted F1 score of 0.698-substantially outperforming the clinical-only model (macro F1 = 0.292). CT imaging contributed more than text data (clinical + CT macro F1 = 0.651; clinical + text = 0.486). The model showed highest accuracy in the >90 (F1 = 0.7773) and 60-75 (F1 = 0.7303) categories. SHAP analysis identified donor age, BMI, and donor sex as key predictors. Dimensionality reduction confirmed internal feature validity. Multimodal deep learning integrating clinical, imaging, and textual data enhances prediction of post-transplant renal function. This framework offers a robust and interpretable approach for individualized risk stratification in LDKT, supporting precision medicine in transplantation.

CT Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Influence of Mammography Acquisition Parameters on AI and Radiologist Interpretive Performance.

Lotter W, Hippe DS, Oshiro T, Lowry KP, Milch HS, Miglioretti DL, Elmore JG, Lee CI, Hsu W

•papers•Sep 17 2025

"Just Accepted" papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. Purpose To evaluate the impact of screening mammography acquisition parameters on the interpretive performance of AI and radiologists. Materials and Methods The associations between seven mammogram acquisition parameters-mammography machine version, kVp, x-ray exposure delivered, relative x-ray exposure, paddle size, compression force, and breast thickness-and AI and radiologist performance in interpreting two-dimensional screening mammograms acquired by a diverse health system between December 2010 and 2019 were retrospectively evaluated. The top 11 AI models and the ensemble model from the Digital Mammography DREAM Challenge were assessed. The associations between each acquisition parameter and the sensitivity and specificity of the AI models and the radiologists' interpretations were separately evaluated using generalized estimating equations-based models at the examination level, adjusted for several clinical factors. Results The dataset included 28,278 screening two-dimensional mammograms from 22,626 women (mean age 58.5 years ± 11.5 [SD]; 4913 women had multiple mammograms). Of these, 324 examinations resulted in breast cancer diagnosis within 1 year. The acquisition parameters were significantly associated with the performance of both AI and radiologists, with absolute effect sizes reaching 10% for sensitivity and 5% for specificity; however, the associations differed between AI and radiologists for several parameters. Increased exposure delivered reduced the specificity for the ensemble AI (-4.5% per 1 SD increase; P < .001) but not radiologists (P = .44). Increased compression force reduced the specificity for radiologists (-1.3% per 1 SD increase; P < .001) but not for AI (P = .60). Conclusion Screening mammography acquisition parameters impacted the performance of both AI and radiologists, with some parameters impacting performance differently. ©RSNA, 2025.

Mammography Classification Breast Retrospective Clinical In Silico Consortium Benchmark SOTA

Non-iterative and uncertainty-aware MRI-based liver fat estimation using an unsupervised deep learning method.

Meneses JP, Tejos C, Makalic E, Uribe S

•papers•Sep 17 2025

Liver proton density fat fraction (PDFF), the ratio between fat-only and overall proton densities, is an extensively validated biomarker associated with several diseases. In recent years, numerous deep learning-based methods for estimating PDFF have been proposed to optimize acquisition and post-processing times without sacrificing accuracy, compared to conventional methods. However, the lack of interpretability and the often poor generalizability of these DL-based models undermine the adoption of such techniques in clinical practice. In this work, we propose an Artificial Intelligence-based Decomposition of water and fat with Echo Asymmetry and Least-squares (AI-DEAL) method, designed to estimate both proton density fat fraction (PDFF) and the associated uncertainty maps. Once trained, AI-DEAL performs a one-shot MRI water-fat separation by first calculating the nonlinear confounder variables, R2∗ and off-resonance field. It then employs a weighted least squares approach to compute water-only and fat-only signals, along with their corresponding covariance matrix, which are subsequently used to derive the PDFF and its associated uncertainty. We validated our method using in vivo liver CSE-MRI, a fat-water phantom, and a numerical phantom. AI-DEAL demonstrated PDFF biases of 0.25% and -0.12% at two liver ROIs, outperforming state-of-the-art deep learning-based techniques. Although trained using in vivo data, our method exhibited PDFF biases of -3.43% in the fat-water phantom and -0.22% in the numerical phantom with no added noise. The latter bias remained approximately constant when noise was introduced. Furthermore, the estimated uncertainties showed good agreement with the observed errors and the variations within each ROI, highlighting their potential value for assessing the reliability of the resulting PDFF maps.

MRI Segmentation Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

Habitat-aware radiomics and adaptive 2.5D deep learning predict treatment response and long-term survival in ESCC patients undergoing neoadjuvant chemoimmunotherapy.

Gao X, Yang L, She T, Wang F, Ding H, Lu Y, Xu Y, Wang Y, Li P, Duan X, Leng X

•papers•Sep 17 2025

Current radiomic approaches inadequately resolve spatial intratumoral heterogeneity (ITH) in esophageal squamous cell carcinoma (ESCC), limiting neoadjuvant chemoimmunotherapy (NACI) response prediction. We propose an interpretable multimodal framework to: (1) quantitatively map intra-/peritumoral heterogeneity via voxel-wise habitat radiomics; (2) model cross-sectional tumor biology using 2.5D deep learning; and (3) establish mechanism-driven biomarkers via SHAP interpretability to identify resistance-linked subregions. This dual-center retrospective study analyzed 269 treatment-naïve ESCC patients with baseline PET/CT (training: n = 144; validation: n = 62; test: n = 63). Habitat radiomics delineated tumor subregions via K-means clustering (Calinski-Harabasz-optimized) on PET/CT, extracting 1,834 radiomic features per modality. A multi-stage pipeline (univariate filtering, mRMR, LASSO regression) selected 32 discriminative features. The 2.5D model aggregated ± 4 peri-tumoral slices, fusing PET/CT via MixUp channels using a fine-tuned ResNet50 (ImageNet-pretrained), with multi-instance learning (MIL) translating slice-level features to patient-level predictions. Habitat features, MIL signatures, and clinical variables were integrated via five-classifier ensemble (ExtraTrees/SVM/RandomForest) and Crossformer architecture (SMOTE-balanced). Validation included AUC, sensitivity, specificity, calibration curves, decision curve analysis (DCA), survival metrics (C-index, Kaplan-Meier), and interpretability (SHAP, Grad-CAM). Habitat radiomics achieved superior validation AUC (0.865, 95% CI: 0.778-0.953), outperforming conventional radiomics (ΔAUC + 3.6%, P < 0.01) and clinical models (ΔAUC + 6.4%, P < 0.001). SHAP identified the invasive front (H2) as dominant predictor (40% of top features), with wavelet_LHH_firstorder_Entropy showing highest impact (SHAP = + 0.42). The 2.5D MIL model demonstrated strong generalizability (validation AUC: 0.861). The combined model achieved state-of-the-art test performance (AUC = 0.824, sensitivity = 0.875) with superior calibration (Hosmer-Lemeshow P > 0.800), effective survival stratification (test C-index: 0.809), and 23-41% net benefit improvement in DCA. Integrating habitat radiomics and 2.5D deep learning enables interpretable dual diagnostic-prognostic stratification in ESCC, advancing precision oncology by decoding spatial heterogeneity.

Mixed Modality Classification Abdominal Retrospective Clinical In Silico Academic Lab Benchmark SOTA

FunKAN: Functional Kolmogorov-Arnold Network for Medical Image Enhancement and Segmentation

Maksim Penkin, Andrey Krylov

•preprint•Sep 16 2025

Medical image enhancement and segmentation are critical yet challenging tasks in modern clinical practice, constrained by artifacts and complex anatomical variations. Traditional deep learning approaches often rely on complex architectures with limited interpretability. While Kolmogorov-Arnold networks offer interpretable solutions, their reliance on flattened feature representations fundamentally disrupts the intrinsic spatial structure of imaging data. To address this issue we propose a Functional Kolmogorov-Arnold Network (FunKAN) -- a novel interpretable neural framework, designed specifically for image processing, that formally generalizes the Kolmogorov-Arnold representation theorem onto functional spaces and learns inner functions using Fourier decomposition over the basis Hermite functions. We explore FunKAN on several medical image processing tasks, including Gibbs ringing suppression in magnetic resonance images, benchmarking on IXI dataset. We also propose U-FunKAN as state-of-the-art binary medical segmentation model with benchmarks on three medical datasets: BUSI (ultrasound images), GlaS (histological structures) and CVC-ClinicDB (colonoscopy videos), detecting breast cancer, glands and polyps, respectively. Experiments on those diverse datasets demonstrate that our approach outperforms other KAN-based backbones in both medical image enhancement (PSNR, TV) and segmentation (IoU, F1). Our work bridges the gap between theoretical function approximation and medical image analysis, offering a robust, interpretable solution for clinical applications.

Mixed Modality Segmentation Methodology In Silico Academic Lab Benchmark SOTA

Filter Papers

Tags

Real-world clinical impact of three commercial AI algorithms on musculoskeletal radiography interpretation: A prospective crossover reader study.

Automating classification of treatment responses to combined targeted therapy and immunotherapy in HCC.

Accuracy of Foundation AI Models for Hepatic Macrovesicular Steatosis Quantification in Frozen Sections

Video Transformer for Segmentation of Echocardiography Images in Myocardial Strain Measurement.

A Deep Learning Framework for Synthesizing Longitudinal Infant Brain MRI during Early Development.

Multimodal deep learning integration for predicting renal function outcomes in living donor kidney transplantation: a retrospective cohort study.

Influence of Mammography Acquisition Parameters on AI and Radiologist Interpretive Performance.

Non-iterative and uncertainty-aware MRI-based liver fat estimation using an unsupervised deep learning method.

Habitat-aware radiomics and adaptive 2.5D deep learning predict treatment response and long-term survival in ESCC patients undergoing neoadjuvant chemoimmunotherapy.

FunKAN: Functional Kolmogorov-Arnold Network for Medical Image Enhancement and Segmentation

Ready to Sharpen Your Edge?