Latest Papers on Radiology AI. Tags: None

Out-of-the-Box Large Language Models for Detecting and Classifying Critical Findings in Radiology Reports Using Various Prompt Strategies.

Talati IA, Chaves JMZ, Das A, Banerjee I, Rubin DL

•papers•Sep 10 2025

Background: The increasing complexity and volume of radiology reports present challenges for timely critical findings communication. Purpose: To evaluate the performance of two out-of-the-box LLMs in detecting and classifying critical findings in radiology reports using various prompt strategies. Methods: The analysis included 252 radiology reports of varying modalities and anatomic regions extracted from the MIMIC-III database, divided into a prompt engineering tuning set of 50 reports, a holdout test set of 125 reports, and a pool of 77 remaining reports used as examples for few-shot prompting. An external test set of 180 chest radiography reports was extracted from the CheXpert Plus database. Reports were manually reviewed to identify critical findings and classify such findings into one of three categories (true critical finding, known/expected critical finding, equivocal critical finding). Following prompt engineering using various prompt strategies, a final prompt for optimal true critical findings detection was selected. Two general-purpose LLMs, GPT-4 and Mistral-7B, processed reports in the test sets using the final prompt. Evaluation included automated text similarity metrics (BLEU-1, ROUGE-F1, G-Eval) and manual performance metrics (precision, recall). Results: For true critical findings, zero-shot, few-shot static (five examples), and few-shot dynamic (five examples) prompting yielded BLEU-1 of 0.691, 0.778, and 0.748; ROUGE-F1 of 0.706, 0.797, and 0.773; and G-Eval of 0.428, 0.573, and 0.516. Precision and recall for true critical findings, known/expected critical findings, and equivocal critical findings, in the holdout test set for GPT-4 were 90.1% and 86.9%, 80.9% and 85.0%, and 80.5% and 94.3%; in the holdout test set for Mistral-7B were 75.6% and 77.4%, 34.1% and 70.0%, and 41.3% and 74.3%; in the external test set for GPT-4 were 82.6% and 98.3%, 76.9% and 71.4%, and 70.8% and 85.0%; and in the external test set for Mistral-7B were 75.0% and 93.1%, 33.3% and 92.9%, and 34.0% and 80.0%. Conclusion: Out-of-the-box LLMs were used to detect and classify arbitrary numbers of critical findings in radiology reports. The optimal model for true critical findings entailed a few-shot static approach. Clinical Impact: The study shows a role of contemporary general-purpose models in adapting to specialized medical tasks using minimal data annotation.

Mixed Modality LLM Radiology Report Retrospective Clinical In Silico Academic Lab GenAI

An Explainable Deep Learning Model for Focal Liver Lesion Diagnosis Using Multiparametric MRI.

Shen Z, Chen L, Wang L, Dong S, Wang F, Pan Y, Zhou J, Wang Y, Xu X, Chong H, Lin H, Li W, Li R, Ma H, Ma J, Yu Y, Du L, Wang X, Zhang S, Yan F

•papers•Sep 10 2025

"Just Accepted" papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. Purpose To assess the effectiveness of an explainable deep learning (DL) model, developed using multiparametric MRI (mpMRI) features, in improving diagnostic accuracy and efficiency of radiologists for classification of focal liver lesions (FLLs). Materials and Methods FLLs ≥ 1 cm in diameter at mpMRI were included in the study. nn-Unet and Liver Imaging Feature Transformer (LIFT) models were developed using retrospective data from one hospital (January 2018-August 2023). nnU-Net was used for lesion segmentation and LIFT for FLL classification. External testing was performed on data from three hospitals (January 2018-December 2023), with a prospective test set obtained from January 2024 to April 2024. Model performance was compared with radiologists and impact of model assistance on junior and senior radiologist performance was assessed. Evaluation metrics included the Dice similarity coefficient (DSC) and accuracy. Results A total of 2131 individuals with FLLs (mean age, 56 ± [SD] 12 years; 1476 female) were included in the training, internal test, external test, and prospective test sets. Average DSC values for liver and tumor segmentation across the three test sets were 0.98 and 0.96, respectively. Average accuracy for features and lesion classification across the three test sets were 93% and 97%, respectively. LIFT-assisted readings improved diagnostic accuracy (average 5.3% increase, P < .001), reduced reading time (average 34.5 seconds decrease, P < .001), and enhanced confidence (average 0.3-point increase, P < .001) of junior radiologists. Conclusion The proposed DL model accurately detected and classified FLLs, improving diagnostic accuracy and efficiency of junior radiologists. ©RSNA, 2025.

MRI Segmentation Abdominal Retrospective Clinical Clinical Pilot Academic Lab Benchmark SOTA

Artificial Intelligence in Breast Cancer Care: Transforming Preoperative Planning and Patient Education with 3D Reconstruction

Mustafa Khanbhai, Giulia Di Nardo, Jun Ma, Vivienne Freitas, Caterina Masino, Ali Dolatabadi, Zhaoxun "Lorenz" Liu, Wey Leong, Wagner H. Souza, Amin Madani

•preprint•Sep 10 2025

Effective preoperative planning requires accurate algorithms for segmenting anatomical structures across diverse datasets, but traditional models struggle with generalization. This study presents a novel machine learning methodology to improve algorithm generalization for 3D anatomical reconstruction beyond breast cancer applications. We processed 120 retrospective breast MRIs (January 2018-June 2023) through three phases: anonymization and manual segmentation of T1-weighted and dynamic contrast-enhanced sequences; co-registration and segmentation of whole breast, fibroglandular tissue, and tumors; and 3D visualization using ITK-SNAP. A human-in-the-loop approach refined segmentations using U-Mamba, designed to generalize across imaging scenarios. Dice similarity coefficient assessed overlap between automated segmentation and ground truth. Clinical relevance was evaluated through clinician and patient interviews. U-Mamba showed strong performance with DSC values of 0.97 ($\pm$0.013) for whole organs, 0.96 ($\pm$0.024) for fibroglandular tissue, and 0.82 ($\pm$0.12) for tumors on T1-weighted images. The model generated accurate 3D reconstructions enabling visualization of complex anatomical features. Clinician interviews indicated improved planning, intraoperative navigation, and decision support. Integration of 3D visualization enhanced patient education, communication, and understanding. This human-in-the-loop machine learning approach successfully generalizes algorithms for 3D reconstruction and anatomical segmentation across patient datasets, offering enhanced visualization for clinicians, improved preoperative planning, and more effective patient education, facilitating shared decision-making and empowering informed patient choices across medical applications.

MRI Segmentation Breast Retrospective Clinical In Silico Academic Lab Reproducibility

Explainable Deep Learning Framework for Classifying Mandibular Fractures on Panoramic Radiographs.

Seo H, Lee JI, Park JU, Sung IY

•papers•Sep 10 2025

This study aimed to develop a deep-learning model for the automatic classification of mandibular fractures using panoramic radiographs. A pretrained convolutional neural network (CNN) was used to classify fractures based on a novel, clinically relevant classification system. The dataset comprised 800 panoramic radiographs obtained from patients with facial trauma. The model demonstrated robust classification performance across 8 fracture categories, achieving consistently high accuracy and F1 scores. Performance was evaluated using standard metrics, including accuracy, precision, recall, and F1-score. To enhance interpretability and clinical applicability, explainable AI techniques-Gradient-weighted Class Activation Mapping (Grad-CAM) and Local Interpretable Model-Agnostic Explanations (LIME)-were used to visualize the model's decision-making process. These findings suggest that the proposed deep learning framework is a reliable and efficient tool for classifying mandibular fractures on panoramic radiographs. Its application may help reduce diagnostic time and improve decision-making in maxillofacial trauma care. Further validation using larger, multi-institutional datasets is recommended to ensure generalizability.

X-Ray Classification Methodology In Silico Academic Lab GenAI

Leveraging GPT-4o for Automated Extraction and Categorization of CAD-RADS Features From Free-Text Coronary CT Angiography Reports: Diagnostic Study.

Chen Y, Dong M, Sun J, Meng Z, Yang Y, Muhetaier A, Li C, Qin J

•papers•Sep 10 2025

Despite the Coronary Artery Reporting and Data System (CAD-RADS) providing a standardized approach, radiologists continue to favor free-text reports. This preference creates significant challenges for data extraction and analysis in longitudinal studies, potentially limiting large-scale research and quality assessment initiatives. To evaluate the ability of the generative pre-trained transformer (GPT)-4o model to convert real-world coronary computed tomography angiography (CCTA) free-text reports into structured data and automatically identify CAD-RADS categories and P categories. This retrospective study analyzed CCTA reports from January 2024 and July 2024. A subset of 25 reports was used for prompt engineering to instruct the large language models (LLMs) in extracting CAD-RADS categories, P categories, and the presence of myocardial bridges and noncalcified plaques. Reports were processed using the GPT-4o API (application programming interface) and custom Python scripts. The ground truth was established by radiologists based on the CAD-RADS 2.0 guidelines. Model performance was assessed using accuracy, sensitivity, specificity, and F1-score. Intrarater reliability was assessed using Cohen κ coefficient. Among 999 patients (median age 66 y, range 58-74; 650 males), CAD-RADS categorization showed accuracy of 0.98-1.00 (95% CI 0.9730-1.0000), sensitivity of 0.95-1.00 (95% CI 0.9191-1.0000), specificity of 0.98-1.00 (95% CI 0.9669-1.0000), and F1-score of 0.96-1.00 (95% CI 0.9253-1.0000). P categories demonstrated accuracy of 0.97-1.00 (95% CI 0.9569-0.9990), sensitivity from 0.90 to 1.00 (95% CI 0.8085-1.0000), specificity from 0.97 to 1.00 (95% CI 0.9533-1.0000), and F1-score from 0.91 to 0.99 (95% CI 0.8377-0.9967). Myocardial bridge detection achieved an accuracy of 0.98 (95% CI 0.9680-0.9870), and noncalcified coronary plaques detection showed an accuracy of 0.98 (95% CI 0.9680-0.9870). Cohen κ values for all classifications exceeded 0.98. The GPT-4o model efficiently and accurately converts CCTA free-text reports into structured data, excelling in CAD-RADS classification, plaque burden assessment, and detection of myocardial bridges and calcified plaques.

CT LLM Radiology Report Cardiac Retrospective Clinical In Silico Academic Lab GenAI

Exploring Women's Perceptions of Traditional Mammography and the Concept of AI-Driven Thermography to Improve the Breast Cancer Screening Journey: Mixed Methods Study.

Sirka Kacafírková K, Poll A, Jacobs A, Cardone A, Ventura JJ

•papers•Sep 10 2025

Breast cancer is the most common cancer among women and a leading cause of mortality in Europe. Early detection through screening reduces mortality, yet participation in mammography-based programs remains suboptimal due to discomfort, radiation exposure, and accessibility issues. Thermography, particularly when driven by artificial intelligence (AI), is being explored as a noninvasive, radiation-free alternative. However, its acceptance, reliability, and impact on the screening experience remain underexplored. This study aimed to explore women's perceptions of AI-enhanced thermography (ThermoBreast) as an alternative to mammography. It aims to identify barriers and motivators related to breast cancer screening and assess how ThermoBreast might improve the screening experience. A mixed methods approach was adopted, combining an online survey with follow-up focus groups. The survey captured women's knowledge, attitudes, and experiences related to breast cancer screening and was used to recruit participants for qualitative exploration. After the focus groups, the survey was relaunched to include additional respondents. Quantitative data were analyzed using SPSS (IBM Corp), and qualitative data were analyzed in MAXQDA (VERBI software). Findings from both strands were synthesized to redesign the breast cancer screening journey. A total of 228 valid survey responses were analyzed. Of 228, 154 women (68%) had previously undergone mammography, while 74 (32%) had not. The most reported motivators were belief in prevention (69/154, 45%), invitations from screening programs (68/154, 44%), and doctor recommendations (45/154, 29%). Among nonscreeners, key barriers included no recommendation from a doctor (39/74, 53%), absence of symptoms (27/74, 36%), and perceived age ineligibility (17/74, 23%). Pain, long appointment waits, and fear of radiation were also mentioned. In total, 18 women (mean age 45.3 years, SD 13.6) participated in 6 focus groups. Participants emphasized the importance of respectful and empathetic interactions with medical staff, clear communication, and emotional comfort-factors they perceived as more influential than the screening technology itself. ThermoBreast was positively received for being contactless, radiation-free, and potentially more comfortable. Participants described it as "less traumatic," "easier," and "a game changer." However, concerns were raised regarding its novelty, lack of clinical validation, and data privacy. Some participants expressed the need for human oversight in AI-supported procedures and requested more information on how AI is used. Based on these insights, an updated screening journey was developed, highlighting improvements in preparation, appointment booking, privacy, and communication of results. While AI-driven thermography shows promise as a noninvasive, user-friendly alternative to mammography, its adoption depends on trust, clinical validation, and effective communication from health care professionals. It may expand screening access for populations underserved by mammography, such as younger and immobile women, but does not eliminate all participation barriers. Long-term studies and direct comparisons between mammography and thermography are needed to assess diagnostic accuracy, patient experience, and their impact on screening participation and outcomes.

Mixed Modality Classification Breast Retrospective Clinical Clinical Pilot Academic Lab

A multidimensional deep ensemble learning model predicts pathological response and outcomes in esophageal squamous cell carcinoma treated with neoadjuvant chemoradiotherapy from pretreatment CT imaging: A multicenter study.

Liu Y, Su Y, Peng J, Zhang W, Zhao F, Li Y, Song X, Ma Z, Zhang W, Ji J, Chen Y, Men Y, Ye F, Men K, Qin J, Liu W, Wang X, Bi N, Xue L, Yu W, Wang Q, Zhou M, Hui Z

•papers•Sep 10 2025

Neoadjuvant chemoradiotherapy (nCRT) followed by esophagectomy remains standard for locally advanced esophageal squamous cell carcinoma (ESCC). However, accurately predicting pathological complete response (pCR) and treatment outcomes remains challenging. This study aimed to develop and validate a multidimensional deep ensemble learning model (DELRN) using pretreatment CT imaging to predict pCR and stratify prognostic risk in ESCC patients undergoing nCRT. In this multicenter, retrospective cohort study, 485 ESCC patients were enrolled from four hospitals (May 2009-August 2023, December 2017-September 2021, May 2014-September 2019, and March 2013-July 2019). Patients were divided into a discovery cohort (n = 194), an internal cohort (n = 49), and three external validation cohorts (n = 242). A multidimensional deep ensemble learning model (DELRN) integrating radiomics and 3D convolutional neural networks was developed based on pretreatment CT images to predict pCR and clinical outcomes. The model's performance was evaluated by discrimination, calibration, and clinical utility. Kaplan-Meier analysis assessed overall survival (OS) and disease-free survival (DFS) at two follow-up centers. The DELRN model demonstrated robust predictive performance for pCR across the discovery, internal, and external validation cohorts, with area under the curve (AUC) values of 0.943 (95 % CI: 0.912-0.973), 0.796 (95 % CI: 0.661-0.930), 0.767 (95 % CI: 0.646-0.887), 0.829 (95 % CI: 0.715-0.942), and 0.782 (95 % CI: 0.664-0.900), respectively, surpassing single-domain radiomics or deep learning models. DELRN effectively stratified patients into high-risk and low-risk groups for OS (log-rank P = 0.018 and 0.0053) and DFS (log-rank P = 0.00042 and 0.035). Multivariate analysis confirmed DELRN as an independent prognostic factor for OS and DFS. The DELRN model demonstrated promising clinical potential as an effective, non-invasive tool for predicting nCRT response and treatment outcome in ESCC patients, enabling personalized treatment strategies and improving clinical decision-making with future prospective multicenter validation.

CT Classification Abdominal Retrospective Clinical In Silico Academic Lab

Deep-Learning System for Automatic Measurement of the Femorotibial Rotational Angle on Lower-Extremity Computed Tomography.

Lee SW, Lee GP, Yoon I, Kim YJ, Kim KG

•papers•Sep 10 2025

To develop and validate a deep-learning-based algorithm for automatic identification of anatomical landmarks and calculating femoral and tibial version angles (FTT angles) on lower-extremity CT scans. In this IRB-approved, retrospective study, lower-extremity CT scans from 270 adult patients (median age, 69 years; female to male ratio, 235:35) were analyzed. CT data were preprocessed using contrast-limited adaptive histogram equalization and RGB superposition to enhance tissue boundary distinction. The Attention U-Net model was trained using the gold standard of manual labeling and landmark drawing, enabling it to segment bones, detect landmarks, create lines, and automatically measure the femoral version and tibial torsion angles. The model's performance was validated against manual segmentations by a musculoskeletal radiologist using a test dataset. The segmentation model demonstrated 92.16%±0.02 sensitivity, 99.96%±<0.01 specificity, and 2.14±2.39 HD95, with a Dice similarity coefficient (DSC) of 93.12%±0.01. Automatic measurements of femoral and tibial torsion angles showed good correlation with radiologists' measurements, with correlation coefficients of 0.64 for femoral and 0.54 for tibial angles (p < 0.05). Automated segmentation significantly reduced the measurement time per leg compared to manual methods (57.5 ± 8.3 s vs. 79.6 ± 15.9 s, p < 0.05). We developed a method to automate the measurement of femorotibial rotation in continuous axial CT scans of patients with osteoarthritis (OA) using a deep-learning approach. This method has the potential to expedite the analysis of patient data in busy clinical settings.

CT Segmentation Musculoskeletal Retrospective Clinical In Silico Academic Lab

Fixed point method for PET reconstruction with learned plug-and-play regularization.

Savanier M, Comtat C, Sureau F

•papers•Sep 10 2025

Deep learning has shown great promise for improving medical image reconstruction, including PET. However, concerns remain about the stability and robustness of these methods, especially when trained on limited data. This work aims to explore the use of the Plug-and-Play (PnP) framework in PET reconstruction to address these concerns.Approach:We propose a convergent PnP algorithm for low-count PET reconstruction based on the Douglas-Rachford splitting method. We consider several denoisers trained to satisfy fixed-point conditions, with convergence properties ensured either during training or by design, including a spectrally normalized network and a deep equilibrium model. We evaluate the bias-standard deviation tradeoff across clinically relevant regions and an unseen pathological case in a synthetic experiment and a real study. Comparisons are made with model-based iterative reconstruction, post-reconstruction denoising, a deep end-to-end unfolded network and PnP with a Gaussian denoiser.Main Results:Our method achieves lower bias than post-reconstruction processing and reduced standard deviation at matched bias compared to model-based iterative reconstruction. While spectral normalization underperforms in generalization, the deep equilibrium model remains competitive with convolutional networks for plug-and-play reconstruction and generalizes better to the unseen pathology. Compared to the end-to-end unfolded network, it also generalizes more consistently.Significance:This study demonstrates the potential of the PnP framework to improve image quality and quantification accuracy in PET reconstruction. It also highlights the importance of how convergence conditions are imposed on the denoising network to ensure robust and generalizable performance.

PET Reconstruction Whole Body Methodology In Silico Academic Lab Reproducibility

Few-shot learning for highly accelerated 3D time-of-flight MRA reconstruction.

Li H, Chiew M, Dragonu I, Jezzard P, Okell TW

•papers•Sep 10 2025

To develop a deep learning-based reconstruction method for highly accelerated 3D time-of-flight MRA (TOF-MRA) that achieves high-quality reconstruction with robust generalization using extremely limited acquired raw data, addressing the challenge of time-consuming acquisition of high-resolution, whole-head angiograms. A novel few-shot learning-based reconstruction framework is proposed, featuring a 3D variational network specifically designed for 3D TOF-MRA that is pre-trained on simulated complex-valued, multi-coil raw k-space datasets synthesized from diverse open-source magnitude images and fine-tuned using only two single-slab experimentally acquired datasets. The proposed approach was evaluated against existing methods on acquired retrospectively undersampled in vivo k-space data from five healthy volunteers and on prospectively undersampled data from two additional subjects. The proposed method achieved superior reconstruction performance on experimentally acquired in vivo data over comparison methods, preserving most fine vessels with minimal artifacts with up to eight-fold acceleration. Compared to other simulation techniques, the proposed method generated more realistic raw k-space data for 3D TOF-MRA. Consistently high-quality reconstructions were also observed on prospectively undersampled data. By leveraging few-shot learning, the proposed method enabled highly accelerated 3D TOF-MRA relying on minimal experimentally acquired data, achieving promising results on both retrospective and prospective in vivo data while outperforming existing methods. Given the challenges of acquiring and sharing large raw k-space datasets, this holds significant promise for advancing research and clinical applications in high-resolution, whole-head 3D TOF-MRA imaging.

MRI Reconstruction Neurological Methodology In Silico Benchmark SOTA Open Dataset

Filter Papers

Tags

Out-of-the-Box Large Language Models for Detecting and Classifying Critical Findings in Radiology Reports Using Various Prompt Strategies.

An Explainable Deep Learning Model for Focal Liver Lesion Diagnosis Using Multiparametric MRI.

Artificial Intelligence in Breast Cancer Care: Transforming Preoperative Planning and Patient Education with 3D Reconstruction

Explainable Deep Learning Framework for Classifying Mandibular Fractures on Panoramic Radiographs.

Leveraging GPT-4o for Automated Extraction and Categorization of CAD-RADS Features From Free-Text Coronary CT Angiography Reports: Diagnostic Study.

Exploring Women's Perceptions of Traditional Mammography and the Concept of AI-Driven Thermography to Improve the Breast Cancer Screening Journey: Mixed Methods Study.

A multidimensional deep ensemble learning model predicts pathological response and outcomes in esophageal squamous cell carcinoma treated with neoadjuvant chemoradiotherapy from pretreatment CT imaging: A multicenter study.

Deep-Learning System for Automatic Measurement of the Femorotibial Rotational Angle on Lower-Extremity Computed Tomography.

Fixed point method for PET reconstruction with learned plug-and-play regularization.

Few-shot learning for highly accelerated 3D time-of-flight MRA reconstruction.

Ready to Sharpen Your Edge?