Latest Papers on Radiology AI. Tags: Clinical Pilot

Development and Temporal Validation of a Deep Learning Model for Automatic Fetal Biometry from Ultrasound Videos.

Goetz-Fu M, Haller M, Collins T, Begusic N, Jochum F, Keeza Y, Uwineza J, Marescaux J, Weingertner AS, Sananès N, Hostettler A

•papers•Sep 22 2025

The objective was to develop an artificial intelligence (AI)-based system, using deep neural network (DNN) technology, to automatically detect standard fetal planes during video capture, measure fetal biometry parameters and estimate fetal weight. A standard plane recognition DNN was trained to classify ultrasound images into four categories: head circumference (HC), abdominal circumference (AC), femur length (FL) standard planes, or 'other'. The recognized standard plane images were subsequently processed by three fetal biometry DNNs, automatically measuring HC, AC and FL. Fetal weight was then estimated with the Hadlock 3 formula. The training dataset consisted of 16,626 images. A prospective temporal validation was then conducted using an independent set of 281 ultrasound videos of healthy fetuses. Fetal weight and biometry measurements were compared against an expert sonographer. Two less experienced sonographers were used as controls. The AI system obtained a significantly lower absolute relative measurement error in fetal weight estimation than the controls (AI vs. medium-level: p = 0.032, AI vs. beginner: p < 1e-8), so in AC measurements (AI vs. medium-level: p = 1.72e-04, AI vs. beginner: p < 1e-06). Average absolute relative measurement errors of AI versus expert were: 0.96 % (S.D. 0.79 %) for HC, 1.56 % (S.D. 1.39 %) for AC, 1.77 % (S.D. 1.46 %) for FL and 3.10 % (S.D. 2.74 %) for fetal weight estimation. The AI system produced similar biometry measurements and fetal weight estimation to those of the expert sonographer. It is a promising tool to enhance non-expert sonographers' performance and reproducibility in fetal biometry measurements, and to reduce inter-operator variability.

Ultrasound Detection Abdominal Prospective Clinical Pilot Academic Lab Benchmark SOTA

Radiologist Interaction with AI-Generated Preliminary Reports: A Longitudinal Multi-Reader Study.

Hong EK, Suh CH, Nukala M, Esfahani A, Licaros A, Madan R, Hunsaker A, Hammer M

•papers•Sep 20 2025

To investigate the integration of multimodal AI-generated reports into radiology workflow over time, focusing on their impact on efficiency, acceptability, and report quality. A multicase, multireader study involved 756 publicly available chest radiographs interpreted by five radiologists using preliminary reports generated by a radiology-specific multimodal AI model, divided into seven sequential batches of 108 radiographs each. Two thoracic radiologists assessed the final reports using RADPEER criteria for agreement and 5-point Likert scale for quality. Reading times, rate of acceptance without modification, agreement, and quality scores were measured, with statistical analyses evaluating trends across seven sequential batches. Radiologists' reading times for chest radiographs decreased from 25.8 seconds in Batch 1 to 19.3 seconds in Batch 7 (p < .001). Acceptability increased from 54.6% to 60.2% (p < .001), with normal chest radiographs demonstrating high rates (68.9%) compared to abnormal chest radiographs (52.6%; p < .001). Median agreement and quality scores remained stable for normal chest radiographs but varied significantly for abnormal chest radiographs (ps < .05). The introduction of AI-generated reports improved efficiency of chest radiograph interpretation, acceptability increased over time. However, agreement and quality scores showed variability, particularly in abnormal cases, emphasizing the need for oversight in the interpretation of complex chest radiographs.

X-Ray Report Generation Chest Retrospective Clinical Clinical Pilot Academic Lab Benchmark SOTA

Deep Learning Integration of Endoscopic Ultrasound Features and Serum Data Reveals LTB4 as a Diagnostic and Therapeutic Target in ESCC.

Huo S, Zhang W, Wang Y, Qi J, Wang Y, Bai C

•papers•Sep 18 2025

Background: Early diagnosis and accurate prediction of treatment response in esophageal squamous cell carcinoma (ESCC) remain major clinical challenges due to the lack of reliable and noninvasive biomarkers. Recently, artificial intelligence-driven endoscopic ultrasound image analysis has shown great promise in revealing genomic features associated with imaging phenotypes. Methods: A prospective study of 115 patients with ESCC was conducted. Deep features were extracted from endoscopic ultrasound using a ResNet50 convolutional neural network. Important features shared across three machine learning models (NN, GLM, DT) were used to construct an image-derived signature. Plasma levels of leukotriene B4 (LTB4) and other inflammatory markers were measured using enzyme-linked immunosorbent assay. Correlations between signature and inflammation markers were analyzed, followed by logistic regression and subgroup analyses. Results: The endoscopic ultrasound image-derived signature, generated using deep learning algorithms, effectively distinguished esophageal cancer from normal esophageal tissue. Among all inflammatory markers, LTB4 exhibited the strongest negative correlation with the image signature and showed significantly higher expression in the healthy control group. Multivariate logistic regression analysis identified LTB4 as an independent risk factor for ESCC (odds ratio = 1.74, p = 0.037). Furthermore, LTB4 expression was significantly associated with patient sex, age, and chemotherapy response. Notably, higher LTB4 levels were linked to an increased likelihood of achieving a favorable therapeutic response. Conclusions: This study demonstrates that deep learning-derived endoscopic ultrasound image features can effectively distinguish ESCC from normal esophageal tissue. By integrating image features with serological data, the authors identified LTB4 as a key inflammation-related biomarker with significant diagnostic and therapeutic predictive value.

Ultrasound Classification Abdominal Prospective Clinical Pilot

Optimized deep learning-accelerated single-breath-hold abdominal HASTE with and without fat saturation improves and accelerates abdominal imaging at 3 Tesla.

Tan Q, Kubicka F, Nickel D, Weiland E, Hamm B, Geisel D, Wagner M, Walter-Rittel TC

•papers•Sep 18 2025

Deep learning-accelerated single-shot turbo-spin-echo techniques (DL-HASTE) enable single-breath-hold T2-weighted abdominal imaging. However, studies evaluating the image quality of DL-HASTE with and without fat saturation (FS) remain limited. This study aimed to prospectively evaluate the technical feasibility and image quality of abdominal DL-HASTE with and without FS at 3 Tesla. DL-HASTE of the upper abdomen was acquired with variable sequence parameters regarding FS, flip angle (FA) and field of view (FOV) in 10 healthy volunteers and 50 patients. DL-HASTE sequences were compared to clinical sequences (HASTE, HASTE-FS and T2-TSE-FS BLADE). Two radiologists independently assessed the sequences regarding scores of overall image quality, delineation of abdominal organs, artifacts and fat saturation using a Likert scale (range: 1-5). Breath-hold time of DL-HASTE and DL-HASTE-FS was 21 ± 2 s with fixed FA and 20 ± 2 s with variable FA (p < 0.001), with no overall image quality difference (p > 0.05). DL-HASTE required a 10% larger FOV than DL-HASTE-FS to avoid aliasing artifacts from subcutaneous fat. Both DL-HASTE and DL-HASTE-FS had significantly higher overall image quality scores than standard HASTE acquisitions (DL-HASTE vs. HASTE: 4.8 ± 0.40 vs. 4.1 ± 0.50; DL-HASTE-FS vs. HASTE-FS: 4.6 ± 0.50 vs. 3.6 ± 0.60; p < 0.001). Compared to the T2-TSE-FS BLADE, DL-HASTE-FS provided higher overall image quality (4.6 ± 0.50 vs. 4.3 ± 0.63, p = 0.011). DL-HASTE achieved significant higher image quality (p = 0.006) and higher sharpness score of organs compared to DL-HASTE-FS (p < 0.001). Deep learning-accelerated HASTE with and without fat saturation were both feasible at 3 Tesla and showed improved image quality compared to conventional sequences. Not applicable.

MRI Reconstruction Abdominal Prospective Clinical Pilot Academic Lab

Real-world clinical impact of three commercial AI algorithms on musculoskeletal radiography interpretation: A prospective crossover reader study.

Prucker P, Lemke T, Mertens CJ, Ziegelmayer S, Graf MM, Weller D, Kim SH, Gassert FT, Kader A, Dorfner FJ, Meddeb A, Makowski MR, Lammert J, Huber T, Lohöfer F, Bressem KK, Adams LC, Luiken I, Busch F

•papers•Sep 17 2025

To prospectively assess the diagnostic performance, workflow efficiency, and clinical impact of three commercial deep-learning tools (BoneView, Rayvolve, RBfracture) for routine musculoskeletal radiograph interpretation. From January to March 2025, two radiologists (4 and 5 years' experience) independently interpreted 1,037 adult musculoskeletal studies (2,926 radiographs) first unaided and, after 14-day washouts, with each AI tool in a randomized crossover design. Ground truth was established by confirmatory CT when available. Outcomes included sensitivity, specificity, accuracy, area under the receiver operating characteristic curve (AUC), interpretation time, diagnostic confidence (5-point Likert), and rates of additional CT recommendations and senior consultations. DeLong tests compared AUCs; Mann-Whitney U and χ2 tests assessed secondary endpoints. AI assistance did not significantly change performance for fractures, dislocations, or effusions. For fractures, AUCs were comparable to baseline (Reader 1: 96.50 % vs. 96.30-96.50 %; Reader 2: 95.35 % vs. 95.97 %; all p > 0.11). For dislocations, baseline AUCs (Reader 1: 92.66 %; Reader 2: 90.68 %) were unchanged with AI (92.76-93.95 % and 92.00 %; p ≥ 0.280). For effusions, baseline AUCs (Reader 1: 92.52 %; Reader 2: 96.75 %) were similar with AI (93.12 % and 96.99 %; p ≥ 0.157). Median interpretation times decreased with AI (Reader 1: 34 s to 21-25 s; Reader 2: 30 s to 21-26 s; all p < 0.001). Confidence improved across tools: BoneView increased combined "very good/excellent" ratings versus unaided reads (Reader 1: 509 vs. 449, p < 0.001; Reader 2: 483 vs. 439, p < 0.001); Rayvolve (Reader 1: 456 vs. 449, p = 0.029; Reader 2: 449 vs. 439, p < 0.001) and RBfracture (Reader 1: 457 vs. 449, p = 0.017; Reader 2: 448 vs. 439, p = 0.001) yielded smaller but significant gains. Reader 1 recommended fewer CT scans with AI assistance (33 vs. 22-23, p = 0.007). In a real-world clinical setting, AI-assisted interpretation of musculoskeletal radiographs reduced reading time and increased diagnostic confidence without materially affecting diagnostic performance. These findings support AI assistance as a lever for workflow efficiency and potential cost-effectiveness at scale.

X-Ray Detection Musculoskeletal Prospective Clinical Pilot Startup Benchmark SOTA

The HeartMagic prospective observational study protocol - characterizing subtypes of heart failure with preserved ejection fraction

Meyer, P., Rocca, A., Banus, J., Ogier, A. C., Georgantas, C., Calarnou, P., Fatima, A., Vallee, J.-P., Deux, J.-F., Thomas, A., Marquis, J., Monney, P., Lu, H., Ledoux, J.-B., Tillier, C., Crowe, L. A., Abdurashidova, T., Richiardi, J., Hullin, R., van Heeswijk, R. B.

•preprint•Sep 16 2025

Introduction Heart failure (HF) is a life-threatening syndrome with significant morbidity and mortality. While evidence-based drug treatments have effectively reduced morbidity and mortality in HF with reduced ejection fraction (HFrEF), few therapies have been demonstrated to improve outcomes in HF with preserved ejection fraction (HFpEF). The multifaceted clinical presentation is one of the main reasons why the current understanding of HFpEF remains limited. This may be caused by the existence of several HFpEF disease subtypes that each need different treatments. There is therefore an unmet need for a holistic approach that combines comprehensive imaging with metabolomic, transcriptomic and genomic mapping to subtype HFpEF patients. This protocol details the approach employed in the HeartMagic study to address this gap in understanding. Methods This prospective multi-center observational cohort study will include 500 consecutive patients with actual or recent hospitalization for treatment of HFpEF at two Swiss university hospitals, along with 50 age-matched HFrEF patients and 50 age-matched healthy controls. Diagnosis of heart failure is based on clinical signs and symptoms and subgrouping HF patients is based on the left-ventricular ejection fraction. In addition to routine clinical workup, participants undergo genomic, transcriptomic, and metabolomic analyses, while the anatomy, composition, and function of the heart are quantified by comprehensive echocardiography and magnetic resonance imaging (MRI). Quantitative MRI is also applied to characterize the kidney. The primary outcome is a composite of one-year cardiovascular mortality or rehospitalization. Machine learning (ML) based multi-modal clustering will be employed to identify distinct HFpEF subtypes in the holistic data. The clinical importance of these subtypes shall be evaluated based on their association with the primary outcome. Statistical analysis will include group comparisons across modalities, survival analysis for the primary outcome, and integrative multi-modal clustering combining clinical, imaging, ECG, genomic, transcriptomic, and metabolomic data to identify and validate HFpEF subtypes. Discussion The integration of comprehensive MRI with extensive genomic and metabolomic profiling in this study will result in an unprecedented panoramic view of HFpEF and should enable us to distinguish functional subgroups of HFpEF patients. This approach has the potential to provide unprecedented insights on HFpEF disease and should provide a basis for personalized therapies. Beyond this, identifying HFpEF subtypes with specific molecular and structural characteristics could lead to new targeted pharmacological interventions, with the potential to improve patient outcomes.

MRI Classification Cardiac Prospective Clinical Pilot Academic Lab GenAI

Developing and Validation of a Multimodal-based Machine Learning Model for Diagnosis of Usual Interstitial Pneumonia: a Prospective Multicenter Study.

Wang H, Liu A, Ni Y, Wang J, Du J, Xi L, Qiang Y, Xie B, Ren Y, Wang S, Geng J, Deng Y, Huang S, Zhang R, Liu M, Dai H

•papers•Sep 16 2025

Usual Interstitial Pneumonia (UIP) indicates poor prognosis, and there is significant heterogeneity in the diagnosis of UIP, necessitating an auxiliary diagnostic tool. Can a machine learning (ML) classifier using radiomics features and clinical data accurately identify UIP from patients with interstitial lung diseases (ILD)? This dataset from a prospective cohort consists of 5321 sets of high-resolution computed tomography (HRCT) images from 2901 patients with ILD (male: 63.5%, age: 61.7 ± 10.8 years) across three medical centers. Multimodal data, including whole-lung radiomics features on HRCT and demographics, smoking, lung function, and comorbidity data, were extracted. An eXtreme Gradient Boosting (XGBoost) and logistic regression were used to design a nomogram predicting UIP or not. Area under the receiver operating characteristic curve (AUC) and Cox's regression for all-cause mortality were used to assess the diagnostic performance and prognostic value of models, respectively. 5213 HRCT image datasets were divided into the training group (n=3639), the internal testing group (n=785), and the external validation group (n=789). UIP prevalence was 43.7% across the whole dataset, with 42.7% and 41.3% for the internal validation set and external validation set. The radiomics-based classifier had an AUC of 0.790 in the internal testing set and 0.786 for the external validation dataset. Integrating multimodal data improved AUCs to 0.802 and 0.794, respectively. The performance of the integration model was comparable to pulmonologist with over 10 years of experience in ILD. Within 522 patients deceased during a median follow-up period of 3.37 years, the multimodal-based ML model-predicted UIP status was associated with high all-cause mortality risk (hazard ratio: 2.52, p<0.001). The classifier combining radiomics and clinical features showed strong performance across varied UIP prevalence. This multimodal-based ML model could serve as an adjunct in the diagnosis of UIP.

CT Classification Chest Prospective Clinical Pilot Academic Lab

Deep learning based multi-shot breast diffusion MRI: Improving imaging quality and reduced distortion.

Chien N, Cho YH, Wang MY, Tsai LW, Yeh CY, Li CW, Lan P, Wang X, Liu KL, Chang YC

•papers•Sep 15 2025

To investigate the imaging performance of deep-learning reconstruction on multiplexed sensitivity encoding (MUSE DL) compared to single-shot diffusion-weighted imaging (SS-DWI) in the breast. In this prospective, institutional review board-approved study, both single-shot (SS-DWI) and multi-shot MUSE DWI were performed on patients. MUSE DWI was processed using deep-learning reconstruction (MUSE DL). Quantitative analysis included calculating apparent diffusion coefficients (ADCs), signal-to-noise ratio (SNR) within fibroglandular tissue (FGT), adjacent pectoralis muscle, and breast tumors. The Hausdorff distance (HD) was used as a distortion index to compare breast contours between T2-weighted anatomical images, SS-DWI, and MUSE images. Subjective visual qualitative analysis was performed using Likert scale. Quantitative analyses were assessed using Friedman's rank-based analysis with Bonferroni correction. Sixty-one female participants (mean age 49.07 years ± 11.0 [standard deviation]; age range 23-75 years) with 65 breast lesions were included in this study. All data were acquired using a 3 T MRI scanner. The MUSE DL yielded significant improvement in image quality compared with non-DL MUSE in both 2-shot and 4-shot settings (SNR enhancement FGT 2-shot DL 207.8 % [125.5-309.3],4- shot DL 175.1 % [102.2-223.5]). No significant difference was observed in the ADC between MUSE, MUSE DL, and SS-DWI in both benign (P = 0.154) and malignant tumors (P = 0.167). There was significantly less distortion in the 2- and 4-shot MUSE DL images (HD 3.11 mm, 2.58 mm) than in the SS-DWI images (4.15 mm, P < 0.001). MUSE DL enhances SNR, minimizes image distortion, and preserves lesion diagnosis accuracy and ADC values.

MRI Reconstruction Breast Prospective Clinical Pilot

Exploring Women's Perceptions of Traditional Mammography and the Concept of AI-Driven Thermography to Improve the Breast Cancer Screening Journey: Mixed Methods Study.

Sirka Kacafírková K, Poll A, Jacobs A, Cardone A, Ventura JJ

•papers•Sep 10 2025

Breast cancer is the most common cancer among women and a leading cause of mortality in Europe. Early detection through screening reduces mortality, yet participation in mammography-based programs remains suboptimal due to discomfort, radiation exposure, and accessibility issues. Thermography, particularly when driven by artificial intelligence (AI), is being explored as a noninvasive, radiation-free alternative. However, its acceptance, reliability, and impact on the screening experience remain underexplored. This study aimed to explore women's perceptions of AI-enhanced thermography (ThermoBreast) as an alternative to mammography. It aims to identify barriers and motivators related to breast cancer screening and assess how ThermoBreast might improve the screening experience. A mixed methods approach was adopted, combining an online survey with follow-up focus groups. The survey captured women's knowledge, attitudes, and experiences related to breast cancer screening and was used to recruit participants for qualitative exploration. After the focus groups, the survey was relaunched to include additional respondents. Quantitative data were analyzed using SPSS (IBM Corp), and qualitative data were analyzed in MAXQDA (VERBI software). Findings from both strands were synthesized to redesign the breast cancer screening journey. A total of 228 valid survey responses were analyzed. Of 228, 154 women (68%) had previously undergone mammography, while 74 (32%) had not. The most reported motivators were belief in prevention (69/154, 45%), invitations from screening programs (68/154, 44%), and doctor recommendations (45/154, 29%). Among nonscreeners, key barriers included no recommendation from a doctor (39/74, 53%), absence of symptoms (27/74, 36%), and perceived age ineligibility (17/74, 23%). Pain, long appointment waits, and fear of radiation were also mentioned. In total, 18 women (mean age 45.3 years, SD 13.6) participated in 6 focus groups. Participants emphasized the importance of respectful and empathetic interactions with medical staff, clear communication, and emotional comfort-factors they perceived as more influential than the screening technology itself. ThermoBreast was positively received for being contactless, radiation-free, and potentially more comfortable. Participants described it as "less traumatic," "easier," and "a game changer." However, concerns were raised regarding its novelty, lack of clinical validation, and data privacy. Some participants expressed the need for human oversight in AI-supported procedures and requested more information on how AI is used. Based on these insights, an updated screening journey was developed, highlighting improvements in preparation, appointment booking, privacy, and communication of results. While AI-driven thermography shows promise as a noninvasive, user-friendly alternative to mammography, its adoption depends on trust, clinical validation, and effective communication from health care professionals. It may expand screening access for populations underserved by mammography, such as younger and immobile women, but does not eliminate all participation barriers. Long-term studies and direct comparisons between mammography and thermography are needed to assess diagnostic accuracy, patient experience, and their impact on screening participation and outcomes.

Mixed Modality Classification Breast Retrospective Clinical Clinical Pilot Academic Lab

An Explainable Deep Learning Model for Focal Liver Lesion Diagnosis Using Multiparametric MRI.

Shen Z, Chen L, Wang L, Dong S, Wang F, Pan Y, Zhou J, Wang Y, Xu X, Chong H, Lin H, Li W, Li R, Ma H, Ma J, Yu Y, Du L, Wang X, Zhang S, Yan F

•papers•Sep 10 2025

"Just Accepted" papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. Purpose To assess the effectiveness of an explainable deep learning (DL) model, developed using multiparametric MRI (mpMRI) features, in improving diagnostic accuracy and efficiency of radiologists for classification of focal liver lesions (FLLs). Materials and Methods FLLs ≥ 1 cm in diameter at mpMRI were included in the study. nn-Unet and Liver Imaging Feature Transformer (LIFT) models were developed using retrospective data from one hospital (January 2018-August 2023). nnU-Net was used for lesion segmentation and LIFT for FLL classification. External testing was performed on data from three hospitals (January 2018-December 2023), with a prospective test set obtained from January 2024 to April 2024. Model performance was compared with radiologists and impact of model assistance on junior and senior radiologist performance was assessed. Evaluation metrics included the Dice similarity coefficient (DSC) and accuracy. Results A total of 2131 individuals with FLLs (mean age, 56 ± [SD] 12 years; 1476 female) were included in the training, internal test, external test, and prospective test sets. Average DSC values for liver and tumor segmentation across the three test sets were 0.98 and 0.96, respectively. Average accuracy for features and lesion classification across the three test sets were 93% and 97%, respectively. LIFT-assisted readings improved diagnostic accuracy (average 5.3% increase, P < .001), reduced reading time (average 34.5 seconds decrease, P < .001), and enhanced confidence (average 0.3-point increase, P < .001) of junior radiologists. Conclusion The proposed DL model accurately detected and classified FLLs, improving diagnostic accuracy and efficiency of junior radiologists. ©RSNA, 2025.

MRI Segmentation Abdominal Retrospective Clinical Clinical Pilot Academic Lab Benchmark SOTA

Filter Papers

Tags