Latest Papers on Radiology AI. Tags: GenAI

Prompt Mechanisms in Medical Imaging: A Comprehensive Survey

Hao Yang, Xinlong Liang, Zhang Li, Yue Sun, Zheyu Hu, Xinghe Xie, Behdad Dashtbozorg, Jincheng Huang, Shiwei Zhu, Luyi Han, Jiong Zhang, Shanshan Wang, Ritse Mann, Qifeng Yu, Tao Tan

•preprint•Jun 28 2025

Deep learning offers transformative potential in medical imaging, yet its clinical adoption is frequently hampered by challenges such as data scarcity, distribution shifts, and the need for robust task generalization. Prompt-based methodologies have emerged as a pivotal strategy to guide deep learning models, providing flexible, domain-specific adaptations that significantly enhance model performance and adaptability without extensive retraining. This systematic review critically examines the burgeoning landscape of prompt engineering in medical imaging. We dissect diverse prompt modalities, including textual instructions, visual prompts, and learnable embeddings, and analyze their integration for core tasks such as image generation, segmentation, and classification. Our synthesis reveals how these mechanisms improve task-specific outcomes by enhancing accuracy, robustness, and data efficiency and reducing reliance on manual feature engineering while fostering greater model interpretability by making the model's guidance explicit. Despite substantial advancements, we identify persistent challenges, particularly in prompt design optimization, data heterogeneity, and ensuring scalability for clinical deployment. Finally, this review outlines promising future trajectories, including advanced multimodal prompting and robust clinical integration, underscoring the critical role of prompt-driven AI in accelerating the revolution of diagnostics and personalized treatment planning in medicine.

Review GenAI

Improving radiology reporting accuracy: use of GPT-4 to reduce errors in reports.

Mayes CJ, Reyes C, Truman ME, Dodoo CA, Adler CR, Banerjee I, Khandelwal A, Alexander LF, Sheedy SP, Thompson CP, Varner JA, Zulfiqar M, Tan N

•papers•Jun 27 2025

Radiology reports are essential for communicating imaging findings to guide diagnosis and treatment. Although most radiology reports are accurate, errors can occur in the final reports due to high workloads, use of dictation software, and human error. Advanced artificial intelligence models, such as GPT-4, show potential as tools to improve report accuracy. This retrospective study evaluated how GPT-4 performed in detecting and correcting errors in finalized radiology reports in real-world settings for abdominopelvic computed tomography (CT) reports. We evaluated finalized CT abdominopelvic reports from a tertiary health system by using GPT-4 with zero-shot learning techniques. Six radiologists each reviewed 100 of their finalized reports (randomly selected), evaluating GPT-4's suggested revisions for agreement, acceptance, and clinical impact. The radiologists' responses were compared by years in practice and sex. GPT-4 identified issues and suggested revisions for 91% of the 600 reports; most revisions addressed grammar (74%). The radiologists agreed with 27% of the revisions and accepted 23%. Most revisions were rated as having no (44%) or low (46%) clinical impact. Potential harm was rare (8%), with only 2 cases of potentially severe harm. Radiologists with less experience (≤ 7 years of practice) were more likely to agree with the revisions suggested by GPT-4 than those with more experience (34% vs. 20%, P = .003) and accepted a greater percentage of the revisions (32% vs. 15%, P = .003). Although GPT-4 showed promise in identifying errors and improving the clarity of finalized radiology reports, most errors were categorized as minor, with no or low clinical impact. Collectively, the radiologists accepted 23% of the suggested revisions in their finalized reports. This study highlights the potential of GPT-4 as a prospective tool for radiology reporting, with further refinement needed for consistent use in clinical practice.

CT LLM Radiology Report Abdominal Retrospective Clinical In Silico Academic Lab GenAI

White Box Modeling of Self-Determined Sequence Exercise Program Among Sarcopenic Older Adults: Uncovering a Novel Strategy Overcoming Decline of Skeletal Muscle Area.

Wei M, He S, Meng D, Lv Z, Guo H, Yang G, Wang Z

•papers•Jun 27 2025

Resistance exercise, Taichi exercise, and the hybrid exercise program consisting of the two aforementioned methods have been demonstrated to increase the skeletal muscle mass of older individuals with sarcopenia. However, the exercise sequence has not been comprehensively investigated. Therefore, we designed a self-determined sequence exercise program, incorporating resistance exercises, Taichi, and the hybrid exercise program to overcome the decline of skeletal muscle area and reverse sarcopenia in older individuals. Ninety-one older patients with sarcopenia between the ages of 60 and 75 completed this three-stage randomized controlled trial for 24 weeks, including the self-determined sequence exercise program group (n = 31), the resistance training group (n = 30), and the control group (n = 30). We used quantitative computed tomography to measure the effects of different intervention protocols on skeletal muscle mass in participants. Participants' demographic variables were analyzed using one-way analysis of variance and chi-square tests, and experimental data were examined using repeated-measures analysis of variance. Furthermore, we utilized the Markov model to explain the effectiveness of the exercise programs among the three-stage intervention and explainable artificial intelligence to predict whether intervention programs can reverse sarcopenia. Repeated-measures analysis of variance results indicated that there were statistically significant Group × Time interactions detected in the L3 skeletal muscle density, L3 skeletal muscle area, muscle fat infiltration, handgrip strength, and relative skeletal muscle mass index. The stacking model exhibited the best accuracy (84.5%) and the best F1-score (68.8%) compared to other algorithms. In the self-determined sequence exercise program group, strength training contributed most to the reversal of sarcopenia. One self-determined sequence exercise program can improve skeletal muscle area among sarcopenic older people. Based on our stacking model, we can predict whether sarcopenia in older people can be reversed accurately. The trial was registered in ClinicalTrials.gov. TRN:NCT05694117. Our findings indicate that such tailored exercise interventions can substantially benefit sarcopenic patients, and our stacking model provides an accurate predictive tool for assessing the reversibility of sarcopenia in older adults. This approach not only enhances individual health outcomes but also informs future development of targeted exercise programs to mitigate age-related muscle decline.

CT Classification Abdominal RCT Clinical Pilot Academic Lab GenAI

Association of Covert Cerebrovascular Disease With Falls Requiring Medical Attention.

Clancy Ú, Puttock EJ, Chen W, Whiteley W, Vickery EM, Leung LY, Luetmer PH, Kallmes DF, Fu S, Zheng C, Liu H, Kent DM

•papers•Jun 27 2025

The impact of covert cerebrovascular disease on falls in the general population is not well-known. Here, we determine the time to a first fall following incidentally detected covert cerebrovascular disease during a clinical neuroimaging episode. This longitudinal cohort study assessed computed tomography (CT) and magnetic resonance imaging from 2009 to 2019 of patients aged >50 years registered with Kaiser Permanente Southern California which is a healthcare organization combining health plan coverage with coordinated medical services, excluding those with before stroke/dementia. We extracted evidence of incidental covert brain infarcts (CBI) and white matter hyperintensities/hypoattenuation (WMH) from imaging reports using natural language processing. We examined associations of CBI and WMH with falls requiring medical attention, using Cox proportional hazards regression models with adjustment for 12 variables including age, sex, ethnicity multimorbidity, polypharmacy, and incontinence. We assessed 241 050 patients, mean age 64.9 (SD, 10.42) years, 61.3% female, detecting covert cerebrovascular disease in 31.1% over a mean follow-up duration of 3.04 years. A recorded fall occurred in 21.2% (51 239/241 050) during follow-up. On CT, single fall incidence rate/1000 person-years (p-y) was highest in individuals with both CBI and WMH on CT (129.3 falls/1000 p-y [95% CI, 123.4-135.5]), followed by WMH (109.9 falls/1000 p-y [108.0-111.9]). On magnetic resonance imaging, the incidence rate was the highest with both CBI and WMH (76.3 falls/1000 p-y [95% CI, 69.7-83.2]), followed by CBI (71.4 falls/1000 p-y [95% CI, 65.9-77.2]). The adjusted hazard ratio for single index fall in individuals with CBI on CT was 1.13 (95% CI, 1.09-1.17); versus magnetic resonance imaging 1.17 (95% CI, 1.08-1.27). On CT, the risk for single index fall incrementally increased for mild (1.37 [95% CI, 1.32-1.43]), moderate (1.57 [95% CI, 1.48-1.67]), or severe WMH (1.57 [95% CI, 1.45-1.70]). On magnetic resonance imaging, index fall risk similarly increased with increasing WMH severity: mild (1.11 [95% CI, 1.07-1.17]), moderate (1.21 [95% CI, 1.13-1.28]), and severe WMH (1.34 [95% CI, 1.22-1.46]). In a large population with neuroimaging, CBI and WMH are independently associated with greater risks of an index fall. Increasing severities of WMH are associated incrementally with fall risk across imaging modalities.

Mixed Modality Classification Neurological Retrospective Clinical In Silico Academic Lab GenAI

Enhancing Diagnostic Precision: Utilising a Large Language Model to Extract U Scores from Thyroid Sonography Reports.

Watts E, Pournik O, Allington R, Ding X, Boelaert K, Sharma N, Ghalichi L, Arvanitis TN

•papers•Jun 26 2025

This study evaluates the performance of ChatGPT-4, a Large Language Model (LLM), in automatically extracting U scores from free-text thyroid ultrasound reports collected from University Hospitals Birmingham (UHB), UK, between 2014 and 2024. The LLM was provided with guidelines on the U classification system and extracted U scores independently from 14,248 de-identified reports, without access to human-assigned scores. The LLM-extracted scores were compared to initial clinician-assigned and refined U scores provided by expert reviewers. The LLM achieved 97.7% agreement with refined human U scores, successfully identifying the highest U score in 98.1% of reports with multiple nodules. Most discrepancies (2.5%) were linked to ambiguous descriptions, multi-nodule reports, and cases with human-documented uncertainty. While the results demonstrate the potential for LLMs to improve reporting consistency and reduce manual workload, ethical and governance challenges such as transparency, privacy, and bias must be addressed before routine clinical deployment. Embedding LLMs into reporting workflows, such as Online Analytical Processing (OLAP) tools, could further enhance reporting quality and consistency.

Ultrasound LLM Radiology Report Abdominal Retrospective Clinical In Silico Academic Lab GenAI Ethics

Deep transfer learning radiomics combined with explainable machine learning for preoperative thymoma risk prediction based on CT.

Wu S, Fan L, Wu Y, Xu J, Guo Y, Zhang H, Xu Z

•papers•Jun 26 2025

To develop and validate a computerized tomography (CT)‑based deep transfer learning radiomics model combined with explainable machine learning for preoperative risk prediction of thymoma. This retrospective study included 173 pathologically confirmed thymoma patients from our institution in the training group and 93 patients from two external centers in the external validation group. Tumors were classified according to the World Health Organization simplified criteria as low‑risk types (A, AB, and B1) or high‑risk types (B2 and B3). Radiomics features and deep transfer learning features were extracted from venous‑phase contrast‑enhanced CT images by using a modified Inception V3 network. Principal component analysis and least absolute shrinkage and selection operator regression identified 20 key predictors. Six classifiers-decision tree, gradient boosting machine, k‑nearest neighbors, naïve Bayes, random forest (RF), and support vector machine-were trained on five feature sets: CT imaging model, radiomics feature model, deep transfer learning feature model, combined feature model, and combined model. Interpretability was assessed with SHapley Additive exPlanations (SHAP), and an interactive web application was developed for real‑time individualized risk prediction and visualization. In the external validation group, the RF classifier achieved the highest area under the receiver operating characteristic curve (AUC) value of 0.956. In the training group, the AUC values for the CT imaging model, radiomics feature model, deep transfer learning feature model, combined feature model, and combined model were 0.684, 0.831, 0.815, 0.893, and 0.910, respectively. The corresponding AUC values in the external validation group were 0.604, 0.865, 0.880, 0.934, and 0.956, respectively. SHAP visualizations revealed the relative contribution of each feature, while the web application provided real‑time individual prediction probabilities with interpretative outputs. We developed a CT‑based deep transfer learning radiomics model combined with explainable machine learning and an interactive web application; this model achieved high accuracy and transparency for preoperative thymoma risk stratification, facilitating personalized clinical decision‑making.

CT Classification Chest Retrospective Clinical In Silico Academic Lab GenAI

Clinician-Led Code-Free Deep Learning for Detecting Papilloedema and Pseudopapilloedema Using Optic Disc Imaging

Shenoy, R., Samra, G. S., Sekhri, R., Yoon, H.-J., Teli, S., DeSilva, I., Tu, Z., Maconachie, G. D., Thomas, M. G.

•preprint•Jun 26 2025

ImportanceDifferentiating pseudopapilloedema from papilloedema is challenging, but critical for prompt diagnosis and to avoid unnecessary invasive procedures. Following diagnosis of papilloedema, objectively grading severity is important for determining urgency of management and therapeutic response. Automated machine learning (AutoML) has emerged as a promising tool for diagnosis in medical imaging and may provide accessible opportunities for consistent and accurate diagnosis and severity grading of papilloedema. ObjectiveThis study evaluates the feasibility of AutoML models for distinguishing the presence and severity of papilloedema using near infrared reflectance images (NIR) obtained from standard optical coherence tomography (OCT), comparing the performance of different AutoML platforms. Design, setting and participantsA retrospective cohort study was conducted using data from University Hospitals of Leicester, NHS Trust. The study involved 289 adults and children patients (813 images) who underwent optic nerve head-centred OCT imaging between 2021 and 2024. The dataset included patients with normal optic discs (69 patients, 185 images), papilloedema (135 patients, 372 images), and optic disc drusen (ODD) (85 patients, 256 images). AutoML platforms - Amazon Rekognition, Medic Mind (MM) and Google Vertex were evaluated for their ability to classify and grade papilloedema severity. Main outcomes and measuresTwo classification tasks were performed: (1) distinguishing papilloedema from normal discs and ODD; (2) grading papilloedema severity (mild/moderate vs. severe). Model performance was evaluated using area under the curve (AUC), precision, recall, F1 score, and confusion matrices for all six models. ResultsAmazon Rekognition outperformed the other platforms, achieving the highest AUC (0.90) and F1 score (0.81) in distinguishing papilloedema from normal/ODD. For papilloedema severity grading, Amazon Rekognition also performed best, with an AUC of 0.90 and F1 score of 0.79. Google Vertex and Medic Mind demonstrated good performance but had slightly lower accuracy and higher misclassification rates. Conclusions and relevanceThis evaluation of three widely available AutoML platforms using NIR images obtained from standard OCT shows promise in distinguishing and grading papilloedema. These models provide an accessible, scalable solution for clinical teams without coding expertise to feasibly develop intelligent diagnostic systems to recognise and characterise papilloedema. Further external validation and prospective testing is needed to confirm their clinical utility and applicability in diverse settings. Key PointsQuestion: Can clinician-led, code-free deep learning models using automated machine learning (AutoML) accurately differentiate papilloedema from pseudopapilloedema using optic disc imaging? Findings: Three widely available AutoML platforms were used to develop models that successfully distinguish the presence and severity of papilloedema on optic disc imaging, with Amazon Rekognition demonstrating the highest performance. Meaning: AutoML may assist clinical teams, even those with limited coding expertise, in diagnosing papilloedema, potentially reducing the need for invasive investigations.

OCT Classification Retrospective Clinical In Silico Academic Lab GenAI

Self-supervised learning for MRI reconstruction: a review and new perspective.

Li X, Huang J, Sun G, Yang Z

•papers•Jun 26 2025

To review the latest developments in self-supervised deep learning (DL) techniques for magnetic resonance imaging (MRI) reconstruction, emphasizing their potential to overcome the limitations of supervised methods dependent on fully sampled k-space data. While DL has significantly advanced MRI, supervised approaches require large amounts of fully sampled k-space data for training-a major limitation given the impracticality and expense of acquiring such data clinically. Self-supervised learning has emerged as a promising alternative, enabling model training using only undersampled k-space data, thereby enhancing feasibility and driving research interest. We conducted a comprehensive literature review to synthesize recent progress in self-supervised DL for MRI reconstruction. The analysis focused on methods and architectures designed to improve image quality, reduce scanning time, and address data scarcity challenges, drawing from peer-reviewed publications and technical innovations in the field. Self-supervised DL holds transformative potential for MRI reconstruction, offering solutions to data limitations while maintaining image quality and accelerating scans. Key challenges include robustness across diverse anatomies, standardization of validation, and clinical integration. Future research should prioritize hybrid methodologies, domain-specific adaptations, and rigorous clinical validation. This review consolidates advancements and unresolved issues, providing a foundation for next-generation medical imaging technologies.

MRI Reconstruction Review Concept Academic Lab GenAI

MedPrompt: LLM-CNN Fusion with Weight Routing for Medical Image Segmentation and Classification

Shadman Sobhan, Kazi Abrar Mahmud, Abduz Zami

•preprint•Jun 26 2025

Current medical image analysis systems are typically task-specific, requiring separate models for classification and segmentation, and lack the flexibility to support user-defined workflows. To address these challenges, we introduce MedPrompt, a unified framework that combines a few-shot prompted Large Language Model (Llama-4-17B) for high-level task planning with a modular Convolutional Neural Network (DeepFusionLab) for low-level image processing. The LLM interprets user instructions and generates structured output to dynamically route task-specific pretrained weights. This weight routing approach avoids retraining the entire framework when adding new tasks-only task-specific weights are required, enhancing scalability and deployment. We evaluated MedPrompt across 19 public datasets, covering 12 tasks spanning 5 imaging modalities. The system achieves a 97% end-to-end correctness in interpreting and executing prompt-driven instructions, with an average inference latency of 2.5 seconds, making it suitable for near real-time applications. DeepFusionLab achieves competitive segmentation accuracy (e.g., Dice 0.9856 on lungs) and strong classification performance (F1 0.9744 on tuberculosis). Overall, MedPrompt enables scalable, prompt-driven medical imaging by combining the interpretability of LLMs with the efficiency of modular CNNs.

Mixed Modality Classification Chest Methodology In Silico Academic Lab GenAI

Harnessing Generative AI for Lung Nodule Spiculation Characterization.

Wang Y, Patel C, Tchoua R, Furst J, Raicu D

•papers•Jun 26 2025

Spiculation, characterized by irregular, spike-like projections from nodule margins, serves as a crucial radiological biomarker for malignancy assessment and early cancer detection. These distinctive stellate patterns strongly correlate with tumor invasiveness and are vital for accurate diagnosis and treatment planning. Traditional computer-aided diagnosis (CAD) systems are limited in their capability to capture and use these patterns given their subtlety, difficulty in quantifying them, and small datasets available to learn these patterns. To address these challenges, we propose a novel framework leveraging variational autoencoders (VAE) to discover, extract, and vary disentangled latent representations of lung nodule images. By gradually varying the latent representations of non-spiculated nodule images, we generate augmented datasets containing spiculated nodule variations that, we hypothesize, can improve the diagnostic classification of lung nodules. Using the National Institutes of Health/National Cancer Institute Lung Image Database Consortium (LIDC) dataset, our results show that incorporating these spiculated image variations into the classification pipeline significantly improves spiculation detection performance up to 7.53%. Notably, this enhancement in spiculation detection is achieved while preserving the classification performance of non-spiculated cases. This approach effectively addresses class imbalance and enhances overall classification outcomes. The gradual attenuation of spiculation characteristics demonstrates our model's ability to both capture and generate clinically relevant semantic features in an algorithmic manner. These findings suggest that the integration of semantic-based latent representations into CAD models not only enhances diagnostic accuracy but also provides insights into the underlying morphological progression of spiculated nodules, enabling more informed and clinically meaningful AI-driven support systems.

CT Classification Chest Methodology In Silico Academic Lab GenAI Open Dataset

Filter Papers

Tags

Prompt Mechanisms in Medical Imaging: A Comprehensive Survey

Improving radiology reporting accuracy: use of GPT-4 to reduce errors in reports.

White Box Modeling of Self-Determined Sequence Exercise Program Among Sarcopenic Older Adults: Uncovering a Novel Strategy Overcoming Decline of Skeletal Muscle Area.

Association of Covert Cerebrovascular Disease With Falls Requiring Medical Attention.

Enhancing Diagnostic Precision: Utilising a Large Language Model to Extract U Scores from Thyroid Sonography Reports.

Deep transfer learning radiomics combined with explainable machine learning for preoperative thymoma risk prediction based on CT.

Clinician-Led Code-Free Deep Learning for Detecting Papilloedema and Pseudopapilloedema Using Optic Disc Imaging

Self-supervised learning for MRI reconstruction: a review and new perspective.

MedPrompt: LLM-CNN Fusion with Weight Routing for Medical Image Segmentation and Classification

Harnessing Generative AI for Lung Nodule Spiculation Characterization.

Ready to Sharpen Your Edge?