Sort by:
Page 51 of 78779 results

Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines.

Wihl J, Rosenkranz E, Schramm S, Berberich C, Griessmair M, Woźnicki P, Pinto F, Ziegelmayer S, Adams LC, Bressem KK, Kirschke JS, Zimmer C, Wiestler B, Hedderich D, Kim SH

pubmed logopapersJun 19 2025
To evaluate the impact of an annotation guideline on the performance of large language models (LLMs) in extracting data from stroke computed tomography (CT) reports. The performance of GPT-4o and Llama-3.3-70B in extracting ten imaging findings from stroke CT reports was assessed in two datasets from a single academic stroke center. Dataset A (n = 200) was a stratified cohort including various pathological findings, whereas dataset B (n = 100) was a consecutive cohort. Initially, an annotation guideline providing clear data extraction instructions was designed based on a review of cases with inter-annotator disagreements in dataset A. For each LLM, data extraction was performed under two conditions: with the annotation guideline included in the prompt and without it. GPT-4o consistently demonstrated superior performance over Llama-3.3-70B under identical conditions, with micro-averaged precision ranging from 0.83 to 0.95 for GPT-4o and from 0.65 to 0.86 for Llama-3.3-70B. Across both models and both datasets, incorporating the annotation guideline into the LLM input resulted in higher precision rates, while recall rates largely remained stable. In dataset B, the precision of GPT-4o and Llama-3-70B improved from 0.83 to 0.95 and from 0.87 to 0.94, respectively. Overall classification performance with and without the annotation guideline was significantly different in five out of six conditions. GPT-4o and Llama-3.3-70B show promising performance in extracting imaging findings from stroke CT reports, although GPT-4o steadily outperformed Llama-3.3-70B. We also provide evidence that well-defined annotation guidelines can enhance LLM data extraction accuracy. Annotation guidelines can improve the accuracy of LLMs in extracting findings from radiological reports, potentially optimizing data extraction for specific downstream applications. LLMs have utility in data extraction from radiology reports, but the role of annotation guidelines remains underexplored. Data extraction accuracy from stroke CT reports by GPT-4o and Llama-3.3-70B improved when well-defined annotation guidelines were incorporated into the model prompt. Well-defined annotation guidelines can improve the accuracy of LLMs in extracting imaging findings from radiological reports.

Artificial Intelligence Language Models to Translate Professional Radiology Mammography Reports Into Plain Language - Impact on Interpretability and Perception by Patients.

Pisarcik D, Kissling M, Heimer J, Farkas M, Leo C, Kubik-Huch RA, Euler A

pubmed logopapersJun 19 2025
This study aimed to evaluate the interpretability and patient perception of AI-translated mammography and sonography reports, focusing on comprehensibility, follow-up recommendations, and conveyed empathy using a survey. In this observational study, three fictional mammography and sonography reports with BI-RADS categories 3, 4, and 5 were created. These reports were repeatedly translated to plain language by three different large language models (LLM: ChatGPT-4, ChatGPT-4o, Google Gemini). In a first step, the best of these repeatedly translated reports for each BI-RADS category and LLM was selected by two experts in breast imaging considering factual correctness, completeness, and quality. In a second step, female participants compared and rated the translated reports regarding comprehensibility, follow-up recommendations, conveyed empathy, and additional value of each report using a survey with Likert scales. Statistical analysis included cumulative link mixed models and the Plackett-Luce model for ranking preferences. 40 females participated in the survey. GPT-4 and GPT-4o were rated significantly higher than Gemini across all categories (P<.001). Participants >50 years of age rated the reports significantly higher as compared to participants of 18-29 years of age (P<.05). Higher education predicted lower ratings (P=.02). No prior mammography increased scores (P=.03), and AI-experience had no effect (P=.88). Ranking analysis showed GPT-4o as the most preferred (P=.48), followed by GPT-4 (P=.37), with Gemini ranked last (P=.15). Patient preference differed among AI-translated radiology reports. Compared to a traditional report using radiological language, AI-translated reports add value for patients, enhance comprehensibility and empathy and therefore hold the potential to improve patient communication in breast imaging.

CLAIM: Clinically-Guided LGE Augmentation for Realistic and Diverse Myocardial Scar Synthesis and Segmentation

Farheen Ramzan, Yusuf Kiberu, Nikesh Jathanna, Shahnaz Jamil-Copley, Richard H. Clayton, Chen, Chen

arxiv logopreprintJun 18 2025
Deep learning-based myocardial scar segmentation from late gadolinium enhancement (LGE) cardiac MRI has shown great potential for accurate and timely diagnosis and treatment planning for structural cardiac diseases. However, the limited availability and variability of LGE images with high-quality scar labels restrict the development of robust segmentation models. To address this, we introduce CLAIM: \textbf{C}linically-Guided \textbf{L}GE \textbf{A}ugmentation for Real\textbf{i}stic and Diverse \textbf{M}yocardial Scar Synthesis and Segmentation framework, a framework for anatomically grounded scar generation and segmentation. At its core is the SMILE module (Scar Mask generation guided by cLinical knowledgE), which conditions a diffusion-based generator on the clinically adopted AHA 17-segment model to synthesize images with anatomically consistent and spatially diverse scar patterns. In addition, CLAIM employs a joint training strategy in which the scar segmentation network is optimized alongside the generator, aiming to enhance both the realism of synthesized scars and the accuracy of the scar segmentation performance. Experimental results show that CLAIM produces anatomically coherent scar patterns and achieves higher Dice similarity with real scar distributions compared to baseline models. Our approach enables controllable and realistic myocardial scar synthesis and has demonstrated utility for downstream medical imaging task.

Implicit neural representations for accurate estimation of the standard model of white matter

Tom Hendriks, Gerrit Arends, Edwin Versteeg, Anna Vilanova, Maxime Chamberland, Chantal M. W. Tax

arxiv logopreprintJun 18 2025
Diffusion magnetic resonance imaging (dMRI) enables non-invasive investigation of tissue microstructure. The Standard Model (SM) of white matter aims to disentangle dMRI signal contributions from intra- and extra-axonal water compartments. However, due to the model its high-dimensional nature, extensive acquisition protocols with multiple b-values and diffusion tensor shapes are typically required to mitigate parameter degeneracies. Even then, accurate estimation remains challenging due to noise. This work introduces a novel estimation framework based on implicit neural representations (INRs), which incorporate spatial regularization through the sinusoidal encoding of the input coordinates. The INR method is evaluated on both synthetic and in vivo datasets and compared to parameter estimates using cubic polynomials, supervised neural networks, and nonlinear least squares. Results demonstrate superior accuracy of the INR method in estimating SM parameters, particularly in low signal-to-noise conditions. Additionally, spatial upsampling of the INR can represent the underlying dataset anatomically plausibly in a continuous way, which is unattainable with linear or cubic interpolation. The INR is fully unsupervised, eliminating the need for labeled training data. It achieves fast inference ($\sim$6 minutes), is robust to both Gaussian and Rician noise, supports joint estimation of SM kernel parameters and the fiber orientation distribution function with spherical harmonics orders up to at least 8 and non-negativity constraints, and accommodates spatially varying acquisition protocols caused by magnetic gradient non-uniformities. The combination of these properties along with the possibility to easily adapt the framework to other dMRI models, positions INRs as a potentially important tool for analyzing and interpreting diffusion MRI data.

Federated Learning for MRI-based BrainAGE: a multicenter study on post-stroke functional outcome prediction

Vincent Roca, Marc Tommasi, Paul Andrey, Aurélien Bellet, Markus D. Schirmer, Hilde Henon, Laurent Puy, Julien Ramon, Grégory Kuchcinski, Martin Bretzner, Renaud Lopes

arxiv logopreprintJun 18 2025
$\textbf{Objective:}$ Brain-predicted age difference (BrainAGE) is a neuroimaging biomarker reflecting brain health. However, training robust BrainAGE models requires large datasets, often restricted by privacy concerns. This study evaluates the performance of federated learning (FL) for BrainAGE estimation in ischemic stroke patients treated with mechanical thrombectomy, and investigates its association with clinical phenotypes and functional outcomes. $\textbf{Methods:}$ We used FLAIR brain images from 1674 stroke patients across 16 hospital centers. We implemented standard machine learning and deep learning models for BrainAGE estimates under three data management strategies: centralized learning (pooled data), FL (local training at each site), and single-site learning. We reported prediction errors and examined associations between BrainAGE and vascular risk factors (e.g., diabetes mellitus, hypertension, smoking), as well as functional outcomes at three months post-stroke. Logistic regression evaluated BrainAGE's predictive value for these outcomes, adjusting for age, sex, vascular risk factors, stroke severity, time between MRI and arterial puncture, prior intravenous thrombolysis, and recanalisation outcome. $\textbf{Results:}$ While centralized learning yielded the most accurate predictions, FL consistently outperformed single-site models. BrainAGE was significantly higher in patients with diabetes mellitus across all models. Comparisons between patients with good and poor functional outcomes, and multivariate predictions of these outcomes showed the significance of the association between BrainAGE and post-stroke recovery. $\textbf{Conclusion:}$ FL enables accurate age predictions without data centralization. The strong association between BrainAGE, vascular risk factors, and post-stroke recovery highlights its potential for prognostic modeling in stroke care.

RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering.

Tayebi Arasteh S, Lotfinia M, Bressem K, Siepmann R, Adams L, Ferber D, Kuhl C, Kather JN, Nebelung S, Truhn D

pubmed logopapersJun 18 2025
<i>"Just Accepted" papers have undergone full peer review and have been accepted for publication in <i>Radiology: Artificial Intelligence</i>. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content.</i> Purpose To evaluate diagnostic accuracy of various large language models (LLMs) when answering radiology-specific questions with and without access to additional online, up-to-date information via retrieval-augmented generation (RAG). Materials and Methods The authors developed Radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RAG incorporates information retrieval from external sources to supplement the initial prompt, grounding the model's response in relevant information. Using 80 questions from the RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions with reference standard answers, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8 × 7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG in a zero-shot inference scenario (temperature ≤ 0.1, top- <i>P</i> = 1). RadioRAG retrieved context-specific information from www.radiopaedia.org. Accuracy of LLMs with and without RadioRAG in answering questions from each dataset was assessed. Statistical analyses were performed using bootstrapping while preserving pairing. Additional assessments included comparison of model with human performance and comparison of time required for conventional versus RadioRAG-powered question answering. Results RadioRAG improved accuracy for some LLMs, including GPT-3.5-turbo [74% (59/80) versus 66% (53/80), FDR = 0.03] and Mixtral-8 × 7B [76% (61/80) versus 65% (52/80), FDR = 0.02] on the RSNA-RadioQA dataset, with similar trends in the ExtendedQA dataset. Accuracy exceeded (FDR ≤ 0.007) that of a human expert (63%, (50/80)) for these LLMs, while not for Mistral-7B-instruct-v0.2, Llama3-8B, and Llama3-70B (FDR ≥ 0.21). RadioRAG reduced hallucinations for all LLMs (rates from 6-25%). RadioRAG increased estimated response time fourfold. Conclusion RadioRAG shows potential to improve LLM accuracy and factuality in radiology question answering by integrating real-time domain-specific data. ©RSNA, 2025.

Toward general text-guided multimodal brain MRI synthesis for diagnosis and medical image analysis.

Wang Y, Xiong H, Sun K, Bai S, Dai L, Ding Z, Liu J, Wang Q, Liu Q, Shen D

pubmed logopapersJun 17 2025
Multimodal brain magnetic resonance imaging (MRI) offers complementary insights into brain structure and function, thereby improving the diagnostic accuracy of neurological disorders and advancing brain-related research. However, the widespread applicability of MRI is substantially limited by restricted scanner accessibility and prolonged acquisition times. Here, we present TUMSyn, a text-guided universal MRI synthesis model capable of generating brain MRI specified by textual imaging metadata from routinely acquired scans. We ensure the reliability of TUMSyn by constructing a brain MRI database comprising 31,407 3D images across 7 MRI modalities from 13 worldwide centers and pre-training an MRI-specific text encoder to process text prompts effectively. Experiments on diverse datasets and physician assessments indicate that TUMSyn-generated images can be utilized along with acquired MRI scan(s) to facilitate large-scale MRI-based screening and diagnosis of multiple brain diseases, substantially reducing the time and cost of MRI in the healthcare system.

DGG-XNet: A Hybrid Deep Learning Framework for Multi-Class Brain Disease Classification with Explainable AI

Sumshun Nahar Eity, Mahin Montasir Afif, Tanisha Fairooz, Md. Mortuza Ahmmed, Md Saef Ullah Miah

arxiv logopreprintJun 17 2025
Accurate diagnosis of brain disorders such as Alzheimer's disease and brain tumors remains a critical challenge in medical imaging. Conventional methods based on manual MRI analysis are often inefficient and error-prone. To address this, we propose DGG-XNet, a hybrid deep learning model integrating VGG16 and DenseNet121 to enhance feature extraction and classification. DenseNet121 promotes feature reuse and efficient gradient flow through dense connectivity, while VGG16 contributes strong hierarchical spatial representations. Their fusion enables robust multiclass classification of neurological conditions. Grad-CAM is applied to visualize salient regions, enhancing model transparency. Trained on a combined dataset from BraTS 2021 and Kaggle, DGG-XNet achieved a test accuracy of 91.33\%, with precision, recall, and F1-score all exceeding 91\%. These results highlight DGG-XNet's potential as an effective and interpretable tool for computer-aided diagnosis (CAD) of neurodegenerative and oncological brain disorders.

Appropriateness of acute breast symptom recommendations provided by ChatGPT.

Byrd C, Kingsbury C, Niell B, Funaro K, Bhatt A, Weinfurtner RJ, Ataya D

pubmed logopapersJun 16 2025
We evaluated the accuracy of ChatGPT-3.5's responses to common questions regarding acute breast symptoms and explored whether using lay language, as opposed to medical language, affected the accuracy of the responses. Questions were formulated addressing acute breast conditions, informed by the American College of Radiology (ACR) Appropriateness Criteria (AC) and our clinical experience at a tertiary referral breast center. Of these, seven addressed the most common acute breast symptoms, nine addressed pregnancy-associated breast symptoms, and four addressed specific management and imaging recommendations for a palpable breast abnormality. Questions were submitted three times to ChatGPT-3.5 and all responses were assessed by five fellowship-trained breast radiologists. Evaluation criteria included clinical judgment and adherence to the ACR guidelines, with responses scored as: 1) "appropriate," 2) "inappropriate" if any response contained inappropriate information, or 3) "unreliable" if responses were inconsistent. A majority vote determined the appropriateness for each question. ChatGPT-3.5 generated responses were appropriate for 7/7 (100 %) questions regarding common acute breast symptoms when phrased both colloquially and using standard medical terminology. In contrast, ChatGPT-3.5 generated responses were appropriate for 3/9 (33 %) questions about pregnancy-associated breast symptoms and 3/4 (75 %) questions about management and imaging recommendations for a palpable breast abnormality. ChatGPT-3.5 can automate healthcare information related to appropriate management of acute breast symptoms when prompted with both standard medical terminology or lay phrasing of the questions. However, physician oversight remains critical given the presence of inappropriate recommendations for pregnancy associated breast symptoms and management of palpable abnormalities.
Page 51 of 78779 results
Show
per page

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.