Sort by:
Page 6 of 871 results

Evaluating Large Language Models for Enhancing Radiology Specialty Examination: A Comparative Study with Human Performance.

Liu HY, Chen SJ, Wang W, Lee CH, Hsu HH, Shen SH, Chiou HJ, Lee WJ

pubmed logopapersMay 27 2025
The radiology specialty examination assesses clinical decision-making, image interpretation, and diagnostic reasoning. With the expansion of medical knowledge, traditional test design faces challenges in maintaining accuracy and relevance. Large language models (LLMs) demonstrate potential in medical education. This study evaluates LLM performance in radiology specialty exams, explores their role in assessing question difficulty, and investigates their reasoning processes, aiming to develop a more objective and efficient framework for exam design. This study compared the performance of LLMs and human examinees in a radiology specialty examination. Three LLMs (GPT-4o, o1-preview, and GPT-3.5-turbo-1106) were evaluated under zero-shot conditions. Exam accuracy, examinee accuracy, discrimination index, and point-biserial correlation were used to assess LLMs' ability to predict question difficulty and reasoning processes. The data provided by the Taiwan Radiological Society ensures comparability between AI and human performance. As for accuracy, GPT-4o (88.0%) and o1-preview (90.9%) outperformed human examinees (76.3%), whereas GPT-3.5-turbo-1106 showed significantly lower accuracy (50.2%). Question difficulty analysis revealed that newer LLMs excel in solving complex questions, while GPT-3.5-turbo-1106 exhibited greater performance variability. Discrimination index and point-biserial Correlation analyses demonstrated that GPT-4o and o1-preview accurately identified key differentiating questions, closely mirroring human reasoning patterns. These findings suggest that advanced LLMs can assess medical examination difficulty, offering potential applications in exam standardization and question evaluation. This study evaluated the problem-solving capabilities of GPT-3.5-turbo-1106, GPT-4o, and o1-preview in a radiology specialty examination. LLMs should be utilized as tools for assessing exam question difficulty and assisting in the standardized development of medical examinations.

Integrating Large language models into radiology workflow: Impact of generating personalized report templates from summary.

Gupta A, Hussain M, Nikhileshwar K, Rastogi A, Rangarajan K

pubmed logopapersMay 25 2025
To evaluate feasibility of large language models (LLMs) to convert radiologist-generated report summaries into personalized report templates, and assess its impact on scan reporting time and quality. In this retrospective study, 100 CT scans from oncology patients were randomly divided into two equal sets. Two radiologists generated conventional reports for one set and summary reports for the other, and vice versa. Three LLMs - GPT-4, Google Gemini, and Claude Opus - generated complete reports from the summaries using institution-specific generic templates. Two expert radiologists qualitatively evaluated the radiologist summaries and LLM-generated reports using the ACR RADPEER scoring system, using conventional radiologist reports as reference. Reporting time for conventional versus summary-based reports was compared, and LLM-generated reports were analyzed for errors. Quantitative similarity and linguistic metrics were computed to assess report alignment across models with the original radiologist-generated report summaries. Statistical analyses were performed using Python 3.0 to identify significant differences in reporting times, error rates and quantitative metrics. The average reporting time was significantly shorter for summary method (6.76 min) compared to conventional method (8.95 min) (p < 0.005). Among the 100 radiologist summaries, 10 received RADPEER scores worse than 1, with three deemed to have clinically significant discrepancies. Only one LLM-generated report received a worse RADPEER score than its corresponding summary. Error frequencies among LLM-generated reports showed no significant differences across models, with template-related errors being most common (χ<sup>2</sup> = 1.146, p = 0.564). Quantitative analysis indicated significant differences in similarity and linguistic metrics among the three LLMs (p < 0.05), reflecting unique generation patterns. Summary-based scan reporting along with use of LLMs to generate complete personalized report templates can shorten reporting time while maintaining the report quality. However, there remains a need for human oversight to address errors in the generated reports. Summary-based reporting of radiological studies along with the use of large language models to generate tailored reports using generic templates has the potential to make the workflow more efficient by shortening the reporting time while maintaining the quality of reporting.

Evaluating the Performance of Reasoning Large Language Models on Japanese Radiology Board Examination Questions.

Nakaura T, Takamure H, Kobayashi N, Shiraishi K, Yoshida N, Nagayama Y, Uetani H, Kidoh M, Funama Y, Hirai T

pubmed logopapersMay 17 2025
This study evaluates the performance, cost, and processing time of OpenAI's reasoning large language models (LLMs) (o1-preview, o1-mini) and their base models (GPT-4o, GPT-4o-mini) on Japanese radiology board examination questions. A total of 210 questions from the 2022-2023 official board examinations of the Japan Radiological Society were presented to each of the four LLMs. Performance was evaluated by calculating the percentage of correctly answered questions within six predefined radiology subspecialties. The total cost and processing time for each model were also recorded. The McNemar test was used to assess the statistical significance of differences in accuracy between paired model responses. The o1-preview achieved the highest accuracy (85.7%), significantly outperforming GPT-4o (73.3%, P<.001). Similarly, o1-mini (69.5%) performed significantly better than GPT-4o-mini (46.7%, P<.001). Across all radiology subspecialties, o1-preview consistently ranked highest. However, reasoning models incurred substantially higher costs (o1-preview: $17.10, o1-mini: $2.58) compared to their base counterparts (GPT-4o: $0.496, GPT-4o-mini: $0.04), and their processing times were approximately 3.7 and 1.2 times longer, respectively. Reasoning LLMs demonstrated markedly superior performance in answering radiology board exam questions compared to their base models, albeit at a substantially higher cost and increased processing time.

High-Performance Prompting for LLM Extraction of Compression Fracture Findings from Radiology Reports.

Kanani MM, Monawer A, Brown L, King WE, Miller ZD, Venugopal N, Heagerty PJ, Jarvik JG, Cohen T, Cross NM

pubmed logopapersMay 16 2025
Extracting information from radiology reports can provide critical data to empower many radiology workflows. For spinal compression fractures, these data can facilitate evidence-based care for at-risk populations. Manual extraction from free-text reports is laborious, and error-prone. Large language models (LLMs) have shown promise; however, fine-tuning strategies to optimize performance in specific tasks can be resource intensive. A variety of prompting strategies have achieved similar results with fewer demands. Our study pioneers the use of Meta's Llama 3.1, together with prompt-based strategies, for automated extraction of compression fractures from free-text radiology reports, outputting structured data without model training. We tested performance on a time-based sample of CT exams covering the spine from 2/20/2024 to 2/22/2024 acquired across our healthcare enterprise (637 anonymized reports, age 18-102, 47% Female). Ground truth annotations were manually generated and compared against the performance of three models (Llama 3.1 70B, Llama 3.1 8B, and Vicuna 13B) with nine different prompting configurations for a total of 27 model/prompt experiments. The highest F1 score (0.91) was achieved by the 70B Llama 3.1 model when provided with a radiologist-written background, with similar results when the background was written by a separate LLM (0.86). The addition of few-shot examples to these prompts had variable impact on F1 measurements (0.89, 0.84 respectively). Comparable ROC-AUC and PR-AUC performance was observed. Our work demonstrated that an open-weights LLM excelled at extracting compression fractures findings from free-text radiology reports using prompt-based techniques without requiring extensive manually labeled examples for model training.

Exploring the Potential of Retrieval Augmented Generation for Question Answering in Radiology: Initial Findings and Future Directions.

Mou Y, Siepmann RM, Truhnn D, Sowe S, Decker S

pubmed logopapersMay 15 2025
This study explores the application of Retrieval-Augmented Generation (RAG) for question answering in radiology, an area where intelligent systems can significantly impact clinical decision-making. A preliminary experiment tested a naive RAG setup on nice radiology-specific questions with a textbook as the reference source, showing moderate improvements over baseline methods. The paper discusses lessons learned and potential enhancements for RAG in handling radiology knowledge, suggesting pathways for future research in integrating intelligent health systems in medical practice.

Scientific Evidence for Clinical Text Summarization Using Large Language Models: Scoping Review.

Bednarczyk L, Reichenpfader D, Gaudet-Blavignac C, Ette AK, Zaghir J, Zheng Y, Bensahla A, Bjelogrlic M, Lovis C

pubmed logopapersMay 15 2025
Information overload in electronic health records requires effective solutions to alleviate clinicians' administrative tasks. Automatically summarizing clinical text has gained significant attention with the rise of large language models. While individual studies show optimism, a structured overview of the research landscape is lacking. This study aims to present the current state of the art on clinical text summarization using large language models, evaluate the level of evidence in existing research and assess the applicability of performance findings in clinical settings. This scoping review complied with the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines. Literature published between January 1, 2019, and June 18, 2024, was identified from 5 databases: PubMed, Embase, Web of Science, IEEE Xplore, and ACM Digital Library. Studies were excluded if they did not describe transformer-based models, did not focus on clinical text summarization, did not engage with free-text data, were not original research, were nonretrievable, were not peer-reviewed, or were not in English, French, Spanish, or German. Data related to study context and characteristics, scope of research, and evaluation methodologies were systematically collected and analyzed by 3 authors independently. A total of 30 original studies were included in the analysis. All used observational retrospective designs, mainly using real patient data (n=28, 93%). The research landscape demonstrated a narrow research focus, often centered on summarizing radiology reports (n=17, 57%), primarily involving data from the intensive care unit (n=15, 50%) of US-based institutions (n=19, 73%), in English (n=26, 87%). This focus aligned with the frequent reliance on the open-source Medical Information Mart for Intensive Care dataset (n=15, 50%). Summarization methodologies predominantly involved abstractive approaches (n=17, 57%) on single-document inputs (n=4, 13%) with unstructured data (n=13, 43%), yet reporting on methodological details remained inconsistent across studies. Model selection involved both open-source models (n=26, 87%) and proprietary models (n=7, 23%). Evaluation frameworks were highly heterogeneous. All studies conducted internal validation, but external validation (n=2, 7%), failure analysis (n=6, 20%), and patient safety risks analysis (n=1, 3%) were infrequent, and none reported bias assessment. Most studies used both automated metrics and human evaluation (n=16, 53%), while 10 (33%) used only automated metrics, and 4 (13%) only human evaluation. Key barriers hinder the translation of current research into trustworthy, clinically valid applications. Current research remains exploratory and limited in scope, with many applications yet to be explored. Performance assessments often lack reliability, and clinical impact evaluations are insufficient raising concerns about model utility, safety, fairness, and data privacy. Advancing the field requires more robust evaluation frameworks, a broader research scope, and a stronger focus on real-world applicability.

Comparative performance of large language models in structuring head CT radiology reports: multi-institutional validation study in Japan.

Takita H, Walston SL, Mitsuyama Y, Watanabe K, Ishimaru S, Ueda D

pubmed logopapersMay 14 2025
To compare the diagnostic performance of three proprietary large language models (LLMs)-Claude, GPT, and Gemini-in structuring free-text Japanese radiology reports for intracranial hemorrhage and skull fractures, and to assess the impact of three different prompting approaches on model accuracy. In this retrospective study, head CT reports from the Japan Medical Imaging Database between 2018 and 2023 were collected. Two board-certified radiologists established the ground truth regarding intracranial hemorrhage and skull fractures through independent review and consensus. Each radiology report was analyzed by three LLMs using three prompting strategies-Standard, Chain of Thought, and Self Consistency prompting. Diagnostic performance (accuracy, precision, recall, and F1-score) was calculated for each LLM-prompt combination and compared using McNemar's tests with Bonferroni correction. Misclassified cases underwent qualitative error analysis. A total of 3949 head CT reports from 3949 patients (mean age 59 ± 25 years, 56.2% male) were enrolled. Across all institutions, 856 patients (21.6%) had intracranial hemorrhage and 264 patients (6.6%) had skull fractures. All nine LLM-prompt combinations achieved very high accuracy. Claude demonstrated significantly higher accuracy for intracranial hemorrhage than GPT and Gemini, and also outperformed Gemini for skull fractures (p < 0.0001). Gemini's performance improved notably with Chain of Thought prompting. Error analysis revealed common challenges including ambiguous phrases and findings unrelated to intracranial hemorrhage or skull fractures, underscoring the importance of careful prompt design. All three proprietary LLMs exhibited strong performance in structuring free-text head CT reports for intracranial hemorrhage and skull fractures. While the choice of prompting method influenced accuracy, all models demonstrated robust potential for clinical and research applications. Future work should refine the prompts and validate these approaches in prospective, multilingual settings.
Page 6 of 871 results
Show
per page

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.