Sort by:
Page 1 of 321 results

Predicting Immunotherapy Response in Unresectable Hepatocellular Carcinoma: A Comparative Study of Large Language Models and Human Experts.

Xu J, Wang J, Li J, Zhu Z, Fu X, Cai W, Song R, Wang T, Li H

pubmed logopapersMay 15 2025
Hepatocellular carcinoma (HCC) is an aggressive cancer with limited biomarkers for predicting immunotherapy response. Recent advancements in large language models (LLMs) like GPT-4, GPT-4o, and Gemini offer the potential for enhancing clinical decision-making through multimodal data analysis. However, their effectiveness in predicting immunotherapy response, especially compared to human experts, remains unclear. This study assessed the performance of GPT-4, GPT-4o, and Gemini in predicting immunotherapy response in unresectable HCC, compared to radiologists and oncologists of varying expertise. A retrospective analysis of 186 patients with unresectable HCC utilized multimodal data (clinical and CT images). LLMs were evaluated with zero-shot prompting and two strategies: the 'voting method' and the 'OR rule method' for improved sensitivity. Performance metrics included accuracy, sensitivity, area under the curve (AUC), and agreement across LLMs and physicians.GPT-4o, using the 'OR rule method,' achieved 65% accuracy and 47% sensitivity, comparable to intermediate physicians but lower than senior physicians (accuracy: 72%, p = 0.045; sensitivity: 70%, p < 0.0001). Gemini-GPT, combining GPT-4, GPT-4o, and Gemini, achieved an AUC of 0.69, similar to senior physicians (AUC: 0.72, p = 0.35), with 68% accuracy, outperforming junior and intermediate physicians while remaining comparable to senior physicians (p = 0.78). However, its sensitivity (58%) was lower than senior physicians (p = 0.0097). LLMs demonstrated higher inter-model agreement (κ = 0.59-0.70) than inter-physician agreement, especially among junior physicians (κ = 0.15). This study highlights the potential of LLMs, particularly Gemini-GPT, as valuable tools in predicting immunotherapy response for HCC.

Scientific Evidence for Clinical Text Summarization Using Large Language Models: Scoping Review.

Bednarczyk L, Reichenpfader D, Gaudet-Blavignac C, Ette AK, Zaghir J, Zheng Y, Bensahla A, Bjelogrlic M, Lovis C

pubmed logopapersMay 15 2025
Information overload in electronic health records requires effective solutions to alleviate clinicians' administrative tasks. Automatically summarizing clinical text has gained significant attention with the rise of large language models. While individual studies show optimism, a structured overview of the research landscape is lacking. This study aims to present the current state of the art on clinical text summarization using large language models, evaluate the level of evidence in existing research and assess the applicability of performance findings in clinical settings. This scoping review complied with the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines. Literature published between January 1, 2019, and June 18, 2024, was identified from 5 databases: PubMed, Embase, Web of Science, IEEE Xplore, and ACM Digital Library. Studies were excluded if they did not describe transformer-based models, did not focus on clinical text summarization, did not engage with free-text data, were not original research, were nonretrievable, were not peer-reviewed, or were not in English, French, Spanish, or German. Data related to study context and characteristics, scope of research, and evaluation methodologies were systematically collected and analyzed by 3 authors independently. A total of 30 original studies were included in the analysis. All used observational retrospective designs, mainly using real patient data (n=28, 93%). The research landscape demonstrated a narrow research focus, often centered on summarizing radiology reports (n=17, 57%), primarily involving data from the intensive care unit (n=15, 50%) of US-based institutions (n=19, 73%), in English (n=26, 87%). This focus aligned with the frequent reliance on the open-source Medical Information Mart for Intensive Care dataset (n=15, 50%). Summarization methodologies predominantly involved abstractive approaches (n=17, 57%) on single-document inputs (n=4, 13%) with unstructured data (n=13, 43%), yet reporting on methodological details remained inconsistent across studies. Model selection involved both open-source models (n=26, 87%) and proprietary models (n=7, 23%). Evaluation frameworks were highly heterogeneous. All studies conducted internal validation, but external validation (n=2, 7%), failure analysis (n=6, 20%), and patient safety risks analysis (n=1, 3%) were infrequent, and none reported bias assessment. Most studies used both automated metrics and human evaluation (n=16, 53%), while 10 (33%) used only automated metrics, and 4 (13%) only human evaluation. Key barriers hinder the translation of current research into trustworthy, clinically valid applications. Current research remains exploratory and limited in scope, with many applications yet to be explored. Performance assessments often lack reliability, and clinical impact evaluations are insufficient raising concerns about model utility, safety, fairness, and data privacy. Advancing the field requires more robust evaluation frameworks, a broader research scope, and a stronger focus on real-world applicability.

Automated high precision PCOS detection through a segment anything model on super resolution ultrasound ovary images.

Reka S, Praba TS, Prasanna M, Reddy VNN, Amirtharajan R

pubmed logopapersMay 15 2025
PCOS (Poly-Cystic Ovary Syndrome) is a multifaceted disorder that often affects the ovarian morphology of women of their reproductive age, resulting in the development of numerous cysts on the ovaries. Ultrasound imaging typically diagnoses PCOS, which helps clinicians assess the size, shape, and existence of cysts in the ovaries. Nevertheless, manual ultrasound image analysis is often challenging and time-consuming, resulting in inter-observer variability. To effectively treat PCOS and prevent its long-term effects, prompt and accurate diagnosis is crucial. In such cases, a prediction model based on deep learning can help physicians by streamlining the diagnosis procedure, reducing time and potential errors. This article proposes a novel integrated approach, QEI-SAM (Quality Enhanced Image - Segment Anything Model), for enhancing image quality and ovarian cyst segmentation for accurate prediction. GAN (Generative Adversarial Networks) and CNN (Convolutional Neural Networks) are the most recent cutting-edge innovations that have supported the system in attaining the expected result. The proposed QEI-SAM model used Enhanced Super Resolution Generative Adversarial Networks (ESRGAN) for image enhancement to increase the resolution, sharpening the edges and restoring the finer structure of the ultrasound ovary images and achieved a better SSIM of 0.938, PSNR value of 38.60 and LPIPS value of 0.0859. Then, it incorporates the Segment Anything Model (SAM) to segment ovarian cysts and achieve the highest Dice coefficient of 0.9501 and IoU score of 0.9050. Furthermore, Convolutional Neural Network - ResNet 50, ResNet 101, VGG 16, VGG 19, AlexNet and Inception v3 have been implemented to diagnose PCOS promptly. Finally, VGG 19 has achieved the highest accuracy of 99.31%.

Synthetic Data-Enhanced Classification of Prevalent Osteoporotic Fractures Using Dual-Energy X-Ray Absorptiometry-Based Geometric and Material Parameters.

Quagliato L, Seo J, Hong J, Lee T, Chung YS

pubmed logopapersMay 14 2025
Bone fracture risk assessment for osteoporotic patients is essential for implementing early countermeasures and preventing discomfort and hospitalization. Current methodologies, such as Fracture Risk Assessment Tool (FRAX), provide a risk assessment over a 5- to 10-year period rather than evaluating the bone's current health status. The database was collected by Ajou University Medical Center from 2017 to 2021. It included 9,260 patients, aged 55 to 99, comprising 242 femur fracture (FX) cases and 9,018 non-fracture (NFX) cases. To model the association of the bone's current health status with prevalent FXs, three prediction algorithms-extreme gradient boosting (XGB), support vector machine, and multilayer perceptron-were trained using two-dimensional dual-energy X-ray absorptiometry (2D-DXA) analysis results and subsequently benchmarked. The XGB classifier, which proved most effective, was then further refined using synthetic data generated by the adaptive synthetic oversampler to balance the FX and NFX classes and enhance boundary sharpness for better classification accuracy. The XGB model trained on raw data demonstrated good prediction capabilities, with an area under the curve (AUC) of 0.78 and an F1 score of 0.71 on test cases. The inclusion of synthetic data improved classification accuracy in terms of both specificity and sensitivity, resulting in an AUC of 0.99 and an F1 score of 0.98. The proposed methodology demonstrates that current bone health can be assessed through post-processed results from 2D-DXA analysis. Moreover, it was also shown that synthetic data can help stabilize uneven databases by balancing majority and minority classes, thereby significantly improving classification performance.

AI-based metal artefact correction algorithm for radiotherapy patients with dental hardware in head and neck CT: Towards precise imaging.

Yu X, Zhong S, Zhang G, Du J, Wang G, Hu J

pubmed logopapersMay 14 2025
To investigate the clinical efficiency of an AI-based metal artefact correction algorithm (AI-MAC), for reducing dental metal artefacts in head and neck CT, compared to conventional interpolation-based MAC. We retrospectively collected 41 patients with non-removal dental hardware who underwent non-contrast head and neck CT prior to radiotherapy. All images were reconstructed with standard reconstruction algorithm (SRA), and were additionally processed with both conventional MAC and AI-MAC. The image quality of SRA, MAC and AI-MAC were compared by qualitative scoring on a 5-point scale, with scores ≥ 3 considered interpretable. This was followed by a quantitative evaluation, including signal-to-noise ratio (SNR) and artefact index (Idxartefact). Organ contouring accuracy was quantified via calculating the dice similarity coefficient (DSC) and hausdorff distance (HD) for oral cavity and teeth, using the clinically accepted contouring as reference. Moreover, the treatment planning dose distribution for oral cavity was assessed. AI-MAC yielded superior qualitative image quality as well as quantitative metrics, including SNR and Idxartefact, to SRA and MAC. The image interpretability significantly improved from 41.46% for SRA and 56.10% for MAC to 92.68% for AI-MAC (p < 0.05). Compared to SRA and MAC, the best DSC and HD for both oral cavity and teeth were obtained on AI-MAC (all p < 0.05). No significant differences for dose distribution were found among the three image sets. AI-MAC outperforms conventional MAC in metal artefact reduction, achieving superior image quality with high image interpretability for patients with dental hardware undergoing head and neck CT. Furthermore, the use of AI-MAC improves the accuracy of organ contouring while providing consistent dose calculation against metal artefacts in radiotherapy. AI-MAC is a novel deep learning-based technique for reducing metal artefacts on CT. This in-vivo study first demonstrated its capability of reducing metal artefacts while preserving organ visualization, as compared with conventional MAC.

Large language models for efficient whole-organ MRI score-based reports and categorization in knee osteoarthritis.

Xie Y, Hu Z, Tao H, Hu Y, Liang H, Lu X, Wang L, Li X, Chen S

pubmed logopapersMay 14 2025
To evaluate the performance of large language models (LLMs) in automatically generating whole-organ MRI score (WORMS)-based structured MRI reports and predicting osteoarthritis (OA) severity for the knee. A total of 160 consecutive patients suspected of OA were included. Knee MRI reports were reviewed by three radiologists to establish the WORMS reference standard for 39 key features. GPT-4o and GPT-4o-mini were prompted using in-context knowledge (ICK) and chain-of-thought (COT) to generate WORMS-based structured reports from original reports and to automatically predict the OA severity. Four Orthopedic surgeons reviewed original and LLM-generated reports to conduct pairwise preference and difficulty tests, and their review times were recorded. GPT-4o demonstrated perfect performance in extracting the laterality of the knee (accuracy = 100%). GPT-4o outperformed GPT-4o mini in generating WORMS reports (Accuracy: 93.9% vs 76.2%, respectively). GPT-4o achieved higher recall (87.3% s 46.7%, p < 0.001), while maintaining higher precision compared to GPT-4o mini (94.2% vs 71.2%, p < 0.001). For predicting OA severity, GPT-4o outperformed GPT-4o mini across all prompt strategies (best accuracy: 98.1% vs 68.7%). Surgeons found it easier to extract information and gave more preference to LLM-generated reports over the original reports (both p < 0.001) while spending less time on each report (51.27 ± 9.41 vs 87.42 ± 20.26 s, p < 0.001). GPT-4o generated expert multi-feature, WORMS-based reports from original free-text knee MRI reports. GPT-4o with COT achieved high accuracy in categorizing OA severity. Surgeons reported greater preference and higher efficiency when using LLM-generated reports. The perfect performance of generating WORMS-based reports and the high efficiency and ease of use suggest that integrating LLMs into clinical workflows could greatly enhance productivity and alleviate the documentation burden faced by clinicians in knee OA. GPT-4o successfully generated WORMS-based knee MRI reports. GPT-4o with COT prompting achieved impressive accuracy in categorizing knee OA severity. Greater preference and higher efficiency were reported for LLM-generated reports.

Comparative performance of large language models in structuring head CT radiology reports: multi-institutional validation study in Japan.

Takita H, Walston SL, Mitsuyama Y, Watanabe K, Ishimaru S, Ueda D

pubmed logopapersMay 14 2025
To compare the diagnostic performance of three proprietary large language models (LLMs)-Claude, GPT, and Gemini-in structuring free-text Japanese radiology reports for intracranial hemorrhage and skull fractures, and to assess the impact of three different prompting approaches on model accuracy. In this retrospective study, head CT reports from the Japan Medical Imaging Database between 2018 and 2023 were collected. Two board-certified radiologists established the ground truth regarding intracranial hemorrhage and skull fractures through independent review and consensus. Each radiology report was analyzed by three LLMs using three prompting strategies-Standard, Chain of Thought, and Self Consistency prompting. Diagnostic performance (accuracy, precision, recall, and F1-score) was calculated for each LLM-prompt combination and compared using McNemar's tests with Bonferroni correction. Misclassified cases underwent qualitative error analysis. A total of 3949 head CT reports from 3949 patients (mean age 59 ± 25 years, 56.2% male) were enrolled. Across all institutions, 856 patients (21.6%) had intracranial hemorrhage and 264 patients (6.6%) had skull fractures. All nine LLM-prompt combinations achieved very high accuracy. Claude demonstrated significantly higher accuracy for intracranial hemorrhage than GPT and Gemini, and also outperformed Gemini for skull fractures (p < 0.0001). Gemini's performance improved notably with Chain of Thought prompting. Error analysis revealed common challenges including ambiguous phrases and findings unrelated to intracranial hemorrhage or skull fractures, underscoring the importance of careful prompt design. All three proprietary LLMs exhibited strong performance in structuring free-text head CT reports for intracranial hemorrhage and skull fractures. While the choice of prompting method influenced accuracy, all models demonstrated robust potential for clinical and research applications. Future work should refine the prompts and validate these approaches in prospective, multilingual settings.

Improving AI models for rare thyroid cancer subtype by text guided diffusion models.

Dai F, Yao S, Wang M, Zhu Y, Qiu X, Sun P, Qiu C, Yin J, Shen G, Sun J, Wang M, Wang Y, Yang Z, Sang J, Wang X, Sun F, Cai W, Zhang X, Lu H

pubmed logopapersMay 13 2025
Artificial intelligence applications in oncology imaging often struggle with diagnosing rare tumors. We identify significant gaps in detecting uncommon thyroid cancer types with ultrasound, where scarce data leads to frequent misdiagnosis. Traditional augmentation strategies do not capture the unique disease variations, hindering model training and performance. To overcome this, we propose a text-driven generative method that fuses clinical insights with image generation, producing synthetic samples that realistically reflect rare subtypes. In rigorous evaluations, our approach achieves substantial gains in diagnostic metrics, surpasses existing methods in authenticity and diversity measures, and generalizes effectively to other private and public datasets with various rare cancers. In this work, we demonstrate that text-guided image augmentation substantially enhances model accuracy and robustness for rare tumor detection, offering a promising avenue for more reliable and widespread clinical adoption.

Evaluating the reference accuracy of large language models in radiology: a comparative study across subspecialties.

Güneş YC, Cesur T, Çamur E

pubmed logopapersMay 12 2025
This study aimed to compare six large language models (LLMs) [Chat Generative Pre-trained Transformer (ChatGPT)o1-preview, ChatGPT-4o, ChatGPT-4o with canvas, Google Gemini 1.5 Pro, Claude 3.5 Sonnet, and Claude 3 Opus] in generating radiology references, assessing accuracy, fabrication, and bibliographic completeness. In this cross-sectional observational study, 120 open-ended questions were administered across eight radiology subspecialties (neuroradiology, abdominal, musculoskeletal, thoracic, pediatric, cardiac, head and neck, and interventional radiology), with 15 questions per subspecialty. Each question prompted the LLMs to provide responses containing four references with in-text citations and complete bibliographic details (authors, title, journal, publication year/month, volume, issue, page numbers, and PubMed Identifier). References were verified using Medline, Google Scholar, the Directory of Open Access Journals, and web searches. Each bibliographic element was scored for correctness, and a composite final score [(FS): 0-36] was calculated by summing the correct elements and multiplying this by a 5-point verification score for content relevance. The FS values were then categorized into a 5-point Likert scale reference accuracy score (RAS: 0 = fabricated; 4 = fully accurate). Non-parametric tests (Kruskal-Wallis, Tamhane's T2, Wilcoxon signed-rank test with Bonferroni correction) were used for statistical comparisons. Claude 3.5 Sonnet demonstrated the highest reference accuracy, with 80.8% fully accurate references (RAS 4) and a fabrication rate of 3.1%, significantly outperforming all other models (<i>P</i> < 0.001). Claude 3 Opus ranked second, achieving 59.6% fully accurate references and a fabrication rate of 18.3% (<i>P</i> < 0.001). ChatGPT-based models (ChatGPT-4o, ChatGPT-4o with canvas, and ChatGPT o1-preview) exhibited moderate accuracy, with fabrication rates ranging from 27.7% to 52.9% and <8% fully accurate references. Google Gemini 1.5 Pro had the lowest performance, achieving only 2.7% fully accurate references and the highest fabrication rate of 60.6% (<i>P</i> < 0.001). Reference accuracy also varied by subspecialty, with neuroradiology and cardiac radiology outperforming pediatric and head and neck radiology. Claude 3.5 Sonnet significantly outperformed all other models in generating verifiable radiology references, and Claude 3 Opus showed moderate performance. In contrast, ChatGPT models and Google Gemini 1.5 Pro delivered substantially lower accuracy with higher rates of fabricated references, highlighting current limitations in automated academic citation generation. The high accuracy of Claude 3.5 Sonnet can improve radiology literature reviews, research, and education with dependable references. The poor performance of other models, with high fabrication rates, risks misinformation in clinical and academic settings and highlights the need for refinement to ensure safe and effective use.

Benchmarking Radiology Report Generation From Noisy Free-Texts.

Yuan Y, Zheng Y, Qu L

pubmed logopapersMay 12 2025
Automatic radiology report generation can enhance diagnostic efficiency and accuracy. However, clean open-source imaging scan-report pairs are limited in scale and variety. Moreover, the vast amount of radiological texts available online is often too noisy to be directly employed. To address this challenge, we introduce a novel task called Noisy Report Refinement (NRR), which generates radiology reports from noisy free-texts. To achieve this, we propose a report refinement pipeline that leverages large language models (LLMs) enhanced with guided self-critique and report selection strategies. To address the inability of existing radiology report generation metrics in measuring cleanliness, radiological usefulness, and factual correctness across various modalities of reports in NRR task, we introduce a new benchmark, NRRBench, for NRR evaluation. This benchmark includes two online-sourced datasets and four clinically explainable LLM-based metrics: two metrics evaluate the matching rate of radiology entities and modality-specific template attributes respectively, one metric assesses report cleanliness, and a combined metric evaluates overall NRR performance. Experiments demonstrate that guided self-critique and report selection strategies significantly improve the quality of refined reports. Additionally, our proposed metrics show a much higher correlation with noisy rate and error count of reports than radiology report generation metrics in evaluating NRR.
Page 1 of 321 results
Show
per page
Get Started

Upload your X-ray image and get interpretation.

Upload now →

Disclaimer: X-ray Interpreter's AI-generated results are for informational purposes only and not a substitute for professional medical advice. Always consult a healthcare professional for medical diagnosis and treatment.