Insufficient reporting quality in large language model studies in the field of radiology.
Authors
Affiliations (8)
Affiliations (8)
- Department of Radiology, Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Republic of Korea.
- Department of Radiology, Seoul National University Bundang Hospital, Seoul National University College of Medicine, Seongnam, Republic of Korea.
- Department of Artificial Intelligence, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.
- Department of Radiology and Research Institute of Radiology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.
- Department of Internal Medicine, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.
- Department of Pulmonology, Shihwa Medical Center, Siheung, Republic of Korea.
- Department of Radiology and Research Institute of Radiology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea. [email protected].
Abstract
Our systematic review aimed to evaluate the quality of reporting in research articles involving LLMs in the radiology field. After searching the PubMed-MEDLINE and EMBASE databases, a total of 246 eligible studies published between November 30, 2022, and December 31, 2024, were included. The analysis assessed the percentage of studies adhering to key elements required for LLM research, based on the MInimum reporting items for CLear Evaluation of Accuracy Reports of Large Language Models in healthcare (MI-CLEAR-LLM) and the Transparent Reporting of a Multivariable Model for Individual Prognosis Or Diagnosis-large language models (TRIPOD-LLM) checklists. Studies published before and after July 25, 2024, were compared using a chi-square test. The most common topic was performance evaluation of LLMs using radiologic cases (44.3%, 109/246), followed by radiology reporting (37.8%, 93/246). Although all studies reported LLM's name, only 27.6% (68/246) specified the model version, 35.8% (88/246) mentioned access date, and 25.2% (62/246) mentioned application programming interface usage. Full prompts were provided in 41.1% (101/246) of studies. Output probability-related issues, including the number of attempts (22.8%, 56/246) and factors such as temperature (16.7%, 41/246), were under-reported. These reporting insufficiencies persisted in studies published before and after July 25, 2024. Most studies assessing large language models in radiology lacked sufficient reporting of key elements required for large language model research. We recommend that authors strive to adhere to these elements to ensure transparency and improve the reproducibility of future studies. Our study highlighted the need for improved reporting quality and adherence to key elements to ensure transparent reporting and improve the reproducibility of future studies using large language models. Numerous studies on large language models (LLMs) in radiology lack standardized methodologies, leading to high variability and inconsistent reporting. Our review demonstrated insufficiency in key elements for LLM research, particularly in model details and output probability. Better reporting and adherence to key elements are essential for enhancing transparency and reproducibility in future LLM research.