Large Language Model-Assisted Cleaning of Report-Derived Labels in a Large-Scale Chest CT Dataset
Authors
Abstract
Purpose: To evaluate whether large language model (LLM)-assisted label cleaning can identify label-report discordance in CT-RATE, a large-scale public chest CT dataset. Materials and Methods: After report-level deduplication, 24,446 unique radiology reports were identified. Twelve reports were excluded from the primary GPT-5.4 analysis because of Microsoft Azure AI Foundry content-safety filtering, leaving 24,434 reports and 439,812 label instances across 18 abnormality categories. GPT-5.4-derived binary labels were generated from report text using structured JSON output and compared with existing CT-RATE labels. Discordant instances were adjudicated by radiologists. In addition, 100 randomly sampled reports were manually annotated to compare CT-RATE labels, individual LLM-derived labels, and multi-LLM majority-vote labels against radiologist-annotated reference labels. Results: Overall agreement between GPT-5.4-derived and CT-RATE labels was 96.4%, with Cohen's kappa of 0.884. Lymphadenopathy showed the lowest agreement and kappa. In discordance review, radiologist adjudication supported GPT-5.4-derived labels in 72 of 97 (74.2%) general discordant instances and 91 of 99 (91.9%) targeted lymphadenopathy discordant instances. Against radiologist-annotated reference labels, multi-LLM majority-vote labels achieved the highest label-macro-averaged F1 score and Cohen's kappa. Conclusion: LLM-assisted label cleaning identified clinically meaningful label-report discordance in CT-RATE and may support scalable quality improvement of public imaging datasets. The cleaned dataset will be made publicly available to support future research.