Development of a Large-Scale Dataset of Chest Computed Tomography Reports in Japanese and a High-Performance Finding Classification Model: Dataset Development and Validation Study.
Authors
Affiliations (4)
Affiliations (4)
- Division of Radiology and Biomedical Engineering, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan, 81 3-3815-5411.
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Tokyo, Japan.
- Department of Radiology, School of Medicine, Jichi Medical University, Shimotsuke, Japan.
- Department of Diagnostic Radiology, Toranomon Hospital, Tokyo, Japan.
Abstract
Recent advances in large language models have highlighted the need for high-quality multilingual medical datasets. Although Japan is a global leader in computed tomography (CT) scanner deployment and use, the absence of large-scale Japanese radiology datasets has hindered the development of specialized language models for medical imaging analysis. Despite the emergence of multilingual models and language-specific adaptations, the development of Japanese-specific medical language models has been constrained by a lack of comprehensive datasets, particularly in radiology. This study aims to address this critical gap in Japanese medical natural language processing resources, for which a comprehensive Japanese CT report dataset was developed through machine translation, to establish a specialized language model for structured classification. In addition, a rigorously validated evaluation dataset was created through expert radiologist refinement to ensure a reliable assessment of model performance. We translated the CT-RATE dataset (24,283 CT reports from 21,304 patients) into Japanese using GPT-4o mini. The training dataset consisted of 22,778 machine-translated reports, and the validation dataset included 150 reports carefully revised by radiologists. We developed CT-BERT-JPN, a specialized Bidirectional Encoder Representations from Transformers (BERT) model for Japanese radiology text, based on the "tohoku-nlp/bert-base-japanese-v3" architecture, to extract 18 structured findings from reports. Translation quality was assessed with Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores and further evaluated by radiologists in a dedicated human-in-the-loop experiment. In that experiment, each of a randomly selected subset of reports was independently reviewed by 2 radiologists-1 senior (postgraduate year [PGY] 6-11) and 1 junior (PGY 4-5)-using a 5-point Likert scale to rate: (1) grammatical correctness, (2) medical terminology accuracy, and (3) overall readability. Inter-rater reliability was measured via quadratic weighted kappa (QWK). Model performance was benchmarked against GPT-4o using accuracy, precision, recall, F1-score, ROC (receiver operating characteristic)-AUC (area under the curve), and average precision. General text structure was preserved (BLEU: 0.731 findings, 0.690 impression; ROUGE: 0.770-0.876 findings, 0.748-0.857 impression), though expert review identified 3 categories of necessary refinements-contextual adjustment of technical terms, completion of incomplete translations, and localization of Japanese medical terminology. The radiologist-revised translations scored significantly higher than raw machine translations across all dimensions, and all improvements were statistically significant (P<.001). CT-BERT-JPN outperformed GPT-4o on 11 of 18 findings (61%), achieving perfect F1-scores for 4 conditions and F1-score >0.95 for 14 conditions, despite varied sample sizes (7-82 cases). Our study established a robust Japanese CT report dataset and demonstrated the effectiveness of a specialized language model in structured classification of findings. This hybrid approach of machine translation and expert validation enabled the creation of large-scale datasets while maintaining high-quality standards. This study provides essential resources for advancing medical artificial intelligence research in Japanese health care settings, using datasets and models publicly available for research to facilitate further advancement in the field.