Context-Aware Sentence Classification of Radiology Reports Using Synthetic Data: Development and Validation Study.
Authors
Affiliations (3)
Affiliations (3)
- Jichi Medical University, 3311-1, Yakushiji, Shimotsuke, Tochigi, JP.
- The University of Tokyo, Tokyo, JP.
- Juntendo University, Tokyo, JP.
Abstract
Automated structuring of radiology reports is essential for data utilization and the development of medical artificial intelligence models. However, manual annotation by experts is labor-intensive, and processing real clinical data through commercial large language models (LLMs) presents significant privacy risks. These challenges are particularly pronounced for non-English languages like Japanese, where specialized medical corpora are scarce. While synthetic data generation offers a potential privacy-preserving alternative, its effectiveness in capturing complex clinical nuances-such as negation and contextual dependencies-to train robust classification models without any real-world training data has not been fully established. This study aimed to develop a context-aware sentence classification model for Japanese radiology reports using an entirely synthetic training pipeline, thereby eliminating the reliance on real-world clinical data during the development phase. Furthermore, we sought to evaluate the generalizability of this approach by validating the model performance on diverse, multi-institutional real-world reports. Japanese radiology reports (n=3,104) were generated using GPT-4.1 and automatically annotated at the sentence level into four categories (background, positive finding, negative finding, and continuation) using GPT-4.1-mini. This Synthetic data was partitioned into training (n=2,670), validation (n=334), and test (n=100) sets. We fine-tuned several models, including lightweight local LLMs (Qwen3 and Llama 3.2 series) using Low-Rank Adaptation (LoRA) and Japanese text classification models (BERT base Japanese v3, JMedRoBERTa-base, and ModernBERT-Ja-130M). External validation was performed using 280 real-world reports (3,477 sentences) from seven institutions in the Japan Medical Image Database (J-MID), with ground-truth labels established by board-certified radiologists. Evaluation metrics included accuracy, macro-averaged F1 (Macro F1) score, and positive predictive value for positive findings (PPV_1). All models achieved high performance on the synthetic test set (accuracy: 0.938-0.951; Macro F1 score: 0.924-0.940). Overall performance declined on the external validation dataset (accuracy: 0.783-0.813; Macro F1 score: 0.761-0.790), reflecting distributional differences between synthetic and real-world reports; however, PPV_1 remained stable and high across datasets (e.g., 0.957 on the synthetic test set vs. 0.952 on the external validation dataset for Qwen3 (4B)). Parsing errors occurred in LLM-based approaches (19-260 sentences, 0.55%-7.48% in the external dataset). This study demonstrates the feasibility of developing context-aware sentence classification models for Japanese radiology reports using a training pipeline based entirely on synthetic data. The stability of PPV_1 indicates that the models successfully captured the essential clinical terminology and linguistic patterns required to identify positive findings in real-world reports, despite the observed performance degradation in external validation. This approach substantially reduces manual annotation requirements and privacy risks, providing a scalable foundation for constructing structured radiology datasets to support the development of clinically relevant medical artificial intelligence models.