Context-Aware Sentence Classification of Radiology Reports Using Synthetic Data: Development and Validation Study.

February 28, 2026

papers

DOI: 10.2196/86365 PMID: 41764068

Authors

Kikuchi T,Yamagishi Y,Yamamoto K,Akashi T,Mori H,Makimoto H,Kohro T

Affiliations (3)

Jichi Medical University, 3311-1, Yakushiji, Shimotsuke, Tochigi, JP.
The University of Tokyo, Tokyo, JP.
Juntendo University, Tokyo, JP.

Abstract

Automated structuring of radiology reports is essential for data utilization and the development of medical artificial intelligence models. However, manual annotation by experts is labor-intensive, and processing real clinical data through commercial large language models (LLMs) presents significant privacy risks. These challenges are particularly pronounced for non-English languages like Japanese, where specialized medical corpora are scarce. While synthetic data generation offers a potential privacy-preserving alternative, its effectiveness in capturing complex clinical nuances-such as negation and contextual dependencies-to train robust classification models without any real-world training data has not been fully established. This study aimed to develop a context-aware sentence classification model for Japanese radiology reports using an entirely synthetic training pipeline, thereby eliminating the reliance on real-world clinical data during the development phase. Furthermore, we sought to evaluate the generalizability of this approach by validating the model performance on diverse, multi-institutional real-world reports. Japanese radiology reports (n=3,104) were generated using GPT-4.1 and automatically annotated at the sentence level into four categories (background, positive finding, negative finding, and continuation) using GPT-4.1-mini. This Synthetic data was partitioned into training (n=2,670), validation (n=334), and test (n=100) sets. We fine-tuned several models, including lightweight local LLMs (Qwen3 and Llama 3.2 series) using Low-Rank Adaptation (LoRA) and Japanese text classification models (BERT base Japanese v3, JMedRoBERTa-base, and ModernBERT-Ja-130M). External validation was performed using 280 real-world reports (3,477 sentences) from seven institutions in the Japan Medical Image Database (J-MID), with ground-truth labels established by board-certified radiologists. Evaluation metrics included accuracy, macro-averaged F1 (Macro F1) score, and positive predictive value for positive findings (PPV_1). All models achieved high performance on the synthetic test set (accuracy: 0.938-0.951; Macro F1 score: 0.924-0.940). Overall performance declined on the external validation dataset (accuracy: 0.783-0.813; Macro F1 score: 0.761-0.790), reflecting distributional differences between synthetic and real-world reports; however, PPV_1 remained stable and high across datasets (e.g., 0.957 on the synthetic test set vs. 0.952 on the external validation dataset for Qwen3 (4B)). Parsing errors occurred in LLM-based approaches (19-260 sentences, 0.55%-7.48% in the external dataset). This study demonstrates the feasibility of developing context-aware sentence classification models for Japanese radiology reports using a training pipeline based entirely on synthetic data. The stability of PPV_1 indicates that the models successfully captured the essential clinical terminology and linguistic patterns required to identify positive findings in real-world reports, despite the observed performance degradation in external validation. This approach substantially reduces manual annotation requirements and privacy risks, providing a scalable foundation for constructing structured radiology datasets to support the development of clinically relevant medical artificial intelligence models.

View Source Full Text PDF

Topics

Journal Article

Context-Aware Sentence Classification of Radiology Reports Using Synthetic Data: Development and Validation Study.

Authors

Affiliations (3)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?