Domain and Language adaptive pre-training of BERT models for Korean-English bilingual clinical text analysis.
Authors
Affiliations (6)
Affiliations (6)
- Department of Biomedical Informatics, Korea University College of Medicine, Seongbuk-gu 73, Goryeodae-ro, Seoul, 02841, Republic of Korea.
- Department of Linguistics, Korea University, Seongbuk-gu 145, Anam-ro, Seoul, 02841, Republic of Korea.
- Department of Linguistics, Korea University, Seongbuk-gu 145, Anam-ro, Seoul, 02841, Republic of Korea. [email protected].
- Department of Biomedical Informatics, Korea University College of Medicine, Seongbuk-gu 73, Goryeodae-ro, Seoul, 02841, Republic of Korea. [email protected].
- Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea. [email protected].
- Department of Cardiology, Cardiovascular Center, Korea University College of Medicine, Seoul, Republic of Korea. [email protected].
Abstract
To develop bilingual Korean-English medical language models through domain- and language-adaptive pre-training and evaluate their performance in clinical text analysis tasks, specifically semantic similarity and multi-label classification. A bilingual corpus comprising Korean (medical textbooks and online health articles) and English (medical textbooks, health-related articles, and MIMIC-IV EHRs) clinical texts were constructed. Three BERT-based foundation models (Korea Medical [KM-BERT], English Biomedical [BioBERT], and multilingual general domain [M-BERT]) underwent additional pre-training using a newly created bilingual WordPiece vocabulary (45,000 tokens). Model performance was assessed intrinsically on the medical semantic textual similarity (MedSTS) benchmark and extrinsically through multi-label classification of chest computed tomography (CT) reports from tertiary hospitals. Macro F1 scores and Pearson’s correlation coefficients were used as primary evaluation metrics. After bilingual pre-training, the Korean semantic similarity performance of bi-BioBERT improved significantly from a Pearson correlation coefficient ranging 0.190–0.871. In the multi-label classification of chest CT reports, all bilingual models outperformed their respective foundation models; bi-KM-BERT achieved the highest Macro F1 score in both internal (0.9460 vs. 0.8902 for KM-BERT) and external validation (0.9288 vs. 0.8495 for KM-BERT). However, bi-KM-BERT and bi-M-BERT experienced semantic performance declines in Korean tasks, indicating catastrophic forgetting, and gradient-based token-importance heatmaps confirmed that the bilingual models captured critical cross-lingual medical contexts more effectively. The findings underscore that careful bilingual vocabulary curation and targeted domain-adaptive pre-training enhance natural language processing (NLP) performance in multilingual clinical environments, even with modest training resources. Continual-learning strategies should be explored to mitigate minor forgetting effects. Domain- and language-adaptive pre-training of bilingual medical corpora improves NLP model performance in multilingual clinical settings, thereby providing a scalable strategy for enhancing clinical text analysis capabilities in resource-limited bilingual contexts. The online version contains supplementary material available at 10.1186/s12911-025-03262-7.