Back to all papers

Domain and Language adaptive pre-training of BERT models for Korean-English bilingual clinical text analysis.

November 25, 2025pubmed logopapers

Authors

Jo E,Cho E,Lee Y,Song S,Joo HJ

Affiliations (6)

  • Department of Biomedical Informatics, Korea University College of Medicine, Seongbuk-gu 73, Goryeodae-ro, Seoul, 02841, Republic of Korea.
  • Department of Linguistics, Korea University, Seongbuk-gu 145, Anam-ro, Seoul, 02841, Republic of Korea.
  • Department of Linguistics, Korea University, Seongbuk-gu 145, Anam-ro, Seoul, 02841, Republic of Korea. [email protected].
  • Department of Biomedical Informatics, Korea University College of Medicine, Seongbuk-gu 73, Goryeodae-ro, Seoul, 02841, Republic of Korea. [email protected].
  • Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea. [email protected].
  • Department of Cardiology, Cardiovascular Center, Korea University College of Medicine, Seoul, Republic of Korea. [email protected].

Abstract

To develop bilingual Korean-English medical language models through domain- and language-adaptive pre-training and evaluate their performance in clinical text analysis tasks, specifically semantic similarity and multi-label classification. A bilingual corpus comprising Korean (medical textbooks and online health articles) and English (medical textbooks, health-related articles, and MIMIC-IV EHRs) clinical texts were constructed. Three BERT-based foundation models (Korea Medical [KM-BERT], English Biomedical [BioBERT], and multilingual general domain [M-BERT]) underwent additional pre-training using a newly created bilingual WordPiece vocabulary (45,000 tokens). Model performance was assessed intrinsically on the medical semantic textual similarity (MedSTS) benchmark and extrinsically through multi-label classification of chest computed tomography (CT) reports from tertiary hospitals. Macro F1 scores and Pearson’s correlation coefficients were used as primary evaluation metrics. After bilingual pre-training, the Korean semantic similarity performance of bi-BioBERT improved significantly from a Pearson correlation coefficient ranging 0.190–0.871. In the multi-label classification of chest CT reports, all bilingual models outperformed their respective foundation models; bi-KM-BERT achieved the highest Macro F1 score in both internal (0.9460 vs. 0.8902 for KM-BERT) and external validation (0.9288 vs. 0.8495 for KM-BERT). However, bi-KM-BERT and bi-M-BERT experienced semantic performance declines in Korean tasks, indicating catastrophic forgetting, and gradient-based token-importance heatmaps confirmed that the bilingual models captured critical cross-lingual medical contexts more effectively. The findings underscore that careful bilingual vocabulary curation and targeted domain-adaptive pre-training enhance natural language processing (NLP) performance in multilingual clinical environments, even with modest training resources. Continual-learning strategies should be explored to mitigate minor forgetting effects. Domain- and language-adaptive pre-training of bilingual medical corpora improves NLP model performance in multilingual clinical settings, thereby providing a scalable strategy for enhancing clinical text analysis capabilities in resource-limited bilingual contexts. The online version contains supplementary material available at 10.1186/s12911-025-03262-7.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 7,600+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.