Back to all papers

Domain and Language adaptive pre-training of BERT models for Korean-English bilingual clinical text analysis.

November 25, 2025pubmed logopapers

Authors

Jo E,Cho E,Lee Y,Song S,Joo HJ

Affiliations (6)

  • Department of Biomedical Informatics, Korea University College of Medicine, Seongbuk-gu 73, Goryeodae-ro, Seoul, 02841, Republic of Korea.
  • Department of Linguistics, Korea University, Seongbuk-gu 145, Anam-ro, Seoul, 02841, Republic of Korea.
  • Department of Linguistics, Korea University, Seongbuk-gu 145, Anam-ro, Seoul, 02841, Republic of Korea. [email protected].
  • Department of Biomedical Informatics, Korea University College of Medicine, Seongbuk-gu 73, Goryeodae-ro, Seoul, 02841, Republic of Korea. [email protected].
  • Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea. [email protected].
  • Department of Cardiology, Cardiovascular Center, Korea University College of Medicine, Seoul, Republic of Korea. [email protected].

Abstract

To develop bilingual Korean-English medical language models through domain- and language-adaptive pre-training and evaluate their performance in clinical text analysis tasks, specifically semantic similarity and multi-label classification. A bilingual corpus comprising Korean (medical textbooks and online health articles) and English (medical textbooks, health-related articles, and MIMIC-IV EHRs) clinical texts were constructed. Three BERT-based foundation models (Korea Medical [KM-BERT], English Biomedical [BioBERT], and multilingual general domain [M-BERT]) underwent additional pre-training using a newly created bilingual WordPiece vocabulary (45,000 tokens). Model performance was assessed intrinsically on the medical semantic textual similarity (MedSTS) benchmark and extrinsically through multi-label classification of chest computed tomography (CT) reports from tertiary hospitals. Macro F1 scores and Pearson’s correlation coefficients were used as primary evaluation metrics. After bilingual pre-training, the Korean semantic similarity performance of bi-BioBERT improved significantly from a Pearson correlation coefficient ranging 0.190–0.871. In the multi-label classification of chest CT reports, all bilingual models outperformed their respective foundation models; bi-KM-BERT achieved the highest Macro F1 score in both internal (0.9460 vs. 0.8902 for KM-BERT) and external validation (0.9288 vs. 0.8495 for KM-BERT). However, bi-KM-BERT and bi-M-BERT experienced semantic performance declines in Korean tasks, indicating catastrophic forgetting, and gradient-based token-importance heatmaps confirmed that the bilingual models captured critical cross-lingual medical contexts more effectively. The findings underscore that careful bilingual vocabulary curation and targeted domain-adaptive pre-training enhance natural language processing (NLP) performance in multilingual clinical environments, even with modest training resources. Continual-learning strategies should be explored to mitigate minor forgetting effects. Domain- and language-adaptive pre-training of bilingual medical corpora improves NLP model performance in multilingual clinical settings, thereby providing a scalable strategy for enhancing clinical text analysis capabilities in resource-limited bilingual contexts. The online version contains supplementary material available at 10.1186/s12911-025-03262-7.

Topics

Journal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.