Back to all papers

Adaptive, Privacy-Preserving Small Language Models for Multi-Task Clinical Assistance.

March 13, 2026pubmed logopapers

Authors

Zheng G,Kamel P,Pillai JJ,Akhbardeh A,Braverman V,Jacobs MA,Parekh VS

Affiliations (10)

  • Department of Computer Science, The Johns Hopkins University, Baltimore, MD, USA.
  • Department of Computer Science, Rice University, Houston, TX, USA.
  • Department of Neuroradiology, Division of Diagnostic Imaging, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
  • Department of Neurosurgery, The Johns Hopkins University School of Medicine, Baltimore, MD, USA.
  • Division of Neuroradiology, Department of Radiology, Mayo Clinic, Rochester, MN, USA.
  • Department of Diagnostic and Interventional Imaging, McGovern Medical School, UTHealth Houston, Houston, TX, USA.
  • GSBS, MD Anderson Cancer Center, The University of Texas, Houston, TX, USA.
  • Department of Russell H. Morgan Radiology and Radiological Science and Sidney Kimmel Cancer Center, The Johns Hopkins University School of Medicine, Baltimore, MD, USA.
  • Department of Neurosurgery, The Johns Hopkins University School of Medicine, Baltimore, MD, USA. [email protected].
  • Department of Diagnostic and Interventional Imaging, McGovern Medical School, UTHealth Houston, Houston, TX, USA. [email protected].

Abstract

The purpose of this study is to evaluate whether a single, fine-tuned SLM can match or exceed the performance of LLMs across diverse clinical tasks, enabling hospitals to build tailored, privacy-preserving, efficient, and deployable language models that do not require managing multiple task-specific systems. We used SLMs of varying sizes and applied low-rank adaptation (LoRA) for fine-tuning across three clinical tasks: (1) medical report labeling, (2) DICOM series description harmonization, and (3) impression generation from findings. These tasks were constructed using two datasets: the public Open-i Indiana University Chest X-ray Dataset and an in-house brain MRI DICOM metadata dataset. We compared single-task SLMs, a multi-task SLM (representing our proposed configuration), and GPT-4o using zero-shot and few-shot prompting. We found OPT-350Ā m to be the optimal SLM. In medical report labeling, the multi-task SLM achieved an F1 score of 0.894 compared to additional prompt-engineered GPT-4o's 0.728. In DICOM series description harmonization, the multi-task achieved an accuracy of 0.975 compared to additional prompt-engineered GPT-4o's 0.878. In impression generation from findings, the multi-task SLM achieved an average Likert scale score of 4.39 ± 1.00, compared to GPT-4o's 3.65 ± 1.00 (p = 0.0008). This study demonstrates that a single fine-tuned SLM can serve as a general-purpose clinical assistant, offering performance on par with or better than larger models. With lower resource requirements, greater customizability, privacy protection, and strong task generalization, fine-tuning one SLM to support multiple clinical tasks meets the practical demands of clinical AI deployment in both high-resource and resource-limited healthcare settings.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.