Augmenting LLM with Prompt Engineering and Supervised Fine-Tuning in NSCLC TNM Staging: Framework Development and Validation.

January 29, 2026

papers

DOI: 10.2196/77988 PMID: 41636636

Authors

Jin R,Ling C,Hou Y,Sun Y,Li N,Han J,Sheng J,Wang Q,Liu Y,Zheng S,Ren X,Chen C,Wang J,Li C

Affiliations (8)

Liangyihui Network Technology Co., Ltd, 9/F, Tower T2, Jinheshangcheng, 140 Tianlin RoadXuhui District, Shanghai, CN.
School of Medicine, Tongji University, Shanghai, CN.
Department of Thoracic Surgery, Renmin Hospital of Wuhan University, Wuhan, CN.
Department of Neuro-oncology, Neurosurgery Center, Beijing Tiantan Hospital, Capital Medical University, Beijing, CN.
Department of Medical Oncology, Sir Run Run Shaw Hospital, School of Medicine, Zhejiang University, Hangzhou, CN.
NoDesk AI, Hangzhou, CN.
Zhipu AI Technology Co., Ltd, Beijing, CN.
Department of Biomedical Engineering, Faculty of Engineering, National University of Singapore, Singapore, SG.

Abstract

Accurate TNM staging is fundamental for treatment planning and prognosis in non-small cell lung cancer (NSCLC). However, its complexity poses significant challenges, particularly in standardizing interpretations across diverse clinical settings. Traditional rule-based natural language processing methods are constrained by their reliance on manually crafted rules and are susceptible to inconsistencies in clinical reporting. This study aimed to develop and validate a robust, accurate, and operationally efficient artificial intelligence framework for the TNM staging of NSCLC by strategically enhancing a large language model, GLM-4-Air, through advanced prompt engineering and supervised fine-tuning (SFT). We constructed a curated dataset of 492 de-identified real-world medical imaging reports, with TNM staging annotations rigorously validated by senior physicians according to the AJCC (American Joint Committee on Cancer) 8th edition guidelines. The GLM-4-Air model was systematically optimized via a multi-phase process: iterative prompt engineering incorporating chain-of-thought reasoning and domain knowledge injection for all staging tasks, followed by parameter-efficient SFT using Low-Rank Adaptation (LoRA) for the reasoning-intensive T and N staging tasks,. The final hybrid model was evaluated on a completely held-out internal test set (black-box) and benchmarked against GPT-4o using standard metrics, statistical tests, and a clinical impact analysis of staging errors. The optimized hybrid GLM-4-Air model demonstrated reliable performance. It achieved higher staging accuracies on the held-out black-box test set: 92% (95% Confidence Interval (CI): 0.850-0.959) for T, 86% (95% CI: 0.779-0.915) for N, 92% (95% CI: 0.850-0.959) for M, and 90% for overall clinical staging; by comparison, GPT-4o attained 87% (95% CI: 0.790-0.922), 70% (95% CI: 0.604-0.781), 78% (95% CI: 0.689-0.850), and 80%, respectively. The model's robustness was further evidenced by its macro-average F1-scores of 0.914 (T), 0.815 (N), and 0.831 (M), consistently surpassing those of GPT-4o (0.836, 0.620, and 0.698). Analysis of confusion matrices confirmed the model's proficiency in identifying critical staging features while effectively minimizing false negatives. Crucially, the clinical impact assessment showed a substantial reduction in severe Category I errors, which are defined as misclassifications that could significantly influence subsequent clinical decisions. Our model committed zero Category I errors in M staging across both test sets, and fewer Category I errors in T and N staging. Furthermore, the framework demonstrated practical deployability, achieving efficient inference on consumer-grade hardware (e.g., 4 RTX 4090 GPUs) with latencies suitable and acceptable for clinical workflows. The proposed hybrid framework, integrating structured prompt engineering and applies SFT to reasoning-heavy tasks (T/N), enables the GLM-4-Air model to serve as a highly accurate, clinically reliable, and cost-efficient solution for automated NSCLC TNM staging. This work demonstrates the efficacy and potential of a domain-optimized smaller model compared to an off-the-shelf generalist model, holding promise for enhancing diagnostic standardization in resource-aware healthcare environments.

View Source Full Text PDF

Topics

Journal Article

Augmenting LLM with Prompt Engineering and Supervised Fine-Tuning in NSCLC TNM Staging: Framework Development and Validation.

Authors

Affiliations (8)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?