Back to all papers

Automated RECIST tumor response classification through prompt-guided large language models.

May 27, 2026pubmed logopapers

Authors

Mergen M,Busch F,Sauter AP,Pfeiffer D,Makowski MR,Spitzl D,Gassert FT

Affiliations (4)

  • Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital, 81675, Munich, Germany. [email protected].
  • Medical Clinic and Polyclinic II, TUM School of Medicine and Health, TUM University Hospital, Technical University Munich (TUM), Ismaningerstr. 22, 81675, Munich, Germany. [email protected].
  • Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital, 81675, Munich, Germany.
  • Munich Institute for Advanced Study, Technical University of Munich, 85748, Garching, Germany.

Abstract

This study investigates whether an entirely offline, general-purpose large language model (LLM) can reliably automate the classification of routine radiology reports according to RECIST (Response Evaluation Criteria in Solid Tumors) guidelines, focusing on how different prompting strategies support accurate, privacy-preserving tumor response assessment without additional model fine-tuning. An offline, in-house implementation of LLaMA-3.3 (70B) was used to classify real-world CT imaging reports from oncology patients. Reports were authored following RECIST-structured reporting but had outcome labels programmatically withheld prior to processing. Three prompting strategies-zero-shot, few-shot, and chain-of-thought prompting-were tested to guide the model in assigning RECIST categories: Baseline (BL), Complete Response (CR), Partial Response (PR), Stable Disease (SD), and Progressive Disease (PD). Model outputs were benchmarked against original expert labels using accuracy, precision, recall, and F1 scores. Across all tested prompting strategies, the LLaMA-3.3 model achieved strong classification performance. The best results were obtained with chain-of-thought prompting, reaching micro F1 scores of 0.81 across all RECIST categories. Overall model predictions aligned well with human expert assessments. Operating entirely offline within hospital infrastructure, the system preserved full compliance with stringent data privacy requirements. Prompt-driven large language models can accurately classify tumor response categories from real-world radiology reports in a scalable, reproducible, and privacy-preserving manner. Offline LLM deployment, combined with optimized prompting strategies, offers a promising approach for automating structured oncology report interpretation, potentially enhancing consistency and efficiency in clinical decision support workflows.

Topics

Response Evaluation Criteria in Solid TumorsNeoplasmsJournal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.