Evaluation of Context-Aware Prompting Techniques for Classification of Tumor Response Categories in Radiology Reports Using Large Language Model.
Authors
Affiliations (6)
Affiliations (6)
- Department of Radiology, Research Institute of Radiological Science, and Center for Clinical Imaging Data Science (CCIDS), Yonsei University College of Medicine, 50-1 Yonsei-Ro, Seodaemun-Gu, Seoul, 03722, South Korea.
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, 50-1 Yonsei-Ro, Seodaemun-Gu, Seoul, 03722, South Korea.
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, 50-1 Yonsei-Ro, Seodaemun-Gu, Seoul, 03722, South Korea. [email protected].
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, South Korea. [email protected].
- Department of Radiology, Research Institute of Radiological Science, and Center for Clinical Imaging Data Science (CCIDS), Yonsei University College of Medicine, 50-1 Yonsei-Ro, Seodaemun-Gu, Seoul, 03722, South Korea. [email protected].
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, South Korea. [email protected].
Abstract
Radiology reports are essential for medical decision-making, providing crucial data for diagnosing diseases, devising treatment plans, and monitoring disease progression. While large language models (LLMs) have shown promise in processing free-text reports, research on effective prompting techniques for radiologic applications remains limited. To evaluate the effectiveness of LLM-driven classification based on radiology reports in terms of tumor response category (TRC), and to optimize the model through a comparison of four different prompt engineering techniques for effectively performing this classification task in clinical applications, we included 3062 whole-spine contrast-enhanced magnetic resonance imaging (MRI) radiology reports for prompt engineering and validation. TRCs were labeled by two radiologists based on criteria modified from the Response Evaluation Criteria in Solid Tumors (RECIST) guidelines. The Llama3 instruct model was used to classify TRCs in this study through four different prompts: General, In-Context Learning (ICL), Chain-of-Thought (CoT), and ICL with CoT. AUROC, accuracy, precision, recall, and F1-score were calculated against each prompt and model (8B, 70B) with the test report dataset. The average AUROC for ICL (0.96 internally, 0.93 externally) and ICL with CoT prompts (0.97 internally, 0.94 externally) outperformed other prompts. Error increased with prompt complexity, including 0.8% incomplete sentence errors and 11.3% probability-classification inconsistencies. This study demonstrates that context-aware LLM prompts substantially improved the efficiency and effectiveness of classifying TRCs from radiology reports, despite potential intrinsic hallucinations. While further improvements are required for real-world application, our findings suggest that context-aware prompts have significant potential for segmenting complex radiology reports and enhancing oncology clinical workflows.