Back to all papers

Advanced Prompting Techniques Informed by Clinical Expertise Improve the Accuracy of LLM Data Extraction but Increase Non-Determinism.

March 11, 2026pubmed logopapers

Authors

Wang Y,Cisneros AA,Stewart C,Malekhedayat M,Luong J,Smith-Bindman R,Mongan J

Affiliations (6)

  • Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA, USA. [email protected].
  • Department of Radiology, San Antonio Uniformed Services Health Education Consortium, San Antonio, TX, USA.
  • Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA, USA.
  • Department of Obstetrics, Gynecology and Reproductive Sciences, University of California San Francisco, San Francisco, CA, USA.
  • Philip R. Lee Institute for Health Policy Studies, University of California San Francisco, San Francisco, CA, USA.
  • Department of Radiology and Biomedical Imaging, Center for Intelligent Imaging, University of California San Francisco, San Francisco, CA, USA.

Abstract

Prompt engineering techniques which aid in the use of generative artificial intelligence to address classification tasks have expanded considerably in the last 2 years. The success of such methods varies depending on context, and their efficacy in extracting structured data from unstructured medical text is not well understood. In this paper, five large language prompting strategies were evaluated on a structured categorical question about unstructured radiology reports. Three categories were typically explicit in the text, while one required extrapolation from medical knowledge. The five prompting strategies each contained one or more of the following prompting techniques: external knowledge source, recursive criticism and improvement, chain-of-thought. The efficacy of each strategy was assessed by measuring accuracy and rate of non-determinism. Accuracy was measured by overall correctness and sensitivity to the non-explicit category. Non-determinism was assessed by running each strategy multiple times per exam, tracking both the frequency of exams with non-deterministic outputs and the number of runs needed before stabilization. The presence of an external knowledge source increased sensitivity to the non-explicit category from 10 to 66%, with minimal impact on overall accuracy. Sensitivity further increased to 78% with the introduction of recursive criticism and improvement and chain-of-thought but at the cost of increased non-determinism, with the proportion of exams with non-deterministic results increasing from 7% with only external knowledge source to 14%.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.