Evaluation of large language models with clinical guidance for vetting outpatient magnetic resonance imaging lumbar spine referrals.
Authors
Affiliations (3)
Affiliations (3)
- Department of Radiology, NHS Tayside, Dundee, UK.
- Canon Medical Research Europe, Edinburgh, UK.
- University of Edinburgh, Edinburgh, UK.
Abstract
ObjectivesAccurate triage of lumbar spine magnetic resonance imaging (MRI) referrals for sciatica is important for patient assessment, diagnosis and surgical planning. This study evaluates the accuracy and speed of large language models (LLMs) in automatically vetting lumbar spine MRI referrals from general practice.MethodsThree LLMs (GPT-4, Claude Opus, Gemini) were tasked with assigning an outcome (Accept - Routine, Accept - Urgent, Reject) and flagging MRI contraindications for lumbar spine referrals. Three prompts of increasing detail, including clinical guidelines and training examples, were used. Two radiology registrars synthesised 120 referrals, vetted by two board-certified radiologists, with a third resolving disagreements. Performance was assessed using accuracy, precision, recall and F1 scores.ResultsInter-rater agreement between radiologists was substantial for vetting outcome (Cohen's <i>κ</i> = 0.76) and contraindication detection (<i>κ</i> = 0.68). Claude Opus with the full prompt achieved the highest accuracy (0.86) for vetting outcomes. GPT-4 with the instruction-only prompt achieved the highest F1 score (0.88) for contraindication detection. LLMs completed the task substantially faster than radiologists (9.8 ± 1.0 vs 135.0 ± 45.0 min).ConclusionsLLMs demonstrate promising performance in vetting radiological referrals for sciatica, particularly with detailed context. All models identified all urgent referrals, suggesting potential for prioritising vetting worklists and improving timeliness of care.