Large language models as cost-conscious decision aids in emergency medicine: protocol support for imaging in lower back pain.
Authors
Affiliations (13)
Affiliations (13)
- Harvard Medical School, Boston, MA, USA.
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, USA.
- Harvard Business School, Boston, MA, USA.
- Massachusetts General Hospital, Boston, MA, USA.
- Réseau Radiologique Romand, Région de Genève, Genève, Suisse.
- Brown University Health, Providence, USA.
- Mass General Brigham AI, Boston, MA, USA.
- Brigham and Women's Hospital, Boston, MA, USA.
- Harvard Medical School, Boston, MA, USA. [email protected].
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, USA. [email protected].
- Massachusetts General Hospital, Boston, MA, USA. [email protected].
- Mass General Brigham Innovation, Boston, MA, USA. [email protected].
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA. [email protected].
Abstract
Appropriate utilization of imaging in the emergency department (ED) remains an important determinant of health care expenditures and patient throughput. LLMs have demonstrated potential as clinical decision support tools, and may aid in cost-conscious imaging triage in the ED. We aim to evaluate the effectiveness of LLMs in providing accurate, cost-conscious imaging recommendations for ED patients with lower back pain. 422 patients presented between December 2017 and June 2018 to the ED of a ~ 1000-bed major urban academic medical center with a chief complaint of lower back pain and received a lumbar spine MRI. The primary outcomes were Hoy et al. (Best Pract Res Clin Rheumatol 24(6):769-781, 2010) alignment of Generative Pre-Trained Transformer 4 (GPT-4)-generated imaging recommendations with ACR criteria, by raw accuracy and Cohen's κ, and Hoy et al. (Arthritis Rheum 64(6):2028-2037, 2012) professional service resource utilization quantified in work relative value units (wRVUs) under real-world clinical decisions, GPT-4 recommendations, and hypothetical 100% ACR adherence scenarios. GPT-4 was compared with real-world clinical decisions for imaging of lower back pain based on ED triage notes. Resource utilization was analyzed to assess potential savings from GPT-4 recommendations. GPT-4 model generated ACR-concordant recommendations for 72.0% (304/422) of cases and demonstrated significant alignment with ACR criteria as measured by Cohen's κ (0.42,95% CI: [0.35,0.48], p < 0.05). Actual resource utilization was 629 wRVUs. GPT-4 would have used 481.74 wRVUs. 100% adherence to the ACR criteria would have used 481.86 wRVUs. Our results support LLMs as possible tools for cost-conscious radiologic decision making in ED back pain evaluation.