A comparison of performance of DeepSeek-R1 model-generated responses to musculoskeletal radiology queries against ChatGPT-4 and ChatGPT-4o - A feasibility study.
Authors
Affiliations (8)
Affiliations (8)
- Department of Musculoskeletal Radiology, Royal Orthopedic Hospital, Birmingham, UK.
- Department of Musculoskeletal Radiology, AIIMS, Rishikesh, India.
- Department of Musculoskeletal Radiology, University of North Carolina, Chapel Hill, USA.
- Department of Orthopaedics, Southport and Ormskirk Hospital, Southport, Mersey and West Lancashire Hospital NHS Trust, UK.
- Department of Orthopedics, Indraprastha Apollo Hospital, New Delhi, India.
- Sport Medicine Physiotherapist, The Board of Control for Cricket in India, Mumbai, India.
- Department of Spinal Surgery, University Hospitals Coventry and Warwickshire, Coventry, UK.
- Department of Musculoskeletal Radiology, Royal Orthopedic Hospital, Birmingham, UK. Electronic address: [email protected].
Abstract
Artificial Intelligence (AI) has transformed society and chatbots using Large Language Models (LLM) are playing an increasing role in scientific research. This study aims to assess and compare the efficacy of newer DeepSeek R1 and ChatGPT-4 and 4o models in answering scientific questions about recent research. We compared output generated from ChatGPT-4, ChatGPT-4o, and DeepSeek-R1 in response to ten standardized questions in the setting of musculoskeletal (MSK) radiology. These were independently analyzed by one MSK radiologist and one final-year MSK radiology trainee and graded using a Likert scale from 1 to 5 (1 being inaccurate to 5 being accurate). Five DeepSeek answers were significantly inaccurate and provided fictitious references only on prompting. All ChatGPT-4 and 4o answers were well-written with good content, the latter including useful and comprehensive references. ChatGPT-4o generates structured research answers to questions on recent MSK radiology research with useful references in all our cases, enabling reliable usage. DeepSeek-R1 generates articles that, on the other hand, may appear authentic to the unsuspecting eye but contain a higher amount of falsified and inaccurate information in the current version. Further iterations may improve these accuracies.