Could a New Method of Acromiohumeral Distance Measurement Emerge? Artificial Intelligence vs. Physician.
Authors
Affiliations (5)
Affiliations (5)
- Department of Physical Medicine and Rehabilitation, Prof. Dr. Cemil Tascioglu City Hospital, Istanbul, Turkey. [email protected].
- Department of Radiology, Basaksehir Cam and Sakura City Hospital, Istanbul, Turkey.
- Department of Physical Medicine and Rehabilitation, Istanbul Training and Research Hospital, Istanbul, Turkey.
- Department of Physical Medicine and Rehabilitation, Golcuk Necati Celik State Hospital, Kocaeli, Turkey.
- Department of Physical Medicine and Rehabilitation, Basaksehir Cam and Sakura City Hospital, Istanbul, Turkey.
Abstract
The aim of this study was to evaluate the reliability of ChatGPT-4 measurement of acromiohumeral distance (AHD), a popular assessment in patients with shoulder pain. In this retrospective study, 71 registered shoulder magnetic resonance imaging (MRI) scans were included. AHD measurements were performed on a coronal oblique T1 sequence with a clear view of the acromion and humerus. Measurements were performed by an experienced radiologist twice at 3-day intervals and by ChatGPT-4 twice at 3-day intervals in different sessions. The first, second, and mean values of AHD measured by the physician were 7.6 ± 1.7, 7.5 ± 1.6, and 7.6 ± 1.7, respectively. The first, second, and mean values measured by ChatGPT-4 were 6.7 ± 0.8, 7.3 ± 1.1, and 7.1 ± 0.8, respectively. There was a significant difference between the physician and ChatGPT-4 between the first and mean measurements (p < 0.0001 and p = 0.009, respectively). However, there was no significant difference between the second measurements (p = 0.220). Intrarater reliability for the physician was excellent (ICC = 0.99); intrarater reliability for ChatGPT-4 was poor (ICC = 0.41). Interrater reliability was poor (ICC = 0.45). In conclusion, this study demonstrated that the reliability of ChatGPT-4 in AHD measurements is inferior to that of an experienced radiologist. This study may help improve the possible future contribution of large language models to medical science.