Reporting efficiency in diagnostic imaging: Can plug-and-play general-purpose large language models outperform conventional speech recognition?
Authors
Affiliations (5)
Affiliations (5)
- Université Paris Cité, PARCC UMRS 970, INSERM, AP-HP, Hôpital Saint-Louis, Department of Radiology, 75010, Paris, France. [email protected].
- Hôpital Fondation Ophtalmologique Adolphe de Rothschild, Department of Radiology, 75019, Paris, France.
- Université Paris Cité, PARCC UMRS 970, INSERM, AP-HP, Hôpital Européen Georges Pompidou, Department of Radiology, 75015, Paris, France.
- Université Paris Cité, PARCC UMRS 970, INSERM, 75015, Paris, France.
- Université de Lorraine, Inserm, IADI, and Guilloz Imaging Department, Central Hospital, University Hospital Center of Nancy, 54000, Nancy, France.
Abstract
To compare conventional speech recognition (CSR) and a general-purpose large language model (LLM) in radiology reports, focusing on generation times and errors. In this prospective, multicenter study, five radiologists produced 200 reports using CSR and 200 using a general-purpose LLM with in-built speech recognition during routine clinical practice. Generation times were recorded. Errors were evaluated qualitatively and quantitatively using Levenshtein distance. Mann-Whitney U-test was used to compare quantitative variables, and chi-square test for categorical variables. No patient-identifying or clinical information was uploaded to the LLM. 301/400 (75.3%) CT and 99/400 (24.8%) MR reports were included. Overall, median total generation time was shorter in the LLM than in the CSR group (238 (154; 349) versus 318 (218; 478) seconds, p < 0.01). However, at the individual level, time reduction with the LLM was observed in only 3 out of 5 radiologists. Grammar/spelling and transcription errors were fewer in the LLM group (79 versus 293 and 225 versus 445, respectively, p < 0.01 for both). In the LLM group, 69 instances of rewording without loss of meaning, 99 instances of non-compliance to instructions and 4 confabulations were observed. Levenshtein distance at the character scale was higher in the LLM group (43 (8; 156) versus 20 (5; 43), p < 0.01). Overall, reports produced with the general-purpose LLM were generated more quickly and with fewer grammar/spelling and transcription errors than CSR reports; however, time savings varied across radiologists and reporting practices, and new error patterns appeared. Question Does a general-purpose LLM with speech recognition and plug-and-play prompts improve efficiency and accuracy of radiology reporting compared to conventional speech recognition in clinical practice? Findings Overall, reports generated with LLM required significantly shorter generation time and contained fewer grammar and transcription errors than those using speech recognition. Clinical relevance A general-purpose LLM can be easily implemented in clinical radiological practice and can potentially reduce reporting time and minimize errors encountered with speech recognition; however, the time-saving effect is heterogeneous and depends on radiologists' dictation habits.