Diagnostic Performance of Large Language Models in Multimodal Analysis of Radiolucent Jaw Lesions.
Authors
Affiliations (2)
Affiliations (2)
- Department of Oral and Maxillofacial Surgery, Daejeon Dental Hospital, Wonkwang University College of Dentistry, Daejeon 35233, Republic of Korea.
- Department of Oral and Maxillofacial Surgery, Daejeon Dental Hospital, Wonkwang University College of Dentistry, Daejeon 35233, Republic of Korea. Electronic address: [email protected].
Abstract
Large language models (LLMs), such as ChatGPT and Gemini, are increasingly being used in medical domains, including dental diagnostics. Despite advancements in image-based deep learning systems, LLM diagnostic capabilities in oral and maxillofacial surgery (OMFS) for processing multi-modal imaging inputs remain underexplored. Radiolucent jaw lesions represent a particularly challenging diagnostic category due to their varied presentations and overlapping radiographic features. This study evaluated diagnostic performance of ChatGPT 4o and Gemini 2.5 Pro using real-world OMFS radiolucent jaw lesion cases, presented in multiple-choice (MCQ) and short-answer (SAQ) formats across 3 imaging conditions: panoramic radiography only, panoramic + CT, and panoramic + CT + pathology. Data from 100 anonymized patients at Wonkwang University Daejeon Dental Hospital were analyzed, including demographics, panoramic radiographs, CBCT images, histopathology slides, and confirmed diagnoses. Sample size was determined based on institutional case availability and statistical power requirements for comparative analysis. ChatGPT and Gemini diagnosed each case under 6 conditions using 3 imaging modalities (P, P+C, P+C+B) in MCQ and SAQ formats. Model accuracy was scored against expert-confirmed diagnoses by 2 independent evaluators. McNemar's and Cochran's Q tests evaluated statistical differences across models and imaging modalities. For MCQ tasks, ChatGPT achieved 66%, 73%, and 82% accuracies across the P, P+C, and P+C+B conditions, respectively, while Gemini achieved 57%, 62%, and 63%, respectively. In SAQ tasks, ChatGPT achieved 34%, 45%, and 48%; Gemini achieved 15%, 24%, and 28%, respectively. Accuracy improved significantly with additional imaging data for ChatGPT; ChatGPT consistently outperformed Gemini across all conditions (P < .001 for MCQ; P = .008 to < .001 for SAQ). MCQ format, which incorporates a human-in-the-loop (HITL) structure, showed higher overall performance than SAQ. ChatGPT demonstrated superior diagnostic performance compared to Gemini in OMFS diagnostic tasks when provided with richer multimodal inputs. Diagnostic accuracy increased with additional imaging data, especially in MCQ formats, suggesting LLMs can effectively synthesize radiographic and pathological data. LLMs have potential as diagnostic support tools for OMFS, especially in settings with limited specialist access. Presenting clinical cases in structured formats using curated imaging data enhances LLM accuracy and underscores HITL integration. Although current LLMs show promising results, further validation using larger datasets and hybrid AI systems are necessary for broader contextualised, clinical adoption.