Exploring GPT-4o's multimodal reasoning capabilities with panoramic radiograph: the role of prompt engineering.

Authors

Xiong YT,Lian WJ,Sun YN,Liu W,Guo JX,Tang W,Liu C

Affiliations (4)

  • State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Department of Oral and Maxillofacial Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu, 610041, China.
  • College of Computer Science, Sichuan University, Chengdu, 610065, China.
  • State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Department of Oral and Maxillofacial Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu, 610041, China. [email protected].
  • State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Department of Oral and Maxillofacial Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu, 610041, China. [email protected].

Abstract

The aim of this study was to evaluate GPT-4o's multimodal reasoning ability to review panoramic radiograph (PR) and verify its radiologic findings, while exploring the role of prompt engineering in enhancing its performance. The study included 230 PRs from West China Hospital of Stomatology in 2024, which were interpreted to generate the PR findings. A total of 300 instances of interpretation errors, were manually inserted into the PR findings. The ablation study was conducted to assess whether GPT-4o can perform reasoning on PR under a zero-shot prompt. Prompt engineering was employed to enhance the reasoning capabilities of GPT-4o in identifying interpretation errors with PRs. The prompt strategies included chain-of-thought, self-consistency, in-context learning, multimodal in-context learning, and their systematic integration into a meta-prompt. Recall, accuracy, and F1 score were employed to evaluate the outputs. Subsequently, the localization capability of GPT-4o and its influence on reasoning capability were evaluated. In the ablation study, GPT-4o's recall increased significantly from 2.67 to 43.33% upon acquiring PRs (P < 0.001). GPT-4o with the meta prompt demonstrated improvements in recall (43.33% vs. 52.67%, P = 0.022), accuracy (39.95% vs. 68.75%, P < 0.001), and F1 score (0.42 vs. 0.60, P < 0.001) compared to the zero-shot prompt and other prompt strategies. The localization accuracy of GPT-4o was 45.67% (137 out of 300, 95% CI: 40.00 to 51.34). A significant correlation was observed between its localization accuracy and reasoning capability under the meta prompt (φ coefficient = 0.33, p < 0.001). The model's recall increased by 5.49% (P = 0.031) by providing accurate localization cues within the meta prompt. GPT-4o demonstrated a certain degree of multimodal capability for PR, with performance enhancement through prompt engineering. Nevertheless, its performance remains inadequate for clinical requirements. Future efforts will be necessary to identify additional factors influencing the model's reasoning capability or to develop more advanced models. Evaluating GPT-4o's capability to interpret and reason through PRs and exploring potential methods to enhance its performance before clinical application in assisting radiological assessments.

Topics

Radiography, PanoramicClinical ReasoningJournal Article

Ready to Sharpen Your Edge?

Join hundreds of your peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.