Comparative analysis of AI and human radiographer performance in radiographic image assessments: A pilot study using a large language model to simulate radiographer decision-making.

May 25, 2026

papers

DOI: 10.1016/j.jmir.2026.102444 PMID: 42184684

Authors

Almulla M

Affiliations (1)

Department of Radiologic Sciences, Faculty of Allied Health, Kuwait University, Kuwait City, Kuwait. Electronic address: [email protected].

Abstract

To determine whether a prompting-based, interpretable artificial intelligence (AI) system, specifically a large language model (LLM) that applies structured radiographic criteria derived from radiographer training, can approximate Kuwaiti radiographer acceptance-rejection decisions across varied radiographic examinations and to compare findings with international benchmarks. Thirty anonymized radiographs (chest=3, spine=3, abdomen/KUB=2, upper extremity=15, lower extremity=7) were evaluated by 43 radiographers (1290 decisions) and by an interpretable large language model (LLM) prompted 43 times per image to generate repeated model evaluations under controlled prompting conditions (1290 AI decisions). Decisions were categorized as "keep," "could keep," or "reject." Outcomes were: (i) "keep" vs "reject"; (ii) agreement across cases; (iii) examination-specific trends; and (iv) alignment with expert labels. International comparisons contextualized local thresholds. The AI system applied structured radiographic criteria to descriptive representations of radiographic features rather than directly interpreting image data. Radiographers kept 52.6% of images vs 32.5% for the AI (χ²(1) = 105.6, p < 0.001; OR = 2.30). Case-level agreement was weak (r = 0.16). Radiographers accepted more images across all examinations, with the largest gap in chest radiographs (47 percentage points). Relative to expert labels, the AI aligned more with "reject," while radiographers aligned more with "keep." International comparisons showed that Kuwaiti radiographers applied stricter thresholds than those in European cohorts. The lower acceptance rates and modest agreement indicate that the LLM produces more conservative evaluation outputs than radiographers, particularly in chest examinations. These differences reflect the contextual and experience-based judgments radiographers apply that AI cannot replicate. The international comparison further shows that local decision patterns influence acceptance thresholds and should inform future AI calibration. Although radiographic images were included in input, outputs relied on structured prompting and predefined criteria rather than direct visual interpretation. A prompting-based LLM grounded in radiographer criteria can approximate radiographer decision-making patterns when applied within a structured prompting framework, but remains conservative in the absence of clinical context. These exploratory findings suggest potential applications in radiographer education, quality assurance, and standardization pending further validation. Radiographers decide whether an X-ray image is good enough to keep or needs to be repeated, which affects care quality and safety. This study compared decisions made by radiographers in Kuwait with those from an artificial intelligence tool that follows written image quality rules. This study found that the artificial intelligence rejected more images than radiographers and showed different decision patterns, including differences from experts and from other countries. This matters because understanding these gaps can guide safer use of artificial intelligence in training, quality checks, and consistent imaging decisions.

View Source Full Text PDF

Topics

Journal Article

Comparative analysis of AI and human radiographer performance in radiographic image assessments: A pilot study using a large language model to simulate radiographer decision-making.

Authors

Affiliations (1)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?