Comparison of AI-generated radiology impressions: a multi-stakeholder evaluation.
Authors
Affiliations (4)
Affiliations (4)
- Rad AI, San Francisco, CA, USA.
- Moffitt Cancer Center, Tampa, FL, USA.
- Rad AI, San Francisco, CA, USA. [email protected].
- Moffitt Cancer Center, Tampa, FL, USA. [email protected].
Abstract
A retrospective, blinded evaluation of 200 oncologic computed tomography reports compared original radiologist-authored impressions, impressions generated by a custom domain-specific AI model fine-tuned on institutional data, and impressions generated by a general-purpose large language model. Ten clinicians, including original radiologists (n = 4), independent radiologists (n = 3), and oncologists (n = 3), rated impressions for completeness, correctness, conciseness, clarity, clinical utility, and patient harm. Original and independent radiologists assigned lower preference to generic model impressions (Cohen's h 1.04-1.22 and 0.66-0.69, p < 0.001). Original radiologists slightly preferred their own impressions to the custom model (h = 0.18, p = 0.0716), while independent radiologists showed no preference (h = -0.03, p = 0.78). Oncologists demonstrated no significant preference among impression types (h = 0.04-0.12, all p > 0.20). Custom model impressions achieved near parity with human impressions; original radiologists rated their own impressions slightly more complete (r = 0.22, p = 0.0016). Generic model impressions were longer (75.1 ± 20.4 words), slightly more complete (r = 0.18-0.39, p < 0.001-0.01), but significantly less concise (r = 0.85-0.87, p < 0.001). Patient harm ratings were uniformly low (likelihood 1.01-1.14; extent 1.05-1.21). Inter-rater reliability ranged from -0.09 to 0.67 (α = 0.67 conciseness; α = -0.09-0.03 clinical utility/correctness).