AI-RADS: A Framework for Assessment of Artificial Intelligence Output in Radiology-Development and Multireader Evaluation.
Authors
Abstract
Despite the growing number of artificial intelligence (AI)-based applications used in radiology, no structured framework exists to assess their case-level reliability or to document overridden outputs in reports. To develop and evaluate the Artificial Intelligence Reporting and Data System (AI-RADS), a structured framework for an objective, case-level assessment of AI output reliability, clinical utility, and recommended actions in radiology. The AI-RADS framework was tested in a retrospective, multireader study. Here, 5 board-certified radiologists independently evaluated 350 cases processed by 7 representative AI applications for image-based and generative tasks. Each case was assigned one of 5 AI-RADS categories, applicable modifiers, and an independent correctness rating as a reference. Interreader agreement was quantified using Krippendorff's α with 95% CIs. Substantial interreader agreement was observed for the core AI-RADS categories in both image-based (Krippendorff's α=0.87; 95% CI: 0.83-0.91) and generative AI tasks (Krippendorff's α=0.93; 95% CI: 0.91-0.95). Reader-assigned correctness aligned well with AI-RADS categories 1 to 2, which indicate outputs suitable for integration into clinical workflows. Outputs rated as "incorrect" were predominantly assigned to categories 4 to 5, warranting override or removal from display. AI-RADS provides a structured framework for the case-level evaluation of AI output reliability, clinical utility, and consequences for report communication. This multireader study demonstrated substantial interreader agreement and applicability across various AI applications.