Back to all papers

Assessing Statistical Practices of Existing Artificial Intelligence (AI) Models for Lung Cancer Detection, Prognosis, and Risk Prediction: A Cross-Sectional Meta-Research Study Supplemented by Human and Large Language Model (LLM)-Directed Quality Appraisal

December 30, 2025medrxiv logopreprint

Authors

Hou, Y.,Ward, T.,Yang, C.-H.,Jernigan, E.,Caturegli, G.,Boffa, D.,Mukherjee, B.

Affiliations (1)

  • Yale University

Abstract

Artificial intelligence (AI) models with medical images as input data are increasingly proposed to support clinical decisions in lung cancer screening. To assess how these models are developed, evaluated, and reported, and to identify gaps in best statistical practices, we conducted a cross-sectional meta-research study of OpenAlex-indexed studies (January 1, 2023, to June 30, 2025) that developed image-based AI tools to detect lung cancer, predict prognosis, or estimate future risk. Thirty-six studies met our inclusion criteria. Study quality and reporting were appraised using three approaches: subjective ratings from two statisticians and two clinicians, scoring from two AI agents (GPT-5 and Gemini 2.5 Pro), and a guideline-based checklist from the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS). Convolutional neural networks were used in most of the included studies (69%). Area under the curve was the most frequently reported metric (81%). Our meta-research study also highlights common lapses in these 36 studies, including limited external test set use (39%), insufficient subgroup analyses (28%), and a substantial lack of adherence to established prediction-model reporting guidelines. AI-based quality scoring aligned better with CHARMS-based scores than did human scoring. Spearman correlations with CHARMS were weaker for statisticians/clinicians (p [≤] 0.46) than for the two AI agents (GPT-5 p = 0.66; Gemini 2.5 Pro p = 0.56). Overall, future research should prioritize standardized reporting, use of external test sets, and model performance assessment across subpopulations. Large language models (LLMs) offer a supportive role in providing guideline-driven appraisals to complement human judgment in evaluating AI-based prediction models. 1-2 Sentence DescriptionThis cross-sectional meta-research study synthesizes recent studies that developed artificial intelligence (AI)-driven predictive models using medical images to detect lung cancer, predict prognosis, or estimate future risk, highlighting methodological trends, limitations in model testing and subgroup analyses, and advocating for the need for greater transparency, reliability, quality assessment, and adherence to established reporting guidelines in such studies. Quality assessment of the models carried out by LLMs, human statisticians and clinicians indicates chatbots are more aligned with recommended guidelines than humans.

Topics

oncology

Ready to Sharpen Your Edge?

Subscribe to join 8,000+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.