Back to all papers

Vision-Language Models for automated quality control: a benchmarking framework and comprehensive study.

May 26, 2026pubmed logopapers

Authors

Abdelkader M

Affiliations (1)

  • Robotics & Internet of Things Lab, Prince Sultan University, Riyadh, Saudi Arabia. [email protected].

Abstract

Automated object detection systems require robust quality assessment mechanisms to maintain performance when deployed in environments that deviate from training distributions. While traditional monitoring relies on statistical drift detection, these approaches lack semantic understanding necessary for triggering appropriate model adaptations. This paper presents the first comprehensive benchmarking of Vision-Language Models (VLMs) for semantic-level quality assessment of multi-domain object detection outputs. We systematically evaluate nine state-of-the-art VLM models across five diverse domains spanning medical imaging (Brain Tumor, HAM10000), aerial surveillance (VisDrone), industrial inspection (Carparts), and general detection (COCO) using ground-truth-annotated samples. Our rigorous statistical evaluation employs multi-class classification where VLMs assess the semantic correctness of detection outputs, with comprehensive analysis including accuracy metrics, coefficient of variation, and Kruskal-Wallis testing. Results reveal substantial performance heterogeneity across models and domains, with overall accuracy ranging from 8.5% to 82.8% (mean: 45.7%, SD: 18.5%). LLaVA-13B achieves the highest overall performance (48.6% accuracy, CV: 23.8%), while medical domains prove most challenging (HAM10000: 7.3% mean accuracy vs. VisDrone: 55.9%). Statistical analysis reveals significant inter-model differences within all domains (p<0.001, effect sizes [Formula: see text]=0.82-0.96), confirming meaningful performance distinctions despite substantial cross-domain variation. Based on deployment criticality requirements, we establish three operational tiers: production-assistants (medical ≥80%, industrial ≥70%, surveillance ≥60%), supervised deployment, and research-stage systems. Our findings demonstrate that current VLMs are suitable for supervised rather than fully autonomous deployment, providing essential benchmarks and evidence-based guidelines for implementing VLM-based quality control in production computer vision systems. The benchmark code is open-source and available at https://github.com/mzahana/vlm-bench.

Topics

Journal Article

Ready to Sharpen Your Edge?

Subscribe to join 11k+ peers who rely on RadAI Slice. Get the essential weekly briefing that empowers you to navigate the future of radiology.

We respect your privacy. Unsubscribe at any time.