Vision-Language Models for automated quality control: a benchmarking framework and comprehensive study.

May 26, 2026

papers

DOI: 10.1038/s41598-026-55179-4 PMID: 42191872

Authors

Abdelkader M

Affiliations (1)

Robotics & Internet of Things Lab, Prince Sultan University, Riyadh, Saudi Arabia. [email protected].

Abstract

Automated object detection systems require robust quality assessment mechanisms to maintain performance when deployed in environments that deviate from training distributions. While traditional monitoring relies on statistical drift detection, these approaches lack semantic understanding necessary for triggering appropriate model adaptations. This paper presents the first comprehensive benchmarking of Vision-Language Models (VLMs) for semantic-level quality assessment of multi-domain object detection outputs. We systematically evaluate nine state-of-the-art VLM models across five diverse domains spanning medical imaging (Brain Tumor, HAM10000), aerial surveillance (VisDrone), industrial inspection (Carparts), and general detection (COCO) using ground-truth-annotated samples. Our rigorous statistical evaluation employs multi-class classification where VLMs assess the semantic correctness of detection outputs, with comprehensive analysis including accuracy metrics, coefficient of variation, and Kruskal-Wallis testing. Results reveal substantial performance heterogeneity across models and domains, with overall accuracy ranging from 8.5% to 82.8% (mean: 45.7%, SD: 18.5%). LLaVA-13B achieves the highest overall performance (48.6% accuracy, CV: 23.8%), while medical domains prove most challenging (HAM10000: 7.3% mean accuracy vs. VisDrone: 55.9%). Statistical analysis reveals significant inter-model differences within all domains (p<0.001, effect sizes [Formula: see text]=0.82-0.96), confirming meaningful performance distinctions despite substantial cross-domain variation. Based on deployment criticality requirements, we establish three operational tiers: production-assistants (medical ≥80%, industrial ≥70%, surveillance ≥60%), supervised deployment, and research-stage systems. Our findings demonstrate that current VLMs are suitable for supervised rather than fully autonomous deployment, providing essential benchmarks and evidence-based guidelines for implementing VLM-based quality control in production computer vision systems. The benchmark code is open-source and available at https://github.com/mzahana/vlm-bench.

View Source Full Text PDF

Topics

Journal Article

Vision-Language Models for automated quality control: a benchmarking framework and comprehensive study.

Authors

Affiliations (1)

Abstract

Tags

Topics

Ready to Sharpen Your Edge?