Efficient SqueezeViT: A lightweight vision transformer framework for chest X-ray image classification.
Authors
Affiliations (3)
Affiliations (3)
- Department of Instrumentation and Control Engineering, Netaji Subhas University of Technology, Sector-3, Dwarka, New Delhi, India.
- Department of Computer Science & Engineering, Shri Madhwa Vadiraja Institute of Technology and Management, Bantakal, Karnataka, India.
- Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, 576104, Karnataka, India. [email protected].
Abstract
This work introduces SqueezeViT (Squeeze Vision Transformers), a compact yet effective architecture based on Vision Transformers (ViT) designed for chest X-ray (CXR) image classification. In contrast to traditional ViT architectures, which are computationally demanding, SqueezeViT employs a novel squeezing procedure that effectively lowers token dimensions without compromising important visual components, leading to expedited inference and decreased memory consumption. The designed model is tested for two commonly used public datasets, NIH Chest X-ray and CheXpert, providing a diverse range of thoracic pathologies. SqueezeViT reduces the number of parameters by 43.2% compared to the baseline MobileViT<sup>1</sup>, and up to 95.4% compared to other state-of-the-art (SOTA) models. The suggested model offers up to 16.5% improvement in the area under the receiver operating characteristic curve (AUROC) compared to SOTA models, and it is, in general, superior to the baseline and effective convolutional neural networks CNNs<sup>2</sup> in numerous tasks. Such developments make the proposed SqueezeViT approach an attractive option for a wide variety of applications. The findings indicate that SqueezeViT outperforms the current SOTA classifiers while maintaining a lightweight model architecture. In turn, such results emphasize the possibilities of using SqueezeViT in real clinical environment, where the amount of computational resources can be constrained.