Vision transformer embeddings and quantum pyramidal circuits for biomedical image analysis.
Authors
Affiliations (5)
Affiliations (5)
- BCN Medtech, Department of Engineering, Universitat Pompeu Fabra, Barcelona, Spain. [email protected].
- Parc Tecnològic TecnoCampus Mataró-Maresme, Universitat Pompeu Fabra, Mataró, Spain. [email protected].
- BCN Medtech, Department of Engineering, Universitat Pompeu Fabra, Barcelona, Spain.
- Quantic, Barcelona Supercomputing Center, Barcelona, Spain.
- ICREA, Barcelona, Spain.
Abstract
This work presents a novel quantum-hybrid pipeline for lung nodule classification in computed tomography (CT) scans, combining vision transformer (ViT) embeddings with quantum orthogonal pyramidal circuits (QOPCs). The approach was evaluated on 681 lung nodule CT scans across axial, coronal, and sagittal planes. Two ViT configurations were tested: ViT<sub>1</sub> (1 head, 4 layers) and a Bayesian-optimized ViT<sub>2</sub> (4 heads, 8 layers). Features extracted from ViT embedding layers were reduced via principal component analysis to 2-16 dimensions and classified using the QOPC with reconfigurable beam splitter (RBS) gates. The proposed approach achieved unprecedented compression, up to 1,470 <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>×</mo></math> (from 10,290 to 7 parameters) while preserving over 99% of baseline accuracy. The approach reached 83.7% accuracy (ViT<sub>2</sub>, <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>k</mi> <mo>=</mo> <mn>8</mn></mrow> </math> , <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>h</mi> <mo>=</mo> <mn>8</mn></mrow> </math> ) with only 46 trainable parameters and achieved computational efficiency (CE) up to 92.0. Training was accelerated up to 28 <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>×</mo></math> (0.030 vs. 0.833 min) while maintaining robust diagnostic performance (F1: 0.77-0.82, receiver operating characteristic area under the curve (ROC-AUC): 0.87-0.90). Ablation studies confirmed that the quantum layer outperforms conventional MLPs by +3.4% accuracy with 35% fewer parameters, while late fusion of multi-view predictions further improved performance to 85.4% accuracy and 0.92 ROC-AUC. These results establish hybrid ViT-QOPC architectures as a practical and resource-efficient framework for medical image analysis, demonstrating their ability to dramatically reduce computational cost without compromising clinical accuracy.