Vision Transformer Autoencoders for Unsupervised Representation Learning: Revealing Novel Genetic Associations through Learned Sparse Attention Patterns
Authors
Affiliations (1)
Affiliations (1)
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston
Abstract
The discovery of genetic loci associated with brain architecture can provide deeper insights into neuroscience and potentially lead to improved personalized medicine outcomes. Previously, we designed the Unsupervised Deep learning-derived Imaging Phenotypes (UDIPs) approach to extract phenotypes from brain imaging using a convolutional (CNN) autoencoder, and conducted brain imaging GWAS on UK Biobank (UKBB). In this work, we design a vision transformer (ViT)-based autoencoder, leveraging its distinct inductive bias and its ability to capture unique patterns through its pairwise attention mechanism. The encoder generates contextual embeddings for input patches, from which we derive a 128-dimensional latent representation, interpreted as phenotypes, by applying average pooling. The GWAS on these 128 phenotypes discovered 10 loci previously unreported by CNN-based UDIP model, 3 of which had no previous associations with brain structure in the GWAS Catalog. Our interpretation results suggest that these novel associations stem from the ViTs capability to learn sparse attention patterns, enabling the capturing of non-local patterns such as left-right hemisphere symmetry within brain MRI data. Our results highlight the advantages of transformer-based architectures in feature extraction and representation learning for genetic discovery.