Table of Contents

Introduction
#

Self-attention-based architectures, such as Transformers, were initially developed for natural language processing. In contrast, convolutional neural networks (CNNs) have long dominated computer vision tasks.

Attempts to integrate self-attention with convolutional models initially yielded poor results. The Vision Transformer (ViT) introduces a new approach by splitting an image into patches, linearly embedding each patch, and feeding the resulting sequence into a standard Transformer.

Patch $ \approx $ Token
Poor performance on mid-sized datasets like ImageNet
Lacks inductive biases present in CNNs:
- Translation equivariance: outputs shift consistently with translated inputs
- Locality: emphasizes spatially adjacent features

However, with sufficiently large datasets, ViTs achieve strong performance.

Methodology
#

Vision Transformer (ViT)
#

Patch Embedding:
- Flatten the input image into non-overlapping fixed-size patches.
- Linearly project each patch into a vector.
- $ \mathbf{x}_p \in \mathbb{R}^{H \times W \times C} \rightarrow \mathbf{x}_p \in \mathbb{R}^{N \times (P^2 \cdot C)} $, where $ N $ is the number of patches.
- Each patch is projected to a constant latent dimension $ D $.
Class Token:
- A learnable classification token ([CLS]) is prepended to the input sequence.
- Acts as an aggregate representation, similar to BERT.
Positional Embedding:
- 1D learnable positional embeddings are added to retain spatial order.
- 2D positional embeddings offer limited additional benefit.
Transformer Encoder:
- Composed of alternating layers of multi-headed self-attention (MSA) and MLP blocks.
- Includes residual connections and LayerNorm.
Classification Head:
- The final representation of the [CLS] token is passed through an MLP for classification.

Architecture
#

$$ z_0 = [x_{\text{class}}; x_p^1 E; x_p^2 E; \cdots; x_p^N E] + E_{\text{pos}} \tag{1} $$

$$ E \in \mathbb{R}^{(P^2 \cdot C) \times D}, \quad E_{\text{pos}} \in \mathbb{R}^{(N+1) \times D} $$

$$ z_{\ell}’ = \mathrm{MSA}(\mathrm{LN}(z_{\ell - 1})) + z_{\ell - 1}, \quad \ell = 1, \ldots, L \tag{2} $$

$$ z_{\ell} = \mathrm{MLP}(\mathrm{LN}(z_{\ell}’)) + z_{\ell}’, \quad \ell = 1, \ldots, L \tag{3} $$

$$ y = \mathrm{LN}(z_L^0) \tag{4} $$

Inductive Biases in ViT
#

Though ViT lacks explicit inductive biases like CNNs, the architecture learns both local and global features:

MSA captures long-range dependencies.
MLP layers encourage local interactions.

Hybrid ViT-CNN Architecture
#

Instead of using image patches, ViT can ingest CNN feature maps as input sequences to combine benefits of both architectures.

Fine-Tuning on Higher Resolution
#

Pretrained ViTs on lower-resolution images can be fine-tuned on higher-resolution inputs:

Patch size remains fixed, increasing the sequence length.
2D interpolation aligns the learned positional embeddings with the new input resolution.

Experimental Results
#

Experimental Setup
#

Pretraining Datasets:
- ImageNet (1.3M images)
- ImageNet-21k (14M images)
- JFT-300M (303M images)
Evaluation Benchmarks:
- ImageNet, CIFAR-10/100, VTAB (19 tasks), Oxford Pets, Flowers-102
Model Variants:
- ViT-Base, ViT-Large, ViT-Huge
Baselines:
- Compared against ResNet (BiT) and Noisy Student (EfficientNet)

Key Findings
#

Performance:
- ViTs match or surpass state-of-the-art CNNs when pretrained on large datasets.
- ViT-H/14 achieves:
  - 88.55% on ImageNet
  - 94.55% on CIFAR-100
  - 77.63% on VTAB
Compute Efficiency:
- ViT-H/14 outperforms ResNet-based models while using significantly fewer compute resources (2.5k vs. 9.9k TPUv3-core-days).
Data Requirements:
- ViTs underperform CNNs on small datasets.
- Performance improves markedly with larger-scale data.
Scalability:
- Increasing model depth, width, or patch size improves performance.
- No clear signs of saturation even for very large models.
Few-Shot Learning:
- ViTs show strong performance on few-shot classification, benefiting from large-scale pretraining.
Self-Supervised Learning (Preliminary Results):
- Masked patch prediction improves accuracy by ~2% on ImageNet.
- Still lags behind supervised pretraining in overall performance.
Attention Visualization:
- ViTs exhibit both local and global attention patterns.
- Positional embeddings capture spatial topology despite lacking explicit structure.

Conclusion
#

Vision Transformers, though originally lacking image-specific inductive biases, achieve strong and often superior performance compared to CNNs when trained on large datasets. They scale effectively, are compute-efficient, and perform well in both few-shot and self-supervised settings. Their ability to model both local and global dependencies makes them a promising architecture for future vision tasks.

Introduction#

Methodology#

Vision Transformer (ViT)#

Architecture#

Inductive Biases in ViT#

Hybrid ViT-CNN Architecture#

Fine-Tuning on Higher Resolution#

Experimental Results#

Experimental Setup#

Key Findings#

Conclusion#