Skip to main content
  1. Posts/

[논문 리뷰] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)

·625 words·3 mins·
CV ViT Transformer Vision Transformer
LimePencil
Author
LimePencil
Machine Learning enthusiast.
Table of Contents

Introduction
#

Self-attention-based architectures, such as Transformers, were initially developed for natural language processing. In contrast, convolutional neural networks (CNNs) have long dominated computer vision tasks.

Attempts to integrate self-attention with convolutional models initially yielded poor results. The Vision Transformer (ViT) introduces a new approach by splitting an image into patches, linearly embedding each patch, and feeding the resulting sequence into a standard Transformer.

  • Patch \( \approx \) Token
  • Poor performance on mid-sized datasets like ImageNet
  • Lacks inductive biases present in CNNs:
    • Translation equivariance: outputs shift consistently with translated inputs
    • Locality: emphasizes spatially adjacent features

However, with sufficiently large datasets, ViTs achieve strong performance.

Methodology
#

alt text

Vision Transformer (ViT)
#

  1. Patch Embedding:

    • Flatten the input image into non-overlapping fixed-size patches.
    • Linearly project each patch into a vector.
    • \( \mathbf{x}_p \in \mathbb{R}^{H \times W \times C} \rightarrow \mathbf{x}_p \in \mathbb{R}^{N \times (P^2 \cdot C)} \), where \( N \) is the number of patches.
    • Each patch is projected to a constant latent dimension \( D \).
  2. Class Token:

    • A learnable classification token ([CLS]) is prepended to the input sequence.
    • Acts as an aggregate representation, similar to BERT.
  3. Positional Embedding:

    • 1D learnable positional embeddings are added to retain spatial order.
    • 2D positional embeddings offer limited additional benefit.
  4. Transformer Encoder:

    • Composed of alternating layers of multi-headed self-attention (MSA) and MLP blocks.
    • Includes residual connections and LayerNorm.
  5. Classification Head:

    • The final representation of the [CLS] token is passed through an MLP for classification.

Architecture
#

$$ z_0 = [x_{\text{class}}; x_p^1 E; x_p^2 E; \cdots; x_p^N E] + E_{\text{pos}} \tag{1} $$

$$ E \in \mathbb{R}^{(P^2 \cdot C) \times D}, \quad E_{\text{pos}} \in \mathbb{R}^{(N+1) \times D} $$

$$ z_{\ell}’ = \mathrm{MSA}(\mathrm{LN}(z_{\ell - 1})) + z_{\ell - 1}, \quad \ell = 1, \ldots, L \tag{2} $$

$$ z_{\ell} = \mathrm{MLP}(\mathrm{LN}(z_{\ell}’)) + z_{\ell}’, \quad \ell = 1, \ldots, L \tag{3} $$

$$ y = \mathrm{LN}(z_L^0) \tag{4} $$

Inductive Biases in ViT
#

Though ViT lacks explicit inductive biases like CNNs, the architecture learns both local and global features:

  • MSA captures long-range dependencies.
  • MLP layers encourage local interactions.

Hybrid ViT-CNN Architecture
#

Instead of using image patches, ViT can ingest CNN feature maps as input sequences to combine benefits of both architectures.

Fine-Tuning on Higher Resolution
#

Pretrained ViTs on lower-resolution images can be fine-tuned on higher-resolution inputs:

  • Patch size remains fixed, increasing the sequence length.
  • 2D interpolation aligns the learned positional embeddings with the new input resolution.

Experimental Results
#

Experimental Setup
#

  • Pretraining Datasets:
    • ImageNet (1.3M images)
    • ImageNet-21k (14M images)
    • JFT-300M (303M images)
  • Evaluation Benchmarks:
    • ImageNet, CIFAR-10/100, VTAB (19 tasks), Oxford Pets, Flowers-102
  • Model Variants:
    • ViT-Base, ViT-Large, ViT-Huge
  • Baselines:
    • Compared against ResNet (BiT) and Noisy Student (EfficientNet)

Key Findings
#

alt text

  1. Performance:

    • ViTs match or surpass state-of-the-art CNNs when pretrained on large datasets.
    • ViT-H/14 achieves:
      • 88.55% on ImageNet
      • 94.55% on CIFAR-100
      • 77.63% on VTAB
  2. Compute Efficiency:

    • ViT-H/14 outperforms ResNet-based models while using significantly fewer compute resources (2.5k vs. 9.9k TPUv3-core-days).
  3. Data Requirements:

    • ViTs underperform CNNs on small datasets.
    • Performance improves markedly with larger-scale data.
  4. Scalability:

    • Increasing model depth, width, or patch size improves performance.
    • No clear signs of saturation even for very large models.
  5. Few-Shot Learning:

    • ViTs show strong performance on few-shot classification, benefiting from large-scale pretraining.
  6. Self-Supervised Learning (Preliminary Results):

    • Masked patch prediction improves accuracy by ~2% on ImageNet.
    • Still lags behind supervised pretraining in overall performance.
  7. Attention Visualization:

    • ViTs exhibit both local and global attention patterns.
    • Positional embeddings capture spatial topology despite lacking explicit structure.

Conclusion
#

Vision Transformers, though originally lacking image-specific inductive biases, achieve strong and often superior performance compared to CNNs when trained on large datasets. They scale effectively, are compute-efficient, and perform well in both few-shot and self-supervised settings. Their ability to model both local and global dependencies makes them a promising architecture for future vision tasks.