Vision Transformers: A Review — Part II

  • A requirement of a large amount of data for pre-training.
  • High computational complexity, especially for dense prediction in high-resolution images.

1. A requirement of a large amount of data for pre-training

1.1 DeiT

Figure 1: Distillation process in DeiT (image from [3])

1.2 CaiT

Figure 2: Architecture comparison between ViT (left), a modified ViT in which the class token is inserted in a later stage (middle), and CaiT (right).

1.3 Tokens-to-Token ViT

  1. A sequence of tokens is passed into a self-attention module to improve the relation between tokens. The output of this step is another sequence of the same size as its input.
  2. The output sequence from the previous step is reshaped back into a 2D-array of tokens.
  3. The 2D-array of tokens is then divided into overlapping windows, in which neighboring tokens in the same window are concatenated into a longer token. The result of this process is a shorter 1D-sequence of higher-dimensional tokens.
Figure 3: The Tokens-to-token process (image from [6])
Figure 4: The overview architecture of T2T-ViT (image from [6])

2. High computational complexity, especially for dense prediction in high-resolution images

2.1 Spatial-reduction attention (SRA)

Figure 5: Comparison between the regular attention (left) and SRA (right) (Image from [5])

2.2 FAVOR+

Figure 6: The computation order in FAVOR+ (right), compared with that in a regular attention module (left) (Image from [7])

3. Summary

--

--

--

Leading big data and AI-powered solution company https://www.sertiscorp.com/

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Implementation of Retinal Receptive Fields using Difference of Gaussian kernel

Watch: Inspecting Image Metadata using Python’s Pillow Library

Challenges in Deep Learning

How to evaluate Machine Learning Model Performance Metrics for Classification:

Evaluation Metrics For Classification Model | Classification Model Metrics

Using Stocktwits to Create an Investor Sentiment Metric

Stackoverflow Based Semantic Search Engine

Cold Start I2I Recommendation Model

Machine Learning Demystified

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sertis

Sertis

Leading big data and AI-powered solution company https://www.sertiscorp.com/

More from Medium

The latest machine learning models are being aggregated in the library How to use Detectron2

RAFT: A Machine Learning Model for Estimating Optical Flow

SimCLR — STL10 Implementation

Reconstruction of corrupted images.