Vision Transformers: A Review — Part II

This series aims to explain the mechanism of Vision Transformers (ViT) [2], which is a pure Transformer model used as a visual backbone in computer vision tasks. It also points out the limitations of ViT and provides a summary of its recent improvements.

In Part I, we briefly introduced the concept of Transformers [1] and explained the mechanism of ViT and how it uses the attention module to achieve state-of-the-art performance on computer vision problems.

Although ViT has gained a lot of attention from researchers in the field, many studies have pointed out its weaknesses and proposed several techniques to improve ViT. This post, which is the second part of a three-part series, aims to describe the following key problems in the original ViT:

  • A requirement of a large amount of data for pre-training.
  • High computational complexity, especially for dense prediction in high-resolution images.

Additionally, this post introduces recently published papers that aim to cope with the above problems.

1. A requirement of a large amount of data for pre-training

Some approaches have been proposed so far to handle this problem. For example, in [3], a knowledge distillation technique with a minimal modification of the ViT architecture was adopted in the training process; in [6], a more effective tokenization process to represent an input image was proposed; or in [4] some modifications in the architecture of ViT were explored. The details of these approaches are explained in the following subsections.

1.1 DeiT

Figure 1: Distillation process in DeiT (image from [3])

1.2 CaiT

Figure 2: Architecture comparison between ViT (left), a modified ViT in which the class token is inserted in a later stage (middle), and CaiT (right).

1.3 Tokens-to-Token ViT

To cope with the first problem, they proposed a tokenization method, named Tokens-to-token (T2T) module, that iteratively aggregates neighboring tokens into one token using a process named T2T process, as shown in Fig. 3. The T2T process can be done as follows:

  1. A sequence of tokens is passed into a self-attention module to improve the relation between tokens. The output of this step is another sequence of the same size as its input.
  2. The output sequence from the previous step is reshaped back into a 2D-array of tokens.
  3. The 2D-array of tokens is then divided into overlapping windows, in which neighboring tokens in the same window are concatenated into a longer token. The result of this process is a shorter 1D-sequence of higher-dimensional tokens.

The T2T process can be iterated to better improve the representation of the input image. In [6], it was done twice in the T2T module.

Figure 3: The Tokens-to-token process (image from [6])

Apart from using the proposed T2T module to improve the representation of an input image, they also explored various architecture designs used in CNNs and applied them to the Transformer backbone. They found that a deep-narrow structure, which exploits more Transformer layers (deeper) to improve feature richness and reduces the embedding dimension (narrower) to maintain the computational cost, gave the best results among the compared architecture designs. As shown in Fig. 4, the sequence of tokens generated by the T2T module is prepended with a classification token, as in the original ViT, and is then fed into the deep-narrow Transformer, which is named T2T-ViT backbone, to make a prediction.

It is shown in [6] that when trained from scratch, T2T-ViT outperforms the original ViT on the ImageNet1k dataset while reducing the model size and the computation cost by half.

Figure 4: The overview architecture of T2T-ViT (image from [6])

2. High computational complexity, especially for dense prediction in high-resolution images

Several approaches to this problem mainly aim at improving the efficiency of the attention module. The following subsections describe two examples of the approaches applying to ViT [5], [7].

2.1 Spatial-reduction attention (SRA)

Figure 5: Comparison between the regular attention (left) and SRA (right) (Image from [5])

2.2 FAVOR+

The Performer architecture, which exploits FAVOR+ inside, was also explored in the T2T-ViT [6] and was found competitive in the performance, compared with the original Transformer, while reducing the computation cost.

Figure 6: The computation order in FAVOR+ (right), compared with that in a regular attention module (left) (Image from [7])

3. Summary


[2] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv Preprint, arXiv2010.11929, 2020. (ViT)

[3] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image Transformers & distillation through attention,” arXiv Preprint, arXiv2012.12877, 2020. (DeiT)

[4] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou, ”Going deeper with image Transformers,” arXiv Preprint, arXiv2103.17329, 2021. (CaiT)

[5] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, et al., “Pyramid Vision Transformer: A versatile backbone for dense prediction without convolutions,” arXiv Preprint, arXiv2102.12122, 2021. (PVT)

[6] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, et al., “Tokens-to-token ViT: Training Vision Transformers from scratch on ImageNet,” arXiv Preprint, arXiv2101.11986, 2021. (T2T-ViT)

[7] K. Choromanski, V. Likhosherstov, D Dohan, X. Song, A. Gane, T. Sarlos, et al., “Rethinking attention with performers,” arXiv Preprint, arXiv2009.14974, 2020. (Performer and FAVOR+)

Written by Ukrit Watchareeruethai, Senior AI Researcher, and Sertis Computer Vision team