Vision Transformers: A Review — Part III

  • A requirement of a large amount of data for pre-training.
  • High computational complexity, especially for dense prediction in high-resolution images.

1. Incapability of generating multi-scale feature maps

1.1 Pyramid Vision Transformer (PVT)

Figure 1: The overview architecture of PVT (Image from [3])

1.2 Swin Transformer

Figure 2: The overview architecture of Swin Transformer (Image from [5])
Figure 3: An illustration of how W-MSA and SW-MSA compute self-attention locally (left) and the architecture of the Swin Transformer block (right) (Images from [5])

1.3 Pooling-based Vision Transformer (PiT)

Figure 4: Comparison in the network architecture: ViT (left) and PiT (right). (Images from [4])

2. Summary





