Vision Transformers: A Review — Part III

  • A requirement of a large amount of data for pre-training.
  • High computational complexity, especially for dense prediction in high-resolution images.

1. Incapability of generating multi-scale feature maps

1.1 Pyramid Vision Transformer (PVT)

Figure 1: The overview architecture of PVT (Image from [3])

1.2 Swin Transformer

Figure 2: The overview architecture of Swin Transformer (Image from [5])
Figure 3: An illustration of how W-MSA and SW-MSA compute self-attention locally (left) and the architecture of the Swin Transformer block (right) (Images from [5])

1.3 Pooling-based Vision Transformer (PiT)

Figure 4: Comparison in the network architecture: ViT (left) and PiT (right). (Images from [4])

2. Summary





Leading big data and AI-powered solution company

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Around the world in 90.414 kilometers

Why Automated Feature Engineering Will Change the Way You Do Machine Learning

One-Hot layer in Keras’s Sequential API

Add Hair Color Filter in Lens Studio

Generative Adversarial Networks

From Photos to Miyazaki — A CycleGAN Experiment

Machine Learning Process


Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Leading big data and AI-powered solution company

More from Medium

RAFT: A Machine Learning Model for Estimating Optical Flow

Melanoma Image Augmentation using CycleGANs and detection.

Original U-Net in PyTorch

Image of Semantic Segmentation

Explainable Defect Detection Using Convolutional Neural Networks: Case Study