Vision Transformers: A Review — Part III

  • A requirement of a large amount of data for pre-training.
  • High computational complexity, especially for dense prediction in high-resolution images.

1. Incapability of generating multi-scale feature maps

1.1 Pyramid Vision Transformer (PVT)

Figure 1: The overview architecture of PVT (Image from [3])

1.2 Swin Transformer

Figure 2: The overview architecture of Swin Transformer (Image from [5])
Figure 3: An illustration of how W-MSA and SW-MSA compute self-attention locally (left) and the architecture of the Swin Transformer block (right) (Images from [5])

1.3 Pooling-based Vision Transformer (PiT)

Figure 4: Comparison in the network architecture: ViT (left) and PiT (right). (Images from [4])

2. Summary





Leading big data and AI-powered solution company

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Cross-Domain Few-shot Learning學習目錄

Animating gAnime with StyleGAN: The Tool

What’s new in Apache Spark 2.3 ?

How to find your rubber duck: Using machine learning to understand a changing sea

This is Why Translation Software Can Never Replace Humans

Stacked Generalisation theory and implementation.

Deep Learning for Graphs

Learning process insides

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Leading big data and AI-powered solution company

More from Medium

RAFT: A Machine Learning Model for Estimating Optical Flow

The latest machine learning models are being aggregated in the library How to use Detectron2

ConvNext: The Return Of Convolution Networks

Reconstruction of corrupted images.