Video Anomaly Detection: An Introduction

19 min readNov 30, 2023

1. What is video anomaly detection?

Presently, there has been a widespread use of surveillance cameras both in public and private areas. However, surveillance monitoring is usually done by humans. It is a tedious and time-consuming task. Due to the rapid increase of surveillance cameras, it is difficult to efficiently and effectively monitor many surveillance cameras with humans, leading to a need to automate surveillance monitoring. Using a computer vision system to monitor all surveillance cameras to detect abnormal or anomalous events instead of humans is the aim of video anomaly detection (VAD).

Anomalous events are activities that occur at an unusual location and/or unusual period. Examples of anomalous events, especially for surveillance monitoring, are fighting, stealing, arson, accidents, and so on. Note that being anomalous highly depends on the scene and context, i.e., some events are considered normal at some places/times but anomalous at some others. For example, riding a horse on an expressway is considered anomalous but is normal on a ground field. Its dependency on the context makes anomaly detection more challenging.

Some kinds of anomalies can be detected at the image level, so-called image anomaly detection. That is, given only an image as input, anomaly detection can be done by either detecting unusual objects (object-level anomaly detection) or detecting unusual relationships between objects and scenes (scene-level anomaly detection). For the example of horse riding, detecting a horse and recognizing the background scene is sufficient for anomaly detection. However, image anomaly detection can only utilize spatial information while ignoring temporal information.

For some other examples of anomalous events such as sudden movement or loitering, both spatial and temporal information are crucial. To detect these kinds of anomalies, it is required to know how objects move and how the relationships among objects and scenes change. Hence, it is difficult to achieve at the image level. VAD, on the other hand, aims to analyze a video input to detect and localize unusual events/activities (event-level anomaly detection) using spatiotemporal information. Given an input video, an expected output of a VAD system is a list of start/end times of all anomalous events found. So we can know which frames contain or do not contain anomalous events, enabling a VAD system to generate an alarm for human staff when it is detected. Some VAD systems can further identify and localize where objects related to those detected anomalous events are, allowing human staff to easily specify the source of the anomalous events.

2. How does video anomaly detection work?

VAD methods can be divided into 2 main approaches based on the learning techniques exploited: 1) one-class classification approaches and 2) weakly-supervised learning approaches. Note that, due to the scarcity of frame-level annotations for large-scale real-world VAD datasets, a supervised learning approach has not gained attention from researchers in the field.

2.1 One-class classification approach

Early VAD methods are usually based on one-class classification, in which their prediction models are trained only on normal videos. These approaches usually aim to learn a dictionary of normal features, which can either be achieved based on hand-crafted features (Lu et al., 2013) or deep autoencoder models (Xu et al., 2015). Once trained, the models would be able to extract essential features to reconstruct an input normal video, leading to a low reconstruction error. During inference, given an unknown input, the trained model is utilized to extract those features and perform reconstruction of the input. If the reconstruction error is larger than a pre-defined threshold, the input is considered an anomaly since it is highly likely that the input deviates a lot from the normal videos used to train the model. However, these approaches usually do not generalize well on test datasets since there are no anomalous samples presented to the model during training.

2.2 Weakly-supervised learning approach

While one-class classification-based VAD approaches train a model using only normal videos, weakly-supervised learning VAD approaches utilize both normal and anomalous videos with supervised signals during training. These supervised signals, however, are not the ground truth labels for the task to solve, i.e., to predict if a frame contains an anomaly or not, as in the case of supervised learning. In weakly-supervised learning VAD, the supervised signals are from a related task, e.g., to predict if an entire video contains an anomaly or not. This kind of weak supervised signals does not provide localization information on when anomalies occur in a video; we only know if anomalies exist in the video or not. Therefore, it requires some techniques to leverage these weak supervised signals, i.e., video-level labels, to train a model to anomalies at the frame level. Compared to the frame-level labels, these video-level labels are much easier to obtain, enabling many research teams to construct their large-scale datasets for weakly-supervised VAD (see Section 3 for more details of some large-scale VAD datasets).

Multiple instance learning

Given video-level labels as supervised signals, a weakly-supervised learning technique called multiple instance learning (MIL) has gained attention from many researchers in the field of VAD recently. As an example, Fig. 1 illustrates the flow diagram of a MIL-based VAD method proposed by Sultani et al. (2018). In MIL, an input untrimmed video is usually divided into short segments (also called snippets). A deep learning model consisting of a video processing backbone and a prediction head is used to process a video segment to predict an anomaly score, which tells us how likely the input video segment is an anomaly. The MIL framework treats each video as a bag containing several video segments. A bag is called a positive bag if the input video contains an anomalous event; otherwise, a negative bag. Since there are no frame-level or segment-level supervised signals, i.e., only video-level signals are available, common loss functions such as mean squared error (MSE) or cross-entropy loss cannot be directly used. MIL exploits these weakly supervised signals and a ranking loss to train a model to make predictions at the segment level. During training, both positive and negative bags are required to compute a ranking loss that compares the maximum anomaly score of a positive bag with that of a negative bag. It is penalized if the maximum anomaly score of a negative bag is higher. This top-1 ranking loss was designed based on the consideration that not all segments in an anomaly video are anomalous; at least one of them should result in a higher anomaly score than all segments in a normal video.

During inference, a trained VAD model is used to process each video segment one by one to produce an anomaly score, which is then compared with a threshold to decide if the segment is normal or anomalous.

Fig. 1: Multiple instance learning (MIL) framework (image from Sultani et al., 2018)

While there are several MIL-based VAD methods proposed so far, many of them follow the key ideas of Sultani’s method but utilize different model architectures and training tricks to improve prediction accuracy. The following subsections highlight variations of notable weakly-supervised VAD methods.

Ranking loss and regularization terms

Apart from the top-1 ranking loss used by Sultani et al. (2018), several variations of ranking losses have been proposed and utilized in VAD research. For example, Dubey et al. (2019) compared the top-3 anomaly scores of an anomalous video with the maximum score of a normal video, while Kamoona et al. (2023) computed the average difference in anomaly scores between normal and abnormal bags and aimed at maximizing it.

Regularization terms are commonly added to the loss function to encourage desired properties of the model. Sultani et al. (2018) proposed to include temporal smoothness and sparsity terms in their ranking loss. The temporal smoothness computes the sum of the squared differences in the anomaly scores of two adjacent video segments, preventing rapid oscillations of the anomaly scores between being normal and anomalous. On the other hand, based on the assumption that anomalous events rarely occur, the sparsity that computes the sum of predicted anomaly scores of a video sequence penalizes a model when too many segments are predicted as anomalies. These two regularization terms have been commonly used in other MIL-based methods (for example, Dubey et al., 2019; Kamoona et al., 2023; Tain et al., 2021).

Model architectures

Many VAD models in MIL-based approaches usually consist of at least two modules: 1) a video processing backbone and 2) a prediction head.

A video processing backbone is used to convert a video segment input X ∈ Rʰ×ʷ×ᶜ×ᵗ, where h×w is the spatial resolution of each frame in the video segment, c is the number of channels, and t is the number of frames in the video segment, into an embedding f ∈ Rᵈ., where d is the embedding dimension. Typically, a 3D convolutional neural network (3D CNN) such as C3D (Tran et al., 2015) or I3D (Carreira and Zisserman, 2017) is used as a video processing backbone since it can handle both spatial and temporal information. This video backbone can be pre-trained on a large-scale action recognition dataset such as Kinetics (Smaira et al., 2020) to learn spatiotemporal features, and then transferred to a VAD task. Although the video backbone can be fine-tuned during the training phase of the downstream VAD task, many papers opt to freeze it to speed up the training process and reduce computational resources.

A prediction head, on the other hand, takes an embedding from the video backbone as input and predicts an anomaly score as output. While Sultani et al. (2018) designed a fully connected network with two hidden layers as a prediction head (Fig. 1), there are many variations of the prediction head proposed so far for VAD research. For example, Cao et al. (2022) designed a prediction head using a graph convolution network (GCN) architecture, as shown in Fig. 2, to model a contextual relationship among video segments. Firstly, they use an I3D backbone to convert each video segment in a video into an embedding. Secondly, they construct a graph, in which a node represents a video segment. Each node is connected to each other node in the graph but the weight of each connection between two segments is computed based on the embedding similarity and on the proximity in time of the two segments. This graph is then fed into a GCN to predict an anomaly score for each node (segment) on the graph, allowing the model to utilize a contextual relationship among video segments.

Similarly, other network architectures, for example, 1D convolution layers as in Lv et al. (2020), causal convolution layers as in Pu et al. (2023), or attention networks as in Pu et al. (2023), can be utilized to leverage contextual information among video segments. Moreover, Pu et al. (2023) applied a score-smoothing module at the end of their inference pipeline to reduce false alarms caused by frame jitters.

Fig. 2: A VAD model with a GCN-based prediction head (image from Cao et al., 2022)

Data augmentation

Data augmentation has been shown to be crucial in training deep neural networks. For VAD research, several data augmentation techniques have been utilized to improve the generalization performance of models. For example, Lv et al. (2021) proposed a noise simulation strategy that randomly chooses a number of segments in a normal video and adds various kinds of noise such as motion blur, jitter, and picture interruption to those segments. This strategy does not change the label of the video, consequently, the augmented video segments are treated as normal. It allows the model to learn changes in the raw videos that are not from anomalous events, leading to fewer false alarm predictions.


Another technique to improve the performance of VAD models is to use pseudo labels, which are target labels for unlabelled data used for training as if they were ground truth labels. Pseudo labels can be obtained from a pseudo-label generation technique (e.g., Lv et al., 2021) or from predictions of a trained model (e.g., Zhong et al., 2019 and Feng et al., 2021).

Lv et al. (2021) introduced a technique to generate hand-crafted anomalies with segment-level pseudo labels. They randomly select a pair of normal and anomalous videos and fuse random segments of the anomalous video into the normal video. The locations of the fused segments in the normal video are labeled as anomalies. This technique allows them to have segment-level pseudo labels to enhance their MIL framework to improve the anomaly localization capability of their model.

On the other hand, Zhong et al. (2019) proposed a noise cleaner framework as shown in Fig. 3 to refine and select high-confident predictions to be pseudo labels. A GCN was designed and added to their training pipeline to clean off noisy predictions. Their action classifier and the label noise cleaner were alternately trained. Once training is finished, only the action classifier module is used for inference. With the pseudo labels from high-confident predictions, they reframed the weakly supervised learning into supervised learning under noisy labels and trained their model with a common cross-entropy loss.

Fig. 3: Graph convolutional noise cleaner framework (image from Zhong et al., 2019)

Prompt learning

Recently, a vision-language model called CLIP (Radford et al., 2021) has gained attention from researchers and has been applied in various applications. A CLIP model consists of a visual encoder and a text encoder that are jointly trained to align embeddings generated by both encoders when an input image and an input text description are related.

Pu et al. (2023) proposed a technique called prompt-enhanced learning (PEL) that utilizes a CLIP model during training to enhance their VAD model trained under a MIL framework (Fig. 4). With PEL, they leverage anomaly class names such as ‘fighting’ or ‘robbery’ as keywords to query related words from a knowledge graph called ConceptNet (Speer et al., 2017). Their VAD model was trained to align their visual embeddings of video segments that are likely anomalous with the text embeddings of the related words while pushing them away from the text embeddings of words related to ‘normal’ or ‘nonviolence’ (Fig. 5), and vice versa. This technique allows them to use contextual knowledge about anomalies to boost their model’s performance.

Fig. 4: Prompt-enhanced learning framework (image from Pu et al., 2023)
Fig. 5: The PEL module (image from Pu et al., 2023)

3. Benchmarking datasets for video anomaly detection

Large-scale benchmarking datasets are crucial for training robust deep-learning models. For VAD, there are several public benchmarking datasets proposed so far. These large-scale VAD datasets usually provide a set of training videos with only their corresponding video-level labels, i.e., normal or anomalous videos, enabling VAD models to train with weakly supervised learning. Frame-level annotations are; however, provided only for the test sets, allowing researchers to benchmark their models with frame-level evaluation metrics. This section aims to introduce some notable public datasets and evaluation metrics for VAD.


UCF-Crime dataset (Sultani et al., 2018) is an early large-scale VAD dataset capturing realistic anomalies. Apart from the normal class, there are 13 classes of real-world anomalies in the dataset, including ‘abuse’, ‘arrest’, ‘arson’, ‘assault’, ‘burglary’, ‘explosion’, ‘fighting’, ‘road accidents’, ‘robbery’, ‘shooting’, ‘shoplifting’, ‘stealing’, and ‘vandalism’. It consists of 1,900 untrimmed videos obtained from real-world surveillance cameras. All videos are in total 128 hours long, with an average number of frames of 7,247 per video. The dataset is divided into a training set of 1,610 videos (800 normal and 810 anomalous videos) with video-level labels and a test set of 290 videos (150 normal and 140 anomalous videos) with frame-level labels. Example snapshots of the dataset are shown in Fig. 6.

Fig. 6: UCF-Crime dataset (image from Sultani et al., 2018)


ShanghaiTech dataset (Luo et al., 2017) was collected under complex lighting conditions and camera angles (Fig. 7). It contains 13 real-world scenes; each consisting of several videos. ShanghaiTech dataset was originally proposed as a one-class classification problem since its training set contains only normal videos. It contains 270K frames of normal videos for training, while the test set includes 130 anomaly events in total. It also provides pixel-level annotations of abnormal regions. Later, Zhong et al. (2019) proposed a training/test split for ShanghaiTech, setting up a new protocol to train and evaluate models on this dataset. Under this training/test split, the training set contains 238 videos (175 normal and 63 anomalous videos) while the test set contains 199 videos (155 normal and 44 anomalous videos).

Fig. 7: ShanghaiTech dataset (image from Luo et al., 2017)


Different from other datasets, XD-Violence (Wu et al., 2020) is a large-scale VAD dataset providing 4,754 untrimmed videos with audio signals, allowing models to leverage multimodal information for anomaly detection. XD-Violence dataset includes six kinds of physical violence: ‘abuse’, ‘car accident’, ‘explosion’, ‘fighting’, ‘riot’, and ‘shooting’. In total, the dataset is 217 hours long. It is divided into a training set of 3,954 videos (2,049 normal and 1,905 anomalous videos) with video-level annotations and a test set of 800 videos (300 normal and 500 anomalous videos) with frame-level annotations. Each anomalous video contains 1–3 anomalous events.


While several VAD datasets mostly focus on abnormal human-related events, traffic anomaly detection or TAD dataset (Lv et al., 2021) is dedicated to anomaly detection in traffic scenes (Fig. 8). It consists of long untrimmed videos including seven kinds of anomalies on roads: ‘vehicle accidents’, ‘illegal turns’, ‘illegal occupations’ ‘retrograde motion’, ‘pedestrian on road’, ‘road spills’, and ‘the else’. There are 500 traffic surveillance videos in the dataset, in total 25 hours long. The training set consists of 400 videos with video-level annotations while the test set consists of 100 videos with frame-level annotations. Both training and test sets contain all kinds of anomalies and normal videos.

Fig. 8: TAD dataset (image from Lv et al., 2021)

Street Scene

Street Scene dataset (Ramachandra and Jones, 2020) is a dataset providing videos of bird-eye-view two-lane streets (Fig. 9). It includes 17 anomaly classes that occur on streets such as ‘jaywalking’, ‘biker outside lane’, ‘car illegally parked’, and so on. Different from several other datasets that provide only frame-level annotations for the test sets, Street Scene dataset offers bounding boxes around anomalous events with track numbers. In this dataset, it is possible that a single frame may contain more than one anomaly labeled.

Fig. 9: An example frame from Street Scene dataset (image from Ramachandra and Jones, 2020)


UBnormal (Acsintoae et al., 2022) is a recent benchmark dataset for supervised open-set VAD (Fig. 10). This data is, however, a virtual dataset generated with Cinema4D, a 3D modeling software suite, by placing animated characters/objects in real-world backgrounds. Since it is generated, pixel-level annotations for both training and test sets are provided, allowing researchers to train VAD models in a supervised learning fashion. The dataset includes several objects such as people, cars, skateboards, bicycles, and motorcycles. The following events are considered normal: ‘walking’, ‘talking on the phone’, ‘walking while texting’, ‘standing’, ‘sitting’, ‘yelling’, and ‘talking with others’, while there are 22 abnormal events such as ‘running’, ‘falling’, ‘fighting’, ‘sleeping’, and so on. However, these abnormal events are organized in a way to form an open set VAD, i.e., types of abnormal events in training (six types), validation (four types), and test (12 types) sets are not overlapped.

Fig. 10: UBnormal dataset (image from Acsintoae et al., 2022)

Evaluation metrics

While there is a variety of benchmarking datasets with different characteristics, the most common evaluation metrics for VAD research are ROC-AUC and false alarm rate.

The ROC-AUC is the area under the curve (AUC) of the receiver operating characteristic (ROC) curve, which is the plot of true positive rate (TPR) against false positive rate (FPR). This metric measures the overall performance of a VAD model under various thresholds. A VAD model with a higher ROC-AUC is considered better than a VAD model with a lower ROC-AUC.

False alarm rate is computed at a given threshold, measuring the number of times a VAD model makes a mistake in predicting a normal video frame/segment to be anomalous. The value is then normalized by the total number of normal frames/segments that the VAD model processes. A threshold of 0.5 is commonly used for anomaly scores ranging from 0 to 1.

Apart from these quantitative measurements, many researchers also present qualitative results of the VAD models’ predictions by showing a prediction graph as shown in Fig. 11.

Fig. 11: Examples of qualitative results. Each graph is a plot of predicted anomaly scores (the blue curves) against the frame number in a video, while a pink curve indicates when an anomalous event occurs, i.e., the ground truth (image from Sultani et al., 2018).

4. Open problems and challenges

Although many researchers have dedicated their efforts to developing more robust VAD methods, there is room for improvements and remaining challenges/problems to address. First, due to the lack of frame-level annotations for training, current state-of-the-art (SOTA) approaches are based on weakly supervised learning. Although they have been shown to be much superior to one-class classification-based approaches, the current SOTA performance has not reached 90% ROC-AUC on large-scale datasets yet. For example, it is about 86–87% ROC-AUC on UCF-Crime and 85–86% AP on XD-Violence, which could be further improved.

Second, some anomalous events found in the real world could be greatly varied and much differ from those in the training set. This would create an issue of generalization performance when deploying a VAD model in a real-world setting, especially when the environment and context are highly different. As described in Section 3, recently, there has been a dataset, called UBnormal (Acsintoae et al., 2022), proposed for training and benchmarking open-set VAD. In an open-set setting, the types of anomalous events presented in the test set might not exist in the training set. As reported by Acsintoae et al. (2022), open-set VAD is much more challenging than closed-set VAD.

Third, as discussed in Section 1, being anomalous highly depends on the context. While training a VAD model to be able to detect anomalies under different scenes and contexts, i.e., scene-dependent VAD, is challenging, it is crucial for practical applications. Recently, there has been a dataset for scene-dependent VAD, called NWPU Campus dataset (Cao et al., 2023), which paves the way to train and evaluate VAD models to recognize scene-dependent anomalies.

Last but not least, developing an explainable VAD system, enabling a user to understand why a frame/segment is predicted as an anomaly, has recently gained attention from researchers (Singh et al., 2023 and Doshi and Yilmaz, 2023). For example, Singh et al. (2023) include a module to learn high-level attributes, such as object types, moving direction, and speed, which are interpretable into their VAD pipeline. Once an anomaly is detected, it would be much easier for human observers to understand who/what is involved in the anomalous event.


Written by: Sertis Vision Lab

Originally published at