This series of posts aims to talk about the concept and applications of graph neural networks (GNNs), which is a machine learning model applied to graph-structured data. In Part I, we have explained what graph-structured data is and how it is represented. We have also introduced the concept of graph machine learning and GNNs. This part, which is the second part of this three-part series, aims to provide more details on a variant of GNNs called graph convolutional networks (GCNs). Two types of GCNs, i.e., 1) spectral GCNs and 2) spatial GCNs, and their recent GCNs models are also introduced.
1. Graph convolutional networks (GCNs)
Graph convolutional networks (GCNs) are a special type of graph neural networks (GNNs) that use convolutional aggregations. Applications of the classic convolutional neural network (CNN) architectures in solving machine learning problems, especially computer vision problems, have been hugely successful. However, the set of problems that has a graph structure is hard to encode using regular neural networks. The use of GCNs to address some computer vision problems has been shown comparable or even better performance. As mentioned in the previous post, GCNs have been applied to solve many problems, for example, image classification [1], traffic forecasting [2], recommendation system [3], scene graph generation [4], and visual question answering [5].
GCNs can be split into two types, namely, 1) spectral GCNs and 2) spatial GCNs. Briefly, spectral GCNs are defined on the spectral domain of the data based on the graph Fourier transformation while spatial GCNs are defined on the spatial domain, i.e., the original graph domain, as an aggregation of the node representations of its neighborhood.
2. Spectral GCNs
Graph Laplacian
Given an undirected graph G ={V, E}, which consists of a set of nodes V with |V| = n and a set of edges E with |E| = m, we can obtain the adjacency matrix A where Aᵢ,ⱼ is 1 if there exists an edge between node i and node j, and 0 otherwise. The degree matrix of A is a diagonal matrix where, Dᵢ,ⱼ = ΣAᵢ,ⱼ is the sum of all the edges to node i. The Laplacian matrix of A is defined as L = D — A and the corresponding symmetrically normalized Laplacian matrix is
where I is the identity matrix. The normalization aims to make the influence of nodes that have a large degree more equal to that of other nodes.
Graph Fourier transformation
In the fields of signal processing or image processing, Fourier transformation is a transformation that decomposes an input signal from its original time or spatial domain into constituent basis functions in the frequency domain. The classic Fourier transformation of a 1-D signal f is computed by
where ξ is the frequency of f’ in the spectral domain and the complex exponential is the eigenfunction of the Laplace operator. For example, Fourier transformation in images is shown in Fig. 1 where we can transform an image into its constituting basis functions in the frequency domain. Please refer to [6] for more information on the 2-D Fourier transformation.
The graph Laplacian matrix L is the Laplacian operator defined on graphs. An eigenvector of L associated with its corresponding eigenvalue is an analogy to the complex exponential at a certain frequency. Given the eigenvalue decomposition of L` = UΛUᵀ where the l-th column of U is the eigenvector uₗ and Λₗ,ₗ is the corresponding eigenvalue λₗ, we can compute the Fourier transform of graph signals as:
The equation represents a graph signal defined in the so-called spectral domain. The inverse Fourier transformation, which converts graph signals in the spectral domain back into that in the original graph domain, can be written as:
The first notable spectral GCN was proposed by Bruna et al. [7] containing several convolutional layers that take a vector Xᵖ as the input feature map of p-th layer and output a feature map of Xᵖ⁺¹ by:
where Xᵖ(:, i) and Xᵖ⁺¹(:, j) are the i-th and j-th dimension of the input and output feature map, respectively; θᵖᵢ,ⱼ denotes a vector of learnable parameters of the filter of p-th layer; each column of V is the eigenvector of L, and σ(.) denotes the activation function. A drawback of this approach was that the time complexity of calculating the eigenvalue decomposition of the Laplacian matrix is O(n³). Even if eigenvalue decomposition is computed and stored internally, the time complexity of the above equation is still O(n²) and there are O(n) of learnable parameters for each layer which limits its usability. Besides, the non-parametric filters are localized in the vertex domain.
Defferard et al. [8] proposed the ChebNet that uses K-polynomial filters in convolutions for localization. Such a K-polynomial filter is represented by
It achieves good localization in the vertex domain in the vertex domain by integrating the node features within the K hop neighborhood and the number of the trainable parameters decreases to O(K) = O(1). In order to reduce the computational complexity, they used the Chebyshev polynomial approximation [9]. The convolutional layer is defined as follows:
where θᵖᵢ,ⱼ is the K-dimensional parameter vector for the i-th column of the input feature map and the j-th column of the output feature map at the p-th layer.
As a special variant, Kipf et al. [10] proposed a GCN in which they truncated the Chebyshev polynomial to the first-order (i.e., K = 2 in the above equation) and specifically set (θ)ᵢ,ⱼ(1) = — (θ)ᵢ,ⱼ(2) = θᵢ,ⱼ. The convolutional layer after the simplification is
where A` = I + A is equivalent to adding a self-loop to the original graph and D` is the diagonal degree matrix of A`, and θᵖ is the parameter matrix. The above form is essentially equivalent to aggregating node representations from their direct neighborhood and having a clear meaning of vertex localization, and thus is often considered as the bridging the gap between spectral-based and spatial-based methods. However, the memory space becomes a limiting factor for training such a network for large-scale graphs. In addition, the generalization of learning representation of unseen nodes of the same graph and nodes of an entirely different graph is difficult. FastGCN [11] addresses the training issue enabling faster training by assuming the input graph is an induced subgraph made of vertices that are i.i.d. sampled from a possibly infinite graph under some probability. For more details, please refer to [12].
A key criticism of spectral approaches is the fact that the spectral definition of convolution is dependent on the Fourier basis (Laplacian eigenbasis), which, in turn, is domain-dependent. It implies that a spectral GCN model learned on one graph cannot be trivially transferred to another graph that uses a different Fourier basis function, as it would be expressed in a ‘different language’ [1].
3. Spatial GCNs
The graph filtering of a signal x in the vertex domain is generally defined as a linear combination of the signal components in the nodes neighborhood:
where N (i, K) represents the K-hop neighborhood of node i in the graph and wᵢ,ⱼ are the weights used for the combination. According to graph filtering in the vertex domain, graph convolution can be viewed as an aggregation of graph signals within the node neighborhood.
A classic CNN models grid-like data such as images, exploiting the basic properties that include:
- the number of pixels in the neighborhood of a pixel is fixed;
- the spatial order of scanning, i.e, from left to right and top to bottom.
However, in an arbitrary graph, neither the number of neighbors of a node nor the spatial order among them is fixed. Many works have been proposed to build graph convolution directly upon the classical CNN to address this issue. Niepert et al. [13] proposed PATCHY-SAN model that first creates a fixed-length sequence of nodes and their order for generating a fixed-size neighborhood graph. The neighborhood graph is then normalized so that nodes of similar structural nodes are assigned similar relative positions. The normalized neighborhood serves as a receptive field for a node under consideration and feature learning components such as CNNs are applied.
Gao et al. [14] proposed Learnable Graph Convolutional Layer (LGCL) which selects a fixed number of neighboring nodes for each feature based on value ranking as compared to PATCHY-SAN that only depends on the structural information to transform the irregular graph data to grid-like data. They also proposed sub-graph model training to reduce memory and computational resource requirements. A way to extend classical CNNs that can only manage the data with the same topological structures to graph data is to develop a structure-aware convolution operation for both Euclidean and non-Euclidean data. Chang et al. [15] first built the connection between the classical filters and uni-variate functions (i.e., functional filters) and then modeled the graph structure into the generalized functional filters to be structural aware. The Chebyshev polynomial was used for approximation since the structure-aware convolution requires infinite parameters to learn.
Spatial graph convolutions based on propagated and aggregated node representation from neighboring nodes in the vertex domain have been extensively explored. Duvenaud et al. [16] designed the graph convolution of node u at p-th layer such that a new representation is an aggregation of information of its current representation and its neighborhood followed by multiplication with a weight matrix and activation. The weight matrix is unique for the nodes with the same degree in the p-th layer. For an arbitrarily large graph, the number of unique degrees of nodes is often very large, which implies that there are many unique weight matrices to be learned at each layer which possibly leads to an over-fitting problem.
Atwood et al. [17] proposed a graph diffusion-based convolution network (DCNN) that propagates and aggregates node representation using a graph diffusion process. The computational complexity of the method is O(n²K) where K is the number of hops of the diffusion. This limits its use for very large graphs. Monti et al. [1] proposed mixture model networks (MoNet), a general framework to design convolutional deep architectures on non-Euclidean domains such as graphs and manifolds. Following the general philosophy of spatial-domain methods of formulating convolution-like operations as template matching with local intrinsic patches on graphs, they employed parametric construction of patches as a function of a local graph and studied the family of functions represented as a mixture of Gaussian kernels. Such construction allows formulating GCN and DCNN as a particular instance of MoNet.
Hamilton et al. [18] proposed GraphSAGE (Sample and Aggregate), an aggregation-based inductive representation learning model that aggregates the neighboring nodes’ vector representation using some learnable aggregator. The node representation vector is concatenated with the aggregated representation and then fed into a fully connected layer with some non-linear activation function, followed by a normalization step.
where p denotes the layer index, u and v are nodes in the graph, and N(u) is a neighborhood of node u. There are several choices of learnable aggregator functions, including mean aggregator, LSTM aggregator, and pooling aggregator. For minibatch training, they provided a variant by uniformly sampling a fixed size of the neighboring nodes for each node.
4. Summary
We have introduced the concept of GCNs, in which information of nodes’ neighbors is aggregated using convolutional operators. We also introduced two main types of GCNs, i.e., 1) spectral GCNs which operates on the spectral domain of the graph data, and 2) spatial GCNs which directly operate on the original graph domains. Recent researches related to these two types of GCNs have also been discussed in this part. The next part, which is the final of this three-part series, will introduce various applications of GNNs to real-world problems.
References
Written by: Sertis Vision Lab
Originally published at https://www.sertiscorp.com/