Access the full text.
Sign up today, get DeepDyve free for 14 days.
M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, P. Vandergheynst (2016)
Geometric Deep Learning: Going beyond Euclidean dataIEEE Signal Process. Mag., 34
Jonathan Masci, D. Boscaini, M. Bronstein, P. Vandergheynst (2015)
Geodesic Convolutional Neural Networks on Riemannian Manifolds2015 IEEE International Conference on Computer Vision Workshop (ICCVW)
P. Sen, Galileo Namata, M. Bilgic, L. Getoor, Brian Gallagher, Tina Eliassi-Rad (2008)
Collective Classi!cation in Network Data
N. Guttenberg, N. Virgo, O. Witkowski, H. Aoki, R. Kanai (2016)
Permutation-equivariant neural networks applied to dynamics predictionArXiv, abs/1612.04530
Yann LeCun, L. Bottou, Yoshua Bengio, P. Haffner (1998)
Gradient-based learning applied to document recognitionProc. IEEE, 86
P. Sen, Galileo Namata, M. Bilgic, L. Getoor, Brian Gallagher, Tina Eliassi-Rad (2008)
Collective Classification in Network Data
Lazer (2009)
Life in the network: the coming age of computational social scienceScience (N. Y.), 323
S. Mallat (2011)
Group Invariant ScatteringCommunications on Pure and Applied Mathematics, 65
Yi Yu, Tengyao Wang, R. Samworth (2014)
A useful variant of the Davis--Kahan theorem for statisticiansBiometrika, 102
Alex Nowak, Soledad Villar, A. Bandeira, Joan Bruna (2017)
A Note on Learning Algorithms for Quadratic Assignment with Graph Neural NetworksArXiv, abs/1706.07450
D. Shuman, B. Ricaud, P. Vandergheynst (2013)
Vertex-Frequency Analysis on GraphsArXiv, abs/1307.5708
Joan Bruna, Wojciech Zaremba, Arthur Szlam, Yann LeCun (2013)
Spectral Networks and Locally Connected Networks on GraphsCoRR, abs/1312.6203
Zhengdao Chen, Xiang Li, Joan Bruna (2017)
Supervised Community Detection with Hierarchical Graph Neural Networks
W. Czaja, Weilin Li (2016)
Analysis of time-frequency scattering transformsApplied and Computational Harmonic Analysis
S. Fortunato (2009)
Community detection in graphsArXiv, abs/0906.0612
I. Daubechies (1992)
Ten Lectures on WaveletsComputers in Physics, 6
J. Gilmer, S. Schoenholz, Patrick Riley, Oriol Vinyals, George Dahl (2017)
Neural Message Passing for Quantum Chemistry
Dongmian Zou, Gilad Lerman (2018)
Encoding robust representation for graph generation2019 International Joint Conference on Neural Networks (IJCNN)
S. Mahadevan, M. Maggioni (2005)
Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions
D. Lazer, A. Pentland, A. Adamić, Sinan Aral, A. Barabási, Devon Brewer, N. Christakis, N. Contractor, J. Fowler, M. Gutmann, T. Hebara, Gary King, M. Macy, D. Roy, Marshall Alstyne (2009)
Life in the network: The coming age of computational social science: Science
S. Sabour, Nicholas Frosst, Geoffrey Hinton (2017)
Dynamic Routing Between CapsulesArXiv, abs/1710.09829
Yann LeCun, Yoshua Bengio, Geoffrey Hinton (2015)
Deep LearningNature, 521
David Hammond, P. Vandergheynst, Rémi Gribonval (2009)
Wavelets on Graphs via Spectral Graph TheoryArXiv, abs/0912.3848
S. Mallat (1998)
A wavelet tour of signal processing
Joan Bruna, S. Mallat (2013)
Invariant Scattering Convolution NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence, 35
Chandler Davis, W. Kahan (1970)
The Rotation of Eigenvectors by a Perturbation. IIISIAM Journal on Numerical Analysis, 7
E. Abbe (2017)
Community detection and stochastic block models: recent developmentsJ. Mach. Learn. Res., 18
Zhengdao Chen, Lisha Li, Joan Bruna (2017)
Supervised Community Detection with Line Graph Neural NetworksarXiv: Machine Learning
R. Coifman, Mauro Maggioni (2004)
Diffusion Wavelets
Joan Bruna, Xiang Li (2017)
Community Detection with Graph Neural Networks
D. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, Rafael Gómez-Bombarelli, Timothy Hirzel, Alán Aspuru-Guzik, Ryan Adams (2015)
Convolutional Networks on Graphs for Learning Molecular FingerprintsArXiv, abs/1509.09292
Yotam Hechtlinger, Purvasha Chakravarti, Jining Qin (2017)
A Generalization of Convolutional Neural Networks to Graph-Structured DataArXiv, abs/1704.08165
A. Sandryhaila, José Moura (2013)
Discrete Signal Processing on Graphs: Frequency AnalysisIEEE Transactions on Signal Processing, 62
Xu Chen, Xiuyuan Cheng, S. Mallat (2014)
Unsupervised Deep Haar Scattering on Graphs
Michael Edwards, Xianghua Xie (2016)
Graph Based Convolutional Neural NetworkArXiv, abs/1609.08965
R. Kondor, H. Son, Horace Pan, Brandon Anderson, Shubhendu Trivedi (2018)
Covariant Compositional Networks For Learning GraphsArXiv, abs/1801.02144
Thomas Kipf, M. Welling (2016)
Semi-Supervised Classification with Graph Convolutional NetworksArXiv, abs/1609.02907
Thomas Wiatowski, P. Grohs, H. Bölcskei (2017)
Energy Propagation in Deep Convolutional Neural NetworksIEEE Transactions on Information Theory, 64
M. Defferrard, X. Bresson, P. Vandergheynst (2016)
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
We generalize the scattering transform to graphs and consequently construct a convolutional neural network on graphs. We show that under certain conditions, any feature generated by such a network is approximately invariant to permutations and stable to graph manipulations. Numerical results demon- strate competitive performance on relevant datasets. 1 Introduction Many interesting and modern datasets can be described by graphs. Examples include social [1], physical [2], and transportation [3] networks. The recent survey paper of Bronstein et al. [4] on geometric deep learning emphasizes the need to develop deep learning tools for such datasets and even more importantly to understand the mathematical properties of these tools, in particular, their invariances. They also mention two types of problems that may be addressed by such tools. The rst problem is signal analysis on graphs with applications such as classi cation, prediction and inference on graphs. The second problem is learning the graph structure with applications such as graph clustering and graph matching. Several recent works address the rst problem [5, 6, 7, 8]. In these works, the lters of the networks are designed to be parametric functions of graph operators, such as the graph adjacency and Laplacian, and the parameters of those functions have to be trained. The second problem is often explored with random graphs generated according to two common models: Erd} os{R enyi, which is used for graph matching, and the Stochastic Block Model (SBM), which is used for community detection. Some recent graph neural networks have obtained state-of-the-art performance for graph matching [9] and community detection [10, 11] with synthetic data generated from the respective graph models. As above, the lters in these works are parametric functions of either the graph adjacency or Laplacian, where the parameters are trained. Despite the impressive progress in developing graph neural networks for solving these two problems, the performance of these methods is poorly understood. Of main interest is their invariance or stability to basic signal and graph manipulations. In the Euclidean case, the stability of a convolutional neural network [12] to rigid transformations and deformations is best understood in view of the scattering transform [13]. The scattering transform has a multilayer structure and uses wavelet lters to propagate signals. It can be viewed as a convolutional neural network where no training is required to design the lters. Training is only required for the classi ers given the transformed data. Nevertheless, there is freedom in the selection and design of the wavelets. The scattering transform is approximately invariant to translation and rotation. More precisely, under strong assumptions on the wavelet and scaling functions and as the coarsest scale J approaches 1, the scattering transform becomes invariant to translations and rotations. Moreover, it is Lipschitz continuous with respect to smooth deformation. These properties are shown in [13] for signals in 2 d 2 L (R ) and L (H ), where H is a compact Lie group. It is interesting to note that the design of lters in existing graph neural networks is related to the design of wavelets on graphs in the signal processing literature. Indeed, the construction of wavelets on graphs use special operators on graphs such as the graph adjacency and Laplacian. As mentioned above, these operators are commonly used in graph neural networks. The earliest works on graph wavelets [14, 15] apply the normalized graph Laplacian to de ne the diusion wavelets on graphs and use them to study multiresolution decomposition of graph signals. Hammond et al. [16] use the unnormalized graph Laplacian to de ne analogous graph wavelets and study properties of these wavelets such as reconstructibility and arXiv:1804.00099v2 [cs.IT] 18 Nov 2018 locality. One can easily construct a graph scattering transform by using any of these wavelets. A main question is whether this scattering transform enjoys the desired invariance and stability properties. In this work, we use a special instance of the graph wavelets of [16] to form a graph scattering network and establish its covariance and approximate invariance to permutations and stability to graph manipulations. We also demonstrate the practical eectiveness of this transform in solving the two types of problems discussed above. The rest of the paper is organized as follows. The scattering transform on graphs is de ned in Section 2. Section 3 shows that the full scattering transform preserves the energy of the input signal. This section also provides an absolute bound on the energy decay rate of components of the transform at each layer. Section 4 proves the permutation covariance and approximate invariance of the graph scattering transform. It also brie y discusses previously suggested candidates for the notion of translation or localization on graphs and the possible covariance and approximate invariance of the scattering transform with respect to them. Furthermore, it clari es why some special permutations are good substitutes for Euclidean rigid transforma- tions. Section 5 establishes the stability of the scattering transform with respect to graph manipulations. Section 6 demonstrates competitive performance of the proposed graph neural network in solving the two types of problems. 2 Wavelet graph convolutional neural network We rst review the graph wavelets of [16] in Section 2.1. We then use these wavelets and ideas of [13] to construct a graph scattering transform in Section 2.2. 2.1 Wavelets on graphs We review the wavelet construction of Hammond et al. [16] and adapt it to our setting. Our general theory applies to what we call simple graphs, that is, weighted, undirected and connected graphs with no self-loops. We remark that we may also address self-loops, but for simplicity we exclude them. Throughout the paper we x an arbitrary simple graph G = (V; E) with N vertices. We also consistently use uppercase boldface letters to denote matrices and lowercase boldface letters to denote vectors or vector-valued functions. The weight matrix of G is an N N symmetric matrix W with zero diagonal, where W (n; m) denotes the weight assigned to the edge fn; mg of G. The degree matrix of G is an N N diagonal matrix with D(n; n) = W (n; m) ; 1 n N : (1) m=1 The (unnormalized) Laplacian of G is the N N matrix L = D W : (2) The eigenvalues of L are non-negative and the smallest one is 0. Since the graph is connected, the eigenspace of 0 (that is, the kernel of L) has dimension one. It is spanned by a vector with equal nonzero entries for all vertices. This vector represents a signal of the lowest possible \frequency". The graph Laplacian L is symmetric and can be represented as N1 L = u u ; (3) l l l=0 where 0 = < are the eigenvalues of L, u ; ;u are the corresponding eigenvectors, 0 1 N1 0 N1 and denotes the conjugate transpose. We remark that the phases of the eigenvectors of L and their order within any eigenspace of dimension larger than 1 can be arbitrarily chosen without aecting our theory for the graph scattering transform formulated below. Let f 2 L (G) be a graph signal. Note that in our setting we can regard L (G) ' L (V ) ' C , and 2 2 2 N N N without further speci cation we shall consider f 2 C . We de ne the Fourier transform F : C ! C by N1 Ff = f := (u f ) ; (4) l l=0 2 1 N N and the inverse Fourier transform F : C ! C by N1 ^ ^ F f := f (l)u : (5) l=0 Let denote the Hadamard product, that is, for g , g 2 C , g g (l) = g (l)g (l), l = 0; ; N 1. 1 2 1 2 1 2 ^ ^ De ne the convolution of f and f in L (G) as the inverse Fourier transform of f f , that is, 1 2 1 2 N1 N1 N1 X X X ^ ^ ^ ^ ^ f f = F f f = u f (l)f (l) = u u f f (l) = u u f u f : (6) l l l 1 2 1 2 1 2 l 1 2 l 1 l 2 l=0 l=0 l=0 When emphasizing the dependence of on the graph G, we denote it by . Euclidean wavelets use shift and scale in Euclidean space. For signals de ned on graphs, which are discrete, the notions of translation and dilation need to be de ned in the spectral domain. Hammond et al. [16] view R as the spectral domain since it contains the eigenvalues of L. Their procedure assumes a ^ ^ scaling function and a wavelet functions [17, 18] with corresponding Fourier transforms and . They have minimal assumptions on and . In our construction, we consider dyadic wavelets, that is, ^ ^ (!) = (2 !); j 2 Z : (7) Also, we x a scale J 2 Z of coarsest resolution and assume that and can be constructed from multiresolution analysis, that is, 2 2 ^ ^ + = 1 : (8) J j j>J The graph wavelets of [16] are constructed as follows in our setting. For j > J , denote by the vector in N j ^ ^ ^ ^ ^ C with the following entries: (l) = ( ) = (2 ), l = 0; ; N 1. Similarly, (l) = ( ) = j l l J l j J (2 ). In view of (6), N1 N1 X X j J ^ ^ f = u u f (2 ) for j > J and f = u u f(2 ) : (9) l l l l j l J l l=0 l=0 Note that f and f are both in C . The graph wavelet coecients of f 2 L (G) are de ned by j J Q f := f and Q f := f ; j > J: (10) J J j j We use boldface notation for fQ g to emphasize that they are operators even though the wavelet jJ coecients Q f (n); n = 1; ; N , are scalars. At last, we note that (8) implies that (0) = 0. Combining this fact and (9) results in u Q = 0 for all j > J : (11) 0 j 2.2 Scattering on graphs Our construction of convolutional neural networks on graphs is inspired by Mallat's scattering network [13]. As a feature extractor, the scattering transform de ned in [13] is translation and rotation invariant when the coarsest scale approaches 1 and the wavelet and scaling functions satisfy some strong admissibility conditions. It is also Lipschitz continuous with respect to small deformations. The neural network repre- sentation of the scattering transform is the scattering network. It has been successfully used in image and audio classi cation problems [19]. We form a scattering network on graphs in a similar way, while using the graph wavelets de ned above and the following de nitions. A path p = (j ; ; j ) is an ordering of the scales of wavelets j ; ; j > J . 1 m 1 m The length of the path p is jpj = m. The length of an empty path p = ; is zero. For a path p = (j ; ; j ) 1 m as above and a scale j > J , we de ne the path p +j as p +j = (j ; ; j ; j ). For a vector m+1 m+1 m+1 1 m m+1 v 2 R we denote jvj = (jv(n)j) and note that the vectors v and jvj have the same norm. n=1 3 N N For a scale j > J , the one-step propagator U [j] : R ! R is de ned by U [j]f = Q f = f = f (n) ; 8f 2 R : (12) j j j n=1 N N For p 6= ;, the scattering propagator U [p] : R ! R is de ned by U [p] = U [j ]U [j ]U [j ] : (13) m m1 1 For the empty path, we de ne U [;]f = f . We note that for any path p and any scale j > J m+1 U [p + j ] = U [j ]U [p] : (14) m+1 m+1 The windowed scattering transform for a path p is de ned by S[p]f (n) = Q U [p]f (n) = U [p]f (n) = u u U [p]f (n)(2 ) : (15) l l J J l m m Let denote the set of all paths of length m 2 N[f0g, i.e., = fp : jpj = mg. The collection of all paths of nite length is denoted by P := . The scattering propagator and the scattering transform m=0 N N jPj with respect to P , which we denote by U [P ] and S[P ] : C ! (C ) respectively, are de ned as J J J U [P ]f = (U [p]f ) and S[P ]f = (S[p]f ) ; 8f 2 C : (16) J J p2P p2P J J When we emphasize the dependence of the scattering propagator and transform on the graph G, we denote them by U [G][P ] and S[G][P ] respectively. A naturally de ned norm on U [P ] and S[P ]f is J J J J 1 1 0 1 0 1 2 2 X X 2 2 @ A @ A kU [P ]fk = kU [p]fk and kS[P ]fk = kS[p]fk ; (17) J J p2P p2P J J where kk = kk denotes the l -norm on C . In the terminology of deep learning, the scattering transform acts as a convolutional neural network on graphs. At the m-th layer, where m 0, the propagated signal is fU [p]f : p 2 g and the extracted feature is fS[p]f : p 2 g. This network is illustrated in Figure 1. Figure 1: Network representation of the scattering transform. In a similar way, we can de ne the scattering transform for matrices of signals on graphs. Let F = ND (f ;f ; ;f ) 2 C , where for each 1 d D, f is a complex signal of length N on the same 1 2 D d underlying graph. We de ne S[P ]F := (S[P ]f ) (18) J J d d=1 4 and kS[P ]Fk := kS[P ]f k : (19) J J F d d=1 Note that kS[p]Fk is the Frobenious norm of the matrix S[p]F = (S[p]f ) . Here and throughout the F d d=1 rest of the paper we denote by kAk the Frobenius norm of a matrix A. 3 Energy preservation We discuss the preservation of energy of a given signal by the scattering transform. The signal is either 2 2 N ND f 2 C with the energy kfk or F 2 C with the energy kFk . We rst formulate our main result. N ND Theorem 3.1. The scattering transform is norm preserving. That is, for f 2 C or F 2 C , 2 2 2 2 kS[P ]fk = kfk and kS[P ]Fk = kFk : (20) J J F F The analog of Theorem 3.1 in the Euclidean case appears in [13, Theorem 2.6]. However, the proof is dierent for the graph case. One basic observation analogous to the Euclidean case is the following. Proposition 3.2. For f 2 C and m 2 N, X X X 2 2 2 kU [p]fk = kU [p]fk + kS[p]fk : (21) m m m+1 p2 p2 p2 This proposition can be rephrased as follows: the propagated energy at the m-th layer splits into the propagated energy at the next layer and the output energy at the current layer. In order to conclude Theorem 3.1 from Proposition 3.2, we quantify the decay rate of the propagated energy, which may be of independent interest. Fast decay rate means that few layers are sucient to extract most of the energy of the signal. We de ne the decay rate of the scattering transform at a given layer as follows. De nition 1. For J 2 N, m 2 N and r > 0, the energy decay rate of S[P ] at the m-th layer is r if X X 2 2 kU [p]fk r kU [p]fk : (22) m+1 m p2 p2 In practice, dierent choices of graph G and scale J lead to dierent energy decay rates. Nevertheless, we establish the following generic result that applies to all graph scattering transforms under the construction in Section 2.2. Proposition 3.3. The scattering transform S[P ] has energy decay rate of at least 1 2=N at all layers but the rst one. This is the sharpest generic decay rate, though a better one can be obtained with additional assumptions on J , , and L. Note that in the Euclidean domain, no such generic result exists. Therefore, one has to choose the wavelets very carefully (see the admissibility condition in [13, Theorem 2.6]). Numerical results illustrating the energy decay in the Euclidean domain are given in [19]. Furthermore, theoretical rates are provided in [20] and [21], where [20] introduces additional assumptions on the smoothness of input signals and the bandwidth of lters and [21] studies time-frequency frames instead of wavelets. In practice, the propagated energy seems to decrease much faster than the generic rate stated in Proposi- tion 3.3. Figure 2 illustrates this claim. It considers 100 randomly selected images from the MNIST database [22]. A graph that represents a grid of pixels shared by these images is used. Details of the graph and the dataset are described in Section 6.1. The gure reports the box plots of the cumulative percentage of the output energy of the scattering transform with J = 3 for the rst four layers and the 100 input images. P P m 2 2 That is, at layer 1 m 4 the cumulative percentage for an image f is kS[p]fk =kfk . k1 k=1 p2 We see that in the third layer, the scattering transform already extracts almost all the energy of the signal. Therefore, in practice we can estimate the graph scattering transform with a small number of layers, which is also evident in practice for the Euclidean scattering transform [19]. 5 Figure 2: Demonstration of fast energy decay rate for the graph scattering transform on MNIST. One hundred random images are drawn from the MNIST database, and the scattering transform is applied with the graph described in Section 6.1. The box plots summarize the distribution of the cumulative energy percentages for the random images. 3.1 Proof of Proposition 3.2 Application of (9) and later (8) implies that for any f 2 C 2 2 f + f j J j>J 2 2 N1 N1 X X X j J ^ ^ = u u f (2 ) + u u f (2 ) l l l l l l j>J l=0 l=0 N1 N1 X X 2 X 2 j J ^ ^ = u f (2 ) + u f(2 ) l l l l (23) j>J l=0 l=0 0 1 N1 X X 2 2 j J ^ ^ @ A = ju fj (2 ) + (2 ) l l l=0 j>J N1 2 2 = ju fj = kfk : l=0 6 Replacing f with U [p]f , summing over all paths with length m and applying (14) yields 0 1 X X X 2 2 @ A kU [p]fk = U [p]f +kU [p]f k j J m m p2 p2 j>J 0 1 X X @ A = Q U [p]f +kS[p]fk (24) p2 j>J X X X 2 2 = kU [p + j]fk + kS[p]fk m m p2 j>J p2 X X 2 2 = kU [p]fk + kS[p]fk : m+1 m p2 p2 3.2 Proof of Proposition 3.3 Recall that = 0 and u = = N (1; ; 1) where 2 C with jj = 1. Note that (8) implies that 0 0 (0) = 1. Note further that for any p 2 , m 1, the entries of U [p]f are non-negative due to the absolute value in (12), and thus ju U [p]fj = kU [p]fk = N . Consequently, N1 2 2 kS[p]fk = kQ U [p]fk = (2 )u u U [p]f J l l l l=0 (25) N1 X 2 2 J J ^ ^ = (2 )u U [p]f (2 )u U [p]f = kU [p]fk : l 0 l 0 1 l=0 Furthermore, we claim that 2 2 kU [p]fk 2kU [p]fk : (26) Indeed, in view of (11){(13) and the form of u , U [p]f = jgj, where g 2 C satis es (1; ; 1) g = 0. One 2 2 can easily show that the minimal value of kgk , over all g 2 C satisfying kgk = 1 and (1; ; 1) g = 0, equals 2 and this concludes (26). Combining (25) and (26) and summing the resulting inequality over p 2 yields X X 2 2 kS[p]fk kU [p]fk : (27) m m p2 p2 The combination of (21) and (27) concludes the proof as follows X X 2 2 kU [p]fk 1 kU [p]fk : (28) m+1 m p2 p2 An improvement of this decay rate is possible if and only if one may strengthen the single inequality in (25) and the inequality in (26). We show that these inequalities can be equalities for special cases and thus the stated generic decay rate is sharp. We rst note that equality occurs in the inequality of (25) if, for example, is the indicator function of [0; 2 ). Equality occurs in the second inequality when U [p]f has exactly two non-zero elements, for example, when N = 2. These two cases can be simultaneously satis ed. We comment that certain choices of J , , and L imply dierent inequalities with stronger decay rates. 3.3 Proof of Theorem 3.1 We write (21) as X X X 2 2 2 kS[p]fk = kU [p]fk kU [p]fk (29) m m m+1 p2 p2 p2 7 and sum over m 0, while recalling that U [;]f := f , to obtain that X X 2 2 kS[P ]fk = kS[p]fk m0 p2 0 1 X X X 2 2 @ A = kU [p]fk kU [p]fk m+1 m0 p2 p2 0 1 (30) X X 2 2 @ A = lim kU [p]fk kU [p]fk m!1 m+1 p2 p2 2 2 = kfk lim kU [p]fk : m!1 m+1 p2 Combining Proposition 3.3 and (23) yields m m X X 2 2 2 2 2 kU [p]fk 1 kU [p]fk 1 kfk ! 0; as m ! 1 : (31) N N m+1 1 p2 p2 The rst equality in (20) clearly follows from (30) and (31). The second equality in (20) is an immediate 2 D 2 consequence of the rst equality and the observation that for F = (f ; ;f ), kFk = kf k . 1 D d F d=1 4 Permutation covariance and invariance When applying a transformation [G] to a graph signal it is natural to expect that relabeling the graph vertices and the corresponding signal's indices before applying the transformation has the same eect as relabeling the corresponding indices after applying the transformation. More precisely, let P 2 S be a permutation, where S denotes the symmetric group on N letters, then it is natural to ask whether [PG](Pf ) = P [G](f ): (32) In deep learning, the property expressed in (32) is referred to as covariance to permutations. On the other hand, invariance to permutations means that [PG](Pf ) = [G](f ): (33) Ideally, a graph-based classi er should not be sensitive to \graph-consistent relabeling" of the signal coordinates. The analog of this ideal request in the Euclidean setting is that a classi er of signals de ned on R should not be sensitive to their rigid transformations. In the case of classifying graph signals by rst applying a feature-extracting transformation and then a standard classi er, this ideal request translates to permutation invariance of the initial transformation. However, permutation invariance is a very strong property that often contradicts the necessary permutation covariance. We show here that the scattering transform is permutation covariant and if the scaling function is suciently smooth and J approaches in nity, then it becomes permutation invariant. We rst exemplify the basic notions of covariance and invariance in Section 4.1. Section 4.2 reviews the few existing results on permutation covariance and invariance of graph neural networks and then presents our results for the scattering network. Section 4.3 explains why some permutations are natural general- izations of rigid transformations and then discusses previous broad generalizations of the notion of \graph translation" and their possible covariance and invariance properties. Sections 4.4 and 4.5 prove the main results formulated in Section 4.2. 4.1 Basic examples of graph permutations, covariance and invariance For demonstration, we focus on the graph G depicted in Figure 3a. In this graph, each drawn edge has weight one and thus the double edge between the rst two nodes has the total weight 2. The weight matrix 8 of the graph is 2 3 0 2 1 1 6 7 2 0 1 0 6 7 W = : (34) 4 5 1 1 0 0 1 0 0 0 The signal f = (2; 1; 0; 0) is depicted on the graph with dierent colors corresponding to dierent values. The following permutation is applied to the graph in Figure 3b: 2 3 0 0 1 0 6 7 1 0 0 0 6 7 P = : (35) 4 5 0 0 0 1 0 1 0 0 Figure 3c applies the permutation both to the signal and the graph. (b) (PG;f) (c) (PG;Pf) (a) (G;f) Figure 3: Illustration of permutation for a particular example of a graph and signal discussed in this section. An example of a transformation [G] can be the replacement of the signal values in the two vertices connected by the edge of weight 2. This transformation is independent of the labeling of the graph and is thus permutation covariant. This can be formally veri ed as follows, while using for simplicity the permutation P de ned in (35). For a signal f = (f ; f ; f ; f ) , Pf = (f ; f ; f ; f ) . Furthermore, [G] swaps the rst 1 2 3 4 3 1 4 2 two entries of a signal, while [PG] swaps the second and the fourth entries (the second claim is obvious from Figure 3b). Accordingly, [G](f ) = (f ; f ; f ; f ) and [PG](Pf ) = (f ; f ; f ; f ) . One can readily 2 1 3 4 3 2 4 1 check that indeed P [G](f ) = (f ; f ; f ; f ) = [PG](Pf ). 3 2 4 1 Another example is the transformation [G](f ) = W [G]f , where W [G] W is the weight matrix in (34). This transformation is also independent of the labeling of the graph and thus permutation covariant. This property can also be formally veri ed as follows: [PG](Pf ) = W [PG]Pf = PW [G]P Pf = PW [G]f = P [G](f ) : (36) Similarly, [G](f ) = L[G]f , where L[G] is the graph Laplacian, is permutation covariant. The above three examples of permutation covariant transformations are not permutation invariant. An example of a permutation invariant transformation [G], but not permutation covariant, maps the signal f = (f ; : : : ; f ) to the signal [G]f = (max f ; 0; 0; 0) . Clearly the output [G]f is not aected 1 4 i i=1 by permutation of the input signal and is thus permutation invariant. On the other hand, zeroing out three speci ed signal coordinates, instead of three vertices with unique graph properties (e.g., the vertices connected by at least two edges), violates permutation covariance. The latter example demonstrates in a very simplistic way the value of invariance for classi cation. Indeed, assume that there are two types of signals with low and high values and a classi er tries to distinguish between the two classes according to the rst coordinate of [G](f ) by checking whether it is larger than a certain threshold or not. Then this procedure can distinguish the two types of signals without getting confused with signal relabeling. Permutation covariance does not play any role in this simplistic setting, since the classi er only considers the rst coordinate of [G](f ) and ignores the rest of them. 9 4.2 Permutation covariance and invariance of graph neural networks The recent works of Gilmer et al. [23] and Kondor et al. [24] discuss permutation covariance and invariance for composition schemes on graphs, where message passing is a special case. Composition schemes are covariant to permutations since they do not depend on any labeling of the graph vertices. Moreover, if the aggregation function of the composition scheme is invariant to permutations, so is the whole scheme [24, Proposition 2]. However, aggregation leads to loss of local information, which might weaken the performance of the scheme. Methods based on graph operators, such as the graph adjacency, weight or Laplacian, are not invariant to permutations (see demonstration in Section 4.1). Nevertheless, the scattering transform is approximately permutation invariant when the wavelet scaling function is suciently smooth. Furthermore, when J ap- proaches in nity it becomes invariant to permutations. We rst formulate its permutation covariance and then its approximate permutation invariance. Proposition 4.1. Let G be a simple graph and S[G][P ] be the graph scattering transform with respect to G. For any f 2 C and P 2 S , S[PG][P ]Pf = PS[G][P ]f : (37) J J Theorem 4.2. Let G be a simple graph and S[G][P ] be the graph scattering transform with respect to G. Assume that the Fourier transform of the scaling function of S[G][P ] decays as follows: (!) C =j!j, where C is a constant depending on . For any f 2 C and P 2 S (J +0:5) 1 kS[PG][P ]Pf S[G][P ]fk C 2 N + 2kP Ikkfk : (38) J J In particular, the scattering transform is invariant as J approaches in nity. The result also holds if f 2 C ND is replaced with F 2 C and the Euclidean norm is replaced with the Frobenius norm. 4.3 Generalized graph translations Permutation invariance on graphs is an important notion, which is motivated by concrete applications [25, 26]. It can be seen as an analog of translation invariance in Euclidean domains, which is also essential for applications [13, 12]. A dierent line of research asks for the most natural notion of translation on a graph [3, 27]. We show here that very special permutations of signals on graphs naturally generalize the notion of translation or rigid transformation of a signal in a Euclidean domain. More precisely, there is a planar representation of the graph on which the permutation acts like a rigid transformation. However, in general, there are many permutations that act very dierently than translations or rigid transformations in a Euclidean domain, though, they still preserve the graph topology. Indeed, the underlying geometry of general graphs is richer than that of the Euclidean domain. We later discuss previously suggested generalized notions of \translations" on graphs and the possible covariance and invariance of a modi ed graph scattering transform with respect to these. We rst present two examples of permutations of graphs that can be viewed as Euclidean translations or rigid transformations. We later provide examples of permutations of the same graphs that are dierent than rigid transformations. The rst example, demonstrated in Figure 4a, shows a periodic lattice graph G and signal f with two values denoted by white and blue. Note that the periodic graph can be embedded in a torus, whereas the gure only shows the projection of its 25 vertices into a 5 5 grid in the plane. The edges are not depicted in the gure, but they connect points to their four nearest neighbors on the torus. That is, including \periodic padding" for the 5 5 grid of vertices, each vertex in the plane is connected with its four nearest neighbors. For example vertex 21 is connected with vertices 1, 16, 22, 25. The graph signal obtains a non-zero constant value on the four vertices colored in blue (3, 4, 7 and 8) and is zero on the rest of them. Figure 4b demonstrates an application of a permutation P to both the graph and the signal. At last, Figure 4c depicts the permuted graph and signal of Figure 4b when the indices are rearranged so that the representation of the lattice in the plane is the same as that in Figure 4a (this is necessary as the lattice lives in the torus and may have more than one representation in the plane). The relation between the consistent representations of (G,f ) in Figure 4a and (PG, Pf ) in Figure 4c is obviously a translation. That is, graph and signal permutation in this example corresponds to translation. We remark that the fact 10 rearrange relabel ! ! (b) (PG;Pf) (c) Indices of (PG;Pf) (a) (G;f) rearranged as in (a) Figure 4: Demonstration of graph permutation as Euclidean translation. Figure 4a shows a signal lying on a lattice in the torus embedded onto a 5 5 planar grid. Figure 4b demonstrates a permutation of the graph and signal. Figure 4c shows a planar representation of the permuted graph and signal that is consistent with the one of Figure 4a. The permutation clearly corresponds to translations in a Euclidean space. that Figure 4c coincides with the description of (G,Pf ) is incidental for this particular example and does not occur in the next example. Figure 5 depicts a dierent example where a permutation of a graph signal can be viewed as a variant of a Euclidean rigid transformation. The graph G and the signal f are shown in Figure 5a, where f is supported on the vertices marked in blue (indexed by 1, 2 and 3). Figure 5b demonstrates an application of a permutation P (mapping (1; 2; 3; 4; 5) to (5; 4; 3; 2; 1)) to the graph and signal. Figure 5c shows a dierent representation of (PG;Pf ), which is consistent with the one of (G;f ) presented in Figure 5a. The comparison between Figures 5a and 5c makes it clear that the graph and signal permutation corresponds to a Euclidean rigid transformation in the planar representation of the graph. At last, Figure 5d demonstrates that unlike the example in Figure 4, the rearrangement of (PG;Pf ) is generally dierent than the graph (G;Pf ). Indeed, the subgraph associated with the blue values of the signal is not a triangle and thus the topology is dierent. We remark that many permutations on graphs do not act like translations or rigid transformations. We demonstrate this claim using the graphs of the previous two examples. In Figure 6, we consider the same graph as in Figure 4, but with a dierent permutation. The dierence in permutations can be noticed by comparing the second columns of the grids in Figures 4b and 6b. We note that the rearrangement of the vertices in Figure 6c does not yield an analog of a Euclidean translation. The reason is that the rearranged vertices do not form a grid. To demonstrate this claim, note that in Figure 6a, label 17 is connected to 22, but in Figure 6b, and consequently in the rearranged representation in Figure 6c, they are disconnected. Figure 7 demonstrates a permutation that does not act like a rigid transformation with respect to the graph of Figure 5. Clearly, the rearranged graph and signal in Figure 7c have dierent planar geometry. We remark that while the permutations demonstrated in Figures 6 and 7 do not preserve the planar geometry, they still preserve the topology of the graphs. Indeed, the notion of permutation invariance is richer than invariance to rigid transformations in the Euclidean domain. In the signal processing community, some candidates were proposed for translating signals on graphs. Shuman et al. [3] de ned a \graph translation" (or in retrospect, a graph localization procedure) as follows N1 p p X T f = N (f ) = N u (c)u u f : (39) c c l l l=0 They established useful localization properties of T , which justify a corresponding construction of a win- dowed graph Fourier transform. They also demonstrated the applicability of this tool for the Minnesota road network [3, Figure 7]. We remark that in their de nition u (c) may not be well-de ned. To make it well-de ned one needs to assume xed choices of the phases of u (c), 0 l N 1, and that the algebraic multiplicities of all eigenvalues equal one. 11 rearrange relabel (b) (PG;Pf) (a) (G;f) (c) (PG;Pf) with indices of PG embedded the same way as in (a) (d) (G;Pf) Figure 5: Another example where a graph signal permutation corresponds to Euclidean translation. Figures 5a-5c are created analogously to Figures 4a-4c. Figure 5d shows (G;Pf ), which is dierent from the rearrangement procedure depicted in 5c. relabel rearrange ! ! (a) (G;f) (b) (PG;Pf) (c) Indices of (PG;Pf) rearranged as in (a) Figure 6: A dierent permutation of Figure 4a, which is not similar to rigid motion, but still preserves the graph topology. Note that 6c does not maintain the planar geometry of the graph: for instance, the vertices 19 and 24 are not connected by an edge. Sandryhaila and Moura [27] de ne a \shift" of a graph signal f by T f = Wf , where W is the weight matrix of the graph. This de nition is motivated by the example of a directed cyclic graph, where an application of the weight matrix is equivalent to a shift by one vertex. Note that in this special case, the graph signal permutation (PG;Pf ) advocated in this section also results in a vertex shift. We remark that it is unclear to us why this notion of shift is useful for general graphs. If one needs covariance and approximate invariance of a graph scattering transform to the graph lo- calization procedure de ned in [3], then one may modify the nonlinearity of the scattering transform as 12 rearrange relabel ! ! (a) (G;f) (b) (PG;Pf) (c) Indices of (PG;Pf) rearranged as in (a) Figure 7: A dierent permuation of Figure 5a, which is not similar to rigid motion, but still preserves the graph topology. N1 (f ) = ju fju and rede ne U f = (Q f ) for j > J . Note that l j l j l=0 N1 N1 X X p p (40) (T f ) = N u (c)(u f )u = N u (c)ju fju c l l l l l l l=0 l=0 and N1 N1 N1 X X X T (f ) = u (c) u ju 0fju u = N u (c)ju fju : (41) c l l l l l l l l l=0 l =0 l=0 Therefore, the nonlinearity and the modi ed scattering transform are covariant to the localization operator T . Similarly, by following the proof for Theorem 4.2 one can show that the modi ed scattering transform is approximately invariant to T as long as its energy decay is suciently fast. The scattering transform cannot be adjusted to be covariant and approximately invariant to the \shift" de ned by T f = Wf . The reason is that unlike L, W does not commute in general with the eigenvectors fu g . l=1 4.4 Proof of Proposition 4.1 We need to show that for each path p = (j ; ; j ) 2 P , where j ; ; j > J , 1 m J 1 m S[PG][p]Pf = PS[G][p]f : (42) Note that the Laplacian of PG is L = PLP , which has the same eigenvalues as L and has eigenvectors u ~ = Pu , l = 0; ; N 1. Equation (9) implies that for j > J l l N1 f = Pu u P f (2 ) : (43) PG l l j l l=0 Therefore, for j > J N1 (Pf ) = Pu u P Pf (2 ) PG l l j l l=0 (44) N1 N1 X X j j ^ ^ = Pu u f (2 ) = P u u f (2 ) = P (f ): l l l l G l l j l=0 l=0 Consequently, applying the absolute value pointwise, (Pf ) = P (f ) = P f (45) PG G G j j j Similarly, (Pf ) = P (f ) : (46) PG G J J Application of (45) and (46) results in the identity (Pf ) = P f : (47) PG PG PG PG G G G G j j J j j J 1 m 1 m In view of (13) { (15), (42) is equivalent to (47), and the proof is thus concluded. 13 4.5 Proof of Theorem 4.2 According to (15) and (37), kS[PG][P ]Pf S[G][P ]fk = kPQ U [G][P ]f Q U [G][P ]fk J J J J J J (48) kPQ Q kkU [P ]fk : J J We bound the right-hand-side of (48) by a function that approaches zero as J ! 1. We rst bound kPQ Q k. We apply (15) as well as the following facts: = 0, (0) = 0 and > 0 (since G is 0 1 J J connected) to obtain that for f 2 C N1 N1 X X J J ^ ^ k(PQ Q )fk = P (2 )u u f (2 )u u f l l l l J J l l l=0 l=0 N1 = (2 )u u (Pf f ) l l l=0 N1 = (2 )u u (Pf f ) l l (49) l=1 N1 = (2 )u (Pf f ) l=1 2 2 max (2 ) kP Ik kfk l=1; ;N1 2 2 2 2 2J C 2 kP Ik kfk : Hence J 1 kPQ Q k C 2 kP Ik : (50) J J 1 It remains to bound kU [P ]fk. The application of (17), Proposition 3.3 and (23) results in X X 2 2 2 kU [P ]fk = kfk + kU [p]fk m1 p2 m1 X X 2 2 kfk + 1 kU [p]fk m1 p2 2 2 (51) = kfk + kU [p]fk p2 2 2 kfk + kfk N + 2 = kfk : ND At last, the combination of (48), (50) and (51) implies (38). The generalization to F 2 C is immediate 2 D 2 since F = (f ; ;f ) and kFk = kf k . 1 D F d=1 d 5 Stability to signal and graph manipulations We establish the stability of the graph scattering transform to both signal and graph manipulations. The stability to signal manipulation is an immediate corrolary of the energy preservation established in Theorem 3.1. It states that the graph scattering transform is Lipschitz continuous with respect to the graph signal in the following way. 14 N N Proposition 5.1. For two signals f 2 C and f 2 C , ~ ~ S[P ]f S[P ]f f f : (52) J J ND ND Similarly, for two signals F 2 C and F 2 C , ~ ~ S[P ]F S[P ]F F F : (53) J J F F In order to motivate the stability to graph manipulation, we discuss the problem of community detection [28]. Its setting assumes dierent groups of vertices that communicate more signi cantly with each other than with other groups. The goal is to identify these underlying groups. In some cases, such as for data generated by the stochastic-block model [29], the edge set of the graph is the only information one can work with. For other cases, such as bibliographic datasets [30], in addition to the edge set (the citations), information of features of vertices is provided. For graph convolutional neural networks, if vertex-wise features are not given, it is natural to choose an arti cial feature for each vertex. For instance, Kipf & Welling [11] use f = (1; 1; ; 1) and Bruna & Li [10] use F = I . In this problem, stability to graph manipulations can be formulated as follows when the number of vertices is suciently large: small changes of the edge weights should not aect the community structure. More speci cally, one may consider graph manipulation as modi cation of edge weights and ask for the eect on such manipulation on important features. In the following, we establish such stability to graph manipulations, where the features are expressed by the output of the graph scattering transform. This result is conditioned on suciently fast decay rate of the energy as well as of and (equivalently, their Fourier transforms are suciently smooth). Theorem 5.2. Let G be a simple graph with N vertices and weights fW (n; m)g , and let > 0 denote n;m=1 the smallest gap of eigenvalues of its Laplacian: = min j j : l l 1 2 l 6=l 1 2 ~ ~ Let G be a perturbation of G with weights fW (n; m)g , such that for some 0 < C N=2 n;m=1 W (n; m) W (n; m) C N : (54) Let f 2 C be a xed input signal for which the energy of the scattering transform decays fast in the sense that for some M > 0 and C > 0 X X 2 0 2 kS[p]fk kfk : (55) mM p2 ^ ^ Also, suppose and are both Lipschitz continuous functions with Lipschitz constant C . Then there exists a constant C depending on C , C , C , such that ] 0 1 S[G][P ]f S[G][P ]f p kfk : (56) J J N ND The same result holds if f 2 C is replaced with F 2 C . We remark that the assumption > 0 in the above theorem implies that all eigenvalues have algebraic multiplicity one. In general, it is impossible to extend this theorem to higher multiplicity of eigenvalues. Indeed, assume for example that there are two zero eigenvalues, so the graph is disconnected. Then it is possible to make the graph connected by changing a certain edge weight from zero to an arbitrarily small positive number. Such a small change completely deform the topology of the graph and we thus do not expect a general theorem that includes higher multiplicities. We also remark that (54) only allows a very small change of weights and generally does not allow one to remove or add an edge. The latter more general graph manipulation is natural in some applications. 15 For example, for the CORA dataset [30] of publications and citations, the lack of knowledge of the mutual citation between two speci c publications should not signi cantly aect the detection of communities. We are unaware of previous theoretical results for stability with respect to this more general graph manipulation. In some cases, removing an edge from a graph can completely change the topology, no matter how large the graph is. For example, one can make some graphs disconnected by removing a single edge. Therefore, it is dicult to have a general result that can handle stability to removal or addition of edges. Nevertheless, the following theorem generalizes Theorem 5.2 by restricting the perturbation of the spectral decomposition of the graph Laplacian. Theorem 5.3. Let G and G be two simple graphs with the same set of N vertices. Let f 2 C be a xed input signal for which the energy of the scattering transform decays fast in the sense that for some M > 0 and C > 0 X X 2 2 kS[p]fk kfk : (57) mM p2 ^ ^ Also, suppose and are both Lipschitz continuous functions with Lipschitz constant C and the Laplacian eigenpairs of G and G satisfy C C 2 3 and sin\(u ;u ~ ) ; l = 0; ; N 1 : (58) l l l l N N Then there exists a constant C depending on C , C , C , C and M , for which 0 1 2 3 S[G][P ]f S[G][P ]f p kfk : (59) J J N ND The same result holds if f 2 C is replaced with F 2 C . Condition (58) might be strong for some practical applications, but we are unable to relax it. As mentioned above, we do not expect a general stability result with respect to addition or removal of edges. Nevertheless, for some stochastic models, it is rather unlikely that adding or removing an edge leads to a signi cant change in the graph topology. To demonstrate this claim, we numerically test the stability of the scattering transform in practice for a synthetic setting produced by SBM with arbitrary edge corruption. In this setting, the condition of (58) may not hold. For each N = 5; 10; 20; 50; 100; 200; 400; 600; 800; 1; 000, we randomly sample 20 graphs from an SBM with two classes both containing N vertices. The probability of connecting two vertices within one class is p = maxf1; 6 log N=Ng and the probability of connecting two vertices from dierent classes is q = log N=N . For each model, we randomly choose two vertices of the graph G: if they are connected by an edge, we remove the edge to form G; if they are not connected by an edge, ~ ~ we add an edge to form G. We compute the relative error S[G]f S[G]f =kfk and average it over the 20 random samples. The results of the experiments are shown in Figure 8. We note that the ratio decays fast and the relative error is negligible for suciently large graphs. 5.1 Proof of Proposition 5.1 Similarly to establishing (24), 0 1 X 2 X X 2 2 ~ @ ~ ~ A U [p]f U [p]f = (U [p]f U [p]f ) + (U [p]f U [p]f ) j J m m p2 p2 j>J 0 1 X X 2 2 ~ ~ @ A = U [p]f U [p]f + U [p]f U [p]f j j J J p2 j>J X X X 2 2 ~ ~ U [p + j]f U [p + j]f + S[p]f S[p]f (60) m m p2 j>J p2 X X 2 2 ~ ~ = U [p]f U [p]f + S[p]f S[p]f : m+1 m p2 p2 16 ~ ~ Figure 8: The relative error jjS[G]fS[G]fjj=kfk for a graph G and its perturbation G generated by SBM. The perturbation G is formed by randomly selecting two vertices of G and reversing the connectivity by adding/deleting an edge between them. The average ratio from 20 randomly generated graphs is plotted for each N (number of vertices in each class). We remark that the inequality of (60) follows from the triangle inequality kx yk jkxkkykj. By summing the terms on the left and right hand sides of (60) over m 0, one obtains, similarly to (30), that 2 2 X 2 ~ ~ ~ S[P ]f S[P ]f f f lim U [p]f U [p]f : (61) J J m!1 m+1 p2 Application of (31) to (61) yields (52). The proof of (53) is the same. 5.2 Proof of Theorem 5.3 Let p 2 be an arbitrary path of length m 0. We bound the squared norm of the dierence of wavelet p p ~ ~ coecients of U [p]f with respect to G and G. Recall that = = 0 and u = u ~ = (1= N; ; 1= N ) . 0 0 0 0 17 The required bound for the J -th coecient is as follows ~ ~ Q [G]U [G][p]f Q [G]U [G][p]f J J N1 N1 X X J J ^ ^ ~ ~ = (2 )u u U [G][p]f (2 )u ~ u ~ U [G][p]f l l l l l l l=0 l=0 N1 N1 X X J J J ^ ^ ~ ^ = (2 ) (2 ) u u U [G][p]f + (2 ) (u u u ~ u ~ )U [G][p]f + l l l l l l l l l l=0 l=0 N1 ^ ~ ~ ~ ~ (2 )u u U [G][p]f U [G][p]f l l l=0 N1 N1 (62) X X J J J ^ ^ ~ ^ = (2 ) (2 ) u u U [G][p]f + (2 ) (u u u ~ u ~ )U [G][p]f + l l l l l l l l l l=1 l=1 N1 ^ ~ ~ (2 )u ~ u ~ U [G][p]f U [G][p]f l l l=0 2 2 N1 N1 X X J J J ^ ^ ~ ^ 3 (2 ) (2 ) u u U [G][p]f + (2 ) (u u u ~ u ~ )U [G][p]f + l l l l l l l l l l=1 l=1 N1 ^ ~ ~ ~ ~ (2 )u u U [G][p]f U [G][p]f : l l l=0 Similarly, for j > J , ~ ~ Q [G]U [G][p]f Q [G]U [G][p]f j j 2 2 N1 N1 X X j j j ^ ^ ~ ^ 3 (2 ) (2 ) u u U [G][p]f + (2 ) (u u u ~ u ~ )U [G][p]f + l l l l l l l l l (63) l=1 l=1 N1 ^ ~ ~ ~ ~ (2 )u u U [G][p]f U [G][p]f : l l l=0 Both (62) and (63) bound the energy by the sum of three terms. We next bound each of these terms. In ^ ^ order the bound the rst term, note that since and are both C -Lipschitz J J J ^ ^ ~ ~ (2 ) (2 ) C 2 ; (64) l l 1 l l and j j j ^ ^ ~ ~ (2 ) (2 ) C 2 ; for all j > J : (65) l l 1 l l Therefore, N1 J J ^ ^ ~ (2 ) (2 ) u u U [G][p]f l l l l l=1 N1 2 X J J ^ ^ ~ max (2 ) (2 ) ku u U [G][p]fk l l l l=1; ;N1 (66) l=1 2J C 2 kU [G][p]fk 2J C C 2 1 2 = kU [G][p]fk : 18 Similarly, N1 2j C C 2 1 2 j j ^ ^ ~ (2 ) (2 ) u u U [G][p]f kU [G][p]fk : (67) l l l l=1 The second term for J and j > J is bounded as follows N1 (2 )(u u u ~ u ~ )U [G][p]f l l l l l l=1 N1 2 2 (68) ku u u ~ u ~ k (2 ) kU [G][p]fk l l l l l l=0 N1 X 2 3 J (2 ) kU [G][p]fk l=1 and N1 N1 X X 2 j 3 j ^ ^ (69) (2 )(u u u ~ u ~ )U [G][p]f (2 ) kU [G][p]fk : l l l l l l l=1 l=1 The third term for J and j > J has the form N1 N1 X X 2 2 J J ^ ~ ^ ~ ~ ~ (70) (2 )u ~ u ~ U [G][p]f U [G][p]f = (2 ) u ~ (U [G][p]f U [G][p]f ) l l l l l l=0 l=0 and N1 N1 X X 2 2 j j ^ ~ ~ ^ ~ ~ (71) (2 )u ~ u ~ U [G][p]f U [G][p]f = (2 ) u ~ (U [G][p]f U [G][p]f ) : l l l l l l=0 l=0 Applying (66) { (71), 2 2 ~ ~ ~ ~ Q [G]U [G][p]f Q [G]U [G][p]f + Q [G]U [G][p]f Q [G]U [G][p]f J J j j j>J N1 2j 2 X X 2 X 2 C C 2 C 1 2 2 2 3 J j ^ ^ 3 kU [G][p]fk + (2 ) + (2 ) kU [G][p]fk + l l N N jJ l=1 j>J N1 X 2 X 2 2 J j ^ ^ ~ (2 ) + (2 ) u ~ U [G][p]f U [G][p]f (72) l l l=0 j>J 2J +1 2 C C 2 (N 1)C 1 2 2 2 = 3 kU [G][p]fk + kU [G][p]fk + U [G][p]f U [G][p]f N N 3 kU [G][p]fk + U [G][p]f U [G][p]f ; 2J +1 2 m where C = C C 2 + C . Summing over p 2 yields 1 2 X X 2 2 ~ ~ ~ ~ Q [G]U [G][p]f Q [G]U [G][p]f + Q [G]U [G][p]f Q [G]U [G][p]f J J j j p2 j>J ! (73) X 2 3 kU [G][p]fk + U [G][p]f U [G][p]f : p2 19 That is, X X 2 2 ~ ~ ~ Q [G]U [G][p]f Q [G]U [G][p]f + U [G][p]f U [G][p]f J J m+1 p2 p2 (74) X X 3C kU [G][p]fk + 3 U [G][p]f U [G][p]f : m m p2 p2 To make the following estimation clear, we denote for m 1 ~ ~ a = Q [G]U [G][p]f Q [G]U [G][p]f ; J J m1 p2 b = kU [G][p]fk ; p2 d = U [G][p]f U [G][p]f : p2 Also, we denote b = kfk and d = 0. Note that b b for all m 2 N[f0g. Now (74) can be written as 0 0 m 0 3C a + d b + 3d ; m 1: (75) m m m1 m1 Summing over m = 1; ; M yields M M M M X X X X 3C a = b + 3 d d m m1 m1 m m=1 m=1 m=1 m=1 M1 3CM b + 2 d d (76) 0 m M m=1 M1 3CM b + 2 d : 0 m m=1 Note that d 3CN b + 3d for m 1, and d = 0, and hence m 0 m1 0 1 1 3CM d b : (77) m 0 2 2 3 N Therefore, M M1 X X 3CM 1 1 3CM a b + 2 b m 0 0 N 2 2 3 N m=1 m=1 M1 3CM 3CM (M 1) 1 3CM = b + b b 0 0 0 N N 3 N m=1 (78) 3CM 3CM 1 1 = b b 0 0 M1 N N 2 2 3 3CM 1 1 = M + b M1 N 2 2 3 = b ; 0 1 1 1M M1 m where C = 3CM M 2 + 2 3 . That is, for P = [ , the collection of all paths of length m=0 smaller than M , X 2 S[G][p]f S[G][p]f kfk : (79) p2P 20 On the other hand, summation over paths in P nP results in J M X 2 X 2 4C 2 0 2 ~ ~ S[G][p]f S[G][p]f 2 kS[G][p]fk + S[G][p]f kfk : (80) p2P nP p2P nP J M J M Therefore, C + 4C S[G][P ]f S[G][P ]f kfk : (81) J J 2 D 2 ND The generalization to F 2 C is immediate since F = (f ; ;f ) and kFk = kf k . 1 D d F d=1 5.3 Proof of Theorem 5.2 This theorem is actually a corollary of Theorem 5.3. In order to prove it, we only need to show that (58) holds under the assumptions of Theorem 5.2. We de ne E := LL and conclude from (54) and the de nition of the graph Laplacian that jE (n; m)j C N ; for n 6= m; 1 n; m N and E (n; n) = E (n; m); for 1 n N : (82) m6=n To derive a bound for the perturbation of the eigenvalues, we need to control x Ex kEk = max : x6=0 kxk By applying basic algebraic manipulations as well as the two parts of (82), We obtain that N N X X X X x Ex = jx j E (n; m) + x E (n; m)x n n m n=1 m6=n n=1 m6=n X X = jx x j E (n; m) n m n=1 m6=n 2 2 max jE (n; m)j 2 jx j +jx j n m 1n;mN; 1n;mN; n6=m n6=m C N (N 1)kxk C N kxk : Therefore, kEk C N . By Weyl's inequality, kEk C N ; l = 0; ; N 1 : (83) l l ] Next, we establish bounds for the perturbation of eigenvectors. First note that ~ ~ min min j jj j C N =2 : l l l l l l ] 1 2 1 2 2 2 l 6=l l 6=l 1 2 1 2 Let P and P be the orthogonal projections onto the eigenspaces corresponding to and , respectively. V ~ l l According to the Davis-Kahan Theorem [31, 32], kEk 2C P P : (84) V ~ min 6= l l l l 1 2 1 2 As a result, 1 1 sin\(u ;u ~ ) = kuu u ~u ~ k P P 2C N : (85) l l V ~ ] l V We thus note that (58) is satis ed with C = C and C = 2C = and consequently conclude the proof. 2 ] 3 ] 21 6 Numerical results We demonstrate the eectiveness of the graph wavelet scattering transform by using the MNIST dataset for the problem of image classi cation and the CORA citation network for the problem of community detection. All tests in this section are executed on a PC with Intel i7-6700 CPU, 8GB RAM and GTX1060 6GB GPU. 6.1 Image Classi cation using MNIST The MNIST dataset [22] contains 28 28 gray-scaled images of digits from 0 to 9. There are 60,000 training images and 10,000 testing images in total, where the task is to classify the images according to the digits. In order to test a graph-based method on this dataset, we follow [7] and construct a graph representing the underlying grid of the images. More speci cally, the vertices of this graph correspond to the 28 28 pixels of each image. For vertices v and v we let dist(v ; v ) denote the scaled Euclidean distance between i j i j the centers of the corresponding pixels so that if v and v are nearest pixels then dist(v ; v ) = 1. Edges are i j i j drawn between any vertices v and v satisfying dist(v ; v ) 2. That is, each pixel is connected to the i j i j dist(v ;v ) i j nearest pixels in horizontal, vertical and diagonal directions. The weight e is assigned to any pair of vertices v and v connected by an edge. The rest of the pairs have zero weight. i j Using this graph, we apply our proposed graph scattering transform with J = 3, either two or three layers and the Shannon wavelets. The dimension of the output of the scattering transform is 28 28 (1 + 3 + 9) = 10; 192. It is reduced by PCA to 1,000. Figure 9 illustrates an input image of the digit 7 with the features obtained at the rst few layers of the scattering transform with J = 3 applied to this image. Note that the \frequency" of these features increases with the number of layers. (a) original image (b) 1st layer feature (c) 2nd layer feature (d) 3rd layer feature (e) 4th layer feature Figure 9: Examples of features obtained in each layer of the graph scattering network. The input image f is shown in 9a. The scattering transform is applied with J = 3. The features at the four dierent layers, that is, S[;]f , S[(2)]f , S[(2;2)]f and S[(2;2;2)]f , are demonstrated in Sub gures 9b{9e. The pixels of the output images representing the features are arranged in the same manner as in the input image. We use three dierent classi ers on the features generated via scattering: (1) support vector machine (SVM), (2) softmax layer and (3) fully-connected network (FCN). We use 6,000 images of the training set for validation (simple holdout validation). The accuracies of all three classi ers with and without the scattering transform with 2 and 3 layers are shown in Table 1. Note that the scattering transform is able to generate features that improve the classi cation results for all classi ers. Moreover, a three-layer network performs better than a two-layer network. Three layers almost exhaust the energy of the input signal, so a deeper network is not necessary. Indeed, we did not notice any improvement when using a fourth-layer. SVM Softmax FCN No initial data processing 94.16% 91.78% 98.10% Graph scattering transform with M = 2 layers 95.68% 94.31% 99.02% Graph scattering transform with M = 3 layers 96.59% 94.62% 99.09% Table 1: Classi cation results on MNIST with and without graph scattering. The rst row shows per- centage of correct classi cation by direct application of three common classi ers. The next two lines show classi cation percentages after preprocessing the data by the graph scattering transform with 2 and 3 layers. 22 Our best result does not compete with the state-of-the-art result for the MNIST dataset that obtains 99.75% accuracy rate [33]. While the network structure of the method in [33] is very carefully designed and architected, the graph model used here is only able to encode the information for neighboring pixels. On the other hand, for a convolutional neural network, the convolution at the lowest level collects local information that is not restricted to direct neighbors and is thus able to learn more meaningful local relations. Although the grid graph is not the best way to fully represent the image information, it is still a common benchmark for sanity check of a graph neural network. Table 2 lists classi cation results of other graph- based methods. Results of the rst three methods from [6, 8, 7] are indicated in parenthesis since they are copied from their original works (codes were not available for the methods of [6] and [8] and the code for Spline lters [7] did not converge on our computer). We remark that codes for the methods of [7] in these experiments and the ones below were obtained from https://github.com/mdeff/cnn_graph. It is evident that in terms of accuracy the scattering transform is comparable with the best graph-based performer. Method Accuracy Laplacian eigenvalues [6] (94.96%) Intuitive convolution [8] (98.55%) Spline lters [7] (97.15%) Chebyshev lters [7] 99.12% Scattering transform 99.09% Table 2: Percentages of correct classi cation of dierent graph-based methods on the MNIST database. Results in parenthesis are copied from their original publications as explained in the main text. To further compare the methods in [7] and our method, we list the running time for each method on our machine. We note that although the method that uses Chebyshev lters is accurate, it is not computationally ecient. Furthermore, the method that uses spline lters did not converge (DNC) on our computer. On the other hand, the scattering transform achieves a competitive accuracy with high eciency. Accuracy Time for scattering Time for training Spline lters DNC not needed 11 s/epoch Chebyshev lters 99.12% not needed 56 s/epoch Scattering M = 2 99.02% 17 s 2 s/epoch Scattering M = 3 99.09% 36 s 2 s/epoch Table 3: Accuracy and time needed for training MNIST on our machine. 6.2 Community detection using CORA The CORA dataset [30] contains 2,708 research papers with 1,433 features describing each paper. There are also 5,429 citation links of the dierent papers. This dataset gives rise to a graph whose vertices correspond to the research papers and edges correspond to citations. We assume an undirected graph, where the weight between two papers is one if at least one of them cite the other, and zero otherwise. There are 7 communities of papers and the problem is to detect them. The dataset in [30] provides labels (in f1; 2; ; 7g) of 140 vertices for training, 500 vertices for validation, and 1,000 vertices for testing. Due to the small fraction of training samples, the community detection problem in this setting can be considered as semi-supervised learning. The graph scattering transform is applied to the 2; 708 1; 433 feature matrix with J = 3, three layers and the Shannon wavelets. The dimension of the output of the scattering transform is 1; 433 (1 + 3 + 9) = 18; 629. The communities are detected by applying FCN to the features obtained by the scattering transform. We remark that since training with only 140 samples is fast, there is no need to reduce dimension. Table 4 lists the accuracy of the graph scattering transform compared with the state-of-the-art graph- based neural network methods. Note that they are comparable, where the scattering transform demonstrates a slight improvement. All of them outperform the traditional methods listed in [11, Table 2] (the accuracies of those methods are in the range 59% { 75%). 23 Method Accuracy Chebyshev lters [7, 11] 79.5% Renormalization [11] 81.5% Graph scattering + FCN 81.9% Table 4: Percentages of correct labels on CORA for graph scattering and two state-of-the-art methods. 7 Conclusion We constructed a graph convolutional neural network by adapting the scattering transform to graphs. We showed that, with the proper choice of graph wavelets, the graph scattering transform is invariant to per- mutations and stable to signal and graph manipulations. These invariance and stability properties make the graph scattering transform eective for classi cation and community detection tasks. Although we exempli ed the performance of the graph scattering transform in only two particular in- stances, where one is a bit arti cial, it is a generic tool for feature extraction on graphs. Its potential use is thus not limited to the discriminative tasks illustrated in these two examples. Furthermore, the graph scattering transform does not require training. However, it can be adapted to dierent datasets and choosing dierent kinds of wavelets. In the numerical experiments of this paper we only used the simple Shannon wavelets. In addition to our work, there are other models that try to use the idea of the scattering transform for graphs. For example, the deep Haar scattering [34]. We believe that our proposed graph scattering network has a more exible design. Its established permutation invariance and stability to signal and graph manipulations makes it a robust feature extractor that is natural for graph representation. The convolutions with the wavelets of [16] used in our graph transform are somewhat similar to the ones used in trained graph convolutional neural networks such as [7] and [11]. Our work thus suggest some conceptual understanding of invariance and stability properties for other graph convolutional networks. Despite the advocated properties of the graph scattering transform, it has some limitations. First of all, it is based on the full spectral decomposition of the graph Laplacian and for very large graphs, its computation is demanding. In order to improve eciency for the training component, dimension reduction techniques can be used after computing the graph scattering transform. Second of all, the \high frequency" information for the graph Laplacian is not as clear as the high-frequency information in the Euclidean case. Therefore, we do not suciently understand the kind of information being processed at deeper layers of the graph scattering transform. At last, the graph scattering transform is a basic generic tool and it may take some time for practitioners to evaluate its potential use. The examples demonstrated here are very simple and the stylized application of classi cation of images via graph neural networks cannot result in suciently competitive results. Acknowledgement This research was partially supported by NSF awards DMS-14-18386, DMS-18-21266 and DMS-18-30418. We thank Radu Balan, Addison Bohannon and Maneesh Singh for helpful references and Loren Anderson, Vahan Huroyan and Tyler Maunu for commenting on an earlier version of this manuscript. References [1] D. Lazer, A. S. Pentland, L. Adamic, S. Aral, A. L. Barabasi, D. Brewer, N. Christakis, N. Contractor, J. Fowler, M. Gutmann, et al., \Life in the network: the coming age of computational social science," Science (New York, NY), vol. 323, no. 5915, p. 721, 2009. [2] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst, \Geodesic convolutional neural networks on Riemannian manifolds," in Proceedings of the IEEE international conference on computer vision workshops, pp. 37{45, 2015. 24 [3] D. I. Shuman, B. Ricaud, and P. Vandergheynst, \Vertex-frequency analysis on graphs," Applied and Computational Harmonic Analysis, vol. 40, no. 2, pp. 260{291, 2016. [4] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, \Geometric deep learning: going beyond Euclidean data," IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18{42, 2017. [5] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, \Spectral networks and locally connected networks on graphs," arXiv preprint arXiv:1312.6203, 2013. [6] M. Edwards and X. Xie, \Graph based convolutional neural network," arXiv preprint arXiv:1609.08965, [7] M. Deerrard, X. Bresson, and P. Vandergheynst, \Convolutional neural networks on graphs with fast localized spectral ltering," in Advances in Neural Information Processing Systems, pp. 3844{3852, [8] Y. Hechtlinger, P. Chakravarti, and J. Qin, \A generalization of convolutional neural networks to graph- structured data," arXiv preprint arXiv:1704.08165, 2017. [9] A. Nowak, S. Villar, A. S. Bandeira, and J. Bruna, \A note on learning algorithms for quadratic assignment with graph neural networks," arXiv preprint arXiv:1706.07450, 2017. [10] Z. Chen, X. Li, and J. Bruna, \Supervised community detection with hierarchical graph neural net- works," arXiv preprint arXiv:1705.08415, 2017. [11] T. N. Kipf and M. Welling, \Semi-supervised classi cation with graph convolutional networks," arXiv preprint arXiv:1609.02907, 2016. [12] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. [13] S. Mallat, \Group invariant scattering," Communications on Pure and Applied Mathematics, vol. 65, no. 10, pp. 1331{1398, 2012. [14] R. R. Coifman and M. Maggioni, \Diusion wavelets," Applied and Computational Harmonic Analysis, vol. 21, no. 1, pp. 53{94, 2006. [15] S. Mahadevan and M. Maggioni, \Value function approximation with diusion wavelets and Laplacian eigenfunctions," in Advances in Neural Information Processing Systems, pp. 843{850, 2006. [16] D. K. Hammond, P. Vandergheynst, and R. Gribonval, \Wavelets on graphs via spectral graph theory," Applied and Computational Harmonic Analysis, vol. 30, no. 2, pp. 129 { 150, 2011. [17] I. Daubechies, Ten lectures on wavelets. SIAM, 1992. [18] S. Mallat, A wavelet tour of signal processing. Academic press, 1999. [19] J. Bruna and S. Mallat, \Invariant scattering convolution networks," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1872{1886, 2013. [20] T. Wiatowski, P. Grohs, and H. B olcskei, \Energy propagation in deep convolutional neural networks," arXiv preprint arXiv:1704.03636, 2017. [21] W. Czaja and W. Li, \Analysis of time-frequency scattering transforms," Applied and Computational Harmonic Analysis, 2017. [22] Y. LeCun, L. Bottou, Y. Bengio, and P. Haner, \Gradient-based learning applied to document recog- nition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278{2324, 1998. [23] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, \Neural message passing for quantum chemistry," arXiv preprint arXiv:1704.01212, 2017. 25 [24] R. Kondor, H. T. Son, H. Pan, B. Anderson, and S. Trivedi, \Covariant compositional networks for learning graphs," arXiv preprint arXiv:1801.02144, 2018. [25] N. Guttenberg, N. Virgo, O. Witkowski, H. Aoki, and R. Kanai, \Permutation-equivariant neural networks applied to dynamics prediction," arXiv preprint arXiv:1612.04530, 2016. [26] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, \Convolutional networks on graphs for learning molecular ngerprints," in Advances in neural information processing systems, pp. 2224{2232, 2015. [27] A. Sandryhaila and J. M. F. Moura, \Discrete signal processing on graphs: Frequency analysis," IEEE Transactions on Signal Processing, vol. 62, pp. 3042{3054, June 2014. [28] S. Fortunato, \Community detection in graphs," Physics reports, vol. 486, no. 3-5, pp. 75{174, 2010. [29] E. Abbe, \Community detection and stochastic block models: Recent developments," Journal of Ma- chine Learning Research, vol. 18, no. 177, pp. 1{86, 2018. [30] P. Sen, G. M. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad, \Collective classi cation in network data," AI Magazine, vol. 29, no. 3, pp. 93{106, 2008. [31] C. Davis and W. M. Kahan, \The rotation of eigenvectors by a perturbation. III," SIAM Journal on Numerical Analysis, vol. 7, no. 1, pp. 1{46, 1970. [32] Y. Yu, T. Wang, and R. J. Samworth, \A useful variant of the Davis{Kahan theorem for statisticians," Biometrika, vol. 102, no. 2, pp. 315{323, 2014. [33] S. Sabour, N. Frosst, and G. E. Hinton, \Dynamic routing between capsules," in Advances in Neural Information Processing Systems, pp. 3856{3866, 2017. [34] X. Chen, X. Cheng, and S. Mallat, \Unsupervised deep Haar scattering on graphs," in Advances in Neural Information Processing Systems, pp. 1709{1717, 2014.
Electrical Engineering and Systems Science – arXiv (Cornell University)
Published: Mar 31, 2018
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.