{"title": "Structure-Aware Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 11, "page_last": 20, "abstract": "Convolutional neural networks (CNNs) are inherently subject to invariable filters that can only aggregate local inputs with the same topological structures. It causes that CNNs are allowed to manage data with Euclidean or grid-like structures (e.g., images), not ones with non-Euclidean or graph structures (e.g., traffic networks). To broaden the reach of CNNs, we develop structure-aware convolution to eliminate the invariance, yielding a unified mechanism of dealing with both Euclidean and non-Euclidean structured data. Technically, filters in the structure-aware convolution are generalized to univariate functions, which are capable of aggregating local inputs with diverse topological structures. Since infinite parameters are required to determine a univariate function, we parameterize these filters with numbered learnable parameters in the context of the function approximation theory. By replacing the classical convolution in CNNs with the structure-aware convolution, Structure-Aware Convolutional Neural Networks (SACNNs) are readily established. Extensive experiments on eleven datasets strongly evidence that SACNNs outperform current models on various machine learning tasks, including image classification and clustering, text categorization, skeleton-based action recognition, molecular activity detection, and taxi flow prediction.", "full_text": "Structure-Aware Convolutional Neural Networks\n\nJianlong Chang1,2\n\nJie Gu1,2\n\nShiming Xiang1,2\n\n1NLPR, Institute of Automation, Chinese Academy of Sciences\n\n2School of Arti\ufb01cial Intelligence, University of Chinese Academy of Sciences\n\n{jianlong.chang, jie.gu, lfwang, gfmeng, smxiang, chpan}@nlpr.ia.ac.cn\n\nLingfeng Wang1\n\nChunhong Pan1\n\nGaofeng Meng1\n\nAbstract\n\nConvolutional neural networks (CNNs) are inherently subject to invariable \ufb01lters\nthat can only aggregate local inputs with the same topological structures. It causes\nthat CNNs are allowed to manage data with Euclidean or grid-like structures (e.g.,\nimages), not ones with non-Euclidean or graph structures (e.g., traf\ufb01c networks). To\nbroaden the reach of CNNs, we develop structure-aware convolution to eliminate\nthe invariance, yielding a uni\ufb01ed mechanism of dealing with both Euclidean and\nnon-Euclidean structured data. Technically, \ufb01lters in the structure-aware convolu-\ntion are generalized to univariate functions, which are capable of aggregating local\ninputs with diverse topological structures. Since in\ufb01nite parameters are required\nto determine a univariate function, we parameterize these \ufb01lters with numbered\nlearnable parameters in the context of the function approximation theory. By re-\nplacing the classical convolution in CNNs with the structure-aware convolution,\nStructure-Aware Convolutional Neural Networks (SACNNs) are readily estab-\nlished. Extensive experiments on eleven datasets strongly evidence that SACNNs\noutperform current models on various machine learning tasks, including image\nclassi\ufb01cation and clustering, text categorization, skeleton-based action recognition,\nmolecular activity detection, and taxi \ufb02ow prediction.\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNNs) provide an effective and ef\ufb01cient framework to deal with\nEuclidean structured data, including speeches and images. As a core module in CNNs, the convolution\nunit explicitly allows to share parameters among the whole spatial domains to extremely reduce the\nnumber of parameters, without sacri\ufb01cing the expressive capability of networks [3]. Bene\ufb01ting from\nsuch artful modeling, signi\ufb01cant successes have been achieved in a multitude of \ufb01elds, including the\nimage classi\ufb01cation [15, 24] and clustering [5, 6], the object detection [9, 32], and amongst others.\nAlthough the achievements in the literature are brilliant, CNNs are still incompetent to handle non-\nEuclidean structured data, such as the traf\ufb01c \ufb02ow data on traf\ufb01c networks, the relational data on\nsocial networks, and the active data on molecule structure networks. The major limitation originates\nfrom that the classical \ufb01lters are invariant at each location. As a result, the \ufb01lters can only be applied\nto aggregate local inputs with the same topological structures, not with diverse topological structures.\nIn order to eliminate the limitation, we develop structure-aware convolution in which a single share-\nable \ufb01lter suf\ufb01ces to aggregate local inputs with diverse topological structures. For this purpose, we\ngeneralize the classical \ufb01lters to univariate functions that can be effectively and ef\ufb01ciently parameter-\nized under the guidance of the function approximation theory. Then, we introduce local structure\nrepresentations to quanti\ufb01cationally encode topological structures. By modeling these representations\ninto the generalized \ufb01lters, the corresponding local inputs can be aggregated based on the generalized\n\ufb01lters consequently. In practice, Structure-Aware Convolutional Neural Networks (SACNNs) can\nbe readily established by replacing the classical convolution in CNNs with our structure-aware\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fconvolution. Since all the operations in our structure-aware convolution are differentiable, SACNNs\ncan be trained end-to-end by the standard back-propagation.\nTo sum up, the key contributions of this paper are:\n\ncapability of capturing the latent structures of data in a purely data-driven way.\n\n\u2022 The structure-aware convolution is developed to establish SACNNs to uniformly deal with\nboth Euclidean and non-Euclidean structured data, which broadens the reach of convolution.\n\u2022 We introduce the learnable local structure representations, which endow SACNNs with the\n\u2022 By taking advantage of the function approximation theory, SACNNs can be effectively and\n\u2022 Extensive experiments demonstrate that SACNNs are superior to current models in various\n\nef\ufb01ciently trained with the standard back-propagation to guarantee the practicability.\n\nmachine learning tasks, including classi\ufb01cation, clustering, and regression.\n\n2 Related work\n\n2.1 Convolutional neural networks (CNNs)\n\nTo elevate the performance of CNNs, much research has been devoted to designing the convolution\nunits, which can be roughly divided into two classes, i.e., handcrafted and learnable ones.\nHandcrafted convolution units generally derive from the professional knowledge. Primary convolution\nunits [24, 26] present large sizes, e.g., 7 \u00d7 7 pixels in images. To increase the nonlinearity, stacking\nmultiple small \ufb01lters (e.g., 3 \u00d7 3 pixels) instead of using a single large \ufb01lter has become a common\ndesign in CNNs [38]. To obtain larger receptive \ufb01elds, the dilated convolution [41], whose receptive\n\ufb01eld size grows exponentially while the number of parameters grows linearly, is proposed. In addition,\nthe separable convolution [7] promotes performance by integrating various \ufb01lters with diverse sizes.\nAmong the latter, lots of efforts have been widely made to learn convolution units. By introducing\nadditional parameters named offsets, the active convolution [19] is explored to learn the shape of\nconvolution. To achieve dynamic offsets that vary with inputs, the deformable convolution [9] is\nproposed. Contrary to such modi\ufb01cations, some approaches have been devoted to directly capturing\nstructures of data to improve the performance of CNNs, such as the spatial transform networks [18].\nWhile these models have been successful on Euclidean domains, they can hardly be applied to\nnon-Euclidean domains. In contrast, our SACNNs can be utilized on these two domains uniformly.\n\n2.2 Graph convolutional neural networks (GCNNs)\n\nRecently, there has been a growing interest in applying CNNs to non-Euclidean domains [3, 29, 31,\n35]. Generally, existing methods can be summarized into two types, i.e., spectral and spatial methods.\nSpectral methods explore an analogical convolution operator over non-Euclidean domains on the basis\nof the spectral graph theory [4, 16, 27]. Relying on the eigenvectors of graph Laplacian, data with non-\nEuclidean structures can be \ufb01ltered on the corresponding spectral domain. To enhance the ef\ufb01ciency\nand acquire spectrum-free methods without performing eigen-decomposition, polynomial-based\nnetworks are developed to execute convolution on non-Euclidean domains ef\ufb01ciently [10, 22].\nContrary to the spectral methods, spatial methods always analogize the convolutional strategy based\non the local spatial \ufb01ltering [1, 2, 30, 31, 37, 40]. The major difference between these methods lies in\nthe intrinsic coordinate systems used for encoding local patches. Typically, the diffusion CNNs [1]\nencode local patches based on the random walk process on graphs, the anisotropic CNNs [2] employ\nan anisotropic patch-extraction method, and the geodesic CNNs [30] represent local patches in polar\ncoordinates. In the mixture-model CNNs [31], synthetically, learnable local pseudo-coordinates are\ndeveloped to parameterize local patches in a general way. Additionally, a series of spatial methods\nwithout the classical convolutional strategy have also been explored, including the message passing\nneural networks [12, 28, 34], and the graph attention networks [39].\nIn spite of considerable achievements, both spectral and spatial methods partially rely on \ufb01xed\nstructures (i.e., \ufb01xed relationship matrix) in graphs. Bene\ufb01ting from the proposed structure-aware\nconvolution, by comparison, the structures can be learned from data automatically in our SACNNs.\n\n2\n\n\f3 Structure-aware convolution\n\nConvolution, intrinsically, is an aggregation operation between local inputs and \ufb01lters. In practice,\nlocal inputs involve not only their input values but also topological structures. Accordingly, \ufb01lters\nshould be in a position to aggregate local inputs with diverse topological structures. To this end, we\ndevelop the structure-aware convolution by generalizing the \ufb01lters in the classical convolution and\nmodeling the local structure information into the generalized \ufb01lters.\nThe \ufb01lters in the classical convolution can be smoothly generalized to univariate functions. Without\nloss of generality and for simplicity, we elaborate such generalization with 1-Dimensional data. Given\nan input x \u2208 Rn and a \ufb01lter w \u2208 R2m\u22121, the output at the i-th vertex (location) is\ni \u2208 {1, 2,\u00b7\u00b7\u00b7 , n},\n\n(1)\nwhere xi = [xi\u2212m+1,\u00b7\u00b7\u00b7 , xi+m\u22121]T is the local input at the i-th vertex, i\u2212m < j < i+m indicates\nthat the j-th vertex is a neighbor of the i-th vertex, wj\u2212i+m and xj signify the (j \u2212 i + m)-th and\nj-th elements in w and x, respectively. For any univariate function f (\u00b7), Eq. (1) can be equivalently\nrewritten as follows when f (j \u2212 i + m) = wj\u2212i+m is always satis\ufb01ed, i.e.,\n\nwj\u2212i+m \u00b7 xj,\n\n\u00afyi = wTxi =\n\ni\u2212m<j<i+m\n\n(cid:88)\n\n(cid:88)\n\n\u00afyi = f TRxi =\n\nf (j \u2212 i + m) \u00b7 xj,\n\ni \u2208 {1, 2,\u00b7\u00b7\u00b7 , n},\n\ni\u2212m<j<i+m\n\n(2)\nwhere f (\u00b7) is called a functional \ufb01lter, R = {j \u2212 i + m | i\u2212 m < j < i + m} = {1, 2,\u00b7\u00b7\u00b7 , 2m\u2212 1},\nand fR = {f (r)|r \u2208 R}. Actually, R encodes relationships between a vertex and its neighbors. For\nexample, r \u2208 R means that the (i\u2212 m + r)-th vertex is the r-th neighbor of the i-th vertex. Since the\nrelationships in R can re\ufb02ect the structure information around a vertex, we call R a local structure\nrepresentation. Generally, the local structure representation R is constant in the classical convolution,\nwhich causes that the same fR is shared at each vertex. As a result, the classical convolution solely\npertains to manage data with the same local topological structures, not with diverse ones.\nTo handle this limitation, we introduce general local structure representations to quanti\ufb01cationally\nencode any local topological structure, and then develop structure-aware convolution by replacing the\nconstant R in classical convolution with the introduced general ones. Technically, both Euclidean\nand non-Euclidean structured data can be represented by a graph G = (V,E, R), where the vertices\nin V store the values of data, the edges in E indicate whether two vertices are connected, and the\nrelationship matrix R signi\ufb01es the structure information in the graph G. For a vertex i \u2208 V, the local\nstructure representation at i is encoded via the relationships with its neighbors, i.e.,\n(3)\nwhere eji \u2208 E means that the j-th vertex is a neighbor of the i-th vertex, rji is the element of R at (j, i)\nand indicates the relationship from the j-th vertex to the i-th vertex. Note that S = {Ri|i \u2208 V} can\ninclude the whole structure information in the graph G by integrating the local structure representations\ntogether. This implies that Eq. (3) is a reasonable formulation for local topological structures. Based\non the introduced local structure representations, the structure-aware convolution is developed by\nmodeling these representations into the generalized functional \ufb01lters. Formally, given an input x\nembedded on the graph G and a functional \ufb01lter f (\u00b7), we de\ufb01ne the structure-aware convolution as\n(4)\nwhere fRi = {f (rji)|eji \u2208 E} varies with Ri. Bene\ufb01ting from this modi\ufb01cation, the structure-aware\nconvolution is capable of aggregating local inputs with diverse topological structures.\n\nRi = {rji|eji \u2208 E},\n\ni \u2208 {1, 2,\u00b7\u00b7\u00b7 , n},\n\ni \u2208 {1, 2,\u00b7\u00b7\u00b7 , n},\n\nf (rji) \u00b7 xj,\n\n(cid:88)\n\n\u00afyi = f TRi\n\nxi =\n\neji\u2208E\n\n4 Structure-aware convolutional neural networks\n\nReplacing the classical convolution in CNNs with the structure-aware convolution, SACNNs are\nestablished. Intuitively, a structure-aware convolutional layer is illustrated in Figure 1. However,\ntwo essential problems need to be tackled before training SACNNs. First, functional \ufb01lters in the\nstructure-aware convolution are univariate functions, which need in\ufb01nite parameters to be determined.\nThis implies that SACNNs can not be learned in a common way, and an effective and ef\ufb01cient strategy\nis required to learn these \ufb01lters with numbered parameters. Second, local structure representations\n(or the relationship matrix R) may be hardly de\ufb01ned in advance and thus a learning mechanism is\nneeded. In the following, Section 4.1 and 4.2 focus on tackling these two problems, respectively.\n\n3\n\n\fFigure 1: A structure-aware convolutional layer. For clarity of exposition, the input x has c = 2\nchannels with n = 6 vertices, the output y has a single channel, and \u00afxj, \u00afxi \u2208 Rc indicate the j-th\nand i-th rows of the input x, respectively. For each vertex i, its local structure representation is \ufb01rst\ncaptured from the input and represented as Ri, which is identically shared for each channel of the\ninput x. Afterwards, the local inputs in the \ufb01rst and second channels are aggregated via the \ufb01rst \ufb01lter\nf1(\u00b7) and the second \ufb01lter f2(\u00b7) respectively, with the same Ri. Note that f1(\u00b7) and f2(\u00b7) are shared\nfor every location in the \ufb01rst and second channels, respectively.\n4.1 Polynomial parametrization for functional \ufb01lters\n\nWe parameterize the developed functional \ufb01lters with numbered learnable parameters under the\n(cid:80)t\nguidance of the function approximation theory. In mathematics, for an arbitrary univariate function\nh(x), it can be composed of a group basis functions {h1(x), h2(x),\u00b7\u00b7\u00b7} with a set of coef\ufb01cients\nk=1 vk \u00b7 hk(x), where hk(x) and vk are the k-th basis function\n{v1, v2,\u00b7\u00b7\u00b7}, denoted by h(x) (cid:39)\nand the corresponding coef\ufb01cient, respectively. The equation is satis\ufb01ed when t tends to in\ufb01nity.\nBecause of the high ef\ufb01ciency [14], our functional \ufb01lters are parameterized based on the Chebyshev\npolynomials that form an orthogonal basis for L2([\u22121, 1], dy/\n1 \u2212 y2), the Hilbert space of square\nintegrable functions with respect to the measure dy/\n1 \u2212 y2. Formally, the Chebyshev polynomial\nhk(x) of order k\u22121 (k \u2265 3) can be generated by the stable recurrence relation hk(x) = 2xhk\u22121(x)\u2212\n(cid:33)\nhk\u22122(x), with h1(x) = 1 and h2(x) = x. In practice, the truncated expansion of Chebyshev\npolynomials is employed to approximate the functional \ufb01lter f (\u00b7) in Eq. (4), i.e.,\n\n(cid:112)\n\n(cid:112)\n\nyi =\n\nf (rji) \u00b7 xj =\n\nvk \u00b7 hk(rji)\n\n\u00b7 xj,\n\ni \u2208 {1, 2,\u00b7\u00b7\u00b7 , n},\n\n(5)\n\n(cid:88)\n\neji\u2208E\n\n(cid:32) t(cid:88)\n\n(cid:88)\n\neji\u2208E\n\nk=1\n\nwhere t is the number of the truncated polynomials, and {v1,\u00b7\u00b7\u00b7 , vt} are t learnable coef\ufb01cients cor-\nresponding to the polynomials {h1(x),\u00b7\u00b7\u00b7 , ht(x)}. Note that f (rji) can be cumulatively computed\nbased on the recurrence relation, leading to an ef\ufb01cient computing strategy.\n\n4.2 Local structure representations learning\n\nRi = {rji = T (\u00afxT\n\nTo eliminate the feature engineering, we consider to learn local structure representations from data\nrather than using prede\ufb01ned ones. To preserve the structure consistency between channels, for every\nstructure-aware convolutional layer, only a single local structure representation set S = {Ri|i \u2208 V}\nis identically learned for each channel of the input. Formally, given a multi-channel input feature\nmap x \u2208 Rn\u00d7c, where n and c denote the numbers of vertices and channels respectively, the local\nstructure representation at each vertex is learned as\n(6)\nwhere \u00afxj, \u00afxi \u2208 Rc indicate the j-th and i-th rows of the input x respectively, M \u2208 Rc\u00d7c is a matrix\nwith c \u00d7 c learnable parameters to measure relationships between local vertices, and T (\u00b7) is the Tanh\nfunction to normalize elements in local structure representations into [\u22121, 1] strictly.\nThis local structure learning formulation has two good properties. First, M is identically shared for\neach channel of the input, so every channel possesses the same structure in each structure-aware\nconvolutional layer and the size of M only depends on the number of channels. As a result, only a\nfew additional parameters are required to be learned, which can alleviate the over\ufb01tting when training\ndata is limited. Second, M is not constrained as a symmetric matrix, namely rji may not be equal to\nrij. This implies that our approach is capable of modeling not only undirected structures, but also\ndirect structures, such as the traf\ufb01c networks and the social networks.\n\nj M\u00afxi) | eji \u2208 E},\n\ni \u2208 {1, 2,\u00b7\u00b7\u00b7 , n},\n\n4\n\nx22(cid:21)(cid:24)(cid:23)(cid:25)(cid:20)(cid:22)x12x32x42x52x62(cid:21)(cid:24)(cid:23)(cid:25)(cid:20)(cid:22)y2y1y3y4y5y6x22(cid:21)(cid:23)(cid:20)(cid:22)x12(cid:24)(cid:23)(cid:25)(cid:22)x32x4252x62(cid:24)x52(cid:22)(cid:21)(cid:20)r11r31r21local structuresfunctional filtersr11r21r31f1r11r21r31f2(cid:258)(cid:258)(cid:3)(cid:258)(cid:258)(cid:3)c(cid:23)(cid:25)r66r46r46r66f1r46r66f2(cid:21)(cid:24)(cid:23)(cid:25)(cid:20)(cid:22)x21x11x31x41x51x61outputinput\f4.3 Understanding the structure-aware convolution\n\nIn this subsection, we give the following theorem to reveal the essence of our structure-aware\nconvolution (the proof is reported in the supplementary material).\nTheorem 1. Under the Chebyshev polynomial basis, the structure-aware convolution is equivalent to\n\nyi = vTPixi,\n\ni \u2208 {1, 2,\u00b7\u00b7\u00b7 , n},\n\nwhere v \u2208 Rt is the coef\ufb01cients of the polynomials, Pi \u2208 Rt\u00d7m is a matrix determined by the local\nstructure representation Ri and the polynomials, and xi \u2208 Rm is the local input at the i-th vertex.\nTheorem 1 indicates that the structure-aware convolution can be split into two independent units, i.e.,\na transformation Pi \u2208 Rt\u00d7m and a vector v \u2208 Rt. In the \ufb01rst unit, the transformation Pi devotes\nto encoding the m-Dimensional local inputs as t-Dimensional vectors. Since the basis functions\nare \ufb01xed in the Chebyshev polynomial basis, Pi is purely depended on the corresponding local\nstructure representation Ri that is varied with the vertex i and can be learned according to Eq. (6).\nIt is worth noting that this transformation Pi is similar to a speci\ufb01c local spatial transformer in the\nspatial transform networks [18]. In the second unit, the learnable vector v is shared by every vertex\nto aggregate these encoded local inputs, which is akin to the classical convolution. By integrating\nthese two learnable units together, the structure-aware convolution can simultaneously focus on local\ninput values and local topological structures to capture high-level representations.\n\n5 Experiments\n\nIn this section, we systematically carry out extensive experiments to verify the capability of SACNNs.\nDue to the space restriction we report some experimental details in the supplement, such as the\ngradients during training, descriptions of datasets, descriptions of datasets, and network architectures.\nSpeci\ufb01cally, Our core code will be released at https://github.com/vector-1127/SACNNs.\n\n5.1 Experimental settings\n\nWe perform experiments on six Euclidean and \ufb01ve non-Euclidean structured datasets to verify the\ncapability of SACNNs. Six Euclidean structured datasets include the Mnist [26], Cifar-10 [23],\nCifar-100 [23], STL-10 [8], Image10 [6], and ImageDog [6] image datasets. Five non-Euclidean\nstructured datasets contain the text categorization datasets 20NEWS and Reuters [25], the action\nrecognition dataset NTU [36], the molecular activity dataset DPP4 [20], and the taxi \ufb02ow dataset\nTF-198 [42] that consists of the taxis \ufb02ow data at 198 traf\ufb01c intersections in a city.\nWith respect to the scope of applications, two types of methods are compared, i.e., CNNs and GCNNs.\nOn Euclidean domains, popular CNN models, including the classical convolution (ClaCNNs) [26],\nthe separable convolution (SepCNNs) [7], the active convolution (ActCNNs) [19], and the deformable\nconvolution (DefCNNs) [9] are utilized for comparisons. On non-Euclidean domains, both spatial\nand spectral GCNNs are taken as competitors to SACNNs, including the local connected networks\n(LCNs) [4], the dynamic \ufb01lters based networks (DFNs) [40], the edge-conditioned convolution\n(ECC) [37], the mixture-model networks (MoNets) [31] (which is a generalization of the diffusion\nCNNs [1], the anisotropic CNNs [2], and the geodesic CNNs [30]), the spectral networks (SCNs) [16],\nthe Chebyshev based SCNs (ChebNets) [10], and the graph convolution networks (GCNs) [22].\nFurthermore, SACNNs\u2020 that omit the structure learning in SACNNs are used as a baseline of our\nmethod and to show the effectiveness of structure learning. In SACNNs\u2020, Ri is assigned by uniformly\nsampling on [\u22121, 1], e.g., Ri = {\u2212 1\n2} is prede\ufb01ned when a 3-Dimensional \ufb01lter is required.\nThe hyper-parameters in SACNNs are set as follows. In our experiments, the max pooling and the\nGraclus method [11] are employed as the pooling operations to coarsen the feature maps in SACNNs\nwhen managing Euclidean and non-Euclidean structured data respectively, the ReLU function [13]\nis used as the activation function, batch normalization [17] is employed to normalize the inputs of\nall layers, parameters are randomly initialized with a uniform distribution U (\u22120.1, 0.1), the order\nof polynomials t is set to the maximum number of neighbors among the whole spatial domains\n(e.g., t = 9 if we attempt to learn 3 \u00d7 3 \ufb01lters in images). During the training stage, the Adam\noptimizer [21] with the initial learning rate 0.001 is utilized to train SACNNs, the mini-batch size is\nset to 32, the categorical cross entropy loss is used in the classi\ufb01cation tasks, and the mean squared\n\n2 , 0, 1\n\n5\n\n\fTable 1: The classi\ufb01cation or clustering accuracies on the experimental Euclidean structured datasets.\nFor clarity, \u2021 indicates that DAC [6] is used to cluster the whole samples in each experimental dataset.\n\nDatasets\nMnist\nClaCNNs [26] 0.9953\nSepCNNs [7]\n0.9910\nActCNNs [19] 0.9926\n0.9908\nDefCNNs [9]\nSACNNs\u2020\n0.9957\n0.9961\nSACNNs\n\nCifar-10 Cifar-100\n0.6629\n0.9075\n0.9062\n0.6643\n0.6648\n0.9086\n0.6349\n0.8718\n0.6759\n0.9091\n0.9167\n0.6938\n\nSTL-10\n0.6635\n0.6685\n0.6761\n0.6564\n0.7175\n0.7358\n\nImage10\u2021 ImageDog\u2021 Time (s)\n53\u00b11\n0.5272\n68\u00b11\n0.5637\n0.5478\n83\u00b12\n0.4853\n125\u00b13\n0.5953\n78\u00b12\n0.6007\n136\u00b12\n\n0.2748\n0.2754\n0.2786\n0.2355\n0.2801\n0.2913\n\nFigure 2: Invariance properties of various CNNs. (a) Gaussion noises with mean 0 and variance \u03b4.\n(b) Rotation. (c) Shift. (d) Scale. (e) Normalized total variations at the initial stage. (f) Normalized\ntotal variations at the \ufb01nal stage. Large \ufb01gures can be found in the supplementary material.\n\nerror loss is used in the regression tasks. During the testing stage, the squared correlation and the\nroot mean square error are used to evaluate the results on DPP4 and TF-198 respectively, and the\nclassi\ufb01cation or clustering accuracy is used for the others. For a reasonable evaluation, we perform 5\nrandom restarts and the average results are used for comparisons.\n\n5.2 Compared with various CNNs on Euclidean domains\n\nTo validate the capability of SACNNs on the Euclidean domains, several SACNNs are modeled\nto classify images in Mnist, Cifar-10, Cifar-100 and STL-10, and to cluster images in Image10\nand ImageDog based on the DAC model [6]. In this experiment, images are recast as speci\ufb01c\nmulti-channel graphs on 2-Dimensional regular grids. In the graphs, each vertex is provided with 9\nneighbors including itself, which is similar to the classical convolution with a 3 \u00d7 3 \ufb01lter.\nIn Table 1, we report the quantitative results of the modeled networks with diverse convolution units\non various Euclidean structured datasets. Note that SACNNs achieve the superior performance on\nboth classi\ufb01cation and clustering tasks, which implies that SACNNs and SACNNs\u2020 are capable of\nmanaging Euclidean structured data effectively. In Figure 2, we empirically verify the invariance\nproperty of the compared CNNs on the Mnist dataset. In this experiment, we disturb the testing\ndata in Mnist with four typical transformations, including Gaussion noise, rotation, shift, and scale.\nThen, these disturbed data is utilized to validate the trained networks with the evaluated convolution\nunits. From Figure 2, the results assuredly prove that SACNNs and SACNNs\u2020 are in possession of\nexcellent robustness to such transformations. Furthermore, we analyze the learned \ufb01lters via the\nnormalized total variation [33] that can reveal the smoothness of \ufb01lters. Figure 2 (e) and (f) show that\nsmoother \ufb01lters are obtained in SACNNs\u2020 at both initial and \ufb01nal stages. Based on the conclusion\nin [33], higher deformation stability will be achieved when smoother \ufb01lters are learned, which is in\nagreement with the results of our experiments in Figure 2 (a)-(d).\n\n6\n\n00.10.20.30.40.50.20.40.60.81variance\u03b4classi\ufb01cationaccuracy(a)ClaCNNsSepCNNsActCNNsDefCNNsSACNNs\u2020SACNNs\u221270\u221235035700.20.40.60.81rotationangle(degrees)classi\ufb01cationaccuracy(b)ClaCNNsSepCNNsActCNNsDefCNNsSACNNs\u2020SACNNs\u22128\u221240480.20.40.60.81shift(pixels)classi\ufb01cationaccuracy(c)ClaCNNsSepCNNsActCNNsDefCNNsSACNNs\u2020SACNNs0.50.7511.251.50.60.81scaleclassi\ufb01cationaccuracy(d)ClaCNNsSepCNNsActCNNsDefCNNsSACNNs\u2020SACNNs1234567891000.10.2layernormalizedtotalvariation(e)initialstageSACNN\u2020ClaCNN1234567891000.10.2layernormalizedtotalvariation(f)\ufb01nalstageSACNN\u2020ClaCNN\fTable 2: The results on the experimental non-Euclidean structured datasets. For each dataset, \u2191 (\u2193)\nindicates that the larger (the smaller) values, the better results are.\n\nMnist\u2191\nDatasets\n0.9914\nLCNs [4]\n0.9840\nDFNs [40]\n0.9937\nECC [37]\n0.9919\nMoNets [31]\nSCNs [16]\n0.9726\nChebNets [10] 0.9914\n0.9867\nGCNs [22]\nSACNNs\u2020\n0.9957\n0.9961\nSACNNs\n\n20News\u2191 Reuters\u2191\n0.9162\n0.6491\n0.7017\n0.9046\n0.9114\n0.7003\n0.9113\n0.6929\n0.8985\n0.6453\n0.6826\n0.9124\n0.8992\n0.6278\n0.9365\n0.7362\n0.7436\n0.9452\n\nNTU\u2191\n0.5457\n0.6346\n0.6416\n0.6354\n0.5818\n0.6384\n0.5983\n0.6844\n0.6931\n\nDPP4\u2191\n0.225\n0.214\n0.249\n0.256\n0.248\n0.265\n0.258\n0.279\n0.285\n\nTF-198\u2193 Time (s)\n68.83\n175\u00b12\n192\u00b13\n70.35\n238\u00b14\n65.35\n69.35\n252\u00b14\n75.83\n1384\u00b111\n673\u00b18\n65.86\n341\u00b14\n71.54\n78\u00b12\n58.82\n53.72\n136\u00b12\n\n5.3 Compared with diverse GCNNs on non-Euclidean domains\n\nTo verify the versatility of SACNNs for non-Euclidean structured data, we build SACNNs to classify\nthe texts in 20News and Reuters, recognize the skeleton-based actions in NTU, estimate the activities\nof molecules in DPP4, and predict the taxis \ufb02ows in TF-198, respectively. In addition, Mnist is also\nused to see how these GCNNs perform on Euclidean structured data.\nTable 2 gives the results in this experiment, which shows that SACNNs and SACNNs\u2020 outperform all\nthe compared methods with signi\ufb01cant margins. In addition, we have several observations from the\ntable. First, dramatical improvements are achieved by SACNNs on both Euclidean and non-Euclidean\ndomains in numerous tasks. Such a good performance veri\ufb01es that SACNNs can effectively deal\nwith data on different domains, without any human intervention. Second, Table 1, Table 2 and\nFigure 2 consistently show that SACNNs always achieve better performance than SACNNs\u2020. These\nresults empirically con\ufb01rm that the local structure representation learning is capable of capturing the\nsigni\ufb01cant structure information from data, thus improving the capability of SACNNs with only a few\nadditional learnable parameters. Furthermore, Table 1 and Table 2 report the time consumptions of\nthe evaluated methods when one epoch is executed on Mnist during training. From these tables, we\nobserve that SACNNs are obviously faster than the competitive GCNN methods. Compared with the\nCNN methods, the timing cost of SACNNs is tolerably, which ensures the practicability of SACNNs.\n\n5.4 Ablation study\n\nIn this subsection, we perform extensive ablation studies on diverse datasets to synthetically analyze\nthe developed SACNNs. Intuitively, all the results are illustrated in Figure 3. Due to the space\nlimitation, the learned \ufb01lters in SACNNs are presented in the supplementary material.\n\nImpact of polynomial order To show the impact of polynomial order t on the structure-aware\nconvolution, we select t from {5, 40, 80, 120, 160} to generate 11 \u00d7 11 \ufb01lters to classify STL-10.\nFigure 3 (a) illustrates the validation errors of SACNNs with different t. One can observe that the\nperformance generally improves if we increase the polynomial order t, then the performance will\nsaturate when \ufb01lters can be well approximated, i.e., t \u2265 80 is satis\ufb01ed. Moreover, it is worthy to\nnote that the developed SACNNs can utilize parameters more effectively than ClaCNNs. This is\nempirically supported by the observation that SACNNs with only 40 parameters per \ufb01lter can achieve\nsigni\ufb01cant better performance than ClaCNNs with 11 \u00d7 11 = 121 parameters per \ufb01lter.\nIn\ufb02uence of channels On the Cifar-10 dataset, we model SACNNs with different numbers of\nchannels c (i.e., 8, 16, 32) to study its in\ufb02uence on the local structure representations learning.\nSpeci\ufb01cally, we observe the following two tendencies from Figure 3 (b). The \ufb01rst one is that the\nperformance of both SACNNs and ClaCNNs bene\ufb01ts from the increase of the channel numbers. This\nis reasonable since more parameters may improve the expressive capability of networks in general.\nSecond, our SACNNs work consistently better than ClaCNNs, especially when the channel number\nis relatively large. One considerable reason is that more information can be exploited to model the\nlatent structure information to assist SACNNs achieving superior performance.\n\n7\n\n\fFigure 3: Ablation studies on various datasets. (a) Impact of polynomial order. (b) In\ufb02uence of\nchannels. (c) Transfer learning from Reuters to 20News. (d) Impact of training samples. (e) In\ufb02uence\nof basis functions. (f) Integration with recent networks. (g) Sensitivity to initialization. (h) Parameters\ndistribution. Large \ufb01gures can be found in the supplementary material.\n\nTransfer learning from Reuters to 20News To reveal the transferability of SACNNs, we \ufb01ne-\ntune the SACNNs that are pre-trained on Reuters (denoted as *SACNNs), with a small number of\nlabeled samples (i.e., 1k, 2k, 3k) in the 20News dataset. Figure 3 (c) shows that the pre-training on\nReuters can signi\ufb01cantly elevate the performance of SACNNs on 20News and stabilize the training\nprocess simultaneously, especially when labeled training samples are limited. This demonstrates that\nSACNNs learned on a domain can be seamlessly transferred to similar domains.\n\nImpact of training samples We randomly sample three sub-datasets with various sizes (i.e., 10k,\n25k, 50k) from Cifar-100 to evaluate the impact of number of training samples on SACNNs. As\nillustrated in Figure 3 (d), the performance of SACNNs improves when more training samples are\nused. Furthermore, the superiority of our SACNNs against ClaCNNs holds on all these cases, which\nmeans that SACNNs are capable of tackling machine learning tasks with both rich and limited data.\n\nIn\ufb02uence of basis functions To investigate the in\ufb02uence of basis functions on SACNNs, the\nLegendre polynomials are employed as basis functions to learn \ufb01lters on Cifar-10. Similar to the\nChebyshev polynomials, the Legendre polynomial hk(x) of order k\u22121 (k \u2265 3) can be obtained based\non the recurrence relation hk(x) = 2k+1\nk+1 hk\u22122(x), with h1(x) = 1 and h2(x) = x.\nFrom Figure 3 (e), almost the same training processes are generated in spite of diverse bases. The\nslight mismatching may come from the randomness in training, e.g., random mini-batch selections.\nThis demonstrates that the learnability of SACNNs is robust to the basis functions.\n\nk+1 hk\u22121(x) \u2212 k\n\nIntegration with recent networks A class of popular networks, i.e., ResNets [15], are employed\nto survey the range of applications of our structure-aware convolution. The results in Figure 3 (f)\nclearly indicate that better improvements will be achieved by replacing the classical convolution in\nResNets with the structure-aware convolution. This adequately validates that the structure-aware\nconvolution suf\ufb01ces to be applied to general ClaCNNs, not con\ufb01ned to simple and shallow networks.\n\nSensitivity to initialization We carry out an experiment on Mnist to contrastively analyze the\nsensitivities to initializations in SACNNs and ClaCNNs. In this experiment, parameters in networks\nare randomly initialized with a uniform distribution U (\u2212\u03b1, \u03b1), where \u03b1 is randomly selected from\n[0, 1]. Figure 3 (g) illustrates the descending processes of loss functions in ClaCNNs and SACNNs,\nindicating that SACNNs generally converge faster than ClaCNNs and are robust to initializations. A\npossible reason is that the whole values in the generated discrete \ufb01lters fRi = {f (rji)|eji \u2208 E} can\nbe together modi\ufb01ed by adjusting each coef\ufb01cient of basis functions, which may yield more precise\ngradients to accelerate and stabilize the training processes.\n\nParameters distribution Figure 3 (h) shows the distributions of parameters learned by ClaCNNs\nand SACNNs on Mnist in ten convolutional layers. From the \ufb01gure, we have the following two\nobservations. First, the parameters in both SACNNs and ClaCNNs have almost the same standard\ndeviations. Second, the expectations of parameters in SACNNs are more closer to 0 than ClaCNNs.\n\n8\n\n0204060801000.30.40.50.6epochvalidationerror(STL-10)(a)SACNN:t=5SACNN:t=120SACNN:t=40SACNN:t=160SACNN:t=80ClaCNN:11\u00d7110204060801000.20.3epochvalidationerror(Cifar-10)(b)SACNN:c=32ClaCNN:c=32SACNN:c=16ClaCNN:c=16SACNN:c=8ClaCNN:c=8010203040500.40.60.8epochvalidationerror(20News)(c)\u2217SACNN:1kSACNN:1k\u2217SACNN:2kSACNN:2k\u2217SACNN:3kSACNN:3k0501001502000.40.60.8epochvalidationerror(Cifar-100)(d)SACNN:10kClaCNN:10kSACNN:25kClaCNN:25kSACNN:50kClaCNN:50k0204060801000.10.20.3epochvalidationerror(Cifar-10)(e)SACNN:ChebyshevSACNN:LegendreClaCNN:3\u00d73Res20Res32Res44Res56Res1106789networktestingerror%(Cifar-10)(f)SACNNsClaCNNs0246123iteration(1e2)cross-entropyloss(Mnist)(g)SACNNsstdSACNNsmeanClaCNNsstdClaCNNsmean1234567891000.20.4layerparametersdistribution(Mnist)(h)SACNNstdSACNNmeanClaCNNstdClaCNNmean\fThese observations reveal that SACNNs have more sparse parameters than ClaCNNs. As a result,\nmore robust models will be achieved, which is in accordance with the results in Section 5.2.\n\n6 Conclusion\n\nWe present a conceptually simple yet powerful structure-aware convolution to establish SACNNs.\nIn the structure-aware convolution, \ufb01lters are represented via univariate functions, which suf\ufb01ce to\naggregate local inputs with diverse topological structures. By feat of the function approximation\ntheory, a numerical strategy is proposed to learn these \ufb01lters in an effectively and ef\ufb01ciently way.\nFurthermore, rather than using the prede\ufb01ned local structures of data, we incorporate them into the\nstructure-aware convolution to learn the underlying structure information from data automatically.\nExtensive experimental results strongly demonstrate that the structure-aware convolution can be\nequipped in SACNNs to learn high-level representations and latent structures for both Euclidean and\nnon-Euclidean structured data. In the future, we plan to systematically investigate the interpretability\nof SACNNs based on their functional \ufb01lters, i.e., univariate functions.\n\nAcknowledgments\n\nThis work was supported by the National Natural Science Foundation of China under Grants\n91646207, 61773377 and 61573352, and the Beijing Natural Science Foundation under Grants\nL172053. We would like to thank Lele Yu, Bin Fan, Cheng Da, Tingzhao Yu, Xue Ye, Hongfei Xiao,\nand Qi Zhang for their invaluable contributions in shaping the early stage of this work.\n\nReferences\n[1] James Atwood and Don Towsley. Diffusion-convolutional neural networks. In NIPS, pages 1993\u20132001,\n\n2016.\n\n[2] Davide Boscaini, Jonathan Masci, Emanuele Rodol\u00e0, and M. M. Bronstein. Learning shape correspondence\n\nwith anisotropic convolutional neural networks. In NIPS, pages 3189\u20133197, 2016.\n\n[3] M. M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep\n\nlearning: Going beyond euclidean data. IEEE Signal Process. Mag., 34(4):18\u201342, 2017.\n\n[4] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected\n\nnetworks on graphs. CoRR, abs/1312.6203, 2013.\n\n[5] Jianlong Chang, Lingfeng Wang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Deep unsupervised\n\nlearning with consistent inference of latent representations. Pattern Recognition, 77:438\u2013453, 2017.\n\n[6] Jianlong Chang, Lingfeng Wang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. Deep adaptive\n\nimage clustering. In ICCV, pages 5880\u20135888, 2017.\n\n[7] Fran\u00e7ois Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, pages\n\n1800\u20131807, 2017.\n\n[8] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature\n\nlearning. In AISTATS, pages 215\u2013223, 2011.\n\n[9] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable\n\nconvolutional networks. In ICCV, pages 764\u2013773, 2017.\n\n[10] Micha\u00a8el Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs\n\nwith fast localized spectral \ufb01ltering. In NIPS, pages 3837\u20133845, 2016.\n\n[11] I. S. Dhillon, Yuqiang Guan, and Brian Kulis. Weighted graph cuts without eigenvectors A multilevel\n\napproach. IEEE Trans. Pattern Anal. Mach. Intell., 29(11):1944\u20131957, 2007.\n\n[12] Justin Gilmer, S. S. Schoenholz, P. F. Riley, Oriol Vinyals, and G. E. Dahl. Neural message passing for\n\nquantum chemistry. In ICML, pages 1263\u20131272, 2017.\n\n[13] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti\ufb01er neural networks. In AISTATS,\n\npages 315\u2013323, 2011.\n\n[14] D. K. Hammond, Pierre Vandergheynst, and R\u00e9mi Gribonval. Wavelets on graphs via spectral graph theory.\n\nApplied & Computational Harmonic Analysis, 30(2):129\u2013150, 2009.\n\n[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, pages 770\u2013778, 2016.\n\n9\n\n\f[16] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured data.\n\nCoRR, abs/1506.05163, 2015.\n\n[17] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, pages 448\u2013456, 2015.\n\n[18] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer\n\nnetworks. In NIPS, pages 2017\u20132025, 2015.\n\n[19] Yunho Jeon and Junmo Kim. Active convolution: Learning the shape of convolution for image classi\ufb01cation.\n\nIn CVPR, pages 1846\u20131854, 2017.\n\n[20] Kaggle. Merck molecular activity challenge. https://www.kaggle.com/c/MerckActivity, 2012.\n[21] D. P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.\n[22] T. N. Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks. CoRR,\n\nabs/1609.02907, 2016.\n\n[23] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Master\u2019s\n\nThesis, Department of Computer Science, University of Torono, 2009.\n\n[24] Alex Krizhevsky, Ilya Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, pages 1106\u20131114, 2012.\n\n[25] Ken Lang. Newsweeder: Learning to \ufb01lter netnews. In ICML, pages 331\u2013339, 1995.\n[26] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[27] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. Adaptive graph convolutional neural networks.\n\nCoRR, abs/1801.03226, 2018.\n\n[28] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and R. S. Zemel. Gated graph sequence neural networks.\n\nCoRR, abs/1511.05493, 2015.\n\n[29] Renjie Liao, Marc Brockschmidt, Daniel Tarlow, A. L. Gaunt, Raquel Urtasun, and R. S. Zemel. Graph\n\npartition neural networks for semi-supervised classi\ufb01cation. CoRR, abs/1803.06272, 2018.\n\n[30] Jonathan Masci, Davide Boscaini, M. M. Bronstein, and Pierre Vandergheynst. Geodesic convolutional\n\nneural networks on riemannian manifolds. In ICCV Workshops, pages 832\u2013840, 2015.\n\n[31] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodol\u00e0, Jan Svoboda, and M. M. Bronstein.\nGeometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, pages 5425\u20135434,\n2017.\n\n[32] Shaoqing Ren, Kaiming He, R. B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object\n\ndetection with region proposal networks. In NIPS, pages 91\u201399, 2015.\n\n[33] Avraham Ruderman, N. C. Rabinowitz, A. S. Morcos, and Daniel Zoran. Learned deformation stability in\n\nconvolutional neural networks. CoRR, abs/1804.04438, 2018.\n\n[34] K.T. Sch\u00a8utt, F. Arbabzadah, S. Chmiela, K.R. M\u00a8uller, and A. Tkatchenko. Quantum-chemical insights\n\nfrom deep tensor neural networks. Nature Communications, 8(13890), 2017.\n\n[35] M. S. Schlichtkrull, T. N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling\n\nrelational data with graph convolutional networks. CoRR, abs/1703.06103, 2017.\n\n[36] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. NTU RGB+D: A large scale dataset for 3d\n\nhuman activity analysis. In CVPR, pages 1010\u20131019, 2016.\n\n[37] Martin Simonovsky and Nikos Komodakis. Dynamic edge-conditioned \ufb01lters in convolutional neural\n\nnetworks on graphs. In CVPR, pages 29\u201338, 2017.\n\n[38] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. CoRR, abs/1409.1556, 2014.\n\n[39] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua Bengio.\n\nGraph attention networks. CoRR, abs/1710.10903, 2017.\n\n[40] Nitika Verma, Edmond Boyer, and Jakob Verbeek. Dynamic \ufb01lters in graph convolutional networks. CoRR,\n\nabs/1706.05206, 2017.\n\n[41] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. CoRR, ab-\n\ns/1511.07122, 2015.\n\n[42] Qi Zhang, Qizhao Jin, Jianlong Chang, Shiming Xiang, and Chunhong Pan. Kernel-weighted graph\nconvolutional network: A deep learning approach for traf\ufb01c forcasting. In ICPR, pages 1018\u20131023, 2018.\n\n10\n\n\f", "award": [], "sourceid": 33, "authors": [{"given_name": "Jianlong", "family_name": "Chang", "institution": "National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences"}, {"given_name": "Jie", "family_name": "Gu", "institution": "National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences"}, {"given_name": "Lingfeng", "family_name": "Wang", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"given_name": "GAOFENG", "family_name": "MENG", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"given_name": "SHIMING", "family_name": "XIANG", "institution": "Chinese Academy of Sciences, China"}, {"given_name": "Chunhong", "family_name": "Pan", "institution": "Institute of Automation, Chinese Academy of Sciences"}]}