{"title": "Unsupervised Deep Haar Scattering on Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 1709, "page_last": 1717, "abstract": "The classification of high-dimensional data defined on graphs is particularly difficult when the graph geometry is unknown. We introduce a Haar scattering transform on graphs, which computes invariant signal descriptors. It is implemented with a deep cascade of additions, subtractions and absolute values, which iteratively compute orthogonal Haar wavelet transforms. Multiscale neighborhoods of unknown graphs are estimated by minimizing an average total variation, with a pair matching algorithm of polynomial complexity. Supervised classification with dimension reduction is tested on data bases of scrambled images, and for signals sampled on unknown irregular grids on a sphere.", "full_text": "Unsupervised Deep Haar Scattering on Graphs\n\nXu Chen1,2, Xiuyuan Cheng2, and St\u00b4ephane Mallat2\n\n1Department of Electrical Engineering, Princeton University, NJ, USA\n2D\u00b4epartement d\u2019Informatique, \u00b4Ecole Normale Sup\u00b4erieure, Paris, France\n\nAbstract\n\nThe classi\ufb01cation of high-dimensional data de\ufb01ned on graphs is particularly dif\ufb01-\ncult when the graph geometry is unknown. We introduce a Haar scattering trans-\nform on graphs, which computes invariant signal descriptors. It is implemented\nwith a deep cascade of additions, subtractions and absolute values, which itera-\ntively compute orthogonal Haar wavelet transforms. Multiscale neighborhoods of\nunknown graphs are estimated by minimizing an average total variation, with a\npair matching algorithm of polynomial complexity. Supervised classi\ufb01cation with\ndimension reduction is tested on data bases of scrambled images, and for signals\nsampled on unknown irregular grids on a sphere.\n\n1\n\nIntroduction\n\nThe geometric structure of a data domain can be described with a graph [11], where neighbor data\npoints are represented by vertices related by an edge. For sensor networks, this connectivity depends\nupon the sensor physical locations, but in social networks it may correspond to strong interactions\nor similarities between two nodes. In many applications, the connectivity graph is unknown and\nmust therefore be estimated from data. We introduce an unsupervised learning algorithm to classify\nsignals de\ufb01ned on an unknown graph.\nAn important source of variability on graphs results from displacement of signal values. It may\nbe due to movements of physical sources in a sensor network, or to propagation phenomena in so-\ncial networks. Classi\ufb01cation problems are often invariant to such displacements.\nImage pattern\nrecognition or characterization of communities in social networks are examples of invariant prob-\nlems. They require to compute locally or globally invariant descriptors, which are suf\ufb01ciently rich\nto discriminate complex signal classes.\nSection 2 introduces a Haar scattering transform which builds an invariant representation of graph\ndata, by cascading additions, subtractions and absolute values in a deep network. It can be factor-\nized as a product of Haar wavelet transforms on the graph. Haar wavelet transforms are \ufb02exible\nrepresentations which characterize multiscale signal patterns on graphs [6, 10, 11]. Haar scatter-\ning transforms are extensions on graphs of wavelet scattering transforms, previously introduced for\nuniformly sampled signals [1].\nFor unstructured signals de\ufb01ned on an unknown graph, recovering the full graph geometry is an\nNP complete problem. We avoid this complexity by only learning connected multiresolution graph\napproximations. This is suf\ufb01cient to compute Haar scattering representations. Multiscale neigh-\nborhoods are calculated by minimizing an average total signal variation over training examples.\nIt involves a pair matching algorithm of polynomial complexity. We show that this unsupervised\nlearning algorithms computes sparse scattering representations.\n\nThis work was supported by the ERC grant InvariantClass 320959.\n\n1\n\n\fFigure 1: A Haar scattering network computes each coef\ufb01cient of a layer Sj+1x by adding or subtracting a\npair of coef\ufb01cients in the previous layer Sjx.\n\nFor classi\ufb01cation, the dimension of unsupervised Haar scattering representations are reduced with\nsupervised partial least square regressions [12]. It amounts to computing a last layer of reduced\ndimensionality, before applying a Gaussian kernel SVM classi\ufb01er. The performance of a Haar scat-\ntering classi\ufb01cation is tested on scrambled images, whose graph geometry is unknown. Results\nare provided for MNIST and CIFAR-10 image data bases. Classi\ufb01cation experiments are also per-\nformed on scrambled signals whose samples are on an irregular grid of a sphere. All computations\ncan be reproduced with a software available at www.di.ens.fr/data/scattering/haar.\n\n2 Orthogonal Haar Scattering on a Graph\n\n2.1 Deep Networks of Permutation Invariant Operators\nWe consider signals x de\ufb01ned on an unweighted graph G = (V, E), with V = {1, ..., d}. Edges\nrelate neighbor vertices. We suppose that d is a power of 2 to simplify explanations. A Haar\nscattering is calculated by iteratively applying the following permutation invariant operator\n\nIts values are not modi\ufb01ed by a permutation of \u03b1 and \u03b2, and both values are recovered by\n\n(\u03b1, \u03b2) \u2212\u2192 (\u03b1 + \u03b2,|\u03b1 \u2212 \u03b2|) .\n\n(cid:0)\u03b1 + \u03b2 + |\u03b1 \u2212 \u03b2|(cid:1) and min(\u03b1, \u03b2) =\n\nmax(\u03b1, \u03b2) =\n\n1\n2\n\n(cid:0)\u03b1 + \u03b2 \u2212 |\u03b1 \u2212 \u03b2|(cid:1) .\n\n1\n2\n\n(1)\n\n(2)\n\nAn orthogonal Haar scattering transform computes progressively more invariant signal descriptors\nby applying this invariant operator at multiple scales. This is implemented along a deep network\nillustrated in Figure 1. The network layer j is a two-dimensional array Sjx(n, q) of d = 2\u2212jd \u00d7 2j\ncoef\ufb01cients, where n is a node index and q is a feature type.\nThe input network layer is S0x(n, 0) = x(n). We compute Sj+1x by regrouping the 2\u2212jd nodes\nof Sjx in 2\u2212j\u22121d pairs (an, bn), and applying the permutation invariant operator (1) to each pair\n(Sjx(an, q), Sjx(bn, q)):\n\nSj+1x(n, 2q) = Sjx(an, q) + Sjx(bn, q)\n\n(3)\n\nSj+1x(n, 2q + 1) = |Sjx(an, q) \u2212 Sjx(bn, q)| .\n\nand\n(4)\nThis transform is iterated up to a maximum depth J \u2264 log2(d).\nIt computes SJ x with Jd/2\nadditions, subtractions and absolute values. Since Sjx \u2265 0 for j > 0, one can put an absolute value\non the sum in (3) without changing Sj+1x. It results that Sj+1x is calculated from the previous\nlayer Sjx by applying a linear operator followed by a non-linearity as in most deep neural network\narchitectures. In our case this non-linearity is an absolute value as opposed to recti\ufb01ers used in most\ndeep networks [4].\nFor each n, the 2j scattering coef\ufb01cients {Sjx(n, q)}0\u2264q<2j are calculated from the values of x in a\nvertex set Vj,n of size 2j. One can verify by induction on (3) and (4) that V0,n = {n} for 0 \u2264 n < d,\nand for any j \u2265 0\n(5)\n\nVj+1,n = Vj,an \u222a Vj,bn .\n\n2\n\nxS1xS2xS3xv\f(a)\n\n(b)\n\nFigure 2: A connected multiresolution is a partition of vertices with embedded connected sets Vj,n of size\n2j. (a): Example of partition for the graph of a square image grid, for 1 \u2264 j \u2264 3. (b): Example on an irregular\ngraph.\n\nThe embedded subsets {Vj,n}j,n form a multiresolution approximation of the vertex set V . At each\nscale 2j, different pairings (an, bn) de\ufb01ne different multiresolution approximations. A small graph\ndisplacement propagates signal values from a node to its neighbors. To build nearly invariant repre-\nsentations over such displacements, a Haar scattering transform must regroup connected vertices. It\nis thus computed over multiresolution vertex sets Vj,n which are connected in the graph G. It results\nfrom (5) that a necessary and suf\ufb01cient condition is that each pair (an, bn) regroups two connected\nsets Vj,an and Vj,bn.\nFigure 2 shows two examples of connected multiresolution approximations. Figure 2(a) illustrates\nthe graph of an image grid, where pixels are connected to 8 neighbors. In this example, each Vj+1,n\nregroups two subsets Vj,an and Vj,bn which are connected horizontally if j is even and connected\nvertically if j is odd. Figure 2(b) illustrates a second example of connected multiresolution approx-\nimation on an irregular graph. There are many different connected multiresolution approximations\nresulting from different pairings at each scale 2j. Different multiresolution approximations corre-\nspond to different Haar scattering transforms. In the following, we compute several Haar scattering\ntransforms of a signal x, by de\ufb01ning different multiresolution approximations.\nThe following theorem proves that a Haar scattering preserves the norm and that it is contractive up\nto a normalization factor 2j/2. The contraction is due to the absolute value which suppresses the\nsign and hence reduces the amplitude of differences. The proof is in Appendix A.\nTheorem 2.1. For any j \u2265 0, and any x, x(cid:48) de\ufb01ned on V\n\nand\n\n(cid:107)Sjx \u2212 Sjx\n\n(cid:48)(cid:107) \u2264 2j/2(cid:107)x \u2212 x\n\n(cid:48)(cid:107) ,\n\n(cid:107)Sjx(cid:107) = 2j/2(cid:107)x(cid:107) .\n\n2.2\n\nIterated Haar Wavelet Transforms\n\nWe show that a Haar scattering transform can be written as a cascade of orthogonal Haar wavelet\ntransforms and absolute value non-linearities. It is a particular example of scattering transforms in-\ntroduced in [1]. It computes coef\ufb01cients measuring signal variations at multiple scales and multiple\norders. We prove that the signal can be recovered from Haar scattering coef\ufb01cients computed over\nenough multiresolution approximations.\nA scattering operator is contractive because of the absolute value. When coef\ufb01cients have an arbi-\ntrary sign, suppressing the sign reduces by a factor 2 the volume of the signal space. We say that\nSJ x(n, q) is a coef\ufb01cient of order m if its computation includes m absolute values of differences.\nThe amplitude of scattering coef\ufb01cients typically decreases exponentially when the scattering order\nm increases, because of the contraction produced by the absolute value. We verify from (3) and (4)\n\n3\n\nV1,nV2,nV3,nV1,nV2,nV3,n\fthat SJ x(n, q) is a coef\ufb01cient of order m = 0 if q = 0 and of order m > 0 if\n\nIt results that there are(cid:0) J\n\nm\n\nq =\n\n2J\u2212jk for 0 \u2264 jk < jk+1 \u2264 J .\n\n(cid:1)2\u2212J d coef\ufb01cients SJ x(n, q) of order m.\n\nm(cid:88)\n\nk=1\n\nWe now show that Haar scattering coef\ufb01cients of order m are obtained by cascading m orthogonal\nHaar wavelet tranforms de\ufb01ned on the graph G. A Haar wavelet at a scale 2J is de\ufb01ned over each\nVj,n = Vj\u22121,an \u222a Vj\u22121,bn by\nFor any J \u2265 0, one can verify [10, 6] that\n\n\u2212 1Vj\u22121,bn .\n\n\u03c8j,n = 1Vj\u22121,an\n\n{1VJ,n}0\u2264n<2\u2212J d \u222a {\u03c8j,n}0\u2264n<2\u2212j d,0\u2264j<J\n\n(cid:104)x, x(cid:48)(cid:105) =(cid:80)\n\nis a non-normalized orthogonal Haar basis of the space of signals de\ufb01ned on V . Let us denote\nv\u2208V x(v) x(cid:48)(v). Order m = 0 scattering coef\ufb01cients sum the values of x in each VJ,n\n\nSJ x(n, 0) = (cid:104)x , 1VJ,n(cid:105) .\n\nOrder m = 1 scattering coef\ufb01cients are sums of absolute values of orthogonal Haar wavelet coef\ufb01-\ncients. They measure the variation amplitude x at each scale 2j, in each VJ,n:\n\n(cid:88)\n\nSJ x(n, 2J\u2212j1) =\n\n|(cid:104)x , \u03c8j1,p(cid:105)|.\n\np\n\nVj1,p\u2282VJ,n\n\nAppendix B proves that second order scattering coef\ufb01cients SJ x(n, 2J\u2212j1 + 2J\u2212j2) are computed\nby applying a second orthogonal Haar wavelet transform to \ufb01rst order scattering coef\ufb01cients. A\ncoef\ufb01cient SJ x(n, 2J\u2212j1 +2J\u2212j2 ) is an averaged second order increment over VJ,n, calculated from\nthe variations at the scale 2j2, of the increments of x at the scale 2j1. More generally, Appendix B\nalso proves that order m coef\ufb01cients measure multiscale variations of x at the order m, and are\nobtained by applying a Haar wavelet transform on scattering coef\ufb01cients of order m \u2212 1.\nA single Haar scattering transform loses information since it applies a cascade of permutation in-\nvariant operators. However, the following theorem proves that x can be recovered from scattering\ntransforms computed over 2J different multiresolution approximations.\nTheorem 2.2. There exist 2J multiresolution approximations such that almost all x \u2208 Rd can be\nreconstructed from their scattering coef\ufb01cients on these multiresolution approximations.\n\nThis theorem is proved in Appendix C. The key idea is that Haar scattering transforms are computed\nwith permutation invariants operators. Inverting these operators allows to recover values of signal\npairs but not their locations. However, recombining these values on enough overlapping sets allows\none to recover their locations and hence the original signal x. This is done with multiresolutions\nwhich are interlaced at each scale 2j, in the sense that if a multiresolution is pairing (an, bn) and\n(a(cid:48)\nn, b(cid:48)\nn, bn). Connectivity conditions are\nneeded on the graph G to guarantee the existence of \u201cinterlaced\u201d multiresolution approximations\nwhich are all connected.\n\nn) then another multiresolution approximation is pairing (a(cid:48)\n\n3 Learning\n\n3.1 Sparse Unsupervised Learning of Multiscale Connectivity\n\nHaar scattering transforms compute multiscale signal variations of multiple orders, over non-\noverlapping sets of size 2J. To build signal descriptors which are nearly invariant to signal displace-\nments on the graph, we want to compute scattering transforms over connected sets in the graph,\nwhich a priori requires to know the graph connectivity. However, in many applications, the graph\nconnectivity is unknown. For piecewise regular signals, the graph connectivity implies some form\nof correlation between neighbor signal values, and may thus be estimated from a training set of\nunlabeled examples {xi}i [7].\n\n4\n\n\fInstead of estimating the full graph geometry, which is an NP complete problem, we estimate mul-\ntiresolution approximations which are connected. This is a hierarchical clustering problem [19].\nA multiresolution approximation is connected if at each scale 2j, each pair (an, bn) regroups two\nvertex sets (Vj,an , Vj,bn) which are connected. This connection is estimated by minimizing the total\nvariation within each set Vj,n, which are clusters of size 2j [19]. It is done with a \ufb01ne to coarse\naggregation strategy. Given {Vj,n}0\u2264n<2\u2212j d, we compute Vj+1,n at the next scale, by \ufb01nding an\noptimal pairing {an, bn}n which minimizes the total variation of scattering vectors, averaged over\nthe training set {xi}i:\n\n\u2212j\u22121d(cid:88)\n\n2\n\n2j\u22121(cid:88)\n\n(cid:88)\n\n|Sjxi(an, q) \u2212 Sjxi(bn, q)| .\n\n(6)\n\nn=0\n\nq=0\n\ni\n\n(cid:88)\n\n(cid:88)\n\n(cid:107)Sj+1x(cid:107)1 = (cid:107)Sjx(cid:107)1 +\n\nThis is a weighted matching problem which can be solved by the Blossom Algorithm of Edmonds [8]\nwith O(d3) operations. We use the implementation in [9]. Iterating on this algorithm for 0 \u2264 j < J\nthus computes a multiresolution approximation at the scale 2J, with a hierarchical aggregation of\ngraph vertices.\nObserve that\n\nGiven Sjx, it results that the minimization of (6) is equivalent to the minimization of(cid:80)\n\ni (cid:107)Sj+1xi(cid:107)1.\nThis can be interpreted as \ufb01nding a multiresolution approximation which yields an optimally sparse\nscattering transform. It operates with a greedy layerwise strategy across the network layers, similarly\nto sparse autoencoders for unsupervised deep learning [4].\nAs explained in the previous section, several Haar scattering transforms are needed to obtain a com-\nplete signal representation. The unsupervised learning computes N multiresolution approximations\nby dividing the training set {xi}i in N non-overlapping subsets, and learning a different multireso-\nlution approximation from each training subset.\n\n|Sjx(an, q) \u2212 Sjx(bn, q)| .\n\nq\n\nn\n\n3.2 Supervised Feature Selection and Classi\ufb01cation\n\nThe unsupervised learning computes a vector of scattering coef\ufb01cients which is typically much\nlarger than the dimension d of x. However, only a subset of these invariants are needed for any\nparticular classi\ufb01cation task. The classi\ufb01cation is improved by a supervised dimension reduction\nwhich selects a subset of scattering coef\ufb01cients. In this paper, the feature selection is implemented\nwith a partial least square regression [12, 13, 14]. The \ufb01nal supervised classi\ufb01er is a Gaussian kernel\nSVM.\nLet us denote by \u03a6x = {\u03c6px}p the set of all scattering coef\ufb01cients at a scale 2J, computed from\nN multiresolution approximations. We perform a feature selection adapted to each class c, with a\npartial least square regression of the one-versus-all indicator function\n\n(cid:26) 1\n\n0\n\nfc(x) =\n\nif x belongs to class c\notherwise\n\n.\n\nA partial least square greedily selects and orthogonalizes each feature, one at a time. At the kth\niteration, it selects a \u03c6pk x, and a Gram-Schmidt orthogonalization yields a normalized \u02dc\u03c6pk x, which\nis uncorrelated relatively to all previously selected features:\n\n(cid:88)\n\n\u02dc\u03c6pk (xi) \u02dc\u03c6pr (xi) = 0 and (cid:88)\n\n\u2200r < k ,\n\n| \u02dc\u03c6pk (xi)|2 = 1 .\n\ni\n\ni\n\ni fc(xi) \u02dc\u03c6pk (xi) is maximum.\n\nThe kth feature \u03c6pk x is selected so that the linear regression of fc(x) on { \u02dc\u03c6pr x}1\u2264r\u2264k has a min-\n(cid:80)\nimum mean-square error, computed on the training set. This is equivalent to \ufb01nding \u03c6pk so that\nThe partial least square regression thus selects and computes K decorrelated scattering features\n{ \u02dc\u03c6pk x}k<K for each class c. For a total of C classes, the union of all these feature sets de\ufb01nes a\ndictionary of size M = K C. They are linear combinations of the original Haar scattering coef\ufb01-\ncients {\u03c6px}p. This dimension reduction can thus be interpreted as a last fully connected network\n\n5\n\n\fFigure 3: MNIST images (left) and images after random pixel permutations (right).\n\nlayer, which outputs a vector of size M. The parameter M allows one to optimize the bias versus\nvariance trade-off. It can be adjusted from the decay of the regression error of each fc [12]. In our\nnumerical experiments, it is set to a \ufb01xed size for all data bases.\n\n4 Numerical Experiments\n\nUnsupervised Haar scattering representations are tested on classi\ufb01cation problems, over scrambled\nimages and scrambled data on a sphere, for which the geometry is therefore unknown. Classi\ufb01cation\nresults are compared with a Haar scattering algorithm computed over the known signal geometry,\nand with state of the art algorithms.\nA Haar scattering representation involves few parameters which are reviewed. The scattering scale\n2J \u2264 d is the invariance scale. Scattering coef\ufb01cients are computed up to the a maximum order m,\nwhich is set to 4 in all experiments. Indeed, higher order scattering coef\ufb01cient have a negligible rel-\native energy, which is below 1%. The unsupervised learning algorithm computes N multiresolution\napproximations, corresponding to N different scattering transforms. Increasing N decreases the\nclassi\ufb01cation error but it increases computations. The error decay becomes negligible for N \u2265 40.\nThe supervised dimension reduction selects a \ufb01nal set of M orthogonalized scattering coef\ufb01cients.\nWe set M = 1000 in all numerical experiments.\nFor signals de\ufb01ned on an unknown graph, the unsupervised learning computes an estimation of\nconnected multiresolution sets by minimizing an average total variation. For each data basis of\nscrambled signals, the precision of this estimation is evaluated by computing the percentage of\nmultiscale sets which are indeed connected in the original topology (an image grid or a grid on the\nsphere).\n\n4.1 MNIST Digit Recognition\nMNIST is a data basis with 6 \u00d7 104 hand-written digit images of size d \u2264 210, with 5 \u00d7 104 images\nfor training and 104 for testing. Examples of MNIST images before and after pixel scrambling\nare shown in Figure 3. The best classi\ufb01cation results are obtained with a maximum invariance scale\n2J = 210. The classi\ufb01cation error is 0.9%, with an unsupervised learning of N = 40 multiresolution\napproximations. Table 1 shows that it is below but close to state of the art results obtained with fully\nsupervised deep convolution, which are optimized with supervised backpropagation algorithms.\nThe unsupervised learning computes multiresolution sets Vj,n from scrambled images. At scales\n1 \u2264 2j \u2264 23, 100% of these multiresolution sets are connected in the original image grid, which\nproves that the geometry is well estimated at these scales. This is only evaluated on meaningful\npixels which do not remain zero on all training images. For j = 4 and j = 5 the percentages of\nconnected sets are respectively 85% and 67%. The percentage of connected sets decreases because\nlong range correlations are weaker.\nOne can reduce the Haar scattering classi\ufb01cation error from 0.9% to 0.59% with a known image\ngeometry. The Haar scattering transform is then computed over multiresolution approximations\nwhich are directly constructed from the image grid as in Figure 2(a). Rotations and translations\nde\ufb01ne N = 64 different connected multiresolution approximations, which yield a reduced error of\n0.59%. State of the art classi\ufb01cation errors on MNIST, for non-augmented data basis (without elastic\ndeformations), are respectively 0.46% with a Gabor scattering [2] and 0.53% with a supervised\ntraining of deep convolution networks [5]. This shows that without any learning, a Haar scattering\nusing geometry is close to the state of the art.\n\n6\n\n\fMaxout MLP + dropout [15] Deep convex net. [16] DBM + dropout [17] Haar Scattering\n\n0.94\n\n0.83\n\n0.79\n\n0.90\n\nTable 1: Percentage of errors for the classi\ufb01cation of scrambled MNIST images, obtained by different algo-\nrithms.\n\nFigure 4: Images of digits mapped on a sphere.\n\n4.2 CIFAR-10 Images\nCIFAR-10 images are color images of 32 \u00d7 32 pixels, which are much more complex than MNIST\ndigit images. It includes 10 classes, such as \u201cdogs\u201d, \u201ccars\u201d, \u201cships\u201d with a total of 5 \u00d7 104 training\nexamples and 104 testing examples. The 3 color bands are represented with Y, U, V channels and\nscattering coef\ufb01cients are computed independently in each channel.\nThe Haar scattering is \ufb01rst applied to scrambled CIFAR images whose geometry is unknown. The\nminimum classi\ufb01cation error is obtained at the scale 2J = 27 which is below the maximum scale d =\n210. It maintains some localization information on the image features. With N = 10 multiresolution\napproximations, a Haar scattering transform has an error of 27.3%. It is 10% below previous results\nobtained on this data basis, given in Table 2.\nNearly 100% of the multiresolution sets Vj,n computed from scrambled images are connected in the\noriginal image grid, for 1 \u2264 j \u2264 4, which shows that the multiscale geometry is well estimated\nat these \ufb01ne scales. For j = 5, 6 and 7, the proportions of connected sets are 98%, 93% and 83%\nrespectively. As for MNIST images, the connectivity is not as precisely estimated at large scales.\n\nFastfood [18] Random Kitchen Sinks [18] Haar Scattering\n\n36.9\n\n37.6\n\n27.3\n\nTable 2: Percentage of errors for the classi\ufb01cation of scrambled CIFAR-10 images, with different algorithms.\n\nThe Haar scattering classi\ufb01cation error is reduced from 27.7% to 21.3% if the image geometry is\nknown. Same as for MNIST, we compute N = 64 multiresolution approximations obtained by\ntranslating and rotating. After dimension reduction, the classi\ufb01cation error is 21.3%. This error is\nabove the state of the art obtained by a supervised convolutional network [15] (11.68%), but the\nHaar scattering representation involves no learning.\n\n4.3 Signals on a Sphere\n\nA data basis of irregularly sampled signals on a sphere is constructed in [3], by projecting the\nMNIST image digits on d = 4096 points randomly sampled on the 3D sphere, and by randomly\nrotating these images on the sphere. The random rotation is either uniformly distributed on the\nsphere or restricted with a smaller variance (small rotations) [3]. The digit \u20189\u2019 is removed from the\ndata set because it can not be distinguished from a \u20186\u2019 after rotation. Examples of the dataset are\nshown in Figure 4.\nThe classi\ufb01cation algorithms introduced in [3] use the known distribution of points on the sphere,\nby computing a representation based on the graph Laplacian. Table 3 gives the results reported in\n[3], with a fully connected neural network, and a spectral graph Laplacian network.\nAs opposed to these algorithms, the Haar scattering algorithm uses no information on the positions of\npoints on the sphere. Computations are performed from a scrambled set of signal values, without any\n\n7\n\n\fgeometric information. Scattering transforms are calculated up to the maximum scale 2J = d = 212.\nA total of N = 10 multiresolution approximations are estimated by unsupervised learning, and the\nclassi\ufb01cation is performed from M = 103 selected coef\ufb01cients. Despite the fact that the geometry\nis unknown, the Haar scattering reduces the error rate both for small and large 3D random rotations.\nIn order to evaluate the precision of our geometry estimation, we use the neighborhood information\nbased on the 3D coordinates of the 4096 points on the sphere of radius 1. We say that two points\nare connected if their geodesic distance is smaller than 0.1. Each point on the sphere has on average\n8 connected points. For small rotations, the percentage of learned multiresolution sets which are\nconnected is 92%, 92%, 88% and 83% for j going from 1 to 4. It is computed on meaningful points\nwith nonneglegible energy. For large rotations, it is 97%, 96%, 95% and 95%. This shows that the\nmultiscale geometry on the sphere is well estimated.\n\nSpectral\nNearest\nNeighbors Connect. Net.[3]\n\nFully\n\nSmall rotations\nLarge rotations\n\n19\n80\n\n5.6\n52\n\n6\n50\n\nHaar\n\nScattering\n\n2.2\n47.7\n\nTable 3: Percentage of errors for the classi\ufb01cation of MNIST images rotated and sampled on a sphere [3], with\na nearest neighbor classi\ufb01er, a fully connected two layer neural network, a spectral network [3], and a Haar\nscattering.\n\n5 Conclusion\n\nA Haar scattering transform computes invariant data representations by iterating over a hierarchy\nof permutation invariant operators, calculated with additions, subtractions and absolute values. The\ngeometry of unstructured signals is estimated with an unsupervised learning algorithm, which min-\nimizes the average total signal variation over multiscale neighborhoods. This shows that unsuper-\nvised deep learning can be implemented with a polynomial complexity algorithm. The supervised\nclassi\ufb01cation includes a feature selection implemented with a partial least square regression. State\nof the art results have been shown on scrambled images as well as random signals sampled on a\nsphere. The two important parameters of this architecture are the network depth, which corresponds\nto the invariance scale, and the dimension reduction of the \ufb01nal layer, set to 103 in all experiments.\nIt can thus easily be applied to any data set.\nThis paper concentrates on scattering transforms of real valued signals. For a boolean vector x,\na boolean scattering transform is computed by replacing the operator (1) by a boolean permutation\ninvariant operator which transforms (\u03b1, \u03b2) into (\u03b1 and \u03b2 , \u03b1 xor \u03b2). Iteratively applying this operator\nde\ufb01nes a boolean scattering transform Sjx having similar properties.\n\n8\n\n\fReferences\n[1] S. Mallat, \u201cRecursive interferometric representations\u201d. Proc. of EUSICO Conf. 2010, Den-\n\nmark.\n\n[2] J. Bruna, S. Mallat, \u201cInvariant Scattering Convolution Networks,\u201d IEEE Trans. PAMI, 35(8):\n\n1872-1886, 2013.\n\n[3] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, \u201cSpectral Networks and Deep Locally Con-\n\nnected Networks on Graphs,\u201d ICLR 2014.\n\n[4] Y. Bengio, A. Courville, P. Vincent, \u201cRepresentation Learning: A Review and New Perspec-\n\ntives\u201d, IEEE Trans. on PAMI, no.8, vol. 35, pp 1798-1828, 2013.\n\n[5] Y. LeCun, K. Kavukvuoglu, and C. Farabet, \u201cConvolutional Networks and Applications in\n\nVision,\u201d Proc. IEEE Int. Sump. Circuits and Systems 2010.\n\n[6] M. Gavish, B. Nadler, and R. R. Coifman. \u201cMultiscale wavelets on trees, graphs and high\ndimensional data: Theory and applications to semi supervised learning\u201d, in ICML, pages 367-\n374, 2010.\n\n[7] N. L. Roux, Y. Bengio, P. Lamblin, M. Joliveau and B. K\u00b4egl, \u201cLearning the 2-D topology of\n\nimages\u201d, in NIPS, pages 841-848, 2008.\n\n[8] J. Edmonds. Paths, trees, and \ufb02owers. Canadian Journal of Mathematics, 1965.\n[9] E. Rothberg of H. Gabow\u2019s \u201cAn Ef\ufb01cient Implementation of Edmond\u2019s Algorithm for Maxi-\n\nmum Matching on Graphs.\u201d JACM, 23, 1v976.\n\n[10] R. Rustamov, L. Guibas, \u201cWavelets on Graphs via Deep Learning,\u201d NIPS 2013.\n[11] D. Shuman, S. Narang, P. Frossard, A. Ortega, P. Vanderghenyst, \u201cThe Emmerging Field of\n\nSignal Processing on Graphs,\u201d IEEE Signal Proc. Magazine, May 2013.\n\n[12] T. Mehmood, K. H. Liland, L. Snipen and S. S\u00e6b\u00f8, \u201cA Review of Variable Selection Methods\nin Partial Least Squares Regression\u201d, Chemometrics and Intelligent Laboratory Systems, vol.\n118, pages 62-69, 2012.\n\n[13] H. Zhang, S. Kiranyaz and M. Gabbouj, \u201cCardinal Sparse Partial Least Square Feature Se-\nlection and its Application in Face Recognition\u201d, Signal Processing Conference (EUSIPCO),\n2014 Proceedings of the 22st European, Sep. 2014.\n\n[14] W. R. Schwartz, A. Kembhavi, D. Harwood and L. S. Davis, \u201cHuman Detection Using Partial\n\nLeast Squares Analysis\u201d, Computer vision, ICCV 2009.\n\n[15] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville and Y. Benjio, \u201cMaxout Networks\u201d,\n\nArxiv preprint, arxiv:1302.4389, 2013.\n\n[16] D. Yu and L. Deng, \u201cDeep Convex Net: A Scalable Architecture for Speech Pattern Classi\ufb01-\n\ncation\u201d,in Proc. INTERSPEECH, 2011, pp.2285-2288.\n\n[17] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, \u201cImprov-\ning neural networks by preventing co-adaptation of feature detectors\u201d, Technical report,\narXiv:1207.0580, 2012.\n\n[18] Q. Le, T. Sarlos and A. Smola,\u201cFastfood - Approximating Kernel Expansions in Loglinear\n\nTime\u201d, ICML, 2013.\n\n[19] M Hein and S. Setzer, \u201cBeyond Spectral Clustering - Tight Relaxations of Balanced Graph\n\nCuts,\u201d NIPS 2011.\n\n9\n\n\f", "award": [], "sourceid": 901, "authors": [{"given_name": "Xu", "family_name": "Chen", "institution": "Princeton University"}, {"given_name": "Xiuyuan", "family_name": "Cheng", "institution": "E \u0301cole Normale Supe \u0301rieure"}, {"given_name": "Stephane", "family_name": "Mallat", "institution": "Ecole Polytechnique Paris"}]}