{"title": "The Sparse Manifold Transform", "book": "Advances in Neural Information Processing Systems", "page_first": 10513, "page_last": 10524, "abstract": "We present a signal representation framework called the sparse manifold transform that combines key ideas from sparse coding, manifold learning, and slow feature analysis. It turns non-linear transformations in the primary sensory signal space into linear interpolations in a representational embedding space while maintaining approximate invertibility. The sparse manifold transform is an unsupervised and generative framework that explicitly and simultaneously models the sparse discreteness and low-dimensional manifold structure found in natural scenes. When stacked, it also models hierarchical composition. We provide a theoretical description of the transform and demonstrate properties of the learned representation on both synthetic data and natural videos.", "full_text": "The Sparse Manifold Transform\n\nYubei Chen1,2\n\nDylan M Paiton1,3\n\nBruno A Olshausen1,3,4\n\n1Redwood Center for Theoretical Neuroscience\n\n2Department of Electrical Engineering and Computer Science\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nyubeic@eecs.berkeley.edu\n\n3Vision Science Graduate Group\n\n4Helen Wills Neuroscience Institute & School of Optometry\n\nAbstract\n\nWe present a signal representation framework called the sparse manifold transform\nthat combines key ideas from sparse coding, manifold learning, and slow feature\nanalysis. It turns non-linear transformations in the primary sensory signal space\ninto linear interpolations in a representational embedding space while maintaining\napproximate invertibility. The sparse manifold transform is an unsupervised and\ngenerative framework that explicitly and simultaneously models the sparse dis-\ncreteness and low-dimensional manifold structure found in natural scenes. When\nstacked, it also models hierarchical composition. We provide a theoretical descrip-\ntion of the transform and demonstrate properties of the learned representation on\nboth synthetic data and natural videos.\n\n1\n\nIntroduction\n\nInspired by Pattern Theory [40], we attempt to model three important and pervasive patterns in natural\nsignals: sparse discreteness, low dimensional manifold structure and hierarchical composition.\nEach of these concepts have been individually explored in previous studies. For example, sparse\ncoding [43, 44] and ICA [5, 28] can learn sparse and discrete elements that make up natural signals.\nManifold learning [56, 48, 38, 4] was proposed to model and visualize low-dimensional continuous\ntransforms such as smooth 3D rotations or translations of a single discrete element. Deformable,\ncompositional models [60, 18] allow for a hierarchical composition of components into a more\nabstract representation. We seek to model these three patterns jointly as they are almost always\nentangled in real-world signals and their disentangling poses an unsolved challenge.\nIn this paper, we introduce an interpretable, generative and unsupervised learning model, the sparse\nmanifold transform (SMT), which has the potential to untangle all three patterns simultaneously and\nexplicitly. The SMT consists of two stages: dimensionality expansion using sparse coding followed\nby contraction using manifold embedding. Our SMT implementation is to our knowledge, the \ufb01rst\nmodel to bridge sparse coding and manifold learning. Furthermore, an SMT layer can be stacked to\nproduce an unsupervised hierarchical learning network.\nThe primary contribution of this paper is to establish a theoretical framework for the SMT by\nreconciling and combining the formulations and concepts from sparse coding and manifold learning.\nIn the following sections we point out connections between three important unsupervised learning\nmethods: sparse coding, local linear embedding and slow feature analysis. We then develop a single\nframework that utilizes insights from each method to describe our model. Although we focus here\non the application to image data, the concepts are general and may be applied to other types of data\nsuch as audio signals and text. All experiments performed on natural scenes used the same dataset,\ndescribed in Supplement D.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fx = \u21b5 + \u270f\n\n1.1 Sparse coding\nSparse coding attempts to approximate a data vector, x 2 IRn, as a sparse superposition of dictionary\nelements i:\n(1)\nwhere 2 IRn\u21e5m is a matrix with columns i, \u21b5 2 IRm is a sparse vector of coef\ufb01cients and \u270f is\na vector containing independent Gaussian noise samples, which are assumed to be small relative\nto x. Typically m > n so that the representation is overcomplete. For a given dictionary, , the\nsparse code, \u21b5, of a data vector, x, can be computed in an online fashion by minimizing an energy\nfunction composed of a quadratic penalty on reconstruction error plus an L1 sparseness penalty on \u21b5\n(see Supplement A). The dictionary itself is adapted to the statistics of the data so as to maximize\nthe sparsity of \u21b5. The resulting dictionary often provides important insights about the structure of\nthe data. For natural images, the dictionary elements become \u2018Gabor-like\u2019\u2014i.e., spatially localized,\noriented and bandpass\u2014and form a tiling over different locations, orientations and scales due to the\nnatural transformations of objects in the world.\nThe sparse code of an image provides a representation that makes explicit the structure contained\nin the image. However the dictionary is typically unordered, and so the sparse code will lose the\ntopological organization that was inherent in the image. The pioneering works of Hyv\u00e4rinen and\nHoyer [27], Hyv\u00e4rinen et al. [29] and Osindero et al. [45] addressed this problem by specifying a \ufb01xed\n2D topology over the dictionary elements that groups them according to the co-occurrence statistics\nof their coef\ufb01cients. Other works learn the group structure from a statistical approach [37, 3, 32], but\ndo not make explicit the underlying topological structure. Some previous topological approaches\n[34, 11, 10] used non-parametric methods to reveal the low-dimensional geometrical structure in local\nimage patches, which motivated us to look for the connection between sparse coding and geometry.\nFrom this line of inquiry, we have developed what we believe to be the \ufb01rst mathematical formulation\nfor learning the general geometric embedding of dictionary elements when trained on natural scenes.\nAnother observation motivating this work is that the representation computed using overcomplete\nsparse coding can exhibit large variability for time-varying inputs that themselves have low variability\nfrom frame to frame [49]. While some amount of variability is to be expected as image features move\nacross different dictionary elements, the variation can appear unstructured without information about\nthe topological relationship of the dictionary. In section 3 and section 4, we show that considering the\njoint spatio-temporal regularity in natural scenes can allow us to learn the dictionary\u2019s group structure\nand produce a representation with smooth variability from frame to frame (Figure 3).\n\n1.2 Manifold Learning\n\nIn manifold learning, one assumes that the data occupy a low-dimensional, smooth manifold embed-\nded in the high-dimensional signal space. A smooth manifold is locally equivalent to a Euclidean\nspace and therefore each of the data points can be linearly reconstructed by using the neighboring data\npoints. The Locally Linear Embedding (LLE) algorithm [48] \ufb01rst \ufb01nds the neighbors of each data\npoint in the whole dataset and then reconstructs each data point linearly from its neighbors. It then em-\nbeds the dataset into a low-dimensional Euclidean space by solving a generalized eigendecomposition\nproblem.\nThe \ufb01rst step of LLE has the same linear formulation as sparse coding (1), with being the whole\ndataset rather than a learned dictionary, i.e., = X, where X is the data matrix. The coef\ufb01cients, \u21b5,\ncorrespond to the linear interpolation weights used to reconstruct a datapoint, x, from its K-nearest\nneighbors, resulting in a K-sparse code. (In other work [17], \u21b5 is inferred by sparse approximation,\nwhich provides better separation between manifolds nearby in the same space.) Importantly, once\nthe embedding of the dataset X ! Y is computed, the embedding of a new point xNEW ! yNEW\nis obtained by a simple linear projection of its sparse coef\ufb01cients. That is, if \u21b5NEW is the K-sparse\ncode of xNEW, then yNEW = Y\u21b5 NEW. Viewed this way, the dictionary may be thought of as a discrete\nsampling of a continuous manifold, and the sparse code of a data point provides the interpolation\ncoef\ufb01cients for determining its coordinates on the manifold. However, using the entire dataset as the\ndictionary is cumbersome and inef\ufb01cient in practice.\nSeveral authors [12, 53, 58] have realized that it is unnecessary to use the whole dataset as a dictionary.\nA random subset of the data or a set of cluster centers can be good enough to preserve the manifold\nstructure, making learning more ef\ufb01cient. Going forward, we refer to these as landmarks. In Locally\n\n2\n\n\fx = LM \u21b5 + n\nx = DATA + n0\n\nLinear Landmarks (LLL) [58], the authors compute two linear interpolations for each data point x:\n(2)\n(3)\nwhere LM is a dictionary of landmarks and DATA is a dictionary composed of the whole dataset.\nAs in LLE, \u21b5 and are coef\ufb01cient vectors inferred using KNN solvers (where the coef\ufb01cient\ncorresponding to x is forced to be 0). We can substitute the solutions to equation (2) into DATA,\ngiving DATA \u21e1 LMA, where the jth column of the matrix A is a unique vector \u21b5j. This leads to an\ninterpolation relationship:\n(4)\nThe authors sought to embed the landmarks into a low dimensional Euclidean space using an\nembedding matrix, PLM, such that the interpolation relationship in equation (4) still holds:\n\nLM\u21b5 \u21e1 LM A\n\nPLM\u21b5 \u21e1 PLM A\n\n(5)\nWhere we use the same \u21b5 and vectors that allowed for equality in equations (2) and (3). PLM is\nan embedding matrix for LM such that each of the columns of P represents an embedding of a\nlandmark. PLM can be derived by solving a generalized eigendecomposition problem [58].\nThe similarity between equation (1) and equation (2) provides an intuition to bring sparse coding and\nmanifold learning closer together. However, LLL still has a dif\ufb01culty in that it requires a nearest\nneighbor search. We posit that temporal information provides a more natural and ef\ufb01cient solution.\n\n1.3 Slow Feature Analysis (SFA)\nThe general idea of imposing a \u2018slowness prior\u2019 was initially proposed by [20] and [59] to extract\ninvariant or slowly varying features from temporal sequences rather than using static orderless data\npoints. While it is still common practice in both sparse coding and manifold learning to collect data\nin an orderless fashion, other work has used time-series data to learn spatiotemporal representations\n[57, 41, 30] or to disentangle form and motion [6, 9, 13]. Speci\ufb01cally, the combination of topography\nand temporal coherence in [30] provides a strong motivation for this work.\nHere, we utilize temporal adjacency to determine the nearest neighbors in the embedding space (eq. 3)\nby speci\ufb01cally minimizing the second-order temporal derivative, implying that video sequences form\nlinear trajectories in the manifold embedding space. A similar approach was recently used by\n[23] to linearize transformations in natural video. This is a variation of \u2018slowness\u2019 that makes the\nconnection to manifold learning more explicit. It also connects to the ideas of manifold \ufb02attening [14]\nor straightening [24] which are hypothesized to underly perceptual representations in the brain.\n\n2 Functional Embedding: A Sensing Perspective\n\nThe SMT framework differs from the classical manifold learning approach in that it relies on the\nconcept of functional embedding as opposed to embedding individual data points. We explain this\nconcept here before turning to the sparse manifold transform in section 3.\nIn classical manifold learning [26], for a m-dimensional compact manifold, it is typical to solve\na generalized eigenvalue decomposition problem and preserve the 2nd to the (d + 1)th trailing\neigenvectors as the embedding matrix PC 2 IRd\u21e5N, where d is as small as possible (parsimonious)\nsuch that the embedding preserves the topology of the manifold (usually, m \uf8ff d \uf8ff 2m due to the\nstrong Whitney embedding theorem[35]) and N is the number of data points or landmarks to embed.\nIt is conventional to view the columns of an embedding matrix, PC, as an embedding to an Euclidean\nspace, which is (at least approximately) topologically-equivalent to the data manifold. Each of the\nrows of PC is treated as a coordinate of the underlying manifold. One may think of a point on the\nmanifold as a single, constant-amplitude delta function with the manifold as its domain. Classical\nmanifold embedding turns a non-linear transformation (i.e., a moving delta function on the manifold)\nin the original signal space into a simple linear interpolation in the embedding space. This approach is\neffective for visualizing data in a low-dimensional space and compactly representing the underlying\ngeometry, but less effective when the underlying function is not a single delta function.\nIn this work we seek to move beyond the single delta-function assumption, because natural images\nare not well described as a single point on a continuous manifold of \ufb01xed dimensionality. For any\n\n3\n\n\freasonably sized image region (e.g., a 16 \u21e5 16 pixel image patch), there could be multiple edges\nmoving in different directions, or the edge of one occluding surface may move over another, or\nthe overall appearance may change as features enter or exit the region. Such changes will cause\nthe manifold dimensionality to vary substantially, so that the signal structure is no longer well-\ncharacterized as a manifold.\nWe propose instead to think of any given image patch as consisting of h discrete components\nsimultaneously moving over the same underlying manifold - i.e., as h delta functions, or an h-sparse\nfunction on the smooth manifold. This idea is illustrated in \ufb01gure 1. First, let us organize the\nGabor-like dictionary learned from natural scenes on a 4-dimensional manifold according to the\nposition (x, y), orientation (\u2713) and scale () of each dictionary element i. Any given Gabor function\ncorresponds to a point with coordinates (x, y, \u2713, ) on this manifold, and so the learned dictionary as\na whole may be conceptualized as a discrete tiling of the manifold. Then, the k-sparse code of an\nimage, \u21b5, can be viewed as a set of k delta functions on this manifold (illustrated as black arrows\nin \ufb01gure 1C). Hyv\u00e4rinen has pointed out that when the dictionary is topologically organized in a\nsimilar manner, the active coef\ufb01cients \u21b5i tend to form clusters, or \u201cbubbles,\u201d over this domain [30].\nEach of these clusters may be thought of as linearly approximating a \u201cvirtual Gabor\" at the center of\nthe cluster (illustrated as red arrows in \ufb01gure 1C), effectively performing a \ufb02exible \u201csteering\u201d of the\ndictionary to describe discrete components in the image, similar to steerable \ufb01lters [21, 55, 54, 47].\nAssuming there are h such clusters, then the k-sparse code of the image can be thought of as a discrete\napproximation of an underlying h-sparse function de\ufb01ned on the continuous manifold domain, where\nh is generally greater than 1 but less than k.\n\nx (R2)\n\nAAAB+3icbVDLSgNBEOz1GeMrxqOXwSB4Cru56DHoxWME84BkCbOT2WTIvJiZFUPIr3jxoIhXf8Sbf+Mk2YMmFjQUVd10dyWaM+vC8DvY2Nza3tkt7BX3Dw6Pjksn5ZZVmSG0SRRXppNgSzmTtOmY47SjDcUi4bSdjG/nfvuRGsuUfHATTWOBh5KljGDnpX6pbDU2liKihFaWLcVKWA0XQOskykkFcjT6pa/eQJFMUOkIx9Z2o1C7eIqNY4TTWbGXWaoxGeMh7XoqsaA2ni5un6ELrwxQqowv6dBC/T0xxcLaiUh8p8BuZFe9ufif181ceh1PmdSZo5IsF6UZR06heRBowAwljk88wcT4zwkiI2wwcT6uog8hWn15nbRq1SisRve1Sv0mj6MAZ3AOlxDBFdThDhrQBAJP8Ayv8BbMgpfgPfhYtm4E+cwp/EHw+QOLMZS/\n\nxlocal(R2)\n\n. . .\n\nAAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOJsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4N4w2mpTbtiFouheINFCh5OzGcxpHkrWh8O/NbT9xYodUDThIexnSoxEAwik5qdpOR6AW9csWv+nOQVRLkpAI56r3yV7evWRpzhUxSazuBn2CYUYOCST4tdVPLE8rGdMg7jioacxtm82un5MwpfTLQxpVCMld/T2Q0tnYSR64zpjiyy95M/M/rpDi4DjOhkhS5YotFg1QS1GT2OukLwxnKiSOUGeFuJWxEDWXoAiq5EILll1dJ86Ia+NXg/rJSu8njKMIJnMI5BHAFNbiDOjSAwSM8wyu8edp78d69j0VrwctnjuEPvM8fO0aO4w==\n\nAAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd0g6DHoxWME84BkCbOT2WTM7Mwy0yuEkH/w4kERr/6PN//GSbIHTSxoKKq66e6KUiks+v63t7a+sbm1Xdgp7u7tHxyWjo6bVmeG8QbTUpt2RC2XQvEGCpS8nRpOk0jyVjS6nfmtJ26s0OoBxykPEzpQIhaMopOa3XQoetVeqexX/DnIKglyUoYc9V7pq9vXLEu4QiaptZ3ATzGcUIOCST4tdjPLU8pGdMA7jiqacBtO5tdOyblT+iTWxpVCMld/T0xoYu04iVxnQnFol72Z+J/XyTC+DidCpRlyxRaL4kwS1GT2OukLwxnKsSOUGeFuJWxIDWXoAiq6EILll1dJs1oJ/Epwf1mu3eRxFOAUzuACAriCGtxBHRrA4BGe4RXePO29eO/ex6J1zctnTuAPvM8fPMqO5A==\n\nAAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hV0V9Bj04jGCeUCyhNlJbzJmdmaZmRVCyD948aCIV//Hm3/jJNmDJhY0FFXddHdFqeDG+v63t7K6tr6xWdgqbu/s7u2XDg4bRmWaYZ0poXQrogYFl1i33ApspRppEglsRsPbqd98Qm24kg92lGKY0L7kMWfUOqnRSQe8e9Etlf2KPwNZJkFOypCj1i19dXqKZQlKywQ1ph34qQ3HVFvOBE6KncxgStmQ9rHtqKQJmnA8u3ZCTp3SI7HSrqQlM/X3xJgmxoySyHUm1A7MojcV//PamY2vwzGXaWZRsvmiOBPEKjJ9nfS4RmbFyBHKNHe3EjagmjLrAiq6EILFl5dJ47wS+JXg/rJcvcnjKMAxnMAZBHAFVbiDGtSBwSM8wyu8ecp78d69j3nripfPHMEfeJ8/Pk6O5Q==\n\nAAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1u+lQ9C575Ypf9ecgqyTISQVy1Hvlr25fsyzhCpmk1nYCP8VwQg0KJvm01M0sTykb0QHvOKpowm04mV87JWdO6ZNYG1cKyVz9PTGhibXjJHKdCcWhXfZm4n9eJ8P4OpwIlWbIFVssijNJUJPZ66QvDGcox45QZoS7lbAhNZShC6jkQgiWX14lzYtq4FeD+8tK7SaPowgncArnEMAV1OAO6tAABo/wDK/w5mnvxXv3PhatBS+fOYY/8D5/AD/SjuY=\n\nAAAB+nicbVC7TsMwFL0pr1JeKYwsFhUSU5V0gbGChbFI9CG1UeU4TmvVjiPbAUWln8LCAEKsfAkbf4PbZoCWI1k6Pude3XtPmHKmjed9O6WNza3tnfJuZW//4PDIrR53tMwUoW0iuVS9EGvKWULbhhlOe6miWIScdsPJzdzvPlClmUzuTZ7SQOBRwmJGsLHS0K0KbL+SR8jI1FaP8qFb8+reAmid+AWpQYHW0P0aRJJkgiaGcKx13/dSE0yxMoxwOqsMMk1TTCZ4RPuWJlhQHUwXq8/QuVUiFEtlX2LQQv3dMcVC61yEtlJgM9ar3lz8z+tnJr4KpixJM0MTshwUZ9yeieY5oIgpSgzPLcFEMbsrImOsMDE2rYoNwV89eZ10GnXfq/t3jVrzuoijDKdwBhfgwyU04RZa0AYCj/AMr/DmPDkvzrvzsSwtOUXPCfyB8/kDpBeUOg==\n\nAAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hV0V9Bj04jGCeUCyhNlJbzJmdmaZmRVCyD948aCIV//Hm3/jJNmDJhY0FFXddHdFqeDG+v63t7K6tr6xWdgqbu/s7u2XDg4bRmWaYZ0poXQrogYFl1i33ApspRppEglsRsPbqd98Qm24kg92lGKY0L7kMWfUOqnRSQe8e9Etlf2KPwNZJkFOypCj1i19dXqKZQlKywQ1ph34qQ3HVFvOBE6KncxgStmQ9rHtqKQJmnA8u3ZCTp3SI7HSrqQlM/X3xJgmxoySyHUm1A7MojcV//PamY2vwzGXaWZRsvmiOBPEKjJ9nfS4RmbFyBHKNHe3EjagmjLrAiq6EILFl5dJ47wS+JXg/rJcvcnjKMAxnMAZBHAFVbiDGtSBwSM8wyu8ecp78d69j3nripfPHMEfeJ8/Pk6O5Q==\n\nAAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOJsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4N4w2mpTbtiFouheINFCh5OzGcxpHkrWh8O/NbT9xYodUDThIexnSoxEAwik5qdpOR6AW9csWv+nOQVRLkpAI56r3yV7evWRpzhUxSazuBn2CYUYOCST4tdVPLE8rGdMg7jioacxtm82un5MwpfTLQxpVCMld/T2Q0tnYSR64zpjiyy95M/M/rpDi4DjOhkhS5YotFg1QS1GT2OukLwxnKiSOUGeFuJWxEDWXoAiq5EILll1dJ86Ia+NXg/rJSu8njKMIJnMI5BHAFNbiDOjSAwSM8wyu8edp78d69j0VrwctnjuEPvM8fO0aO4w==\n\nAAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hd0g6DHoxWME84BkCbOT2WTM7Mwy0yuEkH/w4kERr/6PN//GSbIHTSxoKKq66e6KUiks+v63t7a+sbm1Xdgp7u7tHxyWjo6bVmeG8QbTUpt2RC2XQvEGCpS8nRpOk0jyVjS6nfmtJ26s0OoBxykPEzpQIhaMopOa3XQoetVeqexX/DnIKglyUoYc9V7pq9vXLEu4QiaptZ3ATzGcUIOCST4tdjPLU8pGdMA7jiqacBtO5tdOyblT+iTWxpVCMld/T0xoYu04iVxnQnFol72Z+J/XyTC+DidCpRlyxRaL4kwS1GT2OukLwxnKsSOUGeFuJWxIDWXoAiq6EILll1dJs1oJ/Epwf1mu3eRxFOAUzuACAriCGtxBHRrA4BGe4RXePO29eO/ex6J1zctnTuAPvM8fPMqO5A==\n\nAAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOJsmY2ZllplcIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqSw6PvfXmFtfWNzq7hd2tnd2z8oHx41rU4N4w2mpTbtiFouheINFCh5OzGcxpHkrWh8O/NbT9xYodUDThIexnSoxEAwik5qdpOR6AW9csWv+nOQVRLkpAI56r3yV7evWRpzhUxSazuBn2CYUYOCST4tdVPLE8rGdMg7jioacxtm82un5MwpfTLQxpVCMld/T2Q0tnYSR64zpjiyy95M/M/rpDi4DjOhkhS5YotFg1QS1GT2OukLwxnKiSOUGeFuJWxEDWXoAiq5EILll1dJ86Ia+NXg/rJSu8njKMIJnMI5BHAFNbiDOjSAwSM8wyu8edp78d69j0VrwctnjuEPvM8fO0aO4w==\n\nAAAB7XicbVDLSgNBEOz1GeMr6tHLYBA8hV0V9Bj04jGCeUCyhNlJbzJmdmaZmRVCyD948aCIV//Hm3/jJNmDJhY0FFXddHdFqeDG+v63t7K6tr6xWdgqbu/s7u2XDg4bRmWaYZ0poXQrogYFl1i33ApspRppEglsRsPbqd98Qm24kg92lGKY0L7kMWfUOqnRSQe8e9Etlf2KPwNZJkFOypCj1i19dXqKZQlKywQ1ph34qQ3HVFvOBE6KncxgStmQ9rHtqKQJmnA8u3ZCTp3SI7HSrqQlM/X3xJgmxoySyHUm1A7MojcV//PamY2vwzGXaWZRsvmiOBPEKjJ9nfS4RmbFyBHKNHe3EjagmjLrAiq6EILFl5dJ47wS+JXg/rJcvcnjKMAxnMAZBHAFVbiDGtSBwSM8wyu8ecp78d69j3nripfPHMEfeJ8/Pk6O5Q==\n\nAAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1u+lQ9C575Ypf9ecgqyTISQVy1Hvlr25fsyzhCpmk1nYCP8VwQg0KJvm01M0sTykb0QHvOKpowm04mV87JWdO6ZNYG1cKyVz9PTGhibXjJHKdCcWhXfZm4n9eJ8P4OpwIlWbIFVssijNJUJPZ66QvDGcox45QZoS7lbAhNZShC6jkQgiWX14lzYtq4FeD+8tK7SaPowgncArnEMAV1OAO6tAABo/wDK/w5mnvxXv3PhatBS+fOYY/8D5/AD/SjuY=\n\nFigure 1: Dictionary elements learned from natural signals with sparse coding may be conceptualized\nas landmarks on a smooth manifold. A) A function de\ufb01ned on R2 (e.g. a gray-scale natural image) and\none local component from its reconstruction are represented by the black and red curves, respectively.\nB) The signal is encoded using sparse inference with a learned dictionary, , resulting in a k-sparse\nvector (also a function) \u21b5, which is de\ufb01ned on an orderless discrete set {1,\u00b7\u00b7\u00b7 , N}. C) \u21b5 can be\nviewed as a discrete k-sparse approximation to the true h-sparse function, \u21b5TRUE(M ), de\ufb01ned on the\nsmooth manifold (k = 8 and h = 3 in this example). Each dictionary element in corresponds to\na landmark (black dot) on the smooth manifold, M. Red arrows indicate the underlying h-sparse\nfunction, while black arrows indicate the k non-zero coef\ufb01cients of used to interpolate the red\narrows. D) Since only contains a \ufb01nite number of landmarks, we must interpolate (i.e. \u201csteer\u201d)\namong a few dictionary elements to reconstruct each of the true image components.\n\nAn h-sparse function would not be recoverable from the d-dimensional projection employed in the\nclassical approach because the embedding is premised on there being only a single delta function\non the manifold. Hence the inverse will not be uniquely de\ufb01ned. Here we utilize a more general\nfunctional embedding concept that allows for better recovery capacity. A functional embedding of the\nlandmarks is to take the \ufb01rst f trailing eigenvectors from the generalized eigendecomposition solution\n\n4\n\n\fas the embedding matrix P 2 IRf\u21e5N, where f is larger than d such that the h-sparse function can be\nrecovered from the linear projection. Empirically1 we use f = O(h log(N )).\nTo illustrate the distinction between the classical view of a data manifold and the additional properties\ngained by a functional embedding, let us consider a simple example of a function over the 2D unit disc.\nAssume we are given 300 landmarks on this disc as a dictionary LM 2 IR2\u21e5300. We then generate\nmany short sequences of a point x moving along a straight line on the unit disc, with random starting\nlocations and velocities. At each time, t, we use a nearest neighbor (KNN) solver to \ufb01nd a local linear\ninterpolation of the point\u2019s location from the landmarks, that is xt = LM \u21b5t, with \u21b5t 2 IR300 and\n\u21b5t \u232b 0 (the choice of sparse solver does not impact the demonstration). Now we seek to \ufb01nd an\nembedding matrix, P , which projects the \u21b5t into an f-dimensional space via t = P\u21b5 t such that\nthe trajectories in t are as straight as possible, thus re\ufb02ecting their true underlying geometry. This\nis achieved by performing an optimization that minimizes the second temporal derivative of t, as\nspeci\ufb01ed in equation (8) below.\nFigure 2A shows the rows of P resulting from this optimization using f = 21. Interestingly, they\nresemble Zernike polynomials on the unit-disc. We can think of these as functionals that \"sense\"\nsparse functions on the underlying manifold. Each row p0i 2 IR300 (here the prime sign denotes a row\nof the matrix P ) projects a discrete k-sparse approximation \u21b5 of the underlying h-sparse function to\na real number, i. We de\ufb01ne the full set of these linear projections = P\u21b5 as a \"manifold sensing\"\nof \u21b5.\nWhen there is only a single delta-function on the manifold, the second and third rows of P , which\nform simple linear ramp functions in two orthogonal directions, are suf\ufb01cient to fully represent its\nposition. These two rows would constitute PC 2 IR2\u21e5300 as an embedding solution in the classical\nmanifold learning approach, since a unit disk is diffeomorphic to IR2 and can be embedded in a 2\ndimensional space. The resulting embedding 2, 3 closely resembles the 2-D unit disk manifold and\nallows for recovery of a one-sparse function, as shown in Figure 2B.\n\nFigure 2: Demonstration of functional embedding on the unit disc. A) The rows of P , visualized\nhere on the ground-truth unit disc. Each disc shows the weights in a row of P by coloring the\nlandmarks according to the corresponding value in that row of P . The color scale for each row is\nindividually normalized to emphasize its structure. The pyramidal arrangement of rows is chosen\nto highlight their strong resemblance to the Zernike polynomials. B) (Top) The classic manifold\nembedding perspective allows for low-dimensional data visualization using PC, which in this case is\ngiven by the second and third rows of P (shown in dashed box in panel A). Each blue dot shows the\n2D projection of a landmark using PC. Boundary effects cause the landmarks to cluster toward the\nperimeter. (Bottom) A 1-sparse function is recoverable when projected to the embedding space by PC.\nC) (Top) A 4-sparse function (red arrows) and its discrete k-sparse approximation, \u21b5 (black arrows)\non the unit disc. (Bottom) The recovery, \u21b5REC, (black arrows) is computed by solving the optimization\nproblem in equation (6). The estimate of the underlying function (red arrows) was computed by\ntaking a normalized local mean of the recovered k-sparse approximations for a visualization purpose.\nRecovering more than a one-sparse function requires using additional rows of P with higher spatial-\nfrequencies on the manifold, which together provide higher sensing capacity. Figure 2C demonstrates\n\n1This choice is inspired by the result from compressive sensing[15], though here h is different from k.\n\n5\n\n\frecovery of an underlying 4-sparse function on the manifold using all 21 functionals, from p01 to p021.\nFrom this representation, we can recover an estimate of \u21b5 with positive-only sparse inference:\n\n\u21b5REC = g() \u2318 argmin\n\n\u21b5\n\nk P\u21b5 k2\n\nF + zT \u21b5, s.t. \u21b5 \u232b 0,\n\n(6)\n\nwhere z = [kp1k2,\u00b7\u00b7\u00b7 ,kpNk2]T and pj 2 IR21 is the jth column of P . Note that although \u21b5REC is\nnot an exact recovery of \u21b5, the 4-sparse structure is still well preserved, up to a local shift in the\nlocations of the delta functions. We conjecture this will lead to a recovery that is perceptually similar\nfor an image signal.\nThe functional embedding concept can be generalized beyond functionals de\ufb01ned on a single manifold\nand will still apply when the underlying geometrical domain is a union of several different manifolds.\nA thorough analysis of the capacity of this sensing method is beyond the scope of this paper, although\nwe recognize it as an interesting research topic for model-based compressive sensing.\n\n3 The Sparse Manifold Transform\n\nThe Sparse Manifold Transform (SMT) consists of a non-linear sparse coding expansion followed\nby a linear manifold sensing compression (dimension reduction). The manifold sensing step acts to\nlinearly pool the sparse codes, \u21b5, with a matrix, P , that is learned using the functional embedding\nconcept (sec. 2) in order to straighten trajectories arising from video (or other dynamical) data.\nThe SMT framework makes three basic assumptions:\n\n1. The dictionary learned by sparse coding has an organization that is a discrete sampling of\n\na low-dimensional, smooth manifold, M (Fig. 1).\n\n2. The resulting sparse code \u21b5 is a discrete k-sparse approximation of an underlying h-sparse\nfunction de\ufb01ned on M. There exists a functional manifold embedding, \u2327 : ,! P , that\nmaps each of the dictionary elements to a new vector, pj = \u2327 (j), where pj is the jth\ncolumn of P s.t. both the topology of M and h-sparse function\u2019s structure are preserved.\n\n3. A continuous temporal transformation in the input (e.g., from natural movies) lead to a\n\nlinear \ufb02ow on M and also in the geometrical embedding space.\n\nIn an image, the elements of the underlying h-sparse function correspond to discrete components\nsuch as edges, corners, blobs or other features that are undergoing some simple set of transformations.\nSince there are only a \ufb01nite number of learned dictionary elements tiling the underlying manifold, they\nmust cooperate (or \u2018steer\u2019) to represent each of these components as they appear along a continuum.\nThe desired property of linear \ufb02ow in the geometric embedding space may be stated mathematically\nas\n(7)\nwhere \u21b5t denotes the sparse coef\ufb01cient vector at time t. Here we exploit the temporal continuity\ninherent in the data to solve the otherwise cumbersome nearest-neighbor search required of LLE or\nLLL. The embedding matrix P satisfying (7) may be derived by minimizing an objective function\nthat encourages the second-order temporal derivative of P\u21b5 to be zero:\n\n2 P\u21b5 t1 + 1\n\nP\u21b5 t \u21e1 1\n\n2 P\u21b5 t+1.\n\nmin\nP kP ADk2\n\nF , s.t. P V P T = I\n\n(8)\n\nwhere A is the coef\ufb01cient matrix whose columns are the coef\ufb01cient vectors, \u21b5t, in temporal order, and\nD is the second-order differential operator matrix, with Dt1,t = 0.5, Dt,t = 1, Dt+1,t = 0.5\nand D\u2327,t = 0 otherwise. V is a positive-de\ufb01nite matrix for normalization, I is the identity matrix and\nk\u2022k F indicates the matrix Frobenius norm. We choose V to be the covariance matrix of \u21b5 and thus\nthe optimization constraint makes the rows of P orthogonal in whitened sparse coef\ufb01cient vector\nspace. Note that this formulation is qualitatively similar to applying SFA to sparse coef\ufb01cients, but\nusing the second-order derivative instead of the \ufb01rst-order derivative.\nThe solution to this generalized eigen-decomposition problem is given [58] by P = V 1\n2 U, where\nU is a matrix of f trailing eigenvectors (i.e. eigenvectors with the smallest eigenvalues) of the matrix\nV 1\n2 . Some drawbacks of this analytic solution are that: 1) there is an unnecessary\nordering among different dimensions, 2) the learned functional embedding tends to be global, which\n\n2 ADDT AT V 1\n\n6\n\n\fhas support as large as the whole manifold and 3) the solution is not online and does not allow other\nconstraints to be posed. In order to solve these issues, we modify the formulation slightly with a\nsparse regularization term on P and develop an online SGD (Stochastic Gradient Descent) solution,\nwhich is detailed in the Supplement C.\nTo summarize, the SMT is performed on an input signal x by \ufb01rst computing a higher-dimensional\nrepresentation \u21b5 via sparse inference with a learned dictionary, , and second computing a contracted\ncode by sensing a manifold representation, = P\u21b5 with a learned pooling matrix, P .\n\n4 Results\n\nStraightening of video sequences. We applied the SMT optimization procedure on sequences\nof whitened 20 \u21e5 20 pixel image patches extracted from natural videos. We \ufb01rst learned a 10\u21e5\novercomplete spatial dictionary 2 IR400\u21e54000 and coded each frame xt as a 4000-dimensional\nsparse coef\ufb01cient vector \u21b5t. We then derived an embedding matrix P 2 IR200\u21e54000 by solving\nequation 8. Figure 3 shows that while the sparse code \u21b5t exhibits high variability from frame to\nframe, the embedded representation t = P\u21b5 t changes in a more linear or smooth manner. It should\nbe emphasized that \ufb01nding such a smooth linear projection (embedding) is highly non-trivial, and\nis possible if and only if the sparse codes change in a locally linear manner in response to smooth\ntransformations in the image. If the sparse code were to change in an erratic or random manner\nunder these transformations, any linear projection would be non-smooth in time. Furthermore, we\nshow that this embedding does not constitute a trivial temporal smoothing, as we can recover a\ngood approximation of the image sequence via \u02c6xt = g(t), where g() is the inverse embedding\nfunction (6). We can also use the functional embedding to regularize sparse inference, as detailed in\nSupplement B, which further increases the smoothness of both \u21b5 and .\n\nFigure 3: SMT encoding of a 80 frame image sequence. A) Rescaled activations for 80 randomly\nselected \u21b5 units. Each row depicts the temporal sequence of a different unit. B) The activity of 80\nrandomly selected units. C) Frame samples from the 90fps video input (top) and reconstructions\ncomputed from the \u21b5REC recovered from the sequence of values (bottom).\n\npT\nj pk\n\nkpjk2kpkk2\n\nAf\ufb01nity Groups and Dictionary Topology. Once a functional embedding is learned for the\ndictionary elements, we can compute the cosine similarity between their embedding vectors,\n, to \ufb01nd the neighbors, or af\ufb01nity group, of each dictionary element\ncos(pj, pk) =\nin the embedding space. In Figure 4A we show the af\ufb01nity groups for a set of randomly sampled\nelements from the overcomplete dictionary learned from natural videos. As one can see, the topology\nof the embedding learned from the SMT re\ufb02ects the structural similarity of the dictionary elements\naccording to the properties of position, orientation, and scale. Figure 4B shows that the nearest\nneighbors of each dictionary element in the embedding space are more \u2019semantically similar\u2019 than\nthe nearest neighbors of the element in the pixel space. To measure the similarity, we chose the top\n500 most well-\ufb01t dictionary elements and computed their lengths and orientations. For each of these\nelements, we \ufb01nd the top 9 nearest neighbors in both the embedding space and in pixel space and\nthen compute the average difference in length ( Length) and orientation ( Angle). The results\ncon\ufb01rm that the embedding space is succeeding in grouping dictionary elements according to their\nstructural similarity, presumably due to the continuous geometric transformations occurring in image\nsequences.\nComputing the cosine similarity can be thought of as a hypersphere normalization on the embedding\nmatrix P . In other words, if the embedding is normalized to be approximately on a hypersphere,\n\n7\n\n\fFigure 4: A) Af\ufb01nity groups learned using the SMT reveal the topological ordering of a sparse coding\ndictionary. Each box depicts as a needle plot the af\ufb01nity group of a randomly selected dictionary\nelement and its top 40 af\ufb01nity neighbors. The length, position, and orientation of each needle re\ufb02ect\nthose properties of the dictionary element in the af\ufb01nity group (see Supplement E for details). The\ncolor shade indicates the normalized strength of the cosine similarity between the dictionary elements.\nB) The properties of length and orientation (angle) are more similar among nearest neighbors in the\nembedding space (E) as compared to the pixel space (P).\n\nthe cosine distance is almost equivalent to the Gramian matrix, P T P . Taking this perspective,\nthe learned geometric embedding and af\ufb01nity groups can explain the dictionary grouping results\nshown in previous work [25]. In that work, the layer 1 outputs are pooled by an af\ufb01nity matrix\ngiven by P = ET E, where E is the eigenvector matrix computed from the correlations among\nlayer 1 outputs. This PCA-based method can be considered an embedding that uses only spatial\ncorrelation information, while the SMT model uses both spatial correlation and temporal interpolation\ninformation.\nHierarchical Composition. A SMT layer is composed of two sublayers: a sparse coding sublayer\nthat models sparse discreteness, and a manifold embedding sublayer that models simple geometrical\ntransforms. It is possible to stack multiple SMT layers to form a hierarchical architecture, which\naddresses the third pattern from Mumford\u2019s theory: hierarchical composition. It also provides a way\nto progressively \ufb02atten image manifolds, as proposed by DiCarlo & Cox [14]. Here we demonstrate\nthis process with a two-layer SMT model (Figure 5A) and we visualize the learned representations.\nThe network is trained in a layer-by-layer fashion on a natural video dataset as above.\n\nFigure 5: SMT layers can be stacked to learn a hierarchical representation. A) The network architec-\nture. Each layer contains a sparse coding sublayer (red) and a manifold sensing sublayer (green). B)\nExample dictionary element groups for (1) (left) and (2) (right). C) Each row shows an example of\ninterpolation by combining layer 3 dictionary elements. From left to right, the \ufb01rst two columns are\nvisualizations of two different layer-3 dictionary elements, each obtained by setting a single element\nof \u21b5(3) to one and the rest to zero. The third column is an image generated by setting both elements\nof \u21b5(3) to 0.5 simultaneously. The fourth column is a linear interpolation in image space between the\n\ufb01rst two images, for comparison. D) Information is approximately preserved at higher layers. From\nleft to right: The input image and the reconstructions from \u21b5(1), \u21b5(2) and \u21b5(3), respectively. The\nrows in C) and D) are unique examples. See section 2 for visualization details.\nWe can produce reconstructions and dictionary visualizations from any layer by repeatedly using\nthe inverse operator, g(). Formally, we de\ufb01ne \u21b5(l)\nREC = g(l)((l)), where l is the layer number. For\nexample, the inverse transform from \u21b5(2) to the image space will be xREC = C(1)g(1)((2)\u21b5(2)),\nwhere C is an unwhitening matrix. We can use this inverse transform to visualize any single dictionary\nelement by setting \u21b5(l) to a 1-hot vector. Using this method of visualization, Figure 5B shows a\ncomparison of some of the dictionary elements learned at layers 1 and 2. We can see that lower layer\n\n8\n\n(cid:19)(cid:22)(cid:24)(cid:51)(cid:76)(cid:91)(cid:72)(cid:79)(cid:86)(cid:47)(cid:72)(cid:81)(cid:74)(cid:87)(cid:75)(cid:3)(cid:40)(cid:19)(cid:20)(cid:22)(cid:21)(cid:26)(cid:39)(cid:72)(cid:74)(cid:85)(cid:72)(cid:72)(cid:86)(cid:36)(cid:81)(cid:74)(cid:79)(cid:72)(cid:3)(cid:51)(cid:3)(cid:40)(cid:3)(cid:51)\felements combine together to form more global and abstract dictionary elements in higher layers, e.g.\nlayer-2 units tend to be more curved, many of them are corners, textures or larger blobs.\nAnother important property that emerges at higher levels of the network is that dictionary elements are\nsteerable over a larger range, since they are learned from progressively more linearized representations.\nTo demonstrate this, we trained a three-layer network and performed linear interpolation between two\nthird-layer dictionary elements, resulting in a non-linear interpolation in the image space that shifts\nfeatures far beyond what simple linear interpolation in the image space would accomplish (Figure\n5C). A thorough visualization of the dictionary elements and groups is provided in the Supplement F.\n\n5 Discussion\n\nA key new perspective introduced in this work is to view both the signals (such as images) and their\nsparse representations as functions de\ufb01ned on a manifold domain. A gray-scale image is a function\nde\ufb01ned on a 2D plane, tiled by pixels. Here we propose that the dictionary elements should be viewed\nas the new \u2018pixels\u2019 and their coef\ufb01cients are the corresponding new \u2018pixel values\u2019. The pooling\nfunctions can be viewed as low pass \ufb01lters de\ufb01ned on this new manifold domain. This perspective is\nstrongly connected to the recent development in both signal processing on irregular domains [52] and\ngeometric deep learning [7].\nPrevious approaches have learned the group structure of dictionary elements mainly from a statistical\nperspective [27, 29, 45, 32, 37, 39]. Additional unsupervised learning models [51, 46, 33, 62] combine\nsparse discreteness with hierarchical structure, but do not explicitly model the low-dimensional\nmanifold structure of inputs. Our contribution here is to approach the problem from a geometric\nperspective to learn a topological embedding of the dictionary elements.\nThe functional embedding framework provides a new perspective on the pooling functions commonly\nused in convnets. In particular, it provides a principled framework for learning the pooling operators\nat each stage of representation based on the underlying geometry of the data, rather than being\nimposed in a 2D topology a priori as was done previously to learn linearized representations from\nvideo [23]. This could facilitate the learning of higher-order invariances, as well as equivariant\nrepresentations [50], at higher stages. In addition, since the pooling is approximately invertible\ndue to the underlying sparsity, it is possible to have bidirectional \ufb02ow of information between\nstages of representation to allow for hierarchical inference [36]. The invertibility of SMT is due\nto the underlying sparsity of the signal, and is related to prior works on the invertibility of deep\nnetworks[22, 8, 61, 16]. Understanding this relationship may bring further insights to these models.\n\nAcknowledgments\nWe thank Joan Bruna, Fritz Sommer, Ryan Zarcone, Alex Anderson, Brian Cheung and Charles Frye\nfor many fruitful discussions; Karl Zipser for sharing computing resources; Eero Simoncelli and Chris\nRozell for pointing us to some valuable references. This work is supported by NSF-IIS-1718991,\nNSF-DGE-1106400, and NIH/NEI T32 EY007043.\n\nReferences\n[1] Joseph J Atick and A Norman Redlich. Towards a theory of early visual processing. Neural\n\nComputation, 2(3):308\u2013320, 1990.\n\n[2] Joseph J Atick and A Norman Redlich. What does the retina know about natural scenes? Neural\n\ncomputation, 4(2):196\u2013210, 1992.\n\n[3] Johannes Ball\u00e9, Valero Laparra, and Eero P Simoncelli. Density modeling of images using a\n\ngeneralized normalization transformation. arXiv preprint arXiv:1511.06281, 2015.\n\n[4] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding\nand clustering. In Advances in neural information processing systems, pages 585\u2013591, 2002.\n\n[5] Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind\n\nseparation and blind deconvolution. Neural computation, 7(6):1129\u20131159, 1995.\n\n9\n\n\f[6] Pietro Berkes, Richard E Turner, and Maneesh Sahani. A structured model of video reproduces\n\nprimary visual cortical organisation. PLoS computational biology, 5(9):e1000495, 2009.\n\n[7] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.\nGeometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34\n(4):18\u201342, 2017.\n\n[8] Joan Bruna, Arthur Szlam, and Yann LeCun. Signal recovery from pooling representations.\n\narXiv preprint arXiv:1311.4025, 2013.\n\n[9] Charles F Cadieu and Bruno A Olshausen. Learning intermediate-level representations of form\n\nand motion from natural movies. Neural computation, 24(4):827\u2013866, 2012.\n\n[10] Gunnar Carlsson, Tigran Ishkhanov, Vin De Silva, and Afra Zomorodian. On the local behavior\n\nof spaces of natural images. International journal of computer vision, 76(1):1\u201312, 2008.\n\n[11] Vin De Silva and Gunnar E Carlsson. Topological estimation using witness complexes. SPBG,\n\n4:157\u2013166, 2004.\n\n[12] Vin De Silva and Joshua B Tenenbaum. Sparse multidimensional scaling using landmark points.\n\nTechnical report, Technical report, Stanford University, 2004.\n\n[13] Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations\nfrom video. In Advances in Neural Information Processing Systems, pages 4417\u20134426, 2017.\n[14] James J DiCarlo and David D Cox. Untangling invariant object recognition. Trends in cognitive\n\nsciences, 11(8):333\u2013341, 2007.\n\n[15] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):\n\n1289\u20131306, 2006.\n\n[16] Alexey Dosovitskiy and Thomas Brox. Inverting visual representations with convolutional\nnetworks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 4829\u20134837, 2016.\n\n[17] Ehsan Elhamifar and Ren\u00e9 Vidal. Sparse manifold clustering and embedding. In Advances in\n\nneural information processing systems, pages 55\u201363, 2011.\n\n[18] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection\nwith discriminatively trained part-based models. IEEE transactions on pattern analysis and\nmachine intelligence, 32(9):1627\u20131645, 2010.\n\n[19] David J Field. Relations between the statistics of natural images and the response properties of\n\ncortical cells. Josa a, 4(12):2379\u20132394, 1987.\n\n[20] Peter F\u00f6ldi\u00e1k. Learning invariance from transformation sequences. Neural Computation, 3(2):\n\n194\u2013200, 1991.\n\n[21] William T Freeman, Edward H Adelson, et al. The design and use of steerable \ufb01lters. IEEE\n\nTransactions on Pattern analysis and machine intelligence, 13(9):891\u2013906, 1991.\n\n[22] Anna C Gilbert, Yi Zhang, Kibok Lee, Yuting Zhang, and Honglak Lee. Towards understanding\n\nthe invertibility of convolutional neural networks. arXiv preprint arXiv:1705.08664, 2017.\n\n[23] Ross Goroshin, Michael F Mathieu, and Yann LeCun. Learning to linearize under uncertainty.\n\nIn Advances in Neural Information Processing Systems, pages 1234\u20131242, 2015.\n\n[24] OJ Henaff, RLT Goris, and Simoncelli EP. Perceptual straightening of natural videos. In\n\nComputational and Systems Neuroscience, 2018.\n\n[25] Haruo Hosoya and Aapo Hyv\u00e4rinen. Learning visual spatial pooling by strong pca dimension\n\nreduction. Neural computation, 2016.\n\n[26] Xiaoming Huo, Xuelei Ni, and Andrew K Smith. A survey of manifold-based learning methods.\n\nRecent advances in data mining of enterprise data, pages 691\u2013745, 2007.\n\n10\n\n\f[27] Aapo Hyv\u00e4rinen and Patrik Hoyer. Emergence of phase-and shift-invariant features by decom-\nposition of natural images into independent feature subspaces. Neural computation, 12(7):\n1705\u20131720, 2000.\n\n[28] Aapo Hyv\u00e4rinen and Erkki Oja. Independent component analysis: algorithms and applications.\n\nNeural networks, 13(4):411\u2013430, 2000.\n\n[29] Aapo Hyv\u00e4rinen, Patrik O Hoyer, and Mika Inki. Topographic independent component analysis.\n\nNeural computation, 13(7):1527\u20131558, 2001.\n\n[30] Aapo Hyv\u00e4rinen, Jarmo Hurri, and Jaakko V\u00e4yrynen. Bubbles: a unifying framework for\nlow-level statistical properties of natural image sequences. JOSA A, 20(7):1237\u20131252, 2003.\n\n[31] Juha Karhunen, Erkki Oja, Liuyue Wang, Ricardo Vigario, and Jyrki Joutsensalo. A class of\nneural networks for independent component analysis. IEEE Transactions on Neural Networks,\n8(3):486\u2013504, 1997.\n\n[32] Urs K\u00f6ster and Aapo Hyv\u00e4rinen. A two-layer model of natural stimuli estimated with score\n\nmatching. Neural Computation, 22(9):2308\u20132333, 2010.\n\n[33] Quoc V Le, Marc\u2019Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S Corrado,\nJeff Dean, and Andrew Y Ng. Building high-level features using large scale unsupervised\nlearning. In Proceedings of the 29th International Coference on International Conference on\nMachine Learning, pages 507\u2013514. Omnipress, 2012.\n\n[34] Ann B Lee, Kim S Pedersen, and David Mumford. The nonlinear statistics of high-contrast\n\npatches in natural images. International Journal of Computer Vision, 54(1):83\u2013103, 2003.\n\n[35] John Lee.\n\nIntroduction to smooth manifolds. Springer, New York London, 2012.\n\n978-1-4419-9982-5.\n\nISBN\n\n[36] Tai Sing Lee and David Mumford. Hierarchical bayesian inference in the visual cortex. JOSA\n\nA, 20(7):1434\u20131448, 2003.\n\n[37] Siwei Lyu and Eero P Simoncelli. Nonlinear image representation using divisive normalization.\nIn Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages\n1\u20138. IEEE, 2008.\n\n[38] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine\n\nLearning Research, 9(Nov):2579\u20132605, 2008.\n\n[39] Jes\u00fas Malo and Juan Guti\u00e9rrez. V1 non-linear properties emerge from local-to-global non-linear\n\nica. Network: Computation in Neural Systems, 17(1):85\u2013102, 2006.\n\n[40] David Mumford and Agn\u00e8s Desolneux. Pattern theory: the stochastic analysis of real-world\n\nsignals. CRC Press, 2010.\n\n[41] Bruno A Olshausen. Learning sparse, overcomplete representations of time-varying natural\nimages. In Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference\non, volume 1, pages I\u201341. IEEE, 2003.\n\n[42] Bruno A Olshausen. Highly overcomplete sparse coding. In Human Vision and Electronic\nImaging XVIII, volume 8651, page 86510S. International Society for Optics and Photonics,\n2013.\n\n[43] Bruno A Olshausen and David J Field. Emergence of simple-cell receptive \ufb01eld properties by\n\nlearning a sparse code for natural images. Nature, 381(6583):607, 1996.\n\n[44] Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy\n\nemployed by v1? Vision research, 37(23):3311\u20133325, 1997.\n\n[45] Simon Osindero, Max Welling, and Geoffrey E Hinton. Topographic product models applied to\n\nnatural scene statistics. Neural Computation, 18(2):381\u2013414, 2006.\n\n11\n\n\f[46] Dylan M Paiton, Sheng Lundquist, William Shainin, Xinhua Zhang, Peter Schultz, and Garrett\nKenyon. A deconvolutional competitive algorithm for building sparse hierarchical representa-\ntions. In Proceedings of the 9th EAI International Conference on Bio-inspired Information and\nCommunications Technologies, pages 535\u2013542. ICST, 2016.\n\n[47] Pietro Perona. Deformable kernels for early vision. IEEE Transactions on pattern analysis and\n\nmachine intelligence, 17(5):488\u2013499, 1995.\n\n[48] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear\n\nembedding. science, 290(5500):2323\u20132326, 2000.\n\n[49] Christopher J Rozell, Don H Johnson, Richard G Baraniuk, and Bruno A Olshausen. Sparse\ncoding via thresholding and local competition in neural circuits. Neural computation, 20(10):\n2526\u20132563, 2008.\n\n[50] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In\n\nAdvances in Neural Information Processing Systems, pages 3859\u20133869, 2017.\n\n[51] Honghao Shan and Garrison Cottrell. Ef\ufb01cient visual coding: From retina to v2. arXiv preprint\n\narXiv:1312.6077, 2013.\n\n[52] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst.\nThe emerging \ufb01eld of signal processing on graphs: Extending high-dimensional data analysis to\nnetworks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83\u201398, 2013.\n[53] Jorge Silva, Jorge Marques, and Jo\u00e3o Lemos. Selecting landmark points for sparse manifold\n\nlearning. In Advances in neural information processing systems, pages 1241\u20131248, 2006.\n\n[54] Eero P Simoncelli and William T Freeman. The steerable pyramid: A \ufb02exible architecture for\nmulti-scale derivative computation. In Proceedings of the International Conference on Image\nProcessing, volume 3, pages 444\u2013447. IEEE, 1995.\n\n[55] Eero P Simoncelli, William T Freeman, Edward H Adelson, and David J Heeger. Shiftable\n\nmultiscale transforms. IEEE transactions on Information Theory, 38(2):587\u2013607, 1992.\n\n[56] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for\n\nnonlinear dimensionality reduction. science, 290(5500):2319\u20132323, 2000.\n\n[57] J Hans van Hateren and Dan L Ruderman. Independent component analysis of natural im-\nage sequences yields spatio-temporal \ufb01lters similar to simple cells in primary visual cortex.\nProceedings of the Royal Society of London B: Biological Sciences, 265(1412):2315\u20132320,\n1998.\n\n[58] Max Vladymyrov and Miguel \u00c1 Carreira-Perpin\u00e1n. Locally linear landmarks for large-scale\nIn Joint European Conference on Machine Learning and Knowledge\n\nmanifold learning.\nDiscovery in Databases, pages 256\u2013271. Springer, 2013.\n\n[59] Laurenz Wiskott and Terrence J Sejnowski. Slow feature analysis: Unsupervised learning of\n\ninvariances. Neural computation, 14(4):715\u2013770, 2002.\n\n[60] Ying Nian Wu, Zhangzhang Si, Haifeng Gong, and Song-Chun Zhu. Learning active basis\nmodel for object detection and recognition. International journal of computer vision, 90(2):\n198\u2013235, 2010.\n\n[61] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In\n\nEuropean conference on computer vision, pages 818\u2013833. Springer, 2014.\n\n[62] Matthew D Zeiler, Graham W Taylor, and Rob Fergus. Adaptive deconvolutional networks\nfor mid and high level feature learning. In Computer Vision (ICCV), 2011 IEEE International\nConference on, pages 2018\u20132025. IEEE, 2011.\n\n12\n\n\f", "award": [], "sourceid": 6729, "authors": [{"given_name": "Yubei", "family_name": "Chen", "institution": "EECS, UC Berkeley"}, {"given_name": "Dylan", "family_name": "Paiton", "institution": "University of California, Berkeley"}, {"given_name": "Bruno", "family_name": "Olshausen", "institution": "Redwood Center/UC Berkeley"}]}