{"title": "Structured sparse coding via lateral inhibition", "book": "Advances in Neural Information Processing Systems", "page_first": 1116, "page_last": 1124, "abstract": "This work describes a conceptually simple method for structured sparse coding and dictionary design. Supposing a dictionary with K atoms, we introduce a structure as a set of penalties or interactions between every pair of atoms. We describe modifications of standard sparse coding algorithms for inference in this setting, and describe experiments showing that these algorithms are efficient. We show that interesting dictionaries can be learned for interactions that encode tree structures or locally connected structures. Finally, we show that our framework allows us to learn the values of the interactions from the data, rather than having them pre-specified.", "full_text": "Structured sparse coding via lateral inhibition\n\nKarol Gregor\nJanelia Farm, HHMI\n19700 Helix Drive\nAshburn, VA, 20147\n\nkarol.gregor@gmail.com\n\nArthur Szlam\n\nThe City College of New York\n\nConvent Ave and 138th st\nNew York, NY, 10031\n\naszlam@courant.nyu.edu\n\nYann LeCun\n\nNew York University\n715 Broadway, Floor 12\nNew York, NY, 10003\nyann@cs.nyu.edu\n\nAbstract\n\nThis work describes a conceptually simple method for structured sparse coding\nand dictionary design. Supposing a dictionary with K atoms, we introduce a\nstructure as a set of penalties or interactions between every pair of atoms. We\ndescribe modi\ufb01cations of standard sparse coding algorithms for inference in this\nsetting, and describe experiments showing that these algorithms are ef\ufb01cient. We\nshow that interesting dictionaries can be learned for interactions that encode tree\nstructures or locally connected structures. Finally, we show that our framework\nallows us to learn the values of the interactions from the data, rather than having\nthem pre-speci\ufb01ed.\n\nIntroduction\n\nargminZ,W !\nk\n\n||W zk \u2212 xk||2 + \u03bb||zk||1.\n\n1\nSparse modeling (Olshausen and Field, 1996; Aharon et al., 2006) is one of the most successful\nrecent signal processing paradigms. A set of N data points X in the Euclidean space Rd is written\nas the approximate product of a d \u00d7 k dictionary W and k \u00d7 N coef\ufb01cients Z, where each column\nof Z is penalized for having many non-zero entries. In equations, if we take the approximation to\nX in the least squares sense, and the penalty on the coef\ufb01cient matrix to be the l1 norm, we wish to\n\ufb01nd\n(1)\nIn (Olshausen and Field, 1996), this model is introduced as a possible explanation of the emergence\nof orientation selective cells in the primary visual cortex V1; the matrix representing W corresponds\nto neural connections.\nIt is sometimes appropriate to enforce more structure on Z than just sparsity. For example, we\nmay wish to enforce a tree structure on Z, so that certain basis elements can be used by any data\npoint, but others are speci\ufb01c to a few data points; or more generally, a graph structure on Z that\nspeci\ufb01es which elements can be used with which others. Various forms of structured sparsity are\nexplored in (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Kim and Xing, 2010; Jacob et al., 2009;\nBaraniuk et al., 2009). From an engineering perspective, structured sparse models allow us to access\nor enforce information about the dependencies between codewords, and to control the expressive\npower of the model without losing reconstruction accuracy. From a biological perspective, structured\nsparsity is interesting because structure and sparsity are present in neocortical representations. For\nexample, neurons in the same mini-columns of V1 are receptive to similar orientations and activate\ntogether. Similarly neurons within columns in the inferior temporal cortex activate together and\ncorrespond to object parts.\nIn this paper we introduce a new formulation of structured sparsity. The l1 penalty is replaced with a\nset of interactions between the coding units corresponding to intralayer connections in the neocortex.\nFor every pair of units there is an interaction weight that speci\ufb01es the cost of simultaneously activat-\ning both units. We will describe several experiments with the model. In the \ufb01rst set of experiments\n\n1\n\n\fFigure 1: Model (3) in the locally connected setting of subsection 3.3. Code units are placed in two\ndimensional grid above the image (here represented in 1-d for clarity). A given unit connects to a\nsmall neighborhood of an input via W and to a small neighborhood of code units via S. The S is\npresent and positive (inhibitory) if the distance d between units satis\ufb01es r1 < d < r2 for some radii.\n\nwe set the interactions to re\ufb02ect a prespeci\ufb01ed structure. In one example we create a locally con-\nnected network with inhibitory connections in a ring around every unit. Trained with natural images,\nthis leads to dictionaries with Gabor-like edge elements with similar orientations placed in nearby\nlocations, leading to the pinwheel patterns analogous to those observed in V1 of higher mammals.\nWe also place the units on a tree and place inhibitory interactions between different branches of the\ntree, resulting in edges of similar orientation being placed in the same branch of the tree, see for\nexample (Hyvarinen and Hoyer, 2001). In the second set of experiments we learn the values of the\nlateral connections instead of setting them, in effect learning the structure. When trained on images\nof faces, the system learns to place different facial features at correct locations in the image.\nThe rest of this paper is organized as follows: in the rest of this section, we will introduce our model,\nand describe its relationship between the other structured sparsity mentioned above. In section 2, we\nwill describe the algorithms we use for optimizing the model. Finally, in section 3, we will display\nthe results of experiments showing that the algorithms are ef\ufb01cient, and that we can effectively learn\ndictionaries with a given structure, and even learn the structure.\n\n1.1 Structured sparse models\nWe start with a model that creates a representation Z of data points X via W by specifying a set of\ndisallowed index pairs of Z: U = {(i1, j1), (i2, j2), ..., (ikjk)} - meaning that representations Z are\nnot allowed if both Zi #= 0 and Zj #= 0 for any given pair (i, j) \u2208 U. Here we constrain Z \u2265 0. The\ninference problem can be formulated as\n\nmin\nZ\u22650\n\nN\n!\nj=1\n\n||W Zj \u2212 Xj||2,\n\nsubject to\n\nThen the Langrangian of the energy with respect to Z is\n\nZZT (i, j) = 0, i, j \u2208 U.\n\nN\n!\nj=1\n\n||W Zj \u2212 Xj||2 + ZT\n\nj SZj,\n\n(2)\n\nwhere Sij are the dual variables to each of the constraints in U, and are 0 in the unconstrained pairs.\nA local minimum of the constrained problem is a saddle point for (2). At such a point, Sij can be\ninterpreted as the weight of the inhibitory connection between Wi and Wj necessary to keep them\nfrom simultaneously activating. This observation will be the starting point for this paper.\n\n1.2 Lateral inhibition model\nIn practice, it is useful to soften the constraints in U to a \ufb01xed, prespeci\ufb01ed penalty, instead of a\nmaximization over S as would be suggested by the Lagrangian form. This allows some points to\n\n2\n\n\fuse proscribed activations if they are especially important for the reconstruction. To use units with\nboth positive and negative activations we take absolute values and obtain\n\nW,Z !\nmin\n\nj\n\n||W Zj \u2212 Xj||2 + |Zj|T S|Zj|,\n\n(3)\n\n||Wj|| = 1 \u2200j.\n\nwhere |Zj| denotes the vector obtained from the vector Zj by taking absolute values of each com-\nponent, and Zj is the jth column of Z. S will usually be chosen to be symmetric and have 0 on the\ndiagonal. As before, instead of taking absolute values, we can instead constrain Z \u2265 0 allowing to\nwrite the penalty as ZT\nj SZj. Finally, note that we can also allow S to be negative, implementing\nexcitatory interaction between neurons. One then has to prevent the sparsity term to go to minus in-\n\ufb01nity by limiting the amount of excitation a given element can experience (see the algorithm section\nfor details).\nThe Lagrangian optimization tries to increase the inhibition between a forbidden pair whenever\nit activates. If our goal is to learn the interactions, rather than enforce the ones we have chosen,\nthen it makes sense to do the opposite, and decrease entries of S corresponding to pairs which are\noften activated simultaneously. To force a nontrivial solution and encourage S to economize a \ufb01xed\namount of inhibitory power, we also propose the model\n\nmin\n\nS\n\nW,Z !\nmin\n\nj\n\n||W Zj \u2212 Xj||2 + |Zj|T S|Zj|,\n\n(4)\n\nZ \u2265 0, ||Wj|| = 1 \u2200j,\n\n0 \u2264 S \u2264 \u03b2, S = ST , and|Sj|1 = \u03b1 \u2200j\n\nHere, \u03b1 and \u03b2 control the total inhibitory power of the activation of an atom in W, and how much\nthe inhibitory power can be concentrated in a few interactions (i.e. the sparsity of the interactions).\nAs above, usually one would also \ufb01x S to be 0 on the diagonal.\n1.3 Lateral inhibition and weighted l1\nSuppose we have \ufb01xed S and W, and are inferring z from a datapoint x. Furthermore, suppose that\na subset I of the indices of z do not inhibit each other. Then if I c is the complement of I, for any\n\ufb01xed value of zI c (here the subscript refers to indices of the column vector z), the cost of using zI\nis given by\n\n||WIzI \u2212 x||2 + !\n\n\u03bbi|zi|,\n\ni\u2208I\n\nwhere \u03bbi = \"j\u2208I c Sij|zj|. Thus for zI c \ufb01xed, we get a weighted lasso in zI.\n1.4 Relation with previous work\nAs mentioned above, there is a growing literature on structured dictionary learning and structured\nsparse coding. The works in (Baraniuk et al., 2009; Huang et al., 2009) use a greedy approach for\nstructured sparse coding based on OMP or CoSaMP. These methods are fast when there is an ef-\n\ufb01cient method for searching the allowable additions to the active set of coef\ufb01cients at each greedy\nupdate, for example if the coef\ufb01cients are constrained to lie on a tree. These works also have\nprovable recovery properties when the true coef\ufb01cients respect the structure, and when the dictio-\nnaries satisify certain incoherence properites. A second popular basic framework is group sparsity\n(Kavukcuoglu et al., 2009; Jenatton et al., 2010; Kim and Xing, 2010; Jacob et al., 2009). In these\nworks the coef\ufb01cients are arranged into a predetermined set of groups, and the sparsity term penal-\nizes the number of active groups, rather than the number of active elements. This approach has the\nadvantage that the resulting inference problems are convex, and many of the works can guarantee\nconvergence of their inference schemes to the minimal energy.\nIn our framework, the interactions in S can take any values, giving a different kind of \ufb02exibility. Al-\nthough our framework does not have a convex inference, the algorithms we propose experimentally\nef\ufb01ciently \ufb01nd good codes for every S we have tried. Also note that in this setting, recovery theo-\nrems with incoherence assumptions are not applicable, because we will learn the dictionaries, and\n\n3\n\n\fso there is no guarantee that the dictionaries will satisfy such conditions. Finally, a major difference\nbetween the methods presented here and those in the other works is that we can learn the S from the\ndata simultaneously with dictionary; as far as we know, this is not possible via the above mentioned\nworks.\nThe interaction between a set of units of the form zT Rz + \u03b8T z was originally used in Hop\ufb01eld\nnets (Hop\ufb01eld, 1982); there the z are binary vectors and the inference is deterministic. Boltzman\nmachines (Ackley et al., 1985) have a similar term, but the z and the inference are stochastic, e.g.\nMarkov chain Monte carlo. With S \ufb01xed, one can consider our work a special case of real valued\nHop\ufb01eld nets with R = W T W + S and \u03b8 = W T x; because of the form of R and \u03b8, fast inference\nschemes from sparse coding can be used. When we learn S, the constraints on S serve the same\npurpose as the contrastive terms in the updates in a Boltzman machine.\nIn (Garrigues and Olshausen, 2008) lateral connections were modeled as the connections of an Ising\nmodel with the Ising units deciding which real valued units (from which input was reconstructed)\nwere on. The system learned to typically connect similar orientations at a given location. Our\nmodel is related but different - it has no second layer, the lateral connections control real instead\nof binary values and the inference and learning is simpler, at the cost of a true generative model.\nIn (Druckmann and Chklovskii, 2010) the lateral connections were trained so that solutions zt to a\nrelated ODE starting from the inferred code of z = z0 of an input x would map via W to points\nclose to x. In that work, the lateral connections were trained in response to the dictionary, rather\nthan simultaneously with it, and did not participate in inference.\nIn (Garrigues and Olshausen, 2010) the coef\ufb01cients were given by a Laplacian scale mixture prior,\nleading to multiplicative modulation, as in this work. However, in contrast, in our case the sparsity\ncoef\ufb01cients are modulated by the units in the same layer, and we learn the modulation, as opposed\nto the \ufb01xed topology in (Garrigues and Olshausen, 2010).\n\n2 Algorithms\nIn this section we will describe several algorithms to solve the problems in (3) and (4). The basic\nframework will be to alternate betweens updates to Z, W, and, if desired, S. First we discuss\nmethods for solving for Z with W and S \ufb01xed.\n\nInferring Z from W, X, and S.\n\n2.1\nThe Z update is the most time sensitive, in the sense that the other variables are \ufb01xed after train-\ning, and only Z is inferred at test time. In general, any iterative algorithm that can be used for\nthe weighted basis pursuit problem can be adapted to our setting; the weights just change at each\niteration. We will describe versions of FISTA (Beck and Teboulle, 2009) and coordinate descent\n(Wu and Lange, 2008; Li and Osher, 2009). While we cannot prove that the algorithms converge to\nthe minimum, in all the applications we have tried, they perform very well.\n2.1.1 A FISTA like algorithm\nThe ISTA (Iterated Shrinkage Thresholding Algorithm) minimizes the energy ||W z \u2212 x||2 + \u03bb|z|1\nby following gradient steps in the \ufb01rst term with a \u201cshrinkage\u201d; this can be thought of as gradient\nsteps where any coordinate which crosses zero is thresholded. In equations:\n\nzt+1 = sh(\u03bb/L)(Z \u2212\n\nW T (W zt \u2212 x)),\n\nwhere sha(b) = sign(b) \u00b7 ha(|b|), and ha(b) = max(b \u2212 a, 0). In the case where z is constrained\nto be nonnegative, sh reduces to h. In this paper, \u03bb is a vector depending on the current value of z,\nrather than a \ufb01xed scalar. After each update, \u03bb is updated by \u03bb = \u03bbt+1 \u2190 Szt+1.\nNesterov\u2019s accelerated gradient descent has been found to be effective in the basis pursuit set-\nting, where it is called FISTA (Beck and Teboulle, 2009).\nIn essence one adds to the z update\na momentum that approaches one with appropriate speed. Speci\ufb01cally the update equation on z\n\n1\nL\n\n4\n\n\fAlgorithm 1 ISTA\n\nfunction ISTA(X, Z, W, L)\n\nRequire: L > largest eigenvalue of\nW T W.\nInitialize: Z = 0,\nrepeat\n\n\u03bb = S|z|\nZ = sh(\u03bb/L)(Z \u2212 1\n\nuntil change in Z below a threshold\n\nL W T (W Z \u2212 X))\n\nend function\n\nAlgorithm 2 Coordinate Descent\nfunction CoD(X, Z, W, S, \u00afS)\nRequire: \u00afS = I \u2212 W T\nd Wd\nInitialize: Z = 0; B = W T X; \u03bb = 0\nrepeat\n\u00afZ = h\u03bb(B)\nk = argmax|Z \u2212 \u00afZ|\nB = B + \u00afS\u00b7k( \u00afZk \u2212 Zk)\n\u03bb = \u03bb + S\u00b7k( \u00afZk \u2212 Zk)\nZk = \u00afZk\nuntil change in Z is below a threshold\nZ = h\u03b1(B)\nend function\n\nut+1\n\nL W T (W zt \u2212 x)), rt = ut\u22121\n\nand\nbecomes zt+1 = yt + rt(yt \u2212 yt\u22121), yt = sh(\u03bb/L)(Z \u2212 1\nut+1 = (1 + \u221a1 + 4u2)/2, u1 = 1. Although our problem is not convex and we do not have any of\nthe normal guarantees, empirically, the Nesterov acceleration works extremely well.\n2.1.2 Coordinate descent\nThe coordinate descent algorithm iteratively selects a single coordinate j of z, and \ufb01xing the other\ncoordinates, does a line search to \ufb01nd the value of z(k) with the lowest energy. The coordinate\nselection can be done by picking the entry with the largest gradient Wu and Lange (2008), or by\napproximating the value of the energy after the line search Li and Osher (2009). Suppose at the tth\nstep we have chosen to update the kth coordinate of zt. Because S is zero on its main diagonal, the\npenalty term is not quadratic in zt+1(k), but is simply a \u03bb(k)zt+1(k), where \u03bb = Szk (which only\ndepends on the currently \ufb01xed coordinates). Thus there is an explicit solution zt+1 = h\u03bb(B(k)),\nwhere B is W T (W zt \u2212 x). Just like in the setting of basis pursuit this has the nice property that\nby updating B and \u03bb, and using a precomputed W T W, each update only requires O(K) operations,\nwhere K is the number of atoms in the dictionary; and in particular the dictionary only needs to be\nmultiplied by x once. In fact, when the actual solution is very sparse and the dictionary is large, the\ncost of all the iterations is often less than the cost of multiplying W T x.\nWe will use coordinate descent for a bilinear model below; in this case, we alternate updates of the\nleft coef\ufb01cients with updates of the right coef\ufb01cients.\n2.2 Updating W and S\nThe updates to W and S can be made after each new z is coded, or can be made in batches, say after\na pass through the data. In the case of per datapoint updates, we can proceed via a gradient descent:\nthe derivative of all of our models with respect to W for a \ufb01xed x and z is (W z \u2212 x)zT. The batch\nupdates to W can be done as in K-SVD (Aharon et al., 2006).\nIt is easier to update S in (4) in batch mode, because of the constraints. With W and Z \ufb01xed, the\nconstrained minimization of S is a linear program. We have found that it is useful to average the\ncurrent S with the minimum of the linear program in the update.\n3 Experiments\nIn this section we test the models (3,4) in various experimental settings.\n3.1\nFirst we test the speed of convergence and the quality of the resulting state of ISTA, FISTA and\ncoordinate descent algorithms. We use the example of section 3.4 where the input consist of image\npatches and the connections in S de\ufb01ne a tree. The \ufb01gure 3.1 shows the energy after each iteration\n\nInference\n\n5\n\n\f55\n\n50\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n \n0\n\n \n\nFISTA\nISTA\nCD\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n||X \u2212 W Z||2 + \u03b7|ZT|S|Z|\nFISTA ISTA\nCoD\n21.79\n21.67\n21.67\n21.44\n21.43\n21.79\n21.68\n21.12\n21.12\n20.63\n20.67\n21.19\n19.64\n19.67\n19.94\n\n\u03b7\n.8\n.4\n0.2\n0.1\n0.05\n\n\u03b7|ZT|S|Z|\n\nFISTA ISTA CoD\n1e-9\n0\n.05\n.03\n.21\n.28\n.87\n.78\n2.01\n2.0\n\n.01\n.08\n.32\n.94\n2.07\n\nFigure 2: On the left: The energy values after each iteration of the 3 methods, averaged over all the\ndata points. On the right: values of the average energy 1\nN \"j ||W Zj \u2212 Xj||2 + \u03b7|Zj|T S|Zj| and\naverage S sparsity 1\nN \"j |Zj|T S|Zj|. The \u201coracle\u201d best tree structured output computed by using\nan exhaustive search over the projections of each data point onto each branch of the tree has the\naverage energy 20.58 and sparsity 0. S, W, and X are as in section 3.4\n\nof the three methods average over all data points. We can see that coordinate descent very quickly\nmoves to its resting state (note that each iteration is much cheaper as well, only requiring a few\ncolumn operations), but does not on average tend to be quite as good a code as ISTA or FISTA. We\nalso see that FISTA gets as good a code as ISTA but after far fewer iterations.\nTo test the absolute quality of the methods, we also measure against the \u201coracle\u201d - the lowest possible\nenergy when none of the constraints are broken, that is, when |z|S|z| = 0. This energy is obtained\nby exhaustive search over the projections of each data point onto each branch of the tree. In the\ntable in Figure 3.1, we give the values for the average energy 1\nN \"j ||W Zj \u2212 Xj||2 + \u03b7|Zj|T S|Zj|\nand for the sparsity energy 1\nN \"j \u03b7|Zj|T S|Zj| for various values of \u03b7. Notice that for low values\nof \u03b7, the methods presented here give codes with better energy than the best possible code on the\ntree, because the penalty is small enough to allow deviations from the tree structure; but when the \u03b7\nparameter is increased, the algorithms still compare well against the exhaustive search.\n3.2 Scaling\nAn interesting property of the models (3,4) is their scaling: if the input is re-scaled by a constant\nfactor the optimal code is re-scaled by the same factor. Thus the model preserves the scale infor-\nmation and the input doesn\u2019t need to be normalized. This is not the case in the standard l1 sparse\ncoding model (1). For example if the input becomes small the optimal code is zero.\nIn this subsection we train the model (3) on image patches. In the \ufb01rst part of the experiment we\npreprocess each image patch by subtracting its mean and set the elements of S to be all equal and\npositive except for zeros on the diagonal. In the second part of the experiment we use the original\nimage patches without any preprocessing. However since the mean is the strongest component we\nintroduce the \ufb01rst example of structure: We select one of the components of z and disconnect it from\nall the other components. The resulting S is equal to a positive constant everywhere except on the\ndiagonal, the \ufb01rst row, and the \ufb01rst column, where it is zero. After training in this setting we obtain\nthe usual edge detectors (see Figure (3a)) except for the \ufb01rst component which learns the mean. In\nthe \ufb01rst setting the result is simply a set of edge detectors. Experimentally, explicitly removing the\nmean before training is better as the training converges a lot more quickly.\n3.3 Locally connected net\nIn this section we impose a structure motivated by the connectivity of cortical layer V1. The cortical\nlayer has a two dimensional structure (with depth) with locations corresponding to the locations in\nthe input image. The sublayer 4 contains simple cells with edge like receptive \ufb01elds. Each such\ncell receives input from a small neighborhood of the input image at its corresponding location. We\nmodel this by placing units in a two dimensional grid above the image and connecting each unit to\na small neighborhood of the input, Figure 1. We also bind connections weights for units that are\nfar enough from each other to reduce the number of parameters without affecting the local structure\n(Gregor and LeCun, 2010). Next we connect each unit by inhibitory interactions (the S matrix) to\nunits in its ring-shaped neighborhood: there is a connection between two units if their distance d\n\n6\n\n\fFigure 3: (a) Filters learned on the original unprocessed image patches. The S matrix was fully\nconnected except the unit corresponding to the upper left corner which was not connected to any\nother unit and learned the mean. The other units typically learned edge detectors. (b) Filters learned\nin the tree structure. The Sij = 0 if one of the i and j is descendant of the other and Sij = S0d(i, j)\notherwise where d(i, j) is the distance between the units in the tree. The \ufb01lters in a given branch are\nof a similar orientation and get re\ufb01ned as we walk down the tree.\n\nFigure 4: (a-b) Filters learned on images in the locally connected framework with local inhibition\nshown in the Figure 1. The local inhibition matrix has positive value Sij = S0 > 0 if the distance\nbetween code units Zi and Zj satis\ufb01es r1 < d(i, j) < r2 and Sij = 0 otherwise. The input size\nwas 40\u00d7 40 pixels and the receptive \ufb01eld size was 10\u00d7 10 pixels. The net learned to place \ufb01lters of\nsimilar orientations close together. (a) Images were preprocessed by subtracting the local mean and\ndividing by the standard deviation, each of width 1.65 pixels. The resulting \ufb01lters are sharp edge\ndetectors and can therefore be naturally imbedded in two dimensions. (b) Only the local mean, of\nwidth 5 pixels, was subtracted. This results in a larger range of frequencies that is harder to imbed\nin two dimensions. (c-d) Filters trained on 10 \u00d7 10 image patches with mean subtracted and then\nnormalized. (c) The inhibition matrix was the same as in (a-b). (d) This time there was an l1 penalty\non each code unit and the lateral interaction matrix S was excitatory: Sij < 0 if d(i, j) < r2 and\nzero otherwise.\n\nsatis\ufb01es r1 < d < r2 for some radii r1 and r2 (alternatively we can put r1 = 0 and create excitatory\ninteractions in a smaller neighborhood). With this arrangement units that turn on simultaneously are\ntypically either close to each other (within r1) or far from each other (more distant than r2).\nTraining on image patches results in the \ufb01lters shown in the Figure \ufb01gure 4. We see that \ufb01lters\nwith similar orientations are placed together as is observed in V1 (and other experiments on group\nsparsity, for example (Hyvarinen and Hoyer, 2001)). Here we obtain these patterns by the presence\nof inhibitory connections.\n3.4 Tree structure\nIn this experiment we place the units z on a tree and desire that the units that are on for a given\ninput lie on a single branch of the tree. We de\ufb01ne Sij = 0 if i is descendant of j or vice versa and\n\n7\n\n\fSij = S0d(i, j) otherwise where S0 > 0 is a constant and d(i, j) is the distance between the nodes\ni and j (the number of links it takes to get from one to the other).\nWe trained (3) on image patches. The model learns to place low frequency \ufb01lters close to the root\nof the tree and as we go down the branches the \ufb01lters \u201cre\ufb01ne\u201d their parents, Figure 3b.\n3.5 A convolutional image parts model\n\nFigure 5: On the left: the dictionary of 16 \u00d7 16 \ufb01lters learned by the convolutional model on faces.\nOn the right: some low energy con\ufb01gurations, generated randomly as in Section 3.5 . Each active\n\ufb01lter has response 1.\nWe give an example of learning S in a convolutional setting. We use the centered faces from the\nfaces in the wild dataset, available at http://vis-www.cs.umass.edu/lfw/. From each of\nthe 13233 images we subsample by a factor of two and pick a random 48 \u00d7 48 patch. The 48 \u00d7 48\nimage x is then contrast normalized to x \u2212 b \u2217 x, where b is a 5 \u00d7 5 averaging box \ufb01lter; the images\nare collected into the 48 \u00d7 48 \u00d7 13233 data set X.\nWe then train a model minimizing the energy\n\n20\n!\nj=1\n\n!\ni\n\n||\n\nWj \u2217 zji \u2212 Xi||2 + p(z)T Sp(z),\n\n\u03b2 \u2265 S \u2265 0, S = ST , |Sj|1 = \u03b1.\n\nHere the code vector z is written as a 48 \u00d7 48 \u00d7 20 feature map. The pooling operator p takes the\naverage of the absolute value of each 8 \u00d7 8 patch on each of the 20 maps, and outputs a vector of\nsize 6 \u00b7 6 \u00b7 20 = 720. \u03b2 is set to 72, and \u03b1 to .105. Note that these two numbers roughly specify the\nnumber of zeros in the solution of the S problem to be 1600.\nThe energy is minimized via the batch procedure. The updates for Z are done via coordinate descent\n(coordinate descent in the convolutional setting works exactly as before), the updates for W via least\nsquares, and at each update, S is averaged with .05 of the solution to the linear program in S with\n\ufb01xed Z and W. W is initialized via random patches from X, and S is initialized as the all ones\nmatrix, with zeros on the diagonal. In Figure 5 the dictionary W is displayed.\nTo visualize the S which is learned, we will try to use it to generate new images. Without any data\nto reconstruct the model will collapse to zero, so we will constrain z to have a \ufb01xed number of unit\nentries, and run a few steps of a greedy search to decide which entries should be on. That is: we\ninitialize z to have 5 random entries set to one, and the rest zero. At each step, we pick one of the\nnonzero entries, set it to zero, and \ufb01nd the new entry of z which is cheapest to set to one, namely,\nthe minimum of the entries in Sp(z) which are not currently turned on. We repeat this until the\ncon\ufb01guration is stable. Some results are displayed in 5.\nThe interesting thing about this experiment is the fact that no \ufb01lter ever is allowed to see global\ninformation, except through S. However, even though W is blind to anything larger than a 16 \u00d7 16\npatch, through the inhibition of S, the model is able to learn the placement of facial structures and\nlong edges.\n\n8\n\n\fSensing.\n\nReferences\nAckley, D., Hinton, G., and Sejnowski, T. (1985). A learning algorithm for boltzmann machines*.\n\nCognitive science, 9(1):147\u2013169.\n\nAharon, M., Elad, M., and Bruckstein, A. (2006). K-SVD: An algorithm for designing overcomplete\nIEEE Transactions on Signal Processing, 54(11):4311\u2013\n\ndictionaries for sparse representation.\n4322.\n\nBaraniuk, R. G., Cevher, V., Duarte, M. F., and Hegde, C. (2009). Model-Based Compressive\n\nBeck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm with application\n\nto wavelet-based image deblurring. ICASSP\u201909, pages 693\u2013696.\n\nDruckmann, S. and Chklovskii, D. (2010). Over-complete representations on recurrent neural net-\n\nworks can support persistent percepts.\n\nGarrigues, P. and Olshausen, B. (2008). Learning horizontal connections in a sparse coding model\n\nof natural images. Advances in Neural Information Processing Systems, 20:505\u2013512.\n\nGarrigues, P. and Olshausen, B. (2010). Group sparse coding with a laplacian scale mixture prior.\nIn Lafferty, J., Williams, C. K. I., Shawe-Taylor, J., Zemel, R., and Culotta, A., editors, Advances\nin Neural Information Processing Systems 23, pages 676\u2013684.\n\nGregor, K. and LeCun, Y. (2010). Emergence of Complex-Like Cells in a Temporal Product Network\n\nwith Local Receptive Fields. Arxiv preprint arXiv:1006.0448.\n\nHop\ufb01eld, J. (1982). Neural networks and physical systems with emergent collective computational\nabilities. Proceedings of the National Academy of Sciences of the United States of America,\n79(8):2554.\n\nHuang, J., Zhang, T., and Metaxas, D. N. (2009). Learning with structured sparsity.\n\nIn ICML,\n\npage 53.\n\nHyvarinen, A. and Hoyer, P. (2001). A two-layer sparse coding model learns simple and complex\n\ncell receptive \ufb01elds and topography from natural images. Vision Research, 41(18):2413\u20132423.\nJacob, L., Obozinski, G., and Vert, J.-P. (2009). Group lasso with overlap and graph lasso.\n\nIn\nProceedings of the 26th Annual International Conference on Machine Learning, ICML \u201909, pages\n433\u2013440, New York, NY, USA. ACM.\n\nJenatton, R., Mairal, J., Obozinski, G., and Bach, F. (2010). Proximal methods for sparse hierarchi-\n\ncal dictionary learning. In International Conference on Machine Learning (ICML).\n\nKavukcuoglu, K., Ranzato, M., Fergus, R., and LeCun, Y. (2009). Learning invariant features\nIn Proc. International Conference on Computer Vision and\n\nthrough topographic \ufb01lter maps.\nPattern Recognition (CVPR\u201909). IEEE.\n\nKim, S. and Xing, E. P. (2010). Tree-guided group lasso for multi-task regression with structured\n\nsparsity. In ICML, pages 543\u2013550.\n\nLi, Y. and Osher, S. (2009). Coordinate descent optimization for l1 minimization with application\n\nto compressed sensing; a greedy algorithm. Inverse Problems and Imaging, 3(3):487\u2013503.\n\nOlshausen, B. and Field, D. (1996). Emergence of simple-cell receptive \ufb01eld properties by learning\n\na sparse code for natural images. Nature, 381(6583):607\u2013609.\n\nWu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression.\n\nANNALS OF APPLIED STATISTICS, 2:224.\n\n9\n\n\f", "award": [], "sourceid": 666, "authors": [{"given_name": "Arthur", "family_name": "Szlam", "institution": null}, {"given_name": "Karol", "family_name": "Gregor", "institution": null}, {"given_name": "Yann", "family_name": "Cun", "institution": null}]}