{"title": "Cardinality Restricted Boltzmann Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 3293, "page_last": 3301, "abstract": "The Restricted Boltzmann Machine (RBM) is a popular density model that is also good for extracting features. A main source of tractability in RBM models is the model's assumption that given an input, hidden units activate independently from one another. Sparsity and competition in the hidden representation is believed to be beneficial, and while an RBM with competition among its hidden units would acquire some of the attractive properties of sparse coding, such constraints are not added due to the widespread belief that the resulting model would become intractable. In this work, we show how a dynamic programming algorithm developed in 1981 can be used to implement exact sparsity in the RBM's hidden units. We then expand on this and show how to pass derivatives through a layer of exact sparsity, which makes it possible to fine-tune a deep belief network (DBN) consisting of RBMs with sparse hidden layers. We show that sparsity in the RBM's hidden layer improves the performance of both the pre-trained representations and of the fine-tuned model.", "full_text": "Cardinality Restricted Boltzmann Machines\n\nKevin Swersky\n\nDaniel Tarlow\n\nDept. of Computer Science\n\nUniversity of Toronto\n\nIlya Sutskever\n\n[kswersky,dtarlow,ilya]@cs.toronto.edu\n\nRuslan Salakhutdinov\u2020,\u2021\n\nRichard S. Zemel\u2020\n\nDept. of Computer Science\u2020 and Statistics\u2021\n\nUniversity of Toronto\n\n[rsalakhu,zemel]@cs.toronto.edu\n\nRyan P. Adams\n\nSchool of Eng. and Appl. Sciences\n\nHarvard University\n\nrpa@seas.harvard.edu\n\nAbstract\n\nThe Restricted Boltzmann Machine (RBM) is a popular density model that is also\ngood for extracting features. A main source of tractability in RBM models is\nthat, given an input, the posterior distribution over hidden variables is factorizable\nand can be easily computed and sampled from. Sparsity and competition in the\nhidden representation is bene\ufb01cial, and while an RBM with competition among\nits hidden units would acquire some of the attractive properties of sparse coding,\nsuch constraints are typically not added, as the resulting posterior over the hid-\nden units seemingly becomes intractable. In this paper we show that a dynamic\nprogramming algorithm can be used to implement exact sparsity in the RBM\u2019s\nhidden units. We also show how to pass derivatives through the resulting posterior\nmarginals, which makes it possible to \ufb01ne-tune a pre-trained neural network with\nsparse hidden layers.\n\n1\n\nIntroduction\n\nThe Restricted Boltzmann Machine (RBM) [1, 2] is an important class of probabilistic graphical\nmodels. Although it is a capable density estimator, it is most often used as a building block for\ndeep belief networks (DBNs). The bene\ufb01t of using RBMs as building blocks for a DBN is that they\noften provide a good initialization for feed-forward neural networks, and they can effectively utilize\nlarge amounts of unlabeled data, which has led to success in a variety of application domains [3].\nDespite the bene\ufb01ts of this approach, there is a disconnect between the unsupervised nature of\nRBMs and the \ufb01nal discriminative task (e.g., classi\ufb01cation) for which the learned features are used.\nThis disconnect has motivated the search for ways to improve task-speci\ufb01c performance, while still\nretaining the unsupervised nature of the original model [4, 5]. One effective method for improving\nperformance has been the incorporation of sparsity into the learned representation. Approaches that\nlearn and use sparse representations have achieved good results on a number of tasks [6], and in the\ncontext of computer vision, sparsity has been linked with learning features that are invariant to local\ntransformations [7]. Sparse features are also often more interpretable than dense representations\nafter unsupervised learning.\nFor directed models, such as sparse coding [8], sparsity can be enforced using a Laplace or spike\nand slab prior [9]. For undirected models, introducing hard sparsity constraints directly into the\nenergy function often results in non-trivial dependencies between hidden units that makes inference\nintractable. The most common way around this is to encourage sparsity during training by way of a\npenalty function on the expected conditional hidden unit activations given data [10]. However, this\ntraining-time procedure is a heuristic and does not guarantee sparsity at test time.\n\n1\n\n\fRecently, methods for ef\ufb01ciently dealing with highly structured global interactions within the graph-\nical modeling framework have received considerable interest. One class of these interactions is based\non assigning preferences to counts over subsets of binary variables [11, 12]. These are known as\ncardinality potentials. For example, the softmax distribution can be seen as arising from a cardinal-\nity potential that forces exactly one binary variable to be active. For general potentials over counts,\nit would seem that the cost of inference would grow exponentially with the number of binary vari-\nables. However, ef\ufb01cient algorithms have been proposed that compute exact marginals for many\nhigher-order potentials of interest [12]. For achieving sparsity in RBMs, it turns out that a relatively\nsimple dynamic programming algorithm by Gail et al. [13] contains the key ingredients necessary\nto make inference and learning ef\ufb01cient. The main idea behind these algorithms is the introduction\nof auxiliary variables that store cumulative sums in the form of a chain or a tree.\nIn this paper, we show how to combine these higher-order potentials with RBMs by placing a cardi-\nnality potential directly over the hidden units to form a Cardinality-RBM (CaRBM) model. This will\nallow us to obtain genuinely sparse representations, where only a small number of units are allowed\nto be active. We further show how gradients can be backpropagated through inference using a re-\ncently proposed \ufb01nite-difference method [14]. On a benchmark suite of classi\ufb01cation experiments,\nthe CaRBM is competitive with current approaches that do not enforce sparsity at test-time.\n\n2 Background\n\n2.1 Restricted Boltzmann Machines\n\nA Restricted Boltzmann Machine is a particular type of Markov random \ufb01eld that has a two-layer\narchitecture, in which the visible, stochastic units v \u2208 {0, 1}Nv are connected to hidden stochastic\nunits h \u2208 {0, 1}Nh. The probability of the joint con\ufb01guration {v, h} is given by:\n\n1\n\nP (v, h) =\n\nZ exp (v(cid:62)W h + v(cid:62)bv + h(cid:62)bh),\n\n(1)\nwhere Z is the normalizing constant, and {W \u2208 RNv\u00d7Nh, bv \u2208 RNv , bh \u2208 RNh} are the model\nparameters, with W representing visible-to-hidden symmetric interaction terms, and bv, bh repre-\nsenting visible and hidden biases respectively. The derivative of the log-likelihood with respect to\nthe model parameters1 W can be obtained from Eq. 1:\n\n= EPdata[vh(cid:62)] \u2212 EPmodel[vh(cid:62)],\nwhere EPdata[\u00b7] denotes an expectation with respect to the data distribution\n\n\u2202 log P (v; \u03b8)\n\n\u2202W\n\n(2)\n\nPdata(h, v; \u03b8) = P (h| v; \u03b8) Pdata(v),\n\n(cid:80)\n(3)\nn \u03b4(v\u2212vn) represents the empirical distribution, and EPmodel[\u00b7] is an expec-\nwhere Pdata(v) = 1\nN\ntation with respect to the distribution de\ufb01ned by the model, as in Eq. 1. Exact maximum likelihood\nlearning in this model is intractable because exact computation of the expectation EPmodel[\u00b7] takes\ntime that is exponential in the number of visible or hidden units. Instead, learning can be performed\nby following an approximation to the gradient, the \u201cContrastive Divergence\u201d (CD) objective [15].\nAfter learning, the hidden units of the RBM can be thought of as features extracted from the input\ndata. Quite often, they are used to initialize a deep belief network (DBN), or they can be used\ndirectly as inputs to some other learning system.\n\n2.2 The Sparse RBM (SpRBM)\n\nFor many challenging tasks, such as object or speech recognition, a desirable property for the hidden\nvariables is to encode the data using sparse representations. That is, given an input vector v, we\nwould like the corresponding distribution P (h|v) to favour sparse con\ufb01gurations. The resulting\nfeatures are often more interpretable and tend to improve performance of the learning systems that\nuse these features as inputs. On its own, it is highly unlikely that the RBM will produce sparse\nfeatures. However, suppose we have some desired target expected sparsity \u03c1. If qj represents a\n\n1The derivatives with respect to the bias terms take a similar form.\n\n2\n\n\frunning average of the hidden unit marginals qj = 1/N(cid:80)\n\nfollowing penalty term to the log-likelihood objective [16]:\n\nn P (hj = 1|vn), then we can add the\n\n\u03bb (\u03c1 log qj + (1 \u2212 \u03c1) log(1 \u2212 qj)) ,\n\n(4)\nwhere \u03bb represents the strength of the penalty. This penalty is proportional to the negative of the\nKL divergence between the hidden unit marginals and the target sparsity probability. The derivative\nwith respect to the activity on any case n is proportional to \u03bb(\u03c1 \u2212 qj). Note that this is applied to\neach hidden unit independently and has the intuitive property of encouraging each hidden unit to\nactivate with proportion \u03c1 across the dataset.\nIf the hidden unit activations are stored in a matrix where each row corresponds to a training exam-\nple, and each column corresponds to a hidden unit, then this is enforcing sparsity in the columns of\nthe matrix. This is also referred to as lifetime sparsity. When using the SpRBM model, the hope is\nthat each individual example will be encoded by a sparse vector, corresponding to sparsity across\nthe rows, or population sparsity.\n\n3 The Cardinality Potential\n\nConsider a distribution of the form\n\nq(x) =\n\n1\nZ \u03c8\n\n\uf8eb\uf8ed N(cid:88)\n\nj=1\n\n\uf8f6\uf8f8 N(cid:89)\n\nj=1\n\nxj\n\n\u03c6j(xj),\n\n(5)\n\nwhere x is a binary vector and Z is the normalizing constant. This distribution consists of non-\ninteracting terms, with the exception of the \u03c8(\u00b7) potential, which couples all of the variables together.\nThis is a cardinality potential (or \u201ccounts potential\u201d), because it depends only on the number of 1\u2019s\nin the vector x, but not on their identity. This distribution is useful for imposing sparsity because it\nallows us to represent the constraint that the vector x can have at most k elements set to one.\nThere is an ef\ufb01cient exact inference algorithm for computing the normalizing constant and marginals\nof this distribution. This can be interpreted as a dynamic programming algorithm [13, 17], or as an\ninstance of the sum-product algorithm [18]. We prefer the sum-product interpretation because it\nmakes clear how to compute marginal distributions over binary variables, how to compute marginal\ndistributions over total counts, and how to draw an exact joint sample from the model (pass messages\nforwards, then sample backwards) and also lends itself towards extensions. In this view, we create N\nauxiliary variables zj \u2208 {1, . . . , N}. The auxiliary variables are then deterministically related to\nk=1 xk, where zj represents the cumulative sum of the \ufb01rst j\n\nthe x variables by setting zj =(cid:80)j\nN(cid:89)\n\n\u02c6q(x, z) =\n\nbinary variables.\nMore formally, consider the following joint distribution \u02c6q(x, z):\n\n\u03b3(xj, zj, zj\u22121) \u00b7 \u03c8(zN ).\n\n(6)\n\n\u03c6j(xj) \u00b7 N(cid:89)\n\nj=1\n\nj=2\n\nbe computed either as zj =(cid:80)j\n\nWe let \u03b3(xj, zj, zj\u22121) be a deterministic \u201caddition potential\u201d, which assigns the value one to any\ntriplet (x, z, z(cid:48)) satisfying z = x + z(cid:48) and zero otherwise. Note that the second product ranges\nfrom j = 2, and that z1 is replaced with x1. This notation represents the observation that zj can\nk=1 xk, or more simply as zj = zj\u22121 + xj. The latter is prefer-\nable, because it induces a chain-structured dependency graph amongst the z and x variables. Thus,\nthe distribution \u02c6q(x, z) has two important properties. First, it is chain-structured, and therefore\nwe can perform exact inference using the sum-product algorithm. By leveraging the fact that at\nmost k are allowed to be on, the runtime can be made to be O(N k) by reducing the range of\neach zi from {1, . . . , N} to {1, . . . , k + 1}. Second, the posterior \u02c6q(z|x) assigns a probability of\n1 to the con\ufb01guration z\u2217 that is given by z\u2217\nk=1 xj for all j. This is a direct consequence of\nthe sum-potentials \u03b3(\u00b7) enforcing the constraint z\u2217\nj = xj + z\u2217\nj=1 xj, it follows\nthat q(x) = \u02c6q(x, z\u2217), and since q(z|x) concentrates all of its mass on z\u2217, we obtain:\n\nN =(cid:80)N\n\nj\u22121. Since z\u2217\n\n\u02c6q(x) =\n\n\u02c6q(x, z) =\n\n\u02c6q(z|x)\u02c6q(x) = \u02c6q(x, z\u2217) = q(x).\n\n(7)\n\nj =(cid:80)j\n(cid:88)\n\n(cid:88)\n\nz\n\nz\n\n3\n\n\fThis shows that q(x) is the marginal distribution of the chain-structured distribution \u02c6q(x, z). By\nrunning the sum-product algorithm on \u02c6q we can recover the singleton marginals \u00b5j(xj), which\nare also the marginals of q(\u00b7). We can likewise sample from q by computing all of the pair-\nwise marginals \u00b5j+1,j(zj+1, zj), computing the pairwise conditionals \u00b5j+1,j(zj+1|zj), and sam-\npling each zj sequentially, given zj\u22121, to obtain a sample z. The vector x can be recovered\nvia xj = zj \u2212 zj\u22121. The basic idea behind this algorithm is given in [13] and the sum-product\ninterpretation is elaborated upon in [18].\nThere are many algorithmic extensions, such as performing summations in tree-structured distribu-\ntions, which allow for more ef\ufb01cient inference with very large N (e.g. N > 1000) using fast Fourier\ntransforms [19, 18]. But in this work we only use the chain-structured distribution \u02c6q described above\nwith the restriction that there are only k states.\n\n4 The Cardinality RBM (CaRBM)\n\nThe Cardinality Restricted Boltzmann Machine is de\ufb01ned as follows:\n\nZ exp(cid:0)v(cid:62)W h + v(cid:62)bv + h(cid:62)bh\n\n1\n\n(cid:1) \u00b7 \u03c8k\n\nP (v, h) =\n\n\uf8eb\uf8ed Nh(cid:88)\n\n\uf8f6\uf8f8 ,\n\nhj\n\n(8)\n\nj=1\n\nwhere \u03c8k is a potential given by \u03c8k(c) = 1 if c \u2264 k and 0 otherwise. Observe that the conditional\ndistribution P (h|v) assigns a non-zero probability mass to a vector h only if |h| \u2264 k. The car-\ndinality potential implements competition in the hidden layer because now, a data vector v can be\nexplained by at most k hidden units. This form of competition is similar to sparse coding in that there\nmay be many non-sparse con\ufb01gurations that assign high probability to the data, however only sparse\ncon\ufb01gurations are allowed to be used. Unlike sparse coding, however, the CaRBM learning prob-\nlem involves maximizing the likelihood of the training data, rather than minimizing a reconstruction\ncost. Using the techniques from the previous section, computing the conditional distribution P (h|v)\nis tractable, allowing us to use learning algorithms like CD or stochastic maximum likelihood [20].\nThe conditional distribution P (v|h) is still factorial and easy to sample from.\nPerhaps the best way to view the effect of the cardinality potential is to consider the case of k = 1\nwith the further restriction that con\ufb01gurations with 0 active hidden units are disallowed. In this case,\nthe CaRBM reduces to an ordinary RBM with a single multinomial hidden unit. A similar model to\nthe CaRBM is the Boltzmann Perceptron [21], which also introduces a term in the energy function\nto promote competition between units; however, they do not provide a way to ef\ufb01ciently compute\nmarginals or draw joint samples from P (h|v). Another similar line of work is the Restricted Boltz-\nmann Forest [22], which uses k groups of multinomial hidden units.\nWe should note that the actual marginal probabilities of the hidden units given the visible units are\nnot guaranteed to be sparse, but rather the distribution assigns zero mass to any hidden con\ufb01guration\nthat is not sparse. In practice though, we \ufb01nd that after learning, the marginal probabilities do tend\nto have low entropy. Understanding this as a form of regularization is a topic left for future work.\n\n4.1 The Cardinality Marginal Nonlinearity\n\nOne of the most common ways to use an RBM is to consider it as a pre-training method for a\ndeep belief network [2]. After one or several RBMs are trained in a greedy layer-wise fashion,\nthe network is converted into a deterministic feed-forward neural network that is \ufb01ne-tuned with\nthe backpropagation algorithm. The \ufb01ne-tuning step is important for getting the best results with a\nDBN model [23]. While it is easy to convert a stack of standard RBMs into a feed-forward neural\nnetwork, turning a stack of CaRBMs into a feed-forward neural network is less obvious, because it\nis not clear what nonlinearity should be used.\nObserve that in the case of a standard, binary-binary RBM, the selected nonlinearity is the sig-\nmoid \u03c3(x) \u2261 1/(1+exp(\u2212x)). We can justify this choice by noticing that it is the expectation of the\nconditional distribution P (h|v), namely\n\n\u03c3(W (cid:62)v + bh) = EP (h|v)[h],\n\n(9)\n\n4\n\n\fwhere the sigmoid is applied to the vector in an element-wise fashion.\nIn particular, using the\nconditional expectation as the nonlinearity is a fundamental ingredient in the variational lower bound\nthat justi\ufb01es the greedy layer-wise procedure [2]. It also appears naturally when the score matching\nestimator is applied to RBMs over Gaussian-distributed visible units [24, 25]. This justi\ufb01cation\nsuggests that for the CaRBM, we should choose a nonlinearity \u00b5(\u00b7) which will satisfy the following\nequality:\n\n\u00b5(W (cid:62)v + bh) = EP (h|v)[h],\n\n(10)\nwhere the conditional P (h|v) can be derived from Eq. 8. First note that such a nonlinear function\nexists, because the distribution P (h|v) is completely determined by the total input W (cid:62)v + bh.\nTherefore, the feed-forward neural network that is obtained from a stack of CaRBMs uses a message-\npassing algorithm to compute the nonlinearity \u00b5(\u00b7). We should note that \u00b5 depends on k, the number\nof units that can take on the value 1, but this is a constant that is independent of the input. In practice,\nwe keep k \ufb01xed to the k that was used in unsupervised training.\nTo compute gradients for learning the network, it is necessary to \u201cbackpropagate\u201d through \u00b5, which\nis equivalent to multiplying by the Jacobian of \u00b5. Analytic computation of the Jacobian, however,\nresults in an overly expensive O(N 2) algorithm. We also note that it is possible to manually dif-\nferentiate the computational graph of \u00b5 by passing the derivatives back through the sum-product\nalgorithm. While this approach is correct, it is dif\ufb01cult to implement and can be numerically unsta-\nble.\nWe propose an alternative approach to multiplying by the Jacobian of \u00b5. Let x = W (cid:62)v + bh be the\ntotal input to the RBM\u2019s hidden units, then the Jacobian J(x) is given by:\n\nJ(x) = EP (h|v)[hh(cid:62)] \u2212 EP (h|v)[h] EP (h|v)[h(cid:62)],\n\n= EP (h|v)[hh(cid:62)] \u2212 \u00b5(x)\u00b5(x)(cid:62).\n\n(11)\n\nWe need to multiply by the transpose of the Jacobian from the right, since by the chain rule,\n\n(cid:62)\n\n\u2202L\n\u2202x\n\n\u2202\u00b5\n\u2202x\n\n\u2202L\n\u2202\u00b5\n\n= J(x)(cid:62) \u2202L\n\u2202\u00b5\n\n,\n\n=\n\n(12)\nwhere L is the corresponding loss function. One way to do this is to reuse the sample h \u223c P (h|v)\nin order to obtain a rank-one unbiased estimate of EP (h|v)[hh(cid:62)], but we found this to be inaccu-\nrate. Luckily, Domke [14] makes two critical observations. First, the Jacobian J(x) is symmetric\n(see Eq. 11). Second, it is easy to multiply by the Jacobian of any function using numerical dif-\nferentiation, because multiplication by the Jacobian (without a transpose) is precisely a directional\nderivative.\nMore formally, let f (x) be any differentiable function and J be its Jacobian. For any vector (cid:96), it can\nbe easily veri\ufb01ed that:\n\nf (x + \u0001(cid:96)) \u2212 f (x)\n\n\u0001\n\nlim\n\u0001\u21920\n\n= lim\n\u0001\u21920\n\nf (x) + \u0001J(cid:96) + o(\u0001) \u2212 f (x)\n\n\u0001\n\n= lim\n\u0001\u21920\n\no(\u0001)\n\n\u0001\n\n+\n\n\u0001J(cid:96)\n\n\u0001\n\n= J(cid:96).\n\n(13)\n\nSince \u00b5 is a differentiable function, we can compute J(x)(cid:96) by a \ufb01nite difference formula:\n\nJ(x)(cid:96) \u2248 \u00b5(x + \u0001(cid:96)) \u2212 \u00b5(x \u2212 \u0001(cid:96))\n\n2\u0001\n\n.\n\n(14)\n\nUsing the symmetry of the Jacobian of \u00b5, we can backpropagate a vector of derivatives \u2202L/\u2202\u00b5 using\nEq. 14. Of the approaches we tried, we found this approach to provide the best combination of speed\nand accuracy.\n\n5 Experiments\n\nThe majority of our experiments were carried out on various binary datasets from Larochelle et\nal [26], hence referred to as the Montreal datasets. Each model was trained using the CD-1 algo-\nrithm with stochastic gradient descent on mini-batches. For training the SpRBM, we followed the\nguidelines from Hinton [27].\n\n5\n\n\f5.1 Training CaRBMs\n\nOne issue when training a model with lateral inhibition is that in the initial learning epochs, a small\ngroup of hidden units can learn global features of the data and effectively suppress the other hidden\nunits, leading to \u201cdead units\u201d. This effect has been noted before in energy-based models with com-\npetition [22]. One option is to augment the log-likelihood with the KL penalty given in Eq. 4. In the\nSpRBM, this penalty term is used to encourage each hidden unit to be active a small number of times\nacross the training set, which indirectly provides sparsity per-example. In the CaRBM it is used to\nensure that each hidden unit is used roughly equally across the training set, while the per-example\nsparsity is directly controlled.\nWe observed that dead units occurred only with a random initialization of the parameters and that\nthis was no longer an issue once the weights had been properly initialized. In our experiments, we\nused the KL penalty during unsupervised learning, but not during supervised \ufb01ne-tuning.\nA related issue with SpRBMs is that if the KL penalty is set too high then it can create dead examples\n(examples that activate no hidden units). Note that the KL penalty will not penalize this case as long\nas the inter-example activations matches the target probability \u03c1.\n\n5.2 Comparing CaRBM with SpRBM\n\nBoth the CaRBM and SpRBM models attempt to achieve the same goal of sparsity in the hidden\nunit activations. However, the way in which they accomplish this is fundamentally different.\nFor datasets such as MNIST, we found the two models to give qualitatively similar results. Indeed,\nthis seemed to be the case for several datasets. On the convex dataset, however, we noticed that the\nmodels produced quite different results. The convex dataset consists of binary 28\u00d7 28-pixel images\nof polygons (sometimes with multiple polygons per image). Figure 1 (a) shows several examples\nfrom this dataset. Unlike the MNIST dataset, there is a large variation in the number of active pixels\nin the inputs. Figure 1 (e) shows the distribution of the number of pixels taking the value 1. In some\nexamples, barely any pixels are active, while in others virtually every pixel is on.\nFor both models, we set the target sparsity to 10%. We next performed a grid search over the\nstrength of the KL penalty until we found a setting that achieved an average hidden unit population\nsparsity that matched the target without creating dead examples (in the case of the SpRBM) or dead\nunits (in the case of the CaRBM). Figure 1 (d) and (h) show that both models achieve the desired\nmean population sparsity. However, the SpRBM exhibits a heavy-tailed distribution over activations,\nwith some examples activating over half of the hidden units. By comparison, all inputs activate the\nmaximum number of allowable hidden units in the CaRBM, generating a spike at 10%. Indeed, in\nthe CaRBM, the hidden units suppress each other through competition, while in the SpRBM there\nis no such direct competition. Figure 1 (b) and (f) display the learned weights. Both models appear\nto give qualitatively similar results, although the CaRBM weights appear to model slightly more\nlocalized features at this level of sparsity.\n\n5.3 Classi\ufb01cation Performance\n\nTo evaluate the classi\ufb01cation performance of CaRBMs, we performed a set of experiments on the\nMontreal datasets. We conducted a random search over hyperparameter settings as recommended\nby Bergstra & Bengio [28], and set the target sparsity to be between 2.5% and 10%. Table 1 shows\nthat the CarBM and SpRBM achieve comparable performance. On this suite we found that the val-\nidation sets were quite small and prone to over\ufb01tting. For example, both the SpRBM and CaRBM\nachieve 0.5% validation error on the rectangles dataset. Interestingly, for the convex dataset, the\nSpRBM model, chosen by cross-validation, used a weak penalty strength and only achieved a pop-\nulation sparsity of 25%. As we increased the strength of the sparsity penalty, classi\ufb01cation perfor-\nmance in the SpRBM degraded, but the desired sparsity level was still not achieved.\n\n5.4 CIFAR-10 Patches\nWe extracted 16 \u00d7 16 whitened image patches from the CIFAR-10 dataset [29] and trained both\nmodels. Figure 2 (a) shows learned \ufb01lters of the CaRBM model (both models behave similarly\n\n6\n\n\f(a)\n\n(e)\n\n(b)\n\n(f)\n\n(c)\n\n(g)\n\n(d)\n\n(h)\n\nFigure 1: (a),(e) Samples from the Convex dataset and the distribution of the number of pixels in\neach image with the value 1. (b),(f) Visualization of the incoming weights to 25 randomly selected\nhidden units in the SpRBM and CaRBM models respectively. (c),(g) The distribution of the mean\nlifetime activations (across examples) of the hidden units in the SpRBM and CaRBM respectively.\n(d),(h) The distribution of the mean population activations (within examples) of the hidden units in\nthe SpRBM and CaRBM respectively.\n\nDataset\nrectangles\n\nbackground im\n\nbackground im rot\n\nrecangles im\n\nRBM\nSpRBM CaRBM\n4.05%\n2.66%\n5.60%\n23.78% 23.49% 22.16%\n58.21% 56.48% 56.39%\n24.24% 22.50% 22.56%\n\nDataset\nconvex\n\nmnist basic\nmnist rot\n\nbackground rand\n\nRBM\nSpRBM CaRBM\n20.66% 18.52% 21.13%\n4.42%\n3.65%\n14.83% 13.11% 12.40%\n12.96% 12.97% 12.67%\n\n3.84%\n\nTable 1: Test-set classi\ufb01cation errors on the Montreal datasets.\n\nand so we just display the CaRBM weights). Observe that the learned weights resemble Gabor-like\n\ufb01lters. These features are often considered to be bene\ufb01cial for classi\ufb01cation when modeling images.\n\n5.5 Topic Modeling with the NIPS Dataset\n\nOne form of data with highly variable inputs is text, because some words are used much more\nfrequently than others. We applied the SpRBM and CaRBM to the NIPS dataset2, which consists\nof 13649 words and 1740 papers from NIPS conferences from 1987 to 1999. Each row corresponds\nto a paper, each column corresponds to a word, and the entries are the number of times each word\nappears in each paper. We binarized the dataset by truncating the word counts and train the SpRBM\nand CaRBM models with 50 hidden units, searching over learning rates and KL penalty strengths\nuntil 10% sparsity is achieved without dead units or examples. Once a model is learned, we de\ufb01ne\na topic for a hidden unit by considering the 5 words with the highest connections to that unit. We\nconjecture that sparse RBMs should be bene\ufb01cial in learning interpretable topics because there will\nbe fewer ways for hidden units to collude in order to model a given input.\nTable 2 shows the result of picking a general topic and \ufb01nding the closest matching hidden unit from\neach model. While all models discover meaningful topics, we found that the grouping of words\nproduced by the RBM tended to be less cohesive than those produced by the SpRBM or CaRBM.\nFor example, many of the hidden units contain the words \u2018abstract\u2019 and \u2018reference\u2019, both of which\nappear in nearly every paper.\nFigure 2 (b)-(d) displays the effect that the KL penalty \u03bb has on the population sparsity of the\nSpRBM. For a fairly narrow range, if \u03bb is too small then the desired sparsity level will not be met.\n\n2http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm\n\n7\n\n0.00.20.50.00.30.604008000.050.100.150.00.51.0\fModel\nRBM\n\nComputer Vision\nimages,\npixel,\nquickly, stanford\n\ncomputer,\n\nSpRBM visual, object, objects,\n\nim-\n\nages, vision\n\nCaRBM image,\n\nimages, pixels, ob-\n\njects, recognition\n\nNeuroscience\ninhibitory, organization, neu-\nrons, synaptic, explain\nneurons,\nsynaptic, realistic\nmembrane,\nhibitory, physiol, excitatory\n\nbiology,\n\nresting,\n\nspike,\n\nin-\n\nBayesian Inference\nprobability, bayesian, priors,\nlikelihood, covariance\nconditional,\nbayesian, hidden, mackay\nlikelihood, hyperparameters,\nmonte, variational, neal\n\nprobability,\n\nTable 2: Topics learned by each model on the NIPS dataset. Each column corresponds to a chosen\ntopic, and each cell corresponds to a single hidden unit. The hidden unit is chosen as the best match\nto the given topic from amongst all of the hidden units learned by the model in the row.\n\n(a)\n\n(b) \u03bb = 0.1\n\n(c) \u03bb = 0.5\n\n(d) \u03bb = 1\n\nFigure 2: (a) Weights of the CaRBM learned on 16\u00d716 images patches sampled from the CIFAR-10\ndataset. (b)-(c) Change in population sparsity with increasing KL penalty \u03bb on the NIPS dataset.\nThe SpRBM is sensitive to \u03bb, and can fail to model certain examples if \u03bb is set too high.\n\nAs it is increased, the lifetime sparsity better matches the target but at the cost of an increasing\nnumber of dead examples. This may hurt the generative performance of the SpRBM.\n\n6 Conclusion\n\nWe have introduced cardinality potentials into the energy function of a Restricted Boltzmann Ma-\nchine in order to enforce sparsity in the hidden representation. We showed how to use an auxiliary\nvariable representation in order to perform ef\ufb01cient posterior inference and sampling. Furthermore,\nwe showed how the marginal probabilities can be treated as nonlinearities, and how a simple \ufb01nite-\ndifference trick from Domke [14] can be used to backpropagate through the network. We found\nthat the CaRBM performs similarly to an RBM that has been trained with a sparsity-encouraging\nregularizer, with the exception being datasets that exhibit a wide range of variability in the number\nof active inputs (e.g. text), where the SpRBM seems to have dif\ufb01culty matching the target sparsity.\nIt is possible that this effect may be signi\ufb01cant in other kinds of data, such as images with high\namounts of lighting variation.\nThere are a number of possible extensions to the CaRBM. For example, the cardinality potentials\ncan be relaxed to encourage sparsity, but not enforce it, and they can be learned along with the other\nmodel parameters. It would also be interesting to see if other high order potentials could be used\nwithin the RBM framework. Finally, it would be worth exploring the use of the sparse marginal\nnonlinearity in auto-encoder architectures and in the deeper layers of a deep belief network.\n\nReferences\n\n[1] P. Smolensky. Information processing in dynamical systems: foundations of harmony theory. In Parallel\nDistributed Processing: Explorations in the Microstructure of Cognition, vol. 1, pages 194\u2013281. MIT\nPress, 1986.\n\n[2] G.E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural Compu-\n\ntation, 18(7):1527\u20131554, 2006.\n\n[3] H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng. Convolutional deep belief networks for scalable un-\nsupervised learning of hierarchical representations. In International Conference on Machine Learning,\n2009.\n\n8\n\n0.00.20.50.00.20.50.00.20.5\f[4] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks.\n\nAdvances in Neural Information Processing Systems, 2007.\n\n[5] J. Snoek, R. P. Adams, and H. Larochelle. Nonparametric guidance of autoencoder representations using\n\nlabel information. Journal of Machine Learning Research, 13:2567\u20132588, 2012.\n\n[6] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image\n\nclassi\ufb01cation. In Computer Vision and Pattern Recognition, 2009.\n\n[7] I. Goodfellow, Q. Le, A. Saxe, H. Lee, and A.Y. Ng. Measuring invariances in deep networks. Advances\n\nin Neural Information Processing Systems, 2009.\n\n[8] B.A. Olshausen and D.J. Field. Sparse coding with an overcomplete basis set: A strategy employed by\n\nV1? Vision Research, 37(23):3311\u20133325, 1997.\n\n[9] I. Goodfellow, A. Courville, and Y. Bengio. Large-scale feature learning with spike-and-slab sparse\n\ncoding. International Conference on Machine Learning, 2012.\n\n[10] H. Lee, C. Ekanadham, and A. Ng. Sparse deep belief net model for visual area V2. Advances in Neural\n\nInformation Processing Systems, 2007.\n\n[11] R. Gupta, A. Diwan, and S. Sarawagi. Ef\ufb01cient inference with cardinality-based clique potentials. In\n\nInternational Conference on Machine Learning, 2007.\n\n[12] D. Tarlow, I. Givoni, and R. Zemel. HOP-MAP: Ef\ufb01cient message passing for high order potentials. In\n\nArti\ufb01cial Intelligence and Statistics, 2010.\n\n[13] M. H. Gail, J. H. Lubin, and L. V. Rubinstein. Likelihood calculations for matched case-control studies\n\nand survival studies with tied death times. Biometrika, 68:703\u2013707, 1981.\n\n[14] J. Domke. Implicit differentiation by perturbation. Advances in Neural Information Processing Systems,\n\n2010.\n\n[15] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14(8):1771\u20131800, 2002.\n\n[16] V. Nair and G.E. Hinton. 3d object recognition with deep belief nets. Advances in Neural Information\n\nProcessing Systems, 2009.\n\n[17] R. E. Barlow and K. D. Heidtmann. Computing k-out-of-n system reliability.\n\nReliability, 33:322\u2013323, 1984.\n\nIEEE Transactions on\n\n[18] D. Tarlow, K. Swersky, R. Zemel, R.P. Adams, and B. Frey. Fast exact inference for recursive cardinality\n\nmodels. In Uncertainty in Arti\ufb01cial Intelligence, 2012.\n\n[19] L. Belfore. An O(n) log2(n) algorithm for computing the reliability of k-out-of-n:G and k-to-l-out-of-n:G\n\nsystems. IEEE Transactions on Reliability, 44(1), 1995.\n\n[20] T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In\n\nInternational Conference on Machine Learning, 2008.\n\n[21] H.J. Kappen. Deterministic learning rules for Boltzmann machines. Neural Networks, 8(4):537\u2013548,\n\n1995.\n\n[22] H. Larochelle, Y. Bengio, and J. Turian. Tractable multivariate binary density estimation and the restricted\n\nBoltzmann forest. Neural Computation, 22(9):2285\u20132307, 2010.\n\n[23] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313(5786):504\u2013507, 2006.\n\n[24] K. Swersky, M. Ranzato, D. Buchman, B.M. Marlin, and N. de Freitas. On autoencoders and score\n\nmatching for energy based models. In International Conference on Machine Learning, 2011.\n\n[25] P. Vincent. A connection between score matching and denoising autoencoders. Neural Computation,\n\n23(7):1661\u20131674, 2011.\n\n[26] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep archi-\ntectures on problems with many factors of variation. In International Conference on Machine Learning,\n2007.\n\n[27] G.E. Hinton. A practical guide to training restricted Boltzmann machines. Technical Report UTML-TR\n\n2010003, Department of Computer Science, University of Toronto, 2010.\n\n[28] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. The Journal of Machine\n\nLearning Research, 13:281\u2013305, 2012.\n\n[29] A. Krizhevsky. Learning multiple layers of features from tiny images. Master\u2019s thesis, University of\n\nToronto, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1531, "authors": [{"given_name": "Kevin", "family_name": "Swersky", "institution": null}, {"given_name": "Ilya", "family_name": "Sutskever", "institution": null}, {"given_name": "Daniel", "family_name": "Tarlow", "institution": null}, {"given_name": "Richard", "family_name": "Zemel", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}, {"given_name": "Ryan", "family_name": "Adams", "institution": null}]}