{"title": "Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs", "book": "Advances in Neural Information Processing Systems", "page_first": 1340, "page_last": 1348, "abstract": "Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classification benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros.", "full_text": "Stochastic Ratio Matching of RBMs for Sparse\n\nHigh-Dimensional Inputs\n\nYann N. Dauphin, Yoshua Bengio\n\nD\u00b4epartement d\u2019informatique et de recherche op\u00b4erationnelle\n\nUniversit\u00b4e de Montr\u00b4eal\nMontr\u00b4eal, QC H3C 3J7\n\ndauphiya@iro.umontreal.ca,\nYoshua.Bengio@umontreal.ca\n\nAbstract\n\nSparse high-dimensional data vectors are common in many application domains\nwhere a very large number of rarely non-zero features can be devised. Unfortu-\nnately, this creates a computational bottleneck for unsupervised feature learning\nalgorithms such as those based on auto-encoders and RBMs, because they involve\na reconstruction step where the whole input vector is predicted from the current\nfeature values. An algorithm was recently developed to successfully handle the\ncase of auto-encoders, based on an importance sampling scheme stochastically\nselecting which input elements to actually reconstruct during training for each\nparticular example. To generalize this idea to RBMs, we propose a stochastic\nratio-matching algorithm that inherits all the computational advantages and un-\nbiasedness of the importance sampling scheme. We show that stochastic ratio\nmatching is a good estimator, allowing the approach to beat the state-of-the-art\non two bag-of-word text classi\ufb01cation benchmarks (20 Newsgroups and RCV1),\nwhile keeping computational cost linear in the number of non-zeros.\n\n1\n\nIntroduction\n\nUnsupervised feature learning algorithms have recently attracted much attention, with the promise of\nletting the data guide the discovery of good representations. In particular, unsupervised feature learn-\ning is an important component of many Deep Learning algorithms (Bengio, 2009), such as those\nbased on auto-encoders (Bengio et al., 2007) and Restricted Boltzmann Machines or RBMs (Hinton\net al., 2006). Deep Learning of representations involves the discovery of several levels of representa-\ntion, with some algorithms able to exploit unlabeled examples and unsupervised or semi-supervised\nlearning.\nWhereas Deep Learning has mostly been applied to computer vision and speech recognition, an im-\nportant set of application areas involve high-dimensional sparse input vectors, for example in some\nNatural Language Processing tasks (such as the text categorization tasks tackled here), as well as in\ninformation retrieval and other web-related applications where a very large number of rarely non-\nzero features can be devised. We would like learning algorithms whose computational requirements\ngrow with the number of non-zeros in the input but not with the total number of features. Unfortu-\nnately, auto-encoders and RBMs are computationally inconvenient when it comes to handling such\nhigh-dimensional sparse input vectors, because they require a form of reconstruction of the input\nvector, for all the elements of the input vector, even the ones that were zero.\nIn Section 2, we recapitulate the Reconstruction Sampling algorithm (Dauphin et al., 2011) that was\nproposed to handle that problem in the case of auto-encoder variants. The basic idea is to use an\n\n1\n\n\fimportance sampling scheme to stochastically select a subset of the input elements to reconstruct,\nand importance weights to obtain an unbiased estimator of the reconstruction error gradient.\nIn this paper, we are interested in extending these ideas to the realm of RBMs. In Section 3 we\nbrie\ufb02y review the basics of RBMs and the Gibbs chain involved in training them. Ratio match-\ning (Hyv\u00a8arinen, 2007), is an inductive principle and training criterion that can be applied to train\nRBMs but does not require a Gibbs chain. In Section 4, we present and justify a novel algorithm\nbased on ratio matching order to achieve our objective of taking advantage of highly sparse inputs.\nThe new algorithm is called Stochastic Ratio Matching or SRM. In Section 6 we present a wide array\nof experimental results demonstrating the successful application of Stochastic Ratio Matching, both\nin terms of computational performance (\ufb02at growth of computation as the number of non-zeros is in-\ncreased, linear speedup with respect to regular training) and in terms of generalization performance:\nthe state-of-the-art on two text classi\ufb01cation benchmarks is achieved and surpassed. An interesting\nand unexpected result is that we \ufb01nd the biased version of the algorithm (without reweighting) to\nyield more discriminant features.\n\n2 Reconstruction Sampling\n\nAn auto-encoder learns an encoder function f mapping inputs x to features h = f(x), and a decod-\ning or reconstruction function g such that g(f(x)) \u2248 x for training examples x. See Bengio et al.\n(2012) for a review. In particular, with the denoising auto-encoder, x is stochastically corrupted into\n\u02dcx (e.g. by \ufb02ipping some bits) and trained to make g(f(\u02dcx)) \u2248 x. To avoid the expensive recon-\nstruction g(h) when the input is very high-dimensional, Dauphin et al. (2011) propose that for each\nexample, a small random subset of the input elements be selected for which gi(h) and the associated\nreconstruction error is computed. To make the corresponding estimator of reconstruction error (and\nits gradient) unbiased, they propose to use an importance weighting scheme whereby the loss on the\ni-th input is weighted by the inverse of the probability that it be selected. To reduce the variance of\nthe estimator, they propose to always reconstruct the i-th input if it was one of the non-zeros in x\nor in \u02dcx, and to choose uniformly at random an equal number of zero elements. They show that the\nunbiased estimator yields the expected linear speedup in training time compared to the deterministic\ngradient computation, while maintaining good performance for unsupervised feature learning. We\nwould like to extend similar ideas to RBMs.\n\n3 Restricted Boltzmann Machines\n\nA restricted Boltzmann machine (RBM) is an undirected graphical model with binary variables (Hin-\nton et al., 2006): observed variables x and hidden variables h. In this model, the hidden variables\nhelp uncover higher order correlations in the data.\nThe energy takes the form\n\n\u2212E(x, h) = hT Wx + bT h + cT x\n\nwith parameters \u03b8 = (W, b, c).\nThe RBM can be trained by following the gradient of the negative log-likelihood\n\n\u2212 \u2202 log P (x)\n\n\u2202\u03b8\n\n= Edata\n\n\u2212 Emodel\n\n(cid:20) \u2202F (x)\n\n(cid:21)\n\n\u2202\u03b8\n\n(cid:20) \u2202F (x)\n\n(cid:21)\n\n\u2202\u03b8\n\nwhere F (x) is the free energy (unnormalized log-probability associated with P (x)). However, this\ngradient is intractable because the second expectation is combinatorial. Stochastic Maximum Like-\nlihood or SML (Younes, 1999; Tieleman, 2008) estimates this expectation using sample averages\ntaken from a persistent MCMC chain (Tieleman, 2008). Starting from xi a step in this chain is\ntaken by sampling hi \u223c P (h|xi), then we have xi+1 \u223c P (x|hi). SML-k is the variant where k is\nthe number of steps between parameter updates, with SML-1 being the simplest and most common\nchoice, although better results (at greater computational expense) can be achieved with more steps.\nTraining the RBM using SML-1 is on the order of O(dn) where d is the dimension of the input\nvariables and n is the number of hidden variables. In the case of high-dimensional sparse vectors\nwith p non-zeros, SML does not take advantage of the sparsity. More precisely, sampling P (h|x)\n\n2\n\n\f(inference) can take advantage of sparsity and costs O(pn) computations while \u201creconstruction\u201d,\ni.e., sampling from P (x|h) requires O(dn) computations. Thus scaling to larger input sizes n yields\na linear increase in training time even if the number of non-zeros p in the input remains constant.\n\n4 Ratio Matching\n\nRatio matching (Hyv\u00a8arinen, 2007) is an estimation method for statistical models where the normal-\nization constant is not known. It is similar to score matching (Hyv\u00a8arinen, 2005) but applied on\ndiscrete data whereas score matching is limited to continuous inputs, and both are computationally\nsimple and yield consistent estimators. The use of Ratio Matching in RBMs is of particular interest\nbecause their normalization constant is computationally intractable.\nThe core idea of ratio matching is to match ratios of probabilities between the data and the model.\nThus Hyv\u00a8arinen (2007) proposes to minimize the following objective function\n\n(cid:20)\n\ng\n\n(cid:18) Px(x)\n\n(cid:19)\n\nPx(\u00afxi)\n\nd(cid:88)\n\ni=1\n\n\u2212 g\n\nPx(x)\n\n(cid:18) P (x)\n\n(cid:19)(cid:21)2\n\nP (\u00afxi)\n\n(cid:20)\n\n(cid:18) Px(\u00afxi)\n\n(cid:19)\n\nPx(x)\n\n+\n\ng\n\n(cid:18) P (\u00afxi)\n\n(cid:19)(cid:21)2\n\nP (x)\n\n\u2212 g\n\nwhere Px is the true probability distribution, P the distribution de\ufb01ned by the model, g(x) = 1\n1+x\nis an activation function and \u00afxi = (x1, x2, . . . , 1 \u2212 xi, . . . , xd). In this form, we can see the simi-\nlarity between score matching and ratio matching. The normalization constant is canceled because\nP (\u00afxi) = e\u2212F (x)\nP (x)\ne\u2212F (\u00afxi) , however this objective requires access to the true distribution Px which is rarely\navailable.\nHyv\u00a8arinen (2007) shows that the Ratio Matching (RM) objective can be simpli\ufb01ed to\n\n(1)\n\n(2)\n\n(cid:18)\n\ng\n\nd(cid:88)\n\ni=1\n\n(cid:18) P (x)\n\n(cid:19)(cid:19)2\n\nP (\u00afxi)\n\nJRM (x) =\n\nwhich does not require knowledge of the true distribution Px. This objective can be described as\nensuring that the training example x has the highest probability in the neighborhood of points at\nhamming distance 1.\nWe propose to rewrite Eq. 2 in a form reminiscent of auto-encoders:\n\nJRM (x) =\n\n(xi \u2212 P (xi = 1|x\u2212i))2.\n\n(3)\n\nd(cid:88)\n\ni=1\n\ni=1\n\n\u2202\u03b8\n\nd(cid:88)\n\ni=1\n\n(cid:21)\n\n\u2212 \u2202JRM (x)\n\n=\n\nhave the familiar form\n\n(cid:0)\u03c3(F (x) \u2212 F (\u00afxi))(cid:1)2. The gradients\n\nThis will be useful for reasoning about this estimator. The main difference with auto-encoders is\nthat each input variable is predicted by excluding it from the input.\n\nApplying Equation 2 to the RBM we obtain JRM (x) =(cid:80)d\n(cid:20) \u2202F (x)\nwith \u03b7i =(cid:0)\u03c3(F (x) \u2212 F (\u00afxi))(cid:1)2 \u2212(cid:0)\u03c3(F (x) \u2212 F (\u00afxi))(cid:1)3.\ncomputations \u03b1 = cT x and \u03b2j =(cid:80)\ncan be reduced to O(n) with the formula \u2212F (\u00afxi) = \u03b1\u2212(2xi\u22121)ci +(cid:80)\n\nA naive implementation of this objective is O(d2n) because it requires d computations of the free\nenergy per example. This is much more expensive than SML as noted by Marlin et al. (2010).\nThankfully, as we argue here, it is possible to greatly reduce this complexity by reusing computation\nand taking advantage of the parametrization of RBMs. This can be done by saving the results of the\ni Wjixi +bj when computing F (x). The computation of F (\u00afxi)\nj log(1+ e\u03b2j\u2212(2xi\u22121)Wji).\nThis implementation is O(dn) which is the same complexity as SML. However, like SML, RM does\nnot take advantage of sparsity in the input.\n\n\u2212 \u2202F (\u00afxi)\n\n2\u03b7i\n\n(4)\n\n\u2202\u03b8\n\n\u2202\u03b8\n\n3\n\n\f5 Stochastic Ratio Matching\n\nWe propose Stochastic Ratio Matching (SRM) as a more ef\ufb01cient form of ratio matching for high-\ndimensional sparse distributions. The ratio matching objective requires the summation of d terms\nin O(n). The basic idea of SRM is to estimate this sum using a very small fraction of the terms,\nrandomly chosen. If we rewrite the ratio matching objective as an expectation over a discrete distri-\nbution\n\n(cid:18) P (x)\n\n(cid:19)\n\nP (\u00afxi)\n\n1\nd\n\ng2\n\nd(cid:88)\n\ni=1\n\n(cid:20)\n\n(cid:18) P (x)\n\n(cid:19)(cid:21)\n\nP (\u00afxi)\n\n= dE\n\ng2\n\nJRM (x) = d\n\nwe can use Monte Carlo methods to estimate JRM without computing all the terms in Equation\n2. However, in practice this estimator has a high variance. Thus it is a poor estimator, especially\nif we want to use very few Monte Carlo samples. The solution proposed for SRM is to use an\nImportance Sampling scheme to obtain a lower variance estimator of JRM . Combining Monte\nCarlo with importance sampling, we obtain the SRM objective\n\nd(cid:88)\n\ni=1\n\nJSRM (x) =\n\n\u03b3i\nE[\u03b3i] g2\n\n(cid:18) P (x)\n\n(cid:19)\n\nP (\u00afxi)\n\n(5)\n\n(6)\n\nwhere \u03b3i \u223c P (\u03b3i = 1|x) is the so-called proposal distribution of our importance sampling scheme.\nThe proposal distribution determines which terms will be used to estimate the objective since only\nthe terms where \u03b3i = 1 are non-zero. JSRM (x) is an unbiased estimator of JRM (x), i.e.,\n\nE[JSRM (x)] =\n\nd(cid:88)\n\ni=1\n\nE[\u03b3i]\nE[\u03b3i] g2\n\n(cid:18) P (x)\n\n(cid:19)\n\nP (\u00afxi)\n\n= JRM (x)\n\nThe intuition behind importance sampling is that the variance of the estimator can be reduced by\nfocusing sampling on the largest terms of the expectation. More precisely, it is possible to show\nthat the variance of the estimator is minimized when P (\u03b3i = 1|x) \u221d g2(P (x)/P (\u00afxi)). Thus we\nwould like the probability P (\u03b3i = 1|x) to re\ufb02ect how large the error (xi\u2212 P (xi = 1|x\u2212i))2 will be.\nThe challenge is \ufb01nding a good approximation for (xi \u2212 P (xi = 1|x\u2212i))2 and to de\ufb01ne a proposal\ndistribution that is ef\ufb01cient to sample from.\nFollowing Dauphin et al. (2011), we propose such a distribution for high-dimensional sparse dis-\nIn these types of distributions the marginals Px(xi = 1) are very small. They can\ntributions.\neasily be learned by the biases c of the model, and may even be initialized very close to their\noptimal value. Once the marginals are learned, the model will likely only make wrong predic-\ntions when Px(xi = 1|x\u2212i) differs signi\ufb01cantly from Px(xi = 1).\nIf xi = 0 then the error\n(0 \u2212 P (xi = 1|x\u2212i))2 is likely small because the model has a high bias towards P (xi = 0).\nConversely, the error will be high when xi = 1. In other words, the model will mostly make errors\nfor terms where xi = 1 and a small number of dimensions where xi = 0. We can use this to de\ufb01ne\nthe heuristic proposal distribution\n\n(cid:26) 1\np/(d \u2212(cid:80)\nwhere xi = 1 and a subset of k of the (d \u2212(cid:80)\n\nP (\u03b3i = 1|x) =\n\nwhere p is the average number of non-zeros in the data. The idea is to always sample the terms\nj 1xj >0) remaining terms where xi = 0. Note that if\n\ndif\ufb01cult. In our experiments we set k = p = E[(cid:80)\n\nwe sampled the \u03b3i independently, we would get E[k] = p.\nHowever, instead of sampling those \u03b3i bits independently, we \ufb01nd that much smaller variance is\nobtained by sampling a number of zeros k that is constant for all examples, i.e., k = p. A random k\ncan cause very signi\ufb01cant variance in the gradients and this makes stochastic gradient descent more\nj 1xj >0] which is a small number by de\ufb01nition of\nthese sparse distributions, and guarantees that computation costs will remain constant as n increases\nfor a \ufb01xed number of non-zeros. The computational cost of SRM per training example is O(pn),\nas opposed to O(dn) for RM. While simple, we \ufb01nd that this heuristic proposal distribution works\nwell in practice, as shown below.\n\nj 1xj >0)\n\nif xi = 1\notherwise\n\n(7)\n\n4\n\n\fFor comparison, we also perform experiments with a biased version of Equation 6\n\nd(cid:88)\n\ni=1\n\n(cid:18) P (x)\n\n(cid:19)\n\nP (\u00afxi)\n\nJBiasedSRM (x) =\n\n\u03b3ig2\n\n.\n\n(8)\n\nThis will allow us to gauge the effectiveness of our importance weights for unbiasing the objective.\nThe biased objective can be thought as down-weighting the ratios where xi = 0 by a factor of E[\u03b3i].\nSRM is related to previous work (Dahl et al., 2012) on applying RBMs to high-dimensional sparse\ninputs, more precisely multinomial observations, e.g., one K-ary multinomial for each word in an\nn-gram window. A careful choice of Metropolis-Hastings transitions replaces Gibbs transitions and\nallows to handle large vocabularies. In comparison, SRM is geared towards general sparse vectors\nand involves an extremely simple procedure without MCMC.\n\n6 Experimental Results\n\nIn this section, we demonstrate the effectiveness of SRM for training RBMs. Additionally, we show\nthat RBMs are useful features extractors for topic classi\ufb01cation.\n\nDatasets We have performed experiments with the Reuters Corpus Volume I (RCV1) and 20\nNewsgroups (20 NG). RCV1 is a benchmark for document classi\ufb01cation of over 800,000 news wire\nstories (Lewis et al., 2004). The documents are represented as bag-of-words vectors with 47,236\ndimensions. The training set contains 23,149 documents and the test set has 781,265. While there\nare 3 types of labels for the documents, we focus on the task of predicting the topic. There are a\nset of 103 non-mutually exclusive topics for a document. We report the performance using the F1.0\nmeasure for comparison with the state of the art. 20 Newsgroups is a collection of Usenet posts com-\nposing a training set of 11,269 examples and 7505 test examples. The bag-of-words vectors contain\n61,188 dimensions. The postings are to be classi\ufb01ed into one of 20 categories. We use the by-date\ntrain/test split which ensures that the training set contains postings preceding the test examples in\ntime. Following Larochelle et al. (2012), we report the classi\ufb01cation error and for a fair comparison\nwe use the same preprocessing1.\n\nMethodology We compare the different estimation methods for the RBM based on the log-\nlikelihoods they achieve. To do this we use Annealed Importance Sampling or AIS (Salakhutdi-\nnov and Murray, 2008). For all models we average 100 AIS runs with 10,000 uniformly spaced\nreverse temperatures \u03b2k. We compare RBMs trained with ratio matching, stochastic ratio matching\nand biased stochastic ratio matching. We include experiments with RBMs trained with SML-1 for\ncomparison.\nAdditionally, we provide experiments to motivate the use of high-dimensional RBMs in NLP. We\nuse the RBM to pretrain the hidden layers of a feed-forward neural network (Hinton et al., 2006).\nThis acts as a regularization for the network and it helps optimization by initializing the network\nclose to a good local minimum (Erhan et al., 2010).\nThe hyper-parameters are cross-validated on a validation set consisting of 5% of the training set. In\nour experiments with AIS, we use the validation log-likelihood as the objective. For classi\ufb01cation,\nwe use the discriminative performance on the validation set. The hyper-parameters are found using\nrandom search (Bergstra and Bengio, 2012) with 64 trials per set of experiments. The learning\nrate for the RBMs is sampled from 10\u2212[0,3], the number of hidden units from [500, 2000] and the\nnumber of training epochs from [5, 20]. The learning rate for the MLP is sampled from 10\u2212[2,0]. It\nis trained for 32 epochs using early-stopping based on the validation set. We regularize the MLP by\ndropping out 50% of the hidden units during training (Hinton et al., 2012). We adapt the learning\nrate dynamically by multiplying it by 0.95 when the validation error increases.\nAll experiments are run on a cluster of double quad-core Intel Xeon E5345 running at 2.33Ghz with\n2GB of RAM.\n\n5\n\n\fTable 1: Log-probabilities estimated by AIS for the RBMs trained with the different estimation\nmethods. With a \ufb01xed budget of epochs, SRM achieves likelihoods on the test set comparable with\nRM and SML-1.\n\nlog \u02c6Z\n\nRCV1\n\nBIASED SRM 1084.96\nSRM\n325.26\n499.88\nRM\nSML-1\n323.33\n20 NG BIASED SRM 1723.94\n546.52\n975.42\n612.15\n\nSRM\nRM\nSML-1\n\nESTIMATES\n\nAVG. LOG-PROB.\n\nlog( \u02c6Z \u00b1 \u02c6\u03c3)\n1079.66, 1085.65\n325.24, 325.27\n499.48, 500.17\n320.69, 323.99\n1718.65, 1724.63\n546.55, 546.49\n975.62, 975.18\n611.68, 612.46\n\nTRAIN\n-758.73\n-139.79\n-119.98\n-138.90\n-960.34\n-178.39\n-159.92\n-173.56\n\nTEST\n-793.20\n-151.30\n-147.32\n-153.50\n-1018.73\n-190.72\n-185.61\n-188.82\n\n6.1 Using SRM to train RBMs\n\nWe can measure the effectiveness of SRM by comparing it with various estimation methods for\nthe RBM. As the RBM is a generative model, we must compare these methods based on the log-\nlikelihoods they achieve. Note that Dauphin et al. (2011) relies on the classi\ufb01cation error because\nthere is no accepted performance measure for DAEs. As both RM and SML scale badly with input\ndimension, we restrict the dimension of the dataset to the p = 1, 000 most frequent words. We will\ndescribe experiments with all dimensions in the next section.\nAs seen in Table 1, SRM is a good estimator for training RBMs and is a good approximation of RM.\nWe see that with the same budget of epochs SRM achieves log-likelihoods comparable with RM\non both datasets. The striking difference of more than 500 nats with Biased SRM shows that the\nimportance weights successfully unbias the estimator. Interestingly, we observe that RM is able to\nlearn better generative models than SML-1 for both datasets. This is similar to Marlin et al. (2010)\nwhere Pseudolikelihood achieves better log-likelihood than SML on a subset of 20 newsgroups. We\nobserve this is an optimization problem since the training log-likelihood is also higher than RM.\nOne explanation is that SML-1 might experience mixing problems (Bengio et al., 2013).\n\nFigure 1: Average speedup in the calculation of gradients by using the SRM objective compared to\nRM. The speed-up is linear and reaches up to 2 orders of magnitude.\nFigure 1 shows that as expected SRM achieves a linear speed-up compared to RM, reaching speed-\nups of 2 orders of magnitude. In fact, we observed that the computation time of the gradients for RM\nscale linearly with the size of the input while the computation time of SRM remains fairly constant\nbecause the number of non-zeros varies little. This is an important property of SRM which makes it\nsuitable for very large scale inputs.\n\n1http://qwone.com/\u02dcjason/20Newsgroups/20news-bydate-matlab.tgz\n\n6\n\n\fFigure 2: Average norm of the gradients for the terms in Equation 2 where xi = 1 and xi = 0.\nCon\ufb01rming the hypothesis for the proposal distribution the terms where xi = 1 are 2 orders of\nmagnitude larger.\n\nThe importance sampling scheme of SRM (Equation 7) relies on the hypothesis that terms where\nxi = 1 produce a larger gradient than terms where xi = 0. We can verify this by monitoring the\naverage gradients during learning on RCV1. Figure 2 demonstrates that the average gradients for\nthe terms where xi = 1 is 2 orders of magnitudes larger than those where xi = 0. This con\ufb01rms the\nhypothesis underlying the sampling scheme of SRM.\n\n6.2 Using RBMs as feature extractors for NLP\n\nHaving established that SRM is an ef\ufb01cient unbiased estimator of RM, we turn to the task of using\nRBMs not as generative models but as feature extractors. We \ufb01nd that keeping the bias in SRM is\nhelpful for classi\ufb01cation. This is similar to the known result that contrastive divergence, which is\nbiased, yields better classi\ufb01cation results than persistent contrastive divergence, which is unbiased.\nThe bias increases the weight of non-zeros features. The superior performance of the biased objec-\ntive suggests that the non-zero features contain more information about the classi\ufb01cation task. In\nother words, for these tasks it\u2019s more important to focus on what is there than what is not there.\nTable 2: Classi\ufb01cation results on RCV1 with all 47,326 dimensions. The DBN trained with SRM\nachieves state-of-the-art performance.\n\nMODEL\nROCCHIO\nk-NN\nSVM\nSDA-MLP (REC. SAMPLING)\nRBM-MLP (UNBIASED SRM)\nRBM-MLP (BIASED SRM)\nDBN-MLP (BIASED SRM)\n\nTEST SET F1\n\n0.693\n0.765\n0.816\n0.831\n0.816\n0.829\n0.836\n\nOn RCV1, we train our models on all 47,326 dimensions. The RBM trained with SRM improves\non the state-of-the-art (Lewis et al., 2004), as shown in Table 2. The total training time for this\nRBM using SRM is 57 minutes. We also train a Deep Belief Net (DBN) by stacking an RBM\ntrained with SML on top of the RBMs learned with SRM. This type of 2-layer deep architecture is\nable to signi\ufb01cantly improve the performance on that task (Table 2). In particular the DBN does\nsigni\ufb01cantly better than a stack of denoising auto-encoders we trained using biased reconstruction\nsampling (Dauphin et al., 2011), which appears as SDA-MLP (Rec. Sampling) in Table 2.\nWe apply RBMs trained with SRM on 20 newsgroups with all 61,188 dimensions. We see in Table\n3 that this approach improves the previous state-of-the-art by over 1% (Larochelle et al., 2012),\nbeating non-pretrained MLPs and SVMs by close to 10 %. This result is closely followed by the\nDAE trained with reconstruction sampling which in our experiments reaches 20.6% test error. The\n\n7\n\n\fTable 3: Classi\ufb01cation results on 20 Newsgroups with all 61,188 dimensions. Prior results from\n(Larochelle et al., 2012). The RBM trained with SRM achieves state-of-the-art results.\n\nMODEL\nSVM\nMLP\nRBM\nHDRBM\nDAE-MLP (REC. SAMPLING)\nRBM-MLP (BIASED SRM)\n\nTEST SET ERROR\n\n32.8 %\n28.2 %\n24.9 %\n21.9 %\n20.6 %\n20.5 %\n\nsimpler RBM trained by SRM is able to beat the more powerful HD-RBM model because it uses all\nthe 61,188 dimensions.\n\n7 Conclusion\n\nWe have proposed a very simple algorithm called Stochastic Ratio Matching (SRM) to take advan-\ntage of sparsity in high-dimensional data when training discrete RBMs. It can be used to estimate\ngradients in O(np) computation where p is the number of non-zeros, yielding linear speedup against\nthe O(nd) of Ratio Matching (RM) where d is the input size. It does so while providing an unbiased\nestimator of the ratio matching gradient. Using this ef\ufb01cient estimator we train RBMs as features\nextractors and achieve state-of-the-art results on 2 text classi\ufb01cation benchmarks.\n\nReferences\nBengio, Y. (2009). Learning deep architectures for AI. Now Publishers.\nBengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep\n\nnetworks. In NIPS\u20192006.\n\nBengio, Y., Courville, A., and Vincent, P. (2012). Representation learning: A review and new\n\nperspectives. Technical report, arXiv:1206.5538.\n\nBengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013). Better mixing via deep representations.\n\nIn ICML\u201913.\n\nBergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of\n\nMachine Learning Research, 13, 281\u2013305.\n\nDahl, G., Adams, R., and Larochelle, H. (2012). Training restricted boltzmann machines on word\nobservations. In J. Langford and J. Pineau, editors, Proceedings of the 29th International Confer-\nence on Machine Learning (ICML-12), ICML \u201912, pages 679\u2013686, New York, NY, USA. Omni-\npress.\n\nDauphin, Y., Glorot, X., and Bengio, Y. (2011). Large-scale learning of embeddings with recon-\n\nstruction sampling. In ICML\u201911.\n\nErhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. (2010). Why does\n\nunsupervised pre-training help deep learning? JMLR, 11, 625\u2013660.\n\nHinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets.\n\nNeural Computation, 18, 1527\u20131554.\n\nHinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinv, R. (2012).\n\nIm-\nproving neural networks by preventing co-adaptation of feature detectors. Technical report,\narXiv:1207.0580.\n\nHyv\u00a8arinen, A. (2005). Estimation of non-normalized statistical models using score matching. 6,\n\n695\u2013709.\n\nHyv\u00a8arinen, A. (2007). Some extensions of score matching. Computational Statistics and Data\n\nAnalysis, 51, 2499\u20132512.\n\nLarochelle, H., Mandel, M. I., Pascanu, R., and Bengio, Y. (2012). Learning algorithms for the\nclassi\ufb01cation restricted boltzmann machine. Journal of Machine Learning Research, 13, 643\u2013\n669.\n\n8\n\n\fLewis, D. D., Yang, Y., Rose, T. G., Li, F., Dietterich, G., and Li, F. (2004). Rcv1: A new benchmark\ncollection for text categorization research. Journal of Machine Learning Research, 5, 361\u2013397.\nInductive principles for restricted\n\nMarlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010).\n\nBoltzmann machine learning. volume 9, pages 509\u2013516.\n\nSalakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief networks. In\n\nICML 2008, volume 25, pages 872\u2013879.\n\nTieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood\n\ngradient. In ICML\u20192008, pages 1064\u20131071.\n\nYounes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing\n\nergodicity rates. Stochastics and Stochastic Reports, 65(3), 177\u2013228.\n\n9\n\n\f", "award": [], "sourceid": 687, "authors": [{"given_name": "Yann", "family_name": "Dauphin", "institution": "University of Montreal"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "University of Montreal"}]}