{"title": "The Recurrent Temporal Restricted Boltzmann Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 1601, "page_last": 1608, "abstract": "The Temporal Restricted Boltzmann Machine (TRBM) is a probabilistic model for sequences that is able to successfully model (i.e., generate nice-looking samples of) several very high dimensional sequences, such as motion capture data and the pixels of low resolution videos of balls bouncing in a box. The major disadvantage of the TRBM is that exact inference is extremely hard, since even computing a Gibbs update for a single variable of the posterior is exponentially expensive. This difficulty has necessitated the use of a heuristic inference procedure, that nonetheless was accurate enough for successful learning. In this paper we introduce the Recurrent TRBM, which is a very slight modification of the TRBM for which exact inference is very easy and exact gradient learning is almost tractable. We demonstrate that the RTRBM is better than an analogous TRBM at generating motion capture and videos of bouncing balls.", "full_text": "The Recurrent Temporal Restricted Boltzmann\n\nMachine\n\nIlya Sutskever, Geoffrey Hinton, and Graham Taylor\n\nUniversity of Toronto\n\n{ilya, hinton, gwtaylor}@cs.utoronto.ca\n\nAbstract\n\nThe Temporal Restricted Boltzmann Machine (TRBM) is a probabilistic model for\nsequences that is able to successfully model (i.e., generate nice-looking samples\nof) several very high dimensional sequences, such as motion capture data and the\npixels of low resolution videos of balls bouncing in a box. The major disadvan-\ntage of the TRBM is that exact inference is extremely hard, since even computing\na Gibbs update for a single variable of the posterior is exponentially expensive.\nThis dif\ufb01culty has necessitated the use of a heuristic inference procedure, that\nnonetheless was accurate enough for successful learning. In this paper we intro-\nduce the Recurrent TRBM, which is a very slight modi\ufb01cation of the TRBM for\nwhich exact inference is very easy and exact gradient learning is almost tractable.\nWe demonstrate that the RTRBM is better than an analogous TRBM at generating\nmotion capture and videos of bouncing balls.\n\n1 Introduction\n\nModeling sequences is an important problem since there is a vast amount of natural data, such as\nspeech and videos, that is inherently sequential. A good model for these data sources could be useful\nfor \ufb01nding an abstract representation that is helpful for solving \u201cnatural\u201d discrimination tasks (see\n[4] for an example of this approach for the non-sequential case). In addition, it could be also used\nfor predicting the future of a sequence from its past, be used as a prior for denoising tasks, and be\nused for other applications such as tracking objects in video. The Temporal Restricted Boltzmann\nMachine [14, 13] is a recently introduced probabilistic model that has the ability to accurately model\ncomplex probability distributions over high-dimensional sequences.\nIt was shown to be able to\ngenerate realistic motion capture data [14], and low resolution videos of 2 balls bouncing in a box\n[13], as well as complete and denoise such sequences.\n\nAs a probabilistic model, the TRBM is a directed graphical model consisting of a sequence of Re-\nstricted Boltzmann Machines (RBMs) [3], where the state of one or more previous RBMs determines\nthe biases of the RBM in next timestep. This probabilistic formulation straightforwardly implies a\nlearning procedure where approximate inference is followed by learning. The learning consists of\nlearning a conditional RBM at each timestep, which is easily done with Contrastive Divergence\n(CD) [3]. Exact inference in TRBMs, on the other hand, is highly non-trivial, since computing even\na single Gibbs update requires computing the ratio of two RBM partition functions. The approx-\nimate inference procedure used in [13] was heuristic and was not even derived from a variational\nprinciple.\n\nIn this paper we introduce the Recurrent TRBM (RTRBM), which is a model that is very similar\nto the TRBM, and just as expressive. Despite the similarity, exact inference is very easy in the\nRTRBM and computing the gradient of the log likelihood is feasible (up to the error introduced\nby the use of Contrastive Divergence). We demonstrate that the RTRBM is able to generate more\nrealistic samples than an equivalent TRBM for the motion capture data and for the pixels of videos\n\n\fof bouncing balls. The RTRBM\u2019s performance is better than the TRBM mainly because it learns to\nconvey more information through its hidden-to-hidden connections.\n\n2 Restricted Boltzmann Machines\n\nThe building block of the TRBM and the RTRBM is the Restricted Boltzmann Machine [3]. An\nRBM de\ufb01nes a probability distribution over pairs of vectors, V \u2208 {0, 1}NV and H \u2208 {0, 1}NH (a\nshorthand for visible and hidden) by the equation\n\nP (v, h) = P (V = v, H = h) = exp(v\u22a4bV + h\u22a4bH + v\u22a4W h)/Z\n\n(1)\nwhere bV is a vector of biases for the visible vectors, bH is a vector of biases for the hidden vectors,\nand W is the matrix of connection weights. The quantity Z = Z(bV , bH , W ) is the value of the\npartition function that ensures that Eq. 1 is a valid probability distribution. The RBM\u2019s de\ufb01nition\nimplies that the conditional distributions P (H|v) and P (V |h) are factorial (i.e., all the compo-\nnents of H in P (H|v) are independent) and are given by P (H (j) = 1|v) = s(bH + W \u22a4v)(j) and\nP (V (i) = 1|h) = s(bV + W h)(i), where s(x)(j) = (1 + exp(\u2212x(j)))\u22121 is the logistic function\nand x(j) is the jth component of the vector x. In general, we use i to index visible vectors V and j\nto index hidden vectors H. 1 The RBM can be slightly modi\ufb01ed to allow the vector V to take real\nvalues; one way of achieving this is by the de\ufb01nition\n\nP (v, h) = exp(\u2212kvk2/2 + v\u22a4bV + h\u22a4bH + v\u22a4W h)/Z.\n\n(2)\nUsing this equation does not change the form of the gradients and the conditional distribution\nP (H|v). The only change it introduces is in the conditional distribution P (V |h), which is equal\nto a multivariate Gaussian with parameters N (bV + W h, I). See [18, 14] for more details and\ngeneralizations.\nThe gradient of the average log probability given a dataset S, L = 1/|S|Pv\u2208S log P (v), has the\nfollowing simple form:\n(3)\nwhere \u02dcP (V ) = 1/|S|Pv\u2208S \u03b4v(V ) (here \u03b4x(X) is a distribution over real-valued vectors that is\nconcentrated at x), and hf (X)iP (X) is the expectation of f (X) under the distribution P . Computing\nthe exact values of the expectations h\u00b7iP (H,V ) is computationally intractable, and much work has\nbeen done on methods for computing approximate values for the expectations that are good enough\nfor practical learning and inference tasks (e.g., [16, 12, 19], including [15], which works well for\nthe RBM).\n\n\u2202L/\u2202W = (cid:10)V \u00b7 H \u22a4(cid:11)P (H|V ) \u02dcP (V ) \u2212 (cid:10)V \u00b7 H \u22a4(cid:11)P (H,V )\n\nWe will approximate the gradients with respect to the RBM\u2019s parameters using the Contrastive\nDivergence [3] learning procedure, CDn, whose updates are computed by the following algorithm.\n\nAlgorithm 1 (CDn)\n\n1. Sample (v, h) \u223c P (H|V ) \u02dcP (V )\n2. Set \u2206W to v \u00b7 h\u22a4\n3. repeat n times: sample v \u223c P (V |h), then sample h \u223c P (H|v)\n4. Decrease \u2206W by v \u00b7 h\u22a4\n\nModels learned by CD1 are often reasonable generative models of the data [3], but if learning is\ncontinued with CD25, the resulting generative models are much better [11]. The RBM also plays a\ncritical role in deep belief networks [4], [5], but we do not use this connection in this paper.\n\n3 The TRBM\n\nIt is easy to construct the TRBM with RBMs. The TRBM, as described in the introduction, is\na sequence of RBMs arranged in such a way that in any given timestep, the RBM\u2019s biases de-\npend only on the state of the RBM in the previous timestep. In its simplest form, the TRBM can\n\n1We use uppercase variables (as in P (H|v)) to denote distributions and lowercase variables (as in P (h|v))\n\nto denote the (real-valued) probability P (H = h|v).\n\n\fFigure 1: The graphical structure of a TRBM: a directed sequence of RBMs.\n\nbe viewed as a Hidden Markov Model (HMM) [9] with an exponentially large state space that\nhas an extremely compact parameterization of the transition and the emission probabilities. Let\nX tB\ntA = (XtA , . . . , XtB ) denote a sequence of variables. The TRBM de\ufb01nes a probability distribu-\ntion P (V T\n\n1 ) by the equation\n\n1 = vT\n\n1 = hT\n\n1 , H T\n\nP (vT\n\n1 , hT\n\n1 ) =\n\nT\n\nY\n\nt=2\n\nP (vt, ht|ht\u22121)P0(v1, h1)\n\n(4)\n\nwhich is identical to the de\ufb01ning equation of the HMM. The conditional distribution P (Vt, Ht|ht\u22121)\nis that of an RBM, whose biases for Ht are a function of ht\u22121. Speci\ufb01cally,\n\nt bV + v\u22a4\n\nt W ht + h\u22a4\n\nP (vt, ht|ht\u22121) = exp(cid:0)v\u22a4\n\nt (bH + W \u2032ht\u22121)(cid:1) /Z(ht\u22121)\n\n(5)\nwhere bV , bH and W are as in Eq. 1, while W \u2032 is the weight matrix of the connections from Ht\u22121\nto Ht, making bH + W \u2032ht\u22121 be the bias of RBM at time t. In this equation, V \u2208 {0, 1}NV and\nH \u2208 {0, 1}NH ; it is easy to modify this de\ufb01nition to allow V to take real values as was done in Eq. 2.\nThe RBM\u2019s partition function depends on ht\u22121, because the parameters (i.e., the biases) of the RBM\nat time t depend on the value of the random variable Ht\u22121. Finally, the distribution P0 is de\ufb01ned\nby an equation very similar to Eq. 5, except that the (unde\ufb01ned) term W \u2032h0 is replaced by the\nterm binit, so the hidden units receive a special initial bias at P0; we will often write P (V1, H1|h0)\nfor P0(V1, H1) and W \u2032h0 for binit. It follows from these equations that the TRBM is a directed\ngraphical model that has an (undirected) RBM at each timestep (a related directed sequence of\nBoltzmann Machines has been considered in [7]).\n\nAs in most probabilistic models, the weight update is computed by solving the inference problem\nand computing the weight update as if the inferred variables were observed. fully-visible case. If\nthe hidden variables are observed, equation 4 implies that the gradient of the log likelihood with\nrespect to the TRBM\u2019s parameters is PT\nt=1 \u2207log P (vt, ht|ht\u22121), and each term, being the gradient\nof the log likelihood of an RBM, can be approximated using CDn. Thus the main computational\ndif\ufb01culty of learning TRBMs is in obtaining samples from a distribution approximating the posterior\nP (H T\n\n1 |vT\n\n1 ).\n\nInference in a TRBM\n\nUnfortunately, the TRBM\u2019s inference problem is harder than that of a typical undirected graphical\nmodel, because even computing the probability P (H (j)\nt = 1| everything else) involves evaluating\nthe exact ratio of two RBM partition functions, which can be seen from Eq. 5. This dif\ufb01culty ne-\ncessitated the use of a heuristic inference procedure [13], which is based on the observation that the\ndistribution P (Ht|ht\u22121\n1) = P (Ht|ht\u22121, vt) is factorial by de\ufb01nition. This inference procedure\ndoes not do any kind of smoothing from the future and only does approximate \ufb01ltering from the past\nby sampling from the distribution QT\n1) instead of the true posterior distribution\nQT\nt=1 P (Ht|H t\u22121\n4 Recurrent TRBMs\n\n1 ), which is easy because P (Ht|ht\u22121\n\nt=1 P (Ht|H t\u22121\n\n1) is factorial. 2\n\n, vT\n\n, vt\n\n, vt\n\n, vt\n\n1\n\n1\n\n1\n\n1\n\nLet us start with notation. Consider an arbitrary factorial distribution P \u2032(H). The statement h \u223c\nP \u2032(H) means that h is sampled from the factorial distribution P \u2032(H), so each h(j) is set to 1 with\n\n2This is a slightly simpli\ufb01ed description of the inference procedure in [13].\n\n\ft are binary. The conditional distribution Q(Vt, H \u2032\n\nFigure 2: The graphical structure of the RTRBM, Q. The variables Ht are real valued while the\nt|ht\u22121) is given by the equation\nvariables H \u2032\nt(bH + W \u2032ht\u22121)(cid:1) /Z(ht\u22121), which is essentially the\nQ(vt, h\u2032\nsame as the TRBM\u2019s conditional distribution P from equation 5. We will always integrate out H \u2032\nt\n1 cannot\nand will work directly with the distribution Q(Vt|ht\u22121). Notice that when V1 is observed, H \u2032\naffect H1.\n\nt|ht\u22121) = exp(cid:0)v\u22a4\n\nt W h\u2032\n\nt + v\u22a4\n\nt bV + h\u2032\n\nprobability P \u2032(H (j) = 1), and 0 otherwise. In contrast, the statement h \u2190 P \u2032(H) means that each\nh(j) is set to the real value P \u2032(H (j) = 1), so this is a \u201cmean-\ufb01eld\u201d update [8, 17]. The symbol P\nstands for the distribution of some TRBM, while the symbol Q stands for the distribution de\ufb01ned by\nan RTRBM. Note that the outcome of the operation \u00b7 \u2190 P (Ht|vt, ht\u22121) is s(W vt + W \u2032ht\u22121 + bH ).\nAn RTRBM, Q(V T\n\n1 ), is de\ufb01ned by the equation\n\n1 , H T\n\nQ(vT\n\n1 , hT\n\n1 ) =\n\nT\n\nY\n\nt=2\n\nQ(vt|ht\u22121)Q(ht|vt, ht\u22121) \u00b7 Q0(v1). Q0(h1|v1)\n\n(6)\n\nThe terms appearing in this equation will be de\ufb01ned shortly.\n\nLet us contrast the generative process of the two models. To sample from a TRBM P , we need\nto perform a directed pass, sampling from each RBM on every timestep. One way of doing this is\ndescribed by the following algorithm.\nAlgorithm 2 (for sampling from the TRBM):\nfor 1 \u2264 t \u2264 T :\n\n1. sample vt \u223c P (Vt|ht\u22121)\n2. sample ht \u223c P (Ht|vt, ht\u22121) 3\n\nwhere step 1 requires sampling from the marginals of a Boltzmann Machine (by integrating out Ht),\nwhich involves running a Markov chain.\n\nBy de\ufb01nition, RTRBMs and TRBMs are parameterized in the same way, so from now on we will\nassume that P and Q have identical parameters, which are W, W \u2032, bV , bH, and binit. The following\nalgorithm samples from the RTRBM Q under this assumption.\nAlgorithm 3 (for sampling from the RTRBM)\nfor 1 \u2264 t \u2264 T :\n\n1. sample vt \u223c P (Vt|ht\u22121)\n2. set ht \u2190 P (Ht|vt, ht\u22121)\n\nWe can infer that Q(Vt|ht\u22121) = P (Vt|ht\u22121) because of step 1 in Algorithm 3, which is also con-\nsistent with the equation given in \ufb01gure 2 where H \u2032\nt is integrated out. The only difference between\nAlgorithm 2 and Algorithm 3 is in step 2. The difference may seem small, since the operations\nht \u223c P (Ht|vt, ht\u22121) and ht \u2190 P (Ht|vt, ht\u22121) appear similar. However, this difference signi\ufb01-\ncantly alters the inference and learning procedures of the RTRBM; in particular, it can already be\nseen that Ht are real-valued for the RTRBM.\n\n3When t = 1, P (Ht|vt, ht\u22121) stands for P0(H1|v1), and similarly for other conditional distributions. The\n\nsame convention is used in all algorithms.\n\n\f4.1\n\nInference in RTRBMs\n\n1 is very easy, which might be surprising in light of its similarity to\nInference in RTRBMs given vT\nthe TRBM. The reason inference is easy is similar to the reason inference in square ICAs is easy [1]:\nThere is a unique and an easily computable value of the hidden variables that has a nonzero posterior\nprobability. Suppose, for example, that the value of V1 is v1, which means that v1 was produced at\nthe end of step 1 in Algorithm 3. Since step 2, the deterministic operation h1 \u2190 P0(H1|v1), has\nbeen executed, the only value h1 can take is the value assigned by the operation \u00b7 \u2190 P0(H1|v1). Any\nother value for h1 is never produced by a generative process that outputs v1 and thus has posterior\nprobability 0. In addition, by executing this operation, we can recover h1. Thus, Q0(H1|v1) =\n2 .\n\u03b4s(W v1+bH +binit)(H1). Note that H1\u2019s value is completely independent of vT\nOnce h1 is known, we can consider the generative process that produced v2. As before, since v2\nwas produced at the end of step 1, then the fact that step 2 has been executed implies that h2 can be\ncomputed by h2 \u2190 P (H2|v2, h1) (recall that at this point h1 is known with absolute certainty). If\n1 is uniquely determined and is easily computed\nthe same reasoning is repeated t times, then all of ht\n1 is known. There is no need for smoothing because Vt and Ht\u22121 in\ufb02uence Ht with such\nwhen V t\nt+1 cannot alter the model\u2019s belief about Ht. This is because\nstrength that the knowledge of V T\nQ(Ht|vt, ht\u22121) = \u03b4s(W vt+bH +W \u2032ht\u22121)(Ht).\nThe resulting inference algorithm is simple:\nAlgorithm 4 (inference in RTRBMs)\nfor 1 \u2264 t \u2264 T :\n\n1. ht \u2190 P (Ht|vt, ht\u22121)\n\nLet h(v)T\ndescribed by\n\n1 denote the output of the inference algorithm on input vT\n\n1 , in which case the posterior is\n\nQ(H T\n\n1 |vT\n\n1 ) = \u03b4h(v)T\n\n1\n\n(H T\n\n1 ).\n\n(7)\n\n4.2 Learning in RTRBMs\n\nLearning in RTRBMs may seem easy once inference is solved, since the main dif\ufb01culty in learning\nTRBMs is the inference problem. However, the RTRBM does not allow EM-like learning because\nis not meaningful. To be precise,\nthe equation \u2207log Q(vT\nthe gradient \u2207log Q(vT\n1 ) is unde\ufb01ned because \u03b4s(W \u2032ht\u22121+bH +W T vt)(ht) is not, in general, a\ncontinuous function of W . Thus, the gradient has to be computed differently.\n\n1 ) = (cid:10)\u2207log Q(vT\n1 , hT\n\n1 )(cid:11)hT\n\n1 , hT\n\n1 \u223cQ(H T\n\n1 |vT\n\n1 )\n\nNotice that the RTRBM\u2019s log probability satis\ufb01es log Q(vT\ntry computing the sum \u2207PT\nfeasible is the equation\n\nt=1 log Q(vt|vt\u22121\n\n1\n\n), so we could\n). The key observation that makes the computation\n\nt=1 log Q(vt|vt\u22121\n\n1 ) = PT\n\n1\n\n1\n\nt\u22121\n\nQ(Vt|vt\u22121\n\nQ(vt|h\u2032\n\nt\u22121)Q(h\u2032\n\n) = Rh\u2032\n\n) = Q(Vt|h(v)t\u22121)\n\n(8)\n1. This equa-\nt\u22121 = Q(vt|h(v)t\u22121), as the\n) = \u03b4h(v)t\u22121(Ht\u22121) is a point-mass at h(v)t\u22121, which follows\n\nwhere h(v)t\u22121 is the value computed by the RTRBM inference algorithm with inputs vt\ntion holds because Q(vt|vt\u22121\nposterior distribution Q(Ht\u22121|vt\u22121\nfrom Eq. 7.\nThe equality Q(Vt|vt\u22121\n) = Q(Vt|h(v)t\u22121) allows us to de\ufb01ne a recurrent neural network (RNN)\n[10] whose parameters are identical to those of the RTRBM, and whose cost function is equal to the\nlog likelihood of the RTRBM. This is useful because it is easy to compute gradients with respect to\nthe RNN\u2019s parameters using the backpropagation through time algorithm [10]. The RNN has a pair\nof variables at each timestep, {(vt, rt)}T\nt=1, where vt are the input variables and rt are the RNN\u2019s\nhidden variables (all of which are deterministic). The hiddens rT\n\n1 are computed by the equation\n\nt\u22121|vt\u22121\n\n)dh\u2032\n\n1\n\n1\n\n1\n\n1\n\nrt = s(W vt + bH + W \u2032rt\u22121)\n\n(9)\n\nwhere W \u2032rt\u22121 is replaced with binit when t = 1. This de\ufb01nition was chosen so that the equation\n1 would hold. The RNN attempts to probabilistically predict the next timestep from its\nrT\n1 = h(v)T\nhistory using the marginal distribution of the RBM Q(Vt+1|rt), so its objective function at time t is\nde\ufb01ned to be log Q(vt+1|rt), where Q depends on the RNN\u2019s parameters in the same way it depends\n\n\fon the RTRBM\u2019s parameters (the two sets of parameters being identical). This is a valid de\ufb01nition\nof an RNN whose cumulative objective for the sequence vT\n\n1 is\n\nO =\n\nT\n\nX\n\nt=1\n\nlog Q(vt|rt\u22121)\n\n(10)\n\nwhere Q(v1|r0) = Q0(v1). But since rt as computed in equation 9 on input vT\nthe equality log Q(vt|rt\u22121) = log Q(vt|vt\u22121\n\n) holds. Substituting this identity into Eq. 10 yields\n\n1 is identical to h(v)t,\n\n1\n\nO =\n\nT\n\nX\n\nt=1\n\nlog Q(vt|rt\u22121) =\n\nT\n\nX\n\nt=1\n\nlog Q(vt|vt\u22121\n\n1\n\n) = log Q(vT\n1 )\n\n(11)\n\nwhich is the log probability of the corresponding RTRBM.\n\n1 ) can be computed with the backpropagation through time algo-\nThis means that \u2207O = \u2207 log Q(vT\nrithm [10], where the contribution of the gradient from each timestep is computed with Contrastive\nDivergence.\n\n4.3 Details of the backpropagation through time algorithm\n\nThe backpropagation through time algorithm is identical to the usual backpropagation algorithm\nwhere the feedforward neural network is turned \u201con its side\u201d. Speci\ufb01cally, the algorithm maintains\na term \u2202O/\u2202rt which is computed from \u2202O/\u2202rt+1 and \u2202 log Q(vt+1|rt)/\u2202rt using the chain rule,\nby the equation\n\n\u2202O/\u2202rt = W \u2032\u22a4(rt+1.(1 \u2212 rt+1).\u2202O/\u2202rt+1) + W \u2032\u22a4\u2202 log Q(vt|rt\u22121)/\u2202bH\n\n(12)\nwhere a.b denotes component-wise multiplication, the term rt.(1 \u2212 rt) arises from the derivative of\nthe logistic function s\u2032(x) = s(x).(1 \u2212 s(x)), and \u2202 log Q(vt+1|rt)/\u2202bH is computed by CD. Once\n\u2202O/\u2202rt is computed for all t, the gradients of the parameters can be computed using the following\nequations\n\n\u2202O\n\u2202W \u2032 =\n\n\u2202O\n\u2202W\n\n=\n\nT\n\nX\n\nt=2\n\nT \u22121\n\nX\n\nt=1\n\nrt\u22121(rt.(1 \u2212 rt).\u2202O/\u2202rt)\u22a4\n\n(13)\n\nvt (cid:16)W \u2032\u22a4(rt+1.(1 \u2212 rt+1).\u2202O/\u2202rt+1)(cid:17)\u22a4\n\n+\n\nT\n\nX\n\nt=1\n\n\u2202 log Q(vt|rt\u22121)/\u2202W (14)\n\nThe \ufb01rst summation in Eq. 14 arises from the use of W as weights for inference for computing rt and\nthe second summation arises from the use of W as RBM parameters for computing log Q(vt|rt\u22121).\nEach term of the form \u2202 log Q(vt+1|rt)/\u2202W is also computed with CD. Computing \u2202O/\u2202rt is done\nmost conveniently with a single backward pass through the sequence. As always, log Q(v1|r0) =\nQ0(v1). It is also seen that the gradient would be computed exactly if CD were to return the exact\ngradient of the RBM\u2019s log probability.\n\n5 Experiments\n\nWe report the results of experiments comparing an RTRBM to a TRBM. The results in [14, 13] were\nobtained using TRBMs that had several delay-taps, which means that each hidden unit could directly\nobserve several previous timesteps. To demonstrate that the RTRBM learns to use the hidden units to\nstore information, we did not use delay-taps for the RTRBM nor the TRBM, which causes the results\nto be worse (but not much) than in [14, 13]. If delay-taps are allowed, then the results of [14, 13]\nshow that there is little bene\ufb01t from the hidden-to-hidden connections (which are W \u2032), making the\ncomparison between the RTRBM and the TRBM uninteresting.\n\nIn all experiments, the RTRBM and the TRBM had the same number of hidden units, their param-\neters were initialized in the same manner, and they were trained for the same number of weight\nupdates. When sampling from the TRBM, we would use the sampling procedure of the RTRBM\nusing the TRBM\u2019s parameters to eliminate the additional noise from its hidden units. If this is not\ndone, the samples produced by the TRBM are signi\ufb01cantly worse. Unfortunately, the evaluation\nmetric is entirely qualitative since computing the log probability on a test set is infeasible for both\nthe TRBM and the RTRBM. We provide the code for our experiments in [URL].\n\n\fFigure 3: This \ufb01gure shows the receptive \ufb01elds of the \ufb01rst 36 hidden units of the RTRBM on the\nleft, and the corresponding hidden-to-hidden weights between these units on the right: the ith row on\nthe right corresponds to the ith receptive \ufb01eld on the left, when counted left-to-right. Hidden units\n18 and 19 exhibit unusually strong hidden-to-hidden connections; they are also the ones with the\nweakest visible-hidden connections, which effectively makes them belong to another hidden layer.\n\n5.1 Videos of bouncing balls\nWe used a dataset consisting of videos of 3 balls bouncing in a box. The videos are of length 100\nand of resolution 30\u00d730. Each training example is synthetically generated, so no training sequence\nis seen twice by the model which means that over\ufb01tting is highly unlikely. The task is to learn to\ngenerate videos at the pixel level. This problem is high-dimensional, having 900 dimensions per\nframe, and the RTRBM and the TRBM are given no prior knowledge about the nature of the task\n(e.g., by convolutional weight matrices).\n\nBoth the RTRBM and the TRBM had 400 hidden units. Samples from these models are provided as\nvideos 1,2 (RTRBM) and videos 3,4 (TRBM). A sample training sequence is given in video 5. All\nthe samples can be found in [URL]. The real-values in the videos are the conditional probabilities\nof the pixels [13]. The RTRBM\u2019s samples are noticeably better than the TRBM\u2019s samples; a key\ndifference between these samples is that the balls produced by the TRBM moved in a random walk,\nwhile those produced by the RTRBM moved in a more persistent direction. An examination of the\nvisible to hidden connection weights of the RTRBM reveals a number of hidden units that are not\nconnected to visible units. These units have the most active hidden to hidden connections, which\nmust be used to propagate information through time. In particular, these units are the only units that\ndo not have a strong self connection (i.e., W \u2032\ni,i is not large; see \ufb01gure 3). No such separation of units\nis found in the TRBM and all its hidden units have large visible to hidden connections.\n\n5.2 Motion capture data\nWe used a dataset that represents human motion capture data by sequences of joint angle, transla-\ntions, and rotations of the base of the spine [14]. The total number of frames in the dataset set was\n3000, from which the model learned on subsequences of length 50. Each frame has 49 dimensions,\nand both models have 200 hidden units. The data is real-valued, so the TRBM and the RTRBM\nwere adapted to have Gaussian visible variables using equation 2. The samples produced by the\nRTRBM exhibit less sticking and foot-skate than those produced by the TRBM; samples from these\nmodels are provided as videos 6,7 (RTRBM) and videos 8,9 (TRBM); video 10 is a sample training\nsequence. Part of the Gaussian noise was removed in a manner described in [14] in both models.\n\n5.3 Details of the learning procedures\n\nEach problem was trained for 100,000 weight updates, with a momentum of 0.9, where the gradi-\nent was normalized by the length of the sequence for each gradient computation. The weights are\nupdated after computing the gradient on a single sequence. The learning starts with CD10 for the\n\ufb01rst 1000 weight updates, which is then switched to CD25. The visible to hidden weights, W , were\ninitialized with static CD5 (without using the (R)TRBM learning rules) on 30 sequences (which\nresulted in 30 weight updates) with learning rate of 0.01 and momentum 0.9. These weights were\nthen given to the (R)TRBM learning procedure, where the learning rate was linearly reduced to-\nwards 0. The weights W \u2032 and the biases were initialized with a sample from spherical Gaussian of\nstandard-deviation 0.005. For the bouncing balls problem the initial learning rate was 0.01, and for\nthe motion capture data it was 0.005.\n\n\f6 Conclusions\n\nIn this paper we introduced the RTRBM, which is a probabilistic model as powerful as the intractable\nTRBM that has an exact inference and an almost exact learning procedure. The common disadvan-\ntage of the RTRBM is that it is a recurrent neural network, a type of model known to have dif\ufb01culties\nlearning to use its hidden units to their full potential [2]. However, this disadvantage is common to\nmany other probabilistic models, and it can be partially alleviated using techniques such as the long\nshort term memory RNN [6].\n\nAcknowledgments\nThis research was partially supported by the Ontario Graduate Scholarship and by the Natural Coun-\ncil of Research and Engineering of Canada. The mocap data used in this project was obtained\nfrom http://people.csail.mit.edu/ehsu/work/sig05stf/. For Matlab playback\nof motion and generation of videos, we have adapted portions of Neil Lawrence\u2019s motion capture\ntoolbox (http://www.dcs.shef.ac.uk/\u223cneil/mocap/).\n\nReferences\n[1] A.J. Bell and T.J. Sejnowski. An Information-Maximization Approach to Blind Separation and Blind\n\nDeconvolution. Neural Computation, 7(6):1129\u20131159, 1995.\n\n[2] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is dif\ufb01cult.\n\nNeural Networks, IEEE Transactions on, 5(2):157\u2013166, 1994.\n\n[3] G.E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation,\n\n14(8):1771\u20131800, 2002.\n\n[4] G.E. Hinton, S. Osindero, and Y.W. Teh. A Fast Learning Algorithm for Deep Belief Nets. Neural\n\nComputation, 18(7):1527\u20131554, 2006.\n\n[5] G.E. Hinton and R.R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Sci-\n\nence, 313(5786):504\u2013507, 2006.\n\n[6] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735\u20131780,\n\n1997.\n\n[7] S. Osindero and G. Hinton. Modeling image patches with a directed hierarchy of Markov random \ufb01elds.\n\nAdvances Neural Information Processing Systems, 2008.\n\n[8] C. Peterson and J.R. Anderson. A mean \ufb01eld theory learning algorithm for neural networks. Complex\n\nSystems, 1(5):995\u20131019, 1987.\n\n[9] L.R. Rabiner. A tutorial on hidden Markov models and selected applications inspeech recognition. Pro-\n\nceedings of the IEEE, 77(2):257\u2013286, 1989.\n\n[10] D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning representations by back-propagating errors.\n\nNature, 323(6088):533\u2013536, 1986.\n\n[11] R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In Proceedings of\n\nthe International Conference on Machine Learning, volume 25, 2008.\n\n[12] D. Sontag and T. Jaakkola. New Outer Bounds on the Marginal Polytope. Advances in Neural Information\n\nProcessing Systems, 2008.\n\n[13] I. Sutskever and G.E. Hinton. Learning multilevel distributed representations for high-dimensional se-\nquences. Proceeding of the Eleventh International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 544\u2013551, 2007.\n\n[14] G.W. Taylor, G.E. Hinton, and S. Roweis. Modeling human motion using binary latent variables. Ad-\n\nvances in Neural Information Processing Systems, 19:1345\u20131352, 2007.\n\n[15] T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In\n\nProceedings of the International Conference on Machine Learning, volume 25, 2008.\n\n[16] M.J. Wainwright, T.S. Jaakkola, and A.S. Willsky. A new class of upper bounds on the log partition\n\nfunction. IEEE Transactions on Information Theory, 51(7):2313\u20132335, 2005.\n\n[17] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. UC\n\nBerkeley, Dept. of Statistics, Technical Report, 649, 2003.\n\n[18] M. Welling, M. Rosen-Zvi, and G. Hinton. Exponential family harmoniums with an application to infor-\n\nmation retrieval. Advances in Neural Information Processing Systems, 17:1481\u20131488, 2005.\n\n[19] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations.\n\nExploring Arti\ufb01cial Intelligence in the New Millennium, pages 239\u2013236, 2003.\n\n\f", "award": [], "sourceid": 583, "authors": [{"given_name": "Ilya", "family_name": "Sutskever", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Graham", "family_name": "Taylor", "institution": null}]}