{"title": "Deep Temporal Sigmoid Belief Networks for Sequence Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 2467, "page_last": 2475, "abstract": "Deep dynamic generative models are developed to learn sequential dependencies in time-series data. The multi-layered model is designed by constructing a hierarchy of temporal sigmoid belief networks (TSBNs), defined as a sequential stack of sigmoid belief networks (SBNs). Each SBN has a contextual hidden state, inherited from the previous SBNs in the sequence, and is used to regulate its hidden bias. Scalable learning and inference algorithms are derived by introducing a recognition model that yields fast sampling from the variational posterior. This recognition model is trained jointly with the generative model, by maximizing its variational lower bound on the log-likelihood. Experimental results on bouncing balls, polyphonic music, motion capture, and text streams show that the proposed approach achieves state-of-the-art predictive performance, and has the capacity to synthesize various sequences.", "full_text": "Deep Temporal Sigmoid Belief Networks\n\nfor Sequence Modeling\n\nZhe Gan, Chunyuan Li, Ricardo Henao, David Carlson and Lawrence Carin\n\nDepartment of Electrical and Computer Engineering\n\n{zhe.gan, chunyuan.li, r.henao, david.carlson, lcarin}@duke.edu\n\nDuke University, Durham, NC 27708\n\nAbstract\n\nDeep dynamic generative models are developed to learn sequential dependencies\nin time-series data. The multi-layered model is designed by constructing a hierar-\nchy of temporal sigmoid belief networks (TSBNs), de\ufb01ned as a sequential stack\nof sigmoid belief networks (SBNs). Each SBN has a contextual hidden state,\ninherited from the previous SBNs in the sequence, and is used to regulate its hid-\nden bias. Scalable learning and inference algorithms are derived by introducing\na recognition model that yields fast sampling from the variational posterior. This\nrecognition model is trained jointly with the generative model, by maximizing its\nvariational lower bound on the log-likelihood. Experimental results on bouncing\nballs, polyphonic music, motion capture, and text streams show that the proposed\napproach achieves state-of-the-art predictive performance, and has the capacity to\nsynthesize various sequences.\n\n1\n\nIntroduction\n\nConsiderable research has been devoted to developing probabilistic models for high-dimensional\ntime-series data, such as video and music sequences, motion capture data, and text streams. Among\nthem, Hidden Markov Models (HMMs) [1] and Linear Dynamical Systems (LDS) [2] have been\nwidely studied, but they may be limited in the type of dynamical structures they can model. An\nHMM is a mixture model, which relies on a single multinomial variable to represent the history of a\ntime-series. To represent N bits of information about the history, an HMM could require 2N distinct\nstates. On the other hand, real-world sequential data often contain complex non-linear temporal\ndependencies, while a LDS can only model simple linear dynamics.\nAnother class of time-series models, which are potentially better suited to model complex probabil-\nity distributions over high-dimensional sequences, relies on the use of Recurrent Neural Networks\n(RNNs) [3, 4, 5, 6], and variants of a well-known undirected graphical model called the Restricted\nBoltzmann Machine (RBM) [7, 8, 9, 10, 11]. One such variant is the Temporal Restricted Boltz-\nmann Machine (TRBM) [8], which consists of a sequence of RBMs, where the state of one or more\nprevious RBMs determine the biases of the RBM in the current time step. Learning and inference in\nthe TRBM is non-trivial. The approximate procedure used in [8] is heuristic and not derived from a\nprincipled statistical formalism.\nRecently, deep directed generative models [12, 13, 14, 15] are becoming popular. A directed graph-\nical model that is closely related to the RBM is the Sigmoid Belief Network (SBN) [16]. In the work\npresented here, we introduce the Temporal Sigmoid Belief Network (TSBN), which can be viewed\nas a temporal stack of SBNs, where each SBN has a contextual hidden state that is inherited from\nthe previous SBNs and is used to adjust its hidden-units bias. Based on this, we further develop\na deep dynamic generative model by constructing a hierarchy of TSBNs. This can be considered\n\n1\n\n\fW1\n\nW3\n\nW2\n\nW4\n\n(a) Generative model\n\nU1\n\nU3\n\nU2\n\n(b) Recognition model\n\nTime\n\nTime\n\n(c) Generative model\n\n(d) Recognition model\n\nFigure 1: Graphical model for the Deep Temporal Sigmoid Belief Network. (a,b) Generative and recognition\nmodel of the TSBN. (c,d) Generative and recognition model of a two-layer Deep TSBN.\n\nas a deep SBN [15] with temporal feedback loops on each layer. Both stochastic and deterministic\nhidden layers are considered.\nCompared with previous work, our model: (i) can be viewed as a generalization of an HMM with\ndistributed hidden state representations, and with a deep architecture; (ii) can be seen as a gener-\nalization of a LDS with complex non-linear dynamics; (iii) can be considered as a probabilistic\nconstruction of the traditionally deterministic RNN; (iv) is closely related to the TRBM, but it has a\nfully generative process, where data are readily generated from the model using ancestral sampling;\n(v) can be utilized to model different kinds of data, e.g., binary, real-valued and counts.\nThe \u201cexplaining away\u201d effect described in [17] makes inference slow, if one uses traditional in-\nference methods. Another important contribution we present here is to develop fast and scalable\nlearning and inference algorithms, by introducing a recognition model [12, 13, 14], that learns an\ninverse mapping from observations to hidden variables, based on a loss function derived from a vari-\national principle. By utilizing the recognition model and variance-reduction techniques from [13],\nwe achieve fast inference both at training and testing time.\n\n2 Model Formulation\n\n2.1 Sigmoid Belief Networks\n\nDeep dynamic generative models are considered, based on the Sigmoid Belief Network (SBN) [16].\nAn SBN is a Bayesian network that models a binary visible vector v \u2208 {0, 1}M , in terms of binary\nhidden variables h \u2208 {0, 1}J and weights W \u2208 RM\u00d7J with\n\np(vm = 1|h) = \u03c3(w(cid:62)\n\nmh + cm),\n\np(hj = 1) = \u03c3(bj),\n\n(1)\nwhere v = [v1, . . . , vM ](cid:62), h = [h1, . . . , hJ ](cid:62), W = [w1, . . . , wM ](cid:62), c = [c1, . . . , cM ](cid:62),\nb = [b1, . . . , bJ ](cid:62), and the logistic function, \u03c3(x) (cid:44) 1/(1 + e\u2212x). The parameters W, b and c\ncharacterize all data, and the hidden variables, h, are speci\ufb01c to particular visible data, v.\nThe SBN is closely related to the RBM [18], which is a Markov random \ufb01eld with the same bipar-\ntite structure as the SBN. The RBM de\ufb01nes a distribution over a binary vector that is proportional\nto the exponential of its energy, de\ufb01ned as \u2212E(v, h) = v(cid:62)c + v(cid:62)Wh + h(cid:62)b. The conditional\ndistributions, p(v|h) and p(h|v), in the RBM are factorial, which makes inference fast, while pa-\nrameter estimation usually relies on an approximation technique known as Contrastive Divergence\n(CD) [18].\n\nThe energy function of an SBN may be written as \u2212E(v, h) = v(cid:62)c+v(cid:62)Wh+h(cid:62)b\u2212(cid:80)\n\nm log(1+\nexp(w(cid:62)\nmh + cm)). SBNs explicitly manifest the generative process to obtain data, in which the\nhidden layer provides a directed \u201cexplanation\u201d for patterns generated in the visible layer. However,\nthe \u201cexplaining away\u201d effect described in [17] makes inference inef\ufb01cient, the latter can be alleviated\nby exploiting recent advances in variational inference methods [13].\n\n2\n\n\f2.2 Temporal Sigmoid Belief Networks\n\nThe proposed Temporal Sigmoid Belief Network (TSBN) model is a sequence of SBNs arranged\nin such way that at any given time step, the SBN\u2019s biases depend on the state of the SBNs in the\nprevious time steps. Speci\ufb01cally, assume we have a length-T binary visible sequence, the tth time\nstep of which is denoted vt \u2208 {0, 1}M . The TSBN describes the joint probability as\n\np\u03b8(V, H) = p(h1)p(v1|h1) \u00b7 T(cid:89)\n\np(ht|ht\u22121, vt\u22121) \u00b7 p(vt|ht, vt\u22121),\n\n(2)\n\nwhere V = [v1, . . . , vT ], H = [h1, . . . , hT ], and each ht \u2208 {0, 1}J represents the hidden state\ncorresponding to time step t. For t = 1, . . . , T , each conditional distribution in (2) is expressed as\n\nt=2\n\n3jvt\u22121 + bj),\n4mvt\u22121 + cm),\n\n1jht\u22121 + w(cid:62)\n2mht + w(cid:62)\n\np(hjt = 1|ht\u22121, vt\u22121) = \u03c3(w(cid:62)\np(vmt = 1|ht, vt\u22121) = \u03c3(w(cid:62)\n\n(3)\n(4)\nwhere h0 and v0, needed for the prior model p(h1) and p(v1|h1), are de\ufb01ned as zero vectors,\nrespectively, for conciseness. The model parameters, \u03b8, are speci\ufb01ed as W1 \u2208 RJ\u00d7J, W2 \u2208\nRM\u00d7J, W3 \u2208 RJ\u00d7M , W4 \u2208 RM\u00d7M . For i = 1, 2, 3, 4, wij is the transpose of the jth row of Wi,\nand c = [c1, . . . , cM ](cid:62) and b = [b1, . . . , bJ ](cid:62) are bias terms. The graphical model for the TSBN is\nshown in Figure 1(a).\nBy setting W3 and W4 to be zero matrices, the TSBN can be viewed as a Hidden Markov Model\n[1] with an exponentially large state space, that has a compact parameterization of the transition and\nthe emission probabilities. Speci\ufb01cally, each hidden state in the HMM is represented as a one-hot\nlength-J vector, while in the TSBN, the hidden states can be any length-J binary vector. We note\nthat the transition matrix is highly structured, since the number of parameters is only quadratic w.r.t.\nJ. Compared with the TRBM [8], our TSBN is fully directed, which allows for fast sampling of\n\u201cfantasy\u201d data from the inferred model.\n\n2.3 TSBN Variants\nModeling real-valued data The model above can be readily extended to model real-valued se-\nquence data, by substituting (4) with p(vt|ht, vt\u22121) = N (\u00b5t, diag(\u03c32\n\nt )), where\n\n\u00b5mt = w(cid:62)\n\n2mht + w(cid:62)\n\n4m)(cid:62)vt\u22121 + c(cid:48)\nm,\n4 are of the same size of\nand \u00b5mt and \u03c32\nW2 and W4, respectively. Compared with the Gaussian TRBM [9], in which \u03c3mt is \ufb01xed to 1, our\nformalism uses a diagonal matrix to parameterize the variance structure of vt.\n\nlog \u03c32\nt , respectively. W(cid:48)\n\n2m)(cid:62)ht + (w(cid:48)\n2 and W(cid:48)\n\nmt are elements of \u00b5t and \u03c32\n\n4mvt\u22121 + cm,\n\nmt = (w(cid:48)\n\n(5)\n\nobservations, by replacing (4) with p(vt|ht, vt\u22121) =(cid:81)M\n\nModeling count data We also introduce an approach for modeling time-series data with count\n\n(cid:80)M\n\nymt =\n\nm=1 yvmt\n\nmt , where\n\nexp(w(cid:62)\n\n2mht + w(cid:62)\n\n4mvt\u22121 + cm)\n\nm(cid:48)=1 exp(w(cid:62)\n\n2m(cid:48)ht + w(cid:62)\n\n4m(cid:48)vt\u22121 + cm(cid:48))\n\n.\n\n(6)\n\nThis formulation is related to the Replicated Softmax Model (RSM) described in [19], however, our\napproach uses a directed connection from the binary hidden variables to the visible counts, while\nalso learning the dynamics in the count sequences.\nFurthermore, rather than assuming that ht and vt only depend on ht\u22121 and vt\u22121, in the experiments,\nwe also allow for connections from the past n time steps of the hidden and visible states, to the\ncurrent states, ht and vt. A sliding window is then used to go through the sequence to obtain n\nframes at each time. We refer to n as the order of the model.\n\n2.4 Deep Architecture for Sequence Modeling with TSBNs\n\nLearning the sequential dependencies with the shallow model in (2)-(4) may be restrictive. There-\nfore, we propose two deep architectures to improve its representational power: (i) adding stochastic\nhidden layers; (ii) adding deterministic hidden layers. The graphical model for the deep TSBN\n\n3\n\n\ft\n\nand let h(L+1)\n\nis shown in Figure 1(c). Speci\ufb01cally, we consider a deep TSBN with hidden layers h((cid:96))\nfor\nt = 1, . . . , T and (cid:96) = 1, . . . , L. Assume layer (cid:96) contains J ((cid:96)) hidden units, and denote the visi-\nble layer vt = h(0)\n= 0, for convenience. In order to obtain a proper generative\nmodel, the top hidden layer h(L) contains stochastic binary hidden variables.\nFor the middle layers, (cid:96) = 1, . . . , L\u22121, if stochastic hidden layers are utilized, the generative process\njt |h((cid:96)+1)\nis expressed as p(h((cid:96))\n, h((cid:96))\nt\u22121 ), where each conditional distribution\nis parameterized via a logistic function, as in (4).\nIf deterministic hidden layers are employed,\nt\u22121 ), where f (\u00b7) is chosen to be a recti\ufb01ed linear function.\nwe obtain h((cid:96))\nAlthough the differences between these two approaches are minor, learning and inference algorithms\ncan be quite different, as shown in Section 3.3.\n\nt ) = (cid:81)J ((cid:96))\n\nj=1 p(h((cid:96))\nt\u22121, h((cid:96)\u22121)\n\nt = f (h((cid:96)+1)\n\nt\u22121, h((cid:96)\u22121)\n\n, h((cid:96))\n\nt\n\nt\n\nt\n\nt\n\n3 Scalable Learning and Inference\n\nComputation of the exact posterior over the hidden variables in (2) is intractable. Approximate\nBayesian inference, such as Gibbs sampling or mean-\ufb01eld variational Bayes (VB) inference, can\nbe implemented [15, 16]. However, Gibbs sampling is very inef\ufb01cient, due to the fact that the\nconditional posterior distribution of the hidden variables does not factorize. The mean-\ufb01eld VB\nindeed provides a fully factored variational posterior, but this technique increases the gap between\nthe bound being optimized and the true log-likelihood, potentially resulting in a poor \ufb01t to the data.\nTo allow for tractable and scalable inference and parameter learning, without loss of the \ufb02exibility of\nthe variational posterior, we apply the Neural Variational Inference and Learning (NVIL) algorithm\ndescribed in [13].\n\n3.1 Variational Lower Bound Objective\n\nWe are interested in training the TSBN model, p\u03b8(V, H), described in (2), with parameters \u03b8.\nGiven an observation V, we introduce a \ufb01xed-form distribution, q\u03c6(H|V), with parameters \u03c6, that\napproximates the true posterior distribution, p(H|V). We then follow the variational principle to\nderive a lower bound on the marginal log-likelihood, expressed as1\n\nL(V, \u03b8, \u03c6) = Eq\u03c6(H|V)[log p\u03b8(V, H) \u2212 log q\u03c6(H|V)] .\n\n(7)\nWe construct the approximate posterior q\u03c6(H|V) as a recognition model. By using this, we avoid\nthe need to compute variational parameters per data point; instead we compute a set of parameters\n\u03c6 used for all V. In order to achieve fast inference, the recognition model is expressed as\n\nq\u03c6(H|V) = q(h1|v1) \u00b7 T(cid:89)\n\nq(ht|ht\u22121, vt, vt\u22121) ,\n\n(8)\n\nt=2\n\nand each conditional distribution is speci\ufb01ed as\n\n2jvt + u(cid:62)\n\n1jht\u22121 + u(cid:62)\n\nq(hjt = 1|ht\u22121, vt, vt\u22121) = \u03c3(u(cid:62)\n\n(9)\nwhere h0 and v0, for q(h1|v1), are de\ufb01ned as zero vectors. The recognition parameters \u03c6 are\nspeci\ufb01ed as U1 \u2208 RJ\u00d7J, U2 \u2208 RJ\u00d7M , U3 \u2208 RJ\u00d7M . For i = 1, 2, 3, uij is the transpose of the jth\nrow of Ui, and d = [d1, . . . , dJ ](cid:62) is the bias term. The graphical model is shown in Figure 1(b).\nThe recognition model de\ufb01ned in (9) has the same form as in the approximate inference used for the\nTRBM [8]. Exact inference for our model consists of a forward and backward pass through the entire\nsequence, that requires the traversing of each possible hidden state. Our feedforward approximation\nallows the inference procedure to be fast and implemented in an online fashion.\n\n3jvt\u22121 + dj) ,\n\n3.2 Parameter Learning\n\nTo optimize (7), we utilize Monte Carlo methods to approximate expectations and stochastic gradient\ndescent (SGD) for parameter optimization. The gradients can be expressed as\n\n\u2207\u03b8L(V) = Eq\u03c6(H|V)[\u2207\u03b8 log p\u03b8(V, H)],\n\u2207\u03c6L(V) = Eq\u03c6(H|V)[(log p\u03b8(V, H) \u2212 log q\u03c6(H|V)) \u00d7 \u2207\u03c6 log q\u03c6(H|V)].\n\n(10)\n(11)\n\n1This lower bound is equivalent to the marginal log-likelihood if q\u03c6(H|V) = p(H|V).\n\n4\n\n\fSpeci\ufb01cally, in the TSBN model, if we de\ufb01ne \u02c6vmt = \u03c3(w(cid:62)\n\u03c3(u(cid:62)\n\n1jht\u22121 + u(cid:62)\n\n2jvt + u(cid:62)\n\n3jvt\u22121 + dj), the gradients for w2m and u2j can be calculated as\n\n2mht + w(cid:62)\n\n4mvt\u22121 + cm) and \u02c6hjt =\n\nT(cid:88)\n\nt=1\n\n\u2202 log q\u03c6(H|V)\n\n\u2202u2jm\n\n=\n\nT(cid:88)\n\nt=1\n\n\u2202 log p\u03b8(V, H)\n\n\u2202w2mj\n\n=\n\n(vmt \u2212 \u02c6vmt) \u00b7 hjt,\n\n(hjt \u2212 \u02c6hjt) \u00b7 vmt.\n\n(12)\n\nOther update equations, along with the learning details for the TSBN variants in Section 2.3, are\nprovided in the Supplementary Section B. We observe that the gradients in (10) and (11) share many\nsimilarities with the wake-sleep algorithm [20]. Wake-sleep alternates between updating \u03b8 in the\nwake phase and updating \u03c6 in the sleep phase. The update of \u03b8 is based on the samples generated\nfrom q\u03c6(H|V), and is identical to (10). However, in contrast to (11), the recognition parameters \u03c6\nare estimated from samples generated by the model, i.e., \u2207\u03c6L(V) = Ep\u03b8 (V,H)[\u2207\u03c6 log q\u03c6(H|V)].\nThis update does not optimize the same objective as in (10), hence the wake-sleep algorithm is not\nguaranteed to converge [13].\nInspecting (11), we see that we are using l\u03c6(V, H) = log p\u03b8(V, H)\u2212 log q\u03c6(H|V) as the learning\nsignal for the recognition parameters \u03c6. The expectation of this learning signal is exactly the lower\nbound (7), which is easy to evaluate. However, this tractability makes the estimated gradients of the\nrecognition parameters very noisy. In order to make the algorithm practical, we employ the variance\nreduction techniques proposed in [13], namely: (i) centering the learning signal, by subtracting the\ndata-independent baseline and the data-dependent baseline; (ii) variance normalization, by dividing\nthe centered learning signal by a running estimate of its standard deviation. The data-dependent\nbaseline is implemented using a neural network. Additionally, RMSprop [21], a form of SGD where\nthe gradients are adaptively rescaled by a running average of their recent magnitude, were found\nin practice to be important for fast convergence; thus utilized throughout all the experiments. The\noutline of the NVIL algorithm is provided in the Supplementary Section A.\n\n3.3 Extension to deep models\n\nThe recognition model corresponding to the deep TSBN is shown in Figure 1(d). Two kinds of deep\narchitectures are discussed in Section 2.4. We illustrate the difference of their learning algorithms\nin two respects: (i) the calculation of the lower bound; and (ii) the calculation of the gradients.\nThe top hidden layer is stochastic. If the middle hidden layers are also stochastic, the calculation\nof the lower bound is more involved, compared with the shallow model; however, the gradient\nevaluation remain simple as in (12). On the other hand, if deterministic middle hidden layers (i.e.,\nrecurrent neural networks) are employed, the lower bound objective will stay the same as a shallow\nmodel, since the only stochasticity in the generative process lies in the top layer; however, the\ngradients have to be calculated recursively through the back-propagation through time algorithm\n[22]. All details are provided in the Supplementary Section C.\n\n4 Related Work\n\nThe RBM has been widely used as building block to learn the sequential dependencies in time-series\ndata, e.g., the conditional-RBM-related models [7, 23], and the temporal RBM [8]. To make exact\ninference possible, the recurrent temporal RBM was also proposed [9], and further extended to learn\nthe dependency structure within observations [11].\nIn the work reported here, we focus on modeling sequences based on the SBN [16], which recently\nhas been shown to have the potential to build deep generative models [13, 15, 24]. Our work serves\nas another extension of the SBN that can be utilized to model time-series data. Similar ideas have\nalso been considered in [25] and [26]. However, in [25], the authors focus on grammar learning, and\nuse a feed-forward approximation of the mean-\ufb01eld VB to carry out the inference; while in [26], the\nwake-sleep algorithm was developed. We apply the model in a different scenario, and develop a fast\nand scalable inference algorithm, based on the idea of training a recognition model by leveraging\nthe stochastic gradient of the variational bound.\nThere exist two main methods for the training of recognition models. The \ufb01rst one, termed Stochas-\ntic Gradient Variational Bayes (SGVB), is based on a reparameterization trick [12, 14], which can\nbe only employed in models with continuous latent variables, e.g., the variational auto-encoder [12]\n\n5\n\n\fFigure 2: (Left) Dictionaries learned using the HMSBN for the videos of bouncing balls. (Middle)\nSamples generated from the HMSBN trained on the polyphonic music. Each column is a sample\nvector of notes. (Right) Time evolving from 1790 to 2014 for three selected topics learned from the\nSTU dataset. Plotted values represent normalized probabilities that the topic appears in a given year.\nBest viewed electronically.\n\nand all the recent recurrent extensions of it [27, 28, 29]. The second one, called Neural Variational\nInference and Learning (NVIL), is based on the log-derivative trick [13], which is more general and\ncan also be applicable to models with discrete random variables. The NVIL algorithm has been\npreviously applied to the training of SBN in [13]. Our approach serves as a new application of this\nalgorithm for a SBN-based time-series model.\n\n5 Experiments\n\nWe present experimental results on four publicly available datasets: the bouncing balls [9], poly-\nphonic music [10], motion capture [7] and state-of-the-Union [30]. To assess the performance of the\nTSBN model, we show sequences generated from the model, and report the average log-probability\nthat the model assigns to a test sequence, and the average squared one-step-ahead prediction error per\nframe. Code is available at https://github.com/zhegan27/TSBN_code_NIPS2015.\nThe TSBN model with W3 = 0 and W4 = 0 is denoted Hidden Markov SBN (HMSBN), the deep\nTSBN with stochastic hidden layer is denoted DTSBN-S, and the deep TSBN with deterministic\nhidden layer is denoted DTSBN-D.\nModel parameters were initialized by sampling randomly from N (0, 0.0012I), except for the bias\nparameters, that were initialized as 0. The TSBN model is trained using a variant of RMSprop\n[6], with momentum of 0.9, and a constant learning rate of 10\u22124. The decay over the root mean\nsquared gradients is set to 0.95. The maximum number of iterations we use is 105. The gradient\nestimates were computed using a single sample from the recognition model. The only regularization\nwe used was a weight decay of 10\u22124. The data-dependent baseline was implemented by using a\nneural network with a single hidden layer with 100 tanh units.\nFor the prediction of vt given v1:t\u22121, we (i) \ufb01rst obtain a sample from q\u03c6(h1:t\u22121|v1:t\u22121); (ii)\ncalculate the conditional posterior p\u03b8(ht|h1:t\u22121, v1:t\u22121) of the current hidden state ; (iii) make a\nprediction for vt using p\u03b8(vt|h1:t, v1:t\u22121). On the other hand, synthesizing samples is conceptually\nsimper. Sequences can be readily generated from the model using ancestral sampling.\n\n5.1 Bouncing balls dataset\n\nWe conducted the \ufb01rst experiment on synthetic videos of 3 bouncing balls, where pixels are binary\nvalued. We followed the procedure in [9], and generated 4000 videos for training, and another 200\nvideos for testing. Each video is of length 100 and of resolution 30 \u00d7 30.\nThe dictionaries learned using the HMSBN are shown in Figure 2 (Left). Compared with previous\nwork [9, 10], our learned bases are more spatially localized. In Table 1, we compare the average\nsquared prediction error per frame over the 200 test videos, with recurrent temporal RBM (RTRBM)\nand structured RTRBM (SRTRBM). As can be seen, our approach achieves better performance\ncompared with the baselines in the literature. Furthermore, we observe that a high-order TSBN\nreduces the prediction error signi\ufb01cantly, compared with an order-one TSBN. This is due to the fact\n\n6\n\nTop: Generated from Piano midi5010015020025030020406080Bottom: Generated from Nottingham20406080100120140160180204060801800185019001950200000.51 Topic 29Nicaragua v. U.S.1800185019001950200000.51 Topic 30World War IIWar of 1812Iraq War1800185019001950200000.51 Topic 130The age of American revolution\fTable 1: Average prediction error for the bounc-\ning balls dataset. ((cid:5)) taken from [11].\nMODEL\nDTSBN-S\nDTSBN-D\nTSBN\nTSBN\nRTRBM(cid:5)\nSRTRBM(cid:5)\n\nPRED. ERR.\n2.79 \u00b1 0.39\n2.99 \u00b1 0.42\n3.07 \u00b1 0.40\n9.48 \u00b1 0.38\n3.88 \u00b1 0.33\n3.31 \u00b1 0.33\n\nDIM\n100-100\n100-100\n100\n100\n3750\n3750\n\nORDER\n\n2\n2\n4\n1\n1\n1\n\nTable 2: Average prediction error obtained for\nthe motion capture dataset. ((cid:5)) taken from [11].\nMODEL\nRUNNING\n2.56 \u00b1 0.40\nDTSBN-S\n2.84 \u00b1 0.01\nDTSBN-D\n4.85 \u00b1 1.26\nTSBN\n7.39 \u00b1 0.47\nHMSBN\n5.88 \u00b1 0.05\nSS-SRTRBM(cid:5)\n10.91 \u00b1 0.27\nG-RTRBM(cid:5)\n\nWALKING\n4.40 \u00b1 0.28\n4.62 \u00b1 0.01\n5.12 \u00b1 0.50\n10.77 \u00b1 1.15\n8.13 \u00b1 0.06\n14.41 \u00b1 0.38\n\nthat by using a high-order TSBN, more information about the past is conveyed. We also examine\nthe advantage of employing deep models. Using stochastic, or deterministic hidden layer improves\nperformances. More results, including log-likelihoods, are provided in Supplementary Section D.\n5.2 Motion capture dataset\nIn this experiment, we used the CMU motion capture dataset, that consists of measured joint angles\nfor different motion types. We used the 33 running and walking sequences of subject 35 (23 walking\nsequences and 10 running sequences). We followed the preprocessing procedure of [11], after which\nwe were left with 58 joint angles. We partitioned the 33 sequences into training and testing set: the\n\ufb01rst of which had 31 sequences, and the second had 2 sequences (one walking and another running).\nWe averaged the prediction error over 100 trials, as reported in Table 2. The TSBN we implemented\nis of size 100 in each hidden layer and order 1. It can be seen that the TSBN-based models improves\nover the Gaussian (G-)RTRBM and the spike-slab (SS-)SRTRBM signi\ufb01cantly.\n\nFigure 3: Motion trajectories generated from the HMSBN trained on the motion capture dataset.\n(Left) Walking. (Middle) Running-running-walking. (Right) Running-walking.\n\nAnother popular motion capture dataset is the MIT dataset2. To further demonstrate the directed,\ngenerative nature of our model, we give our trained HMSBN model different initializations, and\nshow generated, synthetic data and the transitions between different motion styles in Figure 3. These\ngenerated data are readily produced from the model and demonstrate realistic behavior. The smooth\ntrajectories are walking movements, while the vibrating ones are running. Corresponding video \ufb01les\n(AVI) are provided as mocap 1, 2 and 3 in the Supplementary Material.\n5.3 Polyphonic music dataset\nThe third experiment is based on four different polyphonic music sequences of piano [10], i.e.,\nPiano-midi.de (Piano), Nottingham (Nott), MuseData (Muse) and JSB chorales (JSB). Each of these\ndatasets are represented as a collection of 88-dimensional binary sequences, that span the whole\nrange of piano from A0 to C8.\nThe samples generated from the trained HMSBN model are shown in Figure 2 (Middle). As can\nbe seen, different styles of polyphonic music are synthesized. The corresponding MIDI \ufb01les are\nprovided as music 1 and 2 in the Supplementary Material. Our model has the ability to learn basic\nharmony rules and local temporal coherence. However, long-term structure and musical melody\nremain elusive. The variational lower bound, along with the estimated log-likelihood in [10], are\npresented in Table 3. The TSBN we implemented is of size 100 and order 1. Empirically, adding\nlayers did not improve performance on this dataset, hence no such results are reported. The results\nof RNN-NADE and RTRBM [10] were obtained by only 100 runs of the annealed importance sam-\npling, which has the potential to overestimate the true log-likelihood. Our variational lower bound\nprovides a more conservative estimate. Though, our performance is still better than that of RNN.\n\n2Quantitative results on the MIT dataset are provided in Supplementary Section D.\n\n7\n\n\fTable 3: Test log-likelihood for the polyphonic\nmusic dataset. ((cid:5)) taken from [10].\nMODEL\nTSBN\nRNN-NADE(cid:5)\nRTRBM(cid:5)\nRNN(cid:5)\n\nPIANO. NOTT. MUSE.\n-6.81\n-7.98\n-5.60\n-7.05\n-7.36\n-6.35\n-8.13\n-8.37\n\nJSB.\n-7.48\n-5.56\n-6.35\n-8.71\n\n-3.67\n-2.31\n-2.62\n-4.46\n\nTable 4: Average prediction precision for STU.\n((cid:5)) taken from [31].\nMODEL\nHMSBN\nDHMSBN-S\nGP-DPFA (cid:5)\nDRFM(cid:5)\n\n0.327 \u00b1 0.002\n0.299 \u00b1 0.001\n0.223 \u00b1 0.001\n0.217 \u00b1 0.003\n\n0.353 \u00b1 0.070\n0.378 \u00b1 0.006\n0.189 \u00b1 0.003\n0.177 \u00b1 0.010\n\nDIM\n25\n25-25\n100\n25\n\nMP\n\nPP\n\n5.4 State of the Union dataset\n\nThe State of the Union (STU) dataset contains the transcripts of T = 225 US State of the Union ad-\ndresses, from 1790 to 2014. Two tasks are considered, i.e., prediction and dynamic topic modeling.\n\nPrediction The prediction task is concerned with estimating the held-out words. We employ the\nsetup in [31]. After removing stop words and terms that occur fewer than 7 times in one document or\nless than 20 times overall, there are 2375 unique words. The entire data of the last year is held-out.\nFor the documents in the previous years, we randomly partition the words of each document into\n80%/20% split. The model is trained on the 80% portion, and the remaining 20% held-out words\nare used to test the prediction at each year. The words in both held-out sets are ranked according to\nthe probability estimated from (6).\nTo evaluate the prediction performance, we calculate the precision @top-Mas in [31], which is given\nby the fraction of the top-M words, predicted by the model, that matches the true ranking of the word\ncounts. M = 50 is used. Two recent works are compared, GP-DPFA [31] and DRFM [30]. The\nresults are summarized in Table 4. Our model is of order 1. The column MP denotes the mean\nprecision over all the years that appear in the training set. The column PP denotes the predictive\nprecision for the \ufb01nal year. Our model achieves signi\ufb01cant improvements in both scenarios.\nDynamic Topic Modeling\nThe setup described in [30] is employed, and the number of topics is\n200. To understand the temporal dynamic per topic, three topics are selected and the normalized\nprobability that a topic appears at each year are shown in Figure 2 (Right). Their associated top 6\nwords per topic are shown in Table 5. The learned trajectory exhibits different temporal patterns\nacross the topics. Clearly, we can identify jumps associated with some key historical events. For\ninstance, for Topic 29, we observe a positive jump in 1986 related to military and paramilitary\nactivities in and against Nicaragua brought by the U.S. Topic 30 is related with war, where the War\nof 1812, World War II and Iraq War all spike up in their corresponding years. In Topic 130, we\nobserve consistent positive jumps from 1890 to 1920, when the American revolution was taking\nplace. Three other interesting topics are also shown in Table 5. Topic 64 appears to be related to\neducation, Topic 70 is about Iraq, and Topic 74 is Axis and World War II. We note that the words\nfor these topics are explicitly related to these matters.\n\nTable 5: Top 6 most probable words associated with the STU topics.\n\nTopic #29 Topic #30 Topic #130\ngovernment\n\nfamily\nbudget\n\nNicaragua\n\nfree\nfuture\nfreedom\n\nof\ufb01cer\ncivilized\nwarfare\nenemy\nwhilst\ngained\n\nTopic #64\ngenerations\ngeneration\nrecognize\n\nbrave\ncrime\nrace\n\nTopic #70\n\nIraqi\nQaida\nIraq\nIraqis\n\nAI\n\nSaddam\n\nTopic #74\nPhilippines\n\nislands\naxis\nNazis\n\nJapanese\nGermans\n\ncountry\npublic\nlaw\n\npresent\ncitizens\n\n6 Conclusion\nWe have presented the Deep Temporal Sigmoid Belief Networks, an extension of SBN, that mod-\nels the temporal dependencies in high-dimensional sequences. To allow for scalable inference and\nlearning, an ef\ufb01cient variational optimization algorithm is developed. Experimental results on sev-\neral datasets show that the proposed approach obtains superior predictive performance, and synthe-\nsizes interesting sequences.\nIn this work, we have investigated the modeling of different types of data individually. One interest-\ning future work is to combine them into a uni\ufb01ed framework for dynamic multi-modality learning.\nFurthermore, we can use high-order optimization methods to speed up inference [32].\nAcknowledgements This research was supported in part by ARO, DARPA, DOE, NGA and ONR.\n\n8\n\n\fReferences\n[1] L. Rabiner and B. Juang. An introduction to hidden markov models. In ASSP Magazine, IEEE, 1986.\n[2] R. Kalman. Mathematical description of linear dynamical systems. In J. the Society for Industrial &\n\nApplied Mathematics, Series A: Control, 1963.\n\n[3] M. Hermans and B. Schrauwen. Training and analysing deep recurrent neural networks. In NIPS, 2013.\n[4] J. Martens and I. Sutskever. Learning recurrent neural networks with hessian-free optimization. In ICML,\n\n2011.\n\n[5] R. Pascanu, T. Mikolov, and Y. Bengio. On the dif\ufb01culty of training recurrent neural networks. In ICML,\n\n2013.\n\n[6] A. Graves. Generating sequences with recurrent neural networks. In arXiv:1308.0850, 2013.\n[7] G. Taylor, G. Hinton, and S. Roweis. Modeling human motion using binary latent variables. In NIPS,\n\n2006.\n\n[8] I. Sutskever and G. Hinton. Learning multilevel distributed representations for high-dimensional se-\n\nquences. In AISTATS, 2007.\n\n[9] I. Sutskever, G. Hinton, and G. Taylor. The recurrent temporal restricted boltzmann machine. In NIPS,\n\n2009.\n\n[10] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Modeling temporal dependencies in high-\n\ndimensional sequences: Application to polyphonic music generation and transcription. In ICML, 2012.\n\n[11] R. Mittelman, B. Kuipers, S. Savarese, and H. Lee. Structured recurrent temporal restricted boltzmann\n\nmachines. In ICML, 2014.\n\n[12] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n[13] A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In ICML, 2014.\n[14] D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\n\ndeep generative models. In ICML, 2014.\n\n[15] Z. Gan, R. Henao, D. Carlson, and L. Carin. Learning deep sigmoid belief networks with data augmenta-\n\ntion. In AISTATS, 2015.\n\n[16] R. Neal. Connectionist learning of belief networks. In Arti\ufb01cial intelligence, 1992.\n[17] G. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. In Neural computation,\n\n2006.\n\n[18] G. Hinton. Training products of experts by minimizing contrastive divergence. In Neural computation,\n\n2002.\n\n[19] G. Hinton and R. Salakhutdinov. Replicated softmax: an undirected topic model. In NIPS, 2009.\n[20] G. Hinton, P. Dayan, B. Frey, and R. Neal. The \u201cwake-sleep\u201d algorithm for unsupervised neural networks.\n\nIn Science, 1995.\n\n[21] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent\n\nmagnitude. In COURSERA: Neural Networks for Machine Learning, 2012.\n\n[22] P. Werbos. Backpropagation through time: what it does and how to do it. In Proc. of the IEEE, 1990.\n[23] G. Taylor and G. Hinton. Factored conditional restricted boltzmann machines for modeling motion style.\n\nIn ICML, 2009.\n\n[24] Z. Gan, C. Chen, R. Henao, D. Carlson, and L. Carin. Scalable deep poisson factor analysis for topic\n\nmodeling. In ICML, 2015.\n\n[25] J. Henderson and I. Titov. Incremental sigmoid belief networks for grammar learning. In JMLR, 2010.\n[26] G. Hinton, P. Dayan, A. To, and R. Neal. The helmholtz machine through time. In Proc. of the ICANN,\n\n1995.\n\n[27] J. Bayer and C. Osendorfer. Learning stochastic recurrent networks. In arXiv:1411.7610, 2014.\n[28] O. Fabius, J. R. van Amersfoort, and D. P. Kingma. Variational recurrent auto-encoders.\n\narXiv:1412.6581, 2014.\n\nIn\n\n[29] J. Chung, K. Kastner, L. Dinh, K. Goel, A. Courville, and Y. Bengio. A recurrent latent variable model\n\nfor sequential data. In NIPS, 2015.\n\n[30] S. Han, L. Du, E. Salazar, and L. Carin. Dynamic rank factor model for text streams. In NIPS, 2014.\n[31] A. Acharya, J. Ghosh, and M. Zhou. Nonparametric Bayesian factor analysis for dynamic count matrices.\n\nIn AISTATS, 2015.\n\n[32] K. Fan, Z. Wang, J. Kwok, and K. Heller. Fast Second-Order Stochastic Backpropagation for Variational\n\nInference. In NIPS, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1461, "authors": [{"given_name": "Zhe", "family_name": "Gan", "institution": "Duke University"}, {"given_name": "Chunyuan", "family_name": "Li", "institution": "Duke University"}, {"given_name": "Ricardo", "family_name": "Henao", "institution": "Duke University"}, {"given_name": "David", "family_name": "Carlson", "institution": null}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}