{"title": "Stochastic variational inference for hidden Markov models", "book": "Advances in Neural Information Processing Systems", "page_first": 3599, "page_last": 3607, "abstract": "Variational inference algorithms have proven successful for Bayesian analysis in large data settings, with recent advances using stochastic variational inference (SVI). However, such methods have largely been studied in independent or exchangeable data settings. We develop an SVI algorithm to learn the parameters of hidden Markov models (HMMs) in a time-dependent data setting. The challenge in applying stochastic optimization in this setting arises from dependencies in the chain, which must be broken to consider minibatches of observations. We propose an algorithm that harnesses the memory decay of the chain to adaptively bound errors arising from edge effects. We demonstrate the effectiveness of our algorithm on synthetic experiments and a large genomics dataset where a batch algorithm is computationally infeasible.", "full_text": "Stochastic Variational Inference for Hidden Markov\n\nModels\n\nNicholas J. Foti\u2020, Jason Xu\u2020, Dillon Laird, and Emily B. Fox\n\n{nfoti@stat,jasonxu@stat,dillonl2@cs,ebfox@stat}.washington.edu\n\nUniversity of Washington\n\nAbstract\n\nVariational inference algorithms have proven successful for Bayesian analysis\nin large data settings, with recent advances using stochastic variational infer-\nence (SVI). However, such methods have largely been studied in independent or\nexchangeable data settings. We develop an SVI algorithm to learn the parameters\nof hidden Markov models (HMMs) in a time-dependent data setting. The chal-\nlenge in applying stochastic optimization in this setting arises from dependencies\nin the chain, which must be broken to consider minibatches of observations. We\npropose an algorithm that harnesses the memory decay of the chain to adaptively\nbound errors arising from edge effects. We demonstrate the effectiveness of our\nalgorithm on synthetic experiments and a large genomics dataset where a batch\nalgorithm is computationally infeasible.\n\n1\n\nIntroduction\n\nModern data analysis has seen an explosion in the size of the datasets available to analyze. Signi\ufb01-\ncant progress has been made scaling machine learning algorithms to these massive datasets based on\noptimization procedures [1, 2, 3]. For example, stochastic gradient descent employs noisy estimates\nof the gradient based on minibatches of data, avoiding a costly gradient computation using the full\ndataset [4]. There is considerable interest in leveraging these methods for Bayesian inference since\ntraditional algorithms such as Markov chain Monte Carlo (MCMC) scale poorly to large datasets,\nthough subset-based MCMC methods have been recently proposed as well [5, 6, 7, 8].\nVariational Bayes (VB) casts posterior inference as a tractable optimization problem by minimizing\nthe Kullback-Leibler divergence between the target posterior and a family of simpler variational\ndistributions. Thus, VB provides a natural framework to incorporate ideas from stochastic opti-\nmization to perform scalable Bayesian inference. Indeed, a scalable modi\ufb01cation to VB harnessing\nstochastic gradients\u2014stochastic variational inference (SVI)\u2014has recently been applied to a variety\nof Bayesian latent variable models [9, 10]. Minibatch-based VB methods have also proven effective\nin a streaming setting where data arrives sequentially [11].\nHowever, these algorithms have been developed assuming independent or exchangeable data. One\nexception is the SVI algorithm for the mixed-membership stochastic block model [12], but indepen-\ndence at the level of the generative model must be exploited. SVI for Bayesian time series including\nHMMs was recently considered in settings where each minibatch is a set of independent series [13],\nthough in this setting again dependencies do not need to be broken.\nIn contrast, we are interested in applying SVI to very long time series. As a motivating example,\nconsider the application in Sec. 4 of a genomics dataset consisting of T = 250 million observa-\ntions in 12 dimensions modeled via an HMM to learn human chromatin structure. An analysis of\nthe entire sequence is computationally prohibitive using standard Bayesian inference techniques for\n\n\u2020 Co-\ufb01rst authors contributed equally to this work.\n\n1\n\n\fHMMs due to a per-iteration complexity linear in T . Unfortunately, despite the simple chain-based\ndependence structure, applying a minibatch-based method is not obvious. In particular, there are two\npotential issues immediately arising in sampling subchains as minibatches: (1) the subsequences are\nnot mutually independent, and (2) updating the latent variables in the subchain ignores the data\noutside of the subchain introducing error. We show that for (1), appropriately scaling the noisy sub-\nchain gradients preserves unbiased gradient estimates. To address (2), we propose an approximate\nmessage-passing scheme that adaptively bounds error by accounting for memory decay of the chain.\nWe prove that our proposed SVIHMM algorithm converges to a local mode of the batch objective,\nand empirically demonstrate similar performance to batch VB in signi\ufb01cantly less time on syn-\nthetic datasets. We then consider our genomics application and show that SVIHMM allows ef\ufb01cient\nBayesian inference on this massive dataset where batch inference is computationally infeasible.\n\n2 Background\n\n2.1 Hidden Markov models\n\nHidden Markov models (HMMs) [14] are a class of discrete-time doubly stochastic processes con-\nsisting of observations yt and latent states xt \u2208 {1, . . . , K} generated by a discrete-valued Markov\nchain. Speci\ufb01cally, for y = (y1, . . . , yT ) and x = (x1, . . . , xT ), the joint distribution factorizes as\n\np(x, y) = \u03c00(x1)p(y1|x1)\n\np(xt|xt\u22121, A)p(yt|xt, \u03c6)\n\n(1)\n\nT(cid:89)\n\nt=2\n\nK(cid:89)\n\ni,j=1 is the transition matrix with Aij = Pr(xt = j|xt\u22121 = i), \u03c6 = {\u03c6k}K\nwhere A = [Aij]K\nk=1\nthe emission parameters, and \u03c00 the initial distribution. We denote the set of HMM parameters\nas \u03b8 = (\u03c00, A, \u03c6). We assume that the underlying chain is irreducible and aperiodic so that a\nstationary distribution \u03c0 exists and is unique. Furthermore, we assume that we observe the sequence\nat stationarity so that \u03c00 = \u03c0, where \u03c0 is given by the leading left-eigenvector of A. As such, we do\nnot seek to learn \u03c00 in the setting of observing a single realization of a long chain.\nWe specify conjugate Dirichlet priors on the rows of the transition matrix as\n\np(A) =\n\nDir(Ai: | \u03b1A\nj ).\n\n(2)\n\nj=1\n\nHere, Dir(\u03c0 | \u03b1) denotes a K-dimensional Dirichlet distribution with concentration parameters \u03b1.\nAlthough our methods are more broadly applicable, we focus on HMMs with multivariate Gaussian\nemissions where \u03c6k = {\u00b5k, \u03a3k}, with conjugate normal-inverse-Wishart (NIW) prior\n\u03c6k = (\u00b5k, \u03a3k) \u223c NIW(\u00b50, \u03ba0, \u03a30, \u03bd0).\n\n(3)\nFor simplicity, we suppress dependence on \u03b8 and write \u03c0(x0), p(xt|xt\u22121), and p(yt|xt) throughout.\n\nyt | xt \u223c N (yt | \u00b5xt, \u03a3xt),\n\n2.2 Structured mean-\ufb01eld VB for HMMs\n\nWe are interested in the posterior distribution of the state sequence and parameters given an obser-\nvation sequence, denoted p(x, \u03b8|y). While evaluating marginal likelihoods, p(y|\u03b8), and most prob-\nable state sequences, arg maxx p(x|y, \u03b8), are tractable via the forward-backward (FB) algorithm\nwhen parameter values \u03b8 are \ufb01xed [14], exact computation of the posterior is intractable for HMMs.\nMarkov chain Monte Carlo (MCMC) provides a widely used sampling-based approach to posterior\ninference in HMMs [15, 16]. We instead focus on variational Bayes (VB), an optimization-based\napproach that approximates p(x, \u03b8|y) by a variational distribution q(\u03b8, x) within a simpler family.\nTypically, for HMMs a structured mean \ufb01eld approximation is considered:\n\nNote that making a full mean \ufb01eld assumption in which q(x) =(cid:81)T\n\n(4)\nbreaking dependencies only between the parameters \u03b8 = {A, \u03c6} and latent state sequence x [17].\ni=1 q(xi) loses crucial information\n\nq(\u03b8, x) = q(A)q(\u03c6)q(x),\n\nabout the latent chain needed for accurate inference.\n\n2\n\n\fEach factor in Eq. (4) is endowed with its own variational parameter and is set to be in the same\nexponential family distribution as its respective complete conditional. The variational parameters\nare optimized to maximize the evidence lower bound (ELBO) L:\n\nln p(y) \u2265 Eq [ln p(\u03b8)] \u2212 Eq [ln q(\u03b8)] + Eq [ln p(y, x|\u03b8)] \u2212 Eq [ln q(x)] := L(q(\u03b8), q(x)).\nMaximizing L is equivalent to minimizing the KL divergence KL(q(x, \u03b8)||p(x, \u03b8|y)) [18].\nIn\npractice, we alternate updating the global parameters \u03b8\u2014those coupled to the entire set of\nobservations\u2014and the local variables {xt}\u2014a variable corresponding to each observation, yt. De-\ntails on computing the terms in the equations and algorithms that follow are in the Supplement.\nThe global update is derived by differentiating L with respect to the global variational parameters\n[17]. Assuming a conjugate exponential family leads to a simple coordinate ascent update [9]:\n\n(5)\n\nw = u + Eq(x) [t(x, y)] .\n\n(6)\nHere, t(x, y) denotes the vector of suf\ufb01cient statistics, and w = (wA, w\u03c6) and u = (uA, u\u03c6) the\nvariational parameters and model hyperparameters, respectively, in natural parameter form.\nThe local update is derived analogously, yielding the optimal variational distribution over the latent\nsequence:\n\nt=2\n\nEq(A)\n\nT(cid:88)\n\n(cid:3) +\n\n(cid:2)ln Axt\u22121,xt\n\nT(cid:88)\n(cid:101)p(yt|xt = k) := exp(cid:2)Eq(\u03c6) ln p(yt|xt = k)(cid:3) .\n\nEq(\u03c6) [ln p(yt|xt)]\n\nt=1\n\n(cid:101)Aj,k := exp(cid:2)Eq(A) ln(Aj,k)(cid:3)\n\nCompare with Eq. (1). Here, we have replaced probabilities by exponentiated expected log proba-\nbilities under the current variational distribution. To determine the optimal q\u2217(x) in Eq. (7), de\ufb01ne:\n\nWe estimate \u03c0 with \u02c6\u03c0 being the leading eigenvector of Eq(A)[A]. We then use \u02c6\u03c0, \u02dcA = ((cid:101)Aj,k), and\n\u02dcp = {(cid:101)p(yt|xt = k), k = 1, . . . , K, t = 1, . . . , T} to run a forward-backward algorithm, produc-\n\ning forward messages \u03b1 and backward messages \u03b2 which allow us to compute q\u2217(xt = k) and\nq\u2217(xt\u22121 = j, xt = k). [19, 17]. See the Supplement.\n\n(8)\n\nq\u2217(x) \u221d exp\n\nEq(A) [ln \u03c0(x1)] +\n\n(cid:32)\n\n(cid:33)\n\n.\n\n(7)\n\n2.3 Stochastic variational inference for non-sequential models\n\nEven in non-sequential models, the batch VB algorithm requires an entire pass through the dataset\nfor each update of the global parameters. This can be costly in large datasets, and wasteful when\nlocal-variable passes are based on uninformed initializations of the global parameters or when many\ndata points contain redundant information.\nTo cope with this computational challenge, stochastic variational inference (SVI) [9] leverages a\nRobbins-Monro algorithm [1] to optimize the ELBO via stochastic gradient ascent. When the data\nare independent, the ELBO in Eq. (5) can be expressed as\n\ni=1\n\nL = Eq(\u03b8) [ln p(\u03b8)] \u2212 Eq(\u03b8) [ln q(\u03b8)] +\n\nEq(xi) [ln p(yi, xi|\u03b8)] \u2212 Eq(x) [ln q(x)] .\n\n(9)\nIf a single observation index s is sampled uniformly s \u223c Unif(1, . . . , T ), the ELBO corresponding\nto (xs, ys) as if it were replicated T times is given by\n\nLs = Eq(\u03b8) [ln p(\u03b8)] \u2212 Eq(\u03b8) [ln q(\u03b8)] + T \u00b7(cid:0)Eq(xs) [ln p(ys, xs|\u03b8)] \u2212 Eq(xs) [ln q(xs)](cid:1) ,\n\n(10)\nand it is clear that Es[Ls] = L. At each iteration n of the SVI algorithm, a data point ys is sampled\nand its local q\u2217(xs) is computed given the current estimate of global variational parameters wn.\nNext, the global update is performed via a noisy, unbiased gradient step (Es[ \u02c6\u2207wLs] = \u2207wL).\nWhen all pairs of distributions in the model are conditionally conjugate, it is cheaper to compute the\n\nstochastic natural gradient,(cid:101)\u2207wLs, which additionally accounts for the information geometry of the\nWe show the form of (cid:101)\u2207wLs in Sec. 3.2, speci\ufb01cally in Eq. (13) with details in the Supplement.\n\nwn+1 = wn + \u03c1n(cid:101)\u2207wLs(wn).\n\ndistribution [9]. The resulting stochastic natural gradient step with step-size \u03c1n is:\n\n(11)\n\nT(cid:88)\n\n3\n\n\f3 Stochastic variational inference for HMMs\n\nThe batch VB algorithm of Sec. 2.2 becomes prohibitively expensive as the length of the chain T\nbecomes large. In particular, the forward-backward algorithm in the local step takes O(K 2T ) time.\nInstead, we turn to a subsampling approach, but naively applying SVI from Sec. 2.3 fails in the\nHMM setting: decomposing the sum over local variables into a sum of independent terms as in\nEq. (9) ignores crucial transition counts, equivalent to making a full mean-\ufb01eld approximation.\nExtending SVI to HMMs requires additional considerations due to the dependencies between the ob-\nservations. It is clear that subchains of consecutive observations rather than individual observations\nare necessary to capture the transition structure (see Sec. 3.1). We show that if the local variables\nof each subchain can be exactly optimized, then stochastic gradients computed on subchains can be\nscaled to preserve unbiased estimates of the full gradient (see Sec. 3.2).\nUnfortunately, as we show in Sec. 3.3, the local step becomes approximate due to edge effects:\nlocal variables are incognizant of nodes outside of the subchain during the forward-backward pass.\nAlthough an exact scheme requires message passing along the entire chain, we harness the memory\ndecay of the latent Markov chain to guarantee that local state beliefs in each subchain form an \u0001-\napproximation q\u0001(x) to the full-data beliefs q\u2217(x). We achieve these approximations by adaptively\nbuffering the subchains with extra observations based on current global parameter estimates. We\nthen prove that for \u0001 suf\ufb01ciently small, the noisy gradient computed using q\u0001(x) corresponds to an\nascent direction in L, guaranteeing convergence of our algorithm to a local optimum. We refer to\nour algorithm, which is outlined in Alg. 1, as SVIHMM.\n\nAlgorithm 1 Stochastic Variational Inference for HMMs (SVIHMM)\n0 , w\u03c6\n1: Initialize variational parameters (wA\n2: while (convergence criterion is not met) do\n3:\n4:\n5: Global update: wn+1 = wn(1 \u2212 \u03c1n) + \u03c1n(u + cT Eq(xS )[t(xS, yS)])\n6: end while\n\nLocal step: Compute \u02c6\u03c0, (cid:101)A,(cid:101)pS and run q(xS) = ForwardBackward(yS, \u02c6\u03c0, (cid:101)A,(cid:101)pS).\n\nSample a subchain yS \u2282 {y1, . . . , yT} with S \u223c p(S)\n\n0 ) and choose stepsize schedule \u03c1n, n = 1, 2, . . .\n\n3.1 ELBO for subsets of data\n\nUnlike the independent data case (Eq. (9)), the local term in the HMM setting decomposes as\n\nT(cid:88)\n\nT(cid:88)\n\nln p(y, x|\u03b8) = ln \u03c0(x1) +\n\nln Axt\u22121,xt +\n\nln p(yt|xt).\n\n(12)\n\nBecause of the paired terms in the \ufb01rst sum, it is necessary to consider consecutive observations\nto learn transition structure. For the SVIHMM algorithm, we de\ufb01ne our basic sampling unit as\nsubchains yS = (yS\nL), where S refers to the associated indices. We denote the ELBO\n\nrestricted to yS as LS, and associated natural gradient as (cid:101)\u2207wLS.\n\n1 , . . . , yS\n\nt=2\n\ni=1\n\n3.2 Global update\nWe detail the global update assuming we have optimized q\u2217(x) exactly (i.e., as in the batch set-\nting), although this assumption will be relaxed as discussed in Sec 3.3. Paralleling Sec. 2.3, the\nglobal SVIHMM step involves updating the global variational parameters w via stochastic (natural)\ngradient ascent based on q\u2217(xS), the beliefs corresponding to our current subchain S.\n\nRecall from Eq. (10) that the original SVI algorithm maintains Es[(cid:101)\u2207wLs] = (cid:101)\u2207wL by scaling the\n\ngradient based on an individual observation s by the total number of observations T . In the HMM\ncase, we analogously derive a batch factor vector c = (cA, c\u03c6) such that\n\nES[(cid:101)\u2207wLS] = (cid:101)\u2207wL with\n\n(cid:101)\u2207wLS = u + cT Eq\u2217(xS )\n\n(13)\nThe speci\ufb01c form of Eq. (13) for Gaussian emissions is in the Supplement. Now, the Robbins-Monro\naverage in Eq. (11) can be written as\n\n(cid:2)t(xS, yS)(cid:3) \u2212 w.\n\nwn+1 = wn(1 \u2212 \u03c1n) + \u03c1n(u + cT Eq\u2217(xS )[t(xS, yS)]).\n\n(14)\n\n4\n\n\fWhen the noisy natural gradients (cid:101)\u2207wLS are independent and unbiased estimates of the true natural\nas long as step-sizes \u03c1n satisfy(cid:80)\ngradient, the iterates in Eq. (14) converge to a local maximum of L under mild regularity conditions\nn \u03c1n = \u221e [2, 9]. In our case, the noisy gradients\nare necessarily correlated even for independently sampled subchains due to dependence between\nobservations (y1, . . . , yT ). However, as detailed in [20], unbiasedness suf\ufb01ces for convergence of\nEq. (14) to a local mode.\n\nn < \u221e, and(cid:80)\n\nn \u03c12\n\nBatch factor Recalling our assumption of being at stationarity, Eq(\u03c0) ln \u03c0(x1) = Eq(\u03c0) ln \u03c0(xi)\nfor all i. If we sample subchains from the uniform distribution over subchains of length L, denoted\np(S), then we can write\n\n(cid:20)\n\n(cid:21)\n\n(cid:34)T\u2212L+1(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=2\n\n(cid:35)\n\nT(cid:88)\n\nt=1\n\nEq ln p(yS, xS|\u03b8)\n\n\u2248 p(S)Eq\n\nES\n\nln \u03c0(xt) + (L \u2212 1)\n\nln Axt\u22121,xt + L\n\np(yt|xt)\n\n,\n\n(15)\nwhere the expectation is with respect to (\u03c0, A, \u03c6); this is detailed in the Supplement. The approx-\nimate equality in Eq. (15) arises because while most transitions appear in L \u2212 1 subchains, those\nnear the endpoints of the full chain do not, e.g., x1 and xT appear in only one subchain. This error\nbecomes negligible as the length of the HMM increases. Since p(S) is uniform over all length L sub-\nchains, by linearity of expectation the batch factor c = (cA, c\u03c6) is given by cA = (T\u2212L+1)/(L\u22121),\nc\u03c6 = (T \u2212 L + 1)/L. Other choices of p(S) can be used by considering the appropriate version of\nEq. (15) analogously to [12], generally with a batch factor cS varying with each subset yS.\n\n3.3 Local update\n\nThe optimal SVIHMM local variational distribution arises just as in the batch case of Eq. (7), but\nwith time indices restricted to the length L subchain yS:\n\n(cid:32)\n\n(cid:2)ln \u03c0(xS\n1 )(cid:3) +\n\nL(cid:88)\n\n(cid:96)=2\n\n(cid:104)\n\nq\u2217(xS) \u221d exp\n\nEq(A)\n\n(cid:105)\n\nL(cid:88)\n\n(cid:96)=1\n\n(cid:2)ln p(yS\n\n(cid:96) |xS\n\n(cid:96) )(cid:3)(cid:33)\n\n. (16)\n\nEq(A)\n\nln AxS\n\n(cid:96)\u22121,xS\n(cid:96)\n\n+\n\nEq(\u03c6)\n\n(cid:96) |xS\n\nprevious subchains\u2014to form \u02c6\u03c0, (cid:101)A,(cid:101)pS = {(cid:101)p(yS\n\nTo compute these local beliefs, we use our current q(A), q(\u03c6)\u2014which have been informed by all\n(cid:96) = k),\u2200k, (cid:96) = 1, . . . , L}, with these parameters\nde\ufb01ned as in the batch case. We then use these parameters in a forward-backward algorithm detailed\nin the Supplement. However, this message passing produces only an approximate optimization due\nto loss of information incurred at the ends of the subchain. Speci\ufb01cally, for yS = (yt, . . . , yt+L),\nthe forward messages coming from y1, . . . , yt\u22121 are not available to yt, and similarly the backwards\nmessages from yt+L+1, . . . , yT are not available to yt+L.\nRecall our assumption in the global update step that q\u2217(xS) corresponds to a subchain of the full-\ndata optimal beliefs q\u2217(x). Here, we see that this assumption is assuredly false; instead, we analyze\nthe implications of using approximate local subchain beliefs and aim to ameliorate the edge effects.\n\nBuffering subchains To cope with the subchain edge effects, we augment the subchain S with\nenough extra observations on each end so that the local state beliefs, q(xi), i \u2208 S, are within an\n\u0001-ball of q\u2217(xi) \u2014 those had we considered the entire chain. The practicality of this approach arises\nfrom the approximate \ufb01nite memory of the process. In particular, consider performing a forward-\nL+\u03c4 ) leading to approximate beliefs \u02dcq\u03c4 (xi). Given \u0001 > 0, de\ufb01ne \u03c4\u0001\nbackward pass on (xS\nas the smallest buffer length \u03c4 such that\n\n1\u2212\u03c4 , . . . , xS\n\n||\u02dcq\u03c4 (xi) \u2212 q\u2217(xi)||1 \u2264 \u0001.\n\nmax\ni\u2208S\n\n(17)\n\nThe \u03c4 that satis\ufb01es Eq. (17) determines the number of observations used to buffer the subchain. After\nimproving subchain beliefs, we discard \u02dcq\u03c4 (xi), i \u2208 buffer, prior to the global update. As will be\nseen in Sec. 4, in practice the necessary \u03c4\u0001 is typically very small relative to the lengthy observation\nsequences of interest.\nBuffering subchains is related to splash belief propagation (BP) for parallel inference in undirected\ngraphical models, where the belief at any given node is monitored based on locally-aware message\npassing in order to maintain a good approximation to the true belief [21]. Unlike splash BP, we\n\n5\n\n\fembed the buffering scheme inside an iterative procedure for updating both the local latent structure\nand the global parameters, which affects the \u0001-approximation in future iterations. Likewise, we wish\nto maintain the approximation on an entire subchain, not just at a single node.\nEven in settings where parameters \u03b8 are known, as in splash BP, analytically choosing \u03c4\u0001 is generally\ninfeasible. As such, we follow the approach of splash BP to select an approximate \u03c4\u0001. We then go\nfurther by showing that SVIHMM still converges using approximate messages within an uncertain\nparameter setting where \u03b8 is learned simultaneously with the state sequence x.\nSpeci\ufb01cally, we approximate \u03c4\u0001 by monitoring the change in belief residuals with a sub-routine\nGrowBuf, outlined in Alg. 2, that iteratively expands a buffer qold \u2192 qnew around a given subchain\nyS. Growbuf terminates when all belief residuals satisfy\n\n||q(xi)new \u2212 q(xi)old||1 \u2264 \u0001.\n\nmax\ni\u2208S\n\n(18)\n\nThe GrowBuf sub-routine can be computed ef\ufb01ciently due to (1) monotonicity of the forward and\nbackward messages so that only residuals at endpoints, q(xS\nL), need be considered, and\n(2) the reuse of computations. Speci\ufb01cally, the forward-backward pass can be rooted at the midpoint\nof yS so that messages to the endpoints can be ef\ufb01ciently propagated, and vice versa [22].\nFurthermore, choosing suf\ufb01ciently small \u0001 guarantees that the noisy natural gradient lies in the same\nhalf-plane as the true natural gradient, a suf\ufb01cient condition for maintaining convergence when using\napproximate gradients [23]; the proof is presented in the Supplement.\n\n1 ) and q(xS\n\nAlgorithm 2 GrowBuf procedure.\n1: Input: subchain S, min buffer length u \u2208 Z+, error tolerance \u0001 > 0.\n\n2: Initialize qold(xS) = ForwardBackward(yS, \u02c6\u03c0, (cid:101)A,(cid:101)pS) and set Sold = S.\n\nGrow buffer Snew by extending Sold by u observations in each direction.\nqnew(xSnew\n\n) = ForwardBackward(ySnew\n\n, \u02c6\u03c0, (cid:101)A,(cid:101)pSnew ), reusing messages from Sold.\n\n3: while true do\n4:\n5:\n6:\n7:\n8:\n9:\n10: end while\n\nif(cid:12)(cid:12)(cid:12)(cid:12)qnew(xS) \u2212 qold(xS)(cid:12)(cid:12)(cid:12)(cid:12) < \u0001 then\n\nreturn q\u2217(xS) = qnew(xS)\n\nend if\nSet Sold = Snew and qold = qnew.\n\n3.4 Minibatches for variance mitigation and their effect on computational complexity\n\nStochastic gradient algorithms often bene\ufb01t from sampling multiple observations in order to reduce\nthe variance of the gradient estimates at each iteration. We use a similar idea in SVIHMM by\nsampling a minibatch B = (yS1, . . . , ySM ) consisting of M subchains. If the latent Markov chain\ntends to dwell in one component for extended periods, sampling one subchain may only contain\ninformation about a select number of states observed in that component.\nIncreasing the length\nof this subchain may only lead to redundant information from this component. In contrast, using a\nminibatch of many smaller subchains may discover disparate components of the chain at comparable\ncomputational cost, accelerating learning and leading to a better local optimum. However, subchains\nmust be suf\ufb01ciently long to be informative of transition dynamics. In this setting, the local step on\neach subchain is identical; summing over subchains in the minibatch yields the gradient update:\n\n(cid:2)t(xS, yS)(cid:3) , wn+1 = wn(1 \u2212 \u03c1n) + \u03c1n\n\n(cid:18)\n\nu +\n\n\u02c6wB\n|B|\n\n(cid:19)\n\n.\n\n(cid:88)\n\nS\u2208B\n\n\u02c6wB =\n\ncT Eq(xS )\n\nWe see that the computational complexity of SVIHMM is O(K 2(L + 2\u03c4\u0001)M ), leading to signi\ufb01cant\nef\ufb01ciency gains compared to O(K 2T ) in batch inference when (L + 2\u03c4\u0001)M << T .\n\n4 Experiments\n\nWe evaluate the performance of SVIHMM compared to batch VB on synthetic experiments designed\nto illustrate the trade off between the choice of subchain length L and the number of subchains per\n\n6\n\n\fTable 1: Runtime and predictive log-probability (without GrowBuf) on RC data.\n\n(cid:98)L/2(cid:99)\n100\n500\n1000\nbatch\n\nRuntime (sec.)\n2.74 \u00b1 0.001\n11.79 \u00b1 0.004\n23.17 \u00b1 0.006\n1240.73 \u00b1 0.370\n\nAvg. iter. time (sec.)\n\n0.03 \u00b1 0.000\n0.12 \u00b1 0.000\n0.23 \u00b1 0.000\n248.15 \u00b1 0.074\n\nlog-predictive\n\u22125.915 \u00b1 0.004\n\u22125.850 \u00b1 0.000\n\u22125.850 \u00b1 0.000\n\u22125.840 \u00b1 0.000\n\nminibatch M. We also demonstrate the utility of GrowBuf. We then apply our algorithm to gene\nsegmentation in a large human chromatin data set.\n\nSynthetic data We create two synthetic datasets with T = 10, 000 observations and K = 8\nlatent states. The \ufb01rst, called diagonally dominant (DD), illustrates the potential bene\ufb01t of large\nM, the number of sampled subchains per minibatch. The Markov chain heavily self-transitions so\nthat most subchains contain redundant information with observations generated from the same latent\nstate. Although transitions are rarely observed, the emission means are set to be distinct so that this\nexample is likelihood-dominated and highly identi\ufb01able. Thus, \ufb01xing a computational budget, we\nexpect large M to be preferable to large L, covering more of the observation sequence and avoiding\npoor local modes arising from redundant information.\nThe second dataset we consider contains two reversed cycles (RC): the Markov chain strongly tran-\nsitions from states 1 \u2192 2 \u2192 3 \u2192 1 and 5 \u2192 7 \u2192 6 \u2192 5 with a small probability of transitioning\nbetween cycles via bridge states 4 and 8. The emission means for the two cycles are very similar\nbut occur in reverse order with respect to the transitions. Transition information in observing long\nenough dynamics is thus crucial to identify between states 1, 2, 3 and 5, 6, 7, and a large enough L\nis imperative. The Supplement contains details for generating both synthetic datasets.\nWe compare SVIHMM to batch VB on these two synthetic examples. For each per parameter setting,\nwe ran 20 random restarts of SVIHMM for 100 iterations and batch VB until convergence of the\nELBO. A forgetting rate \u03ba parametrizes step sizes \u03c1n = (1 + n)\u2212\u03ba. We \ufb01x the total number of\nobservations L \u00d7 M used per iteration of SVIHMM such that increasing M implies decreasing L\n(and vice versa).\nIn Fig. 1(a) we compare || \u02c6A\u2212A||F , where A is the true transition matrix and \u02c6A its learned variational\nmean. We see trends one would expect: the small L, large M settings achieve better performance\nfor the DD example, but the opposite holds for RC, with (cid:98)L/2(cid:99) = 1 signi\ufb01cantly underperforming.\n(Of course, allowing large L and M is always preferable, except computationally.) Under appro-\npriate settings in both cases, we achieve comparable performance to batch VB. In Fig. 1(b), we see\nsimilar trends in terms of predictive log-probability holding out 10% of the observations as a test\nset and using 5-fold cross validation. Here, we actually notice that SVIHMM often achieves higher\npredictive log-probability than batch VB, which is attributed to the fact that stochastic algorithms\ncan \ufb01nd better local modes than their non-random counterparts.\nA timing comparison of SVIHMM to batch VB with T = 3 million is presented in Table 4. All\nsettings of SVIHMM run faster than even a single iteration of batch, with only a negligible change\nin predictive log-likelihood. Further discussion on these timing results is in the Supplement.\nMotivated by the demonstrated importance of choice of L, we now turn to examine the impact of\nthe GrowBuf routine via predictive log-probability. In Fig. 1(b), we see a noticeable improvement\nfor small L settings when GrowBuf is incorporated (the dashed lines in Fig. 1(b)). In particular,\nthe RC example is now learning dynamics of the chain even with (cid:98)L/2(cid:99) = 1, which was not\npossible without buffering. GrowBuf thus provides robustness by guarding against poor choice of\nL. We note that the buffer routine does not overextend subchains, on average growing by only \u2248 8\nobservations with \u0001 = 1\u00d710\u22126. Since the number of observations added is usually small, GrowBuf\ndoes not signi\ufb01cantly add to per-iteration computational cost (see the Supplement).\nHuman chromatin segmentation We apply the SVIHMM algorithm to a massive human chro-\nmatin dataset provided by the ENCODE project [24]. This data was studied in [25] with the goal\nof unsupervised pattern discovery via segmentation of the genome. Regions sharing the same labels\nhave certain common properties in the observed data, and because the labeling at each position is\nunknown but in\ufb02uenced by the label at the previous position, an HMM is a natural model [26].\n\n7\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Transition matrix error varying L with L \u00d7 M \ufb01xed.\nGrowBuf. Batch results denoted by horizontal red line in both \ufb01gures.\n\n(b) Effect of incorporating\n\nWe were provided with 250 million observations consisting of twelve assays carried out in the\nchronic myeloid leukemia cell line K562. We analyzed the data using SVIHMM on an HMM with\n25 states and 12 dimensional Gaussian emissions. We compare our performance to the correspond-\ning segmentation learned by an expectation maximization (EM) algorithm applied to a more \ufb02exible\ndynamic Bayesian network model (DBN) [27]. Due to the size of the dataset, the analysis of [27]\nrequires breaking the chain into several blocks, severing long range dependencies.\nWe assess performance by comparing the false discovery rate (FDR) of predicting active promoter\nelements in the sequence. The lowest (best) FDR achieved with SVIHMM over 20 random restarts\ntrials was .999026 using (cid:98)L/2(cid:99) = 2000, M = 50, \u03ba = .51, comparable and slightly lower than\nthe .999038 FDR obtained using DBN-EM on the severed data [27]. We emphasize that even when\nrestricted to a simpler HMM model, learning on the full data via SVIHMM attains similar results to\nthat of [27] with signi\ufb01cant gains in ef\ufb01ciency. In particular, our SVIHMM runs require only under\nan hour for a \ufb01xed 100 iterations, the maximum iteration limit speci\ufb01ed in the DBN-EM approach.\nIn contrast, even with a parallelized implementation over the broken chain, the DBN-EM algorithm\ncan take days. In conclusion, SVIHMM enables scaling to the entire dataset, allowing for a more\nprincipled approach by utilizing the data jointly.\n\n5 Discussion\n\nWe have presented stochastic variational inference for HMMs, extending such algorithms from in-\ndependent data settings to handle time dependence. We elucidated the complications that arise when\nsub-sampling dependent observations and proposed a scheme to mitigate the error introduced from\nbreaking dependencies. Our approach provides an adaptive technique with provable guarantees for\nconvergence to a local mode. Further extensions of the algorithm in the HMM setting include adap-\ntively selecting the length of meta-observations and parallelizing the local step when the number of\nmeta-observations is large. Importantly, these ideas generalize to other settings and can be applied to\nBayesian nonparametric time series models, general state space models, and other graph structures\nwith spatial dependencies.\n\nAcknowledgements\n\nThis work was supported in part by the TerraSwarm Research Center sponsored by MARCO and DARPA,\nDARPA Grant FA9550-12-1-0406 negotiated by AFOSR, and NSF CAREER Award IIS-1350133. JX was\nsupported by an NDSEG fellowship. We also appreciate the data, discussions, and guidance on the ENCODE\nproject provided by Max Libbrecht and William Noble.\n\n1Other parameter settings were explored.\n\n8\n\nllllllllll0.00.51.01.50.000.250.500.751.00Diag. Dom.Rev. Cycles110100L/2 (log\u2212scale)||A||FL/2 = 1L/2 = 3L/2 = 10\u22124.5\u22124.0\u22123.5\u22123.0\u22126.6\u22126.4\u22126.2\u22126.0Diag. Dom.Rev. Cycles020406002040600204060IterationHeld out log\u2212probabilityGrowBufferOffOnk0.10.30.50.7\fReferences\n[1] H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics,\n\n22(3):400\u2013407, 1951.\n\n[2] L. Bottou. Online algorithms and stochastic approximations. In Online Learning and Neural Networks.\n\nCambridge University Press, 1998.\n\n[3] L. Bottou. Large-Scale Machine Learning with Stochastic Gradient Descent. In International Conference\n\non Computational Statistics, pages 177\u2013187, August 2010.\n\n[4] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM J. on Optimization, 19(4):1574\u20131609, January 2009.\n\n[5] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In International\n\nConference on Machine Learning, pages 681\u2013688, 2011.\n\n[6] D. Maclaurin and R. P. Adams. Fire\ufb02y Monte Carlo: Exact MCMC with subsets of data. CoRR,\n\nabs/1403.5693, 2014.\n\n[7] X. Wang and D. B. Dunson. Parallelizing MCMC via Weierstrass sampler. CoRR, abs/1312.4605, 2014.\n[8] W. Neiswanger, C. Wang, and E. Xing. Asymptotically exact, embarrassingly parllel MCMC. CoRR,\n\nabs/1311.4780, 2014.\n\n[9] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine\n\nLearning Research, 14(1):1303\u20131347, May 2013.\n\n[10] M. Bryant and E. B. Sudderth. Truly nonparametric online variational inference for hierarchical Dirichlet\n\nprocesses. In Advances in Neural Information Processing Systems, pages 2708\u20132716, 2012.\n\n[11] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan. Streaming variational Bayes. In\n\nAdvances in Neural Information Processing Systems, pages 1727\u20131735, 2013.\n\n[12] P. Gopalan, D. M. Mimno, S. Gerrish, M. J. Freedman, and D. M. Blei. Scalable inference of overlapping\n\ncommunities. In Advances in Neural Information Processing Systems, pages 2258\u20132266, 2012.\n\n[13] M. J. Johnson and A. S. Willsky. Stochastic variational inference for Bayesian time series models. In\n\nInternational Conference on Machine Learning, 2014.\n\n[14] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition.\n\nProceedings of the IEEE, 77(2):257\u2013286, 1989.\n\n[15] S. Fr\u00a8uhwirth-Schnatter. Finite mixture and Markov switching models. Springer Verlag, 2006.\n[16] S. L. Scott. Bayesian methods for hidden Markov models: Recursive computing in the 21st century.\n\nJournal of the American Statistical Association, 97(457):337\u2013351, March 2002.\n\n[17] M. J. Beale. Variational Algorithms for Approximate Bayesian Inference. Ph.D. thesis, University College\n\nLondon, 2003.\n\n[18] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for\n\ngraphical models. Machine Learning, 37(2):183\u2013233, November 1999.\n\n[19] C. M. Bishop. Pattern Recognition and Machine Learning. Springer Verlag, 2006.\n[20] B. T. Polyak and Y. Tsypkin. Pseudo-gradient adaptation and learning algorithms. Automatics and Tele-\n\nmechanics, 3:45\u201368, 1973.\n\n[21] J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2009.\n\n[22] S. J. Russell and P. Norvig. Arti\ufb01cial Intelligence: A Modern Approach. Pearson Education, 2003.\n[23] J. Nocedal and S. Wright. Numerical Optimization. Springer Series in Operations Research and Financial\n\nEngineering. Springer, 2006.\n\n[24] ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome.\n\nNature, 489(7414):57\u201374, September 2012.\n\n[25] M. M. Hoffman, O. J. Buske, J. Wang, Z. Weng, J. A. Bilmes, and W. S. Noble. Unsupervised pattern\ndiscovery in human chromatin structure through genomic segmentation. Nature Methods, 9:473\u2013476,\n2012.\n\n[26] N. Day, A. Hemmaplardh, R. E. Thurman, J. A. Stamatoyannopoulos, and W. S. Noble. Unsupervised\n\nsegmentation of continuous genomic data. Bioinformatics, 23(11):1424\u20131426, 2007.\n\n[27] M. M. Hoffman, J. Ernst, S. P. Wilder, A. Kundaje, R. S. Harris, M. Libbrecht, B. Giardine, P. M. El-\nlenbogen, J. A. Bilmes, E. Birney, R. C. Hardison, M. Dunham, I. Kellis, and W. S. Noble. Integrative\nannotation of chromatin elements from encode data. Nucleic Acids Research, 41(2):827\u2013841, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1890, "authors": [{"given_name": "Nick", "family_name": "Foti", "institution": "University of Washington"}, {"given_name": "Jason", "family_name": "Xu", "institution": "University of Washington"}, {"given_name": "Dillon", "family_name": "Laird", "institution": "University of Washington"}, {"given_name": "Emily", "family_name": "Fox", "institution": "University of Washington"}]}