{"title": "Scalable Adaptation of State Complexity for Nonparametric Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1198, "page_last": 1206, "abstract": "Bayesian nonparametric hidden Markov models are typically learned via fixed truncations of the infinite state space or local Monte Carlo proposals that make small changes to the state space. We develop an inference algorithm for the sticky hierarchical Dirichlet process hidden Markov model that scales to big datasets by processing a few sequences at a time yet allows rapid adaptation of the state space cardinality. Unlike previous point-estimate methods, our novel variational bound penalizes redundant or irrelevant states and thus enables optimization of the state space. Our birth proposals use observed data statistics to create useful new states that escape local optima. Merge and delete proposals remove ineffective states to yield simpler models with more affordable future computations. Experiments on speaker diarization, motion capture, and epigenetic chromatin datasets discover models that are more compact, more interpretable, and better aligned to ground truth segmentations than competitors. We have released an open-source Python implementation which can parallelize local inference steps across sequences.", "full_text": "Scalable Adaptation of State Complexity for\n\nNonparametric Hidden Markov Models\n\nMichael C. Hughes, William Stephenson, and Erik B. Sudderth\n\nDepartment of Computer Science, Brown University, Providence, RI 02912\n\nmhughes@cs.brown.edu, wtstephe@gmail.com, sudderth@cs.brown.edu\n\nAbstract\n\nBayesian nonparametric hidden Markov models are typically learned via \ufb01xed\ntruncations of the in\ufb01nite state space or local Monte Carlo proposals that make\nsmall changes to the state space. We develop an inference algorithm for the sticky\nhierarchical Dirichlet process hidden Markov model that scales to big datasets by\nprocessing a few sequences at a time yet allows rapid adaptation of the state space\ncardinality. Unlike previous point-estimate methods, our novel variational bound\npenalizes redundant or irrelevant states and thus enables optimization of the state\nspace. Our birth proposals use observed data statistics to create useful new states\nthat escape local optima. Merge and delete proposals remove ineffective states to\nyield simpler models with more affordable future computations. Experiments on\nspeaker diarization, motion capture, and epigenetic chromatin datasets discover\nmodels that are more compact, more interpretable, and better aligned to ground\ntruth segmentations than competitors. We have released an open-source Python\nimplementation which can parallelize local inference steps across sequences.\n\n1 Introduction\n\nThe hidden Markov model (HMM) [1, 2] is widely used to segment sequential data into interpretable\ndiscrete states. Human activity streams might use walking or dancing states, while DNA transcrip-\ntion might be understood via promotor or repressor states [3]. The hierarchical Dirichlet process\nHMM (HDP-HMM) [4, 5, 6] provides an elegant Bayesian nonparametric framework for reasoning\nabout possible data segmentations with different numbers of states.\n\nExisting inference algorithms for HMMs and HDP-HMMs have numerous shortcomings: they can-\nnot ef\ufb01ciently learn from large datasets, do not effectively explore segmentations with varying num-\nbers of states, and are often trapped at local optima near their initialization. Stochastic optimization\nmethods [7, 8] are particularly vulnerable to these last two issues, since they cannot change the num-\nber of states instantiated during execution. The importance of removing irrelevant states has been\nlong recognized [9]. Samplers that add or remove states via split and merge moves have been de-\nveloped for HDP topic models [10, 11] and beta process HMMs [12]. However, these Monte Carlo\nproposals use the entire dataset and require all sequences to \ufb01t in memory, limiting scalability.\n\nWe propose an HDP-HMM learning algorithm that reliably transforms an uninformative, single-state\ninitialization into an accurate yet compact set of states. Generalizing previous work on memoized\nvariational inference for DP mixture models [13] and HDP topic models [14], we derive a variational\nbound for the HDP-HMM that accounts for sticky state persistence and can be used for effective\nBayesian model selection. Our algorithm uses birth proposal moves to create new states and merge\nand delete moves to remove states with poor predictive power. State space adaptations are validated\nvia a global variational bound, but by caching suf\ufb01cient statistics our memoized algorithm ef\ufb01ciently\nprocesses subsets of sequences at each step. Extensive experiments demonstrate the reliability and\nscalability of our approach, which can be reproduced via Python code we have released online1.\n\n1http://bitbucket.org/michaelchughes/x-hdphmm-nips2015/\n\n1\n\n\f(A) Initialization, K=1 \n\n(B) After first lap births, K=47 \n\n(C) After first lap merges, K=37 \n\n!\n\n\"!!\n\n#!!\n\n$!!\n\n%!!\n\n!\n\n\"!!\n\n#!!\n\n$!!\n\n%!!\n\n!\n\n\"!!\n\n#!!\n\n$!!\n$!!\n\n%!!\n%!!\n\naccepted merge pairs \n\n(F) Ground truth labels, K=12 \n\n(E) After 100 laps, K=31 \n\n(D) After second lap, K=56 \n\naccepted birth \n\n#!!\n\n$!!\n\n!\n\n$!!\n\n%!!\n\n\"!!\n\n#!!\n\n\"!!\n\n%!!\n\n!\nFigure 1: Illustration of our new birth/merge/delete variational algorithm as it learns to segment motion capture\nsequences into common exercise types (Sec. 5). Each panel shows segmentations of the same 6 sequences,\nwith time on the horizontal axis. Starting from just one state (A), birth moves at the \ufb01rst sequence create useful\nstates. Local updates to each sequence in turn can use existing states or birth new ones (B). After all sequences\nare updated once, we perform merge moves to clean up and lap is complete (C). After another complete lap of\nbirth updates at each sequence followed by merges and deletes, the segmentation is further re\ufb01ned (D). After\nmany laps, our \ufb01nal segmentation (E) aligns well to labels from a human annotator (F), with some true states\naligning to multiple learned states that capture subject-speci\ufb01c variability in exercises.\n\n\"!!\n\n#!!\n\n$!!\n\n%!!\n\n!\n\n2 Hierarchical Dirichlet Process Hidden Markov Models\n\n. . . , xnTn ] and\nWe wish to jointly model N sequences, where sequence n has data xn = [xn1, xn2,\nobservation xnt is a vector representing interval or timestep t. For example, xnt \u2208 RD could be the\nspectrogram for an instant of audio, or human limb positions during a 100ms interval.\n\nThe HDP-HMM explains this data by assigning each observation xnt to a single hidden state znt.\nThe chosen state comes from a countably in\ufb01nite set of options k \u2208 {1, 2, . . .}, generated via\nMarkovian dynamics with initial state distributions \u03c00 and transition distributions {\u03c0k}\u221e\n\nk=1:\n\np(zn1 = k) = \u03c00k,\n\np(znt = \u2113 | zn,t\u22121 = k) = \u03c0k\u2113.\n\n(1)\n\nWe draw data xnt given assigned state znt = k from an exponential family likelihood F :\n\nF : log p(xnt | \u03c6k) = sF (xdn)T \u03c6k + cF (\u03c6k),\n\nH : log p(\u03c6k | \u00af\u03c4 ) = \u03c6T\n\nk \u00af\u03c4 + cH (\u00af\u03c4 ).\n\n(2)\n\nThe natural parameter \u03c6k for each state has conjugate prior H. Cumulant functions cF , cH ensure\nthese distributions are normalized. The chosen exponential family is de\ufb01ned by its suf\ufb01cient statis-\ntics sF . Our experiments consider Bernoulli, Gaussian, and auto-regressive Gaussian likelihoods.\n\nHierarchies of Dirichlet processes. Under the HDP-HMM prior and posterior, the number of\nstates is unbounded; it is possible that every observation comes from a unique state. The hierarchical\nDirichlet process (HDP) [5] encourages sharing states over time via a latent root probability vector\n\u03b2 over the in\ufb01nite set of states (see Fig. 2). The stick-breaking representation of the prior on \u03b2 \ufb01rst\n\u2113=1 (1 \u2212 u\u2113).\nWe interpret uk as the conditional probability of choosing state k among states {k, k + 1, k + 2, . . .}.\n\ndraws independent variables uk \u223c Beta(1, \u03b3) for each state k, and then sets \u03b2k = ukQk\u22121\nprobabilities via the vector [\u03b21 \u03b22 . . . \u03b2K \u03b2>K], where \u03b2>K = P\u221e\n\nIn expectation, the K most common states are \ufb01rst in stick-breaking order. We represent their\nk=K+1 \u03b2k. Given this (K + 1)-\ndimensional probability vector \u03b2, the HDP-HMM generates transition distributions \u03c0k for each state\nk from a Dirichlet with mean equal to \u03b2 and variance governed by concentration parameter \u03b1 > 0:\n\n[\u03c0k1 . . . \u03c0kK \u03c0k>K] \u223c Dir(\u03b1\u03b21, \u03b1\u03b22, . . . , \u03b1\u03b2>K).\n\n(3)\n\nWe draw starting probability vector \u03c00 from a similar prior with much smaller variance, \u03c00 \u223c\nDir(\u03b10\u03b2) with \u03b10 \u226b \u03b1, because few starting states are observed.\n\nSticky self-transition bias.\nIn many applications, we expect each segment to persist for many\ntimesteps. The \u201csticky\u201d parameterization of [4, 6] favors self-transition by placing extra prior mass\non the transition probability \u03c0kk. In particular, [\u03c0k1 . . . \u03c0k>K] \u223c Dir(\u03b1\u03b21, . . . \u03b1\u03b2k + \u03ba, . . . \u03b1\u03b2>K)\nwhere \u03ba > 0 controls the degree of self-transition bias. Choosing \u03ba \u2248 100 leads to long segment\nlengths, while avoiding the computational cost of semi-Markov alternatives [7].\n\n2\n\n\fuk\n\n\u02c6\u03c1k\n\u02c6\u03c9k\n\n\u221e\n\n\u03c0k\n\n\u02c6\u03b8k\n\n\u221e\n\n\u02c6sn1\n\nzn1\n\n\u02c6sn2 \u02c6snT -1\n\nzn2\n\n\u00b7 \u00b7 \u00b7\n\nznT\n\n\u03c6k\n\n\u02c6\u03c4k\n\n\u221e\n\n\u221250\n\n\u2212100\n\n\u2212150\n\ncD sticky\n\nlower bound\n\n\u221220\n\n\u221240\n\n\u221260\n\n\u221280\n\n\u2212100\n\n\u2212120\n0\n\ncD sticky\n\nlower bound\n\n5\n\n15\n\n10\n20\nnum states K\n\n25\n\n30\n\nxn1\n\nxn2 \u00b7 \u00b7 \u00b7\n\n\u2212200\n\n0.0\n\nxnT\n\nN\n\n0.5\n\n1.0\n\n1.5\n\n2.0\n\nalpha\n\nFigure 2: Left: Graphical representation of the HDP hidden Markov model. Variational parameters are shown\nin red. Center: Our surrogate bound for the sticky Dirichlet cumulant function cD (Eq. 9) as a function of \u03b1,\ncomputed with \u03ba = 100 and uniform \u03b2 with K = 20 active states. Right: Surrogate bound vs. K, with \ufb01xed\n\u03ba = 100, \u03b1 = 0.5. This bound remains tight when our state adaptation moves insert or remove states.\n\n3 Memoized and Stochastic Variational Inference\n\nAfter observing data x, our inferential goal is posterior knowledge of top-level conditional probabil-\nities u, HMM parameters \u03c0, \u03c6, and assignments z. We refer to u, \u03c0, \u03c6 as global parameters because\nthey generalize to new data sequences. In contrast, the states zn are local to a speci\ufb01c sequence xn.\n\n3.1 A Factorized Variational Lower Bound\n\nWe seek a distribution q over the unobserved variables that is close to the true posterior, but lies in the\nsimpler factorized family q(\u00b7) , q(u)q(\u03c6)q(\u03c0)q(z). Each factor has exponential family form with\nfree parameters denoted by hats, and our inference algorithms update these parameters to minimize\nthe Kullback-Leibler (KL) divergence KL(q || p). Our chosen factorization for q is similar to [7],\nbut includes a substantially more accurate approximation to q(u) as detailed in Sec. 3.2.\n\nFactor q(z). For each sequence n, we use an independent factor q(zn) with Markovian structure:\n\nq(zn) ,\" K\nYk=1\n\n\u02c6r\u03b4k(zn1)\nn1k\n\n# T \u22121\nYt=1\n\nK\n\nYk=1\n\nK\n\nY\u2113=1(cid:20) \u02c6sntk\u2113\n\n\u02c6rntk (cid:21)\u03b4k(znt)\u03b4\u2113(zn,t+1)\n\n(4)\n\nFree parameter vector \u02c6snt de\ufb01nes the joint assignment probabilities \u02c6sntk\u2113 , q(zn,t+1 = \u2113, znt = k),\nso the K 2 non-negative entries of \u02c6snt sum to one. The parameter \u02c6rnt de\ufb01nes the marginal probability\n\u2113=1 \u02c6sntk\u2113. We can \ufb01nd the expected count of transitions\n\n\u02c6rntk = q(znt = k), and equals \u02c6rntk = PK\nfrom state k to \u2113 across all sequences via the suf\ufb01cient statistic Mk\u2113(\u02c6s) ,PN\n\nThe truncation level K limits the total number of states to which data is assigned. Under our approx-\nimate posterior, only q(zn) is constrained by this choice; no global factors are truncated. Indeed, if\ndata is only assigned to the \ufb01rst K states, the conditional independence properties of the HDP-HMM\nimply that {\u03c6k, uk | k > K} are independent of the data. Their optimal variational posteriors thus\nmatch the prior, and need not be explicitly computed or stored [15, 16]. Simple variational algo-\nrithms treat K as a \ufb01xed constant [7], but Sec. 4 develops novel algorithms that \ufb01t K to data.\n\nn=1PTn\u22121\n\nt=1 \u02c6sntk\u2113.\n\nFactor q(\u03c0). For the starting state (k = 0) and each state k \u2208 1, 2, . . ., we de\ufb01ne q(\u03c0k) as a\nDirichlet distribution: q(\u03c0k) , Dir(\u02c6\u03b8k1, . . . , \u02c6\u03b8kK , \u02c6\u03b8k>K). Free parameter \u02c6\u03b8k is a vector of K + 1\npositive numbers, with one entry for each of the K active states and a \ufb01nal entry for the aggregate\nmass of all other states. The expected log transition probability between states k and \u2113, Pk\u2113(\u02c6\u03b8) ,\n\nEq[log \u03c0k\u2113] = \u03c8(\u02c6\u03b8k\u2113) \u2212 \u03c8(PK+1\n\nm=1\n\n\u02c6\u03b8km), is a key suf\ufb01cient statistic.\n\nFactor q(\u03c6). Emission parameter \u03c6k for state k has factor q(\u03c6k) , H(\u02c6\u03c4k) conjugate to the likeli-\nhood F . The supplement provides details for Bernoulli, Gaussian, and auto-regressive F .\n\nWe score the approximation q via an objective function L that assigns a scalar value (higher is better)\nto each possible input of free parameters, data x, and hyperparameters \u03b3, \u03b1, \u03ba, \u00af\u03c4 :\n\nL(\u00b7) , Eq [log p(x, z, \u03c0, u, \u03c6) \u2212 log q(z, \u03c0, u, \u03c6)] = Ldata + Lentropy + Lhdp-local + Lhdp-global.\n\n(5)\n\n3\n\n\fThis function provides a lower bound on the marginal evidence: log p(x|\u03b3, \u03b1, \u03ba, \u00af\u03c4 ) \u2265 L. Improving\nthis bound is equivalent to minimizing KL(q || p). Its four component terms are de\ufb01ned as follows:\n\nq(\u03c6)i ,\nLdata(x, \u02c6r, \u02c6\u03c4 ) , Eqhlog p(x | z, \u03c6) + log p(\u03c6)\nLhdp-local(\u02c6s, \u02c6\u03b8, \u02c6\u03c1, \u02c6\u03c9) , Eqhlog p(z | \u03c0) + log p(\u03c0)\nq(\u03c0)i ,\n\nLentropy(\u02c6s) , \u2212Eq [log q(z)] ,\n\nq(u)i .\nLhdp-global(\u02c6\u03c1, \u02c6\u03c9) , Eqhlog p(u)\n\n(6)\n\nDetailed analytic expansions for each term are available in the supplement.\n\n3.2 Tractable Posterior Inference for Global State Probabilities\n\nPrevious variational methods for the HDP-HMM [7], and for HDP topic models [16] and HDP\ngrammars [17], used a zero-variance point estimate for the top-level state probabilities \u03b2. While\nthis approximation simpli\ufb01es inference, the variational objective no longer bounds the marginal evi-\ndence. Such pseudo-bounds are unsuitable for model selection and can favor models with redundant\nstates that do not explain any data, but nevertheless increase computational and storage costs [14].\n\nBecause we seek to learn compact and interpretable models, and automatically adapt the truncation\nlevel K to each dataset, we instead place a proper beta distribution on uk, k \u2208 1, 2, . . . K:\n\nq(uk) , Beta(\u02c6\u03c1k \u02c6\u03c9k, (1\u2212\u02c6\u03c1k)\u02c6\u03c9k), where \u02c6\u03c1k \u2208 (0, 1), \u02c6\u03c9k > 0.\n\n(7)\n\nHere \u02c6\u03c1k = E\ncontrols the variance, where the zero-variance point estimate is recovered as \u02c6\u03c9k \u2192 \u221e.\n\nq(u)[\u03b2k] = \u02c6\u03c1kE[\u03b2>k\u22121], and E\n\nq(u)[uk], E\n\n\u2113=1(1\u2212\u02c6\u03c1\u2113). The scalar \u02c6\u03c9k\n\nq(u)[\u03b2>k] = Qk\n\nThe beta factorization in Eq. (7) complicates evaluation of the marginal likelihood bound in Eq. (6):\n\nLhdp-local(\u02c6s, \u02c6\u03b8, \u02c6\u03c1, \u02c6\u03c9) = E\n\nE\n\nq(u)[cD(\u03b1\u03b2 + \u03ba\u03b4k)]\n\nk=1\n\nq(u)[cD(\u03b10\u03b2)] +PK\nk=0PK+1\nk=0 cD(\u02c6\u03b8k) +PK\n\u2212PK\ncD(\u03b1\u03b2) , log \u0393(\u03b1) \u2212PK+1\n\n(8)\nThe Dirichlet cumulant function cD maps K +1 positive parameters to a log-normalization constant.\nFor a non-sticky HDP-HMM where \u03ba = 0, previous work [14] established the following bound:\n\n\u2113=1 (Mk\u2113(\u02c6s) + \u03b1k E\n\nq(u)[\u03b2\u2113] + \u03ba\u03b4k(\u2113) \u2212 \u02c6\u03b8k\u2113)Pk\u2113(\u02c6\u03b8).\n\nDirect evaluation of E\nhave no closed form, but the lower bound has a simple expectation given beta distributed q(uk).\n\n(9)\nq(u)[cD(\u03b1\u03b2)] is problematic because the expectations of log-gamma functions\n\nk=1 log \u0393(\u03b1\u03b2k) \u2265 K log \u03b1 +PK+1\n\n\u2113=1 log \u03b2\u2113.\n\nDeveloping a similar bound for sticky models with \u03ba > 0 requires a novel contribution. To begin,\nin the supplement we establish the following bound for any \u03ba > 0, \u03b1 > 0:\n\ncD(\u03b1\u03b2 + \u03ba\u03b4k) \u2265 K log \u03b1 \u2212 log(\u03b1 + \u03ba) + log(\u03b1\u03b2k + \u03ba) +PK+1\n\nq(u)[log(\u03b1\u03b2k + \u03ba)], we leverage the concavity of the logarithm:\n\n\u2113=1 \u21136=k log(\u03b2\u2113).\n\nTo handle the intractable term E\n\n(10)\n\nlog(\u03b1\u03b2k + \u03ba) \u2265 \u03b2k log(\u03b1 + \u03ba) + (1 \u2212 \u03b2k) log \u03ba.\n\n(11)\nCombining Eqs. (10) and (11) and taking expectations, we can evaluate a lower bound on Eq. (8) in\nclosed form, and thereby ef\ufb01ciently optimize its parameters. As illustrated in Fig. 2, this rigorous\nlower bound on the marginal evidence log p(x) is quite accurate for practical hyperparameters.\n\n3.3 Batch and Stochastic Variational Inference\n\nMost variational inference algorithms maximize L via coordinate ascent optimization, where the\nbest value of each parameter is found given \ufb01xed values for other variational factors. For the HDP-\nHMM this leads to the following updates, which when iterated converge to some local maximum.\n\nLocal update to q(zn). The assignments for each sequence zn can be updated independently via\ndynamic programming [18]. The forward-backward algorithm takes as input a Tn \u00d7 K matrix of\nlog-likelihoods Eq[log p(xn | \u03c6k)] given the current \u02c6\u03c4 , and log transition probabilities Pjk given the\ncurrent \u02c6\u03b8. It outputs the optimal marginal state probabilities \u02c6sn, \u02c6rn under objective L. This step has\ncost O(TnK 2) for sequence n, and we can process multiple sequences in parallel for ef\ufb01ciency.\n\nGlobal update to q(\u03c6). Conjugate priors lead to simple closed-form updates \u02c6\u03c4k = \u00af\u03c4 + Sk, where\n\nsuf\ufb01cient statistic Sk summarizes the data assigned to state k: Sk ,PN\n\nGlobal update to q(\u03c0). For each state k \u2208 {0, 1, 2 . . . K}, the positive vector \u02c6\u03b8k de\ufb01ning the\noptimal Dirichlet posterior on transition probabilities from state k is \u02c6\u03b8k\u2113 = Mk\u2113(\u02c6s) + \u03b1\u03b2\u2113 + \u03ba\u03b4k(\u2113).\nStatistic Mk\u2113(\u02c6s) counts the expected number of transitions from state k to \u2113 across all sequences.\n\nn=1PTn\n\nt=1 \u02c6rntksF (xnt).\n\n4\n\n\fGlobal update to q(u). Due to non-conjugacy, our surrogate objective L has no closed-form up-\ndate to q(u). Instead, we employ numerical optimization to update vectors \u02c6\u03c1, \u02c6\u03c9 simultaneously:\n\narg max\n\nLhdp-local(\u02c6\u03c1, \u02c6\u03c9, \u02c6\u03b8, \u02c6s) + Lhdp-global(\u02c6\u03c1, \u02c6\u03c9)\n\nsubject to \u02c6\u03c9k > 0, \u02c6\u03c1k \u2208 (0, 1) for k = 1, 2 . . . K.\n\n\u02c6\u03c1,\u02c6\u03c9\n\nDetails are in the supplement. The update to q(u) requires expectations under q(\u03c0), and vice versa,\nso it can be useful to iteratively optimize q(\u03c0) and q(u) several times given \ufb01xed local statistics.\n\nTo handle large datasets, we can adapt these updates to perform stochastic variational inference\n(SVI) [19]. Stochastic algorithms perform local updates on random subsets of sequences (batches),\nand then perturb global parameters by following a noisy estimate of the natural gradient, which\nhas a simple closed form. SVI has previously been applied to non-sticky HDP-HMMs with point-\nestimated \u03b2 [7], and can be easily adapted to our more principled objective. One drawback of SVI\nis the requirement of a learning rate schedule, which must typically be tuned to each dataset.\n\n3.4 Memoized Variational Inference\n\nWe now outline a memoized algorithm [13] for our sticky HDP-HMM variational objective. Before\nexecution, each sequence is randomly assigned to one of B batches. The algorithm repeatedly visits\nbatches one at a time in random order; we call each full pass through the complete set of B batches a\nlap. At each visit to batch b, we perform a local step for all sequences n in batch b and then a global\nstep. With B = 1 batches, memoized inference reduces to the standard full-dataset algorithm, while\nwith larger B we have more affordable local steps and faster overall convergence. With just one lap,\nmemoized inference is equivalent to the synchronous version of streaming variational inference,\npresented in Alg. 3 of Broderick et al. [20]. We focus on regimes where dozens of laps are feasible,\nwhich we demonstrate dramatically improves performance.\n\nM ,PB\n\nAffordable, but exact, batch optimization of L is possible by exploiting the additivity of statistics\nM , S. For each statistic we track a batch-speci\ufb01c quantity M b, and a whole-dataset summary\nb=1 M b. After a local step at batch b yields \u02c6sb, \u02c6rb, we update M b(\u02c6sb) and Sb(\u02c6rb), increment\neach whole-dataset statistic by adding the new batch summary and subtracting the summary stored\nin memory from the previous visit, and store (or memoize) the new statistics for future iterations.\nThis update cycle makes M and S consistent with the most recent assignments for all sequences.\nMemoization does require O(BK 2) more storage than SVI. However, this cost does not scale with\nthe number of sequences N or length T . Sparsity in transition counts M may make storage cheaper.\n\nAt any point during memoized execution, we can evaluate L exactly for all data seen thus far. This\nis possible because nearly all terms in Eq. (6) are functions of only global parameters \u02c6\u03c1, \u02c6\u03c9, \u02c6\u03b8, \u02c6\u03c4\nand suf\ufb01cient statistics M, S. The one exception that requires local values \u02c6s, \u02c6r is the entropy term\nLentropy. To compute it, we track a (K + 1) \u00d7 K matrix H b at each batch b:\n\n,\n\n(12)\n\n\u2113=1 Hk\u2113.\n\nwhere the sums aggregate sequences n that belong to batch b. Each entry of H b is non-negative, and\n\nH b\n\n0\u2113 = \u2212Pn \u02c6rn1\u2113 log \u02c6rn1\u2113, H b\ngiven the whole-dataset entropy matrix H =PB\n\nt=1 \u02c6sntk\u2113 log \u02c6sntk\u2113\n\u02c6rntk\n\nk\u2113 = \u2212PnPTn\u22121\nb=1 H b, we have Lentropy =PK\n\nk=0PK\n\n4 State Space Adaptation via Birth, Merge, and Delete Proposals\n\nReliable nonparametric inference algorithms must quickly identify and create missing states. Split-\nmerge samplers for HDP topic models [10, 11] are limited because proposals can only split an\nexisting state into two new states, require expensive traversal of all data points to evaluate an ac-\nceptance ratio, and often have low acceptance rates [12]. Some variational methods for HDP topic\nmodels also dynamically create new topics [16, 21], but do not guarantee improvement of the global\nobjective and can be unstable. We instead interleave stochastic birth proposals with delete and merge\nproposals, and use memoization to ef\ufb01ciently verify proposals via the exact full-dataset objective.\n\nBirth proposals. Birth moves can create many new states at once while maintaining the monotonic\nincrease of the whole-dataset objective, L. Each proposal happens within the local step by trying\nto improve q(zn) for a single sequence n. Given current assignments \u02c6sn, \u02c6rn with truncation K, the\nmove proposes new assignments \u02c6s\u2032\nn that include the K existing states and some new states with\nindex k > K. If L improves under the proposal, we accept and use the expanded set of states for\nall remaining updates in the current lap. To compute L, we require candidate global parameters\n\u02c6\u03c1\u2032, \u02c6\u03c9\u2032, \u02c6\u03b8\u2032, \u02c6\u03c4 \u2032. These are found via a global step from candidate summaries M \u2032, S\u2032, which combine\n\nn, \u02c6r\u2032\n\n5\n\n\f30\n\n15\n\n0\n\n\u221215\n\n\u221230\n\nK\n \ns\nc\np\no\n\ni\n\nt\n \n\nm\nu\nn\n\n50\n\n25\n\n*8\n\n\u221230 \u221215 0\n\n15 30\n\n10\n\n100 1000\n1\nnum pass thru data\n\ni\n\n.\nt\ns\nd\n \ng\nn\nm\nm\na\nH\n\ni\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n10\n\n100 1000\n1\nnum pass thru data\n\nsampler\nstoch\nmemo\ndelete,merge\nbirth,delete,merge\nNon-stick, kappa=0\nSticky, kappa=50\n\nstoch: K=47 after 2000 laps in 359 min.\n\nsampler: K=10 after 2000 laps in 74 min.\n\ndelete,merge: K=8 after 100 laps in 5 min.\n\n0\n\n200\n\n400\n\n600\n\n800\n\n0\n\n200\n\n400\n\n600\n\n800\n\n0\n\n200\n\n400\n\n600\n\n800\n\nFigure 3: Toy data experiments (Sec. 5). Top left: Data sequences contain 2D points from 8 well-separated\nGaussians with sticky transitions. Top center: Trace plots from initialization with 50 redundant states. Our\nstate-adaptation algorithms (red/purple) reach ideal K = 8 states and zero Hamming distance regardless of\nwhether a sticky (solid) or non-sticky (dashed) model is used. Competitors converge slower, especially in the\nnon-sticky case because non-adaptive methods are more sensitive to hyperparameters. Bottom: Segmentations\nof 4 sequences by SVI, the Gibbs sampler, and our method under the non-sticky model (\u03ba = 0). Top half shows\ntrue state assignments; bottom shows aligned estimated states. Competitors are polluted by extra states (black).\n\nthe new batch statistics M \u2032\nfor states k > K. See the supplement for details on handling multiple sequences within a batch.\n\nb and memoized statistics of other batches M \u2032\n\n\\b expanded by zeros\n\n\\b, S\u2032\n\nb, S\u2032\n\nThe proposal for expanding \u02c6s\u2032, \u02c6r\u2032 with new states can \ufb02exibly take any form, from very na\u00a8\u0131ve to very\ndata-driven. For data with \u201csticky\u201d state persistence, we recommend randomly choosing one interval\n[t, t + \u03b4] of the current sequence to reassign when creating \u02c6s\u2032, \u02c6r\u2032, leaving other timesteps \ufb01xed. We\nsplit this interval into two contiguous blocks (one may be empty), each completely assigned to a\nnew state. In the supplement, we detail a linear-time search that \ufb01nds the cut point that maximizes\nthe objective Ldata. Other proposals such as sub-cluster splits [11] could be easily incorporated in\nour variational algorithm, but we \ufb01nd this simple interval-based proposal to be fast and effective.\n\nMerge proposals. Merge proposals try to \ufb01nd a less redundant but equally expressive model. Each\nproposal takes a pair of existing states i < j and constructs a candidate model where data from state\nj is reassigned to state i. Conceptually this reassignment gives a new value \u02c6s\u2032, but instead statistics\nM \u2032, S\u2032 can be directly computed and used in a global update for candidate parameters \u02c6\u03c4 \u2032, \u02c6\u03c1\u2032, \u02c6\u03b8\u2032.\n\ni = Si + Sj, M \u2032\nS\u2032\n\n:i = M:i + M:j, M \u2032\n\ni: = Mi: + Mj:, M \u2032\n\nii = Mii + Mjj + Mji + Mij.\n\n:i and row H \u2032\n\nWhile most terms in L are linear functions of our cached suf\ufb01cient statistics, the entropy Lentropy\nis not. Thus for each candidate merge pair (i, j), we use O(K) storage and computation to track\ni: of the corresponding merged entropy matrix H \u2032. Because all terms in the\ncolumn H \u2032\nH \u2032 matrix of Eq. (12) are non-negative, we can lower-bound Lentropy by summing a subset of H \u2032. As\ndetailed in the supplement, this allows us to rigorously bound the objective L\u2032 for accepting multiple\nmerges of distinct state pairs. Because many entries of H \u2032 are near-zero, this bound is very tight,\nand in practice enables us to scalably merge many redundant state pairs in each lap through the data.\n\nTo identify candidate merge pairs i, j, we examine all pairs of states and keep those that satisfy\ndata+L\u2032\nL\u2032\nhdp-local+L\u2032\nhdp-global > Ldata+Lhdp-local+Lhdp-global. Because entropy must decrease after any\nmerge (L\u2032\nentropy < Lentropy), this test is guaranteed to \ufb01nd all possibly useful merges. It is much more\nef\ufb01cient than the heuristic correlation score used in prior work on HDP topic models [14].\n\nDeletes. Our proposal to delete a rarely-used state j begins by dropping row j and column j from\nM to create M \u2032, and dropping Sj from S to create S\u2032. Using a target dataset of sequences with\nt=1 \u02c6rntj > 0.01}, we run global and local parameter\nupdates to reassign observations from former state j in a data-driven way. Rather than verifying on\nonly the target dataset as in [14], we accept or reject the delete proposal via the whole-dataset bound\nL. To control computation, we only propose deleting states used in 10 or fewer sequences.\n\nnon-trivial mass on state j, x\u2032 = {xn : PTn\n\n6\n\n\fstoch\n\nmemo\n\nbirth,del,merge\n\nK=50\n\nK=100\n\n)\n0\n0\n1\nx\n(\n \ne\nv\ni\nt\nc\ne\nb\no\n\nj\n\n-3.4\n\n-3.5\n\n-3.6\n\n1\n\n10\n\n100\n\n100\n\n75\n\n50\n\n25\n\nK\n \ns\ne\nt\na\nt\ns\n \nm\nu\nn\n\n0\n\n)\nc\ne\ns\n(\n \ne\nm\n\ni\nt\n\n1200\n\n1000\n\n800\n\n600\n\n400\n\n200\n\n0\n\np\nu\nd\ne\ne\np\ns\n\n64x\n32x\n16x\n8x\n4x\n2x\n1x\n\n1 2 4 8 16 32 64\n\n1 2 4 8 16 32 64\n\nnum pass thru data\n\nnum parallel workers\nFigure 4: Segmentation of human epigenome: 15 million observations across 173 sequences (Sec. 5). Left:\nAdaptive runs started at 1 state grow to 70 states within one lap and reach better L scores than 100-state non-\nadaptive methods. Each run takes several days. Right: Wallclock times and speedup factors for a parallelized\nlocal step on 1/3 of this dataset. 64 workers complete a local step with K = 50 states in under one minute.\n\nnum parallel workers\n\n0.1\n\n10 100\nnum pass thru data\n\n1\n\n5 Experiments\n\nWe compare our proposed birth-merge-delete memoized algorithm to memoized with delete and\nmerge moves only, and without any moves. We further run a blocked Gibbs sampler [6] that was\npreviously shown to mix faster than slice samplers [22], and our own implementation of SVI for\nobjective L. These baselines maintain a \ufb01xed number of states K, though some states may have\nusage fall to zero. We start all \ufb01xed-K methods (including the sampler) from matched initializations.\nSee the supplement for futher discussion and all details needed to reproduce these experiments.\n\nIn Fig. 3, we study 32 toy data sequences generated from 8 Gaussian states with sticky\nToy data.\ntransitions [8]. From an abundant initialization with 50 states, the sampler and non-adaptive varia-\ntional methods require hundreds of laps to remove redundant states, especially under a non-sticky\nmodel (\u03ba = 0). In contrast, our adaptive methods reach the ideal of zero Hamming distance within\na few dozen laps regardless of stickiness, suggesting less sensitivity to hyperparameters.\n\nSpeaker diarization. We study 21 unrelated audio recordings of meetings with an unknown num-\nber of speakers from the NIST 2007 speaker diarization challenge [23]. The sticky HDP-HMM\npreviously achieved state-of-the-art diarization performance [6] using a sampler that required hours\nof computation. We ran methods from 10 matched initializations with 25 states and \u03ba = 100, com-\nputing Hamming distance on non-speech segments as in the standard DER metric. Fig. 5 shows that\nwithin minutes, our algorithms consistently \ufb01nd segmentations better aligned to true speaker labels.\n\nLabelled N = 6 motion capture. Fox et al. [12] introduced a 6 sequence dataset with labels for\n12 exercise types, illustrated in Fig. 1. Each sequence has 12 joint angles (wrist, knee, etc.) captured\nat 0.1 second intervals. Fig. 6 shows that non-adaptive methods struggle even when initialized\nabundantly with 30 (dashed lines) or 60 (solid) states, while our adaptive methods reach better\nvalues of the objective L and cleaner many-to-one alignment to true exercises.\n\nLarge N = 124 motion capture. Next, we apply scalable methods to the 124 sequence dataset of\n[12]. We lack ground truth here, but Fig. 7 shows deletes and merges making consistent reductions\nfrom abundant initializations and births growing from K = 1. Fig. 7 also shows estimated segmen-\ntations for 10 representative sequences, along with skeleton illustrations for the 10 most-used states\nin this subset. These segmentations align well with held-out text descriptions.\n\nChromatin segmentation. Finally, we study segmenting the human genome by the appearance\npatterns of regulatory proteins [24]. We observe 41 binary signals from [3] at 200bp intervals\nthroughout a white blood cell line (CD4T). Each binary value indicates the presence or absence\nof an acetylation or methylation that controls gene expression. We divide the whole epigenome into\n173 sequences (one per batch) with total size T = 15.4 million. Fig. 4 shows our method can grow\nfrom 1 state to 70 states and compete favorably with non-adaptive competitors. We also demonstrate\nthat our parallelized local step leads to big 25x speedups in processing such large datasets.\n\n6 Conclusion\n\nOur new variational algorithms adapt HMM state spaces to \ufb01nd clean segmentations driven by\nBayesian model selection. Relative to prior work [14], our contributions include a new bound for the\nsticky HDP-HMM, births with guaranteed improvement, local step parallelization, and better merge\nselection rules. Our multiprocessing-based Python code is targeted at genome-scale applications.\nAcknowledgments This research supported in part by NSF CAREER Award No. IIS-1349774. M. Hughes\nsupported in part by an NSF Graduate Research Fellowship under Grant No. DGE0228243.\n\n7\n\n\fi\n\ng\nn\nm\nm\na\nH\n \nr\ne\np\nm\na\ns\n\nl\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.0\n\n0.0 0.1 0.2 0.3 0.4 0.5 0.6\ndelete-merge Hamming\n\nsampler\n\nmemo\n\ndelete,merge\n\nbirth,delete,merge\n\nj\n\ne\nv\ni\nt\nc\ne\nb\no\n \nn\na\nr\nt\n\ni\n\nMeeting 11 (best)\n\nMeeting 16 (avg.)\n\nMeeting 21 (worst)\n\n\u22122.50\n\n\u22122.55\n\n\u22122.60\n\n\u22122.65\n\n\u22122.70\n\n\u22122.75\n\n\u22122.80\n\n1\n\n10\n\n100 1000\nelapsed time (sec)\n\nj\n\ne\nv\ni\nt\nc\ne\nb\no\n \nn\na\nr\nt\n\ni\n\n\u22122.55\n\n\u22122.60\n\n\u22122.65\n\n\u22122.70\n\n1\n\n10\n\n100 1000\nelapsed time (sec)\n\nj\n\ne\nv\ni\nt\nc\ne\nb\no\n \nn\na\nr\nt\n\ni\n\n\u22122.40\n\n\u22122.45\n\n\u22122.50\n\n\u22122.55\n\n1\n\n10\n\n100 1000\nelapsed time (sec)\n\n.\nt\ns\ni\nd\n \ng\nn\nm\nm\na\nH\n\ni\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n.\nt\ns\ni\nd\n \ng\nn\nm\nm\na\nH\n\ni\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n.\nt\ns\ni\nd\n \ng\nn\nm\nm\na\nH\n\ni\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n1\n\n10\n\n100 1000\nelapsed time (sec)\n\n100 1000\nelapsed time (sec)\nFigure 5: Method comparison on speaker diarization from common K = 25 initializations (Sec. 5). Left: Scat-\nterplot of \ufb01nal Hamming distance for our adaptive method and the sampler. Across 21 meetings (each with 10\ninitializations shown as individual dots) our method \ufb01nds segmentations closer to ground truth. Right: Traces\nof objective L and Hamming distance for meetings representative of good, average, and poor performance.\n\n100 1000\nelapsed time (sec)\n\n10\n\n10\n\n1\n\n1\n\nj\n\ne\nv\ni\nt\nc\ne\nb\no\n \nn\na\nr\nt\n\ni\n\n\u22122.2\n\n\u22122.4\n\n\u22122.6\n\n\u22122.8\n\n1\n\n10\n\n100 1000\nnum pass thru data\n\n60\n\n40\n\n20\n\nK\n \ns\ne\nt\na\nt\ns\n \nm\nu\nn\n\n0\n\n1\n\n10\n\n100 1000\nnum pass thru data\n\n.\nt\ns\ni\nd\n \ng\nn\nm\nm\na\nH\n\ni\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\nstoch\nsampler\nmemo\ndelete,merge\nbirth,delete,merge\n\n1\n\n10\n\n100 1000\nnum pass thru data\n\nbirth: Hdist=0.34 K=28 @ 100 laps\n\ndel/merge: Hdist=0.30 K=13 @ 100 laps\n\nsampler: Hdist=0.49 K=29 @ 1000 laps\n\n0\n\n50\n\n100 150 200 250 300 350 400\n\n0\n\n50\n\n100 150 200 250 300 350 400\n\n0\n\n50\n\n100 150 200 250 300 350 400\n\nFigure 6: Comparison on 6 motion capture streams (Sec. 5). Top: Our adaptive methods reach better L values\nand lower distance from true exercise labels. Bottom: Segmentations from the best runs of birth/merge/delete\n(left), only deletes and merges from 30 initial states (middle), and the sampler (right). Each sequence shows\ntrue labels (top half) and estimates (bottom half) colored by the true state with highest overlap (many-to-one).\n\nj\n\ne\nv\ni\nt\nc\ne\nb\no\n \nn\na\nr\nt\n\ni\n\n\u22122.4\n\n\u22122.5\n\n\u22122.6\n\n200\n\n150\n\n100\n\n50\n\nK\n \ns\ne\nt\na\nt\ns\n \nm\nu\nn\n\n1\n\n10\n\n100 1000\nnum pass thru data\n\n0\n\n1\n\n10 100 1000\nnum pass thru data\n\n1-1: playground jump \n\n1-2: playground climb \n\n1-3: playground climb \n\n2-7: swordplay \n\n5-3: dance \n\n5-4: dance \n\n5-5: dance \n\n6-3: basketball dribble \n\n6-4: basketball dribble \n\n6-5: basketball dribble \n\n!\n\n\"!\n\n#!\n\n$!\n\n%!\n\n&!!\n\nWalk\n\nClimb\n\nSword\n\nArms\n\nSwing\n\nDribble\n\nJump\n\nBalance\n\nBallet Leap\n\nBallet Pose\n\nFigure 7: Study of 124 motion capture sequences (Sec. 5). Top Left: Objective L and state count K as more data\nis seen. Solid lines have 200 initial states; dashed 100. Top Right: Final segmentation of 10 select sequences\nby our method, with id numbers and descriptions from mocap.cs.cmu.edu. The 10 most used states are\nshown in color, the rest with gray. Bottom: Time-lapse skeletons assigned to each highlighted state.\n\n8\n\n\fReferences\n\n[1] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc.\n\nof the IEEE, 77(2):257\u2013286, 1989.\n\n[2] Zoubin Ghahramani. An introduction to hidden Markov models and Bayesian networks. International\n\nJournal of Pattern Recognition and Machine Intelligence, 15(01):9\u201342, 2001.\n\n[3] Jason Ernst and Manolis Kellis. Discovery and characterization of chromatin states for systematic anno-\n\ntation of the human genome. Nature Biotechnology, 28(8):817\u2013825, 2010.\n\n[4] Matthew J. Beal, Zoubin Ghahramani, and Carl E. Rasmussen. The in\ufb01nite hidden Markov model. In\n\nNeural Information Processing Systems, 2001.\n\n[5] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581, 2006.\n\n[6] Emily B. Fox, Erik B. Sudderth, Michael I. Jordan, and Alan S. Willsky. A sticky HDP-HMM with\n\napplication to speaker diarization. Annals of Applied Statistics, 5(2A):1020\u20131056, 2011.\n\n[7] Matthew J. Johnson and Alan S. Willsky. Stochastic variational inference for Bayesian time series models.\n\nIn International Conference on Machine Learning, 2014.\n\n[8] Nicholas Foti, Jason Xu, Dillon Laird, and Emily Fox. Stochastic variational inference for hidden Markov\n\nmodels. In Neural Information Processing Systems, 2014.\n\n[9] Andreas Stolcke and Stephen Omohundro. Hidden Markov model induction by Bayesian model merging.\n\nIn Neural Information Processing Systems, 1993.\n\n[10] Chong Wang and David M Blei. A split-merge MCMC algorithm for the hierarchical Dirichlet process.\n\narXiv preprint arXiv:1201.1657, 2012.\n\n[11] Jason Chang and John W Fisher III. Parallel sampling of HDPs using sub-cluster splits.\n\nIn Neural\n\nInformation Processing Systems, 2014.\n\n[12] Emily B. Fox, Michael C. Hughes, Erik B. Sudderth, and Michael I. Jordan. Joint modeling of multiple\ntime series via the beta process with application to motion capture segmentation. Annals of Applied\nStatistics, 8(3):1281\u20131313, 2014.\n\n[13] Michael C. Hughes and Erik B. Sudderth. Memoized online variational inference for Dirichlet process\n\nmixture models. In Neural Information Processing Systems, 2013.\n\n[14] Michael C. Hughes, Dae Il Kim, and Erik B. Sudderth. Reliable and scalable variational inference for the\n\nhierarchical Dirichlet process. In Arti\ufb01cial Intelligence and Statistics, 2015.\n\n[15] Yee Whye Teh, Kenichi Kurihara, and Max Welling. Collapsed variational inference for HDP. In Neural\n\nInformation Processing Systems, 2008.\n\n[16] Michael Bryant and Erik B. Sudderth. Truly nonparametric online variational inference for hierarchical\n\nDirichlet processes. In Neural Information Processing Systems, 2012.\n\n[17] Percy Liang, Slav Petrov, Michael I Jordan, and Dan Klein. The in\ufb01nite PCFG using hierarchical Dirichlet\n\nprocesses. In Empirical Methods in Natural Language Processing, 2007.\n\n[18] Matthew James Beal. Variational algorithms for approximate Bayesian inference. PhD thesis, University\n\nof London, 2003.\n\n[19] Matt Hoffman, David Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of\n\nMachine Learning Research, 14(1), 2013.\n\n[20] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. Streaming\n\nvariational Bayes. In Neural Information Processing Systems, 2013.\n\n[21] Chong Wang and David Blei. Truncation-free online variational inference for Bayesian nonparametric\n\nmodels. In Neural Information Processing Systems, 2012.\n\n[22] J. Van Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani. Beam sampling for the in\ufb01nite hidden Markov\n\nmodel. In International Conference on Machine Learning, 2008.\n\n[23] NIST. Rich transcriptions database. http://www.nist.gov/speech/tests/rt/, 2007.\n\n[24] Michael M. Hoffman, Orion J. Buske, Jie Wang, Zhiping Weng, Jeff A. Bilmes, and William S. Noble.\nUnsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature\nmethods, 9(5):473\u2013476, 2012.\n\n9\n\n\f", "award": [], "sourceid": 741, "authors": [{"given_name": "Michael", "family_name": "Hughes", "institution": "Brown University"}, {"given_name": "William", "family_name": "Stephenson", "institution": "Brown University"}, {"given_name": "Erik", "family_name": "Sudderth", "institution": "Brown University"}]}