{"title": "Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1295, "page_last": 1303, "abstract": "Applications of Bayesian nonparametric methods require learning and inference algorithms which efficiently explore models of unbounded complexity. We develop new Markov chain Monte Carlo methods for the beta process hidden Markov model (BP-HMM), enabling discovery of shared activity patterns in large video and motion capture databases. By introducing split-merge moves based on sequential allocation, we allow large global changes in the shared feature structure. We also develop data-driven reversible jump moves which more reliably discover rare or unique behaviors. Our proposals apply to any choice of conjugate likelihood for observed data, and we show success with multinomial, Gaussian, and autoregressive emission models. Together, these innovations allow tractable analysis of hundreds of time series, where previous inference required clever initialization and at least ten thousand burn-in iterations for just six sequences.", "full_text": "Effective Split-Merge Monte Carlo Methods for\n\nNonparametric Models of Sequential Data\n\nMichael C. Hughes1, Emily B. Fox2, and Erik B. Sudderth1\n\n1Department of Computer Science, Brown University, {mhughes,sudderth}@cs.brown.edu\n\n2Department of Statistics, University of Washington, ebfox@stat.washington.edu\n\nAbstract\n\nApplications of Bayesian nonparametric methods require learning and inference\nalgorithms which ef\ufb01ciently explore models of unbounded complexity. We de-\nvelop new Markov chain Monte Carlo methods for the beta process hidden\nMarkov model (BP-HMM), enabling discovery of shared activity patterns in large\nvideo and motion capture databases. By introducing split-merge moves based on\nsequential allocation, we allow large global changes in the shared feature struc-\nture. We also develop data-driven reversible jump moves which more reliably\ndiscover rare or unique behaviors. Our proposals apply to any choice of conjugate\nlikelihood for observed data, and we show success with multinomial, Gaussian,\nand autoregressive emission models. Together, these innovations allow tractable\nanalysis of hundreds of time series, where previous inference required clever ini-\ntialization and lengthy burn-in periods for just six sequences.\n\n1\n\nIntroduction\n\nBayesian nonparametric time series models, including various \u201cin\ufb01nite\u201d Markov switching pro-\ncesses [1, 2, 3], provide a promising modeling framework for complex sequential data. We focus\non the problem of discovering coherent, short-term activity patterns, or \u201cbehaviors\u201d, shared among\nrelated time series. For example, given collections of videos or human motion capture sequences,\nour goal is to (i) identify a concise global library of behaviors that explain the observed motions,\n(ii) annotate each sequence with the subset of behaviors used, and (iii) label each timestep with one\nactive behavior. The beta process hidden Markov model (BP-HMM) [4] offers a promising solution\nto such problems, allowing an unbounded set of relevant behaviors to be learned from data.\nLearning BP-HMMs from large datasets poses signi\ufb01cant computational challenges. Fox et al. [4]\nconsidered a dataset containing only six motion capture sequences and proposed a Markov chain\nMonte Carlo (MCMC) method that required careful initialization and tens of thousands of burn-in it-\nerations. Their sampler included innovations like block state sequence resampling [5] and marginal-\nization of some variables. However, like most MCMC samplers, their proposals only modi\ufb01ed\nsmall subsets of variables at each step. Additionally, the sampler relied on parameter proposals from\npriors, leading to low acceptance rates for high-dimensional data. Alternative single-site MCMC\nmoves, such as those based on slice sampling [6, 7], can exhibit similarly slow mixing. Our goal\nis to expose this pervasive issue with conventional MCMC, and develop new samplers that rapidly\nexplore the structural uncertainty inherent in Bayesian nonparametric models. While our focus is on\nthe BP-HMM, the technical innovations underlying our samplers are much more broadly applicable.\nWe make two complementary improvements to previous BP-HMM samplers [4]. First, we develop\nsplit-merge moves which change many variables simultaneously, allowing rapid improvements in\nthe discovered behavior library. Our approach builds on previous work on restricted Gibbs propos-\nals [8] and sequential allocation strategies [9], both of which were formulated for static Dirichlet\n\n1\n\n\fFigure 1: Left: The BP-HMM as a directed graphical model. Right: Illustration of our split and merge\nproposals, which modify both binary feature assignment matrices F (white indicates present feature) and state\nsequences z. We show F, z before (top) and after (bottom) feature km (yellow) is split into ka, kb (red,orange).\nAn item i with fikm = 1 can have either ka, kb, or both after the split, and its new zi sequence can use any\nfeatures available in fi. An item without km cannot possess ka, kb, and its state sequence zi does not change.\n\nprocess (DP) mixture models [10]. Second, we design data-driven [11] reversible jump moves [12]\nwhich ef\ufb01ciently discover behaviors unique to a single sequence. These data-driven proposals are\nespecially important for high-dimensional observation sequences. Both innovations apply to any\nlikelihood model with a conjugate prior; we show success with multinomial models of discrete\nvideo descriptors, and Gaussian autoregressive models of continuous motion capture trajectories.\nWe begin in Sec. 2 by reviewing the BP-HMM model. We describe previous BP-HMM samplers [4]\nin Sec. 3.1, and then develop our novel split-merge and data-driven proposals in Sec. 3.3-3.4. We\nevaluate our contributions in Sec. 4, showing improvements over prior work modeling human mo-\ntions captured both via a marker-based mocap system [4] and video cameras [13].\n\n2 Beta Process Hidden Markov Models\n\nLatent feature models intuitively capture the sparse sharing patterns occurring in collections of\nhuman action sequences. For example, one sequence may contain jogging and boxing, while another\nhas jogging and dancing. We assign the ith sequence or \u201citem\u201d with a sparse binary vector fi =\n[fi1, fi2, . . .] indicating the presence or absence of each feature in the unbounded global collection.\nGiven N items, corpus-wide assignments are denoted by a binary matrix F whose ith row is fi.1\nThe feature matrix F is generated by an underlying stochastic process, the beta process (BP) [14]:\n\n\u221e(cid:88)\n\nk=1\n\nbk\u03b4\u03b8k .\n\nB | B0, \u03b3, \u03b2 \u223c BP(\u03b2, \u03b3B0), B =\n\n(1)\nA realization B of the BP contains in\ufb01nitely many features k. For each feature, \u03b8k \u223c B0 marks its\ndata-generation parameters, while bk \u2208 (0, 1) denotes its inclusion probability. The binary feature\nvector for item i is determined by independent Bernoulli draws fik \u223c Ber(bk). Marginalizing over\nB, the number of active features in item i has distribution Poisson(\u03b3), determined by mass param-\neter \u03b3. The concentration parameter \u03b2 controls how often features are shared between items [15].\nTo apply feature models to time series data, Fox et al. [4] combine the BP with a hidden Markov\nmodel to form the BP-HMM, shown in Fig. 1. The binary vector fi determines a \ufb01nite set of states\navailable for item i. Each timestep t is assigned a single state zit = k from the set {k : fik = 1},\ndetermining which parameters \u03b8k generate data xit. Many different data-generation models are\npossible. As in [4], for motion capture data we use a \ufb01rst-order Gaussian autoregressive process\nwith parameters \u03b8k = (Ak, \u03a3k) drawn from a matrix-normal inverse-Wishart conjugate prior\nAk, \u03a3k | B0 \u223c MNW\u22121(\u03bd, S0, R0)\n\n(2)\nTo study video, we use a Dirichlet-multinomial model for quantized interest point descriptors [13]\n(3)\n\nxit | zit = k, xit\u22121 \u223c N (Akxit\u22121, \u03a3k)\n\nxit | zit = k \u223c Multinomial(\u03b8k)\n\n\u03b8k | B0 \u223c Dirichlet(\u03bb0, \u03bb0, . . . \u03bb0)\n\n1Throughout this paper, for variables wijk we use w to denote the vector or collection of wijk\u2019s over the\n\nentire set of subscripts, and wi for the collection over only the omitted subscripts j and k\n\n2\n\n time series i = 1, 2, \u2026 Nfeaturesk=1, 2, ...zi,tzi,2zi,1zi,Txi,txi,2xi,1xi,T......\ue0bdifi\ue0c6ibk\ue0bek\ue0b9\ue0b8\ue0c1\ue0b7\ue0c0z1z3z2z4z1z3z2z4f1f2f3f4f1f2f3f4splitmergekmkakb\fThe BP-HMM allows each item independent transition dynamics. The transition distribution \u03c0ij\nfrom each state j for the HMM of item i is built by drawing a set of transition weights \u03b7i, and then\nnormalizing these over the set of active features fi:\n\u03b7ijk \u223c Gamma(\u03b1 + \u03ba\u03b4jk, 1),\n\n\u03b7ijkfik(cid:80)\n\n\u03c0ijk =\n\n.\n\n(cid:96) fi(cid:96)\u03b7ij(cid:96)\n\n(4)\n\nHere, \u03b4jk = 1 if j = k, and 0 otherwise. This construction assigns positive transition mass \u03c0ijk only\nto features k active in fi. The sticky parameter \u03ba places extra expected mass on self-transitions [3],\nbiasing the model to learn state sequences z with temporally persistent states.\n\n3 MCMC Inference with Split-Merge Proposals\n\nWe \ufb01rst summarize the MCMC methods previously proposed for the BP-HMM [4]. We then present\nour novel contributions: split-merge moves and data-driven reversible jump proposals. Full algo-\nrithmic details for all samplers are found in the supplement, and our code is available online.\n\n3.1 Local Monte Carlo Proposals\n\nFox et al. [4]\u2019s sampler alternates updates to HMM parameters \u03b8 and \u03b7, discrete state sequences z,\nand feature assignments F. Fixing F de\ufb01nes a collection of \ufb01nite HMMs, so each zi can be block\nsampled by dynamic programming [5], and then \u03b8, \u03b7 drawn from conjugate posteriors.2 Sampling\neach item\u2019s features requires separate updates to features shared with some other time series and\nfeatures unique to item i. Both updates marginalize state sequences z and inclusion weights b.\nFor each shared feature k, Fox et al. propose \ufb02ipping fik to the complementary binary value and\naccept or reject according to the Metropolis-Hastings (MH) rule. This local move alters only one\nentry in F while holding all others \ufb01xed; the split-merge moves of Sec. 3.3 improve upon it.\nFor unique features, Fox et al. [4] de\ufb01ne a reversible pair of birth and death moves which add or\ndelete features to single sequences. While this approach elegantly avoids approximating the in\ufb01nite\nBP-HMM, their birth proposals use the (typically vague) prior to propose emission parameters \u03b8k\u2217\nfor new features k\u2217. We remedy this slow exploration with data-driven proposals in Sec. 3.4.\n\n3.2\n\nSplit-Merge Proposals for Dirichlet Process Models\n\nSplit-merge MCMC methods were \ufb01rst applied to nonparametric models by Jain and Neal [8] in\nwork focusing on DP mixture models with conjugate likelihoods. Conjugacy allows samplers to\noperate directly on discrete partitions of observations into clusters, marginalizing emission parame-\nters. Jain and Neal present valid proposals that reversibly split a single cluster km into two (ka, kb),\nor merge two clusters into one. Since merges are deterministic, the primary contribution of [8] is a\ngeneric technique \u2013 restricted Gibbs (RG) sampling \u2013 for proposing splits consistent with the data.\nTo construct an initial split of km, the RG sampler \ufb01rst assigns items in cluster km at random to either\nka or kb. Starting from this partition, the proposal is constructed by performing one-at-a-time Gibbs\nupdates, forgetting an item\u2019s current cluster and reassigning to either ka or kb conditioned on the\nremaining partitioned data. After several sweeps, these Gibbs updates encourage proposed clusters\nka and kb which agree with the data and thus are more likely to be accepted. For non-conjugate\nmodels, more complex RG proposals can be constructed which instantiate emission parameters [16].\nEven in small datasets, there can be signi\ufb01cant bene\ufb01ts from performing \ufb01ve or more sweeps for\neach RG proposal [8]. For large datasets, however, requiring many sweeps for a single proposal\nis computationally expensive. An alternative sequential allocation [9] method replaces the random\ninitialization of RG by using two randomly chosen items to \u201canchor\u201d the two new clusters ka, kb.\nRemaining items are then sequentially assigned to either ka or kb one-at-a-time, using RG moves\nconditioning only on previously assigned data. This creates a proposed partition in agreement with\nthe data after only one sampling sweep. Recent work has shown some success with sequentially-\nallocated split-merge moves for a hierarchical DP topic model [17].\n\n2Fox et al. [4] contains a small error in the resampling of \u03b7, as detailed and corrected in the supplement.\n\n3\n\n\fFor nonparametric models not based on the DP, split-merge moves are not well studied. Several\nauthors have considered RG split-merge proposals for beta process models [18, 19, 20]. However,\nthese papers lack technical details, and do not contain experiments showing improved mixing.\n\n3.3\n\nSplit-Merge Proposals for the BP-HMM\n\nWe now adapt RG and sequential allocation to de\ufb01ne BP-HMM split-merge moves. In the mixture\nmodels considered by prior work [8, 9], each data item i is associated with a single cluster ki,\nso selecting two anchors i, j also identi\ufb01es two cluster indices ki, kj. However, in feature-based\nmodels such as the BP-HMM, each data item i is associated with a collection of features indexed by\nfi. Therefore, our proposals require mechanisms for selecting anchors and for choosing candidate\nstates to split or merge from fi, fj. Additionally, our proposals must allow changes to state sequences\nz to re\ufb02ect changes in F. Our proposals thus jointly create a new con\ufb01guration (F\u2217, z\u2217), collapsing\naway HMM parameters \u03b8, \u03b7. Fig. 1 illustrates (F, z) before and after a split move.\n\nSelecting Anchors Following [9], we \ufb01rst randomly select distinct anchor data items i and j. The\n\ufb01xed choice of i, j de\ufb01nes a split-merge transition kernel satisfying detailed balance. Next, we select\nfrom each anchor one feature it possesses, denoted ki, kj. This choice determines the proposed\nmove: we merge ki, kj if they are distinct, and split ki = kj into two new features otherwise.\nSelecting ki, kj uniformly at random is problematic. First, in datasets with many features choosing\nki = kj is unlikely, making split moves rare. We need to bias the selection process to propose splits\nmore often. Second, in a well-trained model most feature pairs will not make a sensible merge.\nSelecting a pair that explains similar data is crucial for ef\ufb01ciency. We thus develop a proposal\ndistribution which \ufb01rst draws ki uniformly from fi, and then selects kj given \ufb01xed ki as follows:\nqk(ki, kj) = Unif(ki | fi)q(kj | ki, fj),\n(5)\nHere, xk is the data assigned to k, m(\u00b7) denotes the marginal likelihood of data collapsing away\ngives large mass (2/3) to a split move when possible, and also encourages choices ki (cid:54)= kj for a\nmerge that explain similar data via the marginal likelihood ratio. A large ratio means the model\nprefers to explain all data assigned to ki, kj together rather than separately, biasing selection to-\nwards promising merge candidates. We \ufb01nd higher acceptance rates for merges under this qk, which\njusti\ufb01es the small cost of computing m(\u00b7) from cached suf\ufb01cient statistics.\nOnce ki, kj are \ufb01xed, we construct the candidate state F\u2217, z\u2217. As shown in Fig. 1, we only alter\nf(cid:96), z(cid:96) for items (cid:96) which possess either ki or kj. We call this set of items the active set S.\n\nfjkj m(xki, xkj )/(cid:2)m(xki)m(xkj )(cid:3). This construction\n\nemission parameters \u03b8, and Cj = (cid:80)\n\n(cid:40)2Cjfjk\n\nq(kj = k | ki, fj) \u221d\n\nif k = ki\no.w.\n\nm(xki )m(xk)\n\nm(xki ,xk)\n\nkj(cid:54)=ki\n\nfjk\n\n(cid:96)kb\n\nika\n\n(cid:96)ka\n\nf\u2217\n\n(cid:96) given f\u2217\n\nSplit Our split proposal is de\ufb01ned in Alg. 1. Sweeping through a random permutation of items\n(cid:96) in the active set S, we draw each item\u2019s assignment to new features ka, kb and resample its state\nsequence. We sample [f\u2217\n] from its conditional posterior given previously visited items in\nS, requiring that (cid:96) must possess at least one of the new features. We then block sample its state\nsequence z\u2217\n(cid:96) . The dynamic programming recursions underlying these proposals use non-\nrandom auxiliary parameters: \u02c6\u03b7(cid:96) is set to its prior mean, and \u02c6\u03b8k to its posterior mean given the current\nz. For new states k\u2217 \u2208 {ka, kb}, we initialize \u02c6\u03b8k\u2217 from anchor sequences and then update to account\nfor new data assigned to k\u2217 at each item (cid:96). This enables better matching of proposed features to data\nstatistics. Finally, we sample f\u2217, z\u2217 for anchor items, enforcing f\u2217\n= 1 so the move\nremains reversible under a merge. This does not force z\u2217\n\ni to use ka nor z\u2217\n= 1 for (cid:96) \u2208 S, and 0\nMerge For a merge move, constructing F\u2217 is deterministic: we set f\u2217\n(cid:96) for items in S, using a block sampler as in Alg. 1. Again\notherwise. We thus need only to sample z\u2217\nthis requires auxiliary HMM parameters \u02c6\u03b8, \u02c6\u03b7, which we emphasize are deterministic tools enabling\ncollapsed proposals of discrete indicators F\u2217, z\u2217. We never sample \u03b8, \u03b7.\n\n= 1 and f\u2217\n\nj to use kb.\n\n(cid:96)km\n\njkb\n\nAccept-Reject After drawing a candidate value F\u2217, z\u2217, the \ufb01nal step is to compute a Metropolis-\nHastings acceptance probability min(\u03c1, 1). We give the ratio for a split move which creates features\n\n4\n\n\fAlg. 1 Propose split of feature km into ka, kb given F, z, x, anchor items i, j, set S={(cid:96):f(cid:96),km=1}\n1: fi,[ka,kb] \u2190 [1 0]\n2: fj,[ka,kb] \u2190 [0 1]\n3: \u02c6\u03b8 \u2190 E [\u03b8 | x, z, \u03bb]\n4: Sprev = {i, j}\n5: for non-anchor items (cid:96) in random permutation of active set S:\n\nuse anchor i to create ka\nuse anchor j to create kb\nset HMM params deterministically\ninitialize set of previously visited items\n\nzi,t:zi,t=km \u2190 ka\nzj,t:zj,t=km \u2190 kb\n\u02c6\u03b7(cid:96) \u2190 E [\u03b7(cid:96) | \u03b1, \u03ba], (cid:96) \u2208 S\n\ndraw f, z and update \u02c6\u03b8 for each item\ncondition on previously visited items\n\n\ufb01nish by sampling f, z for anchors\n\n\uf8f1\uf8f2\uf8f3[0 1]\n(cid:26)[1 0]\n\n[1 0] \u221d p(f(cid:96),[kakb] | fSprev,[kakb])p(x(cid:96) | f(cid:96), \u02c6\u03b8, \u02c6\u03b7(cid:96))\n[1 1]\n\nf(cid:96),[kakb] \u223c\nz(cid:96) \u223c p(z(cid:96) | x(cid:96), f(cid:96), \u02c6\u03b8, \u02c6\u03b7(cid:96))\nadd (cid:96) to Sprev\nfor k = ka, kb : \u02c6\u03b8k \u2190 E [\u03b8k | \u03bb,{xnt : znt = k, n \u2208 Sprev}]\n\n(cid:26)[0 1]\n\nfj,[kakb] \u223c\nzj \u223c p(zj | xj, fj, \u02c6\u03b8, \u02c6\u03b7j )\n\n[1 1]\n\n6:\n\n7:\n8:\n9:\n10: fi,[kakb] \u223c\n11: zi \u223c p(zi | xi, fi, \u02c6\u03b8, \u02c6\u03b7i )\n\n[1 1]\n\nka, kb from km below. The acceptance ratio for a merge move is the reciprocal of Eq. (6).\nqk(ka, kb | x, F\u2217, z\u2217, i, j)\nqk(km, km | x, F, z, i, j)\n\nqmerge(F, z | x, F\u2217, z\u2217, ka, kb)\nqsplit(F\u2217, z\u2217 | x, F, z, km)\n\np(x, z\u2217, F\u2217)\np(x, z, F)\n\n\u03c1split =\n\n(6)\n\nThe joint probability p(x, z, F) is only tractable with conjugate likelihoods. Proposals which instan-\ntiate emission parameters \u03b8, as in [16], would be required in the non-conjugate case.\n\n3.4 Data-Driven Reversible Jump Birth and Death Proposals\n\nEf\ufb01ciently adding or deleting unique features is crucial for good mixing. To accept the birth of new\nfeature k\u2217 = K + 1 for item i, this feature must explain some of the observed data xi at least as well\nas existing features 1, 2, . . . K. High-dimensional emission parameters \u03b8k\u2217 drawn from a vague prior\nare unlikely to match the data at hand. Instead, we suggest a data-driven proposal [11, 13] for \u03b8k\u2217.\nFirst, select at random a subwindow W of the current sequence i. Next, use data in this subwindow\nxW = {xit : t \u2208 W} to create a proposal distribution: q\u03b8(\u03b8) = 1\n2 p(\u03b8 | xW ), which is\na mixture of \u03b8\u2019s prior and posterior given xW . This mixture strikes a balance between proposing\npromising new features (via the posterior) while also making death moves possible, since the diffuse\nprior will place some mass on the reverse birth move.\n\n2 p(\u03b8) + 1\n\nLet Ui denote the number of unique features in fi and \u03bd = \u03b3\na birth move to candidate state f\u2217\n\n\u03b2\n\ni , \u03b7\u2217\np(xi | f\u2217\ni , \u03b8\u2217)\np(xi | fi, \u03b7i, \u03b8)\n\nN\u22121+\u03b2 . The acceptance probability for\ni , \u03b8\u2217 is then min(\u03c1birth, 1), where\ni , \u03b7\u2217\np\u03b8(\u03b8\u2217\nk\u2217 )\nq\u03b8(\u03b8\u2217\nk\u2217 )\n\nPoi(Ui + 1 | \u03bd)\nPoi(Ui | \u03bd)\n\nqf (fi | f\u2217\ni )\ni | fi)\nqf (f\u2217\n\n(7)\n\n\u03c1birth =\n\nEq. (7) is similar to the ratio for the birth proposal from the prior, adding only one term to account\nfor the proposed \u03b8\u2217\nk\u2217. Note that each choice of W de\ufb01nes a valid pair of birth-death moves satisfying\ndetailed balance, so we need not account for this choice in the acceptance ratio [21].\n\n4 Experimental Results\n\nOur experiments compare competing inference algorithms for the BP-HMM; for comparisons to\nalternative models, see [4]. To evaluate how well our novel moves explore the posterior distribu-\ntion, we compare three possible methods for adding or removing features: split-merge moves (SM,\nSec. 3.3), data-driven moves (DD, Sec. 3.4), and reversible jump moves using the prior (Prior [4],\nSec. 3.1). All experiments interleave the chosen update with the standard MH updates to shared\nfeatures of F and Gibbs updates to HMM parameters \u03b8, \u03b7 described in Sec. 3.1.\nFor each comparison, we run multiple initializations and rank the chains from \u201cbest\u201d to \u201cworst\u201d\naccording to joint probability. Each chain is allowed the same amount of computer time.\n\n5\n\n\fGaussian\n\nAR\n\nMultinomial\n\nWorst SM\n\nWorst DD\n\nBest Prior\n\nFigure 2: Feature creation for synthetic data with Gaussian (left), AR (middle), or multinomial (right) likeli-\nhoods. Each run begins with one feature used by all items, and must add new features via split-merge proposals\n(SM), or reversible-jump moves using data-driven (DD) or prior (Prior) proposals. Top: Log joint probability\nversus computation time, for 10 random initializations of each sampling algorithm. Bottom: Emission pa-\nrameters associated with the last sample after one hour of computation time. Gaussian \u03b8 = (\u00b5, \u03a3) and AR\n\u03b8 = (A, \u03a3) shown as contour lines in \ufb01rst two dimensions, with location determined by \u00b5, A. Multinomial \u03b8\nshown as image where row k gives the emission distribution over vocabulary symbols for state k.\n\n4.1 Synthetic Data\n\nWe examined toy datasets generated by a known set of 8 features (behaviors) \u03b8true. To validate\nthat our contributions apply for many choices of likelihood, we create three datasets: multinomial\n\u201cbag-of-words\u201c emissions using 500 vocabulary words, 8\u2013dimensional Gaussian emissions, and a\n\ufb01rst-order autoregressive (AR) processes with 5 dimensions. Each dataset has N = 100 sequences.\nFirst, we study how well each method creates necessary features from scratch. We initialize the\nsampler with just one feature used by all items, and examine how many true states are recovered\nafter one hour of computation time across 10 runs. We show trace plots as well as illustrations of\nrecovered emission parameters \u03b8 in Fig. 2. All runs of both SM and DD moves \ufb01nd all true states\nwithin several minutes, while no Prior run recovers all true states, remaining stuck with merged\nversions of true features. DD moves add new features most rapidly due to low computational cost.\nWe next examine whether each inference method can remove unnecessary features. We consider a\ndifferent toy dataset of several hundred sequences and a redundant initialization in which 2 copies\nof each true state exist. Half of the sequences are initialized with f , z set to corresponding true\nvalues in copy 1, and the second half using copy 2. Using Gaussian and AR likelihoods, all SM runs\nmerge down to the 8 true states, at best within \ufb01ve minutes, but no DD or Prior run ever reaches\nthis optimal con\ufb01guration in the allotted hour. Merge moves enable critical global changes, while\nthe one-at-a-time updates of [4] (and our DD variant) must take long random walks to completely\ndelete a popular feature. Further details are provided in the supplementary material.\nThese results demonstrate the importance of DD birth and split moves for exploring new features,\nand merge moves for removing features via proposals of large assignment changes. As such, we\nconsider a sampler that interleaves SM and DD moves in our subsequent analyses.\n\n4.2 Motion Capture Data\n\nWe next apply our improved MCMC methods to motion capture (mocap) sequences from the CMU\ndatabase [22]. First, we consider the small dataset examined by [4]: 6 sequences of physical ex-\nercises with motion measurements at 12 joint angles, modeled with an AR(1) likelihood. Human\nannotation identi\ufb01es 12 actions, which we take as ground truth. Previous results [4] show that the\nBP-HMM outperforms competitors in segmenting these actions, although they report that some true\nactions like jogging are split across multiple recovered features (see their Fig. 5). We set likelihood\nhyperparameters similarly to [4], with further details provided in the supplement.\n\n6\n\n0500100015002000250030003500\u22124\u22122024x 104cpu time (sec)log joint prob. SMDDPrior0500100015002000250030003500\u22121.4\u22121.2\u22121\u22120.8x 106cpu time (sec)log joint prob. SMDDPrior0500100015002000250030003500\u22126.5\u22126\u22125.5\u22125\u22124.5x 105cpu time (sec)log joint prob. SMDDPrior\u2212101\u22121.5\u22121\u22120.500.511.5\u22121\u22120.500.51\u22120.200.21002003004005002468\u2212101\u22121.5\u22121\u22120.500.511.5\u22121\u22120.500.51\u22120.200.21002003004005002468\u2212101\u22121.5\u22121\u22120.500.511.5\u22121\u22120.500.51\u22120.200.2100200300400500246\fFigure 3: Analysis of 6 motion capture sequences previously considered by [4]. Left: Joint log probability\nand Hamming distance (from manually annotated behaviors) for 20 runs of each method over 10 hours. Right:\nExamples of arm circles and jogging from 3 sequences, along with estimated zi of last sample from the best\nrun of each method. SM+DD moves (top row started from one feature, middle row started with 5 unique\nstates per sequence) successfully explain each action with one primary state, while [4]\u2019s sampler (bottom row)\nstarted from 5 unique features remains stuck with multiple unique states for one true action.\n\nBallet\n\nWalk\n\nSquat\n\nSword\n\nLambada\n\nDribble Basketball\n\nBox\n\nFigure 4: Analysis of 124 mocap sequences, showing 10 of 33 recovered behaviors. Skeleton trajectories are\nbuilt from contiguous segments \u2265 1 sec long assigned to each behavior. Boxed groups contain segments from\ndistinct sequences assigned to the same state. Some states only appear in one sequence.\n\nClimb\n\nIndian Dance\n\nTai Chi\n\nIn Fig. 3, we compare a sampler which interleaves SM and DD moves with [4]\u2019s original method.\nWe run 20 chains of each method for ten hours from two initializations: unique5, which assigns\n5 unique features per sequence (as done in [4]), and one, using a single feature across all items.\nIn both log probability and Hamming distance, SM+DD methods are noticeably superior. Most\ninterestingly, SM+DD starting from a parsimonious one feature achieves best performance overall,\nshowing that clever initialization is not necessary with our algorithm. The best run of SM+DD from\none achieves Hamming distance of 0.22, compared to 0.3 reported in [4]. No Prior proposal run\nfrom one created any additional states, indicating the importance of using our improved methods\nof feature exploration even in moderate dimensions.\nOur SM moves are critical for effectively creating and deleting features. Example segmentations of\narm-circles and jogging actions in Fig. 3 show that SM+DD consistently use one dominant behavior\nacross all segments where the action appears. In contrast, the Prior remains stuck with some unique\nbehaviors used in different sequences, yielding lower probability and larger Hamming distance.\nNext, we study a larger dataset of 124 sequences, all \u201cPhysical Activities & Sports\u201d examples from\nthe CMU mocap dataset. Analyzing a dataset of this size is computationally infeasible using the\nmethods of [4]. Initializing from unique5 would create over 500 features, requiring a prohibitively\nlong sampling run to merge related behaviors. When initialized from one, the Prior sampler creates\nno additional features. In contrast, starting from one, our SM+DD moves rapidly identify a diverse\nset of 33 behaviors. A set of 10 behaviors representative of this dataset are shown in Fig. 4. Our\nimproved algorithm robustly explores the posterior space, enabling this large-scale analysis.\n\n4.3 CMU Kitchen: Activity Discovery from Video\n\nFinally, we apply our new inference methods to discover common motion patterns from videos of\nrecipe preparation in the CMU Kitchen dataset [23]. Each video is several minutes long and depicts\na single actor in the same kitchen preparing either a pizza, a sandwich, a salad, brownies, or eggs.\nOur previous work [13] showed promising results in activity discovery with the BP-HMM using a\n\n7\n\n00.511.522.533.5x 104\u22126\u22125.8\u22125.6\u22125.4x 104cpu time (sec)joint log prob. SM+DD from oneSM+DD from unique5Prior from unique5\u221215\u221210\u221250510246510152025xzy\u221215\u221210\u2212505102468510152025xzy\u221210\u22125051015\u22122024510152025xzy\u22126\u22124\u2212202\u221220246510152025xzy\u22124\u22122024\u22124\u2212202451015202530xzy\u22126\u22124\u2212202\u22124\u2212202451015202530xzy00.511.522.533.5x 1040.20.30.40.50.60.7cpu time (sec)Hamming dist. SM+DD from oneSM+DD from unique5Prior from unique555555555555555555555520020521021522022523023524055655555555555555510511011512012513013514055555555515555551851901952002052102152222222222222222222224247525762677277822222222222222222222221651701751801852222222222222222111111451501551601655555555555555555555552002052102152202252302352405565555555555555551051101151201251301351405555555555555555185190195200205210215222222222222222222222424752576267727782222222222222222222222165170175180185222222222222222221111145150155160165555555555555555555555200205210215220225230235240202020202020202020202055555551051101151201251301351405555555528285555528185190195200205210215141414141414141422222222222224247525762677277822121212121212121212121212121212121212121211651701751801852522262525252525222252525111111145150155160165\fFigure 5: Activity discovery with 126 Kitchen videos, showing locations of select behaviors over time. Each\nrow summarizes zi for a single video, labeled at left by recipe type (label not provided to the BP-HMM). We\nshow only behaviors assigned to at least two timesteps in a local window.\n\nsmall collection of 30 videos from this collection. We compare our new SM moves on this small\ndataset, and then study a larger set of 126 Kitchen videos using our improved sampler.\nUsing only the 30 video subset, Fig. 6 compares the combined SM+DD sampler with just DD or\nPrior moves, using \ufb01xed hyperparameter settings as in [13] and starting with just one feature. DD\nproposals offer signi\ufb01cant gains over the prior, and further interleaving DD and SM moves achieves\nthe best overall con\ufb01guration, showing the bene\ufb01ts of proposing global changes to F, z.\nFinally, we run the SM+DD sampler on 126 Kitchen se-\nquences, choosing the best of 4 chains after several days\nof computer time (trace plots show convergence in half this\ntime). Fig. 5 maps behavior assignments over time across all\n\ufb01ve recipes, using the last MCMC sample. Several intuitive\nbehavior sharing patterns exist: chopping happens with car-\nrots (salad) and pepperoni (pizza), while stirring occurs when\npreparing brownies and eggs. Non-uniform behavior usage\npatterns within a category are due to differences in available\ncooking equipment across videos. Please see the supplement\nfor more experimental details and results.\n\nFigure 6: Joint log probability versus\ncomputation time for various samplers\non the CMU Kitchen data [23] previ-\nously studied by [13].\n\n5 Discussion\n\nWe have developed ef\ufb01cient MCMC inference methods for the BP-HMM. Our proposals do not re-\nquire careful initialization or parameter tuning, and enable exploration of large datasets intractable\nunder previous methods. Our approach makes ef\ufb01cient use of data and applies to any choice of con-\njugate emission model. We expect the guiding principles of data-driven and sequentially-allocated\nproposals to apply to many other models, enabling new applications of nonparametric analysis.\n\nAcknowledgments\n\nM. Hughes was supported in part by an NSF Graduate Research Fellowship under Grant No.\nDGE0228243. E. Fox was funded in part by AFOSR Grant FA9550-12-1-0453.\n\n8\n\nBrownieFraction of Time Elapsed0.250.50.751PizzaSandwichSaladEggs 65 OthersLight SwitchOpen FridgeStir Bowl 1Stir Bowl 2Pour BowlGrate CheeseSlice/ChopFlip Omelette012345678x 104\u22127.2\u22127.15\u22127.1\u22127.05x 106cpu time (sec)log joint prob. SM+DDDDPrior\fReferences\n[1] M. Beal, Z. Ghahramani, and C. Rasmussen. The in\ufb01nite hidden Markov model. In NIPS, 2002.\n[2] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581, 2006.\n\n[3] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky. A sticky HDP-HMM with application to\n\nspeaker diarization. Annals of Applied Statistics, 5(2A):1020\u20131056, 2011.\n\n[4] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky. Sharing features among dynamical systems\n\nwith beta processes. In NIPS, 2010.\n\n[5] S. L. Scott. Bayesian methods for hidden Markov models: Recursive computing in the 21st century.\n\nJASA, 97(457):337\u2013351, 2002.\n\n[6] J. Van Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani. Beam sampling for the in\ufb01nite hidden Markov\n\nmodel. In ICML, 2008.\n\n[7] C. Yau, O. Papaspiliopoulos, G. O. Roberts, and C. Holmes. Bayesian non-parametric hidden Markov\n\nmodels with applications in genomics. JRSS B, 73(1):37\u201357, 2011.\n\n[8] S. Jain and R.M. Neal. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process\n\nmixture model. Journal of Computational and Graphical Statistics, 13(1):158\u2013182, 2004.\n\n[9] D. B. Dahl. Sequentially-allocated merge-split sampler for conjugate and nonconjugate Dirichlet process\n\nmixture models. Submitted to Journal of Computational and Graphical Statistics, 2005.\n\n[10] M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures.\n\n90(430):577\u2013588, 1995.\n\nJASA,\n\n[11] Z. Tu and S. C. Zhu. Image segmentation by data-driven Markov chain Monte Carlo. PAMI, 24(5):657\u2013\n\n673, 2002.\n\n[12] P.J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.\n\nBiometrika, 82(4):711\u2013732, 1995.\n\n[13] M. C. Hughes and E. B. Sudderth. Nonparametric discovery of activity patterns from video collections.\n\nIn CVPR Workshop on Perceptual Organization in Computer Vision, 2012.\n\n[14] R. Thibaux and M. I. Jordan. Hierarchical beta processes and the Indian buffet process. In AISTATS,\n\n2007.\n\n[15] T. L. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite latent feature models and the Indian buffet process. In NIPS,\n\n2007.\n\n[16] S. Jain and R.M. Neal. Splitting and merging components of a nonconjugate Dirichlet process mixture\n\nmodel (with invited discussion). Bayesian Analysis, 2(3):445\u2013500, 2007.\n\n[17] C. Wang and D. Blei. A split-merge MCMC algorithm for the hierarchical Dirichlet process.\n\narXiv:1201.1657v1 [stat.ML], 2012.\n\n[18] E. Meeds, R. Neal, Z. Ghahramani, and S. Roweis. Modeling dyadic data with binary latent factors. In\n\nNIPS, 2008.\n\n[19] K. Miller, T. Grif\ufb01ths, and M. Jordan. Nonparametric latent feature models for link prediction. In NIPS,\n\n2009.\n\n[20] M. M\u00f8rup, M. N. Schmidt, and L. K. Hansen.\n\nIn\ufb01nite multiple membership relational modeling for\ncomplex networks. In IEEE International Workshop on Machine Learning for Signal Processing, 2011.\n[21] L. Tierney. Markov chains for exploring posterior distributions (with discussion). Annals of Statistics,\n\n22:1701\u20131762, 1994.\n\n[22] Carnegie Mellon University. Graphics lab motion capture database. http://mocap.cs.cmu.edu/.\n[23] F. De la Torre et al. Guide to the Carnegie Mellon University Multimodal Activity (CMU-MMAC)\n\ndatabase. Technical Report CMU-RI-TR-08-22, Robotics Institute, Carnegie Mellon University, 2009.\n\n9\n\n\f", "award": [], "sourceid": 636, "authors": [{"given_name": "Michael", "family_name": "Hughes", "institution": ""}, {"given_name": "Emily", "family_name": "Fox", "institution": ""}, {"given_name": "Erik", "family_name": "Sudderth", "institution": ""}]}