{"title": "On Mixtures of Markov Chains", "book": "Advances in Neural Information Processing Systems", "page_first": 3441, "page_last": 3449, "abstract": "We study the problem of reconstructing a mixture of Markov chains from the trajectories generated by random walks through the state space. Under mild non-degeneracy conditions, we show that we can uniquely reconstruct the underlying chains by only considering trajectories of length three, which represent triples of states. Our algorithm is spectral in nature, and is easy to implement.", "full_text": "On Mixtures of Markov Chains\n\nRishi Gupta\u2217\nStanford University\nStanford, CA 94305\n\nrishig@cs.stanford.edu\n\nRavi Kumar\n\nGoogle Research\n\nMountain View, CA 94043\n\nravi.k53@gmail.com\n\nSergei Vassilvitskii\nGoogle Research\n\nNew York, NY 10011\nsergeiv@google.com\n\nAbstract\n\nWe study the problem of reconstructing a mixture of Markov chains from the\ntrajectories generated by random walks through the state space. Under mild non-\ndegeneracy conditions, we show that we can uniquely reconstruct the underlying\nchains by only considering trajectories of length three, which represent triples of\nstates. Our algorithm is spectral in nature, and is easy to implement.\n\n1\n\nIntroduction\n\nMarkov chains are a simple and incredibly rich tool for modeling, and act as a backbone in numerous\napplications\u2014from Pagerank for web search to language modeling for machine translation. While\nthe true nature of the underlying behavior is rarely Markovian [6], it is nevertheless often a good\nmathematical assumption.\nIn this paper, we consider the case where we are given observations from a mixture of L Markov\nchains, each on the same n states, with n \u2265 2L. Each observation is a series of states, and is\ngenerated as follows: a Markov chain and starting state are selected from a distribution S, and then\nthe selected Markov chain is followed for some number of steps. The goal is to recover S and the\ntransition matrices of the L Markov chains from the observations.\nWhen all of the observations follow from a single Markov chain (namely, when L = 1), recovering\nthe mixture parameters is easy. A simple calculation shows that the empirical starting distribution\nand the empirical transition probabilities form the maximum likelihood Markov chain. So we are\nlargely interested in the case when L > 1.\nAs a motivating example, consider the usage of a standard maps app on a phone. There are a number\nof different reasons one might use the app: to search for a nearby business, to get directions from\none point to another, or just to orient oneself. However, the users of the app never specify an explicit\nintent, rather they swipe, type, zoom, etc., until they are satis\ufb01ed. Each one of the latent intents can be\nmodeled by a Markov chain on a small state space of actions. If the assignment of each session to an\nintent were explicit, recovering these Markov chains would simply reduce to several instances of the\nL = 1 case. Here we are interested in the unsupervised setting of \ufb01nding the underlying chains when\nthis assignment is unknown. This allows for a better understanding of usage patterns. For example:\n\u2022 Common uses for the app that the designers had not expected, or had not expected to be\ncommon. For instance, maybe a good fraction of users (or user sessions) simply use the app\nto check the traf\ufb01c.\n\u2022 Whether different types of users use the app differently. For instance, experienced users\nmight use the app differently than \ufb01rst time users, either due to having different goals, or\ndue to accomplishing the same tasks more ef\ufb01ciently.\n\u2022 Undiscoverable \ufb02ows, with users ignoring a simple, but hidden menu setting, and instead\n\nusing a convoluted path to accomplish the same goal.\n\n\u2217Part of this work was done while the author was visiting Google Research.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThe question of untangling mixture models has received a lot of attention in a variety of different\nsituations, particularly in the case of learning mixtures of Gaussians, see for example the seminal\nwork of [8], as well as later work by [5, 11, 15] and the references therein. This is, to the best of our\nknowledge, the \ufb01rst work that looks at unraveling mixtures of Markov chains.\nThere are two immediate approaches to solving this problem. The \ufb01rst is to use the Expectation-\nMaximization (EM) algorithm [9]. The EM algorithm starts by guessing an initial set of parameters\nfor the mixture, and then performs local improvements that increase the likelihood of the proposed\nsolution. The EM algorithm is a useful benchmark and will converge to some local optimum, but it\nmay be slow to get there [12], and has are no guarantees on the quality of the \ufb01nal solution.\nThe second approach is to model the problem as a Hidden Markov Model (HMM), and employ the\nmachinery for learning HMMs, particularly the recent tensor decomposition methods [2, 3, 10]. As\nin our case, this machinery relies on having more observed states than hidden states. Unfortunately,\ndirectly modeling a Markov chain mixture as an HMM (or as a mixture of HMMs, as in [13]) requires\nnL hidden states for n observed states. Given that, one could try adapting the tensor decomposition\narguments from [3] to our problem, which is done in Section 4.3 of [14]. However, as the authors note,\nthis requires accurate estimates for the distribution of trajectories (or trails) of length \ufb01ve, whereas our\nresults only require estimates for the distribution of trails of length three. This is a large difference in\nthe amount of data one might need to collect, as one would expect to need \u0398(nt) samples to estimate\nthe distribution of trails of length t.\nAn entirely different approach is to assume a Dirichlet prior on the mixture, and model the problem\nas learning a mixture of Dirichlet distributions [14]. Besides requiring the Dirichlet prior, this\nmethod also requires very long trails. Finally, we would like to note a connection to the generic\nidenti\ufb01ability results for HMMs and various mixture models in [1]. Their results are existential rather\nthan algorithmic, but dimension three also plays a central role.\nOur contributions. We propose and study the problem of reconstructing a mixture of Markov chains\nfrom a set of observations, or trajectories. Let a t-trail be a trajectory of length t: a starting state\nchosen according to S along with t \u2212 1 steps along the appropriate Markov chain.\n(i) We identify a weak non-degeneracy condition on mixtures of Markov chains and show that under\nthat non-degeneracy condition, 3-trails are suf\ufb01cient for recovering the underlying mixture parameters.\nWe prove that for random instances, the non-degeneracy condition holds with probability 1.\n(ii) Under the non-degeneracy condition, we give an ef\ufb01cient algorithm for uniquely recovering the\nmixture parameters given the exact distribution of 3-trails.\n(iii) We show that our algorithm outperforms the most natural EM algorithm for the problem in some\nregimes, despite EM being orders of magnitude slower.\nOrganization. In Section 2 we present the necessary background material that will be used in the\nrest of the paper. In Section 3 we state and motivate the non-degeneracy condition that is suf\ufb01cient\nfor unique reconstruction. Using this assumption, in Section 4 we present our four-step algorithm\nfor reconstruction. In Section 5 we present our experimental results on synthetic and real data. In\nSection 6 we show that random instances are non-degenerate with probability 1.\n\n2 Preliminaries\nLet [n] = {1, . . . , n} be a state space. We consider Markov chains de\ufb01ned on [n]. For a Markov\nchain given by its n \u00d7 n transition matrix M, let M (i, j) denote the probability of moving from state\nj M (i, j) = 1. (In general\n\ni to state j. By de\ufb01nition, M is a stochastic matrix, M (i, j) \u2265 0 and(cid:80)\n\nwe use A(i, j) to denote the (i, j)th entry of a matrix A.)\nFor a matrix A, let A denote its transpose. Every n \u00d7 n matrix A of rank r admits a singular value\ndecomposition (SVD) of the form A = U \u03a3V where U and V are n \u00d7 r orthogonal matrices and \u03a3\nis an r \u00d7 r diagonal matrix with non-negative entries. For an L \u00d7 n matrix B of full rank, its right\npseudoinverse B\u22121 is an n \u00d7 L matrix of full rank such that BB\u22121 = I; it is a standard fact that\npseudoinverses exist and can be computed ef\ufb01ciently when n \u2265 L.\nWe now formally de\ufb01ne a mixture of Markov chains (M,S). Let L \u2265 1 be an integer. Let\nM = {M 1, . . . , M L} be L transition matrices, all de\ufb01ned on [n]. Let S = {s1, . . . , sL} be a\n\n2\n\n\f(cid:96),i s(cid:96)\n\ncorresponding set of positive n-dimensional vectors of starting probabilities such that(cid:80)\n\ni = 1.\nGiven M and S, a t-trail is generated as follows: \ufb01rst pick the chain (cid:96) and the starting state i with\nprobability s(cid:96)\ni, and then perform a random walk according to the transition matrix M (cid:96), starting from\ni, for t \u2212 1 steps.\nThroughout, we use i, j, k to denote states in [n] and (cid:96) to denote a particular chain. Let 1n be a\ncolumn vector of n 1\u2019s.\nDe\ufb01nition 1 (Reconstructing a Mixture of Markov Chains). Given a (large enough) set of trails\ngenerated by a mixture of Markov chains and an L > 1, \ufb01nd the parameters M and S of the mixture.\nNote that the number of parameters is O(n2 \u00b7 L). In this paper, we focus on a seemingly restricted\nversion of the reconstruction problem, where all of the given trails are of length three, i.e., every trail\nis of the form i \u2192 j \u2192 k for some three states i, j, k \u2208 [n]. Surprisingly, we show that 3-trails are\nsuf\ufb01cient for perfect reconstruction.\nBy the de\ufb01nition of mixtures, the probability of generating a given 3-trail i \u2192 j \u2192 k is\n\n(cid:88)\n\ni \u00b7 M (cid:96)(i, j) \u00b7 M (cid:96)(j, k),\ns(cid:96)\n\n(cid:96)\n\n(1)\nwhich captures the stochastic process of choosing a particular chain (cid:96) using S and taking two steps in\nM (cid:96). Since we only observe the trails, the choice of the chain (cid:96) in the above process is latent. For\neach j \u2208 [n], let Oj be an n \u00d7 n matrix such that Oj(i, k) equals the value in (1). It is easy to see\nthat using O((n3 log n)/\u00012) sample trails, every entry in Oj for every j is approximated to within\nan additive \u00b1\u0001. For the rest of the paper, we assume we know each Oj(i, k) exactly, rather than an\napproximation of it from samples.\nWe now give a simple decomposition of Oj in terms of the transition matrices in M and the starting\nprobabilities in S. Let Pj be the L \u00d7 n matrix whose ((cid:96), i)th entry denotes the probability of\ni \u00b7 M (cid:96)(i, j). In a\nusing chain (cid:96), starting in state i, and transitioning to state j, i.e., Pj((cid:96), i) = s(cid:96)\nsimilar manner, let Qj be the L \u00d7 n matrix whose ((cid:96), k)th entry denotes the probability of starting\nj \u00b7 M (cid:96)(j, k). Finally, let\nin state j, and transitioning to state k under chain (cid:96), i.e., Qj((cid:96), k) = s(cid:96)\nSj = diag(s1\n\nj ) be the L \u00d7 L diagonal matrix of starting probabilities in state j. Then,\n\nj , . . . , sL\n\nOj = Pj \u00b7 S\u22121\nj\nThis decomposition will form the key to our analysis.\n\n\u00b7 Qj.\n\n(2)\n\nwith s(cid:96) + s(cid:96)(cid:48)\n\n3 Conditions for unique reconstruction\nBefore we delve into the details of the algorithm, we \ufb01rst identify a condition on the mixture (M,S)\nsuch that there is a unique solution to the reconstruction problem when we consider trails of length\nthree. (To appreciate such a need, consider a mixture where two of the matrices M (cid:96) and M (cid:96)(cid:48)\nin\nM are identical. Then for a \ufb01xed vector v, any s(cid:96) and s(cid:96)(cid:48)\n= v will give the same\nobservations, regardless of the length of the trails.) To motivate the condition we require, consider\nagain the sets of L \u00d7 n matrices P = {P1, . . . , Pn} and Q = {Q1, . . . , Qn} as de\ufb01ned in (2).\nTogether these matrices capture the n2L \u2212 1 parameters of the problem, namely, n \u2212 1 for each of\nthe n rows of each of the L transition matrices M (cid:96), and nL \u2212 1 parameters de\ufb01ning S. However,\ntogether P and Q have 2n2L entries, implying algebraic dependencies between them.\nDe\ufb01nition 2 (Shuf\ufb02e pairs). Two ordered sets X = {X1, . . . , Xn} and Y = {Y1, . . . , Yn} of L \u00d7 n\nmatrices are shuf\ufb02e pairs if the jth column of Xi is identical to the ith column of Yj for all i, j \u2208 [n].\nNote that P and Q are shuf\ufb02e pairs. We state an equivalent way of specifying this de\ufb01nition. Consider\na 2nL\u00d7 n2 matrix A(P,Q) that consists of a top and a bottom half. The top half is an nL\u00d7 n2 block\ndiagonal matrix with Pi as the ith block. The bottom half is a concatenation of n different nL \u00d7 n\nblock diagonal matrices; the ith block of the jth matrix is the jth column of \u2212Qi. A representation\nof A is given in Figure 1. As intuition, note that in each column, the two blocks of L entries are the\nsame up to negation. Let F be the L \u00d7 2nL matrix consisting of 2n L \u00d7 L identity matrices in a row.\nIt is straightforward to see that P and Q are shuf\ufb02e pairs if and only if F \u00b7 A(P,Q) = 0.\nLet the co-kernel of a matrix X be the vector space comprising the vectors v for which vX = 0. We\nhave the following de\ufb01nition.\n\n3\n\n\fFigure 1: A(P,Q) for L = 2, n = 4. When P and Q are shuf\ufb02e pairs, each column has two copies\nof the same L-dimensional vector (up to negation). M is well-distributed if there are no non-trivial\nvectors v for which v \u00b7 A(P,Q) = 0.\n\nDe\ufb01nition 3 (Well-distributed). The set of matrices M is well-distributed if the co-kernel of A(P,Q)\nhas rank L.\nEquivalently, M is well-distributed if the co-kernel of A(P,Q) is spanned by the rows of F . Section 4\nshows how to uniquely recover a mixture from the 3-trail probabilities Oj when M is well-distributed\nand S has only non-zero entries. Section 6 shows that nearly all M are well-distributed, or more\nformally, that the set of non well-distributed M has (Lebesgue) measure 0.\n\n4 Reconstruction algorithm\n\nWe present an algorithm to recover a mixture from its induced distribution on 3-trails. We assume for\nthe rest of the section that M is well-distributed (see De\ufb01nition 3) and S has only non-zero entries,\nwhich also means Pj, Qj, and Oj have rank L for each j.\nAt a high level, the algorithm begins by performing an SVD of each Oj, thus recovering both Pj and\nQj, as in (2), up to unknown rotation and scaling. The key to undoing the rotation will be the fact\nthat the sets of matrices P and Q are shuf\ufb02e pairs, and hence have algebraic dependencies.\nMore speci\ufb01cally, our algorithm consists of four high-level steps. We \ufb01rst list the steps and provide\nan informal overview; later we will describe each step in full detail.\n(i) Matrix decomposition: Using SVD, we compute a decomposition Oj = Uj\u03a3jVj and let P (cid:48)\nand Q(cid:48)\nL \u00d7 L matrices Yj and Zj so that Pj = YjP (cid:48)\n(ii) Co-kernel: Let P(cid:48) = {P (cid:48)\n1, . . . , Q(cid:48)\n1, . . . , P (cid:48)\nmatrix A(P(cid:48),Q(cid:48)) as de\ufb01ned in Section 3, to obtain matrices Y (cid:48)\nsingle matrix R for which Yj = RY (cid:48)\n(iii) Diagonalization: Let R(cid:48) be the matrix of eigenvectors of (Z(cid:48)\n1 )\u22121(Z(cid:48)\nis a permutation matrix \u03a0 and a diagonal matrix D such that R = D\u03a0R(cid:48).\n(iv) Two-trail matching: Given Oj it is easy to compute the probability distribution of the mixture\nover 2-trails. We use these to solve for D, and using D, compute R, Yj, Pj, and Sj for each j.\n\nj = Uj\nj = \u03a3jVj. These are the initial guesses at (Pj, Qj). We prove in Lemma 4 that there exist\n\nn}. We compute the co-kernel of\nj and Z(cid:48)\nj. We prove that there is a\n\nj and Qj = ZjQ(cid:48)\n\nj for each j \u2208 [n].\n\nn}, and Q(cid:48) = {Q(cid:48)\nj and Zj = RZ(cid:48)\n\nj for all j.\n\n1Y (cid:48)\n\n2Y (cid:48)\n\n2 ). We prove that there\n\n4.1 Matrix decomposition\nj are L \u00d7 n matrices of full rank. The following lemma states that\nFrom the de\ufb01nition, both P (cid:48)\nthe SVD of the product of two matrices A and B returns the original matrices up to a change of basis.\n\nj and Q(cid:48)\n\n4\n\n\fLemma 4. Let A, B, C, D be L \u00d7 n matrices of full rank, such that AB = CD. Then there is an\nL \u00d7 L matrix X of full rank such that C = X\u22121A and D = XB.\n\nProof. Note that A = ABB\u22121 = CDB\u22121 = CW for W = DB\u22121. Since A has full rank, W must\nas well. We then get CD = AB = CW B, and since C has full column rank, D = W B. Setting\nX = W completes the proof.\n\nj Qj) and Oj = P (cid:48)\n\nj, Lemma 4 implies that there exists an L \u00d7 L matrix Xj of\nSince Oj = Pj(S\u22121\nfull rank such that Pj = X\u22121\n, and let Zj = SjXj. Note that\nboth Yj and Zj have full rank, for each j. Once we have Yj and Zj, we can easily compute both Pj\nand Sj, so we have reduced our problem to \ufb01nding Yj and Zj.\n\nj and Qj = SjXjQ(cid:48)\n\nj. Let Yj = X\u22121\n\nj P (cid:48)\n\njQ(cid:48)\n\nj\n\nj)j\u2208[n], (ZjQ(cid:48)\n\n4.2 Co-kernel\nSince (P,Q) is a shuf\ufb02e pair, ((YjP (cid:48)\nj)j\u2208[n]) is also a shuf\ufb02e pair. We can write the\nlatter fact as B(Y, Z)A(P (cid:48), Q(cid:48)) = 0, where B(Y, Z) is the L \u00d7 2nL matrix comprising 2n matrices\nconcatenated together; \ufb01rst Yj for each j, and then Zj for each j. We know A(P (cid:48), Q(cid:48)) from the\nmatrix decomposition step, and we are trying to \ufb01nd B(Y, Z). By well-distributedness, the co-kernel\nof A(P, Q) has rank L. Let D be the 2nL \u00d7 2nL block diagonal matrix with the diagonal entries\nn ). Then A(P (cid:48), Q(cid:48)) = D A(P, Q). Since D has full rank,\n(Y \u22121\nthe co-kernel of A(P (cid:48), Q(cid:48)) has rank L as well.\nWe compute an arbitrary basis of the co-kernel of A(P (cid:48), Q(cid:48))),2 and write it as an L \u00d7 2nL matrix\nas an initial guess B(Y (cid:48), Z(cid:48)) for B(Y, Z). Since B(Y, Z) lies in the co-kernel of A(P (cid:48), Q(cid:48)), and has\nexactly L rows, there exists an L \u00d7 L matrix R such that B(Y, Z) = R B(Y (cid:48), Z(cid:48)), or equivalently,\nsuch that Yj = RY (cid:48)\nj and Zj = RZ(cid:48)\nj for every j. Since Yj and Zj have full rank, so does R. Now our\nproblem is reduced to computing R.\n\n2 , . . . , Z\u22121\n\nn , Z\u22121\n\n, . . . , Y \u22121\n\n1 , Z\u22121\n\n, Y \u22121\n\n2\n\n1\n\njY (cid:48)\n\n4.3 Diagonalization\nRecall from the matrix decomposition step that there exist matrices Xj such that Yj = X\u22121\nand\nZj = SjXj. Hence Z(cid:48)\nj = (R\u22121Zj)(Yj R\u22121) = R\u22121SjR\u22121. It seems dif\ufb01cult to compute R\ndirectly from equations of the form R\u22121SjR\u22121, but we can multiply any two of them together to get,\ne.g., (Z(cid:48)\n1 )\u22121(Z(cid:48)\n1Y (cid:48)\nSince S\u22121\n1 S2 is a diagonal matrix, we can diagonalize RS\u22121\n1 S2R\u22121 as a step towards computing\nR. Let R(cid:48) be the matrix of eigenvectors of RS\u22121\n1 S2R\u22121. Now, R is determined up to a scaling and\nordering of the eigenvectors. In other words, there is a permutation matrix \u03a0 and diagonal matrix D\nsuch that R = D\u03a0R(cid:48).\n\n2 ) = RS\u22121\n\n1 S2R\u22121.\n\n2Y (cid:48)\n\nj\n\nj Qj1n = Pj1L for each j, since each row of S\u22121\n\n4.4 Two-trail matching\nFirst, Oj1n = PjS\u22121\nj Qj is simply the set of\ntransition probabilities out of a particular Markov chain and state. Another way to see it is that both\nOj1n and Pj1L are vectors whose ith coordinate is the probability of the trail i \u2192 j.\nFrom the \ufb01rst three steps of the algorithm, we also have Pj = YjP (cid:48)\n1)\u22121, where the inverse is a pseudoinverse.\nHence 1LD\u03a0 = 1LP1(R(cid:48)Y (cid:48)\nWe arbitrarily \ufb01x \u03a0, from which we can compute D, R, Yj, and \ufb01nally Pj for each j. From the\ndiagonalization step (Section 4.3), we can also compute Sj = R(Z(cid:48)\nNote that the algorithm implicitly includes a proof of uniqueness, up to a setting of \u03a0. Different\norderings of \u03a0 correspond to different orderings of M (cid:96) in M.\n\n1)\u22121 = O11n(R(cid:48)Y (cid:48)\n\nj = D\u03a0R(cid:48)Y (cid:48)\n\nj )R for each j.\n\nj = RY (cid:48)\n\nj P (cid:48)\nj.\n\nj P (cid:48)\n\n1 P (cid:48)\n\n1 P (cid:48)\n\njY (cid:48)\n\n2For instance, by taking the SVD of A(P (cid:48), Q(cid:48)), and looking at the singular vectors.\n\n5\n\n\f5 Experiments\n\nWe have presented an algorithm for reconstructing a mixture of Markov chains from the observations,\nassuming the observation matrices are known exactly.\nIn this section we demonstrate that the\nalgorithm is ef\ufb01cient, and performs well even when we use empirical observations. In addition, we\nalso compare its performance against the most natural EM algorithm for the reconstruction problem.\nSynthetic data. We begin by generating well distributed instances M and S. Let Dn be the\nuniform distribution over the n-dimensional unit simplex, namely, the uniform distribution over\nvectors in Rn whose coordinates are non-negative and sum to 1.\nFor a speci\ufb01c n and L, we generate an instance (M, S) as follows. For each state i and Markov chain\nM (cid:96), the set of transition probabilities leaving i is distributed as Dn. We draw each s(cid:96) from Dn as\nwell, and then divide by L, so that the sum over all s(cid:96)(i) is 1. In other words, each trail is equally\nlikely to come from any of the L Markov chains. This restriction has little effect on our algorithm,\nbut is needed to make EM tractable. For each instance, we generate T samples of 3-trails. The results\nthat we report are the medians of 100 different runs.\nMetric for synthetic data. Our goal is exact recovery of the underlying instance M. Given two\nn \u00d7 n matrices A and B, the error is the average total variation distance between the transition\ni,j |A(i, j) \u2212 B(i, j)|. Given a pair of instances M =\n{M 1, . . . , M L} and N = {N 1, . . . , N L} on the same state space [n], the recovery error is the\nminimum average error over all matchings of chains in N to M. Let \u03c3 be a permutation on [L], then:\n\nprobabilities: error(A, B) = 1/(2n) \u00b7(cid:80)\n\nrecovery error(M,N ) = min\n\n\u03c3\n\n1\nL\n\nerror(M (cid:96), N \u03c3((cid:96))).\n\n(cid:88)\n\n(cid:96)\n\nGiven all the pairwise errors error(M (cid:96), N p), this minimum can be computed in time O(L3) by the\nHungarian algorithm. Note that the recovery error ranges from 0 to 1.\n\nReal data. We use the last.fm 1K dataset3, which contains the list of songs listened by heavy users\nof Last.Fm. We use the top 25 artist genres4 as the states of the Markov chain. We consider the ten\nheaviest users in the data set, and for each user, consider the \ufb01rst 3001 state transitions that change\ntheir state. We break each sequence into 3000 3-trails. Each user naturally de\ufb01nes a Markov chain on\nthe genres, and the goal is to recover these individual chains from the observed mixture of 3-trails.\n\nMetric for real data. Given a 3-trail from one of the users, our goal is to predict which user the\n3-trail came from. Speci\ufb01cally, given a 3-trail t and a mixture of Markov chains (M,S), we assign t\nto the Markov chain most likely to have generated it. A recovered mixture (M,S) thereby partitions\nthe observed 3-trails into L groups. The prediction error is the minimum over all matchings between\ngroups and users of the fraction of trails that are matched to the wrong user. The prediction error\nranges from 0 to 1 \u2212 1/L.\n\nHandling approximations. Because the algorithm operates on real data, rather than perfect obser-\nvation matrices, we make two minor modi\ufb01cations to make it more robust. First, in the diagonalization\nstep (Section 4.3), we sum (Z(cid:48)\ni+1Yi+1)\u22121 over all i before diagonalizing to estimate R(cid:48),\ninstead of just using i = 1. Second, due to noise, the matrices M that we recover at the end need not\nbe stochastic. Following the work of [7] we normalize the values by \ufb01rst taking absolute values of all\nentries, and then normalizing so that each of the columns sums to 1.\n\niYi)\u22121(Z(cid:48)\n\nBaseline. We turn to EM as a practical baseline for this reconstruction problem. In our implementa-\ntion, we continue running EM until the log likelihood changes by less than 10\u22127 in each iteration;\nthis corresponds to roughly 200-1000 iterations. Although EM continues to improve its solution past\nthis point, even at the 10\u22127 cutoff, it is already 10-50x slower than the algorithm we propose.\n\n5.1 Recovery and prediction error\n\n3http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-1K.tar.gz\n4http://static.echonest.com/Lastfm-ArtistTags2007.tar.gz\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: (a) Performance of EM and our algorithm vs number of samples (b) Performance of EM\nand our algorithm vs L (synthetic data) (c) Performance of EM and our algorithm (real data)\n\nFor the synthetic data, we \ufb01x n = 6 and L = 3, and\nfor each of the 100 instances generate a progressively\nlarger set of samples. Recall that the number of unknown\nparameters grows as \u0398(n2L), so even this relatively sim-\nple setting corresponds to over 100 unknown parameters.\nFigure 3(a) shows the median recovery error of both ap-\nproaches. It is clear that the proposed method signi\ufb01cantly\noutperforms the EM approach, routinely achieving errors\nof 10-90% lower. Furthermore, while we did not make\nsigni\ufb01cant attempts to speed up EM, it is already over\n10x slower than our algorithm at n = 6 and L = 3, and\nbecomes even slower as n and L grow.\nIn Figure 3(b) we study the error as a function of L. Our\napproach is signi\ufb01cantly faster, and easily outperforms\nEM at 100 iterations. Running EM for 1000 iterations results with prediction error on par with our\nalgorithm, but takes orders of magnitude more time to complete.\nFor the real data, there are n = 25 states, and we tried L = 4, . . . , 10 for the number of users. We run\nEM for 500 iterations and show the results in Figure 3(c). While our algorithm slightly underperforms\nEM, it is signi\ufb01cantly faster in practice.\n\nFigure 2: Performance of the algorithm\nas a function of n and L for a \ufb01xed num-\nber of samples.\n\n5.2 Dependence on n and L\n\nTo investigate the dependence of our approach on the size of the input, namely n and L, we \ufb01x the\nnumber of samples to 108 but vary both the number of states from 6 to 30, as well as the number\nof chains from 3 to 9. Recall that the number of parameters grows as n2L, therefore, the largest\nexamples have almost 1000 parameters that we are trying to \ufb01t.\nWe plot the results in Figure 2. As expected, the error grows linearly with the number of chains. This\nis expected \u2014 since we are keeping the number of samples \ufb01xed, the relative error (from the true\nobservations) grows as well. It is therefore remarkable that the error grows only linearly with L.\nWe see more interesting behavior with respect to n. Recall that the proofs required n \u2265 2L.\nEmpirically we see that at n = 2L the approach is relatively brittle, and errors are relatively high.\nHowever, as n increases past that, we see the recovery error stabilizes. Explaining this behavior\nformally is an interesting open question.\n\n6 Analysis\nWe now show that nearly all M are well-distributed (see De\ufb01nition 3), or more formally, that the set\nof non well-distributed M has (Lebesgue) measure 0 for every L > 1 and n \u2265 2L.\nWe \ufb01rst introduce some notation. All arrays and indices are 1-indexed. In previous sections, we\nhave interpreted i, j, k, and (cid:96) as states or as indices of a mixture; in this section we drop these\ninterpretations and just use them as generic indices.\n\n7\n\n0.00.20.40.60.81.0Number of Samples1e90.000.010.020.030.040.050.060.07Median Recovery ErrorAlgEM45678910L0.000.050.100.150.200.25Median Recovery ErrorAlgEM 100EM 100045678910L0.00.20.40.60.81.0Median Prediction ErrorAlgEM 100051015202530n0.000.050.100.150.20Median Recovery Errorl=3l=5l=7l=9\fij to the (i, j)th entry of\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n(cid:26)ej\u2212i+1\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\ne1\ne2\neL e1\n...\ne3\ne2\ne1\ne2\neL e1\n...\ne2\n\neL\neL\u22121\n...\ne1\neL\neL\u22121\n...\ne1\nif i \u2264 L or j \u2264 L\nif i, j > L\n\n\u00b7\u00b7\u00b7\n\ne3\n\ne1\neL\n...\ne2\neL\neL\u22121\n...\ne1\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\ne2\ne1\n\n\u00b7\u00b7\u00b7\ne3\n\u00b7\u00b7\u00b7\ne1\neL \u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\ne2\n\n(cid:19)\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\n.\n\neL\neL\u22121\n...\ne1\neL\u22121\neL\u22122\n...\neL\n\nFor vectors v1, . . . , vn \u2208 RL, let v[n] denote (v1, . . . , vn), and let \u2217(v1, . . . , vn) denote the vi\u2019s\nconcatenated together to form a vector in RnL. Let vi[j] denote the jth coordinate of vector vi.\nWe \ufb01rst show that there exists at least one well-distributed P for each n and L.\nLemma 5 (Existence of a well-distributed P). For every n and L with n \u2265 2L, there exists a P for\nwhich the co-kernel of A(P,Q) has rank L.\n\nProof. It is suf\ufb01cient to show it for n = 2L, since for larger n we can pad with zeros. Also, recall\nthat F \u00b7 A(P,Q) = 0 for any P, where F is the L \u00d7 2nL matrix consisting of 2n identity matrices\nconcatenated together. So the co-kernel of any A(P,Q) has rank at least L, and we just need to show\nthat there exists a P where the co-kernel of A(P,Q) has rank at most L.\nNow, let e(cid:96) be the (cid:96)th basis vector in RL. Let P\u2217 = (P \u2217\nof P \u2217\n\nij denote the jth column\n\nn ), and let p\u2217\n\n1 , . . . , P \u2217\n\ni . We set p\u2217\n\n(cid:18) E E\n\nij =\n\nej\u2212i,\n\nE E(cid:48)\n\nij(cid:105) = (cid:104)bj, p\u2217\n\nij(cid:105) for each i and j.\n\n, where subscripts are taken mod L. Note that we can\n\nFormally, p\u2217\nsplit the above matrix into four L \u00d7 L blocks\nwhere E(cid:48) is a horizontal \u201crotation\u201d of E.\nNow, let a[n], b[n] be any vectors in RL such that v = \u2217(a1, . . . , an, b1, . . . , bn) \u2208 R2nL is in the\nco-kernel of A(P\u2217,Q\u2217). Recall this means v \u00b7 A(P\u2217,Q\u2217) = 0. Writing out the matrix A, it is not\ntoo hard to see that this holds if and only if (cid:104)ai, p\u2217\nij = e1. For each k \u2208 [L], we have ak[1] = bk[1] from the upper\nConsider the i and j where p\u2217\nleft quadrant, ak[1] = bL+k[1] from the upper right quadrant, aL+k[1] = bk[1] from the lower left\nquadrant, and aL+k[1] = bL+(k+1 (mod L))[1] from the lower right quadrant. It is easy to see that\nthese combine to imply that ai[1] = bj[1] for all i, j \u2208 [n].\nA similar argument for each l \u2208 [L] shows that ai[l] = bj[l] for all i, j and l. Equivalently, ai = bj\nfor each i and j, which means that v lives in a subspace of dimension L, as desired.\nWe now bootstrap from our one example to show that almost all P are well-distributed.\nTheorem 6 (Almost all P are well-distributed). The set of non-well-distributed P has Lebesgue\nmeasure 0 for every n and L with n \u2265 2L.\nlet h(P) =\nProof. Let A(cid:48)(P,Q) be all but\ndet|A(cid:48)(P,Q)A(cid:48)(P,Q)|. Note that h(P) is non-zero if and only if P is well-distributed. Let P\u2217 be\nthe P\u2217 from Lemma 5. Since A(cid:48)(P\u2217,Q\u2217) has full row rank, h(P\u2217) (cid:54)= 0. Since h is a polynomial\nfunction of the entries of P, and h is non-zero somewhere, h is non-zero almost everywhere [4].\n\nthe last L rows of A(P,Q).\n\nFor any P,\n\n7 Conclusions\n\nIn this paper we considered the problem of reconstructing Markov chain mixtures from given\nobservation trails. We showed that unique reconstruction is algorithmically possible under a mild\ntechnical condition on the \u201cwell-separatedness\u201d of the chains. While our condition is suf\ufb01cient, we\nconjecture it is also necessary; proving this is an interesting research direction. Extending our analysis\nto work for the noisy case is also a plausible research direction, though we believe the corresponding\nanalysis could be quite challenging.\n\n8\n\n\fReferences\n[1] E. S. Allman, C. Matias, and J. A. Rhodes. Identi\ufb01ability of parameters in latent structure\n\nmodels with many observed variables. The Annals of Statistics, pages 3099\u20133132, 2009.\n\n[2] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for\n\nlearning latent variable models. JMLR, 15(1):2773\u20132832, 2014.\n\n[3] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and\n\nhidden Markov models. In COLT, pages 33.1\u201333.34, 2012.\n\n[4] R. Caron and T. Traynor. The zero set of a polynomial. WSMR Report, pages 05\u201302, 2005.\n[5] K. Chaudhuri and S. Rao. Learning mixtures of product distributions using correlations and\n\nindependence. In COLT, pages 9\u201320, 2008.\n\n[6] F. Chierichetti, R. Kumar, P. Raghavan, and T. Sarlos. Are web users really Markovian? In\n\nWWW, pages 609\u2013618, 2012.\n\n[7] S. B. Cohen, K. Stratos, M. Collins, D. P. Foster, and L. Ungar. Experiments with spectral\n\nlearning of latent-variable PCFGs. In NAACL, pages 148\u2013157, 2013.\n\n[8] S. Dasgupta. Learning mixtures of Gaussians. In FOCS, pages 634\u2013644, 1999.\n[9] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1\u201338, 1977.\n\n[10] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models.\n\nJCSS, 78(5):1460\u20131480, 2012.\n\n[11] A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of Gaussians. In\n\nFOCS, pages 93\u2013102, 2010.\n\n[12] R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood, and the EM algorithm.\n\nSIAM Review, 26:195\u2013239, 1984.\n\n[13] C. Subakan, J. Traa, and P. Smaragdis. Spectral learning of mixture of hidden Markov models.\n\nIn NIPS, pages 2249\u20132257, 2014.\n\n[14] Y. C. S\u00fcbakan. Probabilistic time series classi\ufb01cation. Master\u2019s thesis, Bo\u02d8gazi\u00e7i University,\n\n2011.\n\n[15] S. Vempala and G. Wang. A spectral algorithm for learning mixture models. JCSS, 68(4):841\u2013\n\n860, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1717, "authors": [{"given_name": "Rishi", "family_name": "Gupta", "institution": "Stanford"}, {"given_name": "Ravi", "family_name": "Kumar", "institution": "Yahoo!"}, {"given_name": "Sergei", "family_name": "Vassilvitskii", "institution": "Google"}]}