{"title": "Learning Overcomplete HMMs", "book": "Advances in Neural Information Processing Systems", "page_first": 940, "page_last": 949, "abstract": "We study the basic problem of learning overcomplete HMMs---those that have many hidden states but a small output alphabet. Despite having significant practical importance, such HMMs are poorly understood with no known positive or negative results for efficient learning. In this paper, we present several new results---both positive and negative---which help define the boundaries between the tractable-learning setting and the intractable setting. We show positive results for a large subclass of HMMs whose transition matrices are sparse, well-conditioned and have small probability mass on short cycles. We also show that learning is impossible given only a polynomial number of samples for HMMs with a small output alphabet and whose transition matrices are random regular graphs with large degree. We also discuss these results in the context of learning HMMs which can capture long-term dependencies.", "full_text": "Learning Overcomplete HMMs\n\nVatsal Sharan\n\nStanford University\n\nvsharan@stanford.edu\n\nPercy Liang\n\nStanford University\n\npliang@cs.stanford.edu\n\nSham Kakade\n\nUniversity of Washington\n\nsham@cs.washington.edu\n\nGregory Valiant\nStanford University\n\nvaliant@stanford.edu\n\nAbstract\n\nWe study the problem of learning overcomplete HMMs\u2014those that have many\nhidden states but a small output alphabet. Despite having signi\ufb01cant practical\nimportance, such HMMs are poorly understood with no known positive or negative\nresults for ef\ufb01cient learning. In this paper, we present several new results\u2014both\npositive and negative\u2014which help de\ufb01ne the boundaries between the tractable and\nintractable settings. Speci\ufb01cally, we show positive results for a large subclass of\nHMMs whose transition matrices are sparse, well-conditioned, and have small\nprobability mass on short cycles. On the other hand, we show that learning is\nimpossible given only a polynomial number of samples for HMMs with a small\noutput alphabet and whose transition matrices are random regular graphs with large\ndegree. We also discuss these results in the context of learning HMMs which can\ncapture long-term dependencies.\n\nIntroduction\n\n1\nHidden Markov Models (HMMs) are commonly used for data with natural sequential structure (e.g.,\nspeech, language, video). This paper focuses on overcomplete HMMs, where the number of output\nsymbols m is much smaller than the number of hidden states n. As an example, for an HMM that\noutputs natural language documents one character at a time, the number of characters m is quite\nsmall, but the number of hidden states n would need to be very large to encode the rich syntactic,\nsemantic, and discourse structure of the document.\nMost algorithms for learning HMMs with provable guarantees assume the transition T \u2208 Rn\u00d7n and\nobservation O \u2208 Rm\u00d7n matrices are full rank [2, 3, 20] and hence do not apply to the overcomplete\nregime. A notable exception is the recent work of Huang et al. [14] who studied this setting where\nm (cid:28) n and showed that generic HMMs can be learned in polynomial time given exact moments\nof the output process (which requires in\ufb01nite data). Though understanding properties of generic\nHMMs is an important \ufb01rst step, in reality, HMMs with a large number of hidden states typically have\nstructured, non-generic transition matrices\u2014e.g., consider sparse transition matrices or transition\nmatrices of factorial HMMs [12]. Huang et al. [14] also assume access to exact moments, which\nleaves open the question of when learning is possible with ef\ufb01cient sample complexity. Summarizing,\nwe are interested in the following questions:\n\n1. What are the fundamental limitations for learning overcomplete HMMs?\n2. What properties of HMMs make learning possible with polynomial samples?\n3. Are there structured HMMs which can be learned in the overcomplete regime?\n\nOur contributions. We make progress on all three questions in this work, sharpening our under-\nstanding of the boundary between tractable and intractable learning. We begin by stating a negative\nresult, which perhaps explains some of the dif\ufb01culty of obtaining strong learning guarantees in the\novercomplete setting.\nTheorem 1. The parameters of HMMs where i) the transition matrix encodes a random walk on a\nregular graph on n nodes with degree polynomial in n, ii) the output alphabet m = polylog(n) and,\n\n\fiii) the output distribution for each hidden state is chosen uniformly and independently at random,\ncannot be learned (even approximately) using polynomially many samples over any window length\npolynomial in n, with high probability over the choice of the observation matrix.\nTheorem 1 is somewhat surprising, as parameters of HMMs with such transition matrices can be\neasily learned in the non-overcomplete (m \u2265 n) regime. This is because such transition matrices\nare full-rank and their condition numbers are polynomial in n; hence spectral techniques such\nas Anandkumar et al. [3] can be applied. Theorem 1 is also fundamentally of a different nature\nas compared to lower bounds based on parity with noise reductions for HMMs [20], as ours is\ninformation-theoretic.1 Also, it seems far more damning as the hard cases are seemingly innocuous\nclasses such as random walks on dense graphs. The lower bound also shows that analyzing generic or\nrandom HMMs might not be the right framework to consider in the overcomplete regime as these\nmight not be learnable with polynomial samples even though they are identi\ufb01able. This further\nmotivates the need for understanding HMMs with structured transition matrices. We provide a proof\nof Theorem 1 with more explicitly stated conditions in Appendix D.\nFor our positive results we focus on understanding properties of structured transition matrices\nwhich make learning tractable. To disentangle additional complications due to the choice of the\nobservation matrix, we will assume that the observation matrix is drawn at random throughout the\npaper. Long-standing open problems on learning aliased HMMs (HMMs where multiple hidden states\nhave identical output distributions) [7, 15, 23] hint that understanding learnability with respect to\nproperties of the observation matrix is a daunting task in itself, and is perhaps best studied separately\nfrom understanding how properties of the transition matrix affect learning.\nOur positive result on learnability (Theorem 2) depends on two natural graph-theoretic properties of\nthe transition matrix. We consider transition matrices which are i) sparse (hidden states have constant\ndegree) and ii) have small probability mass on cycles shorter than 10 logm n states\u2014and show that\nthese HMMs can be learned ef\ufb01ciently using tensor decomposition and the method of moments,\ngiven random observation matrices. The condition prohibiting short cycles might seem mysterious.\nIntuitively, we need this condition to ensure that the Markov Chain visits a suf\ufb01cient large portion of\nthe state space in a short interval of time, and in fact the condition stems from information-theoretic\nconsiderations. We discuss these further in Sections 2.4 and 3.1. We also discuss how our results\nrelate to learning HMMs which capture long-term dependencies in their outputs, and introduce a new\nnotion of how well an HMM captures long-term dependencies. These are discussed in Section 5.\nWe also show new identi\ufb01ability results for sparse HMMs. These results provide a \ufb01ner picture of\nidenti\ufb01ability than Huang et al. [14], as ours hold for sparse transition matrices which are not generic.\nTechnical contribution. To prove Theorem 2 we show that the Khatri-Rao product of dependent\nrandom vectors is well-conditioned under certain conditions. Previously, Bhaskara et al. [6] showed\nthat the Khatri-Rao product of independent random vectors is well-conditioned to perform a smoothed\nanalysis of tensor decomposition, their techniques however do not extend to the dependent case. For\nthe dependent case, we show a similar result using a novel Markov chain coupling based argument\nwhich relates the condition number to the best coupling of output distributions of two random walks\nwith disjoint starting distributions. The technique is outlined in Section 2.2.\nRelated work. Spectral methods for learning HMMs have been studied in Anandkumar et al.\n[3], Bhaskara et al. [5], Allman et al. [1], Hsu et al. [13], but these results require m \u2265 n. In Allman\net al. [1], the authors show that that HMMs are identi\ufb01able given moments of continuous observations\nthis requires \u03c4 = O(n1/m). Bhaskara et al. [5] give another bound on window size which requires\n\u03c4 = O(n/m). However, with a output alphabet of size m, specifying all moments in a N length\ncontinuous time interval requires mN time and samples, and therefore all of these approaches lead to\nexponential runtimes when m is constant with respect to n. Also relevant is the work by Anandkumar\net al. [4] on guarantees for learning certain latent variable models such as Gaussian mixtures in the\novercomplete setting through tensor decomposition. As mentioned earlier, the work closest to ours is\nHuang et al. [14] who showed that generic HMMs are identi\ufb01able with \u03c4 = O(logm n), which gives\nthe \ufb01rst polynomial runtimes for the case when m is constant.\n\nover a time interval of length N = 2\u03c4 + 1 for some \u03c4 such that(cid:0)\u03c4 +m\u22121\n\n(cid:1) \u2265 n. When m (cid:28) n\n\nm\u22121\n\n1Parity with noise is information theoretically easy given observations over a window of length at least the\nnumber of inputs to the parity. This is linear in the number of hidden states of the parity with noise HMM,\nwhereas Theorem 1 says that the sample complexity must be super polynomial for any polynomial sized window.\n\n2\n\n\fOutline. Section 2 introduces the notation and setup. It also provides examples and a high-level\noverview of our proof approach. Section 3 states the learnability result, discusses our assumptions\nand HMMs which satisfy these assumptions. Section 4 contains our identi\ufb01ability results for sparse\nHMMs. Section 5 discusses natural measures of long-term dependencies in HMMs. We conclude in\nSection 6. Proof details are deferred to the Appendix.\n2 Setup and preliminaries\nIn this section we \ufb01rst introduce the required notation, and then outline the method of moments\napproach for parameter recovery. We also go over some examples to provide a better understanding\nof the classes of HMMs we aim to learn, and give a high level proof strategy.\n2.1 Notation and preliminaries\n\nWe will denote the output at time t by yt and the hidden state at time t by ht. Let the number of hidden\nstates be n and the number of observations be m. Assume that the output alphabet is {0, . . . , m \u2212 1}\nwithout loss of generality. Let T be the transition matrix and O be the observation matrix of the\nHMM, both of these are de\ufb01ned so that the columns add up to one. For any matrix A, we refer to\nthe ith column of A as Ai. T (cid:48) is de\ufb01ned as the transition matrix of the time-reversed Markov chain,\nbut we do not assume reversibility and hence T may not equal T (cid:48). Let yj\ni = yi, . . . , yj denote the\ni = li, . . . , lj refer to a string of length i + j \u2212 1\nsequence of outputs from time i to time j. Let lj\nover the output alphabet, denoting a particular output sequence from time i to j. De\ufb01ne a bijective\n1 ) \u2208 {1, . . . , m\u03c4}\nmapping L which maps an output sequence l\u03c4\nand the associated inverse mapping L\u22121.\nThroughout the paper, we assume that the transition matrix T is ergodic, and hence has a stationary\ndistribution. We also assume that every hidden state has stationary probability at least 1/poly(n).\nThis is a necessary condition, as otherwise we might not even visit all states in poly(n) samples. We\nalso assume that the output process of the HMM is stationary. A stochastic process is stationary\nif the distribution of any subset of random variables is invariant with respect to shifts in the time\nindex\u2014that is, P[y\u03c4\u2212\u03c4 = l\u03c4\u2212\u03c4 ] = P[y\u03c4 +T\u2212\u03c4 +T = l\u03c4\u2212\u03c4 ] for any \u03c4, T and string l\u03c4\u2212\u03c4 . This is true if the\ninitial hidden state is chosen according to the stationary distribution.\nOur results depend on the conditioning of the matrix T with respect to the (cid:96)1 norm. We de\ufb01ne\n\u03c3(1)\nmin(T ) as the minimum (cid:96)1 gain of the transition matrix T over all vectors x having unit (cid:96)1 norm\n(not just non-negative vectors x, for which the ratio would always be 1):\n\n1 \u2208 {0, . . . , m\u22121}\u03c4 into an index L(l\u03c4\n\n\u03c3(1)\nmin(T ) = min\nx\u2208Rn\n\n(cid:107)T x(cid:107)1\n(cid:107)x(cid:107)1\n\n\u03c3(1)\nmin(T ) is also a natural parameter to measure the long-term dependence of the HMM\u2014if \u03c3(1)\nmin(T )\nis large then T preserves signi\ufb01cant information about the distribution of hidden states at time 0 at a\nfuture time t, for all initial distributions at time 0. We discuss this further in Section 5.\n2.2 Method of moments for learning HMMs\n\nOur algorithm for learning HMMs follows the method of moments based approach, outlined for\nexample in Anandkumar et al. [2] and Huang et al. [14]. In contrast to the more popular Expectation-\nMaximization (EM) approach which can suffer from slow convergence and local optima [21], the\nmethod of moments approach ensures guaranteed recovery of the parameters under mild conditions.\nMore details about tensor decomposition and the method of moments approach to learning HMMs\ncan be found in Appendix A.\nThe method of moments approach to learning HMMs has two high-level steps. In the \ufb01rst step, we\nwrite down a tensor of empirical moments of the data, such that the factors of the tensor correspond to\nparameters of the underlying model. In the second step, we perform tensor decomposition to recover\nthe factors of the tensor\u2014and then recover the parameters of the model from the factors. The key fact\nthat enables the second step is that tensors have a unique decomposition under mild conditions on\ntheir factors, for example tensors have a unique decomposition if all the factors are full rank. The\nuniqueness of tensor decomposition permits unique recovery of the parameters of the model.\nWe will learn the HMM using the moments of observation sequences y\u03c4\u2212\u03c4 from time \u2212\u03c4 to \u03c4.\nSince the output process is assumed to be stationary, the distribution of outputs is the same for\n\n3\n\n\fany contiguous time interval of the same length, and we use the interval \u2212\u03c4 to \u03c4 in our setup for\nconvenience. We call the length of the observation sequences used for learning the window length\nN = 2\u03c4 + 1. Since the number of samples required to estimate moments over a window of length N\nis mN , it is desirable to keep N small. Note that to ensure polynomial runtime and sample complexity\nfor the method of moments approach, the window length N must be O(logm n).\nWe will now de\ufb01ne our moment tensor. Given moments over a window of length N = 2\u03c4 + 1, we\ncan construct the third-order moment tensor M \u2208 Rm\u03c4\u00d7m\u03c4\u00d7m using the mapping L from strings of\noutputs to indices in the tensor:\n\nM(L(l\u03c4\n\n1 ),L(l\n\n\u2212\u03c4\u22121 ),l0) = P[y\u03c4\u2212\u03c4 = l\u03c4\u2212\u03c4 ].\n\nM is simply the tensor of the moments of the HMM over a window length N, and can be estimated\ndirectly from data. We can write M as an outer product because of the Markov property:\n\nM = A \u2297 B \u2297 C\n\nwhere A \u2208 Rm\u03c4\u00d7n, B \u2208 Rm\u03c4\u00d7n, C \u2208 Rm\u00d7n are de\ufb01ned as follows (here h0 denotes the hidden\nstate at time 0):\n\nAL(l\u03c4\nBL(l\n\n1 = l\u03c4\n\n1 | h0 = i]\n\n1 ),i = P[y\u03c4\n\u2212\u03c4\u22121 ),i = P[y\u2212\u03c4\u22121 = l\u2212\u03c4\u22121 | h0 = i]\nCl0,i = P[y0 = l, h0 = i]\n\nT and O can be related in a simple manner to A, B and C. If we can decompose the tensor M into the\nfactors A, B and C, we can recover T and O from A, B and C. Kruskal\u2019s condition [18] guarantees\nthat tensors have a unique decomposition whenever A and B are full rank and no two column of C\nare the same. We refer the reader to Appendix A for more details, speci\ufb01cally Algorithm 1.\n2.3 High-level proof strategy\n\nAs the transition and observation matrices can be recovered from the factors of the tensors, our goal\nis to analyze the conditions under which the tensor decomposition step works provably. Note that the\nfactor matrix A is the likelihood of observing each sequence of observations conditioned on starting\nat a given hidden state. We\u2019ll refer to A as the likelihood matrix for this reason. B is the equivalent\nmatrix for the time-reversed Markov chain. If we show that A, B are full rank and no two columns of\nC are the same, then the HMM can be learned provided the exact moments using the simultaneous\ndiagonalization algorithm, also known as Jennrich\u2019s algorithm (see Algorithm 1). We show this\nproperty for our identi\ufb01ability results. For our learnability results, we show that the matrices A and\nB are well-conditioned (have condition numbers polynomial in n), which implies learnability from\npolynomial samples. This is the main technical contribution of the paper, and requires analyzing\nthe condition number of the Khatri-Rao product of dependent random vectors. Before sketching the\nargument, we \ufb01rst introduce some notation. We can de\ufb01ne A(t) as the likelihood matrix over t steps:\n\n1),i = P[yt\nA(t) can be recursively written down as follows:\n\nA(t)\nL(lt\n\n1 = lt\n\n1 | h0 = i].\n\nA(0) = OT, A(t) = (O (cid:12) A(t\u22121))T\n\n(1)\nwhere A (cid:12) B, denotes the Khatri-Rao product of the matrices A and B. If A and B are two matrices\nof size m1 \u00d7 r and m2 \u00d7 r then the Khatri-Rao product is a m1m2 \u00d7 r matrix whose ith column is\nthe outer product Ai \u2297 Bi \ufb02attened into a vector. Note that A(\u03c4 ) is the same as A. We now sketch\nour argument for showing that A(\u03c4 ) is well-conditioned under appropriate conditions.\nCoupling random walks to analyze the Khatri-Rao product. As mentioned in the introduction,\nin this paper we are interested in the setting where the transition matrix is \ufb01xed but the observation\nmatrix is drawn at random. If we could draw fresh random matrices O at each time step of the\nrecursion in Eq. 1, then A would be well-conditioned by the smoothed analysis of the Khatri-Rao\nproduct due to Bhaskara et al. [6]. However, our setting is signi\ufb01cantly more dif\ufb01cult, as we do not\nhave access to fresh randomness at each time step, so the techniques of Bhaskara et al. [6] cannot be\napplied here. As pointed out earlier, the condition number of A in this scenario depends crucially on\nthe transition matrix T , as A is not even full rank if T = I.\n\n4\n\n\f(a) Transition matrix is a cycle, or a\npermutation on the hidden states.\n\n(b) Transition matrix is a random walk on a\ngraph with small degree and no short cycles.\n\nFigure 1: Examples of transition matrices which we can learn, refer to Section 2.4 and Section 3.2.\n\nInstead, we analyze A by a coupling argument. To get some intuition for this, note that if A does not\nhave full rank, then there are two disjoint sets of columns of A whose linear combinations are equal,\nand these combination weights can be used to setup the initial states of two random walks de\ufb01ned by\nthe transition matrix T which have the same output distribution for \u03c4 time steps. More generally, if\nA is ill-conditioned then there are two random walks with disjoint starting states which have very\nsimilar output distributions. We show that if two random walks have very similar output distributions\nover \u03c4 time steps for a randomly chosen observation matrix O, then most of the probability mass in\nthese random walks can be coupled. On the other hand, if (\u03c3(1)\nmin(T ))\u03c4 is suf\ufb01ciently large, the total\nvariational distance between random walks starting at two different starting states must be at least\n(\u03c3(1)\nmin(T ))\u03c4 after \u03c4 time steps, and so there cannot be a good coupling, and A is well-conditioned.\nWe provide a sketch of the argument for a simple case in Appendix 1 before we prove Theorem 2.\n2.4\n\nIllustrative examples\n\nWe now provide a few simple examples which will illustrate some classes of HMMs we can and\ncannot learn. We \ufb01rst provide an example of a class of simple HMMs which can be handled by our\nresults, but has non-generic transition matrices and hence does not \ufb01t into the framework of Huang\net al. [14]. Consider an HMM where the transition matrix is a permutation or cyclic shift on the\nhidden states (see Fig. 1a). Our results imply that such HMMs are learnable in polynomial time from\npolynomial samples if the output distributions of the hidden states are chosen at random. We will\ntry to provide some intuition about why an HMM with the transition matrix as in Fig. 1a should be\nef\ufb01ciently learnable. Let us consider the simple case when the the outputs are binary (so m = 2) and\neach hidden state deterministically outputs a 0 or a 1, and is labeled by a 0 or a 1 accordingly. If\nthe labels are assigned at random, then with high probability the string of labels of any continuous\nsequence of 2 log2 n hidden states in the cycle in Fig. 1a will be unique. This means that the output\ndistribution in a 2 log2 n time window is unique for every initial hidden state, and it can be shown\nthat this ensures that the moment tensor has a unique factorization. By showing that the output\ndistribution in a 2 log2 n time window is very different for different initial hidden states\u2014in addition\nto being unique\u2014we can show that the factors of the moment tensor are well-conditioned, which\nallows recovery with ef\ufb01cient sample complexity. As another slightly more complex example of\nan HMM we can learn, Fig. 1b depicts an HMM whose transition matrix is a random walk on a\ngraph with small degree and no short cycles. Our learnability result can handle such HMMs having\nstructured transition matrices.\nAs an example of an HMM which cannot be learned in our framework, consider an HMM with\ntransition matrix T = I and binary observations (m = 2), see Fig. 2a. In this case, the probability of\nan output sequence only depends on the total number of zeros or ones in the sequence. Therefore,\nwe only get t independent measurements from windows of length t, hence windows of length O(n)\ninstead of O(log2 n) are necessary for identi\ufb01ability (also refer to Blischke [8] for more discussions\non this case). More generally, we prove in Proposition 1 that for small m a transition matrix composed\nonly of cycles of constant length (see Fig. 2b) requires the window length to be polynomial in n to\nbecome identi\ufb01able.\nProposition 1. Consider an HMM on n hidden states and m observations with the transition matrix\nbeing a permutation composed of cycles of length c. Then windows of length O(n1/mc\n) are necessary\nfor the model to be identi\ufb01able, which is polynomial in n for constant c and m.\n\nThe root cause of the dif\ufb01culty in learning HMMs having short cycles is that they do not visit a large\nenough portion of the state space in O(logm n) steps, and hence moments over a O(logm n) time\n\n5\n\n\f(a) Transition matrix is the identity on\n8 hidden states.\n\n(b) Transition matrix is a union of 4 cycles,\neach on 5 hidden states.\n\nFigure 2: Examples of transition matrices which do not \ufb01t in our framework. Proposition 1 shows\nthat such HMMs where the transition matrix is composed of a union of cycles of constant length are\nnot even identi\ufb01able from short windows of length O(logm n)\nwindow do not carry suf\ufb01cient information for learning. Our results cannot handle such classes of\ntransition matrices, also see Section 3.1 for more discussion.\n3 Learnability results for overcomplete HMMs\nIn this section, we state our learnability result, discuss the assumptions and provide examples of\nHMMs which satisfy these assumptions. Our learnability results hold under the following conditions:\nAssumptions: For \ufb01xed constants c1, c2, c3 > 1, the HMM satis\ufb01es the following properties for\nsome c > 0:\n\n1. Transition matrix is well-conditioned: Both T and the transition matrix T (cid:48) of the time\nmin(T (cid:48)) \u2265 1/mc/c1\n2. Transition matrix does not have short cycles: For both T and T (cid:48), every state visits at least\n\nreversed Markov Chain are well-conditioned in the (cid:96)1-norm: \u03c3(1)\n10 logm n states in 15 logm n time except with probability \u03b41 \u2264 1/nc.\ntransition distributions Ti and T (cid:48)\nd \u2264 m1/c2 and \u03b42 \u2264 1/nc. Hence this is a soft \u201cdegree\u201d requirement.\n\nmin(T ), \u03c3(1)\n\n3. All hidden states have small \u201cdegree\u201d: There exists \u03b42 such that for every hidden state i, the\ni have cumulative mass at most \u03b42 on all but d states, with\n\n4. Output distributions are random and have small support : There exists \u03b43 such that for every\nhidden state i the output distribution Oi has cumulative mass at most \u03b43 on all but k outputs,\nwith k \u2264 m1/c3 and \u03b43 \u2264 1/nc. Also, the output distribution Oi is drawn uniformly on\nthese k outputs.\n\nThe constants c1, c2, c3 are can be made explicit, for example, c1 = 20, c2 = 16 and c3 = 10 works.\nUnder these conditions, we show that HMMs can be learned using polynomially many samples:\nTheorem 2. If an HMM satis\ufb01es the above conditions, then with high probability over the choice\nof O, the parameters of the HMM are learnable to within additive error \u0001 with observations over\nwindows of length 2\u03c4 + 1, \u03c4 = 15 logm n, with the sample complexity poly(n, 1/\u0001).\nAppendix C also states a corollary of Theorem 2 in terms of the minimum singular value \u03c3min(T ) of\nthe matrix T , instead of \u03c3(1)\nmin(T ). We discuss the conditions for Theorem 2 next, and subsequently\nprovide examples of HMMs which satisfy these conditions.\n3.1 Discussion of the assumptions\n\n1. Transition matrix is well-conditioned: Note that singular transition matrices might not even be\nidenti\ufb01able. Moreover, Mossel and Roch [20] showed that learning HMMs with singular transition\nmatrices is as hard as learning parity with noise, which is widely conjectured to be computationally\nhard. Hence, it is necessary to exclude at least some classes of ill-conditioned transition matrices.\n2. Transition matrix does not have short cycles: Due to Proposition 1, we know that a HMM might\nnot even be identi\ufb01able from short windows if it is composed of a union of short cycles, hence we\nexpect a similar condition for learning the HMM with polynomial samples; though there is a gap\nbetween the upper and lower bounds in terms of the probability mass which is allowed on the short\ncycles. We performed some simulations to understand how the length of cycles in the transition\nmatrix and the probability mass assigned to short cycles affects the condition number of the likelihood\nmatrix A; recall that the condition number of A determines the stability of the method of moments\n\n6\n\n\f(a) The conditioning becomes worse\nwhen cycles are smaller or when more\nprobability mass \u0001 is put on short cycles.\n\n(b) The conditioning becomes worse as the\ndegree increases, and when more probabiltiy\nmass \u0001 is put on the dense part of T .\n\nFigure 3: Experiments to study the effect of sparsity and short cycles on the learnability of HMMs.\nThe condition number of the likelihood matrix A determines the stability or sample complexity of the\nmethod of moments approach. The condition numbers are averaged over 10 trials.\n\napproach. We take the number of hidden states n = 128, and let P128 be a cycle on the n hidden\nstates (as in Fig. 1a). Let Pc be a union of short cycles of length c on the n states (refer to Fig. 2b\nfor an example). We take the transition matrix to be T = \u0001Pc + (1 \u2212 \u0001)P128 for different values of\nc and \u0001. Fig. 3a shows that the condition number of A becomes worse and hence learning requires\nmore samples if the cycles are shorter in length, and if more probability mass is assigned to the short\ncycles, hinting that our conditions are perhaps not be too stringent.\n3. All hidden states have a small degree: Condition 3 in Theorem 2 can be reinterpreted as saying\nthat the transition probabilities out of any hidden state must have mass at most 1/n1+c on any hidden\nstate except a set of d hidden states, for any c > 0. While this soft constraint is weaker than a hard\nconstraint on the degree, it natural to ask whether any sparsity is necessary to learn HMMs. As above,\nwe carry out simulations to understand how the degree affects the condition number of the likelihood\nmatrix A. We consider transition matrices on n = 128 hidden states which are a combination of a\ndense part and a cycle. De\ufb01ne P128 to be a cycle as before. De\ufb01ne Gd as the adjacency matrix of a\ndirected regular graph with degree d. We take the transition matrix T = \u0001Gd + (1 \u2212 \u0001d)P128. Hence\nthe transition distribution of every hidden state has mass \u0001 on a set of d neighbors, and the residual\nprobability mass is assigned to the permutation P128. Fig. 3b shows that the condition number of\nA becomes worse as the degree d becomes larger, and as more probability mass \u0001 is assigned to\nthe dense part Gd of the transition matrix T , providing some weak evidence for the necessity of\nCondition 3. Also, recall that Theorem 1 shows that HMMs where the transition matrix is a random\nwalk on an undirected regular graph with large degree (degree polynomial in n) cannot be learned\n\u221a\nusing polynomially many samples if m is a constant with respect to n. However, such graphs have\nall eigenvalues except the \ufb01rst one to be less than O(1/\nd), hence it is not clear if the hardness of\nlearning depends on the large degree itself or is only due to T being ill-conditioned. More concretely,\nwe pose the following open question:\nOpen question: Consider an HMM with a transition matrix T = (1 \u2212 \u0001)P + \u0001U, where P is the\ncyclic permutation on n hidden states (such as in Fig. 1a) and U is a random walk on a undirected,\nregular graph with large degree (polynomial in n) and \u0001 > 0 is a constant. Can this HMM be learned\nusing polynomial samples when m is small (constant) with respect to n? This example approximately\npreserves \u03c3min(T ) by the addition of the permutation, and hence the dif\ufb01culty is only due to the\ntransition matrix having large degree.\n4. Output distributions are random and have small support: As discussed in the introduction, if we\ndo not assume that the observation matrices are random, then even simple HMMs with a cycle or\npermutation as the transition matrix might require long windows even to become identi\ufb01able, see Fig.\n4. Hence some assumptions on the output distribution do seem necessary for learning the model from\nshort time windows, though our assumptions are probably not tight. For instance, the assumption that\nthe output distributions have a small support makes learning easier as it leads to the outputs being\nmore discriminative of the hidden states, but it is not clear that this is a necessary assumption. Ideally,\nwe would like to prove our learnability results under a smoothed model for O, where an adversary is\nallowed to see the transition matrix T and pick any worst-case O, but random noise is then added to\n\n7\n\n00.10.20.30.40.520406080100Condition number of matrix AEpsilon Cycle length 2Cycle length 4Cycle length 800.010.020.030.040.0550100150200Condition number of matrix AEpsilon Degree 2Degree 4Degree 8\fthe output distributions, which limits the power of the adversary. We believe our results should hold\nunder such a smoothed setting, but set this aside for future work.\n\nFigure 4: Consider two HMMs with transition matrices being cycles on n = 16 states with binary\noutputs, and outputs conditioned on the hidden states are deterministic. The states labeled as 0 always\nemit a 0 and the states labeled as 1 always emit a 1. The two HMMs are not distinguishable from\nwindows of length less than 8. Hence with worst case O even simple HMMs like the cycle could\nrequire long windows to even become identi\ufb01able.\n\n3.2 Examples of transition matrices which satisfy our assumptions\nWe revisit the examples from Fig. 1a and Fig. 1b, showing that they satisfy our assumptions.\n1. Transition matrices where the Markov Chain is a permutation: If the Markov chain is a permutation\nwith all cycles longer than 10 logm n then the transition matrix obeys all the conditions for Theorem\n2. This is because all the singular values of a permutation are 1, the degree is 1 and all hidden states\nvisit 10 logm n different states in 15 logm n time steps.\n2. Transition matrices which are random walks on graphs with small degree and large girth:\nFor directed graphs, Condition 2 can be equivalently stated as that the graph representation of the\ntransition matrix has a large girth (girth of a graph is de\ufb01ned as the length of its shortest cycle).\n3. Transition matrices of factorial HMMs: Factorial HMMs [12] factor the latent state at any time into\nD dimensions, each of which independently evolves according to a Markov process. For D = 2, this\nis equivalent to saying that the hidden states are indexed by two labels (i, j) and if T1 and T2 represent\nthe transition matrices for the two dimensions, then P[(i1, j1) \u2192 (i2, j2)] = T1(i2, i1)T2(j2, j1).\nThis naturally models settings where there are multiple latent concepts which evolve independently.\nThe following properties are easy to show:\n1. If either of T1 or T2 visit N different states in 15 logm n time steps with probability (1 \u2212 \u03b4),\n\nthen T visits N different states in 15 logm n time steps with probability (1 \u2212 \u03b4).\n\n2. \u03c3min(T ) = \u03c3min(T1)\u03c3min(T2)\n3. If all hidden states in T1 and T2 have mass at most \u03b4 on all but d1 states and d2 states\n\nrespectively, then T has mass at most 2\u03b4 on all but d1d2 states.\n\nIdenti\ufb01ability of HMMs from short windows\n\nTherefore, factorial HMMs are learnable with random O if the underlying processes obey conditions\nsimilar to the assumptions for Theorem 2. If both T1 and T2 are well-conditioned and at least one of\nthem does not have short cycles, and either has small degree, then T is learnable with random O.\n4\nAs it is not obvious that some of the requirements for Theorem 2 are necessary, it is natural to attempt\nto derive stronger results for just identi\ufb01ability of HMMs having structured transition matrices. In this\nsection, we state our results for identi\ufb01ability of HMMs from windows of size O(logm n). Huang\net al. [14] showed that all HMMs except those belonging to a measure zero set become identi\ufb01able\nfrom windows of length 2\u03c4 + 1 with \u03c4 = 8(cid:100)logm n(cid:101). However, the measure zero set itself might\npossibly contain interesting classes of HMMs (see Fig. 1), for example sparse HMMs also belong to\na measure zero set. We re\ufb01ne the identi\ufb01ability results in this section, and show that a natural sparsity\ncondition on the transition matrix guarantees identi\ufb01ability from short windows. Given any transition\nmatrix T , we regard T as being supported by a set of indices S if the non-zero entries of T all lie in\nS. We now state our result for identi\ufb01ability of sparse HMMs.\nTheorem 3. Let S be a set of indices which supports a permutation where all cycles have at least\n2(cid:100)logm n(cid:101) hidden states. Then the set T of all transition matrices with support S is identi\ufb01able from\nwindows of length 4(cid:100)logm n(cid:101) + 1 for all observation matrices O except for a measure zero set of\ntransition matrices in T and observation matrices O.\n\n8\n\n\fWe hypothesize that excluding a measure zero set of transition matrices in Theorem 3 should not be\nnecessary as long as the transition matrix is full rank, but are unable to show this. Note that our result\non identi\ufb01ability is more \ufb02exible in allowing short cycles in transition matrices than Theorem 2, and\nis closer to the lower bound on identi\ufb01ability in Proposition 1.\nWe also strengthen the result of Huang et al. [14] for identi\ufb01ability of generic HMMs. Huang\net al. [14] conjectured that windows of length 2(cid:100)logm n(cid:101) + 1 are suf\ufb01cient for generic HMMs to be\nidenti\ufb01able. The constant 2 is the information theoretic bound as an HMM on n hidden states and\nm outputs has O(n2 + nm) independent parameters, and hence needs observations over a window\nof size 2(cid:100)logm n(cid:101) + 1 to be uniquely identi\ufb01able. Proposition 2 settles this conjecture, proving\nthe optimal window length requirement for generic HMMs to be identi\ufb01able. As the number of\npossible outputs over a window of length t is mt, the size of the moment tensor in Section 2.2 is itself\nexponential in the window length. Therefore even a factor of 2 improvement in the window length\nrequirement leads to a quadratic improvement in the sample and time complexity.\nProposition 2. The set of all HMMs is identi\ufb01able from observations over windows of length\n2(cid:100)logm n(cid:101) + 1 except for a measure zero set of transition matrices T and observation matrices O.\n5 Discussion on long-term dependencies in HMMs\nIn this section, we discuss long-term dependencies in HMMs, and show how our results on overcom-\nplete HMMs improve the understanding of how HMMs can capture long-term dependencies, both\nwith respect to the Markov chain and the outputs. Recall the de\ufb01nition of \u03c3(1)\n\nmin(T ):\n\n\u03c3(1)\nmin(T ) = min\nx\u2208Rn\n\n(cid:107)T x(cid:107)1\n(cid:107)x(cid:107)1\n\nmin(T )), therefore if \u03c3(1)\n\nWe claim that if \u03c3(1)\nmin(T ) is large, then the transition matrix preserves signi\ufb01cant information about\nthe distribution of hidden states at time 0 at a future time t, for all initial distributions at time 0.\nConsider any two distributions p0 and q0 at time 0. Let pt and qt be the distributions of the hidden\nstates at time t given that the distribution at time 0 is p0 and q0 respectively. Then the (cid:96)1 distance\nmin(T ))t(cid:107)p0 \u2212 q0(cid:107)1, verifying our claim. It is interesting\nbetween pt and qt is (cid:107)pt \u2212 qt(cid:107)1 \u2265 (\u03c3(1)\nto compare this notion with the mixing time of the transition matrix. De\ufb01ning mixing time as\nthe time until the (cid:96)1 distance between any two starting distributions is at most 1/2, it follows that\nthe mixing time \u03c4mix \u2265 1/ log(1/\u03c3(1)\nmin(T )) is large then the chain is slowly\nmixing. However, the converse is not true\u2014\u03c3(1)\nmin(T ) might be small even if the chain never mixes,\nfor example if the graph is disconnected but the connected components mix very quickly. Therefore,\n\u03c3(1)\nmin(T ) is possibly a better notion of the long-term dependence of the transition matrix, as it requires\nthat information is preserved about the past state \u201cin all directions\u201d.\nAnother reasonable notion of the long-term dependence of the HMM is the long-term dependence in\nthe output process instead of in the hidden Markov chain, which is the utility of past observations\nwhen making predictions about the distant future (given outputs y\u2212\u221e, . . . , y1, y2, . . . , yt, at time t\nhow far back do we need to remember about the past to make a good prediction about yt?). This does\nnot depend in a simple way on the T and O matrices, but we do note that if the Markov chain is fast\nmixing then the output process can certainly not have long-term dependencies. We also note that with\nrespect to long-term dependencies in the output process, the setting m (cid:28) n seems to be much more\ninteresting than when m is comparable to n. The reason is that in the small output alphabet setting\nwe only receive a small amount of information about the true hidden state at each step, and hence\nlonger windows are necessary to infer the hidden state and make a good prediction. We also refer the\nreader to Kakade et al. [16] for related discussions on the memory of output processes of HMMs.\n6 Conclusion and Future Work\nThe setting where the output alphabet m is much smaller than the number of hidden states n is\nwell-motivated in practice and seems to have several interesting theoretical questions about new\nlower bounds and algorithms. Though some of our results are obtained in more restrictive conditions\nthan seems necessary, we hope the ideas and techniques pave the way for much sharper results in\nthis setting. Some open problems which we think might be particularly useful for improving our\nunderstanding is relaxing the condition on the observation matrix being random to some structural\nconstraint on the observation matrix (such as on its Kruskal rank), and more thoroughly investigating\nthe requirement for the transition matrix being sparse and not having short cycles.\n\n9\n\n\fReferences\n\n[1] E. S. Allman, C. Matias, and J. A. Rhodes. Identi\ufb01ability of parameters in latent structure\n\nmodels with many observed variables. Annals of Statistics, 37:3099\u20133132, 2009.\n\n[2] A. Anandkumar, D. J. Hsu, and S. M. Kakade. A method of moments for mixture models and\n\nhidden markov models. In COLT, volume 1, page 4, 2012.\n\n[3] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for\n\nlearning latent variable models. arXiv, 2013.\n\n[4] A. Anandkumar, R. Ge, and M. Janzamin. Learning overcomplete latent variable models\n\nthrough tensor methods. In COLT, pages 36\u2013112, 2015.\n\n[5] A. Bhaskara, M. Charikar, and A. Vijayaraghavan. Uniqueness of tensor decompositions with\n\napplications to polynomial identi\ufb01ability. CoRR, abs/1304.8087, 2013.\n\n[6] A. Bhaskara, M. Charikar, A. Moitra, and A. Vijayaraghavan. Smoothed analysis of tensor\ndecompositions. In Proceedings of the 46th Annual ACM Symposium on Theory of Computing,\npages 594\u2013603. ACM, 2014.\n\n[7] D. Blackwell and L. Koopmans. On the identi\ufb01ability problem for functions of \ufb01nite Markov\n\nchains. Annals of Mathematical Statistics, 28:1011\u20131015, 1957.\n\n[8] W. Blischke. Estimating the parameters of mixtures of binomial distributions. Journal of the\n\nAmerican Statistical Association, 59(306):510\u2013528, 1964.\n\n[9] J. T. Chang. Full reconstruction of markov models on evolutionary trees: identi\ufb01ability and\n\nconsistency. Mathematical biosciences, 137(1):51\u201373, 1996.\n\n[10] A. Flaxman, A. W. Harrow, and G. B. Sorkin. Strings with maximally many distinct subse-\n\nquences and substrings. Electron. J. Combin, 11(1):R8, 2004.\n\n[11] J. Friedman. A proof of Alon\u2019s second eigenvalue conjecture. In Proceedings of the thirty-\ufb01fth\n\nAnnual ACM Symposium on Theory of Computing, pages 720\u2013724. ACM, 2003.\n\n[12] Z. Ghahramani and M. Jordan. Factorial hidden markov models. Machine Learning, 1:31, 1997.\n[13] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden markov models.\n\nJournal of Computer and System Sciences, 78(5):1460\u20131480, 2012.\n\n[14] Q. Huang, R. Ge, S. Kakade, and M. Dahleh. Minimal realization problems for hidden markov\n\nmodels. IEEE Transactions on Signal Processing, 64(7):1896\u20131904, 2016.\n\n[15] H. Ito, S.-I. Amari, and K. Kobayashi. Identi\ufb01ability of hidden markov information sources and\ntheir minimum degrees of freedom. IEEE transactions on information theory, 38(2):324\u2013333,\n1992.\n\n[16] S. Kakade, P. Liang, V. Sharan, and G. Valiant. Prediction with a short memory. arXiv preprint\n\narXiv:1612.02526, 2016.\n\n[17] M. Krivelevich, B. Sudakov, V. H. Vu, and N. C. Wormald. Random regular graphs of high\n\ndegree. Random Structures & Algorithms, 18(4):346\u2013363, 2001.\n\n[18] J. B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with\napplication to arithmetic complexity and statistics. Linear Algebra and its Applications, 18(2),\n1977.\n\n[19] S. Leurgans, R. Ross, and R. Abel. A decomposition for three-way arrays. SIAM Journal on\n\nMatrix Analysis and Applications, 14(4):1064\u20131083, 1993.\n\n[20] E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden markov models. In\nProceedings of the thirty-seventh Annual ACM Symposium on Theory of Computing, pages\n366\u2013375. ACM, 2005.\n\n[21] R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the em algorithm.\n\nSIAM review, 26(2):195\u2013239, 1984.\n\n[22] E. Shamir and E. Upfal. Large regular factors in random graphs. North-Holland Mathematics\n\nStudies, 87:271\u2013282, 1984.\n\n[23] R. Weiss and B. Nadler. Learning parametric-output hmms with two aliased states. In ICML,\n\npages 635\u2013644, 2015.\n\n10\n\n\f", "award": [], "sourceid": 598, "authors": [{"given_name": "Vatsal", "family_name": "Sharan", "institution": "Stanford University"}, {"given_name": "Sham", "family_name": "Kakade", "institution": "University of Washington"}, {"given_name": "Percy", "family_name": "Liang", "institution": "Stanford University"}, {"given_name": "Gregory", "family_name": "Valiant", "institution": "Stanford University"}]}