{"title": "On Learning Markov Chains", "book": "Advances in Neural Information Processing Systems", "page_first": 648, "page_last": 657, "abstract": "The problem of estimating an unknown discrete distribution from its samples is a fundamental tenet of statistical learning. Over the past decade, it attracted significant research effort and has been solved for a variety of divergence measures.  Surprisingly, an equally important problem, estimating an unknown Markov chain from its samples, is still far from understood. We consider two problems related to the min-max risk (expected loss) of estimating an unknown k-state Markov chain from its n sequential samples: predicting the conditional distribution of the next sample with respect to the KL-divergence, and estimating the transition matrix with respect to a natural loss induced by KL or a more general f-divergence measure.\n\nFor the first measure, we determine the min-max prediction risk to within a linear factor in the alphabet size, showing it is \\Omega(k\\log\\log n/n) and O(k^2\\log\\log n/n). For the second, if the transition probabilities can be arbitrarily small, then only trivial uniform risk upper bounds can be derived. We therefore consider transition probabilities that are bounded away from zero, and resolve the problem for essentially all sufficiently smooth f-divergences, including KL-, L_2-, Chi-squared, Hellinger, and Alpha-divergences.", "full_text": "On Learning Markov Chains\n\nDept. of Electrical and Computer Engineering\n\nDept. of Electrical and Computer Engineering\n\nUniversity of California, San Diego\n\nUniversity of California, San Diego\n\nAlon Orlitsky\n\nLa Jolla, CA 92093\nalon@ucsd.edu\n\nYi HAO\n\nLa Jolla, CA 92093\nyih179@ucsd.edu\n\nVenkatadheeraj Pichapati\n\nDept. of Electrical and Computer Engineering\n\nUniversity of California, San Diego\n\nLa Jolla, CA 92093\n\ndheerajpv7@ucsd.edu\n\nAbstract\n\nThe problem of estimating an unknown discrete distribution from its samples\nis a fundamental tenet of statistical learning. Over the past decade, it attracted\nsigni\ufb01cant research effort and has been solved for a variety of divergence measures.\nSurprisingly, an equally important problem, estimating an unknown Markov chain\nfrom its samples, is still far from understood. We consider two problems related to\nthe min-max risk (expected loss) of estimating an unknown k-state Markov chain\nfrom its n sequential samples: predicting the conditional distribution of the next\nsample with respect to the KL-divergence, and estimating the transition matrix with\nrespect to a natural loss induced by KL or a more general f-divergence measure.\nFor the \ufb01rst measure, we determine the min-max prediction risk to within a linear\nfactor in the alphabet size, showing it is \u2126(k log log n/n) and O(k2 log log n/n).\nFor the second, if the transition probabilities can be arbitrarily small, then only\ntrivial uniform risk upper bounds can be derived. We therefore consider transi-\ntion probabilities that are bounded away from zero, and resolve the problem for\nessentially all suf\ufb01ciently smooth f-divergences, including KL-, L2-, Chi-squared,\nHellinger, and Alpha-divergences.\n\n1\n\nIntroduction\n\nMany natural phenomena are inherently probabilistic. With past observations at hand, probabilistic\nmodels can therefore help us predict, estimate, and understand, future outcomes and trends. The two\nmost fundamental probabilistic models for sequential data are i.i.d. processes and Markov chains.\nIn an i.i.d. process, for each i \u2265 1, a sample Xi is generated independently according to the same\nunderlying distribution. In Markov chains, for each i \u2265 2, the distribution of sample Xi is determined\nby just the value of Xi\u22121.\nLet us con\ufb01ne our discussion to random processes over \ufb01nite alphabets, without loss of generality,\nassumed to be [k] := {1, 2, . . . , k}. An i.i.d. process is de\ufb01ned by a single distribution p over [k],\nwhile a Markov chain is characterized by a transition probability matrix M over [k] \u00d7 [k]. We denote\nthe initial and stationary distributions of a Markov model by \u00b5 and \u03c0, respectively. For notational\nconsistency let P = (p) denote an i.i.d. model and P = (M ) denote a Markov model.\nHaving observed a sample sequence X n := X1, . . . , Xn from an unknown i.i.d. process or Markov\nchain, a natural problem is to predict the next sample point Xn+1. Since Xn+1 is a random\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fvariable, this task is typically interpreted as estimating the conditional probability distribution\nPxn := Pr(Xn+1 = \u00b7|X n = xn) of the next sample point Xn+1.\nLet [k]\u2217 denote the collection of all \ufb01nite-length sequences over [k].\nTherefore, conditioning on X n = xn, our \ufb01rst objective is to estimate the conditional distribution To\nbe more precise, we would like to \ufb01nd an estimator \u02c6P , that associates with every sequence xn \u2208 [k]\u2217\na distribution \u02c6Pxn over [k] that approximates Pxn in a suitable sense.\nPerhaps a more classical problem is parameter estimation, which describes the underlying process.\nAn i.i.d. process is completely characterized by Pxn = p, hence this problem coincides with the\nprevious one. For Markov chains, we seek to estimate the transition matrix M. Therefore, instead\nof producing a probability distribution \u02c6Pxn, the estimator \u02c6M maps every sequence xn \u2208 [k]\u2217 to a\ntransition matrix \u02c6Mxn over [k] \u00d7 [k].\nFor two distributions p and q over [k], let L(p, q) be the loss when p is approximated by q. For the\nprediction problem, we measure the performance of an estimator \u02c6P in terms of its prediction risk,\n\nn (P, \u02c6P ) := E\n\u03c1L\n\nX n\u223cP\n\n[L(PX n , \u02c6PX n)] =\n\nP (xn)L(Pxn , \u02c6Pxn ),\n\n(cid:88)\n\nxn\u2208[k]n\n\nthe expected loss with respect to the sample sequence X n, where P (xn) := Pr(X n = xn).\nFor the estimation problem, we quantify the performance of the estimator by estimation risk. We \ufb01rst\nconsider the expected loss of \u02c6M with respect to a single state i \u2208 [k]:\n\n[L(M (i,\u00b7), \u02c6MX n (i,\u00b7))].\n\nE\n\nX n\u223c(M )\n\nWe then de\ufb01ne the estimation risk of \u02c6M given sample sequence X n as the maximum expected loss\nover all states,\n\nn (M, \u02c6M ) := max\n\u03b5L\ni\u2208[k]\n\nE\n\nX n\u223c(M )\n\n[L(M (i,\u00b7), \u02c6MX n (i,\u00b7))].\n\nWhile the process P we are trying to learn is unknown, it often belongs to a known collection P.\nThe worst prediction risk of an estimator \u02c6P over all distributions in P is\n\nn (P, \u02c6P ) := max\n\u03c1L\nP\u2208P\n\nn (P, \u02c6P ).\n\u03c1L\n\nThe minimal possible worst-case prediction risk, or simply the minimax prediction risk, incurred by\nany estimator is\n\nn (P) := min\n\u03c1L\n\u02c6P\n\nn (P, \u02c6P ) = min\n\u03c1L\n\u02c6P\n\nmax\nP\u2208P\n\nn (P, \u02c6P ).\n\u03c1L\n\nn (P) are de\ufb01ned\nThe worst-case estimation risk \u03b5L\nsimilarly. Given P, our goals are to approximate the minimax prediction/estimation risk to a\nuniversal constant-factor, and to devise estimators that achieve this performance.\nAn alternative de\ufb01nition of the estimation risk, considered in [1] and mentioned by a reviewer, is\n\nn (P, \u02c6M ) and the minimax estimation risk \u03b5L\n\nn (M, \u02c6M ) :=\n\u02dc\u03b5L\n\n\u03c0i \u00b7\n\n[L(M (i,\u00b7), \u02c6MX n (i,\u00b7))].\n\nE\n\nX n\u223c(M )\n\n(cid:88)\n\ni\u2208[k]\n\nn (P).\n\nWe denote the corresponding minimax estimation risk by \u02dc\u03b5L\nLet o(1) represent a quantity that vanishes as n \u2192 \u221e. In the following, we use a (cid:46) b to denote\na \u2264 b(1 + o(1)), and a (cid:16) b to denote a \u2264 b(1 + o(1)) and b \u2264 a(1 + o(1)).\nFor the collection IIDk of all the i.i.d. processes over [k], the above two formulations coincide and\nthe problem is essentially the classical discrete distribution estimation problem. The problem of\nn (IIDk) was introduced by [2] and studied in a sequence of papers [3, 4, 5, 6, 7]. For\ndetermining \u03c1L\n\ufb01xed k and KL-divergence loss, as n goes to in\ufb01nity, [7] showed that\n\nn (IIDk) (cid:16) k \u2212 1\n\n\u03c1KL\n\n.\n\n2n\n\n2\n\n\ftwo distributions p and q over [k], whenever well-de\ufb01ned, is Df (p, q) :=(cid:80)\n\nKL-divergence and many other important similarity measures between two distributions can be\nexpressed as f-divergences [8]. Let f be a convex function with f (1) = 0, the f-divergence between\ni\u2208[k] q(i)f (p(i)/q(i)).\nCall an f-divergence ordinary if f is thrice continuously differentiable over (0,\u221e), sub-exponential,\nnamely, limx\u2192\u221e |f (x)|/ecx = 0 for all c > 0, and satis\ufb01es f(cid:48)(cid:48)(1) (cid:54)= 0.\nObserve that all the following notable measures are ordinary f-divergences: Chi-squared diver-\ngence [9] from f (x) = (x\u2212 1)2, KL-divergence [10] from f (x) = x log x, Hellinger divergence [11]\n\u221a\nx \u2212 1)2, and Alpha-divergence [12] from f\u03b1(x) := 4(1 \u2212 x(1+\u03b1)/2)/(1 \u2212 \u03b12),\nfrom f (x) = (\nwhere \u03b1 (cid:54)= \u00b11.\n\nRelated Work For any f-divergence, we denote the corresponding minimax prediction risk for\nan n-element sample over set P by \u03c1f\nn(P). Researchers in [13] considered the problem of de-\nn(IIDk) for the ordinary f-divergences. Except the above minimax formulation, re-\ntermining \u03c1f\ncently, researchers also considered formulations that are more adaptive to the underlying i.i.d.\nprocesses [14] [15]. Surprisingly, while the min-max risk of i.i.d. processes was addressed in a large\nbody of work, the risk of Markov chains, which frequently arise in practice, was not studied until\nvery recently.\nLet Mk denote the collection of all the Markov chains over [k]. For prediction with KL-\ndivergence, [16] showed that \u03c1KL\nn (Mk) = \u0398k (log log n/n), but did not specify the dependence\non k. For estimation, [17] considered the class of Markov Chains whose pseudo-spectral gap is\nbounded away from 0 and approximated the L1 estimation risk to within a log n factor. Some of their\ntechniques, in particular the lower-bound construction in their displayed equation (4.3), are of similar\nnature and were derived independently of results in Section 5 in our paper.\nOur \ufb01rst main result determines the dependence of \u03c1KL\nroughly k:\n\nn (Mk) on both k and n, to within a factor of\n\nTheorem 1 The minimax KL-prediction risk of Markov chains satis\ufb01es\nn (Mk) (cid:46) 2k2 log log n\n\n(k \u2212 1) log log n\n\n(cid:46) \u03c1KL\n\n.\n\n4en\n\nn\n\nDepending on M, some states may be observed very infrequently, or not at all. This does not\ndrastically affect the prediction problem as these states will be also have small impact on \u03c1KL\nn (Mk) in\nn (P, \u02c6P ). For estimation, however, rare and unobserved states still need to be well\nthe prediction risk \u03c1L\napproximated, hence \u03b5L\nn (Mk) = log k for all n.\nWe therefore parametrize the risk by the lowest probability in the transition matrix. For \u03b4 > 0 let\n\nn (Mk) does not decrease with n, and for example \u03b5KL\n\nMk\n\n\u03b4 := {(M ) : Mi,j \u2265 \u03b4, \u2200i, j},\n\n\u03b4 almost precisely.\n\nbe the collection of Markov chains whose lowest transition probability exceeds \u03b4. Note that Mk\n\u03b4 is\ntrivial if \u03b4 \u2265 1/k, we only consider \u03b4 \u2208 (0, 1/k). We characterize the minimax estimation risk of\nMk\nTheorem 2 For all ordinary f-divergences and all \u03b4 \u2208 (0, 1/k),\n\u03b4 ) (cid:16) (k \u2212 1)kf(cid:48)(cid:48)(1)\n\nn(Mk\n\u02dc\u03b5f\n\n2n\n\nand\n\n(1 \u2212 \u03b4)\n\n(k \u2212 2)f(cid:48)(cid:48)(1)\n\n(cid:46) \u03b5f\n\nn(Mk\n\n\u03b4 ) (cid:46) (k \u2212 1)f(cid:48)(cid:48)(1)\n\n.\n\n2n\u03b4\nWe can further re\ufb01ne the estimation-risk bounds by partitioning Mk\n\u03b4 based on the smallest probability\nin the chain\u2019s stationary distribution \u03c0. Clearly, mini\u2208[k] \u03c0i \u2264 1/k. For 0 < \u03c0\u2217 \u2264 1/k and\n0 < \u03b4 < 1/k, let\n\n2n\u03b4\n\nMk\n\n\u03b4,\u03c0\u2217 := {(M ) : (M ) \u2208 Mk\n\n\u03b4 and min\ni\u2208[k]\n\n\u03c0i = \u03c0\u2217}\n\nbe the collection of all Markov chains in Mk\nthe minimax estimation risk over Mk\n\n\u03b4,\u03c0\u2217 nearly precisely.\n\n\u03b4 whose lowest stationary probability is \u03c0\u2217. We determine\n\n3\n\n\fTheorem 3 For all ordinary f-divergences,\n(k \u2212 2)kf(cid:48)(cid:48)(1)\n\n(1 \u2212 \u03c0\u2217)\n\n2n\n\nand\n\n(1 \u2212 \u03c0\u2217)\n\n(k \u2212 2)f(cid:48)(cid:48)(1)\n\n2n\u03c0\u2217\n\n(cid:46) \u02dc\u03b5f\n\nn(Mk\n\n(cid:46) \u03b5f\n\nn(Mk\n\n\u03b4,\u03c0\u2217 ) (cid:46) (k \u2212 1)kf(cid:48)(cid:48)(1)\n\u03b4,\u03c0\u2217 ) (cid:46) (k \u2212 1)f(cid:48)(cid:48)(1)\n\n2n\n\n.\n\n2n\u03c0\u2217\n\nFor L2-distance corresponding to the squared Euclidean norm, we prove the following risk bounds.\nTheorem 4 For all \u03b4 \u2208 (0, 1/k),\n\n\u02dc\u03b5L2\nn (Mk\n\n\u03b4 ) (cid:16) k \u2212 1\n\nn\n\nand\n\n(cid:46) \u03b5L2\nTheorem 5 For all \u03b4 \u2208 (0, 1/k) and \u03c0\u2217 \u2208 (0, 1/k],\n\nn\u03b4\n\n(1 \u2212 \u03b4)2 1 \u2212 1\n\nk\u22121\n\n\u03b4 ) (cid:46) 1 \u2212 1\n\nk\n\n.\n\nn\u03b4\n\nn (Mk\n\nand\n\n(1 \u2212 \u03c0\u2217)2 k \u2212 k\n\nk\u22121\nn\n\n(cid:46) \u02dc\u03b5L2\n\nn (Mk\n\n(1 \u2212 \u03c0\u2217)2 1 \u2212 1\n\nk\u22121\nn\u03c0\u2217 (cid:46) \u03b5L2\n\nn (Mk\n\nn\n\n\u03b4,\u03c0\u2217 ) (cid:46) k \u2212 1\n\u03b4,\u03c0\u2217 ) (cid:46) 1 \u2212 1\nn\u03c0\u2217 .\n\nk\n\nThe rest of the paper is organized as follows. Section 2 introduces add-constant estimators and\nadditional de\ufb01nitions and notation for Markov chains. Note that each of the above results consists\nof a lower bound and an upper bound. We prove the lower bound by constructing a suitable prior\ndistribution over the relevant collection of processes. Section 3 and 5 describe these prior distributions\nfor the prediction and estimation problems, respectively. The upper bounds are derived via simple\nvariants of the standard add-constant estimators. Section 4 and 6 describe the estimators for the\nprediction and estimation bounds, respectively. For space considerations, we relegate all the proofs to\nthe supplemental material.\n\n2 De\ufb01nitions and Notation\n\n2.1 Add-constant estimators\nGiven a sample sequence X n from an i.i.d. process (p), let N(cid:48)\nappears in X n. The classical empirical estimator estimates p by\n\ni denote the number of times symbol i\n\n\u02c6pX n (i) :=\n\nN(cid:48)\ni\nn\n\n, \u2200i \u2208 [k].\n\nThe empirical estimator performs poorly for loss measures such as KL-divergence. For example, if p\nassigns a tiny probability to a symbol so that it is unlikely to appear in X n, then with high probability\nthe KL-divergence between p and \u02c6pX n will be in\ufb01nity.\nA common solution applies the Laplace smoothing technique [18] that assigns to each symbol i a\nprobability proportional to N(cid:48)\ni + \u03b2, where \u03b2 > 0 is a \ufb01xed constant. The resulting add-\u03b2 estimator,\nis denoted by \u02c6p+\u03b2. Due to their simplicity and effectiveness, add-\u03b2 estimators are widely used in\nvarious machine learning algorithms such as naive Bayes classi\ufb01ers [19]. As shown in [7], for the\ni.i.d. processes, a variant of the add-3/4 estimator achieves the minimax estimation risk \u03c1KL\nn (IIDk).\nAnalogously, given a sample sequence X n generated by a Markov chain, let Nij denote the number\nof times symbol j appears right after symbol i in X n, and let Ni denote the number of times that\nsymbol i appears in X n\u22121. We de\ufb01ne the add-\u03b2 estimator \u02c6M +\u03b2 as\n, \u2200i, j \u2208 [k].\n\n\u02c6M +\u03b2\n\nX n (i, j) :=\n\nNij + \u03b2\nNi + k\u03b2\n\n4\n\n\f2.2 More on Markov chains\n\nAdopting notation in [20], let \u2206k denote the collection of discrete distributions over [k]. Let [k]e\nand [k]o be the collection of even and odd integers in [k], respectively. By convention, for a Markov\nchain over [k], we call each symbol i \u2208 [k] a state. Given a Markov chain, the hitting time \u03c4 (j) is the\n\ufb01rst time the chain reaches state j. We denote by Pri(\u03c4 (j) = t) the probability that starting from i,\nthe hitting time of j is exactly t. For a Markov chain (M ), we denote by P t the distribution of Xt\nif we draw X t \u223c (M ). Additionally, for a \ufb01xed Markov chain (M ), the mixing time tmix denotes\nthe smallest index t such that L1(P t, \u03c0) < 1/2. Finally, for notational convenience, we write Mij\ninstead of M (i, j) whenever appropriate.\n\n3 Minimax prediction: lower bound\n\nA standard lower-bound argument for minimax prediction risk uses the fact that\n\n\u03c1KL\nn (P) = min\n\u02c6P\n\nmax\nP\u2208P\n\nn (P, \u02c6P ) \u2265 min\n\u03c1KL\n\n\u02c6P\n\nE\nP\u223c\u03a0\n\n[\u03c1KL\n\nn (P, \u02c6P )]\n\nn (P, \u02c6P )] can often be computed explicitly.\n\nfor any prior distribution \u03a0 over P. One advantage of this approach is that the optimal estimator that\nminimizes EP\u223c\u03a0[\u03c1KL\nPerhaps the simplest prior is the uniform distribution U (PS) over a subset PS \u2282 P. Let \u02c6P \u2217 be\nn (P, \u02c6P )]. Computing \u02c6P \u2217 for all the possible sample\nthe optimal estimator minimizing EP\u223cU (PS )[\u03c1KL\nsequences xn may be unrealistic. Instead, let Kn be an arbitrary subset of [k]n, we can lower bound\n\nby\n\nHence,\n\n\u03c1KL\nn (P, \u02c6P ) = E\n\nX n\u223cP\n\n[DKL(PX n, \u02c6PX n )]\n\n\u03c1KL\nn (P, \u02c6P ; Kn) := E\n\nX n\u223cP\n\n[DKL(PX n, \u02c6PX n )1X n\u2208Kn].\n\nn (P) \u2265 min\n\u03c1KL\n\n\u02c6P\n\nE\n\nP\u223cU (PS )\n\n[\u03c1KL\n\nn (P, \u02c6P ; Kn)].\n\nThe key to applying the above arguments is to \ufb01nd a proper pair (PS, Kn).\nWithout loss of generality, assume that k is even. Let a := 1\n\nn and b := 1 \u2212 k\u22122\n\nn , and de\ufb01ne\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\nb \u2212 a\np2\na\na\n\n...\n\na\na\n\n(cid:26) 1\n\na\nb\u2212p2\na\na\n\n...\n\na\na\n\na\na\nb \u2212 a\np4\n...\n\na\na\n\na\na\na\nb\u2212p4\n...\n\na\na\n\n. . .\n. . .\n. . .\n. . .\n...\n. . .\n. . .\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb .\n\na\na\na\na\n\n...\na\nb\u2212pk\n\na\na\na\na\n\n...\nb \u2212 a\npk\n\n(cid:27)\n\n,\n\nMn(p2, p4, . . . , pk) :=\n\nIn addition, let\n\nVn :=\n\n: t \u2208 N and 1 \u2264 t \u2264 log n\n\n2 log log n\n\nlogt n\n\nand let uk denote the uniform distribution over [k]. Finally, given n, de\ufb01ne\n\nPS = {(M ) \u2208 Mk : \u00b5 = uk and M = Mn(p2, p4, . . . , pk), where pi \u2208 Vn,\u2200i \u2208 [k]e}.\n\nNext, let Kn be the collection of sequences xn \u2208 [k]n whose last appearing state didn\u2019t transition to\nany other state. For example, 3132, or 31322, but not 21323. In other words, for any state i \u2208 [k], let\n\u00afi represent an arbitrary state in [k] \\ {i}, then\n\nKn = {xn \u2208 [k]n : xn = \u00afin\u2212(cid:96)i(cid:96) : i \u2208 [k], n \u2212 1 \u2265 (cid:96) \u2265 1}.\n\n5\n\n\f4 Minimax prediction: upper bound\n\nFor the Kn de\ufb01ned in the last section,\n\n\u03c1KL\nn (P, \u02c6P ; Kn) =\n\n(cid:88)\n\nxn\u2208Kn\n\nP (xn)DKL(Pxn, \u02c6Pxn ).\n\nWe denote the partial minimax prediction risk over Kn by\n\u03c1KL\nn (P, \u02c6P ; Kn).\n\n\u03c1KL\nn (P; Kn) := min\n\u02c6P\n\nmax\nP\u2208P\n\nLet Kn := [k]n \\ Kn. De\ufb01ne \u03c1KL\nconsequence of \u02c6P being a function from [k]n to \u2206k, we have the following triangle inequality,\n\nn (P; Kn) in the same manner. As the\n\nn (P, \u02c6P ; Kn) and \u03c1KL\n\nn (P) \u2264 \u03c1KL\n\u03c1KL\n\nn (P; Kn) + \u03c1KL\n\nn (P; Kn).\n\n2 denote the estimator that maps X n \u223c (M ) to \u02c6M + 1\n\n2 (Xn,\u00b7),\n\nTurning back to Markov chains, let \u02c6P + 1\none can show that\n\nn (Mk; Kn) \u2264 max\n\u03c1KL\nP\u2208Mk\n\n\u03c1KL\nn (P, \u02c6P + 1\n\n2 ; Kn) \u2264 Ok\n\nRecall the following lower bound\n\n\u03c1KL\nn (Mk) = \u2126k\n\n(cid:18) log log n\n\n(cid:19)\n\nn\n\n.\n\n(cid:0) 1\n\nn\n\n(cid:1).\n\nThis together with the above upper bound on \u03c1KL\nupper bound on \u03c1KL\nconstruction yields such an upper bound.\nWe partition Kn according to the last appearing state and the number of times it transitions to itself,\n\nn (Mk; Kn) and the triangle inequality shows that an\nn (Mk). The following\n\nn (Mk; Kn) also suf\ufb01ces to bound the leading term of \u03c1KL\n\nKn = \u222an\u22121\n\n(cid:96)=1 K(cid:96)(i), where K(cid:96)(i) := {xn \u2208 [k]n : xn = \u00afin\u2212(cid:96)i(cid:96)}.\n\nFor any xn \u2208 Kn, there is a unique K(cid:96)(i) such that xn \u2208 K(cid:96)(i). Consider the following estimator\n\n(cid:40)\n\n\u02c6Pxn(i) :=\n\n1 \u2212 1\n1 \u2212 1\n\n(cid:96)\n\n(cid:96) log n\n\n(cid:96) \u2264 n\n2\n(cid:96) > n\n2\n\nand\n\nwe can show that\n\n\u02c6Pxn (j) :=\n\n1 \u2212 \u02c6Pxn (i)\n\nk \u2212 1\n\n, \u2200j \u2208 [k] \\ {i},\n\nn (Mk; Kn) \u2264 max\n\u03c1KL\nP\u2208Mk\n\nn (P, \u02c6P ; Kn) (cid:46) 2k2 log log n\n\u03c1KL\n\nn\n\n.\n\nThe upper-bound proof applies the following lemma that uniformly bounds the hitting probability of\nany k-state Markov chain.\nLemma 1 [21] For any Markov chain over [k] and any two states i, j \u2208 [k], if n > k, then\n\nPri(\u03c4 (j) = n) \u2264 k\nn\n\n.\n\n5 Minimax estimation: lower bound\n\nAnalogous to Section 3, we use the following standard argument to lower bound the minimax risk\n\nn (M ) = min\n\u03b5L\n\u02c6M\n\nmax\n(M )\u2208M\n\nn (M, \u02c6M ) \u2265 min\n\u03b5L\n\n\u02c6M\n\nE\n\n(M )\u223cU (MS )\n\n[\u03b5L\n\nn (M, \u02c6M )],\n\nwhere MS \u2282 M and U (MS) is the uniform distribution over MS. Setting M = Mk(\u03b4, \u03c0\u2217), we\noutline the construction of MS as follows.\n\n6\n\n\fLet uk\u22121 be the uniform distribution over [k \u2212 1]. As in [13], denote the L\u221e ball of radius r around\nuk\u22121 by\nwhere L\u221e(\u00b7,\u00b7) is the L\u221e distance between two distributions. De\ufb01ne\n\nBk\u22121(r) := {p \u2208 \u2206k\u22121 : L\u221e(p, uk\u22121) < r},\n\n(cid:18) \u00af\u03c0\u2217\n\np(cid:48) := (p1, p2, . . . , pk\u22121),\n\u00af\u03c0\u2217\nk \u2212 1\n\n\u00af\u03c0\u2217\nk \u2212 1\n\nk \u2212 1\n\n, . . .\n\n,\n\n, \u03c0\u2217(cid:19)\n\n,\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\u00af\u03c0\u2217\nk\u22121\n\u00af\u03c0\u2217\nk\u22121\n...\n\u00af\u03c0\u2217\nk\u22121\n\u00af\u03c0\u2217p1\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb ,\n\n\u00af\u03c0\u2217\nk\u22121\n\u00af\u03c0\u2217\nk\u22121\n...\n\u00af\u03c0\u2217\nk\u22121\n\u00af\u03c0\u2217p2\n\n. . .\n. . .\n...\n. . .\n. . .\n\n\u00af\u03c0\u2217\n\u03c0\u2217\nk\u22121\n\u00af\u03c0\u2217\n\u03c0\u2217\nk\u22121\n...\n...\n\u00af\u03c0\u2217\n\u03c0\u2217\nk\u22121\n\u00af\u03c0\u2217pk\u22121 \u03c0\u2217\n\np\u2217 :=\n\nand\n\nMn(p(cid:48)) :=\n\nwhere \u00af\u03c0\u2217 = 1 \u2212 \u03c0\u2217 and(cid:80)k\u22121\n\ni=1 pi = 1.\n\nGiven n and \u0001 \u2208 (0, 1), let n(cid:48) := (n(1 + \u0001)\u03c0\u2217)1/5. We set\n\nMS = {(M ) \u2208 Mk(\u03b4, \u03c0\u2217) : \u00b5 = p\u2217 and M = Mn(p(cid:48)), where p(cid:48) \u2208 Bk\u22121(1/n(cid:48))}.\n\nNoting that the uniform distribution over MS, U (MS), is induced by U (Bk\u22121(1/n(cid:48))), the uniform\ndistribution over Bk\u22121(1/n(cid:48)) and thus is well-de\ufb01ned.\nAn important property of the above construction is that for a sample sequence X n \u223c (M ) \u2208 MS,\nNk, the number of times that state k appears in X n, is a binomial random variable with parameters n\nand \u03c0\u2217. Therefore, by the following lemma, Nk is highly concentrated around its mean n\u03c0\u2217.\nLemma 2 [22] Let Y be a binomial random variable with parameters m \u2208 N and p \u2208 [0, 1], then\nfor any \u0001 \u2208 (0, 1),\n\nPr(Y \u2265 (1 + \u0001)mp) \u2264 exp(cid:0)\u2212\u00012mp/3(cid:1) .\n\nIn order to prove the lower bound on \u02dc\u03b5f\n\u03b4,\u03c0\u2217 ), we only need to modify the above construction\nas follows. Instead of drawing the last row of the transition matrix Mn(p(cid:48)) uniformly from the\ndistribution induced by U (Bk\u22121(1/n(cid:48))), we draw all rows independently in the same fashion. The\nproof is omitted due to similarity.\n\nn(Mk\n\n6 Minimax estimation: upper bound\n\nThe proof of the upper bound relies on a concentration inequality for Markov chains in Mk\ncan be informally expressed as\n\n\u03b4 , which\n\nPr(|Ni \u2212 (n \u2212 1)\u03c0(i)| > t) \u2264 \u0398\u03b4(exp(\u0398\u03b4(\u2212t2/n))).\n\nNote that this inequality is very similar to the Hoeffding\u2019s inequality for i.i.d. processes.\nThe dif\ufb01culty in analyzing the performance of the original add-\u03b2 estimator is that the chain\u2019s initial\ndistribution could be far away from its stationary distribution and \ufb01nding a simple expression for\nE[Ni] and E[Nij] could be hard. To overcome this dif\ufb01culty, we ignore the \ufb01rst few sample points\nand construct a new add-\u03b2 estimator based on the remaining sample points. Speci\ufb01cally, let X n\nbe a length-n sample sequence drawn from the Markov chain (M ). Removing the \ufb01rst m sample\nm+1 := Xm+1, . . . , Xn can be viewed as a length-(n\u2212m) sample sequence drawn from\npoints, X n\n(M ) whose initial distribution \u00b5(cid:48) satis\ufb01es\n\n\u221a\n\nL1(\u00b5(cid:48), \u03c0) < 2(1 \u2212 \u03b4)m\u22121.\nn. For suf\ufb01ciently large n, L1(\u00b5(cid:48), \u03c0) (cid:28) 1/n2 and\n\nn (cid:28) n. Hence without loss of\nLet m =\ngenerality, we assume that the original initial distribution \u00b5 already satis\ufb01es L1(\u00b5, \u03c0) < 1/n2. If not,\nwe can simply replace X n by X n\u221a\n\n\u221a\n\nn+1.\n\n7\n\n\fTo prove the desired upper bound for ordinary f-divergences, it suf\ufb01ces to use the add-\u03b2 estimator\n\n\u02c6M +\u03b2\n\nX n (i, j) :=\n\nNij + \u03b2\nNi + k\u03b2\n\n, \u2200i, j \u2208 [k].\n\n\u221a\nFor the L2-distance, instead of an add-constant estimator, we apply an add-\n, \u2200i, j \u2208 [k].\n\n(i, j) :=\n\nNi/k\n\n\u221a\n\n\u02c6M +\nX n\n\n\u221a\n\u221a\nNi/k\nNi\n\nNij +\nNi +\n\nNi/k estimator\n\n7 Experiments\n\n\u221a\n\nWe augment the theory with experiments that demonstrate the ef\ufb01cacy of our proposed estimators\nand validate the functional form of the derived bounds.\nWe brie\ufb02y describe the experimental setup. For the \ufb01rst three \ufb01gures, k = 6, \u03b4 = 0.05, and\n10, 000 \u2264 n \u2264 100, 000. For the last \ufb01gure, \u03b4 = 0.01, n = 100, 000, and 4 \u2264 k \u2264 36. In all\nthe experiments, the initial distribution \u00b5 of the Markov chain is drawn from the k-Dirichlet(1)\ndistribution. For the transition matrix M, we \ufb01rst construct a transition matrix M(cid:48) where each row is\ndrawn independently from the k-Dirichlet(1) distribution. To ensure that each element of M is at\nleast \u03b4, let Jk represent the k \u00d7 k all-ones matrix, and set M = M(cid:48)(1 \u2212 k\u03b4) + \u03b4Jk. We generate a\nnew Markov chain for each curve in the plots. And each data point on the curve shows the average\nloss of 100 independent restarts of the same Markov chain.\nThe plots use the following abbreviations: Theo for theoretical minimax-risk values; Real for real\nexperimental results: using the estimators described in Sections 4 and 6; Pre for average prediction\nloss and Est for average estimation loss; Const for add-constant estimator; Prop for proposed\nadd-\nNi/k estimator described in Section 6; Hell, Chi, and Alpha(c) for Hellinger divergence,\nChi-squared divergence, and Alpha-divergence with parameter c. In all three graphs, the theoretical\nmin-max curves are precisely the upper bounds in the corresponding theorems, except that in the\nprediction curve in Figure 1a the constant factor 2 in the upper bound is adjusted to 1/2 to better \ufb01t\nthe experiments. Note the excellent \ufb01t between the theoretical bounds and experimental results.\nFigure 1a shows the decay of the experimental and theoretical KL-prediction and KL-estimation\nlosses with the sample size n. Figure 1b compares the L2-estimation losses of our proposed estimator\nand the add-one estimator, and the theoretical minimax values. Figure 1c compares the experimental\nestimation losses and the theoretical minimax-risk values for different loss measures. Finally, \ufb01gure 1d\npresents an experiment on KL-learning losses that scales k up while n is \ufb01xed. All the four plots\n\u221a\ndemonstrate that our theoretical results are accurate and can be used to estimate the loss incurred\nin learning Markov chains. Additionally, Figure 1b shows that our proposed add-\nNi/k estimator\nis uniformly better than the traditional add-one estimator for different values of sample size n. We\nhave also considered add-constant estimators with different constants varying from 2 to 10 and our\nproposed estimator outperformed all of them.\n\n8 Conclusions\n\nWe studied the problem of learning an unknown k-state Markov chain from its n sequential sample\npoints. We considered two formulations: prediction and estimation. For prediction, we determined\nthe minimax risk up to a multiplicative factor of k. For estimation, when the transition probabilities\nare bounded away from zero, we obtained nearly matching lower and upper bounds on the minimax\nrisk for L2 and ordinary f-divergences. The effectiveness of our proposed estimators was veri\ufb01ed\nthrough experimental simulations. Future directions include closing the gap in the prediction problem\nin Section 1, extending the results on the min-max estimation problem to other classes of Markov\nchains, and extending the work from the classical setting k (cid:28) n, to general k and n.\n\n8\n\n\f(a) KL-prediction and estimation losses\n\n(b) L2-estimation losses for different estimators\n\n(c) Hellinger, Chi-squared, and Alpha- estimation losses\n\n(d) Fixed n and varying k\n\nFigure 1: Experiments\n\nReferences\n[1] Moein Falahatgar, Mesrob I. Ohannessian, and Alon Orlitsky. Near-optimal smoothing of\nstructured conditional probability matrices? In In Advances in Neural Information Processing\nSystems (NIPS), pages 4860\u20134868, 2016.\n\n[2] Edgar Gilbert. Codes based on inaccurate source probabilities. IEEE Transactions on Informa-\n\ntion Theory, 17, 3:304\u2013314, 1971.\n\n[3] Thomas Cover. Admissibility properties or gilbert\u2019s encoding for unknown source probabilities\n\n(corresp.). IEEE Transactions on Information Theory, 18.1:216\u2013217, 1972.\n\n[4] Raphail Krichevsky and Victor Tro\ufb01mov. The performance of universal encoding.\n\nTransactions on Information Theory, 27.2:199\u2013207, 1981.\n\nIEEE\n\n[5] Dietrich Braess, J\u00fcrgen Forster, Tomas Sauer, and Hans U. Simon. How to achieve mini-\nmax expected kullback-leibler distance from an unknown \ufb01nite distribution. In International\nConference on Algorithmic Learning Theory, pages 380\u2013394. Springer, 2002.\n\n[6] Liam Paninski. Variational minimax estimation of discrete distributions under kl loss. In\n\nAdvances in Neural Information Processing Systems, pages 1033\u20131040, 2004.\n\n[7] Dietrich Braess and Thomas Sauer. Bernstein polynomials and learning theory. Journal of\n\nApproximation Theory, 128(2):187\u2013206, 2004.\n\n[8] I. Csisz\u00e1r. Information type measures of differences of probability distribution and indirect\n\nobservations. Studia Math. Hungarica, 2:299\u2013318, 1967.\n\n[9] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approxi-\n\nmating f-divergences. IEEE Signal Processing Letters, 21, no. 1:10\u201313, 2014.\n\n[10] Solomon Kullback and Richard A. Leibler. On information and suf\ufb01ciency. The Annals of\n\nMathematical Statistics, 22, no. 1:79\u201386, 1951.\n\n9\n\n\f[11] Mikhail S Nikulin. Hellinger distance. Encyclopedia of mathematics, 151, 2001.\n\n[12] Gavin E. Crooks. On measures of entropy and information. Tech. Note 9 (2017): v4, 2017.\n\n[13] Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning\ndistributions from their samples. In Annual Conference on Learning Theory (COLT), pages\n1066\u20131100, 2015.\n\n[14] Alon Orlitsky and Ananda Theertha Suresh. Competitive distribution estimation: Why is\ngood-turing good. In Advances in Neural Information Processing Systems, pages 2143\u20132151,\n2015.\n\n[15] Gregory Valiant and Paul Valiant. Instance optimal learning of discrete distributions. In 48th\n\nannual ACM symposium on Theory of Computing, pages 142\u2013155, 2016.\n\n[16] Moein Falahatgar, Alon Orlitsky, Venkatadheeraj Pichapati, and Ananda Theertha Suresh.\nLearning markov distributions: Does estimation trump compression? In IEEE International\nSymposium on Information Theory (ISIT), pages 2689\u20132693, 2016.\n\n[17] Geoffrey Wolfer and Aryeh Kontorovich. Minimax learning of ergodic markov chains.\n\narXiv:1809.05014, 2018.\n\n[18] Kai Lai Chung and Farid AitSahlia. Elementary probability theory: With stochastic processes\n\nand an introduction to mathematical \ufb01nance. Springer Science & Business Media, 2012.\n\n[19] Christopher M. Bishop and Tom M. Mitchell. Pattern recognition and machine learning.\n\nSpringer, 2006.\n\n[20] David A.Levin and Yuval Peres. Markov chains and mixing Times. American Mathematical\n\nSoc., 2017.\n\n[21] James Norris, Yuval Peres, and Alex Zhai. Surprise probabilities in markov chains. Combina-\n\ntorics, Probability and Computing, 26.4:603\u2013627, 2017.\n\n[22] Fan RK Chung and Linyuan Lu. Complex graphs and networks. American Mathematical Soc.,\n\n2006.\n\n10\n\n\f", "award": [], "sourceid": 377, "authors": [{"given_name": "Yi", "family_name": "Hao", "institution": "University of California, San Diego"}, {"given_name": "Alon", "family_name": "Orlitsky", "institution": "University of California, San Diego"}, {"given_name": "Venkatadheeraj", "family_name": "Pichapati", "institution": "UC San Diego"}]}