{"title": "Maximum Likelihood Learning With Arbitrary Treewidth via Fast-Mixing Parameter Sets", "book": "Advances in Neural Information Processing Systems", "page_first": 874, "page_last": 882, "abstract": "Inference is typically intractable in high-treewidth undirected graphical models, making maximum likelihood learning a challenge. One way to overcome this is to restrict parameters to a tractable set, most typically the set of tree-structured parameters. This paper explores an alternative notion of a tractable set, namely a set of \u201cfast-mixing parameters\u201d where Markov chain Monte Carlo (MCMC) inference can be guaranteed to quickly converge to the stationary distribution. While it is common in practice to approximate the likelihood gradient using samples obtained from MCMC, such procedures lack theoretical guarantees. This paper proves that for any exponential family with bounded sufficient statistics, (not just graphical models) when parameters are constrained to a fast-mixing set, gradient descent with gradients approximated by sampling will approximate the maximum likelihood solution inside the set with high-probability. When unregularized, to find a solution epsilon-accurate in log-likelihood requires a total amount of effort cubic in 1/epsilon, disregarding logarithmic factors. When ridge-regularized, strong convexity allows a solution epsilon-accurate in parameter distance with an effort quadratic in 1/epsilon. Both of these provide of a fully-polynomial time randomized approximation scheme.", "full_text": "Maximum Likelihood Learning With Arbitrary\n\nTreewidth via Fast-Mixing Parameter Sets\n\nJustin Domke\n\nNICTA, Australian National University\njustin.domke@nicta.com.au\n\nAbstract\n\nInference is typically intractable in high-treewidth undirected graphical models,\nmaking maximum likelihood learning a challenge. One way to overcome this is to\nrestrict parameters to a tractable set, most typically the set of tree-structured pa-\nrameters. This paper explores an alternative notion of a tractable set, namely a set\nof \u201cfast-mixing parameters\u201d where Markov chain Monte Carlo (MCMC) inference\ncan be guaranteed to quickly converge to the stationary distribution. While it is\ncommon in practice to approximate the likelihood gradient using samples obtained\nfrom MCMC, such procedures lack theoretical guarantees. This paper proves that\nfor any exponential family with bounded suf\ufb01cient statistics, (not just graphical\nmodels) when parameters are constrained to a fast-mixing set, gradient descent\nwith gradients approximated by sampling will approximate the maximum likeli-\nhood solution inside the set with high-probability. When unregularized, to \ufb01nd a\nsolution \u03f5-accurate in log-likelihood requires a total amount of effort cubic in 1/\u03f5,\ndisregarding logarithmic factors. When ridge-regularized, strong convexity allows\na solution \u03f5-accurate in parameter distance with effort quadratic in 1/\u03f5. Both of\nthese provide of a fully-polynomial time randomized approximation scheme.\n\n1 Introduction\n\nIn undirected graphical models, maximum likelihood learning is intractable in general. For exam-\nple, Jerrum and Sinclair [1993] show that evaluation of the partition function (which can easily be\ncomputed from the likelihood) for an Ising model is #P-complete, and that even the existence of a\nfully-polynomial time randomized approximation scheme (FPRAS) for the partition function would\nimply that RP = NP.\n\nIf the model is well-speci\ufb01ed (meaning that the target distribution falls in the assumed family) then\nthere exist several methods that can ef\ufb01ciently recover correct parameters, among them the pseu-\ndolikelihood [3], score matching [16, 22], composite likelihoods [20, 30], Mizrahi et al.\u2019s [2014]\nmethod based on parallel learning in local clusters of nodes and Abbeel et al.\u2019s [2006] method based\non matching local probabilities. While often useful, these methods have some drawbacks. First,\nthese methods typically have inferior sample complexity to the likelihood. Second, these all assume\na well-speci\ufb01ed model. If the target distribution is not in the assumed class, the maximum-likelihood\nsolution will converge to the M-projection (minimum of the KL-divergence), but these estimators\ndo not have similar guarantees. Third, even when these methods succeed, they typically yield a\ndistribution in which inference is still intractable, and so it may be infeasible to actually make use\nof the learned distribution.\n\nGiven these issues, a natural approach is to restrict the graphical model parameters to a tractable set\n\u0398, in which learning and inference can be performed ef\ufb01ciently. The gradient of the likelihood is\ndetermined by the marginal distributions, whose dif\ufb01culty is typically determined by the treewidth of\nthe graph. Thus, probably the most natural tractable family is the set of tree-structured distributions,\n\n1\n\n\fwhere \u0398 = {\u03b8 : \u2203tree T,\u2200(i, j) \u0338\u2208 T, \u03b8ij = 0}. The Chow-Liu algorithm [1968] provides an\nef\ufb01cient method for \ufb01nding the maximum likelihood parameter vector \u03b8 in this set, by computing\nthe mutual information of all empirical pairwise marginals, and \ufb01nding the maximum spanning tree.\nSimilarly, Heinemann and Globerson [2014] give a method to ef\ufb01ciently learn high-girth models\nwhere correlation decay limits the error of approximate inference, though this will not converge to\nthe M-projection when the model is mis-speci\ufb01ed.\n\nThis paper considers a fundamentally different notion of tractability, namely a guarantee that Markov\nchain Monte Carlo (MCMC) sampling will quickly converge to the stationary distribution. Our\nfundamental result is that if \u0398 is such a set, and one can project onto \u0398, then there exists a FPRAS\nfor the maximum likelihood solution inside \u0398. While inspired by graphical models, this result works\nentirely in the exponential family framework, and applies generally to any exponential family with\nbounded suf\ufb01cient statistics.\n\nThe existence of a FPRAS is established by analyzing a common existing strategy for maximum\nlikelihood learning of exponential families, namely gradient descent where MCMC is used to gen-\nerate samples and approximate the gradient. It is natural to conjecture that, if the Markov chain is\nfast mixing, is run long enough, and enough gradient descent iterations are used, this will converge\nto nearly the optimum of the likelihood inside \u0398, with high probability. This paper shows that this is\nindeed the case. A separate analysis is used for the ridge-regularized case (using strong convexity)\nand the unregularized case (which is merely convex).\n\n2 Setup\n\nThough notation is introduced when \ufb01rst used, the most important symbols are given here for more\nreference.\n\n\u2022 \u03b8 - parameter vector to be learned\n\u2022 M\u03b8 - Markov chain operator corresponding to \u03b8\n\u2022 \u03b8k - estimated parameter vector at k-th gradient descent iteration\n\u2022 qk = Mv\n\n\u03b8k\u22121r - approximate distribution sampled from at iteration k. (v iterations of the\n\nMarkov chain corresponding to \u03b8k\u22121 from arbitrary starting distribution r.)\n\n\u2022 \u0398 - constraint set for \u03b8\n\u2022 f - negative log-likelihood on training data\n\u2022 L - Lipschitz constant for the gradient of f .\n\u2022 \u03b8\u2217 = arg min\u03b8\u2208\u0398 f (\u03b8) - minimizer of likelihood inside of \u0398\n\u2022 K - total number of gradient descent steps\n\u2022 M - total number of samples drawn via MCMC\n\u2022 N - length of vector x.\n\u2022 v - number of Markov chain transitions applied for each sample\n\u2022 C, \u03b1 - parameters determining the mixing rate of the Markov chain. (Equation 3)\n\u2022 Ra - suf\ufb01cient statistics norm bound.\n\u2022 \u03f5f - desired optimization accuracy for f\n\u2022 \u03f5\u03b8 - desired optimization accuracy for \u03b8\n\u2022 \u03b4 - permitted probability of failure to achieve a given approximation accuracy\n\nThis paper is concerned with an exponential family of the form\np\u03b8(x) = exp(\u03b8 \u00b7 t(x) \u2212 A(\u03b8)),\n\nwhere t(x) is a vector of suf\ufb01cient statistics, and the log-partition function A(\u03b8) ensures normal-\nization. An undirected model can be seen as an exponential family where t consists of indicator\nfunctions for each possible con\ufb01guration of each clique [32]. While such graphical models motivate\nthis work, the results are most naturally stated in terms of an exponential family and apply more\ngenerally.\n\n2\n\n\f\u2022 Initialize \u03b80 = 0.\n\u2022 For k = 1, 2, ..., K\n\n\u2013 Draw samples. For i = 1, ..., M , sample\n\nxk\u22121\ni \u223c qk\u22121 := Mv\n\n\u03b8k\u22121r.\n\n\u2013 Estimate the gradient as\n\nf\u2032(\u03b8k\u22121) + ek \u2190\n\n1\nM\n\nM\n\n\u2013 Update the parameter vector as\n\ni\n\nt(xk\u22121\n\n) \u2212 \u00aft + \u03bb\u03b8.\n\n!i=1\n(f\u2032(\u03b8k\u22121) + ek))# .\n\n1\nL\n\n\u03b8k \u2190 \u03a0\u0398\"\u03b8k\u22121 \u2212\n\nf (\u03b8)\n\n\u03b8 \u2217\n\n\u03b8 0\n\n\u0398\n\nk=1 \u03b8k.\n\nK $K\n\n\u2022 Output \u03b8K or 1\nFigure 1: Left: Algorithm 1, approximate gradient descent with gradients approximated via\nMCMC, analyzed in this paper. Right: A cartoon of the desired performance, stochastically \ufb01nding\na solution near \u03b8\u2217, the minimum of the regularized negative log-likelihood f (\u03b8) in the set \u0398.\n\nWe are interested in performing maximum-likelihood learning, i.e. minimizing, for a dataset\nz1, ..., zD,\n\nlog p\u03b8(zi) +\n\n\u03bb\n2\u2225\u03b8\u22252\n\n2 = A(\u03b8) \u2212 \u03b8 \u00b7 \u00aft +\n\n\u03bb\n2\u2225\u03b8\u22252\n2,\n\n(1)\n\nwhere we de\ufb01ne \u00aft = 1\n\ni=1 t(zi). It is easy to see that the gradient of f takes the form\n\n1\nD\n\nD\n\n!i=1\n\nf (\u03b8) = \u2212\nD $D\n\nf\u2032(\u03b8) = Ep\u03b8 [t(X)] \u2212 \u00aft + \u03bb\u03b8.\n\nIf one would like to optimize f using a gradient-based method, computing the expectation of t(X)\nwith respect to p\u03b8 can present a computational challenge. With discrete graphical models, the ex-\npected value of t is determined by the marginal distributions of each factor in the graph. Typ-\nically, the computational dif\ufb01culty of computing these marginal distributions is determined by the\ntreewidth of the graph\u2013 if the graph is a tree, (or close to a tree) the marginals can be computed by the\njunction-tree algorithm [18]. One option, with high treewidth, is to approximate the marginals with\na variational method. This can be seen as exactly optimizing a \u201csurrogate likelihood\u201d approximation\nof Eq. 1 [31].\n\ni=1 from a distribution close to p\u03b8, and then approximate Ep\u03b8 [t(X)] by (1/M )$M\n\nAnother common approach is to use Markov chain Monte Carlo (MCMC) to compute a sample\n{xi}M\ni=1 t(xi).\nThis strategy is widely used, varying in the model type, the sampling algorithm, how samples are\ninitialized, the details of optimization, and so on [10, 25, 27, 24, 7, 33, 11, 2, 29, 5]. Recently,\nSteinhardt and Liang [28] proposed learning in terms of the stationary distribution obtained from a\nchain with a nonzero restart probability, which is fast-mixing by design.\n\nWhile popular, such strategies generally lack theoretical guarantees. If one were able to exactly\nsample from p\u03b8, this could be understood simply as stochastic gradient descent. But, with MCMC,\none can only sample from a distribution approximating p\u03b8, meaning the gradient estimate is not\nonly noisy, but also biased. In general, one can ask how should the step size, number of iterations,\nnumber of samples, and number of Markov chain transitions be set to achieve a convergence level.\n\nThe gradient descent strategy analyzed in this paper, in which one updates a parameter vector \u03b8k\nusing approximate gradients is outlined and shown as a cartoon in Figure 1. Here, and in the rest\nof the paper, we use pk as a shorthand for p\u03b8k , and we let ek denote the difference between the\nestimated gradient and the true gradient f\u2032(\u03b8k\u22121). The projection operator is de\ufb01ned by \u03a0\u0398[\u03c6] =\narg min\u03b8\u2208\u0398 ||\u03b8 \u2212 \u03c6||2.\nWe assume that the parameter set \u03b8 is constrained to a set \u0398 such that MCMC is guaranteed to mix\nat a certain rate (Section 3.1). With convexity, this assumption can bound the mean and variance\n\n3\n\n\fof the errors at each iteration, leading to a bound on the sum of errors. With strong convexity, the\nerror of the gradient at each iteration is bounded with high probability. Then, using results due to\n[26] for projected gradient descent with errors in the gradient, we show a schedule the number of\niterations K, the number of samples M , and the number of Markov transitions v such that with high\nprobability,\n\nf! 1\n\nK\n\n\u03b8k# \u2212 f (\u03b8\u2217) \u2264 \u03f5f or \u2225\u03b8K \u2212 \u03b8\u2217\u22252 \u2264 \u03f5\u03b8,\n\nK\n\n\"k=1\n\nfor the convex or strongly convex cases, respectively, where \u03b8\u2217 \u2208 arg min\u03b8\u2208\u0398 f (\u03b8). The total num-\nber of Markov transitions applied through the entire algorithm, KM v grows as (1/\u03f5f )3 log(1/\u03f5f )\nfor the convex case, (1/\u03f52\n\u03b8) for the strongly convex case, and polynomially in all other\nparameters of the problem.\n\n\u03b8) log(1/\u03f52\n\n3 Background\n\n3.1 Mixing times and Fast-Mixing Parameter Sets\n\n2$x |p(x) \u2212 q(x)|.\n\nThis Section discusses some background on mixing times for MCMC. Typically, mixing times are\nde\ufb01ned in terms of the total-variation distance \u2225p\u2212 q\u2225T V = maxA |p(A)\u2212 q(A)|, where the max-\nimum ranges over the sample space. For discrete distributions, this can be shown to be equivalent to\n\u2225p \u2212 q\u2225T V = 1\nWe assume that a sampling algorithm is known, a single iteration of which can be thought of an\noperator M\u03b8 that transforms some starting distribution into another. The stationary distribution is\np\u03b8, i.e. limv\u2192\u221e Mv\n\u03b8q = p\u03b8 for all q. Informally, a Markov chain will be fast mixing if the total\nvariation distance between the starting distribution and the stationary distribution decays rapidly in\nthe length of the chain. This paper assumes that a convex set \u0398 and constants C and \u03b1 are known\nsuch that for all \u03b8 \u2208 \u0398 and all distributions q,\n\n\u2225Mv\n\n\u03b8q \u2212 p\u03b8\u2225T V \u2264 C\u03b1v.\n\n(2)\n\nThis means that the distance between an arbitrary starting distribution q and the stationary distri-\nbution p\u03b8 decays geometrically in terms of the number of Markov iterations v. This assumption is\njusti\ufb01ed by the Convergence Theorem [19, Theorem 4.9], which states that if M is irreducible and\naperiodic with stationary distribution p, then there exists constants \u03b1 \u2208 (0, 1) and C > 0 such that\n(3)\n\nd(v) := sup\n\nq \u2225Mvq \u2212 p\u2225T V \u2264 C\u03b1v.\n\nMany results on mixing times in the literature, however, are stated in a less direct form. Given a\nconstant \u03f5, the mixing time is de\ufb01ned by \u03c4(\u03f5) = min{v : d(v) \u2264 \u03f5}. It often happens that bounds\non mixing times are stated as something like \u03c4(\u03f5) \u2264 %a + b ln 1\n\u03f5& for some constants a and b. It\nfollows from this that \u2225Mvq \u2212 p\u2225T V \u2264 C\u03b1v with C = exp(a/b) and \u03b1 = exp(\u22121/b).\nA simple example of a fast-mixing exponential family is the Ising model, de\ufb01ned for x \u2208\n{\u22121, +1}N as\n\n\u03b8ixi \u2212 A(\u03b8)\u239e\n\u23a0 .\nA simple result for this model is that, if the maximum degree of any node is \u2206 and |\u03b8ij| \u2264 \u03b2 for\nall (i, j), then for univariate Gibbs sampling with random updates, \u03c4(\u03f5) \u2264 \u2308 N log(N/\u03f5)\n1\u2212\u2206 tanh(\u03b2)\u2309 [19]. The\nalgorithm discussed in this paper needs the ability to project some parameter vector \u03c6 onto \u0398 to \ufb01nd\narg min\u03b8\u2208\u0398 ||\u03b8\u2212\u03c6||2. Projecting a set of arbitrary parameters onto this set of fast-mixing parameters\nis trivial\u2013 simply set \u03b8ij = \u03b2 for \u03b8ij > \u03b2 and \u03b8ij \u2190 \u2212\u03b2 for \u03b8ij < \u2212\u03b2.\nFor more dense graphs, it is known [12, 9] that, for a matrix norm \u2225\u00b7\u2225 that is the spectral norm \u2225\u00b7\u22252,\nor induced 1 or in\ufb01nity norms,\n\np(x|\u03b8) = exp\u239b\n\n\u03b8ijxixj +\"i\n\n\u239d \"(i,j)\u2208Pairs\n\n1 \u2212 \u2225R(\u03b8)\u2225,\n\u03c4(\u03f5) \u2264+ N log(N/\u03f5)\n\n4\n\n(4)\n\n\fwhere Rij(\u03b8) = |\u03b8ij|. Domke and Liu [2013] show how to perform this projection for the Ising\nmodel when \u2225 \u00b7 \u2225 is the spectral norm \u2225 \u00b7 \u22252 with a convex optimization utilizing the singular value\ndecomposition in each iteration.\n\nLoosely speaking, the above result shows that univariate Gibbs sampling on the Ising model is fast-\nmixing, as long as the interaction strengths are not too strong. Conversely, Jerrum and Sinclair\n[1993] exhibited an alternative Markov chain for the Ising model that is rapidly mixing for arbitrary\ninteraction strengths, provided the model is ferromagnetic, i.e.\nthat all interaction strengths are\npositive with \u03b8ij \u2265 0 and that the \ufb01eld is unidirectional. This Markov chain is based on sampling\nin different \u201csubgraphs world\u201d state-space. Nevertheless, it can be used to estimate derivatives of\nthe Ising model log-partition function with respect to parameters, which allows estimation of the\ngradient of the log-likelihood. Huber [2012] provided a simulation reduction to obtain an Ising\nmodel sample from a subgraphs world sample.\n\nMore generally, Liu and Domke [2014] consider a pairwise Markov random \ufb01eld, de\ufb01ned as\n\np(x|\u03b8) = exp\u239b\n\u239d#i,j\n\n\u03b8ij(xi, xj) +#i\n\n\u03b8i(xi) \u2212 A(\u03b8)\u239e\n\u23a0 ,\n\n1\n\n2|\u03b8ij(a, b)\u2212\u03b8ij(a, c)|, then again Equation 4 holds.\n\nand show that, if one de\ufb01nes Rij(\u03b8) = maxa,b,c\nAn algorithm for projecting onto the set \u0398 = {\u03b8 : \u2225R(\u03b8)\u2225 \u2264 c} exists.\nThere are many other mixing-time bounds for different algorithms, and different types of models\n[19]. The most common algorithms are univariate Gibbs sampling (often called Glauber dynamics\nin the mixing time literature) and Swendsen-Wang sampling. The Ising model and Potts models are\nthe most common distributions studied, either with a grid or fully-connected graph structure. Often,\nthe motivation for studying these systems is to understand physical systems, or to mathematically\ncharacterize phase-transitions in mixing time that occur as interactions strengths vary. As such,\nmany existing bounds assume uniform interaction strengths. For all these reasons, these bounds\ntypically require some adaptation for a learning setting.\n\n4 Main Results\n\n4.1 Lipschitz Gradient\n\nFor lack of space, detailed proofs are postponed to the appendix. However, informal proof sketches\nare provided to give some intuition for results that have longer proofs. Our \ufb01rst main result is that\nthe regularized log-likelihood has a Lipschitz gradient.\nTheorem 1. The regularized log-likelihood gradient is L-Lipschitz with L = 4R2\n\n2 + \u03bb, i.e.\n\n\u2225f\u2032(\u03b8) \u2212 f\u2032(\u03c6)\u22252 \u2264 (4R2\n\n2 + \u03bb)\u2225\u03b8 \u2212 \u03c6\u22252.\n\nProof sketch. It is easy, by the triangle inequality, that \u2225f\u2032(\u03b8)\u2212f\u2032(\u03c6)\u22252 \u2264 \u2225 dA\nNext, using the assumption that \u2225t(x)\u22252 \u2264 R2, one can bound that \u2225 dA\nFinally, some effort can bound that \u2225p\u03b8 \u2212 p\u03c6\u2225T V \u2264 2R2\u2225\u03b8 \u2212 \u03c6\u22252.\n4.2 Convex convergence\n\nd\u03b8 \u2212 dA\n\nd\u03b8 \u2212 dA\n\nd\u03c6\u22252 +\u03bb\u2225\u03b8\u2212\u03c6\u22252.\nd\u03c6 \u22252 \u2264 2R2\u2225p\u03b8\u2212p\u03c6\u2225T V .\n\nNow, our \ufb01rst major result is a guarantee on the convergence that is true both in the regularized case\nwhere \u03bb > 0 and the unregularized case where \u03bb = 0.\nTheorem 2. With probability at least 1 \u2212 \u03b4, at long as M \u2265 3K/ log( 1\nK\n\u221aM\n\n\u03b4 ), Algorithm 1 will satisfy\n\n\u03b8k\u2019 \u2212 f (\u03b8\u2217) \u2264\n\nKL ( L\u2225\u03b80 \u2212 \u03b8\u2217\u22252\n\n+ KC\u03b1v)2\n\n+ log\n\nf& 1\n\nK\n\n8R2\n2\n\n1\n\u03b4\n\n+\n\n4R2\n\n.\n\nK\n\n#k=1\n\nProof sketch. First, note that f is convex, since the Hessian of f is the covariance of t(X) when\n\u03bb = 0 and \u03bb > 0 only adds a quadratic. Now, de\ufb01ne the quantity dk = 1\nm) \u2212\n\nm=1 t(X k\n\nM *M\n\n5\n\n\fEqk [t(X)] to be the difference between the estimated expected value of t(X) under qk and the\ntrue value. An elementary argument can bound the expected value of \u2225dk\u2225, while the Efron-Stein\ninequality can bounds its variance. Using both of these bounds in Bernstein\u2019s inequality can then\nshow that, with probability 1 \u2212 \u03b4, !K\n\u03b4 ). Finally, we can observe\nk=1 \u2225ek\u2225 \u2264!K\nthat!K\nk=1 \u2225dk\u2225 +!K\n[t(X)]\u22252. By the assumption on mixing\nspeed, the last term is bounded by 2KR2C\u03b1v. And so, with probability 1 \u2212 \u03b4, !K\nk=1 \u2225ek\u2225 \u2264\n2R2(K/\u221aM + log 1\n\u03b4 ) + 2KR2C\u03b1v. Finally, a result due to Schmidt et al. [26] on the convergence\n\nk=1 \u2225dk\u2225 \u2264 2R2(K/\u221aM + log 1\nk=1 \u2225Eqk[t(X)]\u2212 Ep\u03b8k\n\nof gradient descent with errors in estimated gradients gives the result.\n\nIntuitively, this result has the right character. If M grows on the order of K 2 and v grows on the\norder of log K/(\u2212 log \u03b1), then all terms inside the quadratic will be held constant, and so if we set\nK of the order 1/\u03f5, the sub-optimality will on the order of \u03f5 with a total computational effort roughly\non the order of (1/\u03f5)3 log(1/\u03f5). The following results pursue this more carefully. Firstly, one can\nobserve that a minimum amount of work must be performed.\nTheorem 3. For a, b, c, \u03b1 > 0, if K, M, v > 0 are set so that 1\n\nK (a + b K\u221aM\n\n+ Kc\u03b1v)2 \u2264 \u03f5, then\n\nKM v \u2265\n\na4b2\n\u03f53\n\nlog ac\n\u03f5\n(\u2212 log \u03b1)\n\n.\n\nSince it must be true that a/\u221aK + b\"K/M + \u221aKc\u03b1v \u2264 \u221a\u03f5, each of these three terms must also\n\nbe at most \u221a\u03f5, giving lower-bounds on K, M , and v. Multiplying these gives the result.\n\nNext, an explicit schedule for K, M , and v is possible, in terms of a convex set of parameters\n\u03b21, \u03b22, \u03b23. Comparing this to the lower-bound above shows that this is not too far from optimal.\nTheorem 4. Suppose that a, b, c, \u03b1 > 0. If \u03b21 + \u03b22 + \u03b23 = 1, \u03b21, \u03b22, \u03b23 > 0, then setting\nK = a2\n1 \u03f5 , M = ( ab\n+\n\u03b2 2\nKc\u03b1v)2 \u2264 \u03f5 with a total work of\n\n\u03b21\u03b23\u03f5 /(\u2212 log \u03b1) is suf\ufb01cient to guarantee that 1\n\n\u03b21\u03b22\u03f5 )2, v = log ac\n\nK (a + b K\u221aM\n\nKM v =\n\n1\n\u03b24\n1 \u03b22\n2\n\na4b2\n\u03f53\n\n\u03b21\u03b23\u03f5\n\nlog ac\n(\u2212 log \u03b1)\n\n.\n\nSimply verify that the \u03f5 bound holds, and multiply the terms together.\nFor example, setting \u03b21 = 0.66, \u03b22 = 0.33 and \u03b23 = 0.01 gives that KM v \u2248 48.4 a4b2\n.\nFinally, we can give an explicit schedule for K, M , and v, and bound the total amount of work that\nneeds to be performed.\n\nlog ac\n\u03f5 +5.03\n(\u2212 log \u03b1)\n\n\u03f53\n\nTheorem 5. If D \u2265 max#\u2225\u03b80 \u2212 \u03b8\u2217\u22252, 4R2\n\nthat f ( 1\n\nk=1 \u03b8k) \u2212 f (\u03b8\u2217) \u2264 \u03f5f with probability 1 \u2212 \u03b4 and\nlog\n\n2D4\n\nK !K\n\nL log 1\n\nKM v \u2264\n\n32LR2\n2 \u03f53\n\u03b24\n1 \u03b22\n\nf (1 \u2212 \u03b1)\n\n4DR2C\n\u03b21\u03b23\u03f5f\n\n.\n\n\u03b4$, then for all \u03f5 there is a setting of K, M, v such\n\n[Proof sketch] This follows from setting K, M , and v as in Theorem 4 with a = L\u2225\u03b80 \u2212\n\u03b8\u2217\u22252/(4R2) + log 1\n\n\u03b4 , b = 1, c = C, and \u03f5 = \u03f5f L/(8R2\n2).\n\n4.3 Strongly Convex Convergence\n\nThis section gives the main result for convergence that is true only in the regularized case where\n\u03bb > 0. Again, the main dif\ufb01culty in this proof is showing that the sum of the errors of estimated\ngradients at each iteration is small. This is done by using a concentration inequality to show that the\nerror of each estimated gradient is small, and then applying a union bound to show that the sum is\nsmall. The main result is as follows.\nTheorem 6. When the regularization constant obeys \u03bb > 0, with probability at least 1\u2212\u03b4 Algorithm\n1 will satisfy\n\n\u2225\u03b8K \u2212 \u03b8\u2217\u22252 \u2264 (1 \u2212\n\n\u03bb\nL\n\n)K\u2225\u03b80 \u2212 \u03b8\u2217\u22252 +\n\nL\n\n\u03bb %& R2\n\n2M %1 +&2 log\n\nK\n\n\u03b4 \u2019 + 2R2C\u03b1v\u2019 .\n\n6\n\n\fi=1 t(xk\n\ni ) \u2212 Eqk [t(X)]\u22252 + \u2225Eqk [t(X)] \u2212 Ep\u03b8k\n\nProof sketch. When \u03bb = 0, f is convex (as in Theorem 2) and so is strongly convex when\n\u03bb > 0. The basic proof technique here is to decompose the error in a particular step as \u2225ek+1\u22252 \u2264\nM !M\n\u2225 1\n[t(X)]\u22252. A multidimensional variant of Ho-\n\u03b4 )/\u221aM ,\neffding\u2019s inequality can bound the \ufb01rst term, with probability 1 \u2212 \u03b4\u2032 by R2(1 +\"2 log 1\nwhile our assumption on mixing speed can bound the second term by 2R2C\u03b1v. Applying this to\nall iterations using \u03b4\u2032 = \u03b4/K gives that all errors are simultaneously bounded as before. This can\nthen be used in another result due to Schmidt et al. [26] on the convergence of gradient descent with\nerrors in estimated gradients in the strongly convex case.\n\nA similar proof strategy could be used for the convex case where, rather than directly bounding the\nsum of the norm of errors of all steps using the Efron-Stein inequality and Bernstein\u2019s bound, one\ncould simply bound the error of each step using a multidimensional Hoeffding-type inequality, and\nthen apply this with probability \u03b4/K to each step. This yields a slightly weaker result than that\nshown in Theorem 2. The reason for applying a uniform bound on the errors in gradients here is\nthat Schmidt et al.\u2019s bound [26] on the convergence of proximal gradient descent on strongly convex\nfunctions depends not just on the sum of the norms of gradient errors, but a non-uniform weighted\nvariant of these.\nAgain, we consider how to set parameters to guarantee that \u03b8K is not too far from \u03b8\u2217 with a minimum\namount of work. Firstly, we show a lower-bound.\n\nTheorem 7. Suppose a, b, c > 0. Then for any K, M, v such that \u03b3Ka+ b\u221aM#log(K/\u03b4)+c\u03b1v \u2264 \u03f5.\n\nit must be the case that\n\nlog a\n\n\u03f5 log c\n\u03f5\n\nKM v \u2265\n\nb2\n\u03f52\n\n(\u2212 log \u03b3)(\u2212 log \u03b1)\n[Proof sketch] This is established by noticing that \u03b3Ka,\nthan \u03f5, giving lower bounds on K, M , and v.\n\n\u03f5\n\nlog$ log a\n\u03b4(\u2212 log \u03b3)% .\nb\u221aM\"log K\n\n\u03b4 , and c\u03b1v must each be less\n\n\u03f52\u03b2 2\n\nb2\n\u03f52\u03b22\n2\n\nKM V \u2264\n\nNext, we can give an explicit schedule that is not too far off from this lower-bound.\nTheorem 8. Suppose that a, b, c, \u03b1 > 0. If \u03b21 + \u03b22 + \u03b23 = 1, \u03b2i > 0, then setting K =\n\u03b21\u03f5 )/(\u2212 log \u03b3), M = b2\nlog( a\nto guarantee that \u03b3Ka + b\u221aM\n\n2 &1 +#2 log(K/\u03b4)\u20192\n(1 +#2 log(K/\u03b4)) + c\u03b1v \u2264 \u03f5 with a total work of at most\n\u03b21\u03f5\u2019 log& c\n\u03b23\u03f5\u2019\nlog& a\n(\u2212 log \u03b3)(\u2212 log \u03b1) \u239b\n\n\u03b23\u03f5\u2019 /(\u2212 log \u03b1) is suf\ufb01cient\n\u03b4(\u2212 log \u03b3)\u239e\n\u23a0\n\nand v = log& c\n\u239d1 +*2 log\n\nFor example, if you choose \u03b22 = 1/\u221a2 and \u03b21 = \u03b23 = (1 \u2212 1/\u221a2)/2 \u2248 0.1464, then this varies\nfrom the lower-bound in Theorem 7 by a factor of two, and a multiplicative factor of 1/\u03b23 \u2248 6.84\ninside the logarithmic terms.\nCorollary 9. If we choose K \u2265 L\n, and v \u2265\n1\u2212\u03b1 log (2LR2C/(\u03b23\u03f5\u03bb)), then \u2225\u03b8K \u2212 \u03b8\u2217\u22252 \u2264 \u03f5\u03b8 with probability at least 1 \u2212 \u03b4, and the total\n\n\u03bb log&\u2225\u03b80\u2212\u03b8\u22252\n\n\u03b21\u03f5 \u2019 , M \u2265 L2R2\n\u03b21\u03f5\u03b8 %-1 +*2 log$ L\n\n2 \u03bb2 &1 +#2 log(K/\u03b4)\u20192\n\u03b21\u03f5\u03b8 %%.\nlog$\u2225\u03b80 \u2212 \u03b8\u22252\n\nlog$\u2225\u03b80 \u2212 \u03b8\u22252\n\namount of work is bounded by\n\nL3R2\n2 \u03bb3(1 \u2212 \u03b1)\n\nKM v \u2264\n\nlog( a\n\n\u03b21\u03f5 )\n\n2\u03f52\n\n\u03b8\u03b22\n\n2\n\n.\n\n2\n\n.\n\n2\u03f52\u03b2 2\n\n\u03bb\u03b4\n\n1\n\n5 Discussion\n\nAn important detail in the previous results is that the convex analysis gives convergence in terms of\nthe regularized log-likelihood, while the strongly-convex analysis gives convergence in terms of the\nparameter distance. If we drop logarithmic factors, the amount of work necessary for \u03f5f - optimality\nin the log-likelihood using the convex algorithm is of the order 1/\u03f53\nf , while the amount of work\nnecessary for \u03f5\u03b8 - optimality using the strongly convex analysis is of the order 1/\u03f52\n\u03b8. Though these\nquantities are not directly comparable, the standard bounds on sub-optimality for \u03bb-strongly convex\nfunctions with L-Lipschitz gradients are that \u03bb\u03f52\n\u03b8/2. Thus, roughly speaking, when\nregularized for the strongly-convex analysis shows that \u03f5f optimality in the log-likelihood can be\nachieved with an amount of work only linear in 1/\u03f5f .\n\n\u03b8/2 \u2264 \u03f5f \u2264 L\u03f52\n\n7\n\n\f100\n\n*)\nf(\u03b8k) \u2212 f(\u03b8\n\n|| \u03b8k \u2212 \u03b8\n\n* ||2\n\n100\n\n10\u22121\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n10\u22125\n0\n\n20\n40\niterations k\n\n10\u22122\n0\n\n60\n\n20\n40\niterations k\n\n\u22120.4\n100\n\n60\n\n101\n\niterations k\n\n102\n\nFigure 2: Ising Model Example. Left: The difference of the current test log-likelihood from the\noptimal log-likelihood on 5 random runs. Center: The distance of the current estimated parameters\nfrom the optimal parameters on 5 random runs. Right: The current estimated parameters on one run,\nas compared to the optimal parameters (far right).\n\n6 Example\n\nWhile this paper claims no signi\ufb01cant practical contribution, it is useful to visualize an example.\n\nTake an Ising model p(x) \u221d exp(!(i,j)\u2208Pairs \u03b8ijxixj) for xi \u2208 {\u22121, 1} on a 4 \u00d7 4 grid with 5\nrandom vectors as training data. The suf\ufb01cient statistics are t(x) = {xixj|(i, j) \u2208 Pairs}, and with\n24 pairs, \u2225t(x)\u22252 \u2264 R2 = \u221a24. For a fast-mixing set, constrain |\u03b8ij| \u2264 .2 for all pairs. Since\nthe maximum degree is 4, \u03c4(\u03f5) \u2264 \u2308 N log(N/\u03f5)\n1\u22124 tanh(.2)\u2309 . Fix \u03bb = 1, \u03f5\u03b8 = 2 and \u03b4 = 0.1. Though the\ntheory above suggests the Lipschitz constant L = 4R2\n2 + \u03bb = 97, a lower value of L = 10 is used,\nwhich converged faster in practice (with exact or approximate gradients). Now, one can derive that\n\n\u2225\u03b80 \u2212 \u03b8\u2217\u22252 \u2264 D =\"24 \u00d7 (2 \u00d7 .2)2, C = log(16) and \u03b1 = exp(\u2212(1 \u2212 4 tanh .2)/16). Applying\n\nCorollary 9 with \u03b21 = .01, \u03b22 = .9 and \u03b23 = .1 gives K = 46, M = 1533 and v = 561. Fig. 2\nshows the results. In practice, the algorithm \ufb01nds a solution tighter than the speci\ufb01ed \u03f5\u03b8, indicating\na degree of conservatism in the theoretical bound.\n\n7 Conclusions\nThis section discusses some weaknesses of the above analysis, and possible directions for future\nwork. Analyzing complexity in terms of the total sampling effort ignores the complexity of projec-\ntion itself. Since projection only needs to be done K times, this time will often be very small in\ncomparison to sampling time. (This is certainly true in the above example.) However, this might not\nbe the case if the projection algorithm scales super-linearly in the size of the model.\n\nAnother issue to consider is how the samples are initialized. As far as the proof of correctness\ngoes, the initial distribution r is arbitrary. In the above example, a simple uniform distribution was\nused. However, one might use the empirical distribution of the training data, which is equivalent to\ncontrastive divergence [5]. It is reasonable to think that this will tend to reduce the mixing time when\nthe p\u03b8 is close to the model generating the data. However, the number of Markov chain transitions\nv prescribed above is larger than typically used with contrastive divergence, and Algorithm 1 does\nnot reduce the step size over time. While it is common to regularize to encourage fast mixing\nwith contrastive divergence [14, Section 10], this is typically done with simple heuristic penalties.\nFurther, contrastive divergence is often used with hidden variables. Still, this provides a bound for\nhow closely a variant of contrastive divergence could approximate the maximum likelihood solution.\n\nThe above analysis does not encompass the common strategy for maximum likelihood learning\nwhere one maintains a \u201cpool\u201d of samples between iterations, and initializes one Markov chain at\neach iteration from each element of the pool. The idea is that if the samples at the previous iteration\nwere close to pk\u22121 and pk\u22121 is close to pk, then this provides an initialization close to the current\nsolution. However, the proof technique used here is based on the assumption that the samples xk\ni at\neach iteration are independent, and so cannot be applied to this strategy.\n\nAcknowledgements\nThanks to Ivona Bez\u00e1kov\u00e1, Aaron Defazio, Nishant Mehta, Aditya Menon, Cheng Soon Ong and\nChristfried Webers. NICTA is funded by the Australian Government through the Dept. of Commu-\nnications and the Australian Research Council through the ICT Centre of Excellence Program.\n\n8\n\n\fReferences\n\n[1] Abbeel, P., Koller, D., and Ng, A. Learning factor graphs in polynomial time and sample complexity.\n\nJournal of Machine Learning Research, 7:1743\u20131788, 2006.\n\n[2] Asuncion, A., Liu, Q., Ihler, A., and Smyth, P. Learning with blocks composite likelihood and contrastive\n\ndivergence. In AISTATS, 2010.\n\n[3] Besag, J. Statistical analysis of non-lattice data. Journal of the Royal Statistical Society. Series D (The\n\nStatistician), 24(3):179\u2013195, 1975.\n\n[4] Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of\n\nIndependence. Oxford University Press, 2013.\n\n[5] Carreira-Peripi\u00f1\u00e1n, M. A. and Hinton, G. On contrastive divergence learning. In AISTATS, 2005.\n[6] Chow, C. I. and Liu, C. N. Approximating discrete probability distributions with dependence trees. IEEE\n\nTransactions on Information Theory, 14:462\u2013467, 1968.\n\n[7] Descombes, X., Robin Morris, J. Z., and Berthod, M. Estimation of markov Random \ufb01eld prior parame-\nters using Markov chain Monte Carlo maximum likelihood. IEEE Transactions on Image Processing, 8\n(7):954\u2013963, 1996.\n\n[8] Domke, J. and Liu, X. Projecting Ising model parameters for fast mixing. In NIPS, 2013.\n[9] Dyer, M. E., Goldberg, L. A., and Jerrum, M. Matrix norms and rapid mixing for spin systems. Ann.\n\nAppl. Probab., 19:71\u2013107, 2009.\n\n[10] Geyer, C. Markov chain Monte Carlo maximum likelihood. In Symposium on the Interface, 1991.\n[11] Gu, M. G. and Zhu, H.-T. Maximum likelihood estimation for spatial models by Markov chain Monte\nCarlo stochastic approximation. Journal of the Royal Statistical Society: Series B (Statistical Methodol-\nogy), 63(2):339\u2013355, 2001.\n\n[12] Hayes, T. A simple condition implying rapid mixing of single-site dynamics on spin systems. In FOCS,\n\n2006.\n\n[13] Heinemann, U. and Globerson, A. Inferning with high girth graphical models. In ICML, 2014.\n[14] Hinton, G. A practical guide to training restricted boltzmann machines. Technical report, University of\n\nToronto, 2010.\n\n[15] Huber, M. Simulation reductions for the ising model. Journal of Statistical Theory and Practice, 5(3):\n\n413\u2013424, 2012.\n\n[16] Hyv\u00e4rinen, A. Estimation of non-normalized statistical models by score matching. Journal of Machine\n\nLearning Research, 6:695\u2013709, 2005.\n\n[17] Jerrum, M. and Sinclair, A. Polynomial-time approximation algorithms for the ising model. SIAM Journal\n\non Computing, 22:1087\u20131116, 1993.\n\n[18] Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press,\n\n2009.\n\n[19] Levin, D. A., Peres, Y., and Wilmer, E. L. Markov chains and mixing times. American Mathematical\n\nSociety, 2006.\n\n[20] Lindsay, B. Composite likelihood methods. Contemporary Mathematics, 80(1):221\u2013239, 1988.\n[21] Liu, X. and Domke, J. Projecting Markov random \ufb01eld parameters for fast mixing. In NIPS, 2014.\n[22] Marlin, B. and de Freitas, N. Asymptotic ef\ufb01ciency of deterministic estimators for discrete energy-based\n\nmodels: Ratio matching and pseudolikelihood. In UAI, 2011.\n\n[23] Mizrahi, Y., Denil, M., and de Freitas, N. Linear and parallel learning of markov random \ufb01elds. In ICML,\n\n2014.\n\n[24] Papandreou, G. and Yuille, A. L. Perturb-and-map random \ufb01elds: Using discrete optimization to learn\n\nand sample from energy models. In ICCV, 2011.\n\n[25] Salakhutdinov, R. Learning in Markov random \ufb01elds using tempered transitions. In NIPS, 2009.\n[26] Schmidt, M., Roux, N. L., and Bach, F. Convergence rates of inexact proximal-gradient methods for\n\nconvex optimization. In NIPS, 2011.\n\n[27] Schmidt, U., Gao, Q., and Roth, S. A generative perspective on MRFs in low-level vision. In CVPR,\n\n2010.\n\n[28] Steinhardt, J. and Liang, P. Learning fast-mixing models for structured prediction. In ICML, 2015.\n[29] Tieleman, T. Training restricted Boltzmann machines using approximations to the likelihood gradient. In\n\nICML, 2008.\n\n[30] Varin, C., Reid, N., and Firth, D. An overview of composite likelihood methods. Statistica Sinica, 21:\n\n5\u201324, 2011.\n\n[31] Wainwright, M. Estimating the \"wrong\" graphical model: Bene\ufb01ts in the computation-limited setting.\n\nJournal of Machine Learning Research, 7:1829\u20131859, 2006.\n\n[32] Wainwright, M. and Jordan, M. Graphical models, exponential families, and variational inference. Found.\n\nTrends Mach. Learn., 1(1-2):1\u2013305, 2008.\n\n[33] Zhu, S. C., Wu, Y., and Mumford, D. Filters, random \ufb01elds and maximum entropy (FRAME): Towards a\n\nuni\ufb01ed theory for texture modeling. International Journal of Computer Vision, 27(2):107\u2013126, 1998.\n\n9\n\n\f", "award": [], "sourceid": 555, "authors": [{"given_name": "Justin", "family_name": "Domke", "institution": "NICTA"}]}