{"title": "Implicit Differentiation by Perturbation", "book": "Advances in Neural Information Processing Systems", "page_first": 523, "page_last": 531, "abstract": "This paper proposes a simple and efficient finite difference method for implicit differentiation of marginal inference results in discrete graphical models. Given an arbitrary loss function, defined on marginals, we show that the derivatives of this loss with respect to model parameters can be obtained by running the inference procedure twice, on slightly perturbed model parameters. This method can be used with approximate inference, with a loss function over approximate marginals. Convenient choices of loss functions make it practical to fit graphical models with hidden variables, high treewidth and/or model misspecification.", "full_text": "Implicit Di\ufb00erentiation by Perturbation\n\nJustin Domke\n\nRochester Institute of Technology\n\njustin.domke@rit.edu\n\nAbstract\n\nThis paper proposes a simple and e\ufb03cient \ufb01nite di\ufb00erence method for im-\nplicit di\ufb00erentiation of marginal inference results in discrete graphical mod-\nels. Given an arbitrary loss function, de\ufb01ned on marginals, we show that\nthe derivatives of this loss with respect to model parameters can be obtained\nby running the inference procedure twice, on slightly perturbed model pa-\nrameters. This method can be used with approximate inference, with a\nloss function over approximate marginals. Convenient choices of loss func-\ntions make it practical to \ufb01t graphical models with hidden variables, high\ntreewidth and/or model misspeci\ufb01cation.\n\n1 Introduction\n\nAs graphical models are applied to more complex problems, it is increasingly necessary to\nlearn parameters from data. Though the likelihood and conditional likelihood are the most\nwidespread training objectives, these are sometimes undesirable and/or infeasible in real\napplications.\n\nWith low treewidth, if the data is truly distributed according to the chosen graphical model\nwith some parameters, any consistent loss function will recover those true parameters in the\nhigh-data limit, and so one might select a loss function according to statistical convergence\nrates [1]. In practice, the model is usually misspeci\ufb01ed to some degree, meaning no \"true\"\nparameters exist. In this case, di\ufb00erent loss functions lead to di\ufb00erent asymptotic parameter\nestimates. Hence, it is useful to consider the priorities of the user when learning. For low-\ntreewidth graphs, several loss functions have been proposed that prioritize di\ufb00erent types\nof accuracy (section 2.2). For parameters \u03b8, these loss functions are given as a function\nL(\u00b5(\u03b8)) of marginals \u00b5(\u03b8). One can directly calculate \u2202L\nd\u03b8 can\nbe e\ufb03ciently computed by loss-speci\ufb01c message-passing schemes[2, 3].\n\n\u2202\u00b5 . The parameter gradient dL\n\nThe likelihood may also be infeasible to optimize, due to the computational intractability\nof computing the log-partition function or its derivatives in high treewidth graphs. On the\nother hand, if an approximate inference algorithm will be used at test time, it is logical to\ndesign the loss function to compensate for defects in inference. The surrogate likelihood\n(the likelihood with an approximate partition function) can give superior results to the true\nlikelihood, when approximate inference is used at test time[4].\n\nThe goal of this paper is to e\ufb03ciently \ufb01t parameters to optimize an arbitrary function of\npredicted marginals, in a high-treewidth setting. If \u00b5(\u03b8) is the function mapping parameters\nto (approximate) marginals, and there is some loss function L(\u00b5) de\ufb01ned on those marginals,\nwe desire to recover dL\nd\u03b8 . This enables the use of the marginal-based loss functions mentioned\npreviously, but de\ufb01ned on approximate marginals.\nThere are two major existing approaches for calculating dL\nd\u03b8 . First, after performing infer-\nence, this gradient can be obtained by solving a large, sparse linear system[5]. The major\ndisadvantage of this approach is that standard linear solvers can perform poorly on large\n\n1\n\n\fTrue (y)\n\nNoisy (x)\n\nSurrogate\nlikelihood\n\nClique\n\nlikelihood\n\nUnivariate\nlikelihood\n\nSmooth\n\nclass. error\n\nFigure 1: Example images from the Berkeley dataset, along with marginals for a conditional\nrandom \ufb01eld \ufb01t with various loss functions.\n\ngraphs, meaning that calculating this gradient can be more expensive than performing infer-\nence (Section 4). A second option is the Back Belief Propagation (BBP) algorithm[6]. This\nis based on application of reverse-mode automatic di\ufb00erentiation (RAD) to message passing.\nCrucially, this can be done without storing all intermediate messages, avoiding the enormous\nmemory requirements of a naive application of RAD. This is e\ufb03cient, with running-time in\npractice similar to inference. However, it is tied to a speci\ufb01c entropy approximation (Bethe)\nand algorithm (Loopy Belief Propagation). Extension to similar message-passing algorithms\nappears possible, but extension to more complex inference algorithms [7, 8, 9] is unclear.\n\nd\u03b8 \u2248 1\n\nr (\u00b5(\u03b8 + r \u2202L\n\nHere, we observe that the loss gradient can be calculated by far more straightforward means.\nOur basic result is extremely simple: dL\n\u2202\u00b5 )\u2212 \u00b5(\u03b8)(cid:1), with equality in the limit\nr \u2192 0. This result follows from, \ufb01rst, the well-known trick of approximating Jacobian-\nvector products by \ufb01nite di\ufb00erences and, second, the special property that for marginal\ninference, the Jacobian matrix d\u00b5\nd\u03b8T is symmetric. This result applies when marginal inference\ntakes place over the local polytope with an entropy that is concave and obeys a minor\ntechnical condition. It can also be used with non-concave entropies, with some assumptions\non how inference recovers di\ufb00erent local optima.\nIt is easy to use this to compute the\ngradient of essentially any di\ufb00erentiable loss function de\ufb01ned on marginals. E\ufb00ectively, all\none needs to do is re-run the inference procedure on a set of parameters slightly \"perturbed\"\nin the direction \u2202L\n\u2202\u00b5 . Conditional training and tied or nonlinear parameters can also be\naccommodated.\n\nOne clear advantage of this approach is simplicity and ease of implementation. Aside from\nthis, like the matrix inversion approach, it is independent of the algorithm used to perform\nindependence, and applicable to a variety of di\ufb00erent inference approximations. Like BBP,\nthe method is e\ufb03cient in that it makes only two calls to inference.\n\n2 Background\n\n2.1 Marginal Inference\n\nThis section brie\ufb02y reviews the aspects of graphical models and marginal inference that are\nrequired for the rest of the paper. Let x denote a vector of discrete random variables. We\nuse the exponential family representation\n\np(x; \u03b8) = exp(cid:0)\u03b8 \u00b7 f (x) \u2212 A(\u03b8)(cid:1),\n\n(1)\n\nwhere f (x) is the features of the observation x, and A = logPx exp \u03b8 \u00b7 f (x) assures normal-\n\nization. For graphical models, f is typically a vector of indicator functions for each possible\ncon\ufb01guration of each factor and variable. With a slight abuse of set notation to represent\n\n2\n\n\fa vector, this can be written as f (x) = {I[x\u03b1]} \u222a {I[xi]}. It is convenient to refer to the\ncomponents of vectors like those in Eq. 1 using function notation. Write \u03b8(x\u03b1) to refer to\nthe component of \u03b8 corresponding to the indicator function I[x\u03b1], and similarly for \u03b8(xi).\nThis gives an alternative representation for p, namely\n\np(x; \u03b8) = exp(cid:0)X\u03b1\n\n\u03b8(x\u03b1) +Xi\n\n\u03b8(xi) \u2212 A(\u03b8)(cid:1).\n\n(2)\n\nMarginal inference means recovering the expected value of f or, equivalently, the marginal\nprobability that each factor or variable have a particular value.\n\n\u00b5(\u03b8) = Xx\n\np(x; \u03b8)f (x)\n\n(3)\n\nThough marginals could, in principle, be computed by the brute-force sum in Eq. 3, it is\nuseful to consider the paired variational representation [10, Chapter 3]\n\nA(\u03b8) = max\n\u00b5\u2208M\n\n\u03b8 \u00b7 \u00b5 + H(\u00b5)\n\n\u00b5(\u03b8) =\n\ndA\nd\u03b8\n\n= arg max\n\u00b5\u2208M\n\n\u03b8 \u00b7 \u00b5 + H(\u00b5),\n\n(4)\n\n(5)\n\nin which A and \u00b5 can both be recovered from solving the same optimization problem. Here,\nM = {\u00b5(\u03b8)|\u03b8 \u2208 \u211cn} is the marginal polytope\u2013 those marginals \u00b5 resulting from some\nparameter vector \u03b8. Similarly, H(\u00b5) is the entropy of p(x; \u03b8\u2032), where \u03b8\u2032 is the vector of\nparameters that produces the marginals \u00b5.\nAs M is a convex set, and H a concave function, Eq. 5 is equivalent to a convex optimization\nproblem. Nevertheless it is di\ufb03cult to characterize M or compute H(\u00b5) in high-treewidth\ngraphs. A variety of approximate inference methods can be seen as solving a modi\ufb01cation\nof Eqs. 4 and 5, with the marginal polytope and entropy replaced with tractable approxi-\nmations. Notice that these are also paired; the approximate \u00b5 is the exact gradient of the\napproximate A.\nThe commonest relaxation of M is the local polytope\n\nL = {\u00b5 \u2265 0 | \u00b5(xi) = Xx\n\n\u03b1\\i\n\n\u00b5(x\u03b1), Xxi\n\n\u00b5(xi) = 1}.\n\n(6)\n\nThis underlies loopy belief propagation, as well as tree-reweighted belief propagation. Since\na valid set of marginals must obey these constraints, L \u2287 M. Note that since the equality\nconstraints are linear, there exists a matrix B and vector d such that\n\nL = {\u00b5 \u2265 0|B\u00b5 = d}.\n\n(7)\n\nA variety of entropy approximations exist. The Bethe approximation implicit in loopy belief\npropagation [11] is non-concave in general, which results in sometimes failing to achieve\nthe global optimum. Concave entropy functions include the tree-reweighted entropy [12],\nconvexi\ufb01ed Bethe entropies [13], and the class of entropies obeying Heskes\u2019 conditions [14].\n\n2.2 Loss Functions\n\nGiven some data, {\u02c6x}, we will pick the parameters \u03b8 to minimize the empirical risk\n\nL(\u02c6x; \u03b8).\n\nX\u02c6x\n\n(8)\n\nLikelihood. The (negative) likelihood is the classic loss function for training graphical\nmodels. Exploiting the fact that dA/d\u03b8 = \u00b5(\u03b8), the gradient is available in closed-form.\n\n3\n\n\fL(\u02c6x; \u03b8) = \u2212 log p(\u02c6x; \u03b8)\n\n= \u2212\u03b8 \u00b7 f (\u02c6x) + A(\u03b8).\n\n(9)\n\ndL\nd\u03b8\n\n= \u2212f (\u02c6x) + \u00b5(\u03b8).\n\n(10)\n\nSurrogate Likelihood. Neither A nor \u00b5 is tractable with high treewidth. However, if\nwritten in variational form (Eqs. 4 and 5), they can be approximated using approximate in-\nference. The surrogate likelihood [4] is simply the likelihood as in Eq. 9 with an approximate\nA. It has the gradient as in Eq. 10, but with approximate marginals \u00b5.\n\nUnlike the losses below, the surrogate likelihood is convex when based on a concave inference\nmethod. See Ganapathi et al.[15] for a variant of this for inference with local optima.\n\nUnivariate Likelihood. If the application will only make use of univariate marginals at\ntest time, one might \ufb01t parameters speci\ufb01cally to make these univariate marginals accurate.\nKakade et al.[3] proposed the loss\n\nL(\u02c6x; \u03b8) = \u2212Xi\n\nlog \u00b5(\u02c6xi; \u03b8).\n\n(11)\n\nThis can be computed in treelike graphs, after running belief propagation to compute\nmarginals. A message-passing scheme can e\ufb03ciently compute the gradient.\n\nUnivariate Classi\ufb01cation Error. Some applications only use the maximum probability\nmarginals. Gross at al.[2] considered the loss\n\nL(\u02c6x; \u03b8) = Xi\n\nS(cid:0) max\n\nxi6=\u02c6xi\n\n\u00b5(xi; \u03b8) \u2212 \u00b5(\u02c6xi; \u03b8)(cid:1),\n\n(12)\n\nwhere S is the step function. This loss measures the number of incorrect components of\n\u02c6x if each is predicted to be the \u201cmax marginal\u201d. However, since this is non-di\ufb00erentiable,\nit is suggested to approximate this by replacing S with a sigmoid function S(t) = (1 +\nexp(\u2212\u03bbt))\u22121, where \u03bb controls the approximation quality. Our experiments use \u03bb = 50.\nAs with the univariate likelihood, this loss can be computed if exact marginals are available.\nComputing the gradient requires another message passing scheme.\n\nClique loss functions. One can easily de\ufb01ne clique versions of the previous two loss\nfunctions, where the summations are over \u03b1, rather than i. These measure the accuracy of\nclique-wise marginals, rather than univariate marginals.\n\n2.3 Implicit Di\ufb00erentiation\n\nAs noted in Eq. 7, the equality constraints in the local polytope are linear, and hence when\nthe positivity constraint can be disregarded, approximate marginal inference algorithms can\nbe seen as solving the optimization \u00b5(\u03b8) = arg max\u00b5,B\u00b5=d \u03b8 \u00b7 \u00b5 + H(\u00b5). Domke showed[5],\nin our notation, that\n\ndL\nd\u03b8\n\n= (cid:0)D\u22121BT (BD\u22121BT )\u22121BD\u22121 \u2212 D\u22121(cid:1)\n\ndL\nd\u00b5\n\n,\n\n(13)\n\nwhere D = \u2202 2H\n\n\u2202\u00b5\u2202\u00b5T is the (diagonal) Hessian of the entropy approximation.\n\nUnfortunately, this requires solving a sparse linear system for each training example and\niteration. As we will see below, with large or poorly conditioned problems, the computational\nexpense of this can far exceed that of inference. Note that BD\u22121BT is, in general, inde\ufb01nite,\nrestricting what solvers can be used. Another limitation is that D can be singular if any\ncounting numbers (Eq. 16) are zero.\n\n2.4 Conditional training and nonlinear parameters.\n\nFor simplicity, all the above discussion was con\ufb01ned to fully parametrized models. Nonlinear\nand tied parameters are easily dealt with by considering \u03b8(\u03c6) to be a function of the \u201ctrue\u201d\n\n4\n\n\fAlgorithm 1 Calculating loss derivatives (two-sided).\n\n1. Do inference. \u00b5\u2217 \u2190 arg max\n2. At \u00b5\u2217, calculate the partial derivative\n\n\u03b8 \u00b7 \u00b5 + H(\u00b5)\n\u2202L\n\u2202\u00b5\n\n\u00b5\u2208M\n\n.\n\n3. Calculate a perturbation size r.\n\n4. Do inference on perturbed parameters.\n\n\u00b5+ \u2190 arg max\n\n\u00b5\u2208M\n\n(\u03b8 + r\n\n\u2202L\n\u2202\u00b5\n\n5. Recover full derivative.\n\n) \u00b7 \u00b5 + H(\u00b5)\ndL\nd\u03b8 \u2190\n\n1\n2r\n\n(\u00b5+ \u2212 \u00b5\u2212)\n\n\u00b5\u2212 \u2190 arg max\n\n\u00b5\u2208M\n\n(\u03b8 \u2212 r\n\n\u2202L\n\u2202\u00b5\n\n) \u00b7 \u00b5 + H(\u00b5)\n\nparameters \u03c6. Once dL/d\u03b8 is known dL/d\u03c6 can be recovered by a simple application of\nthe chain rule, namely\n\ndL\nd\u03c6\n\n=\n\nd\u03b8T\nd\u03c6\n\ndL\nd\u03b8\n\n.\n\n(14)\n\nConditional training is similar: de\ufb01ne a distribution over a random variable y, parametrized\nby \u03b8(\u03c6; x), the derivative on a particular pair (x, y) is given again by Eq. 14. Examples of\nboth of these are in the experiments.\n\n3 Implicit Di\ufb00erentiation by Perturbation\n\nThis section shows that when \u00b5(\u03b8) = arg max\u00b5\u2208L \u03b8 \u00b7 \u00b5 + H(\u00b5), the loss gradient can be\ncomputed by Alg. 1 for a concave entropy approximation of the form\n\nH(\u00b5) = \u2212X\u03b1\n\nc\u03b1Xx\u03b1\n\n\u00b5(x\u03b1) log \u00b5(x\u03b1) \u2212Xi\n\nciXxi\n\n\u00b5(xi) log \u00b5(xi),\n\n(15)\n\nwhen the counting numbers c obey (as is true of most proposed entropies)\n\nc\u03b1 > 0, ci + X\u03b1,i\u2208\u03b1\n\nc\u03b1 > 0.\n\n(16)\n\nFor intuition, the following Lemma uses notation (\u00b5, \u03b8, H) suggesting the application to\nmarginal inference. However, note that the result is true for any functions satisfying the\nstated conditions.\n\nLemma. If \u00b5(\u03b8) is implicitly de\ufb01ned by\n\n\u00b5(\u03b8) = arg max\n\n\u00b5\n\n\u00b5 \u00b7 \u03b8 + H(\u00b5)\n\ns.t\n\nB\u00b5 \u2212 d = 0,\n\n(17)\n\n(18)\n\nwhere H(\u00b5) is strictly convex and twice di\ufb00erentiable, then\n\nd\u00b5\nd\u03b8T exists and is symmetric.\n\nProof. First, form a Lagrangian enforcing the constraints on the objective function.\n\nL = \u00b5 \u00b7 \u03b8 + H(\u00b5) + \u03bbT (B\u00b5 \u2212 d)\nThe solution is \u00b5 and \u03bb such that dL/d\u00b5 = 0 and dL/d\u03bb = 0.\n\n(cid:20) \u03b8 + \u2202H(\u00b5)/\u2202\u00b5 + BT \u03bb\n\nB\u00b5 \u2212 d\n\n0 (cid:21)\n(cid:21) = (cid:20) 0\n\n5\n\n(19)\n\n(20)\n\n\fRecall the general implicit function theorem. If f (\u03b8) is implicitly de\ufb01ned by the constraint\nthat h(\u03b8, f ) = 0, then\n\ndf\n\nd\u03b8T = \u2212(cid:16) \u2202h\n\n\u2202f T (cid:17)\u22121 \u2202h\n\u2202\u03b8T .\n\n(21)\n\nUsing Eq. 20 as our de\ufb01nition of h, and di\ufb00erentiating with respect to both \u00b5 and \u03bb, we\nhave\n\n(cid:20) d\u00b5/d\u03b8T\nd\u03bb/d\u03b8T (cid:21) = \u2212(cid:20) \u2202 2H/\u2202\u00b5\u2202\u00b5T B\n\nBT\n\n0 (cid:21)\u22121(cid:20) I\n\n0 (cid:21) .\n\n(22)\n\nWe see that \u2212d\u00b5/d\u03b8T is the upper left block of the matrix being inverted. The result\nfollows, since the inverse of a symmetric matrix is symmetric.\n\nThe following is the main result driving this paper. Again, this uses notation suggesting\nthe application to implicit di\ufb00erentiation and marginal inference, but holds true for any\nfunctions satisfying the stated conditions.\n\nTheorem. Let \u00b5(\u03b8) be de\ufb01ned as in the previous Lemma, and let L(\u03b8) be de\ufb01ned by L(\u03b8) =\n\nM(cid:0)\u00b5(\u03b8)(cid:1) for some di\ufb00erentiable function M (\u00b5). Then the derivative of L with respect to \u03b8\n\nis given by\n\ndL\nd\u03b8\n\n= lim\nr\u21920\n\n1\n\nr(cid:16)\u00b5(\u03b8 + r\n\n\u2202M\n\u2202\u00b5\n\n) \u2212 \u00b5(\u03b8)(cid:17).\n\nProof. First note that, by the vector chain rule,\n\ndL\nd\u03b8\n\n=\n\nd\u00b5T\nd\u03b8\n\n\u2202M\n\u2202\u00b5\n\n.\n\n(23)\n\n(24)\n\nNext, take some vector v. By basic calculus, the derivative of \u00b5(\u03b8) in the direction of v is\n\nd\u00b5\nd\u03b8T v = lim\n\nr\u21920\n\n1\n\nr(cid:0)\u00b5(\u03b8 + rv) \u2212 \u00b5(\u03b8)(cid:1).\n\n(25)\n\nThe result follows from substituting \u2202M/\u2202\u00b5 for v, and using the previous lemma to establish\nthat d\u00b5/d\u03b8T = d\u00b5T /d\u03b8.\n\nAlg. 1 follows from applying this theorem to marginal inference. However, notice that this\ndoes not enforce the constraint that \u00b5 \u2265 0. The following gives mild technical conditions\nunder which \u00b5 will be strictly positive, and so the above theorem applies.\nTheorem. If H(\u00b5) = P\u03b1 c\u03b1H(\u00b5c) + Pi ciH(\u00b5i), and \u00b5\u2217 is a (possibly local) maximum\nof \u03b8 \u00b7 \u00b5 + H(\u00b5), under the local polytope L, then\n\nc\u03b1 > 0, ci + X\u03b1,i\u2208\u03b1\n\nc\u03b1 > 0 \u2212\u2192 \u00b5\u2217 > 0.\n\n(26)\n\nThis is an extension of a previous result [11, Theorem 9] for the Bethe entropy. However,\nextremely minor changes to the existing proof give this stronger result.\n\nMost proposed entropies satisfy these conditions, including the Bethe entropy (c\u03b1 = 1, ci +\n\nP\u03b1,i\u2208\u03b1 c\u03b1 = 1), the TRW entropy (c\u03b1 = \u03c1(\u03b1), ci + P\u03b1,i\u2208\u03b1 c\u03b1 = 1, where \u03c1(\u03b1) > 0 is the\n\nprobability that \u03b1 appears in a randomly chosen tree) and any entropy satisfying the slightly\nstrengthened versions on Heskes\u2019 conditions [14, 16, Section 2].\n\nWhat about non-concave entropies? The only place concavity was used above was in es-\ntablishing that Eq. 20 has a unique solution. With a non-concave entropy this condition is\nstill valid, not not unique, since there can be local optima. BBP essentially calculates this\n\n6\n\n\fBethe entropy\n\n \n\n8\n\n32\n\n128\n\ngrid size\n\nTRW entropy\n\n \n\n512\n\n \n\n)\ns\n(\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\nn\nu\nr\n\n)\ns\n(\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\nn\nu\nr\n\n103\n\n102\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n103\n\n102\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n \n\n8\n\n32\n\ngrid size\n\n128\n\n512\n\n)\ns\n(\n \ne\nm\n\ni\n\ni\nt\n \ng\nn\nn\nn\nu\nr\n\n)\ns\n(\n \ne\nm\n\ni\n\ni\nt\n \ng\nn\nn\nn\nu\nr\n\n103\n\n102\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n103\n\n102\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n \n\n \n\nBethe entropy\n\n0.5\n\n1\n\ninteraction strength\n\n2\n\nTRW entropy\n\n0.5\n\n1\n\ninteraction strength\n\n2\n\n \n\n \n\npert\u2212BP\nsymmlq\nBBP\ndirect\nBP\n\npert\u2212TRWS\nsymmlq\ndirect\nTRWS\n\nFigure 2: Times to compute dL/d\u03b8 by perturbation, Back Belief Propagation (BBP), sparse\nmatrix factorization (direct) and the iterative symmetric-LQ method (symmlq). Inference\nwith BP and TRWS are shown for reference. As these results use two-sided di\ufb00erences,\nperturbation always takes twice the running time of the base inference algorithm. BBP takes\ntime similar BP. Results use a pairwise grid with xi \u2208 {1, 2, ..., 5}, with univariate terms\n\u03b8(xi) taken uniformly from [\u22121, +1] and interaction strengths \u03b8(xi, xj ) from [\u2212a, +a] for\nvarying a. Top Left: Bethe entropy for varying grid sizes, with a = 1. Matrix factorization\nis e\ufb03cient on small problems, but scales poorly. Top Right: Bethe entropy with a grid\nsize of 32 and varying interaction strengths a. High interactions strengths lead to poor\nconditioning, slowing iterative methods. Bottom Left: Varying grid sizes with the TRW\nentropy. Bottom Right: TRW entropy with a grid size of 32 and varying interactions.\n\nderivative by \u201ctracking\u201d the local optima. If perturbed beliefs are calculated from constant\ninitial messages with a small step, one obtains the same result. Thus, BBP and perturbation\ngive the same gradient for the Bethe approximation. (This was also veri\ufb01ed experimentally.)\n\nIt remains to select the perturbation size r. Though the gradient is exact in the limit\nr \u2192 0, numerical error eventually dominates. Following Andrei[17], the experiments here\nuse r = \u221a\u01eb(1 + |\u03b8|\u221e)/| \u2202L\n\n\u2202\u00b5|\u221e, where \u01eb is machine epsilon.\n\n4 Experiments\n\nFor inference, we used either loopy belief propagation, or tree-reweighted belief propagation.\nAs these experiments take place on grids, we are able to make use of the convergent TRWS\nalgorithm [18, Alg. 5], which we found to converge signi\ufb01cantly faster than standard TRW.\nBP/TRWS were iterated until predicted beliefs changed less than 10\u22125 between iterations.\nBBP used a slightly looser convergence threshold of 10\u22124, which was similarly accurate.\n\nBase code was implemented in Python, with C++ extensions for inference algorithms for\ne\ufb03ciency. Sparse systems were solved directly using an interface to Matlab, which calls\nLAPACK. We selected the Symmetric LQ method as an iterative solver. Both solvers were\nthe fastest among several tested on these problems. (Recall, the system is inde\ufb01nite.) BBP\nresults were computed by interfacing to the authors\u2019 implementation included in the libDAI\ntoolkit[19]. We found the PAR mode, based on parallel updates [6, Eqs. 14-25] to be much\nslower than the more sophisticated SEQ_FIX mode, based on sequential updates [6, extended\n\n7\n\n\fTable 1: Binary denoising results, comparing the surrogate likelihood against three loss\nfunctions \ufb01t by implicit di\ufb00erentiation. All loss functions are per-pixel, based on tree-\nreweighted belief propagation with edge inclusion probabilities of .5. The \u201cBest Published\u201d\nresults are the lowest previously reported pixelwise test errors using essentially loopy-belief\npropagation based surrogate likelihood. (For all losses, lower is better.)\n\nTest Loss\n\nTraining Loss\n\nSurrogate likelihood\n\nClique likelihood\n\nUnivariate likelihood\nSmooth Class. Error\nBest Published [20]\n\nBimodal Gaussian\n\nBerkeley Segmentation Data\n\nClass.\nError\n\nClass.\nError\n\nTrain Test\n\nTrain Test\n\nSurrogate\nlikelihood\nTrain Test\n\nClique\n\nlikelihood\nTrain Test\n\nUnivariate\nlikelihood\nTrain Test\n\nClass.\nError\n\nTrain Test\n\n.0498 .0540\n\n.0286 .0239\n\n.251 .252\n\n1.328 1.330\n\n.417 .416\n\n.141 .140\n\n.0488 .0535\n\n.0278 .0236\n\n.275 .277\n\n1.176 1.178\n\n.316 .315\n\n.127 .126\n\n.0493 .0541\n\n.0278 .0235\n\n.301 .303\n\n1.207 1.210\n\n.305 .305\n\n.128 .127\n\n.0460 .0527\n\n.0273 .0241\n\n.281 .283\n\n1.179 1.181\n\n.311 .310\n\n.127 .126\n\n.0548\n\n.0251\n\nversion, Fig. 5]. Hence, all results here use the latter. Other modes exceeded the available\n12 GB memory. All experiments use a single core of a 2.26 GHz machine.\n\nOur \ufb01rst experiment makes use of synthetically generated grid models. This allows system-\natic variance of graph size and parameter strength. With the TRW entropy, we use uniform\nedge appearance probabilities of \u03c1 = .49, to avoid singularity in D. Our results (Fig. 2) can\nbe summarized as follows. Matrix inversion (Eq. 13) with a direct solver is very e\ufb03cient on\nsmall problems, but scales poorly. The iterative solver is expensive, and extremely sensitive\nto conditioning. With the Bethe approximation, perturbation performs similarly to BBP.\nTRWS converges faster than BP on poorly conditioned problems.\n\nThe second experiment considers a popular dataset for learning in high-treewidth graphical\nmodels[21]. This consists of four base images, each corrupted with 50 random noise patterns\n(either Gaussian or bimodal). Following the original work, 10 corrupted versions of the\n\ufb01rst base image are used for training, and the remaining 190 for testing. This dataset\nhas been used repeatedly [22, 23], though direct comparison is sometimes complicated by\nvarying model types and training/test set divisions. This experiment uses a grid model over\nneighboring pairs (i, j)\n\np(y|x) = exp(cid:0)Xi,j\n\n\u03b8(yi, yj) +Xi\n\n\u03b8(yi; xi) \u2212 A(\u03b8(x))(cid:1),\n\n(27)\n\nwhere \u03b8(x) is a function of the input, with \u03b8(yi, yj) = a(yi, yj) fully parametrized (inde-\npendent of x) and \u03b8(yi; xi) = b(yi)xi + c(yi) an a\ufb03ne function of xi. Enforcing translation\ninvariance gives a total of eight free parameters: four for a(yi, yj), and two for b(yi), and\nc(yi)1. Once dL\nd\u03b8 is known, we can, following Eq. 14, recover derivatives with respect to tied\nparameters2.\n\nBecause the previous dataset is quite limited (only four base 64x64 images), all methods\nperform relatively well. Hence, we created a larger and more challenging dataset, consisting\nof 200 200x300 images from the Berkeley segmentation dataset, split half for training and\ntesting. These are binarized by setting yi = 1 if a pixel is above the image mean, and yi = 0\notherwise. The noisy values xi are created by setting xi = yi(1 \u2212 t1.25\n, for ti\nuniform on [0, 1].\n\n) + (1 \u2212 yi)t1.25\n\ni\n\ni\n\nTable 1 shows results for all three datasets. All the results below use batch L-BFGS for\nlearning, and uniform edge appearance probabilities of \u03c1 = .5. The surrogate likelihood\nperforms well, in fact beating the best reported results on the bimodal and Gaussian data.\nHowever, the univariate and clique loss functions provide better univariate accuracy. Fig.\n1 shows example results. The surrogate likelihood (which is convex), was used to initialize\nthe univariate and clique likelihood, while the univariate likelihood was used to initialize\nthe smooth classi\ufb01cation error.\n\n1There are two redundancies, as adding a constant to a(yi, yj ) or c(yi) has no e\ufb00ect on p.\ndL\n2Speci\ufb01cally,\n\ndL\n\ndL\n\ndL\n\nd\u03b8(yi=y) xi, and dL\n\ndc(y) = Pi\n\nd\u03b8(yi=y,yj =y\u2032) ,\n\nd\u03b8(yi=y) .\n\ndL\n\nda(y,y\u2032) = P(i,j)\n\ndb(y) = Pi\n\n8\n\n\fReferences\n\n[1] Percy Liang and Michael Jordan. An asymptotic analysis of generative, discriminative, and\n\npseudolikelihood estimators. In ICML, 2008.\n\n[2] Samuel Gross, Olga Russakovsky, Chuong Do, and Sera\ufb01m Batzoglou. Training conditional\n\nrandom \ufb01elds for maximum labelwise accuracy. In NIPS. 2006.\n\n[3] Sham Kakade, Yee Whye Teh, and Sam Roweis. An alternate objective function for Markovian\n\n\ufb01elds. In ICML, 2002.\n\n[4] Martin Wainwright. Estimating the \"wrong\" graphical model: Bene\ufb01ts in the computation-\n\nlimited setting. Journal of Machine Learning Research, 7:1829\u20131859, 2006.\n\n[5] Justin Domke. Learning convex inference of marginals. In UAI, 2008.\n\n[6] Frederik Eaton and Zoubin Ghahramani. Choosing a variable to clamp. In AISTATS, 2009.\n\n[7] Max Welling and Yee Whye Teh. Belief optimization for binary networks: A stable alternative\n\nto loopy belief propagation. In UAI, 2001.\n\n[8] Tom Heskes, Kees Albers, and Bert Kappen. Approximate inference and constrained opti-\n\nmization. In UAI, 2003.\n\n[9] Alan Yuille. CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent\n\nalternatives to belief propagation. Neural Computation, 14:2002, 2002.\n\n[10] Martin Wainwright and Michael Jordan. Graphical models, exponential families, and varia-\n\ntional inference. Found. Trends Mach. Learn., 1(1-2):1\u2013305, 2008.\n\n[11] Jonathan Yedidia, William Freeman, and Yair Weiss. Constructing free energy approximations\nIEEE Transactions on Information Theory,\n\nand generalized belief propagation algorithms.\n51:2282\u20132312, 2005.\n\n[12] Martin Wainwright, Tommi Jaakkola, and Alan Willsky. A new class of upper bounds on the\n\nlog partition function. IEEE Transactions on Information Theory, 51(7):2313\u20132335, 2005.\n\n[13] Ofer Meshi, Ariel Jaimovich, Amir Globerson, and Nir Friedman. Convexifying the bethe free\n\nenergy. In UAI, 2009.\n\n[14] Tom Heskes. Convexity arguments for e\ufb03cient minimization of the bethe and kikuchi free\n\nenergies. J. Artif. Intell. Res. (JAIR), 26:153\u2013190, 2006.\n\n[15] Varun Ganapathi, David Vickrey, John Duchi, and Daphne Koller. Constrained approximate\n\nmaximum entropy learning of markov random \ufb01elds. In UAI, 2008.\n\n[16] Tamir Hazan and Amnon Shashua. Convergent message-passing algorithms for inference over\n\ngeneral graphs with convex free energies. In UAI, pages 264\u2013273, 2008.\n\n[17] Neculai Andrei. Accelerated conjugate gradient algorithm with \ufb01nite di\ufb00erence hessian/vector\nproduct approximation for unconstrained optimization. J. Comput. Appl. Math., 230(2):570\u2013\n582, 2009.\n\n[18] Talya Meltzer, Amir Globerson, and Yair Weiss. Convergent message passing algorithms - a\n\nunifying view, 2009.\n\n[19] Joris M. Mooij et al. libDAI 0.2.4: A free/open source C++ library for Discrete Approximate\n\nInference. http://www.libdai.org/, 2010.\n\n[20] Sanjiv Kumar, Jonas August, and Martial Hebert. Exploiting inference for approximate pa-\n\nrameter learning in discriminative \ufb01elds: An empirical study. In EMMCVPR, 2005.\n\n[21] Sanjiv Kumar and Martial Hebert. Discriminative random \ufb01elds. International Journal of\n\nComputer Vision, 68(2):179\u2013201, 2006.\n\n[22] S. V. N. Vishwanathan, Nicol Schraudolph, Mark Schmidt, and Kevin Murphy. Accelerated\n\ntraining of conditional random \ufb01elds with stochastic gradient methods. In ICML, 2006.\n\n[23] Patrick Pletscher, Cheng Soon Ong, and Joachim Buhmann. Spanning tree approximations\n\nfor conditional random \ufb01elds. In AISTATS, 2009.\n\n9\n\n\f", "award": [], "sourceid": 426, "authors": [{"given_name": "Justin", "family_name": "Domke", "institution": null}]}