{"title": "Bounding errors of Expectation-Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 244, "page_last": 252, "abstract": "Expectation Propagation is a very popular algorithm for variational inference, but comes with few theoretical guarantees. In this article, we prove that the approximation errors made by EP can be bounded. Our bounds have an asymptotic interpretation in the number n of datapoints, which allows us to study EP's convergence with respect to the true posterior. In particular, we show that EP converges at a rate of $O(n^{-2})$ for the mean, up to an order of magnitude faster than the traditional Gaussian approximation at the mode. We also give similar asymptotic expansions for moments of order 2 to 4, as well as excess Kullback-Leibler cost (defined as the additional KL cost incurred by using EP rather than the ideal Gaussian approximation). All these expansions highlight the superior convergence properties of EP. Our approach for deriving those results is likely applicable to many similar approximate inference methods. In addition, we introduce bounds on the moments of log-concave distributions that may be of independent interest.", "full_text": "Bounding errors of Expectation-Propagation\n\nGuillaume Dehaene\nUniversity of Geneva\n\nguillaume.dehaene@gmail.com\n\nSimon Barthelm\u00e9\nCNRS, Gipsa-lab\n\nsimon.barthelme@gipsa-lab.fr\n\nAbstract\n\nExpectation Propagation is a very popular algorithm for variational inference, but\ncomes with few theoretical guarantees. In this article, we prove that the approx-\nimation errors made by EP can be bounded. Our bounds have an asymptotic in-\nterpretation in the number n of datapoints, which allows us to study EP\u2019s conver-\ngence with respect to the true posterior. In particular, we show that EP converges\nat a rate of O(n\u22122) for the mean, up to an order of magnitude faster than the tra-\nditional Gaussian approximation at the mode. We also give similar asymptotic ex-\npansions for moments of order 2 to 4, as well as excess Kullback-Leibler cost (de-\n\ufb01ned as the additional KL cost incurred by using EP rather than the ideal Gaussian\napproximation). All these expansions highlight the superior convergence proper-\nties of EP. Our approach for deriving those results is likely applicable to many\nsimilar approximate inference methods. In addition, we introduce bounds on the\nmoments of log-concave distributions that may be of independent interest.\n\nIntroduction\n\nExpectation Propagation (EP, 1) is an ef\ufb01cient approximate inference algorithm that is known to give\ngood approximations, to the point of being almost exact in certain applications [2, 3]. It is surprising\nthat, while the method is empirically very successful, there are few theoretical guarantees on its\nbehavior. Indeed, most work on EP has focused on ef\ufb01ciently implementing the method in various\nsettings. Theoretical work on EP mostly represents new justi\ufb01cations of the method which, while\nthey offer intuitive insight, do not give mathematical proofs that the method behaves as expected.\nOne recent breakthrough is due to Dehaene and Barthelm\u00e9 [4] who prove that, in the large data-\nlimit, the EP iteration behaves like a Newton search and its approximation is asymptotically exact.\nHowever, it remains unclear how good we can expect the approximation to be when we have only\n\ufb01nite data. In this article, we offer a characterization of the quality of the EP approximation in terms\nof the worst-case distance between the true and approximate mean and variance.\nWhen approximating a probability distribution p(x) that is, for some reason, close to being Gaussian,\na natural approximation to use is the Gaussian with mean equal to the mode (or argmax) of p(x) and\nwith variance the inverse log-Hessian at the mode. We call it the Canonical Gaussian Approximation\n(CGA), and its use is usually justi\ufb01ed by appealing to the Bernstein-von Mises theorem, which\nshows that, in the limit of a large amount of independent observations, posterior distributions tend\ntowards their CGA. This powerful justi\ufb01cation, and the ease with which the CGA is computed\n(\ufb01nding the mode can be done using Newton methods) makes it a good reference point for any\nmethod like EP which aims to offer a better Gaussian approximation at a higher computational cost.\nIn section 1, we introduce the CGA and the EP approximation. In section 2, we give our theoretical\nresults bounding the quality of EP approximations.\n\n1\n\n\f1 Background\n\nIn this section, we present the CGA and give a short introduction to the EP algorithm. In-depth\ndescriptions of EP can be found in Minka [5], Seeger [6], Bishop [7], Raymond et al. [8].\n\n1.1 The Canonical Gaussian Approximation\n\nWhat we call here the CGA is perhaps the most common approximate inference method in the\nmachine learning cookbook. It is often called the \u201cLaplace approximation\u201d, but this is a misnomer:\nthe Laplace approximation refers to approximating the integral\np from the integral of the CGA.\nThe reason the CGA is so often used is its compelling simplicity: given a target distribution p(x) =\nexp (\u2212\u03c6 (x)), we \ufb01nd the mode x(cid:63) and compute the second derivatives of \u03c6 at x(cid:63):\n\n\u00b4\n\nx(cid:63) = argmin\u03c6(x)\n\u03b2(cid:63) = \u03c6(cid:48)(cid:48) (x(cid:63))\n\nto form a Gaussian approximation q(x) = N(cid:16)\nRoughly, if pn(x) \u221d (cid:81)n\nlimn\u2192\u221e pn (x) = N(cid:16)\n\na second-order Taylor expansion, and its use is justi\ufb01ed by the Bernstein-von Mises theorem [9],\nwhich essentially says that the CGA becomes exact in the large-data (large-n) asymptotic limit.\ni=1 p (yi|x) p0 (x), where y1 . . . yn represent independent datapoints, then\nx|x(cid:63)\nn, 1\n\u03b2(cid:63)\nn\n\nin total variation.\n\n(cid:17) \u2248 p(x). The CGA is effectively just\n\nx|x(cid:63), 1\n\n\u03b2(cid:63)\n\n(cid:17)\n\n1.2 CGA vs Gaussian EP\n\nGaussian EP, as its name indicates, provides an alternative way of computing a Gaussian approxima-\ntion to a target distribution. There is broad overlap between the problems where EP can be applied\nand the problems where the CGA can be used, with EP coming at a higher cost. Our contribution is\nto show formally that the higher computational cost for EP may well be worth bearing, as EP approx-\nimations can outperform CGAs by an order of magnitude. To be speci\ufb01c, we focus on the moment\nestimates (mean and covariance) computed by EP and CGA, and derive bounds on their distance to\nthe true mean and variance of the target distribution. Our bounds have an asymptotic interpretation,\nand under that interpretation we show for example that the mean returned by EP is within an order\n\nof O(cid:0)n\u22122(cid:1) of the true mean, where n is the number of datapoints. For the CGA, which uses the\nmode as an estimate of the mean, we exhibit a O(cid:0)n\u22121(cid:1) upper bound, and we compute the error term\nresponsible for this O(cid:0)n\u22121(cid:1) behavior. This enables us to show that, in the situations in which this\nerror is indeed O(cid:0)n\u22121(cid:1), EP is better than the CGA.\n\n1.3 The EP algorithm\nWe consider the task of approximating a probability distribution over a random-variable X : p(x),\nwhich we call the target distribution. X can be high-dimensional, but for simplicity, we focus on\nthe one-dimensional case. One important hypothesis that makes EP feasible is that p(x) factorizes\ninto n simple factor terms:\n\np(x) =\n\nfi(x)\n\n(cid:89)\n\ni\n\n(cid:18)\n\nEP proposes to approximate each fi(x) (usually referred to as sites) by a Gaussian function qi(x)\n(referred to as the site-approximations). It is convenient to use the parametrization of Gaussians in\nterms of natural parameters:\n\nqi (x|ri, \u03b2i) \u221d exp\n\nrix \u2212 \u03b2i\n\n(cid:19)\n\nx2\n2\n\nwhich makes some of the further computations easier to understand. Note that EP could also be\nused with other exponential approximating families. These Gaussian approximations are computed\ni )), we select a site for update with\niteratively. Starting from a current approximation (qt\nindex i. We then:\n\ni (x|rt\n\ni, \u03b2t\n\n2\n\n\f\u2022 Compute the cavity distribution qt\u2212i(x) \u221d(cid:81)\n\uf8eb\uf8ed\uf8eb\uf8ed(cid:88)\n\nq\u2212i(x) \u221d exp\n\neters:\n\nj(cid:54)=1 qt\n\n\uf8f6\uf8f8 x \u2212\n\nrt\nj\n\nj(cid:54)=i\n\n\uf8eb\uf8ed(cid:88)\n\nj(cid:54)=i\n\n\uf8f6\uf8f8 x2\n\n2\n\n\uf8f6\uf8f8\n\n\u03b2t\nj\n\nj(x). This is very easy in natural param-\n\n\u2022 Compute the hybrid distribution ht\n\u2022 Compute the Gaussian which minimizes the Kullback-Leibler divergence to the hybrid, ie\n\ni(x) \u221d qt\u2212i(x)fi(x) and its mean and variance\n\nthe Gaussian with same mean and variance:\n\n(cid:0)KL(cid:0)ht\ni|q(cid:1)(cid:1)\n\nP(ht\n\ni) = argmin\n\nq\n\n\u2022 Finally, update the approximation of fi:\n\nqt+1\ni =\n\nP(ht\ni)\nqt\u2212i\n\nwhere the division is simply computed as a subtraction between natural parameters\n\nWe iterate these operations until a \ufb01xed point is reached, at which point we return a Gaussian ap-\n\nproximation of p(x) \u2248(cid:81) qi(x).\n\n1.4 The \u201cEP-approximation\u201d\n\nIn this work, we will characterize the quality of an EP approximation of p(x). We de\ufb01ne this to be\nany \ufb01xed point of the iteration presented in section 1.3, which could all be returned by the algorithm.\nIt is known that EP will have at least one \ufb01xed-point [1], but it is unknown under which conditions\nthe \ufb01xed-point is unique. We conjecture that, when all sites are log-concave (one of our hypotheses\nto control the behavior of EP), it is in fact unique but we can\u2019t offer a proof yet. If p (x) isn\u2019t log-\nconcave, it is straightforward to construct examples in which EP has multiple \ufb01xed-points. These\nopen questions won\u2019t matter for our result because we will show that all \ufb01xed-points of EP (should\nthere be more than one) produce a good approximation of p (x).\nFixed points of EP have a very interesting characterization. If we note q\u2217\ni the site-approximations at\ni the corresponding hybrid distributions, and q\u2217 the global approximation of\na given \ufb01xed-point, h\u2217\np(x), then the mean and variance of all the hybrids and q\u2217 is the same1. As we will show in section\n2.2, this leads to a very tight bound on the possible positions of these \ufb01xed-points.\n\n1.5 Notation\n\ni fi(x) is the target distribution we want\nto approximate. The sites fi(x) are each approximated by a Gaussian site-approximation qi(x)\ni qi(x). The hybrids hi(x) interpolate between q(x)\n\nWe will use repeatedly the following notation. p(x) =(cid:81)\nyielding an approximation to p(x) \u2248 q(x) =(cid:81)\n\u03c6i(x) = \u2212 log (fi(x)) and \u03c6p(x) = \u2212 log (p(x)) = (cid:80) \u03c6i(x). We will introduce in section 2\n\nand p(x) by replacing one site approximation qi(x) with the true site fi(x).\nOur results make heavy use of the log-functions of the sites and the target distribution. We note\n\nhypotheses on these functions. Parameter \u03b2m controls their minimum curvature and parameters Kd\ncontrol the maximum dth derivative.\nWe will always consider \ufb01xed-points of EP, where the mean and variance under all hybrids and q(x)\nis identical. We will note these common values: \u00b5EP and vEP . We will also refer to the third and\nfourth centered moment of the hybrids, denoted by mi\n4 and to the fourth moment of q(x) which\nis simply 3v2\nEP . We will show how all these moments are related to the true moments of the target\ndistribution which we will note \u00b5, v for the mean and variance, and mp\n3, mp\n4 for the third and fourth\n(cid:48)(cid:48)\nwhere x(cid:63) is the\np (x(cid:63))\n\nmoment. We also investigate the quality of the CGA: \u00b5 \u2248 x(cid:63) and v \u2248(cid:104)\n\n(cid:105)\u22121\n\n3, mi\n\n\u03c6\n\nthe mode of p(x).\n\n1For non-Gaussian approximations, the expected values of all suf\ufb01cient statistics of the exponential family\n\nare equal.\n\n3\n\n\f2 Results\n\nIn this section, we will give tight bounds on the quality of the EP approximation (ie: of \ufb01xed-points\nof the EP iteration). Our results lean on the properties of log-concave distributions [10]. In section\n2.1, we introduce new bounds on the moments of log-concave distributions. The bounds show that\nthose distributions are in a certain sense close to being Gaussian. We then apply these results to\nstudy \ufb01xed points of EP, where they enable us to compute bounds on the distance between the mean\nand variance of the true distribution p(x) and of the approximation given by EP, which we do in\nsection 2.2.\nOur bounds require us to assume that all sites fi(x) are \u03b2m-strongly log-concave with slowly-\nchanging log-function. That is, if we note \u03c6i(x) = \u2212 log (fi(x)):\n\n\u2200i \u2200x \u03c6\n\n(cid:48)(cid:48)\n\n(cid:12)(cid:12)(cid:12)\u03c6(d)\n\ni (x) \u2265 \u03b2m > 0\n(x)\n\n(cid:12)(cid:12)(cid:12) \u2264 Kd\n\ni\n\n\u2200i \u2200d \u2208 [3, 4, 5, 6]\n\n(1)\n(2)\n\n\u2212 log (p(x)) = (cid:80)\n\nbounded:\n\nThe target distribution p(x) then inherits those properties from the sites. Noting \u03c6p(x) =\ni \u03c6i(x), then \u03c6p is n\u03b2m-strongly log-concave and its higher derivatives are\n\n\u2200x, \u03c6\n(cid:48)(cid:48)\n\n(cid:12)(cid:12)(cid:12)\u03c6(d)\n\np (x) \u2265 n\u03b2m\np (x)\n\n(cid:12)(cid:12)(cid:12) \u2264 nKd\n\n\u2200d \u2208 [3, 4, 5, 6]\n\n(3)\n\n(4)\n\nA natural concern here is whether or not our conditions on the sites are of practical interest. Indeed,\nstrongly-log-concave likelihoods are rare. We picked these strong regularity conditions because they\nmake the proofs relatively tractable (although still technical and long). The proof technique carries\nover to more complicated, but more realistic, cases. One such interesting generalization consists\nof the case in which p(x) and all hybrids at the \ufb01xed-point are log-concave with slowly changing\nlog-functions (with possibly differing constants). In such a case, while the math becomes more\nunwieldy, similar bounds as ours can be found, greatly extending the scope of our results. The\nresults we present here should thus be understood as a stepping stone and not as the \ufb01nal word on\nthe quality of the EP approximation: we have focused on providing a rigorous but extensible proof.\n\n2.1 Log-concave distributions are strongly constrained\n\nLog-concave distributions have many interesting properties. They are of course unimodal, and the\nfamily is closed under both marginalization and multiplication. For our purposes however, the most\nimportant property is a result due to Brascamp and Lieb [11], which bounds their even moments. We\ngive here an extension in the case of log-concave distributions with slowly changing log-functions\n(as quanti\ufb01ed by eq. (2)). Our results show that these are close to being Gaussian.\nThe Brascamp-Lieb inequality states that, if LC(x) \u221d exp (\u2212\u03c6(x)) is \u03b2m-strongly log-concave (ie:\n(x) \u2265 \u03b2m), then centered even moments of LC are bounded by the corresponding moments of a\n(cid:48)(cid:48)\n\u03c6\nGaussian with variance \u03b2\u22121\nm . If we note these moments m2k and \u00b5LC = ELC(x) the mean of LC:\n\n(cid:16)\n\n(x \u2212 \u00b5LC)2k(cid:17)\n\nm2k = ELC\nm2k \u2264 (2k \u2212 1)!!\u03b2\u2212k\n\nm\n\n(5)\nwhere (2k \u2212 1)!! is the double factorial: the product of all odd terms from 1 to 2k \u2212 1. 3!! = 3,\n5!! = 15, 7!! = 105, etc. This result can be understood as stating that a log-concave distribution\nmust have a small variance, but doesn\u2019t generally need to be close to a Gaussian.\nWith our hypothesis of slowly changing log-functions, we were able to improve on this result. Our\nimproved results include a bound on odd moments, as well as \ufb01rst order expansions of even moments\n(eqs. (6)-(9)).\nOur extension to the Brascamp-Lieb inequality is as follows. If \u03c6 is slowly changing in the sense\nthat some of its higher derivatives are bounded, as per eq. 2, then we can give a bound on \u03c6\n(\u00b5LC)\n\n(cid:48)\n\n4\n\n\f(showing that \u00b5LC is close to the mode x(cid:63) of LC, see eqs. (10) to (13)) and m3 (showing that LC\nis mostly symmetric):\n\nand we can compute the \ufb01rst order expansions of m2 and m4, and bound the errors in terms of \u03b2m\nand the K\u2019s :\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(cid:12)(cid:12)(cid:12)\u03c6\n\n(cid:48)\n\n(\u00b5LC)\n\n(cid:12)(cid:12)(cid:12) \u2264 K3\n\n2\u03b2m\n|m3| \u2264 2K3\n\u03b23\nm\n\n(cid:12)(cid:12)(cid:12)m\u22121\n\n(cid:12)(cid:12)(cid:12)\u03c6\n\n(cid:48)(cid:48)\n\n(cid:48)(cid:48)\n\n(\u00b5LC)\n\n2 \u2212 \u03c6\n(\u00b5LC)m4 \u2212 3m2\n\n3\n\u03b22\nm\n\n(cid:12)(cid:12)(cid:12) \u2264 K 2\n(cid:12)(cid:12)(cid:12) \u2264 19\n(cid:17)\u22121\n\n2\n\nK4\n2\u03b2m\n5\n2\n\n+\n\nK4\n\u03b23\nm\n\n+\n\nK 2\n3\n\u03b24\nm\n\n(cid:16)\n\n(cid:17)\u22122\n\nWith eq. (8) and (9), we see that m2 \u2248 (cid:16)\n\n(cid:48)(cid:48)\n\nand m4 \u2248 3\n\n(cid:48)(cid:48)\n\nand, in that\n\n(cid:48)(cid:48)\n\n\u03c6\n\n\u03c6\n\n\u03c6\n\n(cid:16)\n\n(\u00b5LC)\n\n(\u00b5LC)\n\n(\u00b5LC).\n\n(cid:17)\u2212k\n\n(\u00b5LC)\n(cid:48)(cid:48)\nsense, that LC(x) is close to the Gaussian with mean \u00b5LC and inverse-variance \u03c6\nThese expansions could be extended to further orders and similar formulas can be found for the other\n\u2212(k+1)\nmoments of LC(x): for example, any odd moments can be bounded by |m2k+1| \u2264 CkK3\u03b2\nm\n(with Ck some constant) and any even moment can be found to have \ufb01rst-order expansion:\nm2k \u2248 (2k \u2212 1)!!\n. The proof, as well as more detailed results, can be found in\nthe Supplement.\nNote how our result relates to the Bernstein-von Mises theorem, which says that, in the limit of a\nlarge amount of observations, a posterior p(x) tends towards its CGA. If we consider the posterior\nobtained from n likelihood functions that are all log-concave and slowly changing, our results show\nthe slightly different result that the moments of that posterior are close to those of a Gaussian with\nLC)) . This point is\nmean \u00b5LC (instead of x(cid:63)\ncritical. While the CGA still ends up capturing the limit behavior of p, as \u00b5LC \u2192 x(cid:63) in the large-\ndata limit (see eq. (13) below), an approximation that would return the Gaussian approximation at\n\u00b5LC would be better. This is essentially what EP does, and this is how it improves on the CGA.\n\nLC) and inverse-variance \u03c6\n\n(\u00b5LC) (instead of \u03c6\n\n(x(cid:63)\n\n(cid:48)(cid:48)\n\n(cid:48)(cid:48)\n\n2.2 Computing bounds on EP approximations\n\nIn this section, we consider a given EP \ufb01xed-point q\u2217\n\nk (x|ri, \u03b2i) and the corresponding approximation\n\u00b5EP and vEP ) are close to the true mean and variance of p (resp. \u00b5 and v), and also investigate the\n\nof p(x): q\u2217 (x|r =(cid:80) ri, \u03b2 =(cid:80) \u03b2i). We will show that the expected value and variance of q\u2217(resp.\nquality of the CGA (\u00b5 \u2248 x(cid:63), v \u2248(cid:104)\n\n(cid:105)\u22121\n\n).\n\n\u03c6\n\n(cid:48)(cid:48)\np (x(cid:63))\n\nUnder our assumptions on the sites (eq. (1) and (2)), we are able to derive bounds on the quality\nof the EP approximation. The proof is quite involved and long, and we will only present it in the\nSupplement. In the main text, we give a partial version: we detail the \ufb01rst step of the demonstra-\ntion, which consists of computing a rough bound on the distance between the true mean \u00b5, the EP\napproximation \u00b5EP and the mode x(cid:63), and give an outline of the rest of the proof.\nLet\u2019s show that \u00b5, \u00b5EP and x(cid:63) are all close to one another. We start from eq. (6) applied to p(x):\n\n(cid:12)(cid:12)(cid:12)\u03c6\n\n(cid:48)\np(\u00b5)\n\n(cid:12)(cid:12)(cid:12) \u2264 K3\n\n2\u03b2m\n\n5\n\n\fwhich tells us that \u03c6\n\n(cid:48)\n\np(\u00b5) \u2248 0. \u00b5 must thus be close to x(cid:63). Indeed:\n\n(cid:12)(cid:12)(cid:12)\u03c6\n\n(cid:48)\np(\u00b5)\n\n(cid:12)(cid:12)(cid:12)\u03c6\n(cid:12)(cid:12)(cid:12) =\n(cid:12)(cid:12)(cid:12)\u03c6\n\u2265 (cid:12)(cid:12)(cid:12)\u03c6\n\n=\n\n(cid:48)\n\n(cid:48)(cid:48)\n\np(\u00b5) \u2212 \u03c6\n(cid:48)\np(x(cid:63))\np (\u03be) (\u00b5 \u2212 x(cid:63))\n\n(cid:12)(cid:12)(cid:12)|\u00b5 \u2212 x(cid:63)|\n\n(cid:48)(cid:48)\np (\u03be)\n\n\u2265 n\u03b2m |\u00b5 \u2212 x(cid:63)|\n\n(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12) \u03be \u2208 [\u00b5, x(cid:63)]\n\nCombining eq. (10) and (12), we \ufb01nally have:\n\n|\u00b5 \u2212 x(cid:63)| \u2264 n\u22121 K3\n2\u03b22\nm\n\nLet\u2019s now show that \u00b5EP is also close to x(cid:63). We proceed similarly, starting from eq. (6) but applied\nto all hybrids hi(x):\n\n(14)\nwhich is not really equivalent to eq. (10) yet. Recall that q(x|r, \u03b2) has mean \u00b5EP : we thus have:\nr = \u03b2\u00b5EP . Which gives:\n\n2\u03b2m\n\ni(\u00b5EP ) + \u03b2\u2212i\u00b5EP \u2212 r\u2212i\n\n(cid:12)(cid:12)(cid:12) \u2264 n\u22121 K3\n\n\u2200i\n\n(cid:48)\n\n(cid:12)(cid:12)(cid:12)\u03c6\n(cid:32)(cid:88)\n\ni\n\n(cid:33)\n\n\u03b2\u2212i\n\n\u00b5EP = ((n \u2212 1)\u03b2) \u00b5EP\n\n= (n \u2212 1)r\nr\u2212i\n=\n\n(cid:88)\n(cid:12)(cid:12)(cid:12) \u2264 K3\n\ni\n\n2\u03b2m\n\n(cid:12)(cid:12)(cid:12)\u03c6\n\n(cid:48)\np(\u00b5EP )\n\nIf we sum all terms in eq. (14), the \u03b2\u2212i\u00b5EP and r\u2212i thus cancel, leaving us with:\n\nwhich is equivalent to eq. (10) but for \u00b5EP instead of \u00b5. This shows that \u00b5EP is, like \u00b5, close to x(cid:63):\n\n|\u00b5EP \u2212 x(cid:63)| \u2264 n\u22121 K3\n2\u03b22\nm\n\nAt this point, we can show that, since they are both close to x(cid:63) (eq. (13) and (17)), \u00b5 = \u00b5EP +\n\nO(cid:0)n\u22121(cid:1), which constitutes the \ufb01rst step of our computation of bounds on the quality of EP.\nvia computing(cid:12)(cid:12)v\u22121 \u2212 v\u22121\n\n(cid:12)(cid:12)(cid:12) for the CGA, from eq. (8). In both cases,\n\nAfter computing this, the next step is evaluating the quality of the approximation of the variance,\n\n(cid:12)(cid:12) for EP and\n\n(cid:12)(cid:12)(cid:12)v\u22121 \u2212 \u03c6\n\n(cid:48)(cid:48)\np (x(cid:63))\n\nEP\n\nwe \ufb01nd:\n\nv\u22121 = v\u22121\n(cid:48)(cid:48)\n\nEP + O (1)\np (x(cid:63)) + O (1)\n\n(18)\n(19)\nSince v\u22121 is of order n, because of eq. (5) (Brascamp-Lieb upper bound on variance), this is a\ndecent approximation: the relative error is of order n\u22121.\nWe can \ufb01nd similarly that both EP and CGA do a good job of \ufb01nding a good approximation of the\nfourth moment of p: m4. For EP this means that the fourth moment of each hybrid and of q are a\nclose match:\n\n= \u03c6\n\n(11)\n\n(12)\n\n(13)\n\n(15)\n\n(16)\n\n(17)\n\n(20)\n\n(21)\n\n(22)\n\nIn contrast, the third moment of the hybrids doesn\u2019t match at all the third moment of p, but their sum\ndoes !\n\n\u2200i m4 \u2248 mi\n\u2248 3\n\nEP\n\n(cid:17)\u22122\n(cid:16)\n4 \u2248 3v2\n(cid:48)(cid:48)\np (m)\n\u03c6\nm3 \u2248(cid:88)\n\nmi\n3\n\ni\n\n6\n\n\fFinally, we come back to the approximation of \u00b5 by \u00b5EP . These obey two very similar relationships:\n\n\u22121(cid:17)\n\u22121(cid:17)\nSince v = vEP + O(cid:0)n\u22122(cid:1) (a slight rephrasing of eq. (18)), we \ufb01nally have:\n\n= O(cid:16)\n= O(cid:16)\n\nv\np (\u00b5)\n2\nvEP\n2\n\n(cid:48)\np(\u00b5EP ) + \u03c6(3)\n\n(cid:48)\np(\u00b5) + \u03c6(3)\n\np (\u00b5EP )\n\nn\n\nn\n\n\u03c6\n\n\u03c6\n\n\u00b5 = \u00b5EP + O(cid:0)n\u22122(cid:1)\n\n(23)\n\n(24)\n\n(25)\n\n(26)\n\nWe summarize the results in the following theorem:\nTheorem 1. Characterizing \ufb01xed-points of EP\nUnder the assumptions given by eq. (1) and (2) (log-concave sites with slowly changing log), we\ncan bound the quality of the EP approximation and the CGA:\n|\u00b5 \u2212 x\u2217| \u2264 n\u22121 K3\n2\u03b22\nm\n\n|\u00b5 \u2212 \u00b5EP| \u2264 B1(n) = O(cid:0)n\u22122(cid:1)\n(cid:12)(cid:12)(cid:12) \u2264 2K 2\n(cid:12)(cid:12)(cid:12)v\u22121 \u2212 \u03c6\n(cid:12)(cid:12) \u2264 B2(n) = O (1)\n(cid:12)(cid:12)v\u22121 \u2212 v\u22121\n\np (x\u2217)\n\nK4\n2\u03b2m\n\n\u03b22\nm\n\n+\n\n(cid:48)(cid:48)\n\n3\n\nEP\n\nWe give the full expression for the bounds B1 and B2 in the Supplement\nNote that the order of magnitude of the bound on |\u00b5 \u2212 x(cid:63)| is the best possible, because it is at-\ntained for certain distributions. For example, consider a Gamma distribution with natural parameters\nn\u03b2 . More generally, from\n(n\u03b1, n\u03b2) whose mean \u03b1\neq. (23), we can compute the \ufb01rst order of the error:\n\n\u03b2 is approximated at order n\u22121 by its mode \u03b1\n\n\u03b2 \u2212 1\n\n\u00b5 \u2212 m \u2248 \u2212 \u03c6(3)\np (\u00b5)\n(cid:48)(cid:48)\n\u03c6\np (\u00b5)\n\nv\n2\n\n\u2248 \u2212 1\n2\n\n(cid:2)\u03c6\n\np (\u00b5)(cid:3)2\n\n\u03c6(3)\np (\u00b5)\n(cid:48)(cid:48)\n\nwhich is the term causing the order n\u22121 error. Whenever this term is signi\ufb01cant, it is thus safe to\nconclude that EP improves on the CGA.\nAlso note that, since v\u22121 is of order n, the relative error for the v\u22121 approximation is of order n\u22121\nfor both methods. Despite having a convergence rate of the same order, the EP approximation is\ndemonstrably better than the CGA, as we show next. Let us \ufb01rst see why the approximation for v\u22121\nis only of order 1 for both methods. The following relationship holds:\n\np (\u00b5)\n\n(cid:48)(cid:48)\np (\u00b5) + \u03c6(3)\n\nv\u22121 = \u03c6\n(cid:48)(cid:48)\nIn this relationship, \u03c6\np (\u00b5) is an order n term while the rest are order 1. If we now compare this to\nthe CGA approximation of v\u22121, we \ufb01nd that it fails at multiple levels. First, it completely ignores\n(cid:48)(cid:48)\nthe two order 1 terms, and then, because it takes the value of \u03c6\np at x(cid:63) which is at a distance of\np = O (n)). The CGA is thus adding\n\nO(cid:0)n\u22121(cid:1) from \u00b5, it adds another order 1 error term (since \u03c6(3)\n\n+ \u03c6(4)\n\np (\u00b5)\n\n(27)\n\nmp\n4\n3!v\n\nmp\n3\n2v\n\n+ O(cid:0)n\u22121(cid:1)\n\nquite a bit of error, even if each component is of order 1.\nMeanwhile, vEP obeys a relationship similar to eq. (27):\n\n(cid:20)\n\nv\u22121\nEP = \u03c6\n\n(cid:88)\n|\u00b5 \u2212 \u00b5EP| = O(cid:0)n\u22122(cid:1), we have \u03c6\n\n(cid:48)(cid:48)\np (\u00b5EP ) +\n\ni\n\n(cid:21)\n\nWe can see where the EP approximation produces errors. The \u03c6\n\n+ O(cid:0)n\u22121(cid:1)\n\n(28)\n\n\u03c6(3)\ni\n\n(\u00b5EP )\n\nmi\n3\n2vEP\n\n+ \u03c6(4)\n\np (\u00b5EP )\n\n3v2\nEP\n3!vEP\n\np (\u00b5EP ) +O(cid:0)n\u22121(cid:1). The term involving m4 is also well\n\n(cid:48)(cid:48)\np term is well approximated: since\n\n(cid:48)(cid:48)\n\n(cid:48)(cid:48)\np (\u00b5) = \u03c6\n\n7\n\n\fapproximated, and we can see that the only term that fails is the m3 term. The order 1 error is thus\nentirely coming from this term, which shows that EP performance suffers more from the skewness\nof the target distribution than from its kurtosis.\nFinally, note that, with our result, we can get some intuitions about the quality of the EP approxima-\ntion using other metrics. For example, if the most interesting metric is the KL divergence KL (p, q),\nthe excess KL divergence from using the EP approximation q instead of the true minimizer qKL\n(which has the same mean \u00b5 and variance v as p) is given by:\n\n(cid:19)(cid:33)\n\n(cid:18) v\n\nvEP\n\n(29)\n\n(30)\n\n(31)\n\n\u02c6\n\n\u2206KL =\n\np log\n\nqKL\n\nq\n\n=\n\n\u02c6\n\n(cid:32)\n\u2212 (x \u2212 \u00b5)2\n(cid:18) v\n(cid:19)2\n\n2v\n\u2212 1 \u2212 log\n\n+\n\np(x)\n\n(cid:20) v\n(cid:18) v \u2212 vEP\n\nvEP\n\n+\n\nvEP\n\n=\n\n1\n2\n\u2248 1\n4\n\n(x \u2212 \u00b5EP )2\n\n2vEP\n\n(cid:19)(cid:21)\n\n\u2212 1\n2\n\nlog\n(\u00b5 \u2212 \u00b5EP )2\n\n2vEP\n\n+\nvEP\n(\u00b5 \u2212 \u00b5EP )2\n\n2vEP\n\nwhich we recognize as KL (qKL, q). A similar formula gives the excess KL divergence from using\nthe CGA instead of qKL. For both methods, the variance term is of order n\u22122 (though it should be\nsmaller for EP), but the mean term is of order n\u22123 for EP while it is of order n\u22121 for the CGA. Once\nagain, EP is found to be the better approximation.\nFinally, note that our bounds are quite pessimistic: the true value might be a much better \ufb01t than we\nhave predicted here.\nA \ufb01rst cause is the bounding of the derivatives of log(p) (eqs. (3),(4)): while those bounds are\ncorrect, they might prove to be very pessimistic. For example, if the contributions from the sites to\nthe higher-derivatives cancel each other out, a much lower bound than nKd might apply. Similarly,\nthere might be another lower bound on the curvature much higher than n\u03b2m.\nAnother cause is the bounding of the variance from the curvature. While applying Brascamp-Lieb\nrequires the distribution to have high log-curvature everywhere, a distribution with high-curvature\nclose to the mode and low-curvature in the tails still has very low variance: in such a case, the\nBrascamp-Lieb bound is very pessimistic.\nIn order to improve on our bounds, we will thus need to use tighter bounds on the log-derivatives of\nthe hybrids and of the target distribution, but we will also need an extension of the Brascamp-Lieb\nresult that can deal with those cases where a distribution is strongly log-concave around its mode\nbut, in the tails, the log-curvature is much lower.\n\n3 Conclusion\n\nEP has been used for now quite some time without any theoretical concrete guarantees on its per-\nformance. In this work, we provide explicit performance bounds and show that EP is superior to the\nCGA, in the sense of giving provably better approximations of the mean and variance. There are\nnow theoretical arguments for substituting EP to the CGA in a number of practical problems where\nthe gain in precision is worth the increased computational cost. This work tackled the \ufb01rst steps in\nproving that EP offers an appropriate approximation. Continuing in its tracks will most likely lead\nto more general and less pessimistic bounds, but it remains an open question how to quantify the\nquality of the approximation using other distance measures. For example, it would be highly use-\nful for machine learning if one could show bounds on prediction error when using EP. We believe\nthat our approach should extend to more general performance measures and plan to investigate this\nfurther in the future.\n\nReferences\n[1] Thomas P. Minka. Expectation Propagation for approximate Bayesian inference. In UAI \u201901:\nProceedings of the 17th Conference in Uncertainty in Arti\ufb01cial Intelligence, pages 362\u2013369,\nSan Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.\nISBN 1-55860-800-1.\nURL http://portal.acm.org/citation.cfm?id=720257.\n\n8\n\n\f[2] Malte Kuss and Carl E. Rasmussen. Assessing Approximate Inference for Binary Gaussian\nProcess Classi\ufb01cation. J. Mach. Learn. Res., 6:1679\u20131704, December 2005. ISSN 1532-4435.\nURL http://portal.acm.org/citation.cfm?id=1194901.\n\n[3] Hannes Nickisch and Carl E. Rasmussen. Approximations for Binary Gaussian Process\nClassi\ufb01cation. Journal of Machine Learning Research, 9:2035\u20132078, October 2008. URL\nhttp://www.jmlr.org/papers/volume9/nickisch08a/nickisch08a.pdf.\n\n[4] Guillaume Dehaene and Simon Barthelm\u00e9. Expectation propagation in the large-data limit.\n\nTechnical report, March 2015. URL http://arxiv.org/abs/1503.08060.\n\n[5] T. Minka. Divergence Measures and Message Passing. Technical report, 2005. URL\n\nhttp://research.microsoft.com/en-us/um/people/minka/papers/\nmessage-passing/minka-divergence.pdf.\n\n[6] M. Seeger.\n\nExpectation Propagation for Exponential Families.\n\nreport,\n2005. URL http://people.mmci.uni-saarland.de/~{}mseeger/papers/\nepexpfam.pdf.\n\nTechnical\n\n[7] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science\nand Statistics).\nSpringer, 1st ed. 2006. corr. 2nd printing 2011 edition, October 2007.\nISBN 0387310738. URL http://www.amazon.com/exec/obidos/redirect?\ntag=citeulike07-20&path=ASIN/0387310738.\n\n[8] Jack Raymond, Andre Manoel, and Manfred Opper. Expectation propagation, September\n\n2014. URL http://arxiv.org/abs/1409.6179.\n\nAsymptotic Theory of Statistics and Probability (Springer\n[9] Anirban DasGupta.\nTexts in Statistics).\nURL\nhttp://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&\npath=ASIN/0387759700.\n\nSpringer, 1 edition, March 2008.\n\nISBN 0387759700.\n\n[10] Adrien Saumard and Jon A. Wellner. Log-concavity and strong log-concavity: A review.\nStatist. Surv., 8:45\u2013114, 2014. doi: 10.1214/14-SS107. URL http://dx.doi.org/10.\n1214/14-SS107.\n\n[11] Herm J. Brascamp and Elliott H. Lieb. Best constants in young\u2019s inequality, its converse, and\nits generalization to more than three functions. Advances in Mathematics, 20(2):151\u2013173,\nMay 1976. ISSN 00018708. doi: 10.1016/0001-8708(76)90184-5. URL http://dx.doi.\norg/10.1016/0001-8708(76)90184-5.\n\n9\n\n\f", "award": [], "sourceid": 124, "authors": [{"given_name": "Guillaume", "family_name": "Dehaene", "institution": "University of Geneva"}, {"given_name": "Simon", "family_name": "Barthelm\u00e9", "institution": "Gipsa-lab CNRS"}]}