{"title": "Excess Risk Bounds for the Bayes Risk using Variational Inference in Latent Gaussian Models", "book": "Advances in Neural Information Processing Systems", "page_first": 5151, "page_last": 5161, "abstract": "Bayesian models are established as one of the main successful paradigms for complex problems in machine learning. To handle intractable inference, research in this area has developed new approximation methods that are fast and effective. However, theoretical analysis of the performance of such approximations is not well developed. The paper furthers such analysis by providing bounds on the excess risk of variational inference algorithms and related regularized loss minimization algorithms for a large class of latent variable models with Gaussian latent variables. We strengthen previous results for variational algorithms by showing they are competitive with any point-estimate predictor. Unlike previous work, we also provide bounds on the risk of the \\emph{Bayesian} predictor and not just the risk of the Gibbs predictor for the same approximate posterior. The bounds are applied in complex models including sparse Gaussian processes and correlated topic models. Theoretical results are complemented by identifying novel approximations to the Bayesian objective that attempt to minimize the risk directly. An empirical evaluation compares the variational and new algorithms shedding further light on their performance.", "full_text": "Excess Risk Bounds for the Bayes Risk using\n\nVariational Inference in Latent Gaussian Models\n\nRishit Sheth and Roni Khardon\n\nDepartment of Computer Science, Tufts University\n\nMedford, MA, 02155, USA\n\nrishit.sheth@tufts.edu | roni@cs.tufts.edu\n\nAbstract\n\nBayesian models are established as one of the main successful paradigms for\ncomplex problems in machine learning. To handle intractable inference, research\nin this area has developed new approximation methods that are fast and effective.\nHowever, theoretical analysis of the performance of such approximations is not\nwell developed. The paper furthers such analysis by providing bounds on the excess\nrisk of variational inference algorithms and related regularized loss minimization\nalgorithms for a large class of latent variable models with Gaussian latent variables.\nWe strengthen previous results for variational algorithms by showing that they\nare competitive with any point-estimate predictor. Unlike previous work, we\nprovide bounds on the risk of the Bayesian predictor and not just the risk of the\nGibbs predictor for the same approximate posterior. The bounds are applied in\ncomplex models including sparse Gaussian processes and correlated topic models.\nTheoretical results are complemented by identifying novel approximations to\nthe Bayesian objective that attempt to minimize the risk directly. An empirical\nevaluation compares the variational and new algorithms shedding further light on\ntheir performance.\n\n1\n\nIntroduction\n\nBayesian models are established as one of the main successful paradigms for complex problems\nin machine learning. Since inference in complex models is intractable, research in this area is\ndevoted to developing new approximation methods that are fast and effective (Laplace/Taylor\napproximation, variational approximation, expectation propagation, MCMC, etc.), i.e., these can\nbe seen as algorithmic contributions. Much less is known about theoretical guarantees on the\nloss incurred by such approximations, either when the Bayesian model is correct or under model\nmisspeci\ufb01cation.\nSeveral authors provide risk bounds for the Bayesian predictor (that aggregates predictions over its\nposterior and then predicts), e.g., see [15, 6, 12]. However, the analysis is specialized to certain\nclassi\ufb01cation or regression settings, and the results have not been shown to be applicable to complex\nBayesian models and algorithms like the ones studied in this paper.\nIn recent work, [7] and [1] identi\ufb01ed strong connections between variational inference [10] and\nPAC-Bayes bounds [14] and have provided oracle inequalities for variational inference. As we show\nin Section 3, similar results that are stronger in some aspects can be obtained by viewing variational\ninference as performing regularized loss minimization. These results are an exciting \ufb01rst step, but\nthey are limited in two aspects. First, they hold for the Gibbs predictor (that samples a hypothesis\nand uses it to predict) and not the Bayesian predictor and, second, they are only meaningful against\n\u201cweak\u201d competitors. For example, the bounds go to in\ufb01nity if the competitor is a point estimate\nwith zero variance. In addition, these results do not explicitly address hierarchical Bayesian models\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fwhere further development is needed to distinguish among different variational approximations in\nthe literature. Another important result by [11] provides relative loss bounds for generalized linear\nmodels (GLM). These bounds can be translated to risk bounds and they hold against point estimates.\nHowever, they are limited to the prediction of the true Bayesian posterior which is hard to compute.\nIn this paper we strengthen these theoretical results and, motivated by these, make additional\nalgorithmic and empirical contributions. In particular, we focus on latent Gaussian models (LGM)\nwhose latent variables are normally distributed. We extend the technique of [11] to derive agnostic\nbounds for the excess risk of an approximate Bayesian predictor against any point estimate competitor.\nWe then apply these results to several models with two levels of latent variables, including generalized\nlinear models (GLM), sparse Gaussian processes (sGP) [17, 26] and correlated topic models (CTM)\n[3] providing high probability bounds for risk. For CTM our results apply precisely to the variational\nalgorithm and for GLM and sGP they apply for a variant with a smoothed loss function.\nOur results improve over [7, 1] by strengthening the bounds, showing that they can be applied directly\nto the variational algorithm, and showing that they apply to the Bayesian predictor. On the other hand\nthey improve over [11] in analyzing the approximate inference algorithms and in showing how to\napply the bounds to a larger class of models.\nFinally, viewing approximate inference as regularized loss minimization, our exploration of the\nhierarchical models shows that there is a mismatch between the objective being optimized by\nalgorithms such as variational inference and the loss that de\ufb01nes our performance criterion. We\nidentify three possible objectives corresponding respectively to a \u201csimple variational approximation\u201d,\nthe \u201ccollapsed variational approximation\u201d, and to a new algorithm performing direct regularized loss\nminimization instead of optimizing the variational objective. We explore these ideas empirically in\nCTM. Experimental results con\ufb01rm that each variant is the \u201cbest\" for optimizing its own implicit\nobjective, and therefore direct loss minimization, for which we do not yet have a theoretical analysis,\nmight be the algorithm of choice. However, they also show that the collapsed approximation comes\nclose to direct loss minimization. The concluding section of the paper further discusses the results.\n\n2 Preliminaries\n\n2.1 Learning Model, Hypotheses and Risk\n\nWe consider the standard PAC setting where n samples are drawn i.i.d. according to an unknown joint\ndistribution D over the sample space z. This captures the supervised case where z = (x, y) and the\ngoal is to predict y|x. In the unsupervised case, z = y and we are simply modeling the distribution.\nTo treat both cases together we always include x in the notation but \ufb01x it to a dummy value in the\nunsupervised case.\nA learning algorithm outputs a hypothesis h which induces a distribution ph(y|x). One would\nnormally use this predictive distribution and an application-speci\ufb01c loss to pick the prediction.\nFollowing previous work, we primarily focus on log loss, i.e., the loss of h on example (x\u2217, y\u2217)\nis (cid:96)(h, (x\u2217, y\u2217)) = \u2212 log ph(y\u2217|x\u2217).\nIn cases where this loss is not bounded, a smoothed and\n\nbounded variant of the log loss can be de\ufb01ned as \u02dc(cid:96)(h, (y\u2217, x\u2217)) = \u2212 log(cid:0)(1 \u2212 \u03b1)ph(y|x) + \u03b1(cid:1),\np(y|w, x) = (cid:81)\n\nwhere 0 < \u03b1 < 1. We state our results w.r.t. log loss, and demonstrate, by example, how the\nsmoothed log loss can be used. Later, we brie\ufb02y discuss how our results hold more generally for\nlosses that are convex in p.\nWe start by considering one-level (1L) latent variable models given by p(w)p(y|w, x) where\ni p(yi|w, xi). For example, in Bayesian logistic regression, w is the hidden\nweight vector, the prior p(w) is given by a Normal distribution N (w|\u00b5, \u03a3) and the likelihood\nterm is p(y|w, x) = \u03c3(ywT x) where \u03c3() is the sigmoid function. A hypothesis h represents a\ndistribution q(w) over w, where point estimates for w are modeled as delta functions. Regardless\nof how h is computed, the Bayesian predictor calculates a predictive distribution ph(y|x) =\nEq(w)[p(y|w, x)] and accordingly its risk is de\ufb01ned as rBay(q(w)) = E(x,y)\u223cD[\u2212 log ph(y|x)] =\nE(x,y)\u223cD[\u2212 log Eq(w)[p(y|w, x)]].\nFollowing previous work we also analyze the average risk of the Gibbs predictor which draws a\nrandom w from q(w) and predicts using p(y|w, x). Although the Gibbs predictor is not an optimal\nstrategy, its analysis has been found useful in previous work and it serves as an intermediate step\n\n2\n\n\fin our results. Assuming the draw of w is done independently for each x we get: rGib(q(w)) =\nE(x,y)\u223cD[Eq(w)[\u2212 log p(y|w, x)]]. Previous work has de\ufb01ned the Gibbs risk with expectations in\nreversed order. That is, the algorithm draws a single w and uses it for prediction on all examples. We\n\ufb01nd the one given here more natural. Some of our results require the two de\ufb01nitions to be equivalent,\ni.e., the conditions for Fubini\u2019s theorem must hold. We make this explicit in\nAssumption 1. E(x,y)\u223cD[Eq(w)[\u2212 log p(y|w, x)]] = Eq(w)[E(x,y)\u223cD[\u2212 log p(y|w, x)]].\nThis is a relatively mild assumption. It clearly holds when y takes discrete values, where p(y|x, w) \u2264 1\nimplies that the log loss is positive and Fubini\u2019s theorem applies. In the case of continuous y, upper\nbounded likelihood functions imply that a translation of the loss function satis\ufb01es the condition of\nFubini\u2019s theorem. For example, if p(y|x, w) = N (y|f (w, x), \u03c32) where \u03c32 is a hyperparameter, then\n\u221a\nlog p(y|x, w) \u2264 B = \u2212 log(\n2\u03c0) \u2212 log(\u03c32). Therefore, \u2212 log p(y|x, w) + B \u2265 0 so that if we\nrede\ufb01ne1 the loss by adding the constant B, then the loss is positive and Fubini\u2019s theorem applies.\nMore generally, we might need to enforce constraints on D, q(w), and/or p(y|x, w).\n\n2.2 Variational Learners for Latent Variable Models\n\nApproximate inference generally limits q(w) to some \ufb01xed family of distributions Q (e.g. the family\nof normal distributions, or the family of products of independent components in the mean-\ufb01eld\napproximation). Given a dataset S = {(xi, yi)}n\n\ni=1, we de\ufb01ne the following general problem,\n\n(cid:26) 1\n\n\u03b7\n\nq(cid:63) = arg min\n\nq\u2208Q\n\nKL(cid:0)q(w)(cid:107)p(w)(cid:1) + L(w, S)\n\n(cid:27)\n\n,\n\n(1)\n\nL(w, S) = \u2212(cid:80)\n\nwhere KL denotes Kullback-Leibler divergence. Standard variational inference uses \u03b7 = 1 and\ni Eq(w)[log p(yi|w, xi)], and it is well known that (1) is the optimization of a lower\nbound on p(y). If \u2212 log p(yi|w, xi) is replaced with a general loss function, then (1) may no longer\ncorrespond to a lower bound on p(y). In any case, the output of (1), denoted by q(cid:63)\nGib, is achieved via\nregularized cumulative-loss minimization (RCLM) which optimizes a sum of training set error and a\nregularization function. In particular, q(cid:63)\nGib uses a KL regularizer and optimizes the Gibbs risk rGib in\ncontrast to the Bayes risk rBay. This motivates some of the analysis in the paper.\nMany interesting Bayesian models have two levels (2L) of\n\np(w)p(f|w, x)(cid:81)\nin models where an additional factorization p(f|w, x) =(cid:81)\n\nlatent variables given by\ni p(yi|fi) where both w and f are latent. Of course one can treat (w, f ) as one set\nof parameters and apply the one-level model, but this does not capture the hierarchical structure of the\nmodel. The standard approach in the literature infers a posterior on w via a variational distribution\nq(w)q(f|w), and assumes that q(w) is suf\ufb01cient for predicting p(y\u2217|x\u2217). We refer to this structural\nassumption, i.e., p(f\u2217, f|w, x, x\u2217) = p(f\u2217|w, x\u2217)p(f|w, x), as Conditional Independence. It holds\ni p(fi|w, xi) holds, e.g., in GLM, CTM.\nIn the case of sparse Gaussian processes (sGP), Conditional Independence does not hold, but it is\nrequired in order to reduce the cubic complexity of the algorithm, and it has been used in all prior\nwork on sGP. Assuming Conditional Independence, the de\ufb01nition of risk extends naturally from the\none-level model by writing p(y|w, x) = Ep(f|w,x)[p(y|f )] to get:\n\nr2Bay(q(w)) = E\n\n(x,y)\u223cD\n\n[\u2212 log E\n\n[ E\np(f|w,x)\n\nq(w)\n\n[p(y|f )]]],\n\nr2Gib(q(w)) = E\n\n(x,y)\u223cD\n\n[ E\nq(w)\n\n[\u2212 log E\n\np(f|w,x)\n\n[p(y|f )]]].\n\n(2)\n\n(3)\n\n\u2212(cid:80)\n\nEven though Conditional Independence is used in prediction, the learning algorithm must decide\nhow to treat q(f|w) during the optimization of q(w). The mean \ufb01eld approximation uses q(w)q(f )\nin the optimization. We analyze two alternatives that have been used in previous work. The\napproximation q(f|w) = p(f|w), used in sparse GP [26, 8, 23], is described by (1) with L(w, S) =\n2A and observe it is the RCLM solution\n[\u2212 log p(y|f )]]].\n\ni Eq(w)[Ep(fi|w,xi)[log p(yi|fi)]]. We denote this by q(cid:63)\n\nfor the risk de\ufb01ned as\n\nr2A(q(w)) = E\n\n(4)\n\n(x,y)\u223cD\n\n[ E\nq(w)\n\n[ E\np(f|w,x)\n\n1For\n\u2212 log(\n\nthe smoothed log loss,\n\nmaxw,x,y p(y|w,x) p(y|w, x) + \u03b1).\n\n1\u2212\u03b1\n\nthe translation can be applied prior\n\nto the re-scaling,\n\ni.e.,\n\n3\n\n\f\u2212 Eq(w)[log Ep(f|w,x)[(cid:81)\np(f|w) =(cid:81)\n\nAs shown by [25, 9, 22], alternatively, for each w, we can pick the optimal q(f|w) = p(f|w, S).\nFollowing [25] we call this a collapsed approximation. This leads to (1) with L(w, S) =\n2Bj (joint expectation). For models where\ni Eq(w)[log Ep(fi|w,xi)[p(yi|fi)]], and we\n2Bi performs RCLM for the risk\n\ni p(fi|w), this simpli\ufb01es to L(w, S) = \u2212(cid:80)\n\ni p(yi|fi)]] and is denoted by q(cid:63)\n2Bi (independent expectation). Note that q(cid:63)\n\ndenote the algorithm by q(cid:63)\ngiven by r2Gib even if the factorization does not hold.\nFinally, viewing approximate inference as performing RCLM, we observe a discrepancy between\nour de\ufb01nition of risk in (2) and the loss function being optimized by existing algorithms, e.g.,\nvariational inference. This perspective suggests direct loss minimization described by the alternative\n2D. In this case, q(cid:63)\n2D\n\ni log Eq(w)[Ep(fi|w,xi)[p(yi|fi)]] in (1) and which we denote q(cid:63)\n\nL(w, S) = \u2212(cid:80)\n\nis a \u201cposterior\u201d but one for which we do not have a Bayesian interpretation.\nGiven the discussion so far, we can hope to get some analysis for regularized loss minimization where\neach of the algorithms implicitly optimizes a different de\ufb01nition of risk. Our goal is to identify good\nalgorithms for which we can bound the de\ufb01nition of risk we care about, r2Bay, as de\ufb01ned in (2).\n\n3 RCLM\n\nRegularized loss minimization has been analyzed for general hypothesis spaces and losses. For\nhypothesis space H and hypothesis h \u2208 H we have loss function (cid:96)(h, (x, y)), and associated risk\nr(h) = E(x,y)\u223cD[(cid:96)(h, (x, y))]. Now, given a regularizer R : H \u2192 0 \u222a R+, a non-negative scalar \u03b7,\nand sample S, regularized cumulative loss minimization is de\ufb01ned as\n\n\uf8eb\uf8ed 1\n\n\u03b7\n\n(cid:88)\n\ni\n\n\uf8f6\uf8f8 .\n\nRCLM(H, (cid:96), R, \u03b7, S) = arg min\n\nh\u2208H\n\nR(h) +\n\n(cid:96)(h, (xi, yi))\n\n(5)\n\n\u03b4 ( 1\n\n\u03b7n R(h) + 4\u03c12\u03b7\n\n\u03b7n R(h) + 4\u03c12\u03b7\n\u03c3 .\n\nTheorem 1 ([20]2 ). Assume that the regularizer R(h) is \u03c3-strong-convex in h and the loss (cid:96)(h, (x, y))\nis \u03c1-Lipschitz and convex in h, and let h(cid:63)(S) = RCLM(H, (cid:96), R, \u03b7, S). Then, for all h \u2208 H,\nES\u223cDn[r(h(cid:63)(S))] \u2264 r(h) + 1\nThe theorem bounds the expectation of the risk. Using Markov\u2019s inequality we can get a high\nprobability bound: with probability \u2265 1 \u2212 \u03b4, r(h(cid:63)(S)) \u2264 r(h) + 1\n\u03c3 ). Tighter\ndependence on \u03b4 can be achieved for bounded losses using standard techniques. To simplify the\npresentation we keep the expectation version throughout the paper.\nFor this paper we specialize RCLM for Bayesian algorithms, that is, H corresponds to the parameter\nspace for a parameterized family of (possibly degenerate) distributions, denoted Q, where q \u2208 Q is a\ndistribution over a base parameter space w.\nWe have already noted above that q(cid:63)\n2D(w) are RCLM algorithms. We can\ntherefore get immediate corollaries for the corresponding risks (see supplementary material). Such\nresults are already useful, but the convexity and \u03c1-Lipschitz conditions are not always easy to analyze\nor guarantee. We next show how to use recent ideas from PAC-Bayes analysis to derive a similar\nresult for Gibbs risk with less strong requirements. We \ufb01rst develop the result for the one-level model.\nToward this, de\ufb01ne the loss and risk for individual base parameters as (cid:96)W (w, (x, y)), and rW (w) =\nED[(cid:96)W (w, (x, y))], and the empirical estimate \u02c6rW (w, S) = 1\ni (cid:96)W (w, (xi, yi)). Following [7], let\nn\n\u03a8(\u03bb, n) = log ES\u223cDn [Ep(w)[e\u03bb(rW (w)\u2212\u02c6rW (w,S))]] where \u03bb is an additional parameter. Combining\narguments from [20] with the use of the compression lemma [2] as in [7] we can derive the following\nbound (proof in supplementary material):\nTheorem 2. For all q \u2208 Q, ES\u223cDn [rGib(q(cid:63)\n1\n\u03bb \u03a8(\u03bb, n).\nThe theorem applies to the two-level model by writing p(y|w) = Ep(f|w)[p(y|f )]. This yields\nCorollary 3. For all q \u2208 Q, ES\u223cDn [r2Gib(q(cid:63)\n\n\u03bb maxq\u2208Q KL(cid:0)q(cid:107)p(cid:1)+\n\u03b7n KL(cid:0)q(cid:107)p(cid:1) +\n\n(cid:80)\n\u03b7n KL(cid:0)q(cid:107)p(cid:1)+ 1\n\n2Bi(w))] \u2264 r2Gib(q) + 1\n\nGib(w))] \u2264 rGib(q)+ 1\n\nGib(w), q(cid:63)\n\n2Bi(w) and q(cid:63)\n\n\u03bb maxq\u2208Q KL(cid:0)q(cid:107)p(cid:1) + 1\n\n1\n\n\u03bb \u03a8(\u03bb, n).\n\n2 [20] analyzed regularized average loss but the same proof steps with minor modi\ufb01cations yield the statement\n\nfor cumulative loss given here.\n\n4\n\n\fA similar result has already been derived by [1] without making the explicit connection to RCLM.\nHowever, the implied algorithm uses a \u201cregularization factor\u201d \u03bb which may not coincide with \u03b7 = 1,\nwhereas standard variational inference can be analyzed with Theorem 2 (or Corollary 3).\nThe work of [4, 7] showed how the \u03a8 term can be bounded. Brie\ufb02y, if (cid:96)W (w, (x, y)) is bounded\nin [a, b], then \u03a8(\u03bb, n) \u2264 \u03bb2(b\u2212a)2\n; if (cid:96)W (w, (x, y)) is not bounded, but the random variable\nrW (w)\u2212 (cid:96)W (w, (x, y)) is sub-Gaussian or sub-gamma, then \u03a8(\u03bb, n) can be bounded with additional\nassumptions on the underlying distribution D. More details are in the supplementary material.\n\n2n\n\n4 Concrete Bounds on Excess Risk in LGM\n\n(cid:16)\n\nThe LGM family is a special case of the two-level model where the prior p(w) over the M-dimensional\nparameter w is given by a Normal distribution. Following previous work we let Q to be a family\nof Normal distributions. For the analysis we further restrict Q by placing bounds on the mean and\ncovariance as follows: Q = {N (w|m, V ) s.t. (cid:107)m(cid:107)2 \u2264 Bm, \u03bbmin (V ) \u2265 \u0001, \u03bbmax (V ) \u2264 BV } for\nsome \u0001 > 0. The KL divergence from q(w) = N (w|m, V ) to p(w) = N (w|\u00b5, \u03a3) is given by\n\nKL(cid:0)q(cid:107)p(cid:1) = 1\nFirst, we note that KL(cid:0)q(cid:107)p(cid:1) is bounded under a lower bound on the minimum eigenvalue of V (proof\n\n4.1 General Bounds on Excess Risk in LGM Against Point Estimates\n\ntr(\u03a3\u22121V ) + (\u00b5 \u2212 m)T \u03a3\u22121(\u00b5 \u2212 m) + log\n\n|\u03a3|\n|V | \u2212 M\n\n(cid:17)\n\n2\n\n.\n\nin supplementary material follows from linear algebra identities):\nLemma 4. Let B(cid:48)\n\n2+B2\n\nM BV +(cid:107)\u00b5(cid:107)2\n\u03bbmin(\u03a3)\n\nR = 1\n2\n\nm\n\nKL(cid:0)q(cid:107)p(cid:1) \u2264 BR =\n\nM BV +(cid:107)\u00b5(cid:107)2\n\u03bbmin (\u03a3)\n\n2 + B2\n\n1\n2\n\n(cid:19)\n\n+ M log(cid:0)\u03bbmax (\u03a3)(cid:1) \u2212 M\n(cid:18) \u03bbmax (\u03a3)\n(cid:19)\n\nm\n\n+ M log\n\n\u0001\n\n. For q \u2208 Q,\n(cid:33)\n\n\u2212 M\n\n= B\n\nR \u2212 1\n(cid:48)\n2\n\n(cid:18)\n(cid:32)\n\nM log \u0001.\n\n(6)\n\nThe risk bounds of the previous section do not allow for point estimate competitors because the\nKL portion is not bounded. We next generalize a technique from [11] showing that adding a little\nvariance to a point estimate does not hurt too much. This allows us to derive the promised bounds. In\nthe following, \u0001 > 0 is a constant whose value is determined in the proof. For any \u02c6w, we consider the\n\u0001-in\ufb02ated distribution q (w) = N (w| \u02c6w, \u0001I) and calculate the distribution\u2019s Gibbs risk w.r.t. a generic\n(cid:96) : RM \u00d7 (X \u00d7 Y ) (cid:55)\u2192 R.\nLemma 5.\n\u03bbmax\n\nloss. Speci\ufb01cally, we consider the (1L or 2L) Gibbs risk r(q) = E(x,y)\u223cD[Eq(w)[(cid:96)(cid:0)w, (x, y)(cid:1)]] with\nw(cid:96)(w, (x, y))(cid:1) \u2264 BH, then for \u02c6w \u2208 RM and q(w) = N (w| \u02c6w, \u0001I)\n(cid:0)\u22072\n(cid:0)\u03b4 (w \u2212 \u02c6w)(cid:1) +\n\nIf (i) (cid:96)(w, (x, y)) is continuously differentiable in w up to order 2, and (ii)\n\n(cid:0)q(w)(cid:1) = E\n\n(7)\n\n\u0001M BH .\n\nrGib\n\n(x,y)\u223cD\n\n1\n2\n\nProof. By the multivariable Taylor\u2019s theorem, for \u02c6w \u2208 RM\n\n[ E\nq(w)\n\n[(cid:96)(cid:0)w, (x, y)(cid:1)]] \u2264 rGib\n\uf8f6\uf8f8T\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)w= \u02c6w\n\n\uf8eb\uf8ed\u2207w(cid:96)(w, (x, y))\n\n(cid:96)(w, (x, y)) = (cid:96)( \u02c6w, (x, y)) +\n\n(w \u2212 \u02c6w)\n\n\uf8eb\uf8ed\u22072\n\n(w \u2212 \u02c6w)T\n\n+\n\n1\n2\n\nw(cid:96)(w, (x, y))\n\n\uf8f6\uf8f8 (w \u2212 \u02c6w)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)w= \u02dcw\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)w= \u02dcw\n\nwhere \u2207w(cid:96)(w, (x, y)) and \u22072\nfor some \u03b1 \u2208 [0, 1] where \u03b1 is a function of w. Taking the expectation results in\n\nw(cid:96)(w, (x, y)) denote the gradient and Hessian, and \u02dcw = (1 \u2212 \u03b1) \u02c6w+\u03b1w\n\n[(cid:96)(w, (x, y))] = (cid:96)( \u02c6w, (x, y)) +\n\nE\nq(w)\n\n1\n2\n\nE\nq(w)\n\n[(w \u2212 \u02c6w)T \u22072\n\nw(cid:96)(w, (x, y))\n\n5\n\n(w \u2212 \u02c6w)].\n\n(8)\n\n\f2 BH E[(w \u2212 \u02c6w)T (w \u2212 \u02c6w)] = 1\n\nw(cid:96)(w, (x, y)) is bounded uniformly by some BH < \u221e, then the\n2 \u0001M BH. Taking expectation\n\nIf the maximum eigenvalue of \u22072\nsecond term of (8) is bounded above by 1\nw.r.t. D yields the statement of the lemma.\nSince Q includes \u0001-in\ufb02ated distributions centered on \u02c6w where(cid:107) \u02c6w(cid:107)2 \u2264 Bm, we have the following.\nTheorem 6 (Bound on Gibbs Risk Against Point Estimate Competitors).\n(i)\n\u2212 log Ep(f|w)[p(y|f ])\nand (ii)\n\u03bbmax\n\ncontinuously differentiable\n\nin w up to order 2,\n\n(cid:18)\n\n\u22072\n\nis\n\nIf\n\nw\n\n2Bi(w))] \u2264 r2Gib\n\n(cid:17)(cid:19)\n(cid:16)\u2212 log Ep(f|w)[p(y|f ])\n(cid:0)\u03b4 (w \u2212 \u02c6w)(cid:1) + \u2206(BH ) +\n(cid:19)\uf8eb\uf8ed 2\nKL(cid:0)q(cid:107)p(cid:1) +\n\n\u2206(BH ) (cid:44) 1\n2\n\n2Bi(w))] \u2264 r2Gib(q) +\n\n[r2Gib(q(cid:63)\n\n\u2264 BH, then, for all \u02c6w with(cid:107) \u02c6w(cid:107)2 \u2264 Bm,\n(cid:32)\n\n1\n\u03bb\nB(cid:48)\nR + 1 + log\n\n(cid:18) 1\n\n\u03a8(\u03bb, n),\n\nKL(cid:0)q(cid:107)p(cid:1) +\n\nBH\n\n1\n\u03bb\n\nM\n\nM\n\n+\n\nn\n\n1\n\u03b7n\n\n(cid:0)\u03b4 (w \u2212 \u02c6w)(cid:1) +\n\n1\n\u03bb\n\nmax\nq\u2208Q\n\u0001M BH \u2212 1\n2\n\n1\n2\n\n\u2264 r2Gib\n\nProof. Using the distribution q = N (w| \u02c6w, \u0001I) in the RHS of Corollary 3 yields\n\n1\n\u03bb\n\n\u03a8(\u03bb, n)\n\nAM log \u0001 + AB\n\n(cid:48)\nR +\n\n1\n\u03bb\n\n\u03a8(\u03bb, n)\n\n(10)\n\n(cid:18) n\u03bb\n\nn + \u03bb\n\n(cid:19)(cid:33)\uf8f6\uf8f8 .\n\n(9)\n\nE\n\nS\u223cDn\n\n[r2Gib(q(cid:63)\n\nE\n\nS\u223cDn\n\n(cid:16) 1\n\n\u03b7n + 1\n\n\u03bb\n\n(cid:17)\n\nwhere A =\n\u0001 = A\nBH\n\nand we have used Lemma 4 and Lemma 5. Eq (10) is optimized when\n\n. Re-substituting the optimal \u0001 in (10) yields\n\n[r2Gib(q(cid:63)\n\n2Bi(w))] \u2264 r2Gib\n\nE\n\nS\u223cDn\n\n(cid:18) 1\n\n+\n\n1\n2\n\nM\n\n+\n\n1\n\u03bb\n\n\u03b7n\n\nR + 1 \u2212 log\nB(cid:48)\n\n(cid:0)\u03b4 (w \u2212 \u02c6w)(cid:1)\n(cid:19)\uf8eb\uf8ed 2\n\nM\n\n(cid:32)\n\n(cid:18) 1\n\n\u03b7n\n\n1\nBH\n\n(cid:19)(cid:33)\uf8f6\uf8f8 +\n\n1\n\u03bb\n\n+\n\n1\n\u03bb\n\n\u03a8(\u03bb, n).\n\n(11)\n\nSetting \u03b7 = 1 yields the result.\n\nThe theorem calls for running the variational algorithm with constraints on eigenvalues of V . The\n\ufb01xed-point characterization [21] of the optimal solution in linear LGM implies that such constraints\nhold for the optimal solution. Therefore, they need not be enforced explicitly in these models.\nFor any distribution q(w) and function f (w) we have minw [f (w)] \u2264 Eq(w)[f (w)]. Therefore, the\nminimizer of the Gibbs risk is a point estimate, which with Theorem 6 implies:\nCorollary 7. Under the conditions of Theorem 6, for all q(w) = N (w|m, V ) with(cid:107)m(cid:107)2 \u2264 Bm,\nES\u223cDn[r2Gib(q(cid:63)\n\n(cid:0)q(w)(cid:1) + \u2206(BH ) + 1\n\n2Bi(w))] \u2264 r2Gib\n\n\u03bb \u03a8(\u03bb, n).\n\nMore importantly, as another immediate corollary, we have a bound for the Bayes risk:\nCorollary 8 (Bound on Bayes Risk Against Point Estimate Competitors). Under the conditions\nof Theorem 6, for all \u02c6w with(cid:107) \u02c6w(cid:107)2 \u2264 Bm,\nES\u223cDn[r2Bay(q(cid:63)\nProof. Follows from (a) \u2200q, r2Bay(q) \u2264 r2Gib(q) (Jensen\u2019s inequality), and (b) \u2200 \u02c6w \u2208\nRM , r2Bay(\u03b4(w \u2212 \u02c6w)) = r2Gib(\u03b4(w \u2212 \u02c6w)).\n\n(cid:0)\u03b4 (w \u2212 \u02c6w)(cid:1) + \u2206(BH ) + 1\n\n2Bi(w))] \u2264 r2Bay\n\n\u03bb \u03a8(\u03bb, n).\n\nThe extension for Bayes risk in step b of the proof is only possible thanks to the extension to\npoint estimates. As stated in the previous section, for bounded losses, \u03a8(\u03bb, n) is bounded as\n\u03bb2(b\u2212a)2\nn or log n\nrespectively, where the latter has a \ufb01xed non-decaying gap term (b \u2212 a)2/2. However, unlike [7],\nin our proof both cases are achievable with \u03b7 = 1, i.e., for the variational algorithm. For example,\n\nn or \u03bb = n to obtain decays rates log n\u221a\n\n. As in [7], we can choose \u03bb =\n\n\u221a\n\n2n\n\nn\n\n6\n\n\fusing \u03b7 = 1, \u03bb =\n\u2206(BH ) + 1\n\n\u03bb \u03a8(\u03bb, n) \u2264 M\u221a\n\nn\n\n(cid:16)\n\n\u221a\n\nn, the prior with \u00b5 = 0 and \u03a3 = 1\n\n1 + log BH + log n + log(cid:0)BV + 1\n\nM B2\n\nm\n\n(cid:1) + (b\u2212a)2\n\n2M\n\n(cid:17)\n\n.\n\nM (M BV + B2\n\nm)I, and bounded loss,\n\nThe results above are developed for the log loss but we can apply them more generally. Toward\nthis we note that Corollary 3 holds for an arbitrary loss, and Lemma 5, and Theorem 6 hold for\na suf\ufb01ciently smooth loss with bounded 2nd derivative w.r.t. w. The conversion to Bayes risk in\nCorollary 8 holds for any loss convex in p. Therefore, the result of Corollary 8 holds more generally\nfor any suf\ufb01ciently smooth loss that has bounded 2nd derivative in w and that is convex in p. We\nprovide an application of this more general result in the next section.\n\n4.2 Applications in Concrete Models\n\nThis section develops bounds on \u03a8 and BH for members of the 2L family.\nCTM: For a document, the generative model for CTM \ufb01rst draws w \u223c N (\u00b5, \u03a3), w \u2208 RK\u22121\nwhere {\u00b5, \u03a3} are model parameters, and then maps this vector to the K-simplex with the logistic\ntransformation, \u03b8 = h(w). For each position i in the document, the latent topic variable, fi, is drawn\nfrom Discrete(\u03b8), and the word yi is drawn from a Discrete(\u03b2fi,\u00b7) where \u03b2 denotes the topics and is\ntreated as a parameter of the model. In this case p(f|w) can be integrated out analytically and the\nloss is \u2212 log\nCorollary 9. For CTM models where the parameters \u03b2k,y are uniformly bounded away from 0, i.e.,\n\u03b2k,y \u2265 \u03b3 > 0, for all \u02c6w with(cid:107) \u02c6w(cid:107)2 \u2264 Bm,\nES\u223cDn[r2Bay(q(cid:63)\n\n. We have (proof in supplementary material):\n\n(cid:16)(cid:80)K\n\n2Bi(w))] \u2264 r2Bay\n\nk=1 \u03b2k,yhk(w)\n\nwith BH=5.\n\ndifferentiable in f up to order 2, and f (w, x) is continuously differentiable in w up to order\n\nThe following lemma is expressed in terms of log loss but also holds for smoothed log loss (proof in\nsupplementary material):\n\nLemma 10. When f is a deterministic function of w, if (i) \u2212 log p(cid:0)y|f (w, x)(cid:1) is continuously\n2, (ii) \u22022(cid:104)\u2212 log p(y|f)\n(cid:105)\nwf (w, x)(cid:1) \u2264 cf\n(cid:0)\u22072\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 c1, (iv)(cid:13)(cid:13)\u2207wf (w, x)(cid:13)(cid:13)2\n\n2 (\u03c3max is the max singular value), then BH = c2cf\n\n2 \u2264 cf\n1 + c1cf\n2 .\n\n\u2264 c2, (iii)\n\n1 , and (v)\n\n\u03c3max\n\n\u2202f 2\n\n\u2202f\n\n(cid:17)\n(cid:0)\u03b4 (w \u2212 \u02c6w)(cid:1) + \u2206(BH ) + \u03bb(log \u03b3)2\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2202\n\n(cid:104)\u2212 log p(y|f)\n(cid:105)\n\n2n\n\nGLM: The bound of [11] for GLM was developed for exact Bayesian inference. The following\ncorollary extends this to approximate inference through RCLM. In GLM, f = wT x, (cid:107)\u2207w(cid:107)2 = (cid:107)x(cid:107)2,\nand \u22072\nw = 0 and a bound on BH is immediate from Lemma 10. In addition the smoothed loss is\nbounded 0 \u2264 \u02dc(cid:96) \u2264 \u2212 log \u03b1. This implies\n\nCorollary 11. For GLM, if (i) \u02dc(cid:96)(w, (x, y)) = \u2212 log((1 \u2212 \u03b1)p(cid:0)y|f (w, x)(cid:1) + \u03b1) is continuously\n\ndifferentiable in f up to order 2, and (ii) \u22022 \u02dc(cid:96)\nES\u223cDn[\u02dcr2Bay(\u02dcq(cid:63)\n\n(cid:0)\u03b4 (w \u2212 \u02c6w)(cid:1) + \u2206(BH ) + \u03bb(log \u03b1)2\n\n2Bi(w))] \u2264 \u02dcr2Bay\n\n\u2202f 2 \u2264 c,\n\nthen,\n\n2n\n\nfor all \u02c6w with (cid:107) \u02c6w(cid:107)2 \u2264 Bm,\nwith BH = c maxx\u2208X(cid:107)x(cid:107)2\n2.\n\nWe develop the bound c for the logistic and Normal likelihoods (see supplementary material). Let\n\u03b1(cid:48) = \u03b1\n\u03b1(cid:48) . For the Gaussian\nlikelihood\n\n1\u2212\u03b1. For the logistic likelihood \u03c3(yf ), we have c = 1\n\nexp(\u2212 1\n\n(y\u2212f )2\n\n1\u221a\n\n3\n18\n\n\u221a\n\n16\n\n), we have c = 1\n2\u03c0\u03c34\n\n(\u03b1(cid:48))2 +\n\u03b1(cid:48) .\n1\n\n2\u03c0\u03c33\nY\n\n1\n\n(\u03b1(cid:48))2 + 1\u221a\n\n2\u03c0\u03c3Y\n\nY e\n\n\u03c32\nY\n\n2\n\n1\n\n1\n\nThe work of [7] has claimed3 a bound on the Gibbs risk for linear regression which should be\ncompared to our result for the Gaussian likelihood. Their result is developed under the assumption that\nthe Bayesian model speci\ufb01cation is correct and in addition that x is generated from x \u223c N (0, \u03c32\nxI).\nIn contrast our result, using the smoothed loss, holds for arbitrary distributions D without the\n(cid:17)\nassumption of correct model speci\ufb01cation.\n\nproof of Corollary 5 in [7] erroneously replaces Ep(w)[(cid:81)\n\n3 Denoting \u2206ri(w) = rW (w) \u2212 \u02c6rW (w, (xi, yi)) and fi(w, n, \u03bb) = Ep(\u2206ri(w))[exp\n\n], the\ni Ep(w)[fi(w, n, \u03bb)]. We are not\naware of a correction of this proof which yields a correct bound for \u03a8 without using a smoothed loss. Any such\nbound would, of course, be applicable with our Corollary 8.\n\ni fi(w, n, \u03bb)] with(cid:81)\n\nn \u2206ri(w)\n\n(cid:16) \u03bb\n\n7\n\n\fU xK\u22121\n\nU U , b(x) = \u00b5x \u2212 K T\n\nSparse GP: In the sparse GP model, the conditional is p(cid:0)f|w, x(cid:1) = N (f|a(x)T w + b(x), \u03c32(x))\nand (U, U ) respectively. In the conjugate case, the likelihood is given by p(cid:0)y|f(cid:1) = N (y|f, \u03c32\n\nwhere a(x)T = K T\nU U KU x with \u00b5\ndenoting the mean function and KU x, KU U denoting the kernel matrix evaluated at inputs (U, x)\nY ) and\nintegrating f out yields N (y|a(x)T w + b(x), \u03c32(x) + \u03c32\nCorollary 12. For conjugate sparse GP, for all \u02c6w with(cid:107) \u02c6w(cid:107)2 \u2264 Bm,\nES\u223cDn[\u02dcr2Bay(\u02dcq(cid:63)\nwhere c = 1\n2\u03c0\u03c34\n\n(cid:0)\u03b4 (w \u2212 \u02c6w)(cid:1) + \u2206(BH ) + \u03bb(log \u03b1)2\n\nU U \u00b5U and \u03c32(x) = Kxx \u2212 K T\n\nY ). Using the smoothed loss, we obtain:\n\n2Bi(w))] \u2264 \u02dcr2Bay\n(\u03b1(cid:48))2 + 1\u221a\n\n(cid:13)(cid:13)a(x)(cid:13)(cid:13)2\n\nwith BH = c maxx\u2208X\n\nU xK\u22121\n\nU xK\u22121\n\n\u03b1(cid:48) .\n\n2n\n\n1\n\n1\n\n2,\n\nY e\n\n2\u03c0\u03c33\nY\n\n1\n\nwN where\n\u02dc(cid:96)(w, (x, y)) =\nY ), with f (w) = a(x)T w + b(x). The gradient \u2207wN equals\n\u02dc(cid:96) =\n\n(N +\u03b1(cid:48))2\u2207wN (\u2207wN )T \u2212 1N +\u03b1(cid:48)\u22072\n(cid:16) \u22022N\n\n(cid:17)\na(x)a(x)T . Therefore, \u22072\n\u22022(cid:104)\u2212 log((1\u2212\u03b1)N +\u03b1)\n\n(cid:19)\nwN equals\n\n\u2202(f (w))2\n\n(cid:105)\n\nw\n\nw\n\n\u2202(f (w))\n\n(cid:17)\n\n(cid:16) \u2202N\n\nProof. The Hessian is given by \u22072\nN denotes N (y|f (w), \u03c32(x) + \u03c32\n\n(cid:16) \u2202N\n(cid:18)\n(cid:17)2 \u2212 1N +\u03b1(cid:48)\n(cid:105)\nsmoothed loss: \u22022(cid:104)\u2212 log((1\u2212\u03b1)N +\u03b1)\n\na(x) and the Hessian \u22072\n\u22022N\n\n(N +\u03b1(cid:48))2\n\n\u2264\n\n1\u221a\n\n2\u03c0\u03c33\nY\n\n\u2202(f (w))2\n\u03b1(cid:48) = c . Finally,\n\n1\n\n(cid:13)(cid:13)a(x)(cid:13)(cid:13)2\n\n2.\n\n1\n\na(x)a(x)T . The\nresult of Corollary 11 for Gaussian likelihood can be used to bound the 2nd derivative of the\n\na(x)a(x)T =\n\n\u2202(f (w))2\n\n\u2202(f (w))2\n\n\u2202(f (w))\n\n1\n\n1\n\n(\u03b1(cid:48))2 +\nthe eigenvalue of the rank-1 matrix ca(x)a(x)T is bounded by\n\n(\u03b1(cid:48))2 +\n\n2\u03c0(\u03c32(x)+\u03c32\n\n2\u03c0(\u03c32(x)+\u03c32\n\nY )2e\n\n2\u03c0\u03c34\n\nY e\n\nY )\n\n3\n2\n\n1\n\n\u221a\n\n1\n\n1\n\n\u03b1(cid:48) \u2264 1\n\n2A and the collapsed bound uses q(cid:63)\n\nc maxx\u2208X\nRemark 1. We noted above that, for sGP, q(cid:63)\n2Bi does not correspond to a variational algorithm. The\n2Bj (but requires cubic time).\nstandard variational approach uses q(cid:63)\nIt can be shown that q(cid:63)\n2Bi corresponds exactly to the fully independent training conditional (FITC)\napproximation for sGP [24, 16] in that their optimal solutions are identical. Our result can be seen to\njustify the use of this algorithm which is known to perform well empirically.\nFinally, we consider binary classi\ufb01cation in GLM with the convex loss function (cid:96)(cid:48)(w, (x, y)) =\n8 (y \u2212 (2p(y|w, x) \u2212 1))2. The proof of the following corollary is in the supplementary material:\nCorollary 13. For GLM with p(y|w, x) = \u03c3(ywT x),\nES\u223cDn[r(cid:48)\n\n(cid:0)\u03b4 (w \u2212 \u02c6w)(cid:1) + \u2206(BH ) + \u03bb\n\nfor all \u02c6w with (cid:107) \u02c6w(cid:107)2 \u2264 Bm,\n\n16 maxx\u2208X(cid:107)x(cid:107)2\n2.\n\n2Bi(w))] \u2264 r(cid:48)\n\n8n with BH = 5\n\n2Bay(q(cid:48)(cid:63)\n\n2Bay\n\n1\n\n4.3 Direct Application of RCLM to Conjugate Linear LGM\n\nIn this section we derive a bound for an algorithm that optimizes a surrogate of the loss directly. In\nparticular, we consider the Bayes loss for linear LGM with conjugate likelihood p(y|f ) = N (y|f, \u03c32\nY )\nwhere \u2212 log Eq(w)[Ep(f|w)[p(y|f )]] = \u2212 log N (y|aT m + b, \u03c32 + \u03c32\nY + aT V a) and where a, b, and\n\u03c32 are functions of x. This includes, for example, linear regression and conjugate sGP.\n2Ds performs RCLM with competitor set \u0398 = {(m, V ) :(cid:107)m(cid:107)2 \u2264 Bm, V \u2208\nThe proposed algorithm q(cid:63)\nS++,(cid:107)V (cid:107)F \u2264 BV }, regularizer R(m, V ) = 1\nn and the surrogate loss\nY +aT V a .With these de\ufb01nitions we can\n(cid:96)surr(m, V ) = 1\napply Theorem 1 to get (proof in supplementary material):\nTheorem 14. With probability at\n\u221a\n1\nn\n\nY + aT V a(cid:1) + 1\n(cid:17)\n2Bay(q(w)) +\nmaxx\u2208X(cid:107)a(cid:107)2 maxx\u2208X,y\u2208Y,m |y \u2212 aT m\u2212 b| and\n.\n\n(cid:0)\u03c32 + \u03c32\nV )(cid:1) where \u03c1m = 1\n(cid:16)\n\nm + \u03c12\nmaxx\u2208X,y\u2208Y,m(cid:107)a(cid:107)2\n\n2Ds) \u2264 minq\u2208Q rsurr\n\n2(cid:107)V (cid:107)2\n(y\u2212aT m\u2212b)2\n\u03c32+\u03c32\n\nleast 1 \u2212 \u03b4, r2Bay(q(cid:63)\n\n1 + (y\u2212aT m\u2212b)2\n\n2 log (2\u03c0) + 1\n\nF , \u03b7 = 1\u221a\n\n(cid:0)B2\n\n2(cid:107)m(cid:107)2\n\nV + 8(\u03c12\n\nm + B2\n\n2 + 1\n\n\u03c32\nY\n\n2\n\n2\n\n\u03b4\n\u03c1V = 1\n2\u03c32\nY\n\n2\n\n\u03c32\nY\n\n5 Direct Loss Minimization\n\nThe results in this paper expose the fact that different algorithms are apparently implicitly optimizing\ncriteria for different loss functions. In particular, q(cid:63)\n2Bi optimizes for r2Gib\n\n2A optimizes for r2A, q(cid:63)\n\n8\n\n\f(cid:80)\n\nyi\u2208test (cid:96)2A(yi)\n\n(cid:80)\n\nyi\u2208test (cid:96)2Gib(yi)\n\n(cid:80)\n\nyi\u2208test (cid:96)2Bay(yi)\n\nFigure 1: Arti\ufb01cial data. Cumulative test set losses of different variational algorithms. x-axis is\niteration. Mean \u00b1 1\u03c3 of 30 trials are shown per objective. q(cid:63)\n\n2Bi in green. q(cid:63)\n\n2D in red.\n\n2A in blue. q(cid:63)\n\n2Bi algorithm, it is\n\n2D optimizes for r2Bay. Even though we were able to bound r2Bay of the q(cid:63)\n\nand q(cid:63)\ninteresting to check the performance of these algorithms in practice.\nWe present an experimental study comparing these algorithms on the correlated topic model (CTM)\nthat was described in the previous section. To explore the relation between the algorithms and their\nperformance we run the three algorithms and report their empirical risk on a test set, where the risk\nis also measured in three different ways. Figure 1 shows the corresponding learning curves on an\narti\ufb01cial document generated from the model. Full experimental details and additional results on a\nreal dataset are given in the supplementary material.\nWe observe that at convergence each algorithm is best at optimizing its own implicit criterion.\nHowever, considering r2Bay, the differences between the outputs of the variational algorithm q(cid:63)\n2Bi and\n2Bi takes longer\ndirect loss minimization q(cid:63)\n2D are relatively small. We also see that at least in this case q(cid:63)\nto reach the optimal point for r2Bay. Clearly, except for its own implicit criterion, q(cid:63)\n2A should not be\n2Bi [22]. The current experiment shows the\nused. This agrees with prior empirical work on q(cid:63)\npotential of direct loss optimization for improved performance but justi\ufb01es the use of q(cid:63)\n2Bi both under\ncorrect model speci\ufb01cation (arti\ufb01cial data) and when the model is incorrect (real data in supplement).\nPreliminary experiments in sparse GP show similar trends. The comparison in that case is more\ncomplex because q(cid:63)\n2Bi is not the same as the collapsed variational approximation, which in turn\nrequires cubic time to compute, and we additionally have the surrogate optimizer q(cid:63)\n2Ds. We defer a\nfull empirical exploration in sparse GP to future work.\n\n2A and q(cid:63)\n\n6 Discussion\n\nThe paper provides agnostic learning bounds for the risk of the Bayesian predictor, which uses the\nposterior calculated by RCLM, against the best single predictor. The bounds apply for a wide class\nof Bayesian models, including GLM, sGP and CTM. For CTM our bound applies precisely to the\nvariational algorithm with the collapsed variational bound. For sGP and GLM the bounds apply\nto bounded variants of the log loss. The results add theoretical understanding of why approximate\ninference algorithms are successful, even though they optimize the wrong objective, and therefore\njustify the use of such algorithms. In addition, we expose a discrepancy between the loss used\nin optimization and the loss typically used in evaluation and propose alternative algorithms using\nregularized loss minimization. A preliminary empirical evaluation in CTM shows the potential of\ndirect loss minimization but that the collapsed variational approximation q(cid:63)\n2Bi has the advantage of\nstrong theoretical guarantees and excellent empirical performance, both when the Bayesian model is\ncorrect and under model misspeci\ufb01cation.\nOur results can be seen as a \ufb01rst step toward full analysis of approximate Bayesian inference methods.\nOne limitation is that the competitor class in our results is restricted to point estimates. While point\nestimate predictors are optimal for the Gibbs risk, they are not optimal for Bayes predictors. In\naddition, the bounds show that the Bayesian procedures will do almost as well as the best point\nestimator. However, they do not show an advantage over such estimators, whereas one would expect\nsuch an advantage. It would also be interesting to incorporate direct loss minimization within the\nBayesian framework. These issues remain an important challenge for future work.\n\n9\n\n0200400600800100012001400160018002000290029503000305031003150320032503300IterationCumulative loss value0200400600800100012001400160018002000290029503000305031003150320032503300IterationCumulative loss value0200400600800100012001400160018002000290029503000305031003150320032503300IterationCumulative loss value\fAcknowledgments\n\nThis work was partly supported by NSF under grant IIS-1714440.\n\nReferences\n[1] Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of variational\n\napproximations of Gibbs posteriors. JMLR, 17:1\u201341, 2016.\n\n[2] Arindam Banerjee. On Bayesian bounds. In ICML, pages 81\u201388, 2006.\n\n[3] David M. Blei and John D. Lafferty. Correlated topic models. In NIPS, pages 147\u2013154. 2006.\n\n[4] St\u00e9phane Boucheron, G\u00e1bor Lugosi, and Pascal Massart. Concentration Inequalities: A\n\nNonasymptotic Theory of Independence. Oxford University Press, 2013.\n\n[5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press,\n\nMarch 2004.\n\n[6] Arnak S. Dalalyan and Alexandre B. Tsybakov. Aggregation by exponential weighting, sharp\n\nPAC-Bayesian bounds and sparsity. Machine Learning, 72:39\u201361, 2008.\n\n[7] Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. PAC-Bayesian\n\ntheory meets Bayesian inference. In NIPS, pages 1876\u20131884, 2016.\n\n[8] James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational Gaussian\n\nprocess classi\ufb01cation. In AISTATS, pages 351\u2013360, 2015.\n\n[9] Matthew D. Hoffman and David M. Blei. Structured stochastic variational inference.\n\nAISTATS, pages 361\u2013369, 2015.\n\nIn\n\n[10] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An\nintroduction to variational methods for graphical models. Machine Learning, 37:183\u2013233, 1999.\n\n[11] Sham M. Kakade and Andrew Y. Ng. Online bounds for Bayesian algorithms. In NIPS, pages\n\n641\u2013648, 2004.\n\n[12] Alexandre Lacasse, Fran\u00e7ois Laviolette, Mario Marchand, Pascal Germain, and Nicolas Usunier.\nPAC-Bayes bounds for the risk of the majority vote and the variance of the Gibbs classi\ufb01er. In\nNIPS, pages 769\u2013776, 2006.\n\n[13] Moshe Lichman. UCI machine learning repository, 2013. http://archive.ics.uci.edu/\n\nml.\n\n[14] David A. McAllester. Some PAC-Bayesian theorems. In COLT, pages 230\u2013234, 1998.\n\n[15] Ron Meir and Tong Zhang. Generalization error bounds for Bayesian mixture algorithms.\n\nJMLR, 4:839\u2013860, 2003.\n\n[16] Joaquin Qui\u00f1onero-Candela, Carl E. Rasmussen, and Ralf Herbrich. A unifying view of sparse\n\napproximate Gaussian process regression. JMLR, 6:1939\u20131959, 2005.\n\n[17] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning.\n\nMIT Press, 2006.\n\n[18] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\n\nand approximate inference in deep generative models. In ICML, pages 1278\u20131286, 2014.\n\n[19] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and\n\nTrends R(cid:13) in Machine Learning, 4:107\u2013194, 2012.\n\n[20] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge University Press, 2014.\n\n10\n\n\f[21] Rishit Sheth and Roni Khardon. A \ufb01xed-point operator for inference in variational Bayesian\n\nlatent Gaussian models. In AISTATS, pages 761\u2013769, 2016.\n\n[22] Rishit Sheth and Roni Khardon. Monte Carlo structured SVI for non-conjugate models.\n\narXiv:1309.6835, 2016.\n\n[23] Rishit Sheth, Yuyang Wang, and Roni Khardon. Sparse variational inference for generalized\n\nGaussian process models. In ICML, pages 1302\u20131311, 2015.\n\n[24] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In\n\nNIPS, pages 1257\u20131264, 2006.\n\n[25] Yee Whye Teh, David Newman, and Max Welling. A collapsed variational Bayesian inference\n\nalgorithm for latent Dirichlet allocation. In NIPS, pages 1353\u20131360, 2006.\n\n[26] Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In\n\nAISTATS, pages 567\u2013574, 2009.\n\n[27] Sheng-De Wang, Te-Son Kuo, and Chen-Fa Hsu. Trace bounds on the solution of the algebraic\nmatrix Riccati and Lyapunov equation. IEEE Transactions on Automatic Control, 31:654\u2013656,\n1986.\n\n11\n\n\f", "award": [], "sourceid": 2667, "authors": [{"given_name": "Rishit", "family_name": "Sheth", "institution": "Tufts University"}, {"given_name": "Roni", "family_name": "Khardon", "institution": "Tufts University"}]}