{"title": "Reducing Reparameterization Gradient Variance", "book": "Advances in Neural Information Processing Systems", "page_first": 3708, "page_last": 3718, "abstract": "Optimization with noisy gradients has become ubiquitous in statistics and machine learning. Reparameterization gradients, or gradient estimates computed via the ``reparameterization trick,'' represent a class of noisy gradients often used in Monte Carlo variational inference (MCVI). However, when these gradient estimators are too noisy, the optimization procedure can be slow or fail to converge. One way to reduce noise is to generate more samples for the gradient estimate, but this can be computationally expensive. Instead, we view the noisy gradient as a random variable, and form an inexpensive approximation of the generating procedure for the gradient sample. This approximation has high correlation with the noisy gradient by construction, making it a useful control variate for variance reduction. We demonstrate our approach on a non-conjugate hierarchical model and a Bayesian neural net where our method attained orders of magnitude (20-2{,}000$\\times$) reduction in gradient variance resulting in faster and more stable optimization.", "full_text": "Reducing Reparameterization Gradient Variance\n\nAndrew C. Miller\u2217\nHarvard University\n\nacm@seas.harvard.edu\n\nNicholas J. Foti\n\nUniversity of Washington\n\nnfoti@uw.edu\n\nAlexander D\u2019Amour\n\nUC Berkeley\n\nalexdamour@berkeley.edu\n\nRyan P. Adams\n\nGoogle Brain and Princeton University\n\nrpa@princeton.edu\n\nAbstract\n\nOptimization with noisy gradients has become ubiquitous in statistics and ma-\nchine learning. Reparameterization gradients, or gradient estimates computed via\nthe \u201creparameterization trick,\u201d represent a class of noisy gradients often used in\nMonte Carlo variational inference (MCVI). However, when these gradient estima-\ntors are too noisy, the optimization procedure can be slow or fail to converge. One\nway to reduce noise is to generate more samples for the gradient estimate, but this\ncan be computationally expensive. Instead, we view the noisy gradient as a ran-\ndom variable, and form an inexpensive approximation of the generating procedure\nfor the gradient sample. This approximation has high correlation with the noisy\ngradient by construction, making it a useful control variate for variance reduc-\ntion. We demonstrate our approach on a non-conjugate hierarchical model and a\nBayesian neural net where our method attained orders of magnitude (20-2,000\u00d7)\nreduction in gradient variance resulting in faster and more stable optimization.\n\n1\n\nIntroduction\n\nRepresenting massive datasets with \ufb02exible probabilistic models has been central to the success of\nmany statistics and machine learning applications, but the computational burden of \ufb01tting these mod-\nels is a major hurdle. For optimization-based \ufb01tting methods, a central approach to this problem has\nbeen replacing expensive evaluations of the gradient of the objective function with cheap, unbiased,\nstochastic estimates of the gradient. For example, stochastic gradient descent using small mini-\nbatches of (conditionally) i.i.d. data to estimate the gradient at each iteration is a popular approach\nwith massive data sets. Alternatively, some learning methods sample directly from a generative\nmodel or approximating distribution to estimate the gradients of interest, for example, in learning\nalgorithms for implicit models [18, 30] and generative adversarial networks [2, 9].\nApproximate Bayesian inference using variational techniques (variational inference, or VI) has also\nmotivated the development of new stochastic gradient estimators, as the variational approach re-\nframes the integration problem of inference as an optimization problem [4]. VI approaches seek out\nthe distribution from a well-understood variational family of distributions that best approximates\nan intractable posterior distribution. The VI objective function itself is often intractable, but recent\nwork has shown that it can be optimized with stochastic gradient methods that use Monte Carlo\nestimates of the gradient [19, 14, 22, 25], which we call Monte Carlo variational inference (MCVI).\nIn MCVI, generating samples from an approximate posterior distribution is the source of gradient\nstochasticity. Alternatively, stochastic variational inference (SVI) [11] and other stochastic opti-\n\n\u2217http://andymiller.github.io/\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fmization procedures induce stochasticity through data subsampling; MCVI can also be augmented\nwith data subsampling to accelerate computation for large data sets.\nThe two commonly used MCVI gradient estimators are the score function gradient [19, 22] and the\nreparameterization gradient [14, 25, 29, 8]. Broadly speaking, score function estimates can be ap-\nplied to both discrete and continuous variables, but often have high variance and thus are frequently\nused in conjunction with variance reduction techniques. On the other hand, the reparameterization\ngradient often has lower variance, but is restricted to continuous random variables. See Ruiz et al.\n[28] for a unifying perspective on these two estimators. Like other stochastic gradient methods, the\nsuccess of MCVI depends on controlling the variance of the stochastic gradient estimator.\nIn this work, we present a novel approach to controlling the variance of the reparameterization gra-\ndient estimator in MCVI. Existing MCVI methods control this variance na\u00efvely by averaging several\n\u221a\ngradient estimates, which becomes expensive for large data sets and complex models, with error that\nonly diminishes as O(1/\nN ). Our approach exploits the fact that, in MCVI, the randomness in the\ngradient estimator is completely determined by a known Monte Carlo generating process; this al-\nlows us to leverage knowledge about this generative procedure to de-noise the gradient estimator.\nIn particular, we construct a computationally cheap control variate based on an analytical linear\napproximation to the gradient estimator. Taking a linear combination of a na\u00efve gradient estimate\nwith this control variate yields a new estimator for the gradient that remains unbiased but has lower\nvariance. Applying the idea to Gaussian approximating families, we observe a 20-2,000\u00d7 reduction\nin variance of the gradient norm under various conditions, and faster convergence and more stable\nbehavior of optimization traces.\n\n2 Background\nVariational Inference Given a model, p(z,D) = p(D|z)p(z), of data D and parameters/latent\nvariables z, the goal of VI is to approximate the posterior distribution p(z|D). VI approximates this\nintractable posterior distribution with one from a simpler family, Q = {q(z; \u03bb), \u03bb \u2208 \u039b}, parameter-\nized by variational parameters \u03bb. VI procedures seek out the member of that family, q(\u00b7; \u03bb) \u2208 Q,\nthat minimizes some divergence between the approximation q and the true posterior p(z|D).\nVariational inference can be framed as an optimization problem, usually in terms of Kullback-\nLeibler (KL) divergence, of the following form\n\n\u03bb\u2217 = arg min\n\u03bb\u2208\u039b\n\nKL(q(z; \u03bb) || p(z|D)) = arg min\n\u03bb\u2208\u039b\n\nEz\u223cq\u03bb [ln q(z; \u03bb) \u2212 ln p(z|D)] .\n\nThe task is to \ufb01nd a setting of \u03bb that makes q(z; \u03bb) close to the posterior p(z|D) in KL diver-\ngence.2 Directly computing the KL divergence requires evaluating the posterior itself; therefore, VI\nprocedures use the evidence lower bound (ELBO) as the optimization objective\n\nL(\u03bb) = Ez\u223cq\u03bb [ln p(z,D) \u2212 ln q(z; \u03bb)] ,\n\n(1)\nwhich, when maximized, minimizes the KL divergence between q(z; \u03bb) and p(z|D). In special\ncases, parts of the ELBO can be expressed analytically (e.g. the entropy form or KL-to-prior form\n[10]) \u2014 we focus on the general form in Equation 1.\nTo maximize the ELBO with gradient methods, we need to compute the gradient of Eq. (1),\n\u2202L/\u2202\u03bb (cid:44) g\u03bb. The gradient inherits the ELBO\u2019s form as an expectation, which is in general an\nIn this work, we focus on reparameterization gradient estima-\nintractable quantity to compute.\ntors (RGEs) computed using the reparameterization trick. The reparameterization trick exploits\nthe structure of the variational data generating procedure \u2014 the mechanism by which z is simu-\nlated from q\u03bb(z). To compute the RGE, we \ufb01rst express the sampling procedure from q\u03bb(z) as a\ndifferentiable map applied to exogenous randomness\n\n\u0001 \u223c q0(\u0001)\nz = T (\u0001; \u03bb)\n\n(2)\n(3)\nwhere the initial distribution q0 and T are jointly de\ufb01ned such that z \u223c q(z; \u03bb) has the de-\nsired distribution. As a simple concrete example, if we set q(z; \u03bb) to be a diagonal Gaussian,\n\nindependent of \u03bb\ndifferentiable map,\n\n2We use q(z; \u03bb) and q\u03bb(z) interchangeably.\n\n2\n\n\f(a) step size = .01\n\n(b) step size = .1\n\nFigure 1: Optimization traces for MCVI applied to a Bayesian neural network with various hyper-\nparameter settings. Each trace is running adam [13]. The three lines in each plot correspond to\nthree different numbers of samples, L, used to estimate the gradient at each step. (Left) small step-\nsize; (Right) stepsize 10 times larger. Large step sizes allow for quicker progress, however noisier\n(i.e., small L) gradients combined with large step sizes result in chaotic optimization dynamics. The\nconverging traces reach different ELBOs due to the illustrative constant learning rates; in practice,\none decreases the step size over time to satisfy the convergence criteria in Robbins and Monro [26].\n\nN (m\u03bb, diag(s2\nsampling procedure could then be de\ufb01ned as\n\n\u03bb)), with \u03bb = [m\u03bb, s\u03bb], m\u03bb \u2208 RD, and s\u03bb \u2208 RD\n\n+ the mean and variance. The\n\n\u0001 \u223c N (0, ID) ,\n\n(4)\nwhere s (cid:12) \u0001 denotes an element-wise product.3 Given this map, the reparameterization gradient\nestimator is simply the gradient of a Monte Carlo ELBO estimate with respect to \u03bb. For a single\nsample, this is\n\nz = T (\u0001; \u03bb) = m\u03bb + s\u03bb (cid:12) \u0001,\n\n\u0001 \u223c q0(\u0001) ,\n\n\u02c6g\u03bb (cid:44) \u2207\u03bb [ln p(T (\u0001; \u03bb),D) \u2212 ln q(T (\u0001; \u03bb); \u03bb)]\n\nand similarly the L-sample approximation can be computed by averaging the single-sample estima-\ntor over the individual samples\n\nL(cid:88)\n\n(cid:96)=1\n\n\u02c6g(L)\n\u03bb =\n\n1\nL\n\n\u02c6g\u03bb(\u0001(cid:96)).\n\n(5)\n\nCrucially, the reparameterization gradient is unbiased, E[\u02c6g\u03bb] = \u2207\u03bbL(\u03bb), guaranteeing the conver-\ngence of stochastic gradient optimization procedures that use it [26].\nGradient Variance and Convergence The ef\ufb01ciency of Monte Carlo variational inference hinges\non the magnitude of gradient noise and the step size chosen for the optimization procedure. When\nthe gradient noise is large, smaller gradient steps must be taken to avoid unstable dynamics of the\niterates. However, a smaller step size increases the number of iterations that must be performed to\nreach convergence.\nWe illustrate this trade-off in Figure 1, which shows realizations of an optimization proce-\ndure applied to a Bayesian neural network using reparameterization gradients. The posterior is\nover D = 653 parameters that we approximate with a diagonal Gaussian (see Appendix C.2). We\ncompare the progress of the adam algorithm using various numbers of samples [13], \ufb01xing the\nlearning rate. The noise present in the single-sample estimator causes extremely slow convergence,\nwhereas the lower noise 50-sample estimator quickly converges, albeit at 50 times the cost.\nThe upshot is that with low noise gradients we are able to safely take larger steps, enabling faster\nconvergence to a local optimum. A natural question is, how can we reduce the variance of gradient\nestimates without introducing too much extra computation? Our approach is to use information\nabout the variational model, q(\u00b7; \u03bb), and carefully construct a control variate to the gradient.\nControl Variates Control variates are random quantities that are used to reduce the variance of\na statistical estimator without introducing any bias by incorporating additional information into the\nestimator, [7]. Given an unbiased estimator \u02c6g such that E[\u02c6g] = g (the quantity of interest), our goal\n\n3We will also use x/y and x2 to denote pointwise division and squaring, respectively.\n\n3\n\n020406080100wallclock(seconds)260240220200180160ELBO2sampleMC10sampleMC50sampleMC020406080100wallclock(seconds)300280260240220200180160ELBO2sampleMC10sampleMC50sampleMC\fis to construct another unbiased estimator with lower variance. We can do this by de\ufb01ning a control\nvariate \u02dcg with known expectation \u02dcm and can write the new estimator as\n\ng(cv) = \u02c6g \u2212 C(\u02dcg \u2212 \u02dcm) .\n\n(6)\nwhere C \u2208 RD\u00d7D for D-dimensional \u02c6g. Clearly the new estimator has the same expectation as\nthe original estimator, but has a different variance. We can attain optimal variance reduction by\nappropriately setting C. Intuitively, the optimal C is very similar to a regression coef\ufb01cient \u2014 it is\nrelated to the covariance between the control variate and the original estimator. See Appendix A for\nfurther details on optimally setting C.\n\n3 Method: Modeling Reparameterization Gradients\n\nIn this section we develop our main contribution, a new gradient estimator that can dramatically\nreduce reparameterization gradient variance. In MCVI, the reparameterization gradient estimator\n(RGE) is a Monte Carlo estimator of the true gradient \u2014 the estimator itself is a random variable.\nThis random variable is generated using the \u201creparameterization trick\u201d \u2014 we \ufb01rst generate some\nrandomness \u0001 and then compute the gradient of the ELBO with respect to \u03bb holding \u0001 \ufb01xed. This\nresults in a complex distribution from which we can generate samples, but in general cannot charac-\nterize due to the complexity of the term arising from the gradient of the model term.\nHowever, we do have a lot of information about the sampling procedure \u2014 we know the variational\ndistribution ln q(z; \u03bb), the transformation T , and we can evaluate the model joint density ln p(z,D)\npointwise. Furthermore, with automatic differentiation, it is often straightforward to obtain gradients\nand Hessian-vector products of our model ln p(z,D). We propose a scheme that uses the structure\nof q\u03bb and curvature of ln p(z,D) to construct a tractable approximation of the distribution of the\nRGE.4 This approximation has a known mean and is correlated with the RGE distribution, allowing\nus to use it as a control variate to reduce the RGE variance.\nGiven a variational family parameterized by \u03bb, we can decompose the ELBO gradient into a few\nterms that reveal its \u201cdata generating procedure\u201d\n\nz = T (\u0001; \u03bb)\n\n\u0001 \u223c q0 ,\n\u02c6g\u03bb (cid:44) \u02c6g(z; \u03bb) =\n\n\u2202 ln p(z,D)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u2202z\n\n\u2202z\n\u2202\u03bb\n\n(cid:125)\n\n\u2212 \u2202 ln q\u03bb(z)\n\n\u2212 \u2202 ln q\u03bb(z)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u2202z\n\n\u2202z\n\u2202\u03bb\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u2202\u03bb\n\n(cid:125)\n\ndata term\n\npathwise score\n\nparameter score\n\n(7)\n\n(8)\n\n.\n\nCertain terms in Eq. (8) have tractable distributions. The Jacobian of T (\u00b7; \u03bb), given by \u2202z/\u2202\u03bb, is de-\n\ufb01ned by our choice of q(z; \u03bb). For some transformations T we can exactly compute the distribution\nof the Jacobian given the distribution of \u0001. The pathwise and parameter score terms are gradients of\nour approximate distribution with respect to \u03bb (via z or directly). If our approximation is tractable\n(e.g., a multivariate Gaussian), we can exactly characterize the distribution for these components.5\nHowever, the data term in Eq. (8) involves a potentially complicated function of the latent vari-\nable z (and therefore a complicated function of \u0001), resulting in a dif\ufb01cult-to-characterize distribution.\nOur goal is to construct an approximation to the distribution of \u2202 ln p(z,D)/\u2202z and its interaction\nwith \u2202z/\u2202\u03bb given a \ufb01xed distribution over \u0001. If the approximation yields random variables that are\nhighly correlated with \u02c6g\u03bb, then we can use it to reduce the variance of that RGE sample.\nLinearizing the data term To simplify notation, we write the data term of the gradient as\n\n(cid:12)(cid:12)(cid:12)z=z(cid:48)\n\n,\n\nf (z(cid:48)) (cid:44) \u2202 ln p(z,D)\n(cid:20) \u2202f\n\n(cid:21)\n\n\u2202z\n\n(z0)\n\n\u2202z\n\n(9)\n\n(10)\n\nwhere f : RD (cid:55)\u2192 RD since z \u2208 RD. We then linearize f about some value z0\n\n\u02dcf (z) = f (z0) +\n\n(z \u2212 z0) = f (z0) + H(z0)(z \u2212 z0),\n\n4We require the model ln p(z,D) to be twice differentiable.\n5In fact, we know that the expectation of the parameter score term is zero, and removing that term altogether\n\ncan sometimes be a source of variance reduction that we do not explore here [27].\n\n4\n\n\fwhere H(z0) is the Hessian of the model, ln p(z,D), with respect to z evaluated at z0,\n\nH(z0) =\n\n\u2202f\n\u2202z\n\n(z0) =\n\n\u22022 ln p(z,D)\n\n\u2202z2\n\n(z0)\n\n(11)\n\nNote that even though this uses second-order information about the model, it is a \ufb01rst-order approx-\nimation of the gradient. We also view this as a transformation of the random \u0001 for a \ufb01xed \u03bb\n\n(12)\nwhich is linear in z = T (\u0001, \u03bb). For some forms of T we can analytically derive the distribution of\nthe random variable \u02dcf\u03bb(\u0001). In Eq. (8), the data term interacts with the Jacobian of T , given by\n\n\u02dcf\u03bb(\u0001) = f (z0) + H(z0)(T (\u0001, \u03bb) \u2212 z0) ,\n\nJ\u03bb(cid:48)(\u0001) (cid:44) \u2202z\n\u2202\u03bb\n\n=\n\n\u2202T (\u0001, \u03bb)\n\n\u2202\u03bb\n\n(cid:12)(cid:12)(cid:12)\u03bb=\u03bb(cid:48) ,\n\n(13)\n\nwhich importantly is a function of the same \u0001 as in Eq. (12). We form our approximation of the \ufb01rst\nterm in Eq. (8) by multiplying Eqs. (12) and (13) yielding\n\n(14)\nThe tractability of this approximation hinges on how Eq. (14) depends on \u0001. When q(z; \u03bb) is multi-\nvariate normal, we show that this approximation has a computable mean and can be used to reduce\nvariance in MCVI settings. In the following sections we describe and empirically test this variance\nreduction technique applied to diagonal Gaussian posterior approximations.\n\n\u02dcg(data)\n\u03bb\n\n(\u0001) (cid:44) \u02dcf\u03bb(\u0001)J\u03bb(\u0001) .\n\n3.1 Gaussian Variational Families\n\nPerhaps the most common choice of approximating distribution for MCVI is a diagonal Gaussian,\nparameterized by a mean m\u03bb \u2208 RD and scales s\u03bb \u2208 RD\n+. 6 The log probability density function is\n\nln q(z; m\u03bb, s2\n\n\u03bb) = \u2212 1\n2\n\n(z \u2212 m\u03bb)\n\n(z \u2212 m\u03bb) \u2212 1\n2\n\nln s2\n\n\u03bb,d \u2212 D\n2\n\nln(2\u03c0) .\n\n(15)\n\n\u03bb)(cid:3)\u22121\n(cid:124)(cid:2)diag(s2\n\n(cid:88)\n\nd\n\nTo generate a random variate z from this distribution, we use the sampling procedure in Eq. (4). We\ndenote the Monte Carlo RGE as \u02c6g\u03bb (cid:44) [\u02c6gm\u03bb , \u02c6gs\u03bb ]. From Eq. (15), it is straightforward to derive the\ndistributions of the pathwise score, parameter score, and Jacobian terms in Eq. (8).\nThe Jacobian term of the sampling procedure has two straightforward components\n\n\u2202z\n\u2202m\u03bb\n\n= ID ,\n\n\u2202z\n\u2202s\u03bb\n\n= diag(\u0001) .\n\n(16)\n\nThe pathwise score term is the partial derivative of Eq. (15) with respect to z, ignoring variation due\nto the variational distribution parameters and noting that z = m\u03bb + s\u03bb (cid:12) \u0001:\n\n\u2202 ln q\n\n\u2202z\n\n= \u2212diag(s2\n\n\u03bb)\u22121(z \u2212 m\u03bb) = \u2212\u0001/s\u03bb .\n\n(17)\n\nThe parameter score term is the partial derivative of Eq. (15) with respect to variational parame-\nters \u03bb, ignoring variation due to z. The m\u03bb and s\u03bb components are given by\n\n\u2202 ln q\n\u2202m\u03bb\n\u2202 ln q\n\u2202s\u03bb\n\n= (z \u2212 m\u03bb)/s2\n= \u22121/s\u03bb \u2212 (z \u2212 m\u03bb)2/s2\n\n\u03bb = \u0001/s\u03bb\n\n\u03bb =\n\n\u00012 \u2212 1\ns\u03bb\n\n.\n\n(18)\n\n(19)\n\nThe data term, f (z), multiplied by the Jacobian of T is all that remains to be approximated in\nEq. (8). We linearize f around z0 = m\u03bb where the approximation is expected to be accurate\n\n\u02dcf\u03bb(\u0001) = f (m\u03bb) + H(m\u03bb) ((m\u03bb + s\u03bb (cid:12) \u0001) \u2212 m\u03bb)\n\n\u223c N(cid:0)f (m\u03bb), H(m\u03bb)diag(s2\n\n(cid:124)(cid:1) .\n\n\u03bb)H(m\u03bb)\n\n(20)\n(21)\n\n6For diagonal Gaussian q, we de\ufb01ne \u03bb = [m\u03bb, s\u03bb].\n\n5\n\n\fAlgorithm 1 Gradient descent with RV-RGE with a diagonal Gaus-\nsian variational family\n1: procedure RV-RGE-OPTIMIZE(\u03bb1, ln p(z, D), L)\n2:\n\nf (z) \u2190 \u2207z ln p(z, D)\n\n3: H(za, zb) \u2190(cid:2)\u22072\n\nz ln p(za, D)(cid:3) zb\n\n(cid:46) De\ufb01ne Hessian-vector product function\n\n\u0001\n\n\u02c6g\u03bb\n\n\u02dcg\u03bb\n\nL\n\nFigure 2: Relationship be-\ntween the base random-\nness \u0001, RGE \u02c6g, and ap-\nproximation \u02dcg. Arrows in-\ndicate deterministic func-\ntions. Sharing \u0001 correlates\nthe random variables. We\nknow the distribution of \u02dcg,\nwhich allows us to use it as\na control variate for \u02c6g.\n\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n\nfor t = 1, . . . , T do\n\nm\u03bbt\n\n\u0001((cid:96)) \u223c N (0, ID) for (cid:96) = 1, . . . , L\n\u2190 \u2207\u03bb ln p(z(\u0001((cid:96)), \u03bbt), D)\n\u02c6g((cid:96))\n\u03bbt\n\u2190 f (m\u03bbt ) + H(m\u03bbt , s\u03bbt (cid:12) \u0001((cid:96)))\n\u02dcg((cid:96))\nf (m\u03bbt ) + H(m\u03bbt , s\u03bbt (cid:12) \u0001((cid:96)))\n] \u2190 f (m\u03bbt )\n(cid:80)\n] \u2190 diag(H(m\u03bbt )) (cid:12) s\u03bbt + 1/s\u03bbt\n= 1\nL\n\n\u2190(cid:16)\n\u02dcg((cid:96))\ns\u03bbt\nE[\u02dcgm\u03bbt\nE[\u02dcgs\u03bbt\n\u02c6g(RV )\n\u03bbt\n\u03bbt+1 \u2190 grad-update(\u03bbt, \u02c6g(RV )\n\n\u2212 E[\u02dcg\u03bbt ])\n\n\u2212 (\u02dcg(cid:96)\n\u03bbt\n\n(cid:96) \u02c6g(cid:96)\n\u03bbt\n\n)\n\n\u03bbt\n\nreturn \u03bbT\n\n(cid:46) Base randomness q0\n(cid:46) Reparameterization gradients\n(cid:46) Mean approx\n\n(cid:17) (cid:12) \u0001 + 1\n\ns\u03bbt\n\n(cid:46) Scale approx\n(cid:46) Mean approx expectation\n(cid:46) Scale approx expectation\n(cid:46) Subtract control variate\n(cid:46) Gradient step (sgd, adam, etc.)\n\nPutting It Together: Full RGE Approximation We write the complete approximation of the\nRGE in Eq. (8) by combining Eqs. (16), (17), (18), (19), and (21) which results in two components\nthat are concatenated, \u02dcg\u03bb = [\u02dcgm\u03bb, \u02dcgs\u03bb]. Each component is de\ufb01ned as\n\n\u02dcgm\u03bb = \u02dcf\u03bb(\u0001) + \u0001/s\u03bb \u2212 \u0001/s\u03bb\n\u02dcgs\u03bb = \u02dcf\u03bb(\u0001) (cid:12) \u0001 + (\u0001/s\u03bb) (cid:12) \u0001 \u2212 \u00012 \u2212 1\n\ns\u03bb\n\n= f (m\u03bb) + H(m\u03bb)(s\u03bb (cid:12) \u0001)\n= (f (m\u03bb) + H(m\u03bb)(s\u03bb (cid:12) \u0001)) (cid:12) \u0001 +\n\n(22)\n\n(23)\n\n1\ns\u03bb\n\n.\n\nTo summarize, we have constructed an approximation, \u02dcg\u03bb, of the reparameterization gradient, \u02c6g\u03bb,\nas a function of \u0001. Because both \u02dcg\u03bb and \u02c6g\u03bb are functions of the same random variable \u0001, and\nbecause we have mimicked the random process that generates true gradient samples, the two gradient\nestimators will be correlated. This approximation yields two tractable distributions \u2014 a Gaussian\nfor the mean parameter gradient, gm\u03bb, and a location shifted, scaled non-central \u03c72 for the scale\nparameter gradient gs\u03bb. Importantly, we can compute the mean of each component\nE[\u02dcgs\u03bb ] = diag(H(m\u03bb)) (cid:12) s\u03bb + 1/s\u03bb .\n\nE[\u02dcgm\u03bb] = f (m\u03bb) ,\n\n(24)\n\nWe use \u02dcg\u03bb (along with its expectation) as a control variate to reduce the variance of the RGE \u02c6g\u03bb.\n\n3.2 Reduced Variance Reparameterization Gradient Estimators\n\nNow that we have constructed a tractable gradient approximation, \u02dcg\u03bb, with high correlation to the\noriginal reparameterization gradient estimator, \u02c6g\u03bb, we can use it as a control variate as in Eq. (6)\n\n= \u02c6g\u03bb \u2212 C(\u02dcg\u03bb \u2212 E[\u02dcg\u03bb]).\n\n\u02c6g(RV )\n\u03bb\n\n(25)\nThe optimal value for C is related to the covariance between \u02dcg\u03bb and \u02c6g\u03bb (see Appendix A). We can\ntry to estimate the value of C (or a diagonal approximation to C) on the \ufb02y, or we can simply \ufb01x\nthis value. In our case, because we are using an accurate linear approximation to the transformation\nof a spherical Gaussian, the optimal value of C will be close to the identity (see Appendix A.1).\nHigh Dimensional Models For models with high dimensional posteriors, direct manipulation of\nthe Hessian is computationally intractable. However, our approximations in Eqs. (22) and (23) only\nrequire a Hessian-vector product, which can be computed nearly as ef\ufb01ciently as the gradient [21].\nModern automatic differentiation packages enable easy and ef\ufb01cient implementation of Hessian-\nvector products for nearly any differentiable model [1, 20, 15]. We note that the mean of the control\nvariate \u02dcgs\u03bb (Eq. (24)), depends on the diagonal of the Hessian matrix. While computing the Hessian\ndiagonal may be tractable in some cases, in general it may cost the time equivalent of D function\nevaluations to compute [16]. Given a high dimensional problem, we can avoid this bottleneck in\nmultiple ways. The \ufb01rst is simply to ignore the random variation in the Jacobian term due to \u0001 \u2014\nif we \ufb01x z to be m\u03bb (as we do with the data term), the portion of the Jacobian that corresponds to\n\n6\n\n\fs\u03bb will be zero (in Eq. (16)). This will result in the same Hessian-vector-product-based estimator\nfor \u02dcgm\u03bb but will set \u02dcgs\u03bb = 0, yielding variance reduction for the mean parameter but not the scale.\nAlternatively, we can estimate the Hessian diagonal on the \ufb02y. If we use L > 1 samples at each iter-\nation, we can create a per-sample estimate of the s\u03bb-scaled diagonal of the Hessian using the other\nsamples [3]. As the scaled diagonal estimator is unbiased, we can construct an unbiased estimate\nof the control variate mean to use in lieu of the actual mean. We will see that the resulting variance\nis not much higher than when using full Hessian information, and is computationally tractable to\ndeploy on high-dimensional models. A similar local baseline strategy is used for variance reduction\nin Mnih and Rezende [17].\nRV-RGE Estimators We introduce three different estimators based on variations of the gradient\napproximation de\ufb01ned in Eqs. (22), (23), and (24), each adressing the Hessian operations differently:\n\u2022 The Full Hessian estimator implements the three equations as written and can be used when\n\u2022 The Hessian Diagonal estimator replaces the Hessian in (22) with a diagonal approxima-\n\u2022 The Hessian-vector product + local approximation (HVP+Local) uses an ef\ufb01cient Hessian-\nvector product in Eqs. (22) and (23), while approximating the diagonal term in Eq. (24)\nusing a local baseline. The HVP+Local approximation is geared toward models where\nHessian-vector products can be computed, but the exact diagonal of the Hessian cannot.\n\ntion, useful for models with a cheap Hessian diagonal.\n\nit is computationally feasible to use the full Hessian.\n\nWe detail the RV-RGE procedure in Algorithm 1 and compare properties of these three estimators\nto the pure Monte Carlo estimator in the following section.\n\n3.3 Related Work\n\nRecently, Roeder et al. [27] introduced a variance reduction technique for reparameterization gra-\ndients that ignores the parameter score component of the gradient and can be viewed as a type of\ncontrol variate for the gradient throughout the optimization procedure. This approach is comple-\nmentary to our method \u2014 our approximation is typically more accurate near the beginning of the\noptimization procedure, whereas the estimator in Roeder et al. [27] is low-variance near conver-\ngence. We hope to incorporate information from both control variates in future work. Per-sample\nestimators in a multi-sample setting for variational inference were used in Mnih and Rezende [17].\nWe employ this technique in a different way; we use it to estimate computationally intractable quan-\ntities needed to keep the gradient estimator unbiased. Black box variational inference used control\nvariates and Rao-Blackwellization to reduce the variance of score-function estimators [22]. Our\ndevelopment of variance reduction for reparameterization gradients complements their work. Other\nvariance reduction techniques for stochastic gradient descent have focused on stochasticity due to\ndata subsampling [12, 31]. Johnson and Zhang [12] cache statistics about the entire dataset at each\nepoch to use as a control variate for noisy mini-batch gradients.\nThe variance reduction method described in Paisley et al. [19] is conceptually similar to ours. This\nmethod uses \ufb01rst or second order derivative information to reduce the variance of the score function\nestimator. The score function estimator (and their reduced variance version) often has much higher\nvariance than the reparameterization gradient estimator that we improve upon in this work. Our\nvariance measurement experiments in Table 1 includes a comparison to the estimator featured in\n[19], which we found to be much higher variance than the baseline RGE.\n\n4 Experiments and Analysis\n\nIn this section we empirically examine the variance properties of RV-RGEs and stochastic optimiza-\ntion for two real-data examples \u2014 a hierarchical Poisson GLM and a Bayesian neural network.7\n\u2022 Hierarchical Poisson GLM: The frisk model is a hierarchical Poisson GLM, described in\n\u2022 Bayesian Neural Network: The non-conjugate bnn model is a Bayesian neural network\napplied to the wine dataset, (see Appendix C.2) and has a D = 653 dimensional posterior.\n\nAppendix C.1. This non-conjugate model has a D = 37 dimensional posterior.\n\n7Code is available at https://github.com/andymiller/ReducedVarianceReparamGradients.\n\n7\n\n\fTable 1: Comparison of variances for RV-RGEs with L = 10-sample estimators. Variance mea-\nsurements were taken for \u03bb values at three points during the optimization algorithm (early, mid,\nlate). The parenthetical rows labeled \u201cMC abs\u201d denote the absolute value of the standard Monte\nCarlo reparameterization gradient estimator. The other rows compare estimators relative to the pure\nMC RGE variance \u2014 a value of 100 indicates equal variation L = 10 samples, a value of 1 indi-\ncates a 100-fold decrease in variance (lower is better). Our new estimators (Full Hessian, Hessian\nDiag, HVP+Local) are described in Section 3.2. The Score Delta method is the gradient estimator\ndescribed in [19]. Additional variance measurement results are in Appendix D.\n\ngm\u03bb\n\nln gs\u03bb\n\nearly\n\nIteration Estimator\n(MC abs.)\nMC\nFull Hessian\nHessian Diag\nHVP + Local\nScore Delta [19]\n(MC abs.)\nMC\nFull Hessian\nHessian Diag\nHVP + Local\nScore Delta [19]\n(MC abs.)\nMC\nFull Hessian\nHessian Diag\nHVP + Local\nScore Delta [19]\n\nmid\n\nlate\n\nAve V(\u00b7)\n(1.7e+02)\n100.000\n1.279\n34.691\n1.279\n6069.668\n(3.8e+03)\n100.000\n0.075\n38.891\n0.075\n4763.246\n(1.7e+03)\n100.000\n0.042\n40.292\n0.042\n5183.885\n\nV(|| \u00b7 ||)\n(5.4e+03)\n100.000\n1.139\n23.764\n1.139\n718.430\n(1.3e+05)\n100.000\n0.068\n21.283\n0.068\n523.175\n(1.3e+04)\n100.000\n0.030\n53.922\n0.030\n1757.209\n\nAve V(\u00b7)\n(3e+04)\n100.000\n0.001\n0.003\n0.013\n1.395\n(18)\n100.000\n0.113\n6.295\n30.754\n2716.038\n(1.1)\n100.000\n1.686\n23.644\n98.523\n17355.120\n\nV(|| \u00b7 ||)\n(2e+05)\n100.000\n0.002\n0.012\n0.039\n0.931\n(3.3e+02)\n100.000\n0.143\n7.480\n39.156\n700.100\n(19)\n100.000\n0.431\n28.024\n99.811\n3084.940\n\ng\u03bb\n\nAve V(\u00b7)\n(1.5e+04)\n100.000\n0.008\n0.194\n0.020\n34.703\n(1.9e+03)\n100.000\n0.076\n38.740\n0.218\n4753.752\n(8.3e+02)\n100.000\n0.043\n40.281\n0.110\n5192.270\n\nV(|| \u00b7 ||)\n(5.9e+03)\n100.000\n1.039\n21.684\n1.037\n655.105\n(1.3e+05)\n100.000\n0.068\n21.260\n0.071\n523.532\n(1.3e+04)\n100.000\n0.030\n53.777\n0.022\n1761.317\n\nQuantifying Gradient Variance Reduction We measure the variance reduction of the RGE ob-\nserved at various iterates, \u03bbt, during execution of gradient descent. Both the gradient magnitude,\nand the marginal variance of the gradient elements \u2014 using a sample of 1000 gradients \u2014 are\nreported. Further, we inspect both the mean, m\u03bb, and log-scale, ln s\u03bb, parameters separately. Ta-\nble 1 compares gradient variances for the frisk model for our four estimators: i) pure Monte Carlo\n(MC), ii) Full Hessian, iii) Hessian Diagonal, and iv) Hessian-vector product + local approxima-\ntion (HVP+Local). Additionally, we compare our methods to the estimator described in [19], based\non the score function estimator and a control variate method. We use a \ufb01rst order delta method\napproximation of the model term, which admits a closed form control variate term.\nEach entry in the table measures the percent of the variance of the pure Monte Carlo estimator. We\nshow the average variance over each component AveV(\u00b7), and the variance of the norm V(||\u00b7||). We\nseparate out variance in mean parameters, gm, log scale parameters, ln gs, and the entire vector g\u03bb.\nThe reduction in variance is dramatic. Using HVP+Local, in the norm of the mean parameters we\nsee between a 80\u00d7 and 3,000\u00d7 reduction in variance depending on the progress of the optimizer.\nThe importance of the full Hessian-vector product for reducing mean parameter variance is also\ndemonstrated as the Hessian diagonal only reduces mean parameter variance by a factor of 2-5\u00d7.\nFor the variational scale parameters, ln gs, we see that early on the HVP+Local approximation is\nable to reduce parameter variance by a large factor (\u2248 2,000\u00d7). However, at later iterates the\nHVP+Local scale parameter variance is on par with the Monte Carlo estimator, while the full Hes-\nsian estimator still enjoys huge variance reduction. This indicates that, by this point, most of the\nnoise is the local Hessian diagonal estimator. We also note that in this problem, most of the estima-\ntor variance is in the mean parameters. Because of this, the norm of the entire parameter gradient, g\u03bb\nis reduced by 100\u2212 5,000\u00d7. We found that the score function estimator (with the delta method con-\ntrol variate) is typically much higher variance than the baseline reparameterization gradient estimator\n(often by a factor of 10-50\u00d7). In Appendix D we report results for other values of L.\nOptimizer Convergence and Stability We compare the optimization traces for the frisk and\nbnn model for the MC and the HVP+Local estimators under various conditions. At each iteration\nwe estimate the true ELBO value using 2000 Monte Carlo samples. We optimize the ELBO objective\nusing adam [13] for two step sizes, each trace starting at the same value of \u03bb0.\n\n8\n\n\f(a) adam with step size = 0.05\n\n(b) adam with step size = .10\n\nFigure 3: MCVI optimization trace applied to the frisk model for two values of L and step size.\nWe run the standard MC gradient estimator (solid line) and the RV-RGE with L = 2 and 10 samples.\n\n(a) adam with step size = 0.05\n\n(b) adam with step size = 0.10\n\nFigure 4: MCVI optimization for the bnn model applied to the wine data for various L and step\nsizes. The standard MC gradient estimator (dotted) was run with 2, 10, and 50 samples; RV-RGE\n(solid) was run with 2 and 10 samples. In 4b the 2-sample MC estimator falls below the frame.\n\nFigure 3 compares ELBO optimization traces for L = 2 and L = 10 samples and step sizes .05\nand .1 for the frisk model. We see that the HVP+Local estimators make early progress and con-\nverge quickly. We also see that the L = 2 pure MC estimator results in noisy optimization paths.\nFigure 4 shows objective value as a function of wall clock time under various settings for the bnn\nmodel. The HVP+Local estimator does more work per iteration, however it tends to converge faster.\nWe observe the L = 10 HVP+Local outperforming the L = 50 MC estimator.\n\n5 Conclusion\n\nVariational inference reframes an integration problem as an optimization problem with the caveat\nthat each step of the optimization procedure solves an easier integration problem. For general mod-\nels, each sub-integration problem is itself intractable, and must be estimated, typically with Monte\nCarlo samples. Our work has shown that we can use more information about the variational family\nto create tighter estimators of the ELBO gradient, which leads to faster and more stable optimization.\nThe ef\ufb01cacy of our approach relies on the complexity of the RGE distribution to be well-captured\nby linear structure which may not be true for all models. However, we found the idea effective for\nnon-conjugate hierarchical Bayesian models and a neural network.\nOur presentation is a speci\ufb01c instantiation of a more general idea \u2014 using cheap linear structure\nto remove variation from stochastic gradient estimates. This method described in this work is tai-\nlored to Gaussian approximating families for Monte Carlo variational inference, but could be easily\nextended to location-scale families. We plan to extend this idea to more \ufb02exible variational distri-\nbutions, including \ufb02ow distributions [24] and hierarchical distributions [23], which would require\napproximating different functional forms within the variational objective. We also plan to adapt our\ntechnique to model and inference schemes with recognition networks [14], which would require\nback-propagating de-noised gradients into the parameters of an inference network.\n\n9\n\n051015202530wallclock(seconds)860855850845ELBO2sampleMC2sampleHVP+Local10sampleMC10sampleHVP+Local50sampleMC051015202530wallclock(seconds)860855850845ELBO2sampleMC2sampleHVP+Local10sampleMC10sampleHVP+Local50sampleMC020406080wallclock(seconds)220200180160ELBO2sampleMC2sampleHVP+Local10sampleMC10sampleHVP+Local50sampleMC020406080wallclock(seconds)240220200180160ELBO2sampleMC2sampleHVP+Local10sampleMC10sampleHVP+Local50sampleMC\fAcknowledgements\n\nThe authors would like to thank Finale Doshi-Velez, Mike Hughes, Taylor Killian, Andrew Ross, and\nMatt Hoffman for helpful conversations and comments on this work. ACM is supported by the Ap-\nplied Mathematics Program within the Of\ufb01ce of Science Advanced Scienti\ufb01c Computing Research\nof the U.S. Department of Energy under contract No. DE-AC02-05CH11231. NJF is supported by\na Washington Research Foundation Innovation Postdoctoral Fellowship in Neuroengineering and\nData Science. RPA is supported by NSF IIS-1421780 and the Alfred P. Sloan Foundation.\n\nReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Cor-\nrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale machine learning on\nheterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.\n\n[2] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN.\n\narXiv:1701.07875, 2017.\n\narXiv preprint\n\n[3] Costas Bekas, Effrosyni Kokiopoulou, and Yousef Saad. An estimator for the diagonal of a matrix.\n\nApplied numerical mathematics, 57(11):1214\u20131229, 2007.\n\n[4] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, 2017.\n\n[5] Andrew Gelman and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models.\n\nCambridge University Press, 2006.\n\n[6] Andrew Gelman, Jeffrey Fagan, and Alex Kiss. An analysis of the NYPD\u2019s stop-and-frisk policy in the\n\ncontext of claims of racial bias. Journal of the American Statistical Association, 102:813\u2013823, 2007.\n\n[7] Paul Glasserman. Monte Carlo Methods in Financial Engineering, volume 53. Springer Science &\n\nBusiness Media, 2004.\n\n[8] Paul Glasserman. Monte Carlo methods in \ufb01nancial engineering, volume 53. Springer Science & Business\n\nMedia, 2013.\n\n[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative Adversarial Nets. In Advances in Neural Information Process-\ning Systems, pages 2672\u20132680, 2014.\n\n[10] Matthew D Hoffman and Matthew J Johnson. Elbo surgery: yet another way to carve up the variational\n\nevidence lower bound. 2016.\n\n[11] Matthew D Hoffman, David M Blei, Chong Wang, and John William Paisley. Stochastic variational\n\ninference. Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[12] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduc-\n\ntion. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[13] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the\n\nInternational Conference on Learning Representations, 2015.\n\n[14] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In Proceedings of the Interna-\n\ntional Conference on Learning Representations, 2014.\n\n[15] Dougal Maclaurin, David Duvenaud, Matthew Johnson, and Ryan P. Adams. Autograd: Reverse-mode\n\ndifferentiation of native Python, 2015. URL http://github.com/HIPS/autograd.\n\n[16] James Martens, Ilya Sutskever, and Kevin Swersky. Estimating the Hessian by back-propagating curva-\n\nture. In Proceedings of the International Conference on Machine Learning, 2012.\n\n[17] Andriy Mnih and Danilo Rezende. Variational inference for Monte Carlo objectives. In Proceedings of\n\nThe 33rd International Conference on Machine Learning, pages 2188\u20132196, 2016.\n\n[18] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint\n\narXiv:1610.03483, 2016.\n\n10\n\n\f[19] John Paisley, David M Blei, and Michael I Jordan. Variational bayesian inference with stochastic search.\nIn Proceedings of the 29th International Coference on International Conference on Machine Learning,\npages 1363\u20131370. Omnipress, 2012.\n\n[20] Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch. https://github.com/\n\npytorch/pytorch, 2017.\n\n[21] Barak A Pearlmutter. Fast exact multiplication by the Hessian. Neural computation, 6(1):147\u2013160, 1994.\n\n[22] Rajesh Ranganath, Sean Gerrish, and David M Blei. Black box variational inference. In AISTATS, pages\n\n814\u2013822, 2014.\n\n[23] Rajesh Ranganath, Dustin Tran, and David M Blei. Hierarchical variational models. In International\n\nConference on Machine Learning, 2016.\n\n[24] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. In Proceedings of\n\nthe 32nd International Conference on Machine Learning (ICML-15), pages 1530\u20131538, 2015.\n\n[25] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approx-\n\nimate inference in deep generative models. In International Conference on Machine Learning, 2014.\n\n[26] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical\n\nStatistics, pages 400\u2013407, 1951.\n\n[27] Geoffrey Roeder, Yuhuai Wu Wu, and David Duvenaud. Sticking the landing: An asymptotically zero-\n\nvariance gradient estimator for variational inference. arXiv preprint arXiv:1703.09194, 2017.\n\n[28] Francisco R Ruiz, Michalis Titsias RC AUEB, and David Blei. The generalized reparameterization gra-\n\ndient. In Advances in Neural Information Processing Systems, pages 460\u2013468, 2016.\n\n[29] Michalis Titsias and Miguel L\u00e1zaro-Gredilla. Doubly stochastic variational bayes for non-conjugate in-\nference. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages\n1971\u20131979, 2014.\n\n[30] Dustin Tran, Matthew D Hoffman, Rif A Saurous, Eugene Brevdo, Kevin Murphy, and David M Blei.\nDeep probabilistic programming. In Proceedings of the International Conference on Learning Represen-\ntations, 2017.\n\n[31] Chong Wang, Xi Chen, Alexander J Smola, and Eric P Xing. Variance reduction for stochastic gradient\n\noptimization. In Advances in Neural Information Processing Systems, pages 181\u2013189, 2013.\n\n11\n\n\f", "award": [], "sourceid": 2069, "authors": [{"given_name": "Andrew", "family_name": "Miller", "institution": "Harvard"}, {"given_name": "Nick", "family_name": "Foti", "institution": "University of Washington"}, {"given_name": "Alexander", "family_name": "D'Amour", "institution": "UC Berkeley"}, {"given_name": "Ryan", "family_name": "Adams", "institution": null}]}