{"title": "Fast Black-box Variational Inference through Stochastic Trust-Region Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2402, "page_last": 2411, "abstract": "We introduce TrustVI, a fast second-order algorithm for black-box variational inference based on trust-region optimization and the reparameterization trick. At each iteration, TrustVI proposes and assesses a step based on minibatches of draws from the variational distribution. The algorithm provably converges to a stationary point. We implemented TrustVI in the Stan framework and compared it to two alternatives: Automatic Differentiation Variational Inference (ADVI) and Hessian-free Stochastic Gradient Variational Inference (HFSGVI). The former is based on stochastic first-order optimization. The latter uses second-order information, but lacks convergence guarantees. TrustVI typically converged at least one order of magnitude faster than ADVI, demonstrating the value of stochastic second-order information. TrustVI often found substantially better variational distributions than HFSGVI, demonstrating that our convergence theory can matter in practice.", "full_text": "Fast Black-box Variational Inference\n\nthrough Stochastic Trust-Region Optimization\n\nJeffrey Regier\n\njregier@cs.berkeley.edu\n\nMichael I. Jordan\n\njordan@cs.berkeley.edu\n\nJon McAuliffe\n\njon@stat.berkeley.edu\n\nAbstract\n\nWe introduce TrustVI, a fast second-order algorithm for black-box variational\ninference based on trust-region optimization and the \u201creparameterization trick.\u201d At\neach iteration, TrustVI proposes and assesses a step based on minibatches of draws\nfrom the variational distribution. The algorithm provably converges to a stationary\npoint. We implemented TrustVI in the Stan framework and compared it to two\nalternatives: Automatic Differentiation Variational Inference (ADVI) and Hessian-\nfree Stochastic Gradient Variational Inference (HFSGVI). The former is based\non stochastic \ufb01rst-order optimization. The latter uses second-order information,\nbut lacks convergence guarantees. TrustVI typically converged at least one order\nof magnitude faster than ADVI, demonstrating the value of stochastic second-order\ninformation. TrustVI often found substantially better variational distributions than\nHFSGVI, demonstrating that our convergence theory can matter in practice.\n\nIntroduction\n\n1\nThe \u201creparameterization trick\u201d [1, 2, 3] has led to a resurgence of interest in variational inference (VI),\nmaking it applicable to essentially any differentiable model. This new approach, however, requires\nstochastic optimization rather than fast deterministic optimization algorithms like closed-form\ncoordinate ascent. Some fast stochastic optimization algorithms exist, but variational objectives have\nproperties that make them unsuitable: they are typically nonconvex, and the relevant expectations\ncannot usually be replaced by \ufb01nite sums. Thus, to date, practitioners have used SGD and its variants\nalmost exclusively. Automatic Differentiation Variational Inference (ADVI) [4] has been especially\nsuccessful at making variational inference based on \ufb01rst-order stochastic optimization accessible.\nStochastic \ufb01rst-order optimization, however, is slow in theory (sublinear convergence) and in practice\n(thousands of iterations), negating a key bene\ufb01t of VI.\nThis article presents TrustVI, a fast algorithm for variational inference based on second-order\ntrust-region optimization and the reparameterization trick. TrustVI routinely converges in tens\nof iterations for models that take thousands of ADVI iterations. TrustVI\u2019s iterations can be more\nexpensive, but on a large collection of Bayesian models, TrustVI typically reduced total computation\nby an order of magnitude. Usually TrustVI and ADVI \ufb01nd the same objective value, but when they\ndiffer, TrustVI is typically better.\nTrustVI adapts to the stochasticity of the optimization problem, raising the sampling rate for assessing\nproposed steps based on a Hoeffding bound. It provably converges to a stationary point. TrustVI\ngeneralizes the Newton trust-region method [5], which converges quadratically and has performed\nwell at optimizing analytic variational objectives even at an extreme scale [6]. With large enough\nminibatches, TrustVI iterations are nearly as productive as those of a deterministic trust region\nmethod. Fortunately, large minibatches make effective use of single-instruction multiple-data (SIMD)\nparallelism on modern CPUs and GPUs.\nTrustVI uses either explicitly formed approximations of Hessians or approximate Hessian-vector\nproducts. Explicitly formed Hessians can be fast for low-dimensional problems or problems with\nsparse Hessians, particularly when expensive computations (e.g., exponentiation) already need to be\n\n\fperformed to evaluate a gradient. But Hessian-vector products are often more convenient. They can\nbe computed ef\ufb01ciently through forward-mode automatic differentiation, reusing the implementation\nfor computing gradients [7, 8]. This is the approach we take in our experiments.\nFan et al. [9] also note the limitations of \ufb01rst-order stochastic optimization for variational inference:\nthe learning rate is dif\ufb01cult to set, and convergence is especially slow for models with substantial\ncurvature. Their approach is to apply Newton\u2019s method or L-BFGS to problems that are both\nstochastic and nonconvex. All stationary points\u2014minima, maxima, and saddle points\u2014act as\nattractors for Newton steps, however, so while Newton\u2019s method may converge quickly, it may\nalso converge poorly. Trust region methods, on the other hand, are not only unharmed by negative\ncurvature, they exploit it: descent directions that become even steeper are among the most productive.\nIn section 5, we empirically compare TrustVI to Hessian-free Stochastic Gradient Variation Inference\n(HFSGVI) to assess the practical importance of our convergence theory.\nTrustVI builds on work from the derivative-free optimization community [10, 11, 12]. The STORM\nframework [12] is general enough to apply to a derivative-free setting, as well as settings where\nhigher-order stochastic information is available. STORM, however, requires that a quadratic model of\nthe objective function can always be constructed such that, with non-trivial probability, the quadratic\nmodel\u2019s absolute error is uniformly bounded throughout the trust region. That requirement can\nbe satis\ufb01ed for the kind of low-dimensional problems one can optimize without derivatives, where\nthe objective may be sampled throughout the trust region at a reasonable density, but not for most\nvariational objective functions.\n\n2 Background\nVariational inference chooses an approximation to the posterior distribution from a class of candidate\ndistributions through numerical optimization [13]. The candidate approximating distributions q!\nare parameterized by a real-valued vector !. The variational objective function L, also known as\nthe evidence lower bound (ELBO), is an expectation with respect to latent variables z that follow\nan approximating distribution q!:\n\nL(!) , Eq! {log p(x, z) log q!(z)} .\n(1)\nHere x, the data, is \ufb01xed.\nIf this expectation has an analytic form, L may be maximized by\ndeterministic optimization methods, such as coordinate ascent and Newton\u2019s method. Realistic\nBayesian models, however, not selected primarily for computational convenience, seldom yield\nvariational objective functions with analytic forms.\nStochastic optimization offers an alternative. For many common classes of approximating\ndistributions, there exists a base distribution p0 and a function g! such that, for e \u21e0 p0 and z \u21e0 q!,\ng!(e) d= z. In words: the random variable z whose distribution depends on !, is a deterministic\nfunction of a random variable e whose distribution does not depend on !. This alternative expression\nof the variational distribution is known as the \u201creparameterization trick\u201d [1, 2, 3, 14]. At each\niteration of an optimization procedure, ! is updated based on an unbiased Monte Carlo approximation\nto the objective function:\n\n\u02c6L(!; e1, . . . , eN ) , 1\n\nN\n\nNXi=1\n\n{log p(x, g!(ei)) log q!(g!(ei))}\n\n(2)\n\nfor e1, . . . , eN sampled from the base distribution.\n3 TrustVI\nTrustVI performs stochastic optimization of the ELBO L to \ufb01nd a distribution q! that approximates\nthe posterior. For TrustVI to converge, the ELBO only needs to satisfy Condition 1. (Subsequent\nconditions apply to the algorithm speci\ufb01cation, not the optimization problem.)\nCondition 1. L : RD ! R is a twice-differentiable function of ! that is bounded above. Its gradient\nhas Lipschitz constant L.\n\nCondition 1 is compatible with all models whose conditional distributions are in the exponential\nfamily. The ELBO for a model with categorical random variables, for example, is twice differentiable\nin its parameters when using a mean-\ufb01eld categorical variational distribution.\n\n2\n\n\fAlgorithm 1 TrustVI\nRequire: Initial iterate !0 2 RD; initial trust region radius 0 2 (0, max]; and settings for the\n\nparameters listed in Table 1.\nfor k = 0, 1, 2, . . . do\n\nDraw stochastic gradient gk satisfying Condition 2.\nSelect symmetric matrix Hk satisfying Condition 3.\nSolve for sk , arg max g|\n2 s|Hks : ksk \uf8ff k.\nk s + 1\nCompute m0k , g|\n2 s|\nkHksk.\nk sk + 1\nSelect Nk satisfying Inequality 11 and Inequality 13.\nDraw `0k1, . . . ,` 0kNk satisfying Condition 4.\nCompute `0k , 1\nif `0k \u2318m0k 2\n!k+1 !k + sk\nk+1 min(k, max)\n!k+1 !k\nk+1 k/\n\nNkPNk\n\nelse\n\ni=1 `0ki.\n\nk then\n\nend if\nend for\n\nTable 1: User-selected parameters for TrustVI\n\nbrief description\nmodel \ufb01tness threshold\ntrust region expansion factor\ntrust region radius constraint\ntradeoff between trust region radius and objective value\ntradeoff between both sampling rates\naccuracy of \u201cgood\u201d stochastic gradients\u2019 norms\naccuracy of \u201cgood\u201d stochastic gradients\u2019 directions\nprobability of \u201cgood\u201d stochastic gradients\nprobability of accepting a \u201cgood\u201d step\nmaximum norm of the quadratic models\u2019 Hessians\nmaximum trust region radius for enforcing some conditions\n\nname\n\u2318\n\n\n\u21b5\n\u232b1\n\u232b2\n\u232b3\n\u21e30\n\u21e31\n\uf8ffH\n\nmax maximum trust region radius\n\nallowable range\n(0, 1/2]\n(1,1)\n(0,1)\n(/(1 2),1)\n(0, 1 \u2318)\n(0, 1)\n(0, 1 \u2318 \u232b1)\n(1/2, 1)\n(1/(2\u21e30), 1)\n[0,1)\n(0,1]\n(0,1)\n\nThe domain of L is taken to be all of RD. If instead the domain is a proper subset of a real coordinate\nspace, the ELBO can often be reparameterized so that its domain is RD [4].\nTrustVI iterations follow the form of common deterministic trust region methods: 1) construct a\nquadratic model of the objective function restricted to the current trust region; 2) \ufb01nd an approximate\noptimizer of the model function: the proposed step; 3) assess whether the proposed step leads to\nan improvement in the objective; and 4) update the iterate and the trust region radius based on the\nassessment. After introducing notation in Section 3.1, we describe proposing a step in Section 3.2\nand assessing a proposed step in Section 3.3. TrustVI is summarized in Algorithm 1.\n\n3.1 Notation\n\nTrustVI\u2019s iteration number is denoted by k. During iteration k, until variables are updated at its\nend, !k is the iterate, k is the trust region radius, and L(!k) is the objective-function value. As\nshorthand, let Lk , L(!k).\nDuring iteration k, a quadratic model mk is formed based on a stochastic gradient gk of L(!k), as\nwell as a local Hessian approximation Hk. The maximizer of this model on the trust region, sk, we\ncall the proposed step. The maximum, denoted m0k , mk(sk), we refer to as the model improvement.\nWe use the \u201cprime\u201d symbol to denote changes relating to a proposed step sk that is not necessarily\n\n3\n\n\faccepted; e.g., L0k = L(!k + sk)L k. We use the symbol to denote change across iterations; e.g.,\nLk = Lk+1 L k. If a proposed step is accepted, then, for example, Lk = L0k and k = 0k.\nEach iteration k has two sources of randomness: mk and `0k, an unbiased estimate of L0k that\ndetermines whether to accept proposed step sk. `0k is based on an iid random sample of size Nk\n(Section 3.3).\nFor the random sequence m1,` 01, m2,` 02, . . ., it is often useful to condition on the earlier variables\nwhen reasoning about the next. Let Mk refer to the -algebra generated by m1, . . . , mk1 and\n`01, . . . ,` 0k1. When we condition on Mk , we hold constant all the outcomes that precede iteration\nk. Let M+\nk refer to the -algebra generated by m1, . . . , mk and `01, . . . ,` 0k1. When we condition\non M+\nk , we hold constant all the outcomes that precede drawing the sample that determines whether\nto accept the kth proposed step.\nTable 1 lists the user-selected parameters that govern the behavior of the algorithm. TrustVI\nconverges to a stationary point for any selection of parameters in the allowable range (column 3). As\nshorthand, we refer to a particular trust region radius, derived from the user-selected parameters, as\n\nk , min ,r \u2318m0k\n\n\n\n,\n\n\u232b2L + \u232b2\u2318\uf8ffH + 8\uf8ffH! .\n\n\u232b2\u232b3krLkk\n\n(3)\n\n(5)\n\n3.2 Proposing a step\n\nAt each iteration, TrustVI proposes the step sk that maximizes the local quadratic approximation\n\n1\n2\n\nk s +\n\ns|Hks : ksk \uf8ff k\n\nmk(s) = Lk + g|\nto the function L restricted to the trust region.\nWe set gk to the gradient of \u02c6L at !k, where \u02c6L is evaluated using a freshly drawn sample e1, . . . , eN.\nFrom Equation 2 we see that gk is a stochastic gradient constructed from a minibatch of size N. We\nmust choose N large enough to satisfy the following condition:\nCondition 2. If k \uf8ff k , then, with probability \u21e30, given Mk ,\n\n(4)\n\ng|\nkrLk (\u232b1 + \u232b3)krLkkkgkk + \u2318kgkk2\n\nand\n\nkgkk \u232b2krLkk.\n\n(6)\n\nCondition 2 is the only restriction on the stochastic gradients: they have to point in roughly the right\ndirection most of the time, and they have to be of roughly the right magnitude when they do. By\nconstructing the stochastic gradients from large enough minibatches of draws from the variational\ndistribution, this condition can always be met.\nIn practice, we cannot observe rL, and we do not explicitly set \u232b1, \u232b2, and \u232b3. Fortunately, Con-\ndition 2 holds as long as our stochastic gradients remain large in relation to their variance. Because\nwe base each stochastic gradient on at least one sizable minibatch, we always have many iid samples\nto inform us about the population of stochastic gradients. We use a jackknife estimator [15] to con-\nservatively bound the standard deviation of the norm of the stochastic gradient. If the norm of a given\nstochastic gradient is small relative to its standard deviation, we double the next iteration\u2019s sampling\nrate. If it is large relative to its standard deviation, we halve it. Otherwise, we leave it unchanged.\nThe gradient observations may include randomness from sources other than sampling the variational\ndistribution too. In the \u201cdoubly stochastic\u201d setting [3], for example, the data is also subsampled.\nThis setting is fully compatible with our algorithm, though the size of the subsample may need to\nvary across iterations. To simplify our presentation, we henceforth only consider stochasticity from\nsampling the variational distribution.\nCondition 3 is the only restriction on the quadratic models\u2019 Hessians.\nCondition 3. There exists \ufb01nite \uf8ffH satisfying, for the spectral norm,\n\nfor all iterations k with k \uf8ff k .\n\nkHkk \uf8ff \uf8ffH a. s.\n\n4\n\n(7)\n\n\fFor concreteness we bound the spectral norm of Hk, but a bound on any Lp norm suf\ufb01ces. The\nalgorithm speci\ufb01cation does not involve \uf8ffH, but the convergence proof requires that \uf8ffH be \ufb01nite.\nThis condition suf\ufb01ces to ensure that, when the trust region is small enough, the model\u2019s Hessian\ncannot interfere with \ufb01nding a descent direction. With such mild conditions, we are free to use\nnearly arbitrary Hessians. Hessians may be formed like the stochastic gradients, by sampling from\nthe variational distribution. The number of samples can be varied. The quadratic model\u2019s Hessian\ncould even be set to the identity matrix if we prefer not to compute second-order information.\nLow-dimensional models, and models with block diagonal Hessians, may be optimized explicitly\nby inverting Hk + \u21b5kI, where \u21b5k is either zero for interior solutions, or just large enough that\n(Hk + \u21b5kI)1gk is on the boundary of the trust region [5]. Matrix inversion has cubic runtime\nthough, and even explicitly storing Hk is prohibitive for many variational objectives.\nIn our experiments, we instead maximize the model without explicitly storing the Hessian, through\nHessian-vector multiplication, assembling Krylov subspaces through both conjugate gradient\niterations and Lanczos iterations [16, 17]. We reuse our Hessian approximation for two consecutive\niterations if the iterate does not change (i.e., the proposed steps are rejected). A new stochastic\ngradient gk is still drawn for each of these iterations.\n\n3.3 Assessing the proposed step\n\nDeterministic trust region methods only accept steps that improve the objective by enough. In a\nstochastic setting, we must ensure that accepting \u201cbad\u201d steps is improbable while accepting \u201cgood\u201d\nsteps is likely.\nTo assess steps, TrustVI draws new samples from the variational distribution\u2014we may not\nreuse the samples that gk and Hk are based on. The new samples are used to estimate both\nL(!k) and L(!k + sk). Using the same sample to estimate both quantities is analogous to a\nmatched-pairs experiment; it greatly reduces the variance of the improvement estimator. Formally,\nfor i = 1, . . . , NK, let eki follow the base distribution and set\n\nLet\n\n`0ki , \u02c6L(!k + sk; eki) \u02c6L(!k; eki).\n\n(8)\n\n(9)\n\n`0k , 1\nNk\n\n`0ki.\n\nNkXi=1\n\nThen, `0k is an unbiased estimate of L0k\u2014the quantity a deterministic trust region method would use\nto assess the proposed step.\n\n3.3.1 Choosing the sample size\n\nTo pick the sample size NK, we need additional control on the distribution of the `0ki. The next\ncondition gives us that.\nCondition 4. For each k, there exists \ufb01nite k such that the `0ki are k-subgaussian.\nUnlike the quantities we have introduced earlier, such as L and \uf8ffH, the k need to be known to carry\nout the algorithm. Because `0k1,` 0k2, . . . are iid, 2\nk may be estimated\u2014after the sample is drawn\u2014by\ni=1(`0ki `0k). We discuss below, in the context of\nthe population variance formula, i.e.,\nsetting Nk, how to make use of a \u201cretrospective\u201d estimate of k in practice.\nTwo user-selected constants control what steps are accepted: \u2318 2 (0, 1/2) and > 0. The step\nis accepted iff 1) the observed improvement `0k exceeds the fraction \u2318 of the model improvement\nm0k, and 2) the model improvement is at least a small fraction /\u2318 of the trust region radius squared.\nFormally, steps are accepted iff\n\nNk1PNk\n\n1\n\n`0k \u2318m0k 2\nk.\n\nIf \u2318m0k < 2\nOtherwise, we pick the smallest Nk such that\n\nk, the step is rejected regardless of `0k: we set Nk = 0.\n\nNk \n\n22\nk\n\n(\u2318m0k + y)2 log\u2713 \u232722\n\nk + y\n\u232712\n\nk \u25c6 , 8y > max\u2713\n\n(10)\n\n(11)\n\n\u2318m0k\n2\n\nk\u25c6\n,\u232722\n\n5\n\n\fwhere\n\n\u23272 , \u21b5(2 2).\n\n\u23271 , \u21b5(1 2) and\n\n(12)\nFinding the smallest such Nk is a one-dimensional optimization problem. We solve it via bisection.\nInequality 11 ensures that we sample enough to reject most steps that do not improve the objective\nsuf\ufb01ciently. If we knew exactly how a proposed step changed the objective, we could express in\nclosed form how many samples would be needed to detect bad steps with suf\ufb01ciently high probability.\nSince we do not know that, Inequality 11 is for all such change-values in a range. Nonetheless, Nk\nis rarely large in practice: the second factor lower bounding Nk is logarithmic in y; in the \ufb01rst factor\nthe denominator is bounded away from zero.\nFinally, if k \uf8ff k , we also ensure Nk is large enough that\n\nNk 22\n\nk log(1 \u21e31)\n\u232b2\n1krLkk22\n\n.\n\n(13)\n\nk\n\nSelecting Nk this large ensures that we sample enough to detect most steps that improve the value of\nthe objective suf\ufb01ciently when the trust region is small. This bound is not high in practice. Because\nof how the `0ki are collected (a \u201cmatched-pairs experiment\u201d), as k becomes small, k becomes small\ntoo, at roughly the same rate.\nIn practice, at the end of each iteration, we estimate whether Nk was large enough to meet the condi-\ntions. If not, we set Nk+1 = 2Nk. If Nk exceeds the size of the gradient\u2019s minibatch, and it is more\nthan twice as large as necessary to meet the conditions, we set Nk+1 = Nk/2. These Nk function\nevaluations require little computation compared to computing gradients and Hessian-vector products.\n\n4 Convergence to a stationary point\n\nTo show that TrustVI converges to a stationary point, we reason about the stochastic process (k)1k=1,\nwhere\n\nk , Lk \u21b52\nk.\n\n(14)\n\nIn words, k is the objective function penalized by the weighted squared trust region radius.\nBecause TrustVI is stochastic, neither Lk nor k necessarily increase at every iteration. But, k\nincreases in expectation at each iteration (Lemma 1). That alone, however, does not suf\ufb01ce to show\nTrustVI reaches a stationary point; k must increase in expectation by enough at each iteration.\nLemma 1 and Lemma 2 in combination show just that. The latter states that the trust region radius\ncannot remain small unless the gradient is small too, while the former states that the expected\nincrease is a constant fraction of the squared trust region radius. Perhaps surprisingly, Lemma 1\ndoes not depend on the quality of the quadratic model: Rejecting a proposed step always leads to\nsuf\ufb01cient increase in k. Accepting a bad step, though possible, rapidly becomes less likely as the\nproposed step gets worse. No matter how bad a proposed step is, k increases in expectation.\nTheorem 1 uses the lemmas to show convergence by contradiction. The structure of its proof,\nexcluding the proofs of the lemmas, resembles the proof from [5] that a deterministic trust region\nmethod converges. The lemmas\u2019 proofs, on the other hand, more closely resemble the style of\nreasoning in the stochastic optimization literature [12].\nTheorem 1. For Algorithm 1,\n\nlim\n\nk!1krLkk = 0 a. s.\n\n(15)\n\nProof. By Condition 1, L is bounded above. The trust region radius k is positive almost surely by\nconstruction. Therefore, k is bounded above almost surely by the constant supL. Let the constant\nc , supL 0. Then,\n\n1Xk=1\n\nE[k | Mk ] \uf8ff c a. s.\n\n(16)\n\n6\n\n\fk ], and hence E[k | Mk ], is almost surely nonnegative. Therefore,\nBy Lemma 1, E[k | M+\nE[k | Mk ] ! 0 almost surely. By an additional application of Lemma 1, 2\nk ! 0 almost surely\ntoo.\nSuppose there exists K0 and \u270f> 0 such that krLkk \u270f for all k > K1. Fix K K0 such that\nk meets the conditions of Lemma 2 for all k K. By Lemma 2, (log k)1K is a submartingale.\nA submartingale almost surely does not go to 1, so k almost surely does not go to 0. The\ncontradiction implies that krLkk <\u270f in\ufb01nitely often.\nBecause our choice of \u270f was arbitrary,\n\nBecause 2\nLemma 1.\n\nlim inf\n\nk!1 krLkk = 0 a. s.\n\nk ! 0 almost surely, this limit point is unique.\nk\u21e4 2\nk ] = (1 \u21e1)[\u21b5(1 2)2\nk] + \u232712\n\nE\u21e5k | M+\n\nE[k | M+\n\nk a. s.\n\n(17)\n\n(18)\n\n(19)\n(20)\n\n(21)\n(22)\n\n(23)\n\n(24)\n\n(25)\n\n(26)\n\nProof. Let \u21e1 denote the probability that the proposed step is accepted. Then,\n\nk] + \u21e1[L0k \u21b5(2 1)]2\nk + 2\nk.\n\nk\n\n= \u21e1[L0k \u232722\nBy the lower bound on \u21b5, \u23271 0. If \u2318m0k < 2\nholds. Also, if L0k \u232722\nk.\n\u2318m0k 2\nThe probability \u21e1 of accepting this step is a tail bound on the sum of iid subgaussian random\nvariables. By Condition 4, Hoeffding\u2019s inequality applies. Then, Inequality 11 lets us cancel some\nof the remaining iteration-speci\ufb01c variables:\n\nk, the step is rejected regardless of `k, so the lemma\nk and\n\nk, then lemma holds for any \u21e1 2 [0, 1]. So, consider just L0k <\u2327 22\n\n\u21e1 = P(`0k \u2318m0k | M+\nk )\n= P(`0k L 0k \u2318m0k L 0k | M+\nk )\n\nk!\n(`0ki L 0k) (\u2318m0k L 0k)Nk M+\n\n(\u2318m0k L 0k)2Nk\n\n22\nk\n\n\n\n= P NKXi=1\n\uf8ff exp\u21e2\n\n\uf8ff\n\n\u232722\n\n\u232712\nk\nk L 0k\n\n.\n\nThe lemma follows from substituting Inequality 25 into Equation 20.\nLemma 2. For each iteration k, on the event k \uf8ff k , we have\nP(`0k \u2318m0k | Mk ) \u21e30\u21e31 >\nThe proof appears in Appendix A of the supplementary material.\n\n1\n2\n\n.\n\n5 Experiments\nOur experiments compare TrustVI to both Automatic Differentiation Variational Inference (ADVI) [4]\nand Hessian-free Stochastic Gradient Variational Inference (HFSGVI) [9]. We use the authors\u2019\nStan [21] implementation of ADVI, and implement the other two algorithms in Stan as well.\nOur study set comprises 183 statistical models and datasets from [22], an online repository of\nopen-source Stan models and datasets. For our trials, the variational distribution is always mean-\ufb01eld\nmultivariate Gaussian. The dimensions of ELBO domains range from 2 to 2012.\n\n7\n\n\f(a) A variance components model (\u201cDyes\u201d) from [18].\n18-dimensional domain.\n\n(b) A bivariate normal hierarchical model (\u201cBirats\u201d)\nfrom [19]. 132-dimensional domain.\n\n(c) A multi-level\nfrom [20]. 100-dimensional domain.\n\nlinear model (\u201cElectric Chr\u201d)\n\n(d) A multi-level linear model (\u201cRadon Redundant\nChr\u201d) from [20]. 176-dimensional domain.\n\nFigure 1: Each panel shows optimization paths for \ufb01ve runs of ADVI, TrustVI, and HFSGVI, for\na particular dataset and statistical model. Both axes are log scale.\n\nIn addition to the \ufb01nal objective value for each method, we compare the runtime each method\nrequires to produce iterates whose ELBO values are consistently above a threshold. As the threshold,\nfor each pair of methods we compare, we take the ELBO value reached by the worse performing\nmethod, and subtract one nat from it.\nWe measure runtime in \u201coracle calls\u201d rather than wall clock time so that the units are independent\nof the implementation. Stochastic gradients, stochastic Hessian-vector products, and estimates\nof change in ELBO value are assigned one, two, and one oracle calls, respectively, to re\ufb02ect the\nnumber of \ufb02oating point operations required to compute them. Each stochastic gradient is based\non a minibatch of 256 samples of the variational distribution. The number of variational samples\nfor stochastic Hessian-vector products and for estimates of change (85 and 128, respectively) are\nselected to match the degree of parallelism for stochastic gradient computations.\nTo make our comparison robust to outliers, for each method and each model, we optimize \ufb01ve times,\nbut ignore all runs except the one that attains the median \ufb01nal objective value.\n\n5.1 Comparison to ADVI\n\nADVI has two phases that contribute to runtime: During the \ufb01rst phase, a learning rate is selected\nbased on progress made by SGD during trials of 50 (by default) \u201cadaptation\u201d SGD iterations, for\nas many as six learning rates. During the second phase, the variational objective is optimized with\nthe learning rate that made the most progress during the trials. If the number of adaptation iterations\nis small relative to the number of iterations needed to optimize the variational objective, then the\nlearning rate selected may be too large: what appears most productive at \ufb01rst may be overly \u201cgreedy\u201d\nfor a longer run. Conversely, a large number of adaptation iteration may leave little computational\nbudget for the actual optimization. We experimented with both more and fewer adaptation iterations\n\n8\n\n100101102103104runtime (oracle calls)1031041051061071081091010ELBOADVITrustVIHFSGVI--------100101102103104runtime (oracle calls)103104105106107108ELBOADVITrustVIHFSGVI------100101102103104runtime (oracle calls)102.90102.95103.00103.05103.10103.15103.20103.25ELBOADVITrustVIHFSGVI--------100101102103104runtime (oracle calls)103.2103.4103.6103.8104.0104.2ELBOADVITrustVIHFSGVI------\fthan the default but did not \ufb01nd a setting that was uniformly better than the default. Therefore, we\nreport on the default number of adaption iterations for our experiments.\nCase studies. Figure 1 and Appendix B show the optimization paths for several models, chosen to\ndemonstrate typical performance. Often ADVI does not \ufb01nish its adaptation phase before TrustVI\nconverges. Once the adaptation phase ends, ADVI generally increased the objective value function\nmore gradually than TrustVI did, despite having expended iterations to tune its learning rate.\nQuality of optimal points. For 126 of the 183 models (69%), on sets of \ufb01ve runs, the median optimal\nvalues found by ADVI and TrustVI did not differ substantively. For 51 models (28%), TrustVI found\nbetter optimal values than ADVI. For 6 models (3%), ADVI found better optimal values than TrustVI.\nRuntime. We excluded model-threshold pairs from the runtime comparison that did not require at\nleast \ufb01ve iterations to solve; they were too easy to be representative of problems where the choice\nof optimization algorithm matters. For 136 of 137 models (99%) remaining in our study set, TrustVI\nwas faster than ADVI. For 69 models (50%), TrustVI was at least 12x faster than ADVI. For 34\nmodels (25%), TrustVI was at least 36x faster than ADVI.\n\n5.2 Comparison to HFSGVI\n\nHFSGVI applies Newton\u2019s method\u2014an algorithm that converges for convex and deterministic\nobjective functions\u2014to an objective function that is neither. But do convergence guarantees matter\nin practice?\nOften HFSGVI takes steps so large that numerical over\ufb02ow occurs during the next iteration: the\ngradient \u201cexplodes\u201d during the next iteration if we take a bad enough step. With TrustVI, we reject\nobviously bad steps (e.g., those causing numerical over\ufb02ow) and try again with a smaller trust region.\nWe tried several heuristics to workaround this problem with HFSGVI, including shrinking the\nnorm of the very large steps that would otherwise cause numerical over\ufb02ow. But \u201clarge\u201d is relative,\ndepending on the problem, the parameter, and the current iterate; severely restricting step size would\nunfairly limit HFSGVI\u2019s rate of convergence. Ultimately, we excluded 23 of the 183 models from\nfurther analysis because HFSGVI consistently generated numerical over\ufb02ow errors for them, leaving\n160 models in our study set.\nCase studies. Figure 1 and Appendix B show that even when HFSGVI does not step so far as to\ncause numerical over\ufb02ow, it nonetheless often makes the objective value worse before it gets better.\nHFSGVI, however, sometimes makes faster progress during the early iterations, while TrustVI is\nrejecting steps as it searches for an appropriate trust region radius.\nQuality of optimal points. For 107 of the 160 models (59%), on sets of \ufb01ve runs, the median optimal\nvalue found by TrustVI and HFSGVI did not differ substantively. For 51 models (28%), TrustVI\nfound a better optimal values than HFSGVI. For 1 model (0.5%), HFSGVI found a better optimal\nvalue than TrustVI.\nRuntime. We excluded 45 model-threshold pairs from the runtime comparison that did not require\nat least \ufb01ve iterations to solve, as in Section 5.1. For the remainder of the study set, TrustVI was\nfaster than HFSGVI for 61 models, whereas HFSGVI was faster than TrustVI for 54 models. As\na reminder, HFSGVI failed to converge on another 23 models that we excluded from the study set.\n\n6 Conclusions\n\nFor variational inference, it is no longer necessary to pick between slow stochastic \ufb01rst-order\noptimization (e.g., ADVI) and fast-but-restrictive deterministic second-order optimization. The\nalgorithm we propose, TrustVI, leverages stochastic second-order information, typically \ufb01nding\na solution at least one order of magnitude faster than ADVI. While HFSGVI also uses stochastic\nsecond-order information, it lacks convergence guarantees. For more than one-third of our\nexperiments, HFSGVI terminated at substantially worse ELBO values than TrustVI, demonstrating\nthat convergence theory matters in practice.\n\n9\n\n\fReferences\n[1] Diederik Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on\n\nLearning Representations, 2014.\n\n[2] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and\napproximate inference in deep generative models. In International Conference on Machine Learning, 2014.\n[3] Michalis Titsias and Miguel L\u00e1zaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate\n\ninference. In International Conference on Machine Learning, 2014.\n\n[4] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei. Automatic\n\nDifferentiation Variational Inference. Journal of Machine Learning Research, 18(14):1\u201345, 2017.\n\n[5] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer, 2nd edition, 2006.\n[6] Jeffrey Regier et al. Learning an astronomical catalog of the visible universe through scalable Bayesian\n\ninference. arXiv preprint arXiv:1611.03404, 2016.\n\n[7] Jeffrey Fike and Juan Alonso. Automatic differentiation through the use of hyper-dual numbers for second\n\nderivatives. In Recent Advances in Algorithmic Differentiation, pages 163\u2013173. Springer, 2012.\n\n[8] Barak A Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147\u2013160, 1994.\n[9] Kai Fan, Ziteng Wang, Jeffrey Beck, James Kwok, and Katherine Heller. Fast second-order stochastic\nbackpropagation for variational inference. In Advances in Neural Information Processing Systems, 2015.\n[10] Sara Shashaani, Susan Hunter, and Raghu Pasupathy. ASTRO-DF: Adaptive sampling trust-region op-\ntimization algorithms, heuristics, and numerical experience. In IEEE Winter Simulation Conference, 2016.\n[11] Geng Deng and Michael C Ferris. Variable-number sample-path optimization. Mathematical Programming,\n\n117(1):81\u2013109, 2009.\n\n[12] Ruobing Chen, Matt Menickelly, and Katya Scheinberg. Stochastic optimization using a trust-region\n\nmethod and random models. Mathematical Programming, pages 1\u201341, 2017.\n\n[13] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, 2017.\n\n[14] James Spall. Introduction to stochastic search and optimization: Estimation, simulation, and control.\n\nJohn Wiley & Sons, 2005.\n\n[15] Bradley Efron and Charles Stein. The jackknife estimate of variance. The Annals of Statistics, pages\n\n586\u2013596, 1981.\n\n[16] Nicholas Gould, Stefano Lucidi, Massimo Roma, and Philippe Toint. Solving the trust-region subproblem\n\nusing the Lanczos method. SIAM Journal on Optimization, 9(2):504\u2013525, 1999.\n\n[17] Felix Lenders, Christian Kirches, and Andreas Potschka. trlib: A vector-free implementation of the GLTR\n\nmethod for iterative solution of the trust region problem. arXiv preprint arXiv:1611.04718, 2016.\n\n[18] OpenBugs developers. Dyes: Variance components model. http://www.openbugs.net/Examples/\n\nDyes.html, 2017. [Online; accessed Oct 8, 2017].\n\n[19] OpenBugs developers. Rats: A normal hierarchical model. http://www.openbugs.net/Examples/\n\nRats.html, 2017. [Online; accessed Oct 8, 2017].\n\n[20] Andrew Gelman and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models.\n\nCambridge University Press, 2006.\n\n[21] Bob Carpenter et al. Stan: A probabilistic programming language. Journal of Statistical Software, 20, 2016.\n[22] Stan developers. https://github.com/stan-dev/example-models, 2017. [Online; accessed Jan\n\n3, 2017; commit 6fbbf36f9d14ed69c7e6da2691a3dbe1e3d55dea].\n\n[23] OpenBugs developers. Alligators: Multinomial-logistic regression. http://www.openbugs.net/\n\nExamples/Aligators.html, 2017. [Online; accessed Oct 4, 2017].\n\n[24] OpenBugs developers. Seeds: Random effect logistic regression. http://www.openbugs.net/\n\nExamples/Seeds.html, 2017. [Online; accessed Oct 4, 2017].\n\n[25] David Lunn, Chris Jackson, Nicky Best, Andrew Thomas, and David Spiegelhalter. The BUGS book:\n\nA practical introduction to Bayesian analysis. CRC press, 2012.\n\n10\n\n\f", "award": [], "sourceid": 1417, "authors": [{"given_name": "Jeffrey", "family_name": "Regier", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}, {"given_name": "Jon", "family_name": "McAuliffe", "institution": "UC Berkeley"}]}