{"title": "Projected Stein Variational Newton: A Fast and Scalable Bayesian Inference Method in High Dimensions", "book": "Advances in Neural Information Processing Systems", "page_first": 15130, "page_last": 15139, "abstract": "We propose a projected Stein variational Newton (pSVN) method for high-dimensional Bayesian inference. To address the curse of dimensionality, we exploit the intrinsic low-dimensional geometric structure of the posterior distribution in the high-dimensional parameter space via its Hessian (of the log posterior) operator and perform a parallel update of the parameter samples projected into a low-dimensional subspace by an SVN method. The subspace is adaptively constructed using the eigenvectors of the averaged Hessian at the current samples. We demonstrate fast convergence of the proposed method, complexity independent of the parameter and sample dimensions, and parallel scalability.", "full_text": "Projected Stein Variational Newton: A Fast and\n\nScalable Bayesian Inference Method\n\nin High Dimensions\n\nPeng Chen, Keyi Wu, Joshua Chen, Thomas O\u2019Leary-Roseberry, Omar Ghattas\n\nOden Institute for Computational Engineering and Sciences\n\nThe University of Texas at Austin\n\nAustin, TX 78712.\n\n{peng, keyi, joshua, tom, omar}@oden.utexas.edu\n\nAbstract\n\nWe propose a projected Stein variational Newton (pSVN) method for high-\ndimensional Bayesian inference. To address the curse of dimensionality, we exploit\nthe intrinsic low-dimensional geometric structure of the posterior distribution in\nthe high-dimensional parameter space via its Hessian (of the log posterior) op-\nerator and perform a parallel update of the parameter samples projected into a\nlow-dimensional subspace by an SVN method. The subspace is adaptively con-\nstructed using the eigenvectors of the averaged Hessian at the current samples. We\ndemonstrate fast convergence of the proposed method, complexity independent of\nthe parameter and sample dimensions, and parallel scalability.\n\n1\n\nIntroduction\n\nBayesian inference provides an optimal probability formulation for learning complex models from\nobservational or experimental data under uncertainty by updating the model parameters from their\nprior distribution to a posterior distribution [30]. In Bayesian inference we typically face the task\nof drawing samples from the posterior probability distribution to compute various statistics of some\ngiven quantities of interest. However, this is often prohibitive when the posterior distribution is\nhigh-dimensional; many conventional methods for Bayesian inference suffer from the curse of\ndimensionality, i.e., computational complexity grows exponentially or convergence deteriorates with\nincreasing parameter dimension.\nTo address this curse of dimensionality, several ef\ufb01cient and dimension-independent methods have\nbeen developed that exploit the intrinsic properties of the posterior distribution, such as its smooth-\nness, sparsity, and intrinsic low-dimensionality. Markov chain Monte Carlo (MCMC) methods\nexploiting geometry of the log-likelihood function have been developed [16, 21, 24, 12, 3], providing\nmore effective sampling than black-box MCMC. For example, the DILI MCMC method [12] uses\nthe low rank structure of the Hessian of the negative log likelihood in conjunction with operator-\nweighted proposals that are well-de\ufb01ned on function space to yield a sampler whose performance is\ndimension-independent and effective at capturing information provided by the data. However, despite\nthese enhancements, MCMC methods remain prohibitive for problems with expensive-to-evaluate\nlikelihoods (i.e., involving complex models) and in high parameter dimensions. Deterministic sparse\nquadratures were developed in [28, 26, 8] and shown to converge rapidly with dimension-independent\nrates for smooth and sparse problems. However, the fast convergence is lost when the posterior has\nsigni\ufb01cant local variations, despite enhancements with Hessian-based transformation [27, 9].\nVariational inference methods reformulate the sampling problem as an optimization problem that\napproximates the posterior by minimizing its Kullback\u2013Leibler divergence with a transformed prior\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f[22, 20, 4], which can be potentially much faster than MCMC. In particular, Stein variational methods,\nwhich seek a composition of a sequence of simple transport maps represented by kernel functions\nusing gradient descent (SVGD) [20, 11, 19] and especially Newton (SVN) [14] optimization methods,\nare shown to achieve fast convergence in relatively low dimensions. However, these variational\noptimization methods can again become deteriorated in convergence and accuracy in high dimensions.\nThe curse of dimensionality can be partially addressed by a localized SVGD on Markov blankets,\nwhich relies on a conditional independence structure of the target distribution [32, 31].\nContributions: In this work, we develop a projected Stein variational Newton method (pSVN)\nto tackle the challenge of high-dimensional Bayesian inference by exploiting the intrinsic low-\ndimensional geometric structure of the posterior distribution (where it departs from the prior), as\ncharacterized by the dominant spectrum of the prior-preconditioned Hessian of the negative log\nlikelihood. This low-rank structure, or fast decay of eigenvalues of the preconditioned Hessian, has\nbeen proven for some inference problems and commonly observed in many others with complex\nmodels [5, 6, 29, 18, 12, 9, 10, 2, 7]. By projecting the parameters into this data-informed low-\ndimensional subspace and applying the SVN in this subspace, we can effectively mitigate the curse\nof dimensionality. We demonstrate fast convergence of pSVN that is independent of the number of\nparameters and samples. In particular, in two (both linear and nonlinear) experiments we show that\nthe intrinsic dimension is a few (6) and a few tens (40) with the nominal dimension over 1K and 16K,\nrespectively. We present a scalable parallel implementation of pSVN that yields rapid convergence,\nminimal communication, and low memory footprint, thanks to this low-dimensional projection.\nBelow, we present background on Bayesian inference and Stein variational methods in Section\n2, develop the projected Stein variational Newton method in Section 3, and provide numerical\nexperiments in Section 4.\n\n2 Background\n\n2.1 Bayesian inference\nWe consider a random parameter x \u2208 Rd, d \u2208 N, with a prior probability density function p0 : Rd \u2192\nR, and noisy observational data y of a parameter-to-observable map f : Rd \u2192 Rs, s \u2208 N, i.e.,\n\n(1)\nwhere \u03be \u2208 Rs represents observation noise with probability density function p\u03be : Rs \u2192 R. The\nposterior density p(\u00b7|y) : Rd \u2192 R of x conditioned on the data y is given by Bayes\u2019 rule\n\ny = f (x) + \u03be,\n\np(x|y) =\n\npy(x), where py(x) := p\u03be(y \u2212 f (x)) p0(x),\n\n(2)\nand the normalization constant Z, typically Z (cid:54)= 1 if p\u03be or p0 is known up to a constant, is given by\n(3)\n\nZ := Ep0[p\u03be(y \u2212 f (x))] =\n\np\u03be(y \u2212 f (x))p0(x)dx.\n\n1\nZ\n\n(cid:90)\n\nRd\n\nIn practice, Z is computationally intractable, especially for large d.\n\n2.2 Stein variational methods\n\nWhile sampling from the prior is tractable, sampling from the posterior is a great challenge. One\nmethod to sample from the posterior is to \ufb01nd a transport map T : Rd \u2192 Rd in a certain function class\nT that pushes forward the prior to the posterior by minimizing a Kullback\u2013Leibler (KL) divergence\n(4)\n\nT\u2208T DKL(T\u2217p0|py).\n\nmin\n\nStein variational methods [20, 14] simplify the minimization of (4) for one possibly very complex\nand nonlinear transport map T to a sequence of simpler transport maps that are perturbations of the\nidentity, i.e., T = TL \u25e6 TL\u22121 \u25e6 \u00b7\u00b7\u00b7 \u25e6 T2 \u25e6 T1, L \u2208 N, where\n\n(5)\nwith I(x) = x, step size \u03b5, and perturbation map Ql : Rd \u2192 Rd. Let pl denote the pushforward\ndensity pl := (Tl \u25e6 \u00b7\u00b7\u00b7 \u25e6 T1)\u2217p0. For l = 1, 2, . . . , we de\ufb01ne a cost functional Jl(Q) as\n\nTl(x) = I(x) + \u03b5Ql(x),\n\nl = 1, . . . , L,\n\nJl(Q) := DKL((I + Q)\u2217pl\u22121|py).\n\n(6)\n\n2\n\n\fQl(x) =\n\ncnkn(x),\n\n(8)\n\nwhere cn \u2208 Rd, n = 1, . . . , N, are unknown coef\ufb01cient vectors. It is shown in [14] that the coef\ufb01cient\nvector c = (c(cid:62)\n\nN )(cid:62) \u2208 RN d is a solution of the linear system\n\n1 , . . . , c(cid:62)\n\nN(cid:88)\n\nn=1\n\nHc = \u2212g,\n\nThen at step l, Stein variational methods lead to\n\nQl = \u2212H\u22121\n\nl \u2207Jl(0),\n\n(7)\nwhere \u2207Jl(0) : Rd \u2192 Rd is the Fr\u00e9chet derivative of Jl(Q) evaluated at Q = 0, and Hl is a\npreconditioner. For the SVGD method [20], Hl = I, while for the SVN method [14], Hl \u2248 \u22072Jl(0),\nan approximation of the Hessian of the cost functional \u22072Jl(0).\nGiven basis functions kn : Rd \u2192 R, n = 1, . . . , N, an ansatz representation of Ql is de\ufb01ned as\n\n(9)\n\nwhere g = (g(cid:62)\n\n1 , . . . , g(cid:62)\n\nN )(cid:62) \u2208 RN d is the gradient vector given by\ngm := Epl\u22121[\u2212\u2207x log(py)km \u2212 \u2207xkm], m = 1, . . . , N,\n\nHmn := Epl\u22121 [\u2212\u22072\n\n(10)\nand H \u2208 RN d\u00d7N d is the Hessian matrix, speci\ufb01ed as the identity for SVGD [20], which leads to\ncn = \u2212gn, n = 1, . . . , N, while for SVN it is given with mn-block Hmn \u2208 Rd\u00d7d by [14]\nx log(py)knkm + \u2207xkn(\u2207xkm)(cid:62)], m, n = 1, . . . , N.\n\n(11)\nAt each step l = 1, 2, . . . , the expectation Epl\u22121 [\u00b7] in (10) and (11) are approximated by the sample\naverage approximation with samples xl\u22121\nN , which are drawn from the prior at l = 1 and\npushed forward by (5) once the coef\ufb01cients c1, . . . , cN are obtained. We remark that in the original\nSVGD method [20], the samples are moved with the simpli\ufb01ed perturbation Ql(xm) = cm.\nIn both [20] and [14], the basis functions kn(x) are speci\ufb01ed by a suitable kernel function kn(x) =\nk(x, x(cid:48)) at x(cid:48) = xn, n = 1, . . . , N, e.g., a Gaussian kernel given by\n(x \u2212 x(cid:48))(cid:62)M (x \u2212 x(cid:48))\n\nk(x, x(cid:48)) = exp\n\n, . . . , xl\u22121\n\n(cid:18)\n\n(cid:19)\n\n(12)\n\n1\n\n,\n\n\u2212 1\n2\n\nwhere M is a metric that measures the distance between x and x(cid:48) \u2208 Rd. In [20], it is speci\ufb01ed\nas rescaled identity matrix \u03b1I for \u03b1 > 0 depending on the samples, while in [14], M is given\nby M = Epl\u22121 [\u2212\u22072\nx log(py)]/d to account for the geometry of the posterior by averaged Hessian\ninformation. This was shown to accelerate convergence for both SVGD and SVN compared to \u03b1I.\nWe remark that for high-dimensional complex models where a direct computation of the Hessian\n\u22072\nx log(py) is not tractable, its low-rank decomposition by randomized algorithms can be applied.\n\n3 Projected Stein variational Newton\n\n3.1 Dimension reduction by projection\n\nStein variational methods suffer from the curse of dimensionality, i.e., the sample estimate (e.g., for\nvariance) deteriorates considerably in high dimensions because the global kernel function (12) cannot\nrepresent the transport map well, as shown in [32, 31] for SVGD. This challenge can be alleviated in\nmoderate dimensions by a suitable choice of the metric in (12) as demonstrated in [14]. However it is\nstill present when the dimension becomes high. An effective method to tackle this dif\ufb01culty, which\nrelies on conditional independence of the posterior density, uses local kernel functions de\ufb01ned over a\nMarkov blanket with much lower dimension, thus achieving effective dimension reduction [32, 31].\nIn many applications, even when the nominal dimension of the parameter is very high, the intrinsic\nparameter dimension informed by the data is typically low, i.e., the posterior density is effectively\ndifferent from the prior density only in a low-dimensional subspace [5, 6, 29, 18, 12, 9, 10, 2]. This is\nbecause: (i) the prior p0 may have correlation in different dimensions, (ii) the parameter-to-observable\nmap f may be smoothing/regularizing, (iii) the data y may not be very informative, or a combined\n\n3\n\n\fr(cid:88)\n\ni=1\n\nr(cid:88)\n\neffect. Let \u03a8 = (\u03c81, . . . , \u03c8r) \u2208 Rd\u00d7r denote the basis of a subspace of dimension r (cid:28) d in Rd.\nThen we can project the parameter x with mean \u00afx into this subspace as\n\n\u03c8i(\u03c8i, (x \u2212 \u00afx))H = \u00afx +\n\nxr = \u00afx + Pr(x \u2212 \u00afx) = \u00afx +\n\ni=1\n\ni \u0393\u22121\n\n\u03c8iwi = \u00afx + \u03a8w,\n\n0 \u03c8j = \u03b4ij. We de\ufb01ne the projected posterior as\n\n(13)\nwhere w = (w1, . . . , wr) \u2208 Rr is a vector of coef\ufb01cients wi = (\u03c8i, x \u2212 \u00afx)H of the projection\n0 (x \u2212 \u00afx) where \u03930 is the prior\nof x \u2212 \u00afx to \u03c8i in a suitable norm H, e.g., (\u03c8i, x \u2212 \u00afx)H = \u03c8T\ncovariance of x and \u03c8T\npr(x|y) =\n1\n(14)\nZ r pr\nThen we can establish convergence under the following assumption. We de\ufb01ne || \u00b7 ||X as a suitable\nnorm, e.g., ||x||2\nX = xT Xx with X = I, the identity matrix or a mass matrix discretized from\nidentity operator in \ufb01nite dimension approximation space in our numerical experiments.\nAssumption 1. For Gaussian noise \u03be \u2208 N (0, \u0393) with s.p.d. covariance \u0393 \u2208 Rs\u00d7s. Let ||v||\u0393 :=\n(vT \u0393\u22121v)1/2 for any v \u2208 Rs. Assume there exists a constant Cf > 0 such that for any xr in (13)\n\ny(x) = p\u03be(y \u2212 f (xr))p0(x) and Z r = Ep0 [p\u03be(y \u2212 f (xr))].\n\ny(x), where pr\n\ni \u0393\u22121\n\nFor every b > 0, assume there is Cb > 0 such that for all x1, x2 with max{||x1||X ,||x2||X} < b,\n\nEp0 [||f (xr)||\u0393] \u2264 Cf and Ep0[||f (x)||\u0393] \u2264 Cf .\n\n(15)\n\n(16)\n\n||f (x1) \u2212 f (x2)||\u0393 \u2264 Cb||x1 \u2212 x2||X .\n\nWe state the convergence result for the projected posterior density in the following theorem, whose\nproof is presented in Appendix A.\nTheorem 1. Under Assumption 1, there exists a constant C independent of r such that\n\nDKL(p(x|y)| pr(x|y)) \u2264 C||x \u2212 xr||X .\n\n(17)\nRemark 1. Theorem 1 indicates that the projected posterior converges to the full one as along as\nthe projected parameter converges in X-norm, and that the convergence of the former is bounded by\nthe latter. In practical applications, the former may converge faster than the latter because it only\ndepends on the data-informed subspace while the latter is measured in data-independent X-norm.\n\n3.2 Projected Stein variational Newton\n\nLet pr\n\n0 denote the prior densities for xr in (13). Let x\u22a5 = x \u2212 xr. Then the prior is decomposed as\n(18)\n0 (x\u22a5) if p0 is a Gaussian density. Then\n\n0 (x\u22a5|xr),\n0 (x\u22a5|xr) is a conditional density, which becomes p\u22a5\n\n0(xr)p\u22a5\n\np0(x) = pr\n\nwhere p\u22a5\nthe projected posterior density pr\n\ny(x) in (14) becomes\ny(x) = p\u03be(y \u2212 f (xr))pr\npr\n\n0(xr)p\u22a5\n\n0 (x\u22a5|xr),\n\ny as a sample from pr\n\ny(xr) = p\u03be(y \u2212 f (xr))pr\n\ny(x) can be realized by sampling from pr\n\n0 (x\u22a5|xr) for x\u22a5 conditioned on xr (or from p\u22a5\n\n(19)\n0(xr) for\nso that sampling from pr\n0 (x\u22a5) if p0 is Gaussian). To sample\nxr and from p\u22a5\nfrom the posterior, we can sample x from the prior, decompose it as x = xr + x\u22a5, freeze x\u22a5, push\nxr to xr\ny(xr) in the projection subspace, we seek a transport map T that pushes forward\nTo sample from pr\ny(xr) by minimizing the KL divergence between them. Since the randomness of xr =\n0(xr) to pr\npr\n\u00afx + \u03a8w is fully represented by w given the projection basis \u03a8, we just need to \ufb01nd a transport\nmap that pushes forward \u03c00(w) = pr\ny(xr) in the (coef\ufb01cient) parameter space\nRr, where r (cid:28) d. Similarly in the full space, we look for a composition of a sequence of maps\nT = TL \u25e6 TL\u22121 \u25e6 \u00b7\u00b7\u00b7 \u25e6 T2 \u25e6 T1, L \u2208 N, with\n\ny(xr), and construct the posterior sample as xy = xr\n\n0(xr) to \u03c0y(w) = pr\n\ny + x\u22a5.\n\n(20)\nwhere the perturbation map Ql is represented by the basis functions kn : Rr \u2192 R, n = 1, . . . , N, as\n\nTl(w) = I(w) + \u03b5Ql(w),\n\nl = 1, . . . , L,\n\nQl(w) =\n\ncnkn(w),\n\n(21)\n\nN(cid:88)\n\nn=1\n\n4\n\n\fThen the coef\ufb01cient vector c = ((c1)(cid:62), . . . , (cN )(cid:62))(cid:62) \u2208 RN r is the solution of the linear system\n\nHc = \u2212g.\nHere the m-th component of the gradient g is de\ufb01ned as\n\ngm := E\u03c0l\u22121[\u2212\u2207w log(\u03c0y)km \u2212 \u2207wkm],\n\n(22)\n\n(23)\n\n(27)\n\n(28)\n\nand the mn-th component of the Hessian H for pSVN is de\ufb01ned as\n\nHmn := E\u03c0l\u22121[\u2212\u22072\n\n(24)\nThe expectations in (23) and (24) are evaluated by sample average approximation at samples\nwl\u22121\nn = T (wl\u22121\nn ),\nn = 1, . . . , N. By the de\ufb01nition of the projection (13), we have\n\nN , which are drawn from \u03c00 for l = 1 and pushed forward by (20) as wl\n\n, . . . , wl\u22121\n\nw log(\u03c0y)knkm + \u2207wkn(\u2207wkm)(cid:62)].\n\n1\n\n\u2207w log(\u03c0y(w)) = \u03a8(cid:62)\u2207x log(pr\n\n(25)\nFor the basis functions kn, n = 1, . . . , N, we use a Gaussian kernel kn(w) = k(w, wn) de\ufb01ned as in\n(12), with the metric M given by an averaged Hessian at the current samples wl\u22121\nN , i.e.,\n\n, . . . , wl\u22121\n\ny(xr))\u03a8.\n\nx log(pr\n\nw log(\u03c0y(w)) = \u03a8(cid:62)\u22072\n\ny(xr)), and \u22072\n\n1\n\nM = \u2212 1\nr\n\nE\u03c0l\u22121[\u22072\n\nw log(\u03c0y)] \u2248 \u2212 1\nrN\n\n(26)\nWe remark that the projected system (22) is of size N r \u00d7 N r, which is a considerable reduction from\nthe full system (9) of size N d \u00d7 N d, since r (cid:28) d. To further reduce the size of the coupled system\n(22), we use a classical \u201cmass-lumping\u201d technique to decouple it as N systems of size r \u00d7 r\n\nn )).\n\nn=1\n\n\u22072\nw log(\u03c0y(wl\u22121\n\nN(cid:88)\n\nwhere gm is given as in (25), and Hm is given by the lumped Hessian\n\nHmcm = \u2212gm, m = 1, . . . , N,\n\nN(cid:88)\n\nHm :=\n\nHmn, m = 1, . . . , N,\n\nwith Hmn de\ufb01ned in (24). We refer to [14] for this technique and a diagonalization Hm = Hmm.\nMoreover, to \ufb01nd a good step size \u03b5 in (20), we adopt a classical line search [23], see Appendix B.\n\nn=1\n\n3.3 Hessian-based subspace\n\ni \u0393\u22121\n\nTo construct a data-informed subspace of the parameter space, we exploit the geometry of the posterior\ndensity characterized by its Hessian. More speci\ufb01cally, we seek the basis functions \u03c8i, i = 1, . . . , r,\nas the eigenvectors corresponding to the r largest eigenvalues of the generalized eigenvalue problem\n(29)\n\nx\u03b7y(x)]\u03c8i = \u03bbi\u0393\u22121\n\ni = 1, . . . , r,\n\nE[\u22072\n\n0 \u03c8i,\n\nwhere \u03930 is the covariance of x under the prior distribution (not necessarily Gaussian), \u03c8T\n0 \u03c8j =\n\u03b4ij, i, j = 1, . . . , r, E[\u22072\nx\u03b7y(x)], with \u03b7y(x) := \u2212 log(p\u03be(y \u2212 f (x))), is the averaged Hessian of the\nnegative log-likelihood function w.r.t. a certain distribution, e.g., the prior, posterior, or Gaussian\napproximate distribution [13]. Here we propose to evaluate E[\u22072\nx\u03b7y(x)] by an adaptive sample\naverage approximation at the samples pushed from the prior to the posterior, and adaptively construct\nthe eigenvectors \u03a8, as presented in next section. For linear Bayesian inference problems, with\nf (x) = Ax for A \u2208 Rs\u00d7d, a Gaussian prior distribution x \u223c N (\u00afx, \u03930) and a Gaussian noise\n\u03be \u223c N (0, \u0393\u03be) lead to a Gaussian posterior distribution given by N (xMAP, \u0393post), where [30]\n\npost = \u22072\n\u0393\u22121\n\nx\u03b7y + \u0393\u22121\nx\u03b7y, \u0393\u22121\n\n0 , xMAP = \u00afx \u2212 \u0393postAT \u0393\u22121\n0 ), with \u22072\nx\u03b7y = AT \u0393\u22121\n\n\u03be (y \u2212 A\u00afx).\n(30)\nTherefore, the eigenvalue \u03bbi of (\u22072\n\u03be A, measures the relative variation\nbetween the data-dependent log-likelihood and the prior in direction \u03c8i. For \u03bbi (cid:28) 1, the data provides\nnegligible information in direction \u03c8i, so the difference between the posterior and the prior in \u03c8i is\nnegligible. In fact, it is shown in [29] that the subspace constructed by (29) is optimal for linear f.\nLet (\u03bbi, \u03c8i)1\u2264i\u2264r denote the r largest eigenpairs such that |\u03bb1| \u2265 |\u03bb2| \u2265 \u00b7\u00b7\u00b7 \u2265 |\u03bbr| \u2265 \u03b5\u03bb > |\u03bbr+1|\nfor some small tolerance \u03b5\u03bb < 1. Then the Hessian-based subspace spanned by the eigenvectors\n\u03a8 = (\u03c81, . . . , \u03c8r) captures the most variation of the parameter x informed by data y. We remark that\nto solve the generalized Hermitian eigenvalue problem (29), we employ a randomized SVD algorithm\n[17], which requires O(N rCh + dr2) \ufb02ops, where Ch is the cost of a Hessian action in a direction.\n\n5\n\n\f3.4 Parallel and adaptive pSVN algorithm\n\nGiven the bases \u03a8 as the data-informed parameter directions, we can draw samples x1, . . . , xN from\nthe prior distribution and push them by pSVN to match the posterior distribution in a low-dimensional\nsubspace, while keeping the components of the samples in the complementary subspace unchanged.\nWe set the stopping criterion as: (i) the maximum norm of the updates wl\nm , m = 1, . . . , N,\nis smaller than a given tolerance Tolg; (ii) the maximum norm of the gradients gm, m = 1, . . . , N,\nis smaller than a given tolerance Tolw; or (iii) the number of iterations l reaches a preset number\nL. Moreover, we take advantage of pSVN advantages in low-dimensional subspaces\u2014including\nfast computation, lightweight communication, and low memory footprint\u2014and provide an ef\ufb01cient\nparallel implementation using MPI communication in Algorithm 1, with analysis in Appendix C.\n\nm \u2212 wl\u22121\n\nAlgorithm 1 pSVN in parallel using MPI\n1: Input: M prior samples, x1, . . . , xM , in each of K cores, bases \u03a8, and density py in all cores.\n2: Output: posterior samples xy\n1, . . . , xy\n3: Perform projection (13) to get xm = xr\n4: Perform MPI_Allgather for wl\u22121\n5: repeat\n6:\n7:\n8:\n9:\n\nCompute the gradient and Hessian by (25).\nPerform MPI_Allgather for the gradient and Hessian.\nCompute the kernel and its gradient by (12) and (26).\nPerform MPI_Allgather for km, m = 1, . . . , M,\nm \u2207wkm.\n\nMPI_Allreduce w. sum for(cid:80)\n\nm km and(cid:80)\n\nm and the samples wl\u22121\n\nM in each core.\nm + x\u22a5\n\nm , m = 1, . . . , M, at l = 1.\n\nm , m = 1, . . . , M.\n\nAssemble and solve system (27) for c1, . . . , cM .\nPerform a line search to get wl\nPerform MPI_Allgather for wl\nUpdate the samples xr\nm = \u03a8wl\nSet l \u2190 l + 1.\n\n10:\n11:\n12:\n13:\n14:\n15: until A stopping criterion is met.\n16: Reconstruct samples xy\n\n1, . . . , wl\nm, m = 1, . . . , M.\nm + \u00afx, m = 1, . . . , M.\n\nm, m = 1, . . . , M.\n\nm + x\u22a5\n\nm = xr\n\nM .\n\nIn Algorithm 1, we assume that the bases \u03a8 for the projection are the data informed parameter\ndirections, which are obtained by the Hessian-based algorithm in Section 3.3 at the \u201crepresentative\u201d\nsamples x1, . . . , xN . However, we do not have these samples but only the prior samples at the\nbeginning. To address this problem, we propose an adaptive algorithm that adaptively construct the\nbases \u03a8 based on samples pushed forward from the prior to the posterior, see Algorithm 2.\n\n1, . . . , xy\n\nAlgorithm 2 Adaptive pSVN\n1: Input: M prior samples, x1, . . . , xM , in each of K cores, and density py in all cores.\n2: Output: posterior samples xy\n3: Set level l2 = 1, xl2\u22121\n4: repeat\n5:\n6:\n\nPerform the eigendecomposition (29) at samples xl2\u22121\nApply Algorithm 1 to update the samples\n, . . . , xl2\u22121\n1 , . . . , xl2\n[xl2\nSet l2 \u2190 l2 + 1.\n\nM ] = pSVN([xl2\u22121\n7:\n8: until A stopping criterion is met.\n\nm = xm, m = 1, . . . , M.\n\nM ], K, \u03a8l2, py).\n\nM in each core.\n\n, . . . , xl2\u22121\n\n1\n\n1\n\nM , and form the bases \u03a8l2.\n\n4 Numerical experiments\n\nWe demonstrate the convergence, accuracy, and dimension-independence of the pSVN method by\ntwo examples, one a linear problem with Gaussian posterior to demonstrate the convergence and\naccuracy of pSVN in comparison with SVN and SVGD, the other a nonlinear problem to demonstrate\naccuracy as well as the dimension-independent and sample-independent convergence of pSVN and\nits scalability w.r.t. the number of processor cores. The code is described in Appendix D.\n\n6\n\n\f4.1 A linear inference problem\n\nFor the linear inference problem, we have the parameter-to-observable map\n\nf (x) = Ax,\n\n(31)\nwhere the linear map A = O(Bx), with an observation map O : Rd \u2192 Rs, and an inverse discrete\ndifferential operator B = (L + M )\u22121 : Rd \u2192 Rd where L and M are the discrete Laplacian and\nmass matrices in the PDE model \u2212(cid:52)u + u = x, in (0, 1), u(0) = 0, u(1) = 1. s = 15 pointwise\nobservations of u with 1% noise are distributed with equal distance in (0, 1). The input x is a random\n\ufb01eld with Gaussian prior N (0, \u03930), where \u03930 is discretized from (I \u2212 0.1(cid:52))\u22121 with identity I and\nLaplace operator (cid:52). We discretize this forward model by a \ufb01nite element method with piecewise\nlinear elements on a uniform mesh of size 2n, which leads to the parameter dimension d = 2n + 1.\n\nFigure 1: Decay of the RMSE (with 10 trials in dashed lines) of the L2-norm of the mean (left) and\npointwise variance (middle) of the parameter w.r.t. dimension d = 16, 64, 256, 1024 with N = 128\nsamples. Right: Decay of the RMSE of the L2-norm of the pointwise variance with N = 32, 512\nsamples in parameter dimension d = 256 w.r.t. # iterations. Comparison for SVGD, SVN, pSVN.\nFigure 1 compares the convergence and accuracy of SVGD, SVN, and pSVN by the decay of the\nroot mean square errors (RMSE) (using 10 trials and 10 iterations) of the sample mean and variance\n(with L2-norm of errors computed against analytic values in (30)) w.r.t. parameter dimensions and\niterations. We observe much faster convergence and greater accuracy of pSVN relative to SVGD and\nSVN, for both mean and especially variance, which measures the goodness of samples. In particular,\nwe see from the middle \ufb01gure that the SVN estimate of variance deteriorates quickly with increasing\ndimension, while pSVN leads to equally good variance estimate. Moreover, from the right \ufb01gure we\ncan see that pSVN converges very rapidly in a subspace of dimension 6 (at tolerance \u03b5\u03bb = 0.01 in\nSection 3.3, i.e., |\u03bb7| < 0.01) and achieves higher accuracy with larger number of samples, while\nSVN converges slowly and leads to large errors. With the same number of iterations of SVN and\npSVN, SVGD produces no evident error decay.\n\n4.2 A nonlinear inference problem\n\nWe consider a nonlinear benchmark inference problem (which is often used for testing high-\ndimensional inference methods [30, 12, 3]), whose forward map is given by f (x) = O(S(x)),\nwith observation map O : Rd \u2192 Rs and a nonlinear solution map u = S(x) \u2208 Rd of the lognormal\ndiffusion model \u2212\u2207 \u00b7 (ex\u2207u) = 0, in (0, 1)2 with u = 1 on top and u = 0 on bottom boundaries,\nand zero Neumann conditions on left and right boundaries. 49 pointwise observations of u are equally\ndistributed in (0, 1)2. We use 10% noise to test accuracy against a DILI MCMC method [12] with\n10,000 MCMC samples as reference and a challenging 1% noise for a dimension-independence test\nof pSVN. The input x is a random \ufb01eld with Gaussian prior N (0, \u03930), where \u03930 is a discretization\nof (I \u2212 0.1(cid:52))\u22122. We solve this forward model by a \ufb01nite element method with piecewise linear\nelements on a uniform mesh of varying sizes, which leads to a sequence of parameter dimensions.\nFigure 2 shows the comparison of the accuracy and convergence of pSVN and SVN for their sample\nestimate of mean and variance. We can see that in high dimension, d = 1089, pSVN converges\nfaster and achieves higher accuracy than SVN for both mean and variance estimate. Moreover, SVN\nusing the kernel (12) in high dimensions (involving low-rank decomposition of the metric M for\nhigh-dimensional nonlinear problems) is more expensive than pSVN per iteration.\nWe next demonstrate pSVN\u2019s independence of the number of parameter and sample dimensions, and\nits scalability w.r.t. processor cores. First, the dimension of the Hessian-based subspace r, which\ndetermines the computational cost of pSVN, depends on the decay of the absolute eigenvalues |\u03bbi|\n\n7\n\n45678910Log2(Dimension)1.501.251.000.750.500.250.000.25Log10(RMSE of mean)SVGDSVNpSVN45678910Log2(Dimension)0.80.60.40.20.00.2Log10(RMSE of variance)SVGDSVNpSVN02468# iterations1.51.00.50.0Log10(RMSE of variance)SVGD N=32SVN N=32pSVN N=32SVGD N=512SVN N=512pSVN N=512\fFigure 2: Decay of the RMSE (with 10 trials in dashed lines) of the L2-norm of the mean (left) and\npointwise variance (right) of the parameter with dimension d = 1089 and N = 32, 512 samples.\n\nFigure 3: Left: Decay of eigenvalues log10(|\u03bbi|) with increasing dimension d. Middle: Decay\nof a stopping criterion\u2014the averaged norm of the update wl \u2212 wl\u22121 w.r.t. the iteration number l,\nwith increasing number of samples. Right: Decay of the wall clock time (seconds) of different\ncomputational components w.r.t. increasing number of processor cores on a log-log scale.\n\nas presented in Section 3.3. The left part of Figure 3 shows that with increasing d from 289 to\nover 16K, r does not change, which implies that the convergence of pSVN is independent of the\nnumber of nominal parameter dimensions. Second, as shown in the middle part of Figure 3, with\nincreasing number of samples for a \ufb01xed parameter dimension d = 1089, the averaged norm of the\nupdate wl \u2212 wl\u22121, as one convergence indicator presented in Subsection 3.4, decays similarly, which\ndemonstrates the independence of the convergence of pSVN w.r.t. the number of samples. Third, in\nthe right of Figure 3 we plot the total wall clock time of pSVN and the time for its computational\ncomponents in Algorithm 1 using different number of processor cores for the same work, i.e., the\nsame number of samples (256), including variation for forward model solve, gradient and Hessian\nevaluation, as well as eigendecomposition, kernel for kernel and its gradient evaluation, solve for\nsolving the Newton system (27), and sample for sample projection and reconstruction. We can\nobserve nearly perfect strong scaling w.r.t. increasing number of processor cores. Moreover, the time\nfor variation, which depends on parameter dimension d, dominates the time for all other components,\nin particular kernel and solve whose cost only depends on r, not d.\n\n5 Conclusion\n\nWe presented a fast and scalable variational method, pSVN, for Bayesian inference in high dimensions.\nThe method exploits the geometric structure of the posterior via its Hessian, and the intrinsic low-\ndimensionality of the change from prior to posterior characteristic of many high-dimensional inference\nproblems via low rank approximation of the averaged Hessian of the log likelihood, computed\nef\ufb01ciently using randomized matrix-free SVD. The fast convergence and higher accuracy of pSVN\nrelative to SVGD and SVN, its complexity that is independent of parameter and sample dimensions,\nand its scalability w.r.t. processor cores were demonstrated for linear and nonlinear inference problems.\nInvestigation of pSVN to tackle intrinsically high-dimensional inference problem (e.g., performed in\nlocal dimensions as the message passing scheme or combined with dimension-independent MCMC\nto update samples in complement subspace) is ongoing. Further development and application of\npSVN to more general probability distributions, projection basis constructions, and forward models\nsuch as deep neural network, and further analysis of the convergence and scalability of pSVN w.r.t.\nthe number of samples, parameter dimension reduction, and data volume, are of great interest.\n\n8\n\n051015# iterations1.21.00.80.60.40.20.0Log10(RMSE of mean)SVN 32pSVN 32SVN 512pSVN 512051015# iterations1.00.80.60.40.20.0Log10(RMSE of variance)SVN 32pSVN 32SVN 512pSVN 512010203040i101234Log10(|i|)# dimensions=289# dimensions=1,089# dimensions=4,225# dimensions=16,6412468# iterations2.01.51.00.50.0Log10(step norm)# samples=8# samples=32# samples=128# samples=512012345Log2(# processor cores)20246810Log2(time (s))totalvariationkernelsolvesampleO(N1)\fReferences\n[1] Guillaume Alain, Nicolas Le Roux, and Pierre-Antoine Manzagol. Negative eigenvalues of the\n\nHessian in deep neural networks. arXiv preprint arXiv:1902.02366, 2019.\n\n[2] O. Bashir, K. Willcox, O. Ghattas, B. van Bloemen Waanders, and J. Hill. Hessian-based\nmodel reduction for large-scale systems with initial condition inputs. International Journal for\nNumerical Methods in Engineering, 73:844\u2013868, 2008.\n\n[3] Alexandros Beskos, Mark Girolami, Shiwei Lan, Patrick E. Farrell, and Andrew M. Stuart.\nGeometric mcmc for in\ufb01nite-dimensional inverse problems. Journal of Computational Physics,\n335:327 \u2013 351, 2017.\n\n[4] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for\n\nstatisticians. Journal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[5] T. Bui-Thanh and O. Ghattas. Analysis of the Hessian for inverse scattering problems: I. Inverse\n\nshape scattering of acoustic waves. Inverse Problems, 28(5):055001, 2012.\n\n[6] T. Bui-Thanh, O. Ghattas, J. Martin, and G. Stadler. A computational framework for in\ufb01nite-\ndimensional bayesian inverse problems part I: The linearized case, with application to global\nseismic inversion. SIAM Journal on Scienti\ufb01c Computing, 35(6):A2494\u2013A2523, 2013.\n\n[7] Peng Chen and Omar Ghattas. Hessian-based sampling for high-dimensional model reduction.\n\narXiv preprint arXiv:1809.10255, 2018.\n\n[8] Peng Chen and Christoph Schwab. Sparse-grid, reduced-basis Bayesian inversion. Computer\n\nMethods in Applied Mechanics and Engineering, 297:84 \u2013 115, 2015.\n\n[9] Peng Chen, Umberto Villa, and Omar Ghattas. Hessian-based adaptive sparse quadrature for\nin\ufb01nite-dimensional Bayesian inverse problems. Computer Methods in Applied Mechanics and\nEngineering, 327:147\u2013172, 2017.\n\n[10] Peng Chen, Umberto Villa, and Omar Ghattas. Taylor approximation and variance reduction\nfor PDE-constrained optimal control problems under uncertainty. Journal of Computational\nPhysics, 385:163\u2013186, 2019.\n\n[11] Wilson Ye Chen, Lester Mackey, Jackson Gorham, Fran\u00e7ois-Xavier Briol, and Chris J Oates.\n\nStein points. arXiv preprint arXiv:1803.10161, 2018.\n\n[12] Tiangang Cui, Kody JH Law, and Youssef M Marzouk. Dimension-independent likelihood-\n\ninformed MCMC. Journal of Computational Physics, 304:109\u2013137, 2016.\n\n[13] Tiangang Cui, Youssef Marzouk, and Karen Willcox. Scalable posterior approximations for\nlarge-scale bayesian inverse problems via likelihood-informed parameter and state reduction.\nJournal of Computational Physics, 315:363\u2013387, 2016.\n\n[14] Gianluca Detommaso, Tiangang Cui, Youssef Marzouk, Alessio Spantini, and Robert Scheichl.\nA stein variational Newton method. In Advances in Neural Information Processing Systems,\npages 9187\u20139197, 2018.\n\n[15] Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An Investigation into Neural Net\n\nOptimization via Hessian Eigenvalue Density. arXiv preprint arXiv:1901.10159, 2019.\n\n[16] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte\nCarlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n73(2):123\u2013214, 2011.\n\n[17] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness:\nProbabilistic algorithms for constructing approximate matrix decompositions. SIAM review,\n53(2):217\u2013288, 2011.\n\n[18] Tobin Isaac, Noemi Petra, Georg Stadler, and Omar Ghattas. Scalable and ef\ufb01cient algorithms\nfor the propagation of uncertainty from data through inference to prediction for large-scale\nproblems, with application to \ufb02ow of the Antarctic ice sheet. Journal of Computational Physics,\n296:348\u2013368, September 2015.\n\n9\n\n\f[19] Chang Liu and Jun Zhu. Riemannian Stein variational gradient descent for Bayesian inference.\n\nIn Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[20] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian\ninference algorithm. In Advances In Neural Information Processing Systems, pages 2378\u20132386,\n2016.\n\n[21] J. Martin, L.C. Wilcox, C. Burstedde, and O. Ghattas. A stochastic Newton MCMC method for\nlarge-scale statistical inverse problems with application to seismic inversion. SIAM Journal on\nScienti\ufb01c Computing, 34(3):A1460\u2013A1487, 2012.\n\n[22] Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini. Sampling via measure\ntransport: An introduction. In Handbook of Uncertainty Quanti\ufb01cation, pages 1\u201341. Springer,\n2016.\n\n[23] J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.\n\n[24] N. Petra, J. Martin, G. Stadler, and O. Ghattas. A computational framework for in\ufb01nite-\ndimensional Bayesian inverse problems, Part II: Stochastic Newton MCMC with application to\nice sheet \ufb02ow inverse problems. SIAM J. Sci. Comput., 36(4):A1525\u2013A1555, 2014.\n\n[25] Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning:\n\nSingularity and beyond. arXiv preprint arXiv:1611.07476, 2016.\n\n[26] Claudia Schillings and Christoph Schwab. Sparse, adaptive Smolyak quadratures for Bayesian\n\ninverse problems. Inverse Problems, 29(6):065011, 2013.\n\n[27] Claudia Schillings and Christoph Schwab. Scaling limits in computational Bayesian inversion.\n\nESAIM: Mathematical Modelling and Numerical Analysis, 50(6):1825\u20131856, 2016.\n\n[28] Ch. Schwab and A.M. Stuart. Sparse deterministic approximation of Bayesian inverse problems.\n\nInverse Problems, 28(4):045003, 2012.\n\n[29] A. Spantini, A. Solonen, T. Cui, J. Martin, L. Tenorio, and Y. Marzouk. Optimal low-rank\napproximations of Bayesian linear inverse problems. SIAM Journal on Scienti\ufb01c Computing,\n37(6):A2451\u2013A2487, 2015.\n\n[30] A.M. Stuart. Inverse problems: a Bayesian perspective. Acta Numerica, 19(1):451\u2013559, 2010.\n\n[31] Dilin Wang, Zhe Zeng, and Qiang Liu. Stein variational message passing for continuous\ngraphical models. In International Conference on Machine Learning, pages 5206\u20135214, 2018.\n\n[32] Jingwei Zhuo, Chang Liu, Jiaxin Shi, Jun Zhu, Ning Chen, and Bo Zhang. Message passing\n\nStein variational gradient descent. arXiv preprint arXiv:1711.04425, 2017.\n\n10\n\n\f", "award": [], "sourceid": 8670, "authors": [{"given_name": "Peng", "family_name": "Chen", "institution": "The University of Texas at Austin"}, {"given_name": "Keyi", "family_name": "Wu", "institution": "The University of Texas at Austin"}, {"given_name": "Joshua", "family_name": "Chen", "institution": "The University of Texas at Austin"}, {"given_name": "Tom", "family_name": "O'Leary-Roseberry", "institution": "The University of Texas at Austin"}, {"given_name": "Omar", "family_name": "Ghattas", "institution": "The University of Texas at Austin"}]}