{"title": "Fast Second Order Stochastic Backpropagation for Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1387, "page_last": 1395, "abstract": "We propose a second-order (Hessian or Hessian-free) based optimization method for variational inference inspired by Gaussian backpropagation, and argue that quasi-Newton optimization can be developed as well. This is accomplished by generalizing the gradient computation in stochastic backpropagation via a reparametrization trick with lower complexity. As an illustrative example, we apply this approach to the problems of Bayesian logistic regression and variational auto-encoder (VAE). Additionally, we compute bounds on the estimator variance of intractable expectations for the family of Lipschitz continuous function. Our method is practical, scalable and model free. We demonstrate our method on several real-world datasets and provide comparisons with other stochastic gradient methods to show substantial enhancement in convergence rates.", "full_text": "Fast Second-Order Stochastic Backpropagation for\n\nVariational Inference\n\nKai Fan\n\nDuke University\n\nkai.fan@stat.duke.edu\n\nZiteng Wang\u2217\n\nHKUST\u2020\n\nwangzt2012@gmail.com\n\nJeffrey Beck\nDuke University\n\njeff.beck@duke.edu\n\nJames T. Kwok\n\nHKUST\n\njamesk@cse.ust.hk\n\nKatherine Heller\nDuke University\n\nkheller@gmail.com\n\nAbstract\n\nWe propose a second-order (Hessian or Hessian-free) based optimization method\nfor variational inference inspired by Gaussian backpropagation, and argue that\nquasi-Newton optimization can be developed as well. This is accomplished\nby generalizing the gradient computation in stochastic backpropagation via a\nreparametrization trick with lower complexity. As an illustrative example, we ap-\nply this approach to the problems of Bayesian logistic regression and variational\nauto-encoder (VAE). Additionally, we compute bounds on the estimator variance\nof intractable expectations for the family of Lipschitz continuous function. Our\nmethod is practical, scalable and model free. We demonstrate our method on sev-\neral real-world datasets and provide comparisons with other stochastic gradient\nmethods to show substantial enhancement in convergence rates.\n\n1\n\nIntroduction\n\nGenerative models have become ubiquitous in machine learning and statistics and are now widely\nused in \ufb01elds such as bioinformatics, computer vision, or natural language processing. These models\nbene\ufb01t from being highly interpretable and easily extended. Unfortunately, inference and learning\nwith generative models is often intractable, especially for models that employ continuous latent\nvariables, and so fast approximate methods are needed. Variational Bayesian (VB) methods [1] deal\nwith this problem by approximating the true posterior that has a tractable parametric form and then\nidentifying the set of parameters that maximize a variational lower bound on the marginal likelihood.\nThat is, VB methods turn an inference problem into an optimization problem that can be solved, for\nexample, by gradient ascent.\nIndeed, ef\ufb01cient stochastic gradient variational Bayesian (SGVB) estimators have been devel-\noped for auto-encoder models [17] and a number of papers have followed up on this approach\n[28, 25, 19, 16, 15, 26, 10]. Recently, [25] provided a complementary perspective by using stochastic\nbackpropagation that is equivalent to SGVB and applied it to deep latent gaussian models. Stochas-\ntic backpropagation overcomes many limitations of traditional inference methods such as the mean-\n\ufb01eld or wake-sleep algorithms [12] due to the existence of ef\ufb01cient computations of an unbiased\nestimate of the gradient of the variational lower bound. The resulting gradients can be used for pa-\nrameter estimation via stochastic optimization methods such as stochastic gradient decent(SGD) or\nadaptive version (Adagrad) [6].\n\u2217Equal Contribution to this work\n\u2020Refer to Hong Kong University of Science and Technology\n\n1\n\n\fUnfortunately, methods such as SGD or Adagrad converge slowly for some dif\ufb01cult-to-train models,\nsuch as untied-weights auto-encoders or recurrent neural networks. The common experience is that\ngradient decent always gets stuck near saddle points or local extrema. Meanwhile the learning rate\nis dif\ufb01cult to tune. [18] gave a clear explanation on why Newton\u2019s method is preferred over gradient\ndecent, which often encounters under-\ufb01tting problem if the optimizing function manifests patholog-\nical curvature. Newton\u2019s method is invariant to af\ufb01ne transformations so it can take advantage of\ncurvature information, but has higher computational cost due to its reliance on the inverse of the\nHessian matrix. This issue was partially addressed in [18] where the authors introduced Hessian\nfree (HF) optimization and demonstrated its suitability for problems in machine learning.\nIn this paper, we continue this line of research into 2nd order variational inference algorithms. In-\nspired by the property of location scale families [8], we show how to reduce the computational\ncost of the Hessian or Hessian-vector product, thus allowing for a 2nd order stochastic optimiza-\ntion scheme for variational inference under Gaussian approximation. In conjunction with the HF\noptimization, we propose an ef\ufb01cient and scalable 2nd order stochastic Gaussian backpropagation\nfor variational inference called HFSGVI. Alternately, L-BFGS [3] version, a quasi-Newton method\nmerely using the gradient information, is a natural generalization of 1st order variational inference.\nThe most immediate application would be to look at obtaining better optimization algorithms for\nvariational inference. As to our knowledge, the model currently applying 2nd order information is\nLDA [2, 14], where the Hessian is easy to compute [11]. In general, for non-linear factor models\nlike non-linear factor analysis or the deep latent Gaussian models this is not the case.\nIndeed,\nto our knowledge, there has not been any systematic investigation into the properties of various\noptimization algorithms and how they might impact the solutions to optimization problem arising\nfrom variational approximations.\nThe main contributions of this paper are to \ufb01ll such gap for variational inference by introducing a\nnovel 2nd order optimization scheme. First, we describe a clever approach to obtain curvature infor-\nmation with low computational cost, thus making the Newton\u2019s method both scalable and ef\ufb01cient.\nSecond, we show that the variance of the lower bound estimator can be bounded by a dimension-free\nconstant, extending the work of [25] that discussed a speci\ufb01c bound for univariate function. Third,\nwe demonstrate the performance of our method for Bayesian logistic regression and the VAE model\nin comparison to commonly used algorithms. Convergence rate is shown to be competitive or faster.\n\n2 Stochastic Backpropagation\n\nIn this section, we extend the Bonnet and Price theorem [4, 24] to develop 2nd order Gaussian\nbackpropagation. Speci\ufb01cally, we consider how to optimize an expectation of the form Eq\u03b8 [f (z|x)],\nwhere z and x refer to latent variables and observed variables respectively, and expectation is taken\nw.r.t distribution q\u03b8 and f is some smooth loss function (e.g.\nit can be derived from a standard\nvariational lower bound [1]). Sometimes we abuse notation and refer to f (z) by omitting x when no\nambiguity exists. To optimize such expectation, gradient decent methods require the 1st derivatives,\nwhile Newton\u2019s methods require both the gradients and Hessian involving 2nd order derivatives.\n\n2.1 Second Order Gaussian Backpropagation\nIf the distribution q is a dz-dimensional Gaussian N (z|\u00b5, C), the required partial derivative is easily\ncomputed with a lower algorithmic cost O(d2\nz) [25]. By using the property of Gaussian distribution,\nwe can compute the 2nd order partial derivative of EN (z|\u00b5,C)[f (z)] as follows:\n\nf (z)] = 2\u2207Cij\n\nEN (z|\u00b5,C)[f (z)],\n\n(1)\n\n(2)\n\n(3)\n\n\u22072\n\n\u00b5i,\u00b5j\n\nCi,j ,Ck,l\n\n\u22072\n\u22072\n\n\u00b5i,Ck,l\n\nEN (z|\u00b5,C)[f (z)] = EN (z|\u00b5,C)[\u22072\nEN (z|\u00b5,C)[\u22074\nEN (z|\u00b5,C)[f (z)] =\nEN (z|\u00b5,C)[f (z)] =\nEN (z|\u00b5,C)\n\n(cid:2)\u22073\n\nzi,zj\n\n1\n4\n1\n2\n\nf (z)],\n\nf (z)(cid:3) .\n\nzi,zj ,zk,zl\n\nzi,zk,zl\n\nEq. (1), (2), (3) (proof in supplementary) have the nice property that a limited number of samples\nfrom q are suf\ufb01cient to obtain unbiased gradient estimates. However, note that Eq. (2), (3) needs\nto calculate the third and fourth derivatives of f (z), which is highly computationally inef\ufb01cient. To\navoid the calculation of high order derivatives, we use a co-ordinate transformation.\n\n2\n\n\f2.2 Covariance Parameterization for Optimization\n\nreparameterization) z = \u00b5 + R\u0001, where \u0001 \u223c\nBy constructing the linear transformation (a.k.a.\nN (0, Idz ), we can generate samples from any Gaussian distribution N (\u00b5, C) by simulating data\nfrom a standard normal distribution, provided the decomposition C = RR(cid:62) holds. This fact allows\nus to derive the following theorem indicating that the computation of 2nd order derivatives can be\nscalable and programmed to run in parallel.\nTheorem 1 (Fast Derivative). If f is a twice differentiable function and z follows Gaussian dis-\ntribution N (\u00b5, C), C = RR(cid:62), where both the mean \u00b5 and R depend on a d-dimensional pa-\nEN (\u00b5,C)[f (z)] = E\u0001\u223cN (0,Idz )[\u0001(cid:62) \u2297 H], and\nrameter \u03b8 = (\u03b8l)d\n\u22072\n\nl=1, i.e. \u00b5(\u03b8), R(\u03b8), we have \u22072\n\nEN (\u00b5,C)[f (z)] = E\u0001\u223cN (0,Idz )[(\u0001\u0001T ) \u2297 H]. This then implies\n(cid:21)\n\n\u00b5,R\n\nR\n\n\u2207\u03b8l\n\nEN (\u00b5,C)[f (z)] = E\u0001\u223cN (0,I)\n\n(cid:20)\n(cid:34)\n\ng(cid:62) \u2202(\u00b5 + R\u0001)\n(cid:62)\n\n\u2202\u03b8l\n\n\u2202(\u00b5 + R\u0001)\n\n\u2202\u03b8l1\n\n,\n\n\u2202(\u00b5 + R\u0001)\n\n\u2202\u03b8l2\n\n+ g(cid:62) \u22022(\u00b5 + R\u0001)\n\n\u2202\u03b8l1\u2202l2\n\nH\n\n(cid:35)\n\n(4)\n\n, (5)\n\n\u22072\n\u03b8l1 \u03b8l2\n\nEN (\u00b5,C)[f (z)] = E\u0001\u223cN (0,I)\n\nwhere \u2297 is Kronecker product, and gradient g, Hessian H are evaluated at \u00b5 + R\u0001 in terms of f (z).\n\nIf we consider the mean and covariance matrix as the variational parameters in variational inference,\nthe \ufb01rst two results w.r.t \u00b5, R make parallelization possible, and reduce computational cost of the\nHessian-vector multiplication due to the fact that (A(cid:62) \u2297 B)vec(V ) = vec(AV B). If the model has\nfew parameters or a large resource budget (e.g. GPU) is allowed, Theorem 1 launches the foundation\nfor exact 2nd order derivative computation in parallel. In addition, note that the 2nd order gradient\ncomputation on model parameter \u03b8 only involves matrix-vector or vector-vector multiplication, thus\nleading to an algorithmic complexity that is O(d2\nz) for 2nd order derivative of \u03b8, which is the same\nas 1st order gradient [25]. The derivative computation at function f is up to 2nd order, avoiding to\ncalculate 3rd or 4th order derivatives. One practical parametrization assumes a diagonal covariance\nmatrix C = diag{\u03c32\n}. This reduces the actual computational cost compared with Theorem\n1, albeit the same order of the complexity (O(d2\nz)) (see supplementary material). Theorem 1 holds\nfor a large class of distributions in addition to Gaussian distributions, such as student t-distribution.\nIf the dimensionality d of embedded parameter \u03b8 is large, computation of the gradient G\u03b8 and\nHessian H\u03b8 (differ from g, H above) will be linear and quadratic w.r.t d, which may be unacceptable.\nTherefore, in the next section we attempt to reduce the computational complexity w.r.t d.\n\n1, ..., \u03c32\ndz\n\n2.3 Apply Reparameterization on Second Order Algorithm\n\nIn standard Newton\u2019s method, we need to compute the Hessian matrix and its inverse, which is\nintractable for limited computing resources. [18] applied Hessian-free (HF) optimization method in\ndeep learning effectively and ef\ufb01ciently. This work largely relied on the technique of fast Hessian\nmatrix-vector multiplication [23]. We combine reparameterization trick with Hessian-free or quasi-\nNewton to circumvent matrix inverse problem.\nHessian-free Unlike quasi-Newton methods HF doesn\u2019t make any approximation on the Hessian.\nHF needs to compute H\u03b8v, where v is any vector that has the matched dimension to H\u03b8, and then\nuses conjugate gradient algorithm to solve the linear system H\u03b8v = \u2212\u2207F (\u03b8)(cid:62)v, for any objective\nfunction F . [18] gives a reasonable explanation for Hessian free optimization. In short, unlike a\npre-training method that places the parameters in a search region to regularize[7], HF solves issues\nof pathological curvature in the objective by taking the advantage of rescaling property of Newton\u2019s\nindicating that H\u03b8v can be numerically\nmethod. By de\ufb01nition H\u03b8v = lim\u03b3\u21920\ncomputed by using \ufb01nite differences at \u03b3. However, this numerical method is unstable for small \u03b3.\nIn this section, we focus on the calculation of H\u03b8v by leveraging a reparameterization trick.\nSpeci\ufb01cally, we apply an R-operator technique [23] for computing the product H\u03b8v exactly. Let\nF = EN (\u00b5,C)[f (z)] and reparametrize z again as Sec. 2.2, we do variable substitution \u03b8 \u2190 \u03b8 + \u03b3v\nafter gradient Eq. (4) is obtained, and then take derivative on \u03b3. Thus we have the following analyt-\n\n\u2207F (\u03b8+\u03b3v)\u2212\u2207F (\u03b8)\n\n\u03b3\n\n3\n\n\fAlgorithm 1 Hessian-free Algorithm on Stochastic Gaussian Variational Inference (HFSGVI)\nParameters: Minibatch Size B, Number of samples to estimate the expectation M (= 1 as default),\nInput: Observation X (and Y if required), Lower bound function L = EN (\u00b5,C)[fL]\nOutput: Parameter \u03b8 after having converged.\n1: for t = 1, 2, . . . do\n2:\n3:\n4:\n5:\n6:\n7:\n8: end for\n\nb=1 \u2190 Randomly draw B datapoints from full data set X;\nxB\nmb=1 \u2190 sample M times from N (0, I) for each xb;\n\u0001M\nDe\ufb01ne gradient G(\u03b8) = 1\nDe\ufb01ne function B(\u03b8, v) = \u2207\u03b3G(\u03b8 + \u03b3v)|\u03b3=0, where v is a d-dimensional vector;\nM\nUsing Conjugate Gradient algorithm to solve linear system: B(\u03b8t, pt) = \u2212G(\u03b8t);\n\u03b8t+1 = \u03b8t + pt;\n\n, gb,m = \u2207z(fL(z|xb))|z=\u00b5+R\u0001mb\n\n\u2202(\u00b5+R\u0001mb )\n\n(cid:80)\n\n(cid:80)\n\ng(cid:62)\n\n\u2202\u03b8\n\n;\n\nb\n\nmb\n\nb,m\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b8\u2190\u03b8+\u03b3v\n\n(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u03b3=0\n\n(cid:34)\n\ng(cid:62) \u2202 (\u00b5(\u03b8) + R(\u03b8)\u0001)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b8\u2190\u03b8+\u03b3v\n\n\u2202\u03b8\n\n(cid:33)(cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u03b3=0\n\nical expression for Hessian-vector multiplication:\n\nH\u03b8v =\n\n\u2202\n\u2202\u03b3\n\n\u2207F (\u03b8 + \u03b3v)\n\n=\n\n\u2202\n\u2202\u03b3\n\nEN (0,I)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b3=0\n\n(cid:34)\n\n(cid:32)\n\n= EN (0,I)\n\n\u2202\n\u2202\u03b3\n\ng(cid:62) \u2202 (\u00b5(\u03b8) + R(\u03b8)\u0001)\n\n\u2202\u03b8\n\n.\n\n(6)\n\nEq. (6) is appealing since it does not need to store the dense matrix and provides an unbiased H\u03b8v\nestimator with a small sample size. In order to conduct the 2nd order optimization for variational in-\nference, if the computation of the gradient for variational lower bound is completed, we only need to\nadd one extra step for gradient evaluation via Eq. (6) which has the same computational complexity\nas Eq. (4). This leads to a Hessian-free variational inference method described in Algorithm 1.\nFor the worst case of HF, the conjugate gradient (CG) algorithm requires at most d iterations to\nterminate, meaning that it requires d evaluations of H\u03b8v product. However, the good news is that\nCG leads to good convergence after a reasonable number of iterations. In practice we found that it\nmay not necessary to wait CG to converge. In other words, even if we set the maximum iteration K\nin CG to a small \ufb01xed number (e.g., 10 in our experiments, though with thousands of parameters),\nthe performance does not deteriorate. The early stoping strategy may have the similar effect of\nWolfe condition to avoid excessive step size in Newton\u2019s method. Therefore we successfully reduce\nthe complexity of each iteration to O(Kdd2\nL-BFGS Limited memory BFGS utilizes the information gleaned from the gradient vector to ap-\nproximate the Hessian matrix without explicit computation, and we can readily utilize it within\nour framework. The basic idea of BFGS approximates Hessian by an iterative algorithm Bt+1 =\nt Bt\u2206\u03b8t, where \u2206Gt = Gt \u2212 Gt\u22121 and\nt /\u2206\u03b8t\u2206\u03b8(cid:62)\nBt + \u2206Gt\u2206G(cid:62)\n\u2206\u03b8t = \u03b8t \u2212 \u03b8t\u22121. By Eq.\n(4), the gradient Gt at each iteration can be obtained without any\ndif\ufb01culty. However, even if this low rank approximation to the Hessian is easy to invert analyti-\ncally due to the Sherman-Morrison formula, we still need to store the matrix. L-BFGS will further\nimplicitly approximate this dense Bt or B\u22121\nby tracking only a few gradient vectors and a short\nhistory of parameters and therefore has a linear memory requirement. In general, L-BFGS can per-\nform a sequence of inner products with the K most recent \u2206\u03b8t and \u2206Gt, where K is a prede\ufb01ned\nconstant (10 or 15 in our experiments). Due to the space limitations, we omit the details here but\nnone-the-less will present this algorithm in experiments section.\n\nz) is for one SGD iteration.\n\nz), whereas O(dd2\n\nt \u2212 Bt\u2206\u03b8t\u2206\u03b8(cid:62)\n\nt Bt/\u2206\u03b8(cid:62)\n\nt\n\n2.4 Estimator Variance\n\nThe framework of stochastic backpropagation [16, 17, 19, 25] extensively uses the mean of very\nfew samples (often just one) to approximate the expectation. Similarly we approximate the left side\nof Eq. (4), (5), (6) by sampling few points from the standard normal distribution. However, the\nmagnitude of the variance of such an estimator is not seriously discussed. [25] simply explored the\nvariance quantitatively for separable functions.[19] merely borrowed the variance reduction tech-\nnique from reinforcement learning by centering the learning signal in expectation and performing\nvariance normalization. Here, we will generalize the treatment of variance to a broader family,\nLipschitz continuous function.\n\n4\n\n\f.\n\n4\n\nTheorem 2 (Variance Bound). If f is an L-Lipschitz differentiable function and \u0001 \u223c N (0, Idz ),\nthen E[(f (\u0001) \u2212 E[f (\u0001)])2] \u2264 L2\u03c02\nThe proof of Theorem 2 (see supplementary) employs the properties of sub-Gaussian distributions\nand the duplication trick that are commonly used in learning theory. Signi\ufb01cantly, the result implies\na variance bound independent of the dimensionality of Gaussian variable. Note that from the proof,\n\nwe can only obtain the E(cid:2)e\u03bb(f (\u0001)\u2212E[f (\u0001)])(cid:3) \u2264 eL2\u03bb2\u03c02/8 for \u03bb > 0. Though this result is enough\n\nto illustrate the variance independence of dz, we can in fact tighten it to a sharper upper bound by\na constant scalar, i.e. e\u03bb2L2/2, thus leading to the result of Theorem 2 with Var(f (\u0001)) \u2264 L2. If\nall the results above hold for smooth (twice continuous and differentiable) functions with Lipschitz\nconstant L then it holds for all Lipschitz functions by a standard approximation argument. This\nmeans the condition can be relaxed to Lipschitz continuous function.\n\n\u2212 2M t2\n\n\u03c02L2 .\n\nCorollary 3 (Bias Bound). P(cid:16)(cid:12)(cid:12)(cid:12) 1\n\nM\n\n(cid:80)M\nm=1 f (\u0001m) \u2212 E[f (\u0001)]\n\n(cid:12)(cid:12)(cid:12) \u2265 t\n(cid:17) \u2264 2e\n\nIt is also worth mentioning that the signi\ufb01cant corollary of Theorem 2 is probabilistic inequality\nto measure the convergence rate of Monte Carlo approximation in our setting. This tail bound,\ntogether with variance bound, provides the theoretical guarantee for stochastic backpropagation on\nGaussian variables and provides an explanation for why a unique realization (M = 1) is enough\nin practice. By reparametrization, Eq. (4), (5, (6) can be formulated as the expectation w.r.t the\nisotropic Gaussian distribution with identity covariance matrix leading to Algorithm 1. Thus we\ncan rein in the number of samples for Monte Carlo integration regardless dimensionality of latent\nvariables z. This seems counter-intuitive. However, we notice that larger L may require more\nsamples, and Lipschitz constants of different models vary greatly.\n\n3 Application on Variational Auto-encoder\n\nNote that our method is model free. If the loss function has the form of the expectation of a function\nw.r.t latent Gaussian variables, we can directly use Algorithm 1. In this section, we put the emphasis\non a standard framework VAE model [17] that has been intensively researched; in particular, the\nfunction endows the logarithm form, thus bridging the gap between Hessian and \ufb01sher information\nmatrix by expectation (see a survey [22] and reference therein).\n\n3.1 Model Description\ni=1, where x(i) \u2208 RD is a data vector that can\nSuppose we have N i.i.d. observations X = {x(i)}N\ntake either continuous or discrete values. In contrast to a standard auto-encoder model constructed\nby a neural network with a bottleneck structure, VAE describes the embedding process from the\nprospective of a Gaussian latent variable model. Speci\ufb01cally, each data point x follows a generative\nmodel p\u03c8(x|z), where this process is actually a decoder that is usually constructed by a non-linear\ntransformation with unknown parameters \u03c8 and a prior distribution p\u03c8(z). The encoder or recog-\nnition model q\u03c6(z|x) is used to approximate the true posterior p\u03c8(z|x), where \u03c6 is similar to the\nparameter of variational distribution. As suggested in [16, 17, 25], multi-layered perceptron (MLP)\nis commonly considered as both the probabilistic encoder and decoder. We will later see that this\nconstruction is equivalent to a variant deep neural networks under the constrain of unique realization\nfor z. For this model and each datapoint, the variational lower bound on the marginal likelihood is,\n\nlog p\u03c8(x(i)) \u2265 E\n\nq\u03c6(z|x(i))[log p\u03c8(x(i)|z)] \u2212 DKL(q\u03c6(z|x(i))(cid:107)p\u03c8(z)) = L(x(i)).\n\n(7)\n\n(cid:80)\nWe can actually write the KL divergence into the expectation term and denote (\u03c8, \u03c6) as \u03b8.\nBy the previous discussion, this means that our objective is to solve the optimization problem\ni L(x(i)) of full dataset variational lower bound. Thus L-BFGS or HF SGVI algorithm\narg max\u03b8\ncan be implemented straightforwardly to estimate the parameters of both generative and recognition\nmodels. Since the \ufb01rst term of reconstruction error appears in Eq. (7) with an expectation form\non latent variable, [17, 25] used a small \ufb01nite number M samples as Monte Carlo integration with\nreparameterization trick to reduce the variance. This is, in fact, drawing samples from the stan-\ndard normal distribution. In addition, the second term is the KL divergence between the variational\ndistribution and the prior distribution, which acts as a regularizer.\n\n5\n\n\f3.2 Deep Neural Networks with Hybrid Hidden Layers\n\nauto-encoder. For binary inputs, denote the output as y, we have log p\u03c8(x|z) =(cid:80)D\n\nIn the experiments, setting M = 1 can not only achieve excellent performance but also speed up\nthe program. In this special case, we discuss the relationship between VAE and traditional deep\nj=1 xj log yj +\n(1\u2212 xj) log(1\u2212 yj), which is exactly the negative cross-entropy. It is also apparent that log p\u03c8(x|z)\nis equivalent to negative squared error loss for continuous data. This means that maximizing the\nlower bound is roughly equal to minimizing the loss function of a deep neural network (see Figure\n1 in supplementary), except for different regularizers. In other words, the prior in VAE only im-\nposes a regularizer in encoder or generative model, while L2 penalty for all parameters is always\nconsidered in deep neural nets. From the perspective of deep neural networks with hybrid hidden\nnodes, the model consists of two Bernoulli layers and one Gaussian layer. The gradient computation\ncan simply follow a variant of backpropagation layer by layer (derivation given in supplementary).\nTo further see the rationale of setting M = 1, we will investigate the upper bound of the Lipschitz\nconstant under various activation functions in the next lemma. As Theorem 2 implies, the variance\nof approximate expectation by \ufb01nite samples mainly relies on the Lipschitz constant, rather than\ndimensionality. According to Lemma 4, imposing a prior or regularization to the parameter can\ncontrol both the model complexity and function smoothness. Lemma 4 also implies that we can get\nthe upper bound of the Lipschitz constant for the designed estimators in our algorithm.\nLemma 4. For a sigmoid activation function g in deep neural networks with one Gaussian layer z,\nz \u223c N (\u00b5, C), C = R(cid:62)R. Let z = \u00b5 + R\u0001, then the Lipschitz constant of g(Wi,(\u00b5 + R\u0001) + bi)\n4(cid:107)Wi,R(cid:107)2, where Wi, is ith row of weight matrix and bi is the ith element bias.\nis bounded by 1\nSimilarly, for hyperbolic tangent or softplus function, the Lipschitz constant is bounded by (cid:107)Wi,R(cid:107)2.\n\n4 Experiments\n\nWe apply our 2nd order stochastic variational inference to two different non-conjugate models. First,\nwe consider a simple but widely used Bayesian logistic regression model, and compare with the\nmost recent 1st order algorithm, doubly stochastic variational inference (DSVI) [28], designed for\nsparse variable selection with logistic regression. Then, we compare the performance of VAE model\nwith our algorithms.\n\ni=1 g(yix(cid:62)\n\nprior can usually take the form as(cid:81)N\ndiagonal C, a factorized form(cid:81)D\n\n4.1 Bayesian Logistic Regression\ni=1, where each instance xi \u2208 RD includes the default feature 1 and\nGiven a dataset {xi, yi}N\nyi \u2208 {\u22121, 1} is the binary label, the Bayesian logistic regression models the probability of out-\nputs conditional on features and the coef\ufb01cients \u03b2 with an imposed prior. The likelihood and the\ni \u03b2) and N (0, \u039b) respectively, where g is sigmoid\nfunction and \u039b is a diagonal covariance matrix for simplicity. We can propose a variational Gaussian\ndistribution q(\u03b2|\u00b5, C) to approximate the posterior of regression parameter. If we further assume a\nj=1 q(\u03b2j|\u00b5j, \u03c3j) is both ef\ufb01cient and practical for inference. Unlike\niteratively optimizing \u039b and \u00b5, C as in variational EM, [28] noticed that the calculation of the gra-\ndient w.r.t the lower bound indicates the updates of \u039b can be analytically worked out by variational\nparameters, thus resulting a new objective function for the representation of lower bound that only\nrelies on \u00b5, C (details refer to [28]). We apply our algorithm to this variational logistic regression on\nthree appropriate datasets: DukeBreast and Leukemia are small size but high-dimensional for\nsparse logistic regression, and a9a which is large. See Table 1 for additional dataset descriptions.\nFig. 1 shows the convergence of Gaussian variational lower bound for Bayesian logistic regression\nin terms of running time. It is worth mentioning that the lower bound of HFSGVI converges within\n3 iterations on the small datasets DukeBreast and Leukemia. This is because all data points\nare fed to all algorithms and the HFSGVI uses a better approximation of the Hessian matrix to\nproceed 2nd order optimization. L-BFGS-SGVI also take less time to converge and yield slightly\nlarger lower bound than DSVI. In addition, as an SGD-based algorithm, it is clearly seen that DSVI\nis less stable for small datasets and \ufb02uctuates strongly even at the later optimized stage. For the\nlarge a9a, we observe that HFSGVI also needs 1000 iterations to reach a good lower bound and\nbecomes less stable than the other two algorithms. However, L-BFGS-SGVI performs the best\n\n6\n\n\fDataset(size: #train/test/feature)\nDukeBreast(38/4/7129)\nLeukemia(38/34/7129)\nA9a(32561/16281/123)\n\nL-BFGS-SGVI\ntrain\n\ntest\n1\n3\n\n2427\n\nHFSGVI\ntrain test\n0\n3\n\n0\n0\n\n4931 2468\n\nTable 1: Comparison on number of misclassi\ufb01cation\n\nDSVI\n\ntrain test\n2\n3\n\n0\n0\n\n0\n0\n\n4948 2455 4936\n\nFigure 1: Convergence rate on logistic regression (zoom out or see larger \ufb01gures in supplementary)\n\nboth in terms of convergence rate and the \ufb01nal lower bound. The misclassi\ufb01cation report in Table\n1 re\ufb02ects the similar advantages of our approach, indicating a competitive predication ability on\nvarious datasets. Finally, it is worth mentioning that all three algorithms learn a set of very sparse\nregression coef\ufb01cients on the three datasets (see supplement for additional visualizations).\n\n4.2 Variational Auto-encoder\n\nWe also apply the 2nd order stochastic variational inference to train a VAE model (setting M = 1 for\nMonte Carlo integration to estimate expectation) or the equivalent deep neural networks with hybrid\nhidden layers. The datasets we used are images from the Frey Face, Olivetti Face and MNIST. We\nmainly learned three tasks by maximizing the variational lower bound: parameter estimation, images\nreconstruction and images generation. Meanwhile, we compared the convergence rate (running\ntime) of three algorithms, where in this section the compared SGD is the Ada version [6] that is\nrecommended for VAE model in [17, 25]. The experimental setting is as follows. The initial weights\nare randomly drawn from N (0, 0.012I) or N (0, 0.0012I), while all bias terms are initialized as 0.\nThe variational lower bound only introduces the regularization on the encoder parameters, so we add\nan L2 regularizer on decoder parameters with a shrinkage parameter 0.001 or 0.0001. The number of\nhidden nodes for encoder and decoder is the same for all auto-encoder model, which is reasonable\nand convenient to construct a symmetric structure. The number is always tuned from 200 to 800\nwith 100 increment. The mini-batch size is 100 for L-BFGS and Ada, while larger mini-batch is\nrecommended for HF, meaning it should vary according to the training size.\nThe detailed results are shown in Fig. 2 and 3. Both Hessian-free and L-BFGS converge faster than\nAda in terms of running time. HFSGVI also performs competitively with respet to generalization\non testing data. Ada takes at least four times as long to achieve similar lower bound. Theoretically,\nNewton\u2019s method has a quadratic convergence rate in terms of iteration, but with a cubic algorith-\nmic complexity at each iteration. However, we manage to lower the computation in each iteration\nto linear complexity. Thus considering the number of evaluated training data points, the 2nd order\nalgorithm needs much fewer step than 1st order gradient descent (see visualization in supplementary\non MNIST). The Hessian matrix also replaces manually tuned learning rates, and the af\ufb01ne invari-\nant property allows for automatic learning rate adjustment. Technically, if the program can run in\nparallel with GPU, the speed advantages of 2nd order algorithm should be more obvious [21].\nFig. 2(b) and Fig. 3(b) are reconstruction results of input images. From the perspective of deep\nneural network, the only difference is the Gaussian distributed latent variables z. By corollary of\nTheorem 2, we can roughly tell the mean \u00b5 is able to represent the quantity of z, meaning this layer\nis actually a linear transformation with noise, which looks like dropout training [5]. Speci\ufb01cally,\nOlivetti includes 64\u00d764 pixels faces of various persons, which means more complicated models\nor preprocessing [13] (e.g. nearest neighbor interpolation, patch sampling) is needed. However,\neven when simply learning a very bottlenecked auto-encoder, our approach can achieve acceptable\nresults. Note that although we have tuned the hyperparameters of Ada by cross-validation, the\nbest result is still a bunch of mean faces. For manifold learning, Fig. 2(c) represents how the\n\n7\n\n020406080\u2212250\u2212200\u2212150\u2212100\u2212500time(s)Lower BoundDuke Breast DSVIL\u2212BFGS\u2212SGVIHFSGVI020406080\u2212140\u2212120\u2212100\u221280\u221260\u221240\u2212200time(s)Lower BoundLeukemia DSVIL\u2212BFGS\u2212SGVIHFSGVI0100200300400500\u22121.5\u22121.4\u22121.3\u22121.2\u22121.1\u22121x 104time(s)Lower BoundA9a DSVIL\u2212BFGS\u2212SGVIHFSGVI\f(a) Convergence\n\n(b) Reconstruction\n\n(c) Manifold by Generative Model\n\nFigure 2: (a) shows how lower bound increases w.r.t program running time for different algorithms; (b) il-\nlustrates the reconstruction ability of this auto-encoder model when dz = 20 (left 5 columns are randomly\nsampled from dataset); (c) is the learned manifold of generative model when dz = 2.\n\n(a) Convergence\n\n(b) HFSGVI v.s L-BFGS-SGVI v.s. Ada-SGVI\n\nFigure 3: (a) shows running time comparison; (b) illustrates reconstruction comparison without patch sam-\npling, where dz = 100: top 5 rows are original faces.\n\nlearned generative model can simulate the images by HFSGVI. To visualize the results, we choose\nthe 2D latent variable z in p\u03c8(x|z), where the parameter \u03c8 is estimated by the algorithm. The\ntwo coordinates of z take values that were transformed through the inverse CDF of the Gaussian\ndistribution from equal distance grid (10\u00d710 or 20\u00d720) on the unit square. Then we merely use the\ngenerative model to simulate the images. Besides these learning tasks, denoising, imputation [25]\nand even generalizing to semi-supervised learning [16] are possible application of our approach.\n\n5 Conclusions and Discussion\n\nIn this paper we proposed a scalable 2nd order stochastic variational method for generative models\nwith continuous latent variables. By developing Gaussian backpropagation through reparametriza-\ntion we introduced an ef\ufb01cient unbiased estimator for higher order gradients information. Combin-\ning with the ef\ufb01cient technique for computing Hessian-vector multiplication, we derived an ef\ufb01cient\ninference algorithm (HFSGVI) that allows for joint optimization of all parameters. The algorithmic\ncomplexity of each parameter update is quadratic w.r.t the dimension of latent variables for both 1st\nand 2nd derivatives. Furthermore, the overall computational complexity of our 2nd order SGVI is\nlinear w.r.t the number of parameters in real applications just like SGD or Ada. However, HFSGVI\nmay not behave as fast as Ada in some situations, e.g., when the pixel values of images are sparse\ndue to fast matrix multiplication implementation in most softwares.\nFuture research will focus on some dif\ufb01cult deep models such as RNNs [10, 27] or Dynamic SBN\n[9]. Because of conditional independent structure by giving sampled latent variables, we may con-\nstruct blocked Hessian matrix to optimize such dynamic models. Another possible area of future\nwork would be reinforcement learning (RL) [20]. Many RL problems can be reduced to compute\ngradients of expectations (e.g., in policy gradient methods) and there has been series of exploration\nin this area for natural gradients. However, we would suggest that it might be interesting to consider\nwhere stochastic backpropagation \ufb01ts in our framework and how 2nd order computations can help.\n\nAcknolwedgement This research was supported in part by the Research Grants Council of the\nHong Kong Special Administrative Region (Grant No. 614513).\n\n8\n\n00.511.52x 1042004006008001000120014001600time(s)Lower BoundFrey Face Ada trainAda testL\u2212BFGS\u2212SGVI trainL\u2212BFGS\u2212SGVI testHFSGVI trainHFSGVI test00.511.52x 104\u22124000\u221220000200040006000time (s)Lower BoundOlivetti Face Ada trainAda testL\u2212BFGS\u2212SGVI trainL\u2212BFGS\u2212SGVI testHFSGVI trainHFSGVI test\fReferences\n\n[1] Matthew James Beal. Variational algorithms for approximate Bayesian inference. PhD thesis, 2003.\n[2] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of Machine\n\nLearning Research, 3, 2003.\n\n[3] Joseph-Fr\u00b4ed\u00b4eric Bonnans, Jean Charles Gilbert, Claude Lemar\u00b4echal, and Claudia A Sagastiz\u00b4abal. Numer-\n\nical optimization: theoretical and practical aspects. Springer Science & Business Media, 2006.\n\n[4] Georges Bonnet. Transformations des signaux al\u00b4eatoires a travers les syst`emes non lin\u00b4eaires sans\n\nm\u00b4emoire. Annals of Telecommunications, 19(9):203\u2013220, 1964.\n\n[5] George E Dahl, Tara N Sainath, and Geoffrey E Hinton. Improving deep neural networks for lvcsr using\n\nrecti\ufb01ed linear units and dropout. In ICASSP, 2013.\n\n[6] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[7] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy\nBengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research,\n11:625\u2013660, 2010.\n\n[8] Thomas S Ferguson. Location and scale parameters in exponential families of distributions. Annals of\n\nMathematical Statistics, pages 986\u20131001, 1962.\n\n[9] Zhe Gan, Chunyuan Li, Ricardo Henao, David Carlson, and Lawrence Carin. Deep temporal sigmoid\n\nbelief networks for sequence modeling. In NIPS, 2015.\n\n[10] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. Draw: A recurrent neural network for\n\nimage generation. In ICML, 2015.\n\n[11] James Hensman, Magnus Rattray, and Neil D Lawrence. Fast variational inference in the conjugate\n\nexponential family. In NIPS, 2012.\n\n[12] Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The \u201dwake-sleep\u201d algorithm for\n\nunsupervised neural networks. Science, 268(5214):1158\u20131161, 1995.\n\n[13] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural net-\n\nworks. Science, 313(5786):504\u2013507, 2006.\n\n[14] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference.\n\nJournal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[15] Mohammad E Khan. Decoupled variational gaussian inference. In NIPS, 2014.\n[16] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\n\nlearning with deep generative models. In NIPS, 2014.\n\n[17] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.\n[18] James Martens. Deep learning via hessian-free optimization. In ICML, 2010.\n[19] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In ICML,\n\n2014.\n\n[20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control\nthrough deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[21] Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, Quoc V Le, and Andrew Y Ng. On\n\noptimization methods for deep learning. In ICML, 2011.\n\n[22] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. arXiv preprint\n\narXiv:1301.3584, 2013.\n\n[23] Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147\u2013160, 1994.\n[24] Robert Price. A useful theorem for nonlinear devices having gaussian inputs. Information Theory, IRE\n\nTransactions on, 4(2):69\u201372, 1958.\n\n[25] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approx-\n\nimate inference in deep generative models. In ICML, 2014.\n\n[26] Tim Salimans. Markov chain monte carlo and variational inference: Bridging the gap. In ICML, 2015.\n[27] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In\n\nNIPS, 2014.\n\n[28] Michalis Titsias and Miguel L\u00b4azaro-Gredilla. Doubly stochastic variational bayes for non-conjugate in-\n\nference. In ICML, 2014.\n\n9\n\n\f", "award": [], "sourceid": 842, "authors": [{"given_name": "Kai", "family_name": "Fan", "institution": "Duke University"}, {"given_name": "Ziteng", "family_name": "Wang", "institution": null}, {"given_name": "Jeff", "family_name": "Beck", "institution": null}, {"given_name": "James", "family_name": "Kwok", "institution": "Hong Kong University of Science and Technology"}, {"given_name": "Katherine", "family_name": "Heller", "institution": "Duke"}]}