{"title": "Operator Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 496, "page_last": 504, "abstract": "Variational inference is an umbrella term for algorithms which cast Bayesian inference as optimization. Classically, variational inference uses the Kullback-Leibler divergence to define the optimization. Though this divergence has been widely used, the resultant posterior approximation can suffer from undesirable statistical properties. To address this, we reexamine variational inference from its roots as an optimization problem. We use operators, or functions of functions, to design variational objectives. As one example, we design a variational objective with a Langevin-Stein operator. We develop a black box algorithm, operator variational inference (OPVI), for optimizing any operator objective. Importantly, operators enable us to make explicit the statistical and computational tradeoffs for variational inference. We can characterize different properties of variational objectives, such as objectives that admit data subsampling---allowing inference to scale to massive data---as well as objectives that admit variational programs---a rich class of posterior approximations that does not require a tractable density. We illustrate the benefits of OPVI on a mixture model and a generative model of images.", "full_text": "Operator Variational Inference\n\nRajesh Ranganath\nPrinceton University\n\nJaan Altosaar\n\nPrinceton University\n\nDustin Tran\n\nColumbia University\n\nDavid M. Blei\n\nColumbia University\n\nAbstract\n\nVariational inference is an umbrella term for algorithms which cast Bayesian infer-\nence as optimization. Classically, variational inference uses the Kullback-Leibler\ndivergence to de\ufb01ne the optimization. Though this divergence has been widely\nused, the resultant posterior approximation can su\ufb00er from undesirable statistical\nproperties. To address this, we reexamine variational inference from its roots as\nan optimization problem. We use operators, or functions of functions, to design\nvariational objectives. As one example, we design a variational objective with a\nLangevin-Stein operator. We develop a black box algorithm, operator variational\ninference (opvi), for optimizing any operator objective. Importantly, operators en-\nable us to make explicit the statistical and computational tradeo\ufb00s for variational\ninference. We can characterize di\ufb00erent properties of variational objectives, such\nas objectives that admit data subsampling\u2014allowing inference to scale to massive\ndata\u2014as well as objectives that admit variational programs\u2014a rich class of pos-\nterior approximations that does not require a tractable density. We illustrate the\nbene\ufb01ts of opvi on a mixture model and a generative model of images.\n\n1\n\nIntroduction\n\nVariational inference is an umbrella term for algorithms that cast Bayesian inference as optimiza-\ntion [10]. Originally developed in the 1990s, recent advances in variational inference have scaled\nBayesian computation to massive data [7], provided black box strategies for generic inference in\nmany models [19], and enabled more accurate approximations of a model\u2019s posterior without sac-\nri\ufb01cing e\ufb03ciency [21, 20]. These innovations have both scaled Bayesian analysis and removed the\nanalytic burdens that have traditionally taxed its practice.\nGiven a model of latent and observed variables p.x; z/, variational inference posits a family of dis-\ntributions over its latent variables and then \ufb01nds the member of that family closest to the posterior,\np.zj x/. This is typically formalized as minimizing a Kullback-Leibler (kl) divergence from the\napproximating family q.(cid:1)/ to the posterior p.(cid:1)/. However, while the kl.q k p/ objective o\ufb00ers many\nbene\ufb01cial computational properties, it is ultimately designed for convenience; it sacri\ufb01ces many de-\nsirable statistical properties of the resultant approximation.\nWhen optimizing kl, there are two issues with the posterior approximation that we highlight. First,\nit typically underestimates the variance of the posterior. Second, it can result in degenerate solutions\nthat zero out the probability of certain con\ufb01gurations of the latent variables. While both of these is-\nsues can be partially circumvented by using more expressive approximating families, they ultimately\nstem from the choice of the objective. Under the kl divergence, we pay a large price when q.(cid:1)/ is\nbig where p.(cid:1)/ is tiny; this price becomes in\ufb01nite when q.(cid:1)/ has larger support than p.(cid:1)/.\nIn this paper, we revisit variational inference from its core principle as an optimization problem. We\nuse operators\u2014mappings from functions to functions\u2014to design variational objectives, explicitly\ntrading o\ufb00 computational properties of the optimization with statistical properties of the approxima-\ntion. We use operators to formalize the basic properties needed for variational inference algorithms.\nWe further outline how to use them to de\ufb01ne new variational objectives; as one example, we design\na variational objective using a Langevin-Stein operator.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fWe develop operator variational inference (opvi), a black box algorithm that optimizes any operator\nobjective. In the context of opvi, we show that the Langevin-Stein objective enjoys two good prop-\nerties. First, it is amenable to data subsampling, which allows inference to scale to massive data.\nSecond, it permits rich approximating families, called variational programs, which do not require\nanalytically tractable densities. This greatly expands the class of variational families and the \ufb01delity\nof the resulting approximation. (We note that the traditional kl is not amenable to using variational\nprograms.) We study opvi with the Langevin-Stein objective on a mixture model and a generative\nmodel of images.\nRelated Work. There are several threads of research in variational inference with alternative diver-\ngences. An early example is expectation propagation (ep) [16]. ep promises approximate minimiza-\ntion of the inclusive kl divergence kl.pjjq/ to \ufb01nd overdispersed approximations to the posterior.\nep hinges on local minimization with respect to subsets of data and connects to work on \u02db-divergence\nminimization [17, 6]. However, it does not have convergence guarantees and typically does not min-\nimize kl or an \u02db-divergence because it is not a global optimization method. We note that these\ndivergences can be written as operator variational objectives, but they do not satisfy the tractability\ncriteria and thus require further approximations. Li and Turner [14] present a variant of \u02db-divergences\nthat satisfy the full requirements of opvi. Score matching [9], a method for estimating models by\nmatching the score function of one distribution to another that can be sampled, also falls into the\nclass of objectives we develop.\nHere we show how to construct new objectives, including some not yet studied. We make explicit the\nrequirements to construct objectives for variational inference. Finally, we discuss further properties\nthat make them amenable to both scalable and \ufb02exible variational inference.\n\n2 Operator Variational Objectives\n\nWe de\ufb01ne operator variational objectives and the conditions needed for an objective to be useful\nfor variational inference. We develop a new objective, the Langevin-Stein objective, and show how\nto place the classical kl into this class. In the next section, we develop a general algorithm for\noptimizing operator variational objectives.\n\n2.1 Variational Objectives\nConsider a probabilistic model p.x; z/ of data x and latent variables z. Given a data set x, approxi-\nmate Bayesian inference seeks to approximate the posterior distribution p.zj x/, which is applied in\nall downstream tasks. Variational inference posits a family of approximating distributions q.z/ and\noptimizes a divergence function to \ufb01nd the member of the family closest to the posterior.\nThe divergence function is the variational objective, a function of both the posterior and the ap-\nproximating distribution. Useful variational objectives hinge on two properties: \ufb01rst, optimizing the\nfunction yields a good posterior approximation; second, the problem is tractable when the posterior\ndistribution is known up to a constant.\nThe classic construction that satis\ufb01es these properties is the evidence lower bound (elbo),\n\n(1)\nIt is maximized when q.z/ D p.zj x/ and it only depends on the posterior distribution up to a\ntractable constant, log p.x; z/. The elbo has been the focus in much of the classical literature. Max-\nimizing the elbo is equivalent to minimizing the kl divergence to the posterior, and the expectations\nare analytic for a large class of models [4].\n\nEq.z/\u0152log p.x; z/ (cid:0) log q.z/\u008d:\n\n2.2 Operator Variational Objectives\nWe de\ufb01ne a new class of variational objectives, operator variational objectives. An operator objec-\ntive has three components. The \ufb01rst component is an operator O p;q that depends on p.zj x/ and\nq.z/. (Recall that an operator maps functions to other functions.) The second component is a family\nof test functions F , where each f .z/ 2 F maps realizations of the latent variables to real vectors\nRd . In the objective, the operator and a function will combine in an expectation Eq.z/\u0152.O p;q f /.z/\u008d,\ndesigned such that values close to zero indicate that q is close to p. The third component is a distance\n\n2\n\n\ffunction t .a/ W R ! \u01520;1/, which is applied to the expectation so that the objective is non-negative.\n(Our example uses the square function t .a/ D a2.)\nThese three components combine to form the operator variational objective. It is a non-negative\nfunction of the variational distribution,\n\nL.qI O p;q; F ; t / D sup\nf 2F\n\nt .Eq.z/\u0152.O p;q f /.z/\u008d/:\n\n(2)\n\nIntuitively, it is the worst-case expected value among all test functions f 2 F . Operator variational\ninference seeks to minimize this objective with respect to the variational family q 2 Q.\nWe use operator objectives for posterior inference. This requires two conditions on the operator and\nfunction family.\n1. Closeness. The minimum of the variational objective is at the posterior, q.z/ D p.zj x/. We\nmeet this condition by requiring that Ep.zj x/\u0152.O p;p f /.z/\u008d D 0 for all f 2 F . Thus, optimizing\nthe objective will produce p.zj x/ if it is the only member of Q with zero expectation (otherwise\nit will produce a distribution in the equivalence class: q 2 Q with zero expectation). In practice,\nthe minimum will be the closest member of Q to p.zj x/.\n2. Tractability. We can calculate the variational objective up to a constant without involving the\nexact posterior p.zj x/. In other words, we do not require calculating the normalizing constant of\nthe posterior, which is typically intractable. We meet this condition by requiring that the operator\nO p;q\u2014originally in terms of p.zj x/ and q.z/\u2014can be written in terms of p.x; z/ and q.z/.\nTractability also imposes conditions on F : it must be feasible to \ufb01nd the supremum. Below, we\nsatisfy this by de\ufb01ning a parametric family for F that is amenable to stochastic optimization.\n\nEquation 2 and the two conditions provide a mechanism to design meaningful variational objectives\nfor posterior inference. Operator variational objectives try to match expectations with respect to q.z/\nto those with respect to p.zj x/.\n2.3 Understanding Operator Variational Objectives\nConsider operators where Eq.z/\u0152.O p;q f /.z/\u008d only takes positive values. In this case, distance to zero\ncan be measured with the identity t .a/ D a, so tractability implies the operator need only be known\nup to a constant. This family includes tractable forms of familiar divergences like the kl divergence\n(elbo), R\u00e9nyi\u2019s \u02db-divergence [14], and the (cid:31)-divergence [18].\nWhen the expectation can take positive or negative values, operator variational objectives are closely\n(cid:3) that have expectation\nrelated to Stein divergences [2]. Consider a family of scalar test functions F\nzero with respect to the posterior, Ep.zj x/\u0152f\nDStein.p; q/ D sup\n(cid:3)2F\n\n.z/\u008d D 0. Using this family, a Stein divergence is\n(cid:3) jEq.z/\u0152f\n\n(cid:3)\n\n(cid:3)\n\n.z/\u008d (cid:0) Ep.zj x/\u0152f\n\n.z/\u008dj:\n\n(cid:3)\n\nf\n\nNow recall the operator objective of Equation 2. The closeness condition implies that\nt .Eq.z/\u0152.O p;q f /.z/\u008d (cid:0) Ep.zj x/\u0152.O p;p f /.z/\u008d/:\n\nL.qI O p;q; F ; t / D sup\nf 2F\n\nIn other words, operators with positive or negative expectations lead to Stein divergences with a more\ngeneralized notion of distance.\n\n2.4 Langevin-Stein Operator Variational Objective\n\nWe developed the operator variational objective. It is a class of tractable objectives, each of which\ncan be optimized to yield an approximation to the posterior. An operator variational objective is\nbuilt from an operator, function class, and distance function to zero. We now use this construction\nto design a new type of variational objective.\nAn operator objective involves a class of functions that has known expectations with respect to an\nintractable distribution. There are many ways to construct such classes [1, 2]. Here, we construct an\noperator objective from the generator Stein\u2019s method applied to the Langevin di\ufb00usion.\n\n3\n\n\f>\n\nLet r\nents. Applying the generator method of Barbour [2] to Langevin di\ufb00usion gives the operator\n\nf denote the divergence of a vector-valued function f , that is, the sum of its individual gradi-\n\n(3)\nWe call this the Langevin-Stein (ls) operator. We obtain the corresponding variational objective by\nusing the squared distance function and substituting Equation 3 into Equation 2,\n\n>\n.O pls f /.z/ D rz log p.x; z/\n\nf .z/ C r\n\n>\n\nf:\n\nL.qI O pls; F / D sup\nf 2F\n\n>\n.Eq\u0152rz log p.x; z/\n\nf .z/ C r\n\n>\n\nf \u008d/2:\n\n(4)\n\nThe ls operator satis\ufb01es both conditions. First, it satis\ufb01es closeness because it has expectation zero\nunder the posterior (Appendix A) and its unique minimizer is the posterior (Appendix B). Second, it\nis tractable because it requires only the joint distribution. The functions f will also be a parametric\nfamily, which we detail later.\nAdditionally, while the kl divergence \ufb01nds variational distributions that underestimate the variance,\nthe ls objective does not su\ufb00er from that pathology. The reason is that kl is in\ufb01nite when the support\nof q is larger than p; here this is not the case.\nWe provided one example of a variational objectives using operators, which is speci\ufb01c to continu-\nous variables. In general, operator objectives are not limited to continuous variables; Appendix C\ndescribes an operator for discrete variables.\n\n2.5 The KL Divergence as an Operator Variational Objective\nFinally, we demonstrate how classical variational methods fall inside the operator family. For exam-\nple, traditional variational inference minimizes the kl divergence from an approximating family to\nthe posterior [10]. This can be construed as an operator variational objective,\nKL f /.z/ D log q.z/ (cid:0) log p.zjx/ 8f 2 F :\n\n(5)\nThis operator does not use the family of functions\u2014it trivially maps all functions f to the same\nfunction. Further, because kl is strictly positive, we use the identity distance t .a/ D a.\nThe operator satis\ufb01es both conditions. It satis\ufb01es closeness because KL.pjjp/ D 0. It satis\ufb01es\ntractability because it can be computed up to a constant when used in the operator objective of Equa-\ntion 2. Tractability comes from the fact that log p.zj x/ D log p.z; x/ (cid:0) log p.x/.\n3 Operator Variational Inference\n\n.O p;q\n\nWe described operator variational objectives, a broad class of objectives for variational inference. We\nnow examine how it can be optimized. We develop a black box algorithm [27, 19] based on Monte\nCarlo estimation and stochastic optimization. Our algorithm applies to a general class of models and\nany operator objective.\nMinimizing the operator objective involves two optimizations: minimizing the objective with respect\nto the approximating family Q and maximizing the objective with respect to the function class F\n(which is part of the objective).\nWe index the family Q with variational parameters (cid:21) and require that it satis\ufb01es properties typically\nassumed by black box methods [19]: the variational distribution q.zI (cid:21)/ has a known and tractable\ndensity; we can sample from q.zI (cid:21)/; and we can tractably compute the score function r(cid:21) log q.zI (cid:21)/.\nWe index the function class F with parameters \u0002, and require that f\u0002 .(cid:1)/ is di\ufb00erentiable. In the\nexperiments, we use neural networks, which are \ufb02exible enough to approximate a general family of\ntest functions [8].\nGiven parameterizations of the variational family and test family, operator variational inference\n(opvi) seeks to solve a minimax problem,\nD inf\n\nt .E(cid:21)\u0152.O p;qf\u0002 /.z/\u008d/:\n\n(cid:3)\n(cid:21)\n\nsup\n\n(6)\n\n(cid:21)\n\n\u0002\n\nWe will use stochastic optimization [23, 13]. In principle, we can \ufb01nd stochastic gradients of (cid:21)\nby rewriting the objective in terms of the optimized value of \u0002, \u0002\n.(cid:21)/. In practice, however, we\n\n(cid:3)\n\n4\n\n\fAlgorithm 1: Operator variational inference\nInput\nOutput: Variational parameters (cid:21)\nInitialize (cid:21) and \u0002 randomly.\nwhile not converged do\n\n: Model log p.x; z/, variational approximation q.zI (cid:21)/\n\nCompute unbiased estimates of r(cid:21)L\u0002 from Equation 7.\nCompute unbiased esimates of r\u0002 L(cid:21) from Equation 8.\nUpdate (cid:21), \u0002 with unbiased stochastic gradients.\n\nend\n\nsimultaneously solve the maximization and minimization. Though computationally bene\ufb01cial, this\nproduces saddle points. In our experiments we found it to be stable enough. We derive gradients for\nthe variational parameters (cid:21) and test function parameters \u0002. (We \ufb01x the distance function to be the\nsquare t .a/ D a2; the identity t .a/ D a also readily applies.)\nGradient with respect to (cid:21). For a \ufb01xed test function with parameters \u0002, denote the objective\n\nThe gradient with respect to variational parameters (cid:21) is\n\nL\u0002 D t .E(cid:21)\u0152.O p;q f\u0002 /.z/\u008d/:\n\nNow write the second expectation with the score function gradient [19]. This gradient is\n\nr(cid:21)L\u0002 D 2 E(cid:21)\u0152.O p;q f\u0002 /.z/\u008d r(cid:21)E(cid:21)\u0152.O p;q f\u0002 /.z/\u008d:\n\nr(cid:21)L\u0002 D 2 E(cid:21)\u0152.O p;q f\u0002 /.z/\u008d E(cid:21)\u0152r(cid:21) log q.zI (cid:21)/.O p;q f\u0002 /.z/ C r(cid:21).O p;q f\u0002 /.z/\u008d:\n\n(7)\nEquation 7 lets us calculate unbiased stochastic gradients. We \ufb01rst generate two sets of independent\nsamples from q; we then form Monte Carlo estimates of the \ufb01rst and second expectations. For the\nsecond expectation, we can use the variance reduction techniques developed for black box variational\ninference, such as Rao-Blackwellization [19].\nWe described the score gradient because it is general. An alternative is to use the reparameterization\ngradient for the second expectation [11, 22]. It requires that the operator be di\ufb00erentiable with respect\nto z and that samples from q can be drawn as a transformation r of a parameter-free noise source (cid:15),\nz D r.(cid:15); (cid:21)/. In our experiments, we use the reparameterization gradient.\nGradient with respect to \u0002. Mirroring the notation above, the operator objective for \ufb01xed varia-\ntional (cid:21) is\n\nThe gradient with respect to test function parameters \u0002 is\n\nL(cid:21) D t .E(cid:21)\u0152.O p;q f\u0002 /.z/\u008d/:\n\nr\u0002 L(cid:21) D 2 E(cid:21)\u0152.O p;qf\u0002 /.z/\u008d E(cid:21)\u0152r\u0002 O p;q f\u0002 .z/\u008d:\n\n(8)\nAgain, we can construct unbiased stochastic gradients with two sets of Monte Carlo estimates. Note\nthat gradients for the test function do not require score gradients (or reparameterization gradients)\nbecause the expectation does not depend on \u0002.\nAlgorithm. Algorithm 1 outlines opvi. We simultaneously minimize the variational objective with\nrespect to the variational family q(cid:21) while maximizing it with respect to the function class f\u0002. Given\na model, operator, and function class parameterization, we can use automatic di\ufb00erentiation to calcu-\nlate the necessary gradients [3]. Provided the operator does not require model-speci\ufb01c computation,\nthis algorithm satis\ufb01es the black box criteria.\n\n3.1 Data Subsampling and opvi\n\nWith stochastic optimization, data subsampling scales up traditional variational inference to massive\ndata [7, 26]. The idea is to calculate noisy gradients by repeatedly subsampling from the data set,\nwithout needing to pass through the entire data set for each gradient.\n\n5\n\n\fAn as illustration, consider hierarchical models. Hierarchical models consist of global latent vari-\nables \u02c7 that are shared across data points and local latent variables zi each of which is associated to\na data point xi. The model\u2019s log joint density is\n\nh log p.xi j zi ; \u02c7/ C log p.zi j \u02c7/\n\ni\n\nnX\n\niD1\n\nlog p.x1Wn; z1Wn; \u02c7/ D log p.\u02c7/ C\n\n:\n\nHo\ufb00man et al. [7] calculate unbiased estimates of the log joint density (and its gradient) by subsam-\npling data and appropriately scaling the sum.\nWe can characterize whether opvi with a particular operator supports data subsampling. opvi relies\non evaluating the operator and its gradient at di\ufb00erent realizations of the latent variables (Equation 7\nand Equation 8). Thus we can subsample data to calculate estimates of the operator when it derives\nfrom linear operators of the log density, such as di\ufb00erentiation and the identity. This follows as a\nlinear operator of sums is a sum of linear operators, so the gradients in Equation 7 and Equation 8\ndecompose into a sum. The Langevin-Stein and kl operator are both linear in the log density; both\nsupport data subsampling.\n\n3.2 Variational Programs\n\nGiven an operator and variational family, Algorithm 1 optimizes the corresponding operator objec-\ntive. Certain operators require the density of q. For example, the kl operator (Equation 5) requires\nits log density. This potentially limits the construction of rich variational approximations for which\nthe density of q is di\ufb03cult to compute.1\nSome operators, however, do not depend on having a analytic density; the Langevin-Stein (ls) oper-\nator (Equation 3) is an example. These operators can be used with a much richer class of variational\napproximations, those that can be sampled from but might not have analytically tractable densities.\nWe call such approximating families variational programs.\nInference with a variational program requires the family to be reparameterizable [11, 22]. (Otherwise\nwe need to use the score function, which requires the derivative of the density.) A reparameteriz-\nable variational program consists of a parametric deterministic transformation R of random noise (cid:15).\nFormally, let\n\n(cid:15) (cid:24) Normal.0; 1/;\n\n(9)\nThis generates samples for z, is di\ufb00erentiable with respect to (cid:21), and its density may be intractable. For\noperators that do not require the density of q, it can be used as a powerful variational approximation.\nThis is in contrast to the standard Kullback-Leibler (kl) operator.\nAs an example, consider the following variational program for a one-dimensional random variable.\nLet (cid:21)i denote the ith dimension of (cid:21) and make the corresponding de\ufb01nition for (cid:15):\n\nz D R.(cid:15)I (cid:21)/:\n\nz D .(cid:15)3 > 0/R.(cid:15)1I (cid:21)1/ (cid:0) .(cid:15)3 \u0004 0/R.(cid:15)2I (cid:21)2/:\n\n(10)\nWhen R outputs positive values, this separates the parametrization of the density to the positive\nand negative halves of the reals; its density is generally intractable. In Section 4, we will use this\ndistribution as a variational approximation.\nEquation 9 contains many densities when the function class R can approximate arbitrary continuous\nfunctions. We state it formally.\nTheorem 1. Consider a posterior distribution p.zj x/ with a \ufb01nite number of latent variables and\ncontinuous quantile function. Assume the operator variational objective has a unique root at the\nposterior p.zj x/ and that R can approximate continuous functions. Then there exists a sequence\nof parameters (cid:21)1; (cid:21)2 : : : ; in the variational program, such that the operator variational objective\nconverges to 0, and thus q converges in distribution to p.zj x/.\nThis theorem says that we can use variational programs with an appropriate q-independent operator\nto approximate continuous distributions. The proof is in Appendix D.\n\n1It is possible to construct rich approximating families with kl.qjjp/, but this requires the introduction of\n\nan auxiliary distribution [15].\n\n6\n\n\f4 Empirical Study\n\nWe evaluate operator variational inference on a mixture of Gaussians, comparing di\ufb00erent choices\nin the objective. We then study logistic factor analysis for images.\n\n4.1 Mixture of Gaussians\n\none-dimensional mixture\nNormal.zI(cid:0)3; 1/ C 1\n\nConsider\na\ninterest,\nNormal.zI 3; 1/. The posterior contains multiple modes.\np.z/ D 1\nWe seek to approximate it with three variational objectives: Kullback-Leibler (kl) with a Gaussian\napproximating family, Langevin-Stein (ls) with a Gaussian approximating family, and ls with a\nvariational program.\n\nposterior\n\nof\n\nof Gaussians\n\nas\n\nthe\n\n2\n\n2\n\nFigure 1: The true posterior is a mixture of two Gaussians, in green. We approximate it with a\nGaussian using two operators (in blue). The density on the far right is a variational program given\nin Equation 10 and using the Langevin-Stein operator; it approximates the truth well. The density\nof the variational program is intractable. We plot a histogram of its samples and compare this to the\nhistogram of the true posterior.\n\nFigure 1 displays the posterior approximations. We \ufb01nd that the kl divergence and ls divergence\nchoose a single mode and have slightly di\ufb00erent variances. These operators do not produce good\nresults because a single Gaussian is a poor approximation to the mixture. The remaining distribution\nin Figure 1 comes from the toy variational program described by Equation 10 with the ls operator.\nBecause this program captures di\ufb00erent distributions for the positive and negative half of the real\nline, it is able to capture the posterior.\nIn general, the choice of an objective balances statistical and computational properties of variational\nthe ls objective admits the use of a variational program;\ninference. We highlight one tradeo\ufb00:\nhowever, the objective is more di\ufb03cult to optimize than the kl.\n\n4.2 Logistic Factor Analysis\n\nLogistic factor analysis models binary vectors xi with a matrix of parameters W and biases b,\n\nzi (cid:24) Normal.0; 1/\n>\nxi;k (cid:24) Bernoulli.(cid:27) .w\nk zi C bk//;\n\nwhere zi has \ufb01xed dimension K and (cid:27) is the sigmoid function. This model captures correlations of\nthe entries in xi through W .\nWe apply logistic factor analysis to analyze the binarized MNIST data set [24], which contains 28x28\nbinary pixel images of handwritten digits. (We set the latent dimensionality to 10.) We \ufb01x the model\nparameters to those learned with variational expectation-maximization using the kl divergence, and\nfocus on comparing posterior inferences.\nWe compare the kl operator to the ls operator and study two choices of variational models: a fully\nfactorized Gaussian distribution and a variational program. The variational program generates sam-\nples by transforming a K-dimensional standard normal input with a two-layer neural network, using\nrecti\ufb01ed linear activation functions and a hidden size of twice the latent dimensionality. Formally,\n\n7\n\n(cid:0)505ValueofLatentVariablezKLTruth(cid:0)505ValueofLatentVariablezLangevin-SteinTruth(cid:0)505ValueofLatentVariablezVariationalProgramTruth\fInference method\nMean-\ufb01eld Gaussian + kl\nMean-\ufb01eld Gaussian + ls\nVariational Program + ls\n\nCompleted data log-likelihood\n\n-59.3\n-75.3\n-58.9\n\nTable 1: Benchmarks on logistic factor analysis for binarized MNIST. The same variational approx-\nimation with ls performs worse than kl on likelihood performance. The variational program with\nls performs better without directly optimizing for likelihoods.\n\nthe variational program we use generates samples of z as follows:\n\nz0 (cid:24) Normal.0; I /\n>\nh0 D ReLU.W q\n>\nh1 D ReLU.W q\nz D W q\n\nz0 C bq\n0/\nh0 C bq\n1/\nh1 C bq\n2:\n\n>\n\n0\n\n1\n\n2\n\nThe variational parameters are the weights W q and biases bq. For f , we use a three-layer neural net-\nwork with the same hidden size as the variational program and hyperbolic tangent activations where\nunit activations were bounded to have norm two. Bounding the unit norm bounds the divergence.\n(cid:0)5 for the variational\nWe used the Adam optimizer [12] with learning rates 2(cid:2) 10\napproximation.\nThere is no standard for evaluating generative models and their inference algorithms [25]. Following\nRezende et al. [22], we consider a missing data problem. We remove half of the pixels in the test set\n(at random) and reconstruct them from a \ufb01tted posterior predictive distribution. Table 1 summarizes\nthe results on 100 test images; we report the log-likelihood of the completed image. ls with the\nvariational program performs best. It is followed by kl and the simpler ls inference. The ls performs\nbetter than kl even though the model parameters were learned with kl.\n\n(cid:0)4 for f and 2(cid:2) 10\n\n5 Summary\n\nWe present operator variational objectives, a broad yet tractable class of optimization problems for\napproximating posterior distributions. Operator objectives are built from an operator, a family of\ntest functions, and a distance function. We outline the connection between operator objectives and\nexisting divergences such as the KL divergence, and develop a new variational objective using the\nLangevin-Stein operator. In general, operator objectives produce new ways of posing variational\ninference.\nGiven an operator objective, we develop a black box algorithm for optimizing it and show which\noperators allow scalable optimization through data subsampling. Further, unlike the popular evidence\nlower bound, not all operators explicitly depend on the approximating density. This permits \ufb02exible\napproximating families, called variational programs, where the distributional form is not tractable.\nWe demonstrate this approach on a mixture model and a factor model of images.\nThere are several possible avenues for future directions such as developing new variational objectives,\nadversarially learning [5] model parameters with operators, and learning model parameters with\noperator variational objectives.\nAcknowledgments.\nThis work is supported by NSF IIS-1247664, ONR N00014-11-1-0651,\nDARPA FA8750-14-2-0009, DARPA N66001-15-C-4032, Adobe, NSERC PGS-D, Porter Ogden\nJacobus Fellowship, Seibel Foundation, and the Sloan Foundation. The authors would like to thank\nDawen Liang, Ben Poole, Stephan Mandt, Kevin Murphy, Christian Naesseth, and the anonymous\nreviews for their helpful feedback and comments.\n\nReferences\n[1] Assaraf, R. and Ca\ufb00arel, M. (1999). Zero-variance principle for monte carlo algorithms. In Phys. Rev. Let.\n[2] Barbour, A. D. (1988). Stein\u2019s method and poisson process convergence. Journal of Applied Probability.\n\n8\n\n\f[3] Carpenter, B., Ho\ufb00man, M. D., Brubaker, M., Lee, D., Li, P., and Betancourt, M. (2015). The Stan Math\n\nLibrary: Reverse-mode automatic di\ufb00erentiation in C++. arXiv preprint arXiv:1509.07164.\n\n[4] Ghahramani, Z. and Beal, M. (2001). Propagation algorithms for variational Bayesian learning. In NIPS\n\n13, pages 507\u2013513.\n\n[5] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio,\n\nY. (2014). Generative adversarial nets. In Neural Information Processing Systems.\n\n[6] Hern\u00e1ndez-Lobato, J. M., Li, Y., Rowland, M., Hern\u00e1ndez-Lobato, D., Bui, T., and Turner, R. E. (2015).\n\nBlack-box \u02db-divergence Minimization. arXiv.org.\n\n[7] Ho\ufb00man, M., Blei, D., Wang, C., and Paisley, J. (2013). Stochastic variational inference. Journal of\n\nMachine Learning Research, 14(1303\u20131347).\n\n[8] Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal ap-\n\nproximators. Neural networks, 2(5):359\u2013366.\n\n[9] Hyv\u00e4rinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of\n\nMachine Learning Research, 6(Apr):695\u2013709.\n\n[10] Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. (1999). Introduction to variational methods for\n\ngraphical models. Machine Learning, 37:183\u2013233.\n\n[11] Kingma, D. and Welling, M. (2014). Auto-encoding variational bayes. In (ICLR).\n[12] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, abs/1412.6980.\n[13] Kushner, H. and Yin, G. (1997). Stochastic Approximation Algorithms and Applications. Springer New\n\nYork.\n\n[14] Li, Y. and Turner, R. E. (2016). R\u00e9nyi divergence variational inference. arXiv preprint arXiv:1602.02311.\n[15] Maal\u00f8e, L., S\u00f8nderby, C. K., S\u00f8nderby, S. K., and Winther, O. (2016). Auxiliary deep generative models.\n\narXiv preprint arXiv:1602.05473.\n\n[16] Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference. In UAI.\n[17] Minka, T. P. (2004). Power EP. Technical report, Microsoft Research, Cambridge.\n[18] Nielsen, F. and Nock, R. (2013). On the chi square and higher-order chi distances for approximating\n\nf-divergences. arXiv preprint arXiv:1309.3029.\n\n[19] Ranganath, R., Gerrish, S., and Blei, D. (2014). Black Box Variational Inference. In AISTATS.\n[20] Ranganath, R., Tran, D., and Blei, D. M. (2016). Hierarchical variational models. In International Con-\n\nference on Machine Learning.\n\n[21] Rezende, D. J. and Mohamed, S. (2015). Variational inference with normalizing \ufb02ows. In Proceedings of\n\nthe 31st International Conference on Machine Learning (ICML-15).\n\n[22] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate\n\ninference in deep generative models. In International Conference on Machine Learning.\n\n[23] Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical\n\nStatistics, 22(3):pp. 400\u2013407.\n\n[24] Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief networks. In Interna-\n\ntional Conference on Machine Learning.\n\n[25] Theis, L., van den Oord, A., and Bethge, M. (2016). A note on the evaluation of generative models. In\n\nInternational Conference on Learning Representations.\n\n[26] Titsias, M. and L\u00e1zaro-Gredilla, M. (2014). Doubly stochastic variational bayes for non-conjugate infer-\nence. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1971\u2013\n1979.\n\n[27] Wingate, D. and Weber, T. (2013). Automated variational inference in probabilistic programming. ArXiv\n\ne-prints.\n\n9\n\n\f", "award": [], "sourceid": 274, "authors": [{"given_name": "Rajesh", "family_name": "Ranganath", "institution": "Princeton University"}, {"given_name": "Dustin", "family_name": "Tran", "institution": "Columbia University"}, {"given_name": "Jaan", "family_name": "Altosaar", "institution": "Princeton University"}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}]}