{"title": "Automatic Variational Inference in Stan", "book": "Advances in Neural Information Processing Systems", "page_first": 568, "page_last": 576, "abstract": "Variational inference is a scalable technique for approximate Bayesian inference. Deriving variational inference algorithms requires tedious model-specific calculations; this makes it difficult for non-experts to use. We propose an automatic variational inference algorithm, automatic differentiation variational inference (ADVI); we implement it in Stan (code available), a probabilistic programming system. In ADVI the user provides a Bayesian model and a dataset, nothing else. We make no conjugacy assumptions and support a broad class of models. The algorithm automatically determines an appropriate variational family and optimizes the variational objective. We compare ADVI to MCMC sampling across hierarchical generalized linear models, nonconjugate matrix factorization, and a mixture model. We train the mixture model on a quarter million images. With ADVI we can use variational inference on any model we write in Stan.", "full_text": "Automatic Variational Inference in Stan\n\nAlp Kucukelbir\n\nColumbia University\nalp@cs.columbia.edu\n\nRajesh Ranganath\nPrinceton University\n\nrajeshr@cs.princeton.edu\n\nAndrew Gelman\nColumbia University\n\ngelman@stat.columbia.edu\n\nDavid M. Blei\n\nColumbia University\n\ndavid.blei@columbia.edu\n\nAbstract\n\nVariational inference is a scalable technique for approximate Bayesian inference.\nDeriving variational inference algorithms requires tedious model-speci\ufb01c calcula-\ntions; this makes it dicult for non-experts to use. We propose an automatic varia-\ntional inference algorithm, automatic dierentiation variational inference ();\nwe implement it in Stan (code available), a probabilistic programming system. In\n the user provides a Bayesian model and a dataset, nothing else. We make\nno conjugacy assumptions and support a broad class of models. The algorithm\nautomatically determines an appropriate variational family and optimizes the vari-\national objective. We compare to sampling across hierarchical gen-\neralized linear models, nonconjugate matrix factorization, and a mixture model.\nWe train the mixture model on a quarter million images. With we can use\nvariational inference on any model we write in Stan.\n\nIntroduction\n\n1\nBayesian inference is a powerful framework for analyzing data. We design a model for data using\nlatent variables; we then analyze data by calculating the posterior density of the latent variables. For\nmachine learning models, calculating the posterior is often dicult; we resort to approximation.\nVariational inference () approximates the posterior with a simpler distribution [1, 2]. We search\nover a family of simple distributions and \ufb01nd the member closest to the posterior. This turns ap-\nproximate inference into optimization. has had a tremendous impact on machine learning; it is\ntypically faster than Markov chain Monte Carlo () sampling (as we show here too) and has\nrecently scaled up to massive data [3].\nUnfortunately, algorithms are dicult to derive. We must \ufb01rst de\ufb01ne the family of approximating\ndistributions, and then calculate model-speci\ufb01c quantities relative to that family to solve the varia-\ntional optimization problem. Both steps require expert knowledge. The resulting algorithm is tied to\nboth the model and the chosen approximation.\nIn this paper we develop a method for automating variational inference, automatic dierentiation\nvariational inference (). Given any model from a wide class (speci\ufb01cally, probability models\ndierentiable with respect to their latent variables), determines an appropriate variational fam-\nily and an algorithm for optimizing the corresponding variational objective. We implement in\nStan [4], a \ufb02exible probabilistic programming system. Stan describes a high-level language to de\ufb01ne\nprobabilistic models (e.g., Figure 2) as well as a model compiler, a library of transformations, and an\necient automatic dierentiation toolbox. With we can now use variational inference on any\nmodel we write in Stan.1 (See Appendices F to J.)\n\n1 is available in Stan 2.8. See Appendix C.\n\n1\n\n\fe\nv\ni\nt\nc\ni\nd\ne\nr\nP\ng\no\nL\ne\ng\na\nr\ne\nv\nA\n\n0\n300\n600\n900\n\ne\nv\ni\nt\nc\ni\nd\ne\nr\nP\ng\no\nL\ne\ng\na\nr\ne\nv\nA\n\n400\n\n0\n400\n800\n\nADVI\n\nNUTS [5]\n\n103\n\n102\nSeconds\n\nB=50\nB=100\nB=500\nB=1000\n\n102\n\n103\n\n104\n\nSeconds\n\n(a) Subset of 1000 images\n\n(b) Full dataset of 250 000 images\n\nFigure 1: Held-out predictive accuracy results | Gaussian mixture model () of the image\nimage histogram dataset. (a) outperforms the no-U-turn sampler (), the default sampling\nmethod in Stan [5]. (b) scales to large datasets by subsampling minibatches of size B from the\ndataset at each iteration [3]. We present more details in Section 3.3 and Appendix J.\n\nFigure 1 illustrates the advantages of our method. Consider a nonconjugate Gaussian mixture model\nfor analyzing natural images; this is 40 lines in Stan (Figure 10). Figure 1a illustrates Bayesian\ninference on 1000 images. The y-axis is held-out likelihood, a measure of model \ufb01tness; the x-\naxis is time on a log scale. is orders of magnitude faster than , a state-of-the-art \nalgorithm (and Stan\u2019s default inference technique) [5]. We also study nonconjugate factorization\nmodels and hierarchical generalized linear models in Section 3.\nFigure 1b illustrates Bayesian inference on 250 000 images, the size of data we more commonly \ufb01nd in\nmachine learning. Here we use with stochastic variational inference [3], giving an approximate\nposterior in under two hours. For data like these, techniques cannot complete the analysis.\nRelated work. automates variational inference within the Stan probabilistic programming\nsystem [4]. This draws on two major themes.\nThe \ufb01rst is a body of work that aims to generalize . Kingma and Welling [6] and Rezende et al.\n[7] describe a reparameterization of the variational problem that simpli\ufb01es optimization. Ranganath\net al. [8] and Salimans and Knowles [9] propose a black-box technique, one that only requires the\nmodel and the gradient of the approximating family. Titsias and L\u00e1zaro-Gredilla [10] leverage the\ngradient of the joint density for a small class of models. Here we build on and extend these ideas to\nautomate variational inference; we highlight technical connections as we develop the method.\nThe second theme is probabilistic programming. Wingate and Weber [11] study in general proba-\nbilistic programs, as supported by languages like Church [12], Venture [13], and Anglican [14]. An-\nother probabilistic programming system is infer.NET, which implements variational message passing\n[15], an ecient algorithm for conditionally conjugate graphical models. Stan supports a more com-\nprehensive class of nonconjugate models with dierentiable latent variables; see Section 2.1.\n\n2 Automatic Dierentiation Variational Inference\nAutomatic dierentiation variational inference () follows a straightforward recipe. First we\ntransform the support of the latent variables to the real coordinate space. For example, the logarithm\ntransforms a positive variable, such as a standard deviation, to the real line. Then we posit a Gaussian\nvariational distribution to approximate the posterior. This induces a non-Gaussian approximation in\nthe original variable space. Last we combine automatic dierentiation with stochastic optimization\nto maximize the variational objective. We begin by de\ufb01ning the class of models we support.\n\n2.1 Dierentiable Probability Models\nConsider a dataset X D x1WN with N observations. Each xn is a discrete or continuous random vec-\ntor. The likelihood p.X j \u2713/ relates the observations to a set of latent random variables \u2713. Bayesian\n\n2\n\n\f\u02db D 1:5; D 1\n\ndata {\n\ni n t N;\ni n t x [N ] ; // d i s c r e t e - valued o b s e r v a t i o n s\n\n// number o f o b s e r v a t i o n s\n\n}\nparameters {\n\n}\nmodel {\n\n// l a t e n t v a r i a b l e , must be p o s i t i v e\nr e a l < lower=0> theta ;\n\n\u2713\n\nxn\n\n// non - conjugate p r i o r\ntheta ~ w e i b u l l ( 1 . 5 , 1) ;\n\nf o r\n\nl a t e n t v a r i a b l e\n\n// l i k e l i h o o d\nf o r\n(n in 1 :N)\n\nx [ n ] ~ poisson ( theta ) ;\n\nN\n\n}\n\nFigure 2: Specifying a simple nonconjugate probability model in Stan.\n\nanalysis posits a prior density p.\u2713/ on the latent variables. Combining the likelihood with the prior\ngives the joint density p.X; \u2713/ D p.X j \u2713/ p.\u2713/.\nWe focus on approximate inference for dierentiable probability models. These models have contin-\nuous latent variables \u2713. They also have a gradient of the log-joint with respect to the latent variables\nr\u2713 log p.X; \u2713/. The gradient is valid within the support of the prior supp.p.\u2713// D \u02da \u2713 j \u2713 2\nRK and p.\u2713/ > 0 \u2713 RK, where K is the dimension of the latent variable space. We assume that\nthe support of the posterior equals that of the prior. We make no assumptions about conjugacy, either\nfull or conditional.2\nFor example, consider a model that contains a Poisson likelihood with unknown rate, p.x j \u2713/. The\nobserved variable x is discrete; the latent rate \u2713 is continuous and positive. Place a Weibull prior\non \u2713, de\ufb01ned over the positive real numbers. The resulting joint density describes a nonconjugate\ndierentiable probability model. (See Figure 2.) Its partial derivative @=@\u2713 p.x; \u2713 / is valid within the\nsupport of the Weibull distribution, supp.p.\u2713 // D RC \u21e2 R. Because this model is nonconjugate, the\nposterior is not a Weibull distribution. This presents a challenge for classical variational inference.\nIn Section 2.3, we will see how handles this model.\nMany machine learning models are dierentiable. For example: linear and logistic regression, matrix\nfactorization with continuous or discrete measurements, linear dynamical systems, and Gaussian pro-\ncesses. Mixture models, hidden Markov models, and topic models have discrete random variables.\nMarginalizing out these discrete variables renders these models dierentiable. (We show an example\nin Section 3.3.) However, marginalization is not tractable for all models, such as the Ising model,\nsigmoid belief networks, and (untruncated) Bayesian nonparametric models.\n\n2.2 Variational Inference\nBayesian inference requires the posterior density p.\u2713 j X/, which describes how the latent variables\nvary when conditioned on a set of observations X. Many posterior densities are intractable because\ntheir normalization constants lack closed forms. Thus, we seek to approximate the posterior.\nConsider an approximating density q.\u2713 I / parameterized by . We make no assumptions about its\nshape or support. We want to \ufb01nd the parameters of q.\u2713 I / to best match the posterior according to\nsome loss function. Variational inference () minimizes the Kullback-Leibler () divergence from\nthe approximation to the posterior [2],\n\n\u21e4 D arg min\n\n\n\nKL.q.\u2713 I / k p.\u2713 j X//:\n\n(1)\n\nTypically the divergence also lacks a closed form. Instead we maximize the evidence lower bound\n(), a proxy to the divergence,\n\nL./ D Eq.\u2713/\u21e5 log p.X; \u2713/\u21e4 Eq.\u2713/\u21e5 log q.\u2713 I /\u21e4:\n\nThe \ufb01rst term is an expectation of the joint density under the approximation, and the second is the\nentropy of the variational density. Maximizing the minimizes the divergence [1, 16].\n\n2The posterior of a fully conjugate model is in the same family as the prior; a conditionally conjugate model\n\nhas this property within the complete conditionals of the model [3].\n\n3\n\n\fThe minimization problem from Eq. (1) becomes\n\n\u21e4 D arg max\n\n\n\nL./\n\nsuch that\n\nsupp.q.\u2713 I // \u2713 supp.p.\u2713 j X//:\n\n(2)\n\nWe explicitly specify the support-matching constraint implied in the divergence.3 We highlight\nthis constraint, as we do not specify the form of the variational approximation; thus q.\u2713 I / must\nremain within the support of the posterior, which we assume equal to the support of the prior.\nWhy is dicult to automate? In classical variational inference, we typically design a condition-\nally conjugate model. Then the optimal approximating family matches the prior. This satis\ufb01es the\nsupport constraint by de\ufb01nition [16]. When we want to approximate models that are not condition-\nally conjugate, we carefully study the model and design custom approximations. These depend on\nthe model and on the choice of the approximating density.\nOne way to automate is to use black-box variational inference [8, 9]. If we select a density whose\nsupport matches the posterior, then we can directly maximize the using Monte Carlo ()\nintegration and stochastic optimization. Another strategy is to restrict the class of models and use a\n\ufb01xed variational approximation [10]. For instance, we may use a Gaussian density for inference in\nunrestrained dierentiable probability models, i.e. where supp.p.\u2713// D RK.\nWe adopt a transformation-based approach. First we automatically transform the support of the latent\nvariables in our model to the real coordinate space. Then we posit a Gaussian variational density. The\ntransformation induces a non-Gaussian approximation in the original variable space and guarantees\nthat it stays within the support of the posterior. Here is how it works.\n\n2.3 Automatic Transformation of Constrained Variables\nBegin by transforming the support of the latent variables \u2713 such that they live in the real coordinate\nspace RK. De\ufb01ne a one-to-one dierentiable function T W supp.p.\u2713// ! RK and identify the\ntransformed variables as \u21e3 D T .\u2713/. The transformed joint density g.X; \u21e3/ is\n\ng.X; \u21e3/ D pX; T 1.\u21e3/\u02c7\u02c7 det JT 1 .\u21e3/\u02c7\u02c7;\n\nwhere p is the joint density in the original latent variable space, and JT 1 is the Jacobian of the\ninverse of T . Transformations of continuous probability densities require a Jacobian; it accounts for\nhow the transformation warps unit volumes [17]. (See Appendix D.)\nConsider again our running example. The rate \u2713 lives in RC. The logarithm \u21e3 D T .\u2713 / D log.\u2713 /\ntransforms RC to the real line R.\nIts Jacobian adjustment is the derivative of the inverse of the\nlogarithm, j det JT 1.\u21e3/j D exp.\u21e3/. The transformed density is\n\ng.x; \u21e3/ D Poisson.x j exp.\u21e3// Weibull.exp.\u21e3/ I 1:5; 1/ exp.\u21e3/:\n\nFigures 3a and 3b depict this transformation.\nAs we describe in the introduction, we implement our algorithm in Stan to enable generic inference.\nStan implements a model compiler that automatically handles transformations. It works by applying\na library of transformations and their corresponding Jacobians to the joint model density.4 This\ntransforms the joint density of any dierentiable probability model to the real coordinate space. Now\nwe can choose a variational distribution independent from the model.\n\nImplicit Non-Gaussian Variational Approximation\n\n2.4\nAfter the transformation, the latent variables \u21e3 have support on RK. We posit a diagonal (mean-\ufb01eld)\nGaussian variational approximation\n\nq.\u21e3 I / D N .\u21e3 I ; / D\n\nN .\u21e3k I k; k/:\n\n3If supp.q/ \u203a supp.p/ then outside the support of p we have KL.q k p/ D Eq\u0152log q\u00e7 Eq\u0152log p\u00e7 D 1.\n4Stan provides transformations for upper and lower bounds, simplex and ordered vectors, and structured\n\nmatrices such as covariance matrices and Cholesky factors [4].\n\nKYkD1\n\n4\n\n\fy\nt\ni\ns\nn\ne\nD\n\n1\n\n0\n\n1\n\n(a) Latent variable space\n\n2\n\n3\n\nT\n\nT 1\n\n\u2713\n\n1\n\nS;!\n\nS1\n\n;!\n\nPrior\nPosterior\nApproximation\n\n1\n\n0\n\n\u21e3\n1\n(b) Real coordinate space\n\n1\n\n2\n\n\u2318\n2 1 0 1 2\n(c) Standardized space\n\nFigure 3: Transformations for . The purple line is the posterior. The green line is the approxi-\nmation. (a) The latent variable space is RC. (a!b) T transforms the latent variable space to R. (b)\nThe variational approximation is a Gaussian. (b!c) S;! absorbs the parameters of the Gaussian.\n(c) We maximize the in the standardized space, with a \ufb01xed standard Gaussian approximation.\n\nThe vector D .1; ; K; 1; ; K/ contains the mean and standard deviation of each Gaus-\nsian factor. This de\ufb01nes our variational approximation in the real coordinate space. (Figure 3b.)\nThe transformation T maps the support of the latent variables to the real coordinate space; its inverse\nT 1 maps back to the support of the latent variables. This implicitly de\ufb01nes the variational approx-\nthat the support of this approximation is always bounded by that of the true posterior in the original\nlatent variable space (Figure 3a). Thus we can freely optimize the in the real coordinate space\n(Figure 3b) without worrying about the support matching constraint.\nThe in the real coordinate space is\n\nimation in the original latent variable space as q.T .\u2713/I /\u02c7\u02c7 det JT .\u2713/\u02c7\u02c7: The transformation ensures\n\nL.; / D Eq.\u21e3/\uf8ff log pX; T 1.\u21e3/ C log\u02c7\u02c7 det JT 1 .\u21e3/\u02c7\u02c7 C\n\nKXkD1\n\nK\n2\n\n.1 C log.2\u21e1// C\n\nlog k;\n\nwhere we plug in the analytic form of the Gaussian entropy. (The derivation is in Appendix A.)\nWe choose a diagonal Gaussian for eciency. This choice may call to mind the Laplace approxima-\ntion technique, where a second-order Taylor expansion around the maximum-a-posteriori estimate\ngives a Gaussian approximation to the posterior. However, using a Gaussian variational approxima-\ntion is not equivalent to the Laplace approximation [18]. The Laplace approximation relies on max-\nimizing the probability density; it fails with densities that have discontinuities on its boundary. The\nGaussian approximation considers probability mass; it does not suer this degeneracy. Furthermore,\nour approach is distinct in another way: because of the transformation, the posterior approximation\nin the original latent variable space (Figure 3a) is non-Gaussian.\n\n2.5 Automatic Dierentiation for Stochastic Optimization\nWe now maximize the in real coordinate space,\n\n\u21e4; \u21e4 D arg max\n\n;\n\nL.; /\n\nsuch that 0:\n\n(3)\n\nWe use gradient ascent to reach a local maximum of the . Unfortunately, we cannot apply auto-\nmatic dierentiation to the in this form. This is because the expectation de\ufb01nes an intractable\nintegral that depends on and ; we cannot directly represent it as a computer program. More-\nover, the standard deviations in must remain positive. Thus, we employ one \ufb01nal transformation:\nelliptical standardization5 [19], shown in Figures 3b and 3c.\nFirst re-parameterize the Gaussian distribution with the log of the standard deviation, ! D log. /,\napplied element-wise. The support of ! is now the real coordinate space and is always positive.\nThen de\ufb01ne the standardization \u2318 D S;!.\u21e3/ D diagexp .!/1.\u21e3 /. The standardization\n\n5Also known as a \u201cco-ordinate transformation\u201d [7], an \u201cinvertible transformation\u201d [10], and the \u201cre-\n\nparameterization trick\u201d [6].\n\n5\n\n\fAlgorithm 1: Automatic dierentiation variational inference ()\nInput: Dataset X D x1WN , model p.X; \u2713/.\nSet iteration counter i D 0 and choose a stepsize sequence \u21e2.i/.\nInitialize .0/ D 0 and !.0/ D 0.\nwhile change in is above some threshold do\n\nDraw M samples \u2318m \u21e0 N .0; I/ from the standard multivariate Gaussian.\nInvert the standardization \u21e3m D diag.exp .!.i///\u2318m C .i/.\nApproximate rL and r!L using integration (Eqs. (4) and (5)).\nUpdate .iC1/ .i/ C \u21e2.i/rL and !.iC1/ !.i/ C \u21e2.i/r!L.\nIncrement iteration counter.\n\nend\nReturn \u21e4 .i/ and !\u21e4 !.i/.\n\nencapsulates the variational parameters and gives the \ufb01xed density\n\nq.\u2318I 0; I/ D N .\u2318I 0; I/ D\n\nKYkD1\n\nN .\u2318k I 0; 1/:\n\nThe standardization transforms the variational problem from Eq. (3) into\n\u21e4; !\u21e4 D arg max\n\nL.; !/\n\n;!\n\nD arg max\n\n;!\n\nEN .\u2318I 0;I/\uf8ff log pX; T 1.S1\n\n;!.\u2318// C log\u02c7\u02c7 det JT 1S1\n\n;!.\u2318/\u02c7\u02c7 C\n\nKXkD1\n\n!k;\n\nwhere we drop constant terms from the calculation. This expectation is with respect to a standard\nGaussian and the parameters and ! are both unconstrained (Figure 3c). We push the gradient\ninside the expectations and apply the chain rule to get\n\nrL D EN .\u2318/\u21e5r\u2713 log p.X; \u2713/r\u21e3T 1.\u21e3/ C r\u21e3 log\u02c7\u02c7 det JT 1 .\u21e3/\u02c7\u02c7\u21e4 ;\nr!k L D EN .\u2318k /\u21e5r\u2713k log p.X; \u2713/r\u21e3k T 1.\u21e3/ C r\u21e3k log\u02c7\u02c7 det JT 1 .\u21e3/\u02c7\u02c7 \u2318k exp.!k/\u21e4 C 1:\n\n(The derivations are in Appendix B.)\nWe can now compute the gradients inside the expectation with automatic dierentiation. The only\nthing left is the expectation. integration provides a simple approximation: draw M samples from\nthe standard Gaussian and evaluate the empirical mean of the gradients within the expectation [20].\nThis gives unbiased noisy gradients of the for any dierentiable probability model. We can\nnow use these gradients in a stochastic optimization routine to automate variational inference.\n\n(4)\n(5)\n\n2.6 Automatic Variational Inference\nEquipped with unbiased noisy gradients of the , implements stochastic gradient ascent\n(Algorithm 1). We ensure convergence by choosing a decreasing step-size sequence. In practice, we\nuse an adaptive sequence [21] with \ufb01nite memory. (See Appendix E for details.)\n has complexity O.NMK/ per iteration, where M is the number of samples (typically\nbetween 1 and 10). Coordinate ascent has complexity O.NK/ per pass over the dataset. We\nscale to large datasets using stochastic optimization [3, 10]. The adjustment to Algorithm 1 is\nsimple: sample a minibatch of size B \u2327 N from the dataset and scale the likelihood of the sampled\nminibatch by N=B [3]. The stochastic extension of has per-iteration complexity O.BMK/.\n\n6\n\n\fe\nv\ni\nt\nc\ni\nd\ne\nr\nP\ng\no\nL\ne\ng\na\nr\ne\nv\nA\n\n3\n5\n7\n9\n\nADVI (M=1)\nADVI (M=10)\n\nNUTS\nHMC\n\n101\n\n101\n\n100\n\nSeconds\n\ne\nv\ni\nt\nc\ni\nd\ne\nr\nP\ng\no\nL\ne\ng\na\nr\ne\nv\nA\n\n0:7\n0:9\n1:1\n1:3\n1:5\n\nADVI (M=1)\nADVI (M=10)\n\nNUTS\nHMC\n\n101\n\n100\nSeconds\n\n101\n\n102\n\n(a) Linear Regression with \n\n(b) Hierarchical Logistic Regression\n\nFigure 4: Hierarchical generalized linear models. Comparison of to : held-out predic-\ntive likelihood as a function of wall time.\n\n3 Empirical Study\nWe now study across a variety of models. We compare its speed and accuracy to two Markov\nchain Monte Carlo () sampling algorithms: Hamiltonian Monte Carlo () [22] and the no-\nU-turn sampler ()6 [5]. We assess convergence by tracking the . To place and\n on a common scale, we report predictive likelihood on held-out data as a function of time. We\napproximate the posterior predictive likelihood using a estimate. For , we plug in posterior\nsamples. For , we draw samples from the posterior approximation during the optimization. We\ninitialize with a draw from a standard Gaussian.\nWe explore two hierarchical regression models, two matrix factorization models, and a mixture\nmodel. All of these models have nonconjugate prior structures. We conclude by analyzing a dataset\nof 250 000 images, where we report results across a range of minibatch sizes B.\n\n3.1 A Comparison to Sampling: Hierarchical Regression Models\nWe begin with two nonconjugate regression models: linear regression with automatic relevance de-\ntermination () [16] and hierarchical logistic regression [23].\nLinear Regression with . This is a sparse linear regression model with a hierarchical prior\nstructure. (Details in Appendix F.) We simulate a dataset with 250 regressors such that half of the\nregressors have no predictive power. We use 10 000 training samples and hold out 1000 for testing.\nLogistic Regression with Spatial Hierarchical Prior. This is a hierarchical logistic regression\nmodel from political science. The prior captures dependencies, such as states and regions, in a\npolling dataset from the United States 1988 presidential election [23]. (Details in Appendix G.)\nWe train using 10 000 data points and withhold 1536 for evaluation. The regressors contain age,\neducation, state, and region indicators. The dimension of the regression problem is 145.\nResults. Figure 4 plots average log predictive accuracy as a function of time. For these simple\nmodels, all methods reach the same predictive accuracy. We study with two settings of M , the\nnumber of samples used to estimate gradients. A single sample per iteration is sucient; it is\nalso the fastest. (We set M D 1 from here on.)\n3.2 Exploring Nonconjugacy: Matrix Factorization Models\nWe continue by exploring two nonconjugate non-negative matrix factorization models: a constrained\nGamma Poisson model [24] and a Dirichlet Exponential model. Here, we show how easy it is to\nexplore new models using . In both models, we use the Frey Face dataset, which contains 1956\nframes (28 \u21e5 20 pixels) of facial expressions extracted from a video sequence.\nConstrained Gamma Poisson. This is a Gamma Poisson factorization model with an ordering\nconstraint: each row of the Gamma matrix goes from small to large values. (Details in Appendix H.)\n\n6 is an adaptive extension of . It is the default sampler in Stan.\n\n7\n\n\fe\nv\ni\nt\nc\ni\nd\ne\nr\nP\ng\no\nL\ne\ng\na\nr\ne\nv\nA\n\n5\n7\n9\n11\n\n101\n\nADVI\nNUTS\n\n102\n\n103\n\n104\n\nSeconds\n\ne\nv\ni\nt\nc\ni\nd\ne\nr\nP\ng\no\nL\ne\ng\na\nr\ne\nv\nA\n\n0\n200\n400\n600\n\n101\n\nADVI\nNUTS\n\n102\n\n103\n\n104\n\nSeconds\n\n(a) Gamma Poisson Predictive Likelihood\n\n(b) Dirichlet Exponential Predictive Likelihood\n\n(c) Gamma Poisson Factors\n\n(d) Dirichlet Exponential Factors\n\nFigure 5: Non-negative matrix factorization of the Frey Faces dataset. Comparison of to\n: held-out predictive likelihood as a function of wall time.\nDirichlet Exponential. This is a nonconjugate Dirichlet Exponential factorization model with a\nPoisson likelihood. (Details in Appendix I.)\nResults. Figure 5 shows average log predictive accuracy as well as ten factors recovered from both\nmodels. provides an order of magnitude speed improvement over (Figure 5a). \nstruggles with the Dirichlet Exponential model (Figure 5b). In both cases, does not produce\nany useful samples within a budget of one hour; we omit from the plots.\n\n3.3 Scaling to Large Datasets: Gaussian Mixture Model\nWe conclude with the Gaussian mixture model () example we highlighted earlier. This is a\nnonconjugate applied to color image histograms. We place a Dirichlet prior on the mixture\nproportions, a Gaussian prior on the component means, and a lognormal prior on the standard devi-\nations. (Details in Appendix J.) We explore the image dataset, which has 250 000 images [25].\nWe withhold 10 000 images for evaluation.\nIn Figure 1a we randomly select 1000 images and train a model with 10 mixture components. \nstruggles to \ufb01nd an adequate solution and fails altogether. This is likely due to label switching,\nwhich can aect -based techniques in mixture models [4].\nFigure 1b shows results on the full dataset. Here we use with stochastic subsampling\nof minibatches from the dataset [3]. We increase the number of mixture components to 30. With a\nminibatch size of 500 or larger, reaches high predictive accuracy. Smaller minibatch sizes lead\nto suboptimal solutions, an eect also observed in [3]. converges in about two hours.\n\n4 Conclusion\nWe develop automatic dierentiation variational inference () in Stan. leverages automatic\ntransformations, an implicit non-Gaussian variational approximation, and automatic dierentiation.\nThis is a valuable tool. We can explore many models and analyze large datasets with ease. We\nemphasize that is currently available as part of Stan; it is ready for anyone to use.\n\nAcknowledgments\nWe thank Dustin Tran, Bruno Jacobs, and the reviewers for their comments. This work is supported\nby NSF IIS-0745520, IIS-1247664, IIS-1009542, SES-1424962, ONR N00014-11-1-0651, DARPA\nFA8750-14-2-0009, N66001-15-C-4032, Sloan G-2015-13987, IES DE R305D140059, NDSEG,\nFacebook, Adobe, Amazon, and the Siebel Scholar and John Templeton Foundations.\n\n8\n\n\fReferences\n[1] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduc-\n\ntion to variational methods for graphical models. Machine Learning, 37(2):183\u2013233, 1999.\n\n[2] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and vari-\n\national inference. Foundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[3] Matthew D Homan, David M Blei, Chong Wang, and John Paisley. Stochastic variational\n\ninference. The Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[4] Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, 2015.\n[5] Matthew D Homan and Andrew Gelman. The No-U-Turn sampler. The Journal of Machine\n\nLearning Research, 15(1):1593\u20131623, 2014.\n\n[6] Diederik Kingma and Max Welling. Auto-encoding variational Bayes. arXiv:1312.6114, 2013.\n[7] Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and ap-\n\nproximate inference in deep generative models. In ICML, pages 1278\u20131286, 2014.\n\n[8] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In AISTATS,\n\npages 814\u2013822, 2014.\n\n[9] Tim Salimans and David Knowles. On using control variates with stochastic approximation for\n\nvariational Bayes. arXiv preprint arXiv:1401.1022, 2014.\n\n[10] Michalis Titsias and Miguel L\u00e1zaro-Gredilla. Doubly stochastic variational Bayes for non-\n\nconjugate inference. In ICML, pages 1971\u20131979, 2014.\n\n[11] David Wingate and Theophane Weber. Automated variational inference in probabilistic pro-\n\ngramming. arXiv preprint arXiv:1301.1299, 2013.\n\n[12] Noah D Goodman, Vikash K Mansinghka, Daniel Roy, Keith Bonawitz, and Joshua B Tenen-\n\nbaum. Church: A language for generative models. In UAI, pages 220\u2013229, 2008.\n\n[13] Vikash Mansinghka, Daniel Selsam, and Yura Perov. Venture: a higher-order probabilistic\n\nprogramming platform with programmable inference. arXiv:1404.0099, 2014.\n\n[14] Frank Wood, Jan Willem van de Meent, and Vikash Mansinghka. A new approach to proba-\n\nbilistic programming inference. In AISTATS, pages 2\u201346, 2014.\n\n[15] John M Winn and Christopher M Bishop. Variational message passing. In Journal of Machine\n\nLearning Research, pages 661\u2013694, 2005.\n\n[16] Christopher M Bishop. Pattern Recognition and Machine Learning. Springer New York, 2006.\n[17] David J Olive. Statistical Theory and Inference. Springer, 2014.\n[18] Manfred Opper and C\u00e9dric Archambeau. The variational Gaussian approximation revisited.\n\nNeural computation, 21(3):786\u2013792, 2009.\n\n[19] Wolfgang H\u00e4rdle and L\u00e9opold Simar. Applied multivariate statistical analysis. Springer, 2012.\n[20] Christian P Robert and George Casella. Monte Carlo statistical methods. Springer, 1999.\n[21] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. The Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n[22] Mark Girolami and Ben Calderhead. Riemann manifold langevin and hamiltonian monte carlo\n\nmethods. Journal of the Royal Statistical Society: Series B, 73(2):123\u2013214, 2011.\n\n[23] Andrew Gelman and Jennifer Hill. Data analysis using regression and multilevel/hierarchical\n\nmodels. Cambridge University Press, 2006.\n\n[24] John Canny. GaP: a factor model for discrete data. In ACM SIGIR, pages 122\u2013129. ACM, 2004.\n[25] Mauricio Villegas, Roberto Paredes, and Bart Thomee. Overview of the ImageCLEF 2013\nScalable Concept Image Annotation Subtask. In CLEF Evaluation Labs and Workshop, 2013.\n\n9\n\n\f", "award": [], "sourceid": 399, "authors": [{"given_name": "Alp", "family_name": "Kucukelbir", "institution": null}, {"given_name": "Rajesh", "family_name": "Ranganath", "institution": "Princeton University"}, {"given_name": "Andrew", "family_name": "Gelman", "institution": "Columbia University"}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}]}