{"title": "Copula-like Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 2959, "page_last": 2971, "abstract": "This paper considers a new family of variational distributions motivated by Sklar's theorem. This family is based on new copula-like densities on the hypercube with non-uniform marginals which can be sampled efficiently, i.e. with a complexity linear in the dimension d of the state space. Then, the proposed variational densities that we suggest can be seen as arising from these copula-like densities used as base distributions on the hypercube with Gaussian quantile functions and sparse rotation matrices as normalizing flows. The latter correspond to a rotation of the marginals with complexity O(d log d). We provide some empirical evidence that such a variational family can also approximate non-Gaussian posteriors and can be beneficial compared to Gaussian approximations. Our method performs largely comparably to state-of-the-art variational approximations on standard regression and classification benchmarks for Bayesian Neural Networks.", "full_text": "Copula-like Variational Inference\n\nMarcel Hirt\n\nDepartment of Statistical Science\nUniversity College of London, UK\n\nmarcel.hirt.16@ucl.ac.uk\n\nPetros Dellaportas\n\nDepartment of Statistical Science\nUniversity College of London, UK\n\nDepartment of Statistics\n\nAthens University of Economics and Business, Greece\n\nand The Alan Turing Institute, UK\n\nAlain Durmus\n\nCMLA\n\n\u00b4Ecole normale sup\u00b4erieure Paris-Saclay,\n\nCNRS, Universit\u00b4e Paris-Saclay, 94235 Cachan, France.\n\nalain.durmus@cmla.ens-cachan.fr\n\nAbstract\n\nThis paper considers a new family of variational distributions motivated by Sklar\u2019s\ntheorem. This family is based on new copula-like densities on the hypercube with\nnon-uniform marginals which can be sampled ef\ufb01ciently, i.e. with a complexity\nlinear in the dimension d of the state space. Then, the proposed variational densi-\nties that we suggest can be seen as arising from these copula-like densities used as\nbase distributions on the hypercube with Gaussian quantile functions and sparse\nrotation matrices as normalizing \ufb02ows. The latter correspond to a rotation of the\nmarginals with complexity O(d log d). We provide some empirical evidence that\nsuch a variational family can also approximate non-Gaussian posteriors and can\nbe bene\ufb01cial compared to Gaussian approximations. Our method performs largely\ncomparably to state-of-the-art variational approximations on standard regression\nand classi\ufb01cation benchmarks for Bayesian Neural Networks.\n\n1\n\nIntroduction\n\nVariational inference [29, 68, 4] aims at performing Bayesian inference by approximating an in-\ntractable posterior density \u03c0 with respect to the Lebesgue measure on Rd, based on a family of\ndistributions which can be easily sampled from. More precisely, this kind of inference posits some\nvariational family Q of densities (q\u03be)\u03be\u2208\u039e with respect to the Lebesgue measure and intends to \ufb01nd a\ngood approximation q\u03be(cid:63) belonging to Q by minimizing the Kullback-Leibler (KL) with respect to \u03c0\nover Q, i.e. \u03be(cid:63) \u2248 arg min\u03be\u2208\u039e KL(q\u03be|\u03c0). Further, suppose that \u03c0(x) = e\u2212U (x)/Z with U : Rd \u2192 R\n(cid:90)\nRd e\u2212U (x)dx < \u221e is an unknown normalising constant. Then, for any \u03be \u2208 \u039e,\n\nmeasurable and Z =(cid:82)\n\ndx = \u2212Eq\u03be(x) [\u2212U (x) \u2212 log q\u03be(x)] + log Z .\n\n(1)\nSince Z does not depend on q\u03be, minimizing \u03be (cid:55)\u2192 KL(q\u03be|\u03c0) is equivalent to maximizing \u03be (cid:55)\u2192 log Z\u2212\nKL(q\u03be|\u03c0). A standard example is Bayesian inference over latent variables x having a prior density\n\u03c00 for a given likelihood function L(y1:n|x) and n observations y1:n = (y1, . . . , yn). The target\ndensity is the posterior p(x|y1:n) with U (x) = \u2212 log \u03c00(x) \u2212 log L(y1:n|x) and the objective that\nis commonly maximized,\n\nKL(q\u03be|\u03c0) = \u2212\n\nq\u03be(x) log\n\nRd\n\n\u03c0(x)\nq\u03be(x)\n\n(cid:2)log \u03c00(x) + log L(y1:n|x) \u2212 log q\u03be(x)(cid:3)\n\nL(\u03be) = Eq\u03be(x)\n\n(2)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fis called a variational lower bound or ELBO. One of the main features of variational inference meth-\nods is their ability to be scaled to large datasets using stochastic approximation methods [24] and\napplied to non-conjugate models by using Monte Carlo estimators of the gradient [57, 35, 60, 63, 38].\nHowever, the approximation quality hinges on the expressiveness of the distributions in Q and re-\nstrictive assumptions on the variational family that allow for ef\ufb01cient computations such as mean-\n\ufb01eld families, tend to be too restrictive to recover the target distribution. Constructing an approx-\nimation family Q that is both \ufb02exible to closely approximate the density of interest and at the\nsame time computationally ef\ufb01cient has been an ongoing challenge. Much effort has been dedi-\ncated to \ufb01nd \ufb02exible and rich enough variational approximations, for instance by assuming a Gaus-\nsian approximation with different types of covariance matrices. For example, full-rank covariance\nmatrices have been considered in [1, 28, 63] and low-rank perturbations of diagonal matrices in\n[1, 46, 53, 47]. Furthermore, covariance matrices with a Kronecker structure have been proposed in\n[42, 70]. Besides, more complex variational families have been suggested: such as mixture models\n[18, 22, 46, 40, 39], implicit models [45, 26, 67, 69, 64], where the density of the variational distri-\nbution is intractable. Finally, variational inference based on normalizing \ufb02ows has been developed\nin [59, 34, 65, 43, 3]. As a special case and motivated by Sklar\u2019s theorem [62], variational infer-\nence based on families of copula densities and one-dimensional marginal distributions have been\nconsidered by [66] where it is assumed that the copula is a vine copula [2] and by [23] where the\ncopula is assumed to be a Gaussian copula together with non-parametric marginals using Bernstein\npolynomials. Recall that c : [0, 1]d \u2192 R+ is a copula if and only if its marginals are uniform on\n[0,1]d\u22121 c(u1, . . . , ud)du1 \u00b7\u00b7\u00b7 dui\u22121dui+1 \u00b7\u00b7\u00b7 dud = 1[0,1](ui) for any i \u2208 {1, . . . , d}\nand ui \u2208 R. In the present work, we pursue these ideas but propose instead of using a family of\ncopula densities, simply a family of densities {c\u03b8 : [0, 1]d \u2192 R+}\u03b8\u2208\u0398 on the hypercube [0, 1]d.\nThis idea is motivated from the fact that we are able to provide such a family which is both \ufb02exible\nand allow ef\ufb01cient computations.\nThe paper is organised as follow. In Section 2, we recall how one can sample more expressive dis-\ntributions and compute their densities using a sequence of bijective and continuously differentiable\ntransformations. In particular, we illustrate how to apply this idea in order to sample from a tar-\nget density by \ufb01rst sampling a random variable U from its copula density c and then applying the\nmarginal quantile function to each component of U. A new family of copula-like densities on the\nhypercube is constructed in Section 3 that allow for some \ufb02exibility in their dependence structure,\nwhile enjoying linear complexity in the dimension of the state space for generating samples and\nevaluating log-densities. A \ufb02exible variational distribution on Rd is introduced in Section 4 by sam-\npling from such a copula-like density and then applying a sequence of transformations that include\n2 d log d rotations over pairs of coordinates. We illustrate in Section 6 that for some target densi-\n1\nties arising for instance as the posterior in a logistic regression model, the proposed density allows\nfor a better approximation as measured by the KL-divergence compared to a Gaussian density. We\nconclude with applying the proposed methodology on Bayesian Neural Network models.\n\n[0, 1], i.e.(cid:82)\n\n2 Variational Inference and Copulas\n\nIn order to obtain expressive variational distributions, the variational densities can be transformed\nthrough a sequence of invertible mappings, termed normalizing \ufb02ows [60]. To be more speci\ufb01c,\nt=1 of C1-diffeomorphisms and a sample X0 \u223c q0, where q0 is a\nassume a series {Tt : Rd \u2192 Rd}T\ndensity function on Rd. Then the random variable XT = TT \u25e6 TT\u22121 \u25e6 \u00b7\u00b7\u00b7 \u25e6 T1(X0) has a density\nqT that satis\ufb01es\n\nlog qT (xT ) = log q0(x) \u2212 T(cid:88)\n\nlog det\n\nt=1\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202Tt(xt)\n\n\u2202xt\n\n(cid:12)(cid:12)(cid:12)(cid:12) ,\n\n(3)\n\nwith xt = Tt \u25e6 Tt\u22121 \u25e6 \u00b7\u00b7\u00b7 \u25e6 T1(x). To allow for scalable inferences with such densities, the\ntransformations Tt must be chosen so that the determinant of their Jacobians can be computed\nef\ufb01ciently. One possibility that satis\ufb01es this requirement is to choose volume-preserving \ufb02ows\nthat have a Jacobian-determinant of one. This can be achieved by considering transformations\nTt : x (cid:55)\u2192 Htx where Ht is an orthogonal matrix as proposed in [65] using a Householder-projection\nmatrix Ht.\n\n2\n\n\f(cid:90) x1\n\nAn alternative construction of the same form can be used to construct a density using Sklar\u2019s theorem\n[62, 48]. It establishes that given a target density \u03c0 on (Rd,B(Rd)), there exists a continuous func-\ntion C : [0, 1]d \u2192 [0, 1] and a probability space supporting a random variable U = (U1, . . . , Ud)\nvalued in [0, 1]d, such that for any x \u2208 Rd, and u \u2208 [0, 1]d,\nP (U1 (cid:54) u1,\u00b7\u00b7\u00b7 , Ud (cid:54) ud) = C(u1,\u00b7\u00b7\u00b7 , ud) ,\n\nfor any xi \u2208 R, Fi(xi) = (cid:82) xi\u2212\u221e \u03c0i(ti)dti and \u03c0i is the ith marginal of \u03c0, so for any xi \u2208 R,\n\u03c0i(xi) = (cid:82)\n\n\u03c0(t)dt = C(F1(x1), . . . , Fd(xd))\n(4)\nwhere for any i \u2208 {1, . . . , d}, Fi is the cumulative distribution function associated with \u03c0i, so\nRd\u22121 \u03c0(x)dx1 \u00b7\u00b7\u00b7 dxi\u22121dxi+1 \u00b7\u00b7\u00b7 dxd. To illustrate how one can obtain such a con-\ntinuous function C and random variable U, recall that \u03c0i is assumed to be absolutely continu-\nous with respect to the Lebesgue measure. Then for (X1, . . . , Xd) \u223c \u03c0, the random variable\nU = G \u22121(X) = (F1(X1), . . . , Fd(Xd)), where G : [0, 1]d \u2192 Rd, with\n\n(cid:90) xd\n\n\u2212\u221e\n\n\u2212\u221e\n\n. . .\n\nG : u (cid:55)\u2192 (F \u22121\n\n1\n\n(u1), . . . , F \u22121\n\nd (ud)),\n\n(5)\n\nfollows a law on the hypercube with uniform marginals. It can be readily shown that the cumulative\ndistribution function C of U is continuous and satis\ufb01es (4). Note that taking the derivative of (4)\nyields\n\nd(cid:89)\n\ni=1\n\n\u03c0(x) = c(F1(x1), . . . , Fd(xd))\n\n\u03c0i(xi) ,\n\n\u00b7\u00b7\u00b7 \u2202\n\n\u2202ud\n\nC(u1, . . . , ud) is a copula density function by de\ufb01nition of C.\nwhere c(u1, . . . , ud) = \u2202\n\u2202u1\nOne possibility to approximate a target density \u03c0 is then to consider a parametric family of copula\ndensity functions (c\u03b8)\u03b8\u2208\u0398 for \u0398 \u2208 Rpc and one parametric family of a d-dimensional vector of\ndensity functions (f1, . . . , fd)\u03c6\u2208\u03a6 : Rd \u2192 Rd for \u03a6 \u2282 Rpf , and try to estimate \u03b8 \u2208 \u0398 and \u03c6 \u2208 \u03a6\nto get a good approximation of \u03c0 via variational Bayesian methods. This idea was proposed by [23]\nand [66], where Gaussian and vine copulas were used, respectively. The main hurdle for using such\nfamily is their computational cost which can be prohibitive since the dimension of \u0398 is of order d2.\nWe remark that for latent Gaussian models with certain likelihood functions, a Gaussian variational\napproximation can scale linearly in the number of observations by using dual variables, see [54, 31].\n\n3 Copula-like Density\n\n(cid:18)\n\n(cid:19)a(cid:34)(cid:18)\n\n\u2212\u03b1\u2217\n\n(v\u2217)\n\n\u00b7\n\nmax\n\ni\u2208{1,...,d} vi\n\n1 \u2212 max\n\ni\u2208{1,...,d} vi\n\n(cid:19)b\u22121(cid:35)\n\n,\n\nIn this paper, we consider another approach which relies on a copula-like density function on [0, 1]d.\nIndeed, instead of an exact copula density function on [0, 1]d with uniform marginals, we consider\nsimply a density function on [0, 1]d which allows to have a certain degree of freedom in the number\nof parameters we want to use. The family of copula-like densities that we consider is given by\n\nc\u03b8(v1, . . . , vd) =\n\n(cid:41)(cid:35)\n(cid:34) d(cid:89)\n(cid:40)\ni=1 vi and \u03b1\u2217 =(cid:80)d\nwith the notation v\u2217 =(cid:80)d\n\n\u0393(\u03b1\u2217)\nB(a, b)\n\nv\u03b1(cid:96)\u22121\n(cid:96)\n\u0393(\u03b1(cid:96))\n\n(cid:96)=1\n\n(6)\n+ \u00d7\n+)d) = \u0398. The following probabilistic construction is proven in Appendix A to allow for\n\ni=1 \u03b1i. Therefore \u03b8 = (a, b, (\u03b1i)i\u2208{1,...,d}) \u2208 (R\u2217\n\n+ \u00d7 (R\u2217\nR\u2217\nef\ufb01cient sampling from the proposed copula-like density.\nProposition 1. Let \u03b8 \u2208 \u0398 and suppose that\n\n1. (W1, . . . , Wd) \u223c Dirichlet(\u03b11, . . . , \u03b1d);\n2. G \u223c Beta(a, b);\n3. (V1, . . . , Vd) = (GW1/W \u2217, . . . , GWd/W \u2217), where W \u2217 = maxi\u2208{1,...,d} Wi.\n\nThen the distribution of (V1, . . . , Vd) has density with respect to the Lebesgue measure given by (6).\n\n3\n\n\fThe proposed distribution builds up on Beta distributions, as they are the marginals of the Dirichlet\ndistributed random variable W \u223c Dir(\u03b1), which is then multiplied with an independent random vari-\nable G \u223c Beta(a, b). The resulting random variable Y = W G follows a Beta-Liouville distribution,\nwhich allows to account for negative dependence, inherited from the Dirichlet distribution through\na Beta stick-breaking construction, as well as positive dependence via a common Beta-factor. More\nprecisely, one obtains\n\n(cid:18) E[G2]\n\n\u03b1(cid:63) + 1\n\n\u2212 E[G]2\n\n\u03b1(cid:63)\n\n(cid:19)\n\n,\n\nfor some cij > 0 and \u03b1(cid:63) = (cid:80)d\n\nCor(Yi, Yj) = cij\n\nk=1 \u03b1k, cf. [13]. Proposition 1 shows that one can transform the\nBeta-Liouville distribution living within the simplex to one that has support on the full hypercube,\nwhile also allowing for ef\ufb01cient sampling and log-density evaluations.\nNow note that also V \u2212 = (1 \u2212 V1, . . . 1 \u2212 Vd) is a sample on the hypercube if V \u223c c\u03b8, as is the\nconvex combination U = (U1, . . . , Ud), where Ui = \u03b4iVi + (1\u2212 \u03b4i)(1\u2212 Vi) for any \u03b4 \u2208 [0, 1]d. Put\ndifferently, we can write U = H (V ), where\n\nH : v (cid:55)\u2192 (1 \u2212 \u03b4) Id +{diag(2\u03b4) \u2212 Id}v ,\n\n(7)\n\nIt is straightforward to see that H is a C1-diffeomorphism for\nand Id is the identity operator.\n\u03b4 \u2208 ([0, 1]\\{0.5})d from the hypercube into I1\u00d7\u00b7\u00b7\u00b7\u00d7 Id, where Ii = [\u03b4i, 1 \u2212 \u03b4i] if \u03b4i \u2208 [0, 0.5) and\nIi = [1 \u2212 \u03b4i, \u03b4i] if \u03b4i \u2208 (0.5, 1]. Note that the Jacobian-determinant of H is ef\ufb01ciently computable\n\nand is simply equal to |(cid:81)d\n\ni=1(2\u03b4i \u2212 1)| for \u03b4 \u2208 [0, 1]d.\n\nWe suggest to take initially at random \u03b4 \u2208 [0, 1]d for the transformation H such that\n\nP(\u03b4i = \u0001) = p and P(\u03b4i = 1 \u2212 \u0001) = 1 \u2212 p\n\n(8)\nwith p, \u0001 \u2208 (0, 1). In our experiments, we set \u0001 = 0.01 and p = 1/2. We found that choosing\na different (large enough) value of \u0001 tends to yield no large difference, as this choice will get bal-\nanced by a different value of the standard deviation of the Gaussian marginal transformation. The\nmotivation to consider U = H (V ) with V \u223c c\u03b8 was \ufb01rst numerical stability since we need to\ncompute quantile functions only on the interval [\u0001, 1 \u2212 \u0001] using this transformation. Second, this\ntransformation can increase the \ufb02exibility of our proposed family. We found empirically that the\ncomponents of V \u223c c\u03b8 tend to be non-negative in higher dimensions. However, using sometimes\n(more) the antithetic component of V by considering U = H (V ), the transformed density can also\ndescribe negative dependencies in high dimensions. What comes to mind to obtain a \ufb02exible density\nis then to either optimize over the parameter \u03b4 parametrising the transformation H or consider-\ning \u03b4 as an auxiliary variable in the variational density, resorting to techniques developed for such\nhierarchical families, see for instance [58, 69, 64]. However, this proved challenging in an initial\nattempt, since for \u03b4i = 0.5, the transformation H becomes non-invertible, while restricting \u03b4 on say\n\u03b4 \u2208 {\u0001, 1 \u2212 \u0001}d, \u0001 \u2248 0, seemed less easy to optimize. Consequently, we keep \u03b4 \ufb01xed after sampling\nit initially according to (8). A sensible choice was p = 1/2 since it leads to a balanced proportion\nof components of \u03b4 equal to \u0001 and 1 \u2212 \u0001. However, the sampled value of \u03b4 might not be optimal and\nwe illustrate in the next section how the variational density can be made more \ufb02exible.\n\n4 Rotated Variational Density\n\nWe propose to apply rotations to the marginals in order to improve on the initial orientation that\nresults from the sampled values of \u03b4. Rotated copulas have been used before in low dimensions, see\nfor instance [36], however, the set of orthogonal matrices has d(d\u22121)/2 free parameters. We reduce\nthe number of free parameters by considering only rotation matrices Rd that are given as a product of\nd/2 log d Givens rotations, following the FFT-style butter\ufb02y-architecture proposed in [16], see also\n[44] and [49] where such an architecture was used for approximating Hessians and kernel functions,\nrespectively. Recall that a Givens rotation matrix [21] is a sparse matrix with one angle as its param-\neter that rotates two dimensions by this angle. If we assume for the moment that d = 2k, k \u2208 N\u2217,\nthen we consider k rotation matrices denoted O1, . . .Ok where for any i \u2208 {1, . . . , k}, Oi contains\nd/2 independent rotations, i.e. is the product of d/2 independent Givens rotations. Givens rotations\nare arranged in a butter\ufb02y architecture that provides for a minimal number of rotations so that all\ncoordinates can interact with one another in the rotation de\ufb01ned by Rd. For illustration, consider\n\n4\n\n\f\uf8ee\uf8ef\uf8f0c1 \u2212s1\n\ns1\n0\n0\n\nc1\n0\n0\n\n\uf8f9\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8f0c2\n\n0\ns2\n0\n\n0\n0\n0\n0\nc3 \u2212s3\nc3\ns3\n\nO1O2 =\n\nthe case d = 4, where the rotation matrix is fully described using 4 \u2212 1 parameters \u03bd1, \u03bd2, \u03bd3 \u2208 R\nby R4 = O1O2 with\n\n\uf8f9\uf8fa\uf8fb =\n\n\uf8ee\uf8ef\uf8f0c1c2 \u2212s1c2 \u2212c1s2\n\ns1c2\nc3s2 \u2212s3s2\nc3s2\ns3s2\n\ns1s2\nc1c2 \u2212s1s2 \u2212c1ss\nc3c2 \u2212s3cs\nc3c2\ns3c2\n\n\uf8f9\uf8fa\uf8fb ,\n\n0 \u2212s2\n0\nc2\n0\nc2\n0\ns2\n\n0\n\u2212s2\n0\nc2\n\nwhere ci = cos(\u03bdi) and si = sin(\u03bdi). We provide a precise recursive de\ufb01nition of Rd in Appendix B\nwhere we also describe the case where d is not a power of two. In general, we have a computational\ncomplexity of O(d log d), due to the fact that Rd is a product of O(log d) matrices each requiring\nO(d) operations. Moreover, note that Rd is parametrized by d \u2212 1 parameters (\u03bdi)i\u2208{1...d\u22121} and\neach Oi can be implemented as a sparse matrix, which implies a memory complexity of O(d).\nFurthermore, since Oi is orthonormal, we have O\u22121\nTo construct an expressive variational distribution, we consider as a base distribution q0 the proposed\ncopula-like density c\u03b8. We then apply the transformations T1 = H and T2 = G . The operator G in\n(5) is de\ufb01ned via quantile functions of densities f1, . . . , fd, for which we choose Gaussian densities\nwith parameter \u03c6f = (\u00b51, . . . , \u00b5d, \u03c32\n+. As a \ufb01nal transformation, we apply the\nvolume-preserving operator\n(9)\nthat has parameter \u03c6R = (\u03bd1, . . . , \u03bdd\u22121) \u2208 Rd\u22121. Altogether, the parameter for the marginal-like\ndensities that we optimize over is \u03c6 = (\u03c6f , \u03c6R) and simulation from the variational density boils\ndown to the following algorithm.\n\n1, . . . , \u03c32\nT3 : x (cid:55)\u2192 O1 \u00b7\u00b7\u00b7Olog dx\n\ni and | detOi| = 1.\n\nd) \u2208 Rd \u00d7 Rd\n\ni = O(cid:62)\n\nAlgorithm 1 Sampling from the rotated copula-like density.\n1: Sample (V1, . . . , Vd) \u223c c\u03b8 using Proposition 1.\n2: Set U = H (V ) where H is de\ufb01ned in (7).\n3: Set X(cid:48) = G (U ), where G is de\ufb01ned in (5).\n4: Set X = T3, where T3 is de\ufb01ned in (9).\n\nNote that we apply the rotations after we have transformed samples from the hypercube into Rd, as\nthe hypercube is not closed under Givens rotations. The variational density can then be evaluated\nusing the normalizing \ufb02ow formula (3). We optimize the variational lower bound L in (2) using\nreparametrization gradients, proposed by [35, 60, 63], but with an implicit reparametrization, cf.\n[14], for Dirichlet and Beta distributions. Such reparametrized gradients for Dirichlet and Beta\ndistributions are readily available for instance in tensor\ufb02ow probability [9]. Using Monte Carlo\nsamples of unbiased gradient estimates, one can optimize the variational bound using some version\nof stochastic gradient descent. A more formal description is given in Appendix C.\nWe would like to remark that such sparse rotations can be similarly applied to proper copulas. While\nthere is no additional \ufb02exibility by rotating a full-rank Gaussian copula, applying such rotations to\na Gaussian copula with a low-rank correlation yields a Gaussian distribution with a more \ufb02exible\ncovariance structure if combined with Gaussian marginals. In our experiments, we therefore also\ncompare variational families constructed by sampling (V1, . . . , Vd) from an independence copula\nin step 1 in Algorithm 1, i.e. Vi are independent and uniformly distributed on [0, 1] for any i \u2208\n{1, . . . , d}, which results approximately in a Gaussian variational distribution if the effect of the\ntransformation H is neglected. However, a more thorough analysis of such families is left for\nfuture work. Similarly, transformations different from the sparse rotations in step 4 in Algorithm 1\ncan be used in combination with a copula-like base density. Whilst we include a comparison with a\nsimple Inverse Autoregressive Flow [34] in our experiments, a more exhaustive study of non-linear\ntransformations is beyond the scope of this work.\n\n5 Related Work\n\nConceptually, our work is closely related to [66, 23]. It differs from [66] in that it can be applied in\nhigh dimensions without having to search \ufb01rst for the most correlated variables using for instance\na sequential tree selection algorithm [11]. The approach in [23] considered a Gaussian dependence\nstructure, but has only been considered in low-dimensional settings. On a more computational side,\n\n5\n\n\four approach is related to variational inference with normalizing \ufb02ows [59, 34, 65, 43, 3]. In con-\ntrast to these works that introduce a parameter-free base distribution commonly in Rd as the latent\nstate space, we also optimize over the parameters of the base distribution which is supported on\nthe hypercube instead, although distributions supported for instance on the hypersphere as a state\nspace have been considered in [7]. Moreover, such approaches have been often used in the context\nof generative models using Variational Auto-Encoders (VAEs) [35], yet it is in principle possible to\napply the proposed variational copula-like inference in an amortized fashion for VAEs.\nA somewhat similar copula-like construction in the context of importance sampling has been pro-\nposed in [8]. However, sampling from this density requires a rejection step to ensure support on the\nhypercube, which would make optimization of the variational bound less straightforward. Lastly,\n[30] proposed a method to approximate copulas using mixture distributions, but these approxima-\ntions have not been analysed neither in high dimensions nor in the context of variational inference.\n\n6 Experiments\n\n6.1 Bayesian Logistic Regression\nConsider the target distribution \u03c0 on (Rd,B(Rd)) arising as the posterior of a d-dimensional lo-\ngistic regression, assuming a Normal prior \u03c00 = N (0, \u03c4\u22121I), \u03c4 = 0.01, and likelihood function\nL(yi|x) = f (yix(cid:62)ai), f (z) = 1/(1 + e\u2212z) with n observations yi \u2208 {\u22121, 1} and \ufb01xed covariates\nai \u2208 Rd for i \u2208 {1, . . . n}. We analyse a previously considered synthetic dataset where the poste-\nrior distribution is non-Gaussian, yet it can be well approximated with our copula-like construction.\nConcretely, we consider the synthetic dataset with d = 2 as in [50], Section 8.4 and [32] by gener-\nating 30 covariates a \u2208 R2 from a Gaussian N ((1, 5)(cid:62), I) for instances in the \ufb01rst class, while we\ngenerate 30 covariates from N ((\u22125, 1)(cid:62), 1.12I) for instances in the second class. Samples from the\ntarget distribution using a Hamiltonian Monte Carlo (HMC) sampler [12, 51] are shown in Figure 1a\nand one observes non-Gaussian marginals that are positively correlated with heavy right tails. Using\na Gaussian variational approximation with either independent marginals or a full covariance matrix\nas shown in Figure 1b does not adequately approximate the target distribution. Our copula-like con-\nstruction is able to approximate the target more closely, both without any rotations (Figure 1c) and\nwith a rotation of the marginals (Figure 1d). This is also supported by the ELBO obtained for the\ndifferent variational families given in Table 1.\n\nTable 1: Comparison of the ELBO be-\ntween different variational families for\nthe logistic regression experiment.\nVariational family\nMean-\ufb01eld Gaussian\nFull-covariance Gaussian\nCopula-like without rotations\nCopula-like with rotations\n\nELBO\n-3.42\n-2.97\n-2.30\n-2.19\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Target density for logistic regression using a\nHMC sampler in 1a with different variational approxi-\nmations: Gaussian variational approximation with a full\ncovariance matrix in 1b, copula-like variational approxi-\nmation without any rotation in 1c and copula-like varia-\ntional approximation with a rotation in 1d.\n\n6.2 Centred Horseshoe Priors\n\nWe illustrate our approach in a hierarchical Bayesian model that posits a priori a strong coupling of\nthe latent parameters. As an example, we consider a Horseshoe prior [6] that has been considered\nin the variational Gaussian copula framework in [23]. To be more speci\ufb01c, consider the generative\nmodel y|\u03bb \u223c N (0, \u03bb), with \u03bb \u223c C+(0, 1), where C+ is a half-Cauchy distribution, i.e. X \u223c C+(0, b)\nhas the density p(x) \u221d 1R+(x)/(x2 + b2) . Note that we can represent a half-Cauchy distri-\nbution with Inverse Gamma and Gamma distributions using X \u223c C+(0, b) \u21d0\u21d2 X 2|Y \u223c\nIG(1/2, 1/Y ); Y \u223c IG(1/2, 1/b2), see [52], with a rate parametrisation of the inverse gamma den-\nsity p(x) \u221d 1R+(x)xa\u22121e\u2212b/x for X \u223c IG(a, b). We revisit the toy model in [23] \ufb01xing y = 0.01.\n\n6\n\n\fThe model thus writes in a centred form as \u03b7 \u223c G(1/2, 1) and \u03bb|\u03b7 \u223c IG(1/2, \u03b7). Following [23], we\nconsider the posterior density on R2 of the log-transformed variables (x1, x2) = (log \u03b71, log \u03bb1). In\nFigure 2, we show the approximate posterior distribution using a Gaussian family (2b) and a copula-\nlike family (2c), together with samples from a HMC sampler (2a). A copula-like density yields a\nhigher ELBO, see Table 2. The experiments in [23] have shown that a Gaussian copula with a non-\nparametric mixture model \ufb01ts the marginals more closely. To illustrate that it is possible to arrive\nat a more \ufb02exible variational family by using a mixture of copula-like densities, we have used a\nmixture of 3 copula-like densities in Figure 2d. Note that it is possible to accommodate multi-modal\nmarginals using a Gaussian quantile transformation with a copula-like density. Eventually, the \ufb02ex-\nibility of the variational approximation can be increased using different complementary work. For\ninstance, one could use the new density within a semi-implicit variational framework [69] whose\nparameters are the output of a neural network conditional on some latent mixing variable.\n\nTable 2: Comparison of the ELBO be-\ntween different variational families for\nthe centred horseshoe model.\n\nVariational family\nMean-\ufb01eld Gaussian\nFull-covariance Gaussian\nCopula-like\n3-mixture copula-like\n\nELBO\n-1.24\n-0.04\n0.04\n0.08\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Target density for the horseshoe model using\na HMC sampler in 2a with different variational approxi-\nmations: Gaussian variational approximation with a full\ncovariance matrix in 2b, copula-like variational approxi-\nmation including a rotation in 2c and a mixture of three\ncopula-like densities with a one rotation and marginal-\nlike density in 2d.\n\ni hl\n\niW l\n\nReLU function f (a) = max{0, a} and de\ufb01ne the activations al \u2208 Rdl by al+1 =(cid:80)\n\n6.3 Bayesian Neural Networks with Normal Priors\nWe consider an L-hidden layer fully-connected neural network where each layer l, 1 (cid:54) l (cid:54) L + 1\nhas width dl and is parametrised by a weight matrix W l \u2208 Rdl\u22121\u00d7dl and bias vector bl \u2208 Rdl.\nLet h1 \u2208 Rd0 denote the input to the network and f be a point-wise non-linearity such as the\ni\u00b7 + bl\nfor l (cid:62) 1, and the post-activations as hl = f (al) for l (cid:62) 2. We consider a regression likeli-\nhood function L(\u00b7|aL+2, \u03c3) = N (aL+2, exp(0.5\u03c3)), and denote the concatenation of all parameters\nW , b and \u03c3 as x. We assume independent Normal priors for the entries of the weight matrix and\n0. Furthermore, we assume that log \u03c3 \u223c N (0, 16). In-\nbias vector with mean 0 and variance \u03c32\nference with the proposed variational family is applied on commonly considered UCI regression\ndatasets, repeating the experimental set-up used in [15].\nIn particular, we use neural networks\nwith ReLU activation functions and one hidden layer of size 50 for all datasets with the excep-\ntion of the protein dataset that utilizes a hidden layer of size 100. We choose the hyper-parameter\n0 \u2208 {0.01, 0.1, 1., 10., 100.} that performed best on a validation dataset in terms of its predictive\n\u03c32\nlog-likelihood. Optimization was performed using Adam [33] with a learning rate of 0.002. We\ncompare the predictive performance of a copula-like density c\u03b8 and an independent copula as a base\ndistribution in step 1 of Algorithm 1 and we apply different transformations T3 in step 4 of Al-\ngorithm 1: a) the proposed sparse rotation de\ufb01ned in (9); b) T3 = Id; c) an af\ufb01ne autoregressive\ntransformation T3(x) = {x\u2212 f\u00b5(x)}exp(\u2212f\u03b1(x)), see [34], also known as an inverse autogressive\n\ufb02ow (IAF). Here f\u00b5 and f\u03b1 are autoregressive neural networks parametrized by \u00b5 and \u03b1 satisfy-\n= 0 for i (cid:54) j and which can be computed in a single forward pass by\ning \u2202f\u00b5(x)i\nproperly masking the weights in the neural networks [17]. In our experiments, we use a one-hidden\nlayer fully-connected network with width dIAF\n1 = 50 for f\u00b5 and f\u03b1. Note that for a d-dimensional\ntarget density, the size of the weight matrices are of order d \u00b7 dIAF\n1 , implying a higher complexity\ncompared to the sparse rotation. We also compare the predictions against Bayes-by-Backprop [5]\nusing a mean-\ufb01eld model, with the results as reported in [47] for a mean-\ufb01eld Bayes-by-Backprop\nand low-rank Gaussian approximation proposed therein called SLANG. Furthermore, we also report\nthe results for Dropout inference [15]. The test root mean-squared errors are given in Table 3 and\n\n= \u2202f\u03b1(x)i\n\n\u2202xj\n\n\u2202xj\n\n7\n\n\fTable 3: Variational approximations with transformations and different base distributions. Test root\nmean-squared error for UCI regression datasets. Standard errors in parenthesis.\n\nIndependent copula Copula-like\n\nCopula-like\nwith rotation with rotation\n3.43 (0.22)\nBoston\n5.76 (0.14)\nConcrete\nEnergy\n0.55 (0.01)\nKin8nm 0.08 (0.00)\n0.00 (0.00)\nNaval\n4.02 (0.04)\nPower\n0.64 (0.01)\nWine\n1.35 (0.08)\nYacht\n4.20 (0.01)\nProtein\n\n3.51 (0.30)\n6.00 (0.13)\n2.28 (0.11)\n0.08 (0.00)\n0.00 (0.00)\n4.19 (0.04)\n0.64 (0.01)\n1.38 (0.12)\n4.51 (0.04)\n\nwith IAF\n3.21 (0.27)\n5.41 (0.10)\n0.53 (0.02)\n0.08 (0.00)\n0.00 (0.00)\n4.05 (0.04)\n0.64 ( 0.01)\n0.96 (0.06)\n4.31 (0.01)\n\nIndependent copula\nwith IAF\n3.61 (0.28)\n5.82 (0.11)\n1.30 (0.10)\n0.08 (0.00)\n0.00 (0.00)\n4.15 (0.04)\n0.64 (0.01)\n1.25 (0.09)\n4.51 (0.03)\n\nTable 4: Copula-like variational approximation without rotations and benchmark results. Test root\nmean-squared error for UCI regression datasets. Standard errors in parenthesis.\n\nCopula-like\nwithout rotation\n3.22 (0.25)\nBoston\n5.64 (0.14)\nConcrete\n0.52 (0.02)\nEnergy\nKin8nm 0.08 (0.00)\n0.00 (0.00)\nNaval\n4.05 (0.04)\nPower\n0.65 (0.01)\nWine\nYacht\n1.23 (0.08)\n4.31 (0.02)\nProtein\n\nBayes-by-Backprop\nresults from [47]\n3.43 (0.20)\n6.16 (0.13)\n0.97 (0.09)\n0.08 (0.00)\n0.00 (0.00)\n4.21 (0.03)\n0.64 (0.01)\n1.13 (0.06)\nNA\n\nSLANG\nresults from [47]\n3.21 (0.19)\n5.58 (0.12)\n0.64 (0.04)\n0.08 (0.00)\n0.00 (0.00)\n4.16 (0.04)\n0.65 ( 0.01)\n1.08 (0.09)\nNA\n\nDropout\nresults from [47]\n2.97 (0.19)\n5.23 (0.12)\n1.66 (0.04)\n0.10 (0.01)\n0.01 (0.01)\n4.02 (0.04)\n0.62 (0.01)\n1.11 (0.09)\n4.27 (0.01)\n\nTable 4; the predictive test log-likelihood can be \ufb01nd in the Appendix E in Table 6 and Table 7.\nWe can observe from Table 3 and Table 6 that using a copula-like base distribution instead of an\nindependent copula improves the predictive performance, using either rotations or IAF as the \ufb01nal\ntransformation. The same tables also indicate that for a given base distribution, the IAF tends to\noutperform the sparse rotations slightly. Table 4 and Table 7 suggest that copula-like densities with-\nout any transformation in the last step can be a competitive alternative to a benchmark mean-\ufb01eld\nor Gaussian approximation. Dropout tends to perform slightly better. However, note that Dropout\nwith a Normal prior and a variational mixture distribution that includes a Dirac delta function as\none component gives rise to a different objective, since the prior is not absolutely continuous with\nrespect to the approximate posterior, see [25].\n\n6.4 Bayesian Neural Networks with Structured Priors\n\nWe illustrate our approach on a larger Bayesian neural network. To induce sparsity for the weights in\nthe network, we consider a (regularised) Horseshoe prior [56] that has also been used increasingly\nas an alternative prior in Bayesian neural network to allow for sparse variational approximations,\nsee [41, 19] for mean-\ufb01eld models and [20] for a structured Gaussian approximation. We consider\nagain an L-hidden layer fully-connected neural network where we assume that the weight matrix\nW l \u2208 Rdl\u22121\u00d7dl for any l \u2208 {1, . . . , L + 1} and any i \u2208 {1, . . . , dl\u22121} satis\ufb01es a priori\n\ni)2I) \u221d N (0, (\u03c4 l\u03bbl\n\ni))2I)N (0, c2),\n\n(10)\n\ni, \u03c4 l, c \u223c N (0, (\u03c4 l\u02dc\u03bbl\n\ni\u00b7|\u03bbl\nW l\ni)2/(c2 + \u03c4 2(\u03bbl\n\ni)2), \u03bbl\n\nl)2 = c2(\u03bbl\n\nwhere( \u02dc\u03bbi\n2 ) for\nsome hyper-parameters b\u03c4 , \u03bd, s2 > 0. The vector W (l)\nrepresents all weights that interact with the\ni\u00b7\ni-th input neuron. The \ufb01rst Normal factor in (10) is a standard Horseshoe prior with a per layer\nglobal parameter \u03c4 l that adapts to the overall sparsity in layer l and shrinks all weights in this layer\nto zero, due to the fact that C+(0, b\u03c4 ) allows for substantial mass near zero. The local shrinkage\n\ni \u223c C+(0, b\u03c4 ) and c2 \u223c IG( \u03bd\n\ni \u223c C+(0, 1), \u03c4 l\n\n2 , \u03bd s2\n\n8\n\n\fTable 5: MNIST prediction errors.\n\nVariational approximation with Horseshoe prior and size 200 \u00d7 200 Error Rate\nCopula-like with rotations\nCopula-like without rotations\nCopula-like with IAF\nIndependent copula with IAF\nIndependent copula with rotations\nMean-\ufb01eld Gaussian\nCopula-like without rotations and \u03b4i = 0.99 for all i \u2208 {1, . . . , d}\n\n1.70 %\n1.78 %\n2.04 %\n2.88 %\n2.90 %\n3.82 %\n5.70 %\n\ni\u00b7 = \u03c4 l\u02dc\u03bbl\n\ni, \u03c4 l =\n\n\u221a\n\n\u02c6\u03c4 l\u03bal, \u03bbl\n\ni =\n\ni \u223c N (0, I), W l\n\ni \u223c G(1/2, 1), \u02c6\u03bbl\ni\u03b2l\n\n\u03c4 ), \u02c6\u03c4 l \u223c IG(1/2, 1), \u03b2l\ni)2/(c2 + (\u03c4 l)2(\u03bbl\n\ni allow for signals in the i-th input neuron because C+(0, 1) is heavy-tailed. However,\nparameter \u03bbl\nthis can leave large weights un-shrunk, and the second Normal factor in (10) induces a Student-\nt\u03bd(0, s2) regularisation for weights far from zero, see [56] for details. We can rewrite the model in a\n(cid:113)\nnon-centred form [55], where the latent parameters are a priori independent, see also [41, 27, 19, 20]\ni \u223c IG(1/2, 1),\nfor similar variational approximations. We write the model as \u03b7l\n\u03bal \u223c G(1/2, 1/b2\n\u02c6\u03bbl\ni\u03b7l\ni\nand (\u02dc\u03bbl\ni)2). The target density is the posterior of these variables, after\ni)2 = c2(\u03bbl\napplying a log-transformation if their prior is an (inverse) Gamma law.\nWe performed classi\ufb01cation on MNIST using a 2-hidden layer fully-connected network where the\nhidden layers are of size 200 each. Further details about the algorithmic details are given in Ap-\npendix D. Prediction errors for the variational families as considered in the preceding experiments\nare given in Table 5. We again \ufb01nd that a copula-like density outperforms the independent copula.\nUsing a copula-like density without the rotation also performs competitively as long as one uses a\nbalanced amount of its antithetic component via the transformation H with parameter \u03b4; ignoring\nthe transformation H or setting \u03b4i = 0.99 for all i \u2208 {1, . . . , d} can limit the representative power\nof the variational family and can result in high predictive errors. The neural network function for\nthe IAF considered here has two hidden layers of size 100 \u00d7 100. It can be seen that applying the\nrotations can be bene\ufb01cial compared to the IAF for the copula-like density, whereas the two transfor-\nmations perform similarly for the independent base distribution. We expect that more ad-hoc tricks\ncan be used to adjust the rotations to some computational budget. For instance, one could include\nadditional rotations for a group of latent variables such as those within one layer. Conversely, one\ncould consider the series of sparse rotations O1,\u00b7\u00b7\u00b7 ,Ok, but with 2k < d, thereby allowing for\nrotations of the more adjacent latent variables only.\nOur experiment illustrates that the proposed approach can be used in high-dimensional structured\nBayesian models without having to specify more model-speci\ufb01c dependency assumptions in the\nvariatonal family. The prediction errors are in line with current work for fully connected networks\nusing a Gaussian variational family with Normal priors, cf. [47]. Better predictive performance for\na fully connected Bayesian network has been reported in [37] that use RealNVP [10] as a normal-\nising \ufb02ows in a large network that is reparametrised using a weight normalization [61]. It becomes\nscalable by opting to consider only variational inference over the Euclidean norm of W l\ni\u00b7 and per-\ni\u00b7||2. Such a parametrisation\nforming point estimation for the direction of the weight vector W l\ndoes not allow for a \ufb02exible dependence structure of the weights within one layer; and such a model\narchitecture should be complementary to the proposed variational family in this work.\n\ni\u00b7/||W l\n\n7 Conclusion\n\nWe have addressed the challenging problem of constructing a family of distributions that allows for\nsome \ufb02exibility in its dependence structure, whilst also having a reasonable computational com-\nplexity. It has been shown experimentally that it can constitute a useful replacement of a Gaussian\napproximation without requiring many algorithmic changes.\n\n9\n\n\fAcknowledgements\n\nAlain Durmus acknowledges support from Chaire BayeScale \u201dP. Laf\ufb01tte\u201d and from Polish National\nScience Center grant: NCN UMO-2018/31/B/ST1/0025. This research has been partly \ufb01nanced by\nthe Alan Turing Institute under the EPSRC grant EP/N510129/1. The authors acknowledge the use\nof the UCL Myriad High Throughput Computing Facility (Myriad@UCL), and associated support\nservices, in the completion of this work.\n\nReferences\n[1] David Barber and Christopher M Bishop. Ensemble learning for multi-layer networks.\n\nAdvances in neural information processing systems, pages 395\u2013401, 1998.\n\nIn\n\n[2] Tim Bedford and Roger M Cooke. Probability density decomposition for conditionally depen-\ndent random variables modeled by vines. Annals of Mathematics and Arti\ufb01cial intelligence,\n32(1-4):245\u2013268, 2001.\n\n[3] Rianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester\n\nnormalizing \ufb02ows for variational inference. arXiv preprint arXiv:1803.05649, 2018.\n\n[4] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for\n\nstatisticians. Journal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[5] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncer-\ntainty in neural network. In Proceedings of The 32nd International Conference on Machine\nLearning, pages 1613\u20131622, 2015.\n\n[6] Carlos M Carvalho, Nicholas G Polson, and James G Scott. The horseshoe estimator for sparse\n\nsignals. Biometrika, 97(2):465\u2013480, 2010.\n\n[7] Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak. Hyper-\n\nspherical variational auto-encoders. arXiv preprint arXiv:1804.00891, 2018.\n\n[8] Petros Dellaportas and Mike G Tsionas.\n\nImportance sampling from posterior distributions\n\nusing copula-like approximations. Journal of Econometrics, 2018.\n\n[9] Joshua V Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave\nMoore, Brian Patton, Alex Alemi, Matt Hoffman, and Rif A Saurous. Tensor\ufb02ow distribu-\ntions. arXiv preprint arXiv:1711.10604, 2017.\n\n[10] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.\n\narXiv preprint arXiv:1605.08803, 2016.\n\n[11] Jeffrey Dissmann, Eike C Brechmann, Claudia Czado, and Dorota Kurowicka. Selecting and\nestimating regular vine copulae and application to \ufb01nancial returns. Computational Statistics\n& Data Analysis, 59:52\u201369, 2013.\n\n[12] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid monte\n\ncarlo. Physics letters B, 195(2):216\u2013222, 1987.\n\n[13] Kai Wang Fang. Symmetric Multivariate and Related Distributions. Chapman and Hall/CRC,\n\n2017.\n\n[14] Mikhail Figurnov, Shakir Mohamed, and Andriy Mnih. Implicit reparameterization gradients.\n\nIn Advances in Neural Information Processing Systems, pages 441\u2013452, 2018.\n\n[15] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model\nuncertainty in deep learning. In Proceedings of the 33rdInternational Conference on Machine\nLearning, 2016.\n\n[16] Alan Genz. Methods for generating random orthogonal matrices. Monte Carlo and Quasi-\n\nMonte Carlo Methods, pages 199\u2013213, 1998.\n\n10\n\n\f[17] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoen-\nIn International Conference on Machine Learning, pages\n\ncoder for distribution estimation.\n881\u2013889, 2015.\n\n[18] Samuel J Gershman, Matthew D Hoffman, and David M Blei. Nonparametric variational\ninference. In Proceedings of the 29th International Coference on International Conference on\nMachine Learning, pages 235\u2013242. Omnipress, 2012.\n\n[19] Soumya Ghosh and Finale Doshi-Velez. Model selection in bayesian neural networks via\n\nhorseshoe priors. arXiv preprint arXiv:1705.10388, 2017.\n\n[20] Soumya Ghosh, Jiayu Yao, and Finale Doshi-Velez. Structured variational learning of bayesian\n\nneural networks with horseshoe priors. arXiv preprint arXiv:1806.05975, 2018.\n\n[21] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.\n\n[22] Fangjian Guo, Xiangyu Wang, Kai Fan, Tamara Broderick, and David B Dunson. Boosting\n\nvariational inference. arXiv preprint arXiv:1611.05559, 2016.\n\n[23] Shaobo Han, Xuejun Liao, David Dunson, and Lawrence Carin. Variational gaussian copula\n\ninference. In Arti\ufb01cial Intelligence and Statistics, pages 829\u2013838, 2016.\n\n[24] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational\n\ninference. The Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[25] J Hron, De G Matthews, Z Ghahramani, et al. Variational bayesian dropout: Pitfalls and \ufb01xes.\nIn 35th International Conference on Machine Learning, ICML 2018, volume 5, pages 3199\u2013\n3219, 2018.\n\n[26] Ferenc Husz\u00b4ar.\n\narXiv:1702.08235, 2017.\n\nVariational\n\ninference using implicit distributions.\n\narXiv preprint\n\n[27] John Ingraham and Debora Marks. Variational inference for sparse and undirected models. In\n\nInternational Conference on Machine Learning, pages 1607\u20131616, 2017.\n\n[28] Tommi Jaakkola and Michael Jordan. A variational approach to bayesian logistic regression\nmodels and their extensions. In Sixth International Workshop on Arti\ufb01cial Intelligence and\nStatistics, volume 82, page 4, 1997.\n\n[29] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An intro-\nduction to variational methods for graphical models. Machine learning, 37(2):183\u2013233, 1999.\n\n[30] Mohamad A Khaled and Robert Kohn. On approximating copulas by \ufb01nite mixtures. arXiv\n\npreprint arXiv:1705.10440, 2017.\n\n[31] Mohammad Emtiyaz Khan, Aleksandr Aravkin, Michael Friedlander, and Matthias Seeger.\nIn International\n\nFast dual variational inference for non-conjugate latent gaussian models.\nConference on Machine Learning, pages 951\u2013959, 2013.\n\n[32] Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash\nSrivastava. Fast and scalable bayesian deep learning by weight-perturbation in adam. In Inter-\nnational Conference on Machine Learning, pages 2616\u20132625, 2018.\n\n[33] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[34] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max\nIn Advances in\n\nImproved variational inference with inverse autoregressive \ufb02ow.\n\nWelling.\nNeural Information Processing Systems, pages 4743\u20134751, 2016.\n\n[35] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. Proceedings of the\n\n2nd International Conference on Learning Representations (ICLR), 2014.\n\n[36] Ioannis Kosmidis and Dimitris Karlis. Model-based clustering using copulas with applications.\n\nStatistics and computing, 26(5):1079\u20131099, 2016.\n\n11\n\n\f[37] David Krueger, Chin-Wei Huang, Riashat Islam, Ryan Turner, Alexandre Lacoste, and Aaron\n\nCourville. Bayesian hypernetworks. arXiv preprint arXiv:1710.04759, 2017.\n\n[38] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei. Au-\ntomatic differentiation variational inference. The Journal of Machine Learning Research,\n18(1):430\u2013474, 2017.\n\n[39] Francesco Locatello, Gideon Dresdner, Rajiv Khanna, Isabel Valera, and Gunnar R\u00a8atsch.\nBoosting black box variational inference. In Advances in Neural Information Processing Sys-\ntems, pages 3401\u20133411, 2018.\n\n[40] Francesco Locatello, Rajiv Khanna, Joydeep Ghosh, and Gunnar Ratsch. Boosting variational\ninference: an optimization perspective. In International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 464\u2013472, 2018.\n\n[41] Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning.\n\nIn Advances in Neural Information Processing Systems, pages 3290\u20133300, 2017.\n\n[42] Christos Louizos and Max Welling. Structured and ef\ufb01cient variational deep learning with\nmatrix gaussian posteriors. In Proceedings of the 33rdInternational Conference on Machine\nLearning, 2016.\n\n[43] Christos Louizos and Max Welling. Multiplicative normalizing \ufb02ows for variational bayesian\nneural networks. In International Conference on Machine Learning, pages 2218\u20132227, 2017.\n\n[44] Michael Mathieu and Yann LeCun. Fast approximation of rotations and hessians matrices.\n\narXiv preprint arXiv:1404.7195, 2014.\n\n[45] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Uni-\nfying variational autoencoders and generative adversarial networks. In International Confer-\nence on Machine learning (ICML), 2017.\n\n[46] Andrew C Miller, Nicholas J Foti, and Ryan P Adams. Variational boosting: Iteratively re\ufb01ning\nIn International Conference on Machine Learning, pages 2420\u2013\n\nposterior approximations.\n2429, 2017.\n\n[47] Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, and Mohammad Emtiyaz\nKhan. Slang: Fast structured covariance approximations for bayesian deep learning with natu-\nral gradient. In Advances in Neural Information Processing Systems, pages 6246\u20136256, 2018.\n\n[48] David S Moore and Marcus C Spruill. Uni\ufb01ed large-sample theory of general chi-squared\n\nstatistics for tests of \ufb01t. The Annals of Statistics, pages 599\u2013616, 1975.\n\n[49] Marina Munkhoeva, Yermek Kapushev, Evgeny Burnaev, and Ivan Oseledets. Quadrature-\nbased features for kernel approximation. In Advances in Neural Information Processing Sys-\ntems, pages 9165\u20139174, 2018.\n\n[50] Kevin P Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.\n\n[51] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte\n\ncarlo, 2(11):2, 2011.\n\n[52] Sarah E Neville, John T Ormerod, MP Wand, et al. Mean \ufb01eld variational bayes for continuous\nsparse signal shrinkage: pitfalls and remedies. Electronic Journal of Statistics, 8(1):1113\u2013\n1151, 2014.\n\n[53] Victor M-H Ong, David J Nott, and Michael S Smith. Gaussian variational approximation with\na factor covariance structure. Journal of Computational and Graphical Statistics, 27(3):465\u2013\n478, 2018.\n\n[54] Manfred Opper and C\u00b4edric Archambeau. The variational gaussian approximation revisited.\n\nNeural computation, 21(3):786\u2013792, 2009.\n\n[55] Omiros Papaspiliopoulos, Gareth O Roberts, and Martin Skold. Non-centred parameterisations\n\nfor hierarchical models and data augmentation. Bayesian Statistics, 7, 2003.\n\n12\n\n\f[56] Juho Piironen, Aki Vehtari, et al. Sparsity information and regularization in the horseshoe and\n\nother shrinkage priors. Electronic Journal of Statistics, 11(2):5018\u20135051, 2017.\n\n[57] Rajesh Ranganath, Sean Gerrish, and David M Blei. Black box variational inference.\n\nAISTATS, pages 814\u2013822, 2014.\n\nIn\n\n[58] Rajesh Ranganath, Dustin Tran, and David M Blei. Hierarchical variational models. In Inter-\n\nnational Conference on Machine Learning, 2016.\n\n[59] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. In Pro-\nceedings of The 32nd International Conference on Machine Learning, pages 1530\u20131538, 2015.\n\n[60] Danilo J Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and\nIn Proceedings of the 31st International\n\napproximate inference in deep generative models.\nConference on Machine Learning (ICML-14), pages 1278\u20131286, 2014.\n\n[61] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to\naccelerate training of deep neural networks. In Advances in Neural Information Processing\nSystems, pages 901\u2013909, 2016.\n\n[62] M Sklar. Fonctions de repartition an dimensions et leurs marges. Publ. inst. statist. univ. Paris,\n\n8:229\u2013231, 1959.\n\n[63] Michalis Titsias and Miguel L\u00b4azaro-Gredilla. Doubly stochastic variational bayes for non-\nconjugate inference. In Proceedings of the 31st International Conference on Machine Learning\n(ICML-14), pages 1971\u20131979, 2014.\n\n[64] Michalis K Titsias and Francisco Ruiz. Unbiased implicit variational inference. In The 22nd\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 167\u2013176, 2019.\n\n[65] Jakub M Tomczak and Max Welling. Improving variational auto-encoders using householder\n\n\ufb02ow. arXiv preprint arXiv:1611.09630, 2016.\n\n[66] Dustin Tran, David Blei, and Edo M Airoldi. Copula variational inference. In Advances in\n\nNeural Information Processing Systems, pages 3564\u20133572, 2015.\n\n[67] Dustin Tran, Rajesh Ranganath, and David M Blei. Deep and hierarchical implicit models.\n\narXiv preprint arXiv:1702.08896, 2017.\n\n[68] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and vari-\n\national inference. Foundations and Trends in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[69] Mingzhang Yin and Mingyuan Zhou. Semi-implicit variational inference.\n\nConference on Machine Learning, pages 5646\u20135655, 2018.\n\nIn International\n\n[70] Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient\nas variational inference. In International Conference on Machine Learning, pages 5847\u20135856,\n2018.\n\n13\n\n\f", "award": [], "sourceid": 1695, "authors": [{"given_name": "Marcel", "family_name": "Hirt", "institution": "University College London"}, {"given_name": "Petros", "family_name": "Dellaportas", "institution": "University College London, Athens University of Economics and Alan Turing Institute"}, {"given_name": "Alain", "family_name": "Durmus", "institution": "ENS Paris Saclay"}]}