{"title": "Variational Inference with Tail-adaptive f-Divergence", "book": "Advances in Neural Information Processing Systems", "page_first": 5737, "page_last": 5747, "abstract": "Variational inference with \u03b1-divergences has been widely used in modern probabilistic\nmachine learning. Compared to Kullback-Leibler (KL) divergence, a major\nadvantage of using \u03b1-divergences (with positive \u03b1 values) is their mass-covering\nproperty. However, estimating and optimizing \u03b1-divergences require to use importance\nsampling, which could have extremely large or infinite variances due\nto heavy tails of importance weights. In this paper, we propose a new class of\ntail-adaptive f-divergences that adaptively change the convex function f with the\ntail of the importance weights, in a way that theoretically guarantee finite moments,\nwhile simultaneously achieving mass-covering properties. We test our methods\non Bayesian neural networks, as well as deep reinforcement learning in which our\nmethod is applied to improve a recent soft actor-critic (SAC) algorithm (Haarnoja\net al., 2018). Our results show that our approach yields significant advantages\ncompared with existing methods based on classical KL and \u03b1-divergences.", "full_text": "Variational Inference with Tail-adaptive f-Divergence\n\nDilin Wang\nUT Austin\n\ndilin@cs.utexas.edu\n\nHao Liu \u2217\nUESTC\n\nuestcliuhao@gmail.com\n\nQiang Liu\nUT Austin\n\nlqiang@cs.utexas.edu\n\nAbstract\n\nVariational inference with \u03b1-divergences has been widely used in modern proba-\nbilistic machine learning. Compared to Kullback-Leibler (KL) divergence, a major\nadvantage of using \u03b1-divergences (with positive \u03b1 values) is their mass-covering\nproperty. However, estimating and optimizing \u03b1-divergences require to use im-\nportance sampling, which may have large or in\ufb01nite variance due to heavy tails\nof importance weights. In this paper, we propose a new class of tail-adaptive f-\ndivergences that adaptively change the convex function f with the tail distribution\nof the importance weights, in a way that theoretically guarantees \ufb01nite moments,\nwhile simultaneously achieving mass-covering properties. We test our method on\nBayesian neural networks, and apply it to improve a recent soft actor-critic (SAC)\nalgorithm (Haarnoja et al., 2018) in deep reinforcement learning. Our results show\nthat our approach yields signi\ufb01cant advantages compared with existing methods\nbased on classical KL and \u03b1-divergences.\n\n1\n\nIntroduction\n\nVariational inference (VI) (e.g., Jordan et al., 1999; Wainwright et al., 2008) has been established\nas a powerful tool in modern probabilistic machine learning for approximating intractable posterior\ndistributions. The basic idea is to turn the approximation problem into an optimization problem, which\n\ufb01nds the best approximation of an intractable distribution from a family of tractable distributions by\nminimizing a divergence objective function. Compared with Markov chain Monte Carlo (MCMC),\nwhich is known to be consistent but suffers from slow convergence, VI provides biased results but\nis often practically faster. Combined with techniques like stochastic optimization (Ranganath et al.,\n2014; Hoffman et al., 2013) and reparameterization trick (Kingma & Welling, 2014), VI has become\na major technical approach for advancing Bayesian deep learning, deep generative models and deep\nreinforcement learning (e.g., Kingma & Welling, 2014; Gal & Ghahramani, 2016; Levine, 2018).\nA key component of successful variational inference lies on choosing a proper divergence metric.\nTypically, closeness is de\ufb01ned by the KL divergence KL(q || p) (e.g., Jordan et al., 1999), where p\nis the intractable distribution of interest and q is a simpler distribution constructed to approximate\np. However, VI with KL divergence often under-estimates the variance and may miss important\nlocal modes of the true posterior (e.g., Christopher, 2016; Blei et al., 2017). To mitigate this issue,\nalternative metrics have been studied in the literature, a large portion of which are special cases of\nf-divergence (e.g., Csisz\u00e1r & Shields, 2004):\nDf (p || q) = Ex\u223cq\n\n(1)\nwhere f : R+ \u2192 R is any convex function. The most notable class of f-divergence that has been\nexploited in VI is \u03b1-divergence, which takes f (t) = t\u03b1/(\u03b1(\u03b1 \u2212 1)) for \u03b1 \u2208 R \\ {0, 1}. By choosing\ndifferent \u03b1, we get a large number of well-known divergences as special cases, including the standard\n\n\u2212 f (1)\n\n(cid:18) p(x)\n\n(cid:19)\n\n(cid:21)\n\n,\n\n(cid:20)\n\nf\n\nq(x)\n\n\u2217Work done at UT Austin\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fKL divergence objective KL(q || p) (\u03b1 \u2192 0), the KL divergence with the reverse direction KL(p || q)\n(\u03b1 \u2192 1) and the \u03c72 divergence (\u03b1 = 2). In particular, the use of general \u03b1-divergence in VI has been\nwidely discussed (e.g., Minka et al., 2005; Hern\u00e1ndez-Lobato et al., 2016; Li & Turner, 2016); the\nreverse KL divergence is used in expectation propagation (Minka, 2001; Opper & Winther, 2005),\nimportance weighted auto-encoders (Burda et al., 2016), and the cross entropy method (De Boer et al.,\n2005); \u03c72-divergence is exploited for VI (e.g., Dieng et al., 2017), but is more extensively studied in\nthe context of adaptive importance sampling (IS) (e.g., Capp\u00e9 et al., 2008; Ryu & Boyd, 2014; Cotter\net al., 2015), since it coincides with the variance of the IS estimator with q as the proposal.\nA major motivation of using \u03b1-divergence contributes to its mass-covering property: when \u03b1 > 0,\nthe optimal approximation q tends to cover more modes of p, and hence better accounts for the\nuncertainty in p. Typically, larger values of \u03b1 enforce stronger mass-covering properties. In practice,\nhowever, \u03b1 divergence and its gradient need to be estimated empirically using samples from q. Using\nlarge \u03b1 values may cause high or in\ufb01nite variance in the estimation because it involves estimating the\n\u03b1-th power of the density ratio p(x)/q(x), which is likely distributed with a heavy or fat tail (e.g.,\nResnick, 2007). In fact, when q is very different from p, the expectation of ratio (p(x)/q(x))\u03b1 can be\nin\ufb01nite (that is, \u03b1-divergence does not exist). This makes it problematic to use large \u03b1 values, despite\nthe mass-covering property it promises. In addition, it is reasonable to expect that the optimal setting\nof \u03b1 should vary across training processes and learning tasks. Therefore, it is desirable to design\nan approach to choose \u03b1 adaptively and automatically as q changes during the training iterations,\naccording to the distribution of the ratio p(x)/q(x).\nBased on theoretical observations on f-divergence and fat-tailed distributions, we design a new\nclass of f-divergence which is tail-adaptive in that it uses different f functions according to the tail\ndistribution of the density ratio p(x)/q(x) to simultaneously obtain stable empirical estimation and\na strongest possible mass-covering property. This allows us to derive a new adaptive f-divergence-\nbased variational inference by combining it with stochastic optimization and reparameterization\ngradient estimates. Our main method (Algorithm 1) has a simple form, which replaces the f function\nin (1) with a rank-based function of the empirical density ratio w = p(x)/q(x) at each gradient\ndescent step of q, whose variation depends on the distribution of w and does not explode regardless\nthe tail of w.\nEmpirically, we show that our method can better recover multiple modes for variational inference. In\naddition, we apply our method to improve a recent soft actor-critic (SAC) algorithm (Haarnoja et al.,\n2018) in reinforcement learning (RL), showing that our method can be used to optimize multi-modal\nloss functions in RL more ef\ufb01ciently.\n\n2\n\nf-Divergence and Friends\n\nGiven a distribution p(x) of interest, we want to approximate it with a simpler distribution from a\nfamily {q\u03b8(x) : \u03b8 \u2208 \u0398}, where \u03b8 is the variational parameter that we want to optimize. We approach\nthis problem by minimizing the f-divergence between q\u03b8 and p:\n\n(cid:26)\n\nmin\n\u03b8\u2208\u0398\n\nDf (p || q\u03b8) = Ex\u223cq\u03b8\n\n(cid:20)\n\nf\n\n(cid:18) p(x)\n\n(cid:19)\n\nq\u03b8(x)\n\n(cid:21)\n\n(cid:27)\n\n\u2212 f (1)\n\n,\n\n(2)\n\nwhere f : R+ \u2192 R is any twice differentiable convex function. It can be shown by Jensen\u2019s inequality\nthat Df (p || q) \u2265 0 for any p and q. Further, if f (t) is strictly convex at t = 1, then Df (p || q) = 0\nimplies p = q. The optimization in (2) can be solved approximately using stochastic optimization in\npractice by approximating the expectation Ex\u223cq\u03b8 [\u00b7] using samples drawing from q\u03b8 at each iteration.\nThe f-divergence includes a large spectrum of important divergence measures. It includes KL\ndivergence in both directions,\n\nKL(q || p) = Ex\u223cq\n\nlog\n\nq(x)\np(x)\n\n,\n\nKL(p || q) = Ex\u223cq\n\nlog\n\np(x)\nq(x)\n\n,\n\n(3)\n\nwhich correspond to f (t) = \u2212 log t and f (t) = t log t, respectively. KL(q || p) is the typical\nobjective function used in variational inference; the reversed direction KL(p || q) is also used in\nvarious settings (e.g., Minka, 2001; Opper & Winther, 2005; De Boer et al., 2005; Burda et al., 2016).\n\n2\n\n(cid:20) p(x)\n\nq(x)\n\n(cid:21)\n\n(cid:20)\n\n(cid:21)\n\n\f(cid:20)(cid:18) p(x)\n\n(cid:19)\u03b1\n\n(cid:21)\n\nMore generally, f-divergence includes the class of \u03b1-divergence, which takes f\u03b1(t) = t\u03b1/(\u03b1(\u03b1\u2212 1)),\n\u03b1 \u2208 R \\ {0, 1} and hence\n\nDf\u03b1 (p || q) =\n\n(4)\nOne can show that KL(q || p) and KL(p || q) are the limits of Df\u03b1 (q || p) when \u03b1 \u2192 0 and \u03b1 \u2192 1,\nrespectively. Further, one obtain Helinger distance and \u03c72-divergence as \u03b1 = 1/2 and \u03b1 = 2,\nrespectively. In particular, \u03c72-divergence (\u03b1 = 2) plays an important role in adaptive importance\nsampling, because it equals the variance of the importance weight w = p(x)/q(x) and minimizing\n\u03c72-divergence corresponds to \ufb01nding an optimal importance sampling proposal.\n\n\u03b1(\u03b1 \u2212 1)\n\n\u2212 1\n\nq(x)\n\n.\n\n1\n\nEx\u223cq\n\n3 \u03b1-Divergence and Fat Tails\n\nA major motivation of using \u03b1 divergences as the objective function for approximate inference is their\nmass-covering property (also known as the zero-avoiding behavior). This is because \u03b1-divergence is\nproportional to the \u03b1-th moment of the density ratio p(x)/q(x). When \u03b1 is positive and large, large\nvalues of p(x)/q(x) are strongly penalized, preventing the case of q(x) (cid:28) p(x). In fact, whenever\nDf\u03b1 (p || q) < \u221e, we have p(x) > 0 imply q(x) > 0. This means that the probability mass and local\nmodes of p are taken into account in q properly.\nNote that the case when \u03b1 \u2264 0 exhibits the opposite property, that is, p(x) = 0 must imply q(x) = 0\nto make Df\u03b1 (q||p) \ufb01nite when \u03b1 \u2264 0; this includes the typical KL divergence KL(q || p) (\u03b1 = 0),\nwhich is often criticized for its tendency to under-estimate the uncertainty.\nTypically, using larger values of \u03b1 enforces stronger mass-covering properties. In practice, however,\nlarger values of \u03b1 also increase the variance of the empirical estimators, making it highly challenging\nto optimize. In fact, the expectation in (4) may not even exist when \u03b1 is too large. This is because the\ndensity ratio w := p(x)/q(x) often has a fat-tailed distribution.\nA non-negative random variable w is called fat-tailed2 (e.g., Resnick, 2007) if its tail probability\n\u00afFw(t) := Pr(w \u2265 t) is asymptotically equivalent to t\u2212\u03b1\u2217 as t \u2192 +\u221e for some \ufb01nite positive\nnumber \u03b1\u2217 (denoted by \u00afFw(t) \u223c t\u2212\u03b1\u2217), which means that\n\u2212\u03b1\u2217 L(t),\nwhere L is a slowly varying function that satis\ufb01es limt\u2192+\u221e L(ct)/L(t) = 1 for any c > 0. Here\n\u03b1\u2217 determines the fatness of the tail and is called the tail index of w. For a fat-tailed distribution\nwith index \u03b1\u2217, its \u03b1-th moment exists only if \u03b1 < \u03b1\u2217, that is, E[w\u03b1] < \u221e iff \u03b1 < \u03b1\u2217. It turns out\nthe density ratio w := p(x)/q(x), when x \u223c q, tends to have a fat-tailed distribution when q is more\npeaked than p. The example below illustrates this with simple Gaussian distributions.\nExample 3.1. Assume p(x) = N (x; 0, \u03c32\nthe density ratio. If \u03c3p > \u03c3q, then w has a fat-tailed distribution with index \u03b1\u2217 = \u03c32\nOn the other hand, if \u03c3p \u2264 \u03c3q, then w is bounded and not fat-tailed (effectively, \u03b1\u2217 = +\u221e).\nBy the de\ufb01nition above, if the importance weight w = p(x)/q(x) has a tail index \u03b1\u2217, the \u03b1-divergence\nDf\u03b1 (p || q) exists only if \u03b1 < \u03b1\u2217. Although it is desirable to use \u03b1-divergence with large values of \u03b1\nas VI objective function, it is important to keep \u03b1 smaller than \u03b1\u2217 to ensure that the objective and\ngradient are well de\ufb01ned. The problem, however, is that the tail index \u03b1\u2217 is unknown in practice, and\nmay change dramatically (e.g., even from \ufb01nite to in\ufb01nite) as q is updated during the optimization\nprocess. This makes it suboptimal to use a pre-\ufb01xed \u03b1 value. One potential way to address this\nproblem is to estimate the tail index \u03b1\u2217 empirically at each iteration using a tail index estimator (e.g.,\nHill et al., 1975; Vehtari et al., 2015). Unfortunately, tail index estimation is often challenging and\nrequires a large number of samples. The algorithm may become unstable if \u03b1\u2217 is over-estimated.\n\nq ). Let x \u223c q and w = p(x)/q(x)\n\np) and q(x) = N (x; 0, \u03c32\n\np \u2212 \u03c32\nq ).\n\n\u00afFw(t) = t\n\np/(\u03c32\n\n4 Hessian-based Representation of f-Divergence\n\nIn this work, we address the aforementioned problem by designing a generalization of f-divergence\nin which f adaptively changes with p and q, in a way that always guarantees the existence of the\n2Fat-tailed distributions is a sub-class of heavy-tailed distributions, which are distributions whose tail\n\nprobabilities decay slower than exponential functions, that is, limt\u2192+\u221e exp(\u03bbt) \u00afFw(t) = \u221e for all \u03bb > 0.\n\n3\n\n\fexpectation, while simultaneous achieving (theoretically) strong mass-covering equivalent to that of\nthe \u03b1-divergence with \u03b1 = \u03b1\u2217.\nOne challenge of designing such adaptive f is that the convex constraint over function f is dif\ufb01cult\nto express computationally. Our \ufb01rst key observation is that it is easier to specify a convex function\nf through its second order derivative f(cid:48)(cid:48), which can be any non-negative function. It turns out\nf-divergence, as well as its gradient, can be conveniently expressed using f(cid:48)(cid:48), without explicitly\nde\ufb01ning the original f.\nProposition 4.1. 1) Any twice differentiable convex function f : R+ \u222a {0} \u2192 R with \ufb01nite f (0) can\nbe decomposed into linear and nonlinear components as follows\n\nf (t) = (at + b) +\n\n(5)\nwhere h is a non-negative function, (t)+ = max(0, t), and a,b \u2208 R. In this case, h = f(cid:48)(cid:48)(t),\na = f(cid:48)(0) and b = f (0). Conversely, any non-negative function h and a, b \u2208 R speci\ufb01es a convex\nfunction.\n2) This allows us to derive an alternative representation of f-divergence:\n\n(t \u2212 \u00b5)+h(\u00b5)d\u00b5,\n\n0\n\n(cid:90) \u221e\n\n(cid:90) \u221e\n\n0\n\n(cid:34)(cid:18) p(x)\n\n(cid:35)\n\n(cid:19)\n\n(cid:48)(cid:48)\n\nf\n\n(\u00b5)Ex\u223cq\n\nq(x) \u2212 \u00b5\n\nd\u00b5 \u2212 c,\n\n+\n\n(6)\n\nDf (p || q) =\n\nwhere c :=(cid:82) 1\nProof. If f (t) = (at + b) +(cid:82) \u221e\n\n0 f(cid:48)(cid:48)(\u00b5)(1 \u2212 \u00b5)d\u00b5 = f (1) \u2212 f (0) \u2212 f(cid:48)(0) is a constant.\n0 (t \u2212 \u00b5)+h(\u00b5)d\u00b5, calculation shows\n(cid:48)\n(t) = a +\nf\n(t) = h(t).\n\n(cid:90) t\n\nh(\u00b5)d\u00b5, f\n\n(cid:48)(cid:48)\n\nTherefore, f is convex iff h is non-negative. See Appendix for the complete proof.\n\n0\n\nEq (6) suggests that all f-divergences are conical combinations of a set of special f-divergences of\nform Ex\u223cq[(p(x)/q(x)\u2212 \u00b5)+ \u2212 f (1)] with f (t) = (t\u2212 \u00b5)+. Also, every f-divergence is completely\nspeci\ufb01ed by the Hessian f(cid:48)(cid:48), meaning that adding f with any linear function at + b does not change\nDf (p || q). Such integral representation of f-divergence is not new; see e.g., Feldman & Osterreicher\n(1989); Osterreicher (2003); Liese & Vajda (2006); Reid & Williamson (2011); Sason (2018).\nFor the purpose of minimizing Df (p || q\u03b8) (\u03b8 \u2208 \u0398) in variational inference, we are more concerned\nwith calculating the gradient, rather than the f-divergence itself. It turns out the gradient of Df (p || q\u03b8)\nis also directly related to Hessian f(cid:48)(cid:48) in a simple way.\nProposition 4.2. 1) Assume log q\u03b8(x) is differentiable w.r.t. \u03b8, and f is a differentiable convex\nfunction. For f-divergence de\ufb01ned in (2), we have\n\u2207\u03b8Df (p || q\u03b8) = \u2212Ex\u223cq\u03b8\nwhere \u03c1f (t) = f(cid:48)(t)t \u2212 f (t) (equivalently, \u03c1(cid:48)\n2) Assume x \u223c q\u03b8 is generated by x = g\u03b8(\u03be) where \u03be \u223c q0 is a random seed and g\u03b8 is a function that\n(cid:20)\nis differentiable w.r.t. \u03b8. Assume f is twice differentiable and \u2207x log(p(x)/q\u03b8(x)) exists. We have\n(8)\n\nf (t) = f(cid:48)(cid:48)(t)t if f is twice differentiable).\n\n(cid:18) p(x)\n\n(cid:18) p(x)\n\n\u2207\u03b8 log q\u03b8(x)\n\n(cid:19)\n\n(cid:19)\n\nq\u03b8(x)\n\n(cid:20)\n\n(cid:21)\n\n(cid:21)\n\n(7)\n\n\u2207\u03b8Df (p || q\u03b8) = \u2212Ex=g\u03b8(\u03be),\u03be\u223cq0\n\n\u03b3f\n\n\u2207\u03b8g\u03b8(\u03be)\u2207x log(p(x)/q\u03b8(x))\n\n,\n\nq\u03b8(x)\n\n\u03c1f\n\n,\n\nwhere \u03b3f (t) = \u03c1(cid:48)\n\nf (t)t = f(cid:48)(cid:48)(t)t2.\n\nThe result above shows that the gradient of f-divergence depends on f through \u03c1f or \u03b3f . Taking\n\u03b1-divergence (\u03b1 /\u2208 {0, 1}) as example, we have\n\nf (t) = t\u03b1/(\u03b1(\u03b1 \u2212 1)),\n\n\u03c1f (t) = t\u03b1/\u03b1,\n\n\u03b3f (t) = t\u03b1,\n\n4\n\n\fall of which are proportional to the power function t\u03b1. For KL(q || p), we have f (t) = \u2212 log t,\nyielding \u03c1f (t) = log t \u2212 1 and \u03b3f (t) = 1; for KL(p || q), we have f (t) = t log t, yielding \u03c1f (t) = t\nand \u03b3f (t) = t.\nThe formulas in (7) and (8) are called the score-function gradient and reparameterization gra-\ndient (Kingma & Welling, 2014), respectively. Both equal the gradient in expectation, but are\ncomputationally different and yield empirical estimators with different variances. In particular, the\nscore-function gradient in (7) is \u201cgradient-free\u201d in that it does not require calculating the gradient of\nthe distribution p(x) of interest, while (8) is \u201cgradient-based\u201d in that it involves \u2207x log p(x). It has\nbeen shown that optimizing with reparameterization gradients tend to give better empirical results\nbecause it leverages the gradient information \u2207x log p(x), and yields a lower variance estimator for\nthe gradient (e.g., Kingma & Welling, 2014).\nOur key observation is that we can directly specify f through any increasing function \u03c1f , or non-\nnegative function \u03b3f in the gradient estimators, without explicitly de\ufb01ning f.\nProposition 4.3. Assume f : R+ \u2192 R is convex and twice differentiable, then\n1) \u03c1f in (7) is a monotonically increasing function on R+.\nincreasing function \u03c1, there exists a convex function f such that \u03c1f = \u03c1;\n2) \u03b3f in (8) is non-negative on R+, that is, \u03b3f (t) \u2265 0, \u2200t \u2208 R+. In addition, for any non-negative\nfunction \u03b3, there exists a convex function f such that \u03b3f = \u03b3;\n3) if \u03c1(cid:48)\nf (t) is strictly increasing at t = 1 (i.e., \u03c1(cid:48)\nf (1) > 0), or \u03b3f (t) is strictly positive at t = 1 (i.e.,\n\u03b3f (1) > 0), then Df (p || q) = 0 implies p = q.\nProof. Because f is convex (f(cid:48)(cid:48)(t) \u2265 0), we have \u03b3f (t) = f(cid:48)(cid:48)(t)t2 \u2265 0 and \u03c1(cid:48)\nf (t) = f(cid:48)(cid:48)(t)t \u2265 0 on\nt \u2208 R+, that is, \u03b3f is non-negative and \u03c1f is increasing on R+. If \u03c1t is strictly increasing (or \u03b3f is\nstrictly positive) at t = 1, we have f is strictly convex at t = 1, which guarantees Df (p || q) = 0\nimply p = q.\nFor non-negative function \u03b3(t) (or increasing function \u03c1(t)) on R+, any convex function f whose\nsecond-order derivative equals \u03b3(t)/t2 (or \u03c1(cid:48)\n\nIn addition, for any differentiable\n\nf (t)/t) satis\ufb01es \u03b3f = \u03b3 (resp. \u03c1f = \u03c1).\n\n5 Safe f-Divergence with Inverse Tail Probability\n\nThe results above show that it is suf\ufb01cient to \ufb01nd an increasing function \u03c1f , or a non-negative function\n\u03b3f to obtain adaptive f-divergence with computable gradients. In order to make the f-divergence\n\u201csafe\u201d, we need to \ufb01nd \u03c1f or \u03b3f that adaptively depends on p and q such that the expectation in (7)\nand (8) always exists. Because the magnitude of \u2207\u03b8 log q\u03b8(x), \u2207x log(p(x)/q\u03b8(x)) and \u2207\u03b8g\u03b8(\u03be) are\nrelatively small compared with the ratio p(x)/q(x), we can mainly consider designing function \u03c1 (or\n\u03b3) such that they yield \ufb01nite expectation Ex\u223cq[\u03c1(p(x)/q(x))] < \u221e; meanwhile, we should also keep\nthe function large, preferably with the same magnitude as t\u03b1\u2217, to provide a strong mode-covering\nproperty. As it turns out, the inverse of the tail probability naturally achieves all these goals.\nProposition 5.1. For any random variable w with tail distribution \u00afFw(t) := Pr(w \u2265 t) and tail\nindex \u03b1\u2217, we have\nAlso, we have \u00afFw(t)\u03b2 \u223c t\u2212\u03b2\u03b1\u2217, and \u00afFw(t)\u03b2 is always non-negative and monotonically increasing\nProof. Simply note that E[ \u00afFw(w)\u03b2] = (cid:82) \u00afFw(t)\u03b2d \u00afF\u03b2(t) = (cid:82) 1\nwhen \u03b2 < 0.\n0 t\u03b2dt, which is \ufb01nite only when\n\u03b2 > \u22121. The non-negativity and monotonicity of \u00afFw(t)\u03b2 are obvious. \u00afFw(t)\u03b2 \u223c t\u2212\u03b2\u03b1\u2217 directly\nfollows the de\ufb01nition of the tail index.\n\nE[ \u00afFw(w)\u03b2] < \u221e,\n\nfor any \u03b2 > \u22121.\n\nThis motivates us to use \u00afFw(t)\u03b2 to de\ufb01ne \u03c1f or \u03b3f , yielding two versions of \u201csafe\u201d tail-adaptive f\ndivergences. Note that here f is de\ufb01ned implicitly through \u03c1f or \u03b3f . Although it is possible to derive\nthe corresponding f and Df (p || q), there is no computational need to do so, since optimizing the\nobjective function only requires calculating the gradient, which is de\ufb01ned by \u03c1f or \u03b3f .\n\n5\n\n\fAlgorithm 1 Variational Inference with Tail-adaptive f-Divergence (with Reparameterization Gradi-\nent)\n\nGoal: Find the best approximation of p(x) from {q\u03b8 : \u03b8 \u2208 \u0398}. Assume x \u223c q\u03b8 is generated by\nx = g\u03b8(\u03be) where \u03be is a random sample from noise distribution q0.\nInitialize \u03b8, set an index \u03b2 (e.g., \u03b2 = \u22121).\nfor iteration do\nDraw {xi}n\ni=1 \u223c q\u03b8, generated by xi = g\u03b8(\u03bei).\nUpdate \u03b8 \u2190 \u03b8 + \u0001\u2206\u03b8, with \u0001 is step size, and\n\nLet wi = p(xi)/q\u03b8(xi), \u02c6\u00afFw(t) =(cid:80)n\n\nI(wj \u2265 t)/n, and set \u03b3i = \u02c6\u00afFw(wi)\u03b2.\nn(cid:88)\n\nn(cid:88)\n\nj=1\n\n\u2206\u03b8 =\n\n1\nz\u03b3\n\nend for\n\ni=1\n\n[\u03b3i\u2207\u03b8g\u03b8(\u03bei)\u2207x log(p(xi)/q\u03b8(xi))] , where z\u03b3 =\n\n\u03b3i.\n\ni=1\n\n(cid:80)n\n\nn\n\ni=1\n\nIn practice, the explicit form of \u00afFw(t)\u03b2 is unknown. We can approximate it based on empirical data\ndrawn from q. Let {xi} be drawn from q and wi = p(xi)/q(xi), then we can approximate the tail\nI(wi \u2265 t). Intuitively, this corresponds to assigning each data\nprobability with \u02c6\u00afFw(t) = 1\npoint a weight according to the rank of its density ratio in the population. Substituting the empirical\ntail probability into the reparametrization gradient formula in (8) and running a gradient descent with\nstochastic approximation yields our main algorithm shown in Algorithm 1. The version with the\nscore-function gradient is similar and is shown in Algorithm 2 in the Appendix. Both algorithms can\nbe viewed as minimizing the implicitly constructed adaptive f-divergences, but correspond to using\ndifferent f.\nCompared with typical VI with reparameterized gradients, our method assigns a weight \u03c1i =\n\u02c6\u00afFw(wi)\u03b2, which is proportional #w\u03b2\ni where #wi denotes the rank of data wi in the population {wi}.\nWhen taking \u22121 < \u03b2 < 0, this allows us to penalize places with high ratio p(x)/q(x), but avoid\nto be overly aggressive. In practice, we \ufb01nd that simply taking \u03b2 = \u22121 almost always yields the\nbest empirical performance (despite needing \u03b2 > \u22121 theoretically). By comparison, minimizing the\nclassical \u03b1-divergence would have a weight of w\u03b1\ni ; if \u03b1 is too large, the weight of a single data point\nbecomes dominant, making gradient estimate unstable.\n\n6 Experiments\n\nIn this section, we evaluate our adaptive f-divergence with different models. We use reparam-\neterization gradients as default since they have smaller variances (Kingma & Welling, 2014)\nand normally yield better performance than score function gradients. Our code is available at\nhttps://github.com/dilinwang820/adaptive-f-divergence.\n\n6.1 Gaussian Mixture Toy Example\n\n1\n\np(x) = (cid:80)k\nusing q(x) =(cid:80)20\ndist(p, q) := (cid:80)10\n\nWe \ufb01rst illustrate the approximation quality of our proposed adaptive f-divergence on Gaus-\nsian mixture models.\nIn this case, we set our target distribution to be a Gaussian mixture\nkN (x; \u03bdi, 1), for x \u2208 Rd, where the elements of each mean vector \u03bdi is drawn\nfrom uniform([\u2212s, s]). Here s can be viewed as controlling the Gaussianity of the target distribution:\np reduces to standard Gaussian distribution when s = 0 and is increasingly multi-modal when s\nincreases. We \ufb01x the number of components to be k = 10, and initialize the proposal distribution\n\ni=1\n\ni=1 wi = 1.\n\ni=1 wiN (x; \u00b5i, \u03c32\nWe evaluate the mode-seeking ability of how q covers the modes of p using a \u201cmode-shift distance\u201d\ni=1 minj ||\u03bdi \u2212 \u00b5j||2/10, which is the average distance of each mode in p to its\nnearest mode in distribution q. The model is optimized using Adagrad with a constant learning rate\n0.05. We use a minibatch of size 256 to approximate the gradient in each iteration. We train the\nmodel for 10, 000 iterations. To learn the component weights, we apply the Gumble-Softmax trick\n(Jang et al., 2017; Maddison et al., 2017) with a temperature of 0.1. Figure 1 shows the result when\nwe obtain random mixtures p using s = 5, when the dimension d of x equals 2 and 10, respectively.\n\ni ), where(cid:80)20\n\n6\n\n\f(a) Mode-shift distance\n\n(b) Mean\n\n(c) Variance\n\ne\nc\nn\na\nt\ns\ni\nd\n\n.\ng\nv\nA\n\nE\nS\nM\n0\n1\ng\no\nL\n\nE\nS\nM\n0\n1\ng\no\nL\n\nchoice of \u03b1/\u03b2\n\nchoice of \u03b1/\u03b2\n\nchoice of \u03b1/\u03b2\n\nFigure 1: (a) plots the mode-shift distance between p and q; (b-c) show the MSE of mean and variance between\nthe true posterior p and our approximation q, respectively. All results are averaged over 10 random trials.\n\n(a) Mode-shift distance\n\n(b) Mean\n\n(c) Variance\n\ne\nc\nn\na\nt\ns\ni\nd\n\n.\ng\nv\nA\n\nE\nS\nM\n0\n1\ng\no\nL\n\nE\nS\nM\n0\n1\ng\no\nL\n\nNon-Gaussianity s\n\nNon-Gaussianity s\n\nNon-Gaussianity s\n\nFigure 2: Results on randomly generated Gaussian mixture models. (a) plots the average mode-shift distance;\n(b-c) show the MSE of mean and variance. All results are averaged over 10 random trials.\n\nWe can see that when the dimension is low (= 2), all algorithms perform similarly well. However,\nas we increase the dimension to 10, our approach with tail-adaptive f-divergence achieves the best\nperformance.\nTo examine the performance of variational approximation more closely, we show in Figure 2 the\naverage mode-shift distance and the MSE of the estimated mean and variance as we gradually increase\nthe non-Gaussianality of p(x) by changing s. We \ufb01x the dimension to 10. We can see from Figure 2\nthat when p is close to Gaussian (small s), all algorithms perform well; when p is highly non-Gaussian\n(large s), we \ufb01nd that our approach with adaptive weights signi\ufb01cantly outperform other baselines.\n\n6.2 Bayesian Neural Network\n\nWe evaluate our approach on Bayesian neural network regression tasks. The datasets are collected\nfrom the UCI dataset repository3. Similarly to Li & Turner (2016), we use a single-layer neural\nnetwork with 50 hidden units and ReLU activation, except that we take 100 hidden units for the\nProtein and Year dataset which are relatively large. We use a fully factorized Gaussian approximation\nto the true posterior and Gaussian prior for neural network weights. All datasets are randomly\npartitioned into 90% for training and 10% for testing. We use Adam optimizer (Kingma & Ba, 2015)\nwith a constant learning rate of 0.001. The gradient is approximated by n = 100 draws of xi \u223c q\u03b8\nand a minibatch of size 32 from the training data points. All results are averaged over 20 random\npartitions, except for Protein and Year, on which 5 trials are repeated.\nWe summarize the average RMSE and test log-likelihood with standard error in Table 1. We compare\nour algorithm with \u03b1 = 0 (KL divergence) and \u03b1 = 0.5, which are reported as the best for this task\nin Li & Turner (2016). More comparisons with different choices of \u03b1 are provided in the appendix.\nWe can see from Table 1 that our approach takes advantage of an adaptive choice of f-divergence\nand achieves the best performance for both test RMSE and test log-likelihood in most of the cases.\n\n3https://archive.ics.uci.edu/ml/datasets.html\n\n7\n\n-2-10120.55 -2-1012-2-101-2-1012-2-1012Adaptive(dim=2)Adaptive(dim=10)Alpha(dim=2)Alpha(dim=10)0123450.53 5 012345-3-2-101012345-3-2-101Adaptive(beta=-1)Alpha(alpha=0)Alpha(alpha=0.5)Alpha(alpha=1.0)\fAverage Test RMSE\n\nAverage Test Log-likelihood\n\n\u03b1 = 0.0\n\n\u03b1 = 0.5\n\n\u03b2 = \u22121.0\n\n\u03b2 = \u22121.0\ndataset\n\u22122.476 \u00b1 0.177\n2.828 \u00b1 0.177\n2.828 \u00b1 0.177 2.956 \u00b1 0.171 2.990 \u00b1 0.173 \u22122.476 \u00b1 0.177\nBoston 2.828 \u00b1 0.177\n\u22122.476 \u00b1 0.177 \u22122.547 \u00b1 0.171 \u22122.506 \u00b1 0.173\n5.371 \u00b1 0.115\n\u22123.099 \u00b1 0.115\n\u22123.099 \u00b1 0.115 \u22123.149 \u00b1 0.124 \u22123.103 \u00b1 0.111\n5.371 \u00b1 0.115 5.592 \u00b1 0.124 5.381 \u00b1 0.111 \u22123.099 \u00b1 0.115\nConcrete 5.371 \u00b1 0.115\n1.377 \u00b1 0.034\n\u22121.758 \u00b1 0.034\nEnergy 1.377 \u00b1 0.034\n1.377 \u00b1 0.034 1.431 \u00b1 0.029 1.531 \u00b1 0.047 \u22121.758 \u00b1 0.034\n\u22121.758 \u00b1 0.034 \u22121.795 \u00b1 0.029 \u22121.854 \u00b1 0.047\n1.080 \u00b1 0.001\n0.083 \u00b1 0.001\n1.012 \u00b1 0.001 1.080 \u00b1 0.001\nKin8nm 0.085 \u00b1 0.001 0.088 \u00b1 0.001 0.083 \u00b1 0.001\n1.055 \u00b1 0.001\n1.080 \u00b1 0.001\n0.083 \u00b1 0.001\n0.001 \u00b1 0.000\n5.468 \u00b1 0.000\n0.001 \u00b1 0.000\n0.001 \u00b1 0.000\n0.001 \u00b1 0.000 0.001 \u00b1 0.000\n5.468 \u00b1 0.000\n5.468 \u00b1 0.000\n4.086 \u00b1 0.000\n0.001 \u00b1 0.000 0.004 \u00b1 0.000\n5.269 \u00b1 0.000\nNaval\n\u22122.835 \u00b1 0.032\n4.116 \u00b1 0.032\n4.116 \u00b1 0.032 4.161 \u00b1 0.034 4.154 \u00b1 0.042 \u22122.835 \u00b1 0.032\nCombined 4.116 \u00b1 0.032\n\u22122.835 \u00b1 0.032 \u22122.845 \u00b1 0.034 \u22122.843 \u00b1 0.042\n0.634 \u00b1 0.008\n0.634 \u00b1 0.007\n\u22120.959 \u00b1 0.007\n0.636 \u00b1 0.008 0.634 \u00b1 0.007\n0.634 \u00b1 0.008 \u22120.962 \u00b1 0.008 \u22120.959 \u00b1 0.007\n0.634 \u00b1 0.007 0.634 \u00b1 0.008\n\u22120.959 \u00b1 0.007 \u22120.971 \u00b1 0.008\nWine\n\u22121.711 \u00b1 0.059\n0.849 \u00b1 0.059\n\u22121.711 \u00b1 0.059 \u22121.751 \u00b1 0.056 \u22121.875 \u00b1 0.092\n0.849 \u00b1 0.059\n0.849 \u00b1 0.059 0.861 \u00b1 0.056 1.146 \u00b1 0.092 \u22121.711 \u00b1 0.059\nYacht\n4.487 \u00b1 0.019\n\u22122.921 \u00b1 0.019\nProtein 4.487 \u00b1 0.019\n4.487 \u00b1 0.019 4.565 \u00b1 0.026 4.564 \u00b1 0.040 \u22122.921 \u00b1 0.019\n\u22122.921 \u00b1 0.019 \u22122.938 \u00b1 0.026 \u22122.928 \u00b1 0.040\n8.831 \u00b1 0.037\n\u22123.518 \u00b1 0.042\n8.831 \u00b1 0.037\n8.831 \u00b1 0.037 8.859 \u00b1 0.036 8.985 \u00b1 0.042 \u22123.570 \u00b1 0.037 \u22123.600 \u00b1 0.036 \u22123.518 \u00b1 0.042\n\u22123.518 \u00b1 0.042\nYear\n\n\u03b1 = 0.0\n\n\u03b1 = 0.5\n\nTable 1: Average test RMSE and log-likelihood for Bayesian neural regression.\n\n6.3 Application in Reinforcement Learning\n\nWe now demonstrate an application of our method in reinforcement learning, applying it as an inner\nloop to improve a recent soft actor-critic(SAC) algorithm (Haarnoja et al., 2018). We start with a\nbrief introduction of the background of SAC and then test our method in MuJoCo 4 environments.\n\nReinforcement Learning Background Reinforcement learning considers the problem of \ufb01nding\noptimal policies for agents that interact with uncertain environments to maximize the long-term\ncumulative reward. This is formally framed as a Markov decision process in which agents iteratively\ntake actions a based on observable states s, and receive a reward signal r(s, a) immediately following\nthe action a performed at state s. The change of the states is governed by an unknown environmental\ndynamic de\ufb01ned by a transition probability T (s(cid:48)\n|s, a). The agent\u2019s action a is selected by a conditional\nIn policy gradient methods, we consider a set of\nprobability distribution \u03c0(a|s) called policy.\ncandidate policies \u03c0\u03b8(a|s) parameterized by \u03b8 and obtain the optimal policy by maximizing the\nexpected cumulative reward\n\nJ(\u03b8) = Es\u223cd\u03c0,a\u223c\u03c0(a|s) [r(s, a)] ,\n\nwhere d\u03c0(s) =(cid:80)\u221e\n\nt=1 \u03b3t\u22121Pr(st = s) is the unnormalized discounted state visitation distribution\n\nwith discount factor \u03b3 \u2208 (0, 1).\nSoft Actor-Critic\n(SAC) is an off-policy optimization algorithm derived based on maximizing the\nexpected reward with an entropy regularization. It iteratively updates a Q-function Q(a, s), which\npredicts that cumulative reward of taking action a under state s, as well as a policy \u03c0(a|s) which\nselects action a to maximize the expected value of Q(s, a). The update rule of Q(s, a) is based on a\nvariant of Q-learning that matches the Bellman equation, whose detail can be found in Haarnoja et al.\n(2018). At each iteration of SAC, the update of policy \u03c0 is achieved by minimizing KL divergence\n(9)\n\n\u03c0new = arg min\n\n(cid:18) 1\nwhere \u03c4 is a temperature parameter, and V (s) = \u03c4 log(cid:82)\n\npQ(a|s) = exp\n\n\u03c4\n\n\u03c0\n\n(cid:19)\nEs\u223cd [KL(\u03c0(\u00b7|s) || pQ(\u00b7|s))] ,\n\n(Q(a, s) \u2212 V (s))\n\n,\n\n(10)\n\na exp(Q(a, s)/\u03c4 ), serving as a normalization\nconstant here, is a soft-version of value function and is also iteratively updated in SAC. Here, d(s) is\na visitation distribution on states s, which is taken to be the empirical distribution of the states in the\ncurrent replay buffer in SAC. We can see that (9) can be viewed as a variational inference problem on\nconditional distribution pQ(a|s), with the typical KL objective function (\u03b1 = 0).\nSAC With Tail-adaptive f-Divergence To apply f-divergence, we \ufb01rst rewrite (9) to transform the\nconditional distributions to joint distributions. We de\ufb01ne joint distribution pQ(a, s) = exp((Q(a, s)\u2212\nV (s))/\u03c4 )d(s) and q\u03c0(a, s) = \u03c0(a|s)d(s), then we can show that Es\u223cd[KL(\u03c0(\u00b7|s) || pQ(\u00b7|s))] =\nKL(q\u03c0 || pQ). This motivates us to extend the objective function in (9) to more general f-divergences,\n\n(cid:20)\n\n(cid:18) exp((Q(a, s) \u2212 V (s))/\u03c4 )\n\n(cid:19)\n\nf\n\n(cid:21)\n\nDf (pQ || q\u03c0) = Es\u223cdEa|s\u223c\u03c0\n\n\u03c0(a|s)\n\n\u2212 f (1)\n\n.\n\n4http://www.mujoco.org/\n\n8\n\n\fAnt\n\nHalfCheetah\n\nHumanoid(rllab)\n\nWalker\n\nHopper\n\nSwimmer(rllab)\n\nd\nr\na\nw\ne\nR\ne\ng\na\nr\ne\nv\nA\n\nd\nr\na\nw\ne\nR\ne\ng\na\nr\ne\nv\nA\n\nFigure 3: Soft Actor Critic (SAC) with policy updated by Algorithm 1 with \u03b2 = \u22121, or \u03b1-divergence VI with\ndifferent \u03b1 (\u03b1 = 0 corresponds to the original SAC). The reparameterization gradient estimator is used in all the\ncases. In the legend, \u201c\u03b1 = max\u201d denotes setting \u03b1 = +\u221e in \u03b1-divergence.\n\nBy using our tail-adaptive f-divergence, we can readily apply our Algorithm 1 (or Algorithm 2 in\nthe Appendix) to update \u03c0 in SAC, allowing us to obtain \u03c0 that counts the multi-modality of Q(a, s)\nmore ef\ufb01ciently. Note that the standard \u03b1-divergence with a \ufb01xed \u03b1 also yields a new variant of SAC\nthat is not yet studied in the literature.\n\nEmpirical Results We follow the experimental setup of Haarnoja et al. (2018). The policy \u03c0, the\nvalue function V (s), and the Q-function Q(s, a) are neural networks with two fully-connected layers\nof 128 hidden units each. We use Adam (Kingma & Ba, 2015) with a constant learning rate of 0.0003\nfor optimization. The size of the replay buffer for HalfCheetah is 107, and we \ufb01x the size to 106 on\nother environments in a way similar to Haarnoja et al. (2018).\nWe compare with the original SAC (\u03b1 = 0), and also other \u03b1-divergences, such as \u03b1 = 0.5 and\n\u03b1 = \u221e (the \u03b1 = max suggested in Li & Turner (2016)). Figure 3 summarizes the total average\nreward of evaluation rollouts during training on various MuJoCo environments. For non-negative \u03b1\nsettings, methods with larger \u03b1 give higher average reward than the original KL-based SAC in most\nof the cases. Overall, our adaptive f-divergence substantially outperforms all other \u03b1-divergences on\nall of the benchmark tasks in terms of the \ufb01nal performance, and learns faster than all the baselines in\nmost environments. We \ufb01nd that our improvement is especially signi\ufb01cant on high dimensional and\ncomplex environments like Ant and Humanoid.\n\n7 Conclusion\n\nIn this paper, we present a new class of tail-adaptive f-divergence and exploit its application in\nvariational inference and reinforcement learning. Compared to classic \u03b1-divergence, our approach\nguarantees \ufb01nite moments of the density ratio and provides more stable importance weights and\ngradient estimates. Empirical results on Bayesian neural networks and reinforcement learning indicate\nthat our approach outperforms standard \u03b1-divergence, especially for high dimensional multi-modal\ndistribution.\n\nAcknowledgement\n\nThis work is supported in part by NSF CRII 1830161. We would like to acknowledge Google Cloud\nfor their support.\n\n9\n\n0M2M4M6M8M10M\u2212500050010001500200025003000350040000M2M4M6M8M10M020004000600080001000012000140000M2M4M6M8M10M025050075010000M1M2M3M4M5M05001000150020002500300035004000450050000.0M0.5M1.0M1.5M2.0M0500100015002000250030003500400045000.0M0.1M0.2M0.3M0.4M0.5M0255075100125150175200\u03b1=0.0\u03b1=0.5\u03b1=max\u03b2=-1.0\fReferences\n\nBlei, David M, Kucukelbir, Alp, and McAuliffe, Jon D. Variational inference: A review for\n\nstatisticians. Journal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\nBurda, Yuri, Grosse, Roger, and Salakhutdinov, Ruslan. Importance weighted autoencoders. Interna-\n\ntional Conference on Learning Representations (ICLR), 2016.\n\nCapp\u00e9, Olivier, Douc, Randal, Guillin, Arnaud, Marin, Jean-Michel, and Robert, Christian P. Adaptive\nimportance sampling in general mixture classes. Statistics and Computing, 18(4):447\u2013459, 2008.\nChristopher, M Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York,\n\n2016.\n\nCotter, Colin, Cotter, Simon, and Russell, Paul. Parallel adaptive importance sampling. arXiv preprint\n\narXiv:1508.01132, 2015.\n\nCsisz\u00e1r, I. and Shields, P.C. Information theory and statistics: A tutorial. Foundations and Trends in\n\nCommunications and Information Theory, 1(4):417\u2013528, 2004.\n\nDe Boer, Pieter-Tjerk, Kroese, Dirk P, Mannor, Shie, and Rubinstein, Reuven Y. A tutorial on the\n\ncross-entropy method. Annals of operations research, 134(1):19\u201367, 2005.\n\nDieng, Adji Bousso, Tran, Dustin, Ranganath, Rajesh, Paisley, John, and Blei, David. Variational\ninference via \u03c7 upper bound minimization. In Advances in Neural Information Processing Systems\n(NIPS), pp. 2732\u20132741, 2017.\nFeldman, Dorian and Osterreicher, Ferdinand. A note on f-divergences. Studia Sci.\\Math.\\Hungar.,\nGal, Yarin and Ghahramani, Zoubin. Dropout as a Bayesian approximation: Representing model\nuncertainty in deep learning. In international conference on machine learning (ICML), pp. 1050\u2013\n1059, 2016.\n\n24:191\u2013200, 1989.\n\nHaarnoja, Tuomas, Zhou, Aurick, Abbeel, Pieter, and Levine, Sergey. Soft actor-critic: Off-policy\nmaximum entropy deep reinforcement learning with a stochastic actor. International Conference\non Machine Learning (ICML), 2018.\n\nHern\u00e1ndez-Lobato, Jos\u00e9 Miguel, Li, Yingzhen, Rowland, Mark, Hern\u00e1ndez-Lobato, Daniel, Bui,\nThang, and Turner, Richard Eric. Black-box \u03b1-divergence minimization. International Conference\non Machine Learning (ICML), 2016.\n\nHill, Bruce M et al. A simple general approach to inference about the tail of a distribution. The\n\nannals of statistics, 3(5):1163\u20131174, 1975.\n\nHoffman, Matthew D, Blei, David M, Wang, Chong, and Paisley, John. Stochastic variational\n\ninference. The Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\nJang, Eric, Gu, Shixiang, and Poole, Ben. Categorical reparameterization with Gumbel-softmax.\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\nJordan, Michael I, Ghahramani, Zoubin, Jaakkola, Tommi S, and Saul, Lawrence K. An introduction\n\nto variational methods for graphical models. Machine learning, 37(2):183\u2013233, 1999.\n\nKingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. International\n\nConference on Learning Representations (ICLR), 2015.\n\nKingma, Diederik P and Welling, Max. Auto-encoding variational Bayes. International Conference\n\non Learning Representations (ICLR), 2014.\n\nLevine, Sergey. Reinforcement learning and control as probabilistic inference: Tutorial and review.\n\narXiv preprint arXiv:1805.00909, 2018.\n\nLi, Yingzhen and Turner, Richard E. R\u00e9nyi divergence variational inference. In Advances in Neural\n\nInformation Processing Systems (NIPS), pp. 1073\u20131081, 2016.\n\nLiese, Friedrich and Vajda, Igor. On divergences and informations in statistics and information theory.\n\nIEEE Transactions on Information Theory, 52(10):4394\u20134412, 2006.\n\nMaddison, Chris J, Mnih, Andriy, and Teh, Yee Whye. The concrete distribution: A continuous\nrelaxation of discrete random variables. International Conference on Learning Representations\n(ICLR), 2017.\n\nMinka, Thomas P. Expectation propagation for approximate Bayesian inference. In Proceedings of\nthe Seventeenth conference on Uncertainty in arti\ufb01cial intelligence (UAI), pp. 362\u2013369. Morgan\n\n10\n\n\fKaufmann Publishers Inc., 2001.\n\nMinka, Tom et al. Divergence measures and message passing. Technical report, Microsoft Research,\n\n2005.\n\nOpper, Manfred and Winther, Ole. Expectation consistent approximate inference. Journal of Machine\n\nLearning Research, 6(Dec):2177\u20132204, 2005.\n\nOsterreicher, Ferdinand. f-divergences\u2014representation theorem and metrizability. Inst. Math., Univ.\n\nSalzburg, Salzburg, Austria, 2003.\n\nRanganath, Rajesh, Gerrish, Sean, and Blei, David. Black box variational inference. In Arti\ufb01cial\n\nIntelligence and Statistics, pp. 814\u2013822, 2014.\n\nReid, Mark D and Williamson, Robert C. Information, divergence and risk for binary experiments.\n\nJournal of Machine Learning Research, 12(Mar):731\u2013817, 2011.\n\nResnick, Sidney I. Heavy-tail phenomena: probabilistic and statistical modeling. Springer Science\n\n& Business Media, 2007.\n\nRyu, Ernest K and Boyd, Stephen P. Adaptive importance sampling via stochastic convex program-\n\nming. arXiv preprint arXiv:1412.4845, 2014.\n\nSason, Igal. On f-divergences: Integral representations, local behavior, and inequalities. Entropy, 20\n\n(5):383, 2018.\n\nVehtari, Aki, Gelman, Andrew, and Gabry, Jonah. Pareto smoothed importance sampling. arXiv\n\npreprint arXiv:1507.02646, 2015.\n\nWainwright, Martin J, Jordan, Michael I, et al. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends R(cid:13) in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n11\n\n\f", "award": [], "sourceid": 2774, "authors": [{"given_name": "Dilin", "family_name": "Wang", "institution": "UT Austin"}, {"given_name": "Hao", "family_name": "Liu", "institution": "UT Austin"}, {"given_name": "Qiang", "family_name": "Liu", "institution": "UT Austin"}]}