{"title": "Asymptotic optimality of adaptive importance sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 3134, "page_last": 3144, "abstract": "\\textit{Adaptive importance sampling} (AIS) uses past samples to update the \\textit{sampling policy} $q_t$ at each stage $t$. Each stage $t$ is formed with two steps : (i) to explore the space with $n_t$ points according to $q_t$ and (ii) to exploit the current amount of information to update the sampling policy. The very fundamental question raised in this paper concerns the behavior of empirical sums based on AIS. Without making any assumption on the \\textit{allocation policy} $n_t$, the theory developed involves no restriction on the split of computational resources between the explore (i) and the exploit (ii) step. It is shown that AIS is asymptotically optimal : the asymptotic behavior of AIS is the same as some ``oracle'' strategy that knows the targeted sampling policy from the beginning. From a practical perspective, weighted AIS is introduced, a new method that allows to forget poor samples from early stages.", "full_text": "Asymptotic optimality of adaptive importance\n\nsampling\n\nBernard Delyon\n\nIRMAR\n\nUniversity of Rennes 1\n\nbernard.delyon@univ-rennes1.fr\n\nFran\u00e7ois Portier\nT\u00e9l\u00e9com ParisTech\n\nUniversity of Paris-Saclay\n\nfrancois.portier@gmail.com\n\nAbstract\n\nAdaptive importance sampling (AIS) uses past samples to update the sampling\npolicy qt. Each stage t is formed with two steps : (i) to explore the space with nt\npoints according to qt and (ii) to exploit the current amount of information to update\nthe sampling policy. The very fundamental question raised in this paper concerns\nthe behavior of empirical sums based on AIS. Without making any assumption\non the allocation policy nt, the theory developed involves no restriction on the\nsplit of computational resources between the explore (i) and the exploit (ii) step. It\nis shown that AIS is asymptotically optimal : the asymptotic behavior of AIS is\nthe same as some \u201coracle\u201d strategy that knows the targeted sampling policy from\nthe beginning. From a practical perspective, weighted AIS is introduced, a new\nmethod that allows to forget poor samples from early stages.\n\n1\n\nIntroduction\n\nThe adaptive choice of a sampling policy lies at the heart of many \ufb01elds of Machine Learning where\nformer Monte Carlo experiments guide the forthcoming ones. This includes for instance reinforcment\nlearning [19, 27, 30] where the optimal policy maximizes the reward; inference in Bayesian [6] or\ngraphical models [21]; optimization based on stochastic gradient descent [34] or without using the\ngradient [18]; rejection sampling [12]. Adaptive importance sampling (AIS) [25, 2], which extends\nthe basic Monte Carlo integration approach, offers a natural probabilistic framework to describe the\nevolution of sampling policies. The present paper establishes, under fairly reasonable conditions, that\nAIS is asymptotically optimal, i.e., learning the sampling policy has no cost asymptotically.\n\nSuppose we are interested in computing some integral value(cid:82) \u03d5, where \u03d5 : Rd \u2192 R is called the\nintegrand. The importance sampling estimate of(cid:82) \u03d5 based on the sampling policy q, is given by\n\nn(cid:88)\n\ni=1\n\nn\u22121\n\n\u03d5(xi)\nq(xi)\n\n,\n\n(1)\n\nEstimators usually employed are\n\ni.i.d.\u223c q. The previous estimate is unbiased. It is well known, e.g., [16, 13], that\nwhere (x1, . . . xn)\nthe optimal sampling policy, regarding the variance, is when q is proportional to |\u03d5|. A slightly\ndifferent context where importance sampling still applies is Bayesian estimation. Here the targeted\n\nquantity is(cid:82) \u03d5\u03c0 and we only have access to an unnormalized version \u03c0u of the density \u03c0 = \u03c0u/(cid:82) \u03c0u.\nIn this case, the optimal sampling policy q is proportional to |\u03d5 \u2212(cid:82) \u03d5\u03c0|\u03c0 (see [9] or section B.3 in\n\n(cid:44) n(cid:88)\n\n\u03c0u(xi)\nq(xi)\n\n.\n\n\u03d5(xi)\u03c0u(xi)\n\nn(cid:88)\n\ni=1\n\nq(xi)\n\n(2)\n\ni=1\n\nthe supplementary material).\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fBecause appropriate policies naturally depend on \u03d5 or \u03c0, we generally cannot simulate from them.\nThey are then approximated adaptively, by densities from which we can simulate, using the infor-\nmation gathered from the past stages. This is the very spirit of AIS. At each stage t, the value It,\nstanding for the current estimate, is updated using i.i.d. new samples xt,1, . . . xt,nt from qt, where qt\nis a probability density function that might depend on the past stages 1, . . . t \u2212 1. The distribution qt,\ncalled the sampling policy, targets some optimal, at least suitable, sampling policy. The sequence\n(nt) \u2282 N\u2217, called the allocation policy, contains the number of particles generated at each stage.\nThe following algorithm describes the AIS schemes for the classical integration problem. For the\nBayesian problem, it suf\ufb01ces to change the estimate according to (2). This is a generic representation\nof AIS as no explicit update rule is speci\ufb01ed (this will be discussed just below).\n\nAlgorithm 1 (AIS).\nInputs: The number of stages T \u2208 N\u2217, the allocation policy (nt)t=1,...T \u2282 N\u2217, the sampler\nupdate procedure, the initial density q0.\n\nSet S0 = 0, N0 = 0. For t in 1, . . . T :\n\n(i) (Explore) Generate (xt,1, . . . xt,nt) from qt\u22121\n(ii) (Exploit)\n\n(a) Update the estimate:\n\nSt = St\u22121 +\n\nnt(cid:88)\n\n(b) Update the sampler qt\n\ni=1\nNt = Nt\u22121 + nt\nIt = N\u22121\n\nt St\n\n\u03d5(xt,i)\nqt\u22121(xt,i)\n\nPioneer works on adaptive schemes include [20] where, within a two-stages procedure, the sampling\npolicy is chosen out of a parametric family; this is further formalized in [14]; [25] introduces the\nidea of a multi-stages approach where all the previous stages are used to update the sampling policy\n(see also [29] regarding the choice of the loss function); [26] investigates the use of control variates\ncoupled with importance sampling; the population Monte Carlo approach [3, 2] offers a general\nframework for AIS and has been further studied using parametric mixtures [8, 9]; see also [5, 32] for a\nvariant called multiple adaptive importance sampling; see [11] for a recent review. In [33, 23], using\nkernel smoothing, nonparametric importance sampling is introduced. The approach of choosing\nqt out of a parametric family should also be contrasted with the non parametric approach based\non particles often refereed to as sequential Monte Carlo [6, 4, 10] whose context is different as\ntraditionally the targeted distribution changes with t. The distribution qt\u22121 is then a weighted sum of\n\ni wt\u22121,i\u03b4xt\u22121,i, and updating qt follows from adjustment of the weights.\n\nDirac masses(cid:80)\n\nThe theoretical properties of adaptive schemes are dif\ufb01cult to derive due to the recycling of the past\nsamples at each stage and hence to the lack of independence between samples. Among the update\nbased on a parametric family, the convergence properties of the Kullback-Leibler divergence between\nthe estimated and the targeted distribution are studied in [8]. Properties related to the asymptotic\nvariance are given in [9]. Among nonparametric update, [33] establishes fast convergence rates in a\ntwo-stages strategy where the number of samples used in each stage goes to in\ufb01nity. For sequential\nMonte Carlo, limit theorems are given for instance in [6, 4, 10]. All these results are obtained when\nT is \ufb01xed and nT \u2192 \u221e and therefore misses the true nature of the adaptive schemes for which the\nasymptotic should be made with respect to T .\nRecently, a more realistic asymptotic regime was considered in [22] in which the allocation policy\n(nt) is a \ufb01xed growing sequence of integers. The authors establish the consistency of the estimate\nwhen the update is conducted with respect to a parametric family but depends only on the last stage.\nThey focus on multiple adaptive importance sampling [5, 32] which is different than AIS (see Remark\n2 below for more details).\nIn this paper, folllowing the same spirit as [8, 9, 2], we study parametric AIS as presented in the AIS\nalgorithm when the policy is chosen out of a parametric family of probability density functions. Our\nanalysis focuses on the following 3 key points which are new to the best of our knowledge.\n\n2\n\n\f\u2022 A central limit theorem is established for the AIS estimate It.\n\nIt involves high-level\nconditions on the sampling policy estimate qt (which will be easily satis\ufb01ed for parametric\nupdates). Based on the martingale property associated to some sequences of interest, the\nasymptotic is not with T \ufb01xed and nT \u2192 \u221e, but with the number of samples n1 + \u00b7\u00b7\u00b7 +\nnT \u2192 \u221e. In particular, the allocation policy (nt) is not required to grow to in\ufb01nity. This is\npresented in section 2.\n\u2022 The high-level conditions are veri\ufb01ed in the case of parametric sampling policies with\nupdates taking place in a general framework inspired by the paradigm of empirical risk\nminimization (several concrete examples are provided). This establishes the asymptotic\noptimality of AIS in the sense that the rate and the asymptotic variance coincide with some\n\u201coracle\u201d procedure where the targeted policy is known from the beginning. The details are\ngiven in section 3.\n\u2022 A new method, called weighted AIS (wAIS) is designed in section 4 to eventually forget\nbad samples drawn during the early stages of AIS. Our numerical experiments shows that\n(i) wAIS accelerates signi\ufb01cantly the convergence of AIS and (ii) small allocation policies\n(nt) (implying more frequent updates) give better results than large (nt) (at equal number\nof requests to \u03d5). This last point supports empirically the theoretical framework adopted in\nthe paper.\n\nAll the proofs are given in the supplementary material.\n\n2 Central limit theorems for AIS\n\nconsider the multivariate case where \u03d5 = (\u03d51, . . . \u03d5p) : Rd \u2192 Rp. In the whole paper,(cid:82) \u03d5 is with\n\nThe aim of the section is to provide conditions on the sampling policy (qt) under which a central\nlimit theorem holds for AIS and normalized AIS.\nFor the sake of generality and because it will be useful in the treatment of normalized estimators, we\nrespect to the Lebesgue measure, (cid:107) \u00b7 (cid:107) is the Euclidean norm, Ip is the identity matrix of size (p, p).\nTo study the AIS algorithm, it is appropriate to work at the sample time scale as described below\nrather than at the sampling policy scale as described in the introduction. The sample xt,i (resp. the\npolicy qt) of the previous section (t is the block index and i the sample index within the block) is now\nsimply denoted xj (resp. qj), where j = n1 + . . . nt + i is the sample index in the whole sequence\n1, . . . n, with n = NT . The following algorithm is the same as Algorithm 1 (no explicit update rule is\nprovided) but is expressed at the sample scale.\nAlgorithm 2 (AIS at sample scale).\nInputs: The number of stages T \u2208 N\u2217, the allocation policy (nt)t=1,...T \u2282 N\u2217, the sampler\nupdate procedure, the initial density q0.\n\nSet S0 = 0. For j in 1, . . . n :\n\n(i) (Explore) Generate xj from qj\u22121\n(ii) (Exploit)\n\n(a) Update the estimate:\n\n(b) Update the sampler qj whenever j \u2208 {Nt =(cid:80)t\n\nSj = Sj\u22121 +\nIj = j\u22121Sj\n\ns=1 ns : t (cid:62) 1}\n\n\u03d5(xj)\n\nqj\u22121(xj)\n\n2.1 The martingale property\n\nDe\ufb01ne \u2206j as the j-th centered contribution to the sum Sj: \u2206j = \u03d5(xj)/qj\u22121(xj) \u2212(cid:82) \u03d5. De\ufb01ne, for\n\nall n (cid:62) 1,\n\nMn =\n\n\u2206j.\n\nn(cid:88)\n\nj=1\n\n3\n\n\fj=1\n\nj | Fj\u22121\n\nE(cid:2)\u2206j\u2206T\n\nThe \ufb01ltration we consider is given by Fn = \u03c3(x1, . . . xn). The quadratic variation of M is given by\n\n(cid:3). Set\n(cid:90) (cid:0)\u03d5(x) \u2212 q(x)(cid:82) \u03d5(cid:1)(cid:0)\u03d5(x) \u2212 q(x)(cid:82) \u03d5(cid:1)T\n\n(cid:104)M(cid:105)n =(cid:80)n\nsequence (Mn, Fn) is a martingale. In particular, In is an unbiased estimate of(cid:82) \u03d5. In addition,\nthe quadratic variation of M satis\ufb01es (cid:104)M(cid:105)n =(cid:80)n\n\n(3)\nLemma 1. Assume that for all 1 (cid:54) j (cid:54) n, the support of qj contains the support of \u03d5, then the\n\nV (q, \u03d5) =\n\nj=1 V (qj\u22121, \u03d5).\n\nq(x)\n\ndx.\n\n2.2 A central limit theorem for AIS\n\nThe following theorem describes the asymptotic behavior of AIS. The conditions will be veri\ufb01ed for\nparametric updates in section 3 (see Theorem 3) in which case the asymptotic variance V\u2217 will be\nexplicitly given.\nTheorem 1 (central limit theorem for AIS). Assume that the sequence qn satis\ufb01es\n\nV (qn, \u03d5) \u2192 V\u2217,\nfor some V\u2217 (cid:62) 0 and that there exists \u03b7 > 0 such that\n\na.s.\n\n(4)\n\n(5)\n\nThen we have\n\nsup\nj\u2208N\n\n\u221a\n\nn\n\n(cid:90) (cid:107)\u03d5(cid:107)2+\u03b7\n(cid:90)\n(cid:16)\n\nq1+\u03b7\nj\n\nIn \u2212\n\n\u03d5\n\n< \u221e,\n\na.s.\n\n(cid:17) d\u2192 N (0, V\u2217).\n\n\u221a\n\nn(In\u2212(cid:82) \u03d5) = op(1),\n\nRemark 1 (zero-variance estimate). Suppose that p = 1 (recalling that \u03d5 : Rd \u2192 Rp). Theorem 1\nincludes the degenerate case V\u2217 = 0. This happens when the integrand has constant sign and the\n\nsampling policy is well chosen, i.e. qn \u2192 |\u03d5|/(cid:82) |\u03d5|. In this case, we have that\n\n\u221a\nmeaning that the standard Monte Carlo convergence rate (1/\nn) has been improved. This is inline\nwith the results presented in [33] where fast rates of convergence (compared to standard Monte Carlo)\nare obtained under restrictive conditions on the allocation policy (nt). Note that other techniques\nsuch as control variates, kernel smoothing or Gaussian quadrature can achieve fast convergence\nrates [24, 28, 7, 1].\nRemark 2 (adaptive multiple importance sampling). Another way to compute the importance weights,\ncalled multiple adaptive importance sampling, has been introduced in [32] and has been successfully\ni=1 qi\u22121/j, xj\nstill being drawn under qj\u22121. The intuition is that this averaging will reduce the effect of exceptional\npoints xj for which |\u03d5(xj)| (cid:29) qj\u22121(xj) (but |\u03d5(xj)| (cid:54)(cid:29) \u00afqj\u22121(xj)). Our approach is not able to\nstudy this variant, simply because the martingale property described previously is not anymore\nsatis\ufb01ed.\n\nused in [26, 5]. This consists in replacing qj\u22121 in the computation of Sj by \u00afqj\u22121 =(cid:80)j\n\n2.3 Normalized AIS\n\nThe normalization technique described in (2) is designed to compute(cid:82) \u03d5\u03c0, where \u03c0 is a density. It\n\nis useful in the Bayesian context where \u03c0 is only known up to a constant. As this technique seems\nto provide substantial improvements compared to unnormalized estimates (i.e., (1) with \u03d5 replaced\nby \u03d5\u03c0), we recommend to use it even when the normalized constant of \u03c0 is known. Normalized\nestimators are given by\n\nI (norm)\nn\n\n=\n\nIn(\u03d5\u03c0)\nIn(\u03c0)\n\n,\n\nwith In(\u03c8) = n\u22121\n\n\u03c8(xj)/qj\u22121(xj).\n\nfunction a (cid:55)\u2192(cid:80)n\n\nj=1(\u03c0(xj)/qj\u22121(xj))(\u03d5(xj) \u2212 a)2. In contrast with In, I (norm)\n\nInterestingly, normalized estimators are weighted least-squares estimates as they minimize the\nhas the following\nshift-invariance property : whenever \u03d5 is shifted by \u00b5, I (norm)\n+ \u00b5. Because\nIn(\u03d5\u03c0) and In(\u03c0) are of the same kind as In de\ufb01ned in the second AIS algorithm, a straightforward\napplication of Theorem 1 (with (\u03d5T \u03c0, \u03c0)T in place of \u03d5).\n\nsimply becomes I (norm)\n\nn\n\nn\n\nn\n\nn(cid:88)\n\nj=1\n\n4\n\n\fCorollary 1 (central limit theorem for normalized AIS). Suppose that (4) and (5) hold with\n(\u03d5T \u03c0, \u03c0)T (in place of \u03d5). Then we have\n\n(cid:16)\n\n\u221a\n\nn\n\n(cid:90)\n\n(cid:17) d\u2192 N (0, U V\u2217U T ),\n\nI (norm)\nn\n\n\u2212\n\n\u03d5\u03c0\n\nwith U = (Ip,\u2212(cid:82) \u03d5\u03c0).\n\n3 Parametric sampling policy\n\nFrom this point forward, the sampling policies qt, t = 1, . . . T (we are back again to the sampling\npolicy scale as in Algorithm 1), are chosen out of a parametric family of probability density functions\n{q\u03b8 : \u03b8 \u2208 \u0398}. All our examples \ufb01t the general framework of empirical risk minimization over the\nparameter space \u0398 \u2282 Rq, where \u03b8t is given by\n\n(6)\n\n\u03b8t \u2208 argmin\u03b8\u2208\u0398 Rt(\u03b8),\n\nt(cid:88)\n\nns(cid:88)\n\nRt(\u03b8) =\n\nm\u03b8(xs,i)\nqs\u22121(xs,i)\n\n,\n\nsection for examples). Note that Rt/Nt is an unbiased estimate of the risk r(\u03b8) =(cid:82) m\u03b8.\n\nwhere qs is a shortcut for q\u03b8s, m\u03b8 : Rd \u2192 R might be understood as a loss function (see the next\n\ns=1\n\ni=1\n\n3.1 Examples of sampling policy\n\nWe start by introducing a particular case, which is one of the simplest way to implement AIS. Then\nwe will provide more general approaches. In what follows, the targeted policy, denoted by f, is\nchosen by the user and represents the distribution from which we wish to sample. It often re\ufb02ects\nsome prior knowledge on the problem of interest. If \u03d5 : Rd \u2192 Rp, with p = 1, then (as discussed in\n\nthe introduction) f \u221d |\u03d5| is optimal for (1) and f \u221d |\u03d5 \u2212(cid:82) \u03d5\u03c0|\u03c0 is optimal for (2). In the Bayesian\ncontext where many integrals(cid:82) (\u03d51, . . . \u03d5p)d\u03c0 need to be computed, a usual choice is f = \u03c0. All the\n\nfollowing methods only require calls to an unnormalized version of f.\n\nt(cid:88)\nns(cid:88)\n(cid:18) \u03bd \u2212 2\n\ns=1\n\ni=1\n\n\u03bd\n\n\u00b5t =\n\n\u03a3t =\n\nMethod of moments with Student distributions.\nIn this case (q\u03b8)\u03b8\u2208\u0398 is just the family of mul-\ntivariate Student distributions with \u03bd > 2 degrees of freedom (\ufb01xed parameter). The parameter \u03b8\ncontains a location and a scale parameter \u00b5 and \u03a3. This family has two advantages: the parameter \u03bd\nallows tuning for heavy tails, and estimation is easy because moments of q\u03b8 are explicitly related to \u03b8.\ni=1 xs,if (xs,i)/qs\u22121(xs,i), but, as mentioned\n\nA simple unbiased estimate for \u00b5 is (1/Nt)(cid:80)t\n(cid:44) t(cid:88)\nns(cid:88)\n\nin section 2.3, we prefer to use the normalized estimate (using the shortcut qs for q\u03b8s):\n\n(cid:80)ns\n\nf (xs,i)\n\ns=1\n\nqs\u22121(xs,i)\n\nxs,i\n\n(cid:19) t(cid:88)\n\nns(cid:88)\n\nf (xs,i)\n\nqs\u22121(xs,i)\n\n,\n\ns=1\n\ni=1\n\n(cid:44) t(cid:88)\n\nns(cid:88)\n\nf (xs,i)\n\n(7)\n\n(xs,i \u2212 \u00b5t)(xs,i \u2212 \u00b5t)T f (xs,i)\nqs\u22121(xs,i)\n\nqs\u22121(xs,i)\n\n.\n\n(8)\n\ni=1\n\ns=1\n\npolicy is chosen according to a moment matching condition, i.e.,(cid:82) gq\u03b8 =(cid:82) gf for some function\nStudent case). Following [17], choosing \u03b8 such that the empirical moments of g coincide with(cid:82) gq\u03b8\n\nGeneralized method of moments (GMM). This approach includes the previous example. The\ng : Rd \u2192 RD. For instance, g might be given by x (cid:55)\u2192 x or x (cid:55)\u2192 xxT (both are considered in the\n\ns=1\n\ni=1\n\nmight be impossible. We rather compute \u03b8t as the minimum of\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)E\u03b8(g) \u2212\n\n(cid:32) t(cid:88)\n\nns(cid:88)\n\ns=1\n\ni=1\n\n(cid:44) t(cid:88)\n\nns(cid:88)\n\ns=1\n\ni=1\n\n(cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n.\n\ng(xs,i)\n\nf (xs,i)\n\nqs\u22121(xs,i)\n\nf (xs,i)\n\nqs\u22121(xs,i)\n\nEquivalently,\n\n(cid:107)E\u03b8(g) \u2212 g(xs,i)(cid:107)2\nwhich embraces the form given by (6), with m\u03b8 = (cid:107)E\u03b8(g) \u2212 g(cid:107)2f.\n\n\u03b8t \u2208 argmin\u03b8\u2208\u0398\n\ns=1\n\ni=1\n\nf (xs,i)\n\nqs\u22121(xs,i)\n\n,\n\nt(cid:88)\n\nns(cid:88)\n\n5\n\n\fr(\u03b8) = \u2212(cid:82) log(q\u03b8)f. Update of \u03b8t is done by minimizing the current estimator of Ntr(\u03b8) given by\n\nKullback-Leibler approach. Following [31, section 5.5], de\ufb01ne the Kullback-Leibler risk as\n\nRt(\u03b8) = Rt\u22121(\u03b8) \u2212 nt(cid:88)\n\nthe variance over the class of sampling policies. In this case, de\ufb01ne r(\u03b8) =(cid:82) \u03d52/q\u03b8, and follow a\n\nVariance approach. Another approach, when \u03d5 : Rd \u2192 Rp with p = 1, consists in minimizing\n\ni=1\n\nsimilar approach as before by minimizing at each stage,\n\nlog(q\u03b8(xt,i))f (xt,i)\n\nqt\u22121(xt,i)\n\n.\n\n(9)\n\nRt(\u03b8) = Rt\u22121(\u03b8) +\n\n.\n\n(10)\n\nnt(cid:88)\n\n\u03d5(xt,i)2\n\nq\u03b8(xt,i)qt\u22121(xt,i)\n\ni=1\n\nThis case represents a different situation than the Kullback-Leibler approach and the GMM. Here,\nthe sampling policy is selected optimally with respect to a particular function \u03d5 whereas for KL and\nGMM the sampling policy is driven by a targeted distribution f.\nRemark 3 (computation cost). The update rule (6) might be computationally costly but alternatives\nexist. For instance, when q\u03b8 is a family of Gaussian distributions, closed formulas are available\nfor (10). In fact we are in the case of weighted maximum likelihood estimation for which we \ufb01nd\nexactly (7) and (8), with \u03bd = \u221e. This is computed online at no cost. Another strategy to reduce the\ncomputation time is to use online stochastic gradient descent in (6).\nRemark 4 (block estimator). In [22], the authors suggest to update \u03b8 based only on the parti-\ncles from the last stage. For the Kullback-Leibler update, (9) would be replaced by Rt(\u03b8) =\ni=1 log(q\u03b8(xt,i))f (xt,i)/qt\u22121(xt,i). While this update makes easier the theoretical analysis\n(assuming that nt \u2192 \u221e), its main drawback is that most of the computing effort is forgotten at each\nstage as the previous computations are not used.\n\n\u2212(cid:80)nt\n\n3.2 Consistency of the sampling policy and asymptotic optimality of AIS\n\nThe updates described before using GMM, the Kullback-Leibler divergence or the variance, all \ufb01t\nwithin the framework of empirical risk minimization, given by (6), which rewritten at the sample\nscale gives\n\nm\u03b8(xj)\nRj(\u03b8) = Rj\u22121(\u03b8) +\nqj\u22121(xj)\n\u2212 if j \u2208 {Nt : t (cid:62) 1} then :\n\n\u2212 else :\n\n\u03b8j \u2208 argmin\u03b8\u2208\u0398 Rj(\u03b8)\nqj = q\u03b8j\nqj = qj\u22121.\n\nThe proof follows from a standard approach from M-estimation theory [31, Theorem 5.7] but a\nparticular attention shall be payed to the uniform law of large numbers because of the missing i.i.d.\nproperty of the sequences of interest.\nTheorem 2 (concistency of the sampling policy). Set M (x) = sup\u03b8\u2208\u0398 m\u03b8(x). Assume that \u0398 \u2282 Rq\n(cid:90)\nis a compact set and that\n\n(cid:90) M (x)2\n\n(cid:90)\n\n(cid:90)\n\ndx < \u221e,\n\nand\n\n\u2200\u03b8 (cid:54)= \u03b8\u2217, r(\u03b8) =\n\nM (x)dx < \u221e,\n\nm\u03b8 >\n\nm\u03b8\u2217 .\n\n(11)\n\nsup\n\u03b8\u2208\u0398\n\nq\u03b8(x)\n\nIf moreover, for any x \u2208 Rd, the function \u03b8 (cid:55)\u2192 m\u03b8(x) is continuous on Rq, then\n\n\u03b8n \u2192 \u03b8\u2217,\n\na.s.\n\nThe conclusion given in Theorem 2 permits to check the conditions of Theorem 1. This leads to the\nfollowing result.\nTheorem 3 (asymptotic optimality of AIS). Under the assumptions of Theorem 2, if there exists\n\u03b7 > 0 such that sup\u03b8\u2208\u0398\n\n(cid:90)\n\u03b8 < \u221e, we have\nIn \u2212\nwhere V (\u00b7,\u00b7) is de\ufb01ned in Equation (3).\n\n(cid:17) d\u2192 N(cid:0)0, V (q\u03b8\u2217 , \u03d5)(cid:1),\n\n(cid:82) (cid:107)\u03d5(cid:107)2+\u03b7/q1+\u03b7\n(cid:16)\n\n\u221a\n\n\u03d5\n\nn\n\n6\n\n\fRemark 5 (the oracle property). From (11), we deduce that q\u03b8\u2217 is the unique minimizer of the risk\nfunction r. The risk function based on GMM or the Kullback-Leibler approach (described in section\n3.1) is derived from a certain targeted density f in such a way that if q\u03b8 = f, then r(\u03b8) is a minimum.\nHence under the identi\ufb01ability conditions of Theorem 2, if in addition f \u2208 {q\u03b8 : \u03b8 \u2208 \u0398}, we\nhave that q\u03b8\u2217 = f. This means that asymptotically, AIS achives the same variance as the \u201coracle\u201d\nimportance sampling method based on the (\ufb01xed) sampler f.\nCorollary 2 (asymptotic optimality for normalized AIS). Under the assumptions of Theorem 2, if\nthere exists \u03b7 > 0 such that sup\u03b8\u2208\u0398\n\u2212\n\n(cid:82) (cid:107)(\u03d5T \u03c0, \u03c0)(cid:107)2+\u03b7/q1+\u03b7\n(cid:90)\n(cid:17) d\u2192 N(cid:16)\n\n0, U V (q\u03b8\u2217 , (\u03d5T \u03c0, \u03c0)T )U T(cid:17)\n\n\u03b8 < \u221e, we have\n\nI (norm)\nn\n\n(cid:16)\n\n\u221a\n\n\u03d5\u03c0\n\nn\n\n,\n\nwith U de\ufb01ned in Corollary 1 and V (\u00b7,\u00b7) de\ufb01ned in Equation (3).\n\n4 Weighted AIS\n\nWe follow ideas from [9, section 4] to develop a novel method to estimate(cid:82) \u03d5\u03c0. The method is called\n\nweighted adaptive importance sampling (wAIS), and will automatically re-weights each sample\ndepending on its accuracy. It allows in practice to forget poor samples generated during the early\nstages. For clarity, suppose that \u03d5 : Rd \u2192 Rp with p = 1. De\ufb01ne the weighted estimate, for any\nfunction \u03c8,\n\nnt(cid:88)\nNote that for any sequence (\u03b1T,1, . . . \u03b1T,T ) such that(cid:80)T\nestimate of(cid:82) \u03c8. Let \u03c32\n(cid:80)T\n\nT (\u03c8) = N\u22121\nI (\u03b1)\n\nT(cid:88)\n\n\u03b1T,t\n\nt=1\n\ni=1\n\nT\n\nT (\u03c8) is an unbiased\nt = E[V (qt\u22121, \u03d5)] where V (\u00b7,\u00b7) is de\ufb01ned in Equation (3). The variance of\nT (\u03d5) is N\u22122\nI (\u03b1)\n, for each t = 1, . . . T .\nT,tnt\u03c32\nIn [9], a re-weighting is proposed using estimates of \u03c3t (based on sample of the t-th stage). We\npropose the following weights\n\nt which minimized w.r.t. (\u03b1) gives \u03b1T,t \u221d \u03c3\u22122\n\nt=1 nt\u03b1T,t = NT , I (\u03b1)\n\nt=1 \u03b12\n\nT\n\nt\n\n\u03c8(xt,i)\nqt\u22121(xt,i)\n\n.\n\nT,t \u221d nt(cid:88)\n\n\u03b1\u22121\n\ni=1\n\n(cid:18) \u03c0(xt,i)\n\nqt\u22121(xt,i)\n\n(cid:19)2\n\n\u2212 1\n\n,\n\n(12)\n\nt=1 nt\u03b1T,t = NT . The wAIS estimate is the (weighted and normalized)\n\nsatisfying the constraints(cid:80)T\n\nAIS estimate given by\n\nT (\u03c0).\n\nI (\u03b1)\nT (\u03d5\u03c0)/I (\u03b1)\n\n(13)\nIn contrast with the approach in [9], because our weights are based on the estimated variance of\n\u03c0/qt\u22121, our proposal is free from the integrand \u03d5 and thus re\ufb02ects the overall quality of the t-th sample.\nThis makes sense whenever many functions need to be integrated making inappropriate a re-weighting\ndepending on a speci\ufb01c function. Another difference with [9] is that we use the true expectation, 1, in\ni=1 \u03c0(xt,i)/qt\u22121(xt,i). This permits\nto avoid the situation (common in high dimensional settings) where a poor sampler qt\u22121 is such that\n\u03c0(xt,i)/qt\u22121(xt,i) (cid:39) 0, for all i = 1, . . . nt, implying that the classical estimate of the variance is\nnear 0, leading (unfortunately) to a large weight.\n\nthe estimate of the variance, rather than the estimate (1/nt)(cid:80)nt\n\n5 Numerical experiments\n\nIn this section, we study a toy Gaussian example to illustrate the practical behavior of AIS. Special\ninterest is dedicated to the effect of the dimension d, the practical choice of (nt) and the gain given\nby wAIS introduced in the previous section. We set NT = 1e5 and we consider d = 2, 4, 8, 16. The\ncode is made available at https://github.com/portierf/AIS.\n\nThe aim is to compute \u00b5\u2217 = (cid:82) x\u03c6\u00b5\u2217,\u03c3\u2217 (x)dx where \u03c6\u00b5,\u03c3 : Rd \u2192 R is the probability density\n\nof N (\u00b5, \u03c32Id), \u00b5\u2217 = (5, . . . 5)T \u2208 Rd, \u03c3\u2217 = 1. The sampling policy is taken in the collection\nof multivariate Student distributions of degree \u03bd = 3 denoted by {q\u00b5,\u03a30 : \u00b5 \u2208 Rd} with \u03a30 =\n\n7\n\n\fFigure 1: From left-to-right and top-to-bottom d = 2, 4, 8, 16. AIS and wAIS are computed with\nT = 50 with a constant allocation policy nt = 2e3. Plotted is the logarithm of the MSE (computed\nfor each method over 100 replicates) with respect to the number of requests to the integrand.\n\n\u03c30Id(\u03bd \u2212 2)/\u03bd and \u03c30 = 5. The initial sampling policy is set as \u00b50 = (0, . . . 0) \u2208 Rd. The mean \u00b5t\nis updated at each stage t = 1, . . . T following the GMM approach as described in section 3, leading\nto the simple update formula\n\nt(cid:88)\n\nns(cid:88)\n\ns=1\n\ni=1\n\n\u00b5t =\n\n(cid:44) t(cid:88)\n\nns(cid:88)\n\ns=1\n\ni=1\n\nxs,i\n\nf (xs,i)\n\nqs\u22121(xs,i)\n\nf (xs,i)\n\nqs\u22121(xs,i)\n\n,\n\nwith f = \u03c6\u00b5\u2217,\u03c3\u2217. In section C of the supplementary \ufb01le, other results considering the update of the\nvariance within the student family are provided.\nAs the results for the unnormalized approaches were far from being competitive with the normalized\nones, we consider only normalized estimators. We also tried the weights proposed in [9] but the\nresults were not competitive. The (normalized) AIS estimate of \u00b5\u2217 is simply given by \u00b5t as displayed\nabove. The wAIS estimate of \u00b5\u2217 is computed using (13) with weights (12).\nWe also include the adaptive MH proposed in [15], where the proposal, assuming that Xi\u22121 = x,\n\nis given by N(cid:0)x, (2.4)2(Ci + \u0001Id)/d(cid:1), if i > i0, and N (x,Id), if i (cid:54) i0, with Ci the empirical\n\ncovariance matrix of (X0, X1, . . . Xi\u22121), i0 = 1000 and \u0001 = 0.05 (other con\ufb01gurations as for\ninstance using only half of the chain have been tested without improving the results). Finally we\n\n8\n\n0e+002e+044e+046e+048e+041e+05\u221210\u22128\u22126\u22124\u22122sample sizelog of MSEAISwAISAMHoracle 0e+002e+044e+046e+048e+041e+05\u221210\u22128\u22126\u22124\u2212202sample sizelog of MSEAISwAISAMHoracle 0e+002e+044e+046e+048e+041e+05\u2212505sample sizelog of MSEAISwAISAMHoracle 0e+002e+044e+046e+048e+041e+05\u2212505sample sizelog of MSEAISwAISAMHoracle \fFigure 2: From left-to-right and top-to-bottom d = 2, 4, 8, 16. AIS and wAIS are computed with\nT = 5, 20, 50, each with a constant allocation policy, resp. nt = 2e4, 5e3, 2e3. Plotted is the\nlogarithm of the MSE (computed for each method over 100 replicates) with respect to the number of\nrequests to the integrand.\n\nconsider a so called \u201coracle\u201d method : importance sampling with \ufb01x policy q\u00b5\u2217,\u03a3\u2217, with \u03a3\u2217 =\n\u03c3\u2217Id(\u03bd \u2212 2)/\u03bd.\nFor each method that returns \u00b5, the mean squared error (MSE) is computed as the average of\n(cid:107)\u00b5 \u2212 \u00b5\u2217(cid:107)2 computed over 100 replicates of \u00b5.\nIn Figure 1, we compare the evolution of all the mentioned algorithms with respect to stages\nt = 1, . . . T = 50 with constant allocation policy nt = 2e3 (for AIS and wAIS). The clear winner\nis wAIS. Note that the oracle policy q\u00b5\u2217,\u03a3\u2217, which is not the optimal one (see section B.3 in the\nsupplementary material), seems to give worse results than the the policy q\u00b5\u2217,\u03a30, as wAIS with sig_0\nperforms better than the \u201coracle\u201d after some time.\nIn Figure 2, we examine 3 constant allocation policies given by T = 50 and nt = 2e3; T = 20 and\nnt = 5e3; T = 5 and nt = 2e4. We clearly notice that the rate of convergence is in\ufb02uenced by the\nnumber of update steps (at least at the beginning). The results call for updating as soon as possible\nthe sampling policy. This empirical evidence supports the theoretical framework studied in the paper\nwhich imposes no condition on the growth of (nt).\n\n9\n\n0e+002e+044e+046e+048e+041e+05\u221210\u22128\u22126\u22124\u22122sample sizelog of MSEwAISAMHoracle T=5T=20T=500e+002e+044e+046e+048e+041e+05\u221210\u22128\u22126\u22124\u2212202sample sizelog of MSEwAISAMHoracle T=5T=20T=500e+002e+044e+046e+048e+041e+05\u2212505sample sizelog of MSEwAISAMHoracle T=5T=20T=500e+002e+044e+046e+048e+041e+05\u2212505sample sizelog of MSEwAISAMHoracle T=5T=20T=50\fAcknowledgments\n\nThe authors are grateful to R\u00e9mi Bardenet for useful comments and additional references.\n\nReferences\n[1] R\u00e9mi Bardenet and Adrien Hardy. Monte carlo with determinantal point processes. arXiv\n\npreprint arXiv:1605.00361, 2016.\n\n[2] Olivier Capp\u00e9, Randal Douc, Arnaud Guillin, Jean-Michel Marin, and Christian P Robert.\nAdaptive importance sampling in general mixture classes. Statistics and Computing, 18(4):447\u2013\n459, 2008.\n\n[3] Olivier Capp\u00e9, Arnaud Guillin, Jean-Michel Marin, and Christian P Robert. Population monte\n\ncarlo. Journal of Computational and Graphical Statistics, 13(4):907\u2013929, 2004.\n\n[4] Nicolas Chopin. Central limit theorem for sequential monte carlo methods and its application\n\nto bayesian inference. The Annals of Statistics, 32(6):2385\u20132411, 2004.\n\n[5] Jean Cornuet, Jean-Michel Marin, Antonietta Mira, and Christian P Robert. Adaptive multiple\n\nimportance sampling. Scandinavian Journal of Statistics, 39(4):798\u2013812, 2012.\n\n[6] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal of\n\nthe Royal Statistical Society: Series B (Statistical Methodology), 68(3):411\u2013436, 2006.\n\n[7] Bernard Delyon, Fran\u00e7ois Portier, et al. Integral approximation by kernel smoothing. Bernoulli,\n\n22(4):2177\u20132208, 2016.\n\n[8] Randal Douc, Arnaud Guillin, J-M Marin, and Christian P Robert. Convergence of adaptive\n\nmixtures of importance sampling schemes. The Annals of Statistics, pages 420\u2013448, 2007.\n\n[9] Randal Douc, Arnaud Guillin, J-M Marin, and Christian P Robert. Minimum variance impor-\ntance sampling via population monte carlo. ESAIM: Probability and Statistics, 11:427\u2013447,\n2007.\n\n[10] Randal Douc and Eric Moulines. Limit theorems for weighted samples with applications to\n\nsequential monte carlo methods. The Annals of Statistics, pages 2344\u20132376, 2008.\n\n[11] V\u00edctor Elvira, Luca Martino, David Luengo, and M\u00f3nica F Bugallo. Generalized multiple\n\nimportance sampling. arXiv preprint arXiv:1511.03095, 2015.\n\n[12] Akram Erraqabi, Michal Valko, Alexandra Carpentier, and Odalric Maillard. Pliable rejection\n\nsampling. In International Conference on Machine Learning, pages 2121\u20132129, 2016.\n\n[13] Michael Evans and Tim Swartz. Approximating integrals via Monte Carlo and deterministic\n\nmethods. Oxford Statistical Science Series. Oxford University Press, Oxford, 2000.\n\n[14] John Geweke. Bayesian inference in econometric models using monte carlo integration. Econo-\n\nmetrica: Journal of the Econometric Society, pages 1317\u20131339, 1989.\n\n[15] Heikki Haario, Eero Saksman, and Johanna Tamminen. An adaptive metropolis algorithm.\n\nBernoulli, 7(2):223\u2013242, 2001.\n\n[16] John Michael Hammersley and David Christopher Handscomb. General principles of the monte\n\ncarlo method. In Monte Carlo Methods, pages 50\u201375. Springer, 1964.\n\n[17] Lars Peter Hansen. Large sample properties of generalized method of moments estimators.\n\nEconometrica: Journal of the Econometric Society, pages 1029\u20131054, 1982.\n\n[18] Tatsunori B Hashimoto, Steve Yadlowsky, and John C Duchi. Derivative free optimization via\n\nrepeated classi\ufb01cation. arXiv preprint arXiv:1804.03761, 2018.\n\n[19] Tang Jie and Pieter Abbeel. On a connection between importance sampling and the likelihood\nratio policy gradient. In Advances in Neural Information Processing Systems, pages 1000\u20131008,\n2010.\n\n10\n\n\f[20] Tuen Kloek and Herman K Van Dijk. Bayesian estimates of equation system parameters: an\napplication of integration by monte carlo. Econometrica: Journal of the Econometric Society,\npages 1\u201319, 1978.\n\n[21] Qi Lou, Rina Dechter, and Alexander T Ihler. Dynamic importance sampling for anytime\nbounds of the partition function. In Advances in Neural Information Processing Systems, pages\n3199\u20133207, 2017.\n\n[22] Jean-Michel Marin, Pierre Pudlo, and Mohammed Sedki. Consistency of the adaptive multiple\n\nimportance sampling. arXiv preprint arXiv:1211.2548, 2012.\n\n[23] Jan C Neddermeyer. Computationally ef\ufb01cient nonparametric importance sampling. Journal of\n\nthe American Statistical Association, 104(486):788\u2013802, 2009.\n\n[24] Chris J. Oates, Mark Girolami, and Nicolas Chopin. Control functionals for Monte Carlo\n\nintegration. J. R. Statist. Soc. B, 79(3):695\u2013718, 2017.\n\n[25] Man-Suk Oh and James O. Berger. Adaptive importance sampling in Monte Carlo integration.\n\nJ. Statist. Comput. Simulation, 41(3-4):143\u2013168, 1992.\n\n[26] Art Owen and Yi Zhou. Safe and effective importance sampling. J. Amer. Statist. Assoc.,\n\n95(449):135\u2013143, 2000.\n\n[27] Jan Peters, Katharina M\u00fclling, and Yasemin Altun. Relative entropy policy search. In AAAI,\n\npages 1607\u20131612. Atlanta, 2010.\n\n[28] Fran\u00e7ois Portier and Johan Segers. Monte carlo integration with a growing number of control\n\nvariates. arXiv preprint arXiv:1801.01797, 2018.\n\n[29] Jean-Francois Richard and Wei Zhang. Ef\ufb01cient high-dimensional importance sampling. Journal\n\nof Econometrics, 141(2):1385\u20131411, 2007.\n\n[30] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In International Conference on Machine Learning, pages 1889\u20131897,\n2015.\n\n[31] A. W. van der Vaart. Asymptotic statistics, volume 3 of Cambridge Series in Statistical and\n\nProbabilistic Mathematics. Cambridge University Press, Cambridge, 1998.\n\n[32] Eric Veach and Leonidas J Guibas. Optimally combining sampling techniques for monte carlo\nrendering. In Proceedings of the 22nd annual conference on Computer graphics and interactive\ntechniques, pages 419\u2013428. ACM, 1995.\n\n[33] Ping Zhang. Nonparametric importance sampling. J. Amer. Statist. Assoc., 91(435):1245\u20131253,\n\n1996.\n\n[34] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized\n\nloss minimization. In international conference on machine learning, pages 1\u20139, 2015.\n\n11\n\n\f", "award": [], "sourceid": 1602, "authors": [{"given_name": "Fran\u00e7ois", "family_name": "Portier", "institution": "T\u00e9l\u00e9com ParisTech"}, {"given_name": "Bernard", "family_name": "Delyon", "institution": "University of Rennes 1"}]}