{"title": "Policy Optimization via Importance Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 5442, "page_last": 5454, "abstract": "Policy optimization is an effective reinforcement learning approach to solve continuous control tasks. Recent achievements have shown that alternating online and offline optimization is a successful choice for efficient trajectory reuse. However, deciding when to stop optimizing and collect new trajectories is non-trivial, as it requires to account for the variance of the objective function estimate. In this paper, we propose a novel, model-free, policy search algorithm, POIS, applicable in both action-based and parameter-based settings. We first derive a high-confidence bound for importance sampling estimation; then we define a surrogate objective function, which is optimized offline whenever a new batch of trajectories is collected. Finally, the algorithm is tested on a selection of continuous control tasks, with both linear and deep policies, and compared with state-of-the-art policy optimization methods.", "full_text": "Policy Optimization via Importance Sampling\n\nAlberto Maria Metelli\n\nPolitecnico di Milano, Milan, Italy\nalbertomaria.metelli@polimi.it\n\nMatteo Papini\n\nPolitecnico di Milano, Milan, Italy\n\nmatteo.papini@polimi.it\n\nFrancesco Faccio\n\nPolitecnico di Milano, Milan, Italy\n\nIDSIA, USI-SUPSI, Lugano, Switzerland\n\nMarcello Restelli\n\nPolitecnico di Milano, Milan, Italy\nmarcello.restelli@polimi.it\n\nfrancesco.faccio@mail.polimi.it\n\nAbstract\n\nPolicy optimization is an effective reinforcement learning approach to solve contin-\nuous control tasks. Recent achievements have shown that alternating online and\nof\ufb02ine optimization is a successful choice for ef\ufb01cient trajectory reuse. However,\ndeciding when to stop optimizing and collect new trajectories is non-trivial, as it\nrequires to account for the variance of the objective function estimate. In this paper,\nwe propose a novel, model-free, policy search algorithm, POIS, applicable in both\naction-based and parameter-based settings. We \ufb01rst derive a high-con\ufb01dence bound\nfor importance sampling estimation; then we de\ufb01ne a surrogate objective function,\nwhich is optimized of\ufb02ine whenever a new batch of trajectories is collected. Finally,\nthe algorithm is tested on a selection of continuous control tasks, with both linear\nand deep policies, and compared with state-of-the-art policy optimization methods.\n\n1\n\nIntroduction\n\nIn recent years, policy search methods [10] have proved to be valuable Reinforcement Learning\n(RL) [50] approaches thanks to their successful achievements in continuous control tasks [e.g.,\n23, 42, 44, 43], robotic locomotion [e.g., 53, 20] and partially observable environments [e.g., 28].\nThese algorithms can be roughly classi\ufb01ed into two categories: action-based methods [51, 34] and\nparameter-based methods [45]. The former, usually known as policy gradient (PG) methods, perform\na search in a parametric policy space by following the gradient of the utility function estimated\nby means of a batch of trajectories collected from the environment [50]. In contrast, in parameter-\nbased methods, the search is carried out directly in the space of parameters by exploiting global\noptimizers [e.g., 41, 16, 48, 52] or following a proper gradient direction like in Policy Gradients with\nParameter-based Exploration (PGPE) [45, 63, 46]. A major question in policy search methods is:\nhow should we use a batch of trajectories in order to exploit its information in the most ef\ufb01cient\nway? On one hand, on-policy methods leverage on the batch to perform a single gradient step, after\nwhich new trajectories are collected with the updated policy. Online PG methods are likely the most\nwidespread policy search approaches: starting from the traditional algorithms based on stochastic\npolicy gradient [51], like REINFORCE [64] and G(PO)MDP [4], moving toward more modern\nmethods, such as Trust Region Policy Optimization (TRPO) [42]. These methods, however, rarely\nexploit the available trajectories in an ef\ufb01cient way, since each batch is thrown away after just one\ngradient update. On the other hand, off-policy methods maintain a behavioral policy, used to explore\nthe environment and to collect samples, and a target policy which is optimized. The concept of off-\npolicy learning is rooted in value-based RL [62, 30, 27] and it was \ufb01rst adapted to PG in [9], using an\nactor-critic architecture. The approach has been extended to Deterministic Policy Gradient (DPG) [47],\nwhich allows optimizing deterministic policies while keeping a stochastic policy for exploration.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fMore recently, an ef\ufb01cient version of DPG coupled with a deep neural network to represent the policy\nhas been proposed, named Deep Deterministic Policy Gradient (DDPG) [23]. In the parameter-based\nframework, even though the original formulation [45] introduces an online algorithm, an extension has\nbeen proposed to ef\ufb01ciently reuse the trajectories in an of\ufb02ine scenario [67]. Furthermore, PGPE-like\napproaches allow overcoming several limitations of classical PG, like the need for a stochastic policy\nand the high variance of the gradient estimates.1\nWhile on-policy algorithms are, by nature, online, as they need to be fed with fresh samples whenever\nthe policy is updated, off-policy methods can take advantage of mixing online and of\ufb02ine optimization.\nThis can be done by alternately sampling trajectories and performing optimization epochs with the\ncollected data. A prime example of this alternating procedure is Proximal Policy Optimization\n(PPO) [44], that has displayed remarkable performance on continuous control tasks. Off-line op-\ntimization, however, introduces further sources of approximation, as the gradient w.r.t. the target\npolicy needs to be estimated (off-policy) with samples collected with a behavioral policy. A common\nchoice is to adopt an importance sampling (IS) [29, 17] estimator in which each sample is reweighted\nproportionally to the likelihood of being generated by the target policy. However, directly optimizing\nthis utility function is impractical since it displays a wide variance most of the times [29]. Intuitively,\nthe variance increases proportionally to the distance between the behavioral and the target policy;\nthus, the estimate is reliable as long as the two policies are close enough. Preventing uncontrolled\nupdates in the space of policy parameters is at the core of the natural gradient approaches [1] applied\neffectively both on PG methods [18, 33, 63] and on PGPE methods [26]. More recently, this idea\nhas been captured (albeit indirectly) by TRPO, which optimizes via (approximate) natural gradient a\nsurrogate objective function, derived from safe RL [18, 35], subject to a constraint on the Kullback-\nLeibler divergence between the behavioral and target policy.2 Similarly, PPO performs a truncation\nof the importance weights to discourage the optimization process from going too far. Although TRPO\nand PPO, together with DDPG, represent the state-of-the-art policy optimization methods in RL for\ncontinuous control, they do not explicitly encode in their objective function the uncertainty injected\nby the importance sampling procedure. A more theoretically grounded analysis has been provided for\npolicy selection [11], model-free [56] and model-based [54] policy evaluation (also accounting for\nsamples collected with multiple behavioral policies), and combined with options [15]. Subsequently,\nin [55] these methods have been extended for policy improvement, deriving a suitable concentration\ninequality for the case of truncated importance weights. Unfortunately, these methods are hardly\nscalable to complex control tasks. A more detailed review of the state-of-the-art policy optimization\nalgorithms is reported in Appendix A.\nIn this paper, we propose a novel, model-free, actor-only, policy optimization algorithm, named\nPolicy Optimization via Importance Sampling (POIS) that mixes online and of\ufb02ine optimization to\nef\ufb01ciently exploit the information contained in the collected trajectories. POIS explicitly accounts for\nthe uncertainty introduced by the importance sampling by optimizing a surrogate objective function.\nThe latter captures the trade-off between the estimated performance improvement and the variance\ninjected by the importance sampling. The main contributions of this paper are theoretical, algorithmic\nand experimental. After revising some notions about importance sampling (Section 3), we propose a\nconcentration inequality, of independent interest, for high-con\ufb01dence \u201coff-distribution\u201d optimization\nof objective functions estimated via importance sampling (Section 4). Then we show how this bound\ncan be customized into a surrogate objective function in order to either search in the space of policies\n(Action-based POIS) or to search in the space of parameters (Parameter-bases POIS). The resulting\nalgorithm (in both the action-based and the parameter-based \ufb02avor) collects, at each iteration, a set of\ntrajectories. These are used to perform of\ufb02ine optimization of the surrogate objective via gradient\nascent (Section 5), after which a new batch of trajectories is collected using the optimized policy.\nFinally, we provide an experimental evaluation with both linear policies and deep neural policies\nto illustrate the advantages and limitations of our approach compared to state-of-the-art algorithms\n(Section 6) on classical control tasks [12, 57]. The proofs for all Theorems and Lemmas can be found\nin Appendix B. The implementation of POIS can be found at https://github.com/T3p/pois.\n\n1Other solutions to these problems have been proposed in the action-based literature, like the aforementioned\n\nDPG algorithm, the gradient baselines [34] and the actor-critic architectures [21].\n\n2Note that this regularization term appears in the performance improvement bound, which contains exact\n\nquantities only. Thus, it does not really account for the uncertainty derived from the importance sampling.\n\n2\n\n\f2 Preliminaries\nA discrete-time Markov Decision Process (MDP) [37] is de\ufb01ned as a tuple M = (S,A, P, R, \u03b3, D)\nwhere S is the state space, A is the action space, P (\u00b7|s, a) is a Markovian transition model that\nassigns for each state-action pair (s, a) the probability of reaching the next state s(cid:48), \u03b3 \u2208 [0, 1] is\nthe discount factor, R(s, a) \u2208 [\u2212Rmax, Rmax] assigns the expected reward for performing action\na in state s and D is the distribution of the initial state. The behavior of an agent is described by\na policy \u03c0(\u00b7|s) that assigns for each state s the probability of performing action a. A trajectory\n\u03c4 \u2208 T is a sequence of state-action pairs \u03c4 = (s\u03c4,0, a\u03c4,0, . . . , s\u03c4,H\u22121, a\u03c4,H\u22121, s\u03c4,H ), where H is the\nactual trajectory horizon. The performance of an agent is evaluated in terms of the expected return,\ni.e., the expected discounted sum of the rewards collected along the trajectory: E\u03c4 [R(\u03c4 )], where\n\nR(\u03c4 ) =(cid:80)H\u22121\n\nt=0 \u03b3tR(s\u03c4,t, a\u03c4,t) is the trajectory return.\n\n(cid:90)\n\n(cid:90)\n\nWe focus our attention to the case in which the policy belongs to a parametric policy space \u03a0\u0398 =\n{\u03c0\u03b8 : \u03b8 \u2208 \u0398 \u2286 Rp}. In parameter-based approaches, the agent is equipped with a hyperpolicy \u03bd\nused to sample the policy parameters at the beginning of each episode. The hyperpolicy belongs itself\nto a parametric hyperpolicy space NP = {\u03bd\u03c1 : \u03c1 \u2208 P \u2286 Rr}. The expected return can be expressed,\nin the parameter-based case, as a double expectation: one over the policy parameter space \u0398 and one\nover the trajectory space T :\n\n\u0398\n\nT\n\n\u03bd\u03c1(\u03b8)p(\u03c4|\u03b8)R(\u03c4 ) d\u03c4 d\u03b8,\n\nwhere p(\u03c4|\u03b8) = D(s0)(cid:81)H\u22121\n(cid:82)\n(cid:82)\n\n(1)\nJD(\u03c1) =\nt=0 \u03c0\u03b8(at|st)P (st+1|st, at) is the trajectory density function. The goal\nof a parameter-based learning agent is to determine the hyperparameters \u03c1\u2217 so as to maximize\nJD(\u03c1). If \u03bd\u03c1 is stochastic and differentiable, the hyperparameters can be learned according to\nthe gradient ascent update: \u03c1(cid:48) = \u03c1 + \u03b1\u2207\u03c1JD(\u03c1), where \u03b1 > 0 is the step size and \u2207\u03c1JD(\u03c1) =\nT \u03bd\u03c1(\u03b8)p(\u03c4|\u03b8)\u2207\u03c1 log \u03bd\u03c1(\u03b8)R(\u03c4 ) d\u03c4 d\u03b8. Since the stochasticity of the hyperpolicy is a suf\ufb01cient\nsource of exploration, deterministic action policies of the kind \u03c0\u03b8(a|s) = \u03b4(a \u2212 u\u03b8(s)) are typically\nconsidered, where \u03b4 is the Dirac delta function and u\u03b8 is a deterministic mapping from S to A. In\nthe action-based case, on the contrary, the hyperpolicy \u03bd\u03c1 is a deterministic distribution \u03bd\u03c1(\u03b8) =\n\u03b4(\u03b8 \u2212 g(\u03c1)), where g(\u03c1) is a deterministic mapping from P to \u0398. For this reason, the dependence on\n\u03c1 is typically not represented and the expected return expression simpli\ufb01es into a single expectation\nover the trajectory space T :\n\n\u0398\n\n(2)\n\u2217 that maximize JD(\u03b8). In\nAn action-based learning agent aims to \ufb01nd the policy parameters \u03b8\nthis case, we need to enforce exploration by means of the stochasticity of \u03c0\u03b8. For stochastic and\n= \u03b8 + \u03b1\u2207\u03b8JD(\u03b8), where\n(cid:48)\ndifferentiable policies, learning can be performed via gradient ascent: \u03b8\n\nJD(\u03b8) =\n\nT\n\np(\u03c4|\u03b8)R(\u03c4 ) d\u03c4.\n\n(cid:90)\n\n\u2207\u03b8JD(\u03b8) =(cid:82)\n\nT p(\u03c4|\u03b8)\u2207\u03b8 log p(\u03c4|\u03b8)R(\u03c4 ) d\u03c4.\n\n3 Evaluation via Importance Sampling\n\nIn off-policy evaluation [56, 54], we aim to estimate the performance of a target policy \u03c0T (or\nhyperpolicy \u03bdT ) given samples collected with a behavioral policy \u03c0B (or hyperpolicy \u03bdB). More\ngenerally, we face the problem of estimating the expected value of a deterministic bounded function\nf ((cid:107)f(cid:107)\u221e < +\u221e) of random variable x taking values in X under a target distribution P , after having\ncollected samples from a behavioral distribution Q. The importance sampling estimator (IS) [7, 29]\ncorrects the distribution with the importance weights (or Radon\u2013Nikodym derivative or likelihood\nratio) wP/Q(x) = p(x)/q(x):\n\nN(cid:88)\n\n(cid:98)\u00b5P/Q =\n\n1\nN\n\np(xi)\nq(xi)\n\nN(cid:88)\n\n1\nN\n\ni=1\n\nf (xi) =\n\nThis estimator is unbiased (Ex\u223cQ[(cid:98)\u00b5P/Q] = Ex\u223cP [f (x)]) but it may exhibit an undesirable behavior\n\n(3)\nwhere x = (x1, x2, . . . , xN )T is sampled from Q and we assume q(x) > 0 whenever f (x)p(x) (cid:54)= 0.\ndue to the variability of the importance weights, showing, in some cases, in\ufb01nite variance. Intuitively,\nthe magnitude of the importance weights provides an indication of how much the probability measures\nP and Q are dissimilar. This notion can be formalized by the R\u00e9nyi divergence [40, 59], an\ninformation-theoretic dissimilarity index between probability measures.\n\nwP/Q(xi)f (xi),\n\ni=1\n\n3\n\n\f(cid:90)\n\n(cid:18) dP\n\n(cid:19)\u03b1\n\n(cid:18) p(x)\n\n(cid:19)\u03b1\n\n(cid:90)\n\nX\n\nR\u00e9nyi divergence Let P and Q be two probability measures on a measurable space (X ,F) such\nthat P (cid:28) Q (P is absolutely continuous w.r.t. Q) and Q is \u03c3-\ufb01nite. Let P and Q admit p and q as\nLebesgue probability density functions (p.d.f.), respectively. The \u03b1-R\u00e9nyi divergence is de\ufb01ned as:\n\n1\n\nD\u03b1(P(cid:107)Q) =\n\n1\n\n\u03b1 \u2212 1\n\nlog\n\ndx,\n\nX\n\ndQ\n\nlog\n\nq(x)\n\nq(x)\n\ndQ =\n\n\u03b1 \u2212 1\n\n(4)\nwhere dP/ dQ is the Radon\u2013Nikodym derivative of P w.r.t. Q and \u03b1 \u2208 [0,\u221e]. Some remark-\nable cases are: \u03b1 = 1 when D1(P(cid:107)Q) = DKL(P(cid:107)Q) and \u03b1 = \u221e yielding D\u221e(P(cid:107)Q) =\nlog ess supX dP/ dQ. Importing the notation from [8], we indicate the exponentiated \u03b1-R\u00e9nyi\ndivergence as d\u03b1(P(cid:107)Q) = exp (D\u03b1(P(cid:107)Q)). With little abuse of notation, we will replace D\u03b1(P(cid:107)Q)\nwith D\u03b1(p(cid:107)q) whenever possible within the context.\nThe R\u00e9nyi divergence provides a convenient expression for the moments of the importance\nweights: Ex\u223cQ\ness supx\u223cQ wP/Q(x) = d\u221e(P(cid:107)Q) [8]. To mitigate the variance problem of the IS estimator, we can\nresort to the self-normalized importance sampling estimator (SN) [7]:\n\n(cid:2)wP/Q(x)\u03b1(cid:3) = d\u03b1(P(cid:107)Q). Moreover, Varx\u223cQ\nN(cid:88)\n\n(cid:80)N\n(cid:101)\u00b5P/Q =\n(cid:80)N\nwhere (cid:101)wP/Q(x) = wP/Q(x)/(cid:80)N\nfrom(cid:98)\u00b5P/Q,(cid:101)\u00b5P/Q is biased but consistent [29] and it typically displays a more desirable behavior\ni.e.,(cid:101)p(x) =(cid:80)N\ni=1 (cid:101)wP/Q(x)\u03b4(x \u2212 xi). The problem of assessing the quality of the SN estimator\nCarlo estimator(cid:101)\u00b5P/P is approximately equal to the variance of the SN estimator(cid:101)\u00b5P/Q computed\n\nhas been extensively studied by the simulation community, producing several diagnostic indexes to\nindicate when the weights might display problematic behavior [29]. The effective sample size (ESS)\nwas introduced in [22] as the number of samples drawn from P so that the variance of the Monte\n\nbecause of its smaller variance.3 Given the realization x1, x2, . . . , xN we can interpret the SN\nestimator as the expected value of f under an approximation of the distribution P made by N deltas,\n\n(cid:2)wP/Q(x)(cid:3) = d2(P(cid:107)Q) \u2212 1 and\n(cid:101)wP/Q(xi)f (xi),\n\ni=1 wP/Q(xi) is the self-normalized importance weight. Differently\n\nwith N samples. Here we report the original de\ufb01nition and its most common estimate:\n\ni=1 wP/Q(xi)f (xi)\n\ni=1 wP/Q(xi)\n\n(5)\n\n=\n\ni=1\n\nESS(P(cid:107)Q) =\n\nVarx\u223cQ\n\n(cid:2)wP/Q(x)(cid:3) + 1\n\nN\n\n=\n\nN\n\nd2(P(cid:107)Q)\n\n, (cid:100)ESS(P(cid:107)Q) =\n\n(cid:80)N\ni=1 (cid:101)wP/Q(xi)2\n\n1\n\n.\n\n(6)\n\nThe ESS has an interesting interpretation: if d2(P(cid:107)Q) = 1, i.e., P = Q almost everywhere, then\nESS = N since we are performing Monte Carlo estimation. Otherwise, the ESS decreases as the\ndissimilarity between the two distributions increases. In the literature, other ESS-like diagnostics\nhave been proposed that also account for the nature of f [24].\n\n4 Optimization via Importance Sampling\n\nThe off-policy optimization problem [55] can be formulated as \ufb01nding the best target policy \u03c0T (or\nhyperpolicy \u03bdT ), i.e., the one maximizing the expected return, having access to a set of samples\ncollected with a behavioral policy \u03c0B (or hyperpolicy \u03bdB). In a more abstract sense, we aim to\ndetermine the target distribution P that maximizes Ex\u223cP [f (x)] having samples collected from the\n\ufb01xed behavioral distribution Q. In this section, we analyze the problem of de\ufb01ning a proper objective\n\nfunction for this purpose. Directly optimizing the estimator(cid:98)\u00b5P/Q or(cid:101)\u00b5P/Q is, in most of the cases,\n\nunsuccessful. With enough freedom in choosing P , the optimal solution would assign as much\nprobability mass as possible to the maximum value among f (xi). Clearly, in this scenario, the\nestimator is unreliable and displays a large variance. For this reason, we adopt a risk-averse approach\nand we decide to optimize a statistical lower bound of the expected value Ex\u223cP [f (x)] that holds with\nhigh con\ufb01dence. We start by analyzing the behavior of the IS estimator and we provide the following\n\nresult that bounds the variance of(cid:98)\u00b5P/Q in terms of the Renyi divergence.\n3Note that(cid:12)(cid:12)(cid:101)\u00b5P/Q\n\n(cid:12)(cid:12) \u2264 (cid:107)f(cid:107)\u221e. Therefore, its variance is always \ufb01nite.\n\nLemma 4.1. Let P and Q be two probability measures on the measurable space (X ,F) such that\nP (cid:28) Q. Let x = (x1, x2, . . . , xN )T i.i.d. random variables sampled from Q and f : X \u2192 R be a\n\n4\n\n\fbounded function ((cid:107)f(cid:107)\u221e < +\u221e). Then, for any N > 0, the variance of the IS estimator(cid:98)\u00b5P/Q can\n\nbe upper bounded as:\n\n(cid:3) \u2264 1\n\nVar\nx\u223cQ\n\n(cid:2)(cid:98)\u00b5P/Q\n(cid:3) \u2264 (cid:107)f(cid:107)2\n\n\u221e\n\n(cid:107)f(cid:107)2\u221ed2 (P(cid:107)Q) .\n\n(cid:3) \u2264 1\n\n(cid:2)(cid:98)\u00b5Q/Q\n\n(cid:2)(cid:98)\u00b5P/Q\n\n(7)\nN\nN (cid:107)f(cid:107)2\u221e, a well-known bound on the\nWhen P = Q almost everywhere, we get Varx\u223cQ\nvariance of a Monte Carlo estimator. Recalling the de\ufb01nition of ESS (6) we can rewrite the previous\n(cid:98)\u00b5P/Q can have unbounded variance even if f is bounded, the SN estimator(cid:101)\u00b5P/Q is always bounded\nbound as: Varx\u223cQ\nESS(P(cid:107)Q), i.e., the variance scales with ESS instead of N. While\nsamples (cid:101)wP/Q(xi)f (xi) interdependent, an exact analysis of its bias and variance is more challenging.\nby (cid:107)f(cid:107)\u221e and therefore it always has a \ufb01nite variance. Since the normalization term makes all the\n\nSeveral works adopted approximate methods to provide an expression for the variance [17]. We\npropose an analysis of bias and variance of the SN estimator in Appendix D.\n\n4.1 Concentration Inequality\n\nFinding a suitable concentration inequality for off-policy learning was studied in [56] for of\ufb02ine policy\nevaluation and subsequently in [55] for optimization. On one hand, fully empirical concentration\ninequalities, like Student-T, besides the asymptotic approximation, are not suitable in this case since\nthe empirical variance needs to be estimated with importance sampling as well injecting further\nuncertainty [29]. On the other hand, several distribution-free inequalities like Hoeffding require\nknowing the maximum of the estimator, which might not exist (d\u221e(P(cid:107)Q) = \u221e) for the IS estimator.\nConstraining d\u221e(P(cid:107)Q) to be \ufb01nite often introduces unacceptable limitations. For instance, in the\ncase of univariate Gaussian distributions, it prevents a step that selects a target variance larger than the\nbehavioral one from being performed (see Appendix C).4 Even Bernstein inequalities [5], are hardly\napplicable since, for instance, in the case of univariate Gaussian distributions, the importance weights\ndisplay a fat tail behavior (see Appendix C). We believe that a reasonable trade-off is to require the\nvariance of the importance weights to be \ufb01nite, that is equivalent to require d2(P(cid:107)Q) < \u221e, i.e.,\n\u03c3P < 2\u03c3Q for univariate Gaussians. For this reason, we resort to Chebyshev-like inequalities and we\npropose the following concentration bound derived from Cantelli\u2019s inequality and customized for the\nIS estimator.\nTheorem 4.1. Let P and Q be two probability measures on the measurable space (X ,F) such that\nP (cid:28) Q and d2(P(cid:107)Q) < +\u221e. Let x1, x2, . . . , xN be i.i.d. random variables sampled from Q, and\nf : X \u2192 R be a bounded function ((cid:107)f(cid:107)\u221e < +\u221e). Then, for any 0 < \u03b4 \u2264 1 and N > 0 with\nprobability at least 1 \u2212 \u03b4 it holds that:\n\nE\nx\u223cP\n\n[f (x)] \u2265 1\nN\n\nwP/Q(xi)f (xi) \u2212 (cid:107)f(cid:107)\u221e\n\n(1 \u2212 \u03b4)d2(P(cid:107)Q)\n\n\u03b4N\n\n.\n\n(8)\n\nN(cid:88)\n\ni=1\n\n(cid:114)\n\nThe bound highlights the interesting trade-off between the estimated performance and the uncertainty\nintroduced by changing the distribution. The latter enters in the bound as the 2-R\u00e9nyi divergence\nbetween the target distribution P and the behavioral distribution Q. Intuitively, we should trust\n\nthe estimator(cid:98)\u00b5P/Q as long as P is not too far from Q. For the SN estimator, accounting for the\nTheorem 4.1 as \u03bb = (cid:107)f(cid:107)\u221e(cid:112)(1 \u2212 \u03b4)/\u03b4, we get a surrogate objective function. The optimization can\n\nbias, we are able to obtain a bound (reported in Appendix D), with a similar dependence on P as\nin Theorem 4.1, albeit with different constants. Renaming all constants involved in the bound of\n\nbe carried out in different ways. The following section shows why using the natural gradient could be\na successful choice in case P and Q can be expressed as parametric differentiable distributions.\n\n4.2\n\nImportance Sampling and Natural Gradient\n\nis de\ufb01ned as: F(\u03c9) =(cid:82)\n\nWe can look at a parametric distribution P\u03c9, having p\u03c9 as a density function, as a point on a probability\nmanifold with coordinates \u03c9 \u2208 \u2126. If p\u03c9 is differentiable, the Fisher Information Matrix (FIM) [39, 2]\nX p\u03c9(x)\u2207\u03c9 log p\u03c9(x)\u2207\u03c9 log p\u03c9(x)T dx. This matrix is, up to a scale, an\n4Although the variance tends to be reduced in the learning process, there might be cases in which it needs to\nbe increased (e.g., suppose we start with a behavioral policy with small variance, it might be bene\ufb01cial increasing\nthe variance to enforce exploration).\n\n5\n\n\fAlgorithm 1 Action-based POIS\n\nAlgorithm 2 Parameter-based POIS\n\nInitialize \u03b80\nfor j = 0, 1, 2, ..., until convergence do\n\n0 arbitrarily\n\nCollect N trajectories with \u03c0\u03b8j\nfor k = 0, 1, 2, ..., until convergence do\nL(\u03b8j\nk/\u03b8j\nk)\u22121\u2207\u03b8j\n\nk), \u2207\u03b8j\nk + \u03b1kG(\u03b8j\n\n0) and \u03b1k\nL(\u03b8j\nk/\u03b8j\n0)\n\nk\n\n0\n\nk\n\nCompute G(\u03b8j\n\u03b8j\nk+1 = \u03b8j\nend for\n\u03b8j+1\n0 = \u03b8j\nend for\n\nk\n\nInitialize \u03c10\nfor j = 0, 1, 2, ..., until convergence do\n\n0 arbitrarily\n\nSample N policy parameters \u03b8j\nCollect a trajectory with each \u03c0\u03b8j\nfor k = 0, 1, 2, ..., until convergence do\n\ni from \u03bd\u03c1j\n\ni\n\n0\n\nk), \u2207\u03c1j\nk + \u03b1kG(\u03c1j\n\nL(\u03c1j\nk/\u03c1j\nk)\u22121\u2207\u03c1j\n\nk\n\n0) and \u03b1k\nL(\u03c1j\nk/\u03c1j\n0)\n\nk\n\nCompute G(\u03c1j\n\u03c1j\nk+1 = \u03c1j\nend for\n\u03c1j+1\n0 = \u03c1j\nend for\n\nk\n\ninvariant metric [1] on parameter space \u2126, i.e., \u03ba(\u03c9(cid:48) \u2212 \u03c9)TF(\u03c9)(\u03c9(cid:48) \u2212 \u03c9) is independent on the\nspeci\ufb01c parameterization and provides a second order approximation of the distance between p\u03c9 and\nthe natural gradient [1, 19] as (cid:101)\u2207\u03c9L(\u03c9) = F\u22121(\u03c9)\u2207\u03c9L(\u03c9), which represents the steepest ascent\np\u03c9(cid:48) on the probability manifold up to a scale factor \u03ba \u2208 R. Given a loss function L(\u03c9), we de\ufb01ne\n\ndirection in the probability manifold. Thanks to the invariance property, there is a tight connection\nbetween the geometry induced by the R\u00e9nyi divergence and the Fisher information metric [3].\nTheorem 4.2. Let p\u03c9 be a p.d.f. differentiable w.r.t. \u03c9 \u2208 \u2126. Then, it holds that, for the R\u00e9nyi\ndivergence: D\u03b1(p\u03c9(cid:48)(cid:107)p\u03c9) = \u03b1\n2), and for the exponentiated\nR\u00e9nyi divergence: d\u03b1(p\u03c9(cid:48)(cid:107)p\u03c9) = 1 + \u03b1\nThis result provides an approximate expression for the variance of the importance weights, as\n2 (\u03c9(cid:48) \u2212 \u03c9)T F(\u03c9) (\u03c9(cid:48) \u2212 \u03c9). It also justi\ufb01es the use\nVarx\u223cp\u03c9\nof natural gradients in off-distribution optimization, since a step in natural gradient direction has a\ncontrollable effect on the variance of the importance weights.\n\n(cid:2)w\u03c9(cid:48)/\u03c9(x)(cid:3) = d2(p\u03c9(cid:48)(cid:107)p\u03c9) \u2212 1 (cid:39) \u03b1\n\n2 (\u03c9(cid:48) \u2212 \u03c9)T F(\u03c9) (\u03c9(cid:48) \u2212 \u03c9) + o((cid:107)\u03c9(cid:48) \u2212 \u03c9(cid:107)2\n2).\n\n2 (\u03c9(cid:48) \u2212 \u03c9)T F(\u03c9) (\u03c9(cid:48) \u2212 \u03c9)+o((cid:107)\u03c9(cid:48)\u2212\u03c9(cid:107)2\n\n5 Policy Optimization via Importance Sampling\n\nIn this section, we discuss how to customize the bound provided in Theorem 4.1 for policy optimiza-\ntion, developing a novel model-free actor-only policy search algorithm, named Policy Optimization\nvia Importance Sampling (POIS). We propose two versions of POIS: Action-based POIS (A-POIS),\nwhich is based on a policy gradient approach, and Parameter-based POIS (P-POIS), which adopts\nthe PGPE framework. A more detailed description of the implementation aspects is reported in\nAppendix E.\n\n5.1 Action-based POIS\n\n(cid:48)\n\nIn Action-based POIS (A-POIS) we search for a policy that maximizes the performance index JD(\u03b8)\nwithin a parametric space \u03a0\u0398 = {\u03c0\u03b8 : \u03b8 \u2208 \u0398 \u2286 Rp} of stochastic differentiable policies. In\nthis context, the behavioral (resp. target) distribution Q (resp. P ) becomes the distribution over\ntrajectories p(\u00b7|\u03b8) (resp. p(\u00b7|\u03b8\n)) induced by the behavioral policy \u03c0\u03b8 (resp. target policy \u03c0\u03b8(cid:48)) and f\nis the trajectory return R(\u03c4 ) which is uniformly bounded as |R(\u03c4 )| \u2264 Rmax\n1\u2212\u03b3H\n1\u2212\u03b3 .5 The surrogate\nloss function cannot be directly optimized via gradient ascent since computing d\u03b1\nrequires the approximation of an integral over the trajectory space and, for stochastic environments,\nto know the transition model P , which is unknown in a model-free setting. Simple bounds to this\nquantity, like d\u03b1\n(cid:81)H\u22121\ndue to the presence of the supremum, are extremely conservative since the R\u00e9nyi divergence is\nraised to the horizon H. We suggest the replacement of the R\u00e9nyi divergence with an estimate\nt=0 d2 (\u03c0\u03b8(cid:48)(\u00b7|s\u03c4i,t)(cid:107)\u03c0\u03b8(\u00b7|s\u03c4i,t)) de\ufb01ned only in terms of the pol-\nicy R\u00e9nyi divergence (see Appendix E.2 for details). Thus, we obtain the following surrogate\n\n)(cid:107)p(\u00b7|\u03b8)(cid:1)\n)(cid:107)p(\u00b7|\u03b8)(cid:1) \u2264 sups\u2208S d\u03b1 (\u03c0\u03b8(cid:48)(\u00b7|s)(cid:107)\u03c0\u03b8(\u00b7|s))H, besides being hard to compute\n(cid:80)N\n\n(cid:0)p(\u00b7|\u03b8\n)(cid:107)p(\u00b7|\u03b8)(cid:1) = 1\n\n(cid:0)p(\u00b7|\u03b8\n\n(cid:0)p(\u00b7|\u03b8\n\n(cid:98)d2\n\ni=1\n\nN\n\n(cid:48)\n\n(cid:48)\n\n(cid:48)\n\n5When \u03b3 \u2192 1 the bound becomes HRmax.\n\n6\n\n\f(cid:115)(cid:98)d2\n\n\u03bb\n\ni=1\n\n(cid:0)p(\u00b7|\u03b8\n\n(cid:48)\n\n)(cid:107)p(\u00b7|\u03b8)(cid:1)\n\n1\nN\n\nobjective:\n\n(\u03b8(cid:48)/\u03b8) =\n\nLA\u2212POIS\n\nN(cid:88)\nw\u03b8(cid:48)/\u03b8(\u03c4i)R(\u03c4i) \u2212 \u03bb\np(\u03c4i|\u03b8) = (cid:81)H\u22121\n(9)\n\u03c0\u03b8(cid:48) (a\u03c4i,t|s\u03c4i,t)\nwhere w\u03b8(cid:48)/\u03b8(\u03c4i) = p(\u03c4i|\u03b8(cid:48))\n\u03c0\u03b8 (a\u03c4i,t|s\u03c4i,t) . We consider the case in which \u03c0\u03b8(\u00b7|s) is a\nGaussian distribution over actions whose mean depends on the state and whose covariance is state-\nindependent and diagonal: N (u\u00b5(s), diag(\u03c32)), where \u03b8 = (\u00b5, \u03c3). The learning process mixes\nonline and of\ufb02ine optimization. At each online iteration j, a dataset of N trajectories is collected\nby executing in the environment the current policy \u03c0\u03b8j\n. These trajectories are used to optimize the\nsurrogate loss function LA\u2212POIS\n. At each of\ufb02ine iteration k, the parameters are updated via gradient\nascent: \u03b8j\n0), where \u03b1k > 0 is the step size which is chosen via\nline search (see Appendix E.1) and G(\u03b8j\nk), the FIM,\nfor natural gradient)6. The pseudo-code of POIS is reported in Algorithm 1.\n\nL(\u03b8j\nk/\u03b8j\nk) is a positive semi-de\ufb01nite matrix (e.g., F(\u03b8j\n\nk + \u03b1kG(\u03b8j\n\nk)\u22121\u2207\u03b8j\n\nk+1 = \u03b8j\n\nt=0\n\nN\n\n,\n\n\u03bb\n\nk\n\n0\n\n5.2 Parameter-based POIS\n\nIn the Parameter-based POIS (P-POIS) we again consider a parametrized policy space \u03a0\u0398 =\n{\u03c0\u03b8 : \u03b8 \u2208 \u0398 \u2286 Rp}, but \u03c0\u03b8 needs not be differentiable. The policy parameters \u03b8 are sampled\nat the beginning of each episode from a parametric hyperpolicy \u03bd\u03c1 selected in a parametric space\nNP = {\u03bd\u03c1 : \u03c1 \u2208 P \u2286 Rr}. The goal is to learn the hyperparameters \u03c1 so as to maximize JD(\u03c1).\nIn this setting, the distributions Q and P of Section 4 correspond to the behavioral \u03bd\u03c1 and target\n\u03bd\u03c1(cid:48) hyperpolicies, while f remains the trajectory return R(\u03c4 ). The importance weights [67] must\ntake into account all sources of randomness, derived from sampling a policy parameter \u03b8 and a\ntrajectory \u03c4: w\u03c1(cid:48)/\u03c1(\u03b8) = \u03bd\u03c1(cid:48) (\u03b8)p(\u03c4|\u03b8)\n\u03bd\u03c1(\u03b8) . In practice, a Gaussian hyperpolicy \u03bd\u03c1 with diagonal\ncovariance matrix is often used, i.e., N (\u00b5, diag(\u03c32)) with \u03c1 = (\u00b5, \u03c3). The policy is assumed to\nbe deterministic: \u03c0\u03b8(a|s) = \u03b4(a \u2212 u\u03b8(s)), where u\u03b8 is a deterministic function of the state s [e.g.,\n46, 14]. A \ufb01rst advantage over the action-based setting is that the distribution of the importance\nweights is entirely known, as it is the ratio of two Gaussians and the R\u00e9nyi divergence d2(\u03bd\u03c1(cid:48)(cid:107)\u03bd\u03c1)\ncan be computed exactly [6] (see Appendix C). This leads to the following surrogate objective:\n\n\u03bd\u03c1(\u03b8)p(\u03c4|\u03b8) = \u03bd\u03c1(cid:48) (\u03b8)\n\nLP\u2212POIS\n\n\u03bb\n\n(\u03c1(cid:48)/\u03c1) =\n\n1\nN\n\n(cid:114)\nw\u03c1(cid:48)/\u03c1(\u03b8i)R(\u03c4i) \u2212 \u03bb\n\nN(cid:88)\n\ni=1\n\nd2 (\u03bd\u03c1(cid:48)(cid:107)\u03bd\u03c1)\n\nN\n\n,\n\n(10)\n\nwhere each trajectory \u03c4i is obtained by running an episode with action policy \u03c0\u03b8i, and the cor-\nresponding policy parameters \u03b8i are sampled independently from hyperpolicy \u03bd\u03c1, at the be-\nginning of each episode. The hyperpolicy parameters are then updated of\ufb02ine as \u03c1j\nk+1 =\nk + \u03b1kG(\u03c1j\n\u03c1j\n0) (see Algorithm 2 for the complete pseudo-code). A further ad-\nvantage w.r.t. the action-based case is that the FIM F(\u03c1) can be computed exactly, and it is diagonal\nin the case of a Gaussian hyperpolicy with diagonal covariance matrix, turning a problematic inversion\ninto a trivial division (the FIM is block-diagonal in the more general case of a Gaussian hyperpolicy,\nas observed in [26]). This makes natural gradient much more enticing for P-POIS.\n\nk)\u22121\u2207\u03c1j\n\nL(\u03c1j\n\nk/\u03c1j\n\nk\n\n6 Experimental Evaluation\n\nIn this section, we present the experimental evaluation of POIS in its two \ufb02avors (action-based and\nparameter-based). We \ufb01rst provide a set of empirical comparisons on classical continuous control\ntasks with linearly parametrized policies; we then show how POIS can be also adopted for learning\ndeep neural policies. In all experiments, for the A-POIS we used the IS estimator, while for P-POIS\nwe employed the SN estimator. All experimental details are provided in Appendix F.\n\n6.1 Linear Policies\n\nLinear parametrized Gaussian policies proved their ability to scale on complex control tasks [38]. In\nthis section, we compare the learning performance of A-POIS and P-POIS against TRPO [42] and\n\n6The FIM needs to be estimated via importance sampling as well, as shown in Appendix E.3.\n\n7\n\n\fTask A-POIS P-POIS TRPO PPO\n(a)\n0.01\n(b)\n(c)\n(d)\n(e)\n\n0.01\n0.01 0.01\n\n0.4\n0.1\n0.2\n1\n0.8\n\n0.4\n0.1\n0.7\n0.9\n0.9\n\n0.1\n0.1\n1\n\n1\n1\n1\n\n(a) Cartpole\n\n(b) Inverted Double Pendulum\n\n(c) Acrobot\n\n(d) Mountain Car\n\n(e) Inverted Pendulum\n\nFigure 1: Average return as a function of the number of trajectories for A-POIS, P-POIS and TRPO\nwith linear policy (20 runs, 95% c.i.). The table reports the best hyperparameters found (\u03b4 for POIS\nand the step size for TRPO and PPO).\n\nPPO [44] on classical continuous control benchmarks [12]. In Figure 1, we can see that both versions\nof POIS are able to signi\ufb01cantly outperform both TRPO and PPO in the Cartpole environments,\nespecially the P-POIS. In the Inverted Double Pendulum environment the learning curve of P-POIS\nis remarkable while A-POIS displays a behavior comparable to PPO. In the Acrobot task, P-POIS\ndisplays a better performance w.r.t. TRPO and PPO, but A-POIS does not keep up. In Mountain Car,\nwe see yet another behavior: the learning curves of TRPO, PPO and P-POIS are almost one-shot (even\nif PPO shows a small instability), while A-POIS fails to display such a fast convergence. Finally, in\nthe Inverted Pendulum environment, TRPO and PPO outperform both versions of POIS. This example\nhighlights a limitation of our approach. Since POIS performs an importance sampling procedure at\ntrajectory level, it cannot assign credit to good actions in bad trajectories. On the contrary, weighting\neach sample, TRPO and PPO are able also to exploit good trajectory segments. In principle, this\nproblem can be mitigated in POIS by resorting to per-decision importance sampling [36], in which\nthe weight is assigned to individual rewards instead of trajectory returns. Overall, POIS displays\na performance comparable with TRPO and PPO across the tasks. In particular, P-POIS displays a\nbetter performance w.r.t. A-POIS. However, this ordering is not maintained when moving to more\ncomplex policy architectures, as shown in the next section.\nIn Figure 2 we show, for several metrics, the behavior of A-POIS when changing the \u03b4 parameter\nin the Cartpole environment. We can see that when \u03b4 is small (e.g., 0.2), the Effective Sample Size\n(ESS) remains large and, consequently, the variance of the importance weights (Var[w]) is small.\nThis means that the penalization term in the objective function discourages the optimization process\nfrom selecting policies which are far from the behavioral policy. As a consequence, the displayed\nbehavior is very conservative, preventing the policy from reaching the optimum. On the contrary,\nwhen \u03b4 approaches 1, the ESS is smaller and the variance of the weights tends to increase signi\ufb01cantly.\nAgain, the performance remains suboptimal as the penalization term in the objective function is too\nlight. The best behavior is obtained with an intermediate value of \u03b4, speci\ufb01cally 0.4.\n\n6.2 Deep Neural Policies\n\nIn this section, we adopt a deep neural network (3 layers: 100, 50, 25 neurons each) to represent\nthe policy. The experiment setup is fully compatible with the classical benchmark [12]. While\nA-POIS can be directly applied to deep neural networks, P-POIS exhibits some critical issues. A\nhighly dimensional hyperpolicy (like a Gaussian from which the weights of an MLP policy are\n\n8\n\nA-POISP-POISTRPOPPO012345\u00d7104020004000trajectoriesaveragereturn012345\u00d7104010002000300040005000trajectoriesaveragereturn012345\u00d7104\u22121500\u22121000\u2212500trajectoriesaveragereturn012345\u00d7104\u2212400\u2212300\u2212200\u2212100trajectoriesaveragereturn012345\u00d7104\u2212150\u2212100\u221250050trajectoriesaveragereturn\fFigure 2: Average return, Effective Sample Size (ESS) and variance of the importance weights\n(Var[w]) as a function of the number of trajectories for A-POIS for different values of the parameter\n\u03b4 in the Cartpole environment (20 runs, 95% c.i.).\n\nTable 1: Performance of POIS compated with [12] on deep neural policies (5 runs, 95% c.i.). In bold,\nthe performances that are not statistically signi\ufb01cantly different from the best algorithm in each task.\n\nAlgorithm\nREINFORCE\nTRPO\nDDPG\nA-POIS\nCEM\nP-POIS\n\nCart-Pole\nBalancing\n\n4693.7 \u00b1 14.0\n4869.8 \u00b1 37.6\n4634.4 \u00b1 87.6\n4842.8 \u00b1 13.0\n4815.4 \u00b1 4.8\n4428.1 \u00b1 138.6\n\nMountain Car\n\u221267.1 \u00b1 1.0\n\u221261.7 \u00b1 0.9\n\u2212288.4 \u00b1 170.3\n\u221263.7 \u00b1 0.5\n\u221266.0 \u00b1 2.4\n\u221278.9 \u00b1 2.5\n\nDouble Inverted\n\nPendulum\n\n4116.5 \u00b1 65.2\n4412.4 \u00b1 50.4\n2863.4 \u00b1 154.0\n4232.1 \u00b1 189.5\n2566.2 \u00b1 178.9\n3161.4 \u00b1 959.2\n\nSwimmer\n92.3 \u00b1 0.1\n96.0 \u00b1 0.2\n85.8 \u00b1 1.8\n88.7 \u00b1 0.55\n68.8 \u00b1 2.4\n76.8 \u00b1 1.6\n\nsampled) can make d2(\u03bd\u03c1(cid:48)(cid:107)\u03bd\u03c1) extremely sensitive to small parameter changes, leading to over-\nconservative updates.7 A \ufb01rst practical variant comes from the insight that d2(\u03bd\u03c1(cid:48)(cid:107)\u03bd\u03c1)/N is the\n(although approximate) surrogate function by replacing it with 1/(cid:100)ESS(\u03bd\u03c1(cid:48)(cid:107)\u03bd\u03c1). Another trick is to\ninverse of the effective sample size, as reported in Equation 6. We can obtain a less conservative\n\nmodel the hyperpolicy as a set of independent Gaussians, each de\ufb01ned over a disjoint subspace of \u0398\n(implementation details are provided in Appendix E.5). In Table 1, we augmented the results provided\nin [12] with the performance of POIS for the considered tasks. We can see that A-POIS is able\nto reach an overall behavior comparable with the best of the action-based algorithms, approaching\nTRPO and beating DDPG. Similarly, P-POIS exhibits a performance similar to CEM [52], the best\nperforming among the parameter-based methods. The complete results are reported in Appendix F.\n\n7 Discussion and Conclusions\n\nIn this paper, we presented a new actor-only policy optimization algorithm, POIS, which alternates\nonline and of\ufb02ine optimization in order to ef\ufb01ciently exploit the collected trajectories, and can be\nused in combination with action-based and parameter-based exploration. In contrast to the state-of-\nthe-art algorithms, POIS has a strong theoretical grounding, since its surrogate objective function\nderives from a statistical bound on the estimated performance, that is able to capture the uncertainty\ninduced by importance sampling. The experimental evaluation showed that POIS, in both its versions\n(action-based and parameter-based), is able to achieve a performance comparable with TRPO, PPO\nand other classical algorithms on continuous control tasks. Natural extensions of POIS could focus\non employing per-decision importance sampling, adaptive batch size, and trajectory reuse. Future\nwork also includes scaling POIS to high-dimensional tasks and highly-stochastic environments. We\nbelieve that this work represents a valuable starting point for a deeper understanding of modern policy\noptimization and for the development of effective and scalable policy search methods.\n\n7This curse of dimensionality, related to dim(\u03b8), has some similarities with the dependence of the R\u00e9nyi\n\ndivergence on the actual horizon H in the action-based case.\n\n9\n\n012345\u00d7104010002000300040005000trajectoriesaveragereturn012345\u00d7104204060trajectoriesESS012345\u00d7104010203040trajectoriesVar[w]\u03b4=0.2\u03b4=0.4\u03b4=0.6\u03b4=0.8\u03b4=1\fAcknowledgments\n\nThe study was partially funded by Lombardy Region (Announcement PORFESR 2014-2020). F. F.\nwas partially funded through ERC Advanced Grant (no: 742870). We gratefully acknowledge the\nsupport of NVIDIA Corporation with the donation of the Tesla K40cm, Titan XP and Tesla V100\nused for this research.\n\nReferences\n[1] Shun-Ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276,\n\n1998.\n\n[2] Shun-ichi Amari. Differential-geometrical methods in statistics, volume 28. Springer Science & Business\n\nMedia, 2012.\n\n[3] Shun-ichi Amari and Andrzej Cichocki. Information geometry of divergence functions. Bulletin of the\n\nPolish Academy of Sciences: Technical Sciences, 58(1):183\u2013195, 2010.\n\n[4] Jonathan Baxter and Peter L Bartlett. In\ufb01nite-horizon policy-gradient estimation. Journal of Arti\ufb01cial\n\nIntelligence Research, 15:319\u2013350, 2001.\n\n[5] Bernard Bercu, Bernard Delyon, and Emmanuel Rio. Concentration inequalities for sums. In Concentration\n\nInequalities for Sums and Martingales, pages 11\u201360. Springer, 2015.\n\n[6] Jacob Burbea. The convexity with respect to gaussian distributions of divergences of order \u03b1. Utilitas\n\nMathematica, 26:171\u2013192, 1984.\n\n[7] William G Cochran. Sampling techniques. John Wiley & Sons, 2007.\n\n[8] Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In\n\nAdvances in neural information processing systems, pages 442\u2013450, 2010.\n\n[9] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic.\n\narXiv:1205.4839, 2012.\n\narXiv preprint\n\n[10] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics.\n\nFoundations and Trends in Robotics, 2(1\u20132):1\u2013142, 2013.\n\n[11] Shayan Doroudi, Philip S Thomas, and Emma Brunskill. Importance sampling for fair policy selection.\n\nUncertainty in Arti\ufb01cial Intelligence, 2017.\n\n[12] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement\nlearning for continuous control. In International Conference on Machine Learning, pages 1329\u20131338,\n2016.\n\n[13] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nnetworks. In Proceedings of the thirteenth international conference on arti\ufb01cial intelligence and statistics,\npages 249\u2013256, 2010.\n\n[14] Mandy Gr\u00fcttner, Frank Sehnke, Tom Schaul, and J\u00fcrgen Schmidhuber. Multi-dimensional deep memory\n\ngo-player for parameter exploring policy gradients. 2010.\n\n[15] Zhaohan Guo, Philip S Thomas, and Emma Brunskill. Using options and covariance testing for long\nIn Advances in Neural Information Processing Systems, pages\n\nhorizon off-policy policy evaluation.\n2489\u20132498, 2017.\n\n[16] Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolution strategies.\n\nEvolutionary computation, 9(2):159\u2013195, 2001.\n\n[17] Timothy Classen Hesterberg. Advances in importance sampling. PhD thesis, Stanford University, 1988.\n\n[18] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning.\n\nInternational Conference on Machine Learning, volume 2, pages 267\u2013274, 2002.\n\nIn\n\n[19] Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages\n\n1531\u20131538, 2002.\n\n[20] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The\n\nInternational Journal of Robotics Research, 32(11):1238\u20131274, 2013.\n\n10\n\n\f[21] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing\n\nsystems, pages 1008\u20131014, 2000.\n\n[22] Augustine Kong. A note on importance sampling using standardized weights. University of Chicago, Dept.\n\nof Statistics, Tech. Rep, 348, 1992.\n\n[23] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David\nSilver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint\narXiv:1509.02971, 2015.\n\n[24] Luca Martino, V\u00edctor Elvira, and Francisco Louzada. Effective sample size for importance sampling based\n\non discrepancy measures. Signal Processing, 131:386\u2013401, 2017.\n\n[25] Takamitsu Matsubara, Tetsuro Morimura, and Jun Morimoto. Adaptive step-size policy gradients with\naverage reward metric. In Proceedings of 2nd Asian Conference on Machine Learning, pages 285\u2013298,\n2010.\n\n[26] Atsushi Miyamae, Yuichi Nagata, Isao Ono, and Shigenobu Kobayashi. Natural policy gradient methods\nwith parameter-based exploration for control tasks. In Advances in neural information processing systems,\npages 1660\u20131668, 2010.\n\n[27] R\u00e9mi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and ef\ufb01cient off-policy\nreinforcement learning. In Advances in Neural Information Processing Systems, pages 1054\u20131062, 2016.\n\n[28] Andrew Y Ng and Michael Jordan. Pegasus: A policy search method for large mdps and pomdps. In\nProceedings of the Sixteenth conference on Uncertainty in arti\ufb01cial intelligence, pages 406\u2013415. Morgan\nKaufmann Publishers Inc., 2000.\n\n[29] Art B. Owen. Monte Carlo theory, methods and examples. 2013.\n\n[30] Jing Peng and Ronald J Williams. Incremental multi-step q-learning. In Machine Learning Proceedings\n\n1994, pages 226\u2013232. Elsevier, 1994.\n\n[31] Jan Peters, Katharina M\u00fclling, and Yasemin Altun. Relative entropy policy search. In AAAI, pages\n\n1607\u20131612. Atlanta, 2010.\n\n[32] Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space\ncontrol. In Proceedings of the 24th international conference on Machine learning, pages 745\u2013750. ACM,\n2007.\n\n[33] Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180\u20131190, 2008.\n\n[34] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural\n\nnetworks, 21(4):682\u2013697, 2008.\n\n[35] Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe policy iteration. In\n\nInternational Conference on Machine Learning, pages 307\u2013315, 2013.\n\n[36] Doina Precup, Richard S Sutton, and Satinder P Singh. Eligibility traces for off-policy policy evaluation.\n\nIn International Conference on Machine Learning, pages 759\u2013766. Citeseer, 2000.\n\n[37] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley &\n\nSons, 2014.\n\n[38] Aravind Rajeswaran, Kendall Lowrey, Emanuel V Todorov, and Sham M Kakade. Towards generalization\nand simplicity in continuous control. In Advances in Neural Information Processing Systems, pages\n6553\u20136564, 2017.\n\n[39] C Radhakrishna Rao. Information and the accuracy attainable in the estimation of statistical parameters. In\n\nBreakthroughs in statistics, pages 235\u2013247. Springer, 1992.\n\n[40] Alfr\u00e9d R\u00e9nyi. On measures of entropy and information. Technical report, Hungarian Academy of Sciences\n\nBudapest Hungary, 1961.\n\n[41] Reuven Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology\n\nand computing in applied probability, 1(2):127\u2013190, 1999.\n\n[42] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy\n\noptimization. In International Conference on Machine Learning, pages 1889\u20131897, 2015.\n\n11\n\n\f[43] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional\n\ncontinuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.\n\n[44] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy\n\noptimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[45] Frank Sehnke, Christian Osendorfer, Thomas R\u00fcckstie\u00df, Alex Graves, Jan Peters, and J\u00fcrgen Schmidhuber.\nPolicy gradients with parameter-based exploration for control. In International Conference on Arti\ufb01cial\nNeural Networks, pages 387\u2013396. Springer, 2008.\n\n[46] Frank Sehnke, Christian Osendorfer, Thomas R\u00fcckstie\u00df, Alex Graves, Jan Peters, and J\u00fcrgen Schmidhuber.\n\nParameter-exploring policy gradients. Neural Networks, 23(4):551\u2013559, 2010.\n\n[47] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Determin-\n\nistic policy gradient algorithms. In International Conference on Machine Learning, 2014.\n\n[48] Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies.\n\nEvolutionary computation, 10(2):99\u2013127, 2002.\n\n[49] Yi Sun, Daan Wierstra, Tom Schaul, and Juergen Schmidhuber. Ef\ufb01cient natural evolution strategies. In\nProceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 539\u2013546.\nACM, 2009.\n\n[50] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press\n\nCambridge, 1998.\n\n[51] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods\nfor reinforcement learning with function approximation. In Advances in neural information processing\nsystems, pages 1057\u20131063, 2000.\n\n[52] Istv\u00e1n Szita and Andr\u00e1s L\u00f6rincz. Learning tetris using the noisy cross-entropy method. Neural computation,\n\n18(12):2936\u20132941, 2006.\n\n[53] Russ Tedrake, Teresa Weirui Zhang, and H Sebastian Seung. Stochastic policy gradient reinforcement\nlearning on a simple 3d biped. In Intelligent Robots and Systems, 2004.(IROS 2004). Proceedings. 2004\nIEEE/RSJ International Conference on, volume 3, pages 2849\u20132854. IEEE, 2004.\n\n[54] Philip Thomas and Emma Brunskill. Data-ef\ufb01cient off-policy policy evaluation for reinforcement learning.\n\nIn International Conference on Machine Learning, pages 2139\u20132148, 2016.\n\n[55] Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High con\ufb01dence policy improve-\n\nment. In International Conference on Machine Learning, pages 2380\u20132388, 2015.\n\n[56] Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-con\ufb01dence off-policy\n\nevaluation. In AAAI, pages 3000\u20133006, 2015.\n\n[57] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In\nIntelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026\u20135033.\nIEEE, 2012.\n\n[58] George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard E Turner, Zoubin Ghahramani, and Sergey Levine.\nThe mirage of action-dependent baselines in reinforcement learning. arXiv preprint arXiv:1802.10031,\n2018.\n\n[59] Tim Van Erven and Peter Harremos. R\u00e9nyi divergence and kullback-leibler divergence. IEEE Transactions\n\non Information Theory, 60(7):3797\u20133820, 2014.\n\n[60] Jay M Ver Hoef. Who invented the delta method? The American Statistician, 66(2):124\u2013127, 2012.\n\n[61] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando\nde Freitas. Sample ef\ufb01cient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.\n\n[62] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279\u2013292, 1992.\n\n[63] Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Natural evolution strategies.\n\nIn\nEvolutionary Computation, 2008. CEC 2008.(IEEE World Congress on Computational Intelligence). IEEE\nCongress on, pages 3381\u20133387. IEEE, 2008.\n\n[64] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. In Reinforcement Learning, pages 5\u201332. Springer, 1992.\n\n12\n\n\f[65] Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor\nMordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized\nbaselines. arXiv preprint arXiv:1803.07246, 2018.\n\n[66] Tingting Zhao, Hirotaka Hachiya, Gang Niu, and Masashi Sugiyama. Analysis and improvement of policy\n\ngradient estimation. In Advances in Neural Information Processing Systems, pages 262\u2013270, 2011.\n\n[67] Tingting Zhao, Hirotaka Hachiya, Voot Tangkaratt, Jun Morimoto, and Masashi Sugiyama. Ef\ufb01cient\nsample reuse in policy gradients with parameter-based exploration. Neural computation, 25(6):1512\u20131547,\n2013.\n\n13\n\n\f", "award": [], "sourceid": 2613, "authors": [{"given_name": "Alberto Maria", "family_name": "Metelli", "institution": "Politecnico di Milano"}, {"given_name": "Matteo", "family_name": "Papini", "institution": "Politecnico di Milano"}, {"given_name": "Francesco", "family_name": "Faccio", "institution": "Politecnico di Milano -                 The Swiss AI Lab, IDSIA (USI & SUPSI)"}, {"given_name": "Marcello", "family_name": "Restelli", "institution": "Politecnico di Milano"}]}