{"title": "Fast Efficient Hyperparameter Tuning for Policy Gradient Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 4616, "page_last": 4626, "abstract": "The performance of policy gradient methods is sensitive to hyperparameter settings that must be tuned for any new application. Widely used grid search methods for tuning hyperparameters are sample inefficient and computationally expensive. More advanced methods like Population Based Training that learn optimal schedules for hyperparameters instead of fixed settings can yield better results, but are also sample inefficient and computationally expensive. In this paper, we propose Hyperparameter Optimisation on the Fly (HOOF), a gradient-free algorithm that requires no more than one training run to automatically adapt the hyperparameter that affect the policy update directly through the gradient. The main idea is to use existing trajectories sampled by the policy gradient method to optimise a one-step improvement objective, yielding a sample and computationally efficient algorithm that is easy to implement. Our experimental results across multiple domains and algorithms show that using HOOF to learn these hyperparameter schedules leads to faster learning with improved performance.", "full_text": "Fast Ef\ufb01cient Hyperparameter Tuning\n\nfor Policy Gradient Methods\n\nSupratik Paul, Vitaly Kurin, Shimon Whiteson\n\n{supratik.paul,vitaly.kurin,shimon.whiteson}@cs.ox.ac.uk\n\nDeptartment of Computer Science\n\nUniversity of Oxford\n\nAbstract\n\nThe performance of policy gradient methods is sensitive to hyperparameter settings\nthat must be tuned for any new application. Widely used grid search methods\nfor tuning hyperparameters are sample inef\ufb01cient and computationally expensive.\nMore advanced methods like Population Based Training (Jaderberg et al., 2017)\nthat learn optimal schedules for hyperparameters instead of \ufb01xed settings can yield\nbetter results, but are also sample inef\ufb01cient and computationally expensive. In this\npaper, we propose Hyperparameter Optimisation on the Fly (HOOF), a gradient-\nfree algorithm that requires no more than one training run to automatically adapt the\nhyperparameter that affect the policy update directly through the gradient. The main\nidea is to use existing trajectories sampled by the policy gradient method to optimise\na one-step improvement objective, yielding a sample and computationally ef\ufb01cient\nalgorithm that is easy to implement. Our experimental results across multiple\ndomains and algorithms show that using HOOF to learn these hyperparameter\nschedules leads to faster learning with improved performance.\n\n1\n\nIntroduction\n\nPolicy gradient methods (Williams, 1992; Sutton et al., 1999) optimise reinforcement learning policies\nby performing gradient ascent on the policy parameters and have shown considerable success in\nenvironments characterised by large or continuous action spaces (Mordatch et al., 2015; Schulman\net al., 2016; Rajeswaran et al., 2017). However, like other gradient-based optimisation methods, their\nperformance can be sensitive to a number of key hyperparameters.\nFor example, the performance of \ufb01rst order policy gradient methods can depend critically on the\nlearning rate, the choice of which in turn often depends on the task, the particular policy gradient\nmethod in use, and even the optimiser, e.g., RMSProp (Tieleman and Hinton, 2012) and ADAM\n(Kingma and Ba, 2014) have narrow ranges for good learning rates (Henderson et al., 2018b) which\nmay not be known a priori. Even for second order methods like Natural Policy Gradients (NPG)\n(Kakade, 2001) or Trust Region Policy Optimisation (TRPO) (Schulman et al., 2015), which are\nmore robust to the KL divergence constraint (which can be interpreted as a learning rate), signi\ufb01cant\nperformance gains can often be obtained by tuning this parameter (Duan et al., 2016).\nSimilarly, variance reduction techniques such as Generalised Advantage Estimators (GAE) (Schulman\net al., 2016), which trade variance for bias in policy gradient estimates, introduce key hyperparameters\n(\u03b3, \u03bb) that can also greatly affect performance (Schulman et al., 2016; Mahmood et al., 2018).\nGiven such sensitivities, there is a great need for effective methods for tuning policy gradient\nhyperparameters. Perhaps the most popular hyperparameter optimiser is simply grid search (Schulman\net al., 2015; Mnih et al., 2016; Duan et al., 2016; Igl et al., 2018; Farquhar et al., 2018). More\nsophisticated techniques such as Bayesian optimisation (BO) (Srinivas et al., 2010; Hutter et al.,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2011; Snoek et al., 2012; Chen et al., 2018) have also proven effective, and new innovations such as\nPopulation Based Training (PBT) (Jaderberg et al., 2017) and meta-gradients (Xu et al., 2018) have\nshown considerable promise. Furthermore, a host of methods have been proposed for hyperparameter\noptimisation in supervised learning (see Section 4).\nHowever, all these methods suffer from a major problem: they require performing many learning\nruns to identify good hyperparameters. This is particularly problematic in reinforcement learning,\nwhere it incurs not just computational costs but sample costs, as new learning runs typically require\nfresh interactions with the environment. This sample inef\ufb01ciency is obvious in the case of grid search,\nBO based methods and PBT. However, even meta-gradients, which reuses samples collected by the\nunderlying policy gradient method to train the meta-learner, requires multiple training runs. This\nis because the meta-learner introduces its own set of hyperparameters, e.g., meta learning rate and\nreference (\u03b3, \u03bb), all of which need tuning to achieve good performance.\nFurthermore, grid search and BO based methods typically estimate only the best \ufb01xed values of the\nhyperparameters, which often actually need to change dynamically during learning (Jaderberg et al.,\n2017; Fran\u00e7ois-Lavet et al., 2015). This is particularly important in reinforcement learning, where\nthe distribution of visited states, the need for exploration, and the cost of taking suboptimal actions\ncan all vary greatly during a single learning run.\nTo make hyperparameter optimisation practical for reinforcement learning methods such as policy\ngradients, we need radically more ef\ufb01cient methods that can dynamically set key hyperparameters on\nthe \ufb02y, not just \ufb01nd the best \ufb01xed values, and do so within a single run, using only the data that the\nbaseline method would have gathered anyway, without introducing new hyperparameters that need\ntuning. This goal may seem ambitious, but in this paper we show that it is actually entirely feasible,\nusing a surprisingly simple method we call Hyperparameter Optimisation on the Fly (HOOF).\nThe main idea is as follows: At each iteration, sample trajectories using the current policy. Next,\ngenerate some candidate policies and estimate their value sample ef\ufb01ciently by using an off-policy\nmethod. Finally, update the policy greedily with respect to the estimated value of the candidates. In\npractice, HOOF uses the policy gradient method with different hyperparameter (e.g., the learning rate,\n\u03b3, and \u03bb) settings to generate candidate policies and then uses importance sampling (IS) to construct\noff-policy estimates of the value of each candidate policy.\nThe viability of such a simple approach is counter-intuitive since off-policy evaluation using IS tends\nto have high variance that grows rapidly as the behaviour and evaluation policies diverge. However,\nHOOF is motivated by the insight that in second order methods such as NPG and TRPO, constraints\non the magnitude of the update in policy space ensure that the IS estimates remain informative. While\nthis is not the case for \ufb01rst order methods, we show that adding a simple KL constraint, without any\nof the complications of second order methods, suf\ufb01ces to keep IS estimates informative and enable\neffective hyperparameter optimisation. We further show that the performance of HOOF is robust to\nthe setting of this KL constraint.\nHOOF is 1) sample ef\ufb01cient, requiring no more than one training run; 2) computationally ef\ufb01cient\ncompared to sequential and parallel search methods; 3) able to learn a dynamic schedule for the\nhyperparameters that outperforms methods that learn \ufb01xed hyperparameter settings; and 4) simple to\nimplement. Being gradient free, HOOF also avoids the limitations of gradient-based methods (Sutton,\n1992; Luketina et al., 2016; Xu et al., 2018) for learning hyperparameters. While such methods can\nbe more sample ef\ufb01cient than grid search or PBT, they can be sensitive to the choice of their own\nhyperparameters (see Sections 4 and 5.1) and thus require more than one training run to tune their\nown hyperparameters.\nWe evaluate HOOF across a range of simulated continuous control tasks using the Mujoco OpenAI\nGym environments (Brockman et al., 2016). First, we apply HOOF to A2C (Mnih et al., 2016),\nand show that using it to learn the learning rate can improve performance. We also perform a\nbenchmarking exercise where we use HOOF to learn both the learning rate and the weighting for the\nentropy term and compare it against a grid search across these two hyperparameters. Next, we show\nthat using HOOF to learn optimal hyperparameter schedules for NPG can outperform TRPO. This\nsuggests that while strictly enforcing the KL constraint enables TRPO to outperform NPG, doing so\nbecomes unnecessary once we can properly adapt NPG\u2019s hyperparameters.\n\n2\n\n\f2 Background\n\nvalue function of the state st is V (st) = Ea\u223c\u03c0,s\u223cP [(cid:80)\u221e\nJ(\u03c0) = Ea\u223c\u03c0,s\u223cP,s0\u223cp(s0)[(cid:80)\n\nConsider the RL task where an agent interacts with its environment and tries to maximise its\nexpected return. At timestep t, it observes the current state st, takes an action at, receives a reward\nrt = r(st, at), and transitions to a new state st+1 following some transition probability P. The\ni=0 \u03b3irt+i] for some discount rate \u03b3 \u2208 [0, 1).\nThe undiscounted formulation of the objective is to \ufb01nd a policy that maximises the expected return\nt rt]. In stochastic policy gradient algorithms, at is sampled from a\nparametrised stochastic policy \u03c0(a|s) that maps states to actions. These methods perform an update\nof the form\n\n\u03c0(cid:48) = \u03c0 + f (\u03c8).\n\n(1)\n\n1 , \u03c4 \u03c0\n\n2 , . . . , \u03c4 \u03c0\n\nHere f (\u03c8) represents a step along the gradient direction for some objective function estimated from a\nbatch of sampled trajectories {\u03c4 \u03c0\nK}, and \u03c8 is the set of hyperparameters. We use \u03c0 to\ndenote both the policy as well as the parameters.\n(cid:88)\nFor policy gradient methods with GAE, \u03c8 = (\u03b1, \u03b3, \u03bb), and the update takes the form:\n(cid:124)\n\n\u2207 log \u03c0(at|st)AGAE(\u03b3,\u03bb)\n\nf (\u03b1, \u03b3, \u03bb) = \u03b1\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(2)\n\nt\n\nt\n\ng(\u03b3, \u03bb)\n\nt\n\nt + \u03bbA(2)\n\nt + \u03bb2A(3)\n\n= (1\u2212 \u03bb)(A(1)\n\nt + ...) with A(k)\n\nt = \u2212V (st) + rt + \u03b3rt+1 + ... +\nwhere AGAE(\u03b3,\u03bb)\n\u03b3k\u22121rt+k\u22121 + \u03b3kV (st+k). By discounting future rewards and bootstrapping off the value function,\nGAE reduces the variance due to rewards observed far in the future, but adds bias to the policy\ngradient estimate. Well chosen (\u03b3, \u03bb) can signi\ufb01cantly speed up learning (Schulman et al., 2016;\nHenderson et al., 2018a; Mahmood et al., 2018).\nIn \ufb01rst order methods, small updates in parameter space can lead to large changes in policy space,\nleading to large changes in performance. Second order methods like NPG address this by restricting\nthe change to the policy through the constraint KL(\u03c0(cid:48)||\u03c0) \u2264 \u03b4. An approximate solution to this\nconstrained optimisation problem leads to the update rule:\n\n(cid:115)\n\nf (\u03b4, \u03b3, \u03bb) =\n\n2\u03b4\n\ng(\u03b3, \u03bb)T I(\u03c0)\u22121g(\u03b3, \u03bb)\n\nI(\u03c0)\u22121g(\u03b3, \u03bb),\n\n(3)\n\nwhere I(\u03c0) is the estimated Fisher information matrix (FIM).\nSince the above is only an approximate solution, the KL(\u03c0(cid:48)||\u03c0) constraint can be violated in some\niterations. Further, since \u03b4 is not adaptive, it might be too large for some iterations. TRPO addresses\nthese issues by requiring an improvement in the surrogate L\u03c0(\u03c0(cid:48)) = Ea\u223c\u03c0,s\u223cP [ \u03c0(cid:48)(a|s)\n\u03c0(a|s) AGAE(\u03b3,\u03bb)],\nas well as ensuring that the KL-divergence constraint is satis\ufb01ed. It does this by performing a\nbacktracking line search along the gradient direction. As a result, TRPO is more robust to the choice\nof \u03b4 (Schulman et al., 2015).\n\n3 Hyperparameter Optimisation on the Fly\n\nThe main idea behind HOOF is to automatically adapt the hyperparameters during training by greedily\nmaximising the value of the updated policy, i.e., starting with policy \u03c0n at iteration n, HOOF sets\n\n\u03c8n = argmax\n\n\u03c8\n\nJ(\u03c0n+1)\n\n= argmax\n\n\u03c8\n\nJ(\u03c0n + f (\u03c8)),\n\n(4)\n\nGiven a set of sampled trajectories, f (\u03c8) can be computed for any \u03c8, and thus we can generate\ndifferent candidate \u03c0n+1 without requiring any further samples. However, solving the optimisation\nproblem in (4) requires evaluating J(\u03c0n+1) for each such candidate. Any on-policy approach would\nhave prohibitive sample requirements, so HOOF uses weighted importance sampling (WIS) to\n\n3\n\n\fAlgorithm 1 HOOF\ninput Initial policy \u03c00, number of policy iterations N, search space for \u03c8, KL constraint \u0001 if using\n\n\ufb01rst order policy gradient method.\n\nSample trajectories \u03c41:K using \u03c0n.\nfor z = 1, 2, . . . Z do\n\n1: for n = 0, 1, 2, 3, . . . , N do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\n\nGenerate candidate hyperparameter {\u03c8z} from the search space.\nCompute candidate policy \u03c0z using \u03c8z in (1)\nEstimate J(\u03c0z) using WIS (5)\nCompute KL(\u03c0z||\u03c0n) if using \ufb01rst order policy gradient method,\n\nend for\nSelect \u03c8n, and hence \u03c0n+1, according to (7) or (4)\n\nconstruct an off-policy estimate of J(\u03c0n+1). Given sampled trajectories {\u03c4 \u03c0n\ncorresponding returns {R\u03c0n\n\n1 , R\u03c0n\n\n2 , .., \u03c4 \u03c0n\n\nK }, with\n\n(cid:33)\n\nJ(\u03c0n+1) =\n\n2 , ..., R\u03c0n\n\n1 , \u03c4 \u03c0n\n(cid:32)\nK }, the WIS estimate of J(\u03c0n+1) is given by:\nK(cid:88)\nwk(cid:80)K\nk \u223c\u03c0n) . Since p(\u03c4|\u03c0) = p(s0)(cid:81)T\nk=1 wk\ni=0 \u03c0(ai|si)p(si+1|si, ai), the transitions cancel\n(cid:81)T\n(cid:81)T\ni=0 \u03c0n+1(ai|sk\ni )\ni=0 \u03c0n(ai|sk\ni )\n\nR\u03c0n\nk ,\n\nwk =\n\n(5)\n\n(6)\n\nwhere wk = P (\u03c4 \u03c0n\nP (\u03c4 \u03c0n\nout and we have:\n\nk \u223c\u03c0n+1)\n\nk=1\n\n.\n\nThe success of this approach depends critically on the quality of the WIS estimates, which can suffer\nfrom high variance that grows rapidly as the distributions of \u03c0n+1 and \u03c0n diverge. Fortunately, for\nnatural gradient methods like NPG, KL(\u03c0n+1||\u03c0n) is automatically approximately bounded by the\nupdate, ensuring reasonable WIS estimates when HOOF directly uses (4). In the following, we\nconsider the more challenging case of \ufb01rst order methods.\n\n3.1 First Order HOOF\n\nWithout a KL bound on the policy update, it may seem that WIS will not yield adequate estimates to\nsolve (4). However, a key insight is that, while the estimated policy value can have high variance, the\nrelative ordering of the policies, which HOOF solves for, has much lower variance (See Appendix\nE for an illustrative example). Nonetheless, HOOF could still fail if KL(\u03c0n+1||\u03c0n) becomes too\nlarge, which can occur in \ufb01rst order methods. Hence, First Order HOOF modi\ufb01es (4) by constraining\nKL(\u03c0n+1||\u03c0n):\n\n\u03c8n = argmax\n\n\u03c8\n\nJ(\u03c0n+1)\n\ns.t. KL(\u03c0n+1||\u03c0n) < \u0001.\n\n(7)\n\nWhile this yields an update that super\ufb01cially resembles that of natural gradient methods, the KL\nconstraint is applied only during the search for the optimal hyperparameter settings using WIS. The\ndirection of the update is determined solely by a \ufb01rst order gradient update rule, and estimation and\ninversion of the FIM is not required. From a practical perspective, this constraint is enforced by\ncomputing the KL for each candidate policy based on the observed trajectories, and the candidate is\nrejected if this sample KL is greater than the constraint.\nIf learning the learning rate using HOOF, we can also use the KL constraint to dynamically adjust the\nsearch bounds: At each iteration, if none of the candidates violate the KL constraint, we increase the\nupper bound of the search space by a factor \u03bd, while if a large proportion of the candidates violate the\nKL constraint, we reduce the upper bound by \u03bd. This makes HOOF even more robust to the initial\nsetting of the search space. Note that this is entirely optional, and is simply a means to reduce the\nnumber of number of candidates that would otherwise need to be generated and evaluated to ensure\nthat a good solution of (4) is found.\n\n4\n\n\f3.2\n\n(\u03b3, \u03bb) Conditioned Value Function\n\nIf we use HOOF to learn (\u03b3, \u03bb), gn has to be computed for each setting of (\u03b3, \u03bb). With neural net\nvalue functions, we modify our value function such that its inputs are (s, \u03b3, \u03bb), similar to Universal\nValue Function Approximators (Schaul et al., 2015). Thus we learn a (\u03b3, \u03bb)-conditioned value\nfunction that can make value predictions for any candidate (\u03b3, \u03bb) at the cost of a single forward pass.\nIn Appendix D we present some experimental results to show that learning a (\u03b3, \u03bb)-conditioned value\nfunction is key to the success of HOOF.\n\n3.3 Robustness to HOOF Hyperparameters and Computational Costs\n\nHOOF introduces two types of hyperparameters of its own:\nthe search spaces for the various\nhyperparameters it tunes, and the number of candidate policies generated for evaluation. Since the\ncandidate policies are generated using random search, these hyperparameters express a straight up\ntrade-off between performance and computational cost: A larger search space and larger number of\ncandidates should lead to better solution for (4), but incur higher computational cost. However, just\nlike in random search, the generation and evaluation of the candidate policies can be performed in\nparallel to reduce wall clock time. Alternatively, Bayesian Optimisation could be used to solve (4)\nef\ufb01ciently. Finally, we note that HOOF with random search is always more computationally ef\ufb01cient\nthan grid/random search over the hyperparameters with the same number of candidates, as HOOF\nsaves on the additional computational cost of sampling trajectories for each candidate incurred by\ngrid/random search. HOOF additionally introduces the KL constraint hyperparameter for \ufb01rst order\nmethods. We show experimentally that the performance of HOOF is robust to a wide range of settings\nfor this.\n\n3.4 Choice of Optimiser\n\nThroughout this paper we use random search as the optimiser for (4) to show that the simplest\nmethods suf\ufb01ce. However, any gradient-free optimiser could be used instead. For example, grid\nsearch, CMA-ES (Hansen and Ostermeier, 2001), or Bayesian Optimisation (Brochu et al., 2010) are\nall viable alternatives.\nGradient based methods are not viable for two reasons. First, they require that J(\u03c0n+1) be differen-\ntiable w.r.t. the hyperparameters, which might be dif\ufb01cult or impossible to compute, e.g. with the\nTRPO update. Second, they introduce learning rate and initialisation hyperparameters, which require\ntuning at the expense of sample ef\ufb01ciency.\n\n4 Related Work\nMost hyperparameter search methods can be broadly classi\ufb01ed into sequential search, parallel search,\nand gradient based methods.\nSequential search methods perform a training run with some candidate hyperparameters, and use the\nresults to inform the choice of the next set of hyperparameters for evaluation. BO is a sample ef\ufb01cient\nglobal optimisation framework that models performance as a function of the hyperparameters, and is\nespecially suited for sequential search as each training run is expensive. After each training run BO\nuses the observed performance to update the model in a Bayesian way, which then informs the choice\nof the next set of hyperparameters for evaluation. Several modi\ufb01cations have been suggested to further\nreduce the number of evaluations required: input warping (Snoek et al., 2014) to address nonstationary\n\ufb01tness landscapes; freeze-thaw BO (Swersky et al., 2014) to decide whether a new training run should\nbe started and the current one discontinued based on interim performance; transferring knowledge\nabout hyperparameters across similar tasks (Swersky et al., 2013); and modelling training time\nas a function of dataset size (Klein et al., 2016). To further speed up the wall clock time, some\nBO based methods use a hybrid mode wherein batches of hyperparameter settings are evaluated in\nparallel (Contal et al., 2013; Desautels et al., 2014; Shah and Ghahramani, 2015; Wang et al., 2016;\nKandasamy et al., 2018).\nBy contrast, parallel search methods like grid search and random search run multiple training runs\nwith different hyperparameter settings in parallel to reduce wall clock time, but require more parallel\ncomputational resources. These methods are easy to implement, and have been shown to perform\nwell (Bergstra et al., 2011; Bergstra and Bengio, 2012).\n\n5\n\n\fBoth sequential and parallel search suffer from two key disadvantages. First, they require performing\nmultiple training runs to identify good hyperparameters. Not only is this computationally inef\ufb01cient,\nbut when applied to RL, also sample inef\ufb01cient as each run requires fresh interactions with the\nenvironment. Second, these methods learn \ufb01xed values for the hyperparameters that are used\nthroughout training instead of a schedule, which can lead to suboptimal performance (Luketina et al.,\n2016; Jaderberg et al., 2017; Xu et al., 2018).\nPBT (Jaderberg et al., 2017) is a hybrid of random and sequential search, with the added bene\ufb01t of\nadapting hyperparameters during training. It starts by training a population of hyperparameters which\nare then updated periodically to further explore promising hyperparameter settings. However, by\nrequiring multiple training runs, it inherits the sample inef\ufb01ciency of random search.\nHOOF is much more sample ef\ufb01cient because it requires no more interactions with the environment\nthan those gathered by the underlying policy gradient method for one training run. Consequently, it is\nalso far more computationally ef\ufb01cient. However, while HOOF can only optimise hyperparameters\nthat directly affect the policy update, these methods can tune other hyperparameters, e.g., policy\narchitecture. Combining these complementary strengths in an interesting topic for future work.\nGradient based methods (Sutton, 1992; Bengio, 2000; Luketina et al., 2016; Pedregosa, 2016; Xu\net al., 2018) adapt the hyperparameters by performing gradient descent on the policy gradient update\nfunction with respect to the hyperparameters. This raises the fundamental problem that the update\nfunction needs to be differentiable. For example, the update function for TRPO uses conjugate\ngradient to approximate I(\u03c0)\u22121g, performs a backtracking line search to enforce the KL constraint,\nand introduces a surrogate improvement constraint, which introduce discontinuities in the update and\nmakes it non-differentiable.\nA second major disadvantage of these methods is that they introduce their own set of hyperparameters,\nwhich can make them sample inef\ufb01cient if they require tuning. For example, the meta-gradient\nestimates can have high variance, which in turn signi\ufb01cantly affects performance. To address this, the\nobjective function of meta-gradients introduces reference (\u03b3(cid:48), \u03bb(cid:48)) hyperparameters to trade off bias\nand variance. As a result, its performance can be sensitive to these, as the experimental results of Xu\net al. (2018) show. Furthermore, gradient based methods tend to be highly sensitive to the setting of\nthe learning rate, and these methods introduce their own learning rate hyperparameter for the meta\nlearner which requires tuning, as we show in our experiments. As a gradient-free method, HOOF\ndoes not require a differentiable objective and, while it introduces a few hyperparameters of its own,\nthese do not affect sample ef\ufb01ciency, as mentioned in Section 3.3.\nOther work on non-gradient based methods includes that of Kearns and Singh (2000), who derive\na theoretical schedule for the TD(\u03bb) hyperparameter that they show is better than any \ufb01xed value.\nDowney et al. (2010) learn a schedule for TD(\u03bb) using a Bayesian approach. White and White (2016)\ngreedily adapt the TD(\u03bb) hyperparameter as a function of state. Unlike HOOF, these methods can\nonly be applied to TD(\u03bb) and, in the case of Kearns and Singh (2000), are not compatible with\nfunction approximation.\n\n5 Experiments\nTo experimentally validate HOOF, we apply it to four simulated continuous control tasks from\nMuJoCo OpenAI Gym (Brockman et al., 2016): HalfCheetah, Hopper, Ant, and Walker. We start\nwith A2C, and show that HOOF performs better than multiple baselines, and is also far more sample\nef\ufb01cient. Next, we use NPG as the underlying policy gradient method and apply HOOF to learn\n(\u03b4, \u03b3, \u03bb) and show that it outperforms TRPO.\nWe repeat all experiments across 10 random starts. In all \ufb01gures solid lines represent the median, and\nshaded regions the quartiles. Similarly all results in tables represent the median. Hyperparameters\nthat are not tuned are held constant across HOOF and baselines to ensure comparability. Details\nabout all hyperparameters can be found in the appendices, and code is available at https://github.\ncom/supratikp/HOOF.\n\n5.1 HOOF with A2C\n\nIn the A2C framework, a neural net with parameters \u03b8 is commonly used to represent both the policy\nand the value function, usually with some shared layers. The update function (1) for A2C is a linear\n\n6\n\n\f(a) HalfCheetah\n\n(b) Hopper\n\n(c) Ant\n\n(d) Walker\n\nFigure 1: Performance of HOOF with \u0001 = 0.03 compared to Baseline A2C and Tuned Meta-Gradients.\nThe hyperparameters (\u03b10, \u03b2) of meta gradients had to be tuned using grid search which required 36x\nthe samples used by HOOF.\n\nTable 1: Performance of HOOF with different values of the KL constraint (\u0001 parameter). The results\nshow that the performance is relatively robust to the setting of \u0001.\nKL constraint\nHalfCheetah\nHopper\nAnt\nWalker\n\n1,203\n359\n916\n466\n\n1,524\n350\n952\n467\n\n1,325\n362\n957\n475\n\n\u0001 = 0.05\n\n\u0001 = 0.06\n\n\u0001 = 0.07\n\n1,451\n358\n942\n415\n\n1,388\n359\n971\n456\n\n\u0001 = 0.01\n\n\u0001 = 0.02\n\n\u0001 = 0.03\n\n\u0001 = 0.04\n\n1301\n370\n963\n402\n\n1504\n365\n969\n457\n\ncombination of the gradients of the policy loss, the value loss, and the policy entropy:\n\nf\u03b8(\u03b1) = \u03b1{\u2207\u03b8 log \u03c0\u03b8(a|s)(R \u2212 V\u03b8(s)) + c1\u2207\u03b8(R \u2212 V\u03b8(s))2 + c2\u2207\u03b8H(\u03c0\u03b8(s))},\n\n(8)\nwhere we have omitted the dependence on the timestep and other hyperparameters for ease of notation.\nThe performance of A2C is particularly sensitive to the choice of the learning rate \u03b1 (Henderson\net al., 2018b), which requires careful tuning.\nWe learn \u03b1 using HOOF with the KL constraint \u0001 = 0.03 (\u2018HOOF\u2019). We compare this against two\nbaselines: (1) Baseline A2C, i.e., A2C with the initial learning rate set to the OpenAI Baselines\ndefault (0.0007), and (2) learning rate being learnt by meta-gradients (\u2018Tuned Meta-Gradient\u2019), where\nthe hyperparameters introduced by meta-gradients were tuned using grid search.\nThe learning curves in Figure 1 shows that across all environments HOOF learns faster than Baseline\nA2C, and also outperforms it in HalfCheetah and Walker, demonstrating that learning the learning\nrate online can yield signi\ufb01cant gains.\nThe update rule for meta-gradients when learning \u03b1 reduces to \u03b1(cid:48) = \u03b1 + \u03b2\u2207\u03b8(cid:48) log \u03c0\u03b8(cid:48)(a|s)(R \u2212\nV\u03b8(cid:48)(s)) f\u03b8(\u03c8)\n\u03b1 , where \u03b2 is the meta learning rate. This leads to two issues: what should the learning\nrate be initialised to (\u03b10), and what should the meta learning rate be set to? Like all gradient\nbased methods, the performance of meta gradients can be sensitive to the choices of these two\nhyperparamters. When we set \u03b10 to the OpenAI baselines default setting and \u03b2 to 0.001 as per\nXu et al. (2018), A2C fails to learn at all. Thus, we had to run a grid search over (\u03b10, \u03b2) to \ufb01nd\nthe optimal settings across these hyperparameters. In Figure 1 we plot the best run from this grid\nsearch. Despite using 36 times as many samples (due to the grid search), meta-gradients still cannot\noutperform HOOF, and learns slower in 3 of the 4 tasks. The returns for each of the 36 points on\nthe grid are presented in Appendix B.1 and they show that the performance of meta gradients can be\nsensitive to these two hyperparamters.\nTo show that HOOF\u2019s performance is robust to \u0001, its own hyperparameter quantifying the KL\nconstraint, we repeated our experiments with different values of \u0001. The results presented in Table 1\nshow that HOOF\u2019s performance is stable across different values of this parameter. This is not\nsurprising \u2013 the sole purpose of the constraint is to ensure that the WIS estimates remain viable.\nFinally, to ascertain the sample ef\ufb01ciency of HOOF relative to grid search, we perform a benchmarking\nexercise. We used HOOF to learn both the learning rate and the entropy coef\ufb01cient (c2 in (8)). We\nsplit the search bounds for these across a grid with 11x11 points and ran A2C for each setting on the\ngrid. For computational reasons we set the budget for each training run to 1 million timesteps. Given\na budget of n training runs, we randomly subsample n points from the grid (without replacement)\nand note the best return. We repeat this 1000 times to get an estimate of the expected best return\n\n7\n\n\fTable 2: Comparison of sample ef\ufb01ciency of HOOF over grid search.\n\nMax return over subsampled grid of size\n\n1\n\n2\n\nHOOF\nReturns\n\n702\n321\n675\n175\n\nHalfCheetah\nHopper\nAnt\nWalker\n\n-558\n109\n-7561\n\n99\n\n-241\n165\n-272\n153\n\n5\n113\n240\n177\n224\n\n10\n354\n287\n476\n279\n\n(a) HalfCheetah\n\n(b) Hopper\n\n(c) Ant\n\n(d) Walker\n\nFigure 2: Performance of HOOF-TNPG vs TRPO baselines.\n\nof the grid search with a budget of n training runs. The results presented in Table 2 compares the\nreturns of HOOF to that of the expected best return for grid search with different training budgets.\nThe performance of grid search is much worse than that of HOOF with the same budget (i.e., only\n1 training run). The results show that grid search can take more than 10 times as many samples to\nmatch HOOF\u2019s performance.\nAppendix A.3 contains further experimental details, including results con\ufb01rming that the KL con-\nstraint is crucial to ensuring sound WIS estimates.\nIn Appendix A.4 we show that HOOF is also robust to the choice of the optimiser by running\nthe experiments with SGD (instead of RMSProp) as the optimiser. In this case the difference in\nperformance is highly signi\ufb01cant with Baseline A2C failing to learn at all.\n\n5.2 HOOF with Truncated Natural Policy Gradients (TNPG)\n\nA major disadvantage of natural policy gradient methods is that they require the inversion of the FIM\nin (3), which can be prohibitively expensive for large neural net policies with thousands of parameters.\nTNPG (Duan et al., 2016) and TRPO address this by using the conjugate gradient algorithm to\nef\ufb01ciently approximate I(\u03c0)\u22121g. TRPO has been shown to perform better than TNPG in continuous\ncontrol tasks (Schulman et al., 2015), a result attributed to stricter enforcement of the KL constraint.\nHowever, in this section, we show that stricter enforcement of the KL constraint becomes unnecessary\nonce we properly adapt TNPG\u2019s learning rate. To do so, we apply HOOF to learn (\u03b4, \u03b3, \u03bb) of\nTNPG (\u2018HOOF-TNPG\u2019), and compare it with TRPO with the OpenAI Baselines default settings of\n(\u0001 = 0.01, \u03b3 = 0.99, \u03bb = 0.98) (\u2018Baseline TRPO\u2019).\nFigure 2 shows the learning curves of HOOF-TNPG and the Baseline TRPO. HOOF-TNPG learns\nmuch faster, and outperforms Baseline TRPO in all environments except for Walker where there\u2019s\nno signi\ufb01cant difference. Figure 3 presents the learnt (\u03b4, \u03b3, \u03bb). The results show that different KL\nconstraints and GAE hyperparameters are needed for different domains. We could not compare with\nmeta-gradients as the objective function is not differentiable, as discussed earlier in Section 4. We\nalso could not perform a comparison against grid search similar to the one in Section 5.1 as the\ncomputational burden of performing a grid search over three hyperparameters was too large.\n\n6 Conclusions & Future Work\n\nThe performance of a policy gradient method is highly dependent on its hyperparameters. However,\nmethods typically used to tune these hyperparameters are highly sample inef\ufb01cient, computationally\nexpensive, and learn only a \ufb01xed setting of the hyperparameters. In this paper we presented HOOF, a\nsample ef\ufb01cient method that automatically learns a schedule for the learning rate and GAE hyperpa-\nrameters of policy gradient methods without requiring multiple training runs. We believe that this,\n\n8\n\n\f(a) Learnt \u03b4\n\n(b) Learnt \u03b3\n\n(c) Learnt \u03bb\n\nFigure 3: Hyperparameters learnt by HOOF-TNPG for HalfCheetah, Hopper, and Walker.\n\ncombined with its simplicity and ease of implementation, makes HOOF a compelling method for\noptimising policy gradient hyperparameters.\nWhile we have presented HOOF as a method to learn the hyperparameters of a policy gradient\nalgorithm, the underlying principles are far more general. For example, one could compute a\ndistribution for the gradient and generate candidate policies by sampling from that distribution,\ninstead of just using the point estimate of the gradient. It has also been hypothesised that state/action\ndependent discount factors might help speed up learning (White, 2017; Fedus et al., 2019). This\ncould be achieved by using HOOF to learn the parameters of a function that maps the states/actions\nto the discount factors.\n\nAcknowledgements\n\nThis project has received funding from the European Research Council (ERC) under the European\nUnion\u2019s Horizon 2020 research and innovation programme (grant agreement #637713), and Samsung\nR&D Institute UK. The experiments were made possible by a generous equipment grant from\nNVIDIA.\n\nReferences\nBengio, Y. (2000). Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889\u2013\n\n1900.\n\nBergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of\n\nMachine Learning Research, 13(Feb):281\u2013305.\n\nBergstra, J. S., Bardenet, R., Bengio, Y., and K\u00e9gl, B. (2011). Algorithms for hyper-parameter\n\noptimization. In Advances in Neural Information Processing Systems.\n\nBrochu, E., Cora, V. M., and de Freitas, N. (2010). A tutorial on bayesian optimization of expensive\ncost functions, with application to active user modeling and hierarchical reinforcement learning.\neprint arXiv:1012.2599, arXiv.org.\n\nBrockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W.\n\n(2016). Openai gym.\n\nChen, Y., Huang, A., Wang, Z., Antonoglou, I., Schrittwieser, J., Silver, D., and de Freitas, N. (2018).\n\nBayesian optimization in alphago. CoRR, abs/1812.06855.\n\nContal, E., Buffoni, D., Robicquet, A., and Vayatis, N. (2013). Parallel gaussian process optimization\nwith upper con\ufb01dence bound and pure exploration. In Joint European Conference on Machine\nLearning and Knowledge Discovery in Databases.\n\nDesautels, T., Krause, A., and Burdick, J. W. (2014). Parallelizing exploration-exploitation tradeoffs\nin gaussian process bandit optimization. The Journal of Machine Learning Research, 15(1):3873\u2013\n3923.\n\nDhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S.,\nWu, Y., and Zhokhov, P. (2017). Openai baselines. https://github.com/openai/baselines.\n\n9\n\n\fDowney, C., Sanner, S., et al. (2010). Temporal difference bayesian model averaging: A bayesian\n\nperspective on adapting lambda. In International Conference on Machine Learning.\n\nDuan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016). Benchmarking deep\nreinforcement learning for continuous control. In International Conference on Machine Learning.\n\nFarquhar, G., Rocktaschel, T., Igl, M., and Whiteson, S. (2018). Treeqn and atreec: Differentiable\ntree-structured models for deep reinforcement learning. In International Conference on Learning\nRepresentations.\n\nFedus, W., Gelada, C., Bengio, Y., Bellemare, M. G., and Larochelle, H. (2019). Hyperbolic\n\ndiscounting and learning over multiple horizons. arXiv preprint arXiv:1902.06865.\n\nFran\u00e7ois-Lavet, V., Fonteneau, R., and Ernst, D. (2015). How to discount deep reinforcement learning:\n\nTowards new dynamic strategies. In NIPS 2015 Workshop on Deep Reinforcement Learning.\n\nHansen, N. and Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution\n\nstrategies. Evol. Comput.\n\nHenderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2018a). Deep\n\nreinforcement learning that matters. In AAAI.\n\nHenderson, P., Romoff, J., and Pineau, J. (2018b). Where did my optimum go?: An empirical analysis\n\nof gradient descent optimization in policy gradient methods. CoRR, abs/1810.02525.\n\nHutter, F., Hoos, H. H., and Leyton-Brown, K. (2011). Sequential model-based optimization\nfor general algorithm con\ufb01guration. In International Conference on Learning and Intelligent\nOptimization.\n\nIgl, M., Zintgraf, L. M., Le, T. A., Wood, F., and Whiteson, S. (2018). Deep variational reinforcement\n\nlearning for pomdps. In International Conference on Machine Learning.\n\nJaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M., Donahue, J., Razavi, A., Vinyals, O.,\nGreen, T., Dunning, I., Simonyan, K., et al. (2017). Population based training of neural networks.\narXiv preprint arXiv:1711.09846.\n\nKakade, S. (2001). A natural policy gradient. In Advances in Neural Information Processing Systems.\n\nKandasamy, K., Krishnamurthy, A., Schneider, J., and P\u00f3czos, B. (2018). Parallelised bayesian\noptimisation via thompson sampling. In International Conference on Arti\ufb01cial Intelligence and\nStatistics.\n\nKearns, M. J. and Singh, S. P. (2000). Bias-variance error bounds for temporal difference updates. In\n\nConference on Learning Theory.\n\nKingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, abs/1412.6980.\n\nKlein, A., Falkner, S., Bartels, S., Hennig, P., and Hutter, F. (2016). Fast bayesian optimization of\n\nmachine learning hyperparameters on large datasets. arXiv preprint arXiv:1605.07079.\n\nLuketina, J., Raiko, T., Berglund, M., and Greff, K. (2016). Scalable gradient-based tuning of\n\ncontinuous regularization hyperparameters. In International Conference on Machine Learning.\n\nMahmood, A. R., Korenkevych, D., Vasan, G., Ma, W., and Bergstra, J. (2018). Benchmarking\n\nreinforcement learning algorithms on real-world robots. In Conference on Robot Learning.\n\nMnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu,\nK. (2016). Asynchronous methods for deep reinforcement learning. In International Conference\non Machine Learning.\n\nMordatch, I., Lowrey, K., Andrew, G., Popovic, Z., and Todorov, E. V. (2015). Interactive control of\ndiverse complex characters with neural networks. In Advances in Neural Information Processing\nSystems.\n\n10\n\n\fPedregosa, F. (2016). Hyperparameter optimization with approximate gradient. arXiv preprint\n\narXiv:1602.02355.\n\nRajeswaran, A., Lowrey, K., Todorov, E. V., and Kakade, S. M. (2017). Towards generalization and\n\nsimplicity in continuous control. In Advances in Neural Information Processing Systems.\n\nSchaul, T., Horgan, D., Gregor, K., and Silver, D. (2015). Universal value function approximators. In\n\nInternational Conference on Machine Learning.\n\nSchulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimiza-\n\ntion. In International Conference on Machine Learning.\n\nSchulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2016). High-dimensional contin-\nuous control using generalized advantage estimation. In International Conference on Learning\nRepresentations.\n\nShah, A. and Ghahramani, Z. (2015). Parallel predictive entropy search for batch global optimization\nof expensive objective functions. In Advances in Neural Information Processing Systems, pages\n3330\u20133338.\n\nSnoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine\n\nlearning algorithms. In Advances in Neural Information Processing Systems.\n\nSnoek, J., Swersky, K., Zemel, R., and Adams, R. (2014). Input warping for bayesian optimization of\n\nnon-stationary functions. In International Conference on Machine Learning.\n\nSrinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2010). Gaussian process optimization in\nthe bandit setting: no regret and experimental design. In International Conference on Machine\nLearning.\n\nSutton, R. S. (1992). Adapting bias by gradient descent: An incremental version of delta-bar-delta.\n\nIn AAAI.\n\nSutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for\nreinforcement learning with function approximation. In Advances in Neural Information Processing\nSystems.\n\nSwersky, K., Snoek, J., and Adams, R. P. (2013). Multi-task bayesian optimization. In Advances in\n\nNeural Information Processing Systems.\n\nSwersky, K., Snoek, J., and Adams, R. P. (2014). Freeze-thaw bayesian optimization. arXiv preprint\n\narXiv:1406.3896.\n\nTieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average\n\nof its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26\u201331.\n\nWang, J., Clark, S. C., Liu, E., and Frazier, P. I. (2016). Parallel bayesian global optimization of\n\nexpensive functions. arXiv preprint arXiv:1602.05149.\n\nWhite, M. (2017). Unifying task speci\ufb01cation in reinforcement learning. In International Conference\n\non Machine Learning.\n\nWhite, M. and White, A. (2016). A greedy approach to adapting the trace parameter for temporal\ndifference learning. In Proceedings of the 2016 International Conference on Autonomous Agents\n& Multiagent Systems, pages 557\u2013565. International Foundation for Autonomous Agents and\nMultiagent Systems.\n\nWilliams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine Learning.\n\nXu, Z., van Hasselt, H. P., and Silver, D. (2018). Meta-gradient reinforcement learning. In Advances\n\nin Neural Information Processing Systems.\n\n11\n\n\f", "award": [], "sourceid": 2589, "authors": [{"given_name": "Supratik", "family_name": "Paul", "institution": "University of Oxford"}, {"given_name": "Vitaly", "family_name": "Kurin", "institution": "University of Oxford"}, {"given_name": "Shimon", "family_name": "Whiteson", "institution": "University of Oxford"}]}