{"title": "From Complexity to Simplicity: Adaptive ES-Active Subspaces for Blackbox Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 10299, "page_last": 10309, "abstract": "We present a new algorithm (ASEBO) for optimizing high-dimensional blackbox functions. ASEBO adapts to the geometry of the function and learns optimal sets of sensing directions, which are used to probe it, on-the-fly. It addresses the exploration-exploitation trade-off of blackbox optimization with expensive blackbox queries by continuously learning the bias of the lower-dimensional model used to approximate gradients of smoothings of the function via compressed sensing and contextual bandits methods. To obtain this model, it leverages techniques from the emerging theory of active subspaces in a novel ES blackbox optimization context. As a result, ASEBO learns the dynamically changing intrinsic dimensionality of the gradient space and adapts to the hardness of different stages of the optimization without external supervision. Consequently, it leads to more sample-efficient blackbox optimization than state-of-the-art algorithms. We provide theoretical results and test ASEBO advantages over other methods empirically by evaluating it on the set of reinforcement learning policy optimization tasks as well as functions from the recently open-sourced Nevergrad library.", "full_text": "From Complexity to Simplicity: Adaptive ES-Active\n\nSubspaces for Blackbox Optimization\n\nKrzysztof Choromanski\u21e4\nGoogle Brain Robotics\nkchoro@google.com\n\nAldo Pacchiano\u21e4\n\nUC Berkeley\n\nJack Parker-Holder\u21e4\nUniversity of Oxford\n\npacchiano@berkeley.edu\n\njackph@robots.ox.ac.uk\n\nYunhao Tang\u21e4\n\nColumbia University\n\nyt2541@columbia.edu\n\nVikas Sindhwani\n\nGoogle Brain Robotics\n\nsindhwani@google.com\n\nAbstract\n\nWe present a new algorithm (ASEBO) for optimizing high-dimensional blackbox\nfunctions. ASEBO adapts to the geometry of the function and learns optimal\nsets of sensing directions, which are used to probe it, on-the-\ufb02y. It addresses the\nexploration-exploitation trade-off of blackbox optimization with expensive black-\nbox queries by continuously learning the bias of the lower-dimensional model used\nto approximate gradients of smoothings of the function via compressed sensing and\ncontextual bandits methods. To obtain this model, it leverages techniques from the\nemerging theory of active subspaces [8] in a novel ES blackbox optimization con-\ntext. As a result, ASEBO learns the dynamically changing intrinsic dimensionality\nof the gradient space and adapts to the hardness of different stages of the optimiza-\ntion without external supervision. Consequently, it leads to more sample-ef\ufb01cient\nblackbox optimization than state-of-the-art algorithms. We provide theoretical\nresults and test ASEBO advantages over other methods empirically by evaluating\nit on the set of reinforcement learning policy optimization tasks as well as functions\nfrom the recently open-sourced Nevergrad library.\n\n1\n\nIntroduction\n\nConsider a high-dimensional function F : Rd ! R. We assume that querying it is expensive.\nExamples include reinforcement learning (RL) blackbox functions taking as inputs vectors \u2713 encoding\npolicies \u21e1 : S!A mapping states to actions and outputting total (expected/discounted) rewards\nobtained by agents applying \u21e1 in given environments [6]. For this class of functions evaluations\nusually require running a simulator. Other examples include wind con\ufb01guration design optimization\nproblems for high speed civil transport aircrafts, optimizing computer codes (e.g. NASA synthetic\ntool FLOPS/ENGENN used to size the aircraft and propulsion system [2]), crash tests, medical and\nchemical reaction experiments [37].\nEvolution strategy (ES) methods have traditionally been used in low-dimensional regimes (e.g.\nhyperparameter tuning), and considered ill-equipped for higher dimensional problems due to poor\nsampling complexity [27]. However, a \ufb02urry of recent work has shown they can scale better than\npreviously believed [33, 11, 29, 25, 7, 30, 21]. This is thanks to a couple of reasons.\nFirst of all, new ES methods apply several ef\ufb01cient heuristics (\ufb01ltering, various normalization\ntechniques as in [25] and new exploration strategies as in [11]) in order to substantially improve\n\n\u21e4Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsampling complexity. Other recent methods [29, 7] are based on more accurate Quasi Monte Carlo\n(MC) estimators of the gradients of Gaussian smoothings of blackbox functions with theoretical\nguarantees. These approaches provide better quality gradient sensing mechanisms. Additionally,\nin applications such as RL, new compact structured policy architectures (such as low-displacement\nrank neural networks from [7] or even linear policies [14]) are used to reduce the number of policies\u2019\nparameters and dimensionality of the optimization problem.\nRecent research also shows that ES-type blackbox optimization in RL leads to more stable policies\nthan policy gradient methods since ES methods search for parameters that are robust to perturbations\n[19]. Unlike policy gradient methods, ES aims to \ufb01nd parameters maximizing expected reward (rather\nthan just a reward) in respect to Gaussian perturbations.\nFinally, pure ES methods as opposed to state-of-the-art policy optimization techniques (TRPO, PPO\nor ARS [32, 15, 31, 25]), can be applied also for blackbox optimization problems that do not exhibit\nMDP structure required for policy gradient methods and cannot bene\ufb01t from state normalization\nalgorithm central to ARS. This has led to their recent popularity for non-differentiable tasks [17, 14].\nIn this paper we introduce a new adaptive sample-ef\ufb01cient blackbox optimization algorithm. ASEBO\nadapts to the geometry of blackbox functions and learns optimal sets of sensing directions, which\nare used to probe them, on-the-\ufb02y. To do this, it leverages techniques from the emerging theory of\nactive subspaces [8, 10, 9, 20] in a novel ES blackbox optimization context. Active subspaces and\ntheir extensions are becoming popular as effective techniques for dimensionality reduction (see for\ninstance: active manifolds [5] or ResNets for learning isosurfaces [36]). However, to the best of our\nknowledge we are the \ufb01rst to apply active subspace ideas for ES optimization.\nASEBO addresses the exploration-exploitation trade-off of blackbox optimization with expensive\nfunction queries by continuously learning the bias of the lower-dimensional model used to approx-\nimate gradients of smoothings of the function with compressed sensing and contextual bandits\nmethods. The adaptiveness is what distinguishes it from some recently introduced guided ES methods\nsuch as [24] that rely on \ufb01xed hyperparameters that are hard to tune in advance (e.g. the length of\nthe buffer de\ufb01ning lower dimensional space for gradient search). We provide theoretical results and\nempirically evaluate ASEBO on a set of RL blackbox optimization tasks as well as non-RL blackbox\nfunctions from the recently open-sourced Nevergrad library [34], showing that it consistently learns\noptimal inputs with fewer queries to a blackbox function than other methods.\nASEBO versus CMA-ES: There have been a variety of works seeking to reduce sampling complexity\nfor ES methods through the use of metric learning. The prominent class of the covariance matrix\nadaptation evolution strategy (CMA-ES) methods derives state-of-the-art derivative free blackbox\noptimization algorithms, which seek to learn and maintain a fully parameterized Gaussian distribution.\nCMA-ES suffers from quadratic time complexity for each evaluation which can be limiting for high\ndimensional problems. As such, a series of attempts have been made to produce scalable variants\nof CMA-ES, by restricting the covariance matrix to the diagonal (sep-CMA-ES [28]) or a low rank\napproximation as in VD-CMA-ES [3] and LM-CMA-ES [22]. Two recent algorithms, VkD-CMA-ES\n[4] and LM-MA-ES [23], seek to combine the above ideas and have been shown to be successful in\nlarge-scale settings, including RL policy learning [26]. Although these methods are able to quickly\nlearn and adapt the covariance matrix, they are heavily dependent on hyperparameter selection [4, 35]\nand lack the means to avoid learning a bias. As our experiments show, this can severely hurt their\nperformance. The best CMA-ES variants often struggle with RL tasks of challenging objecive\nlandscapes, displaying inconsistent performance across tasks. Furthermore, they require careful\nhyperparameter tuning for good performance (see: analysis in Section 4, Fig. 3).\n\n2 Adaptive Sample-Ef\ufb01cient Blackbox Optimization\n\nBefore we describe ASEBO, we explain key theoretical ideas behind the algorithm. ASEBO uses\nactive,\nonline PCA to maintain and update on-the-\ufb02y subspaces which we call ES-active subspaces LES\naccurately approximating the gradient data space at a given phase of the algorithm. The bias of the\nobtained gradient estimators is measured by sensing the length of its component from the orthogonal\ncomplement LES,?active via compressed sensing or computing optimal probabilities for exploration (e.g.\nsensing from LES,?active) via contextual bandits methods [1]. The algorithm corrects its probabilistic\ndistributions used for choosing directions for gradient sensing based on these measurements. As\nwe show, we can measure that bias accurately using only a constant number of additional function\n\n2\n\n\f(a) HC: active subspace (b) SW: active subspace\n\n(c) HC: # of samples\n\n(d) SW: # of samples\n\nFigure 1: The motivation behind ASEBO. Two \ufb01rst plots: ES baseline for HalfCheetah and\nSwimmer tasks from the OpenAI Gym library for 212-dimensional policies - the plot shows how\nthe dimensionality of the space capturing a given percentage of variance of approximate gradient data\ndepends on the iteration of the algorithm. This information is never exploited by the algorithm, even\nthough 99.5% of the variance resides in the much lower-dimensional space (100 dimensions). Two\nlast plots: ASEBO taking advantage of this information (# of sample/sensing directions re\ufb02ects the\nhardness of the optimization at each iteration and is strongly correlated with the PCA dimensionality.\n\nqueries, regardless of the dimensionality. This in turn determines an exploration strategy, as we\nexplain later. Estimated gradients are then used to update parameters.\n\n2.1 Preliminaries\nConsider a blackbox function F : Rd ! R. We do not assume that F is differentiable. The\nGaussian smoothing [27] F of F parameterized by smoothing parameter > 0 is given as:\n2RRd F (\u2713 + g)e kgk2\nF(\u2713) = Eg2N (0,Id)[F (\u2713 + g)] = (2\u21e1) d\n2 dg. The gradient of the Gaussian\nsmoothing of F is given by the formula:\n1\n Eg\u21e0N (0,Id)[F (\u2713 + g)g].\nrF(\u2713) =\n\n(1)\n\nMCF(\u2713) = 1\n\nMCF(\u2713) = 1\n\nkPk\n\nFormula 1 on rF(\u2713) leads straightforwardly to several unbiased Monte Carlo (MC) estimators\nof rF(\u2713), where the most popular ones are: the forward \ufb01nite difference estimator [7] de\ufb01ned\nas: brFD\ni=1(F (\u2713 + gi) F (\u2713))gi, and an antithetic ES gradient estimator [30]\n2kPk\ngiven as: brAT\ni=1(F (\u2713 + gi) F (\u2713 gi))gi, where typically g1, ..., gk are\ntaken independently at random from N (0, Id) of from more complex joint distributions for variance\nreduction (see: [7]). We call samples g1, ..., gk the sensing directions since they are used to sense\ngradients rF(\u2713). The antithetic formula can be alternatively rationalized as giving the renormalized\ngradient of F (if F is smooth), if not taking into account cubic and higher-order terms of the Taylor\nexpansion F (\u2713 + v) = F (\u2713) +rF >v + 1\n2 v>H(\u2713)v (where H(\u2713) stands for the Hessian of F in \u2713).\nStandard ES methods apply different gradient-based techniques such as SGD or Adam, fed with\nthe above MC estimators of rF to conduct blackbox optimization. The number of samples k per\niteration of the optimization procedure is usually of the order O(d). This becomes a computational\nbottleneck for high-dimensional blackbox functions F (for instance, even for relatively small RL\ntasks with policies encoded by compact neural networks we still have d > 100 parameters).\n2.2 ES-active subspaces via online PCA with decaying weights\nThe \ufb01rst idea leading to the ASEBO algorithm is that in practice one does not need to estimate the\ngradient of F accurately (after all ES-type methods do not even aim to compute the gradient of F , but\nrather focus on rF). Poor scalability of ES-type blackbox optimization algorithms is caused by high-\ndimensionality of the gradient vector. However, during the optimization process the space spanned\nby gradients may be locally well approximated by a lower-dimensional subspace L and sensing\nthe gradient in that subspace might be more effective. In some recent papers such as [24] such a\nsubspace is de\ufb01ned simply as L = span{brAT\nMCF(\u2713is+1)}, where\n{brAT\nMCF(\u2713i),brAT\nMCF(\u2713is+1)} stands for the batch of last s approximated\ngradients during the optimization process and s is a \ufb01xed hyperparameter. Even though L will\ndynamically change during the optimization, such an approach has several disadvantages in practice.\nTuning parameter s is very dif\ufb01cult or almost impossible and the assumption that the dimensionality\nof L should be constant during optimization is usually false. In our approach, dimensionality of L\nvaries and depends on the hardness of the optimization in different optimization stages.\n\nMCF(\u2713i1), ...,brAT\n\nMCF(\u2713i),brAT\n\nMCF(\u2713i1), ...,brAT\n\n3\n\n\fWe apply Principal Component Analysis (PCA, [18]) to create a subspace L capturing particular\nvariance > 0 of the approximate gradients data. This data is either: the approximate gradients\ncomputed in previous iterations from the antithetic formula or: the elements of the sum from that\nequation that are averaged over to obtain these gradients. For clarity of the exposition, from now on\nwe will assume the former, but both variants are valid. Choosing is in practice much easier than s\nand leads to subspaces L of varying dimensionalities throughout the optimization procedure, called\nby us from now on ES-active subspaces LES\nAlgorithm 1 ASEBO Algorithm\nHyperparameters: number of iterations of full sampling l, smoothing parameter > 0, step size \u2318,\nPCA threshold \u270f, decay rate , total number of iterations T .\nInput: blackbox function F , vector \u27130 2 Rd where optimization starts. Cov0 2{ 0}d\u21e5d, p0 = 0.\nOutput: vector \u2713T .\nfor t = 0, . . . , T 1 do\n\nactive.\n\nif t < l then\n\nelse\n\nactive\n\nMCF (\u2713t) = 1\n\ni=1 i \u270fPd\n\nj=1(F (\u2713t + gj) F (\u2713t gj))gj.\n\nMCF(\u2713t))>.\n\nMCF(\u2713t)(brAT\n\nMCF (\u2713t) as: brAT\n\nTake nt = d. Sample g1,\u00b7\u00b7\u00b7 , gnt from N (0, Id) (independently).\n1. Take top r eigenvalues i of Covt, where r is smallest such that:Pr\ni=1 i,\nusing its SVD as described in text and take nt = r.\n2. Take the corresponding eigenvectors u1, ..., ur 2 Rd and let U 2 Rd\u21e5r be obtained\nby stacking them together. Let Uact 2 Rd\u21e5r be obtained from stacking together some\ndef= span{u1, ..., ur}. Let U? 2 Rd\u21e5(dr) be obtained from\northonormal basis of LES\nstacking together some orthonormal basis of the orthogonal complement LES,?active of LES\nactive.\n3. Sample nt vectors g1, ..., gnt as follows: with probability 1 pt from N (0, U?(U?)>)\nand with probability pt from N (0, Uact(Uact)>).\n4. Renormalize g1, ..., gnt such that marginal distributions kgik2 are (d).\n2ntPnt\n1. Compute brAT\n2. Set Covt+1 = Covt + (1 ), where = brAT\n3. Set pt+1 = popt for popt output by Algorithm 2 and: \u2713t+1 \u2713t + \u2318brAT\nThese will be in turn applied to de\ufb01ne covariance matrices encoding probabilistic distributions applied\nto construct sensing directions used for estimating rF(\u2713). The additional advantage of our approach\nis that PCA automatically \ufb01lters out gradient noise.\nWe use our own online version of PCA with decaying weights (decay rate is de\ufb01ned by parameter\n> 0). By tuning we can de\ufb01ne the rate at which historical approximate gradient data is used\nto choose the right sensing directions, which will continuously decay. We consider a stream of\napproximate gradients brAT\nMCF(\u27130), ...brAT\nMCF(\u2713i), ... obtained during the optimization procedure.\nWe maintain and update on-the-\ufb02y the covariance matrix Covt, where t stands for the number of\ncompleted iterations, in the form of the symmetric matrix SVD-decomposition Covt = Q>t \u2303tQt 2\nRd. When the new approximate gradient brAT\nMCF(\u2713t) arrives, the update of the covariance matrix is\ndriven by the following equation, re\ufb02ecting data decay process, where xt = brAT\nTo conduct the update cheaply, it suf\ufb01ces to observe that the RHS of Equation 2 can be rewritten\nas: Q>t \u2303tQt + (1 )xtx>t = Q>t (\u2303t + (1 )Qtxt(Qtxt)>)Qt. Now, using the fact that\nfor a matrix of the form D + uu>, we can get its decomposition in time O(d2) [13], we obtain an\nalgorithm performing updates in quadratic time. That in practice suf\ufb01ces since the bottleneck of the\ncomputations is in querying F and additional overhead related to updating LES\nES-active subspaces versus active subspaces: Our mechanism for constructing LES\nactive is inspired\nby the recent theory of active subspaces [8], developed to determine the most important directions in\nthe space of input parameters of high-dimensional blackbox functions such as computer simulations.\nThe active subspace of a differentiable function F : Rd ! R, square-integrable with respect to the\ngiven probabilistic density function \u21e2 : Rd ! R, is given as a linear subspace Lactive de\ufb01ned by the\n\nQ>t+1\u2303t+1Qt+1 = Q>t \u2303tQt + (1 )xtx>t ,\n\nMCF(\u2713t):\n\n(2)\n\nMCF (\u2713t).\n\nactive is negligible.\n\n4\n\n\f\ufb01rst r (for a \ufb01xed r < d) eigenvectors of the following d \u21e5 d symmetric positive de\ufb01nite matrix:\n\nCov =Zx2Rd rF (x)rF (x)>\u21e2(x)dx\n\n(3)\n\nDensity function \u21e2 determines where compact representation of F is needed. In our approach we do\nnot assume that rF exists, but the key difference between LES\nactive and Lactive lies somewhere else.\nThe goal of ASEBO is to avoid approximating the exact gradient rF (x) 2 Rd which is what makes\nstandard ES methods very expensive and which is done in [9] via gradient sketching techniques\ncombined with \ufb01nite difference approaches (standard methods of choice for ES baselines).\nAlgorithm 2 Explore estimator via exponentiated sampling\nHyperparameters: smoothing parameter , horizon C, learning rate \u21b5, probability regularizer ,\ninitial probability parameter qt\nInput: subspaces: LES\nOutput:\nfor l = 1,\u00b7\u00b7\u00b7 , C + 1 do\n\nactive, LES,?active, function F , vector \u2713t\n\n0 2 (0, 1).\n\nl1 + and sample at\n\nl).\nl \u21e0 Ber(pt\n\n), otherwise sample gl \u21e0N (0, I\n\n).\n\nLES,?active\n\nactive\n\nl1 = (1 2)qt\n\n1. Compute pt\n3. If at\n4. Compute vl = 1\n\nl = 1, sample gl \u21e0N (0, ILES\nl (dim(LES\nactive)+2)\n(pt\nl )3\nl )(dim(LES,?active)+2)\n\n2 (F (\u2713t + gl) F (\u2713t gl)).\n\u2318\n(1pt\nqt\nl1 exp(\u21b5el(1))\nl1) exp(\u21b5el(2)).\n\n5. Set el = (1 2)24 \u21e3 at\n\u21e3 (1at\n\nqt\nl1 exp(\u21b5el(1))+(1qt\n\nl =\n\nl )3\n\n\u231835 v2\n\nl .\n\n6. Set qt\nReturn: pC.\n\nactive is itself used to de\ufb01ne sensing directions and\nInstead, in ASEBO an ES-active subspace LES\nactive. This drastically reduces\nthe number of chosen samples k is given by the dimensionality of LES\nsampling complexity, but comes at a price of risking the optimization to be trapped in the \ufb01xed\nlower-dimensional space that will not be representative for gradient data as optimization progresses.\nWe propose a solution requiring only a constant number of extra queries to F in the next sections.\n\n2.3 Exploration-exploitation trade-off: Adaptive Exploration Mechanism\nThe procedure described above needs to be accompanied with an exploration strategy that will\ndetermine how frequently to choose sensing directions outside the constructed on-the-\ufb02y lower-\ndimensional ES-subspace LES\nactive. Our exploration strategies will be encoded by hybrid probabilistic\ndistributions for sampling sensing directions. The frequency of sensing from the distributions with\ncovariance matrices obtained from LES\nactive (corresponding to exploitation) and from its orthogonal\ncomplement LES,?active or entire space (corresponding to exploration) will be given by weights encoding\nthe importance of exploitation versus exploration in any given iteration of the optimization. For a\nactive and by x? its projection onto LES,?active.\nvector x 2 Rd denote by xactive its projection onto LES\nThe useful metric that can be used to update the above weights in an online manner in the tth iteration\nof the algorithm is the ratio: r = k(rF(\u2713t))activek2\n. Smaller values of r indicate that current active\nk(rF(\u2713t))?k2\nsubspace is not representative enough for the gradient and more aggressive exploration needs to be\n\nMCF(\u2713t1))activek2\nMCF(\u2713t1))?k2\n\nconducted. In practice, we do not compute r explicitly, but rather its approximated versionbr.\nOne can simply take: br = k(brAT\nk(brAT\n\nMCF(\u2713t1) is obtained in the pre-\nvious iteration. But we can do better. It suf\ufb01ces to separately estimate k(rF(\u2713t))activek2 and\nk(rF(\u2713t))?k2. However we do not aim to estimate (rF(\u2713t))active and (rF(\u2713t))?. That would\nbe equivalent to computing exact estimate of rF(\u2713t), defeating the purpose of ASEBO. Instead,\nwe note that estimating the length of the unknown high-dimensional vector is much simpler than\nestimating the vector itself and can be done in the probabilistic manner with arbitrary precision via the\nset of dot-product queries of size independent from dimensionality d via compressed sensing methods.\nWe re\ufb01ne this approach and propose more accurate contextual bandits method that also relies on\ndot-product queries applied in the ES-context, but aims to directly approximate optimal probabilities\n\n, where brAT\n\n5\n\n\fof sampling from LES\nactive rather than approximating gradients components\u2019 lengths (see Algorithm\n2 box, the compressed sensing baseline is presented in the Appendix). The related computational\noverhead is measured in constant number of extra function queries, negligible in practice.\n2.4 The Algorithm\n\nactive versus from outside LES\n\nASEBO is given in the Algorithm 1 box. The algorithm we apply to score relative importance of\nsampling from the ES-active subspace LES\nactive is in the Algorithm 2 box.\nAs we have already mentioned, it uses bandits method do determine optimal probability of sampling\nfrom LES\nactive. In the next section we show that by using these techniques we can substantially reduce\nthe variance of ES blackbox gradient estimators if ES-active subspaces approximate the gradient\ndata well (which is the case for RL blackbox functions as presented in Fig. 1). Horizon lengths C in\nAlgorithm 2 which determines the number of extra function queries should be in practice chosen as\nsmall constants. In each iteration of Algorithm 1 the number of function queries is proportional to\nthe dimensionality of the ES-active subspace LES\n3 Theoretical Results\n\nactive rather than the original space.\n\n1\n\n2\n\n).\n\nWe provide here theoretical guarantees for the ASEBO sampling mechanism (in Algorithm 1), where\n\nsensing directions {gi} at time t are sampled from the hybrid distribution bP : with probability pt\nfrom N (0, ILactive) and with probability 1 pt from N (0, IL?active\nFollowing notation in Algorithm 1, let Uact 2 Rd\u21e5r be obtained by stacking together vectors of\nsome orthonormal basis of LES\nactive) = r and let U? 2 Rd\u21e5(dr) be obtained my\nstacking together vectors of some orthonormal basis of its orthogonal complement LES,?active. Denote\nby a smoothing parameter. We make the following regularity assumptions on F :\nAssumption 1. F is LLipschitz, i.e. for all \u2713, \u27130 2 Rd, |F (\u2713) F (\u27130)|\uf8ff Lk\u2713 \u27130k2.\nAssumption 2. F has a \u2327-smooth third order derivative tensor with respect to > 0, so that\nF (\u2713 + g) = F (\u2713) + rF (\u2713)>g + 2\n6 3F 000(\u2713)[v, v, v] for some v 2 Rd\n(kvk2 \uf8ff kgk2) satisfying |F 000(\u2713)[v, v, v] \uf8ff \u2327kvk3\n\nactive, where dim(LES\n\n2 g>H(\u2713)g + 1\n\n2.\n2 \uf8ff \u2327kgk3\n\nF (\u2713+g)g+F (\u2713+g)(g)\n\nMC,k=1 F(\u2713) = C1\n\nMC,k=1 F(\u2713)i rF (\u2713) \uf8ff \u270f.\n\nObserve that: Eg\u21e0bP\u21e5gg>\u21e4 = \u21e3ptUact (Uact)> + (1 pt)U?U?>\u2318. De\ufb01ne C1 =\n\u21e3ptUact (Uact)> + (1 pt)U?U?>\u2318. Let brAT,asebo\nbe the gradient estimator corresponding to bP . We will assume that is small enough, i.e.\n35q \u270f min(pt,1pt)\nfor some precision parameter \u270f> 0. Our \ufb01rst result shows that under\n< 1\nthese assumptions, baseline and ASEBO estimators of rF (\u2713) are also good estimators of rF (\u2713):\nLemma 3.1. If F satis\ufb01es Assumptions 1 and 2, the estimatorsbrAT,base\nMC,k=1F(\u2713) andbrAT,asebo\nMC,k=1 F(\u2713)\nMC,k=1F(\u2713)i rF (\u2713) \uf8ff \u270f and\nare close to the true gradient rF (\u2713), i.e.: Eg\u21e0N (0,Id)hbrAT,base\nEg\u21e0bPhbrAT,asebo\nWe show now that under sampling strategy given by distribution bP , the variance of the gradient\nestimator can be made smaller by choosing the probability parameter pt appropriately. Denote:\nactive) and d? = dim(LES,?active). Let := dactive+2\n1pt sU? krF (\u2713)k2.\nsUact + d?+2\ndactive = dim(LES\nTheorem 3.2. The following holds for sUact = k(Uact)>rF (\u2713)k2\n2 and sU? = k(U?)>rF (\u2713)k2\n2:\n1. The variance of brAT,asebo\nMC,k=1 F(\u2713) is close to , i.e. |Var[brAT,asebo\nMC,k=1 F(\u2713)] |\uf8ff \u270f.\np(sUact )(dactive+2)\np(sUact )(dactive+2)+p(sU? )(dU? +2)\nsatis\ufb01es:\nhp(sUact)(dactive + 2) +p(sU?)(d? + 2)i2\n krF (\u2713)k2.\n\n\u21e4 :=\nthe optimal variance Varopt corresponding to pt\n\u21e4\n\nand\n|Varopt |\uf8ff \u270f for =\n\n3.1 Variance reduction via non isotropic sampling\n\n2. The choice of pt that minimizes satis\ufb01es pt\n\n\u2327d 3 max(L,1)\n\npt\n\n6\n\n\f.\n\n\n\n{z\n\nMC,k=1F(\u2713)].\n\n2 and sU? = \u21b5krF (\u2713)k2\n2,\n\nFurthermore, slack variable is always nonnegative.\n\nMC,k=1F(\u2713)]+\u270f|p(sU?)(dactive + 2) p(sUact)(d? + 2)|2 2krF (\u2713)k2\n}\n\n|\nfor\nMC,k=1F(\u2713)] \u21e1 (d + 1)krF (\u2713)k2 whereas Varopt =\n1: Varopt \u2327\n\n3. Varopt \uf8ff Var[brAT,base\nTheorem implies that when sUact = (1 \u21b5)krF (\u2713)k2\nsome \u21b5 2 (0, 1), we have: Var[brAT,base\nO ((1 \u21b5)(dactive + 1) + \u21b5(d? + 1)). When dactive << d and \u21b5<<\nVar[brAT,base\n3.2 Adaptive Mirror Descent\nactive and pt, the gradient estimator\nIn Theorem 3.2 we showed that for appropriate choices of LES\nbrAT,asebo\nMC,k=1 F(\u2713) will have signi\ufb01cantly smaller variance than brAT,base\nMC,k=1F(\u2713). In this section we\nshow that Algorithm 2 provides an adaptive way to choose pt. Using tools from online learning\ntheory, we provide regret guarantees that quantify the rate at which this Algorithm 2 minimizes the\nvariance of brAT,asebo\nl. The main component of the variance of brAT,asebo\nl = pt\n\nl (2) sU? krF (\u2713)k2 (Theorem 3.2). We have:\n\nMC,k=1 F(\u2713) and converges to the optimal pt\n\u21e4\n\nMC,k=1 F(\u2713) as a function of pt\nl\n\nl) = dactive+2\n\nl (1) sUact + d?+2\n\npt\n\npt\n\nLet pt\n1pt\nequals = `(pt\n\nl\n\n.\n\nTheorem 3.3. Let 2 be the a 2-d simplex. Under assumptions: 1 and 2, if < 1\n\u21b5 =\n\n2C(d+1), Algorithm 2 satis\ufb01es:\n\nand \u270f = 3\n\nqC[(dactive+2)2s2\n\n22\nUact +(d?+2)s2\n\n]\n\nU?\n\n35q \u270f min(pt,1pt)\n\n\u2327d 3 max(L,1) ,\n\n1\n\nC E\" CXl=1\n\nl)# \n\n`(pt\n\nmin\n\np2+(12)2\n\n`(p) \uf8ff\n\nVaropt\n2pC\n\n+\n\n1\nC\n\n4 Experiments\n\nIn our experiments we use different classes of high-dimensional blackbox functions: RL blackbox\nfunctions (where the input is a high-dimensional vector encoding a neural network policy \u21e1 : S!A\nmapping states s to actions a and the output is the cumulative reward obtained by an agent applying\nthis policy in a particular environment) and functions from the recently open-sourced Nevergrad\nlibrary [34]. In practice one can setup the hyperparameters used by Algorithm 2 as follows: =\n0 = 0.1. For each algorithm we used k = 5 seeds and obtained\n0.01, C = 10,\u21b5 = 0.01, = 0.1, qt\ncurves are median-curves with inter-quartile ranges presented as shadowed regions.\n\nFigure 2: Comparison of different blackbox optimization algorithms on OpenAI Gym tasks. All\ncurves are median-curves obtained from k = 5 seeds and with inter-quartile ranges presented as\nshadowed regions. For Reacher we present only 3 curves since LM-MA-ES and TRPO did not learn.\n\n4.1 RL blackbox functions\n\nWe used the following environments from the OpenAI Gym library: Swimmer-v2, HalfCheetah-\nv2, Walker2d-v2, Reacher-v2, Pusher-v2 and Thrower-v2. In all experiments we used policies\n\n7\n\n\fencoded by neural network architectures of two hidden layers and with tanh nonlinearities, with\n> 100 parameters. For gradient-based optimization we use Adam. For this class of blackbox\nfunctions we compared ASEBO with other generic blackbox methods as well as those specializing\nin optimizing RL blackbox functions F , namely: (1) CMA-ES variants; we compare against two\nrecently introduced algorithms designed for high-dimensional settings (we use the implementation\nof VkD-CMA-ES in the pycma open-source implementation from https : //github.com/CMA-\nES/pycma), and that of LM-MA-ES from [26]), (2) Augmented Random Search (ARS) [25]\n(we use implementation released by the authors at http : //github.com/modestyachts/ARS), (3)\nProximal Policy Optimization (PPO) [32] and Trust Region Policy Optimization (TRPO) [31] (we\nuse OpenAI baseline implementation [12]). The results for four environments are on Fig. 2.\n\nTable 1: Median rewards obtained across k = 5 seeds for seven RL environments. For each\nenvironment the top two performing algorithms are bolded, while the bottom two are shown in red.\n\nMedian reward after # timesteps\n\nEnvironment Timesteps ASEBO\n3821\nHalfCheetah\n358\nSwimmer\nWalker2d\n9941\n99949\nHopper\n11\nReacher\n46\nPusher\n89\nThrower\n\n5.107\n107\n5.107\n107\n105\n105\n105\n\nES\n1530\n36\n347\n626\n10\n-48\n-96\n\n-144\n367\n1\n42\n\nARS VkD-CMA LM-MA TRPO PPO\n1514\n2420\n52\n348\n1112\n2377\n1310\n1091\n-196\n-12\n-316\n45\n-90\n-175\n\n-1391\n-1001\n-796\n\n-512\n110\n3011\n1663\n-112\n-120\n85\n\n1632\n297\n\n18065\n100199\n\n-173\n-467\n-737\n\nSampling complexity is measured in the number of timesteps (environment transitions) used by the\nalgorithms. ASEBO is the only algorithm that performs consistently across all seven environments\n(see: Table 1), outperforming CMA-ES variants on all tasks aside from VkD-CMA-ES on Swimmer\nand LM-MA-ES on Walker2d. For environments such as Reacher, Thrower and Pusher, these\nmethods perform poorly, drastically underperforming even Vanilla ES. On Fig. 3, we demonstrate\nthe common problem of state-of-the-art CMA-ES methods: if the number of samples n is not carefully\ntuned, the algorithm does not learn. ASEBO does not have this problem since n is learned on-the-\ufb02y.\n\nFigure 3: Sensitivity analysis for CMA-ES variants on the HalfCheetah (HC) and Walker2d (WA)\ntasks. In each setting, we run k = 5 seeds, solely changing the number of samples per iteration (or\npopulation size) n.\n\nFigure 4: Comparison of median-curves obtained from k = 5 seeds for different algorithms on\nNevergrad functions [34]. Inter-quartile ranges are presented as shadowed regions.\n\n8\n\n\f4.2 Nevergrad blackbox functions\n\nWe tested functions:\nsphere, rastrigin, rosenbrock and lunacek (from the class of Bi-\nRastrigin/Lunacek\u2019s No.02 functions). All tested functions are 1000-dimensional. The results\nare presented on Fig. 4. ASEBO is the most reliable method across different functions.\n\n5 Conclusion\n\nWe proposed a new algorithm for optimizing high-dimensional blackbox functions. ASEBO adjusts\non-the-\ufb02y the strategy of choosing gradient sensing directions to the hardness of the problem at the\ncurrent stage of optimization and can be applied for both RL and non-RL problems. We provided\ntheoretical guarantees for our method and exhaustive empirical validation.\n\nReferences\n[1] S. Agrawal, N. R. Devanur, and L. Li. Contextual bandits with global constraints and objective.\n\nCoRR, abs/1506.03374, 2015.\n\n[2] S. Ahmad and K. B. Thomas. Flight optimization system ( \ufb02ops ) hybrid electric aircraft design\n\ncapability. 2013.\n\n[3] Y. Akimoto, A. Auger, and N. Hansen. Comparison-based natural gradient optimization in high\n\ndimension. GECCO, 2014.\n\n[4] Y. Akimoto and N. Hansen. Projection-Based Restricted Covariance Matrix Adaptation for\n\nHigh Dimension. GECCO, 2016.\n\n[5] R. A. Bridges, A. D. Gruber, C. Felder, M. E. Verma, and C. Hoff. Active manifolds: A\n\nnon-linear analogue to active subspaces. CoRR, abs/1904.13386, 2019.\n\n[6] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.\n\nOpenAI Gym, 2016.\n\n[7] K. Choromanski, M. Rowland, V. Sindhwani, R. E. Turner, and A. Weller. Structured evolution\nwith compact architectures for scalable policy optimization. In Proceedings of the 35th Interna-\ntional Conference on Machine Learning, ICML 2018, Stockholmsm\u00a8assan, Stockholm, Sweden,\nJuly 10-15, 2018, pages 969\u2013977, 2018.\n\n[8] P. G. Constantine. Active Subspaces - Emerging Ideas for Dimension Reduction in Parameter\n\nStudies, volume 2 of SIAM spotlights. SIAM, 2015.\n\n[9] P. G. Constantine, A. Eftekhari, and M. B. Wakin. Computing active subspaces ef\ufb01ciently\nwith gradient sketching. In 6th IEEE International Workshop on Computational Advances in\nMulti-Sensor Adaptive Processing, CAMSAP 2015, Cancun, Mexico, December 13-16, 2015,\npages 353\u2013356, 2015.\n\n[10] P. G. Constantine, C. Kent, and T. Bui-Thanh. Accelerating markov chain monte carlo with\n\nactive subspaces. SIAM J. Scienti\ufb01c Computing, 38(5), 2016.\n\n[11] E. Conti, V. Madhavan, F. P. Such, J. Lehman, K. O. Stanley, and J. Clune. Improving exploration\nin evolution strategies for deep reinforcement learning via a population of novelty-seeking\nagents. In Advances in Neural Information Processing Systems 31: Annual Conference on\nNeural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr\u00b4eal,\nCanada., pages 5032\u20135043, 2018.\n\n[12] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor,\nY. Wu, and P. Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.\n\n[13] G. H. Golub. Some modi\ufb01ed matrix eigenvalue problems. SIAM, 15, 1973.\n\n[14] D. Ha and J. Schmidhuber. Recurrent world models facilitate policy evolution. NeurIPS, 2018.\n\n9\n\n\f[15] P. H\u00a8am\u00a8al\u00a8ainen, A. Babadi, X. Ma, and J. Lehtinen. Ppo-cma: Proximal policy optimization\n\nwith covariance matrix adaptation. CoRR, abs/1810.02541, 2018.\n\n[16] N. Hansen and A. Ostermeier. Adapting arbitrary normal mutation distributions in evolution\nstrategies: The covariance matrix adaptation. In Evolutionary Computation, 1996., Proceedings\nof IEEE International Conference on, pages 312\u2013317. IEEE, 1996.\n\n[17] R. Houthooft, Y. Chen, P. Isola, B. Stadie, F. Wolski, O. Jonathan Ho, and P. Abbeel. Evolved\n\npolicy gradients. NeurIPS, 2018.\n\n[18] I. Jolliffe. Principal component analysis. Series: Springer Series in Statistics, XXIX, 2002.\n\n[19] J. Lehman, J. Chen, J. Clune, and K. O. Stanley. ES is more than just a traditional \ufb01nite-\ndifference approximator. In Proceedings of the Genetic and Evolutionary Computation Confer-\nence, GECCO 2018, Kyoto, Japan, July 15-19, 2018, pages 450\u2013457, 2018.\n\n[20] C. Li, H. Farkhoor, R. Liu, and J. Yosinski. Measuring the intrinsic dimension of objective land-\nscapes. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver,\nBC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.\n\n[21] G. Liu, L. Zhao, F. Yang, J. Bian, T. Qin, N. Yu, and T.-Y. Liu. Trust region evolution strategies.\n\nIn AAAI, 2019.\n\n[22] I. Loshchilov. A computationally ef\ufb01cient limited memory cma-es for large scale optimization.\n\nGECCO, 2014.\n\n[23] I. Loshchilov, T. Glasmachers, and H. Beyer. Large scale black-box optimization by limited-\n\nmemory matrix adaptation. IEEE Transactions on Evolutionary Computation, 2019.\n\n[24] N. Maheswaranathan, L. Metz, G. Tucker, and J. Sohl-Dickstein. Guided evolutionary strategies:\n\nescaping the curse of dimensionality in random search. CoRR, abs/1806.10230, 2018.\n\n[25] H. Mania, A. Guy, and B. Recht. Simple random search provides a competitive approach to\n\nreinforcement learning. CoRR, abs/1803.07055, 2018.\n\n[26] N. M\u00a8uller and T. Glasmachers. Challenges in high-dimensional reinforcement learning with\n\nevolution strategies. Parallel Problem Solving from Nature \u2013 PPSN XV, 2018.\n\n[27] Y. Nesterov and V. Spokoiny. Random gradient-free minimization of convex functions. Found.\n\nComput. Math., 17(2):527\u2013566, Apr. 2017.\n\n[28] R. Ros and N. Hansen. A simple modi\ufb01cation in cma-es achieving linear time and space\ncomplexity. In G. Rudolph, T. Jansen, N. Beume, S. Lucas, and C. Poloni, editors, Parallel\nProblem Solving from Nature \u2013 PPSN X, pages 296\u2013305, 2008.\n\n[29] M. Rowland, K. Choromanski, F. Chalus, A. Pacchiano, T. Sarlos, R. E. Turner, and A. Weller.\n\nGeometrically coupled monte carlo sampling. In NeurIPS, 2018.\n\n[30] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable\n\nalternative to reinforcement learning. 2017.\n\n[31] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization.\nIn Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille,\nFrance, 6-11 July 2015, pages 1889\u20131897, 2015.\n\n[32] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization\n\nalgorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[33] F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and J. Clune. Deep neuroevo-\nlution: Genetic algorithms are a competitive alternative for training deep neural networks for\nreinforcement learning. CoRR, abs/1712.06567, 2017.\n\n[34] O. Teytaud and J. Rapin. Nevergrad: An open source tool for derivative-free optimization.\n\nhttps://code.fb.com/ai-research/nevergrad/, 2018.\n\n10\n\n\f[35] K. Varelas, A. Auger, D. Brockhoff, N. Hansen, O. A. Elhara, Y. Semet, R. Kassab, and\nF. Barbaresco. A comparative study of large-scale variants of cma-es. PPSN XV 2018 - 15th\nInternational Conference on Parallel Problem Solving from Nature, 2018.\n\n[36] G. Zhang and J. Hinkle. Resnet-based isosurface learning for dimensionality reduction in\n\nhigh-dimensional function approximation with limited data. CoRR, 2019.\n\n[37] Z. Zhou, X. Li, and R. N. Zare. Optimizing chemical reactions with deep reinforcement learning.\n\nACS Central Science, 3(12):1337\u20131344, 2017. PMID: 29296675.\n\n11\n\n\f", "award": [], "sourceid": 5435, "authors": [{"given_name": "Krzysztof", "family_name": "Choromanski", "institution": "Google Brain Robotics"}, {"given_name": "Aldo", "family_name": "Pacchiano", "institution": "UC Berkeley"}, {"given_name": "Jack", "family_name": "Parker-Holder", "institution": "University of Oxford"}, {"given_name": "Yunhao", "family_name": "Tang", "institution": "Columbia University"}, {"given_name": "Vikas", "family_name": "Sindhwani", "institution": "Google"}]}