{"title": "Genetic-Gated Networks for Deep Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1747, "page_last": 1756, "abstract": "We introduce the Genetic-Gated Networks (G2Ns), simple neural networks that combine a gate vector composed of binary genetic genes in the hidden layer(s) of networks. Our method can take both advantages of gradient-free optimization and gradient-based optimization methods, of which the former is effective for problems with multiple local minima, while the latter can quickly find local minima. In addition, multiple chromosomes can define different models, making it easy to construct multiple models and can be effectively applied to problems that require multiple models. We show that this G2N can be applied to typical reinforcement learning algorithms to achieve a large improvement in sample efficiency and performance.", "full_text": "Genetic-Gated Networks\n\nfor Deep Reinforcement Learning\n\nSimyung Chang\n\nSeoul National University,\n\nSamsung Electronics\n\nSeoul, Korea\n\ntimelighter@snu.ac.kr\n\nJaeseok Choi\n\nSeoul National University\n\nSeoul, Korea\n\njaeseok.choi@snu.ac.kr\n\nJohn Yang\n\nSeoul National University\n\nSeoul, Korea\n\nyjohn@snu.ac.kr\n\nNojun Kwak\n\nSeoul National University\n\nSeoul, Korea\n\nnojunk@snu.ac.kr\n\nAbstract\n\nWe introduce the Genetic-Gated Networks (G2Ns), simple neural networks that\ncombine a gate vector composed of binary genetic genes in the hidden layer(s) of\nnetworks. Our method can take both advantages of gradient-free optimization and\ngradient-based optimization methods, of which the former is effective for problems\nwith multiple local minima, while the latter can quickly \ufb01nd local minima. In\naddition, multiple chromosomes can de\ufb01ne different models, making it easy to\nconstruct multiple models and can be effectively applied to problems that require\nmultiple models. We show that this G2N can be applied to typical reinforcement\nlearning algorithms to achieve a large improvement in sample ef\ufb01ciency and\nperformance.\n\n1\n\nIntroduction\n\nMany reinforcement learning algorithms such as policy gradient based methods [14, 17] suffer\nfrom the problem of getting stuck in local extrema. These phenomena are essentially caused by the\nupdates of a function approximator that depend on the gradients of current policy, which is usual\nfor on-policy methods. Exploiting the short-sighted gradients should be balanced with adequate\nexplorations. Explorations thus should be designed irrelevant to policy gradients in order to guide\nthe policy to unseen states. Heuristic exploration methods such as \u0001-greedy action selections and\nentropy regularization [20] are widely used, but are incapable of complex action-planning in many\nenvironments [7, 8]. While policy gradient-based methods such as Actor-Critic models [5, 6] explore\na given state space typically by applying a random Gaussian control noise in the action space, the\nmean and the standard deviation of the randomness remain as hyper-parameters to heuristically\ncontrol the degree of exploration.\nWhile meaningful explorations can also be achieved by applying learned noise in the parameter space\nand thus perturbing neural policy models [2, 9, 10], there have been genetic evolution approaches for\nthe exploration control system of an optimal policy, considering the gradient-free methods are able to\novercome con\ufb01ned local optima [11]. Such et al. [16] vectorize the weights of an elite policy network\nand mutate the vector with a Gaussian distribution to generate other candidate policy networks. This\nprocess is iterated until an elite parameter vector which yields the best \ufb01tness score is learned. While\ntheir method \ufb01nds optimal parameters of a policy network purely based on a genetic algorithm, ES\nalgorithm in [12] is further engaged with zero-order optimization process; the parameter vector is\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) Generation 1\n\n(b) Generation 2\n\nFigure 1: The optimization of Genetic-Gated Networks for a 2D toy problem with 8 population.\nThe black line depicts the gradient-based training path of a network with an elite chromosome\nimplemented. While training an elite network model, networks perturbed by other chromosomes\nexplore other regions of the surface by using the updated network parameters trained by the elite\nchromosome (gray lines). All chromosomes are evaluated by their \ufb01tness scores at the end of every\ngeneration, and the training continues with the best chromosome selected.\n\nupdated every iteration with the sum of vectors that are weighted with the total score earned by\ntheir resulting policy networks. Such structure partially frees their method from back-propagation\noperations, which thus yields an RL algorithm to learn faster in terms of wall-clock time, but carries\nan issue of low sampling ef\ufb01ciency.\nWe, in this paper, introduce the Genetic-Gated Networks (G2Ns) that are composed of two phases of\niterative optimization processes of gradient-free and gradient-based methods. The genetic algorithm,\nduring the gradient-free optimization, searches for an elite chromosome that optimally perturbs an\nengaging neural model, hopping over regions of a loss surface. Once a population of chromosomes is\ngenerated to be implemented in a model as a dropout mask, the perturbation on the model caused by\nvarious chromosomes is evaluated based on the episodic rewards of each model. A next generation of\nchromosome population is generated through selections, cross-overs, and mutations of the superior\ngene vectors after a period of episodes. An elite model can then be selected and simply optimized\nfurther with a gradient-based learning to exploit local minimum.\nA population of multiple chromosomes can quickly de\ufb01ne unique neural models either asynchronously\nor synchronously, by which diverse exploration policies can be accomplished. This genetic formula-\ntion allows a model to not only explore over a multi-modal solution space as depicted in Figure 1\n(The gray lines in the \ufb01gures indicate multiple explorations while black line exploits the gradient),\nbut also hop over the space whenever it \ufb01nds a better region (In Figure 1(b), black is switched to one\nof the gray with a better solution, and then follows the gradient).\n\n2 Background\n\nGenetic Algorithm : A genetic algorithm [4] is a parallel and global search algorithm. It expresses\nthe solution of the problem in the form of a speci\ufb01c data structure, and use a method of gradually\n\ufb01nding better solutions through crossovers and mutations. In practice, this data structure is usually\nexpressed as a binary vector each dimension of which decides the activation of each gene. A vector\nwith a unique combination of these genes can be considered as a chromosome. Evolution begins with\na population of randomly generated individuals, and every time a generation is repeated, the \ufb01tness\nof each individual is evaluated. By selecting the best chromosomes, and doing crossovers among\nthem, better solutions are found. Through repeated selections of superior individuals and crossovers\nof them over generations, newly created genes in the next generation are more likely to inherit the\ncharacteristics of the superior predecessors. Additionally, the algorithm activates new genes with a\n\ufb01xed probability of mutations in every generation, and these random mutations allow the genetic\nalgorithm to escape from local minima.\nActor-Critic methods : In actor-critic methods, an actor plays a role of learning policy \u03c0\u03b8(a|st)\nwhich selects action a \u2208 A given that state s is st \u2208 S and a critic carries value estimation Vw(s) to\nlead the actor to learn the optimal policy. Here, \u03b8 and w respectively denote the network parameters\nof the actor and the critic. Training progresses towards the direction of maximizing the objective\n\n2\n\n\fFigure 2: An illustration of a Genetic-Gated Network variously gating feed-forwarding \ufb02ow of\na neural network to create different models with chromosome vectors generated from a genetic\nalgorithm.\n\nfunction based on cumulative rewards, J(\u03b8) = E\u03c0\u03b8 [(cid:80)\n\nt \u03b3trt] where rt is the instantaneous reward at\n\ntime t and \u03b3 is a discount factor. The policy update gradient is de\ufb01ned as follows:\n\n\u2207\u03b8J(\u03b8) = E\u03c0[\u2207\u03b8log\u03c0\u03b8(s, a)A\u03c0(s, a)].\n\n(1)\n\nIn (1), A\u03c0(s, a) is an advantage function, which can be de\ufb01ned in various ways. In this paper, it is\nde\ufb01ned as it is in asynchronous advantage actor-critic method (A3C) [6]:\n\nk\u22121(cid:88)\n\nA\u03c0(st, at) =\n\n\u03b3ir(st+i, at+i) + \u03b3kV \u03c0\n\nw (st+k) \u2212 V \u03c0\n\nw (st),\n\n(2)\n\ni=0\nwhere k denotes the number of steps.\nDropout : Dropout [15] is one of various regularization methods for training neural networks to\nprevent over-\ufb01tting to training data. Typical neural networks use all neurons to feed-forward, but in\na network with dropout layers, each neuron is activated with a probability p and deactivated with\na probability 1 \u2212 p. Using this method, dropout layers interfere in encoding processes, causing\nperturbations in neural models. In that sense, each unit in a dropout layer can be interpreted to\nexplore and re-adjust its role in relation to other activated ones. Regularizing a neural network by\napplying noise to its hidden units allows more robustness, enhancing generalization through training\nthe deformed outputs to \ufb01t into the target labels. We are motivated the method can be utilized as a\nway to control the exploration of a model for contemporary reinforcement learning issues.\n\n3 Genetic-Gated Networks\n\nIn this section, the mechanisms of Genetic-Gated Networks (G2Ns) are described and how they are\nimplemented within the framework of actor-critic methods are explained. The proposed actor-critic\nmethods are named as Genetic-Gated Actor-Critic (G2AC) and Genetic-Gated PPO (G2PPO).\n\n3.1 Genetic-Gated networks\n\nIn Genetic-Gated networks (G2Ns), hidden units are partially gated (opened or closed) by a combi-\nnation of genes (a chromosome vector) and the optimal combination of genes is learned through a\ngenetic algorithm. The element-wise feature multiplication with a binary vector appears to be similar\nto that of the dropout method, yet our detailed implementation differs in the following aspects:\n\n1. While a dropout vector is composed of Bernoulli random variables each of which has\nprobability of p being 1, a chromosome vector in G2N is generated by a genetic algorithm.\n2. While a dropout vector is randomly generated for every batch, a chromosome vector stays\n\n\ufb01xed during several batches for evaluation of its \ufb01tness.\n\n3. While the dropout method is designed for generalization in gradient-based learnings and\nthus performs back-propagation updates on all the \u2018dropped-out\u2019 models, the G2N only\nperforms gradient-based updates on one elite model.\n\n3\n\n\fFigure 3: Interaction of multiple agents (models) and environments during GA+elite training phase.\nThe actions are taken by multiple policy actors (multi-policy agent) with the minibatch and \u03b8\nis updated using back-propagation on the elite actor with the collected information. \u00b5elite is the\nchromosome vector of the elite and P denotes the Population Matrix.\n\nIf conventional neural networks can be expressed as f (x; \u03b8), where x is an input and \u03b8 is a set of\nweight parameters, G2N can be de\ufb01ned as a function of \u03b8 and a chromosome \u00b5 which is represented\nas a binary vector in our paper. The perturbed model with the i-th genotype, \u00b5i, can be expressed\nas its phenotype or an Individuali, f (x; \u03b8, \u00b5i), and a Population is de\ufb01ned as a set of N number of\nindividuals in a generation:\n\nIndividuali = f (x; \u03b8, \u00b5i),\nP opulation(N ) = {f (x; \u03b8, \u00b50), ..., f (x; \u03b8, \u00b5N\u22121)}.\n\n(3)\n\nWe suggest a Population Minibatch to train a G2N ef\ufb01ciently using conventional neural networks\ntraining methods. A 2D matrix called Population Matrix, P, allows evaluating \ufb01tness in batches for\na population of individuals. As depicted in Figure 2, since each row of this matrix is an individual\nchromosome, the outputs for all models of entire population can be acquired with an element-wise\nmultiplication of this matrix and every batch. The population in (3) can thus be expressed as:\n\nP opulation(N ) = f (x; \u03b8,P).\n\n(4)\n\nUsing the population matrix, we can evaluate \ufb01tness for multiple individuals simultaneously. This\nmethod is structurally simple, and can easily be parallelized with GPUs. Furthermore, a generation\ncan be de\ufb01ned as M iterations of minibatch during the learning of neural networks.\nImplementations of multiple chromosomes in a parameter space as a form of gated vector enables\ngenerating various models through perturbations on the original neural model to better search over a\nsolution space. Not only so, but every once in a generation the training model is allowed to switch its\ngating vector that results the best \ufb01tness-score, which comes as a crucial feature of our optimization\nfor highly multi-modal solution spaces.\nTherefore, among a population, an elite neural network model f (x; \u03b8, \u00b5elite) is selected, where \u00b5elite\ndenotes the chromosome that causes an individual to yield the best \ufb01tness-score. More speci\ufb01cally,\nafter each generation, an elite chromosome is selected based on the following equation:\n\n\u00b5elite = arg max\n\n\u00b5\n\nF (\u03b8, \u00b5), \u00b5 \u2208 {\u00b51, . . . , \u00b5N\u22121},\n\n(5)\n\nwhere F (\u00b7) is a \ufb01tness-score function.\nThe procedures of \u2018training network parameters\u2019, \u2018elite selection\u2019 and \u2018generation of new population\u2019\nare repeated during a training criterion. G2N learns the parameters \u03b8 to maximize F . F values\nare then recalculated for all the individuals based on the updated parameters \u03b8 for sorting top elite\nchromosomes. A generation ends and a new population is generated by the genetic operations such as\nselections, mutations, and crossovers among chromosomes.\n\n3.2 Genetic-Gated Actor Critic Methods\n\nGenetic-Gated Actor Critic (G2AC) is a network that incorporates G2N and the conventional advan-\ntage actor critic architecture. For our back-bone model, the Synchronous Advantage Actor-Critic\n\n4\n\n\fAlgorithm 1 A Genetic-Gated Actor Critic Pseudo-Code\n\nInitialize Population Matrix P\nInitialize Fitness Table T\n\u00b5elite \u2190 P[0]\nrepeat\n\nDo conventional Actor-Critic method\n\nSet \u00b5elite to gate matrix for acting and updating\nwhile Elite Phase do\nSet P to gate matrix for acting\nSet \u00b5elite to gate matrix for updating\nwhile GA+elite Phase do\n\nDo conventional Actor-Critic method\nif terminated episodes exist then\n\nStore episode scores to T\n\nEvaluate the \ufb01tness of the individuals using T\nSelect best individuals for the next generation\n\u00b5elite \u2190 GetBestChromosome(P)\nBuild P with genetic operations such as crossovers and mutations\n\nuntil stop criteria\n\nmodel (A2C) [21] which is structurally simple and publicly available. G2AC, in a later section, is\nevaluated in environments with discrete action space. And for continuous control, we have applied\nG2N in the Proximal Policy Optimization method (PPO) [13], denoting as G2PPO. Our method can\nbe embarked on existing actor-critic models without extra weight parameters since the element-wise\nmultiplication with the population matrix is negligible compared to the main actor-critic computa-\ntions. The proposed Genetic-Gated actor critic methods (G2AC and G2PPO) have the following main\nfeatures :\nMulti Model Race: A single critic evaluates the state values while multiple agents follow unique\npolicies, so the population matrix can be applied to the last hidden layer of actors. Therefore, our\nmethod is composed of multiple policy actors and a single critic. These multiple actors conduct\nexploration and compete each other. Then their \ufb01tness-scores are considered when generating new\nchromosomes in the following generation.\nStrong Elitism: Our method is involved with the strong elitism technique. This technique not only\nutilizes elitism in conventional genetic algorithms which leaves elite in a preserved manner to the\nnext generation without any crossover and mutation, but also performs back-propagations based on\nthe model with the elite chromosome.\nTwo-way Synchronous Optimization: In every generation, our method consists of two training\nphases: the phase only the elite model interacts with environments to be trained and another phase\nthat the elite and other perturbed models are co-utilized. The \ufb01rst phase is purposed on exploiting\nits policy to learn neural network parameters with relatively less explorations. Preventing value\nevaluations for multiple policies during a whole generation, the \ufb01rst phase also secures steps of\ntraining the value function with a single current policy. G2AC and G2PPO, in the second phase, use\nall the perturbed models to collect experiences, while updating the model parameters using the loss\nbased on the elite model; this phase is intended to explore much more so that the elite model can be\nswitched to the actor with a better individual if found.\nFigure 3 and Algorithm 1 sum up the overall behavior of our method, being applicable for both G2AC\nand G2PPO. All agents interact with multiple environments and collect observations, actions and\nrewards into a minibatch. The elite agent is then updated on every update interval. During a generation,\nchromosomes are \ufb01xed and the \ufb01tness-score of each actor is evaluated. We de\ufb01ne an actor\u2019s \ufb01tness to\nbe the average episodic score of an actor during the generation. Figure 1 visualizes the optimization\nof Genetic-Gated Actor Critic in a 2D toy problem. It illustrate how perturbed policies (gray) behave\nwhile an elite (black) of G2N is optimized towards getting higher rewards (red), and how hopping\noccurs in the following generation. The detail of 2D toy problem and implementations are included\nin the supplementary material.\n\n5\n\n\fTable 1: The seventh column shows the average return of 10 episodes evaluated after training 200M\nframes from the Atari 2600 environments with G2AC(with 30 random initial no-ops). The results\nof DQN [19], A3C [6], HyperNeat [3], ES and A2C [12] and GA [16] are all taken from the cited\npapers. Time refers to the time spent in training, where DQN is measured with 1 GPU(K40), A3C\nwith 16 core cpu, ES and Simpe GA with 720 CPUs, and G2AC with 1 GPU(Titan X). Hyphen marks,\n\"-\", indicate no evaluation results reported.\n\nFRAMES, TIME\nAMIDAR\nASSAULT\nATLANTIS\nBEAM RIDER\nBREAKOUT\nGRAVITAR\nPONG\nQBERT\nSEAQUEST\nSPACE INVADERS\nVENTURE\n\nDQN\n200M, 7-10D\n978.0\n4,280.4\n279,987.0\n8,627.5\n385.5\n473.0\n19.5\n13,117.3\n5,860.6\n1,692.3\n54.0\n\nA3C FF HYPERNEAT\n-\n184.4\n912.6\n61,260.0\n1,412.8\n2.8\n370.0\n17.4\n695.0\n716.0\n1,251.0\n0.0\n\n320M, 1D\n283.9\n3,746.1\n772,392.0\n13,235.9\n551.6\n269.5\n11.4\n1,3752.3\n2,300.2\n2,214.7\n19.0\n\nES\n1B, 1H\n112.0\n1,673.9\n1,267,410.0\n744.0\n9.5\n805.0\n21.0\n147.5\n1,390.0\n678.5\n760.0\n\nSIMPLE GA\n1B, 1H\n216\n819\n79,793\n-\n-\n462\n-\n-\n807\n-\n810.0\n\nG2AC\n200M, 4H\n929.0\n15,147.2\n3,491,170.9\n13,262.7\n852.9\n665.1\n23.0\n15,836.5\n1,772.6\n2976.4\n0.0\n\nA2C FF\n320M, -\n548.2\n2026.6\n2,872,644.8\n4,438.9\n368.5\n256.2\n20.8\n15,758.6\n1763.7\n951.9\n0.0\n\n4 Experiments\n\nWe have experimented the Atari environment and MuJoCo [18], a representative RL problems, to\nverify the followings: (1) Can G2N be effectively applied to Actor-Critic, a typical RL algorithm? (2)\nDoes the genetic algorithm of Genetic-Gated RL models have advantages over a simple multi-policy\nmodel? (3) Are Genetic-Gated RL algorithms effective in terms of sample ef\ufb01ciency and computation?\nAll the experiments were performed using OpenAI gym [1].\n\n4.1 Performance on the Atari Domain\n\nFor the Atari experiments, we have adapted the same CNN architecture and hyper-parameters of\nA2C [21]. And at the beginning of training, each gene is activated with a probability of 80%. G2AC\nuse 64 individuals (actors), with 80% of crossover probability and 3% of mutation probability for\neach genetic evolution. In every generation, the elite phase and the GA+elite phase are respectively\nset to persist 500 steps and 20 episodes for each actor.\nTable 1 shows the comparable performances of G2AC on Atari game environments. When trained for\nfar less frames than what A2C is trained, our method already outperforms the baseline model, A2C,\nin all the environments. Considering the performance with a little additional matrix multiplication\noperations, G2N proves its competency compared to many RL algorithms. In the case of GRAVITAR,\neven though both A3C and A2C, which are policy gradient methods, have a lower score than value-\nbased DQN, G2AC achieves a higher score with less frames of training. However, for SEAQUEST,\nit seems that G2AC is not able to deviate from the local minima, scoring a similar rewards as A2C.\nThis is presumably caused by the gate vectors that do not perturb the network enough to get out of\nthe local minima or the low diversity of the models due to insuf\ufb01cient crossovers and/or mutations.\nFor the case of VENTURE, both A2C and G2AC have not gained any reward, implying the learning\nmodel, though of an additional implementation of G2N, is constrained within the mechanism of its\nbaseline model.\nWe also have conducted experiments in 50 Atari games with the same settings as those of Table 1.\nCompared against the base model of A2C which is trained over 320M frames, G2AC, trained only\nfor 200M frames, has better results in 39 games, same results in three games, and worse results in\nthe remaining eight games. Considering that it can be implemented structurally simple with a little\ncomputational overhead, applying G2N to Actor-Critic method is considered very effective in the\nAtari environments. The detailed experimental results are included in the supplementary material.\n\n4.2 Continuous action control\n\nWe, in this section, have further experimented G2PPO in eight MuJoCo environments [18] to evaluate\nour method in environments with continous action control. Except that the elite phase and the\n\n6\n\n\fFigure 4: Performance of G2PPO on MuJoCo environments. PPO uses the same hyper parameters\nwith its original paper while PPO8 and G2PPO use eight actors (in 8 threads) and 512 steps. Compared\nto PPO8 that uses the same setting, the proposed G2PPO is slightly lower in performance at the\nbeginning of learning in most environments, but performs better than the baseline (PPO8) as learning\nprogresses. The score of G2PPO is the average episodic score of all the eight candidate actors while\nG2PPOelite and G2PPOtop are the scores of the current elite and the top score among the eight actors\nwhich will be the elite in the next generation. The curve for G2PPOelite clearly shows the hopping\nproperty of the proposed method.\n\nGA+elite phase are respectively set to persist for 10,240 steps and \ufb01ve episodes in every generation,\ngenetic update sequence is identical to how it is in the Atari domain.\nUnlike our method that is engaged with multiple actors, original PPO is reported to use a single actor\nmodel and a batch size of 2,048 steps (horizons) when learning in the MuJoCo environments. Since\nthe number of steps k is a signi\ufb01cant hyperparameter as noted earlier in Equation 2, we have been\nvery careful on \ufb01nding right parameters of horizon size to reproduce the reported performance of\nPPO using multiple actor threads. We have found that PPO8 can be trained synchronously in eight\nnumber of actor threads with the horizon size of 512 and reproduce most of PPO\u2019s performance in\nthe corresponding paper. G2PPO thus has been experimented with the same settings as those of PPO8\nfor a fair comparison. The results of three models (PPO, PPO8, G2PPO) are shown in Figure 4. To\nmonitor scores and changes of elite actors at the end of every generation, we have also indicated\nthe scores of the current elite actor, G2PPOelite, and the scores of a current candidate actor with the\nhighest score, G2PPOtop, in Figure 4. Note that G2PPOtop is always above or equal to G2PPOelite.\nAs it can be seen in Figure 4, the early learning progress of G2PPO is slower than that of PPO8 in\nmost environments. Its \ufb01nal scores, however, are higher than or equal to the scores of the base model\nexcept for WALKER2D. The score difference between G2PPO and PPO is not signi\ufb01cant because\nPPO is one of the state-of-the-art methods for many MuJoCo environments, and some environments\nsuch as InvertedDoublePendulum, InvertedPendulum and Reacher have \ufb01xed limits of maximum\nscores. Not knowing the global optimum, it is dif\ufb01cult to be sure that G2PPO can achieve signi\ufb01cant\nperformance gain over PPO. Instead, the superiority of G2PPO over PPO should therefore be judged\nbased on whether G2PPO learns to gain more rewards when PPO gets stuck at a certain score despite\nadditional learning steps as the case of Ant. For the case of WALKER2D, the result score of G2PPO\nhave not been able to exceed the scores of baseline models when trained for 2M timesteps. Training\nover additional 2M steps, our method obtains 5032.8 average score of ten episodes while PPO remains\nat the highest score of 4493.7. Considering the graph of G2PPO follows the average episodic score\nof all candidate actors, the performance of G2PPOelite and G2PPOtop should also be considered as\nmuch. If a single policy actor is to be examined in contrast to PPO which trains one representative\nactor, the performance of G2PPOelite and G2PPOtop should be the direct comparison since G2PPO\nalways selects the policy with the highest score.\nAdditionally, Figure 4 implies (1) if a candidate actor scores better than the current elite actor, it\nbecomes the elite model in the next generation. This process is done as a gradient-free optimization\ndepicted in Figure 1, hopping over various models. (2) And, if the scores of G2PPOelite and G2PPOtop\nare equal, the current elite actor continues to be the elite. For MuJoCo environments, the changes of\n\n7\n\n0.5M1M1.5M2MNumber of Timesteps5000500100015002000AntPPOPPO8G2PPOG2PPOeliteG2PPOtop0.5M1M1.5M2MNumber of Timesteps010002000300040005000HalfCheetahPPOPPO8G2PPOG2PPOeliteG2PPOtop0.5M1M1.5M2MNumber of Timesteps050010001500200025003000HopperPPOPPO8G2PPOG2PPOeliteG2PPOtop0.5M1M1.5M2MNumber of Timesteps02000400060008000InvertedDoublePendulumPPOPPO8G2PPOG2PPOeliteG2PPOtop0.5M1M1.5M2MNumber of Timesteps02004006008001000InvertedPendulumPPOPPO8G2PPOG2PPOeliteG2PPOtop0.5M1M1.5M2MNumber of Timesteps120100806040200ReacherPPOPPO8G2PPOG2PPOeliteG2PPOtop0.5M1M1.5M2MNumber of Timesteps20406080100120SwimmerPPOPPO8G2PPOG2PPOeliteG2PPOtop0.5M1M1.5M2MNumber of Timesteps010002000300040005000Walker2dPPOPPO8G2PPOG2PPOeliteG2PPOtop\fFigure 5: Performance comparisons with ablation models. Random denotes the multi-model with a\nrandom binary vector as gates of the same hidden layer. Separated denotes the model with separated\noptimization of GA and neural network. Each curve averaged across three trials of different random\nseeds. (best viewed in color)\n\nthe elite model occurred least frequently in INVERTEDPENDULUM (4 changes during 17 generations)\nand most frequently in REACHER (23 changes during 24 generations). The detailed results for the\nchanges of elite model are included in supplementary material.\n\n4.3 Additional Analysis\n\nComparison with random multi-policy: To compare with a simple multi-policy exploration, we\nhave de\ufb01ned a multi-model structure with random binary vectors as gates and have compared the\nperformance of it with our methods. The other conditions, such as the generation period and the\nnumber of models, are set to be the same as G2AC and G2PPO. Random binary vectors are generated\nwith the same probability as genes initial probability and used instead of chromosomes. By doing so,\nwe have obtained a multi-policy model that does not use a genetic algorithm.\nAs shown in the Figure 5, using a multi-policy model with random gates does not bring a signi\ufb01cant\nimprovement over the baseline and even results in worse performance in a few environments. The\nperformance reductions are especially observed in experiments with the MuJoCo environments. This\nis due to the direct perturbation of the continuous action, which causes highly unstable results as\nthe gate vector changes. On the other hand, the G2N models have achieved noticeable performance\nimprovements. In Atari environments, G2AC with 40M frames of training has a higher average\nscore than A2C with 320M learning frames. G2PPO in MuJoCo experiments is shown to improve\nperformances by effectively escaping local extrema. Furthermore, the learning curves of G2AC and\nG2PPO in Fig. 5 are drawn based on the average scores of all perturbed agents, not the best ones.\nTwo-way synchronous optimization: Gradient-free optimization methods are generally known to\nbe slower than gradient-based optimizations. ES and GA algorithms have achieved comparable\nperformances after training 1B frames which are much larger number than those of DQN, A3C and\nA2C with traditional deep neural networks. Separated in Figure 5 denotes learning curves of the\nseparated optimization of GA and its neural model. This allows us to compare the ef\ufb01ciency with\nthe two-way synchronous optimization. The graph clearly shows learning in two-way synchronous\nmethods (G2AC, G2PPO) are much faster. The gap is larger in Atari, which has relatively high\nproportion of Genetic+elite phase. These results show that the sampling ef\ufb01ciency can be improved\nby training neural networks while evaluating the \ufb01tness of individuals rather than pausing the training.\nWall-clock time: As described earlier, G2AC does not require additional operations other than\nelement-wise multiplication of a batch and the gate matrix and creating a new population at the end\nof each generation. Experiments on \ufb01ve Atari games have shown that G2AC takes 3572 steps per\nsecond while A2C takes 3668. G2PPO which operates in parallel with multiple actors like PPO8\ncompletes 1110 steps per second while PPO8 and PPO complete 1143 and 563 steps respectively.\nOur methods were slowed within only 3% when compared to their direct baselines.\n\n8\n\n10M20M30M40MFrames010002000300040005000AssaultA2CRandomSeparatedG2AC10M20M30M40MFrames0100020003000400050006000BeamRiderA2CRandomSeparatedG2AC10M20M30M40MFrames100200300400500GravitarA2CRandomSeparatedG2AC10M20M30M40MFrames20040060080010001200SpaceInvadersA2CRandomSeparatedG2AC0.5M1M1.5M2MNumber of Timesteps1000010002000AntPPO8RandomSeparatedG2PPO0.5M1M1.5M2MNumber of Timesteps0100020003000HopperPPO8RandomSeparatedG2PPO0.5M1M1.5M2MNumber of Timesteps0200040006000800010000InvertedDoublePendulumPPO8RandomSeparatedG2PPO0.5M1M1.5M2MNumber of Timesteps020406080100120SwimmerPPO8RandomSeparatedG2PPO\fHyper-parameters of Genetic algorithm: We do not aim to \ufb01nd the highest performance model\nby tuning the hyper-parameters of the genetic algorithm. However, we have experimented with some\nhyper-parameters to see what characteristics G2N shows for these. G2AC is less sensitive to the\nhyper-parameters of the genetic algorithm while G2PPO is more sensitive. This is as anticipated\nconsidering that G2AC uses softmax activation as its output, so the range of the perturbed output\nis limited. On the other hand, the outputs of G2PPO are boundless and therefore they are directly\naffected by the binary gates. In MuJoCo, as the mutation probability increases from 0.03 to 0.1\nand 0.3, the performance decreases and becomes unstable. In the case of crossover probability, the\ndifference is higher in Hopper and Swimmer when changing from 0.8 to 0.4. But, the in\ufb02uence\nof crossover was not signi\ufb01cant in other environments. The detailed results of the experiment are\nincluded in the supplementary material.\n\n5 Conclusions\n\nWe have proposed a newly de\ufb01ned network model, Genetic-Gated Networks (G2Ns). G2Ns can\nde\ufb01ne multiple models without increasing the amount of computation and use the advantage of\ngradient-free optimization by simply gating with a chromosome based on a genetic algorithm in a\nhidden layer. Incorporated with the genetic gradient-based optimization techniques, this gradient-free\noptimization method is expected to \ufb01nd global optima effectively for multi-modal environments.\nAs applications of G2Ns, we have also proposed Genetic-Gated Actor-Critic methods(G2AC, G2PPO)\nwhich can be used to problems within the RL domain where the multiple models can be useful while\nhaving the local minima problem. Our experiments show that the performance of the base model is\ngreatly improved by the proposed method. It is not just a development of a RL algorithm, but it shows\nthat a combination of two completely different machine learning algorithms can be a better algorithm.\nIn future works, we intend to study whether G2Ns can be applied to other domains that are related\nwith multiple local minima or can bene\ufb01t from using multiple models. Also, we need to study how to\novercome the initial learning slowdown due to the additional exploration of perturbed polices.\n\nAcknowledgement\n\nThis work was supported by Next-Generation Information Computing Development Program through\nthe National Research Foundation of Korea (NRF) (2017M3C4A7077582).\n\nReferences\n[1] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[2] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex\nGraves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for\nexploration. arXiv preprint arXiv:1706.10295, 2017.\n\n[3] Matthew Hausknecht, Joel Lehman, Risto Miikkulainen, and Peter Stone. A neuroevolution\napproach to general atari game playing. IEEE Transactions on Computational Intelligence and\nAI in Games, 6(4):355\u2013366, 2014.\n\n[4] John H Holland. Genetic algorithms. Scienti\ufb01c american, 267(1):66\u201373, 1992.\n[5] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,\nDavid Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv\npreprint arXiv:1509.02971, 2015.\n\n[6] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein-\nforcement learning. In International Conference on Machine Learning, pages 1928\u20131937,\n2016.\n\n[7] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via\nbootstrapped dqn. In Advances in Neural Information Processing Systems, pages 4026\u20134034,\n2016.\n\n9\n\n\f[8] Ian Osband, Daniel Russo, Zheng Wen, and Benjamin Van Roy. Deep exploration via random-\n\nized value functions. arXiv preprint arXiv:1703.07608, 2017.\n\n[9] Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y Chen, Xi Chen,\nTamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration.\narXiv preprint arXiv:1706.01905, 2017.\n\n[10] Thomas R\u00fcckstie\u00df, Martin Felder, and J\u00fcrgen Schmidhuber. State-dependent exploration for\npolicy gradient methods. Machine Learning and Knowledge Discovery in Databases, pages\n234\u2013249, 2008.\n\n[11] E Ruf\ufb01o, D Saury, D Petit, and M Girault. Tutorial 2: Zero-order optimization algorithms.\n\nEurotherm School METTI, 2011.\n\n[12] Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution strategies as a scalable\n\nalternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.\n\n[13] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[14] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller.\nDeterministic policy gradient algorithms. In Proceedings of the 31st International Conference\non Machine Learning (ICML-14), pages 387\u2013395, 2014.\n\n[15] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-\nnov. Dropout: a simple way to prevent neural networks from over\ufb01tting. Journal of machine\nlearning research, 15(1):1929\u20131958, 2014.\n\n[16] Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O Stanley, and\nJeff Clune. Deep neuroevolution: Genetic algorithms are a competitive alternative for training\ndeep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567, 2017.\n\n[17] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient\nIn Advances in neural\n\nmethods for reinforcement learning with function approximation.\ninformation processing systems, pages 1057\u20131063, 2000.\n\n[18] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based\ncontrol. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on,\npages 5026\u20135033. IEEE, 2012.\n\n[19] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando\nDe Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint\narXiv:1511.06581, 2015.\n\n[20] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[21] Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region\nmethod for deep reinforcement learning using kronecker-factored approximation. In Advances\nin Neural Information Processing Systems, pages 5285\u20135294, 2017.\n\n10\n\n\f", "award": [], "sourceid": 879, "authors": [{"given_name": "Simyung", "family_name": "Chang", "institution": "Seoul National University"}, {"given_name": "John", "family_name": "Yang", "institution": "Seoul National University"}, {"given_name": "Jaeseok", "family_name": "Choi", "institution": "Seoul National University"}, {"given_name": "Nojun", "family_name": "Kwak", "institution": "Seoul National University"}]}