{"title": "Distral: Robust multitask reinforcement learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4496, "page_last": 4506, "abstract": "Most deep reinforcement learning algorithms are data inefficient in complex and rich environments, limiting their applicability to many scenarios. One direction for improving data efficiency is multitask learning with shared neural network parameters, where efficiency may be improved through transfer across related tasks. In practice, however, this is not usually observed, because gradients from different tasks can interfere negatively, making learning unstable and sometimes even less data efficient. Another issue is the different reward schemes between tasks, which can easily lead to one task dominating the learning of a shared model. We propose a new approach for joint training of multiple tasks, which we refer to as Distral (DIStill & TRAnsfer Learning). Instead of sharing parameters between the different workers, we propose to share a distilled policy that captures common behaviour across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy, while the shared policy is trained by distillation to be the centroid of all task policies. Both aspects of the learning process are derived by optimizing a joint objective function. We show that our approach supports efficient transfer on complex 3D environments, outperforming several related methods. Moreover, the proposed learning process is more robust and more stable---attributes that are critical in deep reinforcement learning.", "full_text": "Distral: Robust Multitask Reinforcement Learning\n\nYee Whye Teh, Victor Bapst, Wojciech Marian Czarnecki, John Quan,\n\nJames Kirkpatrick, Raia Hadsell, Nicolas Heess, Razvan Pascanu\n\nDeepMind\nLondon, UK\n\nAbstract\n\nMost deep reinforcement learning algorithms are data inef\ufb01cient in complex and\nrich environments, limiting their applicability to many scenarios. One direction\nfor improving data ef\ufb01ciency is multitask learning with shared neural network\nparameters, where ef\ufb01ciency may be improved through transfer across related tasks.\nIn practice, however, this is not usually observed, because gradients from different\ntasks can interfere negatively, making learning unstable and sometimes even less\ndata ef\ufb01cient. Another issue is the different reward schemes between tasks, which\ncan easily lead to one task dominating the learning of a shared model. We propose\na new approach for joint training of multiple tasks, which we refer to as Distral\n(distill & transfer learning). Instead of sharing parameters between the different\nworkers, we propose to share a \u201cdistilled\u201d policy that captures common behaviour\nacross tasks. Each worker is trained to solve its own task while constrained to stay\nclose to the shared policy, while the shared policy is trained by distillation to be the\ncentroid of all task policies. Both aspects of the learning process are derived by\noptimizing a joint objective function. We show that our approach supports ef\ufb01cient\ntransfer on complex 3D environments, outperforming several related methods.\nMoreover, the proposed learning process is more robust to hyperparameter settings\nand more stable\u2014attributes that are critical in deep reinforcement learning.\n\n1\n\nIntroduction\n\nDeep Reinforcement Learning is an emerging sub\ufb01eld of Reinforcement Learning (RL) that relies\non deep neural networks as function approximators that can scale RL algorithms to complex and\nrich environments. One key work in this direction was the introduction of DQN [21] which is able\nto play many games in the ATARI suite of games [1] at above human performance. However the\nagent requires a fairly large amount of time and data to learn effective policies and the learning\nprocess itself can be quite unstable, even with innovations introduced to improve wall clock time, data\nef\ufb01ciency, and robustness by changing the learning algorithm [27, 33] or by improving the optimizer\n[20, 29]. A different approach was introduced by [12, 19, 14], whereby data ef\ufb01ciency is improved\nby training additional auxiliary tasks jointly with the RL task.\nWith the success of deep RL has come interest in increasingly complex tasks and a shift in focus\ntowards scenarios in which a single agent must solve multiple related problems, either simultaneously\nor sequentially. Due to the large computational cost, making progress in this direction requires\nrobust algorithms which do not rely on task-speci\ufb01c algorithmic design or extensive hyperparameter\ntuning. Intuitively, solutions to related tasks should facilitate learning since the tasks share common\nstructure, and thus one would expect that individual tasks should require less data or achieve a\nhigher asymptotic performance. Indeed this intuition has long been pursued in the multitask and\ntransfer-learning literature [2, 31, 34, 5].\nSomewhat counter-intuitively, however, the above is often not the result encountered in practice,\nparticularly in the RL domain [26, 23]. Instead, the multitask and transfer learning scenarios are\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\ffrequently found to pose additional challenges to existing methods. Instead of making learning\neasier it is often observed that training on multiple tasks can negatively affect performances on the\nindividual tasks, and additional techniques have to be developed to counteract this [26, 23]. It is likely\nthat gradients from other tasks behave as noise, interfering with learning, or, in another extreme, one\nof the tasks might dominate the others.\nIn this paper we develop an approach for multitask and transfer RL that allows effective sharing\nof behavioral structure across tasks, giving rise to several algorithmic instantiations. In addition to\nsome instructive illustrations on a grid world domain, we provide a detailed analysis of the resulting\nalgorithms via comparisons to A3C [20] baselines on a variety of tasks in a \ufb01rst-person, visually-rich,\n3D environment. We \ufb01nd that the Distral algorithms learn faster and achieve better asymptotic\nperformance, are signi\ufb01cantly more robust to hyperparameter settings, and learn more stably than\nmultitask A3C baselines.\n\n2 Distral: Distill and Transfer Learning\n\u03c01\n\ndistill\n\nregularise\n\nWe propose a framework for simultaneous rein-\nforcement learning of multiple tasks which we\ncall Distral. Figure 1 provides a high level il-\nlustration involving four tasks. The method is\nfounded on the notion of a shared policy (shown\nin the centre) which distills (in the sense of\nBucila and Hinton et al. [4, 11]) common be-\nhaviours or representations from task-speci\ufb01c\npolicies [26, 23]. Crucially, the distilled policy\nis then used to guide task-speci\ufb01c policies via\nregularization using a Kullback-Leibler (KL) di-\nFigure 1: Illustration of the Distral framework.\nvergence. The effect is akin to a shaping reward\nwhich can, for instance, overcome random walk exploration bottlenecks. In this way, knowledge\ngained in one task is distilled into the shared policy, then transferred to other tasks.\n\ndistill\nregularise\n\nregularise\ndistill\n\nregularise\n\n\u03c00\n\n\u03c02\n\n\u03c04\n\ndistill\n\n\u03c03\n\n2.1 Mathematical framework\n\nIn this section we describe the mathematical framework underlying Distral. A multitask RL setting is\nconsidered where there are n tasks, where for simplicity we assume an in\ufb01nite horizon with discount\nfactor .1 We will assume that the action A and state S spaces are the same across tasks; we use\na 2 A to denote actions, s 2 S to denote states. The transition dynamics pi(s0|s, a) and reward\nfunctions Ri(a, s) are different for each task i. Let \u21e1i be task-speci\ufb01c stochastic policies. The\ndynamics and policies give rise to joint distributions over state and action trajectories starting from\nsome initial state, which we will also denote by \u21e1i by an abuse of notation.\nOur mechanism for linking the policy learning across tasks is via optimising an objective which con-\nsists of expected returns and policy regularizations. We designate \u21e10 to be the distilled policy which\nwe believe will capture agent behaviour that is common across the tasks. We regularize each task\npolicy \u21e1i towards the distilled policy using -discounted KL divergences E\u21e1i[Pt0 t log \u21e1i(at|st)\n\u21e10(at|st) ].\nIn addition, we also use a -discounted entropy regularization to further encourage exploration. The\nresulting objective to be maximized is:\n\u21e10(at|st) cEntt log \u21e1i(at|st)35\nlog \u21e1i(at|st)35\n\nE\u21e1i24Xt0\nE\u21e1i24Xt0\n\ni=1) =Xi\n=Xi\n\nwhere cKL, cEnt 0 are scalar factors which determine the strengths of the KL and entropy regular-\nizations, and \u21b5 = cKL/(cKL + cEnt) and = 1/(cKL + cEnt). The log \u21e10(at|st) term can be thought\n\ntRi(at, st) cKLt log\n\nlog \u21e10(at|st) \n\nJ(\u21e10,{\u21e1i}n\n\ntRi(at, st) +\n\n\u21e1i(at|st)\n\nt\u21b5\n\n\nt\n\n\n(1)\n\n1The method can be easily generalized to other scenarios like undiscounted \ufb01nite horizon.\n\n2\n\n\fof as a reward shaping term which encourages actions which have high probability under the distilled\npolicy, while the entropy term log \u21e1i(at|st) encourages exploration. In the above we used the same\nregularization costs cKL, cEnt for all tasks. It is easy to generalize to using task-speci\ufb01c costs; this can\nbe important if tasks differ substantially in their reward scales and amounts of exploration needed,\nalthough it does introduce additional hyperparameters that are expensive to optimize.\n\n2.2 Soft Q Learning and Distillation\nA range of optimization techniques in the literature can be applied to maximize the above objective,\nwhich we will expand on below. To build up intuition for how the method operates, we will start in the\nsimple case of a tabular representation and an alternating maximization procedure which optimizes\nover \u21e1i given \u21e10 and over \u21e10 given \u21e1i. With \u21e10 \ufb01xed, (1) decomposes into separate maximization\nproblems for each task, and is an entropy regularized expected return with rede\ufb01ned (regularized)\nreward R0i(a, s) := Ri(a, s) + \u21b5\n log \u21e10(a|s). It can be optimized using soft Q learning [10] aka G\nlearning [7], which are based on deriving the following \u201csoftened\u201d Bellman updates for the state and\naction values (see also [25, 28, 22]):\n1\n\n\n\u21e1\u21b5\n0 (at|st) exp [Qi(at, st)]\n\nVi(st) =\n\n(2)\n\nlogXat\n\nQi(at, st) = Ri(at, st) + Xst\n\npi(st+1|st, at)Vi(st+1)\n\n(3)\n\n\u21e1i(at|st) = \u21e1\u21b5\n\nThe Bellman updates are softened in the sense that the usual max operator over actions for the state\nvalues Vi is replaced by a soft-max at inverse temperature , which hardens into a max operator as\n ! 1. The optimal policy \u21e1i is then a Boltzmann policy at inverse temperature :\n0 (at|st)eAi(at|st)\n\n(4)\nwhere Ai(a, s) = Qi(a, s) Vi(s) is a softened advantage function. Note that the softened state\nvalues Vi(s) act as the log normalizers in the above. The distilled policy \u21e10 can be interpreted as a\npolicy prior, a perspective well-known in the literature on RL as probabilistic inference [32, 13, 25, 7].\nHowever, unlike in past works, it is raised to a power of \u21b5 \uf8ff 1. This softens the effect of the prior \u21e10\non \u21e1i, and is the result of the additional entropy regularization beyond the KL divergence.\nAlso unlike past works, we will learn \u21e10 instead of hand-picking it (typically as a uniform distribution\nover actions). In particular, notice that the only terms in (1) depending on \u21e10 are:\n\n0 (at|st)eQi(at|st)Vi(st) = \u21e1\u21b5\n\n\u21b5\n\nXi\n\nE\u21e1i24Xt0\n\nt log \u21e10(at|st)35\n\n(5)\n\nwhich is simply a log likelihood for \ufb01tting a model \u21e10 to a mixture of -discounted state-action\ndistributions, one for each task i under policy \u21e1i. A maximum likelihood (ML) estimator can be\nderived from state-action visitation frequencies under roll-outs in each task, with the optimal ML\nsolution given by the mixture of state-conditional action distributions. Alternatively, in the non-tabular\ncase, stochastic gradient ascent can be employed, which leads precisely to an update which distills the\ntask policies \u21e1i into \u21e10 [4, 11, 26, 23]. Note however that in our case the distillation step is derived\nnaturally from a KL regularized objective on the policies. Another difference from [26, 23] and from\nprior works on the use of distillation in deep learning [4, 11] is that the distilled policy is \u201cfed back in\u201d\nto improve the task policies when they are next optimized, and serves as a conduit in which common\nand transferable knowledge is shared across the task policies.\nIt is worthwhile here to take pause and ponder the effect of the extra entropy regularization. First\nsuppose that there is no extra entropy regularisation, \u21b5 = 1, and consider the simple scenario of\nonly n = 1 task.Then (5) is maximized when the distilled policy \u21e10 and the task policy \u21e11 are equal,\nand the KL regularization term is 0. Thus the objective reduces to an unregularized expected return,\nand so the task policy \u21e11 converges to a greedy policy which locally maximizes expected returns.\nAnother way to view this line of reasoning is that the alternating maximization scheme is equivalent\nto trust-region methods like natural gradient or TRPO [24, 29] which use a KL ball centred at the\nprevious policy, and which are understood to converge to greedy policies.\nIf \u21b5 < 1, there is an additional entropy term in (1). So even with \u21e10 = \u21e11 and KL(\u21e11k\u21e10) = 0,\nthe objective (1) will no longer be maximized by greedy policies. Instead (1) reduces to an entropy\n\n3\n\n\fregularized expected returns with entropy regularization factor 0 = /(1 \u21b5) = 1/cEnt, so that the\noptimal policy is of the Boltzmann form with inverse temperature 0 [25, 7, 28, 22]. In conclusion,\nby including the extra entropy term, we can guarantee that the task policy will not turn greedy, and\nwe can control the amount of exploration by adjusting cEnt appropriately.\nThis additional control over the amount of exploration is essential when there are more than one task.\nTo see this, imagine a scenario where one of the tasks is easier and is solved \ufb01rst, while other tasks\nare harder with much sparser rewards. Without the entropy term, and before rewards in other tasks\nare encountered, both the distilled policy and all the task policies can converge to the one that solves\nthe easy task. Further, because this policy is greedy, it can insuf\ufb01ciently explore the other tasks to\neven encounter rewards, leading to sub-optimal behaviour. For single-task RL, the use of entropy\nregularization was recently popularized by Mnih et al. [20] to counter premature convergence to\ngreedy policies, which can be particularly severe when doing policy gradient learning. This carries\nover to our multitask scenario as well, and is the reason for the additional entropy regularization.\n\n2.3 Policy Gradient and a Better Parameterization\n\nThe above method alternates between maximization of the distilled policy \u21e10 and the task policies\n\u21e1i, and is reminiscent of the EM algorithm [6] for learning latent variable models, with \u21e10 playing\nthe role of parameters, while \u21e1i plays the role of the posterior distributions for the latent variables.\nGoing beyond the tabular case, when both \u21e10 and \u21e1i are parameterized by, say, deep networks, such\nan alternating maximization procedure can be slower than simply optimizing (1) with respect to task\nand distilled policies jointly by stochastic gradient ascent. In this case the gradient update for \u21e1i\nis simply given by policy gradient with an entropic regularization [20, 28], and can be carried out\nwithin a framework like advantage actor-critic [20].\nA simple parameterization of policies would be to use a separate network for each task policy \u21e1i,\nand another one for the distilled policy \u21e10. An alternative parameterization, which we argue can\nresult in faster transfer, can be obtained by considering the form of the optimal Boltzmann policy (4).\nSpeci\ufb01cally, consider parameterising the distilled policy using a network with parameters \u27130,\n\nand estimating the soft advantages2 using another network with parameters \u2713i:\n\n\u02c6\u21e10(at|st) =\n\nexp(h\u27130(at|st)\n\nPa0 exp(h\u27130(a0|st))\nlogXa\n\n\u02c6Ai(at|st) = f\u2713i(at|st) \n\n1\n\n\n\u02c6\u21e1\u21b5\n0 (a|st) exp(f\u2713i(a|st))\n\nWe used hat notation to denote parameterized approximators of the corresponding quantities. The\npolicy for task i then becomes parameterized as,\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n\u02c6\u21e1i(at|st) = \u02c6\u21e1\u21b5\n\n0 (at|st) exp( \u02c6Ai(at|st)) =\n\nexp(\u21b5h\u27130(at|st) + f\u2713i(at|st))\n\nPa0 exp((\u21b5h\u27130(a0|st) + f\u2713i(a0|st))\n\nThis can be seen as a two-column architecture for the policy, with one column being the distilled\npolicy, and the other being the adjustment required to specialize to task i.\nGiven the parameterization above, we can now derive the policy gradients. The gradient wrt to the\ntask speci\ufb01c parameters \u2713i is given by the standard policy gradient theorem [30],\n\nr\u2713iJ =E\u02c6\u21e1ih\u21e3Pt1 r\u2713i log \u02c6\u21e1i(at|st)\u2318\u21e3Pu1 u(Rreg\n=E\u02c6\u21e1ihPt1 r\u2713i log \u02c6\u21e1i(at|st)\u21e3Put u(Rreg\n\ni (au, su))\u2318i\ni (au, su))\u2318i\n\ni (a, s) = Ri(a, s) + \u21b5\n\nwhere Rreg\n log \u02c6\u21e1i(a|s) is the regularized reward. Note that the\npartial derivative of the entropy in the integrand has expectation E\u02c6\u21e1i[r\u2713i log \u02c6\u21e1i(at|st)] = 0 because\nof the log-derivative trick. If a value baseline is estimated, it can be subtracted from the regularized\n\n log \u02c6\u21e10(a|s) 1\n\n2In practice, we do not actually use these as advantage estimates. Instead we use (8) to parameterize a policy\n\nwhich is optimized by policy gradients.\n\n4\n\n\fDisTra Learning\n\nBaselines\n\nReturns\n\nReturns\n\nReturns\n\nKL\n\n\u21e10\n\nh\n\n\u21e1i\n\n\u21e3\n\nentropy\u2318\n\nf\n\nh\n\nKL\n\n\u21e10\n\n\u21e1i\n\n\u21e3\n\nentropy\u2318\n\nf\n\n\u21e1i\n\nentropy\n\nh\n\nf\n\nentropy\n\nReturns\n\nReturns\n\n\u21e1i\n\nentropy\nf\n\n\u21e10\n\nh\n\ni = 1, 2, ..\n\ni = 1, 2, ..\n\ni = 1, 2, ..\n\ni = 1, 2, ..\n\ni = 1, 2, ..\n\nKL 2col\nKL+ent 2col\n\nKL 1col\nKL+ent 1col\n\nA3C 2col\n\nA3C\n\nA3C\n\nmultitask\n\nFigure 2: Depiction of the different algorithms and baselines. On the left are two of the Distral\nalgorithms and on the right are the three A3C baselines. Entropy is drawn in brackets as it is optional\nand only used for KL+ent 2col and KL+ent 1col.\n\nreturns as a control variate. The gradient wrt \u27130 is more interesting:\n\n(10)\n\nr\u27130J =Xi\n\n\u21b5\n\n+\n\nE\u02c6\u21e1ihPt1 r\u27130 log \u02c6\u21e1i(at|st)\u21e3Put u(Rreg\nXi\n\nE\u02c6\u21e1ihPt1 tPa0t\n\n(\u02c6\u21e1i(a0t|st) \u02c6\u21e10(a0t|st))r\u27130h\u27130(a0t|st)i\n\ni (au, su)\u2318i\n\nnPi \u02c6\u21e1i(a0t|st), and helps to transfer information quickly across tasks and to new tasks.\n\nNote that the \ufb01rst term is the same as for the policy gradient of \u2713i. The second term tries to match\nthe probabilities under the task policy \u02c6\u21e1i and under the distilled policy \u02c6\u21e10. The second term would\nnot be present if we simply parameterized \u21e1i using the same architecture \u02c6\u21e1i, but do not use a KL\nregularization for the policy. The presence of the KL regularization gets the distilled policy to\nlearn to be the centroid of all task policies, in the sense that the second term would be zero if\n\u02c6\u21e10(a0t|st) = 1\n2.4 Other Related Works\nThe centroid and star-shaped structure of Distral is reminiscent of ADMM [3], elastic-averaging\nSGD [35] and hierarchical Bayes [9]. Though a crucial difference is that while ADMM, EASGD\nand hierarchical Bayes operate in the space of parameters, in Distral the distilled policy learns to be\nthe centroid in the space of policies. We argue that this is semantically more meaningful, and may\ncontribute to the observed robustness of Distral by stabilizing learning. In our experiments we \ufb01nd\nindeed that absence of the KL regularization signi\ufb01cantly affects the stability of the algorithm.\nAnother related line of work is guided policy search [17, 18, 15, 16]. These focus on single tasks,\nand uses trajectory optimization (corresponding to task policies here) to guide the learning of a policy\n(corresponding to the distilled policy \u21e10 here). This contrasts with Distral, which is a multitask\nsetting, where a learnt \u21e10 is used to facilitate transfer by sharing common task-agnostic behaviours,\nand the main outcome of the approach are instead the task policies.\nOur approach is also reminiscent of recent work on option learning [8], but with a few important\ndifferences. We focus on using deep neural networks as \ufb02exible function approximators, and applied\nour method to rich 3D visual environments, while Fox et al. [8] considered only the tabular case.\nWe argue for the importance of an additional entropy regularization besides the KL regularization.\nThis lead to an interesting twist in the mathematical framework allowing us to separately control the\namounts of transfer and of exploration. On the other hand Fox et al. [8] focused on the interesting\nproblem of learning multiple options (distilled policies here). Their approach treats the assignment of\ntasks to options as a clustering problem, which is not easily extended beyond the tabular case.\n\n3 Algorithms\n\nThe framework we just described allows for a number of possible algorithmic instantiations, arising\nas combinations of objectives, algorithms and architectures, which we describe below and summarize\nin Table 1 and Figure 2. KL divergence vs entropy regularization: With \u21b5 = 0, we get a purely\n\n5\n\n\f\u21b5 = 0\n\u21b5 = 1\n0 < \u21b5 < 1\n\nh\u27130(a|s)\nA3C multitask\n\nf\u2713i(a|s)\nA3C\nKL 1col\nKL+ent 1col\n\n\u21b5h\u27130(a|s) + f\u2713i(a|s)\nA3C 2col\nKL 2col\nKL+ent 2col\n\nTable 1: The seven different algorithms evaluated in our experiments. Each column describes a\ndifferent architecture, with the column headings indicating the logits for the task policies. The rows\nde\ufb01ne the relative amount of KL vs entropy regularization loss, with the \ufb01rst row comprising the\nA3C baselines (no KL loss).\n\nentropy-regularized objective which does not couple and transfer across tasks [20, 28]. With \u21b5 = 1,\nwe get a purely KL regularized objective, which does couple and transfer across tasks, but might\nprematurely stop exploration if the distilled and task policies become similar and greedy. With\n0 < \u21b5 < 1 we get both terms. Alternating vs joint optimization: We have the option of jointly\noptimizing both the distilled policy and the task policies, or optimizing one while keeping the other\n\ufb01xed. Alternating optimization leads to algorithms that resemble policy distillation/actor-mimic\n[23, 26], but are iterative in nature with the distilled policy feeding back into task policy optimization.\nAlso, soft Q learning can be applied to each task, instead of policy gradients. While alternating\noptimization can be slower, evidence from policy distillation/actor-mimic indicate it might learn more\nstably, particularly for tasks which differ signi\ufb01cantly. Separate vs two-column parameterization:\nFinally, the task policy can be parameterized to use the distilled policy (8) or not. If using the distilled\npolicy, behaviour distilled into the distilled policy is \u201cimmediately available\u201d to the task policies so\ntransfer can be faster. However if the process of transfer occurs too quickly, it might interfere with\neffective exploration of individual tasks.\nFrom this spectrum of possibilities we consider four concrete instances which differ in the underlying\nnetwork architecture and distillation loss, identi\ufb01ed in Table 1. In addition, we compare against three\nA3C baselines. In initial experiments we explored two variants of A3C: the original method [20]\nand the variant of Schulman et al.\n[28] which uses entropy regularized returns. We did not \ufb01nd\nsigni\ufb01cant differences for the two variants in our setting, and chose to report only the original A3C\nresults for clarity in Section 4. Further algorithmic details are provided in the Appendix.\n\n4 Experiments\n\nWe demonstrate the various algorithms derived from our framework, \ufb01rstly using alternating opti-\nmization with soft Q learning and policy distillation on a set of simple grid world tasks. Then all\nseven algorithms will be evaluated on three sets of challenging RL tasks in partially observable 3D\nenvironments.\n\n4.1 Two room grid world\n\nTo give better intuition for the role of the distilled behaviour policy, we considered a set of tasks\nin a grid world domain with two rooms connected by a corridor (see Figure 3) [8]. Each task is\ndistinguished by a different randomly chosen goal location and each MDP state consists of the map\nlocation, the previous action and the previous reward. A Distral agent is trained using only the KL\nregularization and an optimization algorithm which alternates between soft Q learning and policy\ndistillation. Each soft Q learning iteration learns using a rollout of length 10.\nTo determine the bene\ufb01t of the distilled policy, we compared the Distral agent to one which soft Q\nlearns a separate policy for each task. The learning curves are shown in Figure 3 (left). We see that\nthe Distral agent is able to learn signi\ufb01cantly faster than single-task agents. Figure 3 (right) visualizes\nthe distilled policy (probability of next action given position and previous action), demonstrating\nthat the agent has learnt a policy which guides the agent to move consistently in the same direction\nthrough the corridor in order to reach the other room. This allows the agent to reach the other room\nfaster and helps exploration, if the agent is shown new test tasks. In Fox et al. [8] two separate options\nare learnt, while here we learn a single distilled policy which conditions on more past information\n(previous action and reward).\n\n6\n\n\fFour di\u21b5erent examples of GridWorld tasks\n\nPolicy in the corridor if previous action was:\n\nleft\n\nA\n\nC\n\nB\n\nD\n\nPolicy in the corridor if previous action was:\n\nright\n\nFigure 3: Left: Learning curves on two room grid world. The Distral agent (blue) learns faster,\nconverges towards better policies, and demonstrates more stable learning overall. Center: Example\nof tasks. Green is goal position which is uniformly sampled for each task. Starting position is\nuniformly sampled at the beginning of each episode. Right: depiction of learned distilled policy \u21e10\nonly in the corridor, conditioned on previous action being left/right and no previous reward. Sizes of\narrows depict probabilities of actions. Note that up/down actions have negligible probabilities. The\nmodel learns to preserve direction of travel in the corridor.\n\n4.2 Complex Tasks\n\nTo assess Distral under more challenging conditions, we use a complex \ufb01rst-person partially observed\n3D environment with a variety of visually-rich RL tasks. All agents were implemented with a dis-\ntributed Python/TensorFlow code base, using 32 workers for each task and learnt using asynchronous\nRMSProp. The network columns contain convolutional layers and an LSTM and are uniform across\nexperiments and algorithms. We tried three values for the entropy costs and three learning rates \u270f.\nFour runs for each hyperparameter setting were used. All other hyperparameters were \ufb01xed to the\nsingle-task A3C defaults and, for the KL+ent 1col and KL+ent 2col algorithms, \u21b5 was \ufb01xed at\n0.5.\nMazes In the \ufb01rst experiment, each of n = 8 tasks is a different maze containing randomly placed\nrewards and a goal object. Figure 4.A1 shows the learning curves for all seven algorithms. Each\ncurve is produced by averaging over all 4 runs and 8 tasks, and selecting the best settings for and \u270f\n(as measured by the area under the learning curves). The Distral algorithms learn faster and achieve\nbetter \ufb01nal performance than all three A3C baselines. The two-column algorithms learn faster than\nthe corresponding single column ones. The Distral algorithms without entropy learn faster but achieve\nlower \ufb01nal scores than those with entropy, which we believe is due to insuf\ufb01cient exploration towards\nthe end of learning.\nWe found that both multitask A3C and two-column A3C can learn well on some runs, but are generally\nunstable\u2014some runs did not learn well, while others may learn initially then suffer degradation\nlater. We believe this is due to negative interference across tasks, which does not happen for Distral\nalgorithms. The stability of Distral algorithms also increases their robustness to hyperparameter\nselection. Figure 4.A2 shows the \ufb01nal achieved average returns for all 36 runs for each algorithm,\nsorted in decreasing order. We see that Distral algorithms have a signi\ufb01cantly higher proportion of\nruns achieving good returns, with KL+ent_2col being the most robust.\nDistral algorithms, along with multitask A3C, use a distilled or common policy which can be applied\non all tasks. Panels B1 and B2 in Figure 4 summarize the performances of the distilled policies.\nAlgorithms that use two columns (KL_2col and KL+ent_2col) obtain the best performance, because\npolicy gradients are also directly propagated through the distilled policy in those cases. Moreover,\npanel B2 reveals that Distral algorithms exhibit greater stability as compared to traditional multitask\nA3C. We also observe that KL algorithms have better-performing distilled policies than KL+ent ones.\nWe believe this is because the additional entropy regularisation allows task policies to diverge more\nsubstantially from the distilled policy. This suggests that annealing the entropy term or increasing the\nKL term throughout training could improve the distilled policy performance, if that is of interest.\nNavigation We experimented with n = 4 navigation and memory tasks. In contrast to the previous\nexperiment, these tasks use random maps which are procedurally generated on every episode. The\n\ufb01rst task features reward objects which are randomly placed in a maze, and the second task requires to\nreturn these objects to the agent\u2019s start position. The third task has a single goal object which must be\nrepeatedly found from different start positions, and on the fourth task doors are randomly opened and\n\n7\n\n\fFigure 4: Panels A1, C1, D1 show task speci\ufb01c policy performance (averaged across all the tasks)\nfor the maze, navigation and laser-tag tasks, respectively. The x-axes are total numbers of training\nenvironment steps per task. Panel B1 shows the mean scores obtained with the distilled policies (A3C\nhas no distilled policy, so it is represented by the performance of an untrained network.). For each\nalgorithm, results for the best set of hyperparameters (based on the area under curve) are reported.\nThe bold line is the average over 4 runs, and the colored area the average standard deviation over the\ntasks. Panels A2, B2, C2, D2 shows the corresponding \ufb01nal performances for the 36 runs of each\nalgorithm ordered by best to worst (9 hyperparameter settings and 4 runs).\n\nclosed to force novel path-\ufb01nding. Hence, these tasks are more involved than the previous navigation\ntasks. The panels C1 and C2 of Figure 4 summarize the results. We observe again that Distral\nalgorithms yield better \ufb01nal results while having greater stability (Figure 4.C2). The top-performing\nalgorithms are, again, the 2 column Distral algorithms (KL_2col and KL+ent_2col).\nLaser-tag In the \ufb01nal set of experiments, we use n = 8 laser-tag levels. These tasks require the agent\nto learn to tag bots controlled by a built-in AI, and differ substantially: \ufb01xed versus procedurally\ngenerated maps, \ufb01xed versus procedural bots, and complexity of agent behaviour (e.g. learning to\njump in some tasks). Corresponding to this greater diversity, we observe (see panels D1 and D2\nof Figure 4) that the best baseline is the A3C algorithm that is trained independently on each task.\nAmong the Distral algorithms, the single column variants perform better, especially initially, as they\nare able to learn task-speci\ufb01c features separately. We observe again the early plateauing phenomenon\nfor algorithms that do not possess an additional entropy term. While not signi\ufb01cantly better than the\nA3C baseline on these tasks, the Distral algorithms clearly outperform the multitask A3C.\nDiscussion Considering the 3 different sets of complex 3D experiments, we argue that the Distral\nalgorithms are promising solutions to the multitask deep RL problem. Distral can perform signi\ufb01cantly\nbetter than A3C baselines when tasks have suf\ufb01cient commonalities for transfer (maze and navigation),\nwhile still being competitive with A3C when there is less transfer possible. In terms of speci\ufb01c\nalgorithmic proposals, the additional entropy regularization is important in encouraging continued\nexploration, while two column architectures generally allow faster transfer (but can affect performance\nwhen there is little transfer due to task interference). The computational costs of Distral algorithms\nare at most twice that of the corresponding A3C algorithms, as each agent need to process two\nnetwork columns instead of one. However in practice the runtimes are just slightly more than for\nA3C, because the cost of simulating environments is signi\ufb01cant and the same whether single or\nmultitask.\n\n8\n\n\f5 Conclusion\n\nWe have proposed Distral, a general framework for distilling and transferring common behaviours\nin multitask reinforcement learning. In experiments we showed that the resulting algorithms learn\nquicker, produce better \ufb01nal performances, and are more stable and robust to hyperparameter settings.\nWe have found that Distral signi\ufb01cantly outperforms the standard way of using shared neural network\nparameters for multitask or transfer reinforcement learning.\nTwo ideas in Distral might be worth reemphasizing here. We observe that distillation arises naturally\nas one half of an optimization procedure when using KL divergences to regularize the output of\ntask models towards a distilled model. The other half corresponds to using the distilled model as a\nregularizer for training the task models. Another observation is that parameters in deep networks\ndo not typically by themselves have any semantic meaning, so instead of regularizing networks\nin parameter space, it is worthwhile considering regularizing networks in a more semantically\nmeaningful space, e.g. of policies.\nWe would like to end with a discussion of the various dif\ufb01culties faced by multitask RL methods.\nThe \ufb01rst is that of positive transfer: when there are commonalities across tasks, how does the method\nachieve this transfer and lead to better learning speed and better performance on new tasks in the\nsame family? The core aim of Distral is this, where the commonalities are exhibited in terms of\nshared common behaviours. The second is that of task interference, where the differences among\ntasks adversely affect agent performance by interfering with exploration and the optimization of\nnetwork parameters. This is the core aim of the policy distillation and mimic works [26, 23]. As\nin these works, Distral also learns a distilled policy. But this is further used to regularise the task\npolicies to facilitate transfer. This means that Distral algorithms can be affected by task interference.\nIt would be interesting to explore ways to allow Distral (or other methods) to automatically balance\nbetween increasing task transfer and reducing task interference.\nOther possible directions of future research include: combining Distral with techniques which use\nauxiliary losses [12, 19, 14], exploring use of multiple distilled policies or latent variables in the\ndistilled policy to allow for more diversity of behaviours, exploring settings for continual learning\nwhere tasks are encountered sequentially, and exploring ways to adaptively adjust the KL and entropy\ncosts to better control the amounts of transfer and exploration. Finally, theoretical analyses of Distral\nand other KL regularization frameworks for deep RL would help better our understanding of these\nrecent methods.\n\n9\n\n\fReferences\n[1] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation\n\nplatform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279, june 2013.\n\n[2] Yoshua Bengio. Deep learning of representations for unsupervised and transfer learning.\n\nWorkshop on Unsupervised and Transfer Learning, 2012.\n\nIn JMLR:\n\n[3] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and\nstatistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1),\nJanuary 2011.\n\n[4] Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proc. of the Int\u2019l\n\nConference on Knowledge Discovery and Data Mining (KDD), 2006.\n\n[5] Rich Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, July 1997.\n[6] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the\n\nem algorithm. Journal of the royal statistical society. Series B (methodological), pages 1\u201338, 1977.\n\n[7] R. Fox, A. Pakman, and N. Tishby. Taming the noise in reinforcement learning via soft updates. In\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), 2016.\n\n[8] Roy Fox, Michal Moshkovitz, and Naftali Tishby. Principled option learning in markov decision processes.\n\nIn European Workshop on Reinforcement Learning (EWRL), 2016.\n\n[9] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis, volume 2.\n\nChapman & Hall/CRC Boca Raton, FL, USA, 2014.\n\n[10] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep\n\nenergy-based policies. arXiv preprint arXiv:1702.08165, 2017.\n\n[11] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. NIPS\n\nDeep Learning Workshop, 2014.\n\n[12] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver,\nand Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. Int\u2019l Conference on\nLearning Representations (ICLR), 2016.\n\n[13] Hilbert J Kappen, Vicen\u00e7 G\u00f3mez, and Manfred Opper. Optimal control as a graphical model inference\n\nproblem. Machine learning, 87(2):159\u2013182, 2012.\n\n[14] Guillaume Lample and Devendra Singh Chaplot. Playing FPS games with deep reinforcement learning.\n\nAssociation for the Advancement of Arti\ufb01cial Intelligence (AAAI), 2017.\n\n[15] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under\n\nunknown dynamics. In Advances in Neural Information Processing Systems, pages 1071\u20131079, 2014.\n\n[16] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor\n\npolicies. Journal of Machine Learning Research, 17(39):1\u201340, 2016.\n\n[17] Sergey Levine and Vladlen Koltun. Variational policy search via trajectory optimization. In Advances in\n\nNeural Information Processing Systems, pages 207\u2013215, 2013.\n\n[18] Sergey Levine and Vladlen Koltun. Learning complex neural network policies with trajectory optimization.\n\nIn International Conference on Machine Learning, pages 829\u2013837, 2014.\n\n[19] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J. Ballard, Andrea Banino, Misha\nDenil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, Dharshan Kumaran, and Raia Hadsell. Learning\nto navigate in complex environments. Int\u2019l Conference on Learning Representations (ICLR), 2016.\n\n[20] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim\nHarley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In\nInt\u2019l Conference on Machine Learning (ICML), 2016.\n\n[21] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,\nAmir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis\nHassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 02\n2015.\n\n[22] O\ufb01r Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value\n\nand policy based reinforcement learning. arXiv:1702.08892, 2017.\n\n[23] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer\n\nreinforcement learning. In Int\u2019l Conference on Learning Representations (ICLR), 2016.\n\n10\n\n\f[24] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. Int\u2019l Conference on\n\nLearning Representations (ICLR), 2014.\n\n[25] Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement\n\nlearning by approximate inference. In Robotics: Science and Systems (RSS), 2012.\n\n[26] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick,\nRazvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. In Int\u2019l\nConference on Learning Representations (ICLR), 2016.\n\n[27] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. CoRR,\n\nabs/1511.05952, 2015.\n\n[28] J. Schulman, P. Abbeel, and X. Chen. Equivalence between policy gradients and soft Q-Learning.\n\narXiv:1704.06440, 2017.\n\n[29] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy\n\noptimization. In Int\u2019l Conference on Machine Learning (ICML), 2015.\n\n[30] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. Policy gradient methods\nfor reinforcement learning with function approximation. In Adv. in Neural Information Processing Systems\n(NIPS), volume 99, pages 1057\u20131063, 1999.\n\n[31] Matthew E. Taylor and Peter Stone. An introduction to inter-task transfer for reinforcement learning. AI\n\nMagazine, 32(1):15\u201334, 2011.\n\n[32] Marc Toussaint, Stefan Harmeling, and Amos Storkey. Probabilistic inference for solving (PO)MDPs.\n\nTechnical Report EDI-INF-RR-0934, University of Edinburgh, School of Informatics, 2006.\n\n[33] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Q-learning.\n\nAssociation for the Advancement of Arti\ufb01cial Intelligence (AAAI), 2016.\n\n[34] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural\n\nnetworks? In Adv. in Neural Information Processing Systems (NIPS), 2014.\n\n[35] Sixin Zhang, Anna Choromanska, and Yann LeCun. Deep learning with elastic averaging SGD. In Adv. in\n\nNeural Information Processing Systems (NIPS), 2015.\n\n11\n\n\f", "award": [], "sourceid": 2351, "authors": [{"given_name": "Yee", "family_name": "Teh", "institution": "DeepMind"}, {"given_name": "Victor", "family_name": "Bapst", "institution": "DeepMind"}, {"given_name": "Wojciech", "family_name": "Czarnecki", "institution": "DeepMind"}, {"given_name": "John", "family_name": "Quan", "institution": "Google DeepMind"}, {"given_name": "James", "family_name": "Kirkpatrick", "institution": "Google DeepMind"}, {"given_name": "Raia", "family_name": "Hadsell", "institution": "DeepMind"}, {"given_name": "Nicolas", "family_name": "Heess", "institution": "Google DeepMind"}, {"given_name": "Razvan", "family_name": "Pascanu", "institution": "Google DeepMind"}]}