{"title": "Successor Features for Transfer in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4055, "page_last": 4065, "abstract": "Transfer in reinforcement learning refers to the notion that generalization should occur not only within a task but also across tasks. We propose a transfer framework for the scenario where the reward function changes between tasks but the environment's dynamics remain the same. Our approach rests on two key ideas: \"successor features\", a value function representation that decouples the dynamics of the environment from the rewards, and \"generalized policy improvement\", a generalization of dynamic programming's policy improvement operation that considers a set of policies rather than a single one. Put together, the two ideas lead to an approach that integrates seamlessly within the reinforcement learning framework and allows the free exchange of information across tasks. The proposed method also provides performance guarantees for the transferred policy even before any learning has taken place. We derive two theorems that set our approach in firm theoretical ground and present experiments that show that it successfully promotes transfer in practice, significantly outperforming alternative methods in a sequence of navigation tasks and in the control of a simulated robotic arm.", "full_text": "Successor Features for\n\nTransfer in Reinforcement Learning\n\nAndr\u00e9 Barreto, Will Dabney, R\u00e9mi Munos, Jonathan J. Hunt,\n\nTom Schaul, Hado van Hasselt, David Silver\n\n{andrebarreto,wdabney,munos,jjhunt,schaul,hado,davidsilver}@google.com\n\nDeepMind\n\nAbstract\n\nTransfer in reinforcement learning refers to the notion that generalization should\noccur not only within a task but also across tasks. We propose a transfer frame-\nwork for the scenario where the reward function changes between tasks but the\nenvironment\u2019s dynamics remain the same. Our approach rests on two key ideas:\nsuccessor features, a value function representation that decouples the dynamics of\nthe environment from the rewards, and generalized policy improvement, a general-\nization of dynamic programming\u2019s policy improvement operation that considers\na set of policies rather than a single one. Put together, the two ideas lead to an\napproach that integrates seamlessly within the reinforcement learning framework\nand allows the free exchange of information across tasks. The proposed method\nalso provides performance guarantees for the transferred policy even before any\nlearning has taken place. We derive two theorems that set our approach in \ufb01rm\ntheoretical ground and present experiments that show that it successfully promotes\ntransfer in practice, signi\ufb01cantly outperforming alternative methods in a sequence\nof navigation tasks and in the control of a simulated robotic arm.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) provides a framework for the development of situated agents that learn\nhow to behave while interacting with the environment [21]. The basic RL loop is de\ufb01ned in an abstract\nway so as to capture only the essential aspects of this interaction: an agent receives observations\nand selects actions to maximize a reward signal. This setup is generic enough to describe tasks of\ndifferent levels of complexity that may unroll at distinct time scales. For example, in the task of\ndriving a car, an action can be to turn the wheel, make a right turn, or drive to a given location.\nClearly, from the point of view of the designer, it is desirable to describe a task at the highest level of\nabstraction possible. However, by doing so one may overlook behavioral patterns and inadvertently\nmake the task more dif\ufb01cult than it really is. The task of driving to a location clearly encompasses the\nsubtask of making a right turn, which in turn encompasses the action of turning the wheel. In learning\nhow to drive an agent should be able to identify and exploit such interdependencies. More generally,\nthe agent should be able to break a task into smaller subtasks and use knowledge accumulated in any\nsubset of those to speed up learning in related tasks. This process of leveraging knowledge acquired\nin one task to improve performance on other tasks is called transfer [25, 11].\nIn this paper we look at one speci\ufb01c type of transfer, namely, when subtasks correspond to different\nreward functions de\ufb01ned in the same environment. This setup is \ufb02exible enough to allow transfer\nto happen at different levels. In particular, by appropriately de\ufb01ning the rewards one can induce\ndifferent task decompositions. For instance, the type of hierarchical decomposition involved in the\ndriving example above can be induced by changing the frequency at which rewards are delivered:\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fa positive reinforcement can be given after each maneuver that is well executed or only at the \ufb01nal\ndestination. Obviously, one can also decompose a task into subtasks that are independent of each\nother or whose dependency is strictly temporal (that is, when tasks must be executed in a certain\norder but no single task is clearly \u201ccontained\u201d within another).\nThe types of task decomposition discussed above potentially allow the agent to tackle more complex\nproblems than would be possible were the tasks modeled as a single monolithic challenge. However,\nin order to apply this divide-and-conquer strategy to its full extent the agent should have an explicit\nmechanism to promote transfer between tasks. Ideally, we want a transfer approach to have two\nimportant properties. First, the \ufb02ow of information between tasks should not be dictated by a rigid\ndiagram that re\ufb02ects the relationship between the tasks themselves, such as hierarchical or temporal\ndependencies. On the contrary, information should be exchanged across tasks whenever useful.\nSecond, rather than being posed as a separate problem, transfer should be integrated into the RL\nframework as much as possible, preferably in a way that is almost transparent to the agent.\nIn this paper we propose an approach for transfer that has the two properties above. Our method builds\non two conceptual pillars that complement each other. The \ufb01rst is a generalization of Dayan\u2019s [7]\nsuccessor representation. As the name suggests, in this representation scheme each state is described\nby a prediction about the future occurrence of all states under a \ufb01xed policy. We present a generaliza-\ntion of Dayan\u2019s idea which extends the original scheme to continuous spaces and also facilitates the\nuse of approximation. We call the resulting scheme successor features. As will be shown, successor\nfeatures lead to a representation of the value function that naturally decouples the dynamics of the\nenvironment from the rewards, which makes them particularly suitable for transfer.\nThe second pillar of our framework is a generalization of Bellman\u2019s [3] classic policy improvement\ntheorem that extends the original result from one to multiple decision policies. This novel result\nshows how knowledge about a set of tasks can be transferred to a new task in a way that is completely\nintegrated within RL. It also provides performance guarantees on the new task before any learning\nhas taken place, which opens up the possibility of constructing a library of \u201cskills\u201d that can be reused\nto solve previously unseen tasks. In addition, we present a theorem that formalizes the notion that an\nagent should be able to perform well on a task if it has seen a similar task before\u2014something clearly\ndesirable in the context of transfer. Combined, the two results above not only set our approach in\n\ufb01rm ground but also outline the mechanics of how to actually implement transfer. We build on this\nknowledge to propose a concrete method and evaluate it in two environments, one encompassing a\nsequence of navigation tasks and the other involving the control of a simulated two-joint robotic arm.\n\n2 Background and problem formulation\n\nAs usual, we assume that the interaction between agent and environment can be modeled as a Markov\ndecision process (MDP, Puterman, [19]). An MDP is de\ufb01ned as a tuple M \u2261 (S,A, p, R, \u03b3). The sets\nS and A are the state and action spaces, respectively; here we assume that S and A are \ufb01nite whenever\nsuch an assumption facilitates the presentation, but most of the ideas readily extend to continuous\nspaces. For each s \u2208 S and a \u2208 A the function p(\u00b7|s, a) gives the next-state distribution upon taking\naction a in state s. We will often refer to p(\u00b7|s, a) as the dynamics of the MDP. The reward received at\ntransition s a\u2212\u2192 s(cid:48) is given by the random variable R(s, a, s(cid:48)); usually one is interested in the expected\nvalue of this variable, which we will denote by r(s, a, s(cid:48)) or by r(s, a) = ES(cid:48)\u223cp(\u00b7|s,a)[r(s, a, S(cid:48))].\nThe discount factor \u03b3 \u2208 [0, 1) gives smaller weights to future rewards.\nThe objective of the agent in RL is to \ufb01nd a policy \u03c0\u2014a mapping from states to actions\u2014that\ni=0 \u03b3iRt+i+1,\nwhere Rt = R(St, At, St+1). One way to address this problem is to use methods derived from\ndynamic programming (DP), which heavily rely on the concept of a value function [19]. The\naction-value function of a policy \u03c0 is de\ufb01ned as\n\nmaximizes the expected discounted sum of rewards, also called the return Gt =(cid:80)\u221e\n\nQ\u03c0(s, a) \u2261 E\u03c0 [Gt | St = s, At = a] ,\n\n(1)\nwhere E\u03c0[\u00b7] denotes expected value when following policy \u03c0. Once the action-value function of a\nparticular policy \u03c0 is known, we can derive a new policy \u03c0(cid:48) which is greedy with respect to Q\u03c0(s, a),\nthat is, \u03c0(cid:48)(s) \u2208 argmaxaQ\u03c0(s, a). Policy \u03c0(cid:48) is guaranteed to be at least as good as (if not better than)\npolicy \u03c0. The computation of Q\u03c0(s, a) and \u03c0(cid:48), called policy evaluation and policy improvement,\nde\ufb01ne the basic mechanics of RL algorithms based on DP; under certain conditions their successive\napplication leads to an optimal policy \u03c0\u2217 that maximizes the expected return from every s \u2208 S [21].\n\n2\n\n\fIn this paper we are interested in the problem of transfer, which we de\ufb01ne as follows. Let T ,T (cid:48) be\ntwo sets of tasks such that T (cid:48) \u2282 T , and let t be any task. Then there is transfer if, after training on\nT , the agent always performs as well or better on task t than if only trained on T (cid:48). Note that T (cid:48) can\nbe the empty set. In this paper a task will be de\ufb01ned as a speci\ufb01c instantiation of the reward function\nR(s, a, s(cid:48)) for a given MDP. In Section 4 we will revisit this de\ufb01nition and make it more formal.\n\n3 Successor features\n\nIn this section we present the concept that will serve as a cornerstone for the rest of the paper. We\nstart by presenting a simple reward model and then show how it naturally leads to a generalization of\nDayan\u2019s [7] successor representation (SR).\nSuppose that the expected one-step reward associated with transition (s, a, s(cid:48)) can be computed as\n(2)\nwhere \u03c6(s, a, s(cid:48)) \u2208 Rd are features of (s, a, s(cid:48)) and w \u2208 Rd are weights. Expression (2) is not\nrestrictive because we are not making any assumptions about \u03c6(s, a, s(cid:48)): if we have \u03c6i(s, a, s(cid:48)) =\nr(s, a, s(cid:48)) for some i, for example, we can clearly recover any reward function exactly. To simplify\nthe notation, let \u03c6t = \u03c6(st, at, st+1). Then, by simply rewriting the de\ufb01nition of the action-value\nfunction in (1) we have\n\nr(s, a, s(cid:48)) = \u03c6(s, a, s(cid:48))(cid:62)w,\n\nQ\u03c0(s, a) = E\u03c0 [rt+1 + \u03b3rt+2 + ...| St = s, At = a]\n\n= E\u03c0(cid:104)\n= E\u03c0(cid:2)(cid:80)\u221e\n\n\u03c6\n\n(cid:105)\nt+2w + ...| St = s, At = a\n(cid:62)\n\n(cid:62)\nt+1w + \u03b3\u03c6\n\ni=t\u03b3i\u2212t\u03c6i+1 | St = s, At = a(cid:3)(cid:62)\n\nw = \u03c8\u03c0(s, a)(cid:62)w.\n\n(3)\nThe decomposition (3) has appeared before in the literature under different names and interpretations,\nas discussed in Section 6. Since here we propose to look at (3) as an extension of Dayan\u2019s [7] SR, we\ncall \u03c8\u03c0(s, a) the successor features (SFs) of (s, a) under policy \u03c0.\nThe ith component of \u03c8\u03c0(s, a) gives the expected discounted sum of \u03c6i when following policy \u03c0\nstarting from (s, a). In the particular case where S and A are \ufb01nite and \u03c6 is a tabular representation\nof S \u00d7 A \u00d7 S\u2014that is, \u03c6(s, a, s(cid:48)) is a one-hot vector in R|S|2|A|\u2014\u03c8\u03c0(s, a) is the discounted sum\nof occurrences, under \u03c0, of each possible transition. This is essentially the concept of SR extended\nfrom the space S to the set S \u00d7 A \u00d7 S [7].\nOne of the contributions of this paper is precisely to generalize SR to be used with function approx-\nimation, but the exercise of deriving the concept as above provides insights already in the tabular\ncase. To see this, note that in the tabular case the entries of w \u2208 R|S|2|A| are the function r(s, a, s(cid:48))\nand suppose that r(s, a, s(cid:48)) (cid:54)= 0 in only a small subset W \u2282 S \u00d7 A \u00d7 S. From (2) and (3), it is\nclear that the cardinality of W, and not of S \u00d7 A \u00d7 S, is what effectively de\ufb01nes the dimension of\nthe representation \u03c8\u03c0, since there is no point in having d > |W|. Although this fact is hinted at by\nDayan [7], it becomes more apparent when we look at SR as a particular case of SFs.\nSFs extend SR in two other ways. First, the concept readily applies to continuous state and action\nspaces without any modi\ufb01cation. Second, by explicitly casting (2) and (3) as inner products involving\nfeature vectors, SFs make it evident how to incorporate function approximation: as will be shown,\nthese vectors can be learned from data.\nThe representation in (3) requires two components to be learned, w and \u03c8\u03c0. Since the latter is\nthe expected discounted sum of \u03c6 under \u03c0, we must either be given \u03c6 or learn it as well. Note\nthat approximating r(s, a, s(cid:48)) \u2248 \u03c6(s, a, s(cid:48))(cid:62) \u02dcw is a supervised learning problem, so we can use\nwell-understood techniques from the \ufb01eld to learn \u02dcw (and potentially \u02dc\u03c6, too) [9]. As for \u03c8\u03c0, we note\nthat\n\n\u03c8\u03c0(s, a) = \u03c6t+1 + \u03b3E\u03c0[\u03c8\u03c0(St+1, \u03c0(St+1))| St = s, At = a],\n\n(4)\nthat is, SFs satisfy a Bellman equation in which \u03c6i play the role of rewards\u2014something also noted\nby Dayan [7] regarding SR. Therefore, in principle any RL method can be used to compute \u03c8\u03c0 [24].\nThe SFs \u03c8\u03c0 summarize the dynamics induced by \u03c0 in a given environment. As shown in (3), this\nallows for a modular representation of Q\u03c0 in which the MDP\u2019s dynamics are decoupled from its\n\n3\n\n\frewards, which are captured by the weights w. One potential bene\ufb01t of having such a decoupled\nrepresentation is that only the relevant module must be relearned when either the dynamics or the\nreward changes, which may serve as an argument in favor of adopting SFs as a general approximation\nscheme for RL. However, in this paper we focus on a scenario where the decoupled value-function\napproximation provided by SFs is exploited to its full extent, as we discuss next.\n\n4 Transfer via successor features\n\nWe now return to the discussion about transfer in RL. As described, we are interested in the scenario\nwhere all components of an MDP are \ufb01xed, except for the reward function. One way to formalize\nthis model is through (2): if we suppose that \u03c6 \u2208 Rd is \ufb01xed, any w \u2208 Rd gives rise to a new MDP.\nBased on this observation, we de\ufb01ne\n\nM\u03c6(S,A, p, \u03b3)\u2261 {M (S,A, p, r, \u03b3) | r(s, a, s(cid:48))= \u03c6(s, a, s(cid:48))(cid:62)w},\n\n(5)\nthat is, M\u03c6 is the set of MDPs induced by \u03c6 through all possible instantiations of w. Since what\ndifferentiates the MDPs in M\u03c6 is essentially the agent\u2019s goal, we will refer to Mi \u2208 M\u03c6 as a task.\nThe assumption is that we are interested in solving (a subset of) the tasks in the environment M\u03c6.\nDe\ufb01nition (5) is a natural way of modeling some scenarios of interest. Think, for example, how the\ndesirability of water or food changes depending on whether an animal is thirsty or hungry. One way\nto model this type of preference shifting is to suppose that the vector w appearing in (2) re\ufb02ects the\ntaste of the agent at any given point in time [17]. Further in the paper we will present experiments\nthat re\ufb02ect this scenario. For another illustrative example, imagine that the agent\u2019s goal is to produce\nand sell a combination of goods whose production line is relatively stable but whose prices vary\nconsiderably over time. In this case updating the price of the products corresponds to picking a new\nw. A slightly different way of motivating (5) is to suppose that the environment itself is changing,\nthat is, the element wi indicates not only desirability, but also availability, of feature \u03c6i.\nIn the examples above it is desirable for the agent to build on previous experience to improve its\nperformance on a new setup. More concretely, if the agent knows good policies for the set of tasks\nM \u2261 {M1, M2, ..., Mn}, with Mi \u2208 M\u03c6, it should be able to leverage this knowledge to improve\nits behavior on a new task Mn+1\u2014that is, it should perform better than it would had it been exposed\nto only a subset of the original tasks, M(cid:48) \u2282 M. We can assess the performance of an agent on\ntask Mn+1 based on the value function of the policy followed after wn+1 has become available but\nbefore any policy improvement has taken place in Mn+1.1 More precisely, suppose that an agent has\nbeen exposed to each one of the tasks Mi \u2208 M(cid:48). Based on this experience, and on the new wn+1,\nthe agent computes a policy \u03c0(cid:48) that will de\ufb01ne its initial behavior in Mn+1. Now, if we repeat the\nexperience replacing M(cid:48) with M, the resulting policy \u03c0 should be such that Q\u03c0(s, a) \u2265 Q\u03c0(cid:48)\n(s, a)\nfor all (s, a) \u2208 S \u00d7 A.\nNow that our setup is clear we can start to describe our solution for the transfer problem discussed\nabove. We do so in two stages. First, we present a generalization of DP\u2019s notion of policy improvement\nwhose interest may go beyond the current work. We then show how SFs can be used to implement\nthis generalized form of policy improvement in an ef\ufb01cient and elegant way.\n\n4.1 Generalized policy improvement\n\nOne of the fundamental results in RL is Bellman\u2019s [3] policy improvement theorem. In essence, the\ntheorem states that acting greedily with respect to a policy\u2019s value function gives rise to another policy\nwhose performance is no worse than the former\u2019s. This is the driving force behind DP, and most RL\nalgorithms that compute a value function are exploiting Bellman\u2019s result in one way or another.\nIn this section we extend the policy improvement theorem to the scenario where the new policy is\nto be computed based on the value functions of a set of policies. We show that this extension can\nbe done in a natural way, by acting greedily with respect to the maximum over the value functions\navailable. Our result is summarized in the theorem below.\n\n1Of course wn+1 can, and will be, learned, as discussed in Section 4.2 and illustrated in Section 5. Here we\n\nassume that wn+1 is given to make our performance criterion clear.\n\n4\n\n\fDe\ufb01ne\n\nThen,\n\n\u03c0(s) \u2208 argmax\n\na\n\n\u02dcQ\u03c0i (s, a).\n\nmax\n\ni\n\n(6)\n\n(7)\n\n(8)\n\nTheorem 1. (Generalized Policy Improvement) Let \u03c01, \u03c02, ..., \u03c0n be n decision policies and let\n\u02dcQ\u03c01, \u02dcQ\u03c02, ..., \u02dcQ\u03c0n be approximations of their respective action-value functions such that\n\n|Q\u03c0i(s, a) \u2212 \u02dcQ\u03c0i(s, a)| \u2264 \u0001 for all s \u2208 S, a \u2208 A, and i \u2208 {1, 2, ..., n}.\n\nQ\u03c0(s, a) \u2265 max\n\nQ\u03c0i(s, a) \u2212 2\n1 \u2212 \u03b3\n\ni\n\n\u0001\nfor any s \u2208 S and a \u2208 A, where Q\u03c0 is the action-value function of \u03c0.\nThe proofs of our theoretical results are in the supplementary material. As one can see, our theorem\ncovers the case where the policies\u2019 value functions are not computed exactly, either because function\napproximation is used or because some exact algorithm has not be run to completion. This error is\ncaptured by \u0001 in (6), which re-appears as a penalty term in the lower bound (8). Such a penalty is\ninherent to the presence of approximation in RL, and in fact it is identical to the penalty incurred in\nthe single-policy case (see e.g. Bertsekas and Tsitsiklis\u2019s Proposition 6.1 [5]).\nIn order to contextualize generalized policy improvement (GPI) within the broader scenario of DP,\nsuppose for a moment that \u0001 = 0. In this case Theorem 1 states that \u03c0 will perform no worse than\nall of the policies \u03c01, \u03c02, ..., \u03c0n. This is interesting because in general maxi Q\u03c0i\u2014the function used\nto induce \u03c0\u2014is not the value function of any particular policy. It is not dif\ufb01cult to see that \u03c0 will\nbe strictly better than all previous policies if no single policy dominates all other policies, that is,\nif argmaxi maxa \u02dcQ\u03c0i(s, a) \u2229 argmaxi maxa \u02dcQ\u03c0i(s(cid:48), a) = \u2205 for some s, s(cid:48) \u2208 S. If one policy does\ndominate all others, GPI reduces to the original policy improvement theorem.\nIf we consider the usual DP loop, in which policies of increasing performance are computed in\nsequence, our result is not of much use because the most recent policy will always dominate all others.\nAnother way of putting it is to say that after Theorem 1 is applied once adding the resulting \u03c0 to the\nset {\u03c01, \u03c02, ..., \u03c0n} will reduce the next improvement step to standard policy improvement, and thus\nthe policies \u03c01, \u03c02, ..., \u03c0n can be simply discarded. There are however two situations in which our\nresult may be of interest. One is when we have many policies \u03c0i being evaluated in parallel. In this\ncase GPI provides a principled strategy for combining these policies. The other situation in which\nour result may be useful is when the underlying MDP changes, as we discuss next.\n\n4.2 Generalized policy improvement with successor features\n\ni\n\ni\n\ni\n\nWe start this section by extending our notation slightly to make it easier to refer to the quantities\ninvolved in transfer learning. Let Mi be a task in M\u03c6 de\ufb01ned by wi \u2208 Rd. We will use \u03c0\u2217\ni to refer\nto an optimal policy of MDP Mi and use Q\u03c0\u2217\nto refer to its value function. The value function of \u03c0\u2217\ni\nwhen executed in Mj \u2208 M\u03c6 will be denoted by Q\u03c0\u2217\nj .\nSuppose now that an agent has computed optimal policies for the tasks M1, M2, ..., Mn \u2208 M\u03c6. Sup-\npose further that when presented with a new task Mn+1 the agent computes {Q\u03c0\u2217\nn+1},\nthe evaluation of each \u03c0\u2217\ni under the new reward function induced by wn+1. In this case, applying the\nGPI theorem to the newly-computed set of value functions will give rise to a policy that performs at\nleast as well as a policy based on any subset of these, including the empty set. Thus, this strategy\nsatis\ufb01es our de\ufb01nition of successful transfer.\nThere is a caveat, though. Why would one waste time computing the value functions of \u03c0\u2217\n2, ...,\n\u03c0\u2217\nn, whose performance in Mn+1 may be mediocre, if the same amount of resources can be allocated\nto compute a sequence of n policies with increasing performance? This is where SFs come into play.\nSuppose that we have learned the functions Q\u03c0\u2217\ni using the representation scheme shown in (3). Now, if\nthe reward changes to rn+1(s, a, s(cid:48)) = \u03c6(s, a, s(cid:48))(cid:62)wn+1, as long as we have wn+1 we can compute\nthe new value function of \u03c0\u2217\ni (s, a)(cid:62)wn+1. This reduces the\ncomputation of all Q\u03c0\u2217\nOnce the functions Q\u03c0\u2217\nperformance on Mn+1 is no worse than the performance of \u03c0\u2217\n\nn+1 have been computed, we can apply GPI to derive a policy \u03c0 whose\nn on the same task. A\n\nn+1 to the much simpler supervised problem of approximating wn+1.\n\ni by simply making Q\u03c0\u2217\n\nn+1, Q\u03c0\u2217\n\nn+1, ..., Q\u03c0\u2217\n\n1, \u03c0\u2217\n\n2, ..., \u03c0\u2217\n\nn+1(s, a) = \u03c8\u03c0\u2217\n\ni\n\n1, \u03c0\u2217\n\n1\n\n2\n\nn\n\ni\n\ni\n\ni\n\n5\n\n\f(cid:12)(cid:12)(cid:12)Q\n\n\u03c0\u2217\n\u03c0\u2217\ni (s, a) \u2212 \u02dcQ\ni (s, a)\n\nj\n\nj\n\n1\n\ni\n\n(cid:12)(cid:12)(cid:12) \u2264 \u0001\n\nquestion that arises in this case is whether we can provide stronger guarantees on the performance\nof \u03c0 by exploiting the structure shared by the tasks in M\u03c6. The following theorem answers this\nquestion in the af\ufb01rmative.\n\u03c0\u2217\nTheorem 2. Let Mi \u2208 M\u03c6 and let Q\ni\nMj \u2208 M\u03c6 when executed in Mi. Given approximations { \u02dcQ\u03c0\u2217\n\nbe the action-value function of an optimal policy of\n\n, ..., \u02dcQ\u03c0\u2217\n\ni } such that\n\n, \u02dcQ\u03c0\u2217\n\nn\n\ni\n\n2\n\nj\n\n(9)\n\n\u03c0\u2217\nfor all s \u2208 S, a \u2208 A, and j \u2208 {1, 2, ..., n}, let \u03c0(s) \u2208 argmaxa maxj \u02dcQ\ni (s, a). Finally, let\n\u03c6max = maxs,a ||\u03c6(s, a)||, where || \u00b7 || is the norm induced by the inner product adopted. Then,\n\nj\n\nQ\u03c0\u2217\ni (s, a) \u2212 Q\u03c0\n\ni\n\ni (s, a) \u2264 2\n1 \u2212 \u03b3\n\n(\u03c6max minj||wi \u2212 wj|| + \u0001) .\n\n(10)\n\ni\n\ni (s, a) \u2212 Q\u03c0\n\nNote that we used Mi rather than Mn+1 in the theorem\u2019s statement to remove any suggestion of\norder among the tasks. Theorem 2 is a specialization of Theorem 1 for the case where the set of value\nfunctions used to compute \u03c0 are associated with tasks in the form of (5). As such, it provides stronger\nguarantees: instead of comparing the performance of \u03c0 with that of the previously-computed policies\n\u03c0j, Theorem 2 quanti\ufb01es the loss incurred by following \u03c0 as opposed to one of Mi\u2019s optimal policies.\nAs shown in (10), the loss Q\u03c0\u2217\ni (s, a) is upper-bounded by two terms. The term\n2\u03c6maxminj||wi \u2212 wj||/(1\u2212 \u03b3) is of more interest here because it re\ufb02ects the structure of M\u03c6. This\nterm is a multiple of the distance between wi, the vector describing the task we are currently interested\nin, and the closest wj for which we have computed a policy. This formalizes the intuition that the\nagent should perform well in task wi if it has solved a similar task before. More generally, the term in\nquestion relates the concept of distance in Rd with difference in performance in M\u03c6. Note that this\ncorrespondence depends on the speci\ufb01c set of features \u03c6 used, which raises the interesting question\nof how to de\ufb01ne \u03c6 such that tasks that are close in Rd induce policies that are also similar in some\nsense. Regardless of how exactly \u03c6 is de\ufb01ned, the bound (10) allows for powerful extrapolations.\nFor example, by covering the relevant subspace of Rd with balls of appropriate radii centered at wj\nwe can provide performance guarantees for any task w [14]. This corresponds to building a library of\noptions (or \u201cskills\u201d) that can be used to solve any task in a (possibly in\ufb01nite) set [22]. In Section 5\nwe illustrate this concept with experiments.\nAlthough Theorem 2 is inexorably related to the characterization of M\u03c6 in (5), it does not depend\non the de\ufb01nition of SFs in any way. Here SFs are the mechanism used to ef\ufb01ciently apply the\nprotocol suggested by Theorem 2. When SFs are used the value function approximations are given by\n\u03c0\u2217\ni (s, a) = \u02dc\u03c8\u03c0\u2217\n\u02dcQ\nj are computed and stored when the agent is learning\nthe tasks Mj; when faced with a new task Mi the agent computes an approximation of wi, which is\na supervised learning problem, and then uses the GPI policy \u03c0 de\ufb01ned in Theorem 2 to learn \u02dc\u03c8\u03c0\u2217\ni .\nj or wi is computed exactly: the effect of errors in \u02dc\u03c8\u03c0\u2217\nNote that we do not assume that either \u03c8\u03c0\u2217\nand \u02dcwi are accounted for by the term \u0001 appearing in (9). As shown in (10), if \u0001 is small and the agent\nhas seen enough tasks the performance of \u03c0 on Mi should already be good, which suggests it may\nalso speed up the process of learning \u02dc\u03c8\u03c0\u2217\ni .\nInterestingly, Theorem 2 also provides guidance for some practical algorithmic choices. Since in an\nactual implementation one wants to limit the number of SFs \u02dc\u03c8\u03c0\u2217\nj stored in memory, the corresponding\nvectors \u02dcwj can be used to decide which ones to keep. For example, one can create a new \u02dc\u03c8\u03c0\u2217\ni only\nwhen minj|| \u02dcwi \u2212 \u02dcwj|| is above a given threshold; alternatively, once the maximum number of SFs\nk , where k = argminj|| \u02dcwi \u2212 \u02dcwj|| (here wi is the current task).\nhas been reached, one can replace \u02dc\u03c8\u03c0\u2217\n\nj (s, a)(cid:62) \u02dcwi. The modules \u02dc\u03c8\u03c0\u2217\n\nj\n\nj\n\n5 Experiments\n\nIn this section we present our main experimental results. Additional details, along with further results\nand analysis, can be found in Appendix B of the supplementary material.\nThe \ufb01rst environment we consider involves navigation tasks de\ufb01ned over a two-dimensional continu-\nous space composed of four rooms (Figure 1). The agent starts in one of the rooms and must reach a\n\n6\n\n\f\u03c0i is stored and a new \u02dc\u03c8\n\ngoal region located in the farthest room. The environment has objects that can be picked up by the\nagent by passing over them. Each object belongs to one of three classes determining the associated\nreward. The objective of the agent is to pick up the \u201cgood\u201d objects and navigate to the goal while\navoiding \u201cbad\u201d objects. The rewards associated with object classes change at every 20 000 transitions,\ngiving rise to very different tasks (Figure 1). The goal is to maximize the sum of rewards accumulated\nover a sequence of 250 tasks, with each task\u2019s rewards sampled uniformly from [\u22121, 1]3.\nWe de\ufb01ned a straightforward instantia-\ntion of our approach in which both \u02dcw\n\u03c0 are computed incrementally in\nand \u02dc\u03c8\norder to minimize losses induced by (2)\nand (4). Every time the task changes the\n\u03c0i+1\ncurrent \u02dc\u03c8\nis created. We call this method SFQL\nas a reference to the fact that SFs are\nlearned through an algorithm analogous\nto Q-learning (QL)\u2014which is used as a\nbaseline in our comparisons [27]. As a\nmore challenging reference point we re-\nport results for a transfer method called\nprobabilistic policy reuse [8]. We adopt\na version of the algorithm that builds on\nQL and reuses all policies learned. The resulting method, PRQL, is thus directly comparable to\nSFQL. The details of QL, PRQL, and SFQL, including their pseudo-codes, are given in Appendix B.\nWe compared two versions of SFQL. In the \ufb01rst one, called SFQL-\u03c6, we assume the agent has access\nto features \u03c6 that perfectly predict the rewards, as in (2). The second version of our agent had to\nlearn an approximation \u02dc\u03c6 \u2208 Rh directly from data collected by QL in the \ufb01rst 20 tasks. Note that\nh may not coincide with the true dimension of \u03c6, which in this case is 4; we refer to the different\ninstances of our algorithm as SFQL-h. The process of learning \u02dc\u03c6 followed the multi-task learning\nprotocol proposed by Caruana [6] and Baxter [2], and described in detail in Appendix B.\nThe results of our experiments can be seen in Figure 2. As shown, all versions of SFQL signi\ufb01cantly\noutperform the other two methods, with an improvement on the average return of more than 100%\nwhen compared to PRQL, which itself improves on QL by around 100%. Interestingly, SFQL-h\nseems to achieve good overall performance faster than SFQL-\u03c6, even though the latter uses features\nthat allow for an exact representation of the rewards. One possible explanation is that, unlike their\ncounterparts \u03c6i, the features \u02dc\u03c6i are activated over most of the space S \u00d7 A \u00d7 S, which results in a\ndense pseudo-reward signal that facilitates learning.\nThe second environment we consider is a set of control tasks de\ufb01ned in the MuJoCo physics\nengine [26]. Each task consists in moving a two-joint torque-controlled simulated robotic arm to a\n\nFigure 1: Environment layout and some examples of opti-\nmal trajectories associated with speci\ufb01c tasks. The shapes\nof the objects represent their classes; \u2018S\u2019 is the start state\nand \u2018G\u2019 is the goal.\n\nFigure 2: Average and cumulative return per task in the four-room domain. SFQL-h receives no\nreward during the \ufb01rst 20 tasks while learning \u02dc\u03c6. Error-bands show one standard error over 30 runs.\n\n7\n\nQ-LearningPRQLSFQL-\u0001 / SFQL-4SFQL-8\f(b) Average performance on test tasks.\n\n(a) Performance on training tasks (faded dotted lines in the\nbackground are DQN\u2019s results).\n\n(c) Colored and gray circles depict\ntraining and test targets, respectively.\n\nFigure 3: Normalized return on the reacher domain: \u20181\u2019 corresponds to the average result achieved\nby DQN after learning each task separately and \u20180\u2019 corresponds to the average performance of a\nrandomly-initialized agent (see Appendix B for details). SFDQN\u2019s results were obtained using the\nGPI policies \u03c0i(s) de\ufb01ned in the text. Shading shows one standard error over 30 runs.\n\nspeci\ufb01c target location; thus, we refer to this environment as \u201cthe reacher domain.\u201d We de\ufb01ned 12\ntasks, but only allowed the agents to train in 4 of them (Figure 3c). This means that the agent must be\nable to perform well on tasks that it has never experienced during training.\nIn order to solve this problem, we adopted essentially the same algorithm as above, but we replaced\nQL with Mnih et al.\u2019s DQN\u2014both as a baseline and as the basic engine underlying the SF agent [15].\nThe resulting method, which we call SFDQN, is an illustration of how our method can be naturally\ncombined with complex nonlinear approximators such as neural networks. The features \u03c6i used by\nSFDQN are the negation of the distances to the center of the 12 target regions. As usual in experiments\nof this type, we give the agents a description of the current task: for DQN the target coordinates are\ngiven as inputs, while for SFDQN this is provided as an one-hot vector wt \u2208 R12 [12]. Unlike in the\n\u03c0i through losses\nprevious experiment, in the current setup each transition was used to train all four \u02dc\u03c8\nderived from (4). Here \u03c0i is the GPI policy on the ith task: \u03c0i(s) \u2208 argmaxa maxj\n\u02dc\u03c8j(s, a)(cid:62)wi.\nResults are shown in Figures 3a and 3b. Looking at the training curves, we see that whenever a\ntask is selected for training SFDQN\u2019s return on that task quickly improves and saturates at near-\noptimal performance. The interesting point to be noted is that, when learning a given task, SFDQN\u2019s\nperformance also improves in all other tasks, including the test ones, for which it does not have\nspecialized policies. This illustrates how the combination of SFs and GPI can give rise to \ufb02exible\nagents able to perform well in any task of a set of tasks with shared dynamics\u2014which in turn can be\nseen as both a form of temporal abstraction and a step towards more general hierarchical RL [22, 1].\n\n6 Related work\n\nMehta et al.\u2019s [14] approach for transfer learning is probably the closest work to ours in the literature.\nThere are important differences, though. First, Mehta et al. [14] assume that both \u03c6 and w are always\nobservable quantities provided by the environment. They also focus on average reward RL, in which\nthe quality of a decision policy can be characterized by a single scalar. This reduces the process of\nselecting a policy for a task to one decision made at the outset, which is in clear contrast with GPI.\n\n8\n\nTasks TrainedNormalized ReturnTask 1Task 2Task 3Task 4SFDQNDQNTasks TrainedNormalized Return\f\u03c0i from scratch at each new task.\n\nThe literature on transfer learning has other methods that relate to ours [25, 11]. Among the algorithms\ndesigned for the scenario considered here, two approaches are particularly relevant because they also\nreuse old policies. One is Fern\u00e1ndez et al.\u2019s [8] probabilistic policy reuse, adopted in our experiments\nand described in Appendix B. The other approach, by Bernstein [4], corresponds to using our method\nbut relearning all \u02dc\u03c8\nWhen we look at SFs strictly as a representation scheme, there are clear similarities with Littman\net al.\u2019s [13] predictive state representation (PSR). Unlike SFs, though, PSR tries to summarize the\ndynamics of the entire environment rather than of a single policy \u03c0. A scheme that is perhaps closer\nto SFs is the value function representation sometimes adopted in inverse RL [18].\nSFs are also related to Sutton et al.\u2019s [23] general value functions (GVFs), which extend the notion of\nvalue function to also include \u201cpseudo-rewards.\u201d If we see \u03c6i as a pseudo-reward, \u03c8\u03c0\ni (s, a) becomes\na particular case of GVF. Beyond the technical similarities, the connection between SFs and GVFs\nuncovers some principles underlying both lines of work that, when contrasted, may bene\ufb01t both. On\none hand, Sutton et al.\u2019s [23] and Modayil et al.\u2019s [16] hypothesis that relevant knowledge about the\nworld can be expressed in the form of many predictions naturally translates to SFs: if \u03c6 is expressive\nenough, the agent should be able to represent any relevant reward function. Conversely, SFs not only\nprovide a concrete way of using this knowledge, they also suggest a possible criterion to select the\npseudo-rewards \u03c6i, since ultimately we are only interested in features that help in the approximation\n\u03c6(s, a, s(cid:48))(cid:62) \u02dcw \u2248 r(s, a, s(cid:48)).\nAnother generalization of value functions that is related to SFs is Schaul et al.\u2019s [20] universal value\nfunction approximators (UVFAs). UVFAs extend the notion of value function to also include as an\nargument an abstract representation of a \u201cgoal,\u201d which makes them particularly suitable for transfer.\nj (s, a)(cid:62) \u02dcw used in our framework can be seen as a function of s, a, and \u02dcw\u2014the\nThe function maxj\nlatter a generic way of representing a goal\u2014, and thus in some sense this representation is a UVFA.\nThe connection between SFs and UVFAs raises an interesting point: since under this interpretation \u02dcw\nis simply the description of a task, it can in principle be a direct function of the observations, which\nopens up the possibility of the agent determining \u02dcw even before seeing any rewards.\nAs discussed, our approach is also related to temporal abstraction and hierarchical RL: if we look\nat \u03c8\u03c0 as instances of Sutton et al.\u2019s [22] options, acting greedily with respect to the maximum over\ntheir value functions corresponds in some sense to planning at a higher level of temporal abstraction\n(that is, each \u03c8\u03c0(s, a) is associated with an option that terminates after a single step). This is the\nview adopted by Yao et al. [28], whose universal option model closely resembles our approach in\nsome aspects (the main difference being that they do not do GPI).\nFinally, there have been previous attempts to combine SR and neural networks. Kulkarni et al.\n(s, a), \u02dc\u03c6(s, a, s(cid:48)) and \u02dcw.\n[10] and Zhang et al. [29] propose similar architectures to jointly learn \u02dc\u03c8\nAlthough neither work exploits SFs for GPI, they both discuss other uses of SFs for transfer. In\nprinciple the proposed (or similar) architectures can also be used within our framework.\n\n\u02dc\u03c8\u03c0\u2217\n\n\u03c0\n\n7 Conclusion\n\nThis paper builds on two concepts, both of which are generalizations of previous ideas. The \ufb01rst\none is SFs, a generalization of Dayan\u2019s [7] SR that extends the original de\ufb01nition from discrete to\ncontinuous spaces and also facilitates the use of function approximation. The second concept is GPI,\nformalized in Theorem 1. As the name suggests, this result extends Bellman\u2019s [3] classic policy\nimprovement theorem from a single to multiple policies.\nAlthough SFs and GPI are of interest on their own, in this paper we focus on their combination to\ninduce transfer. The resulting framework is an elegant extension of DP\u2019s basic setting that provides a\nsolid foundation for transfer in RL. As a complement to the proposed transfer approach, we derived\na theoretical result, Theorem 2, that formalizes the intuition that an agent should perform well on\na novel task if it has seen a similar task before. We also illustrated with a comprehensive set of\nexperiments how the combination of SFs and GPI promotes transfer in practice.\nWe believe the proposed ideas lay out a general framework for transfer in RL. By specializing the\nbasic components presented one can build on our results to derive agents able to perform well across\na wide variety of tasks, and thus extend the range of environments that can be successfully tackled.\n\n9\n\n\fAcknowledgments\n\nThe authors would like to thank Joseph Modayil for the invaluable discussions during the development\nof the ideas described in this paper. We also thank Peter Dayan, Matt Botvinick, Marc Bellemare,\nand Guy Lever for the excellent comments, and Dan Horgan and Alexander Pritzel for their help with\nthe experiments. Finally, we thank the anonymous reviewers for their comments and suggestions to\nimprove the paper.\n\nReferences\n[1] Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement\n\nlearning. Discrete Event Dynamic Systems, 13(4):341\u2013379, 2003.\n\n[2] Jonathan Baxter. A model of inductive bias learning. Journal of Arti\ufb01cial Intelligence Research,\n\n12:149\u2013198, 2000.\n\n[3] Richard E. Bellman. Dynamic Programming. Princeton University Press, 1957.\n\n[4] Daniel S. Bernstein. Reusing old policies to accelerate learning on new MDPs. Technical report,\n\nAmherst, MA, USA, 1999.\n\n[5] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c,\n\n1996.\n\n[6] Rich Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n\n[7] Peter Dayan. Improving generalization for temporal difference learning: The successor repre-\n\nsentation. Neural Computation, 5(4):613\u2013624, 1993.\n\n[8] Fernando Fern\u00e1ndez, Javier Garc\u00eda, and Manuela Veloso. Probabilistic policy reuse for inter-task\n\ntransfer learning. Robotics and Autonomous Systems, 58(7):866\u2013871, 2010.\n\n[9] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning:\n\nData Mining, Inference, and Prediction. Springer, 2002.\n\n[10] Tejas D. Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel J Gershman. Deep successor\n\nreinforcement learning. arXiv preprint arXiv:1606.02396, 2016.\n\n[11] Alessandro Lazaric. Transfer in Reinforcement Learning: A Framework and a Survey. Rein-\n\nforcement Learning: State-of-the-Art, pages 143\u2013173, 2012.\n\n[12] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval\nTassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.\narXiv preprint arXiv:1509.02971, 2015.\n\n[13] Michael L. Littman, Richard S. Sutton, and Satinder Singh. Predictive representations of state.\n\nIn Advances in Neural Information Processing Systems (NIPS), pages 1555\u20131561, 2001.\n\n[14] Neville Mehta, Sriraam Natarajan, Prasad Tadepalli, and Alan Fern. Transfer in variable-reward\n\nhierarchical reinforcement learning. Machine Learning, 73(3), 2008.\n\n[15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.\nBellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe-\ntersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan\nWierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529\u2013533, 2015.\n\n[16] Joseph Modayil, Adam White, and Richard S. Sutton. Multi-timescale nexting in a reinforcement\n\nlearning robot. Adaptive Behavior, 22(2):146\u2013160, 2014.\n\n[17] Sriraam Natarajan and Prasad Tadepalli. Dynamic preferences in multi-criteria reinforcement\nlearning. In Proceedings of the International Conference on Machine Learning (ICML), pages\n601\u2013608, 2005.\n\n10\n\n\f[18] Andrew Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In Proceedings\n\nof the International Conference on Machine Learning (ICML), pages 663\u2013670, 2000.\n\n[19] Martin L. Puterman. Markov Decision Processes\u2014Discrete Stochastic Dynamic Programming.\n\nJohn Wiley & Sons, Inc., 1994.\n\n[20] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal Value Function\nApproximators. In International Conference on Machine Learning (ICML), pages 1312\u20131320,\n2015.\n\n[21] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press,\n\n1998.\n\n[22] Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: a\nframework for temporal abstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112:\n181\u2013211, 1999.\n\n[23] Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam\nWhite, and Doina Precup. Horde: A scalable real-time architecture for learning knowledge from\nunsupervised sensorimotor interaction. In International Conference on Autonomous Agents and\nMultiagent Systems, pages 761\u2013768, 2011.\n\n[24] Csaba Szepesv\u00e1ri. Algorithms for Reinforcement Learning. Synthesis Lectures on Arti\ufb01cial\n\nIntelligence and Machine Learning. Morgan & Claypool Publishers, 2010.\n\n[25] Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A\n\nsurvey. Journal of Machine Learning Research, 10(1):1633\u20131685, 2009.\n\n[26] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based\ncontrol. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages\n5026\u20135033, 2012.\n\n[27] Christopher Watkins and Peter Dayan. Q-learning. Machine Learning, 8:279\u2013292, 1992.\n\n[28] Hengshuai Yao, Csaba Szepesv\u00e1ri, Richard S Sutton, Joseph Modayil, and Shalabh Bhatnagar.\nUniversal option models. In Advances in Neural Information Processing Systems (NIPS), pages\n990\u2013998. 2014.\n\n[29] Jingwei Zhang, Jost Tobias Springenberg, Joschka Boedecker, and Wolfram Burgard. Deep\nreinforcement learning with successor features for navigation across similar environments.\nCoRR, abs/1612.05533, 2016.\n\n11\n\n\f", "award": [], "sourceid": 2151, "authors": [{"given_name": "Andre", "family_name": "Barreto", "institution": "DeepMind"}, {"given_name": "Will", "family_name": "Dabney", "institution": "DeepMind"}, {"given_name": "Remi", "family_name": "Munos", "institution": "DeepMind"}, {"given_name": "Jonathan", "family_name": "Hunt", "institution": "DeepMind"}, {"given_name": "Tom", "family_name": "Schaul", "institution": "DeepMind"}, {"given_name": "Hado", "family_name": "van Hasselt", "institution": "DeepMind"}, {"given_name": "David", "family_name": "Silver", "institution": "DeepMind"}]}