{"title": "Reward Design via Online Gradient Ascent", "book": "Advances in Neural Information Processing Systems", "page_first": 2190, "page_last": 2198, "abstract": "Recent work has demonstrated that when artificial agents are limited in their ability to achieve their goals, the agent designer can benefit by making the agent's goals different from the designer's. This gives rise to the optimization problem of designing the artificial agent's goals---in the RL framework, designing the agent's reward function. Existing attempts at solving this optimal reward problem do not leverage experience gained online during the agent's lifetime nor do they take advantage of knowledge about the agent's structure. In this work, we develop a gradient ascent approach with formal convergence guarantees for approximately solving the optimal reward problem online during an agent's lifetime. We show that our method generalizes a standard policy gradient approach, and we demonstrate its ability to improve reward functions in agents with various forms of limitations.", "full_text": "Reward Design via Online Gradient Ascent\n\nJonathan Sorg\n\nComputer Science and Eng.\n\nUniversity of Michigan\njdsorg@umich.edu\n\nSatinder Singh\n\nComputer Science and Eng.\n\nUniversity of Michigan\nbaveja@umich.edu\n\nRichard L. Lewis\n\nDepartment of Psychology\n\nUniversity of Michigan\nrickl@umich.edu\n\nAbstract\n\nRecent work has demonstrated that when arti\ufb01cial agents are limited in their\nability to achieve their goals, the agent designer can bene\ufb01t by making the agent\u2019s\ngoals different from the designer\u2019s. This gives rise to the optimization problem of\ndesigning the arti\ufb01cial agent\u2019s goals\u2014in the RL framework, designing the agent\u2019s\nreward function. Existing attempts at solving this optimal reward problem do not\nleverage experience gained online during the agent\u2019s lifetime nor do they take\nadvantage of knowledge about the agent\u2019s structure. In this work, we develop a\ngradient ascent approach with formal convergence guarantees for approximately\nsolving the optimal reward problem online during an agent\u2019s lifetime. We show that\nour method generalizes a standard policy gradient approach, and we demonstrate\nits ability to improve reward functions in agents with various forms of limitations.\n\n1 The Optimal Reward Problem\n\nIn this work, we consider the scenario of an agent designer building an autonomous agent. The\ndesigner has his or her own goals which must be translated into goals for the autonomous agent.\nWe represent goals using the Reinforcement Learning (RL) formalism of the reward function. This\nleads to the optimal reward problem of designing the agent\u2019s reward function so as to maximize the\nobjective reward received by the agent designer.\nTypically, the designer assigns his or her own reward to the agent. However, there is ample work\nwhich demonstrates the bene\ufb01t of assigning reward which does not match the designer\u2019s. For example,\nwork on reward shaping [11] has shown how to modify rewards to accelerate learning without altering\nthe optimal policy, and PAC-MDP methods [5, 20] including approximate Bayesian methods [7, 19]\nadd bonuses to the objective reward to achieve optimism under uncertainty. These approaches\nexplicitly or implicitly assume that the asymptotic behavior of the agent should be the same as that\nwhich would occur using the objective reward function. These methods do not explicitly consider the\noptimal reward problem; however, they do show improved performance through reward modi\ufb01cation.\nIn our recent work that does explicitly consider the optimal reward problem [18], we analyzed an\nexplicit hypothesis about the bene\ufb01t of reward design\u2014that it helps mitigate the performance loss\ncaused by computational constraints (bounds) on agent architectures. We considered various types of\nagent limitations\u2014limits on planning depth, failure to account for partial observability, and other\nerroneous modeling assumptions\u2014and demonstrated the bene\ufb01ts of good reward functions in each\ncase empirically. Crucially, in bounded agents, the optimal reward function often leads to behavior\nthat is different from the asymptotic behavior achieved with the objective reward function.\nIn this work, we develop an algorithm, Policy Gradient for Reward Design (PGRD), for improving\nreward functions for a family of bounded agents that behave according to repeated local (from\nthe current state) model-based planning. We show that this algorithm is capable of improving\nthe reward functions in agents with computational limitations necessitating small bounds on the\ndepth of planning, and also from the use of an inaccurate model (which may be inaccurate due\nto computationally-motivated approximations). PGRD has few parameters, improves the reward\n\n1\n\n\fobtained over an in\ufb01nite horizon, i.e., limN\u2192\u221e E(cid:104) 1\n\nfunction online during an agent\u2019s lifetime, takes advantage of knowledge about the agent\u2019s structure\n(through the gradient computation), and is linear in the number of reward function parameters.\nNotation. Formally, we consider discrete-time partially-observable environments with a \ufb01nite number\nof hidden states s \u2208 S, actions a \u2208 A, and observations o \u2208 O; these \ufb01nite set assumptions are useful\nfor our theorems, but our algorithm can handle in\ufb01nite sets in practice. Its dynamics are governed\nby a state-transition function P (s(cid:48)|s, a) that de\ufb01nes a distribution over next-states s(cid:48) conditioned\non current state s and action a, and an observation function \u2126(o|s) that de\ufb01nes a distribution over\nobservations o conditioned on current state s.\nThe agent designer\u2019s goals are speci\ufb01ed via the objective reward function RO. At each time step, the\ndesigner receives reward RO(st) \u2208 [0, 1] based on the current state st of the environment, where\nthe subscript denotes time. The designer\u2019s objective return is the expected mean objective reward\n. In the standard view of RL,\nthe agent uses the same reward function as the designer to align the interests of the agent and the\ndesigner. Here we allow for a separate agent reward function R(\u00b7 ). An agent\u2019s reward function can in\ngeneral be de\ufb01ned in terms of the history of actions and observations, but is often more pragmatically\nde\ufb01ned in terms of some abstraction of history. We de\ufb01ne the agent\u2019s reward function precisely in\nSection 2.\nOptimal Reward Problem. An RL agent attempts to act so as to maximize its own cumulative\nreward, or return. Crucially, as a result, the sequence of environment-states {st}\u221e\nt=0 is affected by\nthe choice of reward function; therefore, the agent designer\u2019s return is affected as well. The optimal\nreward problem arises from the fact that while the objective reward function is \ufb01xed as part of the\nproblem description, the reward function is a choice to be made by the designer. We capture this\nchoice abstractly by letting the reward be parameterized by some vector of parameters \u03b8 chosen from\nspace of parameters \u0398. Each \u03b8 \u2208 \u0398 speci\ufb01es a reward function R(\u00b7 ; \u03b8) which in turn produces a\ndistribution over environment state sequences via whatever RL method the agent uses. The expected\n. The\n\nreturn obtained by the designer for choice \u03b8 is U(\u03b8) = limN\u2192\u221e E(cid:104) 1\n\n(cid:80)N\n\n(cid:12)(cid:12)(cid:12)R(\u00b7; \u03b8)\n(cid:105)\n\nt=0 RO(st)\n\nN\n\nt=0 RO(st)\n\nN\n\noptimal reward parameters are given by the solution to the optimal reward problem [16, 17, 18]:\n\n(cid:105)\n\n\u03b8\u2217 = arg max\n\u03b8\u2208\u0398\n\nU(\u03b8) = arg max\n\u03b8\u2208\u0398\n\nlim\nN\u2192\u221e\n\nE\n\n1\nN\n\nRO(st)\n\n.\n\n(1)\n\n(cid:80)N\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)R(\u00b7; \u03b8)\n(cid:35)\n\n(cid:34)\n\nN(cid:88)\n\nt=0\n\nOur previous research on solving the optimal reward problem has focused primarily on the properties\nof the optimal reward function and its correspondence to the agent architecture and the environ-\nment [16, 17, 18]. This work has used inef\ufb01cient exhaustive search methods for \ufb01nding good\napproximations to \u03b8\u2217 (though there is recent work on using genetic algorithms to do this [6, 9, 12]).\nOur primary contribution in this paper is a new convergent online stochastic gradient method for\n\ufb01nding approximately optimal reward functions. To our knowledge, this is the \ufb01rst algorithm that\nimproves reward functions in an online setting\u2014during a single agent\u2019s lifetime.\nIn Section 2, we present the PGRD algorithm, prove its convergence, and relate it to OLPOMDP [2],\na policy gradient algorithm. In Section 3, we present experiments demonstrating PGRD\u2019s ability to\napproximately solve the optimal reward problem online.\n\n2 PGRD: Policy Gradient for Reward Design\n\nPGRD builds on the following insight: the agent\u2019s planning algorithm procedurally converts the\nreward function into behavior; thus, the reward function can be viewed as a speci\ufb01c parameterization\nof the agent\u2019s policy. Using this insight, PGRD updates the reward parameters by estimating the\ngradient of the objective return with respect to the reward parameters, \u2207\u03b8U(\u03b8), from experience, using\nstandard policy gradient techniques. In fact, we show that PGRD can be viewed as an (independently\ninteresting) generalization of the policy gradient method OLPOMDP [2]. Speci\ufb01cally, we show that\nOLPOMDP is special case of PGRD when the planning depth d is zero. In this section, we \ufb01rst\npresent the family of local planning agents for which PGRD improves the reward function. Next, we\ndevelop PGRD and prove its convergence. Finally, we show that PGRD generalizes OLPOMDP and\ndiscuss how adding planning to OLPOMDP affects the space of policies available to the optimization\nmethod.\n\n2\n\n\fInput: T , \u03b80, {\u03b1t}\u221e\no0, i0 = initializeStart();\nfor t = 0, 1, 2, 3, . . . do\n\nt=0, \u03b2, \u03b3\n\n\u2200aQt(a; \u03b8t) = plan(it, ot, T, R(it,\u00b7,\u00b7; \u03b8t), d,\u03b3);\nat \u223c \u00b5(a|it; Qt);\nrt+1, ot+1 = takeAction(at);\nzt+1 = \u03b2zt +\n\u03b8t+1 = \u03b8t + \u03b1t(rt+1zt+1 \u2212 \u03bb\u03b8t) ;\nit+1 = updateInternalState(it, at, ot+1);\n\n\u2207\u03b8t \u00b5(at|it;Qt)\n\u00b5(at|it;Qt)\n\n;\n\n1\n2\n3\n4\n5\n\n6\n7\n8\n9\n\nend\n\nFigure 1: PGRD (Policy Gradient for Reward Design) Algorithm\n\nR(it, o, a; \u03b8t) + \u03b3(cid:80)\n\nA Family of Limited Agents with Internal State. Given a Markov model T de\ufb01ned over the\nobservation space O and action space A, denote T (o(cid:48)|o, a) the probability of next observation o(cid:48)\ngiven that the agent takes action a after observing o. Our agents use the model T to plan. We do not\nassume that the model T is an accurate model of the environment. The use of an incorrect model is\none type of agent limitation we examine in our experiments. In general, agents can use non-Markov\nmodels de\ufb01ned in terms of the history of observations and actions; we leave this for future work.\nThe agent maintains an internal state feature vector it that is updated at each time step using it+1 =\nupdateInternalState(it, at, ot+1). The internal state allows the agent to use reward functions\nthat depend on the agent\u2019s history. We consider rewards of the form R(it, o, a; \u03b8t) = \u03b8T\nt \u03c6(it, o, a),\nwhere \u03b8t is the reward parameter vector at time t, and \u03c6(it, o, a) is a vector of features based on\ninternal state it, planning state o, and action a. Note that if \u03c6 is a vector of binary indicator features,\nthis representation allows for arbitrary reward functions and thus the representation is completely\ngeneral. Many existing methods use reward functions that depend on history. Reward functions based\non empirical counts of observations, as in PAC-MDP approaches [5, 20], provide some examples;\nsee [14, 15, 13] for others. We present a concrete example in our empirical section.\nAt each time step t, the agent\u2019s planning algorithm, plan, performs depth-d planning using the model\nT and reward function R(it, o, a; \u03b8t) with current internal state it and reward parameters \u03b8t. Speci\ufb01-\ncally, the agent computes a d-step Q-value function Qd(it, ot, a; \u03b8t)\u2200a \u2208 A, where Qd(it, o, a; \u03b8t) =\no(cid:48)\u2208O T (o(cid:48)|o, a) maxb\u2208A Qd\u22121(it, o(cid:48), b; \u03b8t) and Q0(it, o, a; \u03b8t) = R(it, o, a; \u03b8t).\nWe emphasize that the internal state it and reward parameters \u03b8t are held invariant while planning.\nNote that the d-step Q-values are only computed for the current observation ot, in effect by building\na depth-d tree rooted at ot. In the d = 0 special case, the planning procedure completely ignores the\nmodel T and returns Q0(it, ot, a; \u03b8t) = R(it, ot, a; \u03b8t). Regardless of the value of d, we treat the\nend result of planning as providing a scoring function Qt(a; \u03b8t) where the dependence on d, it and\not is dropped from the notation. To allow for gradient calculations, our agents act according to the\nBoltzmann (soft-max) stochastic policy parameterized by Q: \u00b5(a|it; Qt) def= e\u03c4 Qt(a;\u03b8t)\nb e\u03c4 Qt(b;\u03b8t) , where \u03c4\nis a temperature parameter that determines how stochastically the agent selects the action with the\nhighest score. When the planning depth d is small due to computational limitations, the agent cannot\naccount for events beyond the planning depth. We examine this limitation in our experiments.\nGradient Ascent. To develop a gradient algorithm for improving the reward function, we need\nto compute the gradient of the objective return with respect to \u03b8: \u2207\u03b8U(\u03b8). The main insight is to\nbreak the gradient calculation into the calculation of two gradients. The \ufb01rst is the gradient of the\nobjective return with respect to the policy \u00b5, and the second is the gradient of the policy with respect\nto the reward function parameters \u03b8. The \ufb01rst gradient is exactly what is computed in standard\npolicy gradient approaches [2]. The second gradient is challenging because the transformation from\nreward parameters to policy involves a model-based planning procedure. We draw from the work of\nNeu and Szepesv\u00b4ari [10] which shows that this gradient computation resembles planning itself. We\ndevelop PGRD, presented in Figure 1, explicitly as a generalization of OLPOMDP, a policy gradient\nalgorithm developed by Bartlett and Baxter [2], because of its foundational simplicity relative to\nother policy-gradient algorithms such as those based on actor-critic methods (e.g., [4]). Notably, the\nreward parameters are the only parameters being learned in PGRD.\n\n(cid:80)\n\n3\n\n\f(cid:88)\n\no(cid:48)\u2208O\n\n(cid:88)\n\nb\u2208A\n\nT (o(cid:48)|o, a)\n\n\u03c0d\u22121(b|o(cid:48))\u2207\u03b8Qd\u22121(o(cid:48), b; \u03b8),\n\n\u2207\u03b8t\u00b5(a|it; Qt) = \u03c4\u00b7 \u00b5(a|Qt)[\u2207\u03b8tQt(a|it; \u03b8t) \u2212(cid:80)\n\nPGRD follows the form of OLPOMDP (Algorithm 1 in Bartlett and Baxter [2]) but generalizes it\nin three places. In Figure 1 line 3, the agent plans to compute the policy, rather than storing the\npolicy directly. In line 6, the gradient of the policy with respect to the parameters accounts for the\nplanning procedure. In line 8, the agent maintains a general notion of internal state that allows for\nricher parameterization of policies than typically considered (similar to Aberdeen and Baxter [1]).\nThe algorithm takes as parameters a sequence of learning rates {\u03b1k}, a decaying-average parameter\n\u03b2, and regularization parameter \u03bb > 0 which keeps the the reward parameters \u03b8 bounded throughout\nlearning. Given a sequence of calculations of the gradient of the policy with respect to the parameters,\n\u2207\u03b8t\u00b5(at|it; Qt), the remainder of the algorithm climbs the gradient of objective return \u2207\u03b8U(\u03b8) using\nOLPOMDP machinery. In the next subsection, we discuss how to compute \u2207\u03b8t\u00b5(at|it; Qt).\nComputing the Gradient of the Policy with respect to Reward. For the Boltzmann distribu-\ntion, the gradient of the policy with respect to the reward parameters is given by the equation\nb\u2208A \u2207\u03b8tQt(b; \u03b8t)], where \u03c4 is the Boltzmann\ntemperature (see [10]). Thus, computing \u2207\u03b8t\u00b5(a|it; Qt) reduces to computing \u2207\u03b8tQt(a; \u03b8t).\nThe value of Qt depends on the reward parameters \u03b8t, the model, and the planning depth. However,\nas we present below, the process of computing the gradient closely resembles the process of planning\nitself, and the two computations can be interleaved. Theorem 1 presented below is an adaptation\nof Proposition 4 from Neu and Szepesv\u00b4ari [10]. It presents the gradient computation for depth-d\nplanning as well as for in\ufb01nite-depth discounted planning. We assume that the gradient of the reward\nfunction with respect to the parameters is bounded: sup\u03b8,o,i,a (cid:107)\u2207\u03b8R(i, o, a, \u03b8)(cid:107) < \u221e. The proof of\nthe theorem follows directly from Proposition 4 of Neu and Szepesv\u00b4ari [10].\nTheorem 1. Except on a set of measure zero, for any depth d, the gradient \u2207\u03b8Qd(o, a; \u03b8) exists and\nis given by the recursion (where we have dropped the dependence on i for simplicity)\n\n\u2207\u03b8Qd(o, a; \u03b8) = \u2207\u03b8R(o, a; \u03b8) + \u03b3\n\n(2)\nwhere \u2207\u03b8Q0(o, a; \u03b8) = \u2207\u03b8R(o, a; \u03b8) and \u03c0d(a|o) \u2208 arg maxa Qd(o, a; \u03b8) is any policy that is\ngreedy with respect to Qd. The result also holds for \u2207\u03b8Q\u2217(o, a; \u03b8) = \u2207\u03b8 limd\u2192\u221e Qd(o, a; \u03b8).\nThe Q-function will not be differentiable when there are multiple optimal policies. This is re-\n\ufb02ected in the arbitrary choice of \u03c0 in the gradient calculation. However, it was shown by Neu and\nSzepesv\u00b4ari [10] that even for values of \u03b8 which are not differentiable, the above computation produces\na valid calculation of a subgradient; we discuss this below in our proof of convergence of PGRD.\nConvergence of PGRD (Figure 1). Given a particular \ufb01xed reward function R(\u00b7; \u03b8), transition\nmodel T , and planning depth, there is a corresponding \ufb01xed randomized policy \u00b5(a|i; \u03b8)\u2014where\nwe have explicitly represented the reward\u2019s dependence on the internal state vector i in the policy\nparameterization and dropped Q from the notation as it is redundant given that everything else is\n\ufb01xed. Denote the agent\u2019s internal-state update as a (usually deterministic) distribution \u03c8(i(cid:48)|i, a, o).\nGiven a \ufb01xed reward parameter vector \u03b8, the joint environment-state\u2013internal-state transitions can be\nmodeled as a Markov chain with a |S||I| \u00d7 |S||I| transition matrix M (\u03b8) whose entries are given by\no,a \u03c8(i(cid:48)|i, a, o)\u2126(o|s(cid:48))P (s(cid:48)|s, a)\u00b5(a|i; \u03b8). We make the\n\nM(cid:104)s,i(cid:105),(cid:104)s(cid:48),i(cid:48)(cid:105)(\u03b8) = p((cid:104)s(cid:48), i(cid:48)(cid:105)|(cid:104)s, i(cid:105); \u03b8) =(cid:80)\n\nfollowing assumptions about the agent and the environment:\nAssumption 1. The transition matrix M (\u03b8) of the joint environment-state\u2013internal-state Markov\nchain has a unique stationary distribution \u03c0(\u03b8) = [\u03c0s1,i1(\u03b8), \u03c0s2,i2(\u03b8), . . . , \u03c0s|S|,i|I| (\u03b8)] satisfying\nthe balance equations \u03c0(\u03b8)M (\u03b8) = \u03c0(\u03b8), for all \u03b8 \u2208 \u0398.\nAssumption 2. During its execution, PGRD (Figure 1) does not reach a value of it, and \u03b8t at which\n\u00b5(at|it, Qt) is not differentiable with respect to \u03b8t.\nIt follows from Assumption 1 that the objective return, U(\u03b8), is independent of the start state. The\noriginal OLPOMDP convergence proof [2] has a similar condition that only considers environment\nstates. Intuitively, this condition allows PGRD to handle history-dependence of a reward function in\nthe same manner that it handles partial observability in an environment. Assumption 2 accounts for\nthe fact that a planning algorithm may not be fully differentiable everywhere. However, Theorem 1\nshowed that in\ufb01nite and bounded-depth planning is differentiable almost everywhere (in a measure\ntheoretic sense). Furthermore, this assumption is perhaps stronger than necessary, as stochastic\napproximation algorithms, which provide the theory upon which OLPOMDP is based, have been\nshown to converge using subgradients [8].\n\n4\n\n\f\u03b8U(\u03b8) def= limT\u2192\u221e(cid:80)T\n\na.s., where L is the set of stable equilibrium points of the differential equation \u2202\u03b8\n\ncalculates. Let the approximate gradient estimate be(cid:101)\u2207\u03b2\nIn order to state the convergence theorem, we must de\ufb01ne the approximate gradient which OLPOMDP\nt=1 rtzt for a \ufb01xed \u03b8 and\ncalculations. It was shown by Bartlett and Baxter [2] that (cid:101)\u2207\u03b2\nPGRD parameter \u03b2, where zt (in Figure 1) represents a time-decaying average of the \u2207\u03b8t\u00b5(at|it, Qt)\n\u03b8U(\u03b8) is close to the true value \u2207\u03b8U(\u03b8)\nfor large values of \u03b2. Theorem 2 proves that PGRD converges to a stable equilibrium point based on\nthis approximate gradient measure. This equilibrium point will typically correspond to some local\noptimum in the return function U(\u03b8). Given our development and assumptions, the theorem is a\nTheorem 2. Given \u03b2 \u2208 [0, 1), \u03bb > 0, and a sequence of step sizes \u03b1t satisfying(cid:80)\u221e\n(cid:80)\u221e\nstraightforward extension of Theorem 6 from Bartlett and Baxter [2] (proof omitted).\nt=0 \u03b1t = \u221e and\n\u2202t = (cid:101)\u2207\u03b2\nt=0(\u03b1t)2 < \u221e, PGRD produces a sequence of reward parameters \u03b8t such that \u03b8t \u2192 L as t \u2192 \u221e\n\u03b8U(\u03b8) \u2212 \u03bb\u03b8.\nPGRD generalizes OLPOMDP. As stated above, OLPOMDP, when it uses a Boltzmann distribution\nin its policy representation (a common case), is a special case of PGRD when the planning depth\nis zero. First, notice that in the case of depth-0 planning, Q0(i, o, a; \u03b8) = R(i, o, a, \u03b8), regard-\nless of the transition model and reward parameterization. We can also see from Theorem 1 that\n\u2207\u03b8Q0(i, o, a; \u03b8) = \u2207\u03b8R(i, o, a; \u03b8). Because R(i, o, a; \u03b8) can be parameterized arbitrarily, PGRD\ncan be con\ufb01gured to match standard OLPOMDP with any policy parameterization that also computes\na score function for the Boltzmann distribution.\nIn our experiments, we demonstrate that choosing a planning depth d > 0 can be bene\ufb01cial over using\nOLPOMDP (d = 0). In the remainder of this section, we show theoretically that choosing d > 0\ndoes not hurt in the sense that it does not reduce the space of policies available to the policy gradient\nmethod. Speci\ufb01cally, we show that when using an expressive enough reward parameterization,\nPGRD\u2019s space of policies is not restricted relative to OLPOMDP\u2019s space of policies. We prove the\nresult for in\ufb01nite planning, but the extension to depth-limited planning is straightforward.\nTheorem 3. There exists a reward parameterization such that, for an arbitrary transition model\nT , the space of policies representable by PGRD with in\ufb01nite planning is identical to the space of\npolicies representable by PGRD with depth 0 planning.\n\nProof. Ignoring internal state for now (holding it constant), let C(o, a) be an arbitrary reward\nfunction used by PGRD with depth 0 planning. Let R(o, a; \u03b8) be a reward function for PGRD with\nin\ufb01nite (d = \u221e) planning. The depth-\u221e agent uses the planning result Q\u2217(o, a; \u03b8) to act, while the\ndepth-0 agent uses the function C(o, a) to act. Therefore, it suf\ufb01ces to show that one can always\nchoose \u03b8 such that the planning solution Q\u2217(o, a; \u03b8) equals C(o, a). For all o \u2208 O, a \u2208 A, set\no(cid:48) T (o(cid:48)|o, a) maxa(cid:48) C(o(cid:48), a(cid:48)). Substituting Q\u2217 for C, this is the Bellman\noptimality equation [22] for in\ufb01nite-horizon planning. Setting R(o, a; \u03b8) as above is possible if it is\nparameterized by a table with an entry for each observation\u2013action pair.\n\nR(o, a; \u03b8) = C(o, a) \u2212 \u03b3(cid:80)\n\nTheorem 3 also shows that the effect of an arbitrarily poor model can be overcome with a good choice\nof reward function. This is because a Boltzmann distribution can, allowing for an arbitrary scoring\nfunction C, represent any policy. We demonstrate this ability of PGRD in our experiments.\n\n3 Experiments\n\nThe primary objective of our experiments is to demonstrate that PGRD is able to use experience\nonline to improve the reward function parameters, thereby improving the agent\u2019s obtained objective\nreturn. Speci\ufb01cally, we compare the objective return achieved by PGRD to the objective return\nachieved by PGRD with the reward adaptation turned off. In both cases, the reward function is\ninitialized to the objective reward function. A secondary objective is to demonstrate that when a good\nmodel is available, adding the ability to plan\u2014even for small depths\u2014improves performance relative\nto the baseline algorithm of OLPOMDP (or equivalently PGRD with depth d = 0).\nForaging Domain for Experiments 1 to 3: The foraging environment illustrated in Figure 2(a) is a\n3\u00d7 3 grid world with 3 dead-end corridors (rows) separated by impassable walls. The agent (bird) has\nfour available actions corresponding to each cardinal direction. Movement in the intended direction\nfails with probability 0.1, resulting in movement in a random direction. If the resulting direction is\n\n5\n\n\fFigure 2: A) Foraging Domain, B) Performance of PGRD with observation-action reward features,\nC) Performance of PGRD with recency reward features\n\nblocked by a wall or the boundary, the action results in no movement. There is a food source (worm)\nlocated in one of the three right-most locations at the end of each corridor. The agent has an eat\naction, which consumes the worm when the agent is at the worm\u2019s location. After the agent consumes\nthe worm, a new worm appears randomly in one of the other two potential worm locations.\nObjective Reward for the Foraging Domain: The designer\u2019s goal is to maximize the average number\nof worms eaten per time step. Thus, the objective reward function RO provides a reward of 1.0 when\nthe agent eats a worm, and a reward of 0 otherwise. The objective return is de\ufb01ned as in Equation (1).\nExperimental Methodology: We tested PGRD for depth-limited planning agents of depths 0\u20136. Recall\nthat PGRD for the agent with planning depth 0 is the OLPOMDP algorithm. For each depth, we\njointly optimized over the PGRD algorithm parameters, \u03b1 and \u03b2 (we use a \ufb01xed \u03b1 throughout\nlearning). We tested values for \u03b1 on an approximate logarithmic scale in the range (10\u22126, 10\u22122) as\nwell as the special value of \u03b1 = 0, which corresponds to an agent that does not adapt its reward\nfunction. We tested \u03b2 values in the set 0, 0.4, 0.7, 0.9, 0.95, 0.99. Following common practice [3],\nwe set the \u03bb parameter to 0. We explicitly bound the reward parameters and capped the reward\nfunction output both to the range [\u22121, 1]. We used a Boltzmann temperature parameter of \u03c4 = 100\nand planning discount factor \u03b3 = 0.95. Because we initialized \u03b8 so that the initial reward function\nwas the objective reward function, PGRD with \u03b1 = 0 was equivalent to a standard depth-limited\nplanning agent.\nExperiment 1: A fully observable environment with a correct model learned online. In this\nexperiment, we improve the reward function in an agent whose only limitation is planning depth,\nusing (1) a general reward parameterization based on the current observation and (2) a more compact\nreward parameterization which also depends on the history of observations.\nObservation: The agent observes the full state, which is given by the pair o = (l, w), where l is the\nagent\u2019s location and w is the worm\u2019s location.\nLearning a Correct Model: Although the theorem of convergence of PGRD relies on the agent having\na \ufb01xed model, the algorithm itself is readily applied to the case of learning a model online. In this\nexperiment, the agent\u2019s model T is learned online based on empirical transition probabilities between\nobservations (recall this is a fully observable environment). Let no,a,o(cid:48) be the number of times that o(cid:48)\nwas reached after taking action a after observing o. The agent models the probability of seeing o(cid:48) as\nT (o(cid:48)|o, a) = no,a,o(cid:48)\nReward Parameterizations: Recall that R(i, o, a; \u03b8) = \u03b8T\u03c6(i, o, a), for some \u03c6(i, o, a). (1) In the\nobservation-action parameterization, \u03c6(i, o, a) is a binary feature vector with one binary feature\nfor each observation-action pair\u2014internal state is ignored. This is effectively a table representation\nover all reward functions indexed by (o, a). As shown in Theorem 3, the observation-action feature\nrepresentation is capable of producing arbitrary policies over the observations. In large problems,\nsuch a parameterization would not be feasible. (2) The recency parameterization is a more compact\nrepresentation which uses features that rely on the history of observations. The feature vector is\n\u03c6(i, o, a) = [RO(o, a), 1, \u03c6cl (l, i), \u03c6cl,a (l, a, i)], where RO(o, a) is the objective reward function\nde\ufb01ned as above. The feature \u03c6cl (l) = 1 \u2212 1/c(l, i), where c(l, i) is the number of time steps since\nthe agent has visited location l, as represented in the agent\u2019s internal state i. Its value is normalized\nto the range [0, 1) and is high when the agent has not been to location l recently. The feature\n\u03c6cl,a (l, a, i) = 1 \u2212 1/c(l, a, i) is similarly de\ufb01ned with respect to the time since the agent has taken\naction a in location l. Features based on recency counts encourage persistent exploration [21, 18].\n\n(cid:80)\no(cid:48) no,a,o(cid:48) .\n\n6\n\nA)1000200030004000500000.050.10.15Time StepsObjective Return D=4, \u03b1=2\u00d710\u22124D=0, \u03b1=5\u00d710\u22124B)D=4, \u03b1=0D=0, \u03b1=0D=6, \u03b1=0 & 5\u00d710\u22125D=6, \u03b1=1000200030004000500000.050.10.15Time StepsObjective Return D=1, \u03b1=3\u00d710\u22124D=0, \u03b1=0D=0, \u03b1=0.01 & D=1, \u03b1=0D=3, \u03b1=0D=3, \u03b1=3\u00d710\u22123C)D=6, \u03b1=0 & 5\u00d710\u22125D=6, \u03b1=\fResults & Discussion: Figure 2(b) and Figure 2(c) present results for agents that use the observation-\naction parameterization and the recency parameterization of the reward function respectively. The\nhorizontal axis is the number of time steps of experience. The vertical axis is the objective return, i.e.,\nthe average objective reward per time step. Each curve is an average over 130 trials. The values of d\nand the associated optimal algorithm parameters for each curve are noted in the \ufb01gures. First, note\nthat with d = 6, the agent is unbounded, because food is never more than 6 steps away. Therefore,\nthe agent does not bene\ufb01t from adapting the reward function parameters (given that we initialize\nto the objective reward function). Indeed, the d = 6, \u03b1 = 0 agent performs as well as the best\nreward-optimizing agent. The performance for d = 6 improves with experience because the model\nimproves with experience (and thus from the curves it is seen that the model gets quite accurate in\nabout 1500 time steps). The largest objective return obtained for d = 6 is also the best objective\nreturn that can be obtained for any value of d.\nSeveral results can be observed in both Figures 2(b) and (c). 1) Each curve that uses \u03b1 > 0 (solid\nlines) improves with experience. This is a demonstration of our primary contribution, that PGRD is\nable to effectively improve the reward function with experience. That the improvement over time is\nnot just due to model learning is seen in the fact that for each value of d < 6 the curve for \u03b1 > 0\n(solid-line) which adapts the reward parameters does signi\ufb01cantly better than the corresponding curve\nfor \u03b1 = 0 (dashed-line); the \u03b1 = 0 agents still learn the model. 2) For both \u03b1 = 0 and \u03b1 > 0\nagents, the objective return obtained by agents with equivalent amounts of experience increases\nmonotonically as d is increased (though to maintain readability we only show selected values of\nd in each \ufb01gure). This demonstrates our secondary contribution, that the ability to plan in PGRD\nsigni\ufb01cantly improves performance over standard OLPOMDP (PGRD with d = 0).\nThere are also some interesting differences between the results for the two different reward function\nparameterizations. With the observation-action parameterization, we noted that there always exists a\nsetting of \u03b8 for all d that will yield optimal objective return. This is seen in Figure 2(b) in that all\nsolid-line curves approach optimal objective return. In contrast, the more compact recency reward\nparameterization does not afford this guarantee and indeed for small values of d (< 3), the solid-line\ncurves in Figure 2(c) converge to less than optimal objective return. Notably, OLPOMDP (d = 0)\ndoes not perform well with this feature set. On the other hand, for planning depths 3 \u2264 d < 6, the\nPGRD agents with the recency parameterization achieve optimal objective return faster than the\ncorresponding PGRD agent with the observation-action parameterization. Finally, we note that this\nexperiment validates our claim that PGRD can improve reward functions that depend on history.\nExperiment 2: A fully observable environment and poor given model. Our theoretical analysis\nshowed that PGRD with an incorrect model and the observation\u2013action reward parameterization\nshould (modulo local maxima issues) do just as well asymptotically as it would with a correct model.\nHere we illustrate this theoretical result empirically on the same foraging domain and objective\nreward function used in Experiment 1. We also test our hypothesis that a poor model should slow\ndown the rate of learning relative to a correct model.\nPoor Model: We gave the agents a \ufb01xed incorrect model of the foraging environment that assumes\nthere are no internal walls separating the 3 corridors.\nReward Parameterization: We used the observation\u2013action reward parameterization. With a poor\nmodel it is no longer interesting to initialize \u03b8 so that the initial reward function is the objective\nreward function because even for d = 6 such an agent would do poorly. Furthermore, we found that\nthis initialization leads to excessively bad exploration and therefore poor learning of how to modify\nthe reward. Thus, we initialize \u03b8 to uniform random values near 0, in the range (\u221210\u22123, 10\u22123).\nResults: Figure 3(a) plots the objective return as a function of number of steps of experience. Each\ncurve is an average over 36 trials. As hypothesized, the bad model slows learning by a factor of more\nthan 10 (notice the difference in the x-axis scales from those in Figure 2). Here, deeper planning\nresults in slower learning and indeed the d = 0 agent that does not use the model at all learns\nthe fastest. However, also as hypothesized, because they used the expressive observation\u2013action\nparameterization, agents of all planning depths mitigate the damage caused by the poor model and\neventually converge to the optimal objective return.\nExperiment 3: Partially observable foraging world. Here we evaluate PGRD\u2019s ability to learn in\na partially observable version of the foraging domain. In addition, the agents learn a model under the\nerroneous (and computationally convenient) assumption that the domain is fully observable.\n\n7\n\n\fFigure 3: A) Performance of PGRD with a poor model, B) Performance of PGRD in a partially\nobservable world with recency reward features, C) Performance of PGRD in Acrobot\n\nPartial Observation: Instead of viewing the location of the worm at all times, the agent can now only\nsee the worm when it is colocated with it: its observation is o = (l, f ), where f indicates whether the\nagent is colocated with the food.\nLearning an Incorrect Model: The model is learned just as in Experiment 1. Because of the erroneous\nfull observability assumption, the model will hallucinate about worms at all the corridor ends based\non the empirical frequency of having encountered them there.\nReward Parameterization: We used the recency parameterization; due to the partial observability,\nagents with the observation\u2013action feature set perform poorly in this environment. The parameters \u03b8\nare initialized such that the initial reward function equals the objective reward function.\nResults & Discussion: Figure 3(b) plots the mean of 260 trials. As seen in the solid-line curves,\nPGRD improves the objective return at all depths (only a small amount for d = 0 and signi\ufb01cantly\nmore for d > 0). In fact, agents which don\u2019t adapt the reward are hurt by planning (relative to d = 0).\nThis experiment demonstrates that the combination of planning and reward improvement can be\nbene\ufb01cial even when the model is erroneous. Because of the partial observability, optimal behavior\nin this environment achieves less objective return than in Experiment 1.\nExperiment 4: Acrobot. In this experiment we test PGRD in the Acrobot environment [22], a\ncommon benchmark task in the RL literature and one that has previously been used in the testing of\npolicy gradient approaches [23]. This experiment demonstrates PGRD in an environment in which\nan agent must be limited due to the size of the state space and further demonstrates that adding\nmodel-based planning to policy gradient approaches can improve performance.\nDomain: The version of Acrobot we use is as speci\ufb01ed by Sutton and Barto [22]. It is a two-link\nrobot arm in which the position of one shoulder-joint is \ufb01xed and the agent\u2019s control is limited to 3\nactions which apply torque to the elbow-joint.\nObservation: The fully-observable state space is 4 dimensional, with two joint angles \u03c81 and \u03c82, and\ntwo joint velocities \u02d9\u03c81 and \u02d9\u03c82.\nObjective Reward: The designer receives an objective reward of 1.0 when the tip is one arm\u2019s length\nabove the \ufb01xed shoulder-joint, after which the bot is reset to its initial resting position.\nModel: We provide the agent with a perfect model of the environment. Because the environment\nis continuous, value iteration is intractable, and computational limitations prevent planning deep\nenough to compute the optimal action in any state. The feature vector contains 13 entries. One feature\ncorresponds to the objective reward signal. For each action, there are 5 features corresponding to\neach of the state features plus an additional feature representing the height of the tip: \u03c6(i, o, a) =\n[RO(o),{\u03c81(o), \u03c82(o), \u02d9\u03c81(o), \u02d9\u03c82(o), h(o)}a]. The height feature has been used in previous work as\nan alternative de\ufb01nition of objective reward [23].\nResults & Discussion: We plot the mean of 80 trials in Figure 3(c). Agents that use the \ufb01xed (\u03b1 = 0)\nobjective reward function with bounded-depth planning perform according to the bottom two curves.\nAllowing PGRD and OLPOMDP to adapt the parameters \u03b8 leads to improved objective return, as\nseen in the top two curves in Figure 3(c). Finally, the PGRD d = 6 agent outperforms the standard\nOLPOMDP agent (PGRD with d = 0), further demonstrating that PGRD outperforms OLPOMDP.\nOverall Conclusion: We developed PGRD, a new method for approximately solving the optimal\nreward problem in bounded planning agents that can be applied in an online setting. We showed that\nPGRD is a generalization of OLPOMDP and demonstrated that it both improves reward functions in\nlimited agents and outperforms the model-free OLPOMDP approach.\n\n8\n\n12345x 10400.050.10.15Time StepsObjective Return A)D = 0&2&6, \u03b1 = 0D = 6, \u03b1 = 2\u00d710\u22125D = 2, \u03b1 = 3\u00d710\u22125D = 0, \u03b1 = 2\u00d710\u221241000200030004000500000.020.040.060.08Time StepsObjective Return D = 1, \u03b1 = 7\u00d710\u22124D = 0, \u03b1 = 5\u00d710\u22124D = 1&2&6, \u03b1 = 0D = 2, \u03b1 = 7\u00d710\u22124D = 6, \u03b1 = 7\u00d710\u22124D = 0, \u03b1 = 0B)12345x 1040.511.522.5x 10\u22123Time StepsObjective Return C)D=6, \u03b1=3\u00d710\u22126D=0, \u03b1=1\u00d710\u22125D=0&6, \u03b1=0\fReferences\n[1] Douglas Aberdeen and Jonathan Baxter. Scalable Internal-State Policy-Gradient Methods for POMDPs.\n\nProceedings of the Nineteenth International Conference on Machine Learning, 2002.\n\n[2] Peter L. Bartlett and Jonathan Baxter. Stochastic optimization of controlled partially observable Markov\n\ndecision processes. In Proceedings of the 39th IEEE Conference on Decision and Control, 2000.\n\n[3] Jonathan Baxter, Peter L. Bartlett, and Lex Weaver. Experiments with In\ufb01nite-Horizon, Policy-Gradient\n\nEstimation, 2001.\n\n[4] Shalabh Bhatnagar, Richard S. Sutton, M Ghavamzadeh, and Mark Lee. Natural actor-critic algorithms.\n\nAutomatica, 2009.\n\n[5] Ronen I. Brafman and Moshe Tennenholtz. R-MAX - A General Polynomial Time Algorithm for Near-\n\nOptimal Reinforcement Learning. Journal of Machine Learning Research, 3:213\u2013231, 2001.\n\n[6] S. Elfwing, Eiji Uchibe, K. Doya, and H. I. Christensen. Co-evolution of Shaping Rewards and Meta-\n\nParameters in Reinforcement Learning. Adaptive Behavior, 16(6):400\u2013412, 2008.\n\n[7] J. Zico Kolter and Andrew Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the\n\n26th International Conference on Machine Learning, pages 513\u2013520, 2009.\n\n[8] Harold J. Kushner and G. George Yin. Stochastic Approximation and Recursive Algorithms and Applica-\n\ntions. Springer, 2nd edition, 2010.\n\n[9] C\u00b8 etin Meric\u00b8li, Tekin Meric\u00b8li, and H. Levent Akin. A Reward Function Generation Method Using Genetic\nAlgorithms : A Robot Soccer Case Study (Extended Abstract). In Proc. of the 9th Int. Conf. on Autonomous\nAgents and Multiagent Systems (AAMAS 2010), number 2, pages 1513\u20131514, 2010.\n\n[10] Gergely Neu and Csaba Szepesv\u00b4ari. Apprenticeship learning using inverse reinforcement learning and\ngradient methods. In Proceedings of the 23rd Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n295\u2013302, 2007.\n\n[11] Andrew Y. Ng, Stuart J. Russell, and D. Harada. Policy invariance under reward transformations: Theory\nand application to reward shaping. In Proceedings of the 16th International Conference on Machine\nLearning, pages 278\u2013287, 1999.\n\n[12] Scott Niekum, Andrew G. Barto, and Lee Spector. Genetic Programming for Reward Function Search.\n\nIEEE Transactions on Autonomous Mental Development, 2(2):83\u201390, 2010.\n\n[13] Pierre-Yves Oudeyer, Frederic Kaplan, and Verena V. Hafner. Intrinsic Motivation Systems for Autonomous\n\nMental Development. IEEE Transactions on Evolutionary Computation, 11(2):265\u2013286, April 2007.\n\n[14] J\u00a8urgen Schmidhuber. Curious model-building control systems. In IEEE International Joint Conference on\n\nNeural Networks, pages 1458\u20131463, 1991.\n\n[15] Satinder Singh, Andrew G. Barto, and Nuttapong Chentanez. Intrinsically Motivated Reinforcement\nIn Proceedings of Advances in Neural Information Processing Systems 17 (NIPS), pages\n\nLearning.\n1281\u20131288, 2005.\n\n[16] Satinder Singh, Richard L. Lewis, and Andrew G. Barto. Where Do Rewards Come From? In Proceedings\n\nof the Annual Conference of the Cognitive Science Society, pages 2601\u20132606, 2009.\n\n[17] Satinder Singh, Richard L. Lewis, Andrew G. Barto, and Jonathan Sorg. Intrinsically Motivated Reinforce-\nment Learning: An Evolutionary Perspective. IEEE Transations on Autonomous Mental Development,\n2(2):70\u201382, 2010.\n\n[18] Jonathan Sorg, Satinder Singh, and Richard L. Lewis. Internal Rewards Mitigate Agent Boundedness. In\n\nProceedings of the 27th International Conference on Machine Learning, 2010.\n\n[19] Jonathan Sorg, Satinder Singh, and Richard L. Lewis. Variance-Based Rewards for Approximate Bayesian\nReinforcement Learning. In Proceedings of the 26th Conference on Uncertainty in Arti\ufb01cial Intelligence,\n2010.\n\n[20] Alexander L. Strehl and Michael L. Littman. An analysis of model-based Interval Estimation for Markov\n\nDecision Processes. Journal of Computer and System Sciences, 74(8):1309\u20131331, 2008.\n\n[21] Richard S. Sutton. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating\nDynamic Programming. In The Seventh International Conference on Machine Learning, pages 216\u2013224.\n1990.\n\n[22] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.\n[23] Lex Weaver and Nigel Tao. The Optimal Reward Baseline for Gradient-Based Reinforcement Learning. In\n\nProceedings of the 17th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 538\u2013545. 2001.\n\n9\n\n\f", "award": [], "sourceid": 956, "authors": [{"given_name": "Jonathan", "family_name": "Sorg", "institution": null}, {"given_name": "Richard", "family_name": "Lewis", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}]}