{"title": "Policy Search by Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 831, "page_last": 838, "abstract": "", "full_text": "Policy search by dynamic programming\n\nJ. Andrew Bagnell\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nSham Kakade\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\nAndrew Y. Ng\n\nStanford University\nStanford, CA 94305\n\nJeff Schneider\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nAbstract\n\nWe consider the policy search approach to reinforcement learning. We\nshow that if a \u201cbaseline distribution\u201d is given (indicating roughly how\noften we expect a good policy to visit each state), then we can derive\na policy search algorithm that terminates in a \ufb01nite number of steps,\nand for which we can provide non-trivial performance guarantees. We\nalso demonstrate this algorithm on several grid-world POMDPs, a planar\nbiped walking robot, and a double-pole balancing problem.\n\n1\n\nIntroduction\n\nPolicy search approaches to reinforcement learning represent a promising method for solv-\ning POMDPs and large MDPs. In the policy search setting, we assume that we are given\nsome class \u03a0 of policies mapping from the states to the actions, and wish to \ufb01nd a good\npolicy \u03c0 \u2208 \u03a0. A common problem with policy search is that the search through \u03a0 can be\ndif\ufb01cult and computationally expensive, and is thus typically based on local search heuris-\ntics that do not come with any performance guarantees.\nIn this paper, we show that if we give the learning agent a \u201cbase distribution\u201d on states\n(speci\ufb01cally, one that indicates how often we expect it to be in each state; cf. [5, 4]), then we\ncan derive an ef\ufb01cient policy search algorithm that terminates after a polynomial number\nof steps. Our algorithm outputs a non-stationary policy, and each step in the algorithm\nrequires only a minimization that can be performed or approximated via a call to a standard\nsupervised learning algorithm. We also provide non-trivial guarantees on the quality of the\npolicies found, and demonstrate the algorithm on several problems.\n\n2 Preliminaries\n\nWe consider an MDP with state space S; initial state s0 \u2208 S; action space A; state transition\nprobabilities {Psa(\u00b7)} (here, Psa is the next-state distribution on taking action a in state s);\nand reward function R : S (cid:55)\u2192 R, which we assume to be bounded in the interval [0, 1].\nIn the setting in which the goal is to optimize the sum of discounted rewards over an in\ufb01nite-\nhorizon, it is well known that an optimal policy which is both Markov and stationary (i.e.,\none where the action taken does not depend on the current time) always exists. For this\nreason, learning approaches to in\ufb01nite-horizon discounted MDPs have typically focused\n\n\fon searching for stationary policies (e.g., [8, 5, 9]). In this work, we consider policy search\nin the space of non-starionary policies, and show how, with a base distribution, this allows\nus to derive an ef\ufb01cient algorithm.\nWe consider a setting in which the goal is to maximize the sum of undiscounted rewards\n1\nover a T step horizon:\nT E[R(s0) + R(s1) + . . . + R(sT \u22121)]. Clearly, by choosing\nT suf\ufb01ciently large, a \ufb01nite-horizon problem can also be used to approximate arbitrar-\nily well an in\ufb01nite-horizon discounted problem. (E.g., [6]) Given a non-stationary policy\n(\u03c0t, \u03c0t+1, . . . , \u03c0T \u22121), where each \u03c0t : S (cid:55)\u2192 A is a (stationary) policy, we de\ufb01ne the value\n\nV\u03c0t,...,\u03c0T \u22121 (s) \u2261 1\n\nT E[R(st) + R(st+1) + . . . + R(sT \u22121)|st = s; (\u03c0t, . . . , \u03c0T \u22121)]\n\nas the expected (normalized) sum of rewards attained by starting at state s and the \u201cclock\u201d\nat time t, taking one action according to \u03c0t, taking the next action according to \u03c0t+1, and\nso on. Note that\n\nV\u03c0t,...,\u03c0T \u22121 (s) \u2261 1\n\nT R(s) + Es(cid:48)\u223cPs\u03c0t(s)[V\u03c0t+1,...,\u03c0T \u22121 (s)],\n\nwhere the \u201cs(cid:48) \u223c Ps\u03c0t(s)\u201d subscript indicates that the expectation is with respect to s(cid:48) drawn\nfrom the state transition distribution Ps\u03c0t(s).\nIn our policy search setting, we consider a restricted class of deterministic, stationary poli-\ncies \u03a0, where each \u03c0 \u2208 \u03a0 is a map \u03c0 : S (cid:55)\u2192 A, and a corresponding class of non-stationary\npolicies \u03a0T = {(\u03c00, \u03c01, . . . , \u03c0T \u22121) | for all t, \u03c0t \u2208 \u03a0}. In the partially observed, POMDP\nsetting, we may restrict \u03a0 to contain policies that depend only on the observable aspects\nof the state, in which case we obtain a class of memoryless/reactive policies. Our goal is\nto \ufb01nd a non-stationary policy (\u03c00, \u03c01 . . . , \u03c0T \u22121) \u2208 \u03a0T which performs well under the\nperformance measure V\u03c00,\u03c01...,\u03c0T \u22121 (s0), which we abbreviate as V\u03c0(s0) when there is no\nrisk of confusion.\n\n3 The Policy Search Algorithm\n\nFollowing [5, 4], we assume that we are given a sequence of base distributions\n\u00b50, \u00b51, . . . , \u00b5T \u22121 over the states. Informally, we think of \u00b5t as indicating to the algorithm\napproximately how often we think a good policy visits each state at time t.\nOur algorithm (also given in [4]), which we call Policy Search by Dynamic Programming\n(PSDP) is in the spirit of the traditional dynamic programming approach to solving MDPs\nwhere values are \u201cbacked up.\u201d In PSDP, it is the policy which is backed up. The algorithm\nbegins by \ufb01nding \u03c0T \u22121, then \u03c0T \u22122, . . . down to \u03c00. Each policy \u03c0t is chosen from the\nstationary policy class \u03a0. More formally, the algorithm is as follows:\nAlgorithm 1 (PSDP) Given T , \u00b5t, and \u03a0:\n\nfor t = T \u2212 1, T \u2212 2, . . . , 0\n\nSet \u03c0t = arg max\u03c0(cid:48)\u2208\u03a0Es\u223c\u00b5t [V\u03c0(cid:48),\u03c0t+1...,\u03c0T \u22121 (s)]\n\nIn other words, we choose \u03c0t from \u03a0 so as to maximize the expected sum of future rewards\nfor executing actions according to the policy sequence (\u03c0t, \u03c0t+1, . . . , \u03c0T \u22121) when starting\nfrom a random initial state s drawn from the baseline distribution \u00b5t.\nSince \u00b50, . . . , \u00b5T \u22121 provides the distribution over the state space that the algorithm is op-\ntimizing with respect to, we might hope that if a good policy tends to visit the state space\nin a manner comparable to this base distribution, then PSDP will return a good policy.\nThe following theorem formalizes this intuition. The theorem also allows for the situation\nwhere the maximization step in the algorithm (the arg max\u03c0(cid:48)\u2208\u03a0) can be done only approx-\nimately. We later give speci\ufb01c examples showing settings in which this maximization can\n(approximately or exactly) be done ef\ufb01ciently.\nThe following de\ufb01nitions will be useful. For a non-stationary policy \u03c0 = (\u03c00, . . . , \u03c0T \u22121),\nde\ufb01ne the future state distribution\n\n\u00b5\u03c0,t(s) = Pr(st = s|s0, \u03c0).\n\n\fI.e. \u00b5\u03c0,t(s) is the probability that we will be in state s at time t if picking actions according\nto \u03c0 and starting from state s0. Also, given two T -step sequences of distributions over\nt), de\ufb01ne the average variational distance\nstates \u00b5 = (\u00b50, . . . , \u00b5t) and \u00b5(cid:48) = (\u00b5(cid:48)\nbetween them to be1\n\n0, . . . , \u00b5(cid:48)\n\ndvar(\u00b5, \u00b5(cid:48)) \u2261\n\n1\nT\n\nT \u22121\n\n(cid:88)t=0 (cid:88)s\u2208S\n\n|\u00b5t(s) \u2212 \u00b5(cid:48)\n\nt(s)|\n\nHence, if \u03c0ref is some policy, then dvar(\u00b5, \u00b5\u03c0ref ) represents how much the base distribution\n\u00b5 differs from the future state distribution of the policy \u03c0ref.\n\nTheorem 1 (Performance Guarantee) Let \u03c0 = (\u03c00, . . . , \u03c0T \u22121) be a non-stationary pol-\nicy returned by an \u03b5-approximate version of PSDP in which, on each step, the policy \u03c0t\nfound comes within \u03b5 of maximizing the value. I.e.,\n\nEs\u223c\u00b5t [V\u03c0t,\u03c0t+1...,\u03c0T \u22121 (s)] \u2265 max\u03c0(cid:48)\u2208\u03a0Es\u223c\u00b5t [V\u03c0(cid:48),\u03c0t+1...,\u03c0T \u22121 (s)] \u2212 \u03b5 .\n\n(1)\n\nThen for all \u03c0ref \u2208 \u03a0T we have that\n\nV\u03c0(s0) \u2265 V\u03c0ref (s0) \u2212 T \u03b5 \u2212 T dvar(\u00b5, \u00b5\u03c0ref ) .\n\nProof. This proof may also be found in [4], but for the sake of completeness, we also\nprovide it here. Let Pt(s) = Pr(st = s|s0, \u03c0ref ), \u03c0ref = (\u03c0ref ,0, . . . , \u03c0ref ,T \u22121) \u2208 \u03a0T , and\n\u03c0 = (\u03c00, . . . , \u03c0T \u22121) be the output of \u03b5-PSDP. We have\nV\u03c0ref \u2212 V\u03c0 = 1\nt=0 Est\u223cPt[R(st)] \u2212 V\u03c00,...(s)\n\nT R(st) + V\u03c0t,...(st) \u2212 V\u03c0t,...(st)] \u2212 V\u03c00,...(s)\n\nT (cid:80)T \u22121\n= (cid:80)T \u22121\nt=0 Est\u223cPt[ 1\n= (cid:80)T \u22121\nt=0 Est\u223cPt,st+1\u223cPst \u03c0ref ,t (st ) [ 1\n= (cid:80)T \u22121\nt=0 Est\u223cPt[V\u03c0ref ,t,\u03c0t+1,...,\u03c0T \u22121 (st) \u2212 V\u03c0t,\u03c0t+1,...,\u03c0T \u22121 (st)]\n\nT R(st) + V\u03c0t+1,...(st+1) \u2212 V\u03c0t,...(st)]\n\nIt is well-known that for any function f bounded in absolute value by B, it holds true that\n\nt=0 Est\u223cPt[V\u03c0ref ,t,\u03c0t+1,...,\u03c0T \u22121 (st) \u2212 V\u03c0t,\u03c0t+1,...,\u03c0T \u22121 (st)]\n\n|Es\u223c\u00b51 [f (s)] \u2212 Es\u223c\u00b52 [f (s)]| \u2264 B(cid:80)s |\u00b51(s) \u2212 \u00b52(s)|. Since the values are bounded in\nthe interval [0, 1] and since Pt = \u00b5\u03c0ref ,t,\n(cid:80)T \u22121\n\u2264 (cid:80)T \u22121\n\u2264 (cid:80)T \u22121\n\nt=0 |Pt(s) \u2212 \u00b5t(s)|\nt=0 max\u03c0(cid:48)\u2208\u03a0Es\u223c\u00b5t [V\u03c0(cid:48),\u03c0t+1,...,\u03c0T \u22121 (s) \u2212 V\u03c0t,\u03c0t+1,...,\u03c0T \u22121 (s)] \u2212 T dvar(\u00b5\u03c0ref , \u00b5)\n\nt=0 Es\u223c\u00b5t [V\u03c0ref ,t,\u03c0t+1,...,\u03c0T \u22121 (s) \u2212 V\u03c0t,\u03c0t+1,...,\u03c0T \u22121 (s)] \u2212(cid:80)T \u22121\n\n\u2264 T \u03b5 + T dvar(\u00b5\u03c0ref , \u00b5)\n\nwhere we have used equation (1) and the fact that \u03c0ref \u2208 \u03a0T . The result now follows. (cid:164)\nThis theorem shows that PSDP returns a policy with performance that competes favorably\nagainst those policies \u03c0ref in \u03a0T whose future state distributions are close to \u00b5. Hence, we\nexpect our algorithm to provide a good policy if our prior knowledge allows us to choose a\n\u00b5 that is close to a future state distribution for a good policy in \u03a0T .\nIt is also shown in [4] that the dependence on dvar is tight in the worst case. Furthermore,\nit is straightforward to show (cf. [6, 8]) that the \u03b5-approximate PSDP can be implemented\nusing a number of samples that is linear in the VC dimension of \u03a0, polynomial in T and 1\n\u03b5 ,\nbut otherwise independent of the size of the state space. (See [4] for details.)\n\n4\n\nInstantiations\n\nIn this section, we provide detailed examples showing how PSDP may be applied to speci\ufb01c\nclasses of policies, where we can demonstrate computational ef\ufb01ciency.\n\n1If S is continuous and \u00b5t and \u00b5(cid:48)\n\nt are densities, the inner summation is replaced by an integral.\n\n\f4.1 Discrete observation POMDPs\n\nFinding memoryless policies for POMDPs represents a dif\ufb01cult and important problem.\nFurther, it is known that the best memoryless, stochastic, stationary policy can perform\nbetter by an arbitrarily large amount than the best memoryless, deterministic policy. This\nis frequently given as a reason for using stochastic policies. However, as we shortly show,\nthere is no advantage to using stochastic (rather than deterministic) policies, when we are\nsearching for non-stationary policies.\nFour natural classes of memoryless policies to consider are as follows: stationary determin-\nistic (SD), stationary stochastic (SS), non-stationary deterministic (ND) and non-stationary\nstochastic (NS). Let the operator opt return the value of the optimal policy in a class. The\nfollowing speci\ufb01es the relations among these classes.\n\nProposition 1 (Policy ordering) For any \ufb01nite-state, \ufb01nite-action POMDP,\n\nopt(SD) \u2264 opt(SS) \u2264 opt(ND) = opt(NS)\n\nWe now sketch a proof of this result. To see that opt(ND) = opt(NS), let \u00b5NS be the future\ndistribution of an optimal policy \u03c0N S \u2208 NS. Consider running PSDP with base distribution\n\u00b5NS. After each update, the resulting policy (\u03c0NS,0, \u03c0NS,1, . . . , \u03c0t, . . . , \u03c0T ) must be at least\nas good as \u03c0NS. Essentially, we can consider PSDP as sweeping through each timestep and\nmodifying the stochastic policy to be deterministic, while never decreasing performance.\nA similar argument shows that opt(SS) \u2264 opt(ND) while a simple example POMDP in the\nnext section demonstrates this inequality can be strict.\nThe potentially superior performance of non-stationary policies contrasted with stationary\nstochastic ones provides further justi\ufb01cation for their use. Furthermore, the last inequal-\nity suggests that only considering deterministic policies is suf\ufb01cient in the non-stationary\nregime.\nUnfortunately, one can show that it is NP-hard to exactly or approximately \ufb01nd the best\npolicy in any of these classes (this was shown for SD in [7]). While many search heuristics\nhave been proposed, we now show PSDP offers a viable, computationally tractable, alter-\nnative for \ufb01nding a good policy for POMDPs, one which offers performance guarantees in\nthe form of Theorem 1.\n\nProposition 2 (PSDP complexity) For any POMDP, exact PSDP (\u03b5 = 0) runs in time\npolynomial in the size of the state and observation spaces and in the horizon time T .\n\nUnder PSDP, the policy update is as follows:\n\n\u03c0t(o) = arg maxaEs\u223c\u00b5t [p(o|s)Va,\u03c0t+1...,\u03c0T \u22121 (s)] ,\n\n(2)\nwhere p(o|s) is the observation probabilities of the POMDP and the policy sequence\n(a, \u03c0t+1 . . . , \u03c0T \u22121) always begins by taking action a. It is clear that given the policies from\ntime t + 1 onwards, Va,\u03c0t+1...,\u03c0T \u22121 (s) can be ef\ufb01ciently computed and thus the update 2\ncan be performed in polynomial time in the relevant quantities. Intuitively, the distribution\n\u00b5 speci\ufb01es here how to trade-off the bene\ufb01ts of different underlying state-action pairs that\nshare an observation. Ideally, it is the distribution provided by an optimal policy for ND\nthat optimally speci\ufb01es this tradeoff.\nThis result does not contradict the NP-hardness results, because it requires that a good\nbaseline distribution \u00b5 be provided to the algorithm. However, if \u00b5 is the future state\ndistribution of the optimal policy in ND, then PSDP returns an optimal policy for this class\nin polynomial time.\nFurthermore, if the state space is prohibitively large to perform the exact update in equa-\ntion 2, then Monte Carlo integration may be used to evaluate the expectation over the state\nspace. This leads to an \u03b5-approximate version of PSDP, where one can obtain an algorithm\nwith no dependence on the size of the state space and a polynomial dependence on the\nnumber of observations, T , and 1\n\n\u03b5 (see discussion in [4]).\n\n\f4.2 Action-value approximation\n\nPSDP can also be ef\ufb01ciently implemented if it is possible to ef\ufb01ciently \ufb01nd an approximate\naction-value function \u02dcVa,\u03c0t+1...,\u03c0T \u22121 (s), i.e., if at each timestep\n\n\u0001 \u2265 Es\u223c\u00b5t [maxa\u2208A| \u02dcVa,\u03c0t+1...,\u03c0T \u22121 (s) \u2212 Va,\u03c0t+1...,\u03c0T \u22121 (s)|] .\n\n(Recall that the policy sequence (a, \u03c0t+1 . . . , \u03c0T \u22121) always begins by taking action a.)\nIf the policy \u03c0t is greedy with respect to the action value \u02dcVa,\u03c0t+1...,\u03c0T \u22121 (s) then it follows\nimmediately from Theorem 1 that our policy value differs from the optimal one by 2T \u0001 plus\nthe \u00b5 dependent variational penalty term. It is important to note that this error is phrased in\nterms of an average error over state-space, as opposed to the worst case errors over the state\nspace that are more standard in RL. We can intuitively grasp this by observing that value\niteration style algorithms may amplify any small error in the value function by pushing\nmore probability mass through where these errors are. PSDP, however, as it does not use\nvalue function backups, cannot make this same error; the use of the computed policies\nin the future keeps it honest. There are numerous ef\ufb01cient regression algorithms that can\nminimize this, or approximations to it.\n\n4.3 Linear policy MDPs\n\nWe now examine in detail a particular policy search example in which we have a two-\naction MDP, and a linear policy class is used. This case is interesting because, if the\nterm Es\u223c\u00b5t [V\u03c0,\u03c0t+1,...,\u03c0T \u22121 (s)] (from the maximization step in the algorithm) can be nearly\nmaximized by some linear policy \u03c0, then a good approximation to \u03c0 can be found.\nLet A = {a1, a2}, and \u03a0 = {\u03c0\u03b8(s) : \u03b8 \u2208 Rn}, where \u03c0\u03b8(s) = a1 if \u03b8T \u03c6(s) \u2265 0,\nand \u03c0\u03b8(s) = a2 otherwise. Here, \u03c6(s) \u2208 Rn is a vector of features of the state s. Con-\nsider the maximization step in the PSDP algorithm. Letting 1{\u00b7} be the indicator function\n(1{True} = 1, 1{False} = 0), we have the following algorithm for performing the maxi-\nmization:\n\nAlgorithm 2 (Linear maximization) Given m1 and m2:\n\nfor i = 1 to m1\n\nSample s(i) \u223c \u00b5t.\nUse m2 Monte Carlo samples to estimate Va1,\u03c0t+1,...,\u03c0T \u22121 (s(i)) and\nVa2,\u03c0t+1,...,\u03c0T \u22121 (s(i)). Call the resulting estimates q1 and q2.\nLet y(i) = 1{q1 > q2}, and w(i) = |q1 \u2212 q2|.\n\nFind \u03b8 = arg min\u03b8(cid:80)m1\n\nOutput \u03c0\u03b8.\n\ni=1 w(i)1{1{\u03b8T \u03c6(s(i)) \u2265 0} (cid:54)= y(i)}.\n\nIntuitively, the algorithm does the following: It samples m1 states s(1), . . . , s(m1) from the\ndistribution \u00b5t. Using m2 Monte Carlo samples, it determines if action a1 or action a2 is\npreferable from that state, and creates a \u201clabel\u201d y(i) for that state accordingly. Finally, it\ntries to \ufb01nd a linear decision boundary separating the states from which a1 is better from\nthe states from which a2 is better. Further, the \u201cimportance\u201d or \u201cweight\u201d w(i) assigned to\ns(i) is proportional to the difference in the values of the two actions from that state.\nThe \ufb01nal maximization step can be approximated via a call to any standard supervised\nlearning algorithm that tries to \ufb01nd linear decision boundaries, such as a support vector\nmachine or logistic regression. In some of our experiments, we use a weighted logistic\nregression to perform this maximization. However, using linear programming, it is possible\nto approximate this maximization. Let\n\nT (\u03b8) =\n\nm1\n\n(cid:88)i=1\n\nw(i)1{1{\u03b8T \u03c6(s(i)) \u2265 0} (cid:54)= y(i)}\n\n\fFigure 1: Illustrations of mazes: (a) Hallway (b) McCallum\u2019s Maze (c) Sutton\u2019s Maze\n\nbe the objective in the minimization. If there is a value of \u03b8 that can satis\ufb01es T (\u03b8) = 0,\nthen it can be found via linear programming. Speci\ufb01cally, for each value of i, we let there\nbe a constraint\n\n(cid:189) \u03b8T \u03c6(s(i)) > \u03ba\n\nif y(i) = 1\n\u03b8T \u03c6(s(i)) < \u2212\u03ba otherwise\n\notherwise, where \u03ba is any small positive constant. In the case in which these constraints\ncannot be simultaneously satis\ufb01ed, it is NP-hard to \ufb01nd arg min\u03b8 T (\u03b8). [1] However, the\noptimal value can be approximated. Speci\ufb01cally, if \u03b8 \u2217 = arg min\u03b8 T (\u03b8), then [1] presents\na polynomial time algorithm that \ufb01nds \u03b8 so that\n\nT (\u03b8) \u2264 (n + 1)T (\u03b8\u2217).\n\nHere, n is the dimension of \u03b8. Therefore, if there is a linear policy that does well, we also\n\ufb01nd a policy that does well. (Conversely, if there is no linear policy that does well\u2014i.e.,\nif T (\u03b8\u2217) above were large\u2014then the bound would be very loose; however, in this setting\nthere is no good linear policy, and hence we arguably should not be using a linear policy\nanyway or should consider adding more features.)\n\n5 Experiments\n\nThe experiments below demonstrate each of the instantiations described previously.\n\n5.1 POMDP gridworld example\n\nHere we apply PSDP to some simple maze POMDPs (Figure (5.1) to demonstrate its per-\nformance. In each the robot can move in any of the 4 cardinal direction. Except in (5.1c),\nthe observation at each grid-cell is simply the directions in which the robot can freely move.\nThe goal in each is to reach the circled grid cell in the minimum total number of steps from\neach starting cell.\nFirst we consider the hallway maze in Figure (5.1a). The robot here is confounded by all\nthe middle states appearing the same, and the optimal stochastic policy must take time at\nleast quadratic in the length of the hallway to ensure it gets to the goal from both sides.\nPSDP deduces a non-stationary deterministic policy with much better performance: \ufb01rst\nclear the left half maze by always traveling right and then the right half maze by always\ntraveling left.\nMcCallum\u2019s maze (Figure 5.1b) is discussed in the literature as admitting no satis\ufb01cing\ndeterminisitic reactive policy. When one allows non-stationary policies, however, solutions\ndo exist: PSDP provides a policy with 55 total steps to goal.\nIn our \ufb01nal benchmark,\nSutton\u2019s maze (Figure 5.1c), the observations are determined by the openness of all eight\nconnected directions.\nBelow we summarize the total number of steps to goal of our algorithm as compared with\noptimality for two classes of policy. Column 1 denotes PSDP performance using a uniform\nbaseline distribution. The next column lists the performance of iterating PSDP, starting\ninitially with a uniform baseline \u00b5 and then computing with a new baseline \u00b5(cid:48) based on the\npreviously constructed policy. 2 Column 3 corresponds to optimal stationary deterministic\n\n2It can be shown that this procedure of re\ufb01ning \u00b5 based on previous learned policies will never\n\ndecrease performance.\n\n\fpolicy while the \ufb01nal column gives the best theoretically achievable performance given\narbitrary memory. It is worthwhile to note that the PSDP computations are very fast in all\nof these problems, taking well under a second in an interpreted language.\n\nHallway\nMcCallum\nSutton\n\n5.2 Robot walking\n\n21\n55\n412\n\n\u00b5 uniform \u00b5 iterated Optimal SD Optimal\n18\n39\n\u2265 408\n\n21\n48\n412\n\n\u221e\n\u221e\n416\n\nOur work is related in spirit to Atkeson and Morimoto [2], which describes a differential\ndynamic programming (DDP) algorithm that learns quadratic value functions along trajec-\ntories. These trajectories, which serve as an analog of our \u00b5 distribution, are then re\ufb01ned\nusing the resulting policies. A central difference is their use of the value function back-\nups as opposed to policy backups. In tackling the control problem presented in [2] we\ndemonstrate ways in which PSDP extends that work.\n[2] considers a planar biped robot that walks along a bar. The robot has two legs and a\nmotor that applies torque where they meet. As the robot lacks knees, it walks by essentially\nbrachiating (upside-down); a simple mechanism grabs the bar as a foot swings into posi-\ntion. The robot (excluding the position horizontally along the bar) can be described in a 5\ndimensional state space using angles and angular velocities from the foot grasping the bar.\nThe control variable that needs to be determined is the hip-torque.\nIn [2], signi\ufb01cant manual \u201ccost-function engineering\u201d or \u201cshaping\u201d of the rewards was\nused to achieve walking at \ufb01xed speed. Much of this is due to the limitations of differential\ndynamic programming in which cost functions must always be locally quadratic. This rules\nout natural cost functions that directly penalize, for example, falling. As this limitation does\nnot apply to our algorithm, we used a cost function that rewards the robot for each time-\nstep it remains upright. In addition, we penalize quadratically deviation from the nominal\nhorizontal velocity of 0.4 m/s and control effort applied.\nSamples of \u00b5 are generated in the same way [2] generates initial trajectories, using a para-\nmetric policy search. For our policy we approximated the action-value function with a\nlocally-weighted linear regression. PSDP\u2019s policy signi\ufb01cantly improves performance over\nthe parametric policy search; while both keep the robot walking we note that PSDP incurs\n31% less cost per step.\nDDP makes strong, perhaps unrealistic assumptions about the observability of state vari-\nables. PSDP, in contrast, can learn policies with limited observability. By hiding state\nvariables from the algorithm, this control problem demonstrates PSDP\u2019s leveraging of non-\nstationarity and ability to cope with partial observability. PSDP can make the robot walk\nwithout any observations; open loop control is suf\ufb01cient to propel the robot, albeit at a\nsigni\ufb01cant reduction in performance and robustness. In Figure (5.2) we see the signal gen-\nerated by the learned open-loop controller. This complex torque signal would be identical\nfor arbitrary initial conditions\u2014 modulo sign-reversals, as the applied torque at the hip is\ninverted from the control signal whenever the stance foot is switched.\n\n5.3 Double-pole balancing\n\nOur third problem, double pole balancing, is similar to the standard inverted pendulum\nproblem, except that two unactuated poles, rather than a single one, are attached to the\ncart, and it is our task to simultaneously keep both of them balanced. This makes the task\nsigni\ufb01cantly harder than the standard single pole problem.\nUsing the simulator provided by [3], we implemented PSDP for this problem. The\nstate variables were the cart position x; cart velocity \u02d9x; the two poles\u2019 angles \u03c61 and\n\u03c62; and the poles\u2019 angular velocities \u02d9\u03c61 and \u02d9\u03c62. The two actions are to accelerate left\n\n\fe\nu\nq\nr\no\n\nt\n \nl\n\no\nr\nt\n\nn\no\nc\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n0\n\n)\nd\na\nr\n(\n \n\nl\n\ne\ng\nn\na\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\n\u22120.3\n\n\u22120.4\n\n\u22120.5\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\ntime (s)\n\n12\n\n14\n\n16\n\n18\n\n20\n\n2\n\n4\n\n6\n\n8\n\n10\n\ntime (s)\n\n12\n\n14\n\n16\n\n18\n\n20\n\nFigure 2: (Left) Control signal from open-loop learned controller. (Right) Resulting angle\nof one leg. The dashed line in each indicates which foot is grasping the bar at each time.\nand to accelerate right. We used a linear policy class \u03a0 as described previously, and\n\u03c6(s) = [x, \u02d9x, \u03c61, \u02d9\u03c61, \u03c62, \u02d9\u03c62]T . By symmetry of the problem, a constant intercept term\nwas unnecessary; leaving out an intercept enforces that if a1 is the better action for some\nstate s, then a2 should be taken in the state \u2212s.\nThe algorithm we used for the optimization step was logistic regression.3 The baseline\ndistribution \u00b5 that we chose was a zero-mean multivariate Gaussian distribution over all\nthe state variables. Using a horizon of T = 2000 steps and 5000 Monte Carlo samples per\niteration of the PSDP algorithm, we are able to successfully balance both poles.\nAcknowledgments. We thank Chris Atkeson and John Langford for helpful conversa-\ntions. J. Bagnell is supported by an NSF graduate fellowship. This work was also sup-\nported by NASA, and by the Department of the Interior/DARPA under contract number\nNBCH1020014.\n\nReferences\n\n[1] E. Amaldi and V. Kann. On the approximability of minimizing nonzero variables or\n\nunsatis\ufb01ed relations in linear systems. Theoretical Comp. Sci., 1998.\n\n[2] C. Atkeson and J. Morimoto. Non-parametric representation of a policies and value\n\nfunctions: A trajectory based approach. In NIPS 15, 2003.\n\n[3] F. Gomez.\n\nhttp://www.cs.utexas.edu/users/nn/pages/software/software.html.\n\n[4] Sham Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis,\n\nUniversity College London, 2003.\n\n[5] Sham Kakade and John Langford. Approximately optimal approximate reinforcement\n\nlearning. In Proc. 19th International Conference on Machine Learning, 2002.\n\n[6] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. Approximate planning in large\n\nPOMDPs via reusable trajectories. (extendedversionofpaperinNIPS12), 1999.\n\n[7] M. Littman. Memoryless policies: theoretical limitations and practical results. In Proc.\n\n3rd Conference on Simulation of Adaptive Behavior, 1994.\n\n[8] Andrew Y. Ng and Michael I. Jordan. PEGASUS: A policy search method for large\nMDPs and POMDPs. In Proc. 16th Conf. Uncertainty in Arti\ufb01cial Intelligence, 2000.\n[9] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist\n\nreinforcement learning. Machine Learning, 8:229\u2013256, 1992.\n\n3In our\n\nsetting, we use weighted logistic\n\nregression and minimize \u2212(cid:96)(\u03b8)\n\n\u2212 Pi w(i) log p(y(i)|s(i), \u03b8) where p(y = 1|s, \u03b8) = 1/(1 + exp(\u2212\u03b8T s)).\nward to show that this is a (convex) upper-bound on the objective function T (\u03b8).\n\n=\nIt is straightfor-\n\n\f", "award": [], "sourceid": 2378, "authors": [{"given_name": "J.", "family_name": "Bagnell", "institution": null}, {"given_name": "Sham", "family_name": "Kakade", "institution": null}, {"given_name": "Jeff", "family_name": "Schneider", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}