{"title": "Random Sampling of States in Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 33, "page_last": 40, "abstract": "We combine two threads of research on approximate dynamic programming: random sampling of states and using local trajectory optimizers to globally optimize a policy and associated value function. This combination allows us to replace a dense multidimensional grid with a much sparser adaptive sampling of states. Our focus is on finding steady state policies for the deterministic time invariant discrete time control problems with continuous states and actions often found in robotics. In this paper we show that we can now solve problems we couldn't solve previously with regular grid-based approaches.", "full_text": "Random Sampling of States in Dynamic\n\nProgramming\n\nChristopher G. Atkeson and Benjamin Stephens\n\nRobotics Institute, Carnegie Mellon University\n\ncga@cmu.edu, bstephens@cmu.edu\n\nwww.cs.cmu.edu/\u223ccga, www.cs.cmu.edu/\u223cbstephe1\n\nAbstract\n\nWe combine three threads of research on approximate dynamic programming:\nsparse random sampling of states, value function and policy approximation using\nlocal models, and using local trajectory optimizers to globally optimize a policy\nand associated value function. Our focus is on \ufb01nding steady state policies for\ndeterministic time invariant discrete time control problems with continuous states\nand actions often found in robotics. In this paper we show that we can now solve\nproblems we couldn\u2019t solve previously.\n\n1 Introduction\n\nOptimal control provides a potentially useful methodology to design nonlinear control laws (poli-\ncies) u = u(x) which give the appropriate action u for any state x. Dynamic programming provides\na way to \ufb01nd globally optimal control laws, given a one step cost (a.k.a. \u201creward\u201d or \u201closs\u201d) function\nand the dynamics of the problem to be optimized. We focus on control problems with continuous\nstates and actions, deterministic time invariant discrete time dynamics xk+1 = f(xk, uk), and a time\ninvariant one step cost function L(x, u). Policies for such time invariant problems will also be time\ninvariant. We assume we know the dynamics and one step cost function. Future work will address\nsimultaneously learning a dynamic model, \ufb01nding a robust policy, and performing state estimation\nwith an erroneous partially learned model. One approach to dynamic programming is to approximate\nthe value function V (x) (the optimal total future cost from each state V (x) = \u2211\u221e\nk=0 L(xk, uk)), and to\nrepeatedly solve the Bellman equation V (x) = minu(L(x, u) + V (f(x, u))) at sampled states x until\nthe value function estimates have converged to globally optimal values. We explore approximating\nthe value function and policy using many local models.\n\nAn example problem: We use one link pendulum swingup as an example problem in this intro-\nduction to provide the reader with a visualizable example of a value function and policy. In one\nlink pendulum swingup a motor at the base of the pendulum swings a rigid arm from the downward\nstable equilibrium to the upright unstable equilibrium and balances the arm there (Figure 1). What\nmakes this challenging is that the one step cost function penalizes the amount of torque used and\nthe deviation of the current position from the goal. The controller must try to minimize the total\ncost of the trajectory. The one step cost function for this example is a weighted sum of the squared\nposition errors (\u03b8: difference between current angles and the goal angles) and the squared torques\n\u03c4: L(x, u) = 0.1\u03b82T + \u03c42T where 0.1 weights the position error relative to the torque penalty, and\nT is the time step of the simulation (0.01s). There are no costs associated with the joint velocity.\nFigure 2 shows the value function and policy generated by dynamic programming.\n\nOne important thread of research on approximate dynamic programming is developing representa-\ntions that adapt to the problem being solved and extend the range of problems that can be solved\nwith a reasonable amount of memory and time. Random sampling of states has been proposed by a\nnumber of researchers [1, 2, 3, 4, 5, 6, 7]. In our case we add new randomly selected states as we\n\n1\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\nFigure 1: Con\ufb01gurations from the simulated one link pendulum optimal trajectory every half a second and at\nthe end of the trajectory.\n\nsolve the problem, allowing the \u201cgrid\u201d that results to re\ufb02ect the local complexity of the value func-\ntion as we generate it. Figure 2:right shows such a randomly generated set of states superimposed\non a contour plot of the value function for one link swingup.\n\nAnother important thread in our work on applied dynamic programming is developing ways for grids\nor random samples to be as sparse as possible. One technique that we apply here is to represent full\ntrajectories from each sampled state to the goal, and to re\ufb01ne each trajectory using local trajectory\noptimization [8]. Figure 2:right shows a set of optimized trajectories from the sampled states to the\ngoal. One key aspect of the local trajectory optimizer we use is that it provides a local quadratic\nmodel of the value function and a local linear model of the policy at the sampled state. These local\nmodels help our function approximators handle sparsely sampled states. To obtain globally optimal\nsolutions, we incorporate exchange of information between non-neighboring sampled states.\n\nOn what problems will the proposed approach work? We believe our approach can discover\nunderlying simplicity in many typical problems. An example of a problem that appears complex but\nis actually simple is a problem with linear dynamics and a quadratic one step cost function. Dy-\nnamic programming can be done for linear quadratic regulator (LQR) problems even with hundreds\nof dimensions and it is not necessary to build a grid of states [9]. The cost of representing the value\nfunction is quadratic in the dimensionality of the state. The cost of performing a \u201csweep\u201d or update\nof the value function is at most cubic in the state dimensionality. Continuous states and actions\nare easy to handle. Perhaps many problems, such as the examples in this paper, have simplifying\ncharacteristics similar to LQR problems. For example, problems that are only \u201cslightly\u201d nonlinear\nand have a locally quadratic cost function may be solvable with quite sparse representations. One\ngoal of our work is to develop methods that do not immediately build a hugely expensive represen-\ntation if it is not necessary, and attempt to harness simple and inexpensive parallel local planning\nto solve complex planning problems. Another goal of our work is to develop methods that can take\nadvantage of situations where only a small amount of global interaction is necessary to enable local\nplanners capable of solving local problems to \ufb01nd globally optimal solutions.\n\n2 Related Work\n\nRandom state selection: Random grids and random sampling are well known in numerical inte-\ngration, \ufb01nite element methods, and partial differential equations. Rust applied random sampling\nof states to dynamic programming [1, 10]. He showed that random sampling of states can avoid\nthe curse of dimensionality for stochastic dynamic programming problems with a \ufb01nite set of dis-\ncrete actions. This theoretical result focused on the cost of computing the expectation term in the\nstochastic version of the Bellman equation. [11] claim the assumptions used in [1] are unrealistically\nrestrictive, and [12] point out that the complexity of Rust\u2019s approach is proportional to the Lipschitz\nconstant of the problem data, which often increases exponentially with increasing dimensions. The\npracticality and usefulness of random sampling of states in deterministic dynamic programming with\ncontinuous actions (the focus of our paper) remains an open question. We note that deterministic\nproblems are usually more dif\ufb01cult to solve since the random element in the stochastic dynamics\nsmooths the dynamics and makes them easier to sample. Alternatives to random sampling of states\nare irregular or adaptive grids [13], but in our experience they still require too many representational\nresources as the problem dimensionality increases.\n\nIn reinforcement learning random sampling of states is sometimes used to provide training data for\nfunction approximation of the value function. Reinforcement learning also uses random exploration\nfor several purposes. In model-free approaches exploration is used to \ufb01nd actions and states that lead\nto better outcomes. This process is somewhat analogous to the random state sampling described in\nthis paper for model-based approaches.\nIn model-based approaches, exploration is also used to\nimprove the model of the task. In our paper it is assumed a model of the task is available, so this\ntype of exploration is not necessary.\n\n2\n\n\fValue function for one link example\n\nPolicy for one link example\n\nrandom initial states and trajectories for one link example\n\ne\nu\na\nv\n\nl\n\n20\n\n10\n\n0\n\n3\n\n2\n\n1\n\n)\n\nm\nN\n\n(\n \n\ne\nu\nq\nr\no\n\nt\n\n10\n\n0\n\n\u221210\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\nposition (r)\n\n\u22126\n\n20\n\n15\n\n\u221220\n\n\u221215\n\n\u221210\n\n\u22125\n\n0\n\n5\n\n10\n\nvelocity (r/s)\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n\nposition (r)\n\n\u22126\n\n20\n\n15\n\n\u221220\n\n\u221215\n\n\u221210\n\n\u22125\n\n0\n\n5\n\n10\n\nvelocity (r/s)\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n)\ns\n/\nr\n(\n \ny\nt\ni\nc\no\ne\nv\n\nl\n\n\u221210\n\n\u22126\n\n\u22125\n\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\nposition (r)\n\nFigure 2: Left and Middle: The value function and policy for a one link pendulum swingup. The optimal\ntrajectory is shown as a yellow line in the value function plot, and as a black line with a yellow border in the\npolicy plot. The value function is cut off above 20 so we can see the details of the part of the value function that\ndetermines the optimal trajectory. The goal is at the state (0,0). Right: Random states (dots) and trajectories\n(black lines) used to plan one link swingup, superimposed on a contour map of the value function.\n\nIn the \ufb01eld of Partially Observable Markov Decision Processes (POMDPs) there has been some\nwork on randomly sampling belief states, and also using local models of the value function and its\n\ufb01rst derivative at each randomly sampled belief state (for example [2, 3, 4, 5, 6, 7]). Thrun explored\nrandom sampling of belief states where the underlying states and actions were continuous [7]. He\nused a nearest neighbor scheme to perform value function interpolation, and a coverage test to decide\nwhether to accept a new random state (is a new random state far enough from existing states?) rather\nthan a surprise test (is the value of the new random state predicted incorrectly?).\n\nIn robot planning for obstacle avoidance random sampling of states is now quite popular [14]. Proba-\nbilistic Road Map (PRM) methods build a graph of plans between randomly selected states. Rapidly\nExploring Random Trees (RRTs) grow paths or trajectories towards randomly selected states. In\ngeneral it is dif\ufb01cult to modify PRM and RRT approaches to \ufb01nd optimal paths, and the resulting\nalgorithms based on RRTs are very similar to A* search.\n\n3 Combining Random State Sampling With Local Optimization\n\nThe process of using the Bellman equation to update a representation of the value function by mini-\nmizing over all actions at a state is referred to as value iteration. Standard value iteration represents\nthe value function and associated policy using multidimensional tables, with each entry in the table\ncorresponding to a particular state. In our approach we randomly select states, and associate with\neach state a local quadratic model of the value function and a local linear model of the policy. Our\napproach generalizes value iteration, and has the following components: 1. There is a \u201cglobal\u201d\nfunction approximator for both the value function and the policy. In our current implementation the\nvalue function and policy are represented through a combination of sampled and parametric repre-\nsentations, building global approximations by combining local models. 2. It is possible to estimate\nthe value of a state in two ways. The \ufb01rst is to use the approximated value function. The second is\nour analog of using the Bellman equation: use the cost of a trajectory starting from the state under\nconsideration and following the current global policy. The trajectory is optimized using local tra-\njectory optimization. 3. As in a Bellman update, there is a way to globally optimize the value of\na state by considering many possible \u201cactions\u201d. In our approach we consider many local policies\nassociated with different stored states.\n\nTaking advantage of goal states: For problems with goal states there are several ways to speed\nup convergence. In cases where LQR techniques apply [9], we use the policy obtained by solving\nthe corresponding LQR control problem at the goal as the default policy everywhere, to which the\npolicy computed by dynamic programming is added. [15] plots an example of a default policy and\nthe policy generated by dynamic programming for comparison. We limit the outputs of this default\npolicy. In setting up the goal LQR controller, a radius is established and tested within which the\ngoal LQR controller always works and achieves close to the predicted optimal cost. This has the\neffect of making of enlarging the goal. If the dynamic programming process can get within the LQR\nradius of the goal, it can use only the default policy to go the rest of the way. If it is not possible to\ncreate a goal LQR controller due to a hard nonlinearity, or if there is no goal state, it does not have\nto be done as the goal controller merely accelerates the solution process. The proposed technique\ncan be generalized in a straightforward way to use any default goal policy. In this paper the swingup\n\n3\n\n\fproblems use an LQR default policy, which was limited for each action dimension to \u00b15Nm. For\nthe balance problem we did not use a default policy. We note that for the swingup problems shown\nhere the default LQR policy is capable of balancing the inverted pendulum at the goal, but is not\ncapable of swinging up the pendulum to the goal.\n\nWe also initially only generate the value function and policy in the region near the goal. This solved\nregion is gradually increased in size by increasing a value function threshold. Examples of regions\nbounded by a constant value are shown by the value function contours in Figure 2. [16] describes\nhow to handle periodic tasks which have no goal states, and also discontinuities in the dynamics.\n\nLocal models of the value function and policy: We need to represent value functions as sparsely\nas possible. We propose a hybrid tabular and parametric approach: parametric local models of the\nvalue function and policy are represented at sampled locations. This representation is similar to\nusing many Taylor series approximations of a function at different points. At each sampled state xp\nthe local quadratic model for the value function is:\n\nV p(x) \u2248 V p\n\n0 + V p\n\nx \u02c6x +\n\n\u02c6xTV p\n\nxx \u02c6x\n\n1\n\n(1)\n\n2\nwhere \u02c6x = x \u2212 xp is the vector from the stored state xp, V p\n0 is the constant term of the local model,\nV p\nxx is the second\nx\nderivative of the local model (and the value function) at xp. The local linear model for the policy is:\n\nis the \ufb01rst derivative of the local model (and the value function) at xp, and V p\n\nwhere up\nalso the gain matrix for a local linear controller.\n\n0 is the constant term of the local policy, and K p is the \ufb01rst derivative of the local policy and\n\nup(x) = up\n0\n\n\u2212 K p \u02c6x\n\n(2)\n\nCreating the local model: These local models of the value function can be created using Dif-\nferential Dynamic Programming (DDP) [17, 18, 8, 16]. This local trajectory optimization process\nis similar to linear quadratic regulator design in that a local model of the value function is pro-\nduced. In DDP, value function and policy models are produced at each point along a trajectory.\nSuppose at a point (xi, ui) we have 1) a local second order Taylor series approximation of the opti-\nmal value function: V i(x) \u2248 V i\nxx \u02c6x where \u02c6x = x \u2212 xi. 2) a local second order Taylor\nseries approximation of the robot dynamics, which can be learned using local models of the dy-\nnamics (fi\nu correspond to A and B of the linear plant model used in linear quadratic regulator\n(LQR) design): xk+1 = fi(x, u) \u2248 fi\nuu \u02c6u where \u02c6u = u \u2212 ui, and\n3) a local second order Taylor series approximation of the one step cost, which is often known\nanalytically for human speci\ufb01ed criteria (Lxx and Luu correspond to Q and R of LQR design):\nLi(x, u) \u2248 Li\n\nxx \u02c6x + \u02c6xTfi\n\nxu \u02c6u + 1\n\nx and fi\n\n2 \u02c6xTV i\n\nx \u02c6x + 1\n\nu \u02c6u + 1\n\n0 + V i\n\nx \u02c6x + fi\n\n2 \u02c6uTfi\n\n2 \u02c6xTfi\n\n0 + fi\n\nxx \u02c6x + \u02c6xTLi\n\nxu \u02c6u + 1\n\nx \u02c6x + Li\n\nu \u02c6u + 1\n\n2 \u02c6xTLi\n\n2 \u02c6uTLi\n\n0 + Li\n\nuu \u02c6u\n\nGiven a trajectory, one can integrate the value function and its \ufb01rst and second spatial derivatives\nbackwards in time to compute an improved value function and policy. We utilize the \u201cQ function\u201d\nnotation from reinforcement learning: Q(x, u) = L(x, u) + V (f(x, u)). The backward sweep takes\nthe following form (in discrete time):\n\nQi\n\nxx = Li\n\nxx + V i\nx fi\n\nxx + (fi\n\nx)TV i\n\nxxfi\n\nx; Qi\n\nuu = Li\n\nuu + V i\nx fi\n\nuu + (fi\n\nu)TV i\n\nQi\n\nx = Li\nux = Li\n\nx + V i\nx fi\n\nx; Qi\nux + V i\nx fi\n\nu = Li\n\nu\n\nx fi\nu + V i\nxxfi\nx; Qi\n\nux + (fi\n\nu)TV i\n\n\u2206ui = (Qi\nV i\u22121\nx = Qi\n\nuu)\u22121Qi\n\u2212 Qi\n\nu; Ki = (Qi\nxx = Qi\n\nuKi; V i\u22121\n\nuu)\u22121Qi\n\u2212 Qi\n\nxx\n\nx\n\nux\n\nxuKi\n\n(3)\n\nxxfi\nu\n(4)\n(5)\n\n(6)\n\nwhere subscripts indicate derivatives and superscripts indicate the trajectory index. After the back-\nward sweep, forward integration can be used to update the trajectory itself: ui\nnew = ui \u2212 \u2206ui \u2212\n\u2212 xi). We note that the cost of this approach grows at most cubically rather than expo-\nKi(xi\nnentially with respect to the dimensionality of the state.\n\nnew\n\nIn problems that have a goal state, we can generate a trajectory from each stored state all the way to\nthe goal. The cost of this trajectory is an upper bound on the true value of the state, and is used to\nbound the estimated value for the old state.\n\nUtilizing the local models: For the purpose of explaining our algorithm, let\u2019s assume we already\nhave a set of sampled states, each of which has a local model of the value function and the policy.\n\n4\n\n\fHow should we use these multiple local models? The simplest approach is to just use the predictions\nof the nearest sampled state, which is what we currently do. We use a kd-tree to ef\ufb01ciently \ufb01nd\nnearest neighbors, but there are many other approaches that will \ufb01nd nearby stored states ef\ufb01ciently.\nIn the future we will investigate using other methods to combine local model predictions from nearby\nstored states: distance weighted averaging (kernel regression), linear locally weighted regression,\nand quadratic locally weighted regression for value functions.\n\nCreating new random states: For tasks with a goal state, we initialize the set of stored states by\nstoring the goal state itself. We have explored a number of distributions to select additional states\nfrom: uniform within bounds on the states; Gaussian with the mean at the goal; sampling near\nexisting states; and sampling from an underlying low resolution regular grid. The uniform approach\nis a useful default approach, which we use in the swingup examples, the Gaussian approach provides\na simple way to tune the distribution, sampling near existing states provides a way to ef\ufb01ciently\nsample while growing the solved region in high dimensions, and sampling from an underlying low\nresolution grid seems to perform well when only a small number of stored states are used (similar to\nusing low dispersion sequences [1, 14]). A key point of our approach is that we do not generate the\nrandom states in advance but instead select them as the algorithm progresses. This allows us to apply\nan acceptance criteria to candidate states, which we describe in the next paragraph. We have also\nexplored changing the distribution we generate candidate states from as the algorithm progresses,\nfor example using a mixture of Gaussians with the Gaussians centered on existing stored states.\nAnother reasonable hybrid approach would be to initially sample from a grid, and then bias more\ngeneral sampling to regions of higher value function approximation error.\n\nAcceptance criteria for candidate states: We have several criteria to accept or reject states to be\npermanently stored. In the future we will explore \u201cforgetting\u201d or removing stored states, but at this\npoint we apply all memory control techniques at the storage event. To focus the search and limit the\nvolume considered, a steadily increasing value limit is maintained (Vlimit ), which is increased slightly\nafter each use. The approximated value function is used to predict the value of the candidate state.\nIf the prediction is above Vlimit , the candidate state is rejected. Otherwise, a trajectory is created\nfrom the candidate state using the current approximated policy, and then locally optimized. If the\nvalue of that trajectory is above Vlimit , the candidate state is rejected. If the value of the trajectory is\nwithin 10% of the predicted value, the candidate state is rejected. Only \u201csurprises\u201d are stored. For\nproblems with a goal state, if the trajectory does not reach the goal the candidate state is rejected.\nOther criteria such as an A* like criteria (cost-to-go(x) + cost-from-start(x) > threshold) can be\nused to reject candidate states. All of the thresholds mentioned can be changed as the algorithm\nprogresses. For example, Vlimit is gradually increased during the solution process, to increase the\nvolume considered by the algorithm. We currently use a 10% \u201csurprise\u201d threshold. In future work\nwe will explore starting with a larger threshold and decreasing this threshold with time, to further\nreduce the number of samples accepted and stored while improving convergence.\nIt is possible\nto take the distance to the nearest sampled state into account in the acceptance criteria for new\nsamples. The common approach of accepting states beyond a distance threshold enforces a minimum\nresolution, and leads to potentially severe curse of dimensionality effects. Rejecting states that are\ntoo close to existing states will increase the error in representing the value function, but may be a\nway for preventing too many samples near complex regions of the value functions that have little\npractical effect. For example, we often do not need much accuracy in representing the value function\nnear policy discontinuities where the value function has discontinuities in its spatial derivative and\n\u201ccreases\u201d.\nIn these areas the trajectories typically move away from the discontinuities, and the\ndetails of the value function have little effect.\n\nIn the current implementation, after a candidate state is accepted, the state in the database whose\nlocal model was used to make the prediction is re-optimized including information from the newly\nadded point, since the prediction was wrong and the new point\u2019s policy may lead to a better value\nfor that state.\n\nCreating a trajectory from a state: We create a trajectory from a candidate state or re\ufb01ne a trajec-\ntory from a stored state in the same way. The \ufb01rst step is to use the current approximated policy until\nthe goal or a time limit is reached. In the current implementation this involves \ufb01nding the stored\nstate nearest to the current state in the trajectory and using its locally linear policy to compute the\naction on each time step. The second step is to locally optimize the trajectory. We use Differential\nDynamic Programming (DDP) in the current implementation [17, 18, 8, 16]. In the current imple-\nmentation we do not save the trajectory but only the local models from its start. If the cost of the\n\n5\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\nFigure 3: Con\ufb01gurations from the simulated two link pendulum optimal swing up trajectory every \ufb01fth of a\nsecond and the end of the trajectory.\n\ntrajectory is more than the currently stored value for the state, we reject the new value, as the values\nall come from actual trajectories and are upper bounds for the true value. We always keep the lowest\nupper bound.\n\nCombining parallel greedy local optimizers to perform global optimization: As currently de-\nscribed, the algorithm \ufb01nds a locally optimal policy, but not necessarily a globally optimal policy.\nFor example, the stored states could be divided into two sets of nearest neighbors. One set could\nhave a suboptimal policy, but have no interaction with the other set of states that had a globally\noptimal policy since no nearest neighbor relations joined the two sets. We expect the locally optimal\npolicies to be fairly good because we 1) gradually increase the solved volume and 2) use local op-\ntimizers. Given local optimization of actions, gradually increasing the solved volume will result in\na globally optimal policy if the boundary of this volume never touches a non adjacent section of it-\nself. Figures 2 and 2 show the creases in the value function (discontinuities in the spatial derivative)\nand corresponding discontinuities in the policy that typically result when the constant cost contour\ntouches a non adjacent section of itself as Vlimit is increased.\n\nIn theory, the approach we have described will produce a globally optimal policy if it has in\ufb01nite\nresolution and all the stored states form a densely connected set in terms of nearest neighbor rela-\ntions [8]. By enforcing consistency of the local value function models across all nearest neighbor\npairs, we can create a globally consistent value function estimate. Consistency means that any state\u2019s\nlocal model correctly predicts values of nearby states. If the value function estimate is consistent\neverywhere, the Bellman equation is solved and we have a globally optimal policy. We can en-\nforce consistency of nearest neighbor value functions by 1) using the policy of one state of a pair\nto reoptimize the trajectory of the other state of the pair and vice versa, and 2) adding more stored\nstates in between nearest neighbors that continue to disagree [8]. This approach is similar to using\nthe method of characteristics to solve partial differential equations and \ufb01nding value functions for\ngames.\n\nIn practice, we cannot achieve in\ufb01nite resolution. To increase the likelihood of \ufb01nding a globally\noptimal policy with a limited resolution of stored states, we need an analog to exploration and to\nglobal minimization with respect to actions found in the Bellman equation. We approximate this\nprocess by periodically reoptimizing each stored state using the policies of other stored states. As\nmore and more states are stored, and many alternative stored states are considered in optimizing any\ngiven stored state, a wide range of actions are considered for each state. We run a reoptimization\nphase of the algorithm after every N (typically 100) states have been stored. There are several ways\nto design this reoptimization phase. Each state could use the policy of a nearest neighbor, or a\nrandomly chosen neighbor with the distribution being distance dependent, or just choosing another\nstate randomly with no consideration of distance (what we currently do). [8] describes how to follow\na policy of another stored state if its trajectory is stored, or can be recomputed as needed. In this\nwork we explored a different approach that does not require each stored state to save its trajectory\nor recompute it. To \u201cfollow\u201d the policy of another state, we follow the locally linear policy for that\nstate until the trajectory begins to go away from the state. At that point we switch to following the\nglobally approximated policy. Since we apply this reoptimization process periodically with different\nrandomly selected policies, over time we explore using a wide range of actions from each state.\n\n6\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\nFigure 4: Con\ufb01gurations from the simulated three link pendulum optimal trajectory every tenth of a second\nand at the end of the trajectory.\n\n4 Results\n\nIn addition to the one link swingup example presented in the introduction, we present results on\ntwo link swingup (4 dimensional state) and three link swingup (6 dimensional state). A companion\npaper using these techniques to explore how multiple balance strategies can be generated from one\noptimization criterion is [19]. Further results, including some for a four link (8 dimensional state)\nstanding robot are presented.\n\nOne link pendulum swingup: For the one link swingup case, the random state approach found\na globally optimal trajectory (the same trajectory found by our grid based approaches [15]) after\nadding only 63 random states. Figure 2:right shows the distribution of states and their trajectories\nsuperimposed on a contour map of the value function for one link swingup.\n\nTwo link pendulum swingup: For the two link swingup case, the random state approach \ufb01nds\nwhat we believe is a globally optimal trajectory (the same trajectory found by our grid based ap-\nproaches [15]) after storing an average of 12000 random states (Figure 3). In this case the state has\nfour dimensions (a position and velocity for each joint) and a two dimensional action (a torque at\neach joint). The one step cost function was a weighted sum of the squared position errors and the\nsquared torques: L(x, u) = 0.1(\u03b82\n2)T. 0.1 weights the position errors relative to\nthe torque penalty, T is the time step of the simulation (0.01s), and there were no costs associated\nwith joint velocities. The approximately 12000 sampled states should be compared to the millions\nof states used in grid-based approaches. A 60x60x60x60 grid with almost 13 million states failed\nto \ufb01nd a trajectory as good as this one, while a 100x100x100x100 grid with 100 million states did\n\ufb01nd the same trajectory. In 13 runs with different random number generator seeds, the mean number\nof states stored at convergence is 11430. All but two of the runs converged after storing less than\n13000 states, and all runs converged after storing 27000 states.\n\n2)T + (\u03c42\n\n1 + \u03b82\n\n1 + \u03c42\n\nThree link pendulum swingup: For the three link swingup case, the random state approach found\na good trajectory after storing less than 22000 random states (Figure 4). We have not yet solved\nthis problem a suf\ufb01cient number of times to be convinced this is a global optimum, and we do not\nhave a solution based on a regular grid available for comparison. We were not able to solve this\nproblem using regular grid-based approaches due to limited state resolution: 22x22x22x22x38x44\n= 391,676,032 states \ufb01lled our largest memory. As in the previous examples, the one step cost\nfunction was a weighted sum of the squared position errors and the squared torques: L(x, u) =\n0.1(\u03b82\n\n1 + \u03b82\n\n2 + \u03b82\n\n3)T + (\u03c42\n\n1 + \u03c42\n\n2 + \u03c42\n\n3)T.\n\n5 Conclusion\n\nWe have combined random sampling of states and local trajectory optimization to create a promis-\ning approach to practical dynamic programming for robot control problems. We are able to solve\nproblems we couldn\u2019t solve before due to memory limitations. Future work will optimize aspects\nand variants of this approach.\n\n7\n\n\fAcknowledgments\n\nThis material is based upon work supported in part by the DARPA Learning Locomotion Program\nand the National Science Foundation under grants CNS-0224419, DGE-0333420, ECS-0325383,\nand EEC-0540865.\n\nReferences\n\n[1] J. Rust. Using randomization to break the curse of dimensionality. Econometrica, 65(3):487\u2013\n\n516, 1997.\n\n[2] M. Hauskrecht. Incremental methods for computing bounds in partially observable Markov\ndecision processes. In Proceedings of the 14th National Conference on Arti\ufb01cial Intelligence\n(AAAI-97), pages 734\u2013739, Providence, Rhode Island, 1997. AAAI Press / MIT Press.\n\n[3] N.L. Zhang and W. Zhang. Speeding up the convergence of value iteration in partially observ-\n\nable Markov decision processes. JAIR, 14:29\u201351, 2001.\n\n[4] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for\n\nPOMDPs. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), 2003.\n\n[5] T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In Uncertainty in\n\nArti\ufb01cial Intelligence, 2004.\n\n[6] M.T.J. Spaan and Nikos V. A point-based POMDP algorithm for robot planning. In Proceed-\nings of the IEEE International Conference on Robotics and Automation, pages 2399\u20132404,\nNew Orleans, Louisiana, April 2004.\n\n[7] S. Thrun. Monte Carlo POMDPs. In S.A. Solla, T.K. Leen, and K.-R. M\u00a8uller, editors, Advances\n\nin Neural Information Processing 12, pages 1064\u20131070. MIT Press, 2000.\n\n[8] C. G. Atkeson. Using local trajectory optimizers to speed up global optimization in dynamic\nprogramming. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors, Advances in\nNeural Information Processing Systems, volume 6, pages 663\u2013670. Morgan Kaufmann Pub-\nlishers, Inc., 1994.\n\n[9] F. L. Lewis and V. L. Syrmos. Optimal Control, 2nd Edition. Wiley-Interscience, 1995.\n\n[10] C. Szepesv\u00b4ari. Ef\ufb01cient approximate planning in continuous space Markovian decision prob-\n\nlems. AI Communications, 13(3):163\u2013176, 2001.\n\n[11] J. N. Tsitsiklis and Van B. Roy. Regression methods for pricing complex American-style\n\noptions. IEEE-NN, 12:694\u2013703, July 2001.\n\n[12] V. D. Blondel and J. N. Tsitsiklis. A survey of computational complexity results in systems\n\nand control, 2000.\n\n[13] R. Munos and A. W. Moore. Variable resolution discretization in optimal control. Machine\n\nLearning Journal, 49:291\u2013323, 2002.\n\n[14] S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006.\n[15] C. G. Atkeson. Randomly sampling actions in dynamic programming.\n\nIn 2007 IEEE In-\nternational Symposium on Approximate Dynamic Programming and Reinforcement Learning\n(ADPRL), 2007.\n\n[16] C. G. Atkeson and J. Morimoto. Nonparametric representation of a policies and value func-\ntions: A trajectory based approach. In Advances in Neural Information Processing Systems 15.\nMIT Press, 2003.\n\n[17] P. Dyer and S. R. McReynolds. The Computation and Theory of Optimal Control. Academic\n\nPress, New York, NY, 1970.\n\n[18] D. H. Jacobson and D. Q. Mayne. Differential Dynamic Programming. Elsevier, New York,\n\nNY, 1970.\n\n[19] C. G. Atkeson and B. Stephens. Multiple balance strategies from one optimization criterion.\n\nIn Humanoids, 2007.\n\n8\n\n\f", "award": [], "sourceid": 329, "authors": [{"given_name": "Chris", "family_name": "Atkeson", "institution": null}, {"given_name": "Benjamin", "family_name": "Stephens", "institution": null}]}