{"title": "Robust, Efficient, Globally-Optimized Reinforcement Learning with the Parti-Game Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 961, "page_last": 967, "abstract": null, "full_text": "Robust. Efficient, Globally-Optimized \n\nReinforcement Learning with the \n\nParti-Game Algorithm \n\nMohammad A. AI-Ansari and Ronald J. Williams \n\nCollege of Computer Science, 161 CN \n\nNortheastern University \n\nBoston, MA 02115 \n\nalansar@ccs.neu.edu, rjw@ccs.neu.edu \n\nAbstract \n\nParti-game (Moore 1994a; Moore 1994b; Moore and Atkeson 1995) is a \nreinforcement learning (RL) algorithm that has a lot of promise in over(cid:173)\ncoming the curse of dimensionality that can plague RL algorithms when \napplied to high-dimensional problems. In this paper we introduce mod(cid:173)\nifications to the algorithm that further improve its performance and ro(cid:173)\nbustness. In addition, while parti-game solutions can be improved locally \nby standard local path-improvement techniques, we introduce an add-on \nalgorithm in the same spirit as parti-game that instead tries to improve \nsolutions in a non-local manner. \n\n1 INTRODUCTION \n\nParti-game operates on goal problems by dynamically partitioning the space into hyper(cid:173)\nrectangular cells of varying sizes, represented using a k-d tree data structure. It assumes \nthe existence of a pre-specified local controller that can be commanded to proceed from the \ncurrent state to a given state. The algorithm uses a game-theoretic approach to assign costs \nto cells based on past experiences using a minimax algorithm. A cell's cost can be either \na finite positive integer or infinity. The former represents the number of cells that have to \nbe traveled through to get to the goal cell and the latter represents the belief that there is \nno reliable way of getting from that cell to the goal. Cells with a cost of infinity are called \nlosing cells while others are called winning ones. \n\nThe algorithm starts out with one cell representing the entire space and another, contained \nwithin it, representing the goal region. In a typical step, the local controller is commanded \nto proceed to the center of the most promising neighboring cell. Upon entering a neighbor(cid:173)\ning cell (whether the one aimed at or not), or upon failing to leave the current cell within \n\n\f962 \n\nM A. AI-Ansari and R. J. Williams \n\no \n\n\u2022 \u2022 !---:---:----:-...J.......:---! \n\n.-\n\ns .... \n\n. \n\n0 \n\ns .... \n\n0 \n\n~ .~ \n\nl :~ \n\n0 \n\n(I) \n\n(0) \n\n(e) \n\n(d) \n\nFigure I: In these mazes, the agent is required to stan from the point marked Stan and reach the square goal cell. \n\na timeout period, the result of this attempt is added to the database of experiences the al(cid:173)\ngorithm has collected, cell costs are recomputed based on the updated database, and the \nprocess repeats. The costs are computed using a Dijkstra-like, one-pass minimax version \nof dynamic programming. The algorithm terminates upon entering the goal cell. \n\nIf at any point the algorithm determines that it can not proceed because the agent is in \na losing cell, each cell lying on the boundary between losing and winning cells is split \nacross the dimension in which it is largest and all experiences involving cells that are split \nare discarded. Since parti-game assumes, in the absence of evidence to the contrary, that \nfrom any given cell every neighboring cell is reachable, discarding experiences in this way \nencourages exploration of the newly created cells. \n\n2 PARTITIONING ONLY LOSING CELLS \n\nThe win-lose boundary mentioned above represents a barrier the algorithm perceives that \nis preventing the agent from reaching the goal. The reason behind partitioning cells along \nthis boundary is to increase the resolution along these areas that are crucial to reaching the \ngoal and thus creating more regions along this boundary for the agent to try to get through. \nBy partitioning on both sides of the boundary, parti-game guarantees that neighboring cells \nalong the boundary remain close in size. Along with the strategy of aiming towards cen(cid:173)\nters of neighboring cells, this produces pairings of winner-loser cells that form proposed \n\"corridors\" for the agent to try to go through to penetrate the barrier it perceives. \n\nIn this section we investigate doing away with partitioning on the winning side, and only \npartition losing cells. Because partitioning can only be triggered with the agent on the \nlosing side of the win-lose boundary, partitioning only losing cells would still give the \nagent the same kind of access to the boundary through the newly formed cells. However, \nthis would result in a size disparity between winner- and loser-side cells and, thus, would \nnot produce the winner side of the pairings mentioned above. To produce a similar effect to \nthe pairings of parti-game, we change the aiming strategy of the algorithm. Under the new \nstrategy, when the agent decides to go from the cell it currently occupies to a neighboring \none, it aims towards the center point of the common surface between the two cells. While \nthis does not reproduce the same line of motion of the original aiming strategy exactly, it \nachieves a very similar objective. \n\nParti-game's success in high-dimensional problems stems from its variable resolution strat(cid:173)\negy, which partitions finely only in regions where it is needed. By limiting partitioning to \nlosing cells only, we hope to increase the resolution in even fewer parts of the state space \nand thereby make the algorithm even more efficient. \n\nTo compare the performance of parti-game to the modified algorithm, we applied both al(cid:173)\ngorithms to the set of continuous mazes shown in Figure 1. For all maze problems we used \na simple local controller that can move directly toward the specified target state. We also \n\n\fRobust, Efficient Reiriforcement Learning with the Parti-Game Algorithm \n\n963 \n\nFigure 2: An ice puck on a hill. The puck can thrust horizontally to the left and to the right with a maximum force of I Newton. \nThe state space is two-dimensional consisting of the horizontal position and velocity. The agent starts at the position marked Start \nat velocity zero and its goal is to reach the position marked Goal at velocity zero. Maximum thrust is not adequate to get the puck \nup the ramp so it has to learn to move to the left first to build up momentum \n\nFigure 3: A nine degree of freedom, snake-like arm that moves in a plane and is fixed at one tip, as depicted in Figure 3. The \nobjective is to move the arm from the start configuration to the goal one, which requires curling and uncurling to avoid the barrier \nand the wall. \n\napplied both algorithms to the non-linear dynamics problem of the ice puck on a hill, de(cid:173)\npicted in Figure 2, which has been studied extensively in reinforcement learning literature. \nWe used a local controller very similar to the one described in Moore and Atkeson (1995). \nFinally, we applied the algorithm to the nine-degree of freedom planar robot introduced in \nMoore and Atkeson (1995) and shown in Figure 3 and we used the same local controller \ndescribed there. Additional results on the Acrobot problem (Sutton and Barto 1998) were \nnot included here for space limitations but can be found in AI-Ansari and Williams (1998). \n\nWe applied both algorithms to each of these problems, in each case performing as many \ntrials as was needed for the solution to stabilize. The agent was placed back in the start \nstate at the end of each trial. In the puck problem, the agent was also reset to the start \nstate whenever it hit either of the barriers at the bottom and top of the slope. The results are \nshown in Table 1. The table compares the number of trials needed, the number of partitions, \ntotal number of steps taken in the world, and the length of the final trajectory. \n\nThe table shows that the new algorithm indeed resulted in fewer total partitions in all prob-\n\n, \n\n, \n\n.1 \n\"--t-\"\" \nmtm \n\n1 1 \n\nf- I \n\\ \n\nft-\n\n. \n\n(a) \n\n1\\ \n\nI \n\n! \n\"'\" -\n\n, \n\n\u00b7 \n\u00b7 \n\n\u00b7 \n\n, \n\nI \nI \n\n(b) \n\n\u00b7 \n\u00b7 /1 \n\n~ \n\nI \n~ -\n\u00b7 \n\u00b7 \n\n, \n\n1\\ \n\nf-\n\nc-\n\nf-\n\n(e) \n\nFigure 4: The final trial of applying the various algorithms to the maze in Figure 1 (a). (a) parti-game. (b) parti-game with \npartitioning only losing cells and (c) parti-game with partitioning only the largest losing cells. \n\n\f964 \n\nM A. AI-Ansari and R. J. Williams \n\n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n\nI \nI \n\n0 \n\nFigure 5: Parti-game needed 1194 partitions to reach the goal in the maze of Figure l(d). \n\nlems. It also improved in all problems in the number of trials required to stabilization_ It \nimproved in all but one problem (maze d) in the length of the final trajectory, however the \ndifference in length is very small. Finally, it resulted in fewer total steps taken in three of \nthe six problems, but the total steps taken increased in the remaining three. \n\nTo see the effect of the modification in detail, we show the result of applying parti-game and \nthe modified algorithm on the maze of Figure l(a) in Figures 4(a) and 4(b), respectively. \nWe can see how areas with higher resolution are more localized in Figure 4(b). \n\n3 BALANCED PARTITIONING \n\nUpon close observation of Figure 4(a), we see that parti-game partitions very finely along \nthe right wall of the maze. This behavior is even more clearly seen in parti-game's solution \nto the maze in Figure l(d), which is a simple maze with a single barrier between the start \nstate and the goal. As we see in Table 1, parti-game has a very hard time reaching the goal \nin this maze. Figure 5 shows the 1194 partitions that parti-game generated in trying to reach \nthe goal. We can see that partitioning along the barrier is very uneven, being extremely fine \nnear the goal and growing coarser as the distance from the goal increases. Putting higher \nfocus on places where the highest gain could be attained if a hole is found can be a desirable \nfeature, but what happens in cases like this one is obviously excessive. \n\nOne of the factors contributing to this problem of continuing to search at ever-higher reso(cid:173)\nlutions in the part of the barrier nearest the goal is that any version of parti-game searches \nfor solutions using an implicit trade-off between the shortness of a potential solution path \nand the resolution required to find this path. Only when the resolution becomes so fine \nthat the number of cells through which the agent would have to pass in this potential short(cid:173)\ncut exceeds the number of cells to be traversed when traveling around the barrier is the \nalgorithm forced to look elsewhere for the actual opening. \n\nA conceptually appealing way to bias this search is to maintain a more explicit coarse-to(cid:173)\nfine search strategy. One way to do this is to try to keep the smallest cell size the algorithm \ngenerates as large as possible. In addition to achieving the balance we are seeking, this \nwould tend to lower the total number of partitions and result in shallower tree structures \nneeded to represent the state space, which, in tum, results in higher efficiency. \n\nTo achieve these goals, we modified the algorithm from the previous section such that \nwhenever partitioning is required, instead of partitioning all losing cells, we only partition \nthose among them that are of maximum size. This has the effect of postponing splits that \nwould lower the minimum cell size as long as possible. The results of applying the modified \nalgorithm on the test problems are also shown in Table 1. \n\nComparing the results of this version of the algorithm to those of partitioning all losing cells \n\n\fRobust. Efficient Reinforcement Learning with the Parti-Game Algorithm \n\n965 \n\n, \n\n: ~ \u00b7 \n\u00b7 \n\u00b7 / \n\u00b7 \\ \n\n~ \n\n, \n\nI \n\nI \n\nI \n\\.P \n\n(a) \n\n(b) \n\nFigure 6: The result of partitioning largest cells on the losing side in the maze of Figure I (d). Only two nials are required to \nstabilize. The first requires 1304 steps and 21 partitions. The second nial adds no new partitions and produces a path of only 165 \nsteps. \n\nProblem \n\nAlgorithm \n\nTrials \n\nPartitions \n\nmaze a \n\nmazeb \n\nmazec \n\nmazed \n\npuck \n\nnine-\njoint \narm \n\noriginal parti-game \npartition losing side \npartition largest losing \noriginal parti-game \npartition losing side \npartition largest losing \noriginal parti-game \npartition losing side \npartition largest losing \noriginal parti-game \npartition losing side \npartition largest losing \noriginal parti-game \nparti tion losing side \npartition largest losing \noriginal parti-game \npartition losing side \npartition largest losing \n\n3 \n3 \n3 \n6 \n5 \n6 \n3 \n2 \n2 \n2 \n2 \n2 \n6 \n2 \n2 \n25 \n17 \n7 \n\n444 \n239 \n27 \n98 \n76 \n76 \n176 \n120 \n96 \n1194 \n350 \n21 \n80 \n18 \n18 \n104 \n61 \n37 \n\nTotal \nSteps \n\n35131 \n16652 \n1977 \n5180 \n7187 \n5635 \n7768 \n10429 \n6803 \n553340 \n18639 \n1469 \n6764 \n3237 \n3237 \n2970 \n3041 \n2694 \n\nFinal I \n\nTrajectory \nLength \n279 \n256 \n270 \n183 \n175 \n174 \n416 \n165 \n165 \n149 \n155 \n165 \n240 \n151 \nlSI \n58 \n56 \n112 \n\nTable 1: Results of applying parti-game, parti-game with partitioning only losing cells and parti-game with partitioning the largest \nlosing cells on three of the problem domains. Smaller numbers are better. Best numbers are shown in bold. \n\non the win-lose boundary shows that this algorithm improves on parti-garne's performance \neven further. It outperforms the above algorithm in four problems in the total number of \npartitions required, while it ties it in the remaining two. It outperforms the above algorithm \nin total steps taken in five problems and ties it in one. It improves in the number of trials \nneeded to stabilize in one problem, ties the above algorithm in four cases and ties parti(cid:173)\ngame in the remaining one. In the length of the final trajectory, partitioning the largest \nlosing cells does better in one case, ties partitioning only losing cells in two cases and does \nworse in three. This latter result is due to the generally larger partition sizes that result from \nthe lower resolution that this algorithm produces. However, the increase in the number of \nsteps is very minimal in all but the nine-joint arm problem. \n\nFigure 4(c) shows the result of applying the new algorithm to the maze of Figure l(a). In \ncontrast to the other two algorithms depicted in the same figure, we can see that the new \nalgorithm partitions very uniformly around the barrier. In addition, it requires the fewest \nnumber of partitions and total steps out of the three algorithms. Figure 6 shows that the new \nalgorithm vastly outperforms parti-game on the maze in Figure l(d). Here, too, it partitions \nvery evenly around the barrier and finds the goal very quickly, requiring far fewer steps and \npartitions. \n\n\f966 \n\nM. A. AI-Ansari and R. J Williams \n\n4 GLOBAL PATH IMPROVEMENT \n\nParti-game does not claim to find optimal solutions. As we see in Figure 4, parti-game and \nthe two modified algorithms settle on the longer of the two possible routes to the goal in \nthis maze. In this section we investigate ways we could improve parti-game so that it could \nfind paths of optimal form. It is important to note that we are not seeking paths that are \noptimal, since that is not possible to achieve using the cell shapes and aiming strategies \nwe are using here. By a path of optimal form we mean a path that could be continuously \ndeformed into an optimal path. \n\n4.1 OTHER GRADIENTS \n\nAs mentioned above, parti-game partitions only when the agent has no winning cells to aim \nfor and the only cells partitioned are those that lie on the win-lose boundary. The win-lose \nboundary falls on the gradient between finite- and infinite-cost cells and it appears when \nthe algorithm knows of no reliable way to get to the goal. Consistently partitioning along \nthis gradient guarantees that the algorithm will eventually find a path to the goal, if one \nexists. \n\nHowever, gradients across which the difference in cost is finite also exist in a state space \npartitioned by parti-game (or any of the variants introduced in this paper). Like the win(cid:173)\nlose boundary, these gradients are boundaries through which the agent does not believe \nit can move directly. Although finding an opening in such a boundary is not essential to \nreaching the goal, these boundaries do represent potential shortcuts that might improve the \nagent's policy. Any gradient with a difference in cost of two or more is a location of such \na potentially useful shortcut. \n\nBecause such gradients appear throughout the space, we need to be selective about which \nones to partition along. There are many possible strategies one might consider using to in(cid:173)\ncorporate these ideas into parti-game. For example, since parti-game focuses on the highest \ngradients only, the first thing that comes to mind is to follow in parti-game's footsteps and \nassign partitioning priorities to cells along gradients based on the differences in values \nacross those gradients. However, since the true cost function typically has discontinuities, \nit is clear that the effect of such a strategy would be to continue refining the partitioning \nindefinitely along such a discontinuity in a vain search for a nonexistent shortcut. \n\n4.2 THE ALGORITHM \n\nA much better idea is to try to pick cells to partition in a way that would achieve balanced \npartitioning, following the rationale we introduced in section 3. Again, such a strategy \nwould result in a uniform coarse-to-fine search for better paths along those other gradients. \n\nThe following discussion could, in principle, apply to any of the three forms of parti-game \nstudied up to this point. Because of the superior behavior of the version where we partition \nthe largest cells on the losing side, this is the specific version we report on here, and we use \nthe term modified parti-game to refer to it. \n\nThe way we incorporated partitioning along other gradients is as follows. At the end of any \ntrial in which the agent is able to go from the start state to the goal without any unexpected \nresults of any of its aiming attempts, we partition the largest \"losing cells\" (i.e., higher-cost \ncells) that fall on any gradient across which costs differ by more than one. Because data \nabout experiences involving cells that are partitioned is discarded, the next time modified \nparti-game is run, the agent will try to go through the newly formed cells in search of a \nshortcut. \n\nThis algorithm amounts to simply running modified parti-game until a stable solution is \n\n\fRobust, Efficient Reinforcement Learning with the Parti-Game Algorithm \n\n967 \n\n. \n. 11 1 \n.\\ \n.. j \n\nI I I \n\nI \n\n\\ \n\n' \u00b7/1 I I \n\u2022 I \n\nI I I \nI \n\nFigure 7: The solution found by applying the global improvement algorithm on the maze of Figure 1 (a). The solution proceeded \nexactly like that of the algorithm of section 3 until the solution in Figure 4(d) was reached. After that. eight additional iterations \nwere needed to find the better trajectory, resulting in 22 additional partitions, for a total of 49. \n\nreached. At that point, it introduces new cells along some of the other gradients, and when \nit is subsequently run, modified parti-game is applied again until stabilization is achieved, \nand so on. The results of applying this algorithm to the maze of Figure l(a) is shown in \nFigure 7. As we can see, the algorithm finds the better solution by increasing the resolution \naround the relevant part of the barrier above the start state. \n\nIn the absence of information about the form of the optimal trajectory, there is no natural \ntermination criterion for this algorithm. It is designed to be run continually in search of \nbetter solutions. If, however, the form of the optimal solution is known in advance, the \nextra partitioning could be turned off after such a solution is found. \n\n5 CONCLUSIONS \n\nIn this paper we have presented three successive modifications to parti-game. The combi(cid:173)\nnation of the first two appears to improve its robustness and efficiency, sometimes dramat(cid:173)\nically, and generally yields better solutions. The third provides a novel way of performing \nnon-local search for higher quality solutions that are closer to optimal. \n\nAcknowledgments \n\nMohammad AI-Ansari acknowledges the continued support of King Saud University, \nRiyadh, Saudi Arabia and the Saudi Arabian Cultural Mission to the U.S.A. \n\nReferences \nAI-Ansari, M. A. and R. 1. Williams (1998). Modifying the parti-game algorithm for in(cid:173)\ncreased robustness, higher efficiency and better policies. Technical Report NU-CCS-\n98-13, College of Computer Science, Northeastern University, Boston, MA. \n\nMoore, A. (1994a). Variable resolution reinforcement learning. In Proceedings of the \nEighth Yale Workshop on Adaptive and Learning Systems. Center for Systems Science, \nYale University. \n\nMoore, A. W. (1994b). The parti-game algorithm for variable resolution reinforcement \n\nlearning in multidimensional state spaces. In Proceedings of Neural Information Pro(cid:173)\ncessing Systems Conference 6. Morgan Kaufman. \n\nMoore, A. W. and C. O. Atkeson (1995). The parti-game algorithm for variable resolution \n\nreinforcement learning in multidimensional state-spaces. Machine Learning 21. \n\nSutton, R. S. and A. O. Barto (1998). Reinforcement Learning: An Introduction. MIT Press. \n\n\f", "award": [], "sourceid": 1550, "authors": [{"given_name": "Mohammad", "family_name": "Al-Ansari", "institution": null}, {"given_name": "Ronald", "family_name": "Williams", "institution": null}]}