{"title": "On-line Policy Improvement using Monte-Carlo Search", "book": "Advances in Neural Information Processing Systems", "page_first": 1068, "page_last": 1074, "abstract": null, "full_text": "On-line Policy Improvement using \n\nMonte-Carlo Search \n\nGerald Tesauro \n\nIBM T. J. Watson Research Center \n\nP. O. Box 704 \n\nYorktown Heights, NY 10598 \n\nGregory R. Galperin \n\nMIT AI Lab \n\n545 Technology Square \nCambridge, MA 02139 \n\nAbstract \n\nWe present a Monte-Carlo simulation algorithm for real-time policy \nimprovement of an adaptive controller. In the Monte-Carlo sim(cid:173)\nulation, the long-term expected reward of each possible action is \nstatistically measured, using the initial policy to make decisions in \neach step of the simulation. The action maximizing the measured \nexpected reward is then taken, resulting in an improved policy. Our \nalgorithm is easily parallelizable and has been implemented on the \nIBM SP! and SP2 parallel-RISC supercomputers. \nWe have obtained promising initial results in applying this algo(cid:173)\nrithm to the domain of backgammon. Results are reported for a \nwide variety of initial policies, ranging from a random policy to \nTD-Gammon, an extremely strong multi-layer neural network. In \neach case, the Monte-Carlo algorithm gives a substantial reduction, \nby as much as a factor of 5 or more, in the error rate of the base \nplayers. The algorithm is also potentially useful in many other \nadaptive control applications in which it is possible to simulate the \nenvironment. \n\n1 \n\nINTRODUCTION \n\nPolicy iteration, a widely used algorithm for solving problems in adaptive con(cid:173)\ntrol, consists of repeatedly iterating the following policy improvement computation \n(Bertsekas, 1995): (1) First, a value function is computed that represents the long(cid:173)\nterm expected reward that would be obtained by following an initial policy. (This \nmay be done in several ways, such as with the standard dynamic programming al(cid:173)\ngorithm.) (2) An improved policy is then defined which is greedy with respect to \nthat value function. Policy iteration is known to have rapid and robust convergence \nproperties, and for Markov tasks with lookup-table state-space representations, it \nis guaranteed to convergence to the optimal policy. \n\n\fOn-line Policy Improvement using Monte-Carlo Search \n\n1069 \n\nIn typical uses of policy iteration, the policy improvement step is an extensive \noff-line procedure. For example, in dynamic programming, one performs a sweep \nthrough all states in the state space. Reinforcement learning provides another ap(cid:173)\nproach to policy improvement; recently, several authors have investigated using RL \nin conjunction with nonlinear function approximators to represent the value func(cid:173)\ntions and/or policies (Tesauro, 1992; Crites and Barto, 1996; Zhang and Dietterich, \n1996). These studies are based on following actual state-space trajectories rather \nthan sweeps through the full state space, but are still too slow to compute improved \npolicies in real time. Such function approximators typically need extensive off-line \ntraining on many trajectories before they achieve acceptable performance levels. \n\nIn contrast, we propose an on-line algorithm for computing an improved policy in \nreal time. We use Monte-Carlo search to estimate Vp(z, a), the expected value of \nperforming action a in state z and subsequently executing policy P in all successor \nstates. Here, P is some given arbitrary policy, as defined by a \"base controller\" \n(we do not care how P is defined or was derived; we only need access to its policy \ndecisions). In the Monte-Carlo search, many simulated trajectories starting from \n(z, a) are generated following P, and the expected long-term reward is estimated \nby averaging the results from each of the trajectories. \n(N ote that Monte-Carlo \nsampling is needed only for non-deterministic tasks, because in a deterministic \ntask, only one trajectory starting from (z, a) would need to be examined.) Hav(cid:173)\ning estimated Vp(z, a), the improved policy pI at state z is defined to be the \naction which produced the best estimated value in the Monte-Carlo simulation, i.e., \nPI(z) = argmaxa Vp(z, a). \n\n1.1 EFFICIENT IMPLEMENTATION \n\nThe proposed Monte-Carlo algorithm could be very CPU-intensive, depending on \nthe number of initial actions that need to be simulated, the number of time steps \nper trial needed to obtain a meaningful long-term reward, the amount of CPU per \ntime step needed to make a decision with the base controller, and the total number \nof trials needed to make a Monte-Carlo decision. The last factor depends on both \nthe variance in expected reward per trial, and on how close the values of competing \ncandidate actions are. \n\nWe propose two methods to address the potentially large CPU requirements of this \napproach. First, the power of parallelism can be exploited very effectively. The \nalgorithm is easily parallelized with high efficiency: the individual Monte-Carlo \ntrials can be performed independently, and the combining of results from different \ntrials is a simple averaging operation. Hence there is relatively little communication \nbetween processors required in a parallel implementation. \nThe second technique is to continually monitor the accumulated Monte-Carlo sta(cid:173)\ntistics during the simulation, and to prune away both candidate actions that are \nsufficiently unlikely (outside some user-specified confidence bound) to be selected \nas the best action, as well as candidates whose values are sufficiently close to the \nvalue of the current best estimate that they are considered equivalent (i.e., choos(cid:173)\ning either would not make a significant difference). This technique requires more \ncommunication in a parallel implementation, but offers potentially large savings in \nthe number of trials needed to make a decision. \n\n2 APPLICATION TO BACKGAMMON \n\nWe have initially applied the Monte-Carlo algorithm to making move decisions in \nthe game of backgammon. This is an absorbing Markov process with perfect state-\n\n\f1070 \n\nG. Tesauro and G. R. Galperin \n\nspace information, and one has a perfect model of the nondeterminism in the system, \nas well as the mapping from actions to resulting states. \n\nIn backgammon parlance, the expected value of a position is known as the \"equity\" \nof the position, and estimating the equity by Monte-Carlo sampling is known as \nperforming a \"rollout.\" This involves playing the position out to completion many \ntimes with different random dice sequences, using a fixed policy P to make move \ndecisions for both sides. The sequences are terminated at the end of the game (when \none side has borne off all 15 checkers), and at that time a signed outcome value \n(called \"points\") is recorded. The outcome value is positive if one side wins and \nnegative if the other side wins, and the magnitude of the value can be either 1, 2, or \n3, depending on whether the win was normal, a gammon, or a backgammon. With \nnormal human play, games typically last on the order of 50-60 time steps. Hence \nif one is using the Monte-Carlo player to play out actual games, the Monte-Carlo \ntrials will on average start out somewhere in the middle of a game, and take about \n25-30 time steps to reach completion. \n\nIn backgammon there are on average about 20 legal moves to consider in a typical \ndecision. The candidate plays frequently differ in expected value by on the order of \n.01. Thus in order to resolve the best play by Monte-Carlo sampling, one would need \non the order of 10K or more trials per candidate, or a total of hundreds of thousands \nof Monte-Carlo trials to make one move decision. With extensive statistical pruning \nas discussed previously, this can be reduced to several tens of thousands of trials. \nMultiplying this by 25-30 decisions per trial with the base player, we find that \nabout a million base-player decisions have to be made in order to make one Monte(cid:173)\nCarlo decision. With typical human tournament players taking about 10 seconds \nper move, we need to parallelize to the point that we can achieve at least lOOK \nbase-player decisions per second. \n\nOur Monte-Carlo simulations were performed on the IBM SP! and SP2 parallel(cid:173)\nRISC supercomputers at IBM Watson and at Argonne National Laboratories. Each \nSP node is equivalent to a fast RSj6000, with floating-point capability on the order \nof 100 Mflops. Typical runs were on configurations of 16-32 SP nodes, with parallel \nspeedup efficiencies on the order of 90%. \n\nWe have used a variety of base players in our Monte-Carlo simulations, with widely \nvarying playing abilities and CPU requirements. The weakest (and fastest) of these \nis a purely random player. We have also used a few single-layer networks (i.e., no \nhidden units) with simple encodings of the board state, that were trained by back(cid:173)\npropagation on an expert data set (Tesauro, 1989). These simple networks also make \nfast move decisions, and are much stronger than a random player, but in human \nterms are only at a beginner-to-intermediate level. Finally, we used some multi-layer \nnets with a rich input representation, encoding both the raw board state and many \nhand-crafted features, trained on self-play using the TD(>.) algorithm (Sutton, 1988; \nTesauro, 1992). Such networks play at an advanced level, but are too slow to make \nMonte-Carlo decisions in real time based on full rollouts to completion. Results for \nall these players are presented in the following two sections. \n\n2.1 RESULTS FOR SINGLE-LAYER NETWORKS \n\nWe measured the game-playing strength of three single-layer base players, and of \nthe corresponding Monte-Carlo players, by playing several thousand games against \na common benchmark opponent. The benchmark opponent was TD-Gammon 2.1 \n(Tesauro, 1995), playing on its most basic playing level (I-ply search, i.e., no looka(cid:173)\nhead). Table 1 shows the results. Lin-1 is a single-layer neural net with only the \nraw board description (number of White and Black checkers at each location) as \n\n\fOn-line Policy Improvement using Monte-Carlo Search \n\n1071 \n\nNetwork Base player Monte-Carlo player Monte-Carlo CPU \n\nLin-1 \nLin-2 \nLin-3 \n\n-0.52 ppg \n-0.65 ppg \n-0.32 ppg \n\n-0.01 ppg \n-0.02 ppg \n+0.04 ppg \n\n5 sec/move \n5 sec/move \n10 sec/move \n\nTable 1: Performance of three simple linear evaluators, for both initial base players \nand corresponding Monte-Carlo players. Performance is measured in terms of ex(cid:173)\npected points per game (ppg) vs. TO-Gammon 2.11-ply. Positive numbers indicate \nthat the player here is better than TO-Gammon. Base player stats are the results \nof 30K trials (std. dev. about .005), and Monte-Carlo stats are the results of 5K \ntrials (std. dev. about .02). CPU times are for the Monte-Carlo player running on \n32 SP 1 nodes. \n\ninput. Lin-2 uses the same network structure and weights as Lin-l, plus a signif(cid:173)\nicant amount of random noise was added to the evaluation function, in order to \ndeliberately weaken its playing ability. These networks were highly optimized for \nspeed, and are capable of making a move decision in about 0.2 msec on a single SP1 \nnode. Lin-3 uses the same raw board input as the other two players, plus it has a \nfew additional hand-crafted features related to the probability of a checker being \nhit; there is no noise added. This network is a significantly stronger player, but is \nabout twice as slow in making move decisions. \n\nWe can see in Table 1 that the Monte-Carlo technique produces dramatic improve(cid:173)\nment in playing ability for these weak initial players. As base players, Lin-1 should \nbe regarded as a bad intermediate player, while Lin-2 is substantially worse and is \nprobably about equal to a human beginner. Both of these networks get trounced \nby TO-Gammon, which on its 1-ply level plays at strong advanced level. Yet the \nresulting Monte-Carlo players from these networks appear to play about equal to \nTO-Gammon l-ply. Lin-3 is a significantly stronger player, and the resulting Monte(cid:173)\nCarlo player appears to be clearly better than TO-Gammon l-ply. It is estimated \nto be about equivalent to TO-Gammon on its 2-ply level, which plays at a strong \nexpert level. \n\nThe Monte-Carlo benchmarks reported in Table 1 involved substantial amounts of \nCPU time. At 10 seconds per move decision, and 25 mOve decisions per game, \nplaying 5000 games against TO-Gammon required about 350 hours of 32-node SP \nmachine time. We have also developed an alternative testing procedure, which \nis much less expensive in CPU time, but still seems to give a reasonably accurate \nmeasure of performance strength. We measure the average equity loss of the Monte(cid:173)\nCarlo player on a suite of test positions. We have a collection of about 800 test \npositions, in which every legal play has been extensively rolled out by TO-Gammon \n2.11-ply. We then use the TO-Gammon rollout data to grade the quality of a given \nplayer's move decisions. \n\nTest set results for the three linear evaluators, and for a random evaluator, are \ndisplayed in Table 2. It is interesting to note for comparison that the TO-Gammon \nl-ply base player scores 0.0120 on this test set measure, comparable to the Lin-1 \nMonte-Carlo player, while TO-Gammon 2-ply base player scores 0.00843, compa(cid:173)\nrable to the Lin-3 Monte-Carlo player. These results are exactly in line with what \nwe measured in Table 1 using full-game benchmarking, and thus indicate that the \ntest-set methodology is in fact reasonably accurate. We also note that in each case, \nthere is a huge error reduction of potentially a factor of 4 or more in using the \nMonte-Carlo technique. In fact, the rollouts summarized in Table 2 were done us(cid:173)\ning fairly aggressive statistical pruning; we expect that rolling out decisions more \n\n\f1072 \n\nG. Tesauro and G. R. Galperin \n\nEvaluator Base loss Monte-Carlo loss Ratio \nRandom \n\nLin-1 \nLin-2 \nLin-3 \n\n0.330 \n0.040 \n0.0665 \n0.0291 \n\n0.131 \n0.0124 \n0.0175 \n0.00749 \n\n2.5 \n3.2 \n3.8 \n3.9 \n\nTable 2: Average equity loss per move decision on an 800-position test set, for both \ninitial base players and corresponding Monte-Carlo players. Units are ppgj smaller \nloss values are better. Also computed is ratio of base player loss to Monte-Carlo \nloss. \n\nextensively would give error reduction ratios closer to factor of 5, albeit at a cost \nof increased CPU time. \n\n2.2 RESULTS FOR MULTI-LAYER NETWORKS \n\nUsing large multi-layer networks to do full rollouts is not feasible for real-time \nmove decisions, since the large networks are at least a factor of 100 slower than the \nlinear evaluators described previously. We have therefore investigated an alternative \nMonte-Carlo algorithm, using so-called \"truncated rollouts.\u00bb In this technique trials \nare not played out to completion, but instead only a few steps in the simulation \nare taken, and the neural net's equity estimate of the final position reached is used \ninstead of the actual outcome. The truncated rollout algorithm requires much less \nCPU time, due to two factors: First, there are potentially many fewer steps per \ntrial. Second, there is much less variance per trial, since only a few random steps \nare taken and a real-valued estimate is recorded, rather than many random steps \nand an integer final outcome. These two factors combine to give at least an order \nof magnitude speed-up compared to full rollouts, while still giving a large error \nreduction relative to the base player. \n\nTable 3 shows truncated rollout results for two multi-layer networks: TD-Gammon \n2.1 1-ply, which has 80 hidden units, and a substantially smaller network with \nthe same input features but only 10 hidden units. The first line of data for each \nnetwork reflects very extensive rollouts and shows quite large error reduction ratios, \nalthough the CPU times are somewhat slower than acceptable for real-time play. \n(Also we should be somewhat suspicious of the 80 hidden unit result, since this was \nthe same network that generated the data being used to grade the Monte-Carlo \nplayers.) The second line of data shows what h~ppens when the rollout trials are \ncut off more aggressively. This yields significantly faster run-times, at the price of \nonly slightly worse move decisions. \n\nThe quality of play of the truncated rollout players shown in Table 3 is substantially \nbetter than TD-Gammon I-ply or 2-ply, and it is also substantially better than \nthe full-rollout Monte-Carlo players described in the previous section. In fact, we \nestimate that the world's best human players would score in the range of 0.005 to \n0.006 on this test set, so the truncated rollout players may actually be exhibiting \nsuperhuman playing ability, in reasonable amounts of SP machine time. \n\n3 DISCUSSION \n\nOn-line search may provide a useful methodology for overcoming some of the limi(cid:173)\ntations of training nonlinear function approximators on difficult control tasks. The \nidea of using search to improve in real time the performance of a heuristic controller \n\n\fOn-line Policy Improvement using Monte-Carlo Search \n\n1073 \n\nHidden Units Base loss Truncated Monte-Carlo loss Ratio M-C CPU \n25 sec/move \n9 sec/move \n65 sec(move \n18 sec/move \n\n0.00318 \\ ll-step, thoroug~) \n0.00433 (ll-step, optimistic) \n0.00181 \\!-step, thoroug~) \n0.00269 (7-step, optimistic) \n\n4.8 \n3.5 \n6.6 \n4.5 \n\n10 \n\n80 \n\n0.0152 \n\n0.0120 \n\nTable 3: Truncated rollout results for two multi-layer networks, with number of \nhidden units and rollout steps as indicated. Average equity loss per move decision on \nan 800-position test set, for both initial base players and corresponding Monte-Carlo \nplayers. Again, units are ppg, and smaller loss values are better. Also computed is \nratio of base player loss to Monte-Carlo loss. CPU times are for the Monte-Carlo \nplayer running on 32 SP1 nodes. \n\nis an old one, going back at least to (Shannon, 1950). Full-width search algorithms \nhave been extensively studied since the time of Shannon, and have produced tremen(cid:173)\ndous success in computer games such as chess, checkers and Othello. Their main \ndrawback is that the re~uired CPU time increases exponentially with the depth of \nthe search, i.e., T '\" B \n,where B is the effective branching factor and D is the \nsearch depth. In contrast, Monte-Carlo search provides a tractable alternative for \ndoing very deep searches, since the CPU time for a full Monte-Carlo decision only \nscales as T\", N\u00b7 B . D, where N is the number of trials in the simulation. \n\nIn the backgammon application, for a wide range of initial policies, our on-line \nMonte-Carlo algorithm, which basically implements a single step of policy iteration, \nwas found to give very substantial error reductions. Potentially 80% or more of the \nbase player's equity loss can be eliminated, depending on how extensive the Monte(cid:173)\nCarlo trials are. The magnitude of the observed improvement is surprising to us: \nwhile it is known theoretically that each step of policy iteration produces a strict \nimprovement, there are no guarantees on how much improvement one can expect. \nWe have also noted a rough trend in the data: as one increases the strength of the \nbase player, the ratio of error reduction due to the Monte-Carlo technique appears \nto increase. This could reflect superlinear convergence properties of policy iteration. \n\nIn cases where the base player employs an evaluator that is able to estimate expected \noutcome, the truncated rollout algorithm appears to offer favorable tradeoffs relative \nto doing full rollouts to completion. While the quality of Monte-Carlo decisions is \nnot as good using truncated rollouts (presumably because the neural net's estimates \nare biased), the degradation in quality is fairly small in at least some cases, and \nis compensated by a great reduction in CPU time. This allows more sophisticated \n(and thus slower) base players to be used, resulting in decisions which appear to be \nboth better and faster. \n\nThe Monte-Carlo backgammon program as implemented on the SP offers the poten(cid:173)\ntial to achieve real-time move decision performance that exceeds human capabilities. \nIn future work, we plan to augment the program with a similar Monte-Carlo algo(cid:173)\nrithm for making doubling decisions. It is quite possible that such a program would \nbe by far the world's best backgammon player. \n\nBeyond the backgammon application, we conjecture that on-line Monte-Carlo search \nmay prove to be useful in many other applications of reinforcement learning and \nadaptive control. The main requirement is that it should be possible to simulate \nthe environment in which the controller operates. Since basically all of the recent \nsuccessful applications of reinforcement learning have been based on training in \nsimulators, this doesn't seem to be an undue burden. Thus, for example, Monte-\n\n\f1074 \n\nG. Tesauro and G. R. Galperin \n\nCarlo search may well improve decision-making in the domains of elevator dispatch \n(Crites and Barto, 1996) and job-shop scheduling (Zhang and Dietterich, 1996). \n\nWe are additionally investigating two techniques for training a controller based \non the Monte-Carlo estimates. First, one could train each candidate position on \nits computed rollout equity, yielding a procedure similar in spirit to TD(1). We \nexpect this to converge to the same policy as other TD(..\\) approaches, perhaps \nmore efficiently due to the decreased variance in the target values as well as the \neasily parallelizable nature of the algorithm. Alternately, the base position - the \ninitial position from which the candidate moves are being made - could be trained \nwith the best equity value from among all the candidates (corresponding to the \nmove chosen by the rollout player). In contrast, TD(..\\) effectively trains the base \nposition with the equity of the move chosen by the base controller. Because the \nimproved choice of move achieved by the rollout player yields an expectation closer \nto the true (optimal) value, we expect the learned policy to differ from, and possibly \nbe closer to optimal than, the original policy. \n\nAcknowledgments \n\nWe thank Argonne National Laboratories for providing SPI machine time used to \nperform some of the experiments reported here. Gregory Galperin acknowledges \nsupport under Navy-ONR grant N00014-96-1-0311. \n\nReferences \n\nD. P. Bertsekas, Dynamic Programming and Optimal Control. Athena Scientific, \nBelmont, MA (1995). \nR. H. Crites and A. G. Barto, \"Improving elevator performance using reinforcement \nlearning.\" In: D. Touretzky et al., eds., Advances in Neural Information Processing \nSystems 8, 1017-1023, MIT Press (1996). \n\nC. E. Shannon, \"Programming a computer for playing chess.\" Philosophical Mag(cid:173)\nazine 41, 265-275 (1950). \nR. S. Sutton, \"Learning to predict by the methods of temporal differences.\" Machine \nLearning 3, 9-44 (1988). \n\nG. Tesauro, \"Connectionist learning of expert preferences by comparison training.\" \nIn: D. Touretzky, ed., Advances in Neural Information Processing Systems 1, 99-\n106, Morgan Kaufmann (1989). \n\nG. Tesauro, \"Practical issues in temporal difference learning.\" Machine Learning \n8, 257-277 (1992). \n\nG. Tesauro, \"Temporal difference learning and TD-Gammon.\" Comm. of the ACM, \n38:3, 58-67 (1995). \n\nW. Zhang and T. G. Dietterich, \"High-performance job-shop scheduling with a \ntime-delay TD(\"\\) network.\" \nIn: D. Touretzky et al., eds., Advances in Neural \nInformation Processing Systems 8, 1024-1030, MIT Press (1996). \n\n\f", "award": [], "sourceid": 1302, "authors": [{"given_name": "Gerald", "family_name": "Tesauro", "institution": null}, {"given_name": "Gregory", "family_name": "Galperin", "institution": null}]}