{"title": "Bootstrapping from Game Tree Search", "book": "Advances in Neural Information Processing Systems", "page_first": 1937, "page_last": 1945, "abstract": "In this paper we introduce a new algorithm for updating the parameters of a heuristic evaluation function, by updating the heuristic towards the values computed by an alpha-beta search. Our algorithm differs from previous approaches to learning from search, such as Samuels checkers player and the TD-Leaf algorithm, in two key ways. First, we update all nodes in the search tree, rather than a single node. Second, we use the outcome of a deep search, instead of the outcome of a subsequent search, as the training signal for the evaluation function. We implemented our algorithm in a chess program Meep, using a linear heuristic function. After initialising its weight vector to small random values, Meep was able to learn high quality weights from self-play alone. When tested online against human opponents, Meep played at a master level, the best performance of any chess program with a heuristic learned entirely from self-play.", "full_text": "Bootstrapping from Game Tree Search\n\nJoel Veness\n\nUniversity of NSW and NICTA\nSydney, NSW, Australia 2052\njoelv@cse.unsw.edu.au\n\nDavid Silver\n\nUniversity of Alberta\n\nEdmonton, AB Canada T6G2E8\nsilver@cs.ualberta.ca\n\nWilliam Uther\n\nNICTA and the University of NSW\n\nSydney, NSW, Australia 2052\n\nWilliam.Uther@nicta.com.au\n\nAbstract\n\nAlan Blair\n\nUniversity of NSW and NICTA\nSydney, NSW, Australia 2052\nblair@cse.unsw.edu.au\n\nIn this paper we introduce a new algorithm for updating the parameters of a heuris-\ntic evaluation function, by updating the heuristic towards the values computed by\nan alpha-beta search. Our algorithm differs from previous approaches to learning\nfrom search, such as Samuel\u2019s checkers player and the TD-Leaf algorithm, in two\nkey ways. First, we update all nodes in the search tree, rather than a single node.\nSecond, we use the outcome of a deep search, instead of the outcome of a subse-\nquent search, as the training signal for the evaluation function. We implemented\nour algorithm in a chess program Meep, using a linear heuristic function. After\ninitialising its weight vector to small random values, Meep was able to learn high\nquality weights from self-play alone. When tested online against human oppo-\nnents, Meep played at a master level, the best performance of any chess program\nwith a heuristic learned entirely from self-play.\n\n1 Introduction\n\nThe idea of search bootstrapping is to adjust the parameters of a heuristic evaluation function to-\nwards the value of a deep search. The motivation for this approach comes from the recursive nature\nof tree search: if the heuristic can be adjusted to match the value of a deep search of depth D, then\na search of depth k with the new heuristic would be equivalent to a search of depth k + D with the\nold heuristic.\nDeterministic, two-player games such as chess provide an ideal test-bed for search bootstrapping.\nThe intricate tactics require a signi\ufb01cant level of search to provide an accurate position evaluation;\nlearning without search has produced little success in these domains. Much of the prior work in\nlearning from search has been performed in chess or similar two-player games, allowing for clear\ncomparisons with existing methods.\nSamuel (1959) \ufb01rst introduced the idea of search bootstrapping in his seminal checkers player. In\nSamuel\u2019s work the heuristic function was updated towards the value of a minimax search in a sub-\nsequent position, after black and white had each played one move. His ideas were later extended\nby Baxter et al. (1998) in their chess program Knightcap. In their algorithm, TD-Leaf, the heuristic\nfunction is adjusted so that the leaf node of the principal variation produced by an alpha-beta search\nis moved towards the value of an alpha-beta search at a subsequent time step.\nSamuel\u2019s approach and TD-Leaf suffer from three main drawbacks. First, they only update one\nnode after each search, which discards most of the information contained in the search tree. Second,\ntheir updates are based purely on positions that have actually occurred in the game, or which lie\non the computed line of best play. These positions may not be representative of the wide variety\nof positions that must be evaluated by a search based program; many of the positions occurring in\n\n1\n\n\fFigure 1: Left: TD, TD-Root and TD-Leaf backups. Right: RootStrap(minimax) and TreeStrap(minimax).\n\nlarge search trees come from sequences of unnatural moves that deviate signi\ufb01cantly from sensible\nplay. Third, the target search is performed at a subsequent time-step, after a real move and response\nhave been played. Thus, the learning target is only accurate when both the player and opponent\nare already strong. In practice, these methods can struggle to learn effectively from self-play alone.\nWork-arounds exist, such as initializing a subset of the weights to expert provided values, or by\nattempting to disable learning once an opponent has blundered, but these techniques are somewhat\nunsatisfactory if we have poor initial domain knowledge.\nWe introduce a new framework for bootstrapping from game tree search that differs from prior\nwork in two key respects. First, all nodes in the search tree are updated towards the recursive\nminimax values computed by a single depth limited search from the root position. This makes\nfull use of the information contained in the search tree. Furthermore, the updated positions are\nmore representative of the types of positions that need to be accurately evaluated by a search-based\nplayer. Second, as the learning target is based on hypothetical minimax play, rather than positions\nthat occur at subsequent time steps, our methods are less sensitive to the opponent\u2019s playing strength.\nWe applied our algorithms to learn a heuristic function for the game of chess, starting from random\ninitial weights and training entirely from self-play. When applied to an alpha-beta search, our chess\nprogram learnt to play at a master level against human opposition.\n\n2 Background\n\nThe minimax search algorithm exhaustively computes the minimax value to some depth D, using a\nheuristic function H\u03b8(s) to evaluate non-terminal states at depth D, based on a parameter vector \u03b8.\ns0 (s) to denote the value of state s in a depth D minimax search from root\nWe use the notation V D\nstate s0. We de\ufb01ne T D\ns0 to be the set of states in the depth D search tree from root state s0. We de\ufb01ne\nthe principal leaf, lD(s), to be the leaf state of the depth D principal variation from state s. We use\nthe notation \u03b8\u2190 to indicate a backup that updates the heuristic function towards some target value.\nTemporal difference (TD) learning uses a sample backup H\u03b8(st) \u03b8\u2190 H\u03b8(st+1) to update the esti-\nmated value at one time-step towards the estimated value at the subsequent time-step (Sutton, 1988).\nAlthough highly successful in stochastic domains such as Backgammon (Tesauro, 1994), direct TD\nperforms poorly in highly tactical domains. Without search or prior domain knowledge, the target\nvalue is noisy and improvements to the value function are hard to distinguish. In the game of chess,\nusing a naive heuristic and no search, it is hard to \ufb01nd checkmate sequences, meaning that most\ngames are drawn.\nThe quality of the target value can be signi\ufb01cantly improved by using a minimax backup to update\nthe heuristic towards the value of a minimax search. Samuel\u2019s checkers player (Samuel, 1959) in-\ntroduced this idea, using an early form of bootstrapping from search that we call TD-Root. The\nparameters of the heuristic function, \u03b8, were adjusted towards the minimax search value at the next\ncomplete time-step (see Figure 1), H\u03b8(st) \u03b8\u2190 V D\nst+1(st+1). This approach enabled Samuel\u2019s check-\n\n2\n\ntime = t+1time = tTD-LeafTD-RootTDtime = ttime = t+1RootStrap(minimax) and TreeStrap(minimax)TreeStrap(minimax) only\f\u03b8\u2190 V D\n\ners program to achieve human amateur level play. Unfortunately, Samuel\u2019s approach was handi-\ncapped by tying his evaluation function to the material advantage, and not to the actual outcome\nfrom the position.\nThe TD-Leaf algorithm (Baxter et al., 1998) updates the value of a minimax search at one time-\nstep towards the value of a minimax search at the subsequent time-step (see Figure 1). The pa-\nrameters of the heuristic function are updated by gradient descent, using an update of the form\nst (st)\nst+1(st+1). The root value of minimax search is not differentiable in the parame-\nV D\nters, as a small change in the heuristic value can result in the principal variation switching to a\ncompletely different path through the tree. The TD-Leaf algorithm ignores these non-differentiable\nboundaries by assuming that the principal variation remains unchanged, and follows the local gra-\ndient given that variation. This is equivalent to updating the heuristic function of the principal leaf,\nH\u03b8(lD(st)) \u03b8\u2190 V D\nst+1(st+1). The chess program Knightcap achieved master-level play when trained\nusing TD-Leaf against a series of evenly matched human opposition, whose strength improved at\na similar rate to Knightcap\u2019s. A similar algorithm was introduced contemporaneously by Beal and\nSmith (1997), and was used to learn the material values of chess pieces. The world champion check-\ners program Chinook used TD-Leaf to learn an evaluation function that compared favorably to its\nhand-tuned heuristic function (Schaeffer et al., 2001).\nBoth TD-Root and TD-Leaf are hybrid algorithms that combine a sample backup with a minimax\nbackup, updating the current value towards the search value at a subsequent time-step. Thus the\naccuracy of the learning target depends both on the quality of the players, and on the quality of the\nsearch. One consequence is that these learning algorithms are not robust to variations in the training\nregime. In their experiments with the chess program Knightcap (Baxter et al., 1998), the authors\nfound that it was necessary to prune training examples in which the opponent blundered or made\nan unpredictable move. In addition, the program was unable to learn effectively from games of\nself-play, and required evenly matched opposition. Perhaps most signi\ufb01cantly, the piece values were\ninitialised to human expert values; experiments starting from zero or random weights were unable\nto exceed weak amateur level. Similarly, the experiments with TD-Leaf in Chinook also \ufb01xed the\nimportant checker and king values to human expert values.\nIn addition, both Samuel\u2019s approach and TD-Leaf only update one node of the search tree. This\ndoes not make ef\ufb01cient use of the large tree of data, typically containing millions of values, that\nis constructed by memory enhanced minimax search variants. Furthermore, the distribution of root\npositions that are used to train the heuristic is very different from the distribution of positions that are\nevaluated during search. This can lead to inaccurate evaluation of positions that occur infrequently\nduring real games but frequently within a large search tree; these anomalous values have a tendency\nto propagate up through the search tree, ultimately affecting the choice of best move at the root.\nIn the following section, we develop an algorithm that attempts to address these shortcomings.\n\n3 Minimax Search Bootstrapping\n\nOur \ufb01rst algorithm, RootStrap(minimax), performs a minimax search from the current position st,\nat every time-step t. The parameters are updated so as to move the heuristic value of the root node\ntowards the minimax search value, H\u03b8(st) \u03b8\u2190 V D\nst (st). We update the parameters by stochastic\ngradient descent on the squared error between the heuristic value and the minimax search value. We\ntreat the minimax search value as a constant, to ensure that we move the heuristic towards the search\nvalue, and not the other way around.\n\nst (st) \u2212 H\u03b8(st)\n\n\u03b4t = V D\n\u2206\u03b8 = \u2212 \u03b7\n2\n\n\u2207\u03b8\u03b42\n\nt = \u03b7\u03b4t\u2207\u03b8H\u03b8(st)\n\nwhere \u03b7 is a step-size constant. RootStrap(\u03b1\u03b2) is equivalent to RootStrap(minimax), except it uses\nthe more ef\ufb01cient \u03b1\u03b2-search algorithm to compute V D\nFor the remainder of this paper we consider heuristic functions that are computed by a linear com-\nbination H\u03b8(s) = \u03c6(s)T \u03b8, where \u03c6(s) is a vector of features of position s, and \u03b8 is a parameter\nvector specifying the weight of each feature in the linear combination. Although simple, this form\nof heuristic has already proven suf\ufb01cient to achieve super-human performance in the games of Chess\n\nst (st).\n\n3\n\n\fAlgorithm\nTD\nTD-Root\nTD-Leaf\nRootStrap(minimax) H\u03b8(st) \u03b8\u2190 V D\nTreeStrap(minimax) H\u03b8(s) \u03b8\u2190 V D\nH\u03b8(s) \u03b8\u2190 [bD\nTreeStrap(\u03b1\u03b2)\n\nBackup\nH\u03b8(st) \u03b8\u2190 H\u03b8(st+1)\nH\u03b8(st) \u03b8\u2190 V D\nH\u03b8(lD(st)) \u03b8\u2190 V D\nst (st)\nst (s), \u2200s \u2208 T D\nst (s), aD\n\nst+1 (st+1)\n\nst+1 (st+1)\n\nst\n\nst (s)],\u2200s \u2208 T \u03b1\u03b2\n\nt\n\nTable 1: Backups for various learning algorithms.\n\nAlgorithm 1 TreeStrap(minimax)\n\nRandomly initialise \u03b8\nInitialise t \u2190 1, s1 \u2190 start state\nwhile st is not terminal do\nV \u2190 minimax(st, H\u03b8, D)\nfor s \u2208 search tree do\n\u03b4 \u2190 V (s) \u2212 H\u03b8(s)\n\u2206\u03b8 \u2190 \u2206\u03b8 + \u03b7\u03b4\u03c6(s)\n\nend for\n\u03b8 \u2190 \u03b8 + \u2206\u03b8\nV (st \u25e6 a)\nSelect at = argmax\nExecute move at, receive st+1\nt \u2190 t + 1\n\na\u2208A\n\nend while\n\nend for\nreturn \u2206\u03b8\n\nAlgorithm 2 DeltaFromTransTbl(s, d)\n\nInitialise \u2206\u03b8 \u2190 (cid:126)0, t \u2190 probe(s)\nif t is null or depth(t) < d then\n\nreturn \u2206\u03b8\n\nend if\nif lowerbound(t) > H\u03b8(s) then\n\n\u2206\u03b8 \u2190 \u2206\u03b8 + \u03b7(lowerbound(t)\u2212 H\u03b8(s))\u2207H\u03b8(s)\n\nend if\nif upperbound(t) < H\u03b8(s) then\n\n\u2206\u03b8 \u2190 \u2206\u03b8 + \u03b7(upperbound(t)\u2212 H\u03b8(s))\u2207H\u03b8(s)\n\nend if\nfor s(cid:48) \u2208 succ(s) do\n\n\u2206\u03b8 \u2190 DeltaF romT ransT bl(s(cid:48))\n\n(Campbell et al., 2002), Checkers (Schaeffer et al., 2001) and Othello (Buro, 1999). The gradient\ndescent update for RootStrap(minimax) then takes the particularly simple form \u2206\u03b8t = \u03b7\u03b4t\u03c6(st).\nOur second algorithm, TreeStrap(minimax), also performs a minimax search from the current po-\nsition st. However, TreeStrap(minimax) updates all interior nodes within the search tree. The\nparameters are updated, for each position s in the tree, towards the minimax search value of s,\nH\u03b8(s) \u03b8\u2190 V D\n\nst (s),\u2200s \u2208 T D\n\nst . This is again achieved by stochastic gradient descent,\n\u03b4t(s) = V D\n\u2206\u03b8 = \u2212 \u03b7\n2\n\n(cid:88)\nst (s) \u2212 H\u03b8(s)\n\n\u03b4t(s)2 = \u03b7\n\n(cid:88)\n\n\u03b4t(s)\u03c6(s)\n\n\u2207\u03b8\n\ns\u2208T D\nst\n\ns\u2208T D\nst\n\nThe complete algorithm for TreeStrap(minimax) is described in Algorithm 1.\n\n4 Alpha-Beta Search Bootstrapping\n\nThe concept of minimax search bootstrapping can be extended to \u03b1\u03b2-search. Unlike minimax\nsearch, alpha-beta does not compute an exact value for the majority of nodes in the search tree.\nInstead, the search is cut off when the value of the node is suf\ufb01ciently high or low that it can no\nlonger contribute to the principal variation. We consider a depth D alpha-beta search from root\ns0(s) re-\nposition s0, and denote the upper and lower bounds computed for node s by aD\ns0(s). Only one bound applies in cut off nodes: in the case\nspectively, so that bD\ns0(s) to be \u221e.\nof an alpha-cut we de\ufb01ne bD\nIf no cut off occurs then the bounds are exact, i.e. aD\nThe bounded values computed by alpha-beta can be exploited by search bootstrapping, by using a\none-sided loss function. If the value from the heuristic evaluation is larger than the a-bound of the\ndeep search value, then it is reduced towards the a-bound, H\u03b8(s) \u03b8\u2190 aD\nst(s). Similarly, if the value\nfrom the heuristic evaluation is smaller than the b-bound of the deep search value, then it is increased\n\ns0(s) to be \u2212\u221e, and in the case of a beta-cut we de\ufb01ne aD\n\ns0(s) \u2264 V D\n\ns0 (s) \u2264 aD\n\ns0(s) and bD\n\ns0(s) = V D\n\ns0(s) = bD\n\ns0 (s).\n\n4\n\n\fst(s). We implement this idea by gradient descent on the sum of\n\ntowards the b-bound, H\u03b8(s) \u03b8\u2190 bD\n(cid:189)\none-sided squared errors:\n(cid:189)\n\n\u03b4a\nt (s) =\n\n\u03b4b\nt (s) =\n\n(cid:88)\n\ngiving\n\n\u2206\u03b8t =\n\n\u2207\u03b8\n\n\u03b7\n2\n\nst (s) \u2212 H\u03b8(s)\naD\n0\nst (s) \u2212 H\u03b8(s)\nbD\n0\n\nst (s)\n\nif H\u03b8(s) > aD\notherwise\nif H\u03b8(s) < bD\notherwise\n\nst (s)\n\n(cid:88)\n\n(cid:179)\n\n(cid:180)\n\n\u03b4a\nt (s)2 + \u03b4b\n\nt (s)2 = \u03b7\n\n\u03b4a\nt (s) + \u03b4b\n\nt (s)\n\n\u03c6(s)\n\ns\u2208T \u03b1\u03b2\n\nt\n\ns\u2208T \u03b1\u03b2\n\nt\n\nt\n\nwhere T \u03b1\u03b2\nis the set of nodes in the alpha-beta search tree at time t. We call this algorithm\nTreeStrap(\u03b1\u03b2), and note that the update for each node s is equivalent to the TreeStrap(minimax)\nupdate when no cut-off occurs.\n\n4.1 Updating Parameters in TreeStrap(\u03b1\u03b2)\n\nHigh performance \u03b1\u03b2-search routines rely on transposition tables for move ordering, reducing the\nsize of the search space, and for caching previous search results (Schaeffer, 1989). A natural way\nto compute \u2206\u03b8 for TreeStrap(\u03b1\u03b2) from a completed \u03b1\u03b2-search is to recursively step through the\ntransposition table, summing any relevant bound information. We call this procedure DeltaFrom-\nTransTbl, and give the pseudo-code for it in Algorithm 2.\nDeltaFromTransTbl requires a standard transposition table implementation providing the following\nroutines:\n\n\u2022 probe(s), which returns the transposition table entry associated with state s.\n\u2022 depth(t), which returns the amount of search depth used to determine the bound estimates\n\u2022 lowerbound(t), which returns the lower bound stored in transposition entry t.\n\u2022 upperbound(t), which returns the upper bound stored in transposition entry t.\n\nstored in transposition table entry t.\n\nIn addition, DeltaFromTransTbl requires a parameter d \u2265 1, that limits updates to \u2206\u03b8 from transpo-\nsition table entries based on a minimum of search depth of d. This can be used to control the number\nof positions that contribute to \u2206\u03b8 during a single update, or limit the computational overhead of the\nprocedure.\n\n4.2 The TreeStrap(\u03b1\u03b2) algorithm\n\nThe TreeStrap(\u03b1\u03b2) algorithm can be obtained by two straightforward modi\ufb01cations to Algorithm 1.\nFirst, the call to minimax(st, H\u03b8, D) must be replaced with a call to \u03b1\u03b2-search(st, H\u03b8, D). Sec-\nondly, the inner loop computing \u2206\u03b8 is replaced by invoking DeltaF romT ransT bl(st).\n\n5 Learning Chess Program\n\nWe implemented our learning algorithms in Meep, a modi\ufb01ed version of the tournament chess engine\nBodo. For our experiments, the hand-crafted evaluation function of Bodo was removed and replaced\nby a weighted linear combination of 1812 features. Given a position s, a feature vector \u03c6(s) can be\nconstructed from the 1812 numeric values of each feature. The majority of these features are binary.\n\u03c6(s) is typically sparse, with approximately 100 features active in any given position. Five well-\nknown, chess speci\ufb01c feature construction concepts: material, piece square tables, pawn structure,\nmobility and king safety were used to generate the 1812 distinct features. These features were a\nstrict subset of the features used in Bodo, which are themselves simplistic compared to a typical\ntournament engine (Campbell et al., 2002).\nThe evaluation function H\u03b8(s) was a weighted linear combination of the features i.e. H\u03b8(s) =\n\u03c6(s)T \u03b8. All components of \u03b8 were initialised to small random numbers. Terminal positions were\n\n5\n\n\fevaluated as \u22129999.0, 0 and 9999.0 for a loss, draw and win respectively. In the search tree, mate\nscores were adjusted inward slightly so that shorter paths to mate were preferred when giving mate,\nand vice-versa. When applying the heuristic evaluation function in the search, the heuristic estimates\nwere truncated to the interval [\u22129900.0, 9900.0].\nMeep contains two different modes: a tournament mode and a training mode. When in tournament\nmode, Meep uses an enhanced alpha-beta based search algorithm. Tournament mode is used for\nevaluating the strength of a weight con\ufb01guration. In training mode however, one of two different\ntypes of game tree search algorithms are used. The \ufb01rst is a minimax search that stores the entire\ngame tree in memory. This is used by the TreeStrap(minimax) algorithm. The second is a generic\nalpha-beta search implementation, that uses only three well known alpha-beta search enhancements:\ntransposition tables, killer move tables and the history heuristic (Schaeffer, 1989). This simpli\ufb01ed\nsearch routine was used by the TreeStrap(\u03b1\u03b2) and RootStrap(\u03b1\u03b2) algorithms. In addition, to reduce\nthe horizon effect, checking moves were extended by one ply. During training, the transposition\ntable was cleared before the search routine was invoked.\nSimpli\ufb01ed search algorithms were used during training to avoid complicated interactions with the\nmore advanced heuristic search techniques (such as null move pruning) useful in tournament play.\nIt must be stressed that during training, no heuristic or move ordering techniques dependent on\nknowing properties of the evaluation weights were used by the search algorithms.\nFurthermore, a quiescence search (Beal, 1990) that examined all captures and check evasions was\napplied to leaf nodes. This was to improve the stability of the leaf node evaluations. Again, no\nknowledge based pruning was performed inside the quiescence search tree, which meant that the\nquiescence routine was considerably slower than in Bodo.\n\n6 Experimental Results\n\nWe describe the details of our training procedures, and then proceed to explore the performance\ncharacteristics of our algorithms, RootStrap(\u03b1\u03b2), TreeStrap(minimax) and TreeStrap(\u03b1\u03b2) through\nboth a large local tournament and online play. We present our results in terms of Elo ratings. This is\nthe standard way of quantifying the strength of a chess player within a pool of players. A 300 to 500\nElo rating point difference implies a winning rate of about 85% to 95% for the higher rated player.\n\n6.0.1 Training Methodology\n\nAt the start of each experiment, all weights were initialised to small random values. Games of self-\nplay were then used to train each player. To maintain diversity during training, a small opening book\nwas used. Once outside of the opening book, moves were selected greedily from the results of the\nsearch. Each training game was played within 1m 1s Fischer time controls. That is, both players\nstart with a minute on the clock, and gain an additional second every time they make a move. Each\ntraining game would last roughly \ufb01ve minutes.\nWe selected the best step-size for each learning algorithm, from a series of preliminary experiments:\n\u03b1 = 1.0 \u00d7 10\u22125 for TD-Leaf and RootStrap(\u03b1\u03b2), \u03b1 = 1.0 \u00d7 10\u22126 for TreeStrap(minimax) and\n5.0 \u00d7 10\u22127 for TreeStrap(\u03b1\u03b2). The TreeStrap variants used a minimum search depth parameter of\nd = 1. This meant that the target values were determined by at least one ply of full-width search,\nplus a varying amount of quiescence search.\n\n6.1 Relative Performance Evaluation\n\nWe ran a competition between many different versions of Meep in tournament mode, each using\na heuristic function learned by one of our algorithms.\nIn addition, a player based on randomly\ninitialised weights was included as a reference, and arbitrarily assigned an Elo rating of 250. The\nbest ratings achieved by each training method are displayed in Table 2.\nWe also measured the performance of each algorithm at intermediate stages throughout training.\nFigure 2 shows the performance of each learning algorithm with increasing numbers of games on\na single training run. As each training game is played using the same time controls, this shows the\n\n6\n\n\fFigure 2: Performance when trained via self-play starting from random initial weights. 95% con\ufb01-\ndence intervals are marked at each data point. The x-axis uses a logarithmic scale.\n\nAlgorithm\nTreeStrap(\u03b1\u03b2)\nTreeStrap(minimax)\nRootStrap(\u03b1\u03b2)\nTD-Leaf\nUntrained\n\nElo\n\n2157 \u00b1 31\n1807 \u00b1 32\n1362 \u00b1 59\n1068 \u00b1 36\n250 \u00b1 63\n\nTable 2: Best performance when trained by self play. 95% con\ufb01dence intervals given.\n\nperformance of each learning algorithm given a \ufb01xed amount of computation. Importantly, the time\nused for each learning update also took away from the total thinking time.\nThe data shown in Table 2 and Figure 2 was generated by BayesElo, a freely available program that\ncomputes maximum likelihood Elo ratings. In each table, the estimated Elo rating is given along\nwith a 95% con\ufb01dence interval. All Elo values are calculated relative to the reference player, and\nshould not be compared with Elo ratings of human chess players (including the results of online\nplay, described in the next section). Approximately 16000 games were played in the tournament.\nThe results demonstrate that learning from many nodes in the search tree is signi\ufb01cantly more ef\ufb01-\ncient than learning from a single root node. TreeStrap(minimax) and TreeStrap(\u03b1\u03b2) learn effective\nweights in just a thousand training games and attain much better maximum performance within the\nduration of training. In addition, learning from alpha-beta search is more effective than learning\nfrom minimax search. Alpha-beta search signi\ufb01cantly boosts the search depth, by safely pruning\naway subtrees that cannot affect the minimax value at the root. Although the majority of nodes now\ncontain one-sided bounds rather than exact values, it appears that the improvements to the search\ndepth outweigh the loss of bound information.\nOur results demonstrate that the TreeStrap based algorithms can learn a good set of weights, starting\nfrom random weights, from self-play in the game of chess. Our experiences using TD-Leaf in this\nsetting were similar to those described in (Baxter et al., 1998); within the limits of our training\nscheme, learning occurred, but only to the level of weak amateur play. Our results suggest that\nTreeStrap based methods are potentially less sensitive to initial starting conditions, and allow for\nspeedier convergence in self play; it will be interesting to see whether similar results carry across to\ndomains other than chess.\n\n7\n\n10110210310405001000150020002500Number of training gamesRating (Elo)Learning from self\u2212play: Rating versus Number of training games TreeStrap(alpha\u2212beta)RootStrap(alpha\u2212beta)TreeStrap(minimax)TD\u2212LeafUntrained\fAlgorithm\n\nTreeStrap(\u03b1\u03b2)\nTreeStrap(\u03b1\u03b2)\n\nTraining Partner\n\nSelf Play\nShredder\n\nRating\n\n1950-2197\n2154-2338\n\nTable 3: Blitz performance at the Internet Chess Club\n\n6.2 Evaluation by Internet Play\n\nWe also evaluated the performance of the heuristic function learned by TreeStrap(\u03b1\u03b2), by using it in\nMeep to play against predominantly human opposition at the Internet Chess Club. We evaluated two\nheuristic functions, the \ufb01rst using weights trained by self-play, and the second using weights trained\nagainst Shredder, a grandmaster strength commercial chess program.\nThe hardware used online was a 1.8Ghz Opteron, with 256Mb of RAM being used for the transpo-\nsition table. Approximately 350K nodes per second were seen when using the learned evaluation\nfunction. A small opening book was used to make the engine play a variety of different opening\nlines. Compared to Bodo, the learned evaluation routine was approximately 3 times slower, even\nthough the evaluation function contained less features. This was due to a less optimised implemen-\ntation, and the heavy use of \ufb02oating point arithmetic.\nApproximately 1000 games were played online, using 3m 3s Fischer time controls, for each heuristic\nfunction. Although the heuristic function was \ufb01xed, the online rating \ufb02uctuates signi\ufb01cantly over\ntime. This is due to the high K factor used by the Internet Chess Club to update Elo ratings, which\nis tailored to human players rather than computer engines.\nThe online rating of the heuristic learned by self-play corresponds to weak master level play. The\nheuristic learned from games against Shredder were roughly 150 Elo stronger, corresponding to\nmaster level performance. Like TD-Leaf, TreeStrap also bene\ufb01ts from a carefully chosen opponent,\nthough the difference between self-play and ideal conditions is much less drastic. Furthermore,\na total of 13.5/15 points were scored against registered members who had achieved the title of\nInternational Master.\nWe expect that these results could be further improved by using more powerful hardware, a more\nsophisticated evaluation function, or a better opening book. Furthermore, we used a generic alpha-\nbeta search algorithm for learning. An interesting follow-up would be to explore the interaction\nbetween our learning algorithms and the more exotic alpha-beta search enhancements.\n\n7 Conclusion\n\nOur main result is demonstrating, for the \ufb01rst time, an algorithm that learns to play master level\nChess entirely through self play, starting from random weights. To provide insight into the nature\nof our algorithms, we focused on a single non-trivial domain. However, the ideas that we have\nintroduced are rather general, and may have applications beyond deterministic two-player game tree\nsearch.\nBootstrapping from search could,\nin principle, be applied to many other search algorithms.\nSimulation-based search algorithms, such as UCT, have outperformed traditional search algorithms\nin a number of domains. The TreeStrap algorithm could be applied, for example, to the heuristic\nfunction that is used to initialise nodes in a UCT search tree with prior knowledge (Gelly & Silver,\n2007). Alternatively, in stochastic domains the evaluation function could be updated towards the\nvalue of an expectimax search, or towards the one-sided bounds computed by a *-minimax search\n(Hauk et al., 2004; Veness & Blair, 2007). This approach could be viewed as a generalisation of ap-\nproximate dynamic programming, in which the value function is updated from a multi-ply Bellman\nbackup.\n\nAcknowledgments\n\nNICTA is funded by the Australian Government as represented by the Department of Broadband,\nCommunications and the Digital Economy and the Australian Research Council through the ICT\nCentre of Excellence program.\n\n8\n\n\fReferences\nBaxter, J., Tridgell, A., & Weaver, L. (1998). Knightcap: a chess program that learns by combining\ntd(lambda) with game-tree search. Proc. 15th International Conf. on Machine Learning (pp.\n28\u201336). Morgan Kaufmann, San Francisco, CA.\n\nBeal, D. F. (1990). A generalised quiescence search algorithm. Arti\ufb01cial Intelligence, 43, 85\u201398.\nBeal, D. F., & Smith, M. C. (1997). Learning piece values using temporal differences. Journal of\n\nthe International Computer Chess Association.\n\nBuro, M. (1999). From simple features to sophisticated evaluation functions. First International\n\nConference on Computers and Games (pp. 126\u2013145).\n\nCampbell, M., Hoane, A., & Hsu, F. (2002). Deep Blue. Arti\ufb01cial Intelligence, 134, 57\u201383.\nGelly, S., & Silver, D. (2007). Combining online and of\ufb02ine learning in UCT. 17th International\n\nConference on Machine Learning (pp. 273\u2013280).\n\nHauk, T., Buro, M., & Schaeffer, J. (2004). Rediscovering *-minimax search. Computers and\n\nGames (pp. 35\u201350).\n\nSamuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal\n\nof Research and Development, 3.\n\nSchaeffer, J. (1989). The history heuristic and alpha-beta search enhancements in practice. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, PAMI-11, 1203\u20131212.\n\nSchaeffer, J., Hlynka, M., & Jussila, V. (2001). Temporal difference learning applied to a high\n\nperformance game playing program. IJCAI, 529\u2013534.\n\nSutton, R. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3,\n\n9\u201344.\n\nTesauro, G. (1994). TD-gammon, a self-teaching backgammon program, achieves master-level play.\n\nNeural Computation, 6, 215\u2013219.\n\nVeness, J., & Blair, A. (2007). Effective use of transposition tables in stochastic game tree search.\n\nIEEE Symposium on Computational Intelligence and Games (pp. 112\u2013116).\n\n9\n\n\f", "award": [], "sourceid": 508, "authors": [{"given_name": "Joel", "family_name": "Veness", "institution": null}, {"given_name": "David", "family_name": "Silver", "institution": null}, {"given_name": "Alan", "family_name": "Blair", "institution": null}, {"given_name": "William", "family_name": "Uther", "institution": "NICTA & UNSW"}]}