{"title": "A Scalable Machine Learning Approach to Go", "book": "Advances in Neural Information Processing Systems", "page_first": 1521, "page_last": 1528, "abstract": null, "full_text": "A Scalable Machine Learning Approach to Go\n\nLin Wu and Pierre Baldi\n\nSchool of Information and Computer Sciences\n\nUniversity of California, Irvine\n\nIrvine, CA 92697-3435\n\nlwu,pfbaldi@ics.uci.edu\n\nAbstract\n\nGo is an ancient board game that poses unique opportunities and challenges for AI\nand machine learning. Here we develop a machine learning approach to Go, and\nrelated board games, focusing primarily on the problem of learning a good eval-\nuation function in a scalable way. Scalability is essential at multiple levels, from\nthe library of local tactical patterns, to the integration of patterns across the board,\nto the size of the board itself. The system we propose is capable of automatically\nlearning the propensity of local patterns from a library of games. Propensity and\nother local tactical information are fed into a recursive neural network, derived\nfrom a Bayesian network architecture. The network integrates local information\nacross the board and produces local outputs that represent local territory owner-\nship probabilities. The aggregation of these probabilities provides an effective\nstrategic evaluation function that is an estimate of the expected area at the end (or\nat other stages) of the game. Local area targets for training can be derived from\ndatasets of human games. A system trained using only 9 \u00d7 9 amateur game data\nperforms surprisingly well on a test set derived from 19 \u00d7 19 professional game\ndata. Possible directions for further improvements are brie\ufb02y discussed.\n\n1 Introduction\n\nGo is an ancient board game\u2013over 3,000 years old [6, 5]\u2013that poses unique opportunities and chal-\nlenges for arti\ufb01cial intelligence and machine learning. The rules of Go are deceptively simple: two\nopponents alternatively place black and white stones on the empty intersections of an odd-sized\nsquare board, traditionally of size 19 \u00d7 19. The goal of the game, in simple terms, is for each\nplayer to capture as much territory as possible across the board by encircling the opponent\u2019s stones.\nThis disarming simplicity, however, conceals a formidable combinatorial complexity [2]. On a\n19 \u00d7 19 board, there are approximately 319\u00d719 = 10172.24 possible board con\ufb01gurations and, on\naverage, on the order of 200-300 possible moves at each step of the game, preventing any form of\nsemi-exhaustive search. For comparison purposes, the game of chess has a much smaller branching\nfactor, on the order of 35-40 [10, 7]. Today, computer chess programs, built essentially on search\ntechniques and running on a simple PC, can rival or even surpass the best human players. In contrast,\nand in spite of several decades of signi\ufb01cant research efforts and of progress in hardware speed, the\nbest Go programs of today are easily defeated by an average human amateur.\n\nBesides the intrinsic challenge of the game, and the non-trivial market created by over 100 million\nplayers worldwide, Go raises other important questions for our understanding of natural or arti\ufb01cial\nintelligence in the distilled setting created by the simple rules of a game, uncluttered by the endless\ncomplexities of the \u201creal world\u201d. For example, to many observers, current computer solutions to\nchess appear \u201cbrute force\u201d, hence \u201cunintelligent\u201d. But is this perception correct, or an illusion\u2013is\nthere something like true intelligence beyond \u201cbrute force\u201d and computational power? Where is Go\nsituated in the apparent tug-of-war between intelligence and sheer computational power?\n\n\fAnother fundamental question that is particularly salient in the Go setting is the question of knowl-\nedge transfer. Humans learn to play Go on boards of smaller sized\u2013typically 9 \u00d7 9\u2013and then \u201ctrans-\nfer\u201d their knowledge to the larger 19 \u00d7 19 standard size. How can we develop algorithms that are\ncapable of knowledge transfer?\n\nHere we take modest steps towards addressing these challenges by developing a scalable machine\nlearning approach to Go. Clearly good evaluation functions and search algorithms are essential in-\ngredients of computer board-game systems. Here we focus primarily on the problem of learning a\ngood evaluation function for Go in a scalable way. We do include simple search algorithms in our\nsystem, as many other programs do, but this is not the primary focus. By scalability we imply that\na main goal is to develop a system more or less automatically, using machine learning approaches,\nwith minimal human intervention and handcrafting. The system ought to be able to transfer infor-\nmation from one board size (e.g. 9 \u00d7 9), to another size (e.g. 19 \u00d7 19).\nWe take inspiration in three ingredients that seem to be essential to the Go human evaluation process:\nthe understanding of local patterns, the ability to combine patterns, and the ability to relate tactical\nand strategic goals. Our system is built to learn these three capabilities automatically and attempts to\ncombine the strengths of existing systems while avoiding some of their weaknesses. The system is\ncapable of automatically learning the propensity of local patterns from a library of games. Propensity\nand other local tactical information are fed into a recursive neural network, derived from a Bayesian\nnetwork architecture. The network integrates local information across the board and produces local\noutputs that represent local territory ownership probabilities. The aggregation of these probabilities\nprovides an effective strategic evaluation function that is an estimate of the expected area at the end\n(or at other stages) of the game. Local area targets for training can be derived from datasets of human\ngames. The main results we present here are derived on a 19 \u00d7 19 board using a player trained using\nonly 9 \u00d7 9 game data.\n\n2 Data\n\nBecause the approach to be described emphasizes scalability and learning, we are able to train\nour systems at a given board size and use it to play at different sizes, both larger and smaller. Pure\nbootstrap approaches to Go where computer players are initialized randomly and play large numbers\nof games, such as evolutionary approaches or reinforcement learning, have been tried [11]. We have\nimplemented these approaches and used them for small board sizes 5 \u00d7 5 and 7 \u00d7 7. However, in\nour experience, these approaches do not scale up well to larger board sizes. For larger board sizes,\nbetter results are obtained using training data derived from records of games played by humans. We\nused available data at board sizes 9 \u00d7 9, 13 \u00d7 13, and 19 \u00d7 19.\nData for 9 \u00d7 9 Boards: This data consists of 3,495 games. We randomly selected 3,166 games\n(90.6%) for training, and the remaining 328 games (9.4%) for validation. Most of the games in this\ndata set are played by amateurs. A subset of 424 games (12.13%) have at least one player with an\nolf ranking of 29, corresponding to a very good amateur player.\nData for 13 \u00d7 13 Boards: This data consists of 4175 games. Most of the games, however, are\nplayed by rather weak players and therefore cannot be used for training. For validation purposes,\nhowever, we retained a subset of 91 games where both players have an olf ranking greater or equal\nto 25\u2013the equivalent of a good amateur player.\nData for 19 \u00d7 19 Boards: This high-quality data set consists of 1835 games played by professional\nplayers (at least 1 dan). A subset of 1131 games (61.6%) are played by 9 dan players (the highest\npossible ranking). This is the dataset used in [12].\n\n3 System Architecture\n\n3.1 Evaluation Function, Outputs, and Targets\n\nBecause Go is a game about territory, it is sensible to have \u201cexpected territory\u201d be the evaluation\nfunction, and to decompose this expectation as a sum of local probabilities. More speci\ufb01cally, let\nAij(t) denote the ownership of intersection ij on the board at time t during the game. At the end of a\n\n\fgame, each intersection can be black, white, or both 1. Black is represented as 1, white as 0, and both\nas 0.5. The same scheme with 0.5 for empty intersections, or more complicated schemes, can be\nused to represent ownership at various intermediate stages of the game. Let Oij(t) be the output of\nthe learning system at intersection ij at time t in the game. Likewise, let Tij(t) be the corresponding\ntraining target. In the most simple case, we can use Tij(t) = Aij(T ), where T denotes the end of\nthe game. In this case, the output Oij(t) can be interpreted as the probability Pij(t), estimated at\ntime t, of owning the ij intersection at the end of the game. Likewise, Pij Oij(t) is the estimate,\ncomputed at time t, of the total expected area at the end of the game.\nPropagation of information provided by targets/rewards computed at the end of the game only, how-\never, can be problematic. With a dataset of training examples, this problem can be addressed because\nintermediary area values Aij(t) are available for training for any t. In the simulations presented here,\nwe use a simple scheme\n\nTij(t) = (1 \u2212 w)Aij(T ) + wAij(t + k)\n\n(1)\n\nw \u2265 0 is a parameter that controls the convex combination between the area at the end of the game\nand the area at some step t + k in the more near future. w = 0 corresponds to the simple case\ndescribed above where only the area at the end of the game is used in the target function. Other\nways of incorporating target information from intermediary game positions are discussed brie\ufb02y at\nthe end.\n\nTo learn the evaluation function and the targets, we propose to use a graphical model (Bayesian\nnetwork) which in turn leads to a directed acyclic graph recursive neural network (DAG-RNN)\narchitecture.\n\n3.2 DAG-RNN Architectures\n\nThe architecture is closely related to an architecture originally proposed for a problem in a com-\npletely different area \u2013 the prediction of protein contact maps [8, 1]. As a Bayesian network, the\narchitecture can be described in terms of the DAG in Figure 1 where the nodes are arranged in 6 lat-\ntice planes re\ufb02ecting the Go board spatial organization. Each plane contains N \u00d7 N nodes arranged\non the vertices of a square lattice. In addition to the input and output planes, there are four hidden\nplanes for the lateral propagation and integration of information across the Go board. Within each\nhidden plane, the edges of the quadratic lattice are oriented towards one of the four cardinal direc-\ntions (NE, NW, SE, and SW). Directed edges within a column of this architecture are given in Figure\n1b. Thus each intersection ij in a N \u00d7 N board is associated with six units. These units consist of\nan input unit Iij, four hidden units H N E\nIn a DAG-RNN the relationships between the variables are deterministic, rather than probabilistic,\nand implemented in terms of neural networks with weight sharing. Thus the previous architecture,\nleads to a DAG-RNN architecture consisting of 5 neural networks in the form\n\n, and an output unit Oij.\n\n, H NW\n\n, H SW\n\nij\n\nij\n\n, H SE\n\nij\n\nij\n\n, H SE\ni,j )\n\ni,j\n\nOi,j = NO(Ii,j, H NW\n\n, H N E\nH N E\ni,j = NN E(Ii,j, H N E\nH NW\ni,j = NNW (Ii,j, H NW\nH SW\ni,j = NSW (Ii,j, H SW\ni,j = NSE(Ii,j, H SE\nH SE\n\ni,j , H SW\ni,j\ni\u22121,j, H N E\ni,j\u22121)\ni+1,j, H NW\ni,j\u22121)\ni+1,j, H SW\ni,j+1)\ni\u22121,j, H SE\ni,j+1)\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n(2)\n\nwhere, for instance, NO is a single neural network that is shared across all spatial locations. In\naddition, since Go is \u201cisotropic\u201d we use a single network shared across the four hidden planes. Go\nhowever involves strong boundaries effects and therefore we add one neural network NC for the\ncorners, shared across all four corners, and one neural network NS for each side position, shared\nacross all four sides. In short, the entire Go DAG-RNN architecture is described by four feedforward\nNNs (corner, side, lateral, output) that are shared at all corresponding locations. For each one\nof these feedforward neural networks, we have experimented with several architectures, but we\n\n1This is called \u201cseki\u201d. Seki is a situation where two live groups share liberties and where neither of them\n\ncan \ufb01ll them without dying.\n\n\ftypically use a single hidden layer. The DAG-RNN in the main simulation results uses 16 hidden\nnodes and 8 output nodes for the lateral propagation networks, and 16 hidden nodes and one output\nnode for the output network. All transfer functions are logistic. The total number of free parameters\nis close to 6000.\n\nBecause the underlying graph is acyclic, these networks can be unfolded in space and training can\nproceed by simple gradient descent (back-propagation) taking into account relevant symmetries and\nweight sharing. Networks trained at one board size can be reused at any other board size, providing\na simple mechanism for reusing and extending acquired knowledge. For a board of size N \u00d7 N,\nthe training procedure scales like O(W M N 4) where W is the number of adjustable weights, and\nM is the number of training games. There are roughly N 2 board positions in a game and, for\neach position, N 2 outputs Oij to be trained, hence the O(N 4) scaling. Both game records and\nthe positions within each selected game record are randomly selected during training. Weights are\nupdated essentially on line, once every 10 game positions. Training a single player on our 9 \u00d7 9 data\ntakes on the order of a week on a current desktop computer, corresponding roughly to 50 training\nepochs at 3 hours per epoch.\n\nNE\n\nNW\n\nSW\n\nSE\n\nOutput Plane\n\n4 Hidden Planes\n\nO i,j\n\nNE i,j-1\n\nNE i,j\n\nNE i+1,j\n\nNW i,j\n\nNW i,j+1\n\nNW i+1,j\n\nSW i-1,j\n\nSW i,j\n\nSW i,j+1\n\nSE i-1,j\n\nSE i,j-1\n\nSE i,j\n\nInput Plane\n\nI i,j\n\na. Planar lattices of the architecture.\n\nb. Connection details within an ij column.\n\nFigure 1: (a) The nodes of a DAG-RNN are regularly arranged in one input plane, one output plane,\nand four hidden planes. In each plane, nodes are arranged on a square lattice. The hidden planes\ncontain directed edges associated with the square lattices. All the edges of the square lattice in each\nhidden plane are oriented in the direction of one of the four possible cardinal corners: NE, NW,\nSW, and SE. Additional directed edges run vertically in column from the input plane to each hidden\nplane and from each hidden plane to the output plane. (b) Connection details within one column of\nFigure 1a. The input node is connected to four corresponding hidden nodes, one for each hidden\nplane. The input node and the hidden nodes are connected to the output node. Iij is the vector of\ninputs at intersection ij. Oij is the corresponding output. Connections of each hidden node to its\nlattice neighbors within the same plane are also shown.\n\n3.3\n\nInputs\n\nAt a given board intersection, the input vector Iij has multiple components\u2013listed in Table 1. The\n\ufb01rst three components\u2013stone type, in\ufb02uence, and propensity\u2013are associated with the corresponding\nintersection and a \ufb01xed number of surrounding locations. In\ufb02uence and propensity are described\nbelow in more detail. The remaining features correspond to group properties involving variable\nnumbers of neighboring stones and are self explanatory for those who are familiar with Go. The\ngroup Gij associated with a given intersection is the maximal set of stones of the same color that are\nconnected to it. Neighboring (or connected) opponent groups of Gij are groups of the opposite color\nthat are directly connected (adjacent) to Gij. The idea of using higher order liberties is from Werf\n[13]. O1st and O2nd provide the number of true eyes and the number of liberties of the weakest and\n\n\fthe second weakest neighboring opponent groups. Weakness here is de\ufb01ned in alphabetical order\nwith respect to the number of eyes \ufb01rst, followed by the number of liberties.\n\nTable 1: Typical input features. The \ufb01rst three features\u2013stone type, in\ufb02uence, and propensity\u2013\nare properties associated with the corresponding intersection and a \ufb01xed number of surrounding\nlocations. The other properties are group properties involving variable numbers of neighboring\nstones.\n\nFeature\nb,w,e\nin\ufb02uence\npropensity\nNeye\nN1st\n\nN2nd\n\nN3rd\n\nN4th\n\nO1st\n\nO2nd\n\nDescription\nthe stone type: black, white or empty\nthe in\ufb02uence from the stones of the same color and the opposing color\na local statistics computed from 3 \u00d7 3 patterns in the training data (section 3.3)\nthe number of true eyes\nthe number of liberties, which is the number of empty intersections connected\nto a group of stones. We also call it the 1st-order liberties\nthe number of 2nd-order liberties, which is de\ufb01ned as the liberties of the 1st-\norder liberties\nthe number of 3rd-order liberties, which is de\ufb01ned as the liberties of the 2nd-\norder liberties\nthe number of 4th-order liberties, which is de\ufb01ned as the liberties of the 3rd-\norder liberties\nfeatures of the weakest connected opponent group (stone type, number of liber-\nties, number of eyes)\nfeatures of the second weakest connected opponent group (stone type, number\nof liberties, number of eyes)\n\nIn\ufb02uence: We use two types of in\ufb02uence calculation. Both algorithms are based on Chen\u2019s method\n[4]. One is an exact implementation of Chen\u2019s method. The other uses a stringent in\ufb02uence prop-\nagation rule. In Chen\u2019s exact method, any opponent stone can block the propagation of in\ufb02uence.\nWith a stringent in\ufb02uence propagation rule, an opponent stone can block the propagation of in\ufb02u-\nence if and only if it is stronger than the stone emitting the in\ufb02uence. Strength is again de\ufb01ned in\nalphabetical order with respect to the number of eyes \ufb01rst, followed by the number of liberties.\nPropensity\u2013Automated Learning and Scoring of a Pattern Library: We develop a method to\nlearn local patterns and their value automatically from a database of games. The basic method is\nillustrated in the case of 3 \u00d7 3 patterns, which are used in the simulations. Considering rotation and\nmirror symmetries, there are 10 unique locations for a 3 \u00d7 3 window on a 9 \u00d7 9 board (see also\n[9]). Given any 3 \u00d7 3 pattern of stones on the board and a set of games, we then compute nine\nnumbers, one for each intersection. These numbers are local indicators of strength or propensity.\nThe propensity Sij(p) of each intersection ij associated with stone pattern p and a 3 \u00d7 3 window w\nis de\ufb01ned as:\n\nSw\n\nij(p) =\n\nN Bij(p) \u2212 N Wij(p)\n\nN Bij(p) + N Wij(p) + C\n\n(3)\n\nwhere N Bij(p) is the number of times that pattern p ends with a black stone at intersection ij at\nthe end of the games in the data, and N Wij(p) is the same for a white stone. Both N Bij(p) and\nN Wij(p) are computed taking into account the location and the symmetries of the corresponding\nwindow w. C plays a regularizing role in the case of rare patterns and is set to 1 in the simulations.\nThus Sw\nij(p) is an empirical normalized estimate of the local differential propensity towards con-\nquering the corresponding intersection in the local context provided by the corresponding pattern\nand window.\n\nIn general, a given intersection ij on the board is covered by several 3 \u00d7 3 windows. Thus, for a\ngiven intersection ij on a given board, we can compute a value Sw\nij(p) for each different window\nthat contains the intersection. In the following simulations, a single \ufb01nal value Sij(p) is computed\nby averaging over the different w\u2019s. However, more complex schemes that retain more information\ncan easily be envisioned by, for instance: (1) computing also the standard deviation of the Sw\nij(p) as\na function of w; (2) using a weighted average, weighted by the importance of the window w; and\n(3) using the entire set of Sw\n\nij(p) values, as w varies around ij, to augment the input vector.\n\n\f3.4 Move Selection and Search\n\nFor a given position, the next move can be selected using one-level search by considering all possible\nlegal moves and computing the estimate at time t of the total expected area E = Pij Oij(t) at the\nend of the game, or some intermediate position, or a combination of both, where Oij(t) are the\noutputs (predicted probabilities) of the DAG-RNNs. The next move can be chosen by maximizing\nthis evaluation function (1-ply search). Alternatively, Gibbs sampling can be used to choose the\nnext move among all the legal moves with a probability proportional to eE/T emp, where T emp is\na temperature parameter [3, 11, 12]. We have also experimented with a few other simple search\nschemes, such as 2-ply search (MinMax).\n\n4 Results\nWe trained a large number of players using the methods described above. In the absence of training\ndata, we used pure bootstrap approaches (e.g. reinforcement learning) at sizes 5 \u00d7 5 and 7 \u00d7 7 with\nresults that were encouraging but clearly insuf\ufb01cient. Not surprisingly, when used to play at larger\nboard sizes, the RNNs trained at these small board sizes yield rather weak players. The quality of\nmost 13 \u00d7 13 games available to us is too poor for proper training, although a small subset can be\nused for validation purposes. We do not have any data for sizes N = 11, 15, and 17. And because\nof the O(N 4) scaling, training systems directly at 19 \u00d7 19 takes many months and is currently in\nprogress. Thus the most interesting results we report are derived by training the RNNs using the 9\u00d79\ngame data, and using them to play at 9 \u00d7 9 and, more importantly, at larger board sizes. Several 9 \u00d7 9\nplayers achieve top comparable performance. For conciseness, here we report the results obtained\nwith one of them, trained with target parameters w = 0.25 and k = 2 in Equation 1,\n\n7\n0\n\n.\n\n6\n0\n\n.\n\n5\n\n.\n\n0\n\n4\n0\n\n.\n\n3\n\n.\n\n0\n\n2\n0\n\n.\n\n1\n\n.\n\n0\n\nw1\nw2\nw30\nw38\n\nrand\ntop1\ntop5\ntop10\ntop20\ntop30\n\n0\n6\n\n0\n5\n\n0\n4\n\n0\n3\n\n0\n2\n\n0\n1\n\n0\n\n0\n\n20\n\n40\n\n60\n\n80\n\n0\n\n50\n\n100\n\n150\n\n200\n\na. Validation error vs. game phase\n\nb. Percentage vs. game phase\n\nFigure 2: (a) Validation error vs. game phase. Phase is de\ufb01ned by the total number of stones on\nthe board. The four curves respectively represent the validation errors of the neural network after 1,\n2, 33, and 38 epochs of training. (b) Percentage of moves made by professional human players on\nboards of size 19 \u00d7 19 that are contained in the m top-ranked moves according to the DAG-RNN\ntrained on 9 \u00d7 9 amateur data, for various values of m. The baseline associated with the red curve\ncorresponds to a random uniform player.\n\nFigure 2a shows how the validation error changes as training progresses. Validation error here is\nde\ufb01ned as the relative entropy between the output probabilities produced by the RNN and the target\nprobabilities, computed on the validation data. The validation error decreases quickly during the\n\ufb01rst epochs. In this case, no substantial decrease in validation error is observed after epoch 30. Note\nalso how the error is smaller towards the end of the game due both to the reduction in the number of\npossible moves and the strong end-of-game training signal.\n\nAn area and hence a probability can be assigned by the DAG-RNN to each move, and used to\nrank them, as described in section 3.4. Thus we can compute the average probability of moves\nplayed by good human players according to the DAG-RNN or other probabilistic systems such as\n[12]. In Table 2, we report such probabilities for several systems and at different board sizes. For\nsize 19 \u00d7 19, we use the same test set used in [12]. Boltzmann5 and BoltzmannLiberties are their\nresults reported in the pre-published version of their NIPS paper. At this size, the probabilities in\n\n\fTable 2: Probabilities assigned by different systems to moves played by human players in test data.\n\nBoard Size\n\nSystem Log Probability\n\n9 \u00d7 9\n9 \u00d7 9\n13 \u00d7 13\n13 \u00d7 13\n19 \u00d7 19\n19 \u00d7 19\n19 \u00d7 19\n19 \u00d7 19\n\nRandom player\nRNN(1-ply search)\nRandom player\nRNN(1-ply search)\nRandom player\nBoltzmann5\nBoltzmannLiberties\nRNN(1-ply search)\n\n-4.13\n-1.86\n-4.88\n-2.27\n-5.64\n-5.55\n-5.27\n-2.70\n\nProbability\n1/62\n1/7\n1/132\n1/10\n1/281\n1/254\n1/194\n1/15\n\nthe table are computed using the 80-83rd moves of each game. For boards of size 19 \u00d7 19, a random\nplayer that selects moves uniformly at random among legal moves assigns a probability of 1/281 to\nthe moves played by professional players in the data set. BoltzmannLiberties was able to improve\nthis probability to 1/194. Our best DAG-RNNs trained using amateur data at 9 \u00d7 9 are capable of\nbringing this probability further down to 1/15 (also a considerable improvement over our previous\n1/42 performance presented in April 2006 at the Snowbird Learning Conference). A remarkable\nexample where the top ranked move according to the DAG-RNN coincides with the move actually\nplayed in a game between two very highly-ranked players is given in Figure 3, illustrating also the\nunderlying probabilistic territory calculations.\n\nA\n\nB\n\nC\n\nD\n\nE\n\nF\n\nG\n\nH\n\nJ\n\nK\n\nL\n\nM\n\nN\n\nO\n\nP\n\nQ\n\nR\n\nS\n\nT\n\nA\n\nB\n\nC\n\nD\n\nE\n\nF\n\nG\n\nH\n\nJ\n\nK\n\nL\n\nM\n\nN\n\nO\n\nP\n\nQ\n\nR\n\nS\n\nT\n\n19\n\n18\n\n17\n\n16\n\n15\n\n14\n\n13\n\n12\n\n11\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n19\n\n18\n\n17\n\n16\n\n15\n\n14\n\n13\n\n12\n\n11\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n19\n\n18\n\n17\n\n16\n\n15\n\n14\n\n13\n\n12\n\n11\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n19\n\n18\n\n17\n\n16\n\n15\n\n14\n\n13\n\n12\n\n11\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\nA\n\nB\n\nC\n\nD\n\nE\n\nF\n\nG\n\nH\n\nJ\n\nK\n\nL\n\nM\n\nN\n\nO\n\nP\n\nQ\n\nR\n\nS\n\nT\n\nA\n\nB\n\nC\n\nD\n\nE\n\nF\n\nG\n\nH\n\nJ\n\nK\n\nL\n\nM\n\nN\n\nO\n\nP\n\nQ\n\nR\n\nS\n\nT\n\nFigure 3: Example of an outstanding move based on territory predictions made by the DAG-RNN.\nFor each intersection, the height of the green bar represents the estimated probability that the inter-\nsection will be owned by black at the end of the game. The \ufb01gure on the left shows the predicted\nprobabilities if black passes. The \ufb01gure on the right shows the predicted probabilities if black makes\nthe move at N12. N12 causes the greatest increase in green area and is top-ranked move for the\nDAG-RNN. Indeed this is the move selected in the game played by Zhou, Heyang (black, 8 dan)\nand Chang, Hao (white, 9 dan) on 10/22/2000.\n\nFigure 2b, provides a kind of ROC curve by displaying the percentage of moves made by pro-\nfessional human player on boards of size 19 \u00d7 19 that are contained in the m top-ranked moves\naccording to the DAG-RNN trained on 9 \u00d7 9 amateur data, for various values of m across all phases\nof the game. For instance, when there are 80 stones on the board, and hence on the order of 300\nlegal moves available, there is a 50% chance that a move selected by a very highly ranked human\nplayer (dan 9) is found among the top 30 choices produced by the DAG-RNN.\n\n\f5 Conclusion\n\nWe have designed a DAG-RNN for the game of Go and demonstrated that it can learn territory\npredictions fairly well. Systems trained using only a set of 9 \u00d7 9 amateur games achieve surpris-\ningly good performance on a 19 \u00d7 19 test set that contains 1835 professional played games. The\nmethods and results presented clearly point also to several possible direction of improvement that\nare currently under active investigation. These include: (1) obtaining larger data sets and training\nsystems of size greater than 9 \u00d7 9; (2) exploiting patterns that are larger than 3 \u00d7 3, especially at\nthe beginning of the game when the board is sparsely occupied and matching of large patterns is\npossible using, for instance, Zobrist hashing techniques [14]; (3) combining different players, such\nas players trained at different board sizes, or players trained on different phases of the game; and (4)\ndeveloping better, non-exhaustive but deeper, search methods.\n\nAcknowledgments\n\nThe work of PB and LW has been supported by a Laurel Wilkening Faculty Innovation award and\nawards from NSF, BREP, and Sun Microsystems to PB. We would like to thank Jianlin Chen for\ndeveloping a web-based Go graphical user interface, Nicol Schraudolph for providing the 9 \u00d7 9 and\n13 \u00d7 13 data, and David Stern for providing the 19 \u00d7 19 data.\n\nReferences\n\n[1] P. Baldi and G. Pollastri. The principled design of large-scale recursive neural network\narchitectures\u2013DAG-RNNs and the protein structure prediction problem. Journal of Machine\nLearning Research, 4:575\u2013602, 2003.\n\n[2] E. Berlekamp and D. Wolfe. Mathematical Go\u2013Chilling gets the last point. A K Peters,\n\nWellesley, MA, 1994.\n\n[3] B. Brugmann. Monte Carlo Go. 1993. URL: ftp://www.joy.ne.jp/welcome/igs/\n\nGo/computer/mcgo.tex.Z.\n\n[4] Zhixing Chen. Semi-empirical quantitative theory of Go part 1: Estimation of the in\ufb02uence of\n\na wall. ICGA Journal, 25(4):211\u2013218, 2002.\n\n[5] W. S. Cobb. The Book of GO. Sterling Publishing Co., New York, NY, 2002.\n[6] K. Iwamoto. GO for Beginners. Pantheon Books, New York, NY, 1972.\n[7] Aske Plaat, Jonathan Schaeffer, Wim Pijls, and Arie de Bruin. Exploiting graph properties of\ngame trees. In 13th National Conference on Arti\ufb01cial Intelligence (AAAI\u201996), pages 234\u2013239.\n1996.\n\n[8] G. Pollastri and P. Baldi. Prediction of contact maps by GIOHMMs and recurrent neural\nnetworks using lateral propagation from all four cardinal corners. Bioinformatics, 18:S62\u2013\nS70, 2002.\n\n[9] Liva Ralaivola, Lin Wu, and Pierre Balid. SVM and Pattern-Enriched Common Fate Graphs\n\nfor the game of Go. ESANN 2005, 27-29:485\u2013490, 2005.\n\n[10] Stuart J. Russell and Peter Norvig. Arti\ufb01cial Intelligence: A Modern Approach. Prentice Hall,\n\n2nd edition, 2002.\n\n[11] N. N. Schrauldolph, P. Dayan, and T. J. Sejnowski. Temporal difference learning of position\nevaluation in the game of Go. In Advances in Neural Information Processing Systems 6, pages\n817\u2013824. 1994.\n\n[12] David H. Stern, Thore Graepel, and David J. C. MacKay. Modelling uncertainty in the game\n\nof Go. In Advances in Neural Information Processing Systems 17, pages 1353\u20131360. 2005.\n\n[13] E. Werf, H. Herik, and J. Uiterwijk. Learning to score \ufb01nal positions in the game of Go. In\n\nAdvances in Computer Games: Many Games, Many Challenges, pages 143\u2013158. 2003.\n\n[14] Albert L. Zobrist. A new hashing method with application for game playing. 1970. Technical\nreport 88, University of Wisconsin, April 1970. Reprinted in ICCA Journal, 13(2), (1990), pp.\n69-73.\n\n\f", "award": [], "sourceid": 3094, "authors": [{"given_name": "Lin", "family_name": "Wu", "institution": null}, {"given_name": "Pierre", "family_name": "Baldi", "institution": null}]}