{"title": "Thinking Fast and Slow with Deep Learning and Tree Search", "book": "Advances in Neural Information Processing Systems", "page_first": 5360, "page_last": 5370, "abstract": "Sequential decision making problems, such as structured prediction, robotic control, and game playing, require a combination of planning policies and generalisation of those plans. In this paper, we present Expert Iteration (ExIt), a novel reinforcement learning algorithm which decomposes the problem into separate planning and generalisation tasks. Planning new policies is performed by tree search, while a deep neural network generalises those plans. Subsequently, tree search is improved by using the neural network policy to guide search, increasing the strength of new plans. In contrast, standard deep Reinforcement Learning algorithms rely on a neural network not only to generalise plans, but to discover them too. We show that ExIt outperforms REINFORCE for training a neural network to play the board game Hex, and our final tree search agent, trained tabula rasa, defeats MoHex1.0, the most recent Olympiad Champion player to be publicly released.", "full_text": "Thinking Fast and Slow\n\nwith Deep Learning and Tree Search\n\nThomas Anthony1, , Zheng Tian1, and David Barber1,2\n\n1University College London\n\n2Alan Turing Institute\n\nthomas.anthony.14@ucl.ac.uk\n\nAbstract\n\nSequential decision making problems, such as structured prediction, robotic control,\nand game playing, require a combination of planning policies and generalisation of\nthose plans. In this paper, we present Expert Iteration (EXIT), a novel reinforcement\nlearning algorithm which decomposes the problem into separate planning and\ngeneralisation tasks. Planning new policies is performed by tree search, while a\ndeep neural network generalises those plans. Subsequently, tree search is improved\nby using the neural network policy to guide search, increasing the strength of new\nplans. In contrast, standard deep Reinforcement Learning algorithms rely on a\nneural network not only to generalise plans, but to discover them too. We show that\nEXIT outperforms REINFORCE for training a neural network to play the board\ngame Hex, and our \ufb01nal tree search agent, trained tabula rasa, defeats MOHEX 1.0,\nthe most recent Olympiad Champion player to be publicly released.\n\n1\n\nIntroduction\n\nAccording to dual-process theory [1, 2], human reasoning consists of two different kinds of thinking.\nSystem 1 is a fast, unconscious and automatic mode of thought, also known as intuition or heuristic\nprocess. System 2, an evolutionarily recent process unique to humans, is a slow, conscious, explicit\nand rule-based mode of reasoning.\nWhen learning to complete a challenging planning task, such as playing a board game, humans exploit\nboth processes: strong intuitions allow for more effective analytic reasoning by rapidly selecting\ninteresting lines of play for consideration. Repeated deep study gradually improves intuitions.\nStronger intuitions feedback to stronger analysis, creating a closed learning loop. In other words,\nhumans learn by thinking fast and slow.\nIn deep Reinforcement Learning (RL) algorithms such as REINFORCE [3] and DQN [4], neural\nnetworks make action selections with no lookahead; this is analogous to System 1. Unlike human\nintuition, their training does not bene\ufb01t from a \u2018System 2\u2019 to suggest strong policies. In this paper,\nwe present Expert Iteration (EXIT), which uses a Tree Search as an analogue of System 2; this assists\nthe training of the neural network. In turn, the neural network is used to improve the performance of\nthe tree search by providing fast \u2018intuitions\u2019 to guide search.\nAt a low level, EXIT can be viewed as an extension of Imitation Learning (IL) methods to domains\nwhere the best known experts are unable to achieve satisfactory performance. In IL an apprentice\nis trained to imitate the behaviour of an expert policy. Within EXIT, we iteratively re-solve the IL\nproblem. Between each iteration, we perform an expert improvement step, where we bootstrap the\n(fast) apprentice policy to increase the performance of the (comparatively slow) expert.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fTypically, the apprentice is implemented as a deep neural network, and the expert by a tree search\nalgorithm. Expert improvement can be achieved either by using the apprentice as an initial bias in the\nsearch direction, or to assist in quickly estimating the value of states encountered in the search tree,\nor both.\nWe proceed as follows: in section 2, we cover some preliminaries. Section 3 describes the general\nform of the Expert Iteration algorithm, and discusses the roles performed by expert and apprentice.\nSections 4 and 5 dive into the implementation details of the Imitation Learning and expert improve-\nment steps of EXIT for the board game Hex. The performance of the resultant EXIT algorithm is\nreported in section 6. Sections 7 and 8 discuss our \ufb01ndings and relate the algorithm to previous\nworks.\n\n2 Preliminaries\n\n2.1 Markov Decision Processes\n\nWe consider sequential decision making in a Markov Decision Process (MDP). At each timestep\nt, an agent observes a state st and chooses an action at to take. In a terminal state sT , an episodic\nreward R is observed, which we intend to maximise.1 We can easily extend to two-player, perfect\ninformation, zero-sum games by learning policies for both players simultaneously, which aim to\nmaximise the reward for the respective player.\nWe call a distribution over the actions a available in state s a policy, and denote it \u03c0(a|s). The value\nfunction V \u03c0(s) is the mean reward from following \u03c0 starting in state s. By Q\u03c0(s, a) we mean the\nexpected reward from taking action a in state s, and following policy \u03c0 thereafter.\n\nImitation Learning\n\n2.2\nIn Imitation Learning (IL), we attempt to solve the MDP by mimicking an expert policy \u03c0\u2217 that\nhas been provided. Experts can arise from observing humans completing a task, or, in the context\nof structured prediction, calculated from labelled training data. The policy we learn through this\nmimicry is referred to as the apprentice policy.\nWe create a dataset of states of expert play, along with some target data drawn from the expert, which\nwe attempt to predict. Several choices of target data have been used. The simplest approach is to ask\nthe expert to name an optimal move \u03c0\u2217(a|s) [5]. Once we can predict expert moves, we can take\nthe action we think the expert would have most probably taken. Another approach is to estimate the\naction-value function Q\u03c0\u2217\n(s, a). We can then predict that function, and act greedily with respect\nto it. In contrast to direct action prediction, this target is cost-sensitive, meaning the apprentice can\ntrade-off prediction errors against how costly they are [6].\n\n3 Expert iteration\n\nCompared to IL techniques, Expert Iteration (EXIT) is enriched by an expert improvement step.\nImproving the expert player and then resolving the Imitation Learning problem allows us to exploit\nthe fast convergence properties of Imitation Learning even in contexts where no strong player was\noriginally known, including when learning tabula rasa. Previously, to solve such problems, researchers\nhave fallen back on RL algorithms that often suffer from slow convergence, and high variance, and\ncan struggle with local minima.\nAt each iteration i, the algorithm proceeds as follows: we create a set Si of game states by self\nplay of the apprentice \u02c6\u03c0i\u22121. In each of these states, we use our expert to calculate an Imitation\ni\u22121(a|s)))\nLearning target at s (e.g. the expert\u2019s action \u03c0\u2217\nform our dataset Di . We train a new apprentice \u02c6\u03c0i on Di (Imitation Learning). Then, we use our\ni = \u03c0\u2217(a|s; \u02c6\u03c0i) (expert improvement). See Algorithm 1 for\nnew apprentice to update our expert \u03c0\u2217\npseudo-code.\n\ni\u22121(a|s)); the state-target pairs (e.g. (s, \u03c0\u2217\n\n1This reward may be decomposed as a sum of intermediate rewards (i.e. R =(cid:80)T\n\nt=0 rt)\n\n2\n\n\fThe expert policy is calculated using a tree search algorithm. By using the apprentice policy to\ndirect search effort towards promising moves, or by evaluating states encountered during search more\nquickly and accurately, we can help the expert \ufb01nd stronger policies. In other words, we bootstrap\nthe knowledge acquired by Imitation Learning back into the planning algorithm.\nThe Imitation Learning step is analogous to a human improving their intuition for the task by studying\nexample problems, while the expert improvement step is analogous to a human using their improved\nintuition to guide future analysis.\n\n0 = build_expert(\u02c6\u03c00)\n\nAlgorithm 1 Expert Iteration\n1: \u02c6\u03c00 = initial_policy()\n2: \u03c0\u2217\n3: for i = 1; i \u2264 max_iterations; i++ do\nSi = sample_self_play(\u02c6\u03c0i\u22121)\n4:\nDi = {(s, imitation_learning_target(\u03c0\u2217\n5:\n\u02c6\u03c0i = train_policy(Di)\n6:\n\u03c0\u2217\ni = build_expert(\u02c6\u03c0i)\n7:\n8: end for\n\ni\u22121(s)))|s \u2208 Si}\n\n3.1 Choice of expert and apprentice\n\nThe learning rate of EXIT is controlled by two factors: the size of the performance gap between the\napprentice policy and the improved expert, and how close the performance of the new apprentice\nis to the performance of the expert it learns from. The former induces an upper bound on the new\napprentice\u2019s performance at each iteration, while the latter describes how closely we approach that\nupper bound. The choice of both expert and apprentice can have a signi\ufb01cant impact on both these\nfactors, so must be considered together.\nThe role of the expert is to perform exploration, and thereby to accurately determine strong move\nsequences, from a single position. The role of the apprentice is to generalise the policies that the\nexpert discovers across the whole state space, and to provide rapid access to that strong policy for\nbootstrapping in future searches.\nThe canonical choice of expert is a tree search algorithm. Search considers the exact dynamics of the\ngame tree local to the state under consideration. This is analogous to the lookahead human games\nplayers engage in when planning their moves. The apprentice policy can be used to bias search\ntowards promising moves, aid node evaluation, or both. By employing search, we can \ufb01nd strong\nmove sequences potentially far away from the apprentice policy, accelerating learning in complex\nscenarios. Possible tree search algorithms include Monte Carlo Tree Search [7], \u03b1-\u03b2 Search, and\ngreedy search [6].\nThe canonical apprentice is a deep neural network parametrisation of the policy. Such deep networks\nare known to be able to ef\ufb01ciently generalise across large state spaces, and they can be evaluated\nrapidly on a GPU. The precise parametrisation of the apprentice should also be informed by what\ndata would be useful for the expert. For example, if state value approximations are required, the\npolicy might be expressed implicitly through a Q function, as this can accelerate lookup.\n\n3.2 Distributed Expert Iteration\n\nBecause our tree search is orders of magnitude slower than the evaluations made during training of\nthe neural network, EXIT spends the majority of run time creating datasets of expert moves. Creating\nthese datasets is an embarassingly parallel task, and the plans made can be summarised by a vector\nmeasuring well under 1KB. This means that EXIT can be trivially parallelised across distributed\narchitectures, even with very low bandwidth.\n\n3.3 Online expert iteration\n\nIn each step of EXIT, Imitation Learning is restarted from scratch. This throws away our entire\ndataset. Since creating datasets is computationally intensive this can add substantially to algorithm\nrun time.\n\n3\n\n\fThe online version of EXIT mitigates this by aggregating all datasets generated so far at each iteration.\nIn other words, instead of training \u02c6\u03c0i on Di, we train it on D = \u222aj\u2264iDj. Such dataset aggregation is\nsimilar to the DAGGER algorithm [5]. Indeed, removing the expert improvement step from online\nEXIT reduces it to DAGGER.\nDataset aggregation in online EXIT allows us to request fewer move choices from the expert at each\niteration, while still maintaining a large dataset. By increasing the frequency at which improvements\ncan be made, the apprentice in online EXIT can generalise the expert moves sooner, and hence the\nexpert improves sooner also, which results in higher quality play appearing in the dataset.\n\n4\n\nImitation Learning in the game Hex\n\nWe now describe the implementation of EXIT for the board game Hex. In this section, we develop\nthe techniques for our Imitation Learning step, and test them for Imitation Learning of Monte Carlo\nTree Search (MCTS). We use this test because our intended expert is a version of Neural-MCTS,\nwhich will be described in section 5.\n\n4.1 Preliminaries\n\nHex\nHex is a two-player connection-based game played on an n \u00d7 n hexagonal grid. The players, denoted\nby colours black and white, alternate placing stones of their colour in empty cells. The black player\nwins if there is a sequence of adjacent black stones connecting the North edge of the board to the\nSouth edge. White wins if they achieve a sequence of adjacent white stones running from the West\nedge to the East edge. (See \ufb01gure 1).\n\nFigure 1: A 5 \u00d7 5 Hex game, won by white. Figure from Huang et al. [? ].\n\nHex requires complex strategy, making it challenging for deep Reinforcement Learning algorithms; its\nlarge action set and connection-based rules means it shares similar challenges for AI to Go. However,\ngames can be simulated ef\ufb01ciently because the win condition is mutually exclusive (e.g. if black has\na winning path, white cannot have one), its rules are simple, and permutations of move order are\nirrelevant to the outcome of a game. These properties make it an ideal test-bed for Reinforcement\nLearning. All our experiments are on a 9 \u00d7 9 board size.\n\nMonte Carlo Tree Search\n\nMonte Carlo Tree Search (MCTS) is an any-time best-\ufb01rst tree-search algorithm. It uses repeated\ngame simulations to estimate the value of states, and expands the tree further in more promising\nlines. When all simulations are complete, the most explored move is taken. It is used by the leading\nalgorithms in the AAAI general game-playing competition [8]. As such, it is the best known algorithm\nfor general game-playing without a long RL training procedure.\nEach simulation consists of two parts. First, a tree phase, where the tree is traversed by taking actions\naccording to a tree policy. Second, a rollout phase, where some default policy is followed until the\nsimulation reaches a terminal game state. The result returned by this simulation can then be used to\nupdate estimates of the value of each node traversed in the tree during the \ufb01rst phase.\nEach node of the search tree corresponds to a possible state s in the game. The root node corresponds\nto the current state, its children correspond to the states resulting from a single move from the current\nstate, etc. The edge from state s1 to s2 represents the action a taken in s1 to reach s2, and is identi\ufb01ed\nby the pair (s1, a).\n\n4\n\n\fAt each node we store n(s), the number of iterations in which the node has been visited so far. Each\nedge stores both n(s, a), the number of times it has been traversed, and r(s, a) the sum of all rewards\nobtained in simulations that passed through the edge. The tree policy depends on these statistics. The\nmost commonly used tree policy is to act greedily with respect to the upper con\ufb01dence bounds for\ntrees formula [7]:\n\n(cid:115)\n\nUCT(s, a) =\n\nr(s, a)\nn(s, a)\n\n+ cb\n\nlog n(s)\nn(s, a)\n\n(1)\n\nWhen an action a in a state sL is chosen that takes us to a position s(cid:48) not yet in the search tree, the\nrollout phase begins. In the absence of domain-speci\ufb01c information, the default policy used is simply\nto choose actions uniformly from those available.\nTo build up the search tree, when the simulation moves from the tree phase to the rollout phase, we\nperform an expansion, adding s(cid:48) to the tree as a child of sL.2 Once a rollout is complete, the reward\nsignal is propagated through the tree (a backup), with each node and edge updating statistics for visit\ncounts n(s), n(s, a) and total returns r(s, a).\nIn this work, all MCTS agents use 10,000 simulations per move, unless stated otherwise. All use a\nuniform default policy. We also use RAVE. Full details are in the appendix. [9].\n\n4.2\n\nImitation Learning from Monte Carlo Tree Search\n\nIn this section, we train a standard convolutional neural network3 to imitate an MCTS expert. Guo et\nal. [11] used a similar set up on Atari games. However, their results showed that the performance of\nthe learned neural network fell well short of the MCTS expert, even with a large dataset of 800,000\nMCTS moves. Our methodology described here improves on this performance.\n\nLearning Targets\n\nIn Guo et al. [11], the learning target used was simply the move chosen by MCTS. We refer to this\nas chosen-action targets (CAT), and optimise the Kullback\u2013Leibler divergence between the output\ndistribution of the network and this target. So the loss at position s is given by the formula:\n\nLCAT = \u2212 log[\u03c0(a\u2217|s)]\nwhere a\u2217 = argmaxa(n(s, a)) is the move selected by MCTS.\nWe propose an alternative target, which we call tree-policy targets (TPT). The tree policy target is\nthe average tree policy of the MCTS at the root. In other words, we try to match the network output\nto the distribution over actions given by n(s, a)/n(s) where s is the position we are scoring (so\nn(s) = 10, 000 in our experiments). This gives the loss:\n\nLTPT = \u2212(cid:88)\n\na\n\nn(s, a)\nn(s)\n\nlog[\u03c0(a|s)]\n\nUnlike CAT, TPT is cost-sensitive: when MCTS is less certain between two moves (because they\nare of similar strength), TPT penalises misclassi\ufb01cations less severely. Cost-sensitivity is a desirable\nproperty for an imitation learning target, as it induces the IL agent to trade off accuracy on less\nimportant decisions for greater accuracy on critical decisions.\nIn EXIT, there is additional motivation for such cost-sensitive targets, as our networks will be used\nto bias future searches. Accurate evaluations of the relative strength of actions never made by the\ncurrent expert are still important, since future experts will use the evaluations of all available moves\nto guide their search.\n\n2Sometimes multiple nodes are added to the tree per iteration, adding children to s(cid:48) also. Conversely,\n\nsometimes an expansion threshold is used, so sL is only expanded after multiple visits.\n\n3Our network architecture is described in the appendix. We use Adam [10] as our optimiser.\n\n5\n\n\fSampling the position set\n\nCorrelations between the states in our dataset may reduce the effective dataset size, harming learning.\nTherefore, we construct all our datasets to consist of uncorrelated positions sampled using an\nexploration policy. To do this, we play multiple games with an exploration policy, and select a single\nstate from each game, as in Silver et al. [12]. For the initial dataset, the exploration policy is MCTS,\nwith the number of iterations reduced to 1,000 to reduce computation time and encourage a wider\ndistribution of positions.\nWe then follow the DAGGER procedure, expanding our dataset by using the most recent apprentice\npolicy to sample 100,000 more positions, again sampling one position per game to ensure that there\nwere no correlations in the dataset. This has two advantages over sampling more positions in the\nsame way: \ufb01rstly, selecting positions with the apprentice is faster, and secondly, doing so results in\npositions closer to the distribution that the apprentice network visits at test time.\n\n4.3 Results of Imitation Learning\n\nBased on our initial dataset of 100,000 MCTS moves, CAT and TPT have similar performance in the\ntask of predicting the move selected by MCTS, with average top-1 prediction errors of 47.0% and\n47.7%, and top-3 prediction errors of 65.4% and 65.7%, respectively.\nHowever, despite the very similar prediction errors, the TPT network is 50 \u00b1 13 Elo stronger than the\nCAT network, suggesting that the cost-awareness of TPT indeed gives a performance improvement. 4\nWe continued training the TPT network with the DAGGER algorithm, iteratively creating 3 more\nbatches of 100,000 moves. This additional data resulted in an improvement of 120 Elo over the \ufb01rst\nTPT network. Our \ufb01nal DAGGER TPT network achieved similar performance to the MCTS it was\ntrained to emulate, winning just over half of games played between them (87/162).\n\n5 Expert Improvement in Hex\n\nWe now have an Imitation Learning procedure that can train a strong apprentice network from MCTS.\nIn this section, we describe our Neural-MCTS (N-MCTS) algorithms, which use such apprentice\nnetworks to improve search quality.\n\n5.1 Using the Policy Network\n\nBecause the apprentice network has effectively generalised our policy, it gives us fast evaluations\nof action plausibility at the start of search. As search progresses, we discover improvements on this\napprentice policy, just as human players can correct inaccurate intuitions through lookahead.\nWe use our neural network policy to bias the MCTS tree policy towards moves we believe to be\nstronger. When a node is expanded, we evaluate the apprentice policy \u02c6\u03c0 at that state, and store it. We\nmodify the UCT formula by adding a bonus proportional to \u02c6\u03c0(a|s):\n\nUCTP\u2212NN(s, a) = UCT(s, a) + wa\n\n\u02c6\u03c0(a|s)\n\nn(s, a) + 1\n\nWhere wa weights the neural network against the simulations. This formula is adapted from one\nfound in Gelly & Silver [9]. Tuning of hyperparameters found that wa = 100 was a good choice for\nthis parameter, which is close to the average number of simulations per action at the root when using\n10,000 iterations in the MCTS. Since this policy was trained using 10,000 iterations too, we would\nexpect that the optimal weight should be close to this average.\nThe TPT network\u2019s \ufb01nal layer uses a softmax output. Because there is no reason to suppose that\nthe optimal bonus in the UCT formula should be linear in the TPT policy probability, we view the\ntemperature of the TPT network\u2019s output layer as a hyperparameter for the N-MCTS and tune it to\nmaximise the performance of the N-MCTS.\n\n4When testing network performance, we greedily select the most likely move, because CAT and TPT may\n\notherwise induce different temperatures in the trained networks\u2019 policies.\n\n6\n\n\fWhen using the strongest TPT network from section 4, N-MCTS using a policy network signi\ufb01cantly\noutperforms our baseline MCTS, winning 97% of games. The neural network evaluations cause a\ntwo times slowdown in search. For comparison, a doubling of the number of iterations of the vanilla\nMCTS results in a win rate of 56%.\n\n5.2 Using a Value Network\n\nStrong value networks have been shown to be able to substantially improve the performance of MCTS\n[12]. Whereas a policy network allows us to narrow the search, value networks act to reduce the\nrequired search depth compared to using inaccurate rollout-based value estimation.\nHowever, our imitation learning procedure only learns a policy, not a value function. Monte Carlo\nestimates of V \u03c0\u2217\n(s) could be used to train a value function, but to train a value function without\nsevere over\ufb01tting requires more than 105 independent samples. Playing this many expert games is\nwell beyond our computation resources, so instead we approximate V \u03c0\u2217\n(s) with the value function\nof the apprentice, V \u02c6\u03c0(s), for which Monte Carlo estimates are cheap to produce.\nTo train the value network, we use a KL loss between V (s) and the sampled (binary) result z:\n\nLV = \u2212z log[V (s)] \u2212 (1 \u2212 z) log[1 \u2212 V (s)]\n\nTo accelerate the tree search and regularise value prediction, we used a multitask network with\nseparate output heads for the apprentice policy and value prediction, and sum the losses LV and\nLTPT.\nTo use such a value network in the expert, whenever a leaf sL is expanded, we estimate V (s). This is\nbacked up through the tree to the root in the same way as rollout results are: each edge stores the\naverage of all evaluations made in simulations passing through it. In the tree policy, the value is\nestimated as a weighted average of the network estimate and the rollout estimate.5\n\n6 Experiments\n\n6.1 Comparison of Batch and Online EXIT to REINFORCE\n\nWe compare EXIT to the policy gradient algorithm found in Silver et al. [12], which achieved\nstate-of-the-art performance for a neural network player in the related board game Go. In Silver et al.\n[12], the algorithm was initialised by a network trained to predict human expert moves from a corpus\nof 30 million positions, and then REINFORCE [3] was used. We initialise with the best network\nfrom section 4. Such a scheme, Imitation Learning initialisation followed by Reinforcement Learning\nimprovement, is a common approach when known experts are not suf\ufb01ciently strong.\nIn our batch EXIT, we perform 3 training iterations, each time creating a dataset of 243,000 moves.\nIn online EXIT, as the dataset grows, the supervised learning step takes longer, and in a na\u00efve\nimplementation would come to dominate run-time. We test two forms of online EXIT that avoid this.\nIn the \ufb01rst, we create 24,300 moves each iteration, and train on a buffer of the most recent 243,000\nexpert moves. In the second, we use all our data in training, and expand the size of the dataset by\n10% each iteration.\nFor this experiment we did not use any value networks, so that network architectures between the\npolicy gradient and EXIT are identical. All policy networks are warm-started to the best network\nfrom section 4.\nAs can be seen in \ufb01gure 2, compared to REINFORCE, EXIT learns stronger policies faster. EXIT also\nshows no sign of instability: the policy improves consistently each iteration and there is little variation\nin the performance between each training run. Separating the tree search from the generalisation\nhas ensured that plans don\u2019t over\ufb01t to a current opponent, because the tree search considers multiple\npossible responses to the moves it recommends.\nOnline expert iteration substantially outperforms the batch mode, as expected. Compared to the\n\u2018buffer\u2019 version, the \u2018exponential dataset\u2019 version appears to be marginally stronger, suggesting that\nretaining a larger dataset is useful.\n\n5This is the same as the method used in Silver et al. [12]\n\n7\n\n\fFigure 2: Elo ratings of policy gradient network and EXIT networks through training. Values are the\naverage of 5 training runs, shaded areas represent 90% con\ufb01dence intervals. Time is measured by\nnumber of neural network evaluations made. Elo calculated with BayesElo [13]\n\n.\n\n6.2 Comparison of Value and Policy EXIT\n\nWith suf\ufb01ciently large datasets, a value network can be learnt to improve the expert further, as\ndiscussed in section 5.2. We ran asynchronous distributed online EXIT using only a policy network\nuntil our datasets contained \u223c 550, 000 positions. We then used our most recent apprentice to add a\nMonte Carlo value estimate from each of the positions in our dataset, and trained a combined policy\nand value apprentice, giving a substantial improvement in the quality of expert play.\nWe then ran EXIT with a combined value-and-policy network, creating another \u223c 7, 400, 000 move\nchoices. For comparison, we continued the training run without using value estimation for equal time.\nOur results are shown in \ufb01gure 3, which shows that value-and-policy-EXIT signi\ufb01cantly outperforms\npolicy-only-EXIT. In particular, the improved plans from the better expert quickly manifest in a\nstronger apprentice.\nWe can also clearly see the importance of expert improvement, with later apprentices comfortably\noutperforming experts from earlier in training.\n\nFigure 3: Apprentices and experts in distributed online EXIT, with and without neural network value\nestimation. MOHEX\u2019s rating (10,000 iterations per move) is shown by the black dashed line.\n\n6.3 Performance Against MOHEX\n\nVersions of MOHEX have won every Computer Games Olympiad Hex tournament since 2009.\nMOHEX is a highly optimised algorithm, utilising a complex, hand-made theorem-proving algorithm\n\n8\n\n\fwhich calculates provably suboptimal moves, to be pruned from search, and an improved rollout\npolicy; it also optionally uses a specialised end-game solver, particularly powerful for small board\nsizes. In contrast, our algorithm learns tabula rasa, without game-speci\ufb01c knowledge beside the rules\nof the game. Here we compare to the most recent available version, MOHEX 1.0 [14].\nTo fairly compare MOHEX to our experts with equal wall-clock times is dif\ufb01cult, as the relative\nspeeds of the algorithms are hardware dependent: MOHEX\u2019s theorem prover makes heavy use of the\nCPU, whereas for our experts, the GPU is the bottleneck. On our machine MOHEX is approximately\n50% faster.6 We tested EXIT against 10,000 iteration-MOHEX on default settings, 100,000 iteration-\nMOHEX, and against 4 second per move-MOHEX (with parallel solver switched on). EXIT won all\nmatches using just 10,000 iterations per move, results in the table below:\n\nEXIT Setting\n104 iterations\n104 iterations\n104 iterations\n\nTime/move\n\n\u223c 0.3s\n\u223c 0.3s\n\u223c 0.3s\n\n7 Related work\n\nEXIT win rate MOHEX Setting\n104 iterations\n105 iterations\n\n75.3%\n59.3%\n55.6%\n\n4s/move\n\nSolver Time/move\n\nNo\nNo\nYes\n\n\u223c 0.2s\n\u223c 2s\n4s\n\nEXIT has several connections to existing RL algorithms, resulting from different choices of expert\nclass. For example, we can recover a version of Policy Iteration [15] by using Monte Carlo Search as\nour expert; in this case it is easy to see that Monte Carlo Tree Search gives stronger plans than Monte\nCarlo Search.\nPrevious works have also attempted to achieve Imitation Learning that outperforms the original\n(cid:80)\nexpert. Silver et al. [12] use Imitation Learning followed by Reinforcement Learning. Kai-Wei,\net al. [16] use Monte Carlo estimates to calculate Q\u2217(s, a), and train an apprentice \u03c0 to maximise\na \u03c0(a|s)Q\u2217(s, a). At each iteration after the \ufb01rst, the rollout policy is changed to a mixture of the\nmost recent apprentice and the original expert. This too can be seen as blending an RL algorithm\nwith Imitation Learning: it combines Policy Iteration and Imitation Learning.\nNeither of these approaches is able to improve the original expert policy. They are useful when strong\nexperts exist, but only at the beginning of training. In contrast, because EXIT creates stronger experts\nfor itself, it is able to use experts throughout the training process.\nAlphaGo Zero (AG0)[17], presents an independently developed version of ExIt, 7 and showed that\nit achieves state-of-the-art performance in Go. We include a detailed comparison of these closely\nrelated works in the appendix.\nUnlike standard Imitation Learning methods, EXIT can be applied to the Reinforcement Learning\nproblem: it makes no assumptions about the existence of a satisfactory expert. EXIT can be applied\nwith no domain speci\ufb01c heuristics available, as we demonstrate in our experiment, where we used a\ngeneral purpose search algorithm as our expert class.\n\n8 Conclusion\n\nWe have introduced a new Reinforcement Learning algorithm, Expert Iteration, motivated by the\ndual process theory of human thought. EXIT decomposes the Reinforcement Learning problem by\nseparating the problems of generalisation and planning. Planning is performed on a case-by-case basis,\nand only once MCTS has found a signi\ufb01cantly stronger plan is the resultant policy generalised. This\nallows for long-term planning, and results in faster learning and state-of-the-art \ufb01nal performance,\nparticularly for challenging problems.\nWe show that this algorithm signi\ufb01cantly outperforms a variant of the REINFORCE algorithm in\nlearning to play the board game Hex. The resultant tree search algorithm beats MoHex 1.0, indicating\ncompetitiveness with state-of-the-art heuristic search methods, despite being trained tabula rasa.\n\n6This machine has an Intel Xeon E5-1620 and nVidia Titan X (Maxwell), our tree search takes 0.3 seconds\n\nfor 10,000 iterations, while MOHEX takes 0.2 seconds for 10,000 iterations, with multithreading.\n\n7Our original version, with only policy networks, was published before AG0 was published, but after its\nsubmission. Our value networks were developed before AG0 was published, and published after Silver et al.[17]\n\n9\n\n\fAcknowledgements\n\nThis work was supported by the Alan Turing Institute under the EPSRC grant EP/N510129/1 and by\nAWS Cloud Credits for Research. We thank Andrew Clarke for help with ef\ufb01ciently parallelising\nthe generation of datasets, Alex Botev for assistance implementing the CNN, and Ryan Hayward for\nproviding a tool to draw Hex positions.\n\nReferences\n[1] J. St B. T. Evans. Heuristic and Analytic Processes in Reasoning. British Journal of Psychology,\n\n75(4):451\u2013468, 1984.\n\n[2] Daniel Kahneman. Maps of Bounded Rationality: Psychology for Behavioral Economics. The\n\nAmerican Economic Review, 93(5):1449\u20131475, 2003.\n\n[3] R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforce-\n\nment Learning. Machine Learning, 8(3-4):229\u2013256, 1992.\n\n[4] V. Mnih et al. Human-Level Control through Deep Reinforcement Learning. Nature,\n\n518(7540):529\u2013533, 2015.\n\n[5] S. Ross, G. J. Gordon, and J. A. Bagnell. A Reduction of Imitation Learning and Structured\n\nPrediction to No-Regret Online Learning. AISTATS, 2011.\n\n[6] H. Daum\u00e9 III, J. Langford, and D. Marcu. Search-based Structured Prediction. Machine\n\nLearning, 2009.\n\n[7] L. Kocsis and C. Szepesv\u00e1ri. Bandit Based Monte-Carlo Planning. In European Conference on\n\nMachine Learning, pages 282\u2013293. Springer, 2006.\n\n[8] M. Genesereth, N. Love, and B. Pell. General Game Playing: Overview of the AAAI Competi-\n\ntion. AI Magazine, 26(2):62, 2005.\n\n[9] S. Gelly and D. Silver. Combining Online and Of\ufb02ine Knowledge in UCT. In Proceedings of\n\nthe 24th International Conference on Machine learning, pages 273\u2013280. ACM, 2007.\n\n[10] D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[11] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep Learning for Real-Time Atari Game\nPlay Using Of\ufb02ine Monte-Carlo Rree Search Planning. In Advances in Neural Information\nProcessing Systems, pages 3338\u20133346, 2014.\n\n[12] D. Silver et al. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature,\n\n529(7587):484\u2013489, 2016.\n\n[13] R Coulom. Bayeselo. http://remi.coulom.free.fr/Bayesian-Elo/, 2005.\n\n[14] B. Arneson, R. Hayward, and P. Hednerson. Monte Carlo Tree Search in Hex.\n\nIn IEEE\n\nTransactions on Computational Intelligence and AI in Games, pages 251\u2013258. IEEE, 2010.\n\n[15] S. Ross and J. A. Bagnell. Reinforcement and Imitation Learning via Interactive No-Regret\n\nLearning. ArXiv e-prints, 2014.\n\n[16] K. Chang, A. Krishnamurthy, A. Agarwal, H. Daum\u00e9 III, and J. Langford. Learning to Search\n\nBetter Than Your Teacher. CoRR, abs/1502.02206, 2015.\n\n[17] D. Silver et al. Mastering the Game of Go without Human Knowledge. Nature, 550(7676):354\u2013\n\n359, 2017.\n\n[18] K. Young, R. Hayward, and G. Vasan. NeuroHex: A Deep Q-learning Hex Agent. arXiv\n\npreprint arXiv:1604.07097, 2016.\n\n[19] Y. Goldberg and J. Nivre. Training Deterministic Parsers with Non-Deterministic Oracles.\n\nTransactions of the Association for Computational Linguistics, 1:403\u2013414, 2013.\n\n10\n\n\f[20] D. Arpit, Y. Zhou, B. U. Kota, and V. Govindaraju. Normalization Propagation: A Para-\nmetric Technique for Removing Internal Covariate Shift in Deep Networks. arXiv preprint\narXiv:1603.01431, 2016.\n\n[21] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and Accurate Deep Network Learning by\n\nExponential Linear Units(ELUs). CoRR, abs/1511.07289, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2771, "authors": [{"given_name": "Thomas", "family_name": "Anthony", "institution": "UCL"}, {"given_name": "Zheng", "family_name": "Tian", "institution": "UCL"}, {"given_name": "David", "family_name": "Barber", "institution": "University College London"}]}