{"title": "Maximum Entropy Monte-Carlo Planning", "book": "Advances in Neural Information Processing Systems", "page_first": 9520, "page_last": 9528, "abstract": "We develop a new algorithm for online planning in large scale sequential decision problems that improves upon the worst case efficiency of UCT.  The idea is to augment Monte-Carlo Tree Search (MCTS) with maximum entropy policy optimization, evaluating each search node by softmax values back-propagated from simulation.  To establish the effectiveness of this approach, we first investigate the single-step decision problem, stochastic softmax bandits, and show that softmax values can be estimated at an optimal convergence rate in terms of mean squared error.  We then extend this approach to general sequential decision making by developing a general MCTS algorithm, Maximum Entropy for Tree Search (MENTS).  We prove that the probability of MENTS failing to identify the best decision at the root decays exponentially, which fundamentally improves the polynomial convergence rate of UCT.  Our experimental results also demonstrate that MENTS is more sample efficient than UCT in both synthetic problems and Atari 2600 games.", "full_text": "Maximum Entropy Monte-Carlo Planning\n\nChenjun Xiao1\n\nJincheng Mei1 Ruitong Huang2 Dale Schuurmans1 Martin M\u00a8uller1\n\n{chenjun, jmei2, daes, mmueller}@ualberta.com, ruitong.huang@borealisai.com\n\n1University of Alberta\n\n2Borealis AI\n\nAbstract\n\nWe develop a new algorithm for online planning in large scale sequential decision\nproblems that improves upon the worst case ef\ufb01ciency of UCT. The idea is to\naugment Monte-Carlo Tree Search (MCTS) with maximum entropy policy opti-\nmization, evaluating each search node by softmax values back-propagated from\nsimulation. To establish the effectiveness of this approach, we \ufb01rst investigate the\nsingle-step decision problem, stochastic softmax bandits, and show that softmax\nvalues can be estimated at an optimal convergence rate in terms of mean squared\nerror. We then extend this approach to general sequential decision making by de-\nveloping a general MCTS algorithm, Maximum Entropy for Tree Search (MENTS).\nWe prove that the probability of MENTS failing to identify the best decision at the\nroot decays exponentially, which fundamentally improves the polynomial conver-\ngence rate of UCT. Our experimental results also demonstrate that MENTS is more\nsample ef\ufb01cient than UCT in both synthetic problems and Atari 2600 games.\n\n1\n\nIntroduction\n\nMonte Carlo planning algorithms have been widely applied in many challenging problems [12, 13].\nOne particularly powerful and general algorithm is the Monte Carlo Tree Search (MCTS) [3]. The\nkey idea of MCTS is to construct a search tree of states that are evaluated by averaging over outcomes\nfrom simulations. MCTS provides several major advantages over traditional online planning methods.\nIt breaks the curse of dimensionality by simulating state-action trajectories using a domain generative\nmodel, and building a search tree online by collecting information gathered during the simulations in\nan incremental manner. It can be combined with domain knowledge such as function approximations\nlearned either online [17] or of\ufb02ine [12, 13]. It is highly selective, where bandit algorithm are applied\nto balance between exploring the most uncertain branches and exploiting the most promising ones [9].\nMCTS has demonstrated outstanding empirical performance in many game playing problems, but\nmost importantly, it is provable to converge to the optimal policy if the exploitation and exploration\nbalanced appropriately [9, 7].\nThe convergence property of MCTS highly replies on the state value estimations. At each node of the\nsearch tree, the value estimation is also used to calculate the value of the action leading to that node.\nHence, the convergence rate of the state value estimation in\ufb02uences the rate of convergence for states\nfurther up in the tree. However, the Monte Carlo value estimate (average over simulations outcomes)\nused in MCTS does not enjoy effective convergence guarantee when this value is back-propagated\nin the search tree, since for any given node, the sampling policy in the subtree is changing and the\npayoff sequences experienced will drift in time. In summary, the compounding error, caused by the\nstructure of the search tree as well as the uncertainty of the Monte Carlo estimation, makes that UCT\ncan only guarantee a polynomial convergence rate of \ufb01nding the best action at the root.\nIdeally, one would like to adopt a state value that can be ef\ufb01ciently estimated and back-propagated in\na search tree. In this paper, we exploit the usage of softmax value estimate in MCTS based on the\nmaximum entropy policy optimization framework. To establish the effectiveness of this approach,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwe \ufb01rst propose a new stochastic softmax bandit framework for the single-step decision problem,\nand show that softmax values can be estimated in a sequential manner at an optimal convergence\nrate in terms of mean squared error. Our next contribution is to extend this approach to general\nsequential decision making by developing a general MCTS algorithm, Maximum Entropy for Tree\nSearch (MENTS). We contribute new observations that the softmax state value can be ef\ufb01ciently\nback-propagated in the search tree, which enables the search algorithm to achieve faster convergence\nrate towards \ufb01nding the optimal action at the root. Our theoretical analysis shows that MENTS enjoys\nan exponential convergence rate to the optimal solution, improving the polynomial convergence\nrate of UCT fundamentally. Our experiments also demonstrate that MENTS is much more sample\nef\ufb01cient compared with UCT in practice.\n\n2 Background\n\n2.1 Online Planning in Markov Decision Process\n\nWe focus on the episodic Markov decision process (MDP) 1, which is formally de\ufb01ned as a 5-tuple\n{S,A, P, R, H}. S is the state space, A is the action space. H is the maximum number of steps at\neach episode, P and R are the transition and reward functions, such that P (\u00b7|s, a) and R(s, a) give\nthe next state distribution and reward of taking action a at state s. We assume the transition and\nreward functions are deterministic for simplicity, while all of our techniques can easily generalize to\nthe case with stochastic transitions and rewards, with an appropriate dependence on the variances of\nthe transition and reward distributions. The solution of an MDP is a policy \u03c0 that maps any state s to\na probability distribution over actions. The optimal policy maximizes, on expectation, the cumulative\nsum of rewards, de\ufb01ned as,\n\n(cid:26)R(st, at),\n\n\u03bd(sH+1),\n\nt \u2264 H\nt = H + 1\n\nH+1(cid:88)\n\nk=0\n\nGt =\n\nRt+k,\n\nRt =\n\nwhere we assume an oracle function \u03bd that assigns stochastic evaluations for states at the end of\nepisode. We note that this de\ufb01nition can also be used as a general formulation for planning algorithms\nin in\ufb01nite horizon MDP, since H can be considered as the maximum search depth, and a stochastic\nevaluation function is applied at the end. We assume \u03bd is subgaussian and has variance \u03c32.\nFor policy \u03c0, the state value function V \u03c0(s), is de\ufb01ned to be the expected sum of rewards from s,\nV \u03c0(s) = E\u03c0 [Gt|st = s]. The state-action value function, also known as the Q-value, is de\ufb01ned\nsimilarly, Q\u03c0(s, a) = E\u03c0 [Gt|st = s, at = a]. The optimal value functions are the maximum value\nachievable by any policy, V \u2217(s) = max\u03c0 V \u03c0(s), Q\u2217(s, a) = max\u03c0 Q\u03c0(s, a). The optimal policy is\nde\ufb01ned by the greedy policy with respect to Q\u2217, \u03c0\u2217(s) = argmaxaQ\u2217(s, a). It is well known that the\noptimal values can be recursively de\ufb01ned by the Bellman optimality equation,\n\nQ\u2217(s, a) = R(s, a) + Es(cid:48)|s,a [V \u2217(s(cid:48))] ,\n\nV \u2217(s) = max\n\na\n\nQ\u2217(s, a).\n\n(1)\n\nWe consider the online planning problem that uses a generative model of the MDP to compute\nthe optimal policy at any input state, given a \ufb01xed sampling budget. The generative model is a\nrandomized algorithm that can output the reward R(s, a) and sample a next state s(cid:48) from P (\u00b7|s, a),\ngiven a state-action pair (s, a) as the input. For example, in the game of Go, if the rules of the game\nare known, the next board state can be predicted exactly after a move. To solve the online planning\nproblem, an algorithm uses the generative model to sample an episode at each round, and proposes\nan action for the input state after the sampling budget is expended. The performance of an online\nplanning algorithm can be measured by its probability of proposing the optimal action for the state of\ninterest.\n\n2.2 Monte Carlo Tree Search and UCT\nTo solve the online planning task, Monte Carlo Tree Search (MCTS) builds a look-ahead tree T\nonline in an incremental manner, and evaluates states with Monte Carlo simulations [3]. Each node in\nT is labeled by a state s, and stores a value estimate Q(s, a) and visit count N (s, a) for each action\na. The estimate Q(s, a) is the mean return of all simulations starting from s and a. The root of T is\n\n1All of our approaches can extend to in\ufb01nite horizon MDP.\n\n2\n\n\flabeled by the state of interest. At each iteration of the algorithm, one simulation starts from the root\nof the search tree, and proceeds in two stages: a tree policy is used to select actions while within the\ntree until a leaf of T is reached. An evaluation function is used at the leaf to obtain a simulation return.\nTypical choices of the evaluation function include function approximation with a neural network, and\nMonte Carlo simulations using a roll-out policy. The return is propagated upwards to all nodes along\nthe path to the root. T is grown by expanding the leaf reached during the simulation.\nBandit algorithms are used to balance between exploring the most uncertain branches and exploiting\nthe most promising ones. The UCT algorithm applies UCB1 as its tree policy to balance the growth\nof the search tree [9]. At each node of T , its tree policy selects an action with the maximum upper\ncon\ufb01dence bound\n\n(cid:115)\n\nwhere N (s) =(cid:80)\n\nUCB(s, a) = Q(s, a) + c\n\nlog N (s)\nN (s, a)\n\n,\n\na N (s, a), and c is a parameter controlling exploration. The UCT algorithm has\nproven to be effective in many practical problems. The most famous example is its usage in AlphaGo\n[12, 13]. UCT is asymptotically optimal: the value estimated by UCT converges in probability to the\np\u2192 Q\u2217(s, a), \u2200s \u2208 S,\u2200a \u2208 A. The probability of \ufb01nding a suboptimal action\noptimal value, Q(s, a)\nat the root converges to zero at a rate of O( 1\n\nt ), where t is the simulation budget [9].\n\n2.3 Maximum Entropy Policy Optimization\n\nThe maximum entropy policy optimization problem, which augments the standard expected reward\nobjective with a entropy regularizer, has recently drawn much attention in the reinforcement learning\ncommunity [4, 5, 11]. Given K actions and the corresponding K-dimensional reward vector r \u2208 RK,\nthe entropy regularized policy optimization problem \ufb01nds a policy by solving\n\n(2)\nwhere \u03c4 \u2265 0 is a user-speci\ufb01ed temperature parameter which controlls the degree of exploration. The\nmost intriguing fact about this problem is that it has a closed form solution. De\ufb01ne the softmax F\u03c4\nand the soft indmax f\u03c4 functions,\n\nmax\n\n\u03c0\n\n.\n\n\u03c0 \u00b7 r + \u03c4H(\u03c0)\n\n(cid:110)\n\n(cid:111)\n\nf\u03c4 (r) = exp{(r \u2212 F\u03c4 (r))/\u03c4}\n\nF\u03c4 (r) = \u03c4 log\n\nexp(r(a)/\u03c4 ).\n\n(cid:110)\n\nF\u03c4 (r) = max\n\nNote that the softmax F\u03c4 outputs a scalar while the soft indmax f\u03c4 maps any reward vector r to a\nBoltzmann policy. F\u03c4 (r), f\u03c4 (r) and (2) are connected by as shown in [4, 11],\n= f\u03c4 (r) \u00b7 r + \u03c4H(f\u03c4 (r)).\n\n(3)\nThis relation suggests the softmax value is an upper bound on the maximum value, and the gap can\nbe upper bounded by the product of \u03c4 and the maximum entropy. Note that as \u03c4 \u2192 0, (2) approaches\nthe standard expected reward objective, where the optimal solution is the hard-max policy. Therefore,\nit is straightforward to generalize the entropy regularized optimization to de\ufb01ne the softmax value\nfunctions, by replacing the hard-max operator in (1) with the softmax operators [4, 11],\n\n\u03c0 \u00b7 r + \u03c4H(\u03c0)\n\n(cid:111)\n\n\u03c0\n\nQ\u2217\nsft(s, a) = R(s, a) + Es(cid:48)|s,a [V \u2217\n\nexp\nFinally, according to (3), we can characterize the optimal softmax policy by,\n\nV \u2217\nsft(s) = \u03c4 log\n\na\n\nQ\u2217\nsft(s, a)/\u03c4\n\n(cid:111)\n\n.\n\n(cid:110)\n\nsft(s(cid:48))] ,\n(cid:110)\n\n(cid:88)\n\na\n\n(cid:88)\n(cid:111)\n\nsft(a|s) = exp\n\u03c0\u2217\n\n(Q\u2217\n\nsft(s, a) \u2212 V \u2217\n\nsft(s)) /\u03c4\n\n.\n\n(4)\n\n(5)\n\nIn this paper, we combine the maximum entropy policy optimization framework with MCTS, by esti-\nmating the softmax values backpropagated from simulations. Speci\ufb01cally, we show that the softmax\nvalues can be ef\ufb01ciently backpropagated in the search tree, which leads to a faster convergence rate\nto the optimal policy at the root.\n\n3 Softmax Value Estimation in Stochastic Bandit\n\nWe begin by introducing the stochastic softmax bandit problem. We provide an asymptotical lower\nbound of this problem, propose a new bandit algorithm for it and show a tight upper bound on\nits convergence rate. Our upper bound matches the lower bound not only in order, but also in the\ncoef\ufb01cient of the dominating term. All proofs are provided in the supplementary material.\n\n3\n\n\f(cid:88)t\n\nNt(a) =\n\ni=1\n\n(cid:88)t\n\ni=1\n\n3.1 The Stochastic Softmax Bandit\nConsider a stochastic bandit setting with arms set A. At each round t, a learner chooses an action\nAt \u2208 A. Next, the environment samples a random reward Rt and reveals it to the learner. Let r(a)\nbe the expected value of the reward distribution of action a \u2208 A. We assume r(a) \u2208 [0, 1], and that\nall reward distributions are \u03c32-subgaussian 2. For round t, we de\ufb01ne Nt(a) as the number of times a\nis chosen so far, and \u02c6rt(a) as the empirical estimate of r(a),\n\nI{At = a}\n\n\u02c6rt(a) =\n\nI{Ai = a}Ri/Nt(a),\n\nsome \u03c4 > 0. We de\ufb01ne U\u2217 =(cid:80)\n\nwhere I{\u00b7} is the indicator function. Let r \u2208 [0, 1]K be the vector of expected rewards, and \u02c6rt be the\nempirical estimates of r at round t. We denote \u03c0\u2217\nsft = f\u03c4 (r) the optimal soft indmax policy de\ufb01ned\nby the mean reward vector r. The stochastic bandit setting can be considered as a special case of an\nepisodic MDP with H = 1.\nIn a stochastic softmax bandit problem, instead of \ufb01nding the policy with maximum expected reward\nsft = F\u03c4 (r) for\nas in original stochastic bandits [10], our objective is to estimate the softmax value V \u2217\na exp{\u02c6rt(a)/\u03c4}, and propose to use\nthe estimator Vt = F\u03c4 (\u02c6rt) = \u03c4 log Ut. Our goal is to \ufb01nd a sequential sampling algorithm that can\nminimize the mean squared error, Et = E[(U\u2217 \u2212 Ut)2]. The randomness in Et comes from both the\nsampling algorithm and the observed rewards. Our \ufb01rst result gives a lower bound on Et.\nTheorem 1. In the stochastic softmax bandit problem, for any algorithm that achieves Et = O( 1\nt ),\nthere exists a problem setting such that\n\na exp{r(a)/\u03c4} and Ut =(cid:80)\n\n(cid:32)(cid:88)\n\na\n\nt\u2192\u221e tEt \u2265 \u03c32\n\nlim\n\n\u03c4 2\n\nexp(r(a)/\u03c4 )\n\n.\n\n(cid:33)2\n\n(cid:33)2\n\nAlso, to achieve this lower bound, there must be for any a \u2208 A, limt\u2192\u221e Nt(a)/t = \u03c0\u2217\nNote that in Theorem 1, we only assume Et = O(1/t), but not that the algorithm achieves (asymptot-\nically) unbiased estimates for each arm. Furthermore, this lower bound also re\ufb02ects the consistency\nbetween the softmax value and the soft indmax policy (3): in order to achieve the lower bound on the\nmean squared error, the sampling policy must converge to \u03c0\u2217\n\nsft(a).\n\nsft asymptotically.\n\n3.2 E2W: an Optimal Sequential Sampling Strategy\n\nInspired by the lower bound, we propose an optimal algorithm, Empirical Exponential Weight (E2W),\nfor the stochastic softmax bandit problem. The main idea is very intuitive: enforce enough exploration\nto guarantee good estimation of \u02c6r, and make the policy converge to \u03c0\u2217 asymptotically, as suggested\nby the lower bound. Speci\ufb01cally, at round t, the algorithm selects an action by sampling from the\ndistribution\n\n\u03c0t(a) = (1 \u2212 \u03bbt)f\u03c4 (\u02c6r)(a) + \u03bbt\n\n(6)\nIn (6), \u03bbt = \u03b5|A|/ log(t + 1) is a decay rate for exploration, with exploration parameter \u03b5 > 0. Our\nnext theorem provides an exact convergence rate for E2W.\nTheorem 2. For the softmax stochastic bandit problem, E2W can guarantee,\n\n1\n|A| .\n\n(cid:32)(cid:88)\n\na\n\nt\u2192\u221e tEt =\n\nlim\n\n\u03c32\n\u03c4 2\n\nexp(r(a)/\u03c4 )\n\n.\n\nTheorem 2 shows that E2W is an asymptotically optimal sequential sampling strategy for estimating\nthe softmax value in stochastic multi-armed bandits. The main contribution of the present paper is\nthe introduction of the softmax bandit algorithm for the implementation of tree policy in MCTS. In\nour proposed new algorithm, softmax bandit is used as the fundamental tool both for estimating each\nstate\u2019s softmax value, and balancing the growth of the search tree.\n\n2For prudent readers, we follow the \ufb01nite horizon bandits setting in [10], where the probability space carries\nthe tuple of random variables ST = {A0, R0, . . . , AT , RT}. For every time step t\u2212 1 the historical observation\nde\ufb01nes a \u03c3-algebra Ft\u22121 and At is Ft\u22121-measurable, the conditional distribution of At is our policy at time \u03c0t,\nand the conditoinal distribution of the reward RAt \u2212 r(At) is a martingale difference sequence.\n\n4\n\n\f4 Maximum Entropy MCTS\n\nWe now describe the main technical contributions of this paper, which combine maximum entropy\npolicy optimization with MCTS. Our proposed method, MENTS (Maximum Entropy for Tree Search),\napplies a similar algorithmic design as UCT (see Section 2.2) with two innovations: using E2W as\nthe tree policy, and evaluating each search node by softmax values back-propagated from simulations.\n\nuse Qsft(s) to denote a |A|-dimensional vector with components Qsft(s, a). Let N (s) =(cid:80)\n\n4.1 Algorithmic Design\nLet T be a look-ahead search tree built online by the algorithm. Each node n(s) \u2208 T is labeled by a\nstate s, contains a softmax value estimate Qsft(s, a), and a visit count N (s, a) for each action a. We\na N (s, a)\nand Vsft(s) = F\u03c4 (Qsft(s)). During the in-tree phase of the simulation, the tree policy selects an\naction according to\n\n\u03c0t(a|s) = (1 \u2212 \u03bbs)f\u03c4 (Qsft(s))(a) + \u03bbs\n\nwhere \u03bbs = \u03b5|A|/ log((cid:80)\n\n(7)\na N (s, a) + 1). Let {s0, a0, s1, a1, . . . , sT} be the state action trajectory in\nthe simulation, where n(sT ) is a leaf node of T . An evaluation function is called on sT and returns\nan estimate R 3. T is then grown by expanding n(sT ). Its statistics are initialized by Qsft(sT , a) = 0\nand N (sT , a) = 0 for all actions a. For all nodes in the trajectory, we update the visiting counts by\nN (st, at) = N (st, at) + 1, and the Q-values using a softmax backup,\n\n1\n|A|\n\nQsft(st, at) =\n\nr(st, at) + F\u03c4 (Qsft(st+1))\n\nt = T \u2212 1\nt < T \u2212 1\n\n(8)\n\n(cid:26)r(st, at) + R\n\nThe algorithm MENTS can also be extended to use domain knowledge, such as function approxi-\nmations learned of\ufb02ine. For instance, suppose that a policy network \u02dc\u03c0(\u00b7|s) is available. Then the\nstatistics can be initialized by Qsft(sT , a) = log \u02dc\u03c0(a|sT ) and N (sT , a) = 0 for all actions a during\nthe expansion. Finally, at each time step t, MENTS proposes the action with the maximum estimated\nsoftmax value at the root s0; i.e. at = argmaxaQsft(s0, a).\n\n4.2 Theoretical Analysis\n\nThis section provides the key steps in developing a theoretical analysis of the convergence property for\nMENTS. We \ufb01rst show that for any node in the search tree, after its subtree has been fully explored,\nthe estimated softmax value will converge to the optimal value at an exponential rate. Recall that in\nTheorem 1, an optimal sampling algorithm for the softmax stochastic bandit problem must guarantee\nlimt\u2192\u221e Nt(a)/t = \u03c0\u2217\nsft(a) for any action a. Our \ufb01rst result shows that this holds for true in E2W\nwith high probability. This directly comes from the proof of Theorem 2.\nsft(a)\u00b7 t.\nTheorem 3. Consider E2W applied to the stochastic softmax bandit problem. Let N\u2217\nThen there exists some constants C and \u02dcC such that,\n\nt (a) = \u03c0\u2217\n\n(cid:18)\n\nP\n\n|Nt(a) \u2212 N\u2217\n\nt (a)| >\n\nCt\nlog t\n\n\u2264 \u02dcC|A|t exp\n\n\u2212 t\n\n(log t)3\n\n(cid:19)\n\n(cid:26)\n\n(cid:27)\n\n.\n\nIn the bandit case, the reward distribution of each arm is assumed to be subgaussian. However, when\napplying bandit algorithms at the internal nodes of a search tree, the payoff sequence experienced\nfrom each action will drift over time, since the sampling probability of the actions in the subtree is\nchanging. The next result shows that even under this drift condition, the softmax value can still be\nef\ufb01ciently estimated according to the backup scheme (8).\nTheorem 4. For any node n(s) \u2208 T , de\ufb01ne the event,\n\n(cid:26)\n\nEs =\n\n\u2200a \u2208 A,|N (s, a) \u2212 N\u2217(s, a)| <\n\nN\u2217(s, a)\n\n2\n\n(cid:27)\n\n3We adapt a similar setting to Section 3, where Rt is replaced by the sample from the evaluation function,\nand the martingale assumption is extended to the the selection policy and the evaluation function on the leaves.\n\n5\n\n\fwhere N\u2217(s, a) = \u03c0\u2217\nsuf\ufb01ciently large t,\n\nsft(a|s) \u00b7 N (s). For \u0001 \u2208 [0, 1), there exist some constant C and \u02dcC such that for\n\nP(cid:0)|Vsft(s) \u2212 V \u2217\n\nsft(s)| \u2265 \u0001|Es\n\n(cid:26)\n\n(cid:1) \u2264 \u02dcC exp\n\n(cid:27)\n\n.\n\n\u2212 N (s)\u03c4 2\u00012\n\nC\u03c32\n\nWithout loss of generality, we assume Q\u2217(s, 1) \u2265 Q\u2217(s, 2) \u2265 \u00b7\u00b7\u00b7 \u2265 Q\u2217(s,|A|) for any n(s) \u2208 T ,\nand de\ufb01ne \u2206 = Q\u2217(s, 1) \u2212 Q\u2217(s, 2). Recall that by (3), the gap between the softmax and maximum\nvalue is upper bounded by \u03c4 times the maximum of entropy. Therefore as long as \u03c4 is chosen small\nenough such that this gap is smaller than \u2206, the best action also has the largest softmax value. Finally,\nas we are interested in the probability that the algorithm fails to \ufb01nd the optimal arm at the root, we\nprove the following result.\nTheorem 5. Let at be the action returned by MENTS at iteration t. Then for large enough t with\nsome constant C,\n\n(cid:26)\n\n(cid:27)\n\n.\n\nP (at (cid:54)= a\u2217) \u2264 Ct exp\n\n\u2212 t\n\n(log t)3\n\nRemark. MENTS enjoys a fundamentally faster convergence rate than UCT. We highlight two main\nreasons behind this success result from the innovated algorithmic design. First, MENTS applies E2W\nas the tree policy during simulations. This assures that the softmax value functions used in MENTS\ncould be effectively estimated in an optimal rate, and the tree policy would converge to the optimal\nsoftmax policy \u03c0\u2217\nsft asymptotically, as suggested by Theorem 1 and Theorem 2. Secondly, Theorem 4\nshows that the softmax value can also be ef\ufb01ciently back-propagated in the search tree. Due to these\nfacts, the probability of MENTS failing to identify the best decision at the root decays exponentially,\nsigni\ufb01cantly improving the polynomial rate of UCT.\n\n5 Related Work\n\nMaximum entropy policy optimization is a well studied topic in reinforcement learning [4, 5, 11].\nThe maximum entropy formulation provides a substantial improvement in exploration and robustness,\nby adopting a smoothed optimization objective and acquiring diverse policy behaviors. The proposed\nalgorithm MENTS is built on the softmax Bellman operator (4), which is used as the value propagation\nformula in MCTS. To our best knowledge, MENTS is the \ufb01rst algorithm that applies the maximum\nentropy policy optimization framework for simulation-based planning algorithms.\nSeveral works have been proposed for improving UCT, since it is arguably \u201cover-optimistic\u201d [2] and\ndoes not explore suf\ufb01ciently: UCT can take a long time to discover an optimal branch that initially\nlooked inferior. Previous work has proposed to use \ufb02at-UCB, which enforces more exploration, as the\ntree policy for action selection at internal nodes [2]. Minimizing simple regret in MCTS is discussed\nin [1, 16]. Instead of using UCB1 as the tree policy at each node, these works employ a hybrid\narchitecture, where a best-arm identi\ufb01cation algorithm such as Sequential Halving [6] is applied at\nthe upper levels, while the original UCT is used for the deeper levels of the tree.\nVarious value back-propagation strategies, particularly back-propagate the maximum estimated value\nover the children, were originally studied in [3]. It has been shown that the maximum backup is a\npoor option, since the Monte-Carlo estimation is too noisy when the number of simulations is low,\nwhich misguides the algorithm, particularly at the beginning of search. Complex back-propagation\nstrategies in MCTS have been investigated in [8], where a mixture of maximum backup with the well\nknown TD-\u03bb operator [15] is proposed. In contrast to these approaches, MENTS exploits the softmax\nbackup to achieve a faster convergence rate of value estimation.\n\n6 Experiments\n\nWe evaluate the proposed algorithm, MENTS, across several different benchmark problems against\nstrong baseline methods. Our \ufb01rst test domain is a Synthetic Tree environment. The tree has branching\nfactor (number of actions) k of depth d. At each leaf of the tree, a standard Gaussian distribution is\nassigned as an evaluation function, that is each time a leaf is visited, the distribution is used to sample\na stochastic return. The mean of each Gaussian distribution is determined in the following way: when\ninitializing the environment each edge is assigned a random value, then the mean of the Gaussian\n\n6\n\n\fFigure 1: Evaluation of softmax value estimation in the synthetic tree environment. The x-axis shows\nthe number of simulations and y-axis shows the value estimation error. The shaded area shows the\nstandard error. We \ufb01nd that the softmax value can be ef\ufb01ciently estimated by MENTS.\n\ndistribution at a leaf is the sum of values along the path from the root to the leaf. This environment is\nsimilar to the P-game tree environment [9, 14] used to model two player minimax games, while here\nwe consider the single (max) player version. Finally, we normalize all the means in [0, 1].\nWe then test MENTS on \ufb01ve Atari games: BeamRider, Breakout, Q*bert, Seaquest and SpaceInvaders.\nFor each game, we train a vanilla DQN and use it as an evaluation function for the tree search as\ndiscussed in the AlphaGo [12, 13]. In particular, the softmax of Q-values is used as the state value\nestimate, and the Boltzmann distribution over the Q-values is used as the policy network to assign a\nprobability prior for each action when expanding a node. The temperature is set to 0.1. The UCT\nalgorithm adopts the following tree-policy introduced in AlphaGo [13],\n\nPUCT(s, a) = Q(s, a) + cP (s, a)\n\n(cid:112)(cid:80)\n\nb N (s, b)\n1 + N (s, a)\n\nwhere P (s, a) is the prior probability. MENTS also applies the same evaluation function. The prior\nprobability is used to initialize the Qsft as discussed in Section 4.1. We note that the DQN is trained\nusing a hard-max target. Training a neural network using softmax targets such as soft Q-learning\nor PCL might be more suitable for MENTS [4, 11]. However, in the experiments we still use DQN\nin MENTS to present a fair comparison with UCT, since both algorithms apply the exactly same\nevaluation function. The details of the experimental setup are provided in the Appendix.\n\n6.1 Results\n\nValue estimation in synthetic tree. As shown in Section 4.2, the main advantage of the softmax\nvalue is that it can be ef\ufb01ciently estimated and back-propagated in the search tree. To verify this\nobservation, we compare the value estimation error of MENTS and UCT in both the bandit and tree\nsearch setting. For MENTS, the error is measured by the absolute difference between the estimated\nsoftmax value Vsft(s0) and the true softmax state value V \u2217\nsft(s0) of the root s0. For UCT, the error\nis measured by the absolute difference between the Monte Carlo value estimation V (s0) and the\noptimal state value V \u2217(s0) at the root. We report the results in Figure 1. Each data point is averaged\nover 5 \u00d7 5 independent experiment (5 runs on 5 randomly initialized environment). In all of the test\nenvironments, we observe that MENTS estimates the softmax values ef\ufb01ciently. By comparison, we\n\ufb01nd that the Monte Carlo estimation used in UCT converges far more slowly to the optimal state\nvalue, even in the bandit setting (d = 1).\nOnline planning in synthetic tree. We next compare MENTS with UCT for online planning in the\nsynthetic tree environment. Both algorithms use Monte Carlo simulation with uniform rollout policy\nas the evaluation function. The error is evaluated by V \u2217(s0) \u2212 Q\u2217(s0, at), where at is the action\nproposed by the algorithm at simulation step t, and s0 is the root of the synthetic tree. The optimal\nvalues Q\u2217 and V \u2217 are computed by back-propagating the true values from the leaves when the\nenvironment is initialized. Results are reported in Figure 2. As in the previous experiment, each data\npoint is averaged over 5 \u00d7 5 independent experiment (5 runs on 5 randomly initialized environment).\nUCT converges faster than our method in the bandit environment (d = 1). This is because that the\nmain advantage of MENTS is the usage of softmax state values, which can be ef\ufb01ciently estimated\nand back-propagated in the search tree. In the bandit case such an advantage does not exist. In the tree\ncase (d > 0), we \ufb01nd that MENTS signi\ufb01cantly outperforms UCT, especially in the large domain.\nFor example, in synthetic tree with k = 8 d = 5, UCT fails to identify the optimal action at the root\nin some of the random environments, result in the large regret given the simulation budgets. However,\n\n7\n\n012345678910simulation steps (1e3)0.00.10.20.30.40.50.60.70.80.9k=100 d=1MENTSUCT012345678910simulation steps (1e3)0.00.10.20.30.40.50.60.70.8k=8 d=4MENTSUCT02468101214161820simulation steps (1e3)0.00.10.20.30.40.50.60.70.8k=10 d=4MENTSUCT051015202530simulation steps (1e3)0.00.10.20.30.40.50.60.70.8k=8 d=5MENTSUCT\fFigure 2: Evaluation of online planning in the synthetic tree environment. The x-axis shows the\nnumber of simulations and y-axis shows the planning error. The shaded area shows the standard error.\nWe can observe that MENTS enjoys much smaller error than UCT especially in the large domain.\n\nTable 1: Performance comparison of Atari games playing.\n\nAgent\nDQN\nUCT\n\nMENTS\n\nBeamRider\n\n19280\n21952\n18576\n\nBreakout Q*bert\n14558\n16010\n18336\n\n345\n367\n386\n\nSeaquest\n\nSpaceInvaders\n\n1142\n1129\n1161\n\n625\n656\n1503\n\nMENTS can continuously make progress towards the optimal solution in all random environments,\ncon\ufb01rming MENTS scales with larger tree depth.\nOnline planning in Atari 2600 games. Finally, we compare MENTS and UCT using Atari games.\nAt each time step we use 500 simulations to generate a move. Results are provided in Table 1, where\nwe highlight scores where MENTS signi\ufb01cantly outperforms the baselines. Scores obtained by DQN\nare also provided. In Breakout, Q*bert and SpaceInvaders, MENTS signi\ufb01cantly outperforms UCT\nas well as the DQN agent. In BeamRider and Seaquest all algorithms performs similarly, since the\nsearch algorithms only use the DQN as the evaluation function and only 500 simulations are applied\nto generate a move. We can expect better performance when a larger simulation budget is used.\n\n7 Conclusion\n\nWe propose a new online planning algorithm, Maximum Entropy for Tree Search (MENTS), for large\nscale sequential decision making. The main idea of MENTS is to augment MCTS with maximum\nentropy policy optimization, evaluating each node in the search tree using softmax values back-\nproagated from simulations. We contribute two new observations that are essential to establishing the\neffectiveness of MENTS: \ufb01rst, we study stochastic softmax bandits for single-step decision making\nand show that softmax values can be estimated at an optimal convergence rate in terms of mean\nsquared error; second, the softmax values can be ef\ufb01ciently back-propagated from simulations in\nthe search. We prove that the probability of MENTS failing to identify the best decision at the root\ndecays exponentially, which fundamentally improves the worst case ef\ufb01ciency of UCT. Empirically,\nMENTS exhibits a signi\ufb01cant improvement over UCT in both synthetic tree environments and Atari\ngame playing.\n\nAcknowledgement\n\nThe authors wish to thank Csaba Szepesvari for useful discussions, and the anonymous reviewers\nfor their valuable advice. Part of the work is performed when the \ufb01rst two authors were interns at\nBorealisAI. This research was supported by NSERC, the Natural Sciences and Engineering Research\nCouncil of Canada, and AMII, the Alberta Machine Intelligence Institute.\n\n8\n\n012345678910simulation steps (1e3)0.00.10.20.30.40.5k=100 d=1MENTSUCT012345678910simulation steps (1e3)0.00.10.20.30.40.5k=8 d=4MENTSUCT02468101214161820simulation steps (1e3)0.00.10.20.30.40.5k=10 d=4MENTSUCT0246810simulation steps (1e4)0.00.10.20.30.40.5k=8 d=5MENTSUCT\fReferences\n[1] Tristan Cazenave. Sequential halving applied to trees. IEEE Transactions on Computational\n\nIntelligence and AI in Games, 7(1):102\u2013105, 2015.\n\n[2] Pierre-Arnaud Coquelin and R\u00b4emi Munos. Bandit algorithms for tree search. In Uncertainty in\n\nArti\ufb01cial Intelligence, 2007.\n\n[3] R\u00b4emi Coulom. Ef\ufb01cient selectivity and backup operators in monte-carlo tree search.\n\nInternational conference on computers and games, pages 72\u201383. Springer, 2006.\n\nIn\n\n[4] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning\n\nwith deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017.\n\n[5] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-\npolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint\narXiv:1801.01290, 2018.\n\n[6] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed\n\nbandits. In International Conference on Machine Learning, pages 1238\u20131246, 2013.\n\n[7] Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-\noptimal planning in large markov decision processes. Machine learning, 49(2-3):193\u2013208,\n2002.\n\n[8] Piyush Khandelwal, Elad Liebman, Scott Niekum, and Peter Stone. On the analysis of complex\nbackup strategies in monte carlo tree search. In International Conference on Machine Learning,\npages 1319\u20131328, 2016.\n\n[9] Levente Kocsis and Csaba Szepesv\u00b4ari. Bandit based monte-carlo planning.\n\nconference on machine learning, pages 282\u2013293. Springer, 2006.\n\nIn European\n\n[10] Tor Lattimore and Csaba Szepesv\u00b4ari. Bandit algorithms. 2018.\n\n[11] O\ufb01r Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap\nbetween value and policy based reinforcement learning. In Advances in Neural Information\nProcessing Systems, pages 2775\u20132785, 2017.\n\n[12] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-\nche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.\nMastering the game of go with deep neural networks and tree search. nature, 529(7587):484,\n2016.\n\n[13] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of\ngo without human knowledge. Nature, 550(7676):354, 2017.\n\n[14] Stephen JJ Smith and Dana S Nau. An analysis of forward pruning. In AAAI, 1994.\n\n[15] Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 135.\n\nMIT press Cambridge, 1998.\n\n[16] David Tolpin and Solomon Eyal Shimony. Mcts based on simple regret. In Twenty-Sixth AAAI\n\nConference on Arti\ufb01cial Intelligence, 2012.\n\n[17] Chenjun Xiao, Jincheng Mei, and Martin M\u00a8uller. Memory-augmented monte carlo tree search.\n\nIn Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n9\n\n\f", "award": [], "sourceid": 5058, "authors": [{"given_name": "Chenjun", "family_name": "Xiao", "institution": "University of Alberta"}, {"given_name": "Ruitong", "family_name": "Huang", "institution": "Borealis AI"}, {"given_name": "Jincheng", "family_name": "Mei", "institution": "University of Alberta"}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": "Google"}, {"given_name": "Martin", "family_name": "M\u00fcller", "institution": "University of Alberta"}]}