{"title": "Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search", "book": "Advances in Neural Information Processing Systems", "page_first": 1025, "page_last": 1033, "abstract": "Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayes-optimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems -- because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration.", "full_text": "Ef\ufb01cient Bayes-Adaptive Reinforcement Learning\n\nusing Sample-Based Search\n\nArthur Guez\n\nDavid Silver\n\nPeter Dayan\n\naguez@gatsby.ucl.ac.uk\n\nd.silver@cs.ucl.ac.uk\n\ndayan@gatsby.ucl.ac.uk\n\nAbstract\n\nBayesian model-based reinforcement learning is a formally elegant approach to\nlearning optimal behaviour under model uncertainty, trading off exploration and\nexploitation in an ideal way. Unfortunately, \ufb01nding the resulting Bayes-optimal\npolicies is notoriously taxing, since the search space becomes enormous. In this\npaper we introduce a tractable, sample-based method for approximate Bayes-\noptimal planning which exploits Monte-Carlo tree search. Our approach outper-\nformed prior Bayesian model-based RL algorithms by a signi\ufb01cant margin on sev-\neral well-known benchmark problems \u2013 because it avoids expensive applications\nof Bayes rule within the search tree by lazily sampling models from the current\nbeliefs. We illustrate the advantages of our approach by showing it working in\nan in\ufb01nite state space domain which is qualitatively out of reach of almost all\nprevious work in Bayesian exploration.\n\n1\n\nIntroduction\n\nA key objective in the theory of Markov Decision Processes (MDPs) is to maximize the expected\nsum of discounted rewards when the dynamics of the MDP are (perhaps partially) unknown. The\ndiscount factor pressures the agent to favor short-term rewards, but potentially costly exploration\nmay identify better rewards in the long-term. This con\ufb02ict leads to the well-known exploration-\nexploitation trade-off. One way to solve this dilemma [3, 10] is to augment the regular state of the\nagent with the information it has acquired about the dynamics. One formulation of this idea is the\naugmented Bayes-Adaptive MDP (BAMDP) [18, 9], in which the extra information is the posterior\nbelief distribution over the dynamics, given the data so far observed. The agent starts in the belief\nstate corresponding to its prior and, by executing the greedy policy in the BAMDP whilst updating\nits posterior, acts optimally (with respect to its beliefs) in the original MDP. In this framework, rich\nprior knowledge about statistics of the environment can be naturally incorporated into the planning\nprocess, potentially leading to more ef\ufb01cient exploration and exploitation of the uncertain world.\nUnfortunately, exact Bayesian reinforcement learning is computationally intractable. Various algo-\nrithms have been devised to approximate optimal learning, but often at rather large cost. Here, we\npresent a tractable approach that exploits and extends recent advances in Monte-Carlo tree search\n(MCTS) [16, 20], but avoiding problems associated with applying MCTS directly to the BAMDP.\nAt each iteration in our algorithm, a single MDP is sampled from the agent\u2019s current beliefs. This\nMDP is used to simulate a single episode whose outcome is used to update the value of each node of\nthe search tree traversed during the simulation. By integrating over many simulations, and therefore\nmany sample MDPs, the optimal value of each future sequence is obtained with respect to the agent\u2019s\nbeliefs. We prove that this process converges to the Bayes-optimal policy, given in\ufb01nite samples. To\nincrease computational ef\ufb01ciency, we introduce a further innovation: a lazy sampling scheme that\nconsiderably reduces the cost of sampling.\nWe applied our algorithm to a representative sample of benchmark problems and competitive al-\ngorithms from the literature. It consistently and signi\ufb01cantly outperformed existing Bayesian RL\nmethods, and also recent non-Bayesian approaches, thus achieving state-of-the-art performance.\n\n1\n\n\fOur algorithm is more ef\ufb01cient than previous sparse sampling methods for Bayes-adaptive planning\n[25, 6, 2], partly because it does not update the posterior belief state during the course of each\nsimulation. It thus avoids repeated applications of Bayes rule, which is expensive for all but the\nsimplest priors over the MDP. Consequently, our algorithm is particularly well suited to support\nplanning in domains with richly structured prior knowledge \u2014 a critical requirement for applications\nof Bayesian reinforcement learning to large problems. We illustrate this bene\ufb01t by showing that our\nalgorithm can tackle a domain with an in\ufb01nite number of states and a structured prior over the\ndynamics, a challenging \u2014 if not intractable \u2014 task for existing approaches.\n\n2 Bayesian RL\n\nWe describe the generic Bayesian formulation of optimal decision-making in an unknown MDP,\nfollowing [18] and [9]. An MDP is described as a 5-tuple M = (cid:104)S, A,P,R, \u03b3(cid:105), where S is the\nset of states, A is the set of actions, P : S \u00d7 A \u00d7 S \u2192 R is the state transition probability kernel,\nR : S \u00d7 A \u2192 R is a bounded reward function, and \u03b3 is the discount factor [23]. When all the\ncomponents of the MDP tuple are known, standard MDP planning algorithms can be used to estimate\nthe optimal value function and policy off-line. In general, the dynamics are unknown, and we assume\nthat P is a latent variable distributed according to a distribution P (P). After observing a history\nof actions and states ht = s1a1s2a2 . . . at\u22121st from the MDP, the posterior belief on P is updated\nusing Bayes\u2019 rule P (P|ht) \u221d P (ht|P)P (P). The uncertainty about the dynamics of the model can\nbe transformed into uncertainty about the current state inside an augmented state space S+ = S\u00d7H,\nwhere S is the state space in the original problem and H is the set of possible histories. The dynamics\nassociated with this augmented state space are described by\nP +((cid:104)s, h(cid:105), a,(cid:104)s(cid:48), h(cid:48)(cid:105)) = 1[h(cid:48) = has(cid:48)]\nP(s, a, s(cid:48))P (P|h) dP, R+((cid:104)s, h(cid:105), a) = R(s, a) (1)\nTogether, the 5-tuple M + = (cid:104)S+, A,P +,R+, \u03b3(cid:105) forms the Bayes-Adaptive MDP (BAMDP) for\nthe MDP problem M. Since the dynamics of the BAMDP are known, it can in principle be solved\nto obtain the optimal value function associated with each action:\n\n(cid:90)\n\nP\n\nQ\u2217((cid:104)st, ht(cid:105), a) = max\n\n\u03c0\n\nE\u03c0\n\n\u03b3t(cid:48)\u2212trt(cid:48)|at = a\n\n(2)\n\n(cid:34) \u221e(cid:88)\n\nt(cid:48)=t\n\n(cid:35)\n\nfrom which the optimal action for each state can be readily derived. 1 Optimal actions in the BAMDP\nare executed greedily in the real MDP M and constitute the best course of action for a Bayesian\nagent with respect to its prior belief over P. It is obvious that the expected performance of the\nBAMDP policy in the MDP M is bounded above by that of the optimal policy obtained with a fully-\nobservable model, with equality occurring, for example, in the degenerate case in which the prior\nonly has support on the true model.\n\n3 The BAMCP algorithm\n3.1 Algorithm Description\nThe goal of a BAMDP planning method is to \ufb01nd, for each decision point (cid:104)s, h(cid:105) encountered, the ac-\ntion a that maximizes Equation 2. Our algorithm, Bayes-adaptive Monte-Carlo Planning (BAMCP),\ndoes this by performing a forward-search in the space of possible future histories of the BAMDP\nusing a tailored Monte-Carlo tree search.\nWe employ the UCT algorithm [16] to allocate search effort to promising branches of the state-action\ntree, and use sample-based rollouts to provide value estimates at each node. For clarity, let us denote\nby Bayes-Adaptive UCT (BA-UCT) the algorithm that applies vanilla UCT to the BAMDP (i.e.,\nthe particular MDP with dynamics described in Equation 1). Sample-based search in the BAMDP\nusing BA-UCT requires the generation of samples from P + at every single node. This operation\nrequires integration over all possible transition models, or at least a sample of a transition model P\n\u2014 an expensive procedure for all but the simplest generative models P (P). We avoid this cost by\nonly sampling a single transition model P i from the posterior at the root of the search tree at the\n1The redundancy in the state-history tuple notation \u2014 st is the suf\ufb01x of ht \u2014 is only present to ensure\n\nclarity of exposition.\n\n2\n\n\fstart of each simulation i, and using P i to generate all the necessary samples during this simulation.\nSample-based tree search then acts as a \ufb01lter, ensuring that the correct distribution of state successors\nis obtained at each of the tree nodes, as if it was sampled from P +. This root sampling method was\noriginally introduced in the POMCP algorithm [20], developed to solve Partially Observable MDPs.\n\nargmaxa Q((cid:104)s, h(cid:105), a) + c(cid:112) log(N ((cid:104)s, h(cid:105)))/N ((cid:104)s, h(cid:105), a), where c is an exploration constant that needs\n\n3.2 BA-UCT with Root Sampling\nThe root node of the search tree at a decision point represents the current state of the BAMDP.\nThe tree is composed of state nodes representing belief states (cid:104)s, h(cid:105) and action nodes represent-\ning the effect of particular actions from their parent state node. The visit counts: N ((cid:104)s, h(cid:105)) for\nstate nodes, and N ((cid:104)s, h(cid:105), a) for action nodes, are initialized to 0 and updated throughout search.\nA value Q((cid:104)s, h(cid:105), a), initialized to 0, is also maintained for each action node. Each simulation\ntraverses the tree without backtracking by following the UCT policy at state nodes de\ufb01ned by\nto be set appropriately. Given an action, the transition distribution P i corresponding to the current\nsimulation i is used to sample the next state. That is, at action node ((cid:104)s, h(cid:105), a), s(cid:48) is sampled from\nP i(s, a,\u00b7), and the new state node is set to (cid:104)s(cid:48), has(cid:48)(cid:105). When a simulation reaches a leaf, the tree is\nexpanded by attaching a new state node with its connected action nodes, and a rollout policy \u03c0ro is\nused to control the MDP de\ufb01ned by the current P i to some \ufb01xed depth (determined using the dis-\ncount factor). The rollout provides an estimate of the value Q((cid:104)s, h(cid:105), a) from the leaf action node.\nThis estimate is then used to update the value of all action nodes traversed during the simulation: if\nR is the sampled discounted return obtained from a traversed action node ((cid:104)s, h(cid:105), a) in a given sim-\nulation, then we update the value of the action node to Q((cid:104)s, h(cid:105), a) + R \u2212 Q((cid:104)s, h(cid:105), a)/N ((cid:104)s, h(cid:105), a) (i.e.,\nthe mean of the sampled returns obtained from that action node over the simulations). A detailed\ndescription of the BAMCP algorithm is provided in Algorithm 1. A diagram example of BAMCP\nsimulations is presented in Figure S3.\nThe tree policy treats the forward search as a meta-exploration problem, preferring to exploit re-\ngions of the tree that currently appear better than others while continuing to explore unknown or\nless known parts of the tree. This leads to good empirical results even for small number of simu-\nlations, because effort is expended where search seems fruitful. Nevertheless all parts of the tree\nare eventually visited in\ufb01nitely often, and therefore the algorithm will eventually converge on the\nBayes-optimal policy (see Section 3.5).\nFinally, note that the history of transitions h is generally not the most compact suf\ufb01cient statistic\nof the belief in fully observable MDPs. Indeed, it can be replaced with unordered transition counts\n\u03c8, considerably reducing the number of states of the BAMDP and, potentially the complexity of\nplanning. Given an addressing scheme suitable to the resulting expanding lattice (rather than to a\ntree), BAMCP can search in this reduced space. We found this version of BAMCP to offer only a\nmarginal improvement. This is a common \ufb01nding for UCT, stemming from its tendency to concen-\ntrate search effort on one of several equivalent paths (up to transposition), implying a limited effect\non performance of reducing the number of those paths.\n\n3.3 Lazy Sampling\nIn previous work on sample-based tree search, indeed including POMCP [20], a complete sample\nstate is drawn from the posterior at the root of the search tree. However, this can be computationally\nvery costly. Instead, we sample P lazily, creating only the particular transition probabilities that are\nrequired as the simulation traverses the tree, and also during the rollout.\nConsider P(s, a,\u00b7) to be parametrized by a latent variable \u03b8s,a for each state and action pair. These\nmay depend on each other, as well as on an additional set of latent variables \u03c6. The posterior over\n\u03c6 P (\u0398|\u03c6, h)P (\u03c6|h), where \u0398 = {\u03b8s,a|s \u2208 S, a \u2208 A}. De\ufb01ne\n\u0398t = {\u03b8s1,a1,\u00b7\u00b7\u00b7 , \u03b8st,at} as the (random) set of \u03b8 parameters required during the course of a\nBAMCP simulation that starts at time 1 and ends at time t. Using the chain rule, we can rewrite\nP (\u0398|\u03c6, h) = P (\u03b8s1,a1|\u03c6, h)P (\u03b8s2,a2|\u03981, \u03c6, h) . . . P (\u03b8sT ,aT |\u0398T\u22121, \u03c6, h)P (\u0398 \\ \u0398T|\u0398T , \u03c6, h)\n\nP can be written as P (\u0398|h) = (cid:82)\n\nwhere T is the length of the simulation and \u0398 \\ \u0398T denotes the (random) set of parameters that\nare not required for a simulation. For each simulation i, we sample P (\u03c6|ht) at the root and then\nlazily sample the \u03b8st,at parameters as required, conditioned on \u03c6 and all \u0398t\u22121 parameters sampled\nfor the current simulation. This process is stopped at the end of the simulation, potentially before\n\n3\n\n\fAlgorithm 1: BAMCP\n\nprocedure Search( (cid:104)s, h(cid:105) )\n\nrepeat\n\nP \u223c P (P|h)\nSimulate((cid:104)s, h(cid:105),P, 0)\nQ((cid:104)s, h(cid:105), a)\n\nuntil Timeout()\nreturn argmax\n\na\nend procedure\n\nprocedure Rollout((cid:104)s, h(cid:105),P, d )\n\nif \u03b3dRmax < \u0001 then\n\nreturn 0\n\nend\na \u223c \u03c0ro((cid:104)s, h(cid:105),\u00b7)\ns(cid:48) \u223c P(s, a,\u00b7)\nr \u2190 R(s, a)\nreturn r+\u03b3Rollout((cid:104)s(cid:48), has(cid:48)(cid:105),P, d+1)\n\nend procedure\n\nprocedure Simulate( (cid:104)s, h(cid:105),P, d)\n\nif \u03b3dRmax < \u0001 then return 0\nif N ((cid:104)s, h(cid:105)) = 0 then\nfor all a \u2208 A do\n\nN ((cid:104)s, h(cid:105), a) \u2190 0, Q((cid:104)s, h(cid:105), a)) \u2190 0\n\nend\na \u223c \u03c0ro((cid:104)s, h(cid:105),\u00b7)\ns(cid:48) \u223c P(s, a,\u00b7)\nr \u2190 R(s, a)\nR \u2190 r + \u03b3 Rollout((cid:104)s(cid:48), has(cid:48)(cid:105),P, d)\nN ((cid:104)s, h(cid:105)) \u2190 1, N ((cid:104)s, h(cid:105), a) \u2190 1\nQ((cid:104)s, h(cid:105), a) \u2190 R\nreturn R\n\n(cid:113) log(N ((cid:104)s,h(cid:105)))\n\nb\n\nN ((cid:104)s,h(cid:105),b)\n\nQ((cid:104)s, h(cid:105), b) + c\n\nend\na \u2190 argmax\ns(cid:48) \u223c P(s, a,\u00b7)\nr \u2190 R(s, a)\nR \u2190 r + \u03b3 Simulate((cid:104)s(cid:48), has(cid:48)(cid:105),P, d+1)\nN ((cid:104)s, h(cid:105)) \u2190 N ((cid:104)s, h(cid:105)) + 1\nN ((cid:104)s, h(cid:105), a) \u2190 N ((cid:104)s, h(cid:105), a) + 1\nQ((cid:104)s, h(cid:105), a) \u2190 Q((cid:104)s, h(cid:105), a) + R\u2212Q((cid:104)s,h(cid:105),a)\nN ((cid:104)s,h(cid:105),a)\nreturn R\n\nend procedure\n\nall \u03b8 parameters have been sampled. For example, if the transition parameters for different states\nand actions are independent, we can completely forgo sampling a complete P, and instead draw any\nnecessary parameters individually for each state-action pair. This leads to substantial performance\nimprovement, especially in large MDPs where a single simulation only requires a small subset of\nparameters (see for example the domain in Section 5.2).\n\n3.4 Rollout Policy Learning\n\nThe choice of rollout policy \u03c0ro is important if simulations are few, especially if the domain does\nnot display substantial locality or if rewards require a carefully selected sequence of actions to be\nobtained. Otherwise, a simple uniform random policy can be chosen to provide noisy estimates.\nIn this work, we learn Qro, the optimal Q-value in the real MDP, in a model-free manner (e.g.,\nusing Q-learning) from samples (st, at, rt, st+1) obtained off-policy as a result of the interaction\nof the Bayesian agent with the environment. Acting greedily according to Qro translates to pure\nexploitation of gathered knowledge. A rollout policy in BAMCP following Qro could therefore\nover-exploit. Instead, similar to [13], we select an \u0001-greedy policy with respect to Qro as our rollout\npolicy \u03c0ro. This biases rollouts towards observed regions of high rewards. This method provides\nvaluable direction for the rollout policy at negligible computational cost. More complex rollout\npolicies can be considered, for example rollout policies that depend on the sampled model P i.\nHowever, these usually incur computational overhead.\n\nQ((cid:104)s, h(cid:105), a) \u2200(cid:104)s, h(cid:105) \u2208 S \u00d7 H.\n\n3.5 Theoretical properties\nDe\ufb01ne V ((cid:104)s, h(cid:105)) = max\na\u2208A\nTheorem 1. For all \u0001 > 0 (the numerical precision, see Algorithm 1) and a suitably cho-\n1\u2212\u03b3 ), from state (cid:104)st, ht(cid:105), BAMCP constructs a value function at the root\nsen c (e.g.\nc > Rmax\nnode that converges in probability to an \u0001(cid:48)-optimal value function, V ((cid:104)st, ht(cid:105))\n\u0001(cid:48) ((cid:104)st, ht(cid:105)),\n1\u2212\u03b3 . Moreover, for large enough N ((cid:104)st, ht(cid:105)), the bias of V ((cid:104)st, ht(cid:105)) decreases as\nwhere \u0001(cid:48) = \u0001\nO(log(N ((cid:104)st, ht(cid:105)))/N ((cid:104)st, ht(cid:105))). (Proof available in supplementary material)\n\np\u2192 V \u2217\n\n4\n\n\fBy de\ufb01nition, Theorem 1 implies that BAMCP converges to the Bayes-optimal solution asymp-\ntotically. We con\ufb01rmed this result empirically using a variety of Bandit problems, for which the\nBayes-optimal solution can be computed ef\ufb01ciently using Gittins indices (see supplementary mate-\nrial).\n\n4 Related Work\n\nIn Section 5, we compare BAMCP to a set of existing Bayesian RL algorithms. Given limited\nspace, we do not provide a comprehensive list of planning algorithms for MDP exploration, but\nrather concentrate on related sample-based algorithms for Bayesian RL.\nBayesian DP [22] maintains a posterior distribution over transition models. At each step, a single\nmodel is sampled, and the action that is optimal in that model is executed. The Best Of Sampled Set\n(BOSS) algorithm generalizes this idea [1]. BOSS samples a number of models from the posterior\nand combines them optimistically. This drives suf\ufb01cient exploration to guarantee \ufb01nite-sample per-\nformance guarantees. BOSS is quite sensitive to its parameter that governs the sampling criterion.\nUnfortunately, this is dif\ufb01cult to select. Castro and Precup proposed an SBOSS variant, which pro-\nvides a more effective adaptive sampling criterion [5]. BOSS algorithms are generally quite robust,\nbut suffer from over-exploration.\nSparse sampling [15] is a sample-based tree search algorithm. The key idea is to sample successor\nnodes from each state, and apply a Bellman backup to update the value of the parent node from the\nvalues of the child nodes. Wang et al. applied sparse sampling to search over belief-state MDPs[25].\nThe tree is expanded non-uniformly according to the sampled trajectories. At each decision node, a\npromising action is selected using Thompson sampling \u2014 i.e., sampling an MDP from that belief-\nstate, solving the MDP and taking the optimal action. At each chance node, a successor belief-state\nis sampled from the transition dynamics of the belief-state MDP.\nAsmuth and Littman further extended this idea in their BFS3 algorithm [2], an adaptation of Forward\nSearch Sparse Sampling [24] to belief-MDPs. Although they described their algorithm as Monte-\nCarlo tree search, it in fact uses a Bellman backup rather than Monte-Carlo evaluation. Each Bellman\nbackup updates both lower and upper bounds on the value of each node. Like Wang et al., the tree\nis expanded non-uniformly according to the sampled trajectories, albeit using a different method for\naction selection. At each decision node, a promising action is selected by maximising the upper\nbound on value. At each chance node, observations are selected by maximising the uncertainty\n(upper minus lower bound).\nBayesian Exploration Bonus (BEB) solves the posterior mean MDP, but with an additional reward\nbonus that depends on visitation counts [17]. Similarly, Sorg et al. propose an algorithm with a\ndifferent form of exploration bonus [21]. These algorithms provide performance guarantees after\na polynomial number of steps in the environment. However, behavior in the early steps of explo-\nration is very sensitive to the precise exploration bonuses; and it turns out to be hard to translate\nsophisticated prior knowledge into the form of a bonus.\n\nTable 1: Experiment results summary. For each algorithm, we report the mean sum of rewards and con\ufb01dence\ninterval for the best performing parameter within a reasonable planning time limit (0.25 s/step for Double-loop,\n1 s/step for Grid5 and Grid10, 1.5 s/step for the Maze). For BAMCP, this simply corresponds to the number\nof simulations that achieve a planning time just under the imposed limit. * Results reported from [22] without\ntiming information.\n\nBAMCP\nBFS3 [2]\nSBOSS [5]\nBEB [17]\nBayesian DP* [22]\nBayes VPI+MIX* [8]\nIEQL+* [19]\nQL Boltzmann*\n\nDouble-loop\n387.6 \u00b1 1.5\n382.2 \u00b1 1.5\n371.5 \u00b1 3\n386 \u00b1 0\n377 \u00b1 1\n326 \u00b1 31\n264 \u00b1 1\n186 \u00b1 1\n\nGrid5\n72.9 \u00b1 3\n66 \u00b1 5\n59.3 \u00b1 4\n67.5 \u00b1 3\n\n-\n-\n-\n-\n\nGrid10\n32.7 \u00b1 3\n10.4 \u00b1 2\n21.8 \u00b1 2\n10 \u00b1 1\n\n-\n-\n-\n-\n\n5\n\nDearden\u2019s Maze\n\n965.2 \u00b1 73\n240.9 \u00b1 46\n671.3 \u00b1 126\n184.6 \u00b1 35\n817.6 \u00b1 29\n269.4 \u00b1 1\n195.2 \u00b1 20\n\n-\n\n\f5 Experiments\nWe \ufb01rst present empirical results of BAMCP on a set of standard problems with comparisons to\nother popular algorithms. Then we showcase BAMCP\u2019s advantages in a large scale task: an in\ufb01nite\n2D grid with complex correlations between reward locations.\n5.1 Standard Domains\nAlgorithms\nThe following algorithms were run: BAMCP - The algorithm presented in Section 3, implemented\nwith lazy sampling. The algorithm was run for different number of simulations (10 to 10000) to\nspan different planning times. In all experiments, we set \u03c0ro to be an \u0001-greedy policy with \u0001 = 0.5.\nThe UCT exploration constant was left unchanged for all experiments (c = 3), we experimented\nwith other values of c \u2208 {0.5, 1, 5} with similar results. SBOSS [5]: for each domain, we varied\nthe number of samples K \u2208 {2, 4, 8, 16, 32} and the resampling threshold parameter \u03b4 \u2208 {3, 5, 7}.\nBEB [17]: for each domain, we varied the bonus parameter \u03b2 \u2208 {0.5, 1, 1.5, 2, 2.5, 3, 5, 10, 15, 20}.\nBFS3 [2] for each domain, we varied the branching factor C \u2208 {2, 5, 10, 15} and the number of\nsimulations (10 to 2000). The depth of search was set to 15 in all domains except for the larger grid\nand maze domain where it was set to 50. We also tuned the Vmax parameter for each domain \u2014 Vmin\nwas always set to 0. In addition, we report results from [22] for several other prior algorithms.\nDomains\nFor all domains, we \ufb01x \u03b3 = 0.95. The Double-loop domain is a 9-state deterministic MDP with 2\nactions [8], 1000 steps are executed in this domain. Grid5 is a 5 \u00d7 5 grid with no reward anywhere\nexcept for a reward state opposite to the reset state. Actions with cardinal directions are executed\nwith small probability of failure for 1000 steps. Grid10 is a 10 \u00d7 10 grid designed like Grid5. We\ncollect 2000 steps in this domain. Dearden\u2019s Maze is a 264-states maze with 3 \ufb02ags to collect [8].\nA special reward state gives the number of \ufb02ags collected since the last visit as reward, 20000 steps\nare executed in this domain. 2\nTo quantify the performance of each algorithm, we measured the total undiscounted reward over\nmany steps. We chose this measure of performance to enable fair comparisons to be drawn with\nprior work. In fact, we are optimising a different criterion \u2013 the discounted reward from the start\nstate \u2013 and so we might expect this evaluation to be unfavourable to our algorithm.\nOne major advantage of Bayesian RL is that one can specify priors about the dynamics. For the\nDouble-loop domain, the Bayesian RL algorithms were run with a simple Dirichlet-Multinomial\nmodel with symmetric Dirichlet parameter \u03b1 = 1|S|. For the grids and the maze domain, the algo-\nrithms were run with a sparse Dirichlet-Multinomial model, as described in [11]. For both of these\nmodels, ef\ufb01cient collapsed sampling schemes are available; they are employed for the BA-UCT and\nBFS3 algorithms in our experiments to compress the posterior parameter sampling and the transition\nsampling into a single transition sampling step. This considerably reduces the cost of belief updates\ninside the search tree when using these simple probabilistic models. In general, ef\ufb01cient collapsed\nsampling schemes are not available (see for example the model in Section 5.2).\nResults\nA summary of the results is presented in Table 1. Figure 1 reports the planning time/performance\ntrade-off for the different algorithms on the Grid5 and Maze domain.\nOn all the domains tested, BAMCP performed best. Other algorithms came close on some tasks,\nbut only when their parameters were tuned to that speci\ufb01c domain. This is particularly evident for\nBEB, which required a different value of exploration bonus to achieve maximum performance in\neach domain. BAMCP\u2019s performance is stable with respect to the choice of its exploration constant\nc and it did not require tuning to obtain the results.\nBAMCP\u2019s performance scales well as a function of planning time, as is evident in Figure 1. In con-\ntrast, SBOSS follows the opposite trend. If more samples are employed to build the merged model,\nSBOSS actually becomes too optimistic and over-explores, degrading its performance. BEB cannot\ntake advantage of prolonged planning time at all. BFS3 generally scales up with more planning\ntime with an appropriate choice of parameters, but it is not obvious how to trade-off the branching\nfactor, depth, and number of simulations in each domain. BAMCP greatly bene\ufb01ted from our lazy\n\n2The result reported for Dearden\u2019s maze with the Bayesian DP alg. in [22] is for a different version of the\n\ntask in which the maze layout is given to the agent.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Performance of each algorithm on the Grid5 (a.) and Maze domain (b-d) as a function of planning\ntime. Each point corresponds to a single run of an algorithm with an associated setting of the parameters. In-\ncreasing brightness inside the points codes for an increasing value of a parameter (BAMCP and BFS3: number\nof simulations, BEB: bonus parameter \u03b2, SBOSS: number of samples K). A second dimension of variation\nis coded as the size of the points (BFS3: branching factor C, SBOSS: resampling parameter \u03b4). The range of\nparameters is speci\ufb01ed in Section 5.1. a. Performance of each algorithm on the Grid5 domain. b. Performance\nof each algorithm on the Maze domain. c. On the Maze domain, performance of vanilla BA-UCT with and\nwithout rollout policy learning (RL). d. On the Maze domain, performance of BAMCP with and without the\nlazy sampling (LS) and rollout policy learning (RL) presented in Sections 3.4, 3.3. Root sampling (RS) is\nincluded.\nsampling scheme in the experiments, providing 35\u00d7 speed improvement over the naive approach in\nthe maze domain for example; this is illustrated in Figure 1(c).\nDearden\u2019s maze aptly illustrates a major drawback of forward search sparse sampling algorithms\nsuch as BFS3. Like many maze problems, all rewards are zero for at least k steps, where k is the\nsolution length. Without prior knowledge of the optimal solution length, all upper bounds will be\nhigher than the true optimal value until the tree has been fully expanded up to depth k \u2013 even if a\nsimulation happens to solve the maze. In contrast, once BAMCP discovers a successful simulation,\nits Monte-Carlo evaluation will immediately bias the search tree towards the successful trajectory.\n5.2\nWe also applied BAMCP to a much larger problem. The generative model for this in\ufb01nite-grid\nMDP is as follows: each column i has an associated latent parameter pi \u223c Beta(\u03b11, \u03b21) and each\nrow j has an associated latent parameter qj \u223c Beta(\u03b12, \u03b22). The probability of grid cell ij having\na reward of 1 is piqj, otherwise the reward is 0. The agent knows it is on a grid and is always free\nto move in any of the four cardinal directions. Rewards are consumed when visited; returning to the\nsame location subsequently results in a reward of 0. As opposed to the independent Dirichlet priors\nemployed in standard domains, here, dynamics are tightly correlated across states (i.e., observing\na state transition provides information about other state transitions). Posterior inference (of the\n\nIn\ufb01nite 2D grid task\n\n7\n\n10\u2212310\u2212210\u2212110010203040506070809010\u2212310\u2212210\u2212110010203040506070809010\u2212310\u2212210\u22121100102030405060708090AverageTimeperStep(s)10\u2212310\u2212210\u22121100102030405060708090SumofRewardsafter1000stepsBAMCPBEBBFS3SBOSS10\u22121100010020030040050060070080090010001100Average Time per Step (s)Undiscounted sum of rewards after 20000 steps  BAMCP (BA\u2212UCT+RS+LS+RL)BEBBFS3SBOSS10\u22121100010020030040050060070080090010001100Average Time per Step (s)  BA\u2212UCT + RLBA\u2212UCT10\u22121100010020030040050060070080090010001100Average Time per Step (s)  BA\u2212UCT + RS + LS + RL (BAMCP)BA\u2212UCT + RS + LSBA\u2212UCT + RS + RLBA\u2212UCT + RS\fFigure 2: Performance of BAMCP as a function of planning time on the In\ufb01nite 2D grid task of Section 5.2,\nfor \u03b3 = 0.97, where the grids are generated with Beta parameters \u03b11 = 1, \u03b21 = 2, \u03b12 = 2, \u03b22 = 1 (See\nsupp. Figure S4 for a visualization). The performance during the \ufb01rst 200 steps in the environment is averaged\nover 50 sampled environments (5 runs for each sample) and is reported both in terms of undiscounted (left) and\ndiscounted (right) sum of rewards. BAMCP is run either with the correct generative model as prior or with an\nincorrect prior (parameters for rows and columns are swapped), it is clear that BAMCP can take advantage of\ncorrect prior information to gain more rewards. The performance of a uniform random policy is also reported.\ndynamics P) in this model requires approximation because of the non-conjugate coupling of the\nvariables, the inference is done via MCMC (details in Supplementary). The domain is illustrated in\nFigure S4.\nPlanning algorithms that attempt to solve an MDP based on sample(s) (or the mean) of the posterior\n(e.g., BOSS, BEB, Bayesian DP) cannot directly handle the large state space. Prior forward-search\nmethods (e.g., BA-UCT, BFS3) can deal with the state space, but not the large belief space: at every\nnode of the search tree they must solve an approximate inference problem to estimate the posterior\nbeliefs. In contrast, BAMCP limits the posterior inference to the root of the search tree and is not\ndirectly affected by the size of the state space or belief space, which allows the algorithm to perform\nwell even with a limited planning time. Note that lazy sampling is required in this setup since a full\nsample of the dynamics involves in\ufb01nitely many parameters.\nFigure 2 (and Figure S5) demonstrates the planning performance of BAMCP in this complex do-\nmain. Performance improves with additional planning time, and the quality of the prior clearly\naffects the agent\u2019s performance. Supplementary videos contrast the behavior of the agent for differ-\nent prior parameters.\n6 Future Work\nThe UCT algorithm is known to have several drawbacks. First, there are no \ufb01nite-time regret bounds.\nIt is possible to construct malicious environments, for example in which the optimal policy is hidden\nin a generally low reward region of the tree, where UCT can be misled for long periods [7]. Second,\nthe UCT algorithm treats every action node as a multi-armed bandit problem. However, there is\nno actual bene\ufb01t to accruing reward during planning, and so it is in theory more appropriate to use\npure exploration bandits [4]. Nevertheless, the UCT algorithm has produced excellent empirical\nperformance in many domains [12].\nBAMCP is able to exploit prior knowledge about the dynamics in a principled manner. In principle,\nit is possible to encode many aspects of domain knowledge into the prior distribution. An important\navenue for future work is to explore rich, structured priors about the dynamics of the MDP. If this\nprior knowledge matches the class of environments that the agent will encounter, then exploration\ncould be signi\ufb01cantly accelerated.\n7 Conclusion\nWe suggested a sample-based algorithm for Bayesian RL called BAMCP that signi\ufb01cantly surpassed\nthe performance of existing algorithms on several standard tasks. We showed that BAMCP can\ntackle larger and more complex tasks generated from a structured prior, where existing approaches\nscale poorly. In addition, BAMCP provably converges to the Bayes-optimal solution.\nThe main idea is to employ Monte-Carlo tree search to explore the augmented Bayes-adaptive search\nspace ef\ufb01ciently. The naive implementation of that idea is the proposed BA-UCT algorithm, which\ncannot scale for most priors due to expensive belief updates inside the search tree. We introduced\nthree modi\ufb01cations to obtain a computationally tractable sample-based algorithm: root sampling,\nwhich only requires beliefs to be sampled at the start of each simulation (as in [20]); a model-free\nRL algorithm that learns a rollout policy; and the use of a lazy sampling scheme to sample the\nposterior beliefs cheaply.\n\n8\n\n10\u2212210\u22121100101102030405060708090Planning time (s)Undiscounted sum of rewards10\u2212210\u221211001012468101214Planning time (s)Discounted sum of rewards  BAMCPBAMCP Wrong priorRandom\fReferences\n[1] J. Asmuth, L. Li, M.L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration\nIn Proceedings of the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial\n\nin reinforcement learning.\nIntelligence, pages 19\u201326, 2009.\n\n[2] J. Asmuth and M. Littman. Approaching Bayes-optimality using Monte-Carlo tree search. In Proceedings\n\nof the 27th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 19\u201326, 2011.\n\n[3] R. Bellman and R. Kalaba. On adaptive control processes. Automatic Control, IRE Transactions on,\n\n4(2):1\u20139, 1959.\n\n[4] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In Proceedings\nof the 20th international conference on Algorithmic learning theory, pages 23\u201337. Springer-Verlag, 2009.\n[5] P. Castro and D. Precup. Smarter sampling in model-based Bayesian reinforcement learning. Machine\n\nLearning and Knowledge Discovery in Databases, pages 200\u2013214, 2010.\n\n[6] P.S. Castro. Bayesian exploration in Markov decision processes. PhD thesis, McGill University, 2007.\n[7] P.A. Coquelin and R. Munos. Bandit algorithms for tree search. In Proceedings of the 23rd Conference\n\non Uncertainty in Arti\ufb01cial Intelligence, pages 67\u201374, 2007.\n\n[8] R. Dearden, N. Friedman, and S. Russell. Bayesian Q-learning. In Proceedings of the National Conference\n\non Arti\ufb01cial Intelligence, pages 761\u2013768, 1998.\n\n[9] M.O.G. Duff. Optimal Learning: Computational Procedures For Bayes-Adaptive Markov Decision Pro-\n\ncesses. PhD thesis, University of Massachusetts Amherst, 2002.\n\n[10] AA Feldbaum. Dual control theory. Automation and Remote Control, 21(9):874\u20131039, 1960.\n[11] N. Friedman and Y. Singer. Ef\ufb01cient Bayesian parameter estimation in large discrete domains. Advances\n\nin Neural Information Processing Systems (NIPS), pages 417\u2013423, 1999.\n\n[12] S. Gelly, L. Kocsis, M. Schoenauer, M. Sebag, D. Silver, C. Szepesv\u00b4ari, and O. Teytaud. The grand chal-\nlenge of computer Go: Monte Carlo tree search and extensions. Communications of the ACM, 55(3):106\u2013\n113, 2012.\n\n[13] S. Gelly and D. Silver. Combining online and of\ufb02ine knowledge in UCT. In Proceedings of the 24th\n\nInternational Conference on Machine learning, pages 273\u2013280, 2007.\n\n[14] J.C. Gittins, R. Weber, and K.D. Glazebrook. Multi-armed bandit allocation indices. Wiley Online\n\nLibrary, 1989.\n\n[15] M. Kearns, Y. Mansour, and A.Y. Ng. A sparse sampling algorithm for near-optimal planning in large\nIn Proceedings of the 16th international joint conference on Arti\ufb01cial\n\nMarkov decision processes.\nintelligence-Volume 2, pages 1324\u20131331, 1999.\n\n[16] L. Kocsis and C. Szepesv\u00b4ari. Bandit based Monte-Carlo planning. Machine Learning: ECML 2006, pages\n\n282\u2013293, 2006.\n\n[17] J.Z. Kolter and A.Y. Ng. Near-Bayesian exploration in polynomial time.\n\nAnnual International Conference on Machine Learning, pages 513\u2013520, 2009.\n\nIn Proceedings of the 26th\n\n[18] J.J. Martin. Bayesian decision problems and Markov chains. Wiley, 1967.\n[19] N. Meuleau and P. Bourgine. Exploration of multi-state environments: Local measures and back-\n\npropagation of uncertainty. Machine Learning, 35(2):117\u2013154, 1999.\n\n[20] D. Silver and J. Veness. Monte-Carlo planning in large POMDPs. Advances in Neural Information\n\nProcessing Systems (NIPS), pages 2164\u20132172, 2010.\n\n[21] J. Sorg, S. Singh, and R.L. Lewis. Variance-based rewards for approximate Bayesian reinforcement\n\nlearning. In Proceedings of the 26th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2010.\n\n[22] M. Strens. A Bayesian framework for reinforcement learning. In Proceedings of the 17th International\n\nConference on Machine Learning, pages 943\u2013950, 2000.\n\n[23] C. Szepesv\u00b4ari. Algorithms for reinforcement learning. Synthesis Lectures on Arti\ufb01cial Intelligence and\n\nMachine Learning. Morgan & Claypool Publishers, 2010.\n\n[24] T.J. Walsh, S. Goschin, and M.L. Littman. Integrating sample-based planning and model-based reinforce-\n\nment learning. In Proceedings of the 24th Conference on Arti\ufb01cial Intelligence (AAAI), 2010.\n\n[25] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward\noptimization. In Proceedings of the 22nd International Conference on Machine learning, pages 956\u2013963,\n2005.\n\n9\n\n\f", "award": [], "sourceid": 495, "authors": [{"given_name": "Arthur", "family_name": "Guez", "institution": null}, {"given_name": "David", "family_name": "Silver", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}