{"title": "Bayes-Adaptive POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1225, "page_last": 1232, "abstract": null, "full_text": "Bayes-Adaptive POMDPs\n\nSt\u00b4ephane Ross\nMcGill University\n\nMontr\u00b4eal, Qc, Canada\n\nBrahim Chaib-draa\n\nLaval University\n\nQu\u00b4ebec, Qc, Canada\n\nJoelle Pineau\n\nMcGill University\n\nMontr\u00b4eal, Qc, Canada\n\nsross12@cs.mcgill.ca\n\nchaib@ift.ulaval.ca\n\njpineau@cs.mcgill.ca\n\nAbstract\n\nBayesian Reinforcement Learning has generated substantial interest recently, as it\nprovides an elegant solution to the exploration-exploitation trade-off in reinforce-\nment learning. However most investigations of Bayesian reinforcement learning\nto date focus on the standard Markov Decision Processes (MDPs). Our goal is\nto extend these ideas to the more general Partially Observable MDP (POMDP)\nframework, where the state is a hidden variable. To address this problem, we in-\ntroduce a new mathematical model, the Bayes-Adaptive POMDP. This new model\nallows us to (1) improve knowledge of the POMDP domain through interaction\nwith the environment, and (2) plan optimal sequences of actions which can trade-\noff between improving the model, identifying the state, and gathering reward. We\nshow how the model can be \ufb01nitely approximated while preserving the value func-\ntion. We describe approximations for belief tracking and planning in this model.\nEmpirical results on two domains show that the model estimate and agent\u2019s return\nimprove over time, as the agent learns better model estimates.\n\n1 Introduction\n\nIn many real world systems, uncertainty can arise in both the prediction of the system\u2019s behavior, and\nthe observability of the system\u2019s state. Partially Observable Markov Decision Processes (POMDPs)\ntake both kinds of uncertainty into account and provide a powerful model for sequential decision\nmaking under these conditions. However most solving methods for POMDPs assume that the model\nis known a priori, which is rarely the case in practice. For instance in robotics, the POMDP must\nre\ufb02ect exactly the uncertainty on the robot\u2019s sensors and actuators. These parameters are rarely\nknown exactly and therefore must often be approximated by a human designer, such that even if\nthis approximate POMDP could be solved exactly, the resulting policy may not be optimal. Thus we\nseek a decision-theoretic planner which can take into account the uncertainty over model parameters\nduring the planning process, as well as being able to learn from experience the values of these\nunknown parameters.\n\nBayesian Reinforcement Learning has investigated this problem in the context of fully observable\nMDPs [1, 2, 3]. An extension to POMDP has recently been proposed [4], yet this method relies on\nheuristics to select actions that will improve the model, thus forgoing any theoretical guarantee on\nthe quality of the approximation, and on an oracle that can be queried to provide the current state.\n\nIn this paper, we draw inspiration from the Bayes-Adaptive MDP framework [2], which is formu-\nlated to provide an optimal solution to the exploration-exploitation trade-off. To extend these ideas\nto POMDPs, we face two challenges: (1) how to update Dirichlet parameters when the state is a\nhidden variable? (2) how to approximate the in\ufb01nite dimensional belief space to perform belief\nmonitoring and compute the optimal policy. This paper tackles both problem jointly. The \ufb01rst prob-\nlem is solved by including the Dirichlet parameters in the state space and maintaining belief states\nover these parameters. We address the second by bounding the space of Dirichlet parameters to a\n\ufb01nite subspace necessary for \u01eb-optimal solutions.\n\n\fWe provide theoretical results for bounding the state space while preserving the value function and\nwe use these results to derive approximate solving and belief monitoring algorithms. We compare\nseveral belief approximations in two problem domains. Empirical results show that the agent is able\nto learn good POMDP models and improve its return as it learns better model estimate.\n\n2 POMDP\n\n}s,s\u2032\u2208S,a\u2208A where T sas\u2032\n\nIt has transition\nA POMDP is de\ufb01ned by \ufb01nite sets of states S, actions A and observations Z.\nprobabilities {T sas\u2032\n= Pr(st+1 = s\u2032|st = s, at = a) and observation\nprobabilities {Osaz}s\u2208S,a\u2208A,z\u2208Z where Osaz = Pr(zt = z|st = s, at\u22121 = a). The reward function\nR : S \u00d7 A \u2192 R speci\ufb01es the immediate reward obtained by the agent. In a POMDP, the state is\nnever observed. Instead the agent perceives an observation z \u2208 Z at each time step, which (along\nwith the action sequence) allows it to maintain a belief state b \u2208 \u2206S. The belief state speci\ufb01es\nthe probability of being in each state given the history of action and observation experienced so far,\nstarting from an initial belief b0. It can be updated at each time step using Baye\u2019s rule: bt+1(s\u2032) =\n\nOs\u2032 at zt+1 Ps\u2208S T sat s\u2032\n\nbt(s)\n\n.\n\nPs\u2032\u2032\u2208s Os\u2032\u2032 at zt+1 Ps\u2208S T sat s\u2032\u2032 bt(s)\nA policy \u03c0 : \u2206S \u2192 A indicates how the agent should select actions as a func-\nSolving a POMDP involves \ufb01nding the optimal policy \u03c0\u2217\ntion of\nThe return ob-\nthat maximizes the expected discounted return over the in\ufb01nite horizon.\ntained by following \u03c0\u2217 from a belief b is de\ufb01ned by Bellman\u2019s equation:\nV \u2217(b) =\n\nthe current belief.\n\nmaxa\u2208A(cid:2)Ps\u2208S b(s)R(s, a) + \u03b3Pz\u2208Z Pr(z|b, a)V \u2217(\u03c4 (b, a, z))(cid:3), where \u03c4 (b, a, z) is the new be-\n\nlief after performing action a and observation z and \u03b3 \u2208 [0, 1) is the discount factor.\nExact solving algorithms [5] are usually intractable, except on small domains with only a few states,\nactions and observations. Various approximate algorithms, both of\ufb02ine [6, 7, 8] and online [9],\nhave been proposed to tackle increasingly large domains. However, all these methods requires full\nknowledge of the POMDP model, which is a strong assumption in practice. Some approaches do\nnot require knowledge of the model, as in [10], but these approaches generally require a lot of data\nand do not address the exploration-exploitation tradeoff.\n\n3 Bayes-Adaptive POMDP\n\nIn this section, we introduce the Bayes-Adaptive POMDP (BAPOMDP) model, an optimal decision-\ntheoretic algorithm for learning and planning in POMDPs under parameter uncertainty. Throughout\nwe assume that the state, action, and observation spaces are \ufb01nite and known, but that the transition\nand observation probabilities are unknown or partially known. We also assume that the reward\nfunction is known as it is generally speci\ufb01ed by the user for the speci\ufb01c task he wants to accomplish,\nbut the model can easily be generalised to learn the reward function as well.\nTo model the uncertainty on the transition T sas\u2032 and observation Osaz parameters, we use Dirichlet\ndistributions, which are probability distributions over the parameters of multinomial distributions.\nGiven \u03c6i, the number of times event ei has occurred over n trials, the probabilities pi of each event\nfollow a Dirichlet distribution, i.e. (p1, . . . , pk) \u223c Dir(\u03c61, . . . , \u03c6k). This distribution represents\nthe probability that a discrete random variable behaves according to some probability distribution\ni=1 \u03c6i). Its\n, where B is the multinomial\n\n(p1, . . . , pk), given that the counts (\u03c61, . . . , \u03c6k) have been observed over n trials (n = Pk\n\nprobability density function is de\ufb01ned by: f (p, \u03c6) = 1\nbeta function. The expected value of pi is E(pi) = \u03c6i\n\nB(\u03c6) Qk\n\ni=1 p\u03c6i\u22121\n\ni\n\n.\n\nPk\n\nj=1 \u03c6j\n\n3.1 The BAPOMDP Model\n\nThe BAPOMDP is constructed from the model of the POMDP with unknown parameters. Let\n(S, A, Z, T, O, R, \u03b3) be that model. The uncertainty on the distributions T sa\u00b7 and Os\u2032a\u00b7 can be\nss\u2032\u2200s\u2032 represents the number of times the transition (s, a, s\u2032) oc-\nrepresented by experience counts: \u03c6a\ns\u2032z\u2200z is the number of times observation z was made in state s\u2032 after doing action\ncurred, similarly \u03c8a\na. Let \u03c6 be the vector of all transition counts and \u03c8 be the vector of all observation counts. Given\n\n\fthe count vectors \u03c6 and \u03c8, the expected transition probability for T sas\u2032 is: T sas\u2032\nsimilarly for Os\u2032az: Os\u2032az\n\n\u03c8a\n\ns\u2032 z\n\n.\n\n\u03c6 =\n\n\u03c8 =\n\nPz\u2032 \u2208Z \u03c8a\n\ns\u2032 z\u2032\n\n\u03c6a\n\nss\u2032\n\nPs\u2032\u2032 \u2208S \u03c6a\n\nss\u2032\u2032\n\n, and\n\nwhere T = {\u03c6 \u2208 N|S|2|A||\u2200(s, a),Ps\u2032\u2208S \u03c6a\nO = {\u03c8 \u2208 N|S||A||Z||\u2200(s, a),Pz\u2208Z \u03c8a\n\nThe objective of the BAPOMDP is to learn an optimal policy, such that actions are chosen to\nmaximize reward taking into account both state and parameter uncertainty. To model this, we\nfollow the Bayes-Adaptive MDP framework, and include the \u03c6 and \u03c8 vectors in the state of\nthe BAPOMDP. Thus, the state space S\u2032 of the BAPOMDP is de\ufb01ned as S\u2032 = S \u00d7 T \u00d7 O,\nss\u2032 > 0} represents the space in which \u03c6 lies and\nsz > 0} represents the space in which \u03c8 lies. The action and\nobservation sets of the BAPOMDP are the same as in the original POMDP. Transition and obser-\nvation functions of the BAPOMDP must capture how the state and count vectors \u03c6, \u03c8 evolve after\nevery time step. Consider an agent in a given state s with count vectors \u03c6 and \u03c8, which performs\naction a, causing it to move to state s\u2032 and observe z. Then the vector \u03c6\u2032 after the transition is de\ufb01ned\nas \u03c6\u2032 = \u03c6 + \u03b4a\nss\u2032, and the vector\n\u03c8\u2032 after the observation is de\ufb01ned as \u03c8\u2032 = \u03c8 + \u03b4a\ns\u2032z is a vector full of zeroes, with a 1\nfor the count \u03c8a\ns\u2032z. Note that the probabilities of such transitions and observations occurring must\nbe de\ufb01ned by considering all models and their probabilities as speci\ufb01ed by the current Dirichlet\ndistributions, which turn out to be their expectations. Hence, we de\ufb01ne T \u2032 and O\u2032 to be:\n\nss\u2032 is a vector full of zeroes, with a 1 for the count \u03c6a\n\nss\u2032, where \u03b4a\n\ns\u2032z, where \u03b4a\n\nT \u2032((s, \u03c6, \u03c8), a, (s\u2032, \u03c6\u2032, \u03c8\u2032)) = (cid:26) T sas\u2032\nO\u2032((s, \u03c6, \u03c8), a, (s\u2032, \u03c6\u2032, \u03c8\u2032), z) = (cid:26) 1,\n\n\u03c6 Os\u2032az\n\u03c8 ,\n0,\n\nif \u03c6\u2032 = \u03c6 + \u03b4a\n\nss\u2032 and \u03c8\u2032 = \u03c8 + \u03b4a\ns\u2032z\n\n0, otherwise.\n\nif \u03c6\u2032 = \u03c6 + \u03b4a\notherwise.\n\nss\u2032 and \u03c8\u2032 = \u03c8 + \u03b4a\ns\u2032z\n\n(1)\n\n(2)\n\nNote here that the observation probabilities are folded into the transition function, and that the ob-\nservation function becomes deterministic. This happens because a state transition in the BAPOMDP\nautomatically speci\ufb01es which observation is acquired after transition, via the way the counts are\nincremented. Since the counts do not affect the reward, the reward function of the BAPOMDP is de-\n\ufb01ned as R\u2032((s, \u03c6, \u03c8), a) = R(s, a); the discount factor of the BAPOMDP remains the same. Using\nthese de\ufb01nitions, the BAPOMDP has a known model speci\ufb01ed by the tuple (S\u2032, A, Z, T \u2032, O\u2032, R\u2032, \u03b3).\nThe belief state of the BAPOMDP represents a distribution over both states and count values. The\nmodel is learned by simply maintaining this belief state, as the distribution will concentrate over\nmost likely models, given the prior and experience so far.\nIf b0 is the initial belief state of the\nunknown POMDP, and the count vectors \u03c60 \u2208 T and \u03c80 \u2208 O represent the prior knowledge on this\nPOMDP, then the initial belief of the BAPOMDP is: b\u2032\n0(s, \u03c60, \u03c80) = {b0(s), if (\u03c6, \u03c8) = (\u03c60, \u03c80);\n0, otherwise}. After actions are taken, the uncertainty on the POMDP model is represented by\nmixtures of Dirichlet distributions (i.e. mixtures of count vectors).\n\nNote that the BAPOMDP is in fact a POMDP with a countably in\ufb01nite state space. Hence the belief\nupdate function and optimal value function are still de\ufb01ned as in Section 2. However these functions\nnow require summations over S\u2032 = S \u00d7 T \u00d7 O. Maintaining the belief state is practical only if the\nnumber of states with non-zero probabilities is \ufb01nite. We prove this in the following theorem:\nTheorem 3.1. Let (S\u2032, A, Z, T \u2032, O\u2032, R\u2032, \u03b3) be a BAPOMDP constructed from the POMDP\n(S, A, Z, T, O, R, \u03b3). If S is \ufb01nite, then at any time t, the set S\u2032\nt(\u03c3) > 0} has\nb\u2032\nt\nsize |S\u2032\nb\u2032\nt\n\n= {\u03c3 \u2208 S\u2032|b\u2032\n\n| \u2264 |S|t+1.\n\nProof. Proof available in [11]. Proceeds by induction from b\u2032\n0.\n\nThe proof of this theorem suggests that it is suf\ufb01cient to iterate over S and S\u2032\nb\u2032\nt\u22121\nthe belief state b\u2032\n3.1 can be used to update the belief state.\n\nin order to compute\nt when an action and observation are taken in the environment. Hence, Algorithm\n\n3.2 Exact Solution for BAPOMDP in Finite Horizons\n\nThe value function of a BAPOMDP for \ufb01nite horizons can be represented by a \ufb01nite set \u0393 of func-\ntions \u03b1 : S\u2032 \u2192 R, as in standard POMDP. For example, an exact solution can be computed using\n\n\ffunction \u03c4 (b, a, z)\nInitialize b\u2032 as a 0 vector.\nfor all (s, \u03c6, \u03c8, s\u2032) \u2208 S \u2032\nss\u2032 , \u03c8 + \u03b4a\n\nb\u2032(s\u2032, \u03c6 + \u03b4a\n\nend for\nreturn normalized b\u2032\n\nb \u00d7 S do\ns\u2032z) \u2190 b\u2032(s\u2032, \u03c6 + \u03b4a\n\nss\u2032 , \u03c8 + \u03b4a\n\ns\u2032z) + b(s, \u03c6, \u03c8)T sas\u2032\n\n\u03c6 Os\u2032az\n\n\u03c8\n\nAlgorithm 3.1: Exact Belief Update in BAPOMDP.\n\ndynamic programming (see [5] for more details):\n\n\u0393a\n1\n\u0393a,z\nt\n\u0393a\nt\n\u0393t\n\ni\n\ni\n\n= {\u03b1a|\u03b1a(s, \u03c6, \u03c8) = R(s, a)},\n= {\u03b1a,z\n= \u0393a\n\n(s, \u03c6, \u03c8) = \u03b3Ps\u2032\u2208S T sas\u2032\n\n|\u03b1a,z\n1 \u2295 \u0393a,z1\nt \u2295 \u0393a,z2\n= Sa\u2208A \u0393a\nt .\n\nt \u2295 \u00b7 \u00b7 \u00b7 \u2295 \u0393\n\na,z|Z|\nt\n\n,\n\n\u03c6 Os\u2032az\n\ni(s\u2032, \u03c6 + \u03b4a\n\n\u03c8 \u03b1\u2032\n(where \u2295 is the cross sum operator),\n\nss\u2032 , \u03c8 + \u03b4a\n\ns\u2032z), \u03b1\u2032\n\ni \u2208 \u0393t\u22121},\n\nthe\n\nthat\n\nhere\n\nde\ufb01nition\n\n(3)\nthat\nNote\nfact\nT \u2032((s, \u03c6, \u03c8), a, (s\u2032, \u03c6\u2032, \u03c8\u2032))O\u2032((s, \u03c6, \u03c8), a, (s\u2032, \u03c6\u2032, \u03c8\u2032), z) = 0 except when \u03c6\u2032 = \u03c6 + \u03b4a\nss\u2032 and\n\u03c8\u2032 = \u03c8 + \u03b4a\n\u03b1(\u03c3)b(\u03c3). In\npractice, it will be impossible to compute \u03b1a,z\n(s, \u03c6, \u03c8) for all (s, \u03c6, \u03c8) \u2208 S\u2032. In order to compute\nthese more ef\ufb01ciently, we show in the next section that the in\ufb01nite state space can be reduced to a\n\ufb01nite state space, while still preserving the value function to arbitrary precision for any horizon t.\n\ns\u2032z. The optimal policy is extracted as usual: \u03c0\u0393(b) = argmax\u03b1\u2208\u0393P\u03c3\u2208S\u2032\n\nfrom the\n\nof \u03b1a,z\n\nobtained\n\n(s, \u03c6, \u03c8)\n\nis\n\ni\n\ni\n\nb\n\n4 Approximating the BAPOMDP: Theory and Algorithms\n\nSolving a BAPOMDP exactly for all belief states is impossible in practice due to the dimensionnality\nof the state space (in particular to the fact that the count vectors can grow unbounded). We now show\nhow we can reduce this in\ufb01nite state space to a \ufb01nite state space. This allows us to compute an \u01eb-\noptimal value function over the resulting \ufb01nite-dimensionnal belief space using standard POMDP\ntechniques. Various methods for belief tracking in the in\ufb01nite model are also presented.\n\n4.1 Approximate Finite Model\n\nWe \ufb01rst present an upper bound on the value difference between two states that differ only by\ntheir model estimate \u03c6 and \u03c8. This bound uses the following de\ufb01nitions: given \u03c6, \u03c6\u2032 \u2208 T , and\n\u03c8, \u03c8\u2032 \u2208 O, de\ufb01ne Dsa\n,\nand N sa\nTheorem 4.1. Given any \u03c6, \u03c6\u2032 \u2208 T , \u03c8, \u03c8\u2032 \u2208 O, and \u03b3 \u2208 (0, 1), then for all t:\n\nT sas\u2032\n\u03c6 \u2212 T sas\u2032\nsz.\n\u03c8 = Pz\u2208Z \u03c8a\n|\u03b1t(s, \u03c6, \u03c8) \u2212 \u03b1t(s, \u03c6\u2032, \u03c8\u2032)| \u2264 2\u03b3||R||\u221e\n(1\u2212\u03b3)2\n\nZ (\u03c8, \u03c8\u2032) = Pz\u2208Z (cid:12)(cid:12)(cid:12)\n\nS (\u03c6, \u03c6\u2032) = Ps\u2032\u2208S (cid:12)(cid:12)(cid:12)\n\n\u03c6 = Ps\u2032\u2208S \u03c6a\n\nS (\u03c6, \u03c6\u2032) + Ds\u2032a\n\nss\u2032 and N sa\n\n\u03c8\u2032 (cid:12)(cid:12)(cid:12)\n\n\u03c8 \u2212 Osaz\n\nZ (\u03c8, \u03c8\u2032)\n\nand Dsa\n\nOsaz\n\nsup\n\n(cid:12)(cid:12)(cid:12)\n\n\u03c6\u2032\n\nsup\n\n\u03b1t\u2208\u0393t,s\u2208S\n\ns,s\u2032\u2208S,a\u2208AhDsa\nln(\u03b3\u2212e) (cid:18) Ps\u2032\u2032\u2208S|\u03c6a\n\n(N sa\n\n+ 4\n\nss\u2032\u2032 \u2212\u03c6\u2032a\n\nss\u2032\u2032|\n\n\u03c6\u2032 +1) +\n\n\u03c6 +1)(N sa\n\nPz\u2208Z|\u03c8a\n(N s\u2032 a\n\ns\u2032 z\u2212\u03c8\u2032a\n\u03c8 +1)(N s\u2032 a\n\n\u03c8\u2032 +1)(cid:19)(cid:21)\n\ns\u2032 z|\n\nProof. Proof available in [11] \ufb01nds a bound on a 1-step backup and solves the recurrence.\n\nWe now use this bound on the \u03b1-vector values to approximate the space of Dirichlet parameters\nwithin a \ufb01nite subspace. We use the following de\ufb01nitions: given any \u01eb > 0, de\ufb01ne \u01eb\u2032 = \u01eb(1\u2212\u03b3)2\n,\n8\u03b3||R||\u221e\n\u01eb\u2032\u2032 = \u01eb(1\u2212\u03b3)2 ln(\u03b3\u2212e)\n\n\u01eb\u2032\u2032 \u2212 1(cid:17).\nTheorem 4.2. Given any \u01eb > 0 and (s, \u03c6, \u03c8) \u2208 S\u2032 such that \u2203a \u2208 A, s\u2032 \u2208 S, N s\u2032a\nN s\u2032a\n|\u03b1t(s, \u03c6, \u03c8) \u2212 \u03b1t(s, \u03c6\u2032, \u03c8\u2032)| < \u01eb holds for all t and \u03b1t \u2208 \u0393t.\n\nZ, then \u2203(s, \u03c6\u2032, \u03c8\u2032) \u2208 S\u2032 such that \u2200a \u2208 A, s\u2032 \u2208 S, N s\u2032a\n\nZ = max(cid:16) |Z|(1+\u01eb\u2032)\n\nS = max(cid:16) |S|(1+\u01eb\u2032)\n\n\u01eb\u2032\u2032 \u2212 1(cid:17) and N \u01eb\n\nS or\nZ where\n\nS and N s\u2032a\n\n\u03c6 > N \u01eb\n\n\u03c8\u2032 \u2264 N \u01eb\n\n\u03c6\u2032 \u2264 N \u01eb\n\n\u03c8 > N \u01eb\n\n32\u03b3||R||\u221e\n\n, N \u01eb\n\n, 1\n\n, 1\n\n\u01eb\u2032\n\n\u01eb\u2032\n\nProof. Proof available in [11].\n\n\f\u03c6 \u2264 N \u01eb\n\nS} and \u03c8 \u2208 \u02dcO\u01eb = {\u03c8 \u2208 N|S||A||Z||\u2200a \u2208 A, s \u2208 S, 0 < N sa\n\nTheorem 4.2 suggests that if we want a precision of \u01eb on the value function, we just need to restrict\nthe space of Dirichlet parameters to count vectors \u03c6 \u2208 \u02dcT\u01eb = {\u03c6 \u2208 N|S|2|A||\u2200a \u2208 A, s \u2208 S, 0 <\nZ}. Since \u02dcT\u01eb and \u02dcO\u01eb are\nN sa\n\ufb01nite, we can de\ufb01ne a \ufb01nite approximate BAPOMDP as the tuple ( \u02dcS\u01eb, A, Z, \u02dcT\u01eb, \u02dcO\u01eb, \u02dcR\u01eb, \u03b3) where\n\u02dcS\u01eb = S \u00d7 \u02dcT\u01eb \u00d7 \u02dcO\u01eb is the \ufb01nite state space. To de\ufb01ne the transition and observation functions over\nthat \ufb01nite state space, we need to make sure that when the count vectors are incremented, they stay\nwithin the \ufb01nite space. To achieve, this we de\ufb01ne a projection operator P\u01eb : S\u2032 \u2192 \u02dcS\u01eb that simply\nprojects every state in S\u2032 to their closest state in \u02dcS\u01eb.\nDe\ufb01nition 4.1. Let d : S\u2032 \u00d7 S\u2032 \u2192 R be de\ufb01ned such that:\n\n\u03c8 \u2264 N \u01eb\n\nd(s, \u03c6, \u03c8, s\u2032, \u03c6\u2032, \u03c8\u2032) =\n\nsup\n\n+ 4\n\n2\u03b3||R||\u221e\n(1\u2212\u03b3)2\n\ns,s\u2032\u2208S,a\u2208AhDsa\nln(\u03b3\u2212e) (cid:18) Ps\u2032\u2032\u2208S |\u03c6a\n(1\u2212\u03b3)2 (cid:16)1 + 4\n\n\u03c6\u2032 +1) +\nln(\u03b3\u2212e)(cid:17) + 2||R||\u221e\n(1\u2212\u03b3) ,\n\n\u03c6 +1)(N as\n\nss\u2032\u2032 \u2212\u03c6\u2032a\n\n(N as\n\nss\u2032\u2032 |\n\n8\u03b3||R||\u221e\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nS (\u03c6, \u03c6\u2032) + Ds\u2032a\n\nZ (\u03c8, \u03c8\u2032)\n\nPz\u2208Z |\u03c8a\n(N as\u2032\n\n\u03c8 +1)(N as\u2032\n\ns\u2032 z\u2212\u03c8\u2032a\n\ns\u2032 z|\n\n\u03c8\u2032 +1)(cid:19)(cid:21) ,\n\nif s = s\u2032\n\notherwise.\n\nDe\ufb01nition 4.2. Let P\u01eb : S\u2032 \u2192 \u02dcS\u01eb be de\ufb01ned as P\u01eb(s) = arg min\ns\u2032\u2208 \u02dcS\u01eb\n\nd(s, s\u2032)\n\nThe function d uses the bound de\ufb01ned in Theorem 4.1 as a distance between states that only differs\nby their \u03c6 and \u03c8 vectors, and uses an upper bound on that value when the states differ. Thus\nP\u01eb always maps states (s, \u03c6, \u03c8) \u2208 S\u2032 to some state (s, \u03c6\u2032, \u03c8\u2032) \u2208 \u02dcS\u01eb. Note that if \u03c3 \u2208 \u02dcS\u01eb, then\nP\u01eb(\u03c3) = \u03c3. Using P\u01eb, the transition and observation function are de\ufb01ned as follows:\n\n\u02dcT\u01eb((s, \u03c6, \u03c8), a, (s\u2032, \u03c6\u2032, \u03c8\u2032)) = (cid:26) T sas\u2032\n\n\u03c6 Os\u2032az\n\u03c8 ,\n0,\n\nif (s\u2032, \u03c6\u2032, \u03c8\u2032) = P\u01eb(s\u2032, \u03c6 + \u03b4a\notherwise.\n\nss\u2032, \u03c8 + \u03b4a\n\ns\u2032z)\n\n\u02dcO\u01eb((s, \u03c6, \u03c8), a, (s\u2032, \u03c6\u2032, \u03c8\u2032), z) = (cid:26) 1,\n\n0, otherwise.\n\nif (s\u2032, \u03c6\u2032, \u03c8\u2032) = P\u01eb(s\u2032, \u03c6 + \u03b4a\n\nss\u2032, \u03c8 + \u03b4a\n\ns\u2032z)\n\n(4)\n\n(5)\n\nThese de\ufb01nitions are the same as the one in the in\ufb01nite BAPOMDP, except that now we add an extra\nprojection to make sure that the incremented count vectors stays in \u02dcS\u01eb. Finally, the reward function\n\u02dcR\u01eb : \u02dcS\u01eb \u00d7 A \u2192 R is de\ufb01ned as \u02dcR\u01eb((s, \u03c6, \u03c8), a) = R(s, a).\nTheorem 4.3 bounds the value difference between \u03b1-vectors computed with this \ufb01nite model and\nthe \u03b1-vector computed with the original model.\nTheorem 4.3. Given any \u01eb > 0, (s, \u03c6, \u03c8) \u2208 S\u2032 and \u03b1t \u2208 \u0393t computed from the in\ufb01nite BAPOMDP.\nLet \u02dc\u03b1t be the \u03b1-vector representing the same conditionnal plan as \u03b1t but computed with the \ufb01nite\nBAPOMDP ( \u02dcS\u01eb, A, Z, \u02dcT\u01eb, \u02dcO\u01eb, \u02dcR\u01eb, \u03b3), then |\u02dc\u03b1t(P\u01eb(s, \u03c6, \u03c8)) \u2212 \u03b1t(s, \u03c6, \u03c8)| < \u01eb\n\n1\u2212\u03b3 .\n\nProof. Proof available in [11]. Solves a recurrence over the 1-step approximation in Thm. 4.2.\n\nBecause the state space is now \ufb01nite, solution methods from the literature on \ufb01nite POMDPs could\ntheoretically be applied. This includes en particular the equations for \u03c4 (b, a, z) and V \u2217(b) that were\npresented in Section 2. In practice however, even though the state space is \ufb01nite, it will generally\nbe very large for small \u01eb, such that it may still be intractable, even for small domains. We therefore\nfavor a faster online solution approach, as described below.\n\n4.2 Approximate Belief Monitoring\n\nAs shown in Theorem 3.1, the number of states with non-zero probability grows exponentially in\nthe planning horizon, thus exact belief monitoring can quickly become intractable. We now discuss\ndifferent particle-based approximations that allow polynomial-time belief tracking.\nMonte Carlo sampling: Monte Carlo sampling algorithms have been widely used for sequential\nstate estimation [12]. Given a prior belief b, followed by action a and observation z, the new belief\nb\u2032 is obtained by \ufb01rst sampling K states from the distribution b, then for each sampled s a new state\ns\u2032 is sampled from T (s, a, \u00b7). Finally, the probability O(s\u2032, a, z) is added to b\u2032(s\u2032) and the belief b\u2032\nis re-normalized. This will capture at most K states with non-zero probabilities. In the context of\n\n\fs\u2032z).\n\n\u03c6 O\u00b7az\n\nss\u2032, \u03c8 + \u03b4a\n\nBAPOMDPs, we use a slight variation of this method, where (s, \u03c6, \u03c8) are \ufb01rst sampled from b, and\nthen a next state s\u2032 \u2208 S is sampled from the normalized distribution T sa\u00b7\n\u03c8 . The probability 1/K\nis added directly to b\u2032(s\u2032, \u03c6 + \u03b4a\nMost Probable: Alternately, we can do the exact belief update at a given time step, but then only\nkeep the K most probable states in the new belief b\u2032 and renormalize b\u2032.\nWeighted Distance Minimization: The two previous methods only try to approximate the distribu-\ntion \u03c4 (b, a, z). However, in practice, we only care most about the agent\u2019s expected reward. Hence,\ninstead of keeping the K most likely states, we can keep K states which best approximate the be-\nlief\u2019s value. As in the Most Probable method, we do an exact belief update, however in this case\nwe \ufb01t the posterior distribution using a greedy K-means procedure, where distance is de\ufb01ned as in\nDe\ufb01nition 4.1, weighted by the probability of the state to remove. See [11] for algorithmic details.\n\n4.3 Online planning\n\nWhile the \ufb01nite model presented in Section 4.1 can be used to \ufb01nd provably near-optimal policies\nof\ufb02ine, this will likely be intractable in practice due to the very large state space required to ensure\ngood precision. Instead, we turn to online lookahead search algorithms, which have been proposed\nfor solving standard POMDPs [9]. Our approach simply performs dynamic programming over all the\nbeliefs reachable within some \ufb01xed \ufb01nite planning horizon from the current belief. The action with\nhighest return over that \ufb01nite horizon is executed and then planning is conducted again on the next\nbelief. To further limit the complexity of the online planning algorithm, we used the approximate\nbelief monitoring methods detailed above. Its overall complexity is in O((|A||Z|)DCb) where D is\nthe planning horizon and Cb is the complexity of updating the belief.\n\n5 Empirical Results\n\nWe begin by evaluating the different belief approximations introduced above. To do so, we use a\nsimple online d-step lookahead search, and compare the overall expected return and model accuracy\nin two different problems: the well-known Tiger [5] and a new domain called Follow. Given T sas\u2032\nand Os\u2032az the exact probabilities of the (unknown) POMDP, the model accuracy is measured in\nterms of the weighted sum of L1-distance, denoted W L1, between the exact model and the probable\nmodels in a belief state b:\n\nW L1(b) = P(s,\u03c6,\u03c8)\u2208S\u2032\nL1(\u03c6, \u03c8) = Pa\u2208APs\u2032\u2208S hPs\u2208S |T sas\u2032\n\nb(s, \u03c6, \u03c8)L1(\u03c6, \u03c8)\n\nb\n\n\u03c6 \u2212 T sas\u2032\n\n| +Pz\u2208Z |Os\u2032az\n\n\u03c8 \u2212 Os\u2032az|i\n\n(6)\n\n5.1 Tiger\n\nIn the Tiger problem [5], we consider the case where the transition and reward parameters are known,\nbut the observation probabilities are not. Hence, there are four unknown parameters: OLl, OLr,\nORl, ORr (OLr stands for Pr(z = hear right|s = tiger Lef t, a = Listen)). We de\ufb01ne the\nobservation count vector \u03c8 = (\u03c8Ll, \u03c8Lr, \u03c8Rl, \u03c8Rr). We consider a prior of \u03c80 = (5, 3, 3, 5), which\nspeci\ufb01es an expected sensor accuracy of 62.5% (instead of the correct 85%) in both states. Each\nsimulation consists of 100 episodes. Episodes terminate when the agent opens a door, at which\npoint the POMDP state (i.e. tiger\u2019s position) is reset, but the distribution over count vector is carried\nover to the next episode.\n\nFigures 1 and 2 show how the average return and model accuracy evolve over the 100 episodes\n(results are averaged over 1000 simulations), using an online 3-step lookahead search with varying\nbelief approximations and parameters. Returns obtained by planning directly with the prior and ex-\nact model (without learning) are shown for comparison. Model accuracy is measured on the initial\nbelief of each episode. Figure 3 compares the average planning time per action taken by each ap-\nproach. We observe from these \ufb01gures that the results for the Most Probable and Weighted Distance\napproximations are very similar and perform well even with few particles (lines are overlapping in\nmany places, making Weighted Distance results hard to see). On the other hand, the performance\nof Monte Carlo is signi\ufb01cantly affected by the number of particles and had to use much more par-\n\n\fn\nr\nu\n\nt\n\ne\nR\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n0\n\nExact model\n\nPrior model\n\n20\n\nMost Probable (2)\nMonte Carlo (64)\nWeighted Distance (2)\n\n40\n\nEpisode\n\n60\n\n80\n\n100\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n1\nL\nW\n\n0\n0\n\n20\n\nMost Probable (2)\nMonte Carlo (64)\nWeighted Distance (2)\n\n)\ns\nm\n\n(\n \n\nn\no\n\n/\n\ni\nt\nc\nA\ne\nm\nT\ng\nn\nn\nn\na\nP\n\nl\n\ni\n\n \n\ni\n\n80\n\n100\n\n40\n\n60\n\nEpisode\n\n20\n\n15\n\n10\n\n5\n\n0\n\nMP (2)\n\nMC (64)\n\nWD (2)\n\nFigure 1: Return with different\nbelief approximations.\n\nFigure 2: Model accuracy with\ndifferent belief approximations.\n\nFigure 3: Planning Time with\ndifferent belief approximations.\n\nticles (64) to obtain an improvement over the prior. This may be due to the sampling error that is\nintroduced when using fewer samples.\n\n5.2 Follow\n\nWe propose a new POMDP domain, called Follow, inspired by an interactive human-robot task. It\nis often the case that such domains are particularly subject to parameter uncertainty (due to the dif-\n\ufb01culty in modelling human behavior), thus this environment motivates the utility of Bayes-Adaptive\nPOMDP in a very practical way. The goal of the Follow task is for a robot to continuously follow one\nof two individuals in a 2D open area. The two subjects have different motion behavior, requiring the\nrobot to use a different policy for each. At every episode, the target person is selected randomly with\nP r = 0.5 (and the other is not present). The person\u2019s identity is not observable (except through their\nmotion). The state space has two features: a binary variable indicating which person is being fol-\nlowed, and a position variable indicating the person\u2019s position relative to the robot (5 \u00d7 5 square grid\nwith the robot always at the center). Initially, the robot and person are at the same position. Both the\nrobot and the person can perform \ufb01ve motion actions {N oAction, N orth, East, South, W est}.\nThe person follows a \ufb01xed stochastic policy (stationary over space and time), but the parameters of\nthis behavior are unknown. The robot perceives observations indicating the person\u2019s position rela-\ntive to the robot: {Same, N orth, East, South, W est, U nseen}. The robot perceives the correct\nobservation P r = 0.8 and U nseen with P r = 0.2. The reward R = +1 if the robot and person\nare at the same position (central grid cell), R = 0 if the person is one cell away from the robot, and\nR = \u22121 if the person is two cells away. The task terminates if the person reaches a distance of 3\ncells away from the robot, also causing a reward of -20. We use a discount factor of 0.9.\n\nN A, \u03c61\n\nN , \u03c61\n\nWhen formulating the BAPOMDP, the robot\u2019s motion model (deterministic),\nthe observation\nprobabilities and the rewards are assumed to be known. We maintain a separate count vec-\ntor for each person, representing the number of times they move in each direction, i.e. \u03c61 =\n(\u03c61\nE, \u03c61\nS, \u03c61\n0 = (2, 3, 1, 2, 2)\nfor person 1 and \u03c62\n0 = (2, 1, 3, 2, 2) for person 2, while in reality person 1 moves with probabilities\nP r = (0.3, 0.4, 0.2, 0.05, 0.05) and person 2 with P r = (0.1, 0.05, 0.8, 0.03, 0.02). We run 200\nsimulations, each consisting of 100 episodes (of at most 10 time steps). The count vectors\u2019 distri-\nbutions are reset after every simulation, and the target person is reset after every episode. We use a\n2-step lookahead search for planning in the BAPOMDP.\n\nW ). We assume a prior \u03c61\n\nW ), \u03c62 = (\u03c62\n\nN A, \u03c62\n\nN , \u03c62\n\nE, \u03c62\n\nS, \u03c62\n\nFigures 4 and 5 show how the average return and model accuracy evolve over the 100 episodes (aver-\naged over the 200 simulations) with different belief approximations. Figure 6 compares the planning\ntime taken by each approach. We observe from these \ufb01gures that the results for the Weighted Dis-\ntance approximations are much better both in terms of return and model accuracy, even with fewer\nparticles (16). Monte Carlo fails at providing any improvement over the prior model, which indi-\ncates it would require much more particles. Running Weighted Distance with 16 particles require\nless time than both Monte Carlo and Most Probable with 64 particles, showing that it can be more\ntime ef\ufb01cient for the performance it provides in complex environment.\n\n\f2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\nn\nr\nu\n\nt\n\ne\nR\n\nExact model\n\nMost Probable (64)\nMonte Carlo (64)\nWeighted Distance (16)\n\nPrior model\n\n2\n\n1.5\n\n1\nL\nW\n\n1\n\n0.5\n\nMost Probable (64)\nMonte Carlo (64)\nWeighted Distance (16)\n\n\u22128\n0\n\n20\n\n40\n\n60\n\nEpisode\n\n80\n\n100\n\n0\n0\n\n20\n\n40\n\n60\n\nEpisode\n\n80\n\n100\n\n)\ns\nm\n\n(\n \n\nn\no\n\ni\n\n/\n\ni\nt\nc\nA\ne\nm\nT\ng\nn\nn\nn\na\nP\n\n \n\ni\n\nl\n\n200\n\n150\n\n100\n\n50\n\n0\n\nMP (64)\n\nMC (64)\n\nWD (16)\n\nFigure 4: Return with different\nbelief approximations.\n\nFigure 5: Model accuracy with\ndifferent belief approximations.\n\nFigure 6: Planning Time with\ndifferent belief approximations.\n\n6 Conclusion\n\nThe objective of this paper was to propose a preliminary decision-theoretic framework for learning\nand acting in POMDPs under parameter uncertainty. This raises a number of interesting challenges,\nincluding (1) de\ufb01ning the appropriate model for POMDP parameter uncertainty, (2) approximating\nthis model while maintaining performance guarantees, (3) performing tractable belief updating, and\n(4) planning action sequences which optimally trade-off exploration and exploitation.\n\nWe proposed a new model, the Bayes-Adaptive POMDP, and showed that it can be approximated\nto \u01eb-precision by a \ufb01nite POMDP. We provided practical approaches for belief tracking and online\nplanning in this model, and validated these using two experimental domains. Results in the Follow\nproblem, showed that our approach is able to learn the motion patterns of two (simulated) individu-\nals. This suggests interesting applications in human-robot interaction, where it is often essential that\nwe be able to reason and plan under parameter uncertainty.\n\nAcknowledgments\n\nThis research was supported by the Natural Sciences and Engineering Research Council of Canada\n(NSERC) and the Fonds Qu\u00b4eb\u00b4ecois de la Recherche sur la Nature et les Technologies (FQRNT).\n\nReferences\n[1] R. Dearden, N. Friedman, and N. Andre. Model based bayesian exploration. In UAI, 1999.\n[2] M. Duff. Optimal Learning: Computational Procedure for Bayes-Adaptive Markov Decision Processes.\n\nPhD thesis, University of Massachusetts, Amherst, USA, 2002.\n\n[3] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete bayesian reinforcement\n\nlearning. In Proc. ICML, 2006.\n\n[4] R. Jaulmes, J. Pineau, and D. Precup. Active learning in partially observable markov decision processes.\n\nIn ECML, 2005.\n\n[5] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic\n\ndomains. Arti\ufb01cial Intelligence, 101:99\u2013134, 1998.\n\n[6] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: an anytime algorithm for POMDPs. In\n\nIJCAI, pages 1025\u20131032, Acapulco, Mexico, 2003.\n\n[7] M. Spaan and N. Vlassis. Perseus: randomized point-based value iteration for POMDPs. JAIR, 24:195\u2013\n\n220, 2005.\n\n[8] T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In UAI, Banff, Canada, 2004.\n[9] S. Paquet, L. Tobin, and B. Chaib-draa. An online POMDP algorithm for complex multiagent environ-\n\nments. In AAMAS, 2005.\n\n[10] Jonathan Baxter and Peter L. Bartlett. In\ufb01nite-horizon policy-gradient estimation. Journal of Arti\ufb01cial\n\nIntelligence Research (JAIR), 15:319\u2013350, 2001.\n\n[11] St\u00b4ephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive pomdps. Technical Report SOCS-\n\nTR-2007.6, McGill University, 2007.\n\n[12] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods In Practice. Springer, 2001.\n\n\f", "award": [], "sourceid": 865, "authors": [{"given_name": "Stephane", "family_name": "Ross", "institution": null}, {"given_name": "Brahim", "family_name": "Chaib-draa", "institution": null}, {"given_name": "Joelle", "family_name": "Pineau", "institution": null}]}