{"title": "Theoretical Analysis of Heuristic Search Methods for Online POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1233, "page_last": 1240, "abstract": "Planning in partially observable environments remains a challenging problem, despite significant recent advances in offline approximation techniques. A few online methods have also been proposed recently, and proven to be remarkably scalable, but without the theoretical guarantees of their offline counterparts. Thus it seems natural to try to unify offline and online techniques, preserving the theoretical properties of the former, and exploiting the scalability of the latter. In this paper, we provide theoretical guarantees on an anytime algorithm for POMDPs which aims to reduce the error made by approximate offline value iteration algorithms through the use of an efficient online searching procedure. The algorithm uses search heuristics based on an error analysis of lookahead search, to guide the online search towards reachable beliefs with the most potential to reduce error. We provide a general theorem showing that these search heuristics are admissible, and lead to complete and epsilon-optimal algorithms. This is, to the best of our knowledge, the strongest theoretical result available for online POMDP solution methods. We also provide empirical evidence showing that our approach is also practical, and can find (provably) near-optimal solutions in reasonable time.", "full_text": "Theoretical Analysis of Heuristic Search Methods for\n\nOnline POMDPs\n\nSt\u00b4ephane Ross\nMcGill University\n\nMontr\u00b4eal, Qc, Canada\n\nJoelle Pineau\n\nMcGill University\n\nMontr\u00b4eal, Qc, Canada\n\nBrahim Chaib-draa\n\nLaval University\n\nQu\u00b4ebec, Qc, Canada\n\nsross12@cs.mcgill.ca\n\njpineau@cs.mcgill.ca\n\nchaib@ift.ulaval.ca\n\nAbstract\n\nPlanning in partially observable environments remains a challenging problem, de-\nspite signi\ufb01cant recent advances in of\ufb02ine approximation techniques. A few on-\nline methods have also been proposed recently, and proven to be remarkably scal-\nable, but without the theoretical guarantees of their of\ufb02ine counterparts. Thus it\nseems natural to try to unify of\ufb02ine and online techniques, preserving the theo-\nretical properties of the former, and exploiting the scalability of the latter. In this\npaper, we provide theoretical guarantees on an anytime algorithm for POMDPs\nwhich aims to reduce the error made by approximate of\ufb02ine value iteration algo-\nrithms through the use of an ef\ufb01cient online searching procedure. The algorithm\nuses search heuristics based on an error analysis of lookahead search, to guide the\nonline search towards reachable beliefs with the most potential to reduce error. We\nprovide a general theorem showing that these search heuristics are admissible, and\nlead to complete and \u01eb-optimal algorithms. This is, to the best of our knowledge,\nthe strongest theoretical result available for online POMDP solution methods. We\nalso provide empirical evidence showing that our approach is also practical, and\ncan \ufb01nd (provably) near-optimal solutions in reasonable time.\n\n1 Introduction\n\nPartially Observable Markov Decision Processes (POMDPs) provide a powerful model for sequen-\ntial decision making under state uncertainty. However exact solutions are intractable in most do-\nmains featuring more than a few dozen actions and observations. Signi\ufb01cant efforts have been\ndevoted to developing approximate of\ufb02ine algorithms for larger POMDPs [1, 2, 3, 4]. Most of these\nmethods compute a policy over the entire belief space. This is both an advantage and a liability.\nOn the one hand, it allows good generalization to unseen beliefs, and this has been key to solving\nrelatively large domains. Yet it makes these methods impractical for problems where the state space\nis too large to enumerate. A number of compression techniques have been proposed, which han-\ndle large state spaces by projecting into a sub-dimensional representation [5, 6]. Alternately online\nmethods are also available [7, 8, 9, 10, 11]. These achieve scalability by planning only at execution\ntime, thus allowing the agent to only consider belief states that can be reached over some (small)\n\ufb01nite planning horizon. However despite good empirical performance, both classes of approaches\nlack theoretical guarantees on the approximation. So it would seem we are constrained to either\nsolving small to mid-size problems (near-)optimally, or solving large problems possibly badly.\n\nThis paper suggests otherwise, arguing that by combining of\ufb02ine and online techniques, we can\npreserve the theoretical properties of the former, while exploiting the scalability of the latter. In\nprevious work [11], we introduced an anytime algorithm for POMDPs which aims to reduce the\nerror made by approximate of\ufb02ine value iteration algorithms through the use of an ef\ufb01cient online\nsearching procedure. The algorithm uses search heuristics based on an error analysis of lookahead\nsearch, to guide the online search towards reachable beliefs with the most potential to reduce error. In\n\n\fthis paper, we derive formally the heuristics from our error minimization point of view and provide\ntheoretical results showing that these search heuristics are admissible, and lead to complete and \u01eb-\noptimal algorithms. This is, to the best of our knowledge, the strongest theoretical result available\nfor online POMDP solution methods. Furthermore the approach works well with factored state\nrepresentations, thus further enhancing scalability, as suggested by earlier work [2]. We also provide\nempirical evidence showing that our approach is computationally practical, and can \ufb01nd (provably)\nnear-optimal solutions within a smaller overall time than previous online methods.\n\n2 Background: POMDP\n\nA POMDP is de\ufb01ned by a tuple (S, A, \u2126, T, R, O, \u03b3) where S is the state space, A is the action\nset, \u2126 is the observation set, T : S \u00d7 A \u00d7 S \u2192 [0, 1] is the state-to-state transition function,\nR : S \u00d7 A \u2192 R is the reward function, O : \u2126 \u00d7 A \u00d7 S \u2192 [0, 1] is the observation function,\nand \u03b3 is the discount factor. In a POMDP, the agent often does not know the current state with full\ncertainty, since observations provide only a partial indicator of state. To deal with this uncertainty,\nthe agent maintains a belief state b(s), which expresses the probability that the agent is in each state\nat a given time step. After each step, the belief state b is updated using Bayes rule. We denote the\n\nbelief update function b\u2032 = \u03c4 (b, a, o), de\ufb01ned as b\u2032(s\u2032) = \u03b7O(o, a, s\u2032)Ps\u2208S T (s, a, s\u2032)b(s), where\n\u03b7 is a normalization constant ensuringPs\u2208S b\u2032(s) = 1.\n\nSolving a POMDP consists in \ufb01nding an optimal policy, \u03c0\u2217 : \u2206S \u2192 A, which speci\ufb01es the best\naction a to do in every belief state b, that maximizes the expected return (i.e., expected sum of\ndiscounted rewards over the planning horizon) of the agent. We can \ufb01nd the optimal policy by\ncomputing the optimal value of a belief state over the planning horizon. For the in\ufb01nite horizon, the\n\nvalue Q\u2217(b, a) of a particular action a in belief state b, as the return we will obtain if we perform a in\n\nwhere R(b, a) represents the expected immediate reward of doing action a in belief state b and\nP (o|b, a) is the probability of observing o after doing action a in belief state b. This probability can\n\noptimal value function is de\ufb01ned as V \u2217(b) = maxa\u2208A[R(b, a) + \u03b3Po\u2208\u2126 P (o|b, a)V \u2217(\u03c4 (b, a, o))],\nbe computed according to P (o|b, a) =Ps\u2032\u2208S O(o, a, s\u2032)Ps\u2208S T (s, a, s\u2032)b(s). We also denote the\nb and then follow the optimal policy Q\u2217(b, a) = R(b, a) + \u03b3Po\u2208\u2126 P (o|b, a)V \u2217(\u03c4 (b, a, o)). Using\n\nthis, we can de\ufb01ne the optimal policy \u03c0\u2217(b) = argmaxa\u2208A Q\u2217(b, a).\nWhile any POMDP problem has in\ufb01nitely many belief states, it has been shown that the optimal\nvalue function of a \ufb01nite-horizon POMDP is piecewise linear and convex. Thus we can de\ufb01ne the\noptimal value function and policy of a \ufb01nite-horizon POMDP using a \ufb01nite set of |S|-dimensional\nhyper plans, called \u03b1-vectors, over the belief state space. As a result, exact of\ufb02ine value iteration\nalgorithms are able to compute V \u2217 in a \ufb01nite amount of time, but the complexity can be very high.\nMost approximate of\ufb02ine value iteration algorithms achieve computational tractability by selecting\na small subset of belief states, and keeping only those \u03b1-vectors which are maximal at the selected\nbelief states [1, 3, 4]. The precision of these algorithms depend on the number of belief points and\ntheir location in the space of beliefs.\n\n3 Online Search in POMDPs\n\nContrary to of\ufb02ine approaches, which compute a complete policy determining an action for every\nbelief state, an online algorithm takes as input the current belief state and returns the single action\nwhich is the best for this particular belief state. The advantage of such an approach is that it only\nneeds to consider belief states that are reachable from the current belief state. This naturally provides\na small set of beliefs, which could be exploited as in of\ufb02ine methods. But in addition, since online\nplanning is done at every step (and thus generalization between beliefs is not required), it is suf\ufb01cient\nto calculate only the maximal value for the current belief state, not the full optimal \u03b1-vector. A\nlookahead search algorithm can compute this value in two simple steps.\n\nFirst we build a tree of reachable belief states from the current belief state. The current belief is the\ntop node in the tree. Subsequent belief states (as calculated by the \u03c4 (b, a, o) function) are represented\nusing OR-nodes (at which we must choose an action) and actions are included in between each layer\nof belief nodes using AND-nodes (at which we must consider all possible observations). Note that\nin general the belief MDP could have a graph structure with cycles. Our algorithm simply handle\n\n\fsuch structure by unrolling the graph into a tree. Hence, if we reach a belief that is already elsewhere\nin the tree, it will be duplicated.1\nSecond, we estimate the value of the current belief state by propagating value estimates up from the\nfringe nodes, to their ancestors, all the way to the root. An approximate value function is generally\nused at the fringe of the tree to approximate the in\ufb01nite-horizon value. We are particularly interested\nin the case where a lower bound and an upper bound on the value of the fringe belief states is\navailable, as this allows us to get a bound on the error at any speci\ufb01c node. The lower and upper\nbounds can be propagated to parent nodes according to:\n\nif b is a leaf in T ,\notherwise;\n\nP (o|b, a)LT (\u03c4 (b, a, o));\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\nmaxa\u2208A UT (b, a)\n\nif b is a leaf in T ,\notherwise;\n\nP (o|b, a)UT (\u03c4 (b, a, o));\n\nUT (b) =(cid:26) U (b)\nUT (b, a) = RB(b, a) + \u03b3Xo\u2208\u2126\nLT (b) =(cid:26) L(b)\nLT (b, a) = RB(b, a) + \u03b3Xo\u2208\u2126\n\nmaxa\u2208A LT (b, a)\n\nwhere UT (b) and LT (b) represent the upper and lower bounds on V \u2217(b) associated to belief state\nb in the tree T , UT (b, a) and LT (b, a) represent corresponding bounds on Q\u2217(b, a), and L(b) and\nU (b) are the bounds on fringe nodes, typically computed of\ufb02ine.\nPerforming a complete k-step lookahead search multiplies the error bound on the approximate value\nfunction used at the fringe by \u03b3k ([13]), and thus ensures better value estimates. However, it has\ncomplexity exponential in k, and may explore belief states that have very small probabilities of oc-\ncurring (and an equally small impact on the value function) as well as exploring suboptimal actions\n(which have no impact on the value function). We would evidently prefer to have a more ef\ufb01cient\nonline algorithm, which can guarantee equivalent or better error bounds. In particular, we believe\nthat the best way to achieve this is to have a search algorithm which uses estimates of error reduction\nas a criteria to guide the search over the reachable beliefs.\n\n4 Anytime Error Minimization Search\n\nIn this section, we review the Anytime Error Minimization Search (AEMS) algorithm we had \ufb01rst\nintroduced in [11] and present a novel mathematical derivation of the heuristics that we had sug-\ngested. We also provide new theoretical results describing suf\ufb01cient conditions under which the\nheuristics are guaranteed to yield \u01eb-optimal solutions.\nOur approach uses a best-\ufb01rst search of the belief reachability tree, where error minimization (at the\nroot node) is used as the search criteria to select which fringe nodes to expand next. Thus we need a\nway to express the error on the current belief (i.e. root node) as a function of the error at the fringe\nnodes. This is provided in Theorem 1. Let us denote (i) F(T ), the set of fringe nodes of a tree T ; (ii)\neT (b) = V \u2217(b) \u2212 LT (b), the error function for node b in the tree T ; (iii) e(b) = V \u2217(b) \u2212 L(b), the\nerror at a fringe node b \u2208 F(T ); (iv) hb0,b\nT , the unique action/observation sequence that leads from\nthe root b0 to belief b in tree T ; (v) d(h), the depth of an action/observation sequence h (number of\na), the probability of executing\nthe action/observation sequence h if we follow the optimal policy \u03c0\u2217 from the root node b0 (where\na and hi\no refers to the ith action and observation in the sequence h, and bhi is the belief obtained\nhi\nafter taking the i \ufb01rst actions and observations from belief b. \u03c0\u2217(b, a) is the probability that the\noptimal policy chooses action a in belief b).\nBy abuse of notation, we will use b to represent both a belief node in the tree and its associated\nbelief2.\n\nactions); and (vi) P (h|b0, \u03c0\u2217) =Qd(h)\n\na)\u03c0\u2217(bhi\u22121 , hi\n\ni=1 P (hi\n\n, hi\n\no|bhi\u22121\n\n0\n\n1We are considering using a technique proposed in the LAO* algorithm [12] to handle cycle, but we have\n\nnot investigated this fully, especially in terms of how it affects the heuristic value presented below.\n\n2e.g. Pb\u2208F (T ) should be interpreted as a sum over all fringe nodes in the tree, while e(b) to be the error\n\nassociated to the belief in fringe node b.\n\n\fProof. Consider an arbitrary parent node b in tree T and let\u2019s denote \u02c6aT\nhave eT (b) = V \u2217(b) \u2212 LT (b).\nOn the other hand, when \u02c6aT\nb\n\nTheorem 1. In any tree T , eT (b0) \u2264Pb\u2208F (T ) \u03b3d(h\neT (b) \u2264 \u03b3Po\u2208\u2126 P (o|b, \u03c0\u2217(b))e(\u03c4 (b, \u03c0\u2217(b), o)). Consequently, we have the following:\n\nb = \u03c0\u2217(b), then eT (b) = \u03b3Po\u2208\u2126 P (o|b, \u03c0\u2217(b))e(\u03c4 (b, \u03c0\u2217(b), o)).\n\n6= \u03c0\u2217(b), then we know that LT (b, \u03c0\u2217(b)) \u2264 LT (b, \u02c6aT\n\nb = argmaxa\u2208A LT (b, a). We\n\nT |b0, \u03c0\u2217)e(b).\n\nb ) and therefore\n\n)P (hb0,b\n\nIf \u02c6aT\n\nb0 ,b\nT\n\nP (o|b, \u03c0\u2217(b))eT (\u03c4 (b, \u03c0\u2217(b), o))\n\nif b \u2208 F (T )\notherwise\n\neT (b) \u2264( e(b)\n\u03b3Po\u2208\u2126\n\nThen eT (b0) \u2264Pb\u2208F (T ) \u03b3d(h\n\n4.1 Search Heuristics\n\nb0 ,b\nT\n\n)P (hb0,b\n\nT\n\n|b0, \u03c0\u2217)e(b) can be easily shown by induction.\n\nb0 ,b\nT\n\nb0 ,b\nT\n\n)P (hb0,b\n\n)P (hb0,b\n\nFrom Theorem 1, we see that the contribution of each fringe node to the error in b0 is simply\nthe term \u03b3d(h\nT |b0, \u03c0\u2217)e(b). Consequently, if we want to minimize eT (b0) as quickly as\npossible, we should expand fringe nodes reached by the optimal policy \u03c0\u2217 that maximize the term\nT |b0, \u03c0\u2217)e(b) as they offer the greatest potential to reduce eT (b0). This suggests us\n\u03b3d(h\na sound heuristic to explore the tree in a best-\ufb01rst-search way. Unfortunately we do not know V \u2217\nnor \u03c0\u2217, which are required to compute the terms e(b) and P (hb0,b\nT |b0, \u03c0\u2217); nevertheless, we can\napproximate them. First, the term e(b) can be estimated by the difference between the lower and\nupper bound. We de\ufb01ne \u02c6e(b) = U (b) \u2212 L(b) as an estimate of the error introduced by our bounds at\nfringe node b. Clearly, \u02c6e(b) \u2265 e(b) since U (b) \u2265 V \u2217(b).\n\nTo approximate P (hb0,b\nT |b0, \u03c0\u2217), we can view the term \u03c0\u2217(b, a) as the probability that action a\nis optimal in belief b. Thus, we consider an approximate policy \u02c6\u03c0T that represents the proba-\nbility that action a is optimal in belief state b given the bounds LT (b, a) and UT (b, a) that we\nhave on Q\u2217(b, a) in tree T . More precisely, to compute \u02c6\u03c0T (b, a), we consider Q\u2217(b, a) as a\nrandom variable and make some assumptions about its underlying probability distribution. Once\ncumulative distribution functions F b,a\nT (x) = P (Q\u2217(b, a) \u2264 x), and their associated\ndensity functions f b,a\nare determined for each (b, a) in tree T , we can compute the probability\n(x)dx. Computing this\nT . We\n\n\u02c6\u03c0T (b, a) = P (Q\u2217(b, a\u2032) \u2264 Q\u2217(b, a)\u2200a\u2032 6= a) =R \u221e\n\nintegral may not be computationally ef\ufb01cient depending on how we de\ufb01ne the functions f b,a\nconsider two approximations.\n\nT (x)Qa\u20326=a F b,a\u2032\n\nT , s.t. F b,a\n\n\u2212\u221e f b,a\n\nT\n\nT\n\nOne possible approximation is to simply compute the probability that the Q-value of a given action\nis higher than its parent belief state value (instead of all actions\u2019 Q-value). In this case, we get\nT is the cumulative distribution function for V \u2217(b),\ngiven bounds LT (b) and UT (b) in tree T . Hence by considering both Q\u2217(b, a) and V \u2217(b) as random\nvariables with uniform distributions between their respective lower and upper bounds, we get:\n\n\u02c6\u03c0T (b, a) = R \u221e\n\nT (x)dx, where F b\n\n\u2212\u221e f b,a\n\nT (x)F b\n\n\u02c6\u03c0T (b, a) =( \u03b7 (UT (b,a)\u2212LT (b))2\n\nUT (b,a)\u2212LT (b,a)\n\n0\n\nif UT (b, a) > LT (b),\notherwise.\n\n(5)\n\nwhere \u03b7 is a normalization constant such thatPa\u2208A \u02c6\u03c0T (b, a) = 1. Notice that if the density function\n\nis 0 outside the interval between the lower and upper bound, then \u02c6\u03c0T (b, a) = 0 for dominated\nactions, thus they are implicitly pruned from the search tree by this method.\n\nA second practical approximation is:\n\n\u02c6\u03c0T (b, a) =(cid:26) 1 if a = argmaxa\u2032\u2208A UT (b, a\u2032),\n\n0 otherwise.\n\n(6)\n\nwhich simply selects the action that maximizes the upper bound. This restricts exploration of the\nsearch tree to those fringe nodes that are reached by sequence of actions that maximize the upper\nbound of their parent belief state, as done in the AO\u2217 algorithm [14]. The nice property of this\napproximation is that these fringe nodes are the only nodes that can potentially reduce the upper\nbound in b0.\n\n\fUsing either of these two approximations for \u02c6\u03c0T , we can estimate the error contribution \u02c6eT (b0, b) of\na fringe node b on the value of root belief b0 in tree T , as: \u02c6eT (b0, b) = \u03b3d(h\nT |b0, \u02c6\u03c0T )\u02c6e(b).\n\n)P (hb0,b\n\nb0 ,b\nT\n\nUsing this as a heuristic, the next fringe nodeeb(T ) to expand in tree T is de\ufb01ned aseb(T ) =\n\nT |b0, \u02c6\u03c0T )\u02c6e(b). We use AEMS13 to denote the heuristic that uses \u02c6\u03c0T\nargmaxb\u2208F (T ) \u03b3d(h\nas de\ufb01ned in Equation 5, and AEMS24 to denote the heuristic that uses \u02c6\u03c0T as de\ufb01ned in Equation 6.\n\n)P (hb0,b\n\nb0 ,b\nT\n\n4.2 Algorithm\n\nAlgorithm 1 presents the anytime error minimization search. Since the objective is to provide a\nnear-optimal action within a \ufb01nite allowed online planning time, the algorithm accepts two input\nparameters: t, the online search time allowed per action, and \u01eb, the desired precision on the value\nfunction.\n\nAlgorithm 1 AEMS: Anytime Error Minimization Search\n\nFunction SEARCH(t, \u01eb)\nStatic : T : an AND-OR tree representing the current search tree.\nt0 \u2190 TIME()\nwhile TIME() \u2212 t0 \u2264 t and not SOLVED(ROOT(T ), \u01eb) do\n\nb\u2217 \u2190eb(T )\n\nEXPAND(b\u2217)\nUPDATEANCESTORS(b\u2217)\n\nend while\nreturn argmaxa\u2208A LT (ROOT(T ), a)\n\nThe EXPAND function expands the tree one level under the node b\u2217 by adding the next action and\nbelief nodes to the tree T and computing their lower and upper bounds according to Equations 1-\n4. After a node is expanded, the UPDATEANCESTORS function simply recomputes the bounds of\nits ancestors according to Equations determining b\u2032(s\u2032), V \u2217(b), P (o|b, a) and Q\u2217(b, a), as outlined\nin Section 2. It also recomputes the probabilities \u02c6\u03c0T (b, a) and the best actions for each ancestor\nnode. To \ufb01nd quickly the node that maximizes the heuristic in the whole tree, each node in the tree\ncontains a reference to the best node to expand in their subtree. These references are updated by\nthe UPDATEANCESTORS function without adding more complexity, such that when this function\nterminates, we always know immediatly which node to expand next, as its reference is stored in the\nroot node. The search terminates whenever there is no more time available, or we have found an \u01eb-\noptimal solution (veri\ufb01ed by the SOLVED function). After an action is executed in the environment,\nthe tree T is updated such that our new current belief state becomes the root of T ; all nodes under\nthis new root can be reused at the next time step.\n\n4.3 Completeness and Optimality\n\nWe now provide some suf\ufb01cient conditions under which our heuristic search is guaranteed to con-\nverge to an \u01eb-optimal policy after a \ufb01nite number of expansions. We show that the heuristics pro-\nposed in Section 4.1 satisfy those conditions, and therefore are admissible. Before we present the\nmain theorems, we provide some useful preliminary lemmas.\nLemma 1. In any tree T , the approximate error contribution \u02c6eT (b0, bd) of a belief node bd at depth\nd is bounded by \u02c6eT (b0, bd) \u2264 \u03b3d supb \u02c6e(b).\n\nProof. P (hb0,b\n\nT\n\n|b0, \u02c6\u03c0T ) \u2264 1 and \u02c6e(b) \u2264 supb\u2032 \u02c6e(b\u2032) for all b. Thus \u02c6eT (b0, bd) \u2264 \u03b3d supb \u02c6e(b).\n\na) the\nprobability of observing the sequence of observations ho in some action/observation sequence h,\n\nFor the following lemma and theorem, we will denote P (ho|b0, ha) = Qd(h)\ngiven that the sequence of actions ha in h is performed from current belief b0, and bF(T ) \u2286 F(T )\n\nT |b0, \u02c6\u03c0T ) > 0, for \u02c6\u03c0T de\ufb01ned as in Equation 6 (i.e.\n\nthe set of all fringe nodes in T such that P (hb0,b\n\ni=1 P (hi\n\no|bhi\u22121\n\n, hi\n\n0\n\n3This heuristic is slightly different from the AEMS1 heuristic we had introduced in [11].\n4This is the same as the AEMS2 heuristic we had introduced in [11].\n\n\fthe set of fringe nodes reached by a sequence of actions in which each action maximizes UT (b, a)\nin its respective belief state.)\n\nLemma 2. For any tree T , \u01eb > 0, and D such that \u03b3D supb \u02c6e(b) \u2264 \u01eb, if for all b \u2208 bF(T ), either\n\nT ) \u2265 D or there exists an ancestor b\u2032 of b such that \u02c6eT (b\u2032) \u2264 \u01eb, then \u02c6eT (b0) \u2264 \u01eb.\n\nd(hb0,b\n\n\u02c6e(b)\n\u01eb\n\nb )\u02c6eT (\u03c4 (b, \u02c6aT\n\nb0 ,b\nT\n\nb0 ,b\nT\n\nT\n\nT\n\nT ,o |b0, hb0,b\u2032\n\n)P (hb0,b\n\nT ,o |b0, hb0,b\n\nP (o|b, \u02c6aT\n\nb )\u02c6eT (\u03c4 (b, \u02c6aT\n\nb , o))\n\n)P (hb0,b\n\nT ,o |b0, hb0,b\n\nT ,a )\u02c6e(b) +\n\nT ,o |b0, hb0,b\u2032\n\nT ,a ) = \u01eb.\n\nif b \u2208 F (T )\nif \u02c6eT (b) \u2264 \u01eb\notherwise\n\n\u02c6eT (b) \u22648><>:\n\nthat for all b in A(T ), d(hb0,b\nT ,o |b0, hb0,b\u2032\n\nProof. Let\u2019s denote \u02c6aT\nb )\u2212LT (b, \u02c6aT\nUT (b)\u2212LT (b) \u2264 UT (b, \u02c6aT\nrecurrence is an upper bound on \u02c6eT (b):\n\nb = argmaxa\u2208A UT (b, a). Notice that for any tree T , and parent belief b \u2208 T , \u02c6eT (b) =\nb , o)). Consequently, the following\n\nb ) = \u03b3Po\u2208\u2126 P (o|b, \u02c6aT\n\u03b3Po\u2208\u2126\nBy unfolding the recurrence for b0, we get \u02c6eT (b0) \u2264 Pb\u2208A(T ) \u03b3d(h\n\u01ebPb\u2208B(T ) \u03b3d(h\nT ,a ), where B(T ) is the set of parent nodes b\u2032 having a descendant in bF (T )\nsuch that \u02c6eT (b\u2032) \u2264 \u01eb and A(T ) is the set of fringe nodes b\u2032\u2032 in bF (T ) not having an ancestor in B(T ). Hence\nif for all b \u2208 bF (T ), d(hb0,b\n\u01ebPb\u2032\u2208B(T ) P (hb0,b\u2032\nb = argmaxa\u2208A UT (b, a), then Algorithm 1 usingeb(T ) is complete and \u01eb-optimal.\n\n) \u2265 D, and therefore, \u02c6eT (b0) \u2264 \u03b3D supb \u02c6e(b)Pb\u2032\u2208A(T ) P (hb0,b\u2032\nT ,a ) \u2264 \u01ebPb\u2032\u2208A(T )\u222aB(T ) P (hb0,b\u2032\n\nTheorem 2. For any tree T and \u01eb > 0, if \u02c6\u03c0T is de\ufb01ned such that inf b,T |\u02c6eT (b)>\u01eb \u02c6\u03c0T (b, \u02c6aT\n\u02c6aT\n\n) \u2265 D or there exists an ancestor b\u2032 of b such that \u02c6eT (b\u2032) \u2264 \u01eb, then this means\nT ,a ) +\n\nProof. If \u03b3 = 0, then the proof is immediate. Consider now the case where \u03b3 \u2208 (0, 1). Clearly, since U\nis bounded above and L is bounded below, then \u02c6e is bounded above. Now using \u03b3 \u2208 (0, 1), we can \ufb01nd a\npositive integer D such that \u03b3D supb \u02c6e(b) \u2264 \u01eb. Let\u2019s denote AT\nb the set of ancestor belief states of b in the\ntree T , and given a \ufb01nite set A of belief nodes, let\u2019s de\ufb01ne \u02c6emin\n(A) = minb\u2208A \u02c6eT (b). Now let\u2019s de\ufb01ne Tb =\n) \u2264 D}.\nb ) > 0, then B contains all belief states b within depth\n\nClearly, by the assumption that inf b,T |\u02c6eT (b)>\u01eb \u02c6\u03c0T (b, \u02c6aT\nD such that \u02c6e(b) > 0, P (hb0,b\nb\u2032 of b have \u02c6eT (b\u2032) > \u01eb. Furthermore, B is \ufb01nite since there are only \ufb01nitely many belief states within depth\nD. Hence there exist a Emin = minb\u2208B \u03b3d(h\n|b0, \u02c6\u03c0T ). Clearly, Emin > 0 and we\n\nT ,a ) > 0 and there exists a \ufb01nite tree T where b \u2208 bF (T ) and all ancestors\nknow that for any tree T , all beliefs b in B \u2229 bF (T ) have an approximate error contribution \u02c6eT (b0, b) \u2265 Emin.\na tree T where B \u2229 bF (T ) = \u2205. Because there are only \ufb01nitely many nodes within depth D\u2032, then it is clear\nB \u2229 bF (T ) = \u2205, we have that for all beliefs b \u2208 bF (T ), either d(hb0,b\n\nthat Algorithm 1 will reach such tree T after a \ufb01nite number of expansions. Furthermore, for this tree T , since\nb ) \u2264 \u01eb. Hence by\nLemma 2, this implies that \u02c6eT (b0) \u2264 \u01eb, and consequently Algorithm 1 will terminate after a \ufb01nite number of\nexpansions (SOLVED(b0, \u01eb) will evaluate to true) with an \u01eb-optimal solution (since eT (b0) \u2264 \u02c6eT (b0)).\n\nSince Emin > 0 and \u03b3 \u2208 (0, 1), there exist a positive integer D\u2032 such that \u03b3D\u2032\nsupb \u02c6e(b) < Emin. Hence\nby Lemma 1, this means that Algorithm 1 cannot expand any node at depth D\u2032 or beyond before expanding\n\n{T |T f inite, b \u2208 bF (T ), \u02c6emin\n\nb ) > \u01eb} and B = {b|\u02c6e(b) inf T \u2208Tb P (hb0,b\n\n)\u02c6e(b) inf T \u2208Tb P (hb0,b\n\n|b0, \u02c6\u03c0T ) > 0, d(hb0,b\n\n) \u2265 D or \u02c6emin\n\n(AT\n\nT\n\nb ) > 0 for\n\nT\n\nT\n\n(AT\n\nT\n\nT ,o |b0, hb0,b\n\nT\n\nT\n\nT\n\nb0 ,b\nT\n\nFrom this last theorem, we notice that we can potentially develop many different admissible\nthe main suf\ufb01cient condition being that \u02c6\u03c0T (b, a) > 0 for a =\nheuristics for Algorithm 1;\nargmaxa\u2032\u2208A UT (b, a\u2032). It also follows from this theorem that the two heuristics described above,\nAEMS1 and AEMS2, are admissible. The following corollaries prove this:\n\nProof. Immediate by Theorem 2 and the fact that \u02c6\u03c0T (b, \u02c6aT\n\nCorollary 1. Algorithm 1, usingeb(T ), with \u02c6\u03c0T as de\ufb01ned in Equation 6 is complete and \u01eb-optimal.\nCorollary 2. Algorithm 1, usingeb(T ), with \u02c6\u03c0T as de\ufb01ned in Equation 5 is complete and \u01eb-optimal.\n\nProof. We \ufb01rst notice that (UT (b, a) \u2212 LT (b))2/(UT (b, a) \u2212 LT (b, a)) \u2264 \u02c6eT (b, a), since LT (b) \u2265\nLT (b, a) for all a. Furthermore, \u02c6eT (b, a) \u2264 supb\u2032 \u02c6e(b\u2032). Therefore the normalization constant\nb ) = UT (b), and there-\n\u03b7 \u2265 (|A| supb \u02c6e(b))\u22121. For \u02c6aT\nfore UT (b, \u02c6aT\nb ) \u2265\n\nb ) \u2212 LT (b) = \u02c6eT (b). Hence this means that \u02c6\u03c0T (b, \u02c6aT\n\nb = argmaxa\u2208A UT (b, a), we have UT (b, \u02c6aT\n\nb ) = \u03b7(\u02c6eT (b))2/\u02c6eT (b, \u02c6aT\n\nb ) = 1 for all b, T .\n\n\f(|A|(supb\u2032 \u02c6e(b\u2032))2)\u22121(\u02c6eT (b))2 for all T , b. Hence, for any \u01eb > 0, inf b,T |\u02c6eT (b)>\u01eb \u02c6\u03c0T (b, \u02c6aT\n(|A|(supb \u02c6e(b))2)\u22121\u01eb2 > 0. Hence, corrolary follows from Theorem 2.\n\nb ) \u2265\n\n5 Experiments\n\nIn this section we present a brief experimental evaluation of Algorithm 1, showing that in addition to\nits useful theoretical properties, the empirical performance matches, and in some cases exceeds, that\nof other online approaches. The algorithm is evaluated in three large POMDP environments: Tag\n[1], RockSample [3] and FieldVisionRockSample (FVRS) [11]; all are implemented using a factored\nIn each environments we compute the Blind policy5 to get a lower bound\nstate representation.\nand the FIB algorithm [15] to get an upper bound. We then compare performance of Algorithm 1\nwith both heuristics (AEMS1 and AEMS2) to the performance achieved by other online approaches\n(Satia [7], BI-POMDP [8], RTBSS [10]). For all approaches we impose a real-time constraint of\n1 sec/action, and measure the following metrics: average return, average error bound reduction6\n(EBR), average lower bound improvement7 (LBI), number of belief nodes explored at each time\nstep, percentage of belief nodes reused in the next time step, and the average online time per action\n(< 1s means the algorithm found an \u01eb-optimal action)8. Satia, BI-POMDP, AEMS1 and AEMS2\nwere all implemented using the same algorithm since they differ only in their choice of search\nheuristic used to guide the search. RTBSS served as a base line for a complete k-step lookahead\nsearch using branch & bound pruning. All results were obtained on a Xeon 2.4 Ghz with 4Gb of\nRAM; but the processes were limited to use a max of 1Gb of RAM.\n\nTable 1 shows the average value (over 1000+ runs) of the different statistics. As we can see from\nthese results, AEMS2 provides the best average return, average error bound reduction and average\nlower bound improvement in all considered environments. The higher error bound reduction and\nlower bound improvement obtained by AEMS2 indicates that it can guarantee performance closer\nto the optimal. We can also observe that AEMS2 has the best average reuse percentage, which\nindicates that AEMS2 is able to guide the search toward the most probable nodes and allows it to\ngenerally maintain a higher number of belief nodes in the tree. Notice that AEMS1 did not perform\nvery well, except in FVRS[5,7]. This could be explained by the fact that our assumption that the\nvalues of the actions are uniformly distributed between the lower and upper bounds is not valid in\nthe considered environments.\n\nFinally, we also examined how fast the lower and upper bounds converge if we let the algorithm run\nup to 1000 seconds on the initial belief state. This gives an indication of which heuristic would be\nthe best if we extended online planning time past 1sec. Results for RockSample[7,8] are presented\nin Figure 2, showing that the bounds converge much more quickly for the AEMS2 heuristic.\n\n6 Conclusion\n\nIn this paper we examined theoretical properties of online heuristic search algorithms for POMDPs.\nTo this end, we described a general online search framework, and examined two admissible heuris-\ntics to guide the search. The \ufb01rst assumes that Q\u2217(b, a) is distributed uniformly at random be-\ntween the bounds (Heuristic AEMS1), the second favors an optimistic point of view, and assume\nthe Q\u2217(b, a) is equal to the upper bound (Heuristic AEMS2). We provided a general theorem that\nshows that AEMS1 and AEMS2 are admissible and lead to complete and \u01eb-optimal algorithms. Our\nexperimental work supports the theoretical analysis, showing that AEMS2 is able to outperform on-\nline approaches. Yet it is equally interesting to note that AEMS1 did not perform nearly as well.\nThis highlights the fact that not all admissible heuristics are equally useful. Thus it will be interest-\ning in the future to develop further guidelines and theoretical results describing which subclasses of\nheuristics are most appropriate.\n\n5The policy obtained by taking the combination of the |A| \u03b1-vectors that each represents the value of a\n\npolicy performing the same action in every belief state.\n\n6The error bound reduction is de\ufb01ned as 1 \u2212 UT (b0)\u2212LT (b0)\n, when the search process terminates on b0\nU (b0)\u2212L(b0)\n7The lower bound improvement is de\ufb01ned as LT (b0) \u2212 L(b0), when the search process terminates on b0\n8For RTBSS, the maximum search depth under the 1sec time constraint is show in parenthesis.\n\n\fFigure 1: Comparison of different online search al-\ngorithm in different environments.\n\nHeuristic /\nAlgorithm\n\nReturn\n\u00b1 0.01\n\nEBR (%)\n\n\u00b1 0.1\n\nLBI\n\n\u00b1 0.01\n\nBelief Reuse Time\n(ms)\nNodes\n\u00b11\n\n(%)\n\u00b10.1\n\n-\n\n0\n\n22.3\n22.9\n49.0\n76.2\n76.3\n\n-10.30\n-8.35\n-6.73\n-6.22\n-6.19\n\nTag (|S| = 870, |A| = 5, |\u2126| = 30)\n45067\n36908\n43693\n79508\n80250\n\n10.0\n25.1\n54.6\n54.8\nRockSample[7,8] (|S| = 12545, |A| = 13, |\u2126| = 2)\n8.9\n5.3\n0\n\n3.03\n2.47\n3.92\n7.81\n7.81\n\nFVRS[5,7] (|S| = 3201, |A| = 5, |\u2126| = 128)\n\nRTBSS(5)\nSatia & Lave\n\nAEMS1\n\nBI-POMDP\n\nAEMS2\n\nSatia & Lave\n\nAEMS1\nRTBSS(2)\nBI-POMDP\n\nAEMS2\n\nRTBSS(1)\nBI-POMDP\nSatia & Lave\n\nAEMS1\nAEMS2\n\n7.35\n10.30\n10.30\n18.43\n20.75\n\n20.57\n22.75\n22.79\n23.31\n23.39\n\n3.6\n9.5\n9.7\n33.3\n52.4\n\n7.7\n11.1\n11.1\n12.4\n13.3\n\n0\n\n0.90\n1.00\n4.33\n5.30\n\n2.07\n2.08\n2.05\n2.24\n2.35\n\n509\n579\n439\n2152\n3145\n\n516\n4457\n3683\n3856\n4070\n\n29.9\n36.4\n\n0\n0.4\n0.4\n1.4\n1.6\n\n580\n856\n814\n622\n623\n\n900\n916\n896\n953\n859\n\n254\n923\n947\n942\n944\n\n)\n\n0\n\nb\n(\nV\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n10\u22122\n\nAEMS2\nAEMS1\nBI\u2212POMDP\nSatia\n\n10\u22121\n\n100\n\nTime (s)\n\n101\n\n102\n\n103\n\nFigure 2: Evolution of the upper / lower bounds on\nthe initial belief state in RockSample[7,8].\n\nAcknowledgments\n\nThis research was supported by the Natural Sciences and Engineering Research Council of Canada\n(NSERC) and the Fonds Qu\u00b4eb\u00b4ecois de la Recherche sur la Nature et les Technologies (FQRNT).\n\nReferences\n\n[1] J. Pineau. Tractable planning under uncertainty: exploiting structure. PhD thesis, Carnegie Mellon\n\nUniversity, Pittsburgh, PA, 2004.\n\n[2] P. Poupart. Exploiting structure to ef\ufb01ciently solve large scale partially observable Markov decision\n\nprocesses. PhD thesis, University of Toronto, 2005.\n\n[3] T. Smith and R. Simmons. Point-based POMDP algorithms: improved analysis and implementation. In\n\nUAI, 2005.\n\n[4] M. T. J. Spaan and N. Vlassis. Perseus: randomized point-based value iteration for POMDPs. JAIR,\n\n24:195\u2013220, 2005.\n\n[5] N. Roy and G. Gordon. Exponential family PCA for belief compression in POMDPs. In NIPS, 2003.\n[6] P. Poupart and C. Boutilier. Value-directed compression of POMDPs. In NIPS, 2003.\n[7] J. K. Satia and R. E. Lave. Markovian decision processes with probabilistic observation of states. Man-\n\nagement Science, 20(1):1\u201313, 1973.\n\n[8] R. Washington. BI-POMDP: bounded, incremental partially observable Markov model planning. In 4th\n\nEur. Conf. on Planning, pages 440\u2013451, 1997.\n\n[9] D. McAllester and S. Singh. Approximate Planning for Factored POMDPs using Belief State Simpli\ufb01ca-\n\ntion. In UAI, 1999.\n\n[10] S. Paquet, L. Tobin, and B. Chaib-draa. An online POMDP algorithm for complex multiagent environ-\n\nments. In AAMAS, 2005.\n\n[11] S. Ross and B. Chaib-draa. AEMS: an anytime online search algorithm for approximate policy re\ufb01nement\n\nin large POMDPs. In IJCAI, 2007.\n\n[12] E. A. Hansen and S. Zilberstein. LAO * : A heuristic search algorithm that \ufb01nds solutions with loops.\n\nArti\ufb01cial Intelligence, 129(1-2):35\u201362, 2001.\n\n[13] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley &\n\nSons, Inc., New York, NY, USA, 1994.\n\n[14] N.J. Nilsson. Principles of Arti\ufb01cial Intelligence. Tioga Publishing, 1980.\n[15] M. Hauskrecht. Value-function approximations for POMDPs. JAIR, 13:33\u201394, 2000.\n\n\f", "award": [], "sourceid": 353, "authors": [{"given_name": "Stephane", "family_name": "Ross", "institution": null}, {"given_name": "Joelle", "family_name": "Pineau", "institution": null}, {"given_name": "Brahim", "family_name": "Chaib-draa", "institution": null}]}