{"title": "What makes some POMDP problems easy to approximate?", "book": "Advances in Neural Information Processing Systems", "page_first": 689, "page_last": 696, "abstract": "", "full_text": "What Makes Some POMDP Problems Easy to Approximate?\n\nDavid Hsu\u2217\n\nWee Sun Lee\u2217\n\n\u2217Department of Computer Science\nNational University of Singapore\nSingapore, 117590, Singapore\n\nNan Rong\u2020\n\n\u2020Department of Computer Science\n\nCornell University\n\nIthaca, NY 14853, USA\n\nAbstract\n\nPoint-based algorithms have been surprisingly successful in computing approx-\nimately optimal solutions for partially observable Markov decision processes\n(POMDPs) in high dimensional belief spaces. In this work, we seek to understand\nthe belief-space properties that allow some POMDP problems to be approximated\nef\ufb01ciently and thus help to explain the point-based algorithms\u2019 success often ob-\nserved in the experiments. We show that an approximately optimal POMDP so-\nlution can be computed in time polynomial in the covering number of a reachable\nbelief space, which is the subset of the belief space reachable from a given belief\npoint. We also show that under the weaker condition of having a small covering\nnumber for an optimal reachable space, which is the subset of the belief space\nreachable under an optimal policy, computing an approximately optimal solution\nis NP-hard. However, given a suitable set of points that \u201ccover\u201d an optimal reach-\nable space well, an approximate solution can be computed in polynomial time.\nThe covering number highlights several interesting properties that reduce the com-\nplexity of POMDP planning in practice, e.g., fully observed state variables, beliefs\nwith sparse support, smooth beliefs, and circulant state-transition matrices.\n\n1 Introduction\nComputing an optimal policy for a partially observable Markov decision process (POMDP) is an\nintractable problem [10, 9]. Intuitively, the intractability is due to the \u201ccurse of dimensionality\u201d:\nthe belief space B used in solving a POMDP typically has dimensionality equal to |S|, the number\nof states in the POMDP, and therefore the size of B grows exponentially with |S|. As a result, the\nnumber of states is often used in practice as an important measure of the complexity of POMDP\nplanning. However, in recent years, point-based POMDP algorithms have made impressive progress\nin computing approximate solutions by sampling the belief space: POMDPs with hundreds of states\nhave been solved in a matter of seconds [14, 4].\nIt seems surprising that even an approximate\nsolution can be obtained in seconds in a space of hundreds of dimensions. Thus, we would like to\ninvestigate why these point-based algorithms work well, whether there are sub-classes of POMDPs\nthat are computationally easier, and whether there are alternative measures that better capture the\ncomplexity of POMDP planning for point-based algorithms.\nOur work is motivated by a benchmark problem called Tag [11], in which a robot needs to search\nand tag a moving target that tends to move away from it. The environment is modeled as a grid.\nThe robot\u2019s position is fully observable. The target\u2019s position is not observable, i.e., unknown to\nthe robot, unless the target is in the same grid position as the robot. The joint state of the robot\nand target positions is thus only partially observable. The problem has 870 states in total, resulting\nin a belief space of 870 dimensions. Tag was introduced in the work on Point-Based Value Itera-\ntion (PBVI) [11], one of the \ufb01rst point-based POMDP algorithms. At the time, it was among the\nlargest POMDP problems ever attempted and was considered a challenge for fast, scalable POMDP\nalgorithms [11]. Surprisingly, only two years later, another point-based algorithm [14] computed an\napproximate solution to Tag, a problem with an 870-dimensional belief space, in less than a minute!\nOne important feature that underlies the success of many point-based algorithms is that they only\nexplore a subset R(b0) \u2286 B, usually called the reachable space from b0. The reachable space R(b0)\n\n\fcontains all points reachable from a given initial belief point b0 \u2208 B under arbitrary sequences of\nactions and observations. One may then speculate that the reason for point-based algorithms\u2019 good\nperformance on Tag is that its reachable space R(b0) has much lower dimensionality than B. This\nis, however, not true. By checking the dimensionality of a large set of points sampled from R(b0),\nwe have found that the dimensionality of R(b0) is at least 860 and thus almost as large as B.\nIn this paper, we propose to use the covering number as an alternative measure of the complexity\nof POMDP planning ( Section 4).\nIntuitively, the covering number of a space is the minimum\nnumber of given size balls that needed to cover the space fully. We show that an approximately\noptimal POMDP solution can be computed in time polynomial in the covering number of R(b0).\nThe covering number also reveals that the belief space for Tag behaves more like the union of\nsome 29-dimensional spaces rather than an 870-dimensional space, as the robot\u2019s position is fully\nobserved. Therefore, Tag is probably not as hard as it was thought to be, and the covering number\ncaptures the complexity of the Tag problem better than the dimensionality of the belief space (the\nnumber of states) or the dimensionality of the reachable space.\nWe further ask whether it is possible to compute an approximate solution ef\ufb01ciently under the weaker\ncondition of having a small covering number for an optimal reachable R\u2217(b0), which contains only\npoints in B reachable from b0 under an optimal policy. Unfortunately, we can show that this problem\nis NP-hard. The problem remains NP-hard, even if the optimal policies have a compact piecewise-\nlinear representation using \u03b1-vectors. However, we can also show that given a suitable set of points\nthat \u201ccover\u201d R\u2217(b0) well, a good approximate solution can be computed in polynomial time. To-\ngether, the negative and the positive results indicate that using sampling to approximate an optimal\nreachable space, and not just the reachable space, may be a promising approach in practice. We have\nalready obtained initial experimental evidence that supports this idea. Through careful sampling and\npruning, our new point-based algorithm solved the Tag problem in less than 5 seconds [4].\nThe covering number highlights several properties that reduce the complexity of POMDP planning\nin practice, and it helps to quantify their effects (Section 5). Highly informative observations usually\nresult in beliefs with sparse support and substantially reduce the covering number. For example, fully\nobserved state variables reduce the covering number by a doubly exponential factor. Interestingly,\nsmooth beliefs, usually a result of imperfect actions and uninformative observations, also reduce\nthe covering number. In addition, state-transition matrices with special structures, such as circulant\nmatrices [1], restrict the space of reachable beliefs and reduce the covering number correspondingly.\n\n2 Related Works\n\nPOMDPs provide a principled mathematical framework for planning and decision-making under\nuncertainty [13, 5], but they are notoriously hard to solve [10, 7, 9, 8]. It has been shown that \ufb01nding\nan optimal policy over the entire belief space for a \ufb01nite-horizon POMDP is PSPACE-complete [10]\nand that \ufb01nding an optimal policy over an in\ufb01nite horizon is undecidable [9].\nAs a result, there has been a lot of work on computing approximate POMDP solutions [2], including\na number of point-based POMDP algorithms [16, 11, 15, 14, 3]. Some point-based algorithms were\nable to compute reasonably good policies for very large POMDPs with hundreds of thousands states.\nThe success of these algorithms motivated us to try to understand why and when they work well.\nThe approximation errors of some point-based algorithms have been analyzed [11, 14], but these\nanalyses do not address the general question of when an approximately optimal policy can be com-\nputed ef\ufb01ciently in polynomial time. We provide both positive and negative results showing the\ndif\ufb01culty of computing approximate POMDP solutions. The proof techniques used for Theorems 1\nand 2 are similar to those used for analyzing an approximation algorithm for large (fully observable)\nMDPs [6]. While the algorithm in [6] handles large state spaces well, it does not run in polynomial\ntime: it appears that additional assumptions such as those made in this paper are required for poly-\nnomial time results. Our hardness result is closely related to that for \ufb01nite-horizon POMDPs [8], but\nwe give a direct reduction from the Hamiltonian cycle problem.\n\n3 Preliminaries\n\nA POMDP models an agent taking a sequence of actions under uncertainty to maximize its total\nreward. Formally it is speci\ufb01ed as a tuple (S, A, O, T , Z, R, \u03b3), where S is a set of discrete states,\nA is a \ufb01nite set of actions, and O is a set of discrete observations. At each time step, the agent\n\n\ftakes some action a \u2208 A and moves from a start state s to an end state s0. The end state s0 is\ngiven by a state-transition function T (s, a, s0) = p(s0|s, a), which gives the probability that the\nagent lies in s0, after taking action a in state s. The agent then makes an observation to gather\ninformation on its current state. The outcome of observing o \u2208 O is given by an observation\nfunction Z(s, a, o) = p(o|s, a) for s \u2208 S and a \u2208 A. The reward function R gives the agent a\nreal-valued reward R(s, a) if it takes action a in state s, and the goal of the agent is to maximize\nits expected total reward by choosing a suitable sequence of actions. In this paper, we consider\nonly in\ufb01nite-horizon POMDPs with discounted reward. Thus, the expected total reward is given by\nt=0 \u03b3tR(st, at)], where \u03b3 \u2208 (0, 1) is a discount factor, and st and at denote the agent\u2019s state\n\nE[P\u221e\n\n1\u2212\u03b3 \u03b4.1\n\ni=1 |bi \u2212 b0\n\nb, b0 \u2208 Rd,||b \u2212 b0|| = Pd\n\nand the action at time t.\nSince the agent\u2019s state is only partially observable, we rely on the concept of a belief, which is\nsimply a probability distribution over S, represented disretely as a vector.\nA POMDP solution is a policy \u03c0 that speci\ufb01es the action \u03c0(b) for every belief b. Our goal is to \ufb01nd\nan optimal policy \u03c0\u2217 that maximizes the expected total reward. A policy \u03c0 induces a value function\nV \u03c0 that speci\ufb01es the value V \u03c0(b) of every belief b under \u03c0. It is known that V \u2217, the value function\nassociated the optimal policy \u03c0\u2217, can be approximated arbitrarily closely by a convex, piecewise-\nlinear function V (b) = max\u03b1\u2208\u0393(\u03b1 \u00b7 b), where \u0393 is a \ufb01nite set of vectors called \u03b1-vectors.\nThe optimal value function V \u2217 satis\ufb01es the following Lipschitz condition:\nLemma 1 For any two belief points b and b0, if ||b \u2212 b0|| \u2264 \u03b4, then |V \u2217(b) \u2212 V \u2217(b0)| \u2264 Rmax\nThroughout this paper, we always use the l1 metric\nto measure the distance between belief points: for\ni|. The Lips-\nchitz condition bounds the change of a value func-\ntion using the distance between belief points. It pro-\nvides the basis for approximating the value at a be-\nlief point by the values of other belief points nearby.\nTo \ufb01nd an approximately optimal policy, point-\nbased algorithms explore only the reachable belief\nspace R(b0) from a given initial belief point b0.\nStrictly speaking, these algorithms compute only a\npolicy over R(b0), rather than the entire belief space\nB. We can view the exploration of R(b0) as searching a belief tree TR rooted at b0 (Figure 1). The\nnodes of TR correspond to beliefs in R(b0). The edges correspond to action-observation pairs. Sup-\npose that a child node b0 is connected to its parent b by an edge (a, o). We can compute b0 using\ns T (s, a, s0)b(s), where \u03b7 is a normalizing constant.\nAfter obtaining enough belief points from R(b0), point-based algorithms perform backup operations\nover them to compute an approximately optimal value function.\n\nthe formula b0(s0) = \u03c4(b, a, o) = \u03b7Z(s0, a, o)P\n\nFigure 1: The belief tree rooted at b0.\n\n4 The Covering Number and the Complexity of POMDP Planning\nOur \ufb01rst goal is to show that if the covering number of a reachable space R(b0) is small, then an\napproximately optimal policy in R(b0) can be computed ef\ufb01ciently. We start with the de\ufb01nition of\nthe covering number:\nDe\ufb01nition 1 Given a metric space X, a \u03b4-cover of a set B \u2286 X is a set of point C \u2286 X such that\nfor every point b \u2208 B, there is a point c \u2208 C with ||b \u2212 c|| < \u03b4. If all the points in C also lie in\nB, then we say that C is a proper cover of B. The \u03b4-covering number of B, denoted by C(\u03b4), is the\nsize of the smallest \u03b4-cover of B.\n\nIntuitively, the covering number is equal to the minimum number of balls of radius \u03b4 needed to cover\nthe set B. A closely related notion is that of the packing number:\nDe\ufb01nition 2 Given a metric space X, a \u03b4-packing of a set B \u2286 X is a set of points P \u2286 B such\nthat for any two points p1, p2 \u2208 P , ||p1 \u2212 p2|| \u2265 \u03b4. The \u03b4-packing number of a set B, denoted by\nP(\u03b4), is the size of the largest \u03b4-packing of B.\n\n1The proofs of this and other results are available as an appendix at http://motion.comp.nus.edu.\n\nsg/papers/nips07.pdf.\n\na1a2o1o2b0\fFor any set B, the following relationship holds between packing and covering numbers.\nLemma 2 C(\u03b4) \u2264 P(\u03b4) \u2264 C(\u03b4/2).\nWe are now ready to state our \ufb01rst main result. It shows that for any point b0 \u2208 B, if the covering\nnumber of R(b0) grows polynomially with the parameters of interest, then a good approximation of\nthe value at b0 can be computed in polynomial time.\nTheorem 1 For any b0 \u2208 B, let C(\u03b4) be the \u03b4-covering number of R(b0). Given any constant \u0001 > 0,\nan approximation V (b0) of V \u2217(b0), with error |V \u2217(b0) \u2212 V (b0)| \u2264 \u0001, can be found in time\n\n \n\nC\n\nO\n\n(cid:18)(1 \u2212 \u03b3)2\u0001\n\n(cid:19)2\n\n4\u03b3Rmax\n\nlog\u03b3\n\n(1 \u2212 \u03b3)\u0001\n2Rmax\n\n!\n\n.\n\nProof. To prove the result, we give an algorithm that computes the required approximation.\nIt\nperforms a depth-\ufb01rst search on a depth-bounded belief tree and uses approximate memorization to\navoid unnecessarily computing the values of very similar beliefs. Intuitively, to achieve a polynomial\ntime algorithm, we bound the height of the tree by exploiting the discount factor and bound the width\nof the tree by exploiting the covering number.\nWe perform the depth-\ufb01rst search recursively on a belief tree TR that has root b0 and height h, while\nmaintaining a \u03b4-packing of R(b0) at every level of TR. Suppose that the search encounters a new\nbelief node b at level i of TR. If b is within a distance \u03b4 of a point b0 in the current packing at level i,\nwe set V (b) = V (b0), abort the recursion at b, and backtrack. Otherwise, we recursively search the\nchildren of b. When the search returns, we perform a backup operation to compute V (b) and add b\nto the packing at level i. If b is a leaf node of TR, we set V (b) = 0. We build a separate packing at\neach level of TR, as each level has a different approximation error.\nWe now calculate the values for h and \u03b4 required to achieve the given approximation bound \u0001 at b0.\nLet \u0001i = |V \u2217(b)\u2212 V (b)| denote the approximation error for a node b at level i of TR, if the recursive\nsearch continues in the children of b. By convention, the leaf nodes are at level 0. Similarly, let \u00010\ndenote the error for b, if the search aborts at b and V (b) = V (b0) for some b0 in the packing at level\ni. Hence,\n\ni\n\ni = |V \u2217(b) \u2212 V (b0)|\n\u00010\n\n\u2264 |V \u2217(b) \u2212 V \u2217(b0)| + |V \u2217(b0) \u2212 V (b0)|\n\u2264 Rmax\n1 \u2212 \u03b3\n\n\u03b4 + \u0001i,\n\nwhere the last inequality uses Lemma 1 and the de\ufb01nition of \u0001i. Clearly, \u00010 \u2264 Rmax/(1 \u2212 \u03b3). To\ncalculate \u0001i for a node b at level i, we establish a recurrence. The children of b, which are at level\ni \u2212 1, have error at most \u00010\ni\u22121 and\nthus the recurrence \u0001i \u2264 \u03b3(\u0001i\u22121 + Rmax\n1\u2212\u03b3 \u03b4). Expanding the recurrence, we \ufb01nd that the error \u0001h at\nthe root b0 is given by\n\ni\u22121. Since a backup operation is performed at b, we have \u0001i \u2264 \u03b3\u00010\n\n|V \u2217(b0) \u2212 V (b0)| \u2264 \u03b3Rmax(1 \u2212 \u03b3h)\n\n(1 \u2212 \u03b3)2\n\n\u03b4 + \u03b3h Rmax\n1 \u2212 \u03b3\n\n\u2264 \u03b3Rmax\n\n(1 \u2212 \u03b3)2 \u03b4 + \u03b3h Rmax\n1 \u2212 \u03b3\n\n.\n\n2\u03b3Rmax\n\n(1\u2212\u03b3)\u0001\n2Rmax\n\nand h = log\u03b3\n\n, we can guarantee |V \u2217(b0) \u2212 V (b0)| \u2264 \u0001.\n\nBy setting \u03b4 = (1\u2212\u03b3)2\u0001\nWe now work out the running time of the algorithm. For each node b in the packings, the algorithm\nexpands it by calculating the beliefs and the corresponding values for all its children and performing\na backup operation at b to compute V (b). It takes O(|S|2) time to calculate the belief at a child\nnode. We then perform a nearest neighbor search in O(P(\u03b4)|S|) time to check whether the child\nnode lies within a distance \u03b4 of any point in the packing at that level. Since b has |A||O| children,\nthe expansion operation takes O(|A||O||S|(|S| + P(\u03b4)) time. The backup operation then computes\nV (b) as an average of its children\u2019s values, weighted by the probabilities speci\ufb01ed by the observation\nfunction, and takes only O(|A||O|) time. Since there are h packings of size P(\u03b4) each and by\nLemma 2, P(\u03b4) \u2264 C(\u03b4/2), the total running time of our algorithm is given by\n\nO (hC(\u03b4/2)|A||O||S|(|S| + C(\u03b4/2))) .\n\n\fWe assume that |S|, |A|, and |O| are constant to focus on the dependency on the covering number,\nand the above expression then becomes O(hC(\u03b4/2)2). Substituting in the values for h and \u03b4, we get\nthe \ufb01nal result. 2\n\nThe algorithm in the above proof can be used on-line to choose an approximately optimal action at\nb0. We \ufb01rst estimate the values for all the child nodes of b0 and then select the action resulting in\nthe highest value. Suppose that at each belief point reachable from b0, we perform such an on-line\nsearch for action selection. Using the technique in [12], one can show that if the value function\napproximations at all the child nodes have error at most \u0001, then the policy \u03c0 implicitly de\ufb01ned by\nthe on-line search has approximation error |V \u2217(b) \u2212 V \u03c0(b)| \u2264 2\u03b3\u0001/(1 \u2212 \u03b3) for all b in R(b0).\nInstead of performing the on-line search, one may want to precompute an approximately optimal\nvalue function over R(b0) and perform one-step look-ahead on it at runtime for action selection.\nThe algorithm in Theorem 1 is not suf\ufb01cient for this purpose, as it samples only enough points from\nR(b0) to give a good value estimate at b0, but the sampled points do not form a cover of R(b0). One\npossibility would be to \ufb01nd a cover of R(b0) \ufb01rst and then apply PBVI [11] over the points in the\ncover. Unfortunately, we do not know how to \ufb01nd a cover of R(b0) ef\ufb01ciently. Instead, we give a\nrandomized algorithm that computes an approximately optimal value function with high probability.\nRoughly, this algorithm incrementally builds a packing of R(b0) at each level of TR. It \ufb01rst runs\nthe algorithm in Theorem 1 to obtain an initial packing Pi for each level i and estimate the values\nof belief points in Pi. Then, to test whether the current packing Pi covers R(b0) well, it runs a\nset of simulations of a \ufb01xed size. If the simulations encounter new points not covered by Pi, we\nestimate their values and insert them into Pi. The process repeats until no more new belief points\nare discovered within a set of simulation. We show that if the set of simulations is suf\ufb01ciently large,\nthen the probability that in any future run of the policy, we encounter new belief points not covered\nby the \ufb01nal set of packings can be made arbitrarily small.\nTheorem 2 For any b0 \u2208 B, let C(\u03b4) be the \u03b4-covering number of R(b0). Given constants \u03b2 \u2208 (0, 1)\nand \u0001 > 0, a randomized algorithm can compute, with probability at least 1 \u2212 \u03b2, an approximately\noptimal value function in time\n\n16\u03b3Rmax\n\nIt takes O\n\nsuch that the one-step look-ahead policy \u03c0 induced by this value function has error |V \u2217(b0) \u2212\nV \u03c0(b0)| \u2264 \u0001.\ntime to use this value function to select an action at\nruntime.\nBoth theorems above assume tha a small covering number of R(b0) for ef\ufb01cient computation. To\nrelax this assumption, we may require only that the covering number for an optimal reachable space\nR\u2217(b0) is small, as R\u2217(b0) contains only points reachable under an optimal policy and can be much\nsmaller than R(b0). Unfortunately, under the relaxed condition, approximating the value at b0 is\nNP-hard. We prove this by reduction from the Hamiltonian cycle problem. The main idea is to\nshow that a Hamiltonian cycle exists in a given graph if and only an approximation to V \u2217(b0), with\na suitably chosen error, can be computed for a POMDP whose optimal reachable space R\u2217(b0) has\na small covering number. The result is closely related to one for \ufb01nite-horizon POMDPs [8].\nTheorem 3 Given constant \u0001 > 0, computing an approximation V (b0) of V \u2217(b0), with error\n|V (b0) \u2212 V \u2217(b0)| \u2264 \u0001|V \u2217(b0)|, is NP-hard, even if the covering number of R\u2217(b0) is polynomial-\nsized.\n\nThe result above assumes the standard encoding of POMDP input with state-transition functions,\nobservation functions, and reward functions all represented discretely by matrices of suitable sizes.\nBy slightly extending the proof of Theorem 3, we can also show a related hardness result, which\nassumes that the optimal policy has a compact representation.\nTheorem 4 Given constant \u0001 > 0, computing an approximation V (b0) of V \u2217(b0), with error\n|V (b0)\u2212 V \u2217(b0)| \u2264 \u0001|V \u2217(b0)|, is NP-hard, even if the number of \u03b1-vectors required to represent an\noptimal policy is polynomial-sized.\nOn the other hand, if an oracle provides us a proper cover of an optimal reachable space R\u2217(b0),\nthen a good approximation of V \u2217(b0) can be found ef\ufb01ciently.\n\n \n\nO\n\nRmax\n(1 \u2212 \u03b3)\u0001\n\n(cid:18)\n\nC\n\n(cid:18)(1 \u2212 \u03b3)3\u0001\n\n16\u03b3Rmax\n\n(cid:19)\n(cid:16)C(cid:16) (1\u2212\u03b3)3\u0001\n\nlog\u03b3\n\n(1 \u2212 \u03b3)\u0001\n4Rmax\n\n(cid:17)(cid:17)\n\n(cid:19)2\n\n(cid:18) 1\n\n\u03b2\n\nC\n\n(cid:18)(1 \u2212 \u03b3)3\u0001\n\n(cid:19)\n\n16\u03b3Rmax\n\nlog\n\nlog\u03b3\n\n(1 \u2212 \u03b3)\u0001\n4Rmax\n\n(cid:19)!\n\n.\n\n\fTheorem 5 For any b0 \u2208 B, given constant \u0001 > 0 and a proper \u03b4-cover C of R\u2217(b0) with \u03b4 =\n(1\u2212\u03b3)2\u0001\n, an approximation V (b0) of V \u2217(b0), with error |V \u2217(b0) \u2212 V (b0)| \u2264 \u0001, can be found in time\n2\u03b3Rmax\n\n(cid:18)\n\nO\n\n|C|2 + |C| log\u03b3\n\n(1 \u2212 \u03b3)\u0001\n2RM ax\n\n(cid:19)\n\n.\n\nTogether, the negative and the positive results (Theorems 3 to 5) indicate that a key dif\ufb01culty for\npoint-based algorithms lies in \ufb01nding a cover for R\u2217(b0). In practice, to overcome the dif\ufb01culty,\none may use problem-speci\ufb01c knowledge or heuristics to approximate R\u2217(b0) through sampling.\nMost point-based POMDP algorithms [11, 15, 14] interpolate the value function using \u03b1-vectors.\nAlthough we use the nearest neighbor approximation to simplify the proofs of Theorems 1, 2, and\n5, we want to point out that very similar results can be obtained using the \u03b1-vector representation if\nwe slightly modify the analysis of the approximation errors in the proofs.\n\n5 Bounding the Covering Number\nThe covering number highlights several properties that reduce the complexity of POMDP planning\nin practice. We describe them below and show how they affect the covering number.\n\n\u03b4\n\n.\n\n)kd\u2212d0\n\n.\n\n\u03b4 )m = kd0( kd\u2212d0\n\n\u03b4\n\nat the cost of a multiplicative factor of kd0\n\nsuch subspaces. So the \u03b4-covering number of R(b0) is at most kd0( m\n\n5.1 Fully Observed State Variables\nSuppose that there are d state variables, each of which has at most k possible values. If d0 of these\nvariables are fully observed, then for every such belief point, its vector representation contains at\nmost m = kd\u2212d0\nnon-zero elements out of kd elements in total. For a given initial belief b0, the\nbelief vectors with the same non-zero pattern form a subspace in R(b0), and R(b0) is a union of these\nsubspaces. We can compute a \u03b4-cover for each subspace by discretizing each non-zero element of\nthe belief vectors to an accuracy of \u03b4/m, and the size of the resulting \u03b4-cover is at most ( m\n\u03b4 )m. There\n)kd\u2212d0\nare kd0\n.\nThe fully observed variables thus give a doubly exponential reduction in the covering number: it\nreduces the exponent by a factor of kd0\nProposition 1 Suppose that a POMDP has d state variables, each of which has at most k possible\nvalues. If d0 state variables are fully observed, then for any belief point b0, the \u03b4-covering number\nof the reachable belief space R(b0) is at most kd0( kd\u2212d0\nConsider again the Tag problem described in Section 1. The state consists of both the robot\u2019s and\nthe target\u2019s positions, as well as the status indicating whether the target is tagged. The robot and\nthe target can occupy any position in an environment modeled as a grid of 29 cells. If the robot has\nthe target tagged, they must be in the same position. So, there are 29 \u00d7 29 + 29 = 870 states in\ntotal, and the belief space B is 870-dimensional. However, the robot\u2019s position is fully observed. By\nProposition 1, the \u03b4-covering number is at most 30 \u00b7 (30/\u03b4)30. Indeed, for Tag, any reachable belief\nspace R(b0) is effectively a union of two sets. One set corresponds to the case when the target is\nnot tagged and consists of the union of 29 sub-spaces of 29 dimensions. The other set corresponds\nto the case when the target is tagged and consists of exactly 29 points. Clearly, the covering number\ncaptures the underlying complexity of R(b0) more accurately than the dimensionality of R(b0).\n5.2 Sparse Beliefs\nHighly informative observations often result in sparse beliefs, i.e., beliefs whose vector representa-\ntion is sparse. For example, in the Tag problem, the state is known exactly if the robot and the target\nare in the same position, leaving only a single non-zero element in the belief vector. Fully observed\nstate variables usually result in very sparse beliefs and can be considered a special case.\nIf the beliefs are always sparse, we can exploit the sparsity to bound the covering number. Otherwise,\nsparsity may still give a hint that the covering number is smaller than what would be suggested by\nthe dimensionality of the belief space. By exploiting the non-zeros patterns of belief vectors in a\nway similar to that in Section 5.1, we can derive the following result:\nProposition 2 Let B be a set in an n-dimensional belief space.\nIf every belief in B can be\nrepresented as a vector with at most m non-zero elements, then the \u03b4-covering number of B is\n\nO(nm(cid:0) m\n\n(cid:1)m).\n\n\u03b4\n\n\f5.3 Smooth Beliefs\n\nSparse beliefs are often peaky. Interestingly, when the beliefs are suf\ufb01ciently smooth, e.g., when\ntheir Fourier representations are sparse, the covering number is also small. Below we give a more\ngeneral result, assuming that the beliefs can be represented as a linear combination of a small number\nof basis vectors.\nProposition 3 Let B be a set in an n-dimensional belief space. Assume that every belief b \u2208 B\ncan be represented as a linear combination of m basis vectors such that the magnitudes of both the\nelements of the basis vectors and the coef\ufb01cients representing b are bounded by a constant C. The \u03b4-\ncovering number of B is O(( 2C2mn\n)2m)\nwhen they are complex-valued.\n\n)m) when the basis vectors are real-valued, and O(( 4C2mn\n\n\u03b4\n\n\u03b4\n\nSmooth beliefs are usually a result of actions with high uncertainty and uninformative observations.\n\n5.4 Circulant State-Transition Matrices\n\nLet us now shift our attention from observations to actions, in particular, actions that can be rep-\nresented by state-transition matrices with special structures. We start with an example. A mobile\nrobot scout needs to navigate from a known start position to a goal position in a large environment\nmodeled as a grid. It must not enter certain danger zones to avoid detection by enemies. The robot\ncan take four actions to move in the {N, S, E, W} directions, but have imperfect control. Since\nthe environment is large, we assume that the robot always operates far away from the boundary and\nthe boundary effect can be ignored. At each grid cell, the robot moves to the intended cell with\nprobability 1\u2212 p and moves diagonally to the two cells adjacent to the intended one with probability\n0.5p. The robot can use its sensors to make highly accurate observations on its current position, but\nby doing so, it runs the risk of being detected.\nUnder our assumptions, the state-transition functions representing robot actions are invariant over\nthe grid cells and can thus be represented by circulant matrices [1]. Circulant matrices are widely\nused in signal processing and control theory, as they can represent all discrete-time linear translation-\ninvariant systems.\nIn the context of POMDPs, if applying a state-transition matrix to a belief b\ncorresponds to convolution with a suitable distribution, then the state-transition matrix is circulant.\nOne of the key properties of circulant matrices is that they all share the same eigenvectors. Therefore,\nwe can multiply them in any arbitrary order and obtain the same result. In our example, this means\nthat given a set of robot moves, we can apply them in any order and the resulting belief on the robot\u2019s\nposition is the same. This greatly reduces the number of possible beliefs and correspondingly the\ncovering number in open-loop POMDPs, where there are no observations involved.\nProposition 4 Suppose that all \u2018 state-transition matrices representing actions are circulant and\nthat each matrix has at most m eigenvalues whose magnitudes are greater than \u03b6, with 0 < \u03b6 < 1.\nIn an open-loop POMDP, for any point b0 in an n-dimensional belief space, the \u03b4-covering number\nof the reachable belief space R(b0) is O\nIn our example, suppose that the robot scout makes a sequences of moves and needs to decide when\nto take occasional observations along the way to localize itself. To bound the covering number, we\ndivide the sequence of moves into subsequences such that each subsequence starts with an observa-\ntion and ends right before the next observation. In each subsequence, the robot starts at a speci\ufb01c\nbelief and moves without additional observations. So, within a subsequence, the beliefs encountered\nhave a \u03b4-cover of size O((8\u2018mn/\u03b4)2\u2018m + h\u2018) by Proposition 4. Furthermore, since all the observa-\ntions are highly informative, we assume that the initial beliefs of all subsequences can be represented\nas vectors with at most m0 non-zero elements. The set of all initial beliefs then has a \u03b4-cover of size\nO(nm0(m0/\u03b4)m0) by Proposition 2. From Lemma 3 below, we know that in an open-loop POMDP,\ntwo belief trajectories can only get closer to each other, as they progress.\nLemma 3 Let M be a Markov matrix and ||b1 \u2212 b2|| \u2264 \u03b4. Then ||M b1 \u2212 M b2|| \u2264 \u03b4.\nTherefore, to get a \u03b4-cover of the space R(b0) that the robot scout can reach from a given b0, it\nsuf\ufb01ces to \ufb01rst compute a \u03b4/2-cover C of the initial belief points for all possible subsequences\nof moves and then take the union of the \u03b4/2-covers of the belief points traversed by the subse-\nquences whose initial belief points lie in C. The \u03b4-cover of R(b0) then has its size bounded by\nO(nm0(2m0/\u03b4)m0(16\u2018mn/\u03b4)2\u2018m + h\u2018), where h = log\u03b6(\u03b4/4n).\n\n(cid:1)2\u2018m + h\u2018(cid:17)\n\n, where h = log\u03b6(\u03b4/2n).\n\n(cid:16)(cid:0) 8\u2018mn\n\n\u03b4\n\n\fThe requirement of translation invariance means that circulant matrices have some limitations in\nmodeling certain phenomena well. In mobile robot navigation, obstacles or boundaries in the envi-\nronment often cause dif\ufb01culties. However, if the environment is suf\ufb01ciently large and the obstacles\nare sparse, the behaviors of some systems can be approximated by circulant matrices.\n\n6 Conclusion\n\nWe propose the covering number as a measure of the complexity of POMDP planning. We believe\nthat for point-based algorithms, the covering number captures the dif\ufb01culty of computing approxi-\nmate solutions to POMDPs better than other commonly used measures, such as the number of states.\nThe covering number highlights several interesting properties that reduce the complexity of POMDP\nplanning, and quanti\ufb01es their effects. Using the covering number, we have shown several results that\nhelp to identify the main dif\ufb01culty of POMDP planning using point-based algorithms. These results\nindicate that a promising approach in practice is to approximate an optimal reachable space through\nsampling. We are currently exploring this idea and have already obtained promising initial results\n[4]. On a set of standard test problems, our new point-based algorithm outperformed the fastest\nexisting point-based algorithm by 5 to 10 times on some problems, while remaining competitive on\nothers.\nAcknowledgements. We thank Leslie Kaelbling and Tom\u00b4as Lozano-P\u00b4erez for many insightful discussions on\nPOMDPs. This work is supported in part by NUS ARF grants R-252-000-240-112 and R-252-000-243-112.\n\nReferences\n[1] R.M. Gray. Toeplitz and Circulant Matrices: A Review. Now Publishers Inc, 2006.\n[2] M. Hauskrecht. Value-function approximations for partially observable Markov decision processes. J.\n\nArti\ufb01cial Intelligence Research, 13:33\u201394, 2000.\n\n[3] J. Hoey, A. von Bertoldi, P. Poupart, and A. Mihailidis. Assisting persons with dementia during hand-\nIn Proc. Int. Conf. on Vision Systems,\n\nwashing using a partially observable Markov decision process.\n2007.\n\n[4] D. Hsu, W.S. Lee, and N. Rong. Accelerating point-based POMDP algorithms through successive approx-\nimations of the optimal reachable space. Technical Report TRA4/07, National University of Singapore.\nSchool of Computing, April 2007.\n\n[5] L.P. Kaelbling, M.L. Littman, and A.R. Cassandra. Planning and acting in partially observable stochastic\n\ndomains. Arti\ufb01cial Intelligence, 101(1\u20132):99\u2013134, 1998.\n\n[6] M. Kearns, Y. Mansour, and A.Y. Ng. A sparse sampling algorithm for near optimal planning in large\n\nMarkov decision processes. Machine Learning, 49(2-3):193\u2013208, 2002.\n\n[7] M.L. Littman. Algorithms for sequential decision making. PhD thesis, Dept. of Computer Science, Brown\n\nUniversity, 1996.\n\n[8] C. Lusena, J. Goldsmith, and M. Mundhenk. Nonapproximability results for partially observable Markov\n\ndecision processes. J. Arti\ufb01cial Intelligence Research, 14:83\u2013103, 2002.\n\n[9] O. Madani, S. Hanks, and A. Condon. On the undecidability of probabilistic planning and in\ufb01nite-horizon\npartially observable Markov decision problems. In Proc. Nat. Conf. on Arti\ufb01cial Intelligence, pages 541\u2013\n548, 1999.\n\n[10] C. Papadimitriou and J.N. Tsisiklis. The complexity of Markov decision processes. Mathematics of\n\nOperations Research, 12(3):441\u2013450, 1987.\n\n[11] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In\n\nProc. Int. Jnt. Conf. on Arti\ufb01cial Intelligence, pages 477\u2013484, 2003.\n\n[12] S.P. Singh and R.C. Yee. An upper bound on the loss from approximate optimal-value functions. Machine\n\nLearning, 16(3):227\u2013233, 1994.\n\n[13] R.D. Smallwood and E.J. Sondik. The optimal control of partially observable Markov processes over a\n\n\ufb01nite horizon. Operations Research, 21:1071\u20131088, 1973.\n\n[14] T. Smith and R. Simmons. Point-based POMDP algorithms: Improved analysis and implementation. In\n\nProc. Uncertainty in Arti\ufb01cial Intelligence, 2005.\n\n[15] M.T.J. Spaan and N. Vlassis. A point-based POMDP algorithm for robot planning. In Proc. IEEE Int.\n\nConf. on Robotics & Automation, 2004.\n\n[16] N.L. Zhang and W. Zhang. Speeding up the convergence of value iteration in partially observable Markov\n\ndecision processes. Journal of Arti\ufb01cial Intelligence Research, 14:29\u201351, 2002.\n\n\f", "award": [], "sourceid": 259, "authors": [{"given_name": "Wee", "family_name": "Lee", "institution": null}, {"given_name": "Nan", "family_name": "Rong", "institution": null}, {"given_name": "David", "family_name": "Hsu", "institution": ""}]}