{"title": "Sampling Networks and Aggregate Simulation for Online POMDP Planning", "book": "Advances in Neural Information Processing Systems", "page_first": 9222, "page_last": 9232, "abstract": "The paper introduces a new algorithm for planning in partially observable Markov decision processes (POMDP) based on the idea of aggregate simulation. The algorithm uses product distributions to approximate the belief state and shows how to build a representation graph of an approximate action-value function over belief space. The graph captures the result of simulating the model in aggregate under independence assumptions, giving a symbolic representation of the value function. The algorithm supports large observation spaces using sampling networks, a representation of the process of sampling values of observations, which is integrated into the graph representation. Following previous work in MDPs this approach enables action selection in POMDPs through gradient optimization over the graph representation. This approach complements recent algorithms for POMDPs which are based on particle representations of belief states and an explicit search for action selection. Our approach enables scaling to large factored action spaces in addition to large state spaces and observation spaces. An experimental evaluation demonstrates that the algorithm provides excellent performance relative to state of the art in large POMDP problems.", "full_text": "Sampling Networks and Aggregate Simulation for\n\nOnline POMDP Planning\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nRoni Khardon\n\nIndiana University\n\nBloomington, IN, USA\n\nrkhardon@iu.edu\n\nHao Cui\n\nTufts University\n\nMedford, MA 02155, USA\n\nhao.cui@tufts.edu\n\nAbstract\n\nThe paper introduces a new algorithm for planning in partially observable Markov\ndecision processes (POMDP) based on the idea of aggregate simulation. The\nalgorithm uses product distributions to approximate the belief state and shows how\nto build a representation graph of an approximate action-value function over belief\nspace. The graph captures the result of simulating the model in aggregate under\nindependence assumptions, giving a symbolic representation of the value function.\nThe algorithm supports large observation spaces using sampling networks, a rep-\nresentation of the process of sampling values of observations, which is integrated\ninto the graph representation. Following previous work in MDPs this approach\nenables action selection in POMDPs through gradient optimization over the graph\nrepresentation. This approach complements recent algorithms for POMDPs which\nare based on particle representations of belief states and an explicit search for\naction selection. Our approach enables scaling to large factored action spaces in\naddition to large state spaces and observation spaces. An experimental evaluation\ndemonstrates that the algorithm provides excellent performance relative to state of\nthe art in large POMDP problems.\n\n1\n\nIntroduction\n\nPlanning in partially observable Markov decision processes is a central problem in AI which is known\nto be computationally hard. Work over the last two decades produced signi\ufb01cant algorithmic progress\nthat affords some scalability for solving large problems. Off-line approaches, typically aiming for\nexact solutions, rely on the structure of the optimal value function to construct and prune such\nrepresentations [22, 10, 2], and PBVI and related algorithms (see [20]) carefully control this process\nyielding signi\ufb01cant speedup over early algorithms. In contrast, online algorithms interleave planning\nand execution and are not allowed suf\ufb01cient time to produce an optimal global policy. Instead they\nfocus on search for the best action for the current step. Many approaches in online planning rely on\nan explicit search tree over the belief space of the POMDP and use sampling to reduce the size of the\ntree [11] and most effective recent algorithms further use a particle based representation of the belief\nstates to facilitate fast search [21, 25, 23, 7].\nOur work is motivated by the idea of aggregate simulation in MDPs [5, 4, 3]. This approach builds an\nexplicit symbolic computation graph that approximates the evolution of the distribution of state and\nreward variables over time, conditioned on the current action and future rollout policy. The algorithm\nthen optimizes the choice of actions by gradient based search, using automatic differentiation [8] over\nthe explicit function represented by the computation graph. As recently shown [6] this is equivalent\nto solving a marginal MAP inference problem where the expectation step is evaluated by belief\npropagation (BP) [17], and the maximization step is performed using gradients.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe introduce a new algorithm SNAP (Sampling Networks and Aggregate simulation for POMDP) that\nexpands the scope of aggregate simulation. The algorithm must tackle two related technical challenges.\nThe solution in [5, 4] requires a one-pass forward computation of marginal probabilities. Viewed\nfrom the perspective of BP, this does not allow for downstream observations \u2013 observed descendents\nof action variables \u2013 in the corresponding Bayesian network. But this con\ufb02icts with the standard\nconditioning on observation variables in belief update. Our proposed solution explicitly enumerates\nall possible observations, which are then numerical constants, and reorders the computation steps to\nallow for aggregate simulation. The second challenge is that enumerating all possible observations is\ncomputationally expensive. To resolve this, our algorithm must use explicit sampling for problems\nwith large observation spaces. Our second contribution is a construction of sampling networks,\nshowing how observations z can be sampled symbolically and how both z and p(z) can be integrated\ninto the computation graph so that potential observations are sampled correctly for any setting of the\ncurrent action. This allows full integration with gradient based search and yields the SNAP algorithm.\nWe evaluate SNAP on problems from the international planning competition (IPC) 2011, the latest\nIPC with publicly available challenge POMDP problems, comparing its performance to POMCP [21]\nand DESPOT [25]. The results show that the algorithm is competitive on a large range of problems\nand that it has a signi\ufb01cant advantage on large problems.\n\n2 Background\n\n2.1 MDPs and POMDPs\nA MDP [18] is speci\ufb01ed by {S, A, T, R, \u03b3}, where S is a \ufb01nite state space, A is a \ufb01nite action space,\nT (s, a, s(cid:48)) = p(s(cid:48)\n|s, a) de\ufb01nes the transition probabilities, R(s, a) is the immediate reward and \u03b3\nfunction V \u03c0(s) is the expected discounted total reward E[(cid:80)\nis the discount factor. For MDPs (where the state is observed) a policy \u03c0 : S \u2192 A is a mapping\nfrom states to actions, indicating which action to choose at each state. Given a policy \u03c0, the value\ni \u03b3iR(si, \u03c0(si)) | \u03c0], where si is the ith\nstate visited by following \u03c0 and s0 = s. The action-value function Q\u03c0 : S \u00d7 A \u2192 R is the expected\ndiscounted total reward when taking action a at state s and following \u03c0 thereafter.\nIn POMDPs the agent cannot observe the state. The MDP model is augmented with an observation\nspace O and the observation probability function O(z, s(cid:48), a) = p(z|s(cid:48), a) where s(cid:48) is the state reached\nand z is the observation in the transition T (s, a, s(cid:48)). That is, in the transition from s to s(cid:48), observation\nprobabilities depend on the next state s(cid:48). The belief state, a distribution over states, provides a\nsuf\ufb01cient statistic of the information from the initial state distribution and history of actions and\nobservations. The belief state can be calculated iteratively from the history. More speci\ufb01cally, given\nthe current belief state bt(s), action at and no observations, we expect to be in\n\n(cid:48)\nbat\nt+1(s\n\n(cid:48)\n) = p(s\n\n(cid:48)\n|bt, at) = Es\u223cbt(s)[p(s\n\nGiven bt(s), at and observation zt the new belief state is bat,zt\n\n|s, at)].\nt+1 (s(cid:48)(cid:48)) = p(s(cid:48)(cid:48)\nt+1(s(cid:48)(cid:48))p(zt|s(cid:48)(cid:48), at)\nbat\n\np(zt|bt, at)\n\n|bt, at, zt):\n\n(1)\n\n(2)\n\n(cid:80)\n\ns\n\n(cid:80)\nthe denominator\ns(cid:48) bt(s)p(s(cid:48)\n\n(cid:48)(cid:48)\nbat,zt\nt+1 (s\n\n) =\n\np(s(cid:48)(cid:48), zt|bt, at)\np(zt|bt, at)\nlast\n\n=\n\nwhere\n\nin the\n\nequation requires\n\nstates:\n|s, at)p(zt|s(cid:48), at). Algorithms for POMDPs typically condition action se-\nlection either directly on the history or on the belief state. The description above assumed an atomic\nrepresentation of states, actions and observations. In factored spaces each of these is speci\ufb01ed by a set\nof variables, where in this paper we assume the variables are binary. In this case, the number of states\n(actions, observations) is exponential in the number of variables, implying that state enumeration\nwhich is implicit above is not feasible. One way to address this challenge is by using a particle based\nrepresentation for the belief state as in [20, 21]. In contrast, our approach approximates the belief\nstate as a product distribution which allows for further computational simpli\ufb01cations.\n\nsum over\n\na double\n\n2.2 MDP planning by aggregate simulation\n\nAggregate simulation follows the general scheme of the rollout algorithm [24] with some modi\ufb01ca-\ntions. The core idea in aggregate simulation is to represent a distribution over states at every step of\nplanning. Recall that the rollout algorithm [24] estimates the state-action value function Q\u03c0(s, a) by\n\n2\n\n\fapplying a in s and then simulating forward using action selection with \u03c0, where the policy \u03c0 maps\nstates to actions. This yields a trajectory, s, a, s1, a1, . . . and the average of the cumulative reward\nover multiple trajectories is used to estimate Q\u03c0(s, a). The lifted-conformant SOGBOFA algorithm of\n[3] works in factored spaces. For the rollout process, it uses an open-loop policy (a.k.a. a straight\nline plan, or a sequential plan) where the sequence of actions is pre-determined and the actions used\ndo not depend on the states visited in the trajectory. We refer to this below as a sequential rollout\nplan p. In addition, instead of performing explicit simulations it calculates a product distribution over\nstate and reward variables at every step, conditioned on a and p. Finally, while rollout uses a \ufb01xed \u03c0,\nlifted-conformant SOGBOFA optimizes p at the same time it optimizes a and therefore it can improve\nover the initial rollout scheme. In order to perform this the algorithm approximates the corresponding\ndistributions as product distributions over the variables.\nSOGBOFA accepts a high level description of the MDP, where our implementation works with the\nRDDL language [19], and compiles it into a computation graph. Consider a Dynamic Bayesian\nNetwork (DBN) which captures the \ufb01nite horizon planning objective conditioned on p and a. The\nconditional distribution of each state variable x at any time step is \ufb01rst translated into a disjoint\nsum form \u201cif(c1) then p1, if(c2) . . . if(cn) then pn\" where pi is p(x=T ), T stands for true, and the\nconditions ci are conjunctions of parent values which are are mutually exclusive and exhaustive.\n\nThe last condition implies that the probability that the variable is true is equal to:(cid:80) p(ci)pi. This\ntranslated into a disjoint sum form(cid:80) p(ci)vi with vi \u2208 R. The probabilities for the conditions ci\n\u02c6p(ci) =(cid:81)\n\nrepresentation is always possible because we work with discrete random variables and the expression\ncan be obtained from the conditional probability of x give its parents. In practice the expressions can\nbe obtained directly from the RDDL description. Similarly the expected value of reward variables is\n\nwk\u2208ci \u02c6p(wk)(cid:81)\n\nare approximated by assuming that their parents are independent, that is p(ci) is approximated by\n\u00afwk\u2208ci(1 \u2212 \u02c6p(wk)), where wk and \u00afwk are positive and negative literals in\nthe conjunction respectively. To avoid size explosion when translating expressions with many parents,\nSOGBOFA skips the translation to disjoint sum form and directly translates from the logical form of\nexpressions into a numerical form using standard translation from logical to numerical constructs\n(a \u2227 b is a \u2217 b, a \u2228 b is 1-(1-a)*(1-b), \u00aca is 1-a). These expressions are combined to build an explicit\ncomputation graph that approximates of the marginal probability for every variable in the DBN.\nTo illustrate this process consider the following example from [4] with three state variables s(1),\ns(2) and s(3), three action variables a(1), a(2), a(3) and two intermediate variables cond1 and cond2.\nThe MDP model is given by the following RDDL [19] program where primed variants of variables\nrepresent the value of the variable after performing the action.\ncond1 = Bernoulli(0.7)\ncond2 = Bernoulli(0.5)\ns\u2019(1) = if (cond1) then ~a(3)\ns\u2019(2) = if (s(1))\nthen a(2)\ns\u2019(3) = if (cond2) then s(2)\nreward = s(1) + s(2) + s(3)\nThe model is translated into disjoint sum expressions as s\u2019(1) = (1-a(3))*0.7, s\u2019(2) =\ns(1)*a(2), s\u2019(3) = s(2) * 0.5, r = s(1) + s(2) + s(3). The corresponding computa-\ntion graph, for horizon 3, is shown in the right portion of Figure 1. The bottom layer represents\nthe current state and action variables. In the second layer action variables represent the conformant\npolicy, and state variables are computed from values in the \ufb01rst layer where each node represents the\ncorresponding expression. The reward variables are computed at each layer and summed to get the\ncumulative Q value. The graph enables computation of Qp(s, a) by plugging in values for p, s and\na. For the purpose of our POMDP algorithm it is important to notice that the computation graph in\nSOGBOFA replaces each random variable in the graph with its approximate marginal probability.\nNow given that we have an explicit computation graph we can use it for optimizing a and p using\ngradient based search. This is done by using automatic differentiation [8] to compute gradients w.r.t.\nall variables in a and p and using gradient ascent. To achieve this, for each action variable, e.g., at,(cid:96),\n\nwe optimize p(at,(cid:96)=T ), and optimize the joint setting of(cid:81)\n\n(cid:96) p(at,(cid:96)=T ) using gradient ascent.\n\nelse false\nelse false\nelse false\n\nSOGBOFA includes several additional heuristics including dynamic control of simulation depth (trying\nto make sure we have enough time for n gradient steps, we make the simulation shallower if graph\nsize gets too large), dynamic selection of gradient step size, maintaining domain constraints, and a\nbalance between gradient search and random restarts. In addition, the graph construction simpli\ufb01es\nobvious numerical operations (e.g., 1 \u2217 x = x and 0 \u2217 x = 0) and uses dynamic programming to\navoid regenerating identical node computations, achieving an effect similar to lifting in probabilistic\n\n3\n\n\finference. All these heuristics are inherited by our POMDP solver, but they are not important for\nunderstanding the ideas in this paper. We therefore omit the details and refer to reader to [4, 3].\n\n3 Aggregate simulation for POMDP\n\nThis section describes a basic version of SNAP which assumes that the observation space is small and\ncan be enumerated. Like SOGBOFA, our algorithm performs aggregate rollout with a rollout plan p.\nThe estimation is based on an appropriate de\ufb01nition of the Q() function over belief states:\n\nQp(bt, at) = Ebt(s)[R(s, at)] +\n\np(zt|bt, at) V pzt (bat,zt\nt+1 )\n\n(3)\n\n(cid:88)\n\nzt\n\nwhere V p(b) is the cumulative value obtained by using p to choose actions starting from belief state b.\nNotice that we use a different rollout plan pzt for each value of the observation variables which can\nbe crucial for calculating an informative value for each bat,zt\nt+1 . The update for belief states was given\nabove in Eq (1) and (2). Our algorithm implements approximations of these equations by assuming\nfactoring through independence and by substituting variables with their marginal probabilities.\nA simple approach to upgrade SOGBOFA to this case will attempt to add observation variables to the\ncomputation graph and perform the calculations in the same manner. However, this approach does\nnot work. Note that observations are descendants of current state and action variables. However, as\npointed out by [6] the main computational advantage in SOGBOFA results from the fact that there are\nno downstream observed variables in the computation graph. As a result belief propagation does not\nhave backward messages and and the computation can be done in one pass. To address this dif\ufb01culty\nwe reorder the computations by grounding all possible values for observations, conditioning the\ncomputation of probabilities and values on the observations and combining the results.\nWe start by enforcing factoring over the representation of belief states:\n\n(cid:89)\n\n(cid:89)\n\n(cid:89)\n\n\u02c6bt(s) =\n\n\u02c6bt(si);\n\n\u02c6bat\nt+1(s) =\n\n\u02c6bat\nt+1(si);\n\n\u02c6bat,zt\nt+1 (s) =\n\n\u02c6bat,zt\nt+1 (si)\n\ni\n\ni\n\ni\n\nWe then approximate Eq (1) as\n\n(cid:48)\n(cid:48)\ni=T|{\u02c6bt(si)}, at)\ni=T ) = \u02c6p(s\nt+1(s\n\n(cid:48)\n(cid:48)\ni=T|s, at)] \u2248 \u02c6bat\nbat\ni=T ) = Es\u223cbt(s)[p(s\nt+1(s\nwhere the notation \u02c6p indicates that conditioning on the factored set of beliefs {\u02c6bt(si)} is performed by\nreplacing each occurrence of sj in the expression for p(s(cid:48)\ni=T|{sj}, at) with its marginal probability\n\u02c6bt(sj=T ). We use the same notation with intended meaning for substitution by marginal probabilities\nwhen conditioning on \u02c6b in other expressions below. Note that since variables are binary, for any\nvariable x it suf\ufb01ces to calculate \u02c6p(x=T ) where 1 \u2212 \u02c6p(x=T ) is used when the complement is needed.\nWe use this implicitly in the following. Similarly, the reward portion of Eq (3) is approximated as\n(cid:81)\n(4)\nThe term p(zt|bt, at) from Eq (2) and (3) is approximated by enforcing factoring as p(zt|bt, at) \u2248\nk p(zt,k|bt, at) where for each factor we have\np(zt,k=T|bt, at) = Ebat\n\n, at)] \u2248 \u02c6p(zt,k=T|bt, at) = \u02c6p(zt,k=T|{\u02c6bat\nNext, to facilitate computations with factored representations, we replace Eq (2) with\n\nEbt(s)[R(s, at)] \u2248 \u02c6R({\u02c6bt(si)}, at).\n\n(cid:48)\nt+1(s(cid:48))[p(zt,k=T|s\n\n(cid:48)\ni)}, at).\nt+1(s\n\n(cid:48)(cid:48)\nbat,zt\ni =T ) =\nt+1 (s\n\np(s(cid:48)(cid:48)\n\ni =T, zt|bt, at)\np(zt|bt, at)\n\n=\n\nt+1(s(cid:48)(cid:48)\nbat\n\ni =T )p(zt|s(cid:48)(cid:48)\np(zt|bt, at)\n\ni =T, bt, at)\n\n.\n\n(5)\n\n(cid:81)\nNotice that because we condition on a single variable s(cid:48)(cid:48)\nthe conditioning on bt. This term is approximated by enforcing factoring p(zt|s(cid:48)(cid:48)\nk p(zt,k|s(cid:48)(cid:48)\ni =T, bt, at) where each component is\n(cid:48)(cid:48)\n(cid:48)(cid:48)\ni =T,{\u02c6bat\np(zt,k=T|s\ni =T, bt, at) = Ebat\n, at)] \u2248 \u02c6p(zt,k=T|s\nand Eq (5) is approximated as:\n\n(cid:48)(cid:48)\ni =T )[p(zt,k=T|s\n\ni the last term in the numerator must retain\ni =T, bt, at) \u2248\n\n(cid:48)(cid:48)\n(cid:96) )}(cid:96)(cid:54)=i, at)\nt+1(s\n\nt+1(s(cid:48)(cid:48)|s(cid:48)(cid:48)\n\n(cid:48)(cid:48)\n\u02c6bat,zt\ni =T ) =\nt+1 (s\n\nt+1(s(cid:48)(cid:48)\n\u02c6bat\n\n(cid:96) )}(cid:96)(cid:54)=i, at)\n\n.\n\n(6)\n\ni =T ) (cid:81)\n(cid:81)\nk \u02c6p(zt,k|s(cid:48)(cid:48)\nk \u02c6p(zt,k|{\u02c6bat\n4\n\ni =T,{\u02c6bat\nt+1(s(cid:48)\n\nt+1(s(cid:48)(cid:48)\ni)}, at)\n\n\fFigure 1: Left: demonstration of the structure of the computation graph in SNAP when there are two\npossible values for observations zt = z1 or zt = z2. Right: demonstration of a three-step simulation\nin SOGBOFA including the representation of conformant actions.\n\nThe basic version of our algorithm enumerates all observations and constructs a computation graph\nto capture an approximate version of Eq (3) as follows:\n\n(cid:32)(cid:89)\n\n(cid:88)\n\n(cid:33)\n\n\u02c6Qp(\u02c6bt, at) = \u02c6R({\u02c6bt(si)}, at) +\n\nzt\n\nk\n\n\u02c6p(zt,k|{\u02c6bat\n\n(cid:48)\ni)}, at)\nt+1(s\n\n\u02c6V pzt (\u02c6bat,zt\n\nt+1 ).\n\n(7)\n\nt+1 . To construct this portion, we \ufb01rst build a graph that computes \u02c6bat,zt\n\nThe overall structure has a sum of the reward portion and the next state portion. The next state portion\nhas a sum over all concrete values for observations. For each concrete observation value we have\na product between two portions: the probability for zt and the approximate future value obtained\nfrom \u02c6bat,zt\nt+1 and then apply \u02c6V\nto the belief state which is the output of this graph. This value \u02c6V p(b) is replaced by the SOGBOFA\ngraph which rolls out p on the belief state. This is done using the computation of {\u02c6bat\ni)} which is\ncorrect because actions in p are not conditioned on states. As explained above, the computation in\nSOGBOFA already handles product distributions over state variables so no change is needed for this\npart. Figure 1 shows the high level structure of the computation graph for POMDPs.\n\nt+1(s(cid:48)\n\nExample: Tiger Problem: To illustrate the details of this construction consider the well known\nTiger domain with horizon 2, i.e. where the rollout portion is just an estimate of the reward at the\nsecond step. In Tiger we have one state variable L (true when tiger is on left), three actions listen,\nopenLeft and openRight, and one observation variable H (relevant on listen; true when we hear\nnoise on left, false when we hear noise on right). If we open the door where the tiger is, the reward is\n\u2212100 and the trajectory ends. If we open the other door where there is gold the reward is +10 and\nthe game ends. The cost of taking a listen action is -10. If we listen then we hear noise on the\ncorrect side with probability 0.85 and on the other side with probability 0.15. The initial belief state\nis p(L=T ) = 0.5. Note that the state always remains the same in this problem: p(L(cid:48)=v|L=v)=1.\nWe have p(H=T|L(cid:48), listen) = if L(cid:48) then 0.85 else 0.15 which is translated to L(cid:48)\n\u2217 0.85 + (1-L(cid:48)) \u2217\n0.15. The reward is R = ((1-L)\u2217 openRight + L\u2217 openLeft)\u2217\u2212100 + ((1-L)\u2217 openLeft + L\u2217\nopenRight) \u2217 10 + listen \u2217 \u221210. We \ufb01rst calculate the approximated \u02c6Qp(\u02c6bt, at=listen). The\nreward expectation of taking the action listen is -10. According to Eq (6), the belief state after hearing\nnoise is \u02c6bat=listen,H=T\n(L=T ) = 0.85. With the approximation in Eq (4), the reward expectation\nat step t + 1 is then calculated as E\nt+1 + 0.85 \u2217\n\n[R(s, at+1)] \u2248 (0.15 \u2217 openRight1\n\nt+1\n\n\u02c6bat=listen,H=T\n\nt+1\n\n5\n\n 1 0 1 0 a1 a2 a3 1 * * * a4 a5 a6 s1 s2 s3 a1 a2 a3 STEP 1 2 r + + + * * * a7 a8 a9 1 Q 0.5 - - 0.7 0.7 0.5 + 3 \ft+1\n\n\u02c6bat=listen,H=F\n\nt+1\n\nt+1) \u2217 10 + listen1\n\nt+1 + 0.85 \u2217 openRight1\n\nt+1) \u2217 \u2212100 + (0.15 \u2217 openLeft1\n\n[R(s, at+1)] \u2248 (0.85 \u2217 openRight2\n\nt+1)\u2217\u2212100+(0.85\u2217openLeft2\n[R(s, a1\n\nt+1 \u2217 \u221210,\nopenLeft1\nwhere the suprescript o of action is to denote that it works with the belief state o after seeing the\noth observation. Similarly we have \u02c6bat=listen,H=F\n(L=T ) = 0.15, and the reward expectation on\nthe belief state is calculated as E\nt+1 + 0.15 \u2217\nt+1\u2217\u221210. Note\nopenLeft2\nthat we have \u02c6p(H=T ) = \u02c6p(H=F ) = 0.5. Now with horizon 2, we have \u02c6Qp(\u02c6bt, at = listen) =\nt+1)]. Note that\n\u221210 + 0.5 \u2217 E\nthe conformant actions for step t+1 on different belief states are different. With openLeft1\nt+1=T\nand openRight2\nt+1=T , the total Q estimate is \u221226.5. Similar computations for openLeft and\nopenRight yield \u02c6Q = \u221290. Maximizing over at and p we have an optimal conformant path\nlistent, openLeftt+1|H=T, openRightt+1|H=F .\n4 Sampling networks for large observation spaces\n\nt+1 +0.15\u2217openRight2\nt+1)] + 0.5 \u2217 E\n\nt+1)\u221710+listen2\n[R(s, a2\n\n\u02c6bat=Tlisten,H=T\n\n\u02c6batceqlisten,H=F\n\nt+1\n\nt+1\n\nThe algorithm of the previous section is too slow when there are many observations because we\ngenerate a sub-graph of the simulation for every possible value. Like other algorithms, when the\nobservation space is large we can resort to sampling observations and aggregating values only for\nthe observations sampled. Our construction already computes a node in the graph representing an\napproximation of p(zt,k|bt, at). Therefore we can sample from the product space of observations\nconditioned on at. Once a set of observations are sampled we can produce the same type of graph as\nbefore, replacing the explicit calculation of expectation with an average over the sample as in the\nfollowing equation, where N is total number of samples and zn\n\n\u02c6Qp(\u02c6bt, at) = \u02c6R({\u02c6bt(si)}, at) +\n\n1\nN\n\nN(cid:88)\n\nn=1\n\nt is the nth sampled observation\n\u02c6V pzn\n\nat,zn\nt+1 ).\nt\n\nt (\u02c6b\n\n(8)\n\ndate(cid:81)\n\ntributions(cid:81)\n\nHowever, to implement this idea we must deal\nwith two dif\ufb01culties. The \ufb01rst is that during\ngradient search at is not a binary action but in-\nstead it represents a product of Bernoulli dis-\n(cid:96) p(at,(cid:96)) where each p(at,(cid:96)) deter-\nmines our choice for setting action variable at,(cid:96)\nto true. This is easily dealt with by replacing\nvariables with their expectations as in previous\ncalculations. The second is more complex be-\ncause of the gradient search. We can correctly\nsample as above, calculate derivatives and up-\n(cid:96) p(at,(cid:96)). But once this is done, at has\nchanged and the sampled observations no longer\nre\ufb02ect p(zt,k|bt, at). The computation graph is\nstill correct, but the observations may not be a\nrepresentative sample for the updated action. To address this we introduce the idea of sampling net-\nworks. This provides a static construction that samples observations with correct probabilities for any\nsetting of at. Since we deal with product distributions we can deal with each variable separately. Con-\nsider a speci\ufb01c variable zt,k and let x1 be the node in the graph representing \u02c6p(zt,k=T|{\u02c6bat\ni)}, at).\nOur algorithm draws C \u2208 [0, 1] from the uniform distribution at graph construction time. Note that\np(C \u2264 x1) = x1. Therefore we can sample a value for zt,k at construction time by setting zt,k=T\niff x1 \u2212 C \u2265 0. To avoid the use of a non-differential condition (\u2265 0) in our graph we replace this\nwith \u02c6zt,k = \u03c3(A(x1 \u2212 C)) where \u03c3(x) = 1/(1 + e\u2212x) is the sigmoid function and A is a constant\n(A = 10 in our experiments). This yields a node in the graph representing \u02c6zt,k whose value is \u2248 0 or\n\u2248 1. The only problem is that at graph construction time we do not know whether this value is 0 or 1.\nWe therefore need to modify the portion of the graph that uses \u02c6p(zt,k| . . .) where the construction has\ntwo variants of this with different conditioning events, and we use the same solution in both cases.\nFor concreteness let x2 be the node in the graph representing \u02c6p(zt,k|s(cid:48)(cid:48)\n(cid:96) )}(cid:96)(cid:54)=i, at). The\nvalue computed by node x2 is used as input to other nodes. We replace these inputs with\n\nFigure 2: Sampling network structure.\n\ni =T,{\u02c6bat\n\nt+1(s(cid:48)(cid:48)\n\nt+1(s(cid:48)\n\n\u02c6zt,k \u2217 x2 + (1 \u2212 \u02c6zt,k) \u2217 (1 \u2212 x2).\n\n6\n\n\t\t 1 \t\t\t\t\t\t\tc 10 1 - \t- \t* \t1 - \t\t\t* * * \t+ \t\t\t* * 1 - \t\t\t+ \t\fNow, when \u02c6zt,k \u2248 1 we get x2 and when it is \u2248 0 we get 1 \u2212 x2 as desired. We use the same\nconstruction with x1 to calculate the probability with the second type of conditioning. Therefore,\nthe sampling networks are produced at graph construction time but they produce symbolic nodes\nrepresenting concrete samples for zt which are correctly sampled from the distribution conditioned\non at. Figure 2 shows the sampling network for one \u02c6zt,k and the calculation of the probability.\nSNAP tests if the observation space is smaller than some \ufb01xed constant (S = 10 in the experiments). If\nso it enumerates all observations. Otherwise, it integrates sampling networks for up to S observations\ninto the previous construction to yield a sampled graph. The process for the dynamic setting of\nsimulation depth from SOGBOFA is used for the rollout from all samples. If the algorithm \ufb01nds that\nthere is insuf\ufb01cient time it generates less than S samples with the goal of achieving at least n = 200\ngradient updates. Optimization proceeds in the same manner as before with the new graph.\n\n5 Discussion\n\nSNAP has two main assumptions or sources of potential limitations. The \ufb01rst is the fact that the rollout\nplans do not depend on observations beyond the \ufb01rst step. Our approximation is distinct from the\nQMDP approximation [13] which ignores observations altogether. It is also different from the FIB\napproximation of [9] which uses observations from the \ufb01rst step but uses a state based approximation\nthereafter, in contrast with our use of a conformant plan over the factored belief state. The second\nlimitation is the factoring into products of independent variables. Factoring is not new and has been\nused before for POMDP planning (e.g. [14, 15, 16]) where authors have shown practical success\nacross different problems and some theoretical guarantees. However, the manner in which factoring\nis used in our algorithm, through symbolic propagation with gradient based optimization, is new and\nis the main reason for ef\ufb01ciency and improved search.\nPOMCP [21] and DESPOT [25] perform search in belief space and develop a search tree which\noptimizes the action at every branch in the tree. Very recently these algorithms were improved to\nhandle large, even continuous, observation spaces [23, 7]. Comparing to these, the rollout portion in\nSNAP is more limited because we use a single conformant sequential plan (i.e., not a policy) for rollout\nand do not expand a tree. On the other hand the aggregate simulation in SNAP provides a signi\ufb01cant\nspeedup. The other main advantage of SNAP is the fact that it samples and computes its values\nsymbolically because this allows for effective gradient based search in contrast with unstructured\nsampling of actions in these algorithms. Finally, [21, 25] use a particle based representation of the\nbelief space, whereas SNAP uses a product distribution. These represent different approximations\nwhich may work well in different problems.\nIn terms of limitations, note that deterministic transitions are not necessarily bad for factored\nrepresentations because a belief focused on one state is both deterministic and factored and this can\nbe preserved by the transition function. For example, the work of [15] has already shown this for the\nwell known rocksample domain. The same is true for the T-maze domain of [1]. Simple experiments\n(details omitted) show that SNAP solves this problem correctly and that it scales better than other\nsystems to large mazes. SNAP can be successful in these problems because one step of observation is\nsuf\ufb01cient and the reward does not depend in a sensitive manner on correlation among variables.\nOn the other hand, we can illustrate the limitations of SNAP with two simple domains. The \ufb01rst\nhas 2 states variables x1, x2, 3 action variables a1, a2, a3 and one observation variable o1. The\ninitial belief state is uniform over all 4 assignments which when factored is b0 = (0.5, 0.5), i.e.,\np(x1 = 1) = 0.5 and p(x2 = 1) = 0.5. The reward is if (x1 == x2) then 1 else \u2212 1. The\nactions a1, a2 are deterministic where a1 deterministically \ufb02ips the value of x1, that is: x(cid:48)\n1 =\nif (a1 \u2227 x1) then 0 elseif (a1 \u2227 \u00afx1) then 1 else x1. Similarly, a2 deterministically \ufb02ips the value\nof x2. The action a3 gives a noisy observation testing if x1 == x2 as follows: p(o = 1) =\n1 \u2227 \u00afx(cid:48)\nif (a3 \u2227 x(cid:48)\n2) then 0.9 elseif a3 then 0.1 else 0. In this case, starting with\nb0 = (0.5, 0.5) it is obvious that the belief is not changed with a1, a2 and calculating for a3 we\nsee that p(x(cid:48)\n(0.5\u00b70.9+0.5\u00b70.1)+(0.5\u00b70.9+0.5\u00b70.1) = 0.5 so the belief does not change.\nIn other words we always have the same belief and same expected reward (which is zero) and the\nsearch will fail. On the other hand, a particle based representation of the belief state will be able to\nconcentrate on the correct two particles (00,11 or 01,10) using the observations.\nThe second problem has the same state and action variables, same reward, and a1, a2 have the same\ndynamics. We have two sensing actions a3 and a4 and two observation variables. Action a3 gives\n\n2) \u2228 (a3 \u2227 \u00afx(cid:48)\n1 \u2227 x(cid:48)\n1 = 1|o = 1) =\n\n0.5\u00b70.9+0.5\u00b70.1\n\n7\n\n\fFigure 3: Top Left: The size of state, action, and observation spaces for the three IPC domains. Other\nPanels: Average reward of algorithms normalized relative to SNAP (score=1) and Random (score=0).\n\na noisy observation of the value of x1 as follows: p(o1 = 1) = if (a3 \u2227 x(cid:48)\n1) then 0.9 elseif (a3 \u2227\n\u00afx(cid:48)\n1) then 0.1 else 0. Action a4 does the same w.r.t. x2. In this case the observation from a3 does\nchange the belief, for example: p(x(cid:48)\n0.5\u00b70.9+0.5\u00b70.1 = 0.9. That is, if we observe\no1 = 1 then the belief is (0.9, 0.5). But the expected reward is still: 0.9 \u00b7 0.5 + 0.1 \u00b7 0.5 \u2212 0.9 \u00b7 0.5 \u2212\n0.1 \u00b7 0.5 = 0 so the new belief state is not distinguishable from the original one, unless one uses\nadditional sensing action a4 to identify the value of x2. In other words for this problem we must\ndevelop a search tree because one level of observations does not suf\ufb01ce. If we were to develop such a\ntree we can reach belief states like (0.9, 0.9) that identi\ufb01es the correct action and we can succeed\ndespite factoring, but SNAP will fail because the search is limited to one level of observations. Here\ntoo a particle based representation will succeed because it retains the correlation between x1, x2.\n\n1 = 1|o1 = 1) =\n\n0.5\u00b70.9\n\n6 Experimental evaluation\n\nWe compare the performance of SNAP to the state-of-the-art online planners for POMDP. Speci\ufb01cally,\nwe compare to POMCP [21] and DESPOT [25]. For DESPOT, we use the original implementation\nfrom https://github.com/AdaCompNUS/despot/. For POMCP we use the implementation from the\nwinner of IPC2011 Boolean POMDP track, POMDPX NUS. POMDPX NUS is a combination of an\nof\ufb02ine algorithm SARSOP [12] and POMCP. It triggers different algorithms depending on the size of\nthe problem. Here, we only use their POMCP implementation. DESPOT and POMCP are domain\nindependent planners, but previous work has used manually speci\ufb01ed domain knowledge to improve\ntheir performance in speci\ufb01c domains. Here we test all algorithms without domain knowledge.\nWe compare the planners on 3 IPC domains. In CrossingTraffic, the robot tries to move from\none side of a river to the other side, with a penalty at every step when staying in the river. Floating\nobstacles randomly appear upstream in the river and \ufb02oat downstream. If running into an obstacle,\nthe robot will be trapped and cannot move anymore. The robot has partial observation of whether and\nwhere the obstacles appear. The sysadmin domain models a network of computers. A computer has\na probability of failure which depends on the proportion of all other computers connected to it that\nare running. The agent can reboot one or more computers, which has a cost but makes sure that the\ncomputer is running in the next step. The goal is to keep as many computers as possible running for\nthe entire horizon. The agent has a stochastic observation of whether each computer is still running.\nIn the traffic domain, there are multiple traf\ufb01c lights that work at different intersections of roads.\nCars \ufb02ow from the roads to the intersections and the goal is to minimize the number of cars waiting\nat intersections. The agent can only observe if there are cars running into each intersection and in\nwhich direction but not their number. For each domain, we use the original 10 problems from the\ncompetition, but add two larger problems, where, roughly speaking, the problems get harder as their\n\n8\n\n-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 1.2 2 4 6 8 10 12normalized scoreinstance indexDomain:SysadminRandomPO-SNAPDESPOTPOMCP-4-3-2-1 0 1 2 4 6 8 10 12normalized scoreinstance indexDomain:TrafficRandomPO-SNAPDESPOTPOMCP 0 0.5 1 1.5 2 2.5 2 4 6 8 10 12normalized scoreinstance indexDomain:CrossingTrafficRandomPO-SNAPDESPOTPOMCPSysadminCrossingTraf\ufb01cState10451.62\u21e510242.56\u21e51032Action150416Obs1045512256Figure4:Fromlefttoright:thesysadmindomain,thetraf\ufb01cdomainandthecorssingtraf\ufb01cdomain. 300 400 500 600 700 800 900 2 4 6 8 10 12normalized scoretime per step (seconds)Performance Comparison with Different Running TimeRandomPO-SNAPDESPOTPOMCP 450 460 470 480 490 500 510 520 530 540 2 4 6 8 10 12cumulative rewardnumber of observations sampledCumulative Reward with Different Number of SamplesPO-SNAPFigure5:Left:performanceanalysisofSNAPgivendifferentamountofrunningtime.Right:performanceanalysisofSNAPgivendifferentnumberofsampledobservations.Forthemainexperimentweuse2secondsplanningtimeperstepforallplanners.We\ufb01rstshowthenormalizedcumulativerewardthateachplannergetsfrom100runsoneachproblem.Therawscoresforindividualproblemsvarymakingvisualizationofresultsformanyproblemsdif\ufb01cult.ForvisualclarityofcomparisonacrossproblemswenormalizethetotalrewardofeachplannerbylinearscalingsuchthatSNAPisalways1andtherandompolicyisalways0.Wedonotincludestandarddeviationsintheplotsbecauseitisnotclearhowtocalculatethesefornormalizedratios.Rawscoresandstandarddeviationsofthemeanestimateforeachproblemaregiveninthesupplementarymaterials.Giventhesescores,visibledifferencesintheplotsarestatisticallysigni\ufb01cantsothetrendsintheplotsareindicativeofperformance.TheresultsareshowninFig4.First,wecanobservethatSNAPhascompetitiveperformanceonalldomainsanditissigni\ufb01cantlybetteronmostproblems.Notethattheobservationspaceinsysadminislargeandthebasicalgorithmwouldnotbeabletohandleit,showingtheimportanceofsamplingnetworks.Second,wecanobservethatthelargertheproblemis,theeasieritistodistinguishourplannerfromtheothers.ThisillustratesthatSNAPhasanadvantageindealingwithlargecombinatorialstate,actionandobservationspaces.TofurtheranalyzetheperformanceofSNAPweexploreitssensitivitytothesettingoftheexperiments.First,wecomparetheplannerswithdifferentplanningtime.Wearbitrarilypickedoneofthelargestproblems,sysadmin10,forthisexperiment.Wevarytherunningtimefrom1secondperstepto30secondsperstep.TheresultsareinFig5,left.We\ufb01rstnoticethatSNAPdominatesotherplannersregardlessoftherunningtime.Itkeepsimprovingasgivenmoretime,andthedifferencebetweenSNAPandotherplannersgetslargerwithlongerrunningtime.Thisshowsthatondif\ufb01cultproblems,SNAPisabletoimproveperformancegivenlongerrunningtime,andthatitimprovesfasterthantheothers.FinallyweevaluatethesensitivityofSNAPtothenumberofobservationsamples.Toobserve8-1-0.5 0 0.5 1 1.5 2 2 4 6 8 10 12normalized scoreinstance indexDomain:SysadminRandomSNAPDESPOTPOMCP-4-3-2-1 0 1 2 3 2 4 6 8 10 12normalized scoreinstance indexDomain:TrafficRandomSNAPDESPOTPOMCP 0 0.5 1 1.5 2 2.5 2 4 6 8 10 12normalized scoreinstance indexDomain:CrossingTrafficRandomSNAPDESPOTPOMCP\fFigure 4: Left: performance analysis of SNAP given different amount of running time. Right:\nperformance analysis of SNAP given different number of sampled observations.\n\nindex increases. The size of the largest problem for each domain is shown in Figure 3. Note that\nthe action spaces are relatively small. Similar to SOGBOFA [3], SNAP can handle much larger action\nspaces whereas we expect POMCP and DESPOT to do less well if the action space increases.\nFor the main experiment we use 2 seconds planning time per step for all planners. We \ufb01rst show the\nnormalized cumulative reward that each planner gets from 100 runs on each problem. The raw scores\nfor individual problems vary making visualization of results for many problems dif\ufb01cult. For visual\nclarity of comparison across problems we normalize the total reward of each planner by linear scaling\nsuch that SNAP is always 1 and the random policy is always 0. We do not include standard deviations\nin the plots because it is not clear how to calculate these for normalized ratios. Raw scores and\nstandard deviations of the mean estimate for each problem are given in the supplementary materials.\nGiven these scores, visible differences in the plots are statistically signi\ufb01cant so the trends in the plots\nare indicative of performance. The results are shown in Fig 3. First, we can observe that SNAP has\ncompetitive performance on all domains and it is signi\ufb01cantly better on most problems. Note that\nthe observation space in sysadmin is large and the basic algorithm would not be able to handle it,\nshowing the importance of sampling networks. Second, we can observe that the larger the problem is,\nthe easier it is to distinguish our planner from the others. This illustrates that SNAP has an advantage\nin dealing with large combinatorial state, action and observation spaces.\nTo further analyze the performance of SNAP we explore its sensitivity to the setting of the experiments.\nFirst, we compare the planners with different planning time. We arbitrarily picked one of the largest\nproblems, sysadmin 10, for this experiment. We vary the running time from 1 to 30 seconds per\nstep. The results are in Fig 4, left. We observe that SNAP dominates other planners regardless of the\nrunning time and that the difference between SNAP and other planners is maintained across the range.\nNext, we evaluate the sensitivity of SNAP to the number of observation samples. In this experiment, in\norder to isolate the effect of the number of samples, we \ufb01x the values of dynamically set parameters\nand do not limit the run time of SNAP. In particular we \ufb01x the search depth (to 5) and the number of\nupdates (to 200) and repeat the experiment 100 times. The number of observations is varied from 1 to\n20. We run the experiments on a relatively small problem, sysadmin 3, to control the run time for\nthe experiment. The results are in right plot of Fig 4. We \ufb01rst observe that on this problem allowing\nmore samples improves the performance of the algorithm. For this problem the improvement is\ndramatic until 5 samples and from 5 to 20 the improvement is more moderate. This illustrates that\nmore samples are better but also shows the potential of small sample sizes to yield good performance.\n\n7 Conclusion\n\nThe paper introduces SNAP, a new algorithm for solving POMDPs by using sampling networks\nand aggregate simulation. The algorithm is not guaranteed to \ufb01nd the optimal solution even if is\ngiven unlimited time, because it uses independence assumptions together with inference using belief\npropagation (through the graph construction) for portions of its computation. On the other hand,\nas illustrated in the experiments, when time is limited the algorithm provides a good tradeoff as\ncompared to state of the art anytime exact solvers. This allows scaling POMDP solvers to factored\ndomains where state, observation and actions spaces are all large. SNAP performs well across a range\nof problem domains without the need for domain speci\ufb01c heuristics.\n\n9\n\n 0 0.5 1 1.5 0 5 10 15 20 25 30normalized scoretime per step (seconds)Performance Comparison with Different Running TimeRandomSNAPDESPOTPOMCP 450 460 470 480 490 500 510 520 530 540 0 2 4 6 8 10 12 14 16 18 20cumulative rewardnumber of observations sampledCumulative Reward with Different Number of SamplesSNAP\fAcknowledgments\n\nThis work was partly supported by NSF under grant IIS-1616280 and by an Adobe Data Science\nResearch Award. Some of the experiments in this paper were performed on the Tufts Linux Research\nCluster supported by Tufts Technology Services.\n\nReferences\n[1] Bram Bakker. Reinforcement learning with long short-term memory. In Proceedings of the 14th\nInternational Conference on Neural Information Processing Systems, pages 1475\u20131482, 2001.\n\n[2] A.R. Cassandra, M.L. Littman, and N.L. Zhang. Incremental pruning: A simple, fast, exact\nmethod for partially observable Markov Decision Processes. In Proceedings of the Conference\non Uncertainty in Arti\ufb01cial Intelligence, pages 54\u201361, 1997.\n\n[3] Hao Cui, Thomas Keller, and Roni Khardon. Stochastic planning with lifted symbolic trajectory\noptimization. In Proceedings of the International Conference on Automated Planning and\nScheduling, 2019.\n\n[4] Hao Cui and Roni Khardon. Online symbolic gradient-based optimization for factored action\nMDPs. In Proceedings of the International Joint Conference on Arti\ufb01cial Intelligence, pages\n3075\u20133081, 2016.\n\n[5] Hao Cui, Roni Khardon, Alan Fern, and Prasad Tadepalli. Factored MCTS for large scale\nstochastic planning. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, pages\n3261\u20133267, 2015.\n\n[6] Hao Cui, Radu Marinescu, and Roni Khardon. From stochastic planning to marginal MAP. In\nProceedings of Advances in Neural Information Processing Systems, pages 3085\u20133095, 2018.\n\n[7] Neha Priyadarshini Garg, David Hsu, and Wee Sun Lee. Despot-alpha: Online POMDP\n\nplanning with large state and observation spaces. In Robotics: Science and Systems, 2019.\n\n[8] Andreas Griewank and Andrea Walther. Evaluating derivatives - principles and techniques of\n\nalgorithmic differentiation (2. ed.). SIAM, 2008.\n\n[9] M. Hauskrecht. Value-function approximations for partially observable Markov decision\n\nprocesses. Journal of Arti\ufb01cial Intelligence Research, 13:33\u201394, 2000.\n\n[10] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable\n\nstochastic domains. Arti\ufb01cial Intelligence, 101:99\u2013134, 1998.\n\n[11] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-\noptimal planning in large Markov decision processes. Machine Learning, 49(2-3):193\u2013208,\n2002.\n\n[12] Hanna Kurniawati, David Hsu, and Wee Sun Lee. Sarsop: Ef\ufb01cient point-based POMDP\nplanning by approximating optimally reachable belief spaces. In Robotics: Science and systems,\nvolume 2008, 2008.\n\n[13] M.L. Littman, A.R. Cassandra, and L.P. Kaelbling. Learning policies for partially observable\nenvironments: Scaling up. In Proceedings of the International Conference on Machine Learning,\npages 362\u2013370, 1995.\n\n[14] David A. McAllester and Satinder P. Singh. Approximate planning for factored POMDPs\nusing belief state simpli\ufb01cation. In Proceedings of the Fifteenth Conference on Uncertainty in\nArti\ufb01cial Intelligence, pages 409\u2013416, 1999.\n\n[15] Joni Pajarinen, Jaakko Peltonen, Ari Hottinen, and Mikko A. Uusitalo. Ef\ufb01cient planning in\nlarge POMDPs through policy graph based factorized approximations. In Proceedings of the\nEuropean Conference on Machine Learning, pages 1\u201316, 2010.\n\n10\n\n\f[16] S\u00e9bastien Paquet, Ludovic Tobin, and Brahim Chaib-draa. An online POMDP algorithm for\ncomplex multiagent environments. In International Joint Conference on Autonomous Agents\nand Multiagent Systems, pages 970\u2013977, 2005.\n\n[17] Judea Pearl. Probabilistic reasoning in intelligent systems - networks of plausible inference.\n\nMorgan Kaufmann series in representation and reasoning. Morgan Kaufmann, 1989.\n\n[18] M. L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. Wiley,\n\n1994.\n\n[19] Scott Sanner. Relational dynamic in\ufb02uence diagram language (RDDL): Language description.\n\nUnpublished Manuscript. Australian National University, 2010.\n\n[20] Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-based POMDP solvers.\n\nAutonomous Agents and Multi-Agent Systems, 27(1):1\u201351, 2013.\n\n[21] David Silver and Joel Veness. Monte-carlo planning in large POMDPs. In Proceedings of the\n\nConference on Neural Information Processing Systems, pages 2164\u20132172, 2010.\n\n[22] R. D. Smallwood and E. J. Sondik. The optimal control of partially observable Markov processes\n\nover a \ufb01nite horizon. Operations Research, 21:1071\u20131088, 1973.\n\n[23] Zachary N. Sunberg and Mykel J. Kochenderfer. Online algorithms for POMDPs with continu-\nous state, action, and observation spaces. In International Conference on Automated Planning\nand Scheduling, pages 259\u2013263, 2018.\n\n[24] G. Tesauro and G. Galperin. On-line policy improvement using Monte-Carlo search.\n\nProceedings of Advances in Neural Information Processing Systems, 1996.\n\nIn\n\n[25] Nan Ye, Adhiraj Somani, David Hsu, and Wee Sun Lee. DESPOT: Online POMDP planning\n\nwith regularization. Journal Arti\ufb01cial Intelligence Research, 58(1), 2017.\n\n11\n\n\f", "award": [], "sourceid": 4951, "authors": [{"given_name": "Hao(Jackson)", "family_name": "Cui", "institution": "Google"}, {"given_name": "Roni", "family_name": "Khardon", "institution": "Indiana University, Bloomington"}]}