{"title": "The Infinite Partially Observable Markov Decision Process", "book": "Advances in Neural Information Processing Systems", "page_first": 477, "page_last": 485, "abstract": "The Partially Observable Markov Decision Process (POMDP) framework has proven useful in planning domains that require balancing actions that increase an agents knowledge and actions that increase an agents reward.  Unfortunately, most POMDPs are complex structures with a large number of parameters.  In many realworld problems, both the structure and the parameters are difficult to specify from domain knowledge alone.  Recent work in Bayesian reinforcement learning has made headway in learning POMDP models; however, this work has largely focused on learning the parameters of the POMDP model.  We define an infinite POMDP (iPOMDP) model that does not require knowledge of the size of the state space; instead, it assumes that the number of visited states will grow as the agent explores its world and explicitly models only visited states.  We demonstrate the iPOMDP utility on several standard problems.", "full_text": "The In\ufb01nite Partially Observable Markov Decision\n\nProcess\n\nFinale Doshi-Velez\nCambridge University\n\nCambridge, CB21PZ, UK\nfinale@alum.mit.edu\n\nAbstract\n\nThe Partially Observable Markov Decision Process (POMDP) framework has\nproven useful in planning domains where agents must balance actions that pro-\nvide knowledge and actions that provide reward. Unfortunately, most POMDPs\nare complex structures with a large number of parameters. In many real-world\nproblems, both the structure and the parameters are dif\ufb01cult to specify from do-\nmain knowledge alone. Recent work in Bayesian reinforcement learning has made\nheadway in learning POMDP models; however, this work has largely focused on\nlearning the parameters of the POMDP model. We de\ufb01ne an in\ufb01nite POMDP\n(iPOMDP) model that does not require knowledge of the size of the state space;\ninstead, it assumes that the number of visited states will grow as the agent explores\nits world and only models visited states explicitly. We demonstrate the iPOMDP\non several standard problems.\n\n1 Introduction\n\nThe Partially Observable Markov Decision Process (POMDP) model has proven attractive in do-\nmains where agents must reason in the face of uncertainty because it provides a framework for\nagents to compare the values of actions that gather information and actions that provide immedi-\nate reward. Unfortunately, modelling real-world problems as POMDPs typically requires a domain\nexpert to specify both the structure of the problem and a large number of associated parameters,\nand both of which are often dif\ufb01cult tasks. Current methods in reinforcement learning (RL) focus\non learning the parameters online, that is, while the agent is acting in its environment. Bayesian\nRL [1, 2, 3] has recently received attention because it allows the agent to reason both about uncer-\ntainty in its model of the environment and uncertainty within environment itself. However, these\nmethods also tend to focus on learning parameters of an environment rather than the structure.\n\nIn the context of POMDP learning, several algorithms [4, 5, 6, 7] have applied Bayesian methods\nto reason about the unknown model parameters. All of these approaches provide the agent with the\nsize of the underlying state space and focus on learning the transition and observation1 dynamics for\neach state. Even when the size of the state space is known, however, just making the agent reason\nabout a large number of unknown parameters at the beginning of the learning process is fraught with\ndif\ufb01culties. The agent has insuf\ufb01cient experience to \ufb01t a large number of parameters, and therefore\nmuch of the model will be highly uncertain. Trying to plan under vast model uncertainty often\nrequires signi\ufb01cant computational resources; moreover, the computations are often wasted effort\nwhen the agent has very little data. Using a point estimate of the model instead\u2014that is, ignoring\nthe model uncertainty\u2014can be highly inaccurate if the expert\u2019s prior assumptions are a poor match\nfor the true model.\n\n1[7] also learns rewards.\n\n1\n\n\fWe propose a nonparametric approach to modelling the structure of the underlying space\u2014\nspeci\ufb01cally, the number of states in the agent\u2019s world\u2014which allows the agent to start with a simple\nmodel and grow it with experience. Building on the in\ufb01nite hidden Markov model (iHMM) [8], the\nin\ufb01nite POMDP (iPOMDP) model posits that the environment contains of an unbounded number of\nstates. The agent is expected to stay in a local region; however, as time passes, it may explore states\nthat it has not visited before. Initially, the agent will infer simple, local models of the environment\ncorresponding to its limited experience (also conducive to fast planning). It will dynamically add\nstructure as it accumulates evidence for more complex models. Finally, a data-driven approach to\nstructure discovery allows the agent to agglomerate states with identical dynamics (see section 4 for\na toy example).\n\n2 The In\ufb01nite POMDP Model\n\nA POMDP consists of the n-tuple {S,A,O,T ,\u2126,R,\u03b3}. S, A, and O are\nsets of states, actions, and observations. The transition function T (s\u2032|s, a)\nde\ufb01nes the distribution over next-states s\u2032 to which the agent may transi-\ntion after taking action a from state s. The observation function \u2126(o|s\u2032, a)\nis a distribution over observations o that may occur in state s\u2032 after taking\naction a. The reward function R(s, a) speci\ufb01es the immediate reward for\neach state-action pair (see \ufb01gure 1 for a slice of the graphical model). The\nfactor \u03b3 \u2208 [0, 1) weighs the importance of current and future rewards.\nWe focus on discrete state and observation spaces (generalising to contin-\nuous observations is straightforward) and \ufb01nite action spaces. The size of\nthe state space is unknown and potentially unbounded. The transitions,\nobservations, and rewards are modelled with an iHMM.\nThe In\ufb01nite Hidden Markov Model A standard hidden Markov model (HMM) consists of the n-\ntuple {S,O,T ,\u2126}, where the transition T (s\u2032|s) and observation \u2126(o|s\u2032) distributions only depend\non the hidden state. When the number of hidden states is \ufb01nite and discrete, Dirichlet distributions\nmay be used as priors over the transition and observation distributions. The iHMM [9] uses a\nhierarchical Dirichlet Process (HDP) to de\ufb01ne a prior over HMMs where the number of underlying\nstates is unbounded.2 To generate a model from the iHMM prior, we:\n\nFigure 1: A time-slice\nof the POMDP model.\n\nst\n\ns\n\nt\u22121\n\nat\n\nrt\n\not\n\n1. Draw the mean transition distribution \u00afT \u223c Stick(\u03bb).\n2. Draw observations \u2126(\u00b7|s, a) \u223c H for each s, a.\n3. Draw transitions T (\u00b7|s, a) \u223c DP(\u03b1, \u00afT ) for each s, a.\n\nwhere \u03bb is the DP concentration parameter and H is a prior over observation distributions. For\nexample, if the observations are discrete, then H could be a Dirichlet distribution.\nIntuitively, the \ufb01rst two steps de\ufb01ne the observation distribution and an overall popularity for each\nstate. The second step uses these overall state popularities to de\ufb01ne individual state transition distri-\nbutions. More formally, the \ufb01rst two steps involve a draw G0 \u223c DP(\u03bb, H), where the atoms of G0\nare \u2126, and \u00afT are the associated stick-lengths.3 Recall that in the stick breaking procedure, the sth\nstick-length, \u00afTs, is given by vs Qs\u22121\ni=1 (1 \u2212 vi), where vi \u223c Beta(1, \u03bb). While the number of states\nis unbounded, \u00afTs decreases exponentially with s, meaning that \u201clater\u201d states are less popular. This\nconstruction of \u00afTs also ensures that P\u221e\n\u00afTs = 1. The top part of \ufb01gure 2 shows a cartoon of a few\nelements of \u00afT and \u2126.\nThe second step of the iHMM construction involves de\ufb01ning the transition distributions T (\u00b7|s) \u223c\nDP(\u03b1, \u00afT ) for each state s, where \u03b1, the concentration parameter for the DP, determines how closely\nthe sampled distribution T (\u00b7|s) matches the mean transition distribution \u00afT . Because \u00afT puts higher\nprobabilities on states with smaller indices, T (s\u2032|s) will also generally put more mass on earlier s\u2032\n(see lower rows of \ufb01gure 2). Thus, the generating process encodes a notion that the agent will spend\nmost of its time in some local region. However, the longer the agent acts in this in\ufb01nite space, the\nmore likely it is to transition to somewhere new.\n\ns\n\n2The iHMM models in [8] and [9] are formally equivalent [10].\n3A detailed description of DPs and HDPs is beyond the scope of this paper; please refer to [11] for back-\n\nground on Dirichlet processes and [9] for an overview of HDPs.\n\n2\n\n\fIn\ufb01nite POMDPs To extend the iHMM framework to\niPOMDPs, we must incorporate actions and rewards into the\ngenerative model. To incorporate actions, we draw an ob-\nservation distribution \u2126(\u00b7|s, a) \u223c H for each action a and\neach state s. Similarly, during the second step of the gener-\native process, we draw a transition distribution T (s\u2032|s, a) \u223c\nDP(\u03b1, \u00afT ) for each state-action pair.4\nHMMs have one output\u2014observations\u2014while POMDPs\nalso output rewards. We treat rewards as a secondary set of\nobservations. For this work, we assume that the set of pos-\nsible reward values is given, and we use a multinomial dis-\ntribution to describe the probability R(r|s, a) of observing\nreward r after taking action a in state s. As with the obser-\nvations, the reward distributions R are drawn from Dirichlet\ndistribution HR. We use multinomial distributions for con-\nvenience; however, other reward distributions (such as Gaus-\nsians) are easily incorporate in this framework.\n\nIn summary, the iPOMDP prior requires that we specify\n\n\u03a41\n\n\u03a42\n\u21262\n\u2126\n3\n\nG0\n\n\u21261\n\n\u03a43\n\n\u03a44\n\n\u21264\n\n\u03a4\n1:\n\n\u03a4\n2:\n\n...\n\n...\n\n...\n\n...\n\nFigure 2:\niHMM: The \ufb01rst row\nshows each state\u2019s observation dis-\ntribution \u2126s and the mean transi-\ntion distribution \u00afT . Later rows show\neach state\u2019s transition distribution.\n\n\u2022 a set of actions A and observations O,\n\u2022 a generating distribution H for the observation distributions and HR for the rewards (these\n\ngenerating distributions can have any form; the choice will depend on the application),\n\n\u2022 a mean transition concentration factor \u03bb and a state transition concentration factor \u03b1, and\n\u2022 a discount factor \u03b3.\n\nTo sample a model from the iPOMDP prior, we \ufb01rst sample the mean transition distribution \u00afT \u223c\nStick(\u03bb). Next, for each state s and action a, we sample\n\n\u2022 T (\u00b7|s, a) \u223c DP(\u03b1, \u00afT ) ,\n\u2022 \u2126(\u00b7|s, a) \u223c H,\n\u2022 R(\u00b7|s, a) \u223c HR.\n\nSamples from the iPOMDP prior have an in\ufb01nite number of states, but fortunately all of these states\ndo not need to be explicitly represented. During a \ufb01nite lifetime the agent can only visit a \ufb01nite\nnumber of states, and thus the agent can only make inferences about a \ufb01nite number of states. The\nremaining (in\ufb01nite) states are equivalent from agent\u2019s perspective, as, in expectation, these states\nwill exhibit the mean dynamics of the prior. Thus, the only parts of the in\ufb01nite model that need to\nbe initialised are those corresponding to the states the agent has visited as well as a catch-all state\nrepresenting all other states. In reality, of course, the agent does not know the states it has visited:\nwe discuss joint inference over the unknown state history and the model in section 3.1.\n\n3 Planning\n\nAs in the standard Bayesian RL framework, we recast the problem of POMDP learning as planning\nin a larger \u2018model-uncertainty\u2019 POMDP in which both the true model and the true state are unknown.\nWe outline below our procedure for planning in this joint space of POMDP models and unknown\nstates and the detail each step\u2014belief monitoring and action-selection\u2014in sections 3.1 and 3.2.\n\nBecause the true state is hidden, the agent must choose its actions based only on past actions and\nobservations. Normally the best action to take at time t depends on the entire history of actions and\nobservations that the agent has taken so far. However, the probability distribution over current states,\nknown as the belief, is a suf\ufb01cient statistic for a history of actions and observations. In discrete state\nspaces, the belief at time t + 1 can be computed from the previous belief, bt, the last action a, and\nobservation o, by the following application of Bayes rule:\n\nba,o\nt+1(s) = \u2126(o|s, a) X\ns\u2032\u2208S\n\nT (s|s\u2032, a)bt(s\u2032)/P r(o|b, a),\n\n(1)\n\n4We use the same base measure H to draw all observation distributions; however, a separate measures Ha\ncould be used for each action if one had prior knowledge about the expected observation distribution for reach\naction. Likewise, one could also draw a separate \u00afTa for each action.\n\n3\n\n\fwhere P r(o|b, a)=Ps\u2032\u2208S \u2126(o|s\u2032, a)Ps\u2208S T (s\u2032|s, a)bt(s). However, it is intractable to express the\njoint belief b over models and states with a closed-form expression. We approximate the belief b with\na set of sampled models m = {T, \u2126, R}, each with weight w(m). Each model sample m maintains\na belief over states bm(s). The states are discrete, and thus the belief bm(s) can be updated using\nequation 1. Details for sampling the models m are described in section 3.1.\nGiven the belief, the agent must choose what action to choose next. One approach is to solve the\nplanning problem of\ufb02ine, that is, determine a good action for every possible belief. If the goal is to\nmaximize the expected discounted reward, then the optimal policy is given by:\n\nVt(b) = max\na\u2208A\n\nQt(b, a),\n\nQt(b, a) = R(b, a) + \u03b3 X\n\nP r(o|b, a)Vt(ba,o),\n\no\u2208O\n\n(2)\n\n(3)\n\nwhere the value function V (b) is the expected discounted reward that an agent will receive if its\ncurrent belief is b and Q(b, a) is the value of taking action a in belief b. The exact solution to\nequation 3 is only tractable for tiny problems, but many approximation methods [12, 13, 14] have\nbeen developed to solve POMDPs of\ufb02ine.\n\nWhile we might hope to solve equation 3 over the state space of a single model, it is intractable to\nsolve over the joint space of states and in\ufb01nite models\u2014the model space is so large that standard\npoint-based approximations will generally fail. Moreover, it makes little sense to \ufb01nd the optimal\npolicy for all models when only a few models are likely. Therefore, instead of solving 3 of\ufb02ine,\nwe build a forward-looking search tree at each time step (see [15] for a review of forward search in\nPOMDPs). The tree computes the value of action by investigating a number of steps into the future.\nThe details of the action selection are discussed in section 3.2.\n\n3.1 Belief Monitoring\n\nAs outlined in section 3, we approximate the joint belief over states and models through a set of\nsamples. In this section, we describe a procedure for sampling a set of models m = {T, \u2126, R} from\nthe true belief, or posterior, over models.5 These samples can then be used to approximate various\nintegrations over models that occur during planning; in the limit of in\ufb01nite samples, the approxima-\ntions will be guaranteed to converge to their true values. To simplify matters, we assume that given\na model m, it is tractable to maintain a closed-form belief bm(s) over states using equation 1. Thus,\nmodels need to be sampled, but beliefs do not.\n\nSuppose we have a set of models m that have been drawn from the belief at time t. To get a set of\nmodels drawn from the belief at time t+1, we can either draw the models directly from the new belief\nor adjust the weights on the model set at time t so that they now provide an accurate representation\nof the belief at time t + 1. Adjusting the weights is computationally most straightforward: directly\nfollowing belief update equation 1, the importance weight w(m) on model m is given by:\n\n(4)\nwhere \u2126(o|m, a)=Ps\u2208S \u2126(o|s, m, a)bm(s), and we have used T (m\u2032|m, a) = \u03b4m(m\u2032) because the\ntrue model does not change.\n\nt+1(m) \u221d \u2126(o|m, a)wt(m),\n\nwa,o\n\nThe advantage of simply reweighting the samples is that the belief update is extremely fast. How-\never, new experience may quickly render all of the current model samples unlikely. Therefore, we\nmust periodically resample a new set of models directly from the current belief. The beam-sampling\napproach of [16] is an ef\ufb01cient method for drawing samples from an iHMM posterior. We adapt this\napproach to allow for observations with different temporal shifts (since the reward rt depends on\nthe state st, whereas the observation ot is conditioned on the state st+1) and for transitions indexed\nby both the current state and the most recent action. The correctness of our sampler follows directly\nfrom the correctness of the beam sampler [16].\n\nThe beam-sampler is an auxiliary variable method that draws samples from the iPOMDP posterior.\nA detailed description of beam sampling is beyond the scope of this paper; however, we outline the\ngeneral procedure below. The inference alternates between three phases:\n\n5We will use the words posterior and belief interchangeably; both refer to the probability distribution over\n\nthe hidden state given some initial belief (or prior) and the history of actions and observations.\n\n4\n\n\f\u2022 Sampling slice variables to limit trajectories to a \ufb01nite number of hidden states.\nGiven a transition model T and a state trajectory {s1, s2, . . .}, an auxiliary variable\nut \u223c Uniform([0, min(T (\u00b7|st, a))]) is sampled for each time t. The \ufb01nal column k of the\ntransition matrix is extended via additional stick-breaking until max(T (sk|s, a)) < ut.).\nOnly transitions T (s\u2032|s, a) > ut are considered for inference at time t.6\n\n\u2022 Sampling a hidden state trajectory. Now that we have a \ufb01nite model, we apply forward\n\n\ufb01ltering-backward sampling (FFBS) [18] to sample the underlying state sequence.\n\n\u2022 Sampling a model. Given a trajectory over hidden states, transition, observation, and\nreward distributions are sampled for the visited states (it only makes sense to sample dis-\ntributions for visited states, as we do not have information about unvisited states). In this\n\ufb01nite setting, we can resample the transitions T (\u00b7|s, a) using standard Dirichlet posteriors:\n\nT (\u00b7|s, a) \u223c Dirichlet(T sa\n\n2 , ..., T sa\n\nk + nsa\nk ,\n\n1 + nsa\n\n1 , T sa\n\n2 + nsa\n\n\u221e\n\nX\n\ni=k+1\n\nT sa\n\ni ),\n\n(5)\n\nis the prior probability of transitioning\nwhere k is the number of active or used states, T sa\nto state i from state s after taking action a, and nsa\nis the number of observed transitions\ni\nto state i from s after a. The observations and rewards are resampled in a similar manner:\nfor example, if the observations are discrete with Dirichlet priors:\n\ni\n\n\u2126(\u00b7|s, a) \u223c Dirichlet(H1 + no1sa, H2 + no2sa, ..., H|O| + no|O|sa).\n\n(6)\n\nAs with all MCMC methods, initial samples (from the burn-in period) are biased by sampler\u2019s start\nposition; only after the sampler has mixed will the samples be representative of the true posterior.\n\nFinally, we emphasize that the approach outline above is a sampling approach and not a maximum\nlikelihood estimator; thus the samples, drawn from the agent\u2019s belief, capture the variation over\npossible models. The representation of the belief is necessarily approximate due to our use of\nsamples, but the samples are drawn from the true current belief\u2014no other approximations have\nbeen made. Speci\ufb01cally, we are not \ufb01ltering: each run of the beam sampler produces samples from\nthe current belief. Because they are drawn from the true posterior, all samples have equal weight.\n\n3.2 Action Selection\n\nGiven a set of models, we apply a stochastic forward search in the model-space to choose an action.\nThe general idea behind forward search [15] is to use a forward-looking tree to compute action-\nvalues. Starting from the agent\u2019s current belief, the tree branches on each action the agent might\ntake and each observation the agent might see. At each action node, the agent computes its expected\nimmediate reward R(a) = Em[Es|m[R(\u00b7|s, a)]].\nFrom equation 3, the value of taking action a in belief b is\n\nQ(a, b) = R(a, b) + \u03b3 X\n\no\n\n\u2126(o|b, a) max\n\na\u2032\n\nQ(a\u2032, bao)\n\n(7)\n\nwhere bao is the agent\u2019s belief after taking action a and seeing observation o from belief b. Because\naction selection must be completed online, we use equation 4 to update the belief over models via\nthe weights w(m). Equation 7 is evaluated recursively for each Q(a\u2032, bao) up to some depth D.\nThe number of evaluations grows with (|A||O|)D, so doing a full expansion is feasible only for very\nsmall problems. We approximate the true value stochastically by sampling only a few observations\nfrom the distribution P (o|a) = Pm P (o|a, m)w(m). Equation 7 reduces to\n\nQ(a, b) = R(a, b) + \u03b3\n\n1\nNO\n\nX\n\ni\n\nmax\n\na\u2032\n\nQ(a\u2032, baoi )\n\n(8)\n\nwhere NO is the number of sampled observations and oi is the ith sampled observation.\nOnce we reach a prespeci\ufb01ed depth in the tree, we must approximate the value of the leaves. For\neach model m in the leaves, we can compute the value Q(a, bm, m) of the action a by approximately\n\n6For an introduction to slice sampling, refer to [17].\n\n5\n\n\fsolving of\ufb02ine the POMDP model that m represents. We approximate the value of action a as\n\nQ(a, b) \u2248 X\n\nw(m)Q(a, bm, m).\n\nm\n\n(9)\n\nThis approximation is always an overestimate of the value, as it assumes that the uncertainty over\nmodels\u2014but not the uncertainty over states\u2014will be resolved in the following time step (similar to\nthe QMDP [19] assumption).7 As the iPOMDP posterior becomes peaked and the uncertainty over\nmodels decreases, the approximation becomes more exact.\n\nThe quality of the action selection largely follows from the bounds presented in [20] for planning\nthrough forward search. The key difference is that now our belief representation is particle-based;\nduring the forward search we approximate an expected rewards over all possible models with re-\nwards from the particles in our set. Because we can guarantee that our models are drawn from the\ntrue posterior over models, this approach is a standard Monte Carlo approximation of the expecta-\ntion. Thus, we can apply the central limit theorem to state that the estimated expected rewards will\nbe distributed around the true expectation with approximately normal noise N (0, \u03c32\nn ), where n is\nthe number of POMDP samples and \u03c32 is a problem-speci\ufb01c variance.\n\n4 Experiments\n\nWe begin with a series of illustrative examples demonstrating the properties of the iPOMDP. In\nall experiments, the observations were given vague hyperparameters (1.0 Dirichlet counts per ele-\nment), and rewards were given hyperparameters that encouraged peaked distributions (0.1 Dirichlet\ncounts per element). The small counts on the reward hyperparameters encoded the prior belief that\nR(\u00b7|s, a) is highly peaked, that is, each state-action pair will likely have one associated reward value.\nBeliefs were approximated with sample set of 10 models. Models were resampled between episodes\nand reweighted during episodes. A burn-in of 500 iterations was used for the beam sampler when\ndrawing these models directly from the belief. The forward-search was expanded to a depth of 3.\n\ns\ne\n\nt\n\nt\n\na\nS\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\ns\ne\n\nt\n\nt\n\na\nS\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\nG\n\nG\n\nNumber of States in Lineworld POMDP\n\nTotal Reward in Lineworld POMDP\n6\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nNumber of States in Loopworld POMDP\n\n10\n\n20\n\n30\n\n50\n\n40\n60\nEpisode Number\n\n70\n\n80\n\n90\n\n100\n\nd\nr\na\nw\ne\nR\n\n \nl\na\nt\no\nT\n\n5.5\n\n5\n\n4.5\n\n4\n\nLearned\n\nOptimal\n\nTotal Reward in Loopworld POMDP\n\nd\nr\na\nw\ne\nR\n\n \nl\na\nt\no\nT\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\nLearned\n\nOptimal\n\nLineworld\n\nS\n\nLoopworld\n\nS\n\n.\n\n(a) Cartoon of Models\n\n(b) Evolution of state size\n\n(c) Performance\n\nFigure 3: Various comparisons of the lineworld and loopworld models. Loopworld infers only\nnecessary states, ignoring the more complex (but irrelevant) structure.\n\nAvoiding unnecessary structure: Lineworld and Loopworld. We designed a pair of simple envi-\nronments to show how the iPOMDP infers states only as it can distinguish them. The \ufb01rst, lineworld\nwas a length-six corridor in which the agent could either travel left or right. Loopworld consisted\nof a corridor with a series of loops (see \ufb01gure 3(a)); now the agent could travel though the upper or\nlower branches. In both environments, only the two ends of the corridors had unique observations.\n\n7We also experimented with approximating Q(a, b) \u2248 80 \u2212 percentile({w(m)Q(a, bm,m )}). Taking\na higher percentile ranking as the approximate value places a higher value on actions with larger uncertainty.\nAs the values of the actions become more well known and the discrepancies between the models decreases, this\ncriterion reduces to the true value of the action.\n\n6\n\n\fActions produced the desired effect with probability 0.95, observations were correct with probability\n0.85 (that is, 15% of the time the agent saw an incorrect observation). The agent started at the left\nend of the corridor and received a reward of -1 until it reached the opposite end (reward 10).\n\nThe agent eventually infers that the lineworld environment consists of six states\u2014based on the\nnumber of steps it requires to reach the goal\u2014although in the early stages of learning it infers\ndistinct states only for the ends of the corridor and groups the middle region as one state. The\nloopworld agent also shows a growth in the number of states over time (see \ufb01gure 3(b)), but it\nnever infers separate states for the identical upper and lower branches. By inferring states as they\nneeded to explain its observations\u2014instead of relying on a prespeci\ufb01ed number of states\u2014the agent\navoided the need to consider irrelevant structure in the environment. Figure 3(c) shows that the agent\n(unsurprisingly) learns optimal performance in both environments.\nAdapting to new situations: Tiger-3. The iPOMDP\u2019s \ufb02exibility also lets it adapt to new situations.\nIn the tiger-3 domain, a variant of the tiger problem [19] the agent had to choose one of three doors\nto open. Two doors had tigers behind them (r = \u2212100) and one door had a small reward (r = 10).\nAt each time step, the agent could either open a door or listen for the \u201cquiet\u201d door. It heard the\ncorrect door correctly with probability 0.85.\n\n\u2212100\n\n\u2212120\n\n\u221240\n\n\u221260\n\n\u221280\n\nd\nr\na\nw\ne\nR\n \nd\ne\ng\na\nr\ne\nv\nA\n\nEvolution of Reward\n\nThe reward was unlikely to be behind the third door (p = .2),\nbut during the \ufb01rst 100 episodes, we arti\ufb01cially ensured that\nthe reward was always behind doors 1 or 2. The improving\nrewards in \ufb01gure 4 show the agent steadily learning the dy-\nnamics of its world; it learned never to open door 3. The dip\nin 4 following episode 100 occurs when we next allowed the\nreward to be behind all three doors, but the agent quickly\nadapts to the new possible state of its environment. The\niPOMDP enabled the agent to \ufb01rst adapt quickly to its sim-\npli\ufb01ed environment but add complexity when it was needed.\nBroader Evaluation. We next completed a set of experi-\nments on POMDP problems from the literature. Tests had\n200 episodes of learning, which interleaved acting and re-\nsampling models, and 100 episodes of testing with the mod-\nels \ufb01xed. During learning, actions were chosen stochasti-\ncally based on its value with probability 0.05 and completely randomly with probability 0.01. Oth-\nerwise, they were chosen greedily (we found this small amount of randomness was needed for ex-\nploration to overcome our very small sample set and search depths). We compared accrued rewards\nand running times for the iPOMDP agent against (1) an agent that knew the state count and used\nEM to train its model, (2) an agent that knew the state count and that used the same forward-\ufb01ltering\nbackward-sampling (FFBS) algorithm used in the beam sampling inner loop to sample models, and\n(3) an agent that used FFBS with ten times the true number of states. For situations where the number\nof states is not known, the last case is particularly interesting\u2014we show that simply overestimating\nthe number of states is not necessarily the most ef\ufb01cient solution.\n\nFigure 4: Evolution of reward from\ntiger-3.\n\n100\n150\nEpisode Count\n\n\u2212140\n0\n\n50\n\n200\n\n250\n\nTable 1 summarises the results. We see that the iPOMDP often infers a smaller number of states than\nthe true count, ignoring distinctions that the history does not support. The middle three columns\nshow the speeds of the three controls relative the iPOMDP. Because the iPOMDP generally uses\nsmaller state spaces, we see that most of these values are greater than 1, indicating the iPOMDP is\nfaster. (In the largest problem, dialog, the oversized FFBS model did not complete running in several\ndays.) The latter four columns show accumulated rewards; we see that the iPOMDP is generally on\npar or better than the methods that have access to the true state space size. Finally, \ufb01gure 5 plots the\nlearning curve for one of problems, shuttle.\n\n5 Discussion\n\nRecent work in learning POMDP models include[23], which uses a set of Gaussian approximations\nto allow for analytic value function updates in the POMDP space, and [5], which jointly reasons\nover the space Dirichlet parameter and states when planning in discrete POMDPs. Sampling based\napproaches include Medusa [4], which learns using state-queries, and [7], which learns using policy\n\n7\n\n\fd\nr\na\nw\ne\nR\n\n \nl\n\na\n\nt\n\no\nT\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n0\n\nEvolution of Total Reward for Shuttle\n\nFinal Reward for Shuttle\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n50\n\n100\n\nEpisode Count\n\n150\n\n\u221220\n\n200\n\nLearned Optimal\n\nFigure 5: Evolution of reward for shuttle. During training (left), we see that the agent makes fewer\nmistakes toward the end of the period. The boxplots on the right show rewards for 100 trials after\nlearning has stopped; we see the iPOMDP-agent\u2019s reward distribution over these 100 trials is almost\nidentical an agent who had access to the correct model.\n\nTable 1: Inferred states and performance for various problems. The iPOMDP agent (FFBS-Inf)\noften performs nearly as well as the agents who had knowledge of the true number of states (EM-\ntrue, FFBS-true), learning the necessary number of states much faster than an agent for which we\noverestimate the number of states (FFBS-big).\n\nStates\n\nTrue FFBS-\n\nInf\n2.1\n2.1\n4.36\n7.36\n\nMetric\nProblem\n\nTiger[19]\nShuttle[21]\nNetwork[19]\nGridworld[19]\n(adapted)\nDialog[22]\n(adapted)\n\n2\n8\n7\n26\n\n51\n\nRelative Training Time\nFFBS-\nEM-\ntrue\nbig\n1.50\n0.41\n3.56\n1.82\n4.82\n1.56\n3.57\n59.1\n\nFFBS-\ntrue\n0.70\n1.02\n1.09\n2.48\n\nEM-\ntrue\n-277\n10\n1857\n-25\n\nPerformance\n\nFFBS-\ntrue\n0.49\n10\n7267\n-51\n\nFFBS-\nbig\n4.24\n10\n6843\n-67\n\nFFBS-\nInf\n4.06\n10\n6508\n-13\n\n-1009\n\n2\n\n0.67\n\n5.15\n\n-\n\n-3023\n\n-1326\n\n-\n\nqueries. All of these approaches assume that the number of underlying states is known; all but [7]\nfocus on learning only the transition and observation models.\n\nIn many problems, however, the underlying number of states may not be known\u2014or may require\nsigni\ufb01cant prior knowledge to model\u2014and, from the perspective of performance, is irrelevant. The\niPOMDP model allows the agent to adaptively choose the complexity of the model; any expert\nknowledge is incorporated into the prior: for example, the Dirichlet counts on observation param-\neters can be used to give preference to certain observations as well as encode whether we expect\nobservations to have low or high noise. As seen in the results, the iPOMDP allows the complex-\nity of the model to scale gracefully with the agent\u2019s experience. Future work remains to tailor the\nplanning to unbounded spaces and re\ufb01ne the inference for POMDP resampling.\n\nPast work has attempted to take advantage of structure in POMDPs [24, 25], but learning that struc-\nture has remained an open problem. By giving the agent an unbounded state space\u2014but strong\nlocality priors\u2014the iPOMDP provides one principled framework to learning POMDP structure.\nMoreover, the hierarchical Dirichlet process construction described in section 2 can be extended to\ninclude more structure and deeper hierarchies in the transitions.\n\n6 Conclusion\n\nWe presented the in\ufb01nite POMDP, a new model for Bayesian RL in partially observable domains.\nThe iPOMDP provides a principled framework for an agent to posit more complex models of its\nworld as it gains more experience. By linking the complexity of the model to the agent\u2019s experience,\nthe agent is not forced to consider large uncertainties\u2014which can be computationally prohibitive\u2014\nnear the beginning of the planning process, but it can later come up with accurate models of the\nworld when it requires them. An interesting question may also to apply these methods to learning\nlarge MDP models within the Bayes-Adaptive MDP framework [26].\n\n8\n\n\fReferences\n\n[1] R. Dearden, N. Friedman, and D. Andre, \u201cModel based Bayesian exploration,\u201d pp. 150\u2013159,\n\n1999.\n\n[2] M. Strens, \u201cA Bayesian framework for reinforcement learning,\u201d in ICML, 2000.\n[3] P. Poupart, N. Vlassis, J. Hoey, and K. Regan, \u201cAn analytic solution to discrete Bayesian\n\nreinforcement learning,\u201d in ICML, (New York, NY, USA), pp. 697\u2013704, ACM Press, 2006.\n\n[4] R. Jaulmes, J. Pineau, and D. Precup, \u201cLearning in non-stationary partially observable Markov\n\ndecision processes,\u201d ECML Workshop, 2005.\n\n[5] S. Ross, B. Chaib-draa, and J. Pineau, \u201cBayes-adaptive POMDPs,\u201d in Neural Information Pro-\n\ncessing Systems (NIPS), 2008.\n\n[6] S. Ross, B. Chaib-draa, and J. Pineau, \u201cBayesian reinforcement learning in continuous\n\nPOMDPs with application to robot navigation,\u201d in ICRA, 2008.\n\n[7] F. Doshi, J. Pineau, and N. Roy, \u201cReinforcement learning with limited reinforcement: Using\nBayes risk for active learning in POMDPs,\u201d in International Conference on Machine Learning,\nvol. 25, 2008.\n\n[8] M. J. Beal, Z. Ghahramani, and C. E. Rasmussen, \u201cThe in\ufb01nite hidden Markov model,\u201d in\n\nMachine Learning, pp. 29\u2013245, MIT Press, 2002.\n\n[9] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, \u201cHierarchical Dirichlet processes,\u201d Journal\n\nof the American Statistical Association, vol. 101, no. 476, pp. 1566\u20131581, 2006.\n\n[10] J. V. Gael and Z. Ghahramani, Inference and Learning in Dynamic Models, ch. Nonparametric\n\nHidden Markov Models. Cambridge University Press, 2010.\n\n[11] Y. W. Teh, \u201cDirichlet processes.\u201d Submitted to Encyclopedia of Machine Learning, 2007.\n[12] J. Pineau, G. Gordon, and S. Thrun, \u201cPoint-based value iteration: An anytime algorithm for\n\nPOMDPs,\u201d IJCAI, 2003.\n\n[13] M. T. J. Spaan and N. Vlassis, \u201cPerseus: Randomized point-based value iteration for\n\nPOMDPs,\u201d Journal of Arti\ufb01cial Intelligence Research, vol. 24, pp. 195\u2013220, 2005.\n\n[14] T. Smith and R. Simmons, \u201cHeuristic search value iteration for POMDPs,\u201d in Proc. of UAI\n\n2004, (Banff, Alberta), 2004.\n\n[15] S. Ross, J. Pineau, S. Paquet, and B. Chaib-Draa, \u201cOnline planning algorithms for POMDPs,\u201d\n\nJournal of Arti\ufb01cial Intelligence Research, vol. 32, pp. 663\u2013704, July 2008.\n\n[16] J. van Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani, \u201cBeam sampling for the in\ufb01nite hidden\n\nMarkov model,\u201d in ICML, vol. 25, 2008.\n\n[17] R. Neal, \u201cSlice sampling,\u201d Annals of Statistics, vol. 31, pp. 705\u2013767, 2000.\n[18] C. K. Carter and R. Kohn, \u201cOn Gibbs sampling for state space models,\u201d Biometrika, vol. 81,\n\npp. 541\u2013553, September 1994.\n\n[19] M. L. Littman, A. R. Cassandra, and L. P. Kaelbling, \u201cLearning policies for partially observable\n\nenvironments: scaling up,\u201d ICML, 1995.\n\n[20] D. McAllester and S. Singh, \u201cApproximate planning for factored POMDPs using belief state\n\nsimpli\ufb01cation,\u201d in UAI 15, 1999.\n\n[21] L. Chrisman, \u201cReinforcement learning with perceptual aliasing: The perceptual distinctions\napproach,\u201d in In Proceedings of the Tenth National Conference on Arti\ufb01cial Intelligence,\npp. 183\u2013188, AAAI Press, 1992.\n\n[22] F. Doshi and N. Roy, \u201cEf\ufb01cient model learning for dialog management,\u201d in Proceedings of\n\nHuman-Robot Interaction (HRI 2007), (Washington, DC), March 2007.\n\n[23] P. Poupart and N. Vlassis, \u201cModel-based Bayesian reinforcement learning in partially observ-\n\nable domains,\u201d in ISAIM, 2008.\n\n[24] J. H. Robert, R. St-aubin, A. Hu, and C. Boutilier, \u201cSPUDD: Stochastic planning using decision\n\ndiagrams,\u201d in UAI, pp. 279\u2013288, 1999.\n\n[25] A. P. Wolfe, \u201cPOMDP homomorphisms,\u201d in NIPS RL Workshop, 2006.\n[26] M. O. Duff, Optimal learning: computational procedures for Bayes-adaptive markov decision\n\nprocesses. PhD thesis, 2002.\n\n9\n\n\f", "award": [], "sourceid": 518, "authors": [{"given_name": "Finale", "family_name": "Doshi-velez", "institution": null}]}