{"title": "Bayesian Policy Learning with Trans-Dimensional MCMC", "book": "Advances in Neural Information Processing Systems", "page_first": 665, "page_last": 672, "abstract": null, "full_text": "Trans-dimensional MCMC for Bayesian Policy\n\nLearning\n\nMatt Hoffman\n\nDept. of Computer Science\n\nUniversity of British Columbia\n\nhoffmanm@cs.ubc.ca\n\nArnaud Doucet\n\nDepts. of Statistics and Computer Science\n\nUniversity of British Columbia\n\narnaud@cs.ubc.ca\n\nNando de Freitas\n\nDept. of Computer Science\n\nUniversity of British Columbia\n\nAjay Jasra\n\nDept. of Mathematics\n\nImperial College London\n\nnando@cs.ubc.ca\n\najay.jasra@imperial.ac.uk\n\nAbstract\n\nA recently proposed formulation of the stochastic planning and control problem\nas one of parameter estimation for suitable arti\ufb01cial statistical models has led to\nthe adoption of inference algorithms for this notoriously hard problem. At the\nalgorithmic level, the focus has been on developing Expectation-Maximization\n(EM) algorithms. In this paper, we begin by making the crucial observation that\nthe stochastic control problem can be reinterpreted as one of trans-dimensional\ninference. With this new interpretation, we are able to propose a novel reversible\njump Markov chain Monte Carlo (MCMC) algorithm that is more ef\ufb01cient than\nits EM counterparts. Moreover, it enables us to implement full Bayesian policy\nsearch, without the need for gradients and with one single Markov chain. The\nnew approach involves sampling directly from a distribution that is proportional\nto the reward and, consequently, performs better than classic simulations methods\nin situations where the reward is a rare event.\n\n1 Introduction\n\nincluding linear Gaussian models with quadratic cost,\n\nContinuous state-space Markov Decision Processes (MDPs) are notoriously dif\ufb01cult to solve. Ex-\ncept for a few rare cases,\nthere is no\nclosed-form solution and approximations are required [4]. A large number of methods have been\nproposed in the literature relying on value function approximation and policy search; including\n[3, 10, 14, 16, 18]. In this paper, we follow the policy learning approach because of its promise and\nremarkable success in complex domains; see for example [13, 15]. Our work is strongly motivated\nby a recent formulation of stochastic planning and control problems as inference problems. This line\nof work appears to have been initiated in [5], where the authors used EM as an alternative to standard\nstochastic gradient algorithms to maximize an expected cost. In [2], a planning problem under un-\ncertainty was solved using a Viterbi algorithm. This was later extended in [21]. In these works, the\nnumber of time steps to reach the goal was \ufb01xed and the plans were not optimal in expected reward.\nAn important step toward surmounting these limitations was taken in [20, 19]. In these works, the\nstandard discounted reward control problem was expressed in terms of an in\ufb01nite mixture of MDPs.\nTo make the problem tractable, the authors proposed to truncate the in\ufb01nite horizon time.\n\nHere, we make the observation that, in this probabilistic interpretation of stochastic control, the\nobjective function can be written as the expectation of a positive function with respect to a trans-\ndimensional probability distribution, i.e. a probability distribution de\ufb01ned on a union of subspaces\n\n1\n\n\fof different dimensions. By reinterpreting this function as a (arti\ufb01cial) marginal likelihood, it is\neasy to see that it can also be maximized using an EM-type algorithm in the spirit of [5]. However,\nthe observation that we are dealing with a trans-dimensional distribution enables us to go beyond\nEM. We believe it creates many opportunities for exploiting a large body of sophisticated inference\nalgorithms in the decision-making context.\n\nIn this paper, we propose a full Bayesian policy search alternative to the EM algorithm. In this\napproach, we set a prior distribution on the set of policy parameters and derive an arti\ufb01cial posterior\ndistribution which is proportional to the prior times the expected reward. In the simpler context\nof myopic Bayesian experimental design, a similar method was developed in [11] and applied suc-\ncessfully to high-dimensional problems [12]. Our method can be interpreted as a trans-dimensional\nextension of [11]. We sample from the resulting arti\ufb01cial posterior distribution using a single trans-\ndimensional MCMC algorithm, which only involves a simple modi\ufb01cation of the MCMC algorithm\ndeveloped to implement the EM.\n\nAlthough the Bayesian policy search approach can bene\ufb01t from gradient information, it does not\nrequire gradients. Moreover, since the target is proportional to the expected reward, the simulation\nis guided to areas of high reward automatically. In the \ufb01xed policy case, the value function is often\ncomputed using importance sampling. In this context, our algorithm could be reinterpreted as an\nMCMC algorithm sampling from the optimal importance distribution.\n\n2 Model formulation\n\nWe consider the following class of discrete-time Markov decision processes (MDPs):\n\nX1 \u223c \u00b5(\u00b7)\nXn| (Xn\u22121 = x, An\u22121 = a) \u223c fa ( \u00b7| x)\nRn| (Xn = x, An = a) \u223c ga ( \u00b7| x)\nAn| (Xn = x, \u03b8) \u223c \u03c0\u03b8 ( \u00b7| x) ,\n\n(1)\n\nwhere n = 1, 2, . . . is a discrete-time index, \u00b5(\u00b7) is the initial state distribution, {Xn} is the\nX \u2212valued state process, {An} is the A\u2212valued action process, {Rn} is a positive real-valued re-\nward process, fa denotes the transition density, ga the reward density and \u03c0\u03b8 is a randomized policy.\nIf we have a deterministic policy then \u03c0\u03b8 ( a| x) = \u03b4\u03d5\u03b8(x) (a). In this case, the transition model\nfa ( \u00b7| x) assumes the parametrization f\u03b8 ( \u00b7| x). The reward model could also be parameterized as\ng\u03b8 ( \u00b7| x). It should be noted that for this work we will be working within a model-based framework\nand as a result will require knowledge of the transition model (although it could be learned).\n\nWe are here interested in maximizing with respect to the parameters of the policy \u03b8 the expected\nfuture reward\n\nV \u03c0\n\n\u00b5 (\u03b8) = E\" \u221eXn=1\n\n\u03b3n\u22121Rn# ,\n\nwhere 0 < \u03b3 < 1 is a discount factor and the expectation is with respect to the probabilistic model\nde\ufb01ned in (1). As shown in [20], it is possible to re-write this objective of optimizing an in\ufb01nite\nhorizon discounted reward MDP (where the reward happens at each step) as one of optimizing an\nin\ufb01nite mixture of \ufb01nite horizon MDPs (where the reward only happens at the last time step).\n\nIn particular, we note that by introducing the trans-dimensional probability distribution on] {k} \u00d7\n\nX k \u00d7 Ak \u00d7 R+ given by\n\nkYn=2\n\nkYn=1\n\np\u03b8 (k, x1:k, a1:k, rk) = (1 \u2212 \u03b3) \u03b3k\u22121\u00b5 (x1) gak ( rk| xk)\n\nfan\u22121 ( xn| xn\u22121)\n\n\u03c0\u03b8 ( an| xn) ,\n\nwe can easily rewrite V \u03c0\nhappening at the last horizon step; namely at k. Speci\ufb01cally we have:\n\n(2)\n\u00b5 (\u03b8) as an in\ufb01nite mixture model of \ufb01nite horizon MDPs, with the reward\n\n\u00b5 (\u03b8) = (1 \u2212 \u03b3)\u22121 Ep\u03b8 [RK] = (1 \u2212 \u03b3)\u22121\nV \u03c0\n\n\u221eXk=1Z rkp\u03b8 (k, x1:k, a1:k, rk) dx1:kda1:kdrk\n\n(3)\n\n2\n\n\ffor a randomized policy. Similarly, for a deterministic policy, the representation (3) also holds for\n\nthe trans-dimensional probability distribution de\ufb01ned on] {k} \u00d7 X k \u00d7 R+ given by\n\np\u03b8 (k, x1:k, rk) = (1 \u2212 \u03b3) \u03b3k\u22121\u00b5 (x1) g\u03b8 ( rk| xk)\n\nf\u03b8 ( xn| xn\u22121) .\n\n(4)\n\nkYn=2\n\nThe representation (3) was also used in [6] to compute the value function through MCMC for a\n\ufb01xed \u03b8. In [20], this representation is exploited to maximize V \u03c0\n\u00b5 (\u03b8) using the EM algorithm which,\napplied to this problem, proceeds as follows at iteration i\n\n\u03b8i = arg max\n\nQ (\u03b8i\u22121, \u03b8)\n\n\u03b8\u2208\u0398\n\nwhere\n\nQ (\u03b8i\u22121, \u03b8) = Eep\u03b8i\u22121\n\n[log (RK.p\u03b8 (K, X1:K, A1:K, RK))] ,\n\nep\u03b8 (k, x1:k, a1:k, rk) =\n\nrkp\u03b8 (k, x1:k, a1:k, rk)\n\nEp\u03b8 [RK]\n\n.\n\nUnlike [20], we are interested in problems with potentially nonlinear and non-Gaussian properties.\nIn these situations, the Q function cannot be calculated exactly. The standard Monte Carlo EM\n\nsequently be drawn in regions of high reward. This is a particularly interesting feature in situations\nwhere the reward function is concentrated in a region of low probability mass under p\u03b8 (k, x1:k, rk),\nwhich is often the case in high-dimensional control settings. Note that if we wanted to estimate\nV \u03c0\ntimal zero-variance importance distribution.\n\napproach consists of sampling fromep\u03b8 (k, x1:k, a1:k, rk) using MCMC to obtain a Monte Carlo es-\ntimate of the Q function. Asep\u03b8 (k, x1:k, a1:k, rk) is proportional to the reward, the samples will con-\n\u00b5 (\u03b8) using importance sampling, then the distributionep\u03b8 (k, x1:k, a1:k, rk) corresponds to the op-\nAlternatively, instead of sampling from ep\u03b8 (k, x1:k, a1:k, rk) using MCMC, we could proceed as\n\nin [20] to derive forward-backward algorithms to implement the E-step which can be implemented\nhere using Sequential Monte Carlo (SMC) techniques. We have in fact done this using the smoothing\nalgorithms proposed in [9]. However, we will focus the discussion on a different MCMC approach\nbased on trans-dimensional simulation. As shown in the experiments, the latter does considerably\nbetter.\n\nFinally, we remark that for a deterministic policy, we can introduce the trans-dimensional distribu-\ntion:\n\nrkp\u03b8 (k, x1:k, rk)\n\nEp\u03b8 [RK] .\n\nep\u03b8 (k, x1:k, rk) =\n\nIn addition, and for ease of presentation only, we focus the discussion on deterministic policies and\nreward functions g\u03b8 ( rn| xn) = \u03b4r(xn) (rn) ; the extension of our algorithms to the randomized case\nis straightforward.\n\n3 Bayesian policy exploration\n\nThe EM algorithm is particularly sensitive to initialization and might get trapped in a severe lo-\ncal maximum of V \u03c0\n\u00b5 (\u03b8). Moreover, in the general state-space setting that we are considering, the\nparticle smoothers in the E-step can be very expensive computationally.\n\nTo address these concerns, we propose an alternative full Bayesian approach. In the simpler context\nof experimental design, this approach was successfully developed in [11], [12]. The idea consists\nof introducing a vague prior distribution p (\u03b8) on the parameters of the policy \u03b8. We then de\ufb01ne the\n\nnew arti\ufb01cial probability distribution de\ufb01ned on \u0398 \u00d7] {k} \u00d7 X k by\n\np (\u03b8, k, x1:k) \u221d r (xk) p\u03b8 (k, x1:k) p (\u03b8) .\n\nBy construction, this target distribution admits the following marginal in \u03b8\n\nand we can select an improper prior distribution p (\u03b8) \u221d 1 ifR\u0398 V \u03c0\n\n\u00b5 (\u03b8) d\u03b8 < \u221e.\n\np (\u03b8) \u221d V \u03c0\n\n\u00b5 (\u03b8) p (\u03b8)\n\n3\n\n\fIf we could sample from p (\u03b8), then the generated samples(cid:8)\u03b8(i)(cid:9) would concentrate themselves\n\nin regions where V \u03c0\n\u00b5 (\u03b8) is large. We cannot sample from p (\u03b8) directly but we can developed a\ntrans-dimensional MCMC algorithm which will generate asymptotically samples from p (\u03b8, k, x1:k),\nhence samples from p (\u03b8).\n\nOur algorithm proceeds as follows. Assume the current state of the Markov chain targeting\np (\u03b8, k, x1:k) is (\u03b8, k, x1:k). We propose \ufb01rst to update the components (k, x1:k) conditional upon \u03b8\nusing a combination of birth, death and update moves using the reversible jump MCMC algorithm\n[7, 8, 17]. Then we propose to update \u03b8 conditional upon the current value of (k, x1:k). This can\nbe achieved using a simple Metropolis-Hastings algorithm or a more sophisticated dynamic Monte\nCarlo schemes. For example, if gradient information is available, one could adopt Langevin diffu-\nsions and the hybrid Monte Carlo algorithm [1]. The overall algorithm is depicted in Figure 1. The\ndetails of the reversible jump algorithm are presented in the following section.\n\n1.\n\nInitialization: set (k(0), x(0)\n\n1:k(0) , \u03b8(0)).\n\n2. For i = 0 to N \u2212 1\n\n\u2022 Sample u \u223c U[0,1].\n\u2022 If (u \u2264 bk)\n\n\u2013 then carry out a \u201cbirth\u201d move: Increase the horizon length of the MDP, say\n\nk(i) = k(i\u22121) + 1 and insert a new state.\n\n\u2013 else if (u \u2264 bk + dk) then carry out a \u201cdeath\u201d move: decrease the horizon\n\nlength of the MDP, say k(i) = k(i\u22121) \u2212 1 and an existing state.\n\n\u2013 else let k(i) = k(i\u22121) and generate samples x(i)\nEnd If.\n\n1:k(i) of the MDP states.\n\n\u2022 Sample the policy parameters \u03b8(i) conditional on the samples (x(i)\n\n1:k(i) , k(i)).\n\nFigure 1: Generic reversible jump MCMC for Bayesian policy learning.\n\nWe note that for a given \u03b8 the samples of the states and horizon generated by this Markov chain\n\nHence, they can be easily adapted to generate a Monte Carlo estimate of Q (\u03b8i\u22121, \u03b8). This allows\nus to side-step the need for expensive smoothing algorithms in the E-step. The trans-dimensional\nsimulation approach has the advantage that the samples will concentrate themselves automatically\n\nwill also be distributed (asymptotically) according to the trans-dimensional distributionep\u03b8 (k, x1:k).\nin regions whereep\u03b8 (k) has high probability masses. Moreover, unlike in the EM framework, it is\n\nno longer necessary to truncate the time domain.\n\n4 Trans-Dimensional Markov chain Monte Carlo\n\nWe present a simple reversible jump method composed of two reversible moves (birth and death)\n\nand several update moves. Assume the current state of the Markov chain targeting ep\u03b8 (k, x1:k)\n\nis (k, x1:k). With probability1 bk, we propose a birth move;\nthat is we sample a location\nuniformly in the interval {1, ..., k + 1}, i.e. J \u223c U {1, ..., k + 1}, and propose the candidate\n(k + 1, x1:j\u22121, x\u2217, xj:k) where X \u2217 \u223c q\u03b8 ( \u00b7| xj\u22121:j ). This candidate is accepted with probability\nAbirth = min{1, \u03b1birth} where we have for j \u2208 {2, ..., k \u2212 1}\n\n\u03b1birth = ep\u03b8 (k + 1, x1:j\u22121, x\u2217, xj:k) dk+1\nep\u03b8 (k, x1:k) bkq\u03b8 ( x\u2217| xj\u22121:j )\n\n\u03b3f\u03b8 ( x\u2217| xj\u22121) f\u03b8 ( xj| x\u2217) dk+1\nf\u03b8 ( xj | xj\u22121) bkq\u03b8 ( x\u2217| xj\u22121:j )\n\n=\n\n,\n\nfor j = 1\n\n\u03b1birth =\n\n\u03b3\u00b5 (x\u2217) f\u03b8 ( x1| x\u2217) dk+1\n\n\u00b5 (x1) bkq\u03b8 ( x\u2217| x1)\n\n,\n\n1In practice we can set the birth and death probabilities such that bk = dk = uk = 1/3.\n\n4\n\n\fand j = k + 1\n\n\u03b1birth =\n\n\u03b3r (x\u2217) f\u03b8 ( x\u2217| xk) dk+1\n\nr (xk) bkq\u03b8 ( x\u2217| xk)\n\n.\n\nWith probability dk, we propose a death move; that is J \u223c U {1, ..., k} and we propose the candidate\n(k \u2212 1, x1:j\u22121, xj+1:k) which is accepted with probability Adeath = min{1, \u03b1death} where for\nj \u2208 {2, ..., k \u2212 1}\n\n\u03b1death = ep\u03b8 (k \u2212 1, x1:j\u22121, xj+1:k) bk+1q\u03b8 ( xj| xj\u22121:j+1)\n\nf\u03b8 ( xj+1| xj\u22121) bk+1q\u03b8 ( xj| xj\u22121:j+1)\n\nep\u03b8 (k, x1:k) dk\n\n\u03b3f\u03b8 ( xj+1| xj) f\u03b8 ( xj | xj\u22121) dk\n\n=\n\n,\n\nfor j = 1\n\nand for j = k\n\n\u03b1death =\n\n\u00b5 (x2) q\u03b8 ( x1| x2) bk+1\n\u03b3\u00b5 (x1) f\u03b8 ( x2| x1) dk\n\n,\n\n\u03b1death =\n\nr (xk\u22121) q\u03b8 ( xk| xk\u22121) bk+1\n\n\u03b3r (xk) f\u03b8 ( xk| xk\u22121) dk\n\n.\n\nThe \u03b1birth and \u03b1death terms derived above can be thought of as ratios between the distribution over\nthe newly proposed state of the chain (i.e. after the birth/death) and the current state. These terms\nmust also ensure reversibility and the dimension-matching requirement for reversible jump MCMC.\nFor more information see [7, 8].\n\nFinally with probability uk = 1 \u2212 bk \u2212 dk, we propose a standard (\ufb01xed dimensional) move where\nwe update all or a subset of the components x1:k using say Metropolis-Hastings or Gibbs moves.\nThere are many design possibilities for these moves.\nIn general, one should block some of the\nvariables so as to improve the mixing time of the Markov chain. If one adopts a simple one-at-a\ntime Metropolis-Hastings scheme with proposals q\u03b8 ( x\u2217| xj\u22121:j+1) to update the j-th term, then the\ncandidate is accepted with probability Aupdate = min{1, \u03b1update} where for j \u2208 {2, ..., k \u2212 1}\n\n\u03b1update = ep\u03b8 (k, x1:j\u22121, x\u2217, xj+1:k) q\u03b8 ( xj| xj\u22121, x\u2217, xj+1)\n\nf\u03b8 ( x\u2217| xj\u22121) f\u03b8 ( xj+1| x\u2217) q\u03b8 ( xj| xj\u22121, x\u2217, xj+1)\n\nep\u03b8 (k, x1:k) q\u03b8 ( x\u2217| xj\u22121:j+1)\n\nf\u03b8 ( xj| xj\u22121) f\u03b8 ( xj+1| xj) q\u03b8 ( x\u2217| xj\u22121:j+1)\n\n=\n\n,\n\nfor j = 1\n\nand for j = k\n\n\u03b1update =\n\n\u00b5 (x\u2217) f\u03b8 ( x2| x\u2217) q\u03b8 ( x1| x\u2217, x2)\n\u00b5 (x1) f\u03b8 ( x2| x1) q\u03b8 ( x\u2217| x1:2)\n\n,\n\n\u03b1update =\n\nr (x\u2217) f\u03b8 ( x\u2217| xk\u22121) q\u03b8 ( xk| x\u2217, xk\u22121)\nr (xk) f\u03b8 ( xk| xk\u22121) q\u03b8 ( x\u2217| xk\u22121:k)\n\n.\n\nUnder weak assumptions on the model, the Markov chain {K (i), X (i)\n1:K} generated by this transition\nkernel will be irreducible and aperiodic and hence will generate asymptotically samples from the\n\ntarget distributionep\u03b8 (k, x1:k).\nWe emphasize that the structure of the distributionsep\u03b8 ( x1:k| k) will not in many applications vary\nsigni\ufb01cantly with k and we often haveep\u03b8 ( x1:k| k) \u2248ep\u03b8 ( x1:k| k + 1). Hence the probability of hav-\n\ning the reversible moves accepted will be reasonable. Standard Bayesian applications of reversible\njump MCMC usually do not enjoy this property and it makes it more dif\ufb01cult to design fast mixing\nalgorithms. In this respect, our problem is easier.\n\n5 Experiments\n\nIt should be noted from the outset that the results presented in this paper are preliminary, and serve\nmainly as an illustration of the Monte Carlo algorithms presented earlier. With that note aside, even\nthese simple examples will give us some intuition about the algorithms\u2019 performance and behavior.\n\n5\n\n\f1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.2\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\nFigure 2: This \ufb01gure shows an illustration of the 2d state-space described in section 5. Ten sample points\nare shown distributed according to \u00b5, the initial distribution, and the contour plot corresponds to the reward\nfunction r. The red line denotes the policy parameterized by some angle \u03b8, while a path is drawn in blue\nsampled from this policy.\n\nWe are also very optimistic as to the possible applications of analytic expressions for linear Gaussian\nmodels, but space has not allowed us to present simulations for this class of models here.\n\nWe will consider state- and action-spaces X = A = R2 such that each state x \u2208 X is a 2d position\nand each action a \u2208 A is a vector corresponding to a change in position. A new state at time n\nis given by Xn = Xn\u22121 + An\u22121 + \u03bdn\u22121 where \u03bdn\u22121 denotes zero-mean Gaussian noise. Finally\nwe will let \u00b5 be a normal distribution about the origin, and consider a reward (as in [20]) given by\nan unnormalized Gaussian about some point m, i.e. r(x) = exp(\u2212 1\n2 (x \u2212 m)T \u03a3\u22121(x \u2212 m)). An\nillustration of this space can be seen in Figure 2 where m = (1, 1).\n\nFor these experiments we chose a simple, stochastic policy parameterized by \u03b8 \u2208 [0, 2\u03c0]. Under\nthis policy, an action An = (w + \u03b4) \u00b7 (cos(\u03b8 + \u03c9), sin(\u03b8 + \u03c9)) is taken where \u03b4 and \u03c9 are normally\ndistributed random variables and w is some (small) constant step-length. Intuitively, this policy cor-\nresponds to choosing a direction \u03b8 in which the agent will walk. While unrealistic from a real-world\nperspective, this allows us a method to easily evaluate and plot the convergence of our algorithm.\nFor a state-space with initial distribution and reward function de\ufb01ned as in Figure 2 the optimal\npolicy corresponds to \u03b8 = \u03c0/4.\n\nWe \ufb01rst implemented a simple SMC-based extension of the EM algorithm described in [20], wherein\na particle \ufb01lter was used for the forwards/backwards \ufb01lters. The plots in Figure 3 compare the\nSMC-based and trans-dimensional approaches performing on this synthetic example. Here the in-\nferred value of \u03b8 is shown against CPU time, averaged over 5 runs. The \ufb01rst thing of note is the\nterrible performance of the SMC-based algorithm\u2014in fact we had to make the reward broader and\ncloser to the initial position in order to ensure that the algorithm converges in a reasonable amount\nof time. This comes as no surprise considering the O(N 2k2\nmax) time complexity necessary for com-\nputing the importance weights. While there do exist methods [9] for reducing this complexity to\nO(N log N k2\nmax), the discrepancy between this and the reversible jump MCMC method suggests\nthat the MCMC approach may be more adapted to this class of problems. In the \ufb01nite/discrete case\nit is also possible, as shown by Toussaint et al (2006), to reduce the k2\nmax term to kmax by calculating\nupdates only using messages from the backwards recursion. The SMC method might further be im-\nproved by better choices for the arti\ufb01cial distribution \u03b7n(xn) in the backwards \ufb01lter. In this problem\nwe used a vague Gaussian centered on the relevant state-space. It is however possible that any added\nbene\ufb01t from a more informative \u03b7 distribution is counterbalanced by the time required to calculate\nthis \u03b7, for example by simulating particles forward in order to \ufb01nd the invariant distribution, etc.\n\nAlso shown in \ufb01gure 3 is the performance of a Monte Carlo EM algorithm using reversible jump\nMCMC in the E-step. Both this and the fully Bayesian approach perform comparably, although the\nfully Bayesian approach shows less in-run variance, as well as less variance between runs. The EM\nalgorithm was also more sensitive, and we were forced to increase the number of samples N used\n\n6\n\n\fConvergence of \u03b8 as a function of time\n\nConvergence of \u03b8 as a function of time\n\nTwo\u2212filter EM\nMonte Carlo EM\nBayes. policy search\nOptimal (baseline)\n\nMonte Carlo EM\nBayes. policy search\nOptimal (baseline)\n\n)\ns\nn\na\nd\na\nr\n \n\ni\n\nn\ni\n(\n \n\u03b8\n\n0.85\n\n0.8\n\n0.75\n\n1.5\n\n1.4\n\n1.3\n\n1.2\n\n1.1\n\n1\n\n0.9\n\n0.8\n\n)\ns\nn\na\nd\na\nr\n \n\ni\n\nn\ni\n(\n \n\u03b8\n\n0.7\n\n0\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\ncpu time (in seconds)\n\n0.7\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\ncpu time (in seconds)\n\nFigure 3: The left \ufb01gure shows estimates for the policy parameter \u03b8 as a function of the CPU time used to\ncalculate that value. This data is shown for the three discussed Monte Carlo algorithms as applied to a synthetic\nexample and has been averaged over \ufb01ve runs; error bars are shown for the SMC-based EM algorithm. Because\nof the poor performance of the SMC-based algorithm it is dif\ufb01cult to compare the performance of the other two\nalgorithms using only this plot. The right \ufb01gure shows a smoothed and \u201czoomed\u201d version of the right plot in\norder to show the reversible-jump EM algorithm and the fully Bayesian algorithm in more detail. In both plots\na red line denotes the known optimal policy parameter of \u03c0/4.\n\nby the E-step as the algorithm progressed, as well as controlling the learning rate with a smoothing\nparameter. For higher dimensional and/or larger models it is not inconceivable that this could have\nan adverse affect on the algorithms performance.\n\nFinally, we also compared the proposed Bayesian policy exploration method to the PEGASUS [14]\napproach using a local search method. We initially tried using a policy-gradient approach, but\nbecause of the very highly-peaked rewards the gradients become very poorly scaled and would have\nrequired more tuning. As shown in Figure 4, the Bayesian strategy is more ef\ufb01cient in this rare\nevent setting. As the dimension of the state-space increases, we expect this difference to become\neven more pronounced.\n\n6 Discussion\n\nWe believe that formulating stochastic control as a trans-dimensional inference problem is fruitful.\nThis formulation relies on minimal assumptions and allows us to apply modern inference algorithms\nto solve control problems. We have focused here on Monte Carlo methods and have presented\u2014\nto the best of our knowledge\u2014the \ufb01rst application of reversible jump MCMC to policy search.\nOur results, on an illustrative example, showed that this trans-dimensional MCMC algorithm is\nmore effective that standard policy search methods and alternative Monte Carlo methods relying on\nparticle \ufb01lters. However, this methodology remains to be tested on high-dimensional problems. For\nsuch scenarios, we expect that it will be necessary to develop more ef\ufb01cient MCMC strategies to\nexplore the policy space ef\ufb01ciently.\n\nReferences\n\n[1] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to MCMC for machine learning.\n\nMachine Learning, 50:5\u201343, 2003.\n\n[2] H. Attias. Planning by probabilistic inference. In Uncertainty in Arti\ufb01cial Intelligence, 2003.\n\n[3] J. Baxter and P. L. Bartlett. In\ufb01nite-horizon policy-gradient estimation. Journal of Arti\ufb01cial Intelligence\n\nResearch, 15:319\u2013350, 2001.\n\n[4] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scienti\ufb01c, 1995.\n\n[5] P. Dayan and G. E. Hinton. Using EM for reinforcement learning. Neural Computation, 9:271\u2013278, 1997.\n\n7\n\n\fEvolution of policy parameters against transition-model samples\n\n)\na\nt\ne\nh\nt\n(\n \nr\ne\nt\ne\nm\na\nr\na\np\n \ny\nc\ni\nl\n\no\np\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n0\n\n1000\n\nrjmdp\npegasus\noptimal\n\n8000\n\n9000\n\n2000\n7000\nnumber of samples taken from transition-model\n\n3000\n\n4000\n\n5000\n\n6000\n\nFigure 4: Convergence of PEGASUS and our Bayesian policy search algorithm when started from \u03b8 = 0\nand converging to the optimum of \u03b8\u2217 = \u03c0/4. The plots are averaged over 10 runs. For our algorithm we\nplot samples taken directly from the MCMC algorithm itself: plotting the empirical average would produce an\nestimate whose convergence is almost immediate, but we also wanted to show the \u201cburn-in\u201d period. For both\nalgorithms lines denoting one standard deviation are shown and performance is plotted against the number of\nsamples taken from the transition model.\n\n[6] A. Doucet and V. B. Tadic. On solving integral equations using Markov chain Monte Carlo methods.\n\nTechnical Report CUED-F-INFENG 444, Cambridge University Engineering Department, 2004.\n\n[7] P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.\n\nBiometrika, 82:711\u2013732, 1995.\n\n[8] P. J. Green. Trans-dimensional Markov chain Monte Carlo.\n\nIn Highly Structured Stochastic Systems,\n\n2003.\n\n[9] M. Klaas, M. Briers, N. de Freitas, A. Doucet, and S. Maskell. Fast particle smoothing: If i had a million\n\nparticles. In International Conference on Machine Learning, 2006.\n\n[10] G. Lawrence, N. Cowan, and S. Russell. Ef\ufb01cient gradient estimation for motor control learning.\n\nIn\n\nUncertainty in Arti\ufb01cial Intelligence, pages 354\u201336, 2003.\n\n[11] P. M\u00a8uller. Simulation based optimal design. Bayesian Statistics, 6, 1999.\n\n[12] P. M\u00a8uller, B. Sans\u00b4o, and M. De Iorio. Optimal Bayesian design by inhomogeneous Markov chain simu-\n\nlation. J. American Stat. Assoc., 99:788\u2013798, 2004.\n\n[13] A. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang. Inverted autonomous\nhelicopter \ufb02ight via reinforcement learning. In International Symposium on Experimental Robotics, 2004.\n\n[14] A. Y. Ng and M. I. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs.\n\nIn\n\nUncertainty in Arti\ufb01cial Intelligence, 2000.\n\n[15] J. Peters and S. Schaal. Policy gradient methods for robotics.\n\nIn IEEE International Conference on\n\nIntelligent Robotics Systems, 2006.\n\n[16] M. Porta, N. Vlassis, M. T. J. Spaan, and P. Poupart. Point-based value iteration for continuous POMDPs.\n\nJournal of Machine Learning Research, 7:2329\u20132367, 2006.\n\n[17] S. Richardson and P. J. Green. On Bayesian analysis of mixtures with an unknown number of components.\n\nJournal of the Royal Statistical Society B, 59(4):731\u2013792, 1997.\n\n[18] S. Thrun. Monte Carlo POMDPs. In S. Solla, T. Leen, and K.-R. M\u00a8uller, editors, Neural Information\n\nProcessing Systems, pages 1064\u20131070. MIT Press, 2000.\n\n[19] M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic inference for solving (PO)MDPs. Technical\n\nReport EDI-INF-RR-0934, University of Edinburgh, School of Informatics, 2006.\n\n[20] M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state Markov\n\ndecision processes. In International Conference on Machine Learning, 2006.\n\n[21] D. Verma and R. P. N. Rao. Planning and acting in uncertain environments using probabilistic inference.\n\nIn IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2006.\n\n8\n\n\f", "award": [], "sourceid": 1121, "authors": [{"given_name": "Matthew", "family_name": "Hoffman", "institution": null}, {"given_name": "Arnaud", "family_name": "Doucet", "institution": null}, {"given_name": "Nando", "family_name": "Freitas", "institution": null}, {"given_name": "Ajay", "family_name": "Jasra", "institution": null}]}