{"title": "A Bayesian Approach for Policy Learning from Trajectory Preference Queries", "book": "Advances in Neural Information Processing Systems", "page_first": 1133, "page_last": 1141, "abstract": "We consider the problem of learning control policies via trajectory preference queries to an expert. In particular, the learning agent can present an expert with short runs of a pair of policies originating from the same state and the expert then indicates the preferred trajectory. The agent's goal is to elicit a latent target policy from the expert with as few queries as possible. To tackle this problem we propose a novel Bayesian model of the querying process and introduce two methods that exploit this model to actively select expert queries. Experimental results on four benchmark problems indicate that our model can effectively learn policies from trajectory preference queries and that active query selection can be substantially more efficient than random selection.", "full_text": "A Bayesian Approach for Policy Learning from\n\nTrajectory Preference Queries\n\nAaron Wilson \u2217\nSchool of EECS\n\nOregon State University\n\nAlan Fern \u2020\nSchool of EECS\n\nOregon State University\n\nPrasad Tadepalli \u2021\nSchool of EECS\n\nOregon State University\n\nAbstract\n\nWe consider the problem of learning control policies via trajectory preference\nqueries to an expert. In particular, the agent presents an expert with short runs of\na pair of policies originating from the same state and the expert indicates which\ntrajectory is preferred. The agent\u2019s goal is to elicit a latent target policy from\nthe expert with as few queries as possible. To tackle this problem we propose\na novel Bayesian model of the querying process and introduce two methods that\nexploit this model to actively select expert queries. Experimental results on four\nbenchmark problems indicate that our model can effectively learn policies from\ntrajectory preference queries and that active query selection can be substantially\nmore ef\ufb01cient than random selection.\n\n1\n\nIntroduction\n\nDirectly specifying desired behaviors for automated agents is a dif\ufb01cult and time consuming pro-\ncess. Successful implementation requires expert knowledge of the target system and a means of\ncommunicating control knowledge to the agent. One way the expert can communicate the desired\nbehavior is to directly demonstrate it and have the agent learn from the demonstrations, e.g. via\nimitation learning [15, 3, 13] or inverse reinforcement learning [12]. However, in some cases, like\nthe control of complex robots or simulation agents, it is dif\ufb01cult to generate demonstrations of the\ndesired behaviors. In these cases an expert may still recognize when an agent\u2019s behavior matches a\ndesired behavior, or is close to it, even if it is dif\ufb01cult to directly demonstrate it. In such cases an\nexpert may also be able to evaluate the relative qualities to the desired behavior of a pair of example\ntrajectories and express a preference for one or the other.\nGiven this motivation, we study the problem of learning expert policies via trajectory preference\nqueries to an expert. A trajectory preference query (TPQ) is a pair of short state trajectories origi-\nnating from a common state. Given a TPQ the expert is asked to indicate which trajectory is most\nsimilar to the target behavior. The goal of our learner is to infer the target trajectory using as few\nTPQs as possible. Our \ufb01rst contribution (Section 3) is to introduce a Bayesian model of the query-\ning process along with an inference approach for sampling policies from the posterior given a set\nof TPQs and their expert responses. Our second contribution (Section 4) is to describe two active\nquery strategies that attempt to leverage the model in order to minimize the number of queries re-\nquired. Finally, our third contribution (Section 5) is to empirically demonstrate the effectiveness of\nthe model and querying strategies on four benchmark problems.\nWe are not the \ufb01rst to examine preference learning for sequential decision making. In the work\nof Cheng et al. [5] action preferences were introduced into the classi\ufb01cation based policy iteration\n\n\u2217wilsonaa@eecs.oregonstate.edu\n\u2020afern@eecs.oregonstate.edu\n\u2021tadepall@eecs.oregonstate.edu\n\n1\n\n\fframework. In this framework preferences explicitly rank state-action pairs according to their rel-\native payoffs. There is no explicit interaction between the agent and domain expert. Further the\napproach also relies on knowledge of the reward function, while our work derives all information\nabout the target policy by actively querying an expert. In more closely related to our work, Akraur\net al. [1] consider the problem of learning a policy from expert queries. Similar to our proposal this\nwork suggests presenting trajectory data to an informed expert. However, their queries require the\nexpert to express preferences over approximate state visitation densities and to possess knowledge\nof the expected performance of demonstrated policies. Necessarily the trajectories must be long\nenough to adequately approximate the visitation density. We remove this requirement and only re-\nquire short demonstrations; our expert assesses trajectory snippets not whole solutions. We believe\nthis is valuable because pairs of short demonstrations are an intuitive and manageable object for\nexperts to assess.\n\nP (\u03be|\u03b8, s0) = (cid:81)K\n\n2 Preliminaries\nWe explore policy learning from expert preferences in the framework of Markov Decision Processes\n(MDP). An MDP is a tuple (S, A, T, P0, R) with state space S, action space A, state transition\ndistribution T , which gives the probability T (s, a, s(cid:48)) of transitioning to state s(cid:48) given that action a\nis taken in state s. The initial state distribution P0(s0) gives a probability distribution over initial\nstates s0. Finally the reward function R(s) gives the reward for being in state s. Note that in this\nwork, the agent will not be able to observe rewards and rather must gather all information about\nthe quality of policies via interaction with an expert. We consider agents that select actions using\na policy \u03c0\u03b8 parameterized by \u03b8, which is a stochastic mapping from states to actions P\u03c0(a|s, \u03b8).\nFor example, in our experiments, we use a log-linear policy representation, where the parameters\ncorrespond to coef\ufb01cients of features de\ufb01ned over state-action pairs.\nAgents acting in an MDP experience the world as a sequence of state-action pairs called a tra-\njectory. We denote a K-length trajectory as \u03be = (s0, a0, ..., aK\u22121, sK) beginning in state s0\nand terminating after K steps.\nIt follows from the de\ufb01nitions above that the probability of gen-\nerating a K-length trajectory given that the agent executes policy \u03c0\u03b8 starting from state s0 is,\nt=1 T (st\u22121, at\u22121, st)P\u03c0(at\u22121|st\u22121, \u03b8). Trajectories are an important part of our\nquery process. They are an intuitive means of communicating policy information. Trajectories have\nthe advantage that the expert need not share a language with the agent. Instead the expert is only\nrequired to recognize differences in physical performances presented by the agent. For purposes of\ngenerating trajectories we assume that our learner is provided with a strong simulator (or generative\nmodel) of the MDP dynamics, which takes as input a start state s, a policy \u03c0, and a value K, and\noutputs a sampled length K trajectory of \u03c0 starting in s.\nIn this work, we evaluate policies in an episodic setting where an episode starts by drawing an initial\nstate from P0 and then executing the policy for a \ufb01nite horizon T . A policy\u2019s value is the expected\ntotal reward of an episode. The goal of the learner is to select a policy whose value is close to that\nof an expert\u2019s policy. Note, that our work is not limited to \ufb01nite-horizon problems, but can also be\napplied to in\ufb01nite-horizon formulations.\nIn order to learn a policy, the agent presents trajectory preference queries (TPQs) to the expert and\nreceives responses back. A TPQ is a pair of length K trajectories (\u03bei, \u03bej) that originate from a\ncommon state s. Typically K will be much smaller than the horizon T , which is important from\nthe perspective of expert usability. Having been provided with a TPQ the expert gives a response y\nindicating, which trajectory is preferred. Thus, each TPQ results in a training data tuple (\u03bei, \u03bej, y).\nIntuitively, the preferred trajectory is the one that is most similar to what the expert\u2019s policy would\nhave produced from the same starting state. As detailed more in the next section, this is modeled\nby assuming that the expert has a (noisy) evaluation function f (.) on trajectories and the response\nis then given by y = I(f (\u03bei) > f (\u03bej)) (a binary indicator). We assume that the expert\u2019s evaluation\nfunction is a function of the observed trajectories and a latent target policy \u03b8\u2217.\n\n3 Bayesian Model and Inference\n\nIn this section we \ufb01rst describe a Bayesian model of the expert response process, which will be\nused to: 1) Infer policies based on expert responses to TPQs, and 2) Guide the action selection of\n\n2\n\n\fTPQs. Next, we describe a posterior sampling method for this model which is used for both policy\ninference and TPQ selection.\n\n3.1 Expert Response Model\nThe model for the expert response y given a TPQ (\u03bei, \u03bej) decomposes as follows\n\nP (y|(\u03bei, \u03bej), \u03b8\u2217)P (\u03b8\u2217)\n\nwhere P (\u03b8\u2217) is a prior over the latent expert policy, and P (y|(\u03bei, \u03bej), \u03b8\u2217) is a response distribution\nconditioned on the TPQ and expert policy. In our experiments we use a ridge prior in the form of a\nGaussian over \u03b8\u2217 with diagonal covariance, which penalizes policies with large parameter values.\nResponse Distribution. The conditional response distribution is represented in terms of an expert\nevaluation function f\u2217(\u03bei, \u03bej, \u03b8\u2217), described in detail below, which translates a TPQ and a candidate\nexpert policy \u03b8\u2217 into a measure of preference for trajectory \u03bei over \u03bej. Intuitively, f\u2217 measures\nthe degree to which the policy \u03b8\u2217 agrees with \u03bei relative to \u03bej. To translate the evaluation into an\nexpert response we borrow from previous work [6]. In particular, we assume the expert response is\ngiven by the indicator I(f\u2217(\u03bei, \u03bej, \u03b8\u2217) > \u0001) where \u0001 \u223c N (0, \u03c32\nr ). The indicator simply returns 1 if\nthe condition is true, indicating \u03bei is preferred, and zero otherwise. It follows that the conditional\nresponse distribution is given by:\n\nP (y = 1|(\u03bei, \u03bej), \u03b8\n\n\u2217\n\n) =\n\n(cid:90) +\u221e\n(cid:18) f\u2217(\u03bei, \u03bej, \u03b8\u2217)\n\n\u2212\u221e\n\nI(f\n\n\u2217\n\n(\u03bei, \u03bej, \u03b8\n\n= \u03a6\n\n\u03c3r\n\n(cid:19)\n\n) > \u0001)N (\u0001|0, \u03c32\n\nr )d\u0001\n\n\u2217\n\n.\n\nK(cid:88)\n\nwhere \u03a6(.) denotes the cumulative distribution function of the normal distribution. This formulation\nallows the expert to err when demonstrated trajectories are dif\ufb01cult to distinguish as measured by the\nmagnitude of the evaluation function f\u2217. We now describe the evaluation function in more detail.\nEvaluation Function. Intuitively the evaluation function must combine distances between the query\ntrajectories and trajectories generated by the latent target policy. We say that a latent policy and\nquery trajectory are in agreement when they produce similar trajectories. The dissimilarity between\ntwo trajectories \u03bei and \u03bej is measured by the trajectory dissimilarity function\n\nf (\u03bei, \u03bej) =\n\nk([si,t, ai,t], [sj,t, aj,t])\n\nt=0\n\nwhere the variables [si,t, ai,t] represent the values of the state-action pair at time step t in trajectory\ni (similarly for [sj,t, aj,t]) and the function k computes distances between state-action pairs. In our\nexperiments, states and actions are represented by real-valued vectors and we use a simple function\nof the form: k([s, a], [s(cid:48), a(cid:48)]) = (cid:107)s \u2212 s(cid:48)(cid:107) + (cid:107)a \u2212 a(cid:48)(cid:107) though other more sophisticated comparison\nfunctions could be easily used in the model.\nGiven the trajectory comparison function, we now encode a dissimilarity measure between the latent\ntarget policy and an observed trajectory \u03bei. To do this let \u03be\u2217 be a random variable ranging over length\nk trajectories generated by target policy \u03b8\u2217 starting in the start state of \u03bei. The dissimilarity measure\nis given by:\n\nd(\u03bei, \u03b8\u2217) = E[f (\u03bei, \u03be\u2217)]\n\nThis function computes the expected dissimilarity between a query trajectory \u03bei and the K-length\ntrajectories generated by the latent policy from the same initial state.\nFinally, the comparison function value f\u2217(\u03bei, \u03bej, \u03b8\u2217) = d(\u03bej, \u03b8\u2217) \u2212 d(\u03bei, \u03b8\u2217) is the difference in\ncomputed values between the ith and jth trajectory. Larger values of f\u2217 indicate stronger preferences\nfor trajectory \u03bei.\n3.2 Posterior Inference\nGiven the de\ufb01nition of the response model, the prior distribution, and an observed data set D =\n{(\u03bei, \u03bej, y)} of TPQs and responses the posterior distribution is,\n\n\u2217|D) \u221d P (\u03b8\n\n\u2217\n\n)\n\nP (\u03b8\n\n\u03a6 (z)y (1 \u2212 \u03a6 (z))1\u2212y ,\n\n(cid:89)\n\n(\u03bei,\u03bej ,y)\u2208D\n\nwhere z = d(\u03bej ,\u03b8\u2217)\u2212d(\u03bei,\u03b8\u2217)\n\u03c3r\nmust approximate it.\n\n. This posterior distribution does not have a simple closed form and we\n\n3\n\n\fWe approximate the posterior distribution using a set of posterior samples which we generate using\na stochastic simulation algorithm called Hybrid Monte Carlo (HMC) [8, 2]. The HMC algorithm\nis an example of a Markov Chain Monte Carlo (MCMC) algorithm. MCMC algorithms output a\nsequence of samples from the target distribution. HMC has an advantage in our setting because it\nintroduces auxiliary momentum variables proportional to the gradient of the posterior which guides\nthe sampling process toward the modes of the posterior distribution.\nTo apply the HMC algorithm we must derive the gradient of\n(cid:53)\u03b8\u2217 log(P (D|\u03b8)P (\u03b8)) as follows.\n\nthe energy function\n\n(cid:88)\n\nlog(cid:2)\u03a6 (z)y (1 \u2212 \u03a6 (z))1\u2212y(cid:3)\n\nlog[P (\u03b8\n\n\u2217|D)] =\n\n\u2202\n\u2202\u03b8\u2217\n\ni\n\nlog[P (\u03b8\n\n\u2217\n\n)] +\n\n\u2202\n\u2202\u03b8\u2217\n\ni\n\n\u2202\n\u2202\u03b8\u2217\n\ni\n\n(\u03bei,\u03bej ,y)\u2208D\n\nThe energy function decomposes into prior and likelihood components. Using our assumption of a\nGaussian prior with diagonal covariance on \u03b8\u2217 the partial derivative of the prior component at \u03b8\u2217\ni is\n\nNext, consider the gradient of the data log likelihood,\n\n\u2202\n\u2202\u03b8\u2217\n\ni\n\nlog[P (\u03b8\n\n(cid:88)\n\n\u2202\n\u2202\u03b8\u2217\n\ni\n\n(\u03bei,\u03bej ,y)\u2208D\n\n.\n\n\u2217\n\ni \u2212 \u00b5)\n\u03c32\n\n)] = \u2212 (\u03b8\u2217\nlog(cid:2)\u03a6(z)y(1 \u2212 \u03a6(z))1\u2212y(cid:3) ,\n\n(cid:90)\n(cid:90)\n\nwhich decomposes into |D| components each of which has a value dependent on y.\nIn what follows we will assume that y = 1 (It is straight forward to derive the second case). Recall\nr ), Therefore, the gradient\nthat the function \u03a6(.) is the cumulative distribution function of N (z; 0, \u03c32\nof log(\u03a6(z)) is,\n\u2202\n\u2202\u03b8\u2217\n\n(cid:18) \u2202\n\nN (z; 0, \u03c32\nr )\n\nlog[\u03a6(z)] =\n\n(cid:19)\n\n(cid:19)\n\n\u2202\u03b8\u2217\n\n\u03a6(z)\n\n\u03a6(z)\n\n\u03a6(z)\n\n=\n\n1\n\n1\n\nz\n\ni\n\n(cid:19)\n\n(cid:19)\n\n(cid:18) \u2202\n(cid:18) 1\n\n(cid:18) \u2202\n\ni\n\n\u2202\u03b8\u2217\n\ni\n\n1\n\n=\n\n\u03a6(z)\n\n\u03c3r\n\n\u2217\n\nd(\u03bej, \u03b8\n\nd(\u03bei, \u03b8\n\n\u2217\n\n)\n\nN (z; 0, \u03c32\nr )\n\n.\n\nRrecall the de\ufb01nition of d(\u03be, \u03b8\u2217) from above. After moving the derivative inside the integral the\ngradient of this function is\n\ni\n\n\u2202\u03b8\u2217\n) \u2212 \u2202\n\u2202\u03b8\u2217\n\ni\n\n(cid:90)\n\nd(\u03be, \u03b8\n\n\u2217\n\n) = \u2212\n\n\u2202\n\u2202\u03b8\u2217\n\ni\n\n= \u2212\n\nf (\u03be, \u03be\n\n\u2217\n\n)\n\n\u2202\n\u2202\u03b8\u2217\n\ni\n\nP (\u03be\n\nf (\u03be, \u03be\n\n\u2217\n\n\u2217\n\n\u2217|\u03b8\n\n)\n\n)P (\u03be\n\n\u2217\n\n\u2217|\u03b8\n\nK(cid:88)\n\nk=1\n\n\u2217\n\n= \u2212\n\n)d\u03be\n\nf (\u03be, \u03be\n\n\u2217\n\n\u2217\n\n\u2217|\u03b8\n\n)\n\n)P (\u03be\n\n\u2202\n\u2202\u03b8\u2217\n\ni\n\nlog(P (\u03be\n\n\u2217\n\n\u2217|\u03b8\n\n))d\u03be\n\n\u2217\n\nlog(P\u03c0(ak|sk, \u03b8\n\n\u2217\n\n\u2217\n\n.\n\n))d\u03be\n\n\u2202\n\u2202\u03b8\u2217\n\ni\n\nN\n\nl )(cid:80)K\n\n(cid:80)N\nl=1 f (\u03be, \u03be\u2217\n\nThe \ufb01nal step follows from the de\ufb01nition of the trajectory density which decomposes under the log\ntransformation. For purposes of approximating the gradient this integral must be estimated. We do\nthis by generating N sample trajectories from P (\u03be\u2217|\u03b8\u2217) and then compute the Monte-Carlo estimate\n\u2212 1\nlog(P\u03c0(ak|sk, \u03b8\u2217)). We leave the de\ufb01nition of log(P\u03c0(ak|sk, \u03b8\u2217))\nfor the experimental results section where we describe a speci\ufb01c kind of stochastic policy space.\nGiven this gradient calculation, we can apply HMC in order to sample policy parameter vectors from\nthe posterior distribution. This can be used for policy selection in a number of ways. For example, a\npolicy could be formed via Bayesian averaging. In our experiments, we select a policy by generating\na large set of samples and then select the sample maximizing the energy function.\n\n\u2202\n\u2202\u03b8\u2217\n\nk=1\n\ni\n\n4 Active Query Selection\n\nGiven the ability to perform posterior inference, the question now is how to collect a data set of\nTPQs and their responses. Unlike many learning problems, there is no natural distribution over\nTPQs to draw from, and thus, active selection of TPQs is essential.\nIn particular, we want the\nlearner to select TPQs for which the responses will be most useful toward the goal of learning the\ntarget policy. This selection problem is dif\ufb01cult due to the high dimensional continuous space of\nTPQs, where each TPQ is de\ufb01ned by an initial state and two trajectories originating from the state.\nTo help overcome this complexity our algorithm assumes the availability of a distribution \u02c6P0 over\n\n4\n\n\fcandidate start states of TPQs. This distribution is intended to generate start states that are feasible\nand potentially relevant to a target policy. The distribution may incorporate domain knowledge\nto rule out unimportant parts of the space (e.g. avoiding states where the bicycle has crashed) or\nsimply specify bounds on each dimension of the state space and generate states uniformly within\nthe bounds. Given this distribution, we consider two approaches to actively generating TPQs for the\nexpert.\n\ns0. In particular, the disagreement measure is g =(cid:82)\n\n4.1 Query by Disagreement\nOur \ufb01rst approach Query by Disagreement (QBD) is similar to the well-known query-by-committee\napproach to active learning of classi\ufb01ers [17, 9]. The main idea behind the basic query-by-committee\napproach is to generate a sequence of unlabeled examples from a given distribution and for each\nexample sample a pair of classi\ufb01ers from the current posterior. If the sampled classi\ufb01ers disagree\non the class of the example, then the algorithm queries the expert for the class label. This simple\napproach is often effective and has theoretical guarantees on its ef\ufb01ciency.\nWe can apply this general idea to select TPQs in a straightforward way. In particular, we generate\na sequence of potential initial TPQ states from \u02c6P0 and for each draw two policies \u03b8i and \u03b8j from\nthe current posterior distribution P (\u03b8\u2217|D). If the policies \u201cdisagree\u201d on the state, then a query is\nposed based on trajectories generated by the policies. Disagreement on an initial state s0 is measured\naccording to the expected difference between K length trajectories generated by \u03b8i and \u03b8j starting at\n(\u03bei,\u03bej ) P (\u03bei|\u03b8i, s0, K)P (\u03bej|\u03b8j, s0, K)f (\u03bei, \u03bej),\nwhich we estimate via sampling a set of K length trajectories from each policy. If this measure\nexceeds a threshold then TPQ is generated and given to the expert by running each policy for K\nsteps from the initial state. Otherwise a new initial state is generated. If no query is posed after a\nspeci\ufb01ed number of initial states, then the state and policy pair that generated the most disagreement\nare used to generate the TPQ. We set the threshold t so that \u03a6(t/\u03c3r) = .95.\nThis query strategy has the bene\ufb01t of generating TPQs such that \u03bei and \u03bej are signi\ufb01cantly different.\nThis is important from a usability perspective, since making preference judgements between similar\ntrajectories can be dif\ufb01cult for an expert and error prone.\nIn practice we observe that the QBD\nstrategy often generates TPQs based on policy pairs that are from different modes of the distribution,\nwhich is an intuitively appealing property.\n\n4.2 Expected Belief Change\nAnother class of active learning approaches for classi\ufb01ers is more selective than traditional query-\nby-committee. In particular, they either generate or are given an unlabeled dataset and then use a\nheuristic to select the most promising example to query from the entire set. Such approaches often\noutperform less selective approaches such as the traditional query-by-committee. In this same way,\nour second active learning approach for TPQs attempts to be more selective than the above QBD\napproach by generating a set of candidate TPQs and heuristically selecting the best among those\ncandidates.\nA set of candidate TPQs is generated by \ufb01rst drawing an initial state from from \u02c6P0, sampling a pair\nof policies from the posterior, and then running the policies for K steps from the initial state. It\nremains to de\ufb01ne the heuristic used to select the TPQ for presentation to the expert.\nA truly Bayesian heuristic selection strategy should account for the overall change in belief about the\nlatent target policy after adding a new data point. To represent the difference in posterior beliefs we\nuse the variational distance between posterior based on the current data D and the posterior based\non the updated data D \u222a {(\u03bei, \u03bej, y)}.\n\nV (P (\u03b8|D) (cid:107) P (\u03b8|D \u222a {(\u03bei, \u03bej, y)})) =\n\n|P (\u03b8|D) \u2212 P (\u03b8|D \u222a {(\u03bei, \u03bej, y)})|d\u03b8.\n\n(cid:90)\n\nBy integrating over the entire latent policy space it accounts for the total impact of the query on the\nagent\u2019s beliefs.\nThe value of the variational distance depends on the response to the TPQ, which is unobserved at\nquery selection time. Therefore, the agent computes the expected variational distance,\n\n(cid:88)\n\ny\u22080,1\n\nH(d) =\n\nP (y|\u03bei, \u03bej, D)V (P (\u03b8|D) (cid:107) P (\u03b8|D \u222a {(\u03bei, \u03bej, y)})).\n\n5\n\n\f(cid:90)\n(cid:90)\n\n=\n\nP (\u03b8|D)\n\nP (\u03b8|D)\n\nP (\u03b8|D)\n\nWhere P (y|\u03bei, \u03bej, D) =(cid:82) P (y|\u03bei, \u03bej, \u03b8\u2217)P (\u03b8\u2217|D)d\u03b8\u2217 is the predictive distribution and is straight-\n(cid:12)(cid:12)(cid:12)(cid:12) d\u03b8\n\nforwardly estimated using a set of posterior samples.\nFinally, we specify a simple method of estimating the variational distance given a particular re-\nsponse. For this, we re-express the variational distance as an expectation with respect to P (\u03b8|D),\nV (P (\u03b8|D) (cid:107) P (\u03b8|D \u222a d)) =\n\n|P (\u03b8|D) \u2212 P (\u03b8|D \u222a d)| d\u03b8 =\n\n(cid:90) (cid:12)(cid:12)(cid:12)(cid:12)P (\u03b8|D) \u2212 P (\u03b8|D \u222a d)\n(cid:12)(cid:12)(cid:12)(cid:12)1 \u2212 P (\u03b8|D \u222a d)\n(cid:12)(cid:12)(cid:12)(cid:12) d\u03b8 =\n(cid:12)(cid:12)(cid:12)(cid:12)1 \u2212 P (d|\u03b8)\n(cid:90)\n(cid:12)(cid:12)(cid:12)(cid:12)1 \u2212 z1\n(cid:12)(cid:12)(cid:12)(cid:12)\nV (P (\u03b8|D) (cid:107) P (\u03b8|D \u222a (\u03bei, \u03bej, y))) \u2248(cid:88)\n\nwhere z1 and z2 are the normalizing constants of the posterior distributions. The \ufb01nal expression is\na likelihood weighted estimate of the variational distance. We can estimate this value using Monte-\nCarlo over a set S of policies sampled from the posterior,\n\n\u03b8\u2208S\n\nz2\nThis leaves the computation of the ratio of normalizing constants z1\nwhich we estimate using Monte-\nz2\nCarlo based on a sample set of policies from the prior distribution, hence avoiding further posterior\nsampling.\nOur basic strategy of using an information theoretic selection heuristic is similar to early work using\nKullback Leibler Divergence ([7]) to measure the quality of experiments [11, 4]. Our approach dif-\nfers in that we use a symmetric measure which directly computes differences in probability instead\nof expected differences in code lengths. The key disadvantage of this form of look-ahead query\nstrategy (shared by other strategies of this kind) is the computational cost.\n\nP (\u03b8|D)\nP (\u03b8|D)\nz1\nz2\n\n(cid:12)(cid:12)(cid:12)(cid:12) d\u03b8\n\nP (d|\u03b8)\n\n5 Empirical Results\n\nBelow we outline our experimental setup and present our empirical results on four standard RL\nbenchmark domains.\n5.1 Setup\nIf the posterior distribution focuses mass on the expert policy parameters the expected value of the\nMAP parameters will converge to the expected value of the expert\u2019s policy. Therefore, to examine\nthe speed of convergence to the desired expert policy we report the performance of the MAP policy\nin the MDP task. We choose the MAP policy, maximizing P (D|\u03b8)P (\u03b8), from the sample generated\nby our HMC routine. The expected return of the selected policy is estimated and reported. Note that\nno reward information is given to the learner and is used for evaluation only.\nWe produce an automated expert capable of responding to the queries produced by our agent. The\nexpert knows a target policy, and compares, as described above, the query trajectories generated\nby the agent to the trajectories generated by the target policy. The expert stochastically produces\na response based on its evaluations. Target policies are hand designed and produce near optimal\nperformance in each domain.\nIn all experiments the agent executes a simple parametric policy, P (a|s, \u03b8) =\nexp(\u03c6(s)\u00b7\u03b8a)\nb\u2208A exp(\u03c6(s)\u00b7\u03b8b) .\nThe function \u03c6(s) is a set of features derived from the current state s. The complete parameter\nvector \u03b8 is decomposed into components \u03b8a associated with each action a. The policy is executed\nby sampling an action from this distribution. The gradient of this action selection policy can be\nderived straightforwardly and substituted into the gradient of the energy function required by our\nHMC procedure.\nWe use the following values for the unspeci\ufb01ed model parameters: \u03c32\nr = 1, \u03c32 = 2, \u00b5 = 0. The\nvalue of K used for TPQ trajectories was set to 10 for each domain except for Bicycle, for which we\nused K = 300. The Bicycle simulator uses a \ufb01ne time scale, so that even K = 300 only corresponds\nto a few seconds of bike riding, which is quite reasonable for a TPQ.\nFor purposes of comparison we implement a simple random TPQ selection strategy (Denoted Ran-\ndom in the graphs below). The random strategy draws an initial TPQ state from \u02c6P0 and then gener-\nates a trajectory pair by executing two policies drawn i.i.d. from the prior distribution P (\u03b8). Thus,\nthis approach does not use information about past query responses when selecting TPQs.\n\n(cid:80)\n\n6\n\n\fDomains. We consider the following benchmark domains.\nAcrobot. The acrobot task simulates a two link under-actuated robot. One joint end, the \u201dhands\u201d\nof the robot, rotates around a \ufb01xed point. The mid joint associated with the \u201dhips\u201d attach the upper\nand lower links of the robot. To change the joint angle between the upper and lower links the\nagent applies torque at the hip joint. The lower link swings freely. Our expert knows a policy for\nswinging the acrobot into a balanced handstand. The acrobot system is de\ufb01ned by four continuous\nstate variables (\u03b81, \u03b82, \u02d9\u03b81, \u02d9\u03b82) representing the arrangement of the acrobot\u2019s joints and the changing\nvelocities of the joint angles. The acrobot is controlled by a 12 dimensional softmax policy selecting\nbetween positive, negative, and zero torque to be applied at the hip joint. The feature vector \u03c6(s)\nreturns the vector of state variables. The acrobot receives a penalty on each step proportional to the\ndistance between the foot and the target position for the foot.\nMountain Car. The mountain car domain simulates an underpowered vehicle which the agent must\ndrive to the top of a steep hill. The state of the mountain car system is described by the location of\nthe car x, and its velocity v. The goal of the agent controlling the mountain car system is to utilize\nthe hills surrounding the car to generate suf\ufb01cient energy to escape a basin. Our expert knows a\npolicy for performing this escape. The agent\u2019s softmax control policy space has 16 dimensions and\nselects between positive and negative accelerations of the car. The feature vector \u03c6(s) returns a\npolynomial expansion (x, v, x2, x3, xv, x2v, x3v, v2) of the state. The agent receives a penalty for\nevery step taken to reach the goal.\nCart Pole. In the cart-pole domain the agent attempts to balance a pole \ufb01xed to a movable cart\nwhile maintaining the carts location in space. Episodes terminate if the pole falls or the cart leaves\nits speci\ufb01ed boundary. The state space is composed of the cart velocity v, change in cart velocity\nv(cid:48), angle of the pole \u03c9, and angular velocity of the pole \u03c9(cid:48). The control policy has 12 dimensions\nand selects the magnitude of the change in velocity (positive or negative) applied to the base of the\ncart. The feature vector returns the state of the cart-pole. The agent is penalized for pole positions\ndeviating from upright and for movement away from the midpoint.\nBicycle Balancing. Agents in the bicycle balancing task must keep the bicycle balanced for 30000\nsteps. For our experiments we use the simulator originally introduced in [14]. The state of the\nbicycle is de\ufb01ned by four variables (\u03c9, \u02d9\u03c9, \u03bd, \u02d9\u03bd). The variable \u03c9 is the angle of the bicycle with\nrespect to vertical, and \u02d9\u03c9 is its angular velocity. The variable \u03bd is the angle of the handlebars\nwith respect to neutral, and \u02d9\u03bd is the angular velocity. The goal of the agent is to keep the bicycle\nfrom falling. Falling occurs when |\u03c9| > \u03c0/15. We borrow the same implementation used in[10]\nincluding the discrete action set, the 20 dimensional feature space, and 100 dimensional policy. The\nagent selects from a discrete set of \ufb01ve actions. Each discrete action has two components. The \ufb01rst\ncomponent is the torque applied to the handlebars T \u2208 (\u22121, 0, 1), and the second component is the\ndisplacement of the rider in the saddle p \u2208 (\u2212.02, 0, .02). From these components \ufb01ve action tuples\nare composed a \u2208 ((\u22121, 0), (1, 0), (0,\u2212.02), (0, .02), (0, 0)). The agent is penalized proportional\nto the magnitude of \u03c9 at each step and receives a \ufb01xed penalty for falling.\nWe report the results of our experiments in Figure 1. Each graph gives the results for the TPQ\nselection strategies Random, Query-by-Disagreement (QBD), and Expected Belief Change (EBC).\nThe average reward versus number of queries is provided for each selection strategy, where curves\nare averaged over 20 runs of learning.\n\n5.2 Experiment Results\n\nIn all domains the learning algorithm successfully learns the target policy. This is true independent\nof the query selection procedure used. As can be seen our algorithm can successfully learn even from\nqueries posed by Random. This demonstrates the effectiveness of our HMC inference approach.\nImportantly, in some cases, the active query selection heuristics signi\ufb01cantly improve the rate of\nconvergence compared to Random. The value of the query selection procedures is particularly high\nin the Mountain Car and Cart Pole domains. In the Mountain Car domain more than 500 Random\nqueries were needed to match the performance of 50 EBC queries. In both of these domains exam-\nining the generated query trajectories shows that the Random strategy tended to produce dif\ufb01cult\nto distinguish trajectory data and later queries tended to resemble earlier queries. This is due to\n\u201cplateaus\u201d in the policy space which produce nearly identical behaviors. Intuitively, the informa-\ntion content of queries selected by Random decreases rapidly leading to slower convergence. By\n\n7\n\n\fFigure 1: Results: We report the expected return of the MAP policy, sampled during Hybrid MCMC\nsimulation of the posterior, as a function of the number of expert queries. Results are averaged over\n50 runs. Query trajectory lengths: Acrobot K = 10, Mountain-Car K = 10, Cart-Pole K = 20,\nBicycle Balancing K = 300.\n\ncomparison the selection heuristics ensure that selected queries have high impact on the posterior\ndistribution and exhibit high query diversity.\nThe bene\ufb01ts of the active selection procedure diminish in the Acrobot and Bicycle domains. In both\nof these domains active selection performs only slightly better than Random. This is not the \ufb01rst time\nactive selection procedures have shown performance similar to passive methods [16]. In Acrobot all\nof the query selection procedure quickly converge to the target policy (only 25 queries are needed\nfor Random to identify the target). Little improvement is possible over this result. Similarly, in\nthe bicycle domain the performance results are dif\ufb01cult to distinguish. We believe this is due to\nthe length of the query trajectories (300) and the importance of the initial state distribution. Most\nbicycle con\ufb01gurations lead to out of control spirals from which no policy can return the bicycle\nto balanced. In these con\ufb01gurations inputs from the agent result in small impact on the observed\nstate trajectory making policies dif\ufb01cult to distinguish. To avoid these cases in Bicycle the start state\ndistribution \u02c6P0 only generated initial states close to a balanced con\ufb01guration. In these con\ufb01gurations\npoor balancing policies are easily distinguished from better policies and the better policies are not\nrare. These factors lead Random to be quite effective in this domain.\nFinally, comparing the active learning strategies, we see that EBC has a slight advantage over QBD\nin all domains other than Bicycle. This agrees with prior active learning work, where more selective\nstrategies tend to be superior in practice. The price that EBC pays for the improved performance is\nin computation time, as it is about an order of magnitude slower.\n\n6 Summary\n\nWe examined the problem of learning a target policy via trajectory preference queries. We formu-\nlated a Bayesian model for the problem and a sampling algorithm for sampling from the posterior\nover policies. Two query selection methods were introduced, which heuristically select queries with\nan aim to ef\ufb01ciently identify the target. Experiments in four RL benchmarks indicate that our model\nand inference approach is able to infer quality policies and that the query selection methods are\ngenerally more effective than random selection.\n\nAcknowledgments\nWe gratefully acknowledge the support of ONR under grant number N00014-11-1-0106.\n\n8\n\n\fReferences\n[1] R. Akrour, M. Schoenauer, and M. Sebag. Preference-based policy learning.\n\nIn Dimitrios\nGunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors, Proc.\nECML/PKDD\u201911, Part I, volume 6911 of Lecture Notes in Computer Science, pages 12\u201327.\nSpringer, 2011.\n\n[2] Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael I. Jordan. An introduction\n\nto mcmc for machine learning. Machine Learning, 50(1-2):5\u201343, 2003.\n\n[3] Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot\n\nlearning from demonstration. Robot. Auton. Syst., 57(5):469\u2013483, May 2009.\n\n[4] J M Bernardo. Expected information as expected utility. Annals of Statistics, 7(3):686\u2013690,\n\n1979.\n\n[5] Weiwei Cheng, Johannes F\u00a8urnkranz, Eyke H\u00a8ullermeier, and Sang-Hyeun Park. Preference-\nbased policy iteration: Leveraging preference learning for reinforcement learning. In Proceed-\nings of the 22nd European Conference on Machine Learning (ECML 2011), pages 312\u2013327.\nSpringer, 2011.\n\n[6] Wei Chu and Zoubin Ghahramani. Preference learning with gaussian processes. In Proceed-\nings of the 22nd international conference on Machine learning, ICML \u201905, pages 137\u2013144,\nNew York, NY, USA, 2005. ACM.\n\n[7] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience,\n\nNew York, NY, USA, 1991.\n\n[8] Simon Duane, A. D. Kennedy, Brian J. Pendleton, and Duncan Roweth. Hybrid monte carlo.\n\nPhysics Letters B, 195(2):216 \u2013 222, 1987.\n\n[9] Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using\n\nthe query by committee algorithm. Machine Learning, 28(2-3):133\u2013168, 1997.\n\n[10] Michail G. Lagoudakis, Ronald Parr, and L. Bartlett. Least-squares policy iteration. Journal\n\nof Machine Learning Research, 4, 2003.\n\n[11] D. V. Lindley. On a Measure of the Information Provided by an Experiment. The Annals of\n\nMathematical Statistics, 27(4):986\u20131005, 1956.\n\n[12] Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. In ICML,\n\npages 663\u2013670, 2000.\n\n[13] Bob Price and Craig Boutilier. Accelerating reinforcement learning through implicit imitation.\n\nJ. Artif. Intell. Res. (JAIR), 19:569\u2013629, 2003.\n\n[14] Jette Randl\u00f8v and Preben Alstr\u00f8m. Learning to drive a bicycle using reinforcement learning\n\nand shaping. In ICML, pages 463\u2013471, 1998.\n\n[15] Stefan Schaal. Learning from demonstration. In NIPS, pages 1040\u20131046, 1996.\n[16] Andrew I. Schein and Lyle H. Ungar. Active learning for logistic regression: an evaluation.\n\nMach. Learn., 68(3):235\u2013265, October 2007.\n\n[17] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the \ufb01fth\nannual workshop on Computational learning theory, COLT \u201992, pages 287\u2013294, New York,\nNY, USA, 1992. ACM.\n\n9\n\n\f", "award": [], "sourceid": 541, "authors": [{"given_name": "Aaron", "family_name": "Wilson", "institution": null}, {"given_name": "Alan", "family_name": "Fern", "institution": null}, {"given_name": "Prasad", "family_name": "Tadepalli", "institution": null}]}