{"title": "Algorithms for Learning Markov Field Policies", "book": "Advances in Neural Information Processing Systems", "page_first": 2177, "page_last": 2185, "abstract": "We present a new graph-based approach for incorporating domain knowledge in reinforcement learning applications. The domain knowledge is given as a weighted graph, or a kernel matrix, that loosely indicates which states should have similar optimal actions. We first introduce a bias into the policy search process by deriving a distribution on policies such that policies that disagree with the provided graph have low probabilities. This distribution corresponds to a Markov Random Field. We then present a reinforcement and an apprenticeship learning algorithms for finding such policy distributions. We also illustrate the advantage of the proposed approach on three problems: swing-up cart-balancing with nonuniform and smooth frictions, gridworlds, and teaching a robot to grasp new objects.", "full_text": "Algorithms for Learning Markov Field Policies\n\nAbdeslam Boularias\n\nMax Planck Institute for Intelligent Systems\n\nboularias@tuebingen.mpg.de\n\nOliver Kr\u00a8omer, Jan Peters\n\nTechnische Universit\u00a8at Darmstadt\n\n{oli,jan}@robot-learning.de\n\nAbstract\n\nWe use a graphical model for representing policies in Markov Decision Processes.\nThis new representation can easily incorporate domain knowledge in the form of\na state similarity graph that loosely indicates which states are supposed to have\nsimilar optimal actions. A bias is then introduced into the policy search process\nby sampling policies from a distribution that assigns high probabilities to policies\nthat agree with the provided state similarity graph, i.e. smoother policies. This\ndistribution corresponds to a Markov Random Field. We also present forward\nand inverse reinforcement learning algorithms for learning such policy distribu-\ntions. We illustrate the advantage of the proposed approach on two problems:\ncart-balancing with swing-up, and teaching a robot to grasp unknown objects.\n\n1\n\nIntroduction\n\nMarkov Decision Processes (MDP) provide a rich and elegant mathematical framework for solving\nsequential decision-making problems. In practice, signi\ufb01cant domain knowledge is often necessary\nfor \ufb01nding a near-optimal policy in a reasonable amount of time. For example, one needs a suitable\nset of basis functions, or features, to approximate the value functions in reinforcement learning and\nthe reward functions in inverse reinforcement learning. Designing value or reward features can itself\nbe a challenging problem. The features can be noisy, misspeci\ufb01ed or insuf\ufb01cient, particularly in\ncertain complex robotic tasks such as grasping and manipulating objects. In this type of applications,\nthe features are mainly acquired through vision, which is inherently noisy. Many features are also\nnontrivial, such as the features related to the shape of an object, used for calculating grasp stability.\nIn this paper, we show how to overcome the dif\ufb01cult problem of designing precise value or reward\nfeatures. We draw our inspiration from computer vision wherein similar problems have been ef\ufb01-\nciently solved using a family of graphical models known as Markov Random Fields (MRFs) (Kohli\net al., 2007; Munoz et al., 2009). We start by specifying a graph that loosely indicates which pairs of\nstates are supposed to have similar actions under an optimal policy. In an object manipulation task\nfor example, the states correspond to the points of contact between the robot hand and the object\nsurface. A state similarity graph can be created by sampling points on the surface of the object and\nconnecting each point to its k nearest neighbors using the geodesic or the Euclidean distance. The\nadjacency matrix of this graph can be interpreted as the Gram matrix of a kernel that can be used\nto approximate the optimal value function. Kernels have been widely used before in reinforcement\nlearning (Ormoneit & Sen, 1999), however, they were used for approximating the values of different\npolicies in a search for an optimal policy. Therefore, the kernels should span not only the optimal\nvalue function, but also the values of intermediate policies.\nIn this paper, kernels will be used for a different purpose. We only require that the kernel spans the\nvalue function of an optimal policy. Therefore, the value function of an optimal policy is assumed to\nhave a low approximation error, measured by the Bellman error, using that kernel. Subsequently, we\nderive a distribution on policies, wherein the probability of a policy is proportional to its estimated\nvalue, and inversely proportional to its Bellman error. In other terms, the Bellman error is used\nas a surrogate function for measuring how close a policy is to an optimal one. We show that this\n\n1\n\n\fprobability distribution is an MRF, and use a Markov chain Monte Carlo algorithm for sampling\npolicies from it. We also describe an apprenticeship learning algorithm based on the same principal.\nA preliminary version of some parts of this work was presented in (Boularias et al., 2012).\n\nat time t, V\u03c0t:H (s) =(cid:80)H\n\n2 Notations\nFormally, a \ufb01nite-horizon Markov Decision Process (MDP) is a tuple (S,A, T, R, H, \u03b3), where S\nis a set of states and A is a set of actions, T is a transition function with T (s, a, s(cid:48)) = P (st+1 =\ns(cid:48)|st = s, at = a) for s, s(cid:48) \u2208 S, a \u2208 A, and R is a reward function where R(s, a) is the reward given\nfor action a in state s. To ease notation and without loss of generality, we restrict our theoretical\nanalysis to the case where rewards depend only on states, and denote by R an |S| \u00d7 1 vector. H is\nthe planning horizon and \u03b3 \u2208 [0, 1] is a discount factor. A deterministic policy \u03c0 is a function that\nreturns an action a = \u03c0(s) for each state s. T\u03c0 is de\ufb01ned as T\u03c0(s, s(cid:48)) = T (s, \u03c0(s), s(cid:48)). We denote\nby \u03c0t:H a non-stationary policy (\u03c0t, \u03c0t+1, . . . , \u03c0H ), where \u03c0i is a policy at time-step i. The value\nof policy \u03c0t:H is the expected sum of rewards received by following \u03c0t:H, starting from a state s\nt:H is one satisfying\nt:H \u2208 arg max\u03c0t:H V\u03c0t:H (s),\u2200s \u2208 S. Searching for an optimal policy is generally an iterative\n\u03c0\u2217\nprocess with two phases: policy evaluation, and policy improvement.\nWhen the state space S is large or continuous, the value function V\u03c0t:H is approximated by a linear\ncombination of n basis functions, or features. Let fi be a |S||A| \u00d7 1 vector corresponding to the ith\nbasis function, and let F be the |S||A|\u00d7 n matrix of columns fi. Let \u03a0\u03c0t be an |S|\u00d7|S||A| action-\nselection matrix de\ufb01ned as \u03a0\u03c0t(s, (s, \u03c0t(s))) = 1 and 0 otherwise. Then V\u03c0t:H = F\u03c0tw, where w\nis a n\u00d7 1 weights vector and F\u03c0t = \u03a0\u03c0tF . We de\ufb01ne the Bellman error of two consecutive policies\n\u03c0t and \u03c0t+1 using the feature matrix F and the weights wt, wt+1 \u2208 Rn as BE(F, wt:t+1, \u03c0t:t+1) =\n(cid:107)F\u03c0twt \u2212 \u03b3T\u03c0tF\u03c0t+1wt+1 \u2212 R(cid:107)1. Similarly, we de\ufb01ne the Bellman error of a distribution P on\npolicies \u03c0t and \u03c0t+1 as BE(F, wt:t+1, P ) = (cid:107)E\u03c0t:t+1\u223cP [F\u03c0twt \u2212 \u03b3T\u03c0tF\u03c0t+1wt+1]\u2212 R(cid:107)1. We also\nde\ufb01ne the minimum Bellman error as BE\u2217(F, \u03c0t:t+1) = minwt:t+1 BE(F, wt:t+1, \u03c0t:t+1) and the\n\ni=t \u03b3i\u2212tEsi [R(si)|st = s, T\u03c0t:i]. An optimal policy \u03c0\u2217\n\ntotal Bellman error as BE(F, w0:H , \u03c00:H ) =(cid:80)H\u22121\n\nt=0 BE(F, wt:t+1, \u03c0t:t+1).\n\n3 Markov Random Field Policies for Reinforcement Learning\n\nWe now present the reinforcement learning approach using the Bellman error as a structure penalty.\n\n3.1 Structure penalty\n\nOptimal policies of many real-world problems are structured and change smoothly over the state\nspace. Therefore, the optimal value function can often be approximated by simple features, com-\npared to the value functions of arbitrary policies. We exploit this property and propose to indirectly\nuse these features, provided as domain knowledge, for accelerating the search for an optimal policy.\nSpeci\ufb01cally, we restrain the policy search to a set of policies that have a low estimated Bellman error\nwhen their values are approximated using the provided features, knowing that the optimal policy has\na low Bellman error. Note that our approach is complementary to function approximation methods.\nWe only use the features for calculating Bellman errors, the value functions can be approximated by\nusing other methods, such as LSTD (Boyan, 2002).\nLet K\u03c0 be the Gram matrix de\ufb01ned as K\u03c0 = \u03a0\u03c0K\u03a0T\n\u03c0 , where K = F F T . Matrix K is the adjacency\nmatrix of a graph that indicates which states and actions are similar under an optimal policy. Feature\nmatrix F is not explicitly required, as only the matrix K will be used later. Therefore, the user needs\nonly to provide a similarity measure between states, such as the Euclidean distance.\nLet wt, wt+1 \u2208 R|S|, \u0001 \u2208 R,\nif (cid:107)E\u03c0t:t+1\u223cP [K\u03c0twt \u2212 \u03b3T\u03c0tK\u03c0t+1wt+1] \u2212 R(cid:107)1 \u2264 \u0001 then\nBE\u2217(F, P ) \u2264 \u0001. This result is obtained by setting F T \u03a0T\n\u03c0 wt+1 as the weight vec-\ntors of the values of policies \u03c0t and \u03c0t+1. The condition above implies that the policy distribution\nP has a value function that can be approximated by using F . Enforcing this condition results in a\nbias favoring policies with a low Bellman error. Thus, we are interested in learning a distribution\nP (\u03c00:H ) that satis\ufb01es this condition, while maximizing its expected value.\n\n\u03c0 wt and F T \u03a0T\n\n2\n\n\fDistribution P can be decomposed using the chain rule as P (\u03c00:H ) = P (\u03c0H )(cid:81)H\u22121\n\nt=0 P (\u03c0t|\u03c0t+1:H ).\nWe start by calculating a distribution over deterministic policies \u03c0H that will be executed at the last\ntime-step H. Then, for each step t \u2208 {H \u2212 1, . . . , 0}, we calculate a distribution P (\u03c0t|\u03c0t+1:H ) over\ndeterministic policies \u03c0t given policies \u03c0t+1:H that we sample from P (\u03c0t+1:H ). In the following,\nwe show how to calculate P (\u03c0t|\u03c0t+1:H ).\n\n3.2 Primal problem\nLet \u03c1 \u2208 R be a lower bound on the entropy of a distribution P on deterministic policies \u03c0t, condi-\ntioned on \u03c0t+1:H. \u03c1 is used for tuning the exploration. Our problem can then be formulated as\n\nEP [V \u03c0t:H ](s)\n\n, subject to\n\ng1(P ) = 1, g2(P ) \u2265 \u03c1,(cid:107)g3(P ) \u2212 R(cid:107)1 \u2264 \u0001\n\n,\n\n(cid:17)\n\nwhere\n\ng1(P ) =\n\nP (\u03c0t|\u03c0t+1:H )\n\n,\n\nP (\u03c0t|\u03c0t+1:H ) log P (\u03c0t|\u03c0t+1:H )\n\n,\n\nP (\u03c0t|\u03c0t+1:H )[K\u03c0twt \u2212 \u03b3T\u03c0tK\u03c0t+1wt+1]\n\n,\n\nP (\u03c0t|\u03c0t+1:H )V \u03c0t:H\n\n(cid:17)\n\nmax\n\nP\n\n(cid:16)\n\n(cid:16)(cid:88)\n\ns\u2208S\n\n(cid:88)\n\ng3(P ) =\n\n\u03c0t\u2208A|S|\n\n(cid:16)\n\n(cid:88)\n\n\u03c0t\u2208A|S|\n\n(cid:17)\n\n(cid:16)\n(cid:16)\ng2(P ) = \u2212 (cid:88)\n(cid:17)\n\n\u03c0t\u2208A|S|\n\n(cid:16)EP [V \u03c0t:H ] =\n\n(cid:88)\n\n\u03c0t\u2208A|S|\n\n(1)\n\n(cid:17)\n\n(cid:17)\n\n.\n\nThe objective function in Equation 1 is linear and its constraints de\ufb01ne a convex set. Therefore, the\noptimal solution to Problem 1 can be found by solving its Lagrangian dual.\n\n3.3 Dual problem\n\nThe Lagrangian dual is given by\n\nL(P, \u03c4, \u03b7, \u03bb)=\n\nEP [V \u03c0t:H ](s)\n\n(cid:17) \u2212 \u03b7\n\n(cid:16)\n\n(cid:17)\n\ng1(P )\u22121\n\n(cid:16)\n\n+ \u03c4\n\ng2(P )\u2212\u03c1\n\n(cid:17)\n\n+ \u03bbT(cid:16)\n\ng3(P )\u2212R\n\n(cid:17)\n\n+ \u0001(cid:107)\u03bb(cid:107)1,\n\nwhere \u03b7, \u03c4 \u2208 R and \u03bb \u2208 R|S|. We refer the reader to Dudik et al. (2004) for a detailed derivation.\nV \u03c0t:H (s) + \u03bbT [K\u03c0twt \u2212 \u03b3T\u03c0tK\u03c0t+1wt+1] \u2212 \u03c4 log P (\u03c0t|\u03c0t+1:H )) \u2212 \u03b7 \u2212 1.\n\u2202L(P, \u03c4, \u03b7, \u03bb)\n\u2202P (\u03c0t|\u03c0t+1:H )\n\n=\n\n(cid:16)(cid:88)\n(cid:88)\n\ns\u2208S\n\ns\u2208S\n\nBy setting \u2202L(P,\u03c4,\u03b7,\u03bb)\n\n\u2202P (\u03c0t|\u03c0t+1:H ) = 0 (Karush-Kuhn-Tucker condition), we get the solution\n\n(cid:16) 1\n\u03c4(cid:124)(cid:123)(cid:122)(cid:125)\n\n(cid:122)\n(cid:123)\n(cid:88)\n(cid:0)expected sum of rewards\n\n(cid:125)(cid:124)\n(cid:125)\nV \u03c0t:H (s) + \u03bbT [K\u03c0twt \u2212 \u03b3T\u03c0tK\u03c0t+1wt+1]\n\n(cid:123)(cid:122)\n\ns\u2208S\n\n(cid:124)\n\n(cid:1)(cid:17)\n\n.\n\nsmoothness term\n\nexploration factor\n\nP (\u03c0t|\u03c0t+1:H ) \u221d exp\n\nThis distribution on joint actions is a Markov Random Field.\nIn fact, the kernel K = F F T\nis the adjacency matrix of a graph (E,S), where (si, sj) \u2208 E if and only if \u2203ai, aj \u2208 A :\nK((si, ai), (sj, aj)) (cid:54)= 0. Local Markov property is veri\ufb01ed, \u2200si \u2208 S :\nP (\u03c0t(si)|\u03c0t+1:H ,{\u03c0t(sj) : sj \u2208 S, sj (cid:54)= si})=P (\u03c0t(si)|\u03c0t+1:H ,{\u03c0t(sj) : (si, sj) \u2208 E, sj (cid:54)= si}).\nIn other terms, the probability of selecting an action in a given state depends on the expected long\nterm reward of the action, as well as on the selected actions in the neighboring states. Dependencies\nbetween neighboring states are due to the smoothness term in the distribution.\n\n3.4 Learning parameters\n\nOur goal now is to learn the distribution P , which is parameterized by \u03c4, \u03bb, wt:t+1 and V \u03c0t:H . Given\nthat the transition function T is unknown, we use samples D = {(st, at, rt, st+1)} for approximat-\ning the gradients of the parameters and the value function V \u03c0t:H . We also restrain K\u03c0t to states and\nactions that appear in the samples, and denote by \u02c6T\u03c0t the empirical transition matrix of the sampled\n\nstates. Since P (\u03c00:H ) = P (\u03c0H )(cid:81)H\u22121\n(cid:0)(cid:80)\nt=0 P (\u03c0t|\u03c0t+1:H ), then\ns\u2208D V \u03c0t:H (s) + \u03bbT\n\nP (\u03c00:H ) \u221d exp\n\nt [K\u03c0twt \u2212 \u03b3 \u02c6T\u03c0tK\u03c0t+1wt+1](cid:1)(cid:17)\n\n(cid:80)H\n\n.\n\n(2)\n\n(cid:16) 1\n\nt=0\n\n\u03c4\n\n3\n\n\fThe value function V \u03c0t:H is empirically calculated from the samples by using a standard value\nfunction approximation algorithm, such as LSTD (Boyan, 2002). Temperature \u03c4 determines the\nentropy of the distribution P , \u03c4 is initially set to a high value and gradually decreased over time as\nmore samples are collected. One can use the same temperature for all time-steps within the same\nepisode, or a different one for each step. Since the Lagrangian L is convex, parameters \u03bbt can be\nlearned by a simple gradient descent. Algorithm 1 summarizes the principal steps of the proposed\napproach. The algorithm iterates between two main steps: (i) sampling and executing policies from\nEquation 2, and (ii), updating the value functions and the parameters \u03bbt using the samples. The\nweight vectors w0:H are the ones that minimize the empirical Bellman error in samples D, they are\nalso found by a gradient descent , wherein \u2202w0:H BE(K, w0:H , \u03c00:H ) is estimated from D.\nAlgorithm 1 Episodic Policy Search with Markov Random Fields\nInitialize the temperature \u03c4 with a large value, and \u03bb0:H with 0.\nrepeat\n\n1. Sample policies \u03c00:H from P (Equation 2).\n2. Discard policies \u03c00:H that have an empirical Bellman error higher than \u0001.\n3. Execute \u03c00:H and collect D = {(st, at, rt, st+1)}.\n4. Update the value functions V \u03c0t:H by using LSTD with D.\n5. Find \u03bb0:H that minimizes the dual L by a gradient descent, \u2202\u03bbL is estimated from D.\n6. Decrease the temperature \u03c4.\n\nuntil \u03c4 \u2264 \u0001\u03c4\n\nThe main assumption behind this algorithm is that the kernel K approximates suf\ufb01ciently well the\noptimal value function, what happens when this is not the case? The introduced bias will favor\nsuboptimal policies. However, this problem can be solved by setting the threshold \u0001 to a high value\nwhen the user is uncertain about the domain knowledge provided by K. Our experiments con\ufb01rm\nthat even a binary matrix K, corresponding to a k-NN graph, can yield an improved performance.\nThis approach is straightforward to extend to handle samples of continuous states and actions , in\nwhich case, a policy is represented by a vector \u0398t \u2208 RN of continuous parameters (for instance, the\ncenter and the width of a gaussian). Therefore, Equation 2 de\ufb01nes a distribution P (\u03980:H ). In our\nexperiments, we use the Metropolis-Hastings algorithm for sampling \u03980:H from P .\n\n4 Markov Random Field Policies for Apprenticeship Learning\n\nWe now derive a policy shaping approach for apprenticeship learning using Markov Random Fields.\n\n4.1 Apprenticeship learning\n\n\u03b8k, \u2200s \u2208 S : R(s) = (cid:80)m\ntions, V\u03c0t:H (s) =(cid:80)m\n\nThe aim of apprenticeship learning is to \ufb01nd a policy \u03c0 that is nearly as good as a policy \u02c6\u03c0 demon-\nstrated by an expert, i.e., V\u03c0(s) \u2265 V\u02c6\u03c0(s) \u2212 \u0001,\u2200s \u2208 S. Abbeel & Ng (2004) proposed to learn a\nreward function, assuming that the expert is optimal, and to use it to recover the expert\u2019s general-\nized policy. The process of learning a reward function is known as inverse reinforcement learning.\nThe reward function is assumed to be a linear combination of m feature vectors \u03c6k with weights\nk=1 \u03b8k\u03c6k(s). The expected discounted sum of feature \u03c6k, given policy\ni=t \u03b3i\u2212tEst:H [\u03c6k(si)|st = s, T\u03c0t:i]. Using this\n\u03c0t:H and starting from s, is de\ufb01ned as \u03c6\u03c0t:H\nde\ufb01nition, the expected return of a policy \u03c0 can be written as a linear function of the feature expecta-\n(s). Since this problem is ill-posed, Ziebart et al. (2008) proposed\nto use the maximum entropy regularization, while matching the expected return of the examples.\nThis latter constraint can be satis\ufb01ed by ensuring that \u2200k, s : \u03c6\u03c0\nk (s) = \u02c6\u03c6k, where \u02c6\u03c6k denotes the\nempirical expectation of feature \u03c6k calculated from the demonstration.\n\n(s) =(cid:80)H\n\nk\n\nk=1 \u03b8k\u03c6\u03c0t:H\n\nk\n\n4.2 Structure matching\n\nThe classical framework of apprenticeship learning is based on designing features \u03c6 of the reward\nand learning corresponding weights \u03b8. In practice, as we show in the experiments, it is often dif\ufb01cult\nto \ufb01nd an appropriate set of reward features. Moreover, the values of the reward features are usually\n\n4\n\n\f, then BE\u2217(F, \u02c6\u03c0t:t+1) = BE\u2217(F, P ).\n\nobtained from empirical data and are subject to measurement errors. However, most real-world\nproblems exhibit a structure wherein states that are close together tend to have the same optimal\naction. This information about the structure of the expert\u2019s policy can be used to partially overcome\nthe problem of \ufb01nding reward features. The structure is given by a kernel that measures similarities\nbetween states. Given an expert\u2019s policy \u02c6\u03c00:H and feature matrix F , we are interested in \ufb01nding a\ndistribution P on policies \u03c00:H that has a Bellman error similar to that of the expert\u2019s policy. The\nfollowing proposition states the suf\ufb01cient conditions for solving this problem.\nProposition 1. Let F be a feature matrix, K = F F T , K\u03c0t = \u03a0\u03c0tK\u03a0T\n\u03c0t\ntion on policies \u03c0t and \u03c0t+1 such that E\u03c0t:t+1\u223cP [K\u03c0t] = K\u02c6\u03c0t, and E\u03c0t:t+1\u223cP [\u03b3T\u03c0tK\u03c0t+1T T\n\u03b3T\u02c6\u03c0tK\u02c6\u03c0t+1T T\n\u02c6\u03c0t\nProof. We prove that BE\u2217(F, P ) \u2264 BE\u2217(F, \u02c6\u03c0t:t+1). The same argument can be used for proving\nthat BE\u2217(F, \u02c6\u03c0t:t+1) \u2264 BE\u2217(F, P ). This proof borrows the orthogonality technique used for prov-\ning the Representer Theorem (Sch\u00a8olkopf et al., 2001). Let \u02c6wt, \u02c6wt+1 \u2208 R|S| be the weight vectors\nthat minimize the Bellman error of the expert\u2019s policy, i.e. (cid:107)\u03a0\u02c6\u03c0tF \u02c6wt \u2212 \u03b3T\u02c6\u03c0t\u03a0\u02c6\u03c0t+1F \u02c6wt+1 \u2212 R(cid:107)p =\nBE\u2217(F, \u02c6\u03c0t:t+1). Let us write \u02c6wt = \u02c6wt(cid:107) + \u02c6wt\u22a5, where \u02c6wt(cid:107) is the projection of \u02c6wt on the rows\nof \u03a0\u02c6\u03c0tF , i.e. \u2203\u02c6\u03b1t \u2208 R|S|\n\u02c6\u03b1t, and \u02c6wt\u22a5 is orthogonal to the rows of \u03a0\u02c6\u03c0tF .\nThus, \u03a0\u02c6\u03c0tF \u02c6wt = \u03a0\u02c6\u03c0tF ( \u02c6wt(cid:107) + \u02c6wt\u22a5) = \u03a0\u02c6\u03c0tF \u02c6wt(cid:107) = K\u02c6\u03c0t \u02c6\u03b1t. Similarly, one can show that\n\u02c6\u03b1t+1,\n\u03b3T\u02c6\u03c0t\u03a0\u02c6\u03c0t+1F \u02c6wt+1 = \u03b3T\u02c6\u03c0tK\u02c6\u03c0t+1T T\n\u02c6\u03c0t\nthen we have BE\u2217(F, P ) \u2264 (cid:107)E\u03c0t:t+1[\u03a0\u03c0tF wt \u2212 \u03b3T\u03c0t\u03a0\u03c0t+1 F wt+1] \u2212 R(cid:107)1 = (cid:107)E\u03c0t:t+1[K\u03c0t \u02c6\u03b1t \u2212\n(cid:3)\n\u03b3T\u03c0tK\u03c0t+1T T\n\u03c0t\n\n\u02c6\u03b1t and wt+1 = F T \u03a0T\nT T\n\u03c0t\n\u2212 R(cid:107)1 = BE\u2217(F, \u02c6\u03c0t:t+1).\n\n. Let P be a distribu-\n] =\n\n\u02c6\u03b1t+1] \u2212 R(cid:107)1 = (cid:107)K\u02c6\u03c0t\u03b1T\n\n\u02c6\u03b1t+1. Let wt = F T \u03a0T\n\u03c0t\n\n\u2212 \u03b3T\u02c6\u03c0tK\u02c6\u03c0t+1T T\n\n: \u02c6wt(cid:107) = F T \u03a0T\n\u02c6\u03c0t\n\n\u03b1T\n\n\u03c0t+1\n\n\u03c0t\n\n\u02c6\u03c0t+1\n\n\u02c6\u03c0t\n\n\u02c6\u03c0t\n\n4.3 Problem statement\n\nOur problem now is to \ufb01nd a distribution on deterministic policies P that satis\ufb01es the conditions\nk (s) = \u02c6\u03c6k. The conditions\nstated in Proposition 1 in addition to the feature matching conditions \u03c6\u03c0\nof Proposition 1 ensure that P assigns high probabilities to policies that have a structure similar to\nthe expert\u2019s policy \u02c6\u03c0. The feature matching constraints ensure that the expected value under P is the\nsame as the value of the expert\u2019s policy. Given that there are in\ufb01nite solutions to this problem, we\nselect a distribution P that has a maximal entropy (Ziebart et al., 2008).\n\n\u2212P (\u03c0t|\u03c0t+1:H ) log P (\u03c0t|\u03c0t+1:H )\n\n,\n\n(cid:17)\n\n(cid:17)\n\n(cid:16) (cid:88)\n(cid:17)\n\n\u03c0t\u2208A|S|\n\n(cid:16)\n\nmax\n\nP\n\n(cid:16) (cid:88)\n(cid:16)(cid:88)\n(cid:16)(cid:88)\n\n\u03c0t\u2208A|S|\n\n\u03c0t\u2208A|S|\n\n4.4 Solution\n\n(cid:17)\n\n(cid:88)\n\n=\n\u03c0t\u2208A|S|\n\n(cid:88)\n\n(cid:17)\n\n.\n\n(cid:17)\n\nsubject to\n\nP (\u03c0t|\u03c0t+1:H ) = 1\n\n,\n\nP (\u03c0t|\u03c0t+1:H )\u03c6\u03c0t:H = \u02c6\u03c6\n\n,\n\nP (\u03c0t|\u03c0t+1:H )K\u03c0t = K\u02c6\u03c0t\n\n,\n\n\u03b3T\u02c6\u03c0tK\u02c6\u03c0t+1 T T\n\u02c6\u03c0t\n\nP (\u03c0t|\u03c0t+1:H )\u03b3T\u03c0tK\u03c0t+1T T\n\n\u03c0t\n\n\u03c0t\u2208A|S|\nwhere \u03c6\u03c0t:H (s, k)\nis concave and the constraints are linear. Note that the three last equalities are between matrices.\n\n(s) (de\ufb01ned in subsection 4.1). The objective function of this problem\n\ndef\n= \u03c6\u03c0t:H\n\nk\n\n(cid:16)(cid:88)\n\nk\n\n(cid:88)\n\ns\u2208S\n\n(cid:88)\n\nBy setting the derivatives of the Lagrangian to zero (as in subsection 3.3), we derive the distribution\nP (\u03c0t|\u03c0t+1:H )\u221d exp\n\n\u03bbi,jK\u03c0t(si, sj)+\u03b3\n\n\u03bei,j(T\u03c0tK\u03c0t+1T T\n\u03c0t\n\nk\u03c6\u03c0t:H\n\u03b8s\n\n)(si, sj)\n\n(s)+\n\nk\n\n.\n\n(si,sj )\u2208S 2\n\n(si,sj )\u2208S 2\n\nAgain, this distribution is a Markov Random Field. The parameters \u03b8, \u03bb and \u03be are learned by\nmaximizing the likelihood P (\u02c6\u03c0t:H ) of the expert\u2019s policy \u02c6\u03c0t:H. The learned parameters can then\nbe used for sampling policies that have the same expected value (from the second constraint), and\nthe same Bellman error (from the last two constraints and Proposition 1) as the expert\u2019s policy. If\nkernel K is inaccurate, then the learned \u03bb and \u03be will take low values to maximize the likelihood of\nthe expert\u2019s policy. Hence, our approach will be reduced to MaxEnt IRL (Ziebart et al., 2008).\nFor simplicity, we consider an approximate solution with fewer parameters in our experiments,\nk is replaced by \u03b8k \u2208 R. This simpli\ufb01cation is based on the fact that the reward\nwhere each \u03b8s\nfunction is independent of the initial state. We also replace \u03bbi,j by \u03bb \u2208 R, and \u03bei,j by \u03be \u2208 R.\n\n5\n\n\f(cid:16)(cid:88)\nk \u03b8k\u03c6k(s) + \u03b3(cid:80)\n\nV \u03c0t:H\n\u03b8\n\ns\u2208S\n\n(s) =(cid:80)\n\nFor a sparse matrix K, one can create a corresponding graph (E,S), where (si, sj) \u2208 E if and only if\n\u2203ai, aj \u2208 A : K((si, ai), (sj, aj)) (cid:54)= 0 or \u2203ai, aj \u2208 A, (s(cid:48)\nj) (cid:54)=\n0. Finally, the policy distribution can be rewritten as\nP (\u03c0t|\u03c0t+1:H ) \u221d exp\n\nj) \u2208 E : \u03b3T (si, ai, s(cid:48)\n(cid:88)\n\n(s) + \u03bb(cid:0)(cid:88)\n\n)(si, sj)(cid:1)(cid:17)\n\ni)T (sj, aj, s(cid:48)\n\nK\u03c0t(si, sj) + \u03b3\u03be\n\n(T\u03c0t K\u03c0t+1 T T\n\u03c0t\n\ni, s(cid:48)\n\n, (3)\n\n(si,sj )\u2208E\ns(cid:48)\u2208S T\u03c0t(s, s(cid:48))V \u03c0t+1:H\n\n\u03b8\n\n(si,sj )\u2208E\n(s(cid:48)).\n\n\u03b8\n\nwhere V \u03c0t:H\nThe distribution given by Equation 3 is a Markov Random Field. The probability of choosing action\na in a given state s depends on the expected value of (s, a) and the actions chosen in neighboring\nstates. There is a clear similarity between this distribution of joint actions and the distribution of joint\nlabels in Associative Markov Networks (AMN) (Taskar, 2004). In fact, the proposed framework\ngeneralizes AMN to sequential decision making problems. Also, the MaxEnt method (Ziebart et al.,\n2008) can be derived from Equation 3 by setting \u03bb = 0.\n\n\u03bb = 0\n\n\u03b3 = 0\n\u03b3 (cid:54)= 0 MaxEnt IRL (Ziebart et al., 2008)\n\nLogistic regression\n\n\u03bb (cid:54)= 0\n\nAMN (Taskar, 2004)\n\nAL-MRF\n\nTable 1: Relation between Apprenticeship Learning with MRFs (AL-MRF) and other methods.\n\n4.5 Learning procedure\n\nIn the learning phase, Equation 3 is used for \ufb01nding parameters \u03b8, \u03bb and \u03be that maximize the like-\nlihood of the expert\u2019s policy \u02c6\u03c0. Since this likelihood function is concave, a global optimal can\nbe found by using standard optimization methods, such as BFGS. A main drawback of our ap-\nproach is the high computational cost of calculating the partition function of Equation 3, which is\nO(|A||S||S|2). In practice, this problem can be addressed by using several possible tricks. For in-\nstance, we reuse the values calculated for a given policy \u03c0 as the initial values of all the policies that\ndiffer from \u03c0 in one state only. We also decompose the state space into a set of weakly connected\ncomponents, and separately calculate the partition of each component. One can also use recent\nef\ufb01cient learning techniques for MRFs, such as (Kr\u00a8ahenb\u00a8uhl & Koltun, 2011).\n\n4.6 Planning procedure\nAlgorithm 2 describes a dynamic programming procedure for \ufb01nding a policy (\u03c0\u2217\nH ) that\nsatis\ufb01es \u2200t \u2208 [0, H] : \u03c0\u2217\nt+1:H ). The planning problem is reduced to a\nsequence of inference problems in Markov Random Fields. The inference problem itself can also be\nef\ufb01ciently solved using techniques such as graph min-cut (Boykov et al., 1999), \u03b1-expansions and\nlinear programming relaxation (Taskar, 2004). We use the \u03b1-expansions for our experiments.\n\nt \u2208 arg max\u03c0t\u2208A|S| P (\u03c0t|\u03c0\u2217\n\n1, . . . , \u03c0\u2217\n\n0, \u03c0\u2217\n\nAlgorithm 2 Dynamic Programming for Markov Random Field Policies\n\n\u2200(s, a) \u2208 S \u00d7 A : QH+1(s, a) = 0.\nfor t = H : 0 do\n\n1. \u2200(s, a) \u2208 S \u00d7 A : Qt(s, a) =(cid:80)\n\nk \u03b8k\u03c6k(s) + \u03b3(cid:80)\n\n2. Use an inference algorithm (such as the \u03b1-expansions) in the MRF de\ufb01ned on the graph\n(S,E) to label states with actions: the cost of labeling s with a is \u2212Qt(s, a) and the potential of\n(s(cid:48)\ni, s(cid:48)\n(si, ai, sj, aj) is \u03bb\n.\nj)\n3. Denote by \u03c0\u2217\n\nK(si, ai, sj, aj) + \u03b3\u03be(cid:80)\n\nt the labeling policy returned by the inference algorithm;\n\nj )\u2208E T (si, ai, s(cid:48)\ni,s(cid:48)\n\ni)T (sj, aj, s(cid:48)\n\nj)K\u03c0\u2217\n\n(s(cid:48)\n\nt+1\n\n(cid:17)\n\n(cid:16)\n\ns(cid:48) T (s, a, s(cid:48))Qt+1(s(cid:48), \u03c0\u2217\n\nt+1(s(cid:48)))\n\nend for\nReturn the policy \u03c0\u2217 = (\u03c0\u2217\n\n0, \u03c0\u2217\n\n1, . . . , \u03c0\u2217\n\nH );\n\n5 Experimental Results\nWe present experiments on two problems: learning to swing-up and balance an inverted pendulum\non a cart, and learning to grasp unknown objects.\n\n6\n\n\f5.1 Swing-up cart-balancing\n\nThe simulated swing-up cart-balancing system (Figure 1) consists of a 6 kg cart running on a 2 m\ntrack and a freely-swinging 1 kg pendulum with mass attached to the cart with a 50 cm rod. The\nstate of the system is the position and velocity of the cart (x, \u02d9x), as well as the angle and angular\nvelocity of the pendulum (\u03b8, \u02d9\u03b8). An action a \u2208 R is a horizontal force applied to the cart. The\ndynamics of the system are nonlinear. States and actions are continuous, but time is discretized to\nsteps of 0.1 s. The objective is to learn, in a series of 5s episodes, a policy that swings the pendulum\nup and balances it in the inverted position. Since the pendulum falls down after hitting one of the two\ntrack limits, the policy should also learn to maintain the cart in the middle of the track. Moreover,\nthe track has a nonuniform friction modeled as a force slowing down the cart. Part of the track has\na friction of 30 N, while the remaining part has no friction. This variant is more dif\ufb01cult than the\nstandard ones (Deisenroth & Rasmussen, 2011).\n\nWe consider parametric policies of the form \u03c0(x, \u02d9x, \u03b8, \u02d9\u03b8) = (cid:80)\n\ni piqi(x, \u02d9x, \u03b8, \u02d9\u03b8), where pi are real\nweights and qi are basis functions corresponding to the signs of the angle and the angular velocity\nand an exponential function centered at the middle of the track. Moreover, we discretize the track\ninto 10 segments, and use 10 binary basis functions for friction compensation, each one is nonzero\nonly in a particular segment. A reward of 1 is given for each step the pendulum is above the horizon.\n\n|\n\n.(cid:126)a\n.\n.\n\n|\n\n\u02d9\u03b8i\n\nK(cid:0)(cid:104)xi, \u02d9xi, \u03b8i, \u02d9\u03b8i, ui(cid:105),(cid:104)xj, \u02d9xj, \u03b8j, \u02d9\u03b8j, uj(cid:105)(cid:1) = 1 iff\n\nSince the friction changes smoothly along the\ntrack (domain knowledge), we use the adjacency\nmatrix of a nearest-neighbor graph as the MRF\nSpeci\ufb01cally, we set\nkernel K in Equation 2.\n|xi \u2212 xj| \u2264 0.2m, \u03b8i\u03b8j \u2265 0,\n\u02d9\u03b8j \u2265 0, and\n|ui \u2212 uj| \u2264 5N, otherwise K is set to 0. Fig-\nure 1 shows the average reward per time-step of\nthe learned policies as a function of the learning\ntime. Our attempts to solve this variant using differ-\nent policy gradient methods, e.g. (Kober & Peters,\n2008), mainly resulted in poor policies. We report\nthe values of the policies sampled with Metropolis-\nHastings using Equation 2, and compare to the case\nwhere the policies are sampled solely according to\ntheir expected values, i.e. \u03bbt = 0. The expected\nvalues are estimated from the samples. The results,\naveraged over 50 independent trials, show that the\nconvergence is faster when the MRF is used. More-\nover, the performance increases as the threshold set\non the maximum Bellman error (\u0001) in Algorithm 1\nis decreased. In fact, policies that change smoothly\nhave a lower Bellman error as their values can be\nbetter approximated with kernel K.\n5.2 Precision grasps of unknown objects\nFrom a high-level point of view, grasping an object can be seen as an MDP with three steps: reach-\ning, preshaping, and grasping. At any step, the robot can either proceed to the next step or restart\nfrom the beginning and get a reward of 0. At t = 0, the robot always starts from the same initial\nstate s0, and the set of actions corresponds to the set of points on the surface of the object. Given\na grasping point, we set the approach direction to the surface normal vector. At t = 1, the state is\ngiven by a surface point and an approach direction, and the set of actions corresponds to the set of\nall possible hand orientations. At t = 2, the state is given by a surface point, an approach direction\nand a hand orientation. There are two possible last actions, closing the \ufb01ngers or restarting.\nIn this experiment, we are interested in learning to grasp objects from their handles. The reward of\neach step depends on the current state. There is no reward at t = 0. The reward R1 de\ufb01ned at t = 1\nis a function of the \ufb01rst three eigenvalues of the scatter matrix de\ufb01ned by the 3D coordinates of the\npoints inside a small ball centered on the selected point (Boularias et al., 2011). The reward R2,\n\nFigure 1: Swing-up cart-balancing. The\nfriction is nonuniform, the red area has a\nhigher friction than the blue one. However,\nthe friction changes only at one point of the\ntrack. Consequently, restraining the search to\nsmooth policies yield faster convergence.\n\n7\n\n 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 20 40 60 80 100Average reward per time\u2212stepLearning time in secondsOptimalMetropolis\u2212Hastings with MRF, BE < 0.6 Metropolis\u2212Hastings with MRF, BE < 1 Metropolis\u2212Hastings with MRF, BE < 2 Metropolis\u2212Hastings\fn\no\ni\ns\ns\ne\nr\ng\ne\nR\n\nN\nM\nA\n\nL\nR\n\nI\n\nt\nn\nE\nx\na\nM\n\n-\n\nF\nR\nM\nL\nA\nTable 2: Learned Q-values at t = 0. Each point on an object corresponds to a reaching action. Blue\nindicates low values and red indicates high values. The black arrow indicates the approach direction in the\noptimal policy according to the learned reward function.\nde\ufb01ned at t = 2, is a function of collision features. We simulate the trajectories of 10 equidistant\npoints on each \ufb01nger of a Barrett hand (a three-\ufb01ngered gripper). The collision features are binary\nvariables indicating whether or not the corresponding \ufb01nger points will make contact with the object.\nBased on the domain knowledge that points that are close to each other should have the same action\n(i.e. same approach direction and hand orientation), the kernel K is given by the k-nearest neighbors\ngraph, using the Euclidean distance and k = 6 in the state space of positions (or surface points), and\nthe angular distance, with k = 2 in the discretized state space of hand orientations. We also use a\nquadratic kernel for learning R1, and the Hamming distance between the feature vectors as a kernel\nfor learning R2. We also use a single constant feature for all the edges.\nWe used one object for training and provided six trajectories leading\nto a successful grasp from its handle. For testing, we compared our\napproach (Apprenticeship Learning with MRF) with MaxEnt IRL,\nAMN and Logistic Regression, which is equivalent to AMN with-\nout the graph structure. For AMN and Logistic Regression, only\nthe reward R1 at time-step 1 is learned, since these are classi\ufb01ca-\ntion methods and do not consider subsequent rewards.\nTable 2 shows the Q-values at t = 0 and the approach directions at\noptimal grasping points. AL-MRF improves over the other methods\nby generally giving high values to handle points only. The values of\nthe other points are zeros because the optimal action at these points\nis to restart rather than to grasp. The confusion in the other meth-\nods comes from noised point coordinates and self-occlusions. More\nimportantly, AL-MRF improves over AMN, a structured supervised\nlearning technique, by considering the reward at t = 2 while mak-\ning a decision at t = 1. This can be seen as a type of object recognition by functionality. Figure 2\nshows the percentage of successful grasps using the objects in Table 2. A grasp is successful if it is\nlocated on a handle and the hand orientation is orthogonal to the handle and the approach direction.\n6 Conclusion\nBased on the observation that the value function of an optimal policy is often smooth and can be\napproximated with a simple kernel, we introduced a general framework for incorporating this type\nof domain knowledge in forward and inverse reinforcement learning. Our approach uses Markov\nRandom Fields for de\ufb01ning distributions on deterministic policies, and assigns high probabilities to\nsmooth policies. We also provided strong empirical evidence of the advantage of this approach.\nAcknowledgement\nThis work was partly supported by the EU-FP7 grant 248273 (GeRT).\n\nFigure 2:\nPercentage of\ngrasps located on a handle\nwith a correct approach direc-\ntion and hand orientation.\n\n8\n\n 0 20 40 60 80 100RegressionAMNMaxEnt IRLAL-MRFPercentage of successful grasps\fReferences\nAbbeel, Pieter and Ng, Andrew Y. Apprenticeship Learning via Inverse Reinforcement Learning. In\nProceedings of the Twenty-\ufb01rst International Conference on Machine Learning (ICML\u201904), pp.\n1\u20138, 2004.\n\nBoularias, Abdeslam, Kr\u00a8omer, Oliver, and Peters, Jan. Learning robot grasping from 3-D images\nwith Markov Random Fields. In Proceedings of the 2011 IEEE/RSJ International Conference on\nIntelligent Robots and Systems (IROS\u201911), pp. 1548\u20131553, 2011.\n\nBoularias, Abdeslam, Kr\u00a8omer, Oliver, and Peters, Jan. Structured Apprenticeship Learning.\n\nIn\nProceedings of the European Conference on Machine Learning and Knowledge Discovery in\nDatabases (ECML-PKDD\u201912), pp. 227\u2013242, 2012.\n\nBoyan, Justin A. Technical Update: Least-Squares Temporal Difference Learning. Machine Learn-\n\ning, 49:233\u2013246, November 2002. ISSN 0885-6125.\n\nBoykov, Yuri, Veksler, Olga, and Zabih, Ramin. Fast Approximate Energy Minimization via Graph\n\nCuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:2001, 1999.\n\nDeisenroth, Marc Peter and Rasmussen, Carl Edward. PILCO: A Model-Based and Data-Ef\ufb01cient\nApproach to Policy Search. In Proceedings of the Twenty-Eighth International Conference on\nMachine Learning (ICML\u201911), pp. 465\u2013472, 2011.\n\nDudik, Miroslav, Phillips, Steven J., and Schapire, Robert E. Performance guarantees for regular-\nIn Proceedings of the 17th Annual Conference on\n\nized maximum entropy density estimation.\nComputational Learning Theory (COLT\u201904), pp. 472\u2013486, 2004.\n\nKober, Jens and Peters, Jan. Policy search for motor primitives in robotics. In NIPS, pp. 849\u2013856,\n\n2008.\n\nKohli, Pushmeet, Kumar, Pawan, and Torr, Philip. P3 and beyond: Solving energies with higher\norder cliques. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR\u201907), 2007.\n\nKr\u00a8ahenb\u00a8uhl, Philipp and Koltun, Vladlen. Ef\ufb01cient Inference in Fully Connected CRFs with Gaus-\nsian Edge Potentials. In Advances in Neural Information Processing Systems 24, pp. 109\u2013117.\n2011.\n\nMunoz, Daniel, Vandapel, Nicolas, and Hebert, Martial. Onboard contextual classi\ufb01cation of 3-D\npoint clouds with learned high-order Markov random \ufb01elds. In Proceedings of IEEE International\nConference on Robotics and Automation (ICRA\u201909), 2009.\n\nOrmoneit, Dirk and Sen, Saunak. Kernel-based reinforcement learning. In Machine Learning, pp.\n\n161\u2013178, 1999.\n\nSch\u00a8olkopf, Bernhard, Herbrich, Ralf, and Smola, Alex. A Generalized Representer Theorem .\n\nComputational Learning Theory, 2111:416\u2013426, 2001.\n\nTaskar, Ben. Learning Structured Prediction Models: A Large Margin Approach. PhD thesis,\n\nStanford University, CA, USA, 2004.\n\nZiebart, B., Maas, A., Bagnell, A., and Dey, A. Maximum Entropy Inverse Reinforcement Learning.\nIn Proceedings of the Twenty-Second AAAI Conference on Arti\ufb01cial Intelligence (AAAI\u201908), pp.\n1433\u20131438, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1084, "authors": [{"given_name": "Abdeslam", "family_name": "Boularias", "institution": null}, {"given_name": "Jan", "family_name": "Peters", "institution": null}, {"given_name": "Oliver", "family_name": "Kroemer", "institution": null}]}