{"title": "A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 2717, "page_last": 2725, "abstract": "Parametric policy search algorithms are one of the methods of choice for the optimisation of Markov Decision Processes, with Expectation Maximisation and natural gradient ascent being considered the current state of the art in the field. In this article we provide a unifying perspective of these two algorithms by showing that their step-directions in the parameter space are closely related to the search direction of an approximate Newton method. This analysis leads naturally to the consideration of this approximate Newton method as an alternative gradient-based method for Markov Decision Processes. We are able show that the algorithm has numerous desirable properties, absent in the naive application of Newton's method, that make it a viable alternative to either Expectation Maximisation or natural gradient ascent. Empirical results suggest that the algorithm has excellent convergence and robustness properties, performing strongly in comparison to both Expectation Maximisation and natural gradient ascent.", "full_text": "A Unifying Perspective of Parametric Policy Search\n\nMethods for Markov Decision Processes\n\nThomas Furmston\n\nDepartment of Computer Science\n\nUniversity College London\n\nT.Furmston@cs.ucl.ac.uk\n\nDavid Barber\n\nDepartment of Computer Science\n\nUniversity College London\n\nD.Barber@cs.ucl.ac.uk\n\nAbstract\n\nParametric policy search algorithms are one of the methods of choice for the opti-\nmisation of Markov Decision Processes, with Expectation Maximisation and nat-\nural gradient ascent being popular methods in this \ufb01eld. In this article we pro-\nvide a unifying perspective of these two algorithms by showing that their search-\ndirections in the parameter space are closely related to the search-direction of\nan approximate Newton method. This analysis leads naturally to the considera-\ntion of this approximate Newton method as an alternative optimisation method for\nMarkov Decision Processes. We are able to show that the algorithm has numer-\nous desirable properties, absent in the naive application of Newton\u2019s method, that\nmake it a viable alternative to either Expectation Maximisation or natural gradi-\nent ascent. Empirical results suggest that the algorithm has excellent convergence\nand robustness properties, performing strongly in comparison to both Expectation\nMaximisation and natural gradient ascent.\n\n1 Markov Decision Processes\n\nMarkov Decision Processes (MDPs) are the most commonly used model for the description of se-\nquential decision making processes in a fully observable environment, see e.g.\n[5]. A MDP is\ndescribed by the tuple {S,A, H, p1, p, \u03c0, R}, where S and A are sets known respectively as the\nstate and action space, H \u2208 N is the planning horizon, which can be either \ufb01nite or in\ufb01nite, and\n{p1, p, \u03c0, R} are functions that are referred as the initial state distribution, transition dynamics, pol-\nicy (or controller) and the reward function. In general the state and action spaces can be arbitrary\nsets, but we restrict our attention to either discrete sets or subsets of Rn, where n \u2208 N. We use\nboldface notation to represent a vector and also use the notation z = (s, a) to denote a state-action\npair. Given a MDP the trajectory of the agent is determined by the following recursive procedure:\nGiven the agent\u2019s state, st, at a given time-point, t \u2208 NH, an action is selected according to the pol-\nicy, at \u223c \u03c0(\u00b7|st); The agent will then transition to a new state according to the transition dynamics,\nst+1 \u223c p(\u00b7|at, st); this process is iterated sequentially through all of the time-points in the plan-\nning horizon, where the state of the initial time-point is determined by the initial state distribution\ns1 \u223c p1(\u00b7). At each time-point the agent receives a (scalar) reward that is determined by the reward\nfunction, where this function depends on the current action and state of the environment. Typically\nthe reward function is assumed to be bounded, but as the objective is linear in the reward function\nwe assume w.l.o.g that it is non-negative.\nThe most widely used objective in the MDP framework is to maximise the total expected reward of\nthe agent over the course of the planning horizon. This objective can take various forms, including\nan in\ufb01nite planning horizon, with either discounted or average rewards, or a \ufb01nite planning horizon.\nThe theoretical contributions of this paper are applicable to all three frameworks, but for notational\nease and for reasons of space we concern ourselves with the in\ufb01nite horizon framework with dis-\ncounted rewards. In this framework the boundedness of the objective function is ensured by the\n\n1\n\n\f\u221e(cid:88)\n(cid:26) H\u22121(cid:89)\n\nt=1\n\nintroduction of a discount factor, \u03b3 \u2208 [0, 1), which scales the rewards of the various time-points in a\ngeometric manner. Writing the objective function and trajectory distribution directly in terms of the\nparameter vector then, for any w \u2208 W, the objective function takes the form\n\n(cid:20)\n\n(cid:21)\n\nU (w) =\n\nEpt(a,s;w)\n\n\u03b3t\u22121R(a, s)\n\n,\n\n(1)\n\n(cid:27)\n\nwhere we have denoted the parameter space by W and have used the notation pt(a, s; w) to repre-\nsent the marginal p(st = s, at = a; w) of the joint state-action trajectory distribution\n\np(a1:H , s1:H ; w) = \u03c0(aH|sH ; w)\n\np(st+1|at, st)\u03c0(at|st; w)\n\np1(s1), H \u2208 N.\n\n(2)\n\nt=1\n\nNote that the policy is now written in terms of its parametric representation, \u03c0(a|s; w).\nIt is well known that the global optimum of (1) can be obtained through dynamic programming, see\ne.g. [5]. However, due to various issues, such as prohibitively large state-action spaces or highly\nnon-linear transition dynamics, it is not possible to \ufb01nd the global optimum of (1) in most real-world\nproblems of interest. Instead most research in this area focuses on obtaining approximate solutions,\nfor which there exist numerous techniques, such as approximate dynamic programming methods [6],\nMonte-Carlo tree search methods [19] and policy search methods, both parametric [27, 21, 16, 18]\nand non-parametric [2, 25].\nThis work is focused solely on parametric policy search methods, by which we mean gradient-based\nmethods, such as steepest and natural gradient ascent [23, 1], along with Expectation Maximisation\n[11], which is a bound optimisation technique from the statistics literature. Since their introduction\n[14, 31, 10, 16] these methods have been the centre of a large amount of research, with much of it\nfocusing on gradient estimation [21, 4], variance reduction techniques [30, 15], function approxima-\ntion techniques [27, 8, 20] and real-world applications [18, 26]. While steepest gradient ascent has\nenjoyed some success it is known to suffer from some substantial issues that often make it unattrac-\ntive in practice, such as slow convergence and susceptibility to poor scaling of the objective function\n[23]. Various optimisation methods have been introduced as an alternative, most notably natural\ngradient ascent [16, 24, 3] and Expectation Maximisation [18, 28], which are currently the methods\nof choice among parametric policy search algorithms. In this paper our primary focus is on the\nsearch-direction (in the parameter space) of these two methods.\n\n2 Search Direction Analysis\n\nIn this section we will perform a novel analysis of the search-direction of both natural gradient\nascent and Expectation Maximisation. In gradient-based algorithms of Markov Decision Processes\nthe update of the policy parameters take the form\n\nwnew = w + \u03b1M(w)\u2207wU (w),\n\n(3)\nwhere \u03b1 \u2208 R+ is the step-size parameter and M(w) is some positive-de\ufb01nite matrix that possibly\ndepends on w. It is well-known that such an update will increase the total expected reward, provided\nthat \u03b1 is suf\ufb01ciently small, and this process will converge to a local optimum of (1) provided the\nstep-size sequence is appropriately selected. While EM doesn\u2019t have an update of the form given\nin (3) we shall see that the algorithm is closely related to such an update. It is convenient for later\nreference to note that the gradient \u2207wU (w) can be written in the following form\n\n\u2207wU (w) = Ep\u03b3 (z;w)Q(z;w)\n\n(4)\nwhere we use the expectation notation E[\u00b7] to denote the integral/summation w.r.t. a non-negative\nfunction. The term p\u03b3(z; w) is a geometric weighted average of state-action occupancy marginals\ngiven by\n\n,\n\n\u2207w log \u03c0(a|s; w)\n\n(cid:21)\n\n(cid:20)\n\n\u221e(cid:88)\n\np\u03b3(z; w) =\n\n\u03b3t\u22121pt(z; w),\n\nwhile the term Q(z; w) is referred to as the state-action value function and is equal to the total\nexpected future reward from the current time-point onwards, given the current state-action pair, z,\n\nt=1\n\n2\n\n\fand parameter vector, w, i.e.\n\nQ(z; w) =\n\n\u221e(cid:88)\n\nt=1\n\nEpt(z(cid:48);w)\n\n(cid:20)\n\n(cid:12)(cid:12)(cid:12)(cid:12)z1 = z\n\n(cid:21)\n\n.\n\n\u03b3t\u22121R(z(cid:48))\n\nThis is a standard result and due to reasons of space we have omitted the details, but see e.g. [27] or\nsection(6.1) of the supplementary material for more details.\nAn immediate issue concerning updates of the form (3) is in the selection of the \u2018optimal\u2019 choice\nof the matrix M(w), which clearly depends on the sense in which optimality is de\ufb01ned. There\nare numerous reasonable properties that are desirable of such an update, including the numerical\nstability and computational complexity of the parameter update, as well as the rate of convergence\nof the overall algorithm resulting from these updates. While all reasonable criteria the rate of con-\nvergence is of such importance in an optimisation algorithm that it is a logical starting point in our\nanalysis. For this reason we concern ourselves with relating these two parametric policy search al-\ngorithms to the Newton method, which has the highly desirable property of having a quadratic rate\nof convergence in the vicinity of a local optimum. The Newton method is well-known to suffer from\nproblems that make it either infeasible or unattractive in practice, but in terms of forming a basis for\ntheoretical comparisons it is a logical starting point. We shall discuss some of the issues with the\nNewton method in more detail in section(3). In the Newton method the matrix M(w) is set to the\nnegative inverse Hessian, i.e.\n\nM(w) = \u2212H\u22121(w), where H(w) = \u2207w\u2207T\n\nwU (w).\n\nwhere we have denoted the Hessian by H(w). Using methods similar to those used to calculate the\ngradient, it can be shown that the Hessian takes the form\n\nH(w) = H1(w) + H2(w),\n\nwhere\n\n\u221e(cid:88)\n\u221e(cid:88)\n\nt=1\n\nt=1\n\nH1(w) =\n\nH2(w) =\n\n(cid:20)\n(cid:20)\n\nEp(z1:t;w)\n\nEp(z1:t;w)\n\n\u03b3t\u22121R(zt)\u2207w log p(z1:t; w)\u2207T\n\nw log p(z1:t; w)\n\n,\n\n\u03b3t\u22121R(zt)\u2207w\u2207T\n\nw log p(z1:t; w)\n\n.\n\n(cid:21)\n\n(5)\n\n(6)\n\n(7)\n\n(cid:21)\n\nWe have omitted the details of the derivation, but these can be found in section(6.2) of the sup-\nplementary material, with a similar derivation of a sample-based estimate of the Hessian given in\n[4].\n\n2.1 Natural Gradient Ascent\n\nTo overcome some of the issues that can hinder steepest gradient ascent an alternative, natural\ngradient, was introduced in [16]. Natural gradient ascent techniques originated in the neural network\nand blind source separation literature, see e.g. [1], and take the perspective that the parameter space\nhas a Riemannian manifold structure, as opposed to a Euclidean structure. Deriving the steepest\nascent direction of U (w) w.r.t. a local norm de\ufb01ned on this parameter manifold (as opposed to w.r.t.\nthe Euclidean norm, which is the case in steepest gradient ascent) results in natural gradient ascent.\nWe denote the quadratic form that induces this local norm on the parameter manifold by G(w), i.e.\nd(w)2 = wT G(w)w. The derivation for natural gradient ascent is well-known, see e.g. [1], and its\napplication to the objective (1) results in a parameter update of the form\nwk+1 = wk + \u03b1kG\u22121(wk)\u2207wU (wk).\n\nIn terms of (3) this corresponds to M(w) = G\u22121(w). In the case of MDPs the most commonly\nused local norm is given by the Fisher information matrix of the trajectory distribution, see e.g.\n[3, 24], and due to the Markovian structure of the dynamics it is given by\n\n(cid:20)\n\n(cid:21)\nw log \u03c0(a|s; w)\n\n.\n\nG(w) = \u2212Ep\u03b3 (z;w)\n\n\u2207w\u2207T\n\n(8)\n\nWe note that there is an alternate, but equivalent, form of writing the Fisher information matrix, see\ne.g. [24], but we do not use it in this work.\n\n3\n\n\fIn order to relate natural gradient ascent to the Newton method we \ufb01rst rewrite the matrix (7) into\nthe following form\n\nH2(w) = Ep\u03b3 (z;w)Q(z;w)\n\n\u2207w\u2207T\n\n(cid:20)\n\n(cid:21)\nw log \u03c0(a|s; w)\n\n.\n\n(9)\n\nFor reasons of space the details of this reformulation of (7) are left to section(6.2) of the supplemen-\ntary material. Comparing the Fisher information matrix (8) with the form of H2(w) given in (9) it is\nclear that natural gradient ascent has a relationship with the approximate Newton method that uses\nH2(w) in place of H(w). In terms of (3) this approximate Newton method corresponds to setting\nM(w) = \u2212H\u22121\n2 (w). In particular it can be seen that the difference between the two methods lies\nin the non-negative function w.r.t. which the expectation is taken in (8) and (9). (It also appears\nthat there is a difference in sign, but observing the form of M(w) for each algorithm shows that\nthis is not the case.) In the Fisher information matrix the expectation is taken w.r.t. to the geometri-\ncally weighted summation of state-action occupancy marginals of the trajectory distribution, while\nin H2(w) there is an additional weighting from the state-action value function. Hence, H2(w)\nincorporates information about the reward structure of the objective function, whereas the Fisher\ninformation matrix does not, and so it will generally contain more information about the curvature\nof the objective function.\n\n2.2 Expectation Maximisation\n\n(cid:21)\n\n,\n\n(cid:20)\n\n(cid:20)\n\nThe Expectation Maximisation algorithm, or EM-algorithm, is a powerful optimisation technique\nfrom the statistics literature, see e.g. [11], that has recently been the centre of much research in\nthe planning and reinforcement learning communities, see e.g. [10, 28, 18]. A quantity of central\nimportance in the EM-algorithm for MDPs is the following lower-bound on the log-objective\n\n(cid:21)\n\nlog \u03c0(a|s; w)\n\n.\n\nlog \u03b3t\u22121R(zt)p(z1:t; w)\n\nlog U (w) \u2265 Hentropy(q(z1:t, t)) + Eq(z1:t,t)\n\n(10)\nwhere Hentropy is the entropy function and q(z1:t, t) is known as the \u2018variational distribution\u2019. Further\ndetails of the EM-algorithm for MDPs and a derivation of (10) are given in section(6.3) of the\nsupplementary material or can be found in e.g. [18, 28]. The parameter update of the EM-algorithm\nis given by the maximum (w.r.t. w) of the \u2018energy\u2019 term,\nQ(w, wk) = Ep\u03b3 (z;wk)Q(z;wk)\n\n(11)\nNote that Q is a two-parameter function, where the \ufb01rst parameter occurs inside the expectation\nand the second parameter de\ufb01nes the non-negative function w.r.t.\nthe expectation is taken. This\ndecoupling allows the maximisation over w to be performed explicitly in many cases of interest.\nFor example, when the log-policy is quadratic in w the maximisation problems is equivalent to a\nleast-squares problem and the optimum can be found through solving a linear system of equations.\nIt has previously been noted, again see e.g. [18], that the parameter update of steepest gradient\nascent and the EM-algorithm can be related through this \u2018energy\u2019 term. In particular the gradient\n(4) evaluated at wk can also be written as follows \u2207w|w=wk U (w) = \u220710\nQ(w, wk), where\nwe use the notation \u220710\nw to denote the \ufb01rst derivative w.r.t. the \ufb01rst parameter, while the update of\nthe EM-algorithm is given by wk+1 = argmaxw\u2208W Q(w, wk). In other words, steepest gradient\nascent moves in the direction that most rapidly increases Q(w, wk) w.r.t. the \ufb01rst variable, while the\nEM-algorithm maximises Q(w, wk) w.r.t. the \ufb01rst variable. While this relationship is true, it is also\nquite a negative result. It states that in situations where it is not possible to explicitly perform the\nmaximisation over w in (11) then the alternative, in terms of the EM-algorithm, is this generalised\nEM-algorithm, which is equivalent to steepest gradient ascent. Considering that algorithms such as\nEM are typically considered because of the negative aspects related to steepest gradient ascent this\nis an undesirable alternative. It is possible to \ufb01nd the optimum of (11) numerically, but this is also\nundesirable as it results in a double-loop algorithm that could be computationally expensive. Finally,\nthis result provides no insight into the behaviour of the EM-algorithm, in terms of the direction of\nits parameter update, when the maximisation over w in (11) can be performed explicitly.\nInstead we provide the following result, which shows that the step-direction of the EM-algorithm\nhas an underlying relationship with the Newton method. In particular we show that, under suitable\n\nw|w=wk\n\n4\n\n\fregularity conditions, the direction of the EM-update, i.e. wk+1 \u2212 wk, is the same, up to \ufb01rst order,\nas the direction of an approximate Newton method that uses H2(w) in place of H(w).\nTheorem 1. Suppose we are given a Markov Decision Process with objective (1) and Markovian\ntrajectory distribution (2). Consider the update of the parameter through Expectation Maximisation\nat the kth iteration of the algorithm, i.e.\n\nProvided that Q(w, wk) is twice continuously differentiable in the \ufb01rst parameter we have that\n\nwk+1 = argmaxw\u2208W Q(w, wk).\n\nwk+1 \u2212 wk = \u2212H\u22121\n\n2 (wk)\u2207w|w=wk U (w) + O((cid:107)wk+1 \u2212 wk(cid:107)2).\n\n(12)\n\nAdditionally, in the case where the log-policy is quadratic the relation to the approximate Newton\nmethod is exact, i.e. the second term on the r.h.s. (12) is zero.\n\nw Q(wk, wk) +\u220720\n\nProof. The idea of the proof is simple and only involves performing a Taylor expansion of\n\u220710\nw Q(w, wk). As Q is assumed to be twice continuously differentiable in the \ufb01rst component\nthis Taylor expansion is possible and gives\nw Q(wk, wk)(wk+1 \u2212 wk) +O((cid:107)wk+1 \u2212 wk(cid:107)2). (13)\n\u220710\nw Q(wk+1, wk) = \u220710\nAs wk+1 = argmaxw\u2208W Q(w, wk) it follows that \u220710\nw Q(wk+1, wk) = 0. This means that, upon\nignoring higher order terms in wk+1 \u2212 wk, the Taylor expansion (13) can be rewritten into the form\n(14)\nw Q(wk, wk) = \u2207w|w=wk U (w) and\nThe proof\nw Q(wk, wk) = H2(wk). The second statement follows because in the case where the log-policy\n\u220720\nis quadratic the higher order terms in the Taylor expansion vanish.\n\nis completed by observing that \u220710\n\nwk+1 \u2212 wk = \u2212\u220720\n\nw Q(wk, wk)\u22121\u220710\n\nw Q(wk, wk).\n\n2.3 Summary\n\nIn this section we have provided a novel analysis of both natural gradient ascent and Expectation\nMaximisation when applied to the MDP framework. Previously, while both of these algorithms have\nproved popular methods for MDP optimisation, there has been little understanding of them in terms\nof their search-direction in the parameter space or their relation to the Newton method. Firstly, our\nanalysis shows that the Fisher information matrix, which is used in natural gradient ascent, is similar\nto H2(w) in (5) with the exception that the information about the reward structure of the problem\nis not contained in the Fisher information matrix, while such information is contained in H2(w).\nAdditionally we have shown that the step-direction of the EM-algorithm is, up to \ufb01rst order, an\napproximate Newton method that uses H2(w) in place of H(w) and employs a constant step-size\nof one.\n\n2 (w) in (3). We call this second method the diagonal approximate Newton method.\n\n3 An Approximate Newton Method\nA natural follow on from the analysis in section(2) is the consideration of using M(w) = \u2212H\u22121\n2 (w)\nin (3), a method we call the full approximate Newton method from this point onwards. In this section\nwe show that this method has many desirable properties that make it an attractive alternative to other\nparametric policy search methods. Additionally, denoting the diagonal matrix formed from the\ndiagonal elements of H2(w) by D2(w), we shall also consider the method that uses M(w) =\n\u2212D\u22121\nRecall that in (3) it is necessary that M(w) is positive-de\ufb01nite (in the Newton method this corre-\nsponds to requiring the Hessian to be negative-de\ufb01nite) to ensure an increase of the objective. In\ngeneral the objective (1) is not concave, which means that the Hessian will not be negative-de\ufb01nite\nover the entire parameter space. In such cases the Newton method can actually lower the objective\nand this is an undesirable aspect of the Newton method. An attractive property of the approximate\nHessian, H2(w), is that it is always negative-de\ufb01nite when the policy is log\u2013concave in the policy\nparameters. This fact follows from the observation that in such cases H2(w) is a non-negative mix-\nture of negative-de\ufb01nite matrices, which again is negative-de\ufb01nite [9]. Additionally, the diagonal\n\n5\n\n\fterms of a negative-de\ufb01nite matrix are negative and so D2(w) is also negative-de\ufb01nite when the\ncontroller is log-concave.\nTo motivate this result we now brie\ufb02y consider some widely used policies that are either log-concave\nor blockwise log-concave. Firstly, consider the Gibb\u2019s policy, \u03c0(a|s; w) \u221d exp wT \u03c6(a, s), where\n\u03c6(a, s) \u2208 Rnw is a feature vector. This policy is widely used in discrete systems and is log-\nconcave in w, which can be seen from the fact that log \u03c0(a|s; w) is the sum of a linear term and\na negative log-sum-exp term, both of which are concave [9]. In systems with a continuous state-\naction space a common choice of controller is \u03c0(a|s; wmean, \u03a3) = N (a|K\u03c6(s) + m, \u03a3(s)), where\nwmean = {K, m} and \u03c6(s) \u2208 Rnw is a feature vector. The notation \u03a3(s) is used because there\nare cases where is it bene\ufb01cial to have state dependent noise in the controller. This controller is not\njointly log-concave in wmean and \u03a3, but it is blockwise log-concave in wmean and \u03a3\u22121. In terms of\nwmean the log-policy is quadratic and the coef\ufb01cient matrix of the quadratic term is negative-de\ufb01nite.\nIn terms of \u03a3\u22121 the log-policy consists of a linear term and a log-determinant term, both of which\nare concave.\nIn terms of evaluating the search direction it is clear from the forms of D2(w) and H2(w) that\nmany of the pre-existing gradient evaluation techniques can be extended to the approximate Newton\nframework with little dif\ufb01culty. In particular, gradient evaluation requires calculating the expectation\nof the derivative of the log-policy w.r.t. p\u03b3(z; w)Q(z; w). In terms of inference the only additional\ncalculation necessary to implement either the full or diagonal approximate Newton methods is the\ncalculation of the expectation (w.r.t. to the same function) of the Hessian of the log-policy, or its\ndiagonal terms. As an example in section(6.5) of the supplementary material we detail the extension\nof the recurrent state formulation of gradient evaluation in the average reward framework, see e.g.\n[31], to the approximate Newton method. We use this extension in the Tetris experiment that we\nconsider in section(4). Given ns samples and nw parameters the complexity of these extensions\nscale as O(nsnw) for the diagonal approximate Newton method, while it scales as O(nsn2\nw) for the\nfull approximate Newton method.\nAn issue with the Newton method is the inversion of the Hessian matrix, which scales with O(n3\nw) in\nthe worst case. In the standard application of the Newton method this inversion has to be performed\nat each iteration and in large parameter systems this becomes prohibitively costly. In general H(w)\nwill be dense and no computational savings will be possible when performing this matrix inversion.\nThe same is not true, however, of the matrices D2(w) and H2(w). Firstly, as D2(w) is a diagonal\nmatrix it is trivial to invert. Secondly, there is an immediate source of sparsity that comes from\ntaking the second derivative of the log-trajectory distribution in (7). This property ensures that any\n(product) sparsity over the control parameters in the log-trajectory distribution will correspond to\nsparsity in H2(w). For example, in a partially observable Markov Decision Processes where the\npolicy is modeled through a \ufb01nite state controller, see e.g.\n[22], there are three functions to be\noptimised, namely the initial belief distribution, the belief transition dynamics and the policy. When\nthe parameters of these three functions are independent H2(w) will be block-diagonal (across the\nparameters of the three functions) and the matrix inversion can be performed more ef\ufb01ciently by\ninverting each of the block matrices individually. The reason that H(w) does not exhibit any such\nsparsity properties is due to the term H1(w) in (5), which consists of the non-negative mixture of\nouter-product matrices. The vector in these outer-products is the derivative of the log-trajectory\ndistribution and this typically produces a dense matrix.\nA undesirable aspect of steepest gradient ascent is that its performance is affected by the choice of\nbasis used to represent the parameter space. An important and desirable property of the Newton\nmethod is that it is invariant to non-singular linear (af\ufb01ne) transformations of the parameter space,\nsee e.g. [9]. This means that given a non-singular linear (af\ufb01ne) mapping T \u2208 Rnw\u00d7nw, the Newton\nupdate of the objective \u02dcU (w) = U (T w) is related to the Newton update of the original objective\n\nthrough the same linear (af\ufb01ne) mapping, i.e. v + \u2206vnt = T(cid:0)w + \u2206wnt\n\n(cid:1), where v = T w and \u2206vnt\n\nand \u2206wnt denote the respective Newton steps. In other words running the Newton method on U (w)\nand \u02dcU (T \u22121w) will give identical results. An important point to note is that this desirable property\nis maintained when using H2(w) in an approximate Newton method, while using D2(w) results\nin a method that is invariant to rescaling of the parameters, i.e. where T is a diagonal matrix with\nnon-zero elements along the diagonal. This can be seen by using the linearity of the expectation\noperator and a proof of this statement is provided in section(6.4) of the supplementary material.\n\n6\n\n\f(a) Policy Trace\n\n(b) Tetris Problem\n\nFigure 1: (a) An empirical illustration of the af\ufb01ne invariance of the approximate Newton method,\nperformed on the two state MDP of [16]. The plot shows the trace of the policy during training\nfor the two different parameter spaces, where the results of the latter have been mapped back into\nthe original parameter space for comparison. The plot shows the two steepest gradient ascent traces\n(blue cross and blue circle) and the two traces of the full approximate Newton method (red cross\nand red circle). (b) Results of the tetris problem for steepest gradient ascent (black), natural gradient\nascent (green), the diagonal approximate Newton method (blue) and the full approximate Newton\nmethod (red).\n\n4 Experiments\n\nThe \ufb01rst experiment we performed was an empirical illustration that the full approximate Newton\nmethod is invariant to linear transformations of the parameter space. We considered the simple two\nstate example of [16] as it allows us to plot the trace of the policy during training, since the policy\nhas only two parameters. The policy was trained using both steepest gradient ascent and the full\napproximate Newton method and in both the original and linearly transformed parameter space. The\npolicy traces of the two algorithms are plotted in \ufb01gure(1.a). As expected steepest gradient ascent is\naffected by such mappings, whilst the full approximate Newton method is invariant to them.\nThe second experiment was aimed at demonstrating the scalability of the full and diagonal approxi-\nmate Newton methods to problems with a large state space. We considered the tetris domain, which\nis a popular computer game designed by Alexey Pajitnov in 1985. See [12] for more details. Firstly,\nwe compared the performance of the full and diagonal approximate Newton methods to other para-\nmetric policy search methods. Tetris is typically played on a 20 \u00d7 10 grid, but due to computational\ncosts we considered a 10 \u00d7 10 grid in the experiment. This results in a state space with roughly\n7 \u00d7 2100 states. We modelled the policy through a Gibb\u2019s distribution, where we considered a fea-\nture vector with the following features: the heights of each column, the difference in heights between\nadjacent columns, the maximum height and the number of \u2018holes\u2019. Under this policy it is not possi-\nble to obtain the explicit maximum over w in (11) and so a straightforward application of EM is not\npossible in this problem. We therefore compared the diagonal and full approximate Newton methods\nwith steepest and natural gradient ascent. Due to reasons of space the exact implementation of the\nexperiment is detailed in section(6.6) of the supplementary material. We ran 100 repetitions of the\nexperiment, each consisting of 100 training iterations, and the mean and standard error of the results\nare given in \ufb01gure(1.b). It can be seen that the full approximate Newton method outperforms all of\nthe other methods, while the performance of the diagonal approximate Newton method is compa-\nrable to natural gradient ascent. We also ran several training runs of the full approximate Newton\nmethod on the full-sized 20 \u00d7 10 board and were able to obtain a score in the region of 14, 000\ncompleted lines, which was obtained after roughly 40 training iterations. An approximate dynamic\nprogramming based method has previously been applied to the Tetris domain in [7]. The same set\nof features were used and a score of roughly 4, 500 completed lines was obtained after around 6\ntraining iterations, after which the solution then deteriorated.\nIn the third experiment we considered a \ufb01nite horizon (controlled) linear dynamical system. This\nallowed the search-directions of the various algorithms to be computed exactly using [13] and re-\nmoved any issues of approximate inference from the comparison. In particular we considered a\n3-link rigid manipulator, linearized through feedback linearisation, see e.g. [17]. This system has a\n\n7\n\n\u221210\u22128\u22126\u22124\u221220205101520\u03b81\u03b820204060801000100200300400Training IterationsCompleted Lines\f(a) Model-Based Linear System\n\n(b) Model-Free Non-Linear System\n\nFigure 2: (a) The normalised total expected reward plotted against training time, in seconds, for the\n3-link rigid manipulator. The plot shows the results for steepest gradient ascent (black), EM (blue),\nnatural gradient ascent (green) and the approximate Newton method (red), where the plot shows the\nmean and standard error of the results. (b) The normalised total expected reward plotted against\ntraining iterations for the synthetic non-linear system of [29]. The plot shows the results for EM\n(blue), steepest gradient ascent (black), natural gradient ascent (green) and the approximate Newton\nmethod (red), where the plot shows the mean and standard error of the results.\n\n6-dimensional state space, 3-dimensional action space and a 22-dimensional parameter space. Fur-\nther details of the system can be found in section(6.7) of the supplementary material. We ran the\nexperiment 100 times and the mean and standard error of the results plotted in \ufb01gure(2.a). In this\nexperiment the approximate Newton method found substantially better solutions than either steep-\nest gradient ascent, natural gradient ascent or Expectation Maximisation. The superiority of the\nresults in comparison to either steepest or natural gradient ascent can be explained by the fact that\nH2(w) gives a better estimate of the curvature of the objective function. Expectation Maximisation\nperformed poorly in this experiment, exhibiting sub-linear convergence. Steepest gradient ascent\nperformed 3684 \u00b1 314 training iterations in this experiment which, in comparison to the 203 \u00b1 34\nand 310\u00b1 40 iterations of natural gradient ascent and the approximate Newton method respectively,\nillustrates the susceptibility of this method to poor scaling. In the \ufb01nal experiment we considered the\nsynthetic non-linear system considered in [29]. Full details of the system and the experiment can be\nfound in section(6.8) of the supplementary material. We ran the experiment 100 times and the mean\nand standard error of the results are plotted in \ufb01gure(2.b). Again the approximate Newton method\noutperforms both steepest and natural gradient ascent. In this example only the mean parameters of\nthe Gaussian controller are optimised, while the parameters of the noise are held \ufb01xed, which means\nthat the log-policy is quadratic in the policy parameters. Hence, in this example the EM-algorithm\nis a particular (less general) version of the approximate Newton method, where a \ufb01xed step-size\nof one is used throughout. The marked difference in performance between the EM-algorithm and\nthe approximate Newton method shows the bene\ufb01t of being able to tune the step-size sequence.\nIn this experiment we considered \ufb01ve different step-size sequences for the approximate Newton\nmethod and all of them obtained superior results than the EM-algorithm. In contrast only one of\nthe seven step-size sequences considered for steepest and natural gradient ascent outperformed the\nEM-algorithm.\n\n5 Conclusion\n\nThe contributions of this paper are twofold: Firstly we have given a novel analysis of Expectation\nMaximisation and natural gradient ascent when applied to the MDP framework, showing that both\nhave close connections to an approximate Newton method; Secondly, prompted by this analysis\nwe have considered the direct application of this approximate Newton method to the optimisation of\nMDPs, showing that it has numerous desirable properties that are not present in the naive application\nof the Newton method. In terms of empirical performance we have found the approximate Newton\nmethod to perform consistently well in comparison to EM and natural gradient ascent, highlighting\nits viability as an alternative to either of these methods. At present we have only considered actor\ntype implementations of the approximate Newton method and the extension to actor-critic methods\nis a point of future research.\n\n8\n\n020040060000.10.20.30.40.50.60.7Training TimeNormalised Total Expected Reward02004006008000.60.70.80.91Training IterationsNormalised Total Expected Reward\fReferences\n[1] S. Amari. Natural Gradient Works Ef\ufb01ciently in Learning. Neural Computation, 10:251\u2013276, 1998.\n[2] M. Azar, V. G\u00b4omez, and H. Kappen. Dynamic policy programming with function approximation. Journal\n\nof Machine Learning Research - Proceedings Track, 15:119\u2013127, 2011.\n\n[3] J. Bagnell and J. Schneider. Covariant Policy Search. IJCAI, 18:1019\u20131024, 2003.\n[4] J. Baxter and P. Bartlett. In\ufb01nite Horizon Policy Gradient Estimation. Journal of Arti\ufb01cial Intelligence\n\nResearch, 15:319\u2013350, 2001.\n\n[5] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scienti\ufb01c, second edition, 2000.\n[6] D. P. Bertsekas. Approximate Policy Iteration: A Survey and Some New Methods. Research report,\n\nMassachusetts Institute of Technology, 2010.\n\n[7] D. P. Bertsekas and S. Ioffe. Temporal Differences-Based Policy Iteration and Applications in Neuro-\n\nDynamic Programming. Research report, Massachusetts Institute of Technology, 1997.\n\n[8] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and L. Mark. Natural Actor-Critic Algorithms. Automatica,\n\n45:2471\u20132482, 2009.\n\n[9] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[10] P. Dayan and G. E. Hinton. Using Expectation-Maximization for Reinforcement Learning. Neural Com-\n\nputation, 9:271\u2013278, 1997.\n\n[11] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM\n\nAlgorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1\u201338, 1977.\n\n[12] C. Fahey. Tetris AI, Computers Play Tetris http://colinfahey.com/tetris/tetris_en.\n\nhtml, 2003.\n\n[13] T. Furmston and D. Barber. Ef\ufb01cient Inference for Markov Control Problems. UAI, 29:221\u2013229, 2011.\n[14] P. W. Glynn. Likelihood Ratio Gradient Estimation for Stochastic Systems. Communications of the ACM,\n\n33:97\u201384, 1990.\n\n[15] E. Greensmith, P. Bartlett, and J. Baxter. Variance Reduction Techniques For Gradient Based Estimates\n\nin Reinforcement Learning. Journal of Machine Learning Research, 5:1471\u20131530, 2004.\n\n[16] S. Kakade. A Natural Policy Gradient. NIPS, 14:1531\u20131538, 2002.\n[17] H. Khalil. Nonlinear Systems. Prentice Hall, 2001.\n[18] J. Kober and J. Peters. Policy Search for Motor Primitives in Robotics. Machine Learning, 84(1-2):171\u2013\n\n203, 2011.\n\n[19] L. Kocsis and C. Szepesv\u00b4ari. Bandit Based Monte-Carlo Planning. European Conference on Machine\n\nLearning (ECML), 17:282\u2013293, 2006.\n\n[20] V. R. Konda and J. N. Tsitsiklis. On Actor-Critic Algorithms. SIAM J. Control Optim., 42(4):1143\u20131166,\n\n2003.\n\n[21] P. Marbach and J. Tsitsiklis. Simulation-Based Optimisation of Markov Reward Processes. IEEE Trans-\n\nactions on Automatic Control, 46(2):191\u2013209, 2001.\n\n[22] N. Meuleau, L. Peshkin, K. Kim, and L. Kaelbling. Learning Finite-State Controllers for Partially Ob-\n\nservable Environments. UAI, 15:427\u2013436, 1999.\n\n[23] J. Nocedal and S. Wright. Numerical Optimisation. Springer, 2006.\n[24] J. Peters and S. Schaal. Natural Actor-Critic. Neurocomputing, 71(7-9):1180\u20131190, 2008.\n[25] K. Rawlik, Toussaint. M, and S. Vijayakumar. On Stochastic Optimal Control and Reinforcement Learn-\n\ning by Approximate Inference. International Conference on Robotics Science and Systems, 2012.\n\n[26] S. Richter, D. Aberdeen, and J. Yu. Natural Actor-Critic for Road Traf\ufb01c Optimisation. NIPS, 19:1169\u2013\n\n1176, 2007.\n\n[27] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning\n\nwith Function Approximation. NIPS, 13:1057\u20131063, 2000.\n\n[28] M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic Inference for Solving (PO)MDPs. Research\n\nReport EDI-INF-RR-0934, University of Edinburgh, School of Informatics, 2006.\n\n[29] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis. Learning Model-Free Robot Control by a Monte\n\nCarlo EM Algorithm. Autonomous Robots, 27(2):123\u2013130, 2009.\n\n[30] L. Weaver and N. Tao. The Optimal Reward Baseline for Gradient Based Reinforcement Learning. UAI,\n\n17(29):538\u2013545, 2001.\n\n[31] R. Williams. Simple Statistical Gradient Following Algorithms for Connectionist Reinforcement Learn-\n\ning. Machine Learning, 8:229\u2013256, 1992.\n\n9\n\n\f", "award": [], "sourceid": 1261, "authors": [{"given_name": "Thomas", "family_name": "Furmston", "institution": null}, {"given_name": "David", "family_name": "Barber", "institution": null}]}