{"title": "Advantage Updating Applied to a Differential Game", "book": "Advances in Neural Information Processing Systems", "page_first": 353, "page_last": 360, "abstract": "", "full_text": "Advantage Updating Applied to \n\na Differential Game \n\nMance E. Harmon \n\nWright Laboratory \n\nWL/AAAT Bldg. 635 2185 Avionics Circle \n\nWright-Patterson Air Force Base, OH 45433-7301 \n\nharmonme@aa.wpafb.mil \n\nLeemon C. Baird III\u00b7 \n\nWright Laboratory \n\nbaird@cs.usafa.af.mil \n\nA. Harry Klopr \nWright Laboratory \n\nklopfah@aa.wpafb.mil \n\nCategory: Control, Navigation, and Planning \n\nKeywords: Reinforcement Learning, Advantage Updating, \n\nDynamic Programming, Differential Games \n\nAbstract \n\nAn application of reinforcement learning to a linear-quadratic, differential \ngame is presented. The reinforcement learning system uses a recently \ndeveloped algorithm, the residual gradient form of advantage updating. \nThe game is a Markov Decision Process (MDP) with continuous time, \nstates, and actions, linear dynamics, and a quadratic cost function. The \ngame consists of two players, a missile and a plane; the missile pursues \nthe plane and the plane evades the missile. The reinforcement learning \nalgorithm for optimal control is modified for differential games in order to \nfind the minimax point, rather than the maximum. Simulation results are \ncompared to the optimal solution, demonstrating that the simulated \nreinforcement learning system converges to the optimal answer. The \nperformance of both the residual gradient and non-residual gradient forms \nof advantage updating and Q-learning are compared. The results show that \nadvantage updating converges faster than Q-learning in all simulations. \nThe results also show advantage updating converges regardless of the time \nstep duration; Q-learning is unable to converge as the time step duration \n~rows small. \n\nU.S .A.F. Academy, 2354 Fairchild Dr. Suite 6K4l, USAFA, CO 80840-6234 \n\n\f354 \n\nMance E. Hannon, Leemon C. Baird ll/, A. Harry Klopf \n\n1 ADVANTAGE UPDATING \n\nThe advantage updating algorithm (Baird, 1993) is a reinforcement learning algorithm in \nwhich two types of information are stored. For each state x, the value V(x) is stored, \nrepresenting an estimate of the total discounted return expected when starting in state x \nand performing optimal actions. For each state x and action u, the advantage, A(x,u), is \nstored, representing an estimate of the degree to which the expected total discounted \nreinforcement is increased by performing action u rather than the action currently \nconsidered best. The optimal value function V* (x) represents the true value of each state. \nThe optimal advantage function A * (x,u) will be zero if u is the optimal action (because u \nconfers no advantage relative to itself) and A * (x,u) will be negative for any suboptimal u \n(because a suboptimal action has a negative advantage relative to the best action). The \noptimal advantage function A * can be defined in terms of the optimal value function v*: \n\nA*(x,u) = ~[RN(X,U)- V*(x)+ rNV*(x')] \n\nbat \n\n(1) \n\nThe definition of an advantage includes a l/flt term to ensure that, for small time step \nduration flt, the advantages will not all go to zero. \nBoth the value function and the advantage function are needed during learning, but after \nconvergence to optimality, the policy can be extracted from the advantage function alone. \nThe optimal policy for state x is any u that maximizes A * (x,u). The notation \n\n~ax (x) = max A(x,u) \n\n\" \n\ndefines Amax(x). If Amax converges to zero in every state, the advantage function is said \nto be normalized. Advantage updating has been shown to learn faster than Q-Iearning \n(Watkins, 1989), especially for continuous-time problems (Baird, 1993). \nIf advantage updating (Baird, 1993) is used to control a deterministic system, there are two \nequations that are the equivalent of the Bellman equation in value iteration (Bertsekas, \n1987). These are a pair of two simultaneous equations (Baird, 1993): \n\n(2) \n\n(3) \n\n(4) \n\nA(x,u)-maxA(x,u') =(R+ y l1lV(x')- V(X\u00bb)_l \n~t \n\nw \n\nmaxA(x,u)=O \n\n\" \n\nwhere a time step is of duration L1t, and performing action u in state x results in a \nreinforcement of R and a transition to state Xt+flt. The optimal advantage and value \nfunctions will satisfy these equations. For a given A and V function, the Bellman \nresidual errors, E, as used in Williams and Baird (1993) and defined here as equations (5) \nand (6).are the degrees to which the two equations are not satisfied: \n\nE1 (xl,u) = (R(x\"u)+ y l1lV(xt+l1I)- V(XI\u00bb)~- A(x\"u)+ max A(x,,u' ) \n\n~t \n\nw \n\nE2 (x,u)=-maxA(x,u) \n\n\" \n\n(5) \n\n(6) \n\n\fAdvantage Updating Applied to a Differential Game \n\n355 \n\n2 RESIDUAL GRADIENT ALGORITHMS \n\nDynamic programming algorithms can be guaranteed to converge to optimality when used \nwith look-up tables, yet be completely unstable when combined with function(cid:173)\napproximation systems (Baird & Harmon, In preparation). It is possible to derive an \nalgorithm that has guaranteed convergence for a quadratic function approximation system \n(Bradtke, 1993), but that algorithm is specific to quadratic systems. One solution to this \nproblem is to derive a learning algorithm to perfonn gradient descent on the mean squared \nBellman residuals given in (5) and (6). This is called the residual gradient form of an \nalgorithm. \nThere are two Bellman residuals, (5) and (6), so the residual gradient algorithm must \nperfonn gradient descent on the sum of the two squared Bellman residuals. It has been \nfound to be useful to combine reinforcement learning algorithms with function \napproximation systems (Tesauro, 1990 & 1992). If function approximation systems are \nused for the advantage and value functions, and if the function approximation systems are \nparameterized by a set of adjustable weights, and if the system being controlled is \ndeterministic, then, for incremental learning, a given weight W in the function(cid:173)\napproximation system could be changed according to equation (7) on each time step: \n\ndW = _ a a[E;(x\"u,) + E;(x\"u,)] \n\n2 \n\naw \n\n_ _ E ( \n-\n\na 1 x\"u, \n\n) aE1 (x, ,u,) _ E ( \n\n) aE2 (x\" u, ) \n\naw \n\na 2 x\"u, \n\naw \n\ndt \n\n= _a(_l (R + yMV(X'+M) - V (x) ) - A(x\"u,) + max A(X\"U)) \n_(_I (yfJJ aV(x,+fJJ) _ av(x,))_ aA(x\"u,) + am~XA(xt'U)J \ndt \n\naw \n\naw \n\naw \n\naw \n\nu \n\n(7) \n\namaxA(x\"u) \n\n-amaxA(x\"u) \n\nU \n\nU a \n\nW \n\nAs a simple, gradient-descent algorithm, equation (7) is guaranteed to converge to the \ncorrect answer for a deterministic system, in the same sense that backpropagation \n(Rumelhart, Hinton, Williams, 1986) is guaranteed to converge. However, if the system \nis nondetenninistic, then it is necessary to independently generate two different possible \n\"next states\" Xt+L1t for a given action Ut perfonned in a given state Xt. One Xt+L1t must \n\nbe used to evaluate V(Xt+L1t), and the other must be used to evaluate %w V(xt+fJJ)' \n\nThis ensures that the weight change is an unbiased estimator of the true Bellman-residual \ngradient, but requires a system such as in Dyna (Sutton, 1990) to generate the second \nXt+L1t. The differential game in this paper was detenninistic, so this was not needed here. \n\n\f356 \n\nMance E. Harmon, Leemon C. Baird /11, A. Harry KLopf \n\n3 THE SIMULATION \n\n3.1 GAME DEFINITION \n\nWe employed a linear-quadratic, differential game (Isaacs, 1965) for comparing Q-learning \nto advantage updating, and for comparing the algorithms in their residual gradient forms. \nThe game has two players, a missile and a plane, as in games described by Rajan, Prasad, \nand Rao (1980) and Millington (1991). The state x is a vector (xm,xp) composed of the \nstate of the missile and the state of the plane, each of which are composed of the poSition \nand velocity of the player in two-dimensional space. The action u is a vector (um,up) \ncomposed of the action performed by the missile and the action performed by the plane, \neach of which are the acceleration of the player in two-dimensional space. The dynamics \nof the system are linear; the next state xt+ 1 is a linear function of the current state Xl and \naction Ul. The reinforcement function R is a quadratic function of the accelerations and \nthe distance between the players. \n\nR(x,u)= [distance2 + (missile acceleration)2 - 2(plane acceleration)2]6t \n\nR(X,U)=[(X m -Xp)2 +U~-2U!]llt \n\n(8) \n\n(9) \n\nIn equation (9), squaring a vector is equivalent to taking the dot product of the vector with \nitself. The missile seeks to minimize the reinforcement, and the plane seeks to maximize \nreinforcement. The plane receives twice as much punishment for acceleration as does the \nmissile, thus allowing the missile to accelerate twice as easily as the plane. \nThe value function V is a quadratic function of the state. In equation (10), Dm and Dp \nare weight matrices that change during learning. \n\n(10) \n\nThe advantage function A is a quadratic function of the state X and action u. The actions \nare accelerations of the missile and plane in two dimensions. \n\nA(x,u)=x~Amxm +x~BmCmum +u~Cmum + \n\nx~Apxp +x~BpCpup +u~Cpup \n\n(11) \n\nThe matrices A, B, and C are the adjustable weights that change during learning. \nEquation (11) is the sum of two general quadratic functions. This would still be true if \nthe second and fifth terms were xBu instead of xBCu. The latter form was used to \nsimplify the calculation of the policy. Using the xBu form, the gradient is zero when \nu=-C-lBx!2. Using the xBCu form, the gradient of A(x,u) with respect to u is zero \nwhen u=-Bx!2, which avoids the need to invert a matrix while calculating the policy. \n\n3.2 THE BELLMAN RESIDUAL AND UPDATE EQUATIONS \n\nEquations (5) and (6) define the Bellman residuals when maximizing the total discounted \nreinforcement for an optimal control problem; equations (12) and (13) modify the \nalgorithm to solve differential games rather than optimal control problems. \n\n\fAdvantage Updating Applied to a Differential Game \n\nE1(x\"u,) = (R(x\"u,)+ r6tV(xl+6t)- V(X,\u00bb)..!...- A(x\"u,)+ minimax A(x,) \n\ntl.t \n\nE2(x\"u,)=-minimax A(x,) \n\nThe resulting weight update equation is: \n\ntl.W = -aU R+ r 6tV(X'M')- V(x,\u00bb) 1t - A(x\"u,)+minimax A(X,\u00bb) \n.((rt:., aV~6t) aV(X,\u00bb)_1 _ aA(x\"u,) + aminimax A(X,\u00bb) \n\naW \n\ntl.t \n\naw \n\naw \n\n\" \n\nA() aminimax A(x,) \n\n-amzmmax \n\nx, \n\naw \n\nFor Q-leaming, the residual-gradient form of the weight update equation is: \n\ntl.W = -a( R+ r 6t minimax Q(Xl+dt)-Q(x\"u,\u00bb) \n.( r 6t -kminimax Q(x,+6t)--kQ(x\"u,\u00bb) \n\n4 RESULTS \n\n357 \n\n(12) \n\n(13) \n\n(14) \n\n(15) \n\n4.1 RESIDUAL GRADIENT ADVANTAGE UPDATING RESULTS \n\nThe optimal weight matrices A * , B *, C *, and D * were calculated numerically with \nMathematica for comparison. The residual gradient form of advantage updating learned \nthe correct policy weights, B, to three significant digits after extensive training. Very \ninteresting behavior was exhibited by the plane under certain initial conditions. The plane \nlearned that in some cases it is better to turn toward the missile in the short term to \nincrease the distance between the two in the long term. A tactic sometimes used by \npilots. Figure 1 gives an example. \n\n10 r - - - - - - -...... \n\n~/ \n\n....................... \n\n......... \n.' \n.' ...... \n\\. \n\nI \n\n............................ ~ \n\n....... \n\ni ... \u00b7\u00b7 \n\n\\. \n\nGO \nV \n\nC \u2022 .... \n\ntil \n0.01 \n\n.001 \n\n.0001 \n\n0 \n\n0.04 \n\n0.08 \n\n0.12 \n\nTime \n\nFigure 1: Simulation of a missile (dotted line) pursuing a plane (solid line), each \nhaving learned optimal behavior. The graph of distance vs. time show the effects of \nthe plane's maneuver in turning toward the missile. \n\n\f358 \n\nMance E. Harmon. Leemon C. Baird III. A. Harry Klopf \n\n4.2 COMPARATIVE RESULTS \n\n. The error in the policy of a learning system was defined to be the sum of the squared \nerrors in the B matrix weights. The optimal policy weights in this problem are the same \nfor both advantage updating and Q-learning, so this metric can be used to compare results \nfor both algorithms. Four different learning algorithms were compared: advantage \nupdating, Q-Iearning, Residual Gradient advantage updating, and Residual Gradient Q(cid:173)\nlearning. Advantage updating in the non-residual-gradient form was unstable to the point \nthat no meaningful results could be obtained, so simulation results cannot be given for it. \n\n4.2.1 Experiment Set 1 \n\nThe learning rates for both forms of Q-Iearning were optimized to one significant digit for \neach simulation. A single learning rate was used for residual-gradient advantage updating \nin all four simulations. It is possible that advantage updating would have performed \nbetter with different learning rates. For each algorithm, the error was calculated after \nlearning for 40,000 iterations. The process was repeated 10 times using different random \nnumber seeds and the results were averaged. This experiment was performed for four \ndifferent time step durations, 0.05, 0.005, 0.0005, and 0.00005. The non-residual(cid:173)\ngradient form of Q-Iearning appeared to work better when the weights were initialized to \nsmall numbers. Therefore, the initial weights were chosen randomly between 0 and 1 for \nthe residual-gradient forms of the algorithms, and between 0 and 10-8 for the non-residual(cid:173)\ngradient form of Q-learning. For small time steps, nonresidual-gradient Q-Iearning \nperformed so poorly that the error was lower for a learning rate of zero (no learning) than \nit was for a learning rate of 10-8 . Table 1 gives the learning rates used for each \nsimulation, and figure 2 shows the resulting error after learning. \n\nFinal \nError \n\n8 \n\n6 \n\n4 \n\n2 \n\n0 \n\n0--\n\n\u2022 \n\n\u2022 \n\n[J \n\n\u2022 \n\n- - -0 \n\n-D--FQ \n\n-\u00b7-RAU \n\n0.05 \n\n0.005 \n\n0.0005 0.00005 \n\nTIme Step Duration \n\nFigure 2: Error vs. time step size comparison for Q-Learning (Q), residual-gradient \nQ-Learning(RQ), and residual-gradient advantage updating(RAU) using rates optimal \nto one significant figure for both forms of Q-Iearning, and not optimized for \nadvantage updating. The final error is the sum of squared errors in the B matrix \nweights after 40,000 time steps of learning. The final error for advantage updating \nwas lower than both forms of Q-learning in every case. The errors increased for Q(cid:173)\nlearning as the time step size decreased. \n\n\fAdvantage Updating Applied to a Differential Game \n\n359 \n\nTime step duration, III \n\n5.10-2 \n\n5.10-3 \n\n5.10-4 \n\n5.10-5 \n\nQ \n\nRQ \n\n0.02 \n\n0.08 \n\n0.06 \n\n0.09 \n\n0.2 \n\n0 \n\n0.4 \n\n0 \n\nRAU \n\n0.005 \n\n0.005 \n\n0.005 \n\n0.005 \n\nTable 1: Learning rates used for each simulation. Learning rates are optimal to one \nsignificant figure for both forms of Q-learning, but are not necessarily optimal for \nadvantage updating. \n\n4.2.2 Experiment Set 2 \n\nFigure 3 shows a comparison of the three algorithms' ability to converge to the correct \npolicy. The figure shows the total squared error in each algorithms' policy weights as a \nfunction of learning time. This simulation ran for a much longer period than the \nsimulations in table 1 and figure 2. The learning rates used for this simulation were \nidentical to the rates that were found to be optimal for the shorter run. The weights for \nthe non-Residual gradient form of Q-Iearning grew without bound in all of the long \nexperiments, even after the learning rate was reduced by an order of magnitude. Residual \ngradient advantage updating was able to learn the correct policy, while Q-learning was \nunable to learn a policy that was better than the initial, random weights. \n\nLeorning Ability Comporison \n\n10~------------------, \n\n---RAU \n\n- - - - -, RO, \n\nError \n\n.1 \n\n,01 \n\n,001 \n\n0 \n\n5 Conclusion \n\n2 \n\n3 \n\n4 \n\n5 \n\nTime Steps in millions \n\nFigure 3 \n\nThe experimental data shows residual-gradient advantage updating to be superior to the \nthree other algorithms in all cases. As the time step grows small, Q-learning is unable to \nlearn the correct policy. Future research will include the use of more general networks \nand implementation of the wire fitting algorithm proposed by Baird and Klopf (1994) to \ncalculate the policy from a continuous choice of actions in more general networks. \n\n\f360 \n\nMance E. Hannon. Leemon C. Baird Ill. A. Harry Klopf \n\nAcknowledgments \n\nThis research was supported under Task 2312Rl by the Life and Environmental Sciences \nDirectorate of the United States Air Force Office of Scientific Research. \n\nReferences \nBaird, L.C. (1993). Advantage updating Wright-Patterson Air Force Base, OH. (Wright \nLaboratory Technical Report WL-TR-93-1146, available from the Defense Technical \ninformation Center, Cameron Station, Alexandria, VA 22304-6145). \n\nBaird, L.C., & Harmon, M. E. (In preparation). Residual gradient algorithms Wright(cid:173)\nPatterson Air Force Base, OH. (Wright Laboratory Technical report). \n\nBaird, L.C., & Klopf, A. H. (1993). Reinforcement learning with high-dimensional. \ncontinuous actions Wright-Patterson Air Force Base, OH. (Wright Laboratory technical \nreport WL-TR-93-1147, available from the Defense Technical information Center, \nCameron Station, Alexandria, VA 22304-6145). \n\nBertsekas, D. P. (1987). Dynamic programming: Deterministic and stochastic models. \nEnglewood Cliffs, NJ: Prentice-Hall. \n\nBradtke, S. J. (1993). Reinforcement Learning Applied to Linear Quadratic Regulation. \nProceedings of the 5th annual Conference on Neural Information Processing Systems. \n\nIsaacs, Rufus (1965). Differential games. New York: John Wiley and Sons, Inc. \n\nMillington, P. J. (1991). Associative reinforcement learning for optimal control. \nUnpublished master's thesis, Massachusetts Institute of Technology, Cambridge, MA. \n\nRajan, N., Prasad, U. R., and Rao, N. J. (1980). Pursuit-evasion of two aircraft in a \nhorizontal plane. Journal of Guidance and Control. 3(3). May-June, 261-267. \n\nRumelhart, D., Hinton, G., & Williams, R. (1986). Learning representations by \nbackpropagating errors. Nature. 323 .. 9 October, 533-536. \n\nSutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on \napproximating dynamic programming. Proceedings of the Seventh International \nConference on Machine Learning. \n\nTesauro, G. (1990). Neurogammon: A neural-network backgammon program. \nProceedings of the International Joint Conference on Neural Networks . 3 .. (pp. 33-40). \nSan Diego, CA. \n\nTesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, \n8(3/4), 279-292. \n\nWatkins, C. J. C. H. (1989). Learningfrom delayed rewards. Doctoral thesis, Cambr~dge \nUniversity, Cambridge, England. \n\n\f", "award": [], "sourceid": 912, "authors": [{"given_name": "Mance", "family_name": "Harmon", "institution": null}, {"given_name": "Leemon", "family_name": "Baird", "institution": null}, {"given_name": "A.", "family_name": "Klopf", "institution": null}]}