{"title": "Robust Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1061, "page_last": 1067, "abstract": null, "full_text": "Robust Reinforcement Learning \n\nJ un Morimoto \n\nGraduate School of Information Science \nN ara Institute of Science and Technology; \n\nKawato Dynamic Brain Project, JST \n2-2 Hikaridai Seika-cho Soraku-gun \n\nKyoto 619-0288 JAPAN \nxmorimo@erato.atr.co.jp \n\nKenji Doya \n\nATR International; \n\nCREST, JST \n\n2-2 Hikaridai Seika-cho Soraku-gun \n\nKyoto 619-0288 JAPAN \n\ndoya@isd.atr.co.jp \n\nAbstract \n\nThis paper proposes a new reinforcement learning (RL) paradigm \nthat explicitly takes into account input disturbance as well as mod(cid:173)\neling errors. The use of environmental models in RL is quite pop(cid:173)\nular for both off-line learning by simulations and for on-line ac(cid:173)\ntion planning. However, the difference between the model and the \nreal environment can lead to unpredictable, often unwanted results. \nBased on the theory of H oocontrol, we consider a differential game \nin which a 'disturbing' agent (disturber) tries to make the worst \npossible disturbance while a 'control' agent (actor) tries to make \nthe best control input. The problem is formulated as finding a min(cid:173)\nmax solution of a value function that takes into account the norm \nof the output deviation and the norm of the disturbance. We derive \non-line learning algorithms for estimating the value function and \nfor calculating the worst disturbance and the best control in refer(cid:173)\nence to the value function. We tested the paradigm, which we call \n\"Robust Reinforcement Learning (RRL),\" in the task of inverted \npendulum. In the linear domain, the policy and the value func(cid:173)\ntion learned by the on-line algorithms coincided with those derived \nanalytically by the linear H ootheory. For a fully nonlinear swing(cid:173)\nup task, the control by RRL achieved robust performance against \nchanges in the pendulum weight and friction while a standard RL \ncontrol could not deal with such environmental changes. \n\n1 \n\nIntroduction \n\nIn this study, we propose a new reinforcement learning paradigm that we call \n\"Robust Reinforcement Learning (RRL).\" Plain, model-free reinforcement learning \n(RL) is desperately slow to be applied to on-line learning of real-world problems. \nThus the use of environmental models have been quite common both for on-line ac(cid:173)\ntion planning [3] and for off-line learning by simulation [4]. However, no model can \n\n\fbe perfect and modeling errors can cause unpredictable results, sometimes worse \nthan with no model at all. In fact , robustness against model uncertainty has been \nthe main subject of research in control community for the last twenty years and the \nresult is formalized as the \"'Hoo\" control theory [6). \n\nIn general, a modeling error causes a deviation of the real system state from the \nstate predicted by the model. This can be re-interpreted as a disturbance to the \nmodel. However, the problem is that the disturbance due to a modeling error \ncan have a strong correlation and thus standard Gaussian assumption may not \nbe valid. The basic strategy to achieve robustness is to keep the sensitivity I of \nthe feedback control loop against a disturbance input small enough so that any \ndisturbance due to the modeling error can be suppressed if the gain of mapping \nfrom the state error to the disturbance is bounded by 1;'. In the 'Hooparadigm, \nthose 'disturbance-to-error' and 'error-to-disturbance' gains are measured by a max \nnorms of the functional mappings in order to assure stability for any modes of \ndisturbance. \n\nIn the following, we briefly introduce the 'Hoo paradigm and show that design of a \nrobust controller can be achieved by finding a min-max solution of a value nmc(cid:173)\ntion, which is formulated as Hamilton-Jacobi-Isaacs (HJI) equation. We then derive \non-line algorithms for estimating the value functions and for simultaneously deriv(cid:173)\ning the worst disturbance and the best control that, respectively, maximizes and \nminimizes the value function. \n\nWe test the validity of the algorithms first in a linear inverted pendulum task. It \nis verified that the value function as well as the disturbance and control policies \nderived by the on-line algorithm coincides with the solution of Riccati equations \ngiven by 'Hootheory. We then compare the performance of the robust RL algorithm \nwith a standard model-based RL in a nonlinear task of pendulum swing-up [3). It \nis shown that robust RL controller can accommodate changes in the weight and the \nfriction of the pendulum, which a standard RL controller cannot cope with. \n\n2 H 00 Control \n\nW(s)--..-j \n\nu(s) \n\nz(s) W~G z \n\nu \n\ny \n\nK \n\n(b) \n\n(a) \n\nFigure 1: (a) Generalized Plant and Controller, (b) Small Gain Theorem \n\nThe standard 'Hoocontrol [6) deals with a system shown in Fig.l(a), where G is the \nplant, K is the controller, u is the control input, y is the measurement available to \nthe controller (in the following, we assume all the states are observable, i.e. y = x), \nw is unknown disturbance, and z is the error output that is desired to be kept small. \nIn general, the controller K is designed to stabilize the closed loop system based on \na model of the plant G. However, when there is a discrepancy between the model \nand the actual plant dynamics, the feedback loop could be unstable. The effect of \nmodeling error can be equivalently represented as a disturbance w generated by an \n\n\funknown mapping ~ of the plant output z, as shown in Fig.1(b). \n\nThe goal of 1(,,,control problem is to design a controller K that brings the error z \nto zero while minimizing the Hoonorm of the closed loop transfer function from the \ndisturbance w to the output z \n\n(1) \n\nHere II \u2022 112 denotes \u00a32 norm and i7 denotes maximum singular value. The small \ngain theorem assures that if IITzwiloo ~ 'Y, then the system shown in Fig. l(b) will \nbe stable for any stable mapping ~ : z f-t w with 11~1100 < ~. \n\n2.1 Min-max Solution to HooProblem \nWe consider a dynamical system x = f(x, u, w) . Hoocontrol problem is equivalent \nto finding a control output u that satisfies a constraint \n\nagainst all possible disturbance w with x(O) = 0, because it implies \n\n(2) \n\n(3) \n\nWe can consider this problem as differential game[5] in which the best control output \nu that minimizes V is sought while the worst disturbance w that maximizes V is \nchosen. Thus an optimal value function V* is defined as \n\nV* = minmax (00 (zT(t)z(t) _ 'Y2wT(t)w(t))dt. \n\nu \n\nw 10 \n\nThe condition for the optimal value function is given by \n\n0= minmax[zT z - 'Y2WTW + ~ f(x, u, w)] \n\nu \n\nw \n\noV* \nuX \n\n(4) \n\n(5) \n\nwhich is known as Hamilton-Jacobi-Isaacs (HJI) equation. From (5), we can derive \nthe optimal control output u op and the worst disturbance wop by solving \n\nOZT Z \nOU + ox \n\noV of (x, u, w) _ 0 \n\nOU \n\n-\n\nd \nan \n\nOZT Z _ \noW \n\n2 T oV of (x, u, w) _ 0 \n'YW + ox \n\now \n\n-. (6) \n\n3 Robust Reinforcement Learning \n\nHere we consider a continuous-time formulation of reinforcement learning [3] \nwith the system dynamics x = f(x, u) and the reward r(x, u). The basic \ngoal is to find a policy u = g(x) that maximizes the cumulative future reward \n!too e-\u00b7~t r(x(s), u(s))ds for any given state x(t), where T is a time constant of \nevaluation. However, a particular policy that was optimized for a certain envi(cid:173)\nronment may perform badly when the environmental setting changes. In order to \n\n\fassure robust performance under changing environment or unknown disturbance, \nwe introduce the notion of worst disturbance in 1i control to the reinforcement \nlearning paradigm. \n\nIn this framework, we consider an augmented reward \n\n(7) \nwhere s(w(t)) is an additional reward for withstanding a disturbing input, for ex(cid:173)\nample, s(w) = 'Y2wT w. The augmented value function is then defined as \n\nq(t) = r(x(t), u(t)) + s(w(t)), \n\nV(x(t)) = 1 e- ' -;' q(x(s), u(s), w(s))ds. \n\n(8) \n\nThe optimal value function is given by the solution of a variant of HJI equation \n\n1 \n- V*(x) = maxmin[r(x, u) + s(w) + ~ f(x, u, w)]. \nT \n\naV* \nux \n\nU \n\nW \n\n(9) \n\nNote that we can not find appropriate policies (Le. the solutions of the HJI equation) \nif we choose too small 'Y. In the robust reinforcement learning (RRL) paradigm, \nthe value function \nis update by using the temporal difference (TD) error [3] \n8(t) = q(t) - ~ V(t) + V(t), while the best action and the worst disturbance are \ngenerated by maximizing and minimizing, respectively, the right hand side of HJI \nequation (9). We use a function approximator to implement the value function \nV(x(t); v), where y is a parameter vector. As in the standard continuous-time RL, \nwe define eligibility trace for a parameter Vi as ei(s) = J; e- ' ;;' 8~jit)dt and up-\ndate rule as ei(t) = -~ei(t) + 8~v~t) , where\", is the time constant of the eligibility \ntrace[3] . We can then derive learning rule for value function approximator [3] as \nVi = rJ8(t)ei(t), where rJ denotes the learning rate. Note that we do not assume \nf(x = 0) = 0 because the error output z is generalized as the reward r(x, u) in \nRRL framework. \n\n3.1 Actor-disturber-critic \n\nWe propose actor-disturber-critic architecture by which we can implement robust \nRL in a model-free fashion as the actor-critic architecture[l]. We define the policies \nof the actor and the disturber implemented as u(t) = Au(x(t); yU) + nu(t) and \nw(t) = Aw(x(t); y W) +nw(t), respectively, where Au(x(t); y U) and Aw(x(t); yW) are \nfunction approximators with parameter vectors, yU and yW, and nu(t) and nw(t) \nare noise terms for exploration. The parameters of the actor and the disturber are \nupdated by \n\nvr = rJu8(t)nu(t) aAu(;~~; yU) \n\nt \n\nwhere rJu and rJw denote the learning rates. \n\n3.2 Robust Policy by Value Gradient \n\n(10) \n\nNow we assume that an input-Affine model of the system dynamics and quadratic \nmodels of the costs for the inputs are available as \n\nx \n\nf(x) + gl(X)W + g2(X)U \n\nr(x, u) = Q(x) - uTR(x)u, \n\ns(w) = 'Y2wT w. \n\n\fIn this case, we can derive the best action and the worst disturbance in reference \nto the value function V as \n\n1 \n\nu op = \"2 R(X) \n\n-1 T \n\n8V T \ng2 (X)( Ox) \n\n(11) \n\nWe can use the policy (11) using the value gradient ~~ derived from the value \nfunction approximator. \n\n3.3 Linear Quadratic Case \n\nHere we consider a case in which a linear dynamic model and quadratic reward \nmodels are available as \n\nx = Ax+B1w+B2u \n\nr(x, u) \n\nIn this case, the value function is given by a quadratic form V = _xT Px, where P \nis the solution of a Riccati equation \n\nT IT -1 T \nA P+ PA+ P('iB1B1 - B2R B2 )P+ Q = -Po \n\n1 \nT \n\n, \n\nThus we can derive the best action and the worst disturbance as \n\n(12) \n\n(13) \n\n4 Simulation \n\nWe tested the robust RL algorithm in a task of swinging up a pendulum. The \ndynamics of the pendulum is given by ml2jj = -p,e + mgl sin /9 + T, where /9 is the \nangle from the upright position , T is input torque, p, = 0.01 is the coefficient of \nfriction, m = 1.0[kg] is the weight of the pendulum, l = 1.0[m] is the length of \nthe pendulum, and g = 9.8[m/s 2 ] is the gravity acceleration. The state vector is \ndefined as x = (/9,e)T. \n\n4.1 Linear Case \n\nWe first considered a linear problem in order to test if the value function and \nthe policy learned by robust RL coincides with the analytic solution of 1icx:>control \nproblem. Thus we use a locally linearized dynamics near the unstable equilibrium \npoint x = (0, O)T . The matrices for the linear model are given by \nA= (~ ~~ ),B1 = (~, ),B2 = (~, ),Q= (~ ~ ),R=1. (14) \nThe reward function is given by q( t) = _xT Qx - u2 + ,2W2, where robustness \ncriteria, = 2.0. \nThe value function, V = _xT Px, is parameterized by a symmetric matrix P. For \non-line estimation of P, we define vectors x = (xi, 2X1X2, XDT, p = (Pll,P12,P22)T \nand reformulate V as V = _pTx. Each element of p is updated using recursive \nleast squares method[2]. Note that we used pre-designed stabilizing controller as \nthe initial setting of RRL controller for stable learning[2]. \n\n\f4.1.1 Learning of the value function \nHere we used the policy by value gradient shown in section 3.2. Figure 2(a) shows \nthat each element of the vector p converged to the solution of the Ricatti equation \n(12). \n4.1.2 Actor-disturber-critic \nHere we used robust RL implemented by the actor-disturber-critic shown in section \n3.1. In the linear case, the actor and the disturber are represented as the linear \ncontrollers, A,,(x; v\") = v\"x and Aw(x; VW) = vWx, respectively. The actor and \nthe disturber were almost converged to the policy in (13) which derived from the \nRicatti equation (12) (Fig. 2(b)). \n\n100 \n\n80 \n\n.-----------------~- --\"\"'\"~ \n\nP\" \n\nlOf \n:~------------------::----\n\n. \n, \n\n-5 \n\nv, \n\n~~~e::::;;=:;';::~250~300 \n\nP22 \n\n-25 __ \u2022 \u2022\u2022 __ \u2022 _________ \u2022\u2022 _ \u2022\u2022 ___ ~~ ___ _ \n\n-3\u00b00 \n\n50 \n\n100 \n\n150 \nTrials \n\n200 \n\n250 \n\n300 \n\n(a) Elements of p \n\n(b) Elements of v \n\nFigure 2: Time course of (a)elements of vector p = (Pll,P12,P22) and (b) elements \nof gain vector of the actor v\" = (vf, v~) and the disturber VW = (vi\", v2\"). The \ndash-dotted lines show the solution of the Ricatti equation. \n\n4.2 Applying Robust RL to a Non-linear Dynamics \nWe consider non-linear dynamical system (11), where \n\nf(x) = ( ~ sine _ ~e ) ,gt{x) = ( ~ ) ,g2(X) = ( ~ ) \nQ(x) = cos(e) - 1, R(x) = 0.04. \n\n(15) \nFrom considering (7) and (15), the reward function is given by q(t) = cos(e) - 1 -\n0.04u2 + \"'\u00b7?w 2 , where robustness criteria 'Y = 0.22. For approximating the value \nfunction, we used Normalized Gaussian Network (NGnet)[3]. Note that the input \ngain g(x) was also learned[3]. \n\nFig.3 shows the value functions acquired by robust RL and standard model-based \nRL[3]. The value function acquired by robust RL has a shaper ridge (Fig.3(a)) \nattracts swing up trajectories than that learned with standard RL. \n\nIn FigA, we compared the robustness between the robust RL and the standard RL. \nBoth robust RL controller and the standard RL controller learned to swing up and \nhold a pendulum with the weight m = 1.0[m] and the coefficient of friction J-t = 0.01 \n(FigA(a)) . \n\nThe robust RL controller could successfully swing up pendulums with different \nweight m = 3.0[kg] and the coefficient of friction J-t = 0.3 (FigA(b)). This result \nshowed robustness of the robust RL controller. The standard RL controller could \nachieve the task in fewer swings for m = 1.0[kg] and J-t = 0.01 (FigA(a)). However, \nthe standard RL controller could not swing up the pendulum with different weight \nand friction (FigA(b)). \n\n\fth \n\nth \n\nv \n\" \u00b700' \n1\n\n- 1.00 1 \n\n- 2.001 \n\n(a) Robust RL \n\n(b) Standard RL \n\nFigure 3: Shape of the value function after 1000 learning trials with m = 1. 0 [kg] , \nl = 1.0[m], and J1, = 0.01. \n\n- -------------r'--------1 \n\n2 \n\n.. 1 1 \n\nOS \n\n(\\ \n\\! \\ \n\n'. \n\n.~. \n\nIr_'-\"-\"R~obu~\"'-' ::~~ ... -------------(cid:173)\n\n~~~-~--~S~m\"~d'~~ ~~3~~~~~ \n\nTime [sec) \n\no ------ ------- -'-------I \n\n(a) m = 1.0, J.I, = 0.01 \n\n(b) m = 3.0,J.I, = 0.3 \n\nFigure 4: Swing up trajectories with pendulum with different weight and friction. \nThe dash-dotted lines show upright position. \n\n5 Conclusions \nIn this study, we proposed new RL paradigm called \"Robust Reinforcement Learn(cid:173)\ning (RRL).\" We showed that RRL can learn analytic solution of the 1-loocontroller \nin the linearized inverted pendulum dynamics and also showed that RRL can deal \nwith modeling error which standard RL can not deal with in the non-linear inverted \npendulum swing-up simulation example. We will apply RRL to more complex task \nlike learning stand-up behavior[4]. \n\nReferences \n[1] A. G . Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that \ncan solve difficult learning control problems. IEEE Transactions on Systems, Man, \nand Cybernetics, 13:834- 846, 1983. \n\n[2] S. J. Bradtke. Reinforcement learning Applied to Linear Quadratic Regulation. In \nS. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information \nProcessing Systems 5, pages 295- 302. Morgan Kaufmann, San Mateo, CA, 1993. \n\n[3] K. Doya. Reinforcement Learning in Continuous Time and Space. Neural Computation, \n\n12(1):219-245, 2000. \n\n[4] J . Morimoto and K. Doya. Acquisition of stand-up behavior by a real robot using hier(cid:173)\n\narchical reinforcement learning. In Proceedings of Seventeenth International Conference \non Machine Learning, pages 623- 630, San Francisco, CA, 2000. Morgan Kaufmann. \n\n[5] S. Weiland. Linear Quadratic Games, H co , and the Riccati Equation. In Proceedings of \nthe Workshop on the Riccati Equation in Control, Systems, and Signals, pages 156- 159. \n1989. \n\n[6] K. Zhou, J . C. Doyle, and K. Glover. Robust Optimal Control. PRENTICE HALL, \n\nNew J ersey, 1996. \n\n\f", "award": [], "sourceid": 1841, "authors": [{"given_name": "Jun", "family_name": "Morimoto", "institution": null}, {"given_name": "Kenji", "family_name": "Doya", "institution": null}]}