{"title": "Efficient Nonlinear Control with Actor-Tutor Architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 1012, "page_last": 1018, "abstract": null, "full_text": "Efficient Nonlinear Control with \n\nActor-Tutor Architecture \n\nKenji Doya* \n\nA.TR Human Information Processing Research Laboratories \n2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan. \n\nAbstract \n\nA new reinforcement learning architecture for nonlinear control is \nproposed. A direct feedback controller, or the actor, is trained by \na value-gradient based controller, or the tutor. This architecture \nenables both efficient use of the value function and simple computa(cid:173)\ntion for real-time implementation. Good performance was verified \nin multi-dimensional nonlinear control tasks using Gaussian soft(cid:173)\nmax networks. \n\n1 \n\nINTRODUCTION \n\nIn the study of temporal difference (TD) learning in continuous time and space \n(Doya, 1996b), an optimal nonlinear feedback control law was derived using the \ngradient of the value function and the local linear model of the system dynam(cid:173)\nics. It was demonstrated in the simulation of a pendulum swing-up task that the \nvalue-gradient based control scheme requires much less learning trials than the con(cid:173)\nventional \"actor-critic\" control scheme (Barto et al., 1983). \n\nIn the actor-critic scheme, the actor, a direct feedback controller, improves its con(cid:173)\ntrol policy stochastically using the TD error as the effective reinforcement (Fig(cid:173)\nure 1a). Despite its relatively slow learning, the actor-critic architecture has the \nvirtue of simple computation in generating control command. In order to train a \ndirect controller while making efficient use of the value function, we propose a new \nreinforcement learning scheme which we call the \"actor-tutor\" architecture (Fig(cid:173)\nure 1b). \n\n\u00b7Current address: Kawato Dynamic Brain Project, JSTC. 2-2 Hikaridai, Seika-cho, \n\nSoraku-gun, Kyoto 619-02, Japan. E-mail: doya@erato.atr.co.jp \n\n\fEfficient Nonlinear Control with Actor-Tutor Architecture \n\n1013 \n\nIn the actor-tutor scheme, the optimal control command based on the current esti(cid:173)\nmate of the value function is used as the target output of the actor. With the use of \nsupervised learning algorithms (e.g., LMSE), learning of the actor is expected to be \nfaster than in the actor-critic scheme, which uses stochastic search algorithms (e.g., \nA RP )' The simulation result below confirms this prediction. This hybrid control \narchitecture provides a model of functional integration of motor-related brain areas, \nespecially the basal ganglia and the cerebellum (Doya, 1996a). \n\n2 CONTINUOUS TD LEARNING \n\nFirst, we summarize the theory of TD learning in continuous time and space (Doya, \n1996b), which is basic to the derivation of the proposed control scheme. \n\n2.1 CONTINUOUS TD ERROR \n\nLet us consider a continuous-time, continuous-state dynamical system \n\nd~;t) = f(x(t), u(t\u00bb \n\n(I) \n\nwhere x E X C R n is the state and u E U C R m is the control input (or the \naction). The reinforcement is given as the function of the state and the control \n\nFor a given control law (or a policy) \n\nr(t) = r(x(t), u(t\u00bb. \n\nu(t) = p(x(t\u00bb, \nwe define the \"value function\" of the state x(t) as \n\nVJ'(x(t\u00bb = \n\n100 1 \n\nt \n\n. - j \n\n-e--r r(x(s), u(s\u00bbds, \nT \n\n(2) \n\n(3) \n\n(4) \n\nwhere x(s) and u(s) (t :5 s < 00) follow the system dynamics (I) and the control \nlaw (3). Our goal is to find an optimal control law p* that maximizes VJ'(x) for \nany state x EX. Note that T is the time constant of imminence-weighting, which \nis related to the discount factor 'Y of the discrete-time TD as 'Y = 1 _ ~t. \nBy differentiating (4) by t, we have a local consistency condition for the value \nfunction \n\n(5) \n\nLet P(x(t\u00bb be the prediction of the value function VJ'(x(t\u00bb from x(t) by a neural \nnetwork, or some function approximator that has enough capability of generaliza(cid:173)\ntion. The prediction should be adjusted to minimize the inconsistency \n\nr(t) = r(t) - P(x(t\u00bb + T dP~~(t\u00bb , \n\n(6) \n\nwhich is a continuous version of the TD error. Because the boundary condition \nfor the value function is given on the attractor set of the state space, correc(cid:173)\ntion of P(x(t\u00bb should be made backward into time. The correspondence between \ncontinuous-time TD algorithms and discrete-time TD(A) algorithms (Sutton, 1988) \nis shown in (Doya, 1996b). \n\n\f1014 \n\nK. Doya \n\nFigure 1: \n\n(a) Actor-critic \n\n(b) Actor-tutor \n\n2.2 OPTIMAL CONTROL BY VALUE GRADIENT \n\nAccording to the principle of dynamic programming (Bryson and Ho, 1975), the \nlocal constraint for the value function V\u00b7 for the optimal control law p. is given by \nthe Hamilton-Jacobi-Bellman equation \n\nV\u00b7(t) = max [r(x(t), u(t)) + T av\u00b7~x(t)) I(x(t), u(t))] . \n\n(7) \n\nu(1)EU \n\nx \n\nThe optimal control p* is given by solving the maximization problem in the HJB \nequation, i.e., \n\n(8) \nWhen the cost for each control variable is given by a convex potential function Gj 0 \n\nau +T ax \n\nau - . \n\nar(x, u) \n\naV\u00b7(x) al(x, u) _ 0 \n\nr(x,u) = R(x) - L:Gj(Uj), \n\nj \n\nequation (8) can be solved using a monotonic function gj(x) = (Gj)-l(x) as \n\nUj = gj (TaV;~X) a/~:~ u)) . \n\n(9) \n\n(10) \n\nIf the system is linear with respect to the input, which is the case with many \nmechanical systems, al(x, u)/aUj is independent of u and the above equation gives \na closed-form optimal feedback control law u = p\u00b7(x). \nIn practice, the optimal value function is unknown and we replace V\u00b7(x) with the \ncurrent estimate of the value function P(x) \n\n( aPex) al(x, u)) \n\nu=g T~ au \n\n. \n\n(11) \n\nWhile the system evolves with the above control law, the value function P(x) is \nupdated to minimize the TD error (6). In (11), the vector aP(x)/ax represents the \ndesired motion direction in the state space and the matrix al(x, u)/au transforms \nit into the action space. The function g, which is specified by the control cost, \ndetermines the amplitude of control output. For example, if the control cost G is \nquadratic, then (11) reduces to a linear feedback control. A practically important \ncase is when 9 is a sigmoid, because this gives a feedback control law for a system \nwith limited control amplitude, as in the examples below. \n\n\fEfficient Nonlinear Control with Actor-Tutor Architecture \n\n1015 \n\n3 ACTOR-TUTOR ARCHITECTURE \n\nIt was shown in a task of a pendulum swing-up with limited torque (Doya, 1996b) \nthat the above value-gradient based control scheme (11 can learn the task in much \nless trials than the actor-critic scheme. However, computation of the feedback \ncommand by (11) requires an on-line calculation of the gradient of the value function \noP(x)/ox and its multiplication with the local linear model of the system dynamics \na lex, u)/ou, which can be too demanding for real-time implementation. \nOne solution to this problem is to use a simple direct controller network, as in the \ncase of the actor-critic architecture. The training of the direct controller, or the \nactor, can be performed by supervised learning instead of trial-and-error learning \nbecause the target output of the controller is explicitly given by (11). Although \ncomputation of the target output may involve a processing time that is not accept(cid:173)\nable for immediate feedback control, it is still possible to use its output for training \nthe direct controller provided that there is some mechanism of short-term memory \n(e.g., eligibility trace in the connection weights). \n\nFigure l(b) is a schematic diagram of this \"actor-tutor\" architecture. The critic \nmonitors the performance of the actor and estimates the value function. The \"tutor\" \nis a cascade of the critic, its gradient estimator, the local linear model ofthe system, \nand the differential model of control cost. The actor is trained to minimize the \ndifference between its output and the tutor's output. \n\n4 SIMULATION \n\nWe tested the performance of the actor-tutor architecture in two nonlinear control \ntasks; a pendulum swing-up task (Doya, 1996b) and the global version of a cart-pole \nbalancing task (Barto et al., 1983). \n\nThe network architecture we used for both the actor and the critic was a Gaussian \nsoft-max network. The output of the network is given by \n\nK \n\nY = I: Wkbk(X), \n\nk=l \n\nb ( ) \nk X =\",K \n\nul=l exp ui=l \n\n3/0 \n\nexp[- L:~=1 (~)2] \n\n[_\",n (X.-Cli)2]' \n\nwhere (CkI' ... , Ckn) and (Ski, ... , Skn) are the center and the size of the k-th basis \nfunction. It is in general possible to adjust the centers and sizes of the basis function, \nbut in order to assure predictable transient behaviors, we fixed them in a grid. In \nthis case, computation can be drastically reduced by factorizing the activation of \nbasis functions in each input dimension. \n\n4.1 PENDULUM SWING-UP TASK \nThe first task was to swing up a pendulum with a limited torque ITI ~ Tmax , which \nwas about one fifth of the torque that was required to statically bring the pendulum \nup (Figure 2 (a)). This is a nonlinear control task in which the controller has to \nswing the pendulum several times at the bottom to build up enough momentum. \n\n\f1016 \n\nK. Doya \n\n(a) Pendulum \n\ntriat. \n\n(b) Value gradient \n\n( c) Actor-Critic \n\ntrial_ \n\n( d) Actor-Tutor \n\nFigure 2: Pendulum swing-up task. The dynamics of the pendulum (a) is given by \nmle = -ti; + mglsin{} + T. The parameters were m = I = 1, g = 9.8, Jl. = 0.01, \nand Tmax = 2.0. The learning curves for value-gradient based optimal control (b), \nactor-critic (c), and actor-tutor (d); t_up is time during which I{}I < 45\u00b0. \n\nThe state space for the pendulum x = ({},w) was 2D and we used 12 x 12 basis \nfunctions to cover the range I{} I ~ 180\u00b0 and Iw I ~ 180\u00b0 / s. The reinforcement for the \nstate was given by the height of the tip of the pendulum, i.e., R(x) = cos {} and the \ncost for control G and the corresponding output sigmoid function g were selected \nto match the maximal output torque ymax. \n\nFigures 2 (b), (c), and (d) show the learning curves for the value-gradient based \ncontrol (11), actor critic, and actor-tutor control schemes, respectively. As we \nexpected, the learning of the actor-tutor was much faster than that of the actor(cid:173)\ncritic and was comparable to the value-gradient based optimal control schemes. \n\n4.2 CART-POLE SWING-UP TASK \n\nNext we tested the learning scheme in a higher-dimensional nonlinear control task, \nnamely, a cart-pole swing-up task (Figure 3). In the pioneering work of , the actor(cid:173)\ncritic system successfully learned the task of balancing the pole within \u00b1 12\u00b0 of \nthe upright position while avoiding collision with the end of the cart track. The \ntask we chose was to swing up the pole from an arbitrary angle and to balance it \nupright. The physical parameters of the cart-pole were the same as in (Barto et al., \n1983) except that the length of the track was doubled to provide enough room for \nswinging. \n\n\fEfficient Nonlinear Control with Actor-Tutor Architecture \n\nJ017 \n\n(a) \n\n(b) \n\n(c) \n\nFigure 3: Cart-pole swing-up task. (a) An example of a swing-up trajectory. (b) \nValue function learned by the critic. (c) Feedback force learned by the actor. Each \nsquare in the plot shows a slice of the 4D state space parallel to the (0, w) plane. \n\nFigure 3 (a) shows an example of a successful swing up after 1500 learning trials \nwith the actor-tutor architecture. We could not achieve a comparable performance \nwith the actor-critic scheme within 3000 learning trials. Figures 3 (b) and (c) show \nthe value function and the feedback force field, respectively, in the 4D state space \nx = (x, v, 0, w), which were implemented in 6 x 6 x 12 x 12 Gaussian soft-max \nnetworks. We imposed symmetric constraints on both actor and critic networks to \nfacilitate generalization. It can be seen that the paths to the upright position in \nthe center of the track are represented as ridges in the value function. \n\n5 DISCUSSION \n\nThe biggest problem in applying TD or DP to real-world control tasks is the curse \nof dimensionality, which makes both the computation for each data point and the \nnumbers of data points necessary for training very high. The actor-tutor architec(cid:173)\nture provides a partial solution to the former problem in real-time implementation. \nThe grid-based Gaussian soft-max basis function network was successfully used in \na 4D state space. However, a more flexible algorithm that allocates basis functions \nonly in the relevant parts of the state space may be necessary for dealing with \nhigher-dimension systems (Schaal and Atkeson, 1996). \n\nIn the above simulations, we assumed that the local linear model of the system \ndynamics fJf(x,u)/fJu was available. In preliminary experiments, it was verified \nthat the critic, the system model, and the actor can be trained simultaneously. \n\n\f1018 \n\nK. Doya \n\nThe actor-tutor architecture resembles \"feedback error learning\" (Kawato et al. , \n1987) in the sense that a nonlinear controller is trained by the output of anther \ncontroller. However, the actor-tutor scheme can be applied to a highly nonlinear \ncontrol task to which it is difficult to prepare a simple linear feedback controller. \n\nMotivated by the performance of the actor-tutor architecture and the recent phys(cid:173)\niological and fMRI experiments on the brain activity during the course of motor \nlearning (Hikosaka et al., 1996; Imamizu et al., 1996), we proposed a framework of \nfunctional integration of the basal ganglia, the cerebellum, and cerebral motor areas \n(Doya, 1996a). In this framework, the basal ganglia learns the value function P(x) \n(Houk et al., 1994) and generates the desired motion direction based on its gradient \noP(x)/ox. This is transformed into a motor command by the \"transpose model\" of \nthe motor system (of (x, u)/ouf in the lateral cerebellum (cerebrocerebellum). In \nearly stages of learning, this output is used for control, albeit its feedback latency \nis long. As the subject repeats the same task, a direct controller is constructed \nin the medial and intermediate cerebellum (spinocerebellum) with the above mo(cid:173)\ntor command as the teacher. The direct controller enables quick, near-automatic \nperformance with less cognitive load in other parts of the brain. \n\nReferences \nBarto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive \nelements that can solve difficult learning control problems. IEEE Transactions \non Systems, Man , and Cybernetics, 13:834-846. \n\nBryson, Jr., A. E .. and Ho, Y -C. (1975). Applied Optimal Control. Hemisphere \n\nPublishing, New York, 2nd edition. \n\nDoya, K. (1996a). An integrated model of basal ganglia and cerebellum in sequential \n\ncontrol tasks. Society for Neuroscience Abstracts, 22:2029. \n\nDoya, K. (1996b) . Temporal difference learning in continuous time and space. In \nTouretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in \nNeural Information Processing Systems 8, pages 1073-1079. MIT Press, Cam(cid:173)\nbridge, MA. \n\nHikosaka, 0., Miyachi, S., Miyashita, K., and Rand, M. K. (1996). Procedural learn(cid:173)\n\ning in monkeys - Possible roles of the basal ganglia. In Ono, T., McNaughton, \nB. 1., Molotchnikoff, S., Rolls, E. T ., and Nishijo, H., editors, Perception, \nMemory and Emotion: Frontiers in Neuroscience, pages 403-420 . Pergamon, \nOxford. \n\nHouk, J . C., Adams, J. L., and Barto, A. G. (1994) . A model of how ,the basal \nganglia generate and use neural signals that predict reinforcement. In Houk, \nJ . C., Davis, J. L., and Beiser, D. G., editors, Models of Information Processing \nin the Basal Ganglia, pages 249-270 . MIT Press, Cambrigde, MA. \n\nImamizu, H., Miyauchi, S., Sasaki, Y, Takino, R., Putz, B., and Kawato, M. (1996). \nA functional MRI study on internal models of dynamic transformations during \nlearning a visuomotor task. Society for Neuroscience Abstracts, 22:898. \n\nKawato, M., Furukawa, K., and Suzuki, R. (1987). A hierarchical neural network \nmodel for control and learning of voluntary movement. Biological Cybernetics, \n57:169-185. \n\nSchaal, S. and Atkeson, C. C. (1996) . From isolation to cooperation: An alternative \nview of a system of experts. In Touretzky, D. S., Mozer, M. C., and Hasselmo, \n,M. E., editors, Advances in Neural Information Processing Systems 8, pages \n605-611. MIT Press, Cambridge, MA, USA. \n\nSutton, R. S. (1988) . Learning to predict by the methods of temporal difference. \n\nMachine Learning, 3:9-44. \n\n\f", "award": [], "sourceid": 1228, "authors": [{"given_name": "Kenji", "family_name": "Doya", "institution": null}]}