{"title": "A Practice Strategy for Robot Learning Control", "book": "Advances in Neural Information Processing Systems", "page_first": 335, "page_last": 341, "abstract": null, "full_text": "A Practice Strategy for Robot Learning \n\nControl \n\nTerence D. Sanger \n\nDepartment of Electrical Engineering and Computer Science \n\nMassachusetts Institute of Technology, room E25-534 \n\nCambridge, MA 02139 \n\ntds@ai.mit.edu \n\nAbstract \n\n\"Trajectory Extension Learning\" is a new technique for Learning \nControl in Robots which assumes that there exists some parameter \nof the desired trajectory that can be smoothly varied from a region \nof easy solvability of the dynamics to a region of desired behavior \nwhich may have more difficult dynamics. By gradually varying the \nparameter, practice movements remain near the desired path while \na Neural Network learns to approximate the inverse dynamics. For \nexample, the average speed of motion might be varied, and the in(cid:173)\nverse dynamics can be \"bootstrapped\" from slow movements with \nsimpler dynamics to fast movements. This provides an example of \nthe more general concept of a \"Practice Strategy\" in which a se(cid:173)\nquence of intermediate tasks is used to simplify learning a complex \ntask. I show an example of the application of this idea to a real \n2-joint direct drive robot arm. \n\n1 \n\nINTRODUCTION \n\nThe most general definition of Adaptive Control is one which includes any controller \nwhose behavior changes in response to the controlled system's behavior. In practice, \nthis definition is usually restricted to modifying a small number of controller pa(cid:173)\nrameters in order to maintain system stability or global asymptotic stability of the \nerrors during execution of a single trajectory (Sastry and Bodson 1989, for review). \nLearning Control represents a second level of operation, since it uses Adaptive Con-\n\n335 \n\n\f336 \n\nSanger \n\ntrol to modify parameters during repeated performance trials of a desired trajectory \nso that future trials result in greater accuracy (Arimoto et al. 1984). In this paper \nI present a third level called a \"Practice Strategy\", in which Learning Control is \napplied to a sequence of intermediate trajectories leading ultimately to the true \ndesired trajectory. I claim that this can significantly increase learning speed and \nmake learning possible for systems which would otherwise become unstable. \n\n1.1 LEARNING CONTROL \n\nDuring repeated practice of a single desired trajectory, the actual trajectory followed \nby the robot may be significantly different. Many Learning Control algorithms \nmodify the commands stored in a sequence memory to minimize this difference \n(Atkeson 1989, for review). However, the performance errors are usually measured \nin a sensory coordinate system, while command corrections must be made in the \nmotor coordinate system. If the relationship between these two coordinate sys(cid:173)\ntems is not known, then command corrections might be in the wrong direction and \ninadvertently worsen performance. However, if the practice trajectory is close to \nthe desired trajectory, then the errors will be small and the relationship between \ncommand and sensory errors can be approximated by the system Jacobian. \n\nAn alternative to a stored command sequence is to use a Neural Network to learn an \napproximation to the inverse dynamics in the region of interest (Sanner and Slotine \n1992, Yabuta and Yamada 1991, Atkeson 1989). In this case, the commands and \nresults from the actual movement are used as training data for the network, and \nsmoothness properties are assumed such that the error on the desired trajectory \nwill decrease. However, a significant problem with this method is that if the actual \npractice trajectory is far from the desired trajectory, then its inverse dynamics \ninformation will be of little use in training the inverse dynamics for the desired \ntrajectory. In fact, the network may achieve perfect approximation on the actual \ntrajectory while still making significant errors on the desired trajectory. In this \ncase, learning will stop (since the training error is zero) leading to the phenomenon \nof \"learning lock-up\" (An et al. 1988). So whether Learning Control uses a sequence \nmemory or a Neural Network, learning may proceed poorly if large errors are made \nduring the initial practice movements. \n\n1.2 PRACTICE STRATEGIES \n\nI define a \"practice strategy\" as a sequence of trajectories such that the first element \nin the sequence is any previously learned trajectory, and the last element in the \nsequence is the ultimate desired trajectory. A well designed practice strategy will \nresult in a seqence for which learning control of the trajectory for any particular step \nis simplified if prior steps have already been learned. This will occur if learning of \nprior trajectories reduces the initial performance error for subsequent trajectories, \nso that a network will be less likely to experience learning lock-up. \n\nOne example of a practice strategy is a three-step sequence in which the interme(cid:173)\ndiate step is a set of independently executable subtasks which partition the desired \ntrajectory into discrete pieces. Another example is a multi-step sequence in which \nintermediate steps are a set of trajectories which are somehow related to the de(cid:173)\nsired trajectory. In this paper I present a multi-step sequence which gradually \n\n\fA Practice Strategy for Robot Learning Control \n\n337 \n\n---~-------, \nI \n\" \nI \nA \n\" \ny \n\nA \nu \n\nN \n\nP \na. \n\nFigure 1: Training signals for network learning. \n\ntransforms some known trajectory into the desired trajectory by varying a single \nparameter. This method has the advantage of not requiring detailed knowledge of \nthe task structure in order to break it up into meaningful subtasks, and conditions \nfor convergence can be stated explicitly. It has a close relationship to Continuation \nMethods for solving differential equations, and can be considered to be a particular \napplication of the Banach Extension Theorem. \n\n2 METHODS \n\nAs in (Sanger 1992), we need to specify 4 aspects of the use of a neural network \nwithin a control system: \n\n1. the networks' function in the control system, \n\n2. the network learning algorithm which modifies the connection weights, \n\n3. the training signals used for network learning, and \n\n4. the practice strategy used to generate sample movements. \n\nThe network's function is to learn the inverse dynamics of an equilibrium-point con(cid:173)\ntrolled plant (Shadmehr 1990). The LMS-tree learning algorithm trains the network \n(Sanger 1991b, Sanger 1991a). The training signals are determined from the ac(cid:173)\ntual practice data using either \"Actual Trajectory Training\" or \"Desired Trajectory \nTraining\", as defined below. And the practice strategy is \"Trajectory Extension \nLearning\", in which a parameter of the movement is gradually modified during \ntraining. \n\n\f338 \n\nSanger \n\n2.1 TRAINING SIGNALS \n\nFigure 1 shows the general structure of the network and training signals. A desired \ntrajectory y is fed into the network N to yield an estimated command U. This \ncommand is then applied to the plant Pcx where the subscript indicates that the \nplant is parameterized by the variable a. Although the true command u which \nachieves y is unknown, we do know that the estimated command u produces y, so \nthese signals are used for training by comparing the network response to y given by \n~ = Ny to the known value u and subtracting these to yield the training error 6,. \nNormally, network training would use this error signal to modify the network output \nfor inputs near y, and I refer to this as \"Actual Trajectory Training\". However, if \ny is far from y then no change in response may occur at y and this may lead even \nmore quickly to learning lock-up. Therefore an alternative is to use the error 6fJ to \ntrain the network output for inputs near y. I refer to this as \"Desired Trajectory \nTraining\", and in the figure it is represented by the dotted arrow. \n\nThe following discussion will summarize the convergence conditions and theorems \npresented in (Sanger 1992). \n\nDefine \n\nRu . (1 - N P(x))u = u - U \n\nto be an operator which maps commands into command errors for states x on the \ndesired trajectory. Similarly, let \n\nRu = (1 - N P( x))u = u - ~ \n\nmap commands into command errors for states x on the actual trajectory. \nConvergence depends upon the following assumptions: \n\nA1: The plant P is smooth and invertible with respect to both the state x and the \ninput u with Lipschitz constants k'z; and ku, and it has stable zero-dynamics. \n\nA2: The network N is smooth with Lipschitz constant kN. \nA3: Network learning reduces the error in response to a pair (y, 6y ). \nA4: The change in network output in response to training is smooth with Lipschitz \n\nconstant kL. \n\nA5: There exists a smoothly controllable parameter a such that an inverse dy(cid:173)\nnamics solution is available at a = ao, and the desired performance occurs \nwhen a = ad. \n\nA6: The change in command required to produce a desired output after any change \n\nin a is bounded by the change in a multiplied by a constant kcx \u2022 \n\nA 7: The change in plant response for any fixed input is bounded by the change in \n\na multiplied by a constant kp \u2022 \n\nUnder assumptions A1-A3 we can prove convergence of Desired Trajectory Training: \n\nTheorem 1: \nIf there exists a k Rn such that \n\nII Rnu - Rnull < kRn lI u - ull \n\n\fthen if the learning rate 0 < 'Y :::; 1, \n\nA Practice Strategy for Robot Learning Control \n\n339 \n\nIf k Rn < 1 and 'Y :::; 1, then the network output u approaches the correct command \nu. \n\nUnder assumptions A1-A4, we can prove convergence of Actual Trajectory Training: \nTheorem 2: \nIf there exists a kRn such that \n\nthen if the learning rate 0 < 'Y :::; 1, \n\nIIRn u - Rnull < kRn lIu - illl \n\n2.2 TRAJECTORY EXTENSION LEARNING \n\nLet a be some modifiable parameter of the plant such that for a = ao there exists \na simple inverse dynamics solution, and we seek a solution when a = ad. For ex(cid:173)\nample, if the plant uses Equilibrium Point Control (Shadmehr 1990), then at low \nspeeds the inverse dynamics behave like a perfect servo controller yielding desired \ntrajectories without the need to solve the dynamics. We can continue to train a \nlearning controller as the average speed of movement (a) is gradually increased. \nThe inverse dynamics learned at one speed provide an approximation to the inverse \ndynamics for a slightly faster speed, and thus the performance errors remain small \nduring practice. This leads to significantly faster learning rates and greater likeli(cid:173)\nhood that the conditions for convergence at any given speed will be satisfied. Note \nthat unlike traditional learning schemes, the error does not decrease monotonically \nwith practice, but instead maintains a steady magnitude as the speed increases, \nuntil the network is no longer able to approximate the inverse dynamics. \n\nThe following is a summary of a result from (Sanger 1992). Let a change from al \nto a2, and let P = Pal and P' = Pa2 . Then under assumptions AI-A7 we can \nprove convergence of Trajectory Extension Learning: \nTheorem 3: \nIf there exists a kR such that for a = al \n\nthen for a = a2 \n\nIIR'u' - R'illl < kRllu' - ull + (2ka + kNkp)la2 - all \n\nThis shows that given the smoothness assumptions and a small enough change in \na, the error will continue to decrease. \n\n\f340 \n\nSanger \n\n3 EXAMPLE \n\nFigure 2 shows the result of 15 learning trials performed by a real direct-drive two(cid:173)\njoint robot arm on a sampled desired trajectory. The initial trial required 11.5 \nseconds to execute, and the speed was gradually increased until the final trial re(cid:173)\nquired only 4.5 seconds. Simulated equilibrium point control was used (Bizzi et \nal. 1984) with stiffness and damping coefficients of 15 nm/rad and 1.5 nm/rad/sec, \nrespectively. The grey line in figure 2 shows the equilibrium point control signal \nwhich generated the actual movement represented by the solid line. The difference \nbetween these two indicates the nontrivial nature of the dynamics calculations re(cid:173)\nquired to derive the control signal from the desired trajectory. Note that without \nTrajectory Extension Learning, the network does not converge and the arm becomes \nunstable. The neural network was an LMS tree (Sanger 1991b, Sanger 1991a) with \n10 Gaussian basis functions for each of the 6 input dimensions, and a total of 15 \nsubtrees were grown per joint (see (Sanger 1992) for further explanation). \n\n4 CONCLUSION \n\nTrajectory Extension Learning is one example of the way in which a practice strat(cid:173)\negy can be used to improve convergence for Learning Control. This or other types \nof practice strategies might be able to increase the performance of many different \ntypes of learning algorithms both within and outside the Control domain. Such \nstrategies may also provide a theoretical model for the practice strategies used by \nhumans to learn complex tasks, and the theoretical analysis and convergence con(cid:173)\nditions could potentially lead to a deeper understanding of human motor learning \nand successful techniques for optimizing performance. \n\nAcknowledgements \n\nThanks are due to Simon Giszter, Reza Shadmehr, Sandro Mussa-Ivaldi, Emilio \nBizzi, and many people at the NIPS conference for their comments and criticisms. \nThis report describes research done within the laboratory of Dr. Emilio Bizzi in the \ndepartment of Brain and Cognitive Sciences at MIT. The author was supported dur(cid:173)\ning this work by a National Defense Science and Engineering Graduate Fellowship, \nand by NIH grants 5R37 AR26710 and 5ROINS09343 to Dr. Bizzi. \n\nReferences \nAn C. H., Atkeson C. G., Hollerbach J. M., 1988, Model-Based Control of a Robot \nManipulator, MIT Press, Cambridge, MA. \nArimoto S., Kawamura S., Miyazaki F., 1984, Bettering operation of robots by \nlearning, Journal of Robotic Systems, 1(2):123-140. \nAtkeson C. G., 1989, Learning arm kinematics and dynamics, Ann. Rev. Neurosci., \n12:157-183. \nBizzi E., Accornero N., Chapple W., Hogan N., 1984, Posture control and trajectory \nformation during arm movement, J. Neurosci, 4:2738-2744. \nSanger T. D., 1991a, A tree-structured adaptive network for function approximation \nin high dimensional spaces, IEEE Trans. Neural Networks, 2(2):285-293. \n\n\fA Practice Strategy for Robot Learning Control \n\n341 \n\nSanger T. D., 1991b, A tree-structured algorithm for reducing computation in \nnetworks with separable basis functions, Neural Computation, 3(1):67-78. \nSanger T. D., 1992, Neural network learning control of robot manipulators us(cid:173)\ning gradually increasing task difficulty, submitted to IEEE Trans. Robotics and \nAutomation. \nSanner R. M., Slotine J.-J. E., 1992, Gaussian networks for direct adaptive control, \nIEEE Trans. Neural Networks, in press. Also MIT NSL Report 910303, 910503, \nMarch 1991 and Proc. American Control Conference, Boston pages 2153-2159, June \n1991. \nSastry S., Bodson M., 1989, Adaptive Control: Stability, Convergence, and Robust(cid:173)\nness, Prentice Hall, New Jersey. \nShadmehr R., 1990, Learning virtual equilibrium trajectories for control of a robot \narm, Neural Computation, 2:436-446. \nYabuta T., Yamada T., 1991, Learning control using neural networks, Proc. IEEE \nInt'l ConJ. on Robotics and Automation, Sacramento, pages 740-745. \n\nFigure 2: Dotted line is the desired trajectory, solid line is the actual trajectory, \nand the grey line is the equilibrium point control trajectory. \n\n\f", "award": [], "sourceid": 699, "authors": [{"given_name": "Terence", "family_name": "Sanger", "institution": null}]}