{"title": "Using Local Models to Control Movement", "book": "Advances in Neural Information Processing Systems", "page_first": 316, "page_last": 323, "abstract": null, "full_text": "316 \n\nAtkeson \n\nUsing Local Models to Control Movement \n\nChristopher G. Atkeson \n\nDepartment of Brain and Cognitive Sciences \n\nand the Artificial Intelligence Laboratory \n\nMassachusetts Institute of Technology \n\nNE43-771, 545 Technology Square \n\nCambridge, MA 02139 \n\ncga@ai.mit.edu \n\nABSTRACT \n\nThis paper explores the use of a model neural network for motor \nlearning. Steinbuch and Taylor presented neural network designs to \ndo nearest neighbor lookup in the early 1960s. In this paper their \nnearest neighbor network is augmented with a local model network, \nwhich fits a local model to a set of nearest neighbors. The network \ndesign is equivalent to local regression. This network architecture \ncan represent smooth nonlinear functions, yet has simple training \nrules with a single global optimum. The network has been used \nfor motor learning of a simulated arm and a simulated running \nmachine. \n\nINTRODUCTION \n\n1 \nA common problem in motor learning is approximating a continuous function from \nsamples of the function's inputs and outputs. This paper explores a neural net(cid:173)\nwork architecture that simply remembers experiences (samples) and builds a local \nmodel to answer any particular query (an input for which the function's output is \ndesired). This network design can represent smooth nonlinear functions, yet has \nsimple training rules with a single global optimum for building a local model in \nresponse to a query. Our approach is to model complex continuous functions us(cid:173)\ning simple local models. This approach avoids the difficult problem of finding an \nappropriate structure for a global model. A key idea is to form a training set for \nthe local model network after a query to be answered is known. This approach \n\n\fUsing Local Models to Control Movement \n\n317 \n\nallows us to include in the training set only relevant experiences (nearby samples). \nThe local model network, which may be a simple network architecture such as a \nperceptron, forms a model of the portion of the function near the query point. This \nlocal model is then used to predict the output of the function, given the input. The \nlocal model network is retrained with a new training set to answer the next query. \nThis approach minimizes interference between old and new data, and allows the \nrange of generalization to depend on the density of the samples. \n\nSteinbuch (Steinbuch 1961, Steinbuch and Piske 1963) and Taylor (Taylor 1959, \nTaylor 1960) independently proposed neural network designs that used a local rep(cid:173)\nresentation to do nearest neighbor lookup and pointed out that this approach could \nbe used for control. They used a layer of hidden units to compute an inner product \nof each stored vector with the input vector. A winner-take-all circuit then selected \nthe hidden unit with the highest activation. This type of network can find near(cid:173)\nest neighbors or best matches using a Euclidean distance metric (Kazmierczak and \nSteinbuch 1963). In this paper their nearest neighbor lookup network (which I will \nrefer to as the memory network) is augmented with a local model network, which \nfits a local model to a set of nearest neighbors. \nThe ideas behind the network design used in this paper have a long history. Ap(cid:173)\nproaches which represent previous experiences directly and use a similar experience \nor similar experiences to form a local model are often referred to as nearest neighbor \nor k-nearest neighbor approaches. Local models (often polynomials) have been used \nfor many years to smooth time series (Sheppard 1912, Sherriff 1920, Whittaker and \nRobinson 1924, Macauley 1931) and interpolate and extrapolate from limited data. \nLancaster and Salkauskas (1986) refer to nearest neighbor approaches as \"moving \nleast squares\" and survey their use in fitting surfaces to arbitrarily spaced points. \nEubank (1988) surveys the use of nearest neighbor estimators in nonparametric \nregression. Farmer and Sidorowich (1988) survey the use of nearest neighbor and \nlocal model approaches in modeling chaotic dynamic systems. \n\nCrain and Bhattacharyya (1967), Falconer (1971), and McLain (1974) suggested \nusing a weighted regression to fit a local polynomial model at each point a function \nevaluation was desired. All of the available data points were used. Each data point \nwas weighted by a function of its distance to the desired point in the regression. \nMcIntyre, Pollard, and Smith (1968), Pelto, Elkins, and Boyd (1968), Legg and \nBrent (1969), Palmer (1969), Walters (1969), Lodwick and Whittle (1970), Stone \n(1975) and Franke and Nielson (1980) suggested fitting a polynomial surface to a set \nof nearest neighbors, also using distance weighted regression. Cleveland (1979) pro(cid:173)\nposed using robust regression procedures to eliminate outlying or erroneous points \nin the regression process. A program implementing a refined version of this ap(cid:173)\nproach (LOESS) is available by sending electronic mail containing the single line, \nsend dloess from a, to the address netlib@research.att.com (Grosse 1989). Cleve(cid:173)\nland, Devlin and Grosse (1988) analyze the statistical properties of the LOESS \nalgorithm and Cleveland and Devlin (1988) show examples of its use. Stone (1977, \n1982), Devroye (1981), Cheng (1984), Li (1984), Farwig (1987), and Miiller (1987) \n\n\f318 \n\nAtkeson \n\nprovide analyses of nearest neighbor approaches. Franke (1982) compares the per(cid:173)\nformance of nearest neighbor approaches with other methods for fitting surfaces to \ndata. \n\n2 THE NETWORK ARCHITECTURE \nThe memory network of Steinbuch and Taylor is used to find the nearest stored \nvectors to the current input vector. The memory network computes a measure of \nthe distance between each stored vector and the input vector in parallel, and then a \n\"winner take all\" network selects the nearest vector (nearest neighbor). Euclidean \ndistance has been chosen as the distance metric, because the Euclidean distance is \ninvariant under rotation of the coordinates used to represent the input vector. \n\nThe memory network consists of three layers of units: input units, hidden or memory \nunits, and output units. The squared Euclidean distance between the input vector \n(i) and a weight vector (Wk) for the connections of the input units to hidden unit \nk is given by; \n\n\u00b0To \nd2 \nk = 1 - Wk 1- Wk = 1 1 -\n\n)T(o \n\n(0 \n\n) \n\n1 Wk + Wk Wk \n20T \n\nT \n\nSince the quantity iTi is the same for all hidden units, minimizing the distance \nbetween the input vector and the weight vector for each hidden unit is equivalent \nto maximizing: \n\niTWk -1/2wlw k \n\nThis quantity is the inner product of the input vector and the weight vector for \nhidden unit k, biased by half the squared length of the weight vector. \nDynamics of the memory network neurons allow the memory network to output a \nsequence of nearest neighbors. These nearest neighbors form the selected training \nsequence for the local model network. Memory unit dynamics can be used to allocate \n\"free\" memory units to new experiences, and to forget old training points when the \ncapacity of the memory network is fully utilized. \n\nThe local model network consists of only one layer of modifiable weights preceded by \nany number of layers with fixed connections. There may be arbitrary preprocessing \nof the inputs of the local model, but the local model is linear in the parameters \nused to form the fit. The local model network using the LMS training algorithm \nperforms a linear regression of the transformed inputs against the desired outputs. \nThus, the local model network can be used to fit a linear regression model to the \nselected training set. With multiplicative interactions between inputs the local \nmodel network can be used to fit a polynomial surface (such as a quadratic) to the \nselected training set. An alternative implementation of the local model network \ncould use a single layer of \"sigma-pi\" units. \n\nThis network design has simple training rules. In the memory network the weights \nare simply the values of the components of input and output vectors, and the bias \nfor each memory unit is just half the squared length of the corresponding input \nweight vector. No search for weights is necessary, since the weights are directly \n\n\fUsing Local Models to Control Movement \n\n319 \n\n/ \n\n/ \n\nFigure 1: Simulated Planar Two-joint Arm \n\ngiven by the data to be stored. The local model network is linear in the weights, \nleading to a single optimum which can be found by linear regression or gradient \ndescent. Thus, convergence to the global optimum is guaranteed when forming a \nlocal model to answer a particular query. \n\nThis network architecture was simulated using k-d tree data structures (Friedman, \nBentley, and Finkel 1977) on a standard serial computer and also using parallel \nsearch on a massively parallel computer, the Connection Machine (Hillis 1985). A \nspecial purpose computer is being built to implement this network in real time. \n\n3 APPLICATIONS \nThe network has been used for motor learning of a simulated arm and a simulated \nrunning machine. The network performed surprisingly well in these simple evalua..(cid:173)\ntions. The simulated arm was able to follow a desired trajectory after only a few \npractice movements. Performance of the simulated running machine in following a \nseries of desired velocities was also improved. This paper will report only on the \narm trajectory learning. \nFigure 1 shows the simulated 2-joint planar arm. The problem faced in this sim(cid:173)\nulation is to learn the correct joint torques to drive the arm along the desired \ntrajectory (the inverse dynamics problem). In addition to the feedforward control \nsignal produced by the network described in this paper, a feedback controller was \nalso used. \n\nFigure 2 shows several learning curves for this problem. The first point in each \nof the curves shows the performance generated by the feedback controller alone. \nThe error measure is the RMS torque error during the movement. The highest \ncurve shows the performance of a nearest neighbor method without a local model. \nThe nearest point was used to generate the torques for the feedforward command, \nwhich were then summed with the output from the feedback controller. The second \n\n\f320 \n\nAtkeson \n\n-E 50.0 \nI Z -ct> a: 40.0 \no a: \na: w \nct> \n::E a: \nw \n=> o a: \n~ .... z \no ..., \n\no \n\n~ .. \n~'. o \\\", o \n\\~ o \n13-'.- B- -\n\nt \u2022\u2022 \n\n\\ \n\n\\ \n\nI\" \n\n\" \n\". \n'. \n* \n\" \n\n~ \n\n....... \n\n'i5l \n\n\"-\n\n\u2022 -' . --o. \u2022 '* ... ~.\" .. \n\n[;- -\n\no\u20acJ Nearest neighbor \n\n* ...... * Linear local model \n\n\u2022 Quadratic local model \n\n\u2022 \n\n'So -\n\n-B _ \n\n-..r._ .. \n\n~-~-B--~ \n\n0.0 o~-~-~==--t--'-':'':':':':~'''''''''''''-''-''''--t'''''-'' \nMovement \n\nFigure 2: Learning curves from 3 different network designs on the two joint arm \ntrajectory learning problem. \n\ncurve shows the performance using a linear local model. The third curve shows \nthe performance using a quadratic local model. Adding the local model network \ngreatly speeds up learning. The network with the quadratic local model learned \nmore quickly than the one with the local linear model. \n\n4 WHY DOES IT WORK? \nIn this learning paradigm the feedback controller serves as the teacher, or source of \nnew data for the network. If the feedback controller is of poor quality, the nearest \nneighbor function approximation method tends to get \"stuck\" with a non-zero error \nlevel. The use of a local model seems to eliminate this stuck state, and reduce the \ndependence on the quality of the feedback controller. \n\nFast training is achieved by modularizing the network: the memory network does \nnot need to search for weights in order to store the samples, and local models can \nbe linear in the unknown parameters, leading to a single optimum which can be \nfound by linear regression or gradient descent. \nThe combination of storing all the data and only using a certain number of nearby \nsamples to form a local model minimizes interference between old and new data, \nand allows the range of generalization to depend on the density of the samples. \n\nThere are many issues left to explore. A disadvantage of this approach is the limited \ncapacity of the memory network. In this version of the proposed neural network \n\n\u2022 \n\n\fUsing Local Models to Control Movement \n\n321 \n\narchitecture, every experience is stored. Eventually all the memory units will be \nused up. To use memory units more sparingly, only the experiences which are suf(cid:173)\nficiently different from previous experiences could be stored. Memory requirements \ncould also be reduced by \"forgetting\" certain experiences, perhaps those that have \nnot been referenced for a long time, or a randomly selected experience. It is an \nempirical question as to how large a memory capacity is necessary for this network \ndesign to be useful. \n\nHow should the distance metric be chosen? So far distance metrics have been \ndevised by hand. Better distance metrics may be based on the stored data and \na particular query. How far will this approach take us? Experiments using more \ncomplex systems and actual physical implementations, with the inevitable noise and \nhigh order dynamics, need to be done. \n\nAcknowledgments \n\nB. Widrow and J. D. Cowan made the author aware of the work of Steinbuch and \nTaylor (Steinbuch and Wid row 1965, Cowan and Sharp 1988). \n\nThis paper describes research done at the Whitaker College, Department of Brain \nand Cognitive Sciences, Center for Biological Information Processing and the Arti(cid:173)\nficial Intelligence Laboratory of the Massachusetts Institute of Technology. Support \nwas provided under Office of Naval Research contract N00014-88-K-0321 and under \nAir Force Office of Scientific Research grant AFOSR-89-0500. Support for CGA \nwas provided by a National Science Foundation Engineering Initiation A ward and \nPresidential Young Investigator Award, an Alfred P. Sloan Research Fellowship, the \nW. M. Keck Foundation Assistant Professorship in Biomedical Engineering, and a \nWhitaker Health Sciences Fund MIT Faculty Research Grant. \n\nReferences \n\nCheng, P.E. (1984), \"Strong Consistency of Nearest Neighbor Regression Func(cid:173)\ntion Estimators\", Journal of Multivariate Analysis, 15:63-72. \nCleveland, W.S. (1979), \"Robust Locally Weighted Regression and Smoothing \nScatterplots\", Journal of the American Statistical Association, 74:829-836. \nCleveland, W.S. and S.J. Devlin (1988), \"Locally Weighted Regression: An \nApproach to Regression Analysis by Local Fitting\", Journal of the American Sta(cid:173)\ntistical Association, 83:596-610. \nCleveland, W.S., S.J. Devlin and E. Grosse (1988), \"Regression by Local \nFitting: Methods, Properties, and Computational Algorithms\", Journal of Econo(cid:173)\nmetrics, 37:87-114. \n\nCowan, J.D. and D.H. Sharp (1988), \"Neural Nets\", Quarterly Reviews of \nBiophysics, 21(3):365-427. \n\nCrain, I.K. and B.K. Bhattacharyya (1967), \"Treatment of nonequispaced \ntwo dimensional data with a digital computer\", Geoexploration, 5:173-194. \n\n\f322 \n\nAtkeson \n\nDevroye, L.P. (1981), \"On the Almost Everywhere Convergence of Nonparamet(cid:173)\nric Regression Function Estimates\", The Annals of Statistics, 9(6):1310-1319. \n\nEubank, R.L. (1988), Spline Smoothing and Nonparametric Regression, Marcel \nDekker, New York, pp. 384-387. \nFalconer, K.J. (1971), \"A general purpose algorithm for contouring over scat(cid:173)\ntered data points\", Nat. Phys. Lab. Report NAC 6. \n\nFarmer, J.D., and J.J. Sidorowich (1988), \"Predicting Chaotic Dynamics\", \nin Dynamic Patterns in Complex Systems, J .A.S. Kelso, A.J. Mandell, and M.F. \nShlesinger, (eds.), World Scientific, New Jersey, pp. 265-292. \n\nFarwig, R. (1987), \"Multivariate Interpolation of Scattered Data by Moving Least \nSquares Methods\", in J .C. Mason and M.G. Cox (eds), Algorithms for Approxima(cid:173)\ntion, Clarendon Press, Oxford, pp. 193-21l. \n\nFranke, R. (1982), \"Scattered Data Interpolation: Tests of Some Methods\", \nMathematics of Computation, 38(157):181-200. \nFranke, R. and G. Nielson (1980), \"Smooth Interpolation of Large Sets of \nScattered Data\", International Journal Numerical Methods Engineering, 15:1691-\n1704. \n\nFriedman, J.H., J.L. Bentley, and R.A. Finkel (1977), \"An Algorithm for \nFinding Best Matches in Logarithmic Expected Time\", ACM Trans. on Mathemat(cid:173)\nical Software, 3(3):209-226. \nGrosse, E. (1989), \"LOESS: Multivariate Smoothing by Moving Least Squares\", \nin C.K. Chui, L.L. Schumaker, and J.D. Ward (eds.), Approximation Theory VI, \nAcademic Press, Boston, pp. 1-4. \nHillis, D. (1985), The Connection Machine, MIT Press, Cambridge, Mass. \n\nKazmierczak, H. and K. Steinbuch (1963), \"Adaptive Systems in Pattern \nRecognition\" , IEEE Transactions on Electronic Computers, EC-12:822-835. \n\nLancaster, P. and K. Salkauskas (1986), Curve And Surface Fitting, Academic \nPress, New York. \n\nLegg, M.P.C. and R.P. Brent (1969), \"Automatic Contouring\", Proc. 4th \nA ustralian Computer Conference, 467-468. \n\nLi, K.C. (1984), \"Consistency for Cross-Validated Nearest Neighbor Estimates in \nNonparametric Regression\", The Annals of Statistics, 12:230-240. \n\nLodwick, G.D., and J. Whittle (1970), \"A technique for automatic contouring \nfield survey data\", Australian Computer Journal, 2:104-109. \nMacauley, F.R. (1931), The Smoothing of Time Series, National Bureau of Eco(cid:173)\nnomic Research, New York. \n\nMcIntyre, D.B., D.D. Pollard, and R. Smith (1968), \"Computer Programs \nFor Automatic Contouring\" , Kansas Geological Survey Computer Contributions 23, \n\n\fUsing Local Models to Control Movement \n\n323 \n\nUniversity of Kansas, Lawrence, Kansas. \nMcLain, D.H. (1974), \"Drawing Contours From Arbitrary Data Points\", The \nComputer Journal, 17(4):318-324. \nMiiller, H.G. (1987), \"Weighted Local Regression and Kernel Methods for Non(cid:173)\nparametric Curve Fitting\", Journal of the A merican Statistical Association, 82:231-\n238. \nPalmer, J.A.B. (1969), \"Automated mapping\", Proc. 4th Australian Computer \nConference, 463-466. \nPelto, C.R., T.A. Elkins, and H.A. Boyd (1968), \"Automatic contouring of \nirregularly spaced data\", Geophysics, 33:424-430. \nSheppard, W.F. (1912), \"Reduction of Errors by Means of Negligible Differ(cid:173)\nences\", Proceedings of the Fifth International Congress of Mathematicians, E. W. \nHobson and A. E. H. Love (eds), Cambridge University Press, 11:348-384. \nSherriff, C.W.M. (1920), \"On a Class of Graduation Formulae\", Proceedings of \nthe Royal Society of Edinburgh, XL:112-128. \n\nSteinbuch, K. (1961), \"Die lernmatrix\", Kybernetik, 1:36-45. \n\nSteinbuch, K. and U.A.W. Piske (1963), \"Learning Matrices and Their Ap(cid:173)\nplications\" , IEEE Transactions on Electronic Computers, EC-12:846-862. \n\nSteinbuch, K. and B. Widrow (1965), \"A Critical Comparison of Two Kinds \nof Adaptive Classification Networks\" , IEEE Transactions on Electronic Computers, \nEC-14:737-740. \nStone, C.J. (1975), \"Nearest Neighbor Estimators of a Nonlinear Regression \nFunction\", Proc. of Computer Science and Statistics: 8th Annual Symposium on \nthe Interface, pp. 413-418. \n\nStone, C.J. (1977), \"Consistent Nonparametric Regression\", The Annals of Sta(cid:173)\ntistics, 5:595-645. \n\nStone, C.J. (1982), \"Optimal Global Rates of Convergence for Nonparametric \nRegression\", The Annals of Statistics, 10(4):1040-1053. \nTaylor, W.K. (1959), \"Pattern Recognition By Means Of Automatic Analogue \nApparatus\", Proceedings of The Institution of Electrical Engineers, 106B:198-209. \nTaylor, W.K. (1960), \"A parallel analogue reading machine\", Control, 3:95-99. \n\nTaylor, W.K. (1964), \"Cortico-thalamic organization and memory\", Proc. Royal \nSociety B, 159:466-478. \n\nWalters, R.F. (1969), \"Contouring by Machine: A User's Guide\", American \nAssociation of Petroleum Geologists Bulletin, 53(11):2324-2340. \n\nWhittaker, E., and G. Robinson (1924), The Calculus of Observations, Blackie \n& Son, London. \n\n\f", "award": [], "sourceid": 289, "authors": [{"given_name": "Christopher", "family_name": "Atkeson", "institution": null}]}