{"title": "From Isolation to Cooperation: An Alternative View of a System of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 605, "page_last": 611, "abstract": null, "full_text": "From Isolation to Cooperation: \n\nAn Alternative View of a System of Experts \n\nStefan Schaal:!:* \n\nsschaal@cc.gatech.edu \n\nChristopher C. Atkeson:!: \n\ncga@cc.gatech.edu \n\nhttp://www.cc.gatech.eduifac/Stefan.Schaal \n\nhttp://www.cc.gatech.eduifac/Chris.Atkeson \n\n+College of Computing, Georgia Tech, 801 Atlantic Drive, Atlanta, GA 30332-0280 \n\n* A TR Human Infonnation Processing, 2-2 Hikaridai, Seiko-cho, Soraku-gun, 619-02 Kyoto \n\nAbstract \n\nWe introduce a constructive, incremental learning system for regression \nproblems that models data by means of locally linear experts. In contrast \nto other approaches, the experts are trained independently and do not \ncompete for data during learning. Only when a prediction for a query is \nrequired do the experts cooperate by blending their individual predic(cid:173)\ntions. Each expert is trained by minimizing a penalized local cross vali(cid:173)\ndation error using second order methods. In this way, an expert is able to \nfind a local distance metric by adjusting the size and shape of the recep(cid:173)\ntive field in which its predictions are valid, and also to detect relevant in(cid:173)\nput features by adjusting its bias on the importance of individual input \ndimensions. We derive asymptotic results for our method. In a variety of \nsimulations the properties of the algorithm are demonstrated with respect \nto interference, learning speed, prediction accuracy, feature detection, \nand task oriented incremental learning. \n\n1. INTRODUCTION \nDistributing a learning task among a set of experts has become a popular method in compu(cid:173)\ntationallearning. One approach is to employ several experts, each with a global domain of \nexpertise (e.g., Wolpert, 1990). When an output for a given input is to be predicted, every \nexpert gives a prediction together with a confidence measure. The individual predictions \nare combined into a single result, for instance, based on a confidence weighted average. \nAnother approach-the approach pursued in this paper-of employing experts is to create \nexperts with local domains of expertise. In contrast to the global experts, the local experts \nhave little overlap or no overlap at all. To assign a local domain of expertise to each expert, \nit is necessary to learn an expert selection system in addition to the experts themselves. \nThis classifier determines which expert models are used in which part of the input space. \nFor incremental learning, competitive learning methods are usually applied. Here the ex(cid:173)\nperts compete for data such that they change their domains of expertise until a stable con(cid:173)\nfiguration is achieved (e.g., Jacobs, Jordan, Nowlan, & Hinton, 1991). The advantage of \nlocal experts is that they can have simple parameterizations, such as locally constant or lo(cid:173)\ncally linear models. This offers benefits in terms of analyzability, learning speed, and ro(cid:173)\nbustness (e.g., Jordan & Jacobs, 1994). For simple experts, however, a large number of ex(cid:173)\nperts is necessary to model a function. As a result, the expert selection system has to be \nmore complicated and, thus, has a higher risk of getting stuck in local minima and/or of \nlearning rather slowly. In incremental learning, another potential danger arises when the \ninput distribution of the data changes. The expert selection system usually makes either \nimplicit or explicit prior assumptions about the input data distribution. For example, in the \nclassical mixture model (McLachlan & Basford, 1988) which was employed in several lo(cid:173)\ncal expert approaches, the prior probabilities of each mixture model can be interpreted as \n\n\f606 \n\nS. SCHAAL. C. C. ATKESON \n\nthe fraction of data points each expert expects to experience. Therefore, a change in input \ndistribution will cause all experts to change their domains of expertise in order to fulfill \nthese prior assumptions. This can lead to catastrophic interference. \nIn order to avoid these problems and to cope with the interference problems during incre(cid:173)\nmental learning due to changes in input distribution, we suggest eliminating the competi(cid:173)\ntion among experts and instead isolating them during learning. Whenever some new data is \nexperienced which is not accounted for by one of the current experts, a new expert is cre(cid:173)\nated. Since the experts do not compete for data with their peers, there is no reason for them \nto change the location of their domains of expertise. However, when it comes to making a \nprediction at a query point, all the experts cooperate by giving a prediction of the output \ntogether with a confidence measure. A blending of all the predictions of all experts results \nin the final prediction. It should be noted that these local experts combine properties of \nboth the global and local experts mentioned previously. They act like global experts by \nlearning independently of each other and by blending their predictions, but they act like lo(cid:173)\ncal experts by confining themselves to a local domain of expertise, i.e., their confidence \nmeasures are large only in a local region. \nThe topic of data fitting with structurally simple local models (or experts) has received a \ngreat deal of attention in nonparametric statistics (e.g., Nadaraya, 1964; Cleveland, 1979; \nScott, 1992, Hastie & Tibshirani, 1990). In this paper, we will demonstrate how a non(cid:173)\nparametric approach can be applied to obtain the isolated expert network (Section 2.1), \nhow its asymptotic properties can be analyzed (Section 2.2), and what characteristics such \na learning system possesses in terms of the avoidance of interference, feature detection, \ndimensionality reduction, and incremental learning of motor control tasks (Section 3). \n\n2. RECEPTIVE FIELD WEIGHTED REGRESSION \nThis paper focuses on regression problems, i.e., the learning of a map from 9t n ~ 9t m \u2022 \nEach expert in our learning method, Receptive Field Weighted Regression (RFWR), con(cid:173)\nsists of two elements, a locally linear model to represent the local functional relationship, \nand a receptive field which determines the region in input space in which the expert's \nknowledge is valid. As a result, a given data set will be modeled by piecewise linear ele(cid:173)\nments, blended together. For 1000 noisy data points drawn from the unit interval of the \nfunction z == max[exp(-10x 2),exp(-50l),1.25exp(-5(x 2 + l)], Figure 1 illustrates an \nexample of function fitting with RFWR. This function consists of a narrow and a wide \nridge which are perpendicular to each other, and a Gaussian bump at the origin. Figure 1 b \nshows the receptive fields which the system created during the learning process. Each ex(cid:173)\nperts' location is at the center of its receptive field, marked by a $ in Figure 1 b. The recep-\n\n0 . 5 \n\n0 \n\n-0.5 \n\n(a) \n\n- 0 .5 \n\n-1 \n\n0 \n\nx \n\n1.5 \n\n0.5 \n\n,., \n\n0 \n\n-0.5 \n\n-1 \n\n-1.5 \n\n-1 \n\n1.5 \n\n,1 \n\n10. 5% \n\n0 \nI \n1- 0 .5 \n1 \n\n-1.5 \n\n-1 \n\n-0.5 \n\n(b) \n\no \nx \n\n0.5 \n\n1.5 \n\nFigure 1: (a) result of function approximation with RFWR. (b) contour lines of 0.1 iso-activation of \n\neach expert in input space (the experts' centers are marked by small circles). \n\n\fFrom Isolation to Cooperation: An Alternative View of a System of Experts \n\n607 \n\ntive fields are modeled by Gaussian functions, and their 0.1 iso-activation lines are shown \nin Figure 1 b as well. As can be seen, each expert focuses on a certain region of the input \nspace, and the shape and orientation of this region reflects the function's complexity, or \nmore precisely, the function's curvature, in this region. It should be noticed that there is a \ncertain amount of overlap among the experts, and that the placement of experts occurred on \na greedy basis during learning and is not globally optimal. The approximation result \n(Figure 1 a) is a faithful reconstruction of the real function (MSE = 0.0025 on a test set, 30 \nepochs training, about 1 minute of computation on a SPARC1O). As a baseline comparison, \na similar result with a sigmoidal 3-layer neural network required about 100 hidden units \nand 10000 epochs of annealed standard backpropagation (about 4 hours on a SPARC1O). \n\n2.1 THE ALGORITHM \nRFWR can be sketched in network form as \nshown in Figure 2. All inputs connect to all ex(cid:173)\npert networks, and new experts can be added as \nneeded. Each expert is an independent entity. It \nconsists of a two layer linear subnet and a recep(cid:173)\ntive field subnet. The receptive field subnet has a \nsingle unit with a bell-shaped activation profile, \ncentered at the fixed location c in input space. \nThe maximal output of this unit is \"I\" at the cen(cid:173)\nter, and it decays to zero as a function of the dis(cid:173)\ntance from the center. For analytical convenience, \nwe choose this unit to be Gaussian: \n\n(1) \n\nli'Iear \n\nGalng Unrt \n\n. .\u2022... '. ~\"\" \" \n\n~:~~ ConnectIOn \ncentered at e \n\nWeighBd' / \nAverage \nOutput \nFigure 2: The RFWR network \n\ny, \n\nx is the input vector, and D the distance metric, a positive definite matrix that is generated \nfrom the upper triangular matrix M. The output of the linear subnet is: \n\nTb b \n\n-Tf3 \n\nA \n\ny=x + o=x \n\n(2) \nThe connection strengths b of the linear subnet and its bias bO will be denoted by the d-di(cid:173)\nmensional vector f3 from now on, and the tilde sign will indicate that a vector has been \naugmented by a constant \"I\", e.g., i = (x T , Il. In generating the total output, the receptive \nfield units act as a gating component on the output, such that the total prediction is: \n\n(3) \n\nThe parameters f3 and M are the primary quantities which have to be adjusted in the learn(cid:173)\ning process: f3 forms the locally linear model, while M determines the shape and orienta(cid:173)\ntion of the receptive fields . Learning is achieved by incrementally minimizing the cost \nfunction: \n\n(4) \n\nThe first term of this function is the weighted mean squared cross validation error over all \nexperienced data points, a local cross validation measure (Schaal & Atkeson, 1994). The \nsecond term is a regularization or penalty term. Local cross validation by itself is consis(cid:173)\ntent, i.e., with an increasing amount of data, the size of the receptive field of an expert \nwould shrink to zero. This would require the creation of an ever increasing number of ex(cid:173)\nperts during the course of learning. The penalty term introduces some non-vanishing bias \nin each expert such that its receptive field size does not shrink to zero. By penalizing the \nsquared coefficients of D, we are essentially penalizing the second derivatives of the func(cid:173)\ntion at the site of the expert. This is similar to the approaches taken in spline fitting \n\n\f608 \n\nS. SCHAAL, C. C. A TI(ESON \n\n(deBoor, 1978) and acts as a low-pass filter: the higher the second derivatives, the more \nsmoothing (and thus bias) will be introduced. This will be analyzed further in Section 2.2. \nThe update equations for the linear subnet are the standard weighted recursive least squares \nequation with forgetting factor A (Ljung & SOderstrom, 1986): \n) \npn- -Tpn \nAjw + xTpnx \n\nf3 n+1 =f3n+wpn+lxe wherepn+1 =_ pn_ \n\nande =(y-xT f3n) \n\n1 ( \nA \n\nxx \n\n(5) \n\ncv' \n\ncv \n\nThis is a Newton method, and it requires maintaining the matrix P, which is size \n0.5d x (d + 1) . The update of the receptive field subnet is a gradient descent in J: \n\nMn+l=Mn- a dJ!aM \n\n(6) \nDue to space limitations, the derivation of the derivative in (6) will not be explained here. \nThe major ingredient is to take this derivative as in a batch update, and then to reformulate \nthe result as an iterative scheme. The derivatives in batch mode can be calculated exactly \ndue to the Sherman-Morrison-Woodbury theorem (Belsley, Kuh, & Welsch, 1980; At(cid:173)\nkeson, 1992). The derivative for the incremental update is a very good approximation to \nthe batch update and realizes incremental local cross validation. \nA new expert is initialized with a default M de! and all other variables set to zero, except the \nmatrix P. P is initialized as a diagonal matrix with elements 11 r/, where the ri are usually \nsmall quantities, e.g., 0.01. The ri are ridge regression parameters. From a probabilistic \nview, they are Bayesian priors that the f3 vector is the zero vector. From an algorithmic \nview, they are fake data points of the form [x = (0, ... , '12 ,o, ... l,y = 0] (Atkeson, Moore, & \nSchaal, submitted). Using the update rule (5), the influence of the ridge regression pa(cid:173)\nrameters would fade away due to the forgetting factor A. However, it is useful to make the \nridge regression parameters adjustable. As in (6), rj can be updated by gradient descent: \n\n1'n+1 = 1'n - a aJ/ar \n\nI \n\n(7) \nThere are d ridge regression parameters, one for each diagonal element of the P matrix. In \norder to add in the update of the ridge parameters as well as to compensate for the forget(cid:173)\nting factor, an iterative procedure based on (5) can be devised which we omit here. The \ncomputational complexity of this update is much reduced in comparison to (5) since many \ncomputations involve multiplications by zero. \n\nI \n\nI \n\nInitialize the RFWR network. with no expert; \nFor every new training sample (x,y): \n\na) \n\nb) \n\nc) \n\nd) \n\ne) \n\nFor k= I to #experts: \n- calculate the activation from (I) \n- update the expert's parameters according to (5), (6), and (7) \nend; \nIr no expert was activated by more than W gen : \n- create a new expert with c=x \nend; \nIr two experts are acti vated more than W pn .. ~ \n- erase the expert with the smaller receptive field \nend; \ncalculate the mean, err \"\"an' and standard de viation errslIl of the \nincrementally accumulated error er,! of all experts; \nFor k.= I to #experts: \n\nIr (Itrr! - err_I> 9 er'Sld) reinitialize expert k with M = 2 \u2022 Mdef \n\nend; \n\nend; \n\nIn sum, a RFWR expert consists of \nthree sets of parameters, one for \nthe locally linear model, one for \nthe size and shape of the receptive \nfields, and one for the bias. The \nlinear model parameters are up(cid:173)\ndated by a Newton method, while \nthe other parameters are updated \nby gradient descent. In our imple(cid:173)\nmentations, we actually use second \norder gradient descent based on \nSutton (1992), since, with minor \n\nextra effort, we can obtain estimates of the second derivatives of the cost function with re(cid:173)\nspect to all parameters. Finally, the logic of RFWR becomes as shown in the pseudo-code \nabove. Point c) and e) of the algorithm introduce a pruning facility. Pruning takes place ei(cid:173)\nther when two experts overlap too much, or when an expert has an exceptionally large \nmean squared error. The latter method corresponds to a simple form of outlier detection. \nLocal optimization of a distance metric always has a minimum for a very large receptive \nfield size. In our case, this would mean that an expert favors global instead of locally linear \nregression. Such an expert will accumulate a very large error which can easily be detected \n\n\fFrom Isolation to Cooperation: An Alternative View of a System of Experts \n\n609 \n\nin the given way. The mean squared error term, err, on which this outlier detection is \nbased, is a bias-corrected mean squared error, as will be explained below. \n\n2.2 ASYMPTOTIC BIAS AND PENALTY SELECTION \nThe penalty term in the cost function (4) introduces bias. In order to assess the asymptotic \nvalue of this bias, the real function f(x) , which is to be learned, is assumed to be repre(cid:173)\nsented as a Taylor series expansion at the center of an expert's receptive field. Without loss \nof generality, the center is assumed to be at the origin in input space. We furthermore as(cid:173)\nsume that the size and shape of the receptive field are such that terms higher than 0(2) are \nnegligible. Thus, the cost (4) can be written as: \n\nJ ~ (1w(f. +fTX+~XTFX-bo -bTx Y dx )/(1 wdx )+r~Dnm \n\n(8) \n\nwhere fo' f, and F denote the constant, linear, and quadratic terms of the Taylor series \nexpansion, respectively. Inserting Equation (1), the integrals can be solved analytically af(cid:173)\nter the input space is rotated by an orthonormal matrix transforming F to the diagonal ma(cid:173)\ntrix F'. Subsequently, bo' b, and D can be determined such that J is minimized: \n~ \n\nb~ = fa + bias = fa + ~075 ~ sgn(F:')~IF;,:I, b' = f, D:: = (2r)2 \n\n(9) \n\n0.25 \n\n( \n\n) \n\nThis states that the linear model will asymptotically acquire the correct locally linear \nmodel, while the constant term will have bias proportional to the square root of the sum of \nthe eigenvalues of F, i.e., the F:n \u2022 The distance metric D, whose diagonalized counterpart \nis D', will be a scaled image of the Hessian F with an additional square root distortion. \nThus, the penalty term accomplishes the intended task: it introduces more smoothing the \nhigher the curvature at an expert's location is, and it prevents the receptive field of an ex(cid:173)\npert shrinking to zero size (which would obviously happen for r ~ 0). Additionally, \nEquation (9) shows how to determine rfor a given learning problem from an estimate of \nthe eigenvalues and a permissible bias. Finally, it is possible to derive estimates of the bias \nand the mean squared error of each expert from the current distance metric D: \n\nbiasesl = ~0.5r IJeigenvalues(D)l.; \n\n(10) \n\nen,,~, = r L D;m \n\nn.m \n\nThe latter term was incorporated in the mean squared error, err, in Section 2.1. Empirical \nevaluations (not shown here) verified the validity of these asymptotic results. \n\n3. SIMULA TION RESULTS \nThis section will demonstrate some of the properties of RFWR. In all simulations, the \nthreshold parameters of the algorithm were set to e = 3.5, w prune = 0.9, and w min = 0.1. \nThese quantities determine the overlap of the experts as well as the outlier removal thresh(cid:173)\nold; the results below are not affected by moderate changes in these parameters. \n\n3.1 AVOIDING INTERFERENCE \nIn order to test RFWR's sensitivity with respect to changes in input data distribution, the \ndata of the example of Figure 1 was partitioned into three separate training sets \n1; = {(x, y, z) 1-1.0 < x < -O.2} , 1; = {(x, y, z) 1-0.4 < x < OA}, 1; = {(x, y, z) I 0.2 < x < 1.0} . \nThese data sets correspond to three overlapping stripes of data, each having about 400 uni(cid:173)\nformly distributed samples. From scratch, a RFWR network was trained first on I; for 20 \nepochs, then on T2 for 20 epochs, and finally on 1; for 20 epochs. The penalty was chosen \nas in the example of Figure 1 to be r = I.e - 7 , which corresponds to an asymptotic bias of \n\n\f610 \n\nS. SCHAAL, C. C. ATKESON \n\n0.1 at the sharp ridge of the function. The default distance metric D was 50*1, where I is \nthe identity matrix. Figure 3 shows the results of this experiment. Very little interference \ncan be found. The MSE on the test set increased from 0.0025 (of the original experiment of \nFigure 1) to 0.003, which is still an excellent reconstruction of the real function. \n\ny \n\n0 .5 \n\n-0 . 5 \n\n- 0 . 5 \n\n(a) \nFigure 3: Reconstructed function after training on (a) 7;, (b) then ~,(c) and finally 1;. \n\n(b) \n\n(c) \n\n-1 \n\n3.2 LOCAL FEATURE DETECTION \nThe examples of RFWR given so far did not require ridge regression parameters. Their im(cid:173)\nportance, however, becomes obvious when dealing with locally rank deficient data or with \nirrelevant input dimensions. A learning system should be able to recognize irrelevant input \ndimensions. It is important to note that this cannot be accomplished by a distance metric. \nThe distance metric is only able to decide to what spatial extent averaging over data in a \ncertain dimension should be performed. However, the distance metric has no means to ex(cid:173)\nclude an input dimension. In contrast, bias learning with ridge regression parameters is able \nto exclude input dimensions. To demonstrate this, we added 8 purely noisy inputs \n(N(0,0.3)) to the data drawn from the function of Figure 1. After 30 epochs of training on a \n10000 data point training set, we analyzed histograms of the order of magnitude of the \nridge regression parameters in all 100bias input dimensions over all the 79 experts that had \nbeen generated by the learning algorithm. All experts recognized that the input dimensions \n3 to 8 did not contain relevant information, and correctly increased the corresponding ridge \nparameters to large values. The effect of a large ridge regression parameter is that the asso(cid:173)\nciated regression coefficient becomes zero. In contrast, the ridge parameters of the inputs 1, \n2, and the bias input remained very small. The MSE on the test set was 0.0026, basically \nidentical to the experiment with the original training set. \n\n3.3 LEARNING AN INVERSE DYNAMICS MODEL OF A ROBOT ARM \nRobot learning is one of the domains where incremental learning plays an important role. A \nreal movement system experiences data at a high rate, and it should incorporate this data \nimmediately to improve its performance. As learning is task oriented, input distributions \nwill also be task oriented and interference problems can easily arise. Additionally, a real \nmovement system does not sample data from a training set but rather has to move in order \nto receive new data. Thus, training data is always temporally correlated, and learning must \nbe able to cope with this. An example of such a learning task is given in Figure 4 where a \nsimulated 2 DOF robot arm has to learn to draw the figure \"8\" in two different regions of \nthe work space at a moderate speed (1.5 sec duration). In this example, we assume that the \ncorrect movement plan exists, but that the inverse dynamics model which is to be used to \ncontrol this movement has not been acquired. The robot is first trained for 10 minutes (real \nmovement time) in the region of the lower target trajectory where it performs a variety of \nrhythmic movements under simple PID control. The initial performance of this controller is \nshown in the bottom part of Figure 4a. This training enables the robot to learn the locally \nappropriate inverse dynamics model, a ~6 ~ ~2 continuous mapping. Subsequent per-\n\n\fFrom Isolation to Cooperation: An Alternative View of a System of Experts \n\n611 \n\n0.5 \n\n0.' \n\n0.1 \n\n0.' tGralMy \n0.2 ~ 8 \n~t Z \n8 \n\n..,. \n\n~. \n\n8 \n\n\u00b70.4 \n\n~.5 \n\n(a) \n\n0 \n\n0.1 \n\n0.2 0.3 \n\n0.4 \n\n0.!5 \n\n(b) \n\n(0) \n\nFigure 4: Learning to draw the figure \"8\" with a 2-joint \narm: (a) Performance of a PID controller before learn(cid:173)\ning (the dimmed lines denote the desired trajectories, \nthe solid lines the actual performance); (b) Perfor(cid:173)\nmance after learning using a PD controller with feed(cid:173)\nforward commands from the learned inverse model; (c) \nPerformance of the learned controller after training on \nthe upper \"8\" of (b) (see text for more explanations). \n\n39 locally linear experts were generated. \n\nformance using this inverse model for \ncontrol is depicted in the bottom part \nof Figure 4b. Afterwards, the same \ntraining takes place in the region of the \nupper target trajectory in order to ac(cid:173)\nquire the inverse model in this part of \nthe world. The figure \"8\" can then \nequally well be drawn there (upper \npart of Figure 4a,b). Switching back to \nthe bottom part of the work space \n(Figure 4c), the first task can still be \nperformed as before. No interference \nis recognizable. Thus, the robot could \nlearn fast and reliably to fulfill the two \ntasks. It is important to note that the \ndata generated by the training move(cid:173)\nments did not always have locally full \nrank. All the parameters of RFWR \nwere necessary to acquire the local in(cid:173)\nverse model appropriately. A total of \n\n4. DISCUSSION \nWe have introduced an incremental learning algorithm, RFWR, which constructs a network \nof isolated experts for supervised learning of regression tasks. Each expert determines a lo(cid:173)\ncally linear model, a local distance metric, and local bias parameters by incrementally \nminimizing a penalized local cross validation error. Our algorithm differs from other local \nlearning techniques by entirely avoiding competition among the experts, and by being \nbased on nonparametric instead of parametric statistics. The resulting properties of RFWR \nare a) avoidance of interference in the case of changing input distributions, b) fast incre(cid:173)\nmental learning by means of Newton and second order gradient descent methods, c) ana(cid:173)\nlyzable asymptotic properties which facilitate the selection of the fit parameters, and d) lo(cid:173)\ncal feature detection and dimensionality reduction. The isolated experts are also ideally \nsuited for parallel implementations. Future work will investigate computationally less \ncostly delta-rule implementations of RFWR, and how well RFWR scales in higher dimen(cid:173)\nsions. \n5. REFERENCES \nAtkeson, C. G., Moore, A. W. , & Schaal, S. \n(submitted). \"Locally weighted learning.\" Artificial In(cid:173)\ntelligence Review. \nAtkeson, C. G. (1992). \"Memory-based approaches to \napproximating continuous functions.\" In: Casdagli, M., \n& Eubank, S. (Eds.), Nonlinear Modeling and Fore(cid:173)\ncasting, pp.503-521. Addison Wesley. \nBelsley, D. A., Kuh, E., & Welsch, R. E. (1980). Re(cid:173)\ngression diagnostics: Identifying influential data and \nsources ofcollinearity. New York: Wiley. \nCleveland, W. S. (1979). \"Robust locally weighted re(cid:173)\ngression and smoothing scatterplots.\" J. American Stat. \nAssociation, 74, pp.829-836. \nde Boor, C. (1978). A practical guide to splines. New \nYork: Springer. \nHastie, T. J., & Tibshirani, R. J. (1990). Generalized \nadditive models. London: Chapman and Hall. \nJacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, \nG. E. (1991). \"Adaptive mixtures of local experts.\" \nNeural Computation, 3, pp.79-87. \n\nJordan, M. I., & Jacobs, R. (1994). \"Hierarchical mix(cid:173)\ntures of experts and the EM algorithm.\" Neural Com(cid:173)\nputation, 6, pp.79-87. \nLjung, L., & S_derstr_m, T. (1986). Theory and prac(cid:173)\ntice of recursive identification. Cambridge, MIT Press. \nMcLachlan, G. J., & Basford, K. E. (1988). Mixture \nmodels . New York: Marcel Dekker. \nNadaraya, E. A. (1964). \"On estimating regression .\" \nTheor. Prob. Appl., 9, pp.141-142. \nSchaal, S., & Atkeson, C. G. (l994b). \"Assessing the \nquality of learned local models.\" In: Cowan, J. \nsauro, G., & Alspector, J. (Eds.), Advances in Neural \nInformation Processing Systems 6. Morgan Kaufmann. \nScott, D. W. (1992). Multivariate Density Estimation. \nNew York: Wiley. \nSutton, R. S. (1992). \"Gain adaptation beats least \nsquares.\" In: Proc. of 7th Yale Workshop on Adaptive \nand Learning Systems, New Haven, CT. \nWolpert, D. H. (1990). \"Stacked genealization.\" Los \nAlamos Technical Report LA-UR-90-3460. \n\n,Te(cid:173)\n\n\f", "award": [], "sourceid": 1058, "authors": [{"given_name": "Stefan", "family_name": "Schaal", "institution": null}, {"given_name": "Christopher", "family_name": "Atkeson", "institution": null}]}