{"title": "Generalization Properties of Radial Basis Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 707, "page_last": 713, "abstract": null, "full_text": "Generalization Properties of Radial Basis \n\nFunctions \n\nSherif M. Botros \n\nChristopher G. Atkeson \n\nBrain and Cognitive Sciences Department \nand the Artificial Intelligence Laboratory \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nWe examine the ability of radial basis functions (RBFs) to generalize. We \ncompare the performance of several types of RBFs. We use the inverse dy(cid:173)\nnamics of an idealized two-joint arm as a test case. We find that without \na proper choice of a norm for the inputs, RBFs have poor generalization \nproperties. A simple global scaling of the input variables greatly improves \nperformance. We suggest some efficient methods to approximate this dis(cid:173)\ntance metric. \n\n1 \n\nINTRODUCTION \n\nThe Radial Basis Functions (RBF) approach to approximating functions consists of \nmodeling an input-output mapping as a linear combination of radially symmetric \nfunctions (Powell, 1987; Poggio and Girosi, 1990; Broomhead and Lowe, 1988; \nMoody and Darken, 1989). The RBF approach has some properties which make \nit attractive as a function interpolation and approximation tool. The coefficients \nthat multiply the different basis functions can be found with a linear regression. \nMany RBFs are derived from regularization principles which optimize a criterion \ncombining fitting error and the smoothness of the approximated function. However, \nthe optimality criteria may not always be appropriate, especially when the input \nvariables have different measurement units and affect the output differently. A \nnatural extension to RBFs is to vary the distance metric (equivalent to performing \na linear transformation on the input variables). This can be viewed as changing \nthe cost function to be optimized (Poggio and Girosi, 1990). We first use an exact \ninterpolation approach with RBFs centered at the data in the training set. We \nthen explore the effect of optimizing the distance metric for Gaussian RBFs using a \n\n707 \n\n\f708 \n\nBotros and Atkeson \n\nsmaller number of functions than data in the training set. We also suggest and test \ndifferent methods to approximate this metric for the case of Gaussian RBFs that \nwork well for the two joint arm example that we examined. We refer the reader to \nseveral other studies addressing the generalization performance of RBFs (Franke, \n1982; Casdagli, 1989; Renals and Rohwer, 1989). \n\n2 EXACT INTERPOLATION APPROACH \n\nIn the exact interpolation model the number of RBFs is equal to the number of \nexperiences. The centers of the RBFs are chosen to be at the location of the \nexperiences. We used an idealized horizontal planar two joint arm model with no \nfriction and no noise (perfect measurements) to test the performance of RBFs: \n\nTl \n\nO~ (It + 12 + 2m2cz~h cos O2 - 2m2Cy~11 sin ( 2) \n+O~(I2 + m2cz~h cos O2 - m2cy~h sin ( 2) \n-2hOl(j2(m2Cz~ sin O2 + m2Cy~ cos ( 2) \n-1102 (m2cz~ sin 02 + m2Cy~ cos (2) \nO~(m2cz~/l cos O2 - m2cy~/l sin O2 + 12) + O~I2 \n+It 01 (m2cz~ sin O2 + m2Cy~ cos ( 2 ) \n\n\u00b72 \n\n\u00b72 \n\n(1) \n\nwhere OJ,, OJ,, OJ, are the angular position, velocity and acceleration of joint i. 1i is \nthe torque at joint i. Ii, mj\" Ii, czi and cYi are respectively the moment of inertia, \nmass, length and the x and y components of the center of mass location of link \ni. The input vector is (01,02,Ol,(j2'O~,O~). The training and test sets are formed \nof one thousand random experiences each; uniformly distributed across the space \nof the inputs. The different inputs were selected from the following ranges: [-4, 4] \nfor the joint angles, [-20, 20] for the joint angular velocities and [-100, 100] for the \njoint angular accelerations. For the exact interpolation case, we scaled the input \nvariables such that the input space is limited to the six dimensional hypercube \n[-1,1]6. This improved the results we obtained. The torque to be estimated at \neach joint is modeled by the following equation: \n\nn \n\np \n\nTk(Xi) = LCkj\u00a2(lIxi -xjll)+ LJtkjPf(Xi) \n\n(2) \n\nj=l \n\nj=l \n\nwhere Tk, k = 1,2, is the estimated torque at the kth joint, n is the number of \nexperiences/RBFs, Xi is the 'l-th input vector, II . II is the Euclidean norm and \nPt(\u00b7),j = 1, ... ,p, is the space of polynomials of order m. The polynomial terms \nare not always added and it was found that adding the polynomial terms by them(cid:173)\nselves does not improve the performance significantly, which is in agreement with \nthe conclusion made by Franke (Franke, 1982). When a polynomial is present in \nthe equation, we add the following extra constraints (Powell, 1987): \n\nn \nLCkjPr(Xj) = 0 \nj=l \n\ni=l, ... ,p \n\n(3) \n\n\fGeneralization Properties of Radial Basis Functions \n\n709 \n\n_ ... -\nLS \n_.- CS \n- - TPS \n~ - ~ Gaussians \n\n'\" \n\n. \n\u2022\u2022 !\"&\". \n\n\u2022 \n\n-... ------~ \n.... - ... - .,. - .. , - ... -...;..~ .. - ... - ... - '\" - ... - ... - ... - ... -\n1 \n~ '-'-'7~'-'-'-'-'-'-'-'-'-'-'-' \n\n\u2022...... * HIMO \n-\n\n~-----\n\n\u2022 HMO \n\n\u2022\u2022\u2022\u2022\u2022\u2022.\u2022\u2022\u2022.\u2022 \n\n. ..................... . \n\n. \n\n........ . ......... . \n\nW \n\n1.0 \n\n:J a 0.9 \na: \nf2 0.8 \ne/) 0.7 \n\n~ a: 0.6 -a: 0.5 @ \n\na: 0.4 \nw \n0.3 \nW \n:J a 0.2 \na: \n0 \n0.1 \nt-\ne/) 0.0 \n~ 0 \na: \n\n5 \n\n10 \n\n15 \n\n20 \nc\"2 \n\nFigure 1: Normalized errors for the different RBFs using exact interpolation. c2 is \nthe width parameter when relevant. \n\nTo find the coefficients Ckj and J-lkj \n, we have to invert a square matrix which \nis nonsingular for distinct inputs for the basis functions we considered (Micchelli, \n1986). We used the training set to find the parameters Ckj, j = 1, n, and when \nrelevant J-lkj, j = 1,p, for the following RBFs: \n\n4>(r) = vr2 + c2 \n\n\u2022 Gaussians 4>( r) = exp( ~~2) \n\u2022 Hardy Multiquadrics [HMQ] \n\u2022 Hardy Inverse Multiquadrics [HIMQ] \u00a2(r) = vr21+c'J \n\u2022 Thin Plate Splines [TPS] \u00a2(r) = r 2 10g r \n\u2022 Cubic Splines [CS] \u00a2(r) = r3 \n\u2022 Linear Splines [LS] \u00a2(r) = r \n\nwhere r = Ilxi - Xj II . For the last three RBFs, we added polynomials of different \norders, subject to the constraints in equation 3 above. Since the number of inde(cid:173)\npendent parameters is equal to the number of points in the training set, we can \ntrain the system so that it exactly reproduces the training set. We then tested its \nperformance on the test set. The error was measured by equation 4 below: \n\nE = \n\n2:~=1 2:?=1 (Tki - Tki)2 \n\n2:~=1 2:?=1 Tli \n\n(4) \n\nThe normalized error obtained on the test set for the different RBFs are shown \nin figure 1. The results for LS and CS shown in this figure are obtained after \nthe addition of a first order polynomial to the RBFs. We also tried adding a \nthird order polynomial for TPS. As shown in this figure, the normalized error was \nmore sensitive to the width parameter (i.e. c2 ) for the Gaussian RBFs than for \n\n\f710 \n\nBotros and Atkeson \n\nHardy multiquadrics and inverse multiquadrics. This is in agreement with Franke's \nobservation (Franke, 1982). The best normalized error for any RBF that we tested \nwas 0.338 for HMQ with a value of c2 = 4 . Also, contrary to our expectations and \nto results reported by others (Franke, 1982), the TPS with a third order polynomial \nhad a normalized error of 0.5003. This error value did not change significantly when \nonly lower order polynomials are added to the (r2 log r) RBFs. Using Generalized \nCross Validation (Bates et ai., 1987) to optimize the tradeoff between smoothness \nand fitting the data, we got similar normalized error for TPS. \n\n3 GENERALIZED RBF \n\nThe RBF approach has been generalized (Poggio and Girosi, 1990) to have ad(cid:173)\njustable center locations, fewer centers than data, and to use a different distance \nmetric. Instead of using a Euclidean norm, we can use a general weighted norm: \n\nIlxi - Xj II~ = (Xi - Xj )TWTW(Xi - Xj) \n\n(5) \nwhere W \nis a square matrix. This approach is also referred to as Hyper Basis \nFunctions (Poggio and Girosi, 1990). The problem of finding the weight matrix \nand the location of the centers is nonlinear. We simplified the problem by only \nconsidering a diagonal matrix Wand fixing the locations of the centers of the RBFs. \nThe center locations were chosen randomly and were uniformly distributed over the \ninput space. We tested three different methods to find the different parameters for \nGaussian RBFs that we will describe in the next three subsections. \n\n3.1 NONLINEAR OPTIMIZATION \n\nWe used a Levenberg-Marquardt nonlinear optimization routine to find the coef(cid:173)\nficients of the RBFs {Cd and the diagonal scaling matrix W that minimized the \nsum of the squares of the errors in estimating the training set. We were able to \nfind a set of parameters that reduced the normalized error to less than 0.01 in both \nthe training and the test sets using 500 Gaussian RBFs randomly and uniformly \nspaced over the input space. One disadvantage we found with this method is the \npossible convergence to local minima and the long time it takes to converge using \ngeneral purpose optimization programs. The diagonal elements of the matrix W \nare shown in the L-M columns of table 1. As expected, (h has a very small scale \nfor both joints compared to ()2, since (}l does not affect the output of either joint \nin the horizontal model described by equation 1. Also the scaling of ()2 is much \nlarger than the scaling of the other variables. This suggests that the scaling could \nbe dependent on both the range of the input variables as well as the sensitivity of \nthe output to the different input variables. We found empirically that a formula \nof the form of equation 6 approximates reasonably well the scaling weights found \nusing nonlinear optimization. \n\nIV'!, I \n\nk \n\nWi! = II V' ! II -V;:::::E::::;::{(=Xi=-=t,::::;::;\u00b7) 2;:;::} \n\n(6) \n\nwhere II~~!I\\ is the normalized average absolute value of the gradient of the correct \nmodel of the function to be approximated. The term J \nnormalizes the \n\nk \n\nE{(.t:.-t.r~} \n\n\fGeneralization Properties of Radial Basis Functions \n\n711 \n\nTable 1: Scaling Weights Obtained Using Different Methods. \n\nW \n\nL-M ALG. \n\nTRUE FUNC. \n\nGRAD. APPROX. \n\nJoint 1 \nWl1 (Ot) \n0.000021 \nW22(02) \n0.382014 \nW33(O\u00b7d \n0.004177 \nW44(~\u00b7?) 0.004611 \nW55 (Od \n0.000433 \nW66(O~) 0.000284 \n\nJoint 2 \n5.48237 e-06 \n0.443273 \n0.0871921 \n0.000120948 \n0.00134168 \n0.000955884 \n\nJoint 1 \n0.000000 \n0.456861 \n0.005531 \n0.007490 \n0.000271 \n0.000059 \n\nJoint 2 \n0.000000 \n0.456449 \n0.010150 \n0.000000 \n0.000110 \n0.000116 \n\nJoint 1 \n0.047010 \n0.400615 \n0.009898 \n0.028477 \n0.006365 \n0.000556 \n\nJoint 2 \n0.005450 \n0.409277 \n0.038288 \n0.008948 \n0.002166 \n0.001705 \n\ndensity of the input variables in each direction by taking into account the expected \ndistances from the RBF centers to the data. The constant k in this equation \nis inversely proportional to the width of the Gaussian used in the RBF. For the \ninverse dynamics problem we tested, and using 500 Gaussian functions randomly \nand uniformly distributed over the entire input space, a k between 1 and 2 was found \nto be good and results in scaling parameters which approximate those obtained by \noptimization. The scaling weights obtained using equation 6 and based on the \nknowledge of the functions to be approximated are shown in the TRUE FUNC. \ncolumns of table 1. Using these weight values the error in the test set was about \n0.0001 \n\n3.2 AVERAGE GRADIENT APPROXIMATION \n\nIn the previous section we showed that the scaling weights could be approximated \nusing the derivatives of the function to be approximated in the different directions. If \nwe can approximate these derivatives, we can then approximate the scaling weights \nusing equation 6. A change in the output D..y could be approximated by a first order \nTaylor series expansion as shown in equation 7 below: \n\n(7) \n\nWe first scaled the input variables so that they have the same range, then selected \nall pairs of points from the training set that are below a prespecified distance (since \nequation 7 is only valid for nearby points), and then computed D..x, and D..y for \neach pair. We used least squares regression to estimate the values of ~Y. Using \nthe estimated derivatives and equation 6, we got the scaling weights shown in the \nlast two columns of table 1. Note the similarity between these weights and the ones \nobtained using the nonlinear optimization or the derivatives of the true function. \nThe normalized error in this case was found to be 0.012 for the training set and \n0.033 for the test set. One advantage of this method is that it is much faster than \nthe nonlinear optimization method. However, it is less accurate. \n\nux, \n\n\f712 \n\nBotros and Atkeson \n\n1.0 \n\nW \n::> \n0 \n0.9 \n[[ \n~ O.B \nr/) 0.7 \n~ \n.... \n0: 0.6 \n0: 0.5 \n~ 0.4 \n0: \nw \nW 0.3 \n::> \n0 \n0.2 \n0: \n0 \nt-\nCf) 0.0 \n~ 0 \n0: \n\n0.1 \n\n........... , .\u2022 \n\n. .... \n\" \n\"\"'\" \n... \n..... \n, \n\n... \n\n\". \n\" \n\n* ....... No initial scaling \n\u2022 \u2022 Initial scaling \n\n'. \", \n\n..... \n\n2 \n\n3 \n\n4 \n\n5 \n\niteration number \n\nFigure 2: Normalized error vs. the number of iterations using the recursive method \nwith and without initial scaling. \n\n3.3 RECURSIVE METHOD \n\nAnother possible method to approximate the RMS values of the derivatives is to \nfirst approximate the function using RBFs of \"reasonable\" widths and then use this \nfirst approximation to compute the derivatives which are then used to modify the \nGaussian widths in the different directions. The procedure is then repeated. We \nused 100 RBFs with Gaussian units randomly and uniformly distributed to find the \ncoefficients of the RBFs. We explored two different scalings of the input data. In the \nfirst case we used the raw data without scaling, and in the second case the different \ninput variables were scaled so that they have the same range from [-1, 1]. The width \nof the Gaussians used as specified by the variance c2 was equal to 200 in the first \ncase, and 2 in the second case. We then used the estimated values of the derivatives \nto change the width of the Gaussians in the different directions and iterated the \nprocedure. The normalized error is plotted versus the number of iterations for both \ncases in figure 2. As shown in this figure, the test set error dropped to around 0.001 \nin about only 4 iterations. This technique is also much faster than the nonlinear \noptimization approach. Also it can be easily made local, which is desirable if the \ndependence of the function to be approximated on the input variables changes from \none region of the input space to the other. One disadvantage of this approach is \nthat it is not guaranteed to converge especially if the initial approximation of the \nfunction is very bad. \n\n4 CONCLUSION \n\nIn this paper we tested the ability of RBFs to generalize from the training set. \nWe found that the choice of the distance metric used may be crucial for good \ngeneralization. For the problem we tested, a bad choice of a distance metric resulted \nin very poor generalization. However, the performance of Gaussian RBFs improved \nsignificantly if we optimized the distance metric. We also tested some empirical \n\n\fGeneralization Properties of Radial Basis Functions \n\n713 \n\nmethods for efficiently estimating this metric that worked well in our test problem. \nAdditional work has to be done to identify the conditions under which the techniques \nwe presented here mayor may not work. Although a simple global scaling of the \ninput variables worked well in our test example, it may not work in general. One \nproblem that we found when we optimized the distance metric is that the values \nof the coefficients Cj, become very large, even if we imposed a penalty on their \nvalues. The reason for this, we think, is that the estimation problem was close to \nsingular. Choice of the training set and optimizing the centers of the RBFs may \nsolve this problem. The recursive method we described could probably be modified \nto approximate a complete linear coordinate transformation and local scaling. \n\nAcknowledgments \n\nSupport was provided under Office of Naval Research contract N00014-88-K-0321 \nand under Air Force Office of Scientific Research grant AFOSR-89-0500. Support for \nCGA was provided by a National Science Foundation Engineering Initiation Award \nand Presidential Young Investigator Award, an Alfred P. Sloan Research Fellowship, \nthe W. M. Keck Foundation Assistant Professorship in Biomedical Engineering, and \na Whitaker Health Sciences Fund MIT Faculty Research Grant. \n\nReferences \n\nD. M. Bates, M. J. Lindstorm, G. Wahba and B. S. Yandel (1987) \"GCVPACK \n- Routines for generalized cross validation\". Commun. Statist.-Simulat. 16 (1): \n263-297. \nD. S. Broomhead and D. Lowe (1988) \"Multivariable functional interpolation and \nadaptive networks\". Complex Systems 2:321-323. \n\nM. Casdagli (1989) \"Nonlinear prediction of chaotic time series\". Physica D 35: \n335-356. \nR. Franke (1982) \"Scattered data interpolation: Tests of some methods\" . Math. \nCompo 38(5):181-200. \nC. A. Micchelli (1986) \"Interpolation of scattered data: distance matrices and con(cid:173)\nditionally positive definite functions\". Constr. Approx. 2:11-22. \nJ. Moody and C. Darken (1989) \"Fast learning in networks of locally tuned pro(cid:173)\ncessing units\". Neural Computation 1(2):281-294. \n\nT. Poggio and F. Girosi (1990) \"Networks for approximation and learning\". Pro(cid:173)\nceedings of the IEEE 78(9):1481-1497. \nM. J. D. Powell (1987) \"Radial basis functions for multivariable interpolation: A \nreview\". In J. C. Mason and M. G. Cox (ed.), Algorithms for Approximation, 143-\n167. Clarendon Press, Oxford. \n\nS. Renals and R. Rohwer (1989) \"Phoneme classification experiments using radial \nbasis functions\". In Proceedings of the International Joint Conference on Neural \nNetworks, 1-462 - 1-467, Washington, D.C., IEEE TAB Neural Network Committee. \n\n\f", "award": [], "sourceid": 294, "authors": [{"given_name": "Sherif", "family_name": "Botros", "institution": null}, {"given_name": "Christopher", "family_name": "Atkeson", "institution": null}]}