{"title": "Bayesian Kernel Shaping for Learning Control", "book": "Advances in Neural Information Processing Systems", "page_first": 1673, "page_last": 1680, "abstract": "In kernel-based regression learning, optimizing each kernel individually is useful when the data density, curvature of regression surfaces (or decision boundaries) or magnitude of output noise (i.e., heteroscedasticity) varies spatially. Unfortunately, it presents a complex computational problem as the danger of overfitting is high and the individual optimization of every kernel in a learning system may be overly expensive due to the introduction of too many open learning parameters. Previous work has suggested gradient descent techniques or complex statistical hypothesis methods for local kernel shaping, typically requiring some amount of manual tuning of meta parameters. In this paper, we focus on nonparametric regression and introduce a Bayesian formulation that, with the help of variational approximations, results in an EM-like algorithm for simultaneous estimation of regression and kernel parameters. The algorithm is computationally efficient (suitable for large data sets), requires no sampling, automatically rejects outliers and has only one prior to be specified. It can be used for nonparametric regression with local polynomials or as a novel method to achieve nonstationary regression with Gaussian Processes. Our methods are particularly useful for learning control, where reliable estimation of local tangent planes is essential for adaptive controllers and reinforcement learning. We evaluate our methods on several synthetic data sets and on an actual robot which learns a task-level control law.", "full_text": "Bayesian Kernel Shaping for Learning Control\n\nJo-Anne Ting1, Mrinal Kalakrishnan1, Sethu Vijayakumar2 and Stefan Schaal1,3\n\n1Computer Science, U. of Southern California, Los Angeles, CA 90089, USA\n\n2School of Informatics, University of Edinburgh, Edinburgh, EH9 3JZ, UK\n\n3ATR Computational Neuroscience Labs, Kyoto 619-02, Japan\n\nAbstract\n\nIn kernel-based regression learning, optimizing each kernel individually is useful\nwhen the data density, curvature of regression surfaces (or decision boundaries)\nor magnitude of output noise varies spatially. Previous work has suggested gradi-\nent descent techniques or complex statistical hypothesis methods for local kernel\nshaping, typically requiring some amount of manual tuning of meta parameters.\nWe introduce a Bayesian formulation of nonparametric regression that, with the\nhelp of variational approximations, results in an EM-like algorithm for simulta-\nneous estimation of regression and kernel parameters. The algorithm is computa-\ntionally ef\ufb01cient, requires no sampling, automatically rejects outliers and has only\none prior to be speci\ufb01ed. It can be used for nonparametric regression with local\npolynomials or as a novel method to achieve nonstationary regression with Gaus-\nsian processes. Our methods are particularly useful for learning control, where\nreliable estimation of local tangent planes is essential for adaptive controllers and\nreinforcement learning. We evaluate our methods on several synthetic data sets\nand on an actual robot which learns a task-level control law.\n\n1 Introduction\n\nKernel-based methods have been highly popular in statistical learning, starting with Parzen windows,\nkernel regression, locally weighted regression and radial basis function networks, and leading to\nnewer formulations such as Reproducing Kernel Hilbert Spaces, Support Vector Machines, and\nGaussian process regression [1]. Most algorithms start with parameterizations that are the same for\nall kernels, independent of where in data space the kernel is used, but later recognize the advantage\nof locally adaptive kernels [2, 3, 4]. Such locally adaptive kernels are useful in scenarios where the\ndata characteristics vary greatly in different parts of the workspace (e.g., in terms of data density,\ncurvature and output noise). For instance, in Gaussian process (GP) regression, using a nonstationary\ncovariance function, e.g., [5], allows for such a treatment. Performing optimizations individually for\nevery kernel, however, becomes rather complex and is prone to over\ufb01tting due to a \ufb02ood of open\nparameters. Previous work has suggested gradient descent techniques with cross-validation methods\nor involved statistical hypothesis testing for optimizing the shape and size of a kernel in a learning\nsystem [6, 7].\n\nIn this paper, we consider local kernel shaping by averaging over data samples with the help of\nlocally polynomial models and formulate this approach, in a Bayesian framework, for both function\napproximation with piecewise linear models and nonstationary GP regression. Our local kernel\nshaping algorithm is computationally ef\ufb01cient (capable of handling large data sets), can deal with\nfunctions of strongly varying curvature, data density and output noise, and even rejects outliers\nautomatically. Our approach to nonstationary GP regression differs from previous work by avoiding\nMarkov Chain Monte Carlo (MCMC) sampling [8, 9] and by exploiting the full nonparametric\ncharacteristics of GPs in order to accommodate nonstationary data.\n\n\fOne of the core application domains for our work is learning control, where computationally ef\ufb01cient\nfunction approximation and highly accurate local linearizations from data are crucial for deriving\ncontrollers and for optimizing control along trajectories [10]. The high variations from \ufb01tting noise,\nseen in Fig. 3, are harmful to the learning system, potentially causing the controller to be unstable.\nOur \ufb01nal evaluations illustrate such a scenario by learning an inverse kinematics model for a real\nrobot arm.\n\n2 Bayesian Local Kernel Shaping\n\nWe develop our approach in the context of nonparametric locally weighted regression with lo-\ncally linear polynomials [11], assuming, for notational simplicity, only a one-dimensional output\u2014\nextensions to multi-output settings are straightforward. We assume a training set of N samples,\nD = {xi, yi}N\ni=1, drawn from a nonlinear function y = f (x) + \u0001 that is contaminated with mean-\nzero (but potentially heteroscedastic) noise \u0001. Each data sample consists of a d-dimensional input\nvector xi and an output yi. We wish to approximate a locally linear model of this function at a\nquery point xq \u2208 0 is a positive integer1. As pointed out in [11], the particular mathematical\nformulation of a weighting kernel is largely computationally irrelevant for locally weighted learning.\nOur choice of function for qim was dominated by the desire to obtain analytically tractable learning\nupdates. We place a Gamma prior over the bandwidth hm, i.e., p(hm) \u223c Gamma (ahm0, bhm0)\nwhere ahm0 and bhm0 are parameters of the Gamma distribution, to ensure that a positive weighting\nkernel width.\n\n2.2\n\nInference\n\nQN\nL = log\" N\n\nWe can treat the entire regression problem as an EM learning problem [14, 15] and maximize the log\nlikelihood log p(y|X) for generating the observed data. We can maximize this incomplete log likeli-\nhood by maximizing the expected value of the complete log likelihood p(y, Z, b, w, h, \u03c32, \u03c8z|X) =\ni=1 p(yi, zi, b, wi, h, \u03c32, \u03c8z|xi). In our model, each data sample i has an indicator-like scalar\nweight wi associated with it, allowing us to express the complete log likelihood L, in a similar\nfashion to mixture models, as:\n\nYi=1\"(cid:2)p(yi|zi, \u03c32)p(zi|xi, b, \u03c8z)(cid:3)wi\n\nd\n\nYm=1\n\np(wim)# d\nYm=1\n\np(bm|\u03c8zm)p(\u03c8zm)p(hm)p(\u03c32)#\n\nExpanding the log p(wim) term from the expression above results in a problematic \u2212 log(1 +\n(xim \u2212 xqm)2r) term that prevents us from deriving an analytically tractable expression for the\nposterior of hm. To address this, we use a variational approach on concave/convex functions sug-\ngested by [16] to produce analytically tractable expressions. We can \ufb01nd a lower bound on the\n\nterm so that \u2212 log(1 +(cid:0)xim \u2212 xqm)2r(cid:1) \u2265 \u2212\u03bbim (xim \u2212 xqm)2r hm, where \u03bbim is a variational\n\nparameter to be optimized in the M-step of our \ufb01nal EM-like algorithm. Our choice of weighting\nkernel allows us to \ufb01nd a lower bound to L in this manner. We explored the use of other weighting\nkernels (e.g., a quadratic negative exponential), but had issues with \ufb01nding a lower bound to the\nproblematic terms in log p(wim) such that analytically tractable inference for hm could be done.\nThe resulting lower bound to L is \u02c6L; due to lack of space, we give the expression for \u02c6L in the ap-\npendix. The expectation of \u02c6L should be taken with respect to the true posterior distribution of all\nhidden variables Q(b, \u03c8z, z, h). Since this is an analytically tractable expression, a lower bound\ncan be formulated using a technique from variational calculus where we make a factorial approxi-\nmation of the true posterior, e.g., Q(b, \u03c8z, z, h) = Q(b, \u03c8z)Q(h)Q(z) [15], that allows resulting\nposterior distributions over hidden variables to become analytically tractable. The posterior of wim,\np(wim = 1|yi, zi, xi, \u03b8, wi,k6=m), is inferred using Bayes\u2019 rule:\n\np(yi, zi|xi, \u03b8, wi,k6=m, wim = 1)Qd\n\nt=1,t6=mhwitip(wim = 1)\n\np(yi, zi|xi, \u03b8, wi,k6=m, wim = 1)Qd\n\nt=1,t6=mhwitip(wim = 1) + p(wim = 0)\n\n(1)\n\nwhere \u03b8 = {b, \u03c8z, h} and wi,k6=m denotes the set of weights {wik}d\nk=1,k6=m. For the dimension\nm, we account for the effect of weights in the other d \u2212 1 dimensions. This is a result of wi\nbeing de\ufb01ned as the product of weights in all dimensions. The posterior mean of wim is then\nm=1 hwimi, where h.i denotes the expectation\noperator. We omit the full set of posterior EM update equations (please refer to the appendix for\nthis) and list only the posterior updates for hm, wim, bm and zi:\n\nhp(wim = 1|yi, zi, xi, \u03b8, wi,k6=m)i, and hwii = Qd\n\n\u03a3zi|yi,xi =\n\n\u03a8zN\nhwii\n\n\u2212\n\nbm,0 +\n\n\u03a3bm = \u03a3\u22121\nhbmi = \u03a3bm N\nXi=1\n\n\u22121\n\nN\n\nhwii ximxT\n\nim!\nXi=1\nhwii hzimi xim!\n\nhwimi =\n\nk=1,k6=mhwiki\n\nQd\nqimA\ni\nQd\nqimA\ni\n\nk=1,k6=mhwiki\n\n+ 1 \u2212 qim\n\n1\n\nsi (cid:18) \u03a8zN\n\nhwii\n\nhwii(cid:19)\n11T \u03a8zN\nsi hwii (cid:19) bxi\n\n\u03a8zN 11T\n\nhzii =\n\nhhmi =\n\n\u03a8zN 1\nsi hwii\n\n+(cid:18)Id,d \u2212\nahm0 + N \u2212PN\nbhm0 +PN\n\ni=1 hwimi\n\ni=1 \u03bbim (xim \u2212 xqm)2r\n\n1(xim \u2212 xqm) is taken to the power 2r in order to ensure that the resulting expression is positive. Adjusting\n\nr affects how long the tails of the kernel are. We use r = 2 for all our experiments.\n\n\fm=1 hwimi, \u03a8zN is a diagonal matrix with \u03c8zN on its diagonal, si = \u03c32 + 1T \u03a8zN\n\nQd\nxqm)2r hhmi), and Ai = N (yi; 1T hzii , \u03c32)Qd\n\nwhere Id,d is a d \u00d7 d identity matrix, bxi is a d by 1 vector with coef\ufb01cients hbmiT xim, hwii =\nhwii 1 (to avoid di-\nvision by zero, hwii needs to be capped to some small non-zero value), qim = \u03bbim = 1/(1+(xim \u2212\nm=1 N (zim; hbmiT xim, \u03c8zm). Closer examination\nof the expression for hbmi shows that it is a standard Bayesian weighted regression update [13], i.e.,\na data sample i with lower weight wi will be downweighted in the regression. Since the weights are\nin\ufb02uenced by the residual error at each data point (see posterior update for hwimi), an outlier will\nbe downweighted appropriately and eliminated from the local model. Fig. 2 shows how local kernel\nshaping is able to ignore outliers that a classical GP \ufb01ts.\n\nA few remarks should be made regarding the initialization of priors\nused in the posterior EM updates. \u03a3bm,0 can be set to 106I to\nre\ufb02ect a large uncertainty associated with the prior distribution of\nb. The initial noise variance, \u03c8zm,0, should be set to the best guess\non the noise variance. To adjust the strength of this prior, nm0 can\nbe set to the number of samples one believes to have seen with\nnoise variance \u03c8zm,0 Finally, the initial h of the weighting kernel\nshould be set so that the kernel is broad and wide. We use values of\nahm0 = bhm0 = 10\u22126 so that hm0 = 1 with high uncertainty. Note\nthat some sort of initial belief about the noise level is needed to\ndistinguish between noise and structure in the training data. Aside\nfrom the initial prior on \u03c8zm, we used the same priors for all data\nsets in our evaluations.\n\n2.3 Computational Complexity\n\nTraining data\nStationary GP\nKernel Shaping\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\ny\n\n\u22120.5\n\n\u22121\n\n0\nx\n\n1\n\nFigure 2: Effect of outliers (in\nblack circles)\n\nFor one local model, the EM update equations have a computational complexity of O(N d) per EM\niteration, where d is the number input dimensions and N is the size of the training set. This ef\ufb01ciency\narises from the introduction of the hidden random variables zi, which allows hzii and \u03a3zi|yi,xi to\nbe computed in O(d) and avoids a d \u00d7 d matrix inversion which would typically require O(d3).\nSome nonstationary GP methods, e.g., [5], require O(N 3) + O(N 2) for training and prediction,\nwhile other more ef\ufb01cient stationary GP methods, e.g., [17], require O(M 2N ) + O(M 2) training\nand prediction costs (where M << N is the number of pseudoinputs used in [17]). Our algorithm\nrequires O(N dIEM ), where IEM is the number of EM iterations\u2014with a maximal cap of 1000\niterations used. Our algorithm also does not require any MCMC sampling as in [8, 9], making it\nmore appealing to real-time applications.\n\n3 Extension to Gaussian Processes\n\nWe can apply the algorithm in section 2 not only to locally weighted learning with linear models, but\nalso to derive a nonstationary GP method. A GP is de\ufb01ned by a mean and and a covariance function,\nwhere the covariance function K captures dependencies between any two points as a function of\n\nthe corresponding inputs, i.e., k (xi, xj) = cov(cid:0)f (xi), f (x0\n\nmodels use a stationary covariance function, where the covariance between any two points in the\ntraining data is a function of the distances |xi \u2212 xj|, not of their locations. Stationary GPs perform\nsuboptimally for functions that have different properties in various parts of the input space (e.g.,\ndiscontinuous functions) where the stationary assumption fails to hold. Various methods have been\nproposed to specify nonstationary GPs. These include de\ufb01ning a nonstationary Mat\u00b4ern covariance\nfunction [5], adopting a mixture of local experts approach [18, 8, 9] to use independent GPs to\ncover data in different regions of the input space, and using multidimensional scaling to map a\nnonstationary spatial GP into a latent space [19].\n\nj)(cid:1), where i, j = 1, .., N. Standard GP\n\nGiven the data set D drawn from the function y = f (x)+\u0001, as previously introduced in section 2, we\npropose an approach to specify a nonstationary covariance function. Assuming the use of a quadratic\nnegative exponential covariance function, the covariance function of a stationary GP is k(xi, xj) =\njm)2) + v0, where the hyperparameters {h1, h2, ..., hd, v0, v1} are\nv2\n\nm=1 hm(xim \u2212 x0\n\n1 exp(\u22120.5Pd\n\n\f1 exp(cid:16)\u22120.5Pd\n\nm=1(xim \u2212 xjm)2 himhjm\n\n(him+hjm)(cid:17)+v0, where him is the bandwidth of the local model\n\noptimized. In a nonstationary GP, the covariance function could then take the form2 k(xi, xj) =\nv2\ncentered at xim and hjm is the bandwidth of the local model centered at xjm. We learn \ufb01rst the\nvalues of {him}d\nm=1 for all training data samples i = 1, ..., N using our proposed local kernel\nshaping algorithm and then optimize the hyperparameters v0 and v1. To make a prediction for a test\nsample xq, we learn also the values of {hqm}d\nm=1, i.e., the bandwidth of the local model centered at\nxq. Importantly, since the covariance function of the GP is derived from locally constant models, we\nlearn with locally constant, instead of locally linear, polynomials. We use r = 1 for the weighting\nkernel in order keep the degree of nonlinearity consistent with that in the covariance function (i.e.,\nquadratic). Even though the weighting kernel used in the local kernel shaping algorithm is not a\nquadratic negative exponential, it has a similar bell shape, but with a \ufb02atter top and shorter tails.\nBecause of this, our augmented GP is an approximated form of a nonstationary GP. Nonetheless,\nit is able to capture nonstationary properties of the function f without needing MCMC sampling,\nunlike previously proposed nonstationary GP methods [8, 9].\n\n4 Experimental Results\n\n4.1 Synthetic Data\n\nFirst, we show our local kernel shaping algorithm\u2019s bandwidth adaptation abilities on several syn-\nthetic data sets, comparing it to a stationary GP and our proposed augmented nonstationary GP.\nFor the ease of visualization, we consider the following one-dimensional functions, similar to those\nin [5]: i) a function with a discontinuity, ii) a spatially inhomogeneous function, and iii) a straight\nline function. The data set for function i) consists of 250 training samples, 201 test inputs (evenly\nspaced across the input space) and output noise with \u03c32 = 0.3025; the data set for function ii) con-\nsists of 250 training samples, 101 test inputs and an output signal-to-noise ratio (SNR) of 10; and\nthe data set for function iii) has 50 training samples, 21 test inputs and an output SNR of 100.\n\nFig. 3 shows the predicted outputs of a stationary GP, augmented nonstationary GP and the local\nkernel shaping algorithm for data sets i)-iii). The local kernel shaping algorithm smoothes over\nregions where a stationary GP over\ufb01ts and yet, it still manages to capture regions of highly varying\ncurvature, as seen in Figs. 3(a) and 3(b). It correctly adjusts the bandwidths h with the curvature\nof the function. When the data looks linear, the algorithm opens up the weighting kernel so that\nall data samples are considered, as Fig. 3(c) shows. Our proposed augmented nonstationary GP\nalso can handle the nonstationary nature of the data sets as well, and its performance is quanti\ufb01ed\nin Table 1. Returning to our motivation to use these algorithms to obtain linearizations for learning\ncontrol, it is important to realize that the high variations from \ufb01tting noise, as shown by the stationary\nGP in Fig. 3, are detrimental for learning algorithms, as the slope (or tangent hyperplane, for high-\ndimensional data) would be wrong.\n\nTable 1 reports the normalized mean squared prediction error (nMSE) values for function i) and\nfunction ii) data sets, averaged over 20 random data sets. Fig. 4 shows results of the local kernel\nshaping algorithm and the proposed augmented nonstationary GP on the \u201creal-world\u201d motorcycle\ndata set [20] consisting of 133 samples (with 80 equally spaced input query points used for predic-\ntion). We also show results from a previously proposed MCMC-based nonstationary GP method: an\nalternate in\ufb01nite mixture of GP experts [9]. We can see that the augmented nonstationary GP and\nthe local kernel shaping algorithm both capture the leftmost \ufb02atter region of the function, as well as\nsome of the more nonlinear and noisier regions after 30msec.\n\n4.2 Robot Data\n\nNext, we move on to an example application: learning an inverse kinematics model for a 3 degree-of-\nfreedom (DOF) haptic robot arm (manufactured by SensAble, shown in Fig. 5(a)) in order to control\nthe end-effector along a desired trajectory. This will allow us to verify that the kernel shaping algo-\n\n2This is derived from the de\ufb01nition of K as a positive semi-de\ufb01nite matrix, i.e. where the integral is the\n\nproduct of two quadratic negative exponentials\u2014one with parameter him and the other with parameter hjm.\n\n\f2\n\ny\n\n0\n\n\u22122\n\nTraining data\nStationary GP\nAug GP\nKernel Shaping\n\n2\n\n1\n\n0\n\ny\n\n\u22121\n\n\u22124\n\u22122\n\n\u22121\n\n0\nx\n\n1\n\n2\n\n\u22122\n\n\u22121\n\n0\nx\n\n1\n\n2\n\n2\n\n1\n\n0\n\ny\n\n\u22121\n\n\u22122\n\u22122\n\n\u22121\n\n0\nx\n\n1\n\n2\n\nw\nxq\n\n107\n\nh\n\n103\n\n100\n2\n\nw\n\n1\n\n0\n\u22122\n\nw\n\n1\n\n0\n\u22122\n\n\u22121\n\n0\n\nx\n\n1\n\n(a) Function i)\n\n\u22121\n\n0\n\nx\n\n1\n\n(b) Function ii)\n\n106\n\nh\n\n100\n2\n\nw\n\n1 \n\n0 \n\u22122\n\n\u22121\n\n0\n\nx\n\n1\n\n(c) Function iii)\n\n106\n\nh\n100\n\n10\u22126\n2\n\nFigure 3: Predicted outputs using a stationary GP, our augmented nonstationary GP and local kernel\nshaping. Figures on the bottom show the bandwidths learnt by local kernel shaping and the corre-\nsponding weighting kernels (in dotted black lines) for input query points (shown in red circles).\n\nrithm can successfully deal with a large, noisy real-world data set with outliers and non-stationary\nproperties\u2014typical characteristics of most control learning problems.\n\nWe collected 60, 000 data samples from the arm while it performed random sinusoidal movements\nwithin a constrained box volume of Cartesian space. Each sample consists of the arm\u2019s joint angles\nq, joint velocities \u02d9q, end-effector position in Cartesian space x, and end-effector velocities \u02d9x. From\nthis data, we \ufb01rst learn a forward kinematics model: \u02d9x = J(q) \u02d9q, where J(q) is the Jacobian matrix.\nThe transformation from \u02d9q to \u02d9x can be assumed to be locally linear at a particular con\ufb01guration q\nof the robot arm. We learn the forward model using kernel shaping, building a local model around\neach training point only if that point is not already suf\ufb01ciently covered by an existing local model\n(e.g., having an activation weight of less than 0.2). Using insights into robot geometry, we localize\nthe models only with respect to q while the regression of each model is trained only on a mapping\nfrom \u02d9q to \u02d9x\u2014these geometric insights are easily incorporated as priors in the Bayesian model. This\nprocedure resulted in 56 models being built to cover the entire space of training data.\nWe arti\ufb01cially introduce a redundancy in our inverse kinematics problem on the 3-DOF arm by\nspecifying the desired trajectory (x, \u02d9x) only in terms of x, z positions and velocities, i.e., the move-\nment is supposed to be in a vertical plane in front of the robot. Analytically, the inverse kinematics\nequation is \u02d9q = J#(q) \u02d9x \u2212 \u03b1(I \u2212 J#J) \u2202g\n\u2202q , where J #(q) is the pseudo-inverse of the Jacobian. The\nsecond term is an optimal solution to the redundancy problem, speci\ufb01ed here by a cost function g\nin terms of joint angles q. To learn a model for J#, we can reuse the local regions of q from the\nforward model, where J# is also locally linear. The redundancy issue can be solved by applying\nan additional weight to each data point according to a reward function [21]. In our case, the task is\nspeci\ufb01ed in terms of { \u02d9x, \u02d9z}, so we de\ufb01ne a reward based on a desired y coordinate, ydes, that we\n2 h(k(ydes\u2212y)\u2212 \u02d9y)2, where\nwould like to enforce as a soft constraint. Our reward function is g = e\u2212 1\nk is a gain and h speci\ufb01es the steepness of the reward. This ensures that the learnt inverse model\nchooses a solution which produces a \u02d9y that pushes the y coordinate toward ydes. We invert each\nforward local model using a weighted linear regression, where each data point is weighted by the\nweight from the forward model and additionally weighted by the reward.\n\nWe test the performance of this inverse model (Learnt IK) in a \ufb01gure-eight tracking task as shown\nin Fig. 5(b). As seen, the learnt model performs as well as the analytical inverse kinematics solution\n(IK), with root mean squared tracking errors in positions and velocities very close to that of the\n\n\fTable 1: Average normalized mean squared prediction error values, for a stationary GP model, our\naugmented nonstationary GP, local kernel shaping\u2014averaged over 20 random data sets.\n\nMethod\n\nStationary GP\n\nAugmented nonstationary GP\n\nLocal Kernel Shaping\n\nFunction i)\n\n0.1251 \u00b1 0.013\n0.0110 \u00b1 0.0078\n0.0092 \u00b1 0.0068\n\nFunction ii)\n\n0.0230 \u00b1 0.0047\n0.0212 \u00b1 0.0067\n0.0217 \u00b1 0.0058\n\n)\ng\n(\n \nn\no\n\ni\nt\n\nl\n\na\nr\ne\ne\nc\nc\nA\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n\u2212150\n0\n\nTraining Data\nAiMoGPE\nSingleGP\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n(a) Alternate in\ufb01nite mix. of GPs\n\nTime (ms)\n(A)\n\n)\ng\n(\n \nn\no\n\ni\nt\n\nl\n\na\nr\ne\ne\nc\nc\nA\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n\u2212150\n0 \n\nTraining data\nAug GP\nStationary GP\n\n10\n\n20\n\n30\n\nTime (ms)\n\n40\n\n50\n\n60\n\n)\ng\n(\n \nn\no\n\ni\nt\n\nl\n\na\nr\ne\ne\nc\nc\nA\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n\u2212150\n0 \n\nTraining data\nKernel Shaping\nStationary GP\n\n10\n\n20\n\n30\n\nTime (ms)\n\n40\n\n50\n\n60\n\n(b) Augmented nonstationary GP\n\n(c) Local Kernel Shaping\n\nFigure 4: Motorcycle impact data set from [20], with predicted results shown for our augmented\nGP and local kernel shaping algorithms. Results from the alternate in\ufb01nite mixture of GP experts\n(AiMoGPE) are taken from [9].\n\nanalytical solution. This demonstrates that kernel shaping is an effective learning algorithm for use\nin robot control learning applications.\n\nApplying any arbitrary nonlinear regression method (such as a GP) to the inverse kinematics problem\nwould, in fact, lead to unpredictably bad performance. The inverse kinematics problem is a one-to-\nmany mapping and requires careful design of a learning problem to avoid problems with non-convex\nsolution spaces [22]. Our suggested method of learning linearizations with a forward mapping\n(which is a proper function), followed by learning an inverse mapping within the local region of\nthe forward mapping, is one of the few clean approaches to the problem. Instead of using locally\nlinear methods, one could also use density-based estimation techniques like mixture models [23].\nHowever, these methods must select the correct mode in order to arrive at a valid solution, and\nthis \ufb01nal step may be computationally intensive or involve heuristics. For these reasons, applying\na MCMC-type approach or GP-based method to the inverse kinematics problem was omitted as a\ncomparison.\n\n5 Discussion\n\nWe presented a full Bayesian treatment of nonparametric local multi-dimensional kernel adaptation\nthat simultaneously estimates the regression and kernel parameters. The algorithm can also be inte-\ngrated into nonlinear algorithms, offering a valuable and \ufb02exible tool for learning. We show that our\nlocal kernel shaping method is particularly useful for learning control, demonstrating results on an\ninverse kinematics problem, and envision extensions to more complex problems with redundancy,\n\n \n\nDesired\nAnalytical IK\n\n\u22120.1 \u22120.05\n\n0\n\n0.05\n\n0.1\n\n)\n\nm\n\n(\n \nz\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n \n\n)\n\nm\n\n(\n \nz\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n \n\n \n\nDesired\nLearnt IK\n\n\u22120.1 \u22120.05\n\n0\n\n0.05\n\n0.1\n\n(a) Robot arm\n\nx (m)\n(b) Desired versus actual trajectories\n\nx (m)\n\nFigure 5: Desired versus actual trajectories for SensAble Phantom robot arm\n\n\fe.g., learning inverse dynamics models of complete humanoid robots. Note that our algorithm re-\nquires only one prior be set by the user, i.e., the prior on the output noise. All other biases are\ninitialized the same for all data sets and kept uninformative. In its current form, our Bayesian kernel\nshaping algorithm is built for high-dimensional inputs due to its low computational complexity\u2014\nit scales linearly with the number of input dimensions. However, numerical problems may arise\nin case of redundant and irrelevant input dimensions. Future work will address this issue through\nthe use of an automatic relevant determination feature. Other future extensions include an online\nimplementation of the local kernel shaping algorithm.\n\nReferences\n\n[1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression.\n\nIn David S. Touretzky,\nMichael C. Mozer, and Michael E. Hasselmo, editors, In Advances in Neural Information Processing\nSystems 8, volume 8. MIT Press, 1995.\n\n[2] J. H. Friedman. A variable span smoother. Technical report, Stanford University, 1984.\n[3] T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to multilayer networks.\n\nScience, 247:213\u2013225, 1990.\n\n[4] J. Fan and I. Gijbels. Local polynomial modeling and its applications. Chapman and Hall, 1996.\n[5] C. J. Paciorek and M. J. Schervish. Nonstationary covariance functions for Gaussian process regression.\n\nIn Advances in Neural Information Processing Systems 16. MIT Press, 2004.\n\n[6] J. Fan and I. Gijbels. Data-driven bandwidth selection in local polynomial \ufb01tting: Variable bandwidth\n\nand spatial adaptation. Journal of the Royal Statistical Society B, 57:371\u2013395, 1995.\n\n[7] S. Schaal and C.G. Atkeson. Assessing the quality of learned local models.\n\nIn G. Tesauro J. Cowan\nand J. Alspector, editors, Advances in Neural Information Processing Systems, pages 160\u2013167. Morgan\nKaufmann, 1994.\n\n[8] C. E. Rasmussen and Z. Ghahramani. In\ufb01nite mixtures of Gaussian processes. In Advances in Neural\n\nInformation Processing Systems 14. MIT Press, 2002.\n\n[9] E. Meeds and S. Osindero. An alternative in\ufb01nite mixture of Gaussian process experts. In Advances in\n\nNeural Information Processing Systems 17. MIT Press, 2005.\n\n[10] C. Atkeson and S. Schaal. Robot learning from demonstration. In Proceedings of the 14th international\n\nconference on Machine learning, pages 12\u201320. Morgan Kaufmann, 1997.\n\n[11] C. Atkeson, A. Moore, and S. Schaal. Locally weighted learning. AI Review, 11:11\u201373, April 1997.\n[12] A. D\u2019Souza, S. Vijayakumar, and S. Schaal. The Bayesian back\ufb01tting relevance vector machine.\n\nProceedings of the 21st International Conference on Machine Learning. ACM Press, 2004.\n\nIn\n\n[13] A. Gelman, J. Carlin, H.S. Stern, and D.B. Rubin. Bayesian Data Analysis. Chapman and Hall, 2000.\n[14] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm.\n\nJournal of Royal Statistical Society. Series B, 39(1):1\u201338, 1977.\n\n[15] Z. Ghahramani and M.J. Beal. Graphical models and variational methods. In D. Saad and M. Opper,\n\neditors, Advanced Mean Field Methods - Theory and Practice. MIT Press, 2000.\n\n[16] T. S. Jaakkola and M. I. Jordan. Bayesian parameter estimation via variational methods. Statistics and\n\nComputing, 10:25\u201337, 2000.\n\n[17] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural\n\nInformation Processing Systems 18. MIT Press, 2006.\n\n[18] V. Tresp. Mixtures of Gaussian processes. In Advances in Neural Information Processing Systems 13.\n\nMIT Press, 2000.\n\n[19] A. M. Schmidt and A. O\u2019Hagan. Bayesian inference for nonstationary spatial covariance structure via\n\nspatial deformations. Journal of Royal Statistical Society. Series B, 65:745\u2013758, 2003.\n\n[20] B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve\n\n\ufb01tting. Journal of Royal Statistical Society. Series B, 47:1\u201352, 1985.\n[21] J. Peters and S. Schaal. Learning to control in operational space.\n\nResearch, 27:197\u2013212, 2008.\n\nInternational Journal of Robotics\n\n[22] M. I. Jordan and D. E. Rumelhart. Internal world models and supervised learning. In Machine Learning:\n\nProceedings of Eighth Internatinoal Workshop, pages 70\u201385. Morgan Kaufmann, 1991.\n\n[23] Z. Ghahramani. Solving inverse problems using an EM approach to density estimation. In Proceedings\n\nof the 1993 Connectionist Models summer school, pages 316\u2013323. Erlbaum Associates, 1994.\n\n\f", "award": [], "sourceid": 178, "authors": [{"given_name": "Jo-anne", "family_name": "Ting", "institution": null}, {"given_name": "Mrinal", "family_name": "Kalakrishnan", "institution": null}, {"given_name": "Sethu", "family_name": "Vijayakumar", "institution": null}, {"given_name": "Stefan", "family_name": "Schaal", "institution": null}]}