{"title": "Sequential Bayesian Kernel Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 113, "page_last": 120, "abstract": "", "full_text": "Sequential Bayesian Kernel Regression\n\nJaco Vermaak, Simon J. Godsill, Arnaud Doucet\n\nCambridge University Engineering Department\n{jv211, sjg, ad2}@eng.cam.ac.uk\n\nCambridge, CB2 1PZ, U.K.\n\nAbstract\n\nWe propose a method for sequential Bayesian kernel regression. As is\nthe case for the popular Relevance Vector Machine (RVM) [10, 11], the\nmethod automatically identi\ufb01es the number and locations of the kernels.\nOur algorithm overcomes some of the computational dif\ufb01culties related\nto batch methods for kernel regression. It is non-iterative, and requires\nonly a single pass over the data. It is thus applicable to truly sequen-\ntial data sets and batch data sets alike. The algorithm is based on a\ngeneralisation of Importance Sampling, which allows the design of in-\ntuitively simple and ef\ufb01cient proposal distributions for the model param-\neters. Comparative results on two standard data sets show our algorithm\nto compare favourably with existing batch estimation strategies.\n\n1 Introduction\n\nBayesian kernel methods, including the popular Relevance Vector Machine (RVM) [10,\n11], have proved to be effective tools for regression and classi\ufb01cation. For the RVM the\nsparsity constraints are elegantly formulated within a Bayesian framework, and the result of\nthe estimation is a mixture of kernel functions that rely on only a small fraction of the data\npoints. In this sense it bears resemblance to the popular Support Vector Machine (SVM)\n[13]. Contrary to the SVM, where the support vectors lie on the decision boundaries, the\nrelevance vectors are prototypical of the data. Furthermore, the RVM does not require any\nconstraints on the types of kernel functions, and provides a probabilistic output, rather than\na hard decision.\n\nStandard batch methods for kernel regression suffer from a computational drawback in that\nthey are iterative in nature, with a computational complexity that is normally cubic in the\nnumber of data points at each iteration. A large proportion of the research effort in this area\nis devoted to the development of estimation algorithms with reduced computational com-\nplexity. For the RVM, for example, a strategy is proposed in [12] that exploits the structure\nof the marginal likelihood function to signi\ufb01cantly reduce the number of computations.\n\nIn this paper we propose a full Bayesian formulation for kernel regression on sequential\ndata. Our algorithm is non-iterative, and requires only a single pass over the data. It is\nequally applicable to batch data sets by presenting the data points one at a time, with the\norder of presentation being unimportant. The algorithm is especially effective for large data\nsets. As opposed to batch strategies that attempt to \ufb01nd the optimal solution conditional\non all the data, the sequential strategy includes the data one at a time, so that the poste-\n\n\frior exhibits a tempering effect as the amount of data increases. Thus, the dif\ufb01cult global\nestimation problem is effectively decomposed into a series of easier estimation problems.\n\nThe algorithm itself is based on a generalisation of Importance Sampling, and recursively\nupdates a sample based approximation of the posterior distribution as more data points\nbecome available. The proposal distribution is de\ufb01ned on an augmented parameter space,\nand is formulated in terms of model moves, reminiscent of the Reversible Jump Markov\nChain Monte Carlo (RJ-MCMC) algorithm [5]. For kernel regression these moves may\ninclude update moves to re\ufb01ne the kernel locations, birth moves to add new kernels to\nbetter explain the increasing data, and death moves to eliminate erroneous or redundant\nkernels.\n\nThe remainder of the paper is organised as follows. In Section 2 we outline the details of\nthe model for sequential Bayesian kernel regression. In Section 3 we present the sequential\nestimation algorithm. Although we focus on regression, the method extends straightfor-\nwardly to classi\ufb01cation. It can, in fact, be applied to any model for which the posterior can\nbe evaluated up to a normalising constant. We illustrate the performance of the algorithm\non two standard regression data sets in Section 4, before concluding with some remarks in\nSection 5.\n\n2 Model Description\nThe data is assumed to arrive sequentially as input-output pairs (xt, yt), t = 1, 2,\u00b7\u00b7\u00b7 ,\nxt \u2208 Rd, yt \u2208 R. For kernel regression the output is assumed to follow the model\n\nyt = \u03b20 +\n\n\u03b2iK(xt, \u00b5i) + vt,\n\nvt \u223c N(0, \u03c32\ny),\n\n(cid:88)k\n\ni=1\n\nwhere k is the number of kernel functions, which we will consider to be unknown, \u03b2k =\n(\u03b20 \u00b7\u00b7\u00b7 \u03b2k) are the regression coef\ufb01cients, Uk = (\u00b51 \u00b7\u00b7\u00b7 \u00b5k) are the kernel centres, and \u03c32\ny\nis the variance of the Gaussian observation noise. Assuming independence the likelihood\nfor all the data points observed up to time t, denoted by Yt = (y1 \u00b7\u00b7\u00b7 yt), can be written as\n\np(Yt|k, \u03b2k, Uk, \u03c32\n\ny) = N(Yt|Kk\u03b2k, \u03c32\n\n(1)\nwhere Kk denotes the t \u00d7 (k + 1) kernel matrix with [Kk]s,1 = 1 and [Kk]s,l =\nK(xs, \u00b5l\u22121) for l > 1, and In denotes the n-dimensional identity matrix. For the un-\nknown model parameters \u03b8k = (\u03b2k, Uk, \u03c32\n\u03b2) we assume a hierarchical prior that takes\nthe form\n\ny, \u03c32\n\nyIt),\n\n(2)\n\nwith\n\np(k, \u03b8k) = p(k)p(\u03b2k, \u03c32\n\np(k) \u221d \u03bbk exp(\u2212\u03bb)/k!,\n(cid:88)t\n(cid:89)k\n\u03b2) = N(\u03b2k|0, \u03c32\np(\u03b2k, \u03c32\np(Uk) =\ny|ay, by),\ny) = IG(\u03c32\np(\u03c32\n\ns=1\n\nl=1\n\n\u03b2)p(Uk)p(\u03c32\ny),\nk \u2208 {1\u00b7\u00b7\u00b7 kmax}\n\u03b2|a\u03b2, b\u03b2)\n\n\u03b2Ik+1)IG(\u03c32\n\u03b4xs(\u00b5l)/t\n\nwhere \u03b4x(\u00b7) denotes the Dirac delta function with mass at x, and IG(\u00b7|a, b) denotes the\nInverted Gamma distribution with parameters a and b. The prior on the number of kernels\nis set to be a truncated Poisson distribution, with the mean \u03bb and the maximum number of\nkernels kmax assumed to be \ufb01xed and known. The regression coef\ufb01cients are drawn from\nan isotropic Gaussian prior with variance \u03c32\n\u03b2 in each direction. This variance is, in turn,\ndrawn from an Inverted Gamma prior. This is in contrast with the Automatic Relevance\nDetermination (ARD) prior [8], where each coef\ufb01cient has its own associated variance.\nThe prior for the kernel centres is assumed to be uniform over the grid formed by the input\n\n\fdata points available at the current time step. Note that the support for this prior increases\nwith time. Finally, the noise variance is assumed to follow an Inverted Gamma prior. The\nparameters of the Inverted Gamma priors are assumed to be \ufb01xed and known.\n\nGiven the likelihood and prior in (1) and (2), respectively, it is straightforward to obtain an\nexpression for the full posterior distribution p(k, \u03b8k|Yt). Due to conjugacy this expression\ncan be marginalised over the regression coef\ufb01cients, so that the marginal posterior for the\nkernel centres can be written as\n\np(k, Uk|\u03c32\n\ny, \u03c32\n\n\u03b2, Yt) \u221d |Bk|1/2 exp(\u2212YT\ny + Ik+1/\u03c32\n\nt PkYt/2\u03c32\ny)t/2(\u03c32\n\u03b2)k+1/2\n\u03b2)\u22121 and Pk = It \u2212 KkBkKT\n\n(2\u03c0\u03c32\n\ny)p(k)p(Uk)\n\n,\n\n(3)\n\nkKk/\u03c32\n\nwith Bk = (KT\ny. It will be our ob-\njective to approximate this distribution recursively in time as more data becomes available,\nusing Monte Carlo techniques. Once we have samples for the kernel centres, we will re-\n\u03b2) at the next time step. We can\nquire new samples for the unknown parameters (\u03c32\nobtain these by \ufb01rst sampling for the regression coef\ufb01cients from the posterior\n\nk/\u03c32\n\ny, \u03c32\n\np(\u03b2k|k, Uk, \u03c32\n\ny, \u03c32\n\n\u03b2, Yt) = N(\u03b2k|(cid:98)\u03b2k, Bk),\n\nkYt, and conditional on these values, sampling for the unknown parame-\n\nwith(cid:98)\u03b2k = BkKT\n\nters from the posteriors\n\n(4)\n\n(5)\n\np(\u03c32\np(\u03c32\n\ny|k, \u03b2k, Uk, Yt) = IG(\u03c32\n\u03b2|k, \u03b2k) = IG(\u03c32\n\ny|ay + t/2, by + eT\n\u03b2|a\u03b2 + (k + 1)/2, b\u03b2 + \u03b2T\n\nt et/2)\nk\u03b2k/2),\n\nwith et = Yt \u2212 Kk\u03b2k the model approximation error.\nSince the number of kernel functions to use is unknown the marginal posterior in (3) is\nde\ufb01ned over a discrete space of variable dimension. In the next section we will present a\ngeneralised importance sampling strategy to obtain Monte Carlo approximations for distri-\nbutions of this nature recursively as more data becomes available.\n\n3 Sequential Estimation\n\nRecall that it is our objective to recursively update a Monte Carlo representation of the pos-\nterior distribution for the kernel regression parameters as more data becomes available. The\nmethod we propose here is based on a generalisation of the popular importance sampling\ntechnique. Its application extends to any model for which the posterior can be evaluated up\nto a normalising constant. We will thus \ufb01rst present the general strategy, before outlining\nthe details for sequential kernel regression.\n\n3.1 Generalised Importance Sampling\nOur aim is to recursively update a sample based approximation of the posterior p(k, \u03b8k|Yt)\nof a model parameterised by \u03b8k as more data becomes available. The ef\ufb01ciency of impor-\ntance sampling hinges on the ability to design a good proposal distribution, i.e. one that\napproximates the target distribution suf\ufb01ciently well. Designing an ef\ufb01cient proposal distri-\nbution to generate samples directly in the target parameter space is dif\ufb01cult. This is mostly\ndue to the fact that the dimension of the parameter space is generally high and variable.\nTo circumvent these problems we augment the target parameter space with an auxiliary\nparameter space, which we will associate with the parameters at the previous time step. We\nnow de\ufb01ne the target distribution over the resulting joint space as\n\n\u03c0t(k, \u03b8k; k(cid:48), \u03b8\n\nk(cid:48)) = p(k, \u03b8k|Yt)q(cid:48)\n(cid:48)\n\nt(k(cid:48), \u03b8\n\nk(cid:48)|k, \u03b8k).\n(cid:48)\n\n(6)\n\n\fThis joint clearly admits the desired target distribution as a marginal. Apart from some\nweak assumptions, which we will discuss shortly, the distribution q(cid:48)\nt is entirely arbitrary,\nand may depend on the data and the time step. In fact, in the application to the RVM we\n(cid:48)\nconsider here we will set it to q(cid:48)\nk(cid:48)), so that it effectively dis-\nappears from the expression above. A similar strategy of augmenting the space to simplify\nthe importance sampling procedure has been exploited before in [7] to develop ef\ufb01cient\nSequential Monte Carlo (SMC) samplers for a wide range of models. To generate samples\nin this joint space we de\ufb01ne the proposal for importance sampling to be of the form\n\nk(cid:48)|k, \u03b8k) = \u03b4(k,\u03b8k)(k(cid:48), \u03b8\n(cid:48)\n\nt(k(cid:48), \u03b8\n\n(cid:48)\nk(cid:48)) = p(k(cid:48), \u03b8\n\nk(cid:48)|Yt\u22121)qt(k, \u03b8k|k(cid:48), \u03b8\n(cid:48)\n\n(cid:48)\nk(cid:48)),\n\n(7)\nwhere qt may again depend on the data and the time step. This proposal embodies the\nsequential character of our algorithm. Similar to SMC methods [3] it generates samples\nfor the parameters at the current time step by incrementally re\ufb01ning the posterior at the\nprevious time step through the distribution qt. Designing ef\ufb01cient incremental proposals\nis much easier than constructing proposals that generate samples directly in the target pa-\nrameter space, since the posterior is unlikely to undergo dramatic changes over consecutive\ntime steps. To compensate for the discrepancy between the proposal in (7) and the joint\nposterior in (6) the importance weight takes the form\n\nQt(k, \u03b8k; k(cid:48), \u03b8\n\nWt(k, \u03b8k; k(cid:48), \u03b8\n\n(cid:48)\n\nk(cid:48)) = p(k, \u03b8k|Yt)q(cid:48)\n\nk(cid:48)|k, \u03b8k)\n(cid:48)\nk(cid:48)|Yt\u22121)qt(k, \u03b8k|k(cid:48), \u03b8\n(cid:48)\n(cid:48)\nk(cid:48))\n\nt(k(cid:48), \u03b8\n\np(k(cid:48), \u03b8\n\n.\n\n(8)\n\nDue to the construction of the joint in (6), marginal samples in the target parameter space\nassociated with this weighting will indeed be distributed according to the target posterior\np(k, \u03b8k|Yt). As might be expected the importance weight in (8) is similar in form to\nthe acceptance ratio for the RJ-MCMC algorithm [5]. One notable difference is that the\nreversibility condition is not required, so that for a given qt, q(cid:48)\nt may be arbitrary, as long as\nthe ratio in (8) is well-de\ufb01ned.\n\nIn practice it is often necessary to design a number of candidate moves to obtain an ef\ufb01cient\nalgorithm. Examples include update moves to re\ufb01ne the model parameters in the light of\nthe new data, birth moves to add new parameters to better explain the new data, death moves\nto remove redundant or erroneous parameters, and many more. We will denote the set of\ncandidate moves at time t by {\u03b1t,i, qt,i, q(cid:48)\ni=1, where \u03b1t,i is the probability of choosing\ni=1 \u03b1t,i = 1. For each move i the importance weight is computed by\nmove i, with\nsubstituting the corresponding qt,i and q(cid:48)\nt,i into (8). Note that the probability of choosing\na particular move may depend on the old state and the time step, so that moves may be\nincluded or excluded as is appropriate.\n\n(cid:80)M\n\nt,i}M\n\n3.2 Sequential Kernel Regression\n\nWe will now present the details for sequential kernel regression. Our main concern will\nbe the recursive estimation of the marginal posterior for the kernel centres in (3). This\ndistribution is conditional on the parameters (\u03c32\n\u03b2), for which samples can be obtained\nat each time step from the corresponding posteriors in (4) and (5).\n\ny, \u03c32\n\nTo sample for the new kernel centres we will consider three kinds of moves: a zero move\nqt,1, a birth move qt,2, and a death move qt,3. The zero move leaves the kernel centres\nunchanged. The birth move adds a new kernel at a uniformly randomly chosen location over\nthe grid of unoccupied input data points. The death move removes a uniformly randomly\nchosen kernel. For k = 0 only the birth move is possible, whereas the birth move is\nimpossible for k = kmax or k = t. Similar to [5] we set the move probabilities to\n\n\u03b1t,2 = c min{1, p(k + 1)/p(k)}\n\u03b1t,3 = c min{1, p(k \u2212 1)/p(k)}\n\u03b1t,1 = 1 \u2212 \u03b1t,2 \u2212 \u03b1t,3\n\n\fin all other cases. In the above c \u2208 (0, 1) is a parameter that tunes the relative frequency of\nthe dimension changing moves to the zero move. For these choices the importance weight\nin (8) becomes\n\nWt,i(k, Uk; k(cid:48), U(cid:48)\n\nk(cid:48)) \u221d |Bk|1/2 exp(\u2212(YT\n\n|B(cid:48)\n\nk(cid:48)|1/2(2\u03c0\u03c32\n\nk(cid:48)Yt\u22121)/2\u03c32\ny)\n\nt\u22121P(cid:48)\n\u03b2)k\u2212k(cid:48)/2\n\nt PkYt \u2212 YT\ny)1/2(\u03c32\n\u00d7\n\n\u03bbk\u2212k(cid:48)(t \u2212 1)(k(cid:48) \u2212 1)!\nt(k \u2212 1)!qt,i(k, Uk|k(cid:48), U(cid:48)\n\nk(cid:48)) ,\n\nwhere the primed variables are those corresponding to the posterior at time t \u2212 1. For\nthe zero move the parameters are left unchanged, so that the expression for qt,1 in the\nimportance weight becomes unity. This is often a good move to choose, and captures the\nnotion that the posterior rarely changes dramatically over consecutive time steps. For the\nbirth move one new kernel is added, so that k = k(cid:48) + 1. The centre for this kernel is\nuniformly randomly chosen from the grid of unoccupied input data points. This means that\nthe expression for qt,2 in the importance weight reduces to 1/(t\u2212 k(cid:48)), since there are t\u2212 k(cid:48)\nsuch data points. Similarly, the death move removes a uniformly randomly chosen kernel,\nso that k = k(cid:48) \u2212 1. In this case the expression for qt,3 in the importance weight reduces\nto 1/k(cid:48). It is straightforward to design numerous other moves, e.g. an update move that\nperturbs existing kernel centres. However, we found that the simple moves presented yield\nsatisfactory results while keeping the computational complexity acceptable.\n\nWe conclude this section with a summary of the algorithm.\n\nAlgorithm 1: Sequential Kernel Regression\n\nInputs:\n\n\u2022 Kernel function K(\u00b7,\u00b7), model parameters (\u03bb, kmax, ay, by, a\u03b2, b\u03b2), fraction of dimension\nchange moves c, number of samples to approximate the posterior N.\n\nInitialisation: t = 0\n\n\u2022 For i = 1\u00b7\u00b7\u00b7 N, set k(i) = 0, \u03b2(i)\n\nk = \u2205, U(i)\n\nk = \u2205, and sample \u03c32(i)\n\ny \u223c p(\u03c32\n\ny), \u03c32(i)\n\n\u03b2 \u223c p(\u03c32\n\u03b2).\n\nGeneralised Importance Sampling Step: t > 0\n\nk and(cid:101)k(i) = k(i).\n\nk = U(i)\n\n\u2022 For i = 1\u00b7\u00b7\u00b7 N,\n\u2212 Sample a move j(i) so that P (j(i) = l) = \u03b1t,l.\n\n\u2212 If j(i) = 1 (zero move), set (cid:101)U(i)\nElse if j(i) = 2 (birth move), form (cid:101)U(i)\nthe unoccupied data points, and set(cid:101)k(i) = k(i) + 1.\nElse if j(i) = 3 (death move), form (cid:101)U(i)\nkernels, and set(cid:101)k(i) = k(i) \u2212 1.\n\u2022 For i = 1\u00b7\u00b7\u00b7 N, sample the nuisance parameters(cid:101)\u03b2\nk ,(cid:101)U(i)\n(cid:101)\u03c32(i)\n\u03b2 \u223c p(\u03c32\n\u2022 Multiply / discard samples {(cid:101)k(i),(cid:101)\u03b8\n\n\u2022 For i = 1\u00b7\u00b7\u00b7 N, compute the importance weights W (i)\nnormalise.\n\nk ),(cid:101)\u03c32(i)\n\n\u03b2|(cid:101)k(i),(cid:101)\u03b2\n\ny|(cid:101)k(i),(cid:101)\u03b2\n\ny \u223c p(\u03c32\n\n(i)\n\n(i)\n\nk , Yt).\n\nto obtain N samples {k(i), \u03b8(i)\nk }.\n\nResampling Step: t > 0\n\nk by uniformly randomly adding a kernel at one of\n\nk by uniformly randomly deleting one of the existing\n\nt \u221d Wt((cid:101)k(i),(cid:101)U(i)\nk \u223c p(\u03b2k|(cid:101)k(i),(cid:101)U(i)\n\n(i)\n\nk ; k(i), U(i)\n\nk ), and\n\nk , \u03c32(i)\n\ny\n\n, \u03c32(i)\n\n\u03b2\n\n, Yt),\n\n(i)\n\nk } with respect to high / low importance weights {W (i)\nt }\n\n(cid:165)\n\n\fEach of the samples is initialised to be empty, i.e. no kernels are included. Initial values for\nthe variance parameters are sampled from their corresponding prior distributions. Using\nthe samples before resampling, a Minimum Mean Square Error (MMSE) estimate of the\n\nclean data can be obtained as (cid:98)Zt =\n\n(cid:88)N\n\n(cid:101)K(i)\n\nk\n\n(cid:101)\u03b2\n\nW (i)\n\nt\n\ni=1\n\n(i)\nk .\n\nso that the number of times Ni the sample ((cid:101)k(i),(cid:101)\u03b8\n\nThe resampling step is required to avoid degeneracy of the sample based representation.\nIt can be performed by standard procedures such as multinomial resampling [4], strati-\n\ufb01ed resampling [6], or minimum entropy resampling [2]. All these schemes are unbiased,\nk ) appears after resampling satis\ufb01es\n. Thus, resampling essentially multiplies samples with high importance\n\n(i)\n\nE(Ni) = N W (i)\nweights, and discards those with low importance weights.\n\nt\n\nThe algorithm requires only a single pass through the data. The computational complexity\nat each time step is O(N). For each sample the computations are dominated by the com-\nputation of the matrix Bk, which requires a (k + 1)-dimensional matrix inverse. However,\nthis inverse can be incrementally updated from the inverse at the previous time step using\nthe techniques described in [12], leading to substantial computational savings.\n\n4 Experiments and Results\n\nIn this section we illustrate the performance of the proposed sequential estimation algo-\nrithm on two standard regression data sets.\n\n4.1 Sinc Data\n\nThis experiment is described in [1]. The training data is taken to be the sinc function, i.e.\nsinc(x) = sin(x)/x, corrupted by additive Gaussian noise of standard deviation \u03c3y = 0.1,\nfor 50 evenly spaced points in the interval x \u2208 [\u221210, 10]. In all the runs we presented\nthese points to the sequential estimation algorithm in random order. For the test data we\nused 1000 points over the same interval. We used a Gaussian kernel of width 1.6, and\nset the \ufb01xed parameters of the model to (\u03bb, kmax, ay, by, a\u03b2, b\u03b2) = (1, 50, 0, 0, 0, 0). For\nthese settings the prior on the variances reduces to the uninformative Jeffreys\u2019 prior. The\nfraction of dimension change moves was set to c = 0.25.\nIt should be noted that the\nalgorithm proved to be relatively insensitive to reasonable variations in the values of the\n\ufb01xed parameters.\n\nThe left side of Figure 1 shows the test error as a function of the number of samples N.\nThese results were obtained by averaging over 25 random generations of the training data\nfor each value of N. As expected, the error decreases with an increase in the number of\nsamples. No signi\ufb01cant decrease is obtained beyond N = 250, and we adopt this value for\nsubsequent comparisons. A typical MMSE estimate of the clean data is shown on the right\nside of Figure 1.\n\nIn Table 1 we compare the results of the proposed sequential estimation algorithm with a\nnumber of batch strategies for the SVM and RVM. The results for the batch algorithms are\nduplicated from [1, 9]. The error for the sequential algorithm is slightly higher. This is due\nto the stochastic nature of the algorithm, and the fact that it uses only very simple moves\nthat take no account of the characteristics of the data during the move proposition. This\nincrease should be offset against the algorithm simplicity and ef\ufb01ciency. The error could\nbe further decreased by designing more complex moves.\n\n\fFigure 1: Results for the sinc experiment. Test error as a function of the number of\nsamples (left), and example \ufb01t (right), showing the uncorrupted data (blue circles), noisy\ndata (red crosses) and MMSE estimate (green squares). For this example the test error was\n0.0309 and an average of 6.18 kernels were used.\n\nMethod\nFigueiredo\nSVM\nRVM\nVRVM\nMCMC\nSequential RVM\n\nTest Error\n\n0.0455\n0.0519\n0.0494\n0.0494\n0.0468\n0.0591\n\n7.0\n28.0\n6.9\n7.4\n6.5\n4.5\n\n# Kernels Noise Estimate\n\n\u2212\n\u2212\n\n0.0943\n0.0950\n\n\u2212\n\n0.1136\n\nTable 1: Comparative performance results for the sinc data. The batch results are\nreproduced from [1, 9].\n\n4.2 Boston Housing Data\n\nWe also applied our algorithm to the popular Boston housing data set. We considered\nrandom train / test partitions of the data of size 300 / 206. We again used a Gaussian kernel,\nand set the width parameter to 5. For the model and algorithm parameters we used values\nsimilar to those for the sinc experiment, except for setting \u03bb = 5 to allow a larger number\nof kernels. The results are summarised in Table 2. These were obtained by averaging over\n10 random partitions of the data, and setting the number of samples to N = 250. The test\nerror is comparable to those for the batch strategies, but far fewer kernels are required.\n\nMethod\nSVM\nRVM\nSequential RVM\n\nTest Error\n\n# Kernels\n\n8.04\n7.46\n7.18\n\n142.8\n39.0\n25.29\n\nTable 2: Comparative performance results for the Boston housing data. The batch\nresults are reproduced from [10].\n\n5 Conclusions\n\nIn this paper we proposed a sequential estimation strategy for Bayesian kernel regression.\nOur algorithm is based on a generalisation of importance sampling, and incrementally up-\ndates a Monte Carlo representation of the target posterior distribution as more data points\n\n100200300400500600700800900100000.050.10.150.20.250.30.350.4\u221210\u22128\u22126\u22124\u221220246810\u22120.500.511.5\fbecome available. It achieves this through simple and intuitive model moves, reminiscent\nof the RJ-MCMC algorithm. It is further non-iterative, and requires only a single pass over\nthe data, thus overcoming some of the computational dif\ufb01culties associated with batch es-\ntimation strategies for kernel regression. Our algorithm is more general than the kernel\nregression problem considered here. Its application extends to any model for which the\nposterior can be evaluated up to a normalising constant. Initial experiments on two stan-\ndard regression data sets showed our algorithm to compare favourably with existing batch\nestimation strategies for kernel regression.\n\nAcknowledgements\n\nThe authors would like to thank Mike Tipping for helpful comments during the experimen-\ntal procedure. The work of Vermaak and Godsill was partially funded by QinetiQ under\nthe project \u2018Extended and Joint Object Tracking and Identi\ufb01cation\u2019, CU006-14890.\n\nReferences\n\n[1] C. M. Bishop and M. E. Tipping. Variational relevance vector machines. In C. Boutilier and\nM. Goldszmidt, editors, Proceedings of the 16th Conference on Uncertainty in Arti\ufb01cial Intel-\nligence, pages 46\u201353. Morgan Kaufmann, 2000.\n\n[2] D. Crisan. Particle \ufb01lters \u2013 a theoretical perspective. In A. Doucet, J. F. G. de Freitas, and N. J.\nGordon, editors, Sequential Monte Carlo Methods in Practice, pages 17\u201338. Springer-Verlag,\n2001.\n\n[3] A. Doucet, J. F. G. de Freitas, and N. J. Gordon, editors. Sequential Monte Carlo Methods in\n\nPractice. Springer-Verlag, New York, 2001.\n\n[4] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian\n\nBayesian state estimation. IEE Proceedings-F, 140(2):107\u2013113, 1993.\n\n[5] P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model\n\ndetermination. Biometrika, 82(4):711\u2013732, 1995.\n\n[6] G. Kitagawa. Monte Carlo \ufb01lter and smoother for non-Gaussian nonlinear state space models.\n\nJournal of Computational and Graphical Statistics, 5(1):1\u201325, 1996.\n\n[7] P. Del Moral and A. Doucet. Sequential Monte Carlo samplers. Technical Report CUED/F-\nINFENG/TR.443, Signal Processing Group, Cambridge University Engineering Department,\n2002.\n\n[8] R. M. Neal. Assessing relevance determination methods using DELVE. In C. M. Bishop, editor,\n\nNeural Networks and Machine Learning, pages 97\u2013129. Springer-Verlag, 1998.\n\n[9] S. S. Tham, A. Doucet, and R. Kotagiri. Sparse Bayesian learning for regression and classi-\n\ufb01cation using Markov chain Monte Carlo. In Proceedings of the International Conference on\nMachine Learning, pages 634\u2013643, 2002.\n\n[10] M. E. Tipping. The relevance vector machine. In S. A. Solla, T. K. Leen, and K. R. M\u00a8uller,\neditors, Advances in Neural Information Processing Systems, volume 12, pages 652\u2013658. MIT\nPress, 2000.\n\n[11] M. E. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine\n\nLearning Research, 1:211\u2013244, 2001.\n\n[12] M. E. Tipping and A. C. Faul. Fast marginal likelihood maximisation for sparse Bayesian mod-\nels. In C. M. Bishop and B. J. Frey, editors, Proceedings of the Ninth International Workshop\non Arti\ufb01cial Intelligence and Statistics, 2003.\n\n[13] V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998.\n\n\f", "award": [], "sourceid": 2362, "authors": [{"given_name": "Jaco", "family_name": "Vermaak", "institution": null}, {"given_name": "Simon", "family_name": "Godsill", "institution": null}, {"given_name": "Arnaud", "family_name": "Doucet", "institution": null}]}