{"title": "Approximate Analytical Bootstrap Averages for Support Vector Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 1189, "page_last": 1196, "abstract": "", "full_text": "Approximate Analytical Bootstrap Averages for\n\nSupport Vector Classi\ufb01ers\n\nD\u00a8orthe Malzahn1;2\n\nManfred Opper3\n\n1 Informatics and Mathematical Modelling, Technical University of Denmark,\n\nR.-Petersens-Plads, Building 321, Lyngby DK-2800, Denmark\n2 Institute of Mathematical Stochastics, University of Karlsruhe,\n\nEnglerstr. 2, Karlsruhe D-76131, Germany\n\n3 Neural Computing Research Group, School of Engineering and Applied Science,\n\nAston University, Birmingham B4 7ET, United Kingdom\n\nmalzahnd@isp.imm.dtu.dk\n\nopperm@aston.ac.uk\n\nAbstract\n\nWe compute approximate analytical bootstrap averages for support vec-\ntor classi\ufb01cation using a combination of the replica method of statistical\nphysics and the TAP approach for approximate inference. We test our\nmethod on a few datasets and compare it with exact averages obtained\nby extensive Monte-Carlo sampling.\n\n1\n\nIntroduction\n\nThe bootstrap method [1, 2] is a widely applicable approach to assess the expected qualities\nof statistical estimators and predictors. Say, for example, in a supervised learning problem,\nwe are interested in measuring the expected error of our favorite prediction method on test\npoints 1 which are not contained in the training set D0. If we have no hold out data, we\ncan use the bootstrap approach to create arti\ufb01cial bootstrap data sets D by resampling with\nreplacement training data from the original set D0. Each data point is taken with equal\nprobability, i.e., some of the examples will appear several times in the bootstrap sample\nand others not at all. A proxy for the true average test error can be obtained by retraining\nthe model on each bootstrap training set D, calculating the test error only on those points\nwhich are not contained in D and \ufb01nally averaging over all possible sets D.\nWhile in general bootstrap averages can be approximated to any desired accuracy by the\nMonte-Carlo method, by generating a large enough number of random samples, it is use-\nful to have also analytical approximations which avoid the time consuming retraining of\nthe model for each new sample. Existing analytical approximations (based on asymp-\ntotic techniques) such as the delta method and the saddle point method require usually\nexplicit analytical formulas for the estimators of the parameters for a trained model (see\ne.g. [3]). These may not be easily obtained for more complex models in machine learn-\ning such as support vector machines (SVMs). Recently, we introduced a novel approach\nfor the approximate calculation of bootstrap averages [4] which avoids explicit formulas\nfor parameter estimates. Instead, we de\ufb01ne statistical estimators and predictors implicitly\n\n1The average is over the unknown distribution of training data sets.\n\n\fas expectations with suitably de\ufb01ned pseudo-posterior Gibbs distributions over model pa-\nrameters. Within this formulation, it becomes possible to perform averages over bootstrap\nsamples analytically using the so-called \u201creplica trick\u201d of statistical physics [5]. The latter\ninvolves a speci\ufb01c analytic continuation of the original statistical model. After the aver-\nage, we are left with a typically intractable inference problem for an effective Bayesian\nprobabilistic model. As a \ufb01nal step, we use techniques for approximate inference to treat\nthe probabilistic model. This combination of techniques allows us to obtain approximate\nbootstrap averages by solving a set of nonlinear equations rather than by explicit sampling.\n\nOur method has passed a \ufb01rst test successfully on the simple case of Gaussian process (GP)\nregression, where explicit predictions are still cheaply computed. Also, since the original\nmodel is a smooth probabilistic one, the success of approximate inference techniques may\nbe not too surprising. In this paper, we will address a more challenging problem, that of\nthe support vector machine. In this case, the connection to a probabilistic model (a type\nof GP) can be only established by introducing a further parameter which must eventually\ndiverge to obtain the SVM predictor. In this limit, the probabilistic model is becoming\nhighly nonregular and approaches a deterministic model. Hence it is not clear a priori if\nour framework would survive these delicate limiting manipulations and still be able to give\ngood approximate answers.\n\n2 Hard Margin Support Vector Classi\ufb01ers\n\nThe hard margin SVM is a classi\ufb01er which predicts binary class labels y = sign[ ^fD0(x)] 2\nf(cid:0)1; 1g for inputs x 2 IRd based on a set of training points D0 = (z1; z2; : : : ; zN ), where\nzi = (xi; yi) (for details see [6]). The usually nonlinear activation function ^fD0(x) (which\nwe will call \u201cinternal \ufb01eld\u201d) is expressed as ^fD0(x) =PN\ni=1 yi(cid:11)iK(x; xi), where K(x; x0)\nis a positive de\ufb01nite kernel and the set of (cid:11)i\u2019s is computed from D0 by solving a certain\nconvex optimization problem.\nFor bootstrap problems, we \ufb01x the pool of training data D0, and consider the statistics\nof vectors ^f D = ( ^fD(x1); : : : ; ^fD(xN )) at all inputs xi 2 D0, when the predictor ^f is\ncomputed on randomly chosen subsets D of D0. Unfortunately, we do not have an explicit\nanalytical expression for ^fD, but it is obtained implicitly as the vector f = (f1; : : : ; fN )\nwhich solves the constraint optimization problem\nwith fiyi (cid:21) 1\nK is the kernel matrix with elements K(xi; xj).\n\nfor all i such that (xi; yi) 2 D (1)\n\nMinimize (cid:8)f T K(cid:0)1f(cid:9)\n\n3 Deriving Predictors from Gibbs Distributions\n\nIn this section, we will show how to obtain the SVM predictor ^fD formally as the expecta-\ntion over a certain type of Gibbs distribution over possible f\u2019s in the form\n\n^fD = hfi =Z df f P [fjD]\n\n(2)\n\nwith respect to a density P [fjD] = 1\nZ (cid:22)[f ] P (Djf ) which is constructed from a suitable\nprior distribution (cid:22)[f ], a certain type of \u201clikelihood\u201d P (Djf ) and a normalizing partition\nfunction\n(3)\n\nZ =Z df (cid:22)[f ] P (Djf ) :\n\nOur general notation suggests that this principle will apply to a variety of estimators and\npredictors of the MAP type.\n\n\fTo represent the SVM in this framework, we use a well established relation between SVM\u2019s\nand Gaussian process (GP) models (see e.g. [7, 8]). We choose the GP prior\n\n(cid:22)[f ] =\n\n1\n\nThe pseudo-likelihood 2 is de\ufb01ned by\n\np(2(cid:25))N (cid:12)(cid:0)N det(K)\n\nexp(cid:18)(cid:0)\n\n(cid:12)\n2\n\nf T K(cid:0)1f(cid:19) :\n\nP (Djf ) = Yj: zj 2D\n\nP (zjjfj) = Yj: zj 2D\n\n(cid:2)(yjfj (cid:0) 1)\n\n(4)\n\n(5)\n\nwhere (cid:2)(u) = 1 for u > 0 and 0 otherwise. In the limit (cid:12) ! 1, the measure P [fjD] /\n(cid:22)[f ]P (Djf ) obviously concentrates at the vector ^f which solves Eq. (1).\n4 Analytical Bootstrap Averages Using the Replica Trick\n\nWith the bootstrap method, we would like to compute average properties of the estimator\n^fD, Eq. (2), when datasets D are random subsamples of D0. An important class of such\naverages are of the type of a generalization error \" which are expectations of loss functions\ng( ^fD(xi); xi; yi) over test points i, i.e., those examples which are in D0 but not contained\nin the bootstrap training set D. Hence, we de\ufb01ne\n\n:\n=\n\n\"\n\n1\nN\n\nN\n\nXi=1\n\nEDh(cid:14)si;0 g( ^fD(xi); xi; yi)i\n\nED [(cid:14)si;0]\n\n(6)\n\nwhere ED[(cid:1)(cid:1)(cid:1) ] denotes the expectation over random bootstrap samples D which are created\nfrom the original training set D0. Each sample D is represented by a vector of \u201coccupation\u201d\nnumbers s = (s1; : : : ; sN ) where si is the number of times example zi appears in the set\nD andPN\ni=1 si = S. The Kronecker symbol, de\ufb01ned by (cid:14)si;0 = 1 for si = 0 and 0 else,\nguarantees that only realizations of bootstrap training sets D contribute to Eq. (6) which\ndo not contain the test point. For \ufb01xed bootstrap sample size S, the distribution of si\u2019s is\nmultinomial. It is simpler (and does not make a big difference when S is suf\ufb01ciently large)\nwhen we work with a Poisson distribution for the size of the set D with S as the mean\nnumber of data points in the sample. Then we get the simpler, factorizing joint distribution\n\nP (s) =\n\nN\n\nYi=1\n\n( S\nN )si e(cid:0)S=N\n\nsi!\n\n(7)\n\nfor the occupation numbers si. From Eq. (7) we get ED[(cid:14)si;0] = e(cid:0) S\nN .\nSince we can represent general loss functions g by their Taylor expansions in powers of\n^fD (or polynomial approximations in case of non-smooth losses) it is suf\ufb01cient to consider\nonly monomials g( ^fD(x); x; y) = ( ^fD(x))r for arbitrary r in the following and regain the\ngeneral case at the end by resumming the series. Using the de\ufb01nition of the estimator ^fD,\nEq. (2), the bootstrap expectation Eq. (6) can be rewritten as\n\n\"(S) =\n\n1\nN\n\nN\n\nXi=1\n\nED(cid:20)(cid:14)si;0 Z (cid:0)rR\n\nr\n\nQa=1ndf a (cid:22)[f a] f a\n\nED [(cid:14)si;0]\n\ni QN\n\nj=1(P (zjjf a\n\nj ))sjo(cid:21)\n\n:\n\n(8)\n\nwhich involves r copies3, i.e. replicas f 1; : : : ; f r of the parameter vector f. If the partition\nfunctions Z in the numerator of Eq. (8) were raised to positive powers rather than negative\n\n2It does not allow a full probabilistic interpretation [8].\n3The superscripts should NOT be confused with powers of the variables.\n\n\fones, one could perform the bootstrap average over the distribution Eq. (7) analytically. To\nenable such an analytical average over the vector s (which is the \u201cquenched disorder\u201d in\nthe language of statistical physics) one introduces the following \u201ctrick\u201d extensively used in\nstatistical physics of amorphous systems [5]. We introduce the auxiliary quantity\n\n\"n(S) =\n\n1\ne(cid:0) S\n\nN N\n\nN\n\nXi=1\n\nED2\n4\n\n(cid:14)si;0 Z n(cid:0)rZ\n\nfor arbitrary real n, which allows to write\n\ndf a (cid:22)[f a] f a\ni\n\nr\n\nYa=1\n\n8<\n:\n\nN\n\nYj=1\n\n(P (zjjf a\n\nj ))sj9=\n;\n\n3\n5\n\n(9)\n\n(10)\n\n\"(S) = lim\nn!0\n\n\"n(S):\n\nThe advantage of this de\ufb01nition is that for integers n (cid:21) r, \"n(S) can be represented in\nterms of n replicas f 1; f 2; : : : ; f n of the original variable f for which an explicit average\nover si\u2019s is possible. At the end of all calculations an analytical continuation to arbitrary\nreal n and the limit n ! 0 must be performed. For integer n (cid:21) r, we use the de\ufb01nition of\nthe partition function Eq. (3), exchange the expectation over datasets with the expectation\nover f\u2019s and use the explicit form of the distribution Eq. (7) to perform the average over\nbootstrap sets. The resulting expressions can be rewritten as 4\n\n\"n(S) =\n\n(cid:4)ni\nn\nN\n\nN\n\nXi=1** r\nYa=1\n\nf a\n\ni ++ni\n\n;\n\n(11)\n\nwhere hh(cid:1)(cid:1)(cid:1)iini denotes an average with respect to the so called cavity distribution Pni for\nreplicated variables ~fi = (f 1\n\ni ) de\ufb01ned by\n\ni ; : : : ; f n\n\nPni( ~fi) /\n\n1\n\nLi( ~fi)Z\n\nN\n\nYj=1;j6=i\n\nd ~fj P ( ~f1; : : : ; ~fN ) :\n\n(12)\n\nde\ufb01ned by the new likelihoods\n\nThe joint distribution of replica variables P ( ~f1; : : : ; ~fN ) / Qn\na=1 (cid:22)[f a] QN\nj )!# :\nP (zjjf a\n\nLj( ~fj) = exp\"(cid:0)\n\nN 1 (cid:0)\n\nS\n\nn\n\nYa=1\n\nj=1 Lj( ~fj) is\n\n(13)\n\n5 TAP Approximation\n\nWe have mapped the original bootstrap problem to an inference problem for an effective\nBayesian probabilistic model (the hidden variables have the dimensionality N (cid:2) n) for\nwhich we have to \ufb01nd a tractable approximation which allows analytical continuation of\nn ! 0 and (cid:12) ! 1. We use the adaptive TAP approach of Opper and Winther [9] which\nis often found to give more accurate results than, e.g., a simple mean \ufb01eld or a variational\nGaussian approximation. The ADATAP approach replaces the analytically intractable cav-\nity distribution Eq. (12) by a Gaussian distribution. In our case this can be written as\n\nwhere the parameters (cid:3)\nsolving a set of coupled nonlinear equations. Details are given in the appendix.\n\nPni( ~fi) / e(cid:0) 1\n\n(14)\nc and (cid:13)c are computed selfconsistently from the dataset D0 by\n\n~f T (cid:3)c(i) ~f +(cid:13)c(i)T ~f ;\n\n2\n\nThe form Eq. (14) allows a simple way of dealing with the parameters n and (cid:12). We uti-\ni and assume replica symmetry and further\nlize the exchangeability of variables f 1\n\ni ; : : : ; f n\n\n4Pni( ~fi), Eq. (12), has the normalizing partition function (cid:4)ni\n\nn where (cid:4)ni\n\nn ! 1 for n ! 0.\n\n\fintroduce an explicit scaling of all parameters with (cid:12). This scaling was found to make all\n\ufb01nal expressions \ufb01nite in the limit (cid:12) ! 1. We set\n\nc (i) = (cid:3)c(i) = (cid:12)2(cid:21)c(i)\n(cid:3)ab\nc (i) = (cid:3)0\n(cid:3)aa\nc(i)\n:\n= (cid:12)(cid:0)1((cid:3)0\n\nc(i) = (cid:12)2(cid:21)0\n\nfor a 6= b\n\nand\n\n(cid:13)a\nc (i) = (cid:12)(cid:13)c(i) for all a = 1; : : : ; n :\n\n(15)\n\nWe also assume that (cid:1)(cid:21)c(i)\nc(i) (cid:0) (cid:3)c(i)) remains \ufb01nite for (cid:12) ! 1. The ansatz\nEq. (15) keeps the number of adjustable parameters independent of n and allows to perform\nthe \u201creplica limit\u201d n ! 0 and the \u201cSVM-limit\u201d (cid:12) ! 1 in all equations analytically before\nwe start the \ufb01nal numerical parameter optimization.\n\nComputing the expectation Eq. (11) with Eq. (14) and (15) and resumming the power series\nover r yields the \ufb01nal theoretical expression for Eq. (6)\n\n\"(S) =\n\n1\nN\n\nN\n\nXi=1Z dG(u) g (cid:13)c(i) + up(cid:0)(cid:21)c(i)\n\n(cid:1)(cid:21)c(i)\n\n; xi; yi!\n\n(16)\n\nwhere dG(u) = du(2(cid:25))(cid:0) 1\ng( ^fD(xi); xi; yi) = (cid:2)((cid:0)yi ^fD(xi)) we obtain the bootstrapped classi\ufb01cation error\n\nand g is an arbitrary loss function. With\n\n2 e(cid:0) u\n\n2\n\n2\n\n\"(S) =\n\n1\nN\n\n(cid:8) (cid:0)\n\nN\n\nXi=1\n\nyi(cid:13)c(i)\n\np(cid:0)(cid:21)c(i)!\n\n(17)\n\nwhere (cid:8)(x) =R x\n\n(cid:0)1 dG(u).\n\nBesides the computation of generalization errors, we can use our method to quantify the\nuncertainty of the SVM prediction at test points. This can be obtained by computing the\nbootstrap distribution of the \u201cinternal \ufb01elds\u201d ^fD(xi) at a test input xi. This is obtained\nfrom Eq. (16) by inserting g( ^fD(xi); xi; yi) = (cid:14)( ^fD(xi) (cid:0) h) using the Dirac (cid:14)-function\n(18)\n\n(h(cid:1)(cid:21)c(i) (cid:0) (cid:13)c(i))2\n\n(cid:1)(cid:21)c(i)\n\n(cid:26)i(h) =\n\nexp(cid:18)(cid:0)\n\np(cid:0)2(cid:25)(cid:21)c(i)\n\n2((cid:0)(cid:21)c(i))\n\n(cid:19) ;\n\ni = (cid:13)c(i)\n\nii = (cid:0) (cid:21)c(i)\n\n(cid:1)(cid:21)c(i) and V c\n\ni.e., mc\n((cid:1)(cid:21)c(i))2 are the predicted mean and variance of the internal\n\ufb01eld. (The predicted posterior variance of the internal \ufb01eld is ((cid:12)(cid:1)(cid:21)c(i))(cid:0)1 and goes to zero\nas (cid:12) ! 1 indicating the transition to a deterministic model.) It is possible to extend the\nresult Eq. (18) to \u201creal\u201d test inputs x =2 D0, which is of greater importance to applications.\nThis replaces (cid:1)(cid:21)c(i), (cid:13)c(i), (cid:21)c(i) by\n\nK(x; xi)(cid:1)(cid:21)(i)Ti(x)!\n\n(cid:0)1\n\n(19)\n\nN\n\nXi=1\n\n(cid:13)c(x) = (cid:1)(cid:21)c(x)\n\n(cid:1)(cid:21)c(x) = K(x; x) (cid:0)\nXi=1\nXi=1\n\n(cid:21)c(x) = ((cid:1)(cid:21)c(x))2\n\nN\n\nN\n\nTi(x)(cid:13)(i)\n\n(Ti(x))2(cid:21)(i)\n\nwith Ti(x) =PN\n\ndetermined from D0 according to Eq. (22), (23).\n\nj=1 K(x; xj)(I + diag((cid:1)(cid:21))K)(cid:0)1\n\nji . The parameters (cid:1)(cid:21)(i), (cid:13)(i), (cid:21)(i) are\n\n\f0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nr\no\nr\nr\ne\n \nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nl\nc\n \nd\ne\np\np\na\nr\nt\ns\nt\no\no\nB\n\n0.0\n0\n\n0.14\n0.12\n0.10\n0.08\n0.06\n0.04\n\nWisconsin, N=683\n\n0\n\n200\n\n400\n\n600\n\nPima, N=532\n\nSonar, N=208\n\nCrabs, N=200\n\n2.0\n\n1.5\n\ny\nt\ni\ns\nn\ne\nD\n\n1.0\n\n0.5\n\nSimulation: p(-1|x)\n\n0 0.2 0.4 0.6 0.8 1\n\n1.0\n\n|\n\n)\nx\n1\n-\n(\np\n \n:\ny\nr\no\ne\nh\nT\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\nS: 0.376\nT: 0.405\n\n200\n\n400\n\nBootstrap sample size S\n\n600\n\n0.0\n\n-1.5\n\n-2\n1\nBootstrapped local field at a test input x\n\n-0.5\n\n-1\n\n0\n\n0.5\n\nFigure 1: Left: Average bootstrapped generalization error for hard margin support vector\nclassi\ufb01cation on different data sets (simulation: symbols, theory:\nlines). Right: Boot-\nstrapped distribution of the internal \ufb01eld for Sonar data at a test input x =2 D0. Most\ndistributions are Gaussian-like and in good agreement with the theory Eq. (18). We show\nan atypical case (simulation: histogram, theory line) which nevertheless predicts the rela-\ntive weights for both class labels fairly well. The inset shows true versus estimated values\nof the probability p((cid:0)1jx) for predicting label y = (cid:0)1 .\n\n6 Results for Bootstrap of Hard Margin Support Vector Classi\ufb01ers\n\n2Pd\n\nk=1 vk(xk (cid:0) x0\n\nWe determined the set of theoretical parameters by solving Eq. (21)-(23) for four bench-\nmark data sets D0 [10] and different sample sizes S using a RBF kernel K(x; x0) =\nexp((cid:0) 1\nk)2)) with individually customized hyperparameters vk. The left\npanel of Fig.1 compares our theoretical results for the bootstrapped learning curves ob-\ntained by Eq. (17) (lines) with results from Monte-Carlo simulations (symbols). The Gaus-\nsian approximation of the cavity distribution is based on the assumption that the model\nprediction at a training input is in\ufb02uenced by a suf\ufb01ciently large number of neighboring\ninputs. We expect it to work well for suf\ufb01ciently broad kernel functions. This was the case\nfor the Crabs and Wisconsin data sets where our theory is very accurate. It predicts cor-\nrectly the interesting non-monotonous learning curve for the Wisconsin data (inset Fig.1,\nleft). In comparison, the Sonar and Pima data sets were learnt with narrow RBF kernels.\nHere, we see that the quality of the TAP approximation becomes less good. However, our\nresults provide still a reasonable estimate for the bootstrapped generalization error at sam-\nple size S = N. While for practical applications of estimating the \u201ctrue\u201d generalization\nerror using Efron\u2019s 0.632 bootstrap estimator the case S = N is of main importance, it\nis also interesting to discuss the limit of extreme oversampling S ! 1. Since the hard\nmargin SVM gains no additional information from the multiple presentation of the same\ndata point, in this limit all bootstrap sets D supply exactly the same information as the data\nset D0 and the data average ED[: : : ] becomes trivial. Variances with respect to ED[: : : ]\ngo to zero. With Eq. (21)-(23), we can write the average prediction mi at input xi 2 D0 as\nmi =PN\nj) and recover\nfor S ! 1 the Kuhn-Tucker conditions (cid:11)i (cid:21) 0 and (cid:11)i(cid:2)(yimi(cid:0) 1) = 0. The bootstrapped\ngeneralization error Eq. (17) is found to converge to the approximate leave-one-out error\nof Opper and Winther [8]\n\nj=1 yj(cid:11)jK(xi; xj) with weights (cid:11)j = (cid:1)(cid:21)(j)(cid:1)(cid:21)c(j)\n\n(cid:1)(cid:21)(j)+(cid:1)(cid:21)c(j) (yjmj (cid:0) yjmc\n\nlim\nS!1\n\n\"(S) =\n\n1\nN\n\nN\n\nXi=1\n\n(cid:2) ((cid:0)yimc\n\ni ) =\n\nSV\n\nXi\n\n(cid:2)(cid:18) (cid:11)i\n\nSV ]ii (cid:0) 1(cid:19)\n\n[K(cid:0)1\n\n(20)\n\n\fwhere the weights (cid:11)i are given by the SVM algorithm on D0 and KSV is the kernel matrix\non the set of SV\u2019s. While the leave-one-out estimate is a non-smooth function of model\nparameters, Efron\u2019s 0:632 \"(N ) bootstrap estimate [2] of the generalization error approxi-\nmated within our theory results in a differentiable expression Eq. (17) which may be used\nfor kernel hyperparameter estimation. Preliminary results are promising.\n\nThe right panel of Fig. 1 shows results for the bootstrapped distribution of the internal \ufb01eld\non test inputs x =2 D0. The data set D0 contained N = 188 Sonar data and the bootstrap\nis at sample size S = N. We \ufb01nd that the true distribution is often very Gaussian-like and\nwell described by the theory Eq. (18). Figure 1 (right) shows a rare case where a bi-modal\ndistribution (histogram) is found. Nevertheless, the Gaussian (line) predicted by our theory\nestimates the probability p((cid:0)1jx) of a negative output quite accurately in comparison to\nthe probability obtained from the simulation.\n\nBoth SVM training and the computation of our approximate SVM bootstrap requires the\nrunning of iterative algorithms. We compared the time ttrain for training a single SVM\non each of the four benchmark data sets D0 with the time ttheo needed to solve our theory\nfor SVM bootstrap estimates on these data for S = N. For suf\ufb01ciently broad kernels we\n\ufb01nd ttrain (cid:21) ttheo and our theory is reliable. The exception are extremely narrow kernels.\nFor the latter (Pima example in Fig.1 (left)) we \ufb01nd ttheo > ttrain where our theory is still\nfaster to compute but less reliable than a good Monte-Carlo estimate of the bootstrap.\n\n7 Outlook\n\nOur experiments on SVMs show that the approximate replica bootstrap approach appears\nto be highly robust to apply to models which only \ufb01t into our framework after some delicate\nlimiting process. The SVM is also an important application because the prediction for each\ndataset requires the solution of a costly optimization problem. Experiments on benchmark\ndata showed that our theory is appreciably faster to compute than a good Monte-Carlo\nestimate of the bootstrap and yields reliable results for kernels which are suf\ufb01ciently broad.\nIt will be interesting to apply our approach to other kernel methods such as kernel PCA.\nSince our method is based on a fairly general framework, we will also investigate if it can\nbe applied to models where the bootstrapped parameters have a more complicated structure\nlike, e.g., trees or hidden Markov models.\n\nAcknowledgments\n\nDM gratefully acknowledges \ufb01nancial support by the Copenhagen Image and Signal Pro-\ncessing Graduate School and by the Postgraduate Programme \u201dNatural Disasters\u201d at the\nUniversity of Karlsruhe.\n\nAppendix: TAP Equations\n\nThe ADATAP approach computes the set of parameters (cid:3)c(i), (cid:13)c(i) by constructing an\n~f T (cid:3)(j) ~f +(cid:13)(j)T ~f de\ufb01ning an auxiliary\nalternative set of tractable likelihoods ^Lj( ~f ) = e(cid:0) 1\nGaussian joint distribution PG( ~f1; : : : ; ~fN ) / Qn\n^Lj( ~fj). We use replica\nsymmetry and a speci\ufb01c scaling of the parameters with (cid:12): (cid:13) a(j) = (cid:12)(cid:13)(j), (cid:3)aa(j) =\n(cid:3)0(j) = (cid:12)2(cid:21)0(j) for all a, (cid:3)ab(j) = (cid:3)(j) = (cid:12)2(cid:21)(j) for a 6= b and (cid:1)(cid:21)(j) = (cid:12)(cid:0)1((cid:3)0(j)(cid:0)\n(cid:3)(j)). All unknown parameters are found by moment matching: We assume that the \ufb01rst\ntwo marginal moments mi = lim\ni f a\ni (cid:0)\ni ii of the variables ~fi can be computed 1) by marginalizing PG and 2) by using the\nf a\ni f b\n\na=1 (cid:22)(f a) QN\n\ni ii(cid:0)(mi)2, (cid:31)ii = (cid:12) lim\n\nn!0hhf a\n\nj=1\n\n2\n\nn!0hhf a\n\ni ii, Vii = lim\n\nn!0hhf a\n\ni f b\n\n\frelations between cavity distribution and marginal distributions P ( ~fi) / Li( ~fi)Pni( ~fi) as\nwell as PG( ~fi) / ^Li( ~fi)Pni( ~fi) for all i = 1; : : : ; N. This yields\n(cid:31)ii = (cid:31)c\n\nN )(cid:8)((cid:1)c\n\n(21)\n\nii(cid:16)1 (cid:0) (1 (cid:0) e(cid:0) S\ni(cid:16)1 (cid:0) (1 (cid:0) e(cid:0) S\nii(cid:16)1 (cid:0) (1 (cid:0) e(cid:0) S\n\nN )(cid:8)((cid:1)c\nii = (cid:0) (cid:21)c(i)\n\nN ) (cid:8)((cid:1)c\n\ni )(cid:17)\ni ) + pV c\ni )(cid:17) + yi(1 (cid:0) e(cid:0) S\niip2(cid:25)\ni )(cid:17) + (1 (cid:0) yimi)(yimi (cid:0) yimc\ni )\ni = 1(cid:0)yimc\nipV c\n\n(cid:1)(cid:21)c(i) and (cid:1)c\n\nii\n\n. Further\n\n2 ((cid:1)c\n\ne(cid:0) 1\n\ni )2!\n\nmi = mc\n\nN )(cid:8)((cid:1)c\n\nVii = V c\n\nwhere mc\n\ni = (cid:13)c(i)\n\n(cid:1)(cid:21)c(i) , V c\n\nwith the N (cid:2) N matrix G = (K(cid:0)1 + diag((cid:1)(cid:21)))(cid:0)1 and\n\nii = 1\n\n((cid:1)(cid:21)c(i))2 , (cid:31)c\n(cid:31)ii = (G)ii\nmi = (G (cid:13))i\nVii = (cid:0) (G diag((cid:21)) G)ii\n\n(cid:31)ii =\n\nmi =\n\n1\n\n(cid:1)(cid:21)(i) + (cid:1)(cid:21)c(i)\n\n(cid:13)(i) + (cid:13)c(i)\n\n(cid:1)(cid:21)(i) + (cid:1)(cid:21)c(i)\n\n(22)\n\n(23)\n\nVii = (cid:0)\n\n(cid:21)(i) + (cid:21)c(i)\n\n((cid:1)(cid:21)(i) + (cid:1)(cid:21)c(i))2\n\nWe solve Eq. (21)-(23) by iteration using Eqs. (21) and (22) to evaluate the moments\nfmi; Vii; (cid:31)iig and Eq. (23) to update the sets of parameters f(cid:13)c(i); (cid:1)(cid:21)c(i); (cid:21)c(i)g and\nf(cid:13)(i); (cid:1)(cid:21)(i); (cid:21)(i)g, respectively. Reasonable start values are (cid:1)(cid:21)(i) = (cid:1)(cid:21), (cid:21)(i) = (cid:0)(cid:1)(cid:21),\n(cid:13)(i) = yi(cid:1)(cid:21) where (cid:1)(cid:21) is obtained as the root of 0 = 1 (cid:0) 1\n1+!i(cid:1)(cid:21) (cid:0) (1 (cid:0) (1 (cid:0)\ne(cid:0)S=N )(cid:8)((cid:1)c)) with (cid:1)c = (cid:0)0:5 and !i are the eigenvalues of kernel matrix K.\nReferences\n\nN PN\n\n!i(cid:1)(cid:21)\n\ni=1\n\n[1] B. Efron. Ann. Statist., 7: 1-26, 1979.\n[2] B. Efron, R. J. Tibshirani. An Introduction to the Bootstrap. Monographs on Statistics\n\nand Applied Probability 57, Chapman & Hall, 1993.\n\n[3] J. Shao, D. Tu, The Jackknife and Bootstrap, Springer Series in Statistics, Springer,\n\n1995.\n\n[4] D. Malzahn, M. Opper, A statistical mechanics approach to approximate analytical\nBootstrap averages, NIPS 15, S. Becker, S. Thrun, K. Obermayer eds., MIT Press,\n2003.\n\n[5] M. M\u00b4ezard, G. Parisi, M. A. Virasoro, Spin Glass Theory and Beyond, Lecture Notes\n\nin Physics 9, World Scienti\ufb01c, 1987.\n\n[6] B. Sch\u00a8olkopf, C. J. C. Burges, A. J. Smola (eds.), Advances in Kernel Methods: Sup-\n\nport Vector Learning, MIT, Cambridge, MA, 1999.\n\n[7] P. Sollich, Probabilistic interpretation and Bayesian methods for Support Vector Ma-\n\nchines, In: ICANN99, pp.91-96, Springer 1999.\n\n[8] M. Opper, O. Winther, Neural Computation, 12: 2655-2684, 2000.\n[9] M. Opper, O. Winther, Phys. Rev. Lett. , 86: 3695, 2001.\n[10] From http://www1.ics.uci.edu/~mlearn/MLSummary.html\n\nand http://www.stats.ox.ac.uk/pub/PRNN/.\n\n\f", "award": [], "sourceid": 2483, "authors": [{"given_name": "D\u00f6rthe", "family_name": "Malzahn", "institution": null}, {"given_name": "Manfred", "family_name": "Opper", "institution": null}]}