{"title": "Kernel Dependency Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 897, "page_last": 904, "abstract": null, "full_text": "Kernel Dependency Estimation \n\nJason Weston, Olivier Chapelle, Andre Elisseeff, \n\nBernhard Scholkopf and Vladimir Vapnik* \n\nMax Planck Institute for Biological Cybernetics, 72076 Tubingen, Germany \n\n*NEC Research Institute, Princeton, NJ 08540 USA \n\nAbstract \n\nWe consider the learning problem of finding a dependency between \na general class of objects and another, possibly different, general \nclass of objects. The objects can be for example: vectors, images, \nstrings, trees or graphs. Such a task is made possible by employing \nsimilarity measures in both input and output spaces using ker(cid:173)\nnel functions, thus embedding the objects into vector spaces. We \nexperimentally validate our approach on several tasks: mapping \nstrings to strings, pattern recognition, and reconstruction from par(cid:173)\ntial images. \n\n1 \n\nIntroduction \n\nIn this article we consider the rather general learning problem of finding a de(cid:173)\npendency between inputs x E X and outputs y E Y given a training set \n(Xl,yl), ... ,(xm , Ym) E X x Y where X and Yare nonempty sets. This includes \nconventional pattern recognition and regression estimation. It also encompasses \nmore complex dependency estimation tasks, e.g mapping of a certain class of strings \nto a certain class of graphs (as in text parsing) or the mapping of text descriptions \nto images. In this setting, we define learning as estimating the function j(x, ex*) \nfrom the set offunctions {f (., ex), ex E A} which provides the minimum value of the \nrisk function \n\n(1) \n\nR(ex) = r L(y, j(x,ex))dP(x, y) \n\nix xY \n\nwhere P is the (unknown) joint distribution ofx and y and L(y, 1]) is a loss function, \na measure of distance between the estimate 1] and the true output y at a point x. \nHence in this setting one is given a priori knowledge of the similarity measure used \nin the space Y in the form of a loss function. In pattern recognition this is often the \nzero-one loss, in regression often squared loss is chosen. However, for other types \nof outputs, for example if one was required to learn a mapping to images, or to \na mixture of drugs (a drug cocktail) to prescribe to a patient then more complex \ncosts would apply. We would like to be able to encode these costs into the method \nof estimation we choose. \n\nThe framework we attempt to address is rather general. Few algorithms have been \nconstructed which can work in such a domain - in fact the only algorithm that we \nare aware of is k-nearest neighbors. Most algorithms have focussed on the pattern \n\n\frecognition and regression problems and cannot deal with more general outputs. \nConversely, specialist algorithms have been made for structured outputs, for exam(cid:173)\nple the ones of text classification which calculate parse trees for natural language \nsentences, however these algorithms are specialized for their tasks. Recently, kernel \nmethods [12, 11] have been extended to deal with inputs that are structured ob(cid:173)\njects such as strings or trees by linearly embedding the objects using the so-called \nkernel trick [5, 7]. These objects are then used in pattern recognition or regression \ndomains. In this article we show how to construct a general algorithm for dealing \nwith dependencies between both general inputs and general outputs. The algorithm \nends up in an formulation which has a kernel function for the inputs and a kernel \nfunction (which will correspond to choosing a particular loss function) for the out(cid:173)\nputs. This also enables us (in principle) to encode specific prior information about \nthe outputs (such as special cost functions and/or invariances) in an elegant way, \nalthough this is not experimentally validated in this work. \n\nThe paper is organized as follows. \nIn Section 2 it is shown how to use kernel \nfunctions to measure similarity between outputs as well as inputs. This leads to \nthe derivation of the Kernel Dependency Estimation (KDE) algorithm in Section \n3. Section 4 validates the method experimentally and Section 5 concludes. \n\n2 Loss functions and kernels \n\nAn informal way of looking at the learning problem consists of the following. Gen(cid:173)\neralization occurs when, given a previously unseen x EX, we find a suitable \ny E Y such that (x,y) should be \"similar\" to (Xl,Yl), ... ,(xm,Ym). For out(cid:173)\nputs one is usually given a loss function for measuring similarity (this can be, but \nis not always, inherent to the problem domain). For inputs, one way of mea(cid:173)\nsuring similarity is by using a kernel function. A kernel k is a symmetric func(cid:173)\ntion which is an inner product in some Hilbert space F, i.e., there exists a map \n*k : X ---+ F such that k(X,X/) = (**k(X) . **k(X/)). We can think of the patterns \nas **k(X) , **k(X/), and carry out geometric algorithms in the inner product space \n(\"feature space\") F. Many successful algorithms are now based on this approach, \nsee e.g [12, 11]. Typical kernel functions are polynomials k(x, Xl) = (x . Xl + 1)P \nand RBFs k (x, Xl) = exp( -llx - x/l12 /2( 2 ) although many other types (including \nones which take into account prior information about the learning problem) exist. \n\nNote that, like distances between examples in input space, it is also possible to \nthink of the loss function as a distance measure in output space, we will denote \nthis space 1:. We can measure inner products in this space using a kernel function. \nWe will denote this as C(y,y/) = (**\u00a3(y). **\u00a3(y/)), where **\u00a3 : Y ---+ 1:. This map \nmakes it possible to consider a large class of nonlinear loss functions. l As in the \ntraditional kernel trick for the inputs, the nonlinearity is only taken into account \nwhen computing the kernel matrix. The rest of the training is \"simple\" (e.g. , a \nconvex program, or methods of linear algebra such as matrix diagonalization) . It \nalso makes it possible to consider structured objects as outputs such as the ones \ndescribed in [5]: strings, trees, graphs and so forth. One embeds the output objects \nin the space I: using a kernel. \n\nLet us define some kernel functions for output spaces. \n\nIFor instance, assuming the outputs live in lI~n, usin~ an RBF kernel, one obtains a \nloss function II**e(y) - **e(Y/) 112 = 2 - 2 exp (-Ily - y'll /2(7 2 ). This is a nonlinear loss \nfunction which takes the value 0 if Y and y' coincide, and 2 if they are maximally different. \nThe rate of increase in between (i.e., the \"locality\") , is controlled by a . \n\n\fIn M-class pattern recognition, given Y = {I, ... , M}, one often uses the distance \nL(y, y') = 1- [y = y'], where [y = y'] is 1 if Y = y' and 0 otherwise. To construct a \ncorresponding inner product it is necessary to embed this distance into a Euclidean \nspace, which can be done using the following kernel: \n\u00a3pat(y,y') = ~[y = y'], \n\n(2) \n\nIf>f(y')11 2 = \u00a3(y, y) + \u00a3(y', y') - 2\u00a3(y, y') = 1 -\n\nas L(y, y')2 = Illf>f(Y) -\n[y = y']. \nIt corresponds to embedding into aM-dimensional Euclidean space via the map \nIf>f(Y) = (0,0, . . . , 1', ... , 0) where the yth coordinate is nonzero. It is also possible \nto describe multi-label classification (where anyone example belongs to an arbitrary \nsubset of the M classes) in a similar way. \n\nFor regression estimation, one can use the usual inner product \n\n\u00a3reg(y, y') = (y . y'). \n\n(3) \n\nFor outputs such as strings and other structured objects we require the correspond(cid:173)\ning string kernels and kernels for structured objects [5, 7]. We give one example \nhere, the string subsequence kernel employed in [7] for text categorization. This \nkernel is an inner product in a feature space consisting of all ordered subsequences \nof length r, denoted ~r. The subsequences, which do not have to be contiguous, \nare weighted by an exponentially decaying factor A of their full length in the text: \n\nu EEr i:u= s[i] \n\nj:u=t[j] \n\n(4) \n\nwhere u = xli] denotes u is the subsequence of x with indices 1 :::; it :::; ... :::; i lul \nand l(i) = i lul - it + 1. A fast way to compute this kernel is described in [7]. \nSometimes, one would also like apply the loss given by an (arbitrary) distance matrix \nD of the loss between training examples, i.e where D ij = L(Yi,Yj). In general it \nis not always obvious to find an embedding of such data in an Euclidian space (in \norder to apply kernels) . However, one such method is to compute the inner product \nwith [11 , Proposition 2.27]: \n\n\u00a3(Yi,Yj) = ~ ID ijl2 - ~CpIDiPI2 - {;CqlDqjl2 + p~t cpcqlDpq l2 \n\n( \n\nm \n\n) \n\n(5) \n\nm \n\nm \n\nwhere coefficients Ci satisfy L i Ci = 1 (e.g using Ci = 1... for all i -\nthis amounts \nto using the centre of mass as an origin). See also [m for ways of dealing with \nproblems of embedding distances when equation (5) will not suffice. \n\n3 Algorithm \n\nNow we will describe the algorithm for performing KDE. We wish to minimize the \nrisk function (1) using the feature space F induced by the kernel k and the loss \nfunction measured in the space \u00a3 induced by the kernel \u00a3. To do this we must learn \nthe mapping from If>k(X) to If>f(Y). Our solution is the following: decompose If>e(Y) \ninto p orthogonal directions using kernel principal components analysis (KPCA) \n(see, e.g [11 , Chapter 14]). One can then learn the mapping from If>k(X) to each \ndirection independently using a standard kernel regression method, e.g SVM regres(cid:173)\nsion [12] or kernel ridge regression [9]. Finally, to output an estimate Y given a test \nexample x one must solve a pre-image problem as the solution of the algorithm is \ninitially a solution in the space \u00a3. We will now describe each step in detail. \n\n\f1) Decomposition of outputs Let us construct the kernel matrix L on the \ntraining data such that Lij = f(Yi,Yj), and perform kernel principal components \nanalysis on L. This can be achieved by centering the data in feature space us(cid:173)\ning: V = (I - ~lm1~)L(1 - ~lm1~), where 1 is the m-dimensional identity \nmatrix and 1m is an m dimensional vector of ones. One then solves the eigen(cid:173)\nvalue problem Aa = Va where an is the nth eigenvector of V which we nor(cid:173)\nmalize such that 1 = (an. Va n) = An(an . an). We can then compute the \nprojection of If>\u00a3(y) onto the nth principal component v n = 2:::1o:ilf>\u00a3(Yi) by \n(vn . If>\u00a3(y)) = 2:::1 o:if(Yi' y) . \n2) Learning the map We can now learn the map from If>k(X) \nto ((Vi . \nIf>c(Y)), ... , (v P \u00b7If>c(Y))) where p is the number of principal components. One can \nlearn the map by estimating each output independently. In our experiments we \nuse kernel ridge regression [9] , note that this requires only a single matrix inver(cid:173)\nsion to learn all p directions. That is, we minimize with respect to w the function \n~ 2:::1 (Yi -\n(w . If> k (Xi)))2 + , IIwl1 2 in its dual form. We thus learn each output \ndirection (vn . If> \u00a3 (y)) using the kernel matrix Kij = k(Xi ' Xj) and the training labels \n:Vi = (vn \u00b7If>C(Yi)) , with estimator fn(x): \nfn(x) = L ,Bik(Xi' x), \n\n(6) \n\nm \n\ni=l \n\n3) Solving the pre-image problem During the testing phase, to obtain the \nestimate Y for a given x it is now necessary to find the pre-image of the given \noutput If>c(Y). This can be achieved by finding: \n\nY(X) = argminYEyl1 ((vi. If>c(Y)), ... , (v P . If>c(Y))) - (It (x), ... , fp(x))11 \n\nFor the kernel (3) it is possible to compute the solution explicit ely. For other \nproblems searching from a set of candidate solutions may be enough, e.g from the set \nof training set outputs Yl, ... , Ym; in our experiments we use this set. When more \naccurate solutions are required, several algorithms exist for finding approximate \npre-images e.g via fixed-point iteration methods, see [10] or [11, Chapter 18] for an \noverview. \n\nFor the simple case of vectorial outputs with linear kernel (3), if the output is only \none dimension the method of KDE boils down to the same solution as using ridge \nregression since the matrix L is rank 1 in this case. However, when there are d \noutputs, the rank of L is d and the method trains ridge regression d times, but the \nkernel PCA step first decorrelates the outputs. Thus, in the special case of multiple \noutputs regression with a linear kernel, the method is also related to the work of \n[2] (see [4, page 73] for an overview of other multiple output regression methods.) \nIn the case of classification, the method is related to Kernel Fisher Discriminant \nAnalysis (KFD) [8]. \n\n4 Experiments \n\nIn the following we validate our method with several experiments. \nIn the ex(cid:173)\nperiments we chose the parameters of KDE to be from the following: u* = \n{l0-3 , 10-2,10-\\ 10\u00b0,10\\ 102, 103} where u = b, and the ridge parameter \n, = {l0-4, 10-3, 10-2,10-\\ 100 , 1O1 }. We chose them by five fold cross valida(cid:173)\ntion. \n\n\f4.1 Mapping from strings to strings \n\nToy problem. Three classes of strings consist ofletters from the same alphabet of \n4 letters (a,b,c,d), and strings from all classes are generated with a random length \nbetween 10 to 15. Strings from the first class are generated by a model where \ntransitions from any letter to any other letter are equally likely. The output is the \nstring abad, corrupted with the following noise. There is a probability of 0.3 of \na random insertion of a random letter, and a probability of 0.15 of two random \ninsertions. After the potential insertions there is a probability of 0.3 of a random \ndeletion, and a probability of 0.15 of two random deletions. In the second class, \ntransitions from one letter to itself (so the next letter is the same as the last) have \nprobability 0.7, and all other transitions have probability 0.1. The output is the \nstring dbbd, but corrupted with the same noise as for class one. In the third class \nonly the letters c and d are used; transitions from one letter to itself have probability \n0.7. The output is the string aabc, but corrupted with the same noise as for class \none. For classes one and two any starting letter is equally likely, for the third class \nonly c and d are (equally probable) starting letters. \n\ninput string \nccdddddddd \ndccccdddcd \nadddccccccccc \nbbcdcdadbad \ncdaaccadcbccdd \n\n--+ \n--+ \n--+ \n--+ \n--+ \n\noutput string \naabc \nabc \nbb \naebad \nabad \n\nFigure 1: Five examples from our artificial task (mapping strings to strings). \n\nThe task is to predict the output string given the input string. Note that this is \nalmost like a classification problem with three classes, apart from the noise on the \noutputs. This construction was employed so we can also calculate classification error \nas a sanity check. We use the string subsequence kernel (4) from [7] for both inputs \nand outputs, normalized such that k(x,x') = k(x,x' )/(Jk(x,x)Jk(x',x')). We \nchose the parameters r = 3 and A = 0.01. In the space induced by the input kernel \nk we then chose a further nonlinear map using an RBF kernel: exp( - (k(x, x) + \nk(x',x') - 2k(x,x'))/2(J2). \nWe generated 200 such strings and measured the success by calculating the mean \nand standard error of the loss (computed via the output kernel) over 4 fold cross \nvalidation. We chose (J (the width of the RBF kernel) and'Y (the ridge parameter) \non each trial via a further level of 5 fold cross validation. We compare our method \nto an adaptation of k-nearest neighbors for general outputs: if k = 1 it returns the \noutput of the nearest neighbor, otherwise it returns the linear combination (in the \nspace of outputs) of the k nearest neighbors (in input space) . In the case of k > 1, \nas well as for KDE, we find a pre-image by finding the closest training example \noutput to the given solution. We choose k again via a further level of 5 fold cross \nvalidation. The mean results, and their standard errors, are given in Table 1. \n\nstring loss \nclassification loss \n\nKDE \n\n0.676 \u00b1 0.030 \n0.125 \u00b1 0.012 \n\nk-NN \n\n0.985 \u00b1 0.029 \n0.205 \u00b1 0.026 \n\nTable 1: Performance of KDE and k-NN on the string to string mapping problem. \n\n\f4.2 Multi-class classification problem \n\nWe next tried a multi-class classification problem, a simple special case of the general \ndependency estimation problem. We performed 5-fold cross validation on 1000 digits \n(the first 100 examples of each digit) of the USPS handwritten 16x16 pixel digit \ndatabase, training with a single fold (200 examples) and testing on the remainder. \nWe used an RBF kernel for the inputs and the zero-one multi-class classification \nloss for the outputs using kernel (2). We again compared to k-NN and also to 1-\nvs-rest Support Vector Machines (SVMs) (see, e.g [11, Section 7.6]). We found k \nfor k-NN and a and \"( for the other methods (we employed a ridge also to the SVM \nmethod, reulting in a squared error penalization term) by another level of 5-fold \ncross validation. The results are given in Table 2. SVMs and KDE give similar \nresults (this is not too surprising since KDE gives a rather similar solution to KFD, \nwhose similarity to SVMs in terms of performance has been shown before [8]). Both \nSVM and KDE outperform k-NN. \n\nclassification loss 0.0798 \u00b1 0.0067 0.0847 \u00b1 0.0064 0.1250 \u00b1 0.0075 \n\nk-NN \n\nKDE \n\n1-vs-rest SVM \n\nTable 2: Performance of KDE, 1-vs-rest SVMs and k-NN on a classification problem \nof handwritten digits. \n\n4.3 \n\nImage reconstruction \n\nWe then considered a problem of image reconstruction: given the top half (the first \n8 pixel lines) of a USPS postal digit, it is required to estimate what the bottom \nhalf will be (we thus ignored the original labels of the data).2 The loss function we \nchoose for the outputs is induced by an RBF kernel. The reason for this is that \na penalty that is only linear in y would encourage the algorithm to choose images \nthat are \"inbetween\" clearly readable digits. Hence, the difficulty in this task is \nboth choosing a good loss function (to reflect the end user's objectives) as well as \nan accurate estimator. We chose the width a' of the output RBF kernel which \nmaximized the kernel alignment [1] with a target kernel generated via k-means \nclustering. We chose k=30 clusters and the target kernel is K ij = 1 if Xi and Xj \nare in the same cluster, and 0 otherwise. Kernel alignment is then calculated via: \nA(K 1 ,K2 ) = (K l,K2)F/J(Kl, Kl) F(K2,K2)F where (K , K')F = 2:7,'j=l KijK~j \nis the Frobenius dot product, which gave a' = 0.35. For the inputs we use an RBF \nkernel of width a . \n\nWe again performed 5-fold cross validation on the first 1000 digits of the USPS \nhandwritten 16x16 pixel digit database, training with a single fold (200 examples) \nand testing on the remainder, comparing KDE to k-NN and a Hopfield net. 3 The \nHopfield network we used was the one of [6] implemented in the Neural Network \nToolbox for Matlab. It is a generalization of standard Hopfield nets that has a \nnonlinear transfer function and can thus deal with scalars between -1 and +1 ; \nafter building the network based on the (complete) digits of the training set we \npresent the top half of test digits and fill the bottom half with zeros, and then \nfind the networks equilibrium point. We then chose as output the pre-image from \nthe training data that is closest to this solution (thus the possible outputs are the \n\n2 A similar problem, of higher dimensionality, would be to learn the mapping from top \n\nhalf to complete digit. \n\n3Note that training a naive regressor on each pixel output independently would not \n\ntake into account that the combination of pixel outputs should resemble a digit. \n\n\fFigure 2: Errors in the digit database image reconstruction problem. \nhave to be estimated using only the top half (first 8 rows of pixels) of the orig(cid:173)\ninal image (top row) by KDE (middle row) and k-NN (bottom row). We show \nall the test examples on the first fold of cross validation where k-NN makes \nan error in estimating the correct digit whilst KDE does not (73 mistakes) and \nvice-versa (23 mistakes). We chose them by viewing the complete results by \neye (and are thus somewhat subjective). The complete results can be found at \nhttp://www.kyb.tuebingen.mpg.de/bs/people/weston/kde/kde.html. \n\nImages \n\nsame as the competing algorithms). We found (Y and I for KDE and k for k-NN by \nanother level of 5-fold cross validation. The results are given in Table 3. \n\nRBF loss 0.8384 \u00b1 0.0077 0.8960 \u00b1 0.0052 \n\n1.2190 \u00b1 0.0072 \n\nKDE \n\nk-NN \n\nHopfield net \n\nTable 3: Performance of KDE, k-NN and a Hopfield network on an image recon(cid:173)\nstruction problem of handwritten digits. \n\nKDE outperforms k-NN and Hopfield nets on average, see Figure 2 for comparison \nwith k-NN. Note that we cannot easily compare classification rates on this problem \nusing the pre-images selected since KDE outputs are not correlated well with the \nlabels. For example it will use the bottom stalk of a digit \"7\" or a digit \"9\" equally \nif they are identical, whereas k-NN will not: in the region of the input space which \nis the top half of \"9\"s it will only output the bottom half of \"9\"s. This explains why \nmeasuring the class of the pre-images compared to the true class as a classification \nproblem yields a lower loss for k-NN, 0.2345 \u00b1 0.0058, compared to KDE, 0.2985 \u00b1 \n0.0147 and Hopfield nets, 0.591O\u00b10.0137. Note that if we performed classification as \nin Section 4.2 but using only the first 8 pixel rows then k-NN yields 0.2345 \u00b1 0.0058, \nbut KDE yields 0.1878 \u00b1 0.0098 and 1-vs-rest SVMs yield 0.1942 \u00b1 0.0097, so k-NN \ndoes not adapt well to the given learning task (loss function). \n\nFinally, we note that nothing was stopping us from incorporating known invariances \ninto our loss function in KDE via the kernel. For example we could have used a \nkernel which takes into account local patches of pixels rendering spatial information \nor jittered kernels which take into account chosen transformations (translations, \nIt may also be useful to add virtual examples to the \nrotations, and so forth). \n\n\foutput matrix 1:- before the decomposition step. For an overview of incorporating \ninvariances see [11, Chapter 11] or [12]. \n\n5 Discussion \n\nWe have introduced a kernel method of learning general dependencies. We also gave \nsome first experiments indicating the usefulness of the approach. There are many \napplications of KDE to explore: problems with complex outputs (natural language \nparsing, image interpretation/manipulation, ... ), applying to special cost functions \n(e.g ROC scores) and when prior knowledge can be encoded in the outputs. \n\nIn terms of further research, we feel there are also still many possibilities to explore \nin terms of algorithm development. We admit in this work we have a very simplified \nalgorithm for the pre-image part (just choosing the closest image given from the \ntraining sample). To make the approach work on more complex problems (where \na test output is not so trivially close to a training output) improved pre-image \napproaches should be applied. Although one can apply techniques such as [10] for \nvector based pre-images, efficiently finding pre-images for structured objects such \nas strings is an open problem. Finally, the algorithm should be extended to deal \nwith non-Euclidean loss functions directly, e.g for classification with a general cost \nmatrix. One naive way is to use a distance matrix directly, ignoring the PCA step. \n\nReferences \n[1] N. Cristianini, A. Elisseeff, and J. Shawe-Taylor. On optimizing kernel alignment. \n\nTechnical Report 2001-087, NeuroCOLT, 200l. \n\n[2] I. Frank and J . Friedman. A Statistical View of Some Chemometrics Regression Tools. \n\nTechnometrics, 35(2):109- 147, 1993. \n\n[3] T. Graepel, R. Herbrich, P. Bollmann-Sdorra, and K Obermayer. Classification on \n\npairwise proximity data. NIPS, 11:438- 444, 1999. \n\n[4] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. \n\nSpringer-Verlag, New York , 200l. \n\n[5] D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSC(cid:173)\n\nCRL-99-10, Computer Science Department, University of California at Santa Cruz, \n1999. \n\n[6] J . Li, A. N. Michel, and W . Porod. Analysis and synthesis of a class of neural \nnetworks: linear systems operating on a closed hypercube. IEEE Trans . on Circuits \nand Systems, 36(11):1405- 22, 1989. \n\n[7] H. Lodhi, C. Saunders, J . Shawe-Taylor, N. Cristianini, and C. Watkins. Text clas(cid:173)\n\nsification using string kernels. Journal of Machine Learning Research, 2:419- 444, \n2002. \n\n[8] S. Mika, G. Ratsch, J. Weston, B. Sch6lkopf, and K-R. Miiller. Fisher discriminant \nanalysis with kernels. In Y.-H. Hu, J . Larsen, E. Wilson, and S. Douglas, editors, \nN eural N etworks for Signal Processing IX, pages 41- 48. IEEE, 1999. \n\n[9] C. Saunders, V. Vovk , and A. Gammerman. Ridge regression learning algorithm in \ndual variables. In J . Shavlik, editor, Machine Learning Proceedings of the Fifteenth \nInternational Conference(ICML '98), San Francisco, CA , 1998. Morgan Kaufmann. \n[10] B. Sch6lkopf, S. Mika, C. Burges, P. Knirsch, K-R. Miiller, G. Ratsch, and A. J. \nSmola. Input space vs. feature space in kernel-based methods. IEEE-NN, 10(5):1000-\n1017, 1999. \n\n[11] B. Sch6lkopf and A. J. Smola. Learning with K ernels. MIT Press, Cambridge, MA, \n\n2002. \n\n[12] V . Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998. \n\n\f", "award": [], "sourceid": 2297, "authors": [{"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "Olivier", "family_name": "Chapelle", "institution": null}, {"given_name": "Vladimir", "family_name": "Vapnik", "institution": null}, {"given_name": "Andr\u00e9", "family_name": "Elisseeff", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}*