{"title": "Support Vector Machine Classification with Indefinite Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 953, "page_last": 960, "abstract": "In this paper, we propose a method for support vector machine classification using indefinite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.", "full_text": "Support Vector Machine Classi\ufb01cation\n\nwith Inde\ufb01nite Kernels\n\nRonny Luss\n\nORFE, Princeton University\n\nPrinceton, NJ 08544\n\nAlexandre d\u2019Aspremont\nORFE, Princeton University\n\nPrinceton, NJ 08544\n\nrluss@princeton.edu\n\naspremon@princeton.edu\n\nAbstract\n\nIn this paper, we propose a method for support vector machine classi\ufb01cation using\ninde\ufb01nite kernels. Instead of directly minimizing or stabilizing a nonconvex loss\nfunction, our method simultaneously \ufb01nds the support vectors and a proxy kernel\nmatrix used in computing the loss. This can be interpreted as a robust classi\ufb01cation\nproblem where the inde\ufb01nite kernel matrix is treated as a noisy observation of the\ntrue positive semide\ufb01nite kernel. Our formulation keeps the problem convex and\nrelatively large problems can be solved ef\ufb01ciently using the analytic center cutting\nplane method. We compare the performance of our technique with other methods\non several data sets.\n\n1 Introduction\n\nHere, we present an algorithm for support vector machine (SVM) classi\ufb01cation using inde\ufb01nite ker-\nnels. Our interest in inde\ufb01nite kernels is motivated by several observations. First, certain similarity\nmeasures take advantage of application-speci\ufb01c structure in the data and often display excellent\nempirical classi\ufb01cation performance. Unlike popular kernels used in support vector machine clas-\nsi\ufb01cation, these similarity matrices are often inde\ufb01nite and so do not necessarily correspond to a\nreproducing kernel Hilbert space (see [1] for a discussion).\n\nAn application of classi\ufb01cation with inde\ufb01nite kernels to image classi\ufb01cation using Earth Mover\u2019s\nDistance was discussed in [2]. Similarity measures for protein sequences such as the Smith-\nWaterman and BLAST scores are inde\ufb01nite yet have provided hints for constructing useful positive\nsemide\ufb01nite kernels such as those decribed in [3] or have been transformed into positive semide\ufb01nite\nkernels (see [4] for example). Here instead, our objective is to directly use these inde\ufb01nite similarity\nmeasures for classi\ufb01cation.\n\nOur work also closely follows recent results on kernel learning (see [5] or [6]), where the kernel\nmatrix is learned as a linear combination of given kernels, and the resulting kernel is explicitly\nconstrained to be positive semide\ufb01nite (the authors of [7] have adapted the SMO algorithm to solve\nthe case where the kernel is written as a positively weighted combination of other kernels). In our\ncase however, we never explicitly optimize the kernel matrix because this part of the problem can be\nsolved explicitly, which means that the complexity of our method is substantially lower than that of\nclassical kernel learning methods and closer in spirit to the algorithm used in [8], who formulate the\nmultiple kernel learning problem of [7] as a semi-in\ufb01nite linear program and solve it with a column\ngeneration technique similar to the analytic center cutting plane method we use here.\n\nFinally, it is sometimes impossible to prove that some kernels satisfy Mercer\u2019s condition or the\nnumerical complexity of evaluating the exact positive semide\ufb01nite kernel is too high and a proxy\n(and not necessarily positive semide\ufb01nite) kernel has to be used instead (see [9] for example). In\nboth cases, our method allows us to bypass these limitations.\n\n1\n\n\f1.1 Current results\n\nSeveral methods have been proposed for dealing with inde\ufb01nite kernels in SVMs. A \ufb01rst direction\nembeds data in a pseudo-Euclidean (pE) space: [10] for example, formulates the classi\ufb01cation prob-\nlem with an inde\ufb01nite kernel as that of minimizing the distance between convex hulls formed from\nthe two categories of data embedded in the pE space. The nonseparable case is handled in the same\nmanner using reduced convex hulls (see [11] for a discussion of SVM geometric interpretations).\n\nAnother direction applies direct spectral transformations to inde\ufb01nite kernels: \ufb02ipping the nega-\ntive eigenvalues or shifting the kernel\u2019s eigenvalues and reconstructing the kernel with the original\neigenvectors in order to produce a positive semide\ufb01nite kernel (see [12] and [2]).\n\nYet another option is to reformulate either the maximum margin problem or its dual in order to\nuse the inde\ufb01nite kernel in a convex optimization problem (see [13]). An equivalent formulation\nof SVM with the same objective but where the kernel appears in the constraints can be modi\ufb01ed\nto a convex problem by eliminating the kernel from the objective. Directly solving the nonconvex\nproblem sometimes gives good results as well (see [14] and [10]).\n\n1.2 Contribution\n\nHere, instead of directly transforming the inde\ufb01nite kernel, we simultaneously learn the support vec-\ntor weights and a proxy positive semide\ufb01nite kernel matrix, while penalizing the distance between\nthis proxy kernel and the original, inde\ufb01nite one. Our main result is that the kernel learning part of\nthat problem can be solved explicitly, meaning that the classi\ufb01cation problem with inde\ufb01nite kernels\ncan simply be formulated as a perturbation of the positive semide\ufb01nite case.\n\nOur formulation can also be interpreted as a worst-case robust classi\ufb01cation problem with uncer-\ntainty on the kernel matrix. In that sense, inde\ufb01nite similarity matrices are seen as noisy observa-\ntions of an unknown positive semide\ufb01nite kernel. From a complexity standpoint, while the original\nSVM classi\ufb01cation problem with inde\ufb01nite kernel is nonconvex, the robusti\ufb01cation we detail here is\na convex problem, and hence can be solved ef\ufb01ciently with guaranteed complexity bounds.\n\nThe paper is organized as follows. In Section 2 we formulate our main classi\ufb01cation problem and\ndetail its interpretation as a robust SVM. In Section 3 we describe an algorithm for solving this\nproblem. Finally, in Section 4, we test the numerical performance of these methods on various\napplications.\n\n2 SVM with inde\ufb01nite kernels\n\nHere, we introduce our robusti\ufb01cation of the SVM classi\ufb01cation problem with inde\ufb01nite kernels.\n\n2.1 Robust classi\ufb01cation\n\nLet K \u2208 Sn be a given kernel matrix and y \u2208 Rn be the vector of labels, with Y = diag(y) the\nmatrix with diagonal y, where Sn is the set of symmetric matrices of size n and Rn is the set of\nn-vectors of real numbers. We can write the dual of the SVM classi\ufb01cation problem with hinge loss\nand quadratic penalty as:\n\nmaximize \u03b1T e \u2212 Tr(K(Y \u03b1)(Y \u03b1)T )/2\nsubject to \u03b1T y = 0\n\n0 \u2264 \u03b1 \u2264 C\n\n(1)\n\nin the variable \u03b1 \u2208 Rn and where e is an n-vector of ones. When K is positive semide\ufb01nite, this\nproblem is a convex quadratic program. Suppose now that we are given an inde\ufb01nite kernel matrix\nK0 \u2208 Sn. We formulate a robust version of problem (1) by restricting K to be a positive semide\ufb01nite\nkernel matrix in some given neighborhood of the original (inde\ufb01nite) kernel matrix K0:\n\nmax\n\n{\u03b1T y=0, 0\u2264\u03b1\u2264C}\n\nmin\n\n{K(cid:23)0, kK\u2212K0k2\n\nF \u2264\u03b2}\n\n\u03b1T e \u2212\n\n1\n2\n\nTr(K(Y \u03b1)(Y \u03b1)T )\n\n(2)\n\nin the variables K \u2208 Sn and \u03b1 \u2208 Rn, where the parameter \u03b2 > 0 controls the distance between\nthe original matrix K0 and the proxy kernel K. This can be interpreted as a worst-case robust\n\n2\n\n\fclassi\ufb01cation problem with bounded uncertainty on the kernel matrix K. The above problem is\ninfeasible for some values of \u03b2 so we replace here the hard constraint on K by a penalty on the\ndistance between the proxy positive semide\ufb01nite kernel and the given inde\ufb01nite matrix. The problem\nwe solve is now:\n\nmax\n\n{\u03b1T y=0,0\u2264\u03b1\u2264C}\n\nmin\n{K(cid:23)0}\n\n\u03b1T e \u2212\n\n1\n2\n\nTr(K(Y \u03b1)(Y \u03b1)T ) + \u03c1kK \u2212 K0k2\nF\n\n(3)\n\nin the variables K \u2208 Sn and \u03b1 \u2208 Rn, where the parameter \u03c1 > 0 controls the magnitude of the\npenalty on the distance between K and K0. The inner minimization problem is a convex conic\nprogram on K. Also, as the pointwise minimum of a family of concave quadratic functions of \u03b1, the\nsolution to the inner problem is a concave function of \u03b1, and hence the outer optimization problem\nis also convex (see [15] for further details). Thus, (3) is a concave maximization problem subject to\nlinear constraints and is therefore a convex problem in \u03b1.\nOur key result here is that the inner kernel learning optimization problem can be solved in closed\nform. For a \ufb01xed \u03b1, the inner minimization problem is equivalent to the following problem:\n\nminimize\nsubject to K (cid:23) 0\n\nkK \u2212 (K0 + 1\n\n4\u03c1 (Y \u03b1)(Y \u03b1)T )k2\n\nF\n\nin the variable K \u2208 Sn. This is the projection of the K0 + (1/4\u03c1)(Y \u03b1)(Y \u03b1)T on the cone of\npositive semide\ufb01nite matrices. The optimal solution to this problem is then given by:\n\nK \u2217 = (K0 + (1/4\u03c1)(Y \u03b1)(Y \u03b1)T )+\n\n(4)\ni where \u03bbi and xi are\n\nwhere X+ is the positive part of the matrix X, i.e. X+ = Pi max(0, \u03bbi)xixT\n\nthe ith eigenvalue and eigenvector of the matrix X. Plugging this solution into (3), we get:\n\nmax\n\n{\u03b1T y=0, 0\u2264\u03b1\u2264C}\n\n\u03b1T e \u2212\n\n1\n2\n\nTr(K \u2217(Y \u03b1)(Y \u03b1)T ) + \u03c1kK \u2217 \u2212 K0k2\nF\n\nin the variable \u03b1 \u2208 Rn, where (Y \u03b1)(Y \u03b1)T is the rank one matrix with coef\ufb01cients yi\u03b1i\u03b1j yj,\ni, j = 1, . . . , n. We can rewrite this as an eigenvalue optimization problem by using the eigenvalue\nrepresentation of X+. Letting the eigenvalue decomposition of K0+(1/4\u03c1)(Y \u03b1)(Y \u03b1)T be V DV T ,\nwe get K \u2217 = V D+V T and, with vi the ith column of V , we can write:\n\nTr(K \u2217(Y \u03b1)(Y \u03b1)T ) = (Y \u03b1)T V D+V T (Y \u03b1)\n\n=\n\nn\n\nXi=1\n\nmax(cid:18)0, \u03bbi(cid:18)K0 +\n\n1\n4\u03c1\n\n(Y \u03b1)(Y \u03b1)T(cid:19)(cid:19) (\u03b1T Y vi)2\n\nwhere \u03bbi (X) is the ith eigenvalue of the quantity X. Using the same technique, we can also rewrite\nthe term kK \u2217 \u2212 K0|2\nF using this eigenvalue decomposition. Our original optimization problem (3)\n\ufb01nally becomes:\n\nmaximize \u03b1T e \u2212 1\n\n2 Pi max(0, \u03bbi(K0 + (Y \u03b1)(Y \u03b1)T /4\u03c1))(\u03b1T Y vi)2\n\n+\u03c1Pi (max(0, \u03bbi(K0 + (Y \u03b1)(Y \u03b1)T /4\u03c1)))2\n\u22122\u03c1Pi Tr((vivT\n\ni )K0)max(0, \u03bbi(K0 + (Y \u03b1)(Y \u03b1)T /4\u03c1)) + \u03c1 Tr(K0K0)\n\n(5)\n\nsubject to \u03b1T y = 0, 0 \u2264 \u03b1 \u2264 C\n\nin the variable \u03b1 \u2208 Rn.\n\n2.2 Dual problem\n\nBecause problem (3) is convex with at least one compact feasible set, we can formulate the dual\nproblem to (5) by simply switching the max and the min. The inner maximization is a quadratic\nprogram in \u03b1, and hence has a quadratic program as its dual. We then get the dual by plugging this\ninner dual quadratic program into the outer minimization, to get the following problem:\n\nminimize Tr(K \u22121(Y (e \u2212 \u03bb + \u00b5 + y\u03bd))(Y (e \u2212 \u03bb + \u00b5 + y\u03bd))T )/2 + C\u00b5T e + \u03c1kK \u2212 K0k2\nF\nsubject to K (cid:23) 0, \u03bb, \u00b5 \u2265 0\n\n(6)\n\n3\n\n\fin the variables K \u2208 Sn, \u03bb, \u00b5 \u2208 Rn and \u03bd \u2208 R. This dual problem is a quadratic program in the\nvariables \u03bb and \u00b5 which correspond to the primal constraints 0 \u2264 \u03b1 \u2264 C and \u03bd which is the dual\nvariable for the constraint \u03b1T y = 0. As we have seen earlier, any feasible solution to the primal\nproblem produces a corresponding kernel in (4), and plugging this kernel into the dual problem in\n(6) allows us to calculate a dual feasible point by solving a quadratic program which gives a dual\nobjective value, i.e. an upper bound on the optimum of (5). This bound can then be used to compute\na duality gap and track convergence.\n\n2.3 Interpretation\n\nWe noted that our problem can be viewed as a worst-case robust classi\ufb01cation problem with uncer-\ntainty on the kernel matrix. Our explicit solution of the optimal worst-case kernel given in (4) is the\nprojection of a penalized rank-one update to the inde\ufb01nite kernel on the cone of positive semide\ufb01nite\nmatrices. As \u03c1 tends to in\ufb01nity, the rank-one update has less effect and in the limit, the optimal ker-\nnel is the kernel given by zeroing out the negative eigenvalues of the inde\ufb01nite kernel. This means\nthat if the inde\ufb01nite kernel contains a very small amount of noise, the best positive semide\ufb01nite\nkernel to use with SVM in our framework is the positive part of the inde\ufb01nite kernel.\n\nThis limit as \u03c1 tends to in\ufb01nity also motivates a heuristic for the transformation of the kernel on\nthe testing set. Since the negative eigenvalues of the training kernel are thresholded to zero in the\nlimit, the same transformation should occur for the test kernel. Hence, we update the entries of the\nfull kernel corresponding to training instances by the rank-one update resulting from the optimal\nsolution to (3) and threshold the negative eigenvalues of the full kernel matrix to zero. We then use\nthe test kernel values from the resulting positive semide\ufb01nite matrix.\n\n3 Algorithms\n\nWe now detail two algorithms that can be used to solve Problem (5). The optimization problem is\nthe maximization of a nondifferentiable concave function subject to convex constraints. An optimal\npoint always exists since the feasibility set is bounded and nonempty. For numerical stability, in both\nalgorithms, we quadratically smooth our objective to calculate a gradient instead. We \ufb01rst describe\na simple projected gradient method which has numerically cheap iterations but has no convergence\nbound. We then show how to apply the much more ef\ufb01cient analytic center cutting plane method\nwhose iterations are slightly more complex but which converges linearly.\n\nSmoothing Our objective contains terms of the form max{0, f (x)} for some function f (x), which\nare not differentiable (described in the section below). These functions are easily smoothed out by\na regularization technique (see [16] for example). We replace them by a continuously differentiable\n2 -approximation as follows:\n\u01eb\n\n\u03d5\u01eb(f (x)) = max\n0\u2264u\u22641\n\n(uf (x) \u2212\n\n\u01eb\n2\n\nu2).\n\nand the gradient is given by \u2207\u03d5\u01eb(f (x)) = u\u2217(x)\u2207f (x) where u\u2217(x) = argmax \u03d5\u01eb(f (x)).\n\nGradient Calculating the gradient of our objective requires a full eigenvalue decomposition to\ncompute the gradient of each eigenvalue. Given a matrix X(\u03b1), the derivative of the ith eigenvalue\nwith respect to \u03b1 is given by:\n\n\u2202\u03bbi(X(\u03b1))\n\n\u2202\u03b1\n\n= vT\ni\n\n\u2202X(\u03b1)\n\n\u2202\u03b1\n\nvi\n\n(7)\n\nwhere vi is the ith eigenvector of X(\u03b1). We can then combine this expression with the smooth\napproximation above to get the gradient.\n\nWe note that eigenvalues of symmetric matrices are not differentiable when some of them have mul-\ntiplicities greater than one (see [17] for a discussion). In practice however, most tested kernels were\nof full rank with distinct eigenvalues so we ignore this issue here. One may also consider projected\nsubgradient methods, which are much slower, or use subgradients for analytic center cutting plane\nmethods (which does not affect complexity).\n\n4\n\n\f3.1 Projected gradient method\n\nThe projected gradient method takes a steepest descent, then projects the new point back onto the\nfeasible region (see [18] for example). In order to use these methods the objective function must be\ndifferentiable and the method is only ef\ufb01cient if the projection step is numerically cheap. We choose\nan initial point \u03b10 \u2208 Rn and the algorithm proceeds as follows:\n\nProjected gradient method\n\n1. Compute \u03b1i+1 = \u03b1i + t\u2207f (\u03b1i).\n2. Set \u03b1i+1 = pA(\u03b1i+1).\n3. If gap \u2264 \u01eb stop, otherwise go back to step 1.\n\nThe complexity of each iteration breaks down as follows.\n\nStep 1. This requires an eigenvalue decomposition and costs O(n3). We note that a line search would\nbe costly because it would require multiple eigenvalue decompositions to recalculate the objective\nmultiple times.\n\nStep 2. This is a projection onto the region A = {\u03b1T y = 0, 0 \u2264 \u03b1 \u2264 C} and can be solved\nexplicitly by sorting the vector of entries, with cost O(n log n).\nStopping Criterion. We can compute a duality gap using the results of \u00a72.2: let Ki = (K0 +\n(Y \u03b1i)(Y \u03b1i)T /4\u03c1)+ (the kernel at iteration i), then solving problem (1) which is just an SVM with\na convex kernel Ki produces an upper bound on (5), and hence a bound on the suboptimality of the\ncurrent solution.\n\nComplexity. The number of iterations required by this method to reach a target precision of \u01eb is\ntypically in O(1/\u01eb2).\n\n3.2 Analytic center cutting plane method\n\nThe analytic center cutting plane method (ACCPM) reduces the feasible region on each iteration\nusing a new cut of the feasible region computed by evaluating a subgradient of the objective function\nat the analytic center of the current set, until the volume of the reduced region converges to the target\nprecision. This method does not require differentiability. We set A0 = {\u03b1T y = 0, 0 \u2264 \u03b1 \u2264 C}\nwhich we can write as {A0 \u2264 b0} to be our \ufb01rst localization set for the optimal solution. The\nmethod then works as follows (see [18] for a more complete reference on cutting plane methods):\n\nAnalytic center cutting plane method\n\n1. Compute \u03b1i as the analytic center of Ai by solving:\n\n\u03b1i+1 = argmin\ny\u2208Rn\n\n\u2212\n\nm\n\nXi=1\n\nlog(bi \u2212 aT\n\ni y)\n\nwhere aT\n\ni represents the ith row of coef\ufb01cients from the left-hand side of {A0 \u2264 b0}.\n\n2. Compute \u2207f (x) at the center \u03b1i+1 and update the (polyhedral) localization set:\n\nAi+1 = Ai \u222a {\u2207f (\u03b1i+1)(\u03b1 \u2212 \u03b1i+1) \u2265 0}\n\n3. If gap \u2264 \u01eb stop, otherwise go back to step 1.\n\nThe complexity of each iteration breaks down as follows.\n\nStep 1. This step computes the analytic center of a polyhedron and can be solved in O(n3) operations\nusing interior point methods for example.\n\n5\n\n\fStep 2. This simply updates the polyhedral description.\n\nStopping Criterion. An upper bound is computed by maximizing a \ufb01rst order Taylor approximation\nof f (\u03b1) at \u03b1i over all points in an ellipsoid that covers Ai, which can be done explicitly.\nComplexity. ACCPM is provably convergent in O(n log(1/\u01eb)2) iterations when using cut elimina-\ntion which keeps the complexity of the localization set bounded. Other schemes are available with\nslightly different complexities: an O(n2/\u01eb2) is achieved in [19] using (cheaper) approximate centers\nfor example.\n\n4 Experiments\n\nIn this section we compare the generalization performance of our technique to other methods of\napplying SVM classi\ufb01cation given an inde\ufb01nite similarity measure. We also test SVM classi\ufb01cation\nperformance on positive semide\ufb01nite kernels using the LIBSVM library. We \ufb01nish with experiments\nshowing convergence of our algorithms. Our algorithms were implemented in Matlab.\n\n4.1 Generalization\n\nWe compare our method for SVM classi\ufb01cation with inde\ufb01nite kernels to several of the kernel pre-\nprocessing techniques discussed earlier. The \ufb01rst three techniques perform spectral transformations\non the inde\ufb01nite kernel. The \ufb01rst, denoted denoise, thresholds the negative eigenvalues to zero. The\nsecond transformation, called \ufb02ip, takes the absolute value of all eigenvalues. The last transforma-\ntion, shift, adds a constant to each eigenvalue making them all positive. See [12] for further details.\nWe \ufb01nally also compare with using SVM on the original inde\ufb01nite kernel (SVM converges but the\nsolution is only a stationary point and is not guaranteed to be optimal).\n\nWe experiment on data from the USPS handwritten digits database (described in [20]) using the\ninde\ufb01nite Simpson score (SS) to compare two digits and on two data sets from the UCI repository\n(see [21]) using the inde\ufb01nite Epanechnikov (EP) kernel. The data is randomly divided into training\nand testing data. We apply 5-fold cross validation and use an accuracy measure (described below)\nto determine the optimal parameters C and \u03c1. We then train a model with the full training set and\noptimal parameters and test on the independent test set.\n\nTable 1: Statistics for various data sets.\n\nData Set\n\n# Train\n\nUSPS-3-5-SS\nUSPS-4-6-SS\nDiabetes-EP\n\nLiver-EP\n\n767\n829\n614\n276\n\n# Test\n773\n857\n154\n69\n\n\u03bbmin\n\n-34.76\n-37.30\n-0.27\n\n-1.38e-15\n\n\u03bbmax\n\n453.58\n413.17\n18.17\n3.74\n\nTable 1 provides statistics including the minimum and maximum eigenvalues of the training kernels.\nThe main observation is that the USPS data uses highly inde\ufb01nite kernels while the UCI data use\nkernels that are nearly positive semide\ufb01nite. Table 2 displays performance by comparing accuracy\nand recall. Accuracy is de\ufb01ned as the percentage of total instances predicted correctly. Recall is the\npercentage of true positives that were correctly predicted positive.\n\nOur method is referred to as Inde\ufb01nite SVM. We see that our method performs favorably among\nthe USPS data. Both measures of performance are quite high for all methods. Our method does\nnot perform as well on the UCI data sets but is still favorable on one of the measures in each\nexperiment. Notice though that recall is not good in the liver data set overall which could be the\nresult of over\ufb01tting one of the classi\ufb01cation categories. The liver data set uses a kernel that is almost\npositive semide\ufb01nite - this is an example where the input is almost a true kernel and Inde\ufb01nite\nSVM \ufb01nds one slightly better. We postulate that our method will perform better in situations where\nthe similarity measure is highly inde\ufb01nite as in the USPS dataset, while measures that are almost\npositive semide\ufb01nite maybe be seen as having a small amount of noise.\n\n6\n\n\fTable 2: Performance Measures for various data sets.\n\nData Set\n\nMeasure\nAccuracy\n\nUSPS-3-5-SS\n\nRecall\nUSPS-4-6-SS Accuracy\nRecall\n\nDiabetes-EP\n\nLiver-EP\n\nAccuracy\n\nRecall\n\nAccuracy\n\nRecall\n\n4.2 Algorithm Convergence\n\nDenoise\n95.47\n94.50\n97.78\n98.42\n75.32\n90.00\n63.77\n22.58\n\nFlip\n95.73\n95.45\n97.90\n98.65\n74.68\n90.00\n63.77\n22.58\n\nShift\n90.43\n92.11\n94.28\n93.68\n68.83\n92.00\n57.97\n25.81\n\nSVM Inde\ufb01nite SVM\n74.90\n72.73\n90.08\n88.49\n75.32\n90.00\n63.77\n22.58\n\n96.25\n96.65\n97.90\n98.87\n68.83\n95.00\n65.22\n22.58\n\nWe ran our two algorithms on data sets created by randomly perturbing the four USPS data sets used\nabove. The average results with one standard deviation above and below the mean are displayed in\nFigure 1 with the duality gap in log scale (note that the codes were not stopped here and that the\ntarget gap improvement is usually much smaller than 10\u22128). As expected, ACCPM converges much\nfaster (in fact linearly) to a higher precision while each iteration requires solving a linear program\nof size n. The gradient projection method converges faster in the beginning but stalls at a higher\nprecision, however each iteration only requires sorting the current point.\n\np\na\nG\ny\nt\ni\nl\na\nu\nD\n\n104\n\n103\n\n102\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22124\n\n0\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\np\na\nG\ny\nt\ni\nl\na\nu\nD\n\n50\n\n100\n\n150\n\n200\n\n10\u22123\n\n0\n\n200\n\nIteration\n\n400\n600\nIteration\n\n800\n\n1000\n\nFigure 1: Convergence plots for ACCPM (left) & projected gradient method (right) on randomly perturbed\nUSPS data sets (average gap versus iteration number, dashed lines at plus and minus one standard deviation).\n\n5 Conclusion\n\nWe have proposed a technique for incorporating inde\ufb01nite kernels into the SVM framework with-\nout any explicit transformations. We have shown that if we view the inde\ufb01nite kernel as a noisy\ninstance of a true kernel, we can learn an explicit solution for the optimal kernel with a tractable\nconvex optimization problem. We give two convergent algorithms for solving this problem on rel-\natively large data sets. Our initial experiments show that our method can at least fare comparably\nwith other methods handling inde\ufb01nite kernels in the SVM framework but provides a much clearer\ninterpretation for these heuristics.\n\n7\n\n\fReferences\n\n[1] C. S. Ong, X. Mary, S. Canu, and A. J. Smola. Learning with non-positive kernels. Proceedings of the\n\n21st International Conference on Machine Learning, 2004.\n\n[2] A. Zamolotskikh and P. Cunningham. An assessment of alternative strategies for constructing emd-based\n\nkernel functions for use in an svm for image classi\ufb01cation. Technical Report UCD-CSI-2007-3, 2004.\n\n[3] H. Saigo, J. P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels.\n\nBioinformatics, 20(11):1682\u20131689, 2004.\n\n[4] G. R. G. Lanckriet, N. Cristianini, M. I. Jordan, and W. S. Noble. Kernel-based integration of genomic\n\ndata using semide\ufb01nite programming. 2003. citeseer.ist.psu.edu/648978.html.\n\n[5] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix\n\nwith semide\ufb01nite programming. Journal of Machine Learning Research, 5:27\u201372, 2004.\n\n[6] C. S. Ong, A. J. Smola, and R. C. Williamson. Learning the kernel with hyperkernels. Journal of Machine\n\nLearning Research, 6:1043\u20131071, 2005.\n\n[7] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the smo\n\nalgorithm. Proceedings of the 21st International Conference on Machine Learning, 2004.\n\n[8] S. Sonnenberg, G. R\u00a8atsch, C. Sch\u00a8afer, and B. Sch\u00a8olkopf. Large scale multiple kernel learning. Journal of\n\nMachine Learning Research, 7:1531\u20131565, 2006.\n\n[9] Marco Cuturi. Permanents, transport polytopes and positive de\ufb01nite kernels on histograms. Proceedings\n\nof the Twentieth International Joint Conference on Arti\ufb01cial Intelligence, 2007.\n\n[10] B. Haasdonk. Feature space interpretation of svms with inde\ufb01nite kernels. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 27(4), 2005.\n\n[11] K. P. Bennet and E. J. Bredensteiner. Duality and geometry in svm classi\ufb01ers. Proceedings of the 17th\n\nInternational conference on Machine Learning, pages 57\u201364, 2000.\n\n[12] G. Wu, E. Y. Chang, and Z. Zhang. An analysis of transformation on non-positive semide\ufb01nite similarity\nmatrix for kernel machines. Proceedings of the 22nd International Conference on Machine Learning,\n2005.\n\n[13] H.-T. Lin and C.-J. Lin. A study on sigmoid kernel for svm and the training of non-psd kernels by\n\nsmo-type methods. 2003.\n\n[14] A. Wo\u00b4znica, A. Kalousis, and M. Hilario. Distances and (inde\ufb01nite) kernels for set of objects. Proceedings\n\nof the 6th International Conference on Data Mining, pages 1151\u20131156, 2006.\n\n[15] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[16] C. Gigola and S. Gomez. A regularization method for solving the \ufb01nite convex min-max problem. SIAM\n\nJournal on Numerical Analysis, 27(6):1621\u20131634, 1990.\n\n[17] M. Overton. Large-scale optimization of eigenvalues. SIAM Journal on Optimization, 2(1):88\u2013120, 1992.\n[18] D. Bertsekas. Nonlinear Programming, 2nd Edition. Athena Scienti\ufb01c, 1999.\n[19] J.-L. Gof\ufb01n and J.-P. Vial. Convex nondifferentiable optimization: A survey focused on the analytic center\n\ncutting plane method. Optimization Methods and Software, 17(5):805\u2013867, 2002.\n\n[20] J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 16(5), 1994.\nand D.J. Newman.\n\n[21] A. Asuncion\n\nof California,\n\nsity\nhttp://www.ics.uci.edu/\u223cmlearn/MLRepository.html.\n\nSchool\n\nIrvine,\n\nUCI Machine Learning Repository.\nof\n\nand Computer\n\nInformation\n\nSciences,\n\nUniver-\n2007.\n\n8\n\n\f", "award": [], "sourceid": 50, "authors": [{"given_name": "Ronny", "family_name": "Luss", "institution": null}, {"given_name": "Alexandre", "family_name": "D'aspremont", "institution": null}]}