{"title": "Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1641, "page_last": 1648, "abstract": null, "full_text": "Nonparametric Transforms of Graph Kernels\n\nfor Semi-Supervised Learning\n\nXiaojin Zhu\u2020\n\nJaz Kandola\u2021 Zoubin Ghahramani\u2021\u2020\n\nJohn Lafferty\u2020\n\n\u2020School of Computer Science\nCarnegie Mellon University\n\n5000 Forbes Avenue\n\nPittsburgh, PA 15213 USA\n\n\u2021Gatsby Computational Neuroscience Unit\n\nUniversity College London\n\n17 Queen Square\n\nLondon WC1N 3AR UK\n\nAbstract\n\nWe present an algorithm based on convex optimization for constructing\nkernels for semi-supervised learning. The kernel matrices are derived\nfrom the spectral decomposition of graph Laplacians, and combine la-\nbeled and unlabeled data in a systematic fashion. Unlike previous work\nusing diffusion kernels and Gaussian random \ufb01eld kernels, a nonpara-\nmetric kernel approach is presented that incorporates order constraints\nduring optimization. This results in \ufb02exible kernels and avoids the need\nto choose among different parametric forms. Our approach relies on\na quadratically constrained quadratic program (QCQP), and is compu-\ntationally feasible for large datasets. We evaluate the kernels on real\ndatasets using support vector machines, with encouraging results.\n\n1 Introduction\nSemi-supervised learning has been the focus of considerable recent research. In this learn-\ning problem the data consist of a set of points, with some of the points labeled and the\nremaining points unlabeled. The task is to use the unlabeled data to improve classi\ufb01cation\nperformance. Semi-supervised methods have the potential to improve many real-world\nproblems, since unlabeled data are often far easier to obtain than labeled data.\n\nKernel-based methods are increasingly being used for data modeling and prediction be-\ncause of their conceptual simplicity and good performance on many tasks. A promising\nfamily of semi-supervised learning methods can be viewed as constructing kernels by trans-\nforming the spectrum of a \u201clocal similarity\u201d graph over labeled and unlabeled data. These\nkernels, or regularizers, penalize functions that are not smooth over the graph [7]. Infor-\nmally, a smooth eigenvector has the property that two elements of the vector have similar\nvalues if there are many large weight paths between them on the graph. This results in the\ndesirable behavior of the labels varying smoothly over the graph, as sought by, e.g., spectral\nclustering approaches [2], diffusion kernels [5], and the Gaussian random \ufb01eld approach\n[9]. However, the modi\ufb01cation to the spectrum, called a spectral transformation, is often\na function chosen from some parameterized family. As examples, for the diffusion kernel\nthe spectral transformation is an exponential function, and for the Gaussian \ufb01eld kernel the\ntransformation is a smoothed inverse function.\n\nIn using a parametric approach one faces the dif\ufb01cult problem of choosing an appropriate\nfamily of spectral transformations. For many familes the number of degrees of freedom\nin the parameterization may be insuf\ufb01cient to accurately model the data.\nIn this paper\n\n\fwe propose an effective nonparametric method to \ufb01nd an optimal spectral transformation\nusing kernel alignment. The main advantage of using kernel alignment is that it gives\nus a convex optimization problem, and does not suffer from poor convergence to local\nminima. A key assumption of a spectral transformation is monotonicity, so that unsmooth\nfunctions over the data graph are penalized more severly. We realize this property by\nimposing order constraints. The optimization problem in general is solved using semi-\nde\ufb01nite programming (SDP) [1]; however, in our approach the problem can be formulated\nin terms of quadratically constrained quadratic programming (QCQP), which can be solved\nmore ef\ufb01ciently than a general SDP.\n\nThis paper is structured as follows. In Section 2 we review some graph theoretic concepts\nand relate them to the construction of kernels for semi-supervised learning. In Section 3\nwe introduce convex optimization via QCQP and relate it to the more familiar linear and\nquadratic programming commonly used in machine learning. Section 4 poses the problem\nof kernel based semi-supervised learning as a QCQP problem with order constraints. Ex-\nperimental results using the proposed optimization framework are presented in Section 5.\nThe results indicate that the semi-supervised kernels constructed from the learned spectral\ntransformations perform well in practice.\n\n2 Semi-supervised Kernels from Graph Spectra\nWe are given a labeled dataset consisting of input-output pairs {(x1, y1), . . . , (xl, yl)} and\na (typically much larger) unlabeled dataset {xl+1, . . . , xn} where x is in some general\ninput space and y is potentially from multiple classes. Our objective is to construct a kernel\nthat is appropriate for the classi\ufb01cation task. Since our methods use both the labeled and\nunlabeled data, we will refer to the resulting kernels as semi-supervised kernels. More\nspeci\ufb01cally, we restrict ourselves to the transductive setting where the unlabeled data also\nserve as the test data. As such, we only need to \ufb01nd a good Gram matrix on the points\n{x1, . . . , xn}. For this approach to be effective such kernel matrices must also take into\naccount the distribution of unlabeled data, in order that the unlabeled data can aid in the\nclassi\ufb01cation task. Once these kernel matrices have been constructed, they can be deployed\nin standard kernel methods, for example support vector machines.\n\nIn this paper we motivate the construction of semi-supervised kernel matrices from a\ngraph theoretic perspective. A graph is constructed where the nodes are the data instances\n{1, . . . , n} and an edge connects nodes i and j if a \u201clocal similarity\u201d measure between xi\nand xj suggests they may have the same label. For example, the local similarity measure\ncan be the Euclidean distance between feature vectors if x \u2208 Rm, and each node can con-\nnect to its k nearest neighbors with weight value equal to 1. The intuition underlying the\ngraph is that even if two nodes are not directly connected, they should be considered similar\nas long as there are many paths between them. Several semi-supervised learning algorithms\nhave been proposed under the general graph theoretic theme, based on techniques such as\nrandom walks [8], diffusion kernels [5], and Gaussian \ufb01elds [9]. Many of these methods\ncan be uni\ufb01ed into the regularization framework proposed by [7], which forms the basis of\nthis paper.\nThe graph can be represented by an n \u00d7 n weight matrix W = [wij] where wij is the edge\nweight between nodes i and j, with wij = 0 if there is no edge. We require the entries of W\nto be non-negative, and assume that it forms a symmetric matrix; it is not necessary for W\nitself to be positive semi-de\ufb01nite. In semi-supervised learning W is an essential quantity;\nwe assume it is provided by domain experts, and hence do not study its construction. Let\nD be a diagonal matrix where dii = Pj wij is the degree of node i. This allows us\nto de\ufb01ne the combinatorial graph Laplacian as L = D \u2212 W (the normalized Laplacian\n\u02dcL = D\u22121/2LD\u22121/2 can be used as well). We denote L\u2019s eigensystem by {\u03bbi, \u03c6i}, so\nthat L = Pn\ni where we assume the eigenvalues are sorted in non-decreasing\norder. The matrix L has many interesting properties [3]; for instance, it is always positive\n\ni=1 \u03bbi\u03c6i\u03c6>\n\n\fsemi-de\ufb01nite, even if W is not. Perhaps the most important property of the Laplacian\nrelated to semi-supervised learning is the following: a smaller eigenvalue \u03bb corresponds\nto a smoother eigenvector \u03c6 over the graph; that is, the value Pij wij(\u03c6(i) \u2212 \u03c6(j))2 is\nsmall. In a physical system the smoother eigenvectors correspond to the major vibration\nmodes. Assuming the graph structure is correct, from a regularization perspective we want\nto encourage smooth functions, to re\ufb02ect our belief that labels should vary slowly over the\ngraph. Speci\ufb01cally, [2] and [7] suggest a general principle for creating a semi-supervised\nkernel K from the graph Laplacian L: transform the eigenvalues \u03bb into r(\u03bb), where the\nspectral transformation r is a non-negative and decreasing function1\n\nK =\n\nn\n\nX\n\ni=1\n\nr(\u03bbi) \u03c6i\u03c6>\ni\n\n(1)\n\nK ) = \u2126(P c2\n\nNote that it may be that r reverses the order of the eigenvalues, so that smooth \u03c6i\u2019s have\nlarger eigenvalues in K. A \u201csoft labeling\u201d function f = P ci\u03c6i in a kernel machine has\na penalty term in the RKHS norm given by \u2126(||f ||2\ni /r(\u03bbi)). Since r is de-\ncreasing, a greater penality is incurred for those terms of f corresponding to eigenfunctions\nthat are less smooth. In previous work r has often been chosen from a parametric family.\nFor example, the diffusion kernel [5] corresponds to r(\u03bb) = exp(\u2212 \u03c32\n2 \u03bb) and the Gaussian\n\ufb01eld kernel [10] corresponds to r(\u03bb) = 1\n\u03bb+\u0001 . Cross validation has been used to \ufb01nd the\nhyperparameters \u03c3 or \u0001 for these spectral transformations. Although the general principle\nof equation (1) is appealing, it does not address question of which parametric family to use\nfor r. Moreover, the number of degrees of freedom (or the number of hyperparameters)\nmay not suit the task at hand, resulting in overly constrained kernels. The contribution of\nthe current paper is to address these limitations using a convex optimization approach by\nimposing an ordering constraint on r but otherwise not assuming any parametric form for\nthe kernels.\n\n3 Convex Optimization using QCQP\ni , i = 1 \u00b7 \u00b7 \u00b7 n be the outer product matrices of the eigenvectors. The semi-\nLet Ki = \u03c6i\u03c6>\nsupervised kernel K is a linear combination K = Pn\ni=1 \u00b5iKi, where \u00b5i \u2265 0. We formulate\nthe problem of \ufb01nding the spectral transformation as one that \ufb01nds the interpolation coef\ufb01-\ncients {r(\u03bbi) = \u00b5i} by optimizing some convex objective function on K. To maintain the\npositive semi-de\ufb01niteness constraint on K, one in general needs to invoke SDPs [1]. Semi-\nde\ufb01nite optimization can be described as the problem of optimizing a linear function of a\nsymmetric matrix subject to linear equality constraints and the condition that the matrix be\npositive semi-de\ufb01nite. The well-known linear programming problem can be generalized\nto a semi-de\ufb01nite optimization by replacing the vector of variables with a symmetric ma-\ntrix, and replacing the non-negativity constraints with a positive semi-de\ufb01nite constraints.\nThis generalization inherits several properties: it is convex, has a rich duality theory and\nallows theoretically ef\ufb01cient solution algorithms based on iterating interior point methods\nto either follow a central path or decrease a potential function. However, a limitation of\nSDPs is their computational complexity [1], which has restricted their application to small\nscale problems [6]. However, an important special case of SDPs are quadratically con-\nstrained quadratic programs (QCQP) which are computationally more ef\ufb01cient. Here both\nthe objective function and the constraints are quadratic as illustrated below,\n\nminimize\n\nsubject to\n\n1\n2\n1\n2\n\nx>P0x + q>\n\n0 x + r0\n\nx>Pix + q>\n\ni x + ri \u2264 0\n\ni = 1 \u00b7 \u00b7 \u00b7 m\n\nAx = b\n\n(2)\n\n(3)\n\n(4)\n\n1We use a slightly different notation where r is the inverse of that in [7].\n\n\f+, i = 1, . . . , m, where S n\n\n+ de\ufb01nes the set of square symmetric positive\nwhere Pi \u2208 S n\nsemi-de\ufb01nite matrices. In a QCQP, we minimize a convex quadratic function over a feasible\nregion that is the intersection of ellipsoids. The number of iterations required to reach the\nsolution is comparable to the number required for linear programs, making the approach\nfeasible for large datasets. However, as observed in [1], not all SDPs can be relaxed to\nQCQPs. For the semi-supervised kernel learning task presented here solving an SDP would\nbe computationally infeasible.\n\nRecent work [4, 6] has proposed kernel target alignment that can be used not only to assess\nthe relationship between the feature spaces generated by two different kernels, but also to\nassess the similarity between spaces induced by a kernel and that induced by the labels\nthemselves. Desirable properties of the alignment measure can be found in [4]. The cru-\ncial aspect of alignnement for our purposes is that its optimization can be formulated as a\nQCQP. The objective function is the empirical kernel alignment score:\n\n\u02c6A(Ktr, T ) =\n\nhKtr, T iF\n\nphKtr, KtriF hT, T iF\n\n(5)\n\nwhere Ktr is the kernel matrix restricted to the training points, hM, N iF denotes the Frobe-\nnius product between two square matrices hM, N iF = Pij mijnij = T r(M N >), and T\nis the target matrix on training data, with entry Tij set to +1 if yi = yj and \u22121 otherwise.\nNote for binary {+1, \u22121} training labels y this is simply the rank one matrix T = yy\n>. K\nis guaranteed to be positive semi-de\ufb01nite by constraining \u00b5i \u2265 0. Previous work using ker-\nnel alignment did not take into account that the Ki\u2019s were derived from the graph Laplacian\nwith the goal of semi-supervised learning. As such, the \u00b5i\u2019s can take arbitrary values and\nthere is no preference to penalize components that do not vary smoothly over the graph.\nThis can be recti\ufb01ed by requiring smoother eigenvectors to receive larger coef\ufb01cients, as\nshown in the next section.\n\n4 Semi-Supervised Kernels with Order Constraints\nAs stated above, we would like to maintain a decreasing order on the spectral transforma-\ntion \u00b5i = r(\u03bbi) to encourage smooth functions over the graph. This motivates the set of\norder constraints\n\n\u00b5i \u2265 \u00b5i+1,\n\ni = 1 \u00b7 \u00b7 \u00b7 n \u2212 1\n\n(6)\n\nAnd we can specify the desired semi-supervised kernel as follows.\n\nDe\ufb01nition 1 Anorder constrained semi-supervised kernel K isthesolutiontothefollow-\ningconvexoptimizationproblem:\n\nmaxK\nsubjectto\n\n\u02c6A(Ktr, T )\n\nK = Pn\n\ni=1 \u00b5iKi\n\n\u00b5i \u2265 0\n\ntrace(K) = 1\n\n\u00b5i \u2265 \u00b5i+1,\n\ni = 1 \u00b7 \u00b7 \u00b7 n \u2212 1\n\n(7)\n(8)\n(9)\n(10)\n(11)\n\nwhereT isthetrainingtargetmatrix,Ki = \u03c6i\u03c6>\nLaplacian.\n\ni and\u03c6i\u2019saretheeigenvectorsofthegraph\n\nThe formulation is an extension to [6] with order constraints, and with special components\nKi\u2019s from the graph Laplacian. Since \u00b5i \u2265 0 and Ki\u2019s are outer products, K will auto-\nmatically be positive semi-de\ufb01nite and hence a valid kernel matrix. The trace constraint is\nneeded to \ufb01x the scale invariance of kernel alignment. It is important to notice the order\nconstraints are convex, and as such the whole problem is convex. Let vec(A) be the column\n\n\fvectorization of a matrix A. De\ufb01ning M = (cid:2)vec(K1,tr) \u00b7 \u00b7 \u00b7 vec(Km,tr)(cid:3), it is not hard to\nshow that the problem can then be expressed as\n\nmax\u00b5\nsubject to\n\n(12)\n(13)\n(14)\n(15)\nThe objective function is linear in \u00b5, and there is a simple cone constraint, making it a\nquadratically constrained quadratic program (QCQP).\n\nvec(T )>M \u00b5\n||M \u00b5|| \u2264 1\n\n\u00b5i \u2265 \u00b5i+1,\n\ni = 1 \u00b7 \u00b7 \u00b7 n \u2212 1\n\n\u00b5i \u2265 0\n\nAn improvement of the above order constrained semi-supervised kernel can be obtained\nby studying the Laplacian eigenvectors with zero eigenvalues. For a graph Laplacian there\nwill be k zero eigenvalues if the graph has k connected subgraphs. The k eigenvectors are\npiecewise constant over individual subgraphs, and zero elsewhere. This is desirable when\nk > 1, with the hope that subgraphs correspond to different classes. However if k = 1, the\ngraph is connected. The \ufb01rst eigenvector \u03c61 is a constant vector. The corresponding K1 is\na constant matrix, and acts as a bias term. In this situation we do not want to impose the\norder constraint \u00b51 \u2265 \u00b52 on the constant bias term. Instead we let \u00b51 vary freely during\noptimization.\n\nDe\ufb01nition 2 An improved order constrained semi-supervised kernel K is the solution to\nthesameprobleminDe\ufb01nition1,buttheorderconstraints(11)applyonlytonon-constant\neigenvectors:\n\n\u00b5i \u2265 \u00b5i+1,\n\ni = 1 \u00b7 \u00b7 \u00b7 n \u2212 1, and \u03c6i notconstant\n\n(16)\n\nIn practice we do not need all n eigenvectors of the graph Laplacian, or equivalently all n\nKi\u2019s. The \ufb01rst m < n eigenvectors with the smallest eigenvalues work well empirically.\nAlso note we could have used the fact that Ki\u2019s are from orthogonal eigenvectors \u03c6i to\nfurther simplify the expression. However we neglect this observation, making it easier to\nincorporate other kernel components if necessary.\n\nIt is illustrative to compare and contrast the order constrained semi-supervised kernels to\nother semi-supervised kernels with different spectral transformation. We call the original\nkernel alignment solution in [6] a maximal-alignment kernel. It is the solution to De\ufb01ni-\ntion 1 without the order constraints (11). Because it does not have the additional constraints,\nit maximizes kernel alignment among all spectral transformation. The hyperparameters \u03c3\nand \u0001 of the Diffusion kernel and Gaussian \ufb01elds kernel (described earlier) can be learned\nby maximizing the alignment score also, although the optimization problem is not neces-\nsarily convex. These kernels use different information from the original Laplacian eigen-\nvalues \u03bbi. The maximal-alignment kernels ignore \u03bbi altogether. The order constrained\nsemi-supervised kernels only use the order of \u03bbi and ignore their actual values. The diffu-\nsion and Gaussian \ufb01eld kernels use the actual values. In terms of the degree of freedom in\nchoosing the spectral transformation \u00b5i\u2019s, the maximal-alignment kernels are completely\nfree. The diffusion and Gaussian \ufb01eld kernels are restrictive since they have an implicit\nparametric form and only one free parameter. The order constrained semi-supervised ker-\nnels incorporates desirable features from both approaches.\n\n5 Experimental Results\nWe evaluate the order constrained kernels on seven datasets. baseball-hockey (1993 in-\nstances / 2 classes), pc-mac (1943/2) and religion-atheism (1427/2) are document catego-\nrization tasks taken from the 20-newsgroups dataset. The distance measure is the standard\ncosine similarity between tf.idf vectors. one-two (2200/2), odd-even (4000/2) and ten\ndigits (4000/10) are handwritten digits recognition tasks. one-two is digits \u201c1\u201d vs. \u201c2\u201d;\nodd-even is the arti\ufb01cial task of classifying odd \u201c1, 3, 5, 7, 9\u201d vs. even \u201c0, 2, 4, 6, 8\u201d digits,\n\n\fsuch that each class has several well de\ufb01ned internal clusters; ten digits is 10-way clas-\nsi\ufb01cation. isolet (7797/26) is isolated spoken English alphabet recognition from the UCI\nrepository. For these datasets we use Euclidean distance on raw features. We use 10NN\nunweighted graphs on all datasets except isolet which is 100NN. For all datasets, we use\nthe smallest m = 200 eigenvalue and eigenvector pairs from the graph Laplacian. These\nvalues are set arbitrarily without optimizing and do not create a unfair advantage to the\nproposed kernels. For each dataset we test on \ufb01ve different labeled set sizes. For a given\nlabeled set size, we perform 30 random trials in which a labeled set is randomly sampled\nfrom the whole dataset. All classes must be present in the labeled set. The rest is used as\nunlabeled (test) set in that trial. We compare 5 semi-supervised kernels (improved order\nconstrained kernel, order constrained kernel, Gaussian \ufb01eld kernel, diffusion kernel2 and\nmaximal-alignment kernel), and 3 standard supervised kernels (RBF (bandwidth learned\nusing 5-fold cross validation),linear and quadratic). We compute the spectral transforma-\ntion for order constrained kernels and maximal-alignment kernels by solving the QCQP\nusing standard solvers (SeDuMi/YALMIP). To compute accuracy we use a standard SVM.\nWe choose the the bound on slack variables C with cross validation for all tasks and ker-\nnels. For multiclass classi\ufb01cation we perform one-against-all and pick the class with the\nlargest margin.\nThe results3 are shown in Table 1, which has two rows for each cell: The upper row is\nthe average test set accuracy with one standard deviation; The lower row is the average\ntraining set kernel alignment, and in parenthesis the average run time in seconds for Se-\nDuMi/YALMIP on a 3GHz Linux computer. Each number is averaged over 30 random\ntrials. To assess the statistical signi\ufb01cance of the results, we perform paired t-test on test\naccuracy. We highlight the best accuracy in each row, and those that can not be determined\nas different from the best, with paired t-test at signi\ufb01cance level 0.05. The semi-supervised\nkernels tend to outperform standard supervised kernels. The improved order constrained\nkernels are consistently among the best. Figure 1 shows the spectral transformation \u00b5i of\nthe semi-supervised kernels for different tasks. These are for the 30 trials with the largest\nlabeled set size in each task. The x-axis is in increasing order of \u03bbi (the original eigenvalues\nof the Laplacian). The mean (thick lines) and \u00b11 standard deviation (dotted lines) of only\nthe top 50 \u00b5i\u2019s are plotted for clarity. The \u00b5i values are scaled vertically for easy compari-\nson among kernels. As expected the maximal-alignment kernels\u2019 spectral transformation is\nzigzagged, diffusion and Gaussian \ufb01eld\u2019s are very smooth, while order constrained kernels\u2019\nare in between. The order constrained kernels (green) have large \u00b51 because of the order\nconstraint. This seems to be disadvantageous \u2014 the spectral transformation tries to balance\nit out by increasing the value of other \u00b5i\u2019s so that the constant K1\u2019s relative in\ufb02uence is\nsmaller. On the other hand the improved order constrained kernels (black) allow \u00b51 to be\nsmall. As a result the rest \u00b5i\u2019s decay fast, which is desirable.\n\n6 Conclusions\n\nWe have proposed and evaluated a novel approach for semi-supervised kernel construction\nusing convex optimization. The method incorporates order constraints, and the resulting\nconvex optimization problem can be solved ef\ufb01ciently using a QCQP. In this work the base\nkernels were derived from the graph Laplacian, and no parametric form for the spectral\ntransformation was imposed, making the approach more general than previous approaches.\nExperiments show that the method is both computationally feasible and results in improve-\nments to classi\ufb01cation performance when used with support vector machines.\n\n2The hyperparameters \u03c3\n\n2 and \u0001 are learned with the fminbnd() function in Matlab to maximize\n\nkernel alignment.\n\n3Results on baseball-hockey and odd-even are similar and omitted for space. Full results can be\n\nfound at http://www.cs.cmu.edu/\u02dczhuxj/pub/ocssk.pdf\n\n\fPC vs. MAC\n\nReligion vs. Atheism\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nl\n\nd\ne\na\nc\ns\n \n\u00b5\n\nImproved order\nOrder\nMax\u2212align\nGaussian field\nDiffusion\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nl\n\nd\ne\na\nc\ns\n \n\u00b5\n\nImproved order\nOrder\nMax\u2212align\nGaussian field\nDiffusion\n\n0\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\nrank\n\n30\n\n35\n\n40\n\n45\n\n50\n\n0\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\nrank\n\n30\n\n35\n\n40\n\n45\n\n50\n\nTen Digits (10 classes)\n\nISOLET (26 classes)\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nl\n\nd\ne\na\nc\ns\n \n\u00b5\n\nImproved order\nOrder\nMax\u2212align\nGaussian field\nDiffusion\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nl\n\nd\ne\na\nc\ns\n \n\u00b5\n\nImproved order\nOrder\nMax\u2212align\nGaussian field\nDiffusion\n\n0\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\nrank\n\n30\n\n35\n\n40\n\n45\n\n50\n\n0\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\nrank\n\n30\n\n35\n\n40\n\n45\n\n50\n\nFigure 1: Comparison of spectral transformation for the 5 semi-supervised kernels.\n\nReferences\n\n[1] S. Boyd and L. Vandenberge. Convex Optimization. Cambridge University Press, Cambridge\n\nUK, 2004.\n\n[2] O. Chapelle, J. Weston, and B. Sch\u00a8olkopf. Cluster kernels for semi-supervised learning. In\n\nAdvances in Neural Information Processing Systems, 15, volume 15, 2002.\n\n[3] F. R. K. Chung. Spectral graph theory, Regional Conference Series in Mathematics, No. 92.\n\nAmerican Mathematical Society, 1997.\n\n[4] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target alignment. In\n\nAdvances in NIPS, 2001.\n\n[5] R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In\n\nProc. 19th International Conf. on Machine Learning, 2002.\n\n[6] G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. Jordan. Learning the kernel matrix\n\nwith semide\ufb01nite programming. Journal of Machine Learning Research, 5:27\u201372, 2004.\n\n[7] A. Smola and R. Kondor. Kernels and regularization on graphs. In Conference on Learning\n\nTheory, COLT/KW, 2003.\n\n[8] M. Szummer and T. Jaakkola. Partially labeled classi\ufb01cation with Markov random walks. In\n\nAdvances in Neural Information Processing Systems, 14, volume 14, 2001.\n\n[9] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian \ufb01elds and\nharmonic functions. In ICML-03, 20th International Conference on Machine Learning, 2003.\n[10] X. Zhu, J. Lafferty, and Z. Ghahramani. Semi-supervised learning: From Gaussian \ufb01elds to\n\nGaussian processes. Technical Report CMU-CS-03-175, Carnegie Mellon University, 2003.\n\n\fTraining\nset size\n\nImproved\n\nOrder\n\npc-mac\n\nOrder\n\nGaussian\n\nField\n\nDiffusion\n\nMax-align\n\nRBF\n\nLinear\n\nQuadratic\n\nsemi-supervised kernels\n\nstandard kernels\n\n10\n\n30\n\n50\n\n70\n\n90\n\n87.0 \u00b1 5.0\n0.71 ( 1)\n90.3 \u00b1 1.3\n0.68 ( 8)\n91.3 \u00b1 0.9\n0.64 (31)\n91.5 \u00b1 0.6\n0.63 (70)\n91.5 \u00b1 0.6\n0.63 (108)\n\n84.9 \u00b1 7.2\n0.57 ( 1)\n89.6 \u00b1 2.3\n0.49 ( 8)\n90.5 \u00b1 1.7\n0.46 (31)\n90.8 \u00b1 1.3\n0.46 (56)\n91.3 \u00b1 1.3\n0.45 (98)\n\n56.4 \u00b1 6.2\n\n57.8 \u00b111.5\n\n0.32\n\n0.35\n\n76.4 \u00b1 6.1\n\n79.6 \u00b111.2\n\n0.19\n\n0.23\n\n81.1 \u00b1 4.6\n\n87.5 \u00b1 2.8\n\n0.16\n\n0.20\n\n84.6 \u00b1 2.1\n\n90.5 \u00b1 1.2\n\n0.14\n\n0.19\n\n86.3 \u00b1 2.3\n\n91.3 \u00b1 1.1\n\n0.13\n\n0.18\n\nreligion-atheism\n10\n\n72.8 \u00b111.2\n\n30\n\n50\n\n70\n\n90\n\n0.50 ( 1)\n84.2 \u00b1 2.4\n0.38 ( 8)\n84.5 \u00b1 2.3\n0.31 (28)\n85.7 \u00b1 1.4\n0.29 (55)\n86.6 \u00b1 1.3\n0.27 (86)\n\none-two\n\n70.9 \u00b110.9\n\n55.2 \u00b1 5.8\n\n60.9 \u00b110.7\n\n0.42 ( 1)\n83.0 \u00b1 2.9\n0.31 ( 6)\n83.5 \u00b1 2.5\n0.26 (23)\n85.3 \u00b1 1.6\n0.25 (42)\n86.4 \u00b1 1.5\n0.24 (92)\n\n0.31\n\n0.31\n\n71.2 \u00b1 6.3\n\n80.3 \u00b1 5.1\n\n0.20\n\n0.22\n\n80.4 \u00b1 4.1\n\n83.5 \u00b1 2.7\n\n0.17\n\n0.20\n\n83.0 \u00b1 2.9\n\n85.4 \u00b1 1.8\n\n0.16\n\n0.19\n\n84.5 \u00b1 2.1\n\n86.2 \u00b1 1.6\n\n0.15\n\n0.18\n\n71.1 \u00b1 9.7\n0.90 ( 1)\n85.4 \u00b1 3.9\n0.74 ( 6)\n88.4 \u00b1 2.1\n0.68 (25)\n89.6 \u00b1 1.6\n0.66 (59)\n90.3 \u00b1 1.0\n0.65 (84)\n\n60.7 \u00b1 7.5\n0.85 ( 1)\n74.4 \u00b1 5.4\n0.60 ( 7)\n77.4 \u00b1 6.1\n0.48 (27)\n82.3 \u00b1 3.0\n0.43 (51)\n82.8 \u00b1 2.6\n0.40 (85)\n\n51.6 \u00b1 3.4\n\n63.0 \u00b1 5.1\n\n62.3 \u00b1 4.2\n\n0.11\n\n0.30\n\n0.25\n\n62.6 \u00b1 9.6\n\n71.8 \u00b1 5.5\n\n71.2 \u00b1 5.3\n\n0.03\n\n0.18\n\n0.13\n\n67.8 \u00b1 9.0\n\n77.6 \u00b1 4.8\n\n75.7 \u00b1 5.4\n\n0.02\n\n0.14\n\n0.10\n\n74.7 \u00b1 7.4\n\n80.2 \u00b1 4.6\n\n74.3 \u00b1 8.7\n\n0.01\n\n0.12\n\n0.08\n\n79.0 \u00b1 6.4\n\n82.5 \u00b1 4.2\n\n79.1 \u00b1 7.3\n\n0.01\n\n0.11\n\n0.08\n\n55.8 \u00b1 5.8\n\n60.1 \u00b1 7.0\n\n61.2 \u00b1 4.8\n\n0.13\n\n0.30\n\n0.26\n\n63.4 \u00b1 6.5\n\n63.7 \u00b1 8.3\n\n70.1 \u00b1 6.3\n\n0.05\n\n0.18\n\n0.15\n\n69.3 \u00b1 6.5\n\n69.4 \u00b1 7.0\n\n70.7 \u00b1 8.5\n\n0.04\n\n0.15\n\n0.11\n\n73.1 \u00b1 5.8\n\n75.7 \u00b1 6.0\n\n71.0 \u00b110.0\n\n0.03\n\n0.13\n\n0.10\n\n77.7 \u00b1 5.1\n\n74.6 \u00b1 7.6\n\n70.0 \u00b111.5\n\n0.02\n\n0.12\n\n0.09\n\n10\n\n20\n\n30\n\n40\n\n50\n\n96.2 \u00b1 2.7\n0.87 ( 2)\n96.4 \u00b1 2.8\n0.87 ( 3)\n98.2 \u00b1 2.1\n0.84 ( 8)\n98.3 \u00b1 1.9\n0.84 (13)\n98.4 \u00b1 1.9\n0.83 (31)\nTen digits (10 classes)\n76.6 \u00b1 4.3\n0.47 (26)\n84.8 \u00b1 2.6\n0.47 (124)\n86.5 \u00b1 1.7\n0.48 (310)\n88.1 \u00b1 1.3\n0.47 (708)\n89.1 \u00b1 1.1\n0.47 (942)\n\n100\n\n150\n\n200\n\n250\n\n50\n\nisolet (26 classes)\n50\n\n100\n\n150\n\n200\n\n250\n\n56.0 \u00b1 3.5\n0.27 (26)\n64.6 \u00b1 2.1\n0.26 (105)\n67.6 \u00b1 2.6\n0.26 (249)\n71.0 \u00b1 1.8\n0.26 (441)\n71.8 \u00b1 2.3\n0.26 (709)\n\n90.6 \u00b114.0\n\n58.2 \u00b117.6\n\n59.4 \u00b118.9\n\n85.4 \u00b111.5\n\n78.7 \u00b114.3\n\n85.1 \u00b1 5.7\n\n85.7 \u00b1 4.8\n\n0.66 ( 1)\n93.9 \u00b1 8.7\n0.64 ( 4)\n97.2 \u00b1 2.5\n0.61 ( 7)\n96.5 \u00b1 2.4\n0.61 (15)\n95.6 \u00b1 9.0\n0.60 (37)\n\n71.5 \u00b1 5.0\n0.21 (26)\n83.4 \u00b1 2.6\n0.17 (98)\n86.4 \u00b1 1.3\n0.18 (255)\n88.0 \u00b1 1.3\n0.16 (477)\n89.3 \u00b1 1.0\n0.16 (873)\n\n42.0 \u00b1 5.2\n0.13 (25)\n59.0 \u00b1 3.6\n0.10 (127)\n65.2 \u00b1 3.0\n0.09 (280)\n70.9 \u00b1 2.3\n0.08 (570)\n73.6 \u00b1 1.5\n0.08 (836)\n\n0.43\n\n0.53\n\n87.0 \u00b116.0\n\n83.2 \u00b119.8\n\n0.38\n\n0.50\n\n98.1 \u00b1 2.2\n\n98.1 \u00b1 2.7\n\n0.35\n\n0.47\n\n98.9 \u00b1 1.8\n\n99.1 \u00b1 1.4\n\n0.36\n\n0.48\n\n99.4 \u00b1 0.5\n\n99.6 \u00b1 0.3\n\n0.35\n\n0.46\n\n41.4 \u00b1 6.8\n\n49.8 \u00b1 6.3\n\n0.15\n\n0.16\n\n63.7 \u00b1 3.5\n\n72.5 \u00b1 3.3\n\n0.12\n\n0.13\n\n75.1 \u00b1 3.0\n\n80.4 \u00b1 2.1\n\n0.11\n\n0.13\n\n80.4 \u00b1 2.5\n\n84.4 \u00b1 1.6\n\n0.10\n\n0.11\n\n84.6 \u00b1 1.4\n\n87.2 \u00b1 1.3\n\n0.10\n\n0.11\n\n41.2 \u00b1 2.9\n\n29.0 \u00b1 2.7\n\n0.03\n\n0.11\n\n58.5 \u00b1 2.9\n\n47.4 \u00b1 2.7\n\n-0.02\n\n0.08\n\n65.4 \u00b1 2.6\n\n57.2 \u00b1 2.7\n\n-0.05\n\n0.07\n\n70.6 \u00b1 1.9\n\n64.8 \u00b1 2.1\n\n-0.07\n\n0.06\n\n73.7 \u00b1 1.2\n\n69.8 \u00b1 1.5\n\n-0.07\n\n0.06\n\n0.95 ( 1)\n94.5 \u00b1 1.6\n0.90 ( 3)\n96.4 \u00b1 2.1\n0.86 ( 6)\n96.3 \u00b1 2.3\n0.86 (11)\n96.6 \u00b1 2.3\n0.84 (25)\n\n70.3 \u00b1 5.2\n0.51 (25)\n80.7 \u00b1 2.6\n0.49 (100)\n84.5 \u00b1 1.9\n0.50 (244)\n86.0 \u00b1 1.5\n0.49 (523)\n87.2 \u00b1 1.3\n0.49 (706)\n\n50.1 \u00b1 3.7\n0.31 (24)\n63.2 \u00b1 1.9\n0.29 (102)\n67.9 \u00b1 2.5\n0.27 (221)\n72.3 \u00b1 1.7\n0.27 (423)\n74.2 \u00b1 1.5\n0.27 (665)\n\n0.38\n\n0.26\n\n0.30\n\n90.4 \u00b1 4.6\n\n86.0 \u00b1 9.4\n\n90.9 \u00b1 3.7\n\n0.33\n\n0.22\n\n0.25\n\n93.6 \u00b1 3.1\n\n89.6 \u00b1 5.9\n\n92.9 \u00b1 2.8\n\n0.30\n\n0.17\n\n0.24\n\n94.0 \u00b1 2.7\n\n91.6 \u00b1 6.3\n\n94.9 \u00b1 2.0\n\n0.29\n\n0.18\n\n0.21\n\n96.1 \u00b1 2.4\n\n93.0 \u00b1 3.6\n\n95.8 \u00b1 2.3\n\n0.28\n\n0.17\n\n0.20\n\n57.0 \u00b1 4.0\n\n50.2 \u00b1 9.0\n\n66.3 \u00b1 3.7\n\n-0.62\n\n-0.50\n\n-0.25\n\n69.4 \u00b1 1.9\n\n56.0 \u00b1 7.8\n\n77.2 \u00b1 2.3\n\n-0.64\n\n-0.52\n\n-0.29\n\n75.2 \u00b1 1.4\n\n56.2 \u00b1 7.2\n\n81.4 \u00b1 2.2\n\n-0.66\n\n-0.53\n\n-0.31\n\n78.3 \u00b1 1.3\n\n60.8 \u00b1 7.3\n\n84.3 \u00b1 1.7\n\n-0.65\n\n-0.54\n\n-0.33\n\n80.4 \u00b1 1.4\n\n61.3 \u00b1 7.6\n\n85.7 \u00b1 1.3\n\n-0.65\n\n-0.54\n\n-0.33\n\n28.7 \u00b1 2.0\n\n30.0 \u00b1 2.7\n\n23.7 \u00b1 2.4\n\n-0.89\n\n-0.80\n\n-0.65\n\n46.3 \u00b1 2.4\n\n46.6 \u00b1 2.7\n\n42.0 \u00b1 2.9\n\n-0.90\n\n-0.82\n\n-0.69\n\n57.6 \u00b1 1.5\n\n57.3 \u00b1 1.8\n\n53.8 \u00b1 2.2\n\n-0.90\n\n-0.83\n\n-0.70\n\n63.9 \u00b1 1.6\n\n64.2 \u00b1 2.0\n\n60.5 \u00b1 1.6\n\n-0.91\n\n-0.83\n\n-0.72\n\n68.8 \u00b1 1.5\n\n69.5 \u00b1 1.7\n\n66.2 \u00b1 1.4\n\n-0.91\n\n-0.84\n\n-0.72\n\nTable 1: Accuracy, alignment scores, and run times on the datasets. The table compares 8\nkernels. Each cell has two rows: The upper row is test set accuracy with standard error;\nthe lower row is training set alignment (SeDuMi/YALMIP run time in seconds is given in\nparentheses). All numbers are averaged over 30 random trials. Accuracies in boldface are\nthe best as determined by a paired t-test at the 0.05 signi\ufb01cance level.\n\n\f", "award": [], "sourceid": 2702, "authors": [{"given_name": "Jerry", "family_name": "Zhu", "institution": null}, {"given_name": "Jaz", "family_name": "Kandola", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "John", "family_name": "Lafferty", "institution": null}]}