{"title": "Analysis of Spectral Kernel Design based Semi-supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1601, "page_last": 1608, "abstract": "", "full_text": "Analysis of Spectral Kernel Design based\n\nSemi-supervised Learning\n\nTong Zhang\nYahoo! Inc.\n\nNew York City, NY 10011\n\nRie Kubota Ando\n\nIBM T. J. Watson Research Center\n\nYorktown Heights, NY 10598\n\nAbstract\n\nWe consider a framework for semi-supervised learning using spectral\ndecomposition based un-supervised kernel design. This approach sub-\nsumes a class of previously proposed semi-supervised learning methods\non data graphs. We examine various theoretical properties of such meth-\nods. In particular, we derive a generalization performance bound, and\nobtain the optimal kernel design by minimizing the bound. Based on\nthe theoretical analysis, we are able to demonstrate why spectral kernel\ndesign based methods can often improve the predictive performance. Ex-\nperiments are used to illustrate the main consequences of our analysis.\n\n1 Introduction\n\nSpectral graph methods have been used both in clustering and in semi-supervised learning.\nThis paper focuses on semi-supervised learning, where a classi\ufb01er is constructed from both\nlabeled and unlabeled training examples. Although previous studies showed that this class\nof methods work well for certain concrete problems (for example, see [1, 4, 5, 6]), there\nis no satisfactory theory demonstrating why (and under what circumstances) such methods\nshould work.\n\nThe purpose of this paper is to develop a more complete theoretical understanding for graph\nbased semi-supervised learning. In Theorem 2.1, we present a transductive formulation of\nkernel learning on graphs which is equivalent to supervised kernel learning. This new\nkernel learning formulation includes some of the previous proposed graph semi-supervised\nlearning methods as special cases. A consequence is that we can view such graph-based\nsemi-supervised learning methods as kernel design methods that utilize unlabeled data; the\ndesigned kernel is then used in the standard supervised learning setting. This insight allows\nus to prove useful results concerning the behavior of graph based semi-supervised learning\nfrom the more general view of spectral kernel design. Similar spectral kernel design ideas\nalso appeared in [2]. However, they didn\u2019t present a graph-based learning formulation\n(Theorem 2.1 in this paper); nor did they study the theoretical properties of such methods.\nWe focus on two issues for graph kernel learning formulations based on Theorem 2.1. First,\nwe establish the convergence of graph based semi-supervised learning (when the number\nof unlabeled data increases). Second, we obtain a learning bound, which can be used to\ncompare the performance of different kernels. This analysis gives insights to what are good\nkernels, and why graph-based spectral kernel design is often helpful in various applications.\nExamples are given to justify the theoretical analysis. Due to the space limitations, proofs\n\n\fwill not be included in this paper.\n\n2 Transductive Kernel Learning on Graphs\n\nWe shall start with notations for supervised learning. Consider the problem of predicting\na real-valued output Y based on its corresponding input vector X. In the standard ma-\nchine learning formulation, we assume that the data (X, Y ) are drawn from an unknown\nunderlying distribution D. Our goal is to \ufb01nd a predictor p(x) so that the expected true\nloss of p given below is as small as possible: R(p(\u00b7)) = E(X,Y )\u223cDL(p(X), Y ), where we\nuse E(X,Y )\u223cD to denote the expectation with respect to the true (but unknown) underlying\ndistribution D. Typically, one needs to restrict the hypothesis function family size so that a\nstable estimate within the function family can be obtained from a \ufb01nite number of samples.\nWe are interested in learning in Hilbert spaces. For notational simplicity, we assume that\nthere is a feature representation \u03c8(x) \u2208 H, where H is a high (possibly in\ufb01nity) dimen-\nsional feature space. We denote \u03c8(x) by column vectors, so that the inner product in the\nHilbert-space H is the vector product. A linear classi\ufb01er p(x) on H can be represented by\na vector w \u2208 H such that p(x) = wT \u03c8(x).\nLet the training samples be (X1, Y1), . . . , (Xn, Yn). We consider the following regularized\nlinear prediction method on H:\n\n\u02c6p(x) = \u02c6wT \u03c8(x),\n\n\u02c6w = arg min\n\nw\u2208H\" 1\n\nn\n\nn\n\nXi=1\n\nL(wT \u03c8(Xi), Yi) + \u03bbwT w# .\n\n(1)\n\n(2)\n\nIf H is an in\ufb01nite dimensional space, then it is not be feasible to solve (1) directly. A\nremedy is to use kernel methods. Given a feature representation \u03c8(x), we can de\ufb01ne kernel\nk(x, x0) = \u03c8(x)T \u03c8(x0).\nIt is well-known (the so-called representer theorem) that the\nsolution of (1) can be represented as \u02c6p(x) =Pn\n\ni=1 \u02c6\u03b1ik(Xi, x), where [\u02c6\u03b1i] is given by\n\nn\n\nn\n\n[\u03b1i]\u2208Rn\uf8ee\n\uf8f0\n\n1\nn\n\nXi=1\n\nn\n\nL\uf8eb\nXj=1\n\uf8ed\n\n\u03b1jk(Xi, Xj), Yi\uf8f6\n\nXi,j=1\n\n\uf8f8 + \u03bb\n\n\u03b1i\u03b1j k(Xi, Xj)\uf8f9\n\uf8fb .\n\n[\u02c6\u03b1i] = arg min\n\nThe above formulations of kernel methods are standard. In the following, we present an\nequivalence of supervised kernel learning to a speci\ufb01c semi-supervised formulation. Al-\nthough this representation is implicit in some earlier papers, the explicit form of this method\nis not well-known. As we shall see later, this new kernel learning formulation is critical for\nanalyzing a class of graph-based semi-supervised learning methods.\nIn this framework, the data graph consists of nodes that are the data points Xj. The edge\nconnecting two nodes Xi and Xj is weighted by k(Xi, Xj). The following theorem, which\nestablishes the graph kernel learning formulation we will study in this paper, essentially\nimplies that graph-based semi-supervised learning is equivalent to the supervised learning\nmethod which employs the same kernel.\n\nTheorem 2.1 (Graph Kernel Learning) Consider labeled data {(Xi, Yi)}i=1,...,n and\nunlabeled data Xj (j = n + 1, . . . , m). Consider real-valued vectors f = [f1, . . . , fm]T \u2208\nRm, and the following semi-supervised learning method:\n\n\u02c6f = arg inf\n\nf \u2208Rm\" 1\n\nn\n\nL(fi, Yi) + \u03bbf T K \u22121f# ,\n\n(3)\n\nn\n\nXi=1\n\nwhere K (often called gram-matrix in kernel learning or af\ufb01nity matrix in graph learning)\nis an m \u00d7 m matrix with Ki,j = k(Xi, Xj) = \u03c8(Xi)T \u03c8(Xj). Let \u02c6p be the solution of (1),\nthen \u02c6fj = \u02c6p(Xj) for j = 1, . . . , m.\n\n\fThe kernel gram matrix K is always positive semi-de\ufb01nite. However, if K is not full rank\n(singular), then the correct interpretation of f T K \u22121f is lim\u00b5\u21920+ f T (K + \u00b5Im\u00d7m)\u22121f ,\nwhere Im\u00d7m is the m \u00d7 m identity matrix.\nIf we start with a given kernel k and let\nK = [k(Xi, Xj)], then a semi-supervised learning method of the form (3) is equivalent\nto the supervised method (1). It follows that with a formulation like (3), the only way to\nutilize unlabeled data is to replace K by a kernel \u00afK in (3), or k by \u00afk in (2), where \u00afK (or\n\u00afk) depends on the unlabeled data. In other words, the only bene\ufb01t of unlabeled data in this\nsetting is to construct a good kernel based on unlabeled data.\n\nSome of previous graph-based semi-supervised learning methods employ the same formu-\nlation (3) with K \u22121 replaced by the graph Laplacian operator L (which we will describe\nin Section 5). However, the equivalence of this formulation and supervised kernel learning\n(with kernel matrix K = L\u22121) was not obtained in these earlier studies. This equivalence is\nimportant for good theoretical understanding, as we will see later in this paper. Moreover,\nby treating graph-based supervised learning as unsupervised kernel design (see Figure 1),\nthe scope of this paper is more general than graph Laplacian based methods.\n\nInput: labeled data [(Xi, Yi)]i=1,...,n, unlabeled data Xj (j = n + 1, . . . , m)\n\nshrinkage factors sj \u2265 0 (j = 1, . . . , m), kernel function k(\u00b7,\u00b7),\n\nOutput: predictive values \u02c6f 0\nForm the kernel matrix K = [k(Xi, Xj)] (i, j = 1, . . . , m)\nCompute the kernel eigen-decomposition:\n\nj on Xj (j = 1, . . . , m)\n\nj , where (\u00b5j, vj) are eigenpairs of K (vT\n\nj vj = 1)\n\nj=1 \u00b5jvjvT\n\nK = mPm\nModify the kernel matrix as: \u00afK = mPm\nCompute \u02c6f 0 = arg minf \u2208Rm(cid:2) 1\n\nnPn\n\nj=1 sj\u00b5jvjvT\nj\n\n(\u2217)\n\ni=1 L(fi, Yi) + \u03bbf T \u00afK \u22121f(cid:3).\n\nFigure 1: Spectral kernel design based semi-supervised learning on graph\n\nIn Figure 1, we consider a general formulation of semi-supervised learning method on data\ngraph through spectral kernel design. This is the method we will analyze in the paper. As\na special case, we can let sj = g(\u00b5j) in Figure 1, where g is a rational function, then\n\u00afK = g(K/m)K. In this special case, we do not have to compute eigen-decomposition of\nK. Therefore we obtain a simpler algorithm with the (\u2217) in Figure 1 replaced by\n\n\u00afK = g(K/m)K.\n\n(4)\n\nAs mentioned earlier, the idea of using spectral kernel design has appeared in [2] although\nthey didn\u2019t base their method on the graph formulation (3). However, we believe our anal-\nysis also sheds lights to their methods. The semi-supervised learning method described in\nFigure 1 is useful only when \u02c6f 0 is a better predictor than \u02c6f in Theorem 2.1 (which uses the\noriginal kernel K) \u2013 in other words, only when the new kernel \u00afK is better than K.\nIn the next few sections, we will investigate the following issues concerning the theoretical\nbehavior of this algorithm: (a) the limiting behavior of \u02c6f 0 as m \u2192 \u221e; that is, whether \u02c6f 0\nj\nconverges for each j; (b) the generalization performance of (3); (c) optimal Kernel design\nby minimizing the generalization error, and its implications; (d) statistical models under\nwhich spectral kernel design based semi-supervised learning is effective.\n\n3 The Limiting Behavior of Graph-based Semi-supervised Learning\n\nWe want to show that as m \u2192 \u221e, the semi-supervised algorithm in Figure 1 is well-\nbehaved. That is, \u02c6f 0\n\nj converges as m \u2192 \u221e. This is one of the most fundamental issues.\n\n\fUsing feature space representation, we have k(x, x0) = \u03c8(x)T \u03c8(x0). Therefore a change\nof kernel can be regarded as a change of feature mapping. In particular, we consider a\nfeature transformation of the form \u00af\u03c8(x) = S1/2\u03c8(x), where S is an appropriate positive\nsemi-de\ufb01nite operator on H. The following result establishes an equivalent feature space\nformulation of the semi-supervised learning method in Figure 1.\n\nTheorem 3.1 Using notations in Figure 1. Assume k(x, x0) = \u03c8(x)T \u03c8(x0). Consider\nj , where uj = \u03a8vj/\u221a\u00b5j, \u03a8 = [\u03c8(X1), . . . , \u03c8(Xm)], then (\u00b5j, uj) is\n\nj=1 sjujuT\n\nS = Pm\n\nan eigenpair of \u03a8\u03a8T /m. Let\n\n\u02c6p0(x) = \u02c6w0T S1/2\u03c8(x),\n\n\u02c6w0 = arg min\n\nw\u2208H\" 1\n\nn\n\nThen \u02c6f 0\n\nj = \u02c6p0(Xj) (j = 1, . . . , m).\n\nL(wT S1/2\u03c8(Xi), Yi) + \u03bbwT w# .\n\nn\n\nXi=1\n\nIn this case, we just replace \u03a8\u03a8T /m = 1\n\nThe asymptotic behavior of Figure 1 when m \u2192 \u221e can be easily understood from\nTheorem 3.1.\nj=1 \u03c8(Xj)\u03c8(Xj )T by\nEX \u03c8(X)\u03c8(X)T . The spectral decomposition of EX \u03c8(X)\u03c8(X)T corresponds to the fea-\nture space PCA. It is clear that if S converges, then the feature space algorithm in Theo-\nrem 3.1 also converges. In general, S converges if the eigenvectors uj converges and the\nshrinkage factors sj are bounded. As a special case, we have the following result.\n\nmPm\n\nTheorem 3.2 Consider a sequence of data X1, X2, . . . drawn from a distribution, with\nonly the \ufb01rst n points labeled. Assume when m \u2192 \u221e, Pm\nj=1 \u03c8(Xj)\u03c8(Xj )T /m con-\nverges to EX \u03c8(X)\u03c8(X)T almost surely, and g is a continuous function in the spec-\ntral range of EX \u03c8(X)\u03c8(X)T . Now in Figure 1 with (\u2217) given by (4) and kernel\nk(x, x0) = \u03c8(x)T \u03c8(x0), \u02c6f 0\n\nj converges almost surely for each \ufb01xed j.\n\n4 Generalization analysis on graph\n\nWe study the generalization behavior of graph based semi-supervised learning algorithm\n(3), and use it to compare different kernels. We will then use this bound to justify the ker-\nnel design method given in Section 2. To measure the sample complexity, we consider m\npoints (Xj, Yj) for i = 1, . . . , m. We randomly pick n distinct integers i1, . . . , in from\n{1, . . . , m} uniformly (sample without replacement), and regard it as the n labeled train-\ning data. We obtain predictive values \u02c6fj on the graph using the semi-supervised learning\nmethod (3) with the labeled data, and test it on the remaining m \u2212 n data points. We are\ninterested in the average predictive performance over all random draws.\n\nL(p, y) is convex with respect to p, then we have\n\nTheorem 4.1 Consider (Xj, Yj) for i = 1, . . . , m. Assume that we randomly pick n dis-\ntinct integers i1, . . . , in from {1, . . . , m} uniformly (sample without replacement), and de-\nnote it by Zn. Let \u02c6f (Zn) be the semi-supervised learning method (3) using training data in\nZn: \u02c6f (Zn) = arg minf \u2208Rm(cid:2) 1\n\u2202p L(p, y)| \u2264 \u03b3, and\n2\u03bbnm \uf8f9\n\uf8fb .\n\nnPi\u2208Zn L(fi, Yi) + \u03bbf T K \u22121f(cid:3). If | \u2202\n\nf \u2208Rm\uf8ee\nL( \u02c6fj(Zn), Yj ) \u2264 inf\n\uf8f0\n\nm \u2212 n Xj /\u2208Zn\n\nL(fj, Yj) + \u03bbf T K \u22121f +\n\n\u03b32tr(K)\n\n1\nm\n\nm\n\nXj=1\n\n1\n\nEZn\n\nThe bound depends on the regularization parameter \u03bb in addition to the kernel K. In order\nto compare different kernels, it is reasonable to compare the bound with the optimal \u03bb for\n\n\feach K. That is, in addition to minimizing f , we also minimize over \u03bb on the right hand of\nthe bound. Note that in practice, it is usually not dif\ufb01cult to \ufb01nd a nearly-optimal \u03bb through\ncross validation, implying that it is reasonable to assume that we can choose the optimal \u03bb\nin the bound. With the optimal \u03bb, we obtain:\n\nEZn\n\n1\n\nm \u2212 n Xj /\u2208Zn\n\nf \u2208Rm\uf8ee\nL( \u02c6fj(Zn), Yj) \u2264 inf\n\uf8f0\n\n1\nm\n\nm\n\nXj=1\n\nL(fj, Yj) +\n\n\u03b3\n\n\u221a2npR(f, K)\uf8f9\n\uf8fb ,\n\nj=1 sj\u00b5j)(Pm\n\nwhere R(f, K) = tr(K/m) f T K \u22121f is the complexity of f with respect to kernel K.\nIf we de\ufb01ne \u00afK as in Figure 1, then the complexity of a function f with respect to \u00afK is given\nj /(sj\u00b5j)). If we believe that a good approximate\n\nj=1 \u03b12\n\nthen based on this belief, the optimal choice of the shrinkage factor becomes sj = \u03b2j/\u00b5j.\nj , where vj are normalized\n\nby R(f, \u00afK) = (Pm\ntarget function f can be expressed as f = Pj \u03b1jvj with |\u03b1j| \u2264 \u03b2j for some known \u03b2j,\nThat is, the kernel that optimizes the bound is \u00afK = Pj \u03b2jvjvT\neigenvectors of K. In this case, we have R(f, \u00afK) \u2264 (Pj \u03b2j)2. The eigenvalues of the\n\noptimal kernel is thus independent of K, but depends only on the spectral coef\ufb01cient\u2019s\nrange \u03b2j of the approximate target function.\nSince there is no reason to believe that the eigenvalues \u00b5j of the original kernel K are\nproportional to the target spectral coef\ufb01cient range. If we have some guess of the spectral\ncoef\ufb01cients of the target, then one may use the knowledge to obtain a better kernel. This\njusti\ufb01es why spectral kernel design based algorithm can be potentially helpful (when we\nhave some information on the target spectral coef\ufb01cients). In practice, it is usually dif\ufb01-\ncult to have a precise guess of \u03b2j. However, for many application problems, we observe in\npractice that the eigenvalues of kernel K decays more slowly than that of the target spectral\ncoef\ufb01cients. In this case, our analysis implies that we should use an alternative kernel with\nfaster eigenvalue decay: for example, using K 2 instead of K. This has a dimension reduc-\ntion effect. That is, we effectively project the data into the principal components of data.\nThe intuition is also quite clear: if the dimension of the target function is small (spectral\ncoef\ufb01cient decays fast), then we should project data to those dimensions by reducing the\nremaining noisy dimensions (corresponding to fast kernel eigenvalue decay).\n\n5 Spectral analysis: the effect of input noise\n\nWe provide a justi\ufb01cation on why spectral coef\ufb01cients of the target function often decay\nfaster than the eigenvalues of a natural kernel K. In essence, this is due to the fact that\ninput vector X is often corrupted with noise. Together with results in the previous section,\nwe know that in order to achieve optimal performance, we need to use a kernel with faster\neigenvalue decay. We will demonstrate this phenomenon under a statistical model, and use\nthe feature space notation in Section 3. For simplicity, we assume that \u03c8(x) = x.\nWe consider a two-class classi\ufb01cation problem in R\u221e (with the standard 2-norm inner-\nproduct), where the label Y = \u00b11. We \ufb01rst start with a noise free model, where the data can\nbe partitioned into p clusters. Each cluster ` is composed of a single center point \u00afx` (having\nzero variance) with label \u00afy` = \u00b11. In this model, assume that the centers are well separated\nso that there is a weight vector w\u2217 such that wT\n\u2217 \u00afx` = \u00afy`. Without loss\nof generality, we may assume that \u00afx` and w\u2217 belong to a p-dimensional subspace Vp. Let\np be its orthogonal complement. Assume now that the observed input data are corrupted\nV \u22a5\nwith noise. We \ufb01rst generate a center index `, and then noise \u03b4 (which may depend on\n`). The observed input data is the corrupted data X = \u00afx` + \u03b4, and the observed output is\n\u2217 \u00afx`. In this model, let `(Xi) be the center corresponding to Xi, the observation\nY = wT\ncan be decomposed as: Xi = \u00afx`(Xi) + \u03b4(Xi), and Yi = wT\n\u2217 \u00afx`(Xi). Given noise \u03b4, we\n\n\u2217 w\u2217 < \u221e and wT\n\n\fdecompose it as \u03b4 = \u03b41 + \u03b42 where \u03b41 is the orthogonal projection of \u03b4 in Vp, and \u03b42 is\nthe orthogonal projection of \u03b4 in V \u22a5\np . We assume that \u03b41 is a small noise component; the\ncomponent \u03b42 can be large but has small variance in every direction.\n\n(uT\nues of E\u03b42\u03b4T\n\nj uj = 1). Then the following claims are valid: let \u03c32\n\n1 \u2265 \u03c32\nj ; if k\u03b41k2 \u2264 b/kw\u2217k2, then |wT\n\nTheorem 5.1 Consider the data generation model in this section, with observation X =\n\u00afx` + \u03b4 and Y = wT\n\u2217 \u00afx`. Assume that \u03b4 is conditionally zero-mean given `: E\u03b4|`\u03b4 = 0.\nj be the spectral decomposition with decreasing eigenvalues \u00b5j\n2 \u2265 \u00b7\u00b7\u00b7 be the eigenval-\n\u2217 Xi \u2212 Yi| \u2264 b; \u2200t \u2265 0,\n\nLet EXX T = Pj \u00b5jujuT\n2 , then \u00b5j \u2265 \u03c32\nj \u2264 wT\nPj\u22651(wT\nConsider m points X1, . . . , Xm. Let \u03a8 = [X1, . . . , Xm] and K = \u03a8T \u03a8 = mPj \u00b5jvjvT\nbe the kernel spectral decomposition. Let uj = \u03a8vj/\u221am\u00b5j, fi = wT\nPj \u03b1j vj. Then it is not dif\ufb01cult to verify that \u03b1j = \u221am\u00b5jwT\ni \u2192 EXX T , then we have the following consequences:\n\nj\n\u2217 Xi, and f =\nIf we assume that\n\nasymptotically 1\n\ni=1 XiX T\n\n\u2217 uj)2\u00b5\u2212t\n\n\u2217 uj.\n\n\u2217 (E \u00afx` \u00afxT\n\n` )\u2212tw\u2217.\n\nmPm\n\n\u2217 Xi is a good approximate target when b is small. In particular, if b < 1,\n\nthen this function always gives the correct class label.\n\n\u2022 fi = wT\n\u2022 For all t > 0, the spectral coef\ufb01cient \u03b1j of f decays as 1\n\nj \u2264\n\nj /\u00b51+t\n\nj=1 \u03b12\n\nwT\n\n\u2217 (E\u00afx` \u00afxT\n\n` )\u2212tw\u2217.\n\nmPm\n\n\u2217 (EX \u00afx`(X) \u00afxT\n\n\u2022 The eigenvalue \u00b5j decays slowly when the noise spectral decays slowly: \u00b5j \u2265 \u03c32\nj .\nIf the clean data are well behaved in that we can \ufb01nd a weight vector such that\n`(X))\u2212tw\u2217 is bounded for some t > 1, then when the data are corrupted\nwT\nwith noise, we can \ufb01nd a good approximate target that has spectral decay faster (on aver-\nage) than that of the kernel eigenvalues. This analysis implies that if the feature represen-\ntation associated with the original kernel is corrupted with noise, then it is often helpful\nto use a kernel with faster spectral decay. For example, instead of using K, we may use\n\u00afK = K 2. However, it may not be easy to estimate the exact decay rate of the target spectral\ncoef\ufb01cients. In practice, one may use cross validation to optimize the kernel.\n\nA kernel with fast spectral decay projects the data into the most prominent principal compo-\nnents. Therefore we are interested in designing kernels which can achieve a dimension re-\nduction effect. Although one may use direct eigenvalue computation, an alternative is to use\na function g(K/m)K for such an effect, as in (4). For example, we may consider a normal-\nj where 0 \u2264 uj \u2264 1. A standard normalization\nized kernel such that K/m = Pj \u00b5jujuT\nmethod is to use D\u22121/2KD\u22121/2, where D is the diagonal matrix with each entry corre-\nj . We are\nsponding to the row sums of K. It follows that g(K/m)K = mPj g(\u00b5j)\u00b5j ujuT\ninterested in a function g such that g(\u00b5)\u00b5 \u2248 1 when \u00b5 \u2208 [\u03b1, 1] for some \u03b1, and g(\u00b5)\u00b5 \u2248 0\nwhen \u00b5 < \u03b1 (where \u03b1 is close to 1). One such function is to let g(\u00b5)\u00b5 = (1\u2212 \u03b1)/(1\u2212 \u03b1\u00b5).\nThis is the function used in various graph Laplacian formulations with normalized Gaus-\nsian kernel as the initial kernel K. For example, see [5]. Our analysis suggests that it is\nthe dimension reduction effect of this function that is important, rather than the connec-\ntion to graph Laplacian. As we shall see in the empirical examples, other kernels such as\nK 2, which achieve similar dimension reduction effect (but has nothing to do with graph\nLaplacian), also improve performance.\n\n6 Empirical Examples\n\nThis section shows empirical examples to demonstrate some consequences of our theoret-\nical analysis. We use the MNIST data set (http://yann.lecun.com/exdb/mnist/), consisting\n\n\fof hand-written digit images (representing 10 classes, from digit \u201c0\u201d to digit \u201c9\u201d). In the\nfollowing experiments, we randomly draw m = 2000 samples. We regard n = 100 of\nthem as labeled data, and the remaining m \u2212 n = 1900 as unlabeled test data.\n\nNormalized 25NN, MNIST\n\nAccuracy: 25NN, MNIST\n\ns\nt\nn\ne\ni\nc\ni\nf\nf\ne\no\nc\n\n 1000\n\n 100\n\n 10\n\n 1\n\n 0.1\n\n 0.01\n\n 0.001\n\nY avg\nK\n2\nK\n3\nK\n4\nK\nInverse\n\ny\nc\na\nr\nu\nc\nc\na\n\n 0.9\n\n 0.85\n\n 0.8\n\n 0.75\n\n 0.7\n\n 0.65\n\n 0.6\n\n 0.55\n\n 0.5\n\nY\n4\nK\n3\nK\n2\nK\nK\n1,..1,0,..\nInverse\noriginal K\n\n 0 20 40 60 80 100 120 140 160 180 200\n\n 0\n\n 50\n\n 100\n\n 150\n\n 200\n\ndimension (d)\n\ndimension (d)\n\nFigure 2: Left: spectral coef\ufb01cients; right: classi\ufb01cation accuracy.\n\ni=1 \u00af\u00b5ivivT\n\nThroughout the experiments, we use the least squares loss: L(p, y) = (p\u2212 y)2 for simplic-\nity. We study the performance of various kernel design methods, by changing the spectral\ncoef\ufb01cients of the initial gram matrix K, as in Figure 1. Below we write \u00af\u00b5j for the new\nspectral coef\ufb01cient of the new gram matrix \u00afK: i.e., \u00afK = Pm\ni . We study the\nfollowing kernel design methods (also see [2]), with a dimension cut off parameter d, so\nthat \u00af\u00b5i = 0 when i > d. (a) [1, . . . , 1, 0, . . . , 0]: \u00af\u00b5i = 1 if i \u2264 d, and 0 otherwise. This\nwas used in spectral clustering [3]. (b) K: \u00af\u00b5i = \u00b5i if i \u2264 d; 0 otherwise. This method is\nessentially kernel principal component analysis which keeps the d most signi\ufb01cant princi-\npal components of K. (c) K p: \u00af\u00b5i = \u00b5p\ni if i \u2264 d; 0 otherwise. We set p = 2, 3, 4. This\naccelerates the decay of eigenvalues of K. (d) Inverse: \u00af\u00b5i = 1/(1 \u2212 \u03c1\u00b5i) if i \u2264 d; 0\notherwise. \u03c1 is a constant close to 1 (we used 0.999). This is essentially graph-Laplacian\nbased semi-supervised learning for normalized kernel (e.g. see [5]). Note that the standard\ngraph-Laplacian formulation sets d = m. (e) Y : \u00af\u00b5i = |Y T vi| if i \u2264 d; 0 otherwise. This is\nthe oracle kernel that optimizes our generalization bound. The purpose of testing this oracle\nmethod is to validate our analysis by checking whether good kernel in our theory produces\ngood classi\ufb01cation performance on real data. Note that in the experiments, we use averaged\nY over the ten classes. Therefore the resulting kernel will not be the best possible kernel\nfor each speci\ufb01c class, and thus its performance may not always be optimal.\n\nFigure 2 shows the spectral coef\ufb01cients of the above mentioned kernel design methods\nand the corresponding classi\ufb01cation performance. The initial kernel is normalized 25-NN,\nwhich is de\ufb01ned as K = D\u22121/2W D\u22121/2 (see previous section), where Wij = 1 if either\nthe i-th example is one of the 25 nearest neighbors of the j-th example or vice versa; and\n0 otherwise. As expected, the results demonstrate that the target spectral coef\ufb01cients Y\ndecay faster than that of the original kernel K. Therefore it is useful to use kernel design\nmethods that accelerate the eigenvalue decay. The accuracy plot on the right is consistent\nwith our theory. The near oracle kernel \u2019Y\u2019 performs well especially when the dimension\ncut-off is large. With appropriate dimension d, all methods perform better than the super-\nvised base-line (original K) which is below 65%. With appropriate dimension cut-off, all\nmethods perform similarly (over 80%). However, K p with (p = 2, 3, 4) is less sensitive to\nthe cut-off dimension d than the kernel principal component dimension reduction method\nK. Moreover, the hard threshold method in spectral clustering ([1, . . . , 1, 0, . . . , 0]) is not\nstable. Similar behavior can also be observed with other initial kernels. Figure 3 shows\nthe classi\ufb01cation accuracy with the standard Gaussian kernel as the initial kernel K, both\nwith and without normalization. We also used different bandwidth t to illustrate that the\n\n\fbehavior of different methods are similar with different t (in a reasonable range). The re-\nsult shows that normalization is not critical for achieving high performance, at least for\nthis data. Again, we observe that the near oracle method performs extremely well. The\nspectral clustering kernel is sensitive to the cut-off dimension, while K p with p = 2, 3, 4\nare quite stable. The standard kernel principal component dimension reduction (method\nK) performs very well with appropriately chosen dimension cut-off. The experiments are\nconsistent with our theoretical analysis.\n\nAccuracy: normalized Gaussian, MNIST\n\nAccuracy: Gaussian, MNIST\n\ny\nc\na\nr\nu\nc\nc\na\n\n 0.9\n\n 0.85\n\n 0.8\n\n 0.75\n\n 0.7\n\n 0.65\n\n 0.6\n\n 0.55\n\n 0.5\n\nY\n4\nK\n3\nK\n2\nK\nK\n1,..1,0,..\nInverse\noriginal K\n\ny\nc\na\nr\nu\nc\nc\na\n\n 0.9\n\n 0.85\n\n 0.8\n\n 0.75\n\n 0.7\n\n 0.65\n\n 0.6\n\n 0.55\n\n 0.5\n\n1,..,1,0,..\nK\n2\nK\n3\nK\n4\nK\nY\noriginal K\n\n 0\n\n 50\n\n 100\n\n 150\n\n 200\n\n 0\n\n 50\n\n 100\n\n 150\n\n 200\n\ndimension (d)\n\ndimension (d)\n\nFigure 3: Classi\ufb01cation accuracy with Gaussian kernel k(i, j) = exp(\u2212||xi \u2212 xj||2\nLeft: normalized Gaussian (t = 0.1); right: unnormalized Gaussian (t = 0.3).\n\n2/t).\n\n7 Conclusion\n\nWe investigated a class of graph-based semi-supervised learning methods. By establishing\na graph-based formulation of kernel learning, we showed that this class of semi-supervised\nlearning methods is equivalent to supervised kernel learning with unsupervised kernel de-\nsign (explored in [2]). We then obtained a generalization bound, which implies that the\neigenvalues of the optimal kernel should decay at the same rate of the target spectral co-\nef\ufb01cients. Moreover, we showed that input noise can cause the target spectral coef\ufb01cients\nto decay faster than the kernel spectral coef\ufb01cients. The analysis explains why it is often\nhelpful to modify the original kernel eigenvalues to achieve a dimension reduction effect.\n\nReferences\n\n[1] Mikhail Belkin and Partha Niyogi. Semi-supervised learning on Riemannian mani-\n\nfolds. Machine Learning, Special Issue on Clustering:209\u2013239, 2004.\n\n[2] Olivier Chapelle, Jason Weston, and Bernhard Sch:olkopf. Cluster kernels for semi-\n\nsupervised learning. In NIPS, 2003.\n\n[3] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis\n\nand an algorithm. In NIPS, pages 849\u2013856, 2001.\n\n[4] M. Szummer and T. Jaakkola. Partially labeled classi\ufb01cation with Markov random\n\nwalks. In NIPS 2001, 2002.\n\n[5] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, and B. Schlkopf. Learning with local and\n\nglobal consistency. In NIPS 2003, pages 321\u2013328, 2004.\n\n[6] Xiaojin Zhu, Zoubin Ghahramani, and John Lafferty. Semi-supervised learning using\n\nGaussian \ufb01elds and harmonic functions. In ICML 2003, 2003.\n\n\f", "award": [], "sourceid": 2759, "authors": [{"given_name": "Tong", "family_name": "Zhang", "institution": null}, {"given_name": "Rie", "family_name": "Ando", "institution": null}]}