{"title": "Learning with Fredholm Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 2951, "page_last": 2959, "abstract": "In this paper we propose a framework for supervised and semi-supervised learning based on reformulating the learning problem as a regularized Fredholm integral equation. Our approach fits naturally into the kernel framework and can be interpreted as constructing new data-dependent kernels, which we call Fredholm kernels. We proceed to discuss the noise assumption\" for semi-supervised learning and provide evidence evidence both theoretical and experimental that Fredholm kernels can effectively utilize unlabeled data under the noise assumption. We demonstrate that methods based on Fredholm learning show very competitive performance in the standard semi-supervised learning setting.\"", "full_text": "Learning with Fredholm Kernels\n\nQichao Que Mikhail Belkin Yusu Wang\n\nDepartment of Computer Science and Engineering\n\n{que,mbelkin,yusu}@cse.ohio-state.edu\n\nThe Ohio State University\n\nColumbus, OH 43210\n\nAbstract\n\nIn this paper we propose a framework for supervised and semi-supervised learning\nbased on reformulating the learning problem as a regularized Fredholm integral\nequation. Our approach \ufb01ts naturally into the kernel framework and can be inter-\npreted as constructing new data-dependent kernels, which we call Fredholm ker-\nnels. We proceed to discuss the \u201cnoise assumption\u201d for semi-supervised learning\nand provide both theoretical and experimental evidences that Fredholm kernels\ncan effectively utilize unlabeled data under the noise assumption. We demonstrate\nthat methods based on Fredholm learning show very competitive performance in\nthe standard semi-supervised learning setting.\n\n1\n\nIntroduction\n\nKernel methods and methods based on integral operators have become one of the central areas of\nmachine learning and learning theory. These methods combine rich mathematical foundations with\nstrong empirical performance. In this paper we propose a framework for supervised and unsuper-\nvised learning as an inverse problem based on solving the integral equation known as the Fredholm\nproblem of the \ufb01rst kind. We develop a regularization based algorithms for solving these systems\nleading to what we call Fredholm kernels.\nIn the basic setting of supervised learning we are given the data set (xi, yi), where xi 2 X, yi 2 R.\nWe would like to construct a function f : X ! R, such that f (xi) \u21e1 yi and f is \u201cnice enough\u201d\nto generalize to new data points. This is typically done by choosing f from a class of functions (a\nReproducing Kernel Hilbert Space (RKHS) corresponding to a positive de\ufb01nite kernel for the kernel\nmethods) and optimizing a certain loss function, such as the square loss or hinge loss.\nIn this paper we formulate a new framework for learning based on interpreting the learning problem\nas a Fredholm integral equation. This formulation shares some similarities with the usual kernel\nlearning framework but unlike the standard methods also allows for easy incorporation of unlabeled\ndata. We also show how to interpret the resulting algorithm as a standard kernel method with a\nnon-standard data-dependent kernel (somewhat resembling the approach taken in [14]).\nWe discuss reasons why incorporation of unlabeled data may be desireable, concentrating in partic-\nular on what may be termed \u201cthe noise assumption\u201d for semi-supervised learning, which is related\nbut distinct from manifold and cluster assumption popular in the semi-supervised learning literature.\nWe provide both theoretical and empirical results showing that the Fredholm formulation allows for\nef\ufb01cient denoising of classi\ufb01ers.\nTo summarize, the main contributions of the paper are as follows:\n(1) We formulate a new framework based on solving a regularized Fredholm equation. The frame-\nwork naturally combines labeled and unlabeled data. We show how this framework can be expressed\nas a kernel method with a non-standard data-dependent kernel.\n\n1\n\n\f(2) We discuss \u201cthe noise assumption\u201d in semi-supervised learning and provide some theoretical ev-\nidence that Fredholm kernels are able to improve performance of classi\ufb01ers under this assumption.\nMore speci\ufb01cally, we analyze the behavior of several versions of Fredholm kernels, based on com-\nbining linear and Gaussian kernels. We demonstrate that for some models of the noise assumption,\nFredholm kernel provides better estimators than the traditional data-independent kernel and thus\nunlabeled data provably improves inference.\n(3) We show that Fredholm kernels perform well on synthetic examples designed to illustrate the\nnoise assumption as well as on a number of real-world datasets. We also indicate how random\nfeature approximations can be used to deal with large datasets.\n\n1.1 Related work\n\nApplications of kernel and integral methods in machine learning have a large and diverse literature\n(e.g., [13, 12]). The work most directly related to our approach is [10], where Fredholm integral\nequations were introduced to address the problem of density ratio estimation and covariate shift. In\nthat work the problem of density ratio estimation was expressed as a Fredholm integral equation and\nsolved using regularization in RKHS. This setting also relates to a line of work on on kernel mean\nembedding where data points are embedded in Reproducing Kernel Hilbert Spaces using integral\noperators with applications to density ratio estimation and other tasks [15, 4, 5]. A very interesting\nrecent work [9] explores a shrinkage estimator for estimating means in RKHS, following the Stein-\nJames estimator originally used for estimating the mean in an Euclidean space. The results obtained\nin [9] show how such estimators can reduce variance. There is some similarity between that work\nand our theoretical results presented in Section 4 which also shows variance reduction for certain\nestimators of the kernel although in a different setting.\nAnother line of connected work is the class of semi-supervised learning techniques related to man-\nifold regularization [1], where an additional graph Laplacian regularizer is added to take advantage\nof the geometric/manifold structure of the data. Our reformulation of Fredholm learning as a ker-\nnel, addressing what we called \u201cnoise assumptions\u201d, parallels data-dependent kernels for manifold\nregularization proposed in [14].\n\n2 Fredholm Kernels\n\nWe start by formulating learning framework proposed in this paper.\nSuppose we are given l labeled pairs (x1, y1), . . . , (xl, yl) from the data distribution p(x, y) de\ufb01ned\non X \u21e5 Y and u unlabeled points xl+1, . . . , xl+u from the marginal distribution pX(x) on X. For\nsimplicity we will assume that the feature space X will a Euclidean space RD, and the label set Y\nis either {1, 1} for binary classi\ufb01cation the real line R for regression. Semi-supervised learning\nalgorithms aim to construct a (predictor) function f : X ! Y by incorporating the information of\nunlabeled data distribution.\nTo this end, we introduce the integral operator KpX associated with a kernel function k(x, z). We\nnote that k(x, z) does not have to be a positive semi-de\ufb01nite kernel.\n\nKpX : L2 ! L2 and KpX f (x) =Z k(x, z)f (z)pX(z)dz,\n\n(1)\n\nwhere L2 is the space of square-integrable functions. As usual, by the law of large number, the\nabove operator can be approximated by the unlabeled data from pX as follows,\n\nK\u02c6pX f (x) =\n\n1\n\nl + u\n\nl+uXi=1\n\nk(x, xi)f (xi).\n\n(2)\n\nThis approximation provides a natural way of incorporating unlabeled data into algorithms. In our\nFredholm learning framework, we will use functions in KpXH = {KpX f : f 2H} , where H is\nan appropriate Reproducing Kernel Hilbert Space (RKHS) as classi\ufb01cation or regression functions.\nNote that unlike RKHS, this space of functions, KpXH, is density dependent.\nIn particular, this now allows us to formulate the following optimization problem for semi-supervised\nclassi\ufb01cation/regression in a way similar to many supervised learning algorithms:\n\n2\n\n\fThe Fredholm learning framework solves the following optimization problem1:\n\nf\u21e4 = arg min\nf2H\n\n1\nl\n\nlXi=1\n\n((K\u02c6pX f )(xi) yi)2 + kfk2\nH,\n\n(3)\n\nThe \ufb01nal classi\ufb01er is c(x) = (K\u02c6pX f\u21e4) (x), where K\u02c6pX is the operator de\ufb01ned above. Eqn 3 is a\ndiscretized and regularized version of the Fredholm integral equation KpX f = y, thus giving the\nname of Fredholm learning framework.\nEven though at \ufb01rst glance this setting looks similar to conventional kernel methods, the extra layer\nintroduced by K\u02c6pX makes signi\ufb01cant difference, in particular, by allowing the integration of infor-\nmation from unlabelled data distribution. In contrast, solutions to kernel method for most kernels,\ne.g., linear, polynomial or Gaussian kernels, are completely independent of the unlabeled data. We\nnote that our approach is closely related to [10] where a Fredholm equation is used to estimated the\ndensity ratio for two probability distributions.\nOur Fredholm learning framework is a generalization of the standard kernel framework. In fact, if\nthe kernel k is the -function, then our formulation above is equivalent to the standard Regularized\nLeast Squares equation f\u21e4 = arg minf2H\n. We could also replace\nthe L2 loss in Eqn 3 by other loss functions, such as hinge loss, resulting in a SVM-like classi\ufb01er.\nFinally, even though Eqn 3 is an optimization problem in a potentially in\ufb01nite dimensional function\nspace H, we have the following lemma that allows us to apply the Representer Theorem to get a\ncomputationally accessible solution.\nLemma 1. Given the de\ufb01nition of K\u02c6pX in Eqn 2, the solution to Eqn 3 is of the form,\n\ni=1(f (xi) yi)2 + kfk2\nH\n\nlPl\n\n1\n\nf\u21e4(x) =\n\n1\n\nl + u\n\nkH(x, xj)vj,\n\nl+uXj=1\n\nl+uXi,j=1\n\nlXs=1\n\n3\n\nfor some v 2 Rl+u.\nAs the proof of the above lemma is similar to that of the standard representer theorem, we put\nthe proof in the appendix. Using the above Representer Theorem, we could transform Eqn 3 into\nquadratic optimization in a \ufb01nite dimensional space. We can get have a closed form solution for\nEqn 3 as follows:\n\nl+uXj=1\n\n1\n\nl + u\n\nf\u21e4(x) =\n\nkH(x, xj)vj, v =KT\nwhere (Kl+u)ij = k(xi, xj) for 1 \uf8ff i \uf8ff l, 1 \uf8ff j \uf8ff l + u, and (KH)ij = kH(xi, xj) for\n1 \uf8ff i, j \uf8ff l + u. Note that Kl+u is a l \u21e5 (l + u) matrix.\nFredholm kernels: a convenient reformulation. Interestingly, this Fredholm learning problem\nactually induces a new data-dependent kernel, which we will refer to as Fredholm kernel2. To show\nthis connection, \ufb01rst observe the following identity, which can be easily veri\ufb01ed:\nClaim 2. Matrix Inversion Identity\n\nl+uKl+uKH + I1\n\nl+uy,\n\nKT\n\n(4)\n\nKT\nl+uKl+uKH + I1\n\nDe\ufb01ne KF = Kl+uKHKT\n\nKT\n\nl+u = KT\n\nl+uKl+uKHKT\n\nl+u + I1\n\n.\n\nl+u to be the l \u21e5 l kernel matrix associated with a new kernel de\ufb01ned by\n(5)\n\nk(x, xi)kH(xi, xj)k(z, xj),\n\n1\n\n\u02c6kF (x, z) =\n\n(l + u)2\n\nand we consider the unlabeled data are \ufb01xed for computing this new kernel. Using this new kernel\n\u02c6kF , the \ufb01nal classifying function c\u21e4 de\ufb01ned using the solution given in Eqn 4 can be rewritten as:\n\nk(x, xi)f\u21e4(xi) =\n\n\u02c6kF (x, xs)\u21b5s, \u21b5 = (KF + I)1 y.\n\nc\u21e4(x) =\n\n1\n\nl + u\n\nl+uXi=1\n\n1We will be using the square loss to simplify the exposition. Other loss functions can also be used in Eqn 3.\n2We note that the term \u201cFredholm Kernel\u201d has also been used before in a different context, see page 103, [6]\n\nand [16] in the studies of Fredholm operator. But our usage and the previous one represent different object.\n\n\fBecause of Eqn 5 we will sometimes refer to the kernels kH and k as the \u201cinner\u201d and \u201couter\u201d kernels\nrespectively.\nIt can be observed that this learning algorithm can be considered as a case of the standard kernel\nmethod, but using a new data dependent kernel \u02c6kF , which we will call the Fredholm kernel, since it\nis induced from the Fredholm problem formulated in Eqn 3. And the following proposition shows\nthat this de\ufb01nition gives a positive semi-de\ufb01nite kernel.\nProposition 3. The Fredholm kernel de\ufb01ned in Eqn 5 is positive semi-de\ufb01nite if kH is a positive\nsemi-de\ufb01nite kernel.\n\nThe proof is given in the appendix. The \u201couter\u201d kernel k does not have to be either positive de\ufb01nite\nor even symmetric. When using Gaussian kernel for k, discrete approximation in Eqn 5 might be\nunstable when the kernel width is small, so we also introduce the normalized Fredholm kernel,\n\n\u02c6kN\nF (x, z) =\n\n1\n\n(l + u)2\n\nl+uXi,j=1\n\nk(x, xi)\n\nPn k(x, xn)\n\nkH(xi, xj)\n\nk(z, xj)\n\nPn k(z, xn)\n\nIt is easy to check that the resulting Fredholm kernel \u02c6kN\nF is still symmetric and positive semi-de\ufb01nite.\nUsing Hinge Loss Other than L2 loss we use above, hinge loss can also be used for our Fredholm\nlearning framework. In this section, we explain how Fredholm kernel could be derived when using\nhinge loss. Plugging the hinge loss into Eqn 3, we have\n\n.\n\n(6)\n\nf\u21e4 = arg min\nf2H\n\n1\nl\n\nlXi=1\n\nmax(0, 1 yi \u00b7 (K\u02c6pX f )(xi)) + kfk2\nH.\n\n(7)\n\nLike the Representer Theorem, we proved in Lemma 1, the solution function f is always of the form\n\nf (x) =\n\nvikH(x, xi).\n\nl+uXi=1\n\nH = vT KHv, where KH is the kernel matrix.\n\nThus, kfk2\nAnd we only consider the evaluation of f at the data points, let f = [f (x1), . . . , f (xl+u)] = KHv.\nNow we can vectorize (K\u02c6pX f )(xi) as well, by letting ki = [ 1\nl+u k(xi, xl+u)].\nThus K\u02c6pX f (xi) = 1\ni KHv.\nAnd the optimization problem using hinge loss in Eqn 7 is equivalent to the following problem with\nslack variables \u21e0i,\n\nl+uPl+u\n\nj=1 k(xi, xj)f (xj) = kT\n\nl+u k(xi, x1), . . . , 1\n\ni f = kT\n\nTo solve the above problem, we introduce the Lagrangian multiplier,\n\nL(v,\u21e0,\u21b5, ) =\n\n1\n2\n\n\u21e0i Xi\n\n\u21b5i(yi \u00b7 (kiKHv) 1 + \u21e0i) Xi\n\ni\u21e0i\n\nBy the KKT condition in the theory of convex optimization, we have\n\nUsing this, we have the dual problem of the original problem in Eqn 7,\n\n\u21e0i\n\nvT KHv + CXi\n\n1\nmin\n2\nf2H\nyi \u00b7 (kT\ni KHv) 1 \u21e0i\nfor i = 1, . . . , l\n\u21e0i 0\n\ns.t.\n\nvT KHv + CXi\nv =Xi\n\u21b5 Xi\n\ns.t.\n\nmax\n\n\u21b5i \n0 \uf8ff \u21b5i \uf8ff C.\n\n\u21b5iyiki,\u21b5 i = C i\n\n1\n\n2Xi,j\n\n\u21b5i\u21b5jyiyjkT\n\ni KHkj\n\n4\n\nIt is equivalent to using Fredholm kernel for regular support vector machine, because kT\nkF (xi, xj) according to the de\ufb01nition of Fredholm kernel in Eqn 5.\n\ni KHkj =\n\n\f3 The Noise Assumption and Semi-supervised Learning\n\nIn order for unlabeled data to be useful in classi\ufb01cation tasks, it is necessary for the marginal distri-\nbution of the features to contain information about the conditional distribution of the labels. Several\nways in which such information can be encoded have been proposed, including the \u201ccluster assump-\ntion\u201d [2] and the \u201cmanifold assumption\u201d [1]. The cluster assumption states that a cluster (or a high\ndensity area) contains only (or mostly) points belonging to the same class. That is, if x1 and x2\nbelong to the same cluster, the corresponding labels y1, y2 should be the same. The manifold as-\nsumption assumes that the regression function is smooth with respect to the underlying manifold\nstructure of the data, which can be interpreted as saying that the geodesic distance should be used\ninstead of the ambient distance for optimal classi\ufb01cation. The success of algorithms based on these\nideas indicates that these assumptions do capture certain characteristics of real data. Still, better\nunderstanding of data distribution may still lead to progress in data analysis.\nThe noise assumption. Now we propose to formulate a new assumption, the \u201cnoise assumption\u201d,\nwhich is that in the neighborhood of every point, the directions with low variance (of the feature\ndistribution) are uninformative with respect to the class labels, and can be regarded as noise. While\nbeing intuitive, as far as we know, it has not been explicitly formulated in the context of semi-\nsupervised learning algorithms, nor applied to theoretical analysis.\nNote that even if the noise variance is small\nalong a single direction,\nit could still sig-\nni\ufb01cantly decrease the performance of su-\npervised learning algorithms if the noise are\nhigh-dimensional.\nThese accumulated non-\ninformative variations increase the dif\ufb01culty of\nlearning a good classi\ufb01er in particular when the\namount of labeled data is small. The Figure 1\non right illustrates the issue of noise with two\nlabeled points. The seemingly optimal classi\ufb01-\ncation boundary (the red line) differs from the correct one (in black) due to the noisy variation along\nthe vertical axis for the two labeled points. Intuitively unlabeled data shown in the right panel of\nFigure 1 can be helpful in this setting as low variance directions can be estimated locally such that\nalgorithms could suppress the in\ufb02uences of the noisy variation when learning a classi\ufb01er.\nConnection to cluster and manifold assumptions. The noise assumption is compatible with the\nmanifold assumption within the \u201cmanifold+noise\u201d model. Speci\ufb01cally, we can assume that the\nfunctions of interest vary along the manifold and are constant in the orthogonal direction. Alterna-\ntively, we can think of directions with high variance as \u201csignal/manifold\u201d and directions with low\nvariance as \u201cnoise\u201d. We note that the noise assumption does not require the data to conform to\na low-dimensional manifold in the strict mathematical sense of the word. The noise assumption\nis orthogonal to the cluster assumption. For example, Figure 1 illustrates a situation where data\nhas no clusters but the noise assumption applies. For more examples and experimental results see\nSection 5.1.\n\nFigure 1: Left: only labelled points, and Right:\nwith unlabelled points.\n\n4 Theoretical Results for Fredhom Kernels\n\nNon-informative variation in data could degrade the performance of traditional supervised learning\nalgorithms. We will now show that Fredholm kernels can be used to replace traditional kernels\nto inject them with \u201cnoise-suppression\u201d power with the help of unlabelled data.\nIn this section\nwe will present two views to illustrate how such noise supression can be achieved. Speci\ufb01cally, in\nSection 4.1) we show that under certain setup linear Fredholm kernel supresses principal components\nwith small variance. In Section 4.2) we prove that under certain conditions Fredholm kernels are\nable to provide good approximations to the \u201ctrue\u201d kernel on the hidden underlying space.\nTo make our arguments more clear, in what follows, we assume that there is in\ufb01nite amount of\nunlabelled data; that is, we know the marginal distribution of data exactly. We will then consider the\nfollowing continuous versions of the un-normalized and normalized Fredholm kernels as in Eqn 5\n\n5\n\n\fand 6:\n\nand\n\nkU\n\nF (x, z) =Z Z k(x, u)kH(u, v)k(z, v)p(u)p(v)dudv\nR k(z, w)p(w)dw\n\nR k(x, w)p(w)dw\n\nkH(u, v)\n\nk(x, u)\n\nk(z, v)\n\nkN\n\nF (x, z) =Z Z\n\nNote, in the above equations and in what follows, we sometimes write p instead of pX for the\nmarginal distribution when its choice is clear from context. We will typically use kF to denote\nappropriate normalized or unnormalized kernels depending on the context.\n\n(8)\n\n(9)\n\np(u)p(v)dudv.\n\n4.1 Linear Fredholm kernels and inner products\n\nFor this section, we consider the unormalized Fredholm kernel, that is kF = kU\nF . If the \u201couter\u201d\nkernel k(u, v) is linear, i.e. k(u, v) = uT v, the resulting Fredholm kernel can be viewed as an inner\nproduct. Speci\ufb01cally, the un-normalized Fredholm kernel from Eqn 8 can be rewitten as\n\nkF (x, z) =Z Z (xT u)(zT v)kH(u, v)p(u)p(v)dudv = xT \u2303F z, where\n\n\u2303F =Z Z uvT kH(u, v)p(u)p(v)dudv =Z Z ukH(u, v)vT p(u)p(v)dudv.\n\n(10)\n\nThus kF (x, z) is simply an inner product which depends on both the data distribution p(x) and\nthe \u201cinner\u201d kernel kH. This inner product re-weights the standard norm in feature space based on\nvariances along the principal directions of the matrix \u2303F . We will show that for the model when\ndata is sampled from a normal distribution this kernel can be viewed as a \u201csoft thresholding\u201d PCA,\nsuppressing the directions with low variance.\nMore strictly, we have the following\n\nTheorem 4. Let kH(x, z) = exp\u21e3kxzk2\n\na single multi-variate normal distribution, N (\u00b5, diag(2\n\n2t \u2318 and assume the marginal distribution pX for data is\n\nd)). We have\n\n1, . . . , 2\n\nd + t!\u2713\u00b5\u00b5T + diag\u2713 4\n\n22\n\n1\n1 + t\n\n, . . . ,\n\n4\nD\n22\n\nD + t\u25c6\u25c6 .\n\n22\n\n\u2303F = DYd=1s t\nthe ith principal direction isq 4\n\ni\n22\n\nAssuming that the data is mean-subtracted, i.e. \u00b5 = 0, we see that xT \u2303F z re-scales the projections\nalong the principal components when computing the inner product; that is, the rescaling factor for\n\ni +t.\n4\ni +t \u21e1 0 when 2\ni\n22\n\n4\ni\n22\n\ni t, we\nNote that this rescaling factor\ni +t \u21e1 2\nhave that\n2 . Hence t can be considered as a soft threshold that eliminates the effects of\nprincipal components with small variances. When t is small the rescaling factors are approximately\nproportional to diag(2\nD), in which case \u2303F is is porportional to the covariance matrix\nof the data XX T .\n\ni \u2327 t. On the other hand when 2\n\n2, . . . , 2\n\n1, 2\n\ni\n\n4.2 Kernel Approximation With Noise\n\nWe have seen that one special case of Fredholm kernel could achieve the effect of principal compo-\nnents re-scaling by using linear kernel as the \u201couter\u201d kernel k. In this section we give a more general\ninterpretation of noise suppression by the Fredholm kernel.\nFirst, we give a simple senario to provide some intuition behind the de\ufb01nition of Fred-\nholm kernle. Consider a standard supervised learning setting which uses the solution f\u21e4 =\ndenote the ideal kernel that\narg minf2H\nwe intend to use on the clean data, which we call the target kernel from now on. Now suppose what\nwe have are two noisy labeled points xe and ze for \u201ctrue\u201d data \u00afx and \u00afz, i.e. xe = \u00afx+\"x, ze = \u00afz +\"z.\n\ni=1(f (xi) yi)2 + kfk2\nH\n\nas the classi\ufb01er. Let ktarget\n\nlPl\n\nH\n\n1\n\n6\n\n\fThe evaluation of ktarget\nH (xe, ze) can be quite different from the\ntrue signal ktarget\nH (\u00afx, \u00afz), leading to a suboptimal \ufb01nal classi\ufb01er\n(the red line in Figure 1 (a)). On the other hand, now con-\nsider the Fredholm kernel from Eqn 8 (or similarly from Eqn 9):\nkF (xe, ze) = RR k(xe, u)p(u) \u00b7 kH(u, v) \u00b7 k(ze, v)p(v)dudv,\nand set the outer kernel k to be the Gaussian kernel, and the in-\nner kernel kH to be the same as target kernel ktarget\n. We can think\nH\nof kF (xe, ze) as an averaging of kH(u, v) over all possible pairs\nof data u, v, weighted by k(xe, u)p(u) and k(ze, v)p(v) respec-\ntively. Speci\ufb01cally, points that are close to xe (resp. ze) with\nhigh density will receive larger weights. Hence the weighted\naverages will be biased towards \u00afx and \u00afz respectively (which pre-\nsumably lie in high density regions around xe and ze). The value of kF (xe, ze) tends to provide a\nmore accurate estimate of kH(\u00afx, \u00afz). See the right \ufb01gure for an illustration where the arrows indicate\npoints with stronger in\ufb02uences in the computation of kF (xe, ze) than kH(xe, ze). As a result, the\nclassi\ufb01er obtained using the Fredholm kernel will also be more resilient to noise and closer to the\noptimum.\nThe Fredholm learning framework is rather \ufb02exible in terms of the choices of kernels k and kH.\nIn the remainder of this section, we will consider a few speci\ufb01c scenarios and provide quantitative\nanalysis to show the noise-resilliency of the Fredholm kernel. In particular, for Section 4.2.1 and\n4.2.2, we will assume the following setup for data.\nProblem setup. Assume that we have a ground-truth distribution over the subspace spanned by the\n\ufb01rst d dimension of the Euclidean space RD. We will assume that this ground-truth distribution is\na single Gaussian N (0, 2Id). Now imagine that this ground-truth distribution is corrupted with\nGaussian noise along the orthogonal subspace of dimension D d. That is, for any observed point\nxe, it could be decomposed into \u00afx + \"x, where the signal \u00afx is drawn from N (0, 2Id), and the noise\n\"x is drawn from N (0, 2IDd) over the orthogonal space. Thus any observed point, labelled or\nunlabelled, is sampled from pX = N (0, diag(2Id, 2IDd), with the \ufb01rst d dimensions as signals\nand the rest corrupted by noises.\nWe will show that Fredholm kernel provides a better approximation to the \u201coriginal\u201d kernel given\nboth labeled and unlabeled data than directly computing the kernel evaluation at noisy labeled points.\nWe choose this simple setting so as to be able to state the theoretical results in a clean manner. Even\nthough this is just a Gaussian distribution over a linear subspace with noise this framework can be\ngeneralized since local neighborhoods of a Riemannian manifold can be approximated by linear\nspaces.\nNote. In this section, we use the normalized Fredholm kernel given in Eqn 9 for simplicity, that\nis kF = kN\nF for now on. Un-normalized Fredholm kernel displays similar behavior, however, the\ntheoretical bounds are more complicated.\n\nH (u, v) is the linear kernel, ktarget\n\n4.2.1 Linear Kernel\nFirst we consider the case where the target kernel ktarget\nH (u, v) = uT v.\nWe will set kH in Fredholm kernel to also be linear, and k to be the Gaussian kernel k(u, v) =\ne kuvk2\n2t We will compare kF (xe, ze) with the target kernel on the two observed points, that is,\nwith ktarget\nH (xe, ze). The goal is to estimate ktarget\nH (\u00afx, \u00afz). We will see that (1) both kF (xe, ze) and\n(appropriately scaled) kH(xe, ze) are unbiased estimators of ktarget\nH (\u00afx, \u00afz), however (2) the variance\nof kF (xe, ze) is smaller than that of ktarget\nTheorem 5. Suppose the probability distribution for the data is pX = N (0, diag(2Id, 2IDd)).\nFor Fredholm kernel de\ufb01ned in Eqn 9, we have\n\nH (xe, ze), making it a more precise estimator.\n\nExe,ze(ktarget\n\nH (xe, ze)) = Exe,ze \u2713 t + 2\n2 \u25c62\n\nkF (xe, ze)! = \u00afxT \u00afz\n\nMoreover, when > , Varxe,ze\u2713\u21e3 t+2\n2 \u23182\n\nkF (xe, ze)\u25c6 < Varxe,ze(ktarget\n\nH (xe, ze)).\n\n7\n\n\fRemark: Note that we have a normalization constant for the Fredholm kernel to make it an unbiased\nestimator of \u00afxT \u00afz. In practice, choosing normalization is subsumed in selecting the regularization\nparameter for kernel methods.\nWe will give a sketch of the proof, complete details can be found in the appendix.\nFirst, we have the following lemma regarding the estimator ktarget\nLemma 6. Given two samples xe \u21e0 N (\u00afx, diag([0d, 2IDd])), ze \u21e0 N (\u00afz, diag([0d, 2IDd])),\nlet kH(xe, ze) = xT\n\nH (xe, ze).\n\nExe,ze(ktarget\n\ne ze. We have:\nH (xe, ze)) = \u00afxT \u00afz and Varxe,ze(ktarget\n\nH (xe, ze)) = (D d)4.\n\nNow we consider the Fredholm kernel with the help of unlabelled points from the distribution p =\nN (0, diag(2Id, 2IDd)). Substituting kH(u, v) by the linear kernel uT v in Eqn 9, we have:\n\nkF (xe, ze) =Z Z\n\nk(xe, u)\n\nk(ze, v)\n\nuT vp(u)p(v)dudv\n\nR k(xe, w)p(w)dw\n=\u2713R k(xe, u)up(u)du\nR k(xe, w)p(w)dw\u25c6T\u2713Z\n\nR k(ze, w)p(w)dw\nR k(ze, w)p(w)dw\u25c6\n2t \u2318. Note R k(xe,u)up(u)du\nR k(xe,w)p(w)dw (resp. R k(ze,v)vp(v)dv\n\nk(ze, v)vp(v)dv\n\nR k(ze,w)p(w)dw ) is the\n\n(11)\n\nwhere recall k(u, v) = exp\u21e3kuvk2\n\nweighted mean of the unlabeled data, with the weight function being the normalized Gaussian kernel\ncentered at xe (resp. ze). Hence by Eqn 11, kF (xe, ze) is the linear kernel between these two means\n(instead of the linear kernel for xe and ze). Thus it is not too surprising that kF (xe, ze) should\nbe more stable than the straightforward approximation kH(xe, ze). Indeed, we have the following\nlemma (proof in appendix).\nLemma 7. Given two samples xe \u21e0 N (\u00afx, diag([0d, 2IDd])), ze \u21e0 N (\u00afz, diag([0d, 2IDd])),\nlet kH(xe, ze) = xT\ne ze and p = N (0, diag(2Id, 2IDd)). Let kF be as de\ufb01ned in Eqn 11. We\nhave:\n\nand\n\nExe,ze \u2713 t + 2\n2 \u25c62\nkF (xe, ze)! = (D d)\u2713 2(t + 2)\nVarxe,ze \u2713 t + 2\n2 \u25c62\n2(t + 2)\u25c64\n\nkF (xe, ze)! = \u00afxT \u00afz\n\n4\n\nWith Lemma 6 and 7, we can now compare the variances. Since 2(t+2)\nTheorem 5 follows.\nThus we can see the Fredholm kernel provides an approximation of the \u201ctrue\u201d linear kernel, but with\nsmaller variance than the linear kernel on noisy data.\n\n2(t+2) < 1 when 2 > 2,\n\nH (u, v) =\n\n4.2.2 Gaussian Kernel\nWe now consider the case where the target kernel is the Gaussian kernel: ktarget\n\nexp\u21e3kuvk2\n\n2r \u2318. To approximate this kernel, we will set both k and kH to be Gaussian kernels.\n\nTo simplify the presentation of results, we assume that k and kH have the same kernel width t. The\nresulting Fredholm kernel turns out to also be a Gaussian kernel, whose kernel width depends on the\nchoice of t.\nOur main result is the following. Again, similar to the case of linear kernel, the Fredholm estimator\nkF (xe, ze) and the vanilla one ktarget\nH (\u00afx, \u00afz)\nupto a constant; but kF (xe, ze) has a smaller variance.\nTheorem 8. Suppose\nthe probability distribution for\n=\nN (0, diag(2Id, 2IDd)). Given the target kernel ktarget\nnel width r > 0, we can choose t, given by the equation t(t+2)(t+32)\nconstants c1, c2, such that\nExe,ze(c1\n\nH (xe, ze) are both unbiased estimator for the target ktarget\nthe unlabeled data pX\n\nH (u, v) = exp\u21e3kuvk2\n\n2r \u2318 with ker-\n\n2 kF (xe, ze)) = ktarget\n\n= r, and two scaling\n\n1 ktarget\n\nH (xe, ze)) = Exe,ze(c1\n\nH (\u00afx, \u00afz).\n\n4\n\n8\n\n\f1 ktarget\n\nH (xe, ze)) > Varxe,ze(c1\n\nand when 2 > 2, we have Varxe,ze(c1\nRemark.\nIn practice, when applying kernel methods for real world applications, optimal kernel\nwidth r is usually unknown and chosen by cross-validation or other methods. Similarly, for our\nFredholm kernel, one can also use cross-validation to choose the optimal t for kF .\nThe proof of Theorem 8 is more complicated than in the linear case, and can be found in the ap-\npendix.\n\n2 kF (xe, ze)).\n\n5 Experiments\n\nIn this section, we will demonstrate our Fredholm kernel empirically using both synthetic examples\nand data sets of text categorization and handwriting recognition. In section 5.1, we will use several\nexamples to illustrate the effect of reducing variances using Fredholm kernel and how noise assump-\ntion is distinguished from the conventional assumptions in semi-supervised learning, such as cluster\nassumption and manifold assumption. In section 5.2, we show how classi\ufb01ers based on Fredholm\nkernel perform on real world data sets like hand-written digits recognition and text categorization\nproblems, compared with other semi-supervised algorithms.\nFirst recall the Fredholm kernel we de\ufb01ned in previous section.\n\n\u02c6kF (x, z) =\n\n1\n\n(l + u)2\n\nl+uXi,j=1\n\nk(x, xi)kH(xi, xj)k(z, xj).\n\nAnd using linear and Gaussian kernel for k or kH, we can de\ufb01ne three instances of the Fredholm\nkernel as follows.\n\n(1) FredLin1: k(x, z) = xT z and kH(x, z) = exp\u21e3kxzk2\n2r \u2318.\n(2) FredLin2: k(x, z) = exp\u21e3kxzk2\n2r \u2318 and kH(x, z) = xT z.\n(3) FredGauss: k(x, z) = kH(x, z) = exp\u21e3kxzk2\n2r \u2318.\n\nFor the kernels in (2) and (3) that use the Gaussian kernel as outside kernel k, intuitively we can also\nde\ufb01ne their normalized version using the following de\ufb01nition,\n\n\u02c6kF,n(x, z) =\n\n1\n\n(l + u)2\n\nl+uXi,j=1\n\nk(x, xi)\n\nPn k(x, xn)\n\nkH(xi, xj)\n\n.\n\nk(z, xj)\n\nPn k(z, xn)\n\nThe resulting kernels are denoted by FredLin2(N) and FredGauss(N) respectively.\n\n5.1 Synthetic Examples\n\nUsing specially designed toy examples, we could empirically verify the behavior of Fredholm kernel\ncharacterized by theoretical results in last section.\n\n5.1.1 Principal Component Regression\nAs we have pointed out in Theorem 4, Fredholm kernel and the associated Fredholm inner prod-\nuct space could stress the principal components with larger variances while suppressing the ones\nwith smaller variances. Instead of hard cutting-off in many PCA-based methods, it provides a soft\nthresholding algorithm for feature selection. To demonstrate our methods, we consider the principal\ncomponent regression problem [8], which assumes that the regressor X and response Y have the\nfollowing relationship:\n\nY = \u21b5Xu1,\n\nwher u1 is the \ufb01rst principal component. In this experiments, the data distribution is a Gaussian\ndistribution N (0, diag([10, 1, . . . , 1])). Note that the axes themselves are the principal components.\nWe will compare our method with linear regression using (1) all the dimensions; and (2) \ufb01rst k\nprincipal components, while Fredholm kernel does not need to do any hard thresholding. Figure 2\n\n9\n\n\f", "award": [], "sourceid": 1548, "authors": [{"given_name": "Qichao", "family_name": "Que", "institution": "The Ohio State University"}, {"given_name": "Mikhail", "family_name": "Belkin", "institution": "Ohio State University"}, {"given_name": "Yusu", "family_name": "Wang", "institution": "Ohio State University"}]}