{"title": "Large-Scale Sparsified Manifold Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 1401, "page_last": 1408, "abstract": null, "full_text": "Large-Scale Sparsified Manifold Regularization\nIvor W. Tsang James T. Kwok Department of Computer Science and Engineering The Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong {ivor,jamesk}@cse.ust.hk\n\nAbstract\nSemi-supervised learning is more powerful than supervised learning by using both labeled and unlabeled data. In particular, the manifold regularization framework, together with kernel methods, leads to the Laplacian SVM (LapSVM) that has demonstrated state-of-the-art performance. However, the LapSVM solution typically involves kernel expansions of all the labeled and unlabeled examples, and is slow on testing. Moreover, existing semi-supervised learning methods, including the LapSVM, can only handle a small number of unlabeled examples. In this paper, we integrate manifold regularization with the core vector machine, which has been used for large-scale supervised and unsupervised learning. By using a sparsified manifold regularizer and formulating as a center-constrained minimum enclosing ball problem, the proposed method produces sparse solutions with low time and space complexities. Experimental results show that it is much faster than the LapSVM, and can handle a million unlabeled examples on a standard PC; while the LapSVM can only handle several thousand patterns.\n\n1\n\nIntroduction\n\nIn many real-world applications, collection of labeled data is both time-consuming and expensive. On the other hand, a large amount of unlabeled data are often readily available. While traditional supervised learning methods can only learn from the limited amount of labeled data, semi-supervised learning [2] aims at improving the generalization performance by utilizing both the labeled and unlabeled data. The label dependencies among patterns are captured by exploiting the intrinsic geometric structure of the data. The underlying smoothness assumption is that two nearby patterns in a high-density region should share similar labels [2]. When the data lie on a manifold, it is common to approximate this manifold by a weighted graph, leading to graph-based semi-supervised learning methods. However, many of these are designed for transductive learning, and thus cannot be easily extended to out-of-sample patterns. Recently, attention is drawn to the development of inductive methods, such as harmonic mixtures  [15] and Nystrom-based methods [3]. In this paper, we focus on the manifold regularization framework proposed in [1]. By defining a data-dependent reproducing kernel Hilbert space (RKHS), manifold regularization incorporates an additional regularizer to ensure that the learned function is smooth on the manifold. Kernel methods, which have been highly successful in supervised learning, can then be integrated with this RKHS. The resultant Laplacian SVM (LapSVM) demonstrates state-of-the-art semi-supervised learning performance [10]. However, a deficiency of the LapSVM is that its solution, unlike that of the SVM, is not sparse and so is much slower on testing. Moreover, while the original motivation of semi-supervised learning is to utilize the large amount of unlabeled data available, existing algorithms are only capable of handling a small to moderate amount of unlabeled data. Recently, attempts have been made to scale up these methods. Sindhwani et al. [9] speeded up manifold regularization by restraining to linear models, which, however, may\n\n\f\nnot be flexible enough for complicated target functions. Garcke and Griebel [5] proposed to use discretization with a sparse grid. Though it scales linearly with the sample size, its time complexity grows exponentially with data dimensionality. As reported in a recent survey [14], most semisupervised learning methods can only handle 100  10,000 unlabeled examples. More recently,  Gartner et al. [6] presented a solution in the more restrictive transductive setting. The largest graph they worked with involve 75,888 labeled and unlabeled examples. Thus, no one has ever been experimented on massive data sets with, say, one million unlabeled examples. On the other hand, the Core Vector Machine (CVM) is recently proposed for scaling up kernel methods in both supervised (including classification [12] and regression [13]) and unsupervised learning (e.g., novelty detection). Its main idea is to formulate the learning problem as a minimum enclosing ball (MEB) problem in computational geometry, and then use an (1 + )-approximation algorithm to obtain a close-to-optimal solution efficiently. Given m samples, the CVM has an asymptotic time complexity that is only linear in m and a space complexity that is even independent of m for a fixed . Experimental results on real world data sets with millions of patterns demonstrated that the CVM is much faster than existing SVM implementations and can handle much larger data sets. In this paper, we extend the CVM to semi-supervised learning. To restore sparsity of the LapSVM solution, we first introduce a sparsified manifold regularizer based on the -insensitive loss. Then, we incorporate manifold regularization into the CVM. It turns out that the resultant QP can be casted as a center-constrained MEB problem introduced in [13]. The rest of this paper is organized as follows. In Section 2, we first give a brief review on manifold regularization. Section 3 then describes the proposed algorithm for semi-supervised classification and regression. Experimental results on very large data sets are presented in Section 4, and the last section gives some concluding remarks.\n\n2\n\nManifold Regularization\n\nGiven a training set {(xi , yi )}m 1 with input xi  X and output yi  R. The regularized risk i= functional is the sum of the empirical risk (corresponding to a loss function ) and a regularizer . Given a kernel k and its RKHS Hk , we minimize the regularized risk over function f in Hk :\nf Hk\n\nmin\n\nIn semi-supervised learning, we have both labeled examples {(xi , yi )}m 1 and unlabeled examples i= {xi }m+n 1 . Manifold regularization uses an additional regularizer f I to ensure that the function 2 i=m+ f is smooth on the intrinsic structure of the input. The objective function in (1) is then modified to:\nm 1i (xi , yi , f (xi )) + ( f Hk ) + I f I , 2 m =1\n\nHere,  Hk denotes the RKHS norm and  > 0 is a regularization parameter. By the representer m theorem, the minimizer f admits the representation f (x) = i=1 i k (xi , x), where i  R. Therefore, the problem is reduced to the optimization over the finite-dimensional space of  i 's.\n\nm 1i (xi , yi , f (xi )) + ( f Hk ). m =1\n\n(1)\n\n(2)\n\nwhmre I is another tMdeoff parameter. It can be shown that the minimizer f is of the form f (x) = e ra (x )k (x, x )dPX (x ), where M is the support of the marginal distribution i=1 i k (xi , x) + PX of X [1].\n\nIn practice, we do not have access to PX . Now, assume that the support M of PX is a compact subM manifold, and take f 2 = f, f where is the gradient of f . It is common to approximate I this manifold by a weighted graph defined on all the labeled and unlabeled data, as G = (V , E ) with V and E being the sets of vertices and edges respectively. Denote the weight function w and degree v 1 2 d(u) = u w (u, v ). Here, v  u means that u, v are adjacent. Then, f I is approximated as w f 2 e (xue ) f (xve ) 2 f I= - (ue , ve ) , (3) s(ue ) s(ve )\nE\n1\n\nWhen the set of labeled and unlabeled data is small, a function that is smooth on this small set may not be interesting. However, this is not an issue here as our focus is on massive data sets.\n\n\f\n2.1\n\nd where ue and ve are vertices of the edge e, and s(u) = (u) when the normalized graph Laplacian is used, and s(u) = 1 with the unnormalized one. As shown in [1], the minimizer of (2) becomes m+n f (x) = i=1 i k (xi , x), which depends on both labeled and unlabeled examples. Laplacian SVM\n\nUsing the hinge loss (xi , yi , f (xi )) = max(0, 1 - yi f (xi )) in (2), we obtain the Laplacian SVM (LapSVM) [1]. Its training involves two steps. First, solve the quadratic program (QP) 1 max  1 - 1  Q :  y = 0, 0    m 1 to obtain the   . Here,  = [1 , . . . , m ] , 1 = 2 , -1 Y [1, . . . , 1] Qmm = YJK(2I + 2I LK) J , Ymm is the diagonal matrix with Yii = yi , K(m+n)(m+n) is the kernel matrix over both the labeled and unlabeled data, L(m+n)(m+n) is the graph Laplacian, and Jm(m+n) with Jij = 1 if i = j and xi is a labeled example, and Jij = 0 otherwise. The optimal  = [1 , . . . , m+n ] solution is then obtained by solving the linear system:  = (2I + 2I LK)-1 J Y  . Note that the matrix 2I + 2I LK is of size (m + n)  (m + n), and so its inversion can be very expensive when n is large. Moreover, unlike the standard SVM, the  obtained is not sparse and so evaluation of f (x) is slow.\n\n3\n3.1\n\nProposed Algorithm\nSparsified Manifold Regularizer\n\nBy treating the terms inside the braces as the \"loss function\", this can be regarded as regularized risk minimization and, using the standard representer theorem, the minimizer f then admits the form m+n f (x) = i=1 i k (xi , x), same as that of the original manifold regularization. e 2  Moreover, putting f ( ) = w (x) + ,b into (4), we obtain f I 1 x 2=  E |w. e + be | , where w w (xue ) (xve ) 1 (ue , ve ) s(ue ) - s(ve ) and e = (ue , ve ) s(ue ) - s(ve ) The primal of the e = LapSVM can then be formulated as: min s.t. w\n2 m Ci C e 2 +b +  i + 2C  + m =1 |E | 2  2 (e + e 2 )\n\nwhere |z | = 0 if |z |  ; and |z | -  otherwise. Obviously, it reduces to (3) when  = 0. As will     be shown in Section 3.3, the  solution obtained will be sparse. Substituting (4) into (2), we have: 1m w 2+ f e i (xue ) f (xve ) min (xi , yi , f (xi )) + I (ue , ve ) - ( f Hk ). f Hk m s(ue ) s(ve )   =1\nE\n\nTo restore sparsity of the LapSVM solution, we replace the square function in the manifold regularizer (3) by the -insensitive loss function2 , as w f 2 e (xue ) f (xve ) , (4) (ue , ve ) - f 2= I s(ue ) s(ve )  \nE\n\n(5) (6) (7)\n\nE\n\nyi (w (xi ) + b)  1 -  - i , i = 1, . . . , m,   -(w  e + be )   + e , w  e + be   + e ,  \n\ne  E.\n\n Here, |E | is the number of edges in the graph, i is the slack variable for the error, e , e are slack variables for edge e, and C, ,  are user-defined parameters. As in previous CVM formulations 22  [12, 13], the bias b is penalized and the two-norm errors (i , ij and ij2 ) are used. Moreover, the  constraints i , ij , ij ,   0 are automatically satisfied. When  = 0, (5) reduces to the original   LapSVM (using two-norm errors). When  is also zero, it becomes the Lagrangian SVM.\n\nThe dual can be easily obtained as the following QP: max [\n2\n\n\n\n  ][\n\n2 1 C\n\n0\n\n0 ] - [\n\n\n\n~   ]K[\n\n\n\n\n\n]\n\n:\n\n[\n\n\n\n  ]1 = 1,  ,  ,    0, (8)\n\nTo avoid confusion with the in the (1 + )-approximation, we add a bar to the  here.\n\n\f\n ~ where  = [1 , . . . , m ] ,  = [1 , . . . , |E | ] ,   = [1 , . . . , | | ] are the dual variables, and K = E   V + m\n\n(K\n\n+\n\n11\n\nC\n\nI)\n\nyy\n\nK is the kernel matrix defined using kernel k on the m labeled examples, U|E ||E | = [ e  f + e f ], and Vm|E | = [yi (xi )  e + e ]. Note that while each entry of the matrix Q in LapSVM ~ (Section 2.1) requires O((m + n)2 ) kernel k (xi , xj ) evaluations, each entry in K here takes only O(1) kernel evaluations. This is particularly favorable to decomposition methods such as SMO as most of the CPU computations are typically dominated by kernel evaluations. Moreover, it can be shown that  is a parameter that controls the size of , analogous to the   parameter in  -SVR. Hence, only , but not , appears in (8). Moreover, the primal vari ables can bemeasily recovered from the dual varia bles by the KKT conditione . In particular, a s m . e   w=C nd b = C i=1 i yi (xi ) + E (e - e ) e i=1 i yi + E (e - e )e Subsequently, the decision function f (x) = w (x) + b is a linear combination of k (xi , x)'s defined on both the labeled and unlabeled examples, as in standard manifold regularization. 3.2 Transforming to a MEB Problem\n\n\n\nV -V\n\nU\n\n+\n\n-\n\n|E | I C\n\nU\n\n-V  is the transformed \"kernel matrix\". Here, -U |E | U + C I\n\nWe now show that CVM can be used for solving the possibly very large QP in (8). In particular, we will transform this QP to the dual of a center-constrained MEB problem [13], which is of the form: max  (diag(K) +  -  1) -  K :   0,  1 = 1, (9)\n\n   a nd for some 0    Rm and   R. From the variables in (8), define  = ~ 2 ~  = -diag(K) +  1 + C [1 0 0] s.t.   0 for some sufficiently large  . (8) can then be written ~ as max  (diag(K) +  -  1) -  K :   0,  1 = 1, which is of the form in (9). ~ ~ ~~ ~ ~ The above formulation can be easily extended to the regression case, with the pattern output changed from 1 to yi  R, and the hinge loss replaced by the -insensitive loss. Converting the resultant QP to the form in (9) is also straightforward. 3.3 Sparsity\n\nIn Section 3.3.1, we first explain why a sparse solution can be obtained by using the KKT conditions. Alternatively, by building on [7], we show in Section 3.3.2 that the -insensitive loss achieves a similar effect as the 1 penalty in LASSO [11], which is known to produce sparse approximation. 3.3.1 KKT Perspective\n\nBasically, this follows from the standard argument as for sparse solutions with the -insensitive loss in SVR. From the KKT condition associated with (6): i (yi (w (xi ) + b) - 1 +  + i ) = 0. As  for the SVM, most patterns are expected to lie outside the margin (i.e. yi (w (xi ) + b) > 1 - )  and so most i 's are zero. Similarly, manifold regularization finds a f that is locally smooth. Hence, from the definition of  e and e , many values of (w  e + be )'s will be inside the -tube. Using the   KKT conditions associated with (7), the corresponding e 's and e 's are zero. As f (x) is a linear  combination of the k (xi , x)'s weighted by i and e - e (Section 3.1), f is thus sparse. 3.3.2 LASSO Perspective\n\nOur exposition will be along the line pioneered by Girosi [7], who established a connection betm een the -insensitive loss in SVR and sparse approximation. Given a predictor f (x) = w a i=1 i k (xi , x) = K, we consider minimizing the error between f = [f (x 1 ), . . . f (xm )] nd y = [y1 , . . . , ym ] . While sparse approximation techniques such as basis pursuit typically use the L2 norm for the error, Girosi argued that the norm of the RKHS Hk is a better measure of smoothness. However, the RKHS norm operates on functions, while here we have vectors f and y w.r.t. x1 , . . . , xm . Hence, we will use the kernel PCA map with y - f K  (y - f )K-1 (y - f ). 2\n\n\f\nFirst, consider the simpler case where the manifold regularizer is replaced by a simple regularizer  2 . As in LASSO, we also add a 1 penalty on . The optimization problem is formulated as: 2 m  min y - f K + 2  :  1 = C, (10) C\n where C and  are constants. As in [7], we decompose  as  -   , where  ,    0 and i i = 0. Then, (10) can be rewritten as:\n\nmax [ ~ where3 K =\n\n ]\n\n[2y\n\n-\n\n2y ] - [\n-K K + m I C\n\n ] ~\n\nK[\n\n ]\n\n:\n\n ,    0, \n\n1\n\n+ \n\n1\n\n= C,\n\n(11)\n\nusing the -insensitive loss: min w\n2\n\nK\n\n+ m I C -K\n\n.\n\nOn the other hand, consider the following variant of SVR\n\n+\n\nm Ci   ( 2 + i 2 ) + 2C  : yi - w (xi )   + i , w (xi ) - yi   + i . (12)    m =1 i\n\nIt can be shown that its dual is identical to (11), with  ,   as dual variables. Moreover, the LASSO penalty (i.e., the equality constraint in (11)) is induced from the  in (12). Hence, the -insensitive  loss in SVR achieves a similar effect as using the error y - f K and the LASSO penalty. 2 We now add back the manifold regularizer. The derivation is similar, though more involved, and so details are skipped. As above, the key steps are on replacing the 2 norm by the kernel PCA map, and adding a 1 penalty on the variables. It can then be shown that sparsified manifold regularizer (based on the -insensitive loss) can again be recovered by using the LASSO penalty. 3.4 Complexities\n\nAs the proposed algorithm is an extension of the CVM, its properties are analogous to those in [12]. For example, its approximation ratio is (1 + )2 , and so the approximate solution obtained is very close to the exact optimal solution. As for the computational complexities, it can be shown that the SLapCVM only takes O(1/ 8) time and O(1/ 2) space when probabilistic speedup is used. (Here, we ignore O(m + |E |) space required for storing the m training patterns and 2|E | edge constraints, as these may be stored outside the core memory.) They are thus independent of the numbers of labeled and unlabeled examples for a fixed . In contrary, LapSVM involves an expensive matrix inversion for K(m+n)(m+n) and requires O((m + n)3 ) time and O((m + n)2 ) space. 3.5 Remarks\n\nThe reduced SVM [8] has been used to scale up the standard SVM. Hence, another natural alternative is to extend it for the LapSVM. This \"reduced LapSVM\" solves a smaller optimization problem that involves a random r (m+n) rectangular subset of the kernel matrix, where the r patterns are chosen from both the labeled and unlabeled data. It can be easily shown that it requires O((m + n) 2 r) time and O((m + n)r) space. Experimental comparisons based on this will be made in Section 4. Note that the CVM [12] is in many aspects similar to the column generation technique [4] commonly used in large-scale linear or integer programs. Both start with only a small number of nonzero variables, and the restricted master problem in column generation corresponds to the inner QP that is solved at each CVM iteration. Moreover, both can be regarded as primal methods that maintain primal4 feasibility and work towards dual feasibility. Also, as is typical in column generation, the dual variable whose KKT condition is most violated is added at each iteration. The key difference 5 ,\nFor simplicity, here we have only considered the case where f does not have a bias. In the presence of a ~ bias, it can be easily shown that K (in the expression of K) has to be replaced by K + 11 . 4 By convention, column generation takes the optimization problem to be solved as the primal. Hence, in this section, we also regard the QP to be solved as CVM's primal, and the MEB problem as its dual. Note that each dual variable then corresponds to a training pattern. 5 Another difference is that an entire column is added at each iteration of column generation. However, in CVM, the dual variable added is just a pattern and the extra space required for the QP is much smaller. Besides, there are other implementation tricks (such as probabilistic speedup) that further improves the speed of CVM.\n3\n\n\f\nhowever, is that CVM exploits the \"approximateness\" as in other approximation algorithms. Instead of requiring the dual solution to be strictly feasible, CVM only requires it to be feasible within a factor of (1 + ). This, together with the fact that its dual is a MEB problem, allows its number of iterations for convergence to be bounded and thus the total time and space complexities guaranteed. On the other hand, we are not aware of any similar results for column generation. By regarding the CVM as the approximation algorithm counterpart of column generation, this suggests that the CVM can also be used in the same way as column generation in speeding up other optimization problems. For example, the CVM can also be used for SVM training with other loss functions (e.g. 1-norm error). However, as the dual may no longer be a MEB problem, the downside is that its convergence bound and complexity results in Section 3.4 may no longer be available.\n\n4\n\nExperiments\n\nIn this section, we perform experiments on some massive data sets 6 (Table 1). The graph (for the manifold) is constructed by using the 6 nearest neighbors of eachepattern, and the weight w(u e , ve ) 1 in (3) is defined as exp(- xue - xve 2/g ), where g = |E | E xeu - xev 2. For simplicity, we use the unnormalized Laplacian and so all s()'s in (3) are 1. The value of m in (5) is always fixed at 1, and the other parameters are tuned by a small validation set. Unless otherwise specified, m 1  we use the Gaussian kernel exp(- x - z 2/), with  = m i=1 xi - x 2. For comparison, we also run the LapSVM7 and another LapSVM implementation based on the reduced SVM [8] (Section 3.5). All the experiments are performed on a 3.2GHz Pentium4 PC with 1GB RAM. Table 1: A summary of the data sets used. data set\ntwo-moons extended USPS extended MIT face\n\n#attrib 2 676 361\n\nclass + - + - + -\n\n#training patns labeled unlabeled 1 500,000 1 500,000 1 144,473 1 121,604 5 408,067 5 481,909\n\n#test patns 2,500 2,500 43,439 31,944 472 23,573\n\n4.1\n\nTwo-Moons Data Set\n\nWe first perform experiments on the popular two-moons data set, and use one labeled example for each class (Figure 1(a)). To better illustrate the scaling behavior, we vary the number of unlabeled patterns used for training (from 1, 000 up to a maximum of 1 million). Following [1], the width of the Gaussian kernel is set to  = 0.25. For the reduced LapSVM implementation, we fix r = 200.\nnumber of kernel expansions\n10 CPU time (in seconds) 10 10 10 10\n3\n\n2\n\nSLapCVM LapSVM Reduced LapSVM\n\n10\n\n4\n\nSLapCVM core-set Size LapSVM Reduced LapSVM\n\n1\n\n10\n\n3\n\n0\n\n-1\n\n10\n\n3\n\n10 10 number of unlabeled points\n\n4\n\n5\n\n10\n\n6\n\n10 3 10\n\n2\n\n10 10 number of unlabeled points\n\n4\n\n5\n\n10\n\n6\n\n(a) Data distribution.\n\n(b) Typical decision boundary obtained by SLapCVM.\n\n(c) CPU time.\n\n(d) #kernel sions.\n\nexpan-\n\nFigure 1: Results on the two-moons data set (some abscissas and ordinates are in log scale). The two labeled examples are labeled in red in Figure 1(a). Results are shown in Figure 1. Both the LapSVM and SLapCVM always attain 100% accuracy on the test set, even with only two labeled examples (Figure 1(b)). However, SLapCVM is faster than LapSVM (Figure 1(c)). Moreover, as mentioned in Section 2.1, the LapSVM solution is non-sparse\n6 7\n\nBoth the USPS and MIT face data sets are downloaded from http://www.cs.ust.hk/ivor/cvm.html. http://manifold.cs.uchicago.edu/manifold regularization/.\n\n\f\nand all the labeled and unlabeled examples are involved in the solution (Figure 1(d))). On the other hand, SLapCVM uses only a small fraction of the examples. As can be seen from Figures 1(c) and 1(d), both the time and space required by the SLapCVM are almost constant, even when the unlabeled data set gets very large. The reduced LapSVM, though also fast, is slightly inferior to both the SLapCVM and LapSVM. Moreover, note that both the standard and reduced LapSVMs cannot be run on the full data set on our PC because of their large memory requirements. 4.2 Extended USPS Data Set\n\nThe second experiment is performed on the USPS data from [12]. One labeled example is randomly sampled from each class for training. To achieve comparable accuracy, we use r = 2, 000 for the reduced LapSVM. For comparison, we also train a standard SVM with the two labeled examples. Results are shown in Figure 2. As can be seen, the SLapCVM is again faster (Figures 2(a)) and produces a sparser solution than LapSVM (Figure 2(b)). For the SLapCVM, both the time required and number of kernel expansions involved grow only sublinearly with the number of unlabeled examples. Figure 2(c) demonstrates that semi-supervised learning (using either the LapSVMs or SLapCVM) can have much better generalization performance than supervised learning using the labeled examples only. Note that although the use of the 2-norm error in SLapCVM could in theory be less robust than the use of the 1-norm error in LapSVM, the SLapCVM solution is indeed always more accurate than that of LapSVM. On the other hand, the reduced LapSVM has comparable speed with the SLapCVM, but its performance is inferior and cannot handle large data sets.\n10\n5\n\n10 CPU time (in seconds)\n\nnumber of kernel expansions\n\n4\n\nSLapCVM LapSVM Reduced LapSVM\n\n10\n\n5\n\n80\nSLapCVM core-set Size LapSVM Reduced LapSVM\n\n70 60 error rate (in %) 50 40 30 20 10\n\nSLapCVM LapSVM Reduced LapSVM SVM (#labeled = 2)\n\n10\n\n3\n\n10\n\n4\n\n10\n\n2\n\n10\n\n3\n\n10\n\n1\n\n10 3 10\n\n0\n\n10 10 number of unlabeled points\n\n4\n\n5\n\n10\n\n6\n\n10 3 10\n\n2\n\n10 10 number of unlabeled points\n\n4\n\n5\n\n10\n\n6\n\n03 10\n\n10 10 number of unlabeled points\n\n4\n\n5\n\n10\n\n6\n\n(a) CPU time.\n\n(b) #kernel expansions.\n\n(c) Test error.\n\nFigure 2: Results on the extended USPS data set (some abscissas and ordinates are in log scale). 4.3 Extended MIT Face Data Set\n\nIn this section, we perform face detection using the extended MIT face database in [12]. Five labeled example are randomly sampled from each class and used in training. Because of the imbalanced nature of the test set (Table 1), the classification error is inappropriate for performance evaluation here. Instead, we will use the area under the ROC curve (AUC) and the balanced loss 1 - (TP + TN)/2, where TP and TN are the true positive and negative rates respectively. Here, faces are treated as positives while non-faces as negatives. For the reduced LapSVM, we again use r = 2, 000. For comparison, we also train two SVMs: one uses the 10 labeled examples only while the other uses all the labeled examples (a total of 889,986) in the original training set of [12]. Figure 3 shows the results. Again, the SLapCVM is faster and produces a sparser solution than LapSVM. Note that the SLapCVM, using only 10 labeled examples, can attain comparable AUC and even better balanced loss than the SVM trained on the original, massive training set (Figures 3(c) and 3(d)). This clearly demonstrates the usefulness of semi-supervised learning when a large amount of unlabeled data can be utilized. On the other hand, note that the LapSVM again cannot be run with more than 3,000 unlabeled examples on our PC because of its high space requirement. The reduced LapSVM performs very poorly here, possibly because this data set is highly imbalanced.\n\n5\n\nConclusion\n\nIn this paper, we addressed two issues associated with the Laplacian SVM: 1) How to obtain a sparse solution for fast testing? 2) How to handle data sets with millions of unlabeled examples? For the\n\n\f\n10\n\n5\n\n10 CPU time (in seconds)\n\nnumber of kernel expansions\n\n4\n\nSLapCVM LapSVM Reduced LapSVM\n\n10\n\n5\n\n50\nSLapCVM core-set Size LapSVM Reduced LapSVM\n\n45 balanced loss (in %) 40 35 30 25 20 15\n\n10\n\n3\n\n10\n\n4\n\nSLapCVM LapSVM Reduced LapSVM SVM (#labeled = 10) CVM (w/ all training labels) AU C\n\n1.1 1.05 1 0.95 0.9 0.85 0.8 0.75\n\nSLapCVM LapSVM Reduced LapSVM SVM (#labeled = 10) CVM (w/ all training labels)\n\n10\n\n2\n\n10\n\n3\n\n10\n\n1\n\n10 3 10\n\n0\n\n10 number of unlabeled points\n\n4\n\n10\n\n5\n\n10 3 10\n\n2\n\n10 number of unlabeled points\n\n4\n\n10\n\n5\n\n10 3 10\n\n10 number of unlabeled points\n\n4\n\n10\n\n5\n\n0.7 3 10\n\n10 number of unlabeled points\n\n4\n\n10\n\n5\n\n(a) CPU time.\n\n(b) #kernel expansions.\n\n(c) Balanced loss.\n\n(d) AUC.\n\nFigure 3: Results on the extended MIT face data (some abscissas and ordinates are in log scale). first issue, we introduce a sparsified manifold regularizer based on the -insensitive loss. For the second issue, we integrate manifold regularization with the CVM. The resultant algorithm has low time and space complexities. Moreover, by avoiding the underlying matrix inversion in the original LapSVM, a sparse solution can also be recovered. Experiments on a number of massive data sets show that the SLapCVM is much faster than the LapSVM. Moreover, while the LapSVM can only handle several thousand unlabeled examples, the SLapCVM can handle one million unlabeled examples on the same machine. On one data set, this produces comparable or even better performance than the (supervised) CVM trained on 900K labeled examples. This clearly demonstrates the usefulness of semi-supervised learning when a large amount of unlabeled data can be utilized.\n\nReferences\n[1] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:23992434, 2006.  [2] O. Chapelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning. MIT Press, Cambridge, MA, USA, 2006. [3] O. Delalleau, Y. Bengio, and N. L. Roux. Efficient non-parametric function induction in semi-supervised learning. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, Barbados, January 2005. [4] G. Desaulniers, J. Desrosiers, and M.M. Solomon. Column Generation. Springer, 2005. [5] J. Garcke and M. Griebel. Semi-supervised learning with sparse grids. In Proceedings of the ICML Workshop on Learning with Partially Classified Training Data, Bonn, Germany, August 2005.  [6] T. Gartner, Q.V. Le, S. Burton, A. Smola, and S.V.N. Vishwanathan. Large-scale multiclass transduction.  In Y. Weiss, B. Scholkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18. MIT Press, Cambridge, MA, 2006. [7] F. Girosi. An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6):14551480, 1998. [8] Y.-J. Lee and O.L. Mangasarian. RSVM: Reduced support vector machines. In Proceeding of the First SIAM International Conference on Data Mining, 2001. [9] V. Sindhwani, M. Belkin, and P. Niyogi. The geometric basis of semi-supervised learning. In Semisupervised Learning. MIT Press, 2005. [10] V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: from transductive to semi-supervised learning. In Proceedings of the Twenty-Second International Conference on Machine Learning, pages 825832, Bonn, Germany, August 2005. [11] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B, 58:267288, 1996. [12] I. W. Tsang, J. T. Kwok, and P.-M. Cheung. Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 6:363392, 2005. [13] I. W. Tsang, J. T. Kwok, and K. T. Lai. Core vector regression for very large regression problems. In Proceedings of the Twenty-Second International Conference on Machine Learning, pages 913920, Bonn, Germany, August 2005. [14] X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Department of Computer Sciences, University of Wisconsin - Madison, 2005. [15] X. Zhu and J. Lafferty. Harmonic mixtures: Combining mixture models and graph-based methods. In Proceedings of the Twenty-Second International Conference on Machine Learning, Bonn, Germany, August 2005.\n\n\f\n", "award": [], "sourceid": 3005, "authors": [{"given_name": "Ivor", "family_name": "Tsang", "institution": null}, {"given_name": "James", "family_name": "Kwok", "institution": null}]}