{"title": "Efficient Convex Relaxation for Transductive Support Vector Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 1641, "page_last": 1648, "abstract": "We consider the problem of Support Vector Machine transduction, which involves a combinatorial problem with exponential computational complexity in the number of unlabeled examples. Although several studies are devoted to Transductive SVM, they suffer either from the high computation complexity or from the solutions of local optimum. To address this problem, we propose solving Transductive SVM via a convex relaxation, which converts the NP-hard problem to a semi-definite programming. Compared with the other SDP relaxation for Transductive SVM, the proposed algorithm is computationally more efficient with the number of free parameters reduced from O(n2) to O(n) where n is the number of examples. Empirical study with several benchmark data sets shows the promising performance of the proposed algorithm in comparison with other state-of-the-art implementations of Transductive SVM.", "full_text": "Ef\ufb01cient Convex Relaxation for\n\nTransductive Support Vector Machine\n\nZenglin Xu\n\nDept. of Computer Science & Engineering\n\nThe Chinese University of Hong Kong\n\nShatin, N.T., Hong Kong\n\nRong Jin\n\nDept. of Computer Science & Engineering\n\nMichigan State University\nEast Lansing, MI, 48824\n\nzlxu@cse.cuhk.edu.hk\n\nrongjin@cse.msu.edu\n\nJianke Zhu\n\nIrwin King\n\nMichael R. Lyu\n\nDept. of Computer Science & Engineering\n\nThe Chinese University of Hong Kong\n\nShatin, N.T., Hong Kong\n\nfjkzhu,king,lyug@cse.cuhk.edu.hk\n\nAbstract\n\nWe consider the problem of Support Vector Machine transduction, which involves\na combinatorial problem with exponential computational complexity in the num-\nber of unlabeled examples. Although several studies are devoted to Transductive\nSVM, they suffer either from the high computation complexity or from the so-\nlutions of local optimum. To address this problem, we propose solving Trans-\nductive SVM via a convex relaxation, which converts the NP-hard problem to a\nsemi-de\ufb01nite programming. Compared with the other SDP relaxation for Trans-\nductive SVM, the proposed algorithm is computationally more ef\ufb01cient with the\nnumber of free parameters reduced from O(n2) to O(n) where n is the number of\nexamples. Empirical study with several benchmark data sets shows the promising\nperformance of the proposed algorithm in comparison with other state-of-the-art\nimplementations of Transductive SVM.\n\n1 Introduction\n\nSemi-supervised learning has attracted an increasing amount of research interest recently [3, 15]. An\nimportant semi-supervised learning paradigm is the Transductive Support Vector Machine (TSVM),\nwhich maximizes the margin in the presence of unlabeled data and keeps the boundary traversing\nthrough low density regions, while respecting labels in the input space.\n\nSince TSVM requires solving a combinatorial optimization problem, extensive research efforts have\nbeen devoted to ef\ufb01ciently \ufb01nding the approximate solution to TSVM. The popular version of TSVM\nproposed in [8] uses a label-switching-retraining procedure to speed up the computation. In [5], the\nhinge loss in TSVM is replaced by a smooth loss function, and a gradient descent method is used\nto \ufb01nd the decision boundary in a region of low density. Chapelle et al. [2] employ an iterative\napproach for TSVM. It begins with minimizing an easy convex object function, and then gradu-\nally approximates the objective of TSVM with more complicated functions. The solution of the\nsimple function is used as the initialization for the solution to the complicated function. Other it-\nerative methods, such as deterministic annealing [11] and the concave-convex procedure (CCCP)\nmethod [6], are also employed to solve the optimization problem related to TSVM. The main draw-\nback of the approximation methods listed above is that they are susceptible to local optima, and\ntherefore are sensitive to the initialization of solutions. To address this problem, in [4], a branch-\n\n\fTime Comparison\n\n \n\nCTSVM\nRTSVM\n\n2000\n\n1800\n\n1600\n\n1400\n\n1200\n\n1000\n\n800\n\n600\n\n400\n\n200\n\n)\ns\nd\nn\no\nc\ne\ns\n(\n \n\ne\nm\nT\n\ni\n\n0\n\n \n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\nNumber of Samples\n\nFigure 1: Computation time of the proposed convex relaxation approach for TSVM (i.e., CTSVM)\nand the semi-de\ufb01nite relaxation approach for TSVM (i.e., RTSVM) versus the number of unlabeled\nexamples. The Course data set is used, and the number of labeled examples is 20.\n\nand-bound search method is developed to \ufb01nd the exact solution. In [14], the authors approximate\nTSVM by a semi-de\ufb01nite programming problem, which leads to a relaxation solution to TSVM\n(noted as RTSVM), to avoid the solution of local optimum. However, both approaches suffer from\nthe high computational cost and can only be applied to small sized data sets.\nTo this end, we present the convex relaxation for Transductive SVM (CTSVM). The key idea of our\nmethod is to approximate the non-convex optimization problem of TSVM by its dual problem. The\nadvantage of doing so is twofold:\n\n\u2020 Unlike the semi-de\ufb01nite relaxation [14] that approximates TSVM by dropping the rank\nconstraint, the proposed approach approximates TSVM by its dual problem. As the basic\nresult of convex analysis, the conjugate of conjugate of any function f (x) is the convex en-\nvelope of f (x), and therefore provides a tighter convex relaxation for f (x) [7]. Hence, the\nproposed approach provides a better convex relaxation than that in [14] for the optimization\nproblem in TSVM.\n\n\u2020 Compared to the semi-de\ufb01nite relaxation TSVM, the proposed algorithm involves fewer\nfree parameters and therefore signi\ufb01cantly improves the ef\ufb01ciency by reducing the worst-\ncase computational complexity from O(n6:5) to O(n4:5). Figure 1 shows the running time\nof both the semi-de\ufb01nite relaxation of TSVM in [14] and the proposed convex relaxation for\nTSVM versus increasing number of unlabeled examples. The data set used in this example\nis the Course data set (see the experiment section), and the number of labeled examples\nis 20. We clearly see that the proposed convex relaxation approach is considerably more\nef\ufb01cient than the semi-de\ufb01nition approach.\n\nThe rest of this paper is organized as follows. Section 2 reviews the related work on the semi-\nde\ufb01nite relaxation for TSVM. Section 3 presents the convex relaxation approach for Transductive\nSVM. Section 4 presents the empirical studies that verify the effectiveness of the proposed relaxation\nfor TSVM. Section 5 sets out the conclusion.\n\n2 Related Work\n\nIn this section, we review the key formulae for Transductive SVM, followed by the semi-de\ufb01nite\nprogramming relaxation for TSVM.\nLet X = (x1; : : : ; xn) denote the entire data set, including both the labeled examples and the\nunlabeled ones. We assume that the \ufb01rst l examples within X are labeled by y\u2018 = (y\u2018\n2; : : : ; y\u2018\nl )\nwhere y\u2018\ni 2 f\u00a11; +1g represents the binary class label assigned to xi. We further denote by y =\n(y1; y2; : : : ; yn) 2 f\u00a11; +1gn the binary class labels predicted for all the data points in X . The goal\nof TSVM is to estimate y by using both the labeled examples and the unlabeled ones.\n\n1; y\u2018\n\n\fFollowing the framework of maximum margin, TSVM aims to identify the classi\ufb01cation model that\nwill result in the maximum classi\ufb01cation margin for both labeled and unlabeled examples, which\namounts to solve the following optimization problem:\n\nn\n\nmin\n\nw;b;y2f\u00a11;+1gn;\"\n\nkwk2\n\n2 + C\n\n\"i\n\nXi=1\n\ns. t.\n\nyi(w>xi \u00a1 b) \u201a 1 \u00a1 \"i; \"i \u201a 0; i = 1; 2; : : : ; n\nyi = y\u2018\n\ni ; i = 1; 2; : : : ; l;\n\nwhere C \u201a 0 is the trade-off parameter between the complexity of function w and the margin errors.\nThe prediction function can be formulated as f (x) = w>x \u00a1 b.\nEvidently, the above problem is a non-convex optimization problem due to the product term yiwj in\nthe constraint. In order to approximate the above problem into a convex programming problem, we\n\ufb01rst rewrite the above problem into the following form using the Lagrange Theorem:\n\nmin\n\n\u201d;y2f\u00a11;+1gn;\u2013;\u201a\n\n1\n2\n\n(e + \u201d \u00a1 \u2013 + \u201ay)>D(y)K\u00a11D(y)(e + \u201d \u00a1 \u2013 + \u201ay) + C\u2013>e\n\n(1)\n\ns. t.\n\n\u201d \u201a 0;\n\n\u2013 \u201a 0;\n\nyi = y\u2018\n\ni ; i = 1; 2; : : : ; l;\n\nwhere \u201d, \u2013 and \u201a are the dual variables. e is the n-dimensional column vector of all ones and K is\nthe kernel matrix. D(y) represents a diagonal matrix whose diagonal elements form the vector y.\nDetailed derivation can be found in [9, 13]. Using the Schur complement, the above formulation can\nbe further formulated as follows:\n\nmin\n\ny2f\u00a11;+1gn;t;\u201d;\u2013;\u201a\n\nt\n\n(2)\n\ns. t.\n\n(cid:181)\n\nyy> \u2013 K\n\n(e + \u201d \u00a1 \u2013 + \u201ay)>\n\u201d \u201a 0; \u2013 \u201a 0; yi = y\u2018\n\ne + \u201d \u00a1 \u2013 + \u201ay\n\nt \u00a1 2C\u2013>e \u00b6 \u201d 0\n\ni ; i = 1; 2; : : : ; l;\n\nwhere the operator \u2013 represents the element wise product.\nTo convert the above problem into a convex optimization problem, the key idea is to replace the\nquadratic term yy> by a linear variable. Based on the result that the set Sa = fM = yy>jy 2\nf\u00a11; +1gng is equivalent to the set Sb = fMjMi;i = 1; rank(M) = 1g, we can approximate the\nproblem in (2) as follows:\n\nmin\n\nM;t;\u201d;\u2013;\u201a\n\nt\n\n(3)\n\ns. t.\n\n(cid:181) M \u2013 K\n(e + \u201d \u00a1 \u2013)> t \u00a1 2C\u2013>e \u00b6 \u201d 0\n\ne + \u201d \u00a1 \u2013\n\n\u201d \u201a 0; \u2013 \u201a 0;\nM \u201d 0; Mi;i = 1; i = 1; 2; : : : ; n;\n\nwhere Mij = y\u2018\n\ni y\u2018\n\nj for 1 \u2022 i; j \u2022 l.\n\nNote that the key differences between (2) and (3) are (a) the rank constraint rank(M) = 1 is re-\nmoved, and (b) the variable \u201a is set to be zero, which is equivalent to setting b = 0. The above\napproximation is often referred to as the Semi-De\ufb01nite Programming (SDP) relaxation. As re-\nvealed by the previous studies [14, 1], the SDP programming problem resulting from the approx-\nimation is computationally expensive. More speci\ufb01cally, there are O(n2) parameters in the SDP\ncone and O(n) linear inequality constraints, which implies a worst-case computational complexity\nof O(n6:5). To avoid the high computational complexity, we present a different approach for relax-\ning TSVM into a convex problem. Compared to the SDP relaxation approach, it is advantageous\nin that (1) it produces the best convex approximation for TSVM, and (2) it is computationally more\nef\ufb01cient than the previous SDP relaxation.\n\n3 Relaxed Transductive Support Vector Machine\n\nIn this section, we follow the work of generalized maximum margin clustering [13] by \ufb01rst studying\nthe case of hard margin, and then extending it to the case of soft margin.\n\n\f3.1 Hard Margin TSVM\n\nIn the hard margin case, SVM does not penalize the classi\ufb01cation error, which corresponds to \u2013 = 0\nin (1). The resulting formulism of TSVM becomes\n\nmin\n\u201d;y;\u201a\ns: t:\n\n(e + \u201d + \u201ay)>D(y)K\u00a11D(y)(e + \u201d + \u201ay)\n\n1\n2\n\u201d \u201a 0;\nyi = y\u2018\ny2\ni = 1; i = l + 1; l + 2; : : : ; n:\n\ni ; i = 1; 2; : : : ; l;\n\n(4)\n\nInstead of employing the SDP relaxation as in [14], we follow the work in [13] and introduce a\nvariable z = D(y)(e + \u201d) = y \u2013 (e + \u201d). Given that \u201d \u201a 0, the constraints in (4) can be written\nas y\u2018\ni \u201a 1 for all the unlabeled examples. Hence, z can be\nused as the prediction function, i.e., f \u2044 = z. Using this new notation, the optimization problem in\n(4) can be rewritten as follows:\n\ni zi \u201a 1 for the labeled examples, and z2\n\nmin\nz;\u201a\ns. t.\n\n(z + \u201ae)>K\u00a11(z + \u201ae)\n\n1\n2\ny\u2018\ni zi \u201a 1; i = 1; 2; : : : ; l;\nz2\ni \u201a 1; i = l + 1; l + 2; : : : ; n:\n\n(5)\n\nOne problem with Transductive SVMs is that it is possible to classify all the unlabeled data to one of\nthe classes with a very large margin due to the high dimension and few labeled data. This will lead\nto poor generalization ability. To solve this problem, we introduce the following balance constraint\nto ensure that no class takes all the unlabeled examples:\n\n\u00a1\u2020 \u2022\n\n1\nl\n\nzi \u00a1\n\n1\n\nn \u00a1 l\n\nl\n\nXi=1\n\nn\n\nXi=l+1\n\nzi \u2022 \u2020;\n\n(6)\n\nwhere \u2020 \u201a 0 is a constant. Through the above constraint, we aim to ensure that the difference\nbetween the labeled data and the unlabeled data in their class assignment is small.\n\nTo simplify the expression, we further de\ufb01ne w = (z; \u201a) 2 Rn+1 and P = (In; e) 2 Rn\u00a3(n+1).\nThen, the problem in (5) becomes:\n\nmin\n\nw\ns. t.\n\nw>P>K\u00a11Pw\n\ny\u2018\ni wi \u201a 1; i = 1; 2; : : : ; l;\nw2\ni \u201a 1; i = l + 1; l + 2; : : : ; n;\n\n(7)\n\n\u00a1\u2020 \u2022\n\n1\nl\n\nwi \u00a1\n\n1\n\nn \u00a1 l\n\nl\n\nXi=1\n\nn\n\nXi=l+1\n\nwi \u2022 \u2020:\n\nWhen this problem is solved, the label vector y can be directly determined by the sign of the pre-\ndiction function, i.e., sign(w). This is because wi = (1 + \u201d)yi for i = l + 1; : : : ; n and \u201d \u201a 0.\nThe following theorem shows that the problem in (7) can be relaxed to a semi-de\ufb01nite programming.\nTheorem 1. Given a sample X = fx1; : : : ; xng and a partial set of the labels y\u2018 = (y\u2018\n2; : : : ; y\u2018\nl )\nwhere 1 \u2022 l \u2022 n, the variable w that optimizes (7) can be calculated by\n\n1; y\u2018\n\nw =\n\n1\n2\n\n[A \u00a1 D((cid:176) \u2013 b)]\u00a11 ((cid:176) \u2013 a \u00a1 (\ufb01 \u00a1 \ufb02)c);\n\n(8)\n\nwhere a = (yl; 0n\u00a1l; 0) 2 Rn+1, b = (0l; 1n\u00a1l; 0) 2 Rn+1, c = ( 1\nA = P>K\u00a11P, and (cid:176) is determined by the following semi-de\ufb01nite programming:\n\nl 1l; \u00a1 1\n\nu 1n\u00a1l; 0) 2 Rn+1,\n\n(cid:176)i \u00a1 \u2020(\ufb01 + \ufb02)\n\n(9)\n\nmax\n(cid:176);t;\ufb01;\ufb02\n\ns: t:\n\n\u00a1\n\n1\n4\n\nt +\n\nn\n\nXi=1\n\n(cid:181)\n\nA \u00a1 D((cid:176) \u2013 b)\n\n(cid:176) \u2013 a \u00a1 (\ufb01 \u00a1 \ufb02)c;\n\n((cid:176) \u2013 a \u00a1 (\ufb01 \u00a1 \ufb02)c)>\n\nt\n\n\ufb01 \u201a 0; \ufb02 \u201a 0; (cid:176)i \u201a 0; i = 1; 2; : : : ; n:\n\n\u00b6 \u201d 0\n\n\fProof Sketch. We de\ufb01ne the Lagrangian of the minimization problem (7) as follows:\n\nmin\n\nw\n\nmax\n\n(cid:176)\n\nF(w; (cid:176)) = w>P>K\u00a11Pw +\n\n(cid:176)i(1 \u00a1 y\u2018\n\ni wi) +\n\nl\n\nXi=1\n\n(cid:176)i(1 \u00a1 w2\ni )\n\nn\n\nXi=l+1\n\n+\ufb01(c>w \u00a1 \u2020) + \ufb02(\u00a1c>w \u00a1 \u2020);\n\nwhere (cid:176)i \u201a 0 for i = 1; : : : ; n. It can be derived from the duality that minw max(cid:176) F(w; (cid:176)) =\nmax(cid:176) minw F(w; (cid:176)):\nAt the optimum, the derivatives of F with respect to the variable w are derived as below:\n\n@F\n@w\n\n= 2 [A \u00a1 D((cid:176) \u2013 b)] w \u00a1 (cid:176) \u2013 a + (\ufb01 \u00a1 \ufb02)c = 0;\n\nwhere A = P>K\u00a11P. The inverse of A\u00a1D((cid:176)\u2013b) can be computed through adding a regularization\nparameter. Therefore, w is able to be calculated by:\n\nw =\n\n1\n2\n\n[A \u00a1 D((cid:176) \u2013 b)]\u00a11 ((cid:176) \u2013 a \u00a1 (\ufb01 \u00a1 \ufb02)c):\n\nThus, the dual form of the problem becomes:\n\nmax\n\n(cid:176)\n\nL((cid:176)) = \u00a1\n\n1\n4\n\n((cid:176) \u2013 a \u00a1 (\ufb01 \u00a1 \ufb02)c)> [A \u00a1 D(b \u2013 (cid:176))]\u00a11 ((cid:176) \u2013 a \u00a1 (\ufb01 \u00a1 \ufb02)c) +\n\nWe import a variable t, so that\n\n(cid:176)i \u00a1 \u2020(\ufb01 + \ufb02);\n\nn\n\nXi=1\n\n\u00a1\n\n1\n4\n\n((cid:176) \u2013 a \u00a1 (\ufb01 \u00a1 \ufb02)c)>[A \u00a1 D(b \u2013 (cid:176))]\u00a11((cid:176) \u2013 a \u00a1 (\ufb01 \u00a1 \ufb02)c) \u201a \u00a1t:\n\nAccording to the Schur Complement, we obtain a semi-de\ufb01nite programming cone, from which the\noptimization problem (9) can be formulated. \u00a5\nRemark I. The problem in (9) is a convex optimization problem, more speci\ufb01cally, a semi-de\ufb01nite\nprogramming problem, and can be ef\ufb01ciently solved by the interior-point method [10] implemented\nin some optimization packages, such as SeDuMi [12]. Besides, our relaxation algorithm has O(n)\nparameters in the SDP cone and O(n) linear equality constraints, which involves a worst-case com-\nputational complexity of O(n4:5). However, in the previous relaxation algorithms [1, 14], there\nare approximately O(n2) parameters in the SDP cone, which involve a worst-case computational\ncomplexity in the scale of O(n6:5). Therefore, our proposed convex relaxation algorithm is more\nef\ufb01cient. In addition, as analyzed in Section 2, the approximation in [1, 14] drops the rank constraint\nof the matrix y>y, which does not lead to a tight approximation. On the other hand, our prediction\nfunction f \u2044 implements the conjugate of conjugate of the prediction function f (x), which is the\nconvex envelope of f (x) [7]. Thus, our proposed convex approximation method provides a tighter\napproximation than the previous method.\nRemark II. It is interesting to discuss the connection between the solution of the proposed algorithm\nand that of harmonic functions. We consider a special case of (8), where \u201a = 0 (which implies no\nbias term in the primal SVM), and there is no balance constraint. Then the solution of (9) can be\nexpressed as follows:\n\n(10)\n\n(11)\n\nIt can be further derived as follows:\n\n1\n\nz =\n\n2\u00a3K\u00a11 \u00a1 D((cid:176) \u2013 (0l; 1n\u00a1l))\u2044\u00a11\nn!\u00a11\u02c6 l\nz =\u02c6In \u00a1\nXi=1\n\nXi=l+1\n\n(cid:176)iKIi\n\nn\n\n((cid:176) \u2013 (yl; 0n\u00a1l)):\n\n(cid:176)iy\u2018\n\ni K(xi; \u00a2)! ;\n\nwhere Ii\nn is de\ufb01ned as an n \u00a3 n matrix with all elements being zero except the i-th diagonal el-\nement which is 1, and K(xi; \u00a2) is the i-th column of K. Similar to the solution of the harmonic\nfunction, we \ufb01rst propagate the class labels from the labeled examples to the unlabeled one by term\n\nPl\n\ni=1 (cid:176)iy\u2018\n\ni K(xi; \u00a2), and then adjust the prediction labels by the factor \u00a1In \u00a1Pn\n\nThe key difference in our solution is that (1) different weights (i.e., (cid:176)i) are assigned to the labeled\nexamples, and (2) the adjustment factor is different to that in the harmonic function [16].\n\ni=l+1 (cid:176)iKIi\n\nn\u00a2\u00a11.\n\n\f3.2 Soft Margin TSVM\n\nWe extend TSVM to the case of soft margin by considering the following problem:\n\nmin\n\u201d;y;\u2013;\u201a\n\ns: t:\n\n1\n2\n\n(e + \u201d \u00a1 \u2013 + \u201ay)>D(y)K\u00a11D(y)(e + \u201d \u00a1 \u2013 + \u201ay) + C\u2018\n\n\u201d \u201a 0; \u2013 \u201a 0;\nyi = y\u2018\ny2\ni = 1; l + 1 \u2022 i \u2022 n;\n\ni ; 1 \u2022 i \u2022 l;\n\n\u20132\ni + Cu\n\nl\n\nXi=1\n\n\u20132\ni\n\nn\n\nXi=l+1\n\nwhere \u2013i is related to the margin error. Note that we distinguish the labeled examples from the\nunlabeled examples by introducing different penalty constants for margin errors, C\u2018 for labeled\nexamples and Cu for unlabeled examples.\nSimilarly, we introduce the slack variable z, and then derive the following dual problem:\n\n(cid:176)i \u00a1 \u2020(\ufb01 + \ufb02)\n\n(12)\n\nmax\n(cid:176);t;\ufb01;\ufb02\n\ns: t:\n\n\u00a1\n\n1\n4\n\nt +\n\nn\n\nXi=1\n\n(cid:181)\n\nA \u00a1 D((cid:176) \u2013 b)\n\n(cid:176) \u2013 a \u00a1 (\ufb01 \u00a1 \ufb02)c\n\n((cid:176) \u2013 a \u00a1 (\ufb01 \u00a1 \ufb02)c)>\n\nt\n\n\u00b6 \u201d 0;\n\n0 \u2022 (cid:176)i \u2022 C\u2018; i = 1; 2; : : : ; l;\n0 \u2022 (cid:176)i \u2022 Cu; i = l + 1; l + 2; : : : ; n;\n\ufb01 \u201a 0; \ufb02 \u201a 0;\n\nwhich is also a semi-de\ufb01nite programming problem and can be solved similarly.\n\n4 Experiments\n\nIn this section, we report empirical study of the proposed method on several benchmark data sets.\n\n4.1 Data Sets Description\n\nTo make evaluations comprehensive, we have collected four UCI data sets and three text data sets\nas our experimental testbeds. The UCI data sets include Iono, sonar, Banana, and Breast, which are\nwidely used in data classi\ufb01cation. The WinMac data set consists of the classes, mswindows and\nmac, of the Newsgroup20 data set. The IBM data set contains the classes, IBM and non-IBM, of the\nNewsgroup20 data set. The course data set is made of the course pages and non-course pages of the\nWebKb corpus. For each text data set, we randomly sample the data with the sample size of 60, 300\nand 1000, respectively. Each resulted sample is noted by the suf\ufb01x, \u201c-s\u201d, \u201c-m\u201d, or \u201c-l\u201d depending on\nwhether the sample size is small, medium or large. Table 1 describes the information of these data\nsets, where d represents the data dimensionality, l means the number of labeled data points, and n\ndenotes the total number of examples.\n\nTable 1: Data sets used in the experiments, where d represents the data dimensionality, l means the\nnumber of labeled data points, and n denotes the total number of examples.\nl\n20\n20\n20\n50\n50\n50\n\nn\n351 WinMac-m 7511\n11960\n208\nIBM-m\n400\n1800\nCourse-m\n7511\n300 WinMac-l\n11960\n60\n60\n1800\n\nData set\nIono\nSonar\nBanana\nBreast\nIBM-s\nCourse-s\n\nn\n300\n300\n300\n1000\n1000\n1000\n\nl\n20\n20\n20\n20\n10\n10\n\nData set\n\nd\n\nIBM-l\nCourse-l\n\nd\n34\n60\n4\n9\n\n11960\n1800\n\n4.2 Experimental Protocol\n\nTo evaluate the effectiveness of the proposed CTSVM method, we choose the conventional SVM\nas our baseline method. In our experiments, we also make comparisons with three state-of-the-art\n\n\fmethods: the SVM-light algorithm [8], the Gradient Decent TSVM (rTSVM) algorithm [5], and\nthe Concave Convex Procedure (CCCP) [6]. Since the SDP approximation TSVM [14] has very\nhigh time complexity O(n6:5), which is dif\ufb01cult to process data sets with hundreds of examples.\nThus, it is only evaluated on the smaller data sets, i.e., \u201cIBM-s\u201d and \u201cCourse-s\u201d.\n\nThe experiment setup is described as follows. For each data set, we conduct 10 trials. In each trial,\nthe training set contains each class of data, and the remaining data are then used as the unlabeled\n(test) data. Moreover, the RBF kernel is used for \u201cIono\u201d, \u201cSonar\u201d and \u201cBanana\u201d, and the linear\nkernel is used for the other data sets. This is because the linear kernel performs better than the RBF\nkernel on these data sets. The kernel width of RBF kernel is chosen by 5-cross validation on the\nlabeled data. The margin parameter C\u2018 is tuned by using the labeled data in all algorithms. Due to\nthe small number of labeled examples, for CTSVM and CCCP, the margin parameter for unlabeled\ndata, Cu, is set equal to C\u2018. Other parameters in these algorithms are set to the default values\naccording to the relevant literatures.\n\n4.3 Experimental Results\n\nSVM\n\nSVM-light\n78.25\u00a70.36\n55.26\u00a75.88\n\n-\n\nTable 2: The classi\ufb01cation performance of Transductive SVMs on benchmark data sets.\nData Set\nCTSVM\n78.55\u00a74.83\n80.09\u00a72.63\nIono\n51.76\u00a75.05\n67.39\u00a76.26\nSonar\n79.51\u00a73.02\n58.45\u00a77.15\nBanana\n97.79\u00a70.23\n96.46\u00a71.18\nBreast\n75.25\u00a77.49\n52.75\u00a715.01\nIBM-s\n79.75\u00a78.45\nCourse-s\n63.52\u00a75.82\n84.82\u00a72.12\nWinMac-m 57.64\u00a79.58\n73.17\u00a70.89\nIBM-m\n53.00\u00a76.83\n92.92\u00a72.28\n80.18\u00a71.27\nCourse-m\n91.25\u00a72.67\n60.86\u00a710.10\nWinMac-l\n73.42\u00a73.23\n61.82\u00a77.26\nIBM-l\n94.62\u00a70.97\nCourse-l\n83.56\u00a73.10\n\nrTSVM\n81.72\u00a74.50\n69.36\u00a74.69\n71.54\u00a77.28\n97.17\u00a70.35\n65.80\u00a76.56\n75.80\u00a712.87\n81.03\u00a78.23\n64.65\u00a713.38\n90.35\u00a73.59\n90.19\u00a72.65\n73.11\u00a71.99\n93.58\u00a72.68\n\n95.68\u00a71.82\n67.60\u00a79.29\n76.82\u00a74.78\n79.42\u00a74.60\n67.55\u00a76.74\n93.89\u00a71.49\n89.81\u00a72.10\n75.40\u00a72.26\n92.35\u00a73.02\n\nCCCP\n\n82.11\u00a73.83\n56.01\u00a76.70\n79.33\u00a74.22\n96.89\u00a70.67\n65.62\u00a714.83\n74.20\u00a711.50\n84.28\u00a78.84\n69.62\u00a711.03\n88.78\u00a72.87\n91.00\u00a72.42\n74.80\u00a71.87\n91.32\u00a74.08\n\nTable 2 summarizes the classi\ufb01cation accuracy and the standard deviations of the proposed algo-\nrithm, the baseline method and the state-of-the-art methods. It can be observed that our proposed\nalgorithm performs signi\ufb01cantly better than the standard SVM across all the data sets. Moreover, on\nthe small-size data sets, i.e., \u201cIBM-s\u201d and \u201cCourse-s\u201d, the results of the SDP-relaxation method are\n68.57\u00a722.73 and 64.03\u00a77.65, which are worse than the proposed CTSVM method. In addition, the\nproposed CTSVM algorithm performs much better than other TSVM methods over \u201cWinMac-m\u201d\nand \u201cCourse-l\u201d. As shown in Table 2, the SVM-light algorithm achieves the best results on \u201cCourse-\nm\u201d and \u201cIBM-l\u201d, however, it fails to converge on \u201cBanana\u201d. On the remaining data sets, comparable\nresults have been obtained for our proposed algorithm. From above, the empirical evaluations in-\ndicate that our proposed CTSVM method achieves promising classi\ufb01cation results comparing with\nthe state-of-the-art methods.\n\n5 Conclusion and Future Work\n\nThis paper presents a novel method for Transductive SVM by relaxing the unknown labels to the\ncontinuous variables. In contrast to the previous relaxation method which involves O(n2) free pa-\nrameters in the semi-de\ufb01nite matrix, our method takes the advantages of reducing the number of\nfree parameters to O(n), and can solve the optimization problem more ef\ufb01ciently. In addition, the\nproposed approach provides a tighter convex relaxation for the optimization problem in TSVM. Em-\npirical studies on benchmark data sets demonstrate that the proposed method is more ef\ufb01cient than\nthe previous semi-de\ufb01nite relaxation method and achieves promising classi\ufb01cation results compar-\ning to the state-of-the-art methods.\n\nAs the current model is only designed for a binary-classi\ufb01cation, we plan to develop a multi-class\nTransductive SVM model in the future. Moreover, it is desirable to extend the current model to\nclassify the new incoming data.\n\n\fAcknowledgments\n\nThe work described in this paper is supported by a CUHK Internal Grant (No. 2050346) and a grant\nfrom the Research Grants Council of the Hong Kong Special Administrative Region, China (Project\nNo. CUHK4150/07E).\n\nReferences\n\n[1] T. D. Bie and N. Cristianini. Convex methods for transduction.\n\nIn S. Thrun, L. Saul, and\nB. Sch\u00a8olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press,\nCambridge, MA, 2004.\n\n[2] O. Chapelle, M. Chi, and A. Zien. A continuation method for semi-supervised SVMs. In ICML\n\u201906: Proceedings of the 23rd international conference on Machine learning, pages 185\u2013192,\nNew York, NY, USA, 2006. ACM Press.\n\n[3] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien. Semi-Supervised Learning. MIT Press, Cambridge,\n\nMA, 2006.\n\n[4] O. Chapelle, V. Sindhwani, and S. Keerthi. Branch and bound for semi-supervised support\nvector machines. In B. Sch\u00a8olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Infor-\nmation Processing Systems 19. MIT Press, Cambridge, MA, 2007.\n\n[5] O. Chapelle and A. Zien. Semi-supervised classi\ufb01cation by low density separation. In Pro-\nceedings of the Tenth International Workshop on Arti\ufb01cial Intelligence and Statistics, pages\n57\u201364, 2005.\n\n[6] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive SVMs. Journal of\n\nMachine Learning Reseaerch, 7:1687\u20131712, 2006.\n\n[7] J.-B. Hiriart-Urruty and C. Lemarechal. Convex analysis and minimization algorithms II:\n\nadvanced theory and bundle methods. (2nd part edition). Springer-Verlag, New York, 1993.\n\n[8] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In\nICML \u201999: Proceedings of the Sixteenth International Conference on Machine Learning, pages\n200\u2013209, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.\n\n[9] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the\nkernel matrix with semide\ufb01nite programming. Journal of Machine Learning Research, 5:27\u2013\n72, 2004.\n\n[10] Y. Nesterov and A. Nemirovsky. Interior point polynomial methods in convex programming:\n\nTheory and applications. Studies in Applied Mathematics. Philadelphia, 1994.\n\n[11] V. Sindhwani, S. S. Keerthi, and O. Chapelle. Deterministic annealing for semi-supervised\nkernel machines. In ICML \u201906: Proceedings of the 23rd international conference on Machine\nlearning, pages 841\u2013848, New York, NY, USA, 2006. ACM Press.\n\n[12] J. F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones.\n\nOptimization Methods and Software, 11:625\u2013653, 1999.\n\n[13] H. Valizadegan and R. Jin. Generalized maximum margin clustering and unsupervised kernel\nlearning. In B. Sch\u00a8olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information\nProcessing Systems 19. MIT Press, Cambridge, MA, 2007.\n\n[14] L. Xu and D. Schuurmans. Unsupervised and semi-supervised multi-class support vector ma-\n\nchines. In AAAI, pages 904\u2013910, 2005.\n\n[15] X. Zhu. Semi-supervised learning literature survey. Technical report, Computer Sciences,\n\nUniversity of Wisconsin-Madison, 2005.\n\n[16] X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian \ufb01elds\nIn Proceedings of Twentith International Conference on Machine\n\nand harmonic functions.\nLearning (ICML-2003), pages 912\u2013919, 2003.\n\n\f", "award": [], "sourceid": 534, "authors": [{"given_name": "Zenglin", "family_name": "Xu", "institution": null}, {"given_name": "Rong", "family_name": "Jin", "institution": null}, {"given_name": "Jianke", "family_name": "Zhu", "institution": null}, {"given_name": "Irwin", "family_name": "King", "institution": null}, {"given_name": "Michael", "family_name": "Lyu", "institution": null}]}