{"title": "An Analysis of Inference with the Universum", "book": "Advances in Neural Information Processing Systems", "page_first": 1369, "page_last": 1376, "abstract": null, "full_text": "An Analysis of Inference with the Universum\n\nFabian H. Sinz\n\nMax Planck Institute for biological Cybernetics\nSpemannstrasse 41, 72076, T\u00a8ubingen, Germany\n\nfabee@tuebingen.mpg.de\n\nOlivier Chapelle\nYahoo! Research\n\nSanta Clara, California\n\nchap@yahoo-inc.com\n\nAlekh Agarwal\n\nUniversity of California Berkeley\n\n387 Soda Hall Berkeley, CA 94720-1776\n\nalekh@eecs.berkeley.edu\n\nBernhard Sch\u00a8olkopf\n\nMax Planck Institute for biological Cybernetics\nSpemannstrasse 38, 72076, T\u00a8ubingen, Germany\n\nbs@tuebingen.mpg.de\n\nAbstract\n\nWe study a pattern classi\ufb01cation algorithm which has recently been proposed by\nVapnik and coworkers. It builds on a new inductive principle which assumes that\nin addition to positive and negative data, a third class of data is available, termed\nthe Universum. We assay the behavior of the algorithm by establishing links with\nFisher discriminant analysis and oriented PCA, as well as with an SVM in a pro-\njected subspace (or, equivalently, with a data-dependent reduced kernel). We also\nprovide experimental results.\n\n1 Introduction\n\nLearning algorithms need to make assumptions about the problem domain in order to generalise\nwell. These assumptions are usually encoded in the regulariser or the prior. A generic learning algo-\nrithm usually makes rather weak assumptions about the regularities underlying the data. An example\nof this is smoothness. More elaborate prior knowledge, often needed for a good performance, can\nbe hard to encode in a regulariser or a prior that is computationally ef\ufb01cient too.\nInteresting hybrids between both extremes are regularisers that depend on an additional set of data\navailable to the learning algorithm. A prominent example of data-dependent regularisation is semi-\nsupervised learning [1], where an additional set of unlabelled data, assumed to follow the same\ndistribution as the training inputs, is tied to the regulariser using the so-called cluster assumption.\nA novel form of data-dependent regularisation was recently proposed by [11]. The additional dataset\nfor this approach is explicitly not from the same distribution as the labelled data, but represents a\nthird \u2014 neither \u2014 class. This kind of dataset was \ufb01rst proposed by Vapnik [10] under the name\nUniversum, owing its name to the intuition that the Universum captures a general backdrop against\nwhich a problem at hand is solved. According to Vapnik, a suitable set for this purpose can be\nthought of as a set of examples that belong to the same problem framework, but about which the\nresulting decision function should not make a strong statement.\nAlthough initially proposed for transductive inference, the authors of [11] proposed an inductive\nclassi\ufb01er where the decision surface is chosen such that the Universum examples are located close\nto it. Implementing this idea into an SVM, different choices of Universa proved to be helpful in\nvarious classi\ufb01cation tasks. Although the authors showed that different choices of Universa and loss\nfunctions lead to certain known regularisers as special cases of their implementation, there are still\na few unanswered questions. On the one hand it is not clear whether the good performance of their\nalgorithm is due to the underlying original idea, or just a consequence of the employed algorithmic\n\n1\n\n\frelaxation. On the other hand, except in special cases, the in\ufb02uence of the Universum data on the\nresulting decision hyperplane and therefore criteria for a good choice of a Universum is not known.\nIn the present paper we would like to address the second question by analysing the in\ufb02uence of the\nUniversum data on the resulting function in the implementation of [11] as well as in a least squares\nversion of it which we derive in section 2. Clarifying the regularising in\ufb02uence of the Universum on\nthe solution of the SVM can give valuable insight into which set of data points might be a helpful\nUniversum and how to obtain it.\nThe paper is structured as follows. After brie\ufb02y deriving the algorithms in section 2 we show\nin section 3 that the algorithm of [11] pushes the normal of the hyperplane into the orthogonal\ncomplement of the subspace spanned by the principal directions of the Universum set. Furthermore,\nwe demonstrate that the least squares version of the Universum algorithm is equivalent to a hybrid\nbetween kernel Fisher Discriminant Analysis and kernel Oriented Principal Component Analysis. In\nsection 4, we validate our analysis on toy experiments and give an example how to use the geometric\nand algorithmic intuition gained from the analysis to construct a Universum set for a real world\nproblem.\n\n2 The Universum Algorithms\n\n2.1 The Hinge Loss U-SVM\n\nreview of\n\nthe implementation proposed in [11].\n\nLet L =\nWe start with a brief\n{(x1, y1), ..., (xm, ym)} be the set of labelled examples and let U = {z1, ..., zq} denote the set\nof Universum examples. Using the hinge loss Ha[t] = max{0, a \u2212 t} and fw,b(x) = (cid:104)w, x(cid:105) + b, a\nstandard SVM can compactly be formulated as\n||w||2 + CL\n\nH1[yifw,b(xi)].\n\nm(cid:88)\n\nmin\nw,b\n\n1\n2\n\ni=1\n\nIn the implementation of [11] the goal of bringing the Universum examples close to the separating\nhyperplane is realised by also minimising the cumulative \u03b5-insensitive loss I\u03b5[t] = max{0,|t| \u2212 \u03b5}\non the Universum points\n\nH1[yifw,b(x)] + CU\n\nI\u03b5[ |fw,b(zj)| ].\n\n(1)\n\nm(cid:88)\n\nmin\nw,b\n\n1\n2\n\n||w||2 + CL\n\ni=1\n\nj=1\n\nNoting that I\u03b5[t] = H\u2212\u03b5[t] + H\u2212\u03b5[\u2212t], one can use the simple trick of adding the Universum\nexamples twice with opposite labels and obtain an SVM like formulation which can be solved with\na standard SVM optimiser.\n\n2.2 The Least Squares U-SVM\n\nThe derivation of the least squares U-SVM starts with the same general regularised error minimisa-\ntion problem\n\nmin\nw,b\n\n1\n2\n\n||w||2 + CL\n2\n\nQyi[fw,b(x)] + CU\n2\n\nQ0[fw,b(zj)].\n\n(2)\n\nInstead of using the hinge loss, we employ the quadratic loss Qa[t] = ||t \u2212 a||2\nleast squares versions of SVMs [9]. Expanding (2) in terms of slack variables \u03be and \u03d1 yields\n\n2 which is used in the\n\nm(cid:88)\n\ni=1\n\nq(cid:88)\n\nq(cid:88)\n\nj=1\n\nq(cid:88)\n\nm(cid:88)\n\nmin\nw,b\n\ns.t.\n\ni + CU\n\u03be2\n2\n\n1\n||w||2 + CL\n2\n2\n(cid:104)w, xi(cid:105) + b = yi \u2212 \u03bei for i = 1, ..., m\n(cid:104)w, zj(cid:105) + b = 0 \u2212 \u03d1j for j = 1, ..., q.\n\n\u03d12\nj\n\nj=1\n\ni=1\n\n(3)\n\nMinimising the Lagrangian of (3) with respect to the primal variables w, b, \u03be and \u03d1, and substituting\ntheir optimal values back into (3) yields a dual maximisation problem in terms of the Lagrange\n\n2\n\n\fmultipliers \u03b1. Since this dual problem is still convex, we can set its derivative to zero and thereby\n\nobtain the following linear system(cid:16) 0\n\n(cid:17)(cid:16) b\n\n(cid:17)\n\n1(cid:62)\n\n1 K + C\n\n\u03b1\n\n(cid:18) 0\n\ny\n0\n\n(cid:19)\n\n,\n\n=\n\nHere, K =\n\ndenotes the kernel matrix between the input points in the sets L and\n\n(cid:16) KL,L KL,U\n(cid:16) 1\n\nL,U KU,U\n\nK(cid:62)\n\nI\n\n(cid:17)\n(cid:17)\n\nU, and C =\nassociated with labelled examples and 1\nCU\n\nCL\n0\n\nI\n\n0\n1\nCU\n\nan identity matrix of appropriate size scaled with 1\nCL\n\nin dimensions\n\nfor dimensions corresponding to Universum examples.\n\nThe solution (\u03b1, b) can then be obtained by a simple matrix inversion. In the remaining part of this\npaper we denote the least squares SVM by Uls-SVM.\n\n2.3 Related Ideas\n\nAlthough [11] proposed the \ufb01rst algorithm that explicitly refers to Vapnik\u2019s Universum idea, there\nexist related approaches that we shall mention brie\ufb02y. The authors of [12] describe an algorithm\nfor the one-vs-one strategy in multiclass learning that additionally minimises the distance of the\nseparating hyperplane to the examples that are in neither of the classes. Although this is algorithmi-\ncally equivalent to the U-SVM formulation above, their motivation is merely to sharpen the contrast\nbetween the different binary classi\ufb01ers. In particular, they do not consider using a Universum for\nbinary classi\ufb01cation problems.\nThere are also two Bayesian algorithms that refer to non-examples or neither class in the binary\nclassi\ufb01cation setting. [8] gives a probabilistic interpretation for a standard hinge loss SVM by estab-\nlishing the connection between the MAP estimate of a Gaussian process with a Gaussian prior using\na covariance function k and a hinge loss based noise model. In order to deal with the problem that\nthe proposed likelihood does not integrate to one the author introduces a third \u2014 the neither\u2014 class,\nA similar idea is used by [4], introducing a third class to tackle the problem that unlabelled examples\nused in semi-supervised learning do not contribute to discriminative models PY|X(yi|xi) since the\nparameters of the label distribution are independent of input points with unknown, i.e., marginalised\nvalue of the label. To circumvent this problem, the authors of [4] introduce an additional \u2014 neither\n\u2014 class to introduce a stochastic dependence between the parameter and the unobserved label in the\ndiscriminative model. However, neither of the Bayesian approaches actually assigns an observed\nexample to the introduced third class.\n\n3 Analysis of the Algorithm\n\nThe following two sections analyse the geometrical relation of the decision hyperplane learnt with\none of the Universum SVMs to the Universum set. It will turn out that in both cases the optimal\nsolutions tend to make the normal vector orthogonal to the principal directions of the Universum.\nThe extreme case where w is completely orthogonal to U, makes the decision function de\ufb01ned by w\ninvariant to transformations that act on the subspace spanned by the elements of U. Therefore, the\nUniversum should contain directions the resulting function should be invariant against.\nIn order to increase the readability we state all results for the linear case. However, our results\ngeneralise to the case where the xi and zj live in an RKHS spanned by some kernel.\n\n3.1 U-SVM and Projection Kernel\n\nFor this section we start by considering a U-SVM with hard margin on the elements of U. Further-\nmore, we use \u03b5 = 0 for the \u03b5-insensitive loss. After showing the equivalence to using a standard\nSVM trained on the orthogonal complement of the subspace spanned by the zj, we extend the result\nto the cases with soft margin on U.\nLemma A U-SVM with CU = \u221e, \u03b5 = 0 is equivalent to training a standard SVM with the training\npoints projected onto the orthogonal complement of span{zj \u2212z0, zj \u2208 U}, where z0 is an arbitrary\nelement of U.\n\n3\n\n\fProof: Since CU = \u221e and \u03b5 = 0, any w yielding a \ufb01nite value of (1) must ful\ufb01l (cid:104)w, zj(cid:105) + b = 0 for\nall j = 1, ..., q. So (cid:104)w, zj \u2212z0(cid:105) = 0 and w is orthogonal to span{zj \u2212z0, zj \u2208 U}. Let PU\u22a5 denote\nthe projection operator onto the orthogonal complement of that set. From the previous argument, we\ncan replace (cid:104)w, xi(cid:105) by (cid:104)PU\u22a5w, xi(cid:105) in the solution of (1) without changing it. Indeed, the optimal\nw in (1) will satisfy w = PU\u22a5w. Since PU\u22a5 is an orthogonal projection we have that PU\u22a5 = P (cid:62)\nU\u22a5\nand hence (cid:104)PU\u22a5w, xi(cid:105) = (cid:104)w, P (cid:62)\nU\u22a5xi(cid:105) = (cid:104)w, PU\u22a5xi(cid:105). Therefore, the optimisation problem in (1) is\n(cid:3)\nthe same as a standard SVM where the xi have been replaced by PU\u22a5xi.\n\nThe special case the lemma refers to, clari\ufb01es the role of the Universum in the U-SVM. Since the\nresulting w is orthogonal to an af\ufb01ne space spanned by the Universum points, it is invariant against\nfeatures implicitly speci\ufb01ed by directions of large variance in that af\ufb01ne space. Picturing the (cid:104)\u00b7, zj(cid:105)\nas \ufb01lters that extract certain features from given labelled or test examples x, using the Universum\nalgorithms means suppressing the features speci\ufb01ed by the zj.\nFinally, we generalise the result of the lemma by dropping the hard constraint assumption on the\nUniversum examples, i.e. we consider the case CU < \u221e. Let w\u2217 and b\u2217 the optimal solution of (1).\nWe have that\n\nCU\n\n|(cid:104)w\u2217, zj(cid:105) + b\u2217| \u2265 CU min\n\nb\n\n|(cid:104)w\u2217, zj(cid:105) + b|.\n\nq(cid:88)\n\nj=1\n\nq(cid:88)\n\nj=1\n\nThe right hand side can be interpreted as an \u201dL1 variance\u201d. So the algorithm tries to \ufb01nd a direction\nw\u2217 such that the variance of the projection of the Universum points on that direction is small. As\nCU approaches in\ufb01nity this variance approaches 0 and we recover the result of the above lemma.\n\n3.2 Uls-SVM, Fisher Discriminant Analysis and Principal Component Analysis\n\nIn this section we present the relation of the Uls-SVM to two classic learning algorithms: (kernel)\noriented Principal Component Analysis (koPCA) and (kernel) Fisher discriminant analysis (kFDA)\n[5]. As it will turn out, the Uls-SVM is equivalent to a hybrid between both up to a linear equality\nconstraint. Since koPCA and kFDA can both be written as maximisation of a Rayleigh Quotient we\nstart with the Rayleigh quotient of the hybrid\nz\n\nmax\n\nw\n\nw(cid:62)(CL\n\n(c\n\n+ \u2212 c\n\nfrom FDA\n\u2212\n\n}|\nw(cid:62)\n+ \u2212 c\n)(c\n(cid:62)\n(xi \u2212 ck)(xi \u2212 ck)\n{z\n\n}\n\n\u2212\n\n(cid:62)\n)\n\n{\nqX\n|\n\nj=1\n\nX\n\nX\n|\n\nw\n(zj \u2212 \u02dcc)(zj \u2212 \u02dcc)\n\n(cid:62)\n\n.\n\n)w\n\n}\n\n+CU\n\nfrom FDA\n\nk=\u00b1\n\ni\u2208Ik\n\n{z\nHere, c\u00b1 denote the class means of the labelled examples and \u02dcc = 1\n2(c+ + c\u2212) is the point between\nthem. As indicated in the equation, the numerator is exactly the same as in kFDA, i.e. the inter-\nclass variance, while the denominator is a linear combination of the denominators from kFDA and\nkoPCA, i.e. the inner class variances from kFDA and the noise variance from koPCA.\nAs noted in [6] the numerator is just a rank one matrix. For optimising the quotient it can be \ufb01xed\nto an arbitrary value while the denominator is minimised. Since the denominator might not have\nfull rank it needs to be regularised [6]. Choosing the regulariser to be ||w||2, the problem can be\nrephrased as\n\nfrom oPCA\n\n||w||2 + w(cid:62)\u201c\n\nP\n\nCL\n\nk=\u00b1\n\nP\ni\u2208Ik (xi \u2212 ck)(xi \u2212 ck)(cid:62) + CU\n\nj=1(zj \u2212 \u02dcc)(zj \u2212 \u02dcc)(cid:62)\u201d\nPq\n\nmin\n\nw\ns.t.\n\nw(cid:62)(c+ \u2212 c\u2212) = 2\n\nw (4)\n\n(5)\n\nAs we will see below this problem can further be transformed into a quadratic program\n\nmin\nw,b\ns.t.\n\n||w||2 + CL||\u03be||2 + CU||\u03d1||2\n(cid:104)w, xi(cid:105) + b = yi + \u03bei for all i = 1, ..., m\n(cid:104)w, zj(cid:105) + b = \u03d1j for all j = 1, ..., q\n\u03be(cid:62)1k = 0 for k = \u00b1.\n\nIgnoring the constraint \u03be(cid:62)1k = 0, this program is equivalent to the quadratic program (3) of the\nUls-SVM. The following lemma establishes the relation of the Uls-SVM to kFDA and koPCA.\n\n4\n\n\fLemma For given CL and CU the optimisation problems (4) and (5) are equivalent.\nProof: Let w, b, \u03be and \u03d1 the optimal solution of (5). Combining the \ufb01rst and last constraint, we get\nw(cid:62)c\u00b1 + b\u2213 1 = 0. This gives us w(cid:62)(c+\u2212 c\u2212) = 2 as well as b = \u2212w(cid:62)\u02dcc. Plugging \u03be and \u03d1 in (5)\nand using this value of b, we obtain the objective function (4). So we have proved that the minimum\nvalue of (4) is not larger than the one of (5).\nConversely, let w be the optimal solution of (4). Let us choose b = \u2212w(cid:62)\u02dcc, \u03bei = w(cid:62)xi + b\u2212 yi and\ni: yi=\u00b11 \u03bei =\n0. But because w(cid:62)(c+ \u2212 c\u2212) = 2, we have\n\n\u03d1j = w(cid:62)zj +b. Again both objective functions are equal. We just have to check that(cid:80)\n\n\u03bei = w(cid:62)c\u00b1 + b \u2213 1 = w(cid:62)c\u00b1 \u2212 w(cid:62)(c+ + c\u2212)\n\n2\n\n\u2213 1 =\n\nw(cid:62)(c\u00b1 \u2212 c\u2213)\n\n2\n\n\u2213 1 = 0.(cid:3)\n\n(cid:88)\n\n1\nm\u00b1\n\ni: yi=\u00b11\n\nUniversum. To see this, we rewrite(cid:80)q\n\nThe above lemma establishes a relation of the Uls-SVM to two classic learning algorithms. This\nfurther clari\ufb01es the role of the Universum set in the algorithmic implementation of Vapnik\u2019s idea\nas proposed by [11]. Since the noise covariance matrix of koPCA is given by the covariance of the\nUniversum points centered on the average of the labelled class means, the role of the Universum as\na data-dependent speci\ufb01cation of principal directions of invariance is af\ufb01rmed.\nThe koPCA term also shows that both the position and covariance structure are crucial to a good\nj=1(zj \u2212 \u02dcz)(zj \u2212 \u02dcz)(cid:62) + q(\u02dcz \u2212\n\u02dcc)(\u02dcz \u2212 \u02dcc)(cid:62), where \u02dcz = 1\nj=1 zj is the Universum mean. The additive relationship between\ncovariance of Universum about its mean, and the distance between Universum and training sample\nmeans projected onto w shows that either quantity can dominate depending on the data at hand.\nIn the next section, we demonstrate the theoretical results of this section on toy problems and give\nan example how to use the insight gained from this section to construct an appropriate Universum.\n\nj=1(zj \u2212 \u02dcc)(zj \u2212 \u02dcc)(cid:62) as(cid:80)q\n\n(cid:80)q\n\nq\n\n4 Experiments\n\n4.1 Toy Experiments\n\nThe theoretical results of section 3 show that the covariance structure of the Universum as well as\nits absolute position in\ufb02uence the result of the learning process. To validate this insight on toy data,\nwe sample ten labelled sets of size 20, 50, 100 and 500 from two \ufb01fty-dimensional Gaussians. Both\nGaussians have a diagonal covariance that has low standard deviation (\u03c31,2 = 0.08) in the \ufb01rst two\ndimensions and high standard deviation (\u03c33,...,50 = 10) in the remaining 48. The two Gaussians are\ni = \u00b10.3 exceeds the standard deviation by a factor of 3.75 in\ndisplaced such that the mean of \u00b5\u00b1\nthe \ufb01rst two dimensions but was 125 times smaller in the remaining ones. The values are chosen\nsuch that the Bayes risk is approx. 5%. Note, that by construction the \ufb01rst two dimensions are most\ndiscriminative.\nWe construct two kinds of Universa for this toy problem. For the \ufb01rst kind we use a mean zero\nGaussian with the same covariance structure as the Gaussians for the labelled data (\u03c33,...,50 = 10),\nbut with varied degree of anisotropy in the \ufb01rst two dimensions (\u03c31,2 = 0.1, 1.0, 10). According to\nthe results of section 3 the Universa should be more helpful for larger anisotropy. For the second\nkind of Universa we use the same covariance as the labelled classes but shifted them along the line\nbetween the means of the labelled Gaussians. This kind of Universa should have a positive effect\non the accuracy for small displacements but that effect should vanish with increasing amount of\ntranslation.\nFigure 1 shows the performance of a linear U-SVMs for different amounts of training and Universum\ndata. In the top row, the degree of isotropy increases from left to right, whereas \u03c3 = 10 refers to the\ncomplete isotropic case. In the bottom row, the amount of translation increases from left to right.\nAs expected, performance converges to the performance of an SVM for high isotropy \u03c3 and large\ntranslations t. Note, that large translations do not affect the accuracy as much as a high isotropy.\nHowever, this might be due to the fact the variance along the principal components of the Universum\nis much larger in magnitude than the applied shift. We obtained similar results for the Uls-SVM.\nAlso, the effect remains when employing an RBF kernel.\n\n5\n\n\fFigure 1: Learning curves of linear U-SVMs for different degrees of isotropy \u03c3 and different amounts of\ntranslation z (cid:55)\u2192 z + t\n2 \u00b7 (c+ \u2212 c\u2212). With increasing isotropy and translation the performance of the U-SVMs\nconverges to the performance of a normal SVM.\n\nUniversum\nTest error\nMean output\nAngle\n\n0\n1.234\n0.406\n81.99\n\n1\n1.313\n-0.708\n85.57\n\n2\n1.399\n-0.539\n79.49\n\n3\n1.051\n-0.031\n69.74\n\n4\n1.246\n-0.256\n79.75\n\n6\n1.111\n0.063\n81.02\n\n7\n1.338\n-0.165\n82.72\n\n9\n1.226\n-0.360\n77.98\n\nTable 1: See text for details. Without Universum, test error is 1.419%. The correlation between the test error\nand the absolute value of the mean output (resp. angle) is 0.71 (resp 0.64); the p-value (i.e the probability of\nobserving such a correlation by chance) is 3% (resp 5.5%). Note that for instance that digits 3 and 6 are the\nbest Universum and they are also the closest to the decision boundary.\n\n4.2 Results on MNIST\n\nFollowing the experimental work from [11], we took up the task of distinguishing between the\ndigits 5 and 8 on MNIST data. Training sets of size 1000 were used, and other digits served as\nUniversum data. Using different digits as universa, we recorded the test error (in percentage) of\nU-SVM. We also computed the mean output (i.e. (cid:104)w, x(cid:105) + b) of a normal SVM trained for bi-\nnary classi\ufb01cation between the digits 5 and 8, measured on the points from the Universum class.\nAnother quantity of interest measured was the angle between covariance matrices of training and\nUniversum data in the feature space. Note that for two covariance matrices CX and CY corre-\nsponding to matrices X and Y (centered about their means), the cosine of the angle is de\ufb01ned\nY ). This quantity can be computed in feature space as\ntrace(KXY K(cid:62)\nY Y ), with KXY the kernel matrix between the sets X\nand Y . These quantities have been documented in Table 1. All the results reported are averaged\nover 10-folds of cross-validation, with C = CU = 100, and \u03b5 = 0.01.\n\nas trace(CXCY )/(cid:112)trace(C2\nXY )/(cid:112)trace(K2\n\nX)trace(C2\nXX)trace(K2\n\n4.3 Classi\ufb01cation of Imagined Movements in Brain Computer Interfaces\n\nBrain computer interfaces (BCI) are devices that allow a user to control a computer by merely\nusing his brain activity [3]. The user indicates different states to a computer system by deliberately\nchanging his state of mind according to different experimental paradigms. These states are to be\ndetected by a classi\ufb01er. In our experiments, we used data from electroencephalographic recordings\n(EEG) with a imagined-movement paradigm. In this paradigm the patient imagines the movement\nof his left or right hand for indicating the respective state. In order to reverse the spatial blurring of\nthe brain activity by the intermediate tissue of the skull, the signals from all sensors are demixed via\n\n6\n\n010020030040050000.10.20.30.40.5mmean error\u03c3 = 0.1  010020030040050000.10.20.30.40.5mmean error\u03c3 = 1.0  010020030040050000.10.20.30.40.5mmean error\u03c3 = 10.0  010020030040050000.10.20.30.40.5mmean errort = 0.1  010020030040050000.10.20.30.40.5mmean errort = 0.5  010020030040050000.10.20.30.40.5mmean errort = 0.9  SVM (q=0)q = 100q = 500SVM (q=0)q = 100q = 500SVM (q=0)q = 100q = 500SVM (q=0)q = 100q = 1000SVM (q=0)q = 100q = 1000SVM (q=0)q = 100q = 1000\fAlgorithm\n\nSVM\nU-SVM\n\nU\n\u2205\nUC3\nUnm\n\u2205\nLS-SVM\nUls-SVM UC3\nUnm\n\nSVM\nU-SVM\n\n\u2205\nUC3\nUnm\n\u2205\nLS-SVM\nUls-SVM UC3\nUnm\n\nFS\n\n40.00 \u00b1 7.70\n\n41.33 \u00b1 7.06 (0.63)\n39.67 \u00b1 8.23 (1.00)\n40.67 \u00b1 7.04 (1.00)\n40.67 \u00b1 6.81 (1.00)\n\n41.00 \u00b1 7.04\n\nS1\n\n12.35 \u00b1 6.82\n\n13.53 \u00b1 6.83 (0.63)\n12.35 \u00b1 7.04 (1.00)\n12.94 \u00b1 6.68 (1.00)\n16.47 \u00b1 7.74 (0.50)\n\n13.53 \u00b1 8.34\n\nDATA I\nJH\n\n40.00 \u00b1 11.32\n\n34.58 \u00b1 9.22 (0.07)\n37.08 \u00b1 11.69 (0.73)\n37.08 \u00b1 7.20 (0.18)\n37.92 \u00b1 12.65 (1.00)\n\n40.42 \u00b1 11.96\n\nJL\n\n30.00 \u00b1 15.54\n\n30.56 \u00b1 17.22 (1.00)\n30.00 \u00b1 16.40 (1.00)\n31.11 \u00b1 17.01 (1.00)\n30.00 \u00b1 15.54 (1.00)\n\n30.56 \u00b1 15.77\n\nDATA II\n\nS2\n\n35.29 \u00b1 13.30\n\n32.94 \u00b1 11.83 (0.63)\n27.65 \u00b1 14.15 (0.13)\n32.35 \u00b1 10.83 (0.38)\n31.18 \u00b1 13.02 (0.69)\n\n33.53 \u00b1 13.60\n\nS3\n\n35.26 \u00b1 14.05\n\n35.26 \u00b1 14.05 (1.00)\n36.84 \u00b1 13.81 (1.00)\n35.79 \u00b1 15.25 (1.00)\n35.79 \u00b1 15.25 (1.00)\n\n34.21 \u00b1 12.47\n\nTable 2: Mean zero-one test error scores for the BCI experiments. The mean was taken over ten single error\nscores. The p-value for a two-sided sign test against the SVM error scores are given in brackets.\n\nan independent component analysis (ICA) applied to the concatenated lowpass \ufb01ltered time series\nof all recording channels [2].\nIn the experiments below we used two BCI datasets. For the \ufb01rst set (DATA I) we recorded the EEG\nactivity from three healthy subjects for an imagined movement paradigm as described by [3]. The\nsecond set (DATA II) contains EEG signals from a similar paradigm [7].\nWe constructed two kind of Universa. The \ufb01rst Universum, UC3 consists of recordings from a third\ncondition in the experiments that is not related to imagined movements. Since variations in signals\nfrom this condition should not carry any useful information about imagined movement task, the\nclassi\ufb01er should be invariant against them. The second Universum Unm is physiologically moti-\nvated. In the case of the imagined-movement paradigm the relevant signal is known to be in the so\ncalled \u03b1-band from approximately 10 \u2212 12Hz and spatially located over the motor cortices. Unfor-\ntunately, signals in the \u03b1-band are also related to visual activity and independent components can be\nfound that have a strong in\ufb02uence from sensors over the visual cortex. However, since ICA is un-\nsupervised, those independent components could still contain discriminative information. In order\nto make the learning algorithm prefer the signals from the motor cortex, we construct a Universum\nUnm by projecting the labelled data onto the independent components that have a strong in\ufb02uence\nfrom the visual cortex.\nThe machine learning experiments were carried out in two nested cross validation loops, where\nthe inner loop was used for model selection and the outer for testing. We exclusively used a linear\nkernel. Table 2 shows the mean zero-one loss for DATA I and DATA II and the constructed Universa.\nOn the DATA I dataset, there is no improvement in the error rates for the subjects FS and JL com-\npared to an SVM without Universum. Therefore, we must assume that the employed Universa did\nnot provide helpful information in those cases. For subject JH, UC3 and Unm yield an improvement\nfor both Universum algorithms. However, the differences to the SVM error scores are not signi\ufb01-\ncantly better according to a two-sided sign test. The Uls-SVM performs worse than the U-SVM in\nalmost all cases.\nOn the DATA II dataset, there was an improvement only for subject S2 using the U-SVM with the\nUnm and UC3 Universum (8% and 3% improvement respectively). However, also those differences\nare not signi\ufb01cant. As already observed for the DATA I dataset, the Uls-SVM performs constantly\nworse than its hinge loss counterpart.\nThe better performance of the Unm Universum on the subjects JH and S2 indicates that additional\ninformation about the usefulness of features might in fact help to increase the accuracy of the clas-\nsi\ufb01er. The regularisation constant CU for the Universum points was chosen C = CU = 0.1 in\nboth cases. This means that the non-orthogonality of w on the Universum points was only weakly\n\n7\n\n\fpenalised, but had equal priority to classifying the labelled examples correctly. This could indicate\nthat the spatial \ufb01ltering by the ICA is not perfect and discriminative information might be spread\nover several independent components, even over those that are mainly non-discriminative. Using\nthe Unm Universum and therefore gently penalising the use of these non-discriminative features can\nhelp to improve the classi\ufb01cation accuracy, although the factual usefulness seems to vary with the\nsubject.\n\n5 Conclusion\n\nIn this paper we analysed two algorithms for inference with a Universum as proposed by Vapnik\n[10]. We demonstrated that the U-SVM as implemented in [11] is equivalent to searching for a\nhyperplane which has its normal lying in the orthogonal complement of the space spanned by Uni-\nversum examples. We also showed that the corresponding least squares Uls-SVM can be seen as a\nhybrid between the two well known learning algorithms kFDA and koPCA where the Universum\npoints, centered between the means of the labelled classes, play the role of the noise covariance in\nkoPCA. Ideally the covariance matrix of the Universum should thus contain some important invari-\nant directions for the problem at hand.\nThe position of the Universum set plays also an important role and both our theoretical and exper-\nimental analysis show that the behaviour of the algorithm depends on the difference between the\nmeans of the labelled set and of the Universum set. The question of whether the main in\ufb02uence\nof the Universum comes from the position or the covariance does not have a clear answer and is\nprobably problem dependent.\nFrom a practical point, the main contribution of this paper is to suggest how to select a good Uni-\nversum set: it should be such that it contains invariant directions and is positioned \u201cin between\u201d the\ntwo classes. Therefore, as can be partly seen from the BCI experiments, a good Universum dataset\nneeds to be carefully chosen and cannot be an arbitrary backdrop as the name might suggest.\n\nReferences\n[1] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cambridge, MA,\n\n2006.\n\n[2] N. J. Hill, T. N. Lal, M. Schr\u00a8oder, T. Hinterberger, B. Wilhelm, F. Nijboer, U. Mochty, G. Widman, C. E.\nElger, B. Sch\u00a8olkopf, A. K\u00a8ubler, and N. Birbaumer. Classifying EEG and ECoG signals without subject\ntraining for fast bci implementation: Comparison of non-paralysed and completely paralysed subjects.\nIEEE Transactions on Neural Systems and Rehabilitation Engineering, 14(2):183\u2013186, 06 2006.\n\n[3] T. N. Lal. Machine Learning Methods for Brain-Computer Interdaces. PhD thesis, University Darmstadt,\n\n09 2005. Logos Verlag Berlin MPI Series in Biological Cybernetics, Bd. 12 ISBN 3-8325-1048-6.\n\n[4] Neil D. Lawrence and Michael I. Jordan. Gaussian processes and the null-category noise model.\n\nIn\nA. Zien O. Chapelle, Bernhard Sch\u00a8olkopf, editor, Semi-Supervised Learning, chapter 8, pages 137\u2013150.\nMIT University Press, 2006.\n\n[5] S. Mika, G. R\u00a8atsch, J. Weston, B. Sch\u00a8olkopf, A. Smola, and K. M\u00a8uller. Invariant feature extraction and\nclassi\ufb01cation in kernel spaces. In Advances in Neural Information Processing Systems 12, pages 526\u2013532,\n2000.\n\n[6] Sebastian Mika, Gunnar R\u00a8atsch, and Klaus-Robert M\u00a8uller. A mathematical programming approach to the\n\nkernel \ufb01sher algorithm. In Advances in Neural Information Processing Systems, NIPS, 2000.\n\n[7] J. del R. Mill\u00b4an. On the need for on-line learning in brain-computer interfaces. IDIAP-RR 30, IDIAP,\n\nMartigny, Switzerland, 2003. Published in \u201cProc. of the Int. Joint Conf. on Neural Networks\u201d, 2004.\n\n[8] P. Sollich. Probabilistic methods for support vector machines. In Advances in Neural Information Pro-\n\ncessing Systems, 1999.\n\n[9] J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classi\ufb01ers. Neural Processing\n\nLetters, 9(3):293\u2013300, 1999.\n\n[10] V. Vapnik. Transductive Inference and Semi-Supervised Learning. In O. Chapelle, B. Sch\u00a8olkopf, and\n\nA. Zien, editors, Semi-Supervised Learning, chapter 24, pages 454\u2013472. MIT press, 2006.\n\n[11] J. Weston, R. Collobert, F. Sinz, L. Bottou, and V. Vapnik. Inference with the universum. In Proceedings\n\nof the 23rd International Conference on Machine Learning, page 127, 06/25/ 2006.\n\n[12] P. Zhong and M. Fukushima. A new support vector algorithm. Optimization Methods and Software,\n\n21:359\u2013372, 2006.\n\n8\n\n\f", "award": [], "sourceid": 780, "authors": [{"given_name": "Olivier", "family_name": "Chapelle", "institution": null}, {"given_name": "Alekh", "family_name": "Agarwal", "institution": null}, {"given_name": "Fabian", "family_name": "Sinz", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}