{"title": "Duality, Geometry, and Support Vector Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 593, "page_last": 600, "abstract": null, "full_text": "Duality, Geometry, and Support Vector\n\nRegression\n\nJinbo Bi and Kristin P. Bennett\nDepartment of Mathematical Sciences\n\nRensselaer Polytechnic Institute\n\nTroy, NY 12180\n\nbij2@rpi.edu, bennek@rpi.edu\n\nAbstract\n\nWe develop an intuitive geometric framework for support vector\nregression (SVR). By examining when (cid:15)-tubes exist, we show that\nSVR can be regarded as a classi(cid:12)cation problem in the dual space.\nHard and soft (cid:15)-tubes are constructed by separating the convex\nor reduced convex hulls respectively of the training data with the\nresponse variable shifted up and down by (cid:15). A novel SVR model is\nproposed based on choosing the max-margin plane between the two\nshifted datasets. Maximizing the margin corresponds to shrinking\nthe e(cid:11)ective (cid:15)-tube.\nIn the proposed approach the e(cid:11)ects of the\nchoices of all parameters become clear geometrically.\n\n1\n\nIntroduction\n\nSupport Vector Machines (SVMs) [6] are a very robust methodology for inference\nwith minimal parameter choices.\nIntuitive geometric formulations exist for the\nclassi(cid:12)cation case addressing both the error metric and capacity control [1, 2]. For\nlinearly separable classi(cid:12)cation, the primal SVM (cid:12)nds the separating plane with\nmaximum hard margin between two sets. The equivalent dual SVM computes the\nclosest points in the convex hulls of the data from each class. For the inseparable\ncase, the primal SVM optimizes the soft margin of separation between the two\nclasses. The corresponding dual SVM (cid:12)nds the closest points in the reduced convex\nhulls. In this paper, we derive analogous arguments for SVM regression (SVR).\n\nWe provide a geometric explanation for SVR with the (cid:15)-insensitive loss function.\nFrom the primal perspective, a linear function with no residuals greater than (cid:15) cor-\nresponds to an (cid:15)-tube constructed about the data in the space of the data attributes\nand the response variable [6] (see e.g. Figure 1(a)). The primary contribution of this\nwork is a novel geometric interpretation of SVR from the dual perspective along\nwith a mathematically rigorous derivation of the geometric concepts.\nIn Section\n2, for a (cid:12)xed (cid:15) > 0 we examine the question \\When does a \\perfect\" or \\hard\"\n\n\f(cid:15)-tube exist?\". With duality analysis, the existence of a hard (cid:15)-tube depends on\nthe separability of two sets. The two sets consist of the training data augmented\nwith the response variable shifted up and down by (cid:15). In the dual space, regression\nbecomes the classi(cid:12)cation problem of distinguishing between these two sets. The\ngeometric formulations developed for the classi(cid:12)cation case [1] become applicable to\nthe regression case. We call the resulting formulation convex SVR (C-SVR) since it\nis based on convex hulls of the augmented training data. Much like in SVM classi(cid:12)-\ncation, to compute a hard (cid:15)-tube, C-SVR computes the nearest points in the convex\nhulls of the augmented classes. The corresponding maximum margin (max-margin)\nplanes de(cid:12)ne the e(cid:11)ective (cid:15)-tube. The size of margin determines how much the\ne(cid:11)ective (cid:15)-tube shrinks. Similarly, to compute a soft (cid:15)-tube, reduced-convex SVR\n(RC-SVR) (cid:12)nds the closest points in the reduced convex hulls of the two augmented\nsets.\n\nThis paper introduces the geometrically intuitive RC-SVR formulation which is a\nvariation of the classic (cid:15)-SVR [6] and (cid:23)-SVR models [5]. If parameters are properly\ntuned, the methods perform similarly although not necessarily identically. RC-\nSVR eliminates the pesky parameter C used in (cid:15)-SVR and (cid:23)-SVR. The geometric\nrole or interpretation of C is not known for these formulations. The geometric\nroles of the two parameters of RC-SVR, (cid:23) and (cid:15), are very clear, facilitating model\nselection, especially for nonexperts. Like (cid:23)-SVR, RC-SVR shrinks the (cid:15)-tube and\nhas a parameter (cid:23) controlling the robustness of the solution. The parameter (cid:15)\nacts as an upper bound on the size of the allowable (cid:15)-insensitive error function. In\naddition, RC-SVR can be solved by fast and scalable nearest-point algorithms such\nas those used in [3] for SVM classi(cid:12)cation.\n\n2 When does a hard (cid:15)-tube exist?\n\n \ny\n\n \n\ny \n\ny \n\n \n\ny \n\n(a)\n\n \nx\n\nD+\n\n-\n\nD\n\n(b)\n\nx \n\n \n\n+\n\nD\n\n-\n\nD\n\n(c)\n\n \nx\n\n+\n\nD\n\n-\n\nD\n\n(d)\n\n \nx\n\nFigure 1: The (a) primal hard (cid:15)0-tube, and dual cases: (b) dual strictly separable (cid:15) > (cid:15)0,\n(c) dual separable (cid:15) = (cid:15)0, and (d) dual inseparable (cid:15) < (cid:15)0.\n\nSVR constructs a regression model that minimizes some empirical risk measure\nregularized to control capacity. Let x be the n predictor variables and y the depen-\ndent response variable. In [6], Vapnik proposed using the (cid:15)-insensitive loss function\nL(cid:15)(x; y; f) = jy (cid:0) f(x)j(cid:15) = max (0; jy (cid:0) f(x)j (cid:0) (cid:15)), in which an example is in error if\nits residual jy (cid:0) f(x)j is greater than (cid:15). Plotting the points in (x; y) space as in Fig-\nure 1(a), we see that for a \\perfect\" regression model the data fall in a hard (cid:15)-tube\nabout the regression line. Let (Xi; yi) be an example where i = 1; 2; (cid:1) (cid:1) (cid:1) ; m, Xi is the\nith predictor vector, and yi is its response. The training data are then (X; y) where\nXi is a row of the matrix X 2 Rm(cid:2)n and y 2 Rm is the response. A hard (cid:15)-tube\nfor a (cid:12)xed (cid:15) > 0 is de(cid:12)ned as a plane y = w0x + b satisfying (cid:0)(cid:15)e (cid:20) y (cid:0) Xw (cid:0) be (cid:20) (cid:15)e\nwhere e is an m-dimensional vector of ones.\n\nWhen does a hard (cid:15)-tube exist? Clearly, for (cid:15) large enough such a tube always\n\ne\ne\n\fexists for (cid:12)nite data. The smallest tube, the (cid:15)0-tube, can be found by optimizing:\n\nmin\nw;b;(cid:15)\n\n(cid:15)\n\ns:t: (cid:0) (cid:15)e (cid:20) y (cid:0) Xw (cid:0) be (cid:20) (cid:15)e\n\n(1)\n\nNote that the smallest tube is typically not the (cid:15)-SVR solution. Let D+ and D(cid:0) be\nformed by augmenting the data with the response variable respectively increased\nand decreased by (cid:15), i.e. D+ = f(Xi; yi + (cid:15)); i = 1; (cid:1) (cid:1) (cid:1) ; mg and D(cid:0) = f(Xi; yi (cid:0)\n(cid:15)); i = 1; (cid:1) (cid:1) (cid:1) ; mg. Consider the simple problem in Figure 1(a). For any (cid:12)xed (cid:15) > 0,\nthere are three possible cases: (cid:15) > (cid:15)0 in which strict hard (cid:15)-tubes exist, (cid:15) = (cid:15)0\nin which only (cid:15)0-tubes exist, and (cid:15) < (cid:15)0 in which no hard (cid:15)-tubes exist. A strict\nhard (cid:15)-tube with no points on the edges of the tube only exists for (cid:15) > (cid:15)0. Figure\n1(b-d) illustrates what happens in the dual space for each case. The convex hulls of\nD+ and D(cid:0) are drawn along with the max-margin plane in (b) and the supporting\nplane in (c) for separating the convex hulls.\n\nClearly, the existence of the tube is directly related to the separability of D+ and\nD(cid:0). If (cid:15) > (cid:15)0 then a strict tube exists and the convex hulls of D+ and D(cid:0) are strictly\nseparable1. There are in(cid:12)nitely many possible (cid:15)-tubes when (cid:15) > (cid:15)0. One can see\nthat the max-margin plane separating D+ and D(cid:0) corresponds to one such (cid:15). In\nfact this plane forms an ^(cid:15) tube where (cid:15) > ^(cid:15) (cid:21) (cid:15)0. If (cid:15) = (cid:15)0, then the convex hulls\nof D+ and D(cid:0) are separable but not strictly separable. The plane that separates\nthe two convex hulls forms the (cid:15)0 tube. In the last case, where (cid:15) < (cid:15)0, the two sets\nD+ and D(cid:0) intersect. No (cid:15)-tubes or max-margin planes exist.\n\nIt is easy to show by construction that if a hard (cid:15)-tube exists for a given (cid:15) > 0 then\nthe convex hulls of D+ and D(cid:0) will be separable. If a hard (cid:15)-tube exists, then there\nexists (w; b) such that\n\n(y + (cid:15)e) (cid:0) Xw (cid:0) be (cid:21) 0;\n\n(y (cid:0) (cid:15)e) (cid:0) Xw (cid:0) be (cid:20) 0:\n\n(2)\n\nFor any convex combination of D+, (cid:0) X0\n(y+(cid:15)e)0(cid:1)u where e0u = 1; u (cid:21) 0 of points\n(Xi; yi + (cid:15)); i = 1; 2; (cid:1) (cid:1) (cid:1) ; m, we have (y + (cid:15)e)0u (cid:0) w0(X0u) (cid:0) b (cid:21) 0. Similarly for\nD(cid:0) , (cid:0) X0\n(y(cid:0)(cid:15)e)0(cid:1)v where e0v = 1; v (cid:21) 0 of points (Xi; yi (cid:0) (cid:15)); i = 1; 2; (cid:1) (cid:1) (cid:1) ; m, we have\n(y (cid:0) (cid:15)e)0v (cid:0) w0(X0v) (cid:0) b (cid:20) 0. Then the plane y = w0x + b in the (cid:15)-tube separates the\ntwo convex hulls. Note the separating plane and the (cid:15)-tube plane are the same. If\nno separating plane exists, then there is no tube. Gale\u2019s Theorem2 of the alternative\ncan be used to precisely characterize the (cid:15)-tube.\n\nTheorem 2.1 (Conditions for existence of hard (cid:15)-tube) A hard (cid:15)-tube exists\nfor a given (cid:15) > 0 if and only if the following system in (u; v) has no solution:\n(y + (cid:15)e)0u (cid:0) (y (cid:0) (cid:15)e)0v < 0; u (cid:21) 0; v (cid:21) 0:\n\nX0u = X0v; e0u = e0v = 1;\n\n(3)\n\nProof A hard (cid:15)-tube exists if and only if System (2) has a solution. By Gale\u2019s\nTheorem of the alternative [4], system (2) has a solution if and only if the following\nalternative system has no solution: X0u = X0v; e0u = e0v; (y + (cid:15)e)0u (cid:0) (y (cid:0) (cid:15)e)0v =\n(cid:0)1; u (cid:21) 0; v (cid:21) 0. Rescaling by 1\n\n(cid:27) where (cid:27) = e0u = e0v > 0 yields the result.\n\n1We use the following de(cid:12)nitions of separation of convex sets. Let D+ and D(cid:0) be\nnonempty convex sets. A plane H = fx : w 0x = (cid:11)g is said to separate D+ and D(cid:0) if\nw0x (cid:21) (cid:11); 8x 2 D+ and w0x (cid:20) (cid:11); 8x 2 D(cid:0) . H is said to strictly separate D+ and D(cid:0) if\nw0x (cid:21) (cid:11) + (cid:1) for x 2 D+, and w0x (cid:20) (cid:11) (cid:0) (cid:1) for each x 2 D(cid:0) where (cid:1) is a positive scalar.\n2The system Ax (cid:20) c has a (or has no) solution if and only if the alternative system\n\nA0y = 0; c0y = (cid:0)1; y (cid:21) 0 has no (or has a) solution.\n\n\fNote that if (cid:15) (cid:21) (cid:15)0 then (y + (cid:15)e)0u (cid:0) (y (cid:0) (cid:15)e)0v (cid:21) 0.\nfor any (u; v) such that\nX0u = X0v; e0u = e0v = 1; u; v (cid:21) 0. So as a consequence of this theorem, if\nD+ and D(cid:0) are separable, then a hard (cid:15)-tube exists.\n\n3 Constructing the (cid:15)-tube\n\nFor any (cid:15) > (cid:15)0 in(cid:12)nitely many possible (cid:15)-tubes exist. Which (cid:15)-tube should be used?\nThe linear program (1) can be solved to (cid:12)nd the smallest (cid:15)0-tube. But this corre-\nsponds to just doing empirical risk minimization and may result in poor generaliza-\ntion due to over(cid:12)tting. We know capacity control or structural risk minimization is\nfundamental to the success of SVM classi(cid:12)cation and regression.\n\nWe take our inspiration from SVM classi(cid:12)cation. In hard-margin SVM classi(cid:12)cation,\nthe dual SVM formulation constructs the max-margin plane by (cid:12)nding the two\nnearest points in the convex hulls of the two classes. The max-margin plane is\nthe plane bisecting these two points. We know that the existence of the tube is\nlinked to the separability of the shifted sets, D+ and D(cid:0). The key insight is that\nthe regression problem can be regarded as a classi(cid:12)cation problem between D+ and\nD(cid:0). The two sets D+ and D(cid:0) de(cid:12)ned as in Section 2 both contain the same number\nof data points. The only signi(cid:12)cant di(cid:11)erence occurs along the y dimension as the\nresponse variable y is shifted up by (cid:15) in D+ and down by (cid:15) in D(cid:0) . For (cid:15) > (cid:15)0,\nthe max-margin separating plane corresponds to a hard ^(cid:15)-tube where (cid:15) > ^(cid:15) (cid:21) (cid:15)0.\nThe resulting tube is smaller than (cid:15) but not necessarily the smallest tube. Figure\n1(b) shows the max-margin plane found for (cid:15) > (cid:15)0. Figure 1(a) shows that the\ncorresponding linear regression function for this simple example turns out to be the\n(cid:15)0 tube. As in classi(cid:12)cation, we will have a hard and soft (cid:15)-tube case. The soft\n(cid:15)-tube with (cid:15) (cid:20) (cid:15)0 is used to obtain good generalization when there are outliers.\n\n3.1 The hard (cid:15)-tube case\n\nWe now apply the dual convex hull method to constructing the max-margin plane\nfor our augmented sets D+ and D(cid:0) assuming they are strictly separable, i.e. (cid:15) > (cid:15)0.\nThe problem is illustrated in detail in Figure 2. The closest points of D+ and D(cid:0) can\nbe found by solving the following dual C-SVR quadratic program:\n\nmin\nu;v\ns:t:\n\ne0u = 1; e0v = 1; u (cid:21) 0; v (cid:21) 0:\n\n2\n\n(y(cid:0)(cid:15)e)0(cid:1)v(cid:13)(cid:13)(cid:13)\n\n(y+(cid:15)e)0(cid:1)u (cid:0) (cid:0) X0\n\n1\n\n2 (cid:13)(cid:13)(cid:13)(cid:0) X0\n\n(4)\n\nLet the closest points in the convex hulls of D+ and D(cid:0) be c = (cid:0) X0\n(y+(cid:15)e)0(cid:1)^u and\nd = (cid:0) X0\n(y(cid:0)(cid:15)e)0(cid:1)^v respectively. The max-margin separating plane bisects these two\npoints. The normal ( ^w; ^(cid:14)) of the plane is the di(cid:11)erence between them, i.e., ^w =\nX0^u (cid:0) X0^v; ^(cid:14) = (y + (cid:15)e)0^u (cid:0) (y (cid:0) (cid:15)e)0^v. The threshold, ^b, is the distance from the\norigin to the point halfway between the two closest points along the normal: ^b =\n^w0 (cid:16) X0 ^u+X0 ^v\n(cid:17). The separating plane has the equation ^w0x+^(cid:14)y(cid:0)^b = 0.\nRescaling this plane yields the regression function.\n\n(cid:17)+^(cid:14) (cid:16) y0 ^u+y0 ^v\n\n2\n\n2\n\nDual C-SVR (4) is in the dual space. The corresponding Primal C-SVR is:\n\n\f\b\n\t\n\n\f\u000e\r\u0010\u000f\u0012\u0011\n\n\u0002\u0001\u0004\u0003\u0006\u0005\n\nFigure 2: The solution ^(cid:15)-tube found by C-SVR can have ^(cid:15) < (cid:15). Squares are original data.\nDots are in D+. Triangles are in D(cid:0) . Support Vectors are circled.\n\nmin\nw;(cid:14);(cid:11);(cid:12)\n\ns:t:\n\n1\n\n2 kwk2 + 1\n2(cid:14)2 (cid:0) ((cid:11) (cid:0) (cid:12))\nXw + (cid:14)(y + (cid:15)e) (cid:0) (cid:11)e (cid:21) 0\nXw + (cid:14)(y (cid:0) (cid:15)e) (cid:0) (cid:12)e (cid:20) 0:\n\n(5)\n\nDual C-SVR (4) can be derived by taking the Wolfe or Lagrangian dual [4] of primal\nC-SVR (5) and simplifying.\n\nWe prove that the optimal plane from C-SVR bisects the ^(cid:15) tube. The supporting\nplanes for class D+ and class D(cid:0) determines the lower and upper edges of the ^(cid:15)-tube\nrespectively. The support vectors from D+ and D(cid:0) correspond to the points along\nthe lower and upper edges of the ^(cid:15)-tube. See Figure 2.\n\nTheorem 3.1 (C-SVR constructs ^(cid:15)-tube) Let the max-margin plane obtained\nby C-SVR (4) be ^w0x+^(cid:14)y(cid:0)^b = 0 where ^w = X0^u(cid:0)X0^v, ^(cid:14) = (y+(cid:15)e)0^u(cid:0)(y(cid:0)(cid:15)e)0^v, and\n^b = ^w0 (cid:16) X0 ^u+X0 ^v\n(cid:17). If (cid:15) > (cid:15)0, then the plane y = w0x + b corresponds\nto an ^(cid:15)-tube of training data (Xi; yi); i = 1; 2; (cid:1) (cid:1) (cid:1) ; m where w = (cid:0) ^w\nand\n^(cid:14)\n^(cid:15) = (cid:15) (cid:0) ^(cid:11)(cid:0) ^(cid:12)\n2^(cid:14)\n\n(cid:17) + ^(cid:14) (cid:16) y0 ^u+y0 ^v\n\n, b =\n\n< (cid:15).\n\n^b\n^(cid:14)\n\n2\n\n2\n\nProof First, we show ^(cid:14) > 0. By the Wolfe duality theorem [4], ^(cid:11) (cid:0) ^(cid:12) > 0,\nsince the objective values of (5) and the negative objective value of (4) are equal at\noptimality. By complementarity, the closest points are right on the margin planes\n^w0x + ^(cid:14)y (cid:0) ^(cid:11) = 0 and ^w0x + ^(cid:14)y (cid:0) ^(cid:12) = 0 respectively, so ^(cid:11) = ^w0X0^u + ^(cid:14)(y + (cid:15)e)0^u and\n^(cid:12) = ^w0X0^v+^(cid:14)(y(cid:0)(cid:15)e)0^v. Hence ^b = ^(cid:11)+ ^(cid:12)\n2 , and ^w, ^(cid:14), ^(cid:11), and ^(cid:12) satisfy the constraints of\nproblem (5), i.e., X ^w+ ^(cid:14)(y+(cid:15)e)(cid:0) ^(cid:11)e (cid:21) 0; X ^w+ ^(cid:14)(y(cid:0)(cid:15)e)(cid:0) ^(cid:12)e (cid:20) 0: Then subtract the\nsecond inequality from the (cid:12)rst inequality: 2^(cid:14)(cid:15) (cid:0) ^(cid:11) + ^(cid:12) (cid:21) 0, that is, ^(cid:14) (cid:21) ^(cid:11)(cid:0) ^(cid:12)\n2(cid:15) > 0\nbecause (cid:15) > (cid:15)0 (cid:21) 0. Rescale constraints by (cid:0)^(cid:14) < 0, and reverse the signs. Let\nw = (cid:0) ^w\ne.\n^(cid:14)\nLet b =\nand\ninequalities yields Xw(cid:0)y (cid:20) (cid:16)(cid:15) (cid:0) ^(cid:11)(cid:0) ^(cid:12)\n^(cid:15) = (cid:15) (cid:0) ^(cid:11)(cid:0) ^(cid:12)\n2^(cid:14)\nHence the plane y = w0x + b is in the middle of the ^(cid:15) < (cid:15) tube.\n\n. Substituting into the previous\n2^(cid:14) (cid:17) e(cid:0)be. Denote\n< (cid:15). These inequalities become Xw + be (cid:0) y (cid:20) ^(cid:15)e; Xw + be (cid:0) y (cid:21) (cid:0)^(cid:15)e.\n\n, then the inequalities become Xw (cid:0) y (cid:20) (cid:15)e (cid:0) ^(cid:11)\n^(cid:14)\n^b\n^(cid:14)\n\n2^(cid:14) (cid:17) e(cid:0)be; Xw(cid:0)y (cid:21) (cid:0)(cid:16)(cid:15) (cid:0) ^(cid:11)(cid:0) ^(cid:12)\n\ne; Xw (cid:0) y (cid:21) (cid:0)(cid:15)e (cid:0)\n\n= b (cid:0) ^(cid:11)(cid:0) ^(cid:12)\n2^(cid:14)\n\n= b + ^(cid:11)(cid:0) ^(cid:12)\n2^(cid:14)\n\n^(cid:12)\n^(cid:14)\n\n, then ^(cid:11)\n^(cid:14)\n\n^(cid:12)\n^(cid:14)\n\n3.2 The soft (cid:15)-tube case\n\nFor (cid:15) < (cid:15)0, a hard (cid:15)-tube does not exist. Making (cid:15) large to (cid:12)t outliers may result\nin poor overall accuracy. In soft-margin classi(cid:12)cation, outliers were handled in the\n\n\u0007\n\u000b\n\u0013\n\u0014\n\u0015\n\u0016\n\f \n\ny \n\n2e^\n\n\u0004\u0006\u0005\b\u0007\n\t\n\n\u0002\u0001\n\n\u000f\u0011\u0010\u0012\r\u0014\u0013\n\nFigure 3: Soft ^(cid:15)-tube found by RC-SVR: left: dual, right: primal space.\n\n \nx\n\ndual space by using reduced convex hulls. The same strategy works for soft (cid:15)-tubes,\nsee Figure 3. Instead of taking the full convex hulls of D+ and D(cid:0) , we reduce the\nconvex hulls away from the di(cid:14)cult boundary cases. RC-SVR computes the closest\npoints in the reduced convex hulls\n\nmin\nu;v\ns:t:\n\n1\n\n2 (cid:13)(cid:13)(cid:13)(cid:0) X0\n\n(y+(cid:15)e)0(cid:1)u (cid:0) (cid:0) X0\n\n(y(cid:0)(cid:15)e)0(cid:1)v(cid:13)(cid:13)(cid:13)\n\n2\n\ne0u = 1; e0v = 1; 0 (cid:20) u (cid:20) De; 0 (cid:20) v (cid:20) De:\n\n(6)\n\nParameter D determines the robustness of the solution by reducing the convex hull.\nD limits the in(cid:13)uence of any single point. As in (cid:23)-SVM, we can parameterize D\nby (cid:23). Let D = 1\n(cid:23)m where m is the number of points. Figure 3 illustrates the case\nfor m = 6 points, (cid:23) = 2=6, and D = 1=2.\nIn this example, every point in the\nreduced convex hull must depend on at least two data points since Pm\ni=1 ui = 1 and\n0 (cid:20) ui (cid:20) 1=2. In general, every point in the reduced convex hull can be written as\nthe convex combination of at least d1=De = d(cid:23) (cid:3) me. Since these points are exactly\nthe support vectors and there are two reduced convex hulls, 2 (cid:3) d(cid:23)me is a lower\nbound on the number of support vectors in RC-SVR. By choosing (cid:23) su(cid:14)ciently\nlarge, the inseparable case with (cid:15) (cid:20) (cid:15)0 is transformed into a separable case where\nonce again our nearest-points-in-the-convex-hull-problem is well de(cid:12)ned.\n\nAs in classi(cid:12)cation, the dual reduced convex hull problem corresponds to computing\na soft (cid:15)-tube in the primal space. Consider the following soft tube version of the\nprimal C-SVR (7) which has its Wolfe Dual RC-SVR (6):\n\nmin\n\nw;(cid:14);(cid:11);(cid:12);(cid:24);(cid:17)\n\ns:t:\n\n1\n\n2 (cid:14)2 (cid:0) ((cid:11) (cid:0) (cid:12)) + C(e0(cid:24) + e0(cid:17))\n\n2 kwk2 + 1\nXw + (cid:14)(y + (cid:15)e) (cid:0) (cid:11)e + (cid:24) (cid:21) 0; (cid:24) (cid:21) 0\nXw + (cid:14)(y (cid:0) (cid:15)e) (cid:0) (cid:12)e (cid:0) (cid:17) (cid:20) 0; (cid:17) (cid:21) 0\n\n(7)\n\nThe results of Theorem 3.1 can be easily extended to soft (cid:15)-tubes.\n\nTheorem 3.2 (RC-SVR constructs soft ^(cid:15)-tube) Let\nsoft max-margin\nplane obtained by RC-SVR (6) be ^w 0x + ^(cid:14)y (cid:0) ^b = 0 where ^w = X0^u (cid:0) X0^v,\n^(cid:14) = (y + (cid:15)e)0 ^u (cid:0) (y (cid:0) (cid:15)e)0^v, and ^b = (cid:16) X0 ^u+X0 ^v\n(cid:17) ^(cid:14). If 0 < (cid:15) (cid:20) (cid:15)0, then\nthe plane y = w0x + b corresponds to a soft ^(cid:15) = (cid:15) (cid:0) ~(cid:11)(cid:0) ~(cid:12)\n< (cid:15)-tube of training data\n2^(cid:14)\n(Xi; yi); i = 1; 2; (cid:1) (cid:1) (cid:1) ; m, i.e., a ^(cid:15)-tube of reduced convex hull of training data where\nw = (cid:0) ^w\n^(cid:14)\n\nand ~(cid:11) = ^w0X0^u + ^(cid:14)(y + (cid:15)e)0^u, ~(cid:12) = ^w0X0^v + ^(cid:14)(y (cid:0) (cid:15)e)0^v.\n\n^w +(cid:16) y0 ^u+y0 ^v\n\n, b =\n\n(cid:17)0\n\nthe\n\n^b\n^(cid:14)\n\n2\n\n2\n\nNotice that the ~(cid:11) and ~(cid:12) determine the planes parallel to the regression plane and\nthrough the closest points in each reduced convex hull of shifted data.\nIn the\n\n\u0003\n\u000b\n\f\n\n\u000e\n\u000e\n\finseparable case, these planes are parallel but not necessarily identical to the planes\nobtained by the primal RC-SVR (7).\n\nNonlinear C-SVR and RC-SVR can be achieved by using the usual kernel trick. Let\n(cid:8) by a nonlinear mapping of x such that k(Xi; Xj) = (cid:8)(Xi) (cid:1) (cid:8)(Xj). The objective\nfunction of C-SVR (4) and RC-SVR (6) applied to the mapped data becomes\n\n2 Pm\n1\n= 1\n\ni=1 Pm\n2 Pm\n\ni=1 Pm\n\nj=1 ((ui (cid:0) vi)(uj (cid:0) vj)((cid:8)(Xi) (cid:1) (cid:8)(Xj) + yiyj)) + 2(cid:15)Pm\nj=1 ((ui (cid:0) vi)(uj (cid:0) vj)(k(Xi; Xj) + yiyj)) + 2(cid:15)Pm\n\ni=1 (yi(ui (cid:0) vi))\n\ni=1 (yi(ui (cid:0) vi))\n\n(8)\n\nThe (cid:12)nal regression model after optimizing C-SVR or RC-SVR with kernels takes\n, ^(cid:14) = (^u (cid:0)\n\ni=1 ((cid:22)ui (cid:0) (cid:22)vi) k(Xi; x) + (cid:22)b, where (cid:22)ui = ^ui\n^(cid:14)\n\nthe form of f(x) = Pm\n^v)0y + 2(cid:15), and the intercept term (cid:22)b = (^u+^v)0K(^u(cid:0)^v)\n\nwhere Kij = k(Xi; Xj).\n\n, (cid:22)vi = ^vi\n^(cid:14)\n\n+ (^u+^v)0y\n\n2\n\n2^(cid:14)\n\n4 Computational Results\n\nWe illustrate the di(cid:11)erence between RC-SVR and (cid:15)-SVR on a toy linear problem3.\nFigure 4 depicts the functions constructed by RC-SVR and (cid:15)-SVR for di(cid:11)erent\nvalues of (cid:15). For large (cid:15), (cid:15)-SVR produces undesirable results. RC-SVR constructs the\nsame function for (cid:15) su(cid:14)ciently large. Too small (cid:15) can result in poor generalization.\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\ne = 0.75\n\ne = 0.45\n\ne = 0.25\n\ne = 0.15\n\n(a)\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\ne = 0.75, 0.45, 0.25\n\n(b)\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\nFigure 4: Regression lines from (a) (cid:15)-SVR and (b) RC-SVR with distinct (cid:15).\n\nIn Table 1, we compare RC-SVR, (cid:15)-SVR and (cid:23)-SVR on the Boston Housing problem.\nFollowing the experimental design in [5] we used RBF kernel with 2(cid:27)2 = 3:9, C =\n500(cid:1)m for (cid:15)-SVR and (cid:23)-SVR, and (cid:15) = 3:0 for RC-SVR. RC-SVR, (cid:15)-SVR, and (cid:23)-SVR\nare computationally similar for good parameter choices. In (cid:15)-SVR, (cid:15) is (cid:12)xed. In\nRC-SVR, (cid:15) is the maximum allowable tube width. Choosing (cid:15) is critical for (cid:15)-SVR\nbut less so for RC-SVR. Both RC-SVR and (cid:23)-SVR can shrink or grow the tube\naccording to desired robustness. But (cid:23)-SVR has no upper (cid:15) bound.\n\n5 Conclusion and Discussion\n\nBy examining when (cid:15)-tubes exist, we showed that in the dual space SVR can be\nregarded as a classi(cid:12)cation problem. Hard and soft (cid:15)-tubes are constructed by sep-\narating the convex or reduced convex hulls respectively of the training data with\nthe response variable shifted up and down by (cid:15). We proposed RC-SVR based on\nchoosing the soft max-margin plane between the two shifted datasets. Like (cid:23)-SVM,\nRC-SVR shrinks the (cid:15)-tube. The max-margin determines how much the tube can\nshrink. Domain knowledge can be incorporated into the RC-SVR parameters (cid:15)\n\n3The data consist of (x; y): (0 0), (1 0.1), (2 0.7), (2.5 0.9), (3 1.1) and (5 2). The\n\nCPLEX 6.6 optimization package was used.\n\n\fTable 1: Testing Results for Boston Housing, MSE= average of mean squared errors of\n25 testing points over 100 trials, STD: standard deviation\n\n2(cid:23)\n\n(cid:15)-SVR\n\n0.1\nRC-SVR MSE 37.3\nSTD 72.3\n(cid:15)\n0\nMSE 11.2\nSTD\n8.3\n0.1\n(cid:23)\n9.6\nMSE\nSTD\n5.8\n\n(cid:23)-SVR\n\n0.2\n11.2\n7.6\n1\n10.8\n8.2\n0.2\n8.9\n7.9\n\n0.3\n10.7\n7.3\n2\n9.5\n8.2\n0.3\n9.5\n8.3\n\n0.4\n9.6\n7.4\n3\n10.3\n7.3\n0.4\n10.8\n8.2\n\n0.5\n8.9\n8.4\n4\n11.6\n5.8\n0.5\n10.9\n8.3\n\n0.6\n10.6\n9.1\n5\n13.6\n5.8\n0.6\n11.0\n8.4\n\n0.7\n11.5\n9.3\n6\n15.6\n5.9\n0.7\n11.2\n8.5\n\n0.8\n12.5\n9.8\n7\n17.2\n5.8\n0.8\n11.1\n8.4\n\nand (cid:23). The parameter C in (cid:23)-SVM and (cid:15)-SVR has been eliminated. Computa-\ntionally, no one method is superior for good parameter choices. RC-SVR alone\nhas a geometrically intuitive framework that allows users to easily grasp the model\nand its parameters. Also, RC-SVR can be solved by fast nearest point algorithms.\nConsidering regression as a classi(cid:12)cation problem suggests other interesting SVR\nformulations. We can show (cid:15)-SVR is equivalent to (cid:12)nding closest points in a reduced\nconvex hull problem for certain C, but the equivalent problem utilizes a di(cid:11)erent\nmetric in the objective function than RC-SVR. Perhaps other variations would yield\neven better formulations.\n\nAcknowledgments\n\nThanks to referees and Bernhard Sch(cid:127)olkopf for suggestions to improve this work.\nThis work was supported by NSF IRI-9702306, NSF IIS-9979860.\n\nReferences\n\n[1] K. Bennett and E. Bredensteiner. Duality and Geometry in SVM Classi(cid:12)ers. In\nP. Langley, eds., Proc. of Seventeenth Intl. Conf. on Machine Learning, p 57{64,\nMorgan Kaufmann, San Francisco, 2000.\n\n[2] D. Crisp and C. Burges. A Geometric Interpretation of (cid:23)-SVM Classi(cid:12)ers. In\nS. Solla, T. Leen, and K. Muller, eds., Advances in Neural Info. Proc. Sys., Vol\n12. p 244{251, MIT Press, Cambridge, MA, 1999.\n\n[3] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya and K.R.K. Murthy, A Fast It-\nerative Nearest Point Algorithm for Support Vector Machine Classi(cid:12)er Design,\nIEEE Transactions on Neural Networks, Vol. 11, pp.124-136, 2000.\n\n[4] O. Mangasarian. Nonlinear Programming. SIAM, Philadelphia, 1994.\n\n[5] B. Sch(cid:127)olkopf, P. Bartlett, A. Smola and R. Williamson. Shrinking the Tube:\nA New Support Vector Regression Algorithm. In M. Kearns, S. Solla, and D.\nCohn eds., Advances in Neural Info. Proc. Sys., Vol 12, MIT Press, Cambridge,\nMA, 1999.\n\n[6] V. Vapnik. The Nature of Statistical Learning Theory. Wiley, New York, 1995.\n\n\f", "award": [], "sourceid": 2132, "authors": [{"given_name": "J.", "family_name": "Bi", "institution": null}, {"given_name": "Kristin", "family_name": "Bennett", "institution": null}]}