{"title": "q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 503, "page_last": 511, "abstract": "In this paper we introduce a novel method that can efficiently estimate a family of hierarchical dense sets in high-dimensional distributions. Our method can be regarded as a natural extension of the one-class SVM (OCSVM) algorithm that finds multiple parallel separating hyperplanes in a reproducing kernel Hilbert space. We call our method q-OCSVM, as it can be used to estimate $q$ quantiles of a high-dimensional distribution. For this purpose, we introduce a new global convex optimization program that finds all estimated sets at once and show that it can be solved efficiently. We prove the correctness of our method and present empirical results that demonstrate its superiority over existing methods.", "full_text": "q-OCSVM: A q-Quantile Estimator for\n\nHigh-Dimensional Distributions\n\nAssaf Glazer\n\nMichael Lindenbaum\n\nDepartment of Computer Science, Technion - Israel Institute of Technology\n\n{assafgr,mic,shaulm}@cs.technion.ac.il\n\nShaul Markovitch\n\nAbstract\n\nIn this paper we introduce a novel method that can ef\ufb01ciently estimate a family of\nhierarchical dense sets in high-dimensional distributions. Our method can be re-\ngarded as a natural extension of the one-class SVM (OCSVM) algorithm that \ufb01nds\nmultiple parallel separating hyperplanes in a reproducing kernel Hilbert space.\nWe call our method q-OCSVM, as it can be used to estimate q quantiles of a high-\ndimensional distribution. For this purpose, we introduce a new global convex\noptimization program that \ufb01nds all estimated sets at once and show that it can be\nsolved ef\ufb01ciently. We prove the correctness of our method and present empirical\nresults that demonstrate its superiority over existing methods.\n\n1\n\nIntroduction\n\nOne-class SVM (OCSVM) [14] is a kernel-based learning algorithm that is often considered to be\nthe method of choice for set estimation in high-dimensional data due to its generalization power,\nef\ufb01ciency, and nonparametric nature. Let X be a training set of examples sampled i.i.d. from a\ncontinuous distribution F with Lebesgue density f in Rd. The OCSVM algorithm takes X and a\nparameter 0 < \u03bd < 1, and returns a subset of the input space with a small volume while bounding a\n\u03bd portion of examples in X outside the subset. Asymptotically, the probability mass of the returned\nsubset converges to \u03b1 = 1\u2212 \u03bd. Furthermore, when a Gaussian kernel with a zero tending bandwidth\nis used, the solution also converges to the minimum-volume set (MV-set) at level \u03b1 [19], which is a\nsubset of the input space with the smallest volume and probability mass of at least \u03b1.\nIn light of the above properties, the popularity of the OCSVM algorithm is not surprising. It appears,\nhowever, that in some applications we are not actually interested in estimating a single MV-set but\nin estimating multiple hierarchical MV-sets, which reveal more information about the distribution.\nFor instance, in cluster analysis [5], we are interested in learning hierarchical MV-sets to construct a\ncluster tree of the distribution. In outlier detection [6], hierarchical MV-sets can be used to classify\nexamples as outliers at different levels of signi\ufb01cance. In statistical tests, hierarchical MV-sets are\nused for generalizing univariate tests to high-dimensional data [12, 4]. We are thus interested in a\nmethod that generalizes the OCSVM algorithm for approximating hierarchical MV-sets. By doing\nso we would leverage the advantages of the OCSVM algorithm in high-dimensional data and take it\na step forward by extending its solution for a broader range of applications.\nUnfortunately, a straightforward approach of training a set of OCSVMs, one for each MV-set, would\nnot necessarily satisfy the hierarchy requirement. Let q be the number of hierarchical MV-sets\nwe would like to approximate. A naive approach would be to train q OCSVMs independently and\nenforce hierarchy by intersection operations on the resulting sets. However, we \ufb01nd two major draw-\nbacks in this approach: (1) The \u03bd-property of the OCSVM algorithm, which provides us with bounds\non the number of examples in X lying outside or on the boundary of each set, is no longer guaranteed\ndue to the intersection operator; (2) MV-sets of a distribution, which are also level sets of the dis-\ntribution\u2019s density f (under suf\ufb01cient regularity conditions), are hierarchical by de\ufb01nition. Hence,\n\n1\n\n\fby learning q OCSVMs independently, we ignore an important property of the correct solution, and\nthus are less likely reach a generalized global solution.\nIn this paper we introduce a generalized version of the OCSVM algorithm for approximating hierar-\nchical MV-sets in high-dimensional distributions. As in the naive approach, approximated MV-sets\nin our method are represented as dense sets captured by separating hyperplanes in a reproducing ker-\nnel Hilbert space. However, our method does not suffer from the two drawbacks mentioned above.\nTo preserve the \u03bd-property of the solution while ful\ufb01lling the hierarchy constraint, we require the\nresulting hyperplanes to be parallel to one another. To provide a generalized global solution, we\nintroduce a new convex optimization program that \ufb01nds all approximated MV-sets at once. Further-\nmore, we expect our method to have better generalization ability due to the parallelism constraint\nimposed on the hyperplanes, which also acts as a regularization term on the solution.\n\nFigure 1: An approximation of 4 hierarchical MV-sets\n\nWe call our method q-OCSVM, as it can be used by statisticians to generalize q-quantiles to high-\ndimensional distributions. Figure 1 shows an example of 4-quantiles estimated for two-dimensional\ndata. We show that our method can be solved ef\ufb01ciently, and provide theoretical results showing\nthat it preserves both the density assumption for each approximated set in the same sense suggested\nby [14]. In addition, we empirically compare our method to existing methods on a variety of real\nhigh-dimensional data and show its advantages in the examined domains.\n\n2 Background\n\nIn one-dimensional settings, q-quantiles, which are points dividing a cumulative distribution func-\ntion (CDF) into equal-sized subsets, are widely used to understand the distribution of values. These\npoints are well de\ufb01ned as the inverse of the CDF, that is, the quantile function. It would be useful to\nhave the same representation of q-quantiles in high-dimensional settings. However, it appears that\ngeneralizing quantile functions beyond one dimension is hard since the number of ways to de\ufb01ne\nthem grows exponentially with the dimensions [3]. Furthermore, while various quantile regression\nmethods [7, 16, 9] can be to used to estimate a single quantile of a high-dimensional distribution,\nextensions of those to estimate q-quantiles is mostly non-trivial.\nLet us \ufb01rst understand the exponential complexity involved in estimating a generalized quantile\nfunction in high-dimensional data. Let 0 < \u03b11 < \u03b12, . . . , < \u03b1q < 1 be a sequence of equally-\nspaced q quantiles. When d = 1, the quantile transforms F \u22121(\u03b1j) are uniquely de\ufb01ned as the points\nxj \u2208 R satisfying F (X \u2264 xj) \u2264 \u03b1j, where X is a random variable drawn from F . Equivalently,\nF \u22121(\u03b1j) can be identi\ufb01ed with the unique hierarchical intervals [\u2212\u221e, xj]. However, when d > 1,\nintervals are replaced by sets C1 \u2282 C2 . . . \u2282, Cq that satisfy F (Cj) = \u03b1j but are not uniquely\nde\ufb01ned. Assume for a moment that these sets are de\ufb01ned only by imposing directions on d \u2212 1\ndimensions (the direction of the \ufb01rst dimension can be chosen arbitrarily). Hence, we are left with\n2d\u22121 possible ways of de\ufb01ning a generalized quantile function for the data.\nHypothetically, any arbitrary hierarchical sets satisfying F (Cj) = \u03b1j can be used to de\ufb01ne a valid\ngeneralized quantile function. Nevertheless, we would like the distribution to be dense in these\nsets so that the estimation will be informative enough. Motivated in this direction, Polonik [12]\nsuggested using hierarchical MV-sets to generalize quantile functions. Let C(\u03b1) be the MV-set at\nlevel \u03b1 with respect to F and the Lebesgue density f. Let Lf (c) = {x : f (x) \u2265 c} be the level set\n\n2\n\n00.20.40.60.8100.20.40.60.81Q=4,N=100\fof f at level c. Polonik observed that, under suf\ufb01cient regularity conditions on f, Lf (c) is an MV-set\nof F at level \u03b1 = F (Lf (c)). He thus suggested that level sets can be used as approximations of\nthe MV-sets of a distribution. Since level sets are hierarchical by nature, a density estimator over X\nwould be suf\ufb01cient to construct a generalized quantile function.\nPolonik\u2019s work was largely theoretical. In high-dimensional data, not only is the density estimation\nhard, but extracting level sets of the estimated density is also not always feasible. Furthermore,\nin high-dimensional settings, even attempting to estimate q hierarchical MV-sets of a distribution\nmight be too optimistic an objective due to the exponential growth in the search space, which may\nlead to over\ufb01tted estimates, especially when the sample is relatively small. Consequently, various\nmethods were proposed for estimating q-quantiles in multivariate settings without an intermediate\ndensity estimation step [3, 21, 2, 20]. However, these methods were usually ef\ufb01cient only up to a\nfew dimensions. For a detailed discussion about generalized quantile functions, see Ser\ufb02ing [15].\nOne prominent method that uses a variant of the OCSVM algorithm for approximating level sets of a\ndistribution was proposed by Lee and Scott [8]. Their method is called nested OCSVM (NOC-SVM)\nand it is based on a new quadratic program that simultaneously \ufb01nds a global solution of multiple\nnested half-space decision functions. An ef\ufb01cient decomposition method is introduced to solve\nthis program for large-scale problems. This program uses the C-SVM formulation of the OCSVM\nalgorithm [18], where \u03bd is replaced by a different parameter, C \u2265 0, and incorporates nesting\nconstraints into the dual quadratic program of each approximated function. However, due to these\ndifference formulations, our method converges to prede\ufb01ned q-quantiles of a distribution while theirs\nconverges to approximated sets with unpredicted probability masses. The probability masses in their\nsolution are even less trackable because the constraints imposed by the NOC-SVM program on the\ndual variables changes the geometric interpretation of the primal variables in a non-intuitive way.\nAn improved quantile regression variant of the OCSVM algorithm that also uses \u201cnon-crossing\u201d\nconstraints to estimate \u201cnon-crossing\u201d quantiles of a distribution was proposed by Takeuchi et al.\n[17]. However, similar to the NOC-SVM method, after enforcing these constraints, the \u03bd-property\nof the solution is no longer guaranteed.\nRecently, a greedy hierarchical MV-set estimator (HMVE) that uses OCSVMs as a basic component\nwas introduced by Glazer et al. [4]. This method approximates the MV-sets iteratively by train-\ning a sequence of OCSVMs, from the largest to the smallest sets. The superiority of HMVE was\nshown over a density-based estimation method and over a different hierarchical MV-set estimator\nthat was also introduced in that paper and is based on the one-class neighbor machine (OCNM) al-\ngorithm [11]. However, as we shall see in experiments, it appears that approximations in this greedy\napproach tend to become less accurate as the required number of MV-sets increases, especially for\napproximated MV-sets with small \u03b1 in the last iterations.\nIn contrast to the naive approach of training q OCSVMs independently 1, our q-OCSVM estimator\npreserves the \u03bd-property of the solution and converges to a generalized global solution. In contrast\nto the NOC-SVM algorithm, q-OCSVM converges to prede\ufb01ned q-quantiles of a distribution. In\ncontrast to the HMVE estimator, q-OCSVM provides global and stable solutions. As will be seen,\nwe support these advantages of our method in theoretical and empirical analysis.\n\n3 The q-OCSVM Estimator\n\nIn the following we introduce our q-OCSVM method, which generalizes the OCSVM algorithm so\nthat its advantages can be applied to a broader range of applications. q stands for the number of\nMV-sets we would like our method to approximate.\nLet X = {x1, . . . , xn} be a set of feature vectors sampled i.i.d. with respect to F . Consider a\nfunction \u03a6 : Rd \u2192 F mapping the feature vectors in X to a hypersphere in an in\ufb01nite Hilbert space\nF. Let H be a hypothesis space of half-space decision functions fC(x) = sgn ((w \u00b7 \u03a6(x)) \u2212 \u03c1)\nsuch that fC(x) = +1 if x \u2208 C, and \u22121 otherwise. The OCSVM algorithm returns a function\nfC \u2208 H that maximizes the margin between the half-space decision boundary and the origin in F,\nwhile bounding a portion of examples in X satisfying fC(x) = \u22121. This bound is prede\ufb01ned by a\nparameter 0 < \u03bd < 1, and it is also called the \u03bd-property of the OCSVM algorithm. This function is\n\n1In the following we call this method I-OCSVM (independent one-class SVMs).\n\n3\n\n\fspeci\ufb01ed by the solution of this quadratic program:\n\n(cid:88)\n\n1\n2\n\n1\n\u03bdn\n\n||w||2 \u2212 \u03c1 +\n\n\u03bei, s.t. (w \u00b7 \u03a6 (xi)) \u2265 \u03c1 \u2212 \u03bei, \u03bei \u2265 0,\n\nmin\n\ni\n\nw\u2208F ,\u03be\u2208Rn,\u03c1\u2208R\n\n(1)\nwhere \u03be is a vector of the slack variables. All training examples xi for which (w \u00b7 \u03a6(x))\u2212 \u03c1 \u2264 0 are\ncalled support vectors (SVs). Outliers are referred to as examples that strictly satisfy (w \u00b7 \u03a6(x)) \u2212\n\u03c1 < 0. By solving the program for \u03bd = 1 \u2212 \u03b1, we can use the OCSVM to approximate C(\u03b1).\nLet 0 < \u03b11 < \u03b12, . . . , < \u03b1q < 1 be a sequence of q quantiles. Our goal is to generalize the OCSVM\nalgorithm for approximating a set of MV-sets {C1, . . . , Cq} such that a hierarchy constraint Ci \u2286 Cj\nis satis\ufb01ed for i < j. Given X , our q-OCSVM algorithm solves this primal program:\n\n||w||2 \u2212 q(cid:88)\n\nq\n2\n\n(cid:88)\n\nq(cid:88)\n\n1\n\u03bdjn\n\nmin\nw,\u03bej ,\u03c1j\ns.t. (w \u00b7 \u03a6 (xi)) \u2265 \u03c1j \u2212 \u03bej,i, \u03bej,i \u2265 0, j \u2208 [q], i \u2208 [n],\n\n\u03c1j +\n\n\u03bej,i\n\nj=1\n\nj=1\n\ni\n\nwhere \u03bdj = 1 \u2212 \u03b1j. This program generalizes Equation (1) to the case of \ufb01nding multiple, parallel\nhalf-space decision functions by searching for a global minimum over their sum of objective func-\ntions: the coupling between q half-spaces is done by summing q OCSVM programs, while enforcing\nthese programs to share the same w. As a result, the q half-spaces in the solution of Equation (2) are\ndifferent only by their bias terms, and thus parallel to each other. This program is convex, and thus\na global minimum can be found in polynomial time.\nIt is important to note that even with an ideal, unbounded number of examples, this program does\nnot necessarily converge to the exact MV-sets but to approximated MV-sets of a distribution. As\nwe shall see in Section 4, all decision functions returned by this program preserve the \u03bd-property.\nWe argue that the stability of these approximated MV-sets bene\ufb01ts from the parallelism constraint\nimposed on the half-spaces in H, which acts as a regularizer.\nIn the following we show that our program can be solved ef\ufb01ciently in its dual form. Using multi-\npliers \u03b7j,i \u2265 0, \u03b2j,i \u2265 0, the Lagrangian of this program is\n\nL (w, \u03beq, \u03c11, . . . , \u03c1q, \u03b7, \u03b2) =\n\n(cid:88)\n\nq(cid:88)\n\n1\n\u03bdjn\n\n\u03bej,i\n\n\u03c1j +\n\nj=1\n\n\u03b7j,i ((\u03a6 (xi) \u00b7 wj) \u2212 \u03c1j + \u03bej,i) \u2212 q(cid:88)\n\nj=1\n\ni\n\n(cid:88)\n\n\u03b2j,i\u03bej,i.\n\nj=1\n\ni\n\nq\n2\n\n||w||2 \u2212 q(cid:88)\n\u2212 q(cid:88)\n(cid:88)\n(cid:88)\n\nj=1\n\ni\n\n(2)\n\n(3)\n\n(4)\n\nSetting the derivatives to be equal to zero with respect to the primal variables w, \u03c1j, \u03bej yields\n\nw =\n\n1\nq\n\n\u03b7j,i\u03a6(xi),\n\nj,i\n\ni\n\n\u03b7j,i = 1, 0 \u2264 \u03b7ji \u2264 1\nn\u03bdj\n\n, i \u2208 [n], j \u2208 [q].\n\n(cid:88)\n\nSubstituting Equation (4) into Equation (3), and replacing the dot product (\u03a6(xi) \u00b7 \u03a6(xs))F with a\nkernel function k (xi, xs) 2, we obtain the dual program\n\n\u03b7j,i\u03b7p,sk (xi, xs), s.t.\n\n\u03b7j,i = 1, 0 \u2264 \u03b7ji \u2264 1\nn\u03bdj\n\n, i \u2208 [n], j \u2208 [q].\n\n(5)\n\n(cid:88)\n\n(cid:88)\n\nj,p\u2208[q]\n\ni,s\u2208[n]\n\nmin\n\n\u03b7\n\n1\n2q\n\nSimilar to the formulation of the dual objective function in the original OCSVM algorithm, our dual\nprogram depends only on the \u03b7 multipliers, and hence can be solved more ef\ufb01ciently than the primal\none. The resulting decision function for j\u2019th estimate is\n\n(cid:33)\n\nfCj (x) = sgn\n\ni k (xi, x) \u2212 \u03c1j\n\u03b7\u2217\n\n,\n\n(6)\n\n(cid:88)\n\ni\n\n(cid:32)\n\n(cid:88)\n\ni\n\n1\nq\n\n2A Gaussian kernel function k(xi, xs) = e\u2212\u03b3||xi\u2212xs||2 was used in the following.\n\n4\n\n\fi =(cid:80)q\n\nwhere \u03b7\u2217\nj=1 \u03b7j,i. This ef\ufb01cient formulation of the decision function, which derives from the\nfact that parallel half-spaces share the same w, allows us to compute the outputs of all the q decision\nfunctions simultaneously.\nAs in the OCSVM algorithm, \u03c1j are recovered by identifying points \u03a6 (xj,i) lying strictly on the\nj\u2019th decision boundary. These points are identi\ufb01ed using the condition 0 < \u03b7j,i < 1\n. Therefore,\nn\u03bdj\n\u03c1j can be recovered from a point sv satisfying this condition by\n\n\u03c1j = (w \u00b7 \u03a6 (sv)) =\n\n1\nq\n\n\u03b7\u2217\ni k (xi, sv).\n\n(7)\n\n(cid:88)\n\ni\n\nFigure 1 shows the resulting estimates of our q-OCSVM method for 4 hierarchical MV-sets with\n\u03b1 = 0.2, 0.4, 0.6, 0.8 3. 100 train examples drawn i.i.d. from a bimodal distribution are marked\nwith black dots. It can be seen that the number of bounded SVs (outliers) at each level is no higher\nthan 100(1\u2212 \u03b1j), as expected according to the properties of our q-OCSVM estimator, which will be\nproven in the following section.\n\n4 Properties of the q-OCSVM Estimator\n\nIn this section we provide theoretical results for the q-OCSVM estimator. The program we solve is\ndifferent from the one in Equation (1). Hence, we can not rely on the properties of OCSVM to prove\nthe properties of our method. We provide instead similar proofs, in the spirit of Sch\u00a8olkopf et al. [14]\nand Glazer et al. [4], with some additional required extensions.\nDe\ufb01nition 1. A set X = {x1, . . . , xn} is separable if there exists some w such that (\u03a6(xi) \u00b7 w) > 0\nfor all i \u2208 {1, . . . , n}.\nNote that if a Gaussian kernel is used (implies k(xi, xs) > 0), as in our case, then X is separable.\nTheorem 1. If X is separable, then a feasible solution exists for Equation (2) with \u03c1j > 0 for all\nj \u2208 {1, . . . , q}.\nProof. De\ufb01ne M as the convex hull of \u03a6(x1),\u00b7\u00b7\u00b7 , \u03a6(xn). Note that since X is separable, M does\nnot contain the origin. Then, by the supporting hyperplane theorem [10], there exists a hyperplane\n(\u03a6(xi) \u00b7 w)\u2212\u03c1 that contains M on one side of it and does not contain the origin. Hence,\n\u03c1 < 0, which leads to \u03c1 > 0. Note that the solution \u03c1j = \u03c1 for all j \u2208 [q] is a feasible solution for\nEquation (2).\n\n(cid:17)\u2212\n\n(cid:16)\u2212\u2192\n\n0 \u00b7 w\n\nThe following theorem shows that the regions speci\ufb01ed by the decision functions fC1 , . . . , fCq are\n(a) approximations for the MV-sets in the same sense suggested by Sch\u00a8olkopf et al., and (b) hierar-\nchically nested.\nTheorem 2. Let fC1 , . . . , fCq be the decision functions returned by the q-OCSVM estimator with\nparameters {\u03b11, . . . , \u03b1q},X , k (\u00b7,\u00b7). Assume X is separable. Let SVoj be the set of SVs lying\nstrictly outside Cj, and SVbj be the set of SVs lying exactly on the boundary of Cj. Then, the\nfollowing statements hold:(1) Cj \u2286 Ck for \u03b1j < \u03b1k. (2)\n. (3)\nSuppose X is i.i.d. drawn from a distribution F which does not contain discrete components, and\nk (\u00b7,\u00b7) is analytic and non-constant. Then,\nProof. Cj and Ck are associated with two parallel half-spaces in H with the same w. Therefore,\nstatement (1) can be proven by showing that \u03c1j \u2265 \u03c1k. \u03b1j < \u03b1k leads to \u03c1j \u2265 \u03c1k since otherwise\nthe optimality of Equation (2) would be contradicted. Assume by negation that \u03bdj = 1 \u2212 \u03b1j >\n|SVbj |+|SVoj |\nfor some j \u2208 [q] in the optimal solution of Equation (2). Note that when parallel-\ni \u03bej,i in the equation will change\n< 1, a slight increase in \u03c1j will\n\nshifting the optimal hyperplane by slightly increasing \u03c1j, the term(cid:80)\n\n|X| \u2264 1 \u2212 \u03b1j \u2264 |SVbj |+|SVoj |\n|SVoj |\n\nis asymptotically equal to 1 \u2212 \u03b1j.\n\n|SVoj |\n|X|\n\n|SVbj |+|SVoj |\n\n|X|\u03bdj\n\nproportionally to |SVbj| + |SVoj|. However, since\n3Detailed setup parameters are discussed in Section 5.\n\n|X|\n\n|X|\n\n5\n\n\fresult in a decrease in the objective function, which contradicts the optimality of the hyperplane.\n|SVoj |\n|X| > 1 \u2212 \u03b1j for some j \u2208 [q]\nThe same goes for the other direction: Assume by negation that\nin the optimal solution of Equation (2). Then, a slight decrease in \u03c1j will result in a decrease in\nthe objective function, which contradicts the optimality of the hyperplane. We are now left to prove\nstatement (3): The covering number of the class of fCj functions (which are induced by k) is well-\nbehaved. Hence, asymptotically, the probability of points lying exactly on the hyperplanes converges\nto zero (cf. 13).\n\n5 Empirical Results\n\nWe extensively evaluated the effectiveness of our q-OCSVM method on a variety of real high-\ndimensional data from the UCI repository and the 20-Newsgroup document corpus, and compared\nits performance to competing methods.\n\n5.1 Experiments on the UCI Repository\n\nWe \ufb01rst evaluated our method on datasets taken from the UCI repository 4. From each examined\ndataset, a random set of 100 examples from the most frequent label was used as the training set X .\nThe remaining examples from the same label were used as the test set. We used all UCI datasets\nwith more than 50 test examples \u2014 a total of 61 data sets. The average number of features for a\ndataset is 113 5.\nWe compared the performance of our q-OCSVM method to three alternative methods that generalize\nthe OCSVM algorithm: HMVE (hierarchical minimum-volume estimator) [4], I-OCSVM (indepen-\ndent one-class SVMs), and NOC-SVM (nested one-class SVM) [8]. For the NOC-SVM method, we\nused the implementation provided by the authors 6. The LibSVM package [1] was used to implement\nthe HMVE and I-OCSVM methods. An implementation of our q-OCSVM estimator is available from:\nhttp://www.cs.technion.ac.il/\u02dcassafgr/articles/q-ocsvm.html. All ex-\nperiments were carried out with a Gaussian kernel (\u03b3 = 1\n\n2.5\n\n2\u03c32 =\n\n#f eatures).\n\nFor each data set, we trained the reference methods to approximate hierarchical MV-sets at levels\n\u03b11 = 0.05, \u03b12 = 0.1 . . . , \u03b119 = 0.95 (19-quantiles) 7. Then, we evaluated the estimated q-quantiles\nwith the test set. Since the correct MV-sets are not known for the data, the quality of the approxi-\nmated MV-sets was evaluated by the coverage ratio (CR): Let \u03b1(cid:48) be the empirical proportion of the\napproximated MV-sets that was measured with the test data. The expected proportion of examples\nthat lie within the MV-set C(\u03b1) is \u03b1. The coverage ratio is de\ufb01ned as \u03b1(cid:48)\n\u03b1 . A perfect MV-set approx-\nimation method would yield a coverage ratio of 1.0 for all approximated MV-sets 8. An advantage\nof choosing this measure for evaluation is that it gives more weight for differences between \u03b1 and\n\u03b1(cid:48) in small quantiles associated with regions of high probability mass.\nResults on test data for each approximated MV-set are shown in Figure 2. The left graph displays in\nbars the empirical proportion of test examples in the approximated MV-sets (\u03b1(cid:48)) as a function of the\nexpected proportion (\u03b1) averaged over all 61 data sets. The right graph displays the coverage ratio\nof test examples as a function of \u03b1 averaged over all 61 data sets. It can be seen that our q-OCSVM\nmethod dominates the others with the best average \u03b1(cid:48) and average coverage ratio behaviors. In each\nquantile separately, we tested the signi\ufb01cance of the advantage of q-OCSVM over the competitors\nusing the Wilcoxon statistical test over the absolute difference between the expected and empirical\ncoverage ratios (|1.0\u2212 CR|). The superiority of our method against the three competitors was found\nsigni\ufb01cant, with P < 0.01, for each of the 19 quantiles separately.\nThe I-OCSVM method shows inferior performance to that of q-OCSVM. We ascribe this behavior to\nthe fact that it trains q OCSVMs independently, and thus reaches a local solution. Furthermore, we\n\n4archive.ics.uci.edu/ml/datasets.html\n5Nominal features were transformed into numeric ones using binary encoding; missing values were replaced\n\nby their features\u2019 average values.\n\n6http://web.eecs.umich.edu/\u02dccscott\n7The equivalent C (\u03bb) parameters of the NOC-SVM were initialized as suggested by the authors.\n8In outlier detection, this measure re\ufb02ects the ratio between expected and empirical false alarm rates.\n\n6\n\n\fbelieve that by ignoring the fundamental hierarchical structure of MV-sets, the I-OCSVM method is\nmore likely than ours to reach an over\ufb01tted solution.\nThe HMVE method shows a decrease in performance from the largest to the smallest \u03b1. We assume\nthis is due to the greedy nature of this method. HMVE approximates the MV-sets iteratively by\ntraining a sequence of OCSVMs, from the largest to the smallest \u03b1 . OCSVMs trained later in the\nsequence are thus more constrained in their approximations by solutions from previous iterations,\nso that the error in approximations accumulates over time. This is in contrast to q-OCSVM, which\nconverges to a global minimum, and hence is more scalable than HMVE with respect to the number\nof approximated MV-sets (q). The NOC-SVM method performs poorly in comparison to the other\nmethods. This is not surprising, since, unlike the other methods, we cannot set the parameters of\nNOC-SVM to converge to prede\ufb01ned q-quantiles.\n\nFigure 2: The q-OCSVM, HMVE, I-OCSVM, and NOC-SVM methods were trained to estimate 19-quantiles\nfor the distribution of the most frequent label on the 61 UCI datasets. Left: \u03b1(cid:48) as a function of \u03b1 averaged over\nall datasets. Right: The coverage ratio as a function of \u03b1 averaged over all datasets.\nInterestingly, the solutions produced by the HMVE and I-OCSVM methods for the largest approxi-\nmated MV-set (associated with \u03b119 = 0.95) are equal to the solution of a single OCSVM algorithm\ntrained with \u03bd = 1 \u2212 \u03b119 = 0.05. This equality derives from the de\ufb01nition of the HMVE and\nI-OCSVM methods. Therefore, in this setup, we claim that q-OCSVM also outperforms the OCSVM\nalgorithm in the approximation of a single MV-set, and it does so with an average coverage ratio\nof 0.871 versus 0.821. We believe this improved performance is due to the parallelism constraint\nimposed by the q-OCSVM method on the hyperplanes, which acts as a regularization term on the\nsolution. This observation is an interesting research direction to address in our future studies.\nIn terms of runtime complexity, our q-OCSVM method has higher computational complexity than\nHMVE and I-OCSVM, because we solve a global optimization problem rather than a series of smaller\nlocalized subproblems. However, with regard to the runtime complexity on test samples, our method\nis more ef\ufb01cient than HMVE and I-OCSVM by a factor of q, since the distances from each half-space\nonly differ by their bias terms (\u03c1j).\nWith regard to the choice of the Gaussian kernel width, parameter tuning for one-class classi\ufb01ers,\nin particular for OCSVMs, is an ongoing research area. Unlike binary classi\ufb01cation tasks, negative\nexamples are not available to estimate the optimality of the solution. Consequently, we employed\na common practice [1] of using a \ufb01xed width, divided by the number of features. However, in\nfuture studies, it would be interesting to consider alternative optimization criteria to allow tuning\nparameters with a cross-validation. For instance, using the average coverage ratio over all quantiles\nas an optimality criterion.\n\n5.2 Experiments on Text Data\n\nWe evaluated our method on an additional setup of high-dimensional text data. We used the 20-\nNewsgroup document corpus 9. 500 words with the highest frequency count were picked to generate\n\n9The 20-Newsgroup corpus is at http://people.csail.mit.edu/jrennie/20Newsgroups.\n\n7\n\n00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.9\u03b1\u03b1\u2032UCI\u2212test: Q=19 q\u2212OCSVMHMVEI\u2212OCSVMNOC\u2212SVM00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91\u03b1coverage ratio (CR)20News\u2212test: Q=19,N=824.7213 q\u2212OCSVMHMVEI\u2212OCSVMNOC\u2212SVM\fFigure 3: The q-OCSVM, HMVE, and I-OCSVM methods were trained to estimate 19 quantiles for the distri-\nbution of the 20 categories in the 20-Newsgroup document corpus. Left: \u03b1(cid:48) as a function of \u03b1 averaged over\nall 20 categories. Right: The coverage ratio as a function of \u03b1 averaged over all 20 categories.\n\n500 bag-of-words features. We use the sorted-by-date version of the corpus with 18846 documents\nassociated with 20 news categories. From this series of documents, the \ufb01rst 100 documents from\neach category were used as the training set X . The subsequent documents from the same category\nwere used as the test set. We trained the reference methods with X to estimate 19-quantiles of a\ndistribution, and evaluated the estimated q-quantiles with the test set.\nResults on test data for each approximated MV-set are shown in Figure 3 in the same manner as in\nFigure 2 10. Unlike the experiments on the UCI repository, results in these experiments are not so\nclose to the optimum, but still can provide useful information about the distributions. Again, our q-\nOCSVM method dominates the others with the best average \u03b1(cid:48) and average coverage ratio behaviors.\nAccording to the Wilcoxon statistical test with P < 0.01, our method performs signi\ufb01cantly better\nthan the other competitors for each of the 19 quantiles separately.\nIt can be seen that the differences in coverage ratios between q-OCSVM and I-OCSVM in the largest\nquantile (associated with \u03b119 = 0.95) are relatively high, where the average coverage ratio for\nq-OCSVM is 0.555, and 0.452 for I-OCSVM. Recall that the solution of I-OCSVM in the largest\nquantile is equal to the solution of a single OCSVM algorithm trained with \u03bd = 0.05. These results\nare aligned with our conclusions from the UCI repository experiments, that the parallelism con-\nstraint, which acts as a regularizer, may lead to improved performance even for the approximation\nof a single MV-set.\n\n6 Summary\n\nThe q-OCSVM method introduced in this paper can be regarded as a generalized OCSVM, as it\n\ufb01nds multiple parallel separating hyperplanes in a reproducing kernel Hilbert space. Theoretical\nproperties of our methods are analyzed, showing that it can be used to approximate a family of\nhierarchical MV-sets while preserving the guaranteed separation properties (\u03bd-property), in the same\nsense suggested by Sch\u00a8olkopf et al..\nOur q-OCSVM method is empirically evaluated on a variety of high-dimensional data from the UCI\nrepository and the 20-Newsgroup document corpus, and its advantage is veri\ufb01ed in this setup. We\nbelieve that our method will bene\ufb01t practitioners whose goal is to model distributions by q-quantiles\nin complex settings, where density estimation is hard to apply. An interesting direction for future\nresearch would be to evaluate our method on problems in speci\ufb01c domains that utilize q-quantiles for\ndistribution representation. These domains include cluster analysis, outlier detection, and statistical\ntests.\n\n10Results for NOC-SVM were omitted from the graphs due to the limitation of the method in q-quantile\n\nestimation, which results in inferior performance also in this setup.\n\n8\n\n00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.7\u03b1\u03b1\u203220Newsgroups\u2212test: Q=19 q\u2212OCSVMHMVEI\u2212OCSVM00.10.20.30.40.50.60.70.80.910.20.250.30.350.40.450.50.550.60.650.7\u03b1coverage ratio (CR)20News\u2212test: Q=19,N=842.3 q\u2212OCSVMHMVEI\u2212OCSVM\fReferences\n[1] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001.\n[2] Yixin Chen, Xin Dang, Hanxiang Peng, and Henry L. Bart. Outlier detection with the kernel-\nized spatial depth function. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n31(2):288\u2013305, 2009.\n\n[3] G. Fasano and A. Franceschini. A multidimensional version of the Kolmogorov-Smirnov test.\n\nMonthly Notices of the Royal Astronomical Society, 225:155\u2013170, 1987.\n\n[4] A. Glazer, M. Lindenbaum, and S. Markovitch. Learning high-density regions for a generalized\n\nKolmogorov-Smirnov test in high-dimensional data. In NIPS, pages 737\u2013745, 2012.\n\n[5] John A Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., 1975.\n[6] V. Hodge and J. Austin. A survey of outlier detection methodologies. Arti\ufb01cial Intelligence\n\nReview, 22(2):85\u2013126, 2004.\n\n[7] Roger Koenker. Quantile regression. Cambridge university press, 2005.\n[8] Gyemin Lee and Clayton Scott. Nested support vector machines. Signal Processing, IEEE\n\nTransactions on, 58(3):1648\u20131660, 2010.\n\n[9] Youjuan Li, Yufeng Liu, and Ji Zhu. Quantile regression in reproducing kernel hilbert spaces.\n\nJournal of the American Statistical Association, 102(477):255\u2013268, 2007.\n\n[10] D.G. Luenberger and Y. Ye. Linear and Nonlinear Programming. Springer, 3rd edition, 2008.\n[11] A. Munoz and J.M. Moguerza. Estimation of high-density regions using one-class neighbor\n\nmachines. In PAMI, pages 476\u2013480, 2006.\n\n[12] W. Polonik.\n\nConcentration and goodness-of-\ufb01t\n\nin higher dimensions:(asymptotically)\n\ndistribution-free methods. The Annals of Statistics, 27(4):1210\u20131229, 1999.\n\n[13] B. Sch\u00a8olkopf, A.J. Smola, R.C. Williamson, and P.L. Bartlett. New support vector algorithms.\n\nNeural Computation, 12(5):1207\u20131245, 2000.\n\n[14] Bernhard Sch\u00a8olkopf, John C. Platt, John C. Shawe-Taylor, Alex J. Smola, and Robert C.\nWilliamson. Estimating the support of a high-dimensional distribution. Neural Computation,\n13(7):1443\u20131471, 2001.\n\n[15] R. Ser\ufb02ing. Quantile functions for multivariate analysis: approaches and applications. Statis-\n\ntica Neerlandica, 56(2):214\u2013232, 2002.\n\n[16] Ingo Steinwart, Don R Hush, and Clint Scovel. A classi\ufb01cation framework for anomaly detec-\n\ntion. In JMLR, pages 211\u2013232, 2005.\n\n[17] Ichiro Takeuchi, Quoc V Le, Timothy D Sears, and Alexander J Smola. Nonparametric quantile\n\nestimation. JMLR, 7:1231\u20131264, 2006.\n\n[18] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York,\n\n2nd edition, 1998.\n\n[19] R. Vert and J.P. Vert. Consistency and convergence rates of one-class svms and related algo-\n\nrithms. The Journal of Machine Learning Research, 7:817\u2013854, 2006.\n\n[20] W. Zhang, X. Lin, M.A. Cheema, Y. Zhang, and W. Wang. Quantile-based knn over multi-\n\nvalued objects. In ICDE, pages 16\u201327. IEEE, 2010.\n\n[21] Yijun Zuo and Robert Ser\ufb02ing. General notions of statistical depth function. The Annals of\n\nStatistics, 28(2):461\u2013482, 2000.\n\n9\n\n\f", "award": [], "sourceid": 338, "authors": [{"given_name": "Assaf", "family_name": "Glazer", "institution": "Technion"}, {"given_name": "Michael", "family_name": "Lindenbaum", "institution": "Technion"}, {"given_name": "Shaul", "family_name": "Markovitch", "institution": "Technion"}]}