{"title": "One-Class LP Classifiers for Dissimilarity Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 777, "page_last": 784, "abstract": null, "full_text": "One-Class LP Classi\ufb01er for Dissimilarity\n\nRepresentations\n\nEl\u02d9zbieta P\u02dbekalska1, David M.J.Tax2 and Robert P.W. Duin1\n\n1Delft University of Technology, Lorentzweg 1, 2628 CJ Delft, The Netherlands\n\n2Fraunhofer Institute FIRST.IDA, Kekul\u00e9str.7, D-12489 Berlin, Germany\nela@ph.tn.tudelft.nl,davidt@first.fraunhofer.de\n\nAbstract\n\nProblems in which abnormal or novel situations should be detected can\nbe approached by describing the domain of the class of typical exam-\nples. These applications come from the areas of machine diagnostics,\nfault detection, illness identi\ufb01cation or, in principle, refer to any prob-\nlem where little knowledge is available outside the typical class. In this\npaper we explain why proximities are natural representations for domain\ndescriptors and we propose a simple one-class classi\ufb01er for dissimilarity\nrepresentations. By the use of linear programming an ef\ufb01cient one-class\ndescription can be found, based on a small number of prototype objects.\nThis classi\ufb01er can be made (1) more robust by transforming the dissimi-\nlarities and (2) cheaper to compute by using a reduced representation set.\nFinally, a comparison to a comparable one-class classi\ufb01er by Campbell\nand Bennett is given.\n\n1 Introduction\n\nThe problem of describing a class or a domain has recently gained a lot of attention, since it\ncan be identi\ufb01ed in many applications. The area of interest covers all the problems, where\nthe speci\ufb01ed targets have to be recognized and the anomalies or outlier instances have to\nbe detected. Those might be examples of any type of fault detection, abnormal behavior,\nrare illnesses, etc. One possible approach to class description problems is to construct one-\nclass classi\ufb01ers (OCCs) [13]. Such classi\ufb01ers are concept descriptors, i.e. they refer to all\npossible knowledge that one has about the class.\n\nAn ef\ufb01cient OCC built in a feature space can be found by determining a minimal volume\nhypersphere around the data [14, 13] or by determining a hyperplane such that it separates\nthe data from the origin as well as possible [11, 12]. By the use of kernels [15] the data is\nimplicitly mapped into a higher-dimensional inner product space and, as a result, an OCC in\nthe original space can yield a nonlinear and non-spherical boundary; see e.g. [15, 11, 12, 14].\n\nThose approaches are convenient for data already represented in a feature space. In some\ncases, there is, however, a lack of good or suitable features due to the dif\ufb01culty of de\ufb01ning\nthem, as e.g. in case of strings, graphs or shapes. To avoid the de\ufb01nition of an explicit\nfeature space, we have already proposed to address kernels as general proximity measures\n[10] and not only as symmetric, (conditionally) positive de\ufb01nite functions of two variables\n\n\f[2]. Such a proximity should directly arise from an application; see e.g. [8, 7]. Therefore,\nour reasoning starts not from a feature space, like in case of the other methods [15, 11, 12,\n14], but from a given proximity representation. Here, we address general dissimilarities.\n\nThe basic assumption that an instance belongs to a class is that it is similar to examples\nwithin this class. The identi\ufb01cation procedure is realized by a proximity function equipped\nwith a threshold, determining whether an instance is a class member or not. This proximity\nfunction can be e.g. a distance to an average representative, or a set of selected proto-\ntypes. The data represented by proximities is thus more natural for building the concept\ndescriptors, i.e. OCCs, since the proximity function can be directly built on them.\n\nIn this paper, we propose a simple and ef\ufb01cient OCC for general dissimilarity represen-\ntations, discussed in Section 2, found by the use of linear programming (LP). Section 3\npresents our method together with a dissimilarity transformation to make it more robust\nagainst objects with large dissimilarities. Section 4 describes the experiments conducted,\nand discusses the results. Conclusions are summarized in Section 5.\n\n2 Dissimilarity representations\n\nAlthough a dissimilarity measure D provides a \ufb02exible way to represent the data, there\nare some constraints. Re\ufb02ectivity and positivity conditions are essential to de\ufb01ne a proper\nmeasure; see also [10]. For our convenience, we also adopt the symmetry requirement.\nWe do not require that D is a strict metric, since non-metric dissimilarities may naturally\nbe found when shapes or objects in images are compared e.g. in computer vision [4, 7].\nLet z and pi refer to objects to be compared. A dissimilarity representation can now be\nseen as a dissimilarity kernel based on the representation set R =fp1; ::; pNg and realized\nby a mapping D(z; R) : F ! RN , de\ufb01ned as D(z; R) = [D(z; p1) : : : D(z; pN )]T . R\ncontrols the dimensionality of a dissimilarity space D((cid:1); R). Note also that F expresses a\nconceptual space of objects, not necessarily a feature space. Therefore, to emphasize that\nobjects, like z or pi, might not be feature vectors, they will not be printed in bold.\nThe compactness hypothesis (CH) [5] is the basis for object recognition.\nIt states that\nsimilar objects are close in their representations. For a dissimilarity measure D, this means\nthat D(r; s) is small if objects r and s are similar.If we demand that D(r; s) = 0, if and only\nif the objects r and s are identical, this implies that they belong to the same class. This can\nbe extended by assuming that all objects s such that D(r; s) < \", for a suf\ufb01cient small \", are\nso similar to r that they are members of the same class. Consequently, D(r; t)(cid:25) D(s; t) for\nother objects t. Therefore, for dissimilarity representations satisfying the above continuity,\nthe reverse of the CH holds: objects similar in their representations are similar in reality\nand belong, thereby, to the same class [6, 10].\n\nObjects with large distances are assumed to be dissimilar. When the set R contains objects\nfrom the class of interest, then objects z with large D(z; R) are outliers and should be\nremote from the origin in this dissimilarity space. This characteristic will be used in our\nOCC. If the dissimilarity measure D is a metric, then all vectors D(z; R), lie in an open\nprism (unbounded from above 1), bounded from below by a hyperplane on which the objects\nfrom R are. In principle, z may be placed anywhere in the dissimilarity space D((cid:1); R) only\nif the triangle inequality is completely violated. This is, however, not possible from the\npractical point of view, because then both the CH and its reverse will not be ful\ufb01lled.\nConsequently, this would mean that D has lost its discriminating properties of being small\nfor similar objects. Therefore, the measure D, if not a metric, has to be only slightly non-\nmetric (i.e. the triangle inequalities are only somewhat violated) and, thereby, D(z; R) will\nstill lie either in the prism or in its close neigbourhood.\n\n1the prism is bounded if D is bounded\n\n\f3 The linear programming dissimilarity data description\n\nTo describe a class in a non-negative dissimilarity space, one could minimize the volume of\nT D(z; R) = (cid:26) that bounds the data from above 2 (note\nthe prism, cut by a hyperplane P : w\nthat non-negative dissimilarities impose both (cid:26)(cid:21) 0 and wi(cid:21) 0). However, this might be not\na feasible task. A natural extension is to minimize the volume of a simplex with the main\nvertex being the origin and the other vertices vj resulting from the intersection of P and\nthe axes of the dissimilarity space (v j is a vector of all zero elements except for vji = (cid:26)=wi,\ngiven that wi6= 0). Assume now that there are M non-zero weights of the hyperplane P , so\neffectively, P is constructed in a RM . From geometry we know that the volume V of such\na simplex can be expressed as V = (VBase=M !) (cid:1) ((cid:26)=jjwjj2), where VBase is the volume of\nthe base, de\ufb01ned by the vertices vj. The minimization of h = (cid:26)=jjwjj2, i.e. the Euclidean\ndistance from the origin to P , is then related to the minimization of V .\nLet fD(pi; R)gN\ni=1; N =jRj be a dissimilarity representation, bounded by a hyperplane P ,\nT D(pi; R) (cid:20) (cid:26) for i = 1; : : : ; N, such that the Lq distance to the origin dq(0; P ) =\ni.e. w\n(cid:26)=jjwjjp is the smallest (i.e. q satis\ufb01es 1=p + 1=q = 1 for p(cid:21) 1) [9]. This means that P can\nbe determined by minimizing (cid:26) (cid:0) jjwjjp. However, when we require jjwjjp = 1 (to avoid\nany arbitrary scaling of w), the construction of P can be solved by the minimization of (cid:26)\nonly. The mathematical programming formulation of such a problem is [9, 1]:\n\nmin (cid:26)\ns.t. w\n\nT D(pi; R) (cid:20) (cid:26);\n\ni = 1; 2; ::; N;\n\njjwjjp = 1; (cid:26) (cid:21) 0:\n\nIf p = 2, then P is found such that h is minimized, yielding a quadratic optimization prob-\nlem. A much simpler LP formulation, realized for p = 1, is of our interest. Knowing that\n\njjwjj2(cid:20)jjwjj1(cid:20)pMjjwjj2 and by the assumption of jjwjj1 = 1, after simple calculations,\nwe \ufb01nd that (cid:26) (cid:20) h = (cid:26)=jjwjj2 (cid:20) pM (cid:26). Therefore, by minimizing d1(0; P ) = (cid:26), (and\njjwjj1 = 1), h will be bounded and the volume of the simplex considered, as well.\nBy the above reasoning and (1), a class represented by dissimilarities can be characterized\nby a linear proximity function with the weights w and the threshold (cid:26). Our one-class\nclassi\ufb01er CLPDD, Linear Programming Dissimilarity-data Description, is then de\ufb01ned as:\n\n(1)\n\n(2)\n\n(3)\n\nCLPDD(D(z; (cid:1))) = I( X\nwj 6=0\n\nwj D(z; pj) (cid:20) (cid:26));\n\nwhere I is the indicator function. The proximity function is found as the solution to a soft\nmargin formulation (which is a straightforward extension of the hard margin case) with\n(cid:23) 2 (0; 1] being the upper bound on the outlier fraction for the target class:\n\nmin (cid:26) + 1\ns.t. w\n\n(cid:23) N PN\n\ni=1 (cid:24)i\n\nT D(pi; R) (cid:20) (cid:26) + (cid:24)i;\nPj wj = 1; wj (cid:21) 0; (cid:26) (cid:21) 0; (cid:24)i (cid:21) 0:\n\ni = 1; 2; ::; N\n\nIn the LP formulations, sparse solutions are obtained, meaning that only some wj are posi-\ntive. Objects corresponding to such non-zero weights, will be called support objects (SO).\n\nThe left plot of Fig. 1 is a 2D illustration of the LPDD. The data is represented in a metric\ndissimilarity space, and by the triangle inequality the data can only be inside the prism\nindicated by the dashed lines. The LPDD boundary is given by the hyperplane, as close to\nthe origin as possible (by minimizing (cid:26)), while still accepting (most) target objects. By the\ndiscussion in Section 2, the outliers should be remote from the origin.\nProposition. In (3), (cid:23) 2 (0; 1] is the upper bound on the outlier fraction for the target class,\ni.e. the fraction of objects that lie outside the boundary; see also [11, 12]. This means that\nN PN\n1\n\ni=1(1 (cid:0) CLPDD(D(pi;(cid:1))) (cid:20) (cid:23).\n2P is not expected to be parallel to the prism\u2019s bottom hyperplane\n\n\f.\n\njD( ,p )\n\nLPDD: min \n\n.\n\nK( ,x )\nj\n\n1\n\nLPSD: min 1/N sum (w K(x ,S) + )\n\nk\n\nk\n\nT\n\nw\n\n|| w || = 1\n\n1\n\nw\n\n0\n\nDissimilarity space\n\n.\n\nD( ,p )\ni\n\n0\n\nSimilarity space\n\n1\n\n.\n\nK( ,x )\ni\n\nFigure 1: Illustrations of the LPDD in the dissimilarity space (left) and the LPSD in the similarity\nspace (right). The dashed lines indicate the boundary of the area which contains the genuine objects.\nThe LPDD tries to minimize the max-norm distance from the bounding hyperplane to the origin,\nwhile the LPSD tries to attract the hyperplane towards the average of the distribution.\n\nThe proof goes analogously to the proofs given in [11, 12]. Intuitively, the proof follows\nthis: assume we have found a solution of (3). If (cid:26) is increased slightly, the term Pi (cid:24)i in the\nobjective function will change proportionally to the number of points that have non-zero (cid:24)i\n(i.e. the outlier objects). At the optimum of (3) it has to hold that N (cid:23) (cid:21) #outliers.\nScaling dissimilarities. If D is unbounded, then some atypical objects of the target class\n(i.e. with large dissimilarities) might badly in\ufb02uence the solution of (3). Therefore, we\npropose a nonlinear, monotonous transformation of the distances to the interval [0; 1] such\nthat locally the distances are scaled linearly and globally, all large distances become close to\n1. A function with such properties is the sigmoid function (the hyperbolical tangent can also\nbe used), i.e. Sigm(x) = 2=(1 + e(cid:0)x=s) (cid:0) 1, where s controls the \u2019slope\u2019 of the function,\ni.e. the size of the local neighborhoods. Now, the transformation can be applied in an\nelement-wise way to the dissimilarity representation such that Ds(z; pi) =Sigm(D(z; pi)).\nUnless stated otherwise, the CLPDD will be trained on Ds.\nA linear programming OCC on similarities. Recently, Campbell and Bennett have\nproposed an LP formulation for novelty detection [3]. They start their reasoning from\na feature space in the spirit of positive de\ufb01nite kernels K(S; S) based on the set\nS = fx1; ::; xNg. They restricted themselves to the (modi\ufb01ed) RBF kernels, i.e. for\nK(xi; xj) = e(cid:0)D(xi;xj )2=2 s2, where D is either Euclidean or L1 (city block) distance.\nIn principle, we will refer to RBFp, as to the \u2019Gaussian\u2019 kernel based on the Lp distance.\nHere, to be consistent with our LPDD method, we rewrite their soft-margin LP formula-\ntion (a hard margin formulation is then obvious), to include a trade-off parameter (cid:23) (which\nlacks, however, the interpretation as given in the LPDD), as follows:\n(cid:23) N PN\ni = 1; 2; ::; N\n\nT K(xi; S) + (cid:26)) + 1\n\nmin\ns.t. w\n\ni=1 (cid:24)i\n\n(4)\n\n1\n\ni=1(w\n\nN PN\nT K(xi; S) + (cid:26) (cid:21) (cid:0)(cid:24)i;\nPj wj = 1; wj (cid:21) 0; (cid:24)i (cid:21) 0:\n\nSince K can be any similarity representation, for simplicity, we will call this method Linear\nProgramming Similarity-data Description (LPSD). The CLPSD is then de\ufb01ned as:\n\nCLPSD(K(z; (cid:1))) = I( X\nwj 6=0\n\nwj K(z; xj ) + (cid:26) (cid:21) 0):\n\n(5)\n\nIn the right plot of Fig. 1, a 2D illustration of the LPSD is shown. Here, the data is rep-\nresented in a similarity space, such that all objects lie in a hypercube between 0 and 1.\nObjects remote from the representation objects will be close to the origin. The hyperplane\nis optimized to have minimal average output for the whole target set. This does not nec-\nessarily mean a good separation from the origin or a small volume of the OCC, possibly\nresulting in an unnecessarily high outlier acceptance rate.\n\nr\nr\n-\nr\nr\n\fs = 0.3\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\nLPDD on the Euclidean representation\ns = 1\ns = 0.4\n\ns = 0.5\n\n1.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\ns = 3\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\ns = 0.4\n\ns = 0.3\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\nLPSD based on RBF2\n1.5\n\ns = 0.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\ns = 1\n\ns = 3\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\nFigure 2: One-class hard margin LP classi\ufb01ers for an arti\ufb01cial 2D data. From left to right, s takes\nthe values of 0:3d; 0:4d; 0:5d; d; 3d, where d is the average distance. Support objects are marked by\nsquares.\n\nExtensions. Until now, the LPDD and LPSD were de\ufb01ned for square (dis)similarity ma-\ntrices. If the computation of (dis)similarities is very costly, one can consider a reduced\nrepresentation set Rred (cid:26) R, consisting of n << N objects. Then, a dissimilarity or similar-\nity representations are given as rectangular matrices D(R; Rred) or K(S; Sred), respectively.\nBoth formulations (3) and (4) remain the same with the only change that R=S is replaced by\nRred=Sred. An another reason to consider reduced representations is the robustness against\noutliers. How to choose such a set is beyond the scope of this paper.\n\n4 Experiments\n\nArti\ufb01cial datasets. First, we illustrate the LPDD and the LPSD methods on two arti\ufb01cial\ndatasets, both originally created in a 2D feature space. The \ufb01rst dataset contains two clus-\nters with objects represented by Euclidean distances. The second dataset contains one uni-\nform, square cluster and it is contaminated with three outliers. The objects are represented\nby a slightly non-metric L0:95 dissimilarity (i.e. d0:95(x; y) = [Pi(xi(cid:0)yi)0:95]1=0:95). In\nFig. 2, the \ufb01rst dataset together with the decision boundaries of the LPDD and the LPSD\nin the theoretical input space are shown. The parameter s used in all plots refers either to\nthe scaling parameter in the sigmoid function for the LPDD (based on Ds) or to the scaling\nparameter in the RBF kernel. The pictures show similar behavior of both the LPDD and\nthe LPSD; the LPDD tends to be just slightly more tight around the target class.\n\nLPDD on the Euclidean representation\n\nn = 0.1; s = 0.29; e = 0.04\n\nn = 0.1; s = 0.46; e = 0.06\n\nn = 0.1; s = 0.87; e = 0.06\n\n1\n\n0.5\n\n0\n\n1\n\n0\n\n0.5\n\nn = 0.1; s = 0.29; e = 0.06\n\nn = 0.1; s = 0.2; e = 0\n\n0\n\n0.5\n\n1\n\nn = 0.1; s = 0.2; e = 0.04\n\n1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\nLPSD based on RBF2\n1\n\nn = 0.1; s = 0.46; e = 0.06\n\n1\n\n0\n\n0.5\n\n1\n\nn = 0.1; s = 0.87; e = 0.06\n\n0.5\n\n0\n\n0.5\n\n0\n\nn = 0.1; s = 2.3; e = 0.08\n\n0\n\n0.5\n\n1\n\nn = 0.1; s = 2.3; e = 0.08\n\n1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n0\n\n0.5\n\n1\n\n0\n\n0.5\n\n1\n\n0\n\n0.5\n\n1\n\n0\n\n0.5\n\n1\n\nFigure 3: One-class LP classi\ufb01ers, trained with (cid:23) = 0:1 for an arti\ufb01cial uniformly distributed 2D data\nwith 3 outliers. From left to right s takes the values of 0:7dm; dm; 1:6dm; 3dm; 8dm, where dm is\nthe median distance of all the distances. e refers to the error on the target set. Support objects are\nmarked by squares.\n\n\fn = 0.1; s = 0.26; e = 0\n\n0\n\n0.5\n\n1\n\nn = 0.1; s = 0.26; e = 0\n\n1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\nLPDD on the L0:95 representation\n\nn = 0.1; s = 0.37; e = 0.04\n\nn = 0.1; s = 0.59; e = 0.06\n\nn = 0.1; s = 1.1; e = 0.08\n\n1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n0\n\n0.5\n\n1\n\n0\n\n0.5\n\n1\n\nn = 0.1; s = 0.37; e = 0.04\n\nLPSD based on RBF0:95\n1\n\nn = 0.1; s = 0.59; e = 0.06\n\n1\n\nn = 0.1; s = 1.1; e = 0.08\n\n0.5\n\n0\n\n0.5\n\n0\n\nn = 0.1; s = 3; e = 0.08\n\n0\n\n0.5\n\n1\n\nn = 0.1; s = 3; e = 0.06\n\n1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n0\n\n0.5\n\n1\n\n0\n\n0.5\n\n1\n\n0\n\n0.5\n\n1\n\n0\n\n0.5\n\n1\n\nFigure 4: One-class LP classi\ufb01ers for an arti\ufb01cial 2D data. The same setting as in Fig.3 is used, only\nfor the L0:95 non-metric dissimilarities instead of the Euclidean ones. Note that the median distance\nhas changed, and consequently, the s values, as well.\n\nLPDD\n\n1\n\n0.5\n\n0\n\nn = 0.1; s = 0.41; e = 0.08\n\nn = 0.1; s = 0.4; e = 0.06\n\nn = 0.1; s = 1; e = 0.08\n\n1\n\n0.5\n\n0\n\n1\n\n0.5\n\n0\n\nn = 0.1; s = 1; e = 0.08\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\nn = 0.1; s = 1; e = 0.08\n\nLPSD\n\n1\n\n0.5\n\n0\n\n0\n\n1\nn = 0.1; s = 0.41; e = 0.06\n\n0.5\n\n1\n\n0.5\n\n0\n\n1\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\nn = 0.1; s = 0.4; e = 0.08\n\n0\n\n0.5\n\n1\n\nn = 0.1; s = 1; e = 0.08\n\n1\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n0\n\n0.5\n\n1\n\n0\n\n0.5\n\n1\n\n0\n\n0.5\n\n1\n\nFigure 5: One-class LP classi\ufb01ers, trained with (cid:23) = 0:1, for an arti\ufb01cial uniformly distributed 2D\ndata with 3 outliers, given by the L0:95 non-metric rectangular 50(cid:2)6 dissimilarity representations.\nThe upper row shows the LPDD\u2019s results and bottom row shows the LPSD\u2019s results with the kernel\nRBF0:95. The objects of the reduced sets Rred and Sred are marked by triangles. Note that they differ\nfrom left to right. e refers to the error on the target set. Support objects are marked by squares.\n\nThis becomes more clear in Fig. 3 and 4, where three outliers lying outside a single uni-\nformly distributed cluster should be ignored when an OCC with a soft margin is trained.\nFrom these \ufb01gures, we can observe that the LPDD gives a tighter class description, which\nis more robust against the scaling parameter and more robust against outliers, as well. The\nsame is observed when L0:95 dissimilarity is used instead of the Euclidean distances.\nFig. 5 presents some results for the reduced representations, in which just 6 objects are\nrandomly chosen for the set Rred. In the left four plots, Rred contains objects from the\nuniform cluster only, and both methods perform equally well. In the right four plots, on\nthe other hand, Rred contains an outlier. It can be judged that for a suitable scaling s, no\noutliers become support objects in the LPDD, which is often a case for the LPSD; see also\nFig. 4 and 3. Also, a crucial difference between the LPDD and LPSD can be observed w.r.t.\nthe support objects. In case of the LPSD (applied to a non-reduced representation), they lie\non the boundary, while in case of the LPDD, they tend to be \u2019inside\u2019 the class.\nCondition monitoring. Fault detection is an important problem in the machine diagnos-\ntics: failure to detect faults can lead to machine damage, while false alarms can lead\nto unnecessary expenses. As an example, we will consider a detection of four types of\nfault in ball-bearing cages, a dataset [16] considered in [3]. Each data instance consists\nof 2048 samples of acceleration taken with a Bruel and Kjaer vibration analyser. After\npre-processing with a discrete Fast Fourier Transform, each signal is characterized by 32\nattributes. The dataset consists of \ufb01ve categories: normal behavior (NB), corresponding\n\n\fTable 1: The errors of the \ufb01rst and second kind (in %) of the LPDD and LPSD on two dissimilarity\nrepresentations for the ball-bearing data. The reduced representations are based on 180 objects.\n\nLPDD\nLPDD-reduced\nLPSD\nLPSD-reduced\n\nEuclidean representation\n\nMethod\n\nOptimal s\n\n200:4\n65:3\n320:0\n211:2\n\n# of SO\n\nT1\n0.0\n0.0\n0.0\n0.0\nL1 dissimilarity representation\n\nNB\n1.4\n1.1\n1.3\n0.6\n\n10\n17\n8\n6\n\nMethod\n\nOptimal s\n\n# of SO\n\nLPDD\nLPDD-reduced\nLPSD\nLPSD-reduced\n\n566:3\n329:5\n1019:3\n965:7\n\n12\n10\n8\n5\n\nNB\n1.3\n1.3\n0.9\n0.3\n\nT1\n0.0\n0.0\n0.0\n0.0\n\nError\nT2\n45.0\n20.2\n46.7\n39.9\n\nError\nT2\n1.6\n2.3\n2.2\n3.5\n\nT3\n69.8\n47.5\n71.7\n67.1\n\nT3\n20.9\n18.7\n27.9\n26.3\n\nT4\n70.0\n50.9\n74.5\n69.5\n\nT4\n19.8\n16.9\n27.2\n27.5\n\nto measurements made from new ball-bearings, and four types of anomalies, say, T1 \u2013 T4,\ncorresponding either to the damaged outer race or cages or a badly worn ball-bearing. To\ncompare our LPDD method with the LPSD method, we performed experiments in the same\nway, as described in [3], making use of the same training set, and independent validation\nand test sets; see Fig. 6.\n\nTrain Valid.\n913\n913\n747\n913\n\nThe optimal values of s were found for both LPDD and\nLPSD methods by the use of the validation set on the Eu-\nclidean and L1 dissimilarity representations. The results\nare presented in Table 1.\nIt can be concluded that the\nL1 representation is far more convenient for the fault de-\ntection, especially if we look at the fault type T3 and T4\nwhich were unseen in the validation process. The LPSD\nperforms better on normal instances (yields a smaller er-\nror) than the LPDD. This is to be expected, since the\nboundary is less tight, by which less support objects (SO) are needed. On the contrary, the\nLPSD method deteriorates w.r.t. the outlier detection. Note also that the reduced represen-\ntation, based on randomly chosen 180 target objects ((cid:25) 20%) mostly yields signi\ufb01cantly\nbetter performances in outlier detection for the LPDD, and in target acceptance for the\nLPSD. Therefore, we can conclude that if a failure in the fault detection has higher costs\nthan the cost of misclassifying target objects, then our approach should be recommended.\n\nFigure 6: Fault detection data.\n\nTest\n913\n747\n996\n996\n996\n\nNB\nT1\nT2\nT3\nT4\n\n5 Conclusions\n\nWe have proposed the Linear Programming Dissimilarity-data Description (LPDD) classi-\n\ufb01er, directly built on dissimilarity representations. This method is ef\ufb01cient, which means\nthat only some objects are needed for the computation of dissimilarities in a test phase.\nThe novelty of our approach lies in its reformulation for general dissimilarity measures,\nwhich, we think, is more natural for class descriptors. Since dissimilarity measures might\nbe unbounded, we have also proposed to transform dissimilarities by the sigmoid function,\nwhich makes the LPDD more robust against objects with large dissimilarities. We em-\nphasized the possibility of using the LP procedures for rectangular dissimilarity/similarity\nrepresentations, which is especially useful when (dis)similarities are costly to compute.\n\nThe LPDD is applied to arti\ufb01cial and real-world datasets and compared to the LPSD detec-\ntor as proposed in [3]. For all considered datasets, the LPDD yields a more compact target\ndescription than the LPSD. The LPDD is more robust against outliers in the training set, in\n\n\fparticular, when only some objects are considered for a reduced representation. Moreover,\nwith a proper scaling parameter s of the sigmoid function, the support objects in the LPDD\ndo not contain outliers, while it seems dif\ufb01cult for the LPSD to achieve the same. In the\noriginal formulation, the support objects of the LPSD tend to lie on the boundary, while\nfor the LPDD, they are mostly \u2019inside\u2019 the boundary. This means that a removal of such an\nobject will not impose a drastic change of the LPDD.\n\nIn summary, our LPDD method can be recommended when the failure to detect outliers is\nmore expensive than the costs of a false alarm. It is also possible to enlarge the description\nof the LPDD by adding a small constant to (cid:26). Such a constant should be related to the\ndissimilarity values in the neighborhood of the boundary. How to choose it, remains an\nopen issue for further research.\n\nAcknowledgements. This work is partly supported by the Dutch Organization for Scienti\ufb01c Re-\nsearch (NWO) and the European Community Marie Curie Fellowship. The authors are solely respon-\nsible for information communicated and the European Commission is not responsible for any views\nor results expressed.\n\nReferences\n[1] K.P. Bennett and O.L. Mangasarian. Combining support vector and mathematical programming\nmethods for induction. In B. Sch\u00f6lkopf, C.J.C. Burges, and A.J. Smola, editors, Advances in\nKernel Methods, Support Vector Learning, pages 307\u2013326. MIT Press, Cambridge, MA, 1999.\n[2] C. Berg, J.P.R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer-Verlag,\n\n1984.\n\n[3] C. Campbell and K.P. Bennett. A linear programming approach to novelty detection. In Neural\n\nInformation Processing Systems, pages 395\u2013401, 2000.\n\n[4] M.P. Dubuisson and A.K. Jain. Modi\ufb01ed Hausdorff distance for object matching.\n\nInternat. Conference on Pattern Recognition, volume 1, pages 566\u2013568, 1994.\n\nIn 12th\n\n[5] R.P.W. Duin. Compactness and complexity of pattern recognition problems. In Internat. Sym-\nposium on Pattern Recognition \u2019In Memoriam Pierre Devijver\u2019, pages 124\u2013128, Royal Military\nAcademy, Brussels, 1999.\n\n[6] R.P.W. Duin and E. P\u02dbekalska. Complexity of dissimilarity based pattern classes. In Scandina-\n\nvian Conference on Image Analysis, 2001.\n\n[7] D.W. Jacobs, D. Weinshall, and Y. Gdalyahu. Classi\ufb01cation with non-metric distances: Image\n\nretrieval and class representation. IEEE Trans. on PAMI, 22(6):583\u2013600, 2000.\n\n[8] A.K. Jain and D. Zongker. Representation and recognition of handwritten digits using de-\n\nformable templates. IEEE Trans. on PAMI, 19(12):1386\u20131391, 1997.\n\n[9] Mangasarian O.L. Arbitrary-norm separating plane. Operations Research Letters, 24(1-2):15\u2013\n\n23, 1999.\n\n[10] E. P\u02dbekalska, P. Paclik, and R.P.W. Duin. A generalized kernel approach to dissimilarity-based\n\nclassi\ufb01cation. Journal of Machine Learning Research, 2(2):175\u2013211, 2001.\n\n[11] B. Sch\u00f6lkopf, J.C. Platt, A.J. Smola, and R.C. Williamson. Estimating the support of a high-\n\ndimensional distribution. Neural Computation, 13:1443\u20131471, 2001.\n\n[12] B. Sch\u00f6lkopf, Williamson R.C., A.J. Smola, J. Shawe-Taylor, and J.C. Platt. Support vector\n\nmethod for novelty detection. In Neural Information Processing Systems, 2000.\n\n[13] D.M.J. Tax. One-class classi\ufb01cation. PhD thesis, Delft University of Technology, The Nether-\n\nlands, 2001.\n\n[14] D.M.J. Tax and R.P.W. Duin. Support vector data description. Machine Learning, 2002. ac-\n\ncepted.\n\n[15] V. Vapnik. The Nature of Statistical Learning. Springer, N.Y., 1995.\n[16] http://www.sidanet.org.\n\n\f", "award": [], "sourceid": 2163, "authors": [{"given_name": "Elzbieta", "family_name": "Pekalska", "institution": null}, {"given_name": "David", "family_name": "Tax", "institution": null}, {"given_name": "Robert", "family_name": "Duin", "institution": null}]}