{"title": "On Integrated Clustering and Outlier Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 1359, "page_last": 1367, "abstract": "We model the joint clustering and outlier detection problem using an extension of the facility location formulation. The advantages of combining clustering and outlier selection include: (i) the resulting clusters tend to be compact and semantically coherent (ii) the clusters are more robust against data perturbations and (iii) the outliers are contextualised by the clusters and more interpretable. We provide a practical subgradient-based algorithm for the problem and also study the theoretical properties of algorithm in terms of approximation and convergence. Extensive evaluation on synthetic and real data sets attest to both the quality and scalability of our proposed method.", "full_text": "On Integrated Clustering and Outlier Detection\n\nLionel Ott\n\nUniversity of Sydney\n\nlott4241@uni.sydney.edu.au\n\nLinsey Pang\n\nUniversity of Sydney\n\nqlinsey@it.usyd.edu.au\n\nFabio Ramos\n\nUniversity of Sydney\n\nfabio.ramos@sydney.edu.au\n\nSanjay Chawla\n\nUniversity of Sydney\n\nsanjay.chawla@sydney.edu.au\n\nAbstract\n\nWe model the joint clustering and outlier detection problem using an extension\nof the facility location formulation. The advantages of combining clustering and\noutlier selection include: (i) the resulting clusters tend to be compact and semanti-\ncally coherent (ii) the clusters are more robust against data perturbations and (iii)\nthe outliers are contextualised by the clusters and more interpretable. We provide a\npractical subgradient-based algorithm for the problem and also study the theoreti-\ncal properties of algorithm in terms of approximation and convergence. Extensive\nevaluation on synthetic and real data sets attest to both the quality and scalability\nof our proposed method.\n\n1\n\nIntroduction\n\nClustering and outlier detection are often studied as separate problems [1]. However, it is natural\nto consider them simultaneously. For example, outliers can have a disproportionate impact on the\nlocation and shape of clusters which in turn can help identify, contextualize and interpret the outliers.\nPelillo [2] proposed a game theoretic de\ufb01nition of clustering algorithms which emphasis the need for\nmethods that require as little information as possible while being capable of dealing with outliers.\nThe area of \u201crobust statistics\u201d studies the design of statistical methods which are less sensitive to the\npresence of outliers [3]. For example, the median and trimmed mean estimators are less sensitive\nto outliers than the mean. Similarly, versions of Principal Component Analysis (PCA) have been\nproposed [4, 5, 6] which are more robust against model mis-speci\ufb01cation and outliers. An important\nprimitive in the area of robust statistics is the notion of Minimum Covariance Determinant (MCD):\nGiven a set of n multivariate data points and a parameter (cid:96), the objective is to identify a subset of\npoints which minimizes the determinant of the variance-covariance matrix over all subsets of size\nn\u2212 (cid:96). The resulting variance-covariance matrix can be integrated into the Mahalanobis distance and\nused as part of a chi-square test to identify multivariate outliers [7].\nIn the theoretical computer science literature, similar problems have been studied in the context\nof clustering and facility location. For example, Chen [8] has considered and proposed a constant\nfactor approximation algorithm for the k-median with outliers problem: Given n data points and\nparameters k and (cid:96), the objective is to remove a set of (cid:96) points such that the cost of k-median\nclustering on the remaining n \u2212 (cid:96) points is minimized. Our model is similar to the one proposed by\nCharikar et. al. [9] who have used a primal-dual formulation to derive an approximation algorithm\nfor the facility location with outlier problem.\nMore recently, Chawla and Gionis [10] have proposed k-means--, a practical and scalable algorithm\nfor the k-means with outlier problem. k-means-- is a simple extension of the k-means algorithm and\nis guaranteed to converge to a local optima. However, the algorithm inherits the weaknesses of the\n\n1\n\n\fclassical k-means algorithm. These are: (i) the requirement of setting the number of clusters k and\n(ii) initial speci\ufb01cation of the k centroids. It is well known that the choice of k and initial set of\ncentroids can have a disproportionate impact on the result.\nIn this paper we model clustering and outlier detection as an integer programming optimization task\nand then propose a Lagrangian relaxation to design a scalable subgradient-based algorithm. The\nresulting algorithm discovers the number of clusters and requires as input: the distance (discrepancy)\nbetween pairs of points, the cost of creating a new cluster and the number (cid:96) of outliers to select.\nThe remainder of the paper is structured as follows. In Section 2 we formally describe the prob-\nlem as an integer program. In Section 3, we describe the Lagrangian relaxation and details of the\nsubgradient algorithm. The approximation properties of the relaxation and the convergence of the\nsubgradient algorithm are discussed in Section 4. Experiments on synthetic and real data sets are\nthe focus of Section 5 before concluding with Section 6. The supplementary section derives an ex-\ntension of the af\ufb01nity propagation algorithm [11] to detect outliers (APOC) - which will be used for\ncomparison.\n\n2 Problem Formulation\n\nThe Facility Location with Outliers (FLO) problem is de\ufb01ned as follows [9]. Given a set of data\npoints with distances D = {dij}, the cluster creation costs ci and the number of outliers (cid:96), we de\ufb01ne\nthe task of clustering and outlier detection as the problem of \ufb01nding the assignments to the binary\nexemplar indicators yj, outlier indicators oi and point assignments xij that minimizes the following\nobjective function:\n\ncjyj +(cid:88)i (cid:88)j\n\nxij = 1\n\nFLO \u2261 min(cid:88)j\nsubject to xij \u2264 yj\noi +(cid:88)j\n(cid:88)i\n\noi = (cid:96)\n\ndijxij,\n\n(1)\n\n(2)\n(3)\n\n(4)\n\n(5)\n\nIn order to obtain a valid solution a set of constraints have been imposed:\n\nxij, yj, oi \u2208 {0, 1}.\n\n\u2022 points can only be assigned to valid exemplars Eq. (2);\n\u2022 every point must be assigned to exactly one other point or declared an outlier Eq. (3);\n\u2022 exactly (cid:96) outliers have to be selected Eq. (4);\n\u2022 only integer solutions are allowed Eq. (5).\n\nThese constraints describe the facility location problem with outlier detection. This formulation will\nallow the algorithm to select the number of clusters automatically and implicitly de\ufb01nes outliers as\nthose points whose presence in the dataset has the biggest negative impact on the overall solution.\nThe problem is known to be NP-hard and while approximation algorithms have been proposed, when\ndistances are assumed to be a metric, there is no known algorithm which is practical, scalable, and\ncomes with solution guarantees [9]. For example, a linear relaxation of the problem and a solution\nusing a linear programming solver is not scalable to large data sets as the number of variables is\nO(n2). In fact we will show that the Lagrangian relaxation of the problem is exactly equivalent to a\nlinear relaxation and the corresponding subgradient algorithm scales to large data sets, has a small\nmemory footprint, can be easily parallelized, and does not require access to a linear programming\nsolver.\n\n3 Lagrangian Relaxation of FLO\n\nThe Lagrangian relaxation is based on the following recipe and observations: (i) relax (or dualize)\n\u201ctough\u201d constraints of the original FLO problem by moving them to the objective; (ii) associate\n\n2\n\n\fa Lagrange multiplier (\u03bb) with the relaxed constraints which intuitively captures the price of con-\nstraints not being satis\ufb01ed; (iii) For any non-negative \u03bb, FLO(\u03bb) is a lower-bound on the FLO\nproblem. As a function of \u03bb, FLO(\u03bb) is a concave but non-differentiable; (iv) Use a subgradient\nalgorithm to maximize FLO(\u03bb) as a function of \u03bb in order to close the gap between the primal and\nthe dual.\n\nMore speci\ufb01cally, we relax the constraint oi +(cid:80)j xij = 1 for each i and associate a Lagrange\n\nmultiplier \u03bbi with each constraint. Rearranging the terms yields:\n\n(1 \u2212 oi)\u03bbi\n\nFLO(\u03bb) = min(cid:88)i\n(cid:124)\nsubject to xij \u2264 yi\n(cid:88)i\noi = (cid:96)\n\n(cid:123)(cid:122)\n\noutliers\n\n+(cid:88)j\n(cid:124)\n\ncjyj +(cid:88)i (cid:88)j\n(cid:123)(cid:122)\n\nclustering\n\n(cid:125)\n\n(dij \u2212 \u03bbi)xij\n\n.\n\n(cid:125)\n\n0 \u2264 xij, yj, oi \u2208 {0, 1}\n\n\u2200i, j\n\nWe can now solve the relaxed problem with a heuristic \ufb01nding valid assignments that attempt to\nminimize Eq. (6) without optimality guarantees [12]. The Lagrange multipliers \u03bb act as a penalty\nincurred for constraint violations which we try to minimize. From Eq. (6) we see that the penalty\nin\ufb02uences two parts: outlier selection and clustering. The heuristic starts by selecting good outliers\nby designating the (cid:96) points with largest \u03bb as outliers, as this removes a large part of the penalty. For\nthe remaining N \u2212 (cid:96) points clustering assignments are found by setting xij = 0 for all pairs for\nwhich dij \u2212 \u03bbi \u2265 0. To select the exemplars we compute:\n\u00b5j = cj + (cid:88)i:dij\u2212\u03bbi<0\n\n(dij \u2212 \u03bbi),\n\n(10)\n\nwhich represents the amortized cost of selecting point j as exemplar and assigning points to it. Thus,\nif \u00b5j < 0 we select point j as an exemplar and set yj = 1, otherwise we set yj = 0. Finally, we set\nxij = yj if dij \u2212 \u03bbi < 0. From this complete assignment found by the heuristic we compute a new\nsubgradient st and update the Lagrangian multipliers \u03bbt as follows:\n\n(6)\n\n(7)\n(8)\n\n(9)\n\n(11)\n\n(12)\n\n(13)\n\nst\n\ni = 1 \u2212(cid:88)j\ni = max(\u03bbt\u22121\n\u03bbt\n\nxij \u2212 oi\n\ni + \u03b8tsi, 0),\n\nwhere \u03b8t is the step size at time t computed as\n\n\u03b8t = \u03b80 pow(\u03b1, t) \u03b1 \u2208 (0, 1),\n\nwhere pow(a, b) = ab. To obtain the \ufb01nal solution we repeat the above steps until the changes\nbecome small enough, at which point we extract a feasible solution. This is guaranteed to converge\nif a step function is used for which the following holds [12]:\n\nlim\nn\u2192\u221e\n\nn(cid:88)t=1\n\n\u03b8t = \u221e\n\nand\n\nt\u2192\u221e \u03b8t = 0.\nlim\n\n(14)\n\nA high level algorithm description is given in Algorithm 1.\n\n4 Analysis of Lagrangian Relaxation\n\nIn this section, we analyze the solution obtained from using the Lagrangian relaxation (LR) method.\nOur analysis will have two parts. In the \ufb01rst part, we will show that the Lagrangian relaxation is\nexactly equivalent to solving the linear relaxation of the FLO problem. Thus if FLO(IP), FLO(LP)\nand FLO(LR) are the optimal value of integer program, linear relaxation and linear programming\nsolution respectively, we will show that FLO(LR) = FLO(LP). In the second part, we will analyze\nthe convergence rate of the subgradient method and the impact of outliers.\n\n3\n\n\fAlgorithm 1: LagrangianRelaxation()\nInitialize \u03bb0, x0, t\nwhile not converged do\n\nst \u2190 ComputeSubgradient(xt\u22121)\n\u03bbt \u2190 ComputeLambda(st)\nxt \u2190 FLO(\u03bbt)\nt \u2190 t + 1\n\n(solve via heuristic)\n\nend\n\nFigure 1: Visualization of the building blocks of the A matrix. The top left is a n2 \u00d7 n2 identity\nmatrix which is followed by n row stacked blocks of n \u00d7 n negative identity matrices. To the right\nof those is another n2 \u00d7 n block of zeros. The \ufb01nal row in the block matrix consists of n2 + n zeros\nfollowed by n ones.\n4.1 Quality of the Lagrangian Relaxation\n\nConsider the constraint set L = {(x, y, o) \u2208 Zn2+2n|xij \u2264 yj \u2227(cid:80)i oi \u2264 (cid:96) \u2200 i, j}. Then it is well\n\nknown that the optimal value of FLO(LR) of the Lagrangian relaxation is equal to the cost of the\nfollowing optimization problem [12]:\n\ncjyj +(cid:88)i (cid:88)j\n\nxij = 1\n\nxijdij\n\nmin(cid:88)j\noi +(cid:88)j\n\nconv(L)\n\n(15)\n\n(16)\n\n(17)\n\nwhere conv(L) is the convex hull of the set L. We now show that L is integral and therefore\n\nconv(L) = {(x, y, o) \u2208 Rn2+2n|xij \u2264 yj \u2227(cid:88)i\n\noi \u2264 (cid:96) \u2200 i, j}\n\nThis in turn will imply that FLO(LR) = FLO(LP). In order to show that L is integral, we will establish\nthat that the constraint matrix corresponding to the set L is totally unimodular (TU). For complete-\nness, we recall several important de\ufb01nitions and theorems from integer program theory [12]:\nDe\ufb01nition 1. A matrix A is totally unimodular if every square submatrix of A, has determinant in\nthe set {\u22121, 0, 1}.\nProposition 1. Given a linear program: min{cT x : Ax \u2265 b, x \u2208 Rn\n+}, let b be the set of integer\nvectors for which the problem instance has \ufb01nite value. Then the optimal solution has integral\nsolutions if A is totally unimodular.\n\nAn equivalent de\ufb01nition of total unimodularity (TU) and often easier to establish is captured in the\nfollowing theorem.\nTheorem 1. Let A be a matrix. Then A is TU iff for any subset of rows X of A, there exists a\ncoloring of rows of X, with 1 or -1 such that the weighted sum of every column (while restricting\nthe sum to rows in X) is -1, 0 or 1.\n\nWe are now ready to state and prove the main theorem in this section.\n\n4\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fbA=11\u22121\u22121\u22121\u2212100011\fTheorem 2. The matrix corresponding to the constraint set L is totally unimodular.\n\nProof. We need to consider the constraints\n\nxij \u2264 yj \u2200 i, j\noi \u2264 (cid:96)\n\nn(cid:88)i=1\n\nWe can express the above constraints in the form Au = b where u is the vector:\n\nu = [x11, . . . , x1n, . . . , xn1, . . . , xnn, y1, . . . , yn, o1, . . . , on]T\n\nThe block matrix A is of the form:\n\nA =(cid:20)I B 0\n1(cid:21)\n\n0\n\n0\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\nHere I is an n2 \u00d7 n2 identity matrix, B is stack of n matrices of size n \u00d7 n where each element of\nthe stack is a negative identity matrix, and 1 is an 1 \u00d7 n block of 1(cid:48)s. See Figure 1 for a detailed\nvisualization.\nNow to prove that A is TU, we will use Theorem 1. Take any subset X of rows of A. Whether we\ncolor the rows of X by 1 or -1, the column sum (within X) of a column of I will be in {\u22121, 0, 1}.\nA similar argument holds for columns of the block matrix 1. Now consider the submatrix B. We\ncan express X as\n\nX = \u222an\n\ni=1,i\u2208B(X,:)Xi\n\n(22)\nwhere each Xi = {r \u2208 X|X(r, i) = \u22121}. Given that B is a stack of negative diagonal matrices,\nXi \u2229 Xj = \u2205 for i (cid:54)= j. Now consider a column j of B. If Xj has even number of \u22121(cid:48)s, then split\nthe elements of Xj evenly and color one half as 1 and the other as \u22121. Then the sum of column j\n(for rows in X) will be 0. On the other hand, if another set of rows Xk has odd number of \u22121, color\nthe rows of Xk alternatively with 1 and \u22121. Since Xj and Xk are disjoint their colorings can be\ncarried out independently. Then the sum of column j will be 1 or \u22121. Thus we satisfy the condition\nof Theorem 1 and conclude that A is TU.\n\n4.2 Convergence of Subgradient Method\nAs noted above, the langrangian dual is given by max{FLO(\u03bb)|\u03bb \u2265 0}. Furthermore, we use a\ngradient ascent method to update the \u03bb\u2019s as [\u03bbt\ni = 1 \u2212\n(cid:80)j xij \u2212 oi and \u03b8t is the step-size.\nNow, assuming that the norm of the subgradients are bounded, i.e., (cid:107)s(cid:107)2 \u2264 G and the distance\nbetween the initial point and the optimal set, (cid:107)\u03bb1 \u2212 \u03bb\u2217\n\ni=1 = max(\u03bbt\u22121\ni]n\n\ni + \u03b8tsi, 0) where st\n\n|Z(\u03bbt) \u2212 Z(\u03bb\u2217)| \u2264\n\n(cid:107)2 \u2264 R, it is known that [13]:\nR2 + G2(cid:80)t\n2(cid:80)t\n\ni=1 \u03b82\ni\n\ni=1 \u03b8i\n\nThis can be used to show that to obtain \u0001 accuracy (for any step size), the number of iterations is\nlower bounded by O(RG/\u00012), We examine the impact of integrating clustering and outliers on the\nconvergence rate. We make the following observations:\nObservation 1. At a given iteration t and for a given data point i, if ot\ni = 0 and therefore \u03bbt+1\nst\nObservation 2. At a given iteration t and for a given data point i, if ot\n\ni = 1 then(cid:80)j xt\n\ni = 0 and the point i is\n\nij = 0 and\n\ni.\ni = \u03bbt\n\nij = 1 and therefore st\n\ni = 0 and \u03bbt+1\n\ni.\ni = \u03bbt\n\nassigned to exactly one exemplar, then(cid:80)j xt\n\nIn conjunction with the algorithm for solving FLO(\u03bb) and the above observations we can draw\nimportant conclusions regarding the behavior of the algorithm including (i) the \u03bb values associated\nwith outliers will be relatively larger and stabilize earlier and (ii) the \u03bb values of the exemplars will\nbe relatively smaller and will take longer to stabilize.\n\n5\n\n\f5 Experiments\n\nIn this section we evaluate the proposed method on both synthetic and real data and compare it to\nother methods. We \ufb01rst present experiments using synthetic data to show quantitative analysis of\nthe methods in a controlled environment. Then, we present clustering and outlier results obtained\non the MNIST image data set.\nWe compare our Langrangian Relaxation (LR) based method to two other methods, k-means-- and\nan extension of af\ufb01nity propagation [11] to outlier clustering (APOC) whose details can be found in\nthe supplementary material. Both LR and APOC require a cost for creating clusters. We obtain this\nvalue as \u03b1 \u2217 median(dij), i.e. the median of all distances multiplied by a scaling factor \u03b1 which typ-\nically is in the range [1, 30]. The initial centroids required by k-means-- are found using k-means++\n[14] and unless speci\ufb01ed otherwise k-means-- is provided with the correct number of clusters k.\n\n5.1 Synthetic Data\n\nWe use synthetic datasets for controlled performance evaluation and comparison between the dif-\nferent methods. The data is generated by randomly sampling k clusters with m points, each from\nd-dimensional normal distributions N (\u00b5, \u03a3) with randomly selected \u00b5 and \u03a3. To these clusters\nwe add (cid:96) additional outlier points that have a low probability of belonging to any of the selected\nclusters. The distance between points is computed using the Euclidean distance. We focus on 2D\ndistributions as they are more challenging then higher dimensional data due to the separability of\nthe data.\nTo assess the performance of the methods we use the following three metrics:\n\n1. Normalized Jaccard index, measures how accurately a method selects the ground truth out-\nliers. It is a coef\ufb01cient computed between selected outliers O and ground-truth outliers O\u2217.\nThe \ufb01nal coef\ufb01cient is normalized with regards to the best possible coef\ufb01cient obtainable\nin the following way:\n\n.\n\n(23)\n\nJ(O, O\u2217) = |O \u2229 O\u2217\n|\n|O \u222a O\u2217|\n\n/\n\nmin(|O|,|O\u2217\n|)\nmax(|O|,|O\u2217|)\n\n2. Local outlier factor [15] (LOF) measures the outlier quality of a point. We compute the\nratio between the average LOF of O and O\u2217, which indicates the quality of the set of\nselected outliers.\n\n3. V-Measure [16] indicates the quality of the overall clustering solution. The outliers are\n\nconsidered as an additional class for this measure.\n\nFor the Jaccard index and V-Measure a value of 1 is optimal, while for the LOF factor a larger value\nis better.\nSince the number of outliers (cid:96), required by all methods, is typically not known exactly we explore\nhow its misspeci\ufb01cation affects the results. We generate 2D datasets with 2000 inliers and 200\noutliers and vary the number of outliers (cid:96) selected by the methods. The results in Figure 2 show\nthat in general none of the methods fail completely if the value of (cid:96) is misspeci\ufb01ed. Looking at the\nJaccard index, which indicates the percentage of true outliers selected, we see that if (cid:96) is smaller\nthen the true number of outliers all methods pick only outliers. When (cid:96) is greater then the true\nnumber of outliers we can see a that LR and APOC improve with larger (cid:96) while k-means-- does only\nsometimes. This is due to the formulation of LR which selects the largest outliers, which APOC\ndoes to some extent as well. This means that if some outliers are initially missed they are more\nlikely to be selected if (cid:96) is larger then the true number of outliers. Looking at the LOF ratio we\ncan see that selecting more outliers then present in the data set reduces the score somewhat but not\ndramatically, which provides the method with robustness. Finally, V-Measure results show that the\noverall clustering results remain accurate, even if the number of outliers is misspeci\ufb01ed.\nWe experimentally investigate the quality of the solution by comparing with the results obtained by\nsolving the LP relaxation using CPLEX. This comparison indicates what quality can be typically ex-\npected from the different methods. Additionally, we can evaluate the speed of these approximations.\nWe evaluate 100 datasets, consisting of 2D Gaussian clusters and outliers, with varying number of\n\n6\n\n\fFigure 2: The impact of number of outliers speci\ufb01ed ((cid:96)) on the quality of the clustering and out-\nlier detection performance. LR and APOC perform similarly with more stability and better outlier\nchoices compared to k-means--. We can see that overestimating (cid:96) is more detrimental to the overall\nperformance, as indicated by the LOF Ratio and V-Measure, then underestimating it.\n\n(a) Speedup over LP\n\n(b) Total Runtime\n\n(c) Time per Iteration\n\nFigure 3: The graphs shows how the number of points in\ufb02uences different measures. In (a) we\ncompare the speedup of both LR and APOC over LP. (b) compares the total runtime needed to solve\nthe clustering problem for LR and APOC . Finally, (c) plots the time required (on a log scale) for a\nsingle iteration for LR and APOC.\npoints. On average LR obtains 94%\u00b1 5% of the LP objective value, APOC obtains an energy that is\n95%\u00b1 4% of the optimal solution found by LP and k-means--, with correct k, obtains 86%\u00b1 12% of\nthe optimum. These results reinforce the previous analysis; LR and APOC perform similarly while\noutperforming k-means--. Next we look at the speed-up of LR and APOC over LP. Figure 3 a) shows\nboth methods are signi\ufb01cantly faster with the speed-up increasing as the number of points increases.\nOverall for a small price in quality the two methods obtain a signi\ufb01cantly faster solution. k-means--\noutperforms the other two methods easily with regards to speed but has neither the accuracy nor the\nability to infer the number of clusters directly from the data.\nNext we compare the runtime of LR and APOC. Figure 3 b) shows the overall runtime of both\nmethods for varying number of data points. Here we observe that APOC is faster then LR, however,\nby observing the time a single iteration takes, shown in Figure 3 c), we see that LR is much faster\non a per iteration basis compared to APOC. In practice LR requires several times the number of\niterations of APOC, which is affected by the step size function used. Using a more sophisticated\nmethod of computing the step size will provide large gains to LR. Finally, the biggest difference\nbetween LR and APOC is that the latter requires all messages and distances to be held in memory.\nThis obviously scales poorly for large datasets. Conversely, LR computes the distances at runtime\nand only needs to store indicator vectors and a sparse assignment matrix, thus using much less\nmemory. This makes LR amenable to processing large scale datasets. For example, with single\nprecision \ufb02oating point numbers, dense matrices and 10 000 points APOC requires around 2200 MB\nof memory while LR only needs 370 MB. Further gains can be obtained by using sparse matrices\nwhich is straight forward in the case of LR but complicated for APOC.\n\n5.2 MNIST Data\nThe MNIST dataset, introduced by LeCun et al. [17], contains 28 \u00d7 28 pixel images of handwritten\ndigits. We extract features from these images by representing them as 768 dimensional vectors which\nis reduced to 25 dimensions using PCA. The distance between these vectors is computed using the\nL2 norm. In Figure 4 we show exemplary results obtained when processing 10 000 digits with the\n\n7\n\n10020030040000.51SelectedOutliers(\u2018)JaccardIndex10020030040000.51SelectedOutliers(\u2018)LOFRatio10020030040000.51SelectedOutliers(\u2018)V-Measurek-means--APOCLR05001,0001,5002,000102030DataPointsSpeedupAPOCLR05,00010,00001,0002,000DataPointsTime(s)APOCLR05,00010,00010\u2212410\u22122100DataPointsTime(s)APOCLR\f(a) Digit 1\n\n(b) Digit 4\n\n(c) Outliers\n\nFigure 4: Each row in (a) and (b) shows a different appearance of a digit captured by a cluster. The\noutliers shown in (c) tend to have heavier then usual stroke, are incomplete or are not recognizable\nas a digit.\n\nTable 1: Evaluation of clustering results of the MNIST data set with different cost scaling values \u03b1\nfor LR and APOC as well as different settings for k-means--. We can see that increasing the cost\nresults in fewer clusters but as a trade off reduces the homogeneity of the clusters.\n\n\u03b1\nV-Measure\nHomogeneity\nCompleteness\nClusters\n\n5\n\n0.52\n0.78\n0.39\n120\n\nLR\n15\n\n0.67\n0.74\n0.61\n13\n\n25\n\n0.54\n0.65\n0.46\n27\n\nAPOC\n15\n\n0.53\n0.72\n0.42\n51\n\nk-means--\nn.a.\nn.a.\n0.58\n0.51\n0.75\n0.50\n0.47\n0.52\n10\n40\n\nLR method with \u03b1 = 5 and (cid:96) = 500. Each row in Figure 4 a) and b) shows examples of clusters\nrepresenting the digits 1 and 4, respectively. This illustrates how different the same digit can appear\nand the separation induced by the clusters. Figure 4 c) contains a subset of the outliers selected by\nthe method. These outliers have different characteristics that make them sensible outliers, such as:\nthick stroke, incomplete, unrecognizable or ambiguous meaning.\nTo investigate the in\ufb02uence the cluster creation cost has we run the experiment with different values\nof \u03b1. In Table 1 we show results for LR with values of cost scaling factor \u03b1 = {5, 15, 25}, APOC\nwith \u03b1 = 15 and k-means-- with k = {10, 40}. We can see that LR obtains the best V-Measure score\nout of all methods with \u03b1 = 15. The homogeneity and completeness scores re\ufb02ect this as well, while\nhomogeneity is similar to other settings the completeness value is much better. Looking at APOC we\nsee that it struggles to obtain the same quality as LR. In the case of k-means-- we can observed how\nproviding the algorithm with the actual number of clusters results in worse performance compared\nto a larger number of clusters which highlights the advantage of methods capable of automatically\nselecting the number of clusters from the data.\n\n6 Conclusion\n\nIn this paper we presented a novel approach to joint clustering and outlier detection formulated\nas an integer program. The method only requires pairwise distances and the number of outliers\nas input and detects the number of clusters directly from the data. Using a Lagrangian relaxation\nof the problem formulation, which is solved using a subgradient method, we obtain a method that\nis provably equivalent to a linear programming relaxation. Our proposed algorithm is simple to\nimplement, highly scalable, and has a small memory footprint. The clusters and outliers found by\nthe algorithm are meaningful and easily interpretable.\n\n8\n\n\fReferences\n[1] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 2009.\n[2] M. Pelillo. What is a Cluster? Perspectives from Game Theory. In Proc. of Advances in Neural Informa-\n\ntion Processing Systems, 2009.\n\n[3] P. Huber and E. Ronchetti. Robust Statistics. Wiley, 2008.\n[4] C. Croux and A. Ruiz-Gazen. A Fast Algorithm for Robust Principal Components Based on Projection\n\nPursuit. In Proc. in Computational Statistics, 1996.\n\n[5] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma. Robust Principal Component Analysis: Exact Recovery\nof Corrupted Low-Rank Matrices by Convex Optimization. In Proc. of Advances in Neural Information\nProcessing Systems, 2009.\n\n[6] Emmanuel J. Cand`es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? J.\n\nACM, 58(3):11:1\u201311:37, June 2011. ISSN 0004-5411.\n\n[7] P.J. Rousseeuw and K.V. Driessen. A fast algorithm for the minimum covariance determinant estimator.\n\nTechnometrics, 1999.\n\n[8] K. Chen. A constant factor approximation algorithm for k-median clustering with outliers. In Proc. of the\n\nACM-SIAM Symposium on Discrete Algorithms, 2008.\n\n[9] M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan. Algorithms for Facility Location Problems\n\nwith Outliers. In Proc. of the ACM-SIAM Symposium on Discrete Algorithms, 2001.\n\n[10] S. Chawla and A. Gionis. k-means\u2013: A Uni\ufb01ed Approach to Clustering and Outlier Detection. In SIAM\n\nInternational Conference on Data Mining, 2013.\n\n[11] B. Frey and D. Dueck. Clustering by Passing Messages Between Data Points. Science, 2007.\n[12] D. Bertsimas and R. Weismantel. Optimization over Integers. Dynamic Ideas Belmont, 2005.\n[13] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York,\n\nNY, USA, 2004. ISBN 0521833787.\n\n[14] D. Arthur and S. Vassilvitskii. k-means++: The Advantages of Careful Seeding. In ACM-SIAM Sympo-\n\nsium on Discrete Algorithms, 2007.\n\n[15] M. Breunig, H. Kriegel, R. Ng, and J. Sander. LOF: Identifying Density-Based Local Outliers.\n\nInt. Conf. on Management of Data, 2000.\n\nIn\n\n[16] A. Rosenberg and J. Hirschberg. V-Measure: A conditional entropy-based external cluster evaluation\nmeasure. In Proc. of the Joint Conference on Empirical Methods in Natural Language Processing and\nComputational Natural Language Learning, 2007.\n\n[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 1998.\n\n9\n\n\f", "award": [], "sourceid": 753, "authors": [{"given_name": "Lionel", "family_name": "Ott", "institution": "University of Sydney"}, {"given_name": "Linsey", "family_name": "Pang", "institution": "University of Sydney"}, {"given_name": "Fabio", "family_name": "Ramos", "institution": "University of Sydney"}, {"given_name": "Sanjay", "family_name": "Chawla", "institution": "QCRI"}]}