{"title": "Multi-criteria Anomaly Detection using Pareto Depth Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 845, "page_last": 853, "abstract": "We consider the problem of identifying patterns in a data set that exhibit anomalous behavior, often referred to as anomaly detection. In most anomaly detection algorithms, the dissimilarity between data samples is calculated by a single criterion, such as Euclidean distance. However, in many cases there may not exist a single dissimilarity measure that captures all possible anomalous patterns. In such a case, multiple criteria can be defined, and one can test for anomalies by scalarizing the multiple criteria by taking some linear combination of them. If the importance of the different criteria are not known in advance, the algorithm may need to be executed multiple times with different choices of weights in the linear combination. In this paper, we introduce a novel non-parametric multi-criteria anomaly detection method using Pareto depth analysis (PDA). PDA uses the concept of Pareto optimality to detect anomalies under multiple criteria without having to run an algorithm multiple times with different choices of weights. The proposed PDA approach scales linearly in the number of criteria and is provably better than linear combinations of the criteria.", "full_text": "Multi-criteria Anomaly Detection using\n\nPareto Depth Analysis\n\nKo-Jen Hsiao, Kevin S. Xu, Jeff Calder, and Alfred O. Hero III\n\nUniversity of Michigan, Ann Arbor, MI, USA 48109\n\n{coolmark,xukevin,jcalder,hero}@umich.edu\n\nAbstract\n\nWe consider the problem of identifying patterns in a data set that exhibit anoma-\nlous behavior, often referred to as anomaly detection. In most anomaly detection\nalgorithms, the dissimilarity between data samples is calculated by a single crite-\nrion, such as Euclidean distance. However, in many cases there may not exist a\nsingle dissimilarity measure that captures all possible anomalous patterns. In such\na case, multiple criteria can be de\ufb01ned, and one can test for anomalies by scalar-\nizing the multiple criteria using a linear combination of them. If the importance\nof the different criteria are not known in advance, the algorithm may need to be\nexecuted multiple times with different choices of weights in the linear combina-\ntion. In this paper, we introduce a novel non-parametric multi-criteria anomaly\ndetection method using Pareto depth analysis (PDA). PDA uses the concept of\nPareto optimality to detect anomalies under multiple criteria without having to\nrun an algorithm multiple times with different choices of weights. The proposed\nPDA approach scales linearly in the number of criteria and is provably better than\nlinear combinations of the criteria.\n\n1\n\nIntroduction\n\nAnomaly detection is an important problem that has been studied in a variety of areas and used in di-\nverse applications including intrusion detection, fraud detection, and image processing [1, 2]. Many\nmethods for anomaly detection have been developed using both parametric and non-parametric ap-\nproaches. Non-parametric approaches typically involve the calculation of dissimilarities between\ndata samples. For complex high-dimensional data, multiple dissimilarity measures corresponding\nto different criteria may be required to detect certain types of anomalies. For example, consider the\nproblem of detecting anomalous object trajectories in video sequences. Multiple criteria, such as\ndissimilarity in object speeds or trajectory shapes, can be used to detect a greater range of anomalies\nthan any single criterion. In order to perform anomaly detection using these multiple criteria, one\ncould \ufb01rst combine the dissimilarities using a linear combination. However, in many applications,\nthe importance of the criteria are not known in advance. It is dif\ufb01cult to determine how much weight\nto assign to each dissimilarity measure, so one may have to choose multiple weights using, for ex-\nample, a grid search. Furthermore, when the weights are changed, the anomaly detection algorithm\nneeds to be re-executed using the new weights.\nIn this paper we propose a novel non-parametric multi-criteria anomaly detection approach using\nPareto depth analysis (PDA). PDA uses the concept of Pareto optimality to detect anomalies without\nhaving to choose weights for different criteria. Pareto optimality is the typical method for de\ufb01ning\noptimality when there may be multiple con\ufb02icting criteria for comparing items. An item is said to\nbe Pareto-optimal if there does not exist another item that is better or equal in all of the criteria. An\nitem that is Pareto-optimal is optimal in the usual sense under some combination, not necessarily\nlinear, of the criteria. Hence, PDA is able to detect anomalies under multiple combinations of the\ncriteria without explicitly forming these combinations.\n\n1\n\n\fFigure 1: Left: Illustrative example with 40 training samples (blue x\u2019s) and 2 test samples (red circle\nand triangle) in R2. Center: Dyads for the training samples (black dots) along with \ufb01rst 20 Pareto\nfronts (green lines) under two criteria: |\u2206x| and |\u2206y|. The Pareto fronts induce a partial ordering on\nthe set of dyads. Dyads associated with the test sample marked by the red circle concentrate around\nshallow fronts (near the lower left of the \ufb01gure). Right: Dyads associated with the test sample\nmarked by the red triangle concentrate around deep fronts.\n\nThe PDA approach involves creating dyads corresponding to dissimilarities between pairs of data\nsamples under all of the dissimilarity measures. Sets of Pareto-optimal dyads, called Pareto fronts,\nare then computed. The \ufb01rst Pareto front (depth one) is the set of non-dominated dyads. The second\nPareto front (depth two) is obtained by removing these non-dominated dyads, i.e. peeling off the\n\ufb01rst front, and recomputing the \ufb01rst Pareto front of those remaining. This process continues until\nno dyads remain. In this way, each dyad is assigned to a Pareto front at some depth (see Fig. 1 for\nillustration). Nominal and anomalous samples are located near different Pareto front depths; thus\ncomputing the front depths of the dyads corresponding to a test sample can discriminate between\nnominal and anomalous samples. The proposed PDA approach scales linearly in the number of cri-\nteria, which is a signi\ufb01cant improvement compared to selecting multiple weights via a grid search,\nwhich scales exponentially in the number of criteria. Under assumptions that the multi-criteria dyads\ncan be modeled as a realizations from a smooth K-dimensional density we provide a mathematical\nanalysis of the behavior of the \ufb01rst Pareto front. This analysis shows in a precise sense that PDA\ncan outperform a test that uses a linear combination of the criteria. Furthermore, this theoretical pre-\ndiction is experimentally validated by comparing PDA to several state-of-the-art anomaly detection\nalgorithms in two experiments involving both synthetic and real data sets.\nThe rest of this paper is organized as follows. We discuss related work in Section 2. In Section 3 we\nprovide an introduction to Pareto fronts and present a theoretical analysis of the properties of the \ufb01rst\nPareto front. Section 4 relates Pareto fronts to the multi-criteria anomaly detection problem, which\nleads to the PDA anomaly detection algorithm. Finally we present two experiments in Section 5 to\nevaluate the performance of PDA.\n\n2 Related work\n\nSeveral machine learning methods utilizing Pareto optimality have previously been proposed; an\noverview can be found in [3]. These methods typically formulate machine learning problems as\nmulti-objective optimization problems where \ufb01nding even the \ufb01rst Pareto front is quite dif\ufb01cult.\nThese methods differ from our use of Pareto optimality because we consider multiple Pareto fronts\ncreated from a \ufb01nite set of items, so we do not need to employ sophisticated methods in order to \ufb01nd\nthese fronts.\nHero and Fleury [4] introduced a method for gene ranking using Pareto fronts that is related to our\napproach. The method ranks genes, in order of interest to a biologist, by creating Pareto fronts of\nthe data samples, i.e. the genes. In this paper, we consider Pareto fronts of dyads, which correspond\nto dissimilarities between pairs of data samples rather than the samples themselves, and use the\ndistribution of dyads in Pareto fronts to perform multi-criteria anomaly detection rather than ranking.\nAnother related area is multi-view learning [5, 6], which involves learning from data represented by\nmultiple sets of features, commonly referred to as \u201cviews\u201d. In such case, training in one view helps to\n\n2\n\n01234560123456xy01230123|\u2206x||\u2206y|01230123|\u2206x||\u2206y|\fimprove learning in another view. The problem of view disagreement, where samples take different\nclasses in different views, has recently been investigated [7]. The views are similar to criteria in\nour problem setting. However, in our setting, different criteria may be orthogonal and could even\ngive contradictory information; hence there may be severe view disagreement. Thus training in one\nview could actually worsen performance in another view, so the problem we consider differs from\nmulti-view learning. A similar area is that of multiple kernel learning [8], which is typically applied\nto supervised learning problems, unlike the unsupervised anomaly detection setting we consider.\nFinally, many other anomaly detection methods have previously been proposed. Hodge and Austin\n[1] and Chandola et al. [2] both provide extensive surveys of different anomaly detection methods\nand applications. Nearest neighbor-based methods are closely related to the proposed PDA ap-\nproach. Byers and Raftery [9] proposed to use the distance between a sample and its kth-nearest\nneighbor as the anomaly score for the sample; similarly, Angiulli and Pizzuti [10] and Eskin et al.\n[11] proposed to the use the sum of the distances between a sample and its k nearest neighbors.\nBreunig et al. [12] used an anomaly score based on the local density of the k nearest neighbors\nof a sample. Hero [13] and Sricharan and Hero [14] introduced non-parametric adaptive anomaly\ndetection methods using geometric entropy minimization, based on random k-point minimal span-\nning trees and bipartite k-nearest neighbor (k-NN) graphs, respectively. Zhao and Saligrama [15]\nproposed an anomaly detection algorithm k-LPE using local p-value estimation (LPE) based on a\nk-NN graph. These k-NN anomaly detection schemes only depend on the data through the pairs of\ndata points (dyads) that de\ufb01ne the edges in the k-NN graphs.\nAll of the aforementioned methods are designed for single-criteria anomaly detection. In the multi-\ncriteria setting, the single-criteria algorithms must be executed multiple times with different weights,\nunlike the PDA anomaly detection algorithm that we propose in Section 4.\n\n3 Pareto depth analysis\n\nThe PDA method proposed in this paper utilizes the notion of Pareto optimality, which has been\nstudied in many application areas in economics, computer science, and the social sciences among\nothers [16]. We introduce Pareto optimality and de\ufb01ne the notion of a Pareto front.\nConsider the following problem: given n items, denoted by the set S, and K criteria for evaluating\neach item, denoted by functions f1, . . . , fK, select x \u2208 S that minimizes [f1(x), . . . , fK(x)]. In\nmost settings, it is not possible to identify a single item x that simultaneously minimizes fi(x)\nfor all i \u2208 {1, . . . , K}. A minimizer can be found by combining the K criteria using a linear\ncombination of the fi\u2019s and \ufb01nding the minimum of the combination. Different choices of (non-\nnegative) weights in the linear combination could result in different minimizers; a set of items that\nare minimizers under some linear combination can then be created by using a grid search over the\nweights, for example.\nA more powerful approach involves \ufb01nding the set of Pareto-optimal items. An item x is said to\nstrictly dominate another item x\u2217 if x is no greater than x\u2217 in each criterion and x is less than\nx\u2217 in at least one criterion. This relation can be written as x (cid:31) x\u2217 if fi(x) \u2264 fi(x\u2217) for each i\nand fi(x) < fi(x\u2217) for some i. The set of Pareto-optimal items, called the Pareto front, is the set\nof items in S that are not strictly dominated by another item in S. It contains all of the minimizers\nthat are found using linear combinations, but also includes other items that cannot be found by linear\ncombinations. Denote the Pareto front by F1, which we call the \ufb01rst Pareto front. The second Pareto\nfront can be constructed by \ufb01nding items that are not strictly dominated by any of the remaining\nitems, which are members of the set S \\ F1. More generally, de\ufb01ne the ith Pareto front by\n\n\uf8eb\uf8edi\u22121(cid:91)\n\n\uf8f6\uf8f8 .\n\nFj\n\nFi = Pareto front of the set S \\\n\nFor convenience, we say that a Pareto front Fi is deeper than Fj if i > j.\n\nj=1\n\n3.1 Mathematical properties of Pareto fronts\n\nThe distribution of the number of points on the \ufb01rst Pareto front was \ufb01rst studied by Barndorff-\nNielsen and Sobel in their seminal work [17]. The problem has garnered much attention since; for a\n\n3\n\n\fsurvey of recent results see [18]. We will be concerned here with properties of the \ufb01rst Pareto front\nthat are relevant to the PDA anomaly detection algorithm and thus have not yet been considered in\nthe literature. Let Y1, . . . , Yn be independent and identically distributed (i.i.d.) on Rd with density\nfunction f : Rd \u2192 R. For a measurable set A \u2282 Rd, we denote by FA the points on the \ufb01rst Pareto\nfront of Y1, . . . , Yn that belong to A. For simplicity, we will denote F1 by F and use |F| for the\ncardinality of F. In the general Pareto framework, the points Y1, . . . , Yn are the images in Rd of n\nfeasible solutions to some optimization problem under a vector of objective functions of length d.\nIn the context of this paper, each point Yl corresponds to a dyad Dij, which we de\ufb01ne in Section 4,\nand d = K is the number of criteria. A common approach in multi-objective optimization is linear\nscalarization [16], which constructs a new single criterion as a convex combination of the d criteria.\nIt is well-known, and easy to see, that linear scalarization will only identify Pareto points on the\n+ = {x \u2208 Rd | xi \u2265 0, i = 1 . . . , d}.\nAlthough this is a common motivation for Pareto methods, there are, to the best of our knowledge,\nno results in the literature regarding how many points on the Pareto front are missed by scalarization.\nWe present such a result here. We de\ufb01ne\n\nboundary of the convex hull of(cid:83)\n(cid:91)\n\n(cid:40) d(cid:88)\n\n+), where Rd\n\nx\u2208F (x + Rd\n\n(cid:41)\n\n\u03b1ixi\n\n, Sn = {Y1, . . . , Yn}.\n\nL =\n\nargmin\nx\u2208Sn\n\n\u03b1\u2208Rd\n\n+\n\ni=1\n\nThe subset L \u2282 F contains all Pareto-optimal points that can be obtained by some selection of\nweights for linear scalarization. We aim to study how large L can get, compared to F, in expectation.\nIn the context of this paper, if some Pareto-optimal points are not identi\ufb01ed, then the anomaly\nscore (de\ufb01ned in section 4.2) will be arti\ufb01cially in\ufb02ated, making it more likely that a non-anomalous\nsample will be rejected. Hence the size of F \\ L is a measure of how much the anomaly score is\nin\ufb02ated and the degree to which Pareto methods will outperform linear scalarization.\nPareto points in F \\ L are a result of non-convexities in the Pareto front. We study two kinds of\nnon-convexities: those induced by the geometry of the domain of Y1, . . . , Yn, and those induced by\nrandomness. We \ufb01rst consider the geometry of the domain. Let \u2126 \u2282 Rd be bounded and open with\na smooth boundary \u2202\u2126 and suppose the density f vanishes outside of \u2126. For a point z \u2208 \u2202\u2126 we\ndenote by \u03bd(z) = (\u03bd1(z), . . . , \u03bdd(z)) the unit inward normal to \u2202\u2126. For T \u2282 \u2202\u2126, de\ufb01ne Th \u2282 \u2126 by\nTh = {z + t\u03bd | z \u2208 T, 0 < t \u2264 h}. Given h > 0 it is not hard to see that all Pareto-optimal points\nwill almost surely lie in \u2202\u2126h for large enough n, provided the density f is strictly positive on \u2202\u2126h.\nHence it is enough to study the asymptotics for E|FTh| for T \u2282 \u2202\u2126 and h > 0.\nTheorem 1. Let f \u2208 C 1(\u2126) with inf \u2126 f > 0. Let T \u2282 \u2202\u2126 be open and connected such that\nfor x \u2208 T.\n\nmin(\u03bd1(z), . . . , \u03bdd(z)) \u2265 \u03b4 > 0,\n\n{y \u2208 \u2126 : y (cid:22) x} = {x},\n\nand\n\ninf\nz\u2208T\n\nThen for h > 0 suf\ufb01ciently small, we have\nd\u22121\nd + \u03b4\u2212d\u22121O\nn\nd (\u03bd1(z)\u00b7\u00b7\u00b7 \u03bdd(z))\nd\u22121\n\nwhere \u03b3 = d\u22121(d!)\n\nE|FTh| = \u03b3n\n\nd \u0393(d\u22121)\n\n(cid:90)\n\nf (z)\n\n1\n\n1\n\nd dz.\n\n(cid:16)\n\n(cid:17)\n\nd\u22122\n\nd\n\nas n \u2192 \u221e,\n\nT\n\nThe proof of Theorem 1 is postponed to Section 1 of the supplementary material. Theorem 1 shows\nasymptotically how many Pareto points are contributed on average by the segment T \u2282 \u2202\u2126. The\nnumber of points contributed depends only on the geometry of \u2202\u2126 through the direction of its normal\nvector \u03bd and is otherwise independent of the convexity of \u2202\u2126. Hence, by using Pareto methods, we\nwill identify signi\ufb01cantly more Pareto-optimal points than linear scalarization when the geometry\nof \u2202\u2126 includes non-convex regions. For example, if T \u2282 \u2202\u2126 is non-convex (see left panel of\nFigure 2) and satis\ufb01es the hypotheses of Theorem 1, then for large enough n, all Pareto points in\na neighborhood of T will be unattainable by scalarization. Quantitatively, if f \u2265 C on T , then\nE|F \\ L| \u2265 \u03b3n\nd and |T|\nis the d\u2212 1 dimensional Hausdorff measure of T . It has recently come to our attention that Theorem\n1 appears in a more general form in an unpublished manuscript of Baryshnikov and Yukich [19].\nWe now study non-convexities in the Pareto front which occur due to inherent randomness in the\nsamples. We show that, even in the case where \u2126 is convex, there are still numerous small-scale\nnon-convexities in the Pareto front that can only be detected by Pareto methods. We illustrate this in\nthe case of the Pareto box problem for d = 2.\n\nd ), as n \u2192 \u221e, where \u03b3 \u2265 d\u22121(d!) 1\n\nd\u22121\nd + \u03b4\u2212d\u22121O(n\n\nd \u0393(d\u22121)|T|\u03b4C\n\nd\u22121\n\nd\u22122\n\n4\n\n\fFigure 2: Left: Non-convexities in the Pareto front induced by the geometry of the domain \u2126 (The-\norem 1). Right: Non-convexities due to randomness in the samples (Theorem 2). In each case, the\nlarger points are Pareto-optimal, and the large black points cannot be obtained by scalarization.\n\nTheorem 2. Let Y1, . . . , Yn be independent and uniformly distributed on [0, 1]2. Then\n\n1\n2\n\nln n + O(1) \u2264 E|L| \u2264 5\n6\n\nln n + O(1), as n \u2192 \u221e.\n\nThe proof of Theorem 2 is also postponed to Section 1 of the supplementary material. A proof that\nE|F| = ln n + O(1) as n \u2192 \u221e can be found in [17]. Hence Theorem 2 shows that, asymptotically\nand in expectation, only between 1\n6 of the Pareto-optimal points can be obtained by linear\nscalarization in the Pareto box problem. Experimentally, we have observed that the true fraction of\n6 (and likely more) of the Pareto points can only be\npoints is close to 0.7. This means that at least 1\nobtained via Pareto methods even when \u2126 is convex. Figure 2 gives an example of the sets F and L\nfrom the two theorems.\n\n2 and 5\n\n4 Multi-criteria anomaly detection\nAssume that a training set XN = {X1, . . . , XN} of nominal data samples is available. Given a test\nsample X, the objective of anomaly detection is to declare X to be an anomaly if X is signi\ufb01cantly\ndifferent from samples in XN . Suppose that K > 1 different evaluation criteria are given. Each cri-\nterion is associated with a measure for computing dissimilarities. Denote the dissimilarity between\nXi and Xj computed using the measure corresponding to the lth criterion by dl(i, j).\n+ , i \u2208 {1, . . . , N}, j \u2208 {1, . . . , N} \\ i.\nWe de\ufb01ne a dyad by Dij = [d1(i, j), . . . , dK(i, j)]T \u2208 RK\nEach dyad Dij corresponds to a connection between samples Xi and Xj. Therefore, there are in\n\n(cid:1) different dyads. For convenience, denote the set of all dyads by D and the space of all\n\ndyads RK\n+ by D. By the de\ufb01nition of strict dominance in Section 3, a dyad Dij strictly dominates\nanother dyad Di\u2217j\u2217 if dl(i, j) \u2264 dl(i\u2217, j\u2217) for all l \u2208 {1, . . . , K} and dl(i, j) < dl(i\u2217, j\u2217) for some\nl. The \ufb01rst Pareto front F1 corresponds to the set of dyads from D that are not strictly dominated by\nany other dyads from D. The second Pareto front F2 corresponds to the set of dyads from D \\ F1\nthat are not strictly dominated by any other dyads from D \\ F1, and so on, as de\ufb01ned in Section 3.\nRecall that we refer to Fi as a deeper front than Fj if i > j.\n\ntotal(cid:0)N\n\n2\n\n4.1 Pareto fronts of dyads\nFor each sample Xn, there are N \u2212 1 dyads corresponding to its connections with the other N \u2212 1\nsamples. De\ufb01ne the set of N \u2212 1 dyads associated with Xn by Dn. If most dyads in Dn are located\nat shallow Pareto fronts, then the dissimilarities between Xn and the other N \u2212 1 samples are small\nunder some combination of the criteria. Thus, Xn is likely to be a nominal sample. This is the basic\nidea of the proposed multi-criteria anomaly detection method using PDA.\nWe construct Pareto fronts F1, . . . ,FM of the dyads from the training set, where the total number\nof fronts M is the required number of fronts such that each dyad is a member of a front. When a test\nsample X is obtained, we create new dyads corresponding to connections between X and training\nsamples, as illustrated in Figure 1. Similar to many other anomaly detection methods, we connect\neach test sample to its k nearest neighbors. k could be different for each criterion, so we denote ki\ni=1 ki new dyads, which we denote by the set\n\nas the choice of k for criterion i. We create s = (cid:80)K\n\n5\n\n\u22120.0500.050.10.150.20.250.30.350.4\u22120.0500.050.10.150.20.25\fCalculate pairwise dissimilarities dl(i, j) between all training samples Xi and Xj\n\nAlgorithm 1 PDA anomaly detection algorithm.\nTraining phase:\n1: for l = 1 \u2192 K do\n2:\n3: Create dyads Dij = [d1(i, j), . . . , dK(i, j)] for all training samples\n4: Construct Pareto fronts on set of all dyads until each dyad is in a front\nTesting phase:\n1: nb \u2190 [ ] {empty list}\n2: for l = 1 \u2192 K do\n3:\n4:\n5:\n6: Create s new dyads Dnew\n7: for i = 1 \u2192 s do\n8:\n\n9: Declare X an anomaly if v(X) = (1/s)(cid:80)s\n\nbetween X and training samples in nb\n\nCalculate depth ei of Dnew\n\ni\n\ni\n\ni=1 ei > \u03c3\n\nCalculate dissimilarities between test sample X and all training samples in criterion l\nnbl \u2190 kl nearest neighbors of X\nnb \u2190 [nb, nbl] {append neighbors to list}\n\n2\n\n1\n\n, Dnew\n\n, . . . , Dnew\n\ns }, corresponding to the connections between X and the union of the\nDnew = {Dnew\nki nearest neighbors in each criterion i. In other words, we create a dyad between X and Xj if Xj\nis below a front Fl if\nis among the ki nearest neighbors1 of X in any criterion i. We say that Dnew\nstrictly dominates at least a single dyad in Fl. De\ufb01ne the\ni (cid:31) Dl for some Dl \u2208 Fl, i.e. Dnew\nDnew\ndepth of Dnew\n\nei = min{l | Dnew\nTherefore if ei is large, then Dnew\ni will be near deep fronts, and the distance between X and the\ncorresponding training sample is large under all combinations of the K criteria. If ei is small, then\nDnew\ni will be near shallow fronts, so the distance between X and the corresponding training sample\nis small under some combination of the K criteria.\n\nis below Fl}.\n\nby\n\ni\n\ni\n\ni\n\ni\n\n4.2 Anomaly detection using depths of dyads\n\nIn k-NN based anomaly detection algorithms such as those mentioned in Section 2, the anomaly\nscore is a function of the k nearest neighbors to a test sample. With multiple criteria, one could de-\n\ufb01ne an anomaly score by scalarization. From the probabilistic properties of Pareto fronts discussed\nin Section 3.1, we know that Pareto methods identify more Pareto-optimal points than linear scalar-\nization methods and signi\ufb01cantly more Pareto-optimal points than a single weight for scalarization2.\nThis motivates us to develop a multi-criteria anomaly score using Pareto fronts. We start with the\nobservation from Figure 1 that dyads corresponding to a nominal test sample are typically located\nnear shallower fronts than dyads corresponding to an anomalous test sample. Each test sample is\nassociated with s new dyads, where the ith dyad Dnew\nhas depth ei. For each test sample X, we\nde\ufb01ne the anomaly score v(X) to be the mean of the ei\u2019s, which corresponds to the average depth\nof the s dyads associated with X. Thus the anomaly score can be easily computed and compared to\nthe decision threshold \u03c3 using the test\n\ni\n\ns(cid:88)\n\ni=1\n\nv(X) =\n\n1\ns\n\nei\n\nH1\u2277\nH0\n\n\u03c3.\n\nPseudocode for the PDA anomaly detector is shown in Algorithm 1. In Section 3 of the supplemen-\ntary material we provide details of the implementation as well as an analysis of the time complexity\nand a heuristic for choosing the ki\u2019s that performs well in practice. Both the training time and the\n\n1If a training sample is one of the ki nearest neighbors in multiple criteria, then multiple copies of the dyad\n\ncorresponding to the connection between the test sample and the training sample are created.\n\n2Theorems 1 and 2 require i.i.d. samples, but dyads are not independent. However, there are O(N 2) dyads,\nand each dyad is only dependent on O(N ) other dyads. This suggests that the theorems should also hold for the\nnon-i.i.d. dyads as well, and it is supported by experimental results presented in Section 2 of the supplementary\nmaterial.\n\n6\n\n\fTable 1: AUC comparison of different methods for both experiments. Best AUC is shown in bold.\nPDA does not require selecting weights so it has a single AUC. The median and best AUCs (over all\nchoices of weights selected by grid search) are shown for the other four methods. PDA outperforms\nall of the other methods, even for the best weights, which are not known in advance.\n\n(a) Four-criteria simulation (\u00b1 standard error)\n\nMethod\n\nMedian\n\nAUC by weight\n0.948 \u00b1 0.002\n\nBest\n\nPDA\nk-NN\n\n0.848 \u00b1 0.004\nk-NN sum 0.854 \u00b1 0.003\n0.847 \u00b1 0.004\nk-LPE\n0.845 \u00b1 0.003\nLOF\n\n0.919 \u00b1 0.003\n0.916 \u00b1 0.003\n0.919 \u00b1 0.003\n0.932 \u00b1 0.003\n\n(b) Pedestrian trajectories\n\nMethod\n\nPDA\nk-NN\n\nk-NN sum\n\nk-LPE\nLOF\n\nAUC by weight\nMedian\nBest\n\n0.915\n\n0.883\n0.894\n0.893\n0.839\n\n0.906\n0.911\n0.908\n0.863\n\ntime required to test a new sample using PDA are linear in the number of criteria K. To handle\nmultiple criteria, other anomaly detection methods, such as the ones mentioned in Section 2, need\nto be re-executed multiple times using different (non-negative) linear combinations of the K crite-\nria. If a grid search is used for selection of the weights in the linear combination, then the required\ncomputation time would be exponential in K. Such an approach presents a computational problem\nunless K is very small. Since PDA scales linearly with K, it does not encounter this problem.\n\n5 Experiments\n\nWe compare the PDA method with four other nearest neighbor-based single-criterion anomaly de-\ntection algorithms mentioned in Section 2. For these methods, we use linear combinations of the\ncriteria with different weights selected by grid search to compare performance with PDA.\n\n5.1 Simulated data with four criteria\n\nFirst we present an experiment on a simulated data set. The nominal distribution is given by the\nuniform distribution on the hypercube [0, 1]4. The anomalous samples are located just outside of\nthis hypercube. There are four classes of anomalous distributions. Each class differs from the\nnominal distribution in one of the four dimensions; the distribution in the anomalous dimension is\nuniform on [1, 1.1]. We draw 300 training samples from the nominal distribution followed by 100\ntest samples from a mixture of the nominal and anomalous distributions with a 0.05 probability of\nselecting any particular anomalous distribution. The four criteria for this experiment correspond to\nthe squared differences in each dimension. If the criteria are combined using linear combinations,\nthe combined dissimilarity measure reduces to weighted squared Euclidean distance.\nThe different methods are evaluated using the receiver operating characteristic (ROC) curve and\nthe area under the curve (AUC). The mean AUCs (with standard errors) over 100 simulation runs\nare shown in Table 1(a). A grid of six points between 0 and 1 in each criterion, corresponding to\n64 = 1296 different sets of weights, is used to select linear combinations for the single-criterion\nmethods. Note that PDA is the best performer, outperforming even the best linear combination.\n\n5.2 Pedestrian trajectories\n\nWe now present an experiment on a real data set that contains thousands of pedestrians\u2019 trajectories\nin an open area monitored by a video camera [20]. Each trajectory is approximated by a cubic spline\ncurve with seven control points [21]. We represent a trajectory with l time samples by\n\n(cid:20)x1 x2\n\ny1\n\ny2\n\nT =\n\n(cid:21)\n\n. . . xl\n. . .\nyl\n\n,\n\nwhere [xt, yt] denote a pedestrian\u2019s position at time step t.\n\n7\n\n\fFigure 3: Left: ROC curves for PDA and attainable region for k-LPE over 100 choices of weights.\nPDA outperforms k-LPE even under the best choice of weights. Right: A subset of the dyads for the\ntraining samples along with the \ufb01rst 100 Pareto fronts. The fronts are highly non-convex, partially\nexplaining the superior performance of PDA.\n\nWe use two criteria for computing the dissimilarity between trajectories. The \ufb01rst criterion is to\ncompute the dissimilarity in walking speed. We compute the instantaneous speed at all time steps\nalong each trajectory by \ufb01nite differencing, i.e. the speed of trajectory T at time step t is given\n\nby(cid:112)(xt \u2212 xt\u22121)2 + (yt \u2212 yt\u22121)2. A histogram of speeds for each trajectory is obtained in this\n\nmanner. We take the dissimilarity between two trajectories to be the squared Euclidean distance\nbetween their speed histograms. The second criterion is to compute the dissimilarity in shape. For\neach trajectory, we select 100 points, uniformly positioned along the trajectory. The dissimilarity\nbetween two trajectories T and T (cid:48) is then given by the sum of squared Euclidean distances between\nthe positions of T and T (cid:48) over all 100 points.\nThe training sample for this experiment consists of 500 trajectories, and the test sample consists of\n200 trajectories. Table 1(b) shows the performance of PDA as compared to the other algorithms\nusing 100 uniformly spaced weights for linear combinations. Notice that PDA has higher AUC than\nthe other methods under all choices of weights for the two criteria. For a more detailed comparison,\nthe ROC curve for PDA and the attainable region for k-LPE (the region between the ROC curves\ncorresponding to weights resulting in the best and worst AUCs) is shown in Figure 3 along with\nthe \ufb01rst 100 Pareto fronts for PDA. k-LPE performs slightly better at low false positive rate when\nthe best weights are used, but PDA performs better in all other situations, resulting in higher AUC.\nAdditional discussion on this experiment can be found in Section 4 of the supplementary material.\n\n6 Conclusion\n\nIn this paper we proposed a new multi-criteria anomaly detection method. The proposed method\nuses Pareto depth analysis to compute the anomaly score of a test sample by examining the Pareto\nfront depths of dyads corresponding to the test sample. Dyads corresponding to an anomalous\nsample tended to be located at deeper fronts compared to dyads corresponding to a nominal sample.\nInstead of choosing a speci\ufb01c weighting or performing a grid search on the weights for different\ndissimilarity measures, the proposed method can ef\ufb01ciently detect anomalies in a manner that scales\nlinearly in the number of criteria. We also provided a theorem establishing that the Pareto approach\nis asymptotically better than using linear combinations of criteria. Numerical studies validated our\ntheoretical predictions of PDA\u2019s performance advantages on simulated and real data.\n\nAcknowledgments\n\nWe thank Zhaoshi Meng for his assistance in labeling the pedestrian trajectories. We also thank\nDaniel DeWoskin for suggesting a fast algorithm for computing Pareto fronts in two criteria. This\nwork was supported in part by ARO grant W911NF-09-1-0310.\n\n8\n\n00.20.40.60.8100.20.40.60.81False positive rateTrue positive rate  PDA methodk\u2212LPE with best AUC weightk\u2212LPE with worst AUC weightAttainable region of k\u2212LPE00.010.020.030.040.0500.010.020.030.040.050.06Walking speed dissimilarityShape dissimilarity\fReferences\n[1] V. J. Hodge and J. Austin (2004). A survey of outlier detection methodologies. Arti\ufb01cial Intel-\n\nligence Review 22(2):85\u2013126.\n\n[2] V. Chandola, A. Banerjee, and V. Kumar (2009). Anomaly detection: A survey. ACM Comput-\n\ning Surveys 41(3):1\u201358.\n\n[3] Y. Jin and B. Sendhoff (2008). Pareto-based multiobjective machine learning: An overview\nand case studies. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications\nand Reviews 38(3):397\u2013415.\n\n[4] A. O. Hero III and G. Fleury (2004). Pareto-optimal methods for gene ranking. The Journal of\n\nVLSI Signal Processing 38(3):259\u2013275.\n\n[5] A. Blum and T. Mitchell (1998). Combining labeled and unlabeled data with co-training. In\n\nProceedings of the 11th Annual Conference on Computational Learning Theory.\n\n[6] V. Sindhwani, P. Niyogi, and M. Belkin (2005). A co-regularization approach to semi-\nsupervised learning with multiple views. In Proceedings of the Workshop on Learning with\nMultiple Views, 22nd International Conference on Machine Learning.\n\n[7] C. M. Christoudias, R. Urtasun, and T. Darrell (2008). Multi-view learning in the presence of\nview disagreement. In Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence.\n[8] M. G\u00a8onen and E. Alpayd\u0131n (2011). Multiple kernel learning algorithms. Journal of Machine\n\nLearning Research 12(Jul):2211\u20132268.\n\n[9] S. Byers and A. E. Raftery (1998). Nearest-neighbor clutter removal for estimating features in\n\nspatial point processes. Journal of the American Statistical Association 93(442):577\u2013584.\n\n[10] F. Angiulli and C. Pizzuti (2002). Fast outlier detection in high dimensional spaces. In Proceed-\nings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery.\n[11] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo (2002). A geometric framework for\nunsupervised anomaly detection: Detecting intrusions in unlabeled data. In Applications of\nData Mining in Computer Security. Kluwer: Norwell, MA.\n\n[12] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander (2000). LOF: Identifying density-based\nlocal outliers. In Proceedings of the ACM SIGMOD International Conference on Management\nof Data.\n\n[13] A. O. Hero III (2006). Geometric entropy minimization (GEM) for anomaly detection and\n\nlocalization. In Advances in Neural Information Processing Systems 19.\n\n[14] K. Sricharan and A. O. Hero III (2011). Ef\ufb01cient anomaly detection using bipartite k-NN\n\ngraphs. In Advances in Neural Information Processing Systems 24.\n\n[15] M. Zhao and V. Saligrama (2009). Anomaly detection with score functions based on nearest\n\nneighbor graphs. In Advances in Neural Information Processing Systems 22.\n\n[16] M. Ehrgott (2000). Multicriteria optimization. Lecture Notes in Economics and Mathematical\n\nSystems 491. Springer-Verlag.\n\n[17] O. Barndorff-Nielsen and M. Sobel (1966). On the distribution of the number of admissible\npoints in a vector random sample. Theory of Probability and its Applications, 11(2):249\u2013269.\n[18] Z.-D. Bai, L. Devroye, H.-K. Hwang, and T.-H. Tsai (2005). Maxima in hypercubes. Random\n\nStructures Algorithms, 27(3):290\u2013309.\n\n[19] Y. Baryshnikov and J. E. Yukich (2005). Maximal points and Gaussian \ufb01elds. Unpublished.\n\nURL http://www.math.illinois.edu/\u02dcymb/ps/by4.pdf.\n\n[20] B. Majecka (2009). Statistical models of pedestrian behaviour in the Forum. Master\u2019s thesis,\n\nUniversity of Edinburgh.\n\n[21] R. R. Sillito and R. B. Fisher (2008). Semi-supervised learning for anomalous trajectory de-\n\ntection. In Proceedings of the 19th British Machine Vision Conference.\n\n9\n\n\f", "award": [], "sourceid": 395, "authors": [{"given_name": "Ko-jen", "family_name": "Hsiao", "institution": null}, {"given_name": "Kevin", "family_name": "Xu", "institution": null}, {"given_name": "Jeff", "family_name": "Calder", "institution": null}, {"given_name": "Alfred", "family_name": "Hero", "institution": null}]}