{"title": "PIDForest: Anomaly Detection via Partial Identification", "book": "Advances in Neural Information Processing Systems", "page_first": 15809, "page_last": 15819, "abstract": "We consider the problem of detecting anomalies in a large dataset. We propose a framework called Partial Identification which captures the intuition that anomalies are easy to distinguish from the overwhelming majority of points by relatively few attribute values. Formalizing this intuition, we propose a geometric anomaly measure for a point that we call PIDScore, which measures the minimum density of data points over all subcubes containing the point. We present PIDForest: a random forest based algorithm that finds anomalies based on this definition. We show that it performs favorably in comparison to several popular anomaly detection methods, across a broad range of benchmarks. PIDForest also provides a succinct explanation for why a point is labelled anomalous, by providing a set of features and ranges for them which are relatively uncommon in the dataset.", "full_text": "PIDForest: Anomaly Detection via Partial\n\nIdenti\ufb01cation\n\nParikshit Gopalan\nVMware Research\n\npgopalan@vmware.com\n\nVatsal Sharan\n\nStanford University\n\nvsharan@stanford.edu\n\nUdi Wieder\n\nVMware Research\n\nuwieder@vmware.com\n\nAbstract\n\nWe consider the problem of detecting anomalies in a large dataset. We propose a\nframework called Partial Identi\ufb01cation which captures the intuition that anomalies\nare easy to distinguish from the overwhelming majority of points by relatively\nfew attribute values. Formalizing this intuition, we propose a geometric anomaly\nmeasure for a point that we call PIDScore, which measures the minimum density\nof data points over all subcubes containing the point. We present PIDForest: a\nrandom forest based algorithm that \ufb01nds anomalies based on this de\ufb01nition. We\nshow that it performs favorably in comparison to several popular anomaly detection\nmethods, across a broad range of benchmarks. PIDForest also provides a succinct\nexplanation for why a point is labelled anomalous, by providing a set of features\nand ranges for them which are relatively uncommon in the dataset.1\n\n1\n\nIntroduction\n\nAn anomaly in a dataset is a point that does not conform to what is normal or expected. Anomaly\ndetection is a ubiquitous machine learning task with diverse applications including network mon-\nitoring, medicine and \ufb01nance [1, Section 3]. There is an extensive body of research devoted to it,\nsee [1, 2] and the references therein. Our work is primarily motivated by the emergence of large\ndistributed systems, like the modern data center which produce massive amounts of heterogeneous\ndata. Operators need to constantly monitor this data, and use it to identify and troubleshoot problems.\nThe volumes of data involved are so large that a lot of the analysis has to be automated. Here we\nhighlight some of the challenges that an anomaly detection algorithm must face.\n\n1. High dimensional, heterogeneous data: The data collected could contains measurements\nof metrics like cpu usage, memory, bandwidth, temperature, in addition to categorical data\nsuch as day of the week, geographic location, OS type. This makes \ufb01nding an accurate\ngenerative model for the data challenging. The metrics might be captured in different units,\nhence algorithms that are unit-agnostic are preferable. The algorithm needs to scale to high\ndimensional data.\n\n2. Scarce labels: Most of the data are unlabeled. Generating labels is time and effort intensive\nand requires domain knowledge. Hence supervised methods are a non-starter, and even\ntuning too many hyper-parameters of unsupervised algorithms could be challenging.\n\n3. Irrelevant attributes: Often an anomaly manifests itself in a relatively small number of\nattributes among the large number being monitored. For instance, a single machine in a\nlarge datacenter might be compromised and behave abnormally.\n\n4. Interpretability of results: When we alert a datacenter administrator to a potential\nanomaly, it helps to point to a few metrics that might have triggered it, to help in trou-\nbleshooting.\n\n1The full version of this paper is available at https://arxiv.org/abs/1912.03582.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn the generative model setting, anomalies come with a simple explanation: a model that \ufb01ts the\ndata, under which the anomalous observation is unlikely. Interpretability is more challenging for\nalgorithms that do not assume a generative model. In this work, we are particularly interested in\nrandom forest based methods for anomaly detection, namely the in\ufb02uential work on Isolation Forests\n[3] (we refer to this algorithm as iForest) and subsequent work [4, 5]. iForest is a remarkably simple\nand ef\ufb01cient algorithm, that has been found to outperform other anomaly detection methods in several\ndomains [6]. Yet, there is no crisp de\ufb01nition of ground truth for what constitutes an anomaly: the\nanomaly score is more or less de\ufb01ned as the output of the algorithm. We believe that a necessary\nstep for interpretability is a clear articulation of what is an anomaly, separate from the algorithmic\nquestion of how it is found.\n\nOur contributions. We summarize the main contributions of this work:\n\n1. In Section 2, we motivate and propose a new anomaly measure that we call PIDScore.\nOur de\ufb01nition corresponds to an intuitive notion of what is an anomaly and has a natural\ngeometric interpretation. It is inspired by the notion of Partial Identi\ufb01cation introduced\nby Wigderson and Yehudayoff [7] can be viewed as a natural generalization of teaching\ndimension [8, 9].\n\n2. Our de\ufb01nition sheds light on the types of points likely to be labeled as anomalies by the\niForest algorithm, and also on the types of points it might miss. We build on this intuition\nto design an ef\ufb01cient random forest based algorithm\u2014PIDForest, which \ufb01nds anomalies\naccording to PIDScore, in Section 3.\n\n3. We present extensive experiments on real and synthetic data sets showing that our algorithm\nconsistently outperforms or matches six popular anomaly detection algorithms. PIDForest\nis the top performing algorithm in 6 out of 12 benchmark real-world datasets, while no\nother algorithm is the best in more than 3. PIDForest is also resilient to noise and irrelevant\nattributes. These results are in Section 4 and 5.\n\nWe begin by describing our proposed anomaly measure, PIDScore at a high level. Let the sparsity of\na dataset T in a subcube of the attribute space be the volume of the subcube divided by the number\nof points from T that it contains. For a dataset T and a point x, PIDScore(x,T ) measures the\nmaximum sparsity of T in all subcubes C containing x. A point x is labelled anomalous if it belongs\nto a region of the attribute space where data points are sparse.\nGiven this de\ufb01nition, one could aim for an algorithm that preprocesses T , then takes a point x and\ncomputes PIDScore(x,T ). Such an algorithm is likely to suffer from the curse of dimensionality\nlike in Nearest Neighbor based methods, and not scale to high volumes of data. Instead we adopt the\napproach of iForest [3] which focuses on what is anomalous, rather than the entire dataset. We call\nthe resulting algorithm PIDForest. Like in iForest, PIDForest builds a collection of decision trees that\npartition space into subcubes. In PIDForest, the choice of the splits at each node favors partitions\nof greatly varying sparsity, the variance in the sparsity is explicitly the quantity we optimize when\nchoosing a split. In contrast, previous work either choose splits randomly [3] or based on the range\n[4]. Choosing coordinates that have greater variance in their marginal distribution lets us hone in on\nthe important coordinates, and makes our algorithm robust to irrelevant/noisy attributes, which are\nunlikely to be chosen. Secondly, we label each leaf by its sparsity rather than depth in the tree. The\nscore of a point is the maximum sparsity over all leaves reached in the forest.\nWhile notions of density have been used in previous works on clustering and anomaly detection, our\napproach differs from prior work in important ways.\n\n1. Dealing with heterogeneous attributes: Dealing with subcubes and volumes allows us to\nhandle heterogeneous data where some columns are real, some are categorical and possibly\nunordered. All we need is to specify two things for each coordinate: what constitutes an\ninterval, and how length is measured. Subcubes and volumes are then de\ufb01ned as products\nover coordinates. This is in sharp contrast to methods that assume a metric. Notions like\n(cid:96)1/(cid:96)2 distance add different coordinates and might not be natural in heterogeneous settings.\n2. Scale invariance: For a subcube, we only care about the ratio of its volume to the volume\n\nof the entire attribute space. Hence we are not sensitive to the units of measurement.\n\n3. Considering subcubes at all scales: In previous works, density is computed using balls of\na \ufb01xed radius, this radius is typically a hyperparameter. This makes the algorithm susceptible\n\n2\n\n\fto masking, since there may be a dense cluster of points, all of which are anomalous. We\ntake the minimum over subcubes at all scales.\n\nOrganization. We present the de\ufb01nition of PIDScore in Section 2, and the PIDForest algorithm\n(with a detailed comparison to iForest) in Section 3. We present experiments on real world data\nin Section 4 and synthetic data in Section 5. We further discuss related work in Section A in the\nAppendix (which is included in the supplementary material).\n\n2 Partial Identi\ufb01cation and PIDScore\n\nA motivating example: Anomalous Animals. Imagine a tabular data set that contains a row for\nevery animal on the planet. Each row then contains attribute information about the animal such as the\nspecies, color, weight, age and so forth. The rows are ordered. Say that Alice wishes to identify a\nparticular animal in the table unambiguously to Bob, using the fewest number of bits.\nIf the animal happens to be a white elephant, then Alice is in luck. Just specifying the species and\ncolor narrows the list of candidates down to about \ufb01fty (as per Wikipedia). At this point, specifying\none more attribute like weight or age will probably pin the animal down uniquely. Or she can just\nspecify its order in the list.\nIf the animal in question happens to be a white rabbit, then it might be far harder to uniquely identify,\nsince there are tens of millions of white rabbits, unless that animal happens to have some other\ndistinguishing features. Since weight and age are numeric rather than categorical attributes, if one\ncould measure them to arbitrary precision, one might be able to uniquely identify each specimen.\nHowever, the higher the precision, the more bits Alice needs to communicate to specify the animal.\nWe will postulate a formal de\ufb01nition of anomaly score, drawing on the following intuitions:\n1. Anomalies have short descriptions. The more exotic/anomalous the animal Alice has in mind,\nthe more it stands out from the crowd and the easier it is for her to convey it to Bob. Constraining\njust a few carefully chosen attributes sets anomalies apart from the vast majority of the population.\n2. Precision matters in real values. For real-valued attributes, it makes sense to specify a range in\nwhich the value lies. For anomalous points, this range might not need to be very narrow, but for\nnormal points, we might need more precision.\n\n3. Isolation may be overkill. The selected attributes need not suf\ufb01ce for complete isolation. Partial\nidenti\ufb01cation aka narrowing the space down to a small list can be a good indicator of an anomaly.\nFirst some notation: let T denote a dataset of n points in d dimensions. Given indices S \u2286 [d] and\nx \u2208 Rd, let xS denote the projection of x onto coordinates in S. Logarithms are to base 2.\n\n2.1 The Boolean setting\nWe \ufb01rst consider the Boolean setting where the set of points is T \u2286 {0, 1}d. Assume that T has\nno duplicates. We de\ufb01ne idLength(x,T ) to be the minimum number of co-ordinates that must be\nrevealed to uniquely identify x among all points in T . Since there are no duplicates, revealing all\ncoordinates suf\ufb01ces, so idLength(x,T ) \u2264 d,\nDe\ufb01nition 1. (IDs for a point) We say that S \u2286 [d] is an ID for x \u2208 T if xS (cid:54)= yS for all y \u2208 T \\{x}.\nLet ID(x,T ) be the smallest ID for x breaking ties arbitrarily. Let idLength(x,T ) = |ID(x,T )|.\nWhile on \ufb01rst thought idLength is an appealing measure of anomaly, it does not deal with duplicates,\nand further, the requirement of unique identi\ufb01cation is fairly stringent. Even in simple settings points\nmight not have short IDs. For example, if H is the Hamming ball consisting of 0d and all d unit\nvectors, then idLength(0d,H) = d, since we need to reveal all the coordinates to separate 0d from\nevery unit vector. One can construct examples where even the average value of idLength(x,T ) over\nall points can be surprisingly high [9].\nWe relax the de\ufb01nition to allow for partial identi\ufb01cation. Given x \u2208 T and S \u2286 [d], the set of\nimpostors of x in T are all points that equal x on all coordinates in S. Formally Imp(x,T , S) =\n{y \u2208 T s.t. xS = yS}. We penalize sets that do not identify x uniquely by the logarithm of the\nnumber of impostors. The intuition is that this penalty measures how many bits it costs Alice to\nspecify x from the list of impostors.\n\n3\n\n\fDe\ufb01nition 2. (Partial ID) We de\ufb01ne\n\nPID(x,T ) = arg min\nS\u2286[d]\n(|S| + log2(|Imp(x,T , S)|)).\n\n(|S| + log2(|Imp(x,T , S)|)),\n\npidLength(x,T ) = min\nS\u2286[d]\n\n(1)\n\n(2)\n\nIt follows from the de\ufb01nition that pidLength(x,T ) \u2264 min(log2(n), idLength(x,T )). The \ufb01rst\ninequality follows by taking S to be empty so that every point in T is an impostor, the second by\ntaking S = ID(x,T ) so that the only impostor is x itself. Returning to the Hamming ball example, it\nfollows that pidLength(0d,T ) = log2(d + 1) where we take the empty set as the PID.\nWe present an alternate geometric view of pidLength, which generalizes naturally to other settings.\nA subcube C of {0, 1}d is the set of points obtained by \ufb01xing some subset S \u2286 [d] coordinates\nto values in 0, 1. The sparsity of T in a subcube C is \u03c10,1(T , C) = |C|/|C \u2229 T |. The notation\nC (cid:51) x means that C contains x, hence minC(cid:51)x is the minimum over all C that contain x. One can\nshow that for x \u2208 T , maxC(cid:51)x \u03c10,1(T , C) = 2d\u2212pidLength(x,T ), see appendix D for a proof. This\ncharacterization motivates using 2\u2212pidLength(x,T ) as an anomaly score: anomalies are points that lie\nin relatively sparse subcubes. Low scores come with a natural witness: a sparse subcube PID(x,T )\ncontaining relatively few points from T .\n\n2.2 The continuous setting\n\nNow assume that all the coordinates are real-valued, and bounded. Without loss of generality, we\nmay assume that they lie in the range [0, 1], hence T is a collection of n points from [0, 1]d. An\ninterval I = [a, b], 0 \u2264 a \u2264 b \u2264 1 is of length len(I) = b \u2212 a. A subcube C is speci\ufb01ed by a subset\nof co-ordinates S and intervals Ij for each j \u2208 S. It consists of all points such that xj \u2208 Ij for all\nj \u2208 S. To simplify our notation, we let C be I1 \u00d7 I2 \u00b7\u00b7\u00b7 \u00d7 Id where Ij = [0, 1] for j (cid:54)\u2208 S. Note that\nvol(C) = \u03a0jlen(Ij). De\ufb01ne the sparsity of T in C as \u03c1(T , C) = vol(C)/|C \u2229T |. PIDScore(x, T )\nis the maximum sparsity over all subcubes of [0, 1]d containing x.\nDe\ufb01nition 3. For x \u2208 T , let\n\nPID(x,T ) = arg max\nC(cid:51)x\n\n\u03c1(T , C), PIDScore(x,T ) = max\nC(cid:51)x\n\n\u03c1(T , C).\n\nj\u2208[d] len(Ij), we can write\n\n(cid:88)\n\nC = PID(x,T ). Since vol(C) =(cid:81)\n\nTo see the analogy to the Boolean case, de\ufb01ne pidLength(x,T ) = \u2212 log(PIDScore(x,T )). Fix\n\npidLength(x,T ) = log(|C \u2229 T |/vol(C)) =\n\nlog(1/len(Ij)) + log(|C \u2229 T |).\n\n(3)\n\nj\u2208[d]\n\n(cid:80)\nThis exposes the similarities to Equation (2): C \u2229 T is exactly the set of impostors for x, whereas\nj\u2208[d] log(1/len(Ij)) is the analog of |S|. In the boolean setting, we pay 1 for each coordinate from\nS, here the cost ranges in [0,\u221e) depending on the length of the interval. In the continuous setting,\nthe j (cid:54)\u2208 S iff Ij = [0, 1] hence log(1/len(Ij)) = 0, hence we pay nothing for coordinates outside S.\nRestricting to an interval of length p costs log(1/p). If p = 1/2, we pay 1, which is analogous to the\nBoolean case where we pay 1 to cut the domain in half. This addresses the issue of having to pay\nmore for higher precision. Note also that the de\ufb01nition is scale-invariant as multiplying a coordinate\nby a constant changes the volume of all subcubes by the same factor.\nOther attributes: To handle attributes over a domain D, we need to specify what subsets of\nD are intervals and how we measure their length. For discrete attributes, it is natural to de\ufb01ne\nlen(I) = |I|/|D|. When the domain is ordered intervals are naturally de\ufb01ned, for instance months\nbetween April and September is an interval of length 1/2. We could also allow wraparound in intervals,\nsay months between November and March. For unordered discrete values, the right de\ufb01nition of\ninterval could be singleton sets, like country = Brazil or certain subsets, like continent = the Americas.\nThe right choice will depend on the dataset. Our de\ufb01nition is \ufb02exible enough to handle this: We can\nmake independent choices for each coordinate, subcubes and volumes are then de\ufb01ned as products,\nand PIDScore can be de\ufb01ned using de\ufb01nition 3.\nIDs and PIDs. The notion of IDs for a point is natural and has been studied in the computational\nlearning literature under various names: the teaching dimension of a hypothesis class [8], discriminant\n\n4\n\n\f[10], specifying set [11] and witness set [9]. Our work is inspired by the work of Wigderson and\nYehudayoff [7] on population recovery, which is the task of learning mixtures of certain discrete\ndistributions on the Boolean cube. Their work introduces the notion of partial identi\ufb01cation in the\nBoolean setting. The terminology of IDs and impostors is from their work. They also consider PIDs,\nbut with a different goal in mind (to minimize the depth of a certain graph constructed using the PID\nrelation). Our de\ufb01nition of pidLength, the extension of partial identi\ufb01cation to the continuous setting\nand its application to anomaly detection are new contributions.\n\n3 The PIDForest algorithm\n\nWe do not know how to compute PIDScore exactly, or even a provable approximation of it in a way\nthat scales well with both d and n. The PIDForest algorithm described below is heuristic designed to\napproximate PIDScore. Like with iForest, the PIDForest algorithm builds an ensemble of decision\ntrees, each tree is built using a sample of the data set and partitions the space into subcubes. However,\nthe way the trees are constructed and the criteria by which a point is declared anomalous are very\ndifferent. Each node of a tree corresponds to a subcube C, the children of C represent a disjoint\npartition of C along some axis i \u2208 [d] (iForest always splits C into two , here we allow for a \ufb01ner\npartition). The goal is to have large variance in the sparsity of the subcubes. Recall that the sparsity\nof a subcube C with respect to a data set T is \u03c1(C,T ) = vol(C)/|C \u2229 T |. Ultimately, the leaves\nwith large \u03c1 values will point to regions with anomalies.\nFor each tree, we pick a random sample P \u2286 T of m points, and use that subset to build the tree.\nEach node v in the tree corresponds to subcube C(v), and a set of points P (v) = C(v) \u2229 P. For\nthe root, C(v) = [0, 1]d and P (v) = P. At each internal node, we pick a coordinate j \u2208 [d], and\nbreakpoints t1 \u2264 \u00b7\u00b7\u00b7 \u2264 tk\u22121 which partition Ij into k intervals, and split C into k subcubes. The\nnumber of partitions k is a hyper-parameter (taking k < 5 works well in practice). We then partition\nthe points P (v) into these subcubes. The partitions stop when the tree reached some speci\ufb01ed\nmaximum depth or when |P (v)| \u2264 1. The key algorithmic problem is how to choose the coordinate\nj and the breakpoints by which it should be partitioned. Intuitively we want to partition the cube into\nsome sparse regions and some dense regions. This intuition is formalized next.\nLet Ij \u2286 [0, 1] be the projection of C onto coordinate i. Say the breakpoints are chosen so that we\npartition Ij into I 1\nj . This partitions C into C 1, . . . , C k where the intervals corresponding to\nthe other coordinates stay the same. We \ufb01rst observe that in any partition of C, the average sparsity\nof the subcubes when weighted by the number of points is the same. Let\n|P \u2229 C i|\n\nj , . . . , I k\n\npi :=\n\nlen(I i\nj)\nlen(I)\n\n=\n\nvol(C i)\nvol(C)\n\n, qi :=\n\n=\u21d2 \u03c1(C i) =\n\nvol(C i)\n|P \u2229 C i| =\n\npivol(C)\nqi|P| =\n\n,\n\n|P|\npi\u03c1(C)\n\n.\n\nqi\n\nSince a qi fraction of points in P have sparsity \u03c1(Ci), the expected sparsity for a randomly chosen\npoint from P is\n\nqi\u03c1(C i) =\n\npi\u03c1(C) = \u03c1(C).\n\nIn words, in any partition of C if we pick a point randomly from P (v) and measure the sparsity of\nits subcube, on expectation we get \u03c1(C). Recall that our goal is to split C into sparse and dense\nsubcubes. Hence a natural objective is to maximize the variance in the sparsity:\n\nVar(P,{C i}i\u2208[k]) =\n\nqi(\u03c1(C i) \u2212 \u03c1(C))2 =\n\nqi\u03c1(C i)2 \u2212 \u03c1(C)2.\n\n(4)\n\n(cid:88)\n\ni\u2208[k]\n\nA partition that produces large variance in the sparsity needs to partition space into some sparse\nregions and other dense regions, which will correspond to outliers and normal regions respectively.\nAlternately, one might choose partitions to optimize the maximum sparsity of any interval in the\npartition, or some higher moment of the sparsity. Maximizing the variance has the advantage that\nit turns out to equivalent to a well-studied problem about histograms, and admits a very ef\ufb01cient\nstreaming algorithm. We continue splitting until a certain prede\ufb01ned depth is reached, or points are\nisolated. Each leaf is labeled with the sparsity of its subcube.\n\n5\n\n(cid:88)\n\ni\u2208[k]\n\n(cid:88)\n\ni\u2208[k]\n\n(cid:88)\n\ni\u2208[k]\n\n\fPIDForest Fit\nParams: Num of trees t, Samples m, Max degree k, Max depth h.\nRepeat t times:\n\nCreate root node v.\nLet C(v) = [0, 1]d, P (v) \u2286 T be a random subset of size m .\nSplit(v)\n\nSplit(v):\n\nFor j \u2208 [d], compute the best split into k intervals.\nPick j that maximizes variance, split C along j into {C i}k\nFor i \u2208 [k] create child vi s.t. C(vi) = C i, P (vi) = P (v) \u2229 C i.\nIf depth(vi) \u2264 h and |P (vi)| > 1 then Split(vi).\nElse, set PIDScore(vi) = vol(C(vi))/|P (vi)|.\n\ni=1.\n\nIn Appendix C, we show that the problem of \ufb01nding the partition along a coordinate j that maximizes\nthe variance in the sparsity can be reduced to the problem of \ufb01nding a k-histogram for a discrete\nfunction f : [n] \u2192 R, which minimizes the squared (cid:96)2 error. This is a well-studied problem [12,\n13, 14] and there an ef\ufb01cient one-pass streaming algorithm for computing near-optimal histograms\ndue to Guha et al. [15]. We use this algorithm to compute the best split along each coordinate, and\nthen choose the coordinate that offers the most variance reduction. Using their algorithm, \ufb01nding the\noptimal split for a node takes time O(dm log(m)). This is repeated at most kh times for each tree\n(typically much fewer since the trees we build tend to be unbalanced), and t times to create the forest.\nWe typically choose m \u2264 200, k \u2264 5, h \u2264 10 and t \u2264 50.\nProducing an anomaly score for each point is fairly straightforward. Say we want to compute a score\nfor y \u2208 [0, 1]d. Each tree in the forest maps y to a leaf node v and gives it a score PIDScore(v). We\ntake the 75% percentile score as our \ufb01nal score (any robust analog of the max will do).\n\niForest repeatedly samples a set S of m points from T and builds\nComparison to Isolation Forest.\na random tree with those points as leaves. The tree is built by choosing a random co-ordinate xi, and\na random value in its range about which to split. The intuition is that anomalous points will be easy\nto separate from the rest, and will be isolated at small depth. What kind of points are likely to be\nlabeled anomalous by iForest?\nIn one direction, if a point is isolated at relatively low depth k in a tree, then it probably belongs in\na sparse subcube. Indeed, a node at depth k corresponds to a subcube C of expected volume 2\u2212k,\nwhich is large for small k. The fact that the sample contains no points from C implies that C \u2229 T\nis small, with high probability (this can be made precise using a VC-dimension argument). Hence\n\u03c1(C, T ) = vol(C)/|C \u2229 T | is fairly large.\nBeing in a sparse subcube is necessary but not suf\ufb01cient. This is because iForest chooses which\ncoordinate we split on as well as the breakpoint at random. Thus to be isolated at small depth\nfrequently, a point needs to lie in an abundant number of low-density subspaces: picking splits at\nrandom should have a good chance of de\ufb01ning such a subspace. Requiring such an abundance of\nsparse subcubes can be problematic. Going back to the animals example, isolating white elephants is\nhard unless both Color and Type are used as attributes, as there is no shortage of elephants or white\nanimals. Moreover, which attributes are relevant can depend on the point: weight might be irrelevant\nin isolating a white elephant, but it might be crucial to isolating a particularly large elephant. This\ncauses iForest to perform poorly in the presence of irrelevant attributes, see for instance [5].\nThis is the fundamental difference between PIDForest and iForest and its variants. PIDForest zooms\nin on coordinates with signal\u2014where a split is most bene\ufb01cial. Attributes with little signal are\nunlikely to be chosen for splitting. For concrete examples, see Section 5. The tradeoff is that we incur\na slightly higher cost at training time, the cost of prediction stays pretty much the same.\n\n4 Real-world Datasets\n\nWe show that PIDForest performs favorably in comparison to several popular anomaly detection\nalgorithms on real-world benchmarks. We select datasets from varying domains, and with different\n\n6\n\n\fnumber of datapoints, percentage of anomalies and dimensionality. The code and data for all\nexperiments is available online.2 Detailed parameters of the datasets are in Table 2 in the appendix.\nDataset Descriptions: The \ufb01rst set of datasets are classi\ufb01cation datasets from the UCI [16] and\nopenML repository [17] (they are also available at [18]). Three of the datsets\u2014Thyroid, Mammogra-\nphy and Seismic\u2014are naturally suited to anomaly detection as they are binary classi\ufb01cation tasks\nwhere one of the classes has a much smaller occurrence rate (around 5%) and hence can be treated as\nanomalous. Thyroid and Mammography have real-valued attributes whereas Seismic has categorical\nattributes as well. Three other datasets\u2014Satimage-2, Musk and Vowels\u2014are classi\ufb01cation datasets\nwith multiple classes, and we combine the classes and divide them into inliers and outliers as in\n[19]. Two of the datasets\u2014http and smtp\u2014are derived from the KDD Cup 1999 network intrusion\ndetection task and we preprocess them as in [20]. These two datasets have have signi\ufb01cantly more\ndatapoints (about 500k and 100k respectively) and a smaller percentage of outliers (less than 0.5%).\nThe next set of real-world datasets\u2014NYC taxicab, CPU utilization, Machine temperature (M.T.) and\nAmbient temperature (A.T.)\u2014are time series datasets from the Numenta anomaly detection benchmark\nwhich have been hand-labeled with anomalies rooted in real-world causes [21]. The length of the time\nseries is 10k-20k, with about 10% of the points marked anomalous. We use the standard technique of\nshingling with a sliding window of width 10, hence each data point becomes a 10 dimensional vector\nof 10 consecutive measurements from the time series.\nMethodology: We compare PIDForestwith six popular anomaly detection algorithms: Isolation\nForest (iForest), Robust Random Cut Forest (RRCF), one-class SVM (SVM), Local Outlier Factor\n(LOF), k-Nearest Neighbour (kNN) and Principal Component Analysis (PCA). We implement\nPIDForest in Python, it takes about 500 lines of code. For iForest, SVM and LOF we used the\nscikit-learn implementations, for kNN and PCA we used the implementations on PyOD [22] , and\nfor RRCF we use the implementation from [23]. Except for RRCF, we run each algorithm with\nthe default hyperparameter setting as varying the hyperparameters from their default values did not\nchange the results signi\ufb01cantly. For RRCF, we use 500 trees instead of the default 100 since it yielded\nsigni\ufb01cantly better performance. For PIDForest, we \ufb01x the hyperparameters of depth to 10, number\nof trees to 50, and the number of samples used to build each tree to 100. We use the area under the\nROC curve (AUC) as the performance metric. As iForest, PIDForest and RRCF are randomized, we\nrepeat these algorithms for 5 runs and report the mean and standard deviation. SVM, LOF, kNN and\nPCA are deterministic, hence we report a single AUC number for them.\nResults: We report the results in Table 1. PIDForest is the top performing or jointly top performing\nalgorithm in 6 out of the 12 datasets, and iForest, kNN and PCA are top performing or jointly top\nperforming algorithms in 3 datasets each. Detailed ROC performance curves of the algorithms are\ngiven in Fig. 6 and 7 in the Appendix. While the running time of our \ufb01t procedure is slower than\niForest, it is comparable to RRCF and faster than many other methods. Even our vanilla Python\nimplementation on a laptop computer only takes about 5 minutes to \ufb01t a model to our largest dataset\nwhich has half a million points.\nRecall from Section 3 that PIDForest differs from iForest in two ways, it optimizes for the axis to\nsplit on, and secondly, it uses sparsity instead of depth as the anomaly measure. To further examine\nthe factors which contribute to the favorable performance of PIDForest, we do an ablation study\nthrough two additional experiments.\nChoice of split: Optimizing for the choice of split rather than choosing one at random seems\nvaluable in the presence of irrelevant dimensions. To measure this effect, we added 50 additional\nrandom dimensions sampled uniformly in the range [0, 1] to two low-dimensional datasets from Table\n1\u2014Mammography and Thyroid (both datasets are 6 dimensional). In the Mammography dataset,\nPIDForest (and many other algorithms as well) suffers only a small 2% drop in performance, whereas\nthe performance of iForest drops by 15%. In the Thyroid dataset, the performance of all algorithms\ndrops appreciably. However, PIDForest has a 13% drop in performance, compared to a 20% drop for\niForest. The detailed results are given in Table 3 in the Appendix.\nUsing sparsity instead of depth: In this experiment, we test the hypothesis that the sparsity of the\nleaf is a better anomaly score than depth for the PIDForest algorithm. The performance of PIDForest\ndeteriorates noticeably with depth as the score, the AUC for Thyroid drops to 0.847 from 0.876,\nwhile the AUC for Mammography drops to 0.783 from 0.840.\n\n2https://github.com/vatsalsharan/pidforest\n\n7\n\n\fData set\nThyroid\nMammo.\nSeismic\nSatimage\nVowels\nMusk\nhttp\nsmtp\nNYC\nA.T.\nCPU\nM.T.\n\nPIDForest\n0.876 \u00b1 0.013\n0.840 \u00b1 0.010\n0.733 \u00b1 0.006\n0.987 \u00b1 0.001\n0.741 \u00b1 0.008\n1.000 \u00b1 0.000\n0.986 \u00b1 0.004\n0.923 \u00b1 0.003\n0.564 \u00b1 0.004\n0.810 \u00b1 0.005\n0.935 \u00b1 0.003\n0.813 \u00b1 0.006\n\niForest\n\n0.819 \u00b1 0.013\n0.862 \u00b1 0.008\n0.698 \u00b1 0.004\n0.994 \u00b1 0.001\n0.736 \u00b1 0.026\n0.998 \u00b1 0.003\n1.000 \u00b1 0.000\n0.908 \u00b1 0.003\n0.550 \u00b1 0.005\n0.780 \u00b1 0.006\n0.917 \u00b1 0.002\n0.828 \u00b1 0.002\n\nRRCF\n\n0.739\u00b1 0.004\n0.830 \u00b1 0.002\n0.701 \u00b1 0.004\n0.991 \u00b1 0.002\n0.813\u00b1 0.007\n0.998 \u00b1 0.000\n0.993 \u00b1 0.000\n0.886 \u00b1 0.017\n0.543 \u00b1 0.004\n0.695 \u00b10.004\n0.785 \u00b1 0.002\n0.7524 \u00b1 0.003\n\nLOF\n0.737\n0.720\n0.553\n0.540\n0.943\n0.416\n0.353\n0.905\n0.671\n0.563\n0.560\n0.501\n\nSVM kNN\n0.751\n0.547\n0.839\n0.872\n0.740\n0.601\n0.936\n0.421\n0.975\n0.778\n0.573\n0.373\n0.231\n0.994\n0.895\n0.841\n0.697\n0.500\n0.634\n0.670\n0.724\n0.794\n0.796\n0.759\n\nPCA\n0.673\n0.886\n0.682\n0.977\n0.606\n1.000\n0.996\n0.823\n0.511\n0.792\n0.858\n0.834\n\nTable 1: Results on real-world datasets. We bold the algorithm(s) which get the best AUC.\n\n(a) Masking. Accuracy is the fraction of true out-\nliers in the top 5% of reported anomalies.\n\n(b) y\u2212axis measures how many of the 100 true\nanomalies were reported by the algorithm in the\ntop 100 anomalies.\n\nFigure 1: Synthetic experiments on masked anomalies and Gaussian data.\n\n5 Synthetic Data\n\nWe compare PIDForest with popular anomaly detection algorithms on synthetic benchmarks. The\n\ufb01rst experiment checks how the algorithms handle duplicates in the data. The second experiment\nuses data from a mixture of Gaussians, and highlights the importance of the choice of coordinates\nto split in PIDForest. The third experiment tests the ability of the algorithm to detect anomalies in\ntime-series (see Appendix B). In all these experiments, PIDForest outperforms prior art.\n\nMasking and sample size.\nIt is often the case that anomalies repeat multiple times in the data. This\nphenomena is called masking and is a challenge for many algorithms. iForest counts on sampling to\ncounter masking: not too many repetitions occur in the sample. But the performance is sensitive to\nthe sampling rate, see [4, 5]. To demonstrate it, we create a data set of 1000 points in 10 dimensions.\n970 of these points are sampled randomly in {\u22121, 1}10 (hence most of these points are unique). The\nremaining 30 are the all-zeros vector, these constitute a masked anomaly. We test if the zero points are\ndeclared as anomalies by PIDForest and iForest under varying sample sizes. The results are reported\nin Fig. 1a. Whereas PIDForest consistently reported these points as anomalies, the performance\nof iForest heavily depends on the sample size. When it is small, then masking is negated and the\nanomalies are caught, however the points become hard to isolate when the sample size increases.\n\n8\n\n100300500700900num of samples0.00.20.40.60.81.0accuracyEffects of maskingIsolation ForestPIDForest024681012141618noise dimension020406080100accuracyMixture of Gaussians with added NoisePIDForestIsolation ForestEMLOF\f(a) Varying number of buckets\n\n(b) Varying number of samples\n\nFigure 2: Robustness to choice of hyperparameters\n\nMixtures of Gaussians and random noise. We use a generative model where the ground truth\nanomalies are the points of least likelihood. The \ufb01rst two coordinates of the 1000 data points are\nsampled by taking a mixture of two 2\u2212dimensional Gaussians with different means and covariances\n(the eigenvalues are {1, 2}, the eigenvectors are chosen at random). The remaining d dimensions are\nsampled uniformly in the range [\u22122, 2]. We run experiments varying d from 0 to 19. In each case we\ntake the 100 points smallest likelihood points as the ground truth anomalies. For each algorithm we\nexamine the 100 most anomalous points and calculate how many of these belong to the ground truth.\nWe compare PIDForest with Isolation Forest, Local Outlier Factor (LOF), an algorithm that uses\nExpectation Maximization (EM) to \ufb01t the data to a mixture of Gaussians, RRCF, SVM, kNN and\nPCA. For clarity, the results for the \ufb01rst four algorithms are reported in Fig. 1b and the rest appear in\nFig. 4b in the Appendix. Note that even without noise (d = 0) PIDForest is among the best generic\nalgorithms. As the number of noisy dimensions increase PIDForest focuses on the dimensions with\nsignal, so it performs better. Some observations:\n1. The performance of iForest degrades rapidly with d, once d \u2265 6 it effectively outputs a random\nset. Noticeably, PIDForest performs better than iForest even when d = 0. This is mainly due to\nthe points between the two centers being classi\ufb01ed as normal by iForest. PIDForest classi\ufb01es\nthem correctly as anomalous even though they are assigned to leaves that are deep in the tree.\n\n2. The EM algorithm is speci\ufb01cally designed to \ufb01t a mixture of two Gaussians, so it does best for\n\nsmall or zero noise, i.e. when d is small. PIDForest beats it once d > 2.\n\n3. Apart from PIDForest, a few other algorithms such as RRCF, SVM and kNN also do well (see\nFig. 4b in the Appendix)\u2014but their performance is crucially dependent on the fact that the noise\nis of the same scale as the Gaussians. If we change the scale of the noise (which could happen\nif the measuring unit changes), then the performance of all algorithms except PIDForest drops\nsigni\ufb01cantly. In Figs. 5a and 5b in the Appendix, we repeat the same experiment as above but\nwith the noisy dimensions being uniform in [\u221210, 10] (instead of [\u22122, 2]). The performance of\nPIDForest is essentially unchanged, and it does markedly better than all other algorithms.\n\nRobustness to choice of hyperparameters. One of the appealing properties of PIDForest is that\nthe quality of the output is relatively insensitive to the exact value of the hyper-parameters. We tested\nmultiple settings and found that each hyper-parameter has a moderate value above which the quality\nof the output remains stable. Figure 2a shows precision-recall in the synthetic time-series experiment,\nwhere we vary the parameter k (number of buckets). We see that k = 2 is too little, but there is\nrelatively small variation between k = 3, 4, 5, 6. Similar behavior was observed for the number of\nsamples m (see Figure 2b), number of trees t and depth of each tree h, and also with the mixture of\nGaussians experiment. Since these parameters do affect the running time directly, we set them to the\nsmallest values for which we got good results.\n6 Conclusion\n\nWe believe that PIDForest is arguably one of the best off-the-shelf algorithms for anomaly detection\non a large, heterogenous dataset. It inherits many of the desirable features of Isolation Forests,\nwhile also improving on it in important ways. Developing provable and scalable approximations to\nPIDScore is an interesting algorithmic challenge.\n\n9\n\n0.50.60.70.80.9recall0.30.40.50.60.70.80.91.0precisionTime Seriesn_buckets 2n_buckets 3n_buckets 4n_buckets 5n_buckets 60.50.60.70.80.9recall0.50.60.70.80.91.0precisionTime Seriesn_samples 50n_samples 100n_samples 150n_samples 200n_samples 250\fAcknowledgment\n\nVS\u2019s contribution were supported by NSF award 1813049.\n\nReferences\n[1] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM\n\nComput. Surv., 41(3):15:1\u201315:58, 2009.\n\n[2] Charu C. Aggarwal. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition,\n\n2013. ISBN 9783319475783.\n\n[3] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In Proceedings of the 8th\nIEEE International Conference on Data Mining (ICDM 2008), December 15-19, 2008, Pisa,\nItaly, pages 413\u2013422, 2008.\n\n[4] Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers. Robust random cut forest based\nanomaly detection on streams. In Proceedings of the 33nd International Conference on Machine\nLearning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2712\u20132721, 2016.\n\n[5] Tharindu R. Bandaragoda, Kai Ming Ting, David W. Albrecht, Fei Tony Liu, Ye Zhu, and\nIsolation-based anomaly detection using nearest-neighbor ensembles.\n\nJonathan R. Wells.\nComputational Intelligence, 34(4):968\u2013998, 2018.\n\n[6] Andrew F Emmott, Shubhomoy Das, Thomas Dietterich, Alan Fern, and Weng-Keen Wong.\nSystematic construction of anomaly detection benchmarks from real data. In Proceedings of the\nACM SIGKDD workshop on outlier detection and description, pages 16\u201321. ACM, 2013.\n\n[7] Avi Wigderson and Amir Yehudayoff. Population recovery and partial identi\ufb01cation. Mach.\nLearn., 102(1):29\u201356, January 2016. ISSN 0885-6125. doi: 10.1007/s10994-015-5489-9. URL\nhttp://dx.doi.org/10.1007/s10994-015-5489-9.\n\n[8] S.A. Goldman and M.J. Kearns. On the complexity of teaching. J. Comput. Syst. Sci., 50\nISSN 0022-0000. doi: 10.1006/jcss.1995.1003. URL http:\n\n(1):20\u201331, February 1995.\n//dx.doi.org/10.1006/jcss.1995.1003.\n\n[9] Eyal Kushilevitz, Nathan Linial, Yuri Rabinovich, and Michael E. Saks. Witness sets for families\nof binary vectors. J. Comb. Theory, Ser. A, 73(2):376\u2013380, 1996. doi: 10.1006/jcta.1996.0031.\nURL https://doi.org/10.1006/jcta.1996.0031.\n\n[10] Balas K. Natarajan. Machine learning: A theoretical approach. Morgan Kaufmann Publishers,\n\nInc., 1991.\n\n[11] Martin Anthony, Graham Brightwell, Dave Cohen, and John Shawe-Taylor. On exact speci\ufb01ca-\ntion by examples. In Proceedings of the Fifth Annual Workshop on Computational Learning The-\nory, COLT \u201992, pages 311\u2013318, New York, NY, USA, 1992. ACM. ISBN 0-89791-497-X. doi:\n10.1145/130385.130420. URL http://doi.acm.org/10.1145/130385.130420.\n\n[12] Nick Koudas, S Muthukrishnan, and Divesh Srivastava. Optimal histograms for hierarchical\n\nrange queries. In PODS, pages 196\u2013204, 2000.\n\n[13] Anna C Gilbert, Sudipto Guha, Piotr Indyk, Yannis Kotidis, S. Muthukrishnan, and Martin J\nStrauss. Fast, small-space algorithms for approximate histogram maintenance. In Proceedings\nof the thiry-fourth annual ACM symposium on Theory of computing, pages 389\u2013398. ACM,\n2002.\n\n[14] Sudipto Guha, Piotr Indyk, S Muthukrishnan, and Martin J Strauss. Histogramming data\nstreams with fast per-item processing. In International Colloquium on Automata, Languages,\nand Programming, pages 681\u2013692. Springer, 2002.\n\n[15] Sudipto Guha, Nick Koudas, and Kyuseok Shim. Approximation and streaming algorithms\nfor histogram construction problems. ACM Trans. Database Syst., 31(1):396\u2013438, March\n2006. ISSN 0362-5915. doi: 10.1145/1132863.1132873. URL http://doi.acm.org/\n10.1145/1132863.1132873.\n\n10\n\n\f[16] Arthur Asuncion and David Newman. Uci machine learning repository, 2007.\n\n[17] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked\nscience in machine learning. SIGKDD Explorations, 15(2):49\u201360, 2013. doi: 10.1145/2641190.\n2641198. URL http://doi.acm.org/10.1145/2641190.2641198.\n\n[18] Shebuti Rayana. ODDS library, 2016. URL http://odds.cs.stonybrook.edu.\n\n[19] Charu C Aggarwal and Saket Sathe. Theoretical foundations and algorithms for outlier ensem-\n\nbles. ACM Sigkdd Explorations Newsletter, 17(1):24\u201347, 2015.\n\n[20] Kenji Yamanishi, Jun-Ichi Takeuchi, Graham Williams, and Peter Milne. On-line unsupervised\noutlier detection using \ufb01nite mixtures with discounting learning algorithms. Data Mining and\nKnowledge Discovery, 8(3):275\u2013300, 2004.\n\n[21] Subutai Ahmad, Alexander Lavin, Scott Purdy, and Zuha Agha. Unsupervised real-time anomaly\n\ndetection for streaming data. Neurocomputing, 262:134\u2013147, 2017.\n\n[22] Yue Zhao, Zain Nasrullah, and Zheng Li. Pyod: A python toolbox for scalable outlier detection.\n\narXiv preprint arXiv:1901.01588, 2019.\n\n[23] Tharindu R. Bandaragoda, Kai Ming Ting, David W. Albrecht, Fei Tony Liu, Ye Zhu, and\nJonathan R. Wells. rrcf: Implementation of the robust random cut forest algorithm for anomaly\ndetection on streams. Journal of Open Source Software, 4(35):1336, 2019.\n\n[24] Victoria J. Hodge and Jim Austin. A survey of outlier detection methodologies. Artif. Intell.\n\nRev., 22(2):85\u2013126, 2004.\n\n[25] Animesh Patcha and Jung-Min Jerry Park. An overview of anomaly detection techniques:\nExisting solutions and latest technological trends. Computer Networks, 51:3448\u20133470, 2007.\n\n[26] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J\u00f6rg Sander. Lof: Iden-\nIn Proceedings of the 2000 ACM SIGMOD Interna-\ntifying density-based local outliers.\ntional Conference on Management of Data, SIGMOD \u201900, pages 93\u2013104, New York, NY,\nUSA, 2000. ACM.\nISBN 1-58113-217-4. doi: 10.1145/342009.335388. URL http:\n//doi.acm.org/10.1145/342009.335388.\n\n[27] Fabrizio Angiulli and Clara Pizzuti. Fast outlier detection in high dimensional spaces. In\nProceedings of the 6th European Conference on Principles of Data Mining and Knowledge\nDiscovery, PKDD \u201902, pages 15\u201326, London, UK, UK, 2002. Springer-Verlag. ISBN 3-540-\n44037-2. URL http://dl.acm.org/citation.cfm?id=645806.670167.\n\n[28] Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Ef\ufb01cient algorithms for mining\noutliers from large data sets. SIGMOD Rec., 29(2):427\u2013438, May 2000. ISSN 0163-5808. doi:\n10.1145/335191.335437. URL http://doi.acm.org/10.1145/335191.335437.\n\n[29] Tao Shi and Steve Horvath. Unsupervised learning with random forest predictors. Journal of\n\nComputational and Graphical Statistics, 15(1):118\u2013138, 2006.\n\n[30] Martin Ester, Hans-Peter Kriegel, J\u00f6rg Sander, and Xiaowei Xu. A density-based algorithm\nfor discovering clusters a density-based algorithm for discovering clusters in large spatial\ndatabases with noise. In Proceedings of the Second International Conference on Knowledge\nDiscovery and Data Mining, KDD\u201996, pages 226\u2013231. AAAI Press, 1996. URL http:\n//dl.acm.org/citation.cfm?id=3001460.3001507.\n\n[31] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. Automatic\nsubspace clustering of high dimensional data for data mining applications. SIGMOD Rec.,\n27(2):94\u2013105, June 1998.\nISSN 0163-5808. doi: 10.1145/276305.276314. URL http:\n//doi.acm.org/10.1145/276305.276314.\n\n11\n\n\f", "award": [], "sourceid": 9259, "authors": [{"given_name": "Parikshit", "family_name": "Gopalan", "institution": "VMware Research"}, {"given_name": "Vatsal", "family_name": "Sharan", "institution": "Stanford University"}, {"given_name": "Udi", "family_name": "Wieder", "institution": "VMware Research"}]}