{"title": "Efficient Anomaly Detection via Matrix Sketching", "book": "Advances in Neural Information Processing Systems", "page_first": 8069, "page_last": 8080, "abstract": "We consider the problem of finding anomalies in high-dimensional data using popular PCA based anomaly scores.  The naive algorithms for computing these scores explicitly compute the PCA of the covariance matrix which uses space quadratic in the dimensionality of the data. We give the first streaming algorithms that use space that is linear or sublinear in the dimension. We prove general results showing that \\emph{any} sketch of a matrix that satisfies a certain operator norm guarantee can be used to approximate these scores. We instantiate these results with powerful matrix sketching techniques such as Frequent Directions and random projections to derive efficient and practical algorithms for these problems, which we validate over real-world data sets. Our main technical contribution is to prove matrix perturbation inequalities for operators arising in the computation of these measures.", "full_text": "Ef\ufb01cient Anomaly Detection via Matrix Sketching\n\nVatsal Sharan\n\nStanford University\u2217\n\nvsharan@stanford.edu\n\nParikshit Gopalan\nVMware Research\n\npgopalan@vmware.com\n\nAbstract\n\nUdi Wieder\n\nVMware Research\n\nuwieder@vmware.com\n\nWe consider the problem of \ufb01nding anomalies in high-dimensional data using\npopular PCA based anomaly scores. The naive algorithms for computing these\nscores explicitly compute the PCA of the covariance matrix which uses space\nquadratic in the dimensionality of the data. We give the \ufb01rst streaming algorithms\nthat use space that is linear or sublinear in the dimension. We prove general results\nshowing that any sketch of a matrix that satis\ufb01es a certain operator norm guarantee\ncan be used to approximate these scores. We instantiate these results with powerful\nmatrix sketching techniques such as Frequent Directions and random projections to\nderive ef\ufb01cient and practical algorithms for these problems, which we validate over\nreal-world data sets. Our main technical contribution is to prove matrix perturbation\ninequalities for operators arising in the computation of these measures.\n\n1\n\nIntroduction\n\nAnomaly detection in high-dimensional numeric data is a ubiquitous problem in machine learning\n[1, 2]. A typical scenario is where we have a constant stream of measurements (say parameters\nregarding the health of machines in a data-center), and our goal is to detect any unusual behavior. An\nalgorithm to detect anomalies in such high dimensional settings faces computational challenges: the\ndimension of the data matrix A \u2208 Rn\u00d7d may be very large both in terms of the number of data points\nn and their dimensionality d (in the datacenter example, d could be 106 and n (cid:29) d). The desiderata\nfor an algorithm to be ef\ufb01cient in such settings are\u2014\n1. As n is too large for the data to be stored in memory, the algorithm must work in a streaming\nfashion where it only gets a constant number of passes over the dataset.\n2. As d is also very large, the algorithm should ideally use memory linear or even sublinear in d.\nIn this work we focus on two popular subspace based anomaly scores: rank-k leverage scores and\nrank-k projection distance. The key idea behind subspace based anomaly scores is that real-world\ndata often has most of its variance in a low-dimensional rank k subspace, where k is usually much\nsmaller than d. In this section, we assume k = O(1) for simplicity. These scores are based on\nidentifying this principal k subspace using Principal Component Analyis (PCA) and then computing\nhow \u201cnormal\u201d the projection of a point on the principal k subspace looks. Rank-k leverage scores\ncompute the normality of the projection of the point onto the principal k subspace using Mahalanobis\ndistance, and rank-k projection distance compute the (cid:96)2 distance of the point from the principal k\nsubspace (see Fig. 1 for an illustration). These scores have found widespread use for detection of\nanomalies in many applications such as \ufb01nding outliers in network traf\ufb01c data [3, 4, 5, 6], detecting\nanomalous behavior in social networks [7, 8], intrusion detection in computer security [9, 10, 11], in\nindustrial systems for fault detection [12, 13, 14] and for monitoring data-centers [15, 16].\nThe standard approach to compute principal k subspace based anomaly scores in a streaming setting\nis by computing AT A, the (d \u00d7 d) covariance matrix of the data, and then computing the top k\n\n\u2217Part of the work was done while the author was an intern at VMware Research.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Illustration of subspace based anomaly scores. Here, the data lies mostly in the k = 2\ndimensional principal subspace shaded in red. For a point a(i) the rank-k projection distance equals\n(cid:107)x(cid:107)2, where x is the component of a(i) orthogonal to the principal subspace. The rank-k leverage\nscore measures the normality of the projection y onto the principal subspace.\n\nprincipal components. This takes space O(d2) and time O(nd2). The quadratic dependence on d\nrenders this approach inef\ufb01cient in high dimensions. It raises the natural question of whether better\nalgorithms exist.\n\n1.1 Our Results\n\nIn this work, we answer the above question af\ufb01rmatively, by giving algorithms for computing these\nanomaly scores that require space linear and even sublinear in d. Our algorithms use popular matrix\nsketching techniques while their analysis uses new matrix perturbation inequalities that we prove.\nBrie\ufb02y, a sketch of a matrix produces a much smaller matrix that preserves some desirable properties\nof the large matrix (formally, it is close in some suitable norm). Sketching techniques have found\nnumerous applications to numerical linear algebra. Several ef\ufb01cient sketching algorithms are known\nin the streaming setting [17].\n\nPointwise guarantees with linear space: We show that any sketch \u02dcA of A with the property that\n(cid:107)AT A \u2212 \u02dcAT \u02dcA(cid:107) is small, can be used to additively approximate the rank-k leverage scores and\nrank-k projection distances for each row. By instantiating this with suitable sketches such as the\nFrequent Directions sketch [18], row-sampling [19] or a random projection of the columns of the\ninput, we get a streaming algorithm that uses O(d) memory and O(nd) time.\n\nA matching lower bound:\nCan we get such an additive approximation using memory only\no(d)?2 The answer is no, we show a lower bound saying that any algorithm that computes such an\napproximation to the rank-k leverage scores or the rank-k projection distances for all the rows of a\nmatrix must use \u2126(d) working space, using techniques from communication complexity. Hence our\nalgorithm has near-optimal dependence on d for the task of approximating the outlier scores for every\ndata point.\n\nAverage-case guarantees with logarithmic space:\nPerhaps surprisingly, we show that it is\nactually possible to circumvent the lower bound by relaxing the requirement that the outlier scores\nbe preserved for each and every point to only preserving the outlier scores on average. For this we\nrequire sketches where (cid:107)AAT \u2212 \u02dcA \u02dcAT(cid:107) is small: this can be achieved via random projection of the\nrows of the input matrix or column subsampling [19]. Using any such sketch, we give a streaming\nalgorithm that can preserve the outlier scores for the rows up to small additive error on average, and\nhence preserve most outliers. The space required by this algorithm is only poly(k) log(d), and hence\nwe get signi\ufb01cant space savings in this setting (recall that we assume k = O(1)).\n\nTechnical contributions. A sketch of a matrix A is a signi\ufb01cantly smaller matrix \u02dcA which ap-\nproximates it well in some norm, say for instance (cid:107)AT A \u2212 \u02dcAT \u02dcA(cid:107) is small. We can think of such\na sketch as a noisy approximation of the true matrix. In order to use such sketches for anomaly\n\n2Note that even though each row is d dimensional an algorithm need not store the entire row in memory, and\n\ncould instead perform computations as each coordinate of the row streams in.\n\n2\n\n\fdetection, we need to understand how the noise affects the anomaly scores of the rows of the matrix.\nMatrix perturbation theory studies the effect of adding noise to the spectral properties of a matrix,\nwhich makes it the natural tool for us. The basic results here include Weyl\u2019s inequality [20] and\nWedin\u2019s theorem [21], which respectively give such bounds for eigenvalues and eigenvectors. We use\nthese results to derive perturbation bounds on more complex projection operators that arise while\ncomputing outlier scores, these operators involve projecting onto the top-k principal subspace, and\nrescaling each co-ordinate by some function of the corresponding singular values. We believe these\nresults could be of independent interest.\n\nExperimental results. Our results have a parameter (cid:96) that controls the size and the accuracy of the\nsketch. While our theorems imply that (cid:96) can be chosen independent of d, they depend polynomially\non k, the desired accuracy and other parameters, and are probably pessimistic. We validate both our\nalgorithms on real world data. In our experiments, we found that choosing (cid:96) to be a small multiple\nof k was suf\ufb01cient to get good results. Our results show that one can get outcomes comparable to\nrunning full-blown SVD using sketches which are signi\ufb01cantly smaller in memory footprint, faster to\ncompute and easy to implement (literally a few lines of Python code).\nThis contributes to a line of work that aims to make SVD/PCA scale to massive datasets [22]. We\ngive simple and practical algorithms for anomaly score computation, that give SVD-like guarantees\nat a signi\ufb01cantly lower cost in terms of memory, computation and communication.\n\n2 Notation and Setup\nGiven a matrix A \u2208 Rn\u00d7d, we let a(i) \u2208 Rd denote its ith row and a(i) \u2208 Rn denote its ith column.\nLet U\u03a3VT be the SVD of A where \u03a3 = diag(\u03c31, . . . , \u03c3d), for \u03c31 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03c3d > 0. Let \u03bak be the\ncondition number of the top k subspace of A, de\ufb01ned as \u03bak = \u03c32\nk. We consider all vectors as\ncolumn vectors (that includes a(i)). We denote by (cid:107)A(cid:107)F the Frobenius norm of the matrix, and by\n(cid:107)A(cid:107) the operator norm (which is equal to the largest singular value). Subspace based measures of\nanomalies have their origins in a classical metric in statistics known as Mahalanobis distance, denoted\nby L(i) and de\ufb01ned as,\n\n1/\u03c32\n\nL(i) =\n\n(aT\n\n(i)v(j))2/\u03c32\nj ,\n\n(1)\n\nd(cid:88)\n\nj=1\n\nk(cid:88)\n\nj=1\n\nd(cid:88)\n\nwhere a(i) and v(i) are the ith row of A and ith column of V respectively. L(i) is also known as the\nleverage score [23, 24]. If the data is drawn from a multivariate Gaussian distribution, then L(i) is\nproportional to the negative log likelihood of the data point, and hence is the right anomaly metric in\nthis case. Note that the higher leverage scores correspond to outliers in the data.\nHowever, L(i) depends on the entire spectrum of singular values and is highly sensitive to smaller\nsingular values, whereas real world data sets often have most of their signal in the top singular values.\nTherefore the above sum is often limited to only the k largest singular values (for some appropriately\nchosen k (cid:28) d) [1, 25]. This measure is called the rank k leverage score Lk(i), where\n\nLk(i) =\n\n(aT\n\n(i)v(j))2/\u03c32\nj .\n\nThe rank k leverage score is concerned with the mass which lies within the principal space, but to\ncatch anomalies that are far from the principal subspace a second measure of anomaly is the rank k\nprojection distance T k(i), which is simply the distance of the data point a(i) to the rank k principal\nsubspace\u2014\n\nT k(i) =\n\n(aT\n\n(i)v(j))2.\n\nj=k+1\n\nAssumptions. We now discuss assumptions needed for our anomaly scores to be meaningful.\n(1) Separation assumption.\nIf there is degeneracy in the spectrum of the matrix, namely that\nk+1 then the k-dimensional principal subspace is not unique, and then the quantities Lk and\n\u03c32\nk = \u03c32\nT k are not well de\ufb01ned, since their value will depend on the choice of principal subspace. This\n\n3\n\n\fsuggests that we are using the wrong value of k, since the choice of k ought to be such that the\ndirections orthogonal to the principal subspace have markedly less variance than those in the principal\nsubspace. Hence we require that k is such that there is a gap in the spectrum at k.\nAssumption 1. We de\ufb01ne a matrix A as being (k, \u2206)-separated if \u03c32\nassume that the data are (k, \u2206)-separated for \u2206 > 0.\n\nk+1 \u2265 \u2206\u03c32\n\n1. Our results\n\nk \u2212 \u03c32\n\nThis assumptions manifests itself as an inverse polynomial dependence on \u2206 in our bounds. This\ndependence is probably pessimistic: in our experiments, we have found our algorithms do well on\ndatasets which are not degenerate, but where the separation \u2206 is not particularly large.\n(2) Approximate low-rank assumption. We assume that the top-k principal subspace captures a\nconstant fraction (at least 0.1) of the total variance in the data, formalized as follows.\n\nAssumption 2. We assume the matrix A is approximately rank-k, i.e.,(cid:80)k\n(cid:80)d\n\ni .\ni=1 \u03c32\nFrom a technical standpoint, this assumption is not strictly needed: if Assumption 2 is not true, our\nresults still hold, but in this case they depend on the stable rank sr(A) of A, de\ufb01ned as sr(A) =\n\ni \u2265 (1/10)(cid:80)d\n\n1 (we state these general forms of our results in the appendix).\n\ni=1 \u03c32\n\ni=1 \u03c32\n\ni /\u03c32\n\nFrom a practical standpoint though, this assumption captures the setting where the scores Lk and T k,\nand our guarantees are most meaningful. Indeed, our experiments suggest that our algorithms work\nbest on data sets where relatively few principal components explain most of the variance.\n\nSetup. We work in the row-streaming model, where rows appear one after the other in time. Note\nthat the leverage score of a row depends on the entire matrix, and hence computing the anomaly\nscores in the streaming model requires care, since if the rows are seen in streaming order, when row\ni arrives we cannot compute its leverage score without seeing the rest of the input. Indeed, 1-pass\nalgorithms are not possible (unless they output the entire matrix of scores at the end of the pass,\nwhich clearly requires a lot of memory). Hence we will aim for 2-pass algorithms.\nNote that there is a simple 2-pass algorithm which uses O(d2) memory to compute the covariance\nmatrix in one pass, then computes its SVD, and using this computes Lk(i) and T k(i) in a second\npass using memory O(dk). This requires O(d2) memory and O(nd2) time, and our goal would be to\nreduce this to linear or sublinear in d.\nAnother reasonable way to de\ufb01ne leverage scores and projection distances in the streaming model is\nto de\ufb01ne them with respect to only the input seen so far. We refer to this as the online scenario, and\nrefer to these scores as the online scores. Our result for sketches which preserve row spaces also hold\nin this online scenario. We defer more discussion of this online scenario to the appendix, and focus\nhere only on the scores de\ufb01ned with respect to the entire matrix for simplicity.\n\n3 Guarantees for anomaly detection via sketching\nOur main results say that given \u00b5 > 0 and a (k, \u2206)-separated matrix A \u2208 Rn\u00d7d with top singular\nvalue \u03c31, any sketch \u02dcA \u2208 R(cid:96)\u00d7d satisfying\n\n(cid:107)AT A \u2212 \u02dcAT \u02dcA(cid:107) \u2264 \u00b5\u03c32\n1,\n\n(2)\n\nor a sketch \u02dcA \u2208 Rn\u00d7(cid:96) satisfying\n\n(cid:107)AAT \u2212 \u02dcA \u02dcAT(cid:107) \u2264 \u00b5\u03c32\n1,\n\n(3)\ncan be used to approximate rank k leverage scores and the projection distance from the principal\nk-dimensional subspace. The quality of the approximation depends on \u00b5, the separation \u2206, k and the\ncondition number \u03bak of the top k subspace.3 In order for the sketches to be useful, we also need them\nto be ef\ufb01ciently computable in a streaming fashion. We show how to use such sketches to design\nef\ufb01cient algorithms for \ufb01nding anomalies in a streaming fashion using small space and with fast\nrunning time. The actual guarantees (and the proofs) for the two cases are different and incomparable.\nThis is to be expected as the sketch guarantees are very different in the two cases: Equation (2) can\nbe viewed as an approximation to the covariance matrix of the row vectors, whereas Equation (3)\n\n3The dependence on \u03bak only appears for showing guarantees for rank-k leverage scores Lk in Theorem 1.\n\n4\n\n\fgives an approximation for the covariance matrix of the column vectors. Since the corresponding\nsketches can be viewed as preserving the row/column space of A respectively, we will refer to them\nas row/column space approximations.\n\nPointwise guarantees from row space approximations. Sketches which satisfy Equation (2) can\nbe computed in the row streaming model using random projections of the columns, subsampling\nthe rows of the matrix proportional to their squared lengths [19] or deterministically by using the\nFrequent Directions algorithm [26]. Our streaming algorithm is stated as Algorithm 1, and is very\nsimple. In Algorithm 1, any other sketch such as subsampling the rows of the matrix or using a\nrandom projection can also be used instead of Frequent Directions.\n\nAlgorithm 1: Algorithm to approximate anomaly scores using Frequent Directions\n\nInput: Choice of k, sketch size (cid:96) for Frequent Directions [26]\nFirst Pass:\nUse Frequent Directions to compute a sketch \u02dcA \u2208 R(cid:96)\u00d7d\n\nSVD:\n\nCompute the top k right singular vectors of \u02dcAT \u02dcA\n\nSecond Pass: As each row a(i) streams in,\n\nUse estimated right singular vectors to compute leverage scores and projection distances\n\nWe state our results here, see Section B for precise statements and general results for any sketches\nwhich satisfy the guarantee in Eq. (2). All our proofs are deferred to the appendix in the supplementary\nmaterial.\nTheorem 1. Assume that A is (k, \u2206)-separated. There exists (cid:96) = k2 \u00b7 poly(\u03b5\u22121, \u03bak, \u2206), such that\nthe above algorithm computes estimates \u02dcT k(i) and \u02dcLk(i) where\n|T k(i) \u2212 \u02dcT k(i)| \u2264 \u03b5(cid:107)a(i)(cid:107)2\n2,\n(cid:107)a(i)(cid:107)2\n|Lk(i) \u2212 \u02dcLk(i)| \u2264 \u03b5k\n(cid:107)A(cid:107)2\n\n2\n\n.\n\nF\n\nThe algorithm uses memory O(d(cid:96)) and has running time O(nd(cid:96)).\n\nThe key is that while (cid:96) depends on k and other parameters, it is independent of d. In the setting where\nall these parameters are constants independent of d, our memory requirement is O(d), improving on\nthe trivial O(d2) bound.\nOur approximations are additive rather than multiplicative. But for anomaly detection, the candidate\nanomalies are ones where Lk(i) or T k(i) is large, and in this regime, we argue below that our additive\nbounds also translate to good multiplicative approximations. The additive error in computing Lk(i) is\nabout \u03b5k/n when all the rows have roughly equal norm. Note that the average rank-k leverage score\nof all the rows of any matrix with n rows is k/n, hence a reasonable threshold on Lk(i) to regard\na point as an anomaly is when Lk(i) (cid:29) k/n, so the guarantee for Lk(i) in Theorem 1 preserves\nanomaly scores up to a small multiplicative error for candidate anomalies, and ensures that points\nwhich were not anomalies before are not mistakenly classi\ufb01ed as anomalies. For T k(i), the additive\nerror for row a(i) is \u03b5(cid:107)a(i)(cid:107)2\n2. Again, for points that are anomalies, T k(i) is a constant fraction of\n(cid:107)a(i)(cid:107)2\nNext we show that substantial savings are unlikely for any algorithm with strong pointwise guarantees:\nthere is an \u2126(d) lower bound for any approximation that lets you distinguish Lk(i) = 1 from\nLk(i) = \u03b5 for any constant \u03b5.\nTheorem 2. Any streaming algorithm which takes a constant number of passes over the data and can\ncompute a 0.1 error additive approximation to the rank-k leverage scores or the rank-k projection\ndistances for all the rows of a matrix must use \u2126(d) working space.\n\n2, so this guarantee is meaningful.\n\nAverage-case guarantees from columns space approximations. We derive smaller space algo-\nrithms, albeit with weaker guarantees using sketches that give columns space approximations that\nsatisfy Equation (3). Even though the sketch gives column space approximations our goal is still to\n\n5\n\n\fcompute the row anomaly scores, so it not just a matter of working with the transpose. Many sketches\nare known which approximate AAT and satisfy Equation (3), for instance, a low-dimensional pro-\njection by a random matrix R \u2208 Rd\u00d7(cid:96) (e.g., each entry of R could be a scaled i.i.d. uniform {\u00b11}\nrandom variable) satis\ufb01es Equation (3) for (cid:96) = O(k/\u00b52) [27].\nOn \ufb01rst glance it is unclear how such a sketch should be useful: the matrix \u02dcA \u02dcAT is an n \u00d7 n matrix,\nand since n (cid:29) d this matrix is too expensive to store. Our streaming algorithm avoids this problem\nby only computing \u02dcAT \u02dcA, which is an (cid:96) \u00d7 (cid:96) matrix, and the larger matrix \u02dcA \u02dcAT is only used for the\nanalysis. Instantiated with the sketch above, the resulting algorithm is simple to describe (although\nthe analysis is subtle): we pick a random matrix in R \u2208 Rd\u00d7(cid:96) as above and return the anomaly scores\nfor the sketch \u02dcA = AR instead. Doing this in a streaming fashion using even the naive algorithm\nrequires computing the small covariance matrix \u02dcAT \u02dcA, which is only O((cid:96)2) space.\nBut notice that we have not accounted for the space needed to store the (d \u00d7 (cid:96)) matrix R. This is a\nsubtle (but mainly theoretical) concern, which can be addressed by using powerful results from the\ntheory of pseudorandomness [28]. Constructions of pseudorandom Johnson-Lindenstrauss matrices\n[29, 30] imply that the matrix R can be pseudorandom, meaning that it has a succinct description\nusing only O(log(d)) bits, from which each entry can be ef\ufb01ciently computed on the \ufb02y.\n\nAlgorithm 2: Algorithm to approximate anomaly scores using random projection\nInput: Choice of k, random projection matrix R \u2208 Rd\u00d7(cid:96)\nInitialization\n\nSet covariance \u02dcAT \u02dcA \u2190 0\n\nFirst Pass: As each row a(i) streams in,\n\nProject by R to get RT a(i)\nUpdate covariance \u02dcAT \u02dcA \u2190 \u02dcAT \u02dcA + (RT a(i))(RT a(i))T\n\nSVD:\n\nCompute the top k right singular vectors of \u02dcAT \u02dcA\n\nSecond Pass: As each row a(i) streams in,\n\nProject by R to get RT a(i)\nFor each projected row, use the estimated right singular vectors to compute the leverage\nscores and projection distances\n\nTheorem 3. For \u03b5 suf\ufb01ciently small, there exists (cid:96) = k3 \u00b7 poly(\u03b5\u22121, \u2206) such that the algorithm\nabove produces estimates \u02dcLk(i) and \u02dcT k(i) in the second pass, such that with high probabilty,\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\n|T k(i) \u2212 \u02dcT k(i)| \u2264 \u03b5(cid:107)A(cid:107)2\nF,\n\n|Lk(i) \u2212 \u02dcLk(i)| \u2264 \u03b5\n\nLk(i).\n\nn(cid:88)\n\nThe algorithm uses space O((cid:96)2 + log(d) log(k)) and has running time O(nd(cid:96)).\n\ni=1\n\ni=1\n\nThis gives an average case guarantee. We note that Theorem 3 shows a new property of random\nprojections\u2014that on average they can preserve leverage scores and distances from the principal\nsubspace, with the projection dimension (cid:96) being only poly(k, \u03b5\u22121, \u2206), independent of both n and d.\nWe can obtain similar guarantees as in Theorem 3 for other sketches which preserve the column\nspace, such as sampling the columns proportional to their squared lengths [19, 31], at the price of\none extra pass. Again the resulting algorithm is very simple: it maintains a carefully chosen (cid:96) \u00d7 (cid:96)\nsubmatrix of the full d \u00d7 d covariance matrix AT A where (cid:96) = O(k3). We state the full algorithm in\nSection C.3.\n\n4 Experimental evaluation\n\nThe aim of our experiments is to test whether our algorithms give comparable results to exact anomaly\nscore computation based on full SVD. So in our experiments, we take the results of SVD as the\n\n6\n\n\fground truth and see how close our algorithms get to it. In particular, the goal is to determine how\nlarge the parameter (cid:96) that determines the size of the sketch needs to be to get close to the exact scores.\nOur results suggest that for high dimensional data sets, it is possible to get good approximations to the\nexact anomaly scores even for fairly small values of (cid:96) (a small multiple of k), hence our worst-case\ntheoretical bounds (which involve polynomials in k and other parameters) are on the pessimistic side.\n\nDatasets: We ran experiments on three publicly available datasets: p53 mutants [32], Dorothea\n[33] and RCV1 [34], all of which are available from the UCI Machine Learning Repository, and are\nhigh dimensional (d > 5000). The original RCV1 dataset contains 804414 rows, we took every tenth\nelement from it. The sizes of the datasets are listed in Table 1.\n\nGround Truth:\nTo establish the ground truth, there are two parameters: the dimension k (typically\nbetween 10 and 125) and a threshold \u03b7 (typically between 0.01 and 0.1). We compute the anomaly\nscores for this k using a full SVD, and then label the \u03b7 fraction of points with the highest anomaly\nscores to be outliers. k is chosen by examining the explained variance of the datatset as a function of\nk, and \u03b7 by examining the histogram of the anomaly score.\n\nOur Algorithms: We run Algorithm 1 using random column projections in place of Frequent\nDirections.4 The relevant parameter here is the projection dimension (cid:96), which results in a sketch\nmatrix of size d \u00d7 (cid:96). We run Algorithm 2 with random row projections. If the projection dimension is\n(cid:96), the resulting sketch size is O((cid:96)2) for the covariance matrix. For a given (cid:96), the time complexity of\nboth algorithms is similar, however the size of the sketches are very different: O(d(cid:96)) versus O((cid:96)2).\nMeasuring accuracy: We ran experiments with a range of (cid:96)s, in the range (2k, 20k) for each\ndataset (hence the curves have different start/end points). The algorithm is given just the points\n(without labels or \u03b7) and computes anomaly scores for them. We then declare the points with the\ntop \u03b7(cid:48) fraction of scores to be anomalies, and then compute the F1 score (de\ufb01ned as the harmonic\nmean of the precision and the recall). We choose the value of \u03b7(cid:48) which maximizes the F1 score. This\nmeasures how well the proposed algorithms can approximate the exact outlier scores. Note that in\norder to get both good precision and recall, \u03b7(cid:48) cannot be too far from \u03b7. We report the average F1\nscore over 5 runs.\nFor each dataset, we run both algorithms, approximate both the leverage and projection scores, and\ntry three different values of k. For each of these settings, we run over roughly 10 values for (cid:96). The\nresults are plotted in Figs. 2, 3 and 4. Here are some takeaways from our experiments:\n\n\u2022 Taking (cid:96) = Ck with a fairly small C \u2248 10 suf\ufb01ces to get F1 scores > 0.75 in most settings.\n\u2022 Algorithm 1 generally outperforms Algorithm 2 for a given value of (cid:96). This should not be\ntoo surprising given that it uses much more memory, and is known to give pointwise rather\nthan average case guarantees. However, Algorithm 2 does surprisingly well for an algorithm\nwhose memory footprint is essentially independent of the input dimension d.\n\n\u2022 The separation assumption (Assumption (1)) does hold to the extent that the spectrum is not\n\ndegenerate, but not with a large gap. The algorithms seem fairly robust to this.\n\n\u2022 The approximate low-rank assumption (Assumption (2)) seems to be important in practice.\nOur best results are for the p53 data set, where the top 10 components explain 87% of the\ntotal variance. The worst results are for the RCV1 data set, where the top 100 and 200\ncomponents explain only 15% and 25% of the total variance respectively.\n\nPerformance. While the main focus of this work is on the streaming model and memory con-\nsumption, our algorithms offer considerable speedups even in the of\ufb02ine/batch setting. Our timing\nexperiments were run using Python/Jupyter notebook on a linux VM with 8 cores and 32 Gb of\nRAM, the times reported are total CPU times in seconds as measured by the % time function, and are\nreported in Table 1. We focus on computing projection distances using SVD (the baseline), Random\nColumn Projection (Algorithm 1) and Random Row Projection (Algorithm 2). All SVD computations\nuse the randomized_svd function from scikit.learn. The baseline computes only the top k\nsingular values and vectors (not the entire SVD). The results show consistent speedups between 2\u00d7\nand 6\u00d7. Which algorithm is faster depends on which dimension of the input matrix is larger.\n\n4Since the existing implementation of Frequent Directions [35] does not seem to handle sparse matrices.\n\n7\n\n\fTable 1: Running times for computing rank-k projection distance. Speedups between 2\u00d7 and 6\u00d7.\n\nDataset\n\np53 mutants\nDorothea\nRCV1\n\nSize (n \u00d7 d)\n16772 \u00d7 5409\n1950 \u00d7 100000\n80442 \u00d7 47236\n\nk\n\n20\n20\n50\n\n(cid:96)\n\n200\n200\n500\n\nSVD\n\n29.2s\n17.7s\n39.6s\n\nColumn\nProjection\n\nRow\n\nProjection\n\n6.88s\n9.91s\n17.5s\n\n7.5s\n2.58s\n20.8s\n\nFigure 2: Results for P53 Mutants. We get F1 score > 0.8 with > 10\u00d7 space savings.\n\n5 Related work\nIn most anomaly detection settings, labels are hard to come by and unsupervised learning methods\nare preferred: the algorithm needs to learn what the bulk of the data looks like and then detect any\ndeviations from this. Subspace based scores are well-suited to this, but various other anomaly scores\nhave also been proposed such as those based on approximating the density of the data [36, 37] and\nattribute-wise analysis [38], we refer to surveys on anomaly detection for an overview [1, 2].\nLeverage scores have found numerous applications in numerical linear algebra, and hence there has\nbeen signi\ufb01cant interest in improving the time complexity of computing them. For the problem of\napproximating the (full) leverage scores (L(i) in Eq. (1), note that we are concerned with the rank-k\nleverage scores Lk(i)), Clarkson and Woodruff [39] and Drineas et al. [40] use sparse subspace\nembeddings and Fast Johnson Lindenstrauss Transforms (FJLT [41]) to compute the leverage scores\nusing O(nd) time instead of the O(nd2) time required by the baseline\u2014but these still need O(d2)\nmemory. With respect to projection distance, the closest work to ours is Huang and Kasiviswanathan\n[42] which uses Frequent Directions to approximate projection distances in O(kd) space. In contrast\nto these approaches, our results hold both for rank-k leverage scores and projection distances, for any\nmatrix sketching algorithm\u2014not just FJLT or Frequent Directions\u2014and our space requirement can\nbe as small as log(d) for average case guarantees. However, Clarkson and Woodruff [39] and Drineas\net al. [40] give multiplicative guarantees for approximating leverage scores while our guarantees for\nrank-k leverage scores are additive, but are nevertheless suf\ufb01cient for the task of detecting anomalies.\n6 Conclusion\nWe show that techniques from sketching can be used to derive simple and practical algorithms\nfor computing subspace-based anomaly scores which provably approximate the true scores at a\nsigni\ufb01cantly lower cost in terms of time and memory. A promising direction of future work is to use\nthem in real-world high-dimensional anomaly detection tasks.\n\nAcknowledgments\n\nThe authors thank David Woodruff for suggestions on using communication complexity tools to show\nlower bounds on memory usage for approximating anomaly scores and Weihao Kong for several\nuseful discussions on estimating singular values and vectors using random projections. We also thank\nSteve Mussmann, Neha Gupta, Yair Carmon and the anonymous reviewers for detailed feedback on\n\n8\n\n\fFigure 3: Results for the Dorothea dataset. Column projections give more accurate approximations,\nbut they use more space.\n\nFigure 4: Results for the RCV1 dataset. Our results here are worse than for the other datasets, we\nhypothesize this is due to this data having less pronounced low-rank structure.\n\ninitial versions of the paper. VS\u2019s contribution was partially supported by NSF award 1813049, and\nONR award N00014-18-1-2295.\n\nReferences\n[1] Charu C. Aggarwal. Outlier Analysis. Springer Publishing Company, Incorporated, 2nd edition,\n\n2013. ISBN 9783319475783.\n\n[2] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM\n\nComput. Surv., 41(3):15:1\u201315:58, 2009.\n\n[3] Anukool Lakhina, Mark Crovella, and Christophe Diot. Diagnosing network-wide traf\ufb01c\nanomalies. In ACM SIGCOMM Computer Communication Review, volume 34, pages 219\u2013230.\nACM, 2004.\n\n[4] Anukool Lakhina, Mark Crovella, and Christophe Diot. Mining anomalies using traf\ufb01c feature\ndistributions. In ACM SIGCOMM Computer Communication Review, volume 35, pages 217\u2013\n228. ACM, 2005.\n\n[5] Ling Huang, XuanLong Nguyen, Minos Garofalakis, Joseph M Hellerstein, Michael I Jordan,\nAnthony D Joseph, and Nina Taft. Communication-ef\ufb01cient online detection of network-\nIn INFOCOM 2007. 26th IEEE International Conference on Computer\nwide anomalies.\nCommunications. IEEE, pages 134\u2013142. IEEE, 2007.\n\n[6] Ling Huang, XuanLong Nguyen, Minos Garofalakis, Michael I Jordan, Anthony Joseph, and\nNina Taft. In-network pca and anomaly detection. In Advances in Neural Information Processing\nSystems, pages 617\u2013624, 2007.\n\n9\n\n\f[7] Bimal Viswanath, Muhammad Ahmad Bashir, Mark Crovella, Saikat Guha, Krishna P Gummadi,\nBalachander Krishnamurthy, and Alan Mislove. Towards detecting anomalous user behavior in\nonline social networks. In USENIX Security Symposium, pages 223\u2013238, 2014.\n\n[8] Rebecca Portnoff. The Dark Net: De-Anonymization, Classi\ufb01cation and Analysis. PhD thesis,\n\nEECS Department, University of California, Berkeley, Mar 2018.\n\n[9] Mei-ling Shyu, Shu-ching Chen, Kanoksri Sarinnapakorn, and Liwu Chang. A novel anomaly\ndetection scheme based on principal component classi\ufb01er. In in Proceedings of the IEEE\nFoundations and New Directions of Data Mining Workshop, in conjunction with the Third IEEE\nInternational Conference on Data Mining (ICDM\u201903. Citeseer, 2003.\n\n[10] Wei Wang, Xiaohong Guan, and Xiangliang Zhang. A novel intrusion detection method based\non principle component analysis in computer security. In International Symposium on Neural\nNetworks, pages 657\u2013662. Springer, 2004.\n\n[11] Jonathan J Davis and Andrew J Clark. Data preprocessing for anomaly based network intrusion\n\ndetection: A review. Computers & Security, 30(6-7):353\u2013375, 2011.\n\n[12] Leo H Chiang, Evan L Russell, and Richard D Braatz. Fault detection and diagnosis in industrial\n\nsystems. Springer Science & Business Media, 2000.\n\n[13] Evan L Russell, Leo H Chiang, and Richard D Braatz. Fault detection in industrial processes\nusing canonical variate analysis and dynamic principal component analysis. Chemometrics and\nintelligent laboratory systems, 51(1):81\u201393, 2000.\n\n[14] S Joe Qin. Statistical process monitoring: basics and beyond. Journal of chemometrics, 17(8-9):\n\n480\u2013502, 2003.\n\n[15] Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. Detecting large-\nscale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd\nsymposium on Operating systems principles, pages 117\u2013132. ACM, 2009.\n\n[16] Haibo Mi, Huaimin Wang, Yangfan Zhou, Michael Rung-Tsong Lyu, and Hua Cai. Toward\n\ufb01ne-grained, unsupervised, scalable performance diagnosis for production cloud computing\nsystems. IEEE Transactions on Parallel and Distributed Systems, 24(6):1245\u20131255, 2013.\n\n[17] David P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations and\n\nTrends\u00ae in Theoretical Computer Science, 10(1\u20132):1\u2013157, 2014.\n\n[18] Edo Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM\nSIGKDD international conference on Knowledge discovery and data mining, pages 581\u2013588.\nACM, 2013.\n\n[19] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms for\nmatrices i: Approximating matrix multiplication. SIAM Journal on Computing, 36(1):132\u2013157,\n2006.\n\n[20] Roger A Horn and Charles R Johnson. Topics in matrix analysis. corrected reprint of the 1991\n\noriginal. Cambridge Univ. Press, Cambridge, 1:994, 1994.\n\n[21] Per-\u00c5ke Wedin. Perturbation bounds in connection with singular value decomposition. BIT\n\nNumerical Mathematics, 12(1):99\u2013111, Mar 1972.\n\n[22] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness:\nProbabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53\n(2):217\u2013288, 2011.\n\n[23] Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, and David P. Woodruff. Fast\napproximation of matrix coherence and statistical leverage. J. Mach. Learn. Res., 13(1):3475\u2013\n3506, December 2012. ISSN 1532-4435.\n\n[24] Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends\u00ae\n\nin Machine Learning, 3(2):123\u2013224, 2011. ISSN 1935-8237. doi: 10.1561/2200000035.\n\n10\n\n\f[25] H.E.T. Holgersson and Peter S. Karlsson. Three estimators of the mahalanobis distance in\n\nhigh-dimensional data. Journal of Applied Statistics, 39(12):2713\u20132720, 2012.\n\n[26] Mina Ghashami, Edo Liberty, Jeff M. Phillips, and David P. Woodruff. Frequent directions:\nSimple and deterministic matrix sketching. SIAM J. Comput., 45(5):1762\u20131792, 2016. doi:\n10.1137/15M1009718.\n\n[27] Vladimir Koltchinskii and Karim Lounici. Concentration inequalities and moment bounds for\n\nsample covariance operators. Bernoulli, 23(1):110\u2013133, 02 2017. doi: 10.3150/15-BEJ730.\n\n[28] Salil P Vadhan. Pseudorandomness. Foundations and Trends\u00ae in Theoretical Computer Science,\n\n7(1\u20133):1\u2013336, 2012.\n\n[29] Michael B Cohen, Jelani Nelson, and David P Woodruff. Optimal approximate matrix product\n\nin terms of stable rank. arXiv preprint arXiv:1507.02268, 2015.\n\n[30] Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu.\nDimensionality reduction for k-means clustering and low rank approximation. In Proceedings\nof the forty-seventh annual ACM symposium on Theory of computing, pages 163\u2013172. ACM,\n2015.\n\n[31] Avner Magen and Anastasios Zouzias. Low rank matrix-valued chernoff bounds and approxi-\nmate matrix multiplication. In Proceedings of the twenty-second annual ACM-SIAM symposium\non Discrete Algorithms, pages 1422\u20131436. SIAM, 2011.\n\n[32] Danziger S.A. et al. Functional census of mutation sequence spaces: the example of p53 cancer\nrescue mutants. IEEE/ACM transactions on computational biology and bioinformatics, 2006.\n\n[33] Isabelle Guyon, Steve R. Gunn, Asa Ben-Hur, and Gideon Dror. In Result analysis of the NIPS\n\n2003 feature selection challenge, 2004.\n\n[34] Rcv1: A new benchmark collection for text categorization research. The Journal of Machine\n\nLearning Research, 5, 361-397, 2004.\n\n[35] Edo Liberty and Mina Ghashami. https://github.com/edoliberty/frequent-directions.\n\n[36] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and J\u00f6rg Sander. Lof: identifying\n\ndensity-based local outliers. In ACM sigmod record, volume 29, pages 93\u2013104. ACM, 2000.\n\n[37] Markus Schneider, Wolfgang Ertel, and Fabio Ramos. Expected similarity estimation for\nlarge-scale batch and streaming anomaly detection. Machine Learning, 105(3):305\u2013333, 2016.\n\n[38] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In Data Mining, 2008.\n\nICDM\u201908. Eighth IEEE International Conference on, pages 413\u2013422. IEEE, 2008.\n\n[39] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in input\nsparsity time. In Proceedings of the forty-\ufb01fth annual ACM symposium on Theory of computing,\npages 81\u201390. ACM, 2013.\n\n[40] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast\napproximation of matrix coherence and statistical leverage. Journal of Machine Learning\nResearch, 13(Dec):3475\u20133506, 2012.\n\n[41] Nir Ailon and Bernard Chazelle. The fast johnson\u2013lindenstrauss transform and approximate\n\nnearest neighbors. SIAM Journal on computing, 39(1):302\u2013322, 2009.\n\n[42] Hao Huang and Shiva Prasad Kasiviswanathan. Streaming anomaly detection using randomized\n\nmatrix sketching. Proc. VLDB Endow., 9(3), November 2015.\n\n[43] Amit Chakrabarti, Subhash Khot, and Xiaodong Sun. Near-optimal lower bounds on the\nmulti-party communication complexity of set disjointness. In Computational Complexity, 2003.\nProceedings. 18th IEEE Annual Conference on, pages 107\u2013117. IEEE, 2003.\n\n11\n\n\f[44] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel ridge regression with sta-\ntistical guarantees. In Proceedings of the 28th International Conference on Neural Information\nProcessing Systems - Volume 1, NIPS\u201915, pages 775\u2013783, Cambridge, MA, USA, 2015. MIT\nPress.\n\n[45] Michael B. Cohen, Cameron Musco, and Christopher Musco. Input sparsity time low-rank\napproximation via ridge leverage score sampling. In Proceedings of the Twenty-Eighth Annual\nACM-SIAM Symposium on Discrete Algorithms, SODA \u201917, pages 1758\u20131777, Philadelphia,\nPA, USA, 2017. Society for Industrial and Applied Mathematics.\n\n12\n\n\f", "award": [], "sourceid": 4974, "authors": [{"given_name": "Vatsal", "family_name": "Sharan", "institution": "Stanford University"}, {"given_name": "Parikshit", "family_name": "Gopalan", "institution": "VMware Research"}, {"given_name": "Udi", "family_name": "Wieder", "institution": "VMware Research"}]}