{"title": "Robust Kernel Density Estimation by Scaling and Projection in Hilbert Space", "book": "Advances in Neural Information Processing Systems", "page_first": 433, "page_last": 441, "abstract": "While robust parameter estimation has been well studied in parametric density estimation, there has been little investigation into robust density estimation in the nonparametric setting. We present a robust version of the popular kernel density estimator (KDE). As with other estimators, a robust version of the KDE is useful since sample contamination is a common issue with datasets. What ``robustness'' means for a nonparametric density estimate is not straightforward and is a topic we explore in this paper. To construct a robust KDE we scale the traditional KDE and project it to its nearest weighted KDE in the $L^2$ norm. Because the squared $L^2$ norm penalizes point-wise errors superlinearly this causes the weighted KDE to allocate more weight to high density regions. We demonstrate the robustness of the SPKDE with numerical experiments and a consistency result which shows that asymptotically the SPKDE recovers the uncontaminated density under sufficient conditions on the contamination.", "full_text": "Robust Kernel Density Estimation by Scaling and\n\nProjection in Hilbert Space\n\nRobert A. Vandermeulen\n\nDepartment of EECS\nUniversity of Michigan\nAnn Arbor, MI 48109\nrvdm@umich.edu\n\nClayton D. Scott\nDeparment of EECS\nUniveristy of Michigan\nAnn Arbor, MI 48109\n\nclayscot@umich.edu\n\nAbstract\n\nWhile robust parameter estimation has been well studied in parametric density es-\ntimation, there has been little investigation into robust density estimation in the\nnonparametric setting. We present a robust version of the popular kernel density\nestimator (KDE). As with other estimators, a robust version of the KDE is useful\nsince sample contamination is a common issue with datasets. What \u201crobustness\u201d\nmeans for a nonparametric density estimate is not straightforward and is a topic\nwe explore in this paper. To construct a robust KDE we scale the traditional KDE\nand project it to its nearest weighted KDE in the L2 norm. This yields a scaled\nand projected KDE (SPKDE). Because the squared L2 norm penalizes point-wise\nerrors superlinearly this causes the weighted KDE to allocate more weight to high\ndensity regions. We demonstrate the robustness of the SPKDE with numerical\nexperiments and a consistency result which shows that asymptotically the SPKDE\nrecovers the uncontaminated density under suf\ufb01cient conditions on the contami-\nnation.\n\n1\n\nIntroduction\n\nThe estimation of a probability density function (pdf) from a random sample is a ubiquitous problem\nin statistics. Methods for density estimation can be divided into parametric and nonparametric,\ndepending on whether parametric models are appropriate. Nonparametric density estimators (NDEs)\noffer the advantage of working under more general assumptions, but they also have disadvantages\nwith respect to their parametric counterparts. One of these disadvantages is the apparent dif\ufb01culty in\nmaking NDEs robust, which is desirable when the data follow not the density of interest, but rather\na contaminated version thereof. In this work we propose a robust version of the KDE, which serves\nas the workhorse among NDEs [11, 10].\nWe consider the situation where most observations come from a target density ftar but some ob-\nservations are drawn from a contaminating density fcon, so our observed samples come from the\ndensity fobs = (1 \u2212 \u03b5) ftar + \u03b5fcon. It is not known which component a given observation comes\nfrom. When considering this scenario in the in\ufb01nite sample setting we would like to construct some\ntransform that, when applied to fobs, yields ftar. We introduce a new formalism to describe trans-\nformations that \u201cdecontaminate\u201d fobs under suf\ufb01cient conditions on ftar and fcon. We focus on a\nspeci\ufb01c nonparametric condition on ftar and fcon that re\ufb02ects the intuition that the contamination\nmanifests in low density regions of ftar. In the \ufb01nite sample setting, we seek a NDE that converges\nto ftar asymptotically. Thus, we construct a weighted KDE where the kernel weights are lower in\nlow density regions and higher in high density regions. To do this we multiply the standard KDE\nby a real value greater than one (scale) and then \ufb01nd the closest pdf to the scaled KDE in the L2\nnorm (project), resulting in a scaled and projected kernel density estimator (SPKDE). Because the\nsquared L2 norm penalizes point-wise differences between functions quadratically, this causes the\n\n1\n\n\fSPKDE to draw weight from the low density areas of the KDE and move it to high density areas to\nget a more uniform difference to the scaled KDE. The asymptotic limit of the SPKDE is a scaled\nand shifted version of fobs. Given our proposed suf\ufb01cient conditions on ftar and fcon, the SPKDE\ncan asymptotically recover ftar.\nA different construction for a robust kernel density estimator, the aptly named \u201crobust kernel density\nestimator\u201d (RKDE), was developed by Kim & Scott [6]. In that paper the RKDE was analytically\nand experimentally shown to be robust, but no consistency result was presented. Vandermeulen\n& Scott [15] proved that a certain version of the RKDE converges to fobs. To our knowledge the\nconvergence of the SPKDE to a transformed version of fobs, which is equal to ftar under suf\ufb01cient\nconditions on ftar and fcon, is the \ufb01rst result of its type.\nIn this paper we present a new formalism for nonparametric density estimation, necessary and suf-\n\ufb01cient conditions for decontamination, the construction of the SPKDE, and a proof of consistency.\nWe also include experimental results applying the algorithm to benchmark datasets with compar-\nisons to the RKDE, traditional KDE, and an alternative robust KDE implementation. Many of our\nresults and proof techniques are novel in KDE literature. Proofs are contained in the supplemental\nmaterial.\n\n2 Nonparametric Contamination Models and Decontamination Procedures\n\nfor Density Estimation\n\nWhat assumptions are necessary and suf\ufb01cient on a target and contaminating density in order to\ntheoretically recover the target density is a question that, to the best of our knowledge, is completely\nunexplored in a nonparametric setting. We will approach this problem in the in\ufb01nite sample setting,\nwhere we know fobs = (1 \u2212 \u03b5)ftar + \u03b5fcon and \u03b5, but do not know ftar or fcon. To this end we\nintroduce a new formalism. Let D be the set of all pdfs on Rd. We use the term contamination\nmodel to refer to any subset V \u2282 D \u00d7 D, i.e. a set of pairs (ftar, fcon). Let R\u03b5 : D \u2192 D be\na set of transformations on D indexed by \u03b5 \u2208 [0, 1). We say that R\u03b5 decontaminates V if for all\n(ftar, fcon) \u2208 V and \u03b5 \u2208 [0, 1) we have R\u03b5((1 \u2212 \u03b5)ftar + \u03b5fcon) = ftar.\nOne may wonder whether there exists some set of contaminating densities, Dcon, and a transfor-\nmation, R\u03b5, such that R\u03b5 decontaminates D \u00d7 Dcon. In other words, does there exist some set of\ncontaminating densities for which we can recover any target density? It turns out this is impossible\nif Dcon contains at least two elements.\nProposition 1. Let Dcon \u2282 D contain at least two elements. There does not exist any transformation\nR\u03b5 which decontaminates D \u00d7 Dcon.\nProof. Let f \u2208 D and g, g(cid:48) \u2208 Dcon such that g (cid:54)= g(cid:48). Let \u03b5 \u2208 (0, 1\nand f(cid:48)\n\n2 ). Clearly ftar (cid:44) f (1\u22122\u03b5)+g\u03b5\n\nare both elements of D. Note that\n\n(cid:44) f (1\u22122\u03b5)+\u03b5g(cid:48)\n\n1\u2212\u03b5\n\ntar\n\n1\u2212\u03b5\n\n(1 \u2212 \u03b5)ftar + \u03b5g(cid:48) = (1 \u2212 \u03b5)f(cid:48)\n\ntar + \u03b5g.\n\nIn order for R\u03b5 to decontaminate D with respect to Dcon, we need R\u03b5 ((1 \u2212 \u03b5)ftar + \u03b5g(cid:48)) = ftar\nand R\u03b5 ((1 \u2212 \u03b5)f(cid:48)\n\ntar, which is impossible since ftar (cid:54)= f(cid:48)\n\ntar + \u03b5g) = f(cid:48)\n\ntar.\n\nThis proposition imposes signi\ufb01cant limitations on what contamination models can be decontami-\nnated. For example, suppose we know that fcon is Gaussian with known covariance matrix and un-\nknown mean. Proposition 1 says we cannot design R\u03b5 so that it can decontaminate (1\u2212\u03b5)ftar+\u03b5fcon\nfor all ftar \u2208 D. In other words, it is impossible to design an algorithm capable of removing Gaus-\nsian contamination (for example) from arbitrary target densities. Furthermore, if R\u03b5 decontaminates\nV and V is fully nonparametric (i.e. for all f \u2208 D there exists some f(cid:48) \u2208 D such that (f, f(cid:48)) \u2208 V)\nthen for each (ftar, fcon) pair, fcon must satisfy some properties which depend on ftar.\n\n2.1 Proposed Contamination Model\nFor a function f : Rd \u2192 R let supp(f ) denote the support of f. We introduce the following\ncontamination assumption:\n\n2\n\n\f(ftar,fcon)\u2208VA\n\nAssumption A. For the pair (ftar, fcon), there exists u such that fcon(x) = u for almost all (in the\nLebesgue sense) x \u2208 supp(ftar) and fcon(x(cid:48)) \u2264 u for almost all x(cid:48) /\u2208 supp(ftar).\nSee Figure 1 for an example of a density satisfying this assumption. Because fcon must be uniform\nover the support of ftar a consequence of Assumption A is that supp(ftar) has \ufb01nite Lebesgue mea-\nsure. Let VA be the contamination model containing all pairs of densities which satisfy Assumption\nftar is exactly all densities whose support has \ufb01nite Lebesgue measure,\n\nA. Note that(cid:83)\n\nwhich includes all densities with compact support.\nThe uniformity assumption on fcon is a common \u201cnoninformative\u201d assumption on the contamina-\ntion. Furthermore, this assumption is supported by connections to one-class classi\ufb01cation. In that\nproblem, only one class (corresponding to our ftar) is observed for training, but the testing data is\ndrawn from fobs and must be classi\ufb01ed. The dominant paradigm for nonparametric one-class clas-\nsi\ufb01cation is to estimate a level set of ftar from the one observed training class [14, 7, 13, 16, 12, 9],\nand classify test data according to that level set. Yet level sets only yield optimal classi\ufb01ers (i.e.\nlikelihood ratio tests) under the uniformity assumption on fcon, so that these methods are implicitly\nadopting this assumption. Furthermore, a uniform contamination prior has been shown to optimize\nthe worst-case detection rate among all choices for the unknown contamination density [5]. Finally,\nour experiments demonstrate that the SPKDE works well in practice, even when Assumption A is\nsigni\ufb01cantly violated.\n\n2.2 Decontamination Procedure\n\nUnder Assumption A ftar is present in fobs and its shape is left unmodi\ufb01ed (up to a multiplicative\nfactor) by fcon. To recover ftar it is necessary to \ufb01rst scale fobs by \u03b2 = 1\n\n1\u2212\u03b5 yielding\n\n1\n1 \u2212 \u03b5\n\n((1 \u2212 \u03b5)ftar + \u03b5fcon) = ftar +\n\n\u03b5\n1 \u2212 \u03b5\n\nfcon.\n\n\u03b5\n\n1\u2212\u03b5 fcon from the bottom of ftar + \u03b5\n\n(1)\n1\u2212\u03b5 fcon. This transform\n\nAfter scaling we would like to slice off\nis achieved by\n\n(cid:26)\n\n(cid:27)\n\nmax\n\n0, ftar +\n\n\u03b5\n1 \u2212 \u03b5\n\nfcon \u2212 \u03b1\n\n,\n\n(2)\n\nwhere \u03b1 is set such that 2 is a pdf (which in this case is achieved with \u03b1 = r \u03b5\nshow that this transform is well de\ufb01ned in a general sense. Let f be a pdf and let\n\n1\u2212\u03b5). We will now\n\ng\u03b2,\u03b1 = max{0, \u03b2f (\u00b7) \u2212 \u03b1}\n\nwhere the max is de\ufb01ned pointwise. The following lemma shows that it is possible to slice off the\nbottom of any scaled pdf to get a transformed pdf and that the transformed pdf is unique.\nLemma 1. For \ufb01xed \u03b2 > 1 there exists a unique \u03b1(cid:48) > 0 such that (cid:107)g\u03b2,\u03b1(cid:48)(cid:107)L1 = 1.\nFigure 2 demonstrates this transformation applied to a pdf. We de\ufb01ne the following transform\n\u03b5 (f ) is a pdf.\nRA\n\u03b5\n\n(cid:110) 1\n(cid:111)\n1\u2212\u03b5 f (\u00b7) \u2212 \u03b1, 0\n\n: D \u2192 D where RA\n\nwhere \u03b1 is such that RA\n\nProposition 2. RA\n\n\u03b5 (f ) = max\n\u03b5 decontaminates VA.\n\nThe proof of this proposition is an intermediate step for\nthe proof for Theorem 2. For any two subsets of V,V(cid:48) \u2282\nD \u00d7 D, R\u03b5 decontaminates V and V(cid:48) iff R\u03b5 decontam-\n\ninates V(cid:83)V(cid:48). Because of this, every decontaminating\n\n\u03b5 , i.e. the set VA is maximal.\n\ntransform has a maximal set which it can decontaminate.\nAssumption A is both suf\ufb01cient and necessary for decon-\ntamination by RA\nProposition 3. Let {(q, q(cid:48))} \u2208 D \u00d7 D and (q, q(cid:48)) /\u2208 VA.\n\u03b5 cannot decontaminate {(q, q(cid:48))}.\nRA\nThe proof of this proposition is in the supplementary ma-\nterial.\n\n2.3 Other Possible Contamination Models\n\n3\n\nFigure 1: Density with contamination\nsatisfying Assumption A\n\n \u03b5fcon(1-\u03b5)ftar\fNext we would select some threshold \u03bb > 0 and declare a sample, Xi, as being anomalous if\n\nFigure 2: In\ufb01nite sample SPKDE transform. Arrows indi-\ncate the area under the line.\n\nThe model described previously is\njust one of many possible mod-\nels. An obvious approach to robust\nkernel density estimation is to use\nan anomaly detection algorithm and\nconstruct the KDE using only non-\nanomalous samples. We will inves-\ntigate this model under a couple of\nanomaly detection schemes and de-\nscribe their properties.\nOne of the most common methods for anomaly detection is the level set method. For a probability\nmeasure \u00b5 this method attempts to \ufb01nd the set S with smallest Lebesgue measure such that \u00b5(S)\nis above some threshold, t, and declares samples outside of that set as being anomalous. For a\n{x|f (x)\u2265\u03bb} f (y)dy = t and declaring samples\nwere f (X) < \u03bb as being anomalous. Let X1, . . . , Xn be iid samples from fobs. Using the level\n\ndensity f this is equivalent to \ufb01nding \u03bb such that(cid:82)\nset method for a robust KDE, we would construct a density (cid:98)fobs which is an estimate of fobs.\n(cid:98)fobs(Xi) < \u03bb. Finally we would construct a KDE using the non-anomalous samples. Let \u03c7{\u00b7} be\nthe indicator function. Applying this method in the in\ufb01nite sample situation, i.e. (cid:98)fobs = fobs, would\n(cid:82) \u03c7{f (y)>\u03bb}f (y)dy. See Figure 3. Perfect recovery of ftar using this method requires \u03b5fcon(x) \u2264\n\ncause our non-anomalous samples to come from the density p(x) =\nwhere \u03c4 =\nftar(x) (1 \u2212 \u03b5) for all x and that fcon and ftar have disjoint supports. The \ufb01rst assumption means\nthat this density estimator can only recover ftar if it has a drop off on the boundary of its support,\nwhereas Assumption A only requires that ftar have \ufb01nite support. See the last diagram in Figure\n3. Although these assumptions may be reasonable in certain situations, we \ufb01nd them less palatable\nthan Assumption A. We also evaluate this approach experimentally later and \ufb01nd that it performs\npoorly.\nAnother approach based on anomaly detection\nwould be to \ufb01nd the connected components of\nfobs and declare those that are, in some sense,\nsmall as being anomalous. A \u201csmall\u201d con-\nnected component may be one that integrates\nto a small value, or which has a small mode.\nUnfortunately this approach also assumes that\nftar and fcon have disjoint supports. There are\nalso computational issues with this anomaly de-\ntection scheme; \ufb01nding connected components,\n\ufb01nding modes, and numerical integration are\ncomputationally dif\ufb01cult.\nTo some degree, RA\n\u03b5 actually achieves the ob-\njectives of the previous two robust KDEs. For\nthe \ufb01rst model, the RA\n\u03b5 does indeed set those regions of the pdf that are below some threshold to\nzero. For the second, if the magnitude of the level at which we choose to slice off the bottom of\nthe contaminated density is larger than the mode of the anomalous component then the anomalous\ncomponent will be eliminated.\n\nFigure 3: In\ufb01nite sample version of the level set\nrejection KDE\n\nfobs(x)\u03c7{fobs (x)>\u03bb}\n\n\u03c4\n\n3 Scaled Projection Kernel Density Estimator\n\n\u03b5 in a \ufb01nite sample situation. Let f \u2208 L2(cid:0)Rd(cid:1) be a pdf and\nsuch that k\u03c3 (x, x(cid:48)) = \u03c3\u2212dq ((cid:107)x \u2212 x(cid:48)(cid:107)2 /\u03c3), where q ((cid:107)\u00b7(cid:107)2) \u2208 L2(cid:0)Rd(cid:1) and is a pdf. The classic\n\nHere we consider approximating RA\nX1, . . . , Xn be iid samples from f. Let k\u03c3 (x, x(cid:48)) be a radial smoothing kernel with bandwidth \u03c3\n\nkernel density estimator is:\n\n\u00aff n\n\u03c3 :=\n\n1\nn\n\nk\u03c3 (\u00b7, Xi) .\n\nn(cid:88)\n\n1\n\n4\n\n 1-1/\u03b2Original DensityScaled DensityShifted to pdf\u03b2-1\u03bbOriginal DensityThreshold at \u03bb Set density under threshold to 0Normalize to integrate to 1\fIn practice \u03b5 is usually not known and Assumption A is violated. Because of this we will scale our\ndensity by \u03b2 > 1 rather than 1\n\n1\u2212\u03b5. For a density f de\ufb01ne\nQ\u03b2(f ) (cid:44) max{\u03b2f (\u00b7) \u2212 \u03b1, 0} ,\n\nwhere \u03b1 = \u03b1(\u03b2) is set such that the RHS is a pdf. \u03b2 can be used to tune robustness with larger\n\u03b2 corresponding to more robustness (setting \u03b2 to 1 in all the following transforms simply yields\nthe KDE). Given a KDE we would ideally like to apply Q\u03b2 directly and search over \u03b1 until\n\n\u03c3 (\u00b7) \u2212 \u03b1, 0(cid:9) integrates to 1. Such an estimate requires multidimensional numerical in-\n\nmax(cid:8)\u03b2 \u00aff n\n\ntegration and is not computationally tractable. The SPKDE is an alternative approach that always\nyields a density and manifests the transformed density in its asymptotic limit.\nWe now introduce the construction of\n\u03c3 be the convex hull of\nk\u03c3 (\u00b7, X1) , . . . , k\u03c3 (\u00b7, Xn) (the space of weighted kernel density estimators). The SPKDE is de-\n\ufb01ned as\n\nthe SPKDE. Let Dn\n\nwhich is guaranteed to have a unique minimizer since Dn\nin a Hilbert space ([1] Theorem 3.14). If we represent f n\n\n\u03c3 is closed and convex and we are projecting\n\u03c3,\u03b2 in the form\n\nf n\n\u03c3,\u03b2 := arg min\ng\u2208Dn\n\n\u03c3\n\n(cid:13)(cid:13)\u03b2 \u00aff n\n\u03c3 \u2212 g(cid:13)(cid:13)L2 ,\nn(cid:88)\n\naik\u03c3 (\u00b7, Xi) ,\n\nf n\n\u03c3,\u03b2 =\n\n1\n\nthen the minimization problem is a quadratic program over the vector a = [a1, . . . , an]T , with a\nrestricted to the probabilistic simplex, \u2206n. Let G be the Gram matrix of k\u03c3 (\u00b7, X1) , . . . , k\u03c3 (\u00b7, Xn),\nthat is\n\n(cid:90)\n\nGij = (cid:104)k\u03c3 (\u00b7, Xi) , k\u03c3 (\u00b7, Xj)(cid:105)L2\n\n=\n\nk\u03c3 (x, Xi) k\u03c3 (x, Xj) dx.\n\nLet 1 be the ones vector and b = G1 \u03b2\n\nn, then the quadratic program is\n\naT Ga \u2212 2bT a.\n\nmin\na\u2208\u2206n\n\nSince G is a Gram matrix, and therefore positive-semide\ufb01nite, this quadratic program is convex.\nFurthermore, the integral de\ufb01ning Gij can be computed in closed form for many kernels of interest.\nFor example for the Gaussian kernel\n\n(cid:32)\u2212(cid:107)x \u2212 x(cid:48)(cid:107)2\n\n(cid:33)\n\n2 exp\n\n2\u03c32\n\n=\u21d2 Gij = k\u221a\n\n2\u03c3(Xi, Xj),\n\nk\u03c3 (x, x(cid:48)) =(cid:0)2\u03c0\u03c32(cid:1)\u2212 d\n(cid:1)\n\u0393(cid:0) 1+d\n\n2\n\n1 +\n\n\u03c32\n\n2\n\n(cid:32)\n\n(cid:107)x \u2212 x(cid:48)(cid:107)2\n\nk\u03c3 (x, x(cid:48)) =\n\n\u03c0(d+1)/2 \u00b7 \u03c3d\n\nand for the Cauchy kernel [2]\n\nWe now present some results on the asymptotic behavior of the SPKDE. Let D be the set of all pdfs\n\n(cid:33)\u2212 1+d\nin L2(cid:0)Rd(cid:1). The in\ufb01nite sample version of the SPKDE is\nh\u2208D (cid:107)\u03b2f \u2212 h(cid:107)2\nL2 .\nde\ufb01ned if the convex set we are projecting onto is closed and convex. D is not closed in L2(cid:0)Rd(cid:1),\n\nIt is worth noting that projection operators in Hilbert space, like the one above, are known to be well\n\n=\u21d2 Gij = k2\u03c3(Xi, Xj).\n\nf(cid:48)\n\u03b2 = arg min\n\nbut this turns out not to be an issue because of the form of \u03b2f. For details see the proof of Lemma\n2 in the supplemental material.\nLemma 2. f(cid:48)\n\n\u03b2 = max{\u03b2f (\u00b7) \u2212 \u03b1, 0} where \u03b1 is set such that max{\u03b2f (\u00b7) \u2212 \u03b1, 0} is a pdf.\n\n5\n\n\fGiven the same rate on bandwidth necessary for consistency of the traditional KDE, the SPKDE\nconverges to its in\ufb01nite sample version in its asymptotic limit.\n\nTheorem 1. Let f \u2208 L2(cid:0)Rd(cid:1). If n \u2192 \u221e and \u03c3 \u2192 0 with n\u03c3d \u2192 \u221e then\n(cid:13)(cid:13)(cid:13)f n\n\nBecause f n\nL1 convergence.\nCorollary 1. Given the conditions in the previous theorem statement,\n\n\u03c3,\u03b2 is a sequence of pdfs and f(cid:48)\n\n\u03b2 \u2208 L2(cid:0)Rd(cid:1), it is possible to show L2 convergence implies\n\n\u03c3,\u03b2 \u2212 f(cid:48)\n\n\u03c3,\u03b2 \u2212 f(cid:48)\n\np\u2192 0.\n\np\u2192 0.\n\n(cid:13)(cid:13)(cid:13)L1\n\n\u03b2\n\n(cid:13)(cid:13)(cid:13)f n\n\n(cid:13)(cid:13)(cid:13)L2\n\n\u03b2\n\nTo summarize, the SPKDE converges to a transformed version of f. In the next section we will\nshow that under Assumption A and with \u03b2 = 1\n\n1\u2212\u03b5, the SPKDE converges to ftar.\n\n3.1 SPKDE Decontamination\n\nLet ftar \u2208 L2(cid:0)Rd(cid:1) be a pdf having support with \ufb01nite Lebesgue measure and let ftar and fcon\n\nsatisfy Assumption A. Let X1, X2, . . . , Xn be iid samples from fobs = (1 \u2212 \u03b5) ftar + \u03b5fcon with\n\u03b5 \u2208 [0, 1). Finally let f n\n\u03c3,\u03b2 be the SPKDE constructed from X1, . . . , Xn, having bandwidth \u03c3 and\nrobustness parameter \u03b2. We have\n1\u2212\u03b5 . If n \u2192 \u221e and \u03c3 \u2192 0 with n\u03c3d \u2192 \u221e then\nTheorem 2. Let \u03b2 = 1\n\n\u03c3,\u03b2 \u2212 ftar\n\np\u2192 0.\n\n(cid:13)(cid:13)(cid:13)f n\n\n(cid:13)(cid:13)(cid:13)L1\n\nTo our knowledge this result is the \ufb01rst of its kind, wherein a nonparametric density estimator is able\nto asymptotically recover the underlying density in the presence of contaminated data.\n\n4 Experiments\n\nFor all of the experiments optimization was performed using projected gradient descent. The pro-\njection onto the probabilistic simplex was done using the algorithm developed in [4] (which was\nactually originally discovered a few decades ago [3, 8]).\n\n4.1 Synthetic Data\n\nTo show that the SPKDE\u2019s theoretical properties are manifested in practice we conducted an ide-\nalized experiment where the contamination is uniform and the contamination proportion is known.\nFigure 4 exhibits the ability of the SPKDE to compensate for uniform noise. Samples for the den-\nsity estimator came from a mixture of the \u201cTarget\u201d density with a uniform contamination on [\u22122, 2],\nsampling from the contamination with probability \u03b5 = 0.2. This experiment used 500 samples and\n4 (the value for perfect asymptotic decontamination).\nthe robustness parameter \u03b2 was set to 1\nThe SPKDE performs well in this situation and yields a scaled and shifted version of the standard\nKDE. This scale and shift is especially evident in the preservation of the bump on the right hand side\nof Figure 4.\n\n1\u2212\u03b5 = 5\n\n4.2 Datasets\n\nIn our remaining experiments we investigate two performance metrics for different amounts of con-\ntamination. We perform our experiments on 12 classi\ufb01cation datasets (names given in the supple-\nmental material) where the 0 label is used as the target density and the 1 label is the anomalous\ncontamination. This experimental setup does not satisfy Assumption A. The training datasets are\n1\u2212\u03b5 n0 samples from label 1, thus making an \u03b5 pro-\nconstructed with n0 samples from label 0 and \u03b5\nportion of our samples come from the contaminating density. For our experiments we use the values\n\u03b5 = 0, 0.05, 0.1, 0.15, 0.20, 0.25, 0.30. Given some dataset we are interested in how well our density\n\nestimators (cid:98)f estimate the density of the 0 class of our dataset, ftar. Each test is performed on 15\n\npermutations of the dataset. The experimental setup here is similar to the setup in Kim & Scott [6],\nthe most signi\ufb01cant difference being that \u03c3 is set differently.\n\n4.3 Performance Criteria\n\n6\n\n\fFirst we investigate the Kullback-Leibler (KL)\ndivergence\n\n(cid:16)(cid:98)f||f0\n\n(cid:17)\n\n=\n\n(cid:90) (cid:98)f (x) log\n\n(cid:32) (cid:98)f (x)\n\n(cid:33)\n\ndx.\n\nDKL\n\nf0 (x)\n\nf0 to have mass where it does not. For exam-\n\nThis KL divergence is large when (cid:98)f estimates\nple, in our context, (cid:98)f makes mistakes because\nple using a KDE, (cid:101)f0. The bandwidth for (cid:101)f0 is\n\nof outlying contamination. We estimate this KL\ndivergence as follows. Since we do not have ac-\ncess to f0, it is estimated from the testing sam-\n\n(cid:16)\nf0||(cid:101)f0\nset using the testing data with a LOOCV line\nsearch minimizing DKL\n, which is de-\nscribed in more detail below. We then approxi-\nerating samples from (cid:98)f, {x(cid:48)\nmate the integral using a sample mean by gen-\ni}n(cid:48)\n1 and using the\n\n(cid:17)\n\nestimate\n\n(cid:16)(cid:98)f||f0\n\n(cid:17) \u2248 1\n\nn(cid:48)\n\nn(cid:48)(cid:88)\n\n1\n\nDKL\n\n(cid:33)\n\n(cid:32) (cid:98)f (x(cid:48)\n(cid:101)f0 (x(cid:48)\n\ni)\ni)\n\n.\n\n(cid:90)\n\ndx = C \u2212\n\nlog\n\n(cid:33)\n\n(cid:32)\n\nf0 (x)\n\n(cid:98)f (x)\n\n(cid:16)\n(cid:17)\nf0||(cid:98)f\n\n(cid:90)\n\nThe number of generated samples n(cid:48) is set to double the number of training samples.\nSince KL divergence isn\u2019t symmetric we also investigate\n\n(cid:16)(cid:98)f (y)\n\n(cid:17)\n\nFigure 4: KDE and SPKDE in the presence of uni-\nform noise\n\nDKL\n\n=\n\nf0 (x) log\n\nwhere C is a constant not depending on (cid:98)f. This KL divergence is large when f0 has mass where (cid:98)f\n\nf0 (y) log\n\ndy,\n\ndoes not. The \ufb01nal term is easy to estimate using expectation. Let {x(cid:48)(cid:48)\nf0 (not used for training). The following is a reasonable approximation\n\n1 be testing samples from\n\n(cid:90)\n\n\u2212\n\n(cid:16)(cid:98)f (y)\n\n(cid:17)\n\nf0 (y) log\n\ndy \u2248 \u2212 1\nn(cid:48)(cid:48)\n\ni }n(cid:48)(cid:48)\n(cid:17)\n\ni )\n\n.\n\n(cid:16)(cid:98)f (x(cid:48)(cid:48)\n\nn(cid:48)(cid:48)(cid:88)\n\n1\n\nlog\n\nFor a given performance metric and contamination amount, we compare the mean performance of\ntwo density estimators across datasets using the Wilcoxon signed rank test [17]. Given N datasets\nwe \ufb01rst rank the datasets according to the absolute difference between performance criterion, with\nhi being the rank of the ith dataset. For example if the jth dataset has the largest absolute difference\nwe set hj = N and if the kth dataset has the smallest absolute difference we set hk = 1. We let\nR1 be the sum of the his where method one\u2019s metric is greater than metric two\u2019s and R2 be the sum\nof the his where method two\u2019s metric is larger. The test statistic is min(R1, R2), which we do not\nreport. Instead we report R1 and R2 and the p-value that the two methods do not perform the same\non average. Ri < Rj is indicative of method i performing better than method j.\n\n4.4 Methods\n\nThe data were preprocessed by scaling to \ufb01t in the unit cube. This scaling technique was chosen over\nwhitening because of issues with singular covariance matrices. The Gaussian kernel was used for\nall density estimates. For each permutation of each dataset, the bandwidth parameter is set using the\ntraining data with a LOOCV line search minimizing DKL\nthe contaminated data and fobs is the observed density. This metric was used in order to maximize\nthe performance of the traditional KDE in KL divergence metrics. For the SPKDE the parameter \u03b2\nwas chosen to be 2 for all experiments. This choice of \u03b2 is based on a few preliminary experiments\n\n, where (cid:98)f is the KDE based on\n\n(cid:17)\nfobs||(cid:98)f\n\n(cid:16)\n\n7\n\n\u22122\u22121.5\u22121\u22120.500.511.5200.10.20.30.40.50.60.70.8 KDESPKDETarget\fTable 1: Wilcoxon signed rank test results\n\nWilcoxon Test Applied to DKL\n\n(cid:16)(cid:98)f||f0\n\n(cid:17)\n\n\u03b5\n\nSPKDE\nKDE\np-value\nSPKDE\nRKDE\np-value\nSPKDE\nrejKDE\np-value\n\n0.15\n\n0.05\n\n2\n76\n\n0\n78\n\n0\n5\n73\n\n0.1\n1\n77\n\n0.2\n0\n78\n.0049 5e-4 1e-3 .0015 5e-4\n63\n15\n.064\n\n67\n11\n.027\n\n58\n20\n0.15\n\n53\n25\n0.31\n\n59\n19\n0.13\n\n0.25\n\n0\n78\n5e-4\n61\n17\n.092\n\n0.3\n0\n78\n5e-4\n63\n15\n.064\n\n0\n78\n5e-4\n\n0\n1\n78\n77\n5e-4 1e-3\n\n1\n77\n1e-3\n\n0\n0\n78\n78\n5e-4 .0015 5e-4\n\n2\n76\n\n(cid:16)\n\n(cid:17)\nf0||(cid:98)f\n\n0.25\n16\n62\n\n0.1\n27\n51\n.38\n14\n64\n\n0.15\n21\n57\n.18\n10\n68\n\n0.05\n30\n48\n.52\n14\n64\n\nWilcoxon Test Applied to DKL\n0.3\n0.2\n0\n17\n17\n37\n61\n61\n41\n.092 .078 .092\n.91\n12\n10\n14\n64\n68\n66\n.052 .052 .052 .021 .021 .034 .034\n15\n11\n29\n63\n49\n67\n.064 .043 .016 .027\n.47\n\n21\n57\n.18\n\n19\n59\n.13\n\n13\n65\n\n9\n69\n\n12\n66\n\nfor which it yielded good results over various sample contamination amounts. The construction of\nthe RKDE follows exactly the methods outlined in the \u201cExperiments\u201d section of Kim & Scott [6].\nIt is worth noting that the RKDE depends on the loss function used and that the Hampel loss used\nin these experiments very aggressively suppresses the kernel weights on the tails. Because of this\nwe expect that RKDE performs well on the DKL\nmetric. We also compare the SPKDE to a\nkernel density estimator constructed from samples declared non-anomalous by a level set anomaly\ndetection as described in Section 2.3. To do this we \ufb01rst construct the classic KDE, \u00aff n\n\u03c3 and then\nreject those samples in the lower 10th percentile of \u00aff n\n\u03c3 (Xi). Those samples not rejected are used in\na new KDE, the \u201crejKDE\u201d using the same \u03c3 parameter.\n\n(cid:16)(cid:98)f||f0\n\n(cid:17)\n\n4.5 Results\n\nWe present the results of the Wilcoxon signed rank tests in Table 1. Experimental results for each\ndataset can be found in the supplemental material. From the results it is clear that the SPKDE is\neffective at compensating for contamination in the DKL\nmetric, albeit not quite as well as\nthe RKDE. The main advantage of the SPKDE over the RKDE is that it signi\ufb01cantly outperforms\nthe RKDE in the DKL\nmetric. The rejKDE performs signi\ufb01cantly worse than the SPKDE\non almost every experiment. Remarkably the SPKDE outperforms the KDE in the situation with no\ncontamination (\u03b5 = 0) for both performance metrics.\n\n(cid:17)\nf0||(cid:98)f\n\n(cid:16)\n\n(cid:16)(cid:98)f||f0\n\n(cid:17)\n\n5 Conclusion\n\nRobustness in the setting of nonparametric density estimation is a topic that has received little at-\ntention despite extensive study of robustness in the parametric setting. In this paper we introduced a\nrobust version of the KDE, the SPKDE, and developed a new formalism for analysis of robust den-\nsity estimation. With this new formalism we proposed a contamination model and decontaminating\ntransform to recover a target density in the presence of noise. The contamination model allows that\nthe target and contaminating densities have overlapping support and that the basic shape of the target\ndensity is not modi\ufb01ed by the contaminating density. The proposed transform is computationally\nprohibitive to apply directly to the \ufb01nite sample KDE and the SPKDE is used to approximate the\ntransform. The SPKDE was shown to asymptotically converge to the desired transform. Experi-\nments have shown that the SPKDE is more effective than the RKDE at minimizing DKL\n.\nFurthermore the p-values for these experiments were smaller than the p-values for the DKL\nexperiments where the RKDE outperforms the SPKDE.\n\n(cid:17)\n(cid:16)\nf0||(cid:98)f\n(cid:16)(cid:98)f||f0\n(cid:17)\n\nAcknowledgements\n\nThis work support in part by NSF Awards 0953135, 1047871, 1217880, 1422157. We would also\nlike to thank Samuel Brodkey for his assistance with the simulation code.\n\n8\n\n\fReferences\n[1] H.H. Bauschke and P.L. Combettes. Convex analysis and monotone operator theory in Hilbert\nspaces. CMS Books in Mathematics, Ouvrages de math\u00b4ematiques de la SMC. Springer New\nYork, 2011.\n\n[2] D.A. Berry, K.M. Chaloner, J.K. Geweke, and A. Zellner. Bayesian Analysis in Statistics and\nEconometrics: Essays in Honor of Arnold Zellner. A Wiley Interscience publication. Wiley,\n1996.\n\n[3] Peter Brucker. An o(n) algorithm for quadratic knapsack problems. Operations Research\n\nLetters, 3(3):163 \u2013 166, 1984.\n\n[4] John C. Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Ef\ufb01cient projections\n\nonto the l1-ball for learning in high dimensions. In ICML, pages 272\u2013279, 2008.\n\n[5] R. El-Yaniv and M. Nisenson. Optimal single-class classi\ufb01cation strategies. In B. Sch\u00a8olkopf,\nJ. Platt, and T. Hoffman, editors, Adv. in Neural Inform. Proc. Systems 19. MIT Press, Cam-\nbridge, MA, 2007.\n\n[6] J. Kim and C. Scott. Robust kernel density estimation. J. Machine Learning Res., 13:2529\u2013\n\n2565, 2012.\n\n[7] G. Lanckriet, L. El Ghaoui, and M. I. Jordan. Robust novelty detection with single-class mpm.\nIn S. Thrun S. Becker and K. Obermayer, editors, Advances in Neural Information Processing\nSystems 15, pages 905\u2013912. MIT Press, Cambridge, MA, 2003.\n\n[8] P.M. Pardalos and N. Kovoor. An algorithm for a singly constrained class of quadratic pro-\ngrams subject to upper and lower bounds. Mathematical Programming, 46(1-3):321\u2013328,\n1990.\n\n[9] B. Sch\u00a8olkopf, J. Platt, J. Shawe-Taylor, A. Smola, and R. Williamson. Estimating the support\n\nof a high-dimensional distribution. Neural Computation, 13(7):1443\u20131472, 2001.\n\n[10] D. W. Scott. Multivariate Density Estimation. Wiley, New York, 1992.\n[11] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall,\n\nLondon, 1986.\n\n[12] K. Sricharan and A. Hero. Ef\ufb01cient anomaly detection using bipartite k-nn graphs. In J. Shawe-\nTaylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors, Advances in\nNeural Information Processing Systems 24, pages 478\u2013486. 2011.\n\n[13] I. Steinwart, D. Hush, and C. Scovel. A classi\ufb01cation framework for anomaly detection. JMLR,\n\n6:211\u2013232, 2005.\n\n[14] J. Theiler and D. M. Cai. Resampling approach for anomaly detection in multispectral images.\n\nIn Proc. SPIE, volume 5093, pages 230\u2013240, 2003.\n\n[15] R. Vandermeulen and C. Scott. Consistency of robust kernel density estimators. COLT, 30,\n\n2013.\n\n[16] R. Vert and J.-P. Vert. Consistency and convergence rates of one-class SVM and related algo-\n\nrithms. JMLR, pages 817\u2013854, 2006.\n\n[17] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80\u201383,\n\n1945.\n\n9\n\n\f", "award": [], "sourceid": 285, "authors": [{"given_name": "Robert", "family_name": "Vandermeulen", "institution": "University of Michigan"}, {"given_name": "Clayton", "family_name": "Scott", "institution": "University of Michigan"}]}