{"title": "Density-Difference Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 683, "page_last": 691, "abstract": "We address the problem of estimating the difference between two probability densities. A naive approach is a two-step procedure of first estimating two densities separately and then computing their difference. However, such a two-step procedure does not necessarily work well because the first step is performed without regard to the second step and thus a small estimation error incurred in the first stage can cause a big error in the second stage. In this paper, we propose a single-shot procedure for directly estimating the density difference without separately estimating two densities. We derive a non-parametric finite-sample error bound for the proposed single-shot density-difference estimator and show that it achieves the optimal convergence rate. We then show how the proposed density-difference estimator can be utilized in L2-distance approximation. Finally, we experimentally demonstrate the usefulness of the proposed method in robust distribution comparison such as class-prior estimation and change-point detection.", "full_text": "Density-Difference Estimation\n\nMasashi Sugiyama1 Takafumi Kanamori2 Taiji Suzuki3\n\nSong Liu1\n\nMarthinus Christoffel du Plessis1\n1Tokyo Institute of Technology, Japan\n3University of Tokyo, Japan\n\nIchiro Takeuchi4\n2Nagoya University, Japan\n4Nagoya Institute of Technology, Japan\n\nAbstract\n\nWe address the problem of estimating the difference between two probability den-\nsities. A naive approach is a two-step procedure of \ufb01rst estimating two densities\nseparately and then computing their difference. However, such a two-step proce-\ndure does not necessarily work well because the \ufb01rst step is performed without re-\ngard to the second step and thus a small estimation error incurred in the \ufb01rst stage\ncan cause a big error in the second stage. In this paper, we propose a single-shot\nprocedure for directly estimating the density difference without separately esti-\nmating two densities. We derive a non-parametric \ufb01nite-sample error bound for\nthe proposed single-shot density-difference estimator and show that it achieves the\noptimal convergence rate. We then show how the proposed density-difference es-\ntimator can be utilized in L2-distance approximation. Finally, we experimentally\ndemonstrate the usefulness of the proposed method in robust distribution compar-\nison such as class-prior estimation and change-point detection.\n\n1\n\nIntroduction\n\nWhen estimating a quantity consisting of two elements, a two-stage approach of \ufb01rst estimating\nthe two elements separately and then approximating the target quantity based on the estimates of\nthe two elements often performs poorly, because the \ufb01rst stage is carried out without regard to the\nsecond stage and thus a small estimation error incurred in the \ufb01rst stage can cause a big error in the\nsecond stage. To cope with this problem, it would be more appropriate to directly estimate the target\nquantity in a single-shot process without separately estimating the two elements.\nA seminal example that follows this general idea is pattern recognition by the support vector ma-\nchine [1]: Instead of separately estimating two probability distributions of patterns for positive and\nnegative classes, the support vector machine directly learns the boundary between the two classes\nthat is suf\ufb01cient for pattern recognition. More recently, a problem of estimating the ratio of two\nprobability densities was tackled in a similar fashion [2, 3]: The ratio of two probability densities is\ndirectly estimated without going through separate estimation of the two probability densities.\nIn this paper, we further explore this line of research, and propose a method for directly estimating\nthe difference between two probability densities in a single-shot process. Density differences would\nbe more desirable than density ratios because density ratios can diverge to in\ufb01nity even under a\nmild condition (e.g., two Gaussians [4]), whereas density differences are always \ufb01nite as long as\neach density is bounded. Density differences can be used for solving various machine learning tasks\nsuch as class-balance estimation under class-prior change [5] and change-point detection in time\nseries [6].\nFor this density-difference estimation problem, we propose a single-shot method, called the least-\nsquares density-difference (LSDD) estimator, that directly estimates the density difference without\nseparately estimating two densities. LSDD is derived with in the framework of kernel regularized\nleast-squares estimation, and thus it inherits various useful properties: For example, the LSDD\n\n1\n\n\fsolution can be computed analytically in a computationally ef\ufb01cient and stable manner, and all\ntuning parameters such as the kernel width and the regularization parameter can be systematically\nand objectively optimized via cross-validation. We derive a \ufb01nite-sample error bound for the LSDD\nestimator and show that it achieves the optimal convergence rate in a non-parametric setup.\nWe then apply LSDD to L2-distance estimation and show that it is more accurate than the differ-\nence of KDEs, which tends to severely under-estimate the L2-distance [7]. Because the L2-distance\nis more robust against outliers than the Kullback-Leibler divergence [8], the proposed L2-distance\nestimator can lead to the paradigm of robust distribution comparison. We experimentally demon-\nstrate the usefulness of LSDD in semi-supervised class-prior estimation and unsupervised change\ndetection.\n\n2 Density-Difference Estimation\n\nIn this section, we propose a single-shot method for estimating the difference between two proba-\nbility densities from samples, and analyze its theoretical properties.\n\nProblem Formulation and Naive Approach: First, we formulate the problem of density-\ndifference estimation. Suppose that we are given two sets of independent and identically distributed\nsamples X := {xi}n\n(cid:2)\ni(cid:2)=1 from probability distributions on Rd with densities\np(x) and p\n\n(x), respectively. Our goal is to estimate the density difference,\n\ni=1 and X (cid:2)\n\n:= {x(cid:2)\n\n(cid:2)\n\ni(cid:2)}n\nf (x) := p(x) \u2212 p\n\n(cid:2)\n\n(x),\n\nfrom the samples X and X (cid:2).\nA naive approach to density-difference estimation is to use kernel density estimators (KDEs). How-\never, we argue that the KDE-based density-difference estimator is not the best approach because\nof its two-step nature. Intuitively, good density estimators tend to be smooth and thus the differ-\nence between such smooth density estimators tends to be over-smoothed as a density-difference\nestimator [9]. To overcome this weakness, we give a single-shot procedure of directly estimating the\ndensity difference f (x) without separately estimating the densities p(x) and p\n\n(x).\n\n(cid:2)\n\nLeast-Squares Density-Difference Estimation:\ndifference model g(x) to the true density-difference function f (x) under the squared loss:\n\nIn our proposed approach, we \ufb01t a density-\n\nargmin\n\ng\n\n(cid:2) (cid:3)\n(cid:2)(cid:5)\n\nn+n\n\n(cid:2)=1\n\n(cid:4)2\n\ng(x) \u2212 f (x)\n(cid:6)\n\ndx.\n\n(cid:7)\n\n\u2212(cid:2)x \u2212 c(cid:2)(cid:2)2\n2\u03c32\n1, . . . , x(cid:2)\n\nWe use the following Gaussian kernel model as g(x):\n\nwhere (c1, . . . , cn, cn+1, . . . , cn+n(cid:2) ) := (x1, . . . , xn, x(cid:2)\nn + n\nFor the model (1), the optimal parameter \u03b8\u2217 is given by\n\n(cid:2) is large, we may use only a subset of {x1, . . . , xn, x(cid:2)\n(cid:8)\n\n(cid:2) (cid:3)\n\n(cid:4)2\n\ng(x) =\n\n\u03b8(cid:2) exp\n\n,\n\n(1)\n\n1, . . . , x(cid:2)\n\n\u03b8(cid:4)H\u03b8 \u2212 2h(cid:4)\u03b8\n\nn(cid:2) ) are Gaussian kernel centers. If\nn(cid:2)} as Gaussian kernel centers.\n\n(cid:9)\n(cid:6)\n)-dimensional vector de\ufb01ned as\n\u2212(cid:2)c(cid:2) \u2212 c(cid:2)(cid:2)(cid:2)2\n\n= H\u22121h,\n\n(cid:7)\n\n,\n\n4\u03c32\n\nwhere H is the (n + n\n\n\u03b8\u2217\n\n(cid:2)\n(cid:2)\n\n(cid:2)\n\n\u03b8\n\ng(x) \u2212 f (x)\n:= argmin\n(cid:7)\n(cid:6)\n) \u00d7 (n + n\n(cid:2)\n\u2212(cid:2)x \u2212 c(cid:2)(cid:2)2\n(cid:7)\n\u2212(cid:2)x \u2212 c(cid:2)(cid:2)2\n\n(cid:6)\n(cid:6)\n\n2\u03c32\n\nexp\n\nexp\n\n2\u03c32\n\nH(cid:2),(cid:2)(cid:2) :=\n\nh(cid:2) :=\n\n(cid:2)\n\n\u03b8\n\ndx = argmin\n\n2\u03c32\n\n(cid:2)\n\n) matrix and h is the (n + n\n\n(cid:7)\n\u2212(cid:2)x \u2212 c(cid:2)(cid:2)(cid:2)2\n(cid:6)\ndx = (\u03c0\u03c32)d/2 exp\n\u2212(cid:2)x(cid:2) \u2212 c(cid:2)(cid:2)2\n(cid:9)\n\nexp\np(x)dx \u2212\n(cid:8)\n\n(cid:7)\n\n(x(cid:2)\n\n2\u03c32\n\nexp\n\n(cid:2)\np\n\n\u03b8(cid:4)H\u03b8 \u2212 2(cid:10)h\n\n\u03b8 + \u03bb\u03b8(cid:4)\u03b8\n\n,\n\n)dx(cid:2)\n\n.\n\n(cid:10)\u03b8 := argmin\n\n\u03b8\n\nReplacing the expectations in h by empirical estimators and adding an (cid:5)2-regularizer to the objective\nfunction, we arrive at the following optimization problem:\n(cid:4)\n\n(2)\n\n2\n\n\fwhere \u03bb (\u2265 0) is the regularization parameter and(cid:10)h is the (n + n\n(cid:6)\n\n(cid:6)\n\n(cid:7)\n\n(cid:2)\n\nn(cid:5)\n\n(cid:2)(cid:5)\n\nn\n\n(cid:10)h(cid:2) :=\n\n1\nn\n\nexp\n\ni=1\n\nexp\n\ni(cid:2)=1\n\n2\u03c32\n\n\u2212 1\nn(cid:2)\n\n\u2212(cid:2)xi \u2212 c(cid:2)(cid:2)2\n(cid:10)\u03b8 = (H + \u03bbI)\n\n\u22121(cid:10)h,\n\nTaking the derivative of the objective function in Eq.(2) and equating it to zero, we can obtain the\nsolution analytically as\n\nwhere I denotes the identity matrix.\n\nFinally, a density-difference estimator (cid:10)f (x), which we call the least-squares density-difference\n(LSDD) estimator, is given as (cid:10)f (x) =\n\n(cid:10)\u03b8(cid:2) exp\n\n(cid:2)(cid:5)\n\n(cid:6)\n\n(cid:7)\n\nn+n\n\n(cid:7)\n\n)-dimensional vector de\ufb01ned as\n\u2212(cid:2)x(cid:2)\n\ni(cid:2) \u2212 c(cid:2)(cid:2)2\n2\u03c32\n\n.\n\n(cid:2)\n\nNon-Parametric Error Bound: Here, we theoretically analyze an estimation error of LSDD.\nWe assume n\nthe Gaussian kernel with width \u03b3: k\u03b3(x, x(cid:2)\n(cid:14)\nmodi\ufb01ed LSDD estimator that is more suitable for non-parametric error analysis1:\n\n= n, and let H\u03b3 be the reproducing kernel Hilbert space (RKHS) corresponding to\n. Let us consider a slightly\n\n) = exp\n\n(cid:15)\n\n(cid:13)\n\n(cid:16)\n\n(cid:2)g(cid:2)2\n\nL2(Rd) \u2212 2\n\ng(xi) \u2212 1\n\ng(x(cid:2)\ni(cid:2) )\n\n+ \u03bb(cid:2)g(cid:2)2H\u03b3\n\n.\n\n(cid:10)f := argmin\n\ng\u2208H\u03b3\n\n.\n\n2\u03c32\n\n\u2212(cid:2)x \u2212 c(cid:2)(cid:2)2\n(cid:11)\u2212(cid:2)x \u2212 x(cid:2)(cid:2)2/\u03b32\nn(cid:5)\n\n(cid:12)\n\nn\n\ni(cid:2)=1\n\n(cid:2)=1\n\nn(cid:5)\n\ni=1\n\n1\nn\n\nThen we have the following theorem:\nTheorem 1. Suppose that there exists a constant M such that (cid:2)p(cid:2)\u221e \u2264 M and (cid:2)p\nSuppose also that the density difference f = p \u2212 p\nThat is, f \u2208 B\u03b1\n\n(cid:2)(cid:2)\u221e \u2264 M.\n(cid:2) is a member of Besov space with regularity \u03b1.\n\n2,\u221e is the Besov space with regularity \u03b1, and\n\n2,\u221e where B\u03b1\n(cid:2)f(cid:2)B\u03b1\n\n2,\u221e := (cid:2)f(cid:2)L2(Rd) + sup\n\nt>0\n\n\u2212\u03b1\u03c9r,L2(Rd)(f, t)) < c for r = (cid:6)\u03b1(cid:7) + 1,\n\n(t\n\nwhere (cid:6)\u03b1(cid:7) denotes the largest integer less than or equal to \u03b1 and \u03c9r,L2(Rd) is the r-th modulus of\nsmoothness (see [10] for the de\ufb01nitions). Then, for all \u0001 > 0 and p \u2208 (0, 1), there exists a constant\nK > 0 depending on M, c, \u0001, and p such that for all n \u2265 1, \u03c4 \u2265 1, and \u03bb > 0, the LSDD estimator\n(cid:15)\n\n(cid:14)\n\n(cid:10)f in H\u03b3 satis\ufb01es\n(cid:2)(cid:10)f \u2212 f(cid:2)2\n\nL2(Rd) +\u03bb(cid:2)(cid:10)f(cid:2)2H\u03b3\n\n\u2212(1\u2212p)(1+\u0001)d\n\n\u2212 2(1\u2212p)d\n\n(1+\u0001+ 1\u2212p\n4 )\n\n\u2264 K\n\n\u2212d +\u03b32\u03b1 +\n\n\u03b3\n\n\u03bbpn\n\n\u03b3\n\n+\n\n1+p\n3p\u2212p2\n1+p n\n\n\u03bb\n\n2\n\n1+p\n\n+\n\n\u03c4\nn2\u03bb\n\n+\n\n\u03c4\nn\n\n\u03bb\u03b3\nwith probability not less than 1 \u2212 4e\n\n\u2212\u03c4.\n\n2\u03b1+d\n\n(2\u03b1+d)(1+p)+(\u0001\u2212p+\u0001p) and \u03b3 = n\n\n\u2212\nIf we set \u03bb = n\nsmall, then we immediately have the following corollary.\nCorollary 1. Suppose that the same assumptions as Theorem 1 hold. Then, for all \u03c1, \u03c1\nexists a constant K > 0 depending on M, c, \u03c1, and \u03c1\n\ndensity-difference estimator (cid:10)f with appropriate choice of \u03b3 and \u03bb satis\ufb01es\n\n> 0, there\n(cid:2) such that, for all n \u2265 1 and \u03c4 \u2265 1, the\n\n(2\u03b1+d)(1+p)+(\u0001\u2212p+\u0001p) , and take \u0001 and p suf\ufb01ciently\n\n\u2212\n\n(cid:2)\n\n1\n\n(cid:2)(cid:10)f \u2212 f(cid:2)2\n\nL2(Rd) + \u03bb(cid:2)(cid:10)f(cid:2)2H\u03b3\n\n(cid:3)\n\n\u2264 K\n\n\u2212 2\u03b1\n\n2\u03b1+d +\u03c1 + \u03c4 n\n\nn\n\n\u22121+\u03c1\n\n(cid:2)(cid:4)\n\nwith probability not less than 1 \u2212 4e\n\n\u2212\u03c4.\n\n1More speci\ufb01cally, the regularizer is replaced from the squared (cid:2)2-norm of parameters to the squared RKHS-\nnorm of a learned function, which is necessary to establish consistency. Nevertheless, we use the squared\n(cid:2)2-norm of parameters in experiments because it is simpler and seems to perform well in practice.\n\n3\n\n\f\u2212 2\u03b1\n\n2\u03b1+d is the optimal learning rate to estimate a function in B\u03b1\n\nNote that n\n2,\u221e. Therefore, the density-\ndifference estimator with a Gaussian kernel achieves the optimal learning rate by appropriately\nchoosing the regularization parameter and the Gaussian width. Because the learning rate depends\non \u03b1, the LSDD estimator has adaptivity to the smoothness of the true function.\nIt is known that, if the naive KDE with a Gaussian kernel is used for estimating a probability density\nwith regularity \u03b1 > 2, the optimal learning rate cannot be achieved [11, 12]. To achieve the optimal\nrate by KDE, we should choose a kernel function speci\ufb01cally tailored to each regularity \u03b1 [13].\nHowever, such a kernel function is not non-negative and it is dif\ufb01cult to implement it in practice.\nOn the other hand, our LSDD estimator can always achieve the optimal learning rate for a Gaussian\nkernel without regard to regularity \u03b1.\n\n(cid:2)\n\n= {x(cid:2)\n\nModel Selection by Cross-Validation: The above theoretical analysis showed the superiority of\nLSDD. However, in practice, the performance of LSDD depends on the choice of models (i.e.,\nthe kernel width \u03c3 and the regularization parameter \u03bb). Here, we show that the model can be\noptimized by cross-validation (CV). More speci\ufb01cally, we \ufb01rst divide the samples X = {xi}n\ni=1\nand X (cid:2)\n}T\nt=1, respectively. Then we obtain a\nt), and\ncompute its hold-out error for Xt and X (cid:2)\n\ndensity-difference estimate (cid:10)ft(x) from X\\Xt and X (cid:2)\\X (cid:2)\ni(cid:2)=1 into T disjoint subsets {Xt}T\nt=1 and {X (cid:2)\n(cid:5)\n(cid:10)ft(x) +\n\nt (i.e., all samples without Xt and X (cid:2)\n\n(cid:2) (cid:10)ft(x)2dx \u2212 2|Xt|\n\n(cid:10)ft(x(cid:2)\n\n(cid:5)\n\nCV(t) :=\n\ni(cid:2)}n\n\nt as\n\n),\n\n2|X (cid:2)\n\nt\n\n|\n\nt\n\nx\u2208Xt\n\nx(cid:2)\u2208X (cid:2)\n\nt\n\nwhere |X| denotes the number of elements in the set X . We repeat this hold-out validation proce-\ndure for t = 1, . . . , T , and compute the average hold-out error. Finally, we choose the model that\nminimizes the average hold-out error.\n\n3 L2-Distance Estimation by LSDD\n\n(cid:4)\n\n(cid:2)\n\n(cid:2)\n\n(cid:2)\n\n) =\n\n(cid:2)\n\n) =\n\n) :=\n\ndx,\n\n(x))\n\n(x),\n\nf (x(cid:2)\n\ni(cid:2)}n\n\n(cid:2)\ni(cid:2)=1.\n\nL2(p, p\n\nL2(p, p\n\n(p(x) \u2212 p\n(cid:2)\n\n(cid:2)\nIn this section, we consider the problem of approximating the L2-distance between p(x) and p\n\n(cid:2)\nf (x)p(x)dx \u2212(cid:17)\n(cid:17)\nwith an LSDD estimator (cid:10)f (x) and approximate the expectations by empirical averages, we obtain\n) \u2248 (cid:10)h\n(cid:4)(cid:10)\u03b8. Similarly, for another expression L2(p, p\nan LSDD estimator (cid:10)f (x) gives L2(p, p\nH(cid:10)\u03b8.\n) \u2248(cid:10)\u03b8\nAlthough(cid:10)h\n(cid:4)(cid:10)\u03b8 and(cid:10)\u03b8\n(cid:4)(cid:10)\u03b8 \u2212(cid:10)\u03b8\n) := 2(cid:10)h\n(cid:10)L2(X ,X (cid:2)\n\nfrom their independent and identically distributed samples X := {xi}n\ni=1 and X (cid:2)\n(cid:2)\n(x(cid:2)\n(cid:17)\nFor an equivalent expression L2(p, p\n)p\nH(cid:10)\u03b8 themselves give approximations to L2(p, p\n\n)dx(cid:2), if we replace f (x)\n\nf (x)2dx, replacing f (x) with\n\ntheir combination, de\ufb01ned by\n\n), we argue that the use of\n\nH(cid:10)\u03b8,\n\n:= {x(cid:2)\n\nform \u03b2(cid:10)h\n\n(cid:4)(cid:10)\u03b8 + (1 \u2212 \u03b2)(cid:10)\u03b8\n\nis more sensible. To explain the reason, let us consider a generalized L2-distance estimator of the\n\nH(cid:10)\u03b8, where \u03b2 is a real scalar. If the regularization parameter \u03bb (\u2265 0) is\n\nH(cid:10)\u03b8 =(cid:10)h\n\nsmall, this can be expressed as\n\n(cid:4)(cid:10)\u03b8 + (1 \u2212 \u03b2)(cid:10)\u03b8\n\n(4)\nwhere op denotes the probabilistic order. Thus, up to Op(\u03bb), the bias introduced by regularization\n(i.e., the second term in the right-hand side of Eq.(4) that depends on \u03bb) can be eliminated if \u03b2 = 2,\n\nH\u22122(cid:10)h + op(\u03bb),\nwhich yields Eq.(3). Note that, if no regularization is imposed (i.e., \u03bb = 0), both(cid:10)h\nyield(cid:10)h\n\nH\u22121(cid:10)h \u2212 \u03bb(2 \u2212 \u03b2)(cid:10)h\nH\u22121(cid:10)h, the \ufb01rst term in the right-hand side of Eq.(4).\n\n(cid:4)(cid:10)\u03b8 and(cid:10)\u03b8\n\nH(cid:10)\u03b8\n\n\u03b2(cid:10)h\n\n(cid:4)\n\n(cid:4)\n\n(3)\n\n(cid:2)\n\n(cid:4)\n\n(cid:4)\n\n(cid:4)\n\n(cid:4)\n\n2\n\n(cid:2)\n\n(cid:4)\n\n(cid:4)\n\n4\n\n\fEq.(3) is actually equivalent to the negative of the optimal objective value of the LSDD optimization\nproblem without regularization (i.e., Eq.(2) with \u03bb = 0). This can be naturally interpreted through a\nlower bound of L2(p, p\n\n) obtained by Legendre-Fenchel convex duality [14]:\n\n(cid:2)\n\n(cid:18)\n\n(cid:6)(cid:2)\n\n(cid:7)\n\n(cid:2)\n\n(cid:19)\n\n(cid:2)\n\n(cid:2)\n\nL2(p, p\n\n) = sup\n\n2\n\ng\n\ng(x)p(x)dx \u2212\n\ng(x(cid:2)\n\n(cid:2)\n)p\n\n(x(cid:2)\n\n)dx(cid:2)\n\n\u2212\n\ng(x)2dx\n\n,\n\nwhere the supremum is attained at g = f. If the expectations are replaced by empirical estima-\ntors and the Gaussian kernel model (1) is used as g, the above optimization problem is reduced\nto the LSDD objective function without regularization (see Eq.(2)). Thus, LSDD corresponds to\napproximately maximizing the above lower bound and Eq.(3) is its maximum value.\n\nH(cid:10)\u03b8 \u2265 (cid:10)h\n(cid:4)(cid:10)\u03b8 \u2212(cid:10)\u03b8\n(cid:4)(cid:10)\u03b8 \u2265 (cid:10)\u03b8\nThrough eigenvalue decomposition of H, we can show that 2(cid:10)h\nThus, our approximator (3) is not less than the plain approximators(cid:10)h\nH(cid:10)\u03b8.\n(cid:4)(cid:10)\u03b8 and(cid:10)\u03b8\n\nH(cid:10)\u03b8.\n\n(cid:4)\n\n(cid:4)\n\n(cid:4)\n\n4 Experiments\nIn this section, we experimentally demonstrate the usefulness of LSDD. A MATLAB R(cid:7) implemen-\ntation of LSDD used for experiments is available from\n\n\u201chttp://sugiyama-www.cs.titech.ac.jp/\u02dcsugi/software/LSDD/\u201d.\n\n(cid:2)\n\n(cid:4)\n\n(cid:4)\n\n, (4\u03c0)\n\n, (4\u03c0)\n\n\u22121I d)\n\n\u22121I d).\n\nIllustration: Let N (x; \u03bc, \u03a3) be the multi-dimensional normal density with mean vector \u03bc and\nvariance-covariance matrix \u03a3 with respect to x, and let\n(cid:2)\nand p\n\np(x) = N (x; (\u03bc, 0, . . . , 0)\n\n(x) = N (x; (0, 0, . . . , 0)\n\n(cid:2)\nWe \ufb01rst illustrate how LSDD behaves under d = 1 and n = n\n= 200. We compare LSDD with\nKDEi (KDE with two Gaussian widths chosen independently by least-squares cross-validation [15])\nand KDEj (KDE with two Gaussian widths chosen jointly to minimize the LSDD criterion [9]). The\nnumber of folds in cross-validation is set to 5 for all methods.\nFigure 1 depicts density-difference estimation results obtained by LSDD, KDEi, and KDEj for \u03bc = 0\n(i.e., f (x) = p(x) \u2212 p\n(x) = 0). The \ufb01gure shows that LSDD and KDEj give accurate estimates\nof the density difference f (x) = 0. On the other hand, the estimate obtained by KDEi is rather\n\ufb02uctuated, although both densities are reasonably well approximated by KDEs. This illustrates an\nadvantage of directly estimating the density difference without going through separate estimation of\neach density. Figure 2 depicts the results for \u03bc = 0.5 (i.e., f (x) (cid:9)= 0), showing again that LSDD\nperforms well. KDEi and KDEj give the same estimation result for this dataset, which slightly\nunderestimates the peaks.\nNext, we compare the performance of L2-distance approximation based on LSDD, KDEi, and KDEj.\nFor \u03bc = 0, 0.2, 0.4, 0.6, 0.8 and d = 1, 5, we draw n = n\n= 200 samples from the above p(x)\n(cid:2)\n(x). Figure 3 depicts the mean and standard error of estimated L2-distances over 1000 runs\nand p\nas functions of mean \u03bc. When d = 1 (Figure 3(a)), the LSDD-based L2-distance estimator gives\nthe most accurate estimates of the true L2-distance, whereas the KDEi-based L2-distance estimator\nslightly underestimates the true L2-distance when \u03bc is large. This is caused by the fact that KDE\ntends to provide smooth density estimates (see Figure 2(b) again): Such smooth density estimates\nare accurate as density estimates, but the difference of smooth density estimates yields a small L2-\ndistance estimate [7]. The KDEj-based L2-distance estimator tends to improve this drawback of\nKDEi, but it still slightly underestimates the true L2-distance when \u03bc is large.\nWhen d = 5 (Figure 3(b)), the KDE-based L2-distance estimators even severely underestimate\nthe true L2-distance when \u03bc is large. On the other hand, the LSDD-based L2-distance estimator\nstill gives reasonably accurate estimates of the true L2-distance even when d = 5. However, we\nnote that LSDD also slightly underestimates the true L2-distance when \u03bc is large, because slight\nunderestimation tends to yield smaller variance and thus such stabilized solutions are more accurate\nin terms of the bias-variance trade-off.\n\n(cid:2)\n\nSemi-Supervised Class-Balance Estimation:\nIn real-world pattern recognition tasks, changes in\nclass balance between the training and test phases are often observed. In such cases, naive classi\ufb01er\n\n5\n\n\ff(x)\nf(x)^\n\n0.5\n\n1\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n(cid:98)\n\n\u22121\n\n\u22120.5\n\np(x)\u2212p\u2019(x)\n^^\np(x)\u2212p\u2019(x)\np(x)\n^\np(x)\np\u2019(x)\n^\np\u2019(x)\n\n0.5\n\n1\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n(cid:98)\n\n\u22121\n\n\u22120.5\n\n\u22121\n\n\u22120.5\n\n0\nx\n\n(a) LSDD\n\n0\nx\n\n(b) KDEi\n\nFigure 1: Estimation of density difference when \u03bc = 0 (i.e., f (x) = p(x) \u2212 p\n\n(cid:2)\n\nf(x)\nf(x)^\n\n0.5\n\n1\n\n0\nx\n\n(c) KDEj\n\n(x) = 0).\n\nf(x)\nf(x)^\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n(cid:98)\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n(cid:98)\n\nf(x)\nf(x)^\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n(cid:98)\n\n(cid:98)\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n(cid:98)\n\np(x)\u2212p\u2019(x)\n^^\np(x)\u2212p\u2019(x)\np(x)\n^\np(x)\np\u2019(x)\n^\np\u2019(x)\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\nx\n\n(a) LSDD\n\nx\n\n(b) KDEi\n\nFigure 2: Estimation of density difference when \u03bc = 0.5 (i.e., f (x) = p(x) \u2212 p\n(cid:2)\n\n(x) (cid:9)= 0).\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\nx\n\n(c) KDEj\n\nTrue\nLSDD\nKDE i\nKDE j\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\nc\nn\na\nt\ns\ni\nd\n2\nL\n\n \n\n0\n\n(cid:98)\n\n0\n\n0.1\n\n0.2\n\n0.3\n\nTrue\nLSDD\nKDE i\nKDE j\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\nc\nn\na\nt\ns\ni\nd\n2\nL\n\n \n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0\n\n(cid:98)\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\u03bc\n\n0.4\n\u03bc\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n(b) d = 5\n(cid:2)\nFigure 3: L2-distance estimation by LSDD, KDEi, and KDEj for n = n\nGaussian mean \u03bc. Means and standard errors over 1000 runs are plotted.\n\n(a) d = 1\n\n= 200 as functions of the\n\ntraining produces signi\ufb01cant estimation bias because the class balance in the training dataset does\nnot properly re\ufb02ect that of the test dataset.\nHere, we consider a binary pattern recognition task of classifying pattern x \u2208 Rd to class y \u2208\n{+1,\u22121}. Our goal is to learn the class balance of a test dataset in a semi-supervised learning setup\nwhere unlabeled test samples are provided in addition to labeled training samples [16]. The class\nbalance in the test set can be estimated by matching a mixture of class-wise training input densities,\n\nqtest(x; \u03c0) := \u03c0ptrain(x|y = +1) + (1 \u2212 \u03c0)ptrain(x|y = \u22121),\n\nto the test input density ptest(x) [5], where \u03c0 \u2208 [0, 1] is a mixing coef\ufb01cient to learn. See Figure 4\nfor schematic illustration. Here, we use the L2-distance estimated by LSDD and the difference of\nKDEs for this distribution matching. Note that, when LSDD is used to estimate the L2-distance,\nseparate estimation of ptrain(x|y = \u00b11) is not involved, but the difference between ptest(x) and\nqtest(x; \u03c0) is directly estimated.\nWe use four UCI benchmark datasets (http://archive.ics.uci.edu/ml/), where we ran-\ndomly choose 10 labeled training samples from each class and 50 unlabeled test samples following\n= 0.1, 0.2, . . . , 0.9. Figure 6 plots the mean and standard error of the squared\ntrue class-prior \u03c0\ndifference between true and estimated class-balances \u03c0 and the misclassi\ufb01cation error by a weighted\n(cid:5)2-regularized least-squares classi\ufb01er [17] with weighted cross-validation [18] over 1000 runs. The\nresults show that LSDD tends to provide better class-balance estimates than the KDEi-based, the\nKDEj-based, and the EM-based methods [5], which are translated into lower classi\ufb01cation errors.\n\n\u2217\n\n6\n\n\f(cid:4)\n\n(cid:4)\n\n(cid:4)\n\n]\n\n, y(t + 1)\n\n, . . . , y(t + k \u2212 1)\n\nUnsupervised Change Detection: The objective of change detection is to discover abrupt prop-\nerty changes behind time-series data. Let y(t) \u2208 Rm be an m-dimensional time-series sample at\n(cid:4) \u2208 Rkm be a subsequence of time\ntime t, and let Y (t) := [y(t)\nseries at time t with length k. We treat the subsequence Y (t) as a sample, instead of a single point\ny(t), by which time-dependent information can be incorporated naturally [6]. Let Y(t) be a set of r\nretrospective subsequence samples starting at time t: Y(t) := {Y (t), Y (t + 1), . . . , Y (t + r \u2212 1)}.\nOur strategy is to compute a certain dissimilarity measure between two consecutive segments Y(t)\nand Y(t+r), and use it as the plausibility of change points (see Figure 5). As a dissimilarity measure,\nwe use the L2-distance estimated by LSDD and the Kullback-Leibler (KL) divergence estimated by\nthe KL importance estimation procedure (KLIEP) [2, 3]. We set k = 10 and r = 50.\nFirst, we use the IPSJ SIG-SLP Corpora and Environments for Noisy Speech Recognition (CEN-\nSREC) dataset (http://research.nii.ac.jp/src/en/CENSREC-1-C.html). This\ndataset is provided by the National Institute of Informatics, Japan that records human voice in a\nnoisy environment such as a restaurant. The top graphs in Figure 7(a) display the original time-\nseries (true change points were manually annotated) and change scores obtained by KLIEP and\nLSDD. The graphs show that the LSDD-based change score indicates the existence of change points\nmore clearly than the KLIEP-based change score.\nNext, we use a dataset taken from the Human Activity Sensing Consortium (HASC) challenge\n2011 (http://hasc.jp/hc2011/), which provides human activity information collected by\nportable three-axis accelerometers. Because the orientation of the accelerometers is not necessarily\n\ufb01xed, we take the (cid:5)2-norm of the 3-dimensional data. The HASC dataset is relatively simple, so\nwe arti\ufb01cially added zero-mean Gaussian noise with standard deviation 5 at each time point with\nprobability 0.005. The top graphs in Figure 7(b) display the original time-series for a sequence of\nactions \u201cjog\u201d, \u201cstay\u201d, \u201cstair down\u201d, \u201cstay\u201d, and \u201cstair up\u201d (there exists 4 change points at time 540,\n1110, 1728, and 2286) and the change scores obtained by KLIEP and LSDD. The graphs show that\nthe LSDD score is much more stable and interpretable than the KLIEP score.\nFinally, we compare the change-detection performance more systematically using the receiver op-\nerating characteristic (ROC) curves (i.e., the false positive rate vs. the true positive rate) and the\narea under the ROC curve (AUC) values. In addition to LSDD and KLIEP, we test the L2-distance\nestimated by KDEi and KDEj and native change detection methods based on autoregressive models\n(AR) [19], subspace identi\ufb01cation (SI) [20], singular spectrum transformation (SST) [21], one-class\nsupport vector machine (SVM) [22], kernel Fisher discriminant analysis (KFD) [23], and kernel\nchange-point detection (KCP) [24]. Tuning parameters included in these methods were manually op-\ntimized. For 10 datasets taken from each of the CENSREC and HASC data collections, mean ROC\ncurves and AUC values are displayed at the bottom of Figure 7(b). The results show that LSDD tends\nto outperform other methods and is comparable to state-of-the-art native change-detection methods.\n\n5 Conclusions\n\nIn this paper, we proposed a method for directly estimating the difference between two probability\ndensity functions without density estimation. The proposed method, called the least-squares density-\ndifference (LSDD), was derived within the framework of kernel least-squares estimation, and its\nsolution can be computed analytically in a computationally ef\ufb01cient and stable manner. Furthermore,\nLSDD is equipped with cross-validation, and thus all tuning parameters such as the kernel width and\nthe regularization parameter can be systematically and objectively optimized. We derived a \ufb01nite-\nsample error bound for LSDD in a non-parametric setup, and showed that it achieves the optimal\nconvergence rate. We also proposed an L2-distance estimator based on LSDD, which nicely cancels\na bias caused by regularization. Through experiments on class-prior estimation and change-point\ndetection, the usefulness of the proposed LSDD was demonstrated.\n\nAcknowledgments: We would like to thank Wittawat Jitkrittum for his comments and Za\u00a8\u0131d Har-\nchaoui for providing us a program code of kernel change-point detection. MS was supported by\nMEXT KAKENHI 23300069 and AOARD, TK was supported by MEXT KAKENHI 24500340,\nTS was supported by MEXT KAKENHI 22700289, the Aihara Project, the FIRST program from\nJSPS initiated by CSTP, and the Global COE Program \u201cThe research and training center for new\ndevelopment in mathematics\u201d, MEXT, Japan, MCdP was supported by MEXT Scholarship, SL was\nsupported by the JST PRESTO program, and IT was supported by MEXT KAKENHI 23700165.\n\n7\n\n\fp\n\ntr ain\n\n(x|y = \u22121)\n\np\n\ntrain\n\n(x|y = +1)\n\npte st (x)\n\n(\n\ny (t + r )\n\nf\n\nf\n\ng\n\ng\n\ng\n\nh\n\nh\n\nh\n\ni\n\ni\n\nTime\n\nj\n\nk\n\nl\n\nY (t + r )\n\nY (t + r + 1)\n\ny (t)\n\na\n\na\n\nY (t)\n\nY (t + 1)\n\nb\n\nb\n\nb\n\nr\n\nb\n\ne\n\nd\n\n(\n\nc\n\nc\n\nc\n\nk\n\nY (t + r \u2212 1)\n\ne\n\nf\n\ng\n\nj\n\nk\n(\n\nl\n\nY (t + 2r \u2212 1)\n\n)\n\nFigure 4: Class-balance estimation.\n\nx\n\nY (t)\n\nY (t + r)\n\nFigure 5: Change-point detection.\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \ne\nc\nn\na\na\nb\n \ns\ns\na\nC\n\nl\n\nl\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\nLSDD\nKDEi\n\nKDEj\nEM\n\n0.2\n\n0.4\n\n0.6\n\n\u03c0*\n\n0.8\n\ne\nt\na\nr\n \nn\no\ni\nt\na\nc\nfi\ni\ns\ns\na\nl\nc\ns\ni\n\nM\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.2\n\n0.4\n\n0.6\n\n\u03c0*\n\n0.8\n\n(a) Australian dataset\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \ne\nc\nn\na\na\nb\n \ns\ns\na\nC\n\nl\n\nl\n\n0.15\n\n0.1\n\n0.05\n\n0.45\n\n0.2\n\n0.4\n\n0.6\n\n\u03c0*\n\n0.8\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \ne\nc\nn\na\na\nb\n \ns\ns\na\nC\n\nl\n\nl\n\n0.3\n0.25\n0.2\n0.15\n0.1\n0.05\n0\n\n0.2\n\n0.4\n\n0.6\n\n\u03c0*\n\n0.8\n\ne\nt\na\nr\n \nn\no\ni\nt\na\nc\nfi\ni\ns\ns\na\nl\nc\ns\ni\n\nM\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.2\n\n0.4\n\n0.6\n\n\u03c0*\n\n0.8\n\n(b) Diabetes dataset\n\ne\nt\na\nr\n \nn\no\ni\nt\na\nc\nfi\ni\ns\ns\na\nl\nc\ns\ni\n\nM\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.2\n\n0.4\n\n0.6\n\n\u03c0*\n\n0.8\n\n(c) German dataset\n\n \n\nr\no\nr\nr\ne\nd\ne\nr\na\nu\nq\ns\n \ne\nc\nn\na\na\nb\n \ns\ns\na\nC\n\nl\n\nl\n\ne\nt\na\nr\n \nn\no\ni\nt\na\nc\nfi\ni\ns\ns\na\nl\nc\ns\ni\n\nM\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.2\n\n0.4\n\n0.6\n\n\u03c0*\n\n0.8\n\n0.2\n\n0.4\n\n0.6\n\n\u03c0*\n\n0.8\n\n(d) Statlogheart dataset\n\nFigure 6: Results of semi-supervised class-balance estimation. Top: Squared error of class balance\nestimation. Bottom: Misclassi\ufb01cation error by a weighted (cid:5)2-regularized least-squares classi\ufb01er.\n\nOriginal\n\ndata\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\nKLIEP score\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\nLSDD score\n\n500\n\n1000\n\n1500\nTime\n\n2000\n\n2500\n\n3000\n\n(cid:3)\n\nLSDD\nKDEi\nKDEj\nKLIEP\nAR\nSI\nSST\nSVM\nKFD\nKCP\n\n0.8\n\n1\n\n0.1\n0\n\u22120.1\n\u22120.2\n\n0\n\n40\n\n20\n\n0\n\n0\n\n1\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nt\n\ne\na\nr\n \n\ne\nv\ni\nt\ni\ns\no\np\n \ne\nu\nr\nT\n\n0\n\n(cid:3)\n0\n\n0.2\n\nOriginal data\n\n500\n\n1000\n\n1500\n\nKLIEP score\n\n500\n\n1000\n\n1500\n\nLSDD score\n\n500\n\n1000\n\nTime\n\n1500\n\n(cid:3)\n\nLSDD\nKDEi\nKDEj\nKLIEP\nAR\nSI\nSST\nSVM\nKFD\nKCP\n\n0.8\n\n1\n\n0.4\n\n0.6\n\nFalse positive rate\n\n5\n\n0\n\n\u22125\n\n0\n\n40\n\n20\n\n0\n\n2\n\n1\n\n0\n\n0\n\n0\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nt\n\ne\na\nr\n \n\ne\nv\ni\nt\ni\ns\no\np\n \ne\nu\nr\nT\n\n0\n\n(cid:3)\n0\n\n0.2\n\n0.4\n\n0.6\n\nFalse positive rate\n\nAUC LSDD KDEi KDEj KLIEP AR SI SST SVM KFD KCP\n.635 .749 .756 .580 .773 .905 .913\nMean .879 .755 .705\n.014 .016 .023\n.030 .013 .012 .023 .032 .013 .024\nSE\n\nAUC LSDD KDEi KDEj KLIEP AR SI SST SVM KFD KCP\n.638 .799 .762 .764 .815 .856 .730\nMean .843 .764 .751\n.013 .029 .036\n.020 .026 .020 .016 .018 .023 .032\nSE\n\n(a) Speech data\n\n(b) Accelerometer data\n\nFigure 7: Results of unsupervised change detection. From top to bottom: Original time-series,\nchange scores obtained by KLIEP and LSDD, mean ROC curves over 10 datasets, and AUC values\nfor 10 datasets. The best method and comparable ones in terms of mean AUC values by the t-test at\nthe signi\ufb01cance level 5% are indicated with boldface. \u201cSE\u201d stands for \u201cStandard error\u201d.\n\n8\n\n\fReferences\n[1] V. N. Vapnik. Statistical Learning Theory. Wiley, New York, NY, USA, 1998.\n[2] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von B\u00a8unau, and M. Kawanabe. Di-\nrect importance estimation for covariate shift adaptation. Annals of the Institute of Statistical\nMathematics, 60(4):699\u2013746, 2008.\n\n[3] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the\nIEEE Transactions on Information Theory,\n\nlikelihood ratio by convex risk minimization.\n56(11):5847\u20135861, 2010.\n\n[4] C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importance weighting. In Advances\n\nin Neural Information Processing Systems 23, pages 442\u2013450, 2010.\n\n[5] M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the outputs of a classi\ufb01er to new a\n\npriori probabilities: A simple procedure. Neural Computation, 14(1):21\u201341, 2002.\n\n[6] Y. Kawahara and M. Sugiyama. Sequential change-point detection based on direct density-\n\nratio estimation. Statistical Analysis and Data Mining, 5(2):114\u2013127, 2012.\n\n[7] N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrep-\nancies between two multivariate probability density functions using kernel-based density esti-\nmates. Journal of Multivariate Analysis, 50(1):41\u201354, 1994.\n\n[8] A. Basu, I. R. Harris, N. L. Hjort, and M. C. Jones. Robust and ef\ufb01cient estimation by min-\n\nimising a density power divergence. Biometrika, 85(3):549\u2013559, 1998.\n\n[9] P. Hall and M. P. Wand. On nonparametric discrimination using density differences.\n\nBiometrika, 75(3):541\u2013547, 1988.\n\n[10] M. Eberts and I. Steinwart. Optimal learning rates for least squares SVMs using Gaussian\nkernels. In Advances in Neural Information Processing Systems 24, pages 1539\u20131547, 2011.\n[11] R. H. Farrell. On the best obtainable asymptotic rates of convergence in estimation of a density\n\nfunction at a point. The Annals of Mathematical Statistics, 43(1):170\u2013180, 1972.\n\n[12] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall,\n\nLondon, UK, 1986.\n\n[13] E. Parzen. On the estimation of a probability density function and mode. The Annals of\n\nMathematical Statistics, 33(3):1065\u20131076, 1962.\n\n[14] R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, USA, 1970.\n[15] W. H\u00a8ardle, M. M\u00a8uller, S. Sperlich, and A. Werwatz. Nonparametric and Semiparametric\n\nModels. Springer, Berlin, Germany, 2004.\n\n[16] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press,\n\nCambridge, MA, USA, 2006.\n\n[17] R. Rifkin, G. Yeo, and T. Poggio. Regularized least-squares classi\ufb01cation.\n\nIn Advances in\nLearning Theory: Methods, Models and Applications, pages 131\u2013154. IOS Press, Amsterdam,\nthe Netherlands, 2003.\n\n[18] M. Sugiyama, M. Krauledat, and K.-R. M\u00a8uller. Covariate shift adaptation by importance\n\nweighted cross validation. Journal of Machine Learning Research, 8:985\u20131005, May 2007.\n\n[19] Y. Takeuchi and K. Yamanishi. A unifying framework for detecting outliers and change points\nfrom non-stationary time series data. IEEE Transactions on Knowledge and Data Engineering,\n18(4):482\u2013489, 2006.\n\n[20] Y. Kawahara, T. Yairi, and K. Machida. Change-point detection in time-series data based on\nIn Proceedings of the 7th IEEE International Conference on Data\n\nsubspace identi\ufb01cation.\nMining, pages 559\u2013564, 2007.\n\n[21] V. Moskvina and A. A. Zhigljavsky. An algorithm based on singular spectrum analysis for\nchange-point detection. Communication in Statistics: Simulation & Computation, 32(2):319\u2013\n352, 2003.\n\n[22] F. Desobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm. IEEE\n\nTransactions on Signal Processing, 53(8):2961\u20132974, 2005.\n\n[23] Z. Harchaoui, F. Bach, and E. Moulines. Kernel change-point analysis. In Advances in Neural\n\nInformation Processing Systems 21, pages 609\u2013616, 2009.\n\n[24] S. Arlot, A. Celisse, and Z. Harchaoui. Kernel change-point detection. Technical Report\n\n1202.3878, arXiv, 2012.\n\n9\n\n\f", "award": [], "sourceid": 324, "authors": [{"given_name": "Masashi", "family_name": "Sugiyama", "institution": null}, {"given_name": "Takafumi", "family_name": "Kanamori", "institution": null}, {"given_name": "Taiji", "family_name": "Suzuki", "institution": null}, {"given_name": "Marthinus", "family_name": "Plessis", "institution": null}, {"given_name": "Song", "family_name": "Liu", "institution": null}, {"given_name": "Ichiro", "family_name": "Takeuchi", "institution": null}]}