{"title": "M-Statistic for Kernel Change-Point Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 3366, "page_last": 3374, "abstract": "Detecting the emergence of an abrupt change-point is a classic problem in statistics and machine learning. Kernel-based nonparametric statistics have been proposed for this task which make fewer assumptions on the distributions than traditional parametric approach. However, none of the existing kernel statistics has provided a computationally efficient way to characterize the extremal behavior of the statistic. Such characterization is crucial for setting the detection threshold, to control the significance level in the offline case as well as the average run length in the online case. In this paper we propose two related computationally efficient M-statistics for kernel-based change-point detection when the amount of background data is large. A novel theoretical result of the paper is the characterization of the tail probability of these statistics using a new technique based on change-of-measure. Such characterization provides us accurate detection thresholds for both offline and online cases in computationally efficient manner, without the need to resort to the more expensive simulations such as bootstrapping. We show that our methods perform well in both synthetic and real world data.", "full_text": "M-Statistic for Kernel Change-Point Detection\n\nShuang Li, Yao Xie\n\nH. Milton Stewart School of\n\nIndustrial and Systems Engineering\nGeorgian Institute of Technology\n\nsli370@gatech.edu\n\nyao.xie@isye.gatech.edu\n\nHanjun Dai, Le Song\n\nComputational Science and Engineering\n\nCollege of Computing\n\nGeorgia Institute of Technology\nhanjundai@gatech.edu\nlsong@cc.gatech.edu\n\nAbstract\n\nDetecting the emergence of an abrupt change-point is a classic problem in statis-\ntics and machine learning. Kernel-based nonparametric statistics have been pro-\nposed for this task which make fewer assumptions on the distributions than tra-\nditional parametric approach. However, none of the existing kernel statistics has\nprovided a computationally ef\ufb01cient way to characterize the extremal behavior of\nthe statistic. Such characterization is crucial for setting the detection threshold, to\ncontrol the signi\ufb01cance level in the of\ufb02ine case as well as the average run length\nin the online case. In this paper we propose two related computationally ef\ufb01cient\nM-statistics for kernel-based change-point detection when the amount of back-\nground data is large. A novel theoretical result of the paper is the characterization\nof the tail probability of these statistics using a new technique based on change-of-\nmeasure. Such characterization provides us accurate detection thresholds for both\nof\ufb02ine and online cases in computationally ef\ufb01cient manner, without the need to\nresort to the more expensive simulations such as bootstrapping. We show that our\nmethods perform well in both synthetic and real world data.\n\n1 Introduction\nDetecting the emergence of abrupt change-points is a classic problem in statistics and machine\nlearning. Given a sequence of samples, x1, x2, . . . , xt, from a domain X , we are interested in\ndetecting a possible change-point \u2327, such that before \u2327, the samples xi \u21e0 P i.i.d. for i \uf8ff \u2327, where\nP is the so-called background distribution, and after the change-point, the samples xi \u21e0 Q i.i.d. for\ni \u2327 +1, where Q is a post-change distribution. Here the time horizon t can be either a \ufb01xed number\nt = T0 (called an of\ufb02ine or \ufb01xed-sample problem), or t is not \ufb01xed and we keep getting new samples\n(called a sequential or online problem). Our goal is to detect the existence of the change-point in\nthe of\ufb02ine setting, or detect the emergence of a change-point as soon as possible after it occurs in\nthe online setting. We will restrict our attention to detecting one change-point, which arises often in\nmonitoring problems. One such example is the seismic event detection [9], where we would like to\ndetect the onset of the event precisely in retrospect to better understand earthquakes or as quickly as\npossible from the streaming data. Ideally, the detection algorithm can also be robust to distributional\nassumptions as we wish to detect all kinds of seismic events that are different from the background.\nTypically we have a large amount of background data (since seismic events are rare), and we want\nthe algorithm to exploit these data while being computationally ef\ufb01cient.\nClassical approaches for change-point detection are usually parametric, meaning that they rely on\nstrong assumptions on the distribution. Nonparametric and kernel approaches are distribution free\nand more robust as they provide consistent results over larger classes of data distributions (they\ncan possibly be less powerful in settings where a clear distributional assumption can be made).\nIn particular, many kernel based statistics have been proposed in the machine learning litera-\nture [5, 2, 18, 6, 7, 1] which typically work better in real data with few assumptions. However,\nnone of these existing kernel statistics has provided a computationally ef\ufb01cient way to characterize\n\n1\n\n\fthe tail probability of the extremal value of these statistics. Characterization such tail probability is\ncrucial for setting the correct detection thresholds for both the of\ufb02ine and online cases. Furthermore,\nef\ufb01ciency is also an important consideration since typically the amount of background data is very\nlarge. In this case, one has the freedom to restructure and sample the background data during the\nstatistical design to gain computational ef\ufb01ciency. On the other hand, change-point detection prob-\nlems are related to the statistical two-sample test problems; however, they are usually more dif\ufb01cult\nin that for change-point detection, we need to search for the unknown change-point location \u2327. For\ninstance, in the of\ufb02ine case, this corresponds to taking a maximum of a series of statistics each cor-\nresponding to one putative change-point location (a similar idea was used in [5] for the of\ufb02ine case),\nand in the online case, we have to characterize the average run length of the test statistic hitting the\nthreshold, which necessarily results in taking a maximum of the statistics over time. Moreover, the\nstatistics being maxed over are usually highly correlated. Hence, analyzing the tail probabilities of\nthe test statistic for change-point detection typically requires more sophisticated probabilistic tools.\nIn this paper, we design two related M-statistics for change-point detection based on kernel max-\nimum mean discrepancy (MMD) for two-sample test [3, 4]. Although MMD has a nice unbiased\nand minimum variance U-statistic estimator (MMDu), it can not be directly applied since MMDu\ncosts O(n2) to compute based on a sample of n data points. In the change-point detection case, this\ntranslates to a complexity quadratically grows with the number of background observations and the\ndetection time horizon t. Therefore, we adopt a strategy inspired by the recently developed B-test\nstatistic [17] and design a O(n) statistic for change-point detection. At a high level, our methods\nsample N blocks of background data of size B, compute quadratic-time MMDu of each reference\nblock with the post-change block, and then average the results. However, different from the simple\ntwo-sample test case, in order to provide an accurate change-point detection threshold, the back-\nground block needs to be designed in a novel structured way in the of\ufb02ine setting and updated\nrecursively in the online setting.\nBesides presenting the new M-statistics, our contributions also include: (1) deriving accurate ap-\nproximations to the signi\ufb01cance level in the of\ufb02ine case, and average run length in the online case,\nfor our M-statistics, which enable us to determine thresholds ef\ufb01ciently without recurring to the\nonerous simulations (e.g. repeated bootstrapping); (2) obtaining a closed-form variance estimator\nwhich allows us to form the M-statistic easily; (3) developing novel structured ways to design back-\nground blocks in the of\ufb02ine setting and rules for update in the online setting, which also leads to\ndesired correlation structures of our statistics that enable accurate approximations for tail probability.\nTo approximate the asymptotic tail probabilities, we adopt a highly sophisticated technique based\non change-of-measure, recently developed in a series of paper by Yakir and Siegmund et al. [16].\nThe numerical accuracy of our approximations are validated by numerical examples. We demon-\nstrate the good performance of our method using real speech and human activity data. We also \ufb01nd\nthat, in the two-sample testing scenario, it is always bene\ufb01cial to increase the block size B as the\ndistribution for the statistic under the null and the alternative will be better separated; however, this\nis no longer the case in online change-point detection, because a larger block size inevitably causes\na larger detection delay. Finally, we point to future directions to relax our Gaussian approximation\nand correct for the skewness of the kernel-based statistics.\n\n2 Background and Related Work\n\nWe brie\ufb02y review kernel-based methods and the maximum mean discrepancy. A reproducing kernel\nHilbert space (RKHS) F on X with a kernel k(x, x0) is a Hilbert space of functions f (\u00b7) : X 7! R\nwith inner product h\u00b7,\u00b7iF\n. Its element k(x,\u00b7) satis\ufb01es the reproducing property: hf (\u00b7), k(x,\u00b7)iF =\nf (x), and consequently, hk(x,\u00b7), k(x0,\u00b7)iF = k(x, x0), meaning that we can view the evaluation of\na function f at any point x 2X as an inner product.\nAssume there are two sets with n observations from a domain X , where X = {x1, x2, . . . , xn}\nare drawn i.i.d. from distribution P , and Y = {y1, y2, . . . , yn} are drawn i.i.d. from distri-\nbution Q. The maximum mean discrepancy (MMD) is de\ufb01ned as [3] MMD0[F, P, Q]\n:=\nsupf2F {Ex[f (x)] Ey[f (y)]} . An unbiased estimate of MMD2\n0 can be obtained using U-statistic\n\nMMD2\n\nu[F, X, Y ] =\n\n1\n\nn(n 1)\n\n2\n\nnXi,j=1,i6=j\n\nh(xi, xj, yi, yj),\n\n(1)\n\n\fu[F, X, Y ]. Furthermore, by averaging MMD2\n\nwhere h(\u00b7) is the kernel of the U-statistic de\ufb01ned as h(xi, xj, yi, yj) = k(xi, xj) + k(yi, yj) \nk(xi, yj) k(xj, yi). Intuitively, the empirical test statistic MMD2\nu is expected to be small (close\nto zero) if P = Q, and large if P and Q are far apart. The complexity for evaluating (1) is O(n2)\nsince we have to form the so-called Gram matrix for the data. Under H0 (P = Q), the U-statistic is\ndegenerate and distributed the same as an in\ufb01nite sum of Chi-square variables.\nTo improve the computational ef\ufb01ciency and obtain an easy-to-compute threshold for hypothesis\ntesting, recently, [17] proposed an alternative statistic for MMD2\n0 called B-test. The key idea of the\napproach is to partition the n samples from P and Q into N non-overlapping blocks, X1, . . . , XN\nand Y1, . . . , YN, each of constant size B. Then MMD2\nu[F, Xi, Yi] is computed for each pair of\nNPN\nblocks and averaged over the N blocks to result in MMD2\ni=1 MMD2\nB[F, X, Y ] = 1\nu[F, Xi, Yi].\nSince B is constant, N \u21e0 O(n), and the computational complexity of MMD2\nB[F, X, Y ] is O(B2n),\na signi\ufb01cant reduction compared to MMD2\nu[F, Xi, Yi]\nover independent blocks, the B-statistic is asymptotically normal leveraging over the central limit\ntheorem. This latter property also allows a simple threshold to be derived for the two-sample test\nrather than resorting to more expensive bootstrapping approach. Our later statistics are inspired\nby B-statistic. However, the change-point detection setting requires signi\ufb01cant new derivations\nto obtain the test threshold since one cares about the maximum of MMD2\nB[F, X, Y ] computed\nat different point in time. Moreover, the change-point detection case consists of a sum of highly\ncorrelated MMD statistics, because these MMD2\nB are formed with a common test block of data.\nThis is inevitable in our change-point detection problems because test data is much less than the\nreference data. Hence, we cannot use the central limit theorem (even a martingale version), but have\nto adopt the aforementioned change-of-measure approach.\nRelated work. Other nonparametric change-point detection approach has been proposed in the\nliterature. In the of\ufb02ine setting, [5] designs a kernel-based test statistic, based on a so-called running\nmaximum partition strategy to test for the presence of a change-point; [18] studies a related problem\nin which there are s anomalous sequences out of n sequences to be detected and they construct a\ntest statistic using MMD. In the online setting, [6] presents a meta-algorithm that compares data in\nsome \u201creference window\u201d to the data in the current window, using some empirical distance measures\n(not kernel-based); [1] detects abrupt changes by comparing two sets of descriptors extracted online\nfrom the signal at each time instant: the immediate past set and the immediate future set; based on\nsoft margin single-class support vector machine (SVM), they build a dissimilarity measure (which is\nasymptotically equivalent to the Fisher ratio in the Gaussian case) in the feature space between those\nsets without estimating densities as an intermediate step; [7] uses a density-ratio estimation to detect\nchange-point, and models the density-ratio using a non-parametric Gaussian kernel model, whose\nparameters are updated online through stochastic gradient decent. The above work lack theoretical\nanalysis for the extremal behavior of the statistics or average run length.\n3 M-statistic for of\ufb02ine and online change-point detection\nGive a sequence of observations {. . . , x2, x1, x0, x1, . . . , xt}, xi 2X , with {. . . , x2, x1, x0}\ndenoting the sequence of background (or reference) data. Assume a large amount of reference data is\navailable. Our goal is to detect the existence of a change-point \u2327, such that before the change-point,\nsamples are i.i.d. with a distribution P , and after the change-point, samples are i.i.d. with a different\ndistribution Q. The location \u2327 where the change-point occurs is unknown. We may formulate this\nproblem as a hypothesis test, where the null hypothesis states that there is no change-point, and\nthe alternative hypothesis is that there exists a change-point at some time \u2327. We will construct our\nkernel-based M-statistic using the maximum mean discrepancy (MMD) to measure the difference\nbetween distributions of the reference and the test data.\nWe denote by Y the block of data which potentially contains a change-point (also referred to as the\npost-change block or test block). In the of\ufb02ine setting, we assume the size of Y can be up to Bmax,\nand we want to search for a location of the change-point \u2327 within Y such that observations after \u2327\nare from a different distribution. Inspired by the idea of B-test [17], we sample N reference blocks\nof size Bmax independently from the reference pool, and index them as X Bmax\n, i = 1, . . . , N. Since\nwe search for a location B (2 \uf8ff B \uf8ff Bmax) within Y for a change-point, we construct sub-block\nfrom Y by taking B contiguous data points, and denote them as Y B. To form the statistic, we\ncorrespondingly construct sub-blocks from each reference block by taking B contiguous data points\nout of that block, and index these sub-blocks as X (B)\n(illustrated in Fig. 1(a)). We then compute\n\ni\n\ni\n\n3\n\n\f(a): of\ufb02ine\n\n(b): sequential\n\nFigure 1: Illustration of (a) of\ufb02ine case: data are split into blocks of size Bmax, indexed backwards from time\nt, and we consider blocks of size B, B = 2, . . . , Bmax; (b) online case. Assuming we have large amount of\nreference or background data that follows the null distribution.\nMMD2\n, Y (B)), and average over blocks\n\nu between (X (B)\n\ni\n\nZB :=\n\n1\nN\n\nNXi=1\n\nMMD2\n\nu(X (B)\n\ni\n\n, Y (B)) =\n\n1\n\nN B(B 1)\n\nNXi=1\n\nBXj,l=1,j6=l\n\nh(X (B)\n\ni,j , X (B)\n\ni,l\n\n, Y (B)\n\nj\n\n, Y (B)\n\nl\n\n), (2)\n\nZB/pVar[ZB]\n|\n}\n\n{z\n\nZ0B\n\ni\n\nj\n\n, and Y (B)\n\ni,j denotes the jth sample in X (B)\n\ndenotes the j th sample in Y B. Due to the\nwhere X B\nproperty of MMD2\nu, under the null hypothesis, E[ZB] = 0. Let Var[ZB] denote the variance of ZB\nunder the null. The expression of ZB is given by (6) in the following section. We see the variance\ndepends on the block size B and the number of blocks N. As B increases Var[ZB] decreases (also\nillustrated in Figure 5 in the appendix). Considering this, we standardize the statistic, maximize over\nall values of B to de\ufb01ne the of\ufb02ine M-statistic, and detect a change-point whenever the M-statistic\nexceeds the threshold b > 0:\nmax\n\nM :=\n\n> b,\n\n(3)\n\n{of\ufb02ine change-point detection}\n\nB2{2,3,...,Bmax}\n\nwhere varying the block-size from 2 to Bmax corresponds to searching for the unknown change-point\nlocation. In the online setting, suppose the post-change block Y has size B0 and we construct it\nusing a sliding window. In this case, the potential change-point is declared as the end of each block\nY . To form the statistic, we take N B0 samples without replacement (since we assume the reference\ndata are i.i.d.with distribution P ) from the reference pool to form N reference blocks, compute\nthe quadratic MMD2\nu statistics between each reference block and the post-change block, and then\naverage them. When there is a new sample (time moves from t to t + 1), we append the new sample\nin the reference block, remove the oldest sample from the post-change block, and move it to the\nreference pool. The reference blocks are also updated accordingly: the end point of each reference\nblock is moved to the reference pool, and a new point is sampled and appended to the front of each\nreference block, as shown in Fig. 1(b). Using the sliding window scheme described above, similarly,\nwe may de\ufb01ne an online M-statistic by forming a standardized average of the MMD2\nu between the\npost-change block in a sliding window and the reference block:\n\nZB0,t :=\n\n1\nN\n\nNXi=1\n\nMMD2\n\nu(X (B0,t)\n\ni\n\n, Y (B0,t)),\n\n(4)\n\ni\n\nwhere B0 is the \ufb01xed block-size, X (B0,t)\nis the ith reference block of size B0 at time t, and Y (B0,t) is\nthe the post-change block of size B0 at time t. In the online case, we have to characterize the average\nrun length of the test statistic hitting the threshold, which necessarily results in taking a maximum\nof the statistics over time. The online change-point detection procedure is a stopping time, where\nwe detect a change-point whenever the normalized ZB0,t exceeds a pre-determined threshold b > 0:\n(5)\n\n{online change-point detection}\n\n> b}.\n\nT = inf{t : ZB0,t/pVar[ZB0]\n}\n\n{z\n\n|\n\nMt\n\nNote in the online case, we actually take a maximum of the standardized statistics over time. There is\na recursive way to calculate the online M-statistic ef\ufb01ciently, explained in Section A in the appendix.\nAt the stopping time T , we claim that there exists a change-point. There is a tradeoff in choosing\nthe block size B0 in online setting: a small block size will incur a smaller computational cost, which\nmay be important for the online case, and it also enables smaller detection delay for strong change-\n\n4\n\n\u2026\u2026 time BBBBB22222\u2026\u2026 Block containing potential change point MMDu2XNBmax,Y(Bmax Pool of reference data XNBmax X3Bmax X2Bmax X1Bmax YBmax \ud835\udc61 Bmax Bmax Bmax Bmax Bmax time \u2026\u2026 \u2026\u2026 Pool of reference data sample sample Pool of reference data Block containing potential change point MMDu2\ud835\udc4b\ud835\udc56\ud835\udc350,\ud835\udc61,\ud835\udc4c(\ud835\udc350,\ud835\udc61 XiB0,t XiB0,t+1 Y(B0,t Y(B0,t+1 \ud835\udc61 \ud835\udc61+1 \ud835\udc350 \fpoint magnitude; however, the disadvantage of a small B0 is a lower power, which corresponds to\na longer detection delay when the change-point magnitude is weak (for example, the amplitude of\nthe mean shift is small). Examples of of\ufb02ine and online M-statistics are demonstrated in Fig. 2\nbased on synthetic data and a segment of the real seismic signal. We see that the proposed of\ufb02ine\nM-statistic powerfully detects the existence of a change-point and accurately pinpoints where the\nchange occurs; the online M-statistic quickly hits the threshold as soon as the change happens.\n\nl\n\na\nn\ng\nS\n\ni\n\n10\n5\n0\n\u22125\n\u221210\n0\n\n100\n\nNormal (0,1)\n\n200\n\nTime\n\n300\n\nl\n\na\nn\ng\nS\n\ni\n\n10\n5\n0\n\u22125\n\u221210\n0\n\n400\n\n500\n\nNormal (0,1)\n\nLaplace (0,1)\n\n100\n\n200\n\nTime\n\n300\n\n400\n\n500\n\n10\n\n5\n\nb=3.34\n\nc\n\ni\nt\ns\n\ni\nt\na\nt\nS\n\n10\n\nc\n\ni\nt\ns\ni\nt\na\nt\nS\n\n5\n\nb=3.34\n\nPeak\n\n0\n500\n\n400\n\n300\n\nB\n\n200\n\n100\n\n0\n\n0\n500\n\n400\n\n300\n\nB\n\n200\n\n100\n\n0\n\n(a): Of\ufb02ine, null\n\n(b): Of\ufb02ine, \u2327 = 250\n\nNormal (0,1)\n\nLaplace (0,1)\n\n100\n50\n\n0\n\u221250\n\nl\n\ni\n\na\nn\ng\nS\nc\nm\ns\n\n \n\ni\n\ni\n\ne\nS\n\n100\n\n200\n\nTime\n\n300\n\n400\n\n500\n\n10\n\nl\na\nn\ng\nS\n\ni\n\n0\n\n\u221210\n0\n\n10\n\n5\n\nb=3.55\n\nc\ni\nt\ns\ni\nt\na\nt\nS\n\n200\n\n100\n\n0\n0\n500\n(c): Online, \u2327 = 250\n\nTime\n\n300\n\n400\n\n\u2212100\n\n100\n\nc\n\ni\nt\ns\n\ni\nt\na\nt\nS\n\n50\n\n0\n\n \n\n200\n\n400\n\nTime\n\n600\n\n800\n\n1000\n\n \n\nBandw=Med\nBandw=100Med\nBandw=0.1Med\n\nb=3.55\n\n200\n\n400\n\nTime\n\n600\n\n800\n\n1000\n\n(d) seismic signal\n\nFigure 2: Examples of of\ufb02ine and online M-statistic with N = 5: (a) and (b), of\ufb02ine case without and with a\nchange-point (Bmax = 500 and the maximum is obtained when B = 263); (c) online case with a change-point\nat \u2327 = 250, stopping-time T = 268 (detection delay is 18), and we use B0 = 50; (d) a real seismic signal and\nM-statistic with different kernel bandwidth. All thresholds are theoretical values and are marked in red.\n4 Theoretical Performance Analysis\nWe obtain an analytical expression for the variance Var[ZB] in (3) and (5), by leveraging the corre-\nspondence between the MMD2\nu statistics and U-statistic [11] (since ZB is some form of U-statistic),\nand exploiting the known properties of U-statistic. We also derive the covariance structure for the\nonline and of\ufb02ine standardized ZB statistics, which is crucial for proving theorems 3 and 4.\nLemma 1 (Variance of ZB under the null.) Given any \ufb01xed block size B and number of blocks\nN, under the null hypothesis,\n\nVar[ZB] =\u2713B\n\n2\u25c61\uf8ff 1\n\nN E[h2(x, x0, y, y0)] +\n\nN 1\nN\n\nCov [h(x, x0, y, y0), h(x00, x000, y, y0)] ,\n\n(6)\n\nwhere x, x0, x00, x000, y, and y0 are i.i.d. with the null distribution P .\nLemma 1 suggests an easy way to estimate the variance Var[ZB] from the reference data. To esti-\nmate (6), we need to \ufb01rst estimate E[h2(x, x0, y, y0)], by each time drawing four samples without re-\nplacement from the reference data, use them for x, x0, y, y0, evaluate the sampled function value, and\nthen form a Monte Carlo average. Similarly, we may estimate Cov [h(x, x0, y, y0), h(x00, x000, y, y0)].\n\nLemma 2 (Covariance structure of the standardized ZB statistics.) Under the null hypothesis,\ngiven u and v in [2, Bmax], for the of\ufb02ine case\n\nru,v := Cov (Z0u, Z0v) =s\u2713u\n2\u25c6\u2713v\n\nwhere u _ v = max{u, v}, and for the online case,\nr0u,v := Cov(Mu, Mu+s) = (1 \n\ns\nB0\n\n)(1 \n\n2\u25c6\u2713u _ v\n2 \u25c6,\n), for s 0.\n\ns\n\nB0 1\n\n(7)\n\nIn the of\ufb02ine setting, the choice of the threshold b involves a tradeoff between two standard perfor-\nmance metrics: (i) the signi\ufb01cant level (SL), which is the probability that the M-statistic exceeds\nthe threshold b under the null hypothesis (i.e., when there is no change-point); and (ii) power, which\nis the probability of the statistic exceeds the threshold under the alternative hypothesis. In the online\nsetting, there are two analogous performance metrics commonly used for analyzing change-point\ndetection procedures [15]: (i) the expected value of the stopping time when there is no change, the\naverage run length (ARL); (ii) the expected detection delay (EDD), de\ufb01ned to be the expected stop-\nping time in the extreme case where a change occurs immediately at \u2327 = 0. We focus on analyzing\nSL and ARL of our methods, since they play key roles in setting thresholds. We derive accurate\napproximations to these quantities as functions of the threshold b, so that given a prescribed SL or\n\n5\n\n\fmax\n\nB2{2,3,...,Bmax}\n\n2 b2\n\n\u00b7\n\n(2B 1)\n\n2p2\u21e1B(B 1)\n\nZBpVar[ZB]\n\n> b) = b2e 1\n\nicant level of the of\ufb02ine M-statistic de\ufb01ned in (3) is given by\n\nARL, we can solve for the corresponding b analytically. Let P1 and E1 denote, respectively, the\nprobability measure and expectation under the null.\nTheorem 3 (SL in of\ufb02ine case.) When b ! 1 and b/pBmax ! c for some constant c, the signif-\nB(B 1)!+o(1),\nP1(\nBmaxXB=2\nwhere the special function \u232b(u) \u21e1 (2/u)((u/2)0.5)\n(x) is the cumulative distribution function of the standard normal distribution, respectively.\nThe proof of theorem 3 uses a change-of-measure argument, which is based on the likelihood ratio\nidentity (see, e.g., [12, 16]). The likelihood ratio identity relates computing of the tail probability\nunder the null to computing a sum of expectations each under an alternative distribution indexed by\na particular parameter value. To illustrate, assume the probability density function (pdf) under the\nnull is f (u). Given a function g!(x), with ! in some index set \u2326,, we may introduce a family of al-\n\n(8)\n(u/2)(u/2)+(u/2), is the probability density function and\n\n\u232b bs 2B 1\n\nternative distributions with pdf f!(u) = e\u2713g!(u) !(\u2713)f (u), where !(\u2713) := logR e\u2713g!(u)f (u)du\n\nis the log moment generating function, and \u2713 is the parameter that we may assign an arbitrary value.\nIt can be easily veri\ufb01ed that f!(u) is a pdf. Using this family of alternative, we may calculate the\nprobability of an event A under the original distribution f, by calculating a sum of expectations:\n\nP{A} = E\uf8ffP!2\u2326 e`!\nPs2\u2326 e`s\n\n; A =X!2\u2326\n\nE![e`! ; A],\n\nwhere E[U ; A] := E[UI{A}], the indicator function I{A} is one when event A is true and zero\notherwise, E! is the expectation using pdf f!(u), `! = log[f (u)/f!(u)] = \u2713g!(u) !(\u2713), is the\nlog-likelyhood ratio, and we have the freedom to choose a different \u2713 value for each f!.\nThe basic idea of change-of-measure in our setting is to treat Z0B := ZB/Var[ZB], as a random \ufb01eld\nindexed by B. Then to characterize SL, we need to study the tail probability of the maximum of this\nrandom \ufb01eld. Relate this to the setting above, Z0B corresponds to g!(u), B corresponds to !, and\nA corresponds to the threshold crossing event. To compute the expectations under the alternative\nmeasures, we will take a few steps. First, we choose a parameter value \u2713B for each pdf associated\n\u02d9 B(\u2713B) = b. This is equivalent to setting the mean under each\nwith a parameter value B, such that\nalternative probability to the threshold b: EB[Z0B] = b and it allows us to use the local central limit\ntheorem since under the alternative measure boundary cross has much larger probability. Second,\nwe will express the random quantities involved in the expectations, as a functions of the so-called\nlocal \ufb01eld terms: {`B `s : s = B, B \u00b1 1, . . .}, as well as the re-centered log-likelihood ratios:\n\u02dc`B = `B b. We show that they are asymptotically independent as b ! 1 and b grows on the order\nof pB, and this further simpli\ufb01es our calculation. The last step is to analyze the covariance structure\nof the random \ufb01eld (Lemma 2 in the following), and approximate it using a Gaussian random \ufb01eld.\nNote that the terms Z0u and Z0v have non-negligible correlation due to our construction: they share\nthe same post-change block Y (B). We then apply the localization theorem (Theorem 5.2 in [16]) to\nobtain the \ufb01nal result.\n\nTheorem 4 (ARL in online case.) When b ! 1 and b/pB0 ! c0 for some constant c0, the aver-\n\nage run length (ARL) of the stopping time T de\ufb01ned in (5) is given by\n\nProof for Theorem 4 is similar to that for Theorem 3, due to the fact that for a given m > 0,\n\nE1[T ] =\n\neb2/2\nb2\n\np2\u21e1B0(B0 1) \u00b7 \u232b bs 2(2B0 1)\n\u00b7( (2B0 1)\nB0(B0 1)!)1\nMt > b.\nP1{T \uf8ff m} = P1\u21e2 max\n\n1\uf8fft\uf8ffm\n\n+ o(1).\n\n(9)\n\n(10)\n\nHence, we also need to study the tail probability of the maximum of a random \ufb01eld Mt =\n\nZB0,t/pZB0,t for a \ufb01xed block size B0. A similar change-of-measure approach can be used, ex-\ncept that the covariance structure of Mt in the online case is slightly different from the of\ufb02ine case.\nThis tail probability turns out to be in a form of P1{T \uf8ff m} = m + o(1). Using similar argu-\n\n6\n\n\fments as those in [13, 14], we may see that T is asymptotically exponentially distributed. Hence,\nP1{T \uf8ff m} [1 exp(m)] ! 0. Consequently E1{T}\u21e0 1, which leads to (9).\nTheorem 4 shows that ARL \u21e0O (eb2) and, hence, b \u21e0O (plog ARL). On the other hand, the\n\nEDD is typically on the order of b/ using the Wald\u2019s identity [12] (although a more careful anal-\nysis should be carried out in the future work), where is the Kullback-Leibler (KL) divergence\nbetween the null and alternative distributions (on a order of a constant). Hence, given a desired ARL\n(typically on the order of 5000 or 10000), the error made in the estimated threshold will only be\ntranslated linearly to EDD. This is a blessing to us and it means typically a reasonably accurate b\nwill cause little performance loss in EDD. Similarly, Theorem 3 shows that SL \u21e0O (eb2) and a\nsimilar argument can be made for the of\ufb02ine case.\n5 Numerical examples\nWe test the performance of the M-statistic using simulation and real world data. Here we only\nhighlight the main results. More details can be found in Appendix C. In the following examples, we\n\nuse a Gaussian kernel: k(Y, Y 0) = expkY Y 0k2/22, where > 0 is the kernel bandwidth\n\nand we use the \u201cmedian trick\u201d [10, 8] to get the bandwidth which is estimated using the background\ndata.\nAccuracy of Lemma 1 for estimating Var[ZB]. Fig. 5 in the appendix shows the empirical dis-\ntributions of ZB when B = 2 and B = 200, when N = 5. In both cases, we generate 10000\nrandom instances, which are computed from data following N (0, I), I 2 R20\u21e520 to represent the\nnull distribution. Moreover, we also plot the Gaussian pdf with sample mean and sample variance,\nwhich matches well with the empirical distribution. Note the approximation works better when the\nblock size decreases. (The skewness of the statistic can be corrected; see discussions in Section 7).\nAccuracy of theoretical results for estimating threshold. For the of\ufb02ine case, we compare the\nthresholds obtained from numerical simulations, bootstrapping, and using our approximation in\nTheorem 3, for various SL values \u21b5. We choose the maximum block size to be Bmax = 20. In\nthe appendix, Fig. 6(a) demonstrates how a threshold is obtained by simulation, for \u21b5 = 0.05, the\nthreshold b = 2.88 corresponds to the 95% quantile of the empirical distribution of the of\ufb02ine M-\nstatistic. For a range of b values, Fig. 6(b) compares the empirical SL value \u21b5 from simulation with\nthat predicted by Theorem 3, and shows that theory is quite accurate for small \u21b5, which is desirable\nas we usually care of small \u21b5\u2019s to obtain thresholds. Table 1 shows that our approximation works\nquite well to determine thresholds given \u21b5\u2019s: thresholds obtained by our theory matches quite well\nwith that obtained from Monte Carlo simulation (the null distribution is N (0, I), I 2 R20\u21e520), and\neven from bootstrapping for a real data scenario. Here, the \u201cbootstrap\u201d thresholds are for a speech\nsignal from the CENSREC-1-C dataset. In this case, the null distribution P is unknown, and we only\nhave 3000 samples speech signals. Thus we generate bootstrap samples to estimate the threshold,\nas shown in Fig. 7 in the appendix. These b\u2019s obtained from theoretical approximations have little\nperformance degradation, and we will discuss how to improve in Section 7.\n\nTable 1: Comparison of thresholds for of\ufb02ine case, determined by simulation, bootstrapping and theory re-\nspectively, for various SL value \u21b5.\n\n\u21b5\n0.20\n0.15\n0.10\n\nb (sim)\n1.78\n2.02\n2.29\n\nBmax = 10\n\nb (boot)\n\n1.77\n2.05\n2.45\n\nb (the)\n2.00\n2.18\n2.40\n\nb (sim)\n1.97\n2.18\n2.47\n\nBmax = 20\n\nb (boot)\n\n2.29\n2.63\n3.09\n\nb (the)\n2.25\n2.41\n2.60\n\nb (sim)\n2.21\n2.44\n2.70\n\nBmax = 50\n\nb (boot)\n\n2.47\n2.78\n3.25\n\nb (the)\n2.48\n2.62\n2.80\n\nFor the online case, we also compare the thresholds obtained from simulation (using 5000 instances)\nfor various ARL and from Theorem 4, respectively. As predicated by theory, the threshold is con-\nsistently accurate for various null distributions (shown in Fig. 3). Also note from Fig. 3 that the\nprecision improves as B0 increases. The null distributions we consider include N (0, 1), exponen-\ntial distribution with mean 1, a Erdos-Renyi random graph with 10 nodes and probability of 0.2 of\nforming random edges, and Laplace distribution.\nExpected detection delays (EDD). In the online setting, we compare EDD (with the assumption\n\u2327 = 0) of detecting a change-point when the signal is 20 dimensional and the transition happens\n\n7\n\n\fGaussian(0,I)\nExp(1)\nRandom Graph (Node=10, p=0.2)\nLaplace(0,1)\nTheory\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\nb\n\n \n\n1\n0\n\n0.2\n\n0.4\n\n0.6\nARL(104)\n\n0.8\n\n1\n\n(a): B0 = 10\n\n \n\nb\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n \n\n1\n0\n\nGaussian(0,I)\nExp(1)\nRandom Graph (Node=10, p=0.2)\nLaplace(0,1)\nTheory\n\n0.2\n\n0.4\n\n0.6\nARL(104)\n\n0.8\n\n1\n\n(b): B0 = 50\n\n \n\nb\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n \n\n1\n0\n\nGaussian(0,I)\nExp(1)\nRandom Graph (Node=10, p=0.2)\nLaplace(0,1)\nTheory\n\n \n\n0.2\n\n0.4\n\n0.6\nARL(104)\n\n0.8\n\n1\n\n(c): B0 = 200\n\nFigure 3: In online case, for a range of ARL values, comparison b obtained from simulation and from Theorem\n4 under various null distributions.\n\nfrom a zero-mean Gaussian N (0, I20) to a non-zero mean Gaussian N (\u00b5, I20), where the post-\nchange mean vector \u00b5 is element-wise equal to a constant mean shift. In this setting, Fig. 10(a)\ndemonstrates the tradeoff in choosing a block size: when block size is too small the statistical power\nof the M-statistic is weak and hence EDD is large; on the other hand, when block size is too large,\nalthough statistical power is good, EDD is also large because the way we update the test block.\nTherefore, there is an optimal block size for each case. Fig. 10(b) shows the optimal block size\ndecreases as the mean shift increases, as expected.\n6 Real-data\nWe test the performance of our M-statistics using real data. Our datasets include: (1) CENSREC-\n1-C: a real-world speech dataset in the Speech Resource Consortium (SRC) corpora provided by\nNational Institute of Informatics (NII)1; (2) Human Activity Sensing Consortium (HASC) challenge\n2011 data2. We compare our M-statistic with a state-of-the-art algorithm, the relative density-\nratio (RDR) estimate [7] (one limitation of the RDR algorithm, however, is that it is not suitable\nfor high-dimensional data because estimating density ratio in the high-dimensional setting is ill-\nposed). To achieve reasonable performance for the RDR algorithm, we adjust the bandwidth and\nthe regularization parameter at each time step and, hence, the RDR algorithm is computationally\nmore expensive than the M-statistics method. We use the Area Under Curve (AUC) [7] (the larger\nthe better) as a performance metric. Our M-statistics have competitive performance compared with\nthe baseline RDR algorithm in the real data testing. Here we report the main results and the details\ncan be found in Appendix D. For the speech data, our goal is to detect the onset of speech signal\nemergent from the background noise (the background noises are taken from real acoustic signals,\nsuch as background noise in highway, airport and subway stations). The overall AUC for the M-\nstatistic is .8014 and for the baseline algorithm is .7578. For human activity detection data, we aim at\ndetection the onset of transitioning from one activity to another. Each data consists of human activity\ninformation collected by portable three-axis accelerometers. The overall AUC for the M-statistic is\n.8871 and for the baseline algorithm is .7161.\n7 Discussions\nWe may be able to improve the precision of the tail probability approximation in theorems 3 and 4 to\naccount for skewness of Z0B. In the change-of-measurement argument, we need to choose parameter\n\u02d9 B(\u2713B) = b. Currently, we use a Gaussian assumption Z0B \u21e0N (0, 1) and,\nvalues \u2713B such that\nhence, B(\u2713) = \u27132/2, and \u2713B = b. We may improve the precision if we are able to estimate\nskewness \uf8ff(Z0B) for Z0B. In particular, we can include the skewness in the log moment generating\nfunction approximation B(\u2713) \u21e1 \u27132/2+\uf8ff(Z0B)\u27133/6 when we estimate the change-of-measurement\nparameter: setting the derivative of this to b and solving a quadratic equation \uf8ff(Z0B)\u27132/2 + \u2713 = b\nfor \u27130B. This will change the leading exponent term in (8) from eb2/2 to be e 0B(\u27130B)\u27130Bb. A similar\nimprovement can be done for the ARL approximation in Theorem 4.\n\nAcknowledgments\n\nThis research was supported in part by CMMI-1538746 and CCF-1442635 to Y.X.; NSF/NIH BIGDATA\n1R01GM108341, ONR N00014-15-1-2340, NSF IIS-1218749, NSF CAREER IIS-1350983 to L.S..\n\n1 Available from http://research.nii.ac.jp/src/en/CENSREC-1-C.html\n2 Available from http://hasc.jp/hc2011\n\n8\n\n\fReferences\n[1] F. Desobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm. IEEE\n\nTrans. Sig. Proc., 2005.\n\n[2] F. Enikeeva and Z. Harchaoui. High-dimensional change-point detection with sparse alterna-\n\ntives. arXiv:1312.1900, 2014.\n\n[3] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00a8olkopf, and Alexander\nSmola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723\u2013773,\n2012.\n\n[4] Z. Harchaoui, F. Bach, O. Cappe, and E. Moulines. Kernel-based methods for hypothesis\n\ntesting. IEEE Sig. Proc. Magazine, pages 87\u201397, 2013.\n\n[5] Z. Harchaoui, F. Bach, and E. Moulines. Kernel change-point analysis.\n\nInformation Processing Systems 21 (NIPS 2008), 2008.\n\nIn Adv. in Neural\n\n[6] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In Proc. of the 30th\n\nVLDB Conf., 2004.\n\n[7] Song Liu, Makoto Yamada, Nigel Collier, and Masashi Sugiyama. Change-point detection in\n\ntime-series data by direct density-ratio estimation. Neural Networks, 43:72\u201383, 2013.\n\n[8] Aaditya Ramdas, Sashank Jakkam Reddi, Barnab\u00b4as P\u00b4oczos, Aarti Singh, and Larry Wasser-\nman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in\nhigh dimensions. In Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\n[9] Z. E. Ross and Y. Ben-Zion. Automatic picking of direct P , S seismic phases and fault zone\n\nhead waves. Geophys. J. Int., 2014.\n\n[10] Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines,\n\nregularization, optimization, and beyond. MIT press, 2001.\n\n[11] R. J. Ser\ufb02ing. U-Statistics. Approximation theorems of mathematical statistics. John Wiley &\n\nSons, 1980.\n\n[12] D. Siegmund. Sequential analysis: tests and con\ufb01dence intervals. Springer, 1985.\n[13] D. Siegmund and E. S. Venkatraman. Using the generalized likelihood ratio statistic for se-\n\nquential detection of a change-point. Ann. Statist., (23):255\u2013271, 1995.\n\n[14] D. Siegmund and B. Yakir. Detecting the emergence of a signal in a noisy image. Stat.\n\nInterface, (1):3\u201312, 2008.\n\n[15] Y. Xie and D. Siegmund. Sequential multi-sensor change-point detection. Annals of Statistics,\n\n41(2):670\u2013692, 2013.\n\n[16] B. Yakir. Extremes in random \ufb01elds: A theory and its applications. Wiley, 2013.\n[17] W. Zaremba, A. Gretton, and M. Blaschko. B-test: low variance kernel two-sample test. In\n\nAdv. Neural Info. Proc. Sys. (NIPS), 2013.\n\n[18] S. Zou, Y. Liang, H. V. Poor, and X. Shi. Nonparametric detection of anomalous data via kernel\n\nmean embedding. arXiv:1405.2294, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1852, "authors": [{"given_name": "Shuang", "family_name": "Li", "institution": "Georgia Institute of Technology"}, {"given_name": "Yao", "family_name": "Xie", "institution": "Georgia Tech"}, {"given_name": "Hanjun", "family_name": "Dai", "institution": "Georgia Tech"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Institute of Technology"}]}