{"title": "Kernel Change-point Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 609, "page_last": 616, "abstract": "We introduce a kernel-based method for change-point analysis within a sequence of temporal observations. Change-point analysis of an (unlabelled) sample of observations consists in, first, testing whether a change in the distribution occurs within the sample, and second, if a change occurs, estimating the change-point instant after which the distribution of the observations switches from one distribution to another different distribution. We propose a test statistics based upon the maximum kernel Fisher discriminant ratio as a measure of homogeneity between segments. We derive its limiting distribution under the null hypothesis (no change occurs), and establish the consistency under the alternative hypothesis (a change occurs). This allows to build a statistical hypothesis testing procedure for testing the presence of change-point, with a prescribed false-alarm probability and detection probability tending to one in the large-sample setting. If a change actually occurs, the test statistics also yields an estimator of the change-point location. Promising experimental results in temporal segmentation of mental tasks from BCI data and pop song indexation are presented.", "full_text": "Kernel Change-point Analysis\n\nZa\u00a8\u0131d Harchaoui\n\nLTCI, TELECOM ParisTech and CNRS\n\n46, rue Barrault, 75634 Paris cedex 13, France\n\nzaid.harchaoui@enst.fr\n\nFrancis Bach\n\nWillow Project, INRIA-ENS\n\n45, rue d\u2019Ulm, 75230 Paris, France\nfrancis.bach@mines.org\n\n\u00b4Eric Moulines\n\nLTCI, TELECOM ParisTech and CNRS\n\n46, rue Barrault, 75634 Paris cedex 13, France\n\neric.moulines@enst.fr\n\nAbstract\n\nWe introduce a kernel-based method for change-point analysis within a sequence\nof temporal observations. Change-point analysis of an unlabelled sample of obser-\nvations consists in, \ufb01rst, testing whether a change in the distribution occurs within\nthe sample, and second, if a change occurs, estimating the change-point instant\nafter which the distribution of the observations switches from one distribution to\nanother different distribution. We propose a test statistic based upon the maximum\nkernel Fisher discriminant ratio as a measure of homogeneity between segments.\nWe derive its limiting distribution under the null hypothesis (no change occurs),\nand establish the consistency under the alternative hypothesis (a change occurs).\nThis allows to build a statistical hypothesis testing procedure for testing the pres-\nence of a change-point, with a prescribed false-alarm probability and detection\nprobability tending to one in the large-sample setting. If a change actually occurs,\nthe test statistic also yields an estimator of the change-point location. Promising\nexperimental results in temporal segmentation of mental tasks from BCI data and\npop song indexation are presented.\n\n1 Introduction\n\nThe need to partition a sequence of observations into several homogeneous segments arises in many\napplications, ranging from speaker segmentation to pop song indexation. So far, such tasks were\nmost often dealt with using probabilistic sequence models, such as hidden Markov models [1], or\ntheir discriminative counterparts such as conditional random \ufb01elds [2]. These probabilistic models\nrequire a sound knowledge of the transition structure between the segments and demand careful\ntraining beforehand to yield competitive performance; when data are acquired online, inference in\nsuch models is also not straightforward (see, e.g., [3, Chap. 8]). Such models essentially perform\nmultiple change-point estimation, while one is often also interested in meaningful quantitative mea-\nsures for the detection of a change-point within a sample.\n\nWhen a parametric model is available to model the distributions before and after the change, a com-\nprehensive literature for change-point analysis has been developed, which provides optimal criteria\nfrom the maximum likelihood framework, as described in [4]. Nonparametric procedures were also\nproposed, as reviewed in [5], but were limited to univariate data and simple settings. Online coun-\nterparts have also been proposed and mostly built upon the cumulative sum scheme (see [6] for\nextensive references). However, so far, even extensions to the case where the distribution before the\nchange is known, and the distribution after the change is not known, remains an open problem [7].\nThis brings to light the need to develop statistically grounded change-point analysis algorithms,\nworking on multivariate, high-dimensional, and also structured data.\n\n1\n\n\fWe propose here a regularized kernel-based test statistic, which allows to simultaneously provide\nquantitative answers to both questions: 1) is there a change-point within the sample? 2) if there is\none, then where is it? We prove that our test statistic for change-point analysis has a false-alarm prob-\nability tending to \u03b1 and a detection probability tending to one as the number of observations tends\nto in\ufb01nity. Moreover, the test statistic directly provides an accurate estimate of the change-point\ninstant. Our method readily extends to multiple change-point settings, by performing a sequence of\nchange-point analysis in sliding windows running along the signal. Usually, physical considerations\nallow to set the window-length to a suf\ufb01ciently small length for being guaranteed that at most one\nchange-point occurs within each window, and suf\ufb01ciently large length for our decision rule to be\nstatistically signi\ufb01cant (typically n > 50).\nIn Section 2, we set up the framework of change-point analysis, and in Section 3, we describe how\nto devise a regularized kernel-based approach to the change-point problem. Then, in Section 4\nand in Section 5, we respectively derive the limiting distribution of our test statistic under the null\nhypothesis H0 : \u201dno change occurs\u201c, and establish the consistency in power under the alternative\nHA : \u201da change occurs\u201c. These theoretical results allow to build a test statistic which has provably a\nfalse-alarm probability tending to a prescribed level \u03b1, and a detection probability tending to one, as\nthe number of observations tends to in\ufb01nity. Finally, in Section 7, we display the performance of our\nalgorithm for respectively, segmentation into mental tasks from BCI data and temporal segmentation\nof pop songs.\n\n2 Change-point analysis\n\nIn this section, we outline the change-point problem, and describe formally a strategy for building\nchange-point analysis test statistics.\n\nChange-point problem\nchange-point analysis of the sample {X1, . . . , Xn} consists in the following two steps.\n\nLet X1, . . . , Xn be a time series of independent random variables. The\n\n1) Decide between\n\nH0 :\nHA :\n\nPX1 = \u00b7\u00b7\u00b7 = PXk = \u00b7\u00b7\u00b7 = PXn\nthere exists 1 < k\u22c6 < n such that\nPX1 = \u00b7\u00b7\u00b7 = PXk\u22c6 6= PXk\u22c6+1 = \u00b7\u00b7\u00b7 = PXn .\n2) Estimate k\u22c6 from the sample {X1, . . . , Xn} if HA is true .\n\n(1)\n\nWhile sharing many similarities with usual machine learning problems, the change-point problem is\ndifferent.\n\nStatistical hypothesis testing An important aspect of the above formulation of the change-\npoint problem is its natural embedding in a statistical hypothesis testing framework. Let us re-\ncall brie\ufb02y the main concepts in statistical hypothesis testing, in order to rephrase them within\nthe change-point problem framework (see, e.g., [8]). The goal is to build a decision rule to\nanswer question 1) in the change-point problem stated above. Set a false-alarm probability \u03b1\nwith 0 < \u03b1 < 1 (also called level or Type I error), whose purpose is to theoretically guar-\nantee that P(decide HA, when H0 is true) is close to \u03b1. Now, if there actually is a change-\npoint within the sample, one would like not to miss it, that is that the detection probability\n\u03c0 = P(decide HA, when HA is true)\u2014also called power and equal to one minus the Type II\nerror\u2014should be close to one. The purpose of Sections 4-5 is to give theoretical guarantees to those\npractical requirements in the large-sample setting, that is when the number of observations n tends\nto in\ufb01nity.\n\nRunning maximum partition strategy An ef\ufb01cient strategy for building change-point analysis\nprocedures is to select the partition of the sample which yields a maximum heterogeneity between\nthe two segments: given a sample {X1, . . . , Xn} and a candidate change point k with 1 < k < n,\nassume we may compute a measure of heterogeneity \u2206n,k between the segments {X1, . . . , Xk} on\nthe one hand, and {Xk+1, . . . , Xn} on the other hand. Then, the \u201crunning maximum partition strat-\negy\u201d consists in using max1 1 and bn < n, which is restriction of ]1, n[, in order to prevent the\ntest statistic from uncontrolled behaviour in the neighborhood of the interval boundaries, which is\nstandard practice in this setting [15].\n\nn,k) := Tr{( \u02c6\u03a3W\n\nn,k + \u03b3I)\u22122( \u02c6\u03a3W\n\nd2,n,k;\u03b3( \u02c6\u03a3W\n\nRemark\nNote that, if the input space is Euclidean, for instance X = Rd, and if the kernel is linear\nk(x, y) = xT y, then Tn;\u03b3(k) may be interpreted as a regularized version of the classical maximum-\nlikelihood multivariate test statistic used to test change in mean with unequal covariances, under the\nassumption of normal observations, described in [4, Chap. 3]. Yet, as the next section shall show,\nour test statistic is truly nonparametric, and its large-sample properties do not require any \u201cgaussian\nin the feature space\u201d-type of assumption. Moreover, in practice it may be computed thanks to the\nkernel trick, adapted to the kernel Fisher discriminant analysis and outlined in [16, Chapter 6].\n\nFalse-alarm and detection probability\nIn order to build a principled testing procedure, a proper\ntheoretical analysis from a statistical point of view is necessary. First, as the next section shows, for a\nprescribed \u03b1, we may build a procedure which has, as n tends to in\ufb01nity, the false-alarm probability\n\u03b1 under the null hypothesis H0, that is when the sample is completely homogeneous and contains\nno-change-point. Besides, when the sample actually contains at most one change-point, we prove\nthat our test statistic is able to catch it with probability one as n tends to in\ufb01nity.\n\nLarge-sample setting\nFor the sake of generality, we describe here the large-sample setting for\nthe change-point problem under the alternative hypothesis HA. Essentially, it corresponds to nor-\nmalizing the signal sampling interval to [0, 1] and letting the resolution increase as we observe more\ndata points [4].\nAssume there is 0 < k\u22c6 < n such that PX1 = \u00b7\u00b7\u00b7 = PXk\u22c6 6= PXk\u22c6+1 = \u00b7\u00b7\u00b7 = PXn. De\ufb01ne\n\u03c4 \u22c6 := k\u22c6/n such that \u03c4 \u22c6 \u2208]0, 1[, and de\ufb01ne P(\u2113) the probability distribution prevailing within the\nleft segment of length \u03c4 \u22c6, and P(r) the probability distribution prevailing within the right segment\nof length 1 \u2212 \u03c4 \u22c6. Then, we want to study what happens if we have \u230an\u03c4 \u22c6\u230b observations from P(\u2113)\n(before change) and \u230an(1 \u2212 \u03c4 \u22c6)\u230b observations from P(r) (after change) where n is large and \u03c4 \u22c6 is\nkept \ufb01xed.\n\n4 Limiting distribution under the null hypothesis\nThroughout this section, we work under the null hypothesis H0 that is PX1 = \u00b7\u00b7\u00b7 = PXk = \u00b7\u00b7\u00b7 =\nPXn for all 2 \u2264 k \u2264 n. The \ufb01rst result gives the limiting distribution of Tn;\u03b3(k) as the number of\nobservations n tends to in\ufb01nity.\nBefore stating the theoretical results, let us describe informally the crux of our approach. We may\nprove, under H0, using operator-theoretic pertubation results similar to [9], that it is suf\ufb01cient to\nstudy the large-sample behaviour of \u02dcTn;\u03b3(k) := maxan 0 and bn/n \u2192 v < 1 as n tends to in\ufb01nity. Then,\n\n\u221e\n\n1\n\nu 0 and bn/n \u2192 v < 1 as n tends to in\ufb01nity. Then,\n.\npt(1 \u2212 t)\n\nTn;\u03b3n(k) D\u2212\u2192 sup\n\nan 0 and v < 1 such that PX\u230an\u03c4 \u22c6 \u230b 6= PX\u230an\u03c4 \u22c6 \u230b+1 for all 1 \u2264 i \u2264 n. Assume in addition that\nthe regularization parameter \u03b3 is held \ufb01xed as n tends to in\ufb01nity, and that limn\u2192\u221e an/n > u and\nlimn\u2192\u221e bn/n < v. Then, for any 0 < \u03b1 < 1, we have\n\nPHA(cid:18) max\n\nan t1\u2212\u03b1(cid:19) \u2192 1 ,\n\nas n \u2192 \u221e .\n\n(5)\n\n6 Extensions and related works\n\nExtensions\nIt is worthwhile to note that we may also have built similar procedures from the\nmaximum mean discrepancy (MMD) test statistic proposed by [19]. Note also that, instead of the\nTikhonov-type regularization of the covariance operator, other regularization schemes may also be\napplied, such as the spectral truncation regularization of the covariance operator, equivalent to pre-\nprocessing by a centered kernel principal component analysis [20, 21], as used in [22] for instance.\n\nRelated works\nA related problem is the abrupt change detection problem, explored in [23],\nwhich is naturally also encompassed by our framework. Here, one is interested in the early de-\ntection of a change from a nominal distribution to an erratic distribution. The procedure KCD of\n[23] consists in running a window-limited detection algorithm, using two one-class support vector\nmachines trained respectively on the left and the right part of the window, and comparing the sets\nof obtained weights; Their approach differs from our in two points. First, we have the limiting\nnull distribution of KCpA, which allows to compute decision thresholds in a principled way. Sec-\nond, our test statistic incorporates a reweighting to keep power against alternatives with unbalanced\nsegments.\n\n7 Experiments\n\nComputational considerations\nIn all experiments, we set \u03b3 = 10\u22125 and took the Gaussian ker-\nnel with isotropic bandwidth set by the plug-in rule used in density estimation. Second, since from k\nto k + 1, the test statistic changes from KFDRn,k;\u03b3 to KFDRn,k+1;\u03b3, it corresponds to take into ac-\ncount the change from {(X1, Y1 = \u22121), . . . , (Xk, Yk = \u22121), (Xk+1, Yk+1 = +1), . . . , (Xn, Yn =\n+1)} to {(X1, Y1 = \u22121), . . . , (Xk, Yk = \u22121), (Xk+1, Yk+1 = \u22121), (Xk+2, Yk+2 =\n+1) . . . , (Xn, Yn = +1)} in the labelling in KFDR [9, 16]. This motivates an ef\ufb01cient strategy\nfor the computation of the test statistic. We compute the matrix inversion of the regularized kernel\ngram matrix once for all, at the cost of O(n3), and then compute all values of the test statistic for all\npartitions in one matrix multiplication\u2014in O(n2). As for computing the decision threshold t1\u2212\u03b1,\nwe used bootstrap resampling calibration with 10, 000 runs. Other Monte-Carlo based calibration\nprocedures are possible, but are left for future research.\n\n6\n\n\fSubject 1\n\nSubject 2\n\nSubject 3\n\nKCpA\nSVM\n\n79%\n76%\n\n74%\n69%\n\n61%\n60%\n\nTable 1: Average classi\ufb01cation accuracy for each subject\n\nBrain-computer interface data\nSignals acquired during Brain-Computer Interface (BCI) trial\nexperiments naturally exhibit temporal structure. We considered a dataset proposed in BCI compe-\ntition III1 acquired during 4 non-feedback sessions on 3 normal subjects, where each subject was\nasked to perform different tasks, the time where the subject switches from one task to another being\nrandom (see also [24]). Mental tasks segmentation is usually tackled with supervised classi\ufb01cation\nalgorithms, which require labelled data to be acquired beforehand. Besides, standard supervised\nclassi\ufb01cation algorithms are context-sensitive, and sometimes yield poor performance on BCI data.\nWe performed a sequence of change-point analysis on sliding windows overlapping by 20% along\nthe signals. We provide here two ways of measuring the performance of our method. First, in Fig-\nure 2 (left), we give in the empirical ROC-curve of our test statistic, averaged over all the signals at\nhand. This shows that our test statistic yield competitive performance for testing the presence of a\nchange-point, when compared with a standard parametric multivariate procedure (param) [4]. Sec-\nond, in Table 1, we give experimental results in terms of classi\ufb01cation accuracy, which proves that\nwe can reach comparable/better performance as supervised multi-class (one-versus-one) classi\ufb01ca-\ntion algorithms (SVM) with our completely unsupervised kernel change-point analysis algorithm.\nIf each segment is considered as a sample of a given class, then the classi\ufb01cation accuracy corre-\nsponds here to the proportion of correctly assigned points at the end of the segmentation process.\nThis also clearly shows that KCpA algorithm give accurate estimates of the change-points, since the\nchange-point estimation error is directly measured by the classi\ufb01cation accuracy.\n\nr\ne\nw\no\nP\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n \n0\n\nROC Curve\n\n \n\nKCpA\nparam\n\n0.1\n\n0.2\n\nLevel\n\n0.3\n\n0.4\n\n0.5\n\nr\ne\nw\no\nP\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n \n0\n\nROC Curve\n\n \n\nKCpA\nKCD\n\n0.1\n\n0.2\n\nLevel\n\n0.3\n\n0.4\n\n0.5\n\nFigure 2: Comparison of ROC curves for task segmentation from BCI data (left), and pop songs\nsegmentation (right).\n\nPop song segmentation\nIndexation of music signals aims to provide a temporal segmentation\ninto several sections with different dynamic or tonal or timbral characteristics. We investigated\nthe performance of KCpA on a database of 100 full-length \u201cpop music\u201d signals, whose manual\nsegmentation is available. In Figure 2 (right), we provide the respective ROC-curves of KCD of [23]\nand KCpA. Our approach is indeed competitive in this context.\n\n8 Conclusion\n\nWe proposed a principled approach for the change-point analysis of a time-series of independent\nobservations. It provides a powerful testing procedure for testing the presence of a change in distri-\nbution in a sample. Moreover, we saw in experiments that it also allows to accurately estimate the\nchange-point when a change occurs. We are currently exploring several extensions of KCpA. Since\nexperimental results are promising on real data, in which the assumption of independence is rather\nunrealistic, it is worthwhile to analyze the effect of dependence on the large-sample behaviour of our\n\n1see http://ida.first.fraunhofer.de/projects/bci/competition_iii/\n\n7\n\n\ftest statistic, and explain why the test statistic remains powerful even for (weakly) dependent data.\nWe are also investigating adaptive versions of the change-point analysis, in which the regularization\nparameter \u03b3 and the reproducing kernel k are learned from the data.\n\nAcknowledgments\n\nThis work has been supported by Agence Nationale de la Recherche under contract ANR-06-BLAN-\n0078 KERNSIG.\n\nReferences\n[1] F. De la Torre Frade, J. Campoy, and J. F. Cohn. Temporal segmentation of facial behavior. In\n\nICCV, 2007.\n\n[2] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for\n\nsegmenting and labeling sequence data. In Proc. ICML, 2001.\n\n[3] O. Capp\u00b4e, E. Moulines, and T. Ryden. Inference in Hidden Markov Models. Springer, 2005.\n[4] J. Chen and A.K. Gupta. Parametric Statistical Change-point Analysis. Birkh\u00a8auser, 2000.\n[5] M. Cs\u00a8org\u00a8o and L. Horv\u00b4ath. Limit Theorems in Change-Point Analysis. Wiley and sons, 1998.\n[6] M. Basseville and N. Nikiforov. Detection of abrupt changes. Prentice-Hall, 1993.\n[7] T. L. Lai. Sequential analysis: some classical problems and new challenges. Statistica Sinica,\n\n11, 2001.\n\n[8] E. Lehmann and J. Romano. Testing Statistical Hypotheses (3rd ed.). Springer, 2005.\n[9] Z. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel Fisher discrimi-\n\nnant analysis. In Adv. NIPS, 2007.\n\n[10] G. Blanchard, O. Bousquet, and L. Zwald. Statistical properties of kernel principal component\n\nanalysis. Machine Learning, 66, 2007.\n\n[11] K. Fukumizu, F. Bach, and A. Gretton. Statistical convergence of kernel canonical correlation\n\nanalysis. JLMR, 8, 2007.\n\n[12] C. Gu. Smoothing Spline ANOVA Models. Springer, 2002.\n[13] I. Steinwart, D. Hush, and C. Scovel. An explicit description of the rkhs of gaussian RBF\n\nkernels. IEEE Trans. on Inform. Th., 2006.\n\n[14] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. R. G. Lanckriet, and B. Sch\u00a8olkopf. Injective\n\nhilbert space embeddings of probability measures. In COLT, 2008.\n\n[15] B. James, K. L. James, and D. Siegmund. Tests for a change-point. Biometrika, 74, 1987.\n[16] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Camb. UP, 2004.\n[17] P. Billingsley. Convergence of Probability Measures (2nd ed.). Wiley Interscience, 1999.\n[18] P. Glasserman. Monte Carlo Methods in Financial Engineering (1rst ed.). Springer, 2003.\n[19] A. Gretton, K. Borgwardt, M. Rasch, B. Schoelkopf, and A.J. Smola. A kernel method for the\n\ntwo-sample problem. In Adv. NIPS, 2006.\n\n[20] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.\n[21] G. Blanchard and L. Zwald. Finite-dimensional projection for classi\ufb01cation and statistical\n\nlearning. IEEE Transactions on Information Theory, 54(9):4169\u20134182, 2008.\n\n[22] Z. Harchaoui, F. Vallet, A. Lung-Yut-Fong, and O. Capp\u00b4e. A regularized kernel-based approach\n\nto unsupervised audio segmentation. In ICASSP, 2009.\n\n[23] F. D\u00b4esobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm. IEEE\n\nTrans. on Signal Processing, 53(8):2961\u20132974, August 2005.\n\n[24] Z. Harchaoui and O. Capp\u00b4e. Retrospective multiple change-point estimation with kernels. In\n\nIEEE Workshop on Statistical Signal Processing (SSP), 2007.\n\n8\n\n\f", "award": [], "sourceid": 590, "authors": [{"given_name": "Za\u00efd", "family_name": "Harchaoui", "institution": null}, {"given_name": "Eric", "family_name": "Moulines", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}]}