
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors make a nice contribution based on the recent Btest variant instead of MMD_u, with efficient algorithms, additional analysis, and many illustrations and comparisons.
I have the following comments and suggestions:
 in what sense is the changepoint detection different from outlier detection problems? (on p.3 the authors mention e.g. oneclass SVM which is used also for anomaly detection). It would be good if this could be clarified in the introduction.
 Fig. 2 why are the statistics so different in (b) versus (c)? in both case one has Normal> Laplacian while the statistics from offline versus online are quite different.
 more attention should be paid to the kernel bandwidth selection. How sensitive are the results in Fig.2 wrt this kernel bandwidth?
Only in section 5 a brief comment is given to [1] on the median trick. Some more explanation is needed in this paper on this bandwidth selection.
Q2: Please summarize your review in 12 sentences
Mstatistic for changepoint detection based on kernel methods, with characterization of tail probability and efficient online and offline algorithms.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
weak review
This paper proposes two related Mstatistics for changepoint detection based on kernel maximum mean discrepancy. It allows to incorporate large amount of background data with only O(n) cost. Furthermore approximations to the significance level / average run length allow for an efficient way to set the changepoint detection threshold without recurring to bootstrapping. These approximation are empirically evaluated on real data and the changepoint detection performance is evaluated on both real and synthetic data.
The paper is very well written, structured and illustrated. The math seems to be sound and utilizes a highly sophisticated technique based on changeofmeasure. The main contributions are: 1. accurate approximations to the significance level in the offline case, and average run length in the online case 2. a closedform variance estimator
There are a few shortcomings though. First of all, the computational complexity with regard to the time horizon t is hidden in the paper. If I am correct, B_max equals t and they require O(nB^2) for each 2 < B < B_max, which implies a complexity of O(nt^3). This should be mentioned as it limits the possible applications. Secondly it is unclear why RDR is the only baseline algorithm chosen from the nonparametric changepoint detection approaches mentioned in the paper.
In summary, this paper contains novel contributions and the authors do a good job in showing when this algorithm is beneficial (large amount of reference data) and that the theoretical changepoint detection threshold can be used as an efficient alternative to simulation based approaches.
Q2: Please summarize your review in 12 sentences
In summary, this paper contains novel contributions and the authors do a good job in showing when this algorithm is beneficial (large amount of reference data) and that the theoretical changepoint detection threshold can be used as an efficient alternative to simulation based approaches.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The problem addressed in this paper is the determination of the acceptance threshold in an abrupt change detection test using kernels. The proposed solution is two computationally efficient Mstatistics adapted for large datasets for both batch and online settings. A theoretical characterization of the tail probability of these statistics is proposed based on a new technique using change of measure. Empirical evidence reported on synthetic and real data illustrates the interest of the proposed approach.
The paper is wellwritten, wellorganized, self contained and represents a significant piece of work on an important and difficult subject: change detection. Literature survey looks rather complete to me. The use of a changeofmeasure approach allows deriving a theoretical performance analysis of the proposed test by providing an analytical expression for the variances. This way of doing thinks is new and interesting. Empirical evidence reported is important and convincing.
Minor comments
 Figure 2 is too small and legend (d) is unreadable
Q2: Please summarize your review in 12 sentences
The paper is wellwritten, wellorganized, self contained and represents a significant piece of work on an important and difficult subject: change detection.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes two Mstatistics, for the offline and online settings, for changepoint detection that are based on the kernel MMD Btest statistic for twosample testing. The Btest statistic is a recently developed alternative to the MMD that is more efficient; it involves taking an average of the MMD over a partitioning of the data into N blocks. Since the blocks are independent, the Bstatistic is asymptotically normal so it is simple to obtain thresholds without resorting to bootstrapping. This procedure is modified for the case of change point detection by averaging the MMDs computed from each of N blocks occurring before a potential change point with the block of data of the same size as each N occurring after the potential change point. To determine whether a changepoint has occurred somewhere in a block of data, the above procedure is carried out for each potential change point location and a Zstatistic is obtained for each. The Mstatistic proposed by the authors is a maximum over the standardized Zstatistics. The authors derive an expression for the variance of the Zstatistics (needed to standardize it) and for the significance threshold of the Mstatistic (since the Zstatistics are not averages of independent blocks of data, they are not asymptotically normal like in the case of the Btest) and give a theoretical performance analysis which is also verified numerically. Finally, they apply their method to several real data sets, where it seems to perform slightly better than a baseline method in terms of AUC.
The work is original and significant. Competing methods require bootstrapping, which is much less efficient with larger amounts of data. Furthermore, the theoretical work involved in deriving the variance of the Zstatistic and thresholds for the Mstatistic may prove to be useful elsewhere.
The theoretical work is rigorous and seems reasonable. I did not check the proofs in the appendix.
The presentation is generally good, but could probably be improved. I needed to read through section 3 several times before I understood the procedure. The fact that the data are being formed into blocks at two different levels (for each N and each B) make things difficult to understand when described casually in the body of the section. Figure 1 was somewhat helpful but I was still confused. If the procedure were described more formally, e.g. pseudocode, it might be easier to understand. In my opinion, some of the space given to details in sections 4 and 5 could be better used explaining the test procedure better in section 3 (and those details moved to the appendix). Additionally, it would be useful to have more of an explanation of the baseline procedures in section 6 and incorporate some of the results from the appendix.
There are some minor typographical / editing errors in the paper; it should be proofread a few times. Line 111 begins with something that is not a sentence.
Q2: Please summarize your review in 12 sentences
The authors propose a new method for kernel changepoint detection that is competitive with existing methods in terms of performance, but is much more efficient because it does not require resorting to bootstrap procedures. The paper is significant, the theoretical work is rigorous and appears to be correct and backed by strong empirical results, and the presentation is generally good though the procedure could probably be explained more clearly and given a more formal treatment.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We would like to thank the reviewers and area chairs
for the time and efforts in the review process.
Reviewer 1: 
Typically, anomaly detection assumes that the anomaly is represented by a
small number of samples in a background of normal data. One class SVM
makes decision about anomaly for each individual point separately.
Changepoint detection studies the problem where the changepoint causes a
change in the underlying distribution of the data, which means that the
change will be persistent after that point in time. The decision is made
based on data points before and after the change collectively.

Fig. 2(b) is the offline Mstatistic, and Fig. 2(c) is the online
Mstatistic. These two statistics, although both derived based on the
normalized Bstatistic, detect the changepoint by different scanning
strategies.
In the offline setting, the maximal value of the
statistic is obtained via retrospectively scanning through the entire
testing data in a reverse direction, which is different from the online
setting. Note that in Fig. 2(b), the xaxis represents the block size "B",
different from Fig. 2(c) the online setting where the xaxis represents
"time". More details about how we scan the testing data in the offline
setting can be found in Fig. 1(a). More specifically, we fix one boundary
of the test block aligned with the most recent data and adjust the other
boundary backwards. In doing this, as shown in Fig. 2(b), the statistic
will first increase as the block size "B" increases, due to an increasing
number of postchange samples being tested. Then the statistic reaches its
peak around the true changepoint location. After that, the statistic
decreases, resulting from an increasing number of prechange samples
included in the testing data. The way we design the scanning statistic
results in the statistic to peak around the location where it is most
likely to be a changepoint over the entire data.
The online
Mstatistic, on the other hand, sequentially determines whether a change
occurs or not by moving a slidingwindow forward. As shown in Fig. 1(b),
this time we fix the size of the block and still let the right boundary of
the block align with the most recent data. It detects a change when the
statistic hits the threshold for the first time and stops (since we design
it to be a stopping rule and we will fire an alarm once it
stops).
 We will investigate the kernel bandwidth selection more
thoroughly in our journal paper. Here we only briefly explain why we
employed the median trick. In a recent study [1], authors justified the
use of the median trick in MMD for twosample test by providing numerical
results as well as theoretical insights. They considered a range of
bandwidth choices and numerically studied the testing power under each
choice, which verified that the median heuristic for a Gaussian kernel
maximizes the power in distinguishing two Gaussians with different means.
In our study, we adopted the median heuristic in most experiments since we
used Gaussian kernel. We add [1] as references in the final
version.
[1] A. Ramdas et al. On the decreasing power of kernel and
distance based nonparametric hypothesis tests in high dimensions. AAAI
15.
Reviewer 5:  Indeed, in the worst scenario, like pointed
by the reviewer, the complexity for the OFFLINE case, which requires
searching the maximum over all the possible blocksizes, is O(n t^3).
However, this calculation does not utilize the structure of the statistics
by noticing that many terms in the sum of the Zstatistic are identical
and can be taken out of a precalculated Gram matrix. So, in practice, we
can significantly bring down the complexity by precomputing the Gram
matrix, which has complexity O(n t^2).
For the ONLINE case,
however notice that we only choose one block size B_0 regardless of t, and
hence the complexity is only O(nB_0^2), and B_0 is typically a small
number. We demonstrate numerically that this constant blocksize
lowcomplexity approach can still detect the change quickly in many
cases.
 RDR (Song et al, 2013) is one of the stateofart
algorithms in this line of research, and it also beats uLSIF (Kawahara and
Sugiyama, 2012) and KLIEP (Kawahara and Sugiyama, 2012) on datasets
including HASC and a subset of CENSREC. We carried out experiments using
the same datasets in a similar setting, which demonstrates the advantage
of our Mstatistic over RDR. We will also compare the proposed algorithm
to more algorithms under various settings in the future work.

