NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 6662 Comparing distributions: $\ell_1$ geometry improves kernel two-sample testing

### Reviewer 1

The paper is well written. The authors provide a generalisation of the approach taken in [5] which seems quite natural. Their approximation of the metric seems also quite inspired by the approximation performed in [5]. I think the paper is complete, the methodological results are well justified and the experimental study is extensive. I think this paper "Interpoint Distance Based Two Sample Tests in High Dimension" is relevant as they show a very similar behaviour of L1-based test in higher dimensions Typos: 1) In line 73, I think D_{\mu,J} has not been previously defined. --------------------------- Post review comments: I thank the authors for their clear response. I think this paper is a good submission so I would like to maintain my overall score.

### Reviewer 2

Summary: The paper showed that two samples testing statistics between kernel based distribution representatives (i.e. kernel mean embeddings and smoothed characteristic functions) using L^2 distance [5] can be generalised to any L^p distance with p>=1. Theorem 2.1 and Theorem 3 of the paper showed that this definition give rise to a metric on the space of Borel probability measures and that it metrise the weak convergence. The paper showed when using L^1 distance instead of L^2 distance, the power of the tests of [5] are better with higher probabilities [Proposition 3.1 and 3.4]. Like [5], the paper considered the asymptotic null distribution of the normalised difference between distributions representatives, which give close form asymptotic null distributions [Proposition 3.2]. Further, they provided lower bound on the test power of these two l1-based tests [Proposition 3.3]. This results led to the conclusion that it is sufficient to optimise the test statistics jointly in the kernel parameter and the test locations in order to maximise such lower bound. Empirically, the paper investigated the proposed methods on 4 synthetic and 3 real data problems, illustrating the benefits of using the proposed l1 geometry. ====== Clarity: Overall, the paper is well written with clear logical structure. Notations are clearly define except at a few small places that it was mentioned before definition (e.g. line 73: D_\mu, J was not defined?) I understand that the authors follow the synthetic experimental set up of [16] and therefore chosen the same parameter settings. It would be nice to inform the readers again of these numbers. Also in line 221, the blobs experiments with dim = 30, is that a typo? Quality: The paper presented several theoretical results with proofs on statistics with l1 norm. In the case of two samples testing, they have shown theoretically that l1 geometry gives better statistical power. Such claim is also well supported by experimental results on real and synthetic datasets. Though, I hope to see a little more intuitions on the theoretical results presented. I am wondering if the authors could comment on any weaknesses of their work? Do we lose anything when change the norm from l2 to l1? Or are the authors trying to say that practitioners should now always use l1 over l2 norm? It seems when the samples sizes are large, the run time of the optimised proposed test is longer than the other linear time tests (though still small in this case). Have the authors tried on problems that require even large samples (e.g. 10^7)? For a given computation time, should one opt for the l1-based tests over a l2-based test? Originality & Significance: The methods proposed are a clear significant extension/ generalisation of the work from [5]. The theoretical results are important. The l1-norm based two samples tests proposed seems likely to be used widely in various applications competing with/advancing performance of the current state of art. ===================== Authors feedback read. I am happy with the response provided by the authors.

### Reviewer 3

I thank the authors for their response, and would like to maintain my overall evaluation. ===== The paper is generally well-written, and addresses an important problem. In the literature on kernel two-sample tests, apart from the quadratic-time maximum mean discrepancy (MMD) test which computes the RKHS norm of the mean embeddings, [5] examined a test statistic based on the L^2 distance of the empirical mean embeddings, and gave the asymptotic null distribution in terms of Chi-squared variables. The current work generalizes this approach by replacing the L^2 distance with general L^p distances, and prove that for p >= 1, the resulting metrics dominate weak convergence. For the test based on the L1 norm, the paper further derives the asymptotic null distribution in terms of Nakagami variables, and proves that it achieves lower type-II error than the test based on the L2 norm. Overall, I feel that the paper makes an interesting observation and a valuable contribution. More specifically, I have the following questions/comments: - How does one choose the distribution \Gamma from which the T_j's are sampled? - It would be helpful to clearly state the computational complexity of the proposed tests in terms of N1, N2 and J. - In the experiments, to fully verify Proposition 3.1, I think it would be helpful to have a comparison of the Type-I and Type-II errors of all the tests without optimizing for test locations, since the latter introduce an additional source of confounding. - In the second panel of Figures 1 and 2, why are the ME lines missing? - Table 2 is rather confusing as I'm not sure which entries represent Type-I errors and which represent Type-II errors. It might be better to create two separate tables and defer e.g., the Type-I error one to the supplementary material. - Since the Nakagami distribution may not be familiar to most readers, it would be helpful to provide its pdf in the main text or supplementary material to avoid confusion regarding its parameterization. - I would recommend the authors to release code for implementing the tests and for reproducing the experiments.