Paper ID: 1260 Title: Multivariate f-divergence Estimation With Confidence
Current Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Paper ID: 1260
Title: Multivariate $f$-Divergence Estimation With Confidence

NOTE: Due to the short reviewing time and out of fairness to the other papers I was reviewing, I DID NOT do more than glance over the supplementary material. In most cases, I therefore did not verify that results claimed by the authors are correct, but instead checked that they are plausible.

Summary: The authors consider an estimator for $f$-divergences based on the following procedure: given samples from two distributions $f_1$ and $f_2$, compute $k$-nn density estimators for

Quality, clarity, originality and significance: This paper proves a second-order result on estimators for information-theoretic/statistical functionals of pairs of distributions. From a statistics/probability perspective this is an important result, since $k$-nn density estimators are relatively straightforward to compute. The manuscript is not written in a way that is easy to read for the non-expert; the problem setup would be difficult to follow for the average NIPS attendee. I therefore lowered my score. The proofs (as determined from the sketches provided) are novel and appear to have broader applicability.

Recommendation: Overall, with some reorganization of the material with an eye towards helping the target audience (i.e., NIPS attendees) understand the results, this would be a nice addition to the program. Without such reorganization, I would find it difficult to recommend acceptance.

Comments: (comments are numbered to allow for easy reference in the rebuttal)

1) In the related work, when discussing drawbacks of previous work, it might be nice to indicate which of the problems with that work the current result addresses, and which problems it does not address.

2) In the setup of the problem it would be best to be clear (for the reader) the space in which the estimators take their value. For example, it does not seem that the results can handle vector-valued parameters, but perhaps I am mistaken. A bit more mathematical precision/rigor earlier in the paper goes a long way.

3) The paper is more-or-less written with the expert in mind, which I find unfortunate. The actual setup of the problem does not appear until the bottom of page 3 -- the estimator is given two sets of samples from $f_1$ and $f_2$ and must estimate a particular $f$-divergence. It would *really* help the reader to have this setup first before talking about weighting ensembles of estimators etc. etc.

4) In the optimization before (1), what is the intuition behind the relaxation?

5) How is $N$ chosen?

6) Why is the estimator in [1] defined using data X and Y and the plug-in estimator using X and X? Is there a significance to the notation change?

7) When constructing the plug-in divergence estimator, is the estimate at $\mathbf{X}_j$ for $j \le N$ formed using the $k$-nn density estimate? This is actually *not* specified in the manuscript as written.

8) The sketch of Lemma 4 is unintelligible to someone who is not going to read the proof (as opposed to the comments after Lemma 3) -- I would dramatically shorten it to make room for a more gentle introduction of the problem. Just relegate the whole proof to the supplementary material.

9) The Broad Implications ("broader impacts?") in Section 3.2 seem appealing but also somewhat speculative. While I appreciate the context, I am not sure this is the right location for these comments. Perhaps move them to the conclusion?

10) In the conclusion: I realize that one must always put in some sort of "further work" to make it seem like the current result is building somewhere. However, I am a little unconvinced that "deriving the nonasymptotic distribution of the estimator" is actually something that people will want to do. To me the more convincing follow-up work would be the application of the main ideas of the proof here to other problems as outlined in section 3.2.

11) I didn't quite understand from the experiment section how the QDA classifier was trained -- perhaps a little more detail here would be appropriate -- there is still space in the paper to provide it, and it might give the reader who is not going to read the proof something concrete to look at.

12) Perhaps describing the estimator in an \begin{algorithm} \end{algorithm} environment would be helpful to the reader who wants to implement this procedure. This may take up too much space. However, it would make the result/approach more accessible.

13) One thing that is not at all clear from the description in Section 2 is that the index set $l$ is indexing over different neighborhood sizes for the $k$-nn density estimate. Because a lot of the intuition/construction of the actual estimator is obscured by the author's writing, the overall idea of the proof is rather difficult to understand.

Small things:
- [Introduction] I think it's fairer to say that the idea of $f$-divergence generalizes several measures, rather than saying that many $f$-divergences have been defined -- it's a little more historically accurate.
- there seems to be an overfull \hbox on page 5

Please summarize your review in 1-2 sentences: The authors prove the asymptotic normality of an estimator of $f$-divergence given two samples from two unknown distributions $f_1$ and $f_2$. The estimator is formed by taking a plug-in estimate after kernel density estimation on each of the two distributions.

Quality Score (1-10): 7

Impact Score (1-2): 1

Confidence (1-5): 4

ADDITION AFTER AUTHOR FEEDBACK: The detailed feedback from the authors has made me even more sanguine about this paper and I am revising my score upward to an 8.

Quality Score (1-10): 8
The authors prove the asymptotic normality of an estimator of $f$-divergence given two samples from two unknown distributions $f_1$ and $f_2$. The estimator is formed by taking a plug-in estimate after kernel density estimation on each of the two distributions.

Submitted by Assigned_Reviewer_43

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper is concerned with proving the asymptotic normality of a class of estimators for multidimensional f-divergence (Ali-Silvey distances) due to Moon and Hero. Estimating these functionals (including KL-divergence, Hellinger distance, and variational distance as examples) from finite data arise in all kinds of settings in information theory, machine learning, neuroscience, etc. One strong use of the asymptotic distribution is in developing statistical tests and confidence bounds, which is also done in the present paper. It is shown that the MSE convergence rate is O(1/T), where T is the number of samples.

The paper does exactly what it sets out to do for this important problem. The proof in the appendix looks correct to this reviewer, and the conclusion lists exactly the things one would hope for in extensions. The simulation and UCI results are insightful.

Theorem 2 uses a set of assumptions A.0--A.5, but these are not given in the paper. Hence it is tough to know what the theorem even is, let alone correctness. The assumptions are stated in the Appendix, but this is not even mentioned. It is recommended that at least a sketch of these assumptions be included in the main text. To make space, typographical techniques are definitely possible, e.g. display equation on lines 224-227 can be be made using '/' rather than stacking or Figure 1 caption can be moved to the side of the figure, rather than below.

Lemma 4 is embedded in proof sketch of Theorem 2: this should perhaps be pulled earlier, so that proof sketch of Theorem 2 flows uninterupted.

Please check abbreviations in References for consistency and correctness; include page numbers and years where missing.
This paper is concerned with proving the asymptotic normality of a class of estimators for multidimensional f-divergence (Ali-Silvey distances) due to Moon and Hero. The paper does exactly what it sets out to do for this important problem.

Submitted by Assigned_Reviewer_44

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper shows that the distribution of a class of ensemble estimators for f-divergences, proposed
by Moon and Hero [1], is asymptotically Gaussian. Such ensemble estimators are weighted combinations
of a number of basis'' estimators, and it was known from [1,3] that the MSE goes down as O(1/T), where T is the number of samples. The authors of this paper consider instead the distribution of the estimators, and show that it converges to a Gaussian. This is of interest in many regimes, as motivated by the authors. The main technical content of the paper is to show that the components of the ensemble satisfy Lemma 3. The result is very interesting. The equations are intimidating at times, and it will be helpful if the authors give a brief overview of the intuition beforehand. In particular, even if natural, could they comment what happens if the assumptions A0-A5 do not hold. For example, what if the distributions are allowed to be over an arbitrary support (hence the assumption of strictly lower bounded is violated), but say the distributions are log-concave or have some other structural property, which can make the problem still interesting.

I have a few specific questions.

- As the authors point out, it would be interesting to obtain Barry-Esseen type theorems. It might also be interesting to incorporate the dependence of 'd', the number of dimensions into the frame. In particular,
does convergence happen with exponentially many samples in the dimension, or with just a polynomial, or even linear?

- The authors mention some shortcomings of papers [22,23,24] and state that their asymptotic distributions are currently unknown. I am curious if (a) anyone has tried to find the distribution for any of these? (b) Is there any reason for the asymptotic distribution to not be Gaussian for any of these? (c) Is there any particular difficulty with analyzing the asymptotic distribution of these estimators? If yes, that will be a strong point for the estimator and the analysis.

Line 70: Extra parentheses
This paper proves the asymptotic normality of a class of ensemble estimators for f-divergence estimators, and is interesting.
Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.

Response to 1st Reviewer (Assigned_Reviewer_4)
1) If the paper is accepted, we will implement this suggestion.
2) Note that the estimator we are considering is a non-parametric estimator (a k-nn plug-in estimator). We will make this clearer by emphasizing it further in the abstract and introduction if the paper is accepted. It is perhaps confusing that the optimal weights for the weighted ensemble estimator are derived from a parametrized set of basis functions for the bias. These basis functions depend on the non-parametric base estimator. In general, the parameter for the basis functions may be vector valued. However, for this estimator, the parameter happens to be a scalar. The reorganization of this section as suggested in 3) below should help clarify this. Also, we will clarify what space the parameters may take their values in in the revision.
3) We agree with the suggestion that the problem setup should come before the details on the weighted ensemble estimator and will do this in the revision.
4) The original optimization problem in Theorem 1 uses the weights $w(l)$ to zero out the lower order bias terms given in assumption $\mathcal{C}.1$. The relaxed optimization problem instead uses the weights to decrease the bias terms at the rate of O(1/\sqrt(T)). This still gives the MSE rate of O(1/T). Further details are given in [3]. If the paper is accepted we will add a sentence or two to give this intuition.
5) From assumption $\mathcal{A}.0$ in the supplementary material, we assume that $N$ is chosen to be proportional to $T$. This allows the convergence rate of the weighted ensemble estimator to achieve the parametric rate. In practice, we use a leave-one-out approach where we estimate the likelihood ratio at each point in X using all the points in Y and the all other points in X. We then take the empirical average of $g$ evaluated at the estimates of the likelihood ratio. We will include a sentence to this effect in the paper, if accepted.
6) We are not sure what the reviewer is referring to. The definitions of the estimator of both papers (in Eq. (2) in both papers) are identical and use data X and Y in exactly the same way. Note that, as the divergence is not symmetric in the densities $f_1$ and $f_2$, the density $f_2$ is estimated at points $X$ using other points $X$ while $f_1$ is estimated at points in $X$ using points in $Y$. This is explained in lines 154-159.
7) Yes, the reviewer is correct that we are referring to the knn estimate. This will be clarified in the final version.
8) We agree that as currently written, the proof sketch for Lemma 4 is too short to be very insightful. In the final version we will remove the proof sketch and add more explanatory detail to the paragraph prior to the lemma.
9) We did debate between putting Section 3.2 in its current location and the conclusion. Since Theorem 2 is the main contribution of this paper, we decided to put Section 3.2 in its current location to provide context. We believed that it would have less impact if it were left for the conclusion.
10) Yes, we do mention in the conclusion on line 406 that future work also includes "extending the central limit theorem to other divergence estimators", i.e. the estimators mentioned in Section 3.2.
11) If there is space after the other revisions, we will give more details about the QDA classifier.
12) We agree that this would be helpful and will add it if there is room.
13) Indeed the parameter $l$ indexes over different neighborhood sizes for the k-nn density estimate. We will clarify this in the revision.

Response to 2nd Reviewer (Assigned_Reviewer_43)
The reviewer recommends that the assumptions A.0-A.5 be summarized in the paper in abbreviated form. After we do the reorganization and other changes suggested by the reviewers we will consider doing this if there is still space. At a minimum we will give a more obvious reference to the appendix for these assumptions.

The reviewer also recommends pulling Lemma 4 and its proof sketch outside of the proof of Thm 2. In response to the 1st reviewer, we have decided to eliminate the proof of Lemma 4 and add more explanatory detail to the paragraph prior to the lemma. We believe that this will improve the flow of this section sufficiently.

Response to 3rd Reviewer (Assigned_Reviewer_44)
We appreciate the reviewer's comments about the intimidating equations but we have attempted to keep the presentation at a level of technical detail to give the reader a good feel for the technical issues. In the revision, we will remove the Proof of Lemma 4, as suggested by Reviewer 1, and use the space gained to provide more intuition about the results and the assumptions A.0-A.5. However, we wish to remind the reviewer that these assumptions are not new - they have appeared in the same or similar form in previously published work (e.g. [1,3,29] and [23,24] have similar assumptions). It is possible that some of the assumptions A.0-A.5 are not necessary to achieve our results (e.g. see [1] for some experimental results that suggest A.1 is not strictly necessary). While it is beyond the scope of this work to check this, it would be insightful to comment in the proof sketch of Theorem 2 where the assumptions are used. We will do this in the revision.

We agree with the reviewer that Berry-Esseen theorems would be interesting and useful. We are indeed working on this aspect. However, discussion of our work in progress is beyond the scope of this paper.

The reviewer asks about the difficulty of deriving asymptotic Gaussian results for other divergence estimators. Deriving convergence in distribution is a notoriously hard problem for non-linear problems such as divergence estimation and it is far from clear whether this can be done for these other estimators. However, we partially address these questions in Section 3.2 "Broad Implications of Theorem 2".