NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
This is a good paper. The idea of using multivariate functions in order to perform better learning (distinguishing different distributions) is quite evident, especially when considering the fact that the assessment of the quality of the discrimination in the paper is expectation-related. It is an interesting and well structured paper. The statistical analysis of the Graph-based Discriminator classes is clear and informative and invites the follow up questions presented (closing down the gap in sample complexity, extending the expressiveness bounds for larger k). A few other points regarding this paper: 1) In line 94 - You state that o(\rho) examples are sufficient for discrimination. I believe (even though this is early in the paper) that it would be clearer to state this size using the \epsilon and \delta confidence and accuracy parameters at this stage. 2) In line 229 - You write 'fix all coordinates'. It is possible that this phrase is ambiguous but my understanding is that you intend to fix only a single coordinate while all others remain parameters of the function (executing a reduction to k-1 gVC), While what I understand from the reading the phrase is that after fixing all but one coordinate we end up with a function which receives 1 parameter instead of k-1. 3) I would be very interested in seeing future work regarding different types of discrimination 'loss' (equation 1 in line 31) since the difference in expectations of distinguishing functions hardly seems the best tool for differentiating distributions using a minimal amount of tests. Best of luck
Reviewer 2
Originality: The paper introduces a new notion of VC-dimension for k-ary Boolean functions that controls the sample complexity of graph-based discriminators. I do not think that such a notion of VC-dimension has been considered in the literature. Quality: The main results and the proofs generally seem to be correct. All the main claims in the paper are supported with complete proofs or references to proofs in the literature. I did not check all the proofs in detail; I read the proofs of Theorem 2 and Lemma 1 and they seem to be correct (but please check for typos; some of them are mentioned below under "Minor Comments"). Clarity: I was a bit confused by the definition of the empirical frequency of an edge; please refer to the General Comment/Question below. I also recommend that the paper (including the supplementary material) be proofread at least once more for typos; please see the "Minor Comments" below. Significance: I think the results in the paper are quite significant from an information-theoretic viewpoint of learning. For practical applications, one issue that I think was not discussed (as this was probably not the focus of the paper) is the computation of the IPM; although higher order distinguishing classes have more distinguishing power, how does the complexity of estimating the IPM w.r.t. these classes scale with the order? General Comment/Question: The definition of the empirical frequency of an edge (between lines 176 and 177 on page 5) seems a bit unusual if I am interpreting it correctly (if I am reading the equation wrongly, perhaps the definition could be clarified just before line 177?). The expression |{(u_1,...,u_k) \in E_g: \forall i,u_i \in S}| seems to count the number of distinct k-tuples chosen from S^k that belong to E_g. Why are multiple occurrences of the same k-tuple in S^k not counted separately? Counting them separately would seem to encode more information about the distribution on the vertices...? As an example, suppose S = {v_1,v_2,v_3} with v_1 = v_2 and v_1 \neq v_3, k = 1, and E_g = {{v_1},{v_3}}. Then (if I am interpreting the definition correctly) |{(u) \in E_g: u \in S}| = 2, so the empirical frequency equals 2/3. But it seems natural to calculate the empirical frequency as 3/3 = 1..? I think this definition is also needed for understanding the proof of Lemma 1 (please see "Minor Comments" below). Minor Comments: - Page 2, line 45: "...distinguishing class [gives] rise..." - Page 2, line 47: singletones -> singletons - Page 2, line 50: Missing article between "Thus," and "IPM". - Page 4, line 159: Should the vertex set \mathcal{V}_g be written without the subscript g? - Page 5, line 177: Missing conjunction between "g" and "a". - Page 5, line 178: I suggest adding brackets around k-n. - Page 5, line 187, definition of IPM_{\mathcal{D}(p_1,p_2): Superfluous closing square bracket in second expression and extra closing square bracket in third expression. - Page 5, line 190: \mathcal{V}^{{0,1}} -> {0,1}^\mathcal{V} ? - Page 6, definition 2: It might be helpful to summarise, in the main body of the paper if space permits or otherwise in the supplementary material, some basic properties of this new notion of VC dimension and how it relates to the usual notion of VC dimension. For example, if the graph VC dimension is an upper bound on the standard VC dimension (treating each concept as a set of k-hyperedges and defining shattering in the usual way), is there a Sauer-Shelah type upper bound on the size of the class in terms of the graph VC dimension? On a related note, on page 3, line 97, perhaps one could write precisely what it means for the new notion to be "strictly weaker" than the standard VC-dimension; does it mean "at least equal to"? I also suggest using an alternative term (perhaps "hypergraph VC dimension"?) since the term "graph dimension" seems to have already been used for a notion of VC dimension for multivalue functions (e.g. in the paper "Characterizations of learnability for classes of {0,1,...,n}-valued functions" by Ben-David, Cesabianchi, Haussler and Long)... - Page 7, line 232: class -> classes - Page 7, line 251: Missing space between "bounds" and "[1]". - Page 7, line 261 and Page 17, line 528: Missing period after "1/2". - Page 8, line 295: "...the VC [dimension] of a class..." - Page 10, line 369: [Preliminaries] - Page 10, line 372: characterizes -> characterize - Page 10, footnote 4: lable -> label, missing period at the end of the sentence. - Pages 11 and 12: In the statements of Theorem 2 and Lemma 1, the distribution over the vertices is denoted by P; in the inequality between lines 406 and 407, as well as in the proof between lines 408 and 422, the distribution is denoted by D. Do they refer to the same distribution (seems so)? - Page 11, line 417: "...let S_{i,+} [be] all the..." - Page 12, 5th line of equalities/inequalities: Extra vertical line just before the second "sup". Should the fifth line start with an inequality rather than an equality..? (If not, why?) - Page 13, third line in chain of equalities/inequalities just after "Bounding *": Under the second expectation symbol, should v_{1:k} be v_{1,k-1} ? - Page 13, line 429: "...uniform distribution over k-tuples from..." -> (k-1)-tuples? - Page 13, line 430: "...according to d and..." -> "...according to D and.." ? - Page 13, line 431: element -> elements - Page 13, lines 429 to 433: I was a bit confused at this part of the argument. It seems that S is a sample with possibly repeated vertices, so putting a uniform distribution over (k-1)-tuples of S where each (k-1)-tuple is assigned a probability of 1/m^{k-1} seems to mean that two k-tuples with equal corresponding elements but different sets of corresponding indices are considered to be different? E.g. if S = {v_1,v_2,v_3} is a sample such that v_1 = v_2 = v_3, and k-1 = 2, then (v_1), (v_2) and (v_3) are each assigned a probability of 1/3. If so, why is the process described as one "where we pick m elements according to D and then partition them to m-k+1 elements and to a sequence v_{1:k-1} of distinct [elements]?" In particular, why does v_{1:k-1} consist of distinct elements (i.e. none of its terms are equal?); is this an assumption, and if so, why is such an assumption needed? Why does this process give rise to a "partition" of S unless v_{1:k-1} is picked from S without replacement (is this what was meant)? - Page 14, line 438 and the inequality just after line 439: \mathcal{D}_{u_{1:k-1}} -> \mathcal{G}_{u_{1:k-1}} ? - Page 14, lines 438 and 440: Why is gVC(\mathcal{G}_v) exactly \rho for any choice of v (since, in the definition of gVC, we are taking the maximum projected VC dimension)? Should it be less than or equal to \rho? * Response to author feedback Thank you very much for the detailed feedback. I understand the paper a bit better now, and I think it deserves a higher score. I have therefore increased my overall score to 8. (I just recommend proofreading the paper for minor typos etc....)
Reviewer 3
Due to the shortness of time, I only comment upon the expressivity results. The expressivity results are interesting and novel. However, I found it rather hard to sort out the ``big'' picture of these results, and perhaps some additional general discussion would be helpful. It seems that in this part of the paper a kind of partial ordering between distinguishing classes is studied, where a class is larger than another if it can distinguish more pairs of distributions (classes can also be equivalent, and because of the parameters it is not clear how to formalize this precisely). Theorem 4, for example, shows the superiority of using k=2 over k=1. The graph class is a single graph (an infinite bipartite graph encoding all finite subsets of one half by vertices of the second half). It is shown that for any finite-VC class there are distributions distinguished by the graph but not the class. In terms of ordering the theorem says that there is a class for $k=2$ (consisting of that single graph) which is not smaller than or equal to any finite VC-class for $k=1$. The term ``incomparable''is used at a few places and it is not clear in what sense it is meant. Also, as the ultimate goal of the work seems to be to find possibly simple classes which can distinguish many pairs of distributions, it would be useful to comment on what are the implications of such incomparability results towards this goal.