The submission got four reviews that were quite polarised in their recommendations, with two against accepting and two strongly in favour. The disagreement did not concern the technical quality of the paper. The reviewers agree that the theoretical work in this paper has been very competently performed and in the context of the problem the authors consider, the results are interesting and advance the state of the art. The disagreement is over whether the results are significant enough for NeurIPS or would be more appropriate for a specialised theory conference. The main objections against accepting are (i) the results are not surprising, (ii) the assumptions (monotonicity and uniform distribution) are strong and (iii) the overall computational complexity is high. Regarding (i), the authors in their rebuttal provide a reasonable discussion about why it is not obvious that such results should hold. Regarding (ii), the rebuttal points out and two reviewers agree that such strong assumptions are necessary for meaningful analysis of heuristic top-down algorithms (which otherwise can be shown to break down completely on some simple target classifiers); furthermore, such assumptions have traditionally been accepted as necessary for certain types of deep theoretical analysis. Regarding (iii), the rebuttal clarifies the result of the paper but does so in a direction that confirms the reviewer's negative evaluation. All the reviewers participated in discussion after the rebuttal, but opinions did not converge. Since the reviewers are strongly divided and the issue is largely a judgement call on the significance of the results, I'm making my recommendation based on my personal view, which is in favour of accepting the paper. On issues (i) and (ii), I tend to agree with the rebuttal by the authors and the comments of the two reviewers recommending accepting. On issue (iii) I'm a bit more hesitant, but I consider the results on label efficiency interesting enough. However, the rebuttal mentions that "the runtime of our algorithm can be upper bounded by the product of the size of the dataset and the sample complexity of our learnability procedure." I hope this is an off-the-cuff upper bound that can be improved be more careful use of data structures; otherwise I'm pessimistic about using the algorithm on (large) real-wold data sets which the authors mention.