NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7813
Title:A New Perspective on Pool-Based Active Classification and False-Discovery Control

Reviewer 1

This manuscript considers a novel active learning criterion: namely one of maximizing the true positive rate (TPR) subject to a constraint on the false discovery rate (FDR). The authors draw connections between this paradigm and several existing lines of work including combinatorial bandits and active classification. The technical results appear sound and interesting. The authors show sample complexity results for FDR control and for pool-based active learning in terms of novel complexity measures. The paper is predominantly clearly written. However, I have some minor comments that could aid in enhancing the quality of the presentation: - The notation, especially in the statement of Theorem 2, is quite difficult to parse. I would encourage the authors to consider ways to aid readability. - Please provide a sketch of the proof in the main text that highlights the main techniques and difficulties. I think this manuscript addresses an interesting problem and the presents novel theoretical results and techniques; I think this paper will make a great addition to NeurIPS 19.

Reviewer 2

Originality: The problem considered in this paper has not been extensively studied yet. The proposed solution is based on a nice combination of techniques from active learning and combinatorial bandits. Quality: I didn't check proofs in appendix, but results look reasonable to me. Clarity: This paper is well-organized. However, its technical part is a little bit dense and more explanation might be helpful. Below are some detailed comments: 1. It is very nice to motivate the problem with an application in the introduction. However, the example given is a little confusing to me. For example, Figure 1 is not well explained (what's the difference between babb and aaa? what does the distribution of bueired NPSA mean? What do you mean by "the distribution of a feature that is highly correlated with the fitted logistic model"?). These details might be irrelevant, but can be distracting for readers. Another issue is that when I was reading this part for the first time, at some point I thought the main issue was sampling bias or class imbalance, but actually this is not the point the authors want to make. 2. It might be easier to read if the authors could explain some high level intuitions for Algorithms 1 and 2 before explaining details. 3. Many definitions are not well explained (for example, V_\pi, V_{\pi, \pi'} in line 178-179, \tau_\pi in line 184-185, ...). Explaining these might shed some lights on how the algorithm/proof works and how much improvement had been made. 4. There seem to be some contributions this paper claims but are not very well explained in the main body. For example, the log(n) factors mentioned in line 121, local VC dimensions in line 180, the improvement over disagreement coefficient in section 3.1. Significance: This problem is well-motivated, and results are better than standard baselines/existing methods. One downside though is that the techniques seem a straightforward extension of existing ones. == After rebuttal: The authors have clarified some of my concerns. I hope the authors could improve the clarity of the technical part, and be more specific about the main challenges and the significance of its contribution in the future version.

Reviewer 3

First I want to say that I like the problem setting and the general approach. Also I do not have a single main concern, but I have several non-negligible ones that, when piled up, determined my final score. In what follows, I’ll order my concerns/questions from more important ones to minor comments. I felt that the main result, which is given in Theorem 2, was really hard to interpret. There are way too many terms, making their combination in the final result very hard to interpret. I found the discussion following the theorem very helpful, but it helped me understand how these terms occurred in the proof, rather than what their practical meaning is. I am happy seeing such a theorem if it is followed by multiple corollaries, given by different parameter instantiations, showing when the bound is tight and how the obtained complexity compares to prior work. Some comparison with the result of [4] is given, although in the fully general setting; it would be interesting to see settings when this difference is significant, and when it is not. For example, I liked the “One Dimensional Thresholds” example. Perhaps this is a matter of taste, but when I reached the last paragraph, I thought the paper ended too early; I was expecting more such examples to follow. 
Another concern is that there have been multiple works on different bandit approaches to multiple testing (not all of which test for whether the mean of a distribution is zero, as stated in the paper). Some related papers that weren’t discussed include, for example, Optimal Testing in the Experiment-rich Regime by Schmit et al. (AISTATS 2019), A framework for Multi-A(rmed)/B(andit) testing with online FDR control by Yang et al. (NeurIPS 2017), etc. Moreover, these papers focus on sample complexity just like the current paper. Their setting is different, but the papers are similar enough that they deserve a brief discussion. Citing only 3 papers in the “Multiple Hypothesis Testing” paragraph makes me think that the connections to prior work in this area have not been fully explored. If I understand correctly, the proposed algorithm is inspired by algorithms in related work (e.g. [4, 10]), but a fundamentally different idea is sampling in the symmetric difference, which has not been exploited so far? Further, I found the writing very messy in some parts, and it took a lot of re-reading to go over some definitions and arguments. For example, A_k and T_k are introduced without stating what k is. At the end of page 4, there is a sentence that says “the number of samples that land in T_k”, even though T_k is not a set of samples. In the same paragraph, mu-hats are used without being previously defined. The paragraph assumes I_t is uniformly distributed on [n], even though that is not stated beforehand. In the definition of TPR, eta_[n] is used, even though it is never defined. I understand that submissions are thoroughly revised before publication, but these kinds of mistakes make reviewing harder than it needs to be. I don’t know if this is necessary for the proof or not, but I didn’t see why the mu-hat difference is unbiased, as stated at the bottom of page 4. Especially if the formal result hinges on this fact, I would appreciate a formal argument explaining why this is the case. The rejection sampling strategy is essentially equivalent to the following: conditioned on “past information”, sample uniformly from the set T_k. This couples the old and new samples, making it not so obvious to me that the mu-hat difference is unbiased. Related to this point, I didn’t understand the part that says that the number of samples that land in T_k follow a geometric distribution. I agree that the wait time until you observe a selection in T_k is a geometric random variable. Relatively minor comments regarding style: 1. It is incorrect to say that Y_{I_t,t} is iid, as written at the beginning of page 3; iid is an assumption that refers to a set of random variables. This sentence needs to be rephrased more formally. 2. I was confused by the sentence “Instead of considering all possible subsets of [n], we will restrict ourselves to a finite class … .” The set of all subsets of [n] is finite anyway. 3. There is a minus sign missing when introducing mu_i in the risk equation on page 4. Either way, I do not see the purpose of introducing the risk. The paper makes it clear that the goal is to maximize TPR given an FDR constraint, as opposed to minimizing risk. Much of Section 1.1. seems like an unnecessary distraction. 4. There are some typos that need revision, like at the beginning of Section 3.1 where it says “specific specific noise models”, or in the Remark on page 5 there should be R_{i,t}. 5. The Benjamini-Hochberg paper should probably be cited, given that the used error metric is FDR, which stems from that paper. After rebuttal: The authors have clarified some of my concerns, and have promised to improve clarity of presentation. The other reviews have also convinced me about the originality of this submission, so I am increasing my score.