NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 3801 BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning

### Reviewer 1

post-rebuttal: Thank you for the clarification. My score remains the same. ----------------------------- It is an interesting work that addresses the degeneracy of batch acquisition when using BALD as a score function. The methods proposed in the paper elegantly deals with the problem of redundant acquisition when using BALD in a greedy manner. I have a few questions and hope the authors can address them: (1) Does this problem of redundant acquisition only happen when one uses BALD as the score? Intuitively I would think no, as if one uses any score function greedily, regardless of the contribution of the other samples selected in the same batch, one can still end up with a biased batch that can potentially harm training. If this is the case, then why are var-ratios and mean-std outperforming random? I guess it is a matter of batch size? Is there any intuition why this problem seems more serious when one uses BALD as the acquisition score? (2) To me, Figure 4 really points out the problem and Figure 5 explains why we want to do AL in a batch manner (due to the computational cost). To avoid confusing though, I would suggest emphasizing that an AL algorithm is considered as good when the "accumulated accuracy" is maximized (Fig 4), as long as it is not at the cost of extra computational burden (Fig 5). Since simply just reading Fig 5 (and the description in the text), one could interpret it as BALD is better since it takes less "time" to achieve 95% accuracy. (3) I'm not sure if I get the meaning of Fig 8. You can probably just randomly permute the RHS and it's going to look equally uniform. Likewise, you can also sort the bins for the BatchBALD's acquisition count, then for sure the difference between the max and min values will be contrasted. Perhaps sorting both of them will be more fair? I would also suggested printing some statistics measuring the dispersion (entropy, range, etc). On a related note, why is it that they end up having different "datasets"? is it that the training set is not exhausted after all of the acquisitions? (4) The section on "scope and limitations" seems a bit hastily written. For example, I don't get the intuition of why BatchBALD is expected to perform well if the test set is balanced, and not so otherwise. Will BALD also perform well if the test set is balanced? Please elaborate. (5) What does the shaded area in all of the figs stand for? "1 stdv from the mean across multiple experiments"? "over test set data points"? (typos) - line110: redundant "p(w" at the end? - line122: I think P_{1:n-1} is of shape c^{n-1} x k ? - line238: "Noisy" estimator

### Reviewer 2

Quality: Pros: Overall, this is a technically sound submission. I really like the proof of the submodularity of the proposed batchBALD acquisition function. Furthermore, the way to estimate that acquisition function using Monte-Carlo as well as the new efficient implementation are quite interesting. Cons: My first concern is about the quality of the estimation in (10) when working with large acquisition sizes $n$. In particular, the number of overall possible configurations of $y_{1:n}$ is $n!$ (# of permutations)--will be extremely large when $n$ increases, while only $m$ samples were chosen. Although this was explained in app. C, it's still unclear for me about the difference between $m$ and $n!$. Furthermore, the experimental results of the submission is not very compelling since they were only conducted on MNIST and its variants. It's would be more convincing if the authors can provide the results on at least one more benchmark data set in the field (e.g., cifar10). Clarity: The submission is clearly written and well organized. However, it's unclear for me about the definition of data repetition. Does a data point x' duplicate the given data point x if x'=x, or these samples are closed enough? That relates to the Alg.1 as well as way to generate the repeated MNIST(sec. 4.1). In particular, Alg. 1 can only guarantee that the new selected data point $x_n$ is different from the previous ones; while in repeated MNIST, a duplicated sample is generated by adding a Gaussian noise to a given sample. Originality: The proposed BatchBALD is a novel extension of one of the most widely studied acquisition function in the Bayesian active learning with disagreement (BALD) framework [10], but targeting the selection of a joint (dependent) batch data samples to improve the data diversity. The paper also introduces the use of a greedy approximation algorithm as well as new ways to estimate the BatchBALD acquisition function. Significance: The main contribution of the paper is to improve the data efficiency of the selected informative data points in BALD w.r.t both the diversity and batch size. The experimental results are quite promising. Post rebuttal comments: I have read the author feedback carefully. Thanks the authors for providing insightful clarifications, especially for providing further experimental results required for the score improvement. Also, based on positive coments/evaluations from fellow reviewers to the paper, I decide to upgrade my score for the paper to 7.