Paper ID: | 3431 |
---|---|

Title: | Bayesian Batch Active Learning as Sparse Subset Approximation |

[review updated in Improvements section] The paper is very well written. The paper tackles the problem of sampling for Active learning such that a mini-batch of examples is diverse. It proposes a Bayesian approach as a solution. In order to resolve non-tractability of the original problem, the authors take expectation of outcomes w.r.t. the current predictive posterior distribution, and Bayesian core-sets (which calculates total expectation by constructing a sparse subset approximation). A tricky part of the approach in practive is to construct a suitable inner product, the authors suggest some variants for generalized linear models, and random projections for non-linear models. In the experiments on synthetic and real data from UCI, MNIST, cifar10, SVHN, and fashion MNIST, the authors demonstrate that the method achives what it was purposed to achieve (diversity, competitiveness, scalability). I think the paper is a good contribution. It solves an important problem, suggests a novel algorithm, and has thorough evaluations of the most important aspects of the algorith. I also noticed the authors haven't referenced one of the recent relevant works (I think it was on arxiv only), "Diverse mini-batch Active Learning", which might add to their baselines.

This manuscript proposes a novel method for Bayesian batch active learning through sparse subset approximation and a convenient set of reductions to arrive at a tractable algorithm. This method is validated and explored through a series of special cases (linear regression and classification), illustrations, and experiments. Overall the method appears to be competitive with the state of the art. Overall this manuscript is well written, insightful, and enjoyable to read. The proposed approach outlined on page 3 is elegant, appears to work well in practice, and the approach may be useful in other settings. I only have a few concerns about the work as presented that I will outline below. These are mostly in linear order: - The fact that naive adaptation of sequential active learning algorithms to the batch setting by ranking and taking the top-scoring points produces highly correlated batches is completely unsurprising. However, there is a simple mechanism that can be used to address this phenomenon, which has for example been used in Bayesian optimization [1][2]: - begin with empty batch - for i = 1 .. batch size - choose next point x using sequential AL - impute observation y for this point (e.g., by sampling or MAP) - add x to batch - update model given (x, y) - return batch I _think_ this may be the "greedy" procedure you mention in for final experiment but I'm not sure. Obviously there is a tradeoff here in that you must do (batch size) model updates to compute the batch, although the overall running time is only linearly more expensive than choosing a single point. By imputing observations you get a natural "repulsive effect" lessening the correlations among batch members. I think at the very least some more discussion is warranted on this point. Added after response: after some reflection, I am lowering my score slightly since this simple and pervasive batch construction strategy is completely ignored and note even acknowledged in the discussion. The paper would be much stronger (and more intellectually honest) if it were discussed and evaluated, even if the evaluation concluded the runtime was too great. I don't buy the idea that a linear slowdown in active learning (where evaluations are expensive) is as huge a deal as the authors suggest. - The black box in (5) should be a w. - I think the discussion on logistic regression would be more clear if you simply replaced it entirely by probit regression. Why mention the logistic function at all if you're just going to throw it away a few lines later? - I disagree that extending entropy-based methods to the batch setting is necessarily difficult. For example in logistic regression it would be trivial to at least greedily maximize the entropy (the determinant of the predictive covariance matrix) offline using a series of rank-1 updates. This is effectively the above procedure although in this particular case you don't need to impute y at all. - The manuscript would be strengthened with some mention/study of the running time of the proposed method compared to the others. [1]: https://hal.archives-ouvertes.fr/hal-00260579/document [2]: Jiang, et al. Efficient nonmyopic batch active search. NeurIPS 2018

The paper considers the problem of selecting an optimal batch of samples for evaluation to increase the predictive performance of the model over the entire dataset, aka active learning. The paper considers a Bayesian perspective, where the goal is to select a number of points (fixed budget) to minimize the difference between the resulting posterior and the posterior one could have had if all the points were labeled. The problem is formulated as a sparse data approximation problem and corresponding algorithms are presented. Some derivations and interpretations are made in simple models (linear, logistic). Also the algorithm is extended to deal with larger datasets through the use of random projections. A collection of experimental results for regression and classification on publicly available datasets are presented. Originality: As far as I know the algorithmic development, the interpretations and the experimental results are original. Quality: The material appears to be sound, although some derivations are presented in the supplementary material which I didn’t check. I would like to have a few things clarified: There has been other work on Bayesian active learning that seems to fit well the batch setting that has not mentioned: “Nonmyopic active learning of Gaussian processes: an exploration-exploitation approach” by Krause et al. Although the focus is on GP, I find the discussion and derivations very relevant to the general Bayesian setting. What is the relationship with this work? If this is not suitable for batch AL, why? If yes, how is it different? What is the relationship between the objective proposed in this paper and one of the objectives discussed there? Please provide a careful derivation of Eq.4, as in its current form it is not clear where the first equality is coming from Clarity: The paper is well written overall. However the motivation for the batch AL setting can be improved. Typically in AL the cost of a label is significantly more expensive than computation time. Yet in this case the argument is that the label cost is “cheaper” then updating a model, leading to grouping of samples. Indeed, what are the real applications for this scenario? Significance: Given concerns about the motivation I find this work of somewhat limited significance. Evaluations on the MNIST datasets, although useful, do not relieve those concerns. In other words, experiments on a dataset where labeling costs are “on par” with model update time would be ideal. Post rebuttal feedback: Thank you for clarifying some of the concerns I raised, including those in the paper will certainly improve it. That being said, the point raised by R2 about the simple way to construct the batch through imputation and model update is still valid. The paper would have been much better if either: experimental results comparing with this method are done; or discussion with sufficient details is provided regarding the cases where model update is too expensive to run. So far I can think of this issue possibly happening in the parallel simulation setting only, but none of the other motivational examples mentioned. As a result, my score remains the same.