NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 6508 Submodular Function Minimization with Noisy Evaluation Oracle

### Reviewer 1

The paper is written well. Overall, the problem considered in this paper is important for today's potentially large-scale applications where instead of the actual function value (that is given as an expectation or as a finite-sum over the whole data) we need to resort to stochastic estimates (i.e. an iid sample from the distribution or randomly chosen mini-batches). While the problem formulation is novel, the main novelty is in providing the unbiased estimates of the gradient of the lovasz extension and the rest follows immediately due to the convexity of the extension. The lower bound is also novel. A note regarding related work: The concept of submodular optimization with noisy (unbiased) oracles has been introduced before in two papers: (i) Gradient methods for submodular optimization, (ii) Stochastic Submodular Maximization: The Case of Coverage. To the best of my knowledge these works use SGD albeit in the context of maximization. Also, the work "Submodular Optimization under Noise" considers noisy (worst-case noise) oracles and should be cited as related work.

### Reviewer 2

On line 178, the authors should acknowledge this construction of subgradient. This paper [1] provided this construction several years ago. However, I am not sure if it is the earliest paper that designed this construction. The authors should try to search the book [2]. [1] Djolonga, Josip, and Andreas Krause. "From MAP to marginals: Variational inference in bayesian submodular models." Advances in Neural Information Processing Systems. 2014. [2] Fujishige, Satoru. Submodular functions and optimization. Vol. 58. Elsevier, 2005. Using Lovasz extension and stochastic projected gradient descent are standard. However, the authors should acknowledge previous works for this method that solves the minimization problem of submodular functions. The major contribution of this paper is the two estimators presented in Sec 5.3. The two estimators look so natural that it is surprising to me that nobody would not have considered this before. If there are previous works that only use a gradient estimator, the authors should have a detailed literature review of these works. On line 208, I was confused why I_t is a subset of {0, 1, ..., n}. The ground set is {1, 2, ..., n}. Why could I_t contain 0? Furthermore, it is extremely confusing that the construction from line 207 to line 210 seems not to rely on x_t. On lines 217-218, the notation \bigcup_{i\in J_t} {S_\sigma(i), S_\sigma(i-1)} is perplexing. S_\sigma(i) and S_\sigma(i-1) are two sets and therefore {S_\sigma(i), S_\sigma(i-1)} is a set of two sets. I could not decipher what this sentence tries to convey. J_t seems to be like I_t basically, also picked uniformly at random. It is just that its size l is large (=\lceil k/2\rceil). What is the high-level intuition that the estimator (16) is better? The authors should explain the high-level idea carefully. I checked the proofs of Lemma 1 and Lemma 2 regarding the expectation and variance of the proposed gradient estimators. On line 375 (Appendix B.1), the authors should explain that 1{i\in I_t} and \hat{f}_t(S_\sigma(i)) are independent. It seems to me that the high-level idea that (16) works better is the inequality on line 395. I have a couple of suggestions regarding it. First, the lemma (Lemma 8 in [17], or Lemma 1 in [25]), even including its proof if short, should be included for completeness. Second, the authors should state clearly how the inequality helps get a smaller variance and why multi-point evaluations are crucial here. Furthermore, the authors should justify in the paper that the assumption regarding multi-point evaluations is a realistic assumption. The authors may want to do it by giving examples where one does have access to random submodular functions (the evaluation of k points is from the same randomly sampled submodular function). The presentation of Sec 6.1 should be significantly improved. The authors present a distribution in the first paragraph, which is F(S^*, \epsilon). They present another distribution F'(S^*, \epsilon). What did the authors mean by "stochastically independent"? Is it the usual independence? Why pick i_X and s_X? Are they used to make sure that evaluation of \hat{f} is independent given different X? Why is this construction (sampled from F'(S^*, \epsilon)) a submodular function? If it is submodular, is it indeed a modular function? What is the connection between \hat{f} on line 248 and \hat{f} on line 254? There are many technical details missing so that I could not evaluate the correctness of this paper and it is hard for me to make a favorable recommendation of this paper before the authors present their detailed explanations and arguments. It would be great if the authors could clarify all these aforementioned points in their response and incorporate them into the paper.

### Reviewer 3

Originality: The model is new. Though it's somehow similar to the bandit model, the new model captures exploits a salient feature that often appear in practical applications of submodular function minimization that is not captured by the bandit model. Quality: The proofs seem to be correct. The proofs are not very hard technically as we already have a sheer amount of proof techniques developed in online convex optimization. Clarity: It was easy to follow the argument. Significance: Although the proofs are not very hard, I think the model is worth studying. Line 172: The range of f is missing