NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
The authors propose a relaxation of submodularity, called differential submodularity, where the marginal gains can be bounded by two submodular functions. They use this concept to provide approximation guarantees for a parallel algorithm, namely adaptive sampling, for maximizing weak submodular functions, and show its applicability to parallel feature selection and experimental design. Overall the paper is well written and the problem is well motivated. The main motivation for parallelized algorithms is their applicability to large datasets. Although we see some speedup for relatively small datasets in the experiments, my main concern is that due to the large number of rounds in the worst case and large sample complexity, the algorithm may not scale to large datasets, especially in the actual distributed setting, (e.g. MapReduce). Here are my questions: - What is the sample complexity of line 5 of Algorithm 1 (theoretically)? In experiments, the authors mention they implemented DASH with 5 samples at every round, but later they say that sampling could be computationally intensive, and that’s why DASH is slower than SDS_MA for small k. - Is the proposed algorithm scalable to large datasets, and is it applicable to actual distributed setting (e.g. MapReduce)? The number of rounds log_{1+\eps/2}(n) become prohibitive even for moderate size problems, and the experiments are done on relatively small datasets. So, the algorithm may not be actually used for large datasets that is main motivation for parallelization? - How is SDS_MA parallelized? This is not explained in the paper or appendix, and it’s wired that the parallel version is slower than the centralized one, especially in the multi-processor case and not the actual distributed setting. - Can other distributed frameworks like “A new framework for distributed submodular maximization“, "Fast greedy algorithms in mapreduce and streaming", or “Distributed submodular maximization: Identifying representative elements in massive data” be used to provide guarantees for maximizing weak submodular functions in the distributed setting (with or without differential submodularity)? ---------------------- update: I'm not very convinced by authors' answers about the fundamental difference between the current setup and MapReduce. I believe in MapReduce, the ground set can be partitioned to several machines, while each machine filters elements from its local data and samples from it, and the results can be communicated to the central machine. I also agree with other reviewers that introducing further applications could make the introduced notion of differential submodularity more interesting.
Reviewer 2
Quality/Significance: Somewhat interesting combination of previous ideas. Adaptive sampling provides good motivation for differentially submodular functions (Appendix A). Originality: The authors briefly mention [HS16] but do not explain how the definition its results A preprint by Gupta et al. is cited in the references as [GPB18] but never mentioned in the main paper or the Appendix. It contains a definition equivalent to differential submodularity in the case where g and h are the same function (as they are in many of this paper's proofs). Also, the proof of Theorem 6 is very similar to Section 5.2 of [GD18]. For these reasons, it seems like the paper combines several existing lines of work with limited novelty. Without further detailed discussion, this is a key weakness Clarity: Generally clear, but organization could be improved. Several of the paper's key details are pushed to the Appendix. d defined as both the feature dimension and a submodular diversity function The name DASH might cause confusion with ProtoDash [GDC17] [GD18] Gurumoorthy et al. Streaming Methods for Restricted Strongly Convex Functions with Applications to Prototype Selection. https://arxiv.org/pdf/1807.08091.pdf [GDC17] Gurumoorthy et al. ProtoDash: Fast Interpretable Prototype Selection. https://arxiv.org/abs/1707.01212 ---------------- EDIT: I appreciate the detailed comparison with related work provided by the authors in their response, and will increase my overall score. However there are some additional concerns about how the proposed algorithm performs in large-scale settings. Therefore my final review is still borderline
Reviewer 3
It was relatively easy to follow the paper. I like Figure 1, it helps a lot of immediately getting the definition of differential submodularity. I only do not understand why in the definition we need g_S if we already have \alpha h_S. The theoretical contribution of this work is very motivated by the prior work on performing submodular maximization in parallel, which is also noted by the authors. So, the main contribution of this paper is introducing the notion of differential submodularity. I believe that this notion has future potential, but it would be nice to see in this submission more comments and more comparison with the prior work. In general, experiments are convincing that DASH outperforms the baselines. Also, in the experiments is said :"in all experiments, DASH achieves a two to eight-fold speedup ...", and then :"This shows the incredible potential of other parallelizable algorithms". I assume that this comment refers to plots (c) and (f), i.e., the running times. I think that "incredible" here is an oversell. It is a good news that indeed one obtains faster approach, but saying "incredible" feels like it is sometime orders of magnitude more efficient. In Figure 4(a) and 4(d), when does SDS_MA saturate? It would be good to also use larger k for those experiments so ti see how SDS_MA behave. --- Review updates --- I thank the authors for providing additional comments on their work. My question still remains on the applicability of this notion to other problems (the authors mentioned that they are working on extending these results to other problems, but as I understand the work is in progress). My score remains the same.