{"title": "Distributed Submodular Cover: Succinctly Summarizing Massive Data", "book": "Advances in Neural Information Processing Systems", "page_first": 2881, "page_last": 2889, "abstract": "How can one find a subset, ideally as small as possible, that well represents a massive dataset? I.e., its corresponding utility, measured according to a suitable utility function, should be comparable to that of the whole dataset. In this paper, we formalize this challenge as a submodular cover problem. Here, the utility is assumed to exhibit submodularity, a natural diminishing returns condition preva- lent in many data summarization applications. The classical greedy algorithm is known to provide solutions with logarithmic approximation guarantees compared to the optimum solution. However, this sequential, centralized approach is imprac- tical for truly large-scale problems. In this work, we develop the first distributed algorithm \u2013 DISCOVER \u2013 for submodular set cover that is easily implementable using MapReduce-style computations. We theoretically analyze our approach, and present approximation guarantees for the solutions returned by DISCOVER. We also study a natural trade-off between the communication cost and the num- ber of rounds required to obtain such a solution. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, includ- ing active set selection, exemplar based clustering, and vertex cover on tens of millions of data points using Spark.", "full_text": "Distributed Submodular Cover:\n\nSuccinctly Summarizing Massive Data\n\nBaharan Mirzasoleiman\n\nETH Zurich\n\nAshwinkumar Badanidiyuru\n\nGoogle\n\nAmin Karbasi\nYale University\n\nAndreas Krause\n\nETH Zurich\n\nAbstract\n\nHow can one \ufb01nd a subset, ideally as small as possible, that well represents a\nmassive dataset? I.e., its corresponding utility, measured according to a suitable\nutility function, should be comparable to that of the whole dataset. In this paper,\nwe formalize this challenge as a submodular cover problem. Here, the utility is\nassumed to exhibit submodularity, a natural diminishing returns condition preva-\nlent in many data summarization applications. The classical greedy algorithm is\nknown to provide solutions with logarithmic approximation guarantees compared\nto the optimum solution. However, this sequential, centralized approach is imprac-\ntical for truly large-scale problems. In this work, we develop the \ufb01rst distributed\nalgorithm \u2013 DISCOVER \u2013 for submodular set cover that is easily implementable\nusing MapReduce-style computations. We theoretically analyze our approach,\nand present approximation guarantees for the solutions returned by DISCOVER.\nWe also study a natural trade-off between the communication cost and the num-\nber of rounds required to obtain such a solution. In our extensive experiments,\nwe demonstrate the effectiveness of our approach on several applications, includ-\ning active set selection, exemplar based clustering, and vertex cover on tens of\nmillions of data points using Spark.\n\n1\n\nIntroduction\n\nA central challenge in machine learning is to extract useful information from massive data. Con-\ncretely, we are often interested in selecting a small subset of data points such that they maximize a\nparticular quality criterion. For example, in nonparametric learning, we often seek to select a small\nsubset of points along with associated basis functions that well approximate the hypothesis space\n[1]. More abstractly, in data summarization problems, we often seek a small subset of images [2],\nnews articles [3], scienti\ufb01c papers [4], etc., that are representative w.r.t. an entire corpus. In many\nsuch applications, the utility function that measures the quality of the selected data points satis\ufb01es\nsubmodularity, i.e., adding an element from the dataset helps more in the context of few selected\nelements than if we have already selected many elements (c.f., [5]).\nOur focus in this paper is to \ufb01nd a succinct summary of the data, i.e., a subset, ideally as small as\npossible, which achieves a desired (large) fraction of the utility provided by the full dataset. Hereby,\nutility is measured according to an appropriate submodular function. We formalize this problem as a\nsubmodular cover problem, and seek ef\ufb01cient algorithms for solving it in face of massive data. The\ncelebrated result of Wolsey [6] shows that a greedy approach that selects elements sequentially in\norder to maximize the gain over the items selected so far, yields a logarithmic factor approximation.\nIt is also known that improving upon this approximation ratio is hard under natural complexity\ntheoretic assumptions [7]. Even though such a greedy algorithm produces near-optimal solutions,\n\n1\n\n\fit is impractical for massive datasets, as sequential procedures that require centralized access to the\nfull data are highly constrained in terms of speed and memory.\nIn this paper, we develop the \ufb01rst distributed algorithm \u2013 DISCOVER \u2013 for solving the submodular\ncover problem. It can be easily implemented in MapReduce-style parallel computation models [8]\nand provides a solution that is competitive with the (impractical) centralized solution. We also study\na natural trade-off between the communication cost (for each round of MapReduce) and the number\nof rounds. The trade-off lets us choose between a small communication cost between machines\nwhile having more rounds to perform or a large communication cost with the bene\ufb01t of running\nfewer rounds. Our experimental results demonstrate the effectiveness of our approach on a variety\nof submodular cover instances: vertex cover, exemplar-based clustering, and active set selection in\nnon-parametric learning. We also implemented DISCOVER on Spark [9] and approximately solved\nvertex cover on a social graph containing more than 65 million nodes and 1.8 billion edges.\n\n2 Background and Related Work\n\nRecently, submodular optimization has attracted a lot of interest in machine learning and data min-\ning where it has been applied to a variety of problems including viral marketing [10], information\ngathering [11], and active learning [12], to name a few. Like convexity in continuous optimization,\nsubmodularity allows many discrete problems to become ef\ufb01ciently approximable (e.g., constrained\nsubmodular maximization).\nIn the submodular cover problem, the main objective is to \ufb01nd the smallest subset of data points\nsuch that its utility reaches a desirable fraction of the entire dataset. As stated earlier, the sequential,\ncentralized greedy method fails to appropriately scale. Once faced with massive data, MapReduce\n[8] (and modern implementations like Spark [9]) offer arguably one of the most successful pro-\ngramming models for reliable parallel computing. Distributed solutions for some special cases of\nthe submodular cover problem have been recently proposed. In particular, for the set cover prob-\nlem (i.e., \ufb01nd the smallest subcollection of sets that covers all the data points), Berger et al. [13]\nprovided the \ufb01rst distributed solution with an approximation guarantee similar to that of the greedy\nprocedure. Blelloch et al. [14] improved their result in terms of the number of rounds required\nby a MapReduce-based implementation. Very recently, Stergiou et al. [15] introduced an ef\ufb01cient\ndistributed algorithm for set cover instances of massive size. Another variant of the set cover prob-\nlem that has received some attention is maximum k-cover (i.e., cover as many elements as possible\nfrom the ground set by choosing at most k subsets) for which Chierichetti et al. [16] introduced a\ndistributed solution with a (1 \u2212 1/e \u2212 \u0001) approximation guarantee.\nGoing beyond the special case of coverage functions, distributed constrained submodular maximiza-\ntion has also been the subject of recent research in the machine learning and data mining commu-\nnities. In particular, Mirzasoleiman et al. [17] provided a simple two-round distributed algorithm\ncalled GREEDI for submodular maximization under cardinality constraints. Contemporarily, Kumar\net al [18] developed a multi-round algorithm for submodular maximzation subject to cardinality and\nmatroid constraints. There have also been very recent efforts to either make use of randomization\nmethods or treat data in a streaming fashion [19, 20]. To the best of our knowledge, we are the \ufb01rst\nto address the general distributed submodular cover problem and propose an algorithm DISCOVER\nfor approximately solving it.\n\n3 The Distributed Submodular Cover Problem\n\nThe goal of data summarization is to select a small subset A out of a large dataset indexed by V\n(called the ground set) such that A achieves a certain quality. To this end, we \ufb01rst need to de\ufb01ne a\nutility function f : 2V \u2192 R+ that measures the quality of any subset A \u2286 V , i.e., f (A) quanti\ufb01es\nhow well A represents V according to some objective. In many data summarization applications, the\nutility function f satis\ufb01es submodularity, stating that the gain in utility of an element e in context of\na summary A decreases as A grows. Formally, f is submodular if\n\nf (A \u222a {e}) \u2212 f (A) \u2265 f (B \u222a {e}) \u2212 f (B),\n\nfor any A \u2286 B \u2286 V and e \u2208 V \\ B. Note that the meaning of utility is application speci\ufb01c and\nsubmodular functions provide a wide range of possibilities to de\ufb01ne appropriate utility functions. In\n\n2\n\n\fAc = arg minA\u2286V |A|,\n\nf (A) \u2265 Q.\n\nSection 3.2 we discuss concrete instances of functions f that we consider in our experiments. Let us\ndenote the marginal utility of an element e w.r.t. a subset A as (cid:52)(e|A) = f (A \u222a {e}) \u2212 f (A). The\nutility function f is called monotone if (cid:52)(e|A) \u2265 0 for any e \u2208 V \\ A and A \u2286 V . Throughout this\npaper we assume that the utility function is monotone submodular.\nThe focus of this paper is on the submodular cover problem, i.e., \ufb01nding the smallest set Ac such\nthat it achieves a utility Q = (1 \u2212 \u0001)f (V ) for some 0 \u2264 \u0001 \u2264 1. More precisely,\n\nsuch that\n\n(1)\nWe call Ac the optimum centralized solution with size k = |Ac|. Unfortunately, \ufb01nding Ac\nis NP-hard, for many classes of submodular functions [7]. However, a simple greedy algo-\nrithm is known to be very effective. This greedy algorithm starts with the empty set A0, and at\neach iteration i, it chooses an element e \u2208 V that maximizes (cid:52)(e|Ai\u22121), i.e., Ai = Ai\u22121 \u222a\n{arg maxe\u2208V (cid:52)f (e|Ai\u22121)}. Let us denote this (centralized) greedy solution by Ag. When f is\nintegral (i.e., f : 2V \u2192 N) it is known that the size of the solution returned by the greedy algorithm\n|Ag| is at most H(maxe f ({e}))|Ac|, where H(z) is the z-th harmonic number and is bounded by\nH(z) \u2264 1 + ln z [6]. Thus, we have |Ag| \u2264 (1 + ln(maxe f ({e})))|Ac|, and obtaining a better\nsolution is hard under natural complexity theoretic assumptions [7]. As it is standard practice, for\nour theoretical analysis to hold, we assume that f is an integral, monotone submodular function.\nScaling up: Distributed computation in MapReduce.\nIn many data summarization applications\nwhere the ground set V is large, the sequential greedy algorithm is impractical: either the data cannot\nbe stored on a single computer or the centralized solution is too expensive in terms of computation\ntime.\nInstead, we seek an algorithm for solving the submodular cover problem in a distributed\nmanner, preferably amenable to MapReduce implementations. In this model, at a high level, the\ndata is \ufb01rst distributed to m machines in a cluster, then each part is processed by the corresponding\nmachine (in parallel, without communication), and \ufb01nally the outputs are either merged or used\nfor the next round of MapReduce computation. While in principle multiple rounds of computation\ncan be realized, in practice, expensive synchronization is required after each round. Hence, we are\ninterested in distributed algorithms that require few rounds of computation.\n\n3.1 Naive Approaches Towards Distributed Submodular Cover\n\nOne way of solving the distributed submodular cover problem in multiple rounds is as follows. In\neach round, all machines \u2013 in parallel \u2013 compute the marginal gains for the data points assigned\nto them. Then, they communicate their best candidate to a central processor, who then identi\ufb01es\nthe globally best element, and sends it back to all the m machines. This element is then taken\ninto account when selecting the next element with highest marginal gain, and so on. Unfortunately,\nthis approach requires synchronization after each round and we have exactly |Ag| many rounds.\nIn many applications, k and hence |Ag| is quite large, which renders this approach impractical for\nMapReduce style computations.\nAn alternative approach would be for each machine i to select greedily enough elements from its\npartition Vi until it reaches at least Q/m utility. Then, all machines merge their solution. This\napproach is much more communication ef\ufb01cient, and can be easily implemented, e.g., using a single\nMapReduce round. Unfortunately, many machines may select redundant elements, and the merged\nsolution may suffer from diminishing returns and never reach Q. Instead of aiming for Q/m, one\ncould aim for a larger fraction, but it is not clear how to select this target value.\nIn Section 4, we introduce our solution DISCOVER, which requires few rounds of communication,\nwhile at the same time yielding a solution competitive with the centralized one. Before that, let us\nbrie\ufb02y discuss the speci\ufb01c utility functions that we use in our experiments (described in Section 5).\n\n3.2 Example Applications of the Distributed Submodular Cover Problem\n\nIn this part, we brie\ufb02y discuss three concrete utility functions that have been extensively used in pre-\nvious work for \ufb01nding a diverse subset of data points and ultimately leading to good data summaries\n[1, 17, 21, 22, 23].\nTruncated Vertex Cover: Let G = (V, E) be a graph with the vertex set V and edge set E. Let\n\u0001(C) denote the neighbours of C \u2286 V in the graph G. One way to measure the in\ufb02uence of a set C\n\n3\n\n\fis to look at its cover f (C) = |\u0001(C)\u222a C|. It is easy to see that f is a monotone submodular function.\nThe truncated vertex cover is the problem of choosing a small subset of nodes C such that it covers\na desired fraction of |V | [21].\nActive Set Selection in Kernel Machines:\nIn many application such as feature selections [22],\ndeterminantal point processes [24], and GP regression [23], where the data is described in terms of a\nkernel matrix K, we want to select a small subset of elements while maintaining a certain diversity.\nVery often, the utility function boils down to f (S) = log det(I + \u03b1KS,S) where \u03b1 > 0 and KS,S is\nthe principal sub-matrix of K indexed by S. It is known that f is monotone submodular [5].\nExemplar-Based Clustering: Another natural application is to select a small number of exem-\nplars from the data representing the clusters present in it. A natural utility function (see, [1] and\n[17]) is f (S) = L({e0}) \u2212 L(S \u222a {e0}) where L(S) = 1|V |\ne\u2208V min\u03c5\u2208S d(e, \u03c5) is the k-medoid\nloss function and e0 is an appropriately chosen reference element. The utility function f is mono-\ntone submodular [1]. The goal of distributed submodular cover here is to select the smallest set of\nexemplars that satis\ufb01es a speci\ufb01ed bound on the loss.\n\n(cid:80)\n\n4 The DISCOVER Algorithm for Distributed Submodular Cover\n\nOn a high level, our main approach is to reduce the submodular cover to a sequence of cardinality\nconstrained submodular maximization problems1, a problem for which good distributed algorithms\n(e.g., GREEDI [17, 25, 26]) are known. Concretely, our reduction is based on a combination of the\nfollowing three ideas.\nTo get an intuition, we will \ufb01rst assume that we have access to an optimum algorithm which can\nsolve cardinality constrained submodular maximization exactly, i.e., solve, for some speci\ufb01ed (cid:96),\n\nAoc[(cid:96)] = arg max\n|S|\u2264(cid:96)\n\nf (S).\n\n(2)\n\nWe will then consider how to solve the problem when, instead of Aoc[(cid:96)], we only have access to an\napproximation algorithm for cardinality constrained maximization. Lastly, we will illustrate how we\ncan parametrize our algorithm to trade-off the number of rounds of the distributed algorithm versus\ncommunication cost per round.\n\n4.1 Estimating Size of the Optimal Solution\n\nMomentarily, assume that we have access to an optimum algorithm OPTCARD(V, (cid:96)) for computing\nAoc[(cid:96)] on the ground set V . Then one simple way to solve the submodular cover problem would\nbe to incrementally check for each (cid:96) = {1, 2, 3, . . .} if f (Aoc[(cid:96)]) \u2265 Q. But this is very inef\ufb01cient\nsince it will take k = |Ac| rounds of running the distributed algorithm for computing Aoc[(cid:96)]. A\nsimple \ufb01x that we will follow is to instead start with (cid:96) = 1 and double it until we \ufb01nd an (cid:96) such\nthat f (Aoc[(cid:96)]) \u2265 Q. This way we are guaranteed to \ufb01nd a solution of size at most 2k in at most\n(cid:100)log2(k)(cid:101) rounds of running Aoc[(cid:96)]. The pseudocode is given in Algorithm 1. However, in practice,\nwe cannot run Algorithm 1. In particular, there is no ef\ufb01cient way to identify the optimum subset\nAoc[(cid:96)] in set V , unless P=NP. Hence, we need to rely on approximation algorithms.\n\n4.2 Handling Approximation Algorithms for Submodular Maximization\n\nAssume that there is a distributed algorithm DISCARD(V, m, (cid:96)), for cardinality constrained sub-\nmodular maximization, that runs on the dataset V with m machines and provides a set Agd[m, (cid:96)]\nwith \u03bb-approximation guarantee to the optimal solution Aoc[(cid:96)], i.e., f (Agd[m, (cid:96)]) \u2265 \u03bbf (Aoc[(cid:96)]). Let\nus assume that we could run DISCARD with the unknown value (cid:96) = k. Then the solution we get\nsatis\ufb01es f (Agd[m, k]) \u2265 \u03bbQ. Thus, we are not guaranteed to get Q anymore. Now, what we can do\n(still under the assumption that we know k) is to repeatedly run DISCARD in order to augment our\nsolution set until we get the desired value Q. Note that for each invocation of DISCARD, to \ufb01nd a\nset of size (cid:96) = k, we have to take into account the solutions A that we have accumulated so far. So,\n\n1Note that while reduction from submodular coverage to submodular maximization has been used (e.g.,\n\n[27]), the straightforward application to the distributed setting incurs large communication cost.\n\n4\n\n\fAlgorithm 1 Approximate Submodular Cover\nInput: Set V , constraint Q.\nOutput: Set A.\n1: (cid:96) = 1.\n2: Aoc[(cid:96)] = OPTCARD(V, (cid:96)).\n3: while f (Aoc[(cid:96)]) < Q do\n4:\n5:\n\n(cid:96) = (cid:96) \u00d7 2.\nAoc[l] = OPTCARD(V, (cid:96)).\n\n6: A = Aoc[(cid:96)].\n7: Return A.\n\nAlgorithm 2 Approximate OPTCARD\nInput: Set V , #of partitions m, constraint Q, (cid:96).\nOutput: Set Adc[m].\n1: r = 0, Agd[m, (cid:96)] = \u2205, .\n2: while f (Agd[m, (cid:96)]) < Q do\n3: A = Agd[m, (cid:96)].\n4:\n5: Agd[m, (cid:96)] = DISCARD(V, m, (cid:96), A).\n6:\n7:\n8:\n9:\n10: Return Adc[m].\n\nr = r + 1.\nif f (Agd[m, (cid:96)])\u2212f (A) \u2265 \u03bb(Q\u2212f (A)) then\n\nAdc[m] = {Agd[m, (cid:96)] \u222a A}.\n\nelse\n\nbreak\n\nby overloading the notation, DISCARD(V, m, (cid:96), A) returns a set of size (cid:96) given that A has already\nbeen selected in previous rounds (i.e., DISCARD computes the marginal gains w.r.t. A). Note that at\nevery invocation \u2013thanks to submodularity\u2013 DISCARD increases the value of the solution by at least\n\u03bb(Q \u2212 f (A)). Therefore, by running DISCARD at most (cid:100)log(Q)/\u03bb(cid:101) times we get Q.\nUnfortunately, we do not know the optimum value k. So, we can feed an estimate (cid:96) of the size of\nthe optimum solution k to DISCARD. Now, again thanks to submodularity, DISCARD can check\nwhether this (cid:96) is good enough or not: if the improvement in the value of the solution is not at least\n\u03bb(Q \u2212 f (A)) during the augmentation process, we can infer that (cid:96) is a too small estimate of k and\nwe cannot get the desired value Q by using (cid:96) \u2013 so we apply the doubling strategy again.\n\nTheorem 4.1. Let DISCARD be a distributed algorithm for cardinality-constrained submodular\nmaximization with \u03bb approximation guarantee. Then, Algorithm 1 (where OPTCARD is replaced\nwith Approximate OPTCARD, Algorithm 2) runs in at most (cid:100)log(k) + log(Q)/\u03bb + 1(cid:101) rounds and\nproduces a solution of size at most (cid:100)2k + 2 log(Q)k/\u03bb(cid:101).\n\n4.3 Trading Off Communication Cost and Number of Rounds\nWhile Algorithm 1 successfully \ufb01nds a distributed solution Adc[m] with f (Adc[m]) \u2265 Q, (c.f. 4.1),\nthe intermediate problem instances (i.e., invocations of DISCARD) are required to select sets of size\nup to twice the size of the optimal solution k, and these solutions are communicated between all\nmachines. Oftentimes, k is quite large and we do not want to have such a large communication\ncost per round. Now, instead of \ufb01nding an (cid:96) \u2265 k what we can do is to \ufb01nd a smaller (cid:96) \u2265 \u03b1k,\nfor 0 < \u03b1 \u2264 1 and augment these smaller sets in each round of Algorithm 2. This way, the\ncommunication cost reduces to an \u03b1 fraction (per round), while the improvement in the value of\nthe solution is at least \u03b1\u03bb(Q \u2212 f (Agd[m, (cid:96)])). Consequently, we can trade-off the communication\ncost per round with the total number of rounds. As a positive side effect, for \u03b1 < 1, since in each\ninvocation of DISCARD it returns smaller sets, the \ufb01nal solution set size can potentially get closer to\nthe optimum solution size k. For instance, for the extreme case of \u03b1 = 1/k we recover the solution\nof the sequential greedy algorithm (up to O(1/\u03bb)). We see this effect in our experimental results.\n\n4.4 DISCOVER\n\nThe DISCOVER algorithm is shown in Algorithm 3. The algorithm proceeds in rounds, with commu-\nnication between machines taking place only between successive rounds. In particular, DISCOVER\ntakes the ground set V , the number of partitions m, and the trade-off parameter \u03b1. It starts with\n(cid:96) = 1, and Adc[m] = \u2205. It then augments the set Adc[m] with set Agd[m, (cid:96)] of at most (cid:96) new elements\nusing an arbitrary distributed algorithm for submodular maximization under cardinality constraint,\nDISCARD. If the gain from adding Agd[m, (cid:96)] to Adc[m] is at least \u03b1\u03bb(Q \u2212 f (Agd[m, (cid:96)])), then we\ncontinue augmenting Agd[m, (cid:96)] with another set of at most (cid:96) elements. Otherwise, we double (cid:96) and\nrestart the process with 2(cid:96). We repeat this process until we get Q.\n\nTheorem 4.2. Let DISCARD be a distributed algorithm for cardinality-constrained submodular\nmaximization with \u03bb approximation guarantee. Then, DISCOVER runs in at most (cid:100)log(\u03b1k) +\nlog(Q)/(\u03bb\u03b1) + 1(cid:101) rounds and produces a solution of size (cid:100)2\u03b1k + log(Q)2k/\u03bb(cid:101).\n\n5\n\n\fAlgorithm 3 DISCOVER\nInput: Set V , #of partitions m, constraint Q, trade off parameter \u03b1.\nOutput: Set Adc[m].\n1: Adc[m] = \u2205, r = 0.\n2: while f (Adc[m]) < Q do\n3:\n4:\n5:\nAdc[m] = {Adc[m] \u222a Agd[m, (cid:96)]}.\n6:\n7:\n(cid:96) = (cid:96) \u00d7 2.\n8:\n9: Return Adc[m].\n\nr = r + 1.\nAgd[m, (cid:96)] = DISCARD(V, m, (cid:96), Adc[m]).\nif f (Adc[m] \u222a Agd[m, (cid:96)]) \u2212 f (Adc[m]) \u2265 \u03b1\u03bb(Q \u2212 f (Adc[m])) then\n\nelse\n\nGREEDI as Subroutine: So far, we have assumed that a distributed algorithm DISCARD that\nruns on m machines is given to us as a black box, which can be used to \ufb01nd sets of cardinality\n(cid:96) and obtain a \u03bb-factor of the optimal solution. More concretely, we can use GREEDI, a recently\nproposed distributed algorithm for maximizing submodular functions under a cardinality constraint\nIt \ufb01rst distributes the ground set V to m machines. Then each\n[17] (outlined in Algorithm 4).\nmachine i separately runs the standard greedy algorithm to produce a set Agc\ni [(cid:96)] of size (cid:96). Finally, the\nsolutions are merged, and another round of greedy selection is performed (over the merged results)\nIt was proven that GREEDI provides a (1 \u2212\nin order to return the solution Agd[m, (cid:96)] of size (cid:96).\ne\u22121)2/ min(m, (cid:96))-approximation to the optimal solution [17]. Here, we prove a (tight) improved\nbound on the performance of GREEDI. More formally, we have the following theorem.\n\nTheorem 4.3. Let f be a monotone submodular function and let (cid:96) > 0. Then, GREEDI produces a\nsolution Agd[m, (cid:96)] where f (Agd[m, (cid:96)]) \u2265\n\nf (Ac[(cid:96)]).\n\n\u221a\n\n1\n\n36\n\nmin(m,(cid:96))\n\nAlgorithm 4 Greedy Distributed Submodular Maximization (GREEDI)\nInput: Set V , #of partitions m, constraint (cid:96).\nOutput: Set Agd[m, (cid:96)].\n1: Partition V into m sets V1, V2, . . . , Vm.\n2: Run the standard greedy algorithm on each set Vi. Find a solution Agc\n3: Merge the resulting sets: B = \u222am\n4: Run the standard greedy algorithm on B until (cid:96) elements are selected. Return Agd[m, (cid:96)].\n\ni=1Agc\n\ni [(cid:96)].\n\ni [(cid:96)].\n\nWe illustrate the resulting algorithm DISCOVER using GREEDI as subroutine in Figure 1. By com-\nbining Theorems 4.2 and 4.3, we will have the following.\nCorollary 4.4. By using GREEDI, we get that DISCOVER produces a solution of size (cid:100)2\u03b1k +\n\n72 log(Q)k(cid:112)min(m, \u03b1k))(cid:101) and runs in at most (cid:100)log(\u03b1k)+36(cid:112)min(m, \u03b1k) log(Q)/\u03b1+1(cid:101) rounds.\n\nNote that for a constant number of machines m, \u03b1 = 1 and a large solution size \u03b1k \u2265 m, the above\nresult simply implies that in at most O(log(kQ)) rounds, DISCOVER produces a solution of size\nO(k log Q). In contrast, the greedy solution with O(k log Q) rounds (which is much larger than\nO(log(kQ))) produces a solution of the same quality.\nVery recently, a (1 \u2212 e\u22121)/2-approximation guarantee was proven for the randomized version of\nGREEDI [26, 25]. This suggests that, if it is possible to reshuf\ufb02e (i.e., randomly re-distribute V\namong the m machines) the ground set each time that we revoke GREEDI, we can bene\ufb01t from\nthese stronger approximation guarantees (which are independent of m and k). Note that Theorem 4.2\ndoes not directly apply here, since it requires a deterministic subroutine for constrained submodular\nmaximization. We defer the analysis to a longer version of this paper.\nAs a \ufb01nal technical remark, for our theoretical results to hold we have assumed that the utility\nfunction f is integral. In some applications (like active set selection) this assumption may not hold.\nIn these cases, either we can appropriately discretize and rescale the function, or instead of achieving\n\n6\n\n\fFigure 1: Illustration of our multi-round algorithm DISCOVER , assuming it terminates in two rounds\n(without doubling search for (cid:96)).\nthe utility Q, try to reach (1 \u2212 \u0001)Q, for some 0 < \u0001 < 1. In the latter case, we can simply replace Q\nwith Q/\u0001 in Theorem 4.2.\n\n5 Experiments\nIn our experiments we wish to address the following questions: 1) How well does DISCOVER\nperform compare to the centralized greedy solution; 2) How is the trade-off between the solution\nsize and the number of rounds affected by parameter \u03b1; and 3) How well does DISCOVER scale to\nmassive data sets. To this end, we run DISCOVER on three scenarios: exemplar based clustering,\nactive set selection in GPs, and vertex cover problem. For vertex cover, we report experiments on a\nlarge social graph with more than 65.6 million vertices and 1.8 billion edges. Since the constant in\n\nTheorem 4.3 is not optimized, we used \u03bb = 1/(cid:112)min(m, k) in all the experiments.\n\nExemplar based Clustering. Our exemplar based clustering experiments involve DISCOVER ap-\nplied to the clustering utility f (S) described in Section 3.2 with d(x, x(cid:48)) = (cid:107)x \u2212 x(cid:48)(cid:107)2. We perform\nour experiments on a set of 10,000 Tiny Images [28]. Each 32 by 32 RGB pixel image is represented\nas a 3,072 dimentional vectors. We subtract from each vector the mean value, then normalize it to\nhave unit norm. We use the origin as the auxiliary exemplar for this experiment. Fig. 2a compares\nthe performance of our approach to the centralized benchmark with the number of machines set to\nm = 10 and varying coverage percentage Q = (1 \u2212 \u0001)f (V ). Here, we have \u03b2 = (1 \u2212 \u0001). It can\nbe seen that DISCOVER provides a solution which is very close to the centralized solution, with\na number of rounds much smaller than the solution size. Varying \u03b1 results in a tradeoff between\nsolution size and number of rounds.\nActive Set Selection. Our active set selection experiments involve DISCOVER applied to the\nlog-determinant function f (S) described in Section 3.2, using an exponential kernel K(ei, ej) =\nexp(\u2212|ei \u2212 ej|2/0.75). We use the Parkinsons Telemonitoring dataset [29] comprised of 5,875\nbiomedical voice measurements with 22 attributes from people in early-stage Parkinson\u2019s disease.\nFig. 2b compares the performance of our approach to the benchmark with the number of machines\nset to m = 6 and varying coverage percentage Q = (1\u2212 \u0001)f (V ). Again, DISCOVER performs close\nto the centralized greedy solution, even with very few rounds. Again we see a tradeoff by varying \u03b1.\nLarge Scale Vertex Cover with Spark. As our large scale experiment, we applied DISCOVER to\nthe Friendster network consists of 65,608,366 nodes and 1,806,067,135 edges [30]. The average out-\ndegree is 55.056 while the maximum out-degree is 5,214. The disk footprint of the graph is 30.7GB,\nstored in 246 part \ufb01les on HDFS. Our experimental infrastructure was a cluster of 8 quad-core\nmachines with 32GB of memory each, running Spark. We set the number of reducers to m = 64.\nEach machine carried out a set of map/reduce tasks in sequence, where each map/reduce stage\ncorresponds to running GREEDI with a speci\ufb01c values of (cid:96) on the whole data set. We \ufb01rst distributed\nthe data uniformly at random to the machines, where each machine received \u22481,025,130 vertices\n(\u224812.5GB RAM). Then we start with (cid:96) = 1, perform a map/reduce task to extract one element. We\nthen communicate back the results to each machine and based on the improvement in the value of\nthe solution, we perform another round of map/reduce calculation with either the the same value for\n(cid:96) or 2 \u00d7 (cid:96). We continue performing map/reduce tasks until we get the desired value Q.\nWe examine the performance of DISCOVER by obtaining covers for 50%, 30%, 20% and 10% of\nthe whole graph. The total running time of the algorithm for the above coverage percentages with\n\u03b1 = 1 was about 5.5, 1.5, 0.6 and 0.1 hours respectively. For comparison, we ran the centralized\n\n7\n\nDataCluster NodesCoverGreeDir = 1r = 2GreeDi\u2026\u2026\u2026\f(a) Images 10K\n\n(b) Parkinsons Telemonitoring\n\n(c) Friendster\n\nFigure 2: Performance of DISCOVER compared to the centralized solution. a, b) show the solution\nset size vs. the number of rounds for various \u03b1, for a set of 10,000 Tiny Images and Parkinsons\nTelemonitoring. c) shows the same quantities for the Friendster network with 65,608,366 vertices.\n\ngreedy on a computer of 24 cores and 256GB memory. Note that, loading the entire data set into\nmemory requires 200GB of RAM, and running the centralized greedy algorithm for 50% cover\nrequires at least another 15GB of RAM. This highlights the challenges in applying the centralized\ngreedy algorithm to larger scale data sets. Fig. 2c shows the solution set size versus the number of\nrounds for various \u03b1 and different coverage constraints. We \ufb01nd that by decreasing \u03b1, DISCOVER\u2019s\nsolutions quickly converge (in size) to those obtained by the centralized solution.\n\n6 Conclusion\n\nWe have developed the \ufb01rst ef\ufb01cient distributed algorithm \u2013DISCOVER \u2013 for the submodular cover\nproblem. We have theoretically analyzed its performance and showed that it can perform arbitrary\nclose to the centralized (albeit impractical in context of large data sets) greedy solution. We also\ndemonstrated the effectiveness of our approach through extensive experiments, including vertex\ncover on a graph with 65.6 million vertices using Spark. We believe our results provide an important\nstep towards solving submodular optimization problems in very large scale, real applications.\nAcknowledgments. This research was supported by ERC StG 307036, a Microsoft Faculty\nFellowship and an ETH Fellowship.\n\n8\n\nNumber of Rounds20406080100Solution Set Size50010001500200025003000DisCover 0 = 0.20 Greedy 0 = 0.20DisCover 0 = 0.23 Greedy 0 = 0.23DisCover 0 = 0.24 Greedy 0 = 0.24DisCover 0 = 0.25 Greedy 0 = 0.25, = 0.1, = 1, = 0.4, = 0.2, = 1, = 1, = 1, = 0.6, = 0.2, = 0.1, = 0.2, = 0.1Number of Rounds050100150200Solution Set Size05001000150020002500DisCover 0 = 0.20 Greedy 0 = 0.20DisCover 0 = 0.35 Greedy 0 = 0.35DisCover 0 = 0.55 Greedy 0 = 0.55DisCover 0 = 0.65 Greedy 0 = 0.65, = 0.1, = 0.05, = 0.01, = 0.05, = 1, = 0.1, = 0.05, = 1, = 1, = 1, = 0.4, = 0.1, = 0.1, = 0.05, = 0.4050100150200#10533.23.43.63.84DisCover 0 = 0.5 Greedy 0 = 0.50100200300400#1044.74.754.84.854.94.95DisCover 0 = 0.7 Greedy 0 = 0.7Number of Rounds0100200300400Solution Set Size#1041.51.61.71.81.92DisCover 0 = 0.8 Greedy 0 = 0.802040608010037003800390040004100DisCover 0 = 0.9 Greedy 0 = 0.9, = 0.4, = 0.1, = 0.2, = 0.1, = 0.4, = 0.2, = 0.1, = 0.05, = 0.2, = 1, = 0.4, = 1, = 0.1, = 1, = 0.01, = 0.2, = 1, 0.4\fReferences\n[1] Ryan Gomes and Andreas Krause. Budgeted nonparametric learning from data streams. In ICML, 2010.\n[2] Sebastian Tschiatschek, Rishabh Iyer, Haochen Wei, and Jeff Bilmes. Learning Mixtures of Submodular\n\nFunctions for Image Collection Summarization. In NIPS, 2014.\n\n[3] Khalid El-Arini, Gaurav Veda, Dafna Shahaf, and Carlos Guestrin. Turning down the noise in the blogo-\n\nsphere. In KDD, 2009.\n\n[4] Khalid El-Arini and Carlos Guestrin. Beyond keyword search: Discovering relevant scienti\ufb01c literature.\n\nIn KDD, 2011.\n\n[5] Andreas Krause and Daniel Golovin. Submodular function maximization.\n\nApproaches to Hard Problems. Cambridge University Press, 2013.\n\nIn Tractability: Practical\n\n[6] Laurence A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem.\n\nCombinatorica, 1982.\n\n[7] Uriel Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 1998.\n[8] J. Dean and S. Ghemawat. Mapreduce: Simpli\ufb01ed data processing on large clusters. In OSDI, 2004.\n[9] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. In Spark: cluster\n\ncomputing with working sets, pages 181\u2013213. Springer, 2010.\n\n[10] David Kempe, Jon Kleinberg, and \u00b4Eva Tardos. Maximizing the spread of in\ufb02uence through a social\n\nnetwork. In Proceedings of the ninth ACM SIGKDD, 2003.\n\n[11] Andreas Krause and Carlos Guestrin. Intelligent information gathering and submodular function opti-\n\nmization. Tutorial at the International Joint Conference in Arti\ufb01cial Intelligence, 2009.\n\n[12] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning\n\nand stochastic optimization. Journal of Arti\ufb01cial Intelligence Research, 2011.\n\n[13] Bonnie Berger, John Rompel, and Peter W Shor. Ef\ufb01cient nc algorithms for set cover with applications\n\nto learning and geometry. Journal of Computer and System Sciences, 1994.\n\n[14] Guy E. Blelloch, Richard Peng, and Kanat Tangwongsan. Linear-work greedy parallel approximate set\n\ncover and variants. In SPAA, 2011.\n\n[15] Stergios Stergiou and Kostas Tsioutsiouliklis. Set cover at web scale. In SIGKDD. ACM, 2015.\n[16] Flavio Chierichetti, Ravi Kumar, and Andrew Tomkins. Max-cover in map-reduce. In WWW, 2010.\n[17] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maxi-\n\nmization: Identifying representative elements in massive data. In NIPS, 2013.\n\n[18] Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, and Andrea Vattani. Fast greedy algorithms in\n\nmapreduce and streaming. In SPAA, 2013.\n\n[19] Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrak, and Andreas Krause.\n\nLazier than lazy greedy. In AAAI, 2015.\n\n[20] Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streaming\n\nsubmodular maximization: Massive data summarization on the \ufb02y. In SIGKDD. ACM, 2014.\n\n[21] Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, and Sergei Vassilvitskii. Filtering: a method for\n\nsolving graph problems in mapreduce. In SPAA, 2011.\n\n[22] Roberto Battiti. Using mutual information for selecting features in supervised neural net learning. Neural\n\nNetworks, IEEE Transactions on, 5(4):537\u2013550, 1994.\n\n[23] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning\n\n(Adaptive Computation and Machine Learning). 2006.\n\n[24] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Mach. Learn, 2012.\n[25] Rafael Barbosa, Alina Ene, Huy L. Nguyen, and Justin Ward. The power of randomization: Distributed\n\nsubmodular maximization on massive datasets. In arXiv, 2015.\n\n[26] Vahab Mirrokni and Morteza Zadimoghaddam. Randomized composable core-sets for distributed sub-\n\nmodular maximization. In STOC, 2015.\n\n[27] Rishabh K Iyer and Jeff A Bilmes. Submodular optimization with submodular cover and submodular\n\nknapsack constraints. In NIPS, pages 2436\u20132444, 2013.\n\n[28] Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: A large data set for\n\nnonparametric object and scene recognition. TPAMI, 2008.\n\n[29] Athanasios Tsanas, Max Little, Patrick McSharry, and Lorraine Ramig. Enhanced classical dysphonia\nmeasures and sparse regression for telemonitoring of parkinson\u2019s disease progression. In ICASSP, 2010.\n[30] Jaewon Yang and Jure Leskovec. De\ufb01ning and evaluating network communities based on ground-truth.\n\nKnowledge and Information Systems, 42(1):181\u2013213, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1645, "authors": [{"given_name": "Baharan", "family_name": "Mirzasoleiman", "institution": "ETHZ"}, {"given_name": "Amin", "family_name": "Karbasi", "institution": "Yale"}, {"given_name": "Ashwinkumar", "family_name": "Badanidiyuru", "institution": "Google Research"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETHZ"}]}