{"title": "An Efficient Streaming Algorithm for the Submodular Cover Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 4493, "page_last": 4501, "abstract": "We initiate the study of the classical Submodular Cover (SC) problem in the data streaming model which we refer to as the Streaming Submodular Cover (SSC). We show that any single pass streaming algorithm using sublinear memory in the size of the stream will fail to provide any non-trivial approximation guarantees for SSC. Hence, we consider a relaxed version of SSC, where we only seek to find a partial cover. We design the first Efficient bicriteria Submodular Cover Streaming (ESC-Streaming) algorithm for this problem, and provide theoretical guarantees for its performance supported by numerical evidence. Our algorithm finds solutions that are competitive with the near-optimal offline greedy algorithm despite requiring only a single pass over the data stream. In our numerical experiments, we evaluate the performance of ESC-Streaming on active set selection and large-scale graph cover problems.", "full_text": "An Ef\ufb01cient Streaming Algorithm\nfor the Submodular Cover Problem\n\nAshkan Norouzi-Fard \u21e4\n\nashkan.norouzifard@epfl.ch\n\nAbbas Bazzi \u21e4\n\nabbas.bazzi@epfl.ch\n\nMarwa El Halabi \u2020\n\nmarwa.elhalabi@epfl.ch\n\nIlija Bogunovic \u2020\n\nilija.bogunovic@epfl.ch\n\nYa-Ping Hsieh \u2020\n\nya-ping.hsieh@epfl.ch\n\nVolkan Cevher \u2020\n\nvolkan.cevher@epfl.ch\n\nAbstract\n\nWe initiate the study of the classical Submodular Cover (SC) problem in the data\nstreaming model which we refer to as the Streaming Submodular Cover (SSC). We\nshow that any single pass streaming algorithm using sublinear memory in the size\nof the stream will fail to provide any non-trivial approximation guarantees for SSC.\nHence, we consider a relaxed version of SSC, where we only seek to \ufb01nd a partial\ncover. We design the \ufb01rst Ef\ufb01cient bicriteria Submodular Cover Streaming (ESC-\nStreaming) algorithm for this problem, and provide theoretical guarantees for its\nperformance supported by numerical evidence. Our algorithm \ufb01nds solutions that\nare competitive with the near-optimal of\ufb02ine greedy algorithm despite requiring\nonly a single pass over the data stream. In our numerical experiments, we evaluate\nthe performance of ESC-Streaming on active set selection and large-scale graph\ncover problems.\n\n1 Introduction\n\nWe consider the Streaming Submodular Cover (SSC) problem, where we seek to \ufb01nd the smallest\nsubset that achieves a certain utility, as measured by a monotone submodular function. The data is\nassumed to arrive in an arbitrary order and the goal is to minimize the number of passes over the\nwhole dataset while using a memory that is as small as possible.\nThe motivation behind studying SSC is that many real-world applications can be modeled as cover\nproblems, where we need to select a small subset of data points such that they maximize a particular\nutility criterion. Often, the quality criterion can be captured by a utility function that satis\ufb01es\nsubmodularity [27, 16, 15], an intuitive notion of diminishing returns.\nDespite the fact that the standard Submodular Cover (SC) problem is extensively studied and very\nwell-understood, all the proposed algorithms in the literature heavily rely on having access to whole\nground set during their execution. However, in many real-world applications, this assumption does\nnot hold. For instance, when the dataset is being generated on the \ufb02y or is too large to \ufb01t in memory,\nhaving access to the whole ground set may not be feasible. Similarly, depending on the application,\nwe may have some restrictions on how we can access the data. Namely, it could be that random\naccess to the data is simply not possible, or we might be restricted to only accessing a small fraction\nof it. In all such scenarios, the optimization needs to be done on the \ufb02y.\nThe SC problem is \ufb01rst considered by Wolsey [28], who shows that a simple greedy algorithm yields\na logarithmic factor approximation. This algorithm performs well in practice and usually returns\n\n\u21e4Theory of Computation Laboratory 2 (THL2), EPFL. These authors contributed equally to this work.\n\u2020Laboratory for Information and Inference Systems (LIONS), EPFL\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fsolutions that are near-optimal. Moreover, improving on its theoretical approximation guarantee is\nnot possible under some natural complexity theoretic assumptions [12, 10]. However, such an of\ufb02ine\ngreedy approach is impractical for the SSC, since it requires an infeasible number of passes over the\nstream.\n\n1.1 Our Contribution\n\nIn this work, we rigorously show that achieving any non-trivial approximation of SSC with a single\npass over the data stream, while using a reasonable amount of memory, is not possible. More\ngenerally, we establish an uncontidional lower bound on the trade-off between the memory and\napproximation ratio for any p-pass streaming algorithm solving the SSC problem.\nHence, we consider instead a relaxed version of SSC, where we only seek to achieve a fraction\n(1 \u270f) of the speci\ufb01ed utility. We develop the \ufb01rst Ef\ufb01cient bicriteria Submodular Cover Streaming\n(ESC-Streaming) algorithm. ESC-Streaming is simple, easy to implement, and memory as well time\nef\ufb01cient. It returns solutions that are competitive with the near-optimal of\ufb02ine greedy algorithm.\nIt requires only a single pass over the data in arbitrary order, and provides for any \u270f> 0, a 2/\u270f-\napproximation to the optimal solution, while achieving a (1 \u270f) fraction of the speci\ufb01ed utility.\nIn our experiments, we test the performance of ESC-Streaming on active set selection in materials\nscience and graph cover problems. In the latter, we consider a graph dataset that consists of more\nthan 787 million nodes and 47.6 billion edges.\n\n1.2 Related work\n\nSubmodular optimization has attracted a lot of interest in machine learning, data mining, and\ntheoretical computer science. Faced with streaming and massive data, the traditional (of\ufb02ine) greedy\napproaches fail. One popular approach to deal with the challenge of the data deluge is to adopt\nstreaming or distributed perspective. Several submodular optimization problems have been studied\nso far under these two settings [25, 11, 9, 2, 20, 8, 18, 7, 14, 1]. In the streaming setting, the goal\nis to \ufb01nd nearly optimal solutions, with a minimal number of passes over the data stream, memory\nrequirement, and computational cost (measured in terms of oracle queries).\nA related streaming problem to SSC was investigated by Badanidiyuru et al. [2], where the authors\nstudied the streaming Submodular Maximization (SM) problem subject to a cardinality constraint. In\ntheir setting, given a budget k, the goal is to pick at most k elements that achieve the largest possible\nutility. Whereas for the SC problem, given a utility Q, the goal is to \ufb01nd the minimum number of\nelements that can achieve it. In the of\ufb02ine setting of cadinality constrained SM, the greedy algorithm\nreturns a solution that is (1 1/e) away from the optimal value [21], which is known to be the best\nsolution that one can obtain ef\ufb01ciently [22]. In the streaming setting, Badanidiyuru et al. [2] designed\nan elegant single pass (1/2 \u270f)-approximation algorithm that requires only O((k log k)/\u270f) memory.\nMore general constraints for SM have also been studied in the streaming setting, e.g., in [8].\nMoreover, the Streaming Set Cover problem, which is a special case of the SSC problem is extensively\nstudied [25, 11, 9, 7, 14, 1]. In this special case, the elements in the data stream are m subsets of a\nuniverse X of size n, and the goal is to \ufb01nd the minimum number of sets k\u21e4 that can cover all the\nelements in the universe X . The study of the Streaming Set Cover problem is mainly focused on the\nsemi-streaming model, where the memory is restricted to eO(n)3. This regime is \ufb01rst investigated by\nSaha and Getoor [25], who designed a O(log n)-pass, O(log n)-approximation algorithm that uses\neO(n) space. Emek and Ros\u00e9n [11] show that if one restricts the streaming algorithm to perform only\none pass over the data stream, then the best possible approximation guarantee is O(pn). This lower\nbound holds even for randomized algorithms. They also designed a deterministic greedy algorithm\nthat matches this approximation guarantee. By relaxing the single pass constraint, Chakrabarti and\nWirth [7] designed a p-pass semi-streaming (p + 1)n1/(p+1)-approximation algorithm, and proved\nthat this is essentially tight up to a factor of (p + 1)3.\n\nPartial streaming submodular optimization. The Streaming Set Cover has also been studied from\na bicriteria perspective, where one settles for solutions that only cover a (1\u270f)-fraction of the universe.\nBuilding on the work of [11], the authors in [7] designed a semi-streaming p-pass streaming algorithm\n3The eO notation is used to hide poly-log factors, i.e., eO(n) := O(n poly{log n, log m})\n\n2\n\n\fthat achieves a (1\u270f, (n, \u270f))-approximation, where (n, \u270f) = min{8p\u270f1/p, (8p+1)n1/(p+1)}. They\nalso provided a lower bound that matches their approximation ratio up to a factor of \u21e5(p3).\n\nDistributed submodular optimization. Mirzasoleiman et al. [20] consider the SC problem in the\ndistributed setting, where they design an ef\ufb01cient algorithm whose solution is close to that of the\nof\ufb02ine greedy algorithm. Moreover, they study the trade-off between the communication cost and the\nnumber of rounds to obtain such a solution.\nTo the best of our knowledge, no other works have studied the general SSC problem. We propose the\n\ufb01rst ef\ufb01cient algorithm ESC-Streaming that approximately solves this problem with tight guarantees.\n\n2 Problem Statement\nPreliminaries. We assume that we are given a utility function f : 2V 7! R+ that measures the\nquality of a given subset S \u2713 V , where V = {e1,\u00b7\u00b7\u00b7 , em} is the ground set. The marginal gain\nassociated with any given element e 2 V with respect to some set S \u2713 V , is de\ufb01ned as follows\n\nf (e|S) :=( e|S) = f (S [{ e}) f (S).\n\nIn this work, we focus on normalized, monotone, submodular utility functions f, where f is referred\nto be:\n\n1. submodular if for all S, T , such that S \u2713 T , and for all e 2 V \\ T , (e|S) (e|T ).\n2. monotone if for all S, T such that S \u2713 T \u2713 V , we have f (S) \uf8ff f (T ).\n3. normalized if f (;) = 0.\n\nIn the standard Submodular Cover (SC) problem, the goal is to \ufb01nd the smallest subset S \u2713 V that\nsatis\ufb01es a certain utility Q, i.e.,\n(SC)\n\nmin\n\nS\u2713V |S| s.t. f (S) Q.\n\nHardness results. The SC problem is known to be NP-Hard. A simple greedy strategy [28]\nthat in each round selects the element with the highest marginal gain until Q is reached, returns\na solution of size at most H(maxe f ({e}))k\u21e4, where k\u21e4 is the size of the optimal solution set\nS\u21e4.4 Moreover, Feige [12] proved that this is the best possible approximation guarantee unless\nNP \u2713 DTIMEnO(log log n). This was recently improved to an NP-hardness result by Dinur and\n\nSteurer [10].\n\nStreaming Submodular Cover (SSC).\nIn the streaming setting, the main challenge is to solve the\nSC problem while maintaining a small memory and without performing a large number of passes\nover the data stream. We use m to denote the size of the data stream. Our \ufb01rst result states that any\nsingle pass streaming algorithm with an approximation ratio better than m/2, must use at least \u2326(m)\nmemory. Hence, for large datasets, if we restrict ourselves to a single pass streaming algorithm with\nsublinear memory o(m), we cannot obtain any non-trivial approximation of the SSC problem (cf.,\nTheorem 2 in Section 4). To obtain non-trivial and feasible guarantees, we need to relax the coverage\nconstraint in SC. Thus, we instead solve the Streaming Bicriteria Submodular Cover (SBSC) de\ufb01ned\nas follows:\nDe\ufb01nition 1. Given \u270f 2 (0, 1) and 1, an algorithm is said to be a (1 \u270f, )-bicriteria\napproximation algorithm for the SBSC problem if for any Submodular Cover instance with utility Q\nand optimal set size k\u21e4, the algorithm returns a solution S such that\nf (S) (1 \u270f)Q and |S|\uf8ff k\u21e4 .\n3 An ef\ufb01cient streaming submodular cover algorithm\n\n(1)\n\nESC-Streaming algorithm. The \ufb01rst phase of our algorithm is described in Algorithm 1. The\nalgorithm receives as input a parameter M representing the size of the allowed memory. The\n\n4Here, H(x) is the x-th harmonic number and is bounded by H(x) \uf8ff 1 + ln x.\n\n3\n\n\fdiscussion of the role of this parameter is postponed to Section 4. The algorithm keeps t + 1 =\nlog(M/2) + 1 representative sets. Each representative set Sj (j = 0, .., t) has size at most 2j, and\nhas a corresponding threshold value Q/2j. Once a new element e arrives in the stream, it is added to\nall the representative sets that are not fully populated, and for which the element\u2019s marginal gain is\nabove the corresponding threshold, i.e., (e|Sj) Q\n2j . This phase of the algorithm requires only\none pass over the data stream. The running time of the \ufb01rst phase of the algorithm is O(log(M )) for\nevery element of the stream, since the per-element computational cost is O(log(M )) oracle calls.\nIn the second phase (i.e., Algorithm 2), given a feasible \u02dc\u270f, the algorithm \ufb01nds the smallest set Si\namong the stored sets, such that f (Si) (1 \u02dc\u270f)Q. For any query, the running time of the second\nphase is O(log log(M )). Note that after one-pass over the stream, we have no limitation on the\nnumber of the queries that we can answer, i.e., we do not need another pass over the stream. Moreover,\nthis phase does not require any oracle calls, and its total memory usage is at most M.\n\nAlgorithm 1 ESC-Streaming Algorithm - Picking representative set\nt = log(M/2)\n1: S0 = S1 = ... = St = ;\n2: for i = 1,\u00b7\u00b7\u00b7 , m do\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nLet e be the next element in the stream\nfor j = 0,\u00b7\u00b7\u00b7 , t do\nif (e|Sj) Q\nSj Sj [ e\nend if\n\n2j and |Sj|\uf8ff 2j then\n\nend for\n\nAlgorithm 2 ESC-Streaming Algorithm - Responding to the queries\nGiven value \u02dc\u270f, perform the following steps\n1: Run a binary search on the S0, ..., St\n2: Return the smallest set Si such that f (Si) (1 \u02dc\u270f)Q\n3: If no such set exists, Return \u201cAssumption Violated\u201d\n\nIn the following section, we analyze ESC-Streaming and prove that it is an (1 \u02dc\u270f, 2/\u02dc\u270f)-bicriteria\napproximation algorithm for SSC. Formally we prove the following:\nTheorem 1. For any given instance of SSC problem, and any values M, \u02dc\u270f, such that k\u21e4/\u02dc\u270f \uf8ff M, where\nk\u21e4 is size optimal solution to SSC, ESC-Streaming algorithm returns a (1 \u02dc\u270f, 2/\u02dc\u270f)-approximation\nsolution.\n\n4 Theoretical Bounds\n\nLower Bound. We start by establishing a lower bound on the tradeoff between the memory\nrequirement and the approximation ratio of any p-pass streaming algorithm solving the SSC problem.\nTheorem 2. For any number of passes p and any stream size m, a p-pass streaming algorithm that,\nwith probability at least 2/3, approximates the submodular cover problem to a factor smaller than\nm\n\np+1, must use a memory of size at least \u2326\u2713 m\n\np(p+1)2\u25c6.\n\n1\np\n\n1\np\n\nThe proof of this theorem can be found in the supplementary material.\nNote that for p = 1, Theorem 2 states that any one-pass streaming algorithm with an approximation\nratio better than m/2 requires at least \u2326(m) memory. Hence, for large datasets, Theorem 2 rules out\nany approximation of the streaming submodular cover problem, if we restrict ourselves to a one-pass\nstreaming algorithm with sublinear memory o(m). This result motivates the study of the Streaming\nBicriterion Submodular Cover (SBSC) problem as in De\ufb01nition 1.\n\nMain result and discussion. Re\ufb01ning the analysis of the greedy algorithm for SC [28], where\nwe stop once we have achieved a utility of (1 \u270f)Q, yields that the number of elements that we\n\n4\n\n\fpick is at most k\u21e4 ln(1/\u270f). This yields a tight (1 \u270f, ln(1/\u270f))-bicriteria approximation algorithm\nfor the Bicriteria Submoludar Cover (BSC) problem. One can turn this bicriteria algorithm into an\n(1 \u270f, ln(1/\u270f))-bicriteria algorithm for SBSC, at the cost of doing k\u21e4 ln(1/\u270f) passes over the data\nstream which may be infeasible for some applications. Moreover, this requires mk\u21e4 ln 1\n\u270f oracle\ncalls, which may be infeasible for large datasets.\nTo circumvent these issues, it is natural to parametrize our algorithm by a user de\ufb01ned memory\nbudget M that the streaming algorithm is allowed to use. Assuming, for some 0 <\u270f \uf8ff e1, that the\n(1 \u270f, ln(1/\u270f))-bicriteria solution given by the of\ufb02ine greedy algorithm\u2019s \ufb01ts in a memory of M/2\nfor the BSC variant of the problem, then our algorithm (ESC-Streaming) is guaranteed to return a\n(1 1/ ln(1/\u270f), 2 ln(1/\u270f))-bicriteria solution for the SBSC problem, while using at most M memory.\nHence in only one pass over the data stream, ESC-Streaming returns solutions guaranteed to cover,\nfor small values of \u270f, almost the same fraction of the utility as the greedy solution, loosing only\na factor of two in the worst case solution size. Moreover, the number of oracle calls needed by\nESC-Streaming is only m log M, which for M = 2k\u21e4 ln(1/\u270f) is bounded by\n\nm log M =\n\nm log(2k\u21e4 ln(1/\u270f))\n\noracle calls by ESC-Streaming algorithm\n\n|\n\n{z\n\n}\n\n,\n\n\u270f\u25c6\n\u2327 mk\u21e4 ln\u2713 1\n|\n{z\n}\n\noracle calls by greedy\n\nwhich is more than a factor k\u21e4/ log(k\u21e4) smaller than the greedy algorithm. This enables ESC-\nStreaming algorithm to perform much faster than the of\ufb02ine greedy algorithm. Another feature of\nESC-Streaming is that it performs a single pass over the data stream, and after this unique pass, we\nare able to query a (1 1/ ln(1/\u270f0), 2 ln(1/\u270f0))-bicriteria solution for any \u270f \uf8ff \u270f0 \uf8ff e1, without\nany additonal oracle calls. Whenever the above inequality does not hold, ESC-Streaming returns\n\u201cAssumption Violated\u201d. More precisely, we state the following Theorem whose proof can be found in\nthe supplementary material.\nTheorem 3. For any given instance of SSC problem, and any values M, \u270f, such that 2k\u21e4 ln 1/\u270f \uf8ff M,\nwhere k\u21e4 is the optimal solution size, ESC-Streaming algorithm returns a (1 1/(ln 1/\u270f), 2 ln 1/\u270f)-\napproximation solution.\n\nRemarks. Note that in Algorithm 1, we can replace the constant 2 by another choice of the constant\n1 <\u21b5 \uf8ff 2. The representative sets are changed accordingly to \u21b5j, and t = log\u21b5(M/\u21b5). Varying \u21b5\nprovides a trade-off between memory and solution size guarantee. More precisely, for any 1 <\u21b5 \uf8ff 2,\nESC-Streaming achieves a (1 1\nln 1/\u270f ,\u21b5 ln 1/\u270f)-approximation guarantee, for instances of SSC where\n\u21b5k\u21e4 ln 1/\u270f \uf8ff M. However, the improvement in the size approximation guarantee, comes at the cost\nof increased memory usage M1\nNotice that in the statement of Theorem 3, the approximation guarantee of ESC-Streaming is\ngiven with respect to a memory only large enough to \ufb01t the of\ufb02ine greedy algorithm\u2019s solution.\nHowever, if we allow our memory M to be as large as k\u21e4/\u270f, then Theorem 1 follows immediately for\n\u02dc\u270f = 1/ ln(1/\u270f).\n\n\u21b51 , and increased number of oracle calls m(log\u21b5(M/\u21b5) + 1).\n\n5 Example Applications\n\nMany real-world problems, such as data summarization [27], image segmentation [16], in\ufb02uence\nmaximization in social networks [15], can be formulated as a submodular cover problem and can\nbene\ufb01t from the streaming setting. In this section, we discuss two such concrete applications.\n\n5.1 Active set selection\nTo scale kernel methods (such as kernel ridge regression, Gaussian processes, etc.) to large data sets,\nwe often rely on active set selection methods [23]. For example, a signi\ufb01cant problem with Gaussian\nprocess prediction is that it scales as O(n3). Storing the kernel matrix K and solving the associated\nlinear system is prohibitive when n is large. One way to overcome this is to select a small subset of\ndata while maintaining a certain diversity. A popular approach for active set selection is Informative\nVector Machine (IVM) [26], where the goal is to select a set S that maximizes the utility function\n\nf (S) =\n\n1\n2\n\nlog det(I + 2KS,S), .\n\n(2)\n\n5\n\n\fHere, KS,S is the submatrix of K, corresponding to rows/columns indexed by S, and > 0 is a\nregularization parameter. This utility function is monotone submodular, as shown in [17].\n\n5.2 Graph set cover\nIn a lot of applications, e.g., in\ufb02uence maximization in social networks [15], community detection in\ngraphs [13], etc., we are interested in selecting a small subset of vertices from a massive graph that\n\u201ccover\u201d in some sense a large fraction of the graph.\nIn particular, in section 6, we consider two fundamental set cover problems: Dominating Set and\nVertex Cover problems. Given a graph G(V, E) with vertex set V and edge set E, let \u21e2(S) denote\nthe neighbours of the vertices in S in the graph, and (S) the edges in the graph connect to a vertex\nin S. The dominating set is the problem of selecting the smallest set that covers the vertex set V ,\ni.e., the corresponding utility is f (S) = |\u21e2(S) [ S|. The vertex cover is the problem of selecting the\nsmallest set that covers the edge set E, i.e., the corresponding utility is f (S) = |(S)|. Both utilities\nare monotone submodular functions.\n\n6 Experimental Results\n\nWe address the following questions in our experiments:\n\n1. How does ESC-Streaming perform in comparison to the of\ufb02ine greedy algorithm, in terms\n\nof solution size and speed?\n\n2. How does \u21b5 in\ufb02uence the trade-off between solution size and speed ?\n3. How does ESC-Streaming scale to massive data sets?\n\nWe evaluate the performance of ESC-Streaming on real-world data sets with two applications: active\nset selection and graph set cover problems, described in section 5. For active set selection, we choose\na dataset having a size that permits the comparison with the of\ufb02ine greedy algorithm. For graph cover,\nwe run ESC-Streaming on a large graph of 787 million nodes and 47.6 billion edges.\nWe measure the computational cost in terms of the number of oracle calls, which is independent of\nthe concrete implementation and platform.\n\n2\n\n2h2\n\n6.1 Active Set Selection for Quantum Mechanics\nIn quantum chemistry, computing certain properties, such as atomization energy of molecules, can be\ncomputationally challenging [24]. In this setting, it is of interest to choose a small and diverse training\nset, from which one can predict the atomization energy (e.g., by using kernel ridge regression) of\nother molecules.\nIn this setting, we apply ESC-Streaming on the log-det function de\ufb01ned in Section 5.1 where\nwe use the Gaussian kernel Kij = exp(kxixjk2\n), and we set the hyperparameters as in [24]:\n = 1, h = 724. The dataset consists of 7k small organic molecules, each represented by a 276\ndimensional vector. We set M = 215 and vary Q from f (V )\nWe compare against of\ufb02ine greedy, and its accelerated version with lazy updates (Lazy Greedy)[19].\nFor all algorithms, we provide a vector of different values of \u02dc\u270f as input, and terminate once the utility\n(1 \u02dc\u270f)Q, corresponding to the smallest \u02dc\u270f, is achieved. Below we report the performance for the\nsmallest and largest tested \u02dc\u270f = 0.01 and \u02dc\u270f = 0.5, respectively.\nIn Figure 6.1, we show the performance of ESC-Streaming with respect to the of\ufb02ine greedy and\nlazy greedy, in terms of size of solutions picked and number of oracle calls made. The computational\ncosts of all algorithms are normalized to those of of\ufb02ine greedy.\nIt can be seen that standard ESC-Streaming, with \u21b5 = 2, always chooses a set at most twice (largest\nratio is 2.1089) as large as of\ufb02ine greedy, using at most 3.15% and 25.5% of the number of oracle\ncalls made, respectively, by of\ufb02ine greedy and lazy greedy. As expected, varying the parameter \u21b5\nleads to smaller solutions at the cost of more oracle calls: \u21b5 = 1.1 leads to solutions roughly of the\nsame size as the solutions found by the of\ufb02ine greedy. Note also that choosing larger values \u21b5 leads\nto jumps in the solution sets sizes (c.f., 6.1). In particular, varying the required utility Q, even by a\n\n, and \u21b5 from 1.1 to 2.\n\nto 3f (V )\n\n4\n\n2\n\n6\n\n\f25\n\n20\n\n15\n\n10\n\n5\n\n)\n\n%\n\n(\n \ns\n\nl\nl\n\na\nc\n \n\nl\n\ne\nc\na\nr\no\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\nESC-Streaming , = 1.1\nESC-Streaming , = 1.25\nESC-Streaming , = 1.5\nESC-Streaming , = 1.75\nESC-Streaming , = 2\nLazy Greedy\nOffline Greedy\n\nESC-Streaming , = 1.1\nESC-Streaming , = 1.25\nESC-Streaming , = 1.5\nESC-Streaming , = 1.75\nESC-Streaming , = 2\nLazy Greedy\n\n3000\n\n2500\n\ns\nn\no\n\ni\nt\n\nl\n\nu\no\ns\n \nf\n\no\n\n \n\ne\nz\nS\n\ni\n\n2000\n\n1500\n\n1000\n\n500\n\nESC-Streaming , = 1.1\nESC-Streaming , = 1.25\nESC-Streaming , = 1.5\nESC-Streaming , = 1.75\nESC-Streaming , = 2\nLazy Greedy\nOffline Greedy\n\n250\n\n200\n\n150\n\n100\n\n50\n\ns\nn\no\n\ni\nt\n\nl\n\nu\no\ns\n \nf\n\no\n\n \n\ne\nz\nS\n\ni\n\n0\n0.5\n\n0.55\n\n0.6\n\n0.65\n\n0.7\n\n0.75\n\nQ=f (V )\n\n0\n0.5\n\n0.55\n\n0.6\n\n0.65\n\n0.7\n\n0.75\n\nQ=f (V )\n\n0\n0.5\n\n0.55\n\n0.6\n\n0.65\n\n0.7\n\n0.75\n\nQ=f (V )\n\nFigure 6.1: Active set selection of molecules: (Left) Percentage of oracle calls made relative to of\ufb02ine\ngreedy, (Middle) Size of selected sets for \u270f = 0.01, (Right) Size of selected sets for \u270f = 0.5.\n\nsmall amount, may not be possible to achieve by the current solution\u2019s size (\u21b5j) and would require\nmoving to a set larger by at least an \u21b5 factor (\u21b5j+1).\nFinally, we remark that even for this small dataset, of\ufb02ine greedy, for largest tested Q, required\n1.2 \u21e5 107 oracle calls, and took almost 2 days to run on the same machine.\n6.2 Cover problems on Massive graphs\n\nTo assess the scalability of ESC-Streaming, we apply it to the \"uk-2014\" graph, a large snapshot of the\n.uk domain taken at the end of 2014 [5, 4, 3]. It consists of 787,801,471 nodes and 47,614,527,250\nedges. This graph is sparse, with average degree 60.440, and hence requires large cover solutions.\nStoring this dataset (i.e., the adjacency list of the graph) on the hard-drive requires more than 190GB\nof memory.\nWe solve both the Dominating Set and Vertex Cover problems, whose utility functions are de\ufb01ned in\nSection 5. For the Dominating Set problem, we set M = 520 MB, \u21b5 = 2 and Q = 0.7|V |. We run\nthe \ufb01rst phase of ESC-Streaming (c.f., Algorithm 1), then query for different values of \u02dc\u270f between 0\nto 1, using Algorithm 2. Similarly, for the Vertex Cover problem, we set M = 320 MB, \u21b5 = 2 and\nQ = 0.8|E|. Figure 6.2 shows the performance of ESC-Streaming on both the dominating set and\nvertex cover problems, in terms of utility achieved, i.e., number vertices/edges covered, for all the\nfeasible \u02dc\u270f values, with respect to the size of the subset of vertices picked.\nAs a baseline, we compare against a random selection procedure, that picks a random permutation\nof the vertices and then select any vertex with a non-zero marginal, until it reaches the same partial\ncover achieved by ESC-Streaming. Note that the of\ufb02ine greedy, even with lazy evaluations, is not\napplicable here since it does not terminate in a reasonable time, so we omit it from the comparison.\nSimilarly, we do not compare against the Emek\u2013Ros\u00e9n\u2019s algorithm [11], due to its large memory\nrequirement of n log m, which in this case is roughly 20 times bigger than the memory used by\nESC-Streaming.\nWe do signi\ufb01cantly better than a random selection, especially on the Vertex Cover problem, which\nfor sparse graphs is more challenging than the Dominating Set problem.\nSince running the greedy algorithm on \u201cuk-2014\u201d graph goes beyond our computing infrastructure,\nwe include another instance of the Dominating set cover problem on a smaller graph \u201cFriendster\", an\nonline gaming network [29], to compare with of\ufb02ine greedy algorithm. This graph has 65.6 million\nnodes, and 1.8 billion edges. The memory required by ESC-Streaming is less than 30MB for \u21b5 = 2.\nWe let of\ufb02ine greedy run for 2 days, and gathered data for 2000 greedy iterations. Figure 6.2 (Right)\nshows that our performance almost matches the greedy solutions we managed to compute.\n\n7 Conclusion\n\nIn this paper, we consider the SC problem in the streaming setting, where we select the least number\nof elements that can achieve a certain utility, measured by a submodular function. We prove that\nthere cannot exist any single pass streaming algorithm that can achieve a non-trivial approximation\n\n7\n\n\fd\ne\nr\ne\nv\no\nc\n \ns\ne\ng\nd\ne\n\n \nf\n\no\n\n \n\nn\no\n\ni\nt\nc\na\nr\nF\n\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n\n0\n\nESC-Streaming , = 2\nRandom\n\n0.05\n\n0.2\nFraction of vertices selected\n\n0.15\n\n0.1\n\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n\n0\n\nd\ne\nr\ne\nv\no\nc\n \ns\ne\nc\ni\nt\nr\ne\nv\n \nf\n\no\n\n \n\nn\no\n\ni\nt\nc\na\nr\nF\n\n0.25\n\nESC-Streaming , = 2\nRandom\n\n0.05\n\n0.1\n\n0.15\n\n0.2\n\n0.25\n\nFraction of vertices selected\n\nd\ne\nr\ne\nv\no\nc\n \ns\ne\nc\ni\nt\nr\ne\nv\n \nf\no\n \nn\no\ni\nt\nc\na\nr\nF\n\n1\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n\n0\n\n0.3\n\nESC-Streaming , = 2\nRandom\nOffline Greedy\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n0\n\n0.5\n0.2\n\n1\n#10-4\n0.25\n\n0.05\n\n0.1\n\n0.15\n\nFraction of vertices selected\n\nFigure 6.2:\nDominating set on \"Friendster\"\n\n(Left) Vertex cover on \"uk-2014\" (Middle) Dominating set on \"uk-2014\" (Right)\n\nof SSC, using sublinear memory, if the utility have to be met exactly. Consequently, we develop an\nef\ufb01cient approximation algorithm, ESC-Streaming, which \ufb01nds solution sets, slightly larger than the\noptimal solution, that partially cover the desired utility. We rigorously analyzed the approximation\nguarantees of ESC-Streaming, and compared these guarantees against the of\ufb02ine greedy algorithm.\nWe demonstrate the performance of ESC-Streaming on real-world problems. We believe that our\nalgorithm is an important step towards solving streaming and large scale submodular cover problems,\nwhich lie at the heart of many modern machine learning applications.\n\nAcknowledgments\nWe would like to thank Michael Kapralov and Ola Svensson for useful discussions. This work was\nsupported in part by the European Commission under ERC Future Proof, SNF 200021-146750, SNF\nCRSII2-147633, NCCR Marvel, and ERC Starting Grant 335288-OptApprox.\n\nReferences\n[1] Sepehr Assadi, Sanjeev Khanna, and Yang Li. Tight bounds for single-pass streaming complexity of the\n\nset cover problem. arXiv preprint arXiv:1603.05715, 2016.\n\n[2] Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streaming\nsubmodular maximization: Massive data summarization on the \ufb02y. In Proceedings of the 20th ACM\nSIGKDD international conference on Knowledge discovery and data mining, pages 671\u2013680. ACM, 2014.\n\n[3] Paolo Boldi, Andrea Marino, Massimo Santini, and Sebastiano Vigna. BUbiNG: Massive crawling for\nthe masses. In Proceedings of the Companion Publication of the 23rd International Conference on World\nWide Web, pages 227\u2013228. International World Wide Web Conferences Steering Committee, 2014.\n\n[4] Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. Layered label propagation: A\nmultiresolution coordinate-free ordering for compressing social networks. In Sadagopan Srinivasan, Krithi\nRamamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, and Ravi Kumar, editors, Proceedings of the\n20th international conference on World Wide Web, pages 587\u2013596. ACM Press, 2011.\n\n[5] Paolo Boldi and Sebastiano Vigna. The WebGraph framework I: Compression techniques. In Proc. of the\nThirteenth International World Wide Web Conference (WWW 2004), pages 595\u2013601, Manhattan, USA,\n2004. ACM Press.\n\n[6] Amit Chakrabarti, Graham Cormode, and Andrew McGregor. Robust lower bounds for communication\nand stream computation. In Proceedings of the fortieth annual ACM symposium on Theory of computing,\npages 641\u2013650. ACM, 2008.\n\n[7] Amit Chakrabarti and Tony Wirth. Incidence geometries and the pass complexity of semi-streaming set\n\ncover. arXiv preprint arXiv:1507.04645, 2015.\n\n[8] Chandra Chekuri, Shalmoli Gupta, and Kent Quanrud. Streaming algorithms for submodular function\n\nmaximization. In Automata, Languages, and Programming, pages 318\u2013330. Springer, 2015.\n\n[9] Erik D Demaine, Piotr Indyk, Sepideh Mahabadi, and Ali Vakilian. On streaming and communication\n\ncomplexity of the set cover problem. In Distributed Computing, pages 484\u2013498. Springer, 2014.\n\n8\n\n\f[10] Irit Dinur and David Steurer. Analytical approach to parallel repetition. In Proceedings of the 46th Annual\nACM Symposium on Theory of Computing, STOC \u201914, pages 624\u2013633, New York, NY, USA, 2014. ACM.\n\n[11] Yuval Emek and Adi Ros\u00e9n. Semi-streaming set cover. In Automata, Languages, and Programming, pages\n\n453\u2013464. Springer, 2014.\n\n[12] Uriel Feige. A threshold of ln n for approximating set cover. Journal of the ACM (JACM), 45(4):634\u2013652,\n\n1998.\n\n[13] Santo Fortunato. Community detection in graphs. Physics reports, 486(3):75\u2013174, 2010.\n\n[14] Piotr Indyk, Sepideh Mahabadi, and Ali Vakilian. Towards tight bounds for the streaming set cover problem.\n\narXiv preprint arXiv:1509.00118, 2015.\n\n[15] David Kempe, Jon Kleinberg, and \u00c9va Tardos. Maximizing the spread of in\ufb02uence through a social\nnetwork. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 137\u2013146. ACM, 2003.\n\n[16] Gunhee Kim, Eric P Xing, Li Fei-Fei, and Takeo Kanade. Distributed cosegmentation via submodular\noptimization on anisotropic diffusion. In Computer Vision (ICCV), 2011 IEEE International Conference\non, pages 169\u2013176. IEEE, 2011.\n\n[17] Andreas Krause and Daniel Golovin. Submodular function maximization. Tractability: Practical Ap-\n\nproaches to Hard Problems, 3:19, 2012.\n\n[18] Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, and Andrea Vattani. Fast greedy algorithms in\n\nmapreduce and streaming. ACM Transactions on Parallel Computing, 2(3):14, 2015.\n\n[19] Michel Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization\n\nTechniques, pages 234\u2013243. Springer, 1978.\n\n[20] Baharan Mirzasoleiman, Amin Karbasi, Ashwinkumar Badanidiyuru, and Andreas Krause. Distributed\nsubmodular cover: Succinctly summarizing massive data. In Advances in Neural Information Processing\nSystems, pages 2863\u20132871, 2015.\n\n[21] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for\n\nmaximizing submodular set functions\u2014i. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[22] George L Nemhauser and Leonard A Wolsey. Best algorithms for approximating the maximum of a\n\nsubmodular set function. Mathematics of operations research, 3(3):177\u2013188, 1978.\n\n[23] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning\n\n(Adaptive Computation and Machine Learning). The MIT Press, 2005.\n\n[24] Matthias Rupp. Machine learning for quantum mechanics in a nutshell. International Journal of Quantum\n\nChemistry, 115(16):1058\u20131073, 2015.\n\n[25] Barna Saha and Lise Getoor. On maximum coverage in the streaming model & application to multi-topic\n\nblog-watch. In SDM, volume 9, pages 697\u2013708. SIAM, 2009.\n\n[26] Matthias Seeger. Greedy forward selection in the informative vector machine. Technical report, Technical\n\nreport, University of California at Berkeley, 2004.\n\n[27] Sebastian Tschiatschek, Rishabh K Iyer, Haochen Wei, and Jeff A Bilmes. Learning mixtures of submodular\nfunctions for image collection summarization. In Advances in Neural Information Processing Systems,\npages 1413\u20131421, 2014.\n\n[28] Laurence A Wolsey. An analysis of the greedy algorithm for the submodular set covering problem.\n\nCombinatorica, 2(4):385\u2013393, 1982.\n\n[29] Jaewon Yang and Jure Leskovec. De\ufb01ning and evaluating network communities based on ground-truth. In\nProceedings of the ACM SIGKDD Workshop on Mining Data Semantics, MDS \u201912, pages 3:1\u20133:8, New\nYork, NY, USA, 2012. ACM.\n\n9\n\n\f", "award": [], "sourceid": 2230, "authors": [{"given_name": "Ashkan", "family_name": "Norouzi-Fard", "institution": "EPFL"}, {"given_name": "Abbas", "family_name": "Bazzi", "institution": "EPFL"}, {"given_name": "Ilija", "family_name": "Bogunovic", "institution": "EPFL Lausanne"}, {"given_name": "Marwa", "family_name": "El Halabi", "institution": "l"}, {"given_name": "Ya-Ping", "family_name": "Hsieh", "institution": "EPFL"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}