{"title": "Fast Parallel Algorithms for Statistical Subset Selection Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 5072, "page_last": 5081, "abstract": "In this paper, we propose a new framework for designing fast parallel algorithms for fundamental statistical subset selection tasks that include feature selection and experimental design. Such tasks are known to be weakly submodular and are amenable to optimization via the standard greedy algorithm. Despite its desirable approximation guarantees, however, the greedy algorithm is inherently sequential and in the worst case, its parallel runtime is linear in the size of the data.\nRecently, there has been a surge of interest in a parallel optimization technique called adaptive sampling which produces solutions with desirable approximation guarantees for submodular maximization in exponentially faster parallel runtime. Unfortunately, we show that for general weakly submodular functions such accelerations are impossible. The major contribution in this paper is a novel relaxation of submodularity which we call differential submodularity. We first prove that differential submodularity characterizes objectives like feature selection and experimental design. We then design an adaptive sampling algorithm for differentially submodular functions whose parallel runtime is logarithmic in the size of the data and achieves strong approximation guarantees. Through experiments, we show the algorithm's performance is competitive with state-of-the-art methods and obtains dramatic speedups for feature selection and experimental design problems.", "full_text": "Fast Parallel Algorithms for Statistical Subset\n\nSelection Problems\n\nSharon Qian\n\nHarvard University\n\nsharonqian@g.harvard.edu\n\nYaron Singer\n\nHarvard University\n\nyaron@seas.harvard.edu\n\nAbstract\n\nIn this paper, we propose a new framework for designing fast parallel algorithms\nfor fundamental statistical subset selection tasks that include feature selection and\nexperimental design. Such tasks are known to be weakly submodular and are\namenable to optimization via the standard greedy algorithm. Despite its desirable\napproximation guarantees, the greedy algorithm is inherently sequential and in\nthe worst case, its parallel runtime is linear in the size of the data. Recently, there\nhas been a surge of interest in a parallel optimization technique called adaptive\nsampling which produces solutions with desirable approximation guarantees for\nsubmodular maximization in exponentially faster parallel runtime. Unfortunately,\nwe show that for general weakly submodular functions such accelerations are\nimpossible. The major contribution in this paper is a novel relaxation of submod-\nularity which we call differential submodularity. We \ufb01rst prove that differential\nsubmodularity characterizes objectives like feature selection and experimental de-\nsign. We then design an adaptive sampling algorithm for differentially submodular\nfunctions whose parallel runtime is logarithmic in the size of the data and achieves\nstrong approximation guarantees. Through experiments, we show the algorithm\u2019s\nperformance is competitive with state-of-the-art methods and obtains dramatic\nspeedups for feature selection and experimental design problems.\n\n1\n\nIntroduction\n\nIn fundamental statistics applications such as regression, classi\ufb01cation and maximum likelihood\nestimation, we are often interested in selecting a subset of elements to optimize an objective function.\nIn a series of recent works, both feature selection (selecting k out of n features) and experimental\ndesign (choosing k out of n samples) were shown to be weakly submodular [DK11, EKD+18,\nBBKT17]. The notion of weak submodularity was de\ufb01ned by Das and Kempe in [DK11] and\nquanti\ufb01es the deviance of an objective function from submodularity. Characterizations of weak\nsubmodularity are important as they allow proving guarantees of greedy algorithms in terms of\nthe deviance of the objective function from submodularity. More precisely, for objectives that are\n-weakly submodular (for that depends on the objective, see preliminaries Section 2), the greedy\nalgorithm is shown to return a 1 1/e approximation to the optimal subset.\nGreedy is sequential and cannot be parallelized. For large data sets where one wishes to take ad-\nvantage of parallelization, greedy algorithms are impractical. Greedy algorithms for feature selection\nsuch as forward stepwise regression iteratively add the feature with the largest marginal contribution\nto the objective which requires computing the contribution of each feature in every iteration. Thus,\nthe parallel runtime of the forward stepwise algorithm and greedy algorithms in general, scale linearly\nwith the number of features we want to select. In cases where the computation of the objective\nfunction across all elements is expensive or the dataset is large, this can be computationally infeasible.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fAdaptive sampling for fast parallel submodular maximization.\nIn a recent line of work initiated\nby [BS18a], adaptive sampling techniques have been used for maximizing submodular functions\nunder varying constraints [BS18b, CQ19b, CQ19a, BBS18, CFK19, ENV19, FMZ19a, BRS19b,\nEN19, FMZ19b]. Intuitively, instead of growing the solution set element-wise, adaptive sampling\nadds a large set of elements to the solution at each round which allows the algorithm to be highly\nparallelizable. In particular, for canonical submodular maximization problems, one can obtain\napproximation guarantees arbitrarily close to the one obtained by greedy (which is optimal for\npolynomial time algorithms [NW78]) in exponentially faster parallel runtime.\n\nIn general, adaptive sampling fails for weakly submodular functions. Adaptive sampling tech-\nniques add large sets of high valued elements in each round by \ufb01ltering elements with low marginal\ncontributions. This enables these algorithms to terminate in a small number of rounds. For weak\nsubmodularity, this approach renders arbitrarily poor approximations. In Appendix A.1, we use\nan example of a weakly submodular function from [EDFK17] where adaptive sampling techniques\nhave an arbitrarily poor approximation guarantee. Thus, if we wish to utilize adaptive sampling to\nparallelize algorithms for applications such as feature selection and experimental design, we need a\nstronger characterization of these objectives which is amenable to parallelization.\n\n1.1 Differential Submodularity\nIn this paper, we introduce an alternative measure to quantify the deviation from submodularity which\nwe call differential submodularity, de\ufb01ned below. We use fS(A) to denote f (S [ A) f (S).\nDe\ufb01nition 1. A function f : 2N ! R+ is \u21b5-differentially submodular for \u21b5 2 [0, 1], if there exist\ntwo submodular functions h, g s.t. for any S, A \u2713 N, we have that gS(A) \u21b5 \u00b7 hS(A) and\n\ngS(A) \uf8ff fS(A) \uf8ff hS(A)\n\nA 1-differentially submodular function is submodular and a 0-differentially submodular function\ncan be arbitrarily far from submodularity. In Figure 1, we show a depiction of differential sub-\nmodularity (blue lines) calculated from the feature selection objective by \ufb01xing an element a and\nrandomly sampling sets S of size 100 to compute the marginal contribution fS(a) on a real dataset.\nFor a differentially submodular function (blue lines), the property of decreasing marginal contri-\nbutions does not hold but can be bounded by two submodular functions (red) with such property.\n\nAs we prove in this paper, applications such as feature selection\nfor regression and classi\ufb01cation as well as experimental de-\nsign are all 2-differentially submodular, where corresponds\nto their weak submodularity ratios [EKD+18, BBKT17]. The\npower of this characterization is that it allows for parallelization\nwith strong approximation guarantees. We do this by designing\nan adaptive sampling algorithm that leverages the differential\nsubmodularity structure and has bounded approximation guar-\nantees in terms of the differential submodularity ratios.\n\nFigure 1: Marginal contribution of dif-\nferentially submodular function.\n\n1.2 Main results\nOur main result is that for objectives such as feature selection for regression and classi\ufb01cation\nand Bayesian A-optimality experimental design which are all -weakly submodular, there is an\napproximation guarantee arbitrarily close to 1 1/e4 for maximization under cardinality constraints\nin O(log n) adaptive rounds (see adaptivity de\ufb01nition in Section 2). Thus, while the approximation is\ninferior to the 11/e obtained by greedy, our algorithm has exponentially fewer rounds. Importantly,\nusing experiments we show that empirically it has comparable terminal values to the greedy algorithm,\ngreatly outperforms its theoretical lower bound, and obtains the result with two to eight-fold speedups.\nWe achieve our result by proving these objectives are \u21b5-differentially submodular and designing an\nadaptive sampling algorithm that gives a 11/e\u21b52 approximation for maximizing any \u21b5-differentially\nsubmodular function under a cardinality constraint.\n\nConceptual overview. For the past decade, fundamental problems in machine learning have been\nanalyzed through relaxed notions of submodularity (See details on different relaxations of submodu-\n\n2\n\n\flarity and relationship to differential submodularity in Appendix B). Our main conceptual contribution\nis the framework of differential submodularity which is purposefully designed to enable fast par-\nallelization techniques that previously-studied relaxations of submodularity do not. Speci\ufb01cally,\nalthough stronger than weak submodularity, we can prove direct relationships between objectives\u2019\nweak submodularity ratios and their differential submodularity ratios which allows getting strong ap-\nproximations and exponentially faster parallel runtime. We note that differential submodularity is also\napplicable to more recent parallel optimization techniques such as adaptive sequencing [BRS19b].\n\nTechnical overview. From a purely technical perspective, there are two major challenges addressed\nin this work. The \ufb01rst pertains to the characterization of the objectives in terms of differential submod-\nularity and the second is the design of an adaptive sampling algorithm for differentially submodular\nfunctions. Previous adaptive sampling algorithms are purposefully designed for submodular functions\nand cannot be applied when the objective function is not submodular (example in Appendix A.2).\nIn these cases, the marginal contribution of individual elements is not necessarily subadditive to\nthe marginal contribution of the set of elements combined. Thus, the standard analysis of adaptive\nsampling, where we attempt to add large sets of elements to the solution set by assessing the value of\nindividual elements, does not hold. By leveraging the fact that marginal contributions of differentially\nsubmodular functions can be bounded by marginal contributions of submodular functions, we can\napproximate the marginal contribution of a set by assessing the marginal contribution of its elements.\nThis framework allows us to leverage parallelizable algorithms to show a stronger approximation\nguarantee in exponentially fewer rounds.\n\nPaper organization. We \ufb01rst introduce preliminary de\ufb01nitions in Section 2 followed by intro-\nducing our main framework of differential submodularity and its reduction to feature selection and\nexperimental design objectives in Section 3. We then introduce an algorithm for selection problems\nusing adaptive sampling in Section 4 and conclude with experiments in Section 5. Due to space\nconstraints, most proofs of the analysis are deferred to the Appendix.\n\n2 Preliminaries\nFor a positive integer n, we use [n] to denote the set {1, 2, . . . , n}. Boldface lower and upper case\nletters denote vectors and matrices respectively: a, x, y represent vectors and A, X, Y represent\nmatrices. Unbolded lower and upper case letters present elements and sets respectively: a, x, y\nrepresent elements and A, X, Y represent sets. For a matrix X 2 Rd\u21e5n and S \u2713 [n], we denote\nsubmatrices by column indices by XS. For vectors, we use xS to denote supports supp(x) \u2713 S. To\nconnect the discrete function f (S) to a continuous function, we let f (S) = `(w(S)), where w(S)\ndenotes the w that maximizes `(\u00b7) subject to supp(w) \u2713 S.\nSubmodularity and weak submodularity. A function f : 2N ! R+ is submodular if fS(a) \nfT (a) for all a 2 N\\T and S \u2713 T \u2713 N. It is monotone if f (S) \uf8ff f (T ) for all S \u2713 T . We assume\nthat f is normalized and non-negative, i.e., 0 \uf8ff f (S) \uf8ff 1 for all S \u2713 N, and monotone. The concept\nof weak submodularity is a relaxation of submodularity, de\ufb01ned via the submodularity ratio:\nDe\ufb01nition 2. [DK11] The submodularity ratio of f : 2N ! R+ is de\ufb01ned as, for all A \u2713 N,\n\nk =\n\nmin\n\nA\u2713N,S:|A|\uf8ffkPa2A fS(a)\n\nfS(A)\n\n.\n\nFunctions with submodularity ratios = mink k < 1 are -weakly submodular.\n\nAdaptivity. The adaptivity of algorithms refers to the number of sequential rounds of queries it\nmakes when polynomially-many queries can be executed in parallel in each round.\nDe\ufb01nition 3. For a function f, an algorithm is r-adaptive if every query f (S) given a set S occurs\nat a round i 2 [r] such that S is independent of the values f (S0) of all other queries at round i.\nAdaptivity is an information theoretic measure of parallel-runtime that can be translated to standard\nparallel computation frameworks such as PRAM (See Appendix C). Therefore, like all previous work\non adaptivity on submodular maximization, we are interested in algorithms that have low adaptivity\nsince they are parallelizable and scalable for large datasets [BRS19a, BS18b, CQ19b, CQ19a, BBS18,\nCFK19, ENV19, FMZ19a, BRS19b, EN19, FMZ19b].\n\n3\n\n\f3 Feature Selection and A-Optimal Design are Differentially Submodular\n\nWe begin by characterizing differential submodularity in terms of restricted strong concavity and\nrestricted smoothness de\ufb01ned as follows.\nDe\ufb01nition 4. [EKD+18] Let \u2326 be a subset of Rn \u21e5 Rn and ` : Rn ! R be a continuously\ndifferentiable function. A function ` is restricted strong concave (RSC) with parameter m\u2326 and\nrestricted smooth (RSM) with parameter M\u2326 if, for all (y, x) 2 \u2326,\n\nm\u2326\n2 ky xk2\n\n\n\n2 `(y) `(x) hr`(x), y xi \n\nM\u2326\n2 ky xk2\n\n2\n\nBefore connecting our notion of differential submodularity to RSC/RSM properties, we \ufb01rst de\ufb01ne\nconcavity and smoothness parameters on subsets of \u2326. If \u23260 \u2713 \u2326, then M\u23260 \uf8ff M\u2326 and m\u23260 m\u2326.\nDe\ufb01nition 5. We de\ufb01ne the domain of s-sparse vectors as \u2326s = {(x, y) : kxk0 \uf8ff s,kyk0 \uf8ff\ns,kx yk0 \uf8ff s}. If t s, Ms \uf8ff Mt and ms mt.\nTheorem 6. Suppose `(\u00b7) is RSC/RSM on s-sparse subdomains \u2326s with parameters ms, Ms for\ns \uf8ff 2k. Then, for t = |S|+k, s = |S|+1, the objective f (S) = `(w(S)) is differentially submodular\ns.t. for S, A \u2713 N, |A|\uf8ff k, ms\nProof. We \ufb01rst prove the lower bound of the inequality. We de\ufb01ne x(S[A) = 1\nand use the strong concavity of `(\u00b7) to lower bound fS(A):\n\n\u02dcfS(A), where \u02dcfS(A) =Pa2A fS(a).\n\n\u02dcfS(A) \uf8ff fS(A) \uf8ff Ms\n\nMtr`(w(S))A + w(S)\n\nMt\n\nmt\n\nfS(A) `(x(S[A)) `(w(S)) hr`(w(S)), x(S[A) w(S)i \n\n\n\n1\n2Mtkr`(w(S))Ak2\n\n2\n\nMt\n2 kx(S[A) w(S)k2\n\n2\n\n(1)\n\n(2)\n\nwhere the \ufb01rst inequality follows from the optimality of `(w(S[A)) for vectors with support S [ A\nand the last inequality is by the de\ufb01nition of x(S[A).\nWe also can use smoothness of `(\u00b7) to upper bound the marginal contribution of each element in A to\nS, fS(a). We de\ufb01ne x(S[a) = 1\n\nmsr`(w(S))a + w(S). For a 2 A,\n\nfS(a) = `(w(S[a)) `(w(S)) \uf8ff hr`(w(S)), x(S[a) w(S)i \n\n\uf8ff\n\n1\n2mskr`(w(S))ak2\n\n2\n\nms\n2 kx(S[a) w(S)k2\n\n2\n\nwhere the last inequality follows from the de\ufb01nition of x(S[a). Summing across all a 2 A, we get\n(3)\n\n1\n2mskr`(w(S))ak2\n\n2 =\n\n1\n2mskr`(w(S))Ak2\n\n2\n\nXa2A\n\nfS(a) \uf8ff Xa2A\n\nBy combining (1) and (3), we can get the desired lower bound of fS(A). To get the upper bound\non the marginals, we can use the lower bound of submodularity ratio S,k of f from Elenberg et al.\n[EKD+18], which is no less than mt\nMs\nproof and show that the marginals can be bounded.\n\n. Then, by letting \u02dcfS(A) =Pa2A fS(a), we can complete the\nWe can further generalize the previous lemma to all sets S, A \u2713 N, by using the general RSC/RSM\nparameters m, M associated with \u2326n, where n t, s. From De\ufb01nition 5, since \u2326s \u2713 \u2326t \u2713 \u2326n,\nMs \uf8ff Mt \uf8ff M and ms mt m. Thus, we can weaken the bounds from Lemma 6 to get\n\n\u02dcfS(A) which is a 2-differentially submodular function for = m\nM .\n\nm\nM\n\nm\n\n\u02dcfS(A) \uf8ff fS(A) \uf8ff M\n\n3.1 Differential submodularity bounds for statistical subset selection problems\nWe now connect differential submodularity to feature selection and experimental design objectives.\nWe also show that even when adding diversity-promoting terms d(S) as in [DDK12] the functions\nremain differentially submodular. Due to space limitations, proofs are deferred to Appendix E.\n\n4\n\n\fFeature selection for regression. For a response variable y 2 Rd and feature matrix X 2 Rd\u21e5n,\nthe objective is the maximization of the `2-utility function that represents the variance reduction of y\ngiven the feature set S:\n\n`reg(y, w(S)) = kyk2\n\n2 ky XSwk2\n\n2\n\nWe can bound the marginals by eigenvalues of the feature covariance matrix. We denote the minimum\nand maximum eigenvalues of the k-sparse feature covariance matrix by min(k) and max(k).\nCorollary 7. Let = min(2k)\nf (S) = `reg(w(S)) and fdiv(S) = `reg(w(S)) + d(S) are 2-differentially submodular.\nWe note that [DK11] use a different objective function to measure the goodness of \ufb01t R2.\nIn\nAppendix F, we show an analogous bound for the objective used in [DK11]. Our lower bound is\nconsistent with the result in Lemma 2.4 from Das and Kempe [DK11].\n\nmax(2k) and d : 2N ! R+ be a submodular diversity function. Then\n\nFeature selection for classi\ufb01cation. For classi\ufb01cation, we wish to select the best k columns from\nX 2 Rd\u21e5n to predict a categorical variable y 2 Rd. We use the following log-likelihood objective in\nlogistic regression to select features. For a categorical variable y 2 Rd, the objective in selecting the\nelements to form a solution set is the maximization of the log-likelihood function for a given S:\n\n`class(y, w(S)) =\n\ndXi=1\n\nyi(XSw) log(1 + eXS w)\n\nWe denote m and M to be the RSC/RSM parameters on the feature matrix X. For = m\nshow that the feature selection objective for classi\ufb01cation is -weakly submodular.\nCorollary 8. Let = m\n`class(w(S)) and fdiv(S) = `class(w(S)) + d(S) are 2-differentially submodular.\n\nM [EKD+18]\nM and d : 2N ! R+ be a submodular diversity function. Then f (S) =\n\nBayesian A-optimality for experimental design.\nIn experimental design, we wish to select the set\nof experimental samples xi from X 2 Rd\u21e5n to maximally reduce variance in the parameter posterior\ndistribution. We now show that the objective for selecting diverse experiments using Bayesian\nA-optimality criterion is differentially submodular. We denote \u21e4 = 2I as the prior that takes the\nform of an isotropic Gaussian and 2 as variance (See Appendix D for more details).\nCorollary 9. Let =\nkXk2(2+2kXk2) and d : 2N ! R+ be a submodular diversity function, then\nthe objectives of Bayesian A-optimality de\ufb01ned by fA-opt(S) = Tr(\u21e41) Tr((\u21e4 + 2XSXT\nS )1)\nand the diverse analog de\ufb01ned by fA-div(S) = fA-opt(S) + d(S) are 2-differentially submodular.\n\n2\n\n4 The Algorithm\n\nWe now present the DASH (DIFFERENTIALLY-ADAPTIVE-SHAMPLING) algorithm for maximizing\ndifferentially submodular objectives with logarithmic adaptivity. Similar to recent works on low adap-\ntivity algorithms [BRS19a, BS18b, CQ19b, CQ19a, BBS18, CFK19, ENV19, FMZ19a, BRS19b,\nEN19], this algorithm is a variant of the adaptive sampling technique introduced in [BS18a]. The\nadaptive sampling algorithm for submodular functions, where \u21b5 = 1, is not guaranteed to terminate\nfor non-submodular functions (See Appendix A.2). Thus, we design a variant to speci\ufb01cally address\ndifferential submodularity to parallelize the maximization of non-submodular objectives.\n\nAlgorithm overview. At each round, the DASH algorithm selects good elements determined by\ntheir individual marginal contributions and attempts to add a set of k/r elements to the solution set S.\nThe decision to label elements as \"good\" or \"bad\" depends on the threshold t which quanti\ufb01es the\ndistance between the elements that have been selected and OPT. This elimination step takes place in\nthe while loop and effectively \ufb01lters out elements with low marginal contributions. The algorithm\nterminates when k elements have been selected or when the value of f (S) is suf\ufb01ciently close to OPT.\nThe algorithm presented is an idealized version because we cannot exactly calculate expectations, and\nOPT and differential submodularity parameter \u21b5 are unknown. We can estimate the expectations by\nincreasing sampling of the oracle and we can guess OPT and \u21b5 through parallelizing multiple guesses\n(See Appendix G for more details).\n\n5\n\n\fAlgorithm 1 DASH (N, r, \u21b5)\n1: Input Ground set N, number of outer-iterations r, differential submodularity parameter \u21b5\n2: S ;, X N\n3: for r iterations do\n4:\n5:\n6:\n7:\n8:\n9: end for\n10: return S\n\nt := (1 \u270f)(f (O) f (S))\nwhile ER\u21e0U(X)[fS(R)] <\u21b5 2 t\nend while\nS S [ R where R \u21e0U (X)\n\nr do\n\nX X\\{a : ER\u21e0U(X)[fS[(R\\{a})(a)] <\u21b5 (1 + \u270f\n\n2 )t/k}\n\nAlgorithm analysis. We now outline the proof sketch of the approximation guarantee of f (S) using\nDASH. In our analysis, we denote the optimal solution as OPT = f (O) where O = argmax\n|S|\uf8ffkf (S)\nand k is a cardinality constraint parameter. Proof details can be found in Appendix H.\nTheorem 10. Let f be a monotone, \u21b5-differentially submodular function where \u21b5 2 [0, 1], then, for\nany \u270f> 0, DASH is a log1+\u270f/2(n) adaptive algorithm that obtains the following approximation for\nthe set S that is returned by the algorithm\n\nf (S) (1 1/e\u21b52\n\n \u270f)f (O).\n\nThe key adaptation for \u21b5-differential submodular functions appears in the thresholds of the algorithm,\none to \ufb01lter out elements and another to lower bound the marginal contribution of the set added in\neach round. The additional \u21b5 factor in the while condition compared to the single element marginal\ncontribution threshold is a result of differential submodularity properties and guarantees termination.\nTo prove the theorem, we lower bound the marginal contribution of selected elements X\u21e2 at each\niteration \u21e2: fS(X\u21e2) \u21b52\nWe can show that the algorithm terminates in log1+\u270f/2(n) rounds (Lemma 21 in Appendix H.1).\nThen, using the lower bound of the marginal contribution of a set at each round fS(X\u21e2) in conjunction\nwith an inductive proof, we get the desired result.\nWe have seen in Corollary 7, 8 and 9 that the feature selection and Bayesian experimental design\nproblems are differentially submodular. Thus, we can apply DASH to these problems to obtain the\nf (S) (1 1/e\u21b52 \u270f)f (O) guarantee from Theorem 10.\n\nr (1 \u270f)(f (O) f (S)) (Lemma 19 in Appendix H.1).\n\n(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 2: Linear regression feature selection results comparing DASH (blue) to baselines on synthetic (top row)\nand clinical datasets (bottom row). Dashed line represents LASSO extrapolated across .\n\n6\n\n\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 3: Logistic regression feature selection results comparing DASH (blue) to baselines on synthetic (top\nrow) and gene datasets (bottom row). The X denotes manual termination of the algorithm due to running time\nconstraints. Dashed line represents approximation for LASSO extrapolated across .\n\n5 Experiments\n\nTo empirically evaluate the performance of DASH, we conducted several experiments on feature\nselection and Bayesian experimental design. While the 1 1/e4 approximation guarantee of\nDASH is weaker than the 1 1/e of the greedy algorithm (SDSMA), we observe that DASH\nperforms comparably to SDSMA and outperforms other benchmarks. Most importantly, in all\nexperiments, DASH achieves a two to eight-fold speedup of parallelized greedy implementations,\neven for moderate values of k. This shows the incredible potential of other parallelizable algorithms,\nsuch as adaptive sampling and adaptive sequencing, under the differential submodularity framework.\n\nDatasets. We conducted experiments for linear and logistic regression using the `reg and `class\nobjectives, and Bayesian experimental design using fA-opt. We generated the synthetic feature space\nfrom a multivariate normal distribution. To generate the response variable y, we sample coef\ufb01cients\nuniformly (D1) and map to probabilities for classi\ufb01cation (D3) and attempt to select important\nfeatures and samples. We also select features on a clinical dataset n = 385 (D2) and classify location\nof cancer in a biological dataset n = 2500 (D4). We use D1, D2 for linear regression and Bayesian\nexperimental design, and D3, D4 for logistic regression experiments. (See Appendix I.2 for details.)\n\nBenchmarks. We compared DASH to RANDOM (selecting k elements randomly in one round),\nTOP-k (selecting k elements of largest marginal contribution), SDSMA [KC10] and Parallel SDSMA,\nand LASSO, a popular algorithm for regression with an `1 regularization term. (See Appendix I.3.)\n\nExperimental Setup. We run DASH and baselines for different k for two sets of experiments.\n\n\u2022 Accuracy vs. rounds. In this set of experiments, for each dataset we \ufb01xed one value of k\n(k = 150 for D1, k = 100 for D2, D3 and k = 200 for D4) and ran algorithms to compare\naccuracy of the solution (R2 for linear regression, classi\ufb01cation rate for logistic regression\nand Bayesian A-optimality for experimental design) as a function of the number of parallel\nrounds. The results are plotted in Figures 2a, 2d, Figures 3a, 3d and Figures 4a, 4d;\n\n\u2022 Accuracy and time vs. features. In these experiments, we ran the same benchmarks for\nvarying values of k (in D1 the maximum is k = 150, D2, D3 the maximum is k = 100 and\nin D4 the maximum is k = 200) and measure both accuracy (Figures 2b, 2e, 3b, 3e, 4b, 4e)\nand time (Figures 2c, 2f, 3c, 3f, 4c, 4f). When measuring accuracy, we also ran LASSO by\nmanually varying the regularization parameter to select approximately k features. Since\neach k represents a different run of the algorithm, the output (accuracy or time) is not\nnecessarily monotonic with respect to k.\n\n7\n\n\fWe implemented DASH with 5 samples at every round. Even with this small number of samples,\nthe terminal value outperforms greedy throughout all experiments. The advantage of using fewer\nsamples is that it allows parallelizing over fewer cores. In general, given more cores one can reduce\nthe variance in estimating marginal contributions which improves the performance of the algorithm.\n\n(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 4: Bayesian experimental design results comparing DASH (blue) to baselines on synthetic (top row) and\nclinical datasets (bottom row).\nResults on general performance. We \ufb01rst analyze the performance of DASH . For all applications,\nFigures 2a, 2d, 3a, 3d, 4a and 4d show that the \ufb01nal objective value of DASH is comparable to SDSMA,\noutperforms TOP-k and RANDOM, and is able to achieve the solution in much fewer rounds. In\nFigures 2b, 2e, 3b, 3e, 4b and 4e, we show DASH can be very practical in \ufb01nding a comparable\nsolution set to SDSMA especially for larger values of k. In the synthetic linear regression experiment,\nDASH signi\ufb01cantly outperforms LASSO and has comparable performance in other experiments. While\nDASH outperforms the simple baseline of RANDOM, we note that the performance of RANDOM\nvaries widely depending on properties of the dataset. In cases where a small number of features\ncan give high accuracy, RANDOM can perform well by randomly selecting well-performing features\nwhen k is large (Figure 2e). However, in more interesting cases where the value does not immediately\nsaturate, both DASH and SDSMA signi\ufb01cantly outperform RANDOM (Figure 2b, 4b).\nWe can also see in Figures 2c, 2f, 3c, 3f, 4c and 4f that DASH is computationally ef\ufb01cient compared\nto the other baselines. In some cases, for smaller values of k, SDSMA is faster (Figure 3c). This is\nmainly due to the sampling done by DASH to estimate the marginals, which can be computationally\nintensive. However, in most experiments, DASH terminates more quickly even for small values of k.\nFor larger values, DASH shows a two to eight-fold speedup compared to the fastest baseline.\n\nEffect of oracle queries. Across our experiments, the cost for oracle queries vary widely. When\nthe calculation of the marginal contribution is computationally cheap, parallelization of SDSMA has a\nlonger running time than its sequential analog due to the cost of merging parallelized results (Figures\n2c, 3c). However, in the logistic regression gene selection experiment, calculating the marginal\ncontribution of an element to the solution set can span more than 1 minute. In this setting, using\nsequential SDSMA to select 100 elements would take several days for the algorithm to terminate\n(Figure 3f). Parallelization of SDSMA drastically improves the algorithm running time, but DASH is\nstill much faster and can \ufb01nd a comparable solution set in under half the time of parallelized SDSMA.\nIn both cases of cheap and computationally intensive oracle queries, DASH terminates more quickly\nthan the sequential and parallelized version of SDSMA for larger values of k. This can be seen\nin Figures 2c, 3c and 4c where calculation of marginal contribution on synthetic data is fast and\nin Figures 2f, 3f and 4f where oracle queries on larger datasets are much slower. This shows the\nincredible potential of using DASH across a wide array of different applications to drastically cut\ndown on computation time in selecting a large number elements across different objective functions.\nGiven access to more processors, we expect even a larger increase in speedup for DASH.\n\n8\n\n\fAcknowledgements\n\nThe authors would like to thank Eric Balkanski for helpful discussions. This research was supported\nby a Smith Family Graduate Science and Engineering Fellowship, NSF grant CAREER CCF 1452961,\nNSF CCF 1301976, BSF grant 2014389, NSF USICCS proposal 1540428, a Google Research award,\nand a Facebook research award.\n\nReferences\n[BBKT17] Andrew An Bian, Joachim M Buhmann, Andreas Krause, and Sebastian Tschiatschek.\nGuarantees for greedy maximization of non-submodular functions with applications.\nIn Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 498\u2013507. JMLR. org, 2017.\n\n[BBS18] Eric Balkanski, Adam Breuer, and Yaron Singer. Non-monotone submodular maximiza-\ntion in exponentially fewer iterations. In Advances in Neural Information Processing\nSystems, pages 2359\u20132370, 2018.\n\n[BRS19a] Eric Balkanski, Aviad Rubinstein, and Yaron Singer. An exponential speedup in parallel\nrunning time for submodular maximization without loss in approximation. In Proceedings\nof the ACM-SIAM Symposium on Discrete Algorithms (SODA) 2019, 2019.\n\n[BRS19b] Eric Balkanski, Aviad Rubinstein, and Yaron Singer. An optimal approximation for\nsubmodular maximization under a matroid constraint in the adaptive complexity model.\nSTOC, 2019.\n\n[BS18a] Eric Balkanski and Yaron Singer. The adaptive complexity of maximizing a submodular\n\nfunction. STOC, 2018.\n\n[BS18b] Eric Balkanski and Yaron Singer. Approximation guarantees for adaptive sampling.\nIn Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International\nConference on Machine Learning, volume 80 of Proceedings of Machine Learning\nResearch, pages 384\u2013393, Stockholmsmassan, Stockholm Sweden, 10\u201315 Jul 2018.\nPMLR.\n\n[CFK19] Lin Chen, Moran Feldman, and Amin Karbasi. Unconstrained submodular maximization\n\nwith constant adaptive complexity. STOC, 2019.\n\n[CQ19a] Chandra Chekuri and Kent Quanrud. Parallelizing greedy for submodular set function\n\nmaximization in matroids and beyond. STOC, 2019.\n\n[CQ19b] Chandra Chekuri and Kent Quanrud. Submodular function maximization in parallel via\nthe multilinear relaxation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium\non Discrete Algorithms, pages 303\u2013322. SIAM, 2019.\n\n[DDK12] Abhimanyu Das, Anirban Dasgupta, and Ravi Kumar. Selecting diverse features via\nspectral regularization. In Advances in neural information processing systems, pages\n1583\u20131591, 2012.\n\n[DK11] Abhimanyu Das and David Kempe. Submodular meets spectral: greedy algorithms for\nsubset selection, sparse approximation and dictionary selection. In Proceedings of the\n28th International Conference on International Conference on Machine Learning, pages\n1057\u20131064. Omnipress, 2011.\n\n[EDFK17] Ethan Elenberg, Alexandros G Dimakis, Moran Feldman, and Amin Karbasi. Streaming\nweak submodularity: Interpreting neural networks on the \ufb02y. In Advances in Neural\nInformation Processing Systems, pages 4044\u20134054, 2017.\n\n[EKD+18] Ethan R Elenberg, Rajiv Khanna, Alexandros G Dimakis, Sahand Negahban, et al.\nRestricted strong convexity implies weak submodularity. The Annals of Statistics,\n46(6B):3539\u20133568, 2018.\n\n9\n\n\f[EN19] Alina Ene and Huy L Nguyen. Submodular maximization with nearly-optimal approxi-\n\nmation and adaptivity in nearly-linear time. SODA, 2019.\n\n[ENV19] Alina Ene, Huy L Nguyen, and Adrian Vladu. Submodular maximization with matroid\n\nand packing constraints in parallel. STOC, 2019.\n\n[FMZ19a] Matthew Fahrbach, Vahab Mirrokni, and Morteza Zadimoghaddam. Submodular maxi-\n\nmization with optimal approximation, adaptivity and query complexity. SODA, 2019.\n\n[FMZ19b] Matthew Fahrbach, Vahab S. Mirrokni, and Morteza Zadimoghaddam. Non-monotone\nsubmodular maximization with nearly optimal adaptivity and query complexity. ICML,\n2019.\n\n[GPB18] Gaurav Gupta, S\u00e9rgio Pequito, and Paul Bogdan. Approximate submodular functions\n\nand performance guarantees. CoRR, abs/1806.06323, 2018.\n\n[HS16] Thibaut Horel and Yaron Singer. Maximization of approximately submodular functions.\n\nIn Advances in Neural Information Processing Systems, pages 3045\u20133053, 2016.\n\n[JW04] Richard A Johnson and Dean W Wichern. Multivariate analysis. Encyclopedia of\n\nStatistical Sciences, 8, 2004.\n\n[KC10] Andreas Krause and Volkan Cevher. Submodular dictionary selection for sparse repre-\nsentation. In Proceedings of the 27th International Conference on Machine Learning,\npages 567\u2013574, 2010.\n\n[KSG08] Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements\nin gaussian processes: Theory, ef\ufb01cient algorithms and empirical studies. Journal of\nMachine Learning Research, 9(Feb):235\u2013284, 2008.\n\n[LS18] Yuanzhi Li and Yoram Singer. The well-tempered lasso. In Proceedings of the 35th\n\nInternational Conference on Machine Learning, 2018.\n\n[MY12] Julien Mairal and Bin Yu. Complexity analysis of the lasso regularization path. In\n\nProceedings of the 29th International Conference on Machine Learning, 2012.\n\n[NW78] George L Nemhauser and Laurence A Wolsey. Best algorithms for approximating the\nmaximum of a submodular set function. Mathematics of operations research, 3(3):177\u2013\n188, 1978.\n\n10\n\n\f", "award": [], "sourceid": 2781, "authors": [{"given_name": "Sharon", "family_name": "Qian", "institution": "Harvard"}, {"given_name": "Yaron", "family_name": "Singer", "institution": "Harvard University"}]}