{"title": "Polynomial time algorithms for dual volume sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 5038, "page_last": 5047, "abstract": "We study dual volume sampling, a method for selecting k columns from an n*m short and wide matrix (n <= k <= m) such that the probability of selection is proportional to the volume spanned by the rows of the induced submatrix. This method was proposed by Avron and Boutsidis (2013), who showed it to be a promising method for column subset selection and its multiple applications. However, its wider adoption has been hampered by the lack of polynomial time sampling algorithms. We remove this hindrance by developing an exact (randomized) polynomial time sampling algorithm as well as its derandomization. Thereafter, we study dual volume sampling via the theory of real stable polynomials and prove that its distribution satisfies the \u201cStrong Rayleigh\u201d property. This result has numerous consequences, including a provably fast-mixing Markov chain sampler that makes dual volume sampling much more attractive to practitioners. This sampler is closely related to classical algorithms for popular experimental design methods that are to date lacking theoretical analysis but are known to empirically work well.", "full_text": "Polynomial time algorithms for dual volume sampling\n\nChengtao Li\n\nMIT\n\nctli@mit.edu\n\nStefanie Jegelka\n\nMIT\n\nstefje@csail.mit.edu\n\nSuvrit Sra\n\nMIT\n\nsuvrit@mit.edu\n\nAbstract\n\nWe study dual volume sampling, a method for selecting k columns from an n \u21e5 m\nshort and wide matrix (n \uf8ff k \uf8ff m) such that the probability of selection is propor-\ntional to the volume spanned by the rows of the induced submatrix. This method\nwas proposed by Avron and Boutsidis (2013), who showed it to be a promising\nmethod for column subset selection and its multiple applications. However, its\nwider adoption has been hampered by the lack of polynomial time sampling algo-\nrithms. We remove this hindrance by developing an exact (randomized) polynomial\ntime sampling algorithm as well as its derandomization. Thereafter, we study\ndual volume sampling via the theory of real stable polynomials and prove that its\ndistribution satis\ufb01es the \u201cStrong Rayleigh\u201d property. This result has numerous\nconsequences, including a provably fast-mixing Markov chain sampler that makes\ndual volume sampling much more attractive to practitioners. This sampler is closely\nrelated to classical algorithms for popular experimental design methods that are to\ndate lacking theoretical analysis but are known to empirically work well.\n\nIntroduction\n\n1\nA variety of applications share the core task of selecting a subset of columns from a short, wide\nmatrix A with n rows and m > n columns. The criteria for selecting these columns typically aim at\npreserving information about the span of A while generating a well-conditioned submatrix. Classical\nand recent examples include experimental design, where we select observations or experiments [38];\npreconditioning for solving linear systems and constructing low-stretch spanning trees (here A\nis a version of the node-edge incidence matrix and we select edges in a graph) [6, 4]; matrix\napproximation [11, 13, 24]; feature selection in k-means clustering [10, 12]; sensor selection [25]\nand graph signal processing [14, 41].\nIn this work, we study a randomized approach that holds promise for all of these applications. This\napproach relies on sampling columns of A according to a probability distribution de\ufb01ned over its\nsubmatrices: the probability of selecting a set S of k columns from A, with n \uf8ff k \uf8ff m, is\n\nP (S; A) / det(ASA>S ),\n\n(1.1)\nwhere AS is the submatrix consisting of the selected columns. This distribution is reminiscent of\nvolume sampling, where k < n columns are selected with probability proportional to the determinant\ndet(A>S AS) of a k \u21e5 k matrix, i.e., the squared volume of the parallelepiped spanned by the selected\ncolumns. (Volume sampling does not apply to k > n as the involved determinants vanish.) In contrast,\nP (S; A) uses the determinant of an n \u21e5 n matrix and uses the volume spanned by the rows formed\nby the selected columns. Hence we refer to P (S; A)-sampling as dual volume sampling (DVS).\n\nContributions. Despite the ostensible similarity between volume sampling and DVS, and despite\nthe many practical implications of DVS outlined below, ef\ufb01cient algorithms for DVS are not known\nand were raised as open questions in [6]. In this work, we make two key contributions:\n\n\u2013 We develop polynomial-time randomized sampling algorithms and their derandomization for\n\nDVS. Surprisingly, our proofs require only elementary (but involved) matrix manipulations.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f\u2013 We establish that P (S; A) is a Strongly Rayleigh measure [8], a remarkable property that\ncaptures a speci\ufb01c form of negative dependence. Our proof relies on the theory of real stable\npolynomials, and the ensuing result implies a provably fast-mixing, practical MCMC sampler.\nMoreover, this result implies concentration properties for dual volume sampling.\n\nIn parallel with our work, [16] also proposed a polynomial time sampling algorithm that works\nef\ufb01ciently in practice. Our work goes on to further uncover the hitherto unknown \u201cStrong Rayleigh\u201d\nproperty of DVS, which has important consequences, including those noted above.\n1.1 Connections and implications.\nThe selection of k n columns from a short and wide matrix has many applications. Our algorithms\nfor DVS hence have several implications and connections; we note a few below.\nExperimental design. The theory of optimal experiment design explores several criteria for selecting\nthe set of columns (experiments) S. Popular choices are\nJ(AS) = kA\u2020SkF = k(ASA>S )1kF (A-optimal design) ,\nS 2 argminS\u2713{1,...,m}J(AS), with\nJ(AS) = kA\u2020Sk2 (E-optimal design) , J(AS) = log det(ASA>S ) (D-optimal design).\n(1.2)\nHere, A\u2020 denotes the Moore-Penrose pseudoinverse of A, and the minimization ranges over all S\nsuch that AS has full row rank n. A-optimal design, for instance, is statistically optimal for linear\nregression [38].\nFinding an optimal solution for these design problems is NP-hard; and most discrete algorithms use\nlocal search [33]. Avron and Boutsidis [6, Theorem 3.1] show that dual volume sampling yields an\napproximation guarantee for both A- and E-optimal design: if S is sampled from P (S; A), then\n\nEhkA\u2020Sk2\nFi \uf8ff\n\nm n + 1\nk n + 1 kA\u2020k2\n\nF ; EhkA\u2020Sk2\n\n2i \uf8ff\u27131 +\n\nn(m k)\n\nk n + 1\u25c6kA\u2020k2\n\n2.\n\n(1.3)\n\nAvron and Boutsidis [6] provide a polynomial time sampling algorithm only for the case k = n. Our\nalgorithms achieve the bound (1.3) in expectation, and the derandomization in Section 2.3 achieves\nthe bound deterministically. Wang et al. [43] recently (in parallel) achieved approximation bounds\nfor A-optimality via a different algorithm combining convex relaxation and a greedy method. Other\nmethods include leverage score sampling [30] and predictive length sampling [45].\nLow-stretch spanning trees and applications. Objectives 1.2 also arise in the construction of\nlow-stretch spanning trees, which have important applications in graph sparsi\ufb01cation, preconditioning\nand solving symmetric diagonally dominant (SDD) linear systems [40], among others [18]. In the\nnode-edge incidence matrix \u21e7 2 Rn\u21e5m of an undirected graph G with n nodes and m edges, the\ncolumn corresponding to edge (u, v) ispw(u, v)(eu ev). Let \u21e7= U \u2303Y be the SVD of \u21e7 with\nY 2 Rn1\u21e5m. The stretch of a spanning tree T in G is then given by StT (G) = kY 1\nF [6]. In\nT k2\nthose applications, we hence search for a set of edges with low stretch.\nNetwork controllability. The problem of sampling k n columns in a matrix also arises in\nnetwork controllability. For example, Zhao et al. [44] consider selecting control nodes S (under\ncertain constraints) over time in complex networks to control a linear time-invariant network. After\ntransforming the problem into a column subset selection problem from a short and wide controllability\nmatrix, the objective becomes essentially an E-optimal design problem, for which the authors use\ngreedy heuristics.\nNotation. From a matrix A 2 Rn\u21e5m with m n columns, we sample a set S \u2713 [m] of k columns\n(n \uf8ff k \uf8ff m), where [m] := {1, 2, . . . , m}. We denote the singular values of A by {i(A)}n\ni=1,\nin decreasing order. We will assume A has full row rank r(A) = n, so n(A) > 0. We also\nassume that r(AS) = r(A) = n for every S \u2713 [m] where |S| n. By ek(A), we denote the k-th\nelementary symmetric polynomial of A, i.e., the k-th coef\ufb01cient of the characteristic polynomial\ndet(I A) =PN\n\n2 Polynomial-time Dual Volume Sampling\nWe describe in this section our method to sample from the distribution P (S; A). Our \ufb01rst method\nrelies on the key insight that, as we show, the marginal probabilities for DVS can be computed in\npolynomial time. To demonstrate this, we begin with the partition function and then derive marginals.\n\nj=0(1)jej(A)Nj.\n\n2\n\n\f2.1 Marginals\nThe partition function has a conveniently simple closed form, which follows from the Cauchy-Binet\nformula and was also derived in [6].\nLemma 1 (Partition Function [6]). For A 2 Rn\u21e5m with r(A) = n and n \uf8ff| S| = k \uf8ff m, we have\n\nZA :=X|S|=k,S\u2713[m]\n\ndet(ASA>S ) =\u2713m n\n\nk n\u25c6 det(AA>).\n\nNext, we will need the marginal probability P (T \u2713 S; A) = PS:T\u2713S P (S; A) that a given set\nT \u2713 [m] is a subset of the random set S. In the following theorem, the set Tc = [m] \\ T denotes the\n(set) complement of T , and Q? denotes the orthogonal complement of Q.\nTheorem 2 (Marginals). Let T \u2713 [m], |T|\uf8ff k, and \"> 0. Let AT = Q\u2303V > be the singular value\ndecomposition of AT where Q 2 Rn\u21e5r(AT ), and Q? 2 Rn\u21e5(nr(AT )). Further de\ufb01ne the matrices\n\nB = (Q?)>ATc 2 R(nr(AT ))\u21e5(m|T|),\n. . .\n\n1 (AT )+\"\n\n0\n1p2\n...\n\n1p2\n0\n...\n\nC =2664\nP (T \u2713 S; A) = hQr(AT )\n\n. . .\n...\n\n3775 Q>ATc 2 Rr(AT )\u21e5(m|T|).\nj (B)i \u21e5 \ni (AT )i \u21e5hQr(B)\n\nj=1 2\n\nZA\n\n.\n\nLet QBdiag(2\n\ni (B))Q>B be the eigenvalue decomposition of B>B where QB 2 R|Tc|\u21e5r(B). More-\nover, let W > =\u21e5ITc; C>\u21e4 and = ek|T|r(B)(W ((Q?B)>Q?B)W >). Then the marginal probabil-\n\nity of T in DVS is\n\n2 (AT )+\"\n\ni=1 2\n\nWe prove Theorem 2 via a perturbation argument that connects DVS to volume sampling. Speci\ufb01cally,\nobserve that for \u270f> 0 and |S| n it holds that\n\ndet(ASA>S + \"In) = \"nk det(A>S AS + \"Ik) = \"nk det \uf8ff ASp\"(Im)S>\uf8ff ASp\"(Im)S! . (2.1)\n\nCarefully letting \u270f ! 0 bridges volumes with \u201cdual\u201d volumes. The technical remainder of the proof\nfurther relates this equality to singular values, and exploits properties of characteristic polynomials. A\nsimilar argument yields an alternative proof of Lemma 1. We show the proofs in detail in Appendix A\nand B respectively.\nComplexity. The numerator of P (T \u2713 S; A) in Theorem 2 requires O(mn2) time to compute the\n\ufb01rst term, O(mn2) to compute the second and O(m3) to compute the third. The denominator takes\nO(mn2) time, amounting in a total time of O(m3) to compute the marginal probability.\n2.2 Sampling\nThe marginal probabilities derived above directly yield a polynomial-time exact DVS algorithm.\nInstead of k-sets, we sample ordered k-tuples !S = (s1, . . . , sk) 2 [m]k. We denote the k-tuple\nvariant of the DVS distribution by !P (\u00b7; A):\n!P ((sj = ij)k\n!P (sj = ij|s1 = i1, . . . , sj1 = ij1; A).\nSampling !S is now straightforward. At the jth step we sample sj via !P (sj = ij|s1 =\ni1, . . . , sj1 = ij1; A); these probabilities are easily obtained from the marginals in Theorem 2.\nCorollary 3. Let T = {i1, . . . , it1}, and P (T \u2713 S; A) as in Theorem 2. Then,\nP (T [{ i}\u2713 S; A)\nAs a result, it is possible to draw an exact dual volume sample in time O(km4).\nThe full proof may be found in the appendix. The running time claim follows since the sampling\nalgorithm invokes O(mk) computations of marginal probabilities, each costing O(m3) time.\n\n!P (st = i; A|s1 = i1, . . . , st1 = it1) =\n\nP ({i1, . . . , ik}; A) =Yk\n\n(k t + 1) P (T \u2713 S; A)\n\nj=1; A) =\n\n1\nk!\n\nj=1\n\n.\n\n3\n\n\fRemark A potentially more ef\ufb01cient approximate algorithm could be derived by noting the rela-\ntions between volume sampling and DVS. Speci\ufb01cally, we add a small perturbation to DVS as in\nEquation 2.1 to transform it into a volume sampling problem, and apply random projection for more\nef\ufb01cient volume sampling as in [17]. Please refer to Appendix C for more details.\n\nF | s1 = i1, . . . , st1 = it1i = Pn\n\n2.3 Derandomization\nNext, we derandomize the above sampling algorithm to deterministically select a subset that satis\ufb01es\nthe bound (1.3) for the Frobenius norm, thereby answering another question in [6]. The key insight\nfor derandomization is that conditional expectations can be computed in polynomial time, given the\nmarginals in Theorem 2:\nCorollary 4. Let (i1, . . . , it1) 2 [m]t1 be such that the marginal distribution satis\ufb01es !P (s1 =\ni1, . . . , st1 = it1; A) > 0. The conditional expectation can be expressed as\nEhkA\u2020Sk2\nj=1 P 0({i1, . . . , it1}\u2713 S|S \u21e0 P (S; A[n]\\{j}))\nwhere P 0 are the unnormalized marginal distributions, and it can be computed in O(nm3) time.\nWe show the full derivation in Appendix D.\nCorollary 4 enables a greedy derandomization procedure. Starting with the empty tuple !S 0 = ;, in\nF | (s1, . . . , si) = !S i1 j] and append\nthe ith iteration, we greedily select j\u21e4 2 argmaxj E[kA\u2020S[jk2\nit to our selection: !S i = !S i1 j. The \ufb01nal set is the non-ordered version Sk of !S k. Theorem 5\nshows that this greedy procedure succeeds, and implies a deterministic version of the bound (1.3).\nTheorem 5. The greedy derandomization selects a column set S satisfying\nn(m n + 1)\nk n + 1 kA\u2020k2\n\nP 0({i1, . . . , it1}\u2713 S|S \u21e0 P (S; A))\n\nm n + 1\nk n + 1 kA\u2020k2\n\nkA\u2020Sk2\n\nF \uf8ff\n\n,\n\nF ;\n\nkA\u2020Sk2\n\n2 \uf8ff\n\n2.\n\nIn the proof, we construct a greedy algorithm. In each iteration, the algorithm computes, for each\ncolumn that has not yet been selected, the expectation conditioned on this column being included\nin the current set. Then it chooses the element with the lowest conditional expectation to actually\nbe added to the current set. This greedy inclusion of elements will only decrease the conditional\nexpectation, thus retaining the bound in Theorem 5. The detailed proof is deferred to Appendix E.\nComplexity. Each iteration of the greedy selection requires O(nm3) to compute O(m) conditional\nexpectations. Thus, the total running time for k iterations is O(knm4). The approximation bound for\nthe spectral norm is slightly worse than that in (1.3), but is of the same order if k = O(n).\n3 Strong Rayleigh Property and Fast Markov Chain Sampling\n\nNext, we investigate DVS more deeply and discover that it possesses a remarkable structural property,\nnamely, the Strongly Rayleigh (SR) [8] property. This property has proved remarkably fruitful\nin a variety of recent contexts, including recent progress in approximation algorithms [23], fast\nsampling [2, 27], graph sparsi\ufb01cation [22, 39], extensions to the Kadison-Singer problem [1], and\ncertain concentration of measure results [37], among others.\nFor DVS, the SR property has two major consequences: it leads to a fast mixing practical MCMC\nsampler, and it implies results on concentration of measure.\n\nStrongly Rayleigh measures. SR measures were introduced in the landmark paper of Borcea\net al. [8], who develop a rich theory of negatively associated measures. In particular, we say that a\n\nprobability measure \u00b5 : 2[n] ! R+ is negatively associated ifR F d\u00b5R Gd\u00b5 R F Gd\u00b5 for F, G\n\nincreasing functions on 2[n] with disjoint support. This property re\ufb02ects a \u201crepelling\u201d nature of \u00b5, a\nproperty that occurs more broadly across probability, combinatorics, physics, and other \ufb01elds\u2014see\n[36, 8, 42] and references therein. The negative association property turns out to be quite subtle in\ngeneral; the class of SR measures captures a strong notion of negative association and provides a\nframework for analyzing such measures.\n\n4\n\n\fSpeci\ufb01cally, SR measures are de\ufb01ned via their connection to real stable polynomials [36, 8, 42]. A\nmultivariate polynomial f 2 C[z] where z 2 Cm is called real stable if all its coef\ufb01cients are real and\nf (z) 6= 0 whenever Im(zi) > 0 for 1 \uf8ff i \uf8ff m. A measure is called an SR measure if its multivariate\ngenerating polynomial f\u00b5(z) := PS\u2713[n] \u00b5(S)Qi2S zi is real stable. Notable examples of SR\nmeasures are Determinantal Point Processes [31, 29, 9, 26], balanced matroids [19, 37], Bernoullis\nconditioned on their sum, among others. It is known (see [8, pg. 523]) that the class of SR measures\nis exponentially larger than the class of determinantal measures.\n\n3.1 Strong Rayleigh Property of DVS\nTheorem 6 establishes the SR property for DVS and is the main result of this section. Here and in the\n\nfollowing, we use the notation zS =Qi2S zi.\nTheorem 6. Let A 2 Rn\u21e5m and n \uf8ff k \uf8ff m. Then the multiaf\ufb01ne polynomial\ndet(ASA>S )Yi2S\n\nzi = X|S|=k,S\u2713[m]\n\np(z) := X|S|=k,S\u2713[m]\n\nis real stable. Consequently, P (S; A) is an SR measure.\n\ndet(ASA>S )zS,\n\n(3.1)\n\nThe proof of Theorem 6 relies on key properties of real stable polynomials and SR measures\nestablished in [8]. Essentially, the proof demonstrates that the generating polynomial of P (Sc; A)\ncan be obtained by applying a few carefully chosen stability preserving operations to a polynomial\nthat we know to be real stable. Stability, although easily destroyed, is closed under several operations\nnoted in the important proposition below.\nProposition 7 (Prop. 2.1 [8]). Let f : Cm ! C be a stable polynomial. The following properties pre-\nserve stability: (i) Substitution: f (\u00b5, z2, . . . , zm) for \u00b5 2 R; (ii) Differentiation: @Sf (z1, . . . , zm)\nfor any S \u2713 [m]; (iii) Diagonalization: f (z, z, z3 . . . , zm) is stable, and hence f (z, z, . . . , z); and\n(iv) Inversion: z1 \u00b7\u00b7\u00b7 znf (z1\nIn addition, we need the following two propositions for proving Theorem 6.\nProposition 8 (Prop. 2.4 [7]). Let B be Hermitian, z 2 Cm and Ai (1 \uf8ff i \uf8ff m) be Hermitian\nsemide\ufb01nite matrices. Then, the following polynomial is stable:\n\nn ).\n1 , . . . , z1\n\nf (z) := det(B +Xi\n\nziAi).\n\n(3.2)\n\ni=1) be a diagonal matrix. Using the Cauchy-Binet identity we have\n\nProposition 9. For n \uf8ff| S|\uf8ff m and L := A>A, we have det(ASA>S ) = en(LS,S).\nProof. Let Y = Diag([yi]m\ndet((AY ):,T ) det((A>)T,:) =X|T|=n,T\u2713[m]\ndet(AY A>) =X|T|=n,T\u2713[m]\nThus, when Y = IS, the (diagonal) indicator matrix for S, we obtain AY A> = ASA>S . Conse-\nquently, in the summation above only terms with T \u2713 S survive, yielding\ndet(A>T AT ) = X|T|=n,T\u2713S\ndet(ASA>S ) = X|T|=n,T\u2713S\n\ndet(LT,T ) = en(LS,S).\n\ndet(A>T AT )yT .\n\nWe are now ready to sketch the proof of Theorem 6.\n\nProof. (Theorem 6). Notationally, it is more convenient to prove that the \u201ccomplement\u201d polynomial\n\npc(z) :=P|S|=k,S\u2713[m] det(ASA>S )zSc is stable; subsequently, an application of Prop. 7-(iv) yields\n\nstability of (3.1). Using matrix notation W = Diag(w1, . . . , wm), Z = Diag(z1, . . . , zm), our\nstarting stable polynomial (this stability follows from Prop. 8) is\n\nwhich can be expanded as\n\nh(z, w) =XS\u2713[m]\n\nh(z, w) := det(L + W + Z), w 2 Cm, z 2 Cm,\n\ndet(WS + LS)zSc =XS\u2713[m]\u21e3XT\u2713S\n\nwS\\T det(LT,T )\u2318 zSc.\n\n5\n\n\f(3.3)\n\nThus, h(z, w) is real stable in 2m variables, indexed below by S and R where R := S\\T . Instead of\nthe form above, We can sum over S, R \u2713 [m] but then have to constrain the support to the case when\n\nNext, we truncate polynomial (3.3) at degree (mk)+(kn) = mn by restricting |Sc[R| = mn.\nBy [8, Corollary 4.18] this truncation preserves stability, whence\n\nSc \\ T = ; and Sc \\ R = ;. In other words, we may write (using Iverson-bracketsJ\u00b7K)\nh(z, w) = XS,R\u2713[m]JSc \\ R = ; ^ Sc \\ T = ;K det(LT,T )zScwR.\nH(z, w) := XS,R\u2713[m]\n|Sc[R|=mnJSc \\ R = ;K det(LS\\R,S\\R)zScwR,\n)) = XS,R\u2713[m]\n|Sc[R|=mnJSc \\ R = ;K det(LS\\R,S\\R)zScy|R|\n}\ndet(LT,T )\u2318y|S||T|zSc = XS\u2713[m]\n= XS\u2713[m]\u21e3X|T|=n,T\u2713S\n\nis also stable. Using Prop. 7-(iii), setting w1 = . . . = wm = y retains stability; thus\n\ng(z, y) : = H(z, (y, y, . . . , y\n\nen(LS,S)y|S|nzSc,\n\n|\n\nm times\n\n{z\n\nis also stable. Next, differentiating g(z, y), k n times with respect to y and evaluating at 0 preserves\nstability (Prop. 7-(ii) and (i)). In doing so, only terms corresponding to |S| = k survive, resulting in\n@kn\ndet(ASA>S )zSc,\n\n= (k n)! X|S|=k,S\u2713[m]\n\nen(LS,S)zSc = (k n)! X|S|=k,S\u2713[m]\n\nwhich is just pc(z) (up to a constant); here, the last equality follows from Prop. 9. This establishes\nstability of pc(z) and hence of p(z). Since p(z) is in addition multiaf\ufb01ne, it is the generating\npolynomial of an SR measure, completing the proof.\n\n@y kn g(z, y)y=0\n\nImplications: MCMC\n\n3.2\nThe SR property of P (S; A) established in Theorem 6 implies a fast mixing Markov chain for\nsampling S. The states for the Markov chain are all sets of cardinality k. The chain starts with a\nrandomly-initialized active set S, and in each iteration we swap an element sin 2 S with an element\nsout /2 S with a speci\ufb01c probability determined by the probability of the current and proposed set.\nThe stationary distribution of this chain is the one induced by DVS, by a simple detailed-balance\nargument. The chain is shown in Algorithm 1.\n\nAlgorithm 1 Markov Chain for Dual Volume Sampling\n\nInput: A 2 Rn\u21e5m the matrix of interest, k the target cardinality, T the number of steps\nOutput: S \u21e0 P (S; A)\nInitialize S \u2713 [m] such that |S| = k and det(ASA>S ) > 0\nfor i = 1 to T do\n\ndraw b 2{ 0, 1} uniformly\nif b = 1 then\nPick sin 2 S and sout 2 [m]\\S uniformly randomly\nq(sin, sout, S) minn1, det(AS[{sout}\\{sin}A>S[{sout}\\{sin}\nS S [{ sout}\\{sin} with probability q(sin, sout, S)\n\n)/ det(ASA>S )o\n\nend if\nend for\n\nThe convergence of the markov chain is measured via its mixing time: The mixing time of the chain\nindicates the number of iterations t that we must perform (starting from S0) before we can consider\nSt as an approximately valid sample from P (S; A). Formally, if S0(t) is the total variation distance\nbetween the distribution of St and P (S; A) after t steps, then\n\n\u2327S0(\") := min{t : S0(t0) \uf8ff \", 8t0 t}\n\n6\n\n\fis the mixing time to sample from a distribution \"-close to P (S; A) in terms of total variation distance.\nWe say that the chain mixes fast if \u2327S0 is polynomial in the problem size.\nThe fast mixing result for Algorithm 1 is a corollary of Theorem 6 combined with a recent result\nof [3] on fast-mixing Markov chains for homogeneous SR measures. Theorem 10 states this precisely.\nTheorem 10 (Mixing time). The mixing time of Markov chain shown in Algorithm 1 is given by\n\n\u2327S0(\") \uf8ff 2k(m k)(log P (S0; A)1 + log \"1).\n\nProof. Since P (S; A) is k-homogeneous SR by Theorem 6, the chain constructed for sampling S\nfollowing that in [3] mixes in \u2327S0(\") \uf8ff 2k(m k)(log P (S0; A)1 + log \"1) time.\nImplementation. To implement Algorithm 1 we need to compute the transition probabilities\nq(sin, sout, S). Let T = S\\{sin} and assume r(AT ) = n. By the matrix determinant lemma we have\nthe acceptance ratio\n\ndet(AS[{sout}\\{sin}A>S[{sout}\\{sin}\n\ndet(ASA>S )\n\n)\n\n=\n\n(1 + A>\n(1 + A>\n\n{sout}\n{sin}\n\n(AT A>T )1A{sout})\n(AT A>T )1A{sin})\n\n.\n\nThus, the transition probabilities can be computed in O(n2k) time. Moreover, one can further\naccelerate this algorithm by using the quadrature techniques of [28] to compute lower and upper\nbounds on this acceptance ratio to determine early acceptance or rejection of the proposed move.\n\nInitialization. A remaining question is initialization.\nSince the mixing time involves\nlog P (S0; A)1, we need to start with S0 such that P (S0; A) is suf\ufb01ciently bounded away from 0.\nWe show in Appendix F that by a simple greedy algorithm, we are able to initialize S such that\n\nlog P (S; A)1 log(2nk!m\neO(k3n2m), which is linear in the size of data set m and is ef\ufb01cient when k is not too large.\n\nk) = O(k log m), and the resulting running time for Algorithm 1 is\n\n3.3 Further implications and connections\nConcentration. Pemantle and Peres [37] show concentration results for strong Rayleigh measures.\nAs a corollary of our Theorem 6 together with their results, we directly obtain tail bounds for DVS.\n\nAlgorithms for experimental design. Widely used, classical algorithms for \ufb01nding an approximate\noptimal design include Fedorov\u2019s exchange algorithm [20, 21] (a greedy local search) and simulated\nannealing [34]. Both methods start with a random initial set S, and greedily or randomly exchange\na column i 2 S with a column j /2 S. Apart from very expensive running times, they are known\nto work well in practice [35, 43]. Yet so far there is no theoretical analysis, or a principled way of\ndetermining when to stop the greedy search.\nCuriously, our MCMC sampler is essentially a randomized version of Fedorov\u2019s exchange method.\nThe two methods can be connected by a uni\ufb01ed, simulated annealing view, where we de\ufb01ne\nP (S; A) / exp{log det(ASA>S )/} with temperature parameter . Driving to zero essen-\ntially recovers Fedorov\u2019s method, while our results imply fast mixing for = 1, together with\napproximation guarantees. Through this lens, simulated annealing may be viewed as initializing\nFedorov\u2019s method with the fast-mixing sampler. In practice, we observe that letting < 1 improves\nthe approximation results, which opens interesting questions for future work.\n\n4 Experiments\n\nWe report selection performance of DVS on real regression data (CompAct, CompAct(s), Abalone\nand Bank32NH1) for experimental design. We use 4,000 samples from each dataset for estima-\ntion. We compare against various baselines, including uniform sampling (Unif), leverage score\nsampling (Lev) [30], predictive length sampling (PL) [45], the sampling (Smpl)/greedy (Greedy)\nselection methods in [43] and Fedorov\u2019s exchange algorithm [20]. We initialize the MCMC sampler\nwith Kmeans++ [5] for DVS and run for 10,000 iterations, which empirically yields selections that are\n\n1http://www.dcc.fc.up.pt/?ltorgo/Regression/DataSets.html\n\n7\n\n\fsuf\ufb01ciently good. We measure performances via (1) the prediction error ky X \u02c6\u21b5k, and 2) running\ntimes. Figure 1 shows the results for these three measures with sample sizes k varying from 60 to\n200. Further experiments (including for the interpolation < 1), may be found in the appendix.\n\nr\no\nr\nr\n\nE\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\nPrediction Error\n\nUnif\nLev\nPL\nSmpl\nGreedy\nDVS\nFedorov\n\ns\nd\nn\no\nc\ne\nS\n\n100\n\n150\n\nk\n\n200\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\nRunning Time\n\nr\no\nr\nr\n\nE\n\n100\n\n150\n\nk\n\n200\n\n0.28\n\n0.26\n\n0.24\n\n0.22\n\n0.2\n\n0.18\n\nTime-Error Trade-off\n\n0\n\n10\n\n20\n30\nSeconds\n\n40\n\n50\n\nFigure 1: Results on the CompAct(s) dataset. Results are the median of 10 runs, except Greedy and\nFedorov. Note that Unif, Lev, PL and DVS use less than 1 second to \ufb01nish experiments.\n\nIn terms of prediction error, DVS performs well and is comparable with Lev. Its strength compared\nto the greedy and relaxation methods (Smpl, Greedy, Fedorov) is running time, leading to good\ntime-error tradeoffs. These tradeoffs are illustrated in Figure 1 for k = 120.\nIn other experiments (shown in Appendix G) we observed that in some cases, the optimization and\ngreedy methods (Smpl, Greedy, Fedorov) yield better results than sampling, however with much\nhigher running times. Hence, given time-error tradeoffs, DVS may be an interesting alternative in\nsituations where time is a very limited resource and results are needed quickly.\n5 Conclusion\nIn this paper, we study the problem of DVS and develop an exact (randomized) polynomial time\nsampling algorithm as well as its derandomization. We further study dual volume sampling via\nthe theory of real-stable polynomials and prove that its distribution satis\ufb01es the \u201cStrong Rayleigh\u201d\nproperty. This result has remarkable consequences, especially because it implies a provably fast-\nmixing Markov chain sampler that makes dual volume sampling much more attractive to practitioners.\nFinally, we observe connections to classical, computationally more expensive experimental design\nmethods (Fedorov\u2019s method and SA); together with our results here, these could be a \ufb01rst step towards\na better theoretical understanding of those methods.\n\nAcknowledgement\nThis research was supported by NSF CAREER award 1553284, NSF grant IIS-1409802, DARPA\ngrant N66001-17-1-4039, DARPA FunLoL grant (W911NF-16-1-0551) and a Siebel Scholar Fellow-\nship. The views, opinions, and/or \ufb01ndings contained in this article are those of the author and should\nnot be interpreted as representing the of\ufb01cial views or policies, either expressed or implied, of the\nDefense Advanced Research Projects Agency or the Department of Defense.\n\nReferences\n[1] N. Anari and S. O. Gharan. The Kadison-Singer problem for strongly Rayleigh measures and\n\napplications to asymmetric TSP. arXiv:1412.1143, 2014.\n\n[2] N. Anari and S. O. Gharan. Effective-resistance-reducing \ufb02ows and asymmetric TSP. In IEEE\n\nSymposium on Foundations of Computer Science (FOCS), 2015.\n\n[3] N. Anari, S. O. Gharan, and A. Rezaei. Monte Carlo Markov chain algorithms for sampling\nstrongly Rayleigh distributions and determinantal point processes. In COLT, pages 23\u201326, 2016.\n[4] M. Arioli and I. S. Duff. Preconditioning of linear least-squares problems by identifying basic\n\nvariables. SIAM J. Sci. Comput., 2015.\n\n[5] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings\n\nof the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007.\n\n8\n\n\f[6] H. Avron and C. Boutsidis. Faster subset selection for matrices and applications. SIAM Journal\n\non Matrix Analysis and Applications, 34(4):1464\u20131499, 2013.\n\n[7] J. Borcea and P. Br\u00e4nd\u00e9n. Applications of stable polynomials to mixed determinants: Johnson\u2019s\nconjectures, unimodality, and symmetrized Fischer products. Duke Mathematical Journal,\npages 205\u2013223, 2008.\n\n[8] J. Borcea, P. Br\u00e4nd\u00e9n, and T. Liggett. Negative dependence and the geometry of polynomials.\n\nJournal of the American Mathematical Society, 22:521\u2013567, 2009.\n\n[9] A. Borodin. Determinantal point processes. arXiv:0911.1153, 2009.\n[10] C. Boutsidis and M. Magdon-Ismail. Deterministic feature selection for k-means clustering.\n\nIEEE Transactions on Information Theory, pages 6099\u20136110, 2013.\n\n[11] C. Boutsidis, M. W. Mahoney, and P. Drineas. An improved approximation algorithm for the\n\ncolumn subset selection problem. In SODA, pages 968\u2013977, 2009.\n\n[12] C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas. Stochastic dimensionality reduction\n\nfor k-means clustering. arXiv preprint arXiv:1110.2897, 2011.\n\n[13] C. Boutsidis, P. Drineas, and M. Magdon-Ismail. Near-optimal column-based matrix recon-\n\nstruction. SIAM Journal on Computing, pages 687\u2013717, 2014.\n\n[14] S. Chen, R. Varma, A. Sandryhaila, and J. Kova\u02c7cevi\u00b4c. Discrete signal processing on graphs:\n\nSampling theory. IEEE Transactions on Signal Processing, 63(24):6510\u20136523, 2015.\n\n[15] A. \u00c7ivril and M. Magdon-Ismail. On selecting a maximum volume sub-matrix of a matrix and\n\nrelated problems. Theoretical Computer Science, pages 4801\u20134811, 2009.\n\n[16] M. Derezinski and M. K. Warmuth. Unbiased estimates for linear regression via volume\n\nsampling. Advances in Neural Information Processing Systems (NIPS), 2017.\n\n[17] A. Deshpande and L. Rademacher. Ef\ufb01cient volume sampling for row/column subset selection.\nIn Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages\n329\u2013338. IEEE, 2010.\n\n[18] M. Elkin, Y. Emek, D. A. Spielman, and S.-H. Teng. Lower-stretch spanning trees. SIAM\n\nJournal on Computing, 2008.\n\n[19] T. Feder and M. Mihail. Balanced matroids. In Symposium on Theory of Computing (STOC),\n\npages 26\u201338, 1992.\n\n[20] V. Fedorov. Theory of optimal experiments. Preprint 7 lsm, Moscow State University, 1969.\n[21] V. Fedorov. Theory of optimal experiments. Academic Press, 1972.\n[22] A. Frieze, N. Goyal, L. Rademacher, and S. Vempala. Expanders via random spanning trees.\n\nSIAM Journal on Computing, 43(2):497\u2013513, 2014.\n\n[23] S. O. Gharan, A. Saberi, and M. Singh. A randomized rounding approach to the traveling\nsalesman problem. In IEEE Symposium on Foundations of Computer Science (FOCS), pages\n550\u2013559, 2011.\n\n[24] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic\nalgorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217\u2013288,\n2011.\n\n[25] S. Joshi and S. Boyd. Sensor selection via convex optimization. IEEE Transactions on Signal\n\nProcessing, pages 451\u2013462, 2009.\n\n[26] A. Kulesza and B. Taskar. Determinantal Point Processes for machine learning, volume 5.\n\nFoundations and Trends in Machine Learning, 2012.\n\n[27] C. Li, S. Jegelka, and S. Sra. Fast mixing markov chains for strongly Rayleigh measures, DPPs,\nand constrained sampling. In Advances in Neural Information Processing Systems (NIPS), 2016.\n[28] C. Li, S. Sra, and S. Jegelka. Gaussian quadrature for matrix inverse forms with applications.\n\nIn ICML, pages 1766\u20131775, 2016.\n\n[29] R. Lyons. Determinantal probability measures. Publications Math\u00e9matiques de l\u2019Institut des\n\nHautes \u00c9tudes Scienti\ufb01ques, 98(1):167\u2013212, 2003.\n\n9\n\n\f[30] P. Ma, M. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. In Journal\n\nof Machine Learning Research (JMLR), 2015.\n\n[31] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied\n\nProbability, 7(1), 1975.\n\n[32] A. Magen and A. Zouzias. Near optimal dimensionality reductions that preserve volumes. In\nApproximation, Randomization and Combinatorial Optimization. Algorithms and Techniques,\npages 523\u2013534. Springer, 2008.\n\n[33] A. J. Miller and N.-K. Nguyen. A fedorov exchange algorithm for d-optimal design. Journal of\n\nthe royal statistical society, 1994.\n\n[34] M. D. Morris and T. J. Mitchell. Exploratory designs for computational experiments. Journal\n\nof Statistical Planning and Inference, 43:381\u2013402, 1995.\n\n[35] N.-K. Nguyen and A. J. Miller. A review of some exchange algorithms for constructng discrete\n\noptimal designs. Computational Statistics and Data Analysis, 14:489\u2013498, 1992.\n\n[36] R. Pemantle. Towards a theory of negative dependence. Journal of Mathematical Physics, 41:\n\n1371\u20131390, 2000.\n\n[37] R. Pemantle and Y. Peres. Concentration of Lipschitz functionals of determinantal and other\n\nstrong Rayleigh measures. Combinatorics, Probability and Computing, 23:140\u2013160, 2014.\n\n[38] F. Pukelsheim. Optimal design of experiments. SIAM, 2006.\n[39] D. Spielman and N. Srivastava. Graph sparsi\ufb01cation by effective resistances. SIAM J. Comput.,\n\n40(6):1913\u20131926, 2011.\n\n[40] D. A. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graph\n\nsparsi\ufb01cation, and solving linear systems. In STOC, 2004.\n\n[41] M. Tsitsvero, S. Barbarossa, and P. D. Lorenzo. Signals on graphs: Uncertainty principle and\n\nsampling. IEEE Transactions on Signal Processing, 64(18):4845\u20134860, 2016.\n\n[42] D. Wagner. Multivariate stable polynomials: theory and applications. Bulletin of the American\n\nMathematical Society, 48(1):53\u201384, 2011.\n\n[43] Y. Wang, A. W. Yu, and A. Singh. On Computationally Tractable Selection of Experiments in\n\nRegression Models. ArXiv e-prints, 2016.\n\n[44] Y. Zhao, F. Pasqualetti, and J. Cort\u00e9s. Scheduling of control nodes for improved network\ncontrollability. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1859\u2013\n1864, 2016.\n\n[45] R. Zhu, P. Ma, M. W. Mahoney, and B. Yu. Optimal subsampling approaches for large sample\n\nlinear regression. arXiv preprint arXiv:1509.05111, 2015.\n\n10\n\n\f", "award": [], "sourceid": 2592, "authors": [{"given_name": "Chengtao", "family_name": "Li", "institution": "MIT"}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": "MIT"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MIT"}]}