{"title": "Flexible Modeling of Diversity with Strongly Log-Concave Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 15225, "page_last": 15235, "abstract": "Strongly log-concave (SLC) distributions are a rich class of discrete probability distributions over subsets of some ground set. They are strictly more general than strongly Rayleigh (SR) distributions such as the well-known determinantal point process. While SR distributions offer elegant models of diversity, they lack an easy control over how they express diversity. We propose SLC as the right extension of SR that enables easier, more intuitive control over diversity, illustrating this via examples of practical importance. We develop two fundamental tools needed to apply SLC distributions to learning and inference: sampling and mode finding. For sampling we develop an MCMC sampler and give theoretical mixing time bounds. For mode finding, we establish a weak log-submodularity property for SLC functions and derive optimization guarantees for a distorted greedy algorithm.", "full_text": "Flexible Modeling of Diversity with Strongly\n\nLog-Concave Distributions\n\nMassachusetts Institute of Technology\n\nMassachusetts Institute of Technology\n\nJoshua Robinson\n\njoshrob@mit.edu\n\nSuvrit Sra\n\nsuvrit@mit.edu\n\nStefanie Jegelka\n\nMassachusetts Institute of Technology\n\nstefje@csail.mit.edu\n\nAbstract\n\nStrongly log-concave (SLC) distributions are a rich class of discrete probability\ndistributions over subsets of some ground set. They are strictly more general than\nstrongly Rayleigh (SR) distributions such as the well-known determinantal point\nprocess. While SR distributions offer elegant models of diversity, they lack an easy\ncontrol over how they express diversity. We propose SLC as the right extension\nof SR that enables easier, more intuitive control over diversity, illustrating this via\nexamples of practical importance. We develop two fundamental tools needed to\napply SLC distributions to learning and inference: sampling and mode \ufb01nding.\nFor sampling we develop an MCMC sampler and give theoretical mixing time\nbounds. For mode \ufb01nding, we establish a weak log-submodularity property for\nSLC functions and derive optimization guarantees for a distorted greedy algorithm.\n\nIntroduction\n\n1\nA variety of machine learning tasks involve selecting diverse subsets of items. How we model\ndiversity is, therefore, a key concern with possibly far-reaching consequences. Recently popular\nprobabilisitic models of diversity include determinantal point processes [32, 39], and more generally,\nstrongly Rayleigh (SR) distributions [8, 35]. These models have been successfully deployed for subset\nselection in applications such as video summarization [44], fairness [13], model compression [46],\nanomaly detection [49], the Nystr\u00f6m method [41], generative models [24, 40], and accelerated\ncoordinate descent [51]. While valuable and broadly applicable, SR distributions have one main\ndrawback: it is dif\ufb01cult to control the strength and nature of diversity they model.\nWe counter this drawback by leveraging strongly log-concave (SLC) distributions [3\u20135]. These\ndistributions are strictly more general than SR measures, and possess key properties that enable easier,\nmore intuitive control over diversity. They derive their name from SLC polynomials introduced by\nGurvits already a decade ago [30]. More recently they have shot into prominence due to their key\nrole in developing deep connections between discrete and continuous convexity, with subsequent\napplications in combinatorics [1, 10, 33]. In particular, they lie at the heart of recent breakthrough\nresults such as a proof of Mason\u2019s conjecture [4] and obtaining a fully polynomial-time approximation\nscheme for counting the number of bases of arbitrary matroids [3, 5]. We remark that all these works\nassume homogeneous SLC polynomials.\nWe build on this progress to develop fundamental tools for general SLC distributions, namely,\nsampling and mode \ufb01nding. We highlight the \ufb02exibility of SLC distributions through two settings of\nimportance in practice: (i) raising any SLC distribution to a power \u03b1 \u2208 [0, 1]; and (ii) incorporating a\nconstraint that allows sampling sets of any size up to a budget. In contrast to similar modi\ufb01cations\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fto SR measures (see e.g., [49]), these settings retain the crucial SLC property. Setting (i) allows us\nto conveniently tune the strength of diversity by varying a single parameter; while setting (ii) offers\ngreater \ufb02exibility than \ufb01xed cardinality distributions such as a k-determinantal point process [38].\nThis observation is simple yet important, especially since the \u201cright\u201d value of k is hard to \ufb01x a priori.\n\nContributions. We brie\ufb02y summarize the main contributions of this work below.\n(cid:4) We introduce the class of strongly log-concave distributions to the machine learning community,\n\nshowing how it can offer a \ufb02exible discrete probabilistic model for distributions over subsets.\n\n(cid:4) We prove various closure properties of SLC distributions (Theorems 2-5), and show how to use\n\nthese properties for better controlling the distributions used for inference.\n\n(cid:4) We derive sampling algorithms for SLC and related distributions, and analyze their corresponding\n\nmixing times both theoretically and empirically (Algorithm 1, Theorem 8).\n\n(cid:4) We study the negative dependence of SLC distributions by deriving a weak log-submodularity\nproperty (Theorem 11). Optimization guarantees for a selection of greedy algorithms are obtained\nas a consequence (Theorem 12).\n\nAs noted above, our results build on the remarkable recent progress in [3\u20135] and [10]. The biggest\ndifference between the previous work and this work is our focus on general non-homogeneous SLC\npolynomials, corresponding to distributions over sets of varying cardinality, as opposed to purely\nthe homogeneous, i.e., \ufb01xed-cardinality, case. This broader focus necessitates development of some\nnew machinery, because unlike SR polynomials, the class of SLC polynomials is not closed under\nhomogenization. We summarize the related work below for additional context.\n\n1.1 Related work\n\nSR polynomials. Strongly Rayleigh distributions were introduced in [8] as a class of discrete\ndistributions possessing several strong negative dependence properties. It did not take long for their\npotential in machine learning to be identi\ufb01ed [39]. Particular attention has been paid to determinantal\npoint processes due to the intuitive way they capture negative dependence, and the fact that they\nare parameterized by a single positive semi-de\ufb01nite kernel matrix. Convenient parameterization has\nallowed an abundance of fast algorithms for learning the kernel matrix [23, 26, 45, 47], and sampling\n[2, 42, 50]. SR distributions are a fascinating and elegant probabilistic family whose applicability in\nmachine learning is still an emerging topic [17, 35, 43, 48].\nSLC polynomials. Gurvits introduced SLC polynomials a decade ago [30] and studied their con-\nnection to discrete convex geometry. Recently this connection was signi\ufb01cantly developed [10, 5]\nby establishing that matroids, and more generally M-convex sets, are characterized by the strong\nlog-concavity of their generating polynomial. This is in contrast to SR, for which it is known that\nsome matroids have generating polynomials that are not SR [9].\nLog-submodular distributions. Distributions over subsets that are log-submodular (or supermod-\nular) are amenable to mode \ufb01nding and variational inference with approximation guarantees, by\nexploiting the optimization properties of submodular functions [20\u201322]. Theoretical bounds on\nsampling time require additional assumptions [29]. Iyer and Bilmes [34] analyze inference for\nsubmodular distributions, establishing polynomial approximation bounds.\nMCMC samplers and mixing time. The seminal works [18, 19] offer two tools for obtaining\nmixing time bounds for Markov chains: lower bounding the spectral gap, or log-Sobolev constant.\nThese techniques have been successfully deployed to obtain mixing time bounds for homogenous SR\ndistributions [2], general SR distributions [42], and recently homogenous SLC distributions [5].\n\n2 Background and setup\nNotation. We write [n] = {1, . . . , n}, and denote by 2[n] the power set {S | S \u2286 [n]}. For any\nde\ufb01ne|\u03b1| =(cid:80)n\nvariable u, write \u2202u to denote \u2202\n\u2202u; in case u = zi, we often abbreviate further by writing \u2202i instead\nof \u2202zi. For S \u2286 [n] and \u03b1 \u2208 Nn let 1S \u2208 {0, 1}n denote the binary indicator vector of S, and\n(cid:81)\ni\u2208S zi and(cid:81)n\ni where \u03b1i = 0\nmeans we do not take any derivatives with respect to zi. We let zS and z\u03b1 denote the monomials\nrespectively. For K = R or R+ we write K[z1, . . . , zn] to denote the set of\nall polynomials in the variables z = (z1, . . . , zn) whose coef\ufb01cients belong to K. A polynomial is\n\ni=1 \u03b1i. We also write variously \u2202S\n\nz =(cid:81)\n\nz =(cid:81)\n\ni\u2208S \u2202i and \u2202\u03b1\n\ni\u2208[n] \u2202\u03b1i\n\ni=1 z\u03b1i\n\ni\n\n2\n\n\fsaid to be d-homogeneous if it is the sum of monomials all of which are of degree d. Finally, for a set\nX we shall minimize clutter by using X \u222a i and X \\ i to denote X \u222a {i} and X \\ {i} respectively.\nSLC distributions. We consider distributions \u03c0 : 2[n] \u2192 [0, 1] on the subsets of a ground set [n].\nThere is a one-to-one correspondence between such distributions, and their generating polynomials\n(1)\n\n(cid:88)\n\n(cid:88)\n\n(cid:89)\n\n\u03c0(S)zS.\n\nf\u03c0(z) :=\n\n\u03c0(S)\n\nS\u2286[n]\n\nzi =\n\ni\u2208S\n\nS\u2286[n]\n\nThe central object of interest in this paper is the class of strongly log-concave distributions, which is\nde\ufb01ned by imposing certain log-concavity requirements on the corresponding generating polynomials.\nDe\ufb01nition 1. A polynomial f \u2208 R+[z1, . . . , zn] is strongly log-concave (SLC) if every derivative of\nf is log-concave. That is, for any \u03b1 \u2208 Nn either \u2202\u03b1f = 0, or the function log(\u2202\u03b1f (z)) is concave\nat all z \u2208 Rn\n+. We say a distribution \u03c0 is strongly log-concave if its generating polynomial f\u03c0 is\nstrongly log-concave; we also say \u03c0 is d-homogeneous if f\u03c0 is d-homogeneous.\n\nThere are many examples of SLC distributions; we note a few important ones below.\n\u2013 Determinantal point processes [39, 27, 38, 41], and more generally, Strongly Rayleigh (SR)\n\ndistributions [8, 17, 43, 35].\n\n\u2013 Exponentiated (for exponents in [0, 1]) homogeneous SR distributions [49, 5].\n\u2013 The uniform distribution on the independent sets of a matroid [4].\nSR distributions satisfy several strong negative dependence properties (e.g., log-submodularity and\nnegative association). The fact that SLC is a strict superset of SR suggests that SLC distributions\npossess some weaker negative dependence properties. These properties will play a crucial role in the\ntwo fundamental tasks that we study in this paper: sampling and mode \ufb01nding.\n\nSampling. Our \ufb01rst task is to ef\ufb01ciently draw samples from an SLC distribution \u03c0. To that end, we\nseek to develop Markov Chain Monte Carlo (MCMC) samplers whose mixing time (see Section 4 for\nde\ufb01nition) can be well-controlled. For homogeneous \u03c0, the breakthrough work of Anari et al. [5]\nprovides the \ufb01rst analysis of fast-mixing for a simple Markov chain called Base Exchange Walk;\nthis analysis is further re\ufb01ned in [15]. Base Exchange Walk is de\ufb01ned as follows: if currently\nat state S \u2286 [n], remove an element i \u2208 S uniformly at random. Then move to R \u2283 S \\ {i} with\nprobability proportional to \u03c0(R). This describes a transition kernel Q(S, R) for moving from S to R.\nWe build on these works to obtain the \ufb01rst mixing time bounds for sampling from general (i.e., not\nnecessarily homogeneous) SLC distributions (Section 4).\n\nMode \ufb01nding. Our second main goal is optimization, where we consider the more general task of\n\ufb01nding a mode of an SLC distribution subject to a cardinality constraint. This task involves solving\nmax|S|\u2264d \u03c0(S). This task is known to be NP-hard even for SR distributions; indeed, the maximum\nvolume subdeterminant problem [14] is a special case. We consider a more practical approach based\non observing that SLC distributions satisfy a relaxed notion of log-submodularity, which enables us\nto adapt simple greedy algorithms.\nBefore presenting the details about sampling and optimization, we need to \ufb01rst establish some key\ntheoretical properties of general SLC distributions. This is the subject of the next section.\n3 Theoretical tools for general SLC polynomials\n\nIn this technical section we develop the theory of strong log-concavity by detailing several transfor-\nmations of an SLC polynomial f that preserve strong log-concavity. Such closure properties can be\nessential for proving the SLC property, or for developing algorithmic results. Due to the correspon-\ndence between distributions on 2[n] and their generating polynomials, each statement concerning\npolynomials can be translated into a statement about probability distributions. The forthcoming\nresults assume polynomials that are supported on the independent sets of a matroid. This can be\nviewed as a minor technical assumption since, to the best of our knowledge, all known SLC polyno-\nmials are supported on the independent sets of a matroid. A fundamental correspondence between\nhomogenous SLC distributions and bases of a matroid was observed in [10], however it remains an\nopen question to precisely understand this relationship for non-homogenous SLC polynomials. The\nfollowing theorem is a crucial stepping stone to sampling from non-homogeneous SLC distributions,\nand to sampling with cardinality constraints.\n\n3\n\n\fTheorem 2. Let f =(cid:80)\n\nS\u2286[n] cSzS \u2208 R+[z1, . . . , zn] be SLC, and suppose the support of the sum is\nthe collection of independent sets of a rank d matroid. Then for any k \u2264 d the following polynomial\nis SLC:\n\nHkf (z, y) =\n\ncS\n\n(k \u2212|S|)!\n\nzSyk\u2212|S|.\n\n(cid:88)\n\n|S|\u2264k\n\n(cid:88)\n\n|S|\u2264k\n\nTheorem 3. Let f =(cid:80)\n\nThe above operation is also referred to as scaled homogenization, since the resulting polynomial\nis homogeneous and there is an added 1/(k \u2212|S|)! factor. In fact, we may extend Theorem 2 to\nallowing the user to add an additional exponentiating factor:\nS\u2286[n] cSzS \u2208 R+[z1, . . . , zn] be SLC, and suppose the support of the sum\nis the collection of independent sets of a rank d matroid. Then for 0 \u2264 \u03b1 \u2264 1 and any k \u2264 d the\nfollowing polynomial is SLC:\n\nHk,\u03b1f (z, y) =\n\nc\u03b1\nS\n\n(k \u2212|S|)!\n\nzSyk\u2212|S|.\n\nNotably, Theorem 3 fails for all \u03b1 > 1. For a proof of this see Appendix A.2.\nNext, we show that polarization preserves strong log-concavity. Polarization essentially means to\nreplace a variable with a higher power by multiple \u201ccopies\u201d, each occurring only with power one, in\na way that the resulting polynomial is symmetric (or permutation-invariant) in those copies. This is\nachieved by averaging over elementary symmetric polynomials. Formally, the polarization of the\n\npolynomial f =(cid:80)|S|\u2264d cSzSyd\u2212|S| \u2208 R[z1, . . . , zn, y] is de\ufb01ned to be\n\n\u03a0f (z1, . . . , zn, y1, . . . , yd) =\n\ncSzS\n\ned\u2212|S|(y1, . . . , yd)\n\n(cid:88)\n\n|S|\u2264d\n\n(cid:19)\u22121\n\n(cid:18) d\n\n|S|\n\nwhere ek(y1, . . . , yd) is the kth elementary symmetric polynomial in d variables. The polarization\n\u03a0f has the following three properties:\n\n1. It is symmetric in the variables y1, . . . , yd;\n2. Setting y1 = . . . = yd = y recovers f;\n3. \u03a0f is multiaf\ufb01ne, and hence the generating polynomial of a distribution on 2[n+d].\n\nClosure under polarization, combined with the homogenization results (Theorems 2 and 3) allows\nnon-homogeneous distributions to be transformed into homogenous ones. This allows general SLC\ndistributions to be transformed into homogenous SLC distributions for which fast mixing results are\nknown [5]. How to work backwards to obtain samples from the original distribution will be the topic\nof the next section.\nS\u2286[n] cSzSyd\u2212|S| \u2208 R+[z1, . . . , zn, y] be SLC, and the support of the sum\n\nTheorem 4. 1 Let f =(cid:80)\n\nis the collection of independent sets of a rank d matroid. Then the polarization \u03a0f is SLC.\n\nCorollary 5. Let f =(cid:80)\n\nPutting all of the preceding results together we obtain the following important corollary. It is this\nobservation that will allow us to do mode \ufb01nding for SLC distributions and exponentiated, cardinality\nconstrained SLC distributions.\nS\u2286[n] cSzS \u2208 R+[z1, . . . , zn] be SLC, and suppose the support of the sum\nis the collection of independent sets of a rank d matroid. Then \u03a0(Hk,\u03b1f ) is SLC for any k \u2264 d and\n0 \u2264 \u03b1 \u2264 1.\n\nIn Appendix A.4 we also show that SLC distributions are closed under conditioning on a \ufb01xed set\nsize. We mention those results since they may be of independent interest, but omit them from the\nmain text since we do not use them further in this paper.\n\n1This result was independently discovered by Br\u00e4nd\u00e9n and Huh [10].\n\n4\n\n\f4 Sampling from strongly log-concave distributions\n\nIn this section we outline how to use the SLC closure results from Section 3 to build a sampling\nalgorithm for general SLC distributions and prove mixing time bounds. Recall that we are considering\na probability distribution \u03c0 : 2[n] \u2192 [0, 1] that is strongly log-concave. The mixing time of a Markov\n\nchain (Q, \u03c0) started at S0 is tS0(\u03b5) = min{t \u2208 N |(cid:13)(cid:13)Qt(S0,\u00b7) \u2212 \u03c0(cid:13)(cid:13)1 \u2264 \u03b5} where Qt is the\n\nt-step transition kernel. For the remainder of this section we consider the distribution \u03bd where\n\u03bd(S) \u221d \u03c0(S)\u03b11{|S| \u2264 d} for 0 \u2264 \u03b1 \u2264 1, and d \u2208 [n]. In particular, this includes \u03c0 itself. The power\n\u03b1 allows to vary the degree of diversity induced by the distribution: \u03b1 < 1 smooths \u03bd, making it less\ndiverse. Indeed, as \u03b1 \u2192 0, \u03bd converges to the uniform distribution, which promotes no diversity.\nMeanwhile \u03b1 > 1 (although outside the scope of our results) makes \u03bd more pointy, with \u03bd collapsing\nto a point mass as \u03b1 \u2192 \u221e.\nOur strategy is as follows: we \ufb01rst \u201cextend\u201d \u03bd to a distribution \u03bdsh over subsets of size |n| of [n + d]\nto obtain a homogeneous distribution. If we can sample from \u03bdsh, then we can extract a sample\nS \u2286 [n] of a scaled version of \u03bd by simply restricting a sample T \u223c \u03bdsh to T \u2229 [n]. If \u03bd was SR, then\n\u03bdsh would also be SR, and a fast sampler follows from this observation [42]. But, for general SLC\ndistributions (and their powers), \u03bdsh is not SLC, and deriving a sampler is more challenging.\nTo still enable the homogenization strategy, we instead derive a carefully scaled version of a homoge-\nneous version of \u03bd that, as we prove, is homogeneneous and SLC and hence tractable. We use this\nrescaled version as a proposal distribution in a sampler for \u03bdsh. To obtain an appropriately scaled\nextended, homogeneous variant \u03bd, we \ufb01rst translate Corollary 5 into probabilistic language.\nTheorem 6. Suppose that the support of the sum in the generating polynomial of \u03bd is the collection\nof independent sets of a rank d matroid. Then for any k \u2264 d the following probability distribution on\n2[n+k] is SLC:\n\n(cid:0)\n|S\u2229[n]|(cid:1)\u22121\n\nk\n\n\uf8f1\uf8f2\uf8f3\n\n0,\n\nHk\u03bd(S) \u221d\n\n\u03bd(S\u2229[n])\n(k\u2212|S\u2229[n]|)!\n\n,\n\nfor all S \u2286 [n + k] such that|S| = k\notherwise.\n\nProof. Observe that the generating polynomial of Hk\u03bd is \u03a0(Hkf ) where f denotes the generating\npolynomial of \u03bd. The result follows immediately from Corollary 5.\nThe ultimate proposal that we use is not Hk\u03bd, but a modi\ufb01ed version \u00b5 that better aligns with \u03bd:\n\n(cid:19)d\u2212|S\u2229[n]|\n\n(cid:18) d\n\ne\n\n\u00b5(S) \u221d\n\nHd\u03bd(S).\n\nProposition 7. If \u03bd is SLC, then \u00b5 is SLC.\n\nProof. Lemma 39 in the Appendix says that strong log-concavity is preserved under linear transforma-\ntions of the coordinates. This implies that \u00b5 is SLC since its generating polynomial is \u03a0((Hdf ) \u25e6 T )\nwhere f is the generating polynomial of \u03bd and T is the linear transform de\ufb01ned by: y (cid:55)\u2192 d\ne y and\nzi (cid:55)\u2192 zi for i = 1 . . . , n.\nImportantly, since \u00b5 is homogeneous and SLC, the Base Exchange Walk for \u00b5 mixes rapidly.\nLet Q denote the Markov transition kernel for Base Exchange Walk on 2[n+d] for \u00b5. We use Q as\na proposal, and then compute the appropriate acceptance probability to obtain a chain that mixes to\nthe symmetric homogenization \u03bdsh of \u03bd. The target \u03bdsh is a d-homogenous distribution on 2[n+d]:\n\n\u03bdsh(S) \u221d\n\n\u03bd(S \u2229 [n]), for all S \u2286 [n + d] such that|S| = d.\n\n(cid:80)\nA crucial property of \u03bdsh is that its marginalization over the \u201cdummy\u201d variables yields \u03bd, i.e.,\nT :T\u2229[n]=S \u03bdsh(T ) = \u03bd(S). Therefore, after obtaining a sample T \u223c \u03bdsh one then obtains a sample\nfrom \u03bd by computing T \u2229 [n].\nIt is a simple computation to show that the acceptance probabilities in Algorithm 1 are indeed the\nMetropolis-Hastings acceptance probabilities for sampling from \u03bdsh using the proposal Q. Therefore\n\n(cid:18) d\n(cid:12)(cid:12)S \u2229 [n](cid:12)(cid:12)(cid:19)\u22121\n\n5\n\n\fPropose move T \u223c Q(S,\u00b7)\n\nAlgorithm 1 Metropolis-Hastings sampler for \u03bdsh with proposal Q\n1: Initialize S \u2286 [n + d]\n2: while not mixed do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\nSet k \u2190(cid:12)(cid:12)S \u2229 [n](cid:12)(cid:12)\nif(cid:12)(cid:12)T \u2229 [n](cid:12)(cid:12) = k \u2212 1 then\nif(cid:12)(cid:12)T \u2229 [n](cid:12)(cid:12) = k then\nif(cid:12)(cid:12)T \u2229 [n](cid:12)(cid:12) = k + 1 then\n\nR \u2190 T with probability min{1, e\nR \u2190 T\nR \u2190 T with probability min{1, d\n\n(d\u2212k)}, otherwise stay at S\n\n1\n\ne\n\nd (d \u2212 k + 1)}, otherwise stay at S\n\n(Q, \u03bdsh) is tS0(\u03b5) = min{t \u2208 N |(cid:13)(cid:13)Qt(S0,\u00b7) \u2212 \u03bdsh\n\nthe chain mixes to \u03bdsh. We obtain the following mixing time bound, recalling that the mixing time of\nTheorem 8. For d \u2265 8 the mixing time of the chain in Algorithm 1 started at S0 satis\ufb01es the bound\n\n(cid:13)(cid:13)1 \u2264 \u03b5}.\n(cid:19) 1\n(cid:26)(cid:18) d\n\n(cid:27)\n\n\u221a\ntS0(\u03b5) \u2264 1\n2\u03c0\ne\n\nd5/22d\n\nlog log\n\n|S0|\n\n\u03bd(S0)\n\n+ log\n\n1\n2\u03b52\n\n(cid:19)\n\n.\n\n(cid:18)\n\nn and equal to \u2126(2n/2) if d \u2264 n/2 \u2212 \u221a\n\nA similar bound holds for d < 8. We note that although the mixing time bound scales poorly in d, the\nbound has the interesting property of being independent of the ground set size n. Furthermore, the\n\nbound is meaningful since the total number of subsets of n objects of size d is(cid:80)d\nif d \u2265 n/2 \u2212 \u221a\n\nn, [36]. So the mixing time bound is\nexponentially better than brute force. Later we will detail experiments that suggest that this bound is\nloose in d.\nEf\ufb01cient implementation. It is suf\ufb01cient to only maintain R = S \u2229 [n] since \u03bdsh is exchangeable\nin the variables {n + 1, . . . , n + d}. Sampling T \u223c Q(S,\u00b7) involves dropping i \u2208 S uniformly at\nrandom, then computing the probability of \u00b5((S\\ i)\u222a j) for each j not in S\\ i. However again, by the\nexchangeability of \u00b5 in {n+1, . . . , n+d} this probability is the same for each j in {n+1, . . . , n+d}\nand so only needs to be performed for one such j.\n\n(cid:1) = \u2126(2n)\n\n(cid:0)n\n\nj=0\n\nj\n\n5 Maximization of weakly log-submodular functions\n\nIn this section we explore the negative dependence properties of SLC functions (unnormalized SLC\ndistributions) through the lens of submodularity: a well known negative dependence property [8].\nIn an earlier version of this paper we conjectured that SLC functions have the strong property of\nlog-submodularity. This conjecture has been disproved in a recent note [28].\nProposition 9 (Propositions 1 and 2 [28]). The the distribution with generating polynomial\n\n(cid:0)4 + 3(x + y + z) + 3(xy + xz + yz)(cid:1)\n\nf (x, y, z) =\n\n1\n22\n\nis SLC but not log-submodular.\n\nIn response, we introduce a new notion of weak submodularity and show that any function \u03bd such\nthat Hd\u03bd is SLC is weakly log-submodular. Finally, we prove that a distorted greedy optimization\nprocedure leads to optimization guarantees for weak (log-)submodular functions for the cardinality\nconstrained problem OPT \u2208 arg max|S|\u2264k \u03bd(S). Appendix C contains similar results for constrained\ngreedy optimization of increasing weak (log-)submodular functions and unconstrained double greedy\noptimization of non-negative (log-)submodular functions.\nDe\ufb01nition 10. We call a function \u03c1 : 2[n] \u2192 R \u03b3-weakly submodular if for any S \u2286 [n] and\ni, j \u2208 [n] \\ S with i and j not equal, we have\n\n\u03c1(S) + \u03c1(S \u222a {i, j}) \u2264 \u03b3 + \u03c1(S \u222a i) + \u03c1(S \u222a j).\n\n6\n\n\fWe say \u03bd : 2[n] \u2192 R+ is \u03b3-weakly log-submodular if log \u03bd is (log \u03b3)-weakly submodular.\nWhen \u03b3 = 0 this reduces to the classic notion of submodularity. Note carefully that our notion of\nweak submodularity differs from a notion of weak submodularity that already appears in the literature\n[16, 31, 37]. Building on a result by Br\u00e4nd\u00e9n and Huh [10], we prove the following result.\nTheorem 11. Any non-negative function \u03c1 : 2[n] \u2192 R+ with support contained in {S \u2286 [n] :|S| \u2264\nd} and generating polynomial f such that Hdf is strongly log-concave is \u03b3-weakly log-submodular\n\nNext we show how weak log-submodularity gives a path to optimizing strongly log-concave functions.\nConsider \u03c1 : 2[n] \u2192 R, assumed to be \u03b3-weakly submodular. Note in particular we do not assume\nthat \u03c1 is non-negative. This is important since we are interested in applying this procedure to the\nlogarithm of a distribution, which need not be non-negative. De\ufb01ne ce = max{\u03c1([n]\\ e)\u2212 \u03c1([n]), 0},\ne\u2208S ce. We use the convention that c(\u2205) = 0. Then we may decompose \u03c1 = \u03b7 \u2212 c\n\nwhere \u03b7 = \u03c1 + c. Note that \u03b7 is \u03b3-weakly submodular and c is a non-negative function.\nWe will extend the distorted greedy algorithm by [25, 31] to our notion of weak submodularity. To\ndo so, we introduce the distorted objective \u03a6i(S) = (1 \u2212 1/k)k\u2212i\u03b7(S) \u2212 c(S) for i = 0, . . . k.\nThe distorted greedy algorithm greedily builds a set R of size at most d by forming a sequence\n\u2205 = S0, S1, . . . , Sk\u22121, Sk = R such that Si+1 is formed by adding the element ei \u2208 [n] to Si that\nmaximizes \u03a6i+1(Si \u222a ei) \u2212 \u03a6i+1(Si) so long as the increment is positive.\nAlgorithm 2 Distorted greedy weak submodular constrained maximization of \u03bd = \u03b7 \u2212 c\n1: Let S0 = \u2205\n2: for i = 0, . . . , k \u2212 1 do\n3:\n4:\n5:\n6:\n7: return R = Sk\nTheorem 12. Suppose \u03c1 : 2[n] \u2192 R is \u03b3-weakly submodular and \u03c1(\u2205) = 0. Then the solution\nR = Sk obtained by the distorted greedy algorithm satis\ufb01es\n\nSet ei = arg maxe\u2208[n] \u03a6i+1(Si \u222a e) \u2212 \u03a6i+1(Si)\nif \u03a6i+1(Si \u222a ei) \u2212 \u03a6i+1(Si) > 0 then\nelse Si+1 \u2190 Si\n\nSi+1 \u2190 Si \u222a ei\n\n(cid:18)\n\n(cid:19)(cid:18)\n\n(cid:19)\n\n\u03c1(R) = \u03b7(R) \u2212 c(R) \u2265\n\n1 \u2212 1\ne\n\n\u03b7(OPT) \u2212 1\n2\n\n(cid:96)((cid:96) \u2212 1)\u03b3\n\n\u2212 c(OPT),\n\nwhere OPT \u2208 arg max|S|\u2264k \u03c1(S) and (cid:96) :=|OPT| \u2264 k.\nNote any weakly submodular function can be brought into the required form by subtracting \u03c1(\u2205) if it\nis non-zero. If \u03bd is weakly log-submodular, we can decompose \u03bd = \u03b7/c such that log \u03b7 and log c\nperform the same role as \u03b7 and c did in the weakly submodular setting. Then by applying Theorem\n12 to log \u03bd we obtain the following corollary.\nCorollary 13. Suppose \u03bd : 2[n] \u2192 R+ is \u03b3-weakly log-submodular and \u03bd(\u2205) = 1. Then the solution\nR = Sk obtained by the distorted greedy algorithm satis\ufb01es\n\n(cid:1).\n\nfor \u03b3 = 4(cid:0)1 \u2212 1\nand c(S) =(cid:80)\n\nd\n\n\u03bd(R) =\n\n\u03b7(R)\nc(R)\n\n6 Experiments\n\n\u2265 \u03b3\u2212 1\n\n2 (cid:96)((cid:96)\u22121)(1\u22121/e) \u03b7(OPT)1\u22121/e\n\n.\n\nc(OPT)\n\nIn this section we empirically evaluate the mixing time of Algorithm 1. We use the standard potential\nscale reduction factor metric to measure convergence to the stationary distribution [11]. The method\ninvolves running several chains in parallel and computing the average variance within each chain and\nbetween the chains. The PSRF score is the ratio of the between variance over the within variance and\nis usually above 1. When the PSRF score is close to 1 then the chains are considered to be mixed. In\nall of our experiments we run three chains in parallel and declare them to be mixed once the PSRF\nscore drops below 1.05.\nFigure 1 considers the results of running the Metropolis-Hastings algorithm on a sequence of\nproblems with different cardinality constraints d.\nIn each case we considered the distribution\n\n7\n\n\f(a)\n\n(b)\n\nFigure 1: Empirical mixing time analysis for sampling a ground set of size n = 250 and various\ncardinality constraints d, (a) the PSRF score for each set of chains, (b) the approximate mixing time\nobtained by thresholding at PSRF equal to 1.05.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: (a,b) Empirical mixing time analysis for sampling a set of size at most d = 40 for varying\nground set sizes, (a) the PSRF score for each set of chains, (b) the approximate mixing time obtained\nby thresholding at PSRF equal to 1.05, (c) comparison of Algorithm 1 and a M-H algorithm where\nthe proposal is built using Hd\u03bd: d = 100 and n = 250.\n\n\u03bd(S) \u221d(cid:112)det(LS)1{|S| \u2264 d} where L is a randomly generated 250 \u00d7 250 PSD matrix. Here LS\n(cid:112)det(LS)1{|S| \u2264 40} where L is a randomly generated PSD matrix where of appropriate size n.\n\ndenotes the|S| \u00d7|S| submatrix of L whose indices belong to S. These simulations suggest that the\nmixing time grows linearly in d for a \ufb01xed n.\nFigure 2 considers the results of running the Metropolis-Hastings algorithm on a sequence of\nproblems with different ground set sizes. In each case we considered the distribution \u03bd(S) \u221d\n\nThese simulations suggest that the mixing time grows sublinearly in n for a \ufb01xed d.\nIt is important to know whether the mixing time is robust to different spectra \u03c3L of L. We consider\nthree cases, (i) smooth decay \u03c3L = [n], (ii) a single large eigenvalue \u03c3L = {n, (n \u2212 1)/2, (n \u2212\n2)/2, . . . , 2/2, 1/2}, and (iii) one \ufb01fth of the eigenvalues are equal to n, the rest equal to 1/n. Note\nthat due to normalization, multiplying the spectrum by a constant does not affect the resulting\ndistribution. The results for (i) are the content of Figures 1 and 2 (a,b). Figures 3 and 4 show the\nresults for (ii) and \ufb01gures 5 and 6 show the results for (iii). Figures 3-6 can be found in Appendix D.\nFinally, we address the question of why the proposal distribution was built using the particular choice\nof \u00b5 we made. Indeed one may use Base Exchange Walk for any homogenous distribution on\n2[n] to build a sampler, one simply needs to compute the appropriate acceptance probabilities. We\nrestrict our attention to SLC distributions so as to be able to build on the recent mixing time results\nfor homogenous SLC distributions. An obvious alternative to using \u00b5 to build the proposal is to use\nHd\u03bd. Figure 2(c) compares the empirical mixing time of these two chains. The strong empirical\nimprovement justi\ufb01es our choice of adding the extra rescaling factor d/e.\n\n8\n\n010002000300040005000# Iter1.01.11.21.31.41.5PSRFPotential Scale Reduction Factord = 30d = 60d = 90d = 120d = 150d = 180406080100120140160180cardinality constraint d10001250150017502000225025002750mixing timeApproximate Mixing Time010002000300040005000# Iter1.01.11.21.31.4PSRFPotential Scale Reduction Factorn = 30n = 60n = 90n = 120n = 150n = 1802004006008001000ground set size n6008001000120014001600mixing timeApproximate Mixing Time010002000300040005000# Iter1.11.21.31.41.5PSRFPotential Scale Reduction Factord/e rescalingno rescaling\f7 Discussion\n\nIn this paper we introduced strongly log-concave distributions as a promising class of models for\ndiversity. They have \ufb02exibility beyond that of strongly Rayleigh distributions, e.g., via exponentiated\nand cardinality constrained distributions (which do not preserve the SR property). We derived a\nsuite of MCMC samplers for general SLC distributions and associated mixing time bounds. For\noptimization, we showed that SLC distributions satisfy a weak submodularity property and used this\nto prove mode \ufb01nding guarantees.\nStill, many open problems remain. Although the mixing time bound has the interesting property of not\ndirectly depending on n, the O(2d) dependence seems quite conservative compared to the empirical\nmixing time results. An important future direction would be to close this gap. More fundamentally,\nthe negative dependence properties of SLC distributions need to be explored in greater detail. Finally,\nin order for SLC models to be deployed in practice the user needs a way to learn a good SLC model\nfrom data, a non-trivial task in general since SLC distribution are non-parametric. However, both\nexponentiation and cardinality constraint add a single parameter that must be learned. We leave the\nquestion of how best to learn these parameters as an important topic for future work.\n\nAcknowledgements\n\nThis work was supported by an NSF-BIGDATA award and the Defense Advanced Research Projects\nAgency (grant number YFA17 N66001-17-1-4039). The views, opinions, and/or \ufb01ndings contained\nin this article are those of the author and should not be interpreted as representing the of\ufb01cial views\nor policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the\nDepartment of Defense. We thank Matt Staib for helpful comments on the draft.\n\nReferences\n[1] Karim Adiprasito, June Huh, and Eric Katz. Hodge theory for combinatorial geometries. Annals\n\nof Mathematics, 188(2):381\u2013452, 2018.\n\n[2] Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte Carlo Markov chain algorithms\nfor sampling strongly Rayleigh distributions and determinantal point processes. In Conference\non Learning Theory, pages 103\u2013115, 2016.\n\n[3] Nima Anari, Shayan Oveis Gharan, and Cynthia Vinzant. Log-concave polynomials, en-\ntropy, and a deterministic approximation algorithm for counting bases of matroids. In Annual\nSymposium on Foundations of Computer Science, pages 35\u201346. IEEE, 2018.\n\n[4] Nima Anari, Kuikui Liu, Shayan Oveis Gharan, and Cynthia Vinzant. Log-Concave Poly-\nnomials III: Mason\u2019s Ultra-Log-Concavity Conjecture for Independent Sets of Matroids.\narXiv:1811.01600, 2018.\n\n[5] Nima Anari, Kuikui Liu, Shayan Oveis Gharan, and Cynthia Vinzant. Log-Concave Polynomials\nII: High-Dimensional Walks and an FPRAS for Counting Bases of a Matroid. In Proceedings\nof the 51st Annual ACM SIGACT Symposium on Theory of Computing. ACM, June 2019.\n\n[6] Ravi B Bapat, Ravindra B Bapat, and Raghavan. Nonnegative matrices and applications,\n\nvolume 64. Cambridge University Press, 1997.\n\n[7] Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic analysis on semigroups:\n\ntheory of positive de\ufb01nite and related functions, volume 100. Springer, 1984.\n\n[8] Julius Borcea, Petter Br\u00e4nd\u00e9n, and Thomas Liggett. Negative Dependence and the Geometry of\n\nPolynomials. Journal of the American Mathematical Society, 22(2):521\u2013567, 2009.\n\n[9] Petter Br\u00e4nd\u00e9n. Polynomials with the half-plane property and matroid theory. Advances in\n\nMathematics, 216(1):302\u2013320, 2007.\n\n[10] Petter Br\u00e4nd\u00e9n and June Huh. Lorentzian polynomials. arXiv:1902.03719, 2019.\n[11] Stephen P Brooks and Andrew Gelman. General Methods for Monitoring Convergence of\nIterative Simulations. Journal of computational and graphical statistics, 7(4):434\u2013455, 1998.\n[12] Niv Buchbinder, Moran Feldman, Joseph Sef\ufb01, and Roy Schwartz. A Tight Linear Time (1/2)-\nApproximation for Unconstrained Submodular Maximization. SIAM Journal on Computing, 44\n(5):1384\u20131402, 2015.\n\n9\n\n\f[13] L Elisa Celis, Vijay Keswani, Damian Straszak, Amit Deshpande, Tarun Kathuria, and\nNisheeth K Vishnoi. Fair and diverse DPP-based data summarization. arXiv:1802.04023,\n2018.\n\n[14] Ali Civril and Malik Magdon-Ismail. Exponential inapproximability of selecting a maximum\n\nvolume sub-matrix. Algorithmica, 65(1):159\u2013176, 2013.\n\n[15] Mary Cryan, Heng Guo, and Giorgos Mousa. Modi\ufb01ed log-Sobolev inequalities for strongly\n\nlog-concave distributions. arXiv:1903.06081, 2019.\n\n[16] Abhimanyu Das and David Kempe. Submodular meets spectral: Greedy algorithms for subset\nIn International Conference on\n\nselection, sparse approximation and dictionary selection.\nMachine Learning, 2011.\n\n[17] Michal Derezinski and Manfred K Warmuth. Unbiased Estimates for Linear Regression via\nVolume Sampling. In Advances in Neural Information Processing Systems, pages 3084\u20133093,\n2017.\n\n[18] Persi Diaconis, Daniel Stroock, et al. Geometric Bounds for Eigenvalues of Markov Chains.\n\nThe Annals of Applied Probability, 1(1):36\u201361, 1991.\n\n[19] Persi Diaconis, Laurent Saloff-Coste, et al. Logarithmic Sobolev Inequalities for Finite Markov\n\nChains. The Annals of Applied Probability, 6(3):695\u2013750, 1996.\n\n[20] Josip Djolonga and Andreas Krause. From map to marginals: Variational inference in bayesian\n\nsubmodular models. In Neural Information Processing Systems (NIPS), 2014.\n\n[21] Josip Djolonga and Andreas Krause. Scalable variational inference in log-supermodular models.\n\nIn International Conference on Machine Learning (ICML), 2015.\n\n[22] Josip Djolonga, Stefanie Jegelka, and Andreas Krause. Provable variational inference for\nconstrained log-submodular models. In Neural Information Processing Systems (NeurIPS),\n2018.\n\n[23] Christophe Dupuy and Francis Bach. Learning Determinantal Point Processes in Sublinear\nTime. Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics,\n2018.\n\n[24] Mohamed Elfeki, Camille Couprie, Morgane Riviere, and Mohamed Elhoseiny. GDPP: Learning\n\nDiverse Generations Using Determinantal Point Process. arXiv:1812.00068, 2018.\n\n[25] Moran Feldman. Guess free maximization of submodular and linear sums. arXiv:1810.03813,\n\n2018.\n\n[26] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Low-rank Factorization of Determinantal\n\nPoint Processes. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[27] Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Near-optimal map inference for determinan-\ntal point processes. In Advances in Neural Information Processing Systems, pages 2735\u20132743,\n2012.\n\n[28] Alkis Gotovos. Strong log-concavity does not imply log-submodularity. arXiv, 2019.\n[29] Alkis Gotovos, S. Hamed Hassani, and Andreas Krause. Sampling from probabilistic submodu-\n\nlar models. In Neural Information Processing Systems (NIPS), 2015.\n\n[30] Leonid Gurvits. On multivariate Newton-like inequalities. In Advances in Combinatorial\n\nMathematics, pages 61\u201378. Springer, 2009.\n\n[31] Christopher Harshaw, Moran Feldman, Justin Ward, and Amin Karbasi. Submodular maximiza-\ntion beyond non-negativity: Guarantees, fast algorithms, and applications. arXiv:1904.09354,\n2019.\n\n[32] J. Ben Hough, Manjunath Krishnapur, Yuval Peres, and B\u00e1lint Vir\u00e1g. Determinantal Processes\n\nand Independence. Probab. Surveys, 3:206\u2013229, 2006.\n\n[33] June Huh. Combinatorial applications of the Hodge-Riemann relations. Proceedings of the\n\nInternational Congress of Mathematicians, 2018.\n\n[34] Rishabh Iyer and Jeffrey Bilmes. Submodular point processes with applications to machine\n\nlearning. In Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2015.\n\n[35] Stefanie Jegelka and Suvrit Sra. Negative dependence, stable polynomials, and all that. NeurIPS\n\n2018 Tutorial, 2018.\n\n10\n\n\f[36] Emil Je\u02c7r\u00e1bek. Dual weak pigeonhole principle, Boolean complexity, and derandomization.\n\nAnnals of Pure and Applied Logic, 129(1-3):1\u201337, 2004.\n\n[37] Rajiv Khanna, Ethan Elenberg, Alexandros G Dimakis, Sahand Negahban, and Joydeep Ghosh.\n\nScalable greedy feature selection via weak submodularity. arXiv:1703.02723, 2017.\n\n[38] Alex Kulesza and Ben Taskar. k-DPPs: Fixed-size determinantal point processes. In Proceedings\n\nof the 28th International Conference on Machine Learning, pages 1193\u20131200, 2011.\n\n[39] Alex Kulesza, Ben Taskar, et al. Determinantal point processes for machine learning. Founda-\n\ntions and Trends in Machine Learning, 5(2\u20133):123\u2013286, 2012.\n\n[40] James T Kwok and Ryan P Adams. Priors for diversity in generative latent variable models. In\n\nAdvances in Neural Information Processing Systems, pages 2996\u20133004, 2012.\n\n[41] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast DPP Sampling for Nystr\u00f6m with Application\nto Kernel Methods. In International Conference on Machine Learning, pages 2061\u20132070, 2016.\n[42] Chengtao Li, Suvrit Sra, and Stefanie Jegelka. Fast mixing Markov chains for strongly Rayleigh\nmeasures, DPPs, and constrained sampling. In Advances in Neural Information Processing\nSystems, pages 4188\u20134196, 2016.\n\n[43] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Polynomial time algorithms for dual volume\n\nsampling. In Advances in Neural Information Processing Systems, pages 5038\u20135047, 2017.\n\n[44] Hui Lin and Jeff Bilmes. Learning mixtures of submodular shells with application to document\n\nsummarization. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2012.\n\n[45] Zelda Mariet and Suvrit Sra. Fixed-point algorithms for learning determinantal point processes.\n\nIn International Conference on Machine Learning, pages 2389\u20132397, 2015.\n\n[46] Zelda Mariet and Suvrit Sra. Diversity networks: Neural network compression using determi-\n\nnantal point processes. International Conference on Learning Representations, 2016.\n\n[47] Zelda Mariet and Suvrit Sra. Kronecker determinantal point processes. In Advances in Neural\n\nInformation Processing Systems, pages 2694\u20132702, 2016.\n\n[48] Zelda Mariet and Suvrit Sra. Elementary symmetric polynomials for optimal experimental\n\ndesign sr measures. In Advances in Neural Information Processing Systems, 2017.\n\n[49] Zelda Mariet, Suvrit Sra, and Stefanie Jegelka. Exponentiated Strongly Rayleigh Distributions.\n\nIn Advances in Neural Information Processing Systems, pages 4459\u20134469, 2018.\n\n[50] Zelda Mariet, Yaniv Ovadia, and Jasper Snoek. DPPNet: Approximating Determinantal Point\n\nProcesses with Deep Networks. arXiv:1901.02051, 2019.\n\n[51] Anton Rodomanov and Dmitry Kropotov. A randomized coordinate descent method with\n\nvolume sampling. arXiv:1904.04587, 2019.\n\n11\n\n\f", "award": [], "sourceid": 8744, "authors": [{"given_name": "Joshua", "family_name": "Robinson", "institution": "MIT"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MIT"}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": "MIT"}]}