{"title": "Fast Mixing Markov Chains for Strongly Rayleigh Measures, DPPs, and Constrained Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 4188, "page_last": 4196, "abstract": "We study probability measures induced by set functions with constraints. Such measures arise in a variety of real-world settings, where prior knowledge, resource limitations, or other pragmatic considerations impose constraints. We consider the task of rapidly sampling from such constrained measures, and develop fast Markov chain samplers for them. Our first main result is for MCMC sampling from Strongly Rayleigh (SR) measures, for which we present sharp polynomial bounds on the mixing time. As a corollary, this result yields a fast mixing sampler for Determinantal Point Processes (DPPs), yielding (to our knowledge) the first provably fast MCMC sampler for DPPs since their inception over four decades ago. Beyond SR measures, we develop MCMC samplers for probabilistic models with hard constraints and identify sufficient conditions under which their chains mix rapidly. We illustrate our claims by empirically verifying the dependence of mixing times on the key factors governing our theoretical bounds.", "full_text": "Fast Mixing Markov Chains for Strongly Rayleigh\n\nMeasures, DPPs, and Constrained Sampling\n\nChengtao Li\n\nMIT\n\nctli@mit.edu\n\nStefanie Jegelka\n\nMIT\n\nstefje@csail.mit.edu\n\nSuvrit Sra\n\nMIT\n\nsuvrit@mit.edu\n\nAbstract\n\nWe study probability measures induced by set functions with constraints. Such\nmeasures arise in a variety of real-world settings, where prior knowledge, resource\nlimitations, or other pragmatic considerations impose constraints. We consider\nthe task of rapidly sampling from such constrained measures, and develop fast\nMarkov chain samplers for them. Our \ufb01rst main result is for MCMC sampling\nfrom Strongly Rayleigh (SR) measures, for which we present sharp polynomial\nbounds on the mixing time. As a corollary, this result yields a fast mixing sampler\nfor Determinantal Point Processes (DPPs), yielding (to our knowledge) the \ufb01rst\nprovably fast MCMC sampler for DPPs since their inception over four decades\nago. Beyond SR measures, we develop MCMC samplers for probabilistic models\nwith hard constraints and identify suf\ufb01cient conditions under which their chains\nmix rapidly. We illustrate our claims by empirically verifying the dependence of\nmixing times on the key factors governing our theoretical bounds.\n\nIntroduction\n\n1\nDistributions over subsets of objects arise in a variety of machine learning applications. They occur\nas discrete probabilistic models [5, 20, 28, 36, 38] in computer vision, computational biology and\nnatural language processing. They also occur in combinatorial bandit learning [9], as well as in recent\napplications to neural network compression [32] and matrix approximations [29].\nYet, practical use of discrete distributions can be hampered by computational challenges due to\ntheir combinatorial nature. Consider for instance sampling, a task fundamental to learning, opti-\nmization, and approximation. Without further restrictions, ef\ufb01cient sampling can be impossible\n[13]. Several lines of work thus focus on identifying tractable sub-classes, which in turn have had\nwide-ranging impacts on modeling and algorithms. Important examples include the Ising model [22],\nmatchings (and the matrix permanent) [23], spanning trees (and graph algorithms) [2, 6, 16, 37],\nand Determinantal Point Processes (DPPs) that have gained substantial attention in machine learn-\ning [3, 17, 24, 26, 28, 30].\nIn this work, we extend the classes of tractable discrete distributions. Speci\ufb01cally, we consider\nthe following two classes of distributions on 2V (the set of subsets of a ground set V = [N ] :=\n{1, . . . , N}): (1) strongly Rayleigh (SR) measures, and (2) distributions with certain cardinality or\nmatroid-constraints. We analyze Markov chains for sampling from both classes. As a byproduct of\nour analysis, we answer a long-standing question about rapid mixing of MCMC sampling from DPPs.\nSR measures are de\ufb01ned by strong negative correlations, and have recently emerged as valuable\ntools in the design of algorithms [2], in the theory of polynomials and combinatorics [4], and in\nmachine learning through DPPs, a special case of SR distributions. Our \ufb01rst main result is the \ufb01rst\npolynomial-time sampling algorithm that applies to all SR measures (and thus a fortiori to DPPs).\nGeneral distributions on 2V with constrained support (case (2) above) typically arise upon incorporat-\ning prior knowledge or resource constraints. We focus on resource constraints such as bounds on\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fcardinality and bounds on including limited items from sub-groups. Such constraints can be phrased\nas a family C\u2713 2V of subsets; we say S satis\ufb01es the constraint C iff S 2C . Then the distribution of\ninterest is of the form\n(1.1)\n\n\u21e1C(S) / exp(F (S))JS 2CK,\n\nIverson bracket, and  a constant (also referred to as the inverse temperature). Most prior work on\nsampling with combinatorial constraints (such as sampling the bases of a matroid), assumes that\n\nwhere F : 2V ! R is a set function that encodes relationships between items i 2 V ,J\u00b7K is the\nF breaks up linearly using element-wise weights wi, i.e., F (S) =Pi2S wi. In contrast, we allow\n\ngeneric, nonlinear functions, and obtain a mixing times governed by structural properties of F .\n\nContributions. We brie\ufb02y summarize the key contributions of this paper below.\n\u2013 We derive a provably fast mixing Markov chain for ef\ufb01cient sampling from strongly Rayleigh\nmeasure \u21e1 (Theorem 2). This Markov chain is novel and may be of independent interest. Our\nresults provide the \ufb01rst polynomial guarantee (to our knoweldge) for Markov chain sampling from\na general DPP, and more generally from an SR distribution.1\n\n\u2013 We analyze (Theorem 4) mixing times of an exchange chain when the constraint family C is the\nset of bases of a special matroid, i.e., |S| = k or S obeys a partition constraint. Both of these\nconstraints have high practical relevance [25, 27, 38].\n\u2013 We analyze (Theorem 6) mixing times of an add-delete chain for the case |S|\uf8ff k, which, perhaps\nsurprisingly, turns out to be quite different from |S| = k. This constraint can be more practical\nthan the strict choice |S| = k, because in many applications, the user may have an upper bound on\nthe budget, but may not necessarily want to expend all k units.\n\nFinally, a detailed set of experiments illustrates our theoretical results.\n\nRelated work. Recent work in machine learning addresses sampling from distributions with sub- or\nsupermodular F [19, 34], determinantal point processes [3, 29], and sampling by optimization [14, 31].\nMany of these works (necessarily) make additional assumptions on \u21e1C, or are approximate, or cannot\nhandle constraints. Moreover, the constraints cannot easily be included in F : an out-of-the-box\napplication of the result in [19], for instance, would lead to an unbounded constant in the mixing\ntime.\nApart from sampling, other related tracts include work on variational inference for combinatorial\ndistributions [5, 11, 36, 38] and inference for submodular processes [21]. Special instances of (1.1)\ninclude [27], where the authors limit DPPs to sets that satisfy |S| = k; partition matroid constraints\nare studied in [25], while the budget constraint |S|\uf8ff k has been used recently in learning DPPs [17].\nImportant existing results show fast mixing for a sub-family of strongly Rayleigh distributions [3, 15];\nbut those results do not include, for instance, general DPPs.\n\n1.1 Background and Formal Setup\nBefore describing the details of our new contributions, let us brie\ufb02y recall some useful background\nthat also serves to set the notation. Our focus is on sampling from \u21e1C in (1.1); we denote by\nZ = PS\u2713V exp(F (S)) and ZC = PS\u2713C exp(F (S)). The simplest example of \u21e1C is the\nuniform distribution over sets in C, where F (S) is constant. In general, F may be highly nonlinear.\nWe sample from \u21e1C using MCMC, i.e., we run a Markov Chain with state space C. All our chains\nare ergodic. The mixing time of the chain indicates the number of iterations t that we must perform\n(after starting from an arbitrary set X0 2C ) before we can consider Xt as a valid sample from \u21e1C.\nFormally, if X0(t) is the total variation distance between the distribution of Xt and \u21e1C after t steps,\nthen \u2327X0(\") = min{t : X0(t0) \uf8ff \", 8t0  t} is the mixing time to sample from a distribution\n\u270f-close to \u21e1C in terms of total variation distance. We say that the chain mixes fast if \u2327X0 is polynomial\nin N. The mixing time can be bounded in terms of the eigenvalues of the transition matrix, as the\nfollowing classic result shows:\nTheorem 1 (Mixing Time [10]). Let i be the eigenvalues of the transition matrix, and max =\nmax{2,|N|} < 1. Then, the mixing time starting from an initial set X0 2C is bounded as\n\n1The analysis in [24] is not correct since it relies on a wrong construction of path coupling.\n\n\u2327X0(\") \uf8ff (1  max)1(log \u21e1C(X0)1 + log \"1).\n\n2\n\n\fMost of the effort in bounding mixing times hence is devoted to bounding this eigenvalue.\n\n2 Sampling from Strongly Rayleigh Distributions\n\nIn this section, we consider sampling from strongly Rayleigh (SR) distributions. Such distributions\ncapture the strongest form of negative dependence properties, while enjoying a host of other remark-\nable properties [4]. For instance, they include the widely used DPPs as a special case. A distribution\n\nis SR if its generating polynomial p\u21e1 : CN ! C, p\u21e1(z) =PS\u2713V \u21e1(S)Qi2S zi is real stable. This\nmeans if =(zi) > 0 for all arguments zi of p\u21e1(z), then p\u21e1(z) > 0.\nWe show in particular that SR distributions are amenable to ef\ufb01cient Markov chain sampling. Our\nstarting point is the observation of [4] on closure properties of SR measures; of these we use symmetric\nhomogenization. Given a distribution \u21e1 on 2[N ], its symmetric homogenization \u21e1sh on 2[2N ] is\n\n\u21e1sh(S) :=( \u21e1(S \\ [N ]) N\n\nS\\[N ]1\n\n0\n\nif |S| = N ;\notherwise.\n\nIf \u21e1 is SR, so is \u21e1sh. We use this property below in our derivation of a fast-mixing chain.\nWe use here a recent result of Anari et al. [3], who show a Markov chain that mixes rapidly for\nhomogeneous SR distributions. These distributions are over all subsets S \u2713 V of some \ufb01xed size\n|S| = k, and hence do not include general DPPs. Concretely, for any k-homogeneous SR distribution\n\u21e1 : {0, 1}N ! R+, a Gibbs-exchange sampler has mixing time\n\n\u2327X0(\") \uf8ff 2k(N  k)(log \u21e1(X0)1 + log \"1).\n\nThis sampler uniformly samples one item in the current set, and one outside the current set, and\nswaps them with an appropriate probability. Using these ideas we show how to obtain fast mixing\nchains for any general SR distribution \u21e1 on [N ]. First, we construct its symmetric homogenization\n\u21e1sh, and sample from \u21e1sh using a Gibbs-exchange sampler. This chain is fast mixing, thus we will\nef\ufb01ciently get a sample T \u21e0 \u21e1sh. The corresponding sample for \u21e1 can be then obtained by computing\nS = T \\ V . Theorem 2, proved in the appendix, formally establishes the validity of this idea.\nTheorem 2. If \u21e1 is SR, then the mixing time of a Gibbs-exchange sampler for \u21e1sh is bounded as\n\n\u2327X0(\") \uf8ff 2N 2\u21e3log\u2713 N\n\n|X0|\u25c6 + log(\u21e1(X0))1 + log \"1\u2318.\n\n(2.1)\n\nFor Theorem 2 we may choose the initial set such that X0 makes the \ufb01rst term in the sum logarithmic\nin N (X0 = T0 \\ V in Algorithm 1).\nAlgorithm 1 Markov Chain for Strongly Rayleigh Distributions\nRequire: SR distribution \u21e1\n\nInitialize T \u2713 [2N ] where |T| = N and take S = T \\ V\nwhile not mixed do\n\nDraw q \u21e0 Unif [0, 1]\nDraw t 2 V \\S and s 2 S uniformly at random\nif q 2 [0, (N|S|)2\nelse if q 2 [ (N|S|)2\n\nS = S [{ t} with probability min{1, \u21e1(S[{t})\n\n2N ) then\n\n, N|S|\n\n) then\n\n2N 2\n\n2N 2\n\n\u21e1(S) \u21e5 |S|+1\nN|S|}\n\nS = S [{ t}\\{s} with probability min{1, \u21e1(S[{t}\\{s})\n\n\u21e1(S)\n\n}\n\n2N , |S|2+N (N|S|)\n\n2N 2\n\n) then\n\nelse if q 2 [ N|S|\nelse\n\nS = S\\{s} with probability min{1, \u21e1(S\\{s})\nDo nothing\n\n\u21e1(S) \u21e5 |S|\n\nN|S|+1}\n\nend if\nend while\n\n. Add t\n\n. Exchange s with t\n\n. Delete s\n\nEf\ufb01cient Implementation. Directly running a chain to sample N items from a (doubled) set of size\n2N adds some computational overhead. Hence, we construct an equivalent, more space-ef\ufb01cient\n\n3\n\n\fchain (Algorithm 1) on the initial ground set V = [N ] that only manintains S \u2713 V . Interestingly,\nthis sampler is a mixture of add-delete and Gibbs-exchange samplers. This combination makes sense\nintuitively, too: add-delete moves (also shown in Alg. 3) are needed since the exchange sampler\ncannot change the cardinality of S. But a pure add-delete chain can stall if the sets concentrate\naround a \ufb01xed cardinality (low probability of a larger or smaller set). Exchange moves will not\nsuffer the same high rejection rates. The key idea underlying Algorithm 1 is that the elements\nin {N + 1, . . . , 2N} are indistinguishable, so it suf\ufb01ces to maintain merely the cardinality of the\ncurrently selected subset instead of all its indices. Appendix C contains a detailed proof.\nCorollary 3. The bound (2.1) applies to the mixing time of Algorithm 1.\nRemarks. By assuming \u21e1 is SR, we obtain a clean bound for fast mixing. Compared to the bound\nin [19], our result avoids the somewhat opaque factor exp(\u21e3F ) that depends on F .\nIn certain cases, the above chain may mix slower in practice than a pure add-delete chain that was\nused in previous works [19, 24], since its probability of doing nothing is higher. In other cases, it\nmixes much faster than the pure add-delete chain; we observe both phenomena in our experiments in\nSec. 4. Contrary to a simple add-delete chain, in all cases, it is guaranteed to mix well.\n\n3 Sampling from Matroid-Constrained Distributions\nIn this section we consider sampling from an explicitly-constrained distribution \u21e1C where C speci\ufb01es\ncertain matroid base constraints (\u00a73.1) or a uniform matroid of a given rank (\u00a73.2).\n\n3.1 Matroid Base Constraints\nWe begin with constraints that are special cases of matroid bases2:\n\n1. Uniform matroid: C = {S \u2713 V | |S| = k},\n2. Partition matroid: Given a partition V =Sk\nelement from each Pi: C = {S \u2713 V | |S \\P i| = 1 for all 1 \uf8ff i \uf8ff k}.\n\ni=1 Pi, we allow sets that contain exactly one\n\nAn important special case of a distribution with a uniform matroid constraint is the k-DPP [27].\nPartition matroids are used in multilabel problems [38], and also in probabilistic diversity models [21].\n\nAlgorithm 2 Gibbs Exchange Sampler for Matroid Bases\nRequire: set function F , , matroid C\u2713 2V\n\nInitialize S 2C\nwhile not mixed do\n\nLet b = 1 with probability 0.5\nif b = 1 then\n\nDraw s 2 S and t 2 V \\S (t 2P (s) \\ {s}) uniformly at random\nif S [{ t}\\{s}2C then\nend if\n\nS S [{ t}\\{s} with probability\n\n\u21e1C (S[{t}\\{s})\n\n\u21e1C (S)+\u21e1C (S[{t}\\{s})\n\nend if\nend while\n\nThe sampler is shown in Algorithm 2. At each iteration, we randomly select an item s 2 S and\nt 2 V \\S such that the new set S [{ t}\\{s} satis\ufb01es C, and swap them with certain probability. For\nuniform matroids, this means t 2 V \\S; for partition matroids, t 2P (s) \\ {s} where P(s) is the part\nthat s resides in. The fact that the chain has stationary distribution \u21e1C can be inferred via detailed\nbalance. Similar to the analysis in [19] for unconstrained sampling, the mixing time depends on\na quantity that measures how much F deviates from linearity: \u21e3F = maxS,T2C |F (S) + F (T ) \nF (S \\ T )  F (S [ T )|. Our proof, however, differs from that of [19]. While they use canonical\npaths [10], we use multicommodity \ufb02ows, which are more effective in our constrained setting.\nTheorem 4. Consider the chain in Algorithm 2. For the uniform matroid, \u2327X0(\") is bounded as\n\n\u2327X0(\") \uf8ff 4k(N  k) exp((2\u21e3F ))(log \u21e1C(X0)1 + log \"1);\n2Drawing even a uniform sample from the bases of an arbitrary matroid can be hard.\n\n(3.1)\n\n4\n\n\fFor the partition matroid, the mixing time is bounded as\n\n\u2327X0(\") \uf8ff 4k2 max\n\ni\n\n|Pi| exp((2\u21e3F ))(log \u21e1C(X0)1 + log \"1).\n\n(3.2)\n\nObserve that if Pi\u2019s form an equipartition, i.e., |Pi| = N/k for all i, then the second bound becomes\neO(kN ). For k = O(log N ), the mixing times depend as O(N polylog(N )) = eO(N ) on N. For\nuniform matroids, the time is equally small if k is close to N. Finally, the time depends on the\ninitialization, \u21e1C(X0). If F is monotone increasing, one may run a simple greedy algorithm to ensure\nthat \u21e1C(X0) is large. If F is monotone submodular, this ensures that log \u21e1C(X0)1 = O(log N ).\nOur proof uses a multicommodity \ufb02ow to upper bound the largest eigenvalue of the transition\nmatrix. Concretely, let H be the set of all simple paths between states in the state graph of Markov\nchain, we construct a \ufb02ow f : H! R+ that assigns a nonnegative \ufb02ow value to any simple\npath between any two states (sets) X, Y 2C . Each edge e = (S, T ) in the graph has a capacity\nQ(e) = \u21e1C(S)P (S, T ) where P (S, T ) is the transition probability from S to T . The total \ufb02ow sent\nfrom X to Y must be \u21e1C(X)\u21e1C(Y ): if HXY is the set of all simple paths from X to Y , then we\nneedPp2HXY\nf (p) = \u21e1C(X)\u21e1C(Y ). Intuitively, the mixing time relates to the congestion in any\nedge, and the length of the paths. If there are many short paths X Y across which \ufb02ow can be\ndistributed, then mixing is fast. This intuition is captured in a fundamental theorem:\nTheorem 5 (Multicommodity Flow [35]). Let E be the set of edges in the transition graph, and\nP (X, Y ) the transition probability. De\ufb01ne\n\n1\n\nQ(e)Xp3e\n\nwhere len(p) the length of the path p. Then max \uf8ff 1  1/\u21e2(f ).\nWith this property of multicommodity \ufb02ow, we are ready to prove Thm. 4.\n\n\u21e2(f ) = max\ne2E\n\nf (p)len(p),\n\nProof. (Theorem 4) We sketch the proof for partition matroids; the full proofs is in Appendix A.\nFor any two sets X, Y 2C , we distribute the \ufb02ow equally across all shortest paths X Y in the\ntransition graph and bound the amount of \ufb02ow through any edge e 2 E.\nConsider two arbitrary sets X, Y 2C with symmetric difference |X  Y | = 2m \uf8ff 2k, i.e., m\nelements need to be exchanged to reach from X to Y . However, these m steps are a valid path in the\ntransition graph only if every set S along the way is in C. The exchange property of matroids implies\nthat this requirement is indeed true, so any shortest path X Y has length m. Moreover, there are\nexactly m! such paths, since we can exchange the elements in X \\ Y in any order to reach at Y . Note\nthat once we choose s 2 X \\ Y to swap out, there is only one choice t 2 Y \\ X to swap in, where t\nlies in the same part as s in the partition matroid, otherwise the constraint will be violated. Since the\ntotal \ufb02ow is \u21e1C(X)\u21e1C(Y ), each path receives \u21e1C(X)\u21e1C(Y )/m! \ufb02ow.\nNext, let e = (S, T ) be any edge on some shortest path X Y ; so S, T 2C and T = S [{ j}\\{i}\nfor some i, j 2 V . Let 2r = |X  S| < 2m be the length of the shortest path X S, i.e., r elements\nneed to be exchanged to reach from X to S. Similarly, m  r  1 elements are exchanged to reach\nfrom T to Y . Since there is a path for every permutation of those elements, the ratio of the total \ufb02ow\nwe(X, Y ) that edge e receives from pair X, Y , and Q(e), becomes\n\nwe(X, Y )\n\nQ(e) \uf8ff\n\n2r!(m  1  r)!kL\n\nm!ZC\n\nexp(2\u21e3F )(exp(F (S(X, Y ))) + exp(F (T (X, Y )))),\n\n(3.3)\n\nwhere we de\ufb01ne S(X, Y ) = X  Y  S = (X \\ Y \\ S)[ (X \\ (Y [ S))[ (Y \\ (X [ S)). To bound\nthe total \ufb02ow, we must count the pairs X, Y such that e is on their shortest path(s), and bound the\n\ufb02ow they send. We do this in two steps, \ufb01rst summing over all (X, Y )\u2019s that share the upper bound\n(3.3) since they have the same difference sets US = S(X, Y ) and UT = T (X, Y ), and then we\nr  pairs that share those difference\nsum over all possible US and UT . For \ufb01xed US, UT , there arem1\nsets, since the only freedom we have is to assign r of the m  1 elements in S \\ (X \\ Y \\ S) to Y ,\nand the rest to X. Hence, for \ufb01xed US, UT . Appropriate summing and canceling then yields\n\nX(X,Y ): S (X,Y )=US ,\n\nT (X,Y )=UT\n\nwe(X, Y )\n\nQ(e) \uf8ff\n\n2kL\nZC\n\nexp(2\u21e3F )(exp(F (US)) + exp(F (UT ))).\n\n(3.4)\n\n5\n\n\fFinally, we sum over all valid US (UT is determined by US). One can show that any valid US 2C ,\nand hencePUS\nexp(F (US)) \uf8ff ZC, and likewise for UT . Hence, summing the bound (3.4) over all\npossible choices of US yields\nlen(p) \uf8ff 4k2L exp(2\u21e3F ),\nwhere we upper bound the length of any shortest path by k, since m \uf8ff k. Hence\n\n\u21e2(f ) \uf8ff 4kL exp(2\u21e3F ) max\n\np\n\n\u2327X0(\") \uf8ff 4k2L exp(2\u21e3F )(log \u21e1(X0)1 + log \"1).\n\nFor more restrictive constraints, there are fewer paths, and the bounds can become larger. Appendix A\nshows the general dependence on k (as k!). It is also interesting to compare the bound on uniform\nmatroid in Eq. (3.1) to that shown in [3] for a sub-class of distributions that satisfy the property\nof being homogeneous strongly Rayleigh3. If \u21e1C is homogeneous strongly Rayleigh, we have\n\u2327X0(\") \uf8ff 2k(N  k)(log \u21e1C(X0)1 + log \"1). In our analysis, without additional assumptions on\n\u21e1C, we pay a factor of 2 exp(2\u21e3F )) for generality. This factor is one for some strongly Rayleigh\ndistributions (e.g., if F is modular), but not for all.\n3.2 Uniform Matroid Constraint\nWe consider constraints that is a uniform matroid of certain rank: C = {S : |S|\uf8ff k}. We employ the\nlazy add-delete Markov chain in Algo. 3, where in each iteration, with probability 0.5 we uniformly\nrandomly sample one element from V and either add it to or delete it from the current set, while\nrespecting constraints. To show fast mixing, we consider using path coupling, which essentially says\nthat if we have a contraction of two (coupling) chains then we have fast mixing. We construct path\ncoupling (S, T ) ! (S0, T 0) on a carefully generated graph with edges E (from a proper metric).\nWith all details in Appendix B we end up with the following theorem:\nTheorem 6. Consider the chain shown in Algorithm 3. Let \u21b5 = max(S,T )2E{\u21b51,\u21b5 2} where \u21b51 and\n\u21b52 are functions of edges (S, T ) 2 E and are de\ufb01ned as\n\n\u21b51 =1 Xi2T |p(T, i)  p(S, i)|+ J|S| < kKXi2[N ]\\S\n\u21b52 = min{p(S, s), p(T, t)} Xi2R\n\n|p(S, i)  p(T, i)|+\n\n(p+(S, i)  p+(T, i))+;\n\nJ|S| < kK(min{p+(S, t), p+(T, s)} Xi2[N ]\\(S[T ) |p+(S, i)  p+(T, i)|),\n\nwhere (x)+ = max(0, x). The summations over absolute differences quantify the sensitivity of\ntransition probabilities to adding/deleting elements in neighboring (S, T ). Assuming \u21b5< 1, we get\n\n\u2327 (\") \uf8ff\n\n2N log(N\"1)\n\n1  \u21b5\n\nAlgorithm 3 Gibbs Add-Delete Markov Chain for Uniform Matroid\nRequire: F the set function,  the inverse temperature, V the ground set, k the rank of C\nEnsure: S sampled from \u21e1C\n\nInitialize S 2C\nwhile not mixed do\n\nLet b = 1 with probability 0.5\nif b = 1 then\n\nDraw s 2 V uniformly randomly\nif s /2 S and |S [{ s}| \uf8ff k then\nelse\n\nS S [{ s} with probability p+(S, s) =\nS S\\{s} with probability p(S, s) =\n\n\u21e1C (S[{s})\n\n\u21e1C (S)+\u21e1C (S[{s})\n\n\u21e1C (S\\{s})\n\n\u21e1C (S)+\u21e1C (S\\{s})\n\nend if\n\nend if\nend while\n\nRemarks. If \u21b5 is less than 1 and independent of N, then the mixing time is nearly linear in N.\nThe condition is conceptually similar to those in [29, 34]. The fast mixing requires both \u21b51 and \u21b52,\nspeci\ufb01cally, the change in probability when adding or deleting single element to neighboring subsets,\nto be small. Such notion is closely related to the curvature of discrete set functions.\n\n3Appendix C contains details about strongly Rayleigh distributions.\n\n6\n\n\f4 Experiments\n\nWe next empirically study the dependence of sampling times on key factors that govern our theoretical\nbounds. In particular, we run Markov chains on chain-structured Ising models on a partition matroid\nbase and DPPs on a uniform matroid, and consider estimating marginal and conditional probabilities\nof a single variable. To monitor the convergence of Markov chains, we use potential scale reduction\nfactor (PSRF) [7, 18] that runs several chains in parallel and compares within-chain variances to\nbetween-chain variances. Typically, PSRF is greater than 1 and will converge to 1 in the limit; if it is\nclose to 1 we empirically conclude that chains have mixed well. Throughout experiments we run 10\nchains in parallel for estimations, and declare \u201cconvergence\u201d at a PSRF of 1.05.\nWe \ufb01rst focus on small synthetic examples where we can compute exact marginal and conditional\nprobabilities. We construct a 20-variable chain-structured Ising model as\n\n\u21e1C(S) / exp\u21e3\u21e3\u21e3X19\n\ni=1\n\nwi(si  si+1)\u2318 + (1  )|S|\u2318\u2318JS 2CK,\n\nwhere the si are 0-1 encodings of S, and the wi are drawn uniformly randomly from [0, 1]. The\nparameters (,  ) govern bounds on the mixing time via exp(2\u21e3F ); the smaller , the smaller \u21e3F .\nC is a partition matroid of rank 5. We estimate conditional probabilities of one random variable\nconditioned on 0, 1 and 2 other variables and compare against the ground truth. We set (,  ) to be\n(1, 1), (3, 1) and (3, 0.5) and results are shown in Fig. 1. All marginals and conditionals converge to\ntheir true values, but with different speed. Comparing Fig. 1a against 1b, we observe that with \ufb01xed\n, increase in  slows down the convergence, as expected. Comparing Fig. 1b against 1c, we observe\nthat with \ufb01xed , decrease in  speeds up the convergence, also as expected given our theoretical\nresults. Appendix D.1 and D.2 illustrate the convergence of estimations under other (,  ) settings.\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\n\nE\n\nConvergence for Inference\n\nMarg\nCond-1\nCond-2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\n\nE\n\nConvergence for Inference\n\nMarg\nCond-1\nCond-2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\n\nE\n\nConvergence for Inference\n\nMarg\nCond-1\nCond-2\n\n0\n\n0\n\n0.5\n\n1\n\n# Iter\n\n1.5\n\n2\n\u00d7104\n\n(a) (,  ) = (1, 1)\n\n0\n\n0\n\n1\n\n3\n\n2\n\n# Iter\n\n4\n\n5\n\u00d7104\n\n0\n\n0\n\n1\n\n2\n\n# Iter\n\n3\n\n4\n\n5\n\u00d7104\n\n(b) (,  ) = (3, 1)\n\n(c) (,  ) = (3, 0.5)\n\nFigure 1: Convergence of marginal (Marg) and conditional (Cond-1 and Cond-2, conditioned on\n1 and 2 other variables) probabilities of a single variable in a 20-variable Ising model with different\n(,  ). Full lines show the means and dotted lines the standard deviations of estimations.\n\nWe also check convergence on larger models. We use a DPP on a uniform matroid of rank 30 on\nthe Ailerons data (http://www.dcc.fc.up.pt/657~ltorgo/Regression/DataSets.\nhtml) of size 200. Here, we do not have access to the ground truth, and hence plot the estimation\nmean with standard deviations among 10 chains in 3a. We observe that the chains will eventually\nconverge, i.e., the mean becomes stable and variance small. We also use PSRF to approximately\njudge the convergence. More results can be found in Appendix D.3.\nFurthermore, the mixing time depends on the size N of the ground set. We use a DPP on Ailerons and\nvary N from 50 to 1000. Fig. 2a shows the PSRF from 10 chains for each setting. By thresholding\nPSRF at 1.05 in Fig. 2b we see a clearer dependence on N. At this scale, the mixing time grows\nalmost linearly with N, indicating that this chain is ef\ufb01cient at least at small to medium scale.\nFinally, we empirically study how fast our sampler on strongly Rayleigh distribution converges. We\ncompare the chain in Algorithm 1 (Mix) against a simple add-delete chain (Add-Delete). We use a\nDPP on Ailerons data4 of size 200, and the corresponding PSRF is shown in Fig. 3b. We observe that\nMix converges slightly slower than Add-Delete since it is lazier. However, the Add-Delete chain\ndoes not always mix fast. Fig. 3c illustrates a different setting, where we modify the eigenspectrum\nof the kernel matrix: the \ufb01rst 100 eigenvalues are 500 and others 1/500. Such a kernel corresponds to\n\n4http://www.dcc.fc.up.pt/657~ltorgo/Regression/DataSets.html\n\n7\n\n\f1.5\n\n1.4\n\n1.3\n\n1.2\n\n1.1\n\n1\n\nF\nR\nS\nP\n\nPotential Scale Reduction Factor\n\nN=50\nN=100\nN=200\nN=300\nN=500\nN=1000\n\n0\n\n1\n\n2\n\n3\n\n4\n\n6\n\n7\n\n8\n\n9\n\n10\n\u00d7104\n\n5\n\n# Iter\n\n(a)\n\ns\nr\ne\n\nt\nI\n \n\n#\n\n7\n6\n5\n4\n3\n2\n1\n0\n\n\u00d7104\n\nApproximate Mixing Time\n\n0\n\n200\n\n400\n600\nData Size\n(b)\n\n800\n\n1000\n\nFigure 2: Empirical mixing time analysis when varying dataset sizes, (a) PSRF\u2019s for each set of\nchains, (b) Approximate mixing time obtained by thresholding PSRF at 1.05.\n\nalmost an elementary DPP, where the size of the observed subsets sharply concentrates around 100.\nHere, Add-Delete moves very slowly. Mix, in contrast, has the ability of exchanging elements\nand thus converges way faster than Add-Delete.\n\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n-0.1\n-0.2\n\nl\n\na\nV\n\nConvergence for Inference\nMarg\nCond-5\nCond-10\n\n0\n\n0.5\n\n1.5\n\n2\n\u00d7104\n\n1\n\n# Iter\n\n(a)\n\n1.3\n\n1.25\n\n1.2\n\n1.15\n\nF\nR\nS\nP\n\n1.1\n\n1.05\n\n1\n\n0\n\nPotential Scale Reduction Factor\nAdd-Delete\nMix\n\n2\n\n4\n\n# Iter\n\n6\n\n8\n\n10\n\u00d7104\n\n(b)\n\n1.3\n\n1.25\n\n1.2\n\n1.15\n\nF\nR\nS\nP\n\n1.1\n\n1.05\n\n1\n\n0\n\nPotential Scale Reduction Factor\nAdd-Delete\nMix\n\n2\n\n4\n\n# Iter\n\n6\n\n8\n\n10\n\u00d7104\n\n(c)\n\nFigure 3: (a) Convergence of marginal and conditional probabilities by DPP on uniform matroid,\n(b,c) comparison between add-delete chain (Algorithm 3) and projection chain (Algorithm 1) for two\ninstances: slowly decaying spectrum and sharp step in the spectrum.\n\n5 Discussion and Open Problems\n\nWe presented theoretical results on Markov chain sampling for discrete probabilistic models subject\nto implicit and explicit constraints. In particular, under an implicit constraint that the probability\nmeasure is strongly Rayleigh, we obtain an unconditional fast mixing guarantee. For distributions\nwith various explicit constraints we showed suf\ufb01cient conditions for fast mixing. We show empirically\nthat the dependencies of mixing times on various factors are consistent with our theoretical analysis.\nThere still exist many open problems in both implicitly- and explicitly-constrained settings. Many\nbounds that we show depend on structural quantities (\u21e3F or \u21b5) that may not always be easy to quantify\nin practice. It will be valuable to develop chains on special classes of distributions (like we did for\nstrongly Rayleigh) whose mixing time is independent of these factors. Moreover, we only considered\nmatroid bases or uniform matroids, while several important settings such as knapsack constraints\nremain open. In fact, even uniform sampling with a knapsack constraint is not easy; a mixing time\nof O(N 4.5) is known [33]. We defer the development of similar or better bounds, potentially with\nstructural factors like exp(\u21e3F ), on specialized discrete probabilistic models as our future work.\nAcknowledgements. This research was partially supported by NSF CAREER 1553284 and a Google\nResearch Award.\n\n8\n\n\fReferences\n[1] D. J. Aldous. Some inequalities for reversible Markov chains. Journal of the London Mathematical Society,\n\npages 564\u2013576, 1982.\n\n[2] N. Anari and S. O. Gharan. Effective-resistance-reducing \ufb02ows and asymmetric tsp. In FOCS, 2015.\n[3] N. Anari, S. O. Gharan, and A. Rezaei. Monte Carlo Markov chain algorithms for sampling strongly\n\nRayleigh distributions and determinantal point processes. In COLT, 2016.\n\n[4] J. Borcea, P. Br\u00e4nd\u00e9n, and T. Liggett. Negative dependence and the geometry of polynomials. Journal of\n\nthe American Mathematical Society, pages 521\u2013567, 2009.\n\n[5] A. Bouchard-C\u00f4t\u00e9 and M. I. Jordan. Variational inference over combinatorial spaces. In NIPS, 2010.\n[6] A. Broder. Generating random spanning trees. In FOCS, pages 442\u2013447, 1989.\n[7] S. P. Brooks and A. Gelman. General methods for monitoring convergence of iterative simulations. Journal\n\nof computational and graphical statistics, pages 434\u2013455, 1998.\n\n[8] R. Bubley and M. Dyer. Path coupling: A technique for proving rapid mixing in Markov chains. In FOCS,\n\npages 223\u2013231, 1997.\n\n[9] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. In COLT, 2009.\n[10] P. Diaconis and D. Stroock. Geometric bounds for eigenvalues of Markov chains. The Annals of Applied\n\nProbability, pages 36\u201361, 1991.\n\nIn NIPS, pages 244\u2013252, 2014.\n\nand Algorithms, pages 285\u2013317, 1998.\n\n[11] J. Djolonga and A. Krause. From MAP to marginals: Variational inference in bayesian submodular models.\n\n[12] M. Dyer and C. Greenhill. A more rapidly mixing Markov chain for graph colorings. Random Structures\n\n[13] M. Dyer, A. Frieze, and M. Jerrum. On counting independent sets in sparse graphs. In FOCS, 1999.\n[14] S. Ermon, C. P. Gomes, A. Sabharwal, and B. Selman. Embed and project: Discrete sampling with\n\nuniversal hashing. In NIPS, pages 2085\u20132093, 2013.\n\n[15] T. Feder and M. Mihail. Balanced matroids. In STOC, pages 26\u201338, 1992.\n[16] A. Frieze, N. Goyal, L. Rademacher, and S. Vempala. Expanders via random spanning trees. SIAM Journal\n\n[17] M. Gartrell, U. Paquet, and N. Koenigstein. Low-rank factorization of determinantal point processes for\n\n[18] A. Gelman and D. B. Rubin. Inference from iterative simulation using multiple sequences. Statistical\n\non Computing, 43(2):497\u2013513, 2014.\n\nrecommendation. arXiv:1602.05436, 2016.\n\nscience, pages 457\u2013472, 1992.\n\n[19] A. Gotovos, H. Hassani, and A. Krause. Sampling from probabilistic submodular models. In NIPS, 2015.\n[20] D. M. Greig, B. T. Porteous, and A. H. Seheult. Exact maximum a posteriori estimation for binary images.\n\nJournal of the Royal Statistical Society, 1989.\n\n[21] R. Iyer and J. Bilmes. Submodular point processes. In AISTATS, 2015.\n[22] M. Jerrum and A. Sinclair. Polynomial-time approximation algorithms for the Ising model. SIAM J.\n\nComputing, 1993.\n\n2319\u20132327, 2013.\n\n[23] M. Jerrum, A. Sinclair, and E. Vigoda. A polynomial-time approximation algorithm for the permanent of a\n\nmatrix with nonnegative entries. JACM, 2004.\n\n[24] B. Kang. Fast determinantal point process sampling with application to clustering.\n\nIn NIPS, pages\n\n[25] T. Kathuria and A. Deshpande. On sampling from constrained diversity promoting point processes. 2016.\n[26] M. Kojima and F. Komaki. Determinantal point process priors for Bayesian variable selection in linear\n\nregression. arXiv:1406.2100, 2014.\n\n[27] A. Kulesza and B. Taskar. k-DPPs: Fixed-size determinantal point processes. In ICML, pages 1193\u20131200,\n\n[28] A. Kulesza and B. Taskar. Determinantal point processes for machine learning.\n\narXiv preprint\n\narXiv:1207.6083, 2012.\n\n[29] C. Li, S. Jegelka, and S. Sra. Fast DPP sampling for Nystr\u00f6m with application to kernel methods. In ICML,\n\n[30] C. Li, S. Sra, and S. Jegelka. Gaussian quadrature for matrix inverse forms with applications. In ICML,\n\n2011.\n\n2016.\n\n2016.\n\n[31] C. J. Maddison, D. Tarlow, and T. Minka. A* sampling. In NIPS, 2014.\n[32] Z. Mariet and S. Sra. Diversity networks. In ICLR, 2016.\n[33] B. Morris and A. Sinclair. Random walks on truncated cubes and sampling 0-1 knapsack solutions. SIAM\n\njournal on computing, pages 195\u2013226, 2004.\n\n[34] P. Rebeschini and A. Karbasi. Fast mixing for discrete point processes. In COLT, 2015.\n[35] A. Sinclair. Improved bounds for mixing rates of Markov chains and multicommodity \ufb02ow. Combinatorics,\n\nprobability and Computing, pages 351\u2013370, 1992.\n\n[36] D. Smith and J. Eisner. Dependency parsing by belief propagation. In EMNLP, 2008.\n[37] D. Spielman and N. Srivastava. Graph sparsi\ufb01cation by effective resistances. In STOC, 2008.\n[38] J. Zhang, J. Djolonga, and A. Krause. Higher-order inference for multi-class log-supermodular models. In\n\nICCV, pages 1859\u20131867, 2015.\n\n9\n\n\f", "award": [], "sourceid": 2082, "authors": [{"given_name": "Chengtao", "family_name": "Li", "institution": "MIT"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MIT"}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": "MIT"}]}