{"title": "Exponentiated Strongly Rayleigh Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 4459, "page_last": 4469, "abstract": "Strongly Rayleigh (SR) measures are discrete probability distributions over the subsets of a ground set. They enjoy strong negative dependence properties, as a result of which they assign higher probability to subsets of diverse elements. We introduce in this paper Exponentiated Strongly Rayleigh (ESR) measures, which sharpen (or smoothen) the negative dependence property of SR measures via a single parameter (the exponent) that can intuitively understood as an inverse temperature. We develop efficient MCMC procedures for approximate sampling from ESRs, and obtain explicit mixing time bounds for two concrete instances: exponentiated versions of Determinantal Point Processes and Dual Volume Sampling. We illustrate some of the potential of ESRs, by applying them to a few machine learning tasks; empirical results confirm that beyond their theoretical appeal, ESR-based models hold significant promise for these tasks.", "full_text": "Exponentiated Strongly Rayleigh Distributions\n\nZelda Mariet\n\nMassachusetts Institute of Technology\n\nSuvrit Sra\n\nMassachusetts Institute of Technology\n\n\u0007A\u0006@=(?I=E\u0006\u0002\u0006EJ\u0002A@K\n\nIKLHEJ(\u0006EJ\u0002A@K\n\nStefanie Jegelka\n\nMassachusetts Institute of Technology\n\nIJAB\u0006A(?I=E\u0006\u0002\u0006EJ\u0002A@K\n\nAbstract\n\nStrongly Rayleigh (SR) measures are discrete probability distributions over the\nsubsets of a ground set. They enjoy strong negative dependence properties, as a\nresult of which they assign higher probability to subsets of diverse elements. We\nintroduce in this paper Exponentiated Strongly Rayleigh (ESR) measures, which\nsharpen (or smoothen) the negative dependence property of SR measures via a\nsingle parameter (the exponent) that can be intuitively understood as an inverse\ntemperature. We develop ef\ufb01cient MCMC procedures for approximate sampling\nfrom ESRs, and obtain explicit mixing time bounds for two concrete instances:\nexponentiated versions of Determinantal Point Processes and Dual Volume Sam-\npling. We illustrate some of the potential of ESRs, by applying them to a few\nmachine learning problems; empirical results con\ufb01rm that beyond their theoreti-\ncal appeal, ESR-based models hold signi\ufb01cant promise for these tasks.\n\n1 Introduction\nThe careful selection of a few items from a large ground set is a crucial component of many machine\nlearning problems. Typically, the selected set of items must ful\ufb01ll a variety of application speci\ufb01c\nrequirements\u2014e.g., when recommending items to a user, the quality of each selected item is impor-\ntant. This quality must be, however, balanced by diversity of the selected items to avoid redundancy\nwithin recommendations. Notable applications requiring careful consideration of subset diversity\ninclude recommender systems, information retrieval, and automatic summarization; more broadly,\nsuch concerns are also vital for model design such as model pruning and experimental design.\nA \ufb02exible approach for such subset selection is to sample from subsets of the ground set using a\nmeasure that balances quality with diversity. An effective way to capture diversity is to use nega-\ntively dependent measures. While such measures have been long studied [41], remarkable recent\nprogress by Borcea et al. [11] has put forth a rich new theory with far-reaching impact. The key\nconcept in Borcea et al.\u2019s theory is that of Strongly Rayleigh (SR) measures, which admit important\nclosure properties (speci\ufb01cally, closure under conditioning on a subset of variables, projection, im-\nposition of external \ufb01elds, and symmetric homogenization [11, Theorems 4.2, 4.9]) and enjoy the\nstrongest form of negative association. These properties have been instrumental in the resolution of\nlong-standing conjectures in mathematics [9, 35]; in machine learning, their broader impact is only\nbeginning to emerge [5, 31, 33], while an important subclass of SR measures, Determinantal Point\nProcesses (DPPs) has already found numerous applications [22, 29].\nA practical challenge in using SR measures is the tuning of diversity versus quality, a task that\nis application dependent and may require signi\ufb01cant effort. The modeling need motivates us to\nconsider a generalization of SR measures that allows for easy tuning of the relative importance\ngiven to quality and diversity considerations. Speci\ufb01cally, we introduce the class of Exponentiated\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fStrongly Rayleigh (ESR) measures, which are distributions of the form (cid:23)(S) / (cid:22)(S)p, where S is\na set, p > 0 is a parameter and (cid:22) is an SR measure. A power p > 1 captures a sharper notion of\ndiversity than (cid:22); conversely, a power p < 1 allows for weaker diversity preferences; at the p = 0\nextreme, (cid:23) is uniform, while for p ! 1, the (cid:23) concentrates at the mode of (cid:22).\nESR measures present an attractive generalization to SR measures, where a single parameter allows\nan intuitive regulation of desired strength of negative dependence. Interestingly, a few special cases\nof ESRs have been brie\ufb02y noted in the literature [22, 29, 49], although only the guise of generaliza-\ntions to DPPs and without noting any connection to SR measures.\nWe analyze the negative association properties of ESR measures and derive general-purpose sam-\npling algorithms that we further specialize for important concrete cases. Subsequently, we evaluate\nthe proposed sampling procedures on outlier detection and kernel reconstruction, and show how a\nclass of machine learning problems can bene\ufb01t from the modeling power of ESR measures.\n\nSummary of contributions. The key contributions of this paper are the following:\n\n\u2013 The introduction of Exponentiated SR measures as a \ufb02exible generalization of SR measures,\nallowing for intuitive tuning of subset selection quality/diversity tradeoffs via an exponent p > 0.\n\n\u2013 A discussion of cases when ESR measures remain SR. Speci\ufb01cally, we show that there exist\nnon-trivial determinantal measures whose ESR versions remain SR for p in a neighborhood of 1.\n\n\u2013 The introduction of the notion of r-closeness, which quqnti\ufb01es the suitability of a proposal distri-\n\nbution for MCMC samplers.\n\n\u2013 The analysis of MCMC sampling algorithms applied to ESR measures which take advantage of\nfast-mixing chains for SR measures. We show that the mixing time of the ESR samplers is upper\nbounded in terms of r-closeness; we provide concrete bounds for popular SR measures.\n\n\u2013 An empirical evaluation of ESR measures on various machine learning tasks, showing that ESR\nmeasures outperform standard SR models on several problems requiring a delicate balance of\nsubset quality and diversity.\n\n1.1 Related work\nAn early work that formally motivates various negative dependence conjectures is [41]. The seminal\nwork [11] provides a response, and outlines a powerful theory of negative dependence via the class\nof SR measures. The mathematical theory of SR measures, as well as the intimately related theory\nof multivariate stable polynomials has been the subject of signi\ufb01cant interest [9, 10, 42]; recently,\nSR measures were central in the proof of the Kadison-Singer conjecture [35].\nWithin machine learning, DPPs, which are a subclass of SR measures, have been recognized as a\npowerful theoretical and practical tool. DPPs assign probability proportional to det(L[S]) to a set\nS 2 2[n], where L is the so-called DPP-kernel. Their elegance and tractability has helped DPPs\n\ufb01nd numerous applications, including document and video summarization [15, 34], sensor place-\nment [27], recommender systems [21, 48], object retrieval [1], neural networks [36] and Nystr\u00f6m\napproximations [32]. More recently, an SR probability measure known as volume sampling [8, 16]\nor dual volume sampling (DVS) [33, 37] has found some interest. A DVS measure is parametrized\nby an m(cid:2)n matrix A with columns ai; it assigns to a set S (cid:18) [n] of size m a probability proportional\nto det(\nIndependent of application-speci\ufb01c motivations, two recent results [5, 31] showed that SR measures\nadmit ef\ufb01cient sampling via fast-mixing Markov chains, suggesting SR measures can be tractably ap-\nplicable to many machine learning problems. Nevertheless, the need to tune the measure to modulate\ndiversity persists. We address this need by passing to the broader class of Exponentiated Strongly\nRayleigh measures, whose diversity/quality preference is parametrized by a single exponent.\nTo our knowledge, there has been no previous discussion of ESR measures as a class. Nonetheless,\nthey can bene\ufb01t from the abundant existing theory for log-submodular models [19, 20, 25, 43], and\nisolated special cases have also been discussed in the literature. In particular, Exponentiated DPPs\n(or E-DPPs) are mentioned in [29, 49], as well as in [22] and [4].\n\n\u22a4\ni ai).\ni2S a\n\n\u2211\n\n2\n\n\fFigure 1: Anomaly detection by sampling with an Exponentiated-DPP. 200 samples of size k = 20 were\ndrawn from a E-DPP with Gaussian kernel; darker colors indicate higher sampling frequencies. As p increases,\nthe points furthest from the mean accumulate all of the sampling probability mass.\n2 Exponentiated Strongly Rayleigh measures\nIn this section, we formally introduce Exponentiated SR measures and analyze their properties\nwithin the framework of negative dependence. We use Pn to denote n (cid:2) n Hermitian positive\nde\ufb01nite matrices, and use A \u227b B to denote the usual L\u00f6wner order on Pn matrices1. For a matrix\nL, we write L[S; T ] the submatrix [Lij]i2S;j2T , as well as L[:; S] \u225c L[[n]; S] and L[S; :] similarly.\nWe alleviate the notation L[S; S] as L[S].\nRecall that for a measure (cid:22) over all subsets of a ground set Y \u225c [n], (cid:22)\u2019s generating polynomial is\nthe multi-af\ufb01ne function over Cn de\ufb01ned by\n\n\u2211\n\n\u220f\n\nzi\n\ni2S\n\nP(cid:22)(z1; : : : ; zn) =\n\nS(cid:18)Y (cid:22)(S)\n\nDe\ufb01nition 1 (Strongly Rayleigh [11]). A measure (cid:22) over the subsets of [n] := f1; : : : ; ng is SR\nif its generating polynomial P(cid:22) 2 C[z1; : : : ; zn] is real stable, i.e. P(cid:22)(z1; : : : ; zn) \u0338= 0 whenever\nIm(zj) > 0 for 1 (cid:20) j (cid:20) n.\nIn order to calibrate the relative in\ufb02uence of the diversity and quality of a set S on the probability an\nSR measure assigns to S, we introduce the family of Exponentiated Strongly Rayleigh measures.\nDe\ufb01nition 2 (Exponentiated SR measure). A measure (cid:22) over 2[n] is Exponentiated Strongly\nRayleigh (ESR) if there exists an SR measure (cid:23) over 2[n] and a power p (cid:21) 0 such that (cid:22)(S) / (cid:23)(S)p.\nThe parameter p serves to control the quality/diversity tradeoff by sharpening (p > 1) or smoothing\nout (p < 1) the variations of the ground SR measure (see Figure 1). A natural question is then to\nunderstand how this additional parameter impacts the negatively associated properties of the ESR.\nRecall that a fundamental property of SR measures lies in the fact that they are negatively associated:\nfor two increasing functions F; G over 2[n] that depend on a disjoint set of coordinates, an SR\nmeasure (cid:22) veri\ufb01es the following inequality [11, Theorem 4.9]:\nE(cid:22)[F ] E(cid:22)[G] (cid:21) E(cid:22)[F G]:\n\n(2.1)\nOur \ufb01rst result states that the additional modularity enabled by the exponent parameter can break\nStrong Rayleighness; as a consequence, we have no immediate guarantee that ESRs verify Eq. (2.1).\nProposition 1. There exist ESR measures that are not SR.\n\nConversely, some ESR measures remain SR for any p:\nif (cid:22) is a DPP parametrized by a block-\ndiagonal kernel with 2 (cid:2) 2 blocks, (cid:23) = (cid:11)(cid:22)p is also a DPP, and so SR and ESR by construction. The\nnext theorem guarantees the existence of non-trivial ESR measures which are also SR.\nTheorem 1. There exists \u03f5 > 0 such that 8p 2 [1 (cid:0) \u03f5; 1 + \u03f5];8n 2 N, there exists a non-trivial\nmatrix L 2 Pn such that the E-DPP distribution de\ufb01ned by (cid:23)(S) / det(L[S; S])p is SR.\nHence, ESRs are not guaranteed to be SR but may remain so. Due to their log-submodularity, they\nnonetheless will verify the so-called negative latice condition (cid:22)(S \\ T )(cid:22)(S [ T ) (cid:20) (cid:22)(S)(cid:22)(T ), and\nso retain negative dependence properties.\nWe now show that ESRs nonetheless have a fundamental advantage over standard log-submodular\nfunctions: although the intractability of their partition function precludes exact sampling algorithms,\ntheir closed form as the exponentiation of an SR measure can be leveraged to take advantage of the\nrecent result [31] on fast-mixing Markov chains for SR measures.\n\n1i.e. A \u227b B () (A (cid:0) B) 2 Pn.\n\n3\n\np=1/2p=1p=2\f3 Sampling from ESR measures\nIn the general case, the normalization term of an ESR is NP-hard to compute, precluding exact\nsampling algorithms. In this section, we propose instead two MCMC sampling algorithms whose\nkey idea lies in exploiting the explicit relation ESR measures have to SR measures.\nWe begin by introducing the notion of r-closeness, which serves as a measure of the proximity\nbetween to distributions (cid:22) and (cid:23) over subsets. In practice, r-closeness will allow us to quantify how\nclose an ESR measure is to being SR, and inform our bounds on mixing time.\nDe\ufb01nition 3 (r-closeness). Let (cid:22), (cid:23) be measures over 2[n] and let p (cid:21) 0. We say that (cid:23) is r-close to\n(cid:22) if we have for all S (cid:18) [n],\n\n(cid:23)(S) \u0338= 0 and (cid:22)(S) \u0338= 0 =) r\n\n(cid:0)1 (cid:20) (cid:23)(S)=(cid:22)(S) (cid:20) r\n\nwhere we allow r = 1. We additionally write r((cid:22); (cid:23)) = minfr 2 R [ f1g : (cid:23) is r-close to (cid:22)g.\nRemark 1. If r((cid:22); (cid:23)) < 1, (cid:23) is absolutely continuous wrt. (cid:22): (cid:22)(S) = 0 =) (cid:23)p(S) = 0.\nThe following result establishes that for any ESR measure (cid:23), there exists an SR measure (cid:22) which is\nr-close to (cid:23) with r < 1. This result is the cornerstone of the sampling algorithms we derive, as we\nshow that we can use an r-close SR measure as proposal to ef\ufb01ciently sample from an ESR measure.\nProposition 2. Let (cid:22) be an SR measure over 2[n], and de\ufb01ne (cid:23) to be the ESR measure such that\n(cid:23)(S) / (cid:22)(S)p. Then\n\n[\n\n]\n\nr((cid:22); (cid:23)) (cid:20) max\nS2supp((cid:23))\n\n(cid:0)jp(cid:0)1j\n\n(cid:22)(S)\n\n< 1:\n\nIn order to sample from an ESR distribution (cid:23), we now generalize existing MCMC algorithms for\nSR measures; we bound the distance to stationarity of the the chain\u2019s current state by comparing it\nto the distance to stationarity of a similar chain sampling from an SR measure (cid:22), and leveraging the\nr-closeness r((cid:22); (cid:23)).\n\n3.1 Approximate samplers for ESR measures\n\nBefore investigating MCMC samplers, one may \ufb01rst wonder if rejection sampling might be suf\ufb01-\ncient: sample a set S from a proposal distribution (cid:22), and accept with probability (cid:23)p(S)=M (cid:22)(S),\nwhere M (cid:21) maxS (cid:22)(S)=(cid:23)p(S). Unfortunately, the rejection sampling scaling factor M cannot be\ncomputed \u2014 although it can be bounded by r((cid:22); (cid:23)p) \u2014 leading us to prefer MCMC samplers [6].\nWe begin by analyzing the standard independent Metropolis\u2013Hastings sampler [26, 38], using an SR\nmeasure (cid:22) as a proposal: we sample an initial set S from (cid:22) via a fast-mixing Markov chain, then\niteratively swap from S to a new set S\n\n\u2032 with probability\n\n{\n\n}\n\n) = min\n\n1;\n\n\u2032\n\n(cid:23)(S\n)(cid:22)(S)\n(cid:23)(S)(cid:22)(S\u2032)\n\nPr(S ! S\n\n\u2032\n\nAlgorithm 1 Proposal-based sampling\n\nInput: SR proposal (cid:22), ESR measure (cid:23) and SR measure (cid:26) s.t. (cid:23) = (cid:11)(cid:26)p\nDraw S (cid:24) (cid:22)\nwhile not mixed do\n\u2032 w.p. min\n\n)p}\n\n= min\n\n{\n\n}\n\n{\n\n\u2032\n\n1; (cid:23)(S\n\n)(cid:22)(S)\n(cid:23)(S)(cid:22)(S\u2032)\n\n1; (cid:22)(S)\n(cid:22)(S\u2032)\n\n(\n\n\u2032 (cid:24) (cid:22)\nS\nS S\nreturn S\n\n\u2032\n\n(cid:26)(S\n)\n(cid:26)(S)\n\n\u2032\n\nAlgorithm 1 relies on the fact that we can compute (cid:23)(S\n)=(cid:26)(S))p: we do not re-\nquire knowledge of (cid:23)\u2019s partition function. This sampling method is valid as soon as (cid:23) is absolutely\ncontinuous with regards to the proposal (cid:22); Proposition 2 guaranteed the existence of such measures.\nIf the ESR measure (cid:23) is k-homogeneous (i.e. (cid:23) assigns a non-zero probability only to sets of size k),\nwe can instead sample from (cid:23) via Algorithm 2: we randomly sample S (cid:18) [n] and switch an element\nu 2 S for v \u03382 S if this improves the probability of S.\n\n)=(cid:23)(S) as ((cid:26)(S\n\n\u2032\n\n4\n\n\fAlgorithm 2 Swap-chain sampling\n\nInput: k-homogeneous ESR measure (cid:23) s.t. (cid:23) = (cid:11)(cid:26)p with (cid:26) SR\nSample S (cid:24) Unif(N ) such that jSj = k\nwhile not mixed do\n\n}\nSample u; v 2 (S (cid:2) [N ] n S) uniformly at random\nS S [ fug n fvg w.p. min\n\n= min\n\n1; (cid:23)(S[fugnfvg)\n\n(cid:23)(S)\n\n{\n\n{\n\n(\n\n(cid:26)(S[fugnfvg)\n\n(cid:26)(S)\n\n)p}\n\n1;\n\nreturn S\n\nThe key to extending Algorithm 2 to non-homogeneous ESR measures is similar to the approach\ntaken by Li et al. [31] for SR measures, and relies on leveraging the symmetric homogenization (cid:23)sh\nof (cid:23) over 2[2n] de\ufb01ned by\n\n{\n\n(\n(cid:23)(S \\ [n])\n0\n\n)(cid:0)1\n\nnjS\\[n]j\n\nif jSj = n\nif jSj \u0338= n\n\n(cid:23)sh : S 2 2[2n] !\n\nIf (cid:23) / (cid:22)p, (cid:23)sh is absolutely continuous with regards to (cid:22)sh. A simple calculation further shows that\nr((cid:22)sh; (cid:22)sh) = r((cid:22); (cid:23)), and so to sample S from (cid:23), it suf\ufb01ces to sample T of size n from (cid:23)sh using\nAlgorithm 2, and then output S = T \\ [n].\nHence, although we cannot in the general case sample from an ESR measure exactly (unlike many\nSR measures), being able to evaluate an ESR measure\u2019s unnormalized density function allows us\nto leverage MCMC algorithms for approximate sampling. We now focus on bounding the mixing\ntimes of these algorithms.\n3.2 Bounds on mixing time for the proposal and swapchain algorithms\n\n\u2032\nWriting (cid:23)\nt;S the distribution generated by a Markov chain sampler after t iterations and initialization\nset S, the mixing time (cid:28)S(\u03f5) measures the number of required iterations of the Markov chain so that\n\u2032\nt;S is close enough (in total variational distance) to the true ESR measure (cid:23):\n(cid:23)\n\n(cid:28)S(\u03f5) \u225c minft : \u2225(cid:23)\n\n\u2032\nt;S\n\n(cid:0) (cid:23)\u2225TV (cid:20) \u03f5g\n\nIt is easy to see from the above equation that the mixing time of a chain depends on how close\nthe distribution generating the initialization set S is to the target distribution (cid:22). We now show this\nexplicitly for the two algorithms derived above, obtaining bounds on (cid:28)S that directly depend on the\nr-closeness of the target ESR measure (cid:23) and an SR measure (cid:22).\nFor Algorithm 1, the mixing time explicitly depends on the quality of the proposal distribution.\nTheorem 2 (Alg. 1 mixing time). Let (cid:22); (cid:23) be measures over 2[n] such that (cid:22) is SR and (cid:23) is ESR.\nSampling from (cid:23) via Alg. 1 with (cid:22) as a proposal distribution has a mixing time (cid:28) (\u03f5) such that\n\n(cid:28)S(\u03f5) (cid:20) 2r((cid:22); (cid:23)p) log\n\n1\n\u03f5\n\n:\n\nFor the swapchain algorithm (Alg. 2), we derive a bound on the mixing time by comparing to a\nresult by [5] which shows fast sampling for SR distributions over subsets of a \ufb01xed size.\nTheorem 3 ( Alg. 2 mixing time). Let (cid:23) be a k-homogeneous ESR measure over 2[n]. The mixing\ntime for Alg. 2 with initialization S is bounded in expectation by\n\n(cid:28)S(\u03f5) (cid:20) inf\n(cid:22)2SR\n\n2nk r((cid:22); (cid:23))2 log\n\n1\n\n\u03f5(cid:23)(S)\n\nThe above bound depends on the closest SR distribution to the target measure (cid:23). Combined with\nProp. 2, Thm. 3 provides a simple upper bound to the mixing time of the swapchain algorithm.\nCorollary 1 (Non-homogeneous swapchain mixing time). Let (cid:23) be a non-homogeneous ESR mea-\nsure over 2[n]. The mixing time for the generalized swapchain sampler to sample from (cid:23) with\ninitialization S (cid:18) [2n] is bounded in expectation by\n\n(cid:28)S(\u03f5) (cid:20) inf\n(cid:22)2SR\n\n4n2 r((cid:22); (cid:23))2 log\n\n1\n\n\u03f5(cid:23)sh(S)\n\nAs a Markov chain\u2019s applicability closely depends on its mixing time, a crucial task in sampling\nfrom ESR measures lies in \ufb01nding an r-close SR distribution with small r.\n\n5\n\n\f3.3 Speci\ufb01c bounds for r-closeness\n\nWe now derive explicit mixing time bounds for ESR measures (cid:23) generated by two popular classes\nof SR measures: DPPs, in their usual form as well as their k-homogeneous form (k-DPPs), and Dual\nVolume Sampling (DVS). As Theorem 2 and Theorem 3 provide mixing time bounds that depend\nexplicitly on r((cid:22); (cid:23)), this section focuses on upper bounding r((cid:22); (cid:23)). To the extent of our knowledge,\nthe results below are the \ufb01rst for either of these two classes of ESR distributions.\nTheorem 4 (E-DVS closeness bounds). Let n (cid:21) k (cid:21) m and let X 2 Rm(cid:2)n be a maximal-rank\nmatrix. Let (cid:22) be the Dual-Volume Sampling distribution over 2[n] for sets of size k:\n\n{\n0\n(cid:22)(S) / det(X[:; S]X[:; S]\n\n\u22a4\n\n)\n\nif jSj \u0338= k\nif jSj = k\n\nLet p > 0 and (cid:23) be the ESR measure induced by (cid:22) and p; let MinVol(X; S) be the smallest non-zero\nminor of degree m of X[:; S]. Then\n\n(cid:22) : S (cid:18) [n] !\n(\n\n)j1(cid:0)pj(\n\nn (cid:0) m\nk (cid:0) m\n\nr((cid:22); (cid:23)) (cid:20)\n\ndet(XX\n\nj1(cid:0)pj\n\n\u22a4\n\n)\n\nMinVol(X; S)\n\n(cid:0)2j1(cid:0)pj\n\nTheorem 5 (E-DPP closeness bound). Let (cid:22) be the distribution induced by a DPP with kernel L \u2ab0 0\nand (cid:23) be the E-DPP such that (cid:23)(S) / det(L[S])p. Let (cid:21)1 (cid:20) (cid:1)(cid:1)(cid:1) (cid:20) (cid:21)n be the ordered eigenvalues\nof L. Then,\n\n)(cid:0)j1(cid:0)pj\n\nk\nm\n\n\u220fn\n\nr((cid:22); (cid:23)p) (cid:20)\n\n(1 + (cid:21)i)\n\ni=1\n\nj1(cid:0)pj\u220f\nj1(cid:0)pj\u220fk\n\n(cid:0)j1(cid:0)pj\ni\n\n:\n\n(cid:21)\n\n(cid:21)i<1\n\n(cid:0)j1(cid:0)pj\n(cid:21)\ni\n\n:\n\ni=1\n\nTheorem 6 (E-k-DPP closeness bound). Let (cid:22) be the distribution over 2[n] induced by a k-DPP\n(k (cid:20) n) with kernel L, and let (cid:23) be the induced ESR measure with power p > 0. Then\n\nr((cid:22); (cid:23)) (cid:20) ek((cid:21)1; : : : ; (cid:21)n)\nwhere ek the k-th elementary symmetric polynomial.\n\nOne easily shows that the values r((cid:22); (cid:23)) we derive above for (k-) DPPs are loosely lower-bounded\nj1(cid:0)pj, where (cid:20) is the condition number of the kernel matrix L. However, it is possible to obtain\nby (cid:20)\na closer SR distribution to (cid:23) / det(L)p than the baseline choice of the DPP with kernel L: indeed,\nas L is positive semi-de\ufb01nite, we can also consider a DPP parametrized by kernel Lp.\nFor the rest of this section, we de\ufb01ne (cid:22) as the SR measure corresponding to the DPP with kernel\nLp: (cid:22)(S) = det(Lp[S])= det(I + Lp), and (cid:23) as the ESR measure such that (cid:22)(S) / det(L[S])p.\nNote that (cid:23) remains absolutely continuous with regard to (cid:22). In this setting, upper bounding r((cid:22); (cid:23))\nproves to be signi\ufb01cantly more dif\ufb01cult, and is the focus of the remainder of this section. We \ufb01rst\nrecall a useful expansion of the determinant of principal submatrices, fundamental to deriving the\nbounds below and potentially of more general interest.\nLemma 1 (Shirai and Takahashi [44, Lemma 2.9]). Let H be an n (cid:2) n Hermitian matrix with\neigenvalues (cid:21)1; : : : ; (cid:21)n. There exists a 2n (cid:2) 2n symmetric doubly stochastic matrix Q = [QSJ ]\nindexed by subsets S; J of [n] such that\n\n\u2211\n\n\u220f\n\ndet(H[S]) =\n\nJ(cid:18)[n];jJj=jSj QSJ\n\n(cid:21)i:\n\ni2J\n\nQ can be chosen to depend only on the eigenvectors of H and to satisfy QSJ = 0 for jSj \u0338=jJj.\nThe above lemma allows us to bound det(Lp[S])\ndet(L[S]p) in terms of the generalized condition number of L.\nDe\ufb01nition 4 (Generalized condition number). Given a matrix L 2 Pn with eigenvalues (cid:21)1; : : : ; (cid:21)n,\nwe de\ufb01ne its generalized condition number of order k as\n\n(cid:20)k = ((cid:21)1 (cid:1)(cid:1)(cid:1) (cid:21)k)((cid:21)n (cid:1)(cid:1)(cid:1) (cid:21)n(cid:0)k)\n\n(cid:0)1:\n\nNote that (cid:20)k is the usual condition number of the k-th exterior power L\n\n^k (in particular (cid:20)k (cid:21) (cid:20)k).\nGiven the generalized conditioned number, Lemma 1 combined with the power-mean inequality [45]\n(see App. D) suf\ufb01ces to bound the gap between volumes generated by E-DPPs and DPPs:\n\n6\n\n\fTheorem 7. Let (cid:22) be the distribution induced by a DPP with kernel Lp, and (cid:23) be the corresponding\nE-DPP such that (cid:23) / det(L[S])p. Then r((cid:22); (cid:23)) (cid:20) r((cid:20)\u230an=2\u230b; p) where r((cid:20); p) is de\ufb01ned by\n\n8><>:\n(\n(\n\n)p(\n)p(\n\nr((cid:20); p) =\n\n)1(cid:0)p\n)p(cid:0)1\n\np((cid:20)(cid:0)1)\n(cid:20)p(cid:0)1\n(cid:20)p(cid:0)1\np((cid:20)(cid:0)1)\n\n(1(cid:0)p)((cid:20)(cid:0)1)\n(p(cid:0)1)((cid:20)(cid:0)1)\n\n(cid:20)(cid:0)(cid:20)p\n(cid:20)p(cid:0)(cid:20)\n\nfor 0 < p < 1\n\nfor p > 1\n\nCorollary 2. Let (cid:22) be the distribution induced by a k-DPP with kernel Lp, and (cid:23) be the correspond-\ning ESR measure such that (cid:23)(S) / det(L[S])p. Then r((cid:22); (cid:23)) (cid:20) r((cid:20)k; p).\nAs shown in Figure 4 (App. D), the upper bound 7 grows slower than (cid:20): this shows that the (cid:22)(S) /\ndet(Lp[S]) is a closer SR distribution to an E-DPP with kernel L than the E-DPP\u2019s generating SR\ndistribution, and leads to \ufb01ner mixing time bounds.\nNote that the per-iteration complexity of both algorithms must also be taken into account when\nchoosing a sampling procedure: for E-DPPs, despite Alg. 1\u2019s smaller mixing time, Alg. 2 is more\nef\ufb01cient in cases when n large due to the comparative costs of each sampling round.\n4 Experiments\nTo evaluate the empirical applications of ESR measures, we evaluate E-DPPs (DPPs are by far the\nmost popular SR measure in machine learning) on a variety of machine learning task. In all cases\nwhere we use the proposal MCMC sampler (Alg. 1), we use the DPP with kernel Lp as a proposal.\n\n4.1 Evaluating mixing time\nWe begin our experiments by empirically evaluating the mixing time of both algorithms. We mea-\nsure mixing using the classical Potential Scale Reduction Factor (PSRF) metric [13]. As the PSRF\nconverges to 1, the chain mixes. In the following experiments, we report the mixing time (number\nof iterations) necessary to reach a PSRF of 1.05, as well as the runtime (in seconds) to convergence,\naveraged over 5 iterations; we use matrices with a \ufb01xed (cid:20)k across all mixing time experiments.\n\nFigure 2: Mixing and sampling time for E-k-DPPs as a function of the set size k. In both cases, the mixing time\ngrows linearly with k; although the mixing time for the proposal algorithm is an order of magnitude smaller\nthan for the swapchain algorithm, the latter samples faster due to the per-iteration cost of each transition step.\n\nThe mixing time for proposal-based sampling is an order of magnitude smaller than swap-chain\nsampling; this is in line with the bounds we provide in Theorems 2 and 3. However, this does\nnot translate into faster runtimes: indeed, the per-iteration complexity of proposal-based sampling is\nsigni\ufb01cantly higher than for the swapchain algorithm, as Alg. 1 samples from a DPP at each iteration.\nThe evolution of mixing and wall clock times as a function of N is provided in Appendix E.\n\n4.2 Anomaly detection\nWe now focus on applications for E-DPPs; we begin by evaluating the use of E-DPPs for outlier\ndetection. As increasing p hightens the model\u2019s sensitivity to diversity, we expect p > 1 to provide\nbetter outlier detection. To our knowledge, this is the \ufb01rst application of DPPs to outlier detection,\nand so our goal for this experiment is not to improve upon state-of-the-art results, but to compare\nthe performance of (E-)DPPs for various values of p to standard outlier detection algorithms.\nExperimentally, we detect an outlier via the following approach: given a dataset of n points and an\nE-DPP with an RBF kernel built from the data (bandwidth (cid:12) = 100), we sample n\n5 subsets of size 50\n\n7\n\n102030405060708090100Setsize0.20.40.60.81.01.21.41.6Mixingtime\u00d7102(a)Proposal,r=2.00.01.02.03.04.05.06.07.0Runtime(s)\u00d7101102030405060708090100Setsize0.00.51.01.52.02.53.0Mixingtime\u00d7103(b)Swapchain,r=2.00.01.02.03.04.05.06.07.08.0Runtime(s)\u00d710\u22121\fand report as outliers points that appear at least n\" times, where \" is a tunable parameter (hence, if\nwe were doing uniform sampling, each point in the dataset would be sampled on average 10 times).\nWe detect outliers on three public datasets: the UCI Breast Cancer Wisconsin dataset [46] modi\ufb01ed\nas in [24, 28] as well as the Letter and Speech datasets fom [39]. We also report the performance of a\nselection of standard outlier detection algorithms whose reported performance in [24] is competitive\nwith other outlier detection algorithms: Local Outlier Factor (LOF) [12], k-Nearest Neighbor (k-\nNN) [7], Histogram-based Outlier Score (HBOS) [23], Local Outlier Probability (LoOP) [28] and\nunweighted Cluster-Based Local Outlier Factor (uCBLOF) [3, 24].\n\np\n\nCancer\nLetter\nSpeech\n\n0:5\n\n0.952(cid:6) 0.018\n0.780(cid:6) 0.013\n0.455(cid:6) 0.007\n\n1\n\n0.962(cid:6) 0.004\n0.820(cid:6) 0.003\n0.439(cid:6) 0.011\n\n2\n\n0.965(cid:6) 0.001\n0.847(cid:6) 0.002\n0.445(cid:6) 0.002\n\n(cid:3)\n\nLOF\n\n0.982 (cid:6) 0.002\n0.867 (cid:6) 0.027\n0.504 (cid:6) 0.022\n\n(cid:3)\nk-NN\n\n0.979 (cid:6) 0.001\n0.872 (cid:6) 0.018\n0.497 (cid:6) 0.010\n\n(cid:3)\n\nHBOS\n\n0.983 (cid:6) 0.002\n0.622 (cid:6) 0.007\n0.471 (cid:6) 0.003\n\n(cid:3)\n\nLoOP\n\n0.973 (cid:6) 0.012\n0.907 (cid:6) 0.008\n0.535 (cid:6) 0.034\n\n(cid:3)\nuCBLOF\n0.950 (cid:6) 0.039\n0.819 (cid:6) 0.023\n0.469 (cid:6) 0.003\n\nTable 1: AUC (mean + standard deviation) for E-DPPs and standard outlier detection algorithms. As expected,\nwe see that a higher exponent leads to a stronger preference for diversity and hence a better outlier detection\nscheme. Only LoOP and LOF consistently outperform E-DPPs.\nResults are reported in Table 1; as expected, we see that larger values of p (in this case, p = 2) are\nmore sensitive to outliers, and provide better models for outlier detection.\n\n4.3 E-DPPs for the Nystr\u00f6m method\nAs a more standard application of DPPs, we now investigate the use of E-DPPs for kernel reconstruc-\ntion via the Nystr\u00f6m method [40, 47]. Given a large kernel K, the Nystr\u00f6m method selects a subset\nC of columns (\u201clandmarks\u201d) of K and approximates K as K[:; C]K\n[C; C]K[C; :]. Unsurprisingly,\nDPPs have successfully been applied to the landmark selection for the Nystr\u00f6m approach [2, 30].\nWe show here that E-DPPs further improve upon the recent results of [30] for kernel reconstruction.\nWe apply Kernel Ridge Regression to 3 regression datasets: Ailerons, Bank32NH, and Machine\nCPU2. We subsample 4,000 points from each dataset (3,000 training and 1,000 test) and use an\nRBF kernel and choose the bandwidth (cid:12) and regularization parameter (cid:21) for each dataset by 10-fold\ncross-validation. Results are averaged over 3 random subsets of data, using the swapchain sampler\ninitialized with k-means++ and run for 3000 iterations.\n\ny\n\nFigure 3: Prediction error on regression datasets; we compare various E-DPP models to uniform sampling\n(\u201cunif\u201d) as well as leverage and regularized leveraged sampling (\u201clev\u201d and \u201creglev\u201d). On all datasets, the E-\nDPPs achieve the lowest error, with the largest exponent p = 2 performing markedly better than other methods.\n\nWe evaluate the quality of the sampler via the prediction error on the held-out test set. Figure 3\nreports the results. Consistently across all datasets, p = 2 outperforms all other samplers in terms\nof the prediction error, in particular when only sampling a few landmarks. Interestingly, we also see\nthat the reconstruction error tends to be smaller when p = 1\n5 Conclusion and extensions\nMany machine learning problems have been shown to bene\ufb01t from the negative dependence proper-\nties of Strongly Rayleigh measures: measures based on elementary symmetric polynomials \u2013 includ-\ning (dual) volume sampling \u2013 have been applied to experimental design; DPPs have been applied\n\n2 (see Appendix F).\n\n2DJJF\u0003\u0002\u0002MMM\u0002@??\u0002B?\u0002KF\u0002FJ\u0002\u0007\u0006J\u0006HC\u0006\u00024ACHAIIE\u0006\u0006\u0002,=J=5AJI\u0002DJ\u0006\u0006\n\n8\n\nuniflevreglevDPPE-DPP(p=0.5)E-DPP(p=2)2030405060708090100Landmarkcount0.00.20.40.60.81.01.21.4Testerror\u00d7102(a)Aileronsdataset2030405060708090100Landmarkcount0.00.20.40.60.81.0Testerror\u00d7101(b)Bank32NHdataset2030405060708090100Landmarkcount0.01.02.03.04.05.06.07.0Testerror(c)CPUdataset\fsuccessfully to \ufb01elds ranging from automatic summarization to minibatch selection and neural net-\nwork pruning. However, tuning the strength of the quality/diversity tradeoff of SR measures requires\nsigni\ufb01cant effort.\nWe introduced Exponentiated Strongly Rayleigh measures, an extension of Strongly Rayleigh mea-\nsures which augment standard SR measures with an exponent p, allowing for straightforward tuning\nof the the quality-diversity trade-off of SR distributions. Intuitively, p controls how much priority\nshould be given to diversity requirements. We show that although ESR measures do not necessarily\nremain SR, but certain distributions lie at the intersection of both classes.\nDespite their intractable partition function, ESR measures can leverage existing fast-mixing Markov\nchains for SR measures, enabling \ufb01ner bounds than those obtained for the broader class of log-\nsubmodular models. We derive general-purpose mixing bounds based on the distance from the target\ndistribution (cid:23) to an SR distribution (cid:22); we then show that these bounds can be further improved by\nspecifying a carefully calibrated SR proposal distribution (cid:22), as is the case for Exponentiated DPPs.\nWe veri\ufb01ed empirically that ESR measures and the algorithms we derive are valuable modeling\ntools for machine learning tasks, such as outlier detection and kernel reconstruction. Finally, let us\nnote that there remain several theoretical and practical open questions regarding ESR measures; in\nparticular, we believe that further specifying the class of ESR measures that also remain SR may\nprovide valuable insight into the study of negatively associated measures.\nFinally, one easily veri\ufb01es that given (cid:22) SR and a collection of i.i.d. subsets S = fS1; : : : ; Smg, the\nMLE problem that \ufb01nds the best p > 0 to model S as being sampled from an ESR (cid:23) / (cid:22)p is convex:\n\n(5.1)\nAs such, standard convex optimization algorithms can be leveraged to select p, potentially after\nleaning a parametrization of (cid:22).\n\nargmaxp>0\n\n(cid:22)(S)p\n\nS(cid:18)[n]\n\nk=1\n\n:\n\nlog (cid:22)(Si) (cid:0) log\n\n\u2211m\n\np\nm\n\n(\u2211\n\n)\n\nAcknowledgements. This work is in part supported by NSF CAREER award 1553284, NSF-\nBIGDATA award 1741341, and by The Defense Advanced Research Projects Agency (grant number\nYFA17 N66001-17-1-4039). The views, opinions, and/or \ufb01ndings contained in this article are those\nof the author and should not be interpreted as representing the of\ufb01cial views or policies, either\nexpressed or implied, of the Defense Advanced Research Projects Agency or the Department of\nDefense.\n\nReferences\n[1] R. Affandi, E. Fox, R. Adams, and B. Taskar. Learning the parameters of Determinantal Point\n\nProcess kernels. In ICML, 2014.\n\n[2] R. H. Affandi, A. Kulesza, E. B. Fox, and B. Taskar. Nystr\u00f6m approximation for large-scale\ndeterminantal processes. In Proc. Int. Conference on Arti\ufb01cial Intelligence and Statistics (AIS-\nTATS), 2013.\n\n[3] M. Amer and M. Goldstein. Nearest-neighbor and clustering based anomaly detection algo-\nrithms for rapidminer. In Proceedings of the 3rd RapidMiner Community Meeting and Confer-\nernce (RCOMM 2012), pages 1\u201312. Shaker Verlag GmbH, 8 2012.\n\n[4] N. Anari and S. Oveis Gharan. A generalization of permanent inequalities and applications in\ncounting and optimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on\nTheory of Computing (STOC). ACM, 2017.\n\n[5] N. Anari, S. O. Gharan, and A. Rezaei. Monte Carlo Markov Chain algorithms for sampling\nstrongly Rayleigh distributions and Determinantal Point Processes. In Conference on Learning\nTheory (COLT), 2016.\n\n[6] C. Andrieu, N. D. Freitas, and et al. An introduction to MCMC for machine learning, 2003.\n[7] F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Proceedings of\nthe 6th European Conference on Principles of Data Mining and Knowledge Discovery, PKDD\n\u201902, pages 15\u201326, London, UK, UK, 2002. Springer-Verlag.\n\n[8] H. Avron and C. Boutsidis. Faster subset selection for matrices and applications. SIAM J.\n\nMatrix Analysis Applications, 34(4):1464\u20131499, 2013.\n\n[9] J. Borcea and P. Br\u00e4nd\u00e9n. Applications of stable polynomials to mixed determinants: Johnson\u2019s\n\nconjectures, unimodality and mixed \ufb01scher products. Duke Math. J., pages 205\u2013223, 2008.\n\n9\n\n\f[10] J. Borcea and P. Br\u00e4nd\u00e9n. P\u00f3lya-schur master theorems for circular domains and their bound-\n\naries. Annals of Mathematics, 170(1):465\u2013492, 2009.\n\n[11] J. Borcea, P. Br\u00e4nden, and T. Liggett. Negative dependence and the geometry of polynomials.\n\nJournal of American Mathematical Society, 22:521\u2013567, 2009.\n\n[12] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: Identifying density-based local\n\noutliers. SIGMOD Rec., 29(2):93\u2013104, May 2000.\n\n[13] S. P. Brooks and A. Gelman. General methods for monitoring convergence of iterative simula-\n\ntions. Journal of Computational and Graphical Statistics, 7:434\u2013455, 1998.\n\n[14] H. Cai. Exact bound for the convergence of metropolis chains. Stochastic Analysis and Appli-\n\ncations, 18(1):63\u201371, 2000.\n\n[15] W. Chao, B. Gong, K. Grauman, and F. Sha. Large-margin Determinantal Point Processes. In\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), 2015.\n\n[16] M. Derezinski and M. K. Warmuth. Unbiased estimates for linear regression via volume sam-\npling. In Advances in Neural Information Processing Systems 30, pages 3084\u20133093. Curran\nAssociates, Inc., 2017.\n\n[17] P. Diaconis and L. Saloff-Coste. Comparison theorems for reversible Markov chains. The\n\nAnnals of Applied Probability, 3(3):696\u2013730, 1993.\n\n[18] P. Diaconis and D. Stroock. Geometric bounds for eingenvalues of Markov Chains. Annals of\n\nApplied Probability, pages 36\u201361, 1991.\n\n[19] J. Djolonga and A. Krause. From MAP to marginals: Variational inference in bayesian sub-\nmodular models. In Advances in Neural Information Processing Systems 27, pages 244\u2013252.\nCurran Associates, Inc., 2014.\n\n[20] J. Djolonga, S. Tschiatschek, and A. Krause. Variational inference in mixed probabilistic\n\nsubmodular models. In Neural Information Processing Systems (NIPS), 2016.\n\n[21] M. Gartrell, U. Paquet, and N. Koenigstein. Low-rank factorization of Determinantal Point Pro-\ncesses. In Proceedings of the Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, February\n4-9, 2017, San Francisco, California, USA., pages 1912\u20131918, 2017.\n\n[22] J. Gillenwater. Approximate Inference for Determinantal Point Processes. PhD thesis, Univer-\n\nsity of Pennsylvania, 2014.\n\n[23] M. Goldstein and A. Dengel. Histogram-based outlier score (HBOS): A fast unsupervised\nanomaly detection algorithm. In KI-2012: Poster and Demo Track. German Conference on\nArti\ufb01cial Intelligence (KI-2012), 35th, September 24-27, Saarbr\u00fccken, Germany, pages 59\u201363,\n9 2012.\n\n[24] M. Goldstein and S. Uchida. A comparative evaluation of unsupervised anomaly detection\n\nalgorithms for multivariate data. PLOS ONE, 11(4):1\u201331, 04 2016.\n\n[25] A. Gotovos, S. Hassani, and A. Krause. Sampling from probabilistic submodular models. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2015.\n\n[26] W. K. Hastings. Monte carlo sampling methods using markov chains and their applications.\n\nBiometrika, 57(1):97\u2013109, 1970.\n\n[27] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in gaussian processes:\nTheory, ef\ufb01cient algorithms and empirical studies. Journal of Machine Learning Research, 9:\n235\u2013284, 2008.\n\n[28] H.-P. Kriegel, P. Kr\u00f6ger, E. Schubert, and A. Zimek. LoOP: Local outlier probabilities. In\nProceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM\n\u201909, pages 1649\u20131652, New York, NY, USA, 2009. ACM.\n\n[29] A. Kulesza and B. Taskar. Determinantal Point Processes for machine learning, volume 5.\n\nFoundations and Trends in Machine Learning, 2012.\n\n[30] C. Li, S. Jegelka, and S. Sra. Fast DPP sampling for Nystr\u00f6m with appication to kernel methods.\nIn Proceedings of the 33rd International Conference on International Conference on Machine\nLearning - Volume 48, ICML\u201916, pages 2061\u20132070. JMLR.org, 2016.\n\n[31] C. Li, S. Jegelka, and S. Sra. Fast mixing Markov chains for strongly Rayleigh measures, DPPs,\nIn Advances in Neural Information Processing Systems (NIPS),\n\nand constrained sampling.\n2016.\n\n[32] C. Li, S. Jegelka, and S. Sra. Fast DPP sampling for Nystr\u00f6m with application to kernel\n\nmethods. In Int. Conference on Machine Learning (ICML), 2016.\n\n[33] C. Li, S. Jegelka, and S. Sra. Polynomial time algorithms for dual volume sampling.\n\nIn\nAdvances in Neural Information Processing Systems 30, pages 5038\u20135047. Curran Associates,\n\n10\n\n\fInc., 2017.\n\n[34] H. Lin and J. Bilmes. Learning mixtures of submodular shells with application to document\n\nsummarization. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2012.\n\n[35] A. W. Marcus, D. A. Spielman, and N. Srivastava. Interlacing families II: Mixed characteristic\npolynomials and the Kadison-Singer problem. Annals of Mathematics, 182(1):327\u2013350, 2015.\n[36] Z. Mariet and S. Sra. Diversity networks. In International Conference on Learning Represen-\n\ntations (ICLR), 2016.\n\n[37] Z. Mariet and S. Sra. Elementary symmetric polynomials for optimal experimental design. In\nAdvances in Neural Information Processing Systems 30, pages 2136\u20132145. Curran Associates,\nInc., 2017.\n\n[38] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation\nof state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):\n1087\u20131092, 1953.\n\n[39] B. Micenkov\u00e1, B. McWilliams, and I. Assent. Learning outlier ensembles: The best of both\nworlds \u2013 supervised and unsupervised. In Proceedings of the ACM SIGKDD 2014 Workshop\non outlier Detection and Description under Data Diversity, 2014.\n\n[40] E. Nystr\u00f6m. \u00dcber die praktische Au\ufb02\u00f6sung von Integralgleichungen mit Anwendungen auf\n\nRandwertaufgaben. Acta Mathematica, 54(1):185\u2013204, 1930.\n\n[41] R. Pemantle. Towards a theory of negative dependence. Journal of Mathematical Physics, 41:\n\n1371\u20131390, 2000.\n\n[42] R. Pemantle and Y. Peres. Concentration of Lipschitz functionals of determinantal and other\nstrong Rayleigh measures. Combinatorics, Probability and Computing, 23(1):140160, 2014.\nIn Conference on\n\n[43] P. Rebeschini and A. Karbasi. Fast mixing for discrete point processes.\n\nLearning Theory (COLT), 2015.\n\n[44] T. Shirai and Y. Takahashi. Random point \ufb01elds associated with certain Fredholm determinants\nI: fermion, Poisson and boson point processes. Journal of Functional Analysis, 205(2):414\u2013\n463, 2003.\n\n[45] W. Specht. Zur Theorie der elementaren Mittel. Math. Zeitschr., 74:91\u201398, 1960.\n[46] N. Street, W. H. Wolberg, and O. L. Mangasarian. Nuclear feature extraction for breast tumor\n\ndiagnosis, 1993.\n\n[47] C. K. I. Williams and M. Seeger. Using the Nystr\u00f6m method to speed up kernel machines. In\n\nAdvances in Neural Information Processing Systems 13, pages 682\u2013688. MIT Press, 2001.\n\n[48] T. Zhou, Z. Kuscsik, J.-G. Liu, M. Medo, J. R. Wakeling, and Y.-C. Zhang. Solving the\napparent diversity-accuracy dilemma of recommender systems. PNAS, 107(10):4511\u20134515,\n2010.\n\n[49] J. Y. Zou and R. Adams. Priors for diversity in generative latent variable models. In Advances\n\nin Neural Information Processing Systems (NIPS), 2012.\n\n11\n\n\f", "award": [], "sourceid": 2189, "authors": [{"given_name": "Zelda", "family_name": "Mariet", "institution": "MIT"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MIT"}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": "MIT"}]}