{"title": "Provable Variational Inference for Constrained Log-Submodular Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2697, "page_last": 2707, "abstract": "Submodular maximization problems appear in several areas of machine learning and data science, as many useful modelling concepts such as diversity and coverage satisfy this natural diminishing returns property. Because the data defining these functions, as well as the decisions made with the computed solutions, are subject to statistical noise and randomness, it is arguably necessary to go beyond computing a single approximate optimum and quantify its inherent uncertainty. To this end, we define a rich class of probabilistic models associated with constrained submodular maximization problems. These capture log-submodular dependencies of arbitrary order between the variables, but also satisfy hard combinatorial constraints. Namely, the variables are assumed to take on one of \u2014 possibly exponentially many \u2014 set of states, which form the bases of a matroid. To perform inference in these models we design novel variational inference algorithms, which carefully leverage the combinatorial and probabilistic properties of these objects. In addition to providing completely tractable and well-understood variational approximations, our approach results in the minimization of a convex upper bound on the log-partition function. The bound can be efficiently evaluated using greedy algorithms and optimized using any first-order method. Moreover, for the case of facility location and weighted coverage functions, we prove the first constant factor guarantee in this setting \u2014 an efficiently certifiable e/(e-1) approximation of the log-partition function. Finally, we empirically demonstrate the effectiveness of our approach on several instances.", "full_text": "Provable Variational Inference for\n\nConstrained Log-Submodular Models\n\nJosip Djolonga\n\nDept. of Computer Science\n\nETH Z\u00fcrich\n\nAndreas Krause\n\nDept. of Computer Science\n\nETH Z\u00fcrich\n\njosipd@inf.ethz.ch\n\nstefje@csail.mit.edu\n\nkrausea@ethz.ch\n\nStefanie Jegelka\n\nCSAIL\nMIT\n\nAbstract\n\nSubmodular maximization problems appear in several areas of machine learning\nand data science, as many useful modelling concepts such as diversity and coverage\nsatisfy this natural diminishing returns property. Because the data de\ufb01ning these\nfunctions, as well as the decisions made with the computed solutions, are subject to\nstatistical noise and randomness, it is arguably necessary to go beyond computing a\nsingle approximate optimum and quantify its inherent uncertainty. To this end, we\nde\ufb01ne a rich class of probabilistic models associated with constrained submodular\nmaximization problems. These capture log-submodular dependencies of arbitrary\norder between the variables, but also satisfy hard combinatorial constraints. Namely,\nthe variables are assumed to take on one of \u2014 possibly exponentially many \u2014 set\nof states, which form the bases of a matroid. To perform inference in these models\nwe design novel variational inference algorithms, which carefully leverage the\ncombinatorial and probabilistic properties of these objects. In addition to providing\ncompletely tractable and well-understood variational approximations, our approach\nresults in the minimization of a convex upper bound on the log-partition function.\nThe bound can be ef\ufb01ciently evaluated using greedy algorithms and optimized using\nany \ufb01rst-order method. Moreover, for the case of facility location and weighted\ncoverage functions, we prove the \ufb01rst constant factor guarantee in this setting \u2014 an\nef\ufb01ciently certi\ufb01able e/(e\u2212 1) approximation of the log-partition function. Finally,\nwe empirically demonstrate the effectiveness of our approach on several instances.\n\n1\n\nIntroduction\n\nMany real-world tasks can be modeled as distributions over combinatorial objects such as trees,\nassignments or selections. As an illustrative example, let us consider the following scenario inspired\nby the recent work of Celis et al. [1]. Assume that we are building a news aggregator and are faced\nwith the task of populating the limited number of slots on the front page with articles originating from\nvarious news outlets. We furthermore assume that we have a function that, given a news article and a\nslot, estimates how good of a match they are. Hence, if we decide that a certain subset of the articles\nshould be shown, we can compute their optimal assignment using a maximal bipartite matching.\nFurthermore, to make sure that a diverse set of points of views are represented, we want the chosen\narticles to not only have a high matching value, but to also come from different sources. This can\nbe enforced using a hard selection constraint \u2014 for example, we can require that each source j has\nexactly kj articles on the front page. While the optimization problem has been well-studied as it is\nthat of submodular maximization, taking a probabilistic approach seems very challenging. Not only\nthe random variables have to satisfy complicated combinatorial requirements, but the utility function\nis only implicitly de\ufb01ned via optimal matchings and is very challenging for many approximate\ninference techniques. Nevertheless, by exploiting the submodular properties of the objective and\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthe combinatorial and probabilistic properties of matroids we will develop a method that can easily\nhandle such models with combinatorial constraints and complex long-ranging variable interactions.\nAnother family of constraints that often appears is that of (directed) spanning trees. Namely, we are\nonly interested in subsets of edges of a graph that form a spanning tree. Such constraints can model\ninformation cascades in network inference [2], or non-projective dependency parse trees in natural\nlanguage processing [3, 4]. Moreover, the 1-of-K encoding typically used for multi-label inference\ntasks is perhaps the simplest and most frequent case of a hard combinatorial assignment constraint.\nWhat all these applications have in common is that they give rise to joint distributions over a set of\ndependent random variables, each of which is itself a combinatorial object (a spanning tree in network\ninference and dependency parsing; a discrete selection in multi-label inference and slot allocation).\nInference in such combinatorial models is complex due to two sources of dependencies. First, the\ndistribution may express pairwise or higher-order dependencies between elements (in our previous\nexample, the value of the optimal matching). Second, we have strict combinatorial constraints on the\nsupport of the distribution (e.g., only trees are allowed) that implicitly induce further interactions.\nIn this work, we undertake a variational inference approach and approximate these rich distribu-\ntions with simpler ones that respect the combinatorial constraints but are fully tractable. These\napproximations posses very strong negative association properties, which we utilize in our theory.\nTo \ufb01nd the optimal approximation we minimize a R\u00e9nyi divergence over these distributions, which\nresults in ef\ufb01ciently minimizable convex upper bounds on the partition function. While variational\ninference methods rarely provide any approximation guarantees, our approach yields provably good\napproximations for certain model families. In summary, this paper makes the following contributions.\n\u2022 Fast variational convex algorithms for a large family of probabilistic models with submodular\ndependencies of arbitrarily high order in combination with hard combinatorial constraints.\n\u2022 By combining results from approximate inference and submodular maximization, we prove\nthe \ufb01rst constant factor approximation on the log-partition function for facility location and\nweighted coverage functions under a family of matroid constraints. We speci\ufb01cally show\nthat our upper bound does not exceed the true value by more than a factor of (1 \u2212 1/e).\n\n\u2022 An empirical evaluation of the proposed techniques on several problem instances.\n\nRelated work. Bouchard-C\u00f4t\u00e9 and Jordan [5] introduce a class of variational techniques over\ncombinatorial spaces, but they make a different set of assumptions \u2014 they assume a product space\nand models that are tractable when retaining only one of the constraints. There has also been interest\nin applying belief propagation (BP) to structured problems such as dependency parsing [3]. Our\napproach makes a different set of factorization assumptions, and in contrast to BP, provides a bound on\nthe partition function and is guaranteed to converge without any damping heuristics. Other methods\nthat provide upper bounds make factorization assumptions not satis\ufb01ed by the models we consider\n[6, 7, 8], or have to repeatedly solve hard optimization problems [9, 10]. MCMC sampling methods\nfor distributions over more general combinatorial objects have been addressed in a rich literature\n[11]. Li et al. [12] consider distributions over partition and uniform matroids that also allow for\nnon-linear dependencies between the variables and develop Gibbs samplers whose mixing time grows\nexponentially with the non-linearity of the model. In the unconstrained case, the mixing time as a\nfunction of non-submodularity has been analyzed in [13, 14].\nVariational inference in unconstrained probabilistic submodular models was considered by Djolonga\nand Krause [15], whose inference method for log-supermodular models was shown to be equivalent\nto the minimization of the inclusive R\u00e9nyi divergence [16], which we also use as the variational\nobjective in this paper. The minimization of this divergence for decomposable unconstrained models\nhas been studied in Djolonga et al. [17], who also utilize the M (cid:92)-concavity of the terms. Inference\nin multi-label log-supermodular models has been considered by Zhang et al. [18]. The tractable\ndistributions used in our variational framework have been already studied [19, 20, 21]. Some of them\nare determinantal point processes (DPPs), which have been already used in machine learning [22].\nRisteski [23] has proved a constant factor approximation for the log-partition function of certain Ising\nmodels using a variational approach, and is also leveraging the mean-\ufb01eld bound in the proof.\n\n2\n\n\f2 Background \u2014 submodularity, matroids and continuous extensions\n\nSubmodularity [24, 25] formalizes the concept of diminishing returns \u2014 the bene\ufb01t of adding an\nelement decreases with the growth of the context in which it is being included. Formally, a set\nfunction F : 2V \u2192 [0,\u221e) is said to be submodular if for all X \u2286 Y \u2286 V and i \u2208 V \\ Y it holds\nthat F (i | X) \u2265 F (i | Y ), where the marginal gain F (i | X) is de\ufb01ned as F ({i} \u222a X) \u2212 F (X). To\nkeep the notation as simple as possible , we will w.l.o.g. assume that V = {1, 2, . . . , n}.\nA classical family of submodular functions are set cover functions. If we associate to each i \u2208 V a\nset Ui \u2286 U of elements from some \ufb01nite universe U, the function is given as the size of the union of\nthe chosen sets, i.e., F (X) = | \u222ai\u2208X Ui|. Another well-known function class are facility location\nj=1 maxi\u2208X wi,j for some non-negative weights wi,j \u2265 0. The\nname stems from the following scenario: a set of facilities V serve m customers such that customer j\nreceives a utility of wi,j from facility i \u2208 V , and F (X) measures the total utility from the facilities X\nif each customer can be served by exactly one facility. Moreover, many problems, such as exemplar\nclustering, which we use in the experimental section, can be modelled using this function class.\n\nfunctions, de\ufb01ned as F (X) =(cid:80)m\n\nMaximization. As both examples above model utilities, a natural problem that arises is that of\n\ufb01nding a con\ufb01guration X \u2286 V that maximizes F \u2014 cover as much as possible from U, or serve\nas many customers as possible from the opened facilities. Note that the above functions are not\nonly submodular, but also monotone \u2014 adding an item can never decrease the value. Moreover, we\ntypically want to \ufb01nd the maximal X subject to some constraints. A classical problem is that of\nmaximizing over all sets of cardinality at most k. In this case, Nemhauser et al. [26] have proven that\na simple greedy algorithm results in a provably good solution. Speci\ufb01cally, we start with X0 = \u2205,\nand construct the set Xj+1 as the union of Xj and any element in arg maxi\u2208V \\Xj F (i | Xj). Then,\nthe guarantee is that F (Xk) \u2265 (1 \u2212 1/e) maxX : |X|\u2264k F (X), which is also optimal unless P = NP.\n\nM (cid:92)-concavity. There exists a subclass of submodular functions for which the above algorithm\nexactly maximizes the function even when it is not monotone, if we stop once we see a negative\ngain. These functions, known as M (cid:92)-concave [27, \u00a74], are de\ufb01ned as follows: for all X, Y \u2286 V and\ni \u2208 X \\ Y either (i) F (X) + F (Y ) \u2264 F (X \\ {i}) + F (Y \u222a {i}), or (ii) there exist some j \u2208 Y \\ X\nsuch that F (X) + F (Y ) \u2264 F (X \\ {i} \u222a {j}) + F (Y \\ {j} \u222a {i}). Moreover, it also holds that\nXk = arg maxX : |X|=k F (X) [28, Lem. 6.3, 29, 30]. This family contains (see e.g. [31, \u00a73.6]) the\nmaximum function maxi\u2208X wi,j, weighted matroid rank functions, the value of the optimal bipartite\nj=1 \u03c6j(|X \u2229 Bj|)\nfor any concave \u03c6j : R \u2192 R and laminar {Bj}m\nj=1. While not M (cid:92)-concave themselves, many\nsubmodular functions, such as facility location, can be written as sums of M (cid:92)-concave terms \u2014 a\nfact that we will exploit later on in this paper.\n\nmatching used in the introduction, as well as functions of the form F (X) =(cid:80)m\n\nMatroids. Submodular maximization has been studied not only under cardinality constraints, but\nalso under a broader set of structures that have particularly nice mathematical properties: matroids.\nDe\ufb01nition 1 (Oxley [32]). A matroid M consists of a ground set V = {1, 2, . . . , n} and a collection\nI \u2286 2V of subsets of V (called independent) that satisfy:\n(i) \u2205 \u2208 I.\n(ii) If X \u2208 I and Y \u2286 X then Y \u2208 I.\n(iii) If X \u2208 I and Y \u2208 I and |X| < |Y | then there exists some y \u2208 Y \\ X such that X \u222a {y} \u2208 I.\nA set X \u2208 I is maximal if for all e \u2208 V \\ X, we have that X \u222a {e} /\u2208 I. We will focus on the\ncase when M is the collection of all maximal sets in I. These maximal sets are the bases of the\nmatroid. This framework encompasses for instance both the cardinality constraints and spanning\ntrees. Namely, the set I = {X \u2286 V | |X| \u2264 k} is known as the uniform matroid and its bases are all\nsubsets of cardinality exactly k, while the set of spanning trees form the bases of the graphic matroid,\nde\ufb01ned as the collection of edge subsets that are cycle-free. This latter example belongs to the family\nof regular matroids that are de\ufb01ned as follows. Let\n\nU = [ u1 u2\n\n\u00b7\u00b7\u00b7 un ] \u2208 {0,\u00b11}r\u00d7n\n\nbe a totally unimodular (TU) matrix, meaning that every square submatrix of U has a determinant in\n{0,\u00b11}. A subset X \u2286 V is said to be independent if the columns of U indexed by X are linearly\n\n3\n\n\findependent. The bases of this matroid are the subsets of the columns of U that form a basis of the\ncolumn space of U. We can think of the i-th column ui as the representation of element i. As a\nconcrete example, the graphic matroid of a graph G = (V,E) is generated by the (arbitrarily oriented)\nedge-vertex incidence matrix U \u2208 {0,\u00b11}(|V|\u22121)\u00d7|E| of G after removing an arbitrary vertex.\n\n3 The problem and our approach\nFormally, we have a random variable X that takes values in a set of combinatorial objects M. For\nexample, X could be a random tree drawn from the collection M of all trees in some graph G. We can\nthink of the members of M as being the valid con\ufb01gurations among all possible sets in 2V , so that\nany con\ufb01guration not in M should get a probability of zero. Speci\ufb01cally, we consider distributions\nover the con\ufb01gurations M of the general form\n\nP(X = X) =(cid:74)X \u2208 M(cid:75)Z\n\n(cid:16)\n\nexp\n\nF (X)\n\n(cid:17)\n\n,\n\nwhere F : 2V \u2192 R is the objective function and(cid:74)\u00b7(cid:75) is the Iverson bracket. Note that the problem of\n(i) It holds that F =(cid:80)m\n\ncomputing the MAP con\ufb01guration reduces to maxX\u2208M F (X), which is can be approximated within\na factor of a (1 \u2212 1/e) [33] when F is monotone submodular and M are matroid bases. We make the\nfollowing additional assumptions about F and the set M constituting the support of the distribution.\n\nj=1 Fj for some monotone M (cid:92)-concave functions Fj : 2V \u2192 [0,\u221e).\n\n(ii) The set M consists of the bases of a matroid, which is a direct sum of uniform and totally\n\nunimodular matroids, which we will call normalizable.\n\n(1)\n\nj=1 maxi\u2208X wi,j +(cid:80)\n\ncoverage, i.e., of the form F (X) =(cid:80)\n\ni\u2208U wi(cid:74)i \u2208 \u222ai\u2208X Ui(cid:75), where Ui and U are de\ufb01ned as in the\n\nF (X) = (cid:80)m\nto be included. In addition, there is the modular function(cid:80)\n\nWe would like to point out that the model class is closed under conditioning, as M (cid:92)-concave functions\nare closed under restrictions, and both uniform and TU matroids are closed under taking minors. The\nMAP problem under (i) has been studied in [34]. Note that, unlike many inference methods, we make\nno assumption about the number of variables that each Fj depends on, also known as its order.\nWe will pay special attention to the case when F is a facility location, or equivalently, a weighted\nunweighted case, and wi \u2265 0 are arbitrary weights. As a speci\ufb01c instantiation, let us consider\nthe FLID model of Tschiatschek et al. [35], which has been successfully applied to the problem\nof item set recommendation. Speci\ufb01cally, we have a set of items V = {1, 2, . . . , n} that we want\nto recommend to the user. Moreover, we assume that there are a total of m traits, and item i\nexpresses a level of wi,j \u2265 0 for trait j \u2208 {1, 2, . . . , m}. Then, the idea is that the function\ni\u2208X ui captures the classical notion of substitutes \u2014 once we\nselect an item that has a high expression level of some item, those items similar to it will be less likely\ni\u2208X ui to model the quality of individual\nitems. Similarly to the example in the introduction, we can explicitly enforce the user to see a diverse\nset of offers by for example presenting them with a \ufb01xed number of items from each brand \u2014 if items\nXp are produced by producer p, then we can use M = {X \u2286 V | \u2200p : |Xp \u2229 X| \u2264 kp}, also called\na partition matroid, which as a direct sum of uniform matroids, satis\ufb01es our modelling assumptions.\nThe central problem of interest in this paper is to compute marginal probabilities P(Y \u2286 X) for\nany set Y \u2286 V . In its general form, this problem is hard, owing to the presence of the intractable\nnormalizer Z, whose computation is also important for the computation of likelihoods and model\nselection. We therefore revert to approximate techniques for computing the marginal probabilities\nand the partition function Z. Speci\ufb01cally, we will undertake a divergence minimization approach,\nwhich will yield both an estimate of log Z and approximate marginals. Namely, we will \ufb01rst de\ufb01ne a\nset of approximate distributions Q that are rich enough to capture some of the properties of the target\ndistribution P, but are computationally tractable. Then, we will \ufb01nd the distribution Q in Q that is the\nclosest to P, as measured by some measure of distributional discrepancy, also called a divergence.\n\n\u201cSimple\u201d distributions over matroid bases\n\n4\nWe begin with a characterization of the distributions Q that will serve as approximations. These\ndistributions correspond to modular objective functions, so that for some \u03b8 \u2208 Rn they are given as\n\n4\n\n\fat coordinates X. Formally, they belong to the exponential family and can be written as\n\ne\u2208X \u03b8e = \u03b8(cid:62)1X, where 1X \u2208 {0, 1}n is the characteristic vector of X with ones only\n\nF (X) =(cid:80)\nQ\u03b8(X = X) = exp(\u03b8(cid:62)1X \u2212 A(\u03b8))(cid:74)X \u2208 M(cid:75).\nHere, A(\u03b8) = log(cid:80)\n(2)\nX\u2208C exp(\u03b8(cid:62)1X ) is the normalizing log-partition function, and M is the set of\nbases of the considered matroid classes. Because of the constraint(cid:74)X \u2208 M(cid:75), the distribution is not a\nproduct distribution, and the elements i \u2208 V are not independent. Even though computing A(\u03b8) can\nbe challenging for arbitrary constraints, it can be ef\ufb01ciently done for the considered classes. In what\nfollows we will assume that we have a single normalizable matroid, as the result for their direct sums\neasily follows. In the uniform matroid case, (2) is known as a cardinality potential, and both A(\u03b8)\nand the unary marginals can be computed in O(nk) using the algorithm of Tarlow et al. [19]. If M is\na regular matroid, the model can be ef\ufb01ciently normalized via the celebrated matrix-tree theorem.\n\nTheorem 1 (Maurer [36]). For regular matroids, it holds that A(\u03b8) = log det(cid:80)n\n\ni=1 e\u03b8iuiu(cid:62)\ni .\n\nLyons [20] showed that the distribution (2) is a determinantal point process (DPP) with the scaled rep-\nresentation U\u03b8 = (U diag(exp(\u03b8))U(cid:62))\u22121/2U diag(exp(\u03b8/2)), and can be marginalized as follows.\nTheorem 2 ([20, Remark 5.6]). The marginal probability of any Y \u2286 V is equal to\n\nwhere K = U(cid:62)\n\n(3)\n\u03b8 U\u03b8 \u2208 Rn\u00d7n and KY is the submatrix formed by the rows and columns indexed by Y .\n\nP (Y \u2286 X) = det KY ,\n\ne(cid:48)(cid:105)2,\n\ne, u\u03b8\n\ne(cid:107)2(cid:107)u\u03b8\n\ne(cid:48)(cid:107)2 \u2212 (cid:104)u\u03b8\n\nFor example, the \ufb01rst and second order moments are given by\ne(cid:107)2, and P ({e, e(cid:48)} \u2286 X) = (cid:107)u\u03b8\n\nP (e \u2208 X) = (cid:107)u\u03b8\n\nwhich we now present. The marginals, i.e., the vector \u00b5 \u2208 [0, 1]n with entries \u00b5i = EX\u223cQ\u03b8 [(cid:74)i \u2208 X(cid:75)],\n\n(4)\nwhich implies that the elements e, e(cid:48) are negatively correlated: their joint probability is smaller than if\nthey were independent. Moreover, an even stronger condition can be stated \u2014 both cases are strongly\nRayleigh [37, Coro. 4.18, Prop. 3.5], so that for any Q \u2208 Q we have that EA\u223cQ[G(A)H(A)] \u2264\nEA\u223cQ[G(A)]EA\u223cQ[H(A)] for any monotone functions G and H that depend on disjoint coordinates1.\nAs Q is an exponential family, it has many remarkable properties (for proofs see e.g. [38]), some of\ncan be easily computed from the log-partition function as \u00b5 = \u2207A(\u03b8). An important object\nassociated with Q is the marginal polytope, the set of all realizable unary marginals by any distribution\nover M. In our case, it is equal to the convex hull of the bases, i.e., M = conv{1A | A \u2208 M}.\nRemarkably, Q is rich enough to represent any marginal vector in relint M, i.e., \u2200\u00b5 \u2208 relint M there\nexists some \u03b8(\u00b5) \u2208 Rn such that Ex\u223cQ\u03b8(\u00b5)[x] = \u00b5. Furthermore, the convex conjugate A\u2217(\u00b5) of\nthe log-partition function A evaluates to \u221e if \u00b5 /\u2208 M, and to \u2212H[Q\u03b8(\u00b5)] otherwise, where H[\u00b7] is\nShannon\u2019s entropy function. Moreover, we can optimize linear functions over M using Edmonds\u2019\n[25] celebrated algorithm in O(n log n). Namely, to solve max\u00b5\u2208M \u00b5(cid:62)\u03b8, \ufb01rst sort \u03b8 in descending\norder \u03b8\u03c3(1) \u2265 \u03b8\u03c3(1) \u2265 . . . \u2265 \u03b8\u03c3(n), and de\ufb01ne the chain\n\nX0 = \u2205, and Xi =\n\n(cid:26)Xi\u22121 \u222a {\u03c3(i)}\n\nXi\u22121\n\nif Xi\u22121 \u222a {\u03c3(i)} \u2208 I\notherwise\n\n.\n\nThen, it can be shown that 1Xn is a maximizer. For spanning trees, this is exactly Kruskal\u2019s algorithm.\n\n5\n\nInference using the inclusive in\ufb01nite R\u00e9nyi divergence\n\nHaving \ufb01xed the approximation family, we turn to the choice of the function that will quantify the\ndistance between the distributions, and the analysis of resulting optimization problem. In this paper\nwe will use the inclusive R\u00e9nyi \u221e-divergence [39, 40], de\ufb01ned as\n\nD\u221e(P(cid:107) Q) = log max\nX\u2208M\n\n(5)\nIn other words, it evaluates to the worst-case log-ratio between P and Q. In the terminology of Minka\n[41], it is an inclusive (zero-avoiding) divergence \u2014 it prefers more conservative distributions that do\n1The case M = {X | |X| \u2264 k} can be also normalized and Q is again strongly Rayleigh [37, Cor. 4.18].\n\nP(X)/Q(X).\n\n5\n\n\f(cid:98)f (cid:63)(y | M) = inf x\u2208M y(cid:62)1X \u2212 F (X), which is easily seen to be concave. Then, by expanding the\n\nnot assign event probabilities close to zero or one. To better understand the optimization problem\nthat results from minimizing D\u221e, let us \ufb01rst de\ufb01ne for F : 2V \u2192 R and any M \u2286 2V the function\ndivergence and minimizing with respect to Q\u03b8 \u2208 Q we obtain the following upper bound2\n\nlog Z \u2264 inf\n\u03b8\u2208Rn\n\nA(\u03b8) \u2212 (cid:98)f (cid:63)(\u03b8 | M) = sup\n\n\u00b5\u2208M(cid:98)f (\u00b5 | M) \u2212 A\u2217(\u00b5),\n\nwhere (cid:98)f (\u00b5 | M) = (cid:98)f (cid:63)(cid:63)(\u00b5 | M) = inf y\u2208Rn \u00b5(cid:62)y \u2212 (cid:98)f (y | M) is the concave conjugate of\n(cid:98)f (cid:63)(y | M), and the equality follows from Fenchel\u2019s duality. Unfortunately, we can not evaluate the\nabove bound as we do not know how to compute (cid:98)f (cid:63)(\u00b5 | M), which requires the maximization of a\n\nnon-monotone function over M. We do, however, know that F decomposes as a sum of M (cid:92)-concave\nfunctions, which we can leverage to obtain a more tractable bound using dual decomposition [42, 43].\nProposition 1. By applying dual decomposition to (6) we arrive at the following bound\n\n(6)\n\n(7)\n\nlog Z \u2264 inf\n{\u03b8j}m\n\nj=1\n\nm(cid:88)\n\nj=1\n\n\u03b8j) \u2212 m(cid:88)\n(cid:123)(cid:122)\n\nj=1\n\nA(\n\n(cid:124)\n\n(cid:98)f (cid:63)\nj (\u03b8j | M)\n(cid:125)\n\nR(\u03b81,...,\u03b8m|M)\n\nm(cid:88)\n\nj=1\n\n= sup\n\u00b5\u2208M\n\n(cid:98)fj(\u00b5 | M) \u2212 A\u2217(\u00b5).\n\nNow, instead of maximizing F over M, we only have to maximize only each component Fj. Because\nwe have assumed that each Fj is M (cid:92)-concave, if M is a uniform matroid remember that we can easily\nsolve the resulting problem max|X|=k F (X) \u2212 y(cid:62)1X using the greedy strategy. Even though the\ngeneral case seems much harder, it can be solved using Murota\u2019s duality theorem [27, Thm. 8.21(i)]\nby introducing a set of m auxiliary variables {\u03bbj \u2208 Rn}m\nj=1 over which we also have to minimize.\nProposition 2. For any set of parameters {\u03b8j \u2208 Rn}m\n\nj=1 it holds that\n\nm(cid:88)\n\n\u03b8j)\u2212 m(cid:88)\n\nj=1\n\nj=1\n\nA(\n\n(cid:98)f (cid:63)\nj (\u03b8j | M) = inf\n{\u03bbj}m\n\nm(cid:88)\n\n\u03b8j)\u2212 m(cid:88)\n\nA(\n\n(cid:98)f (cid:63)\nj (\u03bbj | V ) +\n\nm(cid:88)\n\nj=1\n\nj=1\n\nj=1\n\nj=1\n\n\u00b5(cid:62)(\u03bbj \u2212 \u03b8j). (8)\n\nsup\n\u00b5\u2208M\n\nNote that it is easy to both evaluate this bound and compute a subgradient. Namely, we can compute\nboth the log-partition function and its derivatives using the methods from Section 4. The computation\n\nof both (cid:98)f (cid:63) and the linear maximization over M can be done using greedy algorithms, and the\n\ncomputed maxima are members of the corresponding subdifferentials. Hence, we can easily employ\n\ufb01rst-order convex methods to optimize this bound to arbitrary precision in polynomial time.\n\nThe facility location case. We will now prove a strong theoretical guarantee for the quality of the\ncomputed approximation for this important class. Speci\ufb01cally, we will show that the obtained upper\nbound is no greater than (1 \u2212 1/e)\u22121 log Z \u2248 1.582 log Z. To this end, we \ufb01rst construct a lower\nbound on log Z, and then show that the lower and upper bounds are within a multiplicative constant of\neach other. Moreover, this lower bound can be easily evaluated, so that we can at any point return not\nonly a bound, but also a corresponding certi\ufb01cate. We begin by introducing the multi-linear extension\n\u02dcf : [0, 1]n \u2192 R [33] of F , de\ufb01ned as \u02dcf (\u00b5) = Exi\u223cBernoulli(\u00b5i)[F (x)]. It can be evaluated within any\naccuracy using Monte-Carlo sampling, and also analytically for several cases such as facility location\nfunctions (see e.g. [44]). To derive the bound, we start from the mean-\ufb01eld bound [38] (details in\nappendix) EX\u223cQ[F (X)] + H[Q] \u2264 log Z, which holds for any distribution Q absolutely continuous\nwith respect to P. Then, we use a result by Chekuri et al. [45, Lem. VI.1], which states that if F is a\nweighted sum of coverage functions and Q is negatively associated with unary marginals \u00b5 \u2208 [0, 1]n\n\u2014 both conditions satis\ufb01ed for our model \u2014 then EX\u223cQ[F (X)] \u2265 \u02dcf (\u00b5).\nProposition 3. If F is a facility location function, then for any \u03b8 \u2208 Rn it holds that\n\nL(\u03b8) = \u02dcf (\u2207A(\u03b8)) + H[Q\u03b8] = \u02dcf (\u2207A(\u03b8)) + A(\u03b8) \u2212 \u2207A(\u03b8)(cid:62)\u03b8 \u2264 log Z.\n\n(9)\n\nreplace (cid:98)f (cid:63)\n\nj (y | M) by (cid:98)f (cid:63)\n\nWe will actually prove a stronger result that holds not only for (7), but also if we relax the bound and\nj (y | V ), i.e., we ignore the constraints when we maximize. In other words,\n\n2We defer the proofs of all results in this section to the appendix.\n\n6\n\n\fwe will show that the bound\n\nm(cid:88)\n\nj=1\n\n\u03b8j) \u2212 m(cid:88)\n(cid:123)(cid:122)\n\nj=1\nR(\u03b81,...,\u03b8j )\n\nf (cid:63)\nj (\u03b8j)\n\n(cid:125)\n\ninf\n{\u03b8j}m\n\nj=1\n\nA(\n\n(cid:124)\n\nm(cid:88)\n\nj=1\n\n(cid:98)fj(\u00b5) \u2212 A\u2217(\u00b5),\n\n= sup\n\u00b5\u2208M\n\n(10)\n\nimmediately clear from their de\ufb01nitions, both (cid:98)f (\u00b5 | V ) and \u02dcf (\u00b5) are extensions of F \u2014 if we see\n\nis within a multiplicative constant of L evaluated at any optimizer of (10). Even though perhaps not\nF as being de\ufb01ned over {0, 1}n instead of 2V using the natural bijection, then both of them agree\nwith F for binary vectors and continuously \ufb01ll in the rest of the unit cube. Moreover, they are closely\nrelated via the following result known as the correlation gap inequality.\nTheorem 3 ([46, Lem. 3.8, 47]). If F : 2V \u2192 R is monotone submodular with F (\u2205) = 0, then\n\n\u2200\u00b5 \u2208 [0, 1]n : (1 \u2212 1/e)(cid:98)f (\u00b5 | V ) \u2264 \u02dcf (\u00b5) \u2264 (cid:98)f (\u00b5 | V ).\n\nm(cid:88)\n\nj}m\nm(cid:88)\nm) \u2264 (1 \u2212 1/e)\u22121L(\n\nBy combining these two results, we can \ufb01nally prove the approximation result claimed above.\nTheorem 4. If F is a facility location function and {\u03b8\u2217\n\nj=1 minimizes (7) or (10), then\n\nL(\n\nj ) \u2264 log Z \u2264 R(\u03b8\u2217\n\u03b8\u2217\n\n1, . . . , \u03b8\u2217\n\nj ) \u2264 (1 \u2212 1/e)\u22121 log Z.\n\u03b8\u2217\n\n(11)\n\nj=1\n\nby computing C(\u03b81, . . . , \u03b8m) = R(\u03b81, . . . , \u03b8m)/L((cid:80)m\n\nFurthermore, at any point during the optimization we can easily certify our approximation quality\nj=1 \u03b8\u2217\nj ), as the true approximation factor\nR(\u03b81, . . . , \u03b8m)/ log Z is guaranteed to be upper bounded by it.\n\nj=1\n\n6 Experiments\n\nWe perform numerical experiments to better understand the practical performance of the proposed\nmethods, namely how good is the approximation when compared to the theoretical e/(e \u2212 1) factor\nand how well are the marginals estimated. Moreover, we showcase the scalability of our approach\nby performing inference on large real-world instances. The implementation was done in Python\nusing PyTorch, and we optimize the bound using subgradient descent. The computation of the\nlog-partition function and its gradients (building on the code from [48]), as well as the greedy oracle\nwere implemented in C++. We provide all details in the appendix.\n\n6.1 Synthetic experiments\n\nfacility location models with objectives of the form F (X) =(cid:80)20\n\nWe begin by comparing the accuracy of the methods on a set of synthetic experiments. We consider\nj=1 maxi\u2208X wi,j, where we sample\nwi,j \u223c Uniform[0, \u03b1]. We vary the inverse temperature parameter \u03b1 and show the results in Figure 1.\nWe \ufb01rst used a uniform matroid constraint |X| = 5 over a ground set of size n = 40. For the same\nmodels we then considered partition constraints by partitioning V into three sets V1, V2 and V3 of sizes\n10, 10, and 20 respectively and de\ufb01ning M = {X \u2286 V | |X \u2229 V1| = 2,|X \u2229 V2| = 2,|X \u2229 V3| = 4}.\nBecause the number of con\ufb01gurations is in the millions, we were able to compute the exact marginals\nand log-partition functions. From the plots we can see that the approximation is much better than the\ntheoretical factor (\u2248 1.582), and close to exact in the small and high temperature regimes. Moreover,\neven though the divergence we are optimizing does not necessarily target the marginals, we can see\nthat they are also approximated within a small error.\n\n6.2 Real data\n\nWe consider two problems from data mining that can be written as facility location maximization\nproblems under cardinality constraints. For each function F (A) we perform inference in models with\nobjectives \u03b1F (A) for varying \u03b1 \u2265 0. Moreover, to obtain statistical estimates on the approximation\nfactors, we repeat the experiments several times by taking random subsets of the data.\nExemplar clustering. Given a dataset X = {x1, x2, . . . , xn} of n points in Rd, we want to \ufb01nd a small\ni=1 minxj\u2208A (cid:107)xi\u2212xj(cid:107).\n\nsubset of size k = 10 that is a good summary of X by minimizing G(A) =(cid:80)n\n\n7\n\n\f(a) Inference on synthetic models under a uniform matroid constraint |X| = 6.\n\n(b) Inference on synthetic models under a partition matroid constraint with 3 blocks of sizes 10, 10 and 15.\n\nFigure 1: Results on synthetic facility location models on a ground set of size n = 40. The parameters are\nsampled from UNIFORM(0, \u03b1), and there are m = 10 components. The ordinates on plots in the \ufb01rst column\nhave been centered so that zero corresponds to the true partition function. In the last column we plot both the\ncerti\ufb01ed approximation factor (the ratio of the upper bound and the certi\ufb01cate) and the exact one (when dividing\nby the exact partition function). The error bars indicate three standard deviations from 20 repetitions.\n\n(a) Sensor placement under uniform (left) and partition (right) matroids.(b) Exemplar clustering (CIFAR10).\n\nFigure 2: Results on large real-world datasets (full explanation in \u00a76.2). The error bars indicate three standard\ndeviations from 20 repetitions. Note that the certi\ufb01cate is signi\ufb01cantly lower than the theoretical factor of 1.582.\n\ndetection time, and the total utility is naturally captured using F (A) = (cid:80)m\n\nWhile \u2212G is not submodular, it can be shown [49] that F (A) = G({x0}) \u2212 G(A \u222a {x0}) is\nmonotone submodular for carefully chosen x0, typically taken to be the origin. We show our results\nin Figure 2(b), on n = 1500 points from the CIFAR10 [50] dataset normalized as in [44].\nSensor placement. The second problem is that of placing sensors at pipe junctions in order to\neffectively detect water contaminations. Namely, there a total of n locations where we can place\nour sensors, and a set of m possible contamination scenarios. For each scenario j and sensor i\nthere is some utility wi,j \u2265 0 if i detects contamination j, computed e.g. as a function of the the\nj=1 maxi\u2208A wi,j. We\nuse a subset of the data from [51], and show the results in Figure 2(a). We consider two scenarios\n\u2014 (i) n = 5000, m = 300 under a cardinality constraint M = {X \u2286 V | |X| = 50}, and (ii)\nn = 1500, m = 100 under a partition matroid, constructed by splitting V into 3 blocks of equal size,\nand consider only distributions that pick exactly 5, 10 and 5 points from each block respectively.\nDespite the fact that these models have a much larger number of variables and components in the\nobjective, in Figure 2 we see a behaviour similar to that of the synthetic instances \u2014 the certi\ufb01cate\nof the approximation factor is close to one under high and low temperatures (large and small \u03b1\nrespectively), while remaining always signi\ufb01cantly smaller than the theoretical guarantee.\n\n8\n\n10\u22121100101102103\u03b1\u2212400\u2212300\u2212200\u22121000100200ValueofthefunctionminuslogZUpperboundRLowerboundL10\u22121100101102103\u03b10.000.050.100.150.20Meanabsoluteerrorofthemarginals10\u22121100101102103\u03b11.001.021.041.061.08ApproximationfactorCerti\ufb01cateCExact10\u22121100101102103\u03b1\u2212400\u2212300\u2212200\u22121000100200ValueofthefunctionminuslogZUpperboundRLowerboundL10\u22121100101102103\u03b10.000.050.100.150.20Meanabsoluteerrorofthemarginals10\u22121100101102103\u03b11.001.011.021.031.041.05ApproximationfactorCerti\ufb01cateCExact10\u22121100101102103\u03b11.001.011.021.031.041.051.061.07ApproximationfactorCerti\ufb01cateC10\u22121100101102103\u03b11.0001.0251.0501.0751.1001.1251.150ApproximationfactorCerti\ufb01cateC10\u22121100101102103\u03b11.001.011.021.031.04ApproximationfactorCerti\ufb01cateC\f7 Conclusion\n\nWe explored a new, rich class of probabilistic models, whose variables realize bases of a sum of\nnormalizable matroids. These models allow to capture high-order submodular dependencies between\ncomplex combinatorial objects. We presented ef\ufb01cient, convergent convex variational inference\nalgorithms that yield upper bounds on the partition function. Moreover, we proved the \ufb01rst constant\nfactor approximation on the log-partition function of facility location and weighted models under\nconstraints. We also numerically showcased the quality of the estimated partition function and\nthe marginals. Our models and methods provide important steps towards exploiting combinatorial\nstructure for principled modeling and reasoning about complex real-world phenomena.\n\nAcknowledgements. The research was partially supported by ERC StG 307036, Google European\nPhD Fellowship, and NSF CAREER award 1553284.\n\nReferences\n\n[1] L. E. Celis, V. Keswani, D. Straszak, A. Deshpande, T. Kathuria, and N. K. Vishnoi. \u201cFair and\n\nDiverse DPP-based Data Summarization\u201d. arXiv preprint arXiv:1802.04023 (2018).\n\n[2] M. Gomez-Rodriguez, J. Leskovec, and A. Krause. \u201cInferring networks of diffusion and\n\nin\ufb02uence\u201d. ACM Transactions on Knowledge Discovery from Data (TKDD) (2012).\n\n[3] D. A. Smith and J. Eisner. \u201cDependency parsing by belief propagation\u201d. Proceedings of\nthe Conference on Empirical Methods in Natural Language Processing. Association for\nComputational Linguistics. 2008.\n\n[4] T. Koo, A. Globerson, X. Carreras, and M. Collins. \u201cStructured prediction models via the\nmatrix-tree theorem\u201d. Proceedings of the 2007 Joint Conference on Empirical Methods in\nNatural Language Processing and Computational Natural Language Learning (EMNLP-\nCoNLL). 2007.\n\n[5] A. Bouchard-C\u00f4t\u00e9 and M. I. Jordan. \u201cVariational inference over combinatorial spaces\u201d. Neural\n\nInformation Processing Systems (NIPS). 2010.\n\n[6] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. \u201cA new class of upper bounds on the log\npartition function\u201d. IEEE Transactions on Information Theory 51.7 (2005), pp. 2313\u20132335.\n[7] M. J. Wainwright and M. I. Jordan. \u201cLog-determinant relaxation for approximate inference\nin discrete Markov random \ufb01elds\u201d. IEEE Transactions on Signal Processing 54.6 (2006),\npp. 2099\u20132109.\n\n[8] T. Hazan and A. Shashua. \u201cNorm-product belief propagation: Primal-dual message-passing for\napproximate inference\u201d. IEEE Transactions on Information Theory 56.12 (2010), pp. 6294\u2013\n6316.\n\n[9] G. Papandreou and A. L. Yuille. \u201cPerturb-and-MAP Random Fields: Using discrete opti-\nmization to learn and sample from energy models\u201d. Computer Vision (ICCV), 2011 IEEE\nInternational Conference on. IEEE. 2011, pp. 193\u2013200.\n\n[10] T. Hazan and T. Jaakkola. \u201cOn the partition function and random maximum a-posteriori\n\nperturbations\u201d. arXiv preprint arXiv:1206.6410 (2012).\n\n[11] M. Jerrum. \u201cProbabilistic Methods for Algorithmic Discrete Mathematics\u201d. 1998. Chap. Math-\n\nematical Foundations of the Markov Chain Monte Carlo Method, pp. 116\u2013165.\n\n[12] C. Li, S. Sra, and S. Jegelka. \u201cFast mixing Markov chains for strongly Rayleigh measures,\nDPPs, and constrained sampling\u201d. Advances in Neural Information Processing Systems. 2016,\npp. 4188\u20134196.\n\n[13] A. Gotovos, S. H. Hassani, and A. Krause. \u201cSampling from Probabilistic Submodular Models\u201d.\n\nNeural Information Processing Systems (NIPS). Dec. 2015.\n\n[14] P. Rebeschini and A. Karbasi. \u201cFast Mixing for Discrete Point Processes\u201d. 28th Conference on\n\n[15]\n\n[16]\n\nLearning Theory (COLT). 2015.\nJ. Djolonga and A. Krause. \u201cFrom MAP to Marginals: Variational Inference in Bayesian\nSubmodular Models\u201d. Neural Information Processing Systems (NIPS). 2014.\nJ. Djolonga and A. Krause. \u201cScalable Variational Inference in Log-supermodular Models\u201d.\nInternational Conference on Machine Learning (ICML). 2015.\n\n9\n\n\f[17]\n\n[18]\n\nJ. Djolonga, S. Tschiatschek, and A. Krause. \u201cVariational Inference in Mixed Probabilistic\nSubmodular Models\u201d. Neural Information Processing Systems (NIPS). 2016.\nJ. Zhang, J. Djolonga, and A. Krause. \u201cHigher-Order Inference for Multi-class Log-\nsupermodular Models\u201d. International Conference on Computer Vision (ICCV). 2015.\n\n[19] D. Tarlow, K. Swersky, R. S. Zemel, R. P. Adams, and B. J. Frey. \u201cFast exact inference for\n\nrecursive cardinality models\u201d. arXiv preprint arXiv:1210.4899 (2012).\n\n[20] R. Lyons. \u201cDeterminantal probability measures\u201d. Publications Math\u00e9matiques de l\u2019IH\u00c9S 98\n\n(2003), pp. 167\u2013212.\n\n[21] R. Burton and R. Pemantle. \u201cLocal characteristics, entropy and limit theorems for spanning\ntrees and domino tilings via transfer-impedances\u201d. The Annals of Probability (1993), pp. 1329\u2013\n1371.\n\n[22] A. Kulesza and B. Taskar. \u201cDeterminantal Point Processes for Machine Learning\u201d. Foundations\n\nand Trends in Machine Learning 5.2\u20133 (2012).\n\n[23] A. Risteski. \u201cHow to calculate partition functions using convex programming hierarchies:\n\nprovable bounds for variational methods\u201d. Conference on Learning Theory. 2016.\n\n[24] S. Fujishige. Submodular functions and optimization. Annals of Discrete Mathematics vol. 58.\n\n[25]\n\n2005.\nJ. Edmonds. \u201cMatroids and the greedy algorithm\u201d. Mathematical programming 1.1 (1971),\npp. 127\u2013136.\n\n[26] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. \u201cAn analysis of approximations for\nmaximizing submodular set functions\u2014I\u201d. Mathematical Programming 14.1 (1978), pp. 265\u2013\n294.\n\n[27] K. Murota. Discrete convex analysis. SIAM, 2003.\n[28] R. P. Leme. \u201cGross substitutability: An algorithmic survey\u201d. Games and Economic Behavior\n\n106 (2017), pp. 294\u2013316.\n\n[29] A. W. Dress and W. Terhalle. \u201cWell-layered maps\u2014A class of greedily optimizable set\n\nfunctions\u201d. Applied Mathematics Letters 8.5 (1995), pp. 77\u201380.\n\n[30] S. Fujishige and Z. Yang. \u201cA note on Kelso and Crawford\u2019s gross substitutes condition\u201d.\n\nMathematics of Operations Research 28.3 (2003), pp. 463\u2013469.\n\n[31] K. Murota et al. \u201cDiscrete convex analysis: A tool for economics and game theory\u201d. Journal of\n\nMechanism and Institution Design 1.1 (2016), pp. 151\u2013273.\nJ. G. Oxley. Matroid theory. Vol. 3. Oxford University Press, USA, 2006.\n\n[32]\n[33] G. Calinescu, C. Chekuri, M. P\u00e1l, and J. Vondr\u00e1k. \u201cMaximizing a monotone submodular\nfunction subject to a matroid constraint\u201d. SIAM Journal on Computing 40.6 (2011), pp. 1740\u2013\n1766.\n\n[34] A. Shioura. \u201cOn the pipage rounding algorithm for submodular function maximization\u2014a\nview from discrete convex analysis\u201d. Discrete Mathematics, Algorithms and Applications 1.01\n(2009), pp. 1\u201323.\n\n[35] S. Tschiatschek, J. Djolonga, and A. Krause. \u201cLearning Probabilistic Submodular Diversity\nModels Via Noise Contrastive Estimation.\u201d International Conference on Arti\ufb01cial Intelligence\nand Statistics (AISTATS). 2016.\n\n[37]\n\n[36] S. B. Maurer. \u201cMatrix generalizations of some theorems on trees, cycles and cocycles in\n\ngraphs\u201d. SIAM Journal on Applied Mathematics 30.1 (1976), pp. 143\u2013148.\nJ. Borcea, P. Br\u00e4nd\u00e9n, and T. Liggett. \u201cNegative dependence and the geometry of polynomials\u201d.\nJournal of the American Mathematical Society 22.2 (2009), pp. 521\u2013567.\ninference\u201d. Foundations and Trends R(cid:13) in Machine Learning 1.1-2 (2008), pp. 1\u2013305.\n\n[38] M. J. Wainwright and M. I. Jordan. \u201cGraphical models, exponential families, and variational\n\n[39] A. R\u00e9nyi. \u201cOn measures of entropy and information\u201d. Fourth Berkeley symposium on mathe-\n\nmatical statistics and probability. Vol. 1. 1961, pp. 547\u2013561.\n\n[40] T. Van Erven and P. Harremo\u00ebs. \u201cR\u00e9nyi Divergence and Kullback-Leibler Divergence\u201d. arXiv\n\npreprint arXiv:1206.2459 (2012).\n\n[41] T. Minka. Divergence measures and message passing. Tech. rep. Microsoft Research, 2005.\n[42] D. P. Bertsekas. Nonlinear programming. Athena Scienti\ufb01c Belmont, 1999.\n\n10\n\n\f[43] N. Komodakis, N. Paragios, and G. Tziritas. \u201cMRF energy minimization and beyond via dual\ndecomposition\u201d. IEEE Transactions on Pattern Analysis and Machine Intelligence 33.3 (2011),\npp. 531\u2013552.\n\n[44] M. Karimi, M. Lucic, H. Hassani, and A. Krause. \u201cStochastic submodular maximization: The\ncase of coverage functions\u201d. Neural Information Processing Systems (NIPS). 2017, pp. 6856\u2013\n6866.\n\n[45] C. Chekuri, J. Vondrak, and R. Zenklusen. \u201cDependent randomized rounding via exchange\nproperties of combinatorial structures\u201d. Foundations of Computer Science (FOCS), 2010 51st\nAnnual IEEE Symposium on. IEEE. 2010, pp. 575\u2013584.\nJ. Vondr\u00e1k. \u201cSubmodularity in combinatorial optimization\u201d. PhD thesis. Univerzita Karlova,\nMatematicko-fyzik\u00e1ln\u00ed fakulta, 2007.\n\n[46]\n\n[47] S. Agrawal, Y. Ding, A. Saberi, and Y. Ye. \u201cCorrelation robust stochastic optimization\u201d.\nProceedings of the twenty-\ufb01rst annual ACM-SIAM Symposium on Discrete Algorithms. Society\nfor Industrial and Applied Mathematics. 2010, pp. 1087\u20131096.\n\n[48] K. Swersky, I. Sutskever, D. Tarlow, R. S. Zemel, R. R. Salakhutdinov, and R. P. Adams.\n\u201cCardinality restricted boltzmann machines\u201d. Neural Information Processing Systems (NIPS).\n2012, pp. 3293\u20133301.\n\n[49] R. Gomes and A. Krause. \u201cBudgeted Nonparametric Learning from Data Streams.\u201d Interna-\n\ntional Conference on Machine Learning (ICML). 2010, pp. 391\u2013398.\n\n[50] A. Krizhevsky and G. Hinton. \u201cLearning multiple layers of features from tiny images\u201d (2009).\nJ. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. \u201cCost-\n[51]\neffective outbreak detection in networks\u201d. Proceedings of the 13th ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining. ACM. 2007, pp. 420\u2013429.\n\n11\n\n\f", "award": [], "sourceid": 1388, "authors": [{"given_name": "Josip", "family_name": "Djolonga", "institution": "Google Brain"}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": "MIT"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETH Zurich"}]}