{"title": "Variational Inference in Mixed Probabilistic Submodular Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1759, "page_last": 1767, "abstract": "We consider the problem of variational inference in probabilistic models with both log-submodular and log-supermodular higher-order potentials. These models can represent arbitrary distributions over binary variables, and thus generalize the commonly used pairwise Markov random fields and models with log-supermodular potentials only, for which efficient approximate inference algorithms are known. While inference in the considered models is #P-hard in general, we present efficient approximate algorithms exploiting recent advances in the field of discrete optimization. We demonstrate the effectiveness of our approach in a large set of experiments, where our model allows reasoning about preferences over sets of items with complements and substitutes.", "full_text": "Variational Inference in\n\nMixed Probabilistic Submodular Models\n\nJosip Djolonga\n\nSebastian Tschiatschek\n\nDepartment of Computer Science, ETH Z\u00a8urich\n{josipd,tschiats,krausea}@inf.ethz.ch\n\nAndreas Krause\n\nAbstract\n\nWe consider the problem of variational inference in probabilistic models with both\nlog-submodular and log-supermodular higher-order potentials. These models can\nrepresent arbitrary distributions over binary variables, and thus generalize the\ncommonly used pairwise Markov random \ufb01elds and models with log-supermodular\npotentials only, for which ef\ufb01cient approximate inference algorithms are known.\nWhile inference in the considered models is #P-hard in general, we present ef\ufb01-\ncient approximate algorithms exploiting recent advances in the \ufb01eld of discrete\noptimization. We demonstrate the effectiveness of our approach in a large set of\nexperiments, where our model allows reasoning about preferences over sets of\nitems with complements and substitutes.\n\n1\n\nIntroduction\n\nProbabilistic inference is one of the main building blocks for decision making under uncertainty. In\ngeneral, however, this problem is notoriously hard even for deceptively simple-looking models and\napproximate inference techniques are necessary. There are essentially two large classes in which\nwe can categorize approximate inference algorithms \u2014 those based on variational inference or on\nsampling. However, these methods typically do not scale well to large numbers of variables, or\nexhibit an exponential dependence on the model order, rendering them intractable for models with\nlarge factors, which can naturally arise in practice.\nIn this paper we focus on the problem of inference in point processes, i.e. distributions P (A) over\nsubsets A \u2286 V of some \ufb01nite ground set V . Equivalently, these models can represent arbitrary\ndistributions over |V | binary variables1. Speci\ufb01cally, we consider models that arise from submodular\nfunctions. Recently, Djolonga and Krause [1] discussed inference in probabilistic submodular models\n(PSMs), those of the form P (A) \u221d exp(\u00b1F (A)), where F is submodular. These models are called\nlog-submodular (with the plus) and log-supermodular (with the minus) respectively. They generalize\nwidely used models, e.g., pairwise purely attractive or repulsive Ising models and determinantal point\nprocesses (DPPs) [2]. Approximate inference in these models via variational techniques [1, 3] and\nsampling based methods [4, 5] has been investigated.\nHowever, many real-world problems have neither purely log-submodular nor log-supermodular\nformulations, but can be naturally expressed in the form P (A) \u221d exp(F (A) \u2212 G(A)), where both\nF (A) and G(A) are submodular functions \u2014 we call these types of models mixed PSMs. For instance,\nin a probabilistic model for image segmentation there can be both attractive (log-supermodular)\npotentials, e.g., potentials modeling smoothness in the segmentation, and repulsive (log-submodular)\npotentials, e.g., potentials indicating that certain pixels should not be assigned to the same class.\nWhile the sampling based approaches for approximate inference are in general applicable to models\n1Distributions over sets A \u2286 V are isomorphic to distributions over |V | binary variables, where each binary\n\nvariable corresponds to an indicator whether a certain element is included in A or not.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fwith both types of factors, fast mixing is only guaranteed for a subclass of all possible models and\nthese methods may not scale well to large ground sets. In contrast, the variational inference techniques\nwere only developed for either log-submodular or log-supermodular models.\nIn this paper we close this gap and develop variational inference techniques for mixed PSMs. Note\nthat these models can represent arbitrary positive distributions over sets as any set function can be\nrepresented as the difference of a submodular and a supermodular function [6].2 By exploiting recent\nadvances in submodular optimization we formulate ef\ufb01cient algorithms for approximate inference\nthat easily scale to large ground sets and enable the usage of large mixed factors.\nApplications/Models. Mixed PSMs are natural models for a variety of applications \u2014 modeling of\nuser preferences, 3D stereo reconstruction [7], and image segmentation [8, 9] to name a few. For\ninstance, user preferences over items can be used for recommending products in an online marketing\napplication and naturally capture the economic notions of substitutes and complements. Informally,\nitem a is a substitute for another item b if, given item b, the utility of a diminishes (log-submodular\npotentials); on the other hand, an item c is a complement for item d if, given item d, the utility\nof c increases (log-supermodular potentials). Probabilistic models that can model substitutes of\nitems are for example DPPs [2] and the facility location diversity (FLID) model [10]. In \u00a74 we\nextend FLID to model both substitutes and complements which results in improved performance on a\nreal-world product recommendation task. In terms of computer vision problems, non-submodular\nbinary pairwise MRFs are widely used [8], e.g., as discussed above in image segmentation.\nOur contributions. We generalize the variational inference procedure proposed in [1] to models\ncontaining both log-submodular and log-supermodular potentials, enabling inference in arbitrary\ndistributions over binary variables. Furthermore, we provide ef\ufb01cient approximate algorithms for\nfactor-wise coordinate descent updates enabling faster inference for certain types of models, in\nparticular for rich scalable diversity models. In a large set of experiments we demonstrate the\neffectiveness of mixed higher-order models on a product recommendation task and illustrate the merit\nof the proposed variational inference scheme.\n\n2 Background: Variational Inference in PSMs\nSubmodularity. Let F : 2V \u2192 R be a set function, i.e., a function mapping sets A \u2286 V to real\nnumbers. We will furthermore w.l.o.g assume that V = {1, 2, . . . , n}. Formally, a function F is\ncalled submodular if it satis\ufb01es the following diminishing returns property for all A \u2286 B \u2286 V \\ {i}:\n\nF (A \u222a {i}) \u2212 F (A) \u2265 F (B \u222a {i}) \u2212 F (B).\n\ni\u2208A mi.\n\nProbabilistic submodular models (PSMs). PSMs are distributions over sets of the form\n\nInformally, this property states that the gain of an item i in the context of a smaller set A is larger than\nits gain in the context of a larger set B. A function G is called supermodular if \u2212G is submodular.\nA function F is modular, if it is both submodular and supermodular. Modular functions F can be\ni\u2208A mi for some numbers mi \u2208 R, and can be thus parameterized by vectors\n\nwritten as F (A) =(cid:80)\nm \u2208 Rn. As a shorthand we will frequently use m(A) =(cid:80)\nwhere Z =(cid:80)\n\nZ exp(\u00b1F (A)),\n\nP (A) =\n\nA\u2286V exp(\u00b1F (A)) ensures that P (A) is normalized, and is often called the partition\nfunction. The distribution P (A) is called log-submodular if the sign in the above de\ufb01nition is positive\nand log-supermodular if the sign is negative. These distributions generalize many well known\nclassical models and have been effectively used for image segmentation [11], and for modeling\ndiversity of item sets in recommender systems [10]. When F (A) = m(A) is a modular function,\nP (A) \u221d exp(F (A)) is called log-modular and corresponds to a fully factorized distribution over n\nbinary random variables X1, . . . , Xn, where we have for each element i \u2208 V an associated variable\nXi indicating if this element is included in A or not. The resulting distribution can be written as\n\n1\n\nP (A) =\n\n1\nZ exp(m(A)) =\n\n\u03c3(mi)\n\n\u03c3(\u2212mi),\n\n(cid:89)\n\ni\u2208A\n\n(cid:89)\n\ni /\u2208A\n\n2As the authors in [6] note, such a decomposition can be in general hard to \ufb01nd.\n\n2\n\n\fwhere \u03c3(u) = 1/(1 + e\u2212u) is the sigmoid function.\nVariational inference and submodular polyhedra. Djolonga and Krause [1] considered variational\ninference for PSMs, whose idea we will present here in a slightly generalized manner. Their\napproach starts by bounding F (A) using functions of the form m(A) + t, where m(A) is a modular\nfunction and t \u2208 R. Let us \ufb01rst analyze the log-supermodular case. If for all A \u2286 V it holds that\nm(A) + t \u2264 F (A), then we can bound the partition function Z as\n\nlog Z = log\n\ne\u2212F (A) \u2264 log\n\ne\u2212m(A)\u2212t =\n\nlog(1 + e\u2212mi) \u2212 t.\n\n(cid:88)\n\nA\u2286V\n\nn(cid:88)\n\ni=1\n\n(cid:88)\n\nA\u2286V\n\nn(cid:88)\n\ni=1\n\nThen, the idea is to optimize over the free parameters m and t to \ufb01nd the best upper bound, or to\nequivalently solve the optimization problem\n\nlog(1 + exp(\u2212mi)) \u2212 t,\n\nmin\n\n(m,t)\u2208L(F )\n\n(1)\nwhere L(F ) is the set of all lower bounds of F , also known as the generalized submodular lower\npolyhedron [12]\n\nL(F ) := {(x, t) \u2208 Rn+1 | \u2200A \u2286 V : x(A) + t \u2264 F (A)}.\n\n(2)\nDjolonga and Krause [1] show that one obtains the same optimum if we restrict ourselves to t = 0\nand one additional constraint, i.e., if we instead of L(F ) use the base polytope B(F ) de\ufb01ned as\n\nB(F ) := L(F ) \u2229 {(x, 0) \u2208 Rn+1 | x(V ) = F (V )}.\n\nIn words, it contains all modular lower bounds of F that are tight at V and \u2205. Thanks to the celebrated\nresult of Edmonds [13], one can optimize linear functions over B(F ) in time O(n log n). This,\ntogether with the fact that log(1 + e\u2212u) is 1\n4-smooth, in turn renders the optimization problem (1)\nsolvable via the Frank-Wolfe procedure [14, 15].\nIn the log-submodular case, we have to replace in problem (1) the minuses with pluses and use instead\nof L(F ) the set of upper bounds. This set, denoted as U(F ), de\ufb01ned by reversing the inequality\nsign in Equation (2), is called the generalized submodular upper polyhedron [12]. Unfortunately, in\ncontrast to L(F ), one can not easily optimize over U(F ) and asking membership queries is an NP-\nhard problem. As discussed by Iyer and Bilmes [12] there are some special cases, like M (cid:92)-concave\nfunctions [16], where one can describe U(F ), which we will discuss in \u00a7 3. Alternatively, which is\nthe approach taken by [3], one can select a speci\ufb01c subfamily of U(F ) and optimize over them.\n\n3\n\nInference in Mixed PSMs\n\nWe consider mixed PSMs, i.e. probability distributions over sets that can be written in the form\n\nP (A) \u221d exp(F (A) \u2212 G(A)),\n\ndecompose as F (A) =(cid:80)mF\n\nwhere F (A) and G(A) are both submodular functions. Furthermore, we assume that F and G\ni=1 Gi(A), where the functions Fi and Gi are\nall submodular. Note that this is not a limiting assumption, as submodular functions are closed under\naddition and we can always take mF = mG = 1, but such a decomposition will sometimes allow\nus to obtain better bounds. The corresponding distribution has the form\nexp(\u2212Gj(A)).\n\ni=1 Fi(A), and G(A) =(cid:80)mG\nP (A) \u221d mF(cid:89)\nmG(cid:89)\n\nexp(Fj(A))\n\n(3)\n\nj=1\n\nj=1\n\nSimilarly to the approach by Djolonga and Krause [1], we perform variational inference by upper\nbounding F (A) \u2212 G(A) by a modular function parameterized by m and a constant t such that\n\nF (A) \u2212 G(A) \u2264 m(A) + t for all A \u2286 V.\n\n(4)\nThis upper bound induces the log-modular distribution Q(A) \u221d exp(m(A) + t). Ideally, we would\nlike to select (m, t) such that the partition function of Q(A) is as small as possible (and thus our\napproximation of the partition function of P (A) is as tight as possible), i.e., we aim to solve\n\nmin\n\n(m,t)\u2208U (F\u2212G)\n\nt +\n\nlog(1 + exp(mi)).\n\n(5)\n\n|V |(cid:88)\n\ni=1\n\n3\n\n\fOptimization (and even membership checks) over U(F \u2212 G) is in general dif\ufb01cult, mainly because\nof the structure of U(F \u2212 G), which is given by 2n inequalities. Thus, we seek to perform a series\nof inner approximations of U(F \u2212 G) that make the optimization more tractable.\nApproximating U(F \u2212G).\nIn a \ufb01rst step we approximate U(F \u2212G) as U(F )\u2212L(G) \u2286 U(F \u2212G),\nwhere the summation is understood as a Minkowski sum. Then, we can replace L(G) by B(G)\nwithout losing any expressive power, as shown by the following lemma (see [3][Lemma 6]).\nLemma 1. Optimizing problem (5) over U(F ) \u2212 L(G) and over U(F ) \u2212 B(G) yields the same\noptimum value.\n\nThis lemma will turn out to be helpful when we shortly describe our strategy for minimizing (5) over\nU(F ) \u2212 B(G) as it will render some of our subproblems convex optimization problems over B(G)\u2014\nthese subproblems can then be ef\ufb01ciently solved using the Frank-Wolfe algorithm as proposed in [1]\nby noting that a greedy algorithm can be used to solve linear optimization problems over B(G) [17].\nj=1 Gj,\nj=1 B(Gj) (see e.g. [18]). Second, even though it is hard to describe U(F ),\nit might hold that U(Fi) has a tractable description, which leads to the natural inner approximation\n\nBy assumption, F (A) and G(A) are composed of simpler functions. First, because G =(cid:80)mG\nit holds that B(G) =(cid:80)mG\nU(F ) \u2287(cid:80)mF\n\nj=1 U(Fj). To wrap up, we performed the following series of inner approximations\n\nU(F \u2212 G) \u2287\n\nU(F )\n\n\u2287 (cid:80)mF\n\nj=1 U(Fj) \u2212 (cid:80)mG\n\n\u2212\n\n\u2287\n\nB(G)\n\n=\n\nj=1 B(Gj)\n\n,\n\nwhich we then use to approximate U(F \u2212 G) in problem (5) before solving it.\n\nOptimization. To solve the resulting problem we use a block coordinate descent procedure. Let us\n\ufb01rst rewrite the problem in a form that enables us to easily describe the algorithm. Let us write our\nresulting approximation as\n\n(m, t) =\n\n(gj, 0),\n\nwhere we have constrained (fj, tj) \u2208 U(Fj) and gj \u2208 B(Gj). The resulting problem is then to solve\n\nmF(cid:88)\n(fj, tj) \u2212 mG(cid:88)\n(cid:104)\nn(cid:88)\n\nj=1\n\nj=1\n\ni=1\n\nmF(cid:88)\n(cid:124)\n\nj=1\n\nmF(cid:88)\n\nfj,i \u2212 mG(cid:88)\n\nj=1\n\nj=1\n\n(cid:123)(cid:122)\n\n(cid:105)\n(cid:125)\n\n=:T ((fj ,tj )j=1,...,mF ,(gj )j=1,...,mG )\n\n(fj ,tj )\u2208U (Fj ),gj\u2208B(Gj )\n\nmin\n\ntj +\n\nlog\n\n1 + exp(\n\ngj,i)\n\n.\n\n(6)\n\nThen, until convergence, we pick one of the mG + mF blocks uniformly at random and solve the\nresulting optimization problem, which we now show how to do.\n\nLog-supermodular blocks. For a log-supermodular block j, minimizing (6) over gj is a smooth\nconvex optimization problem and we can either use the Frank-Wolfe procedure as in [1], or the\ndivide-and-conquer algorithm (see e.g. [19]). In particular, if we use the Frank-Wolfe procedure we\nperform a block coordinate descent step with respect to (6) by iterating the following until we achieve\nsome desired precision \u0001: Given the current gj, we compute \u2207gj T and use the greedy algorithm to\nsolve arg minx\u2208B(Gj )(cid:104)x,\u2207gj T(cid:105) in O(n log n) time. We then update gj to (1 \u2212 2\nk+2 gj,\nwhere k is the iteration number.\n\nk+2 )x + 2\n\nLog-submodular blocks. As we have already mentioned, this optimization step is much more\nchallenging. One procedure, which is taken by [1], is to consider a set of 2n points inside U(Fj)\nand optimize over them, which turns out to be a submodular minimization problem. However, for\nspeci\ufb01c subfamilies, we can better describe U(Fj). One particularly interesting subfamily is that of\nM (cid:92)-concave functions [16], which have been studied in economics [20].\nA set function H is called M (cid:92)-concave if \u2200A, B \u2286 V, i \u2208 A \\ B it satis\ufb01es\nH(A) + H(B) \u2264 H(A \\ {i}) + H(B \u222a {i}) or\n\n\u2203j \u2208 B \\ A : H(A) + H(B) \u2264 H((A \\ {i}) \u222a {j}) + H((B \u222a {i}) \\ {j}).\n\n4\n\n\fEquivalently, these functions can be de\ufb01ned through the so called gross substitutability property\nknown in economics. It turns out that M (cid:92)-concave set functions are also submodular. Examples of\nthese functions include facility location functions, matroid rank functions, monotone concave over\ncardinality functions, etc. [16]. For example, H(A) = maxi\u2208A hi for hi \u2265 0 is M (cid:92)-concave, which\nwe will exploit in our models in the experimental section.\nReturning to our discussion of optimizing (6), if Fj is an M (cid:92)-concave function, we can minimize (6)\nover (fj, tj) \u2208 U(Fj) to arbitrary precision in polynomial time. Therefore, we can, similarly\nas in [1], use the Frank-Wolfe algorithm by noting that a polynomial time algorithm for com-\nputing arg minx\u2208U (Fj )(cid:104)x,\u2207(fj ,tj )T(cid:105) exists [20]. Although the minimization can be performed in\npolynomial time, it is a very involved algorithm. We therefore consider an inner approximation\n\u02c7U(Fj) := {(m, 0) \u2208 Rn+1 | \u2200A \u2286 V : F (A) \u2264 m(A)} \u2286 U(Fj) of U(Fj) over which we\ncan more ef\ufb01ciently approximately minimize (6). As pointed out by Iyer and Bilmes [12], for M (cid:92)\nfunctions Fj the polyhedron \u02c7U(Fj) can be characterized by O(n2) inequalities as follows:\n\u02c7U(Fj) := \u222aA\u2286V {(m, 0) \u2208 Rn+1 | \u2200i \u2208 A : mi \u2264 Fj(A) \u2212 Fj(A \\ {i}),\n\u2200k (cid:54)\u2208 A : mj \u2265 Fj(A \u222a {k}) \u2212 Fj(A),\n\u2200i \u2208 A, k (cid:54)\u2208 A : mi \u2212 mk \u2264 Fj(A) \u2212 Fj((X \u222a {i}) \\ {k})}.\nWe propose to use Algorithm 1 for minimizing over \u02c7U(Fj). Given a set A where we want our modular\napproximation to be exact at, the algorithm iteratively minimizes the partition function of a modular\nupper bound on Fj. Clearly, after the \ufb01rst iteration of the algorithm (m, 0) is an upper bound on\nFj. Furthermore, the partition function corresponding to that bound decreases monotonically over\nthe iterations of the algorithm. Several heuristics can be used to select A\u2014in the experiments we\ndetermined A as follows: We initialized B = \u2205 and then, while 0 < maxi\u2208V \\A F (B \u222a {i}) \u2212 F (B),\nadded i to B, i.e. B \u222a {arg maxi\u2208V \\B F (B \u222a {i}) \u2212 F (B)}. We used the \ufb01nal B of this iteration\nas our tight set A.\n\nAlgorithm 1 Modular upper bound for M (cid:92)-concave functions\nRequire: M (cid:92) function F , tight set A s.t. m(A) = F (A) for the returned m\n\nInitialize m randomly\nfor l = 1, 2, . . . , max. nr. of iterations do\n\n\u2200i \u2208 A : mi = min{F (A) \u2212 F (A \\ {i}), mink\u2208V \\A mk + F (A) \u2212 F ((A \u222a {i}) \\ {k})}\n\u2200k (cid:54)\u2208 A : mk = max{F (A \u222a {k}) \u2212 F (A), maxi\u2208A mi \u2212 F (A) + F ((A \u222a {i}) \\ {k})}\n\n(cid:46) Alt. minimize m over coeff. corresponding to A and V \\ A\n\nend for\nreturn Modular upper bound m on F\n\n4 Examples of Mixed PSMs for Modelling Substitutes and Complements\n\nIn our experiments we consider probabilistic models that take the following form:\n\n(cid:88)\n\nL(cid:88)\n\nH(A; \u03b1, \u03b2) =\n\nrl,i \u2212(cid:88)\n(cid:0) max\n(cid:124)\n(cid:123)(cid:122)\nWe would like to point out that even though(cid:80)L\n\nui + \u03b1\n\ni\u2208A\n\ni\u2208A\n\ni\u2208A\n\nFl(A)\n\nl=1\n\n\u2212\u03b2\n\nrl,i\n\n(cid:1)\n(cid:125)\n\nK(cid:88)\n\nk=1\n\n(cid:0) max\n(cid:124)\n\ni\u2208A\n\nak,i \u2212(cid:88)\n(cid:123)(cid:122)\n\ni\u2208A\n\nGk(A)\n\n(cid:1)\n(cid:125)\n\n,\n\nak,i\n\n(7)\n\nwhere \u03b1, \u03b2 \u2208 {0, 1} switch on/off the repulsive and attractive capabilities of the model, respectively.\nl=1 Fl(A) is not M (cid:92)-concave, each summand Fl is,\nwhich we will exploit in the next section. The model is parameterized by the vector u \u2208 R|V |, and\nthe weights (rl)l\u2208[L], rl \u2208 R|V |\n\u22650 , which will be explained shortly. From\nthe general model (7) we instantiate four different models as explained in the following.\nLog-modular model. The log-modular model Pmod(A) is instantiated from (7) by setting \u03b1 = \u03b2 =\n0, i.e. Fmod(A) := H(A; 0, 0) and serves as a baseline model. This model cannot capture any\ndependencies between items and corresponds to a fully factorized distribution over the items in V .\nFacility location diversity model (FLID). This model is instantiated from (7) by setting \u03b1 = 1, \u03b2 = 0,\ni.e. FFLID(A) := H(A; 1, 0), and is known as facility location diversity model (FLID) [10]. Note that\n\n\u22650 and (ak)k\u2208[K], ak \u2208 R|V |\n\n5\n\n\fpenalty Fl(A) = maxi\u2208A rl,i \u2212(cid:80)\n\na\u00b7,i \u2208 RK\u22650 of latent properties. In particular, there is a gain of Gk(A) =(cid:80)\n\nthis induces a log-submodular distribution. The FLID model parameterizes all items i by an item\nquality ui and an L-dimensional vector r\u00b7,i \u2208 RL\u22650 of latent properties. The model assigns a negative\ni\u2208A rl,i whenever at least two items in A have the same latent\nproperty (the corresponding dimensions of rl are > 0) \u2014 thus the model explicitly captures repulsive\ndependencies between items.3 Speaking in economic terms, items with similar latent representations\ncan be considered as substitutes for each other. The FLID model has been shown to perform on par\nwith DPPs on product recommendation tasks [10].\nFacility location complements model (FLIC). This model is instantiated from (7) by setting \u03b1 =\n0, \u03b2 = 1, i.e. FFLIC(A) := H(A; 0, 1) and de\ufb01nes a log-supermodular probability distribution.\nSimilar to FLID, the model parameterizes all items i by an item quality ui and a K-dimensional vector\ni\u2208A ak,i \u2212 maxi\u2208A ak,i\nif at least two items in A have the same property k (i.e. for both items the corresponding dimensions\nof ak are > 0). In this way, FLIC captures attractive dependencies among items and assigns high\nprobabilities to sets of items that have similar latent representations \u2014 items with similar latent\nrepresentations would be considered as complements in economics.\nFacility location diversity and complements model (FLDC). This model is instantiated from (7)\nvia FFLDC(A) := H(A; 1, 1). Hence it combines the modelling power of the log-submodular and\nlog-supermodular models and can explicitly represent attractive and repulsive dependencies. In this\nway, FLDC can represent complements and subsitutes for the items in V . The induced probability\ndistribution is neither log-submodular nor log-supermodular.\n\n5 Experiments\n\n5.1 Experimental Setup\n\nDataset. We use the Amazon baby registry dataset [21] for evaluating our proposed variational\ninference scheme. This dataset is a standard dataset for benchmarking diversity models and consists\nof baby registries collected from Amazon. These registries are split into sub-registries according to\n13 different product categories, e.g. safety and carseats. Every category contains 32 to 100 different\nitems and there are \u2248 5.000 to \u2248 13.300 sub-registries per category.\n\nProduct recommendation task. We construct a realistic product recommendation task from the\nregistries of every category as follows. Let D = (S1, . . . , Sn) denote the registries from one category.\nFrom this data, we create a new dataset\n\n\u02c6D = {(S \\ {i}, i) | S \u2208 D,|S| \u2265 2, i \u2208 S},\n\n(8)\ni.e., \u02c6D consists of tuples, where the \ufb01rst element is a registry from D with one item removed, and\nthe second element is the removed item. The product recommendation task is to predict i given\nS \\ {i}. For evaluating the performance of different models on this task we use the following\ntwo metrics: accuracy and mean reciprocal rank. Let us denote the recommendations of a model\n(cid:80)\ngiven a partial basket A by \u03c3A : V \u2192 [n], where \u03c3A(a) = 1 means that product a is recommended\nhighest, \u03c3A(b) = 2 means that product b is recommended second highest, etc. Then, the accuracy\n(S(cid:48),i)\u2208 \u02c6D[i = \u03c3\u22121\nis computed as Acc = 1\nS(cid:48) (1)]. The mean reciprocal rank (MRR) is de\ufb01ned as\n| \u02c6D|\n1\nMRR = 1\n. For our models we consider predictions according to the posterior\n\u22121\n| \u02c6D|\nS(cid:48) (i)\n(cid:80)\nprobability of the model given a partial basket A under the constraint that exactly a single item is to\nbe added, i.e. \u03c3A(i) = k if product i achieves the k-th largest value of P (j|A) =\nj(cid:48)\u2208V \\A P ({j(cid:48)}\u222aA)\nfor j \u2208 V \\ A (ties are broken arbitrarily).\n\nP ({j}\u222aA)\n\n(cid:80)\n\n(S(cid:48),i)\u2208 \u02c6D\n\n\u03c3\n\n5.2 Mixed Models for Product Recommendation\n\nWe learned the models described in the previous section using the training data of the different\ncategories. In case of the modular model, the parameters u were set according to the item frequencies\nin the training data. FLID, FLIC and FLDC were learned using noise contrastive estimation (NCE) [22,\n10]. We used stochastic gradient descent for optimizing the NCE objective, created 200.000 noise\n\n3Clearly, also attractive dependencies between items can thereby be modeled implicitly.\n\n6\n\n\f(a) Accuracy\n\n(b) Mean reciprocal rank\n\n(c) Accuracy of the mixed model\nfor varying L and K\n\nFigure 1: (a,b) Accuracy and MRR on the product recommendation task. For all datasets, the mixed\nFLDC model has the best performance. For datasets with small ground set (furniture, carseats, safety)\nFLID performs better than FLIC. For most other datasets, FLIC outperforms FLID. (c) Accuracy of\nFLDC for different numbers of latent dimensions L and K on the diaper dataset. FLDC (L, K > 0)\nperforms better than FLID (K = 0) and FLIC (L = 0) for the same value of L + K.\n\nsamples from the modular model and made 100 passes through the data and noise samples. We then\nused the trained models for the product recommendation task from the previous section and estimated\nthe performance metrics using 10-fold cross-validation. We used K = 10, L = 10 dimensions for the\nweights (if applicable for the corresponding model). The results are shown in Figure 1. For reference,\nwe also report the performance of DPPs trained with EM [21]. Note that for all categories the mixed\nFLDC models achieve the best performance, followed by FLIC and FLID. For categories with more\nthan 40 items (with the exception of health), FLIC performs better than FLID. The modular model\nperforms worst in all cases. As already observed in the literature, the performance of FLID is similar\nto that of DPPs [10]. For categories with small ground sets (safety, furniture, carseats), there is no\nadvantage of using the higher-order attractive potentials but the repulsive higher-order potentials\nimprove performance signi\ufb01cantly. However, in combination with repulsive potentials the attractive\npotentials enable the model to improve performance over models with only repulsive higher-order\npotentials.\n\nImpact of the Dimension Assignment\n\n5.3\nIn Figure 1c we show the accuracy of FLDC for different numbers of latent dimensions L and K for\nthe category diaper, averaged over the 10 cross-validation folds. Similar results can be observed for\nthe other categories (not shown here because of space constraints). We note that the best performance\nis achieved only for models that have both repulsive and attractive components (i.e. L, K > 0). For\ninstance, if one is constrained to use only 10 latent dimensions in total, i.e. L + K = 10, the best\nperformance is achieved for the settings L = 3, K = 7 and L = 2, K = 8.\n\n5.4 Quality of the Marginals\nIn this section we analyze the quality of the marginals obtained by the algorithm proposed in Section 3.\nTherefore we repeat the following experiment for all baskets S, |S| \u2265 2 in the the held out test\ndata. We randomly select a subset S(cid:48) \u2282 S of 1 to |S| \u2212 1 items and a subset S(cid:48)(cid:48) \u2282 V \\ S with\n|S(cid:48)(cid:48)| = (cid:98)| V \\ S|/2(cid:99), of items not present in the basket. Then we condition our distribution on\nthe event that the items in S(cid:48) are present and the items S(cid:48)(cid:48) are not present i.e. we consider the\ndistribution P (A | S(cid:48) \u2286 A, S(cid:48)(cid:48) \u2229 A = \u2205). This conditioning is supposed to resemble a \ufb01ctitious\nproduct recommendation task in which we condition on items already selected by a user and exclude\nitems which are of no interest to the user (for instance, according to the user\u2019s preferences). We then\ncompute a modular approximation to the posterior distribution using the algorithm from Section 3,\n\n7\n\n0510152025303540furnitureN=32carseatsN=34safetyN=36strollersN=40mediaN=58healthN=62toysN=62bathN=100beddingN=100diaperN=100apparelN=100gearN=100feedingN=100furnitureN=32carseatsN=34safetyN=36strollersN=40mediaN=58healthN=62toysN=62bathN=100beddingN=100diaperN=100apparelN=100gearN=100feedingN=100furnitureN=32carseatsN=34safetyN=36strollersN=40mediaN=58healthN=62toysN=62bathN=100beddingN=100diaperN=100apparelN=100gearN=100feedingN=100furnitureN=32carseatsN=34safetyN=36strollersN=40mediaN=58healthN=62toysN=62bathN=100beddingN=100diaperN=100apparelN=100gearN=100feedingN=100furnitureN=32carseatsN=34safetyN=36strollersN=40mediaN=58healthN=62toysN=62bathN=100beddingN=100diaperN=100apparelN=100gearN=100feedingN=10001020304050modularDPPFLID(L=10)FLIC(K=10)FLDC(L=10,K=10)\fTable 1: AUC for the considered models on the product recommendation task based on the posterior\nmarginals. The best result for every dataset is printed in bold. For datasets with at most 62 items,\nFLDC has the highest AUC, while for larger datasets FLIC and FLDC have similar AUC. This\nindicates a good quality of the marginals computed by the proposed approximate inference procedure.\n\nDataset Modular\n0.731304\n0.701840\n0.717463\n0.727055\n0.750271\n0.692423\n0.666509\n0.724763\n0.741786\n0.700694\n0.685051\n0.687686\n0.686240\n0.708698\n\nsafety\nfurniture\ncarseats\nstrollers\nhealth\nbath\nmedia\ntoys\nbedding\napparel\ndiaper\ngear\nfeeding\nAverage\n\nFLID\n\nFLIC\n\n0.756981\n0.739646\n0.770085\n0.794655\n0.754185\n0.705051\n0.667848\n0.729089\n0.744443\n0.696010\n0.700543\n0.688116\n0.686845\n0.725653\n\n0.731269\n0.702100\n0.735472\n0.827800\n0.756873\n0.730443\n0.758552\n0.765474\n0.771159\n0.778067\n0.787457\n0.687501\n0.744043\n0.752016\n\nFLDC\n0.761168\n0.759979\n0.781642\n0.849767\n0.758586\n0.732407\n0.780634\n0.777729\n0.764595\n0.779665\n0.787274\n0.688885\n0.739921\n0.766327\n\nbounding(cid:80)L\n\nl=1 Fl(A) and upper bounding(cid:80)K\n\nand recommend items according to these approximate marginals. For evaluation, we compute the\nAUC for the product recommendation task and average over the test set data. We found that for\ndifferent models different modular upper/lower bounds gave the best results. In particular, for FLID\nwe used the upper bound given by Algorithm 1 to bound each summand Fl(A) in the facility location\nterm separately. For FLIC and FLID we optimized the lower bound on the partition function by lower-\nk=1 Gk(A), as suggested in [1]. For approximate\ninference in FLIC and FLDC we did not split the facility location terms and bounded them as a whole.\nThe results are summarized in Table 1. We observe that FLDC has the highest AUC for all datasets\nwith at most 62 items. For larger datasets, FLDC and FLIC have roughly the same performance and\nare superior to FLID and the modular model. These \ufb01ndings are similar to those from the previous\nsection and con\ufb01rm a good quality of the marginals computed from FLDC and FLIC by the proposed\napproximate inference procedure.\n6 Related Work\nVariational inference in general probabilistc log-submodular models has been \ufb01rst studied in [1].\nThe authors propose L-FIELD, an approach for approximate inference in both log-submodular and\nlog-supermodular models based on super- and sub-differentials of submodular functions. In [3] they\nextended their work by relating L-FIELD to the minimum norm problem for submodular minimization,\nrendering better scalable algorithms applicable to variational inference in log-submodular models.\nThe forementioned works can only be applied to models that contain either log-submodular or\nlog-supermodular potentials and hence do not cover the models considered in this paper.\nWhile the MAP solution in mixed models is known to be NP-hard, there are approximate methods\nfor its computation based on iterative majorization-minimization (or minorization-maximization)\nprocedures [23, 24]. In [9] the authors consider mixed models in which the supermodular component\nis restricted to a tree-structured cut, and provide several algorithms for approximate MAP computation.\nIn contrast to our work, these methods are non-probabilistic and only provide an approximate MAP\nsolution without any notion of uncertainty.\n7 Conclusion\nWe proposed ef\ufb01cient algorithms for approximate inference in mixed submodular models based\non inner approximations of the set of modular bounds on the corresponding energy functions. For\nmany higher-order potentials, optimizing a modular bound over this inner approximation is tractable.\nAs a consequence, the approximate inference problem can be approached by a block coordinate\ndescent procedure, tightening a modular upper bound over the individual higher-order potentials in\nan iterative manner. Our approximate inference algorithms enable the computation of approximate\nmarginals and can easily scale to large ground sets. In a large set of experiments, we demonstrated\nthe effectiveness of our approach.\n\n8\n\n\fAcknowledgements. The authors acknowledge fruitful discussions with Diego Ballesteros. This research\nwas supported in part by SNSF grant CRSII2-147633, ERC StG 307036, a Microsoft Research Faculty Fellowship\nand a Google European Doctoral Fellowship.\n\nReferences\n[1] Josip Djolonga and Andreas Krause. From MAP to Marginals: Variational Inference in Bayesian Submod-\n\n[2] Alex Kulesza and Ben Taskar. Determinantal Point Processes for Machine Learning. Foundations and\n\nular Models. In Advances in Neural Information Processing Systems (NIPS), pages 244\u2013252, 2014.\nTrends R(cid:13) in Machine Learning, 5(2\u20133):123\u2013286, 2012.\n\n[3] Josip Djolonga and Andreas Krause. Scalable Variational Inference in Log-supermodular Models. In\n\nInternational Conference on Machine Learning (ICML), 2015.\n\n[4] Alkis Gotovos, Hamed Hassani, and Andreas Krause. Sampling from Probabilistic Submodular Models.\n\nIn Advances in Neural Information Processing Systems (NIPS), pages 1936\u20131944, 2015.\n\n[5] Patrick Rebeschini and Amin Karbasi. Fast Mixing for Discrete Point Processes. In Proceedings of the\n\nConference on Learning Theory (COLT), pages 1480\u2014-1500, 2015.\n\n[6] Mukund Narasimhan and Jeff Bilmes. A Submodular-Supermodular Procedure with Applications to\n\nDiscriminative Structure Learning. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2005.\n\n[7] Oliver Woodford, Philip Torr, Ian Reid, and Andrew Fitzgibbon. Global stereo reconstruction under\nsecond-order smoothness priors. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31\n(12), 2009.\n\n[8] Carsten Rother, Vladimir Kolmogorov, Victor Lempitsky, and Martin Szummer. Optimizing binary mrfs\nvia extended roof duality. In Computer Vision and Pattern Recognition, 2007. CVPR\u201907. IEEE Conference\non. IEEE, 2007.\n\n[9] Yoshinobu Kawahara, Rishabh Iyer, and Jeffrey A. Bilmes. On Approximate Non-submodular Minimization\nvia Tree-Structured Supermodularity. In Proceedings of the Eighteenth International Conference on\nArti\ufb01cial Intelligence and Statistics (AISTATS), 2015.\n\n[10] Sebastian Tschiatschek, Josip Djolonga, and Andreas Krause. Learning probabilistic submodular diversity\nmodels via noise contrastive estimation. In Proceedings of the International Conference on Arti\ufb01cial\nIntelligence and Statistics (AISTATS), 2016.\n\n[11] Jian Zhang, Josip Djolonga, and Andreas Krause. Higher-order inference for multi-class log-supermodular\n\nmodels. In International Conference on Computer Vision (ICCV), 2015.\n\n[12] Rishabh Iyer and Jeff Bilmes. Polyhedral aspects of submodularity, convexity and concavity. arXiv preprint\n\n[15] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the\n\n30th International Conference on Machine Learning (ICML-13), pages 427\u2013435, 2013.\n\n[16] Kazuo Murota. Discrete convex analysis. SIAM, 2003.\n[17] Jack Edmonds. Submodular functions, matroids, and certain polyhedra. In Combinatorial Optimiza-\n\ntion\u2014Eureka, You Shrink!, pages 11\u201326. Springer, 2003.\n\n[18] S. Fujishige. Submodular Functions and Optimization. Elsevier, 2005.\n[19] Francis R. Bach. Learning with submodular functions: A convex optimization perspective. Foundations\n\nand Trends in Machine Learning, 6(2-3):145\u2013373, 2013.\n\n[20] Akiyoshi Shioura. Polynomial-time approximation schemes for maximizing gross substitutes utility under\n\nbudget constraints. Mathematics of Operations Research, 40(1), 2014.\n\n[21] Jennifer A Gillenwater, Alex Kulesza, Emily Fox, and Ben Taskar. Expectation-maximization for learning\n\ndeterminantal point processes. In Advances in Neural Information Processing Systems, 2014.\n\n[22] Michael U Gutmann and Aapo Hyv\u00a8arinen. Noise-contrastive estimation of unnormalized statistical models,\n\nwith applications to natural image statistics. The Journal of Machine Learning Research, 13(1), 2012.\n\n[23] Mukund Narasimhan and Jeff Bilmes. A submodular-supermodular procedure with applications to\n\ndiscriminative structure learning. arXiv preprint arXiv:1207.1404, 2012.\n\n[24] Rishabh Iyer and Jeff Bilmes. Algorithms for approximate minimization of the difference between\n\nsubmodular functions, with applications. arXiv preprint arXiv:1207.0560, 2012.\n\n9\n\n[13] Jack Edmonds. Matroids and the greedy algorithm. Mathematical programming, 1(1):127\u2013136, 1971.\n[14] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics\n\narXiv:1506.07329, 2015.\n\nquarterly, 3(1-2):95\u2013110, 1956.\n\n\f", "award": [], "sourceid": 953, "authors": [{"given_name": "Josip", "family_name": "Djolonga", "institution": "ETH Zurich"}, {"given_name": "Sebastian", "family_name": "Tschiatschek", "institution": "ETH Zurich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETHZ"}]}