{"title": "On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations", "book": "Advances in Neural Information Processing Systems", "page_first": 1268, "page_last": 1276, "abstract": "In this paper we describe how MAP inference can be used to sample efficiently from Gibbs distributions. Specifically, we provide means for drawing either approximate or unbiased samples from Gibbs' distributions by introducing low dimensional perturbations and solving the corresponding MAP assignments. Our approach also leads to new ways to derive lower bounds on partition functions. We demonstrate empirically that our method excels in the typical high signal - high coupling'' regime. The setting results in ragged energy landscapes that are challenging for alternative approaches to sampling and/or lower bounds. \"", "full_text": "On Sampling from the Gibbs Distribution with\nRandom Maximum A-Posteriori Perturbations\n\nTamir Hazan\n\nUniversity of Haifa\n\nSubhransu Maji\n\nTTI Chicago\n\nTommi Jaakkola\n\nCSAIL, MIT\n\nAbstract\n\nIn this paper we describe how MAP inference can be used to sample ef\ufb01ciently\nfrom Gibbs distributions. Speci\ufb01cally, we provide means for drawing either ap-\nproximate or unbiased samples from Gibbs\u2019 distributions by introducing low di-\nmensional perturbations and solving the corresponding MAP assignments. Our\napproach also leads to new ways to derive lower bounds on partition functions.\nWe demonstrate empirically that our method excels in the typical \u201chigh signal -\nhigh coupling\u201d regime. The setting results in ragged energy landscapes that are\nchallenging for alternative approaches to sampling and/or lower bounds.\n\n1\n\nIntroduction\n\nInference in complex models drives much of the research in machine learning applications, from\ncomputer vision, natural language processing, to computational biology. Examples include scene\nunderstanding, parsing, or protein design. The inference problem in such cases involves \ufb01nding\nlikely structures, whether objects, parsers, or molecular arrangements. Each structure corresponds\nto an assignment of values to random variables and the likelihood of an assignment is based on\nde\ufb01ning potential functions in a Gibbs distribution. Usually, it is feasible to \ufb01nd only the most\nlikely or maximum a-posteriori (MAP) assignment (structure) rather than sampling from the full\nGibbs distribution. Substantial effort has gone into developing algorithms for recovering MAP as-\nsignments, either based on speci\ufb01c structural restrictions such as super-modularity [2] or by devising\ncutting-planes based methods on linear programming relaxations [19, 24]. However, MAP inference\nis limited when there are other likely assignments.\nOur work seeks to leverage MAP inference so as to sample ef\ufb01ciently from the full Gibbs distribu-\ntion. Speci\ufb01cally, we aim to draw either approximate or unbiased samples from Gibbs distributions\nby introducing low dimensional perturbations in the potential functions and solving the correspond-\ning MAP assignments. Connections between random MAP perturbations and Gibbs distributions\nhave been explored before. Recently [17, 21] de\ufb01ned probability models that are based on low\ndimensional perturbations, and empirically tied them to Gibbs distributions. [5] augmented these\nresults by providing bounds on the partition function in terms of random MAP perturbations.\nIn this work we build on these results to construct an ef\ufb01cient sampler for the Gibbs distribution, also\nderiving new lower bounds on the partition function. Our approach excels in regimes where there\nare several but not exponentially many prominent assignments. In such ragged energy landscapes\nclassical methods for the Gibbs distribution such as Gibbs sampling and Markov chain Monte Carlo\nmethods, remain computationally expensive [3, 25].\n\n2 Background\n\nStatistical inference problems involve reasoning about the states of discrete variables whose con-\n\ufb01gurations (assignments of values) specify the discrete structures of interest. We assume that the\n\n1\n\n\fmodels are parameterized by real valued potentials \u03b8(x) = \u03b8(x1, ..., xn) < \u221e de\ufb01ned over a dis-\ncrete product space X = X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Xn. The effective domain is implicitly de\ufb01ned through \u03b8(x)\nvia exclusions \u03b8(x) = \u2212\u221e whenever x (cid:54)\u2208 dom(\u03b8). The real valued potential functions are mapped\nto the probability scale via the Gibbs\u2019 distribution:\n\np(x1, ..., xn) =\n\n1\nZ\n\nexp(\u03b8(x1, ..., xn)), where Z =\n\nexp(\u03b8(x1, ..., xn)).\n\n(1)\n\n(cid:88)\n\nx1,...,xn\n\nThe normalization constant Z is called the partition function. The feasibility of using the distribution\nfor prediction, including sampling from it, is inherently tied to the ability to evaluate the partition\nfunction, i.e., the ability to sum over the discrete structures being modeled. In general, such counting\nproblems are often hard, in #P.\nA slightly easier problem is that of \ufb01nding the most likely assignment of values to variables, also\nknown as the maximum a-posterior (MAP) prediction.\n\n(MAP)\n\narg max\nx1,...,yn\n\n\u03b8(x1, ..., xn)\n\n(2)\n\nRecent advances in optimization theory have been translated to successful algorithms for solving\nsuch MAP problems in many cases of practical interest. Although the MAP prediction problem is\nstill NP-hard in general, it is often simpler than sampling from the Gibbs distribution.\nOur approach is based on representations of the Gibbs distribution and the partition function using\nextreme value statistics of linearly perturbed potential functions. Let {\u03b3(x)}x\u2208X be a collection of\nrandom variables with zero mean, and consider random potential functions of the form \u03b8(x) + \u03b3(x).\nAnalytic expressions for the statistics of a randomized MAP predictor, \u02c6x \u2208 argmaxx{\u03b8(x) + \u03b3(x)},\ncan be derived for general discrete sets, whenever independent and identically distributed (i.i.d.)\nrandom perturbations are applied for every assignment x \u2208 X. Speci\ufb01cally, when the random\nperturbations follow the Gumbel distribution (cf. [12]), we obtain the following result.\nTheorem 1. ([4], see also [17, 5]) Let {\u03b3(x)}x\u2208X be a collection of i.i.d. random variables,\neach following the Gumbel distribution with zero mean, whose cumulative distribution function is\nF (t) = exp(\u2212 exp(\u2212(t + c))), where c is the Euler constant. Then\n\n(cid:104)\n(cid:104)\n\n{\u03b8(x) + \u03b3(x)}(cid:105)\n\nmax\nx\u2208X\n\u02c6x \u2208 arg max\nx\u2208X\n\n.\n\n{\u03b8(x) + \u03b3(x)}(cid:105)\n\n.\n\nlog Z = E\u03b3\n\n1\nZ\n\nexp(\u03b8(\u02c6x)) = P\u03b3\n\nThe max-stability of the Gumbel distribution provides a straight forward approach to generate un-\nbiased samples from the Gibbs distribution as well as to approximate the partition function by a\nsample mean of random MAP perturbation. Assume we sample j = 1, ..., m independent predic-\ntions maxx{\u03b8(x) + \u03b3j(x)}, then every maximal argument is an unbiased sample from the Gibbs\ndistribution. Moreover, the randomized MAP predictions maxx{\u03b8(x) + \u03b3j(x)} are independent and\nfollow the Gumbel distribution, whose variance is \u03c02/6. Therefore Chebyshev\u2019s inequality dictates,\nfor every t, m\n\n(3)\n\n(cid:104)(cid:12)(cid:12)(cid:12) 1\n\nm\n\nm(cid:88)\n\nj=1\n\nP r\u03b3\n\n{\u03b8(x) + \u03b3j(x)} \u2212 log Z\n\nmax\n\nx\n\n(cid:12)(cid:12)(cid:12) \u2265 \u0001\n\n(cid:105) \u2264 \u03c0\n\n6m\u00012\n\nIn general each x = (x1, ..., xn) represents an assignment to n variables. Theorem 1 suggests to\nintroduce an independent perturbation \u03b3(x) for each such n\u2212dimensional assignment x \u2208 X. The\ncomplexity of inference and learning in this setting would be exponential in n. In our work we\npropose to investigate low dimensional random perturbations as the main tool to ef\ufb01ciently (approx-\nimate) sampling from the Gibbs distribution.\n\n3 Probable approximate samples from the Gibbs distribution\n\nSampling from the Gibbs distribution is inherently tied to estimating the partition function. Markov\nproperties that simplify the distribution also decompose the computation of the partition function.\n\n2\n\n\fFor example, assume a graphical model with potential functions associated with subsets of variables\n\u03b1\u2208A \u03b8\u03b1(x\u03b1). Assume that the subsets are disjoint except for\ntheir common intersection \u03b2 = \u2229\u03b1\u2208A. This separation implies that the partition function can be\ncomputed in lower dimensional pieces\n\n\u03b1 \u2282 {1, ..., n} so that \u03b8(x) = (cid:80)\n(cid:88)\n\nZ =\n\n(cid:89)\n\n(cid:16) (cid:88)\n\n\u03b1\u2208A\n\nx\u03b2\n\nx\u03b1\\x\u03b2\n\n(cid:17)\n\nexp(\u03b8\u03b1(x\u03b1))\n\nj\u03b1=1\n\n\u03b1\u2208A \u03c02\n6m\u03b1\u00012\n\nAs a result, the computation is exponential only in the size of the subsets \u03b1 \u2208 A. Thus,\nwe can also estimate the partition function with lower dimensional random MAP perturbations,\nE\u03b3[maxx\u03b1\\x\u03b2{\u03b8\u03b1(x\u03b1) + \u03b3\u03b1(x\u03b1)}]. The random perturbation are now required only for each as-\nsignment of values to the variables within the subsets \u03b1 \u2208 A rather than the set of all variables.\nWe approximate such partition functions with low dimensional perturbations and their averages. The\noverall computation is cast in a single MAP problem using an extended representation of potential\nfunctions by replicating variables.\nLemma 1. Let A be subsets of variables that are separated by their joint intersection \u03b2 = \u2229\u03b1\u2208A\u03b1.\nWe create multiple copies of x\u03b1, namely \u02c6x\u03b1 = (x\u03b1,j\u03b1 )j\u03b1=1,...,m\u03b1, and de\ufb01ne the extended poten-\nj\u03b1=1 \u03b8\u03b1(x\u03b1,j\u03b1)/m\u03b1. We also de\ufb01ne the extended perturbation model\nj\u03b1=1 \u03b3\u03b1,j\u03b1(x\u03b1,j\u03b1 )/m\u03b1, where each \u03b3\u03b1,j\u03b1 (x\u03b1,j\u03b1 ) is independent and distributed ac-\ncording to the Gumbel distribution with zero mean. Then, for every x\u03b2, with probability at least\n\ntial function \u02c6\u03b8\u03b1(\u02c6x\u03b1) = (cid:80)m\u03b1\n\u02c6\u03b3\u03b1(\u02c6x\u03b1) = (cid:80)m\u03b1\n1 \u2212(cid:80)\n(cid:12)(cid:12)(cid:12) max\n(cid:8)(cid:88)\n(cid:12)(cid:12)(cid:12) 1\n(cid:80)m\nTo compute the sampled average with a single max-operation we introduce the mul-\nj\u03b1=1 maxx\u03b1\\x\u03b2{\u03b8\u03b1(x\u03b1) + \u03b3\u03b1,j\u03b1 (x\u03b1)} =\nj=1{\u03b8\u03b1(x\u03b1,j\u03b1 ) + \u03b3\u03b1,j\u03b1 (x\u03b1,j\u03b1 )}. By the union bound it holds for every \u03b1 \u2208 A\nmaxx\u03b1,j\u03b1\\x\u03b2\n\u03b1\u2208A \u03c02/6m\u03b1\u00012. Since x\u03b2 is \ufb01xed for every \u03b1 \u2208 A\nthe maximizations are done independently across subsets in \u02c6x \\ x\u03b2, where \u02c6x is the concatenation of\n(cid:110)\n(cid:111)\n\ntiple copies \u02c6x\u03b1 = (x\u03b1,j\u03b1 )j\u03b1=1,...,m\u03b1 thus (cid:80)m\u03b1\nsimultaneously with probability at least 1 \u2212(cid:80)\n(cid:111)\nall \u02c6x\u03b1, and(cid:88)\n\nlog(cid:0) (cid:88)\n{\u03b8\u03b1(x\u03b1) + \u03b3\u03b1,j\u03b1 (x\u03b1)} \u2212 log(cid:0) (cid:88)\n\nexp(\u03b8\u03b1(x\u03b1))(cid:1)(cid:12)(cid:12)(cid:12) \u2264 \u0001|A|\nexp(\u03b8\u03b1(x\u03b1))(cid:1)(cid:12)(cid:12)(cid:12) \u2264 \u0001.\n\nProof: Equation (3) implies that for every x\u03b2 with probability at most \u03c02/6m\u03b1\u00012 holds\n\n\u02c6\u03b3\u03b1(\u02c6x\u03b1)(cid:9) \u2212(cid:88)\n\n(cid:110)(cid:88)\n\nm\u03b1(cid:88)\n\n(cid:88)\n\nm\u03b1(cid:88)\n\nm\u03b1(cid:88)\n\n(cid:88)\n\n\u02c6\u03b8\u03b1(\u02c6x\u03b1) +\n\nmax\nx\u03b1\\x\u03b2\n\nx\u03b1\\x\u03b2\n\nx\u03b1\\x\u03b2\n\n\u02c6x\\x\u03b2\n\n\u03b1\u2208A\n\n\u03b1\u2208A\n\n\u03b1\u2208A\n\nm\u03b1\n\n.\n\nj\u03b1=1\n\nj\u03b1=1\n\n\u03b1\u2208A\n\n\u03b1\u2208A\n\nmax\n\u02c6x\u03b1\\x\u03b2\n\n\u03b8\u03b1(x\u03b1,j\u03b1 ) +\n\n\u03b3\u03b1,j\u03b1 (x\u03b1,j\u03b1)\n\n\u03b8\u03b1(x\u03b1,j\u03b1 ) + \u03b3\u03b1,j\u03b1 (x\u03b1,j\u03b1 )\n\n= max\n\u02c6x\\x\u03b2\n\u03b1\u2208A\nThe proof then follows from the triangle inequality. (cid:3)\nWhenever the graphical model has no cycles we can iteratively apply the separation properties with-\nout increasing the computational complexity of perturbations. Thus we may randomly perturb the\nsubsets of potentials in the graph. For notational simplicity we describe our approximate sampling\nscheme for pairwise interactions \u03b1 = (i, j) although it holds for general graphical models without\ncycles:\n\ni,j\u2208E \u03b8i,j(xi, xj) be a graphical model with-\nLet \u02c6\u03b8(x) =\nki,kj =1 \u03b3i,j,ki,kj (xi,ki, xj,kj )/mimj\nwhere each perturbation is independent and distributed according to the Gumbel distribution with\nzero mean. Then, for every edge (r, s) while mr = ms = 1 (i.e., they have no multiple copies) there\n\nTheorem 2. Let \u03b8(x) = (cid:80)\ni\u2208V \u03b8i(xi) + (cid:80)\n(cid:80)mi\nki=1 \u03b8(x1,k1, ..., xn,kn )/(cid:81)\ni mi, and \u02c6\u03b3i,j(xi, xj) = (cid:80)mi,mj\nholds with probability at least 1 \u2212(cid:80)n\n(cid:12)(cid:12)(cid:12) log\n(cid:110)\u02c6\u03b8(x) +\n(cid:111)(cid:105)(cid:17) \u2212 log\n\n(cid:16) (cid:88)\ni=1 \u03c02c/6mi\u00012, where c = maxi |Xi|\n\nout cycles, and let p(x) be the Gibbs distribution de\ufb01ned in Equation (1).\n\n(cid:17)(cid:12)(cid:12)(cid:12) \u2264 \u0001n\n\nxr, xs \u2208 arg max\n\n(cid:88)\n\n\u02c6\u03b3i,j(xi, xj)\n\n(cid:16)\n\np(x)\n\n(cid:104)\n\nP\u03b3\n\n\u02c6x\n\ni,j\u2208E\n\nx\\xr,xs\n\n3\n\n\fmarginal probabilities with a max-operation, if we approximate(cid:80)\n\nProof: Theorem 1 implies that we sample (xr, xs) approximately from the Gibbs distribution\nx\\{xr,xs} exp(\u03b8(x)). Using graph\nseparation (or equivalently the Markov property) it suf\ufb01ces to approximate the partial partition func-\ntion over the disjoint subtrees Tr, Ts that originate from r, s respectively. Lemma 1 describes this\ncase for a directed tree with a single parent. We use this by induction on the parents on these directed\ntrees, noticing that graph separation guarantees: the statistics of Lemma 1 hold uniformly for every\nassignment of the parent\u2019s non-descendants as well; the optimal assignments in Lemma 1 are chosen\nindependently for every child for every assignment of the parent\u2019s non-descendants label. (cid:3)\nOur approximated sampling procedure expands the graphical model, creating layers of the original\ngraph, while connecting edges between vertices in the different layers if an edge exists in the original\ngraph. We use graph separations (Markov properties) to guarantee that the number of added layers\nis polynomial in n, while we approach arbitrarily close to the Gibbs distribution. This construction\npreserves the structure of the original graph, in particular, whenever the original graph has no cycles,\nthe expanded graph does not have cycles as well. In the experiments we show that this probability\nmodel approximates well the Gibbs distribution for graphical models with many cycles.\n\n4 Unbiased sampling using sequential bounds on the partition function\n\nxi+1,...,xn\n\n(cid:16)\n\n(cid:104)\n\nexp\n\nE\u03b3\n\nity which is proportional to(cid:80)\n\nIn the following we describe how to use random MAP perturbations to generate unbiased samples\nfrom the Gibbs distribution. Sampling from the Gibbs distribution is inherently tied to estimating the\npartition function. Assume we could have compute the partition function exactly, then we could have\nsample from the Gibbs distribution sequentially: for every dimension we sample xi with probabil-\nexp(\u03b8(x)). Unfortunately, approximations to the partition\nfunction, as described in Section 3, cannot provide a sequential procedure that would generate un-\nbiased samples from the full Gibbs distribution. Instead, we construct a family of self-reducible\nupper bounds which imitate the partition function behavior, namely bound the summation over its\nexponentiations. These upper bounds extend the one in [5] when restricted to local perturbations.\nLemma 2. Let {\u03b3i(xi)} be a collection of i.i.d. random variables, each following the Gumbel\ndistribution with zero mean. Then for every j = 1, ..., n and every x1, ..., xj\u22121 holds\n\n\u03b3i(xi)}(cid:105)(cid:17) \u2264 exp\n(cid:16)\n\n\u03b3i(xi)}(cid:105)(cid:17)\n(cid:88)\nmaxxj ,...,xn{\u03b8(x) + \u03b3n(xn)}(cid:105)(cid:17)\nIn particular, for j = n holds(cid:80)\n(cid:2) maxxj+1,...,xn{\u03b8(x) +\ni=j \u03b3i(xi)(cid:9)(cid:3)(cid:3), while the right hand side is attained by alternating the maximization with respect\n(cid:80)n\n\nProof: The result is an application of the expectation-optimization interpretation of the partition\nfunction in Theorem 1. The left hand side equals to E\u03b3j\nto xj with the expectation of \u03b3j+1, ..., \u03b3n. The proof then follows by taking the exponent.(cid:3)\nWe use these upper bounds for every dimension i = 1, ..., n to sample from a probability distribution\nthat follows a summation over exponential functions, with a discrepancy that is described by the\nupper bound. This is formalized below in Algorithm 1\n\n(cid:2) maxxj E\u03b3j+1,...,\u03b3n\n\nexp(\u03b8(x)) = exp\n\n{\u03b8(x) +\n\nmax\n\nxj+1,...,xn\n\nn(cid:88)\n\ni=j\n\nE\u03b3\n\nmax\nxj ,...,xn\n\n{\u03b8(x) +\n\nn(cid:88)\n\nE\u03b3n(xn)\n\ni=j+1\n\nxn\n\n(cid:104)\n(cid:104)\n\n(cid:16)\n\nxj\n\n.\n\nAlgorithm 1 Unbiased sampling from Gibbs distribution using randomized prediction\nIterate over j = 1, ..., n, while keeping \ufb01xed x1, ..., xj\u22121. Set\n\n(cid:2) maxxj+1,...,xn{\u03b8(x)+(cid:80)n\nexp(cid:0)E\u03b3\nexp(cid:0)E\u03b3\n(cid:2) maxxj ,...,xn{\u03b8(x)+(cid:80)n\n2. pj(r) = 1 \u2212(cid:80)\n\n1. pj(xj) =\n\np(xj)\n\nxj\n\ni=j+1 \u03b3i(xi)}(cid:3)(cid:1)\ni=j \u03b3i(xi)}(cid:3)(cid:1)\n\n.\n\n3. Sample an element according to pj(\u00b7). If r is sampled then reject and restart with j = 1.\n\nOtherwise, \ufb01x the sampled element xj and continue the iterations.\n\nOutput: x1, ..., xn\n\nWhen we reject the discrepancy, the probability we accept a con\ufb01guration x is the product of prob-\nabilities in all rounds. Since these upper bounds are self-reducible, i.e., for every dimension i we\n\n4\n\n\fP\n\nj=1\n\nare using the same quantities that were computed in the previous dimensions 1, ..., i \u2212 1, we are\nsampling an accepted con\ufb01guration proportionally to exp(\u03b8(x)), the full Gibbs distribution.\nTheorem 3. Let p(x) be the Gibbs distribution, de\ufb01ned in Equation (1) and let {\u03b3i(xi)} be a col-\nlection of i.i.d. random variables following the Gumbel distribution with zero mean. Then whenever\nAlgorithm 1 accepts, it produces a con\ufb01guration (x1, ..., xn) according to the Gibbs distribution\n\nProof: The probability of sampling a con\ufb01guration (x1, ..., xn) without rejecting is\n\n(cid:104)\n(cid:105)\nAlgorithm 1 outputs x(cid:12)(cid:12) Algorithm 1 accepts\n{\u03b8(x) +(cid:80)n\n{\u03b8(x) +(cid:80)n\n\nexp(cid:0)E\u03b3\n\n(cid:2) max\n\n(cid:2) max\n(cid:2) max\n\nexp(cid:0)E\u03b3\ni=j+1 \u03b3i(xi)}(cid:3)(cid:1)\nn(cid:89)\ni=j \u03b3i(xi)}(cid:3)(cid:1) =\nexp(cid:0)E\u03b3\ntion, i.e., P(cid:2)Algorithm 1 accepts(cid:3) = Z(cid:14) exp(cid:0)E\u03b3\n\n(cid:2) maxx1,...,xn{\u03b8(x) +(cid:80)n\n\nxj+1,...,xn\n\nx1,...,xn\n\nxj ,...,xn\n\nexp(\u03b8(x))\n\n= p(x).\n\nThe probability of sampling without rejecting is thus the sum of this probability over all con\ufb01gura-\nconditioned on accepting a con\ufb01guration, it is produced according to the Gibbs distribution. (cid:3).\nAcceptance/rejection follows the geometric distribution, therefore the sampling procedure rejects\nk times with probability (1 \u2212 P [Algorithm 1 accepts])k. The running time of our Gibbs sampler\nis determined by the average number of rejections 1/P [Algorithm 1 accepts].\nInterestingly, this\naverage is the quality of the partition upper bound presented in [5]. To augment this result we\ninvestigate in the next section ef\ufb01ciently computable lower bounds to the partition function, that are\nbased on random MAP perturbations. These lower bounds provide a way to ef\ufb01ciently determine the\ncomputational complexity for sampling from the Gibbs distribution for a given potential function.\n\ni=1 \u03b3i(xi)}(cid:3)(cid:1) .\n{\u03b8(x) +(cid:80)n\ni=1 \u03b3i(xi)}(cid:3)(cid:1). Therefore\n\n5 Lower bounds on the partition function\n\n.\n\nx\n\n(cid:104)\n\nmax\n\nmaxx\n\nIn particular, log Z \u2265 E\u03b3\n\nThe realization of the partition function as expectation-optimization pair in Theorem 1 provides\nef\ufb01ciently computable lower bounds on the partition function. Intuitively, these bounds correspond\nto moving expectations (or summations) inside the maximization operations. In the following we\npresent two lower bounds that are derived along these lines, the \ufb01rst holds in expectation and the\nsecond holds in probability.\nCorollary 1. Consider a family of subsets \u03b1 \u2208 A and let x\u03b1 be a set of variables {xi}i\u2208\u03b1 restricted\nto the indexes in \u03b1. Assume that the random variables \u03b3\u03b1(x\u03b1) are i.i.d. according to the Gumbel\ndistribution with zero mean, for every \u03b1, x\u03b1. Then\n\u2200\u03b1 \u2208 A log Z \u2265 E\u03b3\n\n(cid:104)\n(cid:8)\u03b8(x) + \u03b3\u03b1(x\u03b1)(cid:9)(cid:105)\n\u03b1\u2208A \u03b3\u03b1(x\u03b1)(cid:9)(cid:105)\n(cid:8)\u03b8(x) + 1|A|\n(cid:80)\nProof: Let \u00af\u03b1 = {1, ..., n} \\ \u03b1 then Z =(cid:80)\n(cid:80)\nexp(\u03b8(x)) \u2265(cid:80)\nsecond result is attained while averaging these lower bounds log Z \u2265(cid:80)\n\nmaxx \u00af\u03b1 exp(\u03b8(x)). The \ufb01rst\nresult is derived by swapping the maximization with the exponent, and applying Theorem 1. The\n\u03b1\u2208A 1|A| E\u03b3[maxx{\u03b8(x) +\n\u03b3\u03b1(x\u03b1)}], and by moving the summation inside the maximization operation. (cid:3)\nThe expected lower bound requires to invoke a MAP solver multiple times. Although this expecta-\ntion may be estimated with a single MAP execution, the variance of this random MAP prediction\nis around\nn. We suggest to recursively use Lemma 1 to lower bound the partition function with a\nsingle MAP operation in probability.\nCorollary 2. Let \u03b8(x) be a potential\nfunction over x = (x1, ..., xn). We create multi-\nple copies of xi, namely xi,ki for ki = 1, ..., mi, and de\ufb01ne the extended potential function\n\nki=1 \u03b8(x1,k1, ..., xn,kn )/(cid:81) mi. We de\ufb01ne the extended perturbation model \u02c6\u03b3i(xi) =\n\n\u02c6\u03b8(x) = (cid:80)mi\n(cid:80)mi\nGumbel distribution with zero mean. Then, with probability at least 1 \u2212(cid:80)n\nholds log Z \u2265 max\u02c6x{\u02c6\u03b8(x) +(cid:80)n\n\nki=1 \u03b3i,ki(xi,ki)/mi where each perturbation is independent and distributed according to the\ni=1 \u03c02|dom(\u03b8)|/6mi\u00012\n\ni=1 \u02c6\u03b3i(xi)} \u2212 \u0001n\n\n\u221a\n\nx \u00af\u03b1\n\nx\u03b1\n\nx\u03b1\n\n.\n\n5\n\n\flower bounds\n\nunbiased samplesr complexity\n\napproximate sampler\n\nFigure 1: Left: comparing our expected lower and probable lower bounds with structured mean-\ufb01eld and\nbelief propagation on attractive models with high signal and varying coupling strength. Middle: estimating\nour unbiased sampling procedure complexity on spin glass models of varying sizes. Right: Comparing our\napproximate sampling procedure on attractive models with high signal.\n\nProof: We estimate the expectation-optimization value of the log-partition function iteratively for\nevery dimension, while replacing each expectation with its sampled average, as described in Lemma\n1. Our result holds for every potential function, thus the statistics in each recursion hold uniformly\nfor every x with probability at least 1 \u2212 \u03c02|dom(\u03b8)|/6mi\u00012. We then move the averages inside the\nmaximization operation, thus lower bounding the \u0001n\u2212approximation of the partition function. (cid:3)\nThe probable lower bound that we provide does not assume graph separations thus the statistical\nguarantees are worse than the ones presented in the approximation scheme of Theorem 2. Also,\nsince we are seeking for lower bound, we are able relax our optimization requirements and thus to\nuse vertex based random perturbations \u03b3i(xi). This is an important difference that makes this lower\nbound widely applicable and very ef\ufb01cient.\n\n6 Experiments\n\nWe evaluated our approach on spin glass models \u03b8(x) = (cid:80)\n\ni\u2208V \u03b8ixi +(cid:80)\n\n(i,j)\u2208E \u03b8i,jxixj. where\nxi \u2208 {\u22121, 1}. Each spin has a local \ufb01eld parameter \u03b8i, sampled uniformly from [\u22121, 1]. The\nspins interact in a grid shaped graphical model with couplings \u03b8i,j, sampled uniformly from [0, c].\nWhenever the coupling parameters are positive the model is called attractive as adjacent variables\ngive higher values to positively correlated con\ufb01gurations. Attractive models are computationally\nappealing as their MAP predictions can be computed ef\ufb01ciently by the graph-cut algorithm [2].\nWe begin by evaluating our lower bounds, presented in Section 5, on 10 \u00d7 10 spin glass models.\nCorollary 1 presents a lower bound that holds in expectation. We evaluated these lower bounds\nwhile perturbing the local potentials with \u03b3i(xi). Corollary 2 presents a lower bound that holds\nin probability and requires only a single MAP prediction on an expanded model. We evaluate the\nprobable bound by expanding the model to 1000 \u00d7 1000 grids, ignoring the discrepancy \u0001. For both\nthe expected lower bound and the probable lower bound we used graph-cuts to compute the random\nMAP perturbations. We compared these bounds to the different forms of structured mean-\ufb01eld, tak-\ning the one that performed best: standard structured mean-\ufb01eld that we computed over the vertical\nchains [8, 1], and the negative tree re-weighted computed on the horizontal and vertical trees [14].\nWe also compared to the sum-product belief propagation algorithm, which was recently proven to\nproduce lower bounds for attractive models [20, 18]. We computed the error in estimating the loga-\nrithm of the partition function, averaged over 10 spin glass models, see Figure 1. One can see that\nthe probable bound is the tightest when considering the medium and high coupling domain, which\nis traditionally hard for all methods. As it holds in probability it might generate a solution which is\nnot a lower bound. One can also verify that on average this does not happen. The expected lower\nbound is signi\ufb01cantly worse for the low coupling regime, in which many con\ufb01gurations need to be\ntaken into account. It is (surprisingly) effective for the high coupling regime, which is characterized\nby a few dominant con\ufb01gurations.\nSection 4 describes an algorithm that generates unbiased samples from the full Gibbs distribution.\nFocusing on spin glass models with strong local \ufb01eld potentials, it is well know that one cannot\nproduce unbiased samples from the Gibbs distributions in polynomial time [3]. Theorem 3 connects\n\n6\n\n\fImage + annotation\n\nMAP solution\n\nAverage of 20 samples\n\nError estimates\n\nFigure 2: Example image with the boundary annotation (left) and the error estimates obtained using our\nmethod (right). Thin structures of the object are often lost in a single MAP solution (middle-left), which are\nrecovered by averaging the samples (middle-right) leading to better error estimates.\n\nthe computational complexity of our unbiased sampling procedure to the gap between the logarithm\nof the partition function and its upper bound in [5]. We use our probable lower bound to estimate this\ngap on large grids, for which we cannot compute the partition function exactly. Figure 1 suggests\nthat the running time for this sampling procedure is sub-exponential.\nSampling from the Gibbs distribution in spin glass models with non-zero local \ufb01eld potentials is\ncomputationally hard [7, 3]. The approximate sampling technique in Theorem 3 suggests a method\nto overcome this dif\ufb01culty by ef\ufb01ciently sampling from a distribution that approximates the Gibbs\ndistribution on its marginal probabilities. Although our theory is only stated for graphs without\ncycles, it can be readily applied to general graphs, in the same way the (loopy) belief propaga-\ntion algorithm is applied. For computational reasons we did not expand the graph. Also, we ex-\nperiment both with pairwise perturbations, as Theorem 2 suggests, and with local perturbations,\nwhich are guaranteed to preserve the potential function super-modularity. We computed the local\nmarginal probability errors of our sampling procedure, while comparing to the standard methods\nof Gibbs sampling, Metropolis and Swendsen-Wang1. In our experiments we let them run for at\nmost 1e8 iterations, see Figure 1. Both Gibbs sampling and the Metropolis algorithm perform sim-\nilarly (we omit the Gibbs sampler performance for clarity). Although these algorithms as well as\nthe Swendsen-Wang algorithm directly sample from the Gibbs distribution, they typically require\nexponential running time to succeed on spin glass models. Figure 1 shows that these samplers are\nworse than our approximate samplers. Although we omit from the plots for clarity, our approximate\nsampling marginal probabilities compare to those of the sum-product belief propagation and the tree\nre-weighted belief propagation [22]. Nevertheless, our sampling scheme also provides a probability\nnotion, which lacks in the belief propagation type algorithms. Surprisingly, the approximate sampler\nthat uses pairwise perturbations performs (slightly) worse than the approximate sampler that only\nuse local perturbations. Although this is not explained by our current theory, it is an encouraging\nobservation, since approximate sampler that uses random MAP predictions with local perturbations\nis orders of magnitude faster.\nLastly, we emphasize the importance of probabilistic reasoning over the current variational methods,\nsuch as tree re-weighted belief propagation [22] or max-marginal probabilities [10], that only gen-\nerate probabilities over small subsets of variables. The task we consider is to obtain pixel accurate\nboundaries from rough boundaries provided by the user. For example in an image editing application\nthe user may provide an input in the form of a rough polygon and the goal is to re\ufb01ne the boundaries\nusing the information from the gradients in the image. A natural notion of error is the average devi-\nation of the marked boundary from the true boundary of the image. Given a user boundary we set\nup a graphical model on the pixels using foreground/background models trained from regions well\ninside/outside the marked boundary. Exact binary labeling can be obtained using the graph-cuts al-\ngorithm. From this we can compute the expected error by sampling multiple solutions using random\nMAP predictors and averaging. On a dataset of 10 images which we carefully annotated to obtain\npixel accurate boundaries we \ufb01nd that random MAP perturbations produce signi\ufb01cantly more accu-\nrate estimates of boundary error compared to a single MAP solution. On average the error estimates\nobtained using random MAP perturbations is off by 1.04 pixels from the true error (obtained from\nground truth) whereas the MAP which is off by 3.51 pixels. Such a measure can be used in an active\nannotation framework where the users can iteratively \ufb01x parts of the boundary that contain errors.\n\n1We used Talya Meltzer\u2019s inference package.\n\n7\n\n\fFigure 2 shows an example annotation, the MAP solution, the mean of 20 random MAP solutions,\nand boundary error estimates.\n\n7 Related work\n\nThe Gibbs distribution plays a key role in many areas of science, including computer science, statis-\ntics and physics. To learn more about its roles in machine learning, as well as its standard samplers,\nwe refer the interested reader to the textbook [11]. Our work is based on max-statistics of collections\nof random variables. For comprehensive introduction to extreme value statistics we refer the reader\nto [12].\nThe Gibbs distribution and its partition function can be realized from the statistics of random\nMAP perturbations with the Gumbel distribution (see Theorem 1), [12, 17, 21, 5]. Recently,\n[16, 9, 17, 21, 6] explore the different aspects of random MAP predictions with low dimensional\nperturbation. [16] describe sampling from the Gaussian distribution with random Gaussian pertur-\nbations. [17] show that random MAP predictors with low dimensional perturbations share similar\nstatistics as the Gibbs distribution. [21] describe the Bayesian perspectives of these models and their\nef\ufb01cient sampling procedures. [9, 6] consider the generalization properties of such models within\nPAC-Bayesian theory. In our work we formally relate random MAP perturbations and the Gibbs\ndistribution. Speci\ufb01cally, we describe the case for which the marginal probabilities of random MAP\nperturbations, with the proper expansion, approximate those of the Gibbs distribution. We also\nshow how to use the statistics of random MAP perturbations to generate unbiased samples from\nthe Gibbs distribution. These probability models generate samples ef\ufb01ciently thorough optimiza-\ntion: they have statistical advantages over purely variational approaches such as tree re-weighted\nbelief propagation [22] or max-marginals [10], and they are faster than standard Gibbs samplers and\nMarkov chain Monte Carlo approaches when MAP prediction is ef\ufb01cient [3, 25]. Other methods\nthat ef\ufb01ciently produce samples include Herding [23] and determinantal processes [13].\nOur suggested samplers for the Gibbs distribution are based on low dimensional representation of\nthe partition function, [5]. We augment their results in a few ways. In Lemma 2 we re\ufb01ne their\nupper bound, to a series of sequentially tighter bounds. Corollary 2 shows that the approximation\nscheme of [5] is in fact a lower bound that holds in probability. Lower bounds for the partition func-\ntion have been extensively developed in the recent years within the context of variational methods.\nStructured mean-\ufb01eld methods are inner-bound methods where a simpler distribution is optimized\nas an approximation to the posterior in a KL-divergence sense [8, 1, 14]. The dif\ufb01culty comes\nfrom non-convexity of the set of feasible distributions. Surprisingly, [20, 18] have shown that the\nsum-product belief propagation provides a lower bound to the partition function for super-modular\npotential functions. This result is based on the four function theorem which considers nonnegative\nfunctions over distributive lattices.\n\n8 Discussion\n\nThis work explores new approaches to sample from the Gibbs distribution. Sampling from the Gibbs\ndistribution is key problem in machine learning. Traditional approaches, such as Gibbs sampling,\nfail in the \u201chigh-signal high-coupling\u201d regime that results in ragged energy landscapes. Following\n[17, 21], we showed here that one can take advantage of ef\ufb01cient MAP solvers to generate approx-\nimate or unbiased samples from the Gibbs distribution, when we randomly perturb the potential\nfunction. Since MAP predictions are not affected by ragged energy landscapes, our approach excels\nin the \u201chigh-signal high-coupling\u201d regime. As a by-product to our approach we constructed lower\nbounds to the partition functions, which are both tighter and faster than the previous approaches in\nthe \u201dhigh-signal high-coupling\u201d regime.\nOur approach is based on random MAP perturbations that estimate the partition functions with\nexpectation. In practice we compute the empirical mean. [15] show that the deviation of the sampled\nmean from its expectation decays exponentially.\nThe computational complexity of our approximate sampling procedure is determined by the pertur-\nbations dimension. Currently, our theory do not describe the success of the probability model that is\nbased on the maximal argument of perturbed MAP program with local perturbations.\n\n8\n\n\fReferences\n[1] Alexandre Bouchard-C\u02c6ot\u00b4e and Michael I Jordan. Optimization of structured mean \ufb01eld objec-\n\ntives. In AUAI, pages 67\u201374, 2009.\n\n[2] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts.\n\nPAMI, 2001.\n\n[3] L.A. Goldberg and M. Jerrum. The complexity of ferromagnetic ising with local \ufb01elds. Com-\n\nbinatorics Probability and Computing, 16(1):43, 2007.\n\n[4] E.J. Gumbel and J. Lieblein. Statistical theory of extreme values and some practical applica-\n\ntions: a series of lectures, volume 33. US Govt. Print. Of\ufb01ce, 1954.\n\n[5] T. Hazan and T. Jaakkola. On the partition function and random maximum a-posteriori pertur-\n\nbations. In Proceedings of the 29th International Conference on Machine Learning, 2012.\n\n[6] T. Hazan, S. Maji, Keshet J., and T. Jaakkola. Learning ef\ufb01cient random maximum a-posteriori\npredictors with non-decomposable loss functions. Advances in Neural Information Processing\nSystems, 2013.\n\n[7] M. Jerrum and A. Sinclair. Polynomial-time approximation algorithms for the ising model.\n\nSIAM Journal on computing, 22(5):1087\u20131116, 1993.\n\n[8] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K. Saul. An introduction to variational\n\nmethods for graphical models. Machine learning, 37(2):183\u2013233, 1999.\n\n[9] J. Keshet, D. McAllester, and T. Hazan. Pac-bayesian approach for minimization of phoneme\n\nerror rate. In ICASSP, 2011.\n\n[10] Pushmeet Kohli and Philip HS Torr. Measuring uncertainty in graph cut solutions\u2013ef\ufb01ciently\n\ncomputing min-marginal energies using dynamic graph cuts. In ECCV, pages 30\u201343. 2006.\n\n[11] D. Koller and N. Friedman. Probabilistic graphical models. MIT press, 2009.\n[12] S. Kotz and S. Nadarajah. Extreme value distributions: theory and applications. World Scien-\n\nti\ufb01c Publishing Company, 2000.\n\n[13] A. Kulesza and B. Taskar. Structured determinantal point processes. In Proc. Neural Informa-\n\ntion Processing Systems, 2010.\n\n[14] Qiang Liu and Alexander T Ihler. Negative tree reweighted belief propagation. arXiv preprint\n\narXiv:1203.3494, 2012.\n\n[15] Francesco Orabona, Tamir Hazan, Anand D Sarwate, and Tommi. Jaakkola. On measure con-\n\ncentration of random maximum a-posteriori perturbations. arXiv:1310.4227, 2013.\n\n[16] G. Papandreou and A. Yuille. Gaussian sampling by local perturbations. In Proc. Int. Conf. on\n\nNeural Information Processing Systems (NIPS), pages 1858\u20131866, December 2010.\n\n[17] G. Papandreou and A. Yuille. Perturb-and-map random \ufb01elds: Using discrete optimization to\n\nlearn and sample from energy models. In ICCV, Barcelona, Spain, November 2011.\n\n[18] Nicholas Ruozzi. The bethe partition function of log-supermodular graphical models. arXiv\n\npreprint arXiv:1202.6035, 2012.\n\n[19] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for\n\nMAP using message passing. In Conf. Uncertainty in Arti\ufb01cial Intelligence (UAI), 2008.\n\n[20] E.B. Sudderth, M.J. Wainwright, and A.S. Willsky. Loop series and Bethe variational bounds\nin attractive graphical models. Advances in neural information processing systems, 20, 2008.\n[21] D. Tarlow, R.P. Adams, and R.S. Zemel. Randomized optimum models for structured predic-\n\ntion. In Proceedings of the 15th Conference on Arti\ufb01cial Intelligence and Statistics, 2012.\n\n[22] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log\n\npartition function. Trans. on Information Theory, 51(7):2313\u20132335, 2005.\n\n[23] Max Welling. Herding dynamical weights to learn. In Proceedings of the 26th Annual Inter-\n\nnational Conference on Machine Learning, pages 1121\u20131128. ACM, 2009.\n\n[24] T. Werner. High-arity interactions, polyhedral relaxations, and cutting plane algorithm for soft\n\nconstraint optimisation (map-mrf). In CVPR, pages 1\u20138, 2008.\n\n[25] J. Zhang, H. Liang, and F. Bai. Approximating partition functions of the two-state spin system.\n\nInformation Processing Letters, 111(14):702\u2013710, 2011.\n\n9\n\n\f", "award": [], "sourceid": 648, "authors": [{"given_name": "Tamir", "family_name": "Hazan", "institution": "University of Haifa"}, {"given_name": "Subhransu", "family_name": "Maji", "institution": "TTI Chicago"}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": "MIT"}]}