{"title": "Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 1887, "page_last": 1895, "abstract": "In this work we develop efficient methods for learning random MAP predictors for structured label problems. In particular, we construct posterior distributions over perturbations that can be adjusted via stochastic gradient methods. We show that every smooth posterior distribution would suffice to define a smooth PAC-Bayesian risk bound suitable for gradient methods. In addition, we relate the posterior distributions to computational properties of the MAP predictors. We suggest multiplicative posteriors to learn super-modular potential functions that accompany specialized MAP predictors such as graph-cuts. We also describe label-augmented posterior models that can use efficient MAP approximations, such as those arising from linear program relaxations.", "full_text": "Learning Ef\ufb01cient Random Maximum A-Posteriori\nPredictors with Non-Decomposable Loss Functions\n\nTamir Hazan\n\nUniversity of Haifa\n\nSubhransu Maji\n\nTTI Chicago\n\nJoseph Keshet\n\nBar-Ilan university\n\nTommi Jaakkola\n\nCSAIL, MIT\n\nAbstract\n\nIn this work we develop ef\ufb01cient methods for learning random MAP predictors for\nstructured label problems. In particular, we construct posterior distributions over\nperturbations that can be adjusted via stochastic gradient methods. We show that\nany smooth posterior distribution would suf\ufb01ce to de\ufb01ne a smooth PAC-Bayesian\nrisk bound suitable for gradient methods. In addition, we relate the posterior dis-\ntributions to computational properties of the MAP predictors. We suggest mul-\ntiplicative posteriors to learn super-modular potential functions that accompany\nspecialized MAP predictors such as graph-cuts. We also describe label-augmented\nposterior models that can use ef\ufb01cient MAP approximations, such as those arising\nfrom linear program relaxations.\n\n1\n\nIntroduction\n\nLearning and inference in complex models drives much of the research in machine learning\napplications ranging from computer vision, natural language processing, to computational biol-\nogy [1, 18, 21]. The inference problem in such cases involves assessing the likelihood of possible\nstructured-labels, whether they be objects, parsers, or molecular structures. Given a training dataset\nof instances and labels, the learning problem amounts to estimation of the parameters of the infer-\nence engine, so as to best describe the labels of observed instances. The goodness of \ufb01t is usually\nmeasured by a loss function.\nThe structures of labels are speci\ufb01ed by assignments of random variables, and the likelihood of the\nassignments are described by a potential function. Usually, it is feasible to only \ufb01nd the most likely\nor maximum a-posteriori (MAP) assignment, rather than sampling according to their likelihood. In-\ndeed, substantial effort has gone into developing algorithms for recovering MAP assignments, either\nbased on speci\ufb01c parametrized restrictions such as super-modularity [2] or by devising approximate\nmethods based on linear programming relaxations [21]. Learning MAP predictors is usually done\nby structured-SVMs that compare a \u201closs adjusted\u201d MAP prediction to its training label [25]. In\npractice, most loss functions used decompose in the same way as the potential function, so as to not\nincrease the complexity of the MAP prediction task. Nevertheless, non-decomposable loss functions\ncapture the structures in the data that we would like to learn.\nBayesian approaches for expected loss minimization, or risk, effortlessly deal with non-\ndecomposable loss functions. The inference procedure samples a structure according to its like-\nlihood, and computes its loss given a training label. Recently [17, 23] constructed probability\nmodels through MAP predictions. These \u201cperturb-max\u201d models describe the robustness of the\nMAP prediction to random changes of its parameters. Therefore, one can draw unbiased sam-\nples from these distributions using MAP predictions.\nInterestingly, when incorporating perturb-\nmax models to Bayesian loss minimization one would ultimately like to use the PAC-Bayesian risk\n[11, 19, 3, 20, 5, 10].\nOur work explores the Bayesian aspects that emerge from PAC-Bayesian risk minimization. We\nfocus on computational aspects when constructing posterior distributions, so that they could be used\n\n1\n\n\fto minimize the risk bound ef\ufb01ciently. We show that any smooth posterior distribution would suf\ufb01ce\nto de\ufb01ne a smooth risk bound which can be minimized through gradient decent. In addition, we\nrelate the posterior distributions to the computational properties of MAP predictors. We suggest\nmultiplicative posterior models to learn super-modular potential functions, that come with special-\nized MAP predictors such as graph-cuts [2]. We also describe label-augmented posterior models\nthat can use MAP approximations, such as those arising from linear program relaxations [21].\n\n2 Background\n\nLearning complex models typically involves reasoning about the states of discrete variables whose\nlabels (assignments of values) specify the discrete structures of interest. The learning task which\nwe consider in this work is to \ufb01t parameters w that produce to most accurate prediction y \u2208 Y\nto a given object x. Structures of labels are conveniently described by a discrete product space\nY = Y1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Yn. We describe the potential of relating a label y to an object x with respect to\nthe parameters w by real valued functions \u03b8(y; x, w). Our goal is to learn the parameters w that best\ndescribe the training data (x, y) \u2208 S. Within Bayesian perspectives, the distribution that one learns\ngiven the training data is composed from a distribution over the parameter space qw(\u03b3) and over the\nlabels space P [y|w, x] \u221d exp \u03b8(y; x, w). Using the Bayes rule we derive the predictive distribution\nover the structures\n\nP [y|x] =\n\nP [y|\u03b3, x]qw(\u03b3)d\u03b3\n\n(1)\n\n(cid:90)\n\nUnfortunately, sampling algorithms over complex models are provably hard in theory and tend to\nbe slow in many cases of practical interest [7]. This is in contrast to the maximum a-posteriori\n(MAP) prediction, which can be computed ef\ufb01ciently for many practical cases, even when sampling\nis provably hard.\n\n(MAP predictor)\n\nyw(x) = arg max\ny1,...,yn\n\n\u03b8(y; x, w)\n\n(2)\n\nRecently, [17, 23] suggested to change of the Bayesian posterior probability models to utilize the\nMAP prediction in a deterministic manner. These perturb-max models allow to sample from the\npredictive distribution with a single MAP prediction:\nP [y|x]\n\n(cid:2)y = y\u03b3(x)(cid:3)\n\n(Perturb-max models)\n\ndef\n= P\u03b3\u223cqw\n\n(3)\n\n(cid:80)\ni\u2208V \u03b8i(yi; x, w) +(cid:80)\n\ni,j\u2208E \u03b8i,j(yi, yj; x, w).\n\nA potential function is decomposed along a graphical model if it has the form \u03b8(y; x, w) =\nIf the graph has no cycles, MAP prediction can\nbe computed ef\ufb01ciently using the belief propagation algorithm. Nevertheless, there are cases where\nMAP prediction can be computed ef\ufb01ciently for graph with cycles. A potential function is called\nsupermodular if it is de\ufb01ned over Y = {\u22121, 1}n and its pairwise interactions favor adjacent states to\nhave the same label, i.e., \u03b8i,j(\u22121,\u22121; x, w)+\u03b8i,j(1, 1; x, w) \u2265 \u03b8i,j(\u22121, 1; x, w)+\u03b8i,j(1,\u22121; x, w).\nIn such cases MAP prediction reduces to computing the min-cut (graph-cuts) algorithm.\nRecently, a sequence of works attempt to solve the MAP prediction task for non-supermodular\npotential function as well as general regions. These cases usually involve potentials function that\nare described by a family R of subsets of variables r \u2282 {1, ..., n}, called regions. We denote by yr\nthe set of labels that correspond to the region r, namely (yi)i\u2208r and consider the following potential\nr\u2208R \u03b8r(yr; x, w). Thus, MAP prediction can be formulated as an integer\n\nfunctions \u03b8(y; x, w) =(cid:80)\n(cid:88)\n\nlinear program:\n\nb\u2217 \u2208 arg max\n\nbr(yr)\n\ns.t.\n\nr,yr\n\nbr(yr) \u2208 {0, 1},\n\nbr(yr) = 1,\n\nbr(yr)\u03b8r(yr; x, w)\n\n(cid:88)\n\nyr\n\n(4)\n\nbs(ys) = br(yr) \u2200r \u2282 s\n\n(cid:88)\n\nys\\yr\n\nThe correspondence between MAP prediction and integer linear program solutions is (yw(x))i =\narg maxyi b\u2217\ni (yi). Although integer linear program solvers provide an alternative to MAP predic-\ntion, they may be restricted to problems of small size. This restriction can be relaxed when one\nreplaces the integral constraints br(yr) \u2208 {0, 1} with nonnegative constraints br(yr) \u2265 0. These\n\n2\n\n\flinear program relaxations can be solved ef\ufb01ciently using different convex max-product solvers, and\nwhenever these solvers produce an integral solution it is guaranteed to be the MAP prediction [21].\nGiven training data of object-label pairs, the learning objective is to estimate a predictive distribution\nover the structured-labels. The goodness of \ufb01t is measured by a loss function L(\u02c6y, y). As we focus\non randomized MAP predictors our goal is to learn the parameters w that minimize the expected\nperturb-max prediction loss, or randomized risk. We de\ufb01ne the randomized risk at a single instance-\nlabel pair as\n\n(cid:2)\u02c6y = y\u03b3(x)(cid:3)L(\u02c6y, y).\n\nR(w, x, y) =\n\nP\u03b3\u223cqw\n\n(cid:88)\n\n\u02c6y\u2208Y\n\nAlternatively, the randomized risk takes the form R(w, x, y) = E\u03b3\u223cqw [L(y\u03b3(x), y)]. The random-\nized risk originates within the PAC-Bayesian generalization bounds. Intuitively, if the training set is\nan independent sample, one would expect that best predictor on the training set to perform well on\nunlabeled objects at test time.\n\n3 Minimizing PAC-Bayesian generalization bounds\n\nOur approach is based on the PAC-Bayesian risk analysis of random MAP predictors. In the fol-\nlowing we state the PAC-Bayesian generalization bound for structured predictors and describe the\ngradients of these bounds for any smooth posterior distribution.\nThe PAC-Bayesian generalization bound describes the expected loss, or randomized risk, when con-\nsidering the true distributions over object-labels in the world R(w) = E(x,y)\u223c\u03c1 [R(w, x, y)]. It upper\nbounds the randomized risk by the empirical randomized risk RS(w) = 1|S|\n(x,y)\u2208S R(w, x, y)\nand a penalty term which decreases proportionally to the training set size. Here we state the PAC-\nBayesian theorem, that holds uniformly for all posterior distributions over the predictions.\nTheorem 1. (Catoni [3], see also [5]). Let L(\u02c6y, y) \u2208 [0, 1] be a bounded loss function. Let\np(\u03b3) be any probability density function and let qw(\u03b3) be a family of probability density functions\nany real number \u03bb > 0, with probability at least 1\u2212 \u03b4 over the draw of the training set the following\nholds simultaneously for all w\n\nparameterized by w. Let KL(qw||p) =(cid:82) qw(\u03b3) log(qw(\u03b3)/p(\u03b3)). Then, for any \u03b4 \u2208 (0, 1] and for\n\n(cid:80)\n\nR(w) \u2264\n\n1\n\n1 \u2212 exp(\u2212\u03bb)\n\n\u03bbRS(w) +\n\nKL(qw||p) + log(1/\u03b4)\n\n|S|\n\n(cid:18)\n\n(cid:19)\n\nFor completeness we present a proof sketch for the theorem in the appendix. This proof follows\nSeeger\u2019s PAC-Bayesian approach [19], and extended to the structured label case [13]. The proof\ntechnique replaces prior randomized risk, with the posterior randomized risk that holds uniformly\nfor every w, while penalizing this change by their KL-divergence. This change-of-measure step is\nclose in spirit to the one that is performed in importance sampling. The proof is then concluded by\nsimple convex bound on the moment generating function of the empirical risk.\nTo \ufb01nd the best posterior distribution that minimizes the randomized risk, one can minimize its\nempirical upper bound. We show that whenever the posterior distributions have smooth probability\ndensity functions qw(\u03b3), the perturb-max probability model is smooth as a function of w. Thus the\nrandomized risk bound can be minimized with gradient methods.\nTheorem 2. Assume qw(\u03b3) is smooth as a function of its parameters, then the PAC-Bayesian bound\nis smooth as a function of w:\n\nMoreover, the KL-divergence is a smooth function of w and its gradient takes the form:\n\n\u2207wKL(qw||p) = E\u03b3\u223cqw\n\nProof: First we note that R(w, x, y) =(cid:82) qw(\u03b3)L(y\u03b3(x), y)d\u03b3. Since qw(\u03b3) is a probability density\n\nfunction and L(\u02c6y, y) \u2208 [0, 1] we can differentiate under the integral (cf. [4] Theorem 2.27).\n\n(x,y)\u2208S\n\nE\u03b3\u223cqw\n\n(cid:105)\n(cid:104)\u2207w[log qw(\u03b3)]L(y\u03b3(x), y)\n(cid:88)\n(cid:17)(cid:105)\n(cid:104)\u2207w[log qw(\u03b3)](cid:0) log(qw(\u03b3)/p(\u03b3)) + 1\n(cid:90)\n\n\u2207wqw(\u03b3)L(y\u03b3(x), y)d\u03b3\n\n\u2207wRS(w) =\n\n1\n|S|\n\n\u2207wR(w, x, y) =\n\n3\n\n\fUsing the identity \u2207wqw(\u03b3) = qw(\u03b3)\u2207w log(qw(\u03b3)) the \ufb01rst part of the proof follows. The\nsecond part of the proof follows in the same manner, while noting that \u2207w(qw(\u03b3) log qw(\u03b3)) =\n(\u2207wqw(\u03b3))(log qw(\u03b3) + 1). (cid:3)\nThe gradient of the randomized empirical risk is governed by the gradient of the log-probability\ndensity function of its corresponding posterior model. For example, Gaussian model with mean w\nand identity covariance matrix has the probability density function qw(\u03b3) \u221d exp(\u2212(cid:107)\u03b3 \u2212 w(cid:107)2/2),\nthus the gradient of its log-density is the linear moment \u03b3, i.e., \u2207w[log qw] = \u03b3 \u2212 w.\nTaking any smooth distribution qw(\u03b3), we can \ufb01nd the parameters w by descending along the\nstochastic gradient of the PAC-Bayesian generalization bound. The gradient of the random-\nized empirical risk is formed by two expectations, over the sample points and over the poste-\nrior distribution. Computing these expectations is time consuming, thus we use a single sam-\nple \u2207\u03b3[log qw(\u03b3)]L(y\u03b3(x), y) as an unbiased estimator for the gradient. Similarly we estimate\nthe gradient of the KL-divergence with an unbiased estimator which requires a single sample of\n\u2207w[log qw(\u03b3)](log(qw(\u03b3)/p(\u03b3)) + 1). This approach, called stochastic approximation or online\ngradient descent, amounts to use the stochastic gradient update rule\n\nw \u2190 w \u2212 \u03b7 \u00b7 \u03bb\u2207w[log qw(\u03b3)]\n\nL(y\u03b3(x), y) + log(qw(\u03b3)/p(\u03b3)) + 1\n\n(cid:16)\n\n(cid:17)\n\nwhere \u03b7 is the learning rate. Next, we explore different posterior distributions from computational\nperspectives. Speci\ufb01cally, we show how to learn the posterior model so to ensure the computational\nef\ufb01ciency of its MAP predictor.\n\n4 Learning posterior distributions ef\ufb01ciently\n\nThe ability to ef\ufb01ciently apply MAP predictors is key to the success of the learning process. Al-\nthough MAP predictions are NP-hard in general, there are posterior models for which they can\nbe computed ef\ufb01ciently. For example, whenever the potential function corresponds to a graphical\nmodel with no cycles, MAP prediction can be ef\ufb01ciently computed for any learned parameters w.\nLearning unconstrained parameters with random MAP predictors provides some freedom in choos-\ning the posterior distribution. In fact, Theorem 2 suggests that one can learn any posterior distri-\nbution by performing gradient descent on its risk bound, as long as its probability density function\nis smooth. We show that for unconstrained parameters, additive posterior distributions simplify the\nlearning problem, and the complexity of the bound (i.e., its KL-divergence) mostly depends on its\nprior distribution.\nCorollary 1. Let q0(\u03b3) be a smooth probability density function with zero mean and set the posterior\ndistribution using additive shifts qw(\u03b3) = q0(\u03b3 \u2212 w). Let H(q) = \u2212E\u03b3\u223cq[log q(\u03b3)] be the entropy\nfunction. Then\n\nKL(qw||p) = \u2212H(q0) \u2212 E\u03b3\u223cq0[log p(\u03b3 + w)]\nIn particular, if p(\u03b3) \u221d exp(\u2212(cid:107)\u03b3(cid:107)2) is Gaussian then \u2207wKL(qw||p) = w\nProof: KL(qw||p) = \u2212H(qw) \u2212 E\u03b3\u223cqw [log p(\u03b3)]. By a linear change of variable, \u02c6\u03b3 = \u03b3 \u2212 w it\nfollows that H(qw) = H(q0) thus \u2207wH(qw) = 0. Similarly E\u03b3\u223cqw [log p(\u03b3)] = E\u03b3\u223cq0[log p(\u03b3 +\nw)]. Finally, if p(\u03b3) is Gaussian then E\u03b3\u223cq0[log p(\u03b3 + w)] = \u2212w2 \u2212 E\u03b3\u223cq0 [\u03b32]. (cid:3)\nThis result implies that every additively-shifted smooth posterior distribution may consider the KL-\ndivergence penalty as the square regularization when using a Gaussian prior p(\u03b3) \u221d exp(\u2212(cid:107)\u03b3(cid:107)2).\nThis generalizes the standard claim on Gaussian posterior distributions [11], for which q0(\u03b3) are\nGaussians. Thus one can use different posterior distributions to better \ufb01t the randomized empirical\nrisk, without increasing the computational complexity over Gaussian processes.\nLearning unconstrained parameters can be ef\ufb01ciently applied to tree structured graphical models.\nThis, however, is restrictive. Many practical problems require more complex models, with many\ncycles. For some of these models linear program solvers give ef\ufb01cient, although sometimes approx-\nimate, MAP predictions. For supermodular models there are speci\ufb01c solvers, such as graph-cuts,\nthat produce fast and accurate MAP predictions. In the following we show how to de\ufb01ne posterior\ndistributions that guarantee ef\ufb01cient predictions, thus allowing ef\ufb01cient sampling and learning.\n\n4\n\n\f4.1 Learning constrained posterior models\n\nMAP predictions can be computed ef\ufb01ciently in important practical cases, e.g., supermodular poten-\ntial functions satisfying \u03b8i,j(\u22121,\u22121; x, w) + \u03b8i,j(1, 1; x, w) \u2265 \u03b8i,j(\u22121, 1; x, w) + \u03b8i,j(1,\u22121; x, w).\nWhenever we restrict ourselves to symmetric potential function \u03b8i,j(yi, yj; x, w) = wi,jyiyj, super-\nmodularity translates to nonnegative constraint on the parameters wi,j \u2265 0.\nIn order to model\nposterior distributions that allow ef\ufb01cient sampling we de\ufb01ne models over the constrained param-\neter space. Unfortunately, the additive posterior models qw(\u03b3) = q0(\u03b3 \u2212 w) are inappropriate for\nthis purpose, as they have a positive probability for negative \u03b3 values and would generate non-\nsupermodular models.\nTo learn constrained parameters one requires posterior distributions that respect these constraints.\nFor nonnegative parameters we apply posterior distributions that are de\ufb01ned on the nonnegative\nreal numbers. We suggest to incorporate the parameters of the posterior distribution in a multi-\nplicative manner into a distribution over the nonnegative real numbers. For any distribution q\u03b1(\u03b3)\nwe determine a posterior distribution with parameters w as qw(\u03b3) = q\u03b1(\u03b3/w)/w. We show that\nmultiplicative posterior models naturally provide log-barrier functions over the constrained set of\nnonnegative numbers. This property is important to the computational ef\ufb01ciency of the bound min-\nimization algorithm.\nCorollary 2. For any probability distribution q\u03b1(\u03b3), let q\u03b1,w(\u03b3) = q\u03b1(\u03b3/w)/w be the parametrized\nposterior distribution. Then\n\nDe\ufb01ne the Gamma function \u0393(\u03b1) =(cid:82) \u221e\n\nKL(q\u03b1,w||p) = \u2212H(q\u03b1) \u2212 log w \u2212 E\u03b3\u223cq\u03b1 [log p(w\u03b3)]\n\n0 \u03b3\u03b1\u22121 exp(\u2212\u03b3). If p(\u03b3) = q\u03b1(\u03b3) = \u03b3\u03b1\u22121 exp(\u2212\u03b3)/\u0393(\u03b1)\nhave the Gamma distribution with parameter \u03b1, then E\u03b3\u223cq\u03b1 [log p(w\u03b3)] = (\u03b1 \u2212 1) log w \u2212 \u03b1w.\nAlternatively, if p(\u03b3) are truncated Gaussians then E\u03b3\u223cq\u03b1 [log p(w\u03b3)] = \u2212 \u03b1\nProof: The entropy of multiplicative posterior models naturally implies the log-barrier function:\n\n2 w2 + log(cid:112)\u03c0/2.\n\n(cid:90)\n\n(cid:16)\n\n(cid:17)\n\n\u2212H(q\u03b1,w)\n\n\u02c6\u03b3=\u03b3/w\n\n=\n\nq\u03b1(\u02c6\u03b3)\n\nlog q\u03b1(\u02c6\u03b3) \u2212 log w\n\nd\u02c6\u03b3 = \u2212H(q\u03b1) \u2212 log w.\n\nSimilarly, E\u03b3\u223cq\u03b1,w [log p(\u03b3)] = E\u03b3\u223cq\u03b1[log p(w\u03b3)]. The special cases for the Gamma and the trun-\ncated normal distribution follow by a direct computation. (cid:3)\nThe multiplicative posterior distribution would provide the barrier function \u2212 log w as part of its KL-\ndivergence. Thus the multiplicative posterior effortlessly enforces the constraints of its parameters.\nThis property suggests that using multiplicative rules are computationally favorable. Interestingly,\nusing a prior model with Gamma distribution adds to the barrier function a linear regularization\nterm (cid:107)w(cid:107)1 that encourages sparsity. On the other hand, a prior model with a truncated Gaussian\nadds a square regularization term which drifts the nonnegative parameters away from zero. A com-\nputational disadvantage of the Gaussian prior is that its barrier function cannot be controlled by a\nparameter \u03b1.\n\n4.2 Learning posterior models with approximate MAP predictions\n\nMAP prediction can be phrased as an integer linear program, stated in Equation (4). The computa-\ntional burden of integer linear programs can be relaxed when one replaces the integral constraints\nwith nonnegative constraints. This approach produces approximate MAP predictions. An important\nlearning challenge is to extend the predictive distribution of perturb-max models to incorporate ap-\nproximate MAP solutions. Approximate MAP predictions are are described by the feasible set of\ntheir linear program relaxations, that is usually called the local polytope:\n\nL(R) =\n\nbr(yr) : br(yr) \u2265 0,\n\nbr(yr) = 1, \u2200r \u2282 s\n\nbs(ys) = br(yr)\n\n(cid:111)\n\n(cid:110)\n\n(cid:88)\n\nyr\n\n(cid:88)\n\nys\\yr\n\nLinear programs solutions are usually the extreme points of their feasible polytope. The local poly-\ntope is de\ufb01ned by a \ufb01nite set of equalities and inequalities, thus it has a \ufb01nite number of extreme\npoints. The perturb-max model that is de\ufb01ned in Equation (3) can be effortlessly extended to the\n\ufb01nite set of the local polytope extreme points [15]. This approach has two \ufb02aws. First, linear pro-\ngram solutions might not be extreme points, and decoding such a point usually requires additional\n\n5\n\n\fcomputational effort. Second, without describing the linear program solutions one cannot incorpo-\nrate loss functions that take the structural properties of approximate MAP predictions into account\nwhen computing the the randomized risk.\nTheorem 3. Consider approximate MAP predictions that arise from relaxation of the MAP predic-\ntion problem in Equation (4).\n\n(cid:88)\n\nr,yr\n\narg max\nbr(yr)\n\nbr(yr)\u03b8r(yr; x, w)\n\ns.t. b \u2208 L(R)\n\nThen any optimal solution b\u2217 is described by a vector \u02dcyw(x) in the \ufb01nite power sets over the regions,\n\u02dcY \u2282 \u00d7r2Yr:\n\n\u02dcyw(x) = (\u02dcyw,r(x))r\u2208R\n\nwhere\n\n\u02dcyw,r(x) = {yr : b\u2217\n\nr(yr) > 0}\n\nMoreover, if there is a unique optimal solution b\u2217 then it corresponds to an extreme point in the local\npolytope.\n\nProof: The program is convex over a compact set, thus strong duality holds. Fixing the Lagrange\nbs(ys) = br(yr), and con-\nsidering the probability constraints as the domain of the primal program, we derive the dual program\n\nmultipliers \u03bbr\u2192s(yr) that correspond to the marginal constraints(cid:80)\n\u03bbc\u2192r(yc) \u2212 (cid:88)\n\n\u03b8r(yr; x, w) +\n\n(cid:88)\n\n\u03bbr\u2192p(yr)\n\n(cid:110)\n\n(cid:111)\n\nys\\yr\n\nc:c\u2282r\n\np:p\u2283r\n\n(cid:80)\nc:c\u2282r \u03bb\u2217\n\np:p\u2283r \u03bb\u2217\n\nLagrange optimality constraints (or equivalently, Danskin Theorem) determine the primal op-\nr(yr) to be probability distributions over the set arg maxyr{\u03b8r(yr; x, w) +\ntimal solutions b\u2217\nr\u2192p(yr)} that satisfy the marginalization constraints. Thus \u02dcyw,r(x)\nis the information that identi\ufb01es the primal optimal solutions, i.e., any other primal feasible solution\nthat has the same \u02dcyw,r(x) is also a primal optimal solution. (cid:3)\nThis theorem extends Proposition 3 in [6] to non-binary and non-pairwise graphical models. The\ntheorem describes the discrete structures of approximate MAP predictions. Thus we are able to\nde\ufb01ne posterior distributions that use ef\ufb01cient, although approximate, predictions while taking into\naccount their structures. To integrate these posterior distributions to randomized risk we extend the\nloss function to L(\u02dcyw(x), y). One can verify that the results in Section 3 follow through, e.g., by\nconsidering loss functions L : \u02dcY \u00d7 \u02dcY \u2192 [0, 1] while the training examples labels belong to the\nsubset Y \u2282 \u02dcY .\n\n(cid:88)\nc\u2192r(yc) \u2212(cid:80)\n\nr\n\nmax\n\nyr\n\n5 Empirical evaluation\n\nWe perform experiments on an interactive image segmentation. We use the Grabcut dataset proposed\nby Blake et al. [1] which consists of 50 images of objects on cluttered backgrounds and the goal is\nto obtain the pixel accurate segmentations of the object given an initial \u201ctrimap\u201d (see Figure 1). A\ntrimap is an approximate segmentation of the image into regions that are well inside, well outside\nand the boundary of the object, something a user can easily specify in an interactive application.\ntion over foreground/background segmentations Y = {\u22121, 1}n: \u03b8(y; x, w) =(cid:80)\nA popular approach for segmentation is the GrabCut approach [2, 1]. We learn parameters for the\n(cid:80)\n\u201cGaussian Mixture Markov Random Field\u201d (GMMRF) formulation of [1] using a potential func-\nl\u2208V \u03b8i(yi; x, w) +\ni,j\u2208E \u03b8i,j(yi, yj; x, w). The local potentials are \u03b8i(yi; x, w) = wyi log P (yi|x), where wyi are\nparameters to be learned while P (yi|x) are obtained from a Gaussian mixture model learned on the\nbackground and foreground pixels for an image x in the initial trimap. The pairwise potentials are\n\u03b8i,j(yi, yj; x, w) = wa exp(\u2212(xi \u2212 xj)2)yiyj, where xi denotes the intensity of image x at pixel\ni, and wa are the parameters to be learned for the angles a \u2208 {0, 90, 45,\u221245}\u25e6. These potential\nfunctions are supermodular as long as the parameters wa are nonnegative, thus MAP prediction can\nbe computed ef\ufb01ciently with the graph-cuts algorithm. For these parameters we use multiplicative\nposterior model with the Gamma distribution. The dataset does not come with a standard train-\ning/test split so we use the odd set of images for training and even set of images for testing. We use\nstochastic gradient descent with the step parameter decaying as \u03b7t = \u03b7\n\nto+t for 250 iterations.\n\n6\n\n\fMethod\n\nOur method\n\nStructured SVM (hamming loss)\nStructured SVM (all-zero loss)\n\nGMMRF (Blake et al. [1])\nPerturb-and-MAP ([17])\n\nGrabcut loss\n\n7.77%\n9.74%\n7.87%\n7.88%\n8.19%\n\nPASCAL loss\n\n5.29%\n6.66%\n5.63%\n5.85%\n5.76%\n\nTable 1: Learning the Grabcut segmentations using two different loss functions. Our learned param-\neters outperform structured SVM approaches and Perturb-and-MAP moment matching\n\nFigure 1: Two examples of image (left), input \u201ctrimap\u201d (middle) and the \ufb01nal segmentation (right)\nproduced using our learned parameters.\n\nWe use two different loss functions for training/testing for our approach to illustrate the \ufb02exibility of\nour approach for learning using various task speci\ufb01c loss functions. The \u201cGrabCut loss\u201d measures\nthe fraction of incorrect pixels labels in the region speci\ufb01ed as the boundary in the trimap. The\n\u201cPASCAL loss\u201d, which is commonly used in several image segmentation benchmarks, measures the\nratio of the intersection and union of the foregrounds of ground truth segmentation and the solution.\nAs a comparison we also trained parameters using moment matching of MAP perturbations [17] and\nstructured SVM. We use a stochastic gradient approach with a decaying step size for 1000 iterations.\nUsing structured SVM, solving loss-augmented inference max\u02c6y\u2208Y {L(y, \u02c6y) + \u03b8(y; x, w)} with the\nhamming loss can be ef\ufb01ciently done using graph-cuts. We also consider learning parameters with\nall-zero loss function, i.e., L(y, \u02c6y) \u2261 0. To ensure that the weights remain non-negative we project\nthe weights into the non-negative side after each iteration.\nTable 1 shows the results of learning using various methods. For the GrabCut loss, our method\nobtains comparable results to the GMMRF framework of [1], which used hand-tuned parameters.\nOur results are signi\ufb01cantly better when PASCAL loss is used. Our method also outperforms the\nparameters learned using structured SVM and Perturb-and-MAP approaches. In our experiments the\nstructured SVM with the hamming loss did not perform well \u2013 the loss augmented inference tended\nto focus on maximum violations instead of good solutions which causes the parameters to change\neven though the MAP solution has a low loss (a similar phenomenon was observed in [22]). Using\nthe all-zero loss tends to produce better results in practice as seen in Table 1. Figure 1 shows some\nexamples images, the input trimap, and the segmentations obtained using our approach.\n\n6 Related work\n\nRecent years have introduced many optimization techniques that lend ef\ufb01cient MAP predictors for\ncomplex models. These MAP predictors can be integrated to learn complex models using structured-\nSVM [25]. Structured-SVM has a drawback, as its MAP prediction is adjusted by the loss function,\ntherefore it has an augmented complexity. Recently, there has been an effort to ef\ufb01ciently integrate\nnon-decomposable loss function into structured-SVMs [24]. However this approach does not hold\nfor any loss function.\nBayesian approaches to loss minimization treat separately the prediction process and the loss in-\ncurred, [12]. However, the Bayesian approach depends on the ef\ufb01ciency of its sampling procedure,\nbut unfortunately, sampling in complex models is harder that the MAP prediction task [7].\nThe recent works [17, 23, 8, 9, 16] integrate ef\ufb01cient MAP predictors into Bayesian modeling. [23]\ndescribes the Bayesian perspectives, while [17, 8] describe their relations to the Gibbs distribu-\ntion and moment matching. [9] provide unbiased samples form the Gibbs distribution using MAP\npredictors and [16] present their measure concentration properties. Other strategies for producing\n\n7\n\n\f(pseudo) samples ef\ufb01ciently include Herding [26]. However, these approaches do not consider risk\nminimization.\nThe perturb-max models in Equation (3) play a key role in PAC-Bayesian theory [14, 11, 19, 3, 20, 5,\n10]. The PAC-Bayesian approaches focus on generalization bounds to the object-label distribution.\nHowever, the posterior models in the PAC-Bayesian approaches were not extensively studied in the\npast. In most cases the posterior model remained unde\ufb01ned. [10] investigate linear predictors with\nGaussian posterior models to have a structured-SVM like bound. This bound holds uniformly for\nevery \u03bb and its derivation is quite involved. In contrast we use Catoni\u2019s PAC-Bayesian bound that\nis not uniform over \u03bb but does not require the log |S| term [3, 5]. The simplicity of Catoni\u2019s bound\n(see Appendix) makes it amenable to different extensions. In our work, we extend these results\nto smooth posterior distributions, while maintaining the quadratic regularization form. We also\ndescribe posterior distributions for non-linear models. In different perspective, [3, 5] describe the\noptimal posterior, but unfortunately there is no ef\ufb01cient sampling procedure for this posterior model.\nIn contrast, our work explores posterior models which allow ef\ufb01cient sampling. We investigate two\nposterior models: the multiplicative models, for constrained MAP solvers such as graph-cuts, and\nposterior models for approximate MAP solutions.\n\n7 Discussion\n\nLearning complex models requires one to consider non-decomposable loss functions that take into\naccount the desirable structures. We suggest to use the Bayesian perspectives to ef\ufb01ciently sample\nand learn such models using random MAP predictions. We show that any smooth posterior distri-\nbution would suf\ufb01ce to de\ufb01ne a smooth PAC-Bayesian risk bound which can be minimized using\ngradient decent. In addition, we relate the posterior distributions to the computational properties\nof the MAP predictors. We suggest multiplicative posterior models to learn supermodular potential\nfunctions that come with specialized MAP predictors such as graph-cuts algorithm. We also describe\nlabel-augmented posterior models that can use ef\ufb01cient MAP approximations, such as those arising\nfrom linear program relaxations. We did not evaluate the performance of these posterior models and\nfurther explorations of such models is required.\nThe results here focus on posterior models that would allow for ef\ufb01cient sampling using MAP pre-\ndictions. There are other cases for which speci\ufb01c posterior distributions might be handy, e.g., learn-\ning posterior distributions of Gaussian mixture models. In these cases, the parameters include the\ncovariance matrix, thus would require to sample over the family of positive de\ufb01nite matrices.\n\nA Proof sketch for Theorem 1\n\nTheorem 2.1 in [5]: For any distribution D over object-labels pairs, for any w-parametrized\ndistribution qw, for any prior distribution p, for any \u03b4 \u2208 (0, 1], and for any convex function\nD : [0, 1] \u00d7 [0, 1] \u2192 R, with probability at least 1 \u2212 \u03b4 over the draw of the training set the di-\nvergence D(E\u03b3\u223cqw RS(\u03b3), E\u03b3\u223cqw R(\u03b3)) is upper bounded simultaneously for all w by\n\nE\u03b3\u223cpES\u223cDm exp(cid:0)mD(RS(\u03b3), R(\u03b3))(cid:1)(cid:17)(cid:105)\n\n(cid:16) 1\n\n\u03b4\n\n(cid:104)\n\n1\n|S|\n\nKL(qw||p) + log\n\nmoment generating function of the empirical risk: ES\u223cDm exp(cid:0)mD(RS(\u03b3, x, y), R(\u03b3, x, y))(cid:1) =\n\nFor D(RS(\u03b3), R(\u03b3)) = F(R(\u03b3)) \u2212 \u03bbRS(\u03b3), the bound reduces to a simple convex bound on the\nexp(mF(R(\u03b3)))ES\u223cDm exp(\u2212m\u03bbRS(\u03b3)) Since the exponent function is a convex function of\nRS(\u03b3) = RS(\u03b3) \u00b7 1 + (1 \u2212 RS(\u03b3)) \u00b7 0, the moment generating function bound is exp(\u2212\u03bbRS(\u03b3)) \u2264\nRS(\u03b3) exp(\u2212\u03bb) + (1 \u2212 RS(\u03b3)). Since ESRS(\u03b3) = R(\u03b3), the right term in the risk bound in\ncan be made 1 when choosing F(R(\u03b3)) to be the inverse of the moment generating function\nbound. This is Catoni\u2019s bound [3, 5] for the structured labels case. To derive Theorem 1 we ap-\nply exp(\u2212x) \u2264 1 \u2212 x to derive the lower bound (1 \u2212 exp(\u2212\u03bb))E\u03b3\u223cqw R(\u03b3) \u2212 \u03bbE\u03b3\u223cqw RS(\u03b3) \u2264\nD(E\u03b3\u223cqw RS(\u03b3), E\u03b3\u223cqw R(\u03b3)).\n\n8\n\n\fReferences\n[1] Andrew Blake, Carsten Rother, Matthew Brown, Patrick Perez, and Philip Torr. Interactive\n\nimage segmentation using an adaptive gmmrf model. In ECCV 2004, pages 428\u2013441. 2004.\n\n[2] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts.\n\nPAMI, 2001.\n\n[3] O. Catoni. Pac-bayesian supervised classi\ufb01cation: the thermodynamics of statistical learning.\n\narXiv preprint arXiv:0712.0248, 2007.\n\n[4] G.B. Folland. Real analysis: Modern techniques and their applications, john wiley & sons.\n\nNew York, 1999.\n\n[5] P. Germain, A. Lacasse, F. Laviolette, and M. Marchand. Pac-bayesian learning of linear\n\nclassi\ufb01ers. In ICML, pages 353\u2013360. ACM, 2009.\n\n[6] A. Globerson and T. S. Jaakkola. Fixing max-product: Convergent message passing algorithms\n\nfor MAP LP-relaxations. Advances in Neural Information Processing Systems, 21, 2007.\n\n[7] L.A. Goldberg and M. Jerrum. The complexity of ferromagnetic ising with local \ufb01elds. Com-\n\nbinatorics Probability and Computing, 16(1):43, 2007.\n\n[8] T. Hazan and T. Jaakkola. On the partition function and random maximum a-posteriori pertur-\n\nbations. In Proceedings of the 29th International Conference on Machine Learning, 2012.\n\n[9] T. Hazan, S. Maji, and T. Jaakkola. On sampling from the gibbs distribution with random\nmaximum a-posteriori perturbations. Advances in Neural Information Processing Systems,\n2013.\n\n[10] J. Keshet, D. McAllester, and T. Hazan. Pac-bayesian approach for minimization of phoneme\n\nerror rate. In ICASSP, 2011.\n\n[11] John Langford and John Shawe-Taylor. Pac-bayes & margins. Advances in neural information\n\nprocessing systems, 15:423\u2013430, 2002.\n\n[12] Erich Leo Lehmann and George Casella. Theory of point estimation, volume 31. 1998.\n[13] Andreas Maurer. A note on the pac bayesian theorem. arXiv preprint cs/0411099, 2004.\n[14] D. McAllester. Simpli\ufb01ed pac-bayesian margin bounds. Learning Theory and Kernel Ma-\n\nchines, pages 203\u2013215, 2003.\n\n[15] D. McAllester, T. Hazan, and J. Keshet. Direct loss minimization for structured prediction.\n\nAdvances in Neural Information Processing Systems, 23:1594\u20131602, 2010.\n\n[16] Francesco Orabona, Tamir Hazan, Anand D Sarwate, and Tommi. Jaakkola. On measure con-\n\ncentration of random maximum a-posteriori perturbations. arXiv:1310.4227, 2013.\n\n[17] G. Papandreou and A. Yuille. Perturb-and-map random \ufb01elds: Using discrete optimization to\n\nlearn and sample from energy models. In ICCV, Barcelona, Spain, November 2011.\n\n[18] A.M. Rush and M. Collins. A tutorial on dual decomposition and lagrangian relaxation for\n\ninference in natural language processing.\n\n[19] Matthias Seeger. Pac-bayesian generalisation error bounds for gaussian process classi\ufb01cation.\n\nThe Journal of Machine Learning Research, 3:233\u2013269, 2003.\n\n[20] Yevgeny Seldin. A PAC-Bayesian Approach to Structure Learning. PhD thesis, 2009.\n[21] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for\n\nMAP using message passing. In Conf. Uncertainty in Arti\ufb01cial Intelligence (UAI), 2008.\n[22] Martin Szummer, Pushmeet Kohli, and Derek Hoiem. Learning crfs using graph cuts.\n\nIn\n\nComputer Vision\u2013ECCV 2008, pages 582\u2013595. Springer, 2008.\n\n[23] D. Tarlow, R.P. Adams, and R.S. Zemel. Randomized optimum models for structured predic-\n\ntion. In AISTATS, pages 21\u201323, 2012.\n\n[24] Daniel Tarlow and Richard S Zemel. Structured output learning with high order loss functions.\nIn International Conference on Arti\ufb01cial Intelligence and Statistics, pages 1212\u20131220, 2012.\n[25] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. Advances in neural\n\ninformation processing systems, 16:51, 2004.\n\n[26] Max Welling. Herding dynamical weights to learn. In Proceedings of the 26th Annual Inter-\n\nnational Conference on Machine Learning, pages 1121\u20131128. ACM, 2009.\n\n9\n\n\f", "award": [], "sourceid": 964, "authors": [{"given_name": "Tamir", "family_name": "Hazan", "institution": "University of Haifa"}, {"given_name": "Subhransu", "family_name": "Maji", "institution": "TTI Chicago"}, {"given_name": "Joseph", "family_name": "Keshet", "institution": "Bar-Ilan University"}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": "MIT"}]}