{"title": "Multiple Choice Learning: Learning to Produce Multiple Structured Outputs", "book": "Advances in Neural Information Processing Systems", "page_first": 1799, "page_last": 1807, "abstract": "The paper addresses the problem of generating multiple hypotheses for prediction tasks that involve interaction with users or successive components in a cascade. Given a set of multiple hypotheses, such components/users have the ability to automatically rank the results and thus retrieve the best one. The standard approach for handling this scenario is to learn a single model and then produce M-best Maximum a Posteriori (MAP) hypotheses from this model. In contrast, we formulate this multiple {\\em choice} learning task as a multiple-output structured-output prediction problem with a loss function that captures the natural setup of the problem. We present a max-margin formulation  that minimizes an upper-bound on this loss-function. Experimental results on the problems of image co-segmentation and protein side-chain prediction show that our method outperforms conventional approaches used for this  scenario and leads to substantial improvements in prediction accuracy.", "full_text": "Multiple Choice Learning:\n\nLearning to Produce Multiple Structured Outputs\n\nAbner Guzman-Rivera\nUniversity of Illinois\n\naguzman5@illinois.edu\n\nDhruv Batra\nVirginia Tech\n\ndbatra@vt.edu\n\nPushmeet Kohli\n\nMicrosoft Research Cambridge\npkohli@microsoft.com\n\nAbstract\n\nWe address the problem of generating multiple hypotheses for structured predic-\ntion tasks that involve interaction with users or successive components in a cas-\ncaded architecture. Given a set of multiple hypotheses, such components/users\ntypically have the ability to retrieve the best (or approximately the best) solution\nin this set. The standard approach for handling such a scenario is to \ufb01rst learn\na single-output model and then produce M-Best Maximum a Posteriori (MAP)\nhypotheses from this model. In contrast, we learn to produce multiple outputs\nby formulating this task as a multiple-output structured-output prediction prob-\nlem with a loss-function that effectively captures the setup of the problem. We\npresent a max-margin formulation that minimizes an upper-bound on this loss-\nfunction. Experimental results on image segmentation and protein side-chain pre-\ndiction show that our method outperforms conventional approaches used for this\ntype of scenario and leads to substantial improvements in prediction accuracy.\n\n1 Introduction\nA number of problems in Computer Vision, Natural Language Processing and Computational\nBiology involve predictions over complex but structured interdependent outputs, also known as\nstructured-output prediction. Formulations such as Conditional Random Fields (CRFs) [18], Max-\nMargin Markov Networks (M3N) [27], and Structured Support Vector Machines (SSVMs) [28] have\nprovided principled techniques for learning such models.\nIn all these (supervised) settings, the learning algorithm typically has access to input-output pairs:\n{(xi, yi) | xi \u2208 X , yi \u2208 Y} and the goal is to learn a mapping from the input space to the output\nspace f : X \u2192 Y that minimizes a (regularized) task-dependent loss function (cid:96) : Y \u00d7 Y \u2192 R+,\nwhere (cid:96)(yi, \u02c6yi) denotes the cost of predicting \u02c6yi when the correct label is yi.\nNotice that the algorithm always makes a single prediction \u02c6yi and pays a penalty (cid:96)(yi, \u02c6yi) for that\nprediction. However, in a number of settings, it might be bene\ufb01cial (even necessary) to make multi-\nple predictions:\n\n1. Interactive Intelligent Systems. The goal of interactive machine-learning algorithms is to\nproduce an output for an expert or a user in the loop. Popular examples include tools for\ninteractive image segmentation (where the system produces a cutout of an object from a\npicture [5, 25]), systems for image processing/manipulation tasks such as image denoising\nand deblurring (e.g., Photoshop), or machine translation services (e.g., Google Translate).\nThese problems are typically modeled using structured probabilistic models and involve\ncomputing the Maximum a Posteriori (MAP) solution. In order to minimize user inter-\nactions, the interface could show not just a single prediction but a small set of diverse\npredictions, and simply let the user pick the best one.\n\n2. Generating M-Best Hypotheses. Machine learning algorithms are often cascaded, with\nthe output of one model being fed into another. In such a setting, at the initial stages it is\n\n1\n\n\fnot necessary to make the perfect prediction, rather the goal is to make a set of plausible\npredictions, which may then be re-ranked or combined by a secondary mechanism. For\ninstance, in Computer Vision, this is the case for state-of-the-art methods for human-pose\nestimation which produce multiple predictions that are then re\ufb01ned by employing a tempo-\nral model [23, 3]. In Natural Language Processing, this is the case for sentence parsing [8]\nand machine translation [26], where an initial system produces a list of M-Best hypothe-\nses [12, 24] (also called k-best lists in the NLP literature), which are then re-ranked.\n\nThe common principle in both scenarios is that we need to generate a set of plausible hypotheses\nfor an algorithm/expert downstream to evaluate. Traditionally, this is accomplished by learning a\nsingle-output model and then producing M-Best hypotheses from it (also called the M-Best MAP\nproblem [20, 11, 29] or the Diverse M-Best problem [3] in the context of graphical models).\nNotice that the single-output model is typically trained in the standard way, i.e., either to match the\ndata distribution (max-likelihood) or to score ground-truth the highest by a margin (max-margin).\nThus, there is a disparity between the way this model is trained and the way it is actually used. The\nkey motivating question for this paper is \u2013 can we learn to produce a set of plausible hypotheses?\nWe refer to such a setting as Multiple Choice Learning (MCL) because the learner must learn to\nproduce multiple choices for an expert or other algorithm.\nOverview. This paper presents an algorithm for MCL, formulated as multiple-output structured-\noutput learning, where given an input sample xi the algorithm produces a set of M hypotheses\n{\u02c6y1\ni }. We \ufb01rst present a meaningful loss function for this task that effectively captures the\nsetup of the problem. Next, we present a max-margin formulation for training this M-tuple predictor\nthat minimizes an upper-bound on the loss-function. Despite the popularity of M-Best approaches,\nto the best our knowledge, this is the \ufb01rst attempt to directly model the M-Best prediction problem.\nOur approach has natural connections to SSVMs with latent variables, and resembles a structured-\noutput version of k-means clustering. Experimental results on the problems of image segmentation\nand protein side-chain prediction show that our method outperforms conventional M-Best prediction\napproaches used for this scenario and leads to substantial improvements in prediction accuracy.\nThe outline for the rest of this paper is as follows: Section 2 provides the notation and discusses\nclassical (single-output) structured-output learning; Section 3 introduces the natural task loss for\nmultiple-output prediction and presents our learning algorithm; Section 4 discusses related work;\nSection 5 compares our algorithm to other approaches experimentally and; we conclude in Section\n6 with a summary and ideas for future work.\n\ni , . . . , \u02c6yM\n\n2 Preliminaries: (Single-Output) Structured-Output Prediction\n\nWe begin by reviewing classical (single-output) structured-output prediction and establishing the\nnotation used in the paper.\nNotation. For any positive integer n, let [n] be shorthand for the set {1, 2, . . . , n}. Given a training\ndataset of input-output pairs {(xi, yi) | i \u2208 [n], xi \u2208 X , yi \u2208 Y}, we are interested in learning a\nmapping f : X \u2192 Y from an input space X to a structured output space Y that is \ufb01nite but typically\nexponentially large (e.g., the set of all segmentations of an image, or all English translations of a\nChinese sentence).\nStructured Support Vector Machines (SSVMs). In an SSVM setting, the mapping is de\ufb01ned as\nf (x) = argmaxy\u2208Y wT \u03c6(x, y), where \u03c6(x, y) is a joint feature map: \u03c6 : X \u00d7 Y \u2192 Rd. The\nquality of the prediction \u02c6yi = f (xi) is measured by a task-speci\ufb01c loss function (cid:96) : Y \u00d7 Y \u2192\nR+, where (cid:96)(yi, \u02c6yi) denotes the cost of predicting \u02c6yi when the correct label is yi. Some examples\nof loss functions are the intersection/union criteria used by the PASCAL Visual Object Category\nSegmentation Challenge [10], and the BLEU score used to evaluate machine translations [22].\nThe task-loss is typically non-convex and non-continuous in w. Tsochantaridis et al. [28] proposed\nto optimize a regularized surrogate loss function:\n\n\u0002i(w)\n\n(1)\n\n(cid:88)\n\ni\u2208[n]\n\nmin\n\nw\n\n||w||2\n\n2 + C\n\n1\n2\n\n2\n\n\fwhere C is a positive multiplier and \u0002i(\u00b7) is the structured hinge-loss:\n\n\u0002i(w) = max\n\ny\n\n(cid:96)(yi, y) + wT \u03c6(xi, y)\n\n(2)\n\nIt can be shown [28] that the hinge-loss is an upper-bound on the task loss, i.e., \u0002i(w) \u2265 (cid:96)(yi, f (xi)).\nMoreover, \u0002i(w) is a non-smooth convex function, and can be equivalently expressed with a set of\nconstraints:\n\n(cid:17) \u2212 wT \u03c6(xi, yi).\n\n(cid:16)\n\n(cid:88)\n\ni\u2208[n]\n\nmin\nw,\u03bei\n\n||w||2\n\n2 + C\n\n1\n2\n\n\u03bei\n\ns.t. wT \u03c6(xi, yi) \u2212 wT \u03c6(xi, y) \u2265 (cid:96)(yi, y) \u2212 \u03bei\n\n\u03bei \u2265 0\n\n\u2200y \u2208 Y \\ yi\n\n(3a)\n\n(3b)\n(3c)\n\nThis formulation is known as the margin-rescaled n-slack SSVM [28]. Intuitively, we can see that it\nminimizes the squared-norm of w subject to constraints that enforce a soft-margin between the score\nof the ground-truth yi and the score of all other predictions. The above problem (3) is a Quadratic\nProgram (QP) with n|Y| constraints, which is typically exponentially large. If an ef\ufb01cient separa-\ntion oracle for identifying the most violated constraint is available, then a cutting-plane approach\ncan be used to solve the QP. A cutting-plane algorithm maintains a working set of constraints and\nincrementally adds the most violated constraint to this working set while solving for the optimum\nsolution under the working set. Tsochantaridis et al. [28] showed that such a procedure converges\nin a polynomial number of steps.\n3 Multiple-Output Structured-Output Prediction\nWe now describe our proposed formulation for multiple-output structured-output prediction.\nModel. Our model is a generalization of the single-output SSVM. A multiple-output SSVM is a\nmapping from the input space X to an M-tuple1 of structured outputs Yi = {\u02c6y1\n| \u02c6yi \u2208 Y},\ngiven by g : X \u2192 Y M , where g(x) = argmaxY \u2208YM WT \u03a6(x, Y ). Notice that the joint fea-\nture map is now a function of the input and the entire set of predicted structured-outputs, i.e.,\n\u03a6 : X \u00d7 Y M \u2192 Rd. Without further assumptions, optimizing over the output space |Y|M\nwould be intractable. We make a mean-\ufb01eld-like simplifying assumption that the set score fac-\ntors into independent predictor scores, i.e., \u03a6(xi, Y ) = [ \u03c61(xi, y1)T , . . . , \u03c6M (xi, yM )T ]T . Thus,\n\ng is composed of M single-output predictors: g(x) = (cid:0)f 1(x), . . . , f M (x)(cid:1), where f m(x) =\n\ni , . . . , \u02c6yM\ni\n\ni , . . . , \u02c6yM\n\ni } be the set of predicted outputs for input xi, i.e., \u02c6ym\n\nM ]T .\n\nm\u03c6m(x, y). Hence, the multiple-output SSVM is parameterized by an M-tuple of\n\nargmaxy\u2208Y wT\nweight vectors: W = [wT\n1 , . . . , wT\n3.1 Multiple-Output Loss\nLet \u02c6Yi = {\u02c6y1\ni = f m(xi). In the single-\noutput SSVM, there typically exists a ground-truth output yi for each datapoint, and the quality of\n\u02c6yi w.r.t. yi is given by (cid:96)(yi, \u02c6yi).\nHow good is a set of outputs? For our multiple-output predictor, we need to de\ufb01ne a task-speci\ufb01c\nloss function that can measure the quality of any set of predictions \u02c6Yi \u2208 Y M . Ideally, the quality of\nthese predictions should be evaluated by the secondary mechanism that uses these predictions. For\ninstance, in an interactive setting where they are shown to a user, the quality of \u02c6Yi could be measured\nby how much it reduces the user-interaction time. In the M-best hypotheses re-ranking scenario, the\naccuracy of the top single output after re-ranking could be used as the quality measure for \u02c6Yi. While\nmultiple options exist, in order to provide a general formulation and to isolate our approach, we\npropose the \u201coracle\u201d or \u201chindsight\u201d set-loss as a surrogate:\n\nL( \u02c6Yi) = min\n\u02c6yi\u2208 \u02c6Yi\n\n(cid:96)(yi, \u02c6yi)\n\n(4)\n\n1Our formulation is described with a nominal ordering of the predictions. However, both the proposed\n\nobjective function and optimization algorithm are invariant to permutations of this ordering.\n\n3\n\n\fi.e., the set of predictions \u02c6Yi only pays a loss for the most accurate prediction contained in this set\n(e.g., the best segmentation of an image, or the best translation of a sentence). This loss has the\ndesirable behaviour that predicting a set that contains even a single accurate output is better than\npredicting a set that has none. Moreover, only being penalized for the most accurate prediction\nallows an ensemble to hedge its bets without having to pay for being too diverse (this is opposite\nto the effect that replacing min with max or avg. would have). However, this also makes the set-\nloss rather poorly conditioned \u2013 if even a single prediction in the ensemble is the ground-truth, the\nset-loss is 0, no matter what else is predicted.\nHinge-like Upper-Bound. The set-loss L( \u02c6Yi(W)) is a non-continuous non-convex function of W\nand is thus dif\ufb01cult to optimize. If unique ground-truth sets Yi were available, we could set up a\nstandard hinge-loss approximation:\n\n(cid:16)L(Y ) + WT \u03a6(xi, Y )\n\n(cid:17) \u2212 WT \u03a6(xi, Yi)\n\n(5)\n\nHi(W) = max\nY \u2208YM\n\nwhere \u03a6(xi, Y ) = [ \u03c61(xi, y1)T , . . . , \u03c6M (xi, yM )T ]T are stacked joint feature maps.\nHowever, no such natural choice for Yi exists. We propose a hinge-like upper-bound on the set-loss,\nthat we refer to as min-hinge:\n\n\u02dcHi(W) = min\nm\u2208[M ]\n\n\u0002i(wm),\n\n(6)\n\ni.e., we take the min over the hinge-losses (2) corresponding to each of the M predictors. Since\neach hinge-loss is an upper-bound on the corresponding task-loss, i.e., \u0002i(wm) \u2265 (cid:96)(yi, f m(xi)), it\nis straightforward to see that the min-hinge is an upper-bound on the set-loss, i.e., \u02dcHi(W) \u2265 L( \u02c6Yi).\nNotice that min-hinge is a min of convex functions, and thus not guaranteed to be convex.\n3.2 Coordinate Descent for Learning Multiple Predictors\nWe now present our algorithm for learning a multiple-output SSVM by minimizing the regularized\nmin-hinge loss:\n\nmin\nW\n\n||W||2\n\n2 + C\n\n1\n2\n\n\u02dcHi(W)\n\n(cid:88)\n\ni\u2208[n]\n\n(cid:88)\n\n(cid:88)\n\ni\u2208[n]\n\nm\u2208[M ]\n\nWe begin by rewriting the min-hinge loss in terms of indicator \u201c\ufb02ag\u201d variables, i.e.,\n\nmin\n\nW,{\u03c1i,m}\n\n1\n2\n\n||W||2\n\n2 + C\n\n(cid:88)\n\ns.t.\n\n\u03c1i,m = 1\n\nm\u2208[M ]\n\u03c1i,m \u2208 {0, 1}\n\n\u03c1i,m \u0002i(wm)\n\n\u2200i \u2208 [n]\n\n\u2200i \u2208 [n], m \u2208 [M ]\n\nwhere \u03c1i,m is a \ufb02ag variable that indicates which predictor produces the smallest hinge-loss.\nOptimization problem 8 is a mixed-integer quadratic programming problem (MIQP), which is NP-\nhard in general. However, we can exploit the structure of the problem via a block-coordinate descent\nalgorithm where W and {\u03c1i,m} are optimized iteratively:\n\n(cid:80)\n\n1. Fix W; Optimize all {\u03c1i,m}.\nm\u2208[M ] \u03c1i,m \u0002i(wm)\n\nGiven W,\n\ni\u2208[n]\n\n(cid:80)\n\nthe\n\noptimization\n\nof\n(8b,\n8c). This decomposes into n independent problems, which simply identify the best\npredictor for each datapoint according to the current hinge-losses, i.e.:\n\n{\u03c1i,m}\nthe minimization\nto the \u201cpick-one-predictor\u201d constraints\n\nover\nsubject\n\nreduces\n\nto\n\n(7)\n\n(8a)\n\n(8b)\n\n(8c)\n\n(cid:26) 1\n\n0\n\n\u03c1i,m =\n\nif m = argmin\nm\u2208[M ]\n\nelse.\n\n\u0002i(wm)\n\n(9)\n\n2. Fix {\u03c1i,m}; Optimize W.\n\nGiven {\u03c1i,m}, optimization over W decomposes into M independent problems, one for\neach predictor, which are equivalent to single-output SSVM learning problems:\n\n4\n\n\fi\u2208[n]\n\n(cid:88)\n\uf8f1\uf8f2\uf8f3 1\n\uf8f1\uf8f2\uf8f3 1\n\n2\n\n2\n\nmin\nW\n\n1\n2\n\n||W||2\n\n2 + C\n\n(cid:88)\n\nm\u2208[M ]\n\n= min\nW\n\n(cid:88)\n\n=\n\nmin\nwm\n\nm\u2208[M ]\n\n(cid:88)\n\nm\u2208[M ]\n\n||wm||2\n\n2 + C\n\n||wm||2\n\n2 + C\n\n\u03c1i,m \u0002i(wm)\n\n(cid:88)\n(cid:88)\n\ni\u2208[n]\n\ni:\u03c1i,m(cid:54)=0\n\n\u03c1i,m \u0002i(wm)\n\n\u0002i(wm)\n\n\uf8fc\uf8fd\uf8fe\n\uf8fc\uf8fd\uf8fe\n\n(10a)\n\n(10b)\n\n(10c)\n\nThus, each subproblem in 10c can be optimized using using any standard technique for\ntraining SSVMs. We use the 1-slack algorithm of [14].\n\nstraint with \u201cpick-K-predictors\u201d, i.e.,(cid:80)\n\nConvergence. Overall, the block-coordinate descent algorithm above iteratively assigns each data-\npoint to a particular predictor (Step 1) and then independently trains each predictor with just the\npoints that were assigned to it (Step 2). This is fairly reminiscent of k-means, where step 1 can be\nthought of as the member re-assignment step (or the M-step in EM) and step 2 can be thought of\nas the cluster-\ufb01tting step (or the E-step in EM). Since the \ufb02ag variables take on discrete values and\nthe objective function is non-increasing with iterations, the algorithm is guaranteed to converge in a\n\ufb01nite number of steps.\nGeneralization. Formulation (8) can be generalized by replacing the \u201cpick-one-predictor\u201d con-\nm\u2208[M ] \u03c1i,m = K, where K is a robustness parameter that\nallows training data overlap between predictors. The M-step (cluster reassignment) is still simple,\nand involves assigning a data-point to the top K best predictors. The E-step is unchanged. Notice that\nat K = M, all predictors learn the same mapping. We analyze the effect of K in our experiments.\n4 Related Work\nAt \ufb01rst glance, our work seems related to the multi-label classi\ufb01cation literature, where the goal is\nto predict multiple labels for each input instance (e.g., text tags for images on Flickr). However, the\nmotivation and context of our work is fundamentally different. Speci\ufb01cally, in multi-label classi\ufb01-\ncation there are multiple possible labels for each instance and the goal is to predict as many of them\nas possible. On the other hand, in our setting there is a single ground-truth label for each instance\nand the learner makes multiple guesses, all of which are evaluated against that single ground-truth.\nFor the unstructured setting (i.e. when |Y| is polynomial), Dey et al. [9] proposed an algorithm that\nlearns a multi-class classi\ufb01er for each \u201cslot\u201d in a M-Best list, and provide a formal regret reduction\nfrom submodular sequence optimization.\nTo the best of our knowledge, the only other work that explicitly addresses the task of predicting\nmultiple structured outputs is multi-label structured prediction (MLSP) [19]. This work may be seen\nas a technique to output predictions in the power-set of Y (2Y) with a learning cost comparable to\nalgorithms for prediction over Y. Most critically, MLSP requires gold-standard sets of labels (one\nset for each training example). In contrast, MCL neither needs nor has access to gold-standard sets.\nAt a high-level, MCL and MLSP are orthogonal approaches, e.g., we could introduce MLSP within\nMCL to create an algorithm that predicts multiple (diverse) sets of structured-outputs (e.g., multiple\nguesses by the algorithm where each guess is a set of bounding boxes of objects in an image).\nA form of min-set-loss has received some attention in the context of ambiguously or incompletely\nannotated data. For instance, [4] trains an SSVM for object detection implicitly de\ufb01ning a task-\nadapted loss, Lmin(Y, \u02c6y) = miny\u2208Y (cid:96)(y, \u02c6y). Note that in this case there is a set of ground-truth\nlabels and the model\u2019s prediction is a single label (evaluated against the closest ground-truth).\nOur formulation is also reminiscent of a Latent-SSVM with the indicator \ufb02ags {\u03c1i,m | m \u2208 [M ]}\ntaking a role similar to latent variables. However, the two play very different roles. Latent variable\nmodels typically maximize or marginalize the model score across the latent variables, while MCL\nuses the \ufb02ag variables as a representation of the oracle loss.\nAt a high-level, our ideas are also related to ensemble methods [21] like boosting. However, the key\ndifference is that ensemble methods attempt to combine outputs from multiple weak predictors to\nultimately make a single prediction. We are interested in making multiple predictions which will all\n\n5\n\n\f(c) 3.44% (d) 40.79% (e) 14.78% (f) 26.98% (g) 53.78% (h) 19.54%\n\n(a) Image\n\n(b) GT\n\n(c) 16.59% (d) 2.12% (e) 11.54% (f) 11.34% (g) 77.91% (h) 65.34%\n\nFigure 1: Each row shows the: (a) input image (b) ground-truth segmentation and (c-h) the set of predictions\nproduced by MCL (M = 6). Red border indicates the most accurate segmentation (i.e., lowest error). We can\nsee that the predictors produce different plausible foreground hypotheses, e.g., predictor (g) thinks foliage-like\nthings are foreground.\n\nbe handed to an expert or secondary mechanism that has access to more complex (e.g., higher-order)\nfeatures.\n5 Experiments\nSetup. We tested algorithm MCL on two problems:\ni) foreground-background segmentation in\nimage collections and ii) protein side-chain prediction. In both problems making a single perfect\nprediction is dif\ufb01cult due to inherent ambiguity in the tasks. Moreover, inference-time computing\nlimitations force us to learn restricted models (e.g., pairwise attractive CRFs) that may never be able\nto capture the true solution with a single prediction. The goal of our experiments is to study how\nmuch predicting a set of plausible hypotheses helps. Our experiments will show that MCL is able\nto produce sets of hypotheses which contain more accurate predictions than other algorithms and\nbaselines aimed at producing multiple hypotheses.\n5.1 Foreground-Background Segmentation\nDataset. We used the co-segmentation dataset, iCoseg, of Batra et al. [2]. iCoseg consists of 37\ngroups of related images mimicking typical consumer photograph collections. Each group may be\nthought of as an \u201cevent\u201d (e.g., images from a baseball game, a safari, etc.). The dataset provides\npixel-level ground-truth foreground-background segmentations for each image. We used 9 dif\ufb01cult\ngroups from iCoseg containing 166 images in total. These images were then split into train, valida-\ntion and test sets of roughly equal size. See Fig. 1, 2 for some example images and segmentations.\nModel and Features. The segmentation task is modeled as a binary pairwise MRF where each\nnode corresponds to a superpixel [1] in the image. We extracted 12-dim color features at each su-\nperpixel (mean RGB; mean HSV; 5 bin Hue histogram; Hue histogram entropy). The edge features,\ncomputed for each pair of adjacent superpixels, correspond to a standard Potts model and a contrast\nsensitive Potts model. The weights at each edge were constrained to be positive so that the resulting\nsupermodular potentials could be maximized via graph-cuts [6, 17].\nBaselines and Evaluation. We compare our algorithm against three alternatives for producing\nmultiple predictions: i) Single SSVM + M-Best MAP [29], ii) Single SSVM + Diverse M-Best\nMAP [3] and iii) Clustering + Multiple SSVMs.\nFor the \ufb01rst two baselines, we used all training images to learn a single SSVM and then produced\nmultiple segmentations via M-Best MAP and Diverse M-Best MAP [3]. The M-Best MAP baseline\nwas implemented via the BMMF algorithm [29] using dynamic graph-cuts [15] for computing max-\nmarginals ef\ufb01ciently. For the Diverse M-Best MAP baseline we implemented the DIVMBEST\nalgorithm of Batra et al. [3] using dynamic graph-cuts. The third baseline, Clustering + Multiple\nSSVM (C-SSVM), involves \ufb01rst clustering the training images into M clusters and then training M\nSSVMs independently on each cluster. For clustering, we used k-means with (cid:96)2 distance on color\nfeatures (same as above) computed on foreground pixels.\nFor each algorithm we varied the number of predictors M \u2208 {1, 2, . . . , 6} and tuned the regulariza-\ntion parameter C on validation. Since MCL involves non-convex optimization, a good initialization\nis important. We used the output of k-means clustering as the initial assignment of images to predic-\ntors, so MCL\u2019s \ufb01rst coordinate descent iteration produces the same results as C-SSVM. The task-loss\n\n6\n\n\f(a) 45.91%\n\n(b) 76.84%\n\n(c) 8.30%\n\n(d) 40.55%\n\n(e) 30.01%\n\n(f) 29.16%\n\n(g) 19.54%\n\n(a) 37.00%\n\n(b) 36.06%\n\n(c) 9.14%\n\n(d) 28.54%\n\n(e) 17.43%\n\n(f) 26.91%\n\n(g) 11.09%\n\n(a) 14.70%\n\n(b) 1.17%\n\n(c) 5.69%\n\n(d) 5.86%\n\n(e) 1.18%\n\n(f) 13.32%\n\n(g) 3.44%\n\nFigure 2: In each column: \ufb01rst row shows input images; second shows ground-truth; third shows segmentation\nproduced by the single SSVM baseline; and the last two rows show the best MCL predictions (M = 6) at the\nend of the \ufb01rst and last coordinate descent iteration.\n\nin this experiment ((cid:96)) is the percentage of incorrectly labeled pixels, and the evaluation metric is the\nset-loss, L = min\u02c6yi\u2208 \u02c6Yi\n(cid:96)(yi, \u02c6yi), i.e., the pixel error of the best segmentation among all predictions.\nComparison against Baselines. Fig. 3a show the performance of various algorithms as a function of\nthe number of predictors M. We observed that M-Best MAP produces nearly identical predictions\nand thus the error drops negligibly as M is increased. On the other hand, the diverse M-Best\npredictions output by DIVMBEST [3] lead to a substantial drop in the set-loss. MCL outperforms\nboth DIVMBEST and C-SSVM, con\ufb01rming our hypothesis that it is bene\ufb01cial to learn a collection\nof predictors, rather than learning a single predictor and making diverse predictions from it.\nBehaviour of Coordinate Descent. Fig. 3b shows the MCL objective and train/test errors as a\nfunction of the coordinate descent steps. We verify that the objective function is improved at every\niteration and notice a nice correlation between the objective and the train/test errors.\nEffect of C. Fig. 3c compares performance for different values of regularization parameter C. We\nobserve a fairly stable trend with MCL consistently outperforming baselines.\nEffect of K. Fig. 3d shows the performance of MCL as robustness parameter K is increased from 1\nto M. We observe a monotonic reduction in error as K decreases, which suggests there is a natural\nclustering of the data and thus learning a single SSVM is detrimental.\nQualitative Results. Fig. 1 shows example images, ground-truth segmentations, and the predictions\nmade by M = 6 predictors. We observe that the M hypotheses are both diverse and plausible. The\nevolution of the best prediction with coordinate descent iterations can be seen in Fig. 2.\n5.2 Protein Side-Chain Prediction\nModel and Dataset. Given a protein backbone structure, the task here is to predict the amino\nacid side-chain con\ufb01gurations. This problem has been traditionally formulated as a pairwise MRF\nwith node labels corresponding to (discretized) side-chain con\ufb01gurations (rotamers). These models\ninclude pairwise interactions between nearby side-chains, and between side-chains and backbone.\nWe use the dataset of [7] which consists of 276 proteins (up to 700 residues long) split into train and\ntest sets of sizes 55 and 221 respectively.2 The energy function is de\ufb01ned as a weighted sum of eight\n\n2Dataset available from: http://cyanover.fhcrc.org/recomb-2007/\n\n7\n\n\f(a) Error vs. M.\n\n(b) Error vs. Iterations.\n\n(c) Error vs. C.\n\n(d) Error vs. K.\n\nFigure 3: Experiments on foreground-background segmentation.\n\n(a) Error vs. M.\n\n(b) Error vs. Iterations.\n\n(c) Error vs. C.\n\nFigure 4: Experiments on protein side-chain prediction.\n\nknown energy terms where the weights are to be learned. We used TRW-S [16] (early iterations)\nand ILP (CPLEX [13]) for inference.\nBaselines and Evaluation. For this application there is no natural analogue to the C-SSVM baseline\nand thus we used a boosting-like baseline where we \ufb01rst train an SSVM on the entire training data;\nuse the training instances with high error to train a second SSVM, and so on. For comparison,\nwe also report results from the CRF and HCRF models proposed in [7]. Following [7], we report\naverage error rates for the \ufb01rst two angles (\u03c71 and \u03c72) on all test proteins.\nResults. Fig. 4 shows the results. Overall, we observe behavior similar to the previous set of exper-\niments. Fig. 4a con\ufb01rms that multiple predictors are bene\ufb01cial, and that MCL is able to outperform\nthe boosting-like baseline. Fig. 4b shows the progress of the MCL objective and test loss with coor-\ndinate descent iterations; we again observe a positive correlation between the objective and the loss.\nFig. 4c shows that MCL outperforms baselines across a range of values of C.\n6 Discussion and Conclusions\n\nWe presented an algorithm for producing a set of structured outputs and argued that in a number\nof problems it is bene\ufb01cial to generate a set of plausible and diverse hypotheses. Typically, this\nis accomplished by learning a single-output model and then producing M-best hypotheses from it.\nThis causes a disparity between the way the model is trained (to produce a single output) and the\nway it is used (to produce multiple outputs). Our proposed algorithm (MCL) provides a principled\nway to directly optimize the multiple prediction min-set-loss.\nThere are a number of directions to extend this work. While we evaluated performance of all algo-\nrithms in terms of oracle set-loss, it would be interesting to measure the impact of MCL and other\nbaselines on user experience or \ufb01nal stage performance in cascaded algorithms.\nFurther, our model assumes a modular\n\nscoring function S(Y ) = WT \u03a6(x, Y ) =\nm\u03c6m(x, ym), i.e., the score of a set is the sum of the scores of its members. In a number\nof situations, the score S(Y ) might be a submodular function. Such scoring functions often arise\nwhen we want the model to explicitly reward diverse subsets. We plan to make connections with\ngreedy-algorithms for submodular maximization for such cases.\n\n(cid:80)\n\nm\u2208[M ] wT\n\n8\n\n12345610152025MPixel Error %Task\u2212Loss vs M  MCLMCL (train)C\u2212SSVMM\u2212BestDivM\u2212Best123452.533.544.5x 104Coordinate\u2212Descent IterationObjectiveM = 6; C = 0.8  7891011121314Pixel Error %  ObjectiveTest errorTrain error0.10.41.66.425.610152025CPixel Error %Task\u2212Loss vs C; M = 6  MCLMCL (train)C\u2212SSVMM\u2212BestDivM\u2212Best12345610152025KPixel Error %Task\u2212Loss vs K; M = 6  MCLMCL (train)123426.52727.52828.52929.530MError %Task\u2212Loss vs M  MCLCRFHCRFBoost12340.750.80.85Coordinate\u2212Descent IterationObjectiveM = 4; C = 1  26.52727.528Error %  ObjectiveTest error11010026.52727.52828.52929.530CError %Task\u2212Loss vs C; M = 4  MCLCRFHCRFBoost\fAcknowledgments: We thank David Sontag for his assistance with the protein data. AGR was\nsupported by the C2S2 Focus Center (under the SRC\u2019s Focus Center Research Program).\nReferences\n[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Ssstrunk. SLIC Superpixels Compared to\n\nState-of-the-art Superpixel Methods. PAMI, (To Appear) 2012. 6\n\n[2] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen. iCoseg: Interactive Co-segmentation with Intelligent\n\nScribble Guidance. In CVPR, 2010. 6\n\n[3] D. Batra, P. Yadollahpour, A. Guzman-Rivera, and G. Shakhnarovich. Diverse M-Best Solutions in\n\nMarkov Random Fields. In ECCV, 2012. 2, 6, 7\n\n[4] M. B. Blaschko and C. H. Lampert. Learning to Localize Objects with Structured Output Regression. In\n\nECCV, 2008. 5\n\n[5] Y. Boykov and M.-P. Jolly. Interactive Graph Cuts for Optimal Boundary and Region Segmentation of\n\nObjects in N-D Images. ICCV, 2001. 1\n\n[6] Y. Boykov, O. Veksler, and R. Zabih. Ef\ufb01cient Approximate Energy Minimization via Graph Cuts. PAMI,\n\n20(12):1222\u20131239, 2001. 6\n\n[7] O. S.-F. Chen Yanover and Y. Weiss. Minimizing and Learning Energy Functions for Side-Chain Predic-\n\ntion. Journal of Computational Biology, 15(7):899\u2013911, 2008. 7, 8\n\n[8] M. Collins. Discriminative Reranking for Natural Language Parsing. In ICML, pages 175\u2013182, 2000. 2\n[9] D. Dey, T. Y. Liu, M. Hebert, and J. A. Bagnell. Contextual sequence prediction with application to\n\ncontrol library optimization. In Robotics: Science and Systems, 2012. 5\n\n[10] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Vi-\nsual Object Classes Challenge 2011 (VOC2011) Results. http://www.pascal-network.org/\nchallenges/VOC/voc2011/workshop/index.html. 2\n\n[11] M. Fromer and A. Globerson. An LP View of the M-best MAP problem. In NIPS, 2009. 2\n[12] L. Huang and D. Chiang. Better K-best parsing. In Proceedings of the Ninth International Workshop on\n\nParsing Technology (IWPT), pages 53\u201364, 2005. 2\n\n[13] IBM Corporation.\n\nIBM ILOG CPLEX Optimization Studio.\n\nhttp://www-01.ibm.com/\n\nsoftware/integration/optimization/cplex-optimization-studio/, 2012. 8\n\n[14] T. Joachims, T. Finley, and C.-N. Yu. Cutting-Plane Training of Structural SVMs. Machine Learning,\n\n77(1):27\u201359, 2009. 5\n\n[15] P. Kohli and P. H. S. Torr. Measuring Uncertainty in Graph Cut Solutions. CVIU, 112(1):30\u201338, 2008. 6\n[16] V. Kolmogorov. Convergent Tree-Reweighted Message Passing for Energy Minimization. PAMI,\n\n28(10):1568\u20131583, 2006. 8\n\n[17] V. Kolmogorov and R. Zabih. What Energy Functions can be Minimized via Graph Cuts? PAMI,\n\n26(2):147\u2013159, 2004. 6\n\n[18] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional Random Fields: Probabilistic Models for\n\nSegmenting and Labeling Sequence Data. In ICML, 2001. 1\n\n[19] C. H. Lampert. Maximum Margin Multi-Label Structured Prediction. In NIPS, 2011. 5\n[20] E. L. Lawler. A Procedure for Computing the K Best Solutions to Discrete Optimization Problems and\n\nIts Application to the Shortest Path Problem. Management Science, 18:401\u2013405, 1972. 2\n\n[21] D. W. Opitz and R. Maclin. Popular Ensemble Methods: An Empirical Study. J. Artif. Intell. Res. (JAIR),\n\n11:169\u2013198, 1999. 5\n\n[22] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a Method for Automatic Evaluation of Machine\n\nTranslation. In ACL, 2002. 2\n\n[23] D. Park and D. Ramanan. N-Best Maximal Decoders for Part Models. In ICCV, 2011. 2\n[24] A. Pauls, D. Klein, and C. Quirk. Top-Down K-Best A* Parsing. In ACL, 2010. 2\n[25] C. Rother, V. Kolmogorov, and A. Blake. \u201cGrabCut\u201d \u2013 Interactive Foreground Extraction using Iterated\n\nGraph Cuts. SIGGRAPH, 2004. 1\n\n[26] L. Shen, A. Sarkar, and F. J. Och. Discriminative Reranking for Machine Translation. In HLT-NAACL,\n\npages 177\u2013184, 2004. 2\n\n[27] B. Taskar, C. Guestrin, and D. Koller. Max-Margin Markov Networks. In NIPS, 2003. 1\n[28] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large Margin Methods for Structured and\n\nInterdependent Output Variables. JMLR, 6:1453\u20131484, 2005. 1, 2, 3\n\n[29] C. Yanover and Y. Weiss. Finding the M Most Probable Con\ufb01gurations Using Loopy Belief Propagation.\n\nIn NIPS, 2003. 2, 6\n\n9\n\n\f", "award": [], "sourceid": 891, "authors": [{"given_name": "Abner", "family_name": "Guzm\u00e1n-rivera", "institution": null}, {"given_name": "Dhruv", "family_name": "Batra", "institution": null}, {"given_name": "Pushmeet", "family_name": "Kohli", "institution": null}]}