{"title": "Adversarial Prediction Games for Multivariate Losses", "book": "Advances in Neural Information Processing Systems", "page_first": 2728, "page_last": 2736, "abstract": "Multivariate loss functions are used to assess performance in many modern prediction tasks, including information retrieval and ranking applications. Convex approximations are typically optimized in their place to avoid NP-hard empirical risk minimization problems. We propose to approximate the training data instead of the loss function by posing multivariate prediction as an adversarial game between a loss-minimizing prediction player and a loss-maximizing evaluation player constrained to match specified properties of training data. This avoids the non-convexity of empirical risk minimization, but game sizes are exponential in the number of predicted variables. We overcome this intractability using the double oracle constraint generation method. We demonstrate the efficiency and predictive performance of our approach on tasks evaluated using the precision at k, the F-score and the discounted cumulative gain.", "full_text": "Adversarial Prediction Games for Multivariate Losses\n\nHong Wang\n\n{hwang27, wxing3, kasif2, bziebart}@uic.edu\n\nChicago, IL 60607\n\nKaiser Asif\n\nWei Xing\nDepartment of Computer Science\nUniversity of Illinois at Chicago\n\nBrian D. Ziebart\n\nAbstract\n\nMultivariate loss functions are used to assess performance in many modern pre-\ndiction tasks, including information retrieval and ranking applications. Convex\napproximations are typically optimized in their place to avoid NP-hard empir-\nical risk minimization problems. We propose to approximate the training data\ninstead of the loss function by posing multivariate prediction as an adversarial\ngame between a loss-minimizing prediction player and a loss-maximizing evalua-\ntion player constrained to match speci\ufb01ed properties of training data. This avoids\nthe non-convexity of empirical risk minimization, but game sizes are exponential\nin the number of predicted variables. We overcome this intractability using the\ndouble oracle constraint generation method. We demonstrate the ef\ufb01ciency and\npredictive performance of our approach on tasks evaluated using the precision at\nk, the F-score and the discounted cumulative gain.\n\n1\n\nIntroduction\n\nFor many problems in information retrieval and learning to rank, the performance of a predictor is\nevaluated based on the combination of predictions it makes for multiple variables. Examples in-\nclude the precision when limited to k positive predictions (P@k), the harmonic mean of precision\nand recall (F-score), and the discounted cumulative gain (DCG) for assessing ranking quality. These\nstand in contrast to measures like the accuracy and (log) likelihood, which are additive over indepen-\ndently predicted variables. Many multivariate performance measures are not concave functions of\npredictor parameters, so maximizing them over empirical training data (or, equivalently, empirical\nrisk minimization over a corresponding non-convex multivariate loss function) is computationally\nintractable [11] and can only be accomplished approximately using local optimization methods [10].\nInstead, convex surrogates for the empirical risk are optimized using either an additive [21, 12, 22]\nor a multivariate approximation [14, 24] of the loss function. For both types of approximations,\nthe gap between the application performance measure and the surrogate loss measure can lead to\nsubstantial sub-optimality of the resulting predictions [4].\nRather than optimizing an approximation of the multivariate loss for available training data, we take\nan alternate approach [26, 9, 1] that robustly minimizes the exact multivariate loss function using\napproximations of the training data. We formalize this using a zero-sum game between a predictor\nplayer and an adversarial evaluator player. Learned weights parameterize this game\u2019s payoffs and\nenable generalization from training data to new predictive settings. The key computational challenge\nthis approach poses is that the size of multivariate prediction games grows exponentially in the\nnumber of variables. We leverage constraint generation methods developed for solving large zero-\nsum games [20] and ef\ufb01cient methods for computing best responses [6] to tame this complexity.\nIn many cases, the structure of the multivariate loss function enables the zero-sum game\u2019s Nash\nequilibrium to be ef\ufb01ciently computed. We formulate parameter estimation as a convex optimization\nproblem and solve it using standard convex optimization methods. We demonstrate the bene\ufb01ts of\nthis approach on prediction tasks with P@k, F-score and DCG multivariate evaluation measures.\n\n1\n\n\f2 Background and Related Work\n\n2.1 Notation and multivariate performance functions\n\nthe general\n\nWe consider\ntask of making a multivariate prediction for variables y =\n{y1, y2, . . . , yn} \u2208 Y n (with random variables denoted as Y = {Y1, Y2, . . . , Yn}) given some\ncontextual information x = {x1, x2, . . . , xn} \u2208 X = {X1, X2, . . . , Xn} (with random variable,\nX). Each xi is the information relevant to predicted variable yi. We denote the estimator\u2019s predicted\nvalues as \u02c6y = {\u02c6y1, \u02c6y2, . . . , \u02c6yn}. The multivariate performance measure when predicting \u02c6y when the\ntrue multivariate value is actually y is represented as a scoring function: score(\u02c6y, y). Equivalently,\na complementary loss function for any score function based on the maximal score can be de\ufb01ned as:\nloss(\u02c6y, y) = maxy(cid:48),y(cid:48)(cid:48) score(y(cid:48), y(cid:48)(cid:48)) \u2212 score(\u02c6y, y).\nFor information retrieval, a vector of retrieved items from the pool of n items can be represented\nas \u02c6y \u2208 {0, 1}n and a vector of relevant items as y \u2208 {0, 1}n with x = {x1, x2, . . . , xn} denot-\ning side contextual information (e.g., search terms and document contents). Precision and recall\nare important measures for information retrieval systems. However, maximizing either leads to\ndegenerate solutions (predict all to maximize recall or predict none to maximize precision). The\nk where ||\u02c6y||1 = k,\nprecision when limited to exactly k positive predictions, P@k(\u02c6y, y) = \u02c6y\u00b7y\nis one popular multivariate performance measure that avoids these extremes. Another is the F-\nscore, which is the harmonic mean of the precision and recall often used in information re-\ntrieval tasks. Using this notation, the F-score for a set of items can be simply represented as:\nF1(\u02c6y, y) =\n\nand F1(0, 0) = 1.\n\n2\u02c6y\u00b7y\n\n||\u02c6y ||1+||y||1\n\nIn other information retrieval tasks, a ranked list of retrieved items is desired. This can be rep-\nresented as a permutation, \u03c3, where \u03c3(i) denotes the ith-ranked item (and \u03c3\u22121(j) denotes the\nrank of the jth item). Evaluation measures that emphasize the top-ranked items are used, e.g.,\nto produce search engine results attuned to actual usage. The discounted cumulative gain (DCG)\nmeasures the performance of item rankings with k relevancy scores, yi \u2208 {0, . . . , k \u2212 1} as:\n\nlog2(i+1) or DCG(cid:48)(\u02c6\u03c3, y) = y\u02c6\u03c3(1) +(cid:80)n\n\n2y\u02c6\u03c3(i)\u22121\n\nDCG(\u02c6\u03c3, y) =(cid:80)n\n\ny\u02c6\u03c3(i)\n\nlog2 i.\n\ni=1\n\ni=2\n\n2.2 Multivariate empirical risk minimization\n\nEmpirical risk minimization [28] is a common supervised learning approach that seeks a predictor\n\u02c6P (\u02c6y|x) (from, e.g., a set of predictors \u0393) that minimizes the loss under the empirical distribution of\nE \u02dcP (y,x) \u02c6P (\u02c6y|x)[loss( \u02c6Y, Y)]. Multivariate losses are of-\ntraining data, denoted \u02dcP (y, x): min \u02c6P (\u02c6y|x)\u2208\u0393\nten not convex and \ufb01nding the optimal solution is computationally intractable for expressive classes\nof predictors \u0393 typically speci\ufb01ed by some set of parameters \u03b8 (e.g., linear discriminant functions:\n\u02c6P (\u02c6y|x) = 1 if \u03b8 \u00b7 \u03a6(x, \u02c6y) > \u03b8 \u00b7 \u03a6(x, y(cid:48)) \u2200y(cid:48) (cid:54)= \u02c6y).\nGiven these dif\ufb01culties, convex surrogates to the multivariate loss are instead employed that are\ni loss(\u02c6yi, yi)). Employing the logarithmic loss,\nloss(\u02c6yi, yi) = \u2212 log \u02c6P ( \u02c6Yi = yi) yields the logistic regression model [9]. Using the hinge loss\nyields support vector machines [5]. Structured support vector machines [27] employ a convex ap-\nproximation of the multivariate loss over a training dataset D using the hinge loss function:\n\nadditive over \u02c6yi and yi (i.e., loss(\u02c6y, y) = (cid:80)\n\ni\n\n||\u03b8||2 + \u03b1\n\n\u03bei such that \u2200i, y(cid:48) \u2208 Y, \u03b8 \u00b7 [\u03a6(x(i), y(i)) \u2212 \u03a6(x(i), y(cid:48))] \u2265 \u2206(y(cid:48), y(i)) \u2212 \u03bei.\n\nmin\n\u03b8,\u03bei\u22650\nIn other words, linear parameters \u03b8 for feature functions \u03a6(\u00b7,\u00b7) are desired that make the example\nlabel y(i) have a potential value \u03b8 \u00b7 \u03a6(x(i), y(i)) that is better than all alternative labels y(cid:48) by at\nleast the multivariate loss between y(cid:48) and y(i), denoted \u2206(y(cid:48), y(i)). When this is not possible for\na particular example, a hinge loss penalty \u03bei is incurred that grows linearly with the difference in\npotentials. Parameter \u03b1 controls a trade-off between obtaining a predictor with lower hinge loss or\nbetter discrimination between training examples (the margin). The size of set Y is often too large\nfor explicit construction of the constraint set to be computationally tractable. Instead, constraint\ngeneration methods are employed to \ufb01nd a smaller set of active constraints. This can be viewed as\neither \ufb01nding the most-violated constraint [27] or as a loss-augmented inference problem [25]. Our\n\n(cid:88)\n\n2\n\n\fapproach employs similar constraint generation techniques\u2014in the inference procedure rather than\nthe parameter learning procedure\u2014to improve its ef\ufb01ciency.\n\n3 Multivariate Prediction Games\n\nWe formulate a minimax game for multivariate loss optimization, describe our approach for lim-\niting the computational complexity of solving this game, and describe algorithms for estimating\nparameters of the game and making predictions using this framework.\n\n3.1 Game formulation\n\nFollowing a recent adversarial formulation for classi\ufb01cation [1], we view multivariate prediction as\na two-player game between player \u02c6Y making predictions and player \u02c7Y determining the evaluation\ndistribution. Player \u02c6Y \ufb01rst stochastically chooses a predictive distribution of variable assignments,\n\u02c6P (\u02c6y|x), to maximize a multivariate performance measure, then player \u02c7Y stochastically chooses\nan evaluation distribution, \u02c7P (\u02c7y|x), that minimizes the performance measure. Further, player \u02c7Y\nmust choose the relevant items in a way that (approximately) matches in expectation with a set of\nstatistics, \u03a6(x, y), measured from labeled data. We denote this set as \u039e.\nDe\ufb01nition 1. The multivariate prediction game (MPG) for n predicted variables is:\n\nmax\n\u02c6P (\u02c6y|x)\n\nmin\n\n\u02c7P (\u02c7y|x)\u2208\u039e\n\nE \u02dcP (x) \u02c6P (\u02c6y|x) \u02c7P (\u02c7y|x)\n\nscore( \u02c6Y, \u02c7Y)\n\n,\n\n(cid:2)\u03a6(X, \u02c7Y)(cid:3) = E \u02dcP (y,x) [\u03a6(X, Y)] .\n\nwhere \u02c6P (\u02c6y|x) and \u02c7P (\u02c7y|x) are distributions over combinations of labels for the n predicted vari-\nables and the set \u039e corresponds to the constraint: E \u02dcP (x)P (\u02c7y|x)\nSince the set \u039e constrains the adversary\u2019s multivariate label distribution over the entire distribution\nof inputs \u02dcP (x), solving this game directly is impractical when the number of training examples is\nlarge. Instead, we employ the method of Lagrange multipliers in Theorem 1, which allows the set\nof games to be independently solved given Lagrange multipliers \u03b8.\nTheorem 1. The multivariate prediction game\u2019s value (De\ufb01nition 1) can be equivalently obtained\nby solving a set of unconstrained maximin games parameterized by Lagrange multipliers \u03b8:\n\n(cid:104)\n\n(cid:105)\n\nmax\n\u02c6P (\u02c6y|x)\n\nmin\n\n\u02c7P (\u02c7y|x)\u2208\u039e\n\nE \u02dcP (x) \u02c6P (\u02c6y|x) \u02c7P (\u02c7y|x)\n\n\uf8eb\uf8ec\uf8ec\uf8edE \u02dcP (y,x) [\u03b8 \u00b7 \u03a6(X, Y)] +\n\n(b)\n= max\n\n\u03b8\n\n(1)\n\n(cid:105)\n\n(2)\n\nscore( \u02c6Y, \u02c7Y)\n\n= min\n\n\u02c7P (\u02c7y|x)\u2208\u039e\n\nscore( \u02c6Y, \u02c7Y)\n\n(cid:104)\n(cid:88)\n\nx\u2208X\n\n(cid:105) (a)\n\n(cid:104)\n\uf8f6\uf8f7\uf8f7\uf8f8\n\uf8f6\uf8f7\uf8f7\uf8f8 ,\n\nE \u02dcP (x) \u02c6P (\u02c6y|x) \u02c7P (\u02c7y|x)\n\nmax\n\u02c6P (\u02c6y|x)\n\n\uf8eb\uf8ec\uf8ec\uf8edscore(\u02c6y, \u02c7y) \u2212 \u03b8 \u00b7 \u03a6(x, \u02c7y)\n(cid:124)\n(cid:125)\n\n(cid:123)(cid:122)\n\nCCC(cid:48)\n\n\u02c6y,\u02c7y\n\n\u02dcP (x) min\n\u02c7P (\u02c7y|x)\n\nmax\n\u02c6P (\u02c6y|x)\n\nwhere: \u03a6(x, y) is a vector of features characterizing the set of prediction variables {yi} and pro-\nvided contextual variables {xi} each related to predicted variable yi.\n\nProof (sketch). Equality (a) is a consequence of duality in zero-sum games [29]. Equality (b) is\nobtained by writing the Lagrangian and taking the dual. Strong Lagrangian duality is guaranteed\nwhen a feasible solution exists on the relative interior of the convex constraint set \u039e [2]. (A small\namount of slack corresponds to regularization of the \u03b8 parameter in the dual and guarantees the\nstrong duality feasibility requirement is satis\ufb01ed in practice.)\n\nThe resulting game\u2019s payoff matrix can be expressed as the original game scores of Eq. (1) aug-\nmented with Lagrangian potentials. The combination de\ufb01nes a new payoff matrix with entries\n(cid:48)\n\u02c6y,\u02c7y = score(\u02c6y, \u02c7y) \u2212 \u03b8 \u00b7 \u03a6(x, \u02c7y), as shown in Eq. (2).\nCCC\n\n3.2 Example multivariate prediction games and small-scale solutions\n\nble 1 for three variables. We employ additive feature functions, \u03a6(x, \u02c7y) =(cid:80)n\n\nExamples of the Lagrangian payoff matrices for the P@2, F-score, and DCG games are shown in Ta-\ni=1 \u03c6(xi) I(\u02c7yi = 1),\n\n3\n\n\fTable 1: The payoff matrices for the zero-sum games between player \u02c7Y choosing columns and\nplayer \u02c6Y choosing rows with three variables for: precision at k (top); F-score (middle) and DCG\nwith binary relevance values, \u02c7yi \u2208 {0, 1}, and we let lg 3 (cid:44) log2 3 (bottom).\n\nP@2 000\n011\n0\n101\n0\n110\n0\n000\nF1\n000\n1\n001\n0\n010\n0\n011\n0\n100\n0\n101\n0\n110\n0\n111\n0\nDCG 000\n123\n0\n132\n0\n213\n0\n231\n0\n312\n0\n321\n0\n\n1\n\n2\n\n001\n2\u2212\u03c83\n1\n2\u2212\u03c83\n0\u2212\u03c83\n001\n0\u2212\u03c83\n1\u2212\u03c83\n0\u2212\u03c83\n3\u2212\u03c83\n0\u2212\u03c83\n3\u2212\u03c83\n0\u2212\u03c83\n2\u2212\u03c83\n001\n2\u2212\u03c83\n1\nlg 3\u2212\u03c83\n2\u2212\u03c83\nlg 3\u2212\u03c83\n1\u2212\u03c83\n1\u2212\u03c83\n\n2\n\n1\n\n1\n\n1\n\n1\n\n011\n\n1\n\n1\u2212\u03c82\u2212\u03c83\n2\u2212\u03c82\u2212\u03c83\n2\u2212\u03c82\u2212\u03c83\n\n1\n\n011\n\n2\n\n2\n\n0\u2212\u03c82 \u2212 \u03c83\n3\u2212\u03c82 \u2212 \u03c83\n3\u2212\u03c82 \u2212 \u03c83\n1\u2212\u03c82 \u2212 \u03c83\n0\u2212\u03c82 \u2212 \u03c83\n2\u2212\u03c82 \u2212 \u03c83\n2\u2212\u03c82 \u2212 \u03c83\n5\u2212\u03c82 \u2212 \u03c83\n\n4\n\n1\n\n1\n\n2\n\n1\n\n010\n2\u2212\u03c82\n1\n0\u2212\u03c82\n2\u2212\u03c82\n010\n0\u2212\u03c82\n0\u2212\u03c82\n1\u2212\u03c82\n3\u2212\u03c82\n0\u2212\u03c82\n0\u2212\u03c82\n3\u2212\u03c82\n2\u2212\u03c82\n010\nlg 3\u2212\u03c82\n1\n2 \u2212 \u03c82\n1\u2212\u03c82\n1\u2212\u03c82\n2 \u2212 \u03c82\nlg 3\u2212\u03c82 1+ 1\n\n2 + 1\n2 + 1\n\n1+ 1\n\n1\n\n1\n\n1\n\n2\n\n1\n\n1\n\n1\n\n3\n\n011\nlg 3\u2212\u03c82\u2212\u03c83\nlg 3\u2212\u03c82\u2212\u03c83\n2\u2212\u03c82\u2212\u03c83\nlg 3\u2212\u03c82\u2212\u03c83\n2\u2212\u03c82\u2212\u03c83\nlg 3\u2212\u03c82\u2212\u03c83\n\n3\n\n101\n\n1\n\n2\u2212\u03c81\u2212\u03c83\n1\u2212\u03c81\u2212\u03c83\n2\u2212\u03c81\u2212\u03c83\n\n1\n\n101\n\n0\u2212\u03c81\u2212\u03c83\n3\u2212\u03c81\u2212\u03c83\n0\u2212\u03c81\u2212\u03c83\n2\u2212\u03c81\u2212\u03c83\n3\u2212\u03c81\u2212\u03c83\n1\u2212\u03c81\u2212\u03c83\n2\u2212\u03c81\u2212\u03c83\n5\u2212\u03c81\u2212\u03c83\n\n4\n\n101\n\n1\n\n1\n\n1\n\n2\n\n2\n\n100\n0\u2212\u03c81\n2\u2212\u03c81\n2\u2212\u03c81\n100\n0\u2212\u03c81\n0\u2212\u03c81\n0\u2212\u03c81\n0\u2212\u03c81\n1\u2212\u03c81\n3\u2212\u03c81\n3\u2212\u03c81\n2\u2212\u03c81\n100\n1\u2212\u03c81\n1\u2212\u03c81\n1+ 1\nlg 3\u2212\u03c81\n2 + 1\n1\n2\u2212\u03c81\n2 + 1\nlg 3\u2212\u03c81 1+ 1\n2\u2212\u03c81\n\n2\n\n1\n\n2\n\n1\n\n1\n\n1\n\n1\n\n1\n\n1\n\n3\n\n3\n\n1+ 1\n\n2\u2212\u03c81\u2212\u03c83\nlg 3\u2212\u03c81\u2212\u03c83\nlg 3\u2212\u03c81\u2212\u03c83 1+ 1\nlg 3\u2212\u03c81\u2212\u03c83\nlg 3\u2212\u03c81\u2212\u03c83\n2\u2212\u03c81\u2212\u03c83\n\nlg 3\u2212\u03c81\u2212\u03c82\n2\u2212\u03c81\u2212\u03c82\nlg 3\u2212\u03c81\u2212\u03c82\n2\u2212\u03c81\u2212\u03c82\nlg 3\u2212\u03c81\u2212\u03c82\nlg 3\u2212\u03c81\u2212\u03c82\n\n2 + 1\n2 + 1\n\n3\n\n1\n\n3\n\n1\n\n110\n\n1\n\n2\u2212\u03c81\u2212\u03c82\n2\u2212\u03c81\u2212\u03c82\n1\u2212\u03c81\u2212\u03c82\n\n1\n\n111\n\n1\u2212\u03c81\u2212\u03c82\u2212\u03c83\n1\u2212\u03c81\u2212\u03c82\u2212\u03c83\n1\u2212\u03c81\u2212\u03c82\u2212\u03c83\n\n110\n\n1\n\n2\n\n0\u2212\u03c81\u2212\u03c82\n0\u2212\u03c81\u2212\u03c82\n3\u2212\u03c81\u2212\u03c82\n2\u2212\u03c81\u2212\u03c82\n3\u2212\u03c81\u2212\u03c82\n2\u2212\u03c81\u2212\u03c82\n1\u2212\u03c81\u2212\u03c82\n5\u2212\u03c81\u2212\u03c82\n\n1\n\n4\n\n2\n\n110\n\n111\n\n1\n\n4\n\n1\n\n0\u2212\u03c81\u2212\u03c82 \u2212 \u03c83\n2\u2212\u03c81\u2212\u03c82 \u2212 \u03c83\n2\u2212\u03c81\u2212\u03c82 \u2212 \u03c83\n5\u2212\u03c81\u2212\u03c82 \u2212 \u03c83\n2\u2212\u03c81\u2212\u03c82 \u2212 \u03c83\n5\u2212\u03c81\u2212\u03c82 \u2212 \u03c83\n5\u2212\u03c81\u2212\u03c82 \u2212 \u03c83\n1\u2212\u03c81\u2212\u03c82 \u2212 \u03c83\n\n4\n\n4\n\n1\n\n111\n\n3\n\n3\n\n3\n\nlg 3\u2212\u03c81\u2212\u03c82\u2212\u03c83\nlg 3\u2212\u03c81\u2212\u03c82\u2212\u03c83\nlg 3\u2212\u03c81\u2212\u03c82\u2212\u03c83\nlg 3\u2212\u03c81\u2212\u03c82\u2212\u03c83\nlg 3\u2212\u03c81\u2212\u03c82\u2212\u03c83\nlg 3\u2212\u03c81\u2212\u03c82\u2212\u03c83\n\n2 + 1\n2 + 1\n2 + 1\n2 + 1\n2 + 1\n2 + 1\n\n3\n\n3\n\n3\n\nin these examples (with indicator function I(\u00b7)). We compactly represent the Lagrangian potential\nterms for each game with potential variables, \u03c8i (cid:44) \u03b8 \u00b7 \u03c6(Xi = xi) when \u02c7Yi = 1 (and 0 otherwise).\nZero-sum games such as these can be solved using a pair of linear programs that have a constraint\nfor each pure action (set of variable assignments) in the game [29]:\n\nv such that v \u2264(cid:88)\nv such that v \u2265(cid:88)\n\n\u02c6y\u2208Y\n\n\u02c7y\u2208Y\n\nmax\n\nv, \u02c6P (\u02c6y|x)\u22650\n\nmin\n\nv, \u02c7P (\u02c7y|x)\u22650\n\n(cid:48)\n\n\u02c6y,\u02c7y \u2200\u02c6y \u2208 Y and (cid:88)\n\u02c6y,\u02c7y \u2200\u02c7y \u2208 Y and (cid:88)\n\n\u02c6y\u2208Y\n\n(cid:48)\n\n\u02c7y\u2208Y\n\n\u02c6P (\u02c6y|x)CCC\n\n\u02c7P (\u02c7y|x)CCC\n\n\u02c6P (\u02c6y|x) = 1;\n\n\u02c7P (\u02c7y|x) = 1,\n\n(3)\n\n(4)\n\n(cid:48) is the Lagrangian-augmented payoff and v is the value of the game. The second player to\nwhere CCC\nact in a zero-sum game can maximize/minimize using a pure strategy (i.e., a single value assignment\nto all variables). Thus, these LPs consider only the set of pure strategies of the opponent to \ufb01nd the\n\ufb01rst player\u2019s mixed equilibrium strategy. The equilibrium strategy for the predictor is a distribution\nover rows and the equilibrium strategy for the adversary is a distribution over columns.\n\nThe size of each game\u2019s payoff matrix grows exponentially with the number of variables, n: (2n)(cid:0)n\n\nk\nfor the precision at k game; (2n)2 for the F-score game; and (n! kn) for the DCG game with k\npossible relevance levels. These sizes make explicit construction of the game matrix impractical for\nall but the smallest of problems.\n\n(cid:1)\n\n3.3 Large-scale strategy inference\n\nMore ef\ufb01cient methods for obtaining Nash equilibria are needed to scale our MPG approach to\nlarge prediction tasks with exponentially-sized payoff matrices. Though much attention has focused\non ef\ufb01ciently computing \u0001-Nash equilibria (e.g., in O(1/\u0001) time or O(ln(1/\u0001)) time [8]), which\nguarantee each player a payoff within \u0001 of optimal, we employ an approach for \ufb01nding an exact\nequilibrium that works well in practice despite not having as strong theoretical guarantees [20].\n\n4\n\n\fConsider the reduced game matrices of Table 2. The Nash equi-\nlibrium for the precision at k game with Lagrangian potentials\n\n(cid:3) and \u02c7P (\u02c7y|x) =\n(cid:3) ; with a\n\n2\n9\n\n2\n9\n\n1\n3\n\n2\n9\n\n3\n\n\u03c81 = \u03c82 = \u03c83 = 0.4 is: \u02c6P (\u02c6y|x) = (cid:2) 1\n(cid:2) 1\n(cid:3) ; with a game value of \u2212 2\n(cid:3) and \u02c7P (\u02c7y|x) = (cid:2) 1\n0) is: \u02c6P (\u02c6y|x) = (cid:2) 1\n\n1\n3\n\n3\n\n1\n3\n\n2\n3\n\n3\n\n1\n3\n\n15. The Nash equilibrium for\n3\nthe reduced F-score game with no learning (i.e., \u03c81 = \u03c82 = \u03c83 =\n\ngame value of 2\n3. The reduced game equilibrium is also an equilib-\nrium of the original game. Though the exact size of the subgame\nand its speci\ufb01c actions depends on the values of \u03c8, often a compact\nsub-game with identical equilibrium or close approximation exists\n[18]. Motivated by the compactness of the reduced game, we em-\nploy a constraint generation approach known as the double oracle\nalgorithm [20] to iteratively construct an appropriate reduced game\nthat provides the correct equilibrium but avoids the computational\ncomplexity of the original exponentially sized game.\n\nTable 2: The reduced preci-\nsion at k game with \u03c81 =\n\u03c82 = \u03c83 = 0.4 (top) and F-\nscore game with \u03c81 = \u03c82 =\n\u03c83 = 0 (bottom).\n\n101\n110\n011\n-0.3 -0.3\n0.2\n011\n-0.3\n101 -0.3\n0.2\n111 -0.3 -0.3\n0.2\n\n000 001 010 100\n0\n1\n1\n1\n2\n\n1\n1\n2\n\n1\n1\n2\n\n000\n111\n\n(cid:16) \u02c6P (\u02c6y|x), \u02c7P (\u02c7y|x)\n(cid:17)\n\n(cid:46) Using Eq. (2) for the sub-game matrix of \u02c6S \u00d7 \u02c7S\n\n(cid:48) \u2190 buildPayoffMatrix( \u02c6S, \u02c7S, \u03c8)\n(cid:48))\n[ \u02c6P (\u02c6y|x), vNash1] \u2190 solveZeroSumGame \u02c6Y (CCC\n[\u02c7a, \u02c7vBR] \u2190 \ufb01ndBestResponseAction(P (\u02c6y|x), \u03c8)\nif (vNash1 (cid:54)= \u02c7vBR) then\n\nAlgorithm 1 Constraint generation game solver\nInput: Lagrange potentials for each variable, \u03c8 = {\u03c81, \u03c82, . . . , \u03c8n}; initial action sets \u02c6S0 and \u02c7S0\nOutput: Nash equilibrium,\n1: Initialize Player \u02c6Y \u2019s action set \u02c6S \u2190 \u02c6S0 and Player \u02c7Y \u2019s action set \u02c7S \u2190 \u02c7S0\n2: CCC\n3: repeat\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16: until (vNash1 = vNash2 = \u02c6vBR = \u02c7vBR)\n17: return [ \u02c6P (\u02c6y|x), \u02c7P (\u02c7y|x)]\n\n\u02c7S \u2190 \u02c7S \u222a \u02c7a\n(cid:48) \u2190 buildPayoffMatrix( \u02c6S, \u02c7S, \u03c8)\nCCC\nend if\n(cid:48))\n[ \u02c7P (\u02c7y|x), vNash2] \u2190 solveZeroSumGame \u02c7Y (CCC\n[\u02c6a, \u02c6vBR] \u2190 \ufb01ndBestResponseAction(P (\u02c7y|x), \u03c8)\nif (vNash2 (cid:54)= \u02c6vBR) then\n\n\u02c6S \u2190 \u02c6S \u222a \u02c6a\n(cid:48) \u2190 buildPayoffMatrix( \u02c6S, \u02c7S, \u03c8)\nCCC\nend if\n\n(cid:46) Using the LP of Eq. (3)\n(cid:46) \u02c7a denotes the best response action\n(cid:46) Check if best response provides improvement\n\n(cid:46) Stop if neither best response provides improvement\n\n(cid:46) Add new column to game matrix\n\n(cid:46) Add new row to game matrix\n\n(cid:46) Using the LP of Eq. (4)\n\nNeither player can improve upon their strategy with additional pure\nstrategies when Algorithm 1 terminates, thus the mixed strategies it returns are a Nash equilibrium\npair [20]. Additionally, the algorithm is ef\ufb01cient in practice so long as each player\u2019s strategy is\ncompact (i.e., the number of actions with non-zero probability is a polynomial subset of the la-\nbel combinations) and best responses to opponents\u2019 strategies can be obtained ef\ufb01ciently (i.e., in\npolynomial time) for each player. Additionally, this algorithm can be modi\ufb01ed to \ufb01nd approximate\nequilibria by limiting the number of actions for each player\u2019s set \u02c6S and \u02c7S.\n\n3.4 Ef\ufb01ciently computing best responses\n\nThe tractability of our approach largely rests on our ability to ef\ufb01ciently \ufb01nd best responses to oppo-\n(cid:48)\nnent strategies: argmax\u02c6y\u2208 \u02c6Y E \u02c7P (\u02c7y|x)[CCC\n\u02c6Y ,\u02c7y]. For some combinations\nof loss functions and features, \ufb01nding the best response is trivial using, e.g., a greedy selection algo-\nrithm. Other loss function/feature combinations require specialized algorithms or are NP-hard. We\nillustrate each situation.\n\n(cid:48)\n\u02c6y, \u02c7Y ] and argmin\u02c7y\u2208 \u02c7Y E \u02c6P (\u02c6y|x)[CCC\n\nPrecision at k best response Many best responses can be obtained using greedy algorithms that\nare based on marginal probabilities of the opponent\u2019s strategy. For example, the expected payoff in\n\n5\n\n\fthe precision at k game for the estimator player setting \u02c6yi = 1 is \u02c7P (\u02c7yi = 1|x). Thus, the set of top k\nvariables with the largest marginal label probability provides the best response. For the adversary\u2019s\nbest response, the Lagrangian terms must also be included. Since k is a known variable, as long as\nthe value of each included term, \u02c6P (\u02c6yi = 1,||\u02c6y||1 = k|x) \u2212 k\u03c8i, is negative, the sum is the smallest,\nand the corresponding response is the best for the adversary.\n\nF-score game best response We leverage a recently developed method for ef\ufb01ciently maximizing\nthe F-score when a distribution over relevant documents is given [6]. The key insight is that the\nproblem can be separated into an inner greedy maximization over item sets of a certain size k and an\nouter maximization to select the best set size k from {0, . . . , n}. This method can be directly applied\nto \ufb01nd the best response of the estimator player, \u02c6Y , since the Lagrangian terms of the cost matrix\nare invariant to the choice of \u02c6y. Algorithm 2 obtains the best response for the adversary player, \u02c7Y ,\nusing slight modi\ufb01cations to incorporate the Lagrangian potentials into the objective function.\n\n2(cid:80)n\n\ns, k \u2208 {1, ..., n}\n\nAlgorithm 2 Lagrangian-augmented F-measure Maximizer for adversary player \u02c7Y\nInput: vector \u02c6P of estimator probabilities and Lagrange potentials \u03c8 (\u03c81, \u03c82, ..., \u03c8n)\n1: de\ufb01ne matrix WWW with element WWW s,k = 1\ns+k ,\n2: construct matrix FFF = \u02c6P \u00d7 WWW \u2212 1\n2 \u03c8T \u00d7 1n\n3: for k = 1 to n do\nsolve the inner optimization problem:\n4:\na(k)\u2217\n5:\nby setting a(k)\n6:\nstore a value of E\n7:\n8: end for\n9: for k = 0 take a(k)\u2217\n10: solve the outer optimization problem:\n11: a\u2217 = argmina\u2208{a(0)\u2217\n12: return a\u2217 and E\ny\u223cp( \u02c6Y|x)[FFF (y, a\u2217)]\n\ny\u223cp( \u02c6Y|x)[FFF (y, a(k)\u2217\n= 0n, and E\n,...,a(n)\u2217} E\n\ny\u223cP ( \u02c6Y|x)[FFF (y, 0n)] = p( \u02c6Y = 0n|x)\n\n(cid:46) Ak = {a \u2208 {0, 1}n|(cid:80)n\n\ni=1 ai = k}\ni = 1 for the k-th column of FFF \u2019s smallest k elements, and ai = 0 for the rest;\n\n(cid:46) 1n is the all ones 1 \u00d7 n vector\n\n)] = 2(cid:80)n\n\ny\u223cp( \u02c6Y|x)[FFF (y, a)]\n\n= argmina\u2208Ak\n\ni=1 a(k)\u2217\n\ni=1 aifik\n\nfik\n\ni\n\norderings, \u02c6P (\u02c6\u03c3\u22121(i) > \u02c6\u03c3\u22121(j)) (cid:44) (cid:80)\n\nOrder inversion best response Another common loss measure when comparing two rankings is\nthe number of pairs of items with inverted order across rankings (i.e., one variable may occur before\nanother in one ranking, but not in the other ranking). Only the marginal probabilities of pairwise\n\u02c6P (\u02c6\u03c3) I(\u03c3\u22121(i) > \u03c3\u22121(j)), are needed to construct the\nportion of the payoff received for \u02c7\u03c3 ranking item i over item j, \u02c6P (\u02c6\u03c3\u22121(i) > \u02c6\u03c3\u22121(j))(1 + \u03c8i>j),\nwhere \u03c8i>j is a Lagrangian potential based on pair-wise features for ranking item i over item j.\nOne could construct a fully connected directed graph with edges weighted by these portions of the\npayoff for ranking pairs of items. The best response for \u02c7\u03c3 corresponds to a set of acyclic edges\nwith the smallest sum of edge weights. Unfortunately, this problem is NP-hard in general because\nthe NP-complete minimum feedback arc set problem [15], which seeks to form an acyclic graph by\nremoving the set of edges with the minimal sum of edge weights, can be reduced to it.\n\n\u02c6\u03c3\n\n(cid:33)\n\nP (\u02c7y|x)\n\n(cid:32) n(cid:88)\n\nDCG best response Although we cannot \ufb01nd an ef\ufb01cient algorithm to get the best response using\norder inversion, solving best response of DCG has a known ef\ufb01cient algorithm. In this problem the\nmaximizer is a permutation of the documents while the minimizer is the relevance score of each\ndocument pair. The estimator\u2019s best response \u02c6\u03c3 maximizes:\n\n(cid:88)\ning, computing and sorting(cid:80)\nwhere c is a constant that has no relationship with \u02c6\u03c3. Since 1/log2(i + 1) is monotonically decreas-\n\u02c7y P (\u02c7y|x)2\u02c7yi \u2212 1 with descending order and greedily assign the order\n(cid:33)\n(cid:88)\n\u2212 n(cid:88)\n\nto \u02c6\u03c3 is optimal. The adversary\u2019s best response using additive features minimizes:\n\n2\u02c7y\u02c6\u03c3(i) \u2212 1\nlog2(i + 1)\n\nP (\u02c7y|x)2\u02c7y\u02c6\u03c3(i) \u2212 1\n\n(cid:32)(cid:88)\n\n\u2212 \u03b8 \u00b7 \u03c6(x, \u02c7y)\n\n(cid:32)(cid:88)\n\n\u03b8i \u00b7 \u03c6i(xi, \u02c7yi) =\n\n\u2212 \u03b8i \u00b7 \u03c6i(xi, \u02c7yi)\n\nlog2(i + 1)\n\nn(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\n\u02c7y\n\ni=1\n\n=\n\ni=1\n\nP (\u02c6\u03c3|x)\n\n(cid:33)\n\n\u2212 c,\n\n1\n\n\u02c7y\n\n2\u02c7y\u02c6\u03c3(i) \u2212 1\nlog2(i + 1)\n\n\u02c6\u03c3\n\ni=1\n\nP (\u02c6\u03c3|x)\n\n2\u02c7yi \u2212 1\n\nlog2(\u03c3\u22121(i) + 1)\n\n.\n\ni=1\n\ni=1\n\n\u02c6\u03c3\n\n6\n\n\fThus, by using the expectation of a function of each variable\u2019s rank, 1/(log2(\u03c3\u22121(i) + 1), which is\neasily computed from \u02c6P (\u03c3), each variable\u2019s relevancy score \u02c7yi can be independently chosen.\n\n3.5 Parameter estimation\n\nPredictive model parameters, \u03b8, must be chosen to ensure that the adversarial distribution is similar\nto training data. Though adversarial prediction can be posed as a convex optimization problem\n[1], the objective function is not smooth. General subgradient methods require O(1/\u00012) iterations\nto provide an \u0001 approximation to the optima. We instead employ L-BFGS [19], which has been\nempirically shown to converge at a faster rate in many cases despite lacking theoretical guarantees\nfor non-smooth objectives [16]. We also employ L2 regularization to avoid over\ufb01tting to the training\ndata sample. The addition of the smooth regularizer often helps to improve the rate of convergence.\nThe gradient in these optimizations with L2 regularization, \u2212 \u03bb\n2||\u03b8||2, for training dataset D =\n{(x(i), y(i))} is the difference between feature moments with additional regularization term:\n1|D|\nneeded for calculating this gradient are computed via Alg. 1.\n\n(cid:17) \u2212 \u03bb\u03b8. The adversarial strategies \u02c7P (\u00b7|x(i))\n\n\u03a6(x(i), y(i)) \u2212(cid:80)\n\n\u02c7y\u2208Y \u02c7P (\u02c7y|x(i))\u03a6(x(i), \u02c7y)\n\n(cid:80)|D|\n\n(cid:16)\n\nj=1\n\n4 Experiments\n\nWe evaluate our approach, Multivariate Prediction Games (MPG), on the three performance mea-\nsures of interest in this work: precision at k, F-score, and DCG. Our primary point of comparison\nis with structured support vector machines (SSVM)[27] to better understand the trade-offs between\nconvexly approximating the loss function with the hinge loss versus adversarially approximating the\ntraining data using our approach. We employ an optical recognition of handwritten digits (OPTDIG-\nITS) dataset [17] (10 classes, 64 features, 3,823 training examples, 1,797 test examples), an income\nprediction dataset (\u2018a4a\u2019 ADULT1 [17] (two classes, 123 features, 3,185 training examples, 29,376\ntest examples), and query-document pairs from the million query TREC 2007 (MQ2007) dataset\nof LETOR4.0 [23] (1700 queries, 41.15 documents on average per query, 46 features per docu-\nment). Following the same evaluation method used in [27] for OPTDIGITS, the multi-class dataset\nis converted into multiple binary datasets and we report the macro-average of the performance of all\n3 of the training data as a holdout\nclasses on test data. For OPTDIGITS/ADULT, we use a random 1\nvalidation data to select the L2 regularization parameter trade-off C \u2208 {2\u22126, 2\u22125, ..., 26}.\nWe evaluate the performance of our approach and com-\nparison methods (SSVM variants2 and logistic regression\n(LR)) using precision at k, where k is half the number of\n2 P OS), and F-score. For\npositive examples (i.e. k = 1\nprecision at k, we restrict the pure strategies of the adver-\nsary to select k positive labels. This prevents adversary\nstrategies with no positive labels. From the results in Ta-\nble 3, we see that our approach, MPG, works better than\nSSVM on the OPTDGITS datasets: slightly better on pre-\ncision at k and more signi\ufb01cantly better on F-measure.\nFor the ADULT dataset, MPG provides equivalent per-\nformance for precision at k and better performance on F-\nmeasure. The nature of the running time required for validation and testing is very different for\nSSVM, which must \ufb01nd the maximizing set of variable assignments, and MPG, which must interac-\ntively construct a game and its equilibrium. Model validation and testing require \u2248 30 seconds for\nSSVM on the OPTDIGITS dataset and \u2248 3 seconds on the ADULT dataset, while requiring \u2248 9\nseconds and \u2248 25 seconds for MPG precision at k and \u2248 1397 seconds and \u2248 252 seconds for\nMPG F-measure optimization, respectively. For precision at k, MPG is within an order of magni-\n\nPrecision@k OPTDIGITS ADULT\n0.805\n0.638\n0.805\nOPTDIGITS ADULT\n0.697\n0.673\n0.639\n\nTable 3: Precision at k (top) and F-score\nperformance (bottom).\n\nMPG\nSSVM\nSSVM\u2019\nF-score\nMPG\nSSVM\n\n0.920\n0.915\n0.914\n\n0.990\n0.956\n0.989\n\nLR\n\n1 http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/binary.html)\n2For precision at k, the original SSVM\u2019s implementation uses the restriction k during training, but not\nduring testing. We modi\ufb01ed the code by ordering SSVM\u2019s prediction value for each test example, and select\nthe top k predictions as positives, the rest are considered as negatives. We denote the original implementation\nas SSVM, and the modi\ufb01ed version as SSVM\u2019.\n\n7\n\n\ftude (better for OPTDIGITS, worse for ADULT). For the more dif\ufb01cult problem of maximizing the\nF-score of ADULT over 29, 376 test examples, the MPG game becomes quite large and requires\nsigni\ufb01cantly more computational time. Though our MPG method is not as \ufb01nely optimized as ex-\nisting SSVM implementations, this difference in run times will remain as the game formulation is\ninherently more computationally demanding for dif\ufb01cult prediction tasks.\nWe compare the performance of our approach and com-\nparison methods using \ufb01ve-fold cross validation on the\nMQ2007 dataset. We measure performance using Nor-\nmalized DCG (NDCG), which divides the realized DCG\nby the maximum possible DCG for the dataset, based on a\nslightly different variant of DCG employed by LETOR4.0:\n. The compari-\nson methods are: RankSVM-Struct [13], part of SVMstruct\nwhich uses structured SVM to predict the rank; ListNet\n[3], a list-wise ranking algorithm employing cross en-\ntropy loss; AdaRank-NDCG [30], a boosting method us-\ning \u2018weak rankers\u2019 and data reweighing to achieve good\nNDCG performance; AdaRank-MAP uses Mean Average\nPrecision (MAP) rather than NDCG; and RankBoost [7],\nwhich reduces ranking to binary classi\ufb01cation problems on instance pairs.\n\nDCG(cid:48)(cid:48)(\u02c6\u03c3, y) = 2y\u02c6\u03c3(1) \u2212 1 +(cid:80)n\n\nFigure 1: NDCG@K as K increases.\n\n2y\u02c6\u03c3(i)\u22121\n\ni=2\n\nlog2 i\n\nTable 4: MQ2007 NDCG Results.\n\nMethod\nMPG\n\nRankSVM\n\nListNet\n\nAdaRank-NDCG\nAdaRank-MAP\n\nRankBoost\n\nMean NDCG\n\n0.5220\n0.4966\n0.4988\n0.4914\n0.4891\n0.5003\n\nTable 4 reports the NDCG@K averaged over all values of K\n(between 1 and, on average 41) while Figure 1 reports the re-\nsults for each value of K between 1 and 10. From this, we can\nsee that our MPG approach provides better rankings on aver-\nage than the baseline methods except when K is very small\n(K = 1, 2). In other words, the adversary focuses most of its\neffort in reducing the score received from the \ufb01rst item in the\nranking, but at the expense of providing a better overall NDCG\nscore for the ranking as a whole.\n\n5 Discussion\n\nWe have extended adversarial prediction games [1] to settings with multivariate performance mea-\nsures in this paper. We believe that this is an important step in demonstrating the bene\ufb01ts of this\napproach in settings where structured support vector machines [14] are widely employed. Our fu-\nture work will investigate improving the computational ef\ufb01ciency of adversarial methods and also\nincorporating structured statistical relationships amongst variables in the constraint set in addition\nto multivariate performance measures.\n\nAcknowledgments\n\nThis material is based upon work supported by the National Science Foundation under Grant No.\n#1526379, Robust Optimization of Loss Functions with Application to Active Learning.\n\nReferences\n[1] Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart. Adversarial cost-sensitive classi\ufb01cation. In\n\nProceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, 2015.\n\n[2] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.\n[3] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach\nto listwise approach. In Proceedings of the International Conference on Machine Learning, pages 129\u2013\n136. ACM, 2007.\n\n[4] Corinna Cortes and Mehryar Mohri. AUC optimization vs. error rate minimization. In Advances in Neural\n\nInformation Processing Systems, pages 313\u2013320, 2004.\n\n[5] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273\u2013297, 1995.\n\n8\n\n\f[6] Krzysztof J Dembczynski, Willem Waegeman, Weiwei Cheng, and Eyke H\u00a8ullermeier. An exact algorithm\nfor F-measure maximization. In Advances in Neural Information Processing Systems, pages 1404\u20131412,\n2011.\n\n[7] Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An ef\ufb01cient boosting algorithm for com-\n\nbining preferences. The Journal of machine learning research, 4:933\u2013969, 2003.\n\n[8] Andrew Gilpin, Javier Pe\u02dcna, and Tuomas Sandholm. First-order algorithm with o (ln (1/e)) convergence\nfor e-equilibrium in two-person zero-sum games. In AAAI Conference on Arti\ufb01cial Intelligence, pages\n75\u201382, 2008.\n\n[9] Peter D. Gr\u00a8unwald and A. Phillip Dawid. Game theory, maximum entropy, minimum discrepancy, and\n\nrobust Bayesian decision theory. Annals of Statistics, 32:1367\u20131433, 2004.\n\n[10] Tamir Hazan, Joseph Keshet, and David A McAllester. Direct loss minimization for structured prediction.\n\nIn Advances in Neural Information Processing Systems, pages 1594\u20131602, 2010.\n\n[11] Klaus-Uwe Hoffgen, Hans-Ulrich Simon, and Kevin S Vanhorn. Robust trainability of single neurons.\n\nJournal of Computer and System Sciences, 50(1):114\u2013125, 1995.\n\n[12] Martin Jansche. Maximum expected F-measure training of logistic regression models. In Proceedings of\nthe Conference on Human Language Technology and Empirical Methods in Natural Language Process-\ning, pages 692\u2013699. Association for Computational Linguistics, 2005.\n\n[13] Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the Interna-\n\ntional Conference on Knowledge Discovery and Data Mining, pages 133\u2013142. ACM, 2002.\n\n[14] Thorsten Joachims. A support vector method for multivariate performance measures. In Proceedings of\n\nthe International Conference on Machine Learning, pages 377\u2013384. ACM, 2005.\n\n[15] Richard M. Karp. Reducibility among combinatorial problems. Springer, 1972.\n[16] Adrian S Lewis and Michael L Overton. Nonsmooth optimization via BFGS. 2008.\n[17] M. Lichman. UCI machine learning repository, 2013.\n[18] Richard J Lipton and Neal E Young. Simple strategies for large zero-sum games with applications to\nIn Proc. of the ACM Symposium on Theory of Computing, pages 734\u2013740. ACM,\n\ncomplexity theory.\n1994.\n\n[19] Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization.\n\nMathematical programming, 45(1-3):503\u2013528, 1989.\n\n[20] H Brendan McMahan, Geoffrey J Gordon, and Avrim Blum. Planning in the presence of cost functions\ncontrolled by an adversary. In Proceedings of the International Conference on Machine Learning, pages\n536\u2013543, 2003.\n\n[21] David R Musicant, Vipin Kumar, and Aysel Ozgur. Optimizing F-measure with support vector machines.\n\nIn FLAIRS Conference, pages 356\u2013360, 2003.\n\n[22] Shameem Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. Optimizing F-measures by cost-\nsensitive classi\ufb01cation. In Advances in Neural Information Processing Systems, pages 2123\u20132131, 2014.\n\n[23] Tao Qin and Tie-Yan Liu. Introducing LETOR 4.0 datasets. arXiv preprint arXiv:1306.2597, 2013.\n[24] Mani Ranjbar, Greg Mori, and Yang Wang. Optimizing complex loss functions in structured prediction.\n\nIn Proceedings of the European Conference on Computer Vision, pages 580\u2013593. Springer, 2010.\n\n[25] Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. Learning structured prediction\nmodels: A large margin approach. In Proceedings of the International Conference on Machine Learning,\npages 896\u2013903. ACM, 2005.\n\n[26] Flemming Tops\u00f8e. Information theoretical optimization techniques. Kybernetika, 15(1):8\u201327, 1979.\n[27] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector ma-\nchine learning for interdependent and structured output spaces. In Proceedings of the International Con-\nference on Machine Learning, page 104. ACM, 2004.\n\n[28] Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in Neural Information\n\nProcessing Systems, pages 831\u2013838, 1992.\n\n[29] John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behavior. Princeton Uni-\n\nversity Press, 1947.\n\n[30] Jun Xu and Hang Li. Adarank: a boosting algorithm for information retrieval. In Proc. of the International\n\nConference on Research and Development in Information Retrieval, pages 391\u2013398. ACM, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1568, "authors": [{"given_name": "Hong", "family_name": "Wang", "institution": "University of Illinois at Chic"}, {"given_name": "Wei", "family_name": "Xing", "institution": "University of Illinois at Chicago"}, {"given_name": "Kaiser", "family_name": "Asif", "institution": "University of Illinois at Chicago"}, {"given_name": "Brian", "family_name": "Ziebart", "institution": "University of Illinois at Chicago"}]}