{"title": "Adversarial Multiclass Classification: A Risk Minimization Perspective", "book": "Advances in Neural Information Processing Systems", "page_first": 559, "page_last": 567, "abstract": "Recently proposed adversarial classification methods have shown promising results for cost sensitive and multivariate losses. In contrast with empirical risk minimization (ERM) methods, which use convex surrogate losses to approximate the desired non-convex target loss function, adversarial methods minimize non-convex losses by treating the properties of the training data as being uncertain and worst case within a minimax game. Despite this difference in formulation, we recast adversarial classification under zero-one loss as an ERM method with a novel prescribed loss function. We demonstrate a number of theoretical and practical advantages over the very closely related hinge loss ERM methods. This establishes adversarial classification under the zero-one loss as a method that fills the long standing gap in multiclass hinge loss classification, simultaneously guaranteeing Fisher consistency and universal consistency, while also providing dual parameter sparsity and high accuracy predictions in practice.", "full_text": "Adversarial Multiclass Classi\ufb01cation:\n\nA Risk Minimization Perspective\n\nRizal Fathony\n\nAnqi Liu\n\nKaiser Asif\n\nBrian D. Ziebart\n\nDepartment of Computer Science\nUniversity of Illinois at Chicago\n\nChicago, IL 60607\n\n{rfatho2, aliu33, kasif2, bziebart}@uic.edu\n\nAbstract\n\nRecently proposed adversarial classi\ufb01cation methods have shown promising results\nfor cost sensitive and multivariate losses. In contrast with empirical risk mini-\nmization (ERM) methods, which use convex surrogate losses to approximate the\ndesired non-convex target loss function, adversarial methods minimize non-convex\nlosses by treating the properties of the training data as being uncertain and worst\ncase within a minimax game. Despite this difference in formulation, we recast\nadversarial classi\ufb01cation under zero-one loss as an ERM method with a novel\nprescribed loss function. We demonstrate a number of theoretical and practical\nadvantages over the very closely related hinge loss ERM methods. This establishes\nadversarial classi\ufb01cation under the zero-one loss as a method that \ufb01lls the long\nstanding gap in multiclass hinge loss classi\ufb01cation, simultaneously guaranteeing\nFisher consistency and universal consistency, while also providing dual parameter\nsparsity and high accuracy predictions in practice.\n\n1\n\nIntroduction\n\nA common goal for standard classi\ufb01cation problems in machine learning is to \ufb01nd a classi\ufb01er that\nminimizes the zero-one loss. Since directly minimizing this loss over training data via empirical\nrisk minimization (ERM) [1] is generally NP-hard [2], convex surrogate losses are employed to\napproximate the zero-one loss. For example, the logarithmic loss is minimized by the logistic\nregression classi\ufb01er [3] and the hinge loss is minimized by the support vector machine (SVM) [4, 5].\nBoth are Fisher consistent [6, 7] and universally consistent [8, 9] for binary classi\ufb01cation, meaning\nthey minimize the zero-one loss and are Bayes-optimal classi\ufb01ers when they learn from any true\ndistribution of data using a rich feature representation. SVMs provide the additional advantage of dual\nparameter sparsity so that when combined with kernel methods, extremely rich feature representations\ncan be ef\ufb01ciently considered. Unfortunately, generalizing the hinge loss to classi\ufb01cation tasks with\nmore than two labels is challenging and existing multiclass convex surrogates [10\u201312] tend to lose\ntheir consistency guarantees [13\u201315] or produce low accuracy predictions in practice [15].\nAdversarial classi\ufb01cation [16, 17] uses a different approach to tackle non-convex losses like the\nzero-one loss. Instead of approximating the desired loss function and evaluating over the training\ndata, it adversarially approximates the available training data within a minimax game formulation\nwith game payoffs de\ufb01ned by the desired (zero-one) loss function [18, 19]. This provides promising\nempirical results for cost-sensitive losses [16] and multivariate losses such as the F-measure and\nthe precision-at-k [17]. Conceptually, parameter optimization for the adversarial method forces the\nadversary to \u201cbehave like\u201d certain properties of the training data sample, making labels easier to\npredict within the minimax prediction game. However, a key bottleneck for these methods has been\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ftheir reliance on zero-sum game solvers for inference, which are computationally expensive relative\nto inference in other prediction methods, such as SVMs.\nIn this paper, we recast adversarial prediction from an empirical risk minimization perspective by\nanalyzing the Nash equilibrium value of adversarial zero-one classi\ufb01cation games to de\ufb01ne a new\nmulticlass loss1. This enables us to demonstrate that zero-one adversarial classi\ufb01cation \ufb01lls the long\nstanding gap in ERM-based multiclass classi\ufb01cation by simultaneously: (1) guaranteeing Fisher\nconsistency and universal consistency; (2) enabling computational ef\ufb01ciency via the kernel trick and\ndual parameter sparsity; and (3) providing competitive performance in practice. This reformulation\nalso provides signi\ufb01cant computational ef\ufb01ciency improvements compared to previous adversarial\nclassi\ufb01cation training methods [16].\n\n2 Background and Related Work\n\n2.1 Multiclass SVM generalizations\n\nThe multiclass support vector machine (SVM) seeks class-based potentials fy(xi) for each input\nvector x \u2208 X and class y \u2208 Y so that the discriminant function, \u02c6yf (xi) = argmaxy fy(xi),\nminimizes misclassi\ufb01cation errors, lossf (xi, yi) = I(yi (cid:54)= \u02c6yf (xi)). Unfortunately, empirical risk\nminimization (ERM), minf E \u02dcP (x,y) [lossf (X, Y )], for the zero-one loss is NP-hard once the set of\npotentials is (parametrically) restricted (e.g., as a linear function of input features) [2]. Instead, a\nhinge loss approximation is employed by the SVM. In the binary setting, yi \u2208 {\u22121, +1}, where the\npotential of one class can be set to zero (f\u22121 = 0) with no loss in generality, the hinge loss is de\ufb01ned\nas [1 \u2212 yif+1(xi)]+, with the compact de\ufb01nition [g(.)]+ (cid:44) max(0, g(.)). Binary SVM, which is an\nempirical risk minimizer using the hinge loss with L2 regularization,\n2||\u03b8||2\n2,\n\nE \u02dcP (x,y) [lossf\u03b8 (X, Y )] + \u03bb\n\n(1)\n\nmin\nf\u03b8\n\nj(cid:54)=yi\n\nprovides strong theoretical guarantees (Fisher consistency and universal consistency) [8, 21] and\ncomputational ef\ufb01ciency [1].\nMany methods have been proposed to generalize SVM to the multiclass setting. Apart from\nfor all alternative labels, lossWW(xi, yi) = (cid:80)\nthe one-vs-all and one-vs-one decomposed formulations [22], there are three main joint for-\nthe WW model by Weston et al. [11], which incorporates the sum of hinge losses\nmulations:\n[1 \u2212 (fyi(xi) \u2212 fj(xi))]+; the CS model\nemploys an absolute hinge loss, lossLLW(xi, yi) = (cid:80)\nby Crammer and Singer [10], which uses the hinge loss of only the largest alternative label,\nlossCS(xi, yi) = maxj(cid:54)=yi [1 \u2212 (fyi(xi) \u2212 fj(xi))]+; and the LLW model by Lee et al. [12], which\n(cid:80)\n[1 + fj(xi)]+, and a constraint that\nj fj(xi) = 0. The former two models (CS and WW) both utilize the pairwise class-based potential\ndifferences fyi(xi) \u2212 fj(xi) and are therefore categorized as relative margin methods. LLW, on the\nother hand, is an absolute margin method that only relates to fj(xi)[15]. Fisher consistency, or Bayes\nconsistency [7, 13] guarantees that minimization of a surrogate loss for the true distribution provides\nthe Bayes-optimal classi\ufb01er, i.e., minimizes the zero-one loss. If given any possible distribution of\ndata, a classi\ufb01er is Bayes-optimal, it is called universally consistent. Of these, only the LLW method\nis Fisher consistent and universally consistent [12\u201314]. However, as pointed out by Do\u02d8gan et al. [15],\nLLW\u2019s use of an absolute margin in the loss (rather than the relative margin of WW and CS) often\ncauses it to perform poorly for datasets with low dimensional feature spaces. From the opposite\ndirection, the requirements for Fisher consistency have been well-characterized [13], yet this has not\nled to a multiclass classi\ufb01er that is both Fisher consistent and performs well in practice.\n\nj(cid:54)=yi\n\n2.2 Adversarial prediction games\n\nBuilding on a variety of diverse formulations for adversarial prediction [23\u201326], Asif et al. [16]\nproposed an adversarial game formulation for multiclass classi\ufb01cation with cost-sensitive loss\nfunctions. Under this formulation, the empirical training data is replaced by an adversarially chosen\nconditional label distribution \u02c7P (\u02c7y|x) that must closely approximate the training data, but otherwise\n1Farnia & Tse independently and concurrently discovered this same loss function [20]. They provide an\n\nanalysis focused on generalization bounds and experiments for binary classi\ufb01cation.\n\n2\n\n\fseeks to maximize expected loss, while an estimator player \u02c6P (\u02c6y|x) seeks to minimize expected loss.\nFor the zero-one loss, the prediction game is:\n\nmin\n\n\u02c6P\n\n\u02c7P :E\n\nmax\n\nP (x) \u02c7P ( \u02c7y|x)[\u03c6(X, \u02c7Y )]= \u02dc\u03c6\n\nE \u02dcP (x) \u02c6P (\u02c6y|x) \u02c7P (\u02c7y|x)\n\nI( \u02c6Y (cid:54)= \u02c7Y )\n\n.\n\n(2)\n\n(cid:104)\n\n(cid:105)\n\nThe vector of feature moments, \u02dc\u03c6 = E \u02dcP (x,y)[\u03c6(X, Y )], is measured from sample training data.\nUsing minimax and strong Lagrangian duality, the optimization of Eq. (2) reduces to minimizing the\nequilibrium game values of a new set of zero-sum games characterized by matrix L(cid:48)\n\n\uf8ee\uf8ef\uf8f0 \u03c81,yi(xi)\n\n...\n\n\u03c81,yi(xi) + 1\n\nxi,\u03b8:\n\u00b7\u00b7\u00b7 \u03c8|Y|,yi(xi) + 1\n...\n\u00b7\u00b7\u00b7\n\n\u03c8|Y|,yi(xi)\n\n...\n\n\uf8f9\uf8fa\uf8fb ;\n\n(3)\n\n(cid:88)\n\ni\n\nmin\n\n\u03b8\n\nmax\n\n\u02c7p\n\nmin\n\n\u02c6p\n\nL(cid:48)\nxi,\u03b8 \u02c7pxi ; L(cid:48)\n\nxi,\u03b8 =\n\n\u02c6pT\nxi\n\nwhere \u03b8 is a vector of Lagrangian model parameters, \u02c6pxi is a vector representation of the conditional\nlabel distribution, \u02c6P ( \u02c6Y = k|xi), i.e. \u02c6pxi = [ \u02c6P ( \u02c6Y = 1|xi) \u02c6P ( \u02c6Y = 2|xi) . . .]T, and similarly for\nxi,\u03b8 is a zero-sum game matrix for each example, with \u03c8j,yi (xi) = fj(xi) \u2212\n\u02c7pxi. The matrix L(cid:48)\nfyi(xi) = \u03b8T (\u03c6(xi, j) \u2212 \u03c6(xi, yi)). This optimization problem (Eq. (3)) is convex in \u03b8 and the\ninner zero-sum game can be solved using linear programming [16].\n\n3 Risk Minimization Perspective of Adversarial Multiclass Classi\ufb01cation\n\n3.1 Nash equilibrium game value\n\nDespite the differences in formulation between adversarial loss minimization and empirical risk\nminimization, we now recast the zero-one loss adversarial game as the solution to an empirical\nrisk minimization problem. Theorem 1 de\ufb01nes the loss function that provides this equivalence by\nconsidering all possible combinations of the adversary\u2019s label assignments with non-zero probability\nin the Nash equilibrium of the game.2\nTheorem 1. The model parameters \u03b8 for multiclass zero-one adversarial classi\ufb01cation are equiva-\nlently obtained from empirical risk minimization under the adversarial zero-one loss function:\n\n(cid:80)\nj\u2208S \u03c8j,yi(xi) + |S| \u2212 1\n\nAL0-1\n\nf (xi, yi) =\n\nmax\n\nS\u2286{1,...,|Y|}, S(cid:54)=\u2205\n\n|S|\n\n,\n\n(4)\n\n2\n\nwhere S is any non-empty member of the powerset of classes {1, 2, . . . ,|Y|}.\nThus, AL0-1 is the maximum value over 2|Y| \u2212 1 linear hy-\nperplanes. For binary prediction tasks, there are three linear\nhyperplanes: \u03c81,y(x), \u03c82,y(x) and \u03c81,y(x)+\u03c82,y(x)+1\n. Figure\n1 shows the loss function in potential difference spaces \u03c8 when\nthe true label is y = 1. Note that AL0-1 combines two hinge\nfunctions at \u03c82,y(x) = \u22121 and \u03c82,y(x) = 1, rather than\nSVM\u2019s single hinge at \u03c81,y(x) = \u22121. This difference from\nthe hinge loss corresponds to the loss that is realized by ran-\ndomizing label predictions.3 For three classes, the loss function\nhas seven facets as shown in Figure 2a. Figures 2a, 2b, and\n2c show the similarities and differences between AL0-1 and the\nmulticlass SVM surrogate losses based on class potential dif-\nferences. Note that AL0-1 is also a relative margin loss function\nthat utilizes the pairwise potential difference \u03c8j,y(x).\n\nFigure 1: AL0-1 evaluated over\nthe space of potential differences\n(\u03c8j,y(x) = fj(x) \u2212 fy(x); and\n\u03c8j,j(x) = 0) for binary prediction\ntasks when the true label is y = 1.\n\n3.2 Consistency properties\n\nFisher consistency is a desirable property for a surrogate loss function that guarantees its minimizer,\ngiven the true distribution, P (x, y), will yield the Bayes optimal decision boundary [13, 14]. For\n\n2The proof of this theorem and others in the paper are contained in the Supplementary Materials.\n3We refer the reader to Appendix H for a comparison of the binary adversarial method and the binary SVM.\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Loss function contour plots over the space of potential differences for the prediction task\nwith three classes when the true label is y = 1 under AL0-1 (a), the WW loss (b), and the CS loss (c).\n(Note that \u03c8i in the plots refers to \u03c8j,y(x) = fj(x) \u2212 fy(x); and \u03c8j,j(x) = 0.)\n\nj (x) \u2286 argmaxj Pj(x), where f\u2217(x) = [f\u2217\n\nmulticlass zero-one loss, given that we know Pj(x) (cid:44) P (Y = j|x), Fisher consistency requires\nthat argmaxj f\u2217\n|Y|(x)]T is the minimizer of\nE [lossf (X, Y )|X = x]. Since any constant can be added to all f\u2217\nj (x)\nj=1 fj(x) = 0, to remove redundant solutions. We\n\nthe same, we employ a sum-to-zero constraint,(cid:80)|Y|\nTheorem 2. The loss for the minimizer f\u2217 of E(cid:2)AL0-1\n\nf (X, Y )|X = x(cid:3) resides on the hyperplane\n\nestablish an important property of the minimizer for AL0-1 in the following theorem.\n\nj (x) while keeping argmaxj f\u2217\n\n1 (x), . . . , f\u2217\n\npotentials equal to zero. This minimization reduces to the following optimization:\n\nde\ufb01ned (in Eq. 4) by the complete set of labels, S = {1, . . . ,|Y|}.\nAs an illustration for the case of three classes (Figure 2a), the area described in the theorem\nabove corresponds to the region in the middle where the hyperplane that supports AL0-1 is\n, and, equivalently, where \u2212 1|Y| \u2264 fj(x) \u2264 |Y|\u22121\n,\u2200j \u2208 {1, . . . ,|Y|}\n\u03c81,y(x)+\u03c82,y(x)+\u03c83,y(x)+2\n|Y|\nj fj(x) = 0. Based on this restriction, we focus on the minimization of\n,\u2200j \u2208 {1, . . . ,|Y|} and the sum of\n\nwith a constraint that(cid:80)\nE(cid:2)AL0-1\nf (X, Y )|X = x(cid:3) subject to \u2212 1|Y| \u2264 fj(x) \u2264 |Y|\u22121\n|Y|(cid:88)\n|Y| \u2264 fj(x) \u2264 |Y| \u2212 1\n|Y|\n|Y|\u22121\nThe solution for this maximization (a linear program) satis\ufb01es f\u2217\n|Y|\nand \u2212 1|Y| otherwise, which therefore implies the Fisher consistency theorem.\nTheorem 3. The adversarial zero-one loss, AL0-1, from Eq. (4) is Fisher consistent.\n\nPy(x)fy(x) subject to: \u2212 1\n\nj \u2208 {1, . . . ,|Y|};\n\nif j = argmaxj Pj(x),\n\n|Y|(cid:88)\n\nfj(x) = 0.\n\nj (x) =\n\n|Y|\n\n3\n\nmax\n\nf\n\ny=1\n\nj=1\n\nTheorem 3 implies that AL0-1 (Eq. (4)) is classi\ufb01cation calibrated, which indicates minimization\nof that loss for all distributions on X \u00d7 Y also minimizes the zero-one loss [21, 13]. As proven in\ngeneral by Steinwart and Christmann [2], Micchelli et al. [27], since AL0-1 (Eq.(4)) is a Lipschitz\nloss with constant 1, the adversarial multiclass classi\ufb01er is universally consistent under the conditions\nspeci\ufb01ed in Corollary 1.\nCorollary 1. Given a universal kernel and regularization parameter \u03bb in Eq. (1) tending to zero\nslower than 1\n\nn , the adversarial multiclass classi\ufb01er is also universally consistent.\n\n3.3 Optimization\n\nIn the learning process for adversarial classi\ufb01cation, Asif et al. [16] requires a linear program to be\nsolved that \ufb01nds the Nash equilibrium game value and strategy for every training data point in each\ngradient update. This requirement is computationally burdensome compared to multiclass SVMs,\nwhich must simply \ufb01nd potential-maximizing labels. We propose two approaches with improved\n\n4\n\n\fef\ufb01ciency by leveraging an oracle for \ufb01nding the maximization inside AL0-1 and Lagrange duality in\nthe quadratic programming formulation.\n\n3.3.1 Primal optimization using stochastic sub-gradient descent\n(cid:80)\nThe sub-gradient in the empirical risk minimization of AL0-1 includes the mean of feature differences,\nj\u2208R [\u03c6(xi, j) \u2212 \u03c6(xi, yi)] , where R is the set that maximizes AL0-1. The set R is computed\n1|R|\nby the oracle using a greedy algorithm. Given \u03b8 and a sample (xi, yi), the algorithm calculates all\npotentials \u03c8j,yi(xi) for each label j \u2208 {1, . . . ,|Y|} and sorts them in non-increasing order. Starting\nwith the empty set R = \u2205, it then adds labels to R in sorted order until adding a label would decrease\nthe value of\nTheorem 4. The proposed greedy algorithm used by the oracle is optimal.\n\nj\u2208R \u03c8j,yi (xi)+|R|\u22121\n\n(cid:80)\n\n|R|\n\n.\n\n3.3.2 Dual optimization\n\nIn the next subsections, we focus on the dual optimization technique as it enables us to establish\nconvergence guarantees. We re-formulate the learning algorithm (with L2 regularization) as a\nconstrained quadratic program (QP) with \u03bei specifying the amount of AL0-1 incurred by each of the n\ntraining examples:\n\nmin\n\n\u03b8\n\n1\n2\n\n(cid:107)\u03b8(cid:107)2 + C\n\nn(cid:88)\n\ni=1\n\n\u03bei\n\nsubject to:\n\n\u03bei \u2265 \u2206i,k \u2200i \u2208 {1, . . . n}k \u2208 {1, . . . , 2|Y| \u2212 1},\n\n(5)\n\n(cid:80)\nwhere we denote each of the 2|Y|\u22121 possible constraints for example i corresponding to non-empty el-\nj\u2208Y \u03c8j,yi (xi)+|Y|\u22121\nements of the label powerset as \u2206i,k (e.g., \u2206i,1 = \u03c81,yi(xi), and \u2206i,2|Y|\u22121 =\n).\nNote also that non-negativity for \u03bei is enforced since \u2206i,yi = \u03c8yi,yi(xi) = 0.\nTheorem 5. Let \u039bi,k be the partial derivative of \u2206i,k with respect to \u03b8, i.e., \u039bi,k = d\u2206i,k\nis the constant part of \u2206i,k (for example if \u2206i,k = \u03c81,yi (xi)+\u03c83,yi (xi)+\u03c84,yi (xi)+2\nthen the corresponding dual optimization for the primal minimization (Eq. 5) is:\n\nd\u03b8 and \u03bdi,k\n, then \u03bdi,k = 2\n3 ),\n\n|Y|\n\n3\n\n\u03b1i,k\u03b1j,l [\u039bi,k \u00b7 \u039bj,l]\n\n(6)\n\nn(cid:88)\n\n2|Y|\u22121(cid:88)\n\ni=1\n\nk=1\n\nmax\n\n\u03b1\n\n\u03bdi,k \u03b1i,k \u2212 1\n2\n\n2|Y|\u22121(cid:88)\n\ni,j=1\n\nk,l=1\n\nm(cid:88)\n2|Y|\u22121(cid:88)\n\nsubject to: \u03b1i,k \u2265 0,\n\n\u03b1i,k = C, i \u2208 {1, . . . , n}, k \u2208 {1, . . . , 2|Y| \u2212 1},\n\nwhere \u03b1i,k is the dual variable for the k-th constraint of the i-th sample.\n\nk=1\n\nrecovered from the dual variables using the formula: \u03b8 = \u2212(cid:80)n\n\nNote that the dual formulation above only depends on the dot product of two constraints\u2019 partial deriva-\ntives (with respect to \u03b8) and the constant part of the constraints. The original primal variable \u03b8 can be\nk=1 \u03b1i,k \u039bi,k. Given a new\n\n(cid:80)2|Y|\u22121\n\ndatapoint x, de-randomized predictions are obtained from argmaxj fj(x) = argmaxj \u03b8T\u03c6(x, j).\n\ni=1\n\n3.3.3 Ef\ufb01ciently incorporating rich feature spaces using kernelization\n\nConsidering large feature spaces is important for developing an expressive classi\ufb01er that can learn\nfrom large amounts of training data. Indeed, Fisher consistency requires such feature spaces for its\nguarantees to be meaningful. However, na\u00efvely projecting from the original input space, xi, to richer\n(or possibly in\ufb01nite) feature spaces \u03c9(xi), can be computationally burdensome. Kernel methods\nenable this feature expansion by allowing the dot products of certain feature functions to be computed\nimplicitly, i.e., K(xi, xj) = \u03c9(xi)\u00b7 \u03c9(xj). Since our dual formulation only depends on dot products,\nwe employ kernel methods to incorporate rich feature spaces into our formulation as stated in the\nfollowing theorem.\nTheorem 6. Let X be the input space and K be a positive de\ufb01nite real valued kernel on X \u00d7 X with\na mapping function \u03c9(x) : X \u2192 H that maps the input space X to a reproducing kernel Hilbert\n\n5\n\n\fspace H. Then all the values in the dual optimization of Eq. (6) needed to operate in the Hilbert\nspace H can be computed in terms of the kernel function K(xi, xj) as:\n\n\u039bi,k \u00b7 \u039bj,l = c(i,k),(j,l) K(xi, xj), \u2206i,k = \u2212 n(cid:88)\n2|Y|\u22121(cid:88)\n(cid:20)(cid:18) 1(m \u2208 Rj,l)\n\nfm(xi) = \u2212 n(cid:88)\n\n\u03b1j,l\n\nj=1\n\nl=1\n\n|Rj,l|\n\nwhere c(i,k),(j,l) =\n\n|Y|(cid:88)\n\nm=1\n\n2|Y|\u22121(cid:88)\n(cid:18) 1(m \u2208 Ri,k)\n\nj=1\n\nl=1\n\n|Ri,k|\n\n\u2212 1(m = yi)\n\n\u03b1j,l c(j,l),(i,k) K(xj, xi) + \u03bdi,k,\n\n(7)\n\n(cid:19)\n\n(cid:21)\n\nK(xj, xi)\n\n,\n\n(cid:19)\n\n\u2212 1(m = yj)\n\n,\n\n(8)\n\n\u2212 1(m = yj)\n\n(cid:19)(cid:18) 1(m \u2208 Rj,l)\n\n|Rj,l|\n\nand Ri,k is the set of\n\u03c81,yi (xi)+\u03c83,yi (xi)+\u03c84,yi (xi)+2\n0 otherwise, and the function 1(j \u2208 Ri,k) returns 1 if j is a member of set Ri,k or 0 otherwise.\n\nlabels included in the constraint \u2206i,k (for example if \u2206i,k =\n, then Ri,k = {1, 3, 4}), the function 1(j = yi) returns 1 if j = yi or\n\n3\n\n3.3.4 Ef\ufb01cient optimization using constraint generation\n\nThe number of constraints in the QP formulation above grows exponentially with the number\nof classes: O(2|Y|). This prevents the na\u00efve formulation from being ef\ufb01cient for large multi-\nclass problems. We employ a constraint generation method to ef\ufb01ciently solve the dual quadratic\nprogramming formulation that is similar to those used for extending the SVM to multivariate loss\nfunctions [28] and structured prediction settings [29].\n\nAlgorithm 1 Constraint generation method\nRequire: Training data (x1, y1), . . . (xn, yn), C, \u0001\n1: \u03b8 \u2190 0\ni \u2190 {\u2206i,k|\u2206i,k = \u03c8yi,yi(xi)} \u2200i = 1, . . . , n\n2: A\u2217\n3: repeat\nfor i \u2190 1, n do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\nend if\n11:\nend for\n12:\n13: until no A\u2217\n\na \u2190 arg maxk|\u2206i,k\u2208Ai \u2206i,k\n\u03bei \u2190 maxk|\u2206i,k\u2208A\u2217\n\u2206i,k\nif \u2206i,a > \u03bei + \u0001 then\ni \u222a {\u2206i,a}\n\nCompute \u03b8 from \u03b1: \u03b8 = \u2212(cid:80)n\n\ni \u2190 A\u2217\nA\u2217\n\u03b1 \u2190 Optimize dual over A\u2217 = \u222aiA\u2217\n\ni has changed in the iteration\n\n(cid:80)\n\ni=1\n\ni\n\n(cid:46) Actual label enforces non-negativity\n\n(cid:46) Find the most violated constraint\n(cid:46) Compute the example\u2019s current loss estimate\n\n(cid:46) Add it to the enforced constraints set\n\ni\nk|\u2206i,k\u2208A\u2217\n\ni\n\n\u03b1i,k \u039bi,k\n\nAlgorithm 1 incrementally expands the set of enforced constraints, A\u2217\ni , until no remaining constraint\nfrom the set of all 2|Y| \u2212 1 constraints (in Ai) is violated by more than \u0001. To obtain the most\nviolated constraint, we use the greedy algorithm described in the primal optimization. The constraint\ngeneration algorithm\u2019s stopping criterion ensures that a solution close to the optimal is returned\n(violating no constraint by more than \u0001). Theorem 7 provides a polynomial run time convergence\nbounds for the Algorithm 1.\nTheorem 7. For any \u0001 > 0 and training dataset {(x1, y1), . . . , (xn, yn)} with U = maxi[xi \u00b7 xi],\nconstraint set A\u2217.\n\nAlgorithm 1 terminates after incrementally adding at most max(cid:8) 2n\n\n(cid:9) constraints to the\n\n\u0001 , 4nCU\n\u00012\n\nThe proof of Theorem 7 follows the procedures developed by Tsochantaridis et al. [28] for bounding\nthe running time of structured support vector machines. We observe that this bound is quite loose in\npractice and the algorithm tends to converge much faster in our experiments.\n\n6\n\n\f4 Experiments\n\nWe evaluate the performance of the AL0-1 classi\ufb01er and compare with the three most popular\nmulticlass SVM formulations: WW [11], CS [10], and LLW [12]. We use 12 datasets from the UCI\nMachine Learning repository [30] with various sizes and numbers of classes (details in Table 1). For\neach dataset, we consider the methods using the original feature space (linear kernel) and a kernelized\nfeature space using the Gaussian radial basis function kernel.\n\nTable 1: Properties of the datasets, the number of constraints considered by SVM models\n(WW/CS/LLW), the average number of constraints added to the constraint set for AL0-1 and the\naverage number of active constraints at the optima under both linear and Gausssian kernels.\n\nDataset\n\niris\n\nredwine\n\n(1)\n(2) glass\n(3)\n(4) ecoli\n(5) vehicle\n(6) segment\n(7) sat\n(8) optdigits\n(9) pageblocks\n(10) libras\n(11) vertebral\n(12) breasttissue\n\nProperties\n# test\n45\n65\n480\n101\n254\n693\n2000\n1797\n1642\n108\n93\n32\n\n# train\n105\n149\n1119\n235\n592\n1617\n4435\n3823\n3831\n252\n217\n74\n\n# class\n3\n6\n10\n8\n4\n7\n7\n10\n5\n15\n3\n6\n\n# feature\n4\n9\n11\n7\n18\n19\n36\n64\n10\n90\n6\n9\n\nSVM\n\nconstraints\n210\n745\n10071\n1645\n1776\n9702\n26610\n34407\n15324\n3528\n434\n370\n\nAL0-1 constraints added and active\nLinear kernel\nGauss. kernel\n13\n125\n1681\n117\n311\n244\n1524\n597\n427\n389\n78\n65\n\n223\n490\n3811\n821\n1201\n4312\n11860\n10072\n9155\n1165\n342\n271\n\n213\n578\n5995\n614\n1310\n4410\n11721\n7932\n9459\n1592\n344\n258\n\n38\n252\n1783\n130\n248\n469\n6269\n2315\n551\n353\n86\n145\n\nFor our experimental methodology, we \ufb01rst make 20 random splits of each dataset into training and\ntesting sets. We then perform two stage, \ufb01ve-fold cross validation on the training set of the \ufb01rst\nsplit to tune each model\u2019s parameter C and the kernel parameter \u03b3 under the kernelized formulation.\nIn the \ufb01rst stage, the values for C are 2i, i = {0, 3, 6, 9, 12} and the values for \u03b3 are 2i, i =\n{\u221212,\u22129,\u22126,\u22123, 0}. We select \ufb01nal values for C from 2iC0, i = {\u22122,\u22121, 0, 1, 2} and values for\n\u03b3 from 2i\u03b30, i = {\u22122,\u22121, 0, 1, 2} in the second stage, where C0 and \u03b30 are the best parameters\nobtained in the \ufb01rst stage. Using the selected parameters, we train each model on the 20 training sets\nand evaluate the performance on the corresponding testing set. We use the Shark machine learning\nlibrary [31] for the implementation of the three multiclass SVM formulations.\nDespite having an exponential number of possible constraints (i.e., n(2|Y| \u2212 1) for n examples versus\nn(|Y| \u2212 1) for SVMs), a much smaller number of constraints need to be considered by the AL0-1\nalgorithm in practice to realize a better approximation (\u0001 = 0) than Theorem 7 provides. Table 1\nshows how the total number of constraints for multiclass SVM compares to the number considered\nin practice by our AL0-1 algorithm for linear and Gaussian kernel feature spaces. These range from\na small fraction (0.23) of the SVM constraints for optdigits to a slightly greater number (with a\nfraction of 1.06) for iris. More speci\ufb01cally, of the over 3.9 million (= 210\u00b73823) possible constraints\nfor optdigits when training the classi\ufb01er, fewer than 0.3% (7932 or 10072 depending on the feature\nrepresentation) are added to the constraint set during the constraint generation process. Fewer still\n(597 or 2315 constraints\u2014less than 0.06%) are constraints that are active in the \ufb01nal classi\ufb01er\nwith non-zero dual parameters. The sparsity of the dual parameters provides a key computational\nbene\ufb01t for support vector machines over logistic regression, which has essentially all non-zero dual\nparameters. The small number of active constraints shown in Table 1 demonstrate that AL0-1 induces\nsimilar sparsity, providing ef\ufb01ciency when employed with kernel methods.\nWe report the accuracy of each method averaged over the 20 dataset splits for both linear feature\nrepresentations and Gaussian kernel feature representations in Table 2. We denote the results that\nare either the best of all four methods or not worse than the best with statistical signi\ufb01cance (under\npaired t-test with \u03b1 = 0.05) using bold font. We also show the accuracy averaged over all of the\ndatasets for each method and the number of dataset for which each method is \u201cindistinguishably best\u201d\n(bold numbers) in the last row. As we can see from the table, the only alternative model that is Fisher\n\n7\n\n\fTable 2: The mean and (in parentheses) standard deviation of the accuracy for each model with linear\nkernel and Gaussian kernel feature representations. Bold numbers in each case indicate that the result\nis the best or not signi\ufb01cantly worse than the best (paired t-test with \u03b1 = 0.05).\n\nD\n\n(1)\n(2)\n(3)\n(4)\n(5)\n(6)\n(7)\n(8)\n(9)\n(10)\n(11)\n(12)\navg\n#bold\n\nAL0-1\n\n96.3 (3.1)\n62.5 (6.0)\n58.8 (2.0)\n86.2 (2.2)\n78.8 (2.2)\n94.9 (0.7)\n84.9 (0.7)\n96.6 (0.6)\n96.0 (0.5)\n74.1 (3.3)\n85.5 (2.9)\n64.4 (7.1)\n81.59\n\n9\n\nLinear Kernel\nCS\n\nWW\n\n96.0 (2.6)\n62.2 (3.6)\n59.1 (1.9)\n85.7 (2.5)\n78.8 (1.7)\n94.9 (0.8)\n85.4 (0.7)\n96.5 (0.7)\n96.1 (0.5)\n72.0 (3.8)\n85.9 (2.7)\n59.7 (7.8)\n81.02\n\n6\n\n96.3 (2.4)\n62.5 (3.9)\n56.6 (2.0)\n85.8 (2.3)\n78.4 (2.3)\n95.2 (0.8)\n84.7 (0.7)\n96.3 (0.6)\n96.3 (0.5)\n71.3 (4.3)\n85.4 (3.3)\n66.3 (6.9)\n81.25\n\n8\n\nLLW\n\n79.7 (5.5)\n52.8 (4.6)\n57.7 (1.7)\n74.1 (3.3)\n69.8 (3.7)\n75.8 (1.5)\n74.9 (0.9)\n76.2 (2.2)\n92.5 (0.8)\n34.0 (6.4)\n79.8 (5.6)\n58.3 (8.1)\n68.80\n\n0\n\nAL0-1\n\n96.7 (2.4)\n69.5 (4.2)\n63.3 (1.8)\n86.0 (2.7)\n84.3 (2.5)\n96.5 (0.6)\n91.9 (0.5)\n98.7 (0.4)\n96.8 (0.5)\n83.6 (3.8)\n86.0 (3.1)\n68.4 (8.6)\n85.14\n\n9\n\nGaussian Kernel\nCS\nWW\n\n96.4 (2.4)\n66.8 (4.3)\n64.2 (2.0)\n84.9 (2.4)\n84.4 (2.6)\n96.6 (0.5)\n92.0 (0.6)\n98.8 (0.4)\n96.6 (0.4)\n83.8 (3.4)\n85.3 (2.9)\n68.1 (6.5)\n84.82\n\n6\n\n96.2 (2.3)\n69.4 (4.8)\n64.2 (1.9)\n85.6 (2.4)\n83.8 (2.3)\n96.3 (0.6)\n91.9 (0.5)\n98.8 (0.3)\n96.7 (0.4)\n85.0 (3.9)\n85.5 (3.3)\n66.6 (8.9)\n85.00\n\n6\n\nLLW\n95.4 (2.1)\n69.2 (4.4)\n64.7 (2.1)\n86.0 (2.5)\n84.4 (2.6)\n96.4 (0.5)\n91.9 (0.4)\n98.9 (0.3)\n96.6 (0.4)\n83.2 (4.2)\n84.4 (2.7)\n68.0 (7.2)\n84.93\n\n7\n\nconsistent\u2014the LLW model\u2014performs poorly on all datasets when only linear features are employed.\nThis matches with previous experimental results conducted by Do\u02d8gan et al. [15] and demonstrates a\nweakness of using an absolute margin for the loss function (rather than the relative margins of all other\nmethods). The AL0-1 classi\ufb01er performs competitively with the WW and CS models with a slight\nadvantages on overall average accuracy and a larger number of \u201cindistinguishably best\u201d performances\non datasets\u2014or, equivalently, fewer statistically signi\ufb01cant losses to any other method.\nThe kernel trick in the Gaussian kernel case provides access to much richer feature spaces, improving\nthe performance of all models, and the LLW model especially. In general, all models provide\ncompetitive results in the Gaussian kernel case. The AL0-1 classi\ufb01er maintains a similarly slight\nadvantage and only provides performance that is sub-optimal (with statistical signi\ufb01cance) in three\nof the twelve datasets versus six of twelve and \ufb01ve of twelve for the other methods. We conclude\nthat the multiclass adversarial method performs well in both low and high dimensional feature\nspaces. Recalling the theoretical analysis of the adversarial method, it is a well-motivated (from\nthe adversarial zero-one loss minimization) multiclass classi\ufb01er that enjoys both strong theoretical\nproperties (Fisher consistency and universal consistency) and empirical performance.\n\n5 Conclusion\n\nGeneralizing support vector machines to multiclass settings in a theoretically sound manner remains a\nlong-standing open problem. Though the loss function requirements guaranteeing Fisher-consistency\nare well-understood [13], the few Fisher-consistent classi\ufb01ers that have been developed (e.g., LLW)\noften are not competitive with inconsistent multiclass classi\ufb01ers in practice.\nIn this paper, we\nhave sought to \ufb01ll this gap between theory and practice. We have demonstrated that multiclass\nadversarial classi\ufb01cation under zero-one loss can be recast from an empirical risk minimization\nperspective and its surrogate loss, AL0-1, shown to satisfy the Fisher consistency property, leading\nto a universally consistent classi\ufb01er that also performs well in practice. We believe that this is an\nimportant contribution in understanding both adversarial methods and the generalized hinge loss. Our\nfuture work includes investigating the adversarial methods under the different losses and exploring\nother theoretical properties of the adversarial framework, including generalization bounds.\n\nAcknowledgments\n\nThis research was supported as part of the Future of Life Institute (futureo\ufb02ife.org) FLI-RFP-AI1\nprogram, grant#2016-158710 and by NSF grant RI-#1526379.\n\n8\n\n\fReferences\n[1] Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in Neural Information\n\nProcessing Systems, pages 831\u2013838, 1992.\n\n[2] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Publishing Company,\n\nIncorporated, 1st edition, 2008. ISBN 0387772413.\n\n[3] Peter McCullagh and John A Nelder. Generalized linear models, volume 37. CRC press, 1989.\n[4] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin\n\nclassi\ufb01ers. In Proceedings of the Workshop on Computational Learning Theory, pages 144\u2013152, 1992.\n\n[5] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273\u2013297, 1995.\n[6] Yi Lin. Support vector machines and the bayes rule in classi\ufb01cation. Data Mining and Knowledge\n\nDiscovery, 6(3):259\u2013275, 2002.\n\n[7] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[8] Ingo Steinwart. Support vector machines are universally consistent. J. Complexity, 18(3):768\u2013791, 2002.\n[9] Ingo Steinwart. Consistency of support vector machines and other regularized kernel classi\ufb01ers. IEEE\n\nTrans. Information Theory, 51(1):128\u2013142, 2005.\n\n[10] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector\n\nmachines. The Journal of Machine Learning Research, 2:265\u2013292, 2002.\n\n[11] Jason Weston, Chris Watkins, et al. Support vector machines for multi-class pattern recognition. In ESANN,\n\nvolume 99, pages 219\u2013224, 1999.\n\n[12] Yoonkyung Lee, Yi Lin, and Grace Wahba. Multicategory support vector machines: Theory and application\nto the classi\ufb01cation of microarray data and satellite radiance data. Journal of the American Statistical\nAssociation, 99(465):67\u201381, 2004.\n\n[13] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classi\ufb01cation methods. The Journal\n\nof Machine Learning Research, 8:1007\u20131025, 2007.\n\n[14] Yufeng Liu. Fisher consistency of multicategory support vector machines. In International Conference on\n\nArti\ufb01cial Intelligence and Statistics, pages 291\u2013298, 2007.\n\n[15] \u00dcr\u00fcn Do\u02d8gan, Tobias Glasmachers, and Christian Igel. A uni\ufb01ed view on multi-class support vector\n\nclassi\ufb01cation. Journal of Machine Learning Research, 17(45):1\u201332, 2016.\n\n[16] Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart. Adversarial cost-sensitive classi\ufb01cation. In\n\nProceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, 2015.\n\n[17] Hong Wang, Wei Xing, Kaiser Asif, and Brian Ziebart. Adversarial prediction games for multivariate\n\nlosses. In Advances in Neural Information Processing Systems, pages 2710\u20132718, 2015.\n\n[18] Flemming Tops\u00f8e. Information theoretical optimization techniques. Kybernetika, 15(1):8\u201327, 1979.\n[19] Peter D. Gr\u00fcnwald and A. Phillip Dawid. Game theory, maximum entropy, minimum discrepancy, and\n\nrobust Bayesian decision theory. Annals of Statistics, 32:1367\u20131433, 2004.\n\n[20] Farzan Farnia and David Tse. A minimax approach to supervised learning.\n\nInformation Processing Systems, pages 4233\u20134241. 2016.\n\nIn Advances in Neural\n\n[21] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Large margin classi\ufb01ers: Convex loss, low noise,\nand convergence rates. In Advances in Neural Information Processing Systems, pages 1173\u20131180, 2003.\n[22] Naiyang Deng, Yingjie Tian, and Chunhua Zhang. Support vector machines: optimization based theory,\n\nalgorithms, and extensions. CRC press, 2012.\n\n[23] Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, Deepak Verma, et al. Adversarial classi\ufb01cation.\n\nIn\nProceedings of the International Conference on Knowledge Discovery and Data Mining, pages 99\u2013108.\nACM, 2004.\n\n[24] Anqi Liu and Brian Ziebart. Robust classi\ufb01cation under sample selection bias. In Advances in Neural\n\nInformation Processing Systems, pages 37\u201345, 2014.\n\n[25] Gert RG Lanckriet, Laurent El Ghaoui, Chiranjib Bhattacharyya, and Michael I Jordan. A robust minimax\n\napproach to classi\ufb01cation. The Journal of Machine Learning Research, 3:555\u2013582, 2003.\n\n[26] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing\nSystems, pages 2672\u20132680, 2014.\n\n[27] Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine Learning\n\nResearch, 6:2651\u20132667, 2006.\n\n[28] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods\n\nfor structured and interdependent output variables. In JMLR, pages 1453\u20131484, 2005.\n\n[29] Thorsten Joachims. A support vector method for multivariate performance measures. In Proceedings of\n\nthe International Conference on Machine Learning, pages 377\u2013384, 2005.\n\n[30] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.\n[31] Christian Igel, Verena Heidrich-Meisner, and Tobias Glasmachers. Shark. Journal of Machine Learning\n\nResearch, 9:993\u2013996, 2008.\n\n9\n\n\f", "award": [], "sourceid": 315, "authors": [{"given_name": "Rizal", "family_name": "Fathony", "institution": "U. of Illinois at Chicago"}, {"given_name": "Anqi", "family_name": "Liu", "institution": "University of Illinois at Chicago"}, {"given_name": "Kaiser", "family_name": "Asif", "institution": "University of Illinois at Chicago"}, {"given_name": "Brian", "family_name": "Ziebart", "institution": "University of Illinois at Chicago"}]}