{"title": "Nash Equilibria of Static Prediction Games", "book": "Advances in Neural Information Processing Systems", "page_first": 171, "page_last": 179, "abstract": "The standard assumption of identically distributed training and test data can be violated when an adversary can exercise some control over the generation of the test data. In a prediction game, a learner produces a predictive model while an adversary may alter the distribution of input data. We study single-shot prediction games in which the cost functions of learner and adversary are not necessarily antagonistic. We identify conditions under which the prediction game has a unique Nash equilibrium, and derive algorithms that will find the equilibrial prediction models. In a case study, we explore properties of Nash-equilibrial prediction models for email spam filtering empirically.", "full_text": "Nash Equilibria of Static Prediction Games\n\nMichael Br\u00a8uckner\n\nDepartment of Computer Science\nUniversity of Potsdam, Germany\n\nTobias Scheffer\n\nDepartment of Computer Science\nUniversity of Potsdam, Germany\n\nmibrueck@cs.uni-potsdam.de\n\nscheffer@cs.uni-potsdam.de\n\nAbstract\n\nThe standard assumption of identically distributed training and test data is violated\nwhen an adversary can exercise some control over the generation of the test data.\nIn a prediction game, a learner produces a predictive model while an adversary\nmay alter the distribution of input data. We study single-shot prediction games in\nwhich the cost functions of learner and adversary are not necessarily antagonistic.\nWe identify conditions under which the prediction game has a unique Nash equi-\nlibrium, and derive algorithms that will \ufb01nd the equilibrial prediction models. In a\ncase study, we explore properties of Nash-equilibrial prediction models for email\nspam \ufb01ltering empirically.\n\n1 Introduction\n\nThe assumption that training and test data are governed by identical distributions underlies many\npopular learning mechanisms. In a variety of applications, however, data at application time are\ngenerated by an adversary whose interests are in con\ufb02ict with those of the learner. In computer\nand network security, fraud detection, and drug design, the distribution of data is changed \u2013 by a\nmalevolent individual or under selective pressure \u2013 in response to the predictive model.\nAn adversarial interaction between learner and data generator can be modeled as a single-shot game\nin which one player controls the predictive model whereas the other player exercises some control\nover the distribution of the input data. The optimal action for either player generally depends on\nboth players\u2019 moves.\nThe minimax strategy minimizes the costs under the worst possible move of the opponent. This\nstrategy is motivated for an opponent whose goal is to in\ufb02ict the highest possible costs on the learner;\nit can also be applied when no information about the interests of the adversary is available. Lanckriet\net al. [1] study the so called Minimax Probability Machine. This classi\ufb01er minimizes the maximal\nprobability of misclassifying new instances for a given mean and covariance matrix of each class.\nEl Ghaoui et al. [2] study a minimax model for input data that are known to lie within some hyper-\nrectangle. Their solution minimizes the worst-case loss over all possible choices of the data in these\nintervals. Similarly, minimax solutions to classi\ufb01cation games in which the adversary deletes input\nfeatures or performs a feature transformation have been studied [3, 4, 5]. These studies show that\nthe minimax solution outperforms a learner that naively minimizes the costs on the training data\nwithout taking the adversary into account.\nWhen rational opponents aim at minimizing their personal costs, then the minimax solution is overly\npessimistic. A Nash equilibrium is a pair of actions chosen such that no player gains a bene\ufb01t by\nunilaterally selecting a different action. If a game has a unique Nash equilibrium, it is the strongest\navailable concept of an optimal strategy in a game against a rational opponent. If, however, multiple\nequilibria exist and the players choose their action according to distinct ones, then the resulting\ncombination may be arbitrarily disadvantageous for either player. It is therefore interesting to study\nwhether adversarial prediction games have a unique Nash equilibrium.\n\n1\n\n\fWe study games in which both players \u2013 learner and adversary \u2013 have cost functions that consist of\ndata-dependent loss and regularizer. Contrasting prior results, we do not assume that the players\u2019\ncost functions are antagonistic. As an example, consider that a spam \ufb01lter may minimize the error\nrate whereas a spam sender may aim at maximizing revenue solicited by spam emails. These criteria\nare con\ufb02icting, but not the exact negatives of each other. We study under which conditions unique\nNash equilibria exist and derive algorithms for identifying them.\nThe rest of this paper is organized as follows. Section 2 introduces the problem setting and de\ufb01nes\naction spaces and cost functions. We study the existence of a unique Nash equilibrium and derive an\nalgorithm that \ufb01nds it under de\ufb01ned conditions in Section 3. Section 4 discusses antagonistic loss\nfunctions. For this case, we derive an algorithm that \ufb01nds a unique Nash equilibrium whenever it\nexists. Section 5 reports on experiments on email spam \ufb01ltering; Section 6 concludes.\n\n(cid:81)n\n\n2 Modeling the Game\nWe study prediction games between a learner (v = +1) and an adversary (v = \u22121). We consider\nstatic in\ufb01nite games. Static or single-shot game means that players make decisions simultaneously;\nneither player has information about the opponent\u2019s decisions.\nIn\ufb01nite refers to continuous cost\nfunctions that leave players with in\ufb01nitely many strategies to choose from. We constrain the players\nto select pure (i.e., deterministic) strategies. Mixed strategies and extensive-form games such as\n(cid:81)n\nStackelberg, Cournot, Bertrand, and repeated games are not within the scope of this work.\nBoth players can access an input matrix of training instances X with outputs y, drawn according\ni=1 q(xi, yi). The learner\u2019s action a+1 \u2208 A+1 now is\nto a probability distribution q(X, y) =\nto choose parameters of a linear model ha+1(x) = a+1\nTx. Simultaneously, the adversary chooses\na transformation function \u03c6a\u22121 that maps any input matrix X to an altered matrix \u03c6a\u22121(X). This\ntransformation induces a transition from input distribution q to test distribution qtest with q(X, y) =\ni=1 qtest(\u03c6a\u22121(X)i, yi). Our main result uses a model that implements\nqtest(\u03c6a\u22121(X), y) =\ntransformations as matrices a\u22121 \u2208 A\u22121 \u2286 Rm\u00d7n. Transformation \u03c6a\u22121(X) = X + a\u22121 adds\nperturbation matrix a\u22121 to input matrix X, i.e., input pattern xi is subjected to a perturbation vector\na\u22121,i. If, for instance, inputs are word vectors, the perturbation matrix adds and deletes words.\nThe possible moves a = [a+1, a\u22121] constitute the joint action space A = A+1 \u00d7 A\u22121 which is\nassumed to be nonempty, compact, and convex. Action spaces Av are parameters of the game. For\ninstance, in spam \ufb01ltering it is appropriate to constrain A\u22121 such that perturbation matrices contain\nzero vectors for non-spam messages; this re\ufb02ects that spammers can only alter spam messages.\nEach pair of actions a incurs costs of \u03b8+1(a) and \u03b8\u22121(a), respectively, for the players. Each player\nhas an individual loss function (cid:96)v(y(cid:48), y) where y(cid:48) is the value of decision function ha+1 and y\nis the true label. Section 4 will discuss antagonistic loss functions (cid:96)+1 = \u2212(cid:96)\u22121. However, our\nmain contribution in Section 3 regards non-antagonistic loss functions. For instance, a learner may\nminimize the zero-one loss whereas the adversary may focus on the lost revenue.\nBoth players aim at minimizing their loss over the test distribution qtest. But, since q and con-\nsequently qtest are unknown, the cost functions are regularized empirical loss functions over the\nsample \u03c6a\u22121(X) which re\ufb02ects test distribution qtest. Equation 1 de\ufb01nes either player\u2019s cost func-\ntion as player-speci\ufb01c loss plus regularizer. The learner\u2019s regularizer \u2126a+1 will typically regularize\nthe capacity of ha+1. Regularizer \u2126a\u22121 controls the amount of distortion that the adversary may\nin\ufb02ict on the data and thereby the extent to which an information payload has to be preserved.\n\nn(cid:88)\n\n\u03b8v(av, a\u2212v) =\n\n(cid:96)v(ha+1(\u03c6a\u22121(X)i), yi) + \u2126av\n\n(1)\n\ni=1\n\nEach player\u2019s cost function depends on the opponent\u2019s parameter. In general, there is no value av\nthat maximizes \u03b8v(av, a\u2212v) independently of the opponent\u2019s choice of a\u2212v. The minimax solution\narg minav maxa\u2212v \u03b8v(av, a\u2212v) minimizes the costs under the worst possible move of the opponent.\nThis solution is optimal for a malicious opponent whose goal is to in\ufb02ict maximally high costs on\nthe learner. In absence of any information on the opponent\u2019s goals, the minimax solution still gives\nthe lowest upper bound on the learner\u2019s costs over all possible strategies of the opponent.\nIf both players \u2013 learner and adversary \u2013 behave rationally in the sense of minimizing their personal\ncosts, then the Nash equilibrium is the strongest available concept of an optimal choice of av. A\n\n2\n\n\fNash equilibrium is de\ufb01ned as a pair of actions a\u2217 = [a\u2217\nfrom changing the strategy unilaterally. That is, for both players v \u2208 {\u22121, +1},\n\n+1, a\u2217\n\n\u22121] such that no player can bene\ufb01t\n\n\u03b8v(a\u2217\n\nv, a\u2217\n\n\u2212v) = min\nav\u2208Av\n\n\u03b8v(av, a\u2217\n\n\u2212v).\n\n(2)\n\nThe Nash equilibrium has several catches. Firstly, if the adversary behaves irrationally in the sense\nof in\ufb02icting high costs on the other player at the expense of incurring higher personal costs, then\nchoosing an action according to the Nash equilibrium may result in higher costs than the minimax\nsolution. Secondly, a game may not have an equilibrium point. If an equilibrium point exists, the\ngame may thirdly possess multiple equilibria. If a\u2217 = [a\u2217\n\u22121] are distinct\nequilibria, and each player decides to act according to one of them, then a combination [a\u2217\n\u2212v]\nmay be a poor joint strategy and may give rise to higher costs than a worst-case solution. However,\nif a unique Nash equilibrium exists and both players seek to minimize their individual costs, then\nthe Nash equilibrium is guaranteed to be the optimal move.\n\n\u22121] and a(cid:48) = [a(cid:48)\n\n+1, a\u2217\n\n+1, a(cid:48)\n\nv, a(cid:48)\n\n3 Solution for Convex Loss Functions\n\nIn this section, we study the existence of a unique Nash equilibrium of prediction games with\ncost functions as in Equation 1. We derive an algorithm that identi\ufb01es the unique equilibrium if\nsuf\ufb01cient conditions are met. We consider regularized player-speci\ufb01c loss functions (cid:96)v(y(cid:48), y) which\nare not assumed to satisfy the antagonicity criterion (cid:96)+1 = \u2212(cid:96)\u22121. Both loss functions are, however,\nrequired to be convex and twice differentiable, and we assume strictly convex regularizers \u2126av\nsuch as the l2-norm regularizer. Player- and instance-speci\ufb01c costs may be attached to the loss\nfunctions; however, we omit such cost factors for greater notational harmony. This section\u2019s main\nresult is that if both loss functions are monotonic in y(cid:48) with different monotonicities \u2013 that is, one is\nmonotonically increasing, and one is decreasing for any \ufb01xed y \u2013 then the game has a unique Nash\nequilibrium that can be found ef\ufb01ciently.\n\nTheorem 1. Let the cost functions be de\ufb01ned as in Equation 1 with strictly convex regularizers \u2126av,\nlet action spaces Av be nonempty, compact, and convex subsets of \ufb01nite-dimensional Euclidean\nIf for any \ufb01xed y, both loss functions (cid:96)v(y(cid:48), y) are monotonic in y(cid:48) \u2208 R with distinct\nspaces.\nmonotonicity, convex in y(cid:48), and twice differentiable in y(cid:48), then a unique Nash equilibrium exists.\n\nThe players\u2019\n\nregularizers \u2126av\n\nfunctions\nProof.\n(cid:96)v(ha+1(\u03c6a\u22121(X)i), yi) are convex and twice differentiable in av \u2208 Av for any \ufb01xed a\u2212v \u2208 A\u2212v.\nHence, both cost functions \u03b8v are continuously differentiable and strictly convex, and according to\nTheorem 4.3 in [6], at least one Nash equilibrium exists. As each player has an own nonempty,\ncompact, and convex action space Av, Theorem 2 of [7] applies as well; that is, if function\n\nare strictly convex,\n\nand both loss\n\n\u03c3r(a) = r\u03b8+1(a+1, a\u22121) + (1 \u2212 r)\u03b8\u22121(a+1, a\u22121)\n\n(3)\n\nis diagonally strictly convex in a for some \ufb01xed 0 < r < 1, then a unique Nash equilibrium exists.\nA suf\ufb01cient condition for \u03c3r(a) to be diagonally strictly convex is that matrix Jr(a) in Equation 4\nis positive de\ufb01nite for any a \u2208 A (see Theorem 6 in [7]). This matrix\n\n(cid:183)\n\n(cid:184)\n\nJr(a) =\n\nr\u22072\n\n(1 \u2212 r)\u22072\n\na+1a+1 \u03b8+1(a)\n\na\u22121a+1\u03b8\u22121(a)\n\nr\u22072\n\n(1 \u2212 r)\u22072\n\na+1a\u22121\u03b8+1(a)\n\na\u22121a\u22121\u03b8\u22121(a)\n\n(4)\n\n(5)\n\nis the Jacobian of the pseudo-gradient of \u03c3r(a), that is,\n\n(cid:183)\n\ngr(a) =\n\nr\u2207a+1 \u03b8+1(a)\n\n(1 \u2212 r)\u2207a\u22121\u03b8\u22121(a)\n\n(cid:184)\n\n.\n\nWe want to show that Jr(a) is positive de\ufb01nite for some \ufb01xed r if both loss functions (cid:96)v(y(cid:48), y)\nv(y(cid:48), y) be the\nhave distinct monotonicity and are convex in y(cid:48). Let (cid:96)(cid:48)\nsecond derivative of (cid:96)v(y(cid:48), y) with respect to y(cid:48). Let Ai denote the matrix where the i-th col-\numn is a+1 and all other elements are zero, let \u0393v be the diagonal matrix with diagonal elements\n\u03b3v,i = (cid:96)(cid:48)(cid:48)\nv(ha+1(\u03c6a\u22121(X)i), yi). Using these de\ufb01ni-\n\nv(ha+1(\u03c6a\u22121(X)i), yi), and we de\ufb01ne \u00b5v,i = (cid:96)(cid:48)\n\nv(y(cid:48), y) be the \ufb01rst and (cid:96)(cid:48)(cid:48)\n\n3\n\n\ftions, the Jacobian of Equation 4 can be rewritten,\n\nJr(a) =\n\nr\u0393+1\n\n(1 \u2212 r)\u0393\u22121\n\nr\u0393+1\n\n(1 \u2212 r)\u0393\u22121\n\n(cid:183)\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb\n\n0\nA1\n\n0\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0 \u03c6a\u22121 (X)\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\n\n+\n\n...\n\n...\n...\n0\nAn\nr\u22072\u2126a+1\n(1 \u2212 r)\u00b5\u22121,1I\n\n(1 \u2212 r)\u00b5\u22121,nI\n\nr\u00b5+1,1I\n\n(1 \u2212 r)\u22072\u2126a\u22121\n\n...\n\n0\n\n. . .\n. . .\n...\n. . .\n\n0\n\n(cid:184)\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0 \u03c6a\u22121 (X)\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb .\n\n0\n\n0\n\n...\n\n...\n\nr\u00b5+1,nI\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb\n\nT\n\n0\nA1\n\n...\n\nAn\n\n(6)\n\n(1 \u2212 r)\u22072\u2126a\u22121\n\n\uf8f9\uf8fa\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8ef\uf8f0\n\nThe eigenvalues of the inner matrix of the \ufb01rst summand in Equation 6 are r\u03b3+1,i +(1\u2212 r)\u03b3\u22121,i and\nzero. Loss functions (cid:96)v are convex in y(cid:48), that is, both second derivatives (cid:96)(cid:48)(cid:48)\nv(y(cid:48), y) are non-negative\nfor any y(cid:48) and consequently r\u03b3+1,i + (1 \u2212 r)\u03b3\u22121,i \u2265 0. Hence, the \ufb01rst summand of Jacobian\nJr(a) is positive semi-de\ufb01nite for any choice of 0 < r < 1. Additionally, we can decompose the\nregularizers\u2019 Hessians as follows:\n\n(7)\nwhere \u03bbv is the smallest eigenvalue of \u22072\u2126av. As the regularizers are strictly convex, \u03bbv > 0 and\nthe second summand in Equation 7 is positive semi-de\ufb01nite. Hence, it suf\ufb01ces to show that matrix\n\n\u22072\u2126av = \u03bbvI + (\u22072\u2126av \u2212 \u03bbvI),\n\nr\u03bb+1I\n\n(1 \u2212 r)\u00b5\u22121,1I\n\n(1 \u2212 r)\u00b5\u22121,nI\n\n...\n\nr\u00b5+1,1I\n\n(1 \u2212 r)\u03bb\u22121I\n\n...\n\n0\n\n. . .\n. . .\n...\n. . .\n\nr\u00b5+1,nI\n\n0\n\n...\n\n(1 \u2212 r)\u03bb\u22121I\n\n(8)\n\n(cid:182)\n\n(cid:181)\n\n1\n2\n\n(cid:113)\n\nis positive de\ufb01nite. We derive the eigenvalues of this matrix which assume only three different\nvalues; these are (1 \u2212 r)\u03bb\u22121 and\n\nr\u03bb+1 + (1 \u2212 r)\u03bb\u22121 \u00b1\n\n(r\u03bb+1 \u2212 (1 \u2212 r)\u03bb\u22121)2 + 4r(1 \u2212 r)\u00b5T\n\n(9)\nEigenvalue (1\u2212 r)\u03bb\u22121 is positive by de\ufb01nition. The others are positive if the value under the square\nroot is non-negative and less than (r\u03bb+1 + (1 \u2212 r)\u03bb\u22121)2. The scalar product b = \u00b5T\n+1\u00b5\u22121 is non-\npositive as both loss functions (cid:96)v(y(cid:48), y) are monotonic in y(cid:48) with distinct monotonicity, i.e., both\nderivatives have a different sign for any y(cid:48) \u2208 R and consequently b \u2264 0. This implies that the\nvalue under the square root is less or equal to (r\u03bb+1 \u2212 (1 \u2212 r)\u03bb\u22121)2 < (r\u03bb+1 + (1 \u2212 r)\u03bb\u22121)2. In\naddition, b is bounded from below as action spaces Av, and therefore the value of ha+1(\u03c6a\u22121(X)i),\n+1\u00b5\u22121 be such a lower bound with \u2212\u221e < b \u2264 0. We solve for r such\nis bounded. Let b = inf a\u2208A \u00b5T\nthat the value under the square root in Equation 9 attains a non-negative value, that is,\n\n+1\u00b5\u22121\n\n.\n\n(cid:113)\n\n(cid:113)\n\n(cid:113)\n\n0 < r \u2264 (\u03bb+1 + \u03bb\u22121)\u03bb\u22121 \u2212 2b \u2212 2\n\n(\u03bb+1 + \u03bb\u22121)2 \u2212 4b\n\nb2 \u2212 \u03bb+1\u03bb\u22121b\n\n(10)\n\nor alternatively\n\n(\u03bb+1 + \u03bb\u22121)\u03bb\u22121 \u2212 2b + 2\n\nb2 \u2212 \u03bb+1\u03bb\u22121b\n\n(11)\nFor any \u03bb+1, \u03bb\u22121 > 0 there are values r that satisfy Inequality 10 or 11 because, for any \ufb01xed b \u2264 0,\n\n(\u03bb+1 + \u03bb\u22121)2 \u2212 4b\n\n\u2264 r < 1.\n\n0 < (\u03bb+1 + \u03bb\u22121)\u03bb\u22121 \u2212 2b \u00b1 2\n\nb2 \u2212 \u03bb+1\u03bb\u22121b < (\u03bb+1 + \u03bb\u22121)2 \u2212 4b.\n\n(12)\n\nFor such r all eigenvalues in Equation 9 are strictly positive which completes the proof.\nAccording to Theorem 1, a unique Nash equilibrium exists for suitable loss functions such as the\nsquared hinge loss, logistic loss, etc. To \ufb01nd this equilibrium, we make use of the weighted Nikaido-\nIsoda function (Equation 13). Intuitively, \u03a8rv(a, b) quanti\ufb01es the weighted sum of the relative cost\nsavings that the players can enjoy by changing from strategy av to strategy bv while their opponent\ncontinues to play a\u2212v. Equation 14 de\ufb01nes the value function Vrv(a) as the weighted sum of greatest\n\n4\n\n\fpossible cost savings attainable by changing from a to any strategy unilaterally. By these de\ufb01nitions,\na\u2217 is a Nash equilibrium if, and only if, Vrv(a\u2217) is a global minimum of the value function with\nVrv(a\u2217) = 0 for any \ufb01xed weights r+1 = r and r\u22121 = 1 \u2212 r, where 0 < r < 1.\nrv(\u03b8v(av, a\u2212v) \u2212 \u03b8v(bv, a\u2212v))\n\n\u03a8rv(a, b) =\n\n(cid:88)\n\n(13)\n\nv\u2208{+1,\u22121}\n\nVrv(a) = max\nb\u2208A\n\n\u03a8rv(a, b)\n\n(14)\n\nTo \ufb01nd this global minimum of Vrv(a) we make use of Corollary 3.4 of [8]. The weights rv are\n\ufb01xed scaling factors of the players\u2019 objectives which do not affect the Nash equilibrium in Equa-\ntion 2; however, these weights ensure the main condition of Corollary 3.4, that is, the positive\n\nde\ufb01niteness of the Jacobian Jr(a) in Equation 4. According to this corollary, vector d = (cid:98)b \u2212 a is\na descent direction for the value function at any position a, where(cid:98)b is the maximizing argument\n(cid:98)b = arg maxb\u2208A \u03a8rv(a, b). In addition, the convexity of A ensures that any point a + td with\nt \u2208 [0, 1] (i.e., a point between a and(cid:98)b) is a valid pair of actions.\n\nAlgorithm 1 Nash Equilibrium of Games with Convex Loss Functions\nRequire: Cost functions \u03b8v as de\ufb01ned in Equation 1 and action spaces Av.\n1: Select initial a0 \u2208 A+1 \u00d7 A\u22121, set k := 0, and choose r that satis\ufb01es Inequality 10 or 11.\n2: repeat\n3:\n4:\n5:\n6:\n7: until (cid:107)ak \u2212 ak\u22121(cid:107) \u2264 \u0001.\n\nSet bk := arg maxb\u2208A+1\u00d7A\u22121 \u03a8rv(ak, b) where \u03a8rv is de\ufb01ned in Equation 13.\nSet dk := bk \u2212 ak.\nFind maximal step size tk \u2208 {2\u2212l : l \u2208 N} with Vrv(ak + tkdk) \u2264 Vrv(ak) \u2212 \u0001(cid:107)tkdk(cid:107)2.\nSet ak+1 := ak + tkdk and k := k + 1.\n\nAlgorithm 1 exploits these properties and \ufb01nds the global minimum of Vrv and thereby the unique\nNash equilibrium, under the preconditions of Theorem 1. Convergence follows from the fact that if\nin the k-th iteration dk = 0, then ak is a Nash equilibrium which is unique according to Theorem 1.\nIf dk (cid:54)= 0, then dk is a descent direction of Vrv at position ak. Together with term \u0001(cid:107)tkdk(cid:107)2,\nthis ensures Vrv(ak+1) < Vrv(ak), and as value function Vrv is bounded from below, Algorithm 1\nconverges to the global minimum of Vrv. Note that r only controls the convergence rate, but has no\nin\ufb02uence on the solution. Any value of r that satis\ufb01es Inequality 10 or 11 ensures convergence.\n\n4 Solution for Antagonistic Loss Functions\n\nAlgorithm 1 is guaranteed to identify the unique equilibrium if the loss functions are convex, twice\ndifferentiable, and of distinct monotonicities. We will now study the case in which the learner\u2019s cost\nfunction is continuous and convex, and the adversary\u2019s loss function is antagonistic to the learner\u2019s\nloss, that is, (cid:96)+1 = \u2212(cid:96)\u22121. We abstain from making assumptions about the adversary\u2019s regularizers.\nBecause of the regularizers, the game is still not a zero-sum game. In this setting, a unique Nash\nequilibrium cannot be guaranteed to exist because the adversary\u2019s cost function is not necessarily\nstrictly convex. However, an individual game may still possess a unique Nash equilibrium, and we\ncan derive an algorithm that identi\ufb01es it whenever it exists.\nThe symmetry of the loss functions simpli\ufb01es the players\u2019 cost functions in Equation 1 to\n\nn(cid:88)\n\u03b8\u22121(a\u22121, a+1) = \u2212 n(cid:88)\n\n\u03b8+1(a+1, a\u22121) =\n\ni=1\n\n(cid:96)+1(ha+1(\u03c6a\u22121(X)i), yi) + \u2126a+1 ,\n\n(cid:96)+1(ha+1(\u03c6a\u22121(X)i), yi) + \u2126a\u22121 .\n\n(15)\n\n(16)\n\ni=1\n\nEven though the loss functions are antagonistic, the cost functions in Equations 15 and 16 are not,\nunless the player\u2019s regularizers are antagonistic as well. Hence, the game is not a zero-sum game.\nHowever, according to Theorem 2, if the game has a unique Nash equilibrium, then this equilibrium\nis a minimax solution of the zero-sum game de\ufb01ned by the joint cost function of Equation 17.\n\n5\n\n\fIf\n\nTheorem 2.\nand 16 has a unique Nash equilibrium a\u2217,\narg mina+1 maxa\u22121 \u03b80(a+1, a\u22121) where\n\nthe game with cost\n\nfunctions \u03b8+1 and \u03b8\u22121 de\ufb01ned in Equations 15\nthen this equilibrium also satis\ufb01es a\u2217 =\n\n(cid:88)n\n\ni=1\n\n\u03b80(a+1, a\u22121) =\n\n(cid:96)+1(ha+1(\u03c6a\u22121(X)i), yi) + \u2126a+1 \u2212 \u2126a\u22121.\n\n(17)\n\nThe proof can be found in the appendix. As a consequence of Theorem 2, we can identify the unique\nNash equilibrium of the game with cost functions \u03b8+1 and \u03b8\u22121, if it exists, by \ufb01nding the minimax\nsolution of the game with joint cost function \u03b80. The minimax solution is given by\n\na\u2217\n+1 = arg min\n\n(18)\n\nmax\n\na+1\u2208A+1\n\na\u22121\u2208A\u22121\n\n\u03b80(a+1, a\u22121).\n\nTo solve this optimization problem, we de\ufb01ne(cid:98)\u03b80(a+1) = \u03b80(a+1,(cid:98)a\u22121) to be the function of a+1\nwhere(cid:98)a\u22121 is set to the value(cid:98)a\u22121 = arg maxa\u22121 \u03b80(a+1, a\u22121). Since cost function \u03b80 is continuous\nin its arguments, convex in a+1, and A\u22121 is a compact set, Danskin\u2019s Theorem [9] implies that(cid:98)\u03b80\nThe signi\ufb01cance of Danskin\u2019s Theorem is that when calculating the gradient \u2207a+1\u03b80(a+1,(cid:98)a\u22121) at\nposition a+1, argument(cid:98)a\u22121 acts as a constant in the derivative instead of as a function of a+1.\nThe convexity of (cid:98)\u03b80(a+1) suggests the gradient descent method implemented in Algorithm 2. It\n\n\u2207(cid:98)\u03b80(a+1) = \u2207a+1\u03b80(a+1,(cid:98)a\u22121).\n\nis convex in a+1 with gradient\n\nidenti\ufb01es the unique Nash equilibrium of a game with antagonistic loss functions, if it exists, by\n\ufb01nding the minimax solution of the game with joint cost function \u03b80.\n\n(19)\n\nAlgorithm 2 Nash Equilibrium of Games with Antagonistic Loss Functions\nRequire: Joint cost function \u03b80 as de\ufb01ned in Equation 17 and action spaces Av.\n1: Select initial a0\n2: repeat\n3:\n\n+1 \u2208 A+1 and set k := 0.\n\nSet ak\u22121 := arg maxa\u22121\u2208A\u22121 \u03b80(ak\nSet dk := \u2212\u2207ak\nFind maximal step size tk \u2208 {2\u2212l : l \u2208 N} with\n\n+1, ak\u22121).\n\n+1, a\u22121).\n\n\u03b80(ak\n\n+1\n\n4:\n\n5:\n\n\u03b80(ak\n\n+1 + tkdk, ak\u22121) \u2264 \u03b80(ak\n\n+1, ak\u22121) \u2212 \u0001(cid:107)tkdk(cid:107)2.\n\n+1 := ak\n\n+1 + tkdk and k := k + 1.\n\n+1 to the admissible set A+1, if necessary.\n\nSet ak+1\nProject ak\n\n6:\n7:\n8: until (cid:107)ak\n\n+1 \u2212 ak\u22121\n\n+1 (cid:107) \u2264 \u0001\n\nA minimax solution arg mina+1 maxa\u22121 \u03b8+1(a+1, a\u22121) of the learner\u2019s cost function minimizes\nthe learner\u2019s costs when playing against the most malicious opponent; for instance, Invar-SVM [4]\n\ufb01nds such a solution. By contrast, the minimax solution arg mina+1 maxa\u22121 \u03b80(a+1, a\u22121) of the\njoint cost function as de\ufb01ned in Equation 17 constitutes a Nash equilibrium of the game with cost\nfunctions \u03b8+1 and \u03b8\u22121, de\ufb01ned in Equations 15 and 16. It minimizes the costs for each of two players\nthat seek their personal advantage. Algorithmically, Invar-SVM and Algorithm 2 are very similar;\nthe main difference lies in the optimization criteria and the resulting properties of the solution.\n\n5 Experiments\n\nWe study the problem of email spam \ufb01ltering where the learner tries to identify spam emails while\nthe adversary conceals spam messages in order to penetrate the \ufb01lter. Our goal is to explore the\nrelative strengths and weaknesses of the proposed Nash models for antagonistic and non-antagonistic\nloss functions and existing baseline methods. We compare a regular SVM, logistic regression, SVM\nwith Invariances (Invar-SVM, [4]), the Nash equilibrium for antagonistic loss functions found by\nidentifying the minimax solution of the joint cost function (Minimax, Algorithm 2), and the Nash\nequilibrium for convex loss functions (Nash, Algorithm 1).\n\n6\n\n\fFigure 1: Adversary\u2019s regularization parameter and AUC on test data (private emails).\n\n2 (cid:107)av(cid:107)2\n\n2 xi \u2264 a\u22121,i \u2264 1\n\nWe use the logistic loss as the learner\u2019s loss function (cid:96)+1(h(x), y) = log(1 + e\u2212yh(x)) for the\nMinimax and the Nash model. Consequently, the adversary\u2019s loss for the Minimax solution is the\nnegative loss of the learner. In the Nash model, we choose (cid:96)\u22121(h(x), y) = log(1 + eyh(x)) which is\na convex approximation of the adversary\u2019s zero-one loss, that is, correct predictions by the learner\nincur high costs for the adversary. We use the additive transformation model \u03c6a\u22121(X)i = xi +a\u22121,i\nas de\ufb01ned in Section 2. For spam emails xi, we impose box constraints \u2212 1\n2 xi on\nthe adversary\u2019s parameters; for non-spam we set a\u22121,i = 0. That is, the spam sender can only\ntransform spam emails. This model is equivalent to the component-wise scaling model [4] with\nscaling factors between 0.5 and 1.5, and ensures that the adversary\u2019s action space is nonempty,\ncompact, and convex. We use l2-norm regularizers for both players, that is, \u2126av = \u03bbv\n2 where\n\u03bbv is the regularization parameter of player v. For the Nash model we set r to the mean of the\ninterval de\ufb01ned by Inequality 11, where b = \u2212 n\n4 is a lower bound for the chosen logistic loss and\nregularization parameters \u03bbv are identical to the smallest eigenvalues of \u22072\u2126av.\nWe use two email corpora: the \ufb01rst contains 65,000 publicly available emails received between 2000\nand 2002 from the Enron corpus, the SpamAssassin corpus, Bruce Guenter\u2019s spam trap, and several\nmailing lists. The second contains 40,000 private emails received between 2000 and 2007. All\nemails are binary word vectors of dimensionality 329,518 and 160,981, respectively. The emails are\nsorted chronologically and tagged with label, date, and size. The preprocessed corpora are available\nfrom the authors. We cannot use a standard TREC corpus because there the delivery dates of the\nspam messages have been fabricated, and our experiments require the correct chronological order.\nOur evaluation protocol is as follows. We use the 6,000 oldest instances as training portion and\nset the remaining emails aside as test instances. We use the area under the ROC curve as a fair\nevaluation metric that is adequate for the application; error bars indicate the standard error. We train\nall methods 20 times for the \ufb01rst experiment and 50 times for the following experiments on a subset\nof 200 messages drawn at random from the training portion and average the AUC values on the test\nset. In order to tune both players\u2019 regularization parameters, we conduct a grid search maximizing\nthe AUC for 5-fold cross validation on the training portion.\nIn the \ufb01rst experiment, we explore the impact of the regularization parameter of the transformation\nmodel, i.e., \u03bb\u22121 for our models and K \u2013 the maximal number of alterable attributes \u2013 for Invar-SVM.\nFigure 1 shows the averaged AUC value on the private corpus\u2019 test portion. The crosses indicate the\nparameter values found by the grid search with cross validation on the training data.\nIn the next experiment, we evaluate all methods into the future by processing the test set in chrono-\nlogical order. Figure 2 shows that Invar-SVM, Minimax, and the Nash solution outperform the reg-\nular SVM and logistic regression signi\ufb01cantly. For the public data set, Minimax performs slightly\nbetter than Nash; for the private corpus, there is no signi\ufb01cant difference between the solutions of\nMinimax and Nash. For both data sets, the l2-regularization gives Minimax and Nash an advantage\nover Invar-SVM. Recall that Minimax refers to the Nash equilibrium for antagonistic loss functions\nfound by solving the minimax problem for the joint cost function (Algorithm 2). In this setting, loss\nfunctions \u2013 but not cost functions \u2013 are antagonistic; hence, Nash cannot gain an advantage over\nMinimax. Figure 2 (right hand side) shows the execution time of all methods. Regular SVM and\nlogistic regression are faster than the game models; the game models behave comparably.\nFinally, we explore a setting with non-antagonistic loss. We weight the loss functions with player-\nv(ha+1(\u03c6a\u22121(X)i), yi) = cv,i(cid:96)v(ha+1(\u03c6a\u22121(X)i), yi).\nand instance speci\ufb01c factors cv,i, that is, (cid:96)c\n\n7\n\n 0.988 0.992 0.996 11510204080120160AUCKAmount of Transformation vs. AccuracySVMLogRegInvar-SVM 0.988 0.992 0.996 10.50.10.050.020.010.0050.0020.001AUC\u03bb\u22121Amount of Transformation vs. AccuracySVMLogRegMinimax 0.988 0.992 0.996 1510.50.10.050.010.005AUC\u03bb\u22121Amount of Transformation vs. AccuracySVMLogRegNash\fFigure 2: Left, center: AUC evaluated into the future after training on past. Right: execution time.\n\nFigure 3: Average storage costs versus non-spam recall.\n\nOur model re\ufb02ects that an email service provider may delete detected spam emails after a latency pe-\nriod whereas other emails incur storage costs c+1,i proportional to their \ufb01le size. The spam sender\u2019s\ncosts are c\u22121,i = 1 for all spam instances and c\u22121,i = 0 for all non-spam instances. The classi\ufb01er\nthreshold balances a trade-off between non-spam recall (fraction of legitimate emails delivered) and\nstorage costs. For a threshold of \u2212\u221e, storage costs and non-spam recall are zero for all decision\nfunctions. Likewise, a threshold of \u221e gives a recall of 1, but all emails have to be stored. Fig-\nure 3 shows this trade-off for all methods. The Nash prediction model behaves most favorably: it\noutperforms all reference methods for almost all threshold values, often by several standard errors.\nInvar-SVM and Minimax cannot re\ufb02ect differing costs for learner and adversary in their optimiza-\ntion criteria and therefore perform worse. Logistic regression and the SVM with costs perform better\nthan their counterparts without costs, but worse than the Nash model.\n\n6 Conclusion\n\nWe studied games in which each player\u2019s cost function consists of a data-dependent loss and a\nregularizer. A learner produces a linear model while an adversary chooses a transformation matrix\nto be added to the data matrix. Our main result regards regularized non-antagonistic loss functions\nthat are convex, twice differentiable, and have distinct monotonicity. In this case, a unique Nash\nequilibrium exists. It minimizes the costs of each of two players that aim for their highest personal\nbene\ufb01t. We derive an algorithm that identi\ufb01es the equilibrium under these conditions. For the case\nof antagonistic loss functions with arbitrary regularizers a unique Nash equilibrium may or may\nnot exist. We derive an algorithm that \ufb01nds the unique Nash equilibrium, if it exists, by solving a\nminimax problem on a newly derived joint cost function.\nWe evaluate spam \ufb01lters derived from the different optimization problems on chronologically or-\ndered future emails. We observe that game models outperform the reference methods. In a setting\nwith player- and instance-speci\ufb01c costs, the Nash model for non-antagonistic loss functions excels\nbecause this setting is poorly modeled with antagonistic loss functions.\n\nAcknowledgments\n\nWe gratefully acknowledge support from STRATO AG.\n\n8\n\n 0.98 0.99 1present   20,00040,000 futureAUCt emails received after trainingAccuracy over Time (65,000 Public Emails) 0.985 0.99 0.995 1present   10,00020,000 futureAUCt emails received after trainingAccuracy over Time (40,000 Private Emails)SVMLogRegInvar-SVMMinimaxNash 0.1 1 10 100 1000 100001004001,6006,200time in secnumber of training emailsExecution Time 70 75 80 85 90 95 0.84 0.88 0.92 0.96required storage in MBnon-spam recallStorage Costs vs. Accuracy (65,000 Public Emails)SVMSVM with costsLogRegLogReg with costsInvar-SVMMinimaxNashNash with costs 38 39 40 41 42 43 44 45 0.92 0.94 0.96 0.98required storage in MBnon-spam recallStorage Costs vs. Accuracy (40,000 Private Emails)\fReferences\n[1] Gert R. G. Lanckriet, Laurent El Ghaoui, Chiranjib Bhattacharyya, and Michael I. Jordan. A\nrobust minimax approach to classi\ufb01cation. Journal of Machine Learning Research, 3:555\u2013582,\n2002.\n\n[2] Laurent El Ghaoui, Gert R. G. Lanckriet, and Georges Natsoulis. Robust classi\ufb01cation with in-\nterval data. Technical Report UCB/CSD-03-1279, EECS Department, University of California,\nBerkeley, 2003.\n\n[3] Amir Globerson and Sam T. Roweis. Nightmare at test time: robust learning by feature deletion.\n\nIn Proceedings of the International Conference on Machine Learning, 2006.\n\n[4] Choon Hui Teo, Amir Globerson, Sam T. Roweis, and Alex J. Smola. Convex learning with\n\ninvariances. In Advances in Neural Information Processing Systems, 2008.\n\n[5] Amir Globerson, Choon Hui Teo, Alex J. Smola, and Sam T. Roweis. Dataset Shift in Machine\nLearning, chapter An adversarial view of covariate shift and a minimax approach, pages 179\u2013\n198. MIT Press, 2009.\n\n[6] Tamer Basar and Geert J. Olsder. Dynamic Noncooperative Game Theory. Society for Industrial\n\nand Applied Mathematics, 1999.\n\n[7] J. B. Rosen. Existence and uniqueness of equilibrium points for concave n-person games.\n\nEconometrica, 33(3):520\u2013534, 1965.\n\n[8] Anna von Heusinger and Christian Kanzow. Relaxation methods for generalized Nash equi-\nlibrium problems with inexact line search. Journal of Optimization Theory and Applications,\n143(1):159\u2013183, 2009.\n\n[9] John M. Danskin. The theory of max-min, with applications. SIAM Journal on Applied Mathe-\n\nmatics, 14(4):641\u2013664, 1966.\n\n9\n\n\f", "award": [], "sourceid": 534, "authors": [{"given_name": "Michael", "family_name": "Br\u00fcckner", "institution": null}, {"given_name": "Tobias", "family_name": "Scheffer", "institution": null}]}