{"title": "Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 14934, "page_last": 14942, "abstract": "Recent applications that arise in machine learning have surged significant interest in solving min-max saddle point games. This problem has been extensively studied in the convex-concave regime for which a global equilibrium solution can be computed efficiently. In this paper, we study the problem in the non-convex regime and show that an $\\varepsilon$--first order stationary point of the game can be computed when one of the player\u2019s objective can be optimized to global optimality efficiently. In particular, we first consider the case where the objective of one of the players satisfies the Polyak-{\\L}ojasiewicz (PL) condition. For such a game, we show that a simple multi-step gradient descent-ascent algorithm finds an $\\varepsilon$--first order stationary point of the problem in $\\widetilde{\\mathcal{O}}(\\varepsilon^{-2})$ iterations. Then we show that our framework can also be applied to the case where the objective of the ``max-player\" is concave. In this case, we propose a multi-step gradient descent-ascent algorithm that finds an $\\varepsilon$--first order stationary point of the game in $\\widetilde{\\cal O}(\\varepsilon^{-3.5})$ iterations, which is the best known rate in the literature. We applied our algorithm to a fair classification problem of Fashion-MNIST dataset and observed that the proposed algorithm results in smoother training and better generalization.", "full_text": "Solving a Class of Non-Convex Min-Max Games\n\nUsing Iterative First Order Methods\n\nMaher Nouiehed\nnouiehed@usc.edu \u2217\n\nMaziar Sanjabi\nsanjabi@usc.edu \u2020\n\nTianjian Huang\ntianjian@usc.edu \u2021\n\nJason D. Lee\n\njasonlee@princeton.edu \u00a7\n\nMeisam Razaviyayn\nrazaviya@usc.edu \u00b6\n\nAbstract\n\nRecent applications that arise in machine learning have surged signi\ufb01cant interest\nin solving min-max saddle point games. This problem has been extensively studied\nin the convex-concave regime for which a global equilibrium solution can be\ncomputed ef\ufb01ciently. In this paper, we study the problem in the non-convex regime\nand show that an \u03b5\u2013\ufb01rst order stationary point of the game can be computed when\none of the player\u2019s objective can be optimized to global optimality ef\ufb01ciently. In\nparticular, we \ufb01rst consider the case where the objective of one of the players\nsatis\ufb01es the Polyak-\u0141ojasiewicz (PL) condition. For such a game, we show that a\nsimple multi-step gradient descent-ascent algorithm \ufb01nds an \u03b5\u2013\ufb01rst order stationary\n\npoint of the problem in (cid:101)O(\u03b5\u22122) iterations. Then we show that our framework can\nan \u03b5\u2013\ufb01rst order stationary point of the game in (cid:101)O(\u03b5\u22123.5) iterations, which is the\n\nalso be applied to the case where the objective of the \u201cmax-player\" is concave.\nIn this case, we propose a multi-step gradient descent-ascent algorithm that \ufb01nds\n\nbest known rate in the literature. We applied our algorithm to a fair classi\ufb01cation\nproblem of Fashion-MNIST dataset and observed that the proposed algorithm\nresults in smoother training and better generalization.\n\n1\n\nIntroduction\n\nRecent years have witnessed a wide range of machine learning and robust optimization applications\nbeing formulated as a min-max saddle point game; see [51, 11, 10, 50, 20, 53] and the references\ntherein. Examples of problems that are formulated under this framework include generative ad-\nversarial networks (GANs) [51], reinforcement learning [11], adversarial learning [53], learning\nexponential families [10], fair statistical inference [17, 56, 52, 37], generative adversarial imitation\nlearning [6, 27], distributed non-convex optimization [35] and many others. These applications\nrequire solving an optimization problem of the form\n\nmin\n\u03b8\u2208\u0398\n\nmax\n\u03b1\u2208A f (\u03b8, \u03b1).\n\n(1)\n\nThis optimization problem can be viewed as a zero-sum game between two players. The goal of the\n\ufb01rst player is to minimize f (\u03b8, \u03b1) by tuning \u03b8, while the other player\u2019s objective is to maximize\n\n\u2217Department of Industrial and Systems Engineering, University of Southern California\n\u2020Data Science and Operations Department, Marshall School of Business, University of Southern California\n\u2021Department of Industrial and Systems Engineering, University of Southern California\n\u00a7Department of Electrical Engineering, Princeton University\n\u00b6Department of Industrial and Systems Engineering, University of Southern California\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ff (\u03b8, \u03b1) by tuning \u03b1. Gradient-based methods, especially gradient descent-ascent (GDA), are widely\nused in practice to solve these problems. GDA alternates between a gradient ascent steps on \u03b1 and a\ngradient descent steps on \u03b8. Despite its popularity, this algorithm fails to converge even for simple\nbilinear zero-sum games [41, 39, 14, 2, 32]. This failure was \ufb01xed by adding negative momentum or\nby using primal-dual methods proposed by [22, 21, 8, 13, 15, 33].\nWhen the objective f is convex in \u03b8 and concave in \u03b1, the corresponding variational inequality\nbecomes monotone. This setting has been extensively studied and different algorithms have been\ndeveloped for \ufb01nding a Nash equilibrium [46, 21, 44, 29, 40, 23, 26, 43, 18, 45]. Moreover, [12]\nproposed an algorithm for solving a more general setting that covers both monotone and psuedo-\nmonotone variational problems.\nWhile the convex-concave setting has been extensively studied in the literature, recent machine\nlearning applications urge the necessity of moving beyond these classical settings. For example, in\na typical GAN problem formulation, two neural networks (generator and discriminator) compete\nin a non-convex zero-sum game framework [24]. For general non-convex non-concave games, [28,\nProposition 10] provides an example for which local Nash equilibrium does not exist. Similarly,\none can show that even second-order Nash equilibrium may not exist for non-convex games, see\nSection 2 for more details. Therefore, a well-justi\ufb01ed objective is to \ufb01nd \ufb01rst order Nash equilibria\nof such games [48]; see de\ufb01nitions and discussion in Section 2. The \ufb01rst order Nash equilibrium\ncan be viewed as a direct extension of the concept of \ufb01rst order stationarity in optimization to the\nabove min-max game. While \u03b5\u2013\ufb01rst order stationarity in the context of optimization can be found\nef\ufb01ciently in O(\u03b5\u22122) iterations with gradient descent algorithm [47], the question of whether it is\npossible to design a gradient-based algorithm that can \ufb01nd an \u03b5\u2013\ufb01rst order Nash equilibrium for\ngeneral non-convex saddle point games remains open.\nSeveral recent results provided a partial answer to the problem of \ufb01nding \ufb01rst-order stationary points\nof a non-convex min-max game. For instance, [51] proposed a stochastic gradient descent algorithm\nfor the case when f (\u00b7,\u00b7) is strongly concave in \u03b1 and showed convergence of the algorithm to an\ngradient descent algorithm with Max-oracle and shows O(\u0001\u22124) gradient evaluations and max-oracle\ncalls for solving min-max problems where the inner problem can be solved in one iteration using an\nexisting oracle. More recently, [35, 36] considered the case where f is concave in \u03b1. They developed\n\n\u03b5\u2013\ufb01rst-order Nash equilibrium with (cid:101)O(\u03b5\u22122) gradient evaluations. Also, the work [28] analyzes the\na descent-ascent algorithm with iteration complexity (cid:101)O(\u03b5\u22124). In this non-convex concave setting,\n[50] proposed a stochastic sub-gradient descent method with worst-case complexity (cid:101)O(\u03b5\u22126). Under\nthat \ufb01nds an \u03b5\u2013\ufb01rst order Nash equilibrium/stationary with (cid:101)O(\u03b5\u22123.5) gradient evaluations.\n\nthe same concavity assumption on f, in this paper, we propose an alternative multi-step framework\n\nIn an effort to solve the more general non-convex non-concave setting, [34] developed a framework\nthat converges to \u03b5-\ufb01rst order stationarity/Nash equilibrium under the assumption that there exists a\nsolution to the Minty variational inequality at each iteration. Although among the \ufb01rst algorithms\nwith have theoretical convergence guarantees in the non-convex non-concave setting, the conditions\nrequired are strong and dif\ufb01cult to check. To the best of our knowledge, there is no practical\nproblem for which the Minty variational inequality condition has been proven. With the motivation of\nexploring the non-convex non-concave setting, we propose a simple multi-step gradient descent ascent\nalgorithm for the case where the objective of one of the players satis\ufb01es the Polyak-\u0141ojasiewicz (PL)\n\ncondition. We show the worst-case complexity of (cid:101)O(\u03b5\u22122) for our algorithm. This rate is optimal in\n\nterms of dependence on \u03b5 up to logarithmic factors as discussed in Section 3. Compared to Minty\nvariational inequality condition used in [34], the PL condition is very well studied in the literature\nand has been theoretically veri\ufb01ed for objectives of optimization problems arising in many practical\nproblems. For example, it has been proven to be true for objectives of over-parameterized deep\nnetworks [16], learning LQR models [19], phase retrieval [54], and many other simple problems\ndiscussed in [30]. In the context of min-max games, it has also been proven useful in generative\nadversarial imitation learning with LQR dynamics [6], as discussed in Section 3.\nThe rest of this paper is organized as follows. In Section 2 we de\ufb01ne the concepts of First-order Nash\nequilibrium (FNE) and \u03b5\u2013FNE. In Section 3, we describe our algorithm designed for min-max games\nwith the objective of one player satisfying the PL condition. Finally, in Section 4 we describe our\nmethod for solving games in which the function f (\u03b8, \u03b1) is concave in \u03b1 (or convex in \u03b8).\n\n2\n\n\f2 Two-player Min-Max Games and First-Order Nash Equilibrium\n\nConsider the two-player zero sum min-max game\nmax\n\u03b1\u2208A f (\u03b8, \u03b1),\n\n(2)\nwhere \u0398 and A are both convex sets, and f (\u03b8, \u03b1) is a continuously differentiable function. We say\n(\u03b8\u2217, \u03b1\u2217) \u2208 \u0398 \u00d7 A is a Nash equilibrium of the game if\n\nmin\n\u03b8\u2208\u0398\n\nf (\u03b8\u2217, \u03b1) \u2264 f (\u03b8\u2217, \u03b1\u2217) \u2264 f (\u03b8, \u03b1\u2217) \u2200 \u03b8 \u2208 \u0398, \u2200 \u03b1 \u2208 A.\n\nIn convex-concave games, such a Nash equilibrium always exists [28] and several algorithms were\nproposed to \ufb01nd Nash equilibria [23, 26]. However, in the non-convex non-concave regime, comput-\ning these points is in general NP-Hard. In fact, even \ufb01nding local Nash equilibria is NP-hard in the\ngeneral non-convex non-concave regime.In addition, as shown by [28, Proposition 10], local Nash\nequilibria for general non-convex non-concave games may not exist. Thus, in this paper we aim for\nthe less ambitious goal of \ufb01nding \ufb01rst-order Nash equilibrium which is de\ufb01ned in the sequel.\nDe\ufb01nition 2.1 (FNE). A point (\u03b8\u2217, \u03b1\u2217) \u2208 \u0398 \u00d7 A is a \ufb01rst order Nash equilibrium (FNE) of the\ngame (2) if\n\n(cid:104)\u2207\u03b8f (\u03b8\u2217, \u03b1\u2217), \u03b8 \u2212 \u03b8\u2217(cid:105) \u2265 0 \u2200 \u03b8 \u2208 \u0398 and (cid:104)\u2207\u03b1f (\u03b8\u2217, \u03b1\u2217), \u03b1 \u2212 \u03b1\u2217(cid:105) \u2264 0 \u2200 \u03b1 \u2208 A.\n\n(3)\n\nNotice that this de\ufb01nition, which is also used in [48, 49], contains the \ufb01rst order necessary optimality\nconditions of the objective function of each player [5]. Thus they are necessary conditions for\nlocal Nash equilibrium. Moreover, in the absence of constraints, the above de\ufb01nition simpli\ufb01es\nto \u2207\u03b8f (\u03b8\u2217, \u03b1\u2217) = 0 and \u2207\u03b1f (\u03b8\u2217, \u03b1\u2217) = 0, which are the well-known unconstrained \ufb01rst-order\noptimality conditions. Based on this observation, it is tempting to think that the above \ufb01rst-order\nNash equilibrium condition does not differentiate between the min-max type solutions of (2) and\nmin-min solutions of the type min\u03b8\u2208\u0398,\u03b1\u2208A f (\u03b8, \u03b1). However, the direction of the second inequality\nin (3) would be different if we have considered the min-min problem instead of min-max problem.\nThis different direction makes the problem of \ufb01nding a FNE non-trivial. The following theorem\nguarantees the existence of \ufb01rst-order Nash equilibria under some mild assumptions.\nTheorem 2.2 (Restated from Proposition 2 in [48]). Suppose the sets \u0398 and A are no-empty, compact,\nand convex. Moreover, assume that the function f (\u00b7,\u00b7) is twice continuously differentiable. Then\nthere exists a feasible point ( \u00af\u03b8, \u00af\u03b1) that is \ufb01rst-order Nash equilibrium.\n\nThe above theorem guarantees existence of FNE points even when (local) Nash equilibria may not\nexist. The next natural question is about the computability of such methods. Since in practice we use\niterative methods for computation, we need to de\ufb01ne the notion of approximate\u2013FNE.\nDe\ufb01nition 2.3 (Approximate FNE). A point (\u03b8\u2217, \u03b1\u2217) is said to be an \u03b5\u2013\ufb01rst-order Nash equilibrium\n(\u03b5\u2013FNE) of the game (2) if\n\nX (\u03b8\u2217, \u03b1\u2217) \u2264 \u03b5 and Y(\u03b8\u2217, \u03b1\u2217) \u2264 \u03b5,\n\nwhere\n\nand\n\nX (\u03b8\u2217, \u03b1\u2217) (cid:44) \u2212 min\n\n\u03b8\n\n(cid:104)\u2207\u03b8f (\u03b8\u2217, \u03b1\u2217), \u03b8 \u2212 \u03b8\u2217(cid:105) s.t. \u03b8 \u2208 \u0398, (cid:107)\u03b8 \u2212 \u03b8\u2217(cid:107) \u2264 1,\n\n(4)\n\n\u03b1\n\nY(\u03b8\u2217, \u03b1\u2217) (cid:44) max\n\n(cid:104)\u2207\u03b1f (\u03b8, \u03b1), \u03b1 \u2212 \u03b1\u2217(cid:105) s.t. \u03b1 \u2208 A, (cid:107)\u03b1 \u2212 \u03b1\u2217(cid:107) \u2264 1.\n\n(5)\nIn the absence of constraints, \u03b5\u2013FNE in De\ufb01nition 2.3 reduces to (cid:107)\u2207\u03b8f (\u03b8\u2217, \u03b1\u2217)(cid:107) \u2264\n\u03b5 and (cid:107)\u2207\u03b1f (\u03b8\u2217, \u03b1\u2217)(cid:107) \u2264 \u03b5.\nRemark 2.4. The \u03b5\u2013FNE de\ufb01nition above is based on the \ufb01rst order optimality measure of the\nobjective of each player. Such \ufb01rst-order optimality measure has been used before in the context of\noptimization; see [9]. Such a condition guarantees that each player cannot improve their objective\nfunction using \ufb01rst order information. Similar to the optimization setting, one can de\ufb01ne the second-\norder Nash equilibrium as a point that each player cannot improve their objective further by using\n\ufb01rst and second order information of their objectives. However, the use of second order Nash\nequilibria is more subtle in the context of games. The following example shows that such a point may\nnot exist. Consider the game\n\nmin\u22121\u2264\u03b8\u22641\n\nmax\n\n\u22122\u2264\u03b1\u22642\n\n\u2212\u03b82 + \u03b12 + 4\u03b8\u03b1.\n\nThen (0, 0) is the only \ufb01rst-order Nash equilibrium and is not a second-order Nash equilibrium.\n\n3\n\n\fIn this paper, our goal is to \ufb01nd an \u03b5\u2013FNE of the game (2) using iterative methods. To proceed, we\nmake the following standard assumptions about the smoothness of the objective function f.\nAssumption 2.5. The function f is continuously differentiable in both \u03b8 and \u03b1 and there exists\nconstants L11, L22 and L12 such that for every \u03b1, \u03b11, \u03b12 \u2208 A, and \u03b8, \u03b81, \u03b82 \u2208 \u0398, we have\n\n(cid:107)\u2207\u03b8f (\u03b81, \u03b1) \u2212 \u2207\u03b8f (\u03b82, \u03b1)(cid:107) \u2264 L11(cid:107)\u03b81 \u2212 \u03b82(cid:107),\n(cid:107)\u2207\u03b1f (\u03b81, \u03b1) \u2212 \u2207\u03b1f (\u03b82, \u03b1)(cid:107) \u2264 L12(cid:107)\u03b81 \u2212 \u03b82(cid:107),\n\n(cid:107)\u2207\u03b1f (\u03b8, \u03b11) \u2212 \u2207\u03b1f (\u03b8, \u03b12)(cid:107) \u2264 L22(cid:107)\u03b11 \u2212 \u03b12(cid:107),\n(cid:107)\u2207\u03b8f (\u03b8, \u03b11) \u2212 \u2207\u03b8f (\u03b8, \u03b12)(cid:107) \u2264 L12(cid:107)\u03b11 \u2212 \u03b12(cid:107).\n\n3 Non-Convex PL-Game\n\nIn this section, we consider the problem of developing an \u201cef\ufb01cient\" algorithm for \ufb01nding an \u03b5\u2013FNE\nof (2) when the objective of one of the players satistys Polyak-\u0141ojasiewicz (PL) condition. To\nproceed, let us \ufb01rst formally de\ufb01ne the Polyak-\u0141ojasiewicz (PL) condition.\nDe\ufb01nition 3.1 (Polyak-\u0141ojasiewicz Condition). A differentiable function h(x) with the minimum\nvalue h\u2217 = minx h(x) is said to be \u00b5-Polyak-\u0141ojasiewicz (\u00b5-PL) if\n\u2200x.\n\n(cid:107)\u2207h(x)(cid:107)2 \u2265 \u00b5(h(x) \u2212 h\u2217),\n\n(6)\n\n1\n2\n\nThe PL-condition has been established and utilized for analyzing many practical modern problems\n[30, 19, 16, 54, 6]. Moreover, it is well-known that a function can be non-convex and still satisfy the\nPL condition [30]. Based on the de\ufb01nition above, we de\ufb01ne a class of min-max PL-games.\nDe\ufb01nition 3.2 (PL-Game). We say that the min-max game (2) is a PL-Game if the max player is\nunconstrained, i.e., A = Rn, and there exists a constant \u00b5 > 0 such that the function h\u03b8(\u03b1) (cid:44)\n\u2212f (\u03b8, \u03b1) is \u00b5-PL for any \ufb01xed value of \u03b8 \u2208 \u0398.\nA simple example of a practical PL-game is detailed next.\nExample 3.1 (Generative adversarial imitation learning of linear quadratic regulators). Imitation\nlearning is a paradigm that aims to learn from an expert\u2019s demonstration of performing a task [6]. It\nis known that this learning process can be formulated as a min-max game [27]. In such a game the\nminimization is performed over all the policies and the goal is to minimize the discrepancy between the\naccumulated reward for expert\u2019s policy and the proposed policy. On the other hand, the maximization\nis done over the parameters of the reward function and aims at maximizing this discrepancy over\nthe parameters of the reward function. This approach is also referred to as generative adversarial\nimitation learning (GAIL) [27]. The problem of generative adversarial imitation learning for linear\nquadratic regulators [6] refers to solving this problem for the speci\ufb01c case where the underlying\ndynamic and the reward function come from a linear quadratic regulator [19]. To be more speci\ufb01c,\nthis problem can be formulated [6] as minK max\u03b8\u2208\u0398 m(K, \u03b8), where K represents the choice of the\npolicy and \u03b8 represents the parameters of the dynamic and the reward functions. Under the discussed\nsetting, m is strongly concave in \u03b8 and PL in K (see [6] for more details). Note that since m is\nstrongly concave in \u03b8 and P L in K, any FNE of the game would also be a Nash equilibrium point.\nAlso note that the notion of FNE does not depend on the ordering of the min and max. Thus, to be\nconsistent with our notion of PL-games, we can formulate the problem as\n\nmin\n\u03b8\u2208\u0398\n\nmax\n\nK\n\n\u2212m(K, \u03b8)\n\n(7)\n\nThus, generative adversarial imitation learning of linear quadratic regulators is an example of \ufb01nding\na FNE for a min-max PL-game.\n\nIn what follows, we present a simple iterative method for computing an \u03b5\u2013FNE of PL games.\n\n3.1 Multi-step gradient descent ascent for PL-games\n\nIn this section, we propose a multi-step gradient descent ascent algorithm that \ufb01nds an \u03b5\u2013FNE point\nfor PL-games. At each iteration, our method runs multiple projected gradient ascent steps to estimate\nthe solution of the inner maximization problem. This solution is then used to estimate the gradient of\nthe inner maximization value function, which directly provides a descent direction. In a nutshell, our\nproposed algorithm is a gradient descent-like algorithm on the inner maximization value function. To\npresent the ideas of our multi-step algorithm, let us re-write (2) as\n\nmin\n\u03b8\u2208\u0398\n\ng(\u03b8),\n\n4\n\n(8)\n\n\fwhere\n\ng(\u03b8) (cid:44) max\n\n\u03b1\u2208A f (\u03b8, \u03b1).\n\n(9)\n\nA famous classical result in optimization is Danskin\u2019s theorem [4] which provides a suf\ufb01cient\ncondition under which the gradient of the value function max\u03b1\u2208A f (\u03b8, \u03b1) can be directly evaluated\nusing the gradient of the objective f (\u03b8, \u03b1\u2217) evaluated at the optimal solution \u03b1\u2217. This result\nrequires the optimizer \u03b1\u2217 to be unique. Under our PL assumption on f (\u03b8,\u00b7), the inner maximization\nproblem (9) may have multiple optimal solutions. Hence, Danskin\u2019s theorem does not directly apply.\nHowever, as we will show in Lemma A.5 in the supplementary, under the PL assumption, we still can\nshow the following result\n\n\u2207\u03b8g(\u03b8) = \u2207\u03b8f (\u03b8, \u03b1\u2217) with \u03b1\u2217 \u2208 arg max\n\u03b1\u2208A\n\nf (\u03b8, \u03b1),\n\ndespite the non-uniqueness of the optimal solution.\nMotivated by this result, we propose a Multi-step Gradient Descent Ascent algorithm that solves the\ninner maximization problem to \u201capproximate\u201d the gradient of the value function g. This gradient\ndirection is then used to descent on \u03b8. More speci\ufb01cally, the inner loop (Step 4) in Algorithm 1\nsolves the maximization problem (9) for a given \ufb01xed value \u03b8 = \u03b8t. The computed solution of\nthis optimization problem provides an approximation for the gradient of the function g(\u03b8), see\nLemma A.6 in Appendix A. This gradient is then used in Step 7 to descent on \u03b8.\n\nSet \u03b10(\u03b8t) = \u03b1t\nfor k = 0,\u00b7\u00b7\u00b7 , K \u2212 1 do\n\nAlgorithm 1 Multi-step Gradient Descent Ascent\n1: INPUT: K, T , \u03b71 = 1/L22, \u03b72 = 1/L, \u03b10 \u2208 A and \u03b80 \u2208 \u0398\n2: for t = 0,\u00b7\u00b7\u00b7 , T \u2212 1 do\n3:\n4:\n5:\n6:\n7:\n8: end for\n9: Return (\u03b8t, \u03b1K(\u03b8t)) for t = 0,\u00b7\u00b7\u00b7 , T \u2212 1.\n\nSet \u03b1k+1(\u03b8t) = \u03b1k(\u03b8t) + \u03b71\u2207\u03b1f (\u03b8t, \u03b1k(\u03b8t))\n\nend for\nSet \u03b8t+1 = proj\u0398\n\n\u03b8t \u2212 \u03b72\u2207\u03b8f(\u03b8t, \u03b1K(\u03b8t))\n\n(cid:16)\n\n(cid:17)\n\n3.2 Convergence analysis of Multi-Step Gradient Descent Ascent Algorithm for PL games\n\nThroughout this section, we make the following assumption.\nAssumption 3.3. The constraint set \u0398 is convex and compact. Moreover, there exists a ball with\nradius R, denoted by BR, such that \u0398 \u2286 BR.\nWe are now ready to state the main result of this section.\nTheorem 3.4. Under Assumptions 2.5 and 3.3, for any given scalar \u03b5 \u2208 (0, 1), if we choose K and\nT large enough such that\n\nand K \u2265 NK(\u03b5) (cid:44) O(log(cid:0)\u03b5\u22121)(cid:1),\n\nT \u2265 NT (\u03b5) (cid:44) O(\u03b5\u22122)\n\nthen there exists an iteration t \u2208 {0,\u00b7\u00b7\u00b7 , T} such that (\u03b8t, \u03b1t+1) is an \u03b5\u2013FNE of (2).\n\nProof. The proof is relegated to Appendix A.2.\nCorollary 3.5. Under Assumption 2.5 and Assumption 3.3, Algorithm 1 \ufb01nds an \u03b5-FNE of the\ngame (2) with O(\u03b5\u22122) gradient evaluations of the objective with respect to \u03b8 and O(\u03b5\u22122 log(\u03b5\u22121))\ngradient evaluations with respect to \u03b1. If the two gradient oracles have the same complexity, the\noverall complexity of the method would be O(\u03b5\u22122 log(\u03b5\u22121)).\nRemark 3.6. The iteration complexity order O(\u03b5\u22122 log(\u03b5\u22121)) in Theorem 3.4 is tight (up to log-\narithmic factors). This is due to the fact that for general non-convex smooth problems, \ufb01nding an\n\u03b5\u2013stationary solution requires at least \u2126(\u03b5\u22122) gradient evaluations [7, 47]. Clearly, this lower bound\nis also valid for \ufb01nding an \u03b5\u2013FNE of PL-games. This is because we can assume that the function\nf (\u03b8, \u03b1) does not depend on \u03b1 (and thus PL in \u03b1).\n\n5\n\n\fRemark 3.7. Theorem 3.4 shows that under the PL assumption, the pair (\u03b8t, \u03b1K(\u03b8t)) computed\nby Algorithm 1 is an \u03b5\u2013FNE of the game (2). Since \u03b1K(\u03b8t) is an approximate solution of the inner\nmaximization problem, we get that \u03b8t is concurrently an \u03b5\u2013\ufb01rst order stationary solution of the\noptimization problem (8).\nRemark 3.8. In [51, Theorem 4.2], a similar result was shown for the case when f (\u03b8, \u03b1) is strongly\nconcave in \u03b1. Hence, Theorem 3.4 can be viewed as an extension of [51, Theorem 4.2]. Similar\nto [51, Theorem 4.2], one can easily extend the result of Theorem 3.4 to the stochastic setting by\nreplacing the gradient of f with respect to \u03b8 in Step 7 by the stochastic version of the gradient.\n\nIn the next section we consider the non-convex concave min-max saddle game. It is well-known\nthat convexity/concavity does not imply the PL condition and PL condition does not imply convex-\nity/concavity [30]. Therefore, the problems we consider in the next section are neither restriction nor\nextension of our results on PL games.\n\n4 Non-Convex Concave Games\n\nIn this section, we focus on \u201cnon-convex concave\" games satisfying the following assumption:\nAssumption 4.1. The objective function f (\u03b8, \u03b1) is concave in \u03b1 for any \ufb01xed value of \u03b8. Moreover,\nthe set A is convex and compact, and there exists a ball with radius R that contains the feasible set A.\nOne major difference of this case with the PL-games is that in this case the function g(\u03b8) =\nmax\u03b1\u2208A f (\u03b8, \u03b1) might not be differentiable. To see this, consider the example g(\u03b1) =\nmax0\u2264\u03b1\u22641(2\u03b1 \u2212 1)\u03b8 which is concave in \u03b1. However, the value function g(\u03b8) = |\u03b8| is non-smooth.\nUsing a small regularization term, we approximate the function g(\u00b7) by a differentiable function\n\ng\u03bb(\u03b8) (cid:44) max\n\n\u03b1\u2208A f\u03bb(\u03b8, \u03b1),\n\n(10)\n\n(cid:107)\u03b1 \u2212 \u00af\u03b1(cid:107)2. Here \u00af\u03b1 \u2208 A is some given \ufb01xed point and \u03bb > 0 is\nwhere f\u03bb(\u03b8, \u03b1) (cid:44) f (\u03b8, \u03b1) \u2212 \u03bb\n2\na regularization parameter that we will specify later. Since f (\u03b8, \u03b1) is concave in \u03b1, f\u03bb(\u03b8,\u00b7) is \u03bb-\nstrongly concave. Thus, the function g\u03bb(\u00b7) becomes smooth with Lipschitz gradient; see Lemma B.1\nin the supplementary. Using this property, we propose an algorithm that runs at each iteration multiple\nsteps of Nesterov accelerated projected gradient ascent to estimate the solution of (10). This solution\nis then used to estimate the gradient of g\u03bb(\u03b8) which directly provides a descent direction on \u03b8.\n\nOur algorithm computes an \u03b5\u2013FNE point for non-convex concave games with (cid:101)O(\u03b5\u22123.5) gradient\n\nevaluations. Then for suf\ufb01ciently small regularization coef\ufb01cient, we show that the computed point is\nan \u03b5-FNE.\nNotice that since f\u03bb is Lipschitz smooth and based on the compactness assumption, we can de\ufb01ne\ng\u03b8 (cid:44) max\u03b8\u2208\u0398 (cid:107)\u2207g\u03bb(\u03b8)(cid:107), g\u03b1 (cid:44) max\u03b8\u2208\u0398 (cid:107)\u2207\u03b1f\u03bb(\u03b8, \u03b1\u2217(\u03b8))(cid:107), and gmax = max{g\u03b8, g\u03b1, 1},\n(11)\nwhere \u03b1\u2217(\u03b8) (cid:44) arg max\u03b1\u2208A f\u03bb(\u03b8, \u03b1). We are now ready to describe our proposed algorithm.\n\n4.1 Algorithm Description\n\nOur proposed method is outlined in Algorithm 2. This algorithm has two steps: step 2 and step 3. In\nstep 2, K steps of accelerated gradient ascent method is run over the variable \u03b1 to \ufb01nd an approximate\nmaximizer of the problem max\u03b1 f\u03bb(\u03b8t, \u03b1). Then using approximate maximizer \u03b1t+1, we update \u03b8\nvariable using one step of \ufb01rst order methods in step 3.\nIn step 2, we run K step of accelerated gradient ascent algorithm over the variable \u03b1 with restart\nevery N iterations. The details of this subroutine can be found in subsection B.1 of the supplementary\nmaterials. In step 3 of Algorithm 2, we can either use projected gradient descent update rule\n\n(cid:0)\u03b8t, \u03b1t+1\n\n(cid:1)(cid:17)\n\n,\n\n(cid:16)\n\n\u03b8t+1 (cid:44) proj\u0398\n\n\u03b8t \u2212\n\n1\n\nL11 + L2\n\n12/\u03bb\n\n\u2207\u03b8f\u03bb\n\nor Frank-Wolfe update rule described in subsection B.2 in the supplementary material. We show\nconvergence of the algorithm to \u03b5\u2013FNE in Theorems 4.2.\n\n6\n\n\fAlgorithm 2 Multi-Step Frank Wolfe/Projected Gradient Step Framework\n\nRequire: Constants(cid:101)L (cid:44) max{L, L12, gmax}, N (cid:44) (cid:98)(cid:112)8L22/\u03bb(cid:99), K, T , \u03b7, \u03bb, \u03b80 \u2208 \u0398, \u03b10 \u2208 A\n\n1: for t = 0, 1, 2, . . . , T do\n2:\n\nSet \u03b1t+1 = APGA(\u03b1t, \u03b8t, \u03b7, N, K) by running K steps of Accelerated Projected Gradient\n\nAscent subroutine (Algorithm 3) with periodic restart at every N iteration.\n\nCompute \u03b8t+1 using \ufb01rst-order information (Frank-Wolfe or projected gradient descent).\n\n3:\n4: end for\n\nTheorem 4.2. Given a scalar \u03b5 \u2208 (0, 1). Assume that Step 3 in Algorithm 2 sets either runs projected\ngradient descent or Frank-Wolfe iteration. Under Assumptions 4.1 and 2.5,\n\nand K \u2265 NK(\u03b5) (cid:44) O(cid:0)\u03b5\u22121/2 log(\u03b5\u22121)(cid:1),\n\n\u03b7 =\n\n1\nL22\n\n,\n\n\u03bb (cid:44) \u03b5\n4R\n\n, T \u2265 NT (\u03b5) (cid:44) O(\u03b5\u22123),\n\nthen there exists t \u2208 {0, . . . , T} such that (\u03b8t, \u03b1t+1) is an \u03b5\u2013FNE of problem (2).\n\nProof. The proof is relegated to Appendix B.4.\nCorollary 4.3. Under Assumptions 2.5 and 4.1, Algorithm 2 \ufb01nds an \u03b5-\ufb01rst-order stationary\nsolution of the game (2) with O(\u03b5\u22123) gradient evaluations of the objective with respect \u03b8 and\nO(\u03b5\u22120.5 log(\u03b5\u22121)) gradient evaluations with respect to \u03b1. If the two oracles have the same complex-\nity, the overall complexity of the method would be O(\u03b5\u22123.5 log(\u03b5\u22121)).\n\n5 Numerical Results\n\nWe evaluate the numerical performance of Algorithm 2 in the following two applications:\n\n5.1 Fair Classi\ufb01er\nWe conduct two experiment on the Fashion MNIST dataset [55]. This dataset consists of 28 \u00d7 28\narrays of grayscale pixel images classi\ufb01ed into 10 categories of clothing. It includes 60, 000 training\nimages and 10, 000 testing images.\nExperimental Setup: The recent work in [42] observed that training a logisitic regression model\nto classify the images of the Fashion MNIST dataset can be biased against certain categories. To\nremove this bias, [42] proposed to minimize the maximum loss incurred by the different categories.\nWe repeat the experiment when using a more complex non-convex Convolutional Neural Network\n(CNN) model for classi\ufb01cation. Similar to [42], we limit our experiment to the three categories\nT-shirt/top, Coat, and Shirts, that correspond to the lowest three testing accuracies achieved by the\ntrained classi\ufb01er. To minimize the maximum loss over these three categories, we train the classi\ufb01er to\nminimize\n\n(12)\nwhere W represents the parameters of the CNN; and L1, L2, and L3 correspond to the loss incurred\nby samples in T-shirt/top, Coat, and Shirt categories. Problem (12) can be re-written as\n\nmax{L1(W),L2(W),L3(W)},\n\nmin\nW\n\n3(cid:88)\n\ni=1\n\n3(cid:88)\n\nClearly the inner maximization problem is concave; and thus our theory can be applied. To empirically\nevaluate the regularization scheme proposed in Section 4, we implement two versions of Algorithm 2.\nThe \ufb01rst version solves at each iteration the regularized strongly concave sub-problem\n\nmax\nt1,t2,t3\n\ntiLi(W) \u2212 \u03bb\n2\n\nt2\ni\n\ns.t.\n\nti \u2265 0 \u2200 i = 1, 2, 3;\n\nti = 1,\n\n(13)\n\ni=1\n\ni=1\n\nand use the optimum t to perform a gradient descent step on W (notice that \ufb01xing the value of W,\nthe optimum t can be computed using KKT conditions and a simple sorting or bisection procedure).\n\n7\n\n3(cid:88)\n\nmin\nW\n\nmax\nt1,t2,t3\n\ntiLi(W)\n\ns.t.\n\nti \u2265 0 \u2200 i = 1, 2, 3;\n\nti = 1.\n\ni=1\n\ni=1\n\n3(cid:88)\n\n3(cid:88)\n\n\fFigure 1: The effect of regularization on the convergence of the training loss, \u03bb = 0.1.\n\nThe second version of Algorithm 2 solves at each iteration the concave inner maximization problem\nwithout the regularization term. Then uses the computed solution to perform a descent step on\nW. Notice that in both cases, the optimization with respect to t variable can be done in (almost)\nclosed-form update. Although regularization is required to have theoretical convergence guarantees,\nwe compare the two versions of the algorithm on empirical data to determine whether we lose by\nadding such regularization. We further compare these two algorithms with normal training that uses\ngradient descent to minimize the average loss among the three categories. We run all algorithms\nfor 5500 epochs and record the test accuracy of the categories. To reduce the effect of random\ninitialization, we run our methods with 50 different random initializations and record the average\nand standard deviation of the test accuracy collected. For fair comparison, the same initialization\nis used for all methods in each run. The results are summarized in Tables 1. To test our framework\nin stochastic settings, we repeat the experiment running all algorithms for 12, 000 iterations with\nAdam and SGD optimizer with a bath size of 600 images (200 from each category). The results of\nthe second experiment with Adam optimizer are summarized in Table 2. The model architecture and\nparameters are detailed in Appendix F. The choice of Adam optimizer is mainly because it is more\nrobust to the choice of the step-size and thus can be easily tuned. In fact, the use of SGD or Adam\ndoes not change the overall takeaways of the experiments. The results of using SGD optimizer are\nrelegated to Appendix C.\nResults: Tables 1 and 2 show the average and standard deviation of the number of correctly classi\ufb01ed\nsamples. The average and standard deviation are taken over 50 runs. For each run 1000 testing\nsamples are considered for each category. The results show that when using MinMax and MinMax\nwith regularization, the accuracies across the different categories are more balanced compared to\nnormal training. Moreover, the tables show that Algorithm 2 with regularization provides a slightly\nbetter worst-case performance compared to the unregularized approach. Note that the empirical\nadvantages due to regularization appears more in the stochastic setting. To see this compare the\ndifferences between MinMax and MinMax with Regularization in Tables 1 and 2. Figure 1 depicts a\nsample trajectory of deterministic algorithm applied to the regularized and regularized formulations.\nThis \ufb01gures shows that regularization provides a smoother and slightly faster convergence compared\nto the unregularized approach. In addition, we apply our algorithm to the exact similar logistic\nregression setup as in [42]. Results of this experiment can be found in Appendix D.\n\nNormal\nMinMax\n\nMinMax with Regularization\n\nT-shirt/top\nstd\n8.58\n10.40\n10.53\n\nmean\n850.72\n774.14\n779.84\n\nmean\n843.50\n753.88\n765.56\n\nstd\n17.24\n22.52\n22.28\n\nCoat\n\nShirt\n\nWorst\n\nmean\n658.74\n766.14\n762.34\n\nstd\n17.81\n13.59\n11.91\n\nmean\n658.74\n750.04\n755.66\n\nstd\n17.81\n18.92\n15.11\n\nTable 1: The mean and standard deviation of the number of correctly classi\ufb01ed samples when gradient\ndescent is used in training, \u03bb = 0.1.\n\n8\n\n\fNormal\nMinMax\n\nMinMax with Regularization\n\nT-shirt/top\nstd\n10.04\n15.12\n14.12\n\nmean\n853.86\n753.44\n764.02\n\nCoat\n\nShirt\n\nWorst\n\nmean\n852.22\n715.24\n739.80\n\nstd\n18.27\n32.00\n27.60\n\nmean\n683.32\n733.42\n748.84\n\nstd\n17.96\n18.51\n15.79\n\nmean\n683.32\n711.64\n734.34\n\nstd\n17.96\n29.02\n23.54\n\nTable 2: The mean and standard deviation of the number of correctly classi\ufb01ed samples when Adam\n(mini-batch) is used in training, \u03bb = 0.1.\n\n5.2 Robust Neural Network Training\n\nExperimental Setup: Neural networks have been widely used in various applications, especially in\nthe \ufb01eld of image recognition. However, these neural networks are vulnerable to adversarial attacks,\nsuch as Fast Gradient Sign Method (FGSM) [25] and Projected Gradient Descent (PGD) attack [31].\nThese adversarial attacks show that a small perturbation in the data input can signi\ufb01cantly change the\noutput of a neural network. To train a robust neural network against adversarial attacks, researchers\nreformulate the training procedure into a robust min-max optimization formulation [38], such as\n\nN(cid:88)\n\ni=1\n\nmin\n\nw\n\nmax\n\n\u03b4i, s.t. |\u03b4i|\u221e\u2264\u03b5\n\n(cid:96)(f (xi + \u03b4i; w), yi).\n\nHere w is the parameter of the neural network, the pair (xi, yi) denotes the i-th data point, and \u03b4i\nis the perturbation added to data point i. As discussed in this paper, solving such a non-convex\nnon-concave min-max optimization problem is computationally challenging. Motivated by the theory\ndeveloped in this work, we approximate the above optimization problem with a novel objective\nfunction which is concave in the parameters of the (inner) maximization player. To do so, we \ufb01rst\napproximate the inner maximization problem with a \ufb01nite max problem\n\nmax{(cid:96)(f (\u02c6xi0(w); w), yi), . . . , (cid:96)(f (\u02c6xi9(w); w), yi)} ,\n\n(14)\n\nN(cid:88)\n\ni=1\n\nmin\n\nw\n\nwhere each \u02c6xij(w) is the result of a targeted attack on sample xi aiming at changing the output of\nthe network to label j. These perturbed inputs, which are explained in details in Appendix E, are the\nfunction of the weights of the network. Then we replace this \ufb01nite max inner problem with a concave\nproblem over a probability simplex. Such a concave inner problem allows us to use the multi-step\ngradient descent-ascent method. The structure of the network and the details of the formulation is\ndetailed in Appendix E.\nResults: We compare our results with [38, 57]. Note [57] is the state-of-the-art algorithm and has\nwon the \ufb01rst place, out of \u2248 2000 submissions, in the NeurIPS 2018 Adversarial Vision Challenge.\nThe accuracy of our formulation against popular attacks, FGSM [25] and PGD [31], are summarized\nin Table 3. This table shows that our formulation leads to a comparable results against state-of-the-art\nalgorithms (while in some cases it also outperform those methods by as much as \u2248 15% accuracy).\n\nNatural\n\nFGSM L\u221e [25]\n\nPGD40 L\u221e [31]\n\n[38] with \u03b5 = 0.35\n[57] with \u03b5 = 0.35\n[57] with \u03b5 = 0.40\nProposed with \u03b5 = 0.40\n\n\u03b5 = 0.4\n\n\u03b5 = 0.2\n\n\u03b5 = 0.3\n\n\u03b5 = 0.3\n\n\u03b5 = 0.2\n\n\u03b5 = 0.4\n98.58% 96.09% 94.82% 89.84% 94.64% 91.41% 78.67%\n97.37% 95.47% 94.86% 79.04% 94.41% 92.69% 85.74%\n97.21% 96.19% 96.17% 96.14% 95.01% 94.36% 94.11%\n98.20% 97.04% 96.66% 96.23% 96.00% 95.17% 94.22%\n\nTable 3: Test accuracies under FGSM and PGD attacks. All adversarial images are quanti\ufb01ed to 256\nlevels (0 \u2212 255 integer).\n\nLinks to code and pre-trained models of above two simulations are available at Appendix G.\n\n9\n\n\f", "award": [], "sourceid": 8501, "authors": [{"given_name": "Maher", "family_name": "Nouiehed", "institution": "American University of Beirut"}, {"given_name": "Maziar", "family_name": "Sanjabi", "institution": "USC"}, {"given_name": "Tianjian", "family_name": "Huang", "institution": "University of Southern California"}, {"given_name": "Jason", "family_name": "Lee", "institution": "Princeton University"}, {"given_name": "Meisam", "family_name": "Razaviyayn", "institution": "University of Southern California"}]}