{"title": "Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria", "book": "Advances in Neural Information Processing Systems", "page_first": 625, "page_last": 632, "abstract": "We study the relation between notions of game-theoretic equilibria which are based on stability under a set of deviations, and empirical equilibria which are reached by rational players. Rational players are modelled by players using no regret algorithms, which guarantee that their payoff in the long run is almost as much as the most they could hope to achieve by consistently deviating from the algorithm's suggested action. We show that for a given set of deviations over the strategy set of a player, it is possible to efficiently approximate fixed points of a given deviation if and only if there exist efficient no regret algorithms resistant to the deviations. Further, we show that if all players use a no regret algorithm, then the empirical distribution of their plays converges to an equilibrium.", "full_text": "Computational Equivalence of Fixed Points and No\nRegret Algorithms, and Convergence to Equilibria\n\nIBM Almaden Research Center\n\nComputer Science Department,\n\nElad Hazan\n\n650 Harry Road\n\nSan Jose, CA 95120\n\nhazan@us.ibm.com\n\nSatyen Kale\n\nPrinceton University\n\n35 Olden St.\n\nPrinceton, NJ 08540\n\nsatyen@cs.princeton.edu\n\nAbstract\n\nWe study the relation between notions of game-theoretic equilibria which are\nbased on stability under a set of deviations, and empirical equilibria which are\nreached by rational players. Rational players are modeled by players using no\nregret algorithms, which guarantee that their payoff in the long run is close to\nthe maximum they could hope to achieve by consistently deviating from the algo-\nrithm\u2019s suggested action.\nWe show that for a given set of deviations over the strategy set of a player, it is\npossible to ef\ufb01ciently approximate \ufb01xed points of a given deviation if and only if\nthere exist ef\ufb01cient no regret algorithms resistant to the deviations. Further, we\nshow that if all players use a no regret algorithm, then the empirical distribution\nof their plays converges to an equilibrium.\n\n1 Introduction\n\nWe consider a setting where a number of agents need to repeatedly make decisions in the face of\nuncertainty. In each round, the agent obtains a payoff based on the decision she chose. Each agent\nwould like to be able to maximize her payoff. While this might seem like a natural objective, it\nmay be impossible to achieve without placing restrictions on the kind of payoffs that can arise. For\ninstance, if the payoffs were adversarially chosen, then the agent\u2019s task would become essentially\nhopeless.\nIn such a situation, one way for the agent to cope with the uncertainty is to aim for a relative\nbenchmark rather an absolute one. The notion of regret minimization captures this intuition. We\nimagine that the agent has a choice of several well-de\ufb01ned ways to change her decision, and now\nthe agent aims to maximize her payoff relative to what she could have obtained had she changed her\ndecisions in a consistent manner. As an example of what we mean by consistent changes, a possible\nobjective could be to maximize her payoff relative to the most she could have achieved by choosing\nsome \ufb01xed decision in all the rounds. The difference between these payoffs is known as external\nregret in the game theory literature. Another notion is that of internal regret, which arises when the\npossible ways to change are the ones that switch from some decision i to another, j, whenever the\nagent chose decision i, leaving all other decisions unchanged.\nA learning algorithm for an agent is said to have no regret with respect to an associated set of decision\nmodi\ufb01ers (also called deviations) \u03a6 if the average payoff of an agent using the algorithm converges\nto the largest average payoff she would have achieved had she changed her decisions using a \ufb01xed\ndecision modi\ufb01er in all the rounds. Based on what set of decision modi\ufb01ers are under consideration,\nvarious no regret algorithms are known (for e.g. Hannan [10] gave algorithms to minimize external\nregret, and Hart and Mas-Collel [11] give algorithms to minimize internal regret).\n\n1\n\n\fThe reason no regret algorithms are so appealing, apart from the fact that they model rational behav-\nior of agents in the face of uncertainty, is that in various cases it can be shown that using no regret\nalgorithms guides the overall play towards a game theoretic equilibrium. For example, Freund and\nSchapire [7] show that in a zero-sum game, if all agents use a no external regret algorithm, then\nthe empirical distribution of the play converges to the set of minimax equilibria. Similarly, Hart\nand Mas-Collel [11] show that if all agents use a no internal regret algorithm, then the empirical\ndistribution of the play converges to the set of correlated equilibria.\nIn general, given a set of decision modi\ufb01ers \u03a6, we can de\ufb01ne a notion of game theoretic equilibrium\nthat is based on the property of being stable under deviations speci\ufb01ed by \u03a6. This is a joint distri-\nbution on the agents\u2019 decisions that ensures that the expected payoff to any agent is no less than the\nmost she could achieve if she decided to unilaterally (and consistently) decided to deviate from her\nsuggested action using any decision modi\ufb01er in \u03a6. One can then show that if all agents use a \u03a6-no\nregret algorithm, then the empirical distribution of the play converges to the set of \u03a6-equilibria.\nThis brings us to the question of whether it is possible to design no regret algorithms for various sets\nof decision modi\ufb01ers \u03a6. In this paper, we design algorithms which achieve no regret with respect to\n\u03a6 for a very general setting of arbitrary convex compact decision spaces, arbitrary concave payoff\nfunctions, and arbitrary continuous decision modi\ufb01ers. Our method works as long as it is possible\nto compute approximate \ufb01xed points for (convex combinations) of decision modi\ufb01ers in \u03a6. Our\nalgorithms are based on a connection to the framework of Online Convex Optimization (see, e.g.\n[18]) and we show how to apply known learning algorithms to obtain \u03a6-no regret algorithms. The\ngenerality of our connection allows us to use various sophisticated Online Convex Optimization\nalgorithms which can exploit various structural properties of the utility functions and guarantee a\nfaster rate of convergence to the equilibrium.\nPrevious work by Greenwald and Jafari [9] gave algorithms for the case when the decision space is\nthe simplex of probability distributions over the agents\u2019 decisions, the payoff functions are linear,\nand the decision modi\ufb01ers are also linear. Their algorithm, based on the work of Hart and Mas-\nCollel [11], uses a version of Blackwell\u2019s Approachability Theorem, and also needs to computes\n\ufb01xed points of the decision modi\ufb01ers. Since these modi\ufb01ers are linear, it is possible to compute\n\ufb01xed points for them by computing the stationary distribution of an appropriate stochastic matrix\n(say, by computing its top eigenvector).\nComputing Brouwer \ufb01xed points of continuous functions is in general a very hard problem (it is\nPPAD-complete, as shown by Papadimitriou [15]). Fixed points are ubiquitous in game theory.\nMost common notions of equilibria in game theory are de\ufb01ned as the set of \ufb01xed points of a certain\nmapping. For example, Nash Equilibria (NE) are the set of \ufb01xed points of the best response mapping\n(appropriately de\ufb01ned to avoid ambiguity). The fact that Brouwer \ufb01xed points are hard to compute in\ngeneral is no reason why computing speci\ufb01c \ufb01xed points should be hard (for instance, as mentioned\nearlier, computing \ufb01xed points of linear functions is easy via eigenvector computations). More\nspeci\ufb01cally, could it be the case that the NE, being a \ufb01xed point of some well-speci\ufb01ed mapping,\nis easy to compute? These hopes were dashed by the work of [6, 3] who showed that computing\nNE is as computationally dif\ufb01cult as \ufb01nding \ufb01xed points in a general mapping:\nthey show that\ncomputing NE in a two-player game is PPAD-complete. Further work showed that even computing\nan approximate NE is PPAD-complete [4].\nSince our algorithms (and all previous ones as well) depend on computing (approximate) \ufb01xed points\nof various decision modi\ufb01ers, the above discussion leads us to question whether this is necessary.\nWe show in this paper that indeed it is: a \u03a6-no-regret algorithm can be ef\ufb01ciently used to compute\napproximate \ufb01xed points of any convex combination of decision modi\ufb01ers. This establishes an\nequivalence theorem, which is the main contribution of this paper: there exist ef\ufb01cient \u03a6-no-regret\nalgorithms if and only it is possible to ef\ufb01ciently compute \ufb01xed points of convex combinations\nof decision modi\ufb01ers in \u03a6. This equivalence theorem allows us to translate complexity theoretic\nlower bounds on computing \ufb01xed points to designing no regret algorithms. For instance, a Nash\nequilibrium can be obtained by applying Brouwer\u2019s \ufb01xed point theorem to an appropriately de\ufb01ned\ncontinuous mapping from the compact convex set of pairs of the players\u2019 mixed strategies to itself.\nThus, if \u03a6 contains this mapping, then it is PPAD-hard to design \u03a6-no-regret algorithms.\nIt was recently brought to our attention that Stolz and Lugosi [17], building on the work of Hart and\nSchmeidler [12], have also considered \u03a6-no-regret algorithms. They also show how to design them\n\n2\n\n\ffrom \ufb01xed-point oracles, and proved convergence to equilibria under even more general conditions\nthan we consider. Gordon, Greenwald, Marks, and Zinkevich [8] have also considered similar no-\ntions of regret and showed convergence to equilibria, in the special case when the deviations in \u03a6\ncan be represented as the composition of a \ufb01xed embedding into a higher dimensional space and\nan adjustable linear transformation. The focus of our results is on the computational aspect of such\nreductions, and the equivalence of \ufb01xed-points computation and no-regret algorithms.\n2 Preliminaries\n2.1 Games and Equilibria\n\nWe consider the following kinds of games. First, the set of strategies for the players of the game is\na convex compact set. Second, the utility functions for the players are concave over their strategy\nsets. To avoid cumbersome notation, we restrict ourselves to two player games, although all of our\nresults naturally extend to multi-player games.\nFormally, for i = 1, 2, player i plays points from a convex compact set Ki \u2286 Rni. Her payoff\nis given by function ui : K1 \u00d7 K2 \u2192 R, i.e. if x1, x2 is the pair of strategies played by the two\nplayers, then the payoff to player i is given by ui(x1, x2). We assume that u1 is a concave function\nof x1 for any \ufb01xed x2, and similarly u2 is a concave function of x2 for any \ufb01xed x1.\nWe now de\ufb01ne a notion of game theoretic equilibrium based on the property of being stable with\nrespect to consistent deviations. By this, we mean an online game-playing strategy for the players\nthat will guarantee that neither stands to gain if they decided to unilaterally, and consistently, deviate\nfrom their suggested moves.\nTo model this, assume that each player i has a set of possible deviations \u03a6i which is a \ufb01nite1 set of\ncontinuous mappings \u03c6i : Ki \u2192 Ki. Let \u03a6 = (\u03a61, \u03a62). Let \u03a8 be a joint distribution on K1 \u00d7 K2.\nIf it is the case that for any deviation \u03c61 \u2208 \u03a61, player 1\u2019s expected payoff obtained by sampling x1\nusing \u03a8 is always larger than her expected payoff obtained by deviating to \u03c61(x1), then we call \u03a8\nstable under deviations in \u03a61. The distribution \u03a8 is said to be a \u03a6-equilibrium if \u03a8 is stable under\ndeviations in \u03a61 and \u03a62. A similar de\ufb01nition appears in [12] and [17].\nDe\ufb01nition 1 (\u03a6-equilibrium). A joint distribution \u03a8 over K1 \u00d7 K2 is called a \u03a6-equilibrium if the\nfollowing holds, for any \u03c61 \u2208 \u03a61, and for any \u03c62 \u2208 \u03a62:\n\nZ\nZ\n\nu1(x1, x2)\u03a8(x1, x2) \u2265\nu2(x1, x2)\u03a8(x1, x2) \u2265\n\nu1(\u03c61(x1), x2)\u03a8(x1, x2)\n\nu2(x1, \u03c62(x2))\u03a8(x1, x2)\n\nZ\nZ\n\n\uf8f1\uf8f2\uf8f30\n\nWe say that \u03a8 is a \u03b5-approximate \u03a6-equilibrium if the inequalities above are satis\ufb01ed up to an\nadditive error of \u03b5.\n\nIntuitively, we imagine a repeated game between the two players, where at equilibrium, the players\u2019\nmoves are correlated by a signal, which could be the past history of the play, and various external\nfactors. This signal samples a pair of moves from an equilibrium joint distribution over all pairs\nof moves, and suggests to each player individually only the move she is supposed to play. If no\nplayer stands to gain if she unilaterally, but consistently, used a deviation from her suggested move,\nthen the distribution of the correlating signal is stable under the set of deviations, and is hence an\nequilibrium.\nExample 1: Correlated Equilibria. A standard 2-player game is obtained when the Ki are the\nsimplices of distributions over some base sets of actions Ai and the utility functions ui are bilinear\nin x1, x2. If the sets \u03a6i consist of the maps \u03c6a,b : Ki \u2192 Ki for every pair a, b \u2208 Ai de\ufb01ned as\n\n\u03c6a,b(x)[c] =\n\nxa + xb\nxc\n\nif c = a\nif c = b\notherwise\n\n(1)\n\n1It is highly plausible that the results in this paper extend to the case where \u03a6 is in\ufb01nite \u2013 indeed, our results\nhold for any set of mappings \u03a6 which is obtained by taking all convex combinations of \ufb01nitely many mappings\n\u2013 but we restrict to \ufb01nite \u03a6 in this paper for simplicity.\n\n3\n\n\fthen it can be shown that any \u03a6-equilibrium can be equivalently viewed as a correlated equilibrium\nof the game, and vice-versa.\n\nExample 2: The Stock Market game. Consider the following setting: there are two investors\n(the generalization to many investors is straightforward), who invest their wealth in n stocks. In\neach period, they choose portfolios x1 and x2 over the n stocks, and observe the stock returns. We\nmodel the stock returns as a function r of the portfolios x1, x2 chosen by the investors, and it maps\nthe portfolios to the vector of stock returns. We make the assumption that each player has a small\nin\ufb02uence on the market, and thus the function r is insensitive to the small perturbations in the input.\nThe wealth gain for each investor i is r(x1, x2) \u00b7 xi. The standard way to measure performance of\nan investment strategy is the logarithmic growth rate, viz. log(r(x1, x2) \u00b7 xi). We can now de\ufb01ne\nthe utility functions as ui(x1, x2) = log(r(x1, x2) \u00b7 xi). Intuitively, this game models the setting in\nwhich the market prices are affected by the investments of the players.\nA natural goal for a good investment strategy would be to compare the wealth gain to that of the\nbest \ufb01xed portfolio, i.e. \u03a6i is the set of all constant maps. This was considered by Cover in his\nUniversal Portfolio Framework [5]. Another possible goal would be to compare the wealth gained\nto that achievable by modifying the portfolios using the \u03c6a,b maps above, as considered by [16]. In\nSection 3, we show that the stock market game admits algorithms that converge to an \u03b5-equilibrium\nin O( 1\n\n\u03b5 ) rounds, whereas all previous algorithms need O( 1\n\n\u03b52 ) rounds.\n\n\u03b5 log 1\n\n2.2 No regret algorithms\n\nThe online learning framework we consider is called online convex optimization [18], in which there\nis a \ufb01xed convex compact feasible set K \u2282 Rn and an arbitrary, unknown sequence of concave\npayoff functions f (1), f (2), . . . : K \u2192 R. The decision maker must make a sequence of decisions,\nwhere the tth decision is a selection of a point x(t) \u2208 K and obtains a payoff of f (t)(x(t)) on period\nt. The decision maker can only use the previous points x(1), . . . , x(t\u22121), and the previous payoff\nfunctions f (1), . . . , f (t\u22121) to choose the point x(t).\nThe performance measure we use to evaluate online algorithms is regret, de\ufb01ned as follows. The\ndecision maker has a \ufb01nite set of N decision modi\ufb01ers \u03a6 which, as before, is a set of continuous\nmappings from K \u2192 K. Then the regret for not using some deviation \u03c6 \u2208 \u03a6 is the excess payoff\nthe decision maker could have obtained if she had changed her points in each round by applying \u03c6.\nDe\ufb01nition 2 (\u03a6-Regret). Let \u03a6 be a set of continuous functions from K \u2192 K. Given a set of T\nconcave utility functions f1, ..., fT , de\ufb01ne the \u03a6-regret as\n\nRegret\u03a6(T ) = max\n\u03c6\u2208\u03a6\n\nf (t)(x(t)).\n\nTX\n\nf (t)(\u03c6(x(t))) \u2212 TX\n\nt=1\n\nt=1\n\nTwo speci\ufb01c examples of \u03a6-regret deserve mention. The \ufb01rst one is \u201cexternal regret\u201d, which is\nde\ufb01ned when \u03a6 is the set of all constant mappings from K to itself. The second one is \u201cinternal\nregret\u201d, which is de\ufb01ned when K is the simplex of distributions over some base set of actions A,\nand \u03a6 is the set of the \u03c6a,b functions (de\ufb01ned in (1)) for all pairs a, b \u2208 A.\nA desirable property of an algorithm for Online Convex Optimization is Hannan consistency: the\nregret, as a function of the number of rounds T , is sublinear. This implies that the average per\niteration payoff of the algorithm converges to the average payoff of a clairvoyant algorithm that uses\nthe best deviation in hindsight to change the point in every round. For the purpose of this paper, we\nrequire a slightly stronger property for an algorithm, viz. that the regret is polynomially sublinear as\na function of T .\nDe\ufb01nition 3 (No \u03a6-regret algorithm). A no \u03a6-regret algorithm is one which, given any sequence of\nconcave payoff functions f (1), f (2), . . ., generates a sequence of points x(1), x(2), . . . \u2208 K such that\nfor all T = 1, 2, . . ., Regret\u03a6(T ) = O(T 1\u2212c) for some constant c > 0. Such an algorithm will be\ncalled ef\ufb01cient if it computes x(t) in poly(n, N, t, L) time.\n\nIn the above de\ufb01nition, L is a description length parameter for K, de\ufb01ned appropriately depending\non how the set K is represented. For instance, if K is the n-dimensional probability simplex, then\n\n4\n\n\fL = n. If K is speci\ufb01ed by means of a separation oracle and inner and outer radii r and R, then\nL = log(R/r), and we allow poly(n, N, t, L) calls to the separation oracle in each iteration.\nThe relatively new framework of Online Convex Optimization (OCO) has received much attention\nrecently in the machine learning community. Our no \u03a6-regret algorithms can use any of wide variety\nof algorithms for OCO. In this paper, we will use Exponentiated Gradient (EG) algorithm ([14], [1]),\nwhich has the following (external) regret bound:\nTheorem 1. Let the domain K be the simplex of distributions over a base set of size n. Let\nG\u221e be an upper bound on the L\u221e norm of the gradients of the payoff functions, i.e. G\u221e \u2265\nsupx\u2208K k\u2207f (t)(x)k\u221e. Then the EG algorithm generates points x(1), . . . , x(T ) such that\n\nTX\n\nf (t)(x) \u2212 TX\n\nf (t)(x(t)) \u2264 O(G\u221eplog(n)T )\n\nmax\nx\u2208K\n\nt=1\n\nt=1\n\n\u221a\n\nIf the utility functions are strictly concave rather than linear, even stronger regret bounds, which\ndepend on log(T ) rather than\nWhile most of the literature on online convex optimization focuses on external regret, it was ob-\nserved that any Online Convex Optimization algorithm for external regret can be converted to an\ninternal regret algorithm (for example, see [2], [16]).\n\nT , are known [13].\n\n2.3 Fixed Points\n\nAs mentioned in the introduction, our no regret algorithms depend on computing \ufb01xed points of\nthe relevant mappings. For a given set of deviations \u03a6, denote by CH(\u03a6) the set of all convex\ncombinations of deviations in \u03a6, i.e.\n\nnP\n\u03c6\u2208\u03a6\u03b1\u03c6\u03c6 : \u03b1\u03c6 \u2265 0 andP\n\nCH(\u03a6) =\n\no\n\n\u03c6\u2208\u03a6\u03b1\u03c6 = 1\n\n.\n\nSince each map \u03c6 \u2208 CH(\u03a6) is a continuous function from K \u2192 K, and K is a convex compact\nthere exists a point x \u2208 K\ndomain, by Brouwer\u2019s \ufb01xed theorem, \u03c6 has a \ufb01xed point in K, i.e.\nsuch that \u03c6(x) = x. We consider algorithms which approximate \ufb01xed points for a given map in the\nfollowing sense.\nDe\ufb01nition 4 (FPTAS for \ufb01xed points of deviations). Let \u03a6 be a set of N continuous functions\nfrom K \u2192 K. A fully polynomial time approximation scheme (FPTAS) for \ufb01xed points of \u03a6 is an\nalgorithm, which, given any function \u03c6 \u2208 CH(\u03a6) and an error parameter \u03b5 > 0, computes a point\nx \u2208 K such that k\u03c6(x) \u2212 xk \u2264 \u03b5 in poly(n, N, L, 1\n\n\u03b5 ) time.\n\n3 Convergence of no \u03a6-regret algorithms to \u03a6-equilibria\n\nIn this section we prove that if the players use no \u03a6-regret algorithms, then the empirical distribu-\ntion of the moves converges to a \u03a6-equilibrium. [11] shows that if players use no internal regret\nalgorithms, then the empirical distribution of the moves converges to a correlated equilibrium. This\nwas generalized by [9] to any set of linear transformations \u03a6. The more general setting of this paper\nalso follows easily from the de\ufb01nitions. A similar theorem was also proved in [17].\nThe advantage of this general setting is that the connection to online convex optimization allows for\nfaster rates of convergence using recent online learning techniques. We give an example of a natural\ngame theoretic setting with faster convergence rate below.\nTheorem 2. If each player i chooses moves using a no \u03a6i-regret algorithms, then the empirical\ngame distribution of the players\u2019 moves converges to a \u03a6-equilibrium. Further, an \u03b5-approximate\n\u03a6-equilibrium is reached after T iterations for the \ufb01rst T which satis\ufb01es 1\n\nT Regret\u03a6(T ) \u2264 \u03b5.\n\nProof. Consider the \ufb01rst player. In each game iteration t, let (x1\n(t)) be the pair of moves\nplayed by the two players. From player 1\u2019s point of view, the payoff function she obtains, f (t), is\nthe following:\n\n(t), x2\n\n\u2200x \u2208 K1 :\n\nf (t)(x) , u1(x, x2\n\n(t)).\n\n5\n\n\fNote that this function is concave by assumption. Then we have, by de\ufb01nition 3,\n\nRegret\u03a61(T ) = max\n\u03c6\u2208\u03a6\n\nf (t)(\u03c6(x1\n\nt\n\nt\n\nf (t)(x1\n\n(t)).\n\n(t))) \u2212X\n\nX\n\nTX\n\nt=1\n\nZ\n\nTX\n\nt=1\n\n1\nT\n\nZ\n\nRewriting this in terms of the original utility function, and scaling by the number of iterations we\nget\n\nu1(x1\n\n(t), x2\n\n(t)) \u2265 1\nT\n\nu1(\u03c6(x1\n\n(t)), x2\n\n(t)) \u2212 1\nT\n\nRegret\u03a61(T ).\n\nDenote by \u03a8(T ) the empirical distribution of the played strategies till iteration T , i.e. the distribution\n(t)) for t = 1, 2, . . . , T . Then, the above\nwhich puts a probability mass of 1\ninequality can be rewritten as\n\nT on all pairs (x1\n\n(t), x2\n\nu1(x1, x2)\u03a8(T )(x1, x2) \u2265\n\nu1(\u03c6(x1), x2)\u03a8(T )(x1, x2) \u2212 1\nT\n\nRegret\u03a61(T ).\n\nA similar inequality holds for player 2 as well. Now assume that both players use no regret algo-\n(T ) \u2264 O(T 1\u2212c) for some constant c > 0. Hence as T \u2192 \u221e, we\nrithms, which ensure that Regret\u03a6i\n(T ) \u2192 0. Thus \u03a8(T ) converges to a \u03a6-equilibrium. Also, \u03a8(T ) is a \u03b5-approximate\nT Regret\u03a6i\nhave 1\nT Regret\u03a62(T ) are less than \u03b5,\nequilibrium as soon as T is large enough so that 1\ni.e. T \u2265 \u2126( 1\n\nT Regret\u03a61(T ) and 1\n\n\u03b51/c ).\n\nA corollary of Theorem 2 is that we can obtain faster rates of convergence using recent online\nlearning techniques, when the payoff functions are non-linear. This is natural in many situations,\nsince risk aversion is associated with the concavity of utility functions.\nCorollary 3. For the stock market game as de\ufb01ned in section 2.1, there exists no regret algorithms\nwhich guarantee convergence to an \u03b5-equilibrium in O( 1\n\n\u03b5 log 1\n\n\u03b5 ) iterations.\n\nProof sketch. The utility functions observed by the investor i in the stock market game are of the\nform ui(x1, x2) = log(r(x1, x2) \u00b7 xi). This logarithmic utility function is exp-concave, by the\nassumption on the insensitivity of the function r to small perturbations in the input. Thus the online\nalgorithm of [5], or the more ef\ufb01cient algorithms of [13] can be applied. In the full version of this\n(T ) = O(log T ).\npaper, we show that Lemma 6 can be modi\ufb01ed to obtain algorithms with Regret\u03a6i\nBy the Theorem 2 above, the investors reach \u03b5-equilibrium in O( 1\n\u03b5 ) iterations.\n\n\u03b5 log 1\n\n4 Computational Equivalence of Fixed Points and No Regret algorithms\n\nIn this section we prove our main result on the computational equivalence of computing \ufb01xed points\nand designing no regret algorithms. By the result of the previous section, players using no regret\nalgorithms converge to equilibria.\nWe assume that the payoff functions f (t) are scaled so that the (L2) norm of their gradients is\nbounded by 1, i.e. k\u2207f (t)k \u2264 1. Our main theorem is the following:\nTheorem 4. Let \u03a6 be a given \ufb01nite set of deviations. Then there is a FPTAS for \ufb01xed points of \u03a6 if\nand only if there exists an ef\ufb01cient no \u03a6-regret algorithm.\n\nThe \ufb01rst direction of the theorem is proved by designing utility functions for which the no regret\nproperty will imply convergence to an approximate \ufb01xed point of the corresponding transformations.\nThe proof crucially depends on the fact that no regret algorithms have the stringent requirement that\ntheir worst case regret, against arbitrary adversarially chosen payoff functions, is sublinear as a\nfunction of the number of the rounds.\nLemma 5. If there exists a no \u03a6-regret algorithm then there exists an FPTAS for \ufb01xed points of \u03a6.\nProof. Let \u03c60 \u2208 CH(\u03a6) be a given mapping whose \ufb01xed point we wish to compute. Let \u03b5 be a given\nerror parameter.\n\n6\n\n\fAt iteration t, let x(t) be the point chosen by A. If k\u03c60(x(t)) \u2212 x(t)k \u2264 \u03b5, we can stop, because we\nhave found an approximate \ufb01xed point. Else, supply A with the following payoff function:\n\nf (t)(x) , (\u03c60(x(t)) \u2212 x(t))>\n\nk\u03c60(x(t)) \u2212 x(t)k (x \u2212 x(t))\n\nThis is a linear function, with k\u2207f (t)(x)k = 1. Also, f (t)(x(t)) = 0, and f (t)(\u03c60(x(t))) =\nk\u03c60(x(t)) \u2212 x(t)k \u2265 \u03b5. After T iterations, since \u03c60 is a convex combination of functions in \u03a6,\nand since all the f (t) are linear functions, we have\n\nmax\n\u03c6\u2208\u03a6\n\nThus,\n\nf (t)(\u03c6(x(t))) \u2265 TX\n\nTX\n\nt=1\n\nX\n\nt=1\n\nf (t)(\u03c6(x(t))) \u2212X\n\nf (t)(\u03c60(x(t))) \u2265 \u03b5T.\n\nRegret\u03a6(T ) = max\n\u03c6\u2208\u03a6\n\n(2)\nSince A is a no-regret algorithm, assume that A ensures that Regret\u03a6(T ) = O(T 1\u2212c) for some\nconstant c > 0. Thus, when T = \u2126( 1\n\u03b51/c ) the lower bound (2) on the regret cannot hold unless we\nhave already found an \u03b5-approximate \ufb01xed point of \u03c60.\n\nt\n\nt\n\nf (t)(x(t)) \u2265 \u03b5T.\n\nThe second direction is on the lines of the algorithms of [2] and [16] which use \ufb01xed point compu-\ntations to obtain no internal regret algorithms.\n\u221a\nLemma 6. If there is an FPTAS for \ufb01xed points of \u03a6, then there is an ef\ufb01cient no \u03a6-regret algorithm.\nIn fact, the algorithm guarantees that Regret\u03a6(T ) = O(\n\nT ). 2\n\nProof. We reduce the given OCO problem to an \u201cinner\u201d OCO problem. The \u201couter\u201d OCO problem\nis the original one. We use a no external regret algorithm for the inner OCO problem to generate\npoints in K for the outer one, and use the payoff functions obtained in the outer OCO problem to\ngenerate appropriate payoff functions for the inner one.\nLet \u03a6 = {\u03c61, \u03c62, . . . , \u03c6N}. The domain for the inner OCO problem is the simplex of all distribu-\ntions on \u03a6, denoted \u2206N . For a distribution \u03b1 \u2208 \u2206N , let \u03b1i be the probability measure assigned to\n\u03c6i in the distribution \u03b1. There is a natural mapping from \u2206N \u2192 CH(\u03a6): for any \u03b1 \u2208 \u2206N , denote\n\nby \u03c6\u03b1 the functionPN\n\ni=1 \u03b1i\u03c6i \u2208 CH(\u03a6).\n\nLet x(t) \u2208 K be the point used in the outer OCO problem in the tth round, and let f (t) be the\nobtained payoff function. Then the payoff functions for the inner OCO problem is the function\ng(t) : \u2206N \u2192 R de\ufb01ned as follows:\n\n\u2200\u03b1 \u2208 \u2206N :\n\ng(t)(\u03b1) , f (t)(\u03c6\u03b1(x(t))).\n\ni \u03b1i(\u03c6i(x(t)) \u2212 x0)), becauseP\n\nWe can rewrite g(t) as g(t)(\u03b1) = f (t)(x0 +P\n\nWe now apply the Exponentiated Gradient (EG) algorithm (see Section 2.2) to the inner OCO prob-\nlem. To analyze the algorithm, we bound k\u2207g(t)k\u221e as follows. Let x0 be an arbitrary point in K.\ni \u03b1i = 1. Then,\n\u2207g(t) = X(t)\u2207f (t)(\u03c6\u03b1(x(t))), where X(t) is an N \u00d7 n matrix whose ith row is (\u03c6i(x(t)) \u2212 x0)>.\nThus,\n|(\u03c6i(x(t))\u2212x0)>\u2207f (t)(\u03c6\u03b1(x(t)))| \u2264 k\u03c6i(x(t))\u2212x0kk\u2207f (t)(\u03c6\u03b1(x(t)))k \u2264 1.\nk\u2207g(t)k\u221e = max\nThe last inequality follows because we assumed that the diameter of K is bounded by 1, and the\nnorm of the gradient of f (t) is also bounded by 1.\nLet \u03b1(t) be the distribution on \u03a6 produced by the EG algorithm at time t. Now, the point x(t) is\ncomputed by running the FPTAS for computing an 1\u221a\n-approximate \ufb01xed point of the function \u03c6\u03b1(t),\ni.e. we have k\u03c6\u03b1(t)(x(t)) \u2212 x(t)k \u2264 1\u221a\n\n.\n\nt\n\ni\n\nt\n\n2In the full version of the paper, we improve the regret bound to O(log T ) under some stronger concavity\n\nassumptions on the payoff functions.\n\n7\n\n\fNow, using the de\ufb01nition of the g(t) functions, and by the regret bound for the EG algorithm, we\nhave that for any \ufb01xed distribution \u03b1 \u2208 \u2206N ,\n\nTX\n\ng(t)(\u03b1)\u2212 TX\n\ng(t)(\u03b1(t)) \u2264 O(plog(N)T ). (3)\n\nTX\n\nf (t)(\u03c6\u03b1(x(t)))\u2212 TX\n\nf (t)(\u03c6\u03b1(t)(x(t))) =\n\nt=1\n\nt=1\n\nSince k\u2207f (t)k \u2264 1,\n\nt=1\n\nt=1\n\nf (t)(\u03c6\u03b1(t)(x(t))) \u2212 f (t)(x(t)) \u2264 k\u03c6\u03b1(t)(x(t)) \u2212 x(t)k \u2264 1\u221a\n\nt\n\n.\n\n(4)\n\nSumming (4) from t = 1 to T , and adding to (3), we get that for any distribution \u03b1 over \u03a6,\n\nTX\nf (t)(\u03c6\u03b1(x(t))) \u2212X\nPT\nt=1 f (t)(\u03c6i(x(t))) \u2212PT\n\nt=1\n\nt\n\nf (t)(x(t)) \u2264 O(plog(N)T ) +\nt=1 f (t)(x(t)) \u2264 O(plog(N)T ), and thus we have a no \u03a6-regret al-\n\n= O(plog(N)T ).\n\nthe above inequality implies that\n\nTX\n\nIn particular, by concentrating \u03b1 on any given \u03c6i,\n\n1\u221a\nt\n\nt=1\n\ngorithm.\n\nReferences\n[1] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta algorithm and\n\napplications. Manuscript, 2005.\n\n[2] A. Blum and Y. Mansour. From external to internal regret. In COLT, pages 621\u2013636, 2005.\n[3] X. Chen and X. Deng. Settling the complexity of two-player nash equilibrium. In 47th FOCS, pages\n\n261\u2013272, 2006.\n\n[4] X. Chen, X. Deng, and S-H. Teng. Computing nash equilibria: Approximation and smoothed complexity.\n\nfocs, 0:603\u2013612, 2006.\n\n[5] T. Cover. Universal portfolios. Math. Finance, 1:1\u201319, 1991.\n[6] C. Daskalakis, P. W. Goldberg, and C. H. Papadimitriou. The complexity of computing a nash equilibrium.\n\nIn 38th STOC, pages 71\u201378, 2006.\n\n[7] Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic\n\nBehavior, 29:79\u2013103, 1999.\n\n[8] G. Gordon, A. Greenwald, C. Marks, and M. Zinkevich. No-regret learning in convex games. Brown\n\nUniversity Tech Report CS-07-10, 2007.\n\n[9] A. Greenwald and A. Jafari. A general class of no-regret learning algorithms and game-theoretic equilib-\n\nria, 2003.\n\n[10] J. Hannan. Approximation to bayes risk in repeated play. In M. Dresher, A. W. Tucker, and P. Wolfe,\n\neditors, Contributions to the Theory of Games, volume III, pages 97\u2013139, 1957.\n\n[11] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica,\n\n68(5):1127\u20131150, 2000.\n\n[12] S. Hart and D. Schmeidler. Existence of correlated equilibria. Mathematics of Operations Research,\n\n14(1):18\u201325, 1989.\n\n[13] E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex optimiza-\n\ntion. In 19\u2019th COLT, 2006.\n\n[14] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Inf.\n\nComput., 132(1):1\u201363, 1997.\n\n[15] C. H. Papadimitriou. On the complexity of the parity argument and other inef\ufb01cient proofs of existence.\n\nJ. Comput. Syst. Sci., 48(3):498\u2013532, 1994.\n\n[16] G. Stoltz and G. Lugosi. Internal regret in on-line portfolio selection. Machine Learning, 59:125\u2013159,\n\n2005.\n\n[17] G. Stoltz and G. Lugosi. Learning correlated equilibria in games with compact sets of strategies. Games\n\nand Economic Behavior, 59:187\u2013208, 2007.\n\n[18] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In 20th ICML,\n\npages 928\u2013936, 2003.\n\n8\n\n\f", "award": [], "sourceid": 695, "authors": [{"given_name": "Elad", "family_name": "Hazan", "institution": null}, {"given_name": "Satyen", "family_name": "Kale", "institution": null}]}