{"title": "Conditional Swap Regret and Conditional Correlated Equilibrium", "book": "Advances in Neural Information Processing Systems", "page_first": 1314, "page_last": 1322, "abstract": "We introduce a natural extension of the notion of swap regret, conditional swap regret, that allows for action modifications conditioned on the player\u2019s action history. We prove a series of new results for conditional swap regret minimization. We present algorithms for minimizing conditional swap regret with bounded conditioning history. We further extend these results to the case where conditional swaps are considered only for a subset of actions. We also define a new notion of equilibrium, conditional correlated equilibrium, that is tightly connected to the notion of conditional swap regret: when all players follow conditional swap regret minimization strategies, then the empirical distribution approaches this equilibrium. Finally, we extend our results to the multi-armed bandit scenario.", "full_text": "Conditional Swap Regret and\n\nConditional Correlated Equilibrium\n\nMehryar Mohri\n\nCourant Institute and Google\n\n251 Mercer Street\n\nNew York, NY 10012\nmohri@cims.nyu.edu\n\nScott Yang\n\nCourant Institute\n251 Mercer Street\n\nNew York, NY 10012\nyangs@cims.nyu.edu\n\nAbstract\n\nWe introduce a natural extension of the notion of swap regret, conditional swap\nregret, that allows for action modi\ufb01cations conditioned on the player\u2019s action his-\ntory. We prove a series of new results for conditional swap regret minimization.\nWe present algorithms for minimizing conditional swap regret with bounded con-\nditioning history. We further extend these results to the case where conditional\nswaps are considered only for a subset of actions. We also de\ufb01ne a new notion\nof equilibrium, conditional correlated equilibrium, that is tightly connected to the\nnotion of conditional swap regret: when all players follow conditional swap regret\nminimization strategies, then the empirical distribution approaches this equilib-\nrium. Finally, we extend our results to the multi-armed bandit scenario.\n\n1\n\nIntroduction\n\nOn-line learning has received much attention in recent years.\nIn contrast to the standard batch\nframework, the online learning scenario requires no distributional assumption. It can be described\nin terms of sequential prediction with expert advice [13] or formulated as a repeated two-player\ngame between a player (the algorithm) and an opponent with an unknown strategy [7]: at each time\nstep, the algorithm probabilistically selects an action, the opponent chooses the losses assigned to\neach action, and the algorithm incurs the loss corresponding to the action it selected.\nThe standard measure of the quality of an online algorithm is its regret, which is the difference\nbetween the cumulative loss it incurs after some number of rounds and that of an alternative policy.\nThe cumulative loss can be compared to that of the single best action in retrospect [13] (external\nregret), to the loss incurred by changing every occurrence of a speci\ufb01c action to another [9] (internal\nregret), or, more generally, to the loss of action sequences obtained by mapping each action to some\nother action [4] (swap regret). Swap regret, in particular, accounts for situations where the algorithm\ncould have reduced its loss by swapping every instance of one action with another (e.g. every time\nthe player bought Microsoft, he should have bought IBM).\nThere are many algorithms for minimizing external regret [7], such as, for example, the randomized\nweighted-majority algorithm of [13]. It was also shown in [4] and [15] that there exist algorithms for\nminimizing internal and swap regret. These regret minimization techniques have been shown to be\nuseful for approximating game-theoretic equilibria: external regret algorithms for Nash equilibria\nand swap regret algorithms for correlated equilibria [14].\nBy de\ufb01nition, swap regret compares a player\u2019s action sequence against all possible modi\ufb01cations at\neach round, independently of the previous time steps. In this paper, we introduce a natural extension\nof swap regret, conditional swap regret, that allows for action modi\ufb01cations conditioned on the\nplayer\u2019s action history. Our de\ufb01nition depends on the number of past time steps we condition upon.\n\n1\n\n\fAs a motivating example, let us limit this history to just the previous one time step, and suppose we\ndesign an online algorithm for the purpose of investing, where one of our actions is to buy bonds\nand another to buy stocks. Since bond and stock prices are known to be negatively correlated, we\nshould always be wary of buying one immediately after the other \u2013 unless our objective was to pay\nfor transaction costs without actually modifying our portfolio! However, this does not mean that we\nshould avoid purchasing one or both of the two assets completely, which would be the only available\nalternative in the swap regret scenario. The conditional swap class we introduce provides precisely\na way to account for such correlations between actions. We start by introducing the learning set-up\nand the key notions relevant to our analysis (Section 2).\n\n2 Learning set-up and model\n\nWe consider the standard online learning set-up with a set of actions N = {1, . . . , N}. At each\nround t \u2208 {1, . . . , T}, T \u2265 1, the player selects an action xt \u2208 N according to a distribution pt\nover N , in response to which the adversary chooses a function f t : N t \u2192 [0, 1] and causes the\nplayer to incur a loss f t(xt, xt\u22121, . . . , x1). The objective of the player is to choose a sequence of\nactions (x1, . . . , xT ) that minimizes his cumulative loss\ufffdT\nA standard metric used to measure the performance of an online algorithm A over T rounds is its\n(expected) external regret, which measures the player\u2019s expected performance against the best \ufb01xed\naction in hindsight:\n\nt=1 f t(xt, xt\u22121, . . . , x1).\n\n[f t(xt, .., x1)] \u2212 min\nj\u2208N\n\nf t(j, j, ..., j).\n\nReg\nExt\n\n(A, T ) =\n\nT\ufffdt=1\n\nE\n\n(xt,..,x1)\u223c\n(pt,...,p1)\n\nT\ufffdt=1\n(A, T ) = \ufffdT\n\nC\n\nThere are several common modi\ufb01cations to the above online learning scenario: (1) we may com-\npare regret against stronger competitor classes: Reg\nt=1 Ept,...,p1 f t(xt, .., x1) \u2212\nmin\u03d5\u2208C\ufffdT\nt=1 Ept,...,p1[f t(\u03d5(xt), \u03d5(xt\u22121), ..., \u03d5(x1))] for some function class C \u2286 N N ; (2)\nthe player may have access to only partial information about the loss,\ni.e. only knowledge\nof f t(xt, .., x1) as opposed to f t(a, xt\u22121, . . . , x1)\u2200a \u2208 N (also known as the bandit sce-\nnario); (3) the loss function may have bounded memory: f t(xt, ..., xt\u2212k, xt\u2212k\u22121, ..., x1) =\nf t(xt, ..., xt\u2212k, yt\u2212k\u22121, ..., y1), \u2200xj, yj \u2208 N .\nThe scenario where C = N N in (1) is called the swap regret case, and the case where k = 0 in (3) is\nreferred to as the oblivious adversary. (Sublinear) regret minimization is possible for loss functions\nagainst any competitor class of the form described in (1), with only partial information, and with\nat least some level of bounded memory. See [4] and [1] for a reference on (1), [2] and [5] for (2),\nand [1] for (3). [6] also provides a detailed summary of the best known regret bounds in all of these\nscenarios and more.\nThe introduction of adversaries with bounded memory naturally leads to an interesting question:\nwhat if we also try to increase the power of the competitor class in this way?\nWhile swap regret is a natural competitor class and has many useful game theoretic consequences\n(see [14]), one important missing ingredient is that the competitor class of functions does not have\nmemory. In fact, in most if not all online learning scenarios and regret minimization algorithms\nconsidered so far, the point of comparison has been against modi\ufb01cation of the player\u2019s actions\nat each point of time independently of the previous actions. But, as we discussed above in the\n\ufb01nancial markets example, there exist cases where a player should be measured against alternatives\nthat depend on the past and the player should take into account the correlations between actions.\nSpeci\ufb01cally, we consider competitor functions of the form \u03a6t : N t \u2192 N t. Let Call = {\u03a6t : N t \u2192\nN t}\u221et=1 denote the class of all such functions. This leads us to the expression:\ufffdT\nt=1 Ep1,...,pt[f t]\u2212\nmin\u03a6t\u2208Call\ufffdT\nt=1 Ep1,...,pt[f t \u25e6 \u03a6t]. Call is clearly a substantially richer class of competitor func-\ntions than traditional swap regret. In fact, it is the most comprehensive class, since we can always\nreach \ufffdT\nt=1 min(x1,..,xt) f t(x1, .., xt) by choosing \u03a6t to map all points to\nargmin(xt,..,x1) f t(xt, ..., x1). Not surprisingly, however, it is not possible to obtain a sublinear\nregret bound against this general class.\n\nt=1 Ep1,...,pt[f t] \u2212\ufffdT\n\n2\n\n\f\ufffd\ufffd\ufffd\n\n\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\n\n\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\n\n(b)\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\n\n(a)\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\n\n\ufffd\ufffd\ufffd\n\nFigure 1: (a) unigram conditional swap class interpreted as a \ufb01nite-state transducer. This is the same\nas the usual swap class and has only the trivial state; (b) bigram conditional swap class interpreted as\na \ufb01nite-state transducer. The action at time t\u22121 de\ufb01nes the current state and in\ufb02uences the potential\nswap at time t.\n\nTheorem 1. No algorithm can achieve sublinear regret against the class Call, regardless of the loss\nfunction\u2019s memory.\n\nThis result is well-known in the on-line learning community, but, for completeness, we include a\nproof in Appendix 9. Theorem 1 suggests examining more reasonable subclasses of Call. To simplify\nthe notation and proofs that follow in the paper, we will henceforth restrict ourselves to the scenario\nof an oblivious adversary, as in the original study of swap regret [4]. However, an application of the\nbatching technique of [1] should produce analogous results in the non-oblivious case for all of the\ntheorems that we provide.\nNow consider the collection of competitor functions Ck = {\u03d5: N k \u2192 N}. Then, a player\nwho has played actions {as}t\u22121\ns=1 in the past should have his performance compared against\n\u03d5(at, at\u22121, at\u22122, . . . , at\u2212(k\u22121)) at time t, where \u03d5 \u2208 Ck. We call this class Ck of functions the\nk-gram conditional swap regret class, which also leads us to the regret de\ufb01nition:\n\nReg\nCk\n\n(A, T ) =\n\nT\ufffdt=1\n\nE\n\nxt\u223cpt\n\n[f t(xt)] \u2212 min\n\u03d5\u2208Ck\n\nT\ufffdt=1\n\nE\n\nxt\u223cpt\n\n[f t(\u03d5(xt, at\u22121, at\u22122, . . . , at\u2212(k\u22121)))].\n\nNote that this is a direct extension of swap regret to the scenario where we allow for swaps condi-\ntioned on the history of the previous (k \u2212 1) actions. For k = 1, this precisely coincides with swap\nregret.\nOne important remark about the k-gram conditional swap regret is that it is a random quantity that\ndepends on the particular sequence of actions played. A natural deterministic alternative would be\nof the form:\n\nT\ufffdt=1\n\nT\ufffdt=1\n\nE\n\nxt\u223cpt\n\n[f t(xt)] \u2212 min\n\u03d5\u2208Ck\n\nE\n\n(xt,...,x1)\u223c(pt,...,p1)\n\n[f t(\u03d5(xt, xt\u22121, xt\u22122, . . . , xt\u2212(k\u22121)))].\n\nHowever, by taking the expectation of Reg\nJensen\u2019s inequality, we obtain\n\nCk(A, T ) with respect to aT\u22121, aT2, . . . , a1 and applying\n\nT\ufffdt=1\n\nT\ufffdt=1\n(A, T )\u2265\n\nE\n\nxt\u223cpt\n\nE\n\n(xt,...,x1)\u223c(pt,...,p1)\n\n[f t(xt)]\u2212 min\n\u03d5\u2208Ck\n\n[f t(\u03d5(xt, xt\u22121, xt\u22122, . . . , xt\u2212(k\u22121)))],\n\nReg\nCk\nand so no generality is lost by considering the randomized sequence of actions in our regret term.\nAnother interpretation of the bigram conditional swap class is in the context of \ufb01nite-state transduc-\ners. Taking a player\u2019s sequence of actions (x1, ..., xT ), we may view each competitor function in\nthe conditional swap class as an application of a \ufb01nite-state transducer with N states, as illustrated\nby Figure 1. Each state encodes the history of actions (xt\u22121, . . . , xt\u2212(k\u22121)) and admits N outgoing\ntransitions representing the next action along with its possible modi\ufb01cation. In this framework, the\noriginal swap regret class is simply a transducer with a single state.\n\n3\n\n\f3 Full Information Scenario\n\nHere, we prove that it is in fact possible to minimize k-gram conditional swap regret against an\noblivious adversary, starting with the easier to interpret bigram scenario. Our proof constructs a\nmeta-algorithm using external regret algorithms as subroutines, as in [4]. The key is to attribute\na fraction of the loss to each external regret algorithm, so that these losses sum up to our actual\nrealized loss and also press the subroutines to minimize regret against each of the conditional swaps.\nTheorem 2. There exists an online algorithm A with bigram swap regret bounded as follows:\nC2(A, T ) \u2264 O\ufffdN\u221aT log N\ufffd.\nReg\n\n1, ..., pt\n\nProof. Since the distribution pt at round t is \ufb01nite-dimensional, we can represent it as a vector\nN ). Similarly, since oblivious adversaries take only N arguments, we can write f t\npt = (pt\nas the loss vector f t = (f t\nt=1 be a sequence of random variables denoting the\nplayer\u2019s actions at each time t, and let \u03b4t\nat denote the (random) Dirac delta distribution concentrated\nat at and applied to variable xt. Then, we can rewrite the bigram swap regret as follows:\n\nN ). Let {at}T\n\n1, ..., f t\n\nReg\nC2\n\n(A, T ) =\n\n=\n\nT\ufffdt=1\nT\ufffdt=1\n\nE\npt\n\nT\ufffdt=1\n[f t(xt)] \u2212 min\n\u03d5\u2208C2\nN\ufffdi,j=1\nN\ufffdi=1\nT\ufffdt=1\n\ni \u2212 min\nif t\npt\n\u03d5\u2208C2\n\nE\npt,\u03b4t\u22121\nat\u22121\n\n[f t(\u03d5(xt, xt\u22121)]\n\ni\u03b4t\u22121\npt\n\n{at\u22121=j}\n\nf t\n\u03d5(i,j)\n\nOur algorithm for achieving sublinear regret is de\ufb01ned as follows:\n\n1. At t = 1,\n\ninitialize N 2 external regret minimizing algorithms Ai,k, (i, k) \u2208 N 2.\nk=1, where for each\nis a row vector consisting of the distribution weights generated\n\nWe can view these in the form of N matrices in RN\u00d7N , {Qt,k}N\nk \u2208 {1, . . . , N}, Qt,k\nby algorithm Ai,k at time t based on losses received at times 1, . . . , t \u2212 1.\n\ni\n\n2. At each time t, let at\u22121 denote the random action played at time t \u2212 1 and let \u03b4t\u22121\nat\u22121 denote\nthe (random) Dirac delta distribution for this action. De\ufb01ne the N \u00d7 N matrix Qt =\n\ufffdN\nQt,k. Qt is a Markov chain (i.e., its rows sum up to one), so it admits a\nstationary distribution pt which we we will use as our distribution for time t.\n\nk=1 \u03b4t\u22121\n\n{at\u22121=k}\n\n3. When we draw from pt, we play a random action at and receive loss f t. Attribute the\n. Notice\n\nf t to algorithm Ai,k, and generate distributions Qt,k\n\ni\u03b4t\u22121\n\ni\n\nf t = f t, so that the actual realized loss is allocated completely.\n\nportion of loss pt\ni\u03b4t\u22121\n\ni,k=1 pt\n\nthat\ufffdN\n\n{at\u22121=k}\n\n{at\u22121=k}\n\nt=1 f t,i,k\n\nj\n\nmin, T, N) = O\ufffd\ufffdLi,k\n\nminN\npt = ptQt is a stationary distribution, we can write:\n\nRecall that an optimal external regret minimizing algorithm A (e.g. randomized weighted majority)\nmin log(N)\ufffd, where Li,k\nadmits a regret bound of the form Ri,k = Ri,k(Li,k\nmin =\nj=1\ufffdT\nt=1 incurred by the algorithm. Since\nT\ufffdt=1\npt \u00b7 f t =\n\nfor the sequence of loss vectors {f t,i,k}T\n\n\u03b4t\u22121\n{it\u22121=k}\n\nN\ufffdk=1\n\nN\ufffdj=1\n\nN\ufffdj=1\n\nN\ufffdj=1\n\nN\ufffdi=1\n\nN\ufffdi=1\n\nT\ufffdt=1\n\nT\ufffdt=1\n\nT\ufffdt=1\n\npt\niQt\n\ni,jf t\n\nj =\n\nQt,k\n\ni,j f t\nj .\n\npt\njf t\n\nj =\n\npt\ni\n\n4\n\n\fi\u03b4t\u22121\npt\n\n{it\u22121=k}\n\nQt,k\n\ni,j f t\nj\n\ni\u03b4t\u22121\npt\n\n{it\u22121=k}\n\ni\u03b4t\u22121\npt\n\n{it\u22121=k}\n\nf t\n\nf t\n\n\u03d5(i,k)\ufffd + Ri,k(Lmin, T, N)\ufffd (for arbitrary \u03d5: N 2 \u2192 N )\n\u03d5(i,k)\ufffd +\n\nRi,k(Lmin, T, N).\n\nN\ufffdi,k=1\n\nRearranging leads to\n\nT\ufffdt=1\n\npt \u00b7 f t =\n\n\u2264\n\n=\n\nN\ufffdj=1\nT\ufffdt=1\nN\ufffdi,k=1\nN\ufffdi,k=1\ufffd\ufffd T\ufffdt=1\nN\ufffdi,k=1\ufffd T\ufffdt=1\nT\ufffdt=1\n\nSince \u03d5 is arbitrary, we obtain\n\nReg\nC2\n\n(A, T ) =\n\nUsing the fact that Ri,k = O\ufffd\ufffdLi,k\n\ni\u03b4t\u22121\npt\nimplies\n\n{it\u22121=k}\n\ni\u03b4t\u22121\npt\n\n{it\u22121=k}\n\nf t\n\u03d5(i,k) \u2264\n\nN\ufffdi,k=1\n\nRi,k(Lmin, T, N).\n\npt \u00b7 f t \u2212 min\n\u03d5\u2208C2\n\nN\ufffdi,k=1\nT\ufffdt=1\nmin log(N)\ufffd and that we scaled the losses to algorithm Ai,k by\nk=1\ufffdN\n, the following inequality holds: \ufffdN\nmin \u2264 T . By Jensen\u2019s inequality, this\nmin \u2264\ufffd\ufffd\ufffd\ufffd 1\nN\ufffdj=1\ufffdLk,j\nN\ufffdk=1\nN\ufffdj=1\nmin \u2264 N\u221aT . Combining this with our regret bound yields\nmin log N\ufffd \u2264 O\ufffdN\ufffdT log N\ufffd ,\nO\ufffd\ufffdLi,k\nN\ufffdi,k=1\n\nN\ufffdk=1\nj=1\ufffdLk,j\n\nRi,k(Lmin, T, N) =\n\nLk,j\nmin \u2264\n\nj=1 Lk,j\n\n\u221aT\nN\n\n1\nN 2\n\nN 2\n\n,\n\nor, equivalently,\ufffdN\nk=1\ufffdN\nN\ufffdi,k=1\n(A, T ) \u2264\n\nReg\nC2\n\nwhich concludes the proof.\nRemark 1. The computational complexity of a standard external regret minimization algorithm such\nas randomized weighted majority per round is in O(N) (update the distribution on each of the N\nactions multiplicatively and then renormalize), which implies that updating the N 2 subroutines will\ncost O(N 3) per round. Allocating losses to these subroutines and combining the distributions that\nthey return will cost an additional O(N 3) time. Finding the stationary distribution of a stochastic\nmatrix can be done via matrix inversion in O(N 3) time. Thus, the total computational complexity\nof achieving O(N\ufffdT log(N)) regret is only O(N 3T ). We remark that in practice, one often uses\niterative methods to compute dominant eigenvalues (see [16] for a standard reference and [11] for\nrecent improvements). [10] has also studied techniques to avoid computing the exact stationary\ndistribution at every iteration step for similar types of problems.\n\nj\n\n{at\u22121=k}\n\n1, ..., Qt\n\nj , ..., Qt,N\n\nN , picking one subset Qt\n\nto randomly choose among the algorithms Qt,1\n\n; after locating this algorithm, the player uses the distribution from algorithm Qt,at\u22121\n\nThe meta-algorithm above can be interpreted in three equivalent ways: (1) the player draws an\naction xt from distribution pt at time t; (2) the player uses distribution pt to choose among the\nj; next, after drawing j from pt, the\nN subsets of algorithms Qt\nplayer uses \u03b4t\u22121\n, picking algorithm\nQt,at\u22121\nto draw\nan action; (3) the player chooses algorithm Qt,k\nand draws an action\nfrom its distribution.\nThe following more general bound can be given for an arbitrary k-gram swap scenario.\nTheorem 3. There exists an online algorithm A with k-gram swap regret bounded as follows:\nCk(A, T ) \u2264 O\ufffd\ufffdN kT log N\ufffd.\nReg\nThe algorithm used to derive this result is a straightforward extension of the algorithm provided in\nthe bigram scenario, and the proof is given in Appendix 11.\nRemark 2. The computational complexity of achieving the above regret bound is O(N k+1T ).\n\nj with probability pt\n\n{at\u22121=k}\n\nj\u03b4t\u22121\n\nj\n\nj\n\n5\n\n\f\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\n\nFigure 2: bigram conditional swap class restricted to a \ufb01nite number of active states. When the\naction at time t \u2212 1 is 1 or 2, the transducer is in the same state, and the swap function is the same.\n\n4 State-Dependent Bounds\n\nIn some situations, it may not be relevant to consider conditional swaps for every possible action,\neither because of the speci\ufb01c problem at hand or simply for the sake of computational ef\ufb01ciency.\nThus, for any S \u2286 N 2, we de\ufb01ne the following competitor class of functions:\n\nC2,S = {\u03d5: N 2 \u2192 N|\u03d5(i, k) = \u02dc\u03d5(i) for (i, k) \u2208 S where \u02dc\u03d5: N \u2192 N}.\n\nexists\n\nan\n\nonline\n\nalgorithm A such\n\nSee Figure 2 for a transducer interpretation of this scenario.\nWe will now show that the algorithm above can be easily modi\ufb01ed to derive a tighter bound that\nis dependent on the number of states in our competitor class. We will focus on the bigram case,\nalthough a similar result can be shown for the general k-gram conditional swap regret.\nTheorem 4. There\n\nO(\ufffdT (|S c| + N) log(N)).\nThe proof of this result is given in Appendix 10. Note that when S = \u2205, we are in the scenario where\nall the previous states matter, and our bound coincides with that of the previous section.\nRemark 3. The computational complexity of achieving the above regret bound is O((N(|\u03c01(S)| +\n|S c|) + N 3)T ), where \u03c01 is projection onto the \ufb01rst component. This follows from the fact that\nwe allocate the same loss to all {Ai,k}k:(i,k)\u2208S \u2200i \u2208 \u03c01(S), so we effectively only have to manage\n|\u03c01(S)| + |S c| subroutines.\n5 Conditional Correlated Equilibrium and \ufffd-Dominated Actions\n\nthat Reg\n\n(A, T )\n\nC2,S\n\n\u2264\n\nIt is well-known that regret minimization in on-line learning is related to game-theoretic equilibria\n[14]. Speci\ufb01cally, when both players in a two-player zero-sum game follow external regret mini-\nmizing strategies, then the product of their individual empirical distributions converges to a Nash\nequilibrium. Moreover, if all players in a general K-player game follow swap regret minimizing\nstrategies, then their empirical joint distribution converges to a correlated equilibrium [7].\nWe will show in this section that when all players follow conditional swap regret minimization\nstrategies, then the empirical joint distribution will converge to a new stricter type of correlated\nequilibrium.\nDe\ufb01nition 1. Let Nk = {1, ..., Nk}, for k \u2208 {1, ..., K} and G = (S = \u00d7K\nk=1Nk,{l(k) : S \u2192\nk=1) denote a K-player game. Let s = (s1, ..., sK) \u2208 S denote the strategies of all players in\n[0, 1]}K\none instance of the game, and let s(\u2212k) denote the (K \u2212 1)-vector of strategies played by all players\naside from player k. A joint distribution P on two rounds of this game is a conditional correlated\nequilibrium if for any player k, actions j, j\ufffd \u2208 Nk, and map \u03d5k : N 2\n\nk \u2192 Nk, we have\nP (s, r)\ufffdl(k)(sk, s(\u2212k)) \u2212 l(k)(\u03d5k(sk, rk), s(\u2212k))\ufffd \u2264 0.\n\n\ufffd(s,r)\u2208S2 : sk=j,rk=j\ufffd\n\nThe standard interpretation of correlated equilibrium, which was \ufb01rst introduced by Aumann, is a\nscenario where an external authority assigns mixed strategies to each player in such a way that no\nplayer has an incentive to deviate from the recommendation, provided that no other player deviates\n\n6\n\n\ffrom his [3]. In the context of repeated games, a conditional correlated equilibrium is a situation\nwhere an external authority assigns mixed strategies to each player in such a way that no player\nhas an incentive to deviate from the recommendation in the second round, even after factoring in\ninformation from the previous round of the game, provided that no other player deviates from his.\nIt is important to note that the concept of conditional correlated equilibrium presented here is differ-\nent from the notions of extensive form correlated equilibrium and repeated game correlated equilib-\nrium that have been studied in the game theory and economics literature [8, 12].\nNotice that when the values taken for \u03d5k are independent of its second argument, we retrieve the\nfamiliar notion of correlated equilibrium.\nTheorem 5. Suppose that all players in a K-player repeated game follow bigram conditional swap\nregret minimizing strategies. Then, the joint empirical distribution of all players converges to a\nconditional correlated equilibrium.\n\nProof. Let I t \u2208 S be a random vector denoting the actions played by all K players in the game\nat round t. The empirical joint distribution of every two subsequent rounds of a K-player game\nt=1\ufffd(s,r)\u2208S2 \u03b4{I t=s,I t\u22121=r}, where\nplayed repeatedly for T total rounds has the form \ufffdP T = 1\nI = (I1, .., IK) and Ik \u223c p(k) denotes the action played by player k using the mixed strategy p(k).\nLet qt,(k) denote \u03b4t\u22121\n{it\u22121=k} \u2297 pt\u22121,(k\u22121). Then, the conditional swap regret of each player k,\nreg(k, T ), can be bounded as follows since he is playing with a conditional swap regret minimizing\nstrategy:\n\nT \ufffdT\n\nreg(k, T ) =\n\nst\n\nE\n\n1\nT\n\nT\ufffdt=1\n\nk\u223cpt,(k)\ufffdl(k)(sk, s(\u2212k))\ufffd \u2212 min\n\u2264 O\ufffdN\ufffdlog(N)\nT \ufffd.\n\n\u03d5\n\n1\nT\n\nT\ufffdt=1\n\nDe\ufb01ne the instantaneous conditional swap regret vector as\n\n\u223c(pt,(k),qt,(k))\ufffdl(k)(\u03d5(st\n\nE\nk,st\u22121\n\n(st\n\n)\n\nk\n\nk, st\u22121\n\nk\n\n), st\n\n(\u2212k))\ufffd\n\n{I t\n\n{I t\u22121\n\nk = j0)\u03b4\n\n(k)=j0,I t\u22121\n\nt,j0,j1 = \u03b4\n\n\ufffdr(k)\n\nr(k)\nt,j0,j1 = P(st\n\nand the expected instantaneous conditional swap regret vector as\n\n(\u2212k)\ufffd\ufffd ,\n(\u2212k)\ufffd \u2212 l(k)\ufffd\u03d5k(j0, j1), I t\nt,j0,j1. Thus, {Rt = r(k)\n\n(k) =j1}\ufffdl(k)\ufffdI t\ufffd \u2212 l(k)\ufffd\u03d5k(j0, j1), I t\n(k) =j1}\ufffdl(k)\ufffdj0, I t\nConsider the \ufb01ltration Gt = {information of opponents at time t and of the player\u2019s actions up to\ntime t \u2212 1}. Then, we see that E\ufffd\ufffdr(k)\nt,j0,j1|Gt\ufffd = r(k)\nt,j0,j1}\u221et=1 is a\nsequence of bounded martingale differences, and by the Hoeffding-Azuma inequality, we can write\nfor any \u03b1 > 0, that P[|\ufffdT\nNow de\ufb01ne the sets AT := \ufffd\ufffd\ufffd\ufffd 1\nt=1 Rt\ufffd\ufffd\ufffd >\ufffd C\n\u03b4T\ufffd\ufffd. By our concentration bound, we\nT \ufffdT\nhave P (AT ) \u2264 \u03b4T . Setting \u03b4T = exp(\u2212\u221aT ) and applying the Borel-Cantelli lemma, we obtain\nT \ufffdT\nt=1 Rt| = 0 a.s..\nthat lim supT\u2192\u221e | 1\nFinally, since each player followed a conditional swap regret minimizing strategy, we can write\nT \ufffdT\nt=1\ufffdr(k)\nlim supT\u2192\u221e\nt,j0,j1 \u2264 0. Now, if the empirical distribution did not converge to a con-\nditional correlated equilibrium, then by Prokhorov\u2019s theorem, there exists a subsequence {\ufffdP Tj}j\nsatisfying the conditional correlated equilibrium inequality but converging to some limit P \u2217 that is\nnot a conditional correlated equilibrium. This cannot be true because the inequality is closed under\nweak limits.\n\n(\u2212k)\ufffd\ufffd .\nt,j0,j1 \u2212\ufffdr(k)\n\nt=1 Rt| > \u03b1] \u2264 2 exp(\u2212C\u03b12/T ) for some constant C > 0.\n\nT log\ufffd 2\n\n1\n\nConvergence to equilibria over the course of repeated game-playing also naturally implies the\nscarcity of \u201cvery suboptimal\u201d strategies.\n\n7\n\n\fDe\ufb01nition 2. An action pair (sk, rk) \u2208 N 2\nthere exists a map \u03d5k : N 2\nk \u2192 Nk such that\nl(k)(sk, s(\u2212k)) \u2212 l(k)(\u03d5k(sk, rk), s(\u2212k)) \u2265 \ufffd.\n\nk played by player k is conditionally \ufffd-dominated if\n\nTheorem 6. Suppose player k follows a conditional swap regret minimizing strategy that produces\na regret R over T instances of the repeated game. Then, on average, an action pair of player k is\nconditionally \ufffd-dominated at most R\n\n\ufffdT fraction of the time.\n\nThe proof of this result is provided in Appendix 12.\n\n6 Bandit Scenario\n\nAs discussed earlier, the bandit scenario differs from the full-information scenario in that the player\nonly receives information about the loss of his action f t(xt) at each time and not the entire loss\nfunction f t. One standard external regret minimizing algorithm is the Exp3 algorithm introduced\nby [2], and it is the base learner off of which we will build a conditional swap regret minimizing\nalgorithm.\nTo derive a sublinear conditional swap regret bound, we require an external regret bound on Exp3:\n\nT\ufffdt=1\n\nE\npt\n\n[f t(xt)] \u2212 min\na\u2208N\n\nT\ufffdt=1\n\nf t(a) \u2264 2\ufffdLminN log(N),\n\nwhich can be found in Theorem 3.1 of [5]. Using this estimate, we can derive the following result.\nTheorem 7. There exists an algorithm A such that Reg\nThe proof is given in Appendix 13 and is very similar to the proof for the full information setting.\nIt can also easily be extended in the analogous way to provide a regret bound for the k-gram regret\nin the bandit scenario.\nTheorem 8. There exists an algorithm A such that Reg\nSee Appendix 14 for an outline of the algorithm.\n\nCk,bandit(A, T ) \u2264 O\ufffd\ufffdN k+1 log(N)T\ufffd.\n\nC2,bandit(A, T ) \u2264 O\ufffd\ufffdN 3 log(N)T\ufffd.\n\n7 Conclusion\n\nWe analyzed the extent to which on-line learning scenarios are learnable. In contrast to some of\nthe more recent work that has focused on increasing the power of the adversary (see e.g. [1]), we\nincreased the power of the competitor class instead by allowing history-dependent action swaps and\nthereby extending the notion of swap regret. We proved that this stronger class of competitors can\nstill be beaten in the sense of sublinear regret as long as the memory of the competitor is bounded.\nWe also provided a state-dependent bound that gives a more favorable guarantee when only some\nparts of the history are considered. In the bigram setting, we introduced the notion of conditional\ncorrelated equilibrium in the context of repeated K-player games, and showed how it can be seen\nas a generalization of the traditional correlated equilibrium. We proved that if all players follow\nbigram conditional swap regret minimizing strategies, then the empirical joint distribution converges\nto a conditional correlated equilibrium and that no player can play very suboptimal strategies too\noften. Finally, we showed that sublinear conditional swap regret can also be achieved in the partial\ninformation bandit setting.\n\n8 Acknowledgements\n\nWe thank the reviewers for their comments, many of which were very insightful. We are particularly\ngrateful to the reviewer who found an issue in our discussion on conditional correlated equilibrium\nand proposed a helpful resolution. This work was partly funded by the NSF award IIS-1117591. The\nmaterial is also based upon work supported by the National Science Foundation Graduate Research\nFellowship under Grant No. DGE 1342536.\n\n8\n\n\f", "award": [], "sourceid": 737, "authors": [{"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Institute, NYU & Google"}, {"given_name": "Scott", "family_name": "Yang", "institution": "New York University"}]}