{"title": "How to Combine Expert (and Novice) Advice when Actions Impact the Environment?", "book": "Advances in Neural Information Processing Systems", "page_first": 815, "page_last": 822, "abstract": "", "full_text": "How to Combine Expert (or Novice) Advice\n\nwhen Actions Impact the Environment\n\nDaniela Pucci de Farias\u2217\n\nDepartment of Mechanical Engineering\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nNimrod Megiddo\n\nIBM Almaden Research Center\n\n650 Harry Road, K53-B2\n\nSan Jose, CA 95120\n\npucci@mit.edu\n\nmegiddo@almaden.ibm.com\n\nAbstract\n\nThe so-called \u201cexperts algorithms\u201d constitute a methodology for choos-\ning actions repeatedly, when the rewards depend both on the choice of\naction and on the unknown current state of the environment. An experts\nalgorithm has access to a set of strategies (\u201cexperts\u201d), each of which may\nrecommend which action to choose. The algorithm learns how to com-\nbine the recommendations of individual experts so that, in the long run,\nfor any \ufb01xed sequence of states of the environment, it does as well as the\nbest expert would have done relative to the same sequence. This method-\nology may not be suitable for situations where the evolution of states of\nthe environment depends on past chosen actions, as is usually the case,\nfor example, in a repeated non-zero-sum game.\nA new experts algorithm is presented and analyzed in the context of re-\npeated games. It is shown that asymptotically, under certain conditions,\nit performs as well as the best available expert. This algorithm is quite\ndifferent from previously proposed experts algorithms. It represents a\nshift from the paradigms of regret minimization and myopic optimiza-\ntion to consideration of the long-term effect of a player\u2019s actions on the\nopponent\u2019s actions or the environment. The importance of this shift is\ndemonstrated by the fact that this algorithm is capable of inducing co-\noperation in the repeated Prisoner\u2019s Dilemma game, whereas previous\nexperts algorithms converge to the suboptimal non-cooperative play.\n\n1 Introduction\n\nExperts algorithms. A well-known class of methods in machine learning are the so-\ncalled experts algorithms. The goal of these methods is to learn from experience how to\ncombine advice from multiple experts in order to make sequential decisions in an online\nenvironment. The general idea can be described as follows. An agent has to choose repeat-\nedly from a given set of actions. The reward in each stage is a function of the chosen action\nand the choices of Nature or the environment (also referred to as the \u201cadversary\u201d or the \u201cop-\nponent\u201d). A set of strategies {1, . . . , r} is available for the agent to choose from. We refer\n\n\u2217Work done while at IBM Almaden Research Center, San Jose, California.\n\n\fto each such strategy as an \u201cexpert,\u201d even though some of them might be simple enough to\nbe called a \u201cnovice.\u201d Each expert suggests a choice of an action based on the history of\nthe process and the expert\u2019s own choice algorithm. After each stage, the agent observes his\nown reward. An experts algorithm directs the agent with regard to which expert to follow\nin the next stage, based on the past history of actions and rewards.\n\nMinimum Regret. A popular criterion in decision processes is called Minimum Regret\n(MR). Regret is de\ufb01ned as the difference between the reward that could have been achieved,\ngiven the choices of Nature, and what was actually achieved. An expert selection rule is said\nto minimize regret if it yields an average reward as large as that of any single expert, against\nany \ufb01xed sequence of actions chosen by the opponent. Indeed, certain experts algorithms,\nwhich at each stage choose an expert from a probability distribution that is related to the\nreward accumulated by the expert prior to that stage, have been shown to minimize regret\n[1, 2]. It is crucial to note though that, since the experts are compared on a sequence-by-\nsequence basis, the MR criterion ignores the possibility that different experts may induce\ndifferent sequences of choices by the opponent. Thus, MR makes sense only under the\nassumption that Nature\u2019s choices are independent of the decision maker\u2019s choices.\n\nRepeated games. We consider a multi-agent interaction in the form of a repeated game.\nIn repeated games, the assumption that the opponent\u2019s choices are independent of the\nagent\u2019s choices is not justi\ufb01ed, because the opponent is likely to base his choices of ac-\ntions on the past history of the game. This is evident in nonzero-sum games, where players\nare faced with issues such as how to coordinate actions, establish trust or induce coopera-\ntion. These goals require that they take each other\u2019s past actions into account when making\ndecisions. But even in the case of zero-sum games, the possibility that an opponent has\nbounded rationality may lead a player to look for patterns to be exploited in the opponent\u2019s\npast actions.\n\nWe illustrate some of the aforementioned issues with an example involving the Prisoner\u2019s\nDilemma game.\n\nThe Prisoner\u2019s Dilemma.\nIn the single-stage Prisoner\u2019s Dilemma (PD) game, each\nplayer can either cooperate (C) or defect (D). Defecting is better than cooperating regard-\nless of what the opponent does, but it is better for both players if both cooperate than if both\ndefect. Consider the repeated PD. Suppose the row player consults with a set of experts,\nincluding the \u201cdefecting expert,\u201d who recommends defection all the time. Let the strategy\nof the column player in the repeated game be \ufb01xed. In particular, the column player may be\nvery patient and cooperative, willing to wait for the row player to become cooperative, but\neventually becoming non-cooperative if the row player does not seem to cooperate. Since\ndefection is a dominant strategy in the stage game, the defecting expert achieves in each\nstep a reward as high as any other expert against any sequence of choices of the column\nplayer, so the row player learns with the experts algorithm to defect all the time. Obviously,\nin retrospect, this seems to minimize regret, since for any \ufb01xed sequence of actions by the\ncolumn player, constant defection is the best response. Obviously, constant defection is\nnot the best response in the repeated game against many possible strategies of the column\nplayer. For instance, the row player would regret very much using the experts algorithm if\nhe were told later that the column player had been playing a strategy such as Tit-for-Tat.1\nIn this paper, we propose and analyze a new experts algorithm, which follows experts judi-\nciously, attempting to maximize the long-term average reward. Our algorithm differs from\nprevious approaches in at least two ways. First, each time an expert is selected, it is fol-\nlowed for multiple stages of the game rather than a single one. Second, our algorithm takes\n\n1The Tit-for-Tat strategy is to play C in the \ufb01rst stage, and later play in every stage whatever the\n\nopponent played in the preceding stage.\n\n\finto account only the rewards that were actually achieved by an expert in the stages it was\nfollowed, rather than the reward that could have been obtained in any stage. Our algorithm\nenjoys the appealing simplicity of the previous algorithms, yet it leads to a qualitatively\ndifferent behavior and improved average reward. We present two results:\n\n1. A \u201cworst-case\u201d guarantee that, in any play of the game, our algorithm achieves an\naverage reward that is asymptotically as large as that of the expert that did best\nin the rounds of the game when it was played. The worst-case guarantee holds\nwithout any assumptions on the opponent\u2019s or experts\u2019 strategies.\n\n2. Under certain conditions, our algorithm achieves an average reward that is asymp-\ntotically as large as the average reward that could have been achieved by the best\nexpert, had it been followed exclusively. The conditions are required in order to\nfacilitate learning and for the notion of a \u201cbest expert\u201d to be well-de\ufb01ned.\n\nThe effectiveness of the algorithm is demonstrated by its performance in the repeated PD\ngame, namely, it is capable of identifying the opponent\u2019s willingness to cooperate and it\ninduces cooperative behavior.\n\nThe paper is organized as follows. The algorithm is described in section 2. A bound\nbased on actual expert performance is presented in section 3. In section 4, we introduce\nand discuss an assumption about the opponent. This assumption gives rise to asymptotic\noptimality, which is presented in section 5.\n\n2 The algorithm\n\nWe consider an \u201cexperts strategy\u201d for the row player in a repeated two-person game in\nnormal form. At each stage of the game, the row and column player choose actions i \u2208 I\nand j \u2208 J, respectively. The row player has a reward matrix R, with entries 0 \u2264 Rij \u2264 u.\nThe row player may consult at each stage with a set of experts {1, . . . , r}, before choosing\nan action for the next stage. We denote by \u03c3e the strategy proposed by expert e, i.e.,\n\u03c3e = \u03c3e(hs) is the proposed probability distribution over actions in stage s, given the\nhistory hs. We refer to the row player as the agent and to the column player as the opponent.\nUsually, the form of experts algorithms found in the literature is as follows. Denote by\nMe(s \u2212 1) the average reward achieved by expert e prior to stage s of the game2. Then,\na reasonable rule is to follow expert e in stage s with a probability that is proportional to\nsome monotone function of Me(s \u2212 1). In particular, when this probability is proportional\nto exp{\u03b7sMe(s\u22121)}, for a certain choice of \u03b7s, this algorithm is known to minimize regret\n[1, 2]. Speci\ufb01cally, by letting js (s = 1, 2, . . .) denote the observed actions of the opponent\nup to stage s, and letting \u03c3X denote the strategy induced by the experts algorithm, we have\n\nE[R(i, js) : i \u223c \u03c3X(hs)] \u2265 sup\n\ne\n\n1\ns\n\nE[R(i, js) : i \u223c \u03c3e(hs)] \u2212 o(s).\n\n(1)\n\nThe main de\ufb01ciency of the regret minimization approach is that it fails to consider the in-\n\ufb02uence of chosen actions of a player on the future choices of the opponent \u2014 the inequality\n(1) holds for any \ufb01xed sequence (js) of the opponent\u2019s moves, but does not account for the\nfact that different choices of actions by the agent may induce different sequences of the op-\nponent. This subtlety is also missing in the experts algorithm we described above. At each\n\n2In different variants of the algorithm and depending on what information is available to the row\nplayer, Me(s \u2212 1) could be either an estimate of the average reward based on reward achieved by\nexpert e in the stages it was played, or the reward it could have obtained, had it been played in all\nstages against the same history of play of the opponent.\n\nsX\n\ns0=1\n\n1\ns\n\nsX\n\ns0=1\n\n\fstage of the game, the selection of expert is based solely on how well various experts have,\nor could have, done so far. There is no notion of learning how an expert\u2019s actions affect\nthe opponent\u2019s moves. For instance, in the repeated PD game described in the introduction,\nassuming that the opponent is playing Tit-for-Tat, the algorithm is unable to establish the\nconnection between the opponent\u2019s cooperative moves and his own.\n\nBased on the previous observations, we propose a new experts algorithm, which takes into\naccount how the opponent reacts to each of the experts. The idea is simple: instead of\nchoosing a (potentially different) expert at each stage of the game, the number of stages an\nexpert is followed, each time it is selected, increases gradually. We refer to each such set\nof stages as a \u201cphase\u201d of the algorithm. Following is the statement of the Strategic Experts\nAlgorithm (SEA). The phase number is denoted by i. The number of phases during which\nexpert e has been followed is denoted by Ne. The average payoff from phases in which\nexpert e has been followed is denoted by Me.\nStrategic Experts Algorithm (SEA):\n\n1. For e = 1, . . . , r, set Me = Ne = 0. Set i = 1.\n2. With probability 1/i perform an exploration phase, namely, choose an expert e\nfrom the uniform distribution over {1, . . . , r}; otherwise, perform an exploitation\nphase, namely, choose an expert e from the uniform distribution over the set of\nexperts e0 with maximum Me0.\n\n3. Set Ne = Ne +1. Follow expert e\u2019s instructions for the next Ne stages. Denote by\n\u02dcR the average payoff accumulated during the current phase (i.e., these Ne stages),\nand set\n\nMe = Me + 2\n\nNe+1 ( \u02dcR \u2212 Me) .\n\n4. Set i = i + 1 and go to step 2.\n\nThroughout the paper, s will denote a stage number, and i will denote a phase number.\nWe denote by M1(i), . . . , Mr(i) the values of the registers M1, . . . , Mr, respectively, at\nthe end of phase i. Similarly, we denote by N1(i), . . . , Nr(i) the values of the registers\nN1, . . . , Nr, respectively, at the end of phase i. Thus, Me(i) and Ne(i) are, respectively,\nthe average payoff accumulated by expert e and the total number of phases this expert was\nfollowed on or before phase i. We will also let M(s) and M(i) denote, without confusion,\nthe average payoff accumulated by the algorithm in the \ufb01rst s stages or \ufb01rst i phases of the\ngame.\n\n3 A bound based on actual expert performance\n\nWhen the SEA is employed, the average reward Me(i) that was actually achieved by each\navailable expert e is being tracked. It is therefore interesting to compare the average reward\nM(s) achieved by the SEA, with the averages achieved by the various experts. The follow-\ning theorem states that, in the long run, the SEA obtains almost surely at least as much as\nthe actual average reward obtained by any available expert during the same play.\nTheorem 3.1.\n\n(cid:16)\n\n(cid:17)\n\nPr\n\ns\u2192\u221e M(s) \u2265 max\nlim inf\n\ne\n\nlim inf\ni\u2192\u221e Me(i)\n\n= 1 .\n\n(2)\n\nAlthough the claim of Theorem 3.1 seems very close to regret minimization, there is an es-\nsential difference in that we compare the average reward of our algorithm with the average\nreward actually achieved by each expert in the stages when it was played, as opposed to\nthe estimated average reward based on the whole history of play of the opponent.\n\n\fNote that the bound (2) is merely a statement about the average reward of the SEA in\ncomparison to the average reward achieved by each expert, but nothing is claimed about\nthe limits themselves. Theorem 5.1 proposes an application of this bound in a case when\nan additional assumption about the experts\u2019 and opponent\u2019s strategies allows us to analyze\nconvergence of the average reward for each expert. Another interesting case occurs when\none of the experts plays a maximin strategy; in this case, bound (2) ensures that the SEA\nachieves at least the maximin value of the game. The same holds if one of the experts is a\nregret-minimizing experts algorithm, which is known to achieve at least the maximin value\nof the game.\n\nThe remainder of this section consists of a sketch of the proof of Theorem 3.1.\nSketch of proof: Denote by V be the random variable maxe lim inf i\u2192\u221e Me(i), and denote\nby \u00afE the expert that achieves that maximum (if there is more than one, let \u00afE be the one\nwith the least index). For any logical proposition L, let \u03b4(L) = 1 if L is true; otherwise\n\u03b4(L) = 0. The proof of Theorem 3.1 relies on establishing that, for all \u0001 > 0 and any\nexpert e,\n\n(cid:19)\n\n(cid:18)\n\nPr\n\nlim\ni\u2192\u221e\n\nNe(i) \u00b7 \u03b4(Me(i) \u2264 V \u2212 \u0001)\n\ni\n\n= 0\n\n= 1 .\n\n(3)\n\nIn words, if the average reward of an expert falls below V by a non negligible amount, it\nmust have been followed only a small fraction of the total number of phases. There are\nthree possible situations for any expert e: (a) When lim inf i\u2192\u221e Me(i) > V \u2212 \u0001, the in-\nequality is satis\ufb01ed trivially. (b) When lim supi\u2192\u221e Me(i) < V , there is a phase I such that\nfor all i \u2265 I, Me(i) < M \u00afE(i), so that expert e is played only on exploration phases, and a\nlarge deviations argument establishes that (3) holds. (c) The most involved situation occurs\nwhen lim inf i\u2192\u221e Me(i) \u2264 V \u2212 \u0001 and lim supi\u2192\u221e Me(i) \u2265 V . To show that (3) holds in\nthis case, we are going to focus on the trajectory of Me(i) each time it goes from above\nV \u2212\u0001/2 to below V \u2212\u0001+\u03b4/2, for some 0 < \u03b4 < \u0001. We offer the two following observations:\n\n1. Let Ik be the kth phase such that Me(i) \u2264 V \u2212 \u0001 + \u03b4/2, and let I 0\n\nphase before Ik such that Me(i) \u2265 V \u2212 \u0001/2. Then, between phases I 0\nexpert e is selected at least Ne(I 0\nDenoting by I j\nand Ik, we have\n\nk, j = 1, . . . , Pk, the phases when expert e is selected, between I 0\n\nk)(\u0001 \u2212 \u03b4)/(6u) times.\n\nk be the \ufb01rst\nk and Ik,\n\nk\n\nMe(I j\n\nk) \u2265 Me(I j\u22121\n\n)(Ne(I 0\n\nk) + j \u2212 1)(Ne(I 0\nk) + j + 1)\n\nk) + j)(Ne(I 0\n\nk\n(Ne(I 0\n\nk) + j)\n\n.\n\nA simple induction argument shows that, in order to have\n\nk) \u2212 \u0001 \u2212 \u03b4\n2\nk)(\u0001 \u2212 \u03b4)/(6u).\nexpert e must be selected a number of times Pk \u2265 Ne(I 0\n\nMe(Ik) \u2264 V \u2212 \u0001\n2\n\n\u2264 Me(I 0\n\n,\n\n2. For all large enough k, the phases I j\n\nk when expert e is selected are exclusively\nexploration phases.\nThis follows trivially from the fact that, after a certain phase I, we have M \u00afE(i) \u2265\nV \u2212 \u0001/2, for all i \u2265 I, whereas Me(i) < V \u2212 \u0001/2 for all i between I 0\n\nk and Ik.\n\nFrom the \ufb01rst observation, we have\n\nNe(Ik)\n\nIk\n\n\u2264 Ne(I 0\n\nk) + Pk\n\nIk \u2212 I 0\n\nk\n\n\u2264 (1 + 6u)Pk\n(\u0001 \u2212 \u03b4)Ik \u2212 I 0\n\nk\n\n,\n\n\fSince expert e is selected only during exploration phases between I 0\nk and Ik, a large devi-\nations argument allows us to conclude that the ratio of the number of times Pk expert e is\nselected, to the total number of phases Ik \u2212 I 0\nk, converges to zero with probability one. We\nconclude that (3) holds.\nWe now observe that\n\nP\n\nP\n\nBy a simple optimization argument, we can show that\n\nM(i) =\n\nX\n\ne Ne(i)(Ne(i) + 1)Me(i)\n\ne Ne(i)(Ne(i) + 1)\n\nNe(i)(Ne(i) + 1) \u2265 i(i/r + 1).\n\n.\n\n(4)\n\n(5)\n\ne\n\nUsing (3) and (5) to bound (4), we conclude that (2) holds for the subsequence of stages\ns corresponding to the end of each phase of the SEA. It is easy to show that the average\nreward M(s) in stages s in the middle of phase i becomes arbitrarily close to the average\nreward at the end of that phase M(i), as i goes to in\ufb01nity, and the theorem follows .\n2\n\n4 The \ufb02exible opponent\n\nIn general, it is impossible for an experts algorithm to guarantee, against an unknown op-\nponent, a reward close to what the best available expert would have achieved if it had been\nthe only expert. It is easy to construct examples which prove this impossibility.\n\nExample: Repeated Matching Pennies.\nIn the Matching Pennies (MP) game, each of\nthe player and the adversary has to choose either H (\u201cHeads\u201d) or T (\u201cTails\u201d). If the choices\nmatch, the player loses 1; otherwise, he wins 1. A possible strategy for the adversary in the\nrepeated MP game is:\nAdversary: Fix a positive integer s and a string \u03c3s \u2208 {H, T}s. In each of the \ufb01rst s\nstages, play the 50:50 mixed strategy. In each of the stages s + 1, s + 2, . . . , if the sequence\nof choices of the player during the \ufb01rst s stages coincided with the string \u03c3s, then play T ;\notherwise, play the 50:50 mixed strategy.\n\nSuppose each available expert e corresponds to a strategy of the form:\nExpert: Fix a string \u03c3e \u2208 {H, T}s. During the \ufb01rst s stages play according to \u03c3e. In each\nof the stages s + 1, s + 2, . . . , play H.\nSuppose an expert e\u2217 with \u03c3e\u2217 = \u03c3s is available. Then, in order for an experts algorithm\nto achieve at least the reward of e\u2217, it needs to follow the string \u03c3s precisely during the\n\ufb01rst s stages. Of course, without knowing what \u03c3s is, the algorithm cannot play it with\nprobability one, nor can it learn anything about it during the play.\n\nIn view of the repeated MP example, some assumption about the opponent must be made\nin order for the player to be able to learn how to play to against that opponent. The essence\nof the dif\ufb01culty with the above strategy of the opponent is that it is not \ufb02exible \u2014 the\nplayer has only one chance to guess who the best expert is and thus cannot recover from\na mistake. Here, we introduce the assumption of \ufb02exibility as a possible remedy to that\nproblem. Under the assumption of \ufb02exibility, the SEA achieves an average reward that is\nasymptotically as high as what the best expert could be expected to achieve.\nDe\ufb01nition 4.1 (Flexibility). (i) An opponent playing strategy \u03c0(s) is said to be \ufb02exible\nwith respect to expert e (e = 1, . . . , r) if there exist constants \u00b5e, \u03c4 > 0.25 and c such that\nfor every stage s0, every possible history hs0 at stage s0 and any number of stages s,\n\nh (cid:12)(cid:12)(cid:12) 1\n\ns\n\nE\n\nPs0+s\ns=s0+1R(ae(s), b(s)) \u2212 \u00b5e\n\n(cid:12)(cid:12)(cid:12)\n\n: ae(s) \u223c \u03c3e(hs), b(s) \u223c \u03c0(hs)\n\ni \u2264 c\n\ns\u03c4\n\n\f(ii) Flexibility with respect to a set of experts is de\ufb01ned as \ufb02exibility with respect to every\nmember of the set.\n\nIn words, the expected average reward during the s stages between stage s0 and stage s0 +s\nconverges (as s tends to in\ufb01nity) to a limit that does not depend on the history of the play\nprior to stage s0.\n\nExample 4.1 : Finite Automata.\nIn the literature on \u201cbounded rationality\u201d, players are\noften modelled as \ufb01nite automata. A probabilistic automaton strategy (PAS) is speci\ufb01ed\nby a tuple A = hM, O, A, \u03c3, Pi, where M = {1, . . . , m} is the \ufb01nite set of internal\nstates of the automaton, A is the set of possible actions, O is the set of possible outcomes,\n\u03c3i(a) is the probability of choosing action a while in state i (i = 1, . . . , m) and P o =\nij) (1 \u2264 i, j \u2264 m) is the matrix of state transition probabilities, given an outcome\n(P o\no \u2208 O. Thus, at any stage of the game, the automaton picks an action from a probability\ndistribution associated with its current state and transitions into a new state, according to\na probability distribution which depends on the outcome of the stage game.\nIf both the\nopponent and an expert play PASs, then a Markov chain is induced over the set of pairs\nof the respective internal states. If this Markov chain has a single class of recurrent states,\nthen the \ufb02exibility assumption holds. Note that we do not limit the size of the automata; a\nlarger set of internal states implies a slower convergence of the average rewards, but does\nnot affect the asymptotic results for the SEA.\n\nExample 4.2 : Bounded dependence on the history. The number of possible histories at\nstage s grows exponentially with s. Thus, it is reasonable to assume that the choice of action\nwould be based not on the exact detail of the history but rather on the empirical distribution\nof past actions or patterns of actions.\nIf the opponent is believed not to be stationary,\nthen discounting previous observations by recency may be sensible. For instance, if the\nfrequency of play of action j by the opponent is relevant, the player might condition his\n\u03b4jjs where \u03b2 < 1 and \u03b4 is the\nKronecker delta. In this case, only actions js at stages s that are relatively recent have a\nsigni\ufb01cant impact on \u03c4j. Therefore strategies based on \u03c4j should exhibit behavior similar\nto that of bounded recall, and lead to \ufb02exibility in the same circumstances as the latter.\n\nchoice at stage s + 1 on the quantities \u03c4j = Ps\n\ns0=1 \u03b2s\u2212s0\n\n5 A bound based on expected expert performance\n\nIn this section we show that if the opponent is \u201c\ufb02exible\u201d with respect to the available\nexperts, then the SEA achieves almost surely an average payoff that is asymptotically as\nlarge as what the best expert could achieve against the same opponent.\nTheorem 5.1. If an opponent \u03c0 is \ufb02exible with respect to the experts 1, . . . , r, then the\naverage payoff up to stage s, M(s), satis\ufb01es\n\n(cid:16)\n\nPr\n\ns\u2192\u221e M(s) \u2265 max\nlim inf\n\ne\n\n\u00b5e\n\n(cid:17)\n\n= 1 .\n\nTheorem 5.1 follows from Lemma 5.1, stated and proven below, and Theorem 3.1.\n\nFlexibility comes into play as a way of ensuring that the value of following any given\nexpert is well-de\ufb01ned, and can eventually be estimated as long as the SEA follows that\nexpert suf\ufb01ciently many times. In other words, \ufb02exibility ensures that there is a best expert\nto be learned, and that learning can effectively occur because actions taken by other experts,\nwhich could affect the behavior of the opponent, are eventually forgotten by the latter.\n\n\fWe now present Lemma 5.1, which shows that, under the \ufb02exibility assumption, the average\nreward achieved by each expert is asymptotically almost surely the same as the reward that\nwould have been achieved by the same expert, had he been the only available expert.\nLemma 5.1. If the opponent is \ufb02exible with respect to expert e, then with probability one,\nlimi\u2192\u221e Me(i) = \u00b5e.\n\nSketch of proof: Let e be any expert. By the Borel-Cantelli lemma, exploration occurs\nin\ufb01nitely many times, hence e is followed during in\ufb01nitely many phases. Let Ij = Ij(e),\n(j = 1, 2, . . .) be the phase numbers in which e is followed. By Markov\u2019s inequality, for\nevery \u0001 > 0,\n\nPr(|Me(Ij) \u2212 \u00b5e| > \u0001) \u2264 \u0001\u22124E[(Me(Ij) \u2212 \u00b5e)4] .\n\nIf we could show that\n\n\u221eX\n\nE[(Me(Ij) \u2212 \u00b5e)4] < \u221e ,\n\n(6)\n\nj=1\n\nthen we could conclude, by the Borel-Cantelli lemma, that with probability one, the in-\nequality |Me(Ij)\u2212 \u00b5e| > \u0001 holds only for \ufb01nitely many values of j. This implies that, with\nprobability one, limi\u2192\u221e Me(i) = \u00b5e. It follows that if the opponent is \ufb02exible with respect\nto expert e, then for some \u03bd > 0, as j tends to in\ufb01nity, E[(Me(Ij) \u2212 \u00b5e)4] = O(j\u22121\u2212\u03bd),\nwhich suf\ufb01ces for (6).\n\n2\n\nExample 5.1 : Repeated Prisoner\u2019s Dilemma revisited. Consider playing the repeated\nPD game against an opponent who plays Tit-for-Tat, and suppose there are only two ex-\nperts: \u201cAlways defect\u201d (AD) and \u201cAlways cooperate\u201d (AC). Thus, AC induces cooperation\nin every stage and yields a payoff higher than AD, which induces defection in every stage\nof the game except the \ufb01rst one. It is easy to verify that Tit-for-Tat is \ufb02exible with respect\nto the experts AC and AD. Therefore, Theorem 5.1 holds and the SEA achieves an average\npayoff at least as much as that of AC. By contrast, as mentioned in the introduction, in\norder to minimize regret, the standard experts algorithm must play D in almost every stage\nof the game, and therefore achieves a lower payoff.\n\nReferences\n\n[1] Auer, P., Cesa-Bianchi, N., Freund, Y. & Schapire, R.E. (1995) Gambling in a rigged casino:\nThe adversarial multi-armed bandit problem. In Proc. 36th Annual IEEE Symp. on Foundations\nof Computer Science, pp. 322\u2013331, Los Alamitos, CA: IEEE Computer Society Press.\n\n[2] Freund, Y. & Schapire, R.E. (1999) Adaptive game playing using multiplicative weights. Games\n\nand Economic Behavior 29:79\u2013103.\n\n[3] Foster, D. & Vohra, R. (1999) Regret and the on-line decision problem. Games and Economic\n\nBehavior 29:7\u201335.\n\n[4] Fudenberg, D. & Levine, D.K. (1997) The Theory of Learning in Games. Cambridge, MA: The\n\nMIT Press.\n\n[5] Littlestone, N. & Warmuth, M.K. (1994) The weighted majority algorithm. Information and\n\nComputation 108 (2):212\u2013261.\n\n\f", "award": [], "sourceid": 2489, "authors": [{"given_name": "Daniela", "family_name": "de Farias", "institution": null}, {"given_name": "Nimrod", "family_name": "Megiddo", "institution": null}]}