{"title": "Distributed Non-Stochastic Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 260, "page_last": 268, "abstract": "We consider the online distributed non-stochastic experts problem, where the distributed system consists of one coordinator node that is connected to k sites, and the sites are required to communicate with each other via the coordinator. At each time-step t, one of the k site nodes has to pick an expert from the set {1, . . . , n}, and the same site receives information about payoffs of all experts for that round. The goal of the distributed system is to minimize regret at time horizon T, while simultaneously keeping communication to a minimum. The two extreme solutions to this problem are: (i) Full communication: This essentially simulates the non-distributed setting to obtain the optimal O(\\sqrt{log(n)T}) regret bound at the cost of T communication. (ii) No communication: Each site runs an independent copy \u2013 the regret is O(\\sqrt{log(n)kT}) and the communication is 0. This paper shows the difficulty of simultaneously achieving regret asymptotically better than \\sqrt{kT} and communication better than T. We give a novel algorithm that for an oblivious adversary achieves a non-trivial trade-off: regret O(\\sqrt{k^{5(1+\\epsilon)/6} T}) and communication O(T/k^\\epsilon), for any value of \\epsilon in (0, 1/5). We also consider a variant of the model, where the coordinator picks the expert. In this model, we show that the label-efficient forecaster of Cesa-Bianchi et al. (2005) already gives us strategy that is near optimal in regret vs communication trade-off.", "full_text": "Distributed Non-Stochastic Experts\n\nVarun Kanade\u2217\nUC Berkeley\n\nvkanade@eecs.berkeley.edu\n\nZhenming Liu\u2020\nPrinceton University\n\nzhenming@cs.princeton.edu\n\nBo\u02c7zidar Radunovi\u00b4c\nMicrosoft Research\n\nbozidar@microsoft.com\n\nAbstract\n\nWe consider the online distributed non-stochastic experts problem, where the dis-\ntributed system consists of one coordinator node that is connected to k sites, and\nthe sites are required to communicate with each other via the coordinator. At each\ntime-step t, one of the k site nodes has to pick an expert from the set {1, . . . , n},\nand the same site receives information about payoffs of all experts for that round.\nThe goal of the distributed system is to minimize regret at time horizon T , while\nsimultaneously keeping communication to a minimum. The two extreme solutions\nto this problem are: (i) Full communication: This essentially simulates the non-\n\ndistributed setting to obtain the optimal O((cid:112)log(n)T ) regret bound at the cost of\nthe regret is O((cid:112)log(n)kT ) and the communication is 0. This paper shows the\n\nT communication. (ii) No communication: Each site runs an independent copy \u2013\n\ndif\ufb01culty of simultaneously achieving regret asymptotically better than\nkT and\n\u221a\ncommunication better than T . We give a novel algorithm that for an oblivious\nadversary achieves a non-trivial trade-off: regret O(\nk5(1+\u0001)/6T ) and communi-\ncation O(T /k\u0001), for any value of \u0001 \u2208 (0, 1/5). We also consider a variant of the\nmodel, where the coordinator picks the expert. In this model, we show that the\nlabel-ef\ufb01cient forecaster of Cesa-Bianchi et al. (2005) already gives us strategy\nthat is near optimal in regret vs communication trade-off.\n\n\u221a\n\n1\n\nIntroduction\n\nIn this paper, we consider the well-studied non-stochastic expert problem in a distributed setting.\nIn the standard (non-distributed) setting, there are a total of n experts available for the decision-\nmaker to consult, and at each round t = 1, . . . , T , she must choose to follow the advice of one of\nthe experts, say at, from the set [n] = {1, . . . , n}. At the end of the round, she observes a payoff\nvector pt \u2208 [0, 1]n, where pt[a] denotes the payoff that would have been received by following\nthe advice of expert a. The payoff received by the decision-maker is pt[at]. In the non-stochastic\nsetting, an adversary decides the payoff vectors at any time step. At the end of the T rounds, the\nregret of the decision maker is the difference in the payoff that she would have received using the\nsingle best expert at all times in hindsight, and the payoff that she actually received, i.e. R =\nt=1 pt[at]. The goal here is to minimize her regret; this general problem\nmaxa\u2208[n]\n\u2217This work was performed while the author was at Harvard University supported in part by grant NSF-CCF-\n\u2020This work was performed while the author was at Harvard University supported in part by grants NSF-IIS-\n\n(cid:80)T\nt=1 pt[a] \u2212(cid:80)T\n\n09-64401\n\n0964473 and NSF-CCF-0915922.\n\n1\n\n\fn experts. Here xt minimizes the quantity(cid:80)t\u22121\n\nin the non-stochastic setting captures several applications of interest, such as experiment design,\nonline ad-selection, portfolio optimization, etc. (See [1, 2, 3, 4, 5] and references therein.)\nTight bounds on regret for the non-stochastic expert problem are obtained by the so-called follow\nthe regularized leader approaches; at time t, the decision-maker chooses a distribution, xt, over the\ns=1 pt \u00b7 x + r(x), where r is a regularizer. Common\nregularizers are the entropy function, which results in Hedge [1] or the exponentially weighted\nforecaster (see chap. 2 in [2]), or as we consider in this paper r(x) = \u00af\u03b7 \u00b7 x, where \u00af\u03b7 \u2208R [0, \u03b7]n is a\nrandom vector, which gives the follow the perturbed leader (FPL) algorithm [6].\nWe consider the setting when the decision maker is a distributed system, where several different\nnodes may select experts and/or observe payoffs at different time-steps. Such settings are common,\ne.g. internet search companies, such as Google or Bing, may use several nodes to answer search\nqueries and the performance is revealed by user clicks. From the point of view of making better pre-\ndictions, it is useful to pool all available data. However, this may involve signi\ufb01cant communication\nwhich may be quite costly. Thus, the question of interest is studying the trade-off between cost of\ncommunication and cost of inaccuracy (because of not pooling together all data).\n\n2 Models and Summary of Results\n\nWe consider a distributed computation model consisting of one central coordinator node connected\nto k site nodes. The site nodes must communicate with each other using the coordinator node. At\neach time step, the distributed system receives a query1, which indicates that it must choose an\nexpert to follow. At the end of the round, the distributed system observes the payoff vector. We con-\nsider two different models described in detail below: the site prediction model where one of the k\nsites receives a query at any given time-step, and the coordinator prediction model where the query\nis always received at the coordinator node. In both these models, the payoff vector, pt, is always\nobserved at one of the k site nodes. Thus, some communication is required to share the information\nabout the payoff vectors among nodes. As we shall see, these two models yield different algorithms\nand performance bounds. All missing proofs are provided in the long version [7]\nGoal: The algorithm implemented on the distributed system may use randomness, both to decide\nwhich expert to pick and to decide when to communicate with other nodes. We focus on simulta-\nneously minimizing the expected regret and the expected communication used by the (distributed)\nalgorithm. Recall, that the expected regret is:\n\n(cid:34)\n\nE[R] = E\n\nmax\na\u2208[n]\n\nT(cid:88)\n\npt[a] \u2212 T(cid:88)\n\nt=1\n\nt=1\n\n(cid:35)\n\npt[at],\n\n(1)\n\nwhere the expectation is over the random choices made by the algorithm. The expected communi-\ncation is simply the expected number (over the random choices) of messages sent in the system.\nAs we show in this paper, this is a challenging problem and to keep the analysis simple we focus on\nbounds in terms of the number of sites k and the time horizon T , which are often the most important\nscaling parameters. In particular, our algorithms are variants of follow the perturbed leader (FPL)\nand hence our bounds are not optimal in terms of the number of experts n. We believe that the\ndependence on the number of experts in our algorithms (upper bounds) can be strengthened using\na different regularizer. Also, all our lower bounds are shown in terms of T and k, for n = 2. For\nlarger n, using techniques similar to Thm. 3.6 in [2] should give the appropriate dependence on n.\nAdversaries: In the non-stochastic setting, we assume that an adversary may decide the payoff vec-\ntors, pt, at each time-step and also the site, st, that receives the payoff vector (and also the query in\nthe site-prediction model). An oblivious adversary cannot see any of the actions of the distributed\nsystem, i.e. selection of expert, communication patterns or any random bits used. However, the\noblivious adversary may know the description of the algorithm. In addition to knowing the descrip-\ntion of the algorithm, an adaptive adversary is stronger and can record all of the past actions of the\nalgorithm, and use these arbitrarily to decide the future payoff vectors and site allocations.\nCommunication: We do not explicitly account for message sizes, since we are primarily concerned\nwith scaling in terms of T and k. We require that message size not depend k or T , but only on the\n\n1We do not use the word query in the sense of explicitly giving some information or context, but merely as\n\nindication of occurrence of an event that forces some site or coordinator to choose an expert\n\n2\n\n\fnumber of experts n. In other words, we assume that n is substantially smaller than T and k. All the\nmessages used in our algorithms contain at most n real numbers. As is standard in the distributed\nsystems literature, we assume that communication delay is 0, i.e. the updates sent by any node are\nreceived by the recipients before any future query arrives. All our results still hold under the weaker\nassumption that the number of queries received by the distributed system in the duration required to\ncomplete a broadcast is negligible compared to k. 2\nWe now describe the two models in greater detail, state our main results and discuss related work:\n1. SITE PREDICTION MODEL: At each time step t = 1, . . . , T , one of the k sites, say st, receives\na query and has to pick an expert, at, from the set, [n] = {1, . . . , n}. The payoff vector pt \u2208 [0, 1]n,\nwhere pt[i] is the payoff of the ith expert is revealed only to the site st and the decision-maker\n(distributed system) receives payoff pt[at], corresponding to the expert actually chosen. The site\nprediction model is commonly studied in distributed machine learning settings (see [8, 9, 10]). The\npayoff vectors p1, . . . , pT and also the choice of sites that receive the query, s1, . . . , sT , are decided\nby an adversary. There are two very simple algorithms in this model:\n(i) Full communication: The coordinator always maintains the current cumulative payoff vector,\n\u03c4 =1 p\u03c4 from the\ncoordinator, chooses an expert at \u2208 [n] using FPL, receives payoff vector pt and sends pt to the\n\u221a\ncoordinator, which updates its cumulative payoff vector. Note that the total communication is 2T\nand the system simulates (non-distributed) FPL to achieve (optimal) regret guarantee O(\n(ii) No communication: Each site maintains cumulative payoff vectors corresponding to the queries\nreceived by them, thus implementing k independent versions of FPL. Suppose that the ith site\nnkT )\nand the total communication is 0. This upper bound is actually tight in the event that there is 0\ncommunication (see the accompanying long version [7]).\n\n(cid:80)t\u22121\n\u03c4 =1 p\u03c4 . At time step t, st receives the current cumulative payoff vector (cid:80)t\u22121\n\ni=1 Ti = T ), the regret is bounded by(cid:80)k\n\nreceives a total of Ti queries ((cid:80)k\n\nnT ).\n\ni=1 O(\n\nnTi) = O(\n\n\u221a\n\n\u221a\n\n\u221a\n\nknT using communication\nSimultaneously achieving regret that is asymptotically lower than\nasymptotically lower than T turns out to be a signi\ufb01cantly challenging question. Our main positive\nresult is the \ufb01rst distributed expert algorithm in the oblivious adversarial (non-stochastic) setting,\nusing sub-linear communication. Finding such an algorithm in the case of an adaptive adversary is\nan interesting open problem.\nTheorem 1. When T \u2265 2k2.3, there exists an algorithm for the distributed experts problem that\n\u221a\nagainst an oblivious adversary achieves regret O(log(n)\nk5(1+\u0001)/6T ) and uses communication\nO(T /k\u0001), giving non-trivial guarantees in the range \u0001 \u2208 (0, 1/5).\n2. COORDINATOR PREDICTION MODEL: At every time step, the query is received by the co-\nordinator node, which chooses an expert at \u2208 [n]. However, at the end of the round, one of the\nsite nodes, say st, observes the payoff vector pt. The payoff vectors pt and choice of sites st are\ndecided by an adversary. This model is also a natural one and is explored in the distributed systems\nand streaming literature (see [11, 12, 13] and references therein).\n\n\u221a\nThe full communication protocol is equally applicable here getting optimal regret bound, O(\nnT ) at\nthe cost of substantial (essentially T ) communication. But here, we do not have any straightforward\nalgorithms that achieve non-trivial regret without using any communication. This model is closely\nrelated to the label-ef\ufb01cient prediction problem (see Chapter 6.1-3 in [2]), where the decision-maker\nhas a limited budget and has to spend part of its budget to observe any payoff information. The\noptimal strategy is to request payoff information randomly with probability C/T at each time-step,\nif C is the communication budget. We refer to this algorithm as LEF (label-ef\ufb01cient forecaster) [14].\nTheorem 2. [14] (Informal) The LEF algorithms using FPL with communication budget C achieves\n\nregret O(T(cid:112)n/C) against both an adaptive and an oblivious adversary.\n\nOne of the crucial differences between this model and that of the label-ef\ufb01cient setting is that when\ncommunication does occur, the site can send cumulative payoff vectors comprising all previous\nupdates to the coordinator rather than just the latest one. The other difference is that, unlike in\nthe label-ef\ufb01cient case, the sites have the knowledge of their local regrets and can use it to decide\n\n2This is because in regularized leader like approaches, if the cumulative payoff vector changes by a small\n\namount the distribution over experts does not change much because of the regularization effect.\n\n3\n\n\fkT ) must use communication (T /k)(1 \u2212 o(1)).\n\nwhen to communicate. However, our lower bounds for natural types of algorithms show that these\nadvantages probably do not help to get better guarantees.\nLower Bound Results: In the case of an adaptive adversary, we have an unconditional (for any\ntype of algorithm) lower bound in both the models:\n\u221a\nTheorem 3. Let n = 2 be the number of experts. Then any (distributed) algorithm that achieves\nexpected regret o(\nThe proof appears in [7]. Notice that in the coordinator prediction model, when C = T /k, this\nlower bound is matched by the upper bound of LEF.\nIn the case of an oblivious adversary, our results are weaker, but we can show that certain natural\ntypes of algorithms are not applicable directly in this setting. The so called regularized leader\nalgorithms, maintain a cumulative payoff vector, Pt, and use only this and a regularizer to select an\nexpert at time t. We consider two variants in the distributed setting:\n(i) Distributed Counter Algorithms: Here the forecaster only uses \u02dcPt, which is an (approximate)\nversion of the cumulative payoff vector Pt. But we make no assumptions on how the forecaster will\nuse \u02dcPt. \u02dcPt can be maintained while using sub-linear communication by applying techniques from\ndistributed systems literature [12]. (ii) Delayed Regularized Leader: Here the regularized leaders\ndon\u2019t try to explicitly maintain an approximate version of the cumulative payoff vector. Instead,\nthey may use an arbitrary communication protocol, but make prediction using the cumulative payoff\nvector (using any past payoff vectors that they could have received) and some regularizer.\nWe show in Section 3.2 that the distributed counter approach does not yield any non-trivial guarantee\nin the site-prediction model even against an oblivious adversary. It is possible to show a similar lower\nbound the in the coordinator prediction model, but is omitted since it follows easily from the idea in\nthe site-prediction model combined with an explicit communication lower bound given in [12].\nSection 4 shows that the delayed regularized leader approach is ineffective even against an oblivious\nadversary for coordinator prediction model, suggesting LEF algorithm is near optimal.\nRelated Work: Recently there has been signi\ufb01cant interest in distributed online learning questions\n(see for example [8, 9, 10]). However, these works have focused mainly on stochastic optimiza-\ntion problems. Thus, the techniques used, such as reducing variance through mini-batching, are not\napplicable to our setting. Questions such as network structure [9] and network delays [10] are inter-\nesting in our setting as well, however, at present our work focuses on establishing some non-trivial\nregret guarantees in the distributed online non-stochastic experts setting. Study of communication\nas a resource in distributed learning is also considered in [15, 16, 17]; however, this body of work\nseems only applicable to of\ufb02ine learning.\nThe other related work is that of distributed functional monitoring [11] and in particular distributed\ncounting[12, 13], and sketching [18]. Some of these techniques have been successfully applied\nin of\ufb02ine machine learning problems [19]. However, we are the \ufb01rst to analyze the performance-\ncommunication trade-off of an online learning algorithm in the standard distributed functional mon-\nitoring framework [11]. An application of a distributed counter to an online Bayesian regression was\nproposed in Liu et al. [13]. Our lower bounds discussed below, show that approximate distributed\ncounter techniques do not directly yield non-trivial algorithms.\n\n3 Site-prediction model\n\n3.1 Upper Bounds\n\nWe describe our algorithm that simultaneously achieves non-trivial bounds on expected regret and\nexpected communication. We begin by making two assumptions that simplify the exposition. First,\nwe assume that there are only 2 experts. The generalization from 2 experts to n is easy, as discussed\nin the Remark 1 at the end of this section. Second, we assume that there exists a global query\ncounter, that is available to all sites and the co-ordinator, which keeps track of the total number of\nqueries received across the k sites. We discuss this assumption in Remark 2 at the end of the section.\nAs is often the case in online algorithms, we assume that the time horizon T is known. Otherwise,\nthe standard doubling trick may be employed. The notation used in this Section is de\ufb01ned in Table 1.\n\n4\n\n\fSymbol De\ufb01nition\n\npt\n(cid:96)\nb\nPi\nQi\n\nPayoff vector at time-step t, pt \u2208 [0, 1]2\nThe length of block into which inputs are divided\nNumber of input blocks b = T /(cid:96)\n\nCumulative payoff vector within block i, Pi =(cid:80)i(cid:96)\nCumulative payoff vector until end of block (i \u2212 1), Qi =(cid:80)i\u22121\n\nt=(i\u22121)(cid:96)+1 pt\nFor vector v \u2208 R2, M (v) = 1 if v1 > v2; M (v) = 2 otherwise\n\nj=1 Pj\n\nM (v)\nFPi(\u03b7) Random variable denoting the payoff obtained by playing FPL(\u03b7) on block i\nFRi\n\na(\u03b7) Random variable denoting the regret with respect to action a of playing FPL(\u03b7) on block i\n\nFRi(\u03b7) Random variable denoting the regret of playing FPL(\u03b7) on payoff vectors in block i\n\na(\u03b7) = Pi[a] \u2212 FPi(\u03b7)\n\nFRi\nFRi(\u03b7) = maxa=1,2 Pi[a] \u2212 FPi(\u03b7) = maxa=1,2 FRi\nTable 1: Notation used in Algorithm DFPL (Fig. 1) and in Section 3.1.\n\na(\u03b7)\n\nDFPL(T , (cid:96), \u03b7)\nset b = T /(cid:96); \u03b7(cid:48) =\nfor i = 1 . . . , b\n\n\u221a\n\n(cid:96); q = 2(cid:96)3T 2/\u03b75\n\nlet Yi = Bernoulli(q)\nif Yi = 1 then #step phase\n\nplay FPL(\u03b7(cid:48)) for time-steps (i \u2212 1)(cid:96) + 1, . . . , i(cid:96)\nelse #block phase\nai = M (Qi + r) where r \u2208R [0, \u03b7]2\nplay ai for time-steps (i \u2212 1)(cid:96) + 1, . . . , i(cid:96)\n\nPi =(cid:80)i(cid:96)\n\nt=(i\u22121)(cid:96)+1 pt\n\nQi+1 = Qi + Pi\n\nFPL(T, n = 2, \u03b7)\nfor t = 1, . . . , T\n\nat = M ((cid:80)t\u22121\n\n\u03c4 =1 p\u03c4 + r) where r \u2208R [0, \u03b7]2\n\nfollow expert at at time-step t\nobserve payoff vector pt\n\n(a)\n\n(b)\n\nFigure 1: (a) DFPL: Distributed Follow the Perturbed Leader, (b) FPL: Follow the Perturbed Leader with\nparameter \u03b7 for 2 experts (M (\u00b7) is de\ufb01ned in Table 1, r is a random vector)\n\nAlgorithm Description: Our algorithm DFPL is described in Figure 1(a). We make use of FPL\nalgorithm, described in Figure 1(b), which takes as a parameter the amount of added noise \u03b7. DFPL\nalgorithm treats the T time steps as b(= T /(cid:96)) blocks, each of length (cid:96). At a high level, with\nprobability q on any given block the algorithm is in the step phase, running a copy of FPL (with\nnoise parameter \u03b7(cid:48)) across all time steps of the block, synchronizing after each time step. Otherwise\nit is in a block phase, running a copy of FPL (with noise parameter \u03b7) across blocks with the same\nexpert being followed for the entire block and synchronizing after each block. This effectively makes\nPi, the cumulative payoff over block i, the payoff vector for the block FPL. The block FPL has on\naverage (1 \u2212 q)T /(cid:96) total time steps. We begin by stating a (slightly stronger) guarantee for FPL.\nLemma 1. Consider the case n = 2. Let p1, . . . , pT \u2208 [0, 1]2 be a sequence of payoff vectors such\nthat maxt |pt|\u221e \u2264 B and let the number of experts be 2. Then FPL(\u03b7) has the following guarantee\non expected regret, E[R] \u2264 B\n\n(cid:80)T\nt=1 |pt[1] \u2212 pt[2]| + \u03b7.\n\n\u03b7\n\nThe proof is a simple modi\ufb01cation to the proof of the standard analysis [6] and is given in [7]. The\nrest of this section is devoted to the proof of Lemma 2\n\u221a\nLemma 2. Consider the case n = 2. If T > 2k2.3, Algorithm DFPL (Fig. 1) when run with\nparameters (cid:96), T , \u03b7 = (cid:96)5/12T 1/2 and b, \u03b7(cid:48), q as de\ufb01ned in Fig 1, has expected regret O(\n(cid:96)5/6T )\nand expected communication O(T k/(cid:96)). In particular for (cid:96) = k1+\u0001 for 0 < \u0001 < 1/5, the algorithm\nsimultaneously achieves regret that is asymptotically lower than\nkT and communication that is\nasymptotically lower3 than T .\n\n\u221a\n\n3Note that here asymptotics is in terms of both parameters k and T . Getting communication of the form\n\nT 1\u2212\u03b4f (k) for regret bound better than\n\n\u221a\n\nkT , seems to be a fairly dif\ufb01cult and interesting problem\n\n5\n\n\fSince we are in the case of an oblivious adversary, we may assume that the payoff vectors p1, . . . , pT\nare \ufb01xed ahead of time. Without loss of generality let expert 1 (out of {1, 2}) be the one that has\n1(\u03b7(cid:48)) denotes the random variable that is the regret of\ngreater payoff in hindsight. Recall that FRi\nplaying FPL(\u03b7(cid:48)) in a step phase on block i with respect to the \ufb01rst expert. In particular, this will\nbe negative if expert 2 is the best expert on block i, even though globally expert 1 is better. In fact,\nthis is exactly what our algorithm exploits: it gains on regret in the communication-expensive, step\nphase while saving on communication in the block phase.\n\n1(\u03b7(cid:48)) + (1 \u2212 Yi)(Pi[1] \u2212 Pi[ai])(cid:1). Note that the\n\n(cid:0)Yi \u00b7 FRi\n\ni=1\n\nrandom variables Yi are independent of the random variables FRi\nAs E[Yi] = q, we can bound the expression for expected regret as follows:\n\n1(\u03b7(cid:48)) and the random variables ai.\n\nThe regret can be written as R =(cid:80)b\nb(cid:88)\n\nE[R] \u2264 q\n\nE[FRi\n\n1(\u03b7(cid:48))] + (1 \u2212 q)\n\nE[Pi[1] \u2212 Pi[ai]]\n\n(2)\n\ni=1\n\ni=1\n\nb(cid:88)\n\nWe \ufb01rst analyze the second term of the above equation. This is just the regret corresponding\nto running FPL(\u03b7) at the block level, with T /(cid:96) time steps. Using the fact that maxi |Pi|\u221e \u2264\n(cid:96) maxt |pt|\u221e \u2264 (cid:96), Lemma 1 allows us to conclude that:\n\nb(cid:88)\n\ni=1\n\nb(cid:88)\n\ni=1\n\nE[Pi[1] \u2212 Pi[ai]] \u2264 (cid:96)\n\u03b7\n\n|Pi[1] \u2212 Pi[2]| + \u03b7\n\n(3)\n\n\u221a\n\nNext, we also analyse the \ufb01rst term of the inequality (2). We chose \u03b7(cid:48) =\n\u221a\nanalysis of FPL guarantees that E[FRi(\u03b7(cid:48))] \u2264 2\nthat is the actual regret of FPL(\u03b7(cid:48)), not the regret with respect to expert 1 (which is FRi\neither FRi(\u03b7(cid:48)) = FRi\n\u221a\n2\nE[FRi\neverything together we can write that E[FRi\nx \u2265 0 and 0 otherwise. Thus, we get the main equation for regret.\n\n(cid:96) (see Fig. 1) and the\n(cid:96), where FRi(\u03b7(cid:48)) denotes the random variable\n1(\u03b7(cid:48))). Now\n1(\u03b7(cid:48))] \u2264\n2(\u03b7(cid:48)) (i.e. expert 2 was the better one on block i), in which case\n(cid:96) + Pi[1] \u2212 Pi[2]. Note that in this expression Pi[1] \u2212 Pi[2] is negative. Putting\n(cid:96) \u2212 (Pi[2] \u2212 Pi[1])+, where (x)+ = x if\n\n1(\u03b7(cid:48)) (i.e. expert 1 was the better one on block i), in which case E[FRi\n\n(cid:96); otherwise FRi(\u03b7(cid:48)) = FRi\n\n\u221a\n1(\u03b7(cid:48))] \u2264 2\n\n1(\u03b7(cid:48))] \u2264 2\n\n\u221a\n\nb(cid:88)\n\ni=1\n\n\u221a\nE[R] \u2264 2qb\n\n(cid:96) \u2212 q\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nterm 1\n\nb(cid:88)\n\ni=1\n\n(cid:125)\n\n+\n\n(cid:96)\n\u03b7\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nterm 2\n\n(cid:125)\n\n(Pi[2] \u2212 Pi[1])+\n\n|Pi[1] \u2212 Pi[2]|\n\n+\u03b7\n\n(4)\n\n((cid:96)/\u03b7)(cid:80)\u03b6\n\n\u221a\n(cid:96)) and last (i.e. \u03b7) terms of inequality (4) are O(\n\n\u221a\n(cid:96)5/6T ) for the setting\nNote that the \ufb01rst (i.e. 2qb\nof the parameters as in Lemma 2. The strategy is to show that when \u201cterm 2\u201d becomes large, then\n\u201cterm 1\u201d is also large in magnitude, but negative, compensating the effect of \u201cterm 1\u201d. We consider\na few cases:\nCase 1: When the best expert is identi\ufb01ed quickly and not changed thereafter. Let \u03b6 denote the\nmaximum index, i, such that Qi[1] \u2212 Qi[2] \u2264 \u03b7. Note that after the block \u03b6 is processed, the\nalgorithm in the block phase will never follow expert 2.\nSuppose that \u03b6 \u2264 (\u03b7/(cid:96))2. We note that the correct bound for \u201cterm 2\u201d is now actually\n\ni=1 |Pi[1] \u2212 Pi[2]| \u2264 ((cid:96)2\u03b6/\u03b7) \u2264 \u03b7 since |Pi[1] \u2212 Pi[2]| \u2264 (cid:96) for all i.\n\nCase 2 The best expert may not be identi\ufb01ed quickly, furthermore |Pi[1] \u2212 Pi[2]| is large often. In\nthis case, although \u201cterm 2\u201d may be large (when (Pi[1] \u2212 Pi[2]) is large), this is compensated by\nthe negative regret in \u201cterm 1\u201d in expression (4). This is because if |Pi[1] \u2212 Pi[2]| is large often,\nbut the best expert is not identi\ufb01ed quickly, there must be enough blocks on which (Pi[2] \u2212 Pi[1])\nis positive and large.\nNotice that \u03b6 \u2265 (\u03b7/(cid:96))2. De\ufb01ne \u03bb = \u03b72/T and let S = {i \u2264 \u03b6 | |Pi[1] \u2212 Pi[2]| \u2265 \u03bb}.\ni=1(Pi[2] \u2212 Pi[1])+ \u2265 (\u03b1\u03b6\u03bb)/2 \u2212 \u03b7. To see this consider\ni\u2208S |Pi[1] \u2212 Pi[2]| \u2265 \u03b1\u03b6\u03bb.\n(Pi[1] \u2212 Pi[2]) \u2265 (\u03b1\u03b6\u03bb)/2.\ni=1(Pi[2] \u2212 Pi[1])+ \u2265\n\nLet \u03b1 = |S|/\u03b6. We show that (cid:80)\u03b6\nS1 = {i \u2208 S | Pi[1] > Pi[2]} and S2 = S \\ S1. First, observe that(cid:80)\nThen, if(cid:80)\n(Pi[2] \u2212 Pi[1]) \u2265 (\u03b1\u03b6\u03bb)/2, we are done. If not(cid:80)\nNow notice that(cid:80)\u03b6\ni=1 Pi[1] \u2212 Pi[2] \u2264 \u03b7, hence it must be the case that(cid:80)\u03b6\n\ni\u2208S2\n\ni\u2208S1\n\n6\n\n\f(cid:80)\u03b6\ni=1 |Pi[1] \u2212 Pi[2]| \u2264 (cid:96)\n\n(\u03b1\u03b6\u03bb)/2 \u2212 \u03b7. Now for the value of q = 2(cid:96)3T 2/\u03b75 and if \u03b1 \u2265 \u03b72/(T (cid:96)), the negative contribution of\n\u201cterm 1\u201d is at least q\u03b1\u03b6\u03bb/2 which greater than the maximum possible positive contribution of \u201cterm\n2\u201d which is (cid:96)2\u03b6/\u03b7. It is easy to see that these quantities are equal and hence the total contribution of\n\u201cterm 1\u201d and \u201cterm 2\u201d together is at most \u03b7.\nCase 3 When |Pi[1] \u2212 Pi[2]| is \u201csmall\u201d most of the time. In this case the parameter \u03b7 is actually\nwell-tuned (which was not the case when |Pi[1] \u2212 Pi[2]| \u2248 (cid:96)) and gives us a small overall regret.\n(See Lemma 1.) We have \u03b1 < \u03b72/(T (cid:96)). Note that \u03b1(cid:96) \u2264 \u03bb = \u03b72/T and that \u03b6 \u2264 T /(cid:96). In this case\n\u03b7 (\u03b1\u03b6(cid:96) + (1 \u2212 \u03b1)\u03b6\u03bb) \u2264 2\u03b7\n\u201cterm 2\u201d can be bounded easily as follows: (cid:96)\n\u03b7\nThe above three cases exhaust all possibilities and hence no matter what the nature of the payoff\nsequence, the expected regret of DFPL is bounded by O(\u03b7) as required. The expected total commu-\nnication is easily seen to be O(qT +T k/(cid:96)) \u2013 the q(T /(cid:96)) blocks on which step FPL is used contribute\nO((cid:96)) communication each, and the (1 \u2212 q)(T /(cid:96)) blocks where block FPL is used contributed O(k)\ncommunication each.\nRemark 1. Our algorithm can be generalized to n experts by recursively dividing the set of experts\nin two and applying our algorithm to two meta-experts, to give the result of Theorem 1. Details are\nprovided in [7].\nRemark 2. Instead of a global counter, it suf\ufb01ces for the co-ordinator to maintain an approximate\ncounter and notify all sites of beginning an end of blocks by broadcast. This only adds 2k commu-\nnication per block. See [7] for more details.\n\n3.2 Lower Bounds\n\nIn this section we give a lower bound on distributed counter algorithms in the site prediction model.\n\u221a\nDistributed counters allow tight approximation guarantees, i.e. for factor \u03b2 additive approximation,\n\u221a\nthe communication required is only O(T log(T )\nk/\u03b2) [12]. We observe that the noise used by\nT ), and so it is tempting to \ufb01nd a suitable \u03b2 and run FPL using approximate\nFPL is quite large, O(\ncumulative payoffs. We consider the class of algorithms such that:\n(i) Whenever each site receives a query, it has an (approximate) cumulative payoff of each expert to\nadditive accuracy \u03b2. Furthermore, any communication is only used to maintain such a counter.\n(ii) Any site only uses the (approximate) cumulative payoffs and any local information it may have\nto choose an expert when queried.\n\u221a\nHowever, our negative result shows that even with a highly accurate counter \u03b2 = O(k), the non-\nstochasticity of the payoff sequence may cause any such algorithm to have \u2126(\nkT ) regret. Further-\nmore, we show that any distributed algorithm that implements (approximate) counters to additive\nerror k/10 on all sites4 is at least \u2126(T ).\nTheorem 4. At any time step t, suppose each site has an (approximate) cumulative payoff count,\n\u02dcPt[a], for every expert such that |Pt[a] \u2212 \u02dcPt[a]| \u2264 \u03b2. Then we have the following:\n1. If \u03b2 \u2264 k, any algorithm that uses the approximate counts \u02dcPt[a] and any local information at the\nsite making the decision, cannot achieve expected regret asymptotically better than\n2. Any protocol on the distributed system that guarantees that at each time step, each site has a\n\u03b2 = k/10 approximate cumulative payoff with probability \u2265 1/2, uses \u2126(T ) communication.\n\n\u03b2T .\n\n\u221a\n\n4 Coordinator-prediction model\n\nIn the co-ordinator prediction model, as mentioned earlier it is possible to use the label-ef\ufb01cient fore-\ncaster, LEF (Chap. 6 [2, 14]). Let C be an upper bound on the total amount of communication we are\nallowed to use. The label-ef\ufb01cient predictor translates into the following simple protocol: Whenever\na site receives a payoff vector, it will forward that particular payoff to the coordinator with probabil-\nity p \u2248 C/T . The coordinator will always execute the exponentially weighted forecaster over the\n\nsampled subset of payoffs to make new decisions. Here, the expected regret is O(T(cid:112)log(n)/C).\n\n\u221a\nIn other words, if our regret needs to be O(\n\nT ), the communication needs to be linear in T .\n\n4The approximation guarantee is only required when a site receives a query and has to make a prediction.\n\n7\n\n\fWe observe that in principle there is a possibility of better algorithms in this setting for mainly two\nreasons: (i) when the sites send payoff vectors to the co-ordinator, they can send cumulative payoffs\nrather than the latest ones, thus giving more information, and (ii) the sites may decided when to\n\u221a\ncommunicate as a function of the payoff vectors instead of just randomly. However, we present a\nlower-bound that shows that for a natural family of algorithms achieving regret O(\nT ) requires at\nleast \u2126(T 1\u2212\u0001) for every \u0001 > 0, even when k = 1. The type of algorithms we consider may have an\narbitrary communication protocol, but it satis\ufb01es the following: (i) Whenever a site communicates\nwith the coordinator, the site will report its local cumulative payoff vector. (ii) When the coordinator\nmakes a decision, it will execute, FPL(\nT ) using the\nlatest cumulative payoff vector. The proof of Theorem 5 appears in [7] and the results could be\ngeneralized to other regularizers.\n\u221a\nTheorem 5. Consider the distributed non-stochastic expert problem in coordinator prediction\nT ) must use \u2126(T 1\u2212\u0001)\nmodel. Any algorithm of the kind described above that achieves regret O(\ncommunication against an oblivious adversary for every constant \u0001.\n\nT ), (follow the perturbed leader with noise\n\n\u221a\n\n\u221a\n\n5 Simulations\n\n(a)\n\n(b)\n\nFigure 2:\ncumulative regret vs. communication cost for the MC and zig-zag sequences.\n\n(a) - Cumulative regret for the MC sequences as a function of correlation \u03bb, (b) - Worst-case\n\nIn this section, we describe some simulation results comparing the ef\ufb01cacy of our algorithm DFPL\nwith some other techniques. We compare DFPL against simple algorithms \u2013 full communication\nand no communication, and two other algorithms which we refer to as mini-batch and HYZ. In the\nmini-batch algorithm, the coordinator requests randomly, with some probability p at any time step,\nall cumulative payoff vectors at all sites. It then broadcasts the sum (across all of the sites) back to\nthe sites, so that all sites have the latest cumulative payoff vector. Whenever such a communication\ndoes occur, the cost is 2k. We refer to this as mini-batch because it is similar in spirit to the mini-\nbatch algorithms used in the stochastic optimization problems. In the HYZ algorithm, we use the\ndistributed counter technique of Huang et al. [12] to maintain the (approximate) cumulative payoff\nfor each expert. Whenever a counter update occurs, the coordinator must broadcast to all nodes to\nmake sure they have the most current update.\nWe consider two types of synthetic sequences. The \ufb01rst is a zig-zag sequence, with \u00b5 being the\nlength of one increase/decrease. For the \ufb01rst \u00b5 time steps the payoff vector is always (1, 0) (expert\n1 being better), then for the next 2\u00b5 time steps, the payoff vector is (0, 1) (expert 2 is better), and\nthen again for the next 2\u00b5 time-steps, payoff vector is (1, 0) and so on. The zig-zag sequence is also\nthe sequence used in the proof of the lower bound in Theorem 5. The second is a two-state Markov\nchain (MC) with states 1, 2 and Pr[1 \u2192 2] = Pr[2 \u2192 1] = 1\n2\u03bb. While in state 1, the payoff vector\nis (1, 0) and when in state 2 it is (0, 1).\nIn our simulations we use T = 20000 predictions, and k = 20 sites. Fig. 2 (a) shows the per-\nformance of the above algorithms for the MC sequences, the results are averaged across 100 runs,\nover both the randomness of the MC and the algorithms. Fig. 2 (b) shows the worst-case cumu-\nlative communication vs the worst-case cumulative regret trade-off for three algorithms: DFPL,\nmini-batch and HYZ, over all the described sequences. While in general it is hard to compare al-\ngorithms on non-stochastic inputs, our results con\ufb01rm that for non-stochastic sequences inspired by\nthe lower-bounds in the paper, our algorithm DFPL outperforms other related techniques.\n\n8\n\n00.511.52x 104\u22121000100200300400500\u03bbCumulative regret No\u2212communicationMini\u2212batch, p=4.64e\u2212002All\u2212communicationHYZ, p=2.24e\u2212001DFPL, \u03b5=0.00e+000DFPL, \u03b5=1.48e\u2212001050010001500200002468x 104Worst\u2212case regretWorst\u2212case communication DFPLMini\u2212batchesHYZ\fReferences\n[1] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learnign and an\n\napplication to boosting. In EuroCOLT, 1995.\n\n[2] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University\n\nPress, 2006.\n\n[3] T. Cover. Universal portfolios. Mathematical Finance, 1:1\u201319, 1991.\n[4] E. Hazan and S. Kale. On stochastic and worst-case models for investing. In NIPS, 2009.\n[5] E. Hazan. The convex optimization approach to regret minimization. Optimization for Machine\n\nLearning, 2012.\n\n[6] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for online decision problems. Journal of Com-\n\nputer and System Sciences, 71:291\u2013307, 2005.\n\n[7] V. Kanade, Z. Liu, and B. Radunovi\u00b4c. Distributed non-stochastic experts. In arXiv, 2012.\n[8] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction.\n\nIn ICML, 2011.\n\n[9] J. Duchi, A. Agarwal, and M. Wainright. Distributed dual averaging in networks. In NIPS,\n\n2010.\n\n[10] A. Agarwal and J. Duchi. Distributed delayed stochastic optimization. In NIPS, 2011.\n[11] G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed functional monitoring.\n\nACM Transactions on Algorithms, 7, 2011.\n\n[12] Z. Huang, K. Yi, and Q. Zhang. Randomized algorithms for tracking distributed count, fre-\n\nquencies and ranks. In PODS, 2012.\n\n[13] Z. Liu, B. Radunovi\u00b4c, and M. Vojnovi\u00b4c. Continuous distributed counting for non-monotone\n\nstreams. In PODS, 2012.\n\n[14] N. Cesa-Bianchi, G. Lugosi, and G. Stoltz. Minimizing regret with label ef\ufb01cient prediction.\n\nIn ISIT, 2005.\n\n[15] M-F. Balcan, A. Blum, S. Fine, and Y. Mansour. Distributed learning, communication com-\n\nplexity and privacy. In COLT (to appear), 2012.\n\n[16] H. Daum\u00b4e III, J. M. Phillips, A. Saha, and S. Venkatasubramanian. Protocols for learning\n\nclassi\ufb01ers on distributed data. In AISTATS, 2012.\n\n[17] H. Daum\u00b4e III, J. M. Phillips, A. Saha, and S. Venkatasubramanian. Ef\ufb01cients protocols for\n\ndistributed classi\ufb01cation and optimization. In arXiv:1204.3523v1, 2012.\n\n[18] G. Cormode, M. Garofalakis, P. Haas, and C. Jermaine. Synopses for Massive Data - Samples,\n\nHistograms, Wavelets, Sketches. Foundations and Trends in Databases, 2012.\n\n[19] K. Clarkson, E. Hazan, and D. Woodruff. Sublinear optimization for machine learning. In\n\nFOCS, 2010.\n\n9\n\n\f", "award": [], "sourceid": 140, "authors": [{"given_name": "Varun", "family_name": "Kanade", "institution": null}, {"given_name": "Zhenming", "family_name": "Liu", "institution": null}, {"given_name": "Bozidar", "family_name": "Radunovic", "institution": null}]}