{"title": "From Bandits to Experts: A Tale of Domination and Independence", "book": "Advances in Neural Information Processing Systems", "page_first": 1610, "page_last": 1618, "abstract": "We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir (2011). Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph. We also show that in the undirected case, the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efficient manner.", "full_text": "From Bandits to Experts:\n\nA Tale of Domination and Independence\n\nNoga Alon\n\nTel-Aviv University, Israel\n\nnogaa@tau.ac.il\n\nNicol`o Cesa-Bianchi\n\nUniversit`a degli Studi di Milano, Italy\n\nnicolo.cesa\u00adbianchi@unimi.it\n\nClaudio Gentile\n\nUniversity of Insubria, Italy\n\nclaudio.gentile@uninsubria.it\n\nYishay Mansour\n\nTel-Aviv University, Israel\nmansour@tau.ac.il\n\nAbstract\n\nWe consider the partial observability model for multi-armed bandits, introduced\nby Mannor and Shamir [14]. Our main result is a characterization of regret in\nthe directed observability model in terms of the dominating and independence\nnumbers of the observability graph (which must be accessible before selecting an\naction). In the undirected case, we show that the learner can achieve optimal regret\nwithout even accessing the observability graph before selecting an action. Both\nresults are shown using variants of the Exp3 algorithm operating on the observ-\nability graph in a time-ef\ufb01cient manner.\n\n1\n\nIntroduction\n\nPrediction with expert advice \u2014see, e.g., [13, 16, 6, 10, 7]\u2014 is a general abstract framework for\nstudying sequential prediction problems, formulated as repeated games between a player and an\nadversary. A well studied example of prediction game is the following: In each round, the adversary\nprivately assigns a loss value to each action in a \ufb01xed set. Then the player chooses an action (possibly\nusing randomization) and incurs the corresponding loss. The goal of the player is to control regret,\nwhich is de\ufb01ned as the excess loss incurred by the player as compared to the best \ufb01xed action over\na sequence of rounds. Two important variants of this game have been studied in the past: the expert\nsetting, where at the end of each round the player observes the loss assigned to each action for that\nround, and the bandit setting, where the player only observes the loss of the chosen action, but not\nthat of other actions.\nLet K be the number of available actions, and T be the number of prediction rounds. The best\npossible regret for the expert setting is of order \u221aT log K. This optimal rate is achieved by the\nHedge algorithm [10] or the Follow the Perturbed Leader algorithm [12]. In the bandit setting, the\noptimal regret is of order \u221aT K, achieved by the INF algorithm [2]. A bandit variant of Hedge,\ncalled Exp3 [3], achieves a regret with a slightly worse bound of order \u221aT K log K.\nRecently, Mannor and Shamir [14] introduced an elegant way for de\ufb01ning intermediate observability\nmodels between the expert setting (full observability) and the bandit setting (single observability).\nAn intuitive way of representing an observability model is through a directed graph over actions:\nan arc1 from action i to action j implies that when playing action i we get information also about\nthe loss of action j. Thus, the expert setting is obtained by choosing a complete graph over actions\n(playing any action reveals all losses), and the bandit setting is obtained by choosing an empty edge\nset (playing an action only reveals the loss of that action).\n\n1 According to the standard terminology in directed graph theory, throughout this paper a directed edge will\n\nbe called an arc.\n\n1\n\n\fThe main result of [14] concerns undirected observability graphs. The regret is characterized in\nterms of the independence number \u03b1 of the undirected observability graph. Speci\ufb01cally, they prove\nthat \u221aT \u03b1 log K is the optimal regret (up to logarithmic factors) and show that a variant of Exp3,\ncalled ELP, achieves this bound when the graph is known ahead of time, where \u03b1 \u2208 {1\ufffd . . . \ufffd K}\ninterpolates between full observability (\u03b1 = 1 for the clique) and single observability (\u03b1 = K for\nthe graph with no edges). Given the observability graph, ELP runs a linear program to compute the\ndesired distribution over actions. In the case when the graph changes over time, and at each time\n\nstep ELP observes the current observability graph before prediction, a bound of\ufffd\ufffdT\n\nt=1 \u03b1t log K\nis shown, where \u03b1t is the independence number of the graph at time t. A major problem left open\nin [14] was the characterization of regret for directed observability graphs, a setting for which they\nonly proved partial results.\n\nOur main result is a full characterization (to within logarithmic factors) of regret in the case of di-\nrected observability graphs. Our upper bounds are proven using a new algorithm, called Exp3-DOM.\nThis algorithm is ef\ufb01cient to run even when the graph changes over time: it just needs to compute\na small dominating set of the current observability graph (which must be given as side informa-\ntion) before prediction.2 As in the undirected case, the regret for the directed case is characterized in\nterms of the independence numbers of the observability graphs (computed ignoring edge directions).\nWe arrive at this result by showing that a key quantity emerging in the analysis of Exp3-DOM can\nbe bounded in terms of the independence numbers of the graphs. This bound (Lemma 13 in the\nappendix) is based on a combinatorial construction which might be of independent interest.\n\nWe also explore the possibility of the learning algorithm receiving the observability graph only after\nprediction, and not before. For this setting, we introduce a new variant of Exp3, called Exp3-SET,\nwhich achieves the same regret as ELP for undirected graphs, but without the need of accessing the\ncurrent observability graph before each prediction. We show that in some random directed graph\nmodels Exp3-SET has also a good performance. In general, we can upper bound the regret of Exp3-\nSET as a function of the maximum acyclic subgraph of the observability graph, but this upper bound\nmay not be tight. Yet, Exp3-SET is much simpler and computationally less demanding than ELP,\nwhich needs to solve a linear program in each round.\n\nThere are a variety of real-world settings where partial observability models corresponding to di-\nrected and undirected graphs are applicable. One of them is route selection. We are given a graph\nof possible routes connecting cities: when we select a route r connecting two cities, we observe the\ncost (say, driving time or fuel consumption) of the \u201cedges\u201d along that route and, in addition, we have\ncomplete information on any sub-route r\ufffd of r, but not vice versa. We abstract this in our model by\nhaving an observability graph over routes r, and an arc from r to any of its sub-routes r\ufffd.3\n\nSequential prediction problems with partial observability models also arise in the context of recom-\nmendation systems. For example, an online retailer, which advertises products to users, knows that\nusers buying certain products are often interested in a set of related products. This knowledge can be\nrepresented as a graph over the set of products, where two products are joined by an edge if and only\nif users who buy any one of the two are likely to buy the other as well. In certain cases, however,\nedges have a preferred orientation. For instance, a person buying a video game console might also\nbuy a high-def cable to connect it to the TV set. Vice versa, interest in high-def cables need not\nindicate an interest in game consoles.\n\nSuch observability models may also arise in the case when a recommendation system operates in\na network of users. For example, consider the problem of recommending a sequence of products,\nor contents, to users in a group. Suppose the recommendation system is hosted on an online so-\ncial network, on which users can befriend each other. In this case, it has been observed that social\nrelationships reveal similarities in tastes and interests [15]. However, social links can also be asym-\nmetric (e.g., followers of celebrities). In such cases, followers might be more likely to shape their\npreferences after the person they follow, than the other way around. Hence, a product liked by a\ncelebrity is probably also liked by his/her followers, whereas a preference expressed by a follower\nis more often speci\ufb01c to that person.\n\n2 Computing an approximately minimum dominating set can be done by running a standard greedy set cover\n\nalgorithm, see Section 2.\n\n3 Though this example may also be viewed as an instance of combinatorial bandits [8], the model studied\nhere is more general. For example, it does not assume linear losses, which could arise in the routing example\nfrom the partial ordering of sub-routes.\n\n2\n\n\f2 Learning protocol\ufffd notation\ufffd and preliminaries\n\nAs stated in the introduction, we consider an adversarial multi-armed bandit setting with a \ufb01nite\naction set V = {1\ufffd . . . \ufffd K}. At each time t = 1\ufffd 2\ufffd . . . , a player (the \u201clearning algorithm\u201d) picks\nsome action It \u2208 V and incurs a bounded loss \ufffdIt\ufffdt \u2208 [0\ufffd 1]. Unlike the standard adversarial bandit\nproblem [3, 7], where only the played action It reveals its loss \ufffdIt\ufffdt, here we assume all the losses\nin a subset SIt\ufffdt \u2286 V of actions are revealed after It is played. More formally, the player observes\nthe pairs (i\ufffd \ufffdi\ufffdt) for each i \u2208 SIt\ufffdt. We also assume i \u2208 Si\ufffdt for any i and t, that is, any action\nreveals its own loss when played. Note that the bandit setting (Si\ufffdt = {i}) and the expert setting\n(Si\ufffdt = V ) are both special cases of this framework. We call Si\ufffdt the observation set of action i at\nt\u2212\u2192 j when at time t playing action i also reveals the loss of action j. Hence,\ntime t, and write i\nt\u2212\u2192 j}. The family of observation sets {Si\ufffdt}i\u2208V we collectively call the\nSi\ufffdt = {j \u2208 V : i\nobservation system at time t.\nThe adversaries we consider are nonoblivious. Namely, each loss \ufffdi\ufffdt at time t can be an arbitrary\nfunction of the past player\u2019s actions I1\ufffd . . . \ufffd It\u22121. The performance of a player A is measured\nthrough the regret\n\nmax\nk\u2208V\n\n\ufffd\ufffdLA\ufffdT \u2212 Lk\ufffdT\ufffd\n\nwhere LA\ufffdT = \ufffdI1\ufffd1 + \u00b7\u00b7\u00b7 + \ufffdIT \ufffdT and Lk\ufffdT = \ufffdk\ufffd1 + \u00b7\u00b7\u00b7 + \ufffdk\ufffdT are the cumulative losses of the\nplayer and of action k, respectively. The expectation is taken with respect to the player\u2019s internal\nrandomization (since losses are allowed to depend on the player\u2019s past random actions, also Lk\ufffdt\nmay be random).4 The observation system {Si\ufffdt}i\u2208V is also adversarially generated, and each Si\ufffdt\ncan be an arbitrary function of past player\u2019s actions, just like losses are. However, in Section 3 we\nalso consider a variant in which the observation system is randomly generated according to a speci\ufb01c\nstochastic model.\n\nWhereas some algorithms need to know the observation system at the beginning of each step t,\nothers need not. From this viewpoint, we consider two online learning settings. In the \ufb01rst setting,\ncalled the informed setting, the full observation system {Si\ufffdt}i\u2208V selected by the adversary is made\navailable to the learner before making its choice It. This is essentially the \u201cside-information\u201d frame-\nwork \ufb01rst considered in [14]. In the second setting, called the uninformed setting, no information\nwhatsoever regarding the time-t observation system is given to the learner prior to prediction. We\n\ufb01nd it convenient to adopt the same graph-theoretic interpretation of observation systems as in [14].\nAt each step t = 1\ufffd 2\ufffd . . . , the observation system {Si\ufffdt}i\u2208V de\ufb01nes a directed graph Gt = (V\ufffd Dt),\nwhere V is the set of actions, and Dt is the set of arcs, i.e., ordered pairs of nodes. For j \ufffd= i, arc\n(i\ufffd j) \u2208 Dt if and only if i t\u2212\u2192 j (the self-loops created by i t\u2212\u2192 i are intentionally ignored). Hence,\nwe can equivalently de\ufb01ne {Si\ufffdt}i\u2208V in terms of Gt. Observe that the outdegree d+\ni of any i \u2208 V\nequals |Si\ufffdt|\u2212 1. Similarly, the indegree d\u2212\ni of i is the number of action j \ufffd= i such that i \u2208 Sj\ufffdt (i.e.,\nsuch that j t\u2212\u2192 i). A notable special case of the above is when the observation system is symmetric\nover time: j \u2208 Si\ufffdt if and only if i \u2208 Sj\ufffdt for all i\ufffd j and t. In words, playing i at time t reveals the\nloss of j if and only if playing j at time t reveals the loss of i. A symmetric observation system is\nequivalent to Gt being an undirected graph or, more precisely, to a directed graph having, for every\npair of nodes i\ufffd j \u2208 V , either no arcs or length-two directed cycles. Thus, from the point of view\nof the symmetry of the observation system, we also distinguish between the directed case (Gt is a\ngeneral directed graph) and the symmetric case (Gt is an undirected graph for all t).\nThe analysis of our algorithms depends on certain properties of the sequence of graphs Gt. Two\ngraph-theoretic notions playing an important role here are those of independent sets and dominating\nsets. Given an undirected graph G = (V\ufffd E), an independent set of G is any subset T \u2286 V such\nthat no two i\ufffd j \u2208 T are connected by an edge in E. An independent set is maximal if no proper\nsuperset thereof is itself an independent set. The size of a largest (maximal) independent set is the\nindependence number of G, denoted by \u03b1(G). If G is directed, we can still associate with it an\nindependence number: we simply view G as undirected by ignoring arc orientation. If G = (V\ufffd D)\nis a directed graph, then a subset R \u2286 V is a dominating set for G if for all j \ufffd\u2208 R there exists\nsome i \u2208 R such that arc (i\ufffd j) \u2208 D. In our bandit setting, a time-t dominating set Rt is a subset of\nactions with the property that the loss of any remaining action in round t can be observed by playing\n\n4 Although we de\ufb01ned the problem in terms of losses, our analysis can be applied to the case when actions\n\nreturn rewards gi\ufffdt \u2208 [0\ufffd 1] via the transformation \ufffdi\ufffdt = 1 \u2212 gi\ufffdt.\n\n3\n\n\fAlgorithm 1: Exp3-SET algorithm (for the uninformed setting)\nParameter: \u03b7 \u2208 [0\ufffd 1]\nInitialize: wi\ufffd1 = 1 for all i \u2208 V = {1\ufffd . . . \ufffd K}\nFor t = 1\ufffd 2\ufffd . . . :\n\n1. Observation system {Si\ufffdt}i\u2208V is generated but not disclosed ;\n2. Set pi\ufffdt =\n\nfor each i \u2208 V , where Wt =\ufffdj\u2208V\n\nwi\ufffdt\nWi\ufffdt\n\nwj\ufffdt ;\n\n3. Play action It drawn according to distribution pt = (p1\ufffdt\ufffd . . . \ufffd pK\ufffdt) ;\n4. Observe pairs (i\ufffd \ufffdi\ufffdt) for all i \u2208 SIt\ufffdt;\n5. Observation system {Si\ufffdt}i\u2208V is disclosed ;\n\n6. For any i \u2208 V set wi\ufffdt+1 = wi\ufffdt exp\ufffd\u2212\u03b7\ufffd\ufffdi\ufffdt\ufffd, where\n\nand\n\nI{i \u2208 SIt\ufffdt}\n\n\ufffdi\ufffdt\nqi\ufffdt\n\nqi\ufffdt = \ufffdj : j\n\nt\u2212\u2192i\n\npj\ufffdt .\n\n\ufffd\ufffdi\ufffdt =\n\nsome action in Rt. A dominating set is minimal if no proper subset thereof is itself a dominating set.\nThe domination number of directed graph G, denoted by \u03b3(G), is the size of a smallest (minimal)\ndominating set of G.\nComputing a minimum dominating set for an arbitrary directed graph Gt is equivalent to solving a\nminimum set cover problem on the associated observation system {Si\ufffdt}i\u2208V . Although minimum\nset cover is NP-hard, the well-known Greedy Set Cover algorithm [9], which repeatedly selects\nfrom {Si\ufffdt}i\u2208V the set containing the largest number of uncovered elements so far, computes a\ndominating set Rt such that |Rt| \u2264 \u03b3(Gt) (1 + ln K).\nFinally, we can also lift the notion of independence number of an undirected graph to directed graphs\nthrough the notion of maximum acyclic subgraphs: Given a directed graph G = (V\ufffd D), an acyclic\n\nsubgraph of G is any graph G\ufffd = (V \ufffd\ufffd D\ufffd) such that V \ufffd \u2286 V , and D\ufffd = D \u2229\ufffdV \ufffd \u00d7 V \ufffd\ufffd, with no\n\n(directed) cycles. We denote by mas(G) = |V \ufffd| the maximum size of such V \ufffd. Note that when G\nis undirected (more precisely, as above, when G is a directed graph having for every pair of nodes\ni\ufffd j \u2208 V either no arcs or length-two cycles), then mas(G) = \u03b1(G), otherwise mas(G) \u2265 \u03b1(G).\nIn particular, when G is itself a directed acyclic graph, then mas(G) = |V |.\n\n3 Algorithms without Explicit Exploration: The Uninformed Setting\n\nIn this section, we show that a simple variant of the Exp3 algorithm [3] obtains optimal regret (to\nwithin logarithmic factors) in the symmetric and uninformed setting. We then show that even the\nharder adversarial directed setting lends itself to an analysis, though with a weaker regret bound.\n\nExp3-SET (Algorithm 1) runs Exp3 without mixing with the uniform distribution. Similar to Exp3,\n\nExp3-SET uses loss estimates\ufffd\ufffdi\ufffdt that divide each observed loss \ufffdi\ufffdt by the probability qi\ufffdt of ob-\nserving it. This probability qi\ufffdt is simply the sum of all pj\ufffdt such that j t\u2212\u2192 i (the sum includes pi\ufffdt).\n\nNext, we bound the regret of Exp3-SET in terms of the key quantity\n\nQt =\ufffdi\u2208V\n\npi\ufffdt\nqi\ufffdt\n\n=\ufffdi\u2208V\n\npi\ufffdt\ufffdj : j\n\nt\u2212\u2192i\n\n.\n\npj\ufffdt\n\n(1)\n\nEach term pi\ufffdt/qi\ufffdt can be viewed as the probability of drawing i from pt conditioned on the event\nthat i was observed. Similar to [14], a key aspect to our analysis is the ability to deterministically and\nnonvacuously5 upper bound Qt in terms of certain quantities de\ufb01ned on {Si\ufffdt}i\u2208V . We do so in two\nways, either irrespective of how small each pi\ufffdt may be (this section) or depending on suitable lower\nbounds on the probabilities pi\ufffdt (Section 4). In fact, forcing lower bounds on pi\ufffdt is equivalent to\nadding exploration terms to the algorithm, which can be done only when knowing {Si\ufffdt}i\u2208V before\neach prediction \u2014an information available only in the informed setting.\n\n5 An obvious upper bound on Qt is K.\n\n4\n\n\fThe following result is the building block for all subsequent results in the uninformed setting.6\n\nTheorem 1 The regret of Exp3-SET satis\ufb01es\n\nmax\nk\u2208V\n\n\ufffd\ufffdLA\ufffdT \u2212 Lk\ufffdT\ufffd \u2264\n\nln K\n\n\u03b7\n\n+\n\n\u03b7\n2\n\n\ufffd[Qt] .\n\nT\ufffdt=1\n\nAs we said, in the adversarial and symmetric case the observation system at time t can be described\nby an undirected graph Gt = (V\ufffd Et). This is essentially the problem of [14], which they studied\nin the easier informed setting, where the same quantity Qt above arises in the analysis of their\nELP algorithm. In their Lemma 3, they show that Qt \u2264 \u03b1(Gt), irrespective of the choice of the\nprobabilities pt. When applied to Exp3-SET, this immediately gives the following result.\n\nCorollary 2 In the symmetric setting, the regret of Exp3-SET satis\ufb01es\n\nIn particular, if for constants \u03b11\ufffd . . . \ufffd \u03b1T we have \u03b1(Gt) \u2264 \u03b1t, t = 1\ufffd . . . \ufffd T , then setting \u03b7 =\n\n\ufffd(2 ln K)\ufffd\ufffdT\n\nt=1 \u03b1t, gives\n\n\ufffd[\u03b1(Gt)] .\n\nmax\nk\u2208V\n\n\u03b7\n\n+\n\n\u03b7\n2\n\nln K\n\nT\ufffdt=1\n\ufffd\ufffdLA\ufffdT \u2212 Lk\ufffdT\ufffd \u2264\n\ufffd\ufffdLA\ufffdT \u2212 Lk\ufffdT\ufffd \u2264\ufffd\ufffd\ufffd\ufffd2(ln K)\n\nmax\nk\u2208V\n\n\u03b1t .\n\nT\ufffdt=1\n\nThe bounds proven in Corollary 2 are equivalent to those proven in [14] (Theorem 2 therein) for\nthe ELP algorithm. Yet, our analysis is much simpler and, more importantly, our algorithm is sim-\npler and more ef\ufb01cient than ELP, which requires solving a linear program at each step. Moreover,\nunlike ELP, Exp-SET does not require prior knowledge of the observation system {Si\ufffdt}i\u2208V at the\nbeginning of each step.\n\nWe now turn to the directed setting. We start by considering a setting in which the observation\nsystem is stochastically generated. Then, we turn to the harder adversarial setting.\n\nThe Erd\u02ddos-Renyi model is a standard model for random directed graphs G = (V\ufffd D), where we are\ngiven a density parameter r \u2208 [0\ufffd 1] and, for any pair i\ufffd j \u2208 V , arc (i\ufffd j) \u2208 D with independent\nprobability r.7 We have the following result.\n\nCorollary 3 Let Gt be generated according to the Erd\u02ddos-Renyi model with parameter r \u2208 [0\ufffd 1].\nThen the regret of Exp3-SET satis\ufb01es\n\nIn the above, the expectations \ufffd[\u00b7] are w.r.t. both the algorithm\u2019s randomization and the random\n\n+\n\n\u03b7 T\n\nln K\n\nmax\nk\u2208V\n\n\ufffd\ufffdLA\ufffdT \u2212 Lk\ufffdT\ufffd \u2264\n2r \ufffd1 \u2212 (1 \u2212 r)K\ufffd .\ngeneration of Gt occurring at each round. In particular, setting \u03b7 =\ufffd 2r ln K\n\ufffd\ufffdLA\ufffdT \u2212 Lk\ufffdT\ufffd \u2264\ufffd 2(ln K)T (1 \u2212 (1 \u2212 r)K)\n\nmax\nk\u2208V\n\n\u03b7\n\nr\n\n.\n\nT \ufffd1\u2212\ufffd1\u2212r)K ) , gives\n\nNote that as r ranges in [0\ufffd 1] we interpolate between the bandit (r = 0)8 and the expert (r = 1)\nregret bounds.\n\nWhen the observation system is generated by an adversary, we have the following result.\n\nCorollary 4 In the directed setting, the regret of Exp3-SET satis\ufb01es\n\nmax\nk\u2208V\n\n\ufffd\ufffdLA\ufffdT \u2212 Lk\ufffdT\ufffd \u2264\n\nln K\n\n\u03b7\n\n+\n\n\u03b7\n2\n\nT\ufffdt=1\n\n\ufffd[mas(Gt)] .\n\n6 All proofs are given in the supplementary material to this paper.\n7 Self loops, i.e., arcs \ufffdi\ufffd i) are included by default here.\n8 Observe that limr\ufffd0+\n\n1\u2212\ufffd1\u2212r)K\n\n= K.\n\nr\n\n5\n\n\fIn particular, if for constants m1\ufffd . . . \ufffd mT we have mas(Gt) \u2264 mt, t = 1\ufffd . . . \ufffd T , then setting\n\n\u03b7 =\ufffd(2 ln K)\ufffd\ufffdT\n\nt=1 mt, gives\n\nmax\nk\u2208V\n\n\ufffd\ufffdLA\ufffdT \u2212 Lk\ufffdT\ufffd \u2264\ufffd\ufffd\ufffd\ufffd2(ln K)\n\nmt .\n\nT\ufffdt=1\n\nObserve that Corollary 4 is a strict generalization of Corollary 2 because, as we pointed out in\nSection 2, mas(Gt) \u2265 \u03b1(Gt), with equality holding when Gt is an undirected graph.\nAs far as lower bounds are concerned, in the symmetric setting, the authors of [14] derive a lower\n\nbound of \u03a9\ufffd\ufffd\u03b1(G)T\ufffd in the case when Gt = G for all t. We remark that similar to the symmetric\nsetting, we can derive a lower bound of \u03a9\ufffd\ufffd\u03b1(G)T\ufffd. The simple observation is that given a\n\ndirected graph G, we can de\ufb01ne a new graph G\ufffd which is made undirected just by reciprocating arcs;\nnamely, if there is an arc (i\ufffd j) in G we add arcs (i\ufffd j) and (j\ufffd i) in G\ufffd. Note that \u03b1(G) = \u03b1(G\ufffd).\nSince in G\ufffd the learner can only receive more information than in G, any lower bound on G also\napplies to G\ufffd. Therefore we derive the following corollary to the lower bound of [14] (Theorem 4\ntherein).\n\nCorollary 5 Fix a directed graph G, and suppose Gt = G for all t. Then there exists a \ufffdrandomized)\n\nadversarial strategy such that for any T = \u03a9\ufffd\u03b1(G)3\ufffd and for any learning strategy, the expected\nregret of the learner is \u03a9\ufffd\ufffd\u03b1(G)T\ufffd.\n\nMoreover, standard results in the theory of Erd\u02ddos-Renyi graphs, at least in the symmetric case (e.g.,\n[11]), show that, when the density parameter r is constant, the independence number of the resulting\ngraph has an inverse dependence on r. This fact, combined with the abovementioned lower bound\n\nr , matching (up to logarithmic factors) the upper bound\n\nof [14] gives a lower bound of the form\ufffd T\n\nof Corollary 3.\n\nOne may wonder whether a sharper lower bound argument exists which applies to the general di-\nrected adversarial setting and involves the larger quantity mas(G). Unfortunately, the above mea-\nsure does not seem to be related to the optimal regret: Using Claim 1 in the appendix (see proof of\nTheorem 3) one can exhibit a sequence of graphs each having a large acyclic subgraph, on which\nthe regret of Exp3-SET is still small.\n\nThe lack of a lower bound matching the upper bound provided by Corollary 4 is a good indication\nthat something more sophisticated has to be done in order to upper bound Qt in (1). This leads us\nto consider more re\ufb01ned ways of allocating probabilities pi\ufffdt to nodes. In the next section, we show\nan allocation strategy that delivers optimal (to within logarithmic factors) regret bounds using prior\nknowledge of the graphs Gt.\n\n4 Algorithms with Explicit Exploration: The Informed Setting\n\nWe are still in the general scenario where graphs Gt are adversarially generated and directed, but\nnow Gt is made available before prediction. We start by showing a simple example where our\nanalysis of Exp3-SET inherently fails. This is due to the fact that, when the graph induced by the\nobservation system is directed, the key quantity Qt de\ufb01ned in (1) cannot be nonvacuously upper\nbounded independent of the choice of probabilities pi\ufffdt. A way around it is to introduce a new\nalgorithm, called Exp3-DOM, which controls probabilities pi\ufffdt by adding an exploration term to the\ndistribution pt. This exploration term is supported on a dominating set of the current graph Gt. For\nthis reason, Exp3-DOM requires prior access to a dominating set Rt at each time step t which, in\nturn, requires prior knowledge of the entire observation system {Si\ufffdt}i\u2208V .\nAs announced, the next result shows that, even for simple directed graphs, there exist distributions\npt on the vertices such that Qt is linear in the number of nodes while the independence number is\n1.9 Hence, nontrivial bounds on Qt can be found only by imposing conditions on distribution pt.\n\n9 In this speci\ufb01c example, the maximum acyclic subgraph has size K, which con\ufb01rms the looseness of\n\nCorollary 4.\n\n6\n\n\fAlgorithm 2: Exp3-DOM algorithm (for the uninformed setting)\n\nInput: Exploration parameters \u03b3\ufffdb) \u2208 (0\ufffd 1] for b \u2208\ufffd0\ufffd 1\ufffd . . . \ufffd\ufffdlog2 K\ufffd\ufffd\n\ni\ufffd1 = 1 for all i \u2208 V and b \u2208\ufffd0\ufffd 1\ufffd . . . \ufffd\ufffdlog2 K\ufffd\ufffd\n\nInitialization: w\ufffdb)\nFor t = 1\ufffd 2\ufffd . . . :\n\n1. Observation system {Si\ufffdt}i\u2208V is generated and disclosed ;\n2. Compute a dominating set Rt \u2286 V for Gt associated with {Si\ufffdt}i\u2208V ;\n\n3. Let bt be such that |Rt| \u2208\ufffd2bt \ufffd 2bt+1 \u2212 1\ufffd;\n\n4. Set W \ufffdbt)\n\ni\ufffdt ;\n\nt =\ufffdi\u2208V w\ufffdbt)\ni\ufffdt =\ufffd1 \u2212 \u03b3\ufffdbt)\ufffd w\ufffdbt)\n\ni\ufffdt\n\n\u03b3\ufffdbt)\n|Rt|\n\n5. Set p\ufffdbt)\n\nt\n\n+\n\nW \ufffdbt)\n\nI{i \u2208 Rt};\n6. Play action It drawn according to distribution p\ufffdbt)\n7. Observe pairs (i\ufffd \ufffdi\ufffdt) for all i \u2208 SIt\ufffdt;\n8. For any i \u2208 V set w\ufffdbt)\n\ufffd\ufffd\ufffdbt)\ni\ufffdt =\n\nexp\ufffd\u2212\u03b3\ufffdbt)\ufffd\ufffd\ufffdbt)\n\nI{i \u2208 SIt\ufffdt}\n\ni\ufffdt+1 = w\ufffdbt)\n\n\ufffdi\ufffdt\nq\ufffdbt)\n\nand\n\ni\ufffdt\n\ni\ufffdt\n\nV\ufffdt\ufffd ;\n\n1\ufffdt \ufffd . . . \ufffd p\ufffdbt)\n\nt =\ufffdp\ufffdbt)\ni\ufffdt /2bt\ufffd, where\ni\ufffdt = \ufffdj : j\n\nq\ufffdbt)\n\nt\u2212\u2192i\n\np\ufffdbt)\n\nj\ufffdt\n\n.\n\nFact 6 Let G = (V\ufffd D) be a total order on V = {1\ufffd . . . \ufffd K}, i.e., such that for all i \u2208 V , arc\n(j\ufffd i) \u2208 D for all j = i + 1\ufffd . . . \ufffd K. Let p = (p1\ufffd . . . \ufffd pK) be a distribution on V such that pi = 2\u2212i,\nfor i < K and pk = 2\u2212K+1. Then\n\nQ =\n\nK\ufffdi=1\n\npi\n\npi +\ufffdj : j\u2212\u2192i pj\n\n=\n\nK\ufffdi=1\n\npi\nj=i pj\n\n\ufffdK\n\n=\n\nK + 1\n\n2\n\n.\n\nWe are now ready to introduce and analyze the new algorithm Exp3-DOM for the informed and\ndirected setting. Exp3-DOM (see Algorithm 2) runs \ufffd(log K) variants of Exp3 indexed by b =\n0\ufffd 1\ufffd . . . \ufffd\ufffdlog2 K\ufffd. At time t the algorithm is given observation system {Si\ufffdt}i\u2208V , and computes\na dominating set Rt of the directed graph Gt induced by {Si\ufffdt}i\u2208V . Based on the size |Rt| of Rt,\nthe algorithm uses instance bt = \ufffdlog2 |Rt|\ufffd to pick action It. We use a superscript b to denote the\nquantities relevant to the variant of Exp3 indexed by b. Similarly to the analysis of Exp3-SET, the\nkey quantities are\n\ni\ufffdt\n\np\ufffdb)\nq\ufffdb)\n\n\ufffd\n\nj\ufffdt\n\nand\n\nq\ufffdb)\n\np\ufffdb)\n\np\ufffdb)\n\nQ\ufffdb)\n\nt\u2212\u2192i\n\nt =\ufffdi\u2208V\n\nj\ufffdt = \ufffdj : j\n\ni\ufffdt = \ufffdj : i\u2208Sj\ufffdt\nLet T \ufffdb) =\ufffdt = 1\ufffd . . . \ufffd T : |Rt| \u2208 [2b\ufffd 2b+1 \u2212 1]\ufffd. Clearly, the sets T \ufffdb) are a partition of the time\nsteps {1\ufffd . . . \ufffd T}, so that\ufffdb |T \ufffdb)| = T . Since the adversary adaptively chooses the dominating\n\nsets Rt, the sets T \ufffdb) are random. This causes a problem in tuning the parameters \u03b3\ufffdb). For this\nreason, we do not prove a regret bound for Exp3-DOM, where each instance uses a \ufb01xed \u03b3\ufffdb), but\nfor a slight variant (described in the proof of Theorem 7 \u2014see the appendix) where each \u03b3\ufffdb) is set\nthrough a doubling trick.\n\nb = 0\ufffd 1\ufffd . . . \ufffd\ufffdlog2 K\ufffd .\n\ni\ufffdt\n\nTheorem 7 In the directed case, the regret of Exp3-DOM satis\ufb01es\n\nmax\nk\u2208V\n\n\ufffd\ufffdLA\ufffdT \u2212 Lk\ufffdT\ufffd \u2264\n\n\ufffdlog2 K\ufffd\ufffdb=0\n\n\ufffd\uf8ed 2b ln K\n\u03b3\ufffdb) + \u03b3\ufffdb)\ufffd\uf8ee\uf8f0 \ufffdt\u2208T \ufffdb)\ufffd1 +\n\n7\n\nt\n\nQ\ufffdb)\n\n2b+1\ufffd\uf8f9\uf8fb\uf8f6\uf8f8 .\n\n(2)\n\n\fMoreover, if we use a doubling trick to choose \u03b3\ufffdb) for each b = 0\ufffd . . . \ufffd\ufffdlog2 K\ufffd, then\n\n\ufffd\ufffdLA\ufffdT \u2212 Lk\ufffdT\ufffd = \ufffd\ufffd\uf8ed(ln K) \ufffd\uf8ee\uf8f0\n\n\ufffd\ufffd\ufffd\ufffd\n\nT\ufffdt=1\ufffd4|Rt| + Q\ufffdbt)\n\nt \ufffd\uf8f9\uf8fb + (ln K) ln(KT )\uf8f6\uf8f8 .\n\nmax\nk\u2208V\n\n(3)\n\nImportantly, the next result shows how bound (3) of Theorem 7 can be expressed in terms of the se-\nquence \u03b1(Gt) of independence numbers of graphs Gt whenever the Greedy Set Cover algorithm [9]\n(see Section 2) is used to compute the dominating set Rt of the observation system at time t.\n\nCorollary 8 If Step 2 of Exp3-DOM uses the Greedy Set Cover algorithm to compute the dominating\nsets Rt, then the regret of Exp-DOM with doubling trick satis\ufb01es\n\nmax\nk\u2208V\n\n\ufffd\ufffdLA\ufffdT \u2212 Lk\ufffdT\ufffd = \ufffd\ufffd\uf8edln(K)\ufffd\ufffd\ufffd\ufffdln(KT )\n\nT\ufffdt=1\n\n\u03b1(Gt) + ln(K) ln(KT )\uf8f6\uf8f8\n\nwhere, for each t, \u03b1(Gt) is the independence number of the graph Gt induced by observation system\n{Si\ufffdt}i\u2208V .\nComparing Corollary 8 to Corollary 5 delivers the announced characherization in the general ad-\nversarial and directed setting. Moreover, a quick comparison between Corollary 2 and Corollary 8\nreveals that a symmetric observation system overcomes the advantage of working in an informed\nsetting: The bound we obtained for the uninformed symmetric setting (Corollary 2) is sharper by\nlogarithmic factors than the one we derived for the informed \u2014but more general, i.e., directed\u2014\nsetting (Corollary 8).\n\n5 Conclusions and work in progress\n\nWe have investigated online prediction problems in partial information regimes that interpolate be-\ntween the classical bandit and expert settings. We have shown a number of results characterizing\nprediction performance in terms of: the structure of the observation system, the amount of informa-\ntion available before prediction, the nature (adversarial or fully random) of the process generating\nthe observation system. Our results are substantial improvements over the paper [14] that initi-\nated this interesting line of research. Our improvements are diverse, and range from considering\nboth informed and uninformed settings to delivering more re\ufb01ned graph-theoretic characterizations,\nfrom providing more ef\ufb01cient algorithmic solutions to relying on simpler (and often more general)\nanalytical tools.\n\nSome research directions we are currently pursuing are the following: (1) We are currently inves-\ntigating the extent to which our results could be applied to the case when the observation system\n{Si\ufffdt}i\u2208V may depend on the loss \ufffdIt\ufffdt of player\u2019s action It. Note that this would prevent a di-\nrect construction of an unbiased estimator for unobserved losses, which many worst-case bandit\nalgorithms (including ours \u2014see the appendix) hinge upon.\n(2) The upper bound contained in\nCorollary 4 and expressed in terms of mas(\u00b7) is almost certainly suboptimal, even in the uninformed\nsetting, and we are trying to see if more adequate graph complexity measures can be used instead.\n(3) Our lower bound in Corollary 5 heavily relies on the corresponding lower bound in [14] which,\nin turn, refers to a constant graph sequence. We would like to provide a more complete charecteriza-\ntion applying to sequences of adversarially-generated graphs G1\ufffd G2\ufffd . . . \ufffd GT in terms of sequences\nof their corresponding independence numbers \u03b1(G1)\ufffd \u03b1(G2)\ufffd . . . \ufffd \u03b1(GT ) (or variants thereof), in\nboth the uninformed and the informed settings. (4) All our upper bounds rely on parameters to be\ntuned as a function of sequences of observation system quantities (e.g., the sequence of indepen-\ndence numbers). We are trying to see if an adaptive learning rate strategy `a la [4], based on the\nobservable quantities Qt, could give similar results without such a prior knowledge.\n\nAcknowledgments\n\nNA was supported in part by an ERC advanced grant, by a USA-Israeli BSF grant, and by the Israeli\nI-CORE program. NCB acknowledges partial support by MIUR (project ARS TechnoMedia, PRIN\n2010-2011, grant no. 2010N5K7EB 003). YM was supported in part by a grant from the Israel\nScience Foundation, a grant from the United States-Israel Binational Science Foundation (BSF), a\ngrant by Israel Ministry of Science and Technology and the Israeli Centers of Research Excellence\n(I-CORE) program (Center No. 4/11).\n\n8\n\n\fReferences\n\n[1] N. Alon and J. H. Spencer. The probabilistic method. John Wiley \ufffd Sons, 2004.\n\n[2] Jean-Yves Audibert and S\u00b4ebastien Bubeck. Minimax policies for adversarial and stochastic\n\nbandits. In COLT, 2009.\n\n[3] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic\n\nmultiarmed bandit problem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[4] Peter Auer, Nicol`o Cesa-Bianchi, and Claudio Gentile. Adaptive and self-con\ufb01dent on-line\n\nlearning algorithms. J. Comput. Syst. Sci., 64(1):48\u201375, 2002.\n\n[5] Y. Caro. New results on the independence number. In Tech. Report, Tel-Aviv University, 1979.\n\n[6] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth.\n\nHow to use expert advice. J. ACM, 44(3):427\u2013485, 1997.\n\n[7] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press,\n\n2006.\n\n[8] Nicol`o Cesa-Bianchi and G\u00b4abor Lugosi. Combinatorial bandits.\n\nJ. Comput. Syst. Sci.,\n\n78(5):1404\u20131422, 2012.\n\n[9] V. Chvatal. A greedy heuristic for the set-covering problem. Mathematics of Operations\n\nResearch, 4(3):233\u2013235, 1979.\n\n[10] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning\nand an application to boosting. In Euro-COLT, pages 23\u201337. Springer-Verlag, 1995. Also,\nJCSS 55(1): 119-139 (1997).\n\n[11] A. M. Frieze. On the independence number of random graphs. Discrete Mathematics, 81:171\u2013\n\n175, 1990.\n\n[12] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for online decision problems. Journal of Com-\n\nputer and System Sciences, 71:291\u2013307, 2005.\n\n[13] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and\n\nComputation, 108:212\u2013261, 1994.\n\n[14] S. Mannor and O. Shamir. From bandits to experts: On the value of side-observations. In 25th\n\nAnnual Conference on Neural Information Processing Systems \ufffdNIPS 2011), 2011.\n\n[15] Alan Said, Ernesto W De Luca, and Sahin Albayrak. How social relationships affect user\nIn Proceedings of the International Conference on Intelligent User Interfaces\n\nsimilarities.\nWorkshop on Social Recommender Systems, Hong Kong, 2010.\n\n[16] V. G. Vovk. Aggregating strategies. In COLT, pages 371\u2013386, 1990.\n\n[17] V. K. Wey. A lower bound on the stability number of a simple graph. In Bell Lab. Tech. Memo\n\nNo. 81-11217-9, 1981.\n\n9\n\n\f", "award": [], "sourceid": 798, "authors": [{"given_name": "Noga", "family_name": "Alon", "institution": "Tel Aviv University"}, {"given_name": "Nicol\u00f2", "family_name": "Cesa-Bianchi", "institution": "University of Milan"}, {"given_name": "Claudio", "family_name": "Gentile", "institution": "University of Insubria"}, {"given_name": "Yishay", "family_name": "Mansour", "institution": "Tel Aviv University"}]}