{"title": "Online Learning with Gaussian Payoffs and Side Observations", "book": "Advances in Neural Information Processing Systems", "page_first": 1360, "page_last": 1368, "abstract": "We consider a sequential learning problem with Gaussian payoffs and side information: after selecting an action $i$, the learner receives information about the payoff of every action $j$ in the form of Gaussian observations whose mean is the same as the mean payoff, but the variance depends on the pair $(i,j)$ (and may be infinite). The setup allows a more refined information transfer from one action to another than previous partial monitoring setups, including the recently introduced graph-structured feedback case. For the first time in the literature, we provide non-asymptotic problem-dependent lower bounds on the regret of any algorithm, which recover existing asymptotic problem-dependent lower bounds and finite-time minimax lower bounds available in the literature. We also provide algorithms that achieve the problem-dependent lower bound (up to some universal constant factor) or the minimax lower bounds (up to logarithmic factors).", "full_text": "Online Learning with Gaussian Payoffs and Side\n\nObservations\n\nYifan Wu1\n\nAndr\u00b4as Gy\u00a8orgy2\n\nCsaba Szepesv\u00b4ari1\n\n1Dept. of Computing Science\n\nUniversity of Alberta\n\n{ywu12,szepesva}@ualberta.ca\n\n2Dept. of Electrical and Electronic Engineering\n\nImperial College London\n\na.gyorgy@imperial.ac.uk\n\nAbstract\n\nWe consider a sequential learning problem with Gaussian payoffs and side ob-\nservations: after selecting an action i, the learner receives information about the\npayoff of every action j in the form of Gaussian observations whose mean is the\nsame as the mean payoff, but the variance depends on the pair (i, j) (and may be\nin\ufb01nite). The setup allows a more re\ufb01ned information transfer from one action to\nanother than previous partial monitoring setups, including the recently introduced\ngraph-structured feedback case. For the \ufb01rst time in the literature, we provide\nnon-asymptotic problem-dependent lower bounds on the regret of any algorithm,\nwhich recover existing asymptotic problem-dependent lower bounds and \ufb01nite-\ntime minimax lower bounds available in the literature. We also provide algorithms\nthat achieve the problem-dependent lower bound (up to some universal constant\nfactor) or the minimax lower bounds (up to logarithmic factors).\n\n1\n\nIntroduction\n\nOnline learning in stochastic environments is a sequential decision problem where in each time step\na learner chooses an action from a given \ufb01nite set, observes some random feedback and receives\na random payoff. Several feedback models have been considered in the literature: The simplest is\nthe full information case where the learner observes the payoff of all possible actions at the end\nof every round. A popular setup is the case of bandit feedback, where the learner only observes\nits own payoff and receives no information about the payoff of other actions [1]. Recently, several\npapers considered a more re\ufb01ned setup, called graph-structured feedback, that interpolates between\nthe full-information and the bandit case: here the feedback structure is described by a (possibly\ndirected) graph, and choosing an action reveals the payoff of all actions that are connected to the\nselected one, including the chosen action itself. This problem, motivated for example by social\nnetworks, has been studied extensively in both the adversarial [2, 3, 4, 5] and the stochastic cases\n[6, 7]. However, most algorithms presented heavily depend on the self-observability assumption,\nthat is, that the payoff of the selected action can be observed. Removing this self-loop assumption\nleads to the so-called partial monitoring case [5]. In the absolutely general partial monitoring setup\nthe learner receives some general feedback that depends on its choice (and the environment), with\nsome arbitrary (but known) dependence [8, 9]. While the partial monitoring setup covers all other\nproblems, its analysis has concentrated on the \ufb01nite case where both the set of actions and the set\nof feedback signals are \ufb01nite [8, 9], which is in contrast to the standard full information and bandit\nsettings where the feedback is typically assumed to be real-valued. To our knowledge there are only\na few exceptions to this case: in [5], graph-structured feedback is considered without the self-loop\nassumption, while continuous action spaces are considered in [10] and [11] with special feedback\nstructure (linear and censored observations, resp.).\nIn this paper we consider a generalization of the graph-structured feedback model that can also be\nviewed as a general partial monitoring model with real-valued feedback. We assume that selecting\n\n1\n\n\fan action i the learner can observe a random variable Xij for each action j whose mean is the same\nas the payoff of j, but its variance \u03c32\nij depends on the pair (i, j). For simplicity, throughout the paper\nwe assume that all the payoffs and the Xij are Gaussian. While in the graph-structured feedback\ncase one either has observation on an action or not, but the observation always gives the same amount\nij, the information can be\nof information, our model is more re\ufb01ned: Depending on the value of \u03c32\nij = \u221e, trying action i gives no information about action j.\nof different quality. For example, if \u03c32\nij < \u221e, the value of the information depends on the time horizon T of the\nIn general, for any \u03c32\nproblem: when \u03c32\nij is large relative to T (and the payoff differences of the actions) essentially no\ninformation is received, while a small variance results in useful observations.\nAfter de\ufb01ning the problem formally in Section 2, we provide non-asymptotic problem-dependent\nlower bounds in Section 3, which depend on the distribution of the observations through their mean\npayoffs and variances. To our knowledge, these are the \ufb01rst such bounds presented for any stochas-\ntic partial monitoring problem beyond the full-information setting: previous work either presented\nasymptotic problem-dependent lower bounds (e.g., [12, 7]), or \ufb01nite-time minimax bounds (e.g.,\n[9, 3, 5]). Our bounds can recover all previous bounds up to some universal constant factors not de-\npending on the problem.\nIn Section 4, we present two algorithms with \ufb01nite-time performance\nguarantees for the case of graph-structured feedback without the self-observability assumption.\nWhile due to their complicated forms it is hard to compare our \ufb01nite-time upper and lower bounds,\nwe show that our \ufb01rst algorithm achieves the asymptotic problem-dependent lower bound up to\n\nproblem-independent multiplicative factors. Regarding the minimax regret, the hardness ((cid:101)\u0398(T 1/2)\nor(cid:101)\u0398(T 2/3) regret1) of partial monitoring problems is characterized by their global/local observabil-\n\nity property [9] or, in case of the graph-structured feedback model, by their strong/weak observabil-\nity property [5]. In the same section we present another algorithm that achieves the minimax regret\n(up to logarithmic factors) under both strong and weak observability, and achieves an O(log3/2 T )\nproblem-dependent regret. Earlier results for the stochastic graph-structured feedback problems\n[6, 7] provided only asymptotic problem-dependent lower bounds and performance bounds that did\nnot match the asymptotic lower bounds or the minimax rate up to constant factors. A related combi-\nnatorial partial monitoring problem with linear feedback was considered in [10], where the presented\n\nalgorithm was shown to satisfy both an (cid:101)O(T 2/3) minimax bound and a logarithmic problem depen-\n\n\u221a\ndent bound. However, the dependence on the problem structure in that paper is not optimal, and, in\nparticular, the paper does not achieve the O(\nT ) minimax bound for easy problems. Finally, we\ndraw conclusions and consider some interesting future directions in Section 5. Proofs can be found\nin the long version of this paper [13].\n\n2 Problem Formulation\nFormally, we consider an online learning problem with Gaussian payoffs and side observations:\nSuppose a learner has to choose from K actions in every round. When choosing an action, the\nlearner receives a random payoff and also some side observations corresponding to other actions.\nMore precisely, each action i \u2208 [K] = {1, . . . , K} is associated with some parameter \u03b8i, and\nthe payoff Yt,i to action i in round t is normally distributed random variable with mean \u03b8i and\nvariance \u03c32\nii, while the learner observes a K-dimensional Gaussian random vector Xt,i whose jth\nij (we assume 0 \u2264 \u03c3ij \u2264 \u221e)\ncoordinate is a normal random variable with mean \u03b8j and variance \u03c32\nand the coordinates of Xt,i are independent of each other. We assume the following: (i) the random\nvariables (Xt, Yt)t are independent for all t; (ii) the parameter vector \u03b8 is unknown to the learner but\nij)i,j\u2208[K] is known in advance; (iii) \u03b8 \u2208 [0, D]K for some D > 0; (iv)\nthe variance matrix \u03a3 = (\u03c32\nmini\u2208[K] \u03c3ij \u2264 \u03c3 < \u221e for all j \u2208 [K], that is, the expected payoff of each action can be observed.\nThe goal of the learner is to maximize its payoff or, in other words, minimize the expected regret\n\n\u03b8i \u2212 T(cid:88)\n\nt=1\n\nRT = T max\ni\u2208[K]\n\nE [Yt,it]\n\nwhere it is the action selected by the learner in round t. Note that the problem encompasses several\ncommon feedback models considered in online learning (modulo the Gaussian assumption), and\nmakes it possible to examine more delicate observation structures:\n\n1Tilde denotes order up to logarithmic factors.\n\n2\n\n\fFull information: \u03c3ij = \u03c3j < \u221e for all i, j \u2208 [K].\nBandit: \u03c3ii < \u221e and \u03c3ij = \u221e for all i (cid:54)= j \u2208 [K].\nPartial monitoring with feedback graphs [5]: Each action i \u2208 [K] is associated with an observa-\n\ntion set Si \u2282 [K] such that \u03c3ij = \u03c3j < \u221e if j \u2208 Si and \u03c3ij = \u221e otherwise.\n\nWe will call the uniform variance version of these problems when all the \ufb01nite \u03c3ij are equal to some\n\u03c3 \u2265 0. Some interesting features of the problem can be seen when considering the generalized full\ninformation case , when all entries of \u03a3 are \ufb01nite. In this case, the greedy algorithm, which estimates\nthe payoff of each action by the average of the corresponding observed samples and selects the one\nwith the highest average, achieves at most a constant regret for any time horizon T .2 On the other\nhand, the constant can be quite large: in particular, when the variance of some observations are\nlarge relative to the gaps dj = maxi \u03b8i \u2212 \u03b8j, the situation is rather similar to a partial monitoring\nsetup for a smaller, \ufb01nite time horizon. In this paper we are going to analyze this problem and\npresent algorithms and lower bounds that are able to \u201cinterpolate\u201d between these cases and capture\nthe characteristics of the different regimes.\n\n2.1 Notation\n\nT = {c \u2208 NK : ci \u2265 0 ,(cid:80)\n\nN\n\n0 ,(cid:80)\n\ni\u2208[K] ci = T} and let N (T ) \u2208 C\nN\nDe\ufb01ne C\nT denote the number of\nT = {c \u2208 RK : ci \u2265\nR\nplays over all actions taken by some algorithm in T rounds. Also let C\ni\u2208[K] ci = T}. We will consider environments with different expected payoff vectors \u03b8 \u2208 \u0398,\nbut the variance matrix \u03a3 will be \ufb01xed. Therefore, an environment can be speci\ufb01ed by \u03b8; oftentimes,\nwe will explicitly denote the dependence of different quantities on \u03b8: The probability and expectation\nfunctionals under environment \u03b8 will be denoted by Pr (\u00b7; \u03b8) and E [\u00b7; \u03b8], respectively. Furthermore,\n(cid:80)\nlet ij(\u03b8) be the jth best action (ties are broken arbitrarily, i.e., \u03b8i1 \u2265 \u03b8i2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03b8iK ) and de\ufb01ne\ndi(\u03b8) = \u03b8i1(\u03b8) \u2212 \u03b8i for any i \u2208 [K]. Then the expected regret under environment \u03b8 is RT (\u03b8) =\nE [Ni(T ); \u03b8] di(\u03b8). For any action i \u2208 [K], let Si = {j \u2208 [K] : \u03c3ij < \u221e} denote the set of\nactions whose parameter \u03b8j is observable by choosing action i. Throughout the paper, log denotes\nthe natural logarithm and \u2206n denotes the n-dimensional simplex for any positive integer n.\n\ni\u2208[K]\n\n3 Lower Bounds\n\nThe aim of this section is to derive generic, problem-dependent lower bounds to the regret, which\nare also able to provide minimax lower bounds. The hardness in deriving such bounds is that for any\n\ufb01xed \u03b8 and \u03a3, the dumb algorithm that always selects i1(\u03b8) achieves zero regret (obviously, the re-\ngret of this algorithm is linear for any \u03b8(cid:48) with i1(\u03b8) (cid:54)= i1(\u03b8(cid:48))), so in general it is not possible to give a\nlower bound for a single instance. When deriving asymptotic lower bounds, this is circumvented by\nonly considering consistent algorithms whose regret is sub-polynomial for any problem [12]. How-\never, this asymptotic notion of consistency is not applicable to \ufb01nite-horizon problems. Therefore,\nfollowing ideas of [14], for any problem we create a family of related problems (by perturbing the\nmean payoffs) such that if the regret of an algorithm is \u201ctoo small\u201d in one of the problems than it\nwill be \u201clarge\u201d in another one, while it still depends on the original problem parameters (note that\nderiving minimax bounds usually only involves perturbing certain special \u201cworst-case\u201d problems).\nAs a warm-up, and to show the reader what form of a lower bound can be expected, \ufb01rst we present\nan asymptotic lower bound for the uniform-variance version of the problem of partial monitoring\nwith feedback graphs. The result presented below is an easy consequence of [12], hence its proof\nis omitted. An algorithm is said to be consistent if sup\u03b8\u2208\u0398 RT (\u03b8) = o(T \u03b3) for every \u03b3 > 0. Now\nassume for simplicity that there is a unique optimal action in environment \u03b8, that is, \u03b8i1(\u03b8) > \u03b8i for\nall i (cid:54)= i1 and let\n\n\uf8f1\uf8f2\uf8f3c \u2208 [0,\u221e)K :\n\n(cid:88)\n\ni:j\u2208Si\n\nC\u03b8 =\n\nci \u2265 2\u03c32\nd2\nj (\u03b8)\n\nfor all j (cid:54)= i1(\u03b8) ,\n\nci \u2265 2\u03c32\n\nd2\ni2(\u03b8)(\u03b8)\n\ni:i1(\u03b8)\u2208Si\n\n(cid:88)\n\n\uf8fc\uf8fd\uf8fe .\n\n2To see this, notice that the error of identifying the optimal action decays exponentially with the number of\n\nrounds.\n\n3\n\n\fThen, for any consistent algorithm and for any \u03b8 with \u03b8i1(\u03b8) > \u03b8i2(\u03b8),\n\nlim inf\nT\u2192\u221e\n\nRT (\u03b8)\nlog T\n\n\u2265 inf\nc\u2208C\u03b8\n\n(cid:104)c, d(\u03b8)(cid:105) .\n\n(1)\n\nNote that the right hand side of (1) is 0 for any generalized full information problem (recall that\nthe expected regret is bounded by a constant for such problems), but it is a \ufb01nite positive number\nfor other problems. Similar bounds have been provided in [6, 7] for graph-structured feedback with\nself-observability (under non-Gaussian assumptions on the payoffs).\nIn the following we derive\n\ufb01nite time lower bounds that are also able to replicate this result.\n\n3.1 A General Finite Time Lower Bound\nFirst we derive a general lower bound. For any \u03b8, \u03b8(cid:48) \u2208 \u0398 and q \u2208 \u2206|C\n\nN\n\nT |, de\ufb01ne f (\u03b8, q, \u03b8(cid:48)) as\n\n(cid:88)\n\n|\n\na\u2208C\n\nN\nT\n\nq(cid:48)(a)(cid:104)a, d(\u03b8(cid:48))(cid:105)\n\n\u2264 (cid:88)\n\ni\u2208[K]\n\n\uf8eb\uf8edIi(\u03b8, \u03b8(cid:48))\n\n\uf8f6\uf8f8 ,\n\nq(a)ai\n\n(cid:88)\n\na\u2208C\n\nN\nT\n\nq(a) log\n\nq(a)\nq(cid:48)(a)\n\nf (\u03b8, q, \u03b8(cid:48)) = inf\nN\n|C\nq(cid:48)\u2208\u2206\nT\n\nsuch that (cid:88)\nKL(Xt,i(\u03b8); Xt,i(\u03b8(cid:48))) = (cid:80)K\n\na\u2208C\n\nN\nT\n\nwhere Ii(\u03b8, \u03b8(cid:48)) is the KL-divergence between Xt,i(\u03b8) and Xt,i(\u03b8(cid:48)), given by Ii(\u03b8, \u03b8(cid:48)) =\nij. Clearly, f (\u03b8, q, \u03b8(cid:48)) is a lower bound on RT (\u03b8(cid:48))\nfor any algorithm for which the distribution of N (T ) is q. The intuition behind the allowed values\nof q(cid:48) is that we want q(cid:48) to be as similar to q as the environments \u03b8 and \u03b8(cid:48) look like for the algorithm\n(through the feedback (Xt,it)t). Now de\ufb01ne\n\nj=1(\u03b8j \u2212 \u03b8(cid:48)\n\nj)2/2\u03c32\n\ng(\u03b8, c) = inf\n|C\nq\u2208\u2206\n\nN\nT\n\n|\n\nsup\n\u03b8(cid:48)\u2208\u0398\n\nf (\u03b8, q, \u03b8(cid:48)),\n\nsuch that (cid:88)\n\na\u2208C\n\nN\nT\n\nq(a)a = c \u2208 C\n\nR\nT .\n\ng(\u03b8, c) is a lower bound of the worst-case regret of any algorithm with E [N (T ); \u03b8] = c. Finally, for\nany x > 0, de\ufb01ne\n\nb(\u03b8, x) = inf\n\nc\u2208C\u03b8,x\n\n(cid:104)c, d(\u03b8)(cid:105)\n\nwhere C\u03b8,x = {c \u2208 C\n\nT ; g(\u03b8, c) \u2264 x}.\nR\n\nHere C\u03b8,B contains all the possible values of E [N (T ); \u03b8] that can be achieved by some algorithm\nwhose lower bound g on the worst-case regret is smaller than x. These de\ufb01nitions give rise to the\nfollowing theorem:\nTheorem 1. Given any B > 0, for any algorithm such that sup\u03b8(cid:48)\u2208\u0398 RT (\u03b8(cid:48)) \u2264 B, we have, for any\nenvironment \u03b8 \u2208 \u0398, RT (\u03b8) \u2265 b(\u03b8, B).\nRemark 2. If B is picked as the minimax value of the problem given the observation structure \u03a3,\nthe theorem states that for any minimax optimal algorithm the expected regret for a certain \u03b8 is lower\nbounded by b(\u03b8, B).\n\n3.2 A Relaxed Lower Bound\n\nNow we introduce a relaxed but more interpretable version of the \ufb01nite-time lower bound of Theo-\nrem 1, which can be shown to match the asymptotic lower bound (1). The idea of deriving the lower\nbound is the following: instead of ensuring that the algorithm performs well in the most adversarial\nenvironment \u03b8(cid:48), we consider a set of \u201cbad\u201d environments and make sure that the algorithm performs\nwell on them, where each \u201cbad\u201d environment \u03b8(cid:48) is the most adversarial one by only perturbing one\ncoordinate \u03b8i of \u03b8.\nHowever, in order to get meaningful \ufb01nite-time lower bounds, we need to perturb \u03b8 more carefully\nthan in the case of asymptotic lower bounds. The reason for this is that for any sub-optimal action\ni, if \u03b8i is very close to \u03b8i1(\u03b8), then E [Ni(T ); \u03b8] is not necessarily small for a good algorithm for\n\u03b8. If it is small, one can increase \u03b8i to obtain an environment \u03b8(cid:48) where i is the best action and the\nalgorithm performs bad; otherwise, when E [Ni(T ); \u03b8] is large, we need to decrease \u03b8i to make the\n\n4\n\n\fi \u2212 \u03b8i1(\u03b8) arbitrarily small as in asymptotic lower-bound arguments, because when \u03b8(cid:48)\n\nis small, large E(cid:2)Ni1(\u03b8); \u03b8(cid:48)(cid:3), and not necessarily large E [Ni(T ); \u03b8(cid:48)], may also lead to low \ufb01nite-time\n\nalgorithm perform badly in \u03b8(cid:48). Moreover, when perturbing \u03b8i to be better than \u03b8i1(\u03b8), we cannot\ni \u2212 \u03b8i1(\u03b8)\nmake \u03b8(cid:48)\nregret in \u03b8(cid:48). In the following we make this argument precise to obtain an interpretable lower bound.\n3.2.1 Formulation\n\nR\nT that contains the set of \u201creasonable\u201d values for E [N (T ); \u03b8].\n\nWe start with de\ufb01ning a subset of C\nFor any \u03b8 \u2208 \u0398 and B > 0, let\n\n\uf8f1\uf8f2\uf8f3c \u2208 C\n\nR\nT :\n\nK(cid:88)\n\nj=1\n\ncj\n\u03c32\nji\n\nC(cid:48)\n\u03b8,B =\n\n\u2265 mi(\u03b8, B) for all i \u2208 [K]\n\n\uf8fc\uf8fd\uf8fe\n\nwhere mi, the minimum sample size required to distinguish between \u03b8i and its worst-case perturba-\ntion, is de\ufb01ned as follows: For i (cid:54)= i1, if \u03b8i1 = D,3 then mi(\u03b8, B) = 0. Otherwise let\n\nmi,+(\u03b8, B) =\n\nmax\n\n\u0001\u2208(di(\u03b8),D\u2212\u03b8i]\n\n1\n\n\u00012 log T (\u0001\u2212di(\u03b8))\n\n8B\n\n,\n\nmi,\u2212(\u03b8, B) = max\n\u0001\u2208(0,\u03b8i]\n\n1\n\n\u00012 log T (\u0001+di(\u03b8))\n\n8B\n\n,\n\nand let \u0001i,+ and \u0001i,\u2212 denote the value of \u0001 achieving the maximum in mi,+ and mi,\u2212, respectively.\nThen, de\ufb01ne\n\nmi(\u03b8, B) =\n\nmin{mi,+(\u03b8, B), mi,\u2212(\u03b8, B)}\n\nif di(\u03b8) \u2265 4B/T ;\nif di(\u03b8) < 4B/T .\n\n(cid:26)mi,+(\u03b8, B)\n\nFor i = i1, then mi1(\u03b8, B) = 0 if \u03b8i2(\u03b8) = 0, else the de\ufb01nitions for i (cid:54)= i1 change by replacing\ndi(\u03b8) with di2(\u03b8)(\u03b8) (and switching the + and \u2212 indices):\n\u0001\u2208(di2(\u03b8)(\u03b8),\u03b8i1(\u03b8)]\n\nmi1(\u03b8),\u2212(\u03b8, B) =\n\nT (\u0001\u2212di2(\u03b8)(\u03b8))\n\n1\n\u00012 log\n\nmax\n\n8B\n\n,\n\nmi1(\u03b8),+(\u03b8, B) =\n\nmax\n\n\u0001\u2208(0,D\u2212\u03b8i1(\u03b8)]\n\n1\n\u00012 log\n\nT (\u0001+di2(\u03b8)(\u03b8))\n\n8B\n\nwhere \u0001i1(\u03b8),\u2212 and \u0001i1(\u03b8),+ are the maximizers for \u0001 in the above expressions. Then, de\ufb01ne\nif di2(\u03b8)(\u03b8) \u2265 4B/T ;\nif di2(\u03b8)(\u03b8) < 4B/T .\n\nmi1(\u03b8)(\u03b8, B) =\n\nNote that \u0001i,+ and \u0001i,\u2212 can be expressed in closed form using the Lambert W : R \u2192 R function\nsatisfying W (x)eW (x) = x: for any i (cid:54)= i1(\u03b8),\n\u221a\nD \u2212 \u03b8i , 8\n\n\u0001i,+ = min\n\n/T + di(\u03b8)\n\n(cid:41)\n\n\u221a\n\neB\n\n16\n\n,\n\n(cid:26)mi1(\u03b8),\u2212(\u03b8, B)\nmin(cid:8)mi1(\u03b8),+(\u03b8, B), mi1(\u03b8),\u2212(\u03b8, B)(cid:9)\n(cid:19)\n(cid:18) di(\u03b8)T\n(cid:19)\n\n(cid:40)\n(cid:40)\n\neBe\n\n(cid:18)\n\nW\n\n\u221a\n\u03b8i , 8\n\nW\n\neBe\n\n\u221a\n\n\u2212 di(\u03b8)T\neB\n\n16\n\n/T \u2212 di(\u03b8)\n\n(2)\n\n,\n\n\u0001i,\u2212 = min\n\n(cid:41)\n\nand similar results hold for i = i1, as well.\nNow we can give the main result of this section, a simpli\ufb01ed version of Theorem 1:\nCorollary 3. Given B > 0, for any algorithm such that sup\u03bb\u2208\u0398 RT (\u03bb) \u2264 B, we have, for any\nenvironment \u03b8 \u2208 \u0398, RT (\u03b8) \u2265 b(cid:48)(\u03b8, B) = minc\u2208C(cid:48)\n\n(cid:104)c, d(\u03b8)(cid:105).\n\n\u03b8,B\n\nNext we compare this bound to existing lower bounds.\n\n3.2.2 Comparison to the Asymptotic Lower Bound (1)\n\nNow we will show that our \ufb01nite time lower bound in Corollary 3 matches the asymptotic lower\nbound in (1) up to some constants. Pick B = \u03b1T \u03b2 for some \u03b1 > 0 and 0 < \u03b2 < 1. For sim-\nplicity, we only consider \u03b8 which is \u201caway from\u201d the boundary of \u0398 (so that the minima in (2) are\n\n3Recall that \u03b8i \u2208 [0, D].\n\n5\n\n\fdi(\u03b8)\n\n2W (di(\u03b8)T 1\u2212\u03b2 /(16\u03b1\n\nachieved by the second terms) and has a unique optimal action. Then, for i (cid:54)= i1(\u03b8), it is easy\nlog T (\u0001i,+\u2212di(\u03b8))\nto show that \u0001i,+ =\nfor large enough T . Then, using the fact that log x \u2212 log log x \u2264 W (x) \u2264 log x for x \u2265 e,\nit follows that limT\u2192\u221e mi(\u03b8, B)/ log T = (1 \u2212 \u03b2)/d2\ni (\u03b8), and similarly we can show that\nlimT\u2192\u221e mi1(\u03b8)(\u03b8, B)/ log T = (1 \u2212 \u03b2)/d2\nC\u03b8, under the as-\nsumptions of (1), as T \u2192 \u221e. This implies that Corollary 3 matches the asymptotic lower bound of\n(1) up to a factor of (1 \u2212 \u03b2)/2.\n\ne)) + di(\u03b8) by (2) and mi(\u03b8, B) = 1\n\u00012\ni,+\n\ni2(\u03b8)(\u03b8). Thus, C(cid:48)\n\n\u03b8,B \u2192 (1\u2212\u03b2) log T\n\n\u221a\n\n8B\n\n2\n\n3.2.3 Comparison to Minimax Bounds\n\nNow we will show that our \u03b8-dependent \ufb01nite-time lower bound reproduces the minimax regret\nbounds of [2] and [5], except for the generalized full information case.\nThe minimax bounds depend on the following notion of observability: An action i is strongly ob-\nservable if either i \u2208 Si or [K] \\ {i} \u2282 {j : i \u2208 Sj}. i is weakly observable if it is not strongly\nobservable but there exists j such that i \u2208 Sj (note that we already assumed the latter condition for\nall i). Let W(\u03a3) be the set of all weakly observable actions. \u03a3 is said to be strongly observable if\nW(\u03a3) = \u2205. \u03a3 is weakly observable if W(\u03a3) (cid:54)= \u2205.\nNext we will de\ufb01ne two key qualities introduced by [2] and [5] that characterize the hardness of a\nproblem instance with feedback structure \u03a3: A set A \u2282 [K] is called an independent set if for any\ni \u2208 A, Si \u2229 A \u2282 {i}. The independence number \u03ba(\u03a3) is de\ufb01ned as the cardinality of the largest\nindependent set. For any pair of subsets A, A(cid:48) \u2282 [K], A is said to be dominating A(cid:48) if for any j \u2208 A(cid:48)\nthere exists i \u2208 A such that j \u2208 Si. The weak domination number \u03c1(\u03a3) is de\ufb01ned as the cardinality\nof the smallest set that dominates W(\u03a3).\nCorollary 4. Assume that \u03c3ij = \u221e for some i, j \u2208 [K], that is, we are not in the generalized full\ninformation case. Then,\n\n(i) if \u03a3 is strongly observable, with B = \u03b1\u03c3(cid:112)\u03ba(\u03a3)T for some \u03b1 > 0, we have\n\n\u221a\n\nsup\u03b8\u2208\u0398 b(cid:48)(\u03b8, B) \u2265 \u03c3\n\n\u03ba(\u03a3)T\n\n64e\u03b1\n\nfor T \u2265 64e2\u03b12\u03c32\u03ba(\u03a3)3/D2.\n\n(ii) If \u03a3 is weakly observable, with B = \u03b1(\u03c1(\u03a3)D)1/3(\u03c3T )2/3 log\n\nhave sup\u03b8\u2208\u0398 b(cid:48)(\u03b8, B) \u2265 (\u03c1(\u03a3)D)1/3(\u03c3T )2/3 log\u22122/3 K\n\n.\n\n51200e2\u03b12\n\u221a\nRemark 5. In Corollary 4, picking \u03b1 = 1\n73 for weakly\n8\nobservable \u03a3 gives formal minimax lower bounds: (i) If \u03a3 is strongly observable, for any algorithm\nwe have sup\u03b8\u2208\u0398 RT (\u03b8) \u2265 \u03c3\nfor T \u2265 e\u03c32\u03ba(\u03a3)3/D2. (ii) If \u03a3 is weakly observable, for any\nalgorithm we have sup\u03b8\u2208\u0398 RT (\u03b8) \u2265 (\u03c1(\u03a3)D)1/3(\u03c3T )2/3\n\ne for strongly observable \u03a3 and \u03b1 = 1\n\n\u221a\n\u03ba(\u03a3)T\n8\n\n\u221a\n\n.\n\ne\n\n73 log2/3 K\n\n\u22122/3 K for some \u03b1 > 0, we\n\n4 Algorithms\n\nIn this section we present two algorithms and their \ufb01nite-time analysis for the uniform variance\nversion of our problem (where \u03c3ij is either \u03c3 or \u221e). The upper bound for the \ufb01rst algorithm matches\nthe asymptotic lower bound in (1) up to constants. The second algorithm achieves the minimax lower\nbounds of Corollary 4 up to logarithmic factors, as well as O(log3/2 T ) problem-dependent regret.\nIn the problem-dependent upper bounds of both algorithms, we assume that the optimal action is\nunique, that is, di2(\u03b8)(\u03b8) > 0.\n\n4.1 An Asymptotically Optimal Algorithm\n\nni(t) =(cid:80)t\u22121\nempirical estimate of \u03b8i based on the \ufb01rst ni(t) observations. Let Ni(t) =(cid:80)t\u22121\n\n(cid:104)c, d(\u03b8)(cid:105); note that increasing ci1(\u03b8)(\u03b8) does not change the value of\nLet c(\u03b8) = argminc\u2208C\u03b8\n(cid:104)c, d(\u03b8)(cid:105) (since di1(\u03b8)(\u03b8) = 0), so we take the minimum value of ci1(\u03b8)(\u03b8) in this de\ufb01nition. Let\nI{i \u2208 Sis} be the number of observations for action i before round t and \u02c6\u03b8t,i be the\nI{is = i} be the\nnumber of plays for action i before round t. Note that this de\ufb01nition of Ni(t) is different from that\nin the previous sections since it excludes round t.\n\ns=1\n\ns=1\n\n6\n\n\felse\n\nthen\n\n4\u03b1 log t \u2208 C\u02c6\u03b8t\nif N (t)\nPlay it = i1(\u02c6\u03b8t).\nSet ne(t + 1) = ne(t).\n\nAlgorithm 1\n1: Inputs: \u03a3, \u03b1, \u03b2 : N \u2192 [0,\u221e).\n2: For t = 1, ..., K, observe each action i at least\n\nonce by playing it such that t \u2208 Sit.\n3: Set exploration count ne(K + 1) = 0.\n4: for t = K + 1, K + 2, ... do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\nend if\n15:\n16: end for\nTheorem 6. For any \u03b8 \u2208 \u0398, \u0001 > 0, \u03b1 > 2 and any non-decreasing \u03b2(n) that satis\ufb01es 0 \u2264 \u03b2(n) \u2264\nn/2 and \u03b2(m + n) \u2264 \u03b2(m) + \u03b2(n) for m, n \u2208 N,\n\nOur \ufb01rst algorithm is presented in Algo-\nrithm 1. The main idea, coming from\n[15], is that by forcing exploration over\nall actions, the solution c(\u03b8) of the lin-\near program can be well approximated\nwhile paying a constant price. This solves\nthe main dif\ufb01culty that, without getting\nenough observations on each action, we\nmay not have good enough estimates for\nd(\u03b8) and c(\u03b8). One advantage of our algo-\nrithm compared to that of [15] is that we\nuse a nondecreasing, sublinear exploration\nschedule \u03b2(n) (\u03b2 : N \u2192 [0,\u221e)) instead\nof a constant rate \u03b2(n) = \u03b2n. This re-\nsolves the problem that, to achieve asymp-\ntotically optimal performance, some pa-\nrameter of the algorithm needs to be cho-\nsen according to dmin(\u03b8) as in [15]. The\nexpected regret of Algorithm 1 is upper\nbounded as follows:\n\nPlay it such that argmini\u2208[K] ni(t) \u2208 Sit.\nPlay it such that Ni(t) < ci(\u02c6\u03b8t)4\u03b1 log t.\n\nend if\nSet ne(t + 1) = ne(t) + 1.\n\nif mini\u2208[K] ni(t) < \u03b2(ne(t))/K then\n\nRT (\u03b8) \u2264(cid:0)2K + 2 + 4K/(\u03b1 \u2212 2)(cid:1)dmax(\u03b8) + 4Kdmax(\u03b8)\n\n(cid:16) \u2212 \u03b2(s)\u00012\n\nT(cid:88)\n\n(cid:17)\n\nelse\n\n(cid:16)\n\n(cid:88)\n\n(cid:17)\nj \u2212 \u03b8j| \u2264 \u0001 for all j \u2208 [K]}.\n\nci(\u03b8, \u0001) + K\n\ni\u2208[K]\n\n2K\u03c32\n\ns=0\n\nexp\n\n(cid:88)\n\ni\u2208[K]\n\n+ 2dmax(\u03b8)\u03b2\n\n4\u03b1 log T\n\n+ 4\u03b1 log T\n\nci(\u03b8, \u0001)di(\u03b8) .\n\nwhere ci(\u03b8, \u0001) = sup{ci(\u03b8(cid:48)) : |\u03b8(cid:48)\nFurther specifying \u03b2(n) and using the continuity of c(\u03b8) around \u03b8, it immediately follows that Al-\ngorithm 1 achieves asymptotically optimal performance:\nCorollary 7. Suppose the conditions of Theorem 6 hold. Assume, furthermore, that \u03b2(n) satis\ufb01es\n< \u221e for any \u0001 > 0, then for any \u03b8 such that c(\u03b8) is unique,\n\n\u03b2(n) = o(n) and(cid:80)\u221e\n\n(cid:16)\u2212 \u03b2(s)\u00012\n\n(cid:17)\n\ns=0 exp\n\n2K\u03c32\n\nlim sup\nT\u2192\u221e\n\nRT (\u03b8)/ log T \u2264 4\u03b1 inf\nc\u2208C(\u03b8)\n\n(cid:104)c, d(\u03b8)(cid:105) .\n\nNote that any \u03b2(n) = anb with a \u2208 (0, 1\n2 ], b \u2208 (0, 1) satis\ufb01es the requirements in Theorem 6 and\nCorollary 7. Also note that the algorithms presented in [6, 7] do not achieve this asymptotic bound.\n\nj:i\u2208Sj\n\nj:i\u2208Sj\n\n4.2 A Minimax Optimal Algorithm\n\nFor any A, A(cid:48) \u2282 [K],\n\nc(A, A(cid:48)) = argmaxc\u2208\u2206|A| mini\u2208A(cid:48)(cid:80)\nmini\u2208A(cid:48)(cid:80)\n\n(cid:113) 2 log(8K2r3/\u03b4)\n\nNext we present an algorithm achieving the minimax bounds.\n\nwhere ni(r) =(cid:80)r\u22121\n\nlet\ncj (ties are broken arbitrarily) and m(A, A(cid:48)) =\ncj(A, A(cid:48)). For any A \u2282 [K] and |A| \u2265 2, let AS = {i \u2208 A : \u2203j \u2208 A, i \u2208 Sj}\ns=1 is,i and \u02c6\u03b8r,i\n\nand AW = A\u2212 AS. Furthermore, let gr,i(\u03b4) = \u03c3\nbe the empirical estimate of \u03b8i based on \ufb01rst ni(r) observations (i.e., the average of the samples).\nThe algorithm is presented in Algorithm 2. It follows a successive elimination process: it explores all\npossibly optimal actions (called \u201cgood actions\u201d later) based on some con\ufb01dence intervals until only\none action remains. While doing exploration, the algorithm \ufb01rst tries to explore the good actions\nby only using good ones. However, due to weak observability, some good actions might have to be\nexplored by actions that have already been eliminated. To control this exploration-exploitation trade\noff, we use a sublinear function \u03b3 to control the exploration of weakly observable actions.\nIn the following we present high-probability bounds on the performance of the algorithm, so, with a\nslight abuse of notation, RT (\u03b8) will denote the regret without expectation in the rest of this section.\n\nni(r)\n\n7\n\n\fAlgorithm 2\n1: Inputs: \u03a3, \u03b4.\n2: Set t1 = 0, A1 = [K].\n3: for r = 1, 2, ... do\n4:\n\nr\n\nr\n\nelse\n\ns (cid:54)=\u2205 m([K] , AW\n\nni(r) < mini\u2208AS\n\nLet \u03b1r = min1\u2264s\u2264r,AW\ns = \u2205 for all 1 \u2264 s \u2264 r.)\nAW\n(cid:54)= \u2205 and mini\u2208AW\nif AW\nSet cr = c([K] , AW\nr ).\nSet cr = c(Ar, AS\nr ).\n\n5:\n6:\n7:\n8:\nend if\n9:\nPlay ir = (cid:100)cr \u00b7 (cid:107)cr(cid:107)0(cid:101) and set tr+1 \u2190 tr + (cid:107)ir(cid:107)1.\n10:\nAr+1 \u2190 {i \u2208 Ar : \u02c6\u03b8r+1,i + gr+1,i(\u03b4) \u2265 maxj\u2208Ar\n11:\nif |Ar+1| = 1 then\n12:\n13:\nend if\n14:\n15: end for\nTheorem 8. For any \u03b4 \u2208 (0, 1) and any \u03b8 \u2208 \u0398,\n\nPlay the only action in the remaining rounds.\n\ns ) and \u03b3(r) = (\u03c3\u03b1rtr/D)2/3. (De\ufb01ne \u03b1r = 1 if\n\nni(r) and mini\u2208AW\n\nr\n\nr\n\nni(r) < \u03b3(r) then\n\n\u02c6\u03b8r+1,j \u2212 gr+1,j(\u03b4)}.\n\nRT (\u03b8) \u2264 (\u03c1(\u03a3)D)1/3(\u03c3T )2/3 \u00b7 7(cid:112)6 log(2KT /\u03b4) + 125\u03c32K 3/D + 13K 3D\n\nwith probability at least 1 \u2212 \u03b4 if \u03a3 is weakly observable, while\n\n(cid:114)\n\nRT (\u03b8) \u2264 2KD + 80\u03c3\n\n\u03ba(\u03a3)T \u00b7 6 log K log\n\n2KT\n\n\u03b4\n\nwith probability at least 1 \u2212 \u03b4 if \u03a3 is strongly observable.\nTheorem 9 (Problem-dependent upper bound). For any \u03b4 \u2208 (0, 1) and any \u03b8 \u2208 \u0398 such that the\noptimal action is unique, with probability at least 1 \u2212 \u03b4,\n\nRT (\u03b8) \u2264 1603\u03c1(\u03a3)D\u03c32\n\n+ 15(cid:0)\u03c1(\u03a3)D\u03c32(cid:1)1/3(cid:0)125\u03c32/D2 + 10(cid:1) K 2 (log(2KT /\u03b4))1/2 .\n\n(log(2KT /\u03b4))3/2 + 14K 3D + 125\u03c32K 3/D\n\nd2\nmin(\u03b8)\n\nRemark 10. Picking \u03b4 = 1/T gives an O(log3/2 T ) upper bound on the expected regret.\nRemark 11. Note that Algortihm 2 is similar to the UCB-LP algorithm of [7], which admits a bet-\nter problem-dependent upper bound (although does not achieve it with optimal problem-dependent\nconstants), but it does not achieve the minimax bound even under strong observability.\n\n5 Conclusions and Open Problems\n\nWe considered a novel partial-monitoring setup with Gaussian side observations, which generalizes\nthe recently introduced setting of graph-structured feedback, allowing \ufb01ner quanti\ufb01cation of the\nobserved information from one action to another. We provided non-asymptotic problem-dependent\nlower bounds that imply existing asymptotic problem-dependent and non-asymptotic minimax lower\nbounds (up to some constant factors) beyond the full information case. We also provided an algo-\nrithm that achieves the asymptotic problem-dependent lower bound (up to some universal constants)\nand another algorithm that achieves the minimax bounds under both weak and strong observability.\nHowever, we think this is just the beginning. For example, we currently have no algorithm that\nachieves both the problem dependent and the minimax lower bounds at the same time. Also, our\nupper bounds only correspond to the graph-structured feedback case. It is of great interest to go\nbeyond the weak/strong observability in characterizing the hardness of the problem, and provide\nalgorithms that can adapt to any correspondence between the mean payoffs and the variances (the\nhardness is that one needs to identify suboptimal actions with good information/cost trade-off).\n\nAcknowledgments This work was supported by the Alberta Innovates Technology Futures\nthrough the Alberta Ingenuity Centre for Machine Learning (AICML) and NSERC. During this\nwork, A. Gy\u00a8orgy was with the Department of Computing Science, University of Alberta.\n\n8\n\n\fReferences\n[1] S\u00b4ebatien Bubeck and Nicol`o Cesa-Bianchi. Regret analysis of stochastic and nonstochas-\ntic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122,\n2012.\n\n[2] Shie Mannor and Ohad Shamir. From bandits to experts: on the value of side-observations. In\n\nAdvances in Neural Information Processing Systems 24 (NIPS), pages 684\u2013692, 2011.\n\n[3] Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. From bandits to ex-\nperts: A tale of domination and independence. In Advances in Neural Information Processing\nSystems 26 (NIPS), pages 1610\u20131618, 2013.\n\n[4] Tom\u00b4a\u02c7s Koc\u00b4ak, Gergely Neu, Michal Valko, and R\u00b4emi Munos. Ef\ufb01cient learning by implicit\nIn Advances in Neural Information\n\nexploration in bandit problems with side observations.\nProcessing Systems 27 (NIPS), pages 613\u2013621, 2014.\n\n[5] Noga Alon, Nicol`o Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feed-\nIn Proceedings of The 28th Conference on Learning Theory\n\nback graphs: beyond bandits.\n(COLT), pages 23\u201335, 2015.\n\n[6] St\u00b4ephane Caron, Branislav Kveton, Marc Lelarge, and Smriti Bhagat. Leveraging side observa-\ntions in stochastic bandits. In Proceedings of the 28th Conference on Uncertainty in Arti\ufb01cial\nIntelligence (UAI), pages 142\u2013151, 2012.\n\n[7] Swapna Buccapatnam, Atilla Eryilmaz, and Ness B. Shroff. Stochastic bandits with side ob-\n\nservations on networks. SIGMETRICS Perform. Eval. Rev., 42(1):289\u2013300, June 2014.\n\n[8] Nicol`o Cesa-Bianchi and G\u00b4abor Lugosi. Prediction, Learning, and Games. Cambridge Uni-\n\nversity Press, Cambridge, 2006.\n\n[9] G\u00b4abor Bart\u00b4ok, Dean P. Foster, D\u00b4avid P\u00b4al, Alexander Rakhlin, and Csaba Szepesv\u00b4ari. Par-\ntial monitoring \u2013 classi\ufb01cation, regret bounds, and algorithms. Mathematics of Operations\nResearch, 39:967\u2013997, 2014.\n\n[10] Tian Lin, Bruno Abrahao, Robert Kleinberg, John Lui, and Wei Chen. Combinatorial par-\nIn Proceedings of the 31st\n\ntial monitoring game with linear feedback and its applications.\nInternational Conference on Machine Learning (ICML), pages 901\u2013909, 2014.\n\n[11] Tor Lattimore, Andr\u00b4as Gy\u00a8orgy, and Csaba Szepesv\u00b4ari. On learning the optimal waiting time.\nIn Peter Auer, Alexander Clark, Thomas Zeugmann, and Sandra Zilles, editors, Algorith-\nmic Learning Theory, volume 8776 of Lecture Notes in Computer Science, pages 200\u2013214.\nSpringer International Publishing, 2014.\n\n[12] Todd L. Graves and Tze Leung Lai. Asymptotically ef\ufb01cient adaptive choice of control laws\nincontrolled markov chains. SIAM Journal on Control and Optimization, 35(3):715\u2013743, 1997.\n[13] Yifan Wu, Andr\u00b4as Gy\u00a8orgy, and Csaba Szepesv\u00b4ari. Online learning with Gaussian payoffs and\n\nside observations. arXiv preprint arXiv:1510.08108, 2015.\n\n[14] Lihong Li, R\u00b4emi Munos, and Csaba Szepesv\u00b4ari. Toward minimax off-policy value estima-\ntion. In Proceedings of the Eighteenth International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS), pages 608\u2013616, 2015.\n\n[15] Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower\nbounds and optimal algorithms. In Proceedings of The 27th Conference on Learning Theory\n(COLT), pages 975\u2013999, 2014.\n\n[16] Emilie Kaufmann, Olivier Capp\u00b4e, and Aur\u00b4elien Garivier. On the complexity of best arm iden-\nti\ufb01cation in multi-armed bandit models. The Journal of Machine Learning Research, 2015. (to\nappear).\n\n[17] Richard Combes and Alexandre Proutiere. Unimodal bandits: Regret lower bounds and op-\ntimal algorithms. In Proceedings of the 31st International Conference on Machine Learning\n(ICML), pages 521\u2013529, 2014.\n\n9\n\n\f", "award": [], "sourceid": 830, "authors": [{"given_name": "Yifan", "family_name": "Wu", "institution": "University of Alberta"}, {"given_name": "Andr\u00e1s", "family_name": "Gy\u00f6rgy", "institution": "Imperial College London"}, {"given_name": "Csaba", "family_name": "Szepesvari", "institution": "University of Alberta"}]}