{"title": "Linear Multi-Resource Allocation with Semi-Bandit Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 964, "page_last": 972, "abstract": "We study an idealised sequential resource allocation problem. In each time step the learner chooses an allocation of several resource types between a number of tasks. Assigning more resources to a task increases the probability that it is completed. The problem is challenging because the alignment of the tasks to the resource types is unknown and the feedback is noisy. Our main contribution is the new setting and an algorithm with nearly-optimal regret analysis. Along the way we draw connections to the problem of minimising regret for stochastic linear bandits with heteroscedastic noise. We also present some new results for stochastic linear bandits on the hypercube that significantly out-performs existing work, especially in the sparse case.", "full_text": "Linear Multi-Resource Allocation with Semi-Bandit\n\nFeedback\n\nTor Lattimore\n\nDepartment of Computing Science\n\nUniversity of Alberta, Canada\n\ntor.lattimore@gmail.com\n\nKoby Crammer\n\nDepartment of Electrical Engineering\n\nThe Technion, Israel\n\nkoby@ee.technion.ac.il\n\nCsaba Szepesv\u00b4ari\n\nDepartment of Computing Science\n\nUniversity of Alberta, Canada\nszepesva@ualberta.ca\n\nAbstract\n\nWe study an idealised sequential resource allocation problem. In each time step\nthe learner chooses an allocation of several resource types between a number of\ntasks. Assigning more resources to a task increases the probability that it is com-\npleted. The problem is challenging because the alignment of the tasks to the re-\nsource types is unknown and the feedback is noisy. Our main contribution is the\nnew setting and an algorithm with nearly-optimal regret analysis. Along the way\nwe draw connections to the problem of minimising regret for stochastic linear\nbandits with heteroscedastic noise. We also present some new results for stochas-\ntic linear bandits on the hypercube that signi\ufb01cantly improve on existing work,\nespecially in the sparse case.\n\n1\n\nIntroduction\n\nEconomist Thomas Sowell remarked that \u201cThe \ufb01rst lesson of economics is scarcity: There is never\nenough of anything to fully satisfy all those who want it.\u201d1 The optimal allocation of resources is\nan enduring problem in economics, operations research and daily life. The problem is challenging\nnot only because you are compelled to make dif\ufb01cult trade-offs, but also because the (expected)\noutcome of a particular allocation may be unknown and the feedback noisy.\nWe focus on an idealised resource allocation problem where the economist plays a repeated resource\nallocation game with multiple resource types and multiple tasks to which these resources can be\nassigned. Speci\ufb01cally, we consider a (nearly) linear model with D resources and K tasks. In each\ntime step t the economist chooses an allocation of resources Mt \u2208 RD\u00d7K where Mtk \u2208 RD is the\nkth column and represents the amount of each resource type assigned to the kth task. We assume\nthat the kth task is completed successfully with probability min{1,(cid:104)Mtk, \u03bdk(cid:105)} and \u03bdk \u2208 RD is an\nunknown non-negative vector that determines how the success rate of a given task depends on the\nquantity and type of resources assigned to it. Naturally we will limit the availability of resources\nk=1 Mtdk \u2264 1 for all resource types d. At the end of each time\nstep the economist observes which tasks were successful. The objective is to maximise the number\nof successful tasks up to some time horizon n that is known in advance. This model is a natural\ngeneralisation of the one used by Lattimore et al. [2014], where it was assumed that there was a\nsingle resource type only.\n\nby demanding that Mt satis\ufb01es(cid:80)K\n\n1He went on to add that \u201cThe \ufb01rst lesson of politics is to disregard the \ufb01rst lesson of economics.\u201d Sowell\n\n[1993]\n\n1\n\n\fAn example application might be the problem of allocating computing resources on a server between\na number of Virtual Private Servers (VPS). In each time step (some \ufb01xed interval) the controller\nchooses how much memory/cpu/bandwidth to allocate to each VPS. A VPS is said to fail in a given\nround if it fails to respond to requests in a timely fashion. The requirements of each VPS are\nunknown in advance, but do not change greatly with time. The controller should learn which VPS\nbene\ufb01t the most from which resource types and allocate accordingly.\nThe main contribution of this paper besides the new setting is an algorithm designed for this problem\nalong with theoretical guarantees on its performance in terms of the regret. Along the way we present\nsome additional results for the related problem of minimising regret for stochastic linear bandits on\nthe hypercube. We also prove new concentration results for weighted least squares estimation, which\nmay be independently interesting.\nThe generalisation of the work of Lattimore et al. [2014] to multiple resources turns out to be fairly\nnon-trivial. Those with knowledge of the theory of stochastic linear bandits will recognise some\nsimilarity. In particular, once the nonlinearity of the objective is removed, the problem is equivalent\nto playing K linear bandits in parallel, but where the limited resources constrain the actions of the\nlearner and correspondingly the returns for each task. Stochastic linear bandits have recently been\ngenerating a signi\ufb01cant body of research (e.g., Auer [2003], Dani et al. [2008], Rusmevichientong\nand Tsitsiklis [2010], Abbasi-Yadkori et al. [2011, 2012], Agrawal and Goyal [2012] and many oth-\ners). A related problem is that of online combinatorial optimisation. This has an extensive literature,\nbut most results are only applicable for discrete action sets, are in the adversarial setting, and can-\nnot exploit the additional structure of our problem. Nevertheless, we refer the interested reader to\n(say) the recent work by Kveton et al. [2014] and references there-in. Also worth mentioning is that\nthe resource allocation problem at hand is quite different to the \u201clinear semi-bandit\u201d proposed and\nanalysed by Krishnamurthy et al. [2015] where the action set is also \ufb01nite (the setting is different in\nmany other ways besides).\nGiven its similarity, it is tempting to apply the techniques of linear bandits to our problem. When\ndoing so, two main dif\ufb01culties arise. The \ufb01rst is that our payoffs are non-linear:\nthe expected\nreward is a linear function only up to a point after which it is clipped. In the resource allocation\nproblem this has a natural interpretation, which is that over-allocating resources beyond a certain\npoint is fruitless. Fortunately, one can avoid this dif\ufb01culty rather easily by ensuring that with high\nprobability resources are never over-allocated. The second problem concerns achieving good regret\nregardless of the task speci\ufb01cs. In particular, when the number of tasks K is large and resources are\nat a premium the allocation problem behaves more like a K-armed bandit where the economist must\nchoose the few tasks that can be completed successfully. For this kind of problem regret should scale\nin the worst case with\nK only [Auer et al., 2002, Bubeck and Cesa-Bianchi, 2012]. The standard\nlinear bandits approach, on the other hand, would lead to a bound on the regret that depends linearly\non K. To remedy this situation, we will exploit that if K is large and resources are scarce, then\nmany tasks will necessarily be under-resourced and will fail with high probability. Since the noise\nmodel is Bernoulli, the variance of the noise for these tasks is extremely low. By using weighted\nleast-squares estimators we are able to exploit this and thereby obtain an improved regret. An added\nbene\ufb01t is that when resources are plentiful, then all tasks will succeed with high probability under\nthe optimal allocation, and in this case the variance is also low. This leads to a poly-logarithmic\nregret for the resource-laden case where the optimal allocation fully allocates every task.\n\n\u221a\n\n2 Preliminaries\nIf F is some event, then \u00acF is its complement (i.e., it is the event that F does not occur). If A is\npositive de\ufb01nite and x is a vector, then (cid:107)x(cid:107)2\nA = x(cid:62)Ax stands for the weighted 2-norm. We write |x|\nto be the vector of element-wise absolute values of x. We let \u03bd \u2208 RD\u00d7K be a matrix with columns\n\u03bd1, . . . \u03bdK. All entries in \u03bd are non-negative, but otherwise we make no global assumptions on \u03bd. At\neach time step t the learner chooses an allocation matrix Mt \u2208 M where\nMdk \u2264 1 for all d\n\nM \u2208 [0, 1]D\u00d7K :\n\nK(cid:88)\n\n(cid:41)\n\n.\n\n(cid:40)\n\nM =\n\nThe assumption that each resource type has a bound of 1 is non-restrictive, since the units of any\nresource can be changed to accommodate this assumption. We write Mtk \u2208 [0, 1]D for the kth\n\nk=1\n\n2\n\n\fcolumn of Mt. The reward at time step t is (cid:107)Yt(cid:107)1 where Ytk \u2208 {0, 1} is sampled from a Bernoulli\ndistribution with parameter \u03c8((cid:104)Mtk, \u03bdk(cid:105)) = min{1,(cid:104)Mtk, \u03bdk(cid:105)}. The economist observes all Ytk,\nhowever, not just the sum. The optimal allocation is denoted by M\u2217 and de\ufb01ned by\n\nM\u2217 = arg max\nM\u2208M\n\n\u03c8((cid:104)Mk, \u03bdk(cid:105)) .\n\nK(cid:88)\n(cid:34) n(cid:88)\n\nk=1\n\nWe are primarily concerned with designing an allocation algorithm that minimises the expected\n(pseudo) regret of this problem, which is de\ufb01ned by\nk , \u03bdk(cid:105)) \u2212 E\n\n\u03c8((cid:104)Mtk, \u03bdk(cid:105))\n\nK(cid:88)\n\nK(cid:88)\n\n\u03c8((cid:104)M\u2217\n\nRn = n\n\n,\n\n(cid:35)\n\nwhere the expectation is taken over both the actions of the algorithm and the observed reward.\n\nk=1\n\nt=1\n\nk=1\n\nOptimal Allocations\n\nIf \u03bd is known, then the optimal allocation can be computed by constructing an appropriate linear\nprogram. Somewhat surprisingly it may also be computed exactly in O(K log K + D log D) time\nusing Algorithm 1 below. The optimal allocation is not so straight-forward as, e.g., simply allocating\nresources to the incomplete task for which the corresponding \u03bd is largest in some dimension. For\nexample, for K = 2 tasks and d = 2 resource types:\n\n(cid:18)\n\n(cid:19)\n\n(cid:18) 0\n\n(cid:19)\n\n=\u21d2 M\u2217 =\n\nM\u2217\n\n1 M\u2217\n\n2\n\n=\n\n1\n\n.\n\n1/2\n\n1/2\n\n(cid:18)\n\n(cid:19)\n\n(cid:18) 0\n\n1/2\n\n(cid:19)\n\n1/2\n1\n\n\u03bd =\n\n\u03bd1 \u03bd2\n\n=\n\nAlgorithm 1\n\nA = {k : (cid:104)Mk, \u03bdk(cid:105) < 1} and B = {d : Bd > 0}\n\nInput: \u03bd\nM = 0 \u2208 RD\u00d7K and B = 1 \u2208 RD\nwhile \u2203 k, d s.t (cid:104)Mk, \u03bdk(cid:105) < 1 and Bd > 0 do\n\nWe see that even though \u03bd22 is the largest param-\neter, the optimal allocation assigns only half of the\nsecond resource (d = 2) to this task. The right ap-\nproach is to allocate resources to incomplete tasks\nusing the ratios as prescribed by Algorithm 1. The\nintuition for allocating in this way is that resources\nshould be allocated as ef\ufb01ciently as possible, and ef-\n\ufb01ciency is determined by the ratio of the expected\nsuccess due to the allocation of a resource and the\namount of resources allocated.\nTheorem 1. Algorithm 1 returns M\u2217.\nThe proof of Theorem 1 and an implementation of Algorithm 1 may be found in the supplementary\nmaterial.\nWe are interested primarily in the case when \u03bd is unknown, so Algorithm 1 will not be directly\napplicable. Nevertheless, the algorithm is useful as a module in the implementation of a subsequent\nalgorithm that estimates \u03bd from data.\n\n(cid:18) \u03bddk\n(cid:19)\n(cid:27)\n\ni\u2208A\\{k}\n\u03bddi\n1 \u2212 (cid:104)Mk, \u03bdk(cid:105)\n\nk, d = arg max\n(k,d)\u2208A\u00d7B\n\nend while\nreturn M\n\nMdk = min\n\n(cid:26)\n\nBd,\n\nmin\n\n\u03bddk\n\n3 Optimistic Allocation Algorithm\n\nWe follow the optimism in the face of uncertainty principle.\nIn each time step t, the algorithm\nconstructs an estimator \u02c6\u03bdkt for each \u03bdk and a corresponding con\ufb01dence set Ctk for which \u03bdk \u2208 Ctk\nholds with high probability. The algorithm then takes the optimistic action subject to the assumption\nthat \u03bdk does indeed lie in Ctk for all k. The main dif\ufb01culty is the construction of the con\ufb01dence sets.\nLike other authors [Dani et al., 2008, Rusmevichientong and Tsitsiklis, 2010, Abbasi-Yadkori et al.,\n2011] we de\ufb01ne our con\ufb01dence sets to be ellipses, but the use of a weighted least-squares estimator\nmeans that our ellipses may be signi\ufb01cantly smaller than the sets that would be available by using\nthese previous works in a straightforward way. The algorithm accepts as input the number of tasks\nand resource types, the horizon and constants \u03b1 > 0 and \u03b2 where constant \u03b2 is de\ufb01ned by\n\nN =(cid:0)4n4D2(cid:1)D\n(cid:18) 6nN\n\n(cid:115)\n\n,\n\n\u03b1B + 2\n\nlog\n\n\u03b4 =\n\n,\n\n1\nnK\n\n(cid:32)\n\n\u03b2 =\n\n1 +\n\n\u221a\n\n(cid:107)\u03bdk(cid:107)2\n2 ,\n\nB \u2265 max\n\n(cid:18) 3nN\n\n(cid:19)(cid:19)(cid:33)2\n\nk\n\nso that\n\n.\n\n(1)\n\n\u03b4\n\n\u03b4\n\nlog\n\n3\n\n\fNote that B must be a known bound on maxk (cid:107)\u03bdk(cid:107)2\n2, which might seem like a serious restriction,\nuntil one realizes that it is easy to add an initialisation phase where estimates are quickly made\nwhile incurring minimal additional regret, as was also done by Lattimore et al. [2014]. The value\nof \u03b1 determines the level of regularisation in the least squares estimation and will be tuned later to\noptimise the regret.\n\nAlgorithm 2 Optimistic Allocation Algorithm\n1: Input K, D, n, \u03b1, \u03b2\n2: for t \u2208 1, . . . , n do\n3:\n\n// Compute con\ufb01dence sets for all tasks k:\n\n4: Gtk = \u03b1I +(cid:80)\n(cid:80)\n\n\u02c6\u03bdtk = G\u22121\n\n5:\n\ntk\n\n(cid:110)\n\n6:\n\n\u03c4 \u03b2 for some\n\n1,(cid:107)xt(cid:107)2\n\nt=1 min\n\n(cid:1).\n\n(cid:110)\n\nGtk\n\n(cid:111) \u2264 2D log(cid:0)1 + c\u00b7n\n(cid:111) \u2264 8D log(1 + 4n2).\n\n\u22121\nt\n\nD\n\nG\n\ns . Then(cid:80)n\ns=1 xsx(cid:62)\n(cid:110)\nn(cid:88)\n\n\u03b3tk min\n\nt=1\n\nCorollary 5. If F does not hold, then\n\n1,(cid:107)Mtk(cid:107)2\n\n\u22121\ntk\n\nG\n\nThe proof is omitted, but follows rather easily by showing that \u03b3tk can be moved inside the minimum\nat a price of increasing the loss at most by a factor of four, and then applying Lemma 4. See the\nsupplementary material for the formal proof.\n\nLemma 6. Suppose F does not hold, then\n\n(cid:18)\n\n(cid:107)\u03bdk(cid:107)\u221e + 4(cid:112)\u03b2/\u03b1\n\n(cid:19)\n\n.\n\ntk \u2264 D\n\u03b3\u22121\n\nmax\n\nk\n\nK(cid:88)\n\nk=1\n\n5\n\n\fProof. We exploit the fact that \u03b3\u22121\nis small:\n\ntk is an estimate of the variance, which is small whenever (cid:107)Mtk(cid:107)1\n\ntk\n\n\u02dc\u03bdk\u2208C(cid:48)\n\n(cid:104)Mtk, \u02dc\u03bdk(cid:105)\n\n\u03b3\u22121\ntk = arg max\n\n(cid:104)Mtk, \u02dc\u03bdk(cid:105) (1 \u2212 (cid:104)Mtk, \u02dc\u03bdk(cid:105)) \u2264 arg max\n\u02dc\u03bdk\u2208C(cid:48)\n\n= (cid:104)Mtk, \u03bd(cid:105) + arg max\n\u02dc\u03bdk\u2208Ctk(cid:48)\n\n(cid:104)Mtk, \u02dc\u03bdk \u2212 \u03bd(cid:105) (a)\u2264 (cid:107)Mtk(cid:107)1 (cid:107)\u03bdk(cid:107)\u221e + 4(cid:112)\u03b2 (cid:107)Mtk(cid:107)G\n(cid:17)\n(cid:16)(cid:107)\u03bdk(cid:107)\u221e + 4(cid:112)\u03b2/\u03b1\ntk \u2264 I/\u03b1 and basic\n\n(b)\u2264 (cid:107)Mtk(cid:107)1 (cid:107)\u03bdk(cid:107)\u221e + 4(cid:112)\u03b2 (cid:107)Mtk(cid:107)I/\u03b1\nlinear algebra, (c) since (cid:107)Mtk(cid:107)I/\u03b1 = (cid:112)1/\u03b1(cid:107)Mtk(cid:107)2 \u2264 (cid:112)1/\u03b1(cid:107)Mtk(cid:107)1. The result is completed\nsince the resource constraints implies that(cid:80)K\n\nwhere (a) follows from Cauchy-Schwartz and the fact that \u03bdk \u2208 C(cid:48)\n\ntk, (b) since G\u22121\n\n(c)\u2264 (cid:107)Mtk(cid:107)1\n\n\u22121\ntk\n\ntk\n\n,\n\nk=1 (cid:107)Mtk(cid:107)1 \u2264 D.\n(cid:34)\n1{\u00acF} n(cid:88)\n\nProof of Theorem 2. By Theorem 3 we have that F holds with probability at most \u03b4 = 1/(nK).\nIf F does not hold, then by the de\ufb01nition of the con\ufb01dence set we have \u03bdk \u2208 Ctk for all t and k.\nTherefore\nRn = E\n\n(cid:35)\nk \u2212 Mtk, \u03bdk(cid:105)\n\nk , \u03bdk(cid:105) \u2212 \u03c8((cid:104)Mtk, \u03bdk(cid:105))) \u2264 1 + E\n\nn(cid:88)\n\nK(cid:88)\n\nK(cid:88)\n\n((cid:104)M\u2217\n\n(cid:104)M\u2217\n\n.\n\nt=1\n\nk=1\n\nNote that we were able to replace \u03c8((cid:104)Mtk, \u03bdk(cid:105)) = (cid:104)Mtk, \u03bdk(cid:105), since if F does not hold, then Mtk\nwill never be chosen in such a way that resources are over-allocated. We will now assume that F\ndoes not hold and bound the argument in the expectation. By the optimism principle we have:\n\nt=1\n\nk=1\n\nn(cid:88)\n\nK(cid:88)\n\nt=1\n\nk=1\n\n(cid:104)M\u2217\n\n(cid:111)\n\n(cid:107)\u02dc\u03bdtk \u2212 \u03bdk(cid:107)Gtk\n\nmin\n\nt=1\n\nt=1\n\nt=1\n\nk=1\n\nk=1\n\nk=1\n\nmin\n\n(c)\u2264 2\n\nK(cid:88)\nk \u2212 Mtk, \u03bdk(cid:105) (a)\u2264 n(cid:88)\nK(cid:88)\n(b)\u2264 n(cid:88)\nn(cid:88)\nK(cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116)n\nn(cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116)n\nn(cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116)nD\n(cid:32)\n(cid:118)(cid:117)(cid:117)(cid:116)2\u03b2nK\n\n(g)\u2264 4D\n\n(f )\u2264 2\n\n(d)\u2264 2\n\n(e)\u2264 2\n\nt=1\n\nt=1\n\n\u03b2\n\n\u03b2\n\nmax\n\nk\n\n(cid:32)\n\nmin{1,(cid:104)Mtk, \u02dc\u03bdtk \u2212 \u03bdk(cid:105)}\n\n(cid:111)\n\n\u22121\ntk\n\n\u22121\ntk\n\n(cid:112)\u03b2\n\n(cid:110)\n1,(cid:107)Mtk(cid:107)G\n(cid:110)\n1,(cid:107)Mtk(cid:107)G\n(cid:32) K(cid:88)\n(cid:110)\n1,(cid:107)Mtk(cid:107)G\n(cid:33)(cid:32) K(cid:88)\n(cid:32) K(cid:88)\n(cid:114)\n\n\u03b3\u22121\n\nmin\n\nk=1\n\nk=1\n\nk=1\n\ntk\n\n(cid:107)\u03bdk(cid:107)\u221e + 4\n\n(cid:107)\u03bdk(cid:107)\u221e + 4\n\nmax\n\nk\n\n\u03b2\n\u03b1\n\n\u22121\ntk\n\n(cid:111)(cid:33)2\n(cid:110)\n(cid:32) K(cid:88)\n\n\u03b2\n\n\u03b3tk min\n\n(cid:33) n(cid:88)\n(cid:33)\n(cid:114)\n\nt=1\n\n\u03b2\n\u03b1\n\n1,(cid:107)Mtk(cid:107)2\n\n\u22121\ntk\n\nG\n\n\u03b3tk min\n\nk=1\n\nlog(1 + 4n2) .\n\n(cid:111)(cid:33)\n(cid:110)\n\n1,(cid:107)Mtk(cid:107)2\n\nG\n\n(cid:111)(cid:33)\n\n\u22121\ntk\n\nwhere (a) follows from the assumption that \u03bdk \u2208 Ctk for all t and k and since Mt is chosen opti-\nmistically, (b) by the Cauchy-Schwarz inequality, (c) by the de\ufb01nition of \u02dc\u03bdkt, which lies inside Ctk,\n(d) by Jensen\u2019s inequality, (e) by Cauchy-Schwarz again, (f) follows from Lemma 6. Finally (g)\nfollows from Corollary 5.\n\n5 Regret in Resource-Laden Case\n\n\u221a\nWe now show that if there are enough resources such that the optimal strategy can complete every\ntask with certainty, then the regret of Algorithm 2 is poly-logarithmic (in contrast to O(\nn) other-\nwise). As before we exploit the low variance, but now the variance is small because (cid:104)Mtk, \u03bdk(cid:105) is\n\n6\n\n\fTheorem 7. If(cid:80)K\n\nclose to 1, while in the previous section we argued that this could not happen too often (there is no\ncontradiction as the quantity maxk (cid:107)\u03bdk(cid:107) appeared in the previous bound).\nk , \u03bdk(cid:105) = K, then Rn \u2264 1 + 8\u03b2KD log(1 + 4n2).\n\nk=1 (cid:104)M\u2217\n\n(1 \u2212 (cid:104)Mtk, \u03bd(cid:105))\n(cid:107)\u00af\u03bd \u2212 \u03bd(cid:107)Gtk\n\nmax\n\u00af\u03bd,\u03bd\u2208C(cid:48)\n\ntk\n\n4(cid:112)\u03b2 .\n\n\u22121\ntk\n\nApplying the optimism principle and using the bound above combined with Corollary 5 gives the\nresult:\n\ntk\n\ntk\n\ntk\n\n\u22121\ntk\n\n\u03b3\u22121\ntk = max\n\u03bd\u2208C(cid:48)\n\u2264 max\n\u00af\u03bd,\u03bd\u2208C(cid:48)\n\nProof. We start by showing that the weights are large:\n(cid:104)Mtk, \u03bd(cid:105) (1 \u2212 (cid:104)Mtk, \u03bd(cid:105)) \u2264 max\n\u03bd\u2208C(cid:48)\n(cid:104)Mtk, \u00af\u03bd \u2212 \u03bd(cid:105) \u2264 (cid:107)Mtk(cid:107)G\n(cid:34)\n1{\u00acF} n(cid:88)\nK(cid:88)\n(cid:34)\nK(cid:88)\n1{\u00acF} n(cid:88)\n(cid:34)\nK(cid:88)\n1{\u00acF} n(cid:88)\n(cid:34)\nK(cid:88)\n1{\u00acF} n(cid:88)\n\nERn \u2264 1 + E\n\n\u2264 1 + 2E\n\n= 1 + 2E\n\nmin\n\nmin\n\nt=1\n\nk=1\n\nt=1\n\nk=1\n\nt=1\n\nk=1\n\n\u2264 1 + 8\u03b2 E\n\u2264 1 + 8\u03b2KD log(1 + 4n2) .\n\nt=1\n\nk=1\n\n\u2264 (cid:107)Mtk(cid:107)G\n(cid:35)\n(cid:111)(cid:35)\n\n(cid:35)\n\n(cid:111)(cid:112)\u03b2\n(cid:111)(cid:35)\n\n\u22121\ntk\n\nmin{1,(cid:104)Mtk, \u02dc\u03bdkt \u2212 \u03bdk(cid:105)}\n\n(cid:112)\u03b2\n\n\u22121\ntk\n\n(cid:110)\n1,(cid:107)Mtk(cid:107)G\n(cid:110)\n\n1, \u03b3\u22121\n\ntk \u03b3tk (cid:107)Mtk(cid:107)G\n\nmin\n\n1, \u03b3tk (cid:107)Mtk(cid:107)2\n\n\u22121\ntk\n\nG\n\n(cid:110)\n\n6 Experiments\n\nWe present two experiments to demonstrate the behaviour of Algorithm 2. All code and data is\navailable in the supplementary material. Error bars indicate 95% con\ufb01dence intervals, but sometimes\nthey are too small to see (the algorithm is quite conservative, so the variance is very low). We used\nB = 10 for all experiments. The \ufb01rst experiment demonstrates the improvements obtained by\nusing a weighted estimator over an unweighted one, and also serves to give some idea of the rate of\nlearning. For this experiment we used D = K = 2 and n = 106 and\n\n(cid:18) 1\n\n(cid:19)\n\nK(cid:88)\n\nk=1\n\n=\u21d2 M\u2217 =\n\n0\n\n1/2\n\n1/2\n\nand\n\n(cid:104)M\u2217\n\nk , \u03bdk(cid:105) = 2 ,\n\n(cid:18)\n\n(cid:19)\n\n(cid:18)8/10 2/10\n\n(cid:19)\n\n4/10\n\n2\n\n\u03bd =\n\n\u03bd1\n\n\u03bd2\n\n=\n\nwhere the kth column is the parameter/allocation for the kth task. We ran two versions of the\nalgorithm. The \ufb01rst, exactly as given in Algorithm 2 and the second identical except that the weights\nwere \ufb01xed to \u03b3tk = 4 for all t and k (this value is chosen because it corresponds to the minimum\ninverse variance for a Bernoulli variable). The data was produced by taking the average regret over\n8 runs. The results are given in Fig. 1. In Fig. 2 we plot \u03b3tk. The results show that \u03b3tk is increasing\nlinearly with t. This is congruent with what we might expect because in this regime the estimation\nerror should drop with O(1/t) and the estimated variance is proportional to the estimation error.\n\nNote that the estimation error for the algorithm with \u03b3tk = 4 will be O((cid:112)1/t).\n\nFor the second experiment we show the algorithm adapting to the environment. We \ufb01x n = 5 \u00d7 105\nand D = K = 2. For \u03b1 \u2208 (0, 1) we de\ufb01ne\n\n(cid:18)1/2 \u03b1/2\n\n(cid:19)\n\n1/2 \u03b1/2\n\n\u03bd\u03b1 =\n\n(cid:18)1\n\n1\n\n(cid:19)\n\n0\n0\n\nK(cid:88)\n\nk=1\n\n=\u21d2 M\u2217 =\n\nand\n\n(cid:104)M\u2217\n\nk , \u03bdk(cid:105) = 1 .\n\nThe unusual pro\ufb01le of the regret as \u03b1 varies can be attributed to two factors. First, if \u03b1 is small then\nthe algorithm quickly identi\ufb01es that resources should be allocated \ufb01rst to the \ufb01rst task. However, in\nthe early stages of learning the algorithm is conservative in allocating to the \ufb01rst task to avoid over-\nallocation. Since the remaining resources are given to the second task, the regret is larger for small\n\n7\n\n\f\u03b1 because the gain from allocating to the second task is small. On the other hand, if \u03b1 is close to 1,\nthen the algorithm suffers the opposite problem. Namely, it cannot identify which task the resources\nshould be assigned to. Of course, if \u03b1 = 1, then the algorithm must simply learn that all resources\ncan be allocated safely and so the regret is smallest here. An important point is that the algorithm\nnever allocates all its resources at the start of the process because this risks over-allocation, so even\nin \u201ceasy\u201d problems the regret will not vanish.\n\nFigure 1: Weighted vs unweighted estimation\n\nFigure 2: Weights\n\nFigure 3: \u201cGap\u201d dependence\n\n80,000\n\n60,000\n\n40,000\n\n20,000\n\nt\ne\nr\ng\ne\nR\n\n0\n\n0\n\n40\n\n\u03b3\n\n20\n\n0\n\nWeighted Estimator\nUnweighted Estimator\n\n30,000\n\nt\ne\nr\ng\ne\nR\n\n20,000\n\n10,000\n\n\u03b3t1\n\u03b3t2\n\n1,000,000\n\n0\n\n1,000,000\n\nt\n\nt\n\n0\n\n0.0\n\n0.5\n\u03b1\n\n1.0\n\n7 Conclusions and Summary\n\nWe introduced the stochastic multi-resource allocation problem and developed a new algorithm that\nenjoys near-optimal worst-case regret. The main drawback of the new algorithm is that its com-\nputation time is exponential in the dimension parameters, which makes practical implementations\nchallenging unless both K and D are relatively small. Despite this challenge we were able to im-\nplement that algorithm using a relatively brutish approach to solving the optimisation problem, and\nthis was suf\ufb01cient to present experimental results on synthetic data showing that the algorithm is\nbehaving as the theory predicts, and that the use of the weighted least-squares estimation is leading\nto a real improvement.\nDespite the computational issues, we think this is a reasonable \ufb01rst step towards a more practical al-\ngorithm as well as a solid theoretical understanding of the structure of the problem. As a consolation\n(and on their own merits) we include some other results:\n\nallocation is impossible.\n\n\u2022 An ef\ufb01cient (both in terms of regret and computation) algorithm for the case where over-\n\u2022 An algorithm for linear bandits on the hypercube that enjoys optimal regret bounds and\n\u2022 Theoretical analysis of weighted least-squares estimators, which may have other applica-\n\nadapts to sparsity.\n\ntions (e.g., linear bandits with heteroscedastic noise).\n\nThere are many directions for future research. The most natural is to improve the practicality of the\nalgorithm. We envisage such an algorithm might be obtained by following the program below:\n\n\u2022 Generalise the Thompson sampling analysis for linear bandits by Agrawal and Goyal\n[2012]. This is a highly non-trivial step, since it is no longer straight-forward to show\nthat such an algorithm is optimistic with high probability. Instead it will be necessary to\nmake do with some kind of local optimism for each task.\n\u2022 The method of estimation depends heavily on the algorithm over-allocating its resources\nonly with extremely low probability, but this signi\ufb01cantly slows learning in the initial\nphases when the con\ufb01dence sets are large and the algorithm is acting conservatively. Ideally\nwe would use a method of estimation that depended on the real structure of the problem,\nbut existing techniques that might lead to theoretical guarantees (e.g., empirical process\ntheory) do not seem promising if small constants are expected.\n\nIt is not hard to think up extensions or modi\ufb01cations to the setting. For example, it would be\ninteresting to look at an adversarial setting (even de\ufb01ning it is not so easy), or move towards a\nnon-parametric model for the likelihood of success given an allocation.\n\n8\n\n\fReferences\nYasin Abbasi-Yadkori, Csaba Szepesv\u00b4ari, and David Tax. Improved algorithms for linear stochastic\n\nbandits. In Advances in Neural Information Processing Systems, pages 2312\u20132320, 2011.\n\nYasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Online-to-con\ufb01dence-set conversions and\n\napplication to sparse stochastic bandits. In AISTATS, volume 22, pages 1\u20139, 2012.\n\nShipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs.\n\narXiv preprint arXiv:1209.3352, 2012.\n\nPeter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. The Journal of Ma-\n\nchine Learning Research, 3:397\u2013422, 2003.\n\nPeter Auer, Nicol\u00b4o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit\n\nproblem. Machine Learning, 47:235\u2013256, 2002.\n\nKristin P Bennett and Olvi L Mangasarian. Bilinear separation of two sets inn-space. Computational\n\nOptimization and Applications, 2(3):207\u2013227, 1993.\n\nS\u00b4ebastien Bubeck and Nicol`o Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-\narmed Bandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorpo-\nrated, 2012. ISBN 9781601986269.\n\nVarsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit\n\nfeedback. In COLT, pages 355\u2013366, 2008.\n\nAkshay Krishnamurthy, Alekh Agarwal, and Miroslav Dudik. Ef\ufb01cient contextual semi-bandit\n\nlearning. arXiv preprint arXiv:1502.05890, 2015.\n\nBranislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for stochas-\n\ntic combinatorial semi-bandits. arXiv preprint arXiv:1410.0949, 2014.\n\nTor Lattimore, Koby Crammer, and Csaba Szepesv\u00b4ari. Optimal resource allocation with semi-bandit\nfeedback. In Proceedings of the 30th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI),\n2014.\n\nMarek Petrik and Shlomo Zilberstein. Robust approximate bilinear programming for value function\n\napproximation. The Journal of Machine Learning Research, 12:3027\u20133063, 2011.\n\nPaat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of\n\nOperations Research, 35(2):395\u2013411, 2010.\n\nThomas Sowell. Is Reality Optional?: And Other Essays. Hoover Institution Press, 1993.\n\n9\n\n\f", "award": [], "sourceid": 614, "authors": [{"given_name": "Tor", "family_name": "Lattimore", "institution": "University of Alberta"}, {"given_name": "Koby", "family_name": "Crammer", "institution": "Technion"}, {"given_name": "Csaba", "family_name": "Szepesvari", "institution": "University of Alberta"}]}