{"title": "Oracle-Efficient Algorithms for Online Linear Optimization with Bandit Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 10590, "page_last": 10599, "abstract": "We propose computationally efficient algorithms for \\textit{online linear optimization with bandit feedback}, in which a player chooses an \\textit{action vector} from a given (possibly infinite) set $\\mathcal{A} \\subseteq \\mathbb{R}^d$, and then suffers a loss that can be expressed as a linear function in action vectors. Although existing algorithms achieve an optimal regret bound of $\\tilde{O}(\\sqrt{T})$ for $T$ rounds (ignoring factors of $\\mathrm{poly} (d, \\log T)$), computationally efficient ways of implementing them have not yet been specified, in particular when $|\\mathcal{A}|$ is not bounded by a polynomial size in $d$. A standard way to pursue computational efficiency is to assume that we have an efficient algorithm referred to as \\textit{oracle} that solves (offline) linear optimization problems over $\\mathcal{A}$. Under this assumption, the computational efficiency of a bandit algorithm can then be measured in terms of \\textit{oracle complexity}, i.e., the number of oracle calls. Our contribution is to propose algorithms that offer optimal regret bounds of $\\tilde{O}(\\sqrt{T})$ as well as low oracle complexity for both \\textit{non-stochastic settings} and \\textit{stochastic settings}. Our algorithm for non-stochastic settings has an oracle complexity of $\\tilde{O}( T )$ and is the first algorithm that achieves both a regret bound of $\\tilde{O}( \\sqrt{T} )$ and an oracle complexity of $\\tilde{O} ( \\mathrm{poly} ( T ) )$, given only linear optimization oracles. Our algorithm for stochastic settings calls the oracle only $O( \\mathrm{poly} (d, \\log T))$ times, which is smaller than the current best oracle complexity of $O( T )$ if $T$ is sufficiently large.", "full_text": "Oracle-Ef\ufb01cient Algorithms for\n\nOnline Linear Optimization with Bandit Feedback\u2217\n\nShinji Ito\u2020\n\nNEC Corporation, The University of Tokyo\n\ni-shinji@nec.com\n\nHanna Sumita\n\nTokyo Metropolitan University\n\nsumita@tmu.ac.jp\n\nTakuro Fukunaga\u2021\n\nChuo University, RIKEN AIP, JST PRESTO\n\nfukunaga.07s@g.chuo-u.ac.jp\n\nDaisuke Hatano\n\nRIKEN AIP\n\ndaisuke.hatano@riken.jp\n\nKei Takemura\nNEC Corporation\n\nkei_takemura@nec.com\n\nNaonori Kakimura\u00a7\n\nKeio University\n\nkakimura@math.keio.ac.jp\n\nKen-ichi Kawarabayashi\u00a7\nNational Institute of Informatics\n\nk-keniti@nii.ac.jp\n\nAbstract\n\nWe propose computationally ef\ufb01cient algorithms for online linear optimization\nwith bandit feedback, in which a player chooses an action vector from a given\n(possibly in\ufb01nite) set A \u2286 Rd, and then suffers a loss that can be expressed\n\u221a\nas a linear function in action vectors. Although existing algorithms achieve an\noptimal regret bound of \u02dcO(\nT ) for T rounds (ignoring factors of poly(d, log T )),\ncomputationally ef\ufb01cient ways of implementing them have not yet been speci\ufb01ed,\nin particular when |A| is not bounded by a polynomial size in d. A standard way to\npursue computational ef\ufb01ciency is to assume that we have an ef\ufb01cient algorithm\nreferred to as oracle that solves (of\ufb02ine) linear optimization problems over A.\nUnder this assumption, the computational ef\ufb01ciency of a bandit algorithm can then\nbe measured in terms of oracle complexity, i.e., the number of oracle calls. Our\ncontribution is to propose algorithms that offer optimal regret bounds of \u02dcO(\nT )\nas well as low oracle complexity for both non-stochastic settings and stochastic\n\u221a\nsettings. Our algorithm for non-stochastic settings has an oracle complexity of\n\u02dcO(T ) and is the \ufb01rst algorithm that achieves both a regret bound of \u02dcO(\nT ) and\nan oracle complexity of \u02dcO(poly(T )), given only linear optimization oracles. Our\nalgorithm for stochastic settings calls the oracle only O(poly(d, log T )) times,\nwhich is smaller than the current best oracle complexity of O(T ) if T is suf\ufb01ciently\nlarge.\n\n\u221a\n\n\u2217This work was supported by JST, ERATO, Grant Number JPMJER1201, Japan.\n\u2020This work was supported by JST, ACT-I, Grant Number JPMJPR18U5, Japan.\n\u2021This work was supported by JST, PRESTO, Grant Number JPMJPR1759, Japan.\n\u00a7This work was supported by JSPS, KAKENHI, Grant Number JP18H05291, Japan.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1\n\nIntroduction\n\n\u221a\n\n(cid:80)T\nt=1 (cid:96)(cid:62)\n\nt at \u2212(cid:80)T\n\nt=1 (cid:96)(cid:62)\n\n\u221a\n\nT )-regret bounds achieve optimal performance w.r.t. dependence on T .\n\nOnline linear optimization with bandit feedback, or bandit linear optimization, is an important problem\nthat has a wide range of applications. In it, a player is given A \u2286 Rd, referred to as a set of action\nvectors, and T , the number of rounds of decision-making. In each round t \u2208 [T ] := {1, 2, . . . , T},\nt at, where (cid:96)t \u2208 Rd is an unknown loss\nthe player chooses an action at \u2208 A, and then observes loss (cid:96)(cid:62)\nvector that can change over rounds. The bandit linear optimization includes a variety of important\nonline decision-making problems as special cases. For example, given a graph G = (V, E) and\ns, t \u2208 V , by setting A \u2286 R|E| to be the set of all characteristic vectors of s-t paths, we can take\ninto account bandit shortest path or adaptive routing [9]. In this setting, (cid:96)t \u2208 R|E| corresponds to\n(unknown) lengths of the edges, and bandit feedback (cid:96)(cid:62)\nt at represents the length of a chosen s-t path\nat. In addition to this application, bandit linear optimization includes bandit versions of combinatorial\noptimization problems such as minimum spanning tree, minimum cut, and knapsack problem, as well\nas continuous optimization problems such as linear programming and semide\ufb01nite programming.\nThe performance of the player is evaluated in terms of regret RT (a\u2217), de\ufb01ned as RT (a\u2217) =\nt a\u2217 for a\u2217 \u2208 A, which represents the difference between the cumulative\nloss for decision {at} of the player and that for an arbitrarily \ufb01xed strategy a\u2217. The primary goal in\nbandit linear optimization is to achieve small regret for arbitrary a\u2217 \u2208 A. Some existing algorithms\n\u221a\nachieve regret bounds of \u02dcO(\nT ),1 as shown in Tables 1 and 2. In contrast, papers [6; 8; 12; 21]\n\u221a\nshowed that any algorithm will suffer at least \u2126(\nT ) regret in the worst case. Thus, algorithms with\n\u02dcO(\nAlgorithms achieving an optimal \u02dcO(\nT )-regret, however, have computational issues, especially\nwhen the action set A is exponentially large or is an in\ufb01nite set. For example, well-known LinUCB\nmethods [1; 16; 29] need to solve quadratic programming over A, which has time complexity of\n\u2126(|A|) if there are no additional assumptions. The ComBand algorithm [11] runs ef\ufb01ciently if\nthere is an ef\ufb01cient sampling algorithm for A (such as k-sets, spanning trees, or bipartite perfect\nmatchings), but such sampling algorithms are open for many important examples, including s-t\npaths. For the special case in which the convex hull of A can be expressed by c linear inequalities,\nCombExp [13] runs in O(poly(c, d)T )-time. However, c (the size of the linear inequality expression)\ncan be exponentially large for many examples.\nIn this study, we aim to develop computationally ef\ufb01cient algorithms that achieve an \u02dcO(\nT ) regret\nbound, under the assumption that we can call a linear optimization oracle. The oracle solves\nof\ufb02ine linear optimization problems over A, i.e, given a loss vector (cid:96) \u2208 Rd, the oracle outputs\na\u2217 \u2208 arg min\n(cid:96)(cid:62)a. This assumption is standard in the context of online optimization [15; 23]. Under\na\u2208A\nit, the computational ef\ufb01ciency of online optimization algorithms is evaluated in terms of oracle\ncomplexity: the number of oracle calls for the linear optimization oracle.\nFor online linear optimization with full information, in which a player can observe all entries of\n(cid:96)t \u2208 Rd after choosing at, Kalai and Vempala [23] have proposed algorithms with an \u02dcO(\nT )-regret\nbound and an oracle complexity of O(T ). Using this algorithm, McMahan and Blum [26] and Dani\nand Hayes [15] showed that one can achieve \u02dcO(T 2/3)-regret and O(T 1/2)-oracle complexity for\n\u221a\nbandit linear optimization. However, it has been an open question as to whether or not we can\nachieve \u02dcO(\nT )-regret and \u02dcO(poly(T ))-oracle complexity for bandit linear optimization, with only\n\u221a\nlinear optimization oracles. In this study, we solve this open problem by proposing an algorithm that\nachieves \u02dcO(\nHere, we separately consider here two different settings for bandit linear optimization: a non-\nstochastic setting and a stochastic setting. In the non-stochastic setting, we do not assume any\ngenerative models, but (cid:96)t may be chosen in an adversarial manner, depending on previous actions\na1, . . . , at\u22121. The performance of an algorithm is measured in terms of the expectation of regret\nRT (a\u2217) w.r.t. the randomness of the algorithm\u2019s internal randomness and (cid:96)t. In the stochastic setting,\nby way of contrast, the loss vectors (cid:96)t are assumed to follow a probability distribution D over Rd,\ni.i.d. for t = 1, . . . , T .\n\nT )-regret as well as \u02dcO(T )-oracle complexity.\n\n\u221a\n\n\u221a\n\n1 In \u02dcO(\u00b7) notation, we ignore factors of polynomials in d and log(T ).\n\n2\n\n\f1.1 Our Contribution\n\n\u221a\nIn this paper, we present computationally ef\ufb01cient algorithms that achieve O(poly(d)\nT )-regret.\nSpeci\ufb01cally, we present algorithms with a small oracle complexity, i.e., algorithms that call the oracle\nas infrequently as possible. Our contribution is summarized in Tables 1 and 2.\n\nFor the non-stochastic setting, we propose an algorithm (Algorithm 1) that achieves O((cid:112)d3T log T )-\n\nregret in expectation and has O(poly(d, log T )T )-oracle complexity.\nTheorem 1. For the non-stochastic setting, Algorithm 1 satis\ufb01es the following conditions:\n\n\u2022 The output of the algorithm satis\ufb01es E[RT (a\u2217)] = O((cid:112)d3T log T ) for all a\u2217 \u2208 A.\n\n\u2022 The algorithm calls the linear optimization oracle O(poly(d, log T )T ) times.\n\u2022 The computational time, except for that of the oracle, is of O(poly(d, T )).\n\u221a\nAs shown in Table 1, our Algorithm 1 achieves the smallest oracle complexity among algorithms\nT )-regret. Noting that GeometricHedge assumes A to be a convex body, we can see that\nwith \u02dcO(\nAlgorithm 1 is the \ufb01rst algorithm that is applicable to discrete A and that achieves \u02dcO(\nT )-regret and\n\u02dcO(poly(T ))-oracle complexity.\nAlthough the \ufb01rst algorithm in Table 1 with \u02dcO(T 2/3)-regret and O(T 2/3)-oracle complexity might\nlook incomparable to our results, algorithms with such bounds can be easily constructed from our\nAlgorithm 1. In fact, by dividing T rounds into T /B blocks of size B > 1 and regarding each block\nas an individual round, we obtain the following statement:\nProposition 1. Suppose there exists an algorithm with O(f (T ))-regret and O(g(T ))-oracle com-\nplexity. Then, for arbitrary positive integer B, there exists an algorithm with O(B \u00b7 f (T /B))-regret\nand O(g(T /B))-oracle complexity.\n\n\u221a\n\nBy setting the block size to be B = \u0398(T 1/3) and applying Algorithm 1, we can achieve\n\nO(B(cid:112)T /B) = \u02dcO(T 2/3)-regret and O(T /B) = \u02dcO(T 2/3)-oracle complexity, which is equiva-\nO((cid:112)d3T log(d log T /\u03b4))-regret with probability 1 \u2212 \u03b4 and has O(poly(d, log T ))-oracle\n\u2022 The output of the algorithm satis\ufb01es RT (a\u2217) = O((cid:112)d3T log(d log T /\u03b4)) for all a\u2217 \u2208 A,\n\n\u221a\nlent to the the uppermost result in Table 1. Note that Proposition 1 does not lead to an \u02dcO(\nT )-regret\nalgorithm given an \u02dcO(T 2/3)-regret algorithm, conversely, since the block size B must be at least 1.\nFor\nachieves\ncomplexity, where \u03b4 \u2208 (0, 1) is an arbitrary parameter.\nTheorem 2. Suppose (cid:96)t follows a distribution over Rd, i.i.d. for t = 1, 2, . . . , T . Algorithm 2 then\nsatis\ufb01es the following conditions:\n\nan algorithm (Algorithm 2)\n\nthat\n\nthe\n\nstochastic\n\nsetting, we propose\n\nwith probability 1 \u2212 \u03b4.\n\n\u2022 The algorithm calls the linear optimization oracle O(poly(d, log T )) times.\n\u2022 The computational time, except for that of the oracle, is of O(poly(d, T )).\n\n\u221a\nA complete description of Algorithm 2 and a proof of this theorem are given in Appendix B. As\n\u221a\nshown in Table 2, all existing algorithms that achieve \u02dcO(\nT )-regret require at least \u2126(T ) oracle\ncomplexity, and our Algorithm 2 is the \ufb01rst with an \u02dcO(\nT )-regret bound and a sublinear oracle\ncomplexity in T .\nIn both Algorithms 1 and 2, we use the well-known techniques [30] of reduction among linear\noptimization, separation, and decomposition over a given convex body. De\ufb01nitions of these three\nproblems are given in Section 4. The reduction algorithms enable us to solve separation and\ndecomposition problems by calling the linear optimization oracle O(poly(d)) times. Using these\n2 In this algorithm, A is assumed to be a convex body, and a membership oracle for A is assumed. Because\nwe can construct a membership oracle from a linear optimization oracle and vice versa by a polynomial-time\nreduction [30], the assumption regarding the oracle is equivalent to ours, modulo polynomial-time reduction.\n\n3\n\n\fTable 1: Regret Bound and Oracle Complexity of Non-Stochastic Bandit Linear Optimization\n\nAlgorithm\nMV algorithm [15; 26] with FPL [23]\nComBand [11], GeometricHedge [17], Exp2 [6]\nGeometricHedge with Volumetric Spanners2[19]\nAlgorithm 1 [This paper]\n\nRegret Bound Oracle Complexity\n\u02dcO(T 2/3)\n\u02dcO(T 1/2)\n\u02dcO(T 1/2)\n\u02dcO(T 1/2)\n\nO(T 2/3)\n\u2212\n\u02dcO(T 7)\n\u02dcO(T )\n\nTable 2: Regret Bound and Oracle Complexity for Stochastic Bandit Linear Optimization\n\nAlgorithm\nLinRel [7], LinUCB with (cid:96)2-ball [1; 16; 29]\nLinUCB with (cid:96)1-ball [16],\nLinear Thompson sampling [2; 4]\nAlgorithm 2 [This paper]\n\nRegret Bound Oracle Complexity\n\u02dcO(T 1/2)\n\u02dcO(T 1/2)\n\u02dcO(T 1/2)\n\u02dcO(T 1/2)\n\n\u2212\nO(T )\nO(T )\nO(poly(d, log T ))\n\nreduction techniques, Algorithms 1 and 2 maintain, respectively, supersets and subsets of the convex\nhull of A (=: Conv(A)).\nTo construct Algorithm 1 for the non-stochastic setting, we extend a cutting-plane approach to our\nbandit-feedback setting. The cutting-plane approach, a way of reducing oracle complexity, has been\napplied only to full-information settings [20], not a bandit-feedback setting. A major difference\nbetween bandit-feedback and full-information settings is that the former needs exploration, i.e.,\nchosen actions should be randomized with suf\ufb01ciently large variance, whereas the latter does not\nneed it and chooses actions deterministically. In full-information settings, hence, it suf\ufb01ces to focus\non a deterministically chosen action alone. In contrast to this, to deal with the bandit-feedback setting,\nthe dif\ufb01culty lies in constructing a distribution of actions with suf\ufb01ciently large variance, for which\ncutting planes can be ef\ufb01ciently computed and the number of them can be bounded.\nTo this end, we design relevant probability distributions so that the cutting-plane approach works,\nwhich successfully reduces oracle complexity. Speci\ufb01cally, the cutting-plane approach maintains\nconvex bodies Kt that include and approximate Conv(A), from which we choose candidates for\nactions, employing the support of the probability distribution of actions to choose. It is only when\nsome candidates are invalid, i.e., when some are outside of Conv(A), that Kt is updated with a\ncutting plane excluding such an invalid candidate. To bound the number of oracle calls, we design\ncandidates for actions that satisfy two conditions: the set of candidates has a bounded cardinality, and\neach candidate is suf\ufb01ciently close to the weighted center of Kt. Thanks to the \ufb01rst condition, we\ncan ef\ufb01ciently decide if invalid candidates exist. The second condition is essential for bounding the\nnumber of oracle calls in each update of Kt.\nOur Algorithm 2 for the stochastic setting is based on the framework of phased elimination of actions,\nin which T rounds are divided into phases that are segments of consequent rounds, and, in each\nphase, actions are eliminated so that only promising ones are left. Existing works employing this\nframework [7; 24; 31] are computationally inef\ufb01cient, mainly for the following two reasons: (i) We\nneed to maintain a set of promising actions that may be an exponentially large combinatorial set, and,\n(ii) when choosing actions, we need to solve hard optimization problems, e.g., G-optimal design [24]\nor quadratic programming [7].\nOur idea for resolving the \ufb01rst computational issue is to maintain the set of promising actions\nas a convex set instead of a subset of actions. The convex set here can be represented with only\nO(poly(d) log T ) linear inequalities, which implies that operations over it can be conducted ef\ufb01-\nciently. We resolve the second computational issue by combining barycentric spanners [9] and\nthe decomposition technique over convex bodies [30], both of which are ef\ufb01ciently computed with\n\u221a\nO(poly(d)) oracle calls. We show that, thanks to these techniques, we can estimate the loss vector\nwith enough accuracy to achieve an \u02dcO(\nT )-regret bound. The oracle complexity is bounded as\nfollows: In our algorithm, all at chosen in each phase are determined at the beginning of the phase,\nwhich means that the oracle complexity depends not on the number of rounds, but on the number\nof phases. The number of phases is of O(poly(d) log T ) and that of oracle calls in each phase is of\nO(poly(d, log T )), which results in overall O(poly(d, log T ))-oracle complexity.\n\n4\n\n\f\u221a\n\nt at, Follow\nT )-regret and O(T )-oracle\n\n2 Related Work\nFor the full-information setting in which a player can observe (cid:96)t \u2208 Rd rather than (cid:96)(cid:62)\n\u221a\nthe Perturbed Leader (FPL) by Kalai and Vempara [23] achieves O(\ncomplexity. This algorithm is used as a subroutine in MV algorithm [15; 26] (see Table 2).\nFor a more general problem referred to as online improper learning, in which only an approximate\nlinear optimization oracle is given, Kakade et al. [22] have proposed the \ufb01rst ef\ufb01cient algorithms\nthat achieve approximate regret of O(\nT ) for the full-feedback setting, and O(T 2/3) in the bandit-\nfeedback setting. Recent papers by Garber [18] and Hazan et al. [20] have improved oracle complexity.\nAlgorithms in [20] achieve oracle complexity of \u02dcO(T ) in the full-feedback setting and \u02dcO(T 2/3) in\nthe bandit feedback setting with the same regret bound as in Kakade et al. [22]. For online improper\nlearning with bandit feedback, however, constructing an ef\ufb01cient algorithm achieving \u02dcO(\nT ) poses\ndif\ufb01culties that have yet to be overcome.\nIn addition to the studies listed in Tables 1 and 2, there exist ef\ufb01cient algorithms for bandit linear\noptimization that work under different assumptions. Abernethy et al. [3] proposed a computationally\nT )-regret under the assumption that A is a convex body and that\nef\ufb01cient algorithm achieving O(\na self-concordant barrier [27] for A is given. However, constructing self-concordant barriers is not\nalways possible with a linear optimization oracle alone, and, hence, this algorithm does not always\nwork under our assumptions of linear optimization oracle and Assumption 1 given in the next section.\n\n\u221a\n\n\u221a\n\n3 Problem Setting\n\nThe bandit linear optimization problem is a repeated game described as follows: Before the game\nstarts, a player is given the number T of rounds and the dimensionality d of the action set A \u2286 Rd.\nIn each round t = 1, 2, . . . , T , the player chooses at \u2208 A while an environment chooses a loss vector\n(cid:96)t \u2208 Rd, and then the player observes a loss (cid:96)(cid:62)\nt at. The goal of the player is to achieve a small regret\n\nRT (a), which is de\ufb01ned for an arbitrary a \u2208 A as RT (a) :=(cid:80)T\n\nt at \u2212(cid:80)T\n\nWe assume the action set A to be a compact set. Suppose that we have an algorithm for linear\noptimization over A for any vector w \u2208 Rd, which we call linear optimization oracle OA : Rd \u2192 A\nthat receives an input w \u2208 Rd and returns a point OA(w) \u2208 K satisfying w(cid:62)OA(w) = mina\u2208A w(cid:62)a.\nAssumption 1. We assume that there exist positive real numbers L and R such that (a) (cid:107)(cid:96)t(cid:107)2 \u2264 L for\nall t \u2208 [T ], and (b) (cid:107)a(cid:107)2 \u2264 R for all a \u2208 A. In addition, we assume that (c) K := Conv(A) has a\n\npositive volume, i.e., Vol(K) :=(cid:82)\n\nt=1 (cid:96)(cid:62)\nt a.\n\nt=1 (cid:96)(cid:62)\n\nK 1dx > 0.\n\nThe \ufb01rst two assumptions (a) and (b) are standard in bandit linear optimization. If we are given\na linear optimization oracle over A, we can assume (c) without loss of generality. In fact, if A\nis included in a subspace with a smaller dimension than d, we can then detect it by calling the\nlinear optimization oracle polynomial times (see, e.g., Corollary 14.1g in [30]), and we can make K\nfull-dimensional by ignoring redundant dimensions.\n\n4 Preliminaries\n\n4.1 Linear Optimization, Separation, and Decomposition\n\nWe de\ufb01ne a linear optimization problem (LP), separation problem (SP), and decomposition problem\n(DP) for a compact convex body P \u2286 Rd as follows:\nProblem 1 (linear optimization problem, LP). Given a vector w \u2208 Rd, \ufb01nd a vector x\u2217 \u2208 P such that\nw(cid:62)x\u2217 = minx\u2208P w(cid:62)x.\nProblem 2 (separation problem, SP). Given a vector y \u2208 Rd, decide whether y belong to P or not,\nand, in the latter case, \ufb01nd a vector w \u2208 Rd such that w(cid:62)y < minx\u2208P w(cid:62)x.\nProblem 3 (decomposition problem, DP). Given a vector x \u2208 P, \ufb01nd vertices x0, . . . , xd of P and\n\u03bb0, . . . , \u03bbd \u2265 0 such that x = \u03bb0x0 + \u00b7\u00b7\u00b7 + \u03bbdxd.\nEllipsoid methods provide reductions among these problems, which imply that\nLP: solvable \u21d0\u21d2 SP: solvable =\u21d2 DP: solvable.\n\n5\n\n\fTheorem 3 (Corollaries 14.1a, 14.1b and 14.1g in [30]). Suppose that P \u2286 Rd is a polytope of\nwhich each vertex can be expressed by rationals with bit-lengths of at most \u03d5, and that each entry of\nx, y, w \u2208 Qd is also a rational, with bit-length of at most \u03d5. Then, the following holds:\n(a) If there is an algorithm SEP that solves the separation problem, we can solve the linear opti-\n\nmization problem for w \u2208 Qd by calling SEP at most poly(d, \u03d5) times.\n\n(b) If there is an algorithm OPT that solves the linear optimization problem, we can solve the\n\nseparation problem for y \u2208 Qd by calling OPT at most poly(d, \u03d5) times.\n\ndecomposition problem for x \u2208 P by calling OPT at most poly(d, \u03d5) times.\n\n(c) If there is an algorithm OPT that solves the linear optimization problem, we can solve the\nRemark 1. For any \u03b5 > 0 and any real number x \u2208 [\u22121, 1], we can approximate x by a rational\n\u02c6x \u2208 Q with a bit-length of at most O(log(1/\u03b5)) so that |x \u2212 \u02c6x| \u2264 \u03b5. Hence, we can assume that \u03d5\nin Theorem 3 is bounded as \u03d5 = O(log T ) by ignoring O(1/T ) errors. This implies that the above\nreductions can be computed in O(poly(d, log T )) time.\n\n4.2 Algorithms for Logconcave Distributions\nIf a probability distribution over convex body P \u2286 Rd has a probability density function (PDF)\np : P \u2192 R>0 such that log p is a concave function, we refer to it as a logconcave distribution. The\nfollowing theorem means that, given the value oracle of a convex function f : P \u2192 R, we can\napproximately sample a vector in P from a logconcave distribution p(x) \u221d exp(\u2212f (x)).\nTheorem 4 (Theorems 2.1 and 2.2 in [25], Lemma 10 in [19]). Let P \u2286 Rd be a convex body with\nnon-zero Lubesgue measure, and let f : P \u2192 R be a convex function and let p be a logconcave\ndistribution proportional to exp(\u2212f (x)). Suppose \u03b5 > 0 and \u03b4 \u2208 (0, 1). Then, given access to the\nmembership oracle of P and the value oracle of f, there is an algorithm that samples approximately\nfrom p such that (i) the total variation distance between the produced distribution and p is at most \u03b5,\nand (ii) after preprocessing for a time of O(d5(log d)O(1)), each sample can be produced in a time of\nO(d4/\u03b54 \u00b7 (log(d/\u03b5))O(1)).\nAs an implication of this theorem, we can ef\ufb01ciently approximate mean \u00b5(p) \u2208 Rd and covariance\nmatrix Cov(p) \u2208 Rd\u00d7d of distribution p. In fact, from Corollary 5.52 in [32] and standard concentra-\ntion of logcancave distribution (see, e.g., Lemma 5.17 in [25]), it takes (n log(1/\u03b4)/\u03b5)O(1) samples\nto get a matrix \u02c6\u03a3 such that (1 \u2212 \u03b5)Cov(p) (cid:22) \u02c6\u03a3 (cid:22) (1 + \u03b5)Cov(p) with probability of at least 1 \u2212 \u03b4.3\nSimilarly, we can get \u02c6\u00b5 \u2208 Rd such that (cid:107)\u02c6\u00b5 \u2212 \u00b5(p)(cid:107)Cov(p)\u22121 \u2264 \u03b5 from (n log(1/\u03b4)/\u03b5)O(1) samples.4\nAccordingly, we obtain the following corollary:\nCorollary 1. Suppose the same assumption as in Theorem 4. There is an algorithm that outputs\na vector \u02c6\u00b5 \u2208 Rd and a symmetric matrix \u02c6\u03a3 \u2208 Rd\u00d7d satisfying 1\n2 Cov(p) (cid:22) \u02c6\u03a3 (cid:22) 2Cov(p) and\n(cid:107)\u02c6\u00b5 \u2212 \u00b5(p)(cid:107)Cov(p)\u22121 \u2264 \u03b5 with a probability of at least 1 \u2212 \u03b4. The computational time, the number of\ncalls for the membership oracle of P, and the value oracle of f are bounded by poly(d, 1\n\u03b4 ).\n\n\u03b5 , log 1\n\n5 Algorithm for Non-stochastic Bandit Linear Optimization\n\nOur algorithm uses the framework of a continuous multiplicative weight update (CMWU) [5; 14; 33].\nA straightforward way of applying CMWU is to maintain probability distributions over K :=\nConv(A), which, however, requires a large number of oracle calls. In fact, the algorithm by Hazan\nand Karnin [19] for bandit linear optimization over convex bodies calls an oracle \u02dcO(T 7) times. This\ninef\ufb01ciency is due to that we need to sample from K; the sampling algorithm in Theorem 4 requires\nO(d4/\u03b54)-oracle complexity.\nWe reduce oracle complexity by means of a cutting-plane approach [20]. In this approach, we\nmaintain convex bodies K(j)\nthat include and approximate K, and we update a distribution over\nK(j)\ninstead of K. The advantage of this approach is that we can sample from K(j)\nt without calling\n\u221a\n3A similar argument can be found in Section 6.3 in [10].\n4 For a vector x \u2208 Rd and a positive semide\ufb01nite matrix A \u2208 Rd\u00d7d, denote (cid:107)x(cid:107)A :=\n\nx(cid:62)Ax.\n\nt\n\nt\n\n6\n\n\ft\n\nt \u2286 K(j)\n\nan oracle. On the other hand, updating K(j)\nrequires oracle calls; therefore, we need to bound the\nnumber of the updates as well as the number of oracle calls in each update. We design a strategy\nachieving these as follows: We set candidates of actions E (j)\n, from which we choose action.\nWhen some actions among the candidates are invalid, i.e., outside of K, we then reduce K(j)\nt by a\ncutting plane excluding such an invalid candidate. With this strategy, we need oracle calls to check if\ninvalid candidates exist. Our algorithm bounds the oracle complexity here by setting E (j)\nto have\nO(d) elements. Further, we design E (j)\nso that its elements are suf\ufb01ciently close to the weighted\ncenter of K(j)\n. Indeed, when\na convex body is updated by a cutting plane that excludes a point close to its center, its volume then\ndecreases by a constant factor less than 1 (see, e.g., [25]). On the other hand, K(j)\nalways includes K\nwith a positive volume; hence, the volume of K(j)\ncannot be smaller than that of K, which implies\nthat the number of updates is bounded.\n\n. This plays an important role in bounding the number of updates of K(j)\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\n2 \u2287 \u00b7\u00b7\u00b7 \u2287 K(s2)\n\n5.1 Algorithm\nOur algorithm maintains a convex body K(j)\n1 =\nK(0)\n2 \u2287 K(1)\n2 \u2287 \u00b7\u00b7\u00b7 \u2287 K(sT )\nT \u2287 K = Conv(A), where t corresponds to the round,\nj \u2208 {0, 1, . . .} is an index, and st \u2208 {0, 1, . . . , T} will be de\ufb01ned later. It also updates a logconcave\nfunction zt : Rd \u2192 R>0 in each round t based on the multiplicative weight update [5; 14; 33]. Before\nthe \ufb01rst round, zt is initialized to be a constant function z1(x) = 1. Let q(j)\ndenote the PDF of a\ndistribution over K(j)\n\nt \u2286 Rd such that K(0)\n\nthat is proportional to the function zt, i.e.,\n\n1 \u2287 \u00b7\u00b7\u00b7 \u2287 K(s1)\n\n1 \u2287 K(1)\n\nt\n\nt\n\n(cid:90)\n\nK(j)\n\nt\n\nZ (j)\n\nt =\n\nzt(x)dx,\n\nq(j)\nt\n\n(x) =\n\n(cid:40) zt(x)\n\nt\n\nZ(j)\n0\n\nt\n\n,\n\nif a \u2208 K(j)\nif a \u2208 Rd \\ K(j)\nt \u2208\nby \u00b5(j)\nt )(cid:62)]. From Corollary 1,\n\nt \u2208 Rd and \u03a3(j)\n\n(1)\n\n.\n\nt\n\nt\n\nLet us denote the mean and the covariance matrix for distribution of q(j)\nRd\u00d7d, respectively: \u00b5(j)\nwe can compute estimators \u02c6\u00b5(j)\n\nt = E a\u223cq(j)\n\n[a], \u03a3(j)\n\nt\n\nt\n\nt\n\nt = E a\u223cq(j)\nand \u03a3(j)\n(cid:107)\u02c6\u00b5(j)\n\nt\n\nt\n\nand \u02c6\u03a3(j)\nt of \u00b5(j)\nt (cid:22) 2\u03a3(j)\n\n,\n\nt\n\nt )(a\u2212\u00b5(j)\n\n[(a\u2212\u00b5(j)\n, respectively, such that\nt (cid:107)(\u03a3(j)\n\n1\n2\n\nt (cid:22) \u02c6\u03a3(j)\n\u03a3(j)\nwith probability of at least 1 \u2212 \u03b4(j)\nt = (b(j)\nB(j)\n\nt \u2212 \u00b5(j)\n(2)\nt \u2208 (0, 1), which will be de\ufb01ned later. Let\n, where \u03b5 > 0 and \u03b4(j)\n(cid:26)\nt B(j)(cid:62)\ntd ) \u2208 Rd\u00d7d be a matrix such that B(j)\nt1 , . . . , b(j)\nt \u2212 1\nE (j)\n\u02c6\u00b5(j)\nt =\n4e\n\n(cid:27)\nd . De\ufb01ne E (j)\n\n(cid:12)(cid:12)(cid:12)(cid:12) i \u2208 [d]\n\n(cid:12)(cid:12)(cid:12)(cid:12) i \u2208 [d]\n\nt \u2286 Rd as\n\n)\u22121 \u2264 \u03b5\n\n\u02c6\u00b5(j)\nt +\n\n= \u02c6\u03a3(j)\n\n(cid:26)\n\n(cid:27)\n\n1\n4e\n\nb(j)\nti\n\nb(j)\nti\n\n(3)\n\n\u222a\n\n.\n\nt\n\nt\n\nt\n\nIn each round t, our algorithm checks if E (j)\nin Step 7 of Algorithm 1, to exclude an element in E (j)\nfour conditions are satis\ufb01ed:\n\nt\n\nt\n\nis included in K, and if not, it updates K(j)\n\n\\ K. Set E (j)\n\nt\n\n, as described\nis designed so that the following\n\nt\n\n1. The cardinality of E (j)\nO(poly(d)) oracle calls.\n\nt\n\nis bounded as |E (j)\n\n| = O(d). Hence, we can decide if E (j)\n\nt\n\n2. Each y \u2208 E (j)\n\nt\n\nis suf\ufb01ciently close to \u00b5(j)\n\nt\n\n, i.e., it satis\ufb01es (cid:107)y \u2212 \u00b5(j)\n\nt (cid:107)(\u03a3(j)\n\nt\n\nis important to bound the number of oracle calls.\n\nt \u2286 K by\n)\u22121 \u2264 1/(2e). This\n\n3. The mean of E (j)\n\nt\n\nE (j)\n\nt\n\n, we then have E[(cid:96)(cid:62)\n\nis equal to \u02c6\u00b5(j)\nt y] = (cid:96)(cid:62)\n\nt\n\n. This implies that if y follows a uniform distribution over\nt \u2248 (cid:96)(cid:62)\n\nt x] for x \u223c q(j)\n\nt = E[(cid:96)(cid:62)\n\nt \u00b5(j)\n\nt \u02c6\u00b5(j)\n\n.\n\nt\n\nThanks to this, empirical estimates of (cid:96)t based on E (j)\n\n4. The covariance matrix \u03a3 of a uniform distribution over E (j)\n\nsatis\ufb01es \u03a3 (cid:23) O(1/d2) \u00b7 \u03a3(j)\n.\nt will have a suf\ufb01ciently small variance.\nThe conditions 1 and 2 are used to bound the oracle complexity, and 3 and 4 are necessary to bound\nthe regret. Once E (j)\n. An integer st\n\nis included in K, our algorithm escapes the loop of updating K(j)\n\nt\n\nt\n\nt\n\nt\n\n7\n\n\ft\n\nAssumption 1.\n\nfor j = 0, 1, 2, . . . do\n\non the basis of (1) \u223c (3).\n\n1 = B\u221e(0, R) = {x \u2208 Rd | (cid:107)x(cid:107)\u221e \u2264 R} and de\ufb01ne z1 : Rd \u2192 R>0 by z1(x) = 1.\n\nCompute E (j)\nSolve SP for P = K and for each y \u2208 E (j)\nif There is a hyperplane w \u2208 Rd s.t. w(cid:62)y < minx\u2208K w(cid:62)x for some y \u2208 E (j)\n\nAlgorithm 1 An oracle ef\ufb01cient algorithm for non-stochastic bandit linear optimization\nRequire: Learning rate \u03b7 > 0, error bound \u03b5 > 0, time horizon T \u2208 N, R > 0 satisfying\n1: Set K(0)\n2: for t = 1, 2, . . . , T do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n\nend if\nend for\nLet \u02c6\u00b5t = \u02c6\u00b5(st)\nChoose it \u2208 [d] and \u03c3t \u2208 {1,\u22121} uniformly at random.\nSolve DP for P = K and x = \u02c6\u00b5t + \u03c3t\n4e btit = \u03bbt0xt0 + \u00b7\u00b7\u00b7 + \u03bbtdxtd.\n\u03bbt0, . . . , \u03bbtd such that \u02c6\u00b5t + \u03c3t\nPlay at = xts with probability \u03bbts (s = 0, . . . , d), and receive loss (cid:96)(cid:62)\nt (x \u2212 \u02c6\u00b5t)).\nSet \u02c6(cid:96)t by (4) and update zt by zt+1(x) = zt(x) exp(\u2212\u03b7 \u02c6(cid:96)(cid:62)\nSet K(0)\n\nt \u2229 {x \u2208 Rd | w(cid:62)x \u2265 w(cid:62)y}.\n. Break the for loop w.r.t. the index j.\n\n4e btit to get a decomposition xt0, . . . , xtd \u2208 A and\n\nUpdate K(j)\nSet st = j and Kt = K(st)\n\nfor i \u2208 [n], which are de\ufb01ned in (1) \u2013 (3).\n\nand bti = b(st)\n\nby K(j+1)\n\n= K(j)\n\nt at.\n\nthen\n\nelse\n\nt+1 = K(st)\n\nti\n\n.\n\n.\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\n15:\n16:\n17:\n18: end for\n\nt\n\ndenotes the number of the updates in the round t. We denote Et = E (st)\n, and\n. We randomly choose x from Et as follows: choose \u03c3t \u2208 {\u22121, 1} and it \u2208 [d] uniformly\nBt = B(st)\nat random, and de\ufb01ne x = \u02c6\u00b5t + \u03c3t\n4e btit. If we can play this x, then we can construct a good estimate\nof (cid:96)t from the above condition 4, which leads to a small degree of regret. However, x \u2208 Et does not\nalways belong to A, particularly when A is discrete. To address this issue, we solve DP for this x\nand P = K to derive a decomposition of x, i.e., compute xt0, . . . , xtd \u2208 K and \u03bbt0, . . . , \u03bbtd \u2265 0 as\nin Step 14. Then, the algorithm plays at = xti with probability \u03bbti, and obtains feedback of (cid:96)(cid:62)\nt at.\nBased on this feedback, we compute an estimator \u02c6(cid:96)t of the loss vector (cid:96)t as\n\n, \u02c6\u03a3t = \u02c6\u03a3(st)\n\n, \u02c6\u00b5t = \u02c6\u00b5(st)\n\nt\n\nt\n\nt\n\n\u02c6(cid:96)t = 4ed\u03c3t(cid:96)(cid:62)\n\nt at \u02c6\u03a3\u22121\n\nt btit.\n\n(4)\n\nt\n\nThis is an unbiased estimator of (cid:96)t, i.e., we have E[\u02c6(cid:96)t] = (cid:96)t. The existence of \u02c6\u03a3\u22121\nde\ufb01nition of \u02c6\u03a3 and Assumption 1. In fact, from A \u2286 K(j)\nvolume and q(j)\nq(j)\nt\nupdated by zt+1(x) = zt(x) exp(\u2212\u03b7 \u02c6(cid:96)(cid:62)\nthe learning rate, which will be optimized later. Let\n\nfollows from the\nhas a positive\n, which implies that the covariance matrix \u03a3(j)\nt of\nis positive de\ufb01nite for all t and j. The function zt is\nt (x \u2212 \u02c6\u00b5t)), where \u03b7 > 0 is an input parameter standing for\n\nhas a positive density over K(j)\nt\nis positive de\ufb01nite. From this and (2), \u02c6\u03a3(j)\n\nand Assumption 1, K(j)\n\nt = 1/(T (j + 2 +(cid:80)t\u22121\n\ni=1(si + 1))(j + 3 +(cid:80)t\u22121\n\ni=1(si + 1))).\n\n\u03b4(j)\n\n(5)\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nand \u02c6\u00b5(j)\n\nsatisfying (2) with probability at least 1 \u2212 \u03b4(j)\n\nTo compute \u02c6\u03a3(j)\nCorollary 1.\n\nt=1 st denote the number of updates of K(j)\n\nLet ST =(cid:80)T\nE[RT (a\u2217)] \u2264 27eLRd3/2 max{(cid:112)T (1 + \u03c8 + log T ), d(1 + \u03c8 + log T )2}(cid:0)1 \u2212 ST /210(cid:1) .\n\n. Suppose at is given by Algorithm 1 with parameters\n24d3/2(1+\u03c8+log T )}. Then, for all a\u2217 \u2208 A, we have\n\n2eLR min{(cid:113) 1+\u03c8+log T\n\nd log Vol(B\u221e(0,R))\n,\n\n. We show the following regret bound.\n\nTheorem 5. De\ufb01ne \u03c8 = 1\n\n12eT and \u03b7 = 1\n\n, we use the algorithm in\n\n\u03b5 = 1\n\nVol(K)\n\ndT\n\n1\n\nt\n\nt\n\n(6)\nr if K includes an (cid:96)\u221e-ball of radius r > 0.\n\nWe note that \u03c8 in the above theorem satis\ufb01es \u03c8 \u2264 log R\nThe proof of this theorem is given in Appendix A.\n\n8\n\n\f5.2 Oracle Complexity Analysis\n\n(cid:80)st\nj=0 |E (j)\n\nt\n\nt\n\nt=1\n\nt\n\nt\n\n, the number of elements in E (j)\n\nof solutions of SP is(cid:80)T\nwhich contradicts to RT (a\u2217) = (cid:80)T\n\nHere, we show that Algorithm 1 calls the linear optimization oracles only O(poly(d)T ) times.\nTo implement Algorithm 1, the linear optimization oracle is required only in Steps 5 and 14. In Step\n5, we need to solve SP to decide if there exists x \u2208 E (j)\nsuch that x /\u2208 K. From the de\ufb01nition (3) of\nE (j)\nis equal to 2d for each t and j, and, accordingly, the total number\nt=1(st + 1) = 2d(T + ST ). The number ST can be\nbounded as ST = O(T ). Indeed, from Theorem 5, if ST > 210(1+T ) then E[RT (a\u2217)] < \u221227eLRT ,\nt (at \u2212 a\u2217) \u2265 \u22122LRT . Consequently, the total number\nof solutions of SP is O(dT ). In Step 14, we solve DP in each round t; hence the total number of\nsolutions of DP is equal to T . Because we can solve SP and DP by calling the linear optimization\noracle poly(d, log T ) times from Theorem 4 and Remark 1, we can implement Algorithm 1 so that it\ncalls the linear optimization oracle O(poly(d, log T )T ) times.\n\n| = 2d(cid:80)T\n\nt=1 (cid:96)(cid:62)\n\nReferences\n[1] Y. Abbasi-Yadkori, D. P\u00e1l, and C. Szepesv\u00e1ri. Improved algorithms for linear stochastic bandits.\n\nIn Advances in Neural Information Processing Systems, pages 2312\u20132320, 2011.\n\n[2] M. Abeille, A. Lazaric, et al. Linear thompson sampling revisited. Electronic Journal of\n\nStatistics, 11(2):5165\u20135197, 2017.\n\n[3] J. D. Abernethy, E. Hazan, and A. Rakhlin. An ef\ufb01cient algorithm for bandit linear optimization.\n\nIn Conference on Learning Theory, 2008.\n\n[4] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In\n\nInternational Conference on Machine Learning, pages 127\u2013135, 2013.\n\n[5] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta-algorithm\n\nand applications. Theory of Computing, 8(1):121\u2013164, 2012.\n\n[6] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In\n\nConference on Learning Theory, pages 217\u2013226, 2009.\n\n[7] P. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3(Nov):397\u2013422, 2002.\n\n[8] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[9] B. Awerbuch and R. D. Kleinberg. Adaptive routing with end-to-end feedback: Distributed\nlearning and geometric approaches. In Proceedings of the Thirty-sixth Annual ACM Symposium\non Theory of Computing, pages 45\u201353. ACM, 2004.\n\n[10] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex optimization.\nIn Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages\n72\u201385. ACM, 2017.\n\n[11] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System\n\nSciences, 78(5):1404\u20131422, 2012.\n\n[12] A. Cohen, T. Hazan, and T. Koren. Tight bounds for bandit combinatorial optimization. In\n\nConference on Learning Theory, pages 629\u2013642, 2017.\n\n[13] R. Combes, M. S. T. M. Shahi, A. Proutiere, and M. Lelarge. Combinatorial bandits revisited.\n\nIn Advances in Neural Information Processing Systems, pages 2116\u20132124, 2015.\n\n[14] T. M. Cover. Universal portfolios. In The Kelly Capital Growth Investment Criterion: Theory\n\nand Practice, pages 181\u2013209. World Scienti\ufb01c, 2011.\n\n9\n\n\f[15] V. Dani and T. P. Hayes. Robbing the bandit: Less regret in online geometric optimization\nagainst an adaptive adversary. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium\non Discrete Algorithm, pages 937\u2013943, 2006.\n\n[16] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback.\n\nIn Conference on Learning Theory, pages 355\u2013366, 2008.\n\n[17] V. Dani, S. M. Kakade, and T. P. Hayes. The price of bandit information for online optimization.\n\nIn Advances in Neural Information Processing Systems, pages 345\u2013352, 2008.\n\n[18] D. Garber. Ef\ufb01cient online linear optimization with approximation algorithms. In Advances in\n\nNeural Information Processing Systems, pages 627\u2013635, 2017.\n\n[19] E. Hazan and Z. Karnin. Volumetric spanners: an ef\ufb01cient exploration basis for learning. The\n\nJournal of Machine Learning Research, 17(1):4062\u20134095, 2016.\n\n[20] E. Hazan, W. Hu, Y. Li, and z. li. Online improper learning with an approximation oracle. In\n\nAdvances in Neural Information Processing Systems 31, pages 5652\u20135660. 2018.\n\n[21] S. Ito, D. Hatano, H. Sumita, K. Takemura, T. Fukunaga, N. Kakimura, and K. Kawarabayashi.\nImproved regret bounds for bandit combinatorial optimization. In Advances in Neural Informa-\ntion Processing Systems, 2019, to appear.\n\n[22] S. M. Kakade, A. T. Kalai, and K. Ligett. Playing games with approximation algorithms. SIAM\n\nJournal on Computing, 39(3):1088\u20131106, 2009.\n\n[23] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for online decision problems. Journal of\n\nComputer and System Sciences, 71(3):291\u2013307, 2005.\n\n[24] T. Lattimore and C. Szepesv\u00e1ri. Bandit Algorithms. preprint, Revision: 1699, 2019.\n\n[25] L. Lov\u00e1sz and S. Vempala. The geometry of logconcave functions and sampling algorithms.\n\nRandom Structures & Algorithms, 30(3):307\u2013358, 2007.\n\n[26] H. B. McMahan and A. Blum. Online geometric optimization in the bandit setting against an\nadaptive adversary. In International Conference on Computational Learning Theory, pages\n109\u2013123, 2004.\n\n[27] Y. Nesterov and A. Nemirovskii. Interior-Point Polynomial Algorithms in Convex Programming,\n\nvolume 13. SIAM, 1994.\n\n[28] A. Pr\u00e9kopa. Logarithmic concave measures with application to stochastic programming. Acta\n\nScientiarum Mathematicarum, 32:301\u2013316, 1971.\n\n[29] P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of\n\nOperations Research, 35(2):395\u2013411, 2010.\n\n[30] A. Schrijver. Theory of Linear and Integer Programming. John Wiley & Sons, 1998.\n\n[31] M. Valko, R. Munos, B. Kveton, and T. Koc\u00e1k. Spectral bandits for smooth graph functions. In\n\nInternational Conference on Machine Learning, pages 46\u201354, 2014.\n\n[32] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint\n\narXiv:1011.3027, 2010.\n\n[33] V. G. Vovk. Aggregating strategies. Proc. of Computational Learning Theory, 1990, 1990.\n\n10\n\n\f", "award": [], "sourceid": 5606, "authors": [{"given_name": "Shinji", "family_name": "Ito", "institution": "NEC Corporation, University of Tokyo"}, {"given_name": "Daisuke", "family_name": "Hatano", "institution": "RIKEN AIP"}, {"given_name": "Hanna", "family_name": "Sumita", "institution": "Tokyo Metropolitan University"}, {"given_name": "Kei", "family_name": "Takemura", "institution": "NEC Corporation"}, {"given_name": "Takuro", "family_name": "Fukunaga", "institution": "Chuo University, JST PRESTO, RIKEN AIP"}, {"given_name": "Naonori", "family_name": "Kakimura", "institution": "Keio University"}, {"given_name": "Ken-Ichi", "family_name": "Kawarabayashi", "institution": "National Institute of Informatics"}]}