{"title": "Scalable Generalized Linear Bandits: Online Computation and Hashing", "book": "Advances in Neural Information Processing Systems", "page_first": 99, "page_last": 109, "abstract": "Generalized Linear Bandits (GLBs), a natural extension of the stochastic linear bandits, has been popular and successful in recent years. However, existing GLBs scale poorly with the number of rounds and the number of arms, limiting their utility in practice. This paper proposes new, scalable solutions to the GLB problem in two respects. First, unlike existing GLBs, whose per-time-step space and time complexity grow at least linearly with time $t$, we propose a new algorithm that performs online computations to enjoy a constant space and time complexity. At its heart is a novel Generalized Linear extension of the Online-to-confidence-set Conversion (GLOC method) that takes \\emph{any} online learning algorithm and turns it into a GLB algorithm. As a special case, we apply GLOC to the online Newton step algorithm, which results in a low-regret GLB algorithm with much lower time and memory complexity than prior work. Second, for the case where the number $N$ of arms is very large, we propose new algorithms in which each next arm is selected via an inner product search. Such methods can be implemented via hashing algorithms (i.e., ``hash-amenable'') and result in a time complexity sublinear in $N$. While a Thompson sampling extension of GLOC is hash-amenable, its regret bound for $d$-dimensional arm sets scales with $d^{3/2}$, whereas GLOC's regret bound scales with $d$. Towards closing this gap, we propose a new hash-amenable algorithm whose regret bound scales with $d^{5/4}$. Finally, we propose a fast approximate hash-key computation (inner product) with a better accuracy than the state-of-the-art, which can be of independent interest. We conclude the paper with preliminary experimental results confirming the merits of our methods.", "full_text": "Scalable Generalized Linear Bandits:\n\nOnline Computation and Hashing\n\nKwang-Sung Jun\n\nUW-Madison\n\nkjun@discovery.wisc.edu\n\nAniruddha Bhargava\n\nUW-Madison\n\naniruddha@wisc.edu\n\nRobert Nowak\nUW-Madison\n\nrdnowak@wisc.edu\n\nRebecca Willett\nUW-Madison\n\nwillett@discovery.wisc.edu\n\nAbstract\n\nGeneralized Linear Bandits (GLBs), a natural extension of the stochastic linear\nbandits, has been popular and successful in recent years. However, existing GLBs\nscale poorly with the number of rounds and the number of arms, limiting their\nutility in practice. This paper proposes new, scalable solutions to the GLB problem\nin two respects. First, unlike existing GLBs, whose per-time-step space and time\ncomplexity grow at least linearly with time t, we propose a new algorithm that\nperforms online computations to enjoy a constant space and time complexity. At\nits heart is a novel Generalized Linear extension of the Online-to-con\ufb01dence-set\nConversion (GLOC method) that takes any online learning algorithm and turns it\ninto a GLB algorithm. As a special case, we apply GLOC to the online Newton\nstep algorithm, which results in a low-regret GLB algorithm with much lower\ntime and memory complexity than prior work. Second, for the case where the\nnumber N of arms is very large, we propose new algorithms in which each next\narm is selected via an inner product search. Such methods can be implemented\nvia hashing algorithms (i.e., \u201chash-amenable\u201d) and result in a time complexity\nsublinear in N. While a Thompson sampling extension of GLOC is hash-amenable,\nits regret bound for d-dimensional arm sets scales with d3/2, whereas GLOC\u2019s\nregret bound scales with d. Towards closing this gap, we propose a new hash-\namenable algorithm whose regret bound scales with d5/4. Finally, we propose a\nfast approximate hash-key computation (inner product) with a better accuracy than\nthe state-of-the-art, which can be of independent interest. We conclude the paper\nwith preliminary experimental results con\ufb01rming the merits of our methods.\n\nIntroduction\n\n1\nThis paper considers the problem of making generalized linear bandits (GLBs) scalable. In the\nstochastic GLB problem, a learner makes successive decisions to maximize her cumulative rewards.\nSpeci\ufb01cally, at time t the learner observes a set of arms Xt \u2286 Rd. The learner then chooses an arm\nxt \u2208 Xt and receives a stochastic reward yt that is a noisy function of xt: yt = \u00b5(x(cid:62)\n) + \u03b7t, where\nt \u03b8\n\u2217 \u2208 Rd is unknown, \u00b5:R\u2192R is a known nonlinear mapping, and \u03b7t \u2208 R is some zero-mean noise.\n\u03b8\nThis reward structure encompasses generalized linear models [29]; e.g., Bernoulli, Poisson, etc.\nThe key aspect of the bandit problem is that the learner does not know how much reward she would\n\u2217 is thus biased by the history of the\nhave received, had she chosen another arm. The estimation on \u03b8\nselected arms, and one needs to mix in exploratory arm selections to avoid ruling out the optimal\narm. This is well-known as the exploration-exploitation dilemma. The performance of a learner is\nevaluated by its regret that measures how much cumulative reward she would have gained additionally\nif she had known the true \u03b8\nA linear case of the problem above (\u00b5(z) = z) is called the (stochastic) linear bandit problem. Since\nthe \ufb01rst formulation of the linear bandits [7], there has been a \ufb02urry of studies on the problem [11,\n\n\u2217. We provide backgrounds and formal de\ufb01nitions in Section 2.\n\n\u2217\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f34, 1, 9, 5]. In an effort to generalize the restrictive linear rewards, Filippi et al. [15] propose the\nGLB problem and provide a low-regret algorithm, whose Thompson sampling version appears later\nin Abeille & Lazaric [3]. Li et al. [27] evaluates GLBs via extensive experiments where GLBs exhibit\nlower regrets than linear bandits for 0/1 rewards. Li et al. [28] achieves a smaller regret bound when\nthe arm set Xt is \ufb01nite, though with an impractical algorithm.\nHowever, we claim that all existing GLB algorithms [15, 28] suffer from two scalability issues that\nlimit their practical use: (i) under a large time horizon and (ii) under a large number N of arms.\nFirst, existing GLBs require storing all the arms and rewards appeared so far, {(xs, ys)}t\ns=1, so\nthe space complexity grows linearly with t. Furthermore, they have to solve a batch optimization\nproblem for the maximum likelihood estimation (MLE) at each time step t whose per-time-step time\ncomplexity grows at least linearly with t. While Zhang et al. [41] provide a solution whose space\nand time complexity do not grow over time, they consider a speci\ufb01c 0/1 reward with the logistic link\nfunction, and a generic solution for GLBs is not provided.\nSecond, existing GLBs have linear time complexities in N. This is impractical when N is very large,\nwhich is not uncommon in applications of GLBs such as online advertisements, recommendation\nsystems, and interactive retrieval of images or documents [26, 27, 40, 21, 25] where arms are items in\na very large database. Furthermore, the interactive nature of these systems requires prompt responses\nas users do not want to wait. This implies that the typical linear time in N is not tenable. Towards a\nsublinear time in N, locality sensitive hashings [18] or its extensions [35, 36, 30] are good candidates\nas they have been successful in fast similarity search and other machine learning problems like active\nlearning [22], where the search time scales with N \u03c1 for some \u03c1 < 1 (\u03c1 is usually optimized and\noften ranges from 0.4 to 0.8 depending on the target search accuracy). Leveraging hashing in GLBs,\nhowever, relies critically on the objective function used for arm selections. The function must take a\nform that is readily optimized using existing hashing algorithms.1 For example, algorithms whose\nobjective function (a function of each arm x \u2208 Xt) can be written as a distance or inner product\nbetween x and a query q are hash-amenable as there exist hashing methods for such functions.\nTo be scalable to a large time horizon, we propose a new algorithmic framework called Generalized\nLinear Online-to-con\ufb01dence-set Conversion (GLOC) that takes in an online learning (OL) algorithm\nwith a low \u2018OL\u2019 regret bound and turns it into a GLB algorithm with a low \u2018GLB\u2019 regret bound. The\nkey tool is a novel generalization of the online-to-con\ufb01dence-set conversion technique used in [2]\n\u2217, which is then\n(also similar to [14, 10, 16, 41]). This allows us to construct a con\ufb01dence set for \u03b8\nused to choose an arm xt according to the well-known optimism in the face of uncertainty principle.\nBy relying on an online learner, GLOC inherently performs online computations and is thus free from\nthe scalability issues in large time steps. While any online learner equipped with a low OL regret\nbound can be used, we choose the online Newton step (ONS) algorithm and prove a tight OL regret\nbound, which results in a practical GLB algorithm with almost the same regret bound as existing\ninef\ufb01cient GLB algorithms. We present our proposed algorithms and their regret bounds in Section 3.\nFor large number N of arms, our proposed algorithm\nGLOC is not hash-amenable, to our knowledge, due\nto its nonlinear criterion for arm selection. As the \ufb01rst\nattempt, we derive a Thompson sampling [5, 3] exten-\nsion of GLOC (GLOC-TS), which is hash-amenable\ndue to its linear criterion. However, its regret bound\nscales with d3/2 for d-dimensional arm sets, which\nis far from d of GLOC. Towards closing this gap, we\npropose a new algorithm Quadratic GLOC (QGLOC)\nwith a regret bound that scales with d5/4. We summarize the comparison of our proposed GLB\nalgorithms in Table 1. In Section 4, we present GLOC-TS, QGLOC, and their regret bound.\nNote that, while hashing achieves a time complexity sublinear in N, there is a nontrivial overhead\nof computing the projections to determine the hash keys. As an extra contribution, we reduce this\noverhead by proposing a new sampling-based approximate inner product method. Our proposed\nsampling method has smaller variance than the state-of-the-art sampling method proposed by [22, 24]\nwhen the vectors are normally distributed, which \ufb01ts our setting where projection vectors are indeed\nnormally distributed. Moreover, our method results in thinner tails in the distribution of estimation\n\nTable 1: Comparison of GLBs algorithms for\nd-dimensional arm sets T is the time horizon.\nQGLOC achieves the smallest regret among\nhash-amenable algorithms.\n\n\u221a\nRegret\n\u221a\n\u02dcO(d\nT )\n\u221a\n\u02dcO(d3/2\nT )\n\u02dcO(d5/4\nT )\n\nHash-amenable\n\nAlgorithm\n\nGLOC\n\nGLOC-TS\nQGLOC\n\n\u0017\n\u0013\n\u0013\n\n1 Without this designation, no currently known bandit algorithm achieves a sublinear time complexity in N.\n\n2\n\n\ferror than the existing method, which implies a better concentration. We elaborate more on reducing\nthe computational complexity of QOFUL in Section 5.\n\n2 Preliminaries\nWe review relevant backgrounds here. A refers to a GLB algorithm, and B refers to an online\nlearning algorithm. Let Bd(S) be the d-dimensional Euclidean ball of radius S, which overloads\nthe notation B. Let A\u00b7i be the i-th column vector of a matrix A. De\ufb01ne ||x||A :=\nx(cid:62)Ax and\nvec(A) := [A\u00b71; A\u00b72;\u00b7\u00b7\u00b7 ; A\u00b7d] \u2208 Rd2 Given a function f : R \u2192 R, we denote by f(cid:48) and f(cid:48)(cid:48) its\n\ufb01rst and second derivative, respectively. We de\ufb01ne [N ] := {1, 2, . . . , N}.\nGeneralized Linear Model (GLM) Consider modeling the reward y as one-dimensional exponen-\ntial family such as Bernoulli or Poisson. When the feature vector x is believed to correlate with\ny, one popular modeling assumption is the generalized linear model (GLM) that turns the natural\nparameter of an exponential family model into x(cid:62)\u03b8\n\n\u2217 where \u03b8\n\n\u221a\n\n(cid:18) yz \u2212 m(z)\n\n(cid:19)\n\n\u2217 is a parameter [29]:\n+ h(y, \u03c4 )\n\n,\n\nP(y | z = x(cid:62)\u03b8\n\n\u2217\n\n\u2217\n\ng(\u03c4 )\n\n) = exp\n\n))), where y(cid:48) = 2y \u2212 1.\n\n(1)\nwhere \u03c4 \u2208 R+ is a known scale parameter and m, g, and h are normalizers. It is known that\nm(cid:48)(z) = E[y | z] =: \u00b5(z) and m(cid:48)(cid:48)(z) = Var(y | z). We call \u00b5(z) the inverse link function.\nThroughout, we assume that the exponential family being used in a GLM has a minimal representation,\nwhich ensures that m(z) is strictly convex [38, Prop. 3.1]. Then, the negative log likelihood (NLL)\n(cid:96)(z, y) := \u2212yz + m(z) of a GLM is strictly convex. We refer to such GLMs as the canonical GLM.\nIn the case of Bernoulli rewards y \u2208 {0, 1}, m(z) = log(1 + exp(z)), \u00b5(z) = (1 + exp(\u2212z))\u22121,\nand the NLL can be written as the logistic loss: log(1 + exp(\u2212y(cid:48)(x(cid:62)\nt \u03b8\nGeneralized Linear Bandits (GLB) Recall that xt is the arm chosen at time t by an algorithm.\nWe assume that the arm set Xt can be of an in\ufb01nite cardinality, although we focus on \ufb01nite arm sets in\nhashing part of the paper (Section 4). One can write down the reward model (1) in a different form:\n(2)\n) + \u03b7t,\nwhere \u03b7t is conditionally R-sub-Gaussian given xt and {(xs, \u03b7s)}t\u22121\ns=1. For example, Bernoulli\n) and \u2212\u00b5(x(cid:62)\nreward model has \u03b7t as 1 \u2212 \u00b5(x(cid:62)\n\u2217\n) otherwise. Assume that\nt \u03b8\nt \u03b8\n\u2217||2 \u2264 S, where S is known. One can show that the sub-Gaussian scale R is determined by \u00b5:\n||\u03b8\nL, where L is the Lipschitz constant of \u00b5. Throughout, we assume\nR = supz\u2208(\u2212S,S)\nthat each arm has (cid:96)2-norm at most 1: ||x||2 \u2264 1,\u2200x \u2208 Xt,\u2200t. Let xt,\u2217 := maxx\u2208Xt x(cid:62)\u03b8\n\u2217. The\nperformance of a GLB algorithm A is analyzed by the expected cumulative regret (or simply regret):\nt makes the dependence on A explicit.\nRegretA\nWe remark that our results in this paper hold true for a strictly larger family of distributions than the\ncanonical GLM, which we call the non-canonical GLM and explain below. The condition is that the\nreward model follows (2) where the R is now independent from \u00b5 that satis\ufb01es the following:\nAssumption 1. \u00b5 is L-Lipschitz on [\u2212S, S] and continuously differentiable on (\u2212S, S). Furthermore,\ninf z\u2208(\u2212S,S) \u00b5(cid:48)(z) = \u03ba for some \ufb01nite \u03ba > 0 (thus \u00b5 is strictly increasing).\nDe\ufb01ne \u00b5(cid:48)(z) at \u00b1S as their limits. Under Assumption 1, m is de\ufb01ned to be an integral of \u00b5. Then,\none can show that m is \u03ba-strongly convex on B1(S). An example of the non-canonical GLM is the\nprobit model for 0/1 reward where \u00b5 is the Gaussian CDF, which is popular and competitive to the\nBernoulli GLM as evaluated by Li et al. [27]. Note that canonical GLMs satisfy Assumption 1.\n\nyt = \u00b5(x(cid:62)\nt \u03b8\n) w.p. \u00b5(x(cid:62)\nt \u03b8\n\n(cid:112)\u00b5(cid:48)(z) \u2264 \u221a\n\nT :=(cid:80)T\n\nt=1 \u00b5(x(cid:62)\nt,\u2217\u03b8\n\n), where xA\n\n) \u2212 \u00b5((xA\n\nt )(cid:62)\u03b8\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u2217\n\n3 Generalized Linear Bandits with Online Computation\nWe describe and analyze a new GLB algorithm called Generalized Linear Online-to-con\ufb01dence-set\nConversion (GLOC) that performs online computations, unlike existing GLB algorithms.\nGLOC employs the optimism in the face of uncertainty principle, which dates back to [7]. That is, we\n\u2217 with high probability\nmaintain a con\ufb01dence set Ct (de\ufb01ned below) that traps the true parameter \u03b8\n(w.h.p.) and choose the arm with the largest feasible reward given Ct\u22121 as a constraint:\n\n(3)\nThe main difference between GLOC and existing GLBs is in the computation of the Ct\u2019s. Prior\nmethods involve \u201cbatch\" computations that involve all past observations, and so scale poorly with\n\nx\u2208Xt,\u03b8\u2208Ct\u22121\n\nmax\n\n(xt, \u02dc\u03b8t) := arg\n\n(cid:104)x, \u03b8(cid:105)\n\n3\n\n\ft \u03b8t and zt := [z1;\u00b7\u00b7\u00b7 ; zt]. Let(cid:98)\u03b8t := V\n\nt. In contrast, GLOC takes in an online learner B, and uses B as a co-routine instead of relying on\na batch procedure to construct a con\ufb01dence set. Speci\ufb01cally, at each time t GLOC feeds the loss\nt \u03b8, yt) into the learner B which then outputs its parameter prediction \u03b8t. Let\nfunction (cid:96)t(\u03b8) := (cid:96)(x(cid:62)\nXt \u2208 Rt\u00d7d be the design matrix consisting of x1, . . . , xt. De\ufb01ne Vt := \u03bbI + X(cid:62)\nt Xt, where \u03bb\n\u22121\nis the ridge parameter. Let zt := x(cid:62)\nt X(cid:62)\nt zt be the ridge\nregression estimator taking zt as responses. Theorem 1 below is the key result for constructing our\ncon\ufb01dence set Ct, which is a function of the parameter predictions {\u03b8s}t\ns=1 and the online (OL)\nregret bound Bt of the learner B. All the proofs are in the supplementary material (SM).\nTheorem 1. (Generalized Linear Online-to-Con\ufb01dence-Set Conversion) Suppose we feed loss func-\n(cid:80)t\ntions {(cid:96)s(\u03b8)}t\ns=1 into online learner B. Let \u03b8s be the parameter predicted at time step s by B.\nAssume that B has an OL regret bound Bt: \u2200\u03b8 \u2208 Bd(S),\u2200t \u2265 1,\n(cid:113)\ns=1 (cid:96)s(\u03b8s) \u2212 (cid:96)s(\u03b8) \u2264 Bt .\n(4)\n\u2264 \u03b1(Bt) + \u03bbS2 \u2212(cid:16)||zt||2\n2 \u2212(cid:98)\u03b8\n\u03ba4\u03b42 ). Then, with probability (w.p.) at least 1\u2212 \u03b4,\n1 + 2\n(5)\nCt := {\u03b8 \u2208 Rd : ||\u03b8 \u2212(cid:98)\u03b8t||2\n\nNote that the center of the ellipsoid is the ridge regression estimator on the predicted natural\nparameters zs = x(cid:62)\ns \u03b8s rather than the rewards. Theorem 1 motivates the following con\ufb01dence set:\n(6)\n\u2217 for all t \u2265 1, w.p. at least 1 \u2212 \u03b4. See Algorithm 1 for pseudocode. One way to solve\nwhich traps \u03b8\nthe optimization problem (3) is to de\ufb01ne the function \u03b8(x) := max\u03b8\u2208Ct\u22121 x(cid:62)\u03b8, and then use the\nLagrangian method to write:\n\n\u2217 \u2212(cid:98)\u03b8t||2\n\nLet \u03b1(Bt) := 1 + 4\n\n\u2200t \u2265 1,||\u03b8\n\n\u03ba Bt + 8R2\n\n(cid:62)\nt X(cid:62)\nt zt\n\n\u03ba Bt + 4R4\n\n\u03ba2 log( 2\n\n\u2264 \u03b2t}\n\n=: \u03b2t .\n\n(cid:17)\n\nVt\n\nVt\n\n\u03b4\n\nxGLOC\nt\n\n:= arg max\nx\u2208Xt\n\n.\n\n\u22121\nt\u22121\n\n(7)\n\nWe prove the regret bound of GLOC in the following theorem.\nTheorem 2. Let {\u03b2t} be a nondecreasing sequence such that \u03b2t \u2265 \u03b2t. Then, w.p. at least 1 \u2212 \u03b4,\n\nx(cid:62)(cid:98)\u03b8t\u22121 +(cid:112)\u03b2t\u22121||x||V\n(cid:19)\n(cid:18)\n\n(cid:113)\n\nAlgorithm 1 GLOC\n1: Input: R > 0, \u03b4 \u2208 (0, 1), S > 0, \u03bb > 0, \u03ba > 0,\nan online learner B with known regret bounds\n{Bt}t\u22651.\n\nRegretGLOC\n\nT\n\nL\n\n\u221a\n\n\u03b2T dT log T\n\nCompute xt by solving (3).\nPull xt and then observe yt.\nReceive \u03b8t from B.\nFeed into B the loss (cid:96)t(\u03b8) = (cid:96)(x(cid:62)\nUpdate Vt = Vt\u22121 + xtx(cid:62)\n\n= O\nAlthough any low-regret online learner can be\ncombined with GLOC, one would like to ensure\nthat \u03b2T is O(polylog(T )) in which case the total\nregret can be bounded by \u02dcO(\nT ). This means\nthat we must use online learners whose OL regret\ngrows logarithmically in T such as [20, 31]. In\nthis work, we consider the online Newton step\n(ONS) algorithm [20].\nOnline Newton Step (ONS) for Generalized\nLinear Models Note that ONS requires the loss\nfunctions to be \u03b1-exp-concave. One can show\nthat (cid:96)t(\u03b8) is \u03b1-exp-concave [20, Sec. 2.2]. Then,\nGLOC can use ONS and its OL regret bound to\nsolve the GLB problem. However, motivated by\nthe fact that the OL regret bound Bt appears in the\nradius\n\u03b2t of the con\ufb01dence set while a tighter\ncon\ufb01dence set tends to reduce the bandit regret\nin practice, we derive a tight data-dependent OL\nregret bound tailored to GLMs.\nWe present our version of ONS for GLMs (ONS-\nGLM) in Algorithm 2. (cid:96)(cid:48)(z, y) is the \ufb01rst deriva-\ntive w.r.t. z and the parameter \u0001 is for inverting\nmatrices conveniently (usually \u0001 = 1 or 0.1). The\nonly difference from the original ONS [20] is that\nwe rely on the strong convexity of m(z) instead\nof the \u03b1-exp-concavity of the loss thanks to the\nGLM structure.2 Theorem 3 states that we achieve the desired polylogarithmic regret in T .\n\n2: Set V0 = \u03bbI.\n3: for t = 1, 2, . . . do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\nAlgorithm 2 ONS-GLM\n1: Input: \u03ba > 0, \u0001 > 0, S > 0.\n2: A0 = \u0001I.\n3: Set \u03b81 \u2208 Bd(S) arbitrarily.\n4: for t = 1, 2, 3, . . . do\n5:\nOutput \u03b8t .\n6:\nObserve xt and yt.\nIncur loss (cid:96)(x(cid:62)\n7:\n8: At = At\u22121 + xtx(cid:62)\nt+1 = \u03b8t \u2212 (cid:96)(cid:48)(x(cid:62)\n\u22121\n\u03b8(cid:48)\n9:\nt xt\n\u03b8t+1 = arg min\u03b8\u2208Bd(S) ||\u03b8 \u2212 \u03b8(cid:48)\n10:\n11: end for\n\nCompute(cid:98)\u03b8t = V\n\nDe\ufb01ne Ct as in (6).\n\n\u22121\nt X(cid:62)\n\nt \u03b8t, yt) .\n\nt \u03b8t,yt)\n\n\u221a\n\nA\n\n\u03ba\n\nt\n\nt \u03b8, yt).\nt and zt = x(cid:62)\nt \u03b8t\nt zt and \u03b2t as in (5).\n\nt+1||2\n\nAt\n\n2 A similar change to ONS has been applied in [16, 41].\n\n4\n\n\fTheorem 3. De\ufb01ne gs := (cid:96)(cid:48)(x(cid:62)\n\ns \u03b8s, ys). The regret of ONS-GLM satis\ufb01es, for any \u0001 > 0 and t \u2265 1,\n\n(cid:80)t\ns=1 (cid:96)s(\u03b8s) \u2212 (cid:96)s(\u03b8\n\n(cid:80)t\n\n\u2217\n\n) \u2264 1\n\ns=1 g2\nd log t),\u2200t \u2265 1 w.h.p.\n\n2\u03ba\n\ns||xs||2\n\n\u22121\nA\ns\n\n+ 2\u03baS2\u0001 =: BONS\n\n,\n\nt\n\nt\n\n\u03ba\nd log t).\n\n= O( L2+R2 log(t)\n\nwhere BONS\nt = O( L2+ \u00afR2\nBONS\nWe emphasize that the OL regret bound is data-dependent. A con\ufb01dence set constructed by combining\nTheorem 1 and Theorem 3 directly implies the following regret bound of GLOC with ONS-GLM.\nCorollary 1. De\ufb01ne \u03b2ONS\n\nin (5). With probability at least 1 \u2212 2\u03b4,\n\nby replacing Bt with BONS\n\nIf maxs\u22651 |\u03b7s| is bounded by \u00afR w.p. 1,\n\n\u03ba\n\nt\n\n(cid:110)\n\u03b8 \u2208 Rd : ||\u03b8 \u2212(cid:98)\u03b8t||2\n\nt\n\n(cid:111)\n\n\u2264 \u03b2ONS\n\nt\n\n:=\n. Then, w.p. at least 1 \u2212 2\u03b4, \u2200T \u2265 1, RegretGLOC\nwhere \u02c6O ignores log log(t). If |\u03b7t| is bounded by \u00afR, RegretGLOC\n\nVt\n\nT\n\n.\n\nt\n\nT\n\n(8)\n\n=\n\n=\n\n\u2200t \u2265 1, \u03b8\n\n\u2217 \u2208 CONS\nCorollary 2. Run GLOC with CONS\n\u02c6O\n\nT log3/2(T )\n\n(cid:17)\n\nt\n\n(cid:16) L(L+R)\n(cid:16) L(L+ \u00afR)\n\n\u03ba\n\n(cid:17)\n\n\u221a\nd\n\u221a\nd\n\nT log(T )\n\n.\n\n\u02c6O\n\n\u03ba\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nT log T ), which is\n\nWe make regret bound comparisons ignoring log log T factors. For generic arm sets, our dependence\non d is optimal for linear rewards [34]. For the Bernoulli GLM, our regret has the same order as Zhang\net al. [41]. One can show that the regret of Filippi et al. [15] has the same order as ours if we use their\nassumption that the reward yt is bounded by Rmax. For unbounded noise, Li et al. [28] have regret\nlog T factor smaller than ours and has LR in place of L(L + R).\nO((LR/\u03ba)d\nWhile L(L + R) could be an artifact of our analysis, the gap is not too large for canonical GLMs.\nL. If L \u2264 1, R satis\ufb01es R > L, and so\nLet L be the smallest Lipschitz constant of \u00b5. Then, R =\nL(L + R) = O(LR). If L > 1, then L(L + R) = O(L2), which is larger than LR = O(L3/2). For\nthe Gaussian GLM with known variance \u03c32, L = R = 1.3 For \ufb01nite arm sets, SupCB-GLM of Li\net al. [28] achieves regret of \u02dcO(\ndT log N ) that has a better scaling with d but is not a practical\nalgorithm as it wastes a large number of arm pulls. Finally, we remark that none of the existing GLB\nalgorithms are scalable to large T . Zhang et al. [41] is scalable to large T , but is restricted to the\nBernoulli GLM; e.g., theirs does not allow the probit model (non-canonical GLM) that is popular and\nshown to be competitive to the Bernoulli GLM [27].\nDiscussion\nThe trick of obtaining a con\ufb01dence set from an online learner appeared \ufb01rst in [13, 14]\nfor the linear model, and then was used in [10, 16, 41]. GLOC is slightly different from these studies\nand rather close to Abbasi-Yadkori et al. [2] in that the con\ufb01dence set is a function of a known regret\nbound. This generality frees us from re-deriving a con\ufb01dence set for every online learner. Our result\nis essentially a nontrivial extension of Abbasi-Yadkori et al. [2] to GLMs.\nOne might have notice that Ct does not use \u03b8t+1 that is available before pulling xt+1 and has the\nmost up-to-date information. This is inherent to GLOC as it relies on the OL regret bound directly.\nOne can modify the proof of ONS-GLM to have a tighter con\ufb01dence set Ct that uses \u03b8t+1 as we\nshow in SM Section E. However, this is now speci\ufb01c to ONS-GLM, which looses generality.\n\n4 Hash-Amenable Generalized Linear Bandits\nWe now turn to a setting where the arm set is \ufb01nite but very large. For example, imagine an interactive\nretrieval scenario [33, 25, 6] where a user is shown K images (e.g., shoes) at a time and provides\nrelevance feedback (e.g., yes/no or 5-star rating) on each image, which is repeated until the user is\nsatis\ufb01ed. In this paper, we focus on showing one image (i.e., arm) at a time.4 Most existing algorithms\nrequire maximizing an objective function (e.g., (7)), the complexity of which scales linearly with the\nnumber N of arms. This can easily become prohibitive for large numbers of images. Furthermore,\nthe system has to perform real-time computations to promptly choose which image to show the user\nin the next round. Thus, it is critical for a practical system to have a time complexity sublinear in N.\nOne naive approach is to select a subset of arms ahead of time, such as volumetric spanners [19].\nHowever, this is specialized for an ef\ufb01cient exploration only and can rule out a large number of\ngood arms. Another option is to use hashing methods. Locality-sensitive hashing and Maximum\n\n3 The reason why R is not \u03c3 here is that the suf\ufb01cient statistic of the GLM is y/\u03c3, which is equivalent to\n\ndealing with the normalized reward. Then, \u03c3 appears as a factor in the regret bound.\n\n4 One image at a time is a simpli\ufb01cation of the practical setting. One can extend it to showing multiple\n\nimages at a time, which is a special case of the combinatorial bandits of Qin et al. [32].\n\n5\n\n\fthat chooses an arm xt = arg maxx\u2208Xt x(cid:62) \u02d9\u03b8t where \u02d9\u03b8t \u223c N ((cid:98)\u03b8t\u22121, \u03b2t\u22121V\n\nInner Product Search (MIPS) are effective and well-understood tools but can only be used when the\nobjective function is a distance or an inner product computation; (7) cannot be written in this form.\nIn this section, we consider alternatives to GLOC which are compatible with hashing.\nThompson Sampling We present a Thompson sampling (TS) version of GLOC called GLOC-TS\n\u22121\nt\u22121). TS is known to\nperform well in practice [8] and can solve the polytope arm set case in polynomial time5 whereas\nalgorithms that solve an objective function like (3) (e.g., [1]) cannot since they have to solve an\nNP-hard problem [5]. We present the regret bound of GLOC-TS below. Due to space constraints, we\npresent the pseudocode and the full version of the result in SM.\n\nTheorem 4. (Informal) If we run GLOC-TS with \u02d9\u03b8t \u223c N ((cid:98)\u03b8t\u22121, \u03b2ONS\n(cid:16) L(L+R)\n(cid:16) L(L+ \u00afR)\n\n\u22121\nt\u22121), RegretGLOC-TS\n.\nT log(T )\n\nw.h.p. If \u03b7t is bounded by \u00afR, then \u02c6O\n\nT log3/2(T )\n\nt\u22121 V\n\n(cid:17)\n\n(cid:17)\n\nd3/2\n\nd3/2\n\n\u221a\n\n\u221a\n\n\u02c6O\n\n=\n\nT\n\n\u03ba\n\n\u03ba\n\nNotice that the regret now scales with d3/2 as expected from the analysis of linear TS [4], which\nis higher than scaling with d of GLOC. This is concerning in the interactive retrieval or product\nrecommendation scenario since the relevance of the shown items is harmed, which makes us wonder\nif one can improve the regret without loosing the hash-amenability.\nQuadratic GLOC We now propose a new hash-amenable algorithm called Quadratic GLOC\n(QGLOC). Recall that GLOC chooses the arm xGLOC by (7). De\ufb01ne r = minx\u2208X ||x||2 and\n\n||x||V\n\n\u22121\nt\u22121\n\nmt\u22121 :=\n\nmin\n\n\u03b21/4\nt\u22121\n\n4c0mt\u22121\n\nxQGLOC\nt\n\n4c0mt\u22121\n\nV\n\n||x||2\n\n\u22121\nt\u22121\n\nV\n\nx:||x||2\u2208[r,1]\n\n(cid:1)1/2\n\n= arg max\nx\u2208Xt\n\n(cid:16)(cid:16) 1\n\n(cid:0) L+R\n\n(cid:0) L+R\n\n(cid:104)qt, \u03c6(x)(cid:105) ,\n\n(cid:1)3/2(cid:17)\n\n:= arg max\nx\u2208Xt\n\u03b21/4\nt\u22121\n\n\u221a\nfor all x \u2208 X and that mt\u22121 \u2265 r/\n\n\u22121\nt\u22121\nwhich is r times the square root of the smallest eigenvalue of V\n||x||V\nalternative way to de\ufb01ne mt\u22121 without relying on r, which we present in SM.\nLet c0 > 0 be the exploration-exploitation tradeoff parameter (elaborated upon later). At time t,\nQGLOC chooses the arm\n\n,\n(9)\n\u22121\nt\u22121. It is easy to see that mt\u22121 \u2264\nt + \u03bb using the de\ufb01nition of Vt\u22121. There is an\n\nwhere qt = [(cid:98)\u03b8t\u22121; vec(\n\n. By setting c0 = (cid:0) L+R\n\n(cid:104)(cid:98)\u03b8t\u22121, x(cid:105) +\n(10)\n\u22121\nt\u22121)] \u2208 Rd+d2 and \u03c6(x) := [x; vec(xx(cid:62))]. The key property of\nQGLOC is that the objective function is now quadratic in x, thus the name Quadratic GLOC, and\ncan be written as an inner product. Thus, QGLOC is hash-amenable. We present the regret bound of\nQGLOC (10) in Theorem 5. The key step of the proof is that the QGLOC objective function (10)\nplus c0\u03b23/4mt\u22121 is a tight upper bound of the GLOC objective function (7).\nTheorem 5. Run QGLOC with CONS\n.\n\u221a\nLd5/4\nO\nbound is O( L(L+R)\nNote that one can have a better dependence on log T when \u03b7t is bounded (available in the proof).\nThe regret bound of QGLOC is a d1/4 factor improvement over that of GLOC-TS; see Table 1.\nFurthermore, in (10) c0 is a free parameter that adjusts the balance between the exploitation (the \ufb01rst\nterm) and exploration (the second term). Interestingly, the regret guarantee does not break down when\nadjusting c0 in Theorem 5. Such a characteristic is not found in existing algorithms but is attractive\nto practitioners, which we elaborate in SM.\nMaximum Inner Product Search (MIPS) Hashing While MIPS hashing algorithms such as [35,\n36, 30] can solve (10) in time sublinear in N, these necessarily introduce an approximation error.\nIdeally, one would like the following guarantee on the error with probability at least 1 \u2212 \u03b4H:\nDe\ufb01nition 1. Let X \u2286 Rd(cid:48)\nsatisfy |X| < \u221e. A data point \u02dcx \u2208 X is called cH-MIPS w.r.t. a given\nquery q if it satis\ufb01es (cid:104)q, \u02dcx(cid:105) \u2265 cH \u00b7 maxx\u2208X (cid:104)q, x(cid:105) for some cH < 1. An algorithm is called cH-MIPS\nif, given a query q \u2208 Rd(cid:48)\nUnfortunately, existing MIPS algorithms do not directly offer such a guarantee, and one must build a\nseries of hashing schemes with varying hashing parameters like Har-Peled et al. [18]. Under the \ufb01xed\nbudget setting T , we elaborate our construction that is simpler than [18] in SM.\n\n, it retrieves x \u2208 X that is cH-MIPS w.r.t. q.\n\n(cid:1)\u22121/2, the regret\n\nleast 1 \u2212 2\u03b4, RegretQGLOC\n\n\u03ba\nT log2(T )).\n\n+ c0\nd5/4\n\nThen, w.p.\n\nT log2(T )\n\nT\n\n=\n\nc0\n\n\u03ba\n\n\u03ba\n\n(cid:17)\n\nat\n\nt\n\n\u221a\n\n\u03ba\n\n5Con\ufb01denceBall1 algorithm of Dani et al. [11] can solve the problem in polynomial time as well.\n\n6\n\n\f(cid:16)\n(cid:16) log(dT )\n\nlog\nN \u03c1\u2217\n\nlog(c\n\n\u22121\nH )\n\n(N + log(N )d(cid:48))\n\nTime and Space Complexity Our construction involves saving Gaussian projection vectors that\nare used for determining hash keys and saving the buckets containing pointers to the actual arm\nvectors. The time complexity for retrieving a cH-MIPS solution involves determining hash keys\nand evaluating inner products with the arms in the retrieved buckets. Let \u03c1\u2217 < 1 be an opti-\nmized value for the hashing (see [35] for detail). The time complexity for d(cid:48)-dimensional vec-\n, and the space complexity (except the original data) is\ntors is O\n\n(cid:16) log(dT )\n\nN \u03c1\u2217\n\n(cid:17)\n\nlog(N )d(cid:48)(cid:17)\n(cid:17)\n\nlog(c\n\n\u22121\nH )\n\n. While the time and space complexity grows with the time hori-\nO\nzon T , the dependence is mild; log log(T ) and log(T ), respectively. QGLOC uses d(cid:48) = d + d2,6 and\nGLOC-TS uses d(cid:48) = d(cid:48). While both achieve a time complexity sublinear in N, the time complexity\nof GLOC-TS scales with d that is better than scaling with d2 of QGLOC. However, GLOC-TS has a\nd1/4-factor worse regret bound than QGLOC.\nDiscussion While it is reasonable to incur small errors in solving the arm selection criteria like (10)\nand sacri\ufb01ce some regret in practice, the regret bounds of QGLOC and GLOC-TS do not hold\nanymore. Though not the focus of our paper, we prove a regret bound under the presence of the\nhashing error in the \ufb01xed budget setting for QGLOC; see SM. Although the result therein has an\ninef\ufb01cient space complexity that is linear in T , it provides the \ufb01rst low regret bound with time\nsublinear in N, to our knowledge.\n\n(a)\n\n(b)\n\n5 Approximate Inner Product Computations with L1 Sampling\nWhile hashing allows a time complexity sub-\nlinear in N, it performs an additional com-\nputation for determining the hash keys. Con-\nsider a hashing with U tables and length-k hash\nkeys. Given a query q and projection vectors\na(1), . . . , a(U k), the hashing computes q(cid:62)a(i),\n\u2200i \u2208 [U k] to determine the hash key of q. To\nreduce such an overhead, approximate inner\nproduct methods like [22, 24] are attractive\nsince hash keys are determined by discretizing\nthe inner products; small inner product errors\noften do not alter the hash keys.\nIn this section, we propose an improved approximate inner product method called L1 sampling which\nwe claim is more accurate than the sampling proposed by Jain et al. [22], which we call L2 sampling.\nConsider an inner product q(cid:62)a. The main idea is to construct an unbiased estimate of q(cid:62)a. That is,\nlet p \u2208 Rd be a probability vector. Let\n\nFigure 1: (a) A box plot of estimators. L1 and L2\nhave the same variance, but L2 has thicker tails. (b)\nThe frequency of L1 inducing smaller variance than\nL2 in 1000 trials. After 100 dimensions, L1 mostly\nhas smaller variance than L2.\n\n(cid:80)m\nand Gk := qik aik /pik , k \u2208 [m] .\n(11)\nk=1 Gk as an estimate of q(cid:62)a, the time complexity\nIt is easy to see that EGk = q(cid:62)a. By taking 1\nis now O(mU k) rather than O(d(cid:48)U k). The key is to choose the right p. L2 sampling uses p(L2) :=\ni /||q||2\n2]i. Departing from L2, we propose p(L1) that we call L1 sampling and de\ufb01ne as follows:\n[q2\n(12)\nWe compare L1 with L2 in two different point of view. Due to space constraints, we summarize the\nkey ideas and defer the details to SM.\nThe \ufb01rst is on their concentration of measure. Lemma 1 below shows an error bound of L1 whose\nfailure probability decays exponentially in m. This is in contrast to decaying polynomially of L2 [22],\nwhich is inferior.7\nLemma 1. De\ufb01ne Gk as in (11) with p = p(L1). Then, given a target error \u0001 > 0,\n\np(L1) := [|q1|;\u00b7\u00b7\u00b7 ;|qd(cid:48)|]/||q||1 .\n\ni.i.d.\u223c Multinomial(p)\n\nik\n\nm\n\nP(cid:0)(cid:12)(cid:12) 1\n\nm\n\nk=1 Gk \u2212 q(cid:62)a(cid:12)(cid:12) \u2265 \u0001(cid:1) \u2264 2 exp\n(cid:80)m\n\n(cid:16)\u2212\n\n(cid:17)\n\n2||q||2\n\nm\u00012\n1||a||2\n\nmax\n\n(13)\n\nTo illustrate such a difference, we \ufb01x q and a in 1000 dimension and apply L2 and L1 sampling 20K\ntimes each with m = 5 where we scale down the L2 distribution so its variance matches that of L1.\n\n6 Note that this does not mean we need to store vec(xx(cid:62)) since an inner product with it is structured.\n7 In fact, one can show a bound for L2 that fails with exponentially-decaying probability. However, the bound\nintroduces a constant that can be arbitrarily large, which makes the tails thick. We provide details on this in SM.\n\n7\n\nL2L1-505100101102103d0.70.80.91\fAlgorithm\nQGLOC\nQGLOC-Hash\nGLOC-TS\nGLOC-TS-Hash\n\nCum. Regret\n266.6 (\u00b119.7)\n285.0 (\u00b130.3)\n277.0 (\u00b136.1)\n289.1 (\u00b128.1)\n\n(a)\n\n(b)\n\n(c)\n\n\u221a\n\ncd log t. For GLOC, we replace \u03b2ONS\n\nt with c(cid:80)t\n\nFigure 2: Cumulative regrets with con\ufb01dence intervals under the (a) logit and (b) probit model. (c)\nCumulative regrets with con\ufb01dence intervals of hash-amenable algorithms.\nFigure 1(a) shows that L2 has thicker tails than L1. Note this is not a pathological case but a typical\ncase for Gaussian q and a. This con\ufb01rms our claim that L1 is safer than L2.\nAnother point of comparison is the variance of L2 and L1. We show that the variance of L1 may or\nmay not be larger than L2 in SM; there is no absolute winner. However, if q and a follow a Gaussian\ndistribution, then L1 induces smaller variances than L2 for large enough d; see Lemma 9 in SM.\nFigure 1(b) con\ufb01rms such a result. The actual gap between the variance of L2 and L1 is also nontrivial\nunder the Gaussian assumption. For instance, with d = 200, the average variance of Gk induced by\nL2 is 0.99 whereas that induced by L1 is 0.63 on average. Although a stochastic assumption on the\nvectors being inner-producted is often unrealistic, in our work we deal with projection vectors a that\nare truly normally distributed.\n6 Experiments\nWe now show our experiment results comparing GLB algorithms and hash-amenable algorithms.\nGLB Algorithms We compare GLOC with two different algorithms: UCB-GLM [28] and Online\n\u2217 \u2208 Rd and N arms (X ) uniformly\nLearning for Logit Model (OL2M) [41].8 For each trial, we draw \u03b8\nat random from the unit sphere. We set d = 10 and Xt = X , \u2200t \u2265 1. Note it is a common practice to\nscale the con\ufb01dence set radius for bandits [8, 27]. Following Zhang et al. [41], for OL2M we set the\nsquared radius \u03b3t = c log(det(Zt)/det(Z1)), where c is a tuning parameter. For UCB-GLM, we set\nthe radius as \u03b1 =\n. While parameter\ntuning in practice is nontrivial, for the sake of comparison we tune c \u2208 {101, 100.5, . . . , 10\u22123} and\nreport the best one. We perform 40 trials up to time T = 3000 for each method and compute\ncon\ufb01dence bounds on the regret.\nWe consider two GLM rewards: (i) the logit model (the Bernoulli GLM) and (ii) the probit model\n(non-canonical GLM) for 0/1 rewards that sets \u00b5 as the probit function. Since OL2M is for the\nlogit model only, we expect to see the consequences of model mismatch in the probit setting. For\nGLOC and UCB-GLM, we specify the correct reward model. We plot the cumulative regret under the\nlogit model in Figure 2(a). All three methods perform similarly, and we do not \ufb01nd any statistically\nsigni\ufb01cant difference based on paired t test. The result for the probit model in Figure 2(b) shows that\nOL2M indeed has higher regret than both GLOC and UCB-GLM due to the model mismatch in the\nprobit setting. Speci\ufb01cally, we verify that at t = 3000 the difference between the regret of UCB-GLM\nand OL2M is statistically signi\ufb01cant. Furthermore, OL2M exhibits a signi\ufb01cantly higher variance in\nthe regret, which is unattractive in practice. This shows the importance of being generalizable to any\nGLM reward. Note we observe a big increase in running time for UCB-GLM compared to OL2M\nand GLOC.\nHash-Amenable GLBs To compare hash-amenable GLBs, we use the logit model as above but now\nwith N=100,000 and T =5000. We run QGLOC, QGLOC with hashing (QGLOC-Hash), GLOC-TS,\nand GLOC-TS with hashing (GLOC-TS-Hash), where we use the hashing to compute the objective\nfunction (e.g., (10)) on just 1% of the data points and save a signi\ufb01cant amount of computation.\nDetails on our hashing implementation is found in SM. Figure 2(c) summarizes the result. We observe\nthat QGLOC-Hash and GLOC-TS-Hash increase regret from QGLOC and GLOC-TS, respectively,\nbut only moderately, which shows the ef\ufb01cacy of hashing.\n7 Future Work\nIn this paper, we have proposed scalable algorithms for the GLB problem: (i) for large time horizon\nT and (ii) for large number N of arms. There exists a number of interesting future work. First,\n\n\u22121\ns\n\nA\n\ns=1 g2\n\ns||xs||2\n\n8We have chosen UCB-GLM over GLM-UCB of Filippi et al. [15] as UCB-GLM has a lower regret bound.\n\n8\n\n\fwe would like to extend the GLM rewards to the single index models [23] so one does not need to\nknow the function \u00b5 ahead of time under mild assumptions. Second, closing the regret bound gap\n\u221a\nbetween QGLOC and GLOC without loosing hash-amenability would be interesting: i.e., develop\na hash-amenable GLB algorithm with O(d\nT ) regret. In this direction, a \ufb01rst attempt could be to\ndesign a hashing scheme that can directly solve (7) approximately.\nAcknowledgments This work was partially supported by the NSF grant IIS-1447449 and the\nMURI grant 2015-05174-04. The authors thank Yasin Abbasi-Yadkori and Anshumali Shrivastava\nfor providing constructive feedback and Xin Hunt for her contribution at the initial stage.\n\nReferences\n[1] Abbasi-Yadkori, Yasin, Pal, David, and Szepesvari, Csaba. Improved Algorithms for Linear\nStochastic Bandits. Advances in Neural Information Processing Systems (NIPS), pp. 1\u201319,\n2011.\n\n[2] Abbasi-Yadkori, Yasin, Pal, David, and Szepesvari, Csaba. Online-to-Con\ufb01dence-Set Con-\nversions and Application to Sparse Stochastic Bandits. In Proceedings of the International\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2012.\n\n[3] Abeille, Marc and Lazaric, Alessandro. Linear Thompson Sampling Revisited. In Proceedings\nof the International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), volume 54,\npp. 176\u2013184, 2017.\n\n[4] Agrawal, Shipra and Goyal, Navin. Thompson Sampling for Contextual Bandits with Linear\n\nPayoffs. CoRR, abs/1209.3, 2012.\n\n[5] Agrawal, Shipra and Goyal, Navin. Thompson Sampling for Contextual Bandits with Linear\nPayoffs. In Proceedings of the International Conference on Machine Learning (ICML), pp.\n127\u2013135, 2013.\n\n[6] Ahukorala, Kumaripaba, Medlar, Alan, Ilves, Kalle, and Glowacka, Dorota. Balancing Ex-\nploration and Exploitation: Empirical Parameterization of Exploratory Search Systems. In\nProceedings of the ACM International Conference on Information and Knowledge Management\n(CIKM), pp. 1703\u20131706, 2015.\n\n[7] Auer, Peter and Long, M. Using Con\ufb01dence Bounds for Exploitation-Exploration Trade-offs.\n\nJournal of Machine Learning Research, 3:397\u2013422, 2002.\n\n[8] Chapelle, Olivier and Li, Lihong. An Empirical Evaluation of Thompson Sampling. In Advances\n\nin Neural Information Processing Systems (NIPS), pp. 2249\u20132257, 2011.\n\n[9] Chu, Wei, Li, Lihong, Reyzin, Lev, and Schapire, Robert E. Contextual Bandits with Linear\nPayoff Functions. In Proceedings of the International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS), volume 15, pp. 208\u2013214, 2011.\n\n[10] Crammer, Koby and Gentile, Claudio. Multiclass Classi\ufb01cation with Bandit Feedback Using\n\nAdaptive Regularization. Mach. Learn., 90(3):347\u2013383, 2013.\n\n[11] Dani, Varsha, Hayes, Thomas P, and Kakade, Sham M. Stochastic Linear Optimization under\nBandit Feedback. In Proceedings of the Conference on Learning Theory (COLT), pp. 355\u2013366,\n2008.\n\n[12] Datar, Mayur, Immorlica, Nicole, Indyk, Piotr, and Mirrokni, Vahab S. Locality-sensitive\nHashing Scheme Based on P-stable Distributions. In Proceedings of the Twentieth Annual\nSymposium on Computational Geometry, pp. 253\u2013262, 2004.\n\n[13] Dekel, Ofer, Gentile, Claudio, and Sridharan, Karthik. Robust selective sampling from single\nand multiple teachers. In In Proceedings of the Conference on Learning Theory (COLT), 2010.\n\n[14] Dekel, Ofer, Gentile, Claudio, and Sridharan, Karthik. Selective sampling and active learning\nfrom single and multiple teachers. Journal of Machine Learning Research, 13:2655\u20132697,\n2012.\n\n9\n\n\f[15] Filippi, Sarah, Cappe, Olivier, Garivier, Aur\u00e9lien, and Szepesv\u00e1ri, Csaba. Parametric Bandits:\nThe Generalized Linear Case. In Advances in Neural Information Processing Systems (NIPS),\npp. 586\u2013594. 2010.\n\n[16] Gentile, Claudio and Orabona, Francesco. On Multilabel Classi\ufb01cation and Ranking with\n\nBandit Feedback. Journal of Machine Learning Research, 15:2451\u20132487, 2014.\n\n[17] Guo, Ruiqi, Kumar, Sanjiv, Choromanski, Krzysztof, and Simcha, David. Quantization based\n\nFast Inner Product Search. Journal of Machine Learning Research, 41:482\u2013490, 2016.\n\n[18] Har-Peled, Sariel, Indyk, Piotr, and Motwani, Rajeev. Approximate nearest neighbor: towards\n\nremoving the curse of dimensionality. Theory of Computing, 8:321\u2013350, 2012.\n\n[19] Hazan, Elad and Karnin, Zohar. Volumetric Spanners: An Ef\ufb01cient Exploration Basis for\n\nLearning. Journal of Machine Learning Research, 17(119):1\u201334, 2016.\n\n[20] Hazan, Elad, Agarwal, Amit, and Kale, Satyen. Logarithmic Regret Algorithms for Online\n\nConvex Optimization. Mach. Learn., 69(2-3):169\u2013192, 2007.\n\n[21] Hofmann, Katja, Whiteson, Shimon, and de Rijke, Maarten. Contextual Bandits for Information\nRetrieval. In NIPS Workshop on Bayesian Optimization, Experimental Design and Bandits:\nTheory and Applications, 2011.\n\n[22] Jain, Prateek, Vijayanarasimhan, Sudheendra, and Grauman, Kristen. Hashing Hyperplane\nQueries to Near Points with Applications to Large-Scale Active Learning. In Advances in\nNeural Information Processing Systems (NIPS), pp. 928\u2013936, 2010.\n\n[23] Kalai, Adam Tauman and Sastry, Ravi. The Isotron Algorithm: High-Dimensional Isotonic\n\nRegression. In Proceedings of the Conference on Learning Theory (COLT), 2009.\n\n[24] Kannan, Ravindran, Vempala, Santosh, and Others. Spectral algorithms. Foundations and\n\nTrends in Theoretical Computer Science, 4(3\u20134):157\u2013288, 2009.\n\n[25] Konyushkova, Ksenia and Glowacka, Dorota. Content-based image retrieval with hierarchical\nGaussian Process bandits with self-organizing maps. In 21st European Symposium on Arti\ufb01cial\nNeural Networks, 2013.\n\n[26] Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E. A Contextual-Bandit Approach\nto Personalized News Article Recommendation. Proceedings of the International Conference\non World Wide Web (WWW), pp. 661\u2013670, 2010.\n\n[27] Li, Lihong, Chu, Wei, Langford, John, Moon, Taesup, and Wang, Xuanhui. An Unbiased Of\ufb02ine\nEvaluation of Contextual Bandit Algorithms with Generalized Linear Models. In Proceedings\nof the Workshop on On-line Trading of Exploration and Exploitation 2, volume 26, pp. 19\u201336,\n2012.\n\n[28] Li, Lihong, Lu, Yu, and Zhou, Dengyong. Provable Optimal Algorithms for Generalized Linear\n\nContextual Bandits. CoRR, abs/1703.0, 2017.\n\n[29] McCullagh, P and Nelder, J A. Generalized Linear Models. London, 1989.\n\n[30] Neyshabur, Behnam and Srebro, Nathan. On Symmetric and Asymmetric LSHs for Inner\nProduct Search. Proceedings of the International Conference on Machine Learning (ICML), 37:\n1926\u20131934, 2015.\n\n[31] Orabona, Francesco, Cesa-Bianchi, Nicolo, and Gentile, Claudio. Beyond Logarithmic Bounds\nin Online Learning. In Proceedings of the International Conference on Arti\ufb01cial Intelligence\nand Statistics (AISTATS), volume 22, pp. 823\u2013831, 2012.\n\n[32] Qin, Lijing, Chen, Shouyuan, and Zhu, Xiaoyan. Contextual Combinatorial Bandit and its\n\nApplication on Diversi\ufb01ed Online Recommendation. In SDM, pp. 461\u2013469, 2014.\n\n[33] Rui, Yong, Huang, T S, Ortega, M, and Mehrotra, S. Relevance feedback: a power tool for\ninteractive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video\nTechnology, 8(5):644\u2013655, 1998.\n\n10\n\n\f[34] Rusmevichientong, Paat and Tsitsiklis, John N. Linearly Parameterized Bandits. Math. Oper.\n\nRes., 35(2):395\u2013411, 2010.\n\n[35] Shrivastava, Anshumali and Li, Ping. Asymmetric LSH ( ALSH ) for Sublinear Time Maximum\nInner Product Search ( MIPS ). Advances in Neural Information Processing Systems 27, pp.\n2321\u20132329, 2014.\n\n[36] Shrivastava, Anshumali and Li, Ping. Improved Asymmetric Locality Sensitive Hashing (ALSH)\nfor Maximum Inner Product Search (MIPS). In Proceedings of the Conference on Uncertainty\nin Arti\ufb01cial Intelligence (UAI), pp. 812\u2013821, 2015.\n\n[37] Slaney, Malcolm, Lifshits, Yury, and He, Junfeng. Optimal parameters for locality-sensitive\n\nhashing. Proceedings of the IEEE, 100(9):2604\u20132623, 2012.\n\n[38] Wainwright, Martin J and Jordan, Michael I. Graphical Models, Exponential Families, and\n\nVariational Inference. Found. Trends Mach. Learn., 1(1-2):1\u2013305, 2008.\n\n[39] Wang, Jingdong, Shen, Heng Tao, Song, Jingkuan, and Ji, Jianqiu. Hashing for Similarity\n\nSearch: A Survey. CoRR, abs/1408.2, 2014.\n\n[40] Yue, Yisong, Hong, Sue Ann Sa, and Guestrin, Carlos. Hierarchical exploration for accelerating\ncontextual bandits. Proceedings of the International Conference on Machine Learning (ICML),\npp. 1895\u20131902, 2012.\n\n[41] Zhang, Lijun, Yang, Tianbao, Jin, Rong, Xiao, Yichi, and Zhou, Zhi-hua. Online Stochastic\nLinear Optimization under One-bit Feedback. In Proceedings of the International Conference\non Machine Learning (ICML), volume 48, pp. 392\u2013401, 2016.\n\n11\n\n\f", "award": [], "sourceid": 77, "authors": [{"given_name": "Kwang-Sung", "family_name": "Jun", "institution": "UW-Madison"}, {"given_name": "Aniruddha", "family_name": "Bhargava", "institution": "University of Wisconsin-Madison"}, {"given_name": "Robert", "family_name": "Nowak", "institution": "University of Wisconsion-Madison"}, {"given_name": "Rebecca", "family_name": "Willett", "institution": "University of Wisconsin"}]}