{"title": "Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 891, "page_last": 899, "abstract": "We present an efficient algorithm for the problem of online multiclass prediction with bandit feedback in the fully adversarial setting. We measure its regret with respect to the log-loss defined in \\cite{AbernethyR09}, which is parameterized by a scalar \\(\\alpha\\). We prove that the regret of \\newtron is \\(O(\\log T)\\) when \\(\\alpha\\) is a constant that does not vary with horizon \\(T\\), and at most \\(O(T^{2/3})\\) if \\(\\alpha\\) is allowed to increase to infinity with \\(T\\). For \\(\\alpha\\) = \\(O(\\log T)\\), the regret is bounded by \\(O(\\sqrt{T})\\), thus solving the open problem of \\cite{KST08, AbernethyR09}. Our algorithm is based on a novel application of the online Newton method \\cite{HAK07}. We test our algorithm and show it to perform well in experiments, even when \\(\\alpha\\) is a small constant.", "full_text": "NEWTRON: an Ef\ufb01cient Bandit algorithm for Online\n\nMulticlass Prediction\n\nElad Hazan\n\nDepartment of Industrial Engineering\n\nTechnion - Israel Institute of Technology\n\nHaifa 32000 Israel\n\nehazan@ie.technion.ac.il\n\nskale@yahoo-inc.com\n\nSatyen Kale\n\nYahoo! Research\n\n4301 Great America Parkway\n\nSanta Clara, CA 95054\n\nAbstract\n\nWe present an ef\ufb01cient algorithm for the problem of online multiclass prediction\nwith bandit feedback in the fully adversarial setting. We measure its regret with\nrespect to the log-loss de\ufb01ned in [AR09], which is parameterized by a scalar \u03b1.\nWe prove that the regret of NEWTRON is O(log T ) when \u03b1 is a constant that does\nnot vary with horizon T , and at most O(T 2/3) if \u03b1 is allowed to increase to in\ufb01nity\nwith T . For \u03b1 = O(log T ), the regret is bounded by O(\nT ), thus solving the open\nproblem of [KSST08, AR09]. Our algorithm is based on a novel application of the\nonline Newton method [HAK07]. We test our algorithm and show it to perform\nwell in experiments, even when \u03b1 is a small constant.\n\n\u221a\n\n1\n\nIntroduction\n\nClassi\ufb01cation is a fundamental task of machine learning, and is by now well understood in its basic\nvariants. Unlike the well-studied supervised learning setting, in many recent applications (such as\nrecommender systems, ad selection algorithms, etc.) we only obtain limited feedback about the true\nlabel of the input (e.g., in recommender systems, we only get feedback on the recommended items).\nSeveral such problems can be cast as online, bandit versions of multiclass prediction problems1. The\ngeneral framework, called the \u201ccontextual bandits\u201d problem [LZ07], is as follows. In each round,\nthe learner receives an input x in some high dimensional feature space (the \u201ccontext\u201d), and produces\nan action in response, and obtains an associated reward. The goal is to minimize regret with respect\nto a reference class of policies specifying actions for each context.\nIn this paper, we consider the special case of multiclass prediction, which is a fundamental problem\nin this area introduced by Kakade et al [KSST08]. Here, a learner obtains a feature vector, which\nis associated with an unknown label y which can take one of k values. Then the learner produces\na prediction of the label, \u02c6y. In response, only 1 bit of information is given, whether the label is\ncorrect or incorrect. The goal is to design an ef\ufb01cient algorithm that minimizes regret with respect\nto a natural reference class of policies: linear predictors. Kakade et al [KSST08] gave an ef\ufb01cient\nalgorithm, dubbed BANDITRON. Their algorithm attains regret of O(T 2/3) for a natural multiclass\n\u221a\nhinge loss, and they ask the question whether a better regret bound is possible. While the EXP4\nT log T ) regret bound, it is highly inef\ufb01-\nalgorithm [ACBFS03], applied to this setting, has an O(\ncient, requiring O(T n/2) time per iteration, where n is the dimension of the feature space. Ideally,\none would like to match or improve the O(\nT log T ) regret bound of the EXP4 algorithm with an\nef\ufb01cient algorithm (for a suitable loss function).\nThis question has received considerable attention. In COLT 2009, Abernethy and Rakhlin [AR09]\nformulated the open question precisely as minimizing regret for a suitable loss function in the fully\n\n\u221a\n\n1For the basic bandit classi\ufb01cation problem see [DHK07, RTB07, DH06, FKM05, AK08, MB04, AHR08].\n\n1\n\n\f\u221a\n\n\u221a\n\nthe original paper of [KSST08], gives a O(\n\nadversarial setting (and even offered a monetary reward for a resolution of the problem). Some\nspecial cases have been successfully resolved:\nT )\n\u221a\nbound in the noiseless large-margin case. More recently, Crammer and Gentile [CG11] gave a\nT log T ) regret bound via an ef\ufb01cient algorithm based on the upper con\ufb01dence bound method\nO(\nunder a semi-adversarial assumption on the labels: they are generated stochastically via a speci\ufb01c\nlinear model (with unknown parameters which change over time). Yet the general (fully adversarial)\ncase has been unresolved as of now.\nIn this paper we address this question and design a novel algorithm for the fully adversarial setting\nwith its expected regret measured with respect to log-loss function de\ufb01ned in [AR09], which is\nparameterized by a scalar \u03b1. When \u03b1 is a constant independent of T , we get a much stronger\nguarantee than required by the open problem: the regret is bounded by O(log T ). In fact, the regret\nis bounded by O(\nT ) even for \u03b1 = \u0398(log T ). Our regret bound for larger values of \u03b1 increases\nsmoothly to a maximum of O(T 2/3), matching that of BANDITRON in the worst case.\nThe algorithm is ef\ufb01cient to implement, and it is based on the online Newton method introduced\nin [HAK07]; hence we call the new algorithm NEWTRON. We implement the algorithm (and a\nfaster variant, PNEWTRON) and test it on the same data sets used by Kakade et al [KSST08]. The\nexperiments show improved performance over the BANDITRON algorithm, even for \u03b1 as small as 10.\n\n2 Preliminaries\n\nproduct, i.e. v \u00b7 w =(cid:80)n\n\n2.1 Notation\nLet [k] denote the set of integers {1, 2, . . . , k}, and \u2206k \u2286 Rk the set of distributions on [k].\nFor any Rn, let 1, 0 denote the all 1s and all 0s vectors respectively, and let I denote the identity\nmatrix in Rn\u00d7n. For two (row or column) vectors v, w \u2208 Rn, we denote by v \u00b7 w their usual inner\ni=1 viwi. We denote by (cid:107)v(cid:107) the (cid:96)2 norm of v. For a vector v \u2208 Rn, denote\nby diag(v) the diagonal matrix in Rn\u00d7n where the ith diagonal entry equals vi.\nFor a matrix W \u2208 Rk\u00d7n, denote by W1, W2, . . . , Wk its rows, which are (row) vectors in Rn.\nTo avoid de\ufb01ning unnecessary notation, we will interchangeably use W to denote both a matrix in\nRk\u00d7n or a (column) vector in Rkn. The vector form of the matrix W is formed by arranging its\nrows one after the other, and then taking the transpose (i.e., the vector [W1|W2|\u00b7\u00b7\u00b7|Wk](cid:62)). Thus,\nfor two matrices V and W, V \u00b7 W denotes their inner product in their vector form. For i \u2208 [n] and\nl \u2208 [k], denote by Eil the matrix which has 1 in its (i, l)th entry, and 0 everywhere else.\nFor a matrix W, we denote by (cid:107)W(cid:107) the Frobenius norm of W, which is also the usual (cid:96)2 norm\nof the vector form of W, and so the notation is consistent. Also, we denote by (cid:107)W(cid:107)2 the spectral\nnorm of W, i.e. the largest singular value of W.\nFor two matrices W and V denote by W\u2297 V their Kronecker product [HJ91]. For two square sym-\nmetric matrices W, V of like order, denote by W (cid:23) V the fact that W\u2212 V is positive semide\ufb01nite,\ni.e. all its eigenvalues are non-negative. A useful fact of the Kronecker product is the following:\nif W, V are symmetric matrices such that W (cid:23) V, and if U is a positive semide\ufb01nite symmetric\nmatrix, then W\u2297U (cid:23) V\u2297U. This follows from the fact that if W, U are both symmetric, positive\nsemide\ufb01nite matrices, then so is their Kronecker product W \u2297 U.\n\n2.2 Problem setup\n\nLearning proceeds in rounds. In each round t, for t = 1, 2, . . . , T , we are presented a feature vector\nxt \u2208 X , where X \u2286 Rn, and (cid:107)x(cid:107) \u2264 R for all x \u2208 X . Here R is some speci\ufb01ed constant. Associated\nwith xt is an unknown label yt \u2208 [k]. We are required to produce a prediction, \u02c6yt \u2208 [k], as the label\nof xt. In response, we obtain only 1 bit of information: whether \u02c6yt = yt or not. In particular, when\n\u02c6yt (cid:54)= yt, the identity of yt remains unknown (although one label, \u02c6yt, is ruled out).\nThe learner\u2019s hypothesis class is parameterized by matrices W \u2208 Rk\u00d7n with (cid:107)W(cid:107) \u2264 D, for some\nspeci\ufb01ed constant D. Denote the set of such matrices by K. Given a matrix W \u2208 K with the rows\n\n2\n\n\fW1, W2, . . . , Wk, the prediction associated with W for xt is\nWi \u00b7 xt.\n\n\u02c6yt = arg max\ni\u2208[k]\n\nWhile ideally we would like to minimize the 0 \u2212 1 loss suffered by the learner, for computational\nreasons it is preferable to consider convex loss functions. A natural choice used in Kakade et\nal [KSST08] is the multi-class hinge loss:\n\n(cid:96)(W, (xt, yt)) = max\ni\u2208[k]\\yt\n\n[1 \u2212 Wyt \u00b7 xt + Wi \u00b7 xt]+.\n\nOther suitable loss functions (cid:96)(\u00b7,\u00b7) may also be used. The ultimate goal of the learner is to minimize\nregret, i.e.\n\nRegret :=\n\n(cid:96)(Wt, (xt, yt)) \u2212 min\nW(cid:63)\u2208K\n\n(cid:96)(W(cid:63), (xt, yt)).\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nA different loss function was proposed in an open problem by Abernethy and Rakhlin in COLT\n2009 [AR09]. We use this loss function in this paper and de\ufb01ne it now.\nWe choose a constant \u03b1 which parameterizes the loss function. Given a matrix W \u2208 K and an\nexample (x, y) \u2208 X \u00d7 [k], de\ufb01ne the function P : K \u00d7 X \u2192 \u2206k as\n\nP(W, x)i =\n\n(cid:80)\nexp(\u03b1Wi \u00b7 x)\nj exp(\u03b1Wj \u00b7 x)\n\n.\n\nNow let p = P(W, x). Suppose we make our prediction \u02c6yt by sampling from p.\nA natural loss function for this scheme is log-loss de\ufb01ned as follows:\n\n(cid:96)(W, (x, y)) = \u2212 1\n\u03b1\n\nlog(py) = \u2212 1\n\u03b1\n\n= \u2212Wy \u00b7 x +\n\n1\n\u03b1\n\nlog\n\n(cid:33)\n\n(cid:32)\n\n(cid:80)\nexp(\u03b1Wy \u00b7 x)\n(cid:17)\nj exp(\u03b1Wk \u00b7 x)\nj exp(\u03b1Wj \u00b7 x)\n\n.\n\nlog\n\n(cid:16)(cid:80)\n\nThe log-loss is always positive. As \u03b1 becomes large, this log-loss function has the property that\nwhen the prediction given by W for x is correct, it is very close to zero, and when the prediction is\nincorrect, it is roughly proportional to the margin of the incorrect prediction over the correct one.\nThe algorithm and its analysis depend upon the the gradient and Hessian of the loss function w.r.t.\nW. The following lemma derives these quantities (proof in full version). Note that in the following,\nW is to be interpreted as a vector W \u2208 Rkn.\nLemma 1. Fix a matrix W \u2208 K and an example (x, y) \u2208 X \u00d7 [k], and let p = P(W, x). Then we\nhave\n\n\u2207(cid:96)(W, (x, y)) = (p \u2212 ey) \u2297 x and \u22072(cid:96)(W, (x, y)) = \u03b1(diag(p) \u2212 pp(cid:62)) \u2297 xx(cid:62).\n\nIn the analysis, we need bounds on the smallest non-zero eigenvalue of the (diag(p) \u2212 pp(cid:62)) factor\nof the Hessian. Such bounds are given in the full version2 For the sake of the analysis, however, the\nmatrix inequality given in Lemma 2 below suf\ufb01ces. It is given in terms of a parameter \u03b4, which is\nthe minimum probability of a label in any distribution P(W, x).\nDe\ufb01nition 1. De\ufb01ne \u03b4 := minW\u2208K,x\u2208X mini P(W, x)i.\nWe have the following (loose) bound on \u03b4, which follows easily using the fact that |Wi \u00b7 x| \u2264 RD:\n(1)\nLemma 2. Let W \u2208 K be any weight matrix, and let H \u2208 Rk\u00d7k be any symmetric matrix such that\nH1 = 0. Then we have\n\n\u03b4 \u2265 exp(\u22122\u03b1RD)/k.\n\n\u22072(cid:96)(W, (x, y)) (cid:23) \u03b1\u03b4\n(cid:107)H(cid:107)2\n\nH \u2297 xx(cid:62).\n\n2Our earlier proof used Cheeger\u2019s inequality. We thank an anonymous referee for a simpli\ufb01ed proof.\n\n3\n\n\fAlgorithm 1 NEWTRON. Parameters: \u03b2, \u03b3\n1: Initialize W(cid:48)\n1 = 0.\n2: for t = 1 to T do\nObtain the example xt.\n3:\nt = (1 \u2212 \u03b3) \u00b7 pt + \u03b3\nLet pt = P(W(cid:48)\nk 1.\n4:\nOutput the label \u02c6yt by sampling from p(cid:48)\n5:\nprobability (1 \u2212 \u03b3), and Wt = 0 with probability \u03b3.\nObtain feedback, i.e. whether \u02c6yt = yt or not.\nif \u02c6yt = yt then\n\nt, xt), and set p(cid:48)\n\n\u00b7(cid:0) 1\nt(\u02c6yt) \u00b7(cid:0)e\u02c6yt \u2212 1\n\np(cid:48)\nt(yt)\n\n(cid:1) \u2297 xt and \u03bat := p(cid:48)\nk 1(cid:1) \u2297 xt and \u03bat := 1.\n\nk 1 \u2212 eyt\n\nDe\ufb01ne \u02dc\u2207t := 1\u2212pt(yt)\nDe\ufb01ne \u02dc\u2207t := pt(\u02c6yt)\n\np(cid:48)\n\nelse\n\nend if\nDe\ufb01ne the cost function\n\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n\nt(yt).\n\nt. This is equivalent to playing Wt = W(cid:48)\n\nt with\n\nft(W) := \u02dc\u2207t \u00b7 (W \u2212 W(cid:48)\n\n\u03bat\u03b2( \u02dc\u2207t \u00b7 (W \u2212 W(cid:48)\n\nt))2.\n\n1\n2\n\nt) +\n\nt(cid:88)\n\n\u03c4 =1\n\nft(W) +\n\n(cid:107)W(cid:107)2.\n\n1\n2D\n\nW(cid:48)\n\nt+1 := arg min\nW\u2208K\n\n13:\n\nCompute\n\n14: end for\n\n2.3 The FTAL Lemma\n\n(2)\n\n(3)\n\nOur algorithm is based on the FTAL algorithm [HAK07]. This algorithm is an online version of\nthe Newton step algorithm in of\ufb02ine optimization. The following lemma speci\ufb01es the algorithm,\nspecialized to our setting, and gives its regret bound. The proof is in the full version.\nLemma 3. Consider an online convex optimization problem over some convex, compact domain\n2 \u03b2t(vt \u00b7 w \u2212 \u03b1t)2, where the\nK \u2286 Rn of diameter D with cost functions ft(w) = (vt \u00b7 w \u2212 \u03b1t) + 1\nvector vt \u2208 Rn and scalars \u03b1t, \u03b2t are chosen by the adversary such that for some known parameters\nr, a, b, we have (cid:107)vt(cid:107) \u2264 r, \u03b2t \u2265 a, and |\u03b2t(vt \u00b7 w \u2212 \u03b1t)| \u2264 b, for all w \u2208 K. Then the algorithm\nthat, in round t, plays\n\nwt := arg min\nw\u2208K\n\nhas regret bounded by O( nb2\n\na log( DraT\n\nb\n\n)).\n\nt\u22121(cid:88)\n\n\u03c4 =1\n\nft(w)\n\n3 The NEWTRON algorithm\n\nOur algorithm for bandit multiclass learning algorithm, dubbed NEWTRON, is shown as Algorithm 1\nabove. In each iteration, we randomly choose a label from the distribution speci\ufb01ed by the current\nweight matrix on the current example mixed with the uniform distribution over labels speci\ufb01ed by\nan exploration parameter \u03b3. The parameter \u03b3 (which is similar to the exploration parameter used\nin the EXP3 algorithm of [ACBFS03]) is eventually tuned based on the value of the parameter \u03b1\nin the loss function (see Corollary 5). We then use the observed feedback to construct a quadratic\nloss function (which is strongly convex) that lower bounds the true loss function in expectation (see\nLemma 7) and thus allows us to bound the regret. To do this we construct a randomized estimator\n\u02dc\u2207t for the gradient of the loss function at the current weight matrix. Furthermore, we also choose a\nparameter \u03bat, which is an adjustment factor for the strongly convexity of the quadratic loss function\nensuring that its expectation lower bounds the true loss function. Finally, we compute the new loss\nmatrix using a Follow-The-Regularized-Leader strategy, by minimizing the sum of all quadratic loss\nfunctions so far with (cid:96)2 regularization. As described in [HAK07], this convex program can be solved\nin quadratic time, plus a projection on K in the norm induced by the Hessian.\n\n4\n\n\fStatement and discussion of main theorem. To simplify notation, de\ufb01ne the function (cid:96)t : K \u2192 R\nas (cid:96)t(W) = (cid:96)(W, (xt, yt)). Let Et[\u00b7] denote the conditional expectation with respect to the \u03c3-\ufb01eld\nFt, where Ft is the smallest \u03c3-\ufb01eld with respect to which the predictions \u02c6yk, for k = 1, 2, . . . , t\u2212 1,\nare measurable.\nWith this notation, we can state our main theorem giving the regret bound:\nTheorem 4. Given \u03b1, \u03b4 and \u03b3 \u2264 1\n10 + \u03b7,\nalgorithm, for \u03b7 = \u03b3 log(k)\nbound on the expected regret:\n\n4RD} in the NEWTRON\nk}. The NEWTRON algorithm has the following\n(cid:17)\n(cid:16) kn\n\n20\u03b1R2D2 . Let \u03bd = max{ \u03b4\nT(cid:88)\n\n2 , suppose we set \u03b2 \u2264 min{ \u03b1\u03b4\n\n2 , \u03b3\n\n1\n\nE[(cid:96)t(Wt)] \u2212 (cid:96)t(W(cid:63)) = O\n\n\u03bd\u03b2 log T + \u03b3 log(k)\n\n\u03b1\n\nT\n\nt=1\n\nBefore giving the proof theorem 4, we \ufb01rst state a corollary (a simple optimization of parameters,\nproved in the full version) which shows how \u03b3 in Theorem 4 can be set appropriately to get a smooth\ninterpolation between O(log(T )) and O(T 2/3) regret based on the value of \u03b1.\nCorollary 5. Given \u03b1, there is a setting of \u03b3 so that the regret of NEWTRON is bounded by\n\n(cid:26)\n\nmin\n\nc\n\nexp(4\u03b1RD)\n\n\u03b1\n\nlog(T ),\n\n6cRDT 2/3\n\n(cid:27)\n\n,\n\nwhere the constant c = O(k3n) is independent of \u03b1.\n\nDiscussion of the bound. The parameter \u03b1 is inherent to the log-loss function as de\ufb01ned in\n[AR09]. Our main result as given in Corollary 5 which entails logarithmic regret for constant \u03b1,\ncontains a constant which depends exponentially on \u03b1. Empirically, it seems that \u03b1 can be set to a\nsmall constant, say 10 (see Section 4), and still have good performance.\nNote that even when \u03b1 grows with T , as long as \u03b1 \u2264 1\n\u221a\nO(cRD\nrange of \u03b1.\nWe can say something even stronger - our results provide a \u201csafety net\u201d - no matter what the value\nof \u03b1 is, the regret of our algorithm is never worse than O(T 2/3), matching the bound of the BAN-\nDITRON algorithm (although the latter holds for the multiclass hinge loss).\n\n8RD log(T ), the regret can be bounded as\nT ), thus solving the open problem of [KSST08, AR09] for log-loss functions with this\n\nAnalysis.\n\nProof. (Theorem 4.) The optimization (3) is essentially running the algorithm from Lemma 3 on\n2D (Eil \u00b7 W)2\nK with the cost functions ft(W), with additional nk initial \ufb01ctitious cost functions 1\nfor i \u2208 [n] and l \u2208 [k]. These \ufb01ctitious cost functions can be thought of as regularization. While\ntechnically these \ufb01ctitious cost functions are not necessary to prove our regret bound, we include\nthem since this seems to give better experimental performance and only adds a constant to the regret.\nWe now apply the regret bound of Lemma 3 by estimating the parameters r, a, b. This is a simple\ntechnical calculation and done in Lemma 6 below, which yields the values r = R\n\u03bd , a = \u03b2\u03bd, b = 1.\nHence, the regret bound of Lemma 3 implies that for any W(cid:63) \u2208 K,\n\nft(W(cid:48)\n\nt) \u2212 ft(W(cid:63)) = O\n\n\u03bd\u03b2 log T\n\n.\n\n(cid:16) kn\n\n(cid:17)\n\nT(cid:88)\n\nt=1\n\nNote that the bound above excludes the \ufb01ctitious cost functions since they only add a constant addi-\ntive term to the regret, which is absorbed by the O(log T ) term. Similarly, we have also suppressed\nadditive constants arising from the log( DraT\nTaking expectation on both sides of the above bound with respect to the randomness in the algorithm,\nand using the speci\ufb01cation (2) of ft(W) we get\n\n) term in the regret bound of Lemma 3.\n\nb\n\n(cid:21)\n\n(cid:16) kn\n\n(cid:17)\n\n= O\n\n\u03bd\u03b2 log T\n\n.\n\n(4)\n\n(cid:20)\n\nE\n\n\u02dc\u2207t \u00b7 (W(cid:48)\n\nt \u2212 W(cid:63)) \u2212 1\n2\n\n\u03bat\u03b2( \u02dc\u2207t \u00b7 (W(cid:48)\n\nt \u2212 W(cid:63)))2\n\n5\n\n\fBy Lemma 7 below, we get that\n\n(cid:96)t(W(cid:48)\n\nt) \u2212 (cid:96)t(W(cid:63)) \u2264 E\n\nt\n\n\u02dc\u2207t \u00b7 (W(cid:48)\n\nt \u2212 W(cid:63)) \u2212 1\n2\n\n(cid:20)\n\n\u03bat\u03b2( \u02dc\u2207t \u00b7 (W(cid:48)\n\nt \u2212 W(cid:63)))2\n\n(cid:21)\n\n+ 20\u03b7R2D2.\n\n(5)\n\nFurthermore, we have\n\n[(cid:96)t(Wt)] \u2212 (cid:96)(W(cid:48)\nE\nt\n\n(6)\nt with probability (1 \u2212 \u03b3) and Wt = 0 with probability \u03b3, and (cid:96)t(0) = log(k)\n\u03b1 .\n\n\u03b1\n\n,\n\nt) \u2264 \u03b3 log(k)\n\nsince Wt = W(cid:48)\nPlugging (5) and (6) in (4), and using \u03b7 = \u03b3 log(k)\n20\u03b1R2D2 ,\nE[(cid:96)(Wt)] \u2212 (cid:96)(W(cid:63)) = O\n\nT(cid:88)\n\n(cid:16) kn\n\n(cid:17)\n\n\u03bd\u03b2 log T + \u03b3 log(k)\n\n\u03b1\n\nT\n\n.\n\nt=1\n\nWe now state two lemmas that were used in the proof of Theorem 4. The \ufb01rst one (proof in the full\nversion) obtains parameter settings to use Lemma 3 in Theorem 4.\nLemma 6. Assume \u03b2 \u2264 1\nk}. Then the following are valid\n2 , \u03b3\nsettings for the parameters r, a, b: r = R\n\n2 . Let \u03bd = max{ \u03b4\n\u03bd , a = \u03b2\u03bd and b = 1.\n\n4RD and \u03b3 \u2264 1\n\nThe next lemma shows that in each round, the expected regret of the inner FTAL algorithm with ft\ncost functions is larger than the regret of NEWTRON.\nLemma 7. For \u03b2 = \u03b1\u03b4\n\n(cid:20)\n\n10 + \u03b7 and \u03b3 \u2264 1\n\u02dc\u2207t \u00b7 (W(cid:48)\n\n2 , we have\nt \u2212 W(cid:63)) \u2212 1\n2\n\nt\n\n(cid:96)t(W(cid:48)\n\nt) \u2212 (cid:96)t(W(cid:63)) \u2264 E\n\n\u03bat\u03b2( \u02dc\u2207t \u00b7 (W(cid:48)\n\nt \u2212 W(cid:63)))2\n\n+ 20\u03b7R2D2.\n\n(cid:21)\n\nProof. The intuition behind the proof is the following. We show that Et[ \u02dc\u2207t] = (p \u2212 eyt ) \u2297 xt,\nwhich by Lemma 1 equals \u2207(cid:96)t(W(cid:48)\nfor some\nmatrix Ht s.t. Ht1 = 0. By upper bounding (cid:107)Ht(cid:107), we then show (using Lemma 2) that for any\n\u03a8 \u2208 K we have\n\nt). Next, we show that Et[\u03bat \u02dc\u2207t \u02dc\u2207(cid:62)\n\nt ] = Ht \u2297 xtx(cid:62)\n\nt\n\n\u22072(cid:96)t(\u03a8) (cid:23) \u03b2Ht \u2297 xtx(cid:62)\nt .\n\nThe stated bound then follows by an application of Taylor\u2019s theorem.\nThe technical details for the proof are as follows. First, note that\n[ \u02dc\u2207t] \u00b7 (W(cid:48)\n\nt \u2212 W(cid:63)).\n\n[ \u02dc\u2207t \u00b7 (W(cid:48)\nE\nt\n\nt\n\nt \u2212 W(cid:63))] = E\n(cid:19)\n\n(cid:18) 1\n\n1 \u2212 eyt\n\n+\n\nk\n\n(cid:88)\n\ny(cid:54)=yt\n\nt(y) \u00b7 pt(y)\np(cid:48)\np(cid:48)\nt(y)\n\n\u00b7\n\n(cid:18)\n\ne\u02c6yt \u2212 1\nk\n\n1\n\n(cid:19)\uf8f9\uf8fb \u2297 xt\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n\uf8ee\uf8f0p(cid:48)\n\nWe now compute Et[ \u02dc\u2207t].\n\n[ \u02dc\u2207t] =\nE\nt\n\nt(yt) \u00b7 1 \u2212 pt(yt)\n\np(cid:48)\nt(yt)\n= (pt \u2212 eyt) \u2297 xt.\n\n\u00b7\n\nNext, we have\n[\u03bat( \u02dc\u2207t \u00b7 (W(cid:48)\nE\nt\n(cid:34)\nWe now compute Et[\u03bat \u02dc\u2207t \u02dc\u2207(cid:62)\nt ].\nt(yt) \u00b7 \u03bat\np(cid:48)\n(cid:88)\n\n[\u03bat \u02dc\u2207t \u02dc\u2207(cid:62)\nE\nt\n\nt ] =\n\n+\n\nt(y) \u00b7\np(cid:48)\n=: Ht \u2297 xtx(cid:62)\nt ,\n\ny(cid:54)=yt\n\nt \u2212 W(cid:63)))2] = (W(cid:48)\n\nt \u2212 W(cid:63))(cid:62) E\n\nt ](W(cid:48)\n\n[\u03bat \u02dc\u2207t \u02dc\u2207(cid:62)\n(cid:19)(cid:18) 1\n\n1 \u2212 eyt\n(cid:19)(cid:18)\n\nt \u2212 W(cid:63)).\n(cid:19)(cid:62)\n(cid:19)(cid:62)\uf8f9\uf8fb \u2297 xtx(cid:62)\n\n1 \u2212 eyt\n\nt\n\nk\n\ney \u2212 1\nk\n\n1\n\ney \u2212 1\nk\n\n1\n\nt\n\n(cid:18) 1\n\nk\n\n(cid:18) 1 \u2212 pt(yt)\n(cid:19)2 \u00b7\n(cid:19)2 \u00b7\n(cid:18) pt(y)\n(cid:18)\n\np(cid:48)\nt(yt)\n\np(cid:48)\nt(y)\n\n6\n\n\fwhere Ht is the matrix in the brackets above. We note a few facts about Ht. First, note that\n(ey \u2212 1\nlargest eigenvalue) of Ht is\nbounded as:\n\nk 1) \u00b7 1 = 0, and so Ht1 = 0. Next, the spectral norm (i.e.\n\n(cid:107)Ht(cid:107)2 \u2264 (cid:13)(cid:13) 1\n\n(cid:13)(cid:13)2\n\n(cid:88)\n\ny(cid:54)=yt\n\nk 1 \u2212 eyt\n\n+\n\np(cid:48)\nt(y)\n\n1\n\n(1 \u2212 \u03b3)2\n\n(cid:13)(cid:13)ey \u2212 1\n\nk 1(cid:13)(cid:13)2 \u2264 10,\n\nfor \u03b3 \u2264 1\n\n2. Now, for any \u03a8 \u2208 K, by Lemma 2, for the speci\ufb01ed value of \u03b2 we have\n\n\u22072(cid:96)t(\u03a8) (cid:23) \u03b1\u03b4\n10\n\nHt \u2297 xtx(cid:62)\nt .\n\n(11)\n\nNow, by Taylor\u2019s theorem, for some \u03a8 on the line segment connecting W(cid:48)\n(cid:96)t(W(cid:63))\u2212(cid:96)t(W(cid:48)\n\nt) +\n\nt) = \u2207(cid:96)t(W(cid:48)\n\nt) \u00b7 (W(cid:63) \u2212 W(cid:48)\n\u2265 ((pt \u2212 eyt ) \u2297 xt) \u00b7 (W(cid:63) \u2212 W(cid:48)\n\nt) +\n\n(W(cid:63) \u2212 W(cid:48)\n1\n2\n(W(cid:63) \u2212 W(cid:48)\n1\n2\n\nt)(cid:62)[\u22072(cid:96)t(\u03a8)](W(cid:63) \u2212 W(cid:48)\nt),\nt)(cid:62)[\n\nHt \u2297 xtx(cid:62)\n\n\u03b1\u03b4\n10\n\nt to W(cid:63), we have\n\nt ](W(cid:63) \u2212 W(cid:48)\nt),\n(12)\n\nwhere the last inequality follows from (11). Finally, we have\n\nt ](W(cid:63) \u2212 W(cid:48)\n\nt(cid:107)2 \u2264 20\u03b7R2D2, (13)\nt(cid:107) \u2264 2D. Adding inequalities (12) and (13), rearranging the result and using (7),\n\nt (cid:107)2(cid:107)W(cid:63) \u2212 W(cid:48)\n\n\u03b7(cid:107)Ht \u2297 xtx(cid:62)\n\nt) \u2264 1\n2\n\n(W(cid:63) \u2212 W(cid:48)\n\nt)(cid:62)[\u03b7Ht \u2297 xtx(cid:62)\n\n1\n2\nsince (cid:107)W(cid:63) \u2212 W(cid:48)\n(8), (9), and (10) gives the stated bound.\n4 Experiments\n\nWhile the theoretical regret bound for NEWTRON is O(log T ) when \u03b1 = O(1), the provable constant\nin O(\u00b7) notation is quite large, leading one to question the practical performance of the algorithm.\nThe main reason for the large constant is that the analysis requires the \u03b2 parameter to be set ex-\ntremely small to get the required bounds. In practice, however, one can keep \u03b2 a tunable parameter\nand try using larger values. In this section, we give experimental evidence (replicating the exper-\niments of [KSST08]) that shows that the practical performance of the algorithm is quite good for\nsmall values of \u03b1 (like 10), and not too small values of \u03b2 (like 0.01, 0.0001).\n\nData sets. We used three data sets from [KSST08]: SYNSEP, SYNNONSEP, and REUTERS4. The\n\ufb01rst two, SYNSEP and SYNNONSEP, are synthetic data sets, generated according to the description\ngiven in [KSST08]. These data sets have the same 106 feature vectors with 400 features. There are\n9 possible labels. The data set SYNSEP is linearly separable, whereas the data set SYNNONSEP is\nmade inseparable by arti\ufb01cially adding 5% label noise. The REUTERS4 data set is generated from\nthe Reuters RCV1 corpus. There are 673, 768 documents in the data set with 4 possible labels, and\n346, 810 features. Our results are reported by averaging over 10 runs of the algorithm involved.\n\nAlgorithms. We implemented the BANDITRON and NEWTRON algorithms3. The NEWTRON al-\ngorithm is signi\ufb01cantly slower than BANDITRON due to its quadratic running time. This makes it\ninfeasible for really large data sets like REUTERS4. To surmount this problem, we implemented\nan approximate version of NEWTRON, called PNEWTRON4, which runs in linear time per iteration\nand thus has comparable speed to BANDITRON. PNEWTRON does not have the same regret guaran-\ntees of NEWTRON however. To derive PNEWTRON, we can restate NEWTRON equivalently as (see\n[HAK07]):\n\nt )(cid:62)At(W \u2212 W(cid:48)(cid:48)\nt )\n\u03c4 =1(1\u2212 \u03ba\u03c4 \u03b2 \u02dc\u2207\u03c4 \u00b7W\u03c4 ) \u02dc\u2207\u03c4 .\nwhere W(cid:48)(cid:48)\nPNEWTRON makes the following change, using the diagonal approximation for the Hessian, and\nusual Euclidean projections:\n\n\u03c4 and bt =(cid:80)t\u22121\n\n\u03c4 =1 \u03ba\u03c4 \u03b2 \u02dc\u2207\u03c4 \u02dc\u2207(cid:62)\n\nt bt, for At = 1\n\nt = \u2212A\u22121\n\nt = arg min\n\nW(cid:48)\n\nD I +(cid:80)t\u22121\nW\u2208K(W \u2212 W(cid:48)(cid:48)\nW\u2208K(W \u2212 W(cid:48)(cid:48)\n\nW(cid:48)\n\nt = arg min\n\nt )(cid:62)(W \u2212 W(cid:48)(cid:48)\nt )\n\n3We did not implement the Con\ufb01dit algorithm of [CG11] since our aim was to consider algorithms in the\n\nfully adversarial setting.\n\n4Short for pseudo-NEWTRON. The \u201cP\u201d may be left silent so that it\u2019s almost NEWTRON, but not quite.\n\n7\n\n\fwhere W(cid:48)(cid:48)\n\nbt =(cid:80)t\u22121\n\nt = \u2212A\u22121\nt bt, for At = 1\n\u03c4 =1(1 \u2212 \u03ba\u03c4 \u03b2 \u02dc\u2207\u03c4 \u00b7 W\u03c4 ) \u02dc\u2207\u03c4 .\n\nD I +(cid:80)t\u22121\n\n\u03c4 =1 diag(\u03ba\u03c4 \u03b2 \u02dc\u2207\u03c4 \u02dc\u2207(cid:62)\n\n\u03c4 ) and bt is the same as before,\n\nIn our experiments, we chose K to be the unit (cid:96)2 ball in Rkn, so D = 1. We\nParameter settings.\nalso choose \u03b1 = 10 for all experiments in the log-loss. For BANDITRON, we chose the value of\n\u03b3 speci\ufb01ed in [KSST08]: \u03b3 = 0.014, 0.006 and 0.05 for SYNSEP, SYNNONSEP and REUTERS4\nrespectively. For NEWTRON and PNEWTRON, we chose \u03b3 = 0.01, 0.006 and 0.05 respectively. The\nother parameter for NEWTRON and PNEWTRON, \u03b2, was set to the values \u03b2 = 0.01, 0.01, and 0.0001\nrespectively. We did not tune any of the parameters \u03b1, \u03b2 and \u03b3 for NEWTRON or PNEWTRON.\n\nEvaluation. We evaluated the algorithms in terms of their error\nrate, i.e.\nthe fraction of prediction mistakes made as a function\nof time. Experimentally, PNEWTRON has quite similar perfor-\nmance to NEWTRON, but is signi\ufb01cantly faster. Figure 4 shows\nhow BANDITRON, NEWTRON and PNEWTRON compare on the\nSYNNONSEP data set for 104 examples5.\nIt can be seen that\nPNEWTRON has similar behavior to NEWTRON, and is not much\nworse.\nThe rest of the experiments were conducted using only BAN-\nDITRON and PNEWTRON. The results are shown in \ufb01gure 4. It\ncan be clearly seen that PNEWTRON decreases the error rate much\nfaster than BANDITRON. For the SYNSEP data set, PNEWTRON\nvery rapidly converges to the lowest possible error rate due to\nsetting the exploration parameter \u03b3 = 0.01, viz. 0.01 \u00d7 8/9 =\n0.89%. In comparison, the \ufb01nal error for BANDITRON is 1.91%.\nFor the SYNNONSEP data set, PNEWTRON converges rapidly to\nits \ufb01nal value of 11.94%. BANDITRON remains at a high error\nlevel until about 104 examples, and at the very end catches up\nwith and does slightly better than PNEWTRON, ending at 11.47%.\nFor the REUTERS4 data set, both BANDITRON and PNEWTRON\ndecrease the error rate at roughly same pace; however PNEWTRON still obtains better performance\nconsistently by a few percentage points. In our experiments, the \ufb01nal error rate for PNEWTRON is\n13.08%, while that for BANDITRON is 18.10%.\n\nFigure 1: Log-log plots of error\nrates vs. number of examples\nfor BANDITRON, NEWTRON\nand PNEWTRON on SYNNON-\nSEP with 104 examples.\n\nFigure 2: Log-log plots of error rates vs. number of examples for BANDITRON and PNEWTRON on\ndifferent data sets. Left: SYNSEP. Middle: SYNNONSEP. Right: REUTERS4.\n\n5 Future Work\n\nSome interesting questions remain open. Our theoretical guarantee applies only to the quadratic-\ntime NEWTRON algorithm. Is it possible to obtain similar regret guarantees for a linear time algo-\nrithm? Our regret bound has an exponentially large constant, which depends on the loss functions\nparameters. Does there exist an algorithm with similar regret guarantees but better constants?\n\n5In the interest of reducing running time for NEWTRON, we used a smaller data set.\n\n8\n\n10210310410\u22120.810\u22120.710\u22120.610\u22120.510\u22120.410\u22120.310\u22120.2SynNonSep: number of exampleserror rate BanditronNewtronPNewtron10210310410510610\u2212310\u2212210\u22121100SynSep: number of exampleserror rate BanditronPNewtron10210310410510610\u22120.910\u22120.810\u22120.710\u22120.610\u22120.510\u22120.410\u22120.310\u22120.2SynNonSep: number of exampleserror rate BanditronPNewtron10210310410510610\u22120.810\u22120.710\u22120.610\u22120.510\u22120.410\u22120.310\u22120.2Reuters4: number of examplesError rate BanditronPNewtron\fReferences\n[ACBFS03] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The non-\n\n\u221a\n\n[AHR08]\n\n[AK08]\n\n[AR09]\n\n[CG11]\n\n[DH06]\n\n[DHK07]\n\n[FKM05]\n\n[HAK07]\n\n[HJ91]\n\n[KSST08]\n\n[LZ07]\n\n[MB04]\n\n[RTB07]\n\nT -regret\n\nstochastic multiarmed bandit problem. SIAM J. Comput., 32:48\u201377, January 2003.\nJacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An\nef\ufb01cient algorithm for bandit linear optimization. In COLT, pages 263\u2013274, 2008.\nBaruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive rout-\ning. J. Comput. Syst. Sci., 74(1):97\u2013114, 2008.\nJacob Abernethy and Alexander Rakhlin. An ef\ufb01cient bandit algorithm for\nin online multiclass prediction? In COLT, 2009.\nKoby Crammer and Claudio Gentile. Multiclass classi\ufb01cation with bandit feedback\nusing adaptive regularization. In ICML, 2011.\nVarsha Dani and Thomas P. Hayes. Robbing the bandit: less regret in online geometric\noptimization against an adaptive adversary. In SODA, pages 937\u2013943, 2006.\nVarsha Dani, Thomas Hayes, and Sham Kakade. The price of bandit information for\nonline optimization. In NIPS. 2007.\nAbraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex\noptimization in the bandit setting: gradient descent without a gradient. In SODA, pages\n385\u2013394, 2005.\nElad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online\nconvex optimization. Machine Learning, 69(2-3):169\u2013192, 2007.\nR.A. Horn and C.R. Johnson. Topics in Matrix Analysis. Cambridge University Press,\nCambridge, 1991.\nSham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Ef\ufb01cient bandit algorithms\nfor online multiclass prediction. In ICML\u201908, pages 440\u2013447, 2008.\nJohn Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits\nwith side information. In NIPS, 2007.\nH. Brendan McMahan and Avrim Blum. Online geometric optimization in the bandit\nsetting against an adaptive adversary. In COLT, pages 109\u2013123, 2004.\nAlexander Rakhlin, Ambuj Tewari, and Peter Bartlett. Closing the gap between bandit\nand full-information online optimization: High-probability regret bound. Technical\nReport UCB/EECS-2007-109, EECS Department, University of California, Berkeley,\nAug 2007.\n\n9\n\n\f", "award": [], "sourceid": 577, "authors": [{"given_name": "Elad", "family_name": "Hazan", "institution": null}, {"given_name": "Satyen", "family_name": "Kale", "institution": null}]}