{"title": "Adaptive Regularization of Weight Vectors", "book": "Advances in Neural Information Processing Systems", "page_first": 414, "page_last": 422, "abstract": "We present AROW, a new online learning algorithm that combines several properties of successful : large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, which does not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques and empirically show that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data.", "full_text": "Adaptive Regularization of Weight Vectors\n\nKoby Crammer\nDepartment of\n\nElectrical Enginering\n\nThe Technion\n\nHaifa, 32000 Israel\n\nkoby@ee.technion.ac.il\n\nAlex Kulesza\n\nDepartment of Computer\nand Information Science\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\nkulesza@cis.upenn.edu\n\nMark Dredze\n\nHuman Language Tech.\n\nCenter of Excellence\n\nJohns Hopkins University\n\nBaltimore, MD 21211\nmdredze@cs.jhu.edu\n\nAbstract\n\nWe present AROW, a new online learning algorithm that combines sev-\neral useful properties: large margin training, con\ufb01dence weighting, and the\ncapacity to handle non-separable data. AROW performs adaptive regular-\nization of the prediction function upon seeing each new instance, allowing\nit to perform especially well in the presence of label noise. We derive\na mistake bound, similar in form to the second order perceptron bound,\nthat does not assume separability. We also relate our algorithm to recent\ncon\ufb01dence-weighted online learning techniques and show empirically that\nAROW achieves state-of-the-art performance and notable robustness in the\ncase of non-separable data.\n\n1 Introduction\n\nOnline learning algorithms are fast, simple, make few statistical assumptions, and perform\nwell in a wide variety of settings. Recent work has shown that parameter con\ufb01dence in-\nformation can be e\ufb00ectively used to guide online learning [2]. Con\ufb01dence weighted (CW)\nlearning, for example, maintains a Gaussian distribution over linear classi\ufb01er hypotheses\nand uses it to control the direction and scale of parameter updates [6]. In addition to for-\nmal guarantees in the mistake-bound model [11], CW learning has achieved state-of-the-art\nperformance on many tasks. However, the strict update criterion used by CW learning is\nvery aggressive and can over-\ufb01t [5]. Approximate solutions can be used to regularize the\nupdate and improve results; however, current analyses of CW learning still assume that the\ndata are separable. It is not immediately clear how to relax this assumption.\nIn this paper we present a new online learning algorithm for binary classi\ufb01cation that com-\nbines several attractive properties:\nlarge margin training, con\ufb01dence weighting, and the\ncapacity to handle non-separable data. The key to our approach is the adaptive regular-\nization of the prediction function upon seeing each new instance, so we call this algorithm\nAdaptive Regularization of Weights (AROW). Because it adjusts its regularization for each\nexample, AROW is robust to sudden changes in the classi\ufb01cation function due to label\nnoise. We derive a mistake bound, similar in form to the second order perceptron bound,\nthat does not assume separability. We also provide empirical results demonstrating that\nAROW is competitive with state-of-the-art methods and improves upon them signi\ufb01cantly\nin the presence of label noise.\n\n2 Con\ufb01dence Weighted Online Learning of Linear Classi\ufb01ers\nOnline algorithms operate in rounds. In round t the algorithm receives an instance xt \u2208 Rd\nand applies its current prediction rule to make a prediction \u02c6yt \u2208 Y. It then receives the true\n\n1\n\n\flabel yt \u2208 Y and su\ufb00ers a loss \u2018(yt, \u02c6yt). For binary classi\ufb01cation we have Y = {\u22121, +1} and\nuse the zero-one loss \u201801(yt, \u02c6yt) = 0 if yt = \u02c6yt and 1 otherwise. Finally, the algorithm updates\nits prediction rule using (xt, yt) and proceeds to the next round. In this work we consider\nlinear prediction rules parameterized by a weight vector w: \u02c6y = hw(x) = sign(w \u00b7 x).\nRecently Dredze, Crammer and Pereira [6, 5] proposed an algorithmic framework for on-\nline learning of binary classi\ufb01cation tasks called con\ufb01dence weighted (CW) learning. CW\nlearning captures the notion of con\ufb01dence in a linear classi\ufb01er by maintaining a Gaussian\ndistribution over the weights with mean \u00b5 \u2208 Rd and covariance matrix \u03a3 \u2208 Rd\u00d7d. The\nvalues \u00b5p and \u03a3p,p, respectively, encode the learner\u2019s knowledge of and con\ufb01dence in the\nweight for feature p: the smaller \u03a3p,p, the more con\ufb01dence the learner has in the mean\nweight value \u00b5p. Covariance terms \u03a3p,q capture interactions between weights.\nConceptually, to classify an instance x, a CW classi\ufb01er draws a parameter vector w \u223c\nN (\u00b5, \u03a3) and predicts the label according to sign(w \u00b7 x). In practice, however, it can be\neasier to simply use the average weight vector E [w] = \u00b5 to make predictions. This is similar\nto the approach taken by Bayes point machines [9], where a single weight vector is used to\napproximate a distribution. Furthermore, for binary classi\ufb01cation, the prediction given by\nthe mean weight vector turns out to be Bayes optimal.\nCW classi\ufb01ers are trained according to a passive-aggressive rule [3] that adjusts the dis-\ntribution at each round to ensure that the probability of a correct prediction is at least\n\u03b7 \u2208 (0.5, 1]. This yields the update constraint Pr [yt (w \u00b7 xt) \u2265 0] \u2265 \u03b7 . Subject to this\nconstraint, the algorithm makes the smallest possible change to the hypothesis weight dis-\ntribution as measured using the KL divergence. This implies the following optimization\nproblem for each round t:\n\n(\u00b5t, \u03a3t) = min\n\nDKL\n\n\u00b5,\u03a3\n\n(cid:0)N (\u00b5, \u03a3) kN(cid:0)\u00b5t\u22121, \u03a3t\u22121\n\n(cid:1)(cid:1)\n\ns.t. Prw\u223cN (\u00b5,\u03a3) [yt (w \u00b7 xt) \u2265 0] \u2265 \u03b7\n\nCon\ufb01dence-weighted algorithms have been shown to perform well in practice [5, 6], but they\nsu\ufb00er from several problems. First, the update is quite aggressive, forcing the probability\nof predicting each example correctly to be at least \u03b7 > 1/2 regardless of the cost to the\nobjective. This may cause severe over-\ufb01tting when labels are noisy; indeed, current analyses\nof the CW algorithm [5] assume that the data are linearly separable. Second, they are\ndesigned for classi\ufb01cation, and it is not clear how to extend them to alternative settings\nsuch as regression. This is in part because the constraint is written in discrete terms where\nthe prediction is either correct or not.\nWe deal with both of these issues, coping more e\ufb00ectively with label noise and generalizing\nthe advantages of CW learning in an extensible way.\n\n3 Adaptive Regularization Of Weights\n\nexplicitly as yt (\u00b5 \u00b7 xt) \u2265 \u03c6px>\n\nWe identify two important properties of the CW update rule that contribute to its good\nperformance but also make it sensitive to label noise. First, the mean parameters \u00b5 are\nguaranteed to correctly classify the current training example with margin following each\nupdate. This is because the probability constraint Pr [yt (w \u00b7 xt) \u2265 0] \u2265 \u03b7 can be written\nt \u03a3xt, where \u03c6 > 0 is a positive constant related to \u03b7.\nThis aggressiveness yields rapid learning, but given an incorrectly labeled example, it can\nalso force the learner to make a drastic and incorrect change to its parameters. Second,\ncon\ufb01dence, as measured by the inverse eigenvalues of \u03a3, increases monotonically with every\nupdate. While it is intuitive that our con\ufb01dence should grow as we see more data, this\nalso means that even incorrectly labeled examples causing wild parameter swings result in\narti\ufb01cially increased con\ufb01dence.\nIn order to maintain the positives but reduce the negatives of these two properties, we\nisolate and soften them. As in CW learning, we maintain a Gaussian distribution over\nweight vectors with mean \u00b5 and covariance \u03a3; however, we recast the above characteristics\nof the CW constraint as regularizers, minimizing the following unconstrained objective on\n\n2\n\n\feach round:\n\n(cid:0)N (\u00b5, \u03a3) kN(cid:0)\u00b5t\u22121, \u03a3t\u22121\n\n(cid:1)(cid:1) + \u03bb1\u2018h2 (yt, \u00b5 \u00b7 xt) + \u03bb2x>\n\nC (\u00b5, \u03a3) = DKL\n\nt \u03a3xt ,\n\n(1)\nwhere \u2018h2 (yt, \u00b5 \u00b7 xt) = (max{0, 1 \u2212 yt(\u00b5 \u00b7 xt)})2 is the squared-hinge loss su\ufb00ered using the\nweight vector \u00b5 to predict the output for input xt when the true output is yt. \u03bb1, \u03bb2 \u2265 0 are\ntwo tradeo\ufb00 hyperparameters. For simplicity and compactness of notation, in the following\nwe will assume that \u03bb1 = \u03bb2 = 1/(2r) for some r > 0.\nThe objective balances three desires. First, the parameters should not change radically on\neach round, since the current parameters contain information about previous examples (\ufb01rst\nterm). Second, the new mean parameters should predict the current example with low loss\n(second term). Finally, as we see more examples, our con\ufb01dence in the parameters should\ngenerally grow (third term).\nNote that this objective is not simply the dualization of the CW constraint, but a new\nformulation inspired by the properties discussed above. Since the loss term depends on \u00b5\nonly via the inner-product \u00b5\u00b7 xt, we are able to prove a representer theorem (Sec. 4). While\nwe use the squared-hinge loss for classi\ufb01cation, di\ufb00erent loss functions, as long as they are\nconvex and di\ufb00erentiable in \u00b5, yield algorithms for di\ufb00erent settings.1\nTo solve the optimization in (1), we begin by writing the KL explicitly:\n\n(cid:18)det \u03a3t\u22121\n\n(cid:19)\n\ndet \u03a3\n\nTr(cid:0)\u03a3\u22121\n\nt\u22121\u03a3(cid:1) +\n\n(cid:0)\u00b5t\u22121 \u2212 \u00b5(cid:1)>\n\n1\n2\n\n+\n\n1\n2\n\n\u03a3\u22121\nt\u22121\n\n(cid:0)\u00b5t\u22121 \u2212 \u00b5(cid:1) \u2212 d\n\n2\n\nC (\u00b5, \u03a3) =\n\n1\n2\n\nlog\n\n\u2018h2 (yt, \u00b5 \u00b7 xt) +\n\n1\n2r\n\n1\n2r\n\nx>\nt \u03a3xt\n\n+\n\n(2)\nWe can decompose the result into two terms: C1(\u00b5), depending only on \u00b5, and C2(\u03a3), de-\npending only on \u03a3. The updates to \u00b5 and \u03a3 can therefore be performed independently.\nThe squared-hinge loss yields a conservative (or passive) update for \u00b5 in which the mean\nparameters change only when the margin is too small, and we follow CW learning by en-\nforcing a correspondingly conservative update for the con\ufb01dence parameter \u03a3, updating it\nonly when \u00b5 changes. This results in fewer updates and is easier to analyze. Our update\nthus proceeds in two stages.\n\n1. Update the mean parameters:\n2. If \u00b5t 6= \u00b5t\u22121, update the con\ufb01dence parameters:\n\n\u00b5t = arg min\n\u00b5\n\u03a3t = arg min\n\u03a3\n\nC1 (\u00b5)\nC2 (\u03a3)\n\n(3)\n\n(4)\n\nWe now develop the update equations for (3) and (4) explicitly, starting with the former.\nTaking the derivative of C (\u00b5, \u03a3) with respect to \u00b5 and setting it to zero, we get\n\n(cid:20) d\n\ndz\n\n(cid:21)\n\n\u00b5t = \u00b5t\u22121 \u2212 1\n2r\n\n\u2018h2 (yt, z)|z=\u00b5t\u00b7xt\n\n\u03a3t\u22121xt ,\n\n(5)\n\nassuming \u03a3t\u22121 is non-singular. Substituting the derivative of the squared-hinge loss in (5)\nand assuming 1 \u2212 yt (\u00b5t \u00b7 xt) \u2265 0, we get\n\u00b5t = \u00b5t\u22121 + yt\nr\n\n(1 \u2212 yt (\u00b5t \u00b7 xt)) \u03a3t\u22121xt .\n\n(6)\n\nWe solve for \u00b5t by taking the dot product of each side of the equality with xt and substituting\nback in (6) to obtain the rule\n\n\u00b5t = \u00b5t\u22121 +\n\n\u03a3t\u22121ytxt .\n\n(7)\n\nIt can be easily veri\ufb01ed that (7) satis\ufb01es our assumption that 1 \u2212 yt (\u00b5t \u00b7 xt) \u2265 0.\n\n1It can be shown that the well known recursive least squares (RLS) regression algorithm [7] is a\n\nspecial case of AROW with the squared loss.\n\n3\n\nmax(cid:0)0, 1 \u2212 ytx>\n\nx>\nt \u03a3t\u22121xt + r\n\nt \u00b5t\u22121\n\n(cid:1)\n\n\fInput parameters\nInitialize \u00b50 = 0 , \u03a30 = I,\nFor t = 1, . . . , T\n\nr\n\n\u2022 Receive a training example xt \u2208 Rd\n\u2022 Compute margin and con\ufb01dence mt = \u00b5t\u22121 \u00b7 xt\nvt = x>\n\u2022 Receive true label yt, and su\ufb00er loss \u2018t = 1 if sign (mt) 6= yt\n\u2022 If mtyt < 1, update using eqs. (7) & (9):\n\nt \u03a3t\u22121xt\n\n\u00b5t = \u00b5t\u22121 + \u03b1t\u03a3t\u22121ytxt\n\n\u03b2t =\n\n1\n\nx>\nt \u03a3t\u22121xt + r\n\nOutput: Weight vector \u00b5T and con\ufb01dence \u03a3T .\n\n\u03a3t = \u03a3t\u22121 \u2212 \u03b2t\u03a3t\u22121xtx\n\n>\nt \u03a3t\u22121\n\n\u03b1t = max\n\n0, 1 \u2212 ytx\n\n>\nt \u00b5t\u22121\n\n\u03b2t\n\n\u201d\n\n\u201c\n\nFigure 1: The AROW algorithm for online binary classi\ufb01cation.\n\n6= \u00b5t\u22121, that is, if 1 >\nt \u00b5t\u22121. In this case, we compute the update of the con\ufb01dence parameters by setting\n\nThe update for the con\ufb01dence parameters is made only if \u00b5t\nytx>\nthe derivative of C (\u00b5, \u03a3) with respect to \u03a3 to zero:\n\n\u03a3\u22121\nt = \u03a3\u22121\n\nt\u22121 + xtx>\n\nt\n\nr\n\n(8)\n\nUsing the Woodbury identity we can also rewrite the update for \u03a3 in non-inverted form:\n\n\u03a3t = \u03a3t\u22121 \u2212 \u03a3t\u22121xtx>\n\nr + x>\n\nt \u03a3t\u22121\nt \u03a3t\u22121xt\n\n(9)\n\nNote that it follows directly from (8) and (9) that the eigenvalues of the con\ufb01dence pa-\nrameters are monotonically decreasing: \u03a3t (cid:22) \u03a3t\u22121; \u03a3\u22121\nt\u22121 . Pseudocode for AROW\nappears in Fig. 1.\n\nt (cid:23) \u03a3\u22121\n\n4 Analysis\n\nWe \ufb01rst show that AROW can be kernelized by stating the following representer theorem.\n\nLemma 1 (Representer Theorem) Assume that \u03a30 = I and \u00b50 = 0. The mean param-\neters \u00b5t and con\ufb01dence parameters \u03a3t produced by updating via (7) and (9) can be written\nas linear combinations of the input vectors (resp. outer products of the input vectors with\nthemselves) with coe\ufb03cients depending only on inner-products of input vectors.\n\nProof sketch: By induction. The base case follows from the de\ufb01nitions of \u00b50 and \u03a30,\nand the induction step follows algebraically from the update rules (7) and (9).\nWe now prove a mistake bound for AROW. Denote by M (M = |M|) the set of example\nindices for which the algorithm makes a mistake, yt\nP\nset of example indices for which there is an update but not a mistake, 0 < yt (\u00b5t \u00b7 xt) \u2264 1.\nOther examples do not a\ufb00ect the behavior of the algorithm and can be ignored. Let XM =\nt\u2208M xix>\n\n(cid:1) \u2264 0, and by U (U = |U|) the\n\n(cid:0)\u00b5t\u22121 \u00b7 xt\n\ni and XA = XM + XU.\n\nt\u2208U xix>\n\nTheorem 2 For any reference weight vector u \u2208 Rd, the number of mistakes made by\nAROW (Fig. 1) is upper bounded by\n\ni , XU =P\nq\n\ngt \u2212 U ,\n\n(10)\n\ns\n\n(cid:18)\n\n(cid:18)\n\nlog\n\ndet\n\n(cid:19)(cid:19)\n\n+ U + X\n\nt\u2208M\u222aU\n\nI +\n\n1\nr\n\nXA\n\nM \u2264\n\nr kuk2 + u>XAu\n\nwhere gt = max(cid:0)0, 1 \u2212 ytu>xt\n\n(cid:1).\n\nThe proof depends on two lemmas; we omit the proof of the \ufb01rst for lack of space.\n\n4\n\n\fLemma 3 Let \u2018t = max(cid:0)0, 1 \u2212 yt\u00b5>\n\nt\u22121xt\n\nt \u03a3t\u22121xt. Then, for every t \u2208 M\u222aU,\n\n(cid:1) and \u03c7t = x>\nt\u22121\u00b5t\u22121 + ytu>xt\nt\u22121\u00b5t\u22121 + \u03c7t + r \u2212 \u20182\nt r\nr (\u03c7t + r)\n\u2264 log(cid:0)det(cid:0)\u03a3\u22121\n(cid:1)(cid:1) .\n\nr\n\nT +1\n\nu>\u03a3\u22121\nt \u00b5t = u>\u03a3\u22121\nt \u03a3\u22121\nt\u22121\u03a3\u22121\n\u00b5>\nt \u00b5t = \u00b5>\nX\n\n\u03c7tr\n\nr (\u03c7t + r)\n\nt\n\nLemma 4 Let T be the number of rounds. Then\n\nt\n\nt \u03a3t\u22121\n\nt = x>\n\nx>\nt \u03a3tx>\n\nProof: We compute the following quantity:\n\n(cid:0)\u03a3t\u22121 \u2212 \u03b2t\u03a3t\u22121xtx>\n\n(cid:1) xt = \u03c7t \u2212 \u03c72\n(cid:1)\nt = 1 \u2212 det(cid:0)\u03a3\u22121\nUsing Lemma D.1 from [2] we have that\n(cid:1) .\ndet(cid:0)\u03a3\u22121\nt \u03a3tx>\nx>\n det(cid:0)\u03a3\u22121\n(cid:1)!\n(cid:1)\n\u2264 \u2212X\ndet(cid:0)\u03a3\u22121\n\n \n1 \u2212 det(cid:0)\u03a3\u22121\ndet(cid:0)\u03a3\u22121\n\n=X\n\nCombining, we get\n\n(cid:1)!\n(cid:1)\n\nX\n\nr (\u03c7t + r)\n\nt\u22121\n\nt\u22121\n\nt\u22121\n\n\u03c7tr\n\nlog\n\n1\nr\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\n= \u03c7tr\n\u03c7t + r\n\n.\n\n\u03c7t + r\n\n(11)\n\n\u2264 log(cid:0)det(cid:0)\u03a3\u22121\n\nT +1\n\n(cid:1)(cid:1) .\n\nWe now prove Theorem 2.\nProof: We iterate the \ufb01rst equality of Lemma 3 to get\n\nytu>xt\n\nr\n\n1 \u2212 gt\n\nr\n\n= M + U\n\nr\n\n\u2212 1\nr\n\ngt .\n\n(12)\n\nX\n\nt\u2208M\u222aU\n\nWe iterate the second equality to get\n\n\u2265 X\n= X\n\nt\u2208M\u222aU\n\nu>\u03a3\u22121\n\nT \u03a3\u22121\n\u00b5>\n\nt\u2208M\u222aU\n\nT \u00b5T = X\nT \u00b5T = X\n(cid:0)xt \u00b7 \u00b5t\u22121\n\nt\u2208M\u222aU\n\nX\n\nt\u2208M\u222aU\n\n1 \u2212 \u20182\nt\n\u03c7t + r\n\nUsing Lemma 4 we have that the \ufb01rst term of (13) is upper bounded by 1\nFor the second term in (13) we consider two cases. First, if a mistake occurred on example\nt \u2264 0. Second, if an the algorithm\nt, then we have that yt\nmade an update (but no mistake) on example t, then 0 < yt\nthus 1 \u2212 \u20182\n\nt \u2264 1. We therefore have\n\n+ X\n\nt\u2208M\u222aU\n\n(cid:0)xt \u00b7 \u00b5t\u22121\n=X\n\n1\n\n\u03c7t + r\n\n\u03c7t + r \u2212 \u20182\nt r\nr (\u03c7t + r)\n\n\u03c7t\n\nr (\u03c7t + r)\n\nt\u2208M\u222aU\n\n1\n\n0\n\nt\u2208U\n\nt\u2208M\n\n\u03c7t + r\n\n\u03c7t + r\n\n(cid:1) \u2264 0 and \u2018t \u2265 1, so 1\u2212 \u20182\n\u2264 X\nT \u00b5T \u2264q\ngt \u2264q\nq\n\n+X\nq\nlog(cid:0)det(cid:0)\u03a3\u22121\n(cid:1)(cid:1) + U + X\n\ns1\nlog(cid:0)det(cid:0)\u03a3\u22121\n\nu>\u03a3\u22121\nT u\n\nu>\u03a3\u22121\nT u\n\nT \u00b5T ,\n\nt\u2208U\n\nr\n\nT\n\nT\n\n(cid:1)(cid:1) +X\n\nt\u2208U\n\ngt \u2212 U .\n\nt\u2208M\u222aU\n\n1 \u2212 \u20182\nt\n\u03c7t + r\n\n.\n\n(13)\n\n(cid:1)(cid:1).\nr log(cid:0)det(cid:0)\u03a3\u22121\n(cid:1) \u2264 1 and \u2018t \u2265 0,\n\nT\n\n.\n\n(14)\n\n1\n\n\u03c7t + r\n\n.\n\n(15)\n\nCombining and plugging into the Cauchy-Schwarz inequality\nT \u03a3\u22121\n\u00b5>\n\nu>\u03a3\u22121\n\nwe get\n\nM + U\n\nr\n\n\u2212 1\nr\n\nX\nq\n\nt\u2208M\u222aU\n\nM \u2264\u221a\n\nr\n\nu>\u03a3\u22121\nT u\n\nRearranging the terms and using the fact that \u03c7t \u2265 0 yields\n\n5\n\n\fBy de\ufb01nition,\n\n\u03a3\u22121\nT = I +\n\nxix>\n\ni = I +\n\nXA ,\n\n1\nr\n\nso substituting and simplifying completes the proof:\n\nt\u2208M\u222aU\n\nX\n(cid:18)\n(cid:18)\n\n1\nr\n\ns\n(cid:18)\n\n(cid:19)\ns\n\nu>\n\nI +\n\n1\nr\n\nXA\n\nu\n\nlog\n\ndet\n\nr kuk2 + u>XAu\n\nlog\n\ndet\n\ns\n\n(cid:18)\n\nr\n\nM \u2264\u221a\nq\n\n=\n\n(cid:18)\n\nI +\n\n1\nr\n\nXA\n\n(cid:19)(cid:19)\n\nI +\n\n1\nr\n\nXA\n\n(cid:19)(cid:19)\n+ U + X\n+ U + X\n\nt\u2208M\u222aU\n\nt\u2208M\u222aU\n\ngt \u2212 U\n\ngt \u2212 U .\n\nA few comments are in order. First, the two square-root terms of the bound depend on r\nin opposite ways: the \ufb01rst is monotonically increasing, while the second is monotonically\ndecreasing. One could expect to optimize the bound by minimizing over r. However, the\nbound also depends on r indirectly via other quantities (e.g. XA), so there is no direct way\nto do so. Second, if all the updates are associated with errors, that is, U = \u2205, then the bound\nreduces to the bound of the second-order perceptron [2]. In general, however, the bounds\nare not comparable since each depends on the actual runtime behavior of its algorithm.\n\n5 Empirical Evaluation\n\nWe evaluate AROW on both synthetic and real data, including several popular datasets\nfor document classi\ufb01cation and optical character recognition (OCR). We compare with\nthree baselines: Passive-Aggressive (PA), Second Order Perceptron (SOP)2 and Con\ufb01dence-\nWeighted (CW) learning3.\nOur synthetic data are as in [5], but we invert the labels on 10% of the training examples.\n(Note that evaluation is still done against the true labels.) Fig. 2(a) shows the online learning\ncurves for both full and diagonalized versions of the algorithms on these noisy data. AROW\nimproves over all competitors, and the full version outperforms the diagonal version. Note\nthat CW-full performs worse than CW-diagonal, as has been observed previously for noisy\ndata.\nWe selected a variety of document classi\ufb01cation datasets popular in the NLP community,\nsummarized as follows. Amazon: Product reviews to be classi\ufb01ed into domains (e.g.,\nbooks or music) [6]. We created binary datasets by taking all pairs of the six domains (15\ndatasets). Feature extraction follows [1] (bigram counts). 20 Newsgroups: Approximately\n20,000 newsgroup messages partitioned across 20 di\ufb00erent newsgroups4. We binarized the\ncorpus following [6] and used binary bag-of-words features (3 datasets). Each dataset has\nbetween 1850 and 1971 instances. Reuters (RCV1-v2/LYRL2004): Over 800,000 man-\nually categorized newswire stories. We created binary classi\ufb01cation tasks using pairs of\nlabels following [6] (3 datasets). Details on document preparation and feature extraction\nare given by [10]. Sentiment: Product reviews to be classi\ufb01ed as positive or negative. We\nused each Amazon product review domain as a sentiment classi\ufb01cation task (6 datasets).\nSpam: We selected three task A users from the ECML/PKDD Challenge5, using bag-of-\nwords to classify each email as spam or ham (3 datasets). For OCR data we binarized two\nwell known digit recognition datasets, MNIST6 and USPS, into 45 all-pairs problems. We\nalso created ten one vs. all datasets from the MNIST data (100 datasets total).\nEach result for the text datasets was averaged over 10-fold cross-validation. The OCR\nexperiments used the standard split into training and test sets. Hyperparameters (including\n\n2\n\nFor the real world (high dimensional) datasets, we must drop cross-feature con\ufb01dence terms by projecting\nonto the set of diagonal matrices, following the approach of [6]. While this may reduce performance, we make the\nsame approximation for all evaluated algorithms.\n\n3\n\n4\n\n5\n\n6\n\nWe use the \u201cvariance\u201d version developed in [6].\n\nhttp://people.csail.mit.edu/jrennie/20Newsgroups/\n\nhttp://ecmlpkdd2006.org/challenge.html\n\nhttp://yann.lecun.com/exdb/mnist/index.html\n\n6\n\n\f(a) synthetic data\n\n(b) MNIST data\n\nFigure 2: Learning curves for AROW (full/diagonal) and baseline methods. (a) 5k synthetic\ntraining examples and 10k test examples (10% noise, 100 runs). (b) MNIST 3 vs. 5 binary\nclassi\ufb01cation task for di\ufb00erent amounts of label noise (left: 0 noise, right: 10%).\n\nr for AROW) and the number of online iterations (up to 10) were optimized using a single\nrandomized run. We used 2000 instances from each dataset unless otherwise noted above.\nIn order to observe each algorithm\u2019s ability to handle non-separable data, we performed each\nexperiment using various levels of arti\ufb01cal label noise, generated by independently \ufb02ipping\neach binary label with \ufb01xed probability.\n\n5.1 Results and Discussion\n\nexperimental\n\nare\n\nAlgorithm\nAROW\nCW\nPA\nSOP\n\n0.2\n1.25\n2.42\n2.33\n4.00\n\n0.3\n1.25\n2.76\n2.08\n3.91\n\n0.0\n1.51\n1.63\n2.95\n3.91\n\n0.05\n1.44\n1.87\n2.83\n3.87\n\nNoise level\n0.15\n0.1\n1.38\n1.42\n2.08\n1.95\n2.61\n2.78\n3.89\n3.89\n\nTable 1: Mean rank (out of 4, over all datasets) at di\ufb00er-\nent noise levels. A rank of 1 indicates that an algorithm\noutperformed all the others.\n\nOur\nresults\nare summarized in Table 1.\nAROW outperforms the base-\nlines at all noise levels, but\ndoes especially well as noise\nincreases.\nMore detailed\nresults for AROW and CW,\nthe overall best performing\nbaseline,\ncompared in\nFig. 3. AROW and CW are\ncomparable when there is no\nadded noise, with AROW\nwinning the majority of the time. As label noise increases (moving across the rows in\nFig. 3) AROW holds up remarkably well. In almost every high noise evaluation, AROW\nimproves over CW (as well as the other baselines, not shown). Fig. 2(b) shows the total\nnumber of mistakes (w.r.t. noise-free labels) made by each algorithm during training on the\nMNIST dataset for 0% and 10% noise. Though absolute performance su\ufb00ers with noise,\nthe gap between AROW and the baselines increases.\nTo help interpret the results, we classify the algorithms evaluated here according to four\ncharacteristics: the use of large margin updates, con\ufb01dence weighting, a design that acco-\nmodates non-separable data, and adaptive per-instance margin (Table 2). While all of these\nproperties can be desirable in di\ufb00erent situations, we would like to understand how they\ninteract and achieve high performance while avoiding sensitivity to noise.\nBased on the results in Ta-\nble 1, it is clear that the com-\nbination of con\ufb01dence informa-\ntion and large margin learning\nis powerful when label noise is\nlow. CW easily outperforms\nthe other baselines in such situ-\nations, as it has been shown to\ndo in previous work. However,\nas noise increases, the separa-\nbility assumption inherent in CW appears to reduce its performance considerably.\n\nLarge\nAlgorithm Margin\nPA\nSOP\nCW\nAROW\n\nYes\nNo\nYes\nYes\n\nTable 2: Online algorithm properties overview.\n\nConf-\nidence\n\nNon-\n\nAdaptive\nSeparable Margin\n\nNo\nYes\nYes\nYes\n\nYes\nYes\nNo\nYes\n\nNo\nNo\nYes\nNo\n\n7\n\n5001000150020002500300035004000450050002004006008001000120014001600InstancesMistakes PerceptronPASOPAROW\u2212fullAROW\u2212diagCW\u2212fullCW\u2212diag0200040006000800010000Instances0100200300400500600700800MistakesPACWAROWSOP0200040006000800010000Instances0500100015002000MistakesPACWAROWSOP\fFigure 3: Accuracy on text (top) and OCR (bottom) binary classi\ufb01cation. Plots compare\nperformance between AROW and CW, the best performing baseline (Table 1). Markers\nabove the line indicate superior AROW performance and below the line superior CW per-\nformance. Label noise increases from left to right: 0%, 10% and 30%. AROW improves\nrelative to CW as noise increases.\n\nAROW, by combining the large margin and con\ufb01dence weighting of CW with a soft update\nrule that accomodates non-separable data, matches CW\u2019s performance in general while\navoiding degradation under noise. AROW lacks the adaptive margin of CW, suggesting\nthat this characteristic is not crucial to achieving strong performance. However, we leave\nopen for future work the possibility that an algorithm with all four properties might have\nunique advantages.\n\n6 Related and Future Work\n\nAROW is most similar to the second order perceptron [2]. The SOP performs the same type\nof update as AROW, but only when it makes an error. AROW, on the other hand, updates\neven when its prediction is correct if there is insu\ufb03cient margin. Con\ufb01dence weighted (CW)\n[6, 5] algorithms, by which AROW was inspired, update the mean and con\ufb01dence parameters\nsimultaneously, while AROW makes a decoupled update and softens the hard constraint of\nCW. The AROW algorithm can be seen as a variant of the PA-II algorithm from [3] where\nthe regularization is modi\ufb01ed according to the data.\nHazan [8] describes a framework for gradient descent algorithms with logarithmic regret in\nwhich a quantity similar to \u03a3t plays an important role. Our algorithm di\ufb00ers in several\nways. First, Hazan [8] considers gradient algorithms, while we derive and analyze algo-\nrithms that directly solve an optimization problem. Second, we bound the loss directly, not\nthe cumulative sum of regularization and loss. Third, the gradient algorithms perform a\nprojection after making an update (not before) since the norm of the weight vector is kept\nbounded.\nOngoing work includes the development and analysis of AROW style algorithms for other\nsettings, including a multi-class version following the recent extension of CW to multi-class\nproblems [4]. Our mistake bound can be extended to this case. Applying the ideas behind\nAROW to regression problems turns out to yield the well known recursive least squares\n(RLS) algorithm, for which AROW o\ufb00ers new bounds (omitted). Finally, while we used the\ncon\ufb01dence term x>\nt \u03a3xt in (1), we can replace this term with any di\ufb00erentiable, monotoni-\ncally increasing function f(x>\n\nt \u03a3xt). This generalization may yield additional algorithms.\n\n8\n\n0.750.800.850.900.951.00CW0.750.800.850.900.951.00AROW20newsamazonreuterssentimentspam0.50.60.70.80.91.0CW0.50.60.70.80.91.0AROW20newsamazonreuterssentimentspam0.50.60.70.80.91.0CW0.50.60.70.80.91.0AROW20newsamazonreuterssentimentspam0.900.920.940.960.981.00CW0.900.920.940.960.981.00AROWUSPS 1 vs. AllUSPS All PairsMNIST 1 vs. All0.50.60.70.80.91.0CW0.50.60.70.80.91.0AROWUSPS 1 vs. AllUSPS All PairsMNIST 1 vs. All0.50.60.70.80.91.0CW0.50.60.70.80.91.0AROWUSPS 1 vs. AllUSPS All PairsMNIST 1 vs. All\fReferences\n\n[1] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-boxes\n\nand blenders: Domain adaptation for sentiment classi\ufb01cation. In ACL, 2007.\n\n[2] Nicol\u00b4o Cesa-Bianchi, Alex Conconi, and Claudio Gentile. A second-order perceptron\n\nalgorithm. Siam J. of Comm., 34, 2005.\n\n[3] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer.\nOnline passive-aggressive algorithms. Journal of Machine Learning Research, 7:551\u2013\n585, 2006.\n\n[4] Koby Crammer, Mark Dredze, and Alex Kulesza. Multi-class con\ufb01dence weighted\n\nalgorithms. In Empirical Methods in Natural Language Processing (EMNLP), 2009.\n\n[5] Koby Crammer, Mark Dredze, and Fernando Pereira. Exact convex con\ufb01dence-weighted\n\nlearning. In Neural Information Processing Systems (NIPS), 2008.\n\n[6] Mark Dredze, Koby Crammer, and Fernando Pereira. Con\ufb01dence-weighted linear clas-\n\nsi\ufb01cation. In International Conference on Machine Learning, 2008.\n\n[7] Simon Haykin. Adaptive Filter Theory. 1996.\n[8] Elad Hazan. E\ufb03cient algorithms for online convex optimization and their applications.\n\nPhD thesis, Princeton University, 2006.\n\n[9] Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines. Journal of\n\nMachine Learning Research (JMLR), 1:245\u2013279, 2001.\n\n[10] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. Rcv1: A new benchmark\n\ncollection for text categorization research. JMLR, 5:361\u2013397, 2004.\n\n[11] Nick Littlestone. Learning when irrelevant attributes abound: A new linear-threshold\n\nalgorithm. Machine Learning, 2:285\u2013318, 1988.\n\n9\n\n\f", "award": [], "sourceid": 611, "authors": [{"given_name": "Koby", "family_name": "Crammer", "institution": null}, {"given_name": "Alex", "family_name": "Kulesza", "institution": null}, {"given_name": "Mark", "family_name": "Dredze", "institution": null}]}