{"title": "Linear Hinge Loss and Average Margin", "book": "Advances in Neural Information Processing Systems", "page_first": 225, "page_last": 231, "abstract": null, "full_text": "Linear Hinge Loss and Average Margin \n\nClaudio Gentile \n\nManfred K. Warmuth\u00b7 \n\nDSI, Universita' di Milano, \n\nComputer Science Department, \n\nVia Comelico 39, \n20135 Milano. Italy \n\nUniversity of California, \n95064 Santa Cruz, USA \n\ngentile@dsi.unimi.it \n\nmanfred@cse.ucsc.edu \n\nAbstract \n\nWe describe a unifying method for proving relative loss bounds for on(cid:173)\nline linear threshold classification algorithms, such as the Perceptron and \nthe Winnow algorithms. For classification problems the discrete loss is \nused, i.e., the total number of prediction mistakes. We introduce a con(cid:173)\ntinuous loss function, called the \"linear hinge loss\", that can be employed \nto derive the updates of the algorithms. We first prove bounds w.r.t. the \nlinear hinge loss and then convert them to the discrete loss. We intro(cid:173)\nduce a notion of \"average margin\" of a set of examples . We show how \nrelative loss bounds based on the linear hinge loss can be converted to \nrelative loss bounds i.t.o. the discrete loss using the average margin. \n\n1 Introduction \nConsider the classical Perceptron algorithm. The hypothesis of this algorithm at trial t \nis a linear threshold function determined by a weight vector Wt E Rn. For an instance \nXt ERn the linear activation at = Wt . Xt is passed through a threshold function (7 r \nwhich is -Ion arguments less than the threshold rand + 1 otherwise. Thus the prediction \nof the algorithm is binary and -1 , + 1 denote the two classes. The Perceptron algorithm \nis aimed at learning a classification problem where the examples have the form (X t , Yt ) E \nR n x {-I , +1} . \nAfter seeing T examples (Xt,Yt)1 0: \n\nHLr(at,at) - HLr(u , xt,at) + HLr(u, xt,at) \n\n= * (dr(u,wt) - dr(u,wt+1) + dr(wt,wt+1)) = !(Yt - Yt) (at - U\u00b7 xt} \n\n(5) \n\nProof. We have dr(u, Wt) - dr(u, Wt+d + dr(wt , Wt+l) = (u - Wt) . (J(Wt+d -\nj(wd) = (Wt - u) . \u00a5(O\"r(at) - o\"r(at)) Xt = \u00a5(O\"r(at) - O\"r(at)) (at - U . Xt) = \n1] (HLr(at, at) - HLr(u . Xt, at) + HLr(u . Xt , ad) . The first equality follows Lemma 1 \nand the second follows from the update rule of the regression algorithm. The last equality \nuses HLr(at, at) as a divergence drTr (at , at) (see (4)) and again Lemma 1. 0 \nBy summing the first equality of (5) over all trials t we could relate the total HLr of the \nregression algorithm to the total HLr of the regressor u. However, our goal is to obtain \nbounds on the number of mistakes on the classification algorithm. It is therefore natural \nto interpret u too as a linear threshold classifier. with the same threshold r used by the \nclassification algorithm. We use the second equality of (5) and sum up over all T trials: \n\nL,;=I !(Yt - Yt) (a - u . Xt) = * (dr(u, wd - dr(u, wT+d + L,;=I dr(Wt, wt+d). \n\nNote that the sums in the above equality are unaffected by trials in which no mistake occurs. \nIn such trials. Yt = Yt and Wt+1 = Wt . Thus the above is equivalent to the following. where \nM is the set of trials in which a mistake occurs: \nL,tEM !(Yt - Yt ) (at - U\u00b7 Xt) = ~ (dr(u, wd - dr (u , wT+d + L,tE.vt dr(wt, wt+d). \n\nSince t(Yt -Yt) = -Yt when t E J\\It and dr(u , WT+1) :::: 0 we get the following theorem: \n\nTheorem 3 Let M ~ {I, .. . ,T} be the set a/trials in which the classification algorithm \nmakes a mistake. Then/or every u E W we have \n\nL,tEM Yt (u . Xt - at) ~ ~ (dr(u, wt} + L,tEM dr(wt, wt+d) . 0 \n\nThroughout the rest of this section the classification algorithm is compared to the perfor(cid:173)\nmance of a linear threshold classifier u with threshold r = O. We now apply Theorem 3 to \nthe Perceptron algorithm with WI = 0, giving a bound i.t.o. the average margin of a linear \nthreshold classifier u with threshold 0 on a trial sequence M: \n\ni'u ,M := ILl L,tEM Yt U . Xt \u00b7 \n\nI A \n\nSince Yt at ~ 0 for t E M . the I.h.s. of the inequality of Theorem 3 is at least \nM \"(U,M' By the update rule. L.\"tEM dr wt, Wt+1 = L.\"tEM 211xtl12 ~ 21.Iv1IX2 ' \nI \n\"'2 \nwhere IIxI12 ~ X 2 for t E M . Since in Theorem 3 u is an arbitrary vector. we replace \nu by A u therein, and set A = .x~ 1) \u2022 When we solve the resulting inequality for 1.1v11 the \ndependence on 1] cancels out. This gives us the following bound on the number of mistakes: \n\n) , , ! i . 2 \n\n!i. \n\nI'U ,M \n\n\" \n\n( \n\nIMI ~ ( 1 1~1 12X)2 \n\nI'U. )vl \n\n\fLinear Hinge Loss and Average Margin \n\n231 \n\nNote that in the usual mistake bound for the Perceptron algorithm the average 'Yu,/vt is re(cid:173)\nplaced by mintEM Ytu, Xt. 3 Also, observe that the predictions of the Perceptron algorithm \nwith r = 0 and WI = 0 are not affected by 1]. Hence the previous bound holds for any \n1] > O. \nNext, we apply Theorem 3 to a normalized version of Winnow. This version of Winnow \nkeeps weights in the probability simplex and is obtained by a slight modification of Win(cid:173)\nnow's link function. We assume r = 0 and choose X = {x E nn : Ilxlloo ~ Xoo}. \nUnlike the Perceptron algorithm, a Winnow-like algorithm heavily depends on the learning \nrate, so a careful tuning is needed. One can show (details omitted due to space limitations) \nthat if 1] is such that 1] 'YU,M + 1] X 00 -\nWinnow achieves the bound \n\nX2\u00b0o +1) > 0 then this normalized version of \n\nIn ( e 2\n\n'1\n\nIMI < \n\n- , \n\n1]'YU,M +1] \n\ndr(u, WI) \nX \nn \n\n00 -\n\nI (e 2 '1Xoo +1) , \n\n2 \n\nwhere dr( u, wd is the relative entropy between the two probability vectors U and Wl. \nConclusions: In the full paper we study the case when there is no consistent threshold U \nmore carefully and give more involved bounds for the Winnow and normalized Winnow \nalgorithms as well as for the p-norm Perceptron algorithm [GLS97]. \n\nReferences \n[AW98] K. Azoury and M. K. Warmuth\", \"Relative loss bounds and the exponential \n\n[Bre67] \n\n[FS98] \n\nfamily of distributions\", \"1998\", Unpublished manuscript. \nL.M. Bregman. The relaxation method of finding the common point of convex \nsets and its application to the solution of problems in convex programming. \nUSSR Computational Mathematics and Physics, 7 :200-217, 1967. \ny. Freund and R. Schapire. Large margin classification using the perceptron \nalgorithm. In 11th COLT, pp. 209-217, ACM, 1998. \n\n[GLS97] A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results \n\nfor linear discriminant updates. In 10th COLT, pp. 171-183. ACM, 1997. \n\n[HKW95] D. P. Helmbold, 1. Kivinen. and M. K. Warmuth . Worst-case loss bounds for \n\n[KW97] \n\n[KW98] \n\nsigmoided linear neurons. In NIPS 1995, pp. 309-315. MIT Press, 1995. \nJ. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates \nfor linear prediction. Inform. and Comput., 132(1): 1-64. 1997. \n1. Kivinen and M. K. Warmuth. Relative loss bounds for multidimensional re(cid:173)\ngression problems. In NIPS 10, pp. 287-293 . MIT Press, 1998. \n\n[Lit88] \n\n[KWA97] J. Kivinen, M. K. Warmuth, and P. Auer. The perceptron algorithm vs. winnow: \nlinear vs. logarithmic mistake bounds when few input variables are relevant. \nArtijiciallntelligence, 97:325-343,1997. \nN. Littlestone. Learning when irrelevant attributes abound: A new Iinear(cid:173)\nthreshold algorithm. Machine Learning, 2:285-318, 1988. \nN. Littlestone. Mistake Bounds and Logarithmic Linear-threshold Learning \nAlgorithms. PhD thesis. Umversity of California Santa Cruz, 1989. \nN. Littlestone. Redundant noisy attributes, attribute errors, and linear threshold \nlearning using Winnow. In 4th COLT, pp. 147-156, Morgan Kaufmann, 1991. \n\n[Lit91 J \n\n[Lit89] \n\n3The average margin ~IU.M may be positive even though u is not consistent. \n\n\f", "award": [], "sourceid": 1610, "authors": [{"given_name": "Claudio", "family_name": "Gentile", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}]}