{"title": "Online Classification for Complex Problems Using Simultaneous Projections", "book": "Advances in Neural Information Processing Systems", "page_first": 17, "page_last": 24, "abstract": null, "full_text": "Online Classi\ufb01cation for Complex Problems Using\n\nSimultaneous Projections\n\nYonatan Amit1\n\nShai Shalev-Shwartz1 Yoram Singer1,2\n\n1 School of Computer Sci. & Eng., The Hebrew University, Jerusalem 91904, Israel\n\n2 Google Inc. 1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA\n\n{mitmit,shais,singer}@cs.huji.ac.il\n\nAbstract\n\nWe describe and analyze an algorithmic framework for online classi\ufb01cation where\neach online trial consists of multiple prediction tasks that are tied together. We\ntackle the problem of updating the online hypothesis by de\ufb01ning a projection\nproblem in which each prediction task corresponds to a single linear constraint.\nThese constraints are tied together through a single slack parameter. We then in-\ntroduce a general method for approximately solving the problem by projecting\nsimultaneously and independently on each constraint which corresponds to a pre-\ndiction sub-problem, and then averaging the individual solutions. We show that\nthis approach constitutes a feasible, albeit not necessarily optimal, solution for the\noriginal projection problem. We derive concrete simultaneous projection schemes\nand analyze them in the mistake bound model. We demonstrate the power of\nthe proposed algorithm in experiments with online multiclass text categorization.\nOur experiments indicate that a combination of class-dependent features with the\nsimultaneous projection method outperforms previously studied algorithms.\n\n1 Introduction\n\nIn this paper we discuss and analyze a framework for devising ef\ufb01cient online learning algorithms\nfor complex prediction problems such as multiclass categorization. In the settings we cover, a com-\nplex prediction problem is cast as the task of simultaneously coping with multiple simpli\ufb01ed sub-\nproblems which are nonetheless tied together. For example, in multiclass categorization, the task is\nto predict a single label out of k possible outcomes. Our simultaneous projection approach is based\non the fact that we can retrospectively (after making a prediction) cast the problem as the task of\nmaking k \u2212 1 binary decisions each of which involves the correct label and one of the competing\nlabels. The performance of the k \u2212 1 predictions is measured through a single loss. Our approach\nstands in contrast to previously studied methods which can be roughly be partitioned into three\nparadigms. The \ufb01rst and probably the simplest previously studied approach is to break the problem\ninto multiple decoupled problems that are solved independently. Such an approach was used for\ninstance by Weston and Watkins [1] for batch learning of multiclass support vector machines. The\nsimplicity of this approach also underscores its de\ufb01ciency as it is detached from the original loss of\nthe complex decision problem. The second approach maintains the original structure of the problem\nbut focuses on a single, worst performing, derived sub-problem (see for instance [2]). While this\napproach adheres with the original structure of the problem, the resulting update mechanism is by\nconstruction sub-optimal as it oversees almost all of the constraints imposed by the complex pre-\ndiction problem. (See also [6] for analysis and explanation of the sub-optimality of this approach.)\nThe third approach for dealing with complex problems is to tailor a speci\ufb01c ef\ufb01cient solution for\nthe problem on hand. While this approach yielded ef\ufb01cient learning algorithms for multiclass cate-\ngorization problems [2] and aesthetic solutions for structured output problems [3, 4], devising these\nalgorithms required dedicated efforts. Moreover, tailored solutions typically impose rather restric-\ntive assumptions on the representation of the data in order to yield ef\ufb01cient algorithmic solutions.\n\n\fIn contrast to previously studied approaches, we propose a simple, general, and ef\ufb01cient framework\nfor online learning of a wide variety of complex problems. We do so by casting the online update\ntask as an optimization problem in which the newly devised hypothesis is required to be similar to\nthe current hypothesis while attaining a small loss on multiple binary prediction problems. Casting\nthe online learning task as a sequence of instantaneous optimization problems was \ufb01rst suggested\nand analyzed by Kivinen and Warmuth [12] for binary classi\ufb01cation and regression problems. In\nour optimization-based approach, the complex decision problem is cast as an optimization problem\nthat consists of multiple linear constraints each of which represents a simpli\ufb01ed sub-problem. These\nconstraints are tied through a single slack variable whose role is to asses the overall prediction\nquality for the complex problem. We describe and analyze a family of two-phase algorithms. In the\n\ufb01rst phase, the algorithms solve simultaneously multiple sub-problems. Each sub-problem distills\nto an optimization problem with a single linear constraint from the original multiple-constraints\nproblem. The simple structure of each single-constraint problem results in an analytical solution\nwhich is ef\ufb01ciently computable. In the second phase, the algorithms take a convex combination of\nthe independent solutions to obtain a solution for the multiple-constraints problem. The end result is\nan approach whose time complexity and mistake bounds are equivalent to approaches which solely\ndeal with the worst-violating constraint [9]. In practice, though, the performance of the simultaneous\nprojection framework is much better than single-constraint update schemes.\n\n2 Problem Setting\n\nIn this section we introduce the notation used throughout the paper and formally describe our prob-\nlem setting. We denote vectors by lower case bold face letters (e.g. x and \u03c9) where the j\u2019th element\nof x is denoted by xj. We denote matrices by upper case bold face letters (e.g. X), where the j\u2019th\nrow of X is denoted by xj. The set of integers {1, . . . , k} is denoted by [k]. Finally, we use the\nhinge function [a]+ = max{0, a}.\nOnline learning is performed in a sequence of trials. At trial t the algorithm receives a matrix Xt\nof size kt \u00d7 n, where each row of Xt is an instance, and is required to make a prediction on the\nlabel associated with each instance. We denote the vector of predicted labels by \u02c6yt. We allow \u02c6yt\nj\nj| is the con\ufb01dence\nto take any value in R, where the actual label being predicted is sign(\u02c6yt\nin the prediction. After making a prediction \u02c6yt the algorithm receives the correct labels yt where\nj \u2208 {\u22121, 1} for all j \u2208 [kt]. In this paper we assume that the predictions in each trial are formed\nyt\nby calculating the inner product between a weight vector \u03c9t \u2208 Rn with each instance in Xt, thus\n\u02c6yt = Xt \u03c9t. Our goal is to perfectly predict the entire vector yt. We thus say that the vector \u02c6yt\nj). That is, we suffer a\nwas imperfectly predicted if there exists an outcome j such that yt\nunit loss on trial t if there exists j, such that sign(\u02c6yt\nj. Directly minimizing this combinatorial\nerror is a computationally dif\ufb01cult task. Therefore, we use an adaptation of the hinge-loss, de\ufb01ned\n\u2018 (\u02c6yt, yt) = maxj\u2208[kt]\nj is\noften referred to as the (signed) margin of the prediction and ties the correctness and the con\ufb01dence\nin the prediction. We use \u2018 (\u03c9; (Xt, yt)) to denote \u2018 (\u02c6yt, yt) where \u02c6yt = Xt \u03c9. We also denote the\nj}, and similarly\nset of instances whose labels were predicted incorrectly by Mt = {j | sign(\u02c6yt\nj) 6= yt\nj]+ > 0}.\nthe set of instances whose hinge-losses are greater than zero by \u0393t = {j | [1 \u2212 yt\nj \u02c6yt\n\n, as a proxy for the combinatorial error. The quantity yt\n\n(cid:2)1 \u2212 yt\n\nj 6= sign(\u02c6yt\n\nj) and |\u02c6yt\n\nj \u02c6yt\n\n(cid:3)\n\nj \u02c6yt\n\nj\n\n+\n\nj) 6= yt\n\n3 Derived Problems\n\nIn this section we further explore the motivation for our problem setting by describing two different\ncomplex decision tasks and showing how they can be cast as special cases of our setting. We also\nwould like to note that our approach can be employed in other prediction problems (see Sec. 7).\nMultilabel Categorization In the multilabel categorization task each instance is associated with\na set of relevant labels from the set [k]. The multilabel categorization task can be cast as a\nspecial case of a ranking task in which the goal is to rank the relevant labels above the irrel-\nevant ones. Many learning algorithms for this task employ class-dependant features (for ex-\nample, see [7]). For simplicity, assume that each class is associated with n features and de-\nnote by \u03c6(x, r) the feature vector for class r. We would like to note that features obtained\nfor different classes typically relay different information and are often substantially different.\n\n\f\u03c9t\n\n\u03c9t+1\n\n3) \u2265 1\n\n1) \u2265 1\n\n2) \u2265 1\n\n3 (\u03c9 \u00b7 xt\nyt\n\n2 (\u03c9 \u00b7 xt\nyt\n\n1 (\u03c9 \u00b7 xt\nyt\n\nA categorizer, or label ranker, is based on a weight vector\n\u03c9. A vector \u03c9 induces a score for each class \u03c9 \u00b7 \u03c6(x, r)\nwhich, in turn, de\ufb01nes an ordering of the classes. A learner is\nrequired to build a vector \u03c9 that successfully ranks the labels\naccording to their relevance, namely for each pair of classes\n(r, s) such that r is relevant while s is not, the class r should\nbe ranked higher than the class s. Thus we require that \u03c9 \u00b7\n\u03c6(x, r) > \u03c9 \u00b7 \u03c6(x, s) for every such pair (r, s). We say that a\nlabel ranking is imperfect if there exists any pair (r, s) which\nviolates this requirement. The loss associated with each such\nviolation is [1 \u2212 (\u03c9 \u00b7 \u03c6(x, r) \u2212 \u03c9 \u00b7 \u03c6(x, s))]+ and the loss\nof the categorizer is de\ufb01ned as the maximum over the losses\ninduced by the violated pairs. In order to map the problem to\nour setting, we de\ufb01ne a virtual instance for every pair (r, s)\nsuch that r is relevant and s is not. The new instance is the\nn dimensional vector de\ufb01ned by \u03c6(x, r)\u2212 \u03c6(x, s). The label\nassociated with all of the instances is set to 1. It is clear that\nan imperfect categorizer makes a prediction mistake on at\nleast one of the instances, and that the losses de\ufb01ned by both\nproblems are the same.\nOrdinal Regression In the problem of ordinal regression an instance x is a vector of n features\nthat is associated with a target rank y \u2208 [k]. A learning algorithm is required to \ufb01nd a vector \u03c9\nand k thresholds b1 \u2264 \u00b7\u00b7\u00b7 \u2264 bk\u22121 \u2264 bk = \u221e. The value of \u03c9 \u00b7 x provides a score from which the\nprediction value can be de\ufb01ned as the smallest index i for which \u03c9\u00b7x < bi, \u02c6y = min{i|\u03c9 \u00b7 x < bi}.\nIn order to obtain a correct prediction, an ordinal regressor is required to ensure that \u03c9\u00b7x \u2265 bi for all\ni < y and that \u03c9 \u00b7 x < bi for i \u2265 y. It is considered a prediction mistake if any of these constraints\nis violated. In order to map the ordinal regression task to our setting, we introduce k \u2212 1 instances.\nEach instance is a vector in Rn+k\u22121. The \ufb01rst n entries of the vector are set to be the elements of\nx, the remaining k \u2212 1 entries are set to \u2212\u03b4i,j. That is, the i\u2019th entry in the j\u2019th vector is set to \u22121\nif i = j and to 0 otherwise. The label of the \ufb01rst y \u2212 1 instances is 1, while the remaining k \u2212 y\ninstances are labeled as \u22121. Once we learned an expanded vector in Rn+k\u22121, the regressor \u03c9 is\nobtained by taking the \ufb01rst n components of the expanded vector and the thresholds b1, . . . , bk\u22121\nare set to be the last k \u2212 1 elements. A prediction mistake of any of the instances corresponds to an\nincorrect rank in the original problem.\n\nFigure 1: Illustration of the simultane-\nous projections algorithm: each instance\ncasts a constraint on \u03c9 and each such\nconstraint de\ufb01nes a halfspace of feasi-\nble solutions. We project on each half-\nspace in parallel and the new vector is a\nweighted average of these projections\n\n4 Simultaneous Projection Algorithms\n\n(cid:0)\u03c9t \u00b7 xt\n\n(cid:1) \u2265 1. If all the constraints are satis\ufb01ed\n\nRecall that on trial t the algorithm receives a matrix, Xt, of kt instances, and predicts \u02c6yt = Xt \u03c9t.\nAfter performing its prediction, the algorithm receives the corresponding labels yt. Each such\ninstance-label pair casts a constraint on \u03c9t, yt\nj\nby \u03c9t then \u03c9t+1 is set to be \u03c9t and the algorithm proceeds to the next trial. Otherwise, we would\nlike to set \u03c9t+1 as close as possible to \u03c9t while satisfying all constraints.\nSuch an aggressive approach may be sensitive to outliers and over-\ufb01tting. Thus, we allow some\nof the constraints to remain violated by introducing a tradeoff between the change to \u03c9t and the\nloss attained on (Xt, yt). Formally, we would like to set \u03c9t+1 to be the solution of the following\n2 k\u03c9 \u2212 \u03c9tk2 + C \u2018(\u03c9; (Xt, yt)), where C is a tradeoff parameter.\noptimization problem, min\u03c9\u2208Rn\nAs we discuss below, this formalism effectively translates to a cap on the maximal change to \u03c9t.\nWe rewrite the above optimization by introducing a single slack variable as follows:\n\n1\n\nj\n\nmin\n\n\u03c9\u2208Rn,\u03be\u22650\n\n(1)\nWe denote the objective function of Eq. (1) by P t and refer to it as the instantaneous primal problem\nto be solved on trial t. The dual optimization problem of P t is the maximization problem\n\nj\n\n(cid:0)\u03c9 \u00b7 xt\n\n(cid:1) \u2265 1 \u2212 \u03be , \u03be \u2265 0 .\n\n1\n2\n\n(cid:13)(cid:13)\u03c9 \u2212 \u03c9t(cid:13)(cid:13)2 + C\u03be s.t. \u2200j \u2208 [kt] : yt\nktX\nktX\n\nktX\n\nj\n\n\u03b1t\nj yt\n\nj xt\nj\n\ns.t.\n\n(cid:13)(cid:13)(cid:13)\u03c9t +\n\n(cid:13)(cid:13)(cid:13)2\n\nj=1\n\nj=1\n\nj=1\n\nmax\n1,..,\u03b1t\nkt\n\n\u03b1t\n\nj \u2212 1\n\u03b1t\n2\n\nj \u2264 C , \u2200j : \u03b1t\n\u03b1t\n\nj \u2265 0 .\n\n(2)\n\n\fprimal problem is calculated from the optimal dual solution as follows, \u03c9t+1 = \u03c9t+Pkt\n\nEach dual variable corresponds to a single constraint of the primal problem. The minimizer of the\nj xt\nj.\nUnfortunately, in the common case, where each xt\nj is in an arbitrary orientation, there does not exist\nan analytic solution for the dual problem (Eq. (2)). We tackle the problem by breaking it down\ninto kt reduced problems, each of which focuses on a single dual variable. Formally, for the j\u2019th\nj0 = 0 for all j0 6= j. Each reduced\nvariable, the j\u2019th reduced problem solves Eq. (2) while \ufb01xing \u03b1t\noptimization problem amounts to the following problem\n\nj=1 \u03b1t\n\nj yt\n\n(cid:13)(cid:13)\u03c9t + \u03b1t\n\n(cid:13)(cid:13)2\n\nj \u2212 1\n\u03b1t\n2\n\nmax\n\u03b1t\nj\n\nj yt\n\nj xt\nj\n\ns.t. \u03b1t\n\nj \u2208 [0, C] .\n\n(3)\n\nP\nWe next obtain an exact or approximate solution for each reduced problem as if it were inde-\npendent of the rest. We then choose a distribution \u00b5t \u2208 \u2206kt, where \u2206kt = {\u00b5 \u2208 Rkt\n:\nj \u00b5j = 1, \u00b5j \u2265 0} is the probability simplex, and multiply each \u03b1t\nj \u2264 C implies thatPkt\nj by the corresponding\nj. Since \u00b5t \u2208 \u2206kt, this yields a feasible solution to the dual problem de\ufb01ned in Eq.\n(2) for\n\u00b5t\nFinally, the algorithm uses the combined solution and sets \u03c9t+1 = \u03c9t + Pkt\nj \u2264 C.\nthe following reason. Each \u00b5t\nj\u03b1t\nj=1 \u00b5t\nj xt\nj.\nj yt\nj \u03b1t\nj=1 \u00b5t\n\nj \u2265 0 and the fact that \u03b1t\n\nj\u03b1t\n\nInput:\n\nInitialize:\n\nj to be\n\nAggressiveness parameter C > 0\n\n\u03c91 = (0, . . . , 0)\nFor t = 1, 2, . . . , T :\nReceive instance matrix X t \u2208 Rkt\u00d7n\nPredict \u02c6yt = Xt \u03c9t\nReceive correct labels yt\nSuffer loss \u2018 (\u03c9t; (Xt, yt))\nIf \u2018 > 0:\nChoose importance weights \u00b5t \u2208 \u2206kt\nChoose individual dual solutions \u03b1t\nj\nj \u03b1t\nj yt\n\nWe next present three schemes to obtain a solu-\ntion for the reduced problem (Eq. (3)) and then\ncombine the solution into a single update.\nSimultaneous Perceptron: The simplest of the\nupdate forms generalizes the famous Perceptron\nalgorithm from [8] by setting \u03b1t\nj to C if the j\u2019th\ninstance is incorrectly labeled, and to 0 otherwise.\n1|Mt| for\nWe similarly set the weight \u00b5t\nj \u2208 Mt and to 0 otherwise. We abbreviate this\nscheme as the SimPerc algorithm.\nSoft Simultaneous Projections: The soft simul-\ntaneous projections scheme uses the fact that each\nreduced problem has an analytic solution, yield-\ning \u03b1t\nindependently assign each \u03b1t\ntion. We next set \u00b5t\nsolution may update \u03b1t\nattain is not suf\ufb01ciently large. We abbreviate this scheme as the SimProj algorithm.\nConservative Simultaneous Projections: Combining ideas from both methods, the conservative\nsimultaneous projections scheme optimally sets \u03b1t\nj according to the analytic solution. The difference\nwith the SimProj algorithm lies in the selection of \u00b5t. In the conservative scheme only the instances\nwhich were incorrectly predicted (j \u2208 Mt) are assigned a positive weight. Put differently, \u00b5t\nj is set\n1|Mt| for j \u2208 Mt and to 0 otherwise. We abbreviate this scheme as the ConProj algorithm.\nto\n\nFigure 2: Simultaneous projections algorithm.\nj to be 1|\u0393t| for j \u2208 \u0393t and to 0 otherwise. We would like to comment that this\nj also for instances which were correctly classi\ufb01ed as long as the margin they\n\nUpdate \u03c9t+1 = \u03c9t +Pkt\n\nj = min(cid:8)C, \u2018(cid:0)\u03c9t; (xt\n\nj, yt\nj this optimal solu-\n\n(cid:13)(cid:13)2(cid:9). We\n\nj)(cid:1) /(cid:13)(cid:13)xt\n\nj\n\nj=1 \u00b5t\n\nj xt\nj\n\nTo recap, on each trial t we obtain a feasible solution for the instantaneous dual given in Eq. (2).\nj, according to a weight vector \u00b5t \u2208 \u2206kt. While\nThis solution combines independently calculated \u03b1t\nthis solution may not be optimal, it does constitutes an infrastructure for obtaining a mistake bound\nand, as we demonstrate in Sec. 6, performs well in practice.\n\n5 Analysis\n\nThe algorithms described in the previous section perform updates in order to increase the instanta-\nneous dual problem de\ufb01ned in Eq. (2). We now use the mistake bound model to derive an upper\nbound on the number of trials on which the predictions of SimPerc and ConProj algorithms are\nimperfect. Following [6], the \ufb01rst step in the analysis is to tie the instantaneous dual problems to\n\n\f2 k\u03c9k2 + CPT\n\nTX\n\n(cid:0)\u03c9 \u00b7 xt\n\nj\n\n(cid:1) \u2265 1 \u2212 \u03bet \u2200t : \u03bet \u2265 0. (4)\n\na global loss function. To do so, we introduce a primal optimization problem de\ufb01ned over the en-\ntire sequence of examples as follows, min\u03c9\u2208Rn\nt=1 \u2018 (\u03c9; (X t, Y t)) . We rewrite the\noptimization problem as the following equivalent constrained optimization problem,\n\n1\n\n1\n2\n\nmin\n\nk\u03c9k2 + C\n\ns.t. \u2200t \u2208 [T ],\u2200j \u2208 [kt] : yt\n\nj\n\n\u03bet\n\n\u03bb\n\nt=0\n\nt=1\n\nt=1\n\nj=1\n\nj xt\nj\n\nmax\n\n\u03bbt,j yt\n\n(cid:13)(cid:13)(cid:13)2\n\nktX\n\nTX\n\nktX\n\ns.t. \u2200t :\n\n\u03c9\u2208Rn,\u03be\u2208RT\nWe denote the value of the objective function at (\u03c9, \u03be) for this optimization problem by P(\u03c9, \u03be).\nA competitor who may see the entire sequence of examples in advance may in particular set (\u03c9, \u03be)\nto be the minimizer of the problem which we denote by (\u03c9?, \u03be?). Standard usage of Lagrange\nmultipliers yields that the dual of Eq. (4) is,\n\n(cid:13)(cid:13)(cid:13) TX\nktX\nde\ufb01nes a feasible solution \u03c9 =PT\n\n\u03bbt,j \u2212 1\n2\n\n(5)\nWe denote the value of the objective function of Eq. (5) by D(\u03bb1,\u00b7\u00b7\u00b7 , \u03bbT ), where each \u03bbt is a\nvector in Rkt. Through our derivation we use the fact that any set of dual variables \u03bb1,\u00b7\u00b7\u00b7 , \u03bbT\nj with a corresponding assignment of the slack\n\n\u03bbt,j \u2264 C \u2200t, j : \u03bbt,j \u2265 0 .\n\nvariables.\nClearly, the optimization problem given by Eq. (5) depends on all the examples from the \ufb01rst trial\nthrough time step T and thus can only be solved in hindsight. We note however, that if we ensure\nthat \u03bbs,j = 0 for all s > t then the dual function no longer depends on instances occurring on rounds\nproceeding round t. As we show next, we use this primal-dual view to derive the skeleton algorithm\nfrom Fig. 2 by \ufb01nding a new feasible solution for the dual problem on every trial. Formally, the\ninstantaneous dual problem, given by Eq. (2), is equivalent (after omitting an additive constant) to\nthe following constrained optimization problem,\n\nPkt\n\nj=1 \u03bbt,jyt\n\njxt\n\nj=1\n\nj=1\n\nt=1\n\nktX\n\n\u03bb\n\nmax\n\nD(\u03bb1,\u00b7\u00b7\u00b7 , \u03bbt\u22121, \u03bb, 0,\u00b7\u00b7\u00b7 , 0) s.t. \u03bb \u2265 0 ,\n\n(6)\nThat is, the instantaneous dual problem is obtained from D(\u03bb1,\u00b7\u00b7\u00b7 , \u03bbT ) by \ufb01xing \u03bb1, . . . , \u03bbt\u22121\nto the values set in previous rounds, forcing \u03bbt+1 through \u03bbT to the zero vectors, and choosing a\nfeasible vector for \u03bbt. Given the set of dual variables \u03bb1, . . . , \u03bbt\u22121 it is straightforward to show that\nj. Equipped with these relations and\n\n\u03bbj \u2264 C .\n\nj \u03bbs,jys\n\nj xs\n\ns=1\n\nj=1\n\nthe prediction vector used on trial t is \u03c9t =Pt\u22121\nP\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u03c9t +\n\n\u03bbj \u2212 1\n2\n\nktX\n\nktX\n\n\u03bb1,...,\u03bbkt\n\n\u03bbjyt\n\nmax\n\njxt\nj\n\nj=1\n\nj=1\n\nomitting constants which do not depend on \u03bbt Eq. (6) can be rewritten as,\n\n\u03bbj \u2264 C .\n\n(7)\n\ns.t. \u2200j : \u03bbj \u2265 0,\n\nktX\n\nj=1\n\n1, . . . , \u00b5t\nkt\n\n(7) and Eq.\n\nThe problems de\ufb01ned by Eq.\n\u03b1t\n1, . . . , \u03b1t\nkt\n\u03bbt,j = \u00b5t\nj \u03b1t\n\ufb01rst bound is given for the SimPerc algorithm.\n\n(2) are equivalent. Thus, weighing the variables\nby \u00b5t\nalso yields a feasible solution for the problem de\ufb01ned in Eq. (6), namely\nj. We now tie all of these observations together by using the weak-duality theorem. Our\n\nTheorem 1. Let(cid:0)X1, y1(cid:1) , . . . ,(cid:0)XT , yT(cid:1) be a sequence of examples where Xt is a matrix of kt\n\nexamples and yt are the associated labels. Assume that for all t and j the norm of an instance xt\nis at most R. Then, for any \u03c9? \u2208 Rn the number of trials on which the prediction of SimPerc is\nj\nimperfect is at most,\n\n2k\u03c9?k2 + CPT\n\n1\n\nt=1 \u2018 (\u03c9?; (Xt, yt))\n2 C 2R2\n\nC \u2212 1\n\n.\n\nProof. To prove the theorem we make use of the weak-duality theorem. Recall that any dual feasible\nsolution induces a value for the dual\u2019s objective function which is upper bounded by the optimum\nvalue of the primal problem, P (\u03c9?, \u03be?). In particular, the solution obtained at the end of trial T\nis dual feasible, and thus D(\u03bb1, . . . , \u03bbT ) \u2264 P(\u03c9?, \u03be?) . We now rewrite the left hand-side of the\nabove equation as the following sum,\n\n(cid:2)D(\u03bb1, . . . , \u03bbt, 0, . . . , 0) \u2212 D(\u03bb1, . . . , \u03bbt\u22121, 0, . . . , 0)(cid:3) .\n\nD(0, . . . , 0) +\n\n(8)\n\nTX\n\nt=1\n\n\fs<t\n\nNote that D(0, . . . , 0) equals 0. Therefore, denoting by \u2206t the difference in two consecutive dual\nt=1 \u2206t \u2264\nP(\u03c9?, \u03be?). We now turn to bounding \u2206t from below. First, note that if the prediction on trial t is\nperfect (Mt = \u2205) then SimPerc sets \u03bbt to the zero vector and thus \u2206t = 0. We can thus focus on\ntrials for which the algorithm\u2019s prediction is imperfect. We remind the reader that by unraveling the\n\nobjective values, D(\u03bb1, . . . , \u03bbt, 0, . . . , 0) \u2212 D(\u03bb1, . . . , \u03bbt\u22121, 0, . . . , 0), we get thatPT\nPks\nupdate of \u03c9t we get that \u03c9t =P\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u03c9t +\nktX\n\u03bbt,j \u2212 1\nj andPkt\n2\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u03c9t +\nktX\n\nj = 1, which lets us further expand Eq. (9) and write,\n\nj. We now rewrite \u2206t as follows,\n\nBy construction, \u03bbt,j = \u00b5t\n\nj=1 \u03bbs,jys\n\nktX\n\nktX\n\nj=1 \u00b5t\n\n\u03bbt,jyt\n\n\u2206t =\n\n\u2206t =\n\nThe squared norm, k\u00b7k2 is a convex function in its vector argument and thus \u2206t is concave, which\nyields the following lower bound on \u2206t,\n\n\u00b5t\nj\u03b1t\n\n\u00b5t\nj\u03b1t\n\nj xs\n\njxt\nj\n\njxt\nj\n\nj\u03b1t\n\njyt\n\n(9)\n\n1\n2\n\nj=1\n\nj=1\n\nj=1\n\nj=1\n\n+\n\n.\n\nj \u2212 1\n2\n(cid:20)\n\n\u2206t \u2265 ktX\n\n(cid:13)(cid:13)\u03c9t + \u03b1t\n\nj \u2212 1\n\u03b1t\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)2 +\n\n+\n\n1\n2\n\n1\n2\n\n(cid:13)(cid:13)2 +\n2 C 2(cid:13)(cid:13)yt\n\n.\n\nj=1\n\n\u00b5t\nj\n\n1\n2\n\n(cid:13)(cid:13)\u03c9t(cid:13)(cid:13)2\nktX\n(cid:13)(cid:13)\u03c9t(cid:13)(cid:13)2\n(cid:13)(cid:13)\u03c9t(cid:13)(cid:13)2(cid:21)\n(cid:13)(cid:13)\u03c9t(cid:13)(cid:13)2(cid:21)\n(cid:13)(cid:13)2 +\n(cid:21)\n\njxt\nj\n\n.\n\n\u2206t \u2265 X\n(cid:20)\nC \u2212 1\n2\n\nj\u2208Mt\n\n\u00b5t\nj\n\n(10)\nj to be 1/|Mt| for all j \u2208 Mt and to be 0 otherwise. Furthermore,\nj is set to C. Thus, the right hand-side of Eq. (10) can be further simpli\ufb01ed and\n\njyt\n\njxt\nj\n\n.\n\nj=1\nThe SimPerc algorithm sets \u00b5t\nfor all j \u2208 Mt, \u03b1t\nwritten as,\n\n(cid:13)(cid:13)\u03c9t + Cyt\n\njxt\nj\n\n\u00b5t\nj\n\n(cid:20)\nC \u2212 1\n2\n(cid:13)(cid:13)\u03c9t(cid:13)(cid:13)2 \u2212 Cyt\n(cid:13)(cid:13)2(cid:21)\n\u2265 X\n\njxt\nj\n\nj\u2208Mt\n\n\u2206t \u2265 X\n(cid:20)\n\u2206t \u2265 X\nC \u2212 1\n\nWe expand the norm in the above equation and obtain that,\nj \u2212 1\nj\u03c9t \u00b7 xt\n(cid:20)\nC \u2212 1\n\nThe set Mt consists of indices of instances which were incorrectly classi\ufb01ed. Thus, yt\nfor every j \u2208 Mt. Therefore, \u2206t can further be bounded from below as follows,\n= C \u2212 1\n\nj\u2208Mt\n\n1\n2\n\n\u00b5t\nj\n\n(cid:13)(cid:13)\u03c9t(cid:13)(cid:13)2(cid:21)\n\n\u00b5t\nj\n\n\u00b5t\nj\n\n2 C 2(cid:13)(cid:13)yt\n\n2 C 2R2\n\nj\u2208Mt\n\n.\n\n(11)\n\nj(\u03c9t \u00b7 xt\n\nj) \u2264 0\n\n2 C 2R2 ,\n\n(12)\n\nwhere for the second inequality we used the fact that the norm of all the instances is bounded by\nR. To recap, we have shown that on trials for which the prediction is imperfect \u2206t \u2265 C \u2212 1\n2 C 2R2,\nwhile in perfect trials where no mistake is made \u2206t = 0. Putting all the inequalities together we\nobtain the following bound,\nC \u2212 1\n\n\u2206t = D(\u03bb1, . . . , \u03bbT ) \u2264 P(\u03c9?, \u03be?) ,\n\n\u0001 \u2264 TX\n\n(cid:19)\n\n(cid:18)\n\n2 C 2R2\n\nt=1\n\nrewriting P(\u03c9?, \u03be?) as 1\n\n(13)\n2k\u03c9?k2 +\n\nwhere \u0001 is the number of imperfect\n\ntrials.\n\nFinally,\n\nt=1 \u2018(\u03c9?; (Xt, yt) yields the bound stated in the theorem.\n\nCPT\n\nThe ConProj algorithm updates the same set of dual variables as the SimPerc algorithm, but selects\nj to be the optimal solution of Eq. (3). Thus, the value of \u2206t attained by the ConProj algorithm\n\u03b1t\nis never lower than the value attained by the SimPerc algorithm. The following corollary is a direct\nconsequence of this observation.\nCorollary 1. Under the same conditions of Thm. 1 and for any \u03c9? \u2208 Rn, the number of trials on\nwhich the prediction of ConProj is imperfect is at most,\n\n2k\u03c9?k2 + CPT\n\n1\n\nt=1 \u2018 (\u03c9?; (Xt, yt))\n2 C 2R2\n\nC \u2212 1\n\n.\n\n\fusername\n\nbeck-s\nfarmer-d\nkaminski-v\nkitchen-l\nlokay-m\nsanders-r\n\nwilliams-w3\n\nk\n101\n25\n41\n47\n11\n30\n18\n\nm\n\n1973\n3674\n4479\n4017\n2491\n1190\n2771\n\n50.0\n27.4\n43.1\n42.9\n18.8\n20.7\n4.2\n\nSimProj ConProj\n\n55.2\n30.3\n47.8\n47.0\n25.3\n25.6\n5.0\n\nSimPerc Max-SP Max-MP Mira\n63.7\n31.8\n47.3\n52.6\n25.3\n34.1\n5.9\n\n56.6\n30.0\n49.5\n48.0\n23.0\n23.8\n4.2\n\n63.8\n28.6\n49.6\n54.9\n25.4\n36.3\n5.8\n\n55.9\n30.7\n47.0\n49.0\n25.3\n23.2\n5.4\n\nTable 1: The percentage of online mistakes of the three variants compared to Max-Update (Single\nprototype (SP) and Multi prototype (MP)) and the Mira algorithm. Experiments were performed on\nseven users of the Enron data set.\n\nNote that the predictions of the SimPerc algorithm do not depend on the speci\ufb01c value of C, thus\nfor R = 1 and an optimal choice of C the bound attained in Thm. 1 now becomes.\n\n\u2018(cid:0)\u03c9?; (Xt, yt)(cid:1) +\n\nk\u03c9?k2 +\n\n1\n2\n\n1\n2\n\npk\u03c9?k4 + k\u03c9?k2\u2018 (\u03c9?; (Xt, yt)) .\n\nWe omit the proof for lack of space, see [6] for a closely related analysis.\nWe conclude this section with a few closing words about the SimProj variant. The SimPerc and\nConProj algorithms ensure a minimal increase in the dual by focusing solely on classi\ufb01cation errors\nand ignoring margin errors. While this approach ensures a suf\ufb01cient increase of the dual, in practice\nit appears to be a double edged sword as the SimProj algorithm performs empirically better. This\nsuperior empirical performance can be motivated by a re\ufb01ned derivation of the optimal choice for\n\u00b5. This derivation will be provided in a long version of this manuscript.\n\n6 Experiments\n\nIn this section we describe experimental results in order to demonstrate some of the mer-\nits of our algorithms. We tested performance of the three variants described in Sec.\n4\non a multiclass categorization task and compared them to previously studied algorithms for\nmulticlass categorization. We compared our algorithms to the single-prototype and multi-\nprototype Max-Update algorithms from [9] and to the Mira algorithm [2]. The experiments\nwere performed on the task of email classi\ufb01cation using the Enron email dataset (Available at\nhttp://www.cs.cmu.edu/\u223cenron/enron_mail_030204.tar.gz). The learning goal was to correctly classify\nemail messages into user de\ufb01ned folders. Thus, the instances in this dataset are email messages,\nwhile the set of classes are the user de\ufb01ned folders denoted by {1, . . . , k}. We ran the experiments\non the sequence of email messages from 7 different users.\nSince each user employs different criteria for email classi\ufb01cation, we treated each person as a sep-\narate online learning problem. We represented each email message as a vector with a component\nfor every word in the corpus. On each trial, and for each class r, we constructed class-dependent\nvectors as follows. We set \u03c6j(xt, r) to twice the number of time the j\u2019th word appeared in the\nmessage if it had also appeared in a \ufb01fth of the messages previously assigned to folder r. Similarly,\nwe set \u03c6j(xt, r) to minus the number of appearances of the word appeared if it had appeared in less\nthan 2 percent of previous messages. In all other cases, we set \u03c6j(xt, r) to 0. This class-dependent\nconstruction is closely related to the construction given in [10]. Next, we employed the mapping\ndescribed in Sec. 3, and de\ufb01ned a set of k\u2212 1 instances for each message as follows. Denote the rel-\nevant class by r, then for every irrelevant class s 6= r, we de\ufb01ne an instance xt\ns = \u03c6(xt, r)\u2212\u03c6(xt, s)\nand set its label to 1. All these instances were combined into a single matrix Xt and were provided\nto the algorithm in trial t.\nThe results of the experiments are summarized in Table 1. It is apparent that the SimProj algo-\nrithm outperforms all other algorithms. The performances of SimPerc and ConProj are comparable\nwith no obvious winner. It is worth noting that the Mira algorithm \ufb01nds the optimum of a projec-\ntion problem on each trial while our algorithms only \ufb01nd an approximate solution. However, Mira\nemploys a different approach in which there is a single input instance (instead of the set Xt) and\nconstructs multiple predictors (instead of a single vector \u03c9). Thus, Mira employs a larger hypothesis\nspace which is more dif\ufb01cult to learn in online settings. In addition, by employing a single vector\n\n\fFigure 3: The cumulative number of mistakes as a function of the number of trials.\n\nrepresentation of the email message, Mira cannot bene\ufb01t from feature selection which yields class-\ndependent features. It is also obvious that the simultaneous projection variants, while remaining\nsimple to implement, consistently outperform the Max-Update technique which is commonly used\nin online multiclass classi\ufb01cation. In Fig. 3 we plot the cumulative number of mistakes as a function\nof the trial number for 3 of the 7 users. The graphs clearly indicate the high correlation between the\nSimP erc and ConP roj variants, while indicating the superiority of the SimP roj variant.\n\n7 Extensions and discussion\n\nj \u03b1t\n\nglobal constraint on all the dual variables, namelyP\nwith multiple constraints of the more general formP\n\nWe presented a new approach for online categorization with complex output structure. Our algo-\nrithms decouple the complex optimization task into multiple sub-tasks, each of which is simple\nenough to be solved analytically. While the dual representation of the online problem imposes a\nj \u2264 C, our framework of simultaneous\nprojections which are followed by averaging the solutions automatically adheres with this constraint\nand hence constitute a feasible solution. It is worthwhile noting that our approach can also cope\nj \u03bdj\u03b1j \u2264 C, where \u03bdj \u2265 0 for all j. The\nbox constraint implied for each individual projection problem distils to 0 \u2264 \u03b1j \u2264 C/\u03bdj and thus\nthe simultaneous projection algorithm can be used verbatim. We are currently exploring the usage\nof this extension in complex decision problems with multiple structural constraints. Another pos-\nsible extension is to replace the squared norm regularization with other twice differentiable penalty\nfunctions. Algorithms of this more general framework still attain similar mistake bounds and are\neasy to implement so long as the induced individual problems are ef\ufb01ciently solvable. A particu-\nlarly interesting case is obtained when setting the penalty to the relative entropy. In this case we\nobtain a generalization of the Winnow and the EG algorithms [11, 12] for complex classi\ufb01cation\nproblems. Another interesting direction is the usage of simultaneous projections for problems with\nmore constrained structured output such as max-margin networks [3].\n\nReferences\n[1] J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In Proc. of the Seventh European Symposium on\n\nArti\ufb01cial Neural Networks, April 1999.\n\n[2] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. J. of Machine Learning Res., 3:951\u2013991, 2003.\n[3] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in Neural Information Processing Systems 17, 2003.\nI. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output\n[4]\nspaces. In Proc. of the 21st Intl. Conference on Machine Learning, 2004.\n\n[5] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of\n\nComputer and System Sciences, 55(1):119\u2013139, August 1997.\n\n[6] S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In Proc. of the Nineteenth Annual Conference on\n\nComputational Learning Theory, 2006.\n\n[7] R.E. Schapire and Y. Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 32(2/3), 2000.\n[8] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review,\n\n65:386\u2013407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).).\n\n[9] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. Journal of Machine Learning\n\n[10] M. Fink, S. Shalev-Shwartz, Y. Singer, and S. Ullman. Online multiclass learning by interclass hypothesis sharing. In Proc. of the 23rd\n\nResearch, 7, Mar 2006.\n\nInternational Conference on Machine Learning, 2006.\n\n[11] N. Littlestone. Learning when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285\u2013318, 1988.\n[12] J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors.\n\nInformation and Computation,\n\n132(1):1\u201364, January 1997.\n\n01000200030002004006008001000farmer\u2212d SimProjConProjSimPercMira0500100015002000100200300400500600lokay\u2212m 0200400600800100050100150200250300350400sanders\u2212r \f", "award": [], "sourceid": 2984, "authors": [{"given_name": "Yonatan", "family_name": "Amit", "institution": null}, {"given_name": "Shai", "family_name": "Shalev-shwartz", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}]}