{"title": "Online Classification on a Budget", "book": "Advances in Neural Information Processing Systems", "page_first": 225, "page_last": 232, "abstract": "", "full_text": "Online Classi\ufb01cation on a Budget\n\nKoby Crammer\n\nComputer Sci. & Eng.\n\nHebrew University\n\nJerusalem 91904, Israel\n\nJaz Kandola\n\nRoyal Holloway,\n\nUniversity of London\n\nEgham, UK\n\nYoram Singer\n\nComputer Sci. & Eng.\n\nHebrew University\n\nJerusalem 91904, Israel\n\nkobics@cs.huji.ac.il\n\njaz@cs.rhul.ac.uk\n\nsinger@cs.huji.ac.il\n\nAbstract\n\nOnline algorithms for classi\ufb01cation often require vast amounts of mem-\nory and computation time when employed in conjunction with kernel\nfunctions. In this paper we describe and analyze a simple approach for an\non-the-\ufb02y reduction of the number of past examples used for prediction.\nExperiments performed with real datasets show that using the proposed\nalgorithmic approach with a single epoch is competitive with the sup-\nport vector machine (SVM) although the latter, being a batch algorithm,\naccesses each training example multiple times.\n\n1\n\nIntroduction and Motivation\n\nKernel-based methods are widely being used for data modeling and prediction because of\ntheir conceptual simplicity and outstanding performance on many real-world tasks. The\nsupport vector machine (SVM) is a well known algorithm for \ufb01nding kernel-based linear\nclassi\ufb01ers with maximal margin [7]. The kernel trick can be used to provide an effective\nmethod to deal with very high dimensional feature spaces as well as to model complex in-\nput phenomena via embedding into inner product spaces. However, despite generalization\nerror being upper bounded by a function of the margin of a linear classi\ufb01er, it is notoriously\ndif\ufb01cult to implement such classi\ufb01ers ef\ufb01ciently. Empirically this often translates into very\nlong training times. A number of alternative algorithms exist for \ufb01nding a maximal margin\nhyperplane many of which have been inspired by Rosenblatt\u2019s Perceptron algorithm [6]\nwhich is an on-line learning algorithm for linear classi\ufb01ers. The work on SVMs has in-\nspired a number of modi\ufb01cations and enhancements to the original Perceptron algorithm.\nThese incorporate the notion of margin to the learning and prediction processes whilst ex-\nhibiting good empirical performance in practice. Examples of such algorithms include the\nRelaxed Online Maximum Margin Algorithm (ROMMA) [4], the Approximate Maximal\nMargin Classi\ufb01cation Algorithm (ALMA) [2], and the Margin Infused Relaxed Algorithm\n(MIRA) [1] which can be used in conjunction with kernel functions.\n\nA notable limitation of kernel based methods is their computational complexity since the\namount of computer memory that they require to store the so called support patterns grows\nlinearly with the number prediction errors. A number of attempts have been made to speed\nup the training and testing of SVM\u2019s by enforcing a sparsity condition. In this paper we\ndevise an online algorithm that is not only sparse but also generalizes well. To achieve\nthis goal our algorithm employs an insertion and deletion process. Informally, it can be\nthought of as revising the weight vector after each example on which a prediction mistake\n\n\fhas been made. Once such an event occurs the algorithm adds the new erroneous example\n(the insertion phase), and then immediately searches for past examples that appear to be\nredundant given the recent addition (the deletion phase). As we describe later, making this\nadjustment to the algorithm allows us to modify the standard online proof techniques so as\nto provide a bound on the total number of examples the algorithm keeps.\n\nThis paper is organized as follows. In Sec. 2 we formalize the problem setting and provide\na brief outline of our method for obtaining a sparse set of support patterns in an online\nsetting. In Sec. 3 we present both theoretical and algorithmic details of our approach and\nprovide a bound on the number of support patterns that constitute the cache. Sec. 4 provides\nexperimental details, evaluated on three real world datasets, to illustrate the performance\nand merits of our sparse online algorithm. We end the paper with conclusions and ideas for\nfuture work.\n\n2 Problem Setting and Algorithms\n\nThis work focuses on online additive algorithms for classi\ufb01cation tasks. In such problems\nwe are typically given a stream of instance-label pairs (x1; y1); : : : ; (xt; yt); : : :. we assume\nthat each instance is a vector xt 2 Rn and each label belongs to a \ufb01nite set Y. In this\nand the next section we assume that Y = f(cid:0)1; +1g but relax this assumption in Sec. 4\nwhere we describe experiments with datasets consisting of more than two labels. When\ndealing with the task of predicting new labels, thresholded linear classi\ufb01ers of the form\nh(x) = sign(w (cid:1) x) are commonly employed. The vector w is typically represented as\na weighted linear combination of the examples, namely w = Pt (cid:11)tytxt where (cid:11)t (cid:21) 0.\nThe instances for which (cid:11)t > 0 are referred to as support patterns. Under this assumption,\nthe output of the classi\ufb01er solely depends on inner-products of the form x (cid:1) xt the use of\nkernel functions can easily be employed simply by replacing the standard scalar product\nwith a function K((cid:1); (cid:1)) which satis\ufb01es Mercer conditions [7]. The resulting classi\ufb01cation\nrule takes the form h(x) = sign(w (cid:1) x) = sign(Pt (cid:11)tytK(x; xt)).\nThe majority of additive online algorithms for classi\ufb01cation, for example the well known\nPerceptron [6], share a common algorithmic structure. These online algorithms typically\nwork in rounds. On the tth round, an online algorithm receives an instance xt, computes\nthe inner-products st = Pi 0. The various online algorithms differ in the way the values of the\nparameters (cid:12)t; (cid:11)t and ct are set. A notable example of an online algorithm is the Perceptron\nalgorithm [6] for which we set (cid:12)t = 0; (cid:11)t = 1 and ct = 1. More recent algorithms\nsuch as the Relaxed Online Maximum Margin Algorithm (ROMMA) [4] the Approximate\nMaximal Margin Classi\ufb01cation Algorithm (ALMA) [2] and the Margin Infused Relaxed\nAlgorithm (MIRA) [1] can also be described in this framework although the constants\n(cid:12)t; (cid:11)t and ct are not as simple as the ones employed by the Perceptron algorithm.\nAn important computational consideration needs to be made when employing kernel func-\ntions for machine learning tasks. This is because the amount of memory required to\nstore the so called support patterns grows linearly with the number prediction errors. In\n\n\fInput: Tolerance (cid:12).\nInitialize: Set 8t (cid:11)t = 0 ; w0 = 0 ; C0 = ;.\nLoop: For t = 1; 2; : : : ; T\n\n(cid:15) Get a new instance xt 2 Rn.\n(cid:15) Predict ^yt = sign (yt(xt (cid:1) wt(cid:0)1)).\n(cid:15) Get a new label yt.\n(cid:15) if yt(xt (cid:1) wt(cid:0)1) (cid:20) (cid:12) update:\n\n1. Insert Ct Ct(cid:0)1 [ ftg.\n2. Set (cid:11)t = 1.\n3. Compute wt wt(cid:0)1 + yt(cid:11)txt.\n4. DistillCache(Ct; wt; ((cid:11)1; : : : ; (cid:11)t)).\n\nOutput : H(x) = sign(wT (cid:1) x).\n\nFigure 1: The aggressive Perceptron algorithm with a variable-size cache.\n\nthis paper we shift the focus to the problem of devising online algorithms which are\nbudget-conscious as they attempt to keep the number of support patterns small. The\napproach is attractive for at least two reasons. Firstly, both the training time and clas-\nsi\ufb01cation time can be reduced signi\ufb01cantly if we store only a fraction of the potential\nsupport patterns. Secondly, a classier with a small number of support patterns is intu-\nitively \u201dsimpler\u201d, and hence are likely to exhibit good generalization properties rather\nthan complex classi\ufb01ers with large numbers of support patterns.\n(See for instance [7]\nfor formal results connecting the number of support patterns to the generalization error.)\n\nInput: C; w; ((cid:11)1; : : : ; (cid:11)t).\nLoop:\n\n(cid:15) Choose i 2 C such that\n\n(cid:12) (cid:20) yi(w (cid:0) (cid:11)iyixi).\n\nFigure 2: DistillCache\n\n1. (cid:11)i = 0.\n2. w w (cid:0) (cid:11)iyixi.\n3. C C=fig\n\n(cid:15) if no such i exists then return.\n(cid:15) Remove the example i :\n\nIn Sec. 3 we present a formal analysis and\nthe algorithmic details of our approach.\nLet us now provide a general overview\nof how to restrict the number of support\npatterns in an online setting. Denote by\nCt the indices of patterns which consti-\ntute the classi\ufb01cation vector wt. That is,\ni 2 Ct if and only if (cid:11)i > 0 on round\nt when xt is received. The online classi-\n\ufb01cation algorithms discussed above keep\nenlarging Ct \u2013 once an example is added\nto Ct it will never be deleted. However,\nas the online algorithm receives more ex-\namples, the performance of the classi\ufb01er\nimproves, and some of the past examples\nmay have become redundant and hence\ncan be removed. Put another way, old examples may have been inserted into the cache sim-\nply due the lack of support patterns in early rounds. As more examples are observed, the\nold examples maybe replaced with new examples whose location is closer to the decision\nboundary induced by the online classi\ufb01er. We thus add a new stage to the online algorithm\nin which we discard a few old examples from the cache Ct. We suggest a modi\ufb01cation of\nthe online algorithm structure as follows. Whenever yt (cid:0)Pi 0. Then the number of support patterns constituting the\ncache is at most S (cid:20) (R2 + 2(cid:12))=(cid:13) 2 :\n\nProof: The proof of the theorem is based on the mistake bound of the Perceptron algo-\nrithm [5]. To prove the theorem we bound kwT k2\n2 from above and below and compare the\nbounds. Denote by (cid:11)t\ni the weight of the ith example at the end of round t (after stage 4 of\nthe algorithm). Similarly, we denote by ~(cid:11)t\ni to be the weight of the ith example on round\nt after stage 3, before calling the DistillCache (Fig. 2) procedure. We analogously\ndenote by wt and ~wt the corresponding instantaneous classi\ufb01ers. First, we derive a lower\nbound on kwT k2 by bounding the term wT (cid:1) u from below in a recursive manner.\n\nwT (cid:1) u = X\n\n(cid:11)T\n\nt yt(xt (cid:1) u)\n\nt2CT\n\n(cid:21) (cid:13) X\n\n(cid:11)T\n\nt = (cid:13) S :\n\nt2CT\n\n(1)\n\nWe now turn to upper bound kwT k2. Recall that each example may be added to the cache\nand removed from the cache a single time. Let us write kwT k2 as a telescopic sum,\n\nkwT k2 = (kwT k2 (cid:0) k ~wT k2) + (k ~wT k2 (cid:0) kwT (cid:0)1k2) + : : : + (k ~w1k2 (cid:0) kw0k2) : (2)\nWe now consider three different scenarios that may occur for each new example. The\n\ufb01rst case is when we did not insert the tth example into the cache at all.\nIn this case,\n(k ~wtk2 (cid:0) kwt(cid:0)1k2) = 0. The second scenario is when an example is inserted into the\ncache and is never discarded in future rounds, thus,\n\nk ~wtk2 = kwt(cid:0)1 + ytxtk2 = kwt(cid:0)1k2 + 2yt(wt(cid:0)1 (cid:1) xt) + kxtk2 :\n\nSince we inserted (xt; yt), the condition yt(wt(cid:0)1 (cid:1) xt) (cid:20) (cid:12) must hold. Combining this\nwith the assumption that the examples are enclosed in a ball of radius R we get, (k ~wtk2 (cid:0)\nkwt(cid:0)1k2) (cid:20) 2(cid:12) + R2. The last scenario occurs when an example is inserted into the cache\non some round t, and is then later on removed from the cache on round t + p for p > 0. As\nin the previous case we can bound the value of summands in Equ. (2),\n\n(k ~wtk2 (cid:0) kwt(cid:0)1k2) + (kwt+pk2 (cid:0) k ~wt+pk2)\n\n\fInput: Tolerance (cid:12), Cache Limit n.\nInitialize: Set 8t (cid:11)t = 0 ; w0 = 0 ; C0 = ;.\nLoop: For t = 1; 2; : : : ; T\n\n(cid:15) Get a new instance xt 2 Rn.\n(cid:15) Predict ^yt = sign (yt(xt (cid:1) wt(cid:0)1)).\n(cid:15) Get a new label yt.\n(cid:15) if yt(xt (cid:1) wt(cid:0)1) (cid:20) (cid:12) update:\n\n1. If jCtj = n remove one example:\n\n(a) Find i = arg maxj2Ctfyj(wt(cid:0)1 (cid:0) (cid:11)jyj xj)g.\n(b) Update wt(cid:0)1 wt(cid:0)1 (cid:0) (cid:11)iyixi.\n(c) Remove Ct(cid:0)1 Ct(cid:0)1=fig\n\n2. Insert Ct Ct(cid:0)1 [ ftg.\n3. Set (cid:11)t = 1.\n4. Compute wt wt(cid:0)1 + yt(cid:11)txt.\n\nOutput : H(x) = sign(wT (cid:1) x).\n\nFigure 3: The aggressive Perceptron algorithm with as \ufb01xed-size cache.\n\n= 2yt(wt(cid:0)1 (cid:1) xt) + kxtk2 (cid:0) 2yt( ~wt+p (cid:1) xt) + kxtk2\n= 2 [yt(wt(cid:0)1 (cid:1) xt) (cid:0) yt (( ~wt+p (cid:0) ytxt) (cid:1) xt)]\n(cid:20) 2 [(cid:12) (cid:0) yt (( ~wt+p (cid:0) ytxt) (cid:1) xt)]\n\n:\n\nBased on the form of the cache update we know that yt (( ~wt+p (cid:0) ytxt) (cid:1) xt) (cid:21) (cid:12), and\nthus,\n\n(k ~wtk2 (cid:0) kwt(cid:0)1k2) + (kwt+pk2 (cid:0) k ~wt+pk2) (cid:20) 0 :\n\nSummarizing all three cases we see that only the examples which persist in the cache\ncontribute a factor of R2 + 2(cid:12) each to the bound of the telescopic sum of Equ. (2) and\nthe rest of the examples do contribute anything to the bound. Hence, we can bound the\nnorm of wT as follows,\n\nkwT k2 (cid:20) S (cid:0)R2 + 2(cid:12)(cid:1) :\n\n(3)\n\nWe \ufb01nish up the proof by applying the Cauchy-Swartz inequality and the assumption\nkuk = 1. Combining Equ. (1) and Equ. (3) we get,\n\n(cid:13) 2S 2 (cid:20) (wT (cid:1) u)2 (cid:20) kwT k2kuk2 (cid:20) S(2(cid:12) + R2) ;\n\nwhich gives the desired bound.\n\n4 Experiments\n\nIn this section we describe the experimental methods that were used to compare the per-\nformance of standard online algorithms with the new algorithm described above. We also\ndescribe shortly another variant that sets a hard limit on the number of support patterns.\nThe experiments were designed with the aim of trying to answer the following questions.\nFirst, what is effect of the number of support patterns on the generalization error (mea-\nsured in terms of classi\ufb01cation accuracy on unseen data), and second, would the algorithm\ndescribed in Fig. 2 be able to \ufb01nd an optimal cache size that is able to achieve the best\ngeneralization performance. To examine each question separately we used a modi\ufb01ed ver-\nsion of the algorithm described by Fig. 2 in which we restricted ourselves to have a \ufb01xed\nbounded cache. This modi\ufb01ed algorithm (which we refer to as the \ufb01xed budget Perceptron)\n\n\fName\n\nmnist\nletter\nusps\n\nNo. of\n\nNo. of\n\nNo. of\n\nNo. of\n\nTraining Examples Test Examples Classes Attributes\n\n60000\n16000\n7291\n\n10000\n4000\n2007\n\n10\n26\n10\n\n784\n16\n256\n\nTable 1: Description of the datasets used in experiments.\n\nsimulates the original Perceptron algorithm with one notable difference. When the num-\nber of support patterns exceeds a pre-determined limit, it chooses a support pattern from\nthe cache and discards it. With this modi\ufb01cation the number of support patterns can never\nexceed the pre-determined limit. This modi\ufb01ed algorithm is described in Fig. 3. The algo-\nrithm deletes the example which seemingly attains the highest margin after the removal of\nthe example itself (line 1(a) in Fig. 3).\n\nDespite the simplicity of the original Perceptron algorithm [6] its good generalization per-\nformance on many datasets is remarkable. During the last few year a number of other addi-\ntive online algorithms have been developed [4, 2, 1] that have shown better performance on\na number of tasks. In this paper, we have preferred to embed these ideas into another online\nalgorithm and start with a higher baseline performance. We have chosen to use the Margin\nInfused Relaxed Algorithm (MIRA) as our baseline algorithm since it has exhibited good\ngeneralization performance in previous experiments [1] and has the additional advantage\nthat it is designed to solve multiclass classi\ufb01cation problem directly without any recourse\nto performing reductions.\nThe algorithms were evaluated on three natural datasets: mnist1, usps2 and letter3.\nThe characteristics of these datasets has been summarized in Table 1. A comprehensive\noverview of the performance of various algorithms on these datasets can be found in a\nrecent paper [2]. Since all of the algorithms that we have evaluated are online, it is not\nimplausible for the speci\ufb01c ordering of examples to affect the generalization performance.\nWe thus report results averaged over 11 random permutations for usps and letter and\nover 5 random permutations for mnist. No free parameter optimization was carried out\nand instead we simply used the values reported in [1]. More speci\ufb01cally, the margin param-\neter was set to (cid:12) = 0:01 for all algorithms and for all datasets. A homogeneous polynomial\nkernel of degree 9 was used when training on the mnist and usps data sets, and a RBF\nkernel for letter data set. (The variance of the RBF kernel was identical to the one used\nin [1].)\n\nWe evaluated the performance of four algorithms in total. The \ufb01rst algorithm was the\nstandard MIRA online algorithm, which does not incorporate any budget constraints. The\nsecond algorithm is the version of MIRA described in Fig. 3 which uses a \ufb01xed limited\nbudget. Here we enumerated the cache size limit in each experiment we performed. The\ndifferent sizes that we tested are dataset dependent but for each dataset we evaluated at\nleast 10 different sizes. We would like to note that such an enumeration cannot be done in\nan online fashion and the goal of employing the the algorithm with a \ufb01xed-size cache is to\nunderscore the merit of the truly adaptive algorithm. The third algorithm is the version of\nMIRA described in Fig. 2 that adapts the cache size during the running of the algorithms.\nWe also report additional results for a multiclass version of the SVM [1]. Whilst this\nalgorithm is not online and during the training process it considers all the examples at once,\nthis algorithm serves as our gold-standard algorithm against which we want to compare\n\n1Available from http://www.research.att.com/~yann\n2Available from ftp.kyb.tuebingen.mpg.de\n3Available from http://www.ics.uci.edu/~mlearn/MLRepository.html\n\n\fmnist\n\nusps\n\nletter\n\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\n1.8\n\n1.7\n\n1.6\n\n1.5\n\n1.4\n\n1.3\n\n1.2\n\nFixed\nAdaptive\nSVM\nMIRA\n\nFixed\nAdaptive\nSVM\nMIRA\n\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\n4.8\n\n4.7\n\n4.6\n\n4.5\n\n4.4\n\n4.3\n\n4.2\n\n4.1\n\n4\n\n3.9\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n1.4\n# Support Patterns\n\n1.2\n\n1.6\n\n1.8\n\n2\n\n2.2\nx 104\n\n500\n\n1000\n\n1500\n\n2000\n\n# Support Patterns\n\n2500\n\n3000\n\n3500\n\n6\n\n5.5\n\n5\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\n2\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n# Support Patterns\n\nmnist\n\nusps\n\nletter\n\ns\nr\no\nr\nr\n\n \n\nE\ne\nn\n\ni\nl\n\n \n\nn\nO\ng\nn\nn\na\nr\nT\n\ni\n\ni\n\n1550\n\n1500\n\n1450\n\n1400\n\n1350\n\n1300\n\nFixed\nAdaptive\nMIRA\n\ns\nr\no\nr\nr\n\n \n\nE\ne\nn\n\ni\nl\n\n \n\nn\nO\ng\nn\nn\na\nr\nT\n\ni\n\ni\n\n270\n\n265\n\n260\n\n255\n\n250\n\n245\n\n240\n\n235\n\nFixed\nAdaptive\nMIRA\n\ns\nr\no\nr\nr\n\n \n\nE\ne\nn\n\ni\nl\n\n \n\nn\nO\ng\nn\nn\na\nr\nT\n\ni\n\ni\n\n1500\n\n1450\n\n1400\n\n1350\n\n1300\n\n1250\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n1.4\n# Support Patterns\n\n1.2\n\n1.6\n\n1.8\n\n2\n\n2.2\nx 104\n\n500\n\n1000\n\n1500\n\n2000\n\n# Support Patterns\n\n2500\n\n3000\n\n3500\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n# Support Patterns\n\nx 104\n\nmnist\n\nusps\n\nx 104\n\nletter\n\ns\nr\no\nr\nr\n\ni\n\n \n\nE\nn\ng\nr\na\nM\ng\nn\nn\na\nr\nT\n\n \n\ni\n\ni\n\n5.5\n\n5\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nFixed\nAdaptive\nMIRA\n\nFixed\nAdaptive\nMIRA\n\n6500\n\n6000\n\n5500\n\n5000\n\n4500\n\n4000\n\n3500\n\n1.6\n\n1.5\n\n1.4\n\n1.3\n\n1.2\n\n1.1\n\n1\n\n0.9\n\ns\nr\no\nr\nr\n\ni\n\n \n\nE\nn\ng\nr\na\nM\ng\nn\nn\na\nr\nT\n\n \n\ni\n\ni\n\ns\nr\no\nr\nr\n\ni\n\n \n\nE\nn\ng\nr\na\nM\ng\nn\nn\na\nr\nT\n\n \n\ni\n\ni\n\n1\n1.4\n# Support Patterns\n\n1.2\n\n1.6\n\n1.8\n\n2\n\n2.2\nx 104\n\n500\n\n1000\n\n1500\n\n2000\n\n# Support Patterns\n\n2500\n\n3000\n\n3500\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n# Support Patterns\n\nFixed\nAdaptive\nSVM\nMIRA\n\n7000\n\n8000\n\n9000\n\nFixed\nAdaptive\nMIRA\n\n7000\n\n8000\n\n9000\n\nFixed\nAdaptive\nMIRA\n\n7000\n\n8000\n\n9000\n\nFigure 4: Results on a three data sets - mnist (left), usps (center) and letter (right). Each\npoint in a plot designates the test error (y-axis) vs.\nthe number of support patterns used\n(x-axis). Four algorithms are compared - SVM, MIRA, MIRA with a \ufb01xed cache size and\nMIRA with a variable cache size.\n\nperformance. Note that for the multiclass SVM we report the results using the best set of\nparameters, which does not coincide with the set of parameters used for the online training.\n\nThe results are summarized in Fig 4. This \ufb01gure is composed of three different plots or-\nganized in columns. Each of these plots corresponds to a different dataset - mnist (left),\nusps (center) and letter (right). In each of the three plots the x-axis designates number of\nsupport patterns the algorithm uses. The results for the \ufb01xed-size cache are connected with\na line to emphasize the performance dependency on the size of the cache.\n\nThe top row of the three columns shows the generalization error. Thus the y-axis designates\nthe test error of an algorithm on unseen data at the end of the training. Looking at the error\nof the algorithm with a \ufb01xed-size cache reveals that there is a broad range of cache size\nwhere the algorithm exhibits good performance. In fact for MNIST and USPS there are\nsizes for which the test error of the algorithm is better than SVM\u2019s test error. Naturally, we\ncannot \ufb01x the correct size in hindsight so the question is whether the algorithm with variable\ncache size is a viable automatic size-selection method. Analyzing each of the datasets in\nturn reveals that this is indeed the case \u2013 the algorithm obtains a very similar number\nof support patterns and test error when compared to the SVM method. The results are\nsomewhat less impressive for the letter dataset which contains less examples per class. One\npossible explanation is that the algorithm had fewer chances to modify and distill the cache.\nNonetheless, overall the results are remarkable given that all the online algorithms make a\nsingle pass through the data and the variable-size method \ufb01nds a very good cache size while\n\n\fmaking it also comparable to the SVM in terms of performance. The MIRA algorithm,\nwhich does not incorporate any form of example insertion or deletion in its algorithmic\nstructure, obtains the poorest level of performance not only in terms of generalization error\nbut also in terms of number of support patterns.\n\nThe plot of online training error against the number of support patterns, in row 2 of Fig 4,\ncan be considered to be a good on-the-\ufb02y validation of generalization performance. As the\nplots indicate, for the \ufb01xed and adaptive versions of the algorithm, on all the datasets, a\nlow online training error translates into good generalization performance. Comparing the\ntest error plots with the online error plots we see a nice similarity between the qualitative\nbehavior of the two errors. Hence, one can use the online error, which is easy to evaluate,\nto choose a good cache size for the \ufb01xed-size algorithm.\n\nThe third row gives the online training margin errors that translates directly to the number\nof insertions into the cache. Here we see that the good test error and compactness of the\nalgorithm with a variable cache size come with a price. Namely, the algorithm makes\nsigni\ufb01cantly more insertions into the cache than the \ufb01xed size version of the algorithm.\nHowever, as the upper two sets of plots indicate, the surplus in insertions is later taken care\nof by excess deletions and the end result is very good overall performance. In summary, the\nonline algorithm with a variable cache and SVM obtains similar levels of generalization and\nalso number of support patterns. While the SVM is still somewhat better in both aspects\nfor the letter dataset, the online algorithm is much simpler to implement and performs a\nsingle sweep through the training data.\n\n5 Summary\n\nWe have described and analyzed a new sparse online algorithm that attempts to deal with\nthe computational problems implicit in classi\ufb01cation algorithms such as the SVM. The\nproposed method was empirically tested and its performance in both the size of the resulting\nclassi\ufb01er and its error rate are comparable to SVM. There are a few possible extensions and\nenhancements. We are currently looking at alternative criteria for the deletions of examples\nfrom the cache. For instance, the weight of examples might relay information on their\nimportance for accurate classi\ufb01cation. Incorporating prior knowledge to the insertion and\ndeletion scheme might also prove important. We hope that such enhancements would make\nthe proposed approach a viable alternative to SVM and other batch algorithms.\nAcknowledgements: The authors would like to thank John Shawe-Taylor for many helpful\ncomments and discussions. This research was partially funded by the EU project KerMIT\nNo. IST-2000-25341.\n\nReferences\n[1] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Jornal\n\nof Machine Learning Research, 3:951\u2013991, 2003.\n\n[2] C. Gentile. A new approximate maximal margin classi\ufb01cation algorithm. Journal of Machine\n\nLearning Research, 2:213\u2013242, 2001.\n\n[3] M\u00b4ezard M. Krauth W. Learning algorithms with optimal stability in neural networks. Journal of\n\nPhysics A., 20:745, 1987.\n\n46(1\u20133):361\u2013387, 2002.\n\n[4] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning,\n\n[5] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on\n\nthe Mathematical Theory of Automata, volume XII, pages 615\u2013622, 1962.\n\n[6] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in\nthe brain. Psychological Review, 65:386\u2013407, 1958. (Reprinted in Neurocomputing(MIT Press,\n1988).).\n\n[7] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.\n\n\f", "award": [], "sourceid": 2385, "authors": [{"given_name": "Koby", "family_name": "Crammer", "institution": null}, {"given_name": "Jaz", "family_name": "Kandola", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}]}