{"title": "From Online to Batch Learning with Cutoff-Averaging", "book": "Advances in Neural Information Processing Systems", "page_first": 377, "page_last": 384, "abstract": "We present cutoff averaging\", a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efficiency of the original online algorithm, making it approporiate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results.\"", "full_text": "From Online to Batch Learning with\n\nCutoff-Averaging\n\nAnonymous Author(s)\n\nAf\ufb01liation\nAddress\nemail\n\nAbstract\n\nWe present cutoff averaging, a technique for converting any conservative online\nlearning algorithm into a batch learning algorithm. Most online-to-batch conver-\nsion techniques work well with certain types of online learning algorithms and not\nwith others, whereas cutoff averaging explicitly tries to adapt to the characteristics\nof the online algorithm being converted. An attractive property of our technique\nis that it preserves the ef\ufb01ciency of the original online algorithm, making it appro-\npriate for large-scale learning problems. We provide a statistical analysis of our\ntechnique and back our theoretical claims with experimental results.\n\n1 Introduction\n\nBatch learning (also called statistical learning) and online learning are two different supervised\nmachine-learning frameworks. In both frameworks, a learning problem is primarily de\ufb01ned by an\ninstance space X and a label set Y, and the goal is to assign labels from Y to instances in X . In batch\nlearning, we assume that there exists a probability distribution over the product space X \u00d7 Y, and\nthat we have access to a training set drawn i.i.d. from this distribution. A batch learning algorithm\nuses the training set to generate an output hypothesis, which is a function that maps instances in\nX to labels in Y. We expect a batch learning algorithm to generalize, in the sense that its output\nhypothesis should accurately predict the labels of previously unseen examples, which are sampled\nfrom the distribution.\n\nOn the other hand, in the online learning framework, we typically make no statistical assumptions\nregarding the origin of the data. An online learning algorithm receives a sequence of examples and\nprocesses these examples one-by-one. On each online-learning round, the algorithm receives an\ninstance and predicts its label using an internal hypothesis, which it keeps in memory. Then, the\nalgorithm receives the correct label corresponding to the instance, and uses the new instance-label\npair to update and improve its internal hypothesis. There is no notion of statistical generalization,\nas the algorithm is only expected to accurately predict the labels of examples it receives as input.\nThe sequence of internal hypotheses constructed by the online algorithm from round to round plays\na central role in this paper, and we refer to this sequence as the online hypothesis sequence.\n\nOnline learning algorithms tend to be computationally ef\ufb01cient and easy to implement. However,\nmany real-world problems \ufb01t more naturally in the batch learning framework. As a result, we are\nsometimes tempted to use online learning algorithms as if they were batch learning algorithms. A\ncommon way to do this is to present training examples one-by-one to the online algorithm, and\nuse the last hypothesis constructed by the algorithm as the output hypothesis. We call this tech-\nnique the last-hypothesis online-to-batch conversion technique. The appeal of this technique is that\nit maintains the computational ef\ufb01ciency of the original online algorithm. However, this heuris-\ntic technique generally comes with no theoretical guarantees, and the online algorithm\u2019s inherent\ndisregard for out-of-sample performance makes it a risky practice.\n\n1\n\n\fIn addition to the last-hypothesis heuristic, various principled techniques for converting online al-\ngorithms into batch algorithms have been proposed. Each of these techniques essentially wraps the\nonline learning algorithm with an additional layer of instructions that endow it with the ability to\ngeneralize. One approach is to use the online algorithm to create the online hypothesis sequence, and\nthen to choose a single good hypothesis from this sequence. For instance, the longest survivor tech-\nnique [8] (originally called the pocket algorithm) chooses the hypothesis that survives the longest\nnumber of consecutive online rounds before it is replaced. The validation technique [12] uses a\nvalidation set to evaluate each online hypothesis and chooses the hypothesis with the best empirical\nperformance. Improved versions of the validation technique are given in [2, 3], where the wasteful\nneed for a separate validation set is resolved. All of these techniques follow the single hypothesis\napproach. We note in passing that a disadvantage of the various validation techniques [12, 2, 3] is\nthat their running time scales quadratically with the number of examples. We typically turn to online\nalgorithms for their ef\ufb01ciency, and often a quadratic running time can be problematic.\n\nAnother common online-to-batch conversion approach, which we call the ensemble approach, uses\nthe online algorithm to construct the online hypothesis sequence, and combines the hypotheses in\nthe sequence by taking a majority [7] or by averaging [2, Sec. 2.A]. When using linear hypotheses,\naveraging can be done on-the-\ufb02y, while the online algorithm is constructing the online hypothesis\nsequence. This preserves the computational ef\ufb01ciency of the online algorithm. Taking the majority\nor the average over a rich set of hypotheses promotes robustness and stability. Moreover, since we\ndo not truly know the quality of each online hypothesis, building an ensemble allows us to hedge\nour bets, rather than committing to a single online hypothesis.\n\nSometimes the ensemble approach outperforms the single hypothesis approach, while other times\nwe see the opposite behavior (see Sec. 4 and [9]). Ideally, we would like a conversion technique\nthat enjoys the best of both worlds: when a single good online hypothesis can be clearly identi\ufb01ed,\nit should be chosen as the output hypothesis, but when a good hypothesis cannot be identi\ufb01ed, we\nshould play it safe and construct an ensemble.\n\nA \ufb01rst step in this direction was taken in [10, 5], where the conversion technique selectively chooses\nwhich subset of online hypotheses to include in the ensemble. For example, the suf\ufb01x averaging\nconversion [5] sets the output hypothesis to be the average over a suf\ufb01x of the online hypothesis\nsequence, where the suf\ufb01x length is determined by minimizing a theoretical upper-bound on the\ngeneralization ability of the resulting hypothesis. One extreme of this approach is to include the\nentire online hypothesis sequence in the ensemble. The other extreme reduces to the last-hypothesis\nheuristic. By choosing the suf\ufb01x that gives the best theoretical guarantee, suf\ufb01x averaging automat-\nically balances the trade-off between these two extremes. Regretfully, this technique suffers from\na computational ef\ufb01ciency problem. Speci\ufb01cally, the suf\ufb01x averaging technique only chooses the\nsuf\ufb01x length after the entire hypothesis sequence has been constructed. Therefore, it must store\nthe entire sequence in memory before it constructs the output hypothesis, and its memory footprint\ngrows linearly with training set size. This is in sharp contrast to the last-hypothesis heuristic, which\nuses no memory aside from the memory used by the online algorithm itself. When the training set\nis massive, storing the entire online hypothesis sequence in memory is impossible.\n\nIn this paper, we present and analyze a new conversion technique called cutoff averaging. Like\nsuf\ufb01x averaging, it attempts to enjoy the best of the single hypothesis approach and of the ensemble\napproach. One extreme of our technique reduces to the simple averaging conversion technique,\nwhile the other extreme reduces to the longest-survivor conversion technique. Like suf\ufb01x averaging,\nwe search for the sweet-spot between these two extremes by explicitly minimizing a tight theoretical\ngeneralization bound. The advantage of our technique is that much of it can be performed on-the-\ufb02y,\nas the online algorithm processes the data. The memory required by cutoff averaging scales with\nsquare-root the number of training examples in the worst case, and is far less in the typically case.\n\nThis paper is organized as follows. In Sec. 2 we formally present the background for our approach.\nIn Sec. 3 we present the cutoff averaging technique and provide a statistical generalization analysis\nfor it. Finally, we demonstrate the merits of our approach with a set of experiments in Sec. 4.\n\n2\n\n\f2 Preliminaries\n\nRecall that X is an instance domain and that Y is a set of labels, and let H be a hypothesis class,\nwhere each h \u2208 H is a mapping from X to Y. For example, we may be faced with a con\ufb01dence-\nrated binary classi\ufb01cation problem, where H is the class of linear separators. In this case, X is a\nsubset of the Euclidean space Rn, Y is the real line, and each hypothesis in H is a linear function\nparametrized by a weight vector w \u2208 Rn and de\ufb01ned as h(x) = hw, xi. We interpret sign(h(x)) as\nthe actual binary label predicted by h, and |h(x)| as the degree of con\ufb01dence in this prediction.\nThe quality of the predictions made by h is measured using a loss function \u2113. We use \u2113(h; (x, y))\nto denote the penalty incurred for predicting the label h(x) when the correct label is actually y.\nReturning to the example of linear separators, a common choice of loss function is the zero-one loss,\nwhich is simply the indicator function of prediction mistakes. Another popular loss function is the\nhinge loss, de\ufb01ned as\n\n\u2113(h; (x, y)) = (cid:26) 1 \u2212 yhw, xi\n\n0\n\nif yhw, xi \u2264 1\notherwise\n\n.\n\nAs noted above, in batch learning we assume the existence of a probability distribution D over the\nproduct space X \u00d7 Y. The input of a batch learning algorithm is a training set, sampled from Dm.\nThe risk of a hypothesis h, denoted by \u2113(h;D), is de\ufb01ned as the expected loss incurred by h over\nexamples sampled from D. Formally,\n\n\u2113(h;D) = E(X,Y )\u223cD [\u2113(h; (X, Y ))]\n\n.\n\nWe can talk about the zero-one-risk or the hinge-loss-risk, depending on which loss function we\nchoose to work with. The goal of a batch learning algorithm for the hypothesis class H and for the\nloss function \u2113 is to \ufb01nd a hypothesis h\u22c6 \u2208 H whose risk is as close as possible to inf h\u2208H \u2113(h;D).\nIn online learning, the labeled examples take the form of a sequence S =(cid:0)(xi, yi)(cid:1)m\n. We typically\nrefrain from making any assumptions on the process that generates S; it could very well be a stochas-\ntic process but it doesn\u2019t have to be. The online algorithm observes the examples in the sequence\none-by-one, and incrementally constructs the sequence of online hypotheses (hi)m\ni=0, where each\nhi \u2208 H. The \ufb01rst hypotheses, h0, is a default hypothesis, which is de\ufb01ned in advance. Before round\nt begins, the algorithm has already constructed the pre\ufb01x (hi)t\u22121\ni=0. At the beginning of round t, the\nalgorithm observes xt and makes the prediction ht\u22121(xt). Then, the correct label yt is revealed and\nthe algorithm suffers a loss of \u2113(ht\u22121; (xt, yt)). Finally, the algorithm uses the new example (xt, yt)\nto construct the next hypothesis ht. The update rule used to construct ht is the main component of\nthe online learning algorithm. In this paper, we make the simplifying assumption that the update\nrule is deterministic, and we note that our derivation can be extended to randomized update rules.\nSince S is not necessarily generated by any distribution D, we cannot de\ufb01ne the risk of an online\nhypothesis. Instead, the performance of an online algorithm is measured using the game-theoretic\nnotion of regret. The regret of an online algorithm is de\ufb01ned as\n\ni=1\n\n1\nm\n\nm\n\nXi=1\n\n\u2113(cid:16)\u02c6h; (xi, yi)(cid:17) .\n\n(1)\n\n\u2113(hi\u22121; (xi, yi)) \u2212 min\n\u02c6h\u2208H\n\n1\nm\n\nm\n\nXi=1\n\nIn words, regret measures how much better the algorithm could have done by using the best \ufb01xed\nhypothesis in H on all m rounds. The goal of an online learning algorithm is to minimize regret.\nTo make things more concrete, we focus on two online learning algorithms for binary classi\ufb01cation.\nThe \ufb01rst is the classic Perceptron algorithm [13] and the second is a \ufb01nite-horizon margin-based\nvariant of the Perceptron, which closely resembles algorithms given in [11, 4]. The term \ufb01nite-\nhorizon indicates that the algorithm knows the total length of the sequence of examples before ob-\nserving any data. The term margin-based indicates that the algorithm is concerned with minimizing\nthe hinge-loss, unlike the classic Perceptron, which deals directly with the zero-one loss. Pseudo-\ncode for both algorithms is given in Fig. 1. We chose these two particular algorithms because they\nexhibit two extreme behaviors when converted into batch learning algorithms. Speci\ufb01cally, if we\nwere to present the classic Perceptron with an example-sequence S drawn i.i.d. from a distribution\nD, we would typically see large \ufb02uctuations in the zero-one-risk of the various online hypotheses.\n(see Sec. 4). Due to these \ufb02uctuations, the ensemble approach suits the classic Perceptron very well,\n\n3\n\n\fPERCEPTRON\n\nFINITE-HORIZON MARGIN-BASED PERCEPTRON\n\ninput S =(cid:0)(xi, yi)(cid:1)m\n\nset w0 = (0, . . . , 0)\nfor i = 1, . . . , m\n\ni=1\n\nreceive xi, predict signhwi\u22121, xii\nreceive yi \u2208 {\u22121, +1}\nif sign(cid:0)hwi\u22121, xii(cid:1) 6= yi\nwi \u2190 wi\u22121 + yixi\n\ninput S =(cid:0)(xi, yi)(cid:1)m\n\nset w0 = (0, . . . , 0)\nfor i = 1, . . . , m\n\ni=1\n\ns.t. kxik2 \u2264 R\n\nreceive xi, predict signhwi\u22121, xii\nreceive yi \u2208 {\u22121, +1}\nif \u2113(wi\u22121; (xi, yi)) > 0\n\nw\u2032i\u22121 \u2190 wi\u22121 + yi xi\u221amR\nwi \u2190 w\nkw\n\ni\u22121k2\n\n\u2032\ni\u22121\n\n\u2032\n\nFigure 1: Two versions of the Perceptron algorithm.\n\nand typically outperforms any single hypothesis approach. On the other hand, if we were to repeat\nthis experiment with the margin-based Perceptron, using hinge-loss-risk, we would typically see a\nmonotonic decrease in risk from round to round. A possible explanation for this is the similarity\nbetween the margin-based Perceptron and some incremental SVM solvers [14]. The last hypothesis\nconstructed by the margin-based Perceptron is typically better than any average. This difference\nbetween the classic Perceptron and its margin-based variant was previously observed in [9]. Ideally,\nwe would like a conversion technique that performs well in both cases.\n\nFrom a theoretical standpoint, the purpose of an online-to-batch conversion technique is to turn an\nonline learning algorithm with a regret bound into a batch learning algorithm with a risk bound. We\nstate a regret bound for the margin-based Perceptron, so that we can demonstrate this idea in the\nnext section.\n\nTheorem 1. Let S =(cid:0)(xi, yi)(cid:1)m\nbe a sequence of examples such that xi \u2208 Rn and y \u2208 {\u22121, +1}\nand let \u2113 denote the hinge loss. Let H be the set of linear separators de\ufb01ned by weight vectors in\nthe unit L2 ball. Let (hi)m\ni=0 be the online hypothesis sequence generated by the margin-based\nPerceptron (see Fig. 1) when it processes S. Then, for any \u02c6h \u2208 H,\n\ni=1\n\n1\n\nmPm\n\ni=1 \u2113(cid:0)hi\u22121; (xi, yi)(cid:1) \u2212 1\n\nmPm\n\ni=1 \u2113(cid:0)\u02c6h; (xi, yi)(cid:1) \u2264 R\u221am .\n\nThe proof of Thm. 1 is not much different from other regret bounds for Perceptron-like algorithms;\nfor completeness we give the proof in [1].\n\n3 Cutoff Averaging\n\nWe now present the cutoff averaging conversion technique. This technique can be applied to any\nconservative online learning algorithm that uses a convex hypothesis class H. A conservative al-\ngorithm is one that modi\ufb01es its online hypotheses only on rounds where a positive loss is suffered.\nOn rounds where no loss is suffered, the algorithm keeps its current hypothesis, and we say that\nthe hypothesis survived the round. The survival time of each distinct online hypothesis is the num-\nber of consecutive rounds it survives before the algorithm suffers a loss and replaces it with a new\nhypothesis.\n\nLike the conversion techniques mentioned in Sec. 1, we start by applying the online learning algo-\nrithm to an i.i.d. training set, and obtaining the online hypothesis sequence (hi)m\u22121\ni=0 . Let k be an\narbitrary non-negative integer, which we call the cutoff parameter. Ultimately, our technique will\nset k automatically, but for the time-being, assume k is a prede\ufb01ned constant. Let \u0398 \u2286 (hi)m\u22121\ni=0 be\nthe set of distinct hypotheses whose survival time is greater than k. The cutoff averaging technique\nde\ufb01nes the output hypothesis h\u22c6 as a weighted average over the hypotheses in \u0398, where the weight\nof a hypothesis with survival time s is proportional to s \u2212 k. Intuitively, each hypothesis must qual-\nify for the ensemble, by suffering no loss for k consecutive rounds. The cutoff parameter k sets the\nbar for acceptance into the ensemble. Once a hypothesis is included in the ensemble, its weight is\ndetermined by the number of additional rounds it perseveres after qualifying.\n\n4\n\n\fWe present a statistical analysis of the cutoff averaging technique. We use capital-letter notation\nthroughout our analysis to emphasize that our input is stochastic and that we are essentially ana-\nlyzing random variables. First, we represent the sequence of examples as a sequence of random\n. Once this sequence is presented to the online algorithm, we obtain the on-\ni=1, which is a sequence of random functions. Note that each random\nj=1. Therefore, the risk\nj=1. Since (Xi+1, Yi+1) is sampled from D\n\nline hypothesis sequence (Hi)m\nfunction Hi is deterministically de\ufb01ned by the random variables ((Xj, Yj))i\nof Hi is also a deterministic function of ((Xj, Yj))i\nindependently of ((Xj, Yj))i\n\nvariables(cid:0)(Xi, Yi)(cid:1)m\n\nj=1, we observe that\n\ni=1\n\n\u2113(Hi;D) = E(cid:2)\u2113(cid:0)Hi; (Xi+1, Yi+1)(cid:1)(cid:12)(cid:12)(cid:0)(Xj, Yj)(cid:1)i\n\nj=1(cid:3) .\n\nIn words, the risk of the random function Hi equals the conditional expectation of the online loss\nsuffered on round i + 1, conditioned on the random examples 1 through i. This simple observation\nrelates statistical risk with online loss, and is the key to converting regret bounds into risk bounds.\nDe\ufb01ne the sequence of binary random variables (Bi)m\u22121\n\ni=0 as follows\n\n(2)\n\nBi = (cid:26) 1 if i = 0\n\n0 otherwise\n\nNow de\ufb01ne the output hypothesis\n\nor\n\nif i \u2265 k and Hi\u2212k = Hi\u2212k+1 = . . . = Hi\n\n.\n\n(3)\n\n(4)\n\nH \u22c6\n\nk = (cid:18) m\u22121\nXi=0\n\nBi(cid:19)\u22121 m\u22121\nXi=0\n\nBiHi .\n\nNote that we automatically include the default hypothesis H0 in the de\ufb01nition of H \u22c6\nk . This technical\ndetail makes our analysis more elegant, and is otherwise irrelevant. Also note that setting k = 0\nresults in Bi = 1 for all i, and would reduce our conversion technique to the standard averaging\nconversion technique. At the other extreme, as k increases, our technique approaches the longest\nsurvivor conversion technique.\n\nThe following theorem bounds the risk of H \u22c6\nk using the online loss suffered on rounds where Bi = 1.\nThe theorem holds only when the loss function \u2113 is convex in its \ufb01rst argument and bounded in [0, C].\nNote that this is indeed the case for the margin-based Perceptron and the hinge loss function. Since\nthe margin-based Perceptron enforces kwik \u2264 1, and assuming that kxik \u2264 R, it follows from the\nCauchy-Schwartz inequality that \u2113 \u2208 [0, R + 1]. If the loss function is not convex, the theorem does\nnot hold, but note that we can still bound the average risk of the hypotheses in the ensemble.\nTheorem 2. Let k be a non-negative constant and let \u2113 be a convex loss function such that\n\u2113(h; (x, y)) \u2208 [0, C]. An online algorithm is given m \u2265 4 independent samples from D and\nconstructs the online hypothesis sequence (Hi)m\nk as above, let Li =\nBi\u22121\u2113(cid:0)Hi\u22121; (Xi, Yi)(cid:1) for all i and let \u00afL = (P Bi)\u22121P Li. For any \u03b4 \u2208 (0, 1), with proba-\nbility at least 1 \u2212 \u03b4, it holds that\n\ni=0. De\ufb01ne Bi and H \u22c6\n\n\u2113(H \u22c6\n\nk ;D) < \u00afL + s 2C ln( m\n\u03b4 ) \u00afL\nP Bi\n\n+\n\n.\n\n7C ln( m\n\u03b4 )\n\nP Bi\n\nTo prove the theorem, we require the following tail bound, which is a corollary of Freedman\u2019s tail\nbound for martingales [6], similar to [3, Proposition 2].\nLemma 1. Let (Li)m\nof arbitrary random variables such that Li = E[Li|(Zj)i\ni=1 Li and \u00afUt = Pt\nUi = E[Li|(Zj)i\u22121\nm \u2265 4 and for any \u03b4 \u2208 (0, 1), with probability at least 1 \u2212 \u03b4, it holds that\n\ni=1 be a sequence\nj=1] and Li \u2208 [0, C] for all i. De\ufb01ne\ni=1 Ui for all t. For any\n\ni=1 be a sequence of real-valued random variables and let (Zi)m\n\nj=1] for all i, and de\ufb01ne \u00afLt = Pt\n\u2200 t \u2208 {1, . . . , m}\n\n\u00afUt < \u00afLt +q2C ln( m\n\n\u03b4 ) \u00afLt + 7C ln( m\n\n\u03b4 ) .\n\nDue to space constraints, the proof of Lemma 1 is given in [1]. It can also be reverse-engineered\nfrom [3, Proposition 2]. Equipped with Lemma 1, we now prove Thm. 2.\n\n5\n\n\fProof of Thm. 2. De\ufb01ne Ui = E[Li|((Xj, Yj))i\u22121\nPm\ni=1 Ui. Using Lemma 1, we have that, with probability at least 1 \u2212 \u03b4\n\u00afU < \u00afL +q2C ln( m\n\u03b4 ) .\n\nNow notice that, by de\ufb01nition,\n\n\u03b4 ) \u00afL + 7C ln( m\n\nj=1] for all i \u2208 {1, . . . , m}, and de\ufb01ne \u00afU =\n\nUi = EhBi\u22121\u2113(cid:0)Hi\u22121; (Xi, Yi)(cid:1)(cid:12)(cid:12) ((Xj, Yj))i\u22121\nj=1i .\n\nSince Bi is deterministically de\ufb01ned by ((Xj, Yj))i\u22121\nj=1, it can be taken outside of the conditional\nexpectation above. Using the observation made in Eq. (2), we have Ui = Bi\u22121\u2113(Hi\u22121;D). Overall,\nwe have shown that\n\nm\n\nXi=1\n\nBi\u22121\u2113(Hi\u22121;D) < \u00afL +q2C ln( m\n\n\u03b4 ) \u00afL + 7C ln( m\n\n\u03b4 ) .\n\nk ;D).\n\ni=1 Bi\u22121(cid:1)\u2113(H \u22c6\n\nUsing Jensen\u2019s inequality, the left-hand side above is at least(cid:0)Pm\nWe can now complete the de\ufb01nition of the cutoff averaging technique. Note that by replacing \u03b4\nwith \u03b4/m in Thm. 2 and by using the union bound, we can ensure that Thm. 2 holds uniformly for\nall k \u2208 {0, . . . , m \u2212 1} with probability at least 1 \u2212 \u03b4. The cutoff averaging technique sets the\noutput hypothesis H \u22c6 to be hypothesis in {H \u22c6\nm\u22121} for which Thm. 2 gives the smallest\nbound. In other words, k is chosen automatically so as to balance the trade-off between the bene\ufb01ts\nof averaging and those of good empirical performance. If a small number of online hypotheses\nstand out with signi\ufb01cantly long survival times, then our technique will favor a large k and a sparse\nensemble. On the other hand, if most of the online hypotheses have medium/short survival times,\nthen our technique will favor small values of k and a dense ensemble. Even if \u2113 is not convex,\nminimizing the bound in Thm. 2 implicitly minimizes the average risk of the ensemble hypotheses.\n\n0 , . . . , H \u22c6\n\nIf the online algorithm being converted has a regret bound, then the data dependent risk bound\ngiven by Thm. 2 can be turned into a data independent risk bound. A detailed derivation of such a\nbound exceeds the scope of this paper, and we just sketch the proof in the case of the margin-based\nPerceptron. It trivially holds that the risk of H \u22c6 is upper-bounded by the bound given in Thm. 2 for\nk = 0. When Thm. 2 is applied with k = 0, \u00afL simply becomes the average loss suffered by the\n\nonline algorithm over the entire training set andP Bi = m. We can now use Thm. 1 to bound \u00afL\nby the average loss of any \u02c6h \u2208 H on the sequence(cid:0)(Xi, Yi)(cid:1)m\n. Particularly, we can choose \u02c6h to\nbe the hypothesis with the smallest risk in H, namely, \u02c6h = arg minh\u2208H \u2113(h;D). The \ufb01nal step is\nmP \u2113(\u02c6h; (Xi, Yi)) and \u2113(\u02c6h;D), which can be done using any tail\nto bound the difference between 1\nbound for sums of independent bounded random variables, such as Hoeffding\u2019s bound or Bernstein\u2019s\nbound. The result is that, with high probability, \u2113(H \u22c6;D) \u2264 minh\u2208H \u2113(h;D) + O(m\u22121/2). Similar\nderivations appear in [2, 3].\n\ni=1\n\nAs mentioned in the introduction, our approach is similar to the suf\ufb01x averaging conversion tech-\nnique of [5], which also interpolates between an ensemble approach and a single hypothesis ap-\nproach. However, the suf\ufb01x conversion requires \u2126(m) space, which is problematic when m is large.\nIn contrast, cutoff averaging requires only O(\u221am) space. Our technique cannot choose the optimal\nvalue of k before the entire dataset has been processed, but nevertheless, it does not need to store\nthe entire hypothesis sequence. Instead, it can group the online hypotheses based on their survival\ntimes, and stores only the average hypothesis in each group and the total loss in each group. By\nthe time the entire dataset is processed, most of the work has already been done and calculating the\noptimal k and the output hypothesis is straightforward. Using simple combinatorics, the maximal\nnumber of distinct survival times in a sequence of m hypotheses is O(\u221am).\nFinally, note that Lemma 1 is a Kolmogorov-type bound, namely, it holds uniformly for every pre\ufb01x\nof the sequence of random variables. Therefore, Thm. 2 actually holds simultaneously for every\npre\ufb01x of the training set. Since our conversion is mostly calculated on-the-\ufb02y, in parallel with the\nonline rounds, we can easily construct intermediate output hypotheses, before the online algorithm\nhas a chance to process the entire dataset. Thanks to the Kolmorogorv-type bound, the risk bounds\nfor all of these hypotheses all hold simultaneously. We can monitor how the risk bound changes\nas the number of examples increases, and perhaps even use the bound to de\ufb01ne an early stopping\ncriterion for the training algorithm. Speci\ufb01cally, we could stop processing examples when the risk\nbound becomes lower than a prede\ufb01ned threshold.\n\n6\n\n\fCCAT vs. GCAT\n\n0.5\n\n \n\nCCAT vs. MCAT\n\nCCAT vs. ECAT\n\nCCAT vs. OTHER\n\nGCAT vs. MCAT\n\ncutoff\nlast\n\n101\n\n103\n\n105\nGCAT vs. ECAT\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n103\n\n105\n101\nGCAT vs. OTHER\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n101\n\n103\n\n105\nMCAT vs. ECAT\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n103\n\n105\n101\nMCAT vs. OTHER\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n103\n\n105\n101\nECAT vs. OTHER\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nr\no\nr\nr\ne\n\nt\ns\ne\nt\n\nr\no\nr\nr\ne\n\nt\ns\ne\nt\n\n101\n\n103\n\n105\n\n101\n\n103\n\n105\n\n101\n\n103\n\n105\n\n101\n\n103\n\n105\n\n101\n\n103\n\n105\n\nFigure 2: Test error (zero-one-loss) of last-hypothesis and cutoff averaging, each applied to the stan-\ndard Perceptron, on ten binary classi\ufb01cation problems from RCV1. The x-axis represents training\nset size, and is given in log-scale. Each plot represents the average over 10 random train-test splits.\n\n4 Experiments and Conclusions\n\nWe conducted experiments using Reuters Corpus Vol. 1 (RCV1), a collection of over 800K news\narticles collected from the Reuters news wire. An average article in the corpus contains 240 words,\nand the entire corpus contains over half a million distinct tokens (not including numbers and dates).\nEach article in the corpus is associated with one or more high-level categories, which are: Cor-\nporate/Industrial (CCAT), Economics (ECAT), Government/Social (GCAT), Markets (MCAT), and\nOther (OTHER). About 20% of the articles in the corpus are associated with more than one high-\nlevel category. After discarding this 20%, we are left with over 600K documents, each with a single\nhigh-level label. Each pair of high-level labels de\ufb01nes the binary classi\ufb01cation problem of distin-\nguishing between articles of the two categories, for a total of ten different problems. Each problem\nhas different characteristics, due to the different number of articles and the varying degree of homo-\ngeneity in each category.\n\nEach article was mapped to a feature vector using a logarithmic bag-of-words representation.\nNamely, the length of each vector equals the number of distinct tokens in the corpus, and each\ncoordinate in the vector represents one of these tokens. If a token appears s times in a given article,\nthe respective coordinate in the feature vector equals log2(1 + s).\nWe applied the cutoff averaging technique to the classic Perceptron and to the margin-based Per-\nceptron. We repeated each of our experiments ten times, each time taking a new random split of\nthe data into a training set (80%) and a test set (20%), and randomly ordering the training set. We\ntrained each algorithm on each dataset in an incremental manner, namely, we started by training the\nalgorithm using a short pre\ufb01x of the training sequence, and gradually increased the training set size.\nWe paused training at regular intervals, computed the output hypothesis so far, and calculated its test\nloss. This gives us an idea of what would happen on smaller training sets.\n\nFig. 2 shows the test zero-one loss attained when our technique is applied to the classic Perceptron\nalgorithm. It also shows the test zero-one loss of the last-hypothesis conversion technique. Clearly,\nthe test loss of the last hypothesis is very unstable, even after averaging over 10 repetitions. In some\ncases, adding training data actually deteriorates the performance of the last hypothesis. If we decide\nto use the last hypothesis technique, our training set size could happen to be such that we end up with\na bad output hypothesis. On the other hand, the cutoff averaging hypothesis is accurate, stable and\nconsistent. The performance of the simple averaging conversion technique is not plotted in Fig. 2,\nbut we note that it was only slightly worse than the performance of cutoff averaging. When using\nthe classic Perceptron, any form of averaging is bene\ufb01cial, and our technique successfully identi\ufb01es\nthis.\n\nFig. 3 shows the test hinge loss of cutoff averaging, last-hypothesis, and simple averaging, when\napplied to the margin-based Perceptron. In this case, the last hypothesis performs remarkably well\n\n7\n\n\fCCAT vs. GCAT\n\nCCAT vs. MCAT\n\nCCAT vs. ECAT\n\nCCAT vs. OTHER\n\nGCAT vs. MCAT\n\n \n\ncutoff\naverage\nlast\n\n101\n\n103\n\n105\nGCAT vs. ECAT\n\ns\ns\no\nl\n-\ne\ng\nn\ni\nh\n\nt\ns\ne\nt\n\n0.9\n\n0.7\n\n0.5\n\n0.3\n\n0.1\n\n \n\ns\ns\no\nl\n-\ne\ng\nn\ni\nh\nt\ns\ne\nt\n\n0.9\n\n0.7\n\n0.5\n\n0.3\n\n0.1\n\n0.9\n\n0.7\n\n0.5\n\n0.3\n\n0.1\n\n0.9\n\n0.7\n\n0.5\n\n0.3\n\n0.1\n\n103\n\n105\n101\nGCAT vs. OTHER\n\n0.9\n\n0.7\n\n0.5\n\n0.3\n\n0.1\n\n0.9\n\n0.7\n\n0.5\n\n0.3\n\n0.1\n\n101\n\n103\n\n105\nMCAT vs. ECAT\n\n0.9\n\n0.7\n\n0.5\n\n0.3\n\n0.1\n\n0.9\n\n0.7\n\n0.5\n\n0.3\n\n0.1\n\n103\n\n105\n101\nMCAT vs. OTHER\n\n0.9\n\n0.7\n\n0.5\n\n0.3\n\n0.1\n\n0.9\n\n0.7\n\n0.5\n\n0.3\n\n0.1\n\n103\n\n105\n101\nECAT vs. OTHER\n\n101\n\n103\n\n105\n\n101\n\n103\n\n105\n\n101\n\n103\n\n105\n\n101\n\n103\n\n105\n\n101\n\n103\n\n105\n\nFigure 3: Test hinge-loss of last-hypothesis, averaging, and cutoff averaging, each applied to the\n\ufb01nite-horizon margin-based Perceptron, on ten binary classi\ufb01cation problems from RCV1. The x-\naxis represents training set size and each plot represents the average over 10 random train-test splits.\n\nand the simple averaging conversion technique is signi\ufb01cantly inferior for all training set sizes.\nWithin 1000 online rounds (0.1% of the data), the cutoff averaging technique catches up to the last\nhypothesis and performs comparably well from then on. Our technique\u2019s poor performance on the\n\ufb01rst 0.1% of the data is expected, since the tail bounds we rely on are meaningless with so few\nexamples. Once the tail bounds become tight enough, our technique essentially identi\ufb01es that there\nis no bene\ufb01t in constructing a diverse ensemble, and assigns all of the weight to a short suf\ufb01x of the\nonline hypothesis sequence.\n\nWe conclude that there are cases where the single-hypothesis approach is called for and there are\ncases where an ensemble approach should be used. If we are fortunate enough to know which case\napplies, we can simply choose the right approach. However, if we are after a generic solution that\nperforms well in both cases, we need a conversion technique that automatically balances the trade-\noff between these two extremes. Suf\ufb01x averaging [5] and cutoff averaging are two such techniques,\nwith cutoff averaging having a signi\ufb01cant computational advantage.\n\nReferences\n[1] Anonimous. Technical appendix submitted with this manuscript, 2008.\n[2] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of online learning\n\nalgorithms. IEEE Transactions on Information Theory, 50(9):2050\u20132057, September 2004.\n\n[3] N. Cesa-Bianchi and C. Gentile. Improved risk bounds for online algorithms. NIPS 19, 2006.\n[4] O. Dekel, S. Shalev-Shwartz, and Y. Singer. The Forgetron: A kernel-based perceptron on a\n\nbudget. SIAM Journal on Computing, 37:1342\u20131372, 2008.\n\n[5] O. Dekel and Y. Singer. Data-driven online to batch conversions. NIPS 18, 2006.\n[6] D. A. Freedman. On tail probabilities for martingales. Annals of Prob., 3(1):100\u2013118, 1975.\n[7] Y. Freund and R. E. Schapire. Large margin classi\ufb01cation using the perceptron algorithm.\n\nMachine Learning, 37(3):277\u2013296, 1999.\n\n[8] S. I. Gallant. Optimal linear discriminants. Proc. of ICPR 8, pages 849\u2013852. IEEE, 1986.\n[9] R. Khardon and G. Wachman. Noise tolerant variants of the perceptron algorithm. Journal of\n\nMachine Learning Research, 8:227\u2013248, 2007.\n\n[10] Y. Li. Selective voting for perceptron-like learning. Proc. of ICML 17, pages 559\u2013566, 2000.\n[11] Y. Li, H. Zaragoza, R. He, J. ShaweTaylor, and J. Kandola. The perceptron algorithm with\n\nuneven margins. Proc. of ICML 19, pages 379\u2013386, 2002.\n\n[12] N. Littlestone. From online to batch learning. Proc. of COLT 2, pages 269\u2013284, 1989.\n[13] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization\n\nin the brain. Psychological Review, 65:386\u2013407, 1958.\n\n[14] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent\n\nalgorithms. Proc. of ICML 21, 2004.\n\n8\n\n\f", "award": [], "sourceid": 195, "authors": [{"given_name": "Ofer", "family_name": "Dekel", "institution": null}]}