{"title": "implicit Online Learning with Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 249, "page_last": 256, "abstract": null, "full_text": "Implicit Online Learning with Kernels\n\nLi Cheng S.V. N. Vishwanathan National ICT Australia li.cheng@nicta.com.au SVN.Vishwanathan@nicta.com.au Shaojun Wang Department of Computer Science and Engineering Wright State University shaojun.wang@wright.edu\n\nDale Schuurmans Department of Computing Science University of Alberta, Canada dale@cs.ualberta.ca Terry Caelli National ICT Australia terry.caelli@nicta.com.au\n\nAbstract\nWe present two new algorithms for online learning in reproducing kernel Hilbert spaces. Our first algorithm, ILK (implicit online learning with kernels), employs a new, implicit update technique that can be applied to a wide variety of convex loss functions. We then introduce a bounded memory version, SILK (sparse ILK), that maintains a compact representation of the predictor without compromising solution quality, even in non-stationary environments. We prove loss bounds and analyze the convergence rate of both. Experimental evidence shows that our proposed algorithms outperform current methods on synthetic and real data.\n\n1\n\nIntroduction\n\nOnline learning refers to a paradigm where, at each time t, an instance xt  X is presented to a learner, which uses its parameter vector ft to predict a label. This predicted label is then compared to the true label yt , via a non-negative, piecewise differentiable, convex loss function L(xt , yt , ft ). The learner then updates its parameter vector to minimize a risk functional, and the process repeats. Kivinen and Warmuth [1] proposed a generic framework for online learning where the risk functional, Jt (f ), to be minimized consists of two terms: a Bregman divergence between parameters G (f , ft ) := G(f ) - G(ft ) - f - ft , f G(ft ) , defined via a convex function G, and the instantaneous risk R(xt , yt , f ), which is usually given by a function of the instantaneous loss L(xt , yt , f ). The parameter updates are then derived via the principle ft+1 = argmin Jt (f ) := argmin{G (f , ft ) + t R(xt , yt , f )}, (1)\nf f\n\nwhere t is the learning rate. Since Jt (f ) is convex, (1) is solved by setting the gradient (or, if necessary, a subgradient) to 0. Using the fact that f G (f , ft ) = f G(f ) - f G(ft ), one obtains f G(ft+1 ) = f G(ft ) - t f R(xt , yt , ft+1 ). (2) Since it is difficult to determine f R(xt , yt , ft+1 ) in closed form, an explicit update, as opposed to the above implicit update, uses the approximation f R(xt , yt , ft+1 )  f R(xt , yt , ft ) to arrive at the more easily computable expression [1] f G(ft+1 ) = f G(ft ) - t f R(xt , yt , ft ). (3)\n1 In particular, if we set G(f ) = 1 ||f ||2 , then G (f , ft ) = 2 ||f - ft ||2 and f G(f ) = f , and we 2 obtain the familiar stochastic gradient descent update ft+1 = ft - t f R(xt , yt , ft ). (4) We are interested in applying online learning updates in a reproducing kernel Hilbert space (RKHS). To lift the above update into an RKHS, H, one typically restricts attention to f  H and defines [2]  R(xt , yt , f ) := ||f ||2 + C  L(xt , yt , f ), (5) H 2\n\n\f\nwhere ||  ||H denotes the RKHS norm,  > 0 is a regularization constant, and C > 0 determines the penalty imposed on point prediction violations. Recall that if H is a RKHS of functions on X  Y , then its defining kernel k : (X  Y )2  R satisfies the reproducing property; namely that f, k ((x, y ), ) H = f (x, y ) for all f  H. Therefore, by making the standard assumption that L only depends on f via its evaluations at f (x, y ), one reaches the conclusion that f L(x, y , f )  H, and in particular ~ y k ((x, y ), ), ~ (6) f L(x, y , f ) = ~\ny Y\n\nfor some y  R. Since f R(xt , yt , ft ) = ft + C  f L(xt , yt , ft ), one can use (4) to obtain an ~ explicit update ft+1 = (1 - t )ft - t C  f L(xt , yt , ft ), which combined with (6) shows that there must exist coefficients i,y fully specifying ft+1 via ~ it ~ ft+1 = i,y k ((xi , y ), ). ~ (7) ~\n=1 y Y\n\nIn this paper we propose an algorithm ILK (implicit online learning with kernels) that solves (2) directly, while still expressing updates in the form (7). That is, we derive a technique for computing the implicit update that can be applied to many popular loss functions, including quadratic, hinge, and logistic losses, as well as their extensions to structured domains (see e.g. [3])--in an RKHS. We also provide a general recipe to check if a new convex loss function is amenable to these implicit updates. Furthermore, to reduce the memory requirement of ILK, which grows linearly with the number of observations (instance-label pairs), we propose a sparse variant SILK (sparse ILK) that approximates the decision function f by truncating past observations with insignificant weights.\n\n2\n\nImplicit Updates in an RKHS\n\nAs shown in (1), to perform an implicit update one needs to minimize G (f , ft ) + R(xt , yt , f ). By replacing R(xt , yt , f ) with (5), and setting G(f ) = 1 ||f ||2 , one obtains H 2  . 1 ||f ||2 + C  L(xt , yt , f ) (8) ft+1 = arg min J (f ) = argmin ||f - ft ||2 + t H H f 2 2 f Since L is assumed convex with respect to f , setting f J = 0 and using an auxiliary variable  t = 1+t  yields t ft+1 = (1 - t )ft - (1 - t )t C f L(xt , yt , ft+1 ). (9) On the other hand, from the form (7) it follows that ft+1 can also be written as ft+1 =\nt-1 i\n\n~\n\ni,y k ((xi , y ), ) + ~ ~\n\n~\ny Y\n\nt,y k ((xt , y ), ), ~ ~\n\n(10)\n\n=1 y Y\n\nfor some j,y  R and j = 1, . . . , t. Since ~ f L(xt , yt , ft+1 ) = ~\ny Y\n\nt,y k ((xt , y ), ), ~ ~\n\nand for ease of exposition, we assume a fixed step size (learning rate) t = 1, consequently t =  , it follows from (9) and (10) that i,y = (1 -  )i,y ~ ~ t,y = -(1 -  )C t,y ~ ~ for i = 1, . . . , t - 1, and y  Y , ~ for all y  Y . ~ (11) (12)\n\nNote that sophisticated step size adaptation algorithms (e.g. [3]) can be modified in a straightforward manner to work in our setting. The main difficulty in performing the above update arises from the fact that t,y depends on ft+1 ~ (see e.g. (13)) which in turn depends on t,y via t,y . The general recipe to overcome this problem ~ ~ is to first use (9) to write t,y as a function of t,y . Plugging this back into (12) yields an equation ~ ~ in t,y alone, which sometimes can be solved efficiently. We now elucidate the details for some ~ well-known loss functions.\n\n\f\nSquare Loss In this case, k ((xt , yt ), ) = k (xt , ). That is, the kernel does not depend on the value of y . Furthermore, we assume that Y = R, and write 1 1 L(xt , yt , f ) = (f (xt ) - yt )2 = ( f (), k (xt , ) H - yt )2 , 2 2 which yields f L(xt , yt , f ) = (f (xt ) - yt ) k (xt , ). Substituting into (12) and using (9) we have t = -(1 -  )C ((1 -  )ft (xt ) + t k (xt , xt ) - yt ). After some straightforward algebraic manipulation we obtain the solution t = C (1 -  )(yt - (1 -  )ft (xt )) . 1 + C (1 -  )k (xt , xt ) (13)\n\nBinary Hinge Loss As before, we assume k ((xt , yt ), ) = k (xt , ), and set Y = {1}. The hinge loss for binary classification can be written as L(xt , yt , f ) = ( - yt f (xt ))+ = ( - yt f, k (xt , ) H)+ , (14) where  > 0 is the margin parameter, and ()+ := max(0, ). Recall that the subgradient is a set, and the function is said to be differentiable at a point if this set is a singleton [4]. The binary hinge loss is not differentiable at the hinge point, but its subgradient exists everywhere. Writing f L(xt , yt , f ) = t k (xt , ) we have: yt f (xt ) >  = t = 0; yt f (xt ) =  = t  [0, -yt ]; yt f (xt ) <  = t = -yt . (15a) (15b) (15c)\n\nWe need to balance between two conflicting requirements while computing t . On one hand we want the loss to be zero, which can be achieved by setting  - yt ft+1 (xt ) = 0. On the other hand, the gradient of the loss at the new point f L(xt , yt , ft+1 ) must satisfy (15). We satisfy both constraints by appropriately clipping the optimal estimate of t . Let t denote the optimal estimate of t which leads to  - yt ft+1 (xt ) = 0. Using (9) we have ^  - yt ((1 -  )ft (xt ) + t k (xt , xt )) = 0, which yields ^ t = ^  - (1 -  )yt ft (xt ) yt ( - (1 -  )yt ft (xt )) = . yt k (xt , xt ) k (xt , xt ) have t yt  [0, (1 -  )C ]. By combining the two if yt t  [0, (1 -  )C ]; ^ ^ if yt t < 0; if yt t > (1 -  )C. ^\n\nOn the other hand, by using (15) and (12) we scenarios, we arrive at the final update  ^ t t = 0  yt (1 -  )C\n\n(16)\n\nThe updates for the hinge loss used in novelty detection are very similar. Graph Structured Loss The graph-structured loss on label domain can be written as - + L(xt , yt , f ) = f (xt , yt ) + max((yt , y ) + f (xt , y )) ~ ~ .\ny =yt ~\n\n(17)\n\nHere, the margin of separation between labels is given by (yt , y ) which in turn depends on the ~ graph structure of the output space. This a very general loss, which includes binary and multiclass hinge loss as special cases (see e.g. [3]). We briefly summarize the update equations for this case. Let y  = argmaxy=yt {(yt , y ) + ft (xt , y )} denote the best runner-up label for current instance ~ ~ ~ xt . Then set t,yt = -t,y = t , use kt (y , y ) to denote k ((xt , y ), (xt , y )) and write t = ^ -(1 -  )ft (xt , yt ) + (yt , y  ) + (1 -  )ft (xt , y  ) . (kt (yt , yt ) + kt (y  , y  ) - 2kt (yt , y  ))\n\n\f\nThe updates are now given by  0 t = t ^  (1 -  )C if t < 0; ^ if t  [0, (1 -  )C ]; ^ if t > (1 -  )C. ^ (18)\n\nLogisitic Regression Loss The logistic regression loss and its gradient can be written as -yt k (xt , ) . L(xt , yt , f ) = log (1 + exp(-yt f (xt ))) , f L(xt , yt , f ) = 1 + exp(yt f (xt )) respectively. Using (9) and (12), we obtain (1 -  )C yt t = . 1 + exp(yt (1 -  )ft (xt ) + t yt k (xt , xt )) Although this equation does not give a closed-form solution, the value of t can still be obtained by using a numerical root-finding routine, such as those described in [5]. 2.1 ILK and SILK Algorithms We refer to the algorithm that performs implicit updates as ILK, for \"implicit online learning with kernels\". The update equations of ILK enjoy certain advantages. For example, using (11) it is easy to see that an exponential decay term can be naturally incorporated to down-weight past observations: it ~ (1 -  )t-i i,y k ((xi , y ), ). ~ (19) ft+1 = ~\n=1 y Y\n\nIntuitively, the parameter   (0, 1) (determined by  and  ) trades off between the regularizer and the loss on the current sample. In the case of hinge losses--both binary and graph structured--the weight |t | is always upper bounded by (1 -  )C , which ensures limited influence from outliers (cf. (16) and (18)). A major drawback of the ILK algorithm described above is that the size of the kernel expansion grows linearly with the number of data points up to time t (see (10)). In many practical domains, where real time prediction is important (for example, video surveillance), storing all the past observations and their coefficients is prohibitively expensive. Therefore, following Kivinen et al. [2] and Vishwanathan et al. [3] one can truncate the function expansion by storing only a few relevant past observations. We call this version of our algorithm SILK, for \"sparse ILK\". Specifically, the SILK algorithm maintains a buffer of size  . Each new point is inserted into the buffer with coefficient t . Once the buffer limit  is exceeded, the point with the lowest coefficient value is discarded to maintain a bound on memory usage. This scheme is more effective than the straightforward least recently used (LRU) strategy proposed in Kivinen et al. [2] and Vishwanathan et al. [3]. It is relatively straightforward to show that the difference between the true predictor and its truncated version obtained by storing only  expansion coefficients decreases exponentially as the buffer size  increases [2].\n\n3\n\nTheoretical Analysis\n\nIn this section we will primarily focus on analyzing the graph-structured loss (17), establishing relative loss bounds and analyzing the rate of convergence of ILK and SILK. Our proof techniques adopt those of Kivinen et al. [2]. Due to the space constraints, we leave some details and analysis to the full version of the paper. Although the bounds we obtain are similar to those obtained in [2], our experimental results clearly show that ILK and SILK are stronger than the NORMA strategy of [2] and its truncated variant. 3.1 Mistake Bound We begin with a technical definition. Definition 1 A sequence of hypotheses {(f1 , . . . , fT ) : ft  H} is said to be (T , B , D1 , D2 ) t t bounded if it satisfies ||ft ||2  B 2 t  {1, . . . , T }, ||ft - ft+1 ||H  D1 , and ||ft - H ft+1 ||2  D2 for some B , D1 , D2  0. The set of all (T , B , D1 , D2 ) bounded hypothesis seH quences is denoted as F (T , B , D1 , D2 ).\n\n\f\nGiven a fixed sequence of observations {(x1 , y1 ), . . . , (xT , yT )}, and a sequence of hypotheses {(f1 , . . . , fT )  F }, the number of errors M is defined as\n M := |{t : f (xt , yt , yt )  0}| ,    where f (xt , yt , yt ) = f (xt , yt ) - f (xt , yt ) and yt is the best runner-up label. To keep the equations succinct, we denote kt ((yt , y ), ) := k ((xt , yt ), )-k ((xt , y ), ), and kt ((yt , y ), (yt , y )) := kt ((yt , y ), ) 2 = kt (yt , yt ) - 2kt (yt , y ) + kt (y , y ). In the following we bound the number H of mistakes M made by ILK by the cumulative loss of an arbitrary sequence of hypotheses from F (T , B , D1 , D2 ).\n\nTheorem 2 Let {(x1 , y1 ), . . . , (xT , yT )} be an arbitrary sequence of observations such that kt ((yt , y ), (yt , y ))  X 2 holds for any t, any y , and for some X > 0. For an arbitraryt sequence of hypotheses (g1 ,    , gT )  F (T , B , D1 , D2 ) with t average margin  =  , g 1  (yt , yt  ) - (yt , yt ) and bounded cumulative loss K := L(xt , yt , gt ), the numE |E | ber of mistakes of Dhe sequence of hypotheses (f1 ,    , fT ) generated by ILK with learning rate t t =  ,  =\n1 B\n2\n\nT\n\nis upper-bounded by M\n\nK 1S1 2 K 2S S2 + 2 +2 +2 , (20) 2       2 g where S = X (B 2 + B D1 + B T D2 ),  > 0, and yt  denotes the best runner-up label with 4 hypothesis gt . When considering the stationary distribution in a separable (noiseless) scenario, this theorem allows us to obtain a mistake bound that is reminiscent of the Perceptron convergence theorem. In particular, if we assume the sequence of hypotheses (g1 ,    , gT )  F (T , B , D1 = 0, D2 = 0) and the cumulative loss K = 0, we obtain a bound on the number of mistakes M 3.2 Convergence Analysis T The following theorem asserts that under mild assumptions, the cumulative risk t=1 R(xt , yt , ft ) of the hypothesis sequence produced by ILK converges to the minimum risk of the batch learning T counterpart g  := argmingH t=1 R(xt , yt , g ) at a rate of O(T -1/2 ). Theorem 3 Let {(x1 , y1 ), . . . , (xT , yT )} be an arbitrary sequence of observations such that kt ((yt , yt ), (yt , yt ))  X 2 holds for any t, any y . Denote (f1 , . . . , fT ) the sequence of hypotheses T produced by ILK with learning rate t =  t-1/2 , t=1 R(xt , yt , ft ) the cumulative risk of this T sequence, and t=1 R(xt , yt , g ) the batch cumulative risk of (g , . . . , g ), for any g  H. Then tT\n=1 CX ,\n\nB2X 2 . 2\n\n(21)\n\nR(xt , yt , ft ) \n2 2 2U 2 ,\n\ntT\n=1\n\nR(xt , yt , g ) + aT 1/2 + b,\nU2 2\n\nwhere U =\n\na = 4 C X +\n\nand b =\n\nare constants. In particular, if R(xt , yt , g ),\n\ng  = arg min\ng H\n\ntT\n=1\n\nwe obtain\n\nT T 1t 1t R(xt , yt , ft )  R(xt , yt , g  ) + O(T -1/2 ). T =1 T =1\n\n(22)\n\nEssentially the same theorem holds for SILK, but now with a slightly larger constant a = 2 2 4  (1 +  )C 2 X 2 + 2U . In addition, denote g  the minimizer of the batch learning cumula t tive risk R(xt , yt , g ), and f  the minimizer of the minimum expected risk with R(f  ) := minf E(x,y)P (x,y) R(x, y , f ). As stated in [6] for the structured risk minimization framework, as\n\n\f\n3500\n\n3000\n\nNORMA vs. ILK 180\n\n2500\n\n2000\n\n180 170 NORMA 160 150 140\n-200 0 200 400 600 800\n\nMistakes of NORMA\n\n1500\n\n160\n\n1000\n\nILK\n\n500\n\n0 -400\n\nSILK\n\nILK(0)\n\n140 140\nTrunc. NORMA(0)\n\n160 Mistakes of ILK\n\n180\n\nFigure 1: The left panel depicts a synthetic data sequence containing two classes (blue crosses and red diamonds, see the zoomed-in portion in bottom-left corner), with each class being sampled from a mixture of two drifting Gaussian distributions. Performance comparison of ILK vs NORMA and truncated NORMA on this data: Average cumulative error over 100 trials (middle), and average cumulative error each trial (right).\n\nthe sample size T grows, T  , we obtain g   f  in probability. This subsequently guarantees the convergence of the average regularized risk of ILK and SILK to R(f  ). The upper bound in the above theorem can be directly plugged into Corollary 2 of Cesa-Bianchi  et al. [7] to obtain bounds on the generalization error of ILK. Let f denote the average hypothesis produced by averaging over all hypotheses f1 , . . . , fT . Then for any   (0, 1), with probability  at least 1 -  , the expected risk of f is uppe1 bounded by the risk of the best hypothesis chosen in r. hindsight plus a term which grows as O\nT\n\n4\n\nExperiments\n\nWe evaluate the performance of ILK and SILK by comparing them to NORMA [2] and its truncated variant. On OCR data, we also compare our algorithms to SVMD, a sophisticated step-size adaptation algorithm in RKHS presented in [3]. For a fair comparison we tuned the parameters of each algorithm separately and report the best results. In addition, we fixed the margin to  = 1 for all our loss functions. Binary Classification on Synthetic Sequences The aim here is to demonstrate that ILK is better than NORMA in coping with non-stationary distributions. Each trial of our experiment works with 2000 two-dimensional instances sampled from a non-stationary distribution (see Figure 1) and the task is to classify the sampled points into one of two classes. The central panel of Figure 1 compares the number of errors made by various algorithms, averaged over 100 trials. Here, ILK and SILK make fewer mistakes than NORMA and truncated NORMA. We also tested two other algorithms, ILK(0) obtained by setting the decay factor  to zero, and similarly for NORMA(0). As expected, both these variants make more mistakes because they are unable to forget the past, which is crucial for obtaining good performance in a non-stationary environment. To further compare the performance of ILK and NORMA we plot the relative errors of these two algorithms in the right panel of Figure 1. As can be seen, ILK out-performs NORMA on this simple non-stationary problem. Novelty Detection on Video Sequences As a significant application, we applied SILK to a background subtraction problem in video data analysis. The goal is to detect the moving foreground objects (such as cars, persons, etc) from relatively static background scenes in real time. The challenge in this application is to be able to cope with variations in lighting as well as jitter due to shaking of the camera. We formulate the problem as a novelty detection task using a network of classifiers, one for each pixel. For this task we compare the performance of SILK vs. truncated NORMA. (The ILK and NORMA algorithms are not suitable since their storage requirements grow linearly). A constant buffer size  = 20 is used for both algorithms in this application. We report further implementation details in the full version of this paper. The first task is to identify people, under varying lighting conditions, in an indoor video sequence taken with a static camera. The left hand panel of Figure 2 plots the ROC curves of NORMA and SILK, which demonstrates the overall better performance of SILK. We sampled one of the initial frames after the light was switched off and back on. The results are shown in the right panel of Figure 2. As can be seen, SILK is able to recover from the change in lighting condition better than NORMA, and is able to identify foreground objects reasonably close to the ground truth.\n\n\f\n1 0.9 0.8 True Positive\n\nFrame 1353\n0.7 0.6 0.5 0.4 0 1 2 3 4 False Positive 5 x 10 6\n-3\n\nGround Truth\n\nSILK NORMA\n\nNORMA\n\nSILK\n\nFigure 2: Performance comparison of SILK vs truncated NORMA on a background subtraction (moving object detection) task, with varying lighting conditions. ROC curve (left) and a comparison of algorithms immediately after the lights have been switched off and on (right).\n\nFigure 3: Performance of SILK on a road traffic sequence (moving car detection) task, with a jittery camera. Two random frames and the performance of SILK on those frames are depicted.\n\nOur second experiment is a traffic sequence taken by a camera that shakes irregularly, which creates a challenging problem for any novelty detection algorithm. As seen from the randomly chosen frames plotted in Figure 3 SILK manages to obtain a visually plausible detection result. We cannot report a quantitative comparison with other methods in this case, due to the lack of manually labeled ground-truth data. Binary and Multiclass Classification on OCR data We present two sets of experiments on the MNIST dataset. The aim of the first set experiment is to show that SILK is competitive with NORMA and SVMD on a simple binary task. The data is split into two classes comprising the digits 0 - 4 and 5 - 9, respectively. A polynomial kernel of degree 9 and a buffer size of  = 128 is employed for all three algorithms. Figure 4 (a) plots current average error rate, i.e., the total number of errors on the examples seen so far divided by the iteration number. As can be seen, after the initial oscillations have died out, SILK consistently outperforms SVMD and NORMA, achieving a lower average error after one pass through the dataset. Figure 4 (b) examines the effect of buffer size on SILK. As expected, smaller buffer sizes result in larger truncation error and hence worse performance. With increasing buffer size the asymptotic average error decreases. For the 10-way multiclass classification task we set  = 128, and used a Gaussian kernel following [3]. Figure 4 (c) shows that SILK consistently outperforms NORMA and SVMD, while the trend with the increasing buffer size is repeated, as shown in Figure 4 (d). In both experiments, we used the parameters for NORMA and SVMD reported in [3], and set  = 0.00005 and C = 100 for SILK.\n\n5\n\nOutlook and Discussion\n\nIn this paper we presented a general recipe for performing implicit online updates in an RKHS. Specifically, we showed that for many popular loss functions these updates can be computed efficiently. We then presented a sparse version of our algorithm which uses limited basis expansions to approximate the function. For graph-structured loss we also showed loss bounds and rates of convergence. Experiments on real life datasets demonstrate that our algorithm is able to track nonstationary targets, and outperforms existing algorithms. For the binary hinge loss, when  = 0 the proposed update formula for t (16) reduces to the PA-I algorithm of Crammer et al. [8]. Curiously enough, the motivation for the updates in both cases seems completely different. While we use an implicit update formula Crammer et al. [8] use\n\n\f\n(a)\n\n(b)\n\n(c) (d) Figure 4: Performance comparison of different algorithms over one run of the MNIST dataset. (a) Online binary classification. (b) Performance of SILK using different buffer sizes. (c) Online 10-way multiclass classification. (d) Performance of SILK on three different buffer sizes.\n\na Lagrangian formulation, and a passive-aggressive strategy. Furthermore, the loss functions they handle are generally linear (hinge loss and its various generalizations) while our updates can handle other non-linear losses such as quadratic or logistic loss. Our analysis of loss bounds is admittedly straightforward given current results. The use of more sophisticated analysis and extending our bounds to deal with other non-linear loss functions is ongoing. We are also applying our techniques to video analysis applications by exploiting the structure of the output space. Acknowledgements We thank Xinhua Zhang, Simon Guenter, Nic Schraudolph and Bob Williamson for carefully proof reading the paper, pointing us to many references, and helping us improving presentation style. National ICT Australia is funded by the Australian Government's Department of Communications, Information Technology and the Arts and the Australian Research Council through Backing Australia's Ability and the ICT Center of Excellence program. This work is supported by the IST Program of the European Community, under the Pascal Network of Excellence, IST-2002-506778.\n\nReferences\n[1] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):164, 1997. [2] J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. IEEE Transactions on Signal Processing, 52(8), 2004. [3] S. V. N. Vishwanathan, N. N. Schraudolph, and A. J. Smola. Step size adaptation in reproducing kernel Hilbert space. Journal of Machine Learning Research, 7, 2006. [4] R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, 1970. [5] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing (2nd ed.). Cambridge University Press, Cambridge, 1992. ISBN 0 - 521 - 43108 - 5. [6] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998. [7] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Trans. Information Theory, 50(9):20502057, 2004. [8] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551585, 2006.\n\n\f\n", "award": [], "sourceid": 3038, "authors": [{"given_name": "Li", "family_name": "Cheng", "institution": null}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": null}, {"given_name": "Shaojun", "family_name": "Wang", "institution": null}, {"given_name": "Terry", "family_name": "Caelli", "institution": null}, {"given_name": "S.v.n.", "family_name": "Vishwanathan", "institution": null}]}