{"title": "Online Learning with Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 785, "page_last": 792, "abstract": null, "full_text": "Online Learning with Kernels\n\nJyrki Kivinen\n\nAlex J. Smola\n\nRobert C. Williamson\n\nResearch School of Information Sciences and Engineering\n\nAustralian National University\n\nCanberra, ACT 0200\n\nAbstract\n\nWe consider online learning in a Reproducing Kernel Hilbert Space. Our\nmethod is computationally ef\ufb01cient and leads to simple algorithms. In\nparticular we derive update equations for classi\ufb01cation, regression, and\nnovelty detection. The inclusion of the  -trick allows us to give a robust\nparameterization. Moreover, unlike in batch learning where the  -trick\nonly applies to the\n-insensitive loss function we are able to derive gen-\neral trimmed-mean types of estimators such as for Huber\u2019s robust loss.\n\n1 Introduction\n\nWhile kernel methods have proven to be successful in many batch settings (Support Vector\nMachines, Gaussian Processes, Regularization Networks) the extension to online methods\nhas proven to provide some unsolved challenges. Firstly, the standard online settings for\nlinear methods are in danger of over\ufb01tting, when applied to an estimator using a feature\nspace method. This calls for regularization (or prior probabilities in function space if the\nGaussian Process view is taken).\n\nSecondly, the functional representation of the estimator becomes more complex as the num-\nber of observations increases. More speci\ufb01cally, the Representer Theorem [10] implies\nthat the number of kernel functions can grow up to linearly with the number of observa-\ntions. Depending on the loss function used [15], this will happen in practice in most cases.\nThereby the complexity of the estimator used in prediction increases linearly over time (in\nsome restricted situations this can be reduced to logarithmic cost [8]).\n\nFinally, training time of batch and/or incremental update algorithms typically increases su-\nperlinearly with the number of observations. Incremental update algorithms [2] attempt to\novercome this problem but cannot guarantee a bound on the number of operations required\nper iteration. Projection methods [3] on the other hand, will ensure a limited number of\nupdates per iteration. However they can be computationally expensive since they require\none matrix multiplication at each step. The size of the matrix is given by the number of\nkernel functions required at each step.\n\nRecently several algorithms have been proposed [5, 8, 6, 12] performing perceptron-like\nupdates for classi\ufb01cation at each step. Some algorithms work only in the noise free case,\nothers not for moving targets, and yet again others assume an upper bound on the complex-\nity of the estimators. In the present paper we present a simple method which will allows\nthe use of kernel estimators for classi\ufb01cation, regression, and novelty detection and which\ncopes with a large number of kernel functions ef\ufb01ciently.\n\n\u0001\n\f\u0019\u001f\u001e\n\n'\u001c()!*\u0011\n\nto be studied in\n\nis\nare linear\n\ncombinations of kernel functions.\n\nvector in feature space\u201d as commonly used in SV algorithms. To state our algorithm we\n\n\u0001\u000b\u0003\r\f\u000e\u0003\u000f\u0005\u0010\u0007\n\u001b (reproducing property); 2) \t\n\n2 Stochastic Gradient Descent in Feature Space\nReproducing Kernel Hilbert Space The class of functions \u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\nthis paper are elements of an RKHS\t\n. This means that there exists a kernel \n\n\u001a\u0019\n\u001a\u0019#\u001e\nand a dot product \u0011\u0013\u0012\u0015\u0014\u0016\u0012\u0018\u0017 such that 1) \u0011\n\u0014\u0016\u0012\u0018\u001b \u0017\"!\n\u001b\u001c\u0014\u001d\n\n. In other words, all &$\n\u0014\u0016\u0012\u0018\u001b with \u001e%$&\u0003\n\u0019\u001f\u001e\nthe closure of the span of all \n\nTypically '\n\u0017 is used as a regularization functional. It is the \u201clength of the weight\nneed to compute derivatives of functionals de\ufb01ned on \t\n/.0\u0001\n/.\nFor the regularizer +-,\n'\u0016( we obtain 3546+-,\n!21\n/.\n/.\nlead to 3\n'=<\n'\u0016\u001b;'\n'\u0016\u001b\n+-,\n+-,\n!97\u000e:\n!87\n/.B\u0001\n\u001a\u0019\u001f\u001e\nFor the evaluation functional >@?A,\n\u001b we compute the derivative by using the repro-\n/.\n\u001a\u0019\nand obtain 3\nducing property of \t\n\u0014\u0016\u0012\u0018\u001b . Consequently for\n!%3\n\u0001E\u0003F\f\u0002G\u000f\f\u0002GH\u0005I\u0007 which is differentiable in its third argument we obtain\na function D\n\u001a\u0019\u001f\u001e\n\u0019#\u001e\n\u001a\u0019#\u001e\n\u0014\u0016\u0012\u0018\u001b . Below D will be the loss function.\n\u001b \u001bK\n\n\u001b \u001b\u000e!MD\u001c:\n\u0014 JL\u0014\n\u0014KJL\u0014\n\u0014 J\n\n\u0014 J\u0004\u001b . Our aim is to predict the likely outcome J at location \u001e\n\n$O\u0003P\fQG drawn according to some underlying\n\u0014KJ\n\n\u0014 J\u0004\u001b may change over time, (ii) the training sample \u0019#\u001e\n\nsupplied with pairs of observations \u0019#\u001eLN\ndistribution R\nvariants are possible: (i) R\n\nmay be the next observation on which to predict which leads to a true online setting, or\n(iii) we may want to \ufb01nd an algorithm which approximately minimizes a regularized risk\nfunctional on a given training set.\n\nRegularized Risk Functionals and Learning\n\nIn the standard learning setting we are\n\n. More general versions of\n\n\u0014\u0016\u0012\u0018\u001b \u0017)!C\n\n. Several\n\n\u001b\u001c\u0014\u001d\n\n\u0019\u001f\u001e\n\n\u0019#\u001e\n\n\u0019#\u001e\n\n\u0019#\u001e\n\n.\n\n.\n\n\u0019#\u001e\n\n\u0019#\u001e\n\ninstead minimize the empirical risk\n\nWe assume that we want to minimize a loss function D\nthe deviation between an observation J at location \u001e\nobservations \u0019\u001f\u001e\n\u0014 J\u0004\u001b\n\u0019#\u001ecN\n\n\u001b . Since R\n/.\n\n\u0014KJ\nXZY\\[^]\n\n\u001b\u001c\u0014\u0016V;V\u0016V\u001c\u0014\n\n\u0019#\u001e/W\n\n\u0014KJ\n\n\u0019#\u001e\n\n\u0001\u000b\u0003\u000f\fSG%\fTGU\u0005\u0010\u0007 which penalizes\nand the prediction \u001a\u0019\u001f\u001e\n\u001b , based on\n\nis unknown, a standard approach is to\n\n\u001a\u0019#\u001ecN\n\n\u001bK\u001b\n\n(1)\n\n\u0014KJ\n\nN\u0015b\n\nNjb\n\nor, in order to avoid overly complex hypotheses, minimize the empirical risk plus an addi-\n\n/. . This sum is known as the regularized risk\n\ntional regularization term +-,\n/.5gih\n\nX\"Y\\[^]\n\nX\"d#Y\\e\n\n/.f\u0001\n\n/.\n\n+-,\n\n\u0019\u001f\u001e/N\n\n\u0014 J\n\n\u001a\u0019\u001f\u001e/N\n\n\u001b \u001b\n\ngkh\n\n+-,\n\n/. for hQl9m\n\n(2)\n\nCommon loss functions are the soft margin loss function [1] or the logistic loss for classi\ufb01-\ncation and novelty detection [14], the quadratic loss, absolute loss, Huber\u2019s robust loss [9],\nor the\n\n-insensitive loss [16] for regression. We discuss these in Section 3.\n\n\u0019#\u001e\n\nIn some cases the loss function depends on an additional parameter such as the width of the\nmargin n or the size of the\n-insensitive zone. One may make these variables themselves\nparameters of the optimization problem [15] in order to make the loss function adaptive to\nthe amount or type of noise present in the data. This typically results in a term \nadded to D\nd#Y\\e\nBelow we extend these methods to stochastic gradient descent by approximating Xpd#Y\\e\n\n/. . This can be costly if the number of observations is large. Recently several gradient\n/.\n\nStochastic Approximation In order to \ufb01nd a good estimator we would like to minimize\n\ndescent algorithms for minimizing such functionals ef\ufb01ciently have been proposed [13, 7].\n\nor o\n\n\u001b \u001b .\n\n\u0014KJL\u0014\n\n\u001a\u0019#\u001e\n\n\u0012\n\t\n\n\n\u0014\n\n(\n'\n\n!\n\n\u0019\n'\n\n4\n\u0019\n'\n\n\n1\n\n!\n4\n>\n?\n,\n4\n\u0011\n\u0012\n3\n4\nD\nN\n\u001b\nN\nN\n\u001b\n1\n1\nW\n,\n!\n_\n`\nW\na\n1\nD\nN\n\u0014\n,\n!\n,\n!\n_\n`\nW\na\n1\nD\nN\n\u0014\nV\n\u0001\n\u0001\n\u0001\n\nn\nX\n,\n,\n\fby\n\n\u0014 J\n\n.f\u0001\n\n+-,\n\n\u0019\u001f\u001e\n\n\u001bK\u001b \n\n/.\n\n+-,\n\n\u0019#\u001e\f\u000b\n\n\u0014KJ\n\nis\n\n\u001a\u0019#\u001e\n\nX\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n\ngkh\n\u001b \u001b\n\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n\n/.\n. . Here \n\n(3)\nis either randomly\n\n\u001a\u0019#\u001e\r\u000b\nand then performing gradient descent with respect to X\n`\u0010\u000f or it is the new training instance observed at time \n . Consequently\nchosen from \u000e\n\u0014;V\u0016V\u0016V\nthe gradient of X\u0011\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n\u0019#\u001e\n\n\u0019\u001f\u001e\n\u0012\u0002\u0005\u0004\u0007\u0006\u0013\b\n'\u0016\u001b . The the update equations are hence straightforward:\n$\r\u0007\u0017\u0016\n\n. with respect to \ngSh\n\u001a\u0019#\u001e\ngSh/\n\u0014;\u0012\n/.\nThe last equality holds if +-,\n( . Analogous results hold for general +-,\n354\no\u0015\u0014\nHere \u0014\ntion. We will return to the issue of adjusting \u0019\nDescent Algorithm For simplicity, assume that +-,\n\nis the learning rate controlling the size of updates undertaken at each itera-\n\n\u001b at a later stage.\n/.\n\n\u001a\u0019#\u001e\r\u000b\ncomputation. For this purpose we have to express  as a kernel expansion\n\nWhile (6) is convenient to use for a theoretical analysis, it is not directly amenable to\n\n'\u001c( . In this case (5) becomes\n\u0019\u001f\u001e\f\u000b\n\n!H1\no\u0019\u0014\n\nX\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n\ngihc\n\n\u001a\u0019#\u001e\r\u000b\n\n\u0019\u001f\u001e\f\u000b\n\n\u0019\u001f\u001e\f\u000b\n\n\u0019#\u001e\r\u000b\n\n\u0014;\u0012\n\n\u001b\u001cV\n\n\u0019\u001f\u001e\n\n\u001bK\u001b \n\nS\u0005\n\no\u0019\u0014\n\n\u001b \u001bK\n\n\u001bK\u001b \n\n(4)\n\n(5)\n\n/.\n\n\u0014;\u0012\n\n\u0014KJ\n\n\u0014\u0018\u0014\n\n\u0014 J\n\n\u0014 J\n\n\u0014\u0016\u0012\u0018\u001b\n\n\u001a\u0019\u001f\u001e\n\n\u001bE!\n\no\u0015\u0014\n\u001a\u0019\u001f\u001e\n\n\u001bE!\n\nN\u001b\u001a\n\n\u0014 J\n\n\u0019\u001f\u001e\f\u000b\n\u001b \u001b\n\n\u0019#\u001ecN\n\n\u001a\u0019#\u001e\f\u000b\n\no\u001c\u0014\n\n\u0019#\u001e\n\n\u0014KJ\n\nwhere the \u001e\n\nN are (previously seen) training patterns. Then (6) becomes\n\n\u001bK\u001b\n\nfor\n\n(6)\n\n(7)\n\n(8)\n(9)\n(10)\n\ncomputed \u001a\u0019#\u001e\r\u000b\n\nEq. (8) means that at each iteration the kernel expansion may grow by one term. Further-\nmore, the cost for training at each step is not larger than the prediction cost: once we have\n\nfor \u001d\u001f\u001e\nis obtained by the value of the derivative of D at \u0019#\u001e \u000b\n\u001b\"!@\u0014\u0016V;V\u0016V and pick suitable terms as needed. This is particularly useful\n\u000f as\n\n\u001bK\u001b .\nN we may simply cache the power series _\n\n\u001b\u001c\u0014\nif the derivatives of the loss function D will only assume discrete values, say \u000e\u000bo\n\nis the case when using the soft-margin type loss functions (see Section 3).\n\nInstead of updating all coef\ufb01cients\n\n\u001a\u0019#\u001e\f\u000b\n\n\u001b ,\n\n\u0014 J\n\n\u001b (\n\nTruncation The problem with (8) and (10) is that without any further measures, the num-\n\nN with \u001d\u001c\u001e\nN will be reduced to \u0019\n\nber of basis functions # will grow without bound. This is not desirable since # determines\n\nthe amount of computation needed for prediction. The regularization term helps us here. At\niterations\neach iteration the coef\ufb01cients\nthe coef\ufb01cient\n\n are shrunk by \u0019\nN . Hence:\n\u001b\"%\n\u0019\u001f\u001e\nProposition 1 (Truncation Error) For a loss function D\n\u0019\u001f\u001e\ntive bounded by &\nand a kernel \n with bounded norm '\u0016\n\nin \nbounded by \u0014\nwhich are at least $\n\n\u001b . Thus after $\n\u001bK\u001b with its \ufb01rst deriva-\nfrom the kernel expansion of  after $ update steps is\n\n. Furthermore, the total truncation error by dropping all terms\n\n\u001a\u0019#\u001e\n\u0014 JL\u0014\n\u001b;'('*)\n\u0014;\u0012\n\nincurred by dropping terms\n\nsteps old is bounded by\n\n, the truncation error\n\n&\u0011)0/\n\n&\u0011)\n\n(11)\n\n&\u0011)\nd\u0012+-,\n\n<.%\n\n'\u001f'\n\nN\u0015b\n\n,\n\n\u0014\n\n!\nD\n\u000b\n\u0014\n,\n\n\u0014\n\n_\n,\n\n\u0014\n\n3\n4\nX\n,\n\n\u0014\n\n.\n!\nD\n:\n\u000b\n\u000b\n\u0014\n\u000b\n\u000b\n\u001b\n3\n4\n!\nD\n:\n\u000b\n\u000b\n\u0014\n\u000b\n\u000b\n\u001b\nV\n!\n1\n(\n'\n\n'\n!\n7\n\u0019\n'\n\n\n,\n\n\u0014\n\n.\nV\nh\n(\n'\n\n\n\u0005\n\n\u0019\nD\n:\n\u000b\n\u0014\n\u0019\n_\no\nh\n\u0014\n\u001b\n\nD\n:\n\u000b\n\u0014\na\nN\n\n\u0014\n\u001e\n\u001b\n\u001a\n\u000b\n\u0005\n\u0019\n_\no\nh\n\u0014\n\u001b\n\u001a\n\u000b\nD\n:\n\u000b\n\u0014\n!\nD\n:\n\u000b\n\u000b\n\u0014\n\u000b\n\u001a\n\u000b\n!\nm\n\u001a\nN\n\u0005\n\u0019\n_\no\nh\n\u0014\n\u001b\n\u001a\nN\n!\n\nV\n\u001a\n\u000b\n\u000b\n\u0014\n\u001a\n\u0014\n\u0019\n_\no\nh\n\u0014\n\u0019\n_\no\nh\n\u0014\n\u0014\n\u0019\n_\no\nh\n\u0014\n_\n\u0014\nm\n\u0014\n_\n\u001a\n!\n_\no\nh\n\u0014\n\u001a\n_\no\nh\n\u0014\n\u001a\n\u001a\nN\n\u0019\n_\no\nh\n\u0014\n\u001b\n%\n'\n\no\n\n\u0002\n\u0006\n\u000b\na\n1\n\u0014\n\u0019\n_\no\nh\n\u0014\n\u001b\n\u000b\n<\nN\nh\n<\n1\n\u0019\n_\no\nh\n\u0014\n\u001b\n%\n\fN\u0015b\n\nd\u0012+\n\n\u0019\u001f\u001e/N\n\nnentially with the number of terms retained.\n\nHere \n\u0014\u0016\u0012\u0018\u001b . Obviously the approximation quality increases expo-\nThe regularization parameter h can thus be used to control the storage requirements for the\nexpansion. In addition, it naturally allows for distributions R\nwhich cases it is desirable to forget instances \u0019#\u001e\n\nthat change over time in\nthat are much older than the average\n\ntime scale of the distribution change [11].\n\n\u0014KJ5\u001b\n\n<.%\n\n\u0019\u001f\u001e\n\n\u0014 J\n\n3 Applications\n\n3\u0005\u0004\nif J\u0012\u0003\n\nWe now proceed to applications of (8) and (10) to speci\ufb01c learning situations. We utilize\nthe standard addition of the constant offset\n\nto the function expansion, i.e.\n\nand\n\n. Hence we also update\n\ninto\n\nX\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n\n. .\n\n\u0019#\u001e\n\n\u001b\u000e!\n\n\u001a\u0019\u001f\u001e\n\nwhere O$\n\n\u0019\u001f\u001e\n\n\u0019\u001f\u001e\n\n\u0014 J\n\n\u001b \u001b\n\nClassi\ufb01cation A typical\n\nloss function in SVMs is the soft margin, given by\n\n\u0014\u000eJ\n\n\u0019#\u001e\r\u000b\n\n\u0019\u001f\u001e\n\u0019K\u0019\n\u0019K\u0019\n\n\u0014 JL\u0014\u0006\u0003\n\n!\b\u0007\n\t\f\u000b\n\n\u001b \u001b . In this situation the update equations become\no\u0015\u0014\no\u0015\u0014\n\n\u0019\u001fm\no\u0002J\r\u0003\n\u0005\u0010\u000f\n\u0014\u000e\u0002\u001c\u001b\nIn classi\ufb01cation with the  -trick we avoid having to \ufb01x the margin n by treating it as a\nvariable [15]. The value of n\n\u0019#\u001e\nwhere m\nis another parameter. Since  has a much clearer intuitive meaning than\nn , it is easier to tune. On the other hand, one can show [15] that the speci\ufb01c choice of h\nhas no in\ufb02uence on the estimate in  -SV classi\ufb01cation. Therefore we may set h\n_ and\n\nis found automatically by using the loss function\n\n\u0014Z\u0014\u0011\u0002\n\u0014\u0011\u0002\u001c\u001b\n\u0019\u001fm\n\n\u001b \u001b\u000e!\u0014\u0007\n\t\f\u000b\n\no\u0002J\r\u0003\n\n\u0014 JL\u0014\u0013\u0003\n\notherwise.\n\n\u001bK\u001b\u001ao\n\n(13)\n\n(12)\n\n\u0014 n\n\n\u0019\u001f\u001e\n\n\u0019#\u001e\n\nobtain\n\n\u0019 \u0019\n\u0019 \u0019\n\n\u0014\u0011\u0002=\u0014 n5\u001b\n\n\u0014KJ\nFinally, if we choose the hinge-loss, D\no\u0015\u0014\no\u0015\u0014\n\n\u0019K\u0019\n\u0019K\u0019\n\n\u0014\"\u0014\u000e\u0002\n\u0014\u0011\u0002=\u0014 npo\u0015\u0014\n\u0019#\u001e\n\u0014 JL\u0014\u0013\u0003\n\u0014 J\n\nkernel-perceptron with regularization.\n\nSetting h\n\n\u0014 n\n\u0014\u000eJ\n\u0019#\u001e\n\u001b \u001b\u000e!\u0014\u0007\n\t\f\u000b\n\u0014Z\u0014\u0011\u0002\n\u0014\u000eJ\n\u0014\u0011\u0002\u001c\u001b\n\n\u0019\u001fm\n\n\u0014;o\n\nif J\r\u0003\n\u001b \u001b\u001c\u0014\n\n\u0019\u001f\u001e\n\n\u0019#\u001e\n\n\u001b \u001b\nJ\r\u0003\nif J\u0012\u0003\n\n\u0014\u000e\u0002\u001c\u001b\nrecovers the kernel-perceptron algorithm. For nonzero h we obtain the\n\notherwise.\n\n\u0019\u001f\u001e\n\notherwise.\n\n/kn\n\n(14)\n\n(15)\n\nNovelty Detection The results for novelty detection [14] are similar in spirit. The  -\nsetting is most useful here particularly where the estimator acts as a warning device (e.g.\nnetwork intrusion detection) and we would like to specify an upper limit on the frequency\n\nof alerts \u001a\u0019#\u001e\nand usually [14] one uses 8$\n\n. The relevant loss function is D\nrather than Tg\n\nsolutions. The update equations are\n\n\u001b(/Un\n\n\u0019\u001f\u001e\n\n\u001a\u0019#\u001e\n\n\u0014KJc\u0014\n\nwhere\n\n\u0019\u001fm\n\n\u0014 n\n\n\u001a\u0019#\u001e\n\n\u001bK\u001b\u000eo\n\nin order to avoid trivial\n\n\u001b \u001bZ!\u0015\u0007\n\t\f\u000b\n$i\u0007\nif \u001a\u0019\u001f\u001e\n\n\u001bK\u001b\n\notherwise.\n\n/8n\n\n(16)\n\n\u0019 \u0019\n\u0019 \u0019\n\n\u0014 n5\u001b\n\n\u0014Z\u0014Kn\n\u0014Kn\n\nConsidering the update of n we can see that on average only a fraction of  observations\nwill be considered for updates. Thus we only have to store a small fraction of the \u001efN .\n\n\u0002\n,\n\u0006\n!\n\u0001\n\u000b\n\u000b\n\u0016\n1\n\u001a\nN\n\nN\nN\n\u001b\n\u0002\n\u0003\n\u001b\ng\n\u0002\n\t\n\u0002\n$\n\u0007\n\u0002\n\u0002\no\n\u0014\n,\n\u0003\nD\n\u0014\n_\n\u0019\n\u001a\nN\n\u0014\n\u001a\n\u000b\n_\nh\n\u001b\n\u001a\nN\nN\ng\nN\n\u001b\n\u001b\n/\n_\n_\nh\n\u001b\n\u001a\nN\n\u0014\nm\nD\n\nn\n'\n\n'\n_\n!\n\u0019\n\u001a\nN\n\u0014\n\u001a\n\u000b\n\u0005\n\u000f\n_\no\n\u0014\n\u001b\n\u001a\nN\nN\ng\nN\ng\n\u0014\n\u0019\n_\no\n\n\u000b\n\u001b\n_\no\n\u0014\n\u001b\n\u001a\nN\n\u0014\nm\n\n\u001b\n\u0019\n\u001a\nN\n\u0014\n\u001a\n\u000b\n\u0005\n\u000f\n_\nh\n\u001b\n\u001a\nN\nN\ng\nN\n\u001b\n\u000b\n\u001b\n/\nm\n_\nh\n\u001b\n\u001a\nN\n\u0014\nm\n!\nm\no\n\nn\n\t\n\u0002\n\u0002\n\u0019\n\u001a\nN\n\u0014\n\u001a\n\u000b\n\u0005\n\u000f\n_\no\n\u0014\n\u001b\n\u001a\nN\n\u0014\ng\n\u0014\n\u0019\n_\no\n\n\u000b\n\u001b\n_\no\n\u0014\n\u001b\n\u001a\nN\n\u0014\nm\no\n\u0014\n\n\u001b\n\f\u0019#\u001e\n\n\u0019 \u0019\n\n\u001b \u001b\n\n\u001a\u0019\u001f\u001e\n\n\u001a\u0019\u001f\u001e\n\n\u0014 JL\u0014\n\n!P1\n\nwhere\n\nRegression We consider the following four settings: squared loss, the\n-insensitive loss\nusing the  -trick, Huber\u2019s robust loss function, and trimmed mean estimators. For con-\n$%\u0007\nvenience we will only use estimates *$\n. The\nextension to the latter case is straightforward. We begin with squared loss where D\nis given\nby D\n\n(17)\nThis means that we have to store every observation we make, or more precisely, the\nprediction error we made on the observation. The\n\n\u001bK\u001b\u0013(\u000bV Consequently the update equation is\n\n\u00038!\n\u001a\u0019#\u001e\n-insensitive loss D\n\nrather than\n\n. By making\n\na variable of the optimization problem\n\n\u001a\u0019#\u001e\n\u0001@\u001b avoids this problem but introduces a new parameter in turn \u2014\n\u001b\u0002/o\n\u001a\u0019#\u001e\n\u00015V The update equations now have to be\n\u0014 JL\u0014\n\u001bK\u001b\u000e!\u0014\u0007\n\u000b , and\n\u0005\u0010\u000f\n\nwhich is allowed to change during the optimization process.\n\n\u0014\u0018\u0014\u0005\u0004\u0007\u0006\t\b\n\u0001\"o\u0015\u0014\n\n(18)\n, we increase the insensitivity\n\nThis means that every time the prediction error exceeds\n\n\u0019\u001fm\n\u0014\u0001\n\u0007\n\t\f\u000b\nwe have D\n\nthe width of the insensitivity zone\n\nstated in terms of\nThis leads to\n\notherwise.\n\n\u001a\u0019\u001f\u001e\f\u000b\n\n\u001a\u0019\u001f\u001e\f\u000b\n\nJQo\n\u0019\u001f\u001e\n\nif \n\n\u0014KJc\u0014\n\n\u001a\u0019#\u001e\n\n\u001a\u0019#\u001e\n\n\u001b\u0002\\o\n\n\u001b \u001b\u001c\u0014\n\n\u001b \u001bK\u001b\n\n\u0019 \u0019\n\u0019 \u0019\n\n\u0014\u0003\n\n\u001b\n\n\n\u001b \u001b\n\n\u0019\u001f\u001e\n\n. Likewise, if it is smaller than\n\nzone by \u0014\n\nNext let us analyze the case of regression with Huber\u2019s robust loss. The loss is given by\n\n, the insensitive zone is decreased by \u0014\n\u001b\n@o\n\u001a\u0019#\u001e\n\notherwise.\n\nif \n\n\u001b\u0002\u000e\n\n\u001a\u0019#\u001e\n\n\u001a\u0019#\u001e\n\n\u0019\u001f\u001e\n\n\u001a\u0019#\u001e\n\n\u0014 JL\u0014\n\n\u001bK\u001b\u000e!\n\n\u0019 \u0019\n\u0019 \u0019\n\no\u0015\u0014\no\u0015\u0014\n\n\u0014\u0011\u0004\u0012\u0006\u0013\b\n\n\u001bK\u001b\n\n(\f\u000b\n\u001a\u0019#\u001e\f\u000b\n\u001bK\u001b \u001b\n\u001a\u0019\u001f\u001e\f\u000b\n\u001b \u001bK\u001b\n\n\u001b .\n\n\u001b .\n\n(19)\n\n(20)\n\nAs before we obtain update equations by computing the derivative of D with respect to \u001a\u0019#\u001e\n\n(\u0010\u000f\n\nif \n\notherwise.\n\n\u001a\u0019#\u001e\f\u000b\n\n\u001b\u0002\n\nmight not also be adjusted\nComparing (20) with (18) leads to the question whether\nadaptively. This is a desirable goal since we may not know the amount of noise present in\nthe data. While the  -setting allowed us to form such adaptive estimators for batch learning\nwith the\n-insensitive loss, this goal has proven elusive for other estimators in the standard\nbatch setting. In the online situation, however, such an extension is quite natural (see also\n[4]). All we need to do is make\n\na variable of the optimization problem and set\n\n\u0019K\u0019\n\u0019K\u0019\n\no\u0015\u0014\no\u0015\u0014\n\n\u0014\u0018\u0014\u0005\u0004\u0007\u0006\t\b\n\n\u001a\u0019\u001f\u001e\n\u001a\u0019#\u001e\n\n\u001b \u001b\u001c\u0014\n\n\u001b \u001b\n\n\u001b \u001b\n\n\u001a\u0019\u001f\u001e\n\nif \n\notherwise.\n\n\u001b\u0002\n\n(21)\n\n\u0019\u001fm\n\n\u0019\u001f\u001e\n\n\u001b \u001b\n\n\u0014 n\n\n\u001a\u0019#\u001e\n\no8J\n\n\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n\naccording to (5). We now wish to bound the cumulative risk\n\nis a \ufb01xed margin parameter. Let \n\n4 Theoretical Analysis\n\u001a\u0019#\u001e\nConsider now the classi\ufb01cation problem with the soft margin loss D\n\u0014KJc\u0014\n\u000b denote the hypothesis of\n\u001bK\u001b ; here n\n_ observations. Thus, at time \n , the algorithm\nthe online algorithm after seeing the \ufb01rst \n\n\u0007\n\t\f\u000b\n\u0019\u001f\u001e\f\u000b\nreceives an input \u001e.\u000b , makes its prediction \n\u000b , and up-\n\u001b , receives the correct outcome J\ndates its hypothesis into \n. . The motivation for such bounds is roughly as follows. Assume there is\n\u001b are drawn, and de\ufb01ne\nsome \ufb01xed distribution R\n\u0014 J\n/.\n.5gkh\nX\u0015\u0014\n+-,\nto converge towards \t\"\nThen it would be desirable for the online hypothesis \n\u0012\u0002\u0005\u0004\u0007\u0006\u0013\b\narg min4\n\u000b does converge to \n\" .\nlative risk. In all the bounds of this section we assume +\n\u001bE!\n\n?\u001b\u001a\n\u001c\n\u001d\u001f\u001e! \n\u001b , we see that at least in some sense \n\n/. . If we can show that the cumulative risk is asymptotically `\n\nfrom which the examples \u0019#\u001e\n/.f\u0001\n\u001a\u0019\u001f\u001e\n\nHence, as a \ufb01rst step in our convergence analysis, we obtain an upper bound for the cumu-\n\n!\u0017\u0016\u0019\u0018\n\n'\u001c( .\n\n\u0014 JL\u0014\n\n\u001bK\u001b\n\n.Kg\n\n\u0019#\u001e\n\n\u0001\n\t\n\ng\n\u0002\n\u0002\n(\n\u0019\nJ\no\n\u0019\n\u001a\nN\n\u0014\n\u001a\n\u000b\n\u001b\n\u0005\n_\no\nh\n\u0014\n\u001b\n\u001a\nN\n\u0014\n\u0014\n\u0019\nJ\n\u000b\no\n\u000b\nV\n\u0001\n!\n\u0001\n\u0001\n\t\n\u000b\n\u0019\nm\nJ\no\n\u0001\n\u001b\ng\n\n\u001a\nN\n\u0014\n\u001a\n\u0001\n\u0019\n\u001a\nN\n\u0014\n\u001a\n\u000b\n\u0014\n\u0001\n\u001b\n_\no\nh\n\u0014\n\u001b\n\u001a\nN\n\u0019\nJ\n\u000b\no\n\u0001\ng\n\u0019\n_\no\n\n\u001b\n\u0014\n\u001b\nJ\n\u000b\no\nl\n\u0001\n_\no\nh\n\u0014\n\u001b\n\u001a\nN\n\u0014\nm\n\u0014\n\n\u001b\n\u0001\n\n\u0001\n\u0019\n_\no\n\nD\n\u000f\n\nJ\no\n1\nJ\no\n\u000b\n1\n\u0019\nJ\no\n(\n\u0019\n\u001a\nN\n\u0014\n\u001a\n\u000b\n\u001b\n\u0005\n\u000f\n_\n\u001b\n\u001a\nN\n\u0014\n\u0019\nJ\n\u000b\no\nJ\n\u000b\no\nl\n\u000b\n_\n\u001b\n\u001a\nN\n\u0014\n\u000b\n<\n1\n\u0019\nJ\n\u000b\no\n\u000b\n\u0001\n\u000b\n\u0019\n\u001a\nN\n\u0014\n\u001a\n\u000b\n\u0014\n\u000b\n\u001b\n\u0005\n\u000f\n_\n\u001b\n\u001a\nN\n\u0019\nJ\n\u000b\no\n\u000b\n\u000b\ng\n\u0014\n\u0019\n_\no\n\nJ\n\u000b\no\n\u000b\nl\n\u000b\n_\n\u001b\n\u001a\nN\n\u0014\n\u000b\n<\n1\n\u0019\nJ\n\u000b\no\n\u000b\n\u0014\n\u000b\no\n\u0014\n\n\u001b\n!\no\n\u000b\n\u000b\n\u0016\n1\n\u0001\nW\n\u000b\nb\n1\nX\n,\n\n\u000b\n\u0014\n\n\u000b\n\u000b\n,\n,\nD\nV\n\u000b\n!\nX\n \n,\nX\n,\n\n\"\n\u0014\n\n#\n\u0019\n`\n\u0019\n\n1\n(\n'\n\n\fbe an example sequence such that \n\n\u0019#\u001e\n\n\u001b \u001b\n\n\u0014KJ\n\nTheorem 1 Let \u0019K\u0019#\u001e\nlCm , and choose the learning rate \u0014*!\u0001\u0003\u0002\n\n . Fix\nwe haveW\n'\u0001'\u0006\n'\u0011\u0003\n\u0003L\u0014\n\n\u0003\u0002\u0013\u0004\u0007\u0006\u0013\b\n\n\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n\n1\u0005\u0004\n\n(;\u001b . Then for any\ng\u0006\b\n1\u0007\u0004\n\nNotice that the bound does not depend on any probabilistic assumptions. If the example\nsequence is such that some \ufb01xed predictor\nhas a small cumulative risk, then the cumula-\ntive risk of the online algorithm will also be small. There is a slight catch here in that the\n. The longer\nthe sequence of examples, the smaller learning rate we want. We can avoid this by using a\nlearning rate that starts from a fairly large value and decreases as learning progresses. This\nleads to a bound similar to Theorem 1 but with somewhat worse constant coef\ufb01cients.\n\nlearning rate \u0014 must be chosen a priori, and the optimal setting depends on `\n\nfor all\nsuch that\n\n(22)\n\n\u0014KJ\n\nFix\n\nTheorem 2 Let \u0019K\u0019#\u001e\nthat '\u0011\u0003\n\nl*m , and use at update \n\n'\u0001'\u0006\n\nwe have\n\n\u001bK\u001b\n\n\u0012\u0002\u0005\u0004\u0007\u0006\u0013\b\n\nbe an example sequence such that \n\n\u0019\n\t\u000bh\nthe learning rate \u0014\n1\u0007\u0004\n.5g\f\u000b\u000bhf\u0019\n\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n\n)\r\u0002\n\n\u001b\u0011'\n\n\u0019#\u001e\n\u001b . Then for any\ng\u0006\b\n1\u0005\u0004\n\n\u001b\u0002V\n\nfor all \n .\n\nsuch\n\n(23)\n\n\u0003L\u0014\n\nLet us now consider the implications of Theorem 2 to a situation in which we assume that\n\n.\n\nthe examples \u0019\u001f\u001e\n\u001b are i.i.d. according to some \ufb01xed distribution R\n\u0014 J\nbe a distribution over \u0003\n\u0019\u001f\u001e\nTheorem 3 Let R\n, such that \n\n\u000b where \nfor \u0019#\u001e\nprobability _\n\u0014KJ\u0004\u001b\u000f\u000e\nhypothesis based on an example sequence \u0019 \u0019\u001f\u001e \u000b\n\u0019\n\t\u000bh\nthat '\u0011\u0003\n\n\fkG\n\u001b \u001b\n\u0014 J\nthe learning rate \u0014\n)\r\u0002\n\nl*m , and use at update \n\n'\u0001'\u0006\n\n1\u0007\u0004\ng\u0011\b\n\nwe have\n\n. Let\n\ng\u0006\u000b\n\n\u00160,\n\nhB\u0019\n\nFix\n\n1\u0005\u0004\n\n.\u0015.\n\n( holds with\nis the \n -th online\nthat is drawn i.i.d. according to R\n\n.\nsuch\n\n\u001b . Then for any\n\nIf we know in advance how many examples we are going to draw, we can use a \ufb01xed\nlearning rate as in Theorem 1 and obtain somewhat better constants.\n\n(24)\n\n5 Experiments and Discussion\n\nIn our experiments we studied the performance of online  -SVM algorithms in various\nsettings. They always yielded competitive performance. Due to space constraints we only\nreport the \ufb01ndings in novelty detection as given in Figure 1 (where the training algorithm\nwas fed the patterns sans class labels).\n\nAlready after one pass through the USPS database (5000 training patterns, 2000 test pat-\n\n_\u0013\u0012 pixels), which took in MATLAB less than 15s on a\nterns, each of them of size _\u0013\u0012\n433MHz Celeron, the results can be used for weeding out badly written digits. The  -\n_ ) to allow for a \ufb01xed fraction of detected \u201coutliers.\u201d Based\nsetting was used (with \non the theoretical analysis of Section 4 we used a decreasing learning rate with h\r\u0014\n\u0017 .\n<\u0016\u0015\nConclusion We have presented a range of simple online kernel-based algorithms for a\nvariety of standard machine learning tasks. The algorithms have constant memory require-\nments and are computationally cheap at each update step. They allow the ready application\nof powerful kernel based methods such as novelty detection to online and time-varying\nproblems.\n\n\u000b\n\u000b\nW\n\u000b\nb\n1\n\u000b\n\u0014\n\u001e\n\u000b\n\u001b\n'\n)\n(\n\n\u0019\n)\n`\n\u0003\na\n\u000b\nb\n1\nX\n,\n\n\u000b\n\u0014\n\n.\n'\nW\na\n\u000b\nb\n1\nX\n,\n\n.\ng\n\n)\n`\n(\n\u0019\n_\n\u001b\nV\n\u0003\n\u000b\n\u000b\nW\n\u000b\nb\n1\n\u000b\n\u0014\n\u001e\n\u000b\n)\n(\n\n\u000b\n!\n_\n\u0002\n\n(\n\u0003\nW\na\n\u000b\nb\n1\nX\n,\n\n\u000b\n\u0014\n\n.\n'\nW\na\n\u000b\nb\n1\nX\n,\n\n\ng\nh\n\u001b\n(\n`\n(\n\u0019\n_\n\u000b\n\u000b\n\u0014\n\u001e\n\u001b\n'\n)\nR\n\u0010\n\nW\n!\n\u0019\n_\n\u0002\n`\n\u001b\n\u0001\nW\n<\n1\n\u000b\nb\n1\n\n\u000b\n\u000b\nW\n\u000b\nb\n1\n\n\u000b\n!\n_\n\u0002\n\n(\n\u0003\nX\n\u0014\n,\n\u0010\n\nW\n'\nX\n\u0014\n,\n\u0003\n.\n\ng\nh\n\u001b\n(\n`\n<\n(\n\u0019\n`\n<\n1\n\u001b\nV\n\f\n!\nm\nV\nm\n\n\fResults after one pass through the USPS\ndatabase. We used Gaussian RBF kernels\n\u000b\u0004\u0003 . The learn-\nwith width \u000b\nW where `\ning rate was adjusted to\nis\nthe number of iterations. Top: the \ufb01rst 50\npatterns which incurred a margin error; bot-\ntom left: the 50 worst patterns according to\n\n(0!\n\n\u0002\u0001\n\n\u001a\u0019\u001f\u001e\n\nthe 50 worst patterns on an unseen test set.\n\no8n on the training set, bottom right:\n\nFigure 1: Online novelty detection on the USPS dataset with \n\n_ .\n\nAcknowledgments A.S. was supported by the DFG under grant Sm 62/1-1, J.K. &\nR.C.W. were supported by the ARC. The authors thank Paul Wankadia for help with the\nimplementation.\n\nReferences\n\n[1] K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of\n\ntwo linearly inseparable sets. Optimization Methods and Software, 1:23\u201334, 1992.\n\n[2] G. Cauwenberghs and T. Poggio. Incremental and decremental support vector ma-\nchine learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in\nNeural Information Processing Systems 13, pages 409\u2013415. MIT Press, 2001.\n\n[3] L. Csat\u00b4o and M. Opper. Sparse representation for gaussian process models. In T. K.\nLeen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Pro-\ncessing Systems 13, pages 444\u2013450. MIT Press, 2001.\n\n[4] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical\n\nview of boosting. Technical report, Stanford University, Dept. of Statistics, 1998.\n\n[5] C. Gentile. A new approximate maximal margin classi\ufb01cation algorithm. In T. K.\nLeen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Pro-\ncessing Systems 13, pages 500\u2013506. MIT Press, 2001.\n\n[6] T. Graepel, R. Herbrich, and R. C. Williamson. From margin to sparsity. In T. K. Leen,\nT. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing\nSystems 13, pages 210\u2013216. MIT Press, 2001.\n\n[7] Y. Guo, P. Bartlett, A. Smola, and R. C. Williamson. Norm-based regularization of\n\nboosting. Submitted to Journal of Machine Learning Research, 2001.\n\n[8] M. Herbster. Learning additive models online with fast evaluating kernels. In Proc.\n14th Annual Conference on Computational Learning Theory (COLT), pages 444\u2013460.\nSpringer, 2001.\n\n[9] P. J. Huber. Robust statistics: a review. Annals of Statistics, 43:1041, 1972.\n\n\u000b\nm\nV\n!\n_\n1\n\u0005\n\u001b\n!\nm\nV\nm\n\f[10] G. S. Kimeldorf and G. Wahba. Some results on Tchebychef\ufb01an spline functions.\n\nJ. Math. Anal. Applic., 33:82\u201395, 1971.\n\n[11] J. Kivinen, A.J. Smola, and R.C. Williamson. Large margin classi\ufb01cation for moving\n\ntargets. Unpublished manuscript, 2001.\n\n[12] Y. Li and P.M. Long. The relaxed online maximum margin algorithm. In S. A. Solla,\nT. K. Leen, and K.-R. M\u00a8uller, editors, Advances in Neural Information Processing\nSystems 12, pages 498\u2013504. MIT Press, 1999.\n\n[13] L. Mason, J. Baxter, P. L. Bartlett, and M. Frean. Functional gradient techniques for\ncombining hypotheses. In A. J. Smola, P. L. Bartlett, B. Sch\u00a8olkopf, and D. Schu-\nurmans, editors, Advances in Large Margin Classi\ufb01ers, Cambridge, MA, 2000. MIT\nPress. 221\u2013246.\n\n[14] B. Sch\u00a8olkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating\n\nthe support of a high-dimensional distribution. Neural Computation, 13(7), 2001.\n\n[15] B. Sch\u00a8olkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector\n\nalgorithms. Neural Computation, 12(5):1207\u20131245, 2000.\n\n[16] V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approx-\nimation, regression estimation, and signal processing. In M. Mozer, M. Jordan, and\nT. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 281\u2013\n287, Cambridge, MA, 1997. MIT Press.\n\n\f", "award": [], "sourceid": 2128, "authors": [{"given_name": "Jyrki", "family_name": "Kivinen", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}]}