{"title": "Large Scale Online Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 217, "page_last": 224, "abstract": "", "full_text": "Large Scale Online Learning.\n\nL\u00b4eon Bottou\n\nNEC Labs America\nPrinceton NJ 08540\nleon@bottou.org\n\nYann Le Cun\n\nNEC Labs America\nPrinceton NJ 08540\nyann@lecun.com\n\nAbstract\n\nWe consider situations where training data is abundant and computing\nresources are comparatively scarce. We argue that suitably designed on-\nline learning algorithms asymptotically outperform any batch learning\nalgorithm. Both theoretical and experimental evidences are presented.\n\n1\n\nIntroduction\n\nThe last decade brought us tremendous improvements in the performance and price of mass\nstorage devices and network systems. Storing and shipping audio or video data is now\ninexpensive. Network traf\ufb01c itself provides new and abundant sources of data in the form\nof server log \ufb01les. The availability of such large data sources provides clear opportunities\nfor the machine learning community.\n\nThese technological improvements have outpaced the exponential evolution of the com-\nputing power of integrated circuits (Moore\u2019s law). This remark suggests that learning al-\ngorithms must process increasing amounts of data using comparatively smaller computing\nresources.\n\nThis work assumes that datasets have grown to practically in\ufb01nite sizes and discusses which\nlearning algorithms asymptotically provide the best generalization performance using lim-\nited computing resources.\n\n(cid:15) Online algorithms operate by repetitively drawing a fresh random example and\nadjusting the parameters on the basis of this single example only. Online algo-\nrithms can quickly process a large number of examples. On the other hand, they\nusually are not able to fully optimize the cost function de\ufb01ned on these examples.\n\n(cid:15) Batch algorithms avoid this issue by completely optimizing the cost function de-\n\ufb01ned on a set of training examples. On the other hand, such algorithms cannot\nprocess as many examples because they must iterate several times over the train-\ning set to achieve the optimum.\n\nAs datasets grow to practically in\ufb01nite sizes, we argue that online algorithms outperform\nlearning algorithms that operate by repetitively sweeping over a training set.\n\n\f2 Gradient Based Learning\n\nMany learning algorithms optimize an empirical cost function Cn((cid:18)) that can be expressed\nas the average of a large number of terms L(z; (cid:18)). Each term measures the cost associ-\nated with running a model with parameter vector (cid:18) on independent examples zi (typically\ninput/output pairs zi = (xi; yi).)\n\nCn((cid:18))\n\n4\n=\n\n1\nn\n\nn\n\nXi=1\n\nL(zi; (cid:18))\n\n(1)\n\nTwo kinds of optimization procedures are often mentioned in connection with this problem:\n\n(cid:15) Batch gradient: Parameter updates are performed on the basis of the gradient and\n\nHessian information accumulated over a prede\ufb01ned training set:\n\n(cid:18)(k) = (cid:18)(k (cid:0) 1) (cid:0) (cid:8)k\n\n@Cn\n@(cid:18)\n\n((cid:18)(k (cid:0) 1))\n\n= (cid:18)(k (cid:0) 1) (cid:0)\n\n1\nn\n\n(cid:8)k\n\n@L\n@(cid:18)\n\n(zi; (cid:18)(k (cid:0) 1))\n\n(2)\n\nn\n\nXi=1\n\nwhere (cid:8)k is an appropriately chosen positive de\ufb01nite symmetric matrix.\n\n(cid:15) Online gradient: Parameter updates are performed on the basis of a single sample\n\nzt picked randomly at each iteration:\n\n(cid:18)(t) = (cid:18)(t (cid:0) 1) (cid:0)\n\n1\nt\n\n(cid:8)t\n\n@L\n@(cid:18)\n\n(zt; (cid:18)(t (cid:0) 1))\n\n(3)\n\nwhere (cid:8)t is again an appropriately chosen positive de\ufb01nite symmetric matrix.\nVery often the examples zt are chosen by cycling over a randomly permuted train-\ning set. Each cycle is called an epoch. This paper however considers situations\nwhere the supply of training samples is practically unlimited. Each iteration of\nthe online algorithm utilizes a fresh sample, unlikely to have been presented to the\nsystem before.\n\nSimple batch algorithms converge linearly1 to the optimum (cid:18)(cid:3)\nn of the empirical cost. Care-\nful choices of (cid:8)k make the convergence super-linear or even quadratic2 in favorable cases\n(Dennis and Schnabel, 1983).\n\nWhereas online algorithms may converge to the general area of the optimum at least as fast\nas batch algorithms (Le Cun et al., 1998), the optimization proceeds rather slowly during\nthe \ufb01nal convergence phase (Bottou and Murata, 2002). The noisy gradient estimate causes\nthe parameter vector to \ufb02uctuate around the optimum in a bowl whose size decreases like\n1=t at best.\nOnline algorithms therefore seem hopelessly slow. However, the above discussion com-\npares the speed of convergence toward the minimum of the empirical cost Cn, whereas one\nshould be much more interested in the convergence toward the minimum (cid:18) (cid:3) of the expected\ncost C1, which measures the generalization performance:\n\nC1((cid:18))\n\n4\n\n= Z L(z; (cid:18)) p(z) dz\n\n(4)\n\nDensity p(z) represents the unknown distribution from which the examples are drawn (Vap-\nnik, 1974). This is the fundamental difference between optimization speed and learning\nspeed.\n\n1Linear convergence speed: log 1=j(cid:18)(k) (cid:0) (cid:18) (cid:3)\n2Quadratic convergence speed: log log 1=j(cid:18)(k) (cid:0) (cid:18) (cid:3)\n\nnj2 grows linearly with k.\n\nnj2 grows linearly with k.\n\n\f3 Learning Speed\n\nn. The sequence of empirical optima (cid:18)(cid:3)\n\nRunning an ef\ufb01cient batch algorithm on a training set of size n quickly yields the empirical\nn usually converges to the solution (cid:18)(cid:3)\noptimum (cid:18)(cid:3)\nwhen the training set size n increases.\nIn contrast, online algorithms randomly draw one example zt at each iteration. When these\nexamples are drawn from a set of n examples, the online algorithm minimizes the empirical\nerror Cn. When these examples are drawn from the asymptotic distribution p(z), it mini-\nmizes the expected cost C1. Because the supply of training samples is practically unlim-\nited, each iteration of the online algorithm utilizes a fresh example. These fresh examples\nfollow the asymptotic distribution. The parameter vectors (cid:18)(t) thus directly converge to the\noptimum (cid:18)(cid:3) of the expected cost C1.\nThe convergence speed of the batch (cid:18)(cid:3)\nn and online (cid:18)(t) sequences were \ufb01rst compared by\nMurata and Amari (1999). This section reports a similar result whose derivation uncovers\na deeper relationship between these two sequences. This approach also provides a mathe-\nmatically rigorous treatment (Bottou and Le Cun, 2003).\nLet us \ufb01rst de\ufb01ne the Hessian matrix H and Fisher information matrix G:\n\nH\n\n4\n\n= E(cid:18) @ 2\n\n@(cid:18)@(cid:18)\n\nL(z; (cid:18)(cid:3))(cid:19)\n\nG\n\n4\n\n= E (cid:20) @L\n\n@(cid:18)\n\n(z; (cid:18)(cid:3))(cid:21)(cid:20) @L\n\n@(cid:18)\n\n(z; (cid:18)(cid:3))(cid:21)T!\n\nManipulating a Taylor expansion of the gradient of Cn((cid:18)) in the vicinity of (cid:18)(cid:3)\nately provides the following recursive relation between (cid:18) (cid:3)\n\nn and (cid:18)(cid:3)\n\nn(cid:0)1.\n\nn(cid:0)1 immedi-\n\n(cid:18)(cid:3)\nn = (cid:18)(cid:3)\n\nn(cid:0)1 (cid:0)\n\n1\nn\n\n(cid:9)n\n\n@L\n@(cid:18)\n\n(zn; (cid:18)(cid:3)\n\n(5)\n\nwith\n\n(cid:9)n\n\n4\n\n= 1\n\nn\n\n@ 2\n\n@(cid:18)@(cid:18)\n\nn\n\nXi=1\n\nL(zi; (cid:18)(cid:3)\n\nn(cid:0)1) + O(cid:18) 1\nn2(cid:19)\nn(cid:0)1)!(cid:0)1\n\n(cid:0)!\nt!1\n\nH(cid:0)1\n\nRelation (5) describes the (cid:18)(cid:3)\nn sequence as a recursive stochastic process that is essentially\nsimilar to the online learning algorithm (3). Each iteration of this \u201calgorithm\u201d consists\nin picking a fresh example zn and updating the parameters according to (5). This is not a\npractical algorithm because we have no analytical expression for the second order term. We\ncan however apply the mathematics of online learning algorithms to this stochastic process.\n\nThe similarity between (5) and (3) suggests that both the batch and online sequences con-\nverge at the same speed for adequate choices of the scaling matrix (cid:8)t. Under customary\nregularity conditions, the following asymptotic speed results holds when the scaling matrix\n(cid:8)t converges to the inverse H(cid:0)1 of the Hessian matrix.\n\nE(cid:0)j(cid:18)(t) (cid:0) (cid:18)(cid:3)j2(cid:1) + o(cid:18) 1\n\nt(cid:19) = E(cid:0)j(cid:18)(cid:3)\n\nt (cid:0) (cid:18)(cid:3)j2(cid:1) + o(cid:18) 1\n\nt(cid:19) =\n\ntr(cid:0)H(cid:0)1 G H(cid:0)1(cid:1)\n\nt\n\n(6)\n\nThis convergence speed expression has been discovered many times. Tsypkin (1973) estab-\nlishes (6) for linear systems. Murata and Amari (1999) address generic stochastic gradient\nalgorithms with a constant scaling matrix. Our result (Bottou and Le Cun, 2003) holds\nwhen the scaling matrix (cid:8)t depends on the previously seen examples, and also holds when\nthe stochastic update is perturbed by unspeci\ufb01ed second order terms, as in equation (5).\nSee the appendix for a proof sketch (Bottou and LeCun, 2003).\nResult (6) applies to both the online (cid:18)(t) and batch (cid:18)(t) sequences. Not only does it es-\ntablish that both sequences have O (1=t) convergence, but also it provides the value of\n\n\fthe constant. This constant is neither affected by the second order terms of (5) nor by the\nconvergence speed of the scaling matrix (cid:8)t toward H(cid:0)1.\nIn the Maximum Likelihood case, it is well known that both H and G are equal on the\noptimum. Equation (6) then indicates that the convergence speed saturates the Cramer-Rao\nbound. This fact was known in the case of the natural gradient algorithm (Amari, 1998). It\nremains true for a large class of online learning algorithms.\nResult (6) suggests that the scaling matrix (cid:8)t should be a full rank approximation of the\nHessian H. Maintaining such an approximation becomes expensive when the dimension of\nthe parameter vector increases. The computational cost of each iteration can be drastically\nreduced by maintaining only a coarse approximations of the Hessian (e.g. diagonal, block-\ndiagonal, multiplicative, etc.). A proper setup ensures that the convergence speed remains\nO (1=t) despite a less favorable constant factor.\nThe similar nature of the convergence of the batch and online sequences can be summarized\nas follows. Consider two optimally designed batch and online learning algorithms. The\nbest generalization error is asymptotically achieved by the learning algorithm that uses the\nmost examples within the allowed time.\n\n4 Computational Cost\n\nThe discussion so far has established that a properly designed online learning algorithm\nperforms as well as any batch learning algorithm for a same number of examples. We\nnow establish that, given the same computing resources, an online learning algorithm can\nasymptotically process more examples than a batch learning algorithm.\nEach iteration of a batch learning algorithm running on N training examples requires a time\nK1N + K2. Constants K1 and K2 respectively represent the time required to process each\nexample, and the time required to update the parameters. Result (6) provides the following\nasymptotic equivalence:\n\n((cid:18)(cid:3)\n\nN (cid:0) (cid:18)(cid:3))2 (cid:24)\n\n1\nN\n\nThe batch algorithm must perform enough iterations to approximate (cid:18) (cid:3)\nN with at least the\nsame accuracy ((cid:24) 1=N). An ef\ufb01cient algorithm with quadratic convergence achieves this\nafter a number of iterations asymptotically proportional to log log N.\nRunning an online learning algorithm requires a constant time K3 per processed example.\nLet us call T the number of examples processed by the online learning algorithm using the\nsame computing resources as the batch algorithm. We then have:\n\nK3T (cid:24) (K1N + K2) log log N =) T (cid:24) N log log N\n\nThe parameter (cid:18)(T ) of the online algorithm also converges according to (6). Comparing\nthe accuracies of both algorithms shows that the online algorithm asymptotically provides\na better solution by a factor O (log log N ).\n1\n\n((cid:18)(T ) (cid:0) (cid:18)(cid:3))2 (cid:24)\n\nN log log N\n\n(cid:28)\n\n1\nN\n\n(cid:24) ((cid:18)(cid:3)\n\nN (cid:0) (cid:18)(cid:3))2\n\nThis log log N factor corresponds to the number of iterations required by the batch algo-\nrithm. This number increases slowly with the desired accuracy of the solution. In practice,\nthis factor is much less signi\ufb01cant than the actual value of the constants K1, K2 and K3.\nExperience shows however that online algorithms are considerably easier to implement.\nEach iteration of the batch algorithm involves a large summation over all the available\nexamples. Memory must be allocated to hold these examples. On the other hand, each\niteration of the online algorithm only involves one random example which can then be\ndiscarded.\n\n\f5 Experiments\n\nA simple validation experiment was carried out using synthetic data. The examples are\ninput/output pairs (x; y) with x 2 R20 and y = (cid:6)1. The model is a single sigmoid unit\ntrained using the least square criterion.\n\nL(x; y; (cid:18)) = (1:5y (cid:0) f ((cid:18)x))2\n\nwhere f (x) = 1:71 tanh(0:66x) is the standard sigmoid discussed in LeCun et al. (1998).\nThe sigmoid generates various curvature conditions in the parameter space, including nega-\ntive curvature and plateaus. This simple model represents well the \ufb01nal convergence phase\nof the learning process. Yet it is also very similar to the widely used generalized linear\nmodels (GLIM) (Chambers and Hastie, 1992).\nThe \ufb01rst component of the input x is always 1 in order to compensate the absence of a\nbias parameter in the model. The remaining 19 components are drawn from two Gaussian\ndistributions, centered on ((cid:0)1; (cid:0)1; : : : ; (cid:0)1) for the \ufb01rst class and (+1; +1; : : : ; +1) for the\nsecond class. The eigenvalues of the covariance matrix of each class range from 1 to 20.\nTwo separate sets for training and testing were drawn with 1 000 000 examples each. One\nhundred permutations of the \ufb01rst set are generated. Each learning algorithm is trained using\nvarious number of examples taken sequentially from the beginning of the permuted sets.\nThe resulting performance is then measured on the testing set and averaged over the one\nhundred permutations.\n\nBatch-Newton algorithm\n\nThe reference batch algorithm uses the Newton-Raphson algorithm with Gauss-Newton\napproximation (Le Cun et al., 1998). Each iteration visits all the training and computes\nboth gradient g and the Gauss-Newton approximation H of the Hessian matrix.\n\n@L\n@(cid:18)\n\n(xi; yi; (cid:18)k(cid:0)1)\n\ng =Xi\n\nH =Xi\n\n(f 0((cid:18)k(cid:0)1xi))\n\n2\n\nxixT\ni\n\nThe parameters are then updated using Newton\u2019s formula:\n\nIterations are repeated until the parameter vector moves by less than 0:01=N where N is\nthe number of training examples. This algorithm yields quadratic convergence speed.\n\n(cid:18)k = (cid:18)k(cid:0)1 (cid:0) H (cid:0)1g\n\nOnline-Kalman algorithm\n\nThe online algorithm performs a single sequential sweep over the training examples. The\nparameter vector is updated after processing each example (xt; yt) as follows:\n\n(cid:18)t = (cid:18)t(cid:0)1 (cid:0)\n\n1\n(cid:28)\n\n(cid:8)t\n\n@L\n@(cid:18)\n\n(xt; yt; (cid:18)t(cid:0)1)\n\nThe scalar (cid:28) = max (20; t (cid:0) 40) makes sure that the \ufb01rst few examples do not cause\nimpractically large parameter updates. The scaling matrix (cid:8)t is equal to the inverse of a\nleaky average of the per-example Gauss-Newton approximation of the Hessian.\n\n(cid:8)t =(cid:18)(cid:18)1 (cid:0)\n\n2\n\n(cid:28)(cid:19) (cid:8)(cid:0)1\n\nt(cid:0)1 +(cid:18) 2\n\n(cid:28)(cid:19) (f 0((cid:18)t(cid:0)1xt))\n\n2\n\nxtxT\n\nt (cid:19)(cid:0)1\n\nThe implementation avoids the matrix inversions by directly computing (cid:8)t from (cid:8)t(cid:0)1\nusing the matrix inversion lemma. (see (Bottou, 1998) for instance.)\n(Au)(Au)T\n\n(cid:0)1\n\n(cid:0)(cid:11)A(cid:0)1 + (cid:12)uuT(cid:1)\n\n=\n\n1\n\n(cid:11)(cid:18)A (cid:0)\n\n(cid:11)=(cid:12) + uTAu(cid:19)\n\n\f1e\u22121\n\n1e\u22122\n\n1e\u22123\n\n1e\u22124\n\n1e\u22121\n\n1e\u22122\n\n1e\u22123\n\n1e\u22124\n\n1000\n\n10000\n\n100000\n\n100\n\n1000\n\n10000\n\nFigure 1: Average ((cid:18) (cid:0)(cid:18)(cid:3))2 as a function\nof the number of examples. The gray line\nrepresents the theoretical prediction (6).\nFilled circles: batch. Hollow circles: on-\nline. The error bars indicate a 95% con-\n\ufb01dence interval.\n\nFigure 2: Average ((cid:18) (cid:0)(cid:18)(cid:3))2 as a function\nof the training time (milliseconds). Hol-\nlow circles: online. Filled circles: batch.\nThe error bars indicate a 95% con\ufb01dence\ninterval.\n\nThe resulting algorithm slightly differs from the Adaptive Natural Gradient algorithm\n(Amari, Park, and Fukumizu, 1998). In particular, there is little need to adjust a learning\nrate parameter in the Gauss-Newton approach. The 1=t (or 1=(cid:28) ) schedule is asymptotically\noptimal.\n\nResults\n\nThe optimal parameter vector (cid:18)(cid:3) was \ufb01rst computed on the testing set using the batch-\nnewton approach. The matrices H and G were computed on the testing set as well in order\nto determine the constant in relation (6).\nFigure 1 plots the average squared distance between the optimal parameter vector (cid:18) (cid:3) and\nthe parameter vector (cid:18) achieved on training sets of various sizes. The gray line represents\nthe theoretical prediction. Both the batch points and the online points join the theoretical\nprediction when the training set size increases. Figure 2 shows the same data points as a\nfunction of the CPU time required to run the algorithm on a standard PC. The online algo-\nrithm gradually becomes more ef\ufb01cient when the training set size increases. This happens\nbecause the batch algorithm needs to perform additional iterations in order to maintain the\nsame level of accuracy.\n\nIn practice, the test set mean squared error (MSE) is usually more relevant than the accuracy\nof the parameter vector. Figure 3 displays a logarithmic plot of the difference between the\nMSE and the best achievable MSE, that is to say the MSE achieved by parameter vector (cid:18) (cid:3).\nThis difference can be approximated as ((cid:18) (cid:0) (cid:18) (cid:3))TH ((cid:18) (cid:0) (cid:18)(cid:3)). Both algorithms yield virtually\nidentical errors for the same training set size. This suggests that the small differences shown\nin \ufb01gure 1 occur along the low curvature directions of the cost function. Figure 4 shows the\nMSE as a function of the CPU time. The online algorithm always provides higher accuracy\nin signi\ufb01cantly less time.\n\nAs expected from the theoretical argument, the online algorithm asymptotically outper-\nforms the super-linear Newton-Raphson algorithm3. More importantly, the online algo-\nrithm achieves this result by performing a single sweep over the training data. This is a\nvery signi\ufb01cant advantage when the data does not \ufb01t in central memory and must be se-\nquentially accessed from a disk based database.\n\n3Generalized linear models are usually trained using the IRLS method (Chambers and Hastie,\n1992) which is closely related to the Newton-Raphson algorithm and requires similar computational\nresources.\n\n\fMse*\n+1e\u22121\n\nMse*\n+1e\u22122\n\nMse*\n+1e\u22123\n\nMse*\n+1e\u22124\n\n1000\n\n10000\n\n100000\n\nFigure 3: Average test MSE as a function\nof the number of examples (left). The\nvertical axis shows the logarithm of the\ndifference between the error and the best\nerror achievable on the testing set. Both\ncurves are essentially superposed.\n\n6 Conclusion\n\n0.366\n\n0.362\n\n0.358\n\n0.354\n\n0.350\n\n0.346\n\n0.342\n\n100\n\n1000\n\n10000\n\nFigure 4: Average test MSE as a func-\ntion of the training time (milliseconds).\nHollow circles: online. Filled circles:\nbatch. The gray line indicates the best\nmean squared error achievable on the test\nset.\n\nMany popular algorithms do not scale well to large number of examples because they were\ndesigned with small data sets in mind. For instance, the training time for Support Vector\nMachines scales somewhere between N 2 and N 3, where N is the number of examples.\nOur baseline super-linear batch algorithm learns in N log log N time. We demonstrate that\nadequate online algorithms asymptotically achieve the same generalization performance in\nN time after a single sweep on the training set.\nThe convergence of learning algorithms is usually described in terms of a search phase\nfollowed by a \ufb01nal convergence phase (Bottou and Murata, 2002). Solid empirical evi-\ndence (Le Cun et al., 1998) suggests that online algorithms outperform batch algorithms\nduring the search phase. The present work provides both theoretical and experimental ev-\nidence that an adequate online algorithm outperforms any batch algorithm during the \ufb01nal\nconvergence phase as well.\n\nAppendix4: Sketch of the convergence speed proof\n\nLemma \u2014 Let (ut) be a sequence of positive reals verifying the following recurrence:\n\nut =(cid:18)1 (cid:0)\n\n(cid:11)\nt\n\n+ o(cid:18) 1\n\nt(cid:19)(cid:19) ut(cid:0)1 +\n\n(cid:12)\n\nt2 + o(cid:18) 1\nt2(cid:19)\n\n(7)\n\nThe lemma states that t ut (cid:0)! (cid:12)\n(cid:11)(cid:0)1 when (cid:11) > 1 and (cid:12) > 0. The proof is delicate because\nthe result holds regardless of the unspeci\ufb01ed low order terms of the recurrence. However,\nit is easy to illustrate this convergence with simple numerical simulations.\nConvergence speed \u2014 Consider the following recursive stochastic process:\n\n(cid:18)(t) = (cid:18)(t (cid:0) 1) (cid:0)\n\n1\nt\n\n(cid:8)t\n\n@L\n@(cid:18)\n\n(zt; (cid:18)(t (cid:0) 1)) + O(cid:18) 1\nn2(cid:19)\n\n(8)\n\nOur discussion addresses the \ufb01nal convergence phase of this process. Therefore we assume\nthat the parameters (cid:18) remain con\ufb01ned in a bounded domain D where the cost function\nC1((cid:18)) is convex and has a single non degenerate minimum (cid:18) (cid:3) 2 D. We can assume\n\n4This section has been added for the \ufb01nal version\n\n\f(cid:18)(cid:3) = 0 without loss of generality. We write Et (X) the conditional expectation of X given\nall that is known before time t, including the initial conditions (cid:18)0 and the selected examples\nz1; : : : ; zt(cid:0)1. We initially assume also that (cid:8)t is a function of z1; : : : ; zt(cid:0)1 only.\nUsing (8), we write Et ((cid:18)t(cid:18)0\n\nt) as a function of (cid:18)t(cid:0)1. Then we simplify5 and take the trace.\n\nTaking the unconditional expectation yields a recurence similar to (7). We then apply the\n\ntr(cid:0)H(cid:0)1 G H(cid:0)1(cid:1)\n\nt2\n\n+ o(cid:18) 1\nt2(cid:19)\n\n2\nt\n\nj(cid:18)t(cid:0)1j2 + o(cid:18) j(cid:18)t(cid:0)1j2\n\nt (cid:19) +\nEt(cid:0) j(cid:18)tj2(cid:1) = j(cid:18)t(cid:0)1j2 (cid:0)\nlemma and conclude that t E(j(cid:18)tj2) (cid:0)! tr(cid:0)H(cid:0)1 G H(cid:0)1(cid:1).\n\nRemark 1 \u2014 The notation o (Xt) is quite ambiguous when dealing with stochastic pro-\ncesses. There are many possible \ufb02avors of convergence, including uniform convergence,\nalmost sure convergence, convergence in probability, etc. Furthermore, it is not true in\ngeneral that E (o (Xt)) = o (E (Xt)). The complete proof precisely de\ufb01nes the meaning\nof these notations and carefully checks their properties.\nRemark 2 \u2014 The proof sketch assumes that (cid:8)t is a function of z1; : : : ; zt(cid:0)1 only. In (5),\n(cid:9)t also depends on zt. The result still holds because the contribution of zt vanishes quickly\nwhen t grows large.\nRemark 3 \u2014 The same 1\nt behavior holds when (cid:8)t ! (cid:8)(cid:3) and when (cid:8)(cid:3) is greater than\n2 H(cid:0)1 in the semi de\ufb01nite sense. The constant however is worse by a factor roughly equal\nto jjH(cid:8)(cid:3)jj.\n\n1\n\nAcknowledgments\n\nThe authors acknowledge extensive discussions with Yoshua Bengio, Sami Bengio, Ronan\nCollobert, Noboru Murata, Kenji Fukumizu, Susanna Still, and Barak Pearlmutter.\n\nReferences\nAmari, S. (1998). Natural Gradient Works Ef\ufb01ciently in Learning. Neural Computation, 10(2):251\u2013\n\n276.\n\nBottou, L. (1998). Online Algorithms and Stochastic Approximations, 9-42.\n\nIn Saad, D., editor,\n\nOnline Learning and Neural Networks. Cambridge University Press, Cambridge, UK.\n\nBottou, L. and Murata, N. (2002). Stochastic Approximations and Ef\ufb01cient Learning.\n\nIn Arbib,\nM. A., editor, The Handbook of Brain Theory and Neural Networks, Second edition,. The MIT\nPress, Cambridge, MA.\n\nBottou, L. and Le Cun, Y. (2003). Online Learning for Very Large Datasets. NEC Labs TR-2003-\n\nL039. To appear: Applied Stochastic Models in Business and Industry. Wiley.\n\nChambers, J. M. and Hastie, T. J. (1992). Statistical Models in S, Chapman & Hall, London.\nDennis, J. and Schnabel, R. B. (1983). Numerical Methods For Unconstrained Optimization and\n\nNonlinear Equations. Prentice-Hall, Inc., Englewood Cliffs, New Jersey.\n\nAmari, S. and Park, H. and Fukumizu, K. (1998). Adaptive Method of Realizing Natural Gradient\n\nLearning for Multilayer Perceptrons, Neural Computation, 12(6):1399\u20131409\n\nLe Cun, Y., Bottou, L., Orr, G. B., and M\u00a8uller, K.-R.\n\n(1998). Ef\ufb01cient Back-prop.\n\nIn Neural\n\nNetworks, Tricks of the Trade, Lecture Notes in Computer Science 1524. Springer Verlag.\n\nMurata, N. and Amari, S. (1999). Statistical analysis of learning dynamics. Signal Processing,\n\n74(1):3\u201328.\n\nVapnik, V. N. and Chervonenkis, A. (1974). Theory of Pattern Recognition (in russian). Nauka.\nTsypkin, Ya. (1973). Foundations of the theory of learning systems. Academic Press.\n\n5Recall Et(cid:0)(cid:8)t\n\n@L\n\n@(cid:18) (zt; (cid:18))(cid:1) = (cid:8)t\n\n@C\n\n@(cid:18) ((cid:18)) = (cid:8)tH(cid:18) + o (j(cid:18)j) = (cid:18) + o (j(cid:18)j)\n\n\f", "award": [], "sourceid": 2365, "authors": [{"given_name": "L\u00e9on", "family_name": "Bottou", "institution": null}, {"given_name": "Yann", "family_name": "Cun", "institution": null}]}