{"title": "Worst-case Loss Bounds for Single Neurons", "book": "Advances in Neural Information Processing Systems", "page_first": 309, "page_last": 315, "abstract": null, "full_text": "Worst-case Loss Bounds \n\nfor Single Neurons \n\nDavid P. Helmbold \n\nDepartment of Computer Science \n\nUniversity of California, Santa Cruz \n\nSanta Cruz, CA 95064 \n\nUSA \n\nJyrki Kivinen \n\nDepartment of Computer Science \nP.O. Box 26 (Teollisuuskatu 23) \nFIN-00014 University of Helsinki \n\nFinland \n\nManfred K. Warmuth \n\nDepartment of Computer Science \n\nUniversity of California, Santa Cruz \n\nSanta Cruz, CA 95064 \n\nUSA \n\nAbstract \n\nWe analyze and compare the well-known Gradient Descent algo(cid:173)\nrithm and a new algorithm, called the Exponentiated Gradient \nalgorithm, for training a single neuron with an arbitrary transfer \nfunction . Both algorithms are easily generalized to larger neural \nnetworks, and the generalization of Gradient Descent is the stan(cid:173)\ndard back-propagation algorithm. In this paper we prove worst(cid:173)\ncase loss bounds for both algorithms in the single neuron case. \nSince local minima make it difficult to prove worst-case bounds \nfor gradient-based algorithms, we must use a loss function that \nprevents the formation of spurious local minima. We define such \na matching loss function for any strictly increasing differentiable \ntransfer function and prove worst-case loss bound for any such \ntransfer function and its corresponding matching loss. For exam(cid:173)\nple, the matching loss for the identity function is the square loss \nand the matching loss for the logistic sigmoid is the entropic loss. \nThe different structure of the bounds for the two algorithms indi(cid:173)\ncates that the new algorithm out-performs Gradient Descent when \nthe inputs contain a large number of irrelevant components. \n\n\f310 \n\nD. P. HELMBOLD, J. KIVINEN, M. K. WARMUTH \n\n1 \n\nINTRODUCTION \n\nThe basic element of a neural network, a neuron, takes in a number of real-valued \ninput variables and produces a real-valued output. The input-output mapping of \na neuron is defined by a weight vector W E RN, where N is the number of input \nvariables, and a transfer function \u00a2. When presented with input given by a vector \nx E RN, the neuron produces the output y = \u00a2(w . x). Thus, the weight vector \nregulates the influence of each input variable on the output, and the transfer function \ncan produce nonlinearities into the input-output mapping. In particular, when the \ntransfer function is the commonly used logistic function, \u00a2(p) = 1/(1 + e- P ), the \noutputs are bounded between 0 and 1. On the other hand, if the outputs should \nbe unbounded, it is often convenient to use the identity function as the transfer \nfunction, in which case the neuron simply computes a linear mapping. In this \npaper we consider a large class of transfer functions that includes both the logistic \nfunction and the identity function, but not discontinuous (e.g. step) functions. \nThe goal of learning is to come up with a weight vector w that produces a \ndesirable input-output mapping. This is achieved by considering a sequence \nS = ((X1,yt}, ... ,(Xl,Yl\u00bb of examples, where for t = 1, ... ,i the value Yt E R \nis the desired output for the input vector Xt, possibly distorted by noise or other \nerrors. We call Xt the tth instance and Yt the tth outcome. In what is often called \nbatch learning, alli examples are given at once and are available during the whole \ntraining session. As noise and other problems often make it impossible to find a \nweight vector w that would satisfy \u00a2(w\u00b7 Xt) = Yt for all t, one instead introduces a \nloss function L, such as the square loss given by L(y, y) = (y - y)2/2, and finds a \nweight vector w that minimizes the empirical loss (or training error) \n\nl \n\nLoss(w,S) = LL(Yt,\u00a2(w . xt}) . \n\n(1) \n\nt=l \n\nWith the square loss and identity transfer function \u00a2(p) = p, this is the well-known \nlinear regression problem. When \u00a2 is the logistic function and L is the entropic loss \ngiven by L(y, y) = Y In(yJY) + (1 - y) In((l - y)/(l - y)), \nthis can be seen as a \nspecial case oflogistic regression. (With the entropic loss, we assume 0 ~ Yt, Yt ~ 1 \nfor all t, and use the convention OlnO = Oln(O/O) = 0.) \nIn this paper we use an on-line prediction (or life-long learning) approach to the \nlearning problem. It is well known that on-line performance is closely related to \nbatch learning performance (Littlestone, 1989; Kivinen and Warmuth, 1994). \nInstead of receiving all the examples at once, the training algorithm begins with \nsome fixed start vector W1, and produces a sequence W1, ... , w l+1 of weight vectors. \nThe new weight vector Wt+1 is obtained by applying a simple update rule to the \nprevious weight vector Wt and the single example (Xt, Yt). In the on-line prediction \nmodel, the algorithm uses its tth weight vector, or hypothesis, to make the prediction \nYt = \u00a2(Wt . xt). The training algorithm is then charged a loss L(Yt, Yt) for this tth \ntrial. The performance of a training algorithm A that produces the weight vectors \nWt on an example sequence S is measured by its total (cumulative) loss \n\nl \n\nLoss(A, S) = L L(Yt, \u00a2(Wt . Xt\u00bb . \n\n(2) \n\nt=l \n\nOur main results are bounds on the cumulative losses for two on-line prediction \nalgorithms. One of these is the standard Gradient Descent (GO) algorithm. The \nother one, which we call EG\u00b1, is also based on the gradient but uses it in a different \n\n\fWorst-case Loss Bounds for Single Neurons \n\n311 \n\nmanner than GD. The bounds are derived in a worst-case setting: we make no as(cid:173)\nsumptions about how the instances are distributed or the relationship between each \ninstance Xt and its corresponding outcome Yt. Obviously, some assumptions are \nneeded in order to obtain meaningful bounds. The approach we take is to compare \nthe total losses, Loss(GD,5) and Loss(EG\u00b1, 5), to the least achievable empirical \nloss, infw Loss( w, 5). If the least achievable empirical loss is high, the dependence \nbetween the instances and outcomes in 5 cannot be tracked by any neuron using \nthe transfer function, so it is reasonable that the losses of the algorithms are also \nhigh. More interestingly, if some weight vector achieves a low empirical loss, we \nalso require that the losses of the algorithms are low. Hence, although the algo(cid:173)\nrithms always predict based on an initial segment of the example sequence, they \nmust perform almost as well as the best fixed weight vector for the whole sequence. \n\nThe choice of loss function is crucial for the results that we prove. In particular, \nsince we are using gradient-based algorithms, the empirical loss should not have spu(cid:173)\nrious local minima. This can be achieved for any differentiable increasing transfer \nfunction \u00a2 by using the loss function L\u00a2 defined by \n\nL\u00a2(y, fj) = f \n\nr1(y) \nJ \u00a2-l(y) \n\n(\u00a2(z) - y) dz . \n\n(3) \n\nFor y < fj the value L\u00a2(y, fj) is the area in the z X \u00a2(z) plane below the function \n\u00a2(z), above the line \u00a2(z) = y, and to the left of the line z = \u00a2-l(fj). We call L\u00a2 the \nmatching loss function for transfer function \u00a2, and will show that for any example \nsequence 5, if L = L\u00a2 then the mapping from w to Loss(w , 5) is conveX. For \nexample, if the transfer function is the logistic function, the matching loss function \nis the entropic loss, and ifthe transfer function is the identity function, the matching \nloss function is the square loss. Note that using the logistic activation function with \nthe square loss can lead to a very large number of local minima (Auer et al., 1996). \nEven in the batch setting there are reasons to use the entropic loss with the logistic \ntransfer function (see, for example, Solla et al. , 1988). \nHow much our bounds on the losses of the two algorithms exceed the least empirical \nloss depends on the maximum slope of the transfer function we use. More impor(cid:173)\ntantly, they depend on various norms of the instances and the vector w for which \nthe least empirical loss is achieved. As one might expect, neither of the algorithms \nis uniformly better than the other. Interestingly, the new EG\u00b1 algorithm is better \nwhen most of the input variables are irrelevant, i.e., when some weight vector w \nwith Wi = 0 for most indices i has a low empirical loss. On the other hand, the \nGD algorithm is better when the weight vectors with low empirical loss have many \nnonzero components, but the instances contain many zero components. \n\nThe bounds we derive concern only single neurons, and one often combines a number \nof neurons into a multilayer feedforward neural network. In particular, applying \nthe Gradient Descent algorithm in the multilayer setting gives the famous back \npropagation algorithm . Also the EG\u00b1 algorithm, being gradient-based, can easily \nbe generalized for multilayer feedforward networks. Although it seems unlikely \nthat our loss bounds will generalize to multilayer networks, we believe that the \nintuition gained from the single neuron case will provide useful insight into the \nrelative performance of the two algorithms in the multilayer case. Furthermore, the \nEG\u00b1 algorithm is less sensitive to large numbers of irrelevant attributes. Thus it \nmight be possible to avoid multilayer networks by introducing many new inputs, \neach of which is a non-linear function of the original inputs. Multilayer networks \nremain an interesting area for future study. \n\nOur work follows the path opened by Littlestone (1988) with his work on learning \n\n\f312 \n\nD. P. HELMBOLD, J. KIVINEN, M. K. WARMUTH \n\nthresholded neurons with sparse weight vectors. More immediately, this paper is \npreceded by results on linear neurons using the identity transfer function (Cesa(cid:173)\nBianchi et aI., 1996; Kivinen and Warmuth, 1994). \n\n2 THE ALGORITHMS \n\nThis section describes how the Gradient Descent training algorithm and the new \nExponentiated Gradient training algorithm update the neuron's weight vector. \n\nFor the remainder of this paper, we assume that the transfer function 0 and Ei Wt,i = 1. In general, of course, we do not expect that such \nconstraints are useful. Hence, we introduce a modified algorithm EG\u00b1 by employinj \na linear transformation of the inputs. In addition to the learning rate TJ, the EG \nalgorithm has a scaling factor U > 0 as a parameter. We define the behavior of \nEG\u00b1 on a sequence of examples 5 = ((Xi,Yi), .. . ,(Xl,Yl\u00bb in terms of the EG al(cid:173)\ngorithm's behavior on a transformed example sequence 5' = ((xi, yd, .. . , (x~, Yl\u00bb \n\n\fWorst-case Loss Bounds for Single Neurons \n\n313 \n\nwhere x' = (U Xl , ... , U XN , -U Xl, ... , -U XN) ' The EG algorithm uses the uniform \nstart vector (1/(2N), . .. , 1/(2N\u00bb and learning rate supplied by the EG\u00b1 algorithm. \nAt each time time t the N-dimensional weight vector w of EG\u00b1 is defined in terms \nof the 2N -dimensional weight vector Wi of EG as \n\nWt,i = U(W~ ,i - W~ ,N+i ) ' \n\nThus EG\u00b1 with scaling factor U can learn any weight vector w E RN with Ilwlll < \nU by having the embedded EG algorithm learn the appropriate 2N-dimensional \n(nonnegative and normalized) weight vector Wi. \n\n3 MAIN RESULTS \n\nThe loss bounds for the GO and EG\u00b1 algorithms can be written in similar forms \nthat emphasize how different algorithms work well for different problems. When \nL = L\u00a2n we write Loss\u00a2(w, S) and Loss\u00a2(A , S) for the empirical loss of a weight \nvector wand the total loss of an algorithm A, as defined in (1) and (2). We give \nthe upper bounds in terms of various norms. For x E RN, the 2-norm Ilxl b is the \nEuclidean length of the vector x, the I-norm Ilxlll the sum of the absolute values \nof the components of x , and the (X)-norm Ilxll oo the maximum absolute value of \nany component of x . For the purposes of setting the learning rates, we assume \nthat before training begins the algorithm gets an upper bound for the norms of \ninstances. The GO algorithm gets a parameter X 2 and EG a parameter Xoo such \nthat IIxtl12 ~ X 2 and Ilxtl loo ~ X oo hold for all t. Finally, recall that Z is an upper \nbound on \u00a2/(p) . We can take Z = 1 when \u00a2 is the identity function and Z = 1/4 \nwhen \u00a2 is the logistic function . \nOur first upper bound is for GO . For any sequence of examples S and any weight \nvector u ERN, when the learning rate is TJ = 1/(2X?Z) we have \n\nLoss\u00a2(GO,S) ~ 2Loss\u00a2(u,S) + 2(llulbX2)2Z . \n\nOur upper bounds on the EG\u00b1 algorithm require that we restrict the one-norm of \nthe comparison class: the set of weight vectors competed against. The comparison \nclass contains all weight vectors u such that Ilulh is at most the scaling factor , \nU. For any scaling factor U , any sequence of examples S, and any weight vector \nu ERN with Ilulll ~ U, we have \n4 \n\nLoss\u00a2(EG\u00b1 , S) ~ 3Loss\u00a2(u, S)+ 3(UXoo )2Z1n(2N) \n\n16 \n\nwhen the learning rate is TJ = 1/(4(UXoo )2Z). \nNote that these bounds depend on both the unknown weight vector u and some \nnorms of the input vectors. If the algorithms have some further prior information \non the sequence S they can make a more informed choice of TJ . This leads to bounds \nwith a constant of 1 before the the Loss\u00a2(u, S) term at the cost of an additional \nsquare-root term (for details see the full paper, Helmbold et al. , 1996) . \nIt is important to realize that we bound the total loss of the algorithms over any \nadversarially chosen sequence of examples where the input vectors satisfy the norm \nbound. Although we state the bounds in terms of loss on the data, they imply that \nthe algorithms must also perform well on new unseen examples, since the bounds \nstill hold when an adversary adds these additional examples to the end of the \nsequence. A formal treatment of this appears in several places (Littlestone, 1989; \n\n\f314 \n\nD. P. HELMBOLD, J. KIVINEN, M. K. WARMUTH \n\nKivinen and Warmuth, 1994). Furthermore, in contrast to standard convergence \nproofs (e.g. Luenberger, 1984), we bound the loss on the entire sequence of examples \ninstead of studying the convergence behavior of the algorithm when it is arbitrarily \nclose to the best weight vector. \nComparing these loss bounds we see that the bound for the EG\u00b1 algorithm grows \nwith the maximum component of the input vectors and the one-norm of the best \nweight vector from the comparison class. On the other hand, the loss bound for the \nGD algorithm grows with the tWo-norm (Euclidean length) of both vectors. Thus \nwhen the best weight vector is sparse, having few significant components, and the \ninput vectors are dense, with several similarly-sized components, the bound for the \nEG\u00b1 algorithm is better than the bound for the GD algorithm. More formally, \nconsider the noise-free situation where Lossr/>(u, S) = 0 for some u. Assume Xt E \n{ -1, I}N and U E {-I, 0, I}N with only k nonzero components in u. We can \nthen take X 2 = ..,(N, Xoo = 1, IIuI12 = Vk, and U = k. \nThe loss bounds \nbecome (16/3)k 2Z1n(2N) for EG\u00b1 and 2kZN for GD, so for N ~ k the EG\u00b1 \nalgorithm clearly wins this comparison. On the other hand, the GD algorithm has \nthe advantage over the EG algorithm when each input vector is sparse and the best \nweight vector is dense, having its weight distributed evenly over its components. For \nexample, if the inputs Xt are the rows of an N x N unit matrix and U E { -1, 1 } N , \nthen X2 = Xoo = 1, IIuI12 = ..,(N, and U = N. Thus the upper bounds become \n(16/3)N 2Z1n(2N) for EG\u00b1 and 2NZ for GD, so here GD wins the comparison. \nOf course, a comparison of the upper bounds is meaningless unless the bounds are \nknown to be reasonably tight. Our experiments with artificial random data suggest \nthat the upper bounds are not tight. However, the experimental evidence also \nindicates that EG\u00b1 is much better than G D when the best weight vector is sparse. \nThus the upper bounds do predict the relative behaviors of the algorithms. \n\nThe bounds we give in this paper are very similar to the bounds Kivinen and \nWarmuth (1994) obtained for the comparison class of linear functions and the square \nloss. They observed how the relative performances of the GD and EG\u00b1 algorithms \nrelate to the norms of the input vectors and the best weight vector in the linear \ncase. \n\nOur methods are direct generalizations of those applied for the linear case (Kivinen \nand Warmuth, 1994). The key notion here is a distance function d for measuring \nthe distance d( u, w) between two weight vectors U and w. Our main distance \nmeasures are the Squared Euclidean distance ~ II u - w II ~ and the Relative Entropy \ndistance (or Kullback-Leibler divergence) L~l Ui In(ui/wi). The analysis exploits \nan invariant over t and u of the form \n\naLr/>(Yt, Wt . Xt) - bLr/>(Yt, U\u00b7 Xt) ~ d(u, Wt) - d(u, Wt+l) , \n\nwhere a and b are suitably chosen constants. This invariant implies that at each \ntrial, if the loss of the algorithm is much larger than that of an arbitrary vector \nu, then the algorithm updates its weight vector so that it gets closer to u. By \nsumming the invariant over all trials we can bound the total loss of the algorithms \nin terms of Lossr/>(u, S) and d(u, wI). Full details will be contained in a technical \nreport (Helmbold et al., 1996). \n\n4 OPEN PROBLEMS \n\nAlthough the presence of local minima in multilayer networks makes it difficult \nto obtain worst case bounds for gradient-based algorithms, it may be possible to \n\n\fWorst-case Loss Bounds for Single Neurons \n\n315 \n\nanalyze slightly more complicated settings than just a single neuron. One likely \ncandidate is to generalize the analysis to logistic regression with more than two \nclasses. In this case each class would be represented by one neuron. \n\nAs noted above, the matching loss for the logistic transfer function is the entropic \nloss, so this pair does not create local minima. No bounded transfer function \nmatches the square loss in this sense (Auer et aI., 1996), and thus it seems im(cid:173)\npossible to get the same kind of strong loss bounds for a bounded transfer function \nand the square loss as we have for any (increasing and differentiable) transfer func(cid:173)\ntion and its matching loss function . \n\nAs the bounds for EG\u00b1 depend only logarithmically on the input dimension, the \nfollowing approach may be feasible. Instead of using a multilayer net , use a single \n(linear or sigmoided) neuron on top of a large set of basis functions. The logarithmic \ngrowth of the loss bounds in the number of such basis functions mean that large \nnumbers of basis functions can be tried. \nNote that the bounds of this paper are only worst-case bounds and our experiments \non artificial data indicate that the bounds may not be tight when the input values \nand best weights are large. However, we feel that the bounds do indicate the relative \nmerits of the algorithms in different situations. Further research needs to be done \nto tighten the bounds. Nevertheless, this paper gives the first worst-case upper \nbounds for neurons with nonlinear transfer functions. \n\nReferences \n\nP. Auer, M. Herbster, and M. K. Warmuth (1996). Exponentially many local min(cid:173)\nima for single neurons. In Advances in Neural Information Processing Systems 8. \n\nN. Cesa-Bianchi, P. Long, and M. K. Warmuth (1996). Worst-case quadratic loss \nbounds for on-line prediction of linear functions by gradient descent. IEEE Trans(cid:173)\nactions on Neural Networks. To appear. An extended abstract appeared in COLT \n'93, pp. 429-438. \n\nD. P. Helmbold, J . Kivinen , and M. K. Warmuth (1996). Worst-case loss bounds \nfor single neurons. Technical Report UCSC-CRL-96-2 , Univ. of Calif. Computer \nResearch Lab, Santa Cruz, CA, 1996. In preparation. \n\nJ . Kivinen and M. K. Warmuth (1994). Exponentiated gradient versus gradient \ndescent for linear predictors. Technical Report UCSC-CRL-94-16, Univ. of Calif. \nComputer Research Lab, Santa Cruz , CA, 1994. An extended abstract appeared in \nSTOC '95, pp. 209-218. \n\nN. Littlestone (1988) . Learning when irrelevant attributes abound: A new linear(cid:173)\nthreshold algorithm. Machine Learning, 2:285-318. \n\nN. Littlestone (1989) . From on-line to batch learning. In Proc. 2nd Annual Work(cid:173)\nshop on Computational Learning Theory, pages 269-284. Morgan Kaufmann, San \nMateo, CA. \n\nD. G. Luenberger (1984). Linear and Nonlinear Programming. Addison-Wesley, \nReading, MA. \n\nS. A. Solla, E. Levin, and M. Fleisher (1988) . Accelerated learning in layered neural \nnetworks. Complex Systems, 2:625- 639 . \n\n\f", "award": [], "sourceid": 1095, "authors": [{"given_name": "David", "family_name": "Helmbold", "institution": null}, {"given_name": "Jyrki", "family_name": "Kivinen", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}]}