{"title": "Exponentially many local minima for single neurons", "book": "Advances in Neural Information Processing Systems", "page_first": 316, "page_last": 322, "abstract": null, "full_text": "Exponentially many local minima for single \n\nneurons \n\nPeter Auer \n\nMark Herbster \n\nManfred K. Warmuth \n\nDepartment of Computer Science \n\nSanta Cruz, California \n\n{pauer,mark,manfred} @cs.ucsc.edu \n\nAbstract \n\nWe show that for a single neuron with the logistic function as the transfer \nfunction the number of local minima of the error function based on the \nsquare loss can grow exponentially in the dimension. \n\n1 \n\nINTRODUCTION \n\nConsider a single artificial neuron with d inputs. The neuron has d weights w E Rd. The \noutput of the neuron for an input pattern x E Rd is y = \u00a2(x\u00b7 w), where \u00a2 : R -+ R \nis a transfer function. For a given sequence of training examples ((Xt, Yt))I(y, f)) = 1 (\u00a2(z) - y) dz . \n\n,p-l(y) \n\n-l(y) \n\nThe loss is the area depicted in Figure 2a. If \u00a2 is the identity function then L is the square \nloss likewise if \u00a2 is the logistic function then L is the entropic loss. For the matching loss \nthe gradient descent update for minimizing the error function for a sequence of examples \nis simply \n\nWnew := Wold -1] (f)\u00a2(Xt . Wold) - Yt)Xt) \n\n, \n\nt=1 \n\nwhere 1] is a positive learning rate. Also the second derivatives are easy to calculate for \nthis general setting: L4>(Y~:v~<:Wt;W)) = \u00a2'(Xt . W)Xt,iXt,j. Thus, if Ht(w) is the Hessian \nof L(Yt, \u00a2(Xt . w)) with respect to W then v T Ht(w)v = \u00a2'(Xt . w)(v . Xt)2. Thus \n\n\f318 \n\nP. AUER. M. HERBSTER, M. K. WARMUTH \n\n0.8 \n\nwO.I \n\n0.4 \n\n0.2 \n\n... \n\n- 2 \n\no ... \n(b) \n\n.-1 (9) = w \u00b7 x \n\n(a) \n\nFigure 2: \n\n(a) The Matching Loss Function L. \n(b) The Square Loss becomes Saturated, the Entropic Loss does not. \n\nH t is positive semi-definite for any increasing differentiable transfer function. Clearly \nL:~I Ht(w) is the Hessian of the error function E(w) for a sequence of m examples and \nit is also positive semi-definite. It follows that for any differentiable increasing transfer \nfunction the error function with respect to the matching loss is always convex. \n\nWe show that in the case of one neuron the logistic function paired with the square loss \ncan lead to exponentially many minima. It is open whether the number of local minima \ngrows exponentially for some natural data. However there is another problem with the \npairing of the logistic and the square loss that makes it hard to optimize the error function \nwith gradient based methods. This is the problem of flat regions. Consider one example \n(x, y) consisting of a pattern x (such that x is not equal to the all zero vector) and the \ndesired output y. Then the square loss (Iogistic(x . w) - y)2, for y E [0, I] and w E R d , \nturns flat as a function of w when f) = logistic( x . w) approaches zero or one (for example \nsee Figure 2b where d = I and y = 0). It is easy to see that for all bounded transfer \nfunctions with a finite number of minima and corresponding bounded loss functions, the \nsame phenomenon occurs. In other words, the composition L(y, \u00a2(x . w\u00bb of the square \nloss with any bounded transfer function \u00a2 which has a finite number of extrema turns flat as \nIx . w I becomes large. Similarly, for multiple examples the error function E( w) as defined \nabove becomes flat. In flat regions the gradients with respect to the weight vector w are \nsmall, and thus gradient-based updates of the weight vector may have a hard time moving \nthe weight vector out of these flat regions. This phenomenon can easily be observed in \npractice and is sometimes called \"saturation\" [Hay94]. In contrast, if the logistic function \nis paired with the entropic loss (see Figure 2b), then the error function turns flat only at the \nglobal minimum. The same holds for any increasing differentiable transfer function and its \nmatching loss function. \n\nA number of previous papers discussed conditions necessary and sufficient for mUltiple \nlocal minima of the error function of single neurons or otherwise small networks [WD88, \nSS89, BRS89, Blu89, SS91, GT92]. This previous work only discusses the occurrence of \nmultiple local minima whereas in this paper we show that the number of such minima can \ngrow exponentially with the dimension. Also the previous work has mainly been limited \nto the demonstration of local minima in networks or neurons that have used the hyperbolic \ntangent or logistic function with the square loss. Here we show that exponentially many \nminima occur whenever the composition of the loss function with the transfer function is \ncontinuous and bounded. \n\nThe paper is outlined as follows. After some preliminaries in the next section, we gi ve formal \n\n\fExponentially Many Local Minima for Single Neurons \n\n319 \n\n04$ \n\nO. \n\n036 \n\n03 \n\n026 \n\n02 \n\n0.t5 \n\n0.1 \n\n0.06 \n\n0 \n-2 \n\n-1 \n\n(a) \n\n11 \n\n0.9 \n\n0.1 \n\n07 \n\nWOI \n\nas -- --- ------------ __ \n\nO. \n\n03 \n\n02 \n\n, \n\n,-\n, \n' \n\n/ ' .. \" \n\n, \n, , \n\\ \n' -\n\n01 L......L-~~-~~~~-'--~-~ \n\n-8-e-~-2 0 \n\nlog .. \n\n(b) \n\nFigure 3: \n\n(a) Error Function for the Logistic Transfer Function and the \n\nSquare Loss with Examples ((10, .55), (.7, .25\u00bb) \n\n(b) Sets of Minima can be Combined. \n\nstatements and proofs of the results mentioned above in Section 3. At first (Section 3.1) we \nshow that n one-dimensional examples might result in n local minima of the error function \n(see e.g. Figure 3a for the error function of two one-dimensional examples). From the local \nminima in one dimension it follows easily that n d-dimensional examples might result in \nL n/ dJ d local minima of the error function (see Figure 1 and discussion in Section 3.2). \nWe then consider neurons with a bias (Section 4), i.e. we add an additional input that is \nclamped to one. The error function for a sequence of examples S = ((Xt, Yt\u00bb)I(B + WXt\u00bb, \n\nm \n\nt=1 \n\nwhere B denotes the bias, i.e. the weight of the input that is clamped to one. We can prove \nthat the error function might have L n/2dJ d local minima if loss and transfer function are \nsymmetric. This holds for example for the square loss and the logistic transfer function . \nThe proofs are omitted due to space constraints. They are given in the full paper [AHW96], \ntogether with additional results for general loss and transfer functions. \n\nFinally we show in Section 5 that with minimal assumptions on transfer and loss functions \nthat there is only one minimum of the error function if the sequence of examples is realizable \nby the neuron. \n\nThe essence of the proofs is quite simple. At first observe that ifloss and transfer function are \nbounded and the domain is unbounded, then there exist areas of saturation where the error \nfunction is essentially flat. Furthermore the error function is \"additive\" i.e. the error function \nproduced by examples in SUS' is simply the error function produced by the examples in \nS added to the error function produced by the examples in S', Esusl = Es + ESI. Hence \nthe local minima of Es remain local minima of Esus 1 if they fall into an area of saturation \nof Es. Similarly, the local minima of ESI remain local minima of Esusl as well (see \nFigure 3b). In this way sets of local minima can be combined. \n\n2 PRELIMINARIES ' \n\nWe introduce the notion of minimum-containing set which will prove useful for counting \nthe minima of the error function. \n\n\f320 \n\nP. AUER, M. HERBSTER, M. K. WARMUTH \n\nDefinition 2.1 Let f : Rd_R be a continuous function. Then an open and bounded set \nU E Rd is called a minimum-containing set for f if for each w on the boundary of U there \nis a w'\" E U such that f(w\"') < f(w). \n\nObviously any minimum-containing set contains a local minimum of the respecti ve function. \nFurthermore each of n disjoint minimum-containing sets contains a distinct local minimum. \nThus it is sufficient to find n disjoint minimum-containing sets in order to show that a \nfunction has at least n local minima. \n\n3 MINIMA FOR NEURONS WITHOUT BIAS \n\nWe will consider transfer functions \u00a2 and loss functions L which have the following \nproperty: \n\n(PI): The transfer function \u00a2 : R-R is non-constant. The loss function L : \u00a2(R) x \n\u00a2(R)-[O, 00) has the property that L(y, y) = 0 and L(y, f)) > 0 for all y f. \nf) E \u00a2(R). FinallythefunctionL(\u00b7,\u00a2(\u00b7)): \u00a2(R) x R-[O,oo) is continuous and \nbounded. \n\n3.1 ONE MINIMUM PER EXAMPLE IN ONE DIMENSION \n\nTheorem 3.1 Let \u00a2 and L satisfy ( PI). Then for all n ~ I there is a sequence of n \nexamples S = (XI, y), ... , (xn, y)), Xt E R, y E \u00a2(R), such that Es(w) has n distinct \nlocal minima. \n\nSince L(y, \u00a2( w)) is continuous and non-constant there are w- , w\"', w+ E R such that the \nvalues \u00a2( w-), \u00a2( w\"'), \u00a2( w+) are all distinct. Furthermore we can assume without loss \nof generality that 0 < w- < w'\" < w+. Now set y = \u00a2(w\"'). If the error function \nL(y, \u00a2(w)) has infinitely many local minima then Theorem 3.1 follows immediately, e.g. \nby setting XI = ... = Xn = 1. If L(y, \u00a2(w)) has only finitely many minima then \nlimw ..... oo L(y, \u00a2(w)) = L(y, \u00a2(oo)) exists since L(y, \u00a2(w)) is bounded and continuous. \nWe use this fact in the following lemma. It states that we get a new minimum-containing \nset by adding an example in the area of saturation of the error function. \n\nLemma 3.2 Assume that limw ..... oo L(y, \u00a2( w)) exists. Let S = (XI, YI), ... , (xn, Yn)) \nbe a sequence of examples and 0 < WI < wi < wt < ... < w;; < w~ < w~ \nsuch that Es(wt ) > Es(wn and Es(wn < Es(wt) for t = 1, ... , n. Let S' = \n(xo, y} (XI, Yd, ... , (xn, Yn)) where Xo is sufficiently large. Furthermoreletwo = w'\" /xo \nand Wo = w\u00b1 /xo (where w-, w\"', w+, Y = \u00a2(w\"') are as above). Then 0 < we; < Wo < \nwt < WI < wi < wt < ... < w;; < w~ < w~ and \n\nProof. We have to show that for all Xo sufficiently large condition (l) is satisfied, i.e. that \n\n(2) \n\nWe get \n\nlim ESI(WO) = L(y, \u00a2(w\"')) + lim Es(w'\" /xo) = L(y, \u00a2(w\"')) + Es(O), \n~ ..... oo \n\n~-oo \n\nrecalling that Wo = w'\" /xo and S' = S u (xo, y) . Analogously \n\nlim ESI(w~) = L(y,\u00a2(w\u00b1)) + Es(O). \n\nx 0\"'\" 00 \n\n\fExponentially Many Local Minima for Single Neurons \n\n321 \n\nThus equation (2) holds for t = 0. For t = 1, ... , n we get \n\nlim ESI(w;) = \n\n:1:0-+00 \n\nlim L(y, \u00a2(w;xo)) + Es(wn = L(y, \u00a2(oo)) + Es(wn \n\n:1:0-+00 \n\nand \n\nSince Es (w;) < Es (w;) for t = 1, ... , n, the lemma follows. \nProof of Theorem 3.1. The theorem follows by induction from Lemma 3.2 since each \ninterval ( wi, wi) is a minimum-containing set for the error function . \n0 \nRemark. Though the proof requires the magnitude of the examples to be arbitrarily large I \nin practice local minima show up for even moderately sized w (see Figure 3a). \n\no \n\n3.2 CURSE OF DIMENSIONALITY: THE NUMBER OF MINIMA MIGHT \n\nGROW EXPONENTIALLY WITH THE DIMENSION \n\nWe show how the I-dimensional minima of Theorem 3.1 can be combined to obtain d(cid:173)\ndimensional minima. \n\nLemma 3.3 Let I : R -+ R be a continuous function with n disjoint minimum-containing \nsets UI , .\u2022. ,Un. Then the sets UtI x ... X Utd , tj E {I, ... , n}, are n d disjoint minimum-\ncontaining sets for the function 9 : Rd -+ R, g(XI, . .. , Xd) = l(xI) + ... + I(xd). \n\nProof. Omitted. \n\no \n\nTheorem 3.4 Let \u00a2 and L satisfy ( PI). Then for all n ~ 1 there is a sequence of examples \nS = (XI,Y),\"\"(xn,y)), Xt E Rd, y E \u00a2(R), such that Es(w) has l~Jd distinct local \nminima. \n\nProof. By Lemma 3.2 there exists a sequence of one-dimensional examples S' = \n(xI,y)\"\" , (xLcrJ'Y)) such that ESI(w) has L~J disjoint minimum-containing sets. \nThus by Lemma 3.3 the error function Es (w) has l ~ J d disjoint minimum-containing \nsets where S = ((XI, 0, .. . ,0), y), ... , \u00abxLcrJ' 0, . .. ,0), y), .. . , \u00ab0, ... , xI), y), .. . , \n\u00ab0, .. . , xLcrJ), y)). \n0 \n\n4 MINIMA FOR NEURONS WITH A BIAS \n\nTheorem 4.1 Let the transfer function \u00a2 and the loss function L satisfy \u00a2( Bo + z) - \u00a2o = \n\u00a2o - \u00a2(Bo - z) and L(\u00a2o + y, \u00a2o + y) = L(\u00a2o - y, \u00a2o - y)for some Bo, \u00a2o E R and all \nz E R, y, Y E \u00a2(R). Furthermore let \u00a2 have a continuous second derivative and assume \nthat the first derivative of \u00a2 at Bo is non-zero. At last let ~L(y, y) be continuous in y \nand y, L(y, y) = 0for all y E \u00a2(R), and (~L(Y, y)) (\u00a2o, \u00a2o) > 0. Then for all n ~ 1 \nthere is a sequence of examples S = (XI, YI), . .. , (xn, Yn)), Xt E Rd, Yt E \u00a2(R), such \nthat Es (B, w) has l ~ J d distinct local minima. \n\nNote that the square loss along with either the hyperbolic or logistic transfer function \nsatisfies the conditions of the theorem. \n\nIThere is a parallel proof where the magnitudes of the examples may be arbitrarily small. \n\n\f322 \n\nP. AUER, M. HERBSTER, M. K. WARMUTH \n\n5 ONE MINIMUM IN THE REALIZABLE CASE \n\nWe show that when transfer and loss function are monotone and the examples are realizable \nthen there is only a single minimal surface. A sequence of examples S is realizable if \nEs(w) = 0 for some wE Rd. \n\nTheorem 5.1 Let 4> and L satisfy (P 1). Furthermore let 4> be mOriotone and L such that \nL(y, y + rl) ~ L(y, y + r2) for 0 ~ rl ~ r2 or 0 ~ rl ~ r2. Assume that for some \nsequence of examples S there is a weight vectorwo E Rd such that Es(wo) = O. Thenfor \neach WI E Rd the function h( a) = Es (( 1 - a )wo + aWl) is increasing for a ~ O. \n\nThus each minimum WI can be connected with Wo by the line segment WOWI such that \nEs(w) = 0 for all W on WOWI. \nProof of Theorem 5.1. \nE~=I L(yt, 4>(WOXt + a(wl - wo)xt}). Since Yt = 4>(WOXt) it suffices to show that \nL(4)(z), 4>(z+ar)) is monotonically increasing in a ~ o for all Z, r E R. Let 0 ~ al ~ a2. \nSince 4> is monotone we get 4>(z + aIr) = 4>(z) + rl, 4>(z + a2r) = 4>(z) + r2 where \no ~ rl ~ r2 or 0 ~ rl ~ r2\u00b7 Thus L(4)(z), 4>(z + aIr)) ~ L(4)(z), 4>(z + a2r)). \n0 \n\n((XI, yd, ... , (xn, Yn)). \n\nLet S = \n\nThen h(a) \n\nAcknowledgments \n\nWe thank Mike Dooley, Andrew Klinger and Eduardo Sontag for valuable discussions. Peter Auer \ngratefully acknowledges support from the FWF, Austria, under grant J01028-MAT. Mark Herbster \nand Manfred Warmuth were supported by NSF grant IRI-9123692. \n\nReferences \n[AHW96] P. Auer, M. Herbster, and M. K. Warmuth. Exponentially many local minima for single \nneurons. Technical Report UCSC-CRL-96-1, Univ. of Calif. Computer Research Lab, \nSanta Cruz, CA, 1996. In preperation. \nE.K. Blum. Approximation of boolean functions by sigmoidal networks: Part i: Xor and \nother two-variable functions. Neural Computation, 1 :532-540, February 1989. \n\n[Blu89] \n\n[BRS89] M.L. Brady, R. Raghavan, and J. Slawny. Back propagation fails to separate where \nperceptrons succeed. IEEE Transactions On Circuits and Systems, 36(5):665-674, May \n1989. \n\n[BW88] E. Baum and F. Wilczek. Supervised learning of probability distributions by neural \nnetworks. In D.Z. Anderson, editor, Neural Information Processing Systems, pages 52-\n61, New York, 1988. American Insitute of Physics. \n\n[GT92] Marco Gori and Alberto Tesi. On the problem of local minima in backpropagation. IEEE \n\nTransaction on Pattern Analysis and Machine Intelligence, 14(1):76-86, 1992. \n\n[Hay94] S. Haykin. Neural Networks: a Comprehensive Foundation. Macmillan, New York, NY, \n\n1994. \n\n[SLF88] S. A. Solla, E. Levin, and M. Fleisher. Accelerated learning in layered neural networks. \n\n[SS89] \n\n[SS91] \n\nComplex Systems, 2:625-639,1988. \nE.D. Sontag and H.l. Sussmann. Backpropagation can give rise to spurious local minima \neven for networks without hidden layers. Complex Systems, 3(1):91-106, February 1989. \nE.D. Sontag and H.l. Sussmann. Back propagation separates where perceptrons do. Neural \nNetworks,4(3),1991. \n\n[Wat92] R. L. Watrous. A comparison between squared error and relative entropy metrics using \n\nseveral optimization algorithms. Complex Systems, 6:495-505, 1992. \n\n[WD88] B.S. Wittner and J .S. Denker. Strategies for teaching layered networks classification tasks. \nIn D.Z. Anderson, editor, Neural Information Processing Systems, pages 850--859, New \nYork, 1988. American Insitute of Physics. \n\n\f", "award": [], "sourceid": 1028, "authors": [{"given_name": "Peter", "family_name": "Auer", "institution": null}, {"given_name": "Mark", "family_name": "Herbster", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}]}