{"title": "Linear Learning: Landscapes and Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 65, "page_last": 72, "abstract": null, "full_text": "LINEAR LEARNING: LANDSCAPES AND ALGORITHMS \n\n65 \n\nPierre Baldi \n\nJet Propulsion Laboratory \n\nCalifornia Institute of Technology \n\nPasadena, CA 91109 \n\nWhat follows extends some of our results of [1] on learning from ex(cid:173)\namples in layered feed-forward networks of linear units. In particu(cid:173)\nlar we examine what happens when the ntunber of layers is large or \nwhen the connectivity between layers is local and investigate some \nof the properties of an autoassociative algorithm. Notation will be \nas in [1] where additional motivations and references can be found. \nIt is usual to criticize linear networks because \"linear functions do \nnot compute\" and because several layers can always be reduced to \none by the proper multiplication of matrices. However this is not the \npoint of view adopted here. It is assumed that the architecture of the \nnetwork is given (and could perhaps depend on external constraints) \nand the purpose is to understand what happens during the learning \nphase, what strategies are adopted by a synaptic weights modifying \nalgorithm, ... [see also Cottrell et al. (1988) for an example of an ap(cid:173)\nplication and the work of Linsker (1988) on the emergence of feature \ndetecting units in linear networks}. \n\nConsider first a two layer network with n input units, n output units \nand p hidden units (p < n). Let (Xl, YI), ... , (XT, YT) be the set of \ncentered input-output training patterns. The problem is then to find \ntwo matrices of weights A and B minimizing the error function E: \n\nE(A, B) = L IIYt - ABXtIl2. \n\nl \nIf I = {i t , ... ,ip}(l < it < ... < ip < n) is any or(cid:173)\n... > An. \ndered p-index set, let Uz = [Ui t , \u2022\u2022\u2022 , Uip ] denote the matrix formed \nby the orthononnal eigenvectors of ~ associated with the eigenvalues \nAil' ... , Aip \u2022 Then two full rank matrices A and B define a critical \npoint of E if and only if there exist an ordered p-index set I and an \ninvertible p x p matrix C such that \n\nA=UzC \n\nFor such a critical point we have \n\nE(A,B) = tr(~yy) - L Ai. \n\niEZ \n\n(8) \n\n(9) \n\n(10) \n\n(11 ) \n\nTherefore a critical point of W of rank p is always the product of the \nordinary least squares regression matrix followed by an orthogonal \nprojection onto the subspace spanned by p eigenvectors of~. The map \nW associated with the index set {I, 2, ... ,p} is the unique local and \nglobal minimum of E. The remaining (;) -1 p-index sets correspond \nto saddle points. All additional critical points defined by matrices \nA and B which are not of full rank are also saddle points and can \nbe characerized in terms of orthogonal projections onto subspaces \nspanned by q eigenvectors with q < p. \n\n\f68 \n\nBaldi \n\nDeep Networks \n\nConsider now the case of a deep network with a first layer of n input \nunits, an (m + 1 )-th layer of n output units and m - 1 hidden layers \nwith an error function given by \n\nE(AI, ... ,An)= L IIYt-AIAl ... AmXtll2. \n\nl Xd J1( k) ~ +00. Therefore the algorithm can converge \nonly for a < /-leO) < Xd. When the learning rate is too large, i. e. \nwhen 7],\\ > 1/2 then even if /-leO) is in the interval (0, Xd) one can see \nthat the algorithm does not converge and may even exhibit complex \noscillatory behavior. However when 7],\\ < 1/2, if 0 < J1(0) < Xa then \nJ1( k) ~ 1, if /-leO) = Xa then /-l( k) = a and if Xa < J1(0) < Xd then \n/-l(k) ~ 1. \n\nIn conclusion, we see that if the algorithm is to be tested, the \nlearning rate should be chosen so that it does not exceed 1/2,\\, where \n,\\ is the largest eigenvalue of ~xx. Even more so than back propaga(cid:173)\ntion, it can encounter problems in the proximity of saddle points. \nOnce a non-principal eigenvector of ~xx is learnt, the algorithm \nrapidly incorporates a projection along that direction which cannot \nbe escaped at later stages. Simulations are required to examine the \neffects of \"noisy gradients\" (computed after the presentation of only \na few training examples), multiple starting points, variable learning \nrates, momentum terms, and so forth. \n\n\f72 \n\nBaldi \n\nAknowledgement \n\nWork supported by NSF grant DMS-8800323 and in part by ONR \ncontract 411P006-01. \n\nReferences \n\n(1) Baldi, P. and Hornik, K. (1988) Neural Networks and Principal \nComponent Analysis: Learning from Examples without Local Min(cid:173)\nima. Neural Networks, Vol. 2, No. 1. \n(2) Chauvin, Y. (1989) Another Neural Model as a Principal Compo(cid:173)\nnent Analyzer. Submitted for publication. \n(3) Cottrell, G. W., Munro, P. W. and Zipser, D. (1988) Image Com(cid:173)\npression by Back Propagation: a Demonstration of Extensional Pro(cid:173)\ngramming. In: Advances in Cognitive Science, Vol. 2, Sharkey, N. E. \ned., Norwood, NJ Abbex. \n(4) Linsker, R. (1988) Self-Organization in a Perceptual Network. \nComputer 21 (3), 105-117. \n( 5) Willi ams, R. J. (1985) Feature Discovery Through Error-Correction \nLearning. ICS Report 8501, University of California., San Diego. \n\n\f", "award": [], "sourceid": 123, "authors": [{"given_name": "Pierre", "family_name": "Baldi", "institution": null}]}