{"title": "Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons", "book": "Advances in Neural Information Processing Systems", "page_first": 343, "page_last": 350, "abstract": null, "full_text": "Geometrical Singularities in the \n\nNeuromanifold of Multilayer Perceptrons \n\nShun-ichi Amari, Hyeyoung Park, and Tomoko Ozeki \n\nBrain Science Institute, RIKEN \n\nHirosawa 2-1, Wako, Saitama, 351-0198, Japan \n\n{amari, hypark, tomoko} @brain.riken.go.jp \n\nAbstract \n\nSingularities are ubiquitous in the parameter space of hierarchical \nmodels such as multilayer perceptrons. At singularities, the Fisher \ninformation matrix degenerates, and the Cramer-Rao paradigm \ndoes no more hold, implying that the classical model selection the(cid:173)\nory such as AIC and MDL cannot be applied. It is important to \nstudy the relation between the generalization error and the training \nerror at singularities. The present paper demonstrates a method \nof analyzing these errors both for the maximum likelihood estima(cid:173)\ntor and the Bayesian predictive distribution in terms of Gaussian \nrandom fields, by using simple models. \n\n1 \n\nIntroduction \n\nA neural network is specified by a number of parameters which are synaptic weights \nand biases. Learning takes place by modifying these parameters from observed \ninput-output examples. Let us denote these parameters by a vector () = (01 , .. . , On). \nThen, a network is represented by a point in the parameter space S, where () plays \nthe role of a coordinate system. The parameter space S is called a neuromanifold. \nA learning process is represented by a trajectory in the neuromanifold. The dy(cid:173)\nnamical behavior of learning is known to be very slow, because of the plateau \nphenomenon. The statistical physical method [1] has made it clear that plateaus \nare ubiquitous in a large-scale perceptron. In order to improve the dynamics of \nlearning, the natural gradient learning method has been introduced by taking the \nRiemannian geometrical structure of the neuromanifold into account [2, 3]. Its \nadaptive version, where the inverse of the Fisher information matrix is estimated \nadaptively, is shown to have excellent behaviors by computer simulations [4, 5]. \n\nBecause of the symmetry in the architecture of the multilayer perceptrons, the \nparameter space of the MLP admits an equivalence relation [6, 7]. The residue class \ndivided by the equivalence relation gives rise to singularities in the neuromanifold, \nand plateaus exist at such singularities [8]. The Fisher information matrix becomes \nsingular at singularities, so that the neuromanifold is strongly curved like the space(cid:173)\ntime including black holes. \n\nIn the neighborhood of singularit ies, the Fisher-Cramer-Rao paradigm does not \n\n\fhold, and the estimator is no more subject to the Gaussian distribution even asymp(cid:173)\ntotically. This is essential in neural learning and model selection. The AlC and MDL \ncriteria of model selection use the Gaussian paradigm, so that it is not appropriate. \n\nThe problem was first pointed out by Hagiwara et al. [9]. Watanabe [10] applied \nalgebraic geometry to elucidate the behavior of the Bayesian predictive estimator in \nMLP, showing sharp difference in regular cases and singular cases. Fukumizu [11] \ngives a general analysis of the maximum likelihood estimators in singular statistical \nmodels including the multilayer perceptrons. \n\nThe present paper is a first step to elucidate effects of singularities in the neuro(cid:173)\nmanifold of multilayer perceptrons. We use a simple cone model to elucidate how \ndifferent the behaviors of the maximum likelihood estimator and the Bayes predic(cid:173)\ntive distribution are from the regular case. To this end, we introduce the Gaussian \nrandom field [11, 12, 13], and analyze the generalization error and training error for \nboth the mle (maximum likelihood estimator) and the Bayes estimator. \n\n2 Topology of neuromanifold \n\nLet us consider MLP with h hidden units and one output unit, \n\nh \n\nY = L Vi<{J (Wi\u00b7 x) + n. \n\ni = l \n\n(1) \n\nwhere y is output, x is input and n is Gaussian noise. Let us summarize all the \nparameters in a single parameter vector () = (Wl , \u00b7\u00b7\u00b7, Wh; Vl , \u00b7\u00b7\u00b7, Vh) and write \n\nh \n\nf(x; ()) = L Vi<{J (Wi\u00b7 x). \n\ni=l \n\n(2) \n\nThen, () is a coordinate system of the neuromanifold. Because of the noise, the \ninput-output relation is stochastic, given by the conditional probability distribution \n\n2} \np(ylx,()) = J2 exp -2(y-f(x;())) \n\n1 {I \n\n, \n\n(3) \n\nwhere we normalized the scale of noise equal to 1. Each point in the neuromanifold \nrepresents a neural network or its probability distribution. \n\nIt is known that the behavior of MLP is invariant under 1) permutations of hidden \nunits, and 2) sign change of both Wi and Vi at the same time. Two networks \nare equivalent when they are mapped by any of the above operations which form \na group. Hence, it is natural to treat the residual space SI ::::J, where ::::J is the \nequivalence relation. There are some points which are invariant under a some non(cid:173)\ntrivial isotropy subgroup, on which singularities occurs. \nWhen Vi = 0, vi<{J (Wi\u00b7 x) = 0 so that all the points on the sub manifold Vi = 0 are \nequivalent whatever Wi is. We do not need this hidden unit. Hence, in M = SI ::::J, \nall of these points are reduced to one and the same point. When Wi = Wj hold, \nthese two units may be merged into one, and when Vi +Vj is the same, the two points \nare equivalent even when they differ in Vi - Vj. Hence, the dimension reduction takes \nplace in the subspace satisfying Wi = Wj. Such singularities occur on the critical \nsubmanifolds of the two types \n\n(4) \n\n\f3 Simple toy models \n\nGiven training data, the parameters of the neural network are estimated or trained \nby learning. It is important to elucidate the effects of singularities on learning or \nestimation. We use simple toy models to attack this problem. One is a very simple \nmultilayer percept ron having only one hidden unit. The other is a simple cone \nmodel: Let x be Gaussian random variable x E R d+2 , with mean p, and identity \ncovariance matrix I , \n\n(5) \n\nand let 5 = {p,Ip, E R d+2 } be the parameter space. The cone model M is a subset \nof 5, embedded as \n\nM : p, \n\n(6) \n\nwhere c is a constant, IIa2 11 = 1, W E 5 d and 5 d is a d-dimensional unit sphere. \nWhen d = 1, 51 is a circle so that W is replaced by angle B, and we have \n\np, = \n\n~ \n\nVI + c2 \n\n( \n\n) \n\n1 \n\nccos B \ncsinB \n\n. \n\n(7) \n\nSee Figure 1. The M is a cone, having (~, w) as coordinates, where the apex ~ = 0 \nis the singular point. \n\n, , \n\nFigure 1: One-dimensional cone model \n\nThe input-output relation of a simple multilayer perceptron is given by \n\ny = v