{"title": "Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient", "book": "Advances in Neural Information Processing Systems", "page_first": 127, "page_last": 133, "abstract": null, "full_text": "Neural Learning in Structured \n\nParameter Spaces \n\nNatural Riemannian Gradient \n\nShun-ichi Amari \n\nRIKEN Frontier Research Program, RIKEN, \n\nHirosawa 2-1, Wako-shi 351-01, Japan \n\namari@zoo.riken.go.jp \n\nAbstract \n\nThe parameter space of neural networks has a Riemannian met(cid:173)\nric structure. The natural Riemannian gradient should be used \ninstead of the conventional gradient, since the former denotes the \ntrue steepest descent direction of a loss function in the Riemannian \nspace. The behavior of the stochastic gradient learning algorithm \nis much more effective if the natural gradient is used. The present \npaper studies the information-geometrical structure of perceptrons \nand other networks, and prove that the on-line learning method \nbased on the natural gradient is asymptotically as efficient as the \noptimal batch algorithm. Adaptive modification of the learning \nconstant is proposed and analyzed in terms of the Riemannian mea(cid:173)\nsure and is shown to be efficient. The natural gradient is finally \napplied to blind separation of mixtured independent signal sources. \n\n1 \n\nIntrod uction \n\nNeural learning takes place in the parameter space of modifiable synaptic weights \nof a neural network. The role of each parameter is different in the neural network \nso that the parameter space is structured in this sense. The Riemannian structure \nwhich represents a local distance measure is introduced in the parameter space by \ninformation geometry (Amari, 1985). \n\nOn-line learning is mostly based on the stochastic gradient descent method, where \nthe current weight vector is modified in the gradient direction of a loss function. \nHowever, the ordinary gradient does not represent the steepest direction of a loss \nfunction in the Riemannian space. A geometrical modification is necessary, and it is \ncalled the natural Riemannian gradient. The present paper studies the remarkable \neffects of using the natural Riemannian gradient in neural learning. \n\n\f128 \n\nS. Amari \n\nWe first studies the asymptotic behavior of on-line learning (Opper , NIPS'95 Work(cid:173)\nshop). Batch learning uses all the examples at any time to obtain the optimal weight \nvector, whereas on-line learning uses an example once when it is observed. Hence, \nin general, the target weight vector is estimated more accurately in the case of batch \nlearning. However, we prove that , when the Riemannian gradient is used , on-line \nlearning is asymptotically as efficient as optimal batch learning. \n\nOn-line learning is useful when the target vector fluctuates slowly (Amari, 1967). \nIn this case, we need to modify a learning constant TJt depending on how far the \ncurrent weight vector is located from the target function. We show an algorithm \nadaptive changes in the learning constant based on the Riemannian criterion and \nprove that it gives asymptotically optimal behavior. This is a generalization of the \nidea of Sompolinsky et al. [1995]. \nWe then answer the question what is the Riemannian structure to be introduced in \nthe parameter space of synaptic weights. We answer this problem from the point of \nview of information geometry (Amari [1985, 1995], Amari et al [1992]) . The explicit \nform of the Riemannian metric and its inverse matrix are given in the case of simple \nperceptrons. \n\nWe finally show how the Riemannian gradient is applied to blind separation of mix(cid:173)\ntured independent signal sources. Here, the mixing matrix is unknown so that the \nparameter space is the space of matrices . The Riemannian structure is introduced \nin it. The natural Riemannian gradient is computationary much simpler and more \neffective than the conventional gradient . \n\n2 Stochastic Gradient Descent and On-Line Learning \n\nLet us consider a neural network which is specified by a vector parameter w \n(Wl' ... wn ) E Rn. The parameter w is composed of modifiable connection weights \nand thresholds. Let us denote by I(~, w) a loss when input signal ~ is processed \nby a network having parameter w . In the case of multilayer perceptrons, a desired \noutput or teacher signal y is associated with ~, and a typical loss is given \n\nI(~ , y,w) = 211 y - I(~ , w) II , \n\n2 \n\n1 \n\n(1) \n\nwhere z = I(~, w) is the output from the network. \nWhen input ~, or input-output training pair (~ , y) , is generated from a fixed prob(cid:173)\nability distribution, the expected loss L( w) of the network specified by w is \n\nL(w) = E[/(~,y;w)], \n\n(2) \n\nwhere E denotes the expectation. A neural network is trained by using training \nexamples (~l ' Yl)'(~2 ' Y2) ''' ' to obtain the optimal network parameter w\u00b7 that \nminimizes L(w). If L(w) is known, the gradient method'is described by \n\nt = 1, 2, ,, , \n\nwhere TJt is a learning constant depending on t and \"ilL = oL/ow. Usually L(w) is \nunknown. The stochastic gradient learning method \n\nwas proposed by an old paper (Amari [1967]) . This method has become popular \nsince Rumelhart et al. [1986] rediscovered it . It is expected that , when TJt converges \nto 0 in a certain manner, the above Wt converges to w\". The dynamical behavior of \n\n(3) \n\n\fNeural Learning and Natural Gradient Descent \n\n129 \n\n(3) was studied by Amari [1967], Heskes and Kappen [1992] and many others when \n\"It is a constant. \n\nIt was also shown in Amari [1967] that \n\nworks well for any positive-definite matrix, in particular for the metric G. Ge(cid:173)\nometrically speaking Ol/aw is a covariant vector while Llwt = Wt+l - Wt is a \ncontravariant vector. Therefore, it is natural to use a (contravariant) metric tensor \nC- 1 to convert the covariant gradient into the contravariant form \n\n(4) \n\n) \nVI = G- - = L.; g'J -(w) , \n\n1 01 ( \" . . a \naWj \now \n\n-\n\n. \nJ \n\n(5) \n\nwhere C- 1 = (gij) is the inverse matrix of C = (gij). The present paper studies how \nthe matrix tensor matrix C should be defined in neural learning and how effective \nis the new gradient learning rule \n\n(6) \n\n3 Gradient in Riemannian spaces \n\nLet S = {w} be the parameter space and let I(w) be a function defined on S. \nWhen S is a Euclidean space and w is an orthonormal coordinate system, the \nsquared length of a small incremental vector dw connecting wand w + dw is given \nby \n\nn \n\nIdwl 2 = L(dwd2 . \n\n(7) \n\nHowever, when the coordinate system is non-orthonormal or the space S is Rieman(cid:173)\nnian, the squared length is given by a quadratic form \n\ni=l \n\nIdwl 2 = Lgjj(w)dwidwj = w'Gw. \n\ni ,j \n\n(8) \n\nHere, the matrix G = (gij) depends in general on wand is called the metric tensor. \nIt reduces to the identity matrix I in the Euclidean orthonormal case. It will be \nshown soon that the parameter space S of neural networks has Riemannian structure \n(see Amari et al. [1992], Amari [1995], etc.). \nThe steepest descent direction of a function I( w) at w is defined by a vector dw \nthat minimize I(w+dw) under the constraint Idwl 2 = [2 (see eq.8) for a sufficiently \nsmall constant [. \nLemma 1. The steepest descent direction of I (w) in a Riemannian space is given \nby \n\n-V/(w) = -C-1(w)VI(w). \n\nWe call \n\nV/(w) = C-1(w)V/(w) \n\nthe natural gradient of I( w) in the Riemannian space. It shows the steepest descent \ndirection of I, and is nothing but the contravariant form of VI in the tensor notation. \nWhen the space is Eu~lidean and the coordinate system is orthonormal, C is the \nunit matrix I so that VI = V/' \n\n\f130 \n\nS. Amari \n\n4 Natural gradient gives efficient on-line learning \n\nLet us begin with the simplest case of noisy multilayer analog perceptrons. Given \ninput ~, the network emits output z = f(~, 10) + n, where f is a differentiable \ndeterministic function of the multilayer perceptron with parameter 10 and n is a \nnoise subject to the normal distribution N(O,1) . The probability density of an \ninput-output pair (~, z) is given by \n\np(~,z;1O) = q(~)p(zl~;1O), \n\nwhere q( ~) is the probability distribution of input ~, and \n\np(zl~; 10) = vk exp {-~ II z - f(~, 10) 112} . \n\nThe squared error loss function (1) can be written as \n\nl(~, z, 10) = -logp(~, z; 10) + logq(~) -log.;z;:. \n\nHence, minimizing the loss is equivalent to maximizing the likelihood function \np(~, Zj 10) . \nLet DT = {(~l' zI),\u00b7 \u00b7 \u00b7, (~T, ZT)} be T independent input-output examples gener(cid:173)\nated by the network having the parameter 10\" . Then, maximum likelihood estimator \nWT minimizes the log loss 1(~,z;1O) = -logp(~,z;1O) over the training data DT , \nthat is, it minimizes the training error \n\nThe maximum likelihood estimator is efficient (or Fisher-efficient) , implying that it \nis the best consistent estimator satisfying the Cramer-Rao bound asymptotically, \n\nlim TE[(wT - 1O\")(WT - 10\")'] = C- 1 , \n\nT ..... oo \n\n(10) \n\nwhere G-l is the inverse of the Fisher information matrix G = (9ii) defined by \n\n(9) \n\ngij = E \n\n[ 8 logp(~, z; 10) 8 logp(~, z; 10)] \n\n8 \nWj \n\n8 \n\nWi \n\n(11) \n\nin the component form. Information geometry (Amari, 1985) proves that the Fisher \ninformation G is the only invariant metric to be introduced in the space S = {1o} \nof the parameters of probability distributions. \nExamples (~l' zd, (~2' Z2) . . . are given one at a time in the case of on-line learning. \nLet Wt be the estimated value at time t. At the next time t + 1, the estimator Wt is \nmodified to give a new estimator Wt+l based on the observation (~t+ll zt+d. The \nold observations (~l' zd, .. . , (~t, zt) cannot be reused to obtain Wt+l, so that the \nlearning rule is written as Wt+l = m( ~t+l, Zt+l , wt}. The process {wd is hence \nMarkovian. Whatever a learning rule m we choose, the behavior of the estimator \nWt is never better than that of the optimal batch estimator Wt because of this \nrestriction. The conventional on-line learning rule is given by the following gradient \nform Wt+l = Wt -1JtV'I(~t+l,zt+l;Wt). When 1Jt satisfies a certain condition, say \n1Jt = cit, the stochastic approximation guarantees that Wt is a consistent estimator \nconverging to 10\" . However, it is not efficient in general. \n\nThere arises a question if there exists an on-line learning rule that gives an efficient \nestimator. If it exists, the asymptotic behavior of on-line learning is equivalent to \n\n\fNeural Learning and Natural Gradient Descent \n\n131 \n\nthat of the batch estimation method. The present paper answers the question by \ngiving an efficient on-line learning rule \n\n_ \n_ \nWt+l = Wt -\n\n_ ) \n1\"/( \n- v ~t+l, Zt+l; Wt . \nt \n\n(12) \n\nTheorem 1. The natural gradient on-line learning rule gives an Fisher-efficient \nestimator, that is, \n\n(13) \n\n5 Adaptive modification of learning constant \nWe have proved that TJt = lit with the coefficient matrix C- 1 is the asymptotically \nbest choice for on-line learning. However, when the target parameter w\u00b7 is not fixed \nbut fluctuating or changes suddenly, this choice is not good, because the learning \nsystem cannot follow the change if TJt is too small. It was proposed in Amari [1967] \nto choose TJt adaptively such that TJt becomes larger when the current target w\u00b7 \nis far from Wt and becomes small when it is close to Wt adaptively. However, no \ndefinite scheme was analyzed there . Sompolinsky et at. [1995] proposed an excellent \nscheme of an adaptive choice of TJt for a deterministic dichotomy neural networks. \nWe extend their idea to be applicable to stochastic cases, where the Riemannian \nstructure plays a role. \nWe assume that I( ~, z; w) is differentiable with respect to w. \n(The non(cid:173)\ndifferentiable case is usually more difficult to analyze. Sompolinsky et al [1995] \ntreated this case.) We moreover treat the realizable teacher so that L(w\u00b7) = o. \nWe propose the following learning scheme: \n\n(14) \n(15) \nwhere a and (3 are constants. We try to analyze the dynamical behavior of learning \nby using the continuous version of the algorithm, \n\nWt+l = Wt - TJt't7/(~t+1,Zt+I;Wt) \nTJt+l = TJt + aTJt [B/( ~t+l, Zt+l; wd - TJt], \n\n! Wt = -TJt 't7/(~t, Zt; Wt), \n!TJt = aTJt[B/(~t,zt;wd - TJt]. \n\n(16) \n\n(17) \n\nIn order to show the dynamical behavior of (Wt, TJt), we use the averaged version \nof the above equation with respect to the current input-output pair (~t, Zt). We \nintroduce the squared error variable \n\net = ~(Wt - w\u00b7)'C\u00b7(Wt - w\u00b7). \n\nBy using the average and continuous time version \n\nwhere\u00b7 denotes dldt and ( ) the average over the current (~, z), we have \n\ntVt = -TJtC- 1(Wt) (a~ I(~t, Zt; Wt\u00bb) , \n* = aTJd{3(I(~t, Zt;Wt\u00bb) - TJd, \net = - 2TJtet , \n* = a{3TJt et - aTJi\u00b7 \n\n(18) \n\n(19) \n(20) \n\n\f132 \n\nS.Amari \n\nThe behavior of the above equation is interesting: The origin (0,0) is its attractor. \nHowever, the basin of attraction has a fractal boundary. Anyway, starting from an \nadequate initial value, it has the solution of the form \n\net ~ ~ (~ - \u00b1) ~, \n\n1 \n1Jt ~ 2t' \n\n(21) \n\n(22) \n\nThis proves the lit convergence rate of the generalization error, that is optimal in \norder for any estimator tVt converging to w\u00b7 . \n\n6 Riemannian structures of simple perceptrons \n\nWe first study the parameter space S of simple perceptrons to obtain an explicit \nform of the metric G and its inverse G-1. This suggests how to calculate the metric \nin the parameter space of multilayer perceptrons. \nLet us consider a simple perceptron with input ~ and output z. Let w be its \nconnection weight vector. For the analog stochastic perceptron, its input-output \nbehavior is described by z = f( Wi ~) + n, where n denotes a random noise subject \nto the normal distribution N(O, (72) and f is the hyperbolic tangent, \n\n1- e- u \nf(u)=l+e- u \n\nIn order to calculate the metric G explicitly, let ew be the unit column vector in \nthe direction of w in the Euclidean space R n , \nw \n\new=~, \n\nwhere II w II is the Euclidean norm. We then have the following theorem. \nTheorem 2. The Fisher metric G and its in verse G-l are given by \n\nG(w) = Cl(W)! + {C2(W) - C1(w)}ewe~, \n1 ) \nG \n\n1 \n\n-1 \n\nI \n\nCl W \n\n(1 \nC2 W \n\n(w) = -( -)! + -( -) - -( -) ewew' \nwhere W = Iwl (Euclidean norm) and C1(W) and C2(W) are given by \n4~(72 J {f2(wc) _1}2 exp { _~c2} dc, \nC2(W) = 4~(72 J {f2(wc) - 1}2c2 exp {_~c2} de. \n\nC1(W) \n\nC1 W \n\nTheorem 3. The Jeffrey prior is given by \n\nvIG(w)1 = Vn VC2(W){C1(W)}n-1. \n\n1 \n\n(23) \n\n(24) \n\n(25) \n\n(26) \n\n(27) \n\n7 The natural gradient for blind separation of mixtured \n\nsignals \n\nLet s = (51, ... , 5 n ) be n source signals which are n independent random variables. \nWe assume that their n mixtures ~ = (Xl,' \" , X n ), \n\n~ = As \n\n(28) \n\n\fNeural Learning and Natural Gradient Descent \n\n133 \n\nare observed. Here, A is a matrix. When 8 \nis time serieses, we observe \n~(1), . .. , ~(t) . The problem of blind separation is to estimate W = A-I adap(cid:173)\ntively from z(t), t = 1,2,3,\u00b7 \u00b7\u00b7 without knowing 8(t) nor A. We can then recover \noriginal 8 by \n\n(29) \nwhen W = A-I. Let W E Gl(n), that is a nonsingular n x n-matrix, and \u00a2>(W) \nbe a scalar function. This is given by a measure of independence such as \u00a2>(W) = \nf{ L[P(y);p(y\u00bb)' which is represented by the expectation of a loss function . We define \nthe natural gradient of \u00a2>(W). \n\ny=W~ \n\nNow we return to our manifold Gl(n) of matrices. It has the Lie group structure : \nAny A E Gl(n) maps Gl(n) to Gl(n) by W -+ W A. We impose that the Riemannian \nstructure should be invariant by this operation A. \n\nWe can then prove that the natural gradient in this case is \n\nfl\u00a2> = \\7\u00a2>W'W. \n\n(30) \n\nThe natural gradient works surprisingly well for adaptive blind signal separation \nAmari et al. (1995), Cardoso and Laheld [1996]. \n\nReferences \n\n[1] S. Amari. Theory of adaptive pattern classifiers, IEEE Trans., EC-16, No.3, \n\n299-307, 1967. \n\n[2] S. Amari. Differential-Geometrical Methods in Statistics, Lecture Notes in \n\nStatistics, vol.28 , Springer, 1985. \n\n[3] S. Amari. Information geometry of the EM and em algorithms for neural net(cid:173)\n\nworks , Neural Networks, 8, No.9, 1379-1408, 1995. \n\n[4] S. Amari, A. Cichocki and H.H . Yang. A new learning algorithm for blind signal \n\nseparation, in NIPS'95, vol.8, 1996, MIT Press, Cambridge, Mass. \n\n[5] S. Amari, K. Kurata, H. Nagaoka. Information geometry of Boltzmann ma(cid:173)\n\nchines , IEEE Trans. on Neural Networks, 3, 260-271, 1992. \n\n[6] J . F. Cardoso and Beate Laheld. Equivariant adaptive source separation, to \n\nappear IEEE Trans. on Signal Processing, 1996. \n\n[7] T. M. Heskes and B. Kappen. Learning processes in neural networks, Physical \n\nReview A , 440, 2718-2726, 1991. \n\n[8] D. Rumelhart, G.E. Hinton and R. J . Williams. Learning internal representa(cid:173)\n\ntion, in Parallel Distributed Processing: Explorations in the Microstructure of \nCognition, 1, Foundations , MIT Press, Cambridge, MA, 1986. \n\n[9] H. Sompolinsky,N. Barkai and H. S. Seung. On-line learning of dichotomies: \nalgorithms and learning curves, Neural Networks: The statistical Mechanics \nPerspective, Proceedings of the CTP-PBSRI Joint Workshop on Theoretical \nPhysics, J .-H. Oh et al eds, 105- 130, 1995. \n\n\f", "award": [], "sourceid": 1248, "authors": [{"given_name": "Shun-ichi", "family_name": "Amari", "institution": null}]}