{"title": "Learning Curves, Model Selection and Complexity of Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 607, "page_last": 614, "abstract": null, "full_text": "Learning Curves, Model Selection and \n\nComplexity of Neural Networks \n\nNoboru Murata \n\nDepartment of IVIathematical Engineering and Information Physics \n\nUniversity of Tokyo, Tokyo 113, JAPAN \nE-mail: mura~sat.t.u-tokyo.ac.jp \n\nShuji Yoshizawa \nDept. Mech. Info. \nUniversity of Tokyo \n\nShUll-ichi Amari \n\nDept. Math. Eng. and Info. Phys. \n\nUniversity of Tokyo \n\nAbstract \n\nLearning curves show how a neural network is improved as the \nnumber of t.raiuing examples increases and how it is related to \nthe network complexity. The present paper clarifies asymptotic \nproperties and their relation of t.wo learning curves, one concerning \nthe predictive loss or generalization loss and the other the training \nloss. The result gives a natural definition of the complexity of a \nneural network. Moreover, it provides a new criterion of model \nselection. \n\n1 \n\nINTRODUCTION \n\nThe leal'lI ing Cl1l've shows how well t hE' behavior of a neural network is improved as \nt.he nurnber of training examples increast\"'s and how it is I'elated with the complexity \nof neural net.works. This provides liS with a criterion for choosing an adequate \nnetwork ill relat.ion t.o the number \n\nof training examples. Some researchers have attacked this problem by using sta(cid:173)\ntistical mechanical met.hods (see Levin et al. \n[1991]' etc.) \nand some by informat.ion theory and algorithmic methods (see Baum and Haussler \n\n[1990], Seung et al. \n\n607 \n\n\f608 \n\nMurata, Yoshizawa, and Amari \n\n[1989], et.c.). The present. paper elucidates asympt.otic properties of the learning \nCUl\"ve from the statistical point of view, giving a new criterion for model selection. \n\n2 STATEMENT OF THE PROBLEM \n\nLet us consider a stochastic neural network, which is parameterized by a set of m \nweights 0 = (0 1 , ..\u2022 ,om) and whose input-output relation is specified by a condi(cid:173)\ntional probability p(ylx, 0). In other words, for an input signal is x E R\"\u00b7n, the \nprobability distribution of output y E R\"oU! is given by p(ylx, 0). \n\nA typical form of the stochastic neural network is as follows: \nlet us consider a \nmulti-layered network !(x, 0) where 0 is a set of m parameters 0 = (0 1 , \u2022\u2022\u2022 , om) and \nits components correspond to weights and thresholds of the network. When some \ninput x is given, the network produce an output \n\n(1 ) \nwhere TJ(x) is noise whose conditional distribut.ion is given by a(TJlx). Then the \ncondit.ional dist.ribution of the net.work. which specifies the input-output relation, \nis given by \n\ny = /(x,() + TJ(X), \n\np(yl1.\u00b7,O) = a(y -\n\n/(x, ()Ix). \n\n(2) \n\\Ve define a t.raining sample e = {(Xl, Yd, .. \" (Xt, Yt)} as a set of t examples \ngenerated from the true conditional distribution q(ylx), where Xi is generated from \na probability distribution 1'(X) independently. We should note that both r(x) and \nq(ylx) are unknown and we need not assume the faithfulness of the model, that is, \nwe do not a'3sume that there exists a parameter 0* which realize the true distribution \nq(ylx) such that p(Ylx, 0\u00b7) = q(ylx). \nOur purpose is t.o find an appropriate parameter () which realizes a good approxi(cid:173)\nmation IJ(ylx, 0) t.o q(yl:r). For this purpose, we use a loss function \n\nL(O) = D(1'; qlp(O)) + 8(0) \n\n(3) \n\nas a Cl'it.erioll t.o be minimized, where D( 1'; qlp( 0) represent.s a general divergence \nmeasure between t.wo conditional probabilit.ies q(ylx) and p(ylx, 0) in the expecta(cid:173)\nt.ion form under t.he true input-output probability \n\nD(1'; qlp(O\u00bb) = J 1'(x)q(Ylx)k(x, y, O)dxdy \n\n(4) \n\nand S(O) is a regulal'ization t.erm to fit. the smoothness condition of outputs (Moody \n[1992]), So t.he loss functioll is rewritten as a expectation form \n\nL(O)= j1'(J;)Q( Y1 X)d(x,y'(l)dxd y, d(x,y,()=k(x,y,O)+S(O), \n\n(5) \n\nand d(:t,!I, 0) is raIled t.he pointwise loss funct.ioll. \n\nA typical rase of the divergence D of t.he multi-layered network f( X, 0) with noise \nis the squared error \n\nD( 1'; qllJ( 0\u00bb) = j 1'( X )q( ylx )lly - /( x, 0)11 2dxdy, \n\n(6) \n\n\fLearning Curves, Model Selection and Complexity of Neural Networks \n\n609 \n\nThe error function of an ordinary multi-layered network is in this form, and the \nconventional Back-Pr'opagation met.hod is derived from this type of loss function. \n\nAnot.her t.ypical case is the Kullhaek-Leibler divergence \n\nJ \n\nq(ylx) \np(ylx,B) \n\nD(I';qlp(O)) = \n\n(7) \nThe integration J 1'(x)q(ylx) logq(ylx)dxdy is a constant called a conditional en(cid:173)\ntropy, and we usually use the following abbreviated form instead of the previous \ndivergence: \n\nr(.r)lJ(ylx)log \n\ndxdy. \n\nD(7'; qlp((})) = - J 1'(x)q(ylx) logp(y/x, B)dxdy. \n\nNext, we define an optimum of the parameter in the sense of the loss function that \nwe introduced. We denote by B* the optimal parameter that minimizes the loss \nfunction L( 0), that is, \n\n(8) \n\n(9) \n\nL(O*) = min L(O), \n\n(J \n\nand we regard p(ylx, 0*) as the best realization of the model. \n\n\\t\\'hen a trailling sample e is given, we can also define an empirical loss function: \n(10) \nwhere i', If are the empirical distribut.ions given by the sample e, that is, \n\n1.(0) = D(1'; qlp(B)) + S((n, \n\nD(l\u00b7;tj/p(O)) = t Lk(Xi'Yi,(}), \n\n1 \n\nt \n\n(xi,yd E e. \n\n(11) \n\ni=l \n\nIn practical case, we consider t.he empirical loss function and search for the quasi(cid:173)\noptimal paramet.er 0 defined hy \n\nL(O) = min L(O), \n\n(J \n\n(12) \n\nbecause the trw\u00b7' distribut.ions 1'{x) and q(ylx) are unkllown and we can only use \nexamplps (XidJd observed from t.he tl'lle distribution ,,(x)IJ(ylx), We should note \nthat. the quasi-optilllal paramet.er 0 is a rallc\\OI1l variable depending on the sample \ne, each element of which is chosen randOlnly. \nThe following lemma guarantees that we can use the empirical loss function instead \nof the actual loss funct.ion when t.he number of examples t is large. \n\nLenllna 1 If fhe 11'11111ber of examples t is large e1lough, it is shown that the quasi(cid:173)\noptimal pam7llcier 0 -is lIormally dist7'ib'utcd al'ound the optimal parameter B*, that \nlS, \n\nwhere \n\n(-. \u2022. 1 \n\nQ \n\n/ \n\nr(.t)I/(yl;L')\\'c!(.l', y. 0* )'Vd(;L', V, 0* )Td.tdy, \n\nJ l'(x)IJ(ylx)'V'Vd(x,y,O*)dxdy, \n\nand 'V denote~ fhe di.fJer\u00b7en/utl oper'ator with respect to B, \n\n(13) \n\n( 14) \n\n(15) \n\n\f610 \n\nMurata, Yoshizawa, and Amari \n\nThis lemma is proved hy using t.he uSllal statistical methods. \n\n3 LEARNING PROCEDURE \n\nIn many cases, however, it is difficult to obtain the quasi-optimal parameter 9 by \nminimizing the equation (10) direct.ly. VVe therefore often use a stochastic descent \nmethod to get an approximation to the quasi-optimal parameter 9. \n\nDefinition 1 (Stochastic Descent Method) In each learning step, an example \n\nis re-sampled from the given sample e randomly, and the following modification is \n\napplied to the parameter On at step 71, \n\n(16) \n\nwhere c is a positit,e value called a learni7lg coefficient and (Xi(n), Yi(n)) 2S the re(cid:173)\nsampled example at step 71. \n\nThis is a sequent.ial learning method and the operations of random sampling \nfrol11 e in eacll lcarning step is called the re-sampling plan. The parameter \n011 at. st.ep 11 \nis a random variable as a function of the re-sampled sequence \n...; = {( J'i( 1) \u2022 .lJi( 1) ) \u2022.\u2022. , (J: i( ,t!, lji( Il d }. However, if the initial value of 0 is appropriate \n(this assumpt.ion prevent.s being stuck in local minima) and if the learning step n \nis large enough, it. is shown that the learned parameter On is normally distributed \naround the qnasi-opt.imal parameter. \n\nLenuua 2 If the learning step n is large enough and the learning coefficient c is \nsmall enough, the parameter 0\" is normally distributed asymptotically, that is, \n\nwhere' \\I satisfies the followi7lg T\"Clatio71 \n\nOil '\" N(O,EV), \n\nG = QF + VQ, \n(,' = f L \\1 d ( J ' / , Yi , 0 rv d ( .l: i , !Ii , 0) T , \n\nt \n\ni= I \n\n(17) \n\n(18) \n\nIt \n\n,\nQ = t L V' V' d ( Xi, Yi , 0) . \n\ni==l \n\nIn the following discussion, we assume that. 11 is large enough and c is small enough, \nand we denot.e the learned parameter by \n\n(19) \nThe dist.ribut.ion of t.he randolll variable 0, therefore, can be regarded a<; the normal \ndistribllt.ioll N(O.EV). \n\n4 LEARNING CURVES \n\nIt. is import.allt. to evalll<:l\\.t> the difl'crellce bet.ween two quantities L(O) and 1.(0). The \nquantit.y 1.(0) is calkd the predict.ive loss or the generalization error, which shows \n\n\fLearning Curves, Model Selection and Complexity of Neural Networks \n\n611 \n\nt.he average loss of t.he tl\"ained network when a novel example is given. On the other \nhand, the quant.ity L(O) is called the training loss or the training error, which shows \nthe average loss evaluated by the examples used in tl\u00b7aining. Since these quantities \ndepend all t.he sample e and the I'e-sampled sequence w, we take the expectation \nE and the variance Val' with respect to the sample e and the re-sampling sequence \n\nw. \n\nFirst., let. us consider the predictive loss which is t.he average loss of the trained \nnetwork when a new example (which does not belong to the sample e) is given. \nThis averaging operation is replaced by averaging all over the input-output pairs, \nbecause the measure of the sample e is z\u20ac'ro. Then the predictive loss is written as \n(20) \nFrom the properties of \u00b0 and B, we can prove the following important relations. \n\nL(O) = J 1\u00b7(x)q(Ylx)d(x,y,O)dxdy. \n\nTheorem 1 Th.e predictive loss asymptotically satisfies \nL(()*) + 2t trCQ-1 + '2trQv, \nI \n21.'.! t)'{,'Q-I(,'Q-1 + 2\"trQ VQV + 7t.rG\\I. \n\n\\lar[L(O)] \n\nE[L(O)] \n\n1 \n\n['2 \n\nE \n\n-\n\n(21) \n\n(22) \n\nRoughly speaking, thel'!' exist t.wo raudOll1 values Y 1 and }\"~. and the predictive loss \ncan he writ t.en as t.he following forl1l: \n\nL(O) \n\nwhere Y1 aud Y2 satisfy \n\nE[Yd = 0, \n\nE[Y'.!] = o. \nCov[Y) }''.!] \n\n1 \n\nE \n\nL(O\u00b7) + 2tt.rCQ-l + 2t.rQll \n+fYl + EY2 + Op(~) + Op(E), \n\nVar[Yd = ~t.rCQ-1CQ-l, \n\n(23) \n\n. \n\nI \n\nVad}\":!] = 1t.rQ V QV, \n\n'.rGV, \n\nE, Val' and Cov dellol.e t.he I'xp ect.al.ioll, t.h e variance and the covariance respectively. \nNext, we consider t.he> t.railling loss, i.e., t.lw average loss evaluated by the examples \nused ill t.l'