W. \n(A.3) Let {rj(x, w*); j = 1,2, ... , d} be the associated convergence radii of h(x, w) \nat w*, in other words, Taylor expansion of h(x, w) at w* = (wi, ... , wd), \n\nh(x, w) = L ak 1k2 \u2022 .. kd(X)(WI - wi)kl (W2 - W2)k 2 \u2022\u2022\u2022 (Wd - Wd)kd, \n\n00 \n\nk1, .. ,kd=O \n\nabsolutely converges in IWj - wjl < rj(x, w*). Assume inf \nj=I,2, ... ,d. \n\nxEQw'EW \n\ninf rj(x, w*) > \u00b0 for \n\nTheorem 1 Assume (A.l),(A.2), and (A.3). Then, there exist a rational number \nAl > 0, a natural number ml, and a constant C, such that \n\nIF(n) - A1logn + (ml - 1) loglognl < C \n\nholds for an arbitrary natural number n. \n\nRemarks. (1) If q(x) is compact supported, then the assumption (A.3) is automat(cid:173)\nically satisfied. (2) Without assumptions (A.l) and (A.3), we can prove the upper \nbound, F(n) ::; A1logn -\n\n(ml - 1) log log n + const. \n\n\f358 \n\nS. Watanabe \n\nFrom Theorem 1, if the generalization error K (n) has the asymptotic expansion, \nthen it should be \n\nK(n) = Al _ mi - 1 + o( \n\nn \n\nnlogn \n\n1 \n\n). \n\nnlogn \n\nAs is well known, if the model is identifiable and has the positive definite Fisher \ninformation matrix, then Al = d/2 (d is the dimension of the parameter space) and \nmi = 1. However, hierarchical learning models such as multi-layer perceptrons, \nradial basis functions, and normal mixtures have smaller Al and larger ml, in other \nwords, hierarchical models are better learning machines than regular ones if Bayes \nestimation is applied. Constants Al and mi are characterized by the following \ntheorem. \n\nTheorem 2 Assume the same conditions as theorem 1. Let f > 0 be a sufficiently \nsmall constant. The holomorphic function in Re(z) > 0, \n\nJ(z) = \n\n( \n1 H(W) 'llogn - (m1 - 1) loglogn + canst. \n\n4 Resolution of Singularities \n\nIn this section, we construct a method to calculate >'1 and mI. First of all, we cover \nthe compact set Wo with a finite union of open sets Wo,. In other words, Wo C \nUo, Wo,. Hironaka's resolution of singularities [5J [2J ensures that, for an arbitrary \nanalytic function H(w), we can algorithmically find an open set Uo, C Rd (Uo, \ncontains the origin) and an analytic function go, : Uo, ~ W o, such that \n\nH(go,(u)) = a(u) U~I U~2 ... U~d (u E Uo, ) \n\n(7) \nwhere a( u) > 0 is a positive function and ki ~ 0 (1 ~ i ~ d) are even integers (a( u) \nand k i depend on Uo,). Note that Jacobian Ig~(u) 1 = 0 if and only if u E g~l(WO). \n\n( ()) I I ( \n 1 a(u) {U~I U~2 .. . U~d V Ufl U~2 .. 'U~d dUl dU2 .. \u00b7 dUd . \n\nU '\" \n\nFor real z , maxo, lo,(z) ~ l(z) ~ Lo, lo,(z), \n\n>'1 = min min \n\nmin \n(PI , ... ,Pd) l::;q::;d \n\n0, \n\nand m1 is equal to the number of q which attains the minimum, min. \nl::;q::;d \n\nRemark. In a neighborhood of Wo E Wo, the analytic function H(w) is equivalent \nto a polynomial H Wo ( w ), in other words, there exists constants C1, C2 > 0 such that \nc1Hwo(w) ~ H(w) ~ C2Hwo(W) . Hironaka's theorem constructs the resolution map \ngo, for any polynomial H Wo (w) algorithmically in the finite procedures ( blowing(cid:173)\nups for nonsingular manifolds in singularities are recursively applied [5]). From the \nabove discussion, we obtain an inequality, 1 ~ m ~ d. Moreover there exists 'Y > 0 \nsuch that H(w) ~ 'Ylw - wol 2 in the neighborhood of Wo E Wo, we obtain >'1 ~ d/2. \nExample. Let us consider a model (x, y) E R2 and w = (a, b, c, d) E R4 , \n\np(x, ylw) \n\n= Po(x) (271')1/2 exp(-\"2(Y - ?jJ(x, w)) ), \n\n2 \n\n1 \n\n1 \n\n?jJ(x, a, b, c, d) \n\natanh(bx) + ctanh(dx), \n\n\fAlgebraic Analysis for Non-regular Learning Machines \n\n361 \n\nwhere Po(x) is a compact support probability density (not estimated). We also \nassume that the true regression function is y = 'I/J(x, 0, 0, 0,0). The set of true \nparameters is \n\nWo = {Ex'I/J(X, a, b, c, d)2 = O} = {ab + cd = 0 and ab3 + cd3 = O}. \n\nAssumptions (A.1),(A.2), and (A.3) are satisfied. The singularity in Wo which gives \nthe smallest Al is the origin and the average loss function in the neighborhood WO \nof the origin is equivalent to the polynomial Ho(a, b, c, d) = (ab+cd)2 + (ab3 +cd3)2, \n(see[9]). Using blowing-ups, we find a map 9 : (x, y, z, w) t-+ (a, b, c, d), \n\na = x, b = y3w - yzw, C = zwx, d = y, \n\nby which the singularity at the origin is resolved. \n\nJ(z) \n\nr Ho(a, b, c, d)z 0) shows that \n\nH(w) = J q(x) log tX )) dx :2: ~ J q(x)(log tX\n\npx,w \n\n2 \n\npx,w \n\n)) )2dx :2: a2\n\n0 L 9j(w)2 \n\nJ \n\nj=l \n\nwhere ao > 0 is the smallest eigen value of the positive definite symmetric matrix \nEx {!i(X, WO)!k(X, wo)}. Lastly, combining \n\n2 \nnH(w) \nA(X ) = :~K,< H(w) \n\nn \n\nA \n\nao \n\nJ \n\"\" 1 \"\" \n\nn \n\n::; 2 :~K,< ~ {Vn f=t(fj(Xi , w) - Ex !j(X, w))} \n\n2 \n\nwith eq.(lO), we obtain eq.(5). \n\nAcknowledgments \n\nThis research was partially supported by the Ministry of Education, Science, Sports \nand Culture in Japan, Grant-in-Aid for Scientific Research 09680362. \n\nReferences \n\n[1] Amari,S., Murata, N.(1993) Statistical theory of learning curves under entropic loss. \nNeural Computation, 5 (4) pp.140-153. \n\n[2] Atiyah, M.F. (1970) Resolution of singularities and division of distributions. Comm. \nPure and Appl. Math., 13 pp.145-150. \n\n[3] Fukumizu,K. (1999) Generalization error of linear neural networks in unidentifiable \ncases. Lecture Notes in Computer Science, 1720 Springer, pp.51-62. \n\n[4] Hagiwara,K., Toda,N., Usui,S . (1993) On the problem of applying Ale to determine \nthe structure of a layered feed-forward neural network. Proc. of IJCNN, 3 pp.2263-2266. \n\n[5] Hironaka, H. (1964) Resolution of singularities of an algebraic variety over a field of \ncharacteristic zero, I,ll. Annals of Math., 79 pp.109-326. \n\n[6] Kashiwara, M. (1976) B-functions and holonomic systems, Invent. Math ., 38 pp.33-53. \n\n[7] Oaku, T. (1997) An algorithm of computing b-funcitions. Duke Math . J., 87 pp .115-\n132. \n\n[8] Sato, M., Shintani,T. (1974) On zeta functions associated with prehomogeneous vector \nspace.Annals of Math., 100, pp.131-170. \n\n[9] Watanabe, S.(1998) On the generalization error by a layered statistical model with \nBayesian estimation. IEICE Trans., J81-A pp.1442-1452. English version: Elect. Comm. \nin Japan., to appear. \n\n[10] Watanabe, S. (1999) Algebraic analysis for singular statistical estimation. Lecture \nNotes in Computer Science, 1720 Springer, pp.39-50. \n\n\f", "award": [], "sourceid": 1739, "authors": [{"given_name": "Sumio", "family_name": "Watanabe", "institution": null}]}