{"title": "Statistical Mechanics of Learning in a Large Committee Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 523, "page_last": 530, "abstract": null, "full_text": "Statistical Mechanics of Learning \nLarge Committee Machine \n\n\u2022 In a \n\nHolm Schwarze \n\nCONNECT, The Niels Bohr Institute \n\nBlegdamsvej 17, DK-2100 Copenhagen 0, Denmark \n\nJohn Hertz\u00b7 \n\nBlegdamsvej 17, DK-2100 Copenhagen 0, Denmark \n\nNordita \n\nAbstract \n\nWe use statistical mechanics to study generalization in large com(cid:173)\nmittee machines. For an architecture with nonoverlapping recep(cid:173)\ntive fields a replica calculation yields the generalization error in the \nlimit of a large number of hidden units. For continuous weights the \ngeneralization error falls off asymptotically inversely proportional \nto Q, the number of training examples per weight. For binary \nweights we find a discontinuous transition from poor to perfect \ngeneralization followed by a wide region of metastability. Broken \nreplica symmetry is found within this region at low temperatures. \nFor a fully connected architecture the generalization error is cal(cid:173)\nculated within the annealed approximation. For both binary and \ncontinuous weights we find transitions from a symmetric state to \none with specialized hidden units, accompanied by discontinuous \ndrops in the generalization error. \n\n1 \n\nIntroduction \n\nThere has been a good deal of theoretical work on calcula.ting the generalization \nability of neural networks within the fra.mework of statistical mechanics (for a review \n\n\u2022 Address in 1993: Laboratory of Neuropsychology, NIMH, Bethesda, MD 20892, USA \n\n523 \n\n\f524 \n\nSchwarze and Hertz \n\nsee e.g. Watkin et.al., 1992; Seung et.al., 1992). This approach has mostly been \napplied to single-layer nets (e.g. Gyorgyi and Tishby, 1990; Seung et.al., 1992). \nExtensions to networks with a hidden layer include a model with small hidden \nreceptive fields (Sompolinskyand Tishby, 1990), some general results on networks \nwhose outputs are continuous functions of their inputs (Seung et.al., 1992; Krogh \nand Hertz, 1992), and calculations for a so-called committee machine (Nilsson, \n1965), a two-layer Boolean network, which implements a majority decision of the \nhidden units (Schwarze et.al., 1992; Schwarze and Hertz, 1992; Mato and Parga, \n1992; Barkai et.al., 1992; Engel et.al., 1992). This model has previomlly been studied \nwhen learning a function which could be implemented by a simple perceptron (i.e. \none with no hidden units) in the high-temperature (i.e. high-noise) limit (Schwarze \net.al., 1992). In most practical applications, however, the function to be learnt is \nnot linearly separable. Therefore, we consider here a committee machine trained on \na rule which itself is defined by another committee machine (the 'teacher' network) \nand hence not linearly separable. \n\nWe calculate the generalization error, the probability of misclassifying an arbitrary \nnew input, as a function of 0, the ratio of the number of training examples P to the \nnumber of adjustable weights in the network. First we present results for the 'tree' \ncommittee machine, a restricted version of the model in which the receptive fields of \nthe hidden units do not overlap. In section 3 we study a fully connected architecture \nallowing for correlations between different hidden units in the student network. In \nboth cases we study a large-net limit in which the total number of inputs (N) and \nthe number of hidden units (K) both go to infinity, but with K \u00ab: N. \n\n2 Committee machine with nonoverlapping receptive fields \n\nIn this model each hidden unit receives its input from N I K input units, subject to \nthe restriction that different hidden units do not share common inputs. Therefore \nthere is only one path from each input unit to the output. The hidden-output \nweights are all fixed to +1 as to implement a majority decision of the hidden units. \nThe overall network output for inputs 5, E R N/K, 1 = 1, ... , K, to the K branches \nis given by \n\n0\"( {S,}) = sign ( ~ t. 0\", (5,\u00bb) , \n\nwhere 0\"1 is the output of the lth hidden unit, given by \n\n0\".(5,) = sign ( 1ft w, . 5.) . \n\n(1) \n\n(2) \n\nHere W, is the N I K -dimensional weight vector connecting the input with the Ith \nhidden unit. The training examples ({~#-' ,}, r( {~#-' ,}), j.\u00a3 = I, ... , P, are generated by \nanother committee machine with weight vectors 11, and an overall output r({~#-'I})' \ndefined analogously to (1). There are N adjustable weights in the network, and \ntherefore we have 0 = PIN. \n\n\fStatistical Mechanics of Learning in a Large Committee Machine \n\n525 \n\nAs in the corresponding calculations for simple perceptrons (Gardner and Der(cid:173)\nrida, 1988; Gyorgyi and Tishby, 1990; Seung et.al., 1992), we consider a stochastic \nlearning algorithm which for long training times yields a Gibbs distribution of net(cid:173)\nworks. The statistical mechanics approach starts out from the partition function \nZ = jdpo({W,})e- 13E({W , }), an integral over weight space with a priori measure \nPo({W,}), weighted with a thermal factor e-13E({w ,l), where E is the total error \non the training examples \n\np \n\nE({W,}) = I:e[-u({{IL,}) .r({{ILI})]. \n\n1L=1 \n\n(3) \n\nThe formal temperature T = 1/ f3 defines the level of noise during the training \nprocess. For T = 0 this procedure corresponds to simply minimizing the training \nerror E. \nFrom this the average free energy F = -T ((lnZ)), averaged over all possible sets \nof training examples can be calculated using the replica method (for details see \nSchwarze and Hertz, 1992). Like the calculations for simple perceptrons, our theory \nhas two sets of order parameters: \n\n0.13 _ K Wo. WI3 \n- N-I \u00b7-1 \nq, \n\nK \n\na. \n\nRI = N WI \u00b7V,. \n\na. \n\nNote that these are the only order parameters in this model. Due to the tree struc(cid:173)\nture no correlations between different hidden units exist. Assuming both replica \nsymmetry and 'translational symmetry' we are left with two parameters: q, the \npattern average of the square of the average input-hidden weight vector, and R, \nthe average overlap between this weight vector and a corresponding one for the \nteacher. \n\nWe then obtain expressions for the replica-symmetric free energy of the form \nG(q, R, tI, R) = 0 G1(q, R) + G 2(q, R, tI, R), where the 'entropy' terms G 2 for the \ncontinuous- and binary-weight cases are exactly the same as in the simple percep(cid:173)\ntron (Gyorgyi and Tishby, 1990, Seung et.al., 1992). In the large-K limit another \nsimplification similar to the zero-temperature capacity calculation (Barkai et.al., \n1992) is found in the tree model. The 'energy' term G 1 is the same as the corre(cid:173)\nsponding term in the calculation for the simple perceptron, except that the order \nparameters have to be replaced by f(q) = (2/1r) sin- 1 q and f(R) = (2/1r) sin- 1 R. \nThe generalization error \n\n\u20acg = - arccos If(R)] \n\n1 \n\n7r \n\n(4) \n\ncan then be obtained from the value of R at the saddle point of the free energy. \n\nFor a network with continuous weights, the solution of the saddle point equations \nyields an algebraically decreasing generalization error. There is no phase transition \nat any value of 0 or T. For T = 0 the asymptotic form of the generalization error \nin powers of 1/0 can be easily obtained as 1.25/0 + ('1/0 2 ), twice the \u20acg found for \nthe simple perceptron in this limit. \n\n\f526 \n\nSchwarze and Henz \n\n'\" \n\\I) \n\n0.50 \n\n0.40-\n\n0.30 \n\n0.20 \n\n0.10 \n\n0.00 \n0 \n\n2 \n\n3 \n\n4 \n\nFigure 1: Learning curve for the large-K tree committee (solid line) with binary \nweights at T = 1. The phase transition occurs at Oc = 1.98, and the spinodal point \nis at 0, = 3.56. The analytic results are compared with Monte Carlo simulations \nwith K = 9, N = 75 and T = I, averaged over 10 runs. \nIn each simulation \nthe number of training examples is gradually increased (dotted line) and decreased \n(dashed line), respectively. The broken line shows the generalization error for the \nsimple perceptron. \n\nIn contrast, the model with binary weights exhibits a phase transition at all tem(cid:173)\nperatures from poor to perfect generalization. The corresponding generalization \nerror as a function of 0 is shown in figure 1. At small values of 0 \nthe free energy \nhas two saddle points, one at R < 1 and the other at R = 1. Initially the solution \nwith R < 1 and poor generalization ability has the lower free energy and therefore \ncorresponds to the equilibrium state. When the load parameter is increased to a \ncritical value Oc, the situation changes and the solution at R = 1 becomes the global \nminimum of the free energy. The system exhibits a first order phase transition to \nthe state of perfect generalization. In the region Oc < 0 < 0, the R < 1 solution \nremains metastable and disappears at the spinodal point 0,. We find the same \nqualitative picture at all temperatures, and the complete replica symmetric phase \ndiagram is shown in figure 2. The solid line corresponds to the phase transition \nto perfect generalization, and in the region between the solid and the dashed lines \nthe R < 1 state of poor generalization is metastable. Below the dotted line, the \nreplica-symmetric solution yields a negative entropy for the metastable state. This \nis unphysical in a binary system and replica symmetry has to be broken in this \nregion, indicating the existence of many different metastable states. \nThe simple perceptron without hidden units corresponds to the case K = 1 in \nour model. A comparison of the generalization properties with the large-K limit \nshows that both limits exhibit qualitatively similar behavior. The locations of the \nthermodynamic transitions and the spinodal line, however, are different and the \ngeneralization error of the R < 1 state in the large-K committee machine is higher \nthan in the simple perceptron. \n\nThe case of general finite K is rather more involved, but the annealed approximation \n\n\fStatistical Mechanics of Learning in a Large Committee Machine \n\n527 \n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0.0 \n\nR<1 \n\nmetastability \n\n.\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7~\u00b7>\u00b7;.\u00b7r \n\n~ \n\n~ \n\n~ \n\n/ \n\nI \n\nI \n\nRSB \n\nR=1 \n\n1.0 \n\n1.5 \n\n2.0 \n0/ \n\n2.5 \n\n3.0 \n\nFigure 2: Replica-symmetric phase diagram ofthe large-K tree committee machine \nwith binary weights. The solid line shows the locations of the phase transition, and \nthe spinodal line is shown dashed. Below the the dotted line the replica-symmetric \nsolution is incorrect. \n\nfor finite K indicates a rather smooth K -dependence for 1 < K < 00 (Mato and \nParga, 1992). \n\nWe performed Monte-Carlo simulations to check the validity of the assumptions \nmade in our calculation and found good agreements with our analytic results. Figure \n1 compares the analytic predictions for large K with Monte Carlo simulations for \nK = 9. The simulations were performed for a slowly increasing and decreasing \ntraining set size, respectively, yielding a hysteresis loop around the location of the \nphase transition. \n\n3 Fully connected committee machine \n\nIn contrast to the previous model the hidden units in the fully connected committee \nmachine receive inputs from the entire input layer. Their output for a given N(cid:173)\ndimensional input vector 5 is given by \n\n0',(5) = sign (.Jw W, . 5), \n\n(5) \n\nwhile the overall output is again of the form (1). Note that the weight vectors W, \nare now N-dimensional, and the load parameter is given by a = P / (K N). \nFor this model we solved the annealed approximation, which replaces ((In Z)) by \nIn ((Z)). This approximation becomes exact at high temperatures (high noise level \nduring training). For learnable target rules, as in the present problem, previous work \nindicates that the annealed approximation yields qualitatively correct results and \ncorrectly predicts the shape of the learning curves even at low temperatures (Seung \net.al., 1992). Performing the average over all possible training sets again leads to \ntwo sets of order parameters: the overlaps between the student and teacher weight \n\n\f528 \n\nSchwarze and Hertz \n\nvectors, RlIe = N- 1 W, . V An and the mutual overlaps in the student network CUe = \nN -1 W,, Wk' The weight vectors of the target rule are assumed to be un correlated \nand normalized, N- 1 L . V k = O,k. As in the previous model we make symmetry \nassumptions for the order parameters. In the fully connected architecture we have \nto allow for correlations between different hidden units (RlIe, ClIe :f! 0 for l =f. Ie) but \nalso include the possibility of a specialization of individual units (Rll =f. RlIe). This \nis necessary because the ground state of the system with vanishing generalization \nerror is achieved for the choice R'k = C'k = O,k. Therefore we make the ansatz \n\nR'k = R + 1101111, \n\n(6) \nand evaluate the annealed free energy of the system using the saddle point method \n(details will be reported elsewhere). The values of the order parameters at the \nminimum of the free energy finally yield the average generalization error fg as a \nfunction of o. \n\nC'k = C + (1 - C)O/k \n\nFor a network with continuous weights and small 0 the global minimum of the free \nenergy occurs at 11 = 0 and R '\" qK- 3 / 4 ). Hence, for small training sets each \nhidden unit in the student network has a small symmetric overlap to all the hidden \nunits in the teacher network. The information obtained from the training examples \nis not sufficient for a specialization of hidden units, and the generalization error \napproaches a plateau. To order 1/VK, this approach is given by \n\n\u20acg = fO + ~ + 0(1/ K), \n\nfO = ~ arccos ( )2/71\") ~ 0.206, \n\n(7) \n\nwith 'Y({3) = )71\"/2 - 1 [(1 - e-~)-1 - foJ/(471\"). Figure 3 shows the generalization \nerror as a function of 0, including 1/VK-corrections for different values of K. \n\n0.50 \n\n0.40 \n\nD' 0.30 ...... \n\\U 0.20!:\" ~'~'~\" \"':'''':'''.:::.:::':: :':'-\",:,\u00b7::\u00b7::\u00b7..:r.:.-.: - -'''''r'' \n\n(x, \n\n(Xc \n\n0.10 \n\nt. \n\ni \n\n\u00b7'\u00b7'\u00b7-\u00b7-.L.~ \n\n0.00 t.......o~ ......................... ......L.....~ ........... ~~L........o.-.........:l \n\no \n\n5 \n\n10 \n15 \na=P/KN \n\n20 \n\n25 \n\nFigure 3: Generalization error for continuous weights and T = 0.5. The approach to \nthe residual error is shown including 1/VJ(-corrections for K=5 (solid line), K=ll \n(dotted line), and K=100 (dashed line). The broken line corresponds to the solution \nwith nonvanishing 11. \n\nWhen the training set size is increased to a critical value 0, of the load parameter, \n\n\fStatistical Mechanics of Learning in a Large Committee Machine \n\n529 \n\na second minimum of the free energy appears at a finite value of /:::,. close to 1. For \na larger value Oc > 0, this becomes the global minimum of the free energy and \nthe system exhibits a first order phase transition. The generalization error of the \nspecialized solution decays smoothly with an asymptotic behavior inversely propor(cid:173)\ntional to o. However, the poorly-generalizing symmetric state remains metastable \nfor all \u00b0 > Oc. Therefore, a stochastic learning procedure starting with /:::,. = 0 will \nfirst settle into the metastable state. For large N it will take an exponentially long \ntime to cross the free energy barrier to the global minimum of the free energy. \n\nIn a network with binary weights and for large K we find the same initial approach \nto a finite generalization error as in (7) for continuous weights. In the large-K limit \nthe discreteness of the weights does not influence the behavior for small training \nsets. However, while a perfect match of the student to the teacher network (Rue = \ne'k = Olk) cannot happen for \u00b0 < 00 in the continuous model, such a 'freezing' \nis possible in a discrete system. The free energy of the binary model always has \na local minimum at R'k = e'k = Olk. When the load parameter is increased to a \ncritical value, this minimum becomes the global minimum of the free energy, and \na discontinuous transition into this perfectly generalizing state occurs, just as in \nthe binary-weight simple perceptron and the tree described in section 2. As in the \ncase of continuous weights, the symmetric solution remains metastable here even \nfor large values of o. Figure 4 shows the generalization error for binary weights, \nincluding 1/v'K-corrections for K = 5. The predictions of the large-K theory \nare compared with Monte Carlo simulations. Although we cannot expect a good \nquantitative agreement for such a small committee, the simulations support our \nqualitative results. Note that the leading order correction to \u20aco \nin eqn. (7) is only \nsmall for \u00b0 ~ 11K. However, we have obtained a different solution, which is valid \nfor \u00b0 \"-' (111 K). The corresponding generalization error is shown as a dotted line \nin figure 4. \n\n'\" UJ \n\n0.50 \n\n0.40 \n\n0.30 \n\n0.20 \n\n0.10 \n\n0.00 \n0 \n\n\u2022.... \n\n\u2022 \n\n1M \n\nlI( \n\n\u2022 \n\n1M \n\n1M \n\n~ t t \n\nt t ~ \n\n5 \n\n15 20 25 30 \n\n10 \nex = P/KN \n\nFigure 4: Generalization error for binary weights at T = 5. The large-K theory \nfor different regions of \u00b0 is compared with simulations for K = 5 and N = 45 \naveraged over all simulations (+) and simulations, in which no freezing occurred (*), \nrespectively. The solid line shows the finite-o results including II v'K -corrections. \nThe dotted line shows the small-o solution. \n\n\f530 \n\nSchwarze and Hertz \n\nCompared to the tree model the fully connected committee machine shows a quali(cid:173)\ntatively different behavior. This difference is particularly pronounced in the contin(cid:173)\nuous model. While the generalization error of the tree architecture decays smoothly \nfor all values of a, the fully connected model exhibits a discontinuous phase transi(cid:173)\ntion. Compared to the tree model, the fully connected architecture has an additional \nsymmetry, because each permutation of hidden units in the student network yields \nthe same output for a given input (Barkai et.al., 1992). This additional degree of \nfreedom causes the poor generalization ability for small training sets. Only if the \ntraining set size is sufficiently large can the hidden units specialize on one of the \nhidden units in the teacher network and achieve good generalization. However, the \npoorly generalizing states remain metastable even for arbitrarily large a. A similar \nphenomenon has also been found in a different architecture with only 2 hidden units \nperforming a parity operation (Hansel et.al., 1992). \n\nAcknowledgements \n\nH. Schwarze acknowledges support from the EC under the SCIENCE programme \nand by the Danish Natural Science Council and the Danish Technical Research \nCouncil through CONNECT. \n\nReferences \n\nE. Barkai, D. Hansel, and H. Sompolinsky (1992), Phys.Rev. A 45, 4146. \n\nA. Engel, H.M. Kohler, F. Tschepke, H. Vollmayr, and A. Zippelius (1992), \nPhys.Rev. A 45, 7590. \n\nE. Gardner, B. Derrida (1989), J.Phys. A 21, 271. \n\nG. Gyorgyi and N. Tishby (1990) in Neural Networks and Spin Glasses, edited K. \nThuemann and R. Koberle (World Scientific, Singapore). \n\nD. Hansel, G. Mato, and C. Meunier (1992), Europhys.Lett. 20, 471. \n\nA. Krogh, l. Hertz (1992), Advances in Neural Information Processing Systems IV, \nedited by l.E. Moody, S.l. Hanson, and R.P. Lippmann, (Morgan Kaufmann, San \nMateo). \n\nG. Mato, N. Parga (1992), J.Phys. A 25, 5047. \n\nN.J. Nilsson (1965) Learning Machines, (McGraw-Hill, New York). \n\nH. Schwarze, M. Opper, and W. Kinzel (1992), Phys.Rev. A 45, R6185. \n\nH. Schwarze, J. Hertz (1992), Europhys.Lett. 20,375. \n\nH.S. Seung, H. Sompolinsky, and N. Tishby (1992), Phys.Rev. A 45, 6056. \n\nH. Sompolinsky, N. Tishby (1990), Europhys.Lett. 13,567. \n\nT. Watkin, A. Rau, and M. Biehl (1992), to be published in Review of Modern \nPhysics. \n\n\f", "award": [], "sourceid": 617, "authors": [{"given_name": "Holm", "family_name": "Schwarze", "institution": null}, {"given_name": "John", "family_name": "Hertz", "institution": null}]}