{"title": "Analysis and Comparison of Different Learning Algorithms for Pattern Association Problems", "book": "Neural Information Processing Systems", "page_first": 72, "page_last": 81, "abstract": null, "full_text": "72 \n\nANALYSIS AND COMPARISON OF DIFFERENT LEARNING \nALGORITHMS FOR PATTERN ASSOCIATION PROBLEMS \n\nJ. Bernasconi \n\nBrown Boveri Research Center \nCH-S40S Baden, Switzerland \n\nABSTRACT \n\nWe \n\ninvestigate the behavior of different learning algorithms \n\nfor networks of neuron-like units. As test cases we use simple pat(cid:173)\ntern association problems, such as the XOR-problem and symmetry de(cid:173)\ntection problems. The algorithms considered are either versions of \nthe Boltzmann machine learning rule or based on the backpropagation \nof errors. We also propose and analyze a generalized delta rule for \nlinear threshold units. We \nfind that the performance of a given \nlearning algorithm depends strongly on the type of units used. In \nparticular, we observe that networks with \u00b11 units quite generally \nexhibit a significantly better learning behavior than the correspon(cid:173)\nding 0,1 versions. We also demonstrate that an adaption of the \nweight-structure to \nthe symmetries of the problem can lead to a \ndrastic increase in learning speed. \n\nINTRODUCTION \n\nIn the past few years, a number of learning procedures for \nneural network models with hidden units have been proposed 1 ,2. They \ncan all be considered as strategies to minimize a suitably chosen \nerror measure. Most of these strategies represent local optimization \nprocedures (e.g. gradient descent) and therefore suffer from all the \nproblems with local m1n1ma or cycles. The corresponding learning \nrates, moreover, are usually very slow. \n\nThe performance of a given learning scheme may depend critical(cid:173)\n\nlyon a number of parameters and implementation details. General \nanalytical results concerning these dependences, however, are prac(cid:173)\ntically non-existent. As a first step, we have therefore attempted \nto study empirically the influence of some factors that could have a \nsignificant effect on the learning behavior of neural network sys(cid:173)\ntems. \n\nOur preliminary investigations are restricted to very small \nnetworks and to a few simple examples. Nevertheless, we have made \nsome interesting observations which appear to be rather general and \nwhich can thus be expected to remain valid also for much larger and \nmore complex systems. \n\nNEURAL NETWORK MODELS FOR PATTERN ASSOCIATION \n\nAn artificial neural network consists of a set of interconnec(cid:173)\nted units (formal neurons). The state of the i-th unit is described \nby a variable S. which can be discrete (e.g. S. = 0,1 or S. = \u00b11) or \ncontinuous (e.l. 0 < S. < 1 or -1 < S. < +ll, and each ~onnection \nj-7i carries a weight- W.1. which can be 1positive, zero, or negative. \n\n1J \n\n\u00a9 American Institute of Physics 1988 \n\n\fThe dynamics of the network is determined by a local update \n\nrule, \n\n73 \n\nS.(t+l) \n\n1 \n\n= HI W . . S . (t)) \n\nj \n\n1J \n\nJ \n\n(1) \n\nwhere f is a nonlinear activation function, specifically a threshold \nfunction in the case of discrete units and a sigmoid-type function, \ne.g. \n\nor \n\n(2) \n\n(3) \n\nrespectively, in the case of continuous units. The individual units \ncan be given different thresholds by introducing an extra unit which \nalways has a value of 1. \n\nIf the network is supposed to perform a pattern association \ntask, it is convenient to divide its units into input units, output \nunits, and hidden units. Learning then consists in adjusting the \nweights in such a way that, for a given input pattern, the network \nrelaxes \nto a state in which the \noutput units represent the desired output pattern. \n\n(under the prescribed dynamics) \n\nNeural networks learn from examples (input/output pairs) which \nare presented many times, and a typical learning procedure can be \nviewed as a strategy to minimize a suitably defined error function \nF. In most cases, this strategy is a (stochastic) gradient descent \nmethod: To a clamped input pattern, randomly chosen from the lear(cid:173)\nning examples, the network produces an output pattern {O . }. This is \ncompared with the desired output, say {T . }, and the erfor F( {O. }, \nis calculated . Subsequently, each 1weight is changed by ~an \n{T . }) \nam~unt proportional to the respective gradient of F, \n\nb.W .. \n~J \n\nof \n= -r} - (cid:173)oW .. \n\n~J \n\n(4) \n\nand the procedure is repeated for a new learning example until F is \nminimized to a satisfactory level. \n\nIn our investigations, we shall consider two different types of \n\nlearning schemes. The first is a deterministic version of the Boltz(cid:173)\nmann machine learning rule! and has been proposed by Yann Le Cun2 \u2022 \nIt applies to networks with symmetric weights, W .. = W .. , so that an \nenergy \n\nJ ~ \n\n~J \n\nE(~) == -\n\nI W .. S. S . \nJ \n\n(i ,j) ~J \n\n~ \n\n(5) \n\ncan be associated with each state S = {S.}. If X refers to the net-\nwork state when only the input units are clamped and Y to the state \nwhen both the input and output units are clamped, the error function \n\n1 \n\n-\n\n-\n\n\f74 \n\nis defined as \n\nF = E c:~) - E QO \n\nand the gradients are simply given by \n- - - = Y. Y. \nJ \n\n1 \n\nof \noW. . \n1J \n\nx. X. \nJ \n1 \n\n(6) \n\n(7) \n\nThe second scheme, called backpropagation or generalized delta \nrule 1 ,3, probably represents the most widely used learning algorithm. \nIn its original form, it applies to networks with feedforward connec(cid:173)\ntions only, and it uses gradient descent to minimize the mean squared \nerror of the output signal, \n\nF = -21 L (T . - 0.)2 \n\n.1 1 \n1 \n\nFor a weight W .. from an (input or hidden) unit j \n\nunit i, we simply ha~ \n\n(8) \n\nto an output \n\n(9 ) \n\nwhere f' \nis the derivative of the nonlinear activation function \nintroduced in Eq. (1), and for weights which do not connect to an \noutput unit, the gradients can successively be determined by apply(cid:173)\ning the chain rule of differentiation. \n\nIn the case of discrete units, f is a threshold function, so \nthat the backpropagation algorithm described above cannot be applied. \nWe remark, however, that the perceptron learning rUle 4 , \n\n~W .. = \u00a3(T. - O.)S. \nJ \n\n1J \n\n1 \n\n1 \nreplaced by a constant \u00a3. \nis nothing else than Eq. (9) with f' \nTherefore, we propose \na generalized delta rule for linear \nthreshold units can be obtained if f' is replaced by a constant \u00a3 in \nall the backpropagation expressions for of/oW ... This generalization \nof the perceptron rule is, of course, not u1dque. In layered net(cid:173)\nworks, e.g., the value of the constant which replaces f' need not be \nthe same for the different layers. \n\nthat \n\n(10) \n\nANALYSIS OF LEARNING ALGORITHMS \n\nThe proposed learning algorithms suffer from all the problems \nof gradient descent on a complicated landscape. If we use small \nweight changes, \nlearning becomes prohibitively slow, while large \nweight changes \ninevitably lead to oscillations which prevent the \nalgorithm from converging to a good solution. The error surface, \nmoreover, may contain many local minima, so that gradient descent is \nnot guaranteed to find a global minimum. \n\n\f75 \n\nThere are several ways to improve a stochastic gradient descent \nprocedure. The weight changes may, e.g., be accumulated over a \nnumber of learning examples before the weights are actually changed. \nAnother often used method consists in smoothing the weight changes \nby overrelaxation, \n\n~W .. (k+1) \n\n1J \n\n= -~ ~W + a ~W .. (k) \n\n1J \n\nof \n.. \na \n1J \n\n(11) \n\nwhere ~W .. (k) refers to the weight change after the presentation of \nthe k-th 1 1earning example (or group of learning examples, respecti(cid:173)\nvely). The use of a weight decay term, \n\n~W .. \n1J \n\n= -11 ~W - BW .. \n1J \n\na \n\nof \n.. \n1J \n\n(12) \n\nprevents the algorithm from generating very large weights which may \ncreate such high barriers that a solution cannot be found in reason(cid:173)\nable time. \n\nSuch smoothing methods suppress the occurrence of oscillations, \n\nat least to a certain extent, and thus allow us to use higher lear(cid:173)\nning rates. They cannot prevent, however, that the algorithm may \nbecome trapped in bad local minimum. An obvious way to deal with the \nproblem of local minima is to restart the algorithm with different \ninitial weights or, equivalently, to randomize the weights with a \ncertain probability p during the learning procedure. More sophisti(cid:173)\ncated approaches involve, e.g., the use of hill-climbing methods. \n\nThe properties of the error-surface over the weight space not \nonly depend on the choice of the error function F, but also on the \nnetwork architecture, on the type of units used, and on possible \nrestrictions concerning the values which the weights are allowed to \nassume. \n\nThe performance of a \n\nlearning algorithm thus depends on many \nfactors and parameters. These dependences are conveniently analyzed \nin terms of the behavior of an appropriately defined learning curve. \nFor our small examples, where the learning set always consists of \nall input/output cases, we have chosen to represent the performance \nof \nfraction of networks that are \n\"perfect\" after the presentation of N input patterns. \nworks are networks which for every input pattern produce the correct \noutput). Such learning curves give us much more detailed information \nabout \nthe behavior of the system than, e. g., averaged quantities \nlike the mean learning time. \n\nlearning procedure \n\n(Perfect net(cid:173)\n\na \n\nby \n\nthe \n\nRESULTS \n\nIn the following, we shall present and discuss some represen(cid:173)\n\ntative results of our empirical study. All learning curves refer to \na set of 100 networks that have been exposed to the same learning \nprocedure, where we have varied the initial weights, or the sequence \n\n\f76 \n\nof learning examples, or both. With one exception (Figure 4), the \nsequences of learning examples are always random. \n\nA prototype pattern association problem is the exclusive-or \n(XOR) problem. Corresponding networks have two input units and one \noutput unit. Let us first consider an XOR-network with only one \nhidden unit, but in which the input units also have direct connec(cid:173)\ntions to the output unit. The weights are symmetric, and we use the \ndeterministic version of the Boltzmann learning rule (see Eqs. (5) \nto (7)). Figure 1 shows results for the case of tabula rasa initial \nconditions, i.e. the initial weights are all set equal to zero. If \nthe weights are changed after every learning example, about 2/3 of \nthe networks learn the problem with less than 25 presentations per \npattern (which corresponds to a total number of 4 x 25 = 100 presen(cid:173)\ntations). The remaining networks (about 1/3), however, never learn \nto solve the XOR-problem, no matter how many input/output cases are \npresented. This can be understood by analyzing the corresponding \nevolution-tree in weight-space which contains an attractor consis(cid:173)\nting of 14 \"non-perfect\" weight-configurations. The probability to \nbecome \ntrapped by this attractor is exactly 1/3. If the weight \nchanges are accumulated over 4 learning examples, no such attractor \n\n100 \n\nI \n\nI \n\nI \n\nI \n\n0 \n\n000 \n\n\u2022\u2022\u2022 \u2022 \u2022 \u2022 i \nI-\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 0 \n\n00 \n\n0 0 \n\n0 0 \n\n0 \n\n-\n\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \n-\n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\nij \n\n0 \n\n-\n\n-\n\n-\nen 80 \n~ \na:: \n0 \n.- 60 -\n~ \nw \nz \n.-\nw 40 -\nu \nlL. \na:: \nW \nQ. \n\n20 ..... \n\n~ \n0 \n\n0 \n\n0 \n\n0 \n\u00b70 \n0 \n~ \n.o.~. \n\nI \n20 \n\nI \n40 \n\nI \n60 \n\nI \n80 \n\n100 \n\n#: PRESENTATIONS /PATTERN \n\nFig. 1. Learning curves for an XOR-network with one hidden unit \n(deterministic Boltzmann \ninitial \nweights zero). Full circles: weights changed after every learning \nexample; open circles: weight changes accumulated over 4 learning \nexamples. \n\nlearning, discrete \u00b1I units, \n\n\f77 \n\nseems to exist (see Fig. 1), but for some networks learning at least \ntakes an extremely long time . The same saturation effect is observed \nwith random initial weights (uniformly distributed between -1 and \n+1), see Fig. 2. \n\nFigure 2 also exhibits \n\nthe difference in learning behavior \nbetween networks with \u00b11 units and such with 0,1 units. In both \ncases, weight randomization leads to a considerably improved lear(cid:173)\nning behavior. A weight decay term, by the way, has the same effect. \nThe most striking observation, however, is that \u00b11 networks learn \nmuch faster than 0,1 networks (the respective average learning times \ndiffer by about a factor of 5). In this connection, we should mention \nthat ~ = 0.1 is about optimal for 0,1 units and that for \u00b11 networks \nthe learning behavior is practically independent of the value of ~. \nIt therefore seems that \u00b11 units lead to a much more well-behaved \nerror-surface \nthat a \ndiscrete 0,1 model can always be translated into a \u00b11 model, but \nthis would lead to an energy function which has a considerably more \ncomplicated weight dependence than Eq. (5). \n\nthan 0,1 units. One can argue, of course, \n\n100 \n\n80 \n\nen \n~ a::: \n0 \n3: \n.... 60 \nw \nz \n.... u 40 \nw \nlL. a::: \nw \na.. \n~ \n0 \n\n20 \n\n0 \n\n2 \n\n5 10 20 \n\n50 100 200 \n\n1000 \n\n# PRESENTATIONS / PATTERN \n\nFig. 2. Learning curves for an XOR-network with one hidden unit \n(deterministic Boltzmann learning, initial weights random, weight \nchanges accumulated over 5 learning examples). Circles: discrete \u00b11 \nunits, ~ = 1; triangles: discrete 0,1 units, ~ = 0.1; broken curves: \nwithout weight randomization; solid curves: with weight randomiza(cid:173)\ntion (p = 0.025). \n\n\f78 \n\nFigures 3 and 4 refer to a \n\nfeedforward XOR-network with 3 \nhidden units, and \nto backpropagation or generalized delta rule \nlearning. In all cases we have included an overrelaxation (or momen(cid:173)\ntum) term with a = 0.9 (see Eq. (11\u00bb. For the networks with contin(cid:173)\nuous units we have used the activation functions given by Eqs. (2) \nand (3), respectively, and a network was considered \"perfect\" if for \nall input/output cases the error was smaller than 0.1 in the 0,1 \ncase, or smaller than 0.2 in the \u00b11 case, respectively. \n\nIn Figure 3, the weights have been changed after every learning \nexample, and all curves refer to an optimal choice of the only \nremaining parameter, \u00a3. or \", respectively. For discrete as well as \nfor continuous units, the \u00b11 networks again perform much better than \ntheir 0,1 counterparts. In the continuous case, the average learning \ntimes differ by about a factor of 7, and in the discrete case the \ndiscrepancy is even more pronounced. In addition, we observe that in \n\u00b11 networks learning with the generalized delta rule for discrete \nunits is about twice as fast as with the backpropagation algorithm \nfor continuous units. \n\n100~--~--~----~----~~~~--~--~ \n\nen 80 \n~ a:: \n0 \n~ \nI- 60 \nw \nZ \nI-\n0 \nw \nlL. a:: \nw \na. \n~ \n0 \n\n40 \n\n20 \n\nO~----~--~------~----~----~~~~--~ \n500 1000 \n\n100 200 \n\n50 \n\n5 \n\n10 \n\n20 \n\n:# PRESENTATIONS / PATTERN \n\nFig. 3. Learning curves for an XOR-network with three hidden units \nrandom, \n(backpropagation/generalized delta rule, \nweights changed after every learning example). Open circles: discre(cid:173)\nte \u00b11 units, \u00a3. = 0.05; open triangles: discrete 0,1 units, \u00a3. = 0.025; \nfull circles: continuous \u00b11 units, \" = 0.125; full triangles; contin(cid:173)\nuous 0,1 units, \" = 0.25. \n\ninitial weights \n\n\f79 \n\nIn Figure 4, \n\nthe weight changes are accumulated over all 4 \ninput/output cases, and only networks with continuous units are \nin this case, the \u00b11 units lead to an improved \nconsidered. Also \nlearning behavior \n(the optimal Il-values are about 2.5 and 5.0, \nrespectively). They not only lead to significantly smaller learning \ntimes, but \u00b11 networks also appear to be less sensitive with respect \nto a variation of 11 than the corresponding 0,1 versions. \n\nThe better performance of the \u00b11 models with continuous units \n\ncan partly be attributed to the steeper slope of the chosen activa(cid:173)\ntion function, Eq. (3). A comparison with activation functions that \nhave the same slope, however, shows that the networks with \u00b11 units \nstill perform significantly better than those with 0,1 units. If the \nweights are updated after every learning example, e.g., the reduc(cid:173)\ntion in learning time remains as large as a factor of 5. In the case \nof backpropagation learning, the main reason for the better perfor(cid:173)\nmance of \u00b11 units thus seems to be related to the fact that the \nalgorithm does not modify weights which emerge from a unit with \nvalue zero. Similar observations have been made by Stornetta and \nHuberman,s who further find that the discrepancies become even more \npronounced if the network size is increased. \n\n100 \n\nCI) 80 \n~ a: \n0 \n~ \nI- 60 \nw \nz \nI-\nu 40 \nw \nlL. a: \nw \na.. \n~ \n0 \n\n20 \n\n0 \n0 \n\n\"1 = 5.0 \n\n50 \n\n100 \n\n150 \n\n200 \n\n250 \n\n# PRESENTATIONS I PATTERN \n\nFig. 4. Learning curves for an XOR-network with three hidden units \n(backpropagation, initial weights random, weight changes accumulated \nover all 4 \ninput/output cases). Circles: continuous \u00b11 units; \ntriangles: continuous 0,1 units. \n\n\f80 \n\nIn Figure 5, finally, we present results for a network that \nlearns to detect mirror symmetry in the input pattern. The network \nconsists of one output, one hidden, and four input units which are ' \nalso directly connected to the output unit. We use the deterministic \nversion of Boltzmann learning and change the weights after every \nlearning pattern . If the weights are allowed to \npresentation of a \nassume arbitrary values, \nlearning is rather slow and on average \nrequires almost 700 presentations per pattern. We have observed, \nhowever, that the algorithm preferably seems to converge to solu(cid:173)\ntions in which geometrically symmetric weights are opposite in sign \nand almost equal in magnitude (see also Ref. 3). This means that the \nsymmetric input patterns are automatically treated as equivalent, as \ntheir net input to the hidden as well as to the output unit is zero. \nWe have \ninvestigated what happens if the weights are \nforced to be antisymmetric from the beginning. (The learning proce(cid:173)\ndure, of course, has to be adjusted such that it preserves this \nantisymmetry). Figure 5 shows that such a problem-adapted weight(cid:173)\nstructure leads to a dramatic decrease in learning time. \n\ntherefore \n\n100 \n\nen 80 \n~ \na:: \n0 \n3: \nI- 60 \nw \nz \nl-\n(,) w \nLL. 40 \na:: \nlLI \na.. \n\n~ \n0 \n\n20 \n\n0 \n2 \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n0 \n\n5 \n\n10 20 \n\n50 100 200 500 \n\n2000 \n\n# PRESENTATIONS I PATTERN \n\nFig. 5. Learning curves for a \nsymmetry detection network with 4 \ninput units and one hidden unit (deterministic Boltzmann learning, \n11 = 1, discrete \u00b11 units, initial weights random, weights changed \nsymmetry-adapted \nafter every \nweights; open circles: arbitrary weights, weight \nrandomization \n(p = 0.015). \n\nlearning example). Full circles: \n\n\f81 \n\nCONCLUSIONS \n\nThe main results of our empirical study can be summarized as \n\nfollows: \n- Networks with \u00b11 units quite generally exhibit a significantly \n\nfaster learning than the corresponding 0,1 versions. \n\n- In addition, \u00b11 networks are often less sensitive to parameter va(cid:173)\n\nriations than 0,1 networks. \n\n- An adaptation of the weight-structure to the symmetries of the \nproblem can lead to a drastic improvement of the learning behavior. \n\nOur qualitative interpretations seem to indicate that the ob(cid:173)\nserved effects should not be restricted to the small examples consi(cid:173)\ndered in this paper. It would be very valuable, however, to have \ncorresponding analytical results. \n\nREFERENCES \n\n1. \"Parallel Distributed Processing: Explorations in the Microstruc(cid:173)\n\n2. Y. \n\nture of Cognition\", vol. 1: \"Foundations\", ed. by D.E. Rumelhart \nand J.L. McClelland (MIT Press, Cambridge), 1986, Chapters 7 & 8. \nIe Cun, in \"Disordered Systems and Biological Organization\", \ned . by E. Bienenstock, F. Fogelman Soulie, and G. Weisbuch (Sprin(cid:173)\nger, Berlin), 1986, pp. 233-240. \n\n3. D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Nature 323, 533 \n\n(1986). \n\n-\n\n4. M.L. Minsky and S. Papert, \"Perceptrons\" (MIT Press, Cambridge), \n\n1969. \n\n5. W.S. Stornetta and B.A. Huberman, IEEE Conference on \"Neural Net(cid:173)\n\nworks\", San Diego, California, 21-24 June 1987. \n\n\f", "award": [], "sourceid": 83, "authors": [{"given_name": "J.", "family_name": "Bernasconi", "institution": null}]}