{"title": "Benchmarking Feed-Forward Neural Networks: Models and Measures", "book": "Advances in Neural Information Processing Systems", "page_first": 1167, "page_last": 1174, "abstract": null, "full_text": "Benchmarking Feed-Forward Neural Networks: \n\nModels and Measures \n\nLeonard G. C. Harney \nComputing Discipline \nMacquarie University \n\nNSW2109 \nAUSTRALIA \n\nAbstract \n\nExisting metrics for the learning performance of feed-forward neural networks do \nnot provide a satisfactory basis for comparison because the choice of the training \nepoch limit can determine the results of the comparison. I propose new metrics \nwhich have the desirable property of being independent of the training epoch \nlimit. The efficiency measures the yield of correct networks in proportion to the \ntraining effort expended. The optimal epoch limit provides the greatest efficiency. \nThe learning performance is modelled statistically, and asymptotic performance \nis estimated. Implementation details may be found in (Harney, 1992). \n\n1 Introduction \n\nThe empirical comparison of neural network training algorithms is of great value in the \ndevelopment of improved techniques and in algorithm selection for problem solving. In \nview of the great sensitivity of learning times to the random starting weights (Kolen and \nPollack, 1990), individual trial times such as reported in (Rumelhart, et al., 1986) are almost \nuseless as measures of learning performance. \nBenchmarking experiments normally involve many training trials (typically N = 25 or \n100, although Tesauro and Janssens (1988) use N = 10000). For each trial i, the training \ntime to obtain a correct network ti is recorded. Trials which are not successful within a \nlimitofTepochs are considered failures; they are recorded as ti = T. The mean successful \ntraining time IT is defined as follows. \n1167 \n\n\f1168 \n\nHarney \n\nwhere S is the number of successful trials. The median successful time 'iT is the epoch at \nwhich S/2 trials are successes. It is common (e.g. Jacobs, 1987; Kruschke and Movellan, \n1991; Veitch and Holmes, 1991) to report the mean and standard deviation along with the \nsuccess rate AT = S / N, but the results are strongly dependent on the choice of T as shown \nby Fahlman (1988). The problem is to characterise training performance independent of T. \nTesauro and Janssens (1988) use the harmonic mean tH as the average learning rate. \n\n_ \ntH = \n\nN \nN \n\n1 \nEi=l ti \n\nThis minimizes the contribution of large learning times, so changes in T will have little \neffect on tH. However, tH is not an unbiased estimator of the mean, and is strongly \ninfluenced by the shortest learning times, so that training algorithms which produce greater \nvariation in the learning times are preferred by this measure. \n\nFahlman (1988) allows the learning program to restart an unsuccessful trial, incorporating \nthe failed training time in the total time for that trial. This method is realistic, since a failed \ntrial would be restarted in a problem-solving situation. However, Fahlman's averages are \nstill highly dependent upon the epoch limit T which is chosen beforehand as the restart \npoint. \n\nThe present paper proposes new performance measures for feed-forward neural networks. \nIn section 4, the optimal epoch limit TE is defined. TE is the optimal restart point for \nFahlman's averages, and the efficiency e is the scaled reciprocal of the optimised Fahlman \naverage. In sections 5 and 6, the asymptotic learning behaviour is modelled and the mean \nand median are corrected for the truncation effect of the epoch limit T. Some benchmark \nresults are presented in section 7, and compared with previously published results. \n\n2 Performance Measurement \n\nFor benchmark results to be useful, the parameters and techniques of measurement and \ntraining must be fully specified. Training parameters include the network structure, the \nlearning rate 1}, the momentum term a and the range of the initial random weights [-r, r]. \n\nFor problems with binary output, the correctness of the network response is defined by a \nthreshold Tc-responses less than Tc are considered equivalent to 0, while responses greater \nthan 1 - Tc are considered equivalent to 1. For problems with analog output, the network \nresponse is considered correct if it lies within Tc of the desired value. In the present paper, \nonly binary problems are considered and the value Tc = 0.4 is used, as in (Fahlman 1988). \n\n3 The Training Graph \n\nThe training graph displays the proportion of correct networks as a function of the epoch. \nTypically, the tail of the graph resembles a decay curve. It is evident in figure 1 that the \n\n\fBenchmarking Feed-Forward Neural Networks: Models and Measures \n\n1169 \n\n-BP \n-DE \n\nCIJ \n\n'- ~ \n0 \n0 \n.-t:: Z \n! \nc: \n8. -u \n0 \ncu \n~ ~ \n8 \n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0.0 \n\n0 \n\n2000 \n\n4000 \n\n6000 \n\n8000 \n\n10000 \n\nEpoch Limit \n\nFigure 1: Typical Training Graphs: Back-Propagation ('I} = 0.5, Q' = 0) and Descending \nEpsilon (ry = 0.5, Q' = 0) on Exclusive-Or (2-2-1 structure, N = 1000, T = 10000). \n\nsuccess rate for either algorithm may be significantly increased if the epoch limit was raised \nbeyond 10000. The shape of the training graph varies depending upon the problem and \nthe algorithm employed to solve it. Descending epsilon (Yu and Simmons, 1990) solves a \nhigher proportion of the exclusive-or trials with T = 10000, but back-propagation would \nhave a higher success rate if T = 3000. This exemplifies the dramatic effect that the choice \nof T can have on the comparison of training algorithms. \n1\\vo questions naturally arise from this discussion: \"What is the optimal value for T?\" and \n\"What happens as T ~ oo?\". These questions will be addressed in the following sections. \n\n4 Efficiency and Optimal T. \n\nAdjusting the epOch limit T in a learning algorithm affects both the yield of correct networks \nand the effort expended on unsuccessful trials. To capture the total yield for effort ratio, we \ndefine the efficiency E( t) of epoch limit t as follows. \n\nThe efficiency graph plots the efficiency against of the epoch limit. The effiCiency graph for \nback-propagation (figure 2) exhibits a strong peak with the efficiency reducing relatively \nquickly if the epoch limit is too large. In contrast, the efficiency graph for descending \nepsilon exhibits an extremely broad peak with only a slight drop as the epoch limit is \nincreased. This occurs because the asymptotic success rate (A in section 5) is close to \n\n\fFigure 2: Efficiency Graphs: Back-Propagation (ry = 0.3, a = 0.9) and Descending \nEpsilon (ry = 0.3, a = 0.9) on Exclusive-Or (2-2-1 structure, N = 1000, T = 10000). \n\n1.0; in such cases, the efficiency remains high over a wider range of epoch limits and \nnear-optimal performance can be more easily achieved for novel problems. \n\nThe efficiency benchmark parameters are derived from the graph as shown in figure 3. The \nepoch limit TE at which the peak efficiency occurs is the optimal epoch limit. The peak \nefficiency e is a good performance measure, independent of T when T > TE. Unlike I H , it \nis not biased by the shortest learning times. The peak efficiency is the scaled reciprocal of \nFahlman's (1988) average for optimal T, and incorporates the failed trials as a perfonnance \npenalty. The optimisation of training parameters is suggested by Tesauro and Janssens \n(1988), but they do not optimise T. For comparison with other performance measures, the \nun scaled optimised Fahlman average t E = 1000/ e may be used instead of e. \nThe prediction of the optimal epoch limit TE for novel problems would help reduce wasted \ncomputation. The range parameters TEl and TE2 show how precisely Tmust be set to obtain \nefficiency within 50% of optimal-if two algorithms are otherwise similar in performance, \nthe one with a wider range (TEl , TE2) would be preferred for novel problems. \n\n5 Asymptotic Performance: T ~ 00 \n\nIn the training graph, the proportion of trials that ultimately learn correctly can be estimated \nby the asymptote which the graph is approachin\u00a5. I statistically model the tail of the graph \nby the distribution F(t) = 1 - [a(t - To) + 1]- and thus estimate the asymptotic success \nrate A. Figure 4 illustrates the model parameters. Since the early portions of the graph \nare dominated by initialisation effects, To, the point where the model commences to fit, \nis determined by applying the Kolmogorov-Smimov goodness-of-fit test (Stephens 1974) \n\n\fBenchmarking Feed-Forward Neural Networks: Models and Measures \n\n1171 \n\n0.0 - t - - - - - ' - - t - - - -1 ' - - - - - - - - - - - -+ - - - - -\n\no \n\nEpoch Limit \n\nFigure 3: Efficiency Parameters in Relation to the Efficiency Graph. \n\nfor all possible values of To. The maximum likelihood estimates of a and k are found by \nusing the simplex algorithm (Caceci and Cacheris, 1984) to directly maximise the following \nlog-likelihood equation. \n\nLet) \n\nM [lna+lnk-In(l- (a(T-To)+l)-k)](cid:173)\n(k+l) L In(a(ti- To)+l) \n\nTo T epochs. Incorporating \n\n\f1172 \n\nHarney \n\n1.0 \n\n0.8 \n\ntIJ \n\n'- ~ \n0 ~ \n... \n\u00a7 ~ 0.6 \n.... \n\nIl) J z ... \n~ 0.4 \n5 u \n\n0.2 \n\n0.0 \n\n0 \n\nTo \nEpoch Limit \n\nT \n\n00 \n\nFigure 4: Parameters for the Model of Asymptotic Perfonnance. \n\nthe predicted successes, the corrected mean Ie estimates the mean successful learning time \nas T -\n\n00. \n\nThe corrected median te is the epoch for which AI2 of the trials are successes. It estimates \nthe median successful learning time as T -\n\n00. \n\n7 Benchmark Results for Back.Propagation \n\nTable 1 presents optimised results for two popular benchmark problems: \nthe 2-2-1 \nexclusive-or problem (Rumelhart, et al., 1986, page 334), and the 10-5-10 encoder/decoder \nproblem (Fahlman, 1988). Both problems employ three-layer networks with one hidden \nlayer fully connected to the input and output units. The networks were trained with input \nand output values of 0 and 1. The weights were updated after each epoch of training; i.e. \nafter each cycle through all the training patterns. \nThe characteristics of the learning for these two problems differs significantly. To accurately \nbenchmark the exclusive-or problem, N = 10000 learning runs were needed to measure e \naccurate to \u00b10.3. With T = 200, I searched the combinations of 0:', 1] and r. The optimal \nparameters were then used in a separate run with N = 10000 and T = 2000 to estimate \nthe other benchmark parameters. In contrast, the encoder/decoder problem produced more \nstable efficiency values so that N = 100 learning runs produced estimates of e precise to \n\u00b10.2. With T = 600, all the learning runs converged. The final benchmark values were \n\n\fBenchmarking Feed-Forward Neural Networks: Models and Measures \n\n1173 \n\nTable 1: Optimised Benchmark Results. \n\nPROBLEM \n\nr \n\nQ' \n\nTJ \n\ne \n\nTE \n\nTEl \n\nTE2 \n\ntE \n\nexclusive-or \n2-2-1 \nencoder/decoder \n10-5-10 \n\n1.4 \n\n1.1 \n\n0.65 \n\n17.1 \n\u00b10.2 \u00b10.05 \u00b10.5 \u00b10.3 \n8.1 \n\u00b10.2 \u00b10.10 \u00b10.1 \u00b10.2 \n\n0.00 \n\n7.0 \n\n1.7 \n\n49 \n\n26 \n\n235 \n\n59 \n\n00 \n\n110 \n\n00 \n\n124 \n\nPROBLEM \n\na \n\nk \n\nTo \n\n'Y \n\nA \n\nIe \n\nAT \n\nIT \n\nIH \n\nexclusive-or \nencoder/decoder \n\n0.1 \n\n0.5 \n\n54 0.66 0.93 409 0.76 \n1.00 \n\n1.00 124 \n\n50 \n124 \n\n40 \n114 \n\ndetermined with N = 1000. Confidence intervals for e were obtained by applying the \njackknife procedure (Mosteller and Tukey, 1977, chapter 8); confidence intervals on the \ntraining parameters reflect the range of near-optimal efficiency results. \nIn the exclusive-or results, the four means vary from each other considerably. Ie is \nlarge because the asymptotic performance model predicts many successful learning runs \nwith T > 2000. However, since the model is fitting only a small portion of the data \n(approximately 1000 cases), its predictions may not be highly reliable. IT is low because \nthe limit T = 2000 discards the longer training runs. IH is also low because it is strongly \nbiased by the shortest times. IE measures the training effort required per trained network, \nincluding failure times, provided that T = 49. However, TEl and TE2 show that T can \nlie within the range (26,235) and achieve performance no worse than 118 epochs effort per \ntrained network. \nThe results for the encoder/decoder problem agree well with Fahlman (1988) who found \nQ' = 0, TJ = 1.7 and 1\" = 1.0 as optimal parameter values and obtained t = 129 based \nupon N = 25. Equal performance is obtained with Q' = 0.1 and TJ = 1.6, but momentum \nvalues in excess of 0.2 reduce the efficiency. Since all the learning runs are successful, \nt E = Ie = IT and A = AT = 1.0. Both TE and TE2 are infinite, indicating that there is no \nneed to limit the training epochs to produce optimal learning performance. Because there \nwere no failed runs, the asymptotic performance was not modelled. \n\n8 Conclusion \n\nThe measurement of learning performance in artificial neural networks is of great impor(cid:173)\ntance. Existing performance measurements have employed measures that are either de(cid:173)\npendent on an arbitrarily chosen training epoch limit or are strongly biased by the shortest \nlearning times. By optimising the training epoch limit, I have developed new performance \nmeasures, the efficiency e and the related mean tE, which are both independent of the \ntraining epoch limit and provide an unbiased measure of performance. The optimal training \nepoch limit TE and the range over which near-optimal performance is achieved (TEl, TE2) \nmay be useful for solving novel problems. \nI have also shown how the random distribution of learning times can be statistically mod-\n\n\f1174 \n\nHarney \n\nelled, allowing prediction of the asymptotic success rate A, and computation of corrected \nmean and median successful learning times, and I have demonstrated these new techniques \non two popular benchmark problems. Further work is needed to extend the modelling to \nencompass a wider range of algOrithms and to broaden the available base of benchmark \nresults. In the process, it is believed that greater understanding of the learning processes of \nfeed-forward artificial neural networks will result. \n\nReferences \n\nM. S. Caceci and W. P. Cacheris. Fitting curves to data: The simplex algorithm is the \n\nanswer. Byte, pages 340-362, May 1984. \n\nScott E. Fahlman. An empirical study of learning speed in back-propagation networks. \nTechnical Report CMU-CS-88-162, Computer Science Department, Carnegie Mellon \nUniversity, Pittsburgh, PA, 1988. \n\nLeonard G. C. Hamey. Benchmarking feed-forward neural networks: Models and measures. \nMacquarie Computing Report, Computing Discipline, Macquarie University, NSW \n2109 Australia, 1992. \n\nR. A. Jacobs. Increased rates of convergence through learning rate adaptation. COINS \nTechnical Report 87 -117 , University of Massachusetts at Amherst, Dept. of Computer \nand Information Science, Amherst, MA, 1987. \n\nJohn F. Kolen and Jordan B. Pollack. Back propagation is sensitive to initial conditions. \n\nComplex Systems, 4:269-280, 1990. \n\nJohn K. Kruschke and Javier R. Movellan. Benefits of gain: Speeded learning and min(cid:173)\n\nimal hidden layers in back-propagation networks. IEEE Trans. Systems, Man and \nCybernetics, 21(1):273-280, January 1991. \n\nFrederick Mosteller and John W. Tukey. Data Analysis and Regression. Addison-Wesley, \n\n1977. \n\nD. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by \nerror propagation. In Parallel Distributed Processing, chapter 8, pages 318-362. MIT \nPress, 1986. \n\nM. A. Stephens. EDF statistics for goodness of fit and some comparisons. Journal of the \n\nAmerican Statistical Association, 69:730-737, September 1974. \n\nG. Tesauro and B. Janssens. Scaling relationships in back-propagation learning. Complex \n\nSystems, 2:39-44, 1988. \n\nA. C. Veitch and G. Holmes. Benchmarking and fast learning in neural networks: Results \nfor back-propagation. In Proceedings of the Second Australian Conference on Neural \nNetworks, pages 167-171,1991. \n\nYeong-Ho Yu and Robert F. Simmons. Descending epsilon in back-propagation: A tech(cid:173)\n\nnique for better generalization. In Proceedings of the International Joint Conference \non Neural Networks 1990,1990. \n\n\f", "award": [], "sourceid": 548, "authors": [{"given_name": "Leonard", "family_name": "Hamey", "institution": null}]}