{"title": "Chaitin-Kolmogorov Complexity and Generalization in Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 925, "page_last": 931, "abstract": null, "full_text": "Chaitin-Kolmogorov Complexity \n\nand Generalization in Neural Networks \n\nBarak A. Pearlmutter \nSchool of Computer Science \nCarnegie Mellon University \nPittsburgh, PA 15213 \n\nRonald Rosenfeld \nSchool of Computer Science \nCarnegie Mellon University \nPittsburgh, PA 15213 \n\nAbstract \n\nWe present a unified framework for a number of different ways of failing \nto generalize properly. During learning, sources of random information \ncontaminate the network, effectively augmenting the training data with \nrandom information. The complexity of the function computed is therefore \nincreased, and generalization is degraded. We analyze replicated networks, \nin which a number of identical networks are independently trained on the \nsame data and their results averaged. We conclude that replication almost \nalways results in a decrease in the expected complexity of the network, and \nthat replication therefore increases expected generalization. Simulations \nconfirming the effect are also presented. \n\n1 BROKEN SYMMETRY CONSIDERED HARMFUL \n\nConsider a one-unit backpropagation network trained on exclusive or. Without \nhidden units, the problem is insoluble. One point where learning would stop is \nwhen all weights are zero and the output is always ~, resulting in an mean squared \nerror of ~. But this is a saddle point; by placing the discrimination boundary \nproperly, one point can be gotten correctly, two with errors of ~, and one with error \nof i, giving an MSE of i, as shown in figure 1. \nNetworks are initialized with small random weights, or noise is injected during train(cid:173)\ning to break symmetries of this sort. But in breaking this symmetry, something has \nbeen lost. Consider a kNN classifier, constructed from a kNN program and the \ntraining data. Anyone who has a copy of the kNN program can construct an iden(cid:173)\ntical classifier if they receive the training data. Thus, considering the classification \n\n925 \n\n\f926 \n\nPearlmutter and Rosenfeld \n\nas an abstract entity, we know its complexity cannot exceed that of the training \ndata plus the overhead of the complexity of the program, which is fixed. \n\nBut this is not necessarily the case for the backpropagation network we saw! Be(cid:173)\ncause of the introduction of randomly broken symmetries, the complexity of the \nclassification itself can exceed that of the training data plus the learning procedure. \nThus an identical classifier can no longer be constructed just from the program and \nthe training data, because random factors have been introduced. For a striking \nexample, consider presenting a \"32 bit parity with 10,000 exceptions\" stochastic \nlearner with one million exemplars. The complexity of the resulting function will \nbe high, since in order to specify it we must specify not only the regularities of \ntraining set, which we just did in a couple words, but also which of the 4 billion \npossibilities are among the 10,000 exceptions. \n\nApplying this idea to undertraining and overtraining, we see that there are two kinds \nof symmetries that can be broken. First, if not all the exemplars can be loaded, \nwhich of the outliers are not loaded can be arbitrary. Second, underconstrained \nnetworks that behave the same on the training set may behave differently on other \ninputs. Both phenomena can be present simultaneously. \n\n2 A COMPLEXITY BOUND \n\nThe expected value of the complexity of the function implemented by a network b \ntrained on data d, where b is a potentially stochastic mapping, satisfies \n\nE(C(b(d))) ~ C(d) + C(b) + I(b(d)ld) \n\nwhere I(b(d)ld) is the negative of the entropy of the bias distribution of b trained \non d, \n\nI(b(d)ld) = -H(b(d)) = - L log P(b(d) = f) \n\nf \n\nwhere f ranges over functions that the network could end up performing, with \nthe network regarded as a black box. This in turn is bounded by the information \ncontained in the random internal parameters, or by the entropy of the watershed \nstructure; but these are both potentially unbounded. \n\nA number of techniques for improving generalization, when viewed in this light, \nwork because they tighten this bound. \n\n\u2022 Weight decay [2] and the statistical technique of ridge regression impose an \nextra constraint on the parameters, reducing their freedom to arbitrarily break \nsymmetry when underconstrained. \n\n\u2022 Cross validation attempts to stop training before too many symmetries have \n\nbeen broken. \n\n\u2022 Efforts to find the perfect number of hidden units attempt to minimize the \n\nnumber of symmetries that must be broken. \n\nThese techniques strike a balance between undertraining and overtraining. Since in \nany realistic domain both of these effects will be simultaneously present, it would \nseem advantageous to attack the problem at the root. One approach that has been \n\n\fChaitin-Kolmogorov Complexity and Generalization in Neural Networks \n\n927 \n\n3 \n\n\u2022 \n\n+ \n\n: \n\nt! \n\n\u2022 \u2022 \n\n+ \n\n+ \n\n+ \n\n+ \n\n.+ \n\n+: \n\n\u2022 \n\u2022 \n+ \n\u2022 \n...... :.+++. \n\u2022 \n+ .... + + \n\n\u2022\u2022 + t\u00b7 .~+,:. \n\n.. , \n\n\u2022\u2022 ++...,+ + + \nt \n\n\u2022 + \n\n., \n\n\u2022 \n\n, \n\no ~ .....\u2022 _ ... ~_,;.;.~~_/:.: .... t:~ .. _ ... ~ .. _ \n\n\u2022 \u2022 \n\n... .. \n\u2022\u2022 1+: \n\u00b7\u00b7t :. ++ + + \n\u2022 \u2022 \u2022 + : +* +.; ++ \n+~ ++ .... + \n\u2022 ..: *.+ \n\u2022\u2022 tot\"'+.+ \n. . : \nit + \n\u2022 \n\n+ + \n\n\u2022\u2022 \n\n+ \n+ \n\n\u2022 \n\n\u2022 \n\n+ \n\n+ \n\n+ \n+ \n+ \n\n+ \n\n+ \n\n+ \n\n@ \n\n0 \n\n112 \n\n0 @ \n\n~\"* ........ \"\" \n\n@ \n\n0 \n\n@ \n\n0 \n\n~ \n~ \n\n-3 ~----------~----------~ \n3 \n\n-3 \n\no \n\nFigure 1: The bifurcation of a percep(cid:173)\ntron trained on xor. \n\nFigure 2: The training set. Crosses \nare negative examples and diamonds \nare positive examples. \n\nrediscovered a number of times [1, 3], and systematically explored in its pure form \nby Lincoln and Skrzypek [4], is that of replicated networks. \n\n3 REPLICATED NETWORKS \n\nOne might think that the complexity of the average of a collection of networks would \nbe the sum of the complexities of the components; but this need not be the case. \nConsider an ensemble network, in which an infinite number of networks are taught \nthe training data simultaneously, each making its random decisions according to \nwhatever distributions the training procedure calls for, and their output averaged. \nWe have seen that the complexity of a single network can exceed that of its training \ndata plus the training program. But this is not the case with ensemble networks, \nsince the ensemble network output can be determined solely from the program and \nthe training data, i.e. C(E(b(d))) ~ C(b)+C(d)+C(\"replicate\") where C(\"replicate\") \nis the complexity of the instruction to replicate and average (a small constant). \n\nA simple way to approximate the ensemble machine is to train a number of networks \nsimultaneously and average the results. As the number of networks is increased, \nthe composite model approaches the ensemble network, which cannot have higher \ncomplexity than the training data plus the program plus the instruction to replicate. \nNote that even if one accidentally stumbles across the perfect architecture and \n\n\f928 \n\nPearlmutter and Rosenfeld \n\ntraining regime, resulting in a net that always learns the training set perfectly but \nwith no leftover capacity, and which generalizes as well as anything could, then \nmaking a replicated network can't hurt, since all the component networks would do \nexactly the same thing anyway. \nA number of researchers seem to have inadvertently exploited this fact. For instance, \nHampshire et al. [1] train a number of networks on a speech task, where the networks \ndiffered in choice of objective function. The networks' outputs were averaged to \nform the answer used in the recognition phase, and the generalization performance \nof the composite network was significantly higher than that of any of its component \nnetworks. Replicated implementations programmed from identical specifications is \na common technique in software engineering of highly reliable systems. \n\n4 THE ISSUE OF INDUCTIVE BIAS \n\nThe representational power of an ensemble is greater that that of a single network. \nBy the usual logic, one would expect the ensemble to have worse generalization, \nsince its inductive bias is weaker. Counterintuitively, this is not the case. For \ninstance, the VC dimension of an ensemble of perceptrons is infinite, because it can \nimplement an arbitrary three layer network, using replication to implement weights. \nThis is much greater than the finite VC dimension of a single perceptron within the \nensemble, but our analysis predicts better generalization for the ensemble than for \na single stochastic perceptron when the bounds are tight, that is, when \n\nH(b(d)) ~ C(\"replicate\"). \n\n(1) \nThis leads to the conclusion that just knowing the inductive bias of a learner is \nnot enough information to make strong conclusions about its expected generaliza(cid:173)\ntion. Thus, distribution free results based purely on the inductive bias, such as VC \ndimension based PAC learning theory [5], may sometimes be unduly pessimistic. \n\nAs to replicated networks, we have seen that they can not help but improve gener(cid:173)\nalization when (1) holds. Thus, if one is training the same network over and over, \nperhaps with slightly different training regimes, and getting worse generalization \nthan was hoped for, but on different cases each time, then one can improve gener(cid:173)\nalization in a seemingly principled manner by putting all the trained networks in a \nbox and calling it a finite sample of the ensemble network (and perhaps buying a \nbigger computer to run it on). \n\n5 EMPIRICAL SUPPORT \n\nWe conducted the following experiment: 17 standard backpropagation networks \n(Actually 20, but 3 were lost to a disk failure) were trained on a binary classification \ntask. The nets all had identical architectures (2-20-1) but different initial weights, \nchosen uniformly from the interval [-1, 1]. The same training set was used to train \nall the networks. The fl:nctions implemented by each of the networks were then \ncalculated in detail, and the performance of individual networks compared to that \nof their ensemble. \n\nThe classification task was a stochastic 2D linear discriminator. Each point was \nobtained from a Gaussian centered at (0.0) with stdev 1. A classification of 1 was \n\n\fChaitin-Kolmogorov Complexity and Generalization in Neural Networks \n\n929 \n\nI ...\u2022 J \n, , \u2022 \n\n.. t h,{\u00b7\u00b7> \nI \u00b7\u00b7 \n\nI \n\n. ~ \n\n.' ~ \n\n/ \nl \u00b7~ \n\nFigure 3: The functions implemented by the 17 trained networks, and by their \naverage (bottom right). Both the x and y axes run from -3 to 3, and grey levels are \nused to represent intermediate values in the interval [0,1]. \n\n\f930 \n\nPearlmutter and Rosenfeld \n\nTable 1: Mean squared error and number of mislabeled exemplars for each network \non the training set of 200. \n\nMSE \n\nerrors \n\nnet \n12 \n9 \n16 \n5 \n7 \n10 \n13 \n17 \n19 \n6 \n15 \n18 \n8 \n11 \n20 \n14 \n4 \n\nmean \n\nensemble \nnohidden \n\n0.0150837 \n0.0200039 \n0.0200026 \n0.0250207 \n0.0250213 \n0.0228319 \n0.0250156 \n0.0250018 \n0.0175466 \n0.0300099 \n0.0300075 \n0.0300060 \n0.0350609 \n0.0350006 \n0.0400013 \n0.0305254 \n0.0408391 \n\n0.016286 \n0.060314 \n\n3 *** \n4 **** \n4 **** \n5 ***** \n5 ***** \n5 ***** \n5 ***** \n5 ***** \n5 ***** \n6 ****** \n6 ****** \n6 ****** \n7 ******* \n7 ******* \n8 ******** \n9 ********* \n13 ************* \n\n**\"'''' \n\n4 \n31 \n\n0.027469 \u00b1 0.007226 6.058824 \u00b1 2.261457 \n\nassigned to points with z ~ 0, and 0 to points with z < 0, but reversed with an \nindependent probability of 0.1. The final position of each point was then determined \nby adding a zero mean Gaussian with stdev .25. 200 points were so generated for \nthe training set (shown in figure 2) and another 1000 points for the test set. \nLooking at figure 3, each net appears to correctly classify as many of the inputs \nas possible, within the bounds imposed on it by its inductive bias. Each function \nimplemented by such a net is roughly equivalent to a linear combination of 20 \nindependent linear discriminators. It is therefore clear why each map consists of \nregions delineated by up to 20 straight lines. Since the initial conditions were \ndifferent for each net, so were the resultant regions. All networks misclassified some \nof the exemplars (see table 1), but the missclassifications were different for each \nnetwork, illustrating symmetry breaking due to an overconstraining data set. \nNote that the ensemble's performance on the training set is comparable to that of \nthe best of the trained networks, while its performance on the test set is far superior. \nThe MSE error of the ensemble is much much better than the bound obtained from \nJensen's inequality, the average MSE. In fact, the ensemble network gets a lower \nMSE than all but one individual network on the training sets, and a much lower \nMSE than any individual network on the test set; and it generalizes much better \nthan any of the individual networks by a misclassification count metric. \n\n\fChaitin-Kolmogorov Complexity and Generalization in Neural Networks \n\n931 \n\nTable 2: Mean squared error and number of \nmislabeled samples for each network on the \ntest set of 1000. The performance of a theoret(cid:173)\nically perfect classifier (sign x) on the test set \nis 170 misclassifications, which is about what \nthe network without hidden units gets. \n\nnet \n16 \n9 \n4 \n5 \n11 \n15 \n6 \n19 \n7 \n8 \n12 \n17 \n18 \n20 \n13 \n14 \n10 \n\nmean \n\nensemble \nnohidden \n\nerrors \n205 \n213 \n215 \n216 \n216 \n216 \n219 \n220 \n222 \n224 \n225 \n225 \n227 \n229 \n231 \n237 \n254 \n\n200 \n169 \n\n0.214 \u00b1 0.007 223 \u00b1 10.7 \n\nMSE \n0.201 \n0.207 \n0.206 \n0.209 \n0.208 \n0.207 \n0.212 \n0.213 \n0.214 \n0.214 \n0.212 \n0.219 \n0.220 \n0.223 \n0.223 \n0.227 \n0.226 \n\n0.160 \n0.0715 \n\nTable 3: Histogram of the net(cid:173)\nworks' performance by number of \nmisclassified training exemplars. \n\nI error count I networks I \n\n0 \n1 \n2 \n3 * \n** \n4 \n5 ****** \n6 *** \n7 ** \n8 * \n9 * \n10 \n11 \n12 \n13 * \n14 \n15 \n16 \n\nReferences \n\n[1] J. Hampshire and A. Waibel. A novel objective function for improved phoneme \nrecognition using time delay neural networks. Technical Report CMU-CS-89-\n118, Carnegie Mellon University School of Computer Science, March 1989. \n\n[2] Geoffrey E. Hinton, Terrence J. Sejnowski, and David H. Ackley. Boltzmann \n\nMachines: Constraint satisfaction networks that learn. Technical Report CMU(cid:173)\nCS-84-119, Carnegie-Mellon University, May 1984. \n\n[3] Nathan Intrator. A neural network for feature extraction. In D. S. Touretzky, \neditor, Advances in Neural Information Processing Systems 2, pages 719-726, \nSan Mateo, CA, 1990. Morgan Kaufmann. \n\n[4] Willian P. Lincoln and Josef Skrzypek. Synergy of clustering multiple back prop(cid:173)\n\nagation networks. In D. S. Touretzky, editor, Advances in Neural Information \nProcessing Systems 2: pages 650-657, San Mateo, CA, 1990. Morgan Kaufmann. \n[5] L. G. Valiant. A theory of the learnable. Communications of the ACM, \n\n27(11):1134-1142, 19~4. \n\n\f", "award": [], "sourceid": 394, "authors": [{"given_name": "Barak", "family_name": "Pearlmutter", "institution": null}, {"given_name": "Ronald", "family_name": "Rosenfeld", "institution": null}]}