{"title": "Network Generality, Training Required, and Precision Required", "book": "Neural Information Processing Systems", "page_first": 219, "page_last": 222, "abstract": null, "full_text": "219 \n\nNetwork Generality, Training Required, \n\nand PrecisIon Required \n\nJohn S. Denker and Ben S. Wittner 1 \n\nAT&T Bell Laboratories \n\nHolmdel, New Jersey 07733 \n\nKeep your hand on your wallet. \n- Leon Cooper, 1987 \n\nAbstract \n\nWe show how to estimate (1) the number of functions that can be implemented by a \nparticular network architecture, (2) how much analog precision is needed in the con(cid:173)\nnections in the network, and (3) the number of training examples the network must see \nbefore it can be expected to form reliable generalizations. \n\nGenerality versus Training Data Required \n\nConsider the following objectives: First, the network should be very powerful and ver(cid:173)\nsatile, i.e., it should implement any function (truth table) you like, and secondly, it \nshould learn easily, forming meaningful generalizations from a small number of training \nexamples. Well, it is information-theoretically impossible to create such a network. We \nwill present here a simplified argument; a more complete and sophisticated version can \nbe found in Denker et al. (1987). \n\nIt is customary to regard learning as a dynamical process: adjusting the weights (etc.) \nin a single network. In order to derive the results of this paper, however, we take \na different viewpoint, which we call the ensemble viewpoint. Imagine making a very \nlarge number of replicas of the network. Each replica has the same architecture as the \noriginal, but the weights are set differently in each case. No further adjustment takes \nplace; the \"learning process\" consists of winnowing the ensemble of replicas, searching \nfor the one( s) that satisfy our requirements. \n\nTraining proceeds as follows: We present each item in the training set to every network \nin the ensemble. That is, we use the abscissa of the training pattern as input to the \nnetwork, and compare the ordinate of the training pattern to see if it agrees with the \nactual output of the network. For each network, we keep a score reflecting how many \ntimes (and how badly) it disagreed with a training item. Networks with the lowest score \nare the ones that agree best with the training data. If we had complete confidence in \n\nlCurrently at NYNEX Science and Technology, 500 Westchester Ave., White Plains, NY 10604 \n\n@) American Institute of Physics 1988 \n\n\f220 \n\nthe reliability of the training set, we could at each step simply throwaway all networks \nthat disagree. \n\nFor definiteness, let us consider a typical network architecture, with No input wires and \nNt units in each processing layer I, for I E {I\u00b7\u00b7 \u00b7L}. For simplicity we assume NL = 1. \nWe recognize the importance of networks with continuous-valued inputs and outputs, \nbut we will concentrate for now on training (and testing) patterns that are discrete, \nwith N == No bits of abscissa and N L = 1 bit of ordinate. This allows us to classify the \nnetworks into bins according to what Boolean input-output relation they implement, \nand simply consider the ensemble of bins. \n\nIf the network architecture is completely general and \nThere are 22N jossible bins. \npowerful, all 22 \nfunctions will exist in the ensemble of bins. On average, one expects \nthat each training item will throwaway at most half of the bins. Assuming maximal \nefficiency, if m training items are used, then when m ~ 2N there will be only one bin \nremaining, and that must be the unique function that consistently describes all the \ndata. But there are only 2N possible abscissas using N bits. Therefore a truly general \nnetwork cannot possibly exhibit meaningful generalization -\n100% of the possible data \nis needed for training. \n\nN ow suppose that the network is not completely general, so that even with all possible \nsettings of the weights we can only create functions in 250 bins, where So < 2N. We call \nSo the initial entropy of the network. A more formal and general definition is given in \nDenker et al. (1987). Once again, we can use the training data to winnow the ensemble, \nand when m ~ So, there will be only one remaining bin. That function will presumably \ngeneralize correctly to the remaining 2N - m possible patterns. Certainly that function \nis the best we can do with the network architecture and the training data we were given. \n\nThe usual problem with automatic learning is this: If the network is too general, So \nwill be large, and an inordinate amount of training data will be required. The required \namount of data may be simply unavailable, or it may be so large that training would be \nprohibitively time-consuming. The shows the critical importance of building a network \nthat is not more general than necessary. \n\nEstimating the Entropy \n\nIn real engineering situations, it is important to be able to estimate the initial entropy \nof various proposed designs, since that determines the amount of training data that will \nbe required. Calculating So directly from the definition is prohibitively difficult, but we \ncan use the definition to derive useful approximate expressions. (You wouldn't want to \ncalculate the thermodynamic entropy of a bucket of water directly from the definition, \neither. ) \n\n\f221 \n\nSuppose that the weights in the network at each connection i were not continuously \nadjustable real numbers, but rather were specified by a discrete code with bi bits. Then \nthe total number of bits required to specify the configuration of the network is \n\n(1) \n\nNow the total number offunctions that could possibly be implemented by such a network \narchitecture would be at most 2B. The actual number will always be smaller than this, \nsince there are various ways in which different settings of the weights can lead to identical \nfunctions (bins). For one thing, for each hidden layer 1 E {1\u00b7\u00b7\u00b7 L-1}, the numbering of \nthe hidden units can be permuted, and the polarity of the hidden units can be flipped, \nwhich means that 250 is less than 2B by a factor (among others) of III Nl! 2N ,. In \naddition, if there is an inordinately large number of bits bi at each connection, there \nwill be many settings where small changes in the connection will be immaterial. This \nwill make 2so smaller by an additional factor. We expect aSO/abi ~ 1 when bi is small, \nand aSO/abi ~ 0 when bi is large; we must now figure out where the crossover occurs. \n\nThe number of \"useful and significant\" bits of precision, which we designate b*, typically \nscales like the logarithm of number of connections to the unit in question. This can be \nunderstood as follows: suppose there are N connections into a given unit, and an input \nsignal to that unit of some size A is observed to be significant (the exact value of A \ndrops out of the present calculation). Then there is no point in having a weight with \nmagnitude much larger than A, nor much smaller than A/N. That is, the dynamic \nrange should be comparable to the number of connections. (This argument is not exact, \nand it is easy to devise exceptions, but the conclusion remains useful.) If only a fraction \n1/ S of the units in the previous layer are active (nonzero) at a time, the needed dynamic \nrange is reduced. This implies b* ~ log(N/S). \n\nNote: our calculation does not involve the dynamics of the learning process. Some \nnumerical methods (including versions of back propagation) commonly require a number \nof temporary \"guard bits\" on each weight, as pointed out by llichard Durbin (private \ncommunication). Another log N bits ought to suffice. These bits are not needed after \nlearning is complete, and do not contribute to So. \n\nIf we combine these ideas and apply them to a network with N units in each layer, fully \nconnected, we arrive at the following expression for the number of different Boolean \nfunctions that can be implemented by such a network: \n\nwhere \n\nB ~ LN 2 log N \n\n(2) \n\n(3) \n\nThese results depend on the fact that we are considering only a very restricted type of \nprocessing unit: the output is a monotone function of a weighted sum of inputs. Cover \n\n\f222 \n\n(1965) discussed in considerable depth the capabilities of such units. Valiant (1986) has \nexplored the learning capabilities of various models of computation. \n\nAbu-Mustafa has emphasized the principles of information and entropy and applied \nthem to measuring the properties of the training set. At this conference, formulas \nsimilar to equation 3 arose in the work of Baum, Psaltis, and Venkatesh, in the context \nof calculating the number of different training patterns a network should be able to \nmemorize. We originally proposed equation 2 as an estimate of the number of patterns \nthe network would have to memorize before it could form a reliable generalization. The \nbasic idea, which has numerous consequences, is to estimate the number of (bins of) \nnetworks that can be realized. \n\nReferences \n\n1. Vasser Abu-Mustafa, these proceedings. \n\n2. Eric Baum, these proceedings. \n\n3. T. M. Cover, \"Geometrical and statistical properties of systems of linear inequal(cid:173)\n\nities with applications in pattern recognition,\" IEEE Trans. Elec. Comp., EC-14, \n326-334, (June 1965) \n\n4. John Denker, Daniel Schwartz, Ben Wittner, Sara Solla, John Hopfield, Richard \n\nHoward, and Lawrence Jackel, Complex Systems, in press (1987). \n\n5. Demetri Psaltis, these proceedings. \n\n6. 1. G. Valiant, SIAM J. Comput. 15(2), 531 (1986), and references therein. \n\n7. Santosh Venkatesh, these proceedings. \n\n\f", "award": [], "sourceid": 16, "authors": [{"given_name": "John", "family_name": "Denker", "institution": null}, {"given_name": "Ben", "family_name": "Wittner", "institution": null}]}