{"title": "Generalisation of A Class of Continuous Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 267, "page_last": 273, "abstract": null, "full_text": "Generalisation of A Class of Continuous \n\nNeural Networks \n\nJohn Shawe-Taylor \n\nDept of Computer Science, \n\nRoyal Holloway, University of London, \n\nEgham, Surrey TW20 OEX, UK \nEmail: johnCdcs.rhbnc.ac . uk \n\nJieyu Zhao* \n\nIDSIA, Corso Elvezia 36, \n6900-Lugano, Switzerland \n\nEmail: jieyuCcarota.idsia.ch \n\nAbstract \n\nWe propose a way of using boolean circuits to perform real valued \ncomputation in a way that naturally extends their boolean func(cid:173)\ntionality. The functionality of multiple fan in threshold gates in \nthis model is shown to mimic that of a hardware implementation \nof continuous Neural Networks. A Vapnik-Chervonenkis dimension \nand sample size analysis for the systems is performed giving best \nknown sample sizes for a real valued Neural Network. Experimen(cid:173)\ntal results confirm the conclusion that the sample sizes required for \nthe networks are significantly smaller than for sigmoidal networks. \n\n1 \n\nIntroduction \n\nRecent developments in complexity theory have addressed the question of com(cid:173)\nplexity of computation over the real numbers. More recently attempts have been \nmade to introduce some computational cost related to the accuracy of the compu(cid:173)\ntations [5]. The model proposed in this paper weakens the computational power \nstill further by relying on classical boolean circuits to perform the computation us(cid:173)\ning a simple encoding of the real values. Using this encoding we also show that \nTeo circuits interpreted in the model correspond to a Neural Network design re(cid:173)\nferred to as Bit Stream Neural Networks, which have been developed for hardware \nimplementation [8]. \n\nWith the perspective afforded by the general approach considered here, we are also \nable to analyse the Bit Stream Neural Networks (or indeed any other adaptive sys(cid:173)\ntem based on the technique), giving VC dimension and sample size bounds for PAC \nlearning. The sample sizes obtained are very similar to those for threshold networks, \n\n*Work performed while at Royal Holloway, University of London \n\n\f268 \n\n1. SHAWE-TAYLOR, J. ZHAO \n\ndespite their being derived by very different techniques. They give the best bounds \nfor neural networks involving smooth activation functions, being significantly lower \nthan the bounds obtained recently for sigmoidal networks [4, 7]. \n\nWe subsequently present simulation results showing that Bit Stream Neural Net(cid:173)\nworks based on the technique can be used to solve a standard benchmark problem. \nThe results of the simulations support the theoretical finding that for the same \nsample size generalisation will be better for the Bit Stream Neural Networks than \nfor classical sigmoidal networks. It should also be stressed that the approach is \nvery general - being applicable to any boolean circuit - and by its definition em(cid:173)\nploys compact digital hardware. This fact motivates the introduction of the model, \nthough it will not play an important part in this paper. \n\n2 Definitions and Basic Results \n\nA boolean circuit is a directed acyclic graph whose nodes are referred to as gates, \nwith a single output node of out-degree zero. The nodes with in-degree zero are \ntermed input nodes. The nodes that are not input nodes are computational nodes. \nThere is a boolean function associated with each computational node of arity equal \nto its in-degree. The function computed by a boolean network is determined by \nassigning (input) values to its input nodes and performing the function at each \ncomputational node once its input values are determined. The result is the value \nat the output node. The class TCo is defined to be those functions that can be \ncomputed by a family of polynomially sized Boolean circuits with unrestricted fan(cid:173)\nin and constant depth, where the gates are either NOT or THRESHOLD. \n\nIn order to use the boolean circuits to compute with real numbers we use the method \nof stochastic computing to encode real numbers as bit streams. The encoding we \nwill use is to consider the stream of binary bits, for which the l's are generated \nindependently at random with probability p, as representing the number p. This \nis referred to as a Bernoulli sequence of probability p. In this representation, the \nmultiplication of two independently generated streams can be achieved by a simple \nAND gate, since the probability of a Ion the output stream is equal to P1P2, where \nPl is the probability of a 1 on the first input stream and P2 is the probability of \na 1 on the second input stream. Hence, in this representation the boolean circuit \nconsisting of a single AND gate can compute the product of its two inputs. \n\nMore background information about stochastic computing can be found in the work \nof Gaines [1]. The analysis we provide is made by treating the calculations as exact \nreal valued computations. In a practical (hardware) implementation real bit streams \nwould have to be generated [3] and the question of the accuracy of a delivered result \narlses. \n\nIn the applications considered here the output values are used to determine a binary \nvalue by comparing with a threshold of 0.5. Unless the actual output is exactly 1 or \n\n\u00b0 (which can happen), then however many bits are collected at the output there is a \n\nslight probability that an incorrect classification will be made. Hence, the number \nof bits required is a function of the difference between the actual output and 0.5 \nand the level of confidence required in the correctness of the classification. \n\nDefinition 1 The real function computed by a boolean circuit C, which computes \nthe boolean function \n\nfe : {O, l}n -* {O, I}, \n\nis the function \n\nge : [0, lr -* [0,1], \n\n\fGeneralisation of a Class of Continuous Neural Networks \n\n269 \n\nobtained by coding each input independently as a Bernoulli sequence and interpreting \nthe output as a similar sequence. \n\nHence, by the discussion above we have for the circuit C consisting of a single AND \ngate, the function ge is given by ge(:l:1, :1:2) = :1:1:1:2. \n\nWe now give a proposition showing that the definition of real computation given \nabove is well-defined and generalises the Boolean computation performed by the \ncircuit. \n\nProposition 2 The bit stream on the output of a boolean circuit computing a real \nfunction is a Bernoulli sequence . The real function ge computed by an n input \nboolean circuit C can be expressed in terms of the corresponding boolean function \nfe as follows: \n\no:E{O,1}\" \n\nIn particular, gel{o,)}\" = fe \u00b7 \n\nn \n\ni=1 \n\nProof: The output bit stream is a Bernoulli sequence, since the behaviour at each \ntime step is independent of the behaviour at previous time sequences, assuming the \ninput sequences are independent. Let the probability of a 1 in the output sequence \nbe p. Hence, ge (:I:) = p. At any given time the input to the circuit must be one \nof the 2n possible binary vectors a. P:l(a) gives the probability of the vector a \noccurring. Hence, the expected value of the output of the circuit is given in the \nproposition statement, but by the properties of a Bernoulli sequence this value is \nalso p. The final claim holds since Po: (a) = 1, w hile Po: (a') = 0 for a # a' .\u2022 \n\nHence, the function computed by a circuit can be denoted by a polynomial of degree \nn, though the representation given above may involve exponentially many terms. \nThis representation will therefore only be used for theoretical analysis. \n\n3 Bit Stream Neural Networks \n\nIn this section we describe a neural network model based on stochastic computing \nand show that it corresponds to taking TCo circuits in the framework considered \nin Section 2. \n\nA Stochastic Bit Stream Neuron is a processing unit which carries out very simple \noperations on its input bit streams. All input bit streams are combined with their \ncorresponding weight bit streams and then the weighted bits are summed up. The \nfinal total is compared to a threshold value. If the sum is larger than the threshold \nthe neuron gives an output 1, otherwise O. \n\nThere are two different versions of the Stochastic Bit Stream Neuron corresponding \nto the different data representations. The definitions are given as follows. \n\nDefinition 3 (AND-SBSN): A n-input AND version Stochastic Bit Stream Neu(cid:173)\nron has n weights in the range [-1,1 j and n inputs in the range [0,1 j, which are all \nunipolar representations of Bernoulli sequences. An extra sign bit is attached to \neach weight Bernoulli sequence. The threshold 9 is an integer lying between -n to n \nwhich is randomly generated according to the threshold probability density function \n\u00a2( 9). The computations performed during each operational cycle are \n\n\f270 \n\nJ. SHAWE-TA YLOR, J. ZHAO \n\n(1) combining respectively the n bits from n input Bernoulli sequences with the \ncorresponding n bits from n weight Bernoulli sequences using the AND operation. \n\n(2) assigning n weight sign bits to the corresponding output bits of the AND gate, \nsumming up all the n signed output bits and then comparing the total with the \nrandomly generated threshold value. If the total is not less than the threshold value, \nthe AND-SBSN outputs 1, otherwise it outputs O. \n\nWe can now present the main result characterising the functionality of a Stochastic \nBit Stream Neural Network as the real function of an Teo circuit. \n\nTheorem 4 The functionality of a family of feedforward networks of Bit Stream \nNeurons with constant depth organised into layers with interconnections only be(cid:173)\ntween adjacent layers corresponds to the function gc for an TCo circuit C of depth \ntwice that of the network. The number of input streams is equal to the number \nof network inputs while the number of parameters is at most twice the number of \nweights. \n\nProof: Consider first an individual neuron. We construct a circuit whose real \nfunctionality matches that of the neuron. The circuit has two layers. The first \nconsists of a series of AND gates. Each gate links one input line of the neuron \nwith its corresponding weight input. The outputs of these gates are linked into a \nthreshold gate with fixed threshold 2d for the AND-SBSN, where d is the number \nof input lines to the neuron. The threshold distribution of the AND SBSN is now \nsimulated by having a series of 2d additional inputs to the threshold gate. The \nnumber of additional input streams required to simulate the threshold depends on \nhow general a distribution is allowed for the threshold. We consider three cases: \n\n1. If the threshold is fixed (i.e. not programmable), then no additional inputs \n\nare required, since the actual threshold can be suitably adapted. \n\n2. If the threshold distribution is always focussed on one value (which can \nbe varied), then an additional flog2(2d)1 (rlog2(d)l) inputs are required to \nspecify the binary value of this number. A circuit feeding the corresponding \nnumber of 1 's to the threshold gate is not hard to construct. \n\n3. In the fully general case any series of 2d + 1 (d + 1) numbers summing to \n\none can be assigned as the probabilities of the possible values \n\n4>(0),4>(1), ... , 4>(t), \n\nt \n\n2d for \n\nwhere \nthe AND SBSN. We now construct a circuit \nwhich takes t input streams and passes the I-bits to the threshold \ngate of all the inputs up to the first input stream carrying a O. \nIn other words \nNo fUrther input is passed to the threshold gate. \nInput streams 1, ... , s have bit 1 and \neither s = t or input stream s + 1 has \ninput o. \n\nThreshold gate receives s \nbits of input \n\nq. \n\nWe now set the probability p, of stream s as follows; \n\nPI \n\np, \n\n1 - 4>(0) \n1 - 2:;~~ 4>( i) \n1 - 2:;~g 4>( i) \nfor s = 2, ... , t \n\nWith these values the probability of the threshold gate receiving s bits is \n4>( s) as required. \n\n\fGeneralisation of a Class of Continuous Neural Networks \n\n271 \n\nThis completes the replacement of a single neuron. Clearly, we can replace all \nneurons in a network in the same manner and construct a network with the required \nproperties provided connections do not 'shortcut' layers, since this would create \ninteractions between bits in different time slots. _ \n\n4 VC Dimension and Sample Sizes \n\nIn order to perform a VC Dimension and sample size analysis of the Bit Stream \nNeural Networks described in the previous section we introduce the following general \nframework . \n\nDefinition 5 For a set Q of smooth functions f : R n x Rl -+ R, the class F is \ndefined as \n\nF = Fg = {fw Ifw{x) = f{x, w), f E Q}. \n\nThe corresponding classification class obtained by taking a fixed set of s of the func(cid:173)\ntions from Q, thresholding the corresponding functions from F at 0 and combin(cid:173)\ning them (with the same parameter vector) in some logical formula will be denoted \nH,{F). We will denote H1{F) by H{F). \nIn our case we will consider a set of circuits C each with n + l input connections, n \nlabelled as the input vector and l identified as parameter input connections. Note \nthat if circuits have too few input connections, we can pad them with dummy ones. \nThe set g will then be the set \n\nQ=Qe={gc!CEC}, \n\nwhile Fgc will be denoted by Fe. \nWe now quote some of the results of [7] which uses the techniques of Karpinski and \nMacIntyre [4] to derive sample sizes for classes of smoothly parametrised functions . \n\nProposition 6 \nP : R n x Rl -+ Rand \n\n[7} Let Q be the set of polynomials P of degree at most d with \n\nF = Fg = {PwIPw{x) = p{x, w),p E g}. \n\nHence , there are l adjustable parameters and the input dimension is n . Then the \nVC-dimension of the class H,{Fe) is bounded above by \nlog2{2{2d)l) + 1711og2{s). \n\nCorollary 7 For a set of circuits C, with n input connections and l parameter \nconnections, the VC-dimension of the class H,{Fe) is bounded above by \n\nProof: By Proposition 2 the function gc computed by a circuit C with t input \nconnections has the form \n\ngc{x) = L P;e(a)fc{a), where P;e{a) = II xfi{l- xd 1 - cxi ). \n\nt \n\ni=l \n\nHence, gc( x) is a polynomial of degree t. In the case considered the number t of \ninput connections is n + l. The result follows from the proposition. _ \n\n\f272 \n\nJ. SHAWE-TAYLOR. 1. ZHAO \n\nProposition 8 \np: 'Rn X 'Rl -+ 'R and \n\n[7] Let 9 be the set of polynomials P of degree at most d with \n\nF = Fg = {PwIPw(x) = p(x, w),p E g}. \n\nHence , there are l adjustable parameters and the input dimension is n. If a function \nh E H.(F) correctly computes a function on a sample of m inputs drawn indepen(cid:173)\ndently according to a fixed probability distribution, where \n\nm ~ \"\",(e, 0) = e(1 ~ y'\u20ac) \n\n[Uln ( 4e~) + In (2l/(~ - 1)) 1 \n\nthen with probability at least 1 - 0 the error rate of h will be less than E on inputs \ndrawn according to the same distribution. \n\nCorollary 9 For a set of circuits C, with n input connections and l parameter \nconnections, If a function h E H.(Fc) correctly computes a function on a sample \nof m inputs drawn independently according to a fixed probability distribution, where \n\nm ~ \"\",(e, 0) = e(1 ~ y\u20ac) \n\n[Uln ( 4eJs~n +l)) + In Cl/(~ - 1)) 1 \n\nthen with probability at least 1 - 0 the error rate of h will be less than E on inputs \ndrawn according to the same distribution. \n\nProof: As in the proof of the previous corollary, we need only observe that the \nfunctions gC for C E C are polynomials of degree at most n + l. \u2022 \n\nNote that the best known sample sizes for threshold networks are given in [6]: \n\nm ~ \"\",(e, 0) = e(1 ~ y'\u20ac) \n\n[2Wln (6~) + In (l/(lo- 1)) 1 ' \n\nwhere W is the number of adaptable weights (parameters) and N is the number \nof computational nodes in the network. Hence, the bounds given above are almost \nidentical to those for threshold networks, despite the underlying techniques used to \nderive them being entirely different. \n\nOne surprising fact about the above results is that the VC dimension and sample \nsizes are independent of the complexity of the circuit (except in as much as it must \nhave the required number of inputs). Hence, additional layers of fixed computation \ncannot increase the sample complexity above the bound given). \n\n5 Simulation Results \n\nThe Monk's problems which were the basis of a first international comparison of \nlearning algorithms, are derived from a domain in which each training example \nis represented by six discrete-valued attributes. Each problem involves learning a \nbinary function defined over this domain, from a sample of training examples of \nthis function. The 'true' concepts underlying each Monk's problem are given by: \nMONK-I: (attributet = attribute2) \n\nor (attribute5 = 1) \n\nMONK-2: (attributei = 1) \n\nfor EXACTLY TWO i E {I, 2, ... , 6} \nMONK-3: (attribute5 = 3 and attribute4 = 1) \n\nor (attribute5 =1= 4 and attribute2 =1= 3) \n\n\fGeneralisation of a Class of Continuous Neural Networks \n\n273 \n\nThere are 124, 169 and 122 samples in the training sets of MONK-I, MONK-2 and \nMONK-3 respectively. The testing set has 432 patterns. The network had 17 input \nunits, 10 hidden units, 1 output unit, and was fully connected. Two networks were \nused for each problem. The first was a standard multi-layer perceptron with sigmoid \nactivation function trained using the backpropagation algorithm (BP Network). \n\nThe second network had the same architecture, but used bit stream neurons in place \nof sigmoid ones (BSN Network). The functionality of the neurons was simulated \nusing probability generating functions to compute the probability values of the bit \nstreams output at each neuron. The backpropagation algorithm was adapted to \ntrain these networks by computing the derivative of the output probability value \nwith respect to the individual inputs to that neuron [8]. \n\nExperiments were performed with and without noise in the training examples. \nThere is 5% additional noise (misclassifications) in the training set of MONK-3. \nThe results for the Monk's problems using the moment generating function simula(cid:173)\ntion are shown as follows: \n\nMONK-l \nMONK-2 \nMONK-3 \n\ntraining \n100% \n100% \n97.1% \n\ntesting \n86.6% \n84.2% \n83.3% \n\nBP Network \n\nBSN Network \n\ntraining \n\n100% \n100% \n98.4% \n\ntesting \n97.7% \n100% \n98.6% \n\nIt can be seen that the generalisation of the BSN network is much better than \nthat of a general multilayer backpropagation network. The results on MONK-3 \nproblem is extremely good. The results reported by Hassibi and Stork [2] using a \nsophisticated weight pruning technique are only 93.4% correct for the training set \nand 97.2% correct for the testing set. \nReferences \n\n[1] B. R. Gaines, Stochastic Computing Systems, Advances in Information Sys(cid:173)\n\ntems Science 2 (1969) pp37-172. \n\n[2] B. Hassibi and D.G. Stork, Second order derivatives for network pruning: Op(cid:173)\n\ntimal brain surgeon, Advances in Neural Information Processing System, Vol \n5 (1993) 164-171. \n\n[3] P. Jeavons, D.A. Cohen and J. Shawe-Taylor, Generating Binary Sequences \nfor Stochastic Computing, IEEE Trans on Information Theory, 40 (3) (1994) \n716-720. \n\n[4] M. Karpinski and A. MacIntyre, Bounding VC-Dimension for Neural Networks: \nProgress and Prospects, Proceedings of EuroCOLT'95, 1995, pp. 337-341, \nSpringer Lecture Notes in Artificial Intelligence, 904. \n\n[5] P. Koiran, A Weak Version of the Blum, Shub and Smale Model, ESPRIT \n\nWorking Group NeuroCOLT Technical Report Series, NC-TR-94-5, 1994. \n\n[6] J. Shawe-Taylor, Threshold Network Learning in the Presence of Equivalences, \n\nProceedings of NIPS 4, 1991, pp. 879-886. \n\n[7] J. Shawe-Taylor, Sample Sizes for Sigmoidal Networks, to appear in the Pro(cid:173)\n\nceedings of Eighth Conference on Computational Learning Theory, COLT'95, \n1995. \n\n[8] John Shawe-Taylor, Peter Jeavons and Max van Daalen, \"Probabilistic Bit \n\nStream Neural Chip: Theory\", Connection Science, Vol 3, No 3, 1991. \n\n\f", "award": [], "sourceid": 1163, "authors": [{"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}, {"given_name": "Jieyu", "family_name": "Zhao", "institution": null}]}