{"title": "Nets with Unreliable Hidden Nodes Learn Error-Correcting Codes", "book": "Advances in Neural Information Processing Systems", "page_first": 89, "page_last": 96, "abstract": null, "full_text": "Nets with Unreliable Hidden Nodes Learn \n\nError-Correcting Codes \n\nStephen Judd \n\nSiemens Corporate Research \n\n755 College Road East \nPrinceton NJ 08540 \n\njUdd@learning.siemens.com \n\nPaul W. Munro \n\nDepartment of Infonnation Science \n\nUniversity of Pittsburgh \nPittsburgh, PA 15260 \n\nmunro@lis.pitt.edu \n\nABSTRACT \n\nIn a multi-layered neural network, anyone of the hidden layers can be \nviewed as computing a distributed representation of the input. Several \n\"encoder\" experiments have shown that when the representation space is \nsmall it can be fully used. But computing with such a representation \nrequires completely dependable nodes. In the case where the hidden \nnodes are noisy and unreliable, we find that error correcting schemes \nemerge simply by using noisy units during training; random errors in(cid:173)\njected during backpropagation result in spreading representations apart. \nAverage and minimum distances increase with misfire probability, as \npredicted by coding-theoretic considerations. Furthennore, the effect of \nthis noise is to protect the machine against permanent node failure, \nthereby potentially extending the useful lifetime of the machine. \n\n1 INTRODUCTION \n\nThe encoder task described by Ackley, Hinton, and Sejnowski (1985) for the Boltzmann \nmachine, and by Rumelhart, Hinton, and Williams (1986) for feed-forward networks. has \nbeen used as one of several standard benchmarks in the neural network literature. \nCottrell, Munro, and Zipser (1987) demonstrated the potential of such autoencoding archi(cid:173)\ntectures to lossy compression of image data. In the encoder architecture, the weights con(cid:173)\nnecting the input layer to the hidden layer play the role of an encoding mechanism. and \nthe hidden-output weights are analogous to a decoding device. In the terminology of \nShannon and Weaver (1949), the hidden layer corresponds to the communication channel. \nBy analogy, channel noise corresponds to a fault (misfiring) in the hidden layer. Previous \n\n89 \n\n\f90 \n\nJudd and Munro \n\nencoder studies have shown that the representations in the hidden layer correspond to opti(cid:173)\nmally efficient (i.e., fully compressed) codes, which suggests that introducing noise in \nthe fonn of random interference with hidden unit function may lead to the development of \ncodes more robust to noise of the kind that prevailed during learning. Many of these \nideas also appear in Chiueh and Goodman (1987) and Sequin and Clay (1990). \n\nWe have tested this conjecture empirically, and analyzed the resulting solutions, using a \nstandard gradient-descent procedure (backpropagation). Although there are alternative tech(cid:173)\nniques to encourage fault tolerance through construction of specialized error functions \n(eg., Chauvin, 1989) or direct attacks (eg., Neti, Schneider, and Young, 1990), we have \nused a minimalist approach that simply introduces intennittent node misfirings during \ntraining that mimic the errors anticipated during nonnal performance. \n\nIn traditional approaches to developing error-correcting codes (eg., Hamming, 1980), each \nsymbol from a source alphabet is mapped to a codeword (a sequence of symbols from a \ncode alphabet); the distance between codewords is directly related to the code's robustness. \n\n2 METHODOLOGY \n\nComputer simulations were performed using strictly layered feed forward networks. The \nnodes of one of the hidden layers randomly misfrre during training; in most experiments, \nthis \"channel\" layer was the sole hidden layer. Each input node corresponds to a transmit(cid:173)\nted symbol, output nodes to received symbols, channel representations to codewords; \nother layers are introduced as needed to enable nonlinear encoding and/or decoding. After \ntraining, the networks were analyzed under various conditions, in terms of performance \nand coding-theoretic measures, such as Hamming distance between codewords. \n\nThe response, r, of each unit in the channel layer is computed by passing the weighted \nsum, x , through the hyperbolic tangent (a sigmoid that ranges from -1 to +1). The re(cid:173)\nsponses of those units randomly designated to misfire are then multiplied by -1 as this is \nmost comparable with concepts from coding theory for binary channels\" The misfire op(cid:173)\neration influences the course of learning in two ways, since the erroneous information is \nboth passed on to units further \"downstream\" in the net, and used as the presynaptic factor \nin the synaptic modification rule. Note that the derivative factor in the backpropagation \nprocedure is unaffected for units using the hyperbolic k'Ulgent, since dr/dx = (l+r )(l-r )/2. \n\nThese misfrrings were random I y assigned according to various kinds of probability distri(cid:173)\nbutions: independent identically distributed (i.i.d), k~f-n, correlated across hidden units, \nand correlated over the input distribution. The hidden unit representations required to h,m(cid:173)\ndie uncorrelated noise roughly correspond to Hamming spheres2 ,and can be decoded by a \n\n1 Other possible misfire modes include setting the node's activity to zero (or some other \nconstant) or randomizing it. The most appropriate mode depends on various factors, in(cid:173)\neluding the situation to be simulated and the type of analysis to be performed. For exam(cid:173)\npIe, simulating neuronal death in a biological situation may warrant a different failure \nmode than simulating failure of an electronic component. \n2 Consider an n-bit block code, where each codeword lies on the vertex of an n-cube. The \nHamming sphere of radius k is the neighborhood of vertices that differ from the codeword \nby a number of bits less than or equal to k. \n\n\fNets with Unreliable Hidden Nodes Learn Error-Correcting Codes \n\n91 \n\nsingle layer of weights; thus the entire network consists of just three sets of units: \nsource-channel-sink. However, correlated noise generally necessitates additional layers. \n\nAll the experiments described below use the encoder task described by Ackley, Hinton, \nand Sejnowki (1986); that is, the input pattern consists of just one unit active and the \nothers inactive. The task is to activate only the corresponding unit in the output layer. \nBy comparison with coding theory, the input units are thus analogous to symbols to be \nencoded, and the hidden unit representations are analogous to the code words. \n\n3 RESULTS \n\n3.1. \n\nPERFORMANCE \n\nThe ftrst experiment supports the claim of Sequin and Clay (1990) that training with \nfaults improves network robustness. Four 8-30-8 encoders were trained with fault proba(cid:173)\nbility p = 0, 0.05, 0.1, and 0.3 respectively. After training, each network was tested with \nfault probabilities varying from 0.05 to 1.0. The results show enhanced performance for \nnetworks trained with a higher rate of hidden unit misftring. Figure 1 shows four perfor(cid:173)\nmance curves (one for each training fault probability), each as a function of test fault \nprobability. \n\nInteresting convergence properties were also observed; as the training fault probabilty, p, \nwas varied from 0 to 0.4, networks converge reliably faster for low nonzero values \n(0.05<p<0.15) than they do at p=O. \n\n1.0 \n\n0.8 \n\n...... C,.) \nQ) \n'-'-\n0 \nC,.) \n...... 0.6 \nc: \nQ) \nC,.) \n'-\nQ) \n\na. 0.4 \nQ) \nC> \nm \n'-\nQ) \n> \n\u00ab \n\n0.2 \n\ntraining fault probability \n\nEI \n\n.. \n\u2022 \n\np=O.OO \np=O.05 \np=O. 10 \np=O.30 \n\n9 \n\n0.0 \n\n0.0 \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\n1.0 \n\nT est fault probability \n\nFigure 1. Performance for various training conditions. Four 8-30-8 encoders were \ntrained with different probabilities for hidden unit misfiring. Each data point is an \naverage over 1000 random stimuli with random hidden unit faults. Outputs are \nscored correct if the most active output node corresponds to the active input node. \n\n\f92 \n\nJudd and Munro \n\n3.2. DISTANCE \n\n3.2.1 Distances increase with fault probability \n\nDistances were measured between all pairs of hidden unit representations. Several net(cid:173)\nworks trained with different fault probabilities and various numbers of hidden units were \nexamined. As expected, both the minimum distances and average distances increase with \nthe training fault probability until it approaches 0.5 per node (see Figure 2). For proba(cid:173)\nbilities above 0.25, the minimum distances fall within the theoretical bounds for a 30 bit \ncode of a 16 symbol alphabet given by Gilbert and Elias (see Blahut, 1987). \n\nElias Bound \n\n14 \n\n12 \n\n-.s 10 -CD \n\n(.) c \nas \n~ 8 \nc \n\n6 \n\n4 \n\n0.0 \n\no \n\u2022 \n\naverage \nminimum \n\n0.3 \n0.1 \ntraining fault probability \n\n0.2 \n\n0.4 \n\nFigure 2. Distance increases with fault probability. Average and minimum L1 \ndistances are plotted for 16-30-16 networks trained with fault probabilities \nranging from 0.0 to 0.4. Each data point represents an average over 100 \nnetworks trained using different weight initializations. \n\n3.2.2. Input probabilities affect distance \n\nThe probability distribution over the inputs influences the relative distances of the repre(cid:173)\nsentations at the hidden unit level. To illustrate this, a 4-10-4 encoder was trained using \nvarious probabilities for one of the four inputs (denoted P*), distributing the remaining \nprobabilty unifonnly among the other three. The average distance between the representa(cid:173)\ntion of p* and the others increases with its probability, while the average distmlce among \nthe other three decreases as shown in the upper part of Figure 3. The more frequent pat(cid:173)\nterns are generally expected to \"claim\" a larger region of representation space. \n\n\fNets with Unreliable Hidden Nodes Learn Error-Correcting Codes \n\n93 \n\n6 ~~~------------~------------------~ \n\n5 \n\nCD \n\nU c as -1/1 \nas ... CD > \n\n<t \n\nis \nCD 4 m \n\n3 \n\n5~==========~============~ \n\nCD 4 \n\nis \n\nu c as -1/1 \nCD m as ... CD > \n\n<t \n\n3 \n\n0.0 \n\n0.1 \n\n0.2 \n\n0.3 \n\n0.4 \n\n0.5 \n\nProb(P*) \n\nFigure 3. Non-uniform input distribution. 4-10-4 encoders were trained usingfailure \nprobabilities of 0 (squares), 0.1 (circles), and 0.2 (triangles) . The input distribution was \nskewed by varying the probability of one of the four items (denoted P*) in the training set \nfrom 0.05 to 0.5, keeping the other probabilities uniform. Average L1 distances are \nshown from the manipulated pattern to the other three (open symbols) and among the \nequiprobables (filled symbols) as well. In the upper figure, failure is independent of the \ninput, while in the lower figure , failure is induced only when P* is presented . \n\n\f94 \n\nJudd and Munro \n\nThe dashed line in Figure 3 indicates a uniform input distribution, hence in the top fig(cid:173)\nure, the average distance to p* is equal to the average distances among the other patterns. \nHowever this does not hold in the lower figure, indicating that the representations of \nstimuli that induce more frequent channel errors also claim more representation space. \n\n3.3. CORRELATED MISFIRING \n\nIf the error probability for each bit in a message (or each hiddoo unit in a network layer) \nis uncorrelated with the other message bits (hidden units), then the principles of distance \nbetween codewords (representations) applies. On the other hand, if there is some structure \nto the noise (i.e. the misfrrings are correlated across the hidden units), there may be differ(cid:173)\nent strategies to encoding and decoding, that require computations other than simple dis(cid:173)\ntance. While a Hamming distance criterion on a hypercube is a linearly separable classifi(cid:173)\ncation function, and hence computable by a single layer of weights, the more general case \nis not linearly separable, as is demonstrated below. \n\nExample: Misfiring in 2 of 6 channel units. \nIn this example, up to two of six channel units are randomly selected to misfire with each \nlearning trial. In order to guarantee full recovery from two simultaneous faults, only two \nsymbols can be represented, if the faults are independent; however, if one fault is always \nin one three-unit subset and the other is always in the complementary subset, it is possi(cid:173)\nble to store four patterns. The following code can be considered with no loss of generali(cid:173)\nty: Let the six hidden units (code bits) be partitioned into two sets of three, where there is \nat most one fault in each subset. The four code words, 000000, 000111, 111000, \n111111 form an error correcting code under this condition; i.e. each subset is a triplicate \ncode. Under the allowed fault combinations specified above, any given transmitted code \nstring will be converted by noise to one of 9 strings of the 15 that lie at a Hamming dis(cid:173)\ntance of 2 (the 15 unconstrained two-bit errors of the string 000000 are shown in the \ntable below with the 9 that satisfy the constraint in a box). Because of the symmetric \ndistribution of these 9 allowed states, any category that includes all of them and is defined \nby a linear (hyperplane) boundary, must include all 15. Thus, this code cannot be decoded \nby a single layer of threshold (or sigmoidal) units; hence even if a 4-6-4 network discov(cid:173)\ners this code, it will not decode it accurately. However, our experiments show that in(cid:173)\nserting a reliable (fault-free) hidden layer of just two units between the channel layer and \nthe output layer (i.e., a 4-6-2-4 encoder) enables the discovery of a code that is robust to \nerrors of this kind. The representations of the four patterns in the channel layer show a \ntriply redundant code in each half of the channel layer (Figure 4). The 2-unit layer pro(cid:173)\nvides a transformation that allows successful decoding of channel representations with \nfaults. \n\nTable. Possible two-bit error masks \n\n000011 \n000101 000110 \n001001 001010 \n010001 010010 \n100001 100010 \n'---------------------------\n\n001100 \n010100 011000 \n100100 101000 \n\n110000 \n\n\fNets with Unreliable Hidden Nodes Learn Error-Correcting Codes \n\n95 \n\nInput \n\nChannel \n\nDecoder \n\nOutput \n\nFigure 4. Sample solution to 3-3 channel task. Thresholded activation \npatterns are shown for a 4-6-2-4 network. Errors are introduced into the first \nhidden (channel) layer only. With each iteration, the outputs of one hidden \nunit from the left half of the hidden layer and one unit from the right half can be \ninverted. Note that the channel develops a triplicate code for each half-layer. \n\n4 DISCUSSION \n\nResults indicate that vanilla backpropagation on its own does not spread out the hidden \nunit representations (codewords) optimally, and that deliberate random misfiring during \ntraining induces wider separations, increasing resistance to node misfiring. Furthermore, \nnon-uniform input distributions and non-uniform channel properties lead to asymmetries \namong the similarity relationships between hidden unit representations that are consistent \nwith optimizing mutual information. \n\nA mechanism of this kind may be useful for increasing fault tolerance in electronic sys(cid:173)\nterns, and may be used in neurobiological systems. The potential usefulness of inducing \nfaults during training extends beyond fault tolerance. Clay and Sequin (1992) point out \nthat training of this kind can enhance the capacity of a network to generalize. In effect, \nthe probability of random faults can be used to vary the number of \"effective parameters\" \n(a term coined by Moody, 1992) available for adaptation, without dynamically altering \nnetwork architecture. Thus, a naive system might begin with a relatively high probabili(cid:173)\nty of misfiring, and gradually reduce it as storage capacity needs increase with experience. \n\nThis technique may be particularly valuable for designing efficient, robust codes for chan(cid:173)\nnels with high order statistical properties, which defy traditional coding techniques. In \nsuch cases, a single layer of weights for encoding is not generally sufficient, as was \nshown above in the 4-6-2-4 example. Additional layers may enhance code efficiency for \ncomplex noiseless applications, such as image compression (Cottrell, Munro, and Zipser, \n1987). \n\nAcknowledgements \n\nThe second author participated in this research as a visiting research scientist during the \nsummers of 1991 and 1992 at Siemens Corporate Research, which kindly provided fman(cid:173)\ncial support and a stimulating research environment. \n\n\f96 \n\nJudd and Munro \n\nReferences \n\nAckley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985) A learning algorithm for \nBoltzmann machines. Cognitive Science. 9: 147-169. \n\nBlahut, R. E. (1987) Principle and Practise of Information Theory. Reading MA, Addison \nWesley. \n\nChauvin, Y. (1989) A back-propagation algorithm with optimal use of hidden units. In: \nTouretsky, D.S. (ed.) Advances in Neural Information Processing Systems I. San Mateo, \nCA: Morgan Kaufmann Publishers. \n\nChiueh, Tz-Dar and Rodney Goodman. (1987) A neural network classifier based on cod(cid:173)\ning theory. In: Dana Z. Anderson, editor, Neural Information Processing Systems, pp \n174--183, New York, A.I.P. \n\nClay, Reed D. and Sequin, Carlo H. (1992) Fault tolerance training improves generaliza(cid:173)\ntion and robustness. Proceedings of JJCNN92 , 1-769, Baltimore. \n\nCottrell, G. W., P. Munro, and D. Zipser (1987) Image compression by back propaga(cid:173)\ntion: An example of extensional programming. Ninth Ann Meeting of the Cognitive \nScience Society, pp. 461-473. \n\nHamming, R. W. (1980) Coding and Iriformation Theory. Prentice Hall: Englewood \nCliffs, N.J. \n\nMoody, J. (1992) The effective number of parameters. In: Moody, J. E., Hanson, S. J., \nLippman, R., (eds.) Advances in Neural Iriformation Processing Systems 4. San Mateo, \nCA: Morgan Kaufmann Publishers. \n\nNeti, C., M. H. Schneider, and E. D. Young. (1990) Maximally fault-tolerant neural net(cid:173)\nworks and nonlinear programming. Proceedings of JJCNN, 11-483, San Diego. \n\nRumelhart D., Hinton G., and Williams R. (1986) Learning representations by back(cid:173)\npropagating errors. Nature 323:533-536. \n\nSequin, Carlo H. and Reed D. Clay (1990) Fault tolerance in artificial neural networks. \nProceedings of JJCNN, 1-703, San Diego. \n\nShannon, C. and Weaver, W. (1949) The Mathematical Theory of Communication. \nUniversity of Illinois Press. \n\n\fPART II \n\nARCHITECTURES \nAND ALGORITHMS \n\n\f\f", "award": [], "sourceid": 605, "authors": [{"given_name": "Stephen", "family_name": "Judd", "institution": null}, {"given_name": "Paul", "family_name": "Munro", "institution": null}]}