{"title": "Nets with Unreliable Hidden Nodes Learn Error-Correcting Codes", "book": "Advances in Neural Information Processing Systems", "page_first": 89, "page_last": 96, "abstract": null, "full_text": "Nets with Unreliable Hidden Nodes Learn \n\nError-Correcting Codes \n\nStephen Judd \n\nSiemens Corporate Research \n\n755 College Road East \nPrinceton NJ 08540 \n\njUdd@learning.siemens.com \n\nPaul W. Munro \n\nDepartment of Infonnation Science \n\nUniversity of Pittsburgh \nPittsburgh, PA 15260 \n\nmunro@lis.pitt.edu \n\nABSTRACT \n\nIn a multi-layered neural network, anyone of the hidden layers can be \nviewed as computing a distributed representation of the input. Several \n\"encoder\" experiments have shown that when the representation space is \nsmall it can be fully used. But computing with such a representation \nrequires completely dependable nodes. In the case where the hidden \nnodes are noisy and unreliable, we find that error correcting schemes \nemerge simply by using noisy units during training; random errors in(cid:173)\njected during backpropagation result in spreading representations apart. \nAverage and minimum distances increase with misfire probability, as \npredicted by coding-theoretic considerations. Furthennore, the effect of \nthis noise is to protect the machine against permanent node failure, \nthereby potentially extending the useful lifetime of the machine. \n\n1 INTRODUCTION \n\nThe encoder task described by Ackley, Hinton, and Sejnowski (1985) for the Boltzmann \nmachine, and by Rumelhart, Hinton, and Williams (1986) for feed-forward networks. has \nbeen used as one of several standard benchmarks in the neural network literature. \nCottrell, Munro, and Zipser (1987) demonstrated the potential of such autoencoding archi(cid:173)\ntectures to lossy compression of image data. In the encoder architecture, the weights con(cid:173)\nnecting the input layer to the hidden layer play the role of an encoding mechanism. and \nthe hidden-output weights are analogous to a decoding device. In the terminology of \nShannon and Weaver (1949), the hidden layer corresponds to the communication channel. \nBy analogy, channel noise corresponds to a fault (misfiring) in the hidden layer. Previous \n\n89 \n\n\f90 \n\nJudd and Munro \n\nencoder studies have shown that the representations in the hidden layer correspond to opti(cid:173)\nmally efficient (i.e., fully compressed) codes, which suggests that introducing noise in \nthe fonn of random interference with hidden unit function may lead to the development of \ncodes more robust to noise of the kind that prevailed during learning. Many of these \nideas also appear in Chiueh and Goodman (1987) and Sequin and Clay (1990). \n\nWe have tested this conjecture empirically, and analyzed the resulting solutions, using a \nstandard gradient-descent procedure (backpropagation). Although there are alternative tech(cid:173)\nniques to encourage fault tolerance through construction of specialized error functions \n(eg., Chauvin, 1989) or direct attacks (eg., Neti, Schneider, and Young, 1990), we have \nused a minimalist approach that simply introduces intennittent node misfirings during \ntraining that mimic the errors anticipated during nonnal performance. \n\nIn traditional approaches to developing error-correcting codes (eg., Hamming, 1980), each \nsymbol from a source alphabet is mapped to a codeword (a sequence of symbols from a \ncode alphabet); the distance between codewords is directly related to the code's robustness. \n\n2 METHODOLOGY \n\nComputer simulations were performed using strictly layered feed forward networks. The \nnodes of one of the hidden layers randomly misfrre during training; in most experiments, \nthis \"channel\" layer was the sole hidden layer. Each input node corresponds to a transmit(cid:173)\nted symbol, output nodes to received symbols, channel representations to codewords; \nother layers are introduced as needed to enable nonlinear encoding and/or decoding. After \ntraining, the networks were analyzed under various conditions, in terms of performance \nand coding-theoretic measures, such as Hamming distance between codewords. \n\nThe response, r, of each unit in the channel layer is computed by passing the weighted \nsum, x , through the hyperbolic tangent (a sigmoid that ranges from -1 to +1). The re(cid:173)\nsponses of those units randomly designated to misfire are then multiplied by -1 as this is \nmost comparable with concepts from coding theory for binary channels\" The misfire op(cid:173)\neration influences the course of learning in two ways, since the erroneous information is \nboth passed on to units further \"downstream\" in the net, and used as the presynaptic factor \nin the synaptic modification rule. Note that the derivative factor in the backpropagation \nprocedure is unaffected for units using the hyperbolic k'Ulgent, since dr/dx = (l+r )(l-r )/2. \n\nThese misfrrings were random I y assigned according to various kinds of probability distri(cid:173)\nbutions: independent identically distributed (i.i.d), k~f-n, correlated across hidden units, \nand correlated over the input distribution. The hidden unit representations required to h,m(cid:173)\ndie uncorrelated noise roughly correspond to Hamming spheres2 ,and can be decoded by a \n\n1 Other possible misfire modes include setting the node's activity to zero (or some other \nconstant) or randomizing it. The most appropriate mode depends on various factors, in(cid:173)\neluding the situation to be simulated and the type of analysis to be performed. For exam(cid:173)\npIe, simulating neuronal death in a biological situation may warrant a different failure \nmode than simulating failure of an electronic component. \n2 Consider an n-bit block code, where each codeword lies on the vertex of an n-cube. The \nHamming sphere of radius k is the neighborhood of vertices that differ from the codeword \nby a number of bits less than or equal to k. \n\n\fNets with Unreliable Hidden Nodes Learn Error-Correcting Codes \n\n91 \n\nsingle layer of weights; thus the entire network consists of just three sets of units: \nsource-channel-sink. However, correlated noise generally necessitates additional layers. \n\nAll the experiments described below use the encoder task described by Ackley, Hinton, \nand Sejnowki (1986); that is, the input pattern consists of just one unit active and the \nothers inactive. The task is to activate only the corresponding unit in the output layer. \nBy comparison with coding theory, the input units are thus analogous to symbols to be \nencoded, and the hidden unit representations are analogous to the code words. \n\n3 RESULTS \n\n3.1. \n\nPERFORMANCE \n\nThe ftrst experiment supports the claim of Sequin and Clay (1990) that training with \nfaults improves network robustness. Four 8-30-8 encoders were trained with fault proba(cid:173)\nbility p = 0, 0.05, 0.1, and 0.3 respectively. After training, each network was tested with \nfault probabilities varying from 0.05 to 1.0. The results show enhanced performance for \nnetworks trained with a higher rate of hidden unit misftring. Figure 1 shows four perfor(cid:173)\nmance curves (one for each training fault probability), each as a function of test fault \nprobability. \n\nInteresting convergence properties were also observed; as the training fault probabilty, p, \nwas varied from 0 to 0.4, networks converge reliably faster for low nonzero values \n(0.05
\nm \n'-\nQ) \n> \n\u00ab \n\n0.2 \n\ntraining fault probability \n\nEI \n\n.. \n\u2022 \n\np=O.OO \np=O.05 \np=O. 10 \np=O.30 \n\n9 \n\n0.0 \n\n0.0 \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\n1.0 \n\nT est fault probability \n\nFigure 1. Performance for various training conditions. Four 8-30-8 encoders were \ntrained with different probabilities for hidden unit misfiring. Each data point is an \naverage over 1000 random stimuli with random hidden unit faults. Outputs are \nscored correct if the most active output node corresponds to the active input node. \n\n\f92 \n\nJudd and Munro \n\n3.2. DISTANCE \n\n3.2.1 Distances increase with fault probability \n\nDistances were measured between all pairs of hidden unit representations. Several net(cid:173)\nworks trained with different fault probabilities and various numbers of hidden units were \nexamined. As expected, both the minimum distances and average distances increase with \nthe training fault probability until it approaches 0.5 per node (see Figure 2). For proba(cid:173)\nbilities above 0.25, the minimum distances fall within the theoretical bounds for a 30 bit \ncode of a 16 symbol alphabet given by Gilbert and Elias (see Blahut, 1987). \n\nElias Bound \n\n14 \n\n12 \n\n-.s 10 -CD \n\n(.) c \nas \n~ 8 \nc \n\n6 \n\n4 \n\n0.0 \n\no \n\u2022 \n\naverage \nminimum \n\n0.3 \n0.1 \ntraining fault probability \n\n0.2 \n\n0.4 \n\nFigure 2. Distance increases with fault probability. Average and minimum L1 \ndistances are plotted for 16-30-16 networks trained with fault probabilities \nranging from 0.0 to 0.4. Each data point represents an average over 100 \nnetworks trained using different weight initializations. \n\n3.2.2. Input probabilities affect distance \n\nThe probability distribution over the inputs influences the relative distances of the repre(cid:173)\nsentations at the hidden unit level. To illustrate this, a 4-10-4 encoder was trained using \nvarious probabilities for one of the four inputs (denoted P*), distributing the remaining \nprobabilty unifonnly among the other three. The average distance between the representa(cid:173)\ntion of p* and the others increases with its probability, while the average distmlce among \nthe other three decreases as shown in the upper part of Figure 3. The more frequent pat(cid:173)\nterns are generally expected to \"claim\" a larger region of representation space. \n\n\fNets with Unreliable Hidden Nodes Learn Error-Correcting Codes \n\n93 \n\n6 ~~~------------~------------------~ \n\n5 \n\nCD \n\nU c as -1/1 \nas ... CD > \n\n