{"title": "A Revolution: Belief Propagation in Graphs with Cycles", "book": "Advances in Neural Information Processing Systems", "page_first": 479, "page_last": 485, "abstract": "", "full_text": "A Revolution: Belief Propagation \n\nGraphs With Cycles \n\n\u2022 In \n\nBrendan J. Frey\u00b7 \n\nhttp://wvw.cs.utoronto.ca/-frey \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nDavid J. C. MacKay \n\nhttp://vol.ra.phy.cam.ac.uk/mackay \n\nDepartment of Physics, Cavendish Laboratory \n\nCambridge University \n\nAbstract \n\nUntil recently, artificial intelligence researchers have frowned upon \nthe application of probability propagation in Bayesian belief net(cid:173)\nworks that have cycles. The probability propagation algorithm is \nonly exact in networks that are cycle-free. However, it has recently \nbeen discovered that the two best error-correcting decoding algo(cid:173)\nrithms are actually performing probability propagation in belief \nnetworks with cycles. \n\n1 Communicating over a noisy channel \n\nOur increasingly wired world demands efficient methods for communicating bits of \ninformation over physical channels that introduce errors. Examples of real-world \nchannels include twisted-pair telephone wires, shielded cable-TV wire, fiber-optic \ncable, deep-space radio, terrestrial radio, and indoor radio. Engineers attempt \nto correct the errors introduced by the noise in these channels through the use \nof channel coding which adds protection to the information source, so that some \nchannel errors can be corrected. A popular model of a physical channel is shown \nin Fig. 1. A vector of K information bits u = (Ut, ... ,UK), Uk E {O, I} is encoded, \nand a vector of N codeword bits x = (Xl! ... ,XN) is transmitted into the channel. \nIndependent Gaussian noise with variance (12 is then added to each codeword bit, \n\n.. Brendan Frey is currently a Beckman Fellow at the Beckman Institute for Advanced \n\nScience and Technology, University of Illinois at Urbana-Champaign. \n\n\f480 \n\nB. J Frey and D. J. C. MacKay \n\nGaussian noise \nwith variance (J2 \n\n--U--'~~I~ __ E __ nc_o_d_e_r __ ~1 x ~ y \n\n.I~ ___ D_e_c_od_e_r __ ~--U~.~ \n\nFigure 1: A communication system with a channel that adds Gaussian noise to the \ntransmitted discrete-time sequence. \n\nproducing the real-valued channel output vector y = (Y!, ... ,YN). The decoder \nmust then use this received vector to make a guess U at the original information \nvector. The probability P\" (e) of bit error is minimized by choosing the Uk that \nmaximizes P(ukly) for k = 1, ... , K. The rate K/N of a code is the number of \ninformation bits communicated per codeword bit. We will consider rate ~ 1/2 \nsystems in this paper, where N == 2K. \nThe simplest rate 1/2 encoder duplicates each information hit: X2k-l = X2k = Uk, \nk = 1, ... , K. The optimal decoder for this repetition code simply averages together \npairs of noisy channel outputs and then applies a threshold: \n\nUk = 1 \n\nif (Y2k-l + Y2k)/2 > 0.5, \n\n(1) \nClearly, this procedure has the effect of reducing the noise variance by a factor of \n1/2. The resulting probability p,,(e) that an information bit will be erroneously \ndecoded is given by the area under the tail of the noise Gaussian: \n\n0 otherwise. \n\np,,(e) = 4> \n\n( -0.5) \n' \n(J2/2 \n\n(2) \n\nwhere 4>0 is the cumulative standard normal distribution. A plot of p,,(e) versus (J \nfor this repetition code is shown in Fig. 2, along with a thumbnail picture that shows \nthe distribution of noisy received signals at the noise level where the repetition code \ngives p,,(e) == 10-5 \u2022 \nMore sophisticated channel encoders and decoders can be used to increase the toler(cid:173)\nable noise level without increasing the probability of a bit error. This approach can \nin principle improve performance up to a bound determined by Shannon (1948). For \na given probability of bit error P,,(e), this limit gives the maximum noise level that \ncan be tolerated, no matter what channel code is used. Shannon's proof was non(cid:173)\nconstructive, meaning that he showed that there exist channel codes that achieve his \nlimit, but did not present practical encoders and decoders. The curve for Shannon's \nlimit is also shown in Fig. 2. \nThe two curves described above define the region of interest for practical channel \ncoding systems. For a given P,,(e), if a system requires a lower noise level than the \nrepetition code, then it is not very interesting. At the other extreme, it is impossible \nfor a system to tolerate a higher noise level than Shannon's limit. \n\n2 Decoding Hamming codes by probability propagation \n\nOne way to detect errors in a string of bits is to add a parity-check bit that is \nchosen so that the sum modulo 2 of all the bits is O. If the channel flips one bit, \nthe receiver will find that the sum modulo 2 is 1, and can detect than an error \noccurred. In a simple Hamming code, the codeword x consists of the original vector \n\n\fA Revolution: Belief Propagation in Graphs with Cycles \n\n481 \n\nle-l \n\nShannon \nlimit \n\nle-5 '-----I~_----'c..........L..L_ ____ ....L._ ___ ..L____L_......u._~LI__...J.........J \n\nConcatenated \n\nCode \n\n0.2 \n\n0.3 \n\n0.4 \n\n0.6 \nStandard deviation of Gaussian noise, U \n\n0.5 \n\n0.7 \n\n0.8 \n\nFigure 2: Probability of bit error Ph (e) versus noise level u for several codes with \nrates near 1/2, using 0/1 signalling. It is impossible to obtain a Ph(e) below Shan(cid:173)\nnon's limit (shown on the far right for rate 1/2). \"H-PP\" = Hamming code (rate \n4/7) decoded by probability propagation (5 iterations); \"H-Exact\" = Hamming \ncode decoded exactly; \"LDPCC-PP\" = low-density parity-check coded decoded by \nprobability propagation; \"TC-PP\" = turbo code decoded by probability propaga(cid:173)\ntion. The thumbnail pictures show the distribution of noisy received signals at the \nnoise levels where the repetition code and the Shannon limit give Ph (e) = 10-5 . \n\nu in addition to several parity-check bits, each of which depends on a different \nsubset of the information bits. In this way, the Hamming code can not only detect \nerrors but also correct them. \nThe code can be cast in the form of the conditional probabilities that specify a \nBayesian network. The Bayesian network for a K = 4, N = 7 rate 4/7 Hamming \ncode is shown in Fig. 3a. Assuming the information bits are uniformly random, \nwe have P(Uk) = 0.5, Uk E {0,1}, k = 1,2,3,4. Codeword bits 1 to 4 are di(cid:173)\nrect copies of the information bits: P(xkluk) = 6(Xk,Uk), k = 1,2,3,4, where \n6(a, b) = 1 if a = b and 0 otherwise. Codeword bits 5 to 7 are parity-check \nbits: P(XSIUI,U2,U3) = 6(X5,Ul EB U2 EB U3), P(XaIU.,U2,U4) = 6(Xa,Ul EB U2 EB U4), \nP(x7Iu2,U3,U4) = 6(X7,u2EBU3EBu4), where EB indicates addition modulo 2 (XOR). \nFinally, the conditional channel probability densities are \n\n(3) \n\nfor n = 1, ... , 7. \nThe probabilities P(ukly) can be computed exactly in this belief network, using \nLauritzen and Spiegelhalter's algorithm (1988) or just brute force computation. \nHowever, for the more powerful codes discussed below, exact computations are \nintractable. Instead, one way the decoder can approximate the probabilities P( Uk Iy) \nis by applying the probability propagation algorithm (Pearl 1988) to the Bayesian \nnetwork. Probability propagation is only approximate in this case because the \n\n\fB. 1. Frey and D. J. C. MacKay \n\n482 \n\n(a) \n\n(c) \n\n(b) \n\n(d) \n\nFigure 3: (a) The Bayesian network for a K = 4, N = 7 Hamming code. (b) The \nBayesian network for a K = 4, N = 8 low-density parity-check code. (c) A block \ndiagram for the turbocode linear feedback shift register. (d) The Bayesian network \nfor a K = 6, N = 12 turbocode. \nnetwork contains cycles (ignoring edge directions), e.g., UI-XS-U2-X6-UI. Once a \nchannel output vector y is observed, propagation begins by sending a message from \nYn to Xn for n = 1, ... ,7. Then, a message is sent from Xk to Uk for k = 1,2,3,4. \nAn iteration now begins by sending messages from the information variables 'Ill, U2, \nU3, U4 to the parity-check variables Xs, X6, X7 in parallel. The iteration finishes by \nsending messages from the parity-check variables back to the information variables \nin parallel. Each time an iteration is completed, new estimates of P( Uk Iy) for \nk = 1,2,3,4 are obtained. \nThe Pb (e) -\ntion decoder (5 iterations) are shown in Fig. 2. Quite surprisingly, the performance \nof the iterative decoder is quite close to that of the optimal decoder. Our expec(cid:173)\ntation was that short cycles would confound the probability propagation decoder. \nHowever, it seems that good performance can be obtained even when there are short \ncycles in the code network. \nFor this simple Hamming code, the complexities of the probability propagation de(cid:173)\ncoder and the exact decoder are comparable. However, the similarity in performance \nbetween these two decoders prompts the question: \"Can probability propagation \ndecoders give performances comparable to exact decoding in cases where exact de(cid:173)\ncoding is computationally intractable?\" \n\n(j curve for optimal decoding and the curve for the probability propaga(cid:173)\n\n\fA Revolution: Belief Propagation in Graphs with Cycles \n\n483 \n\n3 A leap towards the limit: Low-density parity-check codes \n\nRecently, there has been an explosion of interest in the channel coding community in \ntwo new coding systems that have brought us a leap closer to Shannon's limit. Both \nof these codes can be described by Bayesian networks with cycles, and it turns out \nthat the corresponding iterative decoders are performing probability propagation \nin these networks. \nFig. 3b shows the Bayesian network for a simple low-density parity-check code (Gal(cid:173)\nlager 1963). In this network, the information bits are not represented explicitly. \nInstead, the network defines a set of allowed configurations for the codewords. Each \nparity-check vertex qi requires that the codeword bits {Xn}nEQ; to which qi is con(cid:173)\nnected have even parity: \n\nP(qil{xn}nEQ;) = 8(qi' EB xn), \n\nnEQi \n\n(4) \n\nwhere q is clamped to 0 to ensure even parity. Here, Qi is the set of indices of \nthe codeword bits to which parity-check vertex qi is connected. The conditional \nprobability densities for the channel outputs are the same as in Eq. 3. \nOne way to view the above code is as N binary codeword variables along with a \nset of linear (modulo 2) equations. If in the end we want there to be K degrees \nof freedom, then the number of linearly independent parity-check equations should \nbe N - K. In the above example, there are N = 8 codeword bits and 4 parity(cid:173)\nchecks, leaving K = 8 - 4 = 4 degrees of freedom. It is these degrees of freedom \nthat we use to represent the information vector u. Because the code is linear, a \nK -dimensional vector u can be mapped to a valid x simply by multiplying by an \nN x K matrix (using modulo 2 addition). This is how an encoder can produce a \nlow-density parity-check codeword for an input vector. \nOnce a channel output vector y is observed, the iterative probability propagation \ndecoder begins by sending messages from y to x. An iteration now begins by sending \nmessages from the codeword variables x to the parity-check constraint variables q. \nThe iteration finishes by sending messages from the parity-check constraint variables \nback to the codeword variables. Each time an iteration is completed, new estimates \nof P(xnly) for n = 1, . . . , N are obtained. After a valid (but not necessarily correct) \ncodeword has been found, or a prespecified limit on the number of iterations has \nbeen reached, decoding stops. The estimate of the codeword is then mapped back \nto an estimate ii of the information vector. \nFig. 2 shows the performance of a K = 32,621, N = 65,389 low-density parity(cid:173)\ncheck code when decoded as described above. (See MacKay and Neal (1996) for \ndetails.) It is impressively close to Shannon's limit -\nsignificantly closer than the \n\"Concatenated Code\" (described in Lin and Costello (1983\u00bb which was considered \nthe best practical code until recently. \n\n4 Another leap: Turbocodes \n\nThe codeword for a turbocode (Berrou et al. 1996) consists of the original informa(cid:173)\ntion vector, plus two sets of bits used to protect the information. Each of these two \nsets is produced by feeding the information bits into a linear feedback shift register \n(LFSR), which is a type of finite state machine. The two sets differ in that one \nset is produced by a permuted set of information bits; i.e., the order of the bits is \nscrambled in a fixed way before the bits are fed into the LFSR. Fig. 3c shows a block \ndiagram (not a Bayesian network) for the LFSR that was used in our experiments. \n\n\f484 \n\nB. 1. Frey and D. J. C. MacKay \n\nEach box represents a delay (memory) element, and each circle performs addition \nmodulo 2. When the kth information bit arrives, the machine has a state Sk which \ncan be written as a binary string of state bits b4b3b2blbo as shown in the figure. bo \nof the state Sk is determined by the current input Uk and the previous state Sk-l' \nBits b1 to b4 are just shifted versions of the bits in the previous state. \nFig. 3d shows the Bayesian network for a simple turbocode. Notice that each \nstate variable in the two constituent chains depends on the previous state and an \ninformation bit. In each chain, every second LFSR output is not transmitted. In \nthis way, the overall rate of the code is 1/2, since there are K = 6 information bits \nand N = 6 + 3 + 3 = 12 codeword bits. The conditional probabilities for the states \nof the non permuted chain are \n\n0 otherwise. \n\nP(sllsl-I' Uk) = 1 if state sl follows Sk-l for input Uk, \n\n(5) \nThe conditional probabilities for the states in the other chain are similar, except that \nthe inputs are permuted. The probabilities for the information bits are uniform, \nand the conditional probability densities for the channel outputs are the same as in \nEq.3. \nDecoding proceeds with messages being passed from the channel output variables \nto the constituent chains and the information bits. Next, messages are passed \nfrom the information variables to the first constituent chain, SI. Messages are \npassed forward and then backward along this chain, in the manner of the forward(cid:173)\nbackward algorithm (Smyth et al. 1997). After messages are passed from the \nfirst chain to the second chain s2, the second chain is processed using the forward(cid:173)\nbackward algorithm. To complete the iteration, messages are passed from S2 to the \ninformation bits. \nFig. 2 shows the performance of a K = 65,536, N = 131,072 turbocode when \ndecoded as described above, using a fixed number (18) of iterations. (See Frey \n(1998) for details.) Its performance is significantly closer to Shannon's limit than the \nperformances of both the low-density parity-check code and the textbook standard \n\"Concatenated Code\" . \n\n5 Open questions \n\nWe are certainly not claiming that the NP-hard problem (Cooper 1990) of proba(cid:173)\nbilistic inference in general Bayesian networks can be solved in polynomial time by \nprobability propagation. However, the results presented in this paper do show that \nthere are practical problems which can be solved using approximate inference in \ngraphs with cycles. Iterative decoding algorithms are using probability propagation \nin graphs with cycles, and it is still not well understood why these decoders work \nso well. Compared to other approximate inference techniques such as variational \nmethods, probability propagation in graphs with cycles is unprincipled. How well \ndo more principled decoders work? In (MacKay and Neal 1995), a variational de-\ncoder that maximized a lower bound on n~=1 P(ukly) was presented for low-density \nparity-check codes. However, it was found that the performance of the variational \ndecoder was not as good as the performance of the probability propagation decoder. \nIt is not difficult to design small Bayesian networks with cycles for which probability \npropagation is unstable. Is there a way to easily distinguish between those graphs for \nwhich propagation will work and those graphs for which propagation is unstable? A \nbelief that is not uncommon in the graphical models community is that short cycles \nare particularly apt to lead probability propagation astray. Although it is possible to \ndesign networks where this is so, there seems to be a variety of interesting networks \n\n\fA Revolution: Belief Propagation in Graphs with Cycles \n\n485 \n\n(such as the Hamming code network described above) for which propagation works \nwell, despite short cycles. \nThe probability distributions that we deal with in decoding are very special distribu(cid:173)\ntions: the true posterior probability mass is actually concentrated in one microstate \nin a space of size 2M where M is large (e.g., 10,000). The decoding problem is to \nfind this most probable microstate, and it may be that iterative probability prop(cid:173)\nagation decoders work because the true probability distribution is concentrated in \nthis microstate. \nWe believe that there are many interesting and contentious issues in this area that \nremain to be resolved. \n\nAcknowledgements \n\nWe thank Frank Kschischang, Bob McEliece, and Radford Neal for discussions re(cid:173)\nlated to this work, and Zoubin Ghahramani for comments on a draft of this paper. \nThis research was supported in part by grants from the Gatsby foundation, the In(cid:173)\nformation Technology Research Council, and the Natural Sciences and Engineering \nResearch Council. \n\nReferences \n\nC. Berrou and A. Glavieux 1996. Near optimum error correcting coding and de(cid:173)\ncoding: Turbo-codes. IEEE 7hmsactions on Communications 44, 1261-1271. \nG. F. Cooper 1990. The computational complexity of probabilistic inference using \nBayesian belief networks. Artificial Intelligence 42, 393-405. \nB. J. Frey 1998. Graphical Models for Machine Learning and Digital Communica(cid:173)\ntion, MIT Press, Cambridge, MA. See http://vwv . cs. utoronto. cal -frey. \nR. G. Gallager 1963. Low-Density Parity-Check Codes, MIT Press, Cambridge, \nMA. \nS. Lin and D. J. Costello, Jr. 1983. Error Control Coding: Fundamentals and \nApplications, Prentice-Hall Inc., Englewood Cliffs, NJ. \nS. L. Lauritzen and D. J. Spiegelhalter 1988. Local computations with probabilities \non graphical structures and their application to expert systems. Journal of the \nRoyal Statistical Society B 50, 157-224. \nD. J. C. MacKay and R. M. Neal 1995. Good codes based on very sparse matrices. \nIn Cryptography and Coding. 5th IMA Conference, number 1025 in Lecture Notes \nin Computer Science, 100-111, Springer, Berlin Germany. \nD. J. C. MacKay and R. M. Neal 1996. Near Shannon limit performance of low \ndensity parity check codes. Electronics Letters 32, 1645-1646. Due to editing \nerrors, reprinted in Electronics Letters 33, 457-458. \nJ. Pearl 1988. Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann, \nSan Mateo, CA. \nC. E. Shannon 1948. A mathematical theory of communication. Bell System Tech(cid:173)\nnical Journal 27, 379-423, 623-656. \nP. Smyth, D. Heckerman, and M. I. Jordan 1997. Probabilistic independence net(cid:173)\nworks for hidden Markov probability models. Neural Computation 9, 227-270. \n\n\f", "award": [], "sourceid": 1467, "authors": [{"given_name": "Brendan", "family_name": "Frey", "institution": null}, {"given_name": "David", "family_name": "MacKay", "institution": null}]}