Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)
Joris M. Mooij, Hilbert Kappen
We introduce a computationally efficient method to estimate the valid- ity of the BP method as a function of graph topology, the connectiv- ity strength, frustration and network size. We present numerical results that demonstrate the correctness of our estimates for the uniform random model and for a real-world network ("C. Elegans"). Although the method is restricted to pair-wise interactions, no local evidence (zero "biases") and binary variables, we believe that its predictions correctly capture the limitations of BP for inference and MAP estimation on arbitrary graphi- cal models. Using this approach, we find that BP always performs better than MF. Especially for large networks with broad degree distributions (such as scale-free networks) BP turns out to significantly outperform MF.
1 Introduction
Loopy Belief Propagation (BP) [1] and its generalizations (such as the Cluster Variation Method [2]) are powerful methods for inference and optimization. As is well-known, BP is exact on trees, but also yields surprisingly good results for many other graphs that arise in real-world applications [3, 4]. On the other hand, for densely connected graphs with high interaction strengths the results can be quite bad or BP can simply fail to converge. Despite the fact that BP is often used in applications nowadays, a good theoretical understanding of its convergence properties and the quality of the approximation is still lacking (except for the very special case of graphs with a single loop [5]).
In this article we attempt to answer the question in what way the quality of the BP re- sults depends on the topology of the underlying graph (looking at structural properties such as short cycles and large "hubs") and on the interaction potentials (i.e. strength and frus- tration). We do this for the special but interesting case of binary networks with symmetric pairwise potentials (i.e. Boltzmann machines) without local evidence. This has the practical
advantage that analytical calculations are feasible and furthermore we believe that adding local evidence will only serve to extend the domain of convergence, implying this to be the worst-case scenario. We compare the results with those of the variational mean-field (MF) method.
Real-world graphs are often far from uniformly random and possess structure such as clus- tering and power-law degree distributions [6]. Since we expect these structural features to arise in many applications of BP, we focus in this article on graphs modeling this kind of features. In particular, we consider Erdos-Renyi uniform random graphs [7], Barabasi- Albert "scale-free" graphs [8], and the neural network of a widely studied worm, the Caenorhabditis elegans.
This paper is organized as follows. In the next section we describe the class of graphical models under investigation and explain our method to efficiently estimate the validity of BP and MF. In section 3 we give a qualitative discussion of how the connectivity strength and frustration generally govern the model behavior and discuss the relevant regimes of the model parameters. We show for uniform random graphs that our validity estimates are in very good agreement with the real behavior of the BP algorithm. In section 4 we study the influence of graph topology. Thanks to the numerical efficiency of our estimation method we are able to study very large (N 10000) networks, for which it would not be feasible to simply run BP and look what happens. We also try our method on the neural network of the worm C. Elegans and find almost perfect agreement of our predictions with observed BP behavior. We conclude that BP is always better than MF and that the difference is particularly striking for the case of large networks with broad degree distributions such as scale-free graphs.
2 Model, paramagnetic solution and stability analysis
Let G = (V, B) be an undirected labelled graph without self-connections, defined by a set of nodes V = {1, . . . , N } and a set of links B {(i, j) | 1 i < j N }. The adjacency matrix corresponding to G is denoted M and defined as follows: Mij := 1 if (ij) B or (ji) B and 0 otherwise. We denote the set of neighbors of node i V by Ni := {j V | (ij) B} and its degree by di := #(Ni). We define the average degree d := 1 d N iV i and the maximum degree := maxiV di.
To each node i we associate a binary random variable xi taking values in {-1, +1}. Let W be a symmetric N N -matrix defining the strength of the links between the nodes. The probability distribution over configurations x = (x1, . . . , xN ) is given by
1 1 1 M P(x) := eWijxixj = e 2 ijWijxixj (1) Z Z (ij)B i,jV
with Z a normalization constant. We will take the weight matrix W to be random, with i.i.d. entries {Wij}1i For this model, instead of using the single-node and pair-wise beliefs bi(xi) resp. bij(xi, xj), it turns out to be more convenient to use the (equivalent) quantities m := {mi}iV and := {ij}(ij)B, defined by: We will use these throughout this paper. We call the mi magnetizations; note that the expectation values E xi vanish because of the symmetry in the probability distribution (1). As is well-known [2, 9], fixed points of BP correspond to stationary points of the Bethe free energy, which is in this case given by with (x) := x log x. Note that with this parameterization all normalization and overlap constraints (i.e. b x ij (xi, xj ) = bi(xi)) are satisfied by construction [10]. We can mini- j mize the Bethe free energy analytically by setting its derivatives to zero; one then immedi- ately sees that a possible solution of the resulting equations is the paramagnetic1 solution: mi = 0 and ij = tanh Wij (for (ij) B). For this solution to be a minimum (instead of a saddle point or maximum), the Hessian of FBe at that point should be positive-definite. This condition turns out to be equivalent to the following Bethe stability matrix being positive-definite. Whether this is the case obviously depends on the values of the weights Wij and the adjacency matrix M . Since for zero weights (W = 0), the stability matrix is just the identity matrix, the paramagnetic solution is a minimum of the Bethe free energy for small values of the weights Wij. The question of what "small" exactly means in terms of J and J0 and how this relates to the graph topology will be taken on in the next two sections. First we discuss the situation for the mean-field variational method. The mean-field free energy FMF (m) only depends on m; we can set its derivatives to zero, which again yields the paramagnetic solution m = 0. The corresponding stability matrix (equal to the Hes- sian) is given by (AMF )ij := ij - WijMij and should be positive-definite for the paramagnetic solution to be stable. One can prove [11] that ABe is positive-definite whenever AMF is positive-definite. Since the exact mag- netizations are zero, we conclude that the Bethe approximation is better than the mean-field approximation for all possible choices of the weights W . As we will see later on, this dif- ference can become quite large for large networks. 3 Weight dependence The behavior of the graphical model depends critically on the parameters J0 and J. Taking the graph topology to be uniformly random (see also subsection 4.1) we recover the model known in the statistical physics community as the Viana-Bray model [12], which has been thoroughly studied and is quite well-understood. In the limit N , there are different relevant regimes ("phases") for the parameters J and J0 to be distinguished (cf. Fig. 1): J 0.2 J 0.2 Figure 1: Empirical regime boundaries for the ER graph model with N = 100 and d = 20, averaged over three instances; expectation values are shown as thick black lines, standard- deviations are indicated by the gray areas. See the main text for additional explanation. The exact location of the boundary between the spin-glass and ferromagnetic phase in the right-hand plot (indicated by the dashed line) was not calculated. The red dash-dotted line shows the stability boundary for MF. Consider now the right-hand plot in Fig. 1. Here we have plotted the different regimes con- cerning the stability of the paramagnetic solution of the Bethe approximation.2 We find that the m = 0 solution is indeed stable for J and J0 small and becomes unstable at some point when J0 increases. This signals the paramagnetic-ferromagnetic phase transition. The lo- cation is in good agreement with the known phase boundary found for the N limit by advanced statistical physics methods as we show in more detail in [11]. For comparison we have also plotted the stability boundary for MF (the red dash-dotted line). Clearly, the mean-field approximation breaks down much earlier than the Bethe approximation and is unable to capture the phase transitions occurring for large connectivity strengths. The boundary between the spin-glass phase and the paramagnetic phase is more subtle. What happens is that the Bethe stability matrix becomes marginally stable at some point when we increase J , i.e. the minimum eigenvalue of ABe approaches zero (in the limit N ). This means that the Bethe free energy becomes very flat at that point. If we go on increasing J , the m = 0 solution becomes stable again (in other words, the minimum eigenvalue of the stability matrix ABe becomes positive again). We interpret the marginal instability as signalling the onset of the spin-glass phase. Indeed it coincides with the known phase boundary for the Viana-Bray model [11, 12]. We observe a similar marginal instability for other graph topologies. Now consider the left-hand plot, Fig. 1(a). It shows the convergence behavior of the BP al- gorithm, which was determined by running BP with a fixed number of maximum iterations and slight damping. The messages were initialized randomly. We find different regimes that are separated by the boundaries shown in the plot. For small J and J0, BP converges to m = 0. For J0 large enough, BP converges to one of the two ferromagnetic solutions 2Although in Fig. 1 we show only one particular graph topology, the general appearance of these plots does not differ much for other graph topologies, especially for large N . The scale of the plots mostly depends on the network size N and the average degree d as we will show in the next section. Figure 2: Critical values for Bethe and MF for different graph topologies ( : ER, : BA) in the dense limit with d = 0.1N as a function of network size. Note that the y-axis is rescaled by d. (which one is determined by the random initial conditions). For large J , BP does not con- verge within 1000 iterations, indicating a complex probability distribution. The boundaries coincide within statistical precision with those in the right-hand plot which were obtained by the stability analysis. The computation time necessary for producing a plot such as Fig. 1(a), showing the conver- gence behavior of BP, quickly increases with increasing N . The computation time needed for the stability analysis (Fig. 1(b)), which amounts to calculating the minimal eigenvalue of the N N stability matrix, is much less, allowing us to investigate the behavior of BP for large networks. 4 Graph topology In this section we will concentrate on the frustrated case, more precisely on the case J0 = 0 (i.e. the y-axis in the regime diagrams) and study the location of the Bethe marginal instability and of the MF instability for various graph topologies as a function of network size N and average degree d. We will denote by J Be c the critical value of J at which the Bethe paramagnetic solution becomes marginally unstable and we will refer to this as the Bethe critical value. The critical value of J where the MF solution becomes unstable will be denoted as J MF c and referred to as the MF critical value. In studying the influence of graph topology for large networks, we have to distinguish two cases, which we call the dense and sparse limits. In the dense limit, we let N and scale the average degree as d = cN for some fixed constant c. In this limit, we find that the influence of the graph topology is almost negligible. For all graph topologies that we have considered, we find the following asymptotic behavior for the critical values: The constant of proportionality is approximately 1. These results are illustrated in Fig. 2 for two different graph topologies that will be discussed in more detail below. In the sparse limit, we let N but keep d fixed. In that case the resulting critical values show significant dependence on the graph topology as we will see. 4.1 Uniform random graphs (ER) The first and most elementary random graph model we will consider was introduced and studied by Erdos and Renyi [7]. The ensemble, which we denote as ER(N, p), consists of Figure 3: Critical values for Bethe and MF for Erdos-Renyi uniform random graphs with average degree d = 10. the graphs with N nodes; links are added between each pair of nodes independently with probability p. The resulting graphs have a degree distribution that is approximately Poisson for large N and the expected average degree is E d = p(N - 1). As was mentioned before, the resulting graphical model is known in the statistical physics literature as the Viana-Bray model (with zero "external field"). Fig. 3 shows the results for the sparse limit, where p is chosen such that the expected aver- age degree is fixed to d = 10. The Bethe critical value J Be c appears to be independent of network size and is slightly larger than 1/ d. The MF critical value J MF c does depend on network size (it looks to be proportional to 1/ instead of 1/ d); in fact it can be proven that it converges very slowly to 0 as N [11], implying that the MF approximation breaks down for very large ER networks in the sparse limit. Although this is an interesting result, one could say that for all practical purposes the MF critical value J MF c is nearly independent of network size N for uniform random graphs. 4.2 Scale-free graphs (BA) A phenomenon often observed in real-world networks is that the degree distribution be- haves like a power-law, i.e. the number of nodes with degree is proportional to - for some > 0. These graphs are also known as "scale-free" graphs. The first random graph model exhibiting this behavior is from Barabasi and Albert [8]. We will consider a slightly different model, which we will denote by BA(N, m). It is defined as a stochastic process, yielding graphs with more and more nodes as time goes on. At t = 0 one starts with the graph consisting of m nodes and no links. At each time step, one node is added; it is connected with m different already existing nodes, attaching preferably to nodes with higher degree ("rich get richer"). More specifically, we take the probability to connect to a node of degree to be proportional to + 1. The degree dis- tribution turns out to have a power-law dependence for N with exponent = 3. In Fig. 4 we illustrate some BA graphs. The difference between the maximum degree and the average degree d is rather large: whereas the average degree d converges to 2m, the maximum degree is known to scale as N . Fig. 5 shows the results of the stability analysis for BA graphs with average degree d = 10. Note that the y-axis is rescaled by to show that the MF critical value J MF c is proportional to 1/ . The Bethe critical values are seen to have a scaling behavior that lies somewhere between 1/ d and 1/ . Compared to the situation for uniform ER graphs, BP now even more significantly outperforms MF. The relatively low sensitivity to the maximum degree that BP exhibits here can be understood intuitively since BA graphs resemble forests of sparsely interconnected stars of high degree, on which BP is exact. mi := bi(+1) - bi(-1); ij := bij(+1, +1) - bij(+1, -1) - bij(-1, +1) + bij(-1, -1).
N 1 + mixi FBe(m, ) := - Wijij + (1 - di) 2 (ij)B i=1 xi=1
1 + mixi + mjxj + xixjij + 4 (ij)B xi,xj =1
2 ij (A ik Be)ij := ij 1 + - Mij (with ij = tanh Wij) (2) 1 - 2 1 - 2 kN ik ij i
The paramagnetic phase, where the magnetizations all vanish (m = 0), valid for J and J0 both small. The ferromagnetic phase, where two configurations (characterized by all magne- tizations being either positive or negative) each get half of the probability mass. This is the phase occurring for large J0.
1Throughout this article, we will use terminology from statistical physics if there is no good corresponding terminology in the field of machine learning available.
BP convergence behavior Stability m=0 minimum Bethe free energy 0.4 0.4 m=0 stable (spin-glass phase) no convergence ? 0.3 0.3
marginal instability
convergence m=0 stable m=0 instable convergence to ferromagnetic (paramagnetic 0.1 0.1 (ferromagnetic to m=0 solutions phase) phase)
0 0 0 0.02 0.04 0.06 0.08 0.1 0 0.02 0.04 0.06 0.08 0.1 J J (a) 0 (b) 0
The spin-glass phase where the probability mass is distributed over exponentially (in N ) many different configurations. This phase occurs for frustrated weights, i.e. for large J .
Mean Field Bethe
2 2
1.5 1.5 1/2 1 d 1 J c 0.5 0.5
0 0 10 100 1000 10000 10 100 1000 10000
N N
1 1 J Be , J MF c c d 2 d
0.5 Bethe J 0.4 c 1/2 1/d 0.3 J c MF Jc 0.2 1/(21/2) 0.1
0 10 100 1000 10000
N