{"title": "Estimating the wrong Markov random field: Benefits in the computation-limited setting", "book": "Advances in Neural Information Processing Systems", "page_first": 1425, "page_last": 1432, "abstract": null, "full_text": "Estimating the \"wrong\" Markov random field: Benefits in the computation-limited setting\n\nMartin J. Wainwright Department of Statistics, and Department of Electrical Engineering and Computer Science UC Berkeley, Berkeley CA 94720 wainwrig@{stat,eecs}.berkeley.edu\n\nAbstract\nConsider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the \"wrong\" model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. 1 Keywords: Markov random fields; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning.\n\n1\n\nIntroduction\n\nConsider the problem of joint learning (parameter estimation) and prediction in a Markov random field (MRF): in the learning phase, an initial collection of data is used to estimate parameters, and the fitted model is then used to perform prediction (e.g., smoothing, interpolation, denoising) on a new noisy observation. Disregarding computational cost, there exist optimal methods for solving this problem (Route A in Figure 1). For general MRFs, however, optimal methods are computationally intractable; consequently, many researchers have examined various types of message-passing methods for learning and prediction problems, including belief propagation [3, 6, 7, 14], expectation propagation [5], linear response [4], as well as reweighted message-passing algorithms [10, 13]. Accordingly, it is of considerable interest to understand and quantify the performance loss incurred\n1 Work partially supported by Intel Corporation Equipment Grant 22978, an Alfred P. Sloan Foundation Fellowship, and NSF Grant DMS-0528488.\n\n\f\nby using computationally tractable methods versus exact methods (i.e., Route B versus A in Figure 1).\nROUTE A\nOPTIMAL PARAMETER ESTIMATION OPTIMAL PREDICTION PREDICTION \n\n\n\nz (y , ;  ) b\n\nDATA SOURCE i\n\n{x }\n\nNEW OBSERVATIONS\n\ny\nAPPROXIMATE PARAMETER ESTIMATION\n\nError\n\nROUTE B\n\nFigure 1. Route A: computationally intractable combination of parameter estimation and prediction. Route B: computationally efficient combination of approximate parameter estimation and prediction.\n\n\n\nAPPROXIMATE PREDICTION\n\nPREDICTION\n\nb z (y ,  ; ) b\n\nIt is now well known that many message-passing algorithms--including mean field, (generalized) belief propagation, expectation propagation and various convex relaxations--can be understood from a variational perspective; in particular, all of these message-passing algorithms are iterative methods solving relaxed forms of an exact variational principle [12]. This paper focuses on the analysis of variational methods based convex relaxations, which includes a broad range of extant algorithms--among them the tree-reweighted sum-product algorithm [11], reweighted forms of generalized belief propagation [13], and semidefinite relaxations [12]. Moreover, it is straightforward to modify other message-passing methods (e.g., expectation propagation [5]) so as to \"convexify\" them. At a high level, the key idea of this paper is the following: given that approximate methods can lead to errors at both the estimation and prediction phases, it is natural to speculate that these sources of error might be arranged to partially cancel one another. Our theoretical analysis confirms this intuition: we show that with respect to end-to-end performance, it is in fact beneficial, even in the infinite data limit, to learn the \"wrong\" the model by using an inconsistent parameter estimator. More specifically, we show how any convex variational method can be used to define a surrogate likelihood function. We then investigate the asymptotic properties of parameter estimators based maximizing such surrogate likelihoods, and establish that they are asymptotically normal but inconsistent in general. We then prove that any variational method that is based on a strongly concave entropy approximation is globally Lipschitz stable. Finally, focusing on prediction for a coupled mixture of Gaussians, we prove upper bounds on the increase in MSE of our computationally efficient method, relative to the unachievable Bayes optimum. We provide experimental results using the tree-reweighted (TRW) sumproduct algorithm that confirm the stability of our methods, and demonstrate its superior performance to a heuristic method based on standard sum-product.\n\n2\n\nBackground\n\nWe begin with necessary notation and background on multinomial Markov random fields, as well as variational representations and methods. Markov random fields: Given an undirected graph G = (V , E ) with N = |V | vertices, we associate to each vertex s  V a discrete random variable Xs , taking values in Xs = {0, 1 . . . , m - 1}. We assume that the vector X = {Xs | s  V } has a distribution that is\n\n\f\nMarkov with respect to the graph G, so that its distribution can be represented in the form s ( p(x; ) = exp{ s (xs ) + st (xs , xt ) - A()} (1)\nV s,t)E\n\ni ( Here A() := log s the cumulant X N exp V s (xs ) + s,t)E st (xs , xt ) generating function that normalizes the distribution, and s () and st (,j) are potential functions. In particular, we make use of the parameterization s (xs ) := Xs s;j I j [xs ], where I j [xs ] is an indicator function for the event {xs = j }; the quantity st is defined analogously. Overall, the family of MRFs (1) is an exponential family with canonical parameter   Rd . Note that the elements of the canonical parameters are associated with vertices {s;j , s  V , j  Xs } and edges {st;j k , (s, t)  E , (j, k )  Xs  Xt } of the underlying graph. x\n\ns\n\nVariational representation: We now describe how the cumulant generating function can be represented as the soution of an optimizx tion problem. The constraint set is given l a , consisting by MARG(G; ) :=  Rd |  = X N p(x)(x) for some p() of all globally realizable singleton s () and pairwise st ( , ) marginal distributions on the graph G. For any   MARG(G; ), we define A () = - maxp H (p), where the maximum is taken over all distributions that have mean parameters . With these definitions, it can be shown [12] that A has the variational representation T .  - A () (2) A() = max\nMARG(G;)\n\n3\n\nFrom convex surrogates to joint estimation/prediction\n\nIn general, solving the variational problem (2) is intractable for two reasons: (i) the constraint set MARG(G; ) is extremely difficult to characterize; and (ii) the dual function A lacks a closed-form representation. These challenges motivate approximations to A and MARG(G; ); the resulting relaxed optimization problem defines a convex surrogate to the cumulant generating function. Convex surrogates: Let REL(G; ) be a compact and convex outer bound to the marginal polytope MARG(G; ), and let B  be a strictly convex and twice continuously differentiable approximation to the dual function A . We use these approximations to define a convex surrogate B via the relaxed optimization problem T . B () := max  - B  ( ) (3)\n REL(G;)\n\nThe function B so defined has several desirable properties. First, since B is defined by the maximum of a collection of functions linear in , it is convex [1]. Moreover, by the strict convexity of B  and compactness of REL(G; ), the optimum is uniquely attained at some  (). Finally, an application of Danskin's theorem [1] yields that B is differentiable, and that B () =  (). Since  () has a natural interpretation as a pseudomarginal, this last property of B is analogous to the well-known cumulant generating property of A--namely, A() = ().\n\nOne example of such a convex surrogate is the tree-reweighted Bethe free energy considered in our previous work [11]. For this surrogate, the relaxed constrainx set REL(G; ) t  x takes the form LOCAL(G; ) :=  Rd | s (xs ) = 1, st (xs , xt ) = + s t , s (xs ) whereas the entropy approximation B  is of the \"convexified\" Bethe form s ( -B  ( ) = Hs (s ) - st Ist (st ). (4)\nV s,t)E\n\n\f\nHere Hs and Ist are the singleton entropy and edge-based mutual information, respectively, and the weights st are derived from the graph structure so as to ensure convexity (see [11] for more details). Analogous convex variational formulations underlie the reweighted generalized BP algorithm [13], as well as a log-determinant relaxation [12]. Approximate parameter estimation using surrogate likelihoods: Consider the problem of estimating the parameter  using i.i.d. samples {x1 , . . . , xn }. For an MRF of the form (1), the maximum likelihood estimate (MLE) is specified using the vector  of empirical marginal distributions (singleton s and pairwise st ). Since the likelihood is intractable to optimize (due to the cumulant generating function A), it is natural to use the convex surrogate B to define an alternative estimator obtained by maximizing the regularized surrogate likelihood:  . L = n := arg max B (; ) - n R() arg max T  - B () - n R() (5)\n R d  R d d\n\nHere R : R  R+ is a regularization function (e.g., R() =  2 ), whereas n > 0 is a regularization coefficient. For the tree-reweighted Bethe surrogate, we have shown in previous work [10] that in the absence of regularization, the optimal parameter estimates n have a very simple closed-form solution, specified in terms of the weights st and the empirical marginals . If a regularizing term is added, these estimates no longer have a closed-form solution, but the optimization problem (5) can still be solved efficiently by message-passing methods. Joint estimation/prediction: Using such an estimator, we now consider the joint approach to estimation and prediction illustrated in Figure 2. Using an initial set of i.i.d. samples, we first use the surrogate likelihood (5) to construct a parameter estimate n . Given a new noisy or incomplete observation y , we wish to perform near-optimal prediction or data fusion using the fitted model (e.g., for smoothing or interpolation of a noisy image). In order to do so, we first incorporate the new observation into the model, and then use the message-passing algorithm associated with the convex surrogate B in order to compute approximate pseudomarginals  . These pseudomarginals can then be used to construct a prediction z(y;  ), where the specifics of the prediction depend on the observation model. We provide a concrete illustration in Section 5 using a mixture-of-Gaussians observation model.\n\n4\n\nAnalysis\n\nAlgorithmic stability: A desirable property of any algorithm--particularly one applied to statistical data--is that it exhibit an appropriate form of stability with respect to its inputs. Not all message-passing algorithms have such stability properties. For instance, the standard BP algorithm, although stable for relatively weakly coupled MRFs [3, 6], can be highly unstable due to phase transitions. Previous experimental work has shown that methods based on convex relaxations, including reweighted belief propagation [10],\n\nA key property of the estimator is its inconsistency--i.e., the estimated model  differs from the true model  even in the limit of large data. Despite this inconsistency, we will see that n is useful for performing prediction.\n\nAsymptotics of estimator: We begin by considering the asymptotic behavior of the parameter estmiator n defined by the surrogate likelihood (5). Since this parameter estimator is a particular type of M -estimator, the following result follows from standard techniques [8]: Proposition 1. For a general graph with cycles, n converges in probability to some fixed  =  ; moreover, n[n - ] is asymptotically normal.\n\n\f\nGeneric algorithm for joint parameter estimation and prediction:\nb 1. Estimate parameters n from initial data x1 , . . . , xn by maximizing surrogate likelihood LB . 2. Given a new set of observations y , incorporate them into the model: e bn s (  ; ys ) = s (  ) + log p(ys |  ). (6)\n\n3. Compute approximate marginals  by using the message-passing algorithm associated with the convex surrogate B . Use approximate marginals to construct prediction z (y ;  ) b of z based on the observation y and pseudomarginals  . Figure 2. Algorithm for joint parameter estimation and prediction. Both the learning and prediction steps are approximate, but the key is that they are both based on the same underlying convex surrogate B . Such a construction yields a provably beneficial cancellation of the two sources of error (learning and prediction).\n\nreweighted generalized BP [13], and log-determinant relaxations [12] appear to be very stable. Here we provide theoretical support for these empirical observations: in particular, we prove that, in sharp contrast to non-convex methods, any variational method based on a strongly convex entropy approximation is globally stable. A function f : Rn  R is strongly convex if there exists a constant c > 0 such that y c f (y )  f (x) + f (x)T - x) + 2 y - x 2 for all x, y  Rn . For a twice continuously differentiable function, this condition is equivalent to the eigenspectrum of the Hessian 2 f (x) being uniformly bounded away from zero by c. With this definition, we have: Proposition 2. Consider any variational method based on a strongly concave entropy approximation -B  ; moreover, for any parameter   Rd , let  () denote the associated set of pseudomarginals. If the optimum is attained interior of the constraint set, then there exists a constant R < + such that  ( +  ) -  ()  R for all ,   Rd . Proof. By our construction of the convex surrogate B , we have  () = B (), so that the statement is equivalent to the assertion that the gradient B is a Lipschitz function. Applying the mean value theorem to B , we can write B ( +  ) - B () = 2 B ( + t ) where t  [0, 1]. Consequently, in order to establish the Lipschitz condition, it suffices to show that the spectral norm of 2 B ( ) is uniformly bounded above over all   Rd . Differentiating the relation B () =  () yields 2 B () =  (). Now standard sensitivity analysis results [1] yield that  () = [2 B  ( ()]-1 . Finally, our assumption of strong convexity of B  yields that the spectral norm of 2 B  ( ) is uniformly bounded away from zero, which yields the claim. Many existing entropy approximations, including the convexifed Bethe entropy (4), can be shown to be strongly concave [9].\n\n5\n\nBounds on performance loss\n\nWe now turn to theoretical analysis of the joint method for parameter estimation and prediction illustrated in Figure 2. Note that given our setting of limited computation, the Bayes optimum is unattainable for two reasons: (a) it has knowledge of the exact parameter value  ; and (b) the prediction step (7) involves computing exact marginal probabilities . Therefore, our ultimate goal is to bound the performance loss of our method relative to the unachievable Bayes optimum. So as to obtain a concrete result, we focus on the special case of joint learning/prediction for a mixture-of-Gaussians; however, the ideas and techniques described here are more generally applicable.\n\n\f\nPrediction for mixture of Gaussians: Suppose that the discrete random vector is a label vector for the components in a finite mixture of Gaussians: i.e., for each s  V , the random 2 variable Zs is specified by p(Zs = zs | Xs = j ;  )  N (j , j ), for j  {0, 1, . . . , m - 1}. Such models are widely used in statistical signal and image processing [2]. Suppose  that we observe a noise-corrupted version of Zs --namely Ys = Zs + 1 - 2 Ws , where Ws  N (0, 1) is additive Gaussian noise, and the parameter   [0, 1] specifies the signalto-noise ratio (SNR) of the observation model. (Here  = 0 corresponds to pure noise, whereas  = 1 corresponds to completely uncorrupted observations.) With this set-up, it is straightforward to show that the optimal Bayes least squares estimator (BLSE) of Z takes the form ,  m -1 j y + (7) s (j ;  ) j () s - j j zs (y ; ) :=\n=0 \n2 j 2 2 j +(1-2 )\n\nwhere s (j ;  ) is the exact marginal of the distribution p(y | x)p(x;  ); and j () := is the usual BLSE weighting for a Gaussian with variance j . For this set-up, the approximate predictor zs (y ;  ) defined by our joint procedure in Figure 2 corresponds to replacing the exact marginals  with the pseudomarginals s (j ; ) obtained by solving the variational problem with .\n\nBounds on performance loss: We now turn to a comparison of the mean-squared error (MSE) of the Bayes optimal predictor z(Y ; ) to the MSE of the surrogate-based predictor z(Y ;  ). More specifically, we provide an upper bound on the increase in MSE, where the bound is specified in terms of the coupling strength and the SNR parameter . Although results of this nature can be derived more generally, for simplicity we focus on the case of two mixture components (m = 2), and consider the asymptotic setting, in which the number of data samples n  +, so that the law of large numbers [8] ensures that the empirical marginals n converge to the exact marginal distributions  . Consequently, the MLE converges to the true parameter value  , whereas Proposition 1 guarantees that our approximate parameter estimate n converges to the fixed quantity . By construction, we have the relations B () =  = A( ). where max denotes the maximal singular value. Following the argument in the proof of Proposition 2, it can be seen that L( ; ) is finite. Two additional quantities that play a role in our bound are the differences An important factor in our bound is the quantity  , L( ; ) := sup max 2 A( +  ) - 2 B ( +  )\n R d\n\n(8)\n\nwhere 0 , 1 are the means of the two Gaussian components. Finally, we define  (Y ; )  p(Ys |Xs =1) Rd with components log p(Ys |Xs =0) for s  V , and zeroes otherwise. With this notation, we state the following result (see the technical report [9] for the proof): Theorem 1. Let MSE( ) and MSE() denote the mean-squared prediction errors of the surrogate-based predictor z(y;  ), and the Bayes oiptimal estimate z(y; ) respectively. The M 1 MSE increase I () := N SE( ) - MSE() s upper bounded by  s s  w Ys4 Ys2 2 2 2 I ()  E () () + ()  () + 2| ()| | ()| N N here () := min{1, L( ; )  (Y ;) }. N\n\n () := 1 () - 0 (),\n\nand\n\n () := [1 - 1 ()]1 - [1 - 0 ()]0 ,\n\n\f\nIND\n\nBP\n\nTRW\n\n50 Performance loss Performance loss 40 30 20 10 0 0 0.5 1 SNR 0 0.5 Edge strength 1\n\n50 40 30 20 10 0 0 0.5 1 SNR 0 0.5 Edge strength 1 Performance loss\n\n50 40 30 20 10 0 0 0.5 1 SNR 0 0.5 Edge strength 1\n\n(a)\nIND\n\n(b)\nBP\n\n(c)\nTRW\n\n5 Performance loss Performance loss 4 3 2 1 0 0 0.5 1 SNR 0 0.5 Edge strength 1\n\n5 4 3 2 1 0 0 0.5 1 SNR 0 0.5 Edge strength 1 Performance loss\n\n5 4 3 2 1 0 0 0.5 1 SNR 0 0.5 Edge strength 1\n\n(d)\n\n(e)\n\n(f)\n\nFigure 3. Surface plots of the percentage increase in MSE relative to Bayes optimum for different methods as a function of observation SNR and coupling strength. Top row: 2 2 Gaussian mixture with components (0 , 0 ) = (-1, 0.5) and (1 , 1 ) = (1, 0.5). Bot2 2 tom row: Gaussian mixture with components (0 , 0 ) = (0, 1) and (0 , 1 ) = (0, 9). Left column: independence model (IND). Center column: ordinary belief propagation (BP). Right column: tree-reweighted algorithm (TRW).\n\nIt can be seen that I ()  0 as   0+ and as   1- , so that the surrogate-based method is asymptotically optimal for both low and high SNR. The behavior of the bound in the intermediate regime is controlled by the balance between these two terms. Experimental results: In order to test our joint estimation/prediction procedure, we have applied it to coupled Gaussian mixture models on different graphs, coupling strengths, observation SNRs, and mixture distributions. Although our methods are more generally applicable, here we show representative results for m = 2 components, and two different mixture types. The first ensemble, constructed with mean and variance components 2 2 (0 , 0 ) = (0, 1) and (1 , 1 ) = (0, 9), mimics heavy-tailed behavior. The second en2 2 semble is bimodal, with components (0 , 0 ) = (-1, 0.5) and (1 , 1 ) = (1, 0.5). In both cases, each mixture component is equally weighted. Here we show results for a 2-D grid with N = 64 nodes. Since the mixture varis bles have m = 2 states, the c.oupling a ( distribution can be written as p(x; )  exp where V s xs + s,t)E st xs xt x  {-1, +1}N are spin variables indexing the mixture components. In all trials, we chose s = 0 for all nodes s  V , which ensures uniform marginal distributions p(xs ; ) at each node. For each coupling strength   [0, 1], we chose edge parameters as st  U [0,  ], and we varied the SNR parameter  controlling the observation model in [0, 1]. We evaluated the following three methods based on their increase in mean-squared error (MSE) over the Bayes optimal predictor (7): (a) As a baseline, we used the independence model for the mixture components: parameters are estimated s (xs ) = log s (xs ), and setting coupling terms st (xs , xt ) equal to zero. The prediction step reduces to performing BLSE at each node independently. (b) The standard belief propagation (BP) approach is based on estimating parameters (see step (1) of Figure 2) using st = 1 for all edges (s, t), and using BP to compute the pseudomarginals. (c) The tree-reweighted method (TRW) is based on 1 estimating parameters using the tree-reweighted surrogate [10] with weights st = 2 for all edges (s, t), and using the TRW sum-product algorithm to compute the pseudomarginals.\n\n\f\nShown in Figure 3 are 2-D surface plots of the average percentage increase in MSE, taken over 100 trials, as a function of the coupling strength   [0, 1] and the observation SNR parameter   [0, 1] for the independence model (left column), BP approach (middle column) and TRW method (right column). For weak coupling (  0), all three methods-- including the independence model--perform quite well, as should be expected given the weak dependency. Although not clear in these plots, BP outperforms TRW for weak coupling; however, both methods lose than than 1% in this regime. As the coupling is increased, the BP method eventually deteriorates quite seriously; indeed, for large enough coupling and low/intermediate SNR, its performance can be worse than the independence model. Looking at alternative models (in which phase transitions are known), we have found that this rapid degradation co-incides with the appearance of multiple fixed points. In contrast, the behavior of the TRW method is extremely stable, consistent with our theory.\n\n6\n\nConclusion\n\nWe have described and analyzed joint methods for parameter estimation and prediction/smoothing using variational methods that are based on convex surrogates to the cumulant generating function. Our results--both theoretical and experimental--confirm the intuition that in the computation-limited setting, in which errors arise from approximations made both during parameter estimation and subsequent prediction, it is provably beneficial to use an inconsistent parameter estimator. Our experimental results on the coupled mixture of Gaussian model confirm the theory: the tree-reweighted sum-product algorithm yields prediction results close to the Bayes optimum, and substantially outperforms an analogous but heuristic method based on standard belief propagation.\n\nReferences\n[1] D. Bertsekas. Nonlinear programming. Athena Scientific, Belmont, MA, 1995. [2] M. Crouse, R. Nowak, and R. Baraniuk. Wavelet-based statistical signal processing using hidden Markov models. IEEE Trans. Signal Processing, 46:886902, April 1998. [3] A. Ihler, J. Fisher, and A. S. Willsky. Loopy belief propagation: Convergence and effects of message errors. Journal of Machine Learning Research, 6:905936, May 2005. [4] M. A. R. Leisink and H. J. Kappen. Learning in higher order Boltzmann machines using linear response. Neural Networks, 13:329335, 2000. [5] T. P. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT, January 2001. [6] S. Tatikonda and M. I. Jordan. Loopy belief propagation and Gibbs measures. In Proc. Uncertainty in Artificial Intelligence, volume 18, pages 493500, August 2002. [7] Y. W. Teh and M. Welling. On improving the efficiency of the iterative proportional fitting procedure. In Workshop on Artificial Intelligence and Statistics, 2003. [8] A. W. van der Vaart. Asymptotic statistics. Cambridge University Press, Cambridge, UK, 1998. [9] M. J. Wainwright. Joint estimation and prediction in Markov random fields: Benefits of inconsistency in the computation-limited regime. Technical Report 690, Department of Statistics, UC Berkeley, 2005. [10] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. Tree-reweighted belief propagation algorithms and approximate ML estimation by pseudomoment matching. In Workshop on Artificial Intelligence and Statistics, January 2003. [11] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Trans. Info. Theory, 51(7):23132335, July 2005. [12] M. J. Wainwright and M. I. Jordan. A variational principle for graphical models. In New Directions in Statistical Signal Processing. MIT Press, Cambridge, MA, 2005. [13] W. Wiegerinck. Approximations with reweighted generalized belief propagation. In Workshop on Artificial Intelligence and Statistics, January 2005. [14] J. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free energy approximations and generalized belief propagation algorithms. IEEE Trans. Info. Theory, 51(7):22822312, July 2005.\n\n\f\n", "award": [], "sourceid": 2925, "authors": [{"given_name": "Martin", "family_name": "Wainwright", "institution": null}]}