{"title": "Elementary Estimators for Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2159, "page_last": 2167, "abstract": "We propose a class of closed-form estimators for sparsity-structured graphical models, expressed as exponential family distributions, under high-dimensional settings. Our approach builds on observing the precise manner in which the classical graphical model MLE ``breaks down'' under high-dimensional settings. Our estimator uses a carefully constructed, well-defined and closed-form backward map, and then performs thresholding operations to ensure the desired sparsity structure. We provide a rigorous statistical analysis that shows that surprisingly our simple class of estimators recovers the same asymptotic convergence rates as those of the $\\ell_1$-regularized MLEs that are much more difficult to compute. We corroborate this statistical performance, as well as significant computational advantages via simulations of both discrete and Gaussian graphical models.", "full_text": "Elementary Estimators for Graphical Models\n\nEunho Yang\n\nIBM T.J. Watson Research Center\n\neunhyang@us.ibm.com\n\nAur\u00b4elie C. Lozano\n\nIBM T.J. Watson Research Center\n\naclozano@us.ibm.com\n\nPradeep Ravikumar\n\nUniversity of Texas at Austin\n\npradeepr@cs.utexas.edu\n\nAbstract\n\nWe propose a class of closed-form estimators for sparsity-structured graphical\nmodels, expressed as exponential family distributions, under high-dimensional\nsettings. Our approach builds on observing the precise manner in which the classi-\ncal graphical model MLE \u201cbreaks down\u201d under high-dimensional settings. Our es-\ntimator uses a carefully constructed, well-de\ufb01ned and closed-form backward map,\nand then performs thresholding operations to ensure the desired sparsity structure.\nWe provide a rigorous statistical analysis that shows that surprisingly our simple\nclass of estimators recovers the same asymptotic convergence rates as those of the\n`1-regularized MLEs that are much more dif\ufb01cult to compute. We corroborate\nthis statistical performance, as well as signi\ufb01cant computational advantages via\nsimulations of both discrete and Gaussian graphical models.\n\n1\n\nIntroduction\n\nUndirected graphical models, also known as Markov random \ufb01elds (MRFs), are a powerful class\nof statistical models, that represent distributions over a large number of variables using graphs, and\nwhere the structure of the graph encodes Markov conditional independence assumptions among the\nvariables. MRFs are widely used in a variety of domains, including natural language processing [1],\nimage processing [2, 3, 4], statistical physics [5], and spatial statistics [6]. Popular instances of\nthis class of models include Gaussian graphical models (GMRFs) [7, 8, 9, 10], used for modeling\ncontinuous real-valued data, and discrete graphical models including the Ising model where each\nvariable takes values in a discrete set [10, 11, 12]. In this paper, we consider the problem of high-\ndimensional estimation, where the number of variables p may exceed the number of observations n.\nIn such high-dimensional settings, it is still possible to perform consistent estimation by leveraging\nlow-dimensional structure. Sparse and group-sparse structural constraints, where few parameters (or\nparameter groups) are non-zero, are particularly pertinent in the context of MRFs as they translate\ninto graphs with few edges.\nA key class of estimators for learning graphical models has thus been based on maximum likelihood\nestimators (MLE) with sparsity-encouraging regularization. For the task of sparse GMRF estima-\ntion, the state-of-the-art estimator minimizes the Gaussian negative log-likelihood regularized by the\n`1 norm of the entries (or the off-diagonal entries) of the concentration matrix (see [8, 9, 10]). Strong\nstatistical guarantees for this estimator have been established (see [13] and references therein). The\nresulting optimization problem is a log-determinant program, which can be solved in polynomial\ntime with interior point methods [14], or by co-ordinate descent algorithms [9, 10]. In a compu-\ntationally simpler approach for sparse GMRFs, [7] proposed the use of neighborhood selection,\nwhich consists of estimating conditional independence relationships separately for each node in the\ngraph, via `1-regularized linear regression, or LASSO [15]. They showed that the procedure can\n\n1\n\n\fconsistently recover the sparse GMRF structure even under high-dimensional settings. The neigh-\nborhood selection approach has also been successfully applied to discrete Markov random \ufb01elds.\nIn particular, for binary graphical models, [11] showed that consistent neighborhood selection can\nbe performed via `1-regularized logistic regression. These results were generalized to general dis-\ncrete graphical models (where each variable can take m  2 values) by [12] through node-wise\nmulti-class logistic regression with group sparsity. A related regularized convex program to solve\nfor sparse GMRFs is the CLIME estimator [16], which reduces the estimation problem to solving\nlinear programs. Overall, while state of the art optimization methods have been developed to solve\nall of these regularized (and consequently non-smooth) convex programs, their iterative approach is\nvery expensive for large scale problems. Indeed, scaling such regularized convex programs to very\nlarge scale settings has attracted considerable recent research and attention.\nIn this paper, we investigate the following leading question: \u201cCan one devise simple estimators\nwith closed-form solutions that are yet consistent and achieve the sharp convergence rates of the\naforementioned regularized convex programs?\u201d\nThis question was originally considered in the\ncontext of linear regression by [17] and to which they had given a positive answer. It is thus natural\nto wonder whether an af\ufb01rmative response can be provided for the more complicated statistical\nmodeling setting of MRFs as well.\nOur key idea is to revisit the vanilla MLE for estimating a graphical model, and consider where it\n\u201cbreaks down\u201d in the case of high-dimensions. The vanilla graphical model MLE can be expressed\nas a backward mapping [18] in an exponential family distribution that computes the model param-\neters corresponding to some given (sample) moments. There are however two caveats with this\nbackward mapping: it is not available in closed form for many classes of models, and even if it were\navailable in closed form, it need not be well-de\ufb01ned in high-dimensional settings (i.e. could lead to\nunbounded model parameter estimates). Accordingly, we consider the use of carefully constructed\nproxy backward maps that are both available in closed-form, and well-de\ufb01ned in high-dimensional\nsettings. We then perform simple thresholding operations on these proxy backward maps to obtain\nour \ufb01nal estimators. Our class of algorithms is thus both computationally practical and highly scal-\nable. We provide a uni\ufb01ed statistical analysis of our class of algorithms for graphical models arising\nfrom general exponential families. We then instantiate our analysis for the speci\ufb01c cases of GMRFs\nand DMRFs, and show that the resulting algorithms come with strong statistical guarantees achiev-\ning near-optimal convergence rates, but doing so computationally much faster than the regularized\nconvex programs. These surprising results are con\ufb01rmed via simulation for both GMRFs and DM-\nRFs. There has been considerable recent interest in large-scale statistical model estimation, and in\nparticular, in scaling these to very large-scale settings. We believe our much simpler class of closed-\nform graphical model estimators have the potential to be estimators of choice in such large-scale\nsettings, particularly if it attracts research on optimizing and scaling its closed-form operations.\n\n2 Background and Problem Setup\n\nSince most popular graphical model families can be expressed as exponential families (see [18]), we\nconsider general exponential family distributions for a random variable X 2 Rp:\n\nP(X; \u2713) = expnh\u2713, (X)i  A(\u2713)o\n\n(1)\nwhere \u2713 2 \u2326 \u2713 Rd is the canonical parameter to be estimated, (X) denotes the suf\ufb01cient statistics\nwith feature function  : Rp 7! Rd, and A(\u2713) is the log-partition function.\nAn alternative parameterization of the exponential family, to the canonical parameterization above,\ndef\nis via the vector of \u201cmean parameters\u201d \u00b5(\u2713)\n= E\u2713[(X)], which are the moments of the suf\ufb01cient\nstatistics (X) under the exponential family distribution. We denote the set of all possible moments\nby the moment polytope: M = {\u00b5 : 9 distribution p s.t. Ep() = \u00b5}, which consist of moments\nof the suf\ufb01cient statistics under all possible distributions. The problem of computing the mean\nparameters \u00b5(\u2713) 2M given the canonical parameters \u2713 2 \u2326 constitutes the key machine learning\nproblem of inference in graphical models (expressed in exponential family form (1)). Let us denote\nthis computation via a so-called forward mapping A :\u2326 7! M. By properties of exponential family\ndistributions, the forward mapping A can actually be expressed in terms of the \ufb01rst derivative of\nthe log-partition function A(\u00b7): A : \u2713 7! \u00b5 = rA(\u2713). It can be shown that this map is injective\n(one-to-one with its range) if the exponential family is minimal. Moreover, it is onto the interior of\n\n2\n\n\fM, denoted by Mo. Thus, for any mean parameter \u00b5 2M o, there exists a canonical parameter\n\u2713(\u00b5) 2 \u2326 such that E\u2713(\u00b5)[(X)] = \u00b5. Unless the exponential family is minimal, the corresponding\ncanonical parameter \u2713(\u00b5) however need not be unique. Thus while there will always exist a so-called\nbackward mapping A\u21e4 : Mo 7! \u2326, that computes the canonical parameters corresponding to given\nmoments, it need not be unique. A candidate backward map can be constructed via the conjugate of\nthe log-partition function A\u21e4(\u00b5) = sup\u27132\u21e5h\u2713, \u00b5i  A(\u2713): A\u21e4 : \u00b5 7! \u2713 = rA\u21e4(\u00b5).\n2.1 High-dimensional Graphical Model Selection\n\nWe focus on the high-dimensional setting, where the number of variables p may greatly exceed\nthe sample size n. Under such high-dimensional settings, it is now well understood that consistent\nestimation is possible if structural constraints are imposed on the model parameters \u2713. In this paper,\nwe focus on the structural constraint of sparsity, for which the `1 norm is known to be well-suited.\nGiven n samples {X (i)}n\ni=1 from P(X; \u2713\u21e4) that belongs to an exponential family (1), a popular\nclass of M-estimators for recovering the sparse model parameter \u2713\u21e4 is the `1-regularized maximum\nlikelihood estimators:\n\nminimize\n\n\u2713\n\nh \u2713,bi  A(\u2713) + nk\u2713k1\n\nnPn\nwhereb := 1\n\ni=1 (X (i)) is the empirical mean of the suf\ufb01cient statistics. Since the log partition\n\nfunction A(\u2713) in (1) is convex, the problem (2) is convex as well.\nWe now brie\ufb02y review the two most popular examples of exponential families in the context of\ngraphical models.\nGaussian Graphical Models. Consider a random vector X = (X1, . . . , Xp) with associated p-\nvariate Gaussian distribution N (X|\u00b5, \u2303), mean vector \u00b5 and covariance matrix \u2303. The probability\ndensity function of X can be formulated as an instance of (1):\n\n(2)\n\n(3)\n\nP(X|\u2713, \u21e5) = exp\u21e3 \n\n1\n\n2hh\u21e5, XX>ii + h\u2713, Xi  A(\u21e5,\u2713 )\u2318\n\nwhere hhA, Bii denotes the trace inner product tr(A BT ). Here, the canonical parameters are the\nprecision matrix \u21e5 and a vector \u2713, with domain \u2326:= {(\u2713, \u21e5) 2 Rp\u21e5Rp\u21e5p :\u21e5  0, \u21e5=\u21e5 T}. The\ncorresponding moment parameters of the graphical model distribution are given by the mean \u00b5 =\nE\u2713[X], and the covariance matrix \u2303= E\u2713[XX T ] of the Gaussian. The forward map A : (\u2713, \u21e5) 7!\n(\u00b5, \u2303) computing these from the canonical parameters can be written as: \u2303=\u21e5 1 and \u00b5 =\u21e5 1\u2713.\nThe moment polytope can be seen to be given by M = {(\u00b5, \u2303) 2 Rp \u21e5 Rp\u21e5p :\u2303  \u00b5\u00b5T \u232b 0, \u2303 \u232b\n0}, with interior Mo = {(\u00b5, \u2303) 2 Rp\u21e5Rp\u21e5p :\u2303 \u00b5\u00b5T  0, \u2303  0}. The corresponding backward\nmap A\u21e4 : (\u00b5, \u2303) 7! (\u2713, \u21e5) for (\u00b5, \u2303) 2M o can be computed as: \u21e5=\u2303 1 and \u2713 =\u2303 1\u00b5.\nWithout loss of generality, assume that \u00b5 = 0 (and hence \u2713 = 0). Then the set of non-zero entries\nin the precision matrix \u21e5 corresponds to the set of edges in an associated Gaussian Markov random\n\ufb01eld (GMRF). In cases where the graph underlying the GMRF has relatively few edges, it thus\nmakes sense to impose `1 regularization on the off-diagonal entries of \u21e5. Given n i.i.d. random\nvectors X (i) 2 Rp sampled from N (0, \u2303\u21e4), the corresponding `1-regularized maximum likelihood\nestimator (MLE) is given by:\n(4)\n\nminimize\n\nhh\u21e5, Sii  log det \u21e5 + nk\u21e5k1,off ,\n\n1\n\ni=1X (i)  XX (i)  X>, X :=\nwhere S is the sample covariance matrix de\ufb01ned as Pn\nnPn\ni=1 X (i), and k\u00b7k 1,off is the off-diagonal element-wise `1 norm.\nDiscrete Graphical Models. Let X = (X1, . . . , Xp) be a random vector where each variable Xi\ntakes values in a discrete set X of cardinality m. Given a graph G = (V, E), a pairwise Markov\nrandom \ufb01eld over X is speci\ufb01ed via nodewise functions \u2713s : X 7! R for all s 2 V , and pairwise\nfunctions \u2713st : X \u21e5 X 7! R for all (s, t) 2 E, as\n(5)\nThis family of probability distributions can be represented using the so-called overcomplete rep-\nresentations [18] as follows. For each random variable Xs and j 2{ 1, . . . , m}, de\ufb01ne nodewise\n\nP(X) = expnPs2V \u2713s(Xs) +P(s,t)2E \u2713st(Xs, Xt)  A(\u2713)o.\n\n\u21e50\n\n3\n\n\f\u2713s;j I[Xs = j] + X(s,t)2E;j,k2[m]\n\nindicators I[Xs = j] equal to 1 if Xs = j and 0 otherwise. Then pairwise MRFs in (5) can be\nrewritten as\nP(X) = exp\u21e2 Xs2V ;j2[m]\n\u2713st;jk I[Xs = j, Xt = k]  A(\u2713) (6)\nfor a set of parameters \u2713 := {\u2713s;j,\u2713 st;jk : s, t 2 V ; (s, t) 2 E; j, k 2 [m]}. Given these suf\ufb01cient\nstatistics, the mean/moment parameters are given by the moments \u00b5s;j := E\u2713I[Xs = j] =\nP(Xs = j; \u2713), and \u00b5st;jk := E\u2713I[Xs = j, Xt = k] = P(Xs = j, Xt = k; \u2713), which precisely\ncorrespond to nodewise and pairwise marginals of the discrete graphical model. Thus, the forward\nmapping A : \u2713 7! \u00b5 would correspond to the inference task of computing nodewise and pairwise\nmarginals of the discrete graphical model given the canonical parameters. A backward mapping\nA\u21e4 : \u00b5 7! \u2713 corresponds to computing a set of canonical parameters such that the corresponding\ngraphical model distribution would yield the given set of nodewise and pairwise marginals. The\nmoment polytope in this case consists of the set of all nodewise and pairwise marginals of any\ndistribution over the random vector X, and hence is termed the marginal polytope; it is typically\nintractable to characterize in high-dimensions [18].\nGiven n i.i.d. samples from an unknown distribution (6) with parameter \u2713\u21e4, one could consider\n\nestimating the graphical model structure with an `1-regularized MLE: b\u2713 2 minimize\u2713 h\u2713,bi +\nA(\u2713) + k\u2713k1,E, where k\u00b7k 1,E is the `1 norm of the edge-parameters: k\u2713k1,E = Ps6=t k\u2713stk,\nand where we have collated the edgewise parameters {\u2713st;jk}m\nj,k=1 for an edge (s, t) 2 E into the\nvector \u2713st. However, there is an critical caveat to actually computing this regularized MLE: the\ncomputation of the log-partition function A(\u2713) is intractable (see [18] for details). To overcome this\nissue, one might consider instead the following class of M-estimators, discussed in [19]:\n\nb\u2713 2 minimize\n\n\u2713\n\nh\u2713,bi + B(\u2713) + k\u2713k1,E.\n\nHere B(\u2713) is a variational approximation to the log-partition function A(\u2713) of the form: B(\u2713) =\nsup\u00b52Lh\u2713, \u00b5i  B\u21e4(\u00b5), where L is a tractable bound on the marginal polytope M, and B\u21e4(\u00b5) is a\ntractable approximation to the graphical model entropy A\u21e4(\u00b5). An example of such approximation,\nwhich we shall later leverage in this paper, is the tree-reweighted entropy [20] given by B\u21e4trw(\u00b5) =\nPs Hs(\u00b5s) Pst \u21e2stIst(\u00b5st), where Hs(\u00b5s) is the entropy for node s, Ist(\u00b5st) is the mutual\n\ninformation for an edge (s, t), and \u21e2st denote the edge-weights that lie in a so-called spanning tree\npolytope. If all \u21e2st are set to 1, this boils down to the Bethe approximation [21].\n\n(7)\n\n3 Closed-form Estimators for Graphical Models\n\nThe state-of-the-art `1-regularized MLE estimators discussed in the previous section enjoy strong\nstatistical guarantees but involve solving dif\ufb01cult non-smooth programs. Scaling them to very large-\nscale problems is thus an important and challenging ongoing research area.\nIn this paper we tackle the scalability issue at the source by departing from regularized MLE ap-\nproaches and proposing instead a family of closed-form estimators for graphical models.\nElem-GM Estimation:\n\nminimize\n\n(8)\n\n\u2713\n\nk\u2713k1\n\ns. t. \u2713 B \u21e4(b)1 \uf8ff n\n\nfunction. This can be shown by the fact that the optimization problem (8) is decomposable into\nindependent element-wise sub-problems, where each sub-problem corresponds to soft-thresholding.\nTo get some intuition on our approach, let us \ufb01rst revisit classical MLE estimators for graphical mod-\n\nwhere B\u21e4(\u00b7) is the proxy of backward mapping A\u21e4, and n is a regularization parameter as in (2).\nOne of the most important properties of (8) is that the estimator is available in closed-form: b\u2713 =\nSnB\u21e4(b), where [S(u)]i = sign(ui) max(|ui| , 0) is the element-wise soft-thresholding\nels as in (1), and see where they \u201cbreak down\u201d in a high-dimensional setting: minimize\u2713 h \u2713,bi \nexpressed as a backward mapping A\u21e4(b). There are two caveats here in high-dimensional settings.\n\nA(\u2713). By the stationary condition of this optimization problem, the MLE estimator can be simply\n\n4\n\n\fThe \ufb01rst is that this backward mapping need not have a simple closed-form, and is typically in-\ntractable to compute for a large number of variables p. The second is that the backward mapping is\nwell-de\ufb01ned only for mean parameters that are in the interior Mo of the marginal polytope, whereas\nthe sample moments b might well lie on the boundary of the marginal polytope. We will illustrate\nthese two caveats in the next two examples.\nOur key idea is to use instead a well-de\ufb01ned proxy function B\u21e4(\u00b7) in lieu of the MLE backward\nmap A\u21e4(\u00b7) so that B\u21e4(b) is both well-de\ufb01ned under high-dimensional settings, as well as with a\nsimple closed-form. The optimization problem (8) seeks an estimator with minimum complexity in\nterms of regularizer k\u00b7k 1 while being close enough to some \u201cinitial estimator\u201d B\u21e4(b) in terms of\n\nelement-wise `1 norm; ensuring that the \ufb01nal estimator has the desired sparse structure.\n3.1 Strong Statistical Guarantees of Closed-form Estimators\n\nnon-zero coordinates of \u2713\u21e4.\n\n, it correctly includes all\n\nWe now provide a statistical analysis of estimators in (8) under the following structural constraint:\n(C-Sparsity) The \u201ctrue\u201d canonical exponential family parameter \u2713\u21e4 is exactly sparse with k non-\nzero elements indexed by the support set S. All other elements in Sc are zeros.\nTheorem 1. Consider any graphical model in (1) with sparse canonical parameter \u2713\u21e4 as stated in\n.\n\n(C-Sparsity). Suppose we solve (8) setting the constraint bound n such that n \u2713\u21e4B \u21e4(b)1\n(A) Then the optimal solutionb\u2713 satis\ufb01es the following error bounds:\n(B) The support set of the estimateb\u2713 correctly excludes all true zero coordinates of \u2713\u21e4. Moreover,\nunder the additional assumption that mins2S |\u2713\u21e4s| 3\u2713\u21e4 B \u21e4(b)1\n\nand b\u2713  \u2713\u21e41 \uf8ff 8kn .\n\nb\u2713  \u2713\u21e41 \uf8ff 2n ,\n\nkb\u2713  \u2713\u21e4k2 \uf8ff 4pkn ,\n\nRemarks. Theorem 1 is a non-probabilistic result, and holds deterministically for any selection of\nn and any selection of B\u21e4(\u00b7). We would then use a probabilistic analysis when we applying the\ntheorem to speci\ufb01c distributional settings and choices of the backward map B\u21e4(\u00b7).\nWe note that while the theorem analyses the case of sparsity structured parameters, our class of\nestimators as well as analyses can be seamlessly extended to more general structures (such as group\nsparsity and low rank), by substituting appropriate regularization functions in (8).\n\nA key ingredient in our class of closed-form estimators is the proxy backward map B\u21e4(b). The\nconditions of the theorem require that this backward map has to be carefully constructed in order\nfor the error bounds and sparsistency guarantees to hold. In the following sections, we will see how\nto precisely construct such backward maps B\u21e4(\u00b7) for speci\ufb01c problem instances, and then derive the\ncorresponding consequences of our abstract theorem as corollaries.\n\n4 Closed-form Estimators for Inverse Covariance Estimation in Gaussian\n\nGraphical Models\n\nIn this section, we derive a class of closed-form estimators for the multivariate Gaussian setting\nin Section 2.1. From our discussion of Gaussian graphical models in Section 2.1, the backward\nmapping from moments to the canonical parameters can be simply computed as A\u21e4(\u2303) = \u23031, but\nonly provided \u2303 2M o := {\u2303 2 Rp\u21e5p :\u2303  0}. However, given the sample covariance, we\ncannot just compute the MLE as A\u21e4(S) = S1 since the sample covariance matrix is rank-de\ufb01cient\nand hence does not belong the Mo under high-dimensional settings where p > n.\nIn our estimation framework (8), we thus use an alternative backward mapping B\u21e4(\u00b7) via a thresh-\nolding operator. Speci\ufb01cally, for any matrix M 2 Rp\u21e5p, we consider the family of thresholding op-\nerators T\u232b(M ) : Rp\u21e5p ! Rp\u21e5p with thresholding parameter \u232b, de\ufb01ned as [T\u232b(M )]ij := \u21e2\u232b(Mij)\nwhere \u21e2\u232b(\u00b7) is an element-wise thresholding operator. Soft-thresholding is a natural option, however,\nalong the lines of [22], we can use arbitrary sparse thresholding operators satisfying the conditions:\n(C-Thresh) For any input a 2 R, (i) |\u21e2\u232b(a)|\uf8ff| a|, (ii) |\u21e2\u232b(a)| = 0 for |a|\uf8ff \u232b, and \ufb01nally (iii)\n\n|\u21e2\u232b(a)  a|\uf8ff \u232b.\n\n5\n\n\fAs long as T\u232b(S) is invertible (which we shall examine in section 4.1), we can de\ufb01ne B\u21e4(S) :=\n[T\u232b(S)]1 and obtain the following class of estimators:\n(9)\nElem-GGM Estimation: minimize\n\n\u21e5\n\nk\u21e5k1,off\n\ns. t.\u21e5  [T\u232b(S)]11,off \uf8ff n\n\nwhere k\u00b7k 1,off is the off-diagonal element-wise `1 norm as the dual of k\u00b7k 1,off.\nComparison with related work.\n:\nminimize\u21e5 k\u21e5k1 s. t.kS\u21e5  Ik1 \uf8ff n where both k\u00b7k 1 and k\u00b7k 1 are entry-wise (`1 and `1,\nrespectively) norms for a matrix. This estimator applies penalty functions even for the diagonal ele-\nments so that the problem can be decoupled into multiple but much simpler optimization problems.\nIt still requires solving p linear programs with 2p linear constraints for each. On the other hand, the\nestimator from (9) has a closed-form solution as long as T\u232b(S) is invertible.\n\n[16] suggest a Dantzig-like estimator\n\nNote that\n\n4.1 Convergence Rates for Elem-GGM\n\nIn this section we derive a corollary of theorem 1 for Elem-GGM. A prerequisite is to show that\nB\u21e4(S) := [T\u232b(S)]1 is well-de\ufb01ned and \u201cwell-behaved\u201d. The following conditions de\ufb01ne a broad\nclass of Gaussian graphical models that satisfy this requirement.\n(C-MinInf\u2303) The true canonical parameter \u21e5\u21e4 of (3) has bounded induced operator norm such that\n\nkwk1 \uf8ff \uf8ff1.\n\n|||\u21e5\u21e4|||1 := supw6=02Rp k\u21e5\u21e4 wk1\n(C-Sparse\u2303) The true covariance matrix \u2303\u21e4 := (\u21e5\u21e4)1 is \u201capproximately sparse\u201d along the lines\nof Bickel and Levina [23]: for some positive constant D, \u2303\u21e4ii \uf8ff D for all diagonal entries, and\nmoreover, for some 0 \uf8ff q < 1 and c0(p), maxiPp\nj=1 |\u2303\u21e4ij|q \uf8ff c0(p). If q = 0, then this\ncondition boils down to \u2303\u21e4 being sparse. We additionally require inf w6=02Rp k\u2303\u21e4 wk1\n\nand setting \u232b := 16(maxi \u2303ii)q 10\u2327 log p0\n\uf8ff2 q log p0\n\nalso that we select n := 4\uf8ff1a\n(9) satis\ufb01es\n\nkwk1  \uf8ff2.\nNow we are ready to utilize Theorem 1 and derive the convergence rates for our Elem-GGM (9).\nCorollary 1. Consider Gaussian graphical models (3) where the true parameter \u21e5\u21e4 has k non-zero\noff-diagonal elements, and the conditions in (C-MinInf\u2303) and (C-Sparse\u2303) hold. Suppose that we\nsolve the optimization problem in (9) with a generalized thresholding operator satisfying (C-Thresh)\nfor p0 := max{n, p}. Furthermore, suppose\nn . Then, as long as n > c3 log p0, any optimal solution b\u21e5 of\nkr log p0\nb\u21e5  \u21e5\u21e4F \uf8ff\n\nb\u21e5  \u21e5\u21e41,off \uf8ff\n, b\u21e5  \u21e5\u21e41,off \uf8ff\nized MLE estimators in (4); for instance, [13] show that |||b\u21e5MLE  \u21e5\u21e4|||F = O\u21e3q k log p0\n\nwith probability at least 1  c1 exp(c2 log p0).\nWe remark that the rates in Corollary 1 are asymptotically the same as those for standard `1 regular-\n\nremarkable given the simplicity of Elem-GGM.\n\nn \u2318. This is\n\n\uf8ff2 r k log p0\n\n\uf8ff2 r log p0\n\nn\n\n:= aq log p0\n\nn\n\n32\uf8ff1a\n\n\uf8ff2\n\n16\uf8ff1a\n\n8\uf8ff1a\n\n,\n\nn\n\nn\n\nn\n\n5 Closed-form Estimators for Discrete Markov Random Fields\n\nWe now specialize our class of closed-form estimators (8) to the setting of discrete Markov random\n\ufb01elds described in Section 2.1. In this case, computing the backward mapping A\u21e4 is non-trivial and\ntypically intractable if the graphical structure has loops [18]. Therefore, we need an approximation\nof the backward map A\u21e4, for which we will leverage the tree-reweighted variational approximation\ndiscussed in Section 2.1. Consider the following map \u00af\u2713 := B\u21e4trw(b), where\n\u00af\u2713st;jk = \u21e2st log bst;jk\n\u00af\u2713s;j = logbs;j , and\nbs;jbt;k\nnPn\nnPn\ni=1 I[Xs,i = j] andbst;jk = 1\nwherebs;j = 1\ni=1 I[Xs,i = j]I[Xt,i = k] are the empirical\nmoments of the suf\ufb01cient statistics in (6) (we de\ufb01ne 0/0 := 1). It was shown in [20] that B\u21e4trw(\u00b7)\n\n(10)\n\n6\n\n\fsatis\ufb01es the following property: the (pseudo)marginals computed by performing tree-reweighted\nvariational inference with the parameters \u00af\u2713 := B\u21e4trw(b) yield the suf\ufb01cient statistics b.\nIn other\nwords, the approximate backward map B\u21e4trw computes an element in the pre-image of the approxi-\nmate forward map given by tree-reweighted variational inference. Since tree-reweighted variational\ninference approximates the true marginals well in practice, the map B\u21e4trw(\u00b7) is thus a great candidate\nfor as an approximate backward map.\nAs an alternative to the `1 regularized approximate MLE estimators (7), we thus obtain the\nfollowing class of estimators using B\u21e4trw(\u00b7) as an instance of (8):\nElem-DMRF Estimation:\n\nminimize\n\n(11)\n\n\u2713\n\nk\u2713k1,E\n\ns. t.\u2713 B \u21e4trw(b)1,E \uf8ff n\n\nwhere k\u00b7k 1,E is the maximum absolute value of edge-parameters as a dual of k\u00b7k 1,E.\nNote that given the empirical means of suf\ufb01cient statistics, B\u21e4trw(b) can usually be obtained easily,\nwithout the need of explicitly specifying the log-partition function approximation B(\u00b7) in (7).\n5.1 Convergence Rates for Elem-DRMF\nWe now derive the convergence rates of Elem-DRMF for the case where B\u21e4(\u00b7) is selected as in\n(10) following the tree reweighed approximation [20]. Let \u00b5\u21e4 be the \u201ctrue\u201d marginals (or mean\nparameters) from the true log-partition function and true canonical parameter \u2713\u21e4: \u00b5\u21e4 = A(\u2713\u21e4). We\nshall require that the approximation Btrw(\u00b7) be close enough to the true A(\u00b7) in terms of backward\nmapping. In addition we assume that true marginal distributions are strictly positive.\n(C-LogPartition) \u2713\u21e4 B \u21e4trw(\u00b5\u21e4)1,E \uf8ff \u270f.\n(C-Marginal) For all s 2 V and j 2 [m], the true singleton marginal \u00b5\u21e4s;j := E\u2713\u21e4I[Xs = j] =\nP(Xs = j; \u2713\u21e4) satis\ufb01es \u270fmin < \u00b5\u21e4s;j for some strictly positive constant \u270fmin 2 (0, 1). Similarly,\nfor all s, t 2 V and all j, k 2 [m], \u00b5\u21e4st;jk satis\ufb01es \u270fmin < \u00b5\u21e4st;jk.\n\nNow we are ready to utilize Theorem 1 to derive the convergence rates for our closed-form estimator\n(11) when \u2713\u21e4 has k non-zero pairwise parameters \u2713\u21e4st, where we recall the notatation that \u2713st :=\nj,k=1 is a collation of the edgewise parameters for edge (s, t). We also de\ufb01ne k\u2713kq,E :=\n{\u2713st;jk}m\n\nn for some positive constant c1 depending only on \u270fmin. Then, as long as n > 4c2\n\n(Ps6=t k\u2713stkq)1/q, for q 2{ 1, 2,1}.\nCorollary 2. Consider discrete Markov random \ufb01elds (6) when the true parameter \u2713\u21e4 has actually\nk non-zero pair-wise parameters, and the conditions in (C-LogPartition) and (C-Marginal) also\nhold in these discrete MRFs. Suppose that we solve the optimization problem in (11) with B\u21e4trw(\u00b7)\nset as (10) using tree reweighed approximation. Furthermore, suppose also that we select n :=\n\u270f + c1q log p\n1 log p\n,\n\u270f2\nmin\nthere are universal positive constants (c2, c3) such that any optimal solutionb\u2713 of (11) satis\ufb01es\nkb\u2713  \u2713\u21e4k1,E \uf8ff 2\u270f + 2c1r log p\nwith probability at least 1  c2 exp(c3 log p0).\n6 Experiments\n\n,kb\u2713  \u2713\u21e4k2,E \uf8ff 4pk\u270f + 4c1r k log p\n\n,kb\u2713  \u2713\u21e4k1,E \uf8ff 8k\u270f + 8c1kr log p\n\nn\n\nn\n\nn\n\nIn this section, we report a set of synthetic experiments corroborating our theoretical results on both\nGaussian and discrete graphical models.\nGaussian Graphical Models We now corroborate Corollary 1, and furthermore, compare our\nestimator with the `1 regularized MLE in terms of statistical performance with respect to the\n\nparameter error kb\u21e5  \u21e5\u21e4kq for q 2{ 2,1}, as well as in terms of computational performance.\nTo generate true inverse covariance matrices \u21e5\u21e4 with a random sparsity structure, we follow the\nprocedure described in [25, 24]. We \ufb01rst generate a sparse matrix U whose non-zero entries are set\nto \u00b11 with equal probabilities. \u21e5\u21e4 is then set to U>U and then a diagonal term is added to ensure\n\n7\n\n\fTable 1: Performance of our Elem-GM vs. state of the art QUIC algorithm [24] solving (4) under\ntwo different regimes: (Left) (n, p) = (800, 1600), (Right) (n, p) = (5000, 10000).\n\nElem-GM\n\nQUIC\n\nK\n0.01\n0.02\n0.05\n0.1\n0.5\n1\n2\n3\n4\n\nTime(sec)\n\n< 1\n< 1\n< 1\n< 1\n2575.5\n1009\n272.1\n78.1\n28.7\n\n`F (off)\n\n6.36\n6.19\n5.91\n6\n\n12.74\n7.30\n6.33\n6.97\n7.68\n\n`1 (off) FPR TPR\n0.1616\n0.99\n0.1880\n0.99\n0.99\n0.1655\n0.97\n0.1703\n1.00\n0.11\n0.99\n0.13\n0.99\n0.18\n0.21\n0.94\n0.86\n0.23\n\n0.48\n0.24\n0.06\n0.01\n0.52\n0.35\n0.16\n0.07\n0.02\n\nElem-GM\n\nQUIC\n\nK\n0.05\n0.1\n0.5\n1\n2\n2.5\n3\n3.5\n\nTime(sec)\n\n47.3\n46.3\n45.8\n46.2\n*\n*\n\n4.8 \u21e5 104\n2.7 \u21e5 104\n\n`F (off)\n11.73\n8.91\n5.66\n8.63\n*\n*\n9.85\n10.51\n\n`1 (off) FPR TPR\n0.1501\n1.00\n0.1479\n1.00\n1.00\n0.1308\n0.99\n0.1111\n*\n*\n1.00\n0.99\n\n0.13\n0.03\n0.0\n0.0\n*\n*\n0.06\n0.04\n\n0.1083\n0.1111\n\n*\n*\n\nTable 2: Performance of Elem-DMRF vs. the regularized MLE-based approach of [12] for structure\nrecovery of DRMFs.\n\nGraph Type\n\n# Parameters\n\nChain Graph\n\nGrid Graph\n\n128\n\n2000\n\n128\n\n2000\n\nMethod\n\nElem-DMRF\n\nRegularized MLE\n\nElem-DMRF\n\nRegularized MLE\n\nElem-DMRF\n\nRegularized MLE\n\nElem-DMRF\n\nRegularized MLE\n\n0.17\n7.30\n21.67\n\nTime(sec) TPR FNR\n0.01\n0.87\n0.81\n0.01\n0.79 0.12\n0.75 0.21\n0.97 0.01\n0.84 0.02\n0.80 0.12\n0.77 0.18\n\n0.17\n7.99\n21.68\n4454.44\n\n4315.10\n\n\u21e5\u21e4 is positive de\ufb01nite. Finally, we normalize \u21e5\u21e4 with maxp\ni=1 \u21e5\u21e4ii so that the maximum diagonal\nentry is equal to 1. We control the number of non-zeros in U so that the number of non-zeros in the\n\ufb01nal \u21e5\u21e4 is approximately 10p. We additionally set the number of samples n to half of the number\nof variables p. Note that though the number of variables is p, the total number of entries in the\ncanonical parameter consisting of the covariance matrix is O(p2).\nTable 1 summarizes the performance of our closed-form estimators in terms of computation time,\n\nkb\u21e5  \u21e5\u21e4k1,off and |||b\u21e5  \u21e5\u21e4|||F,off. We \ufb01x the thresholding parameter \u232b = 2.5plog p/n for all\nsettings, and vary the regularization parameter n = Kplog p/n to investigate how this regularizer\n\naffects the \ufb01nal estimators. Baselines are `1 regularized MLE estimators in (4); we use QUIC\nalgorithms [24], which is one of the fastest way to solve (4). In the table, we show the results of the\nQUIC algorithm run with a tolerance \u270f = 104; * indicates that the algorithm does not stop within\n15 hours. In Appendix, we provide more extensive comparisons including receiver operator curves\n(ROC) for these methods for settings in Table 1. As can be seen from the table and the \ufb01gure, the\nperformance of Elem-GM estimators is both statistically competitive in terms of all types of errors\nand support set recovery, while performing much better computationally than classical methods\nbased on `1 regularized MLE.\nDiscrete Graphical Models We consider two different classes of pairwise graphical models:\nchain graphs and grids. For each case, the size of the alphabet is set to m = 3; the true param-\neter vector \u2713\u21e4 is generated by sampling each non-zero entry from N (0, 1).\nWe compare Elem-DMRF with the group-sparse regularized MLE-based approach of Jalali et al.\n[12], which uses group `1/`2 regularization, where all the parameters of an edge form a group, so as\nto encourage sparsity in terms of the edges, and which we solved using proximal gradient descent.\nWhile our estimator in (11) used vanilla sparsity, we used a simple extension to the group-sparse\nstructured setting; please see Appendix E for more details. For both methods, the tuning parameter\n\nis set to n = cplog p/n, where c is selected using cross-validation. We use 20 simulation runs\n\nwhere for each run n = p/2 samples are drawn from the distribution speci\ufb01ed by \u2713\u21e4.\nWe report true positive rates, false positive rates and timing for running each method. We note\nthat the timing is for running each method without counting the time spent in the cross-validation\nprocess (Had we taken the cross-validation into account, the advantage of our method would be\neven more pronounced, since the entire path of solutions can be computed via simple group-wise\nthresholding operations.) The results in Table 2 show that Elem-DMRF is much faster than its\nMLE-based counterpart, and yield competitive results in terms of structure recovery.\nAcknowledgments E.Y and P.R. acknowledge the support of ARO via W911NF-12-1-0390 and\nNSF via IIS-1149803, IIS-1320894, IIS-1447574, and DMS-1264033\n\n8\n\n\fReferences\n[1] C. D. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. MIT\n\nPress, 1999.\n\n[2] J.W. Woods. Markov image modeling. IEEE Transactions on Automatic Control, 23:846\u2013850,\n\nOctober 1978.\n\n[3] M. Hassner and J. Sklansky. Markov random \ufb01eld models of digitized image texture.\n\nICPR78, pages 538\u2013540, 1978.\n\nIn\n\n[4] G. Cross and A. Jain. Markov random \ufb01eld texture models. IEEE Trans. PAMI, 5:25\u201339, 1983.\n[5] E. Ising. Beitrag zur theorie der ferromagnetismus. Zeitschrift f\u00a8ur Physik, 31:253\u2013258, 1925.\n[6] B. D. Ripley. Spatial statistics. Wiley, New York, 1981.\n[7] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the\n\nLasso. Annals of Statistics, 34:1436\u20131462, 2006.\n\n[8] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model.\n\nBiometrika, 94(1):19\u201335, 2007.\n\n[9] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graph-\n\nical Lasso. Biostatistics, 2007.\n\n[10] O. Bannerjee, , L. El Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum\nlikelihood estimation for multivariate Gaussian or binary data. Jour. Mach. Lear. Res., 9:485\u2013\n516, March 2008.\n\n[11] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional ising model selection using\n\n`1-regularized logistic regression. Annals of Statistics, 38(3):1287\u20131319, 2010.\n\n[12] A. Jalali, P. Ravikumar, V. Vasuki, and S. Sanghavi. On learning discrete graphical models\n\nusing group-sparse regularization. In Inter. Conf. on AI and Statistics (AISTATS), 14, 2011.\n\n[13] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estima-\ntion by minimizing `1-penalized log-determinant divergence. Electronic Journal of Statistics,\n5:935\u2013980, 2011.\n\n[14] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge,\n\nUK, 2004.\n\n[15] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety, Series B, 58(1):267\u2013288, 1996.\n\n[16] T. Cai, W. Liu, and X. Luo. A constrained `1 minimization approach to sparse precision matrix\n\nestimation. Journal of the American Statistical Association, 106(494):594\u2013607, 2011.\n\n[17] E. Yang, A. Lozano, and P. Ravikumar. Elementary estimators for high-dimensional linear\n\nregression. In International Conference on Machine learning (ICML), 31, 2014.\n\n[18] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1\u20132):1\u2014305, December 2008.\n\n[19] E. Yang and P. Ravikumar. On the use of variational inference for learning discrete graphical\n\nmodels. In International Conference on Machine learning (ICML), 28, 2011.\n\n[20] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. Tree-reweighted belief propagation algo-\nrithms and approximate ML estimation by pseudomoment matching. In Inter. Conf. on AI and\nStatistics (AISTATS), 2003.\n\n[21] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation. In NIPS 13, pages\n\n689\u2013695. MIT Press, 2001.\n\n[22] A. J. Rothman, E. Levina, and J. Zhu. Generalized thresholding of large covariance matrices.\n\nJournal of the American Statistical Association (Theory and Methods), 104:177\u2013186, 2009.\n\n[23] P. J. Bickel and E. Levina. Covariance regularization by thresholding. Annals of Statistics, 36\n\n(6):2577\u20132604, 2008.\n\n[24] C. J. Hsieh, M. Sustik, I. Dhillon, and P. Ravikumar. Sparse inverse covariance matrix estima-\n\ntion using quadratic approximation. In Neur. Info. Proc. Sys. (NIPS), 24, 2011.\n\n[25] L. Li and K. C. Toh. An inexact interior point method for l1-regularized sparse covariance\n\nselection. Mathematical Programming Computation, 2:291\u2013315, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1142, "authors": [{"given_name": "Eunho", "family_name": "Yang", "institution": "UT Austin"}, {"given_name": "Aurelie", "family_name": "Lozano", "institution": "IBM T.J. Watson Research Center"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "UT Austin"}]}