{"title": "Statistical Inference for Pairwise Graphical Models Using Score Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 2829, "page_last": 2837, "abstract": "Probabilistic graphical models have been widely used to model complex systems and aid scientific discoveries. As a result, there is a large body of literature focused on consistent model selection. However, scientists are often interested in understanding uncertainty associated with the estimated parameters, which current literature has not addressed thoroughly. In this paper, we propose a novel estimator for edge parameters for pairwise graphical models based on Hyv\\\"arinen scoring rule. Hyv\\\"arinen scoring rule is especially useful in cases where the normalizing constant cannot be obtained efficiently in a closed form. We prove that the estimator is $\\sqrt{n}$-consistent and asymptotically Normal. This result allows us to construct confidence intervals for edge parameters, as well as, hypothesis tests. We establish our results under conditions that are typically assumed in the literature for consistent estimation. However, we do not require that the estimator consistently recovers the graph structure. In particular, we prove that the asymptotic distribution of the estimator is robust to model selection mistakes and uniformly valid for a large number of data-generating processes. We illustrate validity of our estimator through extensive simulation studies.", "full_text": "Statistical Inference for Pairwise Graphical Models\n\nUsing Score Matching\n\nMing Yu\n\nmingyu@chicagobooth.edu\n\nVarun Gupta\n\nvarun.gupta@chicagobooth.edu\n\nmladen.kolar@chicagobooth.edu\n\nUniversity of Chicago Booth School of Business\n\nMladen Kolar\u21e4\n\nChicago, IL 60637\n\nAbstract\n\nProbabilistic graphical models have been widely used to model complex systems\nand aid scienti\ufb01c discoveries. As a result, there is a large body of literature\nfocused on consistent model selection. However, scientists are often interested in\nunderstanding uncertainty associated with the estimated parameters, which current\nliterature has not addressed thoroughly. In this paper, we propose a novel estimator\nfor edge parameters for pairwise graphical models based on Hyv\u00e4rinen scoring\nrule. Hyv\u00e4rinen scoring rule is especially useful in cases where the normalizing\nconstant cannot be obtained ef\ufb01ciently in a closed form. We prove that the estimator\nis pn-consistent and asymptotically Normal. This result allows us to construct\ncon\ufb01dence intervals for edge parameters, as well as, hypothesis tests. We establish\nour results under conditions that are typically assumed in the literature for consistent\nestimation. However, we do not require that the estimator consistently recovers\nthe graph structure. In particular, we prove that the asymptotic distribution of the\nestimator is robust to model selection mistakes and uniformly valid for a large\nnumber of data-generating processes. We illustrate validity of our estimator through\nextensive simulation studies.\n\n1\n\nIntroduction\n\nUndirected probabilistic graphical models are widely used to explore and represent dependencies\nbetween random variables. They have been used in areas ranging from computational biology to\nneuroscience and \ufb01nance. See [7] for a recent review. An undirected probabilistic graphical model\nconsists of an undirected graph G = (V, E), where V = {1, . . . , p} is the vertex set and E \u21e2 V \u21e5 V\nis the edge set, and a random vector X = (X1, . . . , Xp) 2X p \u2713 RP . Each coordinate of the random\nvector X is associated with a vertex in V and the graph structure encodes the conditional independence\nassumptions underlying the distribution of X. In particular, Xa and Xb are conditionally independent\ngiven all the other variables if and only if (a, b) 62 E, that is, the nodes a and b are not adjacent in G.\nOne of the fundamental problems in statistics is that of learning the structure of G from i.i.d. samples\nfrom X and quantifying uncertainty of the estimated structure.\n\n\u21e4This work is supported by an IBM Corporation Faculty Research Fund at the University of Chicago Booth\nSchool of Business. This work was completed in part with resources provided by the University of Chicago\nResearch Computing Center.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f\u2713(l)\nab t(l)\n\nha(xa),\n\nx 2X\u2713 Rp.\n\na , t(l)\n\n\u2713(k)\na t(k)\n\na (xa)+ X(a,b)2E Xl2[L]\n\nWe consider a basic class of pairwise interaction graphical models with densities belonging to an\nexponential family P = {p\u2713(x) | \u2713 2 \u21e5} with natural parameter space \u21e5 and\nlog p\u2713(x) =Xa2V Xk2[K]\nab (xa, xb) (\u2713)+Xa2V\n(1)\nThe functions t(k)\nab are suf\ufb01cient statistics and (\u2713) is the log-partition function. In this paper\nthe support of the densities is either X = RP or X = RP\n+ and P is dominated by Lebesgue\nmeasure on Rp. To simplify the notation, we will write log p\u2713(x) = \u2713Tt(x) (\u2713) + h(x) where\n\u2713 2 Rs and t(x) : Rp 7! Rs with s = p\n2 \u00b7 L + p \u00b7 K. The natural parameter space has the\nform \u21e5= {\u2713 2 Rs | (x) = logRX\nexp(\u2713Tt(x)dx) < 1}. Under the model in (1), there\nis no edge between a and b in the corresponding conditional independence graph if and only if\n\u2713(1)\nab = \u00b7\u00b7\u00b7 = \u2713(L)\nab = 0. The model in (1) encompasses a large number of graphical models studied in\nthe literature (see, for example, [7, 15] and referenced there in).\nThe main focus of the paper is on construction of an asymptotically normal estimator for parameters\nin (1) and performing (asymptotic) inference for them. We illustrate a procedure for construction of\nvalid con\ufb01dence intervals that have the nominal coverage and propose a statistical test for existence\nof edges in the graphical model with nominal size. Our inference results are robust to model selection\nmistakes, which commonly occur in ultra-high dimensional setting. Results in the paper complement\nexisting literature, which is focused on consistent model selection and parameter recovery, as we\nreview in the next section.\nWe use Hyv\u00e4rinen scoring rule to estimate \u2713, as in [15]. However, rather than focusing on consistent\nmodel selection we modify the regularized score matching procedure to construct a regular estimator\nthat is robust to model selection mistakes and show how to use its asymptotic distribution for\nstatistical inference. Compared to previous work on high-dimensional inference in graphical models\n[23, 2, 29, 11], this is the \ufb01rst work on inference in models where computing the normalizing constant\nis intractable.\nRelated work. Our work straddles two areas of statistical learning which have attracted signi\ufb01cant\nresearch of late: model selection and estimation in high-dimensional graphical models, and high-\ndimensional inference. Our approach to inference for high-dimensional graphical models is based on\nregularized score matching. We brie\ufb02y review the literature most relevant to our work, and refer the\nreader to a recent review article for a comprehensive overview [7].\nGraphical model selection: Much of the research effort on graphical model selection has been done\nunder the assumption that the data obeys the law X \u21e0 N (0, \u2303) (Gaussian graphical models), in\nwhich case the edge set E of the graph G is encoded by the non-zero elements of the precision matrix\n\u2326=\u2303 1. More recently, [31] studied estimation of graphical models under the assumption that the\nnode conditional distributions belong to an exponential family distribution (including, for example,\nBernoulli, Gaussian, Poisson and exponential) via regularized likelihood (see also [13, 6, 30] and\nreferences therein). In our paper, we construct a novel pn-consistent estimator of a parameter\ncorresponding to a particular edge in (1). As we mentioned earlier, this is the \ufb01rst procedure that can\nobtain a parametric rate of convergence for an edge parameter in a graphical model where computing\nthe normalizing constant is intractable.\nHigh-dimensional inference: Methods for construction of con\ufb01dence intervals for low dimensional\nparameters in high-dimensional linear and generalized linear models, as well as hypothesis tests,\nhave been developed in [32, 4, 28, 12]. These methods construct honest, uniformly valid con\ufb01dence\nintervals and hypothesis test based on a \ufb01rst stage `1 penalized estimator. [16, 23, 5] construct\npn-consistent estimators for elements of the precision matrix \u2326 under a Gaussian assumption.\nWe contribute to the literature on high dimensional inference by demonstrating how to construct\nestimators that are robust and uniformly valid under more general distributional assumptions than\nGaussian.\nScore Matching estimators: Score matching estimators were \ufb01rst proposed in [9, 10]. Score\nmatching offers a computational advantage when the normalization constant is not available in\nclosed-form making likelihood based approaches intractable. Despite its power, there have not been\nany results on inference in high-dimensional models using score matching. In [8], the authors use\nscore matching for inference of Gaussian linear models (and hence for Gaussian graphical models) in\nlow-dimensional setting. In [15], the authors use `1 regularized score matching to develop consistent\n\n2\n\n\festimators for graphical models in high-dimensional setting. We present the \ufb01rst high-dimensional\ninference results using score matching.\n\nQ2P\n\nEn [S(xi, Q)] .\n\n2 Score Matching\nLet X be a random variable with values in X , and let P be a family of distributions over X . A\nscoring rule S(x, Q) is a real valued function that quanti\ufb01es accuracy of Q 2P upon observing a\nrealized value of X, x 2X . There are a large number of scoring rules that correspond to different\ndecision problems [20]. Given n independent realizations of X, {xi}i2[n], one \ufb01nds optimal score\nestimator bQ 2P that minimizes the empirical score\nbQ = arg min\nWhen X = Rp and P consists of twice differentiable densities with respect to Lebesgue measure, the\nHyv\u00e4rinen scoring rule [9] is given as\n(3)\nS(x, Q) = (1/2)||r log q(x)||2\nwhere q is the density of Q with respect to Lebesgue measure on X , rf (x) = {@/(@xj)f (x)}j2[p]\ndenotes the gradient, and f (x) = Pj2[p] @2/(@x2\nj )f (x) the Laplacian operator on Rp. This\nscoring rule is convenient for learning models that are speci\ufb01ed in an unnormalized fashion or whose\nnormalizing constant is dif\ufb01cult to compute. The score matching rule is proper, that is, EX\u21e0P S(X, Q)\nis minimized over P at Q = P . Under suitable regularity conditions, the Fisher divergence between\nP, Q 2P , D(P, Q) =R p(x)||r log q(x)r log p(x)||2\n2dx, where p is the density of P , is induced\nby the score matching rule [9]. For a parametric exponential family P = {p\u2713 | \u2713 2 \u21e5} with densities\ngiven in (1), minimizing (2) can be done in a closed form [9, 8]. An estimatorb\u2713 obtained in this way\ncan be shown to be asymptotically consistent [9], however, in general it will not be ef\ufb01cient [8].\nHyv\u00e4rinen [10] proposed a generalization of the score matching approach to the case of non-negative\ndata. When X = Rp\n\n+ the scoring rule is given as\n\n2 + log q(x)\n\n(2)\n\nS+(x, Q) =Xa2V\"2xa\n\n@ log q(x)\n\n@xa\n\n+ x2\na\n\n@2 log q(x)\n\n@x2\na\n\n+\n\n1\n2\n\na\u2713 @ log q(x)\n\n@xa \u25c62#.\n\nx2\n\n(4)\n\nFor exponential families, the non-negative score matching loss again can be obtained in a closed form\nand the estimator is consistent and asymptotically normal under suitable conditions [10].\nIn the context of probabilistic graphical models, [8] studied score matching to learn Gaussian graphical\nmodels with symmetry constraints. [15] proposed a regularized score matching procedure to learn\nconditional independence graph in a high-dimensional setting by minimizing En [`(xi,\u2713 )] + ||\u2713||1,\nwhere the loss `(xi,\u2713 ) is either S(xi, Q\u2713) or S+(xi, Q\u2713). For Gaussian models, `1-norm regularized\nscore matching is a simple but state-of-the-art method, which coincides with the method in [17].\nExtending the work on estimation of in\ufb01nite-dimensional exponential families [26], [27] study\nlearning structure of nonparametric probabilistic graphical models using a score matching estimator.\nIn the next section, we present a new estimator for components of \u2713 in (1) that is consistent and\nasymptotically normal, building on [15] and [4].\n\n3 Methodology\nIn this section, we propose a procedure that constructs a pn-consistent estimator of an element \u2713ab\nof \u2713. Our procedure is based on the three steps that we describe after introducing some additional\nnotation. We start by describing the procedure for the case where X = Rp.\nFor \ufb01xed indices a, b 2 [p], let qab\ngiven Xab = xab. In particular,\n\n\u2713 (xa, xb | xab) be the conditional density of (Xa, Xb)\n\n\u2713 (x) := qab\n\nlog qab\n\n\u2713 (x) = h\u2713ab,' (x)i ab(\u2713, xab) + hab(x)\n\nwhere \u2713ab 2 Rs0 is a part of the vector \u2713 corresponding to {\u2713(k)\nbc }l2[L],c2ab\nand '(x) = 'ab(x) 2 Rs0 is the corresponding vector of suf\ufb01cient statistics with the dimension\n\nb }k2[K], {\u2713(l)\n\na ,\u2713 (k)\n\nac ,\u2713 (l)\n\n3\n\n\fs0 = 2K + 2(p 2)L. Here ab(\u2713, xab) is the log-partition function for the conditional distribution\nand hab(x) = ha(xa) + hb(xb). Let rabf (x) = ((@/@xa)f (x), (@/@xb)f (x))T 2 R2 be the\ngradient with respect to xa and xb and abf (x) =(@2/@x2\nWith this notation, we introduce the following scoring rule\n2 + ab log qab\n\n\u2713 (x) = (1/2)\u2713T (x)\u2713 + \u2713T g(x),\n\nb) f (x).\n\na) + (@2/@x2\n\nSab(x, \u2713) = (1/2)||rab log qab\n\n\u2713 (x)||2\n\n(5)\n\nwhere\n(x) = '1(x)'1(x)T + '2(x)'2(x)T\n\nand\n\ng(x) = '1(x)hab\n\n1 (x) + '2(x)hab\n\n1 = (@/@xa)hab, and hab\n\n2 (x) + ab'(x)\nwith '1 = (@/@xa)', '2 = (@/@xb)', hab\n2 = (@/@xb)hab. This scoring\nrule is related to the one in (3), however, rather than using the density q\u2713 in evaluating the parameter\nvector, we only consider the conditional density qab\n\u2713 . We will use this conditional scoring rule to\ncreate an asymptotically normal estimator of an element \u2713ab. Our motivation for using this estimator\ncomes from the fact that the parameter \u2713ab can be identi\ufb01ed from the conditional distribution of\n(Xa, Xb) | XMab where Mab := {c | (a, c) 2 E or (b, c) 2 E} is the Markov blanket of (Xa, Xb).\nFurthermore, the optimization problems arising in steps 1-3 below can be solved much more ef\ufb01ciently,\nas the problems are of much smaller dimension.\nWe are now ready to describe our procedure for estimating \u2713ab, which proceeds in three steps.\nStep 1: We \ufb01nd a pilot estimator of \u2713ab by solving the following program\n\n(6)\n\n\u27132Rs0\n\nb\u2713ab = arg min\n\ndistribution cannot be estimated [14, 22]. Therefore, we proceed to create a regular estimator of \u2713ab\n\nSince we are after an asymptotically normal estimator of \u2713ab, one may think that it is suf\ufb01cient to \ufb01nd\n\nEn\u21e5Sab(xi,\u2713 )\u21e4 + 1||\u2713||1\nwhere 1 is a tuning parameter. LetcM1 = M (b\u2713ab) := {(c, d) |b\u2713ab\ncd 6= 0}.\ne\u2713ab = arg min{En\u21e5Sab(xi,\u2713 )\u21e4 | M (\u2713) \u2713 cM1} and appeal to results of [21]. Unfortunately, this is\nnot the case. Sincee\u2713 is obtained via a model selection procedure, it is irregular and its asymptotic\nin steps 2 and 3. The idea is to create an estimatore\u2713ab that is insensitive to \ufb01rst order perturbations\nof other components ofe\u2713ab, which we consider as nuisance components. The idea of creating an\nStep 2: Letbab be a minimizer of\nThe vector (1,bab,T)T approximately computes a row of the inverse of the Hessian in (6).\nStep 3: LetfM = {(a, b)}[ cM1 [ M (bab). We obtain our estimator as a solution to the following\n\nestimator that is robust to perturbations of nuisance have been recently used in [4], however, the\napproach goes back to the work of [19].\n\n(1/2)En[('1,ab(xi) '1,ab(xi)T )2 + ('2,ab(xi) '2,ab(xi)T )2] + 2||||1.\n\nprogram\n\n(8)\n\n(7)\n\ne\u2713ab = arg min En\u21e5Sab(xi,\u2713 )\u21e4\n\ns.t. M (\u2713) \u2713 fM .\n\nMotivation for this procedure will be clear from the proof of Theorem 1 given in the next section.\nExtension to non-negative data. For non-negative data, the procedure is slightly different. Instead of\n(5), as shown in [15], we instead de\ufb01ne a different scoring rule Sab\n2 \u2713T +(x)\u2713 + \u2713T g+(x)\nwith +(x) = x2\n1 (x) + '2(x)hab\n2 (x) +\nb)'.\nx2\na'11(x) + x2\n\na \u00b7 '1(x)'1(x)T + x2\nb'22(x) + 2xa'1(x) + 2xb'2(x). Here '11 = (@2/@x2\n\n+ (x, \u2713) = 1\nb \u00b7 '2(x)'2(x)T and g+(x) = '1(x)hab\n\na)', and '22 = (@2/@x2\n\nNow we can de\ufb01ne e'1 = xa'1 and e'2 = xb'2. Then +(x) = e'1(x)e'1(x)T + e'2(x)e'2(x)T ,\nwhich is of the same form as (5) with e'1 and e'2 replacing '1 and '2, respectively. Thus our three\n\nstep procedure for non-negative data follows as before.\n\n4 Asymptotic Normality of the Estimator\n\nIn this section, we outline main theoretical properties of our procedure. We start by providing\nhigh-level conditions that allow us to establish properties of each step in our procedure.\n\n4\n\n\fAssumption M. We are given n i.i.d. samples {xi}i2[n] from p\u2713\u21e4 of the form in (1). The parameter\nvector \u2713\u21e4 is sparse, with |M (\u2713ab,\u21e4)|\u2327 n. Let\n\nab,\u21e4 = arg min E[('1,ab(xi) '1,ab(xi)T )2 + ('2,ab(xi) '2,ab(xi)T )2]\n\n(9)\nand \u23181i = '1,ab(xi) '1,ab(xi)T ab,\u21e4 and \u23182i = '2,ab(xi) '2,ab(xi)T ab,\u21e4 for i 2 [n]. The\nvector ab,\u21e4 is sparse with |M (ab,\u21e4)|\u2327 n. Let m = |M (\u2713ab,\u21e4)|_| M (ab,\u21e4)|.\nThe assumption M supposes that the parameter to be estimated is sparse, which makes estimation\nin high-dimensional setting feasible. An extension to approximately sparse parameter is possible,\nbut technical. One of the bene\ufb01ts of using the conditional score to learn parameters of the model is\nthat the sample size will only depend on the size of M (\u2713ab,\u21e4) and not on the sparsity of the whole\nvector \u2713\u21e4 as in [15]. The second part of the assumption states that the inverse of population Hessian\nis approximately sparse, which is a reasonable assumption since the Markov blanket of (Xa, Xb) is\nsmall under the sparsity assumption on \u2713ab,\u21e4.\nOur next condition assumes that the Hessian in (6) and (7) is well conditioned. Let (s, A) =\nminimal and maximal s-sparse eigenvalues of a semi-de\ufb01nite matrix A, respectively.\nAssumption SE. The event\n\n2 | 1 \uf8ff|| ||0 \uf8ff s and +(s, A) = supT A/||||2\n\n2 | 1 \uf8ff|| ||0 \uf8ff s denote the\n\ninfT A/||||2\n\nESE = {min \uf8ff (m \u00b7 log n, En [(xi)]) \uf8ff +(m \u00b7 log n, En [(xi)]) \uf8ff max}\n\nholds with probability 1 SE where 0 < min \uf8ff max < 1.\nWe choose to impose the sparse eigenvalue condition directly on En [(xi)] rather that on the\npopulation quantity E [(xi)]. It is well known that the condition SE holds for a large number of\nmodels. See for example [24] and speci\ufb01cally [31] for exponential family graphical models.\n\nLet rj\u2713 = ||b\u2713ab \u2713ab,\u21e4||j and rj = ||bab ab,\u21e4||j, for j 2{ 1, 2}, be the rates of estimation in\nsteps 1 and 2. Under the assumption SE, on the event E\u2713 = {||En [(xi)\u2713 + g(xi)]||1 \uf8ff 1/2}\nwe have that r1\u2713 \uf8ff c1m/ and r2\u2713 \uf8ff c2pm/.\nSimilarly, on the event E =\n{||En [\u23181i'1,ab(xi) + \u23182i'2,ab(xi)]||1 \uf8ff 2/2} we have that r1 \uf8ff c1m/min and r2 \uf8ff\nc2pm/min using results of [18]. Again, one needs to verify the two events hold with high-\nprobability for the model at hand. However, this is a routine calculation under suitable tail assumptions.\nSee for example Lemma 9 in [31].\n\nmax4\n\n(10)\n\nab =\n\nn \u00b7 pnEn\u21e5w\u21e4,T(xi)\u2713ab,\u21e4 + g(xi)\u21e4 + O2\n\nThe following result establishes a Bahadur representation fore\u2713ab.\nTheorem 1. Suppose that assumptions M and SE holds. De\ufb01ne w\u21e4 with w\u21e4ab = 1 and w\u21e4\nab,\u21e4, where ab,\u21e4 is given in the assumption M. On the event E \\E \u2713, we have that\npn \u00b7\u21e3e\u2713ab \u2713\u21e4ab\u2318 = b1\n\nmin \u00b7 pn2m ,\nwhere = 1 _ 2 and n = En [\u23181i'1,ab(xi) + \u23182i'2,ab(xi)].\nTheorem 1 is deterministic in nature. It establishes a representation that holds on the event E \\E \u2713 \\\nESE, which in many cases holds with overwhelming probability. We will show that under suitable\nconditions the \ufb01rst term converges to a normal distribution. The following is a regularity condition\nneeded even in a low dimensional setting for asymptotic normality [8].\nAssumption R. Eqab\u21e5||(Xa, Xb, xab)\u2713ab,\u21e4||2\u21e4 and Eqab\u21e5||g(Xa, Xb, xab)||2\u21e4 are \ufb01nite for all\nR holds, (m log p)2/n = o(1) and P (E \\E \u2713 \\E SE) ! 1. Then pn(e\u2713ab \u2713\u21e4ab) !D N (0, V ) +\nop(1), where V = (E [n])2 \u00b7 Varw\u21e4,T(xi)\u2713ab + g(xi) and n is as in Theorem 1.\nV using the following consistent estimatorbV ,\nfMi\u2318En [(xi)]fM1 eab,\nabEn [(xi)]fM1\u21e3Enh\u21e3(xi)e\u2713ab + g(xi)\u2318fM\u21e3(xi)e\u2713ab + g(xi)\u2318T\n\nvalues of xab in the domain.\nTheorem 1 and Lemma 9 together give the following corollary:\nCorollary 2. Suppose that the conditions of Theorem 1 hold. In addition, suppose the assumption\n\nWe see that the variance V depend on true \u2713ab and ab, which are unknown. In practice, we estimate\n\neT\n\n5\n\n\fwhere eab is a canonical vector with 1 in position of element ab. Using this estimate, we can construct\na con\ufb01dence interval with asymptotically nominal coverage. In particular,\n\nlim\nn!1\n\nsup\n\u2713\u21e42\u21e5\n\nP\u2713\u21e4\u2713\u2713\u21e4ab 2e\u2713ab \u00b1 z\u21b5/2 \u00b7qbV /n\u25c6 = \u21b5 + o(1).\n\nIn the next section, we outline the proof of Theorem 1. Proofs of other technical results are relegated\nto appendix.\n\n4.1 Proof of Theorem 1\n\n(11)\n\nproblem\n\nWe \ufb01rst introduce some auxiliary estimates. Leteab be a minimizer of the following constrained\n+'2,ab(xi) '2,ab(xi)T 2i s.t. M () \u2713 fM\nmin Enh'1,ab(xi) '1,ab(xi)T 2\nwherefM is de\ufb01ned in step 3 of the procedure. Essentially,eab is the re\ufb01tted estimator from step 2\nconstrained to have the support onfM. Let ew 2 Rs0 with ewab = 1, ewfM = efM and zero elsewhere.\nThe solutione\u2713ab satis\ufb01es the \ufb01rst order optimality condition\u21e3En [(xi)]e\u2713ab + En[g(xi)]\u2318fM\nMultiplying by ew, it follows that\newT\u21e3En [(xi)]e\u2713ab + En[g(xi)]\u2318\n= (ew w\u21e4)T En [(xi)]\u21e3e\u2713ab \u2713ab,\u21e4\u2318 + (ew w\u21e4)TEn\u21e5(xi)\u2713ab,\u21e4 + g(xi)\u21e4 +\nw\u21e4,TEn [(xi)]\u21e3e\u2713ab \u2713ab,\u21e4\u2318 + w\u21e4,TEn\u21e5(xi)\u2713ab,\u21e4 + g(xi)\u21e4 , L1 + L2 + L3 + L4 = 0.\n(12)\nmin \u00b7 2m. Using Lemma 8, the\nterm L3 can be written as En [\u23181i'1,ab(xi) + \u23182i'2,ab(xi)]\u21e3e\u2713ab \u2713ab,\u21e4ab \u2318 + O\u21e31/2\nmin \u00b7 2m\u2318.\n\nFrom Lemma 6 and Lemma 7, we have that |L1 + L2|\uf8ff C \u00b7 2\n\nPutting all the pieces together completes the proof.\n\nmax4\n\nmax2\n\n= 0.\n\n5 Synthetic Datasets\n\nIn this section we illustrate \ufb01nite sample properties of our inference procedure on data simulated\nfrom three different Exponential family distributions. The \ufb01rst two examples involve Gaussian\nnode-conditional distributions, for which we use regularized score matching. For the third setting\nwhere the node-conditional distributions follow an Exponential distribution, we use regularized\nnon-negative score matching procedure. In each example, we report the mean coverage rate of 95%\ncon\ufb01dence intervals for several coef\ufb01cients averaged over 500 independent simulation runs.\nGaussian Graphical Model. We \ufb01rst consider the simplest case of a Gaussian graphical model. The\ndata is generated according to X \u21e0 N (0, \u2303). We denote the precision matrix by \u2326=\u2303 1 = (wab)\n(the inverse of covariance matrix).\nFor the experiment, we set diagonal entries of \u2326 as wjj = 1, and we set the coef\ufb01cients of the 4\nnearest neighbor lattice graph according to wj,j1 = wj1,j = 0.5 and wj,j2 = wj2,j = 0.3. We\nset the sample size n = 300. Table 1 shows the empirical coverage rate for different choices of the\nnumber of nodes p for four chosen coef\ufb01cients. As is evident, our inference procedure performs\nremarkably well for the Gaussian graphical model studied.\nNormal Conditionals. Our second synthetic dataset is sampled from the following exponential\nj=1 jxj}, where\nb = (1, . . . , p) and b(2) = ((2)\np ) are p dimensional vectors, and B = {jk} is a\nsymmetric interaction matrix with diagonal entries set to 0. The above distribution is a special case\nof a class of exponential family distributions with normal conditionals, and densities that need not be\nunimodal [1]. This family is intriguing from the perspective of graphical modeling as, in contrast to\nthe Gaussian case, conditional dependence may also express itself in the variances.\n\nfamily distribution: q(x|B, b, b(2)) / exp{Pj6=k jkx2\n\nk +Pp\n\nj +Pp\n\n1 , . . . , (2)\n\nj=1 (2)\n\nj x2\n\nj x2\n\n6\n\n\fTable 1: Empirical Coverage for Gaussian Graphical Model\n\np = 50\np = 200\np = 400\n\nw1,3\n\nw1,4\n\nw1,2\nw1,10\n95.4% 92.4% 93.8% 93.2%\n94.6% 92.4% 92.6% 94.0%\n94.6% 94.8% 92.6% 93.8%\n\nTable 2: Empirical Coverage for Normal Conditionals\n\np = 100\np = 300\n\n1,3\n\n1,2\n1,10\n93.2% 93.4% 94.6% 95.0%\n93.2% 93.0% 92.6% 93.0%\n\n1,4\n\nTable 3: Empirical Coverage for Exponential Graphical Model\n\np = 100\np = 300\n\n\u27131,3\n\n\u27131,10\n\u27131,2\n92.0% 90.0% 90.0% 92.4%\n92.6% 92.0% 92.2% 92.4%\n\n\u27131,4\n\nFor our experiment we set j = 0.4, (2)\nj = 2, and we use a 4 nearest neighbor lattice dependence\ngraph with interaction matrix: j,j1 = j1,j = 0.2 and j,j2 = j2,j = 0.2. Since the\nunivariate marginal distributions are all Gaussian, we generate the data by Gibbs sampling. The \ufb01rst\n500 samples were discarded as \u2018burn in\u2019 step, and of the remaining samples, we keep one in three.\nWe set the number of samples n = 500. Table 2 shows the empirical coverage rate for p = 100\nand p = 300 nodes. Again, we see that our inference algorithm behaves well on the above Normal\nConditionals Model.\nExponential Graphical Model. Our \ufb01nal synthetic simulated example illustrates non-negative\nscore matching for Exponential Graphical Model. Here the node-conditional distributions obey an\nexponential distribution, and therefore the variables take only non-negative values. Such exponential\ndistributions are typically used for data describing inter-arrival times between events, among other\n\nj=1 \u2713jXj Pj6=k \u2713jkXjXk}. To\napplications. The density function is given by q(x|\u2713) / exp{Pp\nensure that the distribution is valid and normalizable, we require \u2713j > 0, and \u2713jk 0. Therefore,\nwe can only model negative dependencies via the Exponential graphical model. For the experiment\nwe choose \u2713j = 2, and a 2 nearest neighbor dependence graph with \u2713j,j1 = \u2713j1,j = 0.3. We set\nn = 1000 and again use Gibbs sampling to generate data. The empirical coverage rate and histograms\nof estimates of four selected coef\ufb01cients are presented in Table 3 and Figures 1 for p = 100 and\np = 300, respectively.\nWe should point out that, in general, non-negative score matching is harder than regular score\nmatching. For example, as shown in [15], to recover the structure from a regular Gaussian distribution\n\ny\nt\ni\ns\nn\ne\nD\n\ny\nt\ni\ns\nn\ne\nD\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\ny\nt\ni\ns\nn\ne\nD\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\ny\nt\ni\ns\nn\ne\nD\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\ny\nt\ni\ns\nn\ne\nD\n\n0\n\n0.2\n\n\u03b8\n\n1,2\n\n0.4\n\n0.6\n\n0\n-0.4\n\n-0.2\n\n\u03b8\n\n0\n\n1,3\n\n0.2\n\n0.4\n\n0\n-0.4\n\n-0.2\n\n\u03b8\n\n0\n\n1,4\n\n0.2\n\n0.4\n\n0\n-0.4\n\n-0.2\n\n\u03b8\n\n0\n\n0.2\n\n0.4\n\n1,10\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\ny\nt\ni\ns\nn\ne\nD\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\ny\nt\ni\ns\nn\ne\nD\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\ny\nt\ni\ns\nn\ne\nD\n\n0\n\n0.2\n\n\u03b8\n\n1,2\n\n0.4\n\n0.6\n\n0\n-0.3 -0.2 -0.1\n\n\u03b8\n\n0\n\n0.1\n\n0.2\n\n1,3\n\n0\n-0.3 -0.2 -0.1\n\n\u03b8\n\n0\n\n0.1\n\n0.2\n\n1,4\n\n0\n-0.4\n\n-0.2\n\n\u03b8\n\n0\n\n0.2\n\n0.4\n\n1,10\n\nFigure 1: Histograms for \u2713: the \ufb01rst row is for p = 100 and the second row is for p = 300\n\n7\n\n\fwith high probability, a sample size about O(m2 log p) suf\ufb01ces, while to recover from non-negative\nGaussian distribution, we need O(m2(log p)8), which is signi\ufb01cantly larger. Therefore, we expect\nthat con\ufb01dence intervals for non-negative score matching would require more samples to give accurate\ninference. We can see this from Table 3, where the empirical coverage rate tends to be about 92%,\nrather than the designed 95% \u2013 still impressive for the not so large sample size. The histograms in\nFigures 1 show that the \ufb01tting is quite good, but to get a better estimation and hence better coverage,\nwe would need more samples.\n\n6 Protein Signaling Dataset\n\nIn this section we apply our algorithm to a protein signaling \ufb02ow cytometry dataset. The dataset\ncontains the presence of p = 11 proteins in n = 7466 cells. It was \ufb01rst analyzed using Bayesian\nNetworks in [25] who \ufb01t a directed acyclic graph to the data, while [31] \ufb01t their proposed M-estimators\nfor exponential and Gaussian graphical models to the data set.\nFigure 2 shows the network structure after applying our method to the data using an Exponential\nGraphical Model. Since the data is non-negative and skewed, it can also be analyzed after log\ntransformation as was done by [31] for \ufb01tting Gaussian graphical model. We instead learn the\nstructure directly from the data without such a transformation. To infer the network structure, we\ncalculate the p-value for each pair of nodes, and keep the edges with p-values smaller than 0.01.\nEstimated negative conditional dependencies are shown via red edges in the \ufb01gure. Recall that\nthe exponential graphical model restricts the edge weights to be non-negative, hence only negative\ndependencies can be estimated. From the \ufb01gure we see that PKA is a major protein inhibitor in cell\nsignaling networks. This result is consistent with the estimated graph structure in [31], as well as in\nthe Bayesian network of [25]. In addition, we \ufb01nd signi\ufb01cant dependency between PKC and PIP3.\n\nJnk\n\nP38\n\nRaf\n\nPKC\n\nPKA\n\nMek\n\nPlcg\n\nAkt\n\nPIP2\n\nErk\n\nPIP3\n\nFigure 2: Estimated Structure of Protein Signaling Dataset\n\n7 Conclusion\n\nDriven by applications in Biology and Social Networks, there has been a surge in statistical learning\nmodels and methods for networks with large number of nodes. Graphical models provide a very\n\ufb02exible modeling framework for such networks, leading to much work in estimation and inference\nalgorithms for Gaussian graphical models, and more generally for graphical models with node-\nconditional densities lying in Exponential family, in high dimensional setting. Most of this work is\nbased on regularized likelihood loss minimization, which has the disadvantage of being computation-\nally intractable when the normalization constant (partition function) of the conditional densities is\nnot available in closed form. Score matching estimators provide a way around this issue, but so far\nthere has been no work which provides inference guarantees for score matching based estimators for\nhigh-dimensional graphical models. In this paper we \ufb01ll this gap for the case where score matching\nis used to estimate the parameter corresponding to a single edge at a time. An interesting future\nextension would be to perform inference on the entire model instead of one edge at a time as in the\ncurrent paper. Another extension would be to extend our results to discrete valued data.\n\n8\n\n\fReferences\n[1] B. C. Arnold, E. Castillo, and J. M. Sarabia. Conditional speci\ufb01cation of statistical models. Springer Series\n\nin Statistics. Springer-Verlag, New York, 1999. ISBN 0-387-98761-4.\n\n[2] R. F. Barber and M. Kolar. Rocket: Robust con\ufb01dence intervals via kendall\u2019s tau for transelliptical graphical\n\nmodels. ArXiv e-prints, arXiv:1502.07641, Feb. 2015.\n\n[3] A. Belloni and V. Chernozhukov. Least squares after model selection in high-dimensional sparse models.\n\nBernoulli\u2018, 19(2):521\u2013547, May 2013.\n\n[4] A. Belloni, V. Chernozhukov, and C. B. Hansen. Inference on treatment effects after selection amongst\n\nhigh-dimensional controls. Rev. Econ. Stud., 81(2):608\u2013650, Nov 2013.\n\n[5] M. Chen, Z. Ren, H. Zhao, and H. H. Zhou. Asymptotically normal and ef\ufb01cient estimation of covariate-\n\nadjusted gaussian graphical model. Journal of the American Statistical Association, 0(ja):00\u201300, 2015.\n\n[6] S. Chen, D. M. Witten, and A. Shojaie. Selection and estimation for mixed graphical models. ArXiv\n\ne-prints, arXiv:1311.0085, Nov. 2013.\n\nStatistics and Its Application, 3, 2016.\n\n[7] M. Drton and M. H. Maathuis. Structure learning in graphical modeling. To appear in Annual Review of\n\n[8] P. G. M. Forbes and S. L. Lauritzen. Linear estimating equations for exponential families with application\n\nto Gaussian linear concentration models. Linear Algebra Appl., 473:261\u2013283, 2015.\n\n[9] A. Hyv\u00e4rinen. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res., 6:\n\n695\u2013709, 2005.\n\n[10] A. Hyv\u00e4rinen. Some extensions of score matching. Comput. Stat. Data Anal., 51(5):2499\u20132512, 2007.\n[11] J. Jankova and S. A. van de Geer. Con\ufb01dence intervals for high-dimensional inverse covariance estimation.\n\nArXiv e-prints, arXiv:1403.6752, Mar. 2014.\n\n[12] A. Javanmard and A. Montanari. Con\ufb01dence intervals and hypothesis testing for high-dimensional\n\nregression. J. Mach. Learn. Res., 15(Oct):2869\u20132909, 2014.\n\n[13] J. D. Lee and T. J. Hastie. Learning the structure of mixed graphical models. J. Comput. Graph. Statist.,\n\n24(1):230\u2013253, 2015.\n\n[14] H. Leeb and B. M. P\u00f6tscher. Can one estimate the unconditional distribution of post-model-selection\n\nestimators? Econ. Theory, 24(02):338\u2013376, Nov 2007.\n\n[15] L. Lin, M. Drton, and A. Shojaie. Estimation of high-dimensional graphical models using regularized\n\nscore matching. ArXiv e-prints, arXiv:1507.00433, July 2015.\n\n[16] W. Liu. Gaussian graphical model estimation with false discovery rate control. Ann. Stat., 41(6):2948\u20132978,\n\n[17] W. Liu and X. Luo. Fast and adaptive sparse precision matrix estimation in high dimensions. J. Multivar.\n\n2013.\n\nAnal., 135:153\u2013162, 2015.\n\n[18] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of m-estimators with decomposable regularizers. Stat. Sci., 27(4):538\u2013557, 2012.\n\n[19] J. Neyman. Optimal asymptotic tests of composite statistical hypotheses. Probability and statistics, 57:\n\n213, 1959.\n\n2012.\n\n1\u201318, 2009.\n\n[20] M. Parry, A. P. Dawid, and S. L. Lauritzen. Proper local scoring rules. Ann. Stat., 40(1):561\u2013592, Feb\n\n[21] S. L. Portnoy. Asymptotic behavior of likelihood methods for exponential families when the number of\n\nparameters tends to in\ufb01nity. Ann. Stat., 16(1):356\u2013366, 1988.\n\n[22] B. M. P\u00f6tscher. Con\ufb01dence sets based on sparse estimators are necessarily large. Sankhy\u00afa, 71(1, Ser. A):\n\n[23] Z. Ren, T. Sun, C.-H. Zhang, and H. H. Zhou. Asymptotic normality and optimalities in estimation of large\n\nGaussian graphical models. Ann. Stat., 43(3):991\u20131026, 2015.\n\n[24] M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. 2011.\n[25] K. Sachs, O. Perez, D. Pe\u2019er, D. A. Lauffenburger, and G. P. Nolan. Causal protein-signaling networks\n\nderived from multiparameter single-cell data. Science, 308(5721):523\u2013529, 2005.\n\n[26] B. Sriperumbudur, K. Fukumizu, A. Gretton, and A. Hyv\u00e4rinen. Density estimation in in\ufb01nite dimensional\n\nexponential families. ArXiv e-prints, arXiv:1312.3516, Dec. 2013.\n\n[27] S. Sun, M. Kolar, and J. Xu. Learning structured densities via in\ufb01nite dimensional exponential families.\nIn C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 28, pages 2287\u20132295. Curran Associates, Inc., 2015.\n\n[28] S. A. van de Geer, P. B\u00fchlmann, Y. Ritov, and R. Dezeure. On asymptotically optimal con\ufb01dence regions\n\nand tests for high-dimensional models. Ann. Stat., 42(3):1166\u20131202, Jun 2014.\n\n[29] J. Wang and M. Kolar. Inference for high-dimensional exponential family graphical models. In A. Gretton\n\nand C. C. Robert, editors, Proc. of AISTATS, volume 51, pages 751\u2013760, 2016.\n\n[30] E. Yang, Y. Baker, P. Ravikumar, G. I. Allen, and Z. Liu. Mixed graphical models via exponential families.\n\nIn Proc. 17th Int. Conf, Artif. Intel. Stat., pages 1042\u20131050, 2014.\n\n[31] E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. On graphical models via univariate exponential family\n\ndistributions. J. Mach. Learn. Res., 16:3813\u20133847, 2015.\n\n[32] C.-H. Zhang and S. S. Zhang. Con\ufb01dence intervals for low dimensional parameters in high dimensional\n\nlinear models. J. R. Stat. Soc. B, 76(1):217\u2013242, Jul 2013.\n\n9\n\n\f", "award": [], "sourceid": 1430, "authors": [{"given_name": "Ming", "family_name": "Yu", "institution": "The University of Chicago"}, {"given_name": "Mladen", "family_name": "Kolar", "institution": "University of Chicago"}, {"given_name": "Varun", "family_name": "Gupta", "institution": "University of Chicago"}]}