{"title": "Learning structured densities via infinite dimensional exponential families", "book": "Advances in Neural Information Processing Systems", "page_first": 2287, "page_last": 2295, "abstract": "Learning the structure of a probabilistic graphical models is a well studied problem in the machine learning community due to its importance in many applications. Current approaches are mainly focused on learning the structure under restrictive parametric assumptions, which limits the applicability of these methods. In this paper, we study the problem of estimating the structure of a probabilistic graphical model without assuming a particular parametric model. We consider probabilities that are members of an infinite dimensional exponential family, which is parametrized by a reproducing kernel Hilbert space (RKHS) H and its kernel $k$. One difficulty in learning nonparametric densities is evaluation of the normalizing constant. In order to avoid this issue, our procedure minimizes the penalized score matching objective. We show how to efficiently minimize the proposed objective using existing group lasso solvers. Furthermore, we prove that our procedure recovers the graph structure with high-probability under mild conditions. Simulation studies illustrate ability of our procedure to recover the true graph structure without the knowledge of the data generating process.", "full_text": "Learning Structured Densities via In\ufb01nite\n\nDimensional Exponential Families\n\nSiqi Sun\n\nTTI Chicago\n\nsiqi.sun@ttic.edu\n\nMladen Kolar\n\nUniversity of Chicago\n\nmkolar@chicagobooth.edu\n\nJinbo Xu\nTTI Chicago\n\njinbo.xu@gmail.com\n\nAbstract\n\nLearning the structure of a probabilistic graphical models is a well studied prob-\nlem in the machine learning community due to its importance in many applica-\ntions. Current approaches are mainly focused on learning the structure under re-\nstrictive parametric assumptions, which limits the applicability of these methods.\nIn this paper, we study the problem of estimating the structure of a probabilistic\ngraphical model without assuming a particular parametric model. We consider\nprobabilities that are members of an in\ufb01nite dimensional exponential family [4],\nwhich is parametrized by a reproducing kernel Hilbert space (RKHS) H and its\nkernel k. One dif\ufb01culty in learning nonparametric densities is the evaluation of\nthe normalizing constant. In order to avoid this issue, our procedure minimizes\nthe penalized score matching objective [10, 11]. We show how to ef\ufb01ciently min-\nimize the proposed objective using existing group lasso solvers. Furthermore, we\nprove that our procedure recovers the graph structure with high-probability under\nmild conditions. Simulation studies illustrate ability of our procedure to recover\nthe true graph structure without the knowledge of the data generating process.\n\n1\n\nIntroduction\n\nUndirected graphical models, or Markov random \ufb01elds [13], have been extensively studied and ap-\nplied in \ufb01elds ranging from computational biology [15, 28], to natural language processing [16, 20]\nand computer vision [9, 17]. In an undirected graphical model, conditional independence assump-\ntions underlying a probability distribution are encoded in the graph structure. Furthermore, the joint\nprobability density function can be factorized according to the cliques of the graph [14]. One of the\nfundamental problems in the literature is learning the structure of a graphical model given an i.i.d.\nsample from an unknown distribution. A lot of work has been done under speci\ufb01c parametric as-\nsumptions on the unknown distribution. For example, in Gaussian Graphical Models the structure of\nthe graph is encoded by the sparsity pattern of the precision matrix [6, 30]. Similarly, in the context\nof exponential family graphical models, where the node conditional distribution given all the other\nnodes is a member of an exponential family, the structure is described by the non-zero coef\ufb01cients\n[29]. Most existing approaches to learn the structure of a high-dimensional undirected graphical\nmodel are based on minimizing a penalized loss objective, where the loss is usually a log-likelihood\nor a composite likelihood and the penalty induces sparsity on the resulting parameter vector [see,\nfor example, 6, 12, 18, 22, 24, 29, 30]. In addition to sparsity inducing penalties, methods that\nuse other structural constraints have been proposed. For example, since many real-world networks\nare scale-free [1], several algorithms are designed speci\ufb01cally to learn structure of such networks\n\n1\n\n\f[5, 19]. Graphs tend to have cluster structure and learning simultaneously the structure and cluster\nassignment has been investigated [2, 27].\nIn this paper, we focus on learning the structure of a pairwise graphical models without assuming\na parametric class of models. The main challenge in estimating nonparametric graphical models\nis computation of the log normalizing constant. To get around this problem, we propose to use\nscore matching [10, 11] as a divergence, instead of the usual KL divergence, as it does not require\nevaluation of the log partition function. The probability density function is estimated by minimizing\nthe expected distance between the model score function and the data score function, where the score\nfunction is de\ufb01ned as gradient of the corresponding probability density functions. The advantage\nof this measure is that the normalization constant is canceled out when computing the distance. In\norder to learn the underlying graph structure, we assume that the logarithm of the density is additive\nin node-wise and edge-wise potentials and use a sparsity inducing penalty to select non-zero edge\npotentials. As we will prove later, our procedure will allow us to consistently estimate the underlying\ngraph structure.\nThe rest of paper is organized as follows. We \ufb01rst introduce the notations, background and related\nwork. Then we formulate our model, establish a representer theorem and present a group lasso\nalgorithm to optimize the objective. Next we prove that our estimator is consistent by showing that\nit can recover the true graph with high probability given suf\ufb01cient number of samples. Finally the\nresults for simulated data are presented to demonstrate the correctness of our algorithm empirically.\n\ni\u2208[d] |\u03b8i|p)\n\n1\n\n1.1 Notations\nLet [n] denote the set {1, 2, . . . , n}. For a vector \u03b8 = (\u03b81, . . . , \u03b8d)T \u2208 Rd,\n\n((cid:80)\n(cid:107)f(cid:107)Lp(\u03c7,p0) = (cid:107)f(cid:107)p = ((cid:82)\n\nlet (cid:107)\u03b8(cid:107)p =\np denote its lp norm. Let column vector vec(D) denote the vectorization of ma-\nd ) the\ntrix D, cat(a, b) denote the concatenation of two vectors a and b, and mat(aT\nd . For \u03c7 \u2286 Rd, let Lp(\u03c7, p0) denote the space of func-\nmatrix with rows given by aT\ntion for which the p-th power of absolute value is p0 integrable; and for f \u2208 Lp(\u03c7, p0), let\np denote its Lp norm. Throughout the paper, we denote H\n(or Hi,Hij) as Hilbert space and (cid:104)\u00b7,\u00b7(cid:105)H,(cid:107) \u00b7 (cid:107)H as corresponding inner product and norm.\nFor any operator C : H1 \u2192 H2, we use (cid:107)C(cid:107) to denote the usual operator norm, which is de\ufb01ned as\n\n1 , . . . , aT\n\u03c7 |f|pdx)\n\n1\n\n1 , . . . , aT\n\n(cid:107)C(cid:107) = inf{a \u2265 0 : (cid:107)Cf(cid:107)H2 \u2264 a(cid:107)f(cid:107)H1 for all f \u2208 H1};\n\nand (cid:107)C(cid:107)HS to denote its Hilbert-Schmidt norm, which is de\ufb01ned as\n\n(cid:88)\n\ni\u2208I\n\n(cid:107)C(cid:107)2\n\nHS =\n\n(cid:107)Cei(cid:107)2H2\n\n,\n\nwhere ei is an orthonormal basis of H for an index set I. Also, we use R(C) to denote operator C\u2019s\nrange space. For any f \u2208 H1 and g \u2208 H2, let f \u2297 g denote their tensor product.\n\n2 Background & Related Work\n\n2.1 Learning graphical models in exponential families\n\nLet x = (x1, x2, ..., xd) be a d-dimensional random vector from a multivariate Gaussian distribution.\nIt is well known that the conditional independency of two variables given all the others is encoded\nin the zero pattern of its precision matrix \u2126, that is, xi and xj are conditionally independent given\nx\u2212ij if and only if \u2126ij = 0, where x\u2212ij is the vector of x without xi and xj. A sparse estimate\nof \u2126 can be obtained by maximum-likelihood (joint selection) or pseudo-likelihood (neighborhood\nselection) optimization with an added l1 penalty [6, 22, 30]. Given n independent realizations of x\n(rows of X \u2208 Rn\u00d7d), the penalized maximum-likelihood estimate of the precision matrix can be\nobtained as\n\ntr( \u02c6S\u2126) \u2212 log det \u2126 + \u03bb(cid:107)\u2126(cid:107)1,\nwhere \u02c6S = n\u22121X T X and \u03bb controls the sparsity level of estimated graph.\n\n\u02c6\u2126\u03bb = arg min\n\u2126(cid:31)0\n\n(1)\n\n2\n\n\fThe pseudo-likelihood method estimates the neighborhood of a node a by the non-zeros of the\nsolution to a regularized linear model\n\n\u02c6\u03b8s = arg min\n\u03b8\n\n(cid:107)Xs \u2212 X\u2212s\u03b8(cid:107)2\nThe estimated neighborhood is then \u02c6N (s) = {a : \u03b8sa (cid:54)= 0}.\nAnother way to specify a parametric graphical model is by assuming that each node-conditional\ndistributions is a part of the exponential family [29]. Speci\ufb01cally, the conditional distribution of xs\ngiven x\u2212s is assumed to be\n\n2 + \u03bb(cid:107)\u03b8(cid:107)1.\n\n1\nn\n\n(2)\n\nP (xs|x\u2212s) = exp(\n\n\u03b8stxsxt + C(xs) \u2212 D(x\u2212s, \u03b8)),\n\n(3)\n\n(cid:88)\n\nt\u2208N (s)\n\nmodel assumptions for count data, the normalization constant is \u2212 exp((cid:80)\n\nwhere C is the base measure, D is the log-normalization constant and N (s) is the neighborhood a the\nnode s. Similar to (2), the neighborhood of each node can be estimated by minimizing the negative\nlog-likelihood with l1 penalty on \u03b8. The optimization is tractable when the normalization constant\nD can be easily computed based on the model assumption. For example, under Poisson graphical\nt\u2208N (s) \u03b8stxt). When using\nthe neighborhood estimation, the graph can be estimated as the union of the neighborhoods of each\nnode, which leads to consistent graph estimation [22, 29].\n\n2.2 Generalized Exponential Family and RKHS\nWe say H is a RKHS associated with kernel k : \u03c7 \u00d7 \u03c7 \u2192 R+ if and only if for each x \u2208 \u03c7, the\nfollowing two conditions are satis\ufb01ed: (1) k(\u00b7, x) \u2208 H and (2) it has reproducing properties such that\nf (x) = (cid:104)f, k(\u00b7, x)(cid:105)H for all f (\u00b7) \u2208 H, where k is a symmetric and positive semide\ufb01nite function.\nDenote the RKHS H with kernel k as H(k).\ni=1 \u03b1ik(\u00b7, xi). Similarly\nj=1 \u03b2jk(\u00b7, yj), the inner product of f and g is de\ufb01ned as (cid:104)f, g(cid:105)H =\ni,j \u03b1i\u03b1jk(xi, xj). The sum-\nmation is guaranteed to be larger than or equal to zero because the kernel k is positive semide\ufb01nite.\nWe consider the exponential family in in\ufb01nite dimensions [4], where\n\nFor any f \u2208 H(k), there exists a set of xi and \u03b1i, such that f (\u00b7) = (cid:80)\u221e\nfor any g \u2208 H(k), g(\u00b7) = (cid:80)\u221e\n(cid:113)(cid:80)\n(cid:80)\u221e\ni,j=1 \u03b1i\u03b2jk(xi, yj). Therefore the norm of f simply is (cid:107)f(cid:107)H =\n\nP = {pf (x) = ef (x)\u2212A(f )q0(x), x \u2208 \u03c7; f \u2208 F}\n\nand the function space F is de\ufb01ned as\n\n(cid:90)\n\n(cid:90)\n\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202 log p(x)\n\n\u2202x\n\n3\n\nF = {f \u2208 H(k) : A(f ) = log\n\nef (x)q0(x)dx < \u221e},\n\nwhere q0(x) is the base measure, A(f ) is a generalized normalization constant such that pf (x) is\na valid probability density function, and H is a RKHS [3] associated with kernel k. To see it as\na generalization of the exponential family, we show some examples that can generate useful \ufb01nite\ndimension exponential families:\n\n\u03c7\n\n\u2022 Normal: \u03c7 = R, k(x, y) = xy + x2y2\n\u2022 Poisson: \u03c7 = N \u222a {0}, k(x, y) = xy\n\u2022 Exponential: \u03c7 = R+, k(x, y) = xy.\n\nFor more detailed information, please refer to [4].\nWhen learning structure of a graphical model, we will further impose structural conditions on H(k)\nin order ensure that F consists of additive functions.\n\n2.3 Score Matching\n\nScore matching is a convenient procedure that allows for estimating a probability density without\ncomputing the normalizing constant [10, 11]. It is based on minimizing Fisher divergence\n\nJ(p(cid:107)p0) =\n\n1\n2\n\np(x)\n\n\u2212 \u2202 log p0(x)\n\n\u2202x\n\ndx,\n\n(4)\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n\f1\n\n\u2202x\n\n\u2202x1\n\n= ( \u2202 log p(x)\n\n, . . . , \u2202 log p(x)\n\nwhere \u2202 log p(x)\n) is the score function. Observe that for p(x, \u03b8) =\nZ(\u03b8) q(x, \u03b8) the normalization constant Z(\u03b8) cancels out in the gradient computation, which makes\nthe divergence independent of Z(\u03b8). Since the score matching objective involves the unknown or-\nacle probability density function p0, it is typically not computable. However, under some mild\nconditions which we will discuss in METHODS section, (4) can be rewritten as\n\n\u2202xd\n\n(cid:90)\n\n(cid:88)\n(cid:88)\n\ni\u2208[d]\n\n(cid:88)\n\nJ(p(cid:107)p0) =\n\np0(x)\n\n1\n2\n\n(\n\n\u2202 log p(x)\n\n\u2202xi\n\n)2 +\n\n\u22022 log p(x)\n\n\u2202x2\ni\n\ndx.\n\nAfter substituting the expectation with an empirical average, we get\n\n\u02c6J(p(cid:107)p0) =\n\n1\nn\n\na\u2208[n]\n\ni\u2208[d]\n\n1\n2\n\n(\n\n\u2202 log p(Xa)\n\n\u2202xi\n\n)2 +\n\n\u22022 log p(Xa)\n\n\u2202x2\ni\n\n.\n\n(5)\n\n(6)\n\n(cid:88)\n\ni\u2264j\n\n(i,j)\u2208S\n\n(cid:88)\n\nCompared to maximum likelihood estimation, minimizing \u02c6J(p(cid:107)p0) is computationally tractable.\nWhile we will be able to estimate p0 only up to a scale factor, this will be suf\ufb01cient for the purpose\nof graph structure estimation.\n\n3 Methods\n\n3.1 Model Formulation and Assumptions\nWe assume that the true probability density function p0 is in P. Furthermore, for simplicity we\nassume that\n\nlog p0(x) = f (x) =\n\nf0,ij(xi, xj),\n\nwhere f0,ii(xi, xi) is a node potential and f0,ij(xi, xj) is an edge potential. The set S denotes the\nedge set of the graph. Extensions to models where potentials are de\ufb01ned over larger cliques are\npossible. We further assume that f0,ij \u2208 Hij(kij), where Hij is a RKHS with kernel kij. To\nsimplify the notation, we use f0,ij(x) or kij(\u00b7, x) to denote f0,ij(xi, xj) and kij(\u00b7, (xi, xj)). If the\ncontext is clear, we drop the subscript for norm or inner product. De\ufb01ne\n\nH(S) = {f =\n\nfij|fij \u2208 Hij}\n\n(7)\n\n(i,j)\u2208S\n\n(i,j)\u2208S kij.\n\n(i,j)\u2208S (cid:107)fij(cid:107)2Hij\n\nas a set of functions that decompose as sum of bivariate functions on edge set S. Note that\nand kernel\n\nH(S) is also (a subset of) a RKHS with the norm (cid:107)f(cid:107)2H(S) = (cid:80)\nk =(cid:80)\nLet \u2126(f ) = (cid:107)f(cid:107)H,1 =(cid:80)\ndenote \u2126S(fS) = (cid:80)\ni\u2264j (cid:107)fij(cid:107)Hij . For any edge set S (not necessarily the true edge set), we\ns\u2208S (cid:107)fs(cid:107)Hs as the norm \u2126 reduced to S. Similarly, denote its dual norm as\n(cid:16) \u2202f (x)\n(cid:88)\n(cid:88)\n(cid:88)\n\nS[fS] = max\u2126S (gS )\u22641(cid:104)fS, gS(cid:105) [25].\n\u2126\u2217\nUnder the assumption that the unknown f0 is additive, the loss function becomes\n\ni\u2208[d]\n(cid:104)fij \u2212 f0,ij,\n\n\u2297 \u2202kij(cid:48)(\u00b7, (xi, xj(cid:48)))\n\n(cid:90)\n(cid:88)\n(cid:88)\n\ndx(fij(cid:48) \u2212 f0,ij(cid:48))(cid:105)\n\n\u2202kij(\u00b7, (xi, xj))\n\n\u2212 \u2202f0(x)\n\u2202xi\n\n(cid:17)2\n\nj,j(cid:48)\u2208[d]\n\nJ(f ) =\n\np0(x)\n\np0(x)\n\n(cid:90)\n\ni\u2208[d]\n\n\u2202xi\n\n\u2202xi\n\n\u2202xi\n\ndx\n\n(cid:104)fij \u2212 f0,ij, Cijij(cid:48)(fij(cid:48) \u2212 f0,ij(cid:48))(cid:105).\n\n1\n2\n\n1\n2\n\n1\n2\n\n=\n\n=\n\ni\u2208[d]\n\nj,j(cid:48)\u2208[d]\n\nIntuitively, C can be viewed as a d2 matrix, and the operator at position (ij, ij(cid:48)) is Cij,ij(cid:48). For\ngeneral (ij, i(cid:48)j(cid:48)), i (cid:54)= i(cid:48) the corresponding operator simply is 0. De\ufb01ne CSS(cid:48) as\n\u2297 \u2202ki(cid:48)j(cid:48)(\u00b7, (xi(cid:48), xj(cid:48)))\n\n\u2202kij(\u00b7, (xi, xj))\n\n(cid:88)\n\n(cid:90)\n\ndx,\n\np0(x)\n\n(i,j)\u2208S,(i(cid:48),j(cid:48))\u2208S(cid:48)\n\n\u2202xi\n\n\u2202xi\n\n4\n\n\fwhich intuitively can be treated as a sub matrix of C with rows S and columns S(cid:48). We will use this\nnotation intensively in the main theorem and its proof.\nFollowing [26], we make the following assumptions.\nA1. Each kij is twice differentiable on \u03c7 \u00d7 \u03c7.\nA2. For any i and \u02dcxj \u2208 \u03c7j = [aj, bj], we assume that\n\u22022kij(x, y)\n\nlim\nxi\u2192a+\ni or b\n\n\u2212\ni\n\n\u2202xi\u2202yi\nwhere x = (xi, \u02dcxj) and ai, bi could be \u2212\u221e or \u221e.\nA3. This condition ensures that J(p(cid:107)p0) < \u221e for any p \u2208 P [for more details see 26]:\n\n|y=x p2\n\n0(x) = 0,\n\n(cid:107) \u2202kij(\u00b7, x)\n\n(cid:107)Hij \u2208 L2(\u03c7, p0),(cid:107) \u22022kij(\u00b7, x)\n\n(cid:107)Hij \u2208 L2(\u03c7, p0).\n\n\u2202xi\n\n\u2202x2\ni\n\nSc[CScSC\u22121\n\nSS ] \u2264 1 \u2212 \u03b7, where \u03b7 > 0.\n\nA4. The operator CSS, is compact and the smallest eigenvalue \u03c9min = \u03bbmin(CSS) > 0.\nA5. \u2126\u2217\nA6. f0 \u2208 R(C), which means there exists \u03b3 \u2208 H, such that f0 = C\u03b3. f0 is the oracle function.\nWe will discuss the de\ufb01nition of operator C and \u2126\u2217 in section 4. Compared with [29], A4 can be\ninterpreted as the dependency condition and the A5 is the incoherence condition, which is a standard\ncondition for structure learning in high dimensional statistical estimators.\n\n3.2 Estimation Procedure\n\nWe estimate f by minimizing the following penalized score matching objective\n\nmin\n\n\u02c6L\u00b5(f ) = \u02c6J(f ) +\ns.t. fij \u2208 Hij,\n\n(cid:107)f(cid:107)H,1\n\n\u00b5\n2\n\nf\n\nwhere \u02c6J(f ) is given in (6). The norm (cid:107)f(cid:107)H,1 = (cid:80)\n\n(8)\ni\u2264j (cid:107)fij(cid:107)Hij is used as a sparsity inducing\npenalty. A simpli\ufb01ed form of \u02c6J(f ) is given below that will lead to ef\ufb01cient algorithm for solving\n(8).\nThe following theorem states that the score matching objective can be written as a penalized\nquadratic function on f.\n\nTheorem 3.1 (i) The score matching objective can be represented as\n\nwhere C =(cid:82) p0(x)(cid:80)\n\nL\u00b5(f ) =\n\u2202k(\u00b7,x)\n\n1\n2\n\n\u2297 \u2202k(\u00b7,x)\n\ndx is a trace operator.\n(ii) Given observed data Xn\u00d7d, the empirical estimation of L\u00b5 is\n\ni\u2208[d]\n\n\u2202xi\n\n\u2202xi\n\n(cid:104)f \u2212 f0, C(f \u2212 f0)(cid:105) +\n\n(cid:107)f(cid:107)H,1\n\n\u00b5\n2\n\n(cid:80)\n\n1\n2\n\n(cid:80)\n\n\u02c6L\u00b5(f ) =\n\n(cid:104)f, \u02c6Cf(cid:105) +\n\n(cid:104)fij,\u2212 \u02c6\u03beij(cid:105) +\n\n(cid:107)f(cid:107)H,1 + const\n\n\u00b5\n2\n\n\u2202k(\u00b7,Xa)\n\ni\u2264j\n\u2297 \u2202k(\u00b7,Xa)\na\u2208n\n\n\u2202xi\n\n(cid:80)\n\n\u2202x2\ni\n\nand \u02c6\u03beij = 1\nn\n\na\u2208[n]\n\n\u22022kij (\u00b7,(Xai,Xaj ))\n\notherwise.\n\n(cid:80)\n\nwhere \u02c6C = 1\nn\n\u22022kij (\u00b7,(Xai,Xaj ))\n\n\u2202x2\nj\n\ni\u2208[d]\n\na\u2208[n]\nif i (cid:54)= j, or \u02c6\u03beij = 1\n\n\u2202xi\n\nn\n\n(cid:88)\n\n(9)\n\n(10)\n\n\u22022kij (\u00b7,(Xai,Xaj ))\n\n\u2202x2\ni\n\n+\n\nPlease refer to our supplementary material for detailed proof 1.\nThe above theorem still requires us to minimize over F. Our next results shows that the solution is\n\ufb01nite dimensional. That is, we establish a representer theorem for our problem.\n\n1Please visit ttic.uchicago.edu/\u223csiqi for supplementary material and code.\n\n5\n\n\fTheorem 3.2 (i) The solution to (10) can be represented as\n\n\u02c6fij(\u00b7) =\n\n\u2202kij(\u00b7, (Xbi, Xbj))\n\n\u2202xi\n\n+ \u03b2bji\n\n\u2202kij(\u00b7, (Xbi, Xbj))\n\n\u2202xj\n\n\u03b2bij\n\n+ \u03b1ij\n\n\u02c6\u03beij,\n\n(11)\n\nwhere i \u2264 j.\n(ii) Minimizing (10) is equivalent to minimizing the following quadratic function:\n\n(cid:33)2\n\n(cid:88)\n\nj\n\n(cid:88)\n\nb\u2208[n]\n\n1\n2n\n\n(cid:32)(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nai\n\nbj\n\nb\n\ni\u2264j\n1\n2n\n\nai\n\n+\n\n=\n\n(\u03b2bijGab\n\nij11 + \u03b2bjiGab\n\nij12) +\n\n\u03b1ijh1a\nij\n\n(\u03b2bijh1b\n\nij + \u03b2bjih2b\n\nij ) +\n\n\u03b1ij(cid:107) \u02c6\u03beij(cid:107)2 +\n\n(DT\n\nai \u00b7 \u03b8)2 + Et\u03b8 +\n\n\u00b5\n2\n\n\u03b8t\nijFij\u03b8ij\n\n(cid:107)f(cid:107)H,1\n\n\u00b5\n2\n\n(12)\n\n(cid:88)\n(cid:113)\n(cid:88)\n\ni\u2264j\n\ni\u2264j\n\nijrs = \u22022kij (Xa,Xb)\n\n, \u02c6\u03beij(cid:105) are constant that only depends on X, \u03b8 =\nwhere Gab\ncat(vec(\u03b1), vec(\u03b2)) is the vector parameter and \u03b8ij = cat(\u03b1ij, vec(\u03b2\u00b7ij)) is a group of parameters.\nDai, E and F are corresponding constant vectors and matrices based on G, h and the order of\nparameters. Then the above problem can be solved by group lasso [7, 21].\n\nij = (cid:104) \u2202kij (\u00b7,Xb)\n\n, hrb\n\n\u2202xr\u2202ys\n\n\u2202xr\n\nThe \ufb01rst part of theorem states our representer theorem, and the second part is obtained by plugging\nin (11) to (10). See supplementary material for a detailed proof. Theorem 3.2 provides us with an\nef\ufb01cient way to minimize (8), as it reduced the optimization to a group lasso problem for which\nmany ef\ufb01cient solvers exist.\nLet \u02c6f \u00b5 = arg minf\u2208H \u02c6L\u00b5(f ) denote the solution to (12). We can estimate the graph as follows:\n\n\u02c6S\u00b5 = {(i, j) : (cid:107) \u02c6f \u00b5\n\nij(cid:107) (cid:54)= 0},\n\n(13)\n\nThat is, the graph is encoded in the sparsity pattern of \u02c6f \u00b5.\n\n4 Statistical Guarantees\n\nIn this section we study statistical properties of the proposed estimator (13). Let S denote the true\nedge set and Sc its complement. We prove that \u02c6S\u00b5 recovers S with high probability when the sample\nsize n is suf\ufb01ciently large.\nDenote D = mat(DT\noperator \u02c6C,\n\nnd). We will need the following result on the estimated\n\n11, . . . , DT\n\nai, . . . , DT\n\nProposition 4.1 (Lemma 5 in [8] or Theorem 5 in [26] ) (Properties of C)\n\n1. (cid:107) \u02c6C \u2212 C(cid:107)HS = Op0 (n\u2212 1\n2 )\n2. (cid:107)(C + \u00b5L)\u22121(cid:107) \u2264\npositive constants.\n\n\u00b5 min diag(L) , (cid:107)C(C + \u00b5L)\u22121(cid:107) \u2264 1, where \u00b5 > 0 and L is diagonal with\n\n1\n\nThe following result gives \ufb01rst order optimality conditions for the optimization problem (8).\n\nProposition 4.2 (Optimality Condition)\n\u02c6J(f ) + \u00b5\n\n2 \u2126(f )2 achieves optimality when the following two conditions are satis\ufb01ed:\n\n(1) \u2207fs\n(2) \u2126\u2217\n\n\u02c6J(f ) + \u00b5\u2126(f )\nSc[\u2207fSc\n\n(cid:107)fs(cid:107)Hs\n\u02c6J(f )] \u2264 \u00b5\u2126(f ).\n\nfs\n\n= 0 \u2200s \u2208 S\n\n6\n\n\fWith these preliminary results, we have the following main results.\n\n4 and satis\ufb01es \u00b5 \u2264\n4(1\u2212\u03b7)\u03bamax\ns (cid:107) > 0. Then P ( \u02c6S\u00b5 = S) \u2192 1.\n\nTheorem 4.3 Assume that conditions A1-A7 are satis\ufb01ed. The regularization parameter \u00b5 is se-\ns (cid:107) > 0\nlected at the order of n\u2212 1\nand \u03bamax = maxs\u2208S (cid:107)f\u2217\nProof Idea: The theorem above is the main theoretical guarantee for our score matching estimator.\nWe use the \u201cwitness\u201d proof framework inspired by [23, 29]. Let f\u2217 denote the true density function\nand p\u2217 the probability density function. We \ufb01rst construct a solution \u02c6fS on true edge set S as\n\n, where \u03bamin = mins\u2208S (cid:107)f\u2217\n\n\u03b7\u03bamin\u03c9min\n\n|S|+ \u03b7\n\n\u221a\n\n5\n\n\u02c6fS = min\nfSc =0\n\n\u02c6J(f ) +\n\n\u00b5\n2\n\n(cid:107)fij(cid:107))2\n\n(14)\n\n(cid:88)\n\n(\n(i,j)\u2208S\n\nand set \u02c6fSc as zero. Using Proposition 4.1, we prove that (cid:107) \u02c6fS \u2212 f\u2217\n4 ). Then we\ncompute the subgradient on Sc and prove that its dual norm is upper bounded by \u00b5\u2126(f ) by using\nassumptions A4, A5 and A6. Therefore we construct a solution that satis\ufb01ed the optimality condition\nand converges in probability to the true graph. Refer to supplementary material for detailed proof.\n\nS(cid:107) = Op(n\u2212 1\n\n5 Experiments\n\nWe illustrate performance of our method on two simulations. In our experiments, we use the same\nkernel de\ufb01ned as follows:\n\nk(x, y) = exp(\u2212(cid:107)x \u2212 y(cid:107)2\n\n2\n\n2\u03c32\n\n) + r(xT y + c)2,\n\n(15)\n\n|S=1|\n\n|S=0|\n\n|S=1 and \u02c6S\u00b5=1|\n\n| \u02c6S\u00b5=1 and S=0|\n\n, and false positive rate is FPR\u00b5 =\n\nthat is, the summation of a Gaussian kernel and a polynomial kernel. We set \u03c32 = 1.5, r = 0.1 and\nc = 0.5 for all the simulations.\nWe report the true positive rate vs false positive rate (ROC) curve to measure the performance of\ndifferent procedures. Let S be the true edge set, and let \u02c6S\u00b5 be the estimated graph. The true positive\n, where |\u00b7|\nrate is de\ufb01ned as TPR\u00b5 =\nis the cardinality of the set. The curve is then plotted based on 100 uniformly-sampled regularization\nparameters and based on 20 independent runs.\nIn the \ufb01rst simulation, we apply our algorithm to data sampled from a simple chain graph-based\nGaussian model (see Figure 1 for detail), and compare its performance with glasso [6]. We use the\nsame sampling method as in [31] to generate the data: we set \u2126s = 0.4 for s \u2208 S and its diagonal\nto a constant such that \u2126 is positive de\ufb01nite. We set the dimension d to 25 and change the sample\nsize n \u2208 {20, 40, 60, 80, 100} data points.\nExcept for the low sample size case (n = 20), the performance of our method is comparable with\nglasso, without utilizing the fact that the underlying distribution is of a particular parametric form.\nIntuitively, to capture the graph structure, the proposed nonparametric method requires more data\nbecause of much weaker assumptions.\nTo further show the strength of our algorithm, we test it on a nonparanormal (NPN) distribution\n([18]). A random vector x = (x1, . . . , xp) has a nonparanormal distribution if there exist functions\n(f1, . . . , fp) such that (f1(x1), . . . , fd(xd)) \u223c N (\u00b5, \u03a3). When f is monotone and differentiable,\nthe probability density function is given by\nexp{\u2212 1\n2\n\n(f (x) \u2212 \u00b5)T \u03a3\u22121(f (x) \u2212 \u00b5)}(cid:89)\n\nj|.\n|f(cid:48)\n\nP (x) =\n\n1\n2 |\u03a3| 1\np\n\n2\n\n(2\u03c0)\n\nj\n\nHere the graph structure is still encoded in the sparsity pattern of \u2126 = \u03a3\u22121, that is, xi\u22a5xj|x\u2212i,j if\nand only if \u2126ij = 0 [18].\nIn our experiments we use the \u201cSymmetric Power Transformation\u201d [18], that is,\n\nfj(zj) = \u03c3j(\n\n(cid:113)(cid:82) g2\n\ng0(zj \u2212 \u00b5j)\n0(t \u2212 \u00b5j)\u03c6( t\u2212\u00b5j\n\n\u03c3j\n\n) + \u00b5j,\n\n)dt\n\n7\n\n\fFigure 1: The estimation results for Gaussian graphical models. left: The adjacent matrix of true\ngraph. center: the ROC curve of glasso. right: the ROC curve of score matching estimator (SME).\n\nFigure 2: The estimated ROC curves of nonparanormal graphical models for glasso (left), NPN\n(center) and SME (right).\n\nwhere g0(t) = sign(t)|t|\u03b1, to transform data. For comparison with graph lasso, we \ufb01rst use a\ntruncation method to Gaussianize the data, and then apply graphical lasso to the transformed data.\nSee [18, 31] for details. From \ufb01gure 2, without knowing the underlying data distribution, the score\nmatching estimator outperforms glasso, and show similar results to nonparanormal when the sample\nsize is large.\n\n6 Discussion\n\nIn this paper, we have proposed a new procedure for learning the structure of a nonparametric graph-\nical model. Our procedure is based on minimizing a penalized score matching objective, which can\nbe performed ef\ufb01ciently using existing group lasso solvers. Particularly appealing aspect of our\napproach is that it does not require computing the normalization constant. Therefore, our proce-\ndure can be applied to a very broad family of in\ufb01nite dimensional exponential families. We have\nestablished that the procedure provably recovers the true underlying graphical structure with high-\nprobability under mild conditions. In the future, we plan to investigate more ef\ufb01cient algorithms for\nsolving (10), since it is often the case that \u02c6C is well structured and can be ef\ufb01ciently approximated.\n\nAcknowledgments\n\nThe authors are grateful to the \ufb01nancial support from National Institutes of Health R01GM0897532,\nNational Science Foundation CAREER award CCF-1149811 and IBM Corporation Faculty Re-\nsearch Fund at the University of Chicago Booth School of Business. This work was completed in\npart with resources provided by the University of Chicago Research Computing Center.\n\n8\n\n0.00.20.40.60.81.00.00.20.40.60.81.0Adjacent Matrixllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.20.40.60.81.00.00.20.40.60.81.0GlassoFalsePositiveRateTruePositiveRatel20406080100llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.20.40.60.81.00.00.20.40.60.81.0SMEFalsePositiveRateTruePositiveRatel20406080100llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.20.40.60.81.00.00.20.40.60.81.0GlassoFalsePositiveRateTruePositiveRatel20406080100llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.20.40.60.81.00.00.20.40.60.81.0NonParaNormalFalsePositiveRateTruePositiveRatel20406080100llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.20.40.60.81.00.00.20.40.60.81.0SMEFalsePositiveRateTruePositiveRatel20406080100\fReferences\n[1] R. Albert. Scale-free networks in cell biology. Journal of cell science, 118(21):4947\u20134957, 2005.\n[2] C. Ambroise, J. Chiquet, C. Matias, et al. Inferring sparse gaussian graphical models with latent structure.\n\nElectronic Journal of Statistics, 3:205\u2013238, 2009.\n\n[3] N. Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, pages\n\n[4] S. Canu and A. Smola. Kernel methods and the exponential family. Neurocomputing, 69(7):714\u2013720,\n\n337\u2013404, 1950.\n\n2006.\n\n[5] A. Defazio and T. S. Caetano. A convex formulation for learning scale-free networks via submodular\n\nrelaxation. In Advances in Neural Information Processing Systems, pages 1250\u20131258, 2012.\n\n[6] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.\n\n[7] J. Friedman, T. Hastie, and R. Tibshirani. A note on the group lasso and a sparse group lasso. arXiv\n\nBiostatistics, 9(3):432\u2013441, 2008.\n\npreprint arXiv:1001.0736, 2010.\n\n[8] K. Fukumizu, F. R. Bach, and A. Gretton. Statistical consistency of kernel canonical correlation analysis.\n\nThe Journal of Machine Learning Research, 8:361\u2013383, 2007.\n\n[9] S. Geman and C. Graf\ufb01gne. Markov random \ufb01eld image models and their applications to computer vision.\n\nIn Proceedings of the International Congress of Mathematicians, volume 1, page 2, 1986.\n\n[10] A. Hyv\u00a8arinen. Estimation of non-normalized statistical models by score matching. In Journal of Machine\n\nLearning Research, pages 695\u2013709, 2005.\n\n[11] A. Hyv\u00a8arinen. Some extensions of score matching. Computational statistics & data analysis, 51(5):2499\u2013\n\n2512, 2007.\n\n[12] Y. Jeon and Y. Lin. An effective method for high-dimensional log-density anova estimation, with appli-\n\ncation to nonparametric graphical model building. Statistica Sinica, 16(2):353, 2006.\n\n[13] R. Kindermann, J. L. Snell, et al. Markov random \ufb01elds and their applications, volume 1. American\n\nMathematical Society Providence, RI, 1980.\n\n[14] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.\n[15] Y. A. Kourmpetis, A. D. Van Dijk, M. C. Bink, R. C. van Ham, and C. J. ter Braak. Bayesian markov\nrandom \ufb01eld analysis for protein function prediction based on network data. PloS one, 5(2):e9293, 2010.\n[16] J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random \ufb01elds: Probabilistic models for segment-\n\ning and labeling sequence data. 2001.\n\n[17] S. Z. Li. Markov random \ufb01eld modeling in Image Analysis. 2011.\n[18] H. Liu, J. Lafferty, and L. Wasserman. The nonparanormal: Semiparametric estimation of high dimen-\n\nsional undirected graphs. The Journal of Machine Learning Research, 10:2295\u20132328, 2009.\n\n[19] Q. Liu and A. T. Ihler. Learning scale free networks by reweighted l1 regularization. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 40\u201348, 2011.\n\n[20] C. D. Manning and H. Sch\u00a8utze. Foundations of statistical natural language processing. MIT press, 1999.\n[21] L. Meier, S. Van De Geer, and P. B\u00a8uhlmann. The group lasso for logistic regression. Journal of the Royal\n\nStatistical Society: Series B (Statistical Methodology), 70(1):53\u201371, 2008.\n\n[22] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the lasso. The\n\n[23] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional graphical model selection using l1-\n\nAnnals of Statistics, pages 1436\u20131462, 2006.\n\nregularized logistic regression. 2008.\n\n[24] P. Ravikumar, M. J. Wainwright, J. D. Lafferty, et al. High-dimensional ising model selection using\n\n1-regularized logistic regression. The Annals of Statistics, 38(3):1287\u20131319, 2010.\n[25] R. T. Rockafellar. Convex analysis. Number 28. Princeton university press, 1970.\n[26] B. Sriperumbudur, K. Fukumizu, R. Kumar, A. Gretton, and A. Hyv\u00a8arinen. Density estimation in in\ufb01nite\n\ndimensional exponential families. arXiv preprint arXiv:1312.3516, 2013.\n\n[27] S. Sun, H. Wang, and J. Xu. Inferring block structure of graphical models in exponential families. In\nProceedings of the Eighteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n939\u2013947, 2015.\n\n[28] Z. Wei and H. Li. A markov random \ufb01eld model for network-based analysis of genomic data. Bioinfor-\n\nmatics, 23(12):1537\u20131544, 2007.\n\n[29] E. Yang, G. Allen, Z. Liu, and P. K. Ravikumar. Graphical models via generalized linear models. In\n\nAdvances in Neural Information Processing Systems, pages 1358\u20131366, 2012.\n\n[30] M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model. Biometrika,\n\n94(1):19\u201335, 2007.\n\n[31] T. Zhao, H. Liu, K. Roeder, J. Lafferty, and L. Wasserman. The huge package for high-dimensional\n\nundirected graph estimation in r. The Journal of Machine Learning Research, 13(1):1059\u20131062, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1358, "authors": [{"given_name": "Siqi", "family_name": "Sun", "institution": "TTIC"}, {"given_name": "Mladen", "family_name": "Kolar", "institution": "University of Chicago Booth School of Business"}, {"given_name": "Jinbo", "family_name": "Xu", "institution": "Toyota Technological Institute at Chicago"}]}