{"title": "Transelliptical Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 800, "page_last": 808, "abstract": null, "full_text": "Transelliptical Graphical Models\n\nHan Liu\n\nDepartment of Operations Research\n\nand Financial Engineering\n\nPrinceton University, NJ 08544\nhanliu@princeton.edu\n\nFang Han\n\nDepartment of Biostatistics\nJohns Hopkins University\n\nBaltimore, MD 21210\nfhan@jhsph.edu\n\nCun-hui Zhang\n\nDepartment of Statistics\n\nRutgers University\n\nPiscataway, NJ 08854\n\ncunhui@stat.rutgers.edu\n\nAbstract\n\nWe advocate the use of a new distribution family\u2014the transelliptical\u2014for robust\ninference of high dimensional graphical models. The transelliptical family is an\nextension of the nonparanormal family proposed by Liu et al. (2009). Just as the\nnonparanormal extends the normal by transforming the variables using univariate\nfunctions, the transelliptical extends the elliptical family in the same way. We\npropose a nonparametric rank-based regularization estimator which achieves the\nparametric rates of convergence for both graph recovery and parameter estima-\ntion. Such a result suggests that the extra robustness and \ufb02exibility obtained by\nthe semiparametric transelliptical modeling incurs almost no ef\ufb01ciency loss. We\nalso discuss the relationship between this work with the transelliptical component\nanalysis proposed by Han and Liu (2012).\n\n1 Introduction\nWe consider the problem of learning high dimensional graphical models.\nIn a typical setting, a\nd-dimensional random vector X = (X1, ..., Xd)T can be represented as an undirected graph de-\nnoted by G = (V, E), where V contains nodes corresponding to the d variables in X, and the\nedge set E describes the conditional independence relationship among X1, ..., Xd. Let X\\{i,j} :=\n{Xk : k 6= i, j}. We say the joint distribution of X is Markov to G if Xi is independent of Xj given\nX\\{i,j} for all (i, j) /\u2208 E. While often G is assumed given, here we want to estimate it from data.\nMost graph estimation methods rely on the Gaussian graphical models, in which the random vector\nX is assumed to be Gaussian: X \u223c Nd(\u00b5, \u03a3). Under this assumption, the graph G is encoded\nby the precision matrix \u0398 := \u03a3\u22121. More speci\ufb01cally, no edge connects Xj and Xk if and only\nif \u0398jk = 0. This problem of estimating G is called covariance selection [5]. In low dimensions\nwhere d < n, [6, 7] develop a multiple testing procedure for identifying the sparsity pattern of the\nprecision matrix. In high dimensions where d (cid:29) n, [21] propose a neighborhood pursuit approach\nfor estimating Gaussian graphical models by solving a collection of sparse regression problems using\nthe Lasso [25, 3]. Such an approach can be viewed as a pseudo-likelihood approximation of the full\nlikelihood.\nIn contrast, [1, 30, 10] propose a penalized likelihood approach to directly estimate\n\u2126. [15, 14, 24] maximize the non-concave penalized likelihood to obtain an estimator with less\nbias than the traditional L1-regularized estimator. Under the irrepresentable conditions [33, 31, 27],\n[22, 23] study the theoretical properties of the penalized likelihood methods. More recently, [29, 2]\npropose the graphical Dantzig selector and CLIME, which can be solved by linear programming and\npossess more favorable theoretical properties than the penalized likelihood approach.\n\n1\n\n\fBesides Gaussian models, [18] propose a semiparametric procedure named nonparanormal SKEP-\nTIC which extends the Gaussian family to the more \ufb02exible semiparametric Gaussian copula family.\nInstead of assuming X follows a Gaussian distribution, they assume there exists a set of monotone\nfunctions f1, . . . , fd, such that the transformed data f(X) := (f1(X1), . . . , fd(Xd))T is Gaussian.\nMore details can be found in [18]. [32] has developed a scalable software package to implement\nthese algorithms. In another line of research, [26] extends the Gaussian graphical models to the\nelliptical graphical models. However, for elliptical distributions, only the generalized partial cor-\nrelation graph can be reliably estimated. These graphs only represent the conditional uncorrelated-\nness, but conditional independence, among variables. Therefore, by extending the Gaussian to the\nelliptical family, the gain in modeling \ufb02exibility is traded off with a loss in the strength of inference.\nIn a related work, [9] provide a latent variable interpretation of the generalized partial correlation\ngraph for multivariate t-distributions. An EM-type algorithm is proposed to \ufb01t the model for high\ndimensional data. However, the theoretical properties of their estimator is unknown.\nIn this paper, we introduce a new distribution family named transelliptical graphical model. A key\nconcept is the transelliptical distribution [12]. The transelliptical distribution is a generalization of\nthe nonparanormal distribution proposed by [18]. By mimicking how the nonparanormal extends the\nnormal family, the transelliptical extends the elliptical family in the same way. The transelliptical\nfamily contains the nonparanomral family and elliptical family. To infer the graph structure, a rank-\nbased procedure using the Kendall\u2019s tau statistic is proposed. We show such a procedure is adaptive\nover the transelliptical family: the procedure by default delivers a conditional uncorrelated graphs\namong certain latent variables; however, if the true distribution is the nonparanormal, the procedure\nautomatically delivers the conditional independence graph. Computationally, the only extra cost is a\none-pass data sort, which is almost negligible. Theoretically, even though the transelliptical family\nis much larger than the nonparanormal family, the same parametric rates of convergence for graph\nrecovery and parameter estimation can be established. These results suggest that the transelliptical\ngraphical model can be used routinely as a replacement of the nonparanormal models. Thorough\nnumerical results are provided to back up our theory.\n2 Background on Elliptical Distributions\nLet X and Y be two random variables, we denote by X d= Y if they have the same distribution.\nDe\ufb01nition 2.1 (elliptical distribution [8]). Let \u00b5 \u2208 Rd and \u03a3 \u2208 Rd\u00d7d with rank(\u03a3) = q \u2264 d. A\nd-dimensional random vector X has an elliptical distribution, denoted by X \u223c ECd(\u00b5, \u03a3, \u03be), if it\nhas a stochastic representation: X d= \u00b5 + \u03beAU, where U is a random vector uniformly distributed\non the unit sphere in Rq, \u03be \u2265 0 is a scalar random variable independent of U, A \u2208 Rd\u00d7q is a\ndeterministic matrix such that AAT = \u03a3.\nRemark 2.1. An equivalent de\ufb01nition of an elliptical distribution is that its characteristic function\ncan be written as exp(itT \u00b5)\u03c6(tT \u03a3t), where \u03c6 is a properly-de\ufb01ned characteristic function which\nhas a one-to-one mapping with \u03be in De\ufb01nition 2.1. In this setting we denote by X \u223c ECd(\u00b5, \u03a3, \u03c6).\nAn elliptical distribution does not necessarily have a density. One example is the rank-de\ufb01cient\nGaussian. More examples can be found in [11]. However, when the random variable \u03be is absolutely\ncontinuous with respect to the Lebesgue measure and \u03a3 is non-singular, the density of X exists and\nhas the form\n\np(x) = |\u03a3|\u22121/2g(cid:0)(x \u2212 \u00b5)T \u03a3\u22121(x \u2212 \u00b5)(cid:1) ,\n\n(1)\nwhere g(\u00b7) is a scale function uniquely determined by the distribution of \u03be. In this case, we can also\ndenote it as X \u223c ECd(\u00b5, \u03a3, g). Many multivariate distributions belong to the elliptical family. For\nexample, when g(x) = (2\u03c0)\u2212d/2 exp{\u2212x/2}, X is d-dimensional Gaussian. Another important\nsubclass is the multivariate t-distribution with the degrees of freedom v, in which, we choose\n\n\u0393(cid:0) v+d\n\n(cid:1)\n\n(cid:18)\n\ng(x) = cv\n\n(v\u03c0) d\n\n2\n2 \u0393( v\n2 )\n\n(cid:19)\u2212 v+d\n\n2\n\n1 \u2212 c2\nvx\nv\n\n,\n\n(2)\n\nwhere cv is a normalizing constant.\nThe model family in De\ufb01nition 2.1 is not identi\ufb01able. For example, given X \u223c ECd(\u00b5, \u03a3, \u03be) with\nrank(\u03a3) = q, there will be multiple As corresponding to the same \u03a3. i.e., there exist A1 6= A2 \u2208\n\n2\n\n\f1 = A2AT\n\n2 = \u03a3. For some constant c 6= 0, we de\ufb01ne \u03be\u2217 = \u03be/c and A\u2217 = c\u00b7 A,\nRd\u00d7q such that A1AT\nthen \u03beAU = \u03be\u2217A\u2217U. Therefore, the matrix \u03a3 is unique only up to a constant scaling. To make the\nmodel identi\ufb01able, we impose the condition that max{diag(\u03a3)} = 1. More discussions about the\nidenti\ufb01ability issue can be found in [12].\n3 Transelliptical Graphical Models\nIn this paper we only consider distributions with continuous marginals. We introduce the transellip-\ntical graphical models in analogy to the nonparanormal graphical models [19, 18]. The key concept\nis transelliptical distribution which is also introduced in [12]. However, the de\ufb01nition of transellip-\ntical distribution in this paper is slightly more restrictive than that in [12] due to the complication of\ngraphical modeling. More speci\ufb01cally, let\n\nR+\nd := {\u03a3 \u2208 Rd\u00d7d : \u03a3T = \u03a3, diag(\u03a3) = 1, \u03a3 (cid:31) 0},\n\n(3)\n\nwe de\ufb01ne the transelliptical distribution as follows:\nDe\ufb01nition 3.1 (transelliptical distribution). A continuous random vector X = (X1, . . . , Xd)T is\ntranselliptical, denoted by X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd), if there exists a set of monotone univariate\nfunctions f1, . . . , fd and a nonnegative random variable \u03be satisfying P(\u03be = 0) = 0, such that\n\n(f1(X1), . . . , fd(Xd))T \u223c ECd(0, \u03a3, \u03be), where \u03a3 \u2208 R+\nd .\n\n(4)\n\nd is called latent correlation matrix.\n\nHere, \u03a3 is called latent generalized correlation matrix1.\nWe then discuss the relationship between the transelliptical family with the nonparanormal family,\nwhich is de\ufb01ned as follows:\nDe\ufb01nition 3.2 (nonparanormal distribution). A ramdom vector X = (X1, . . . , Xd)T is nonpara-\nnormal, denoted by X \u223c N P Nd(\u03a3; f1, . . . , fd), if there exist monotone functions f1, . . . , fd such\nthat (f1(X1), . . . , fd(Xd))T \u223c Nd(0, \u03a3), where \u03a3 \u2208 R+\nFrom De\ufb01nitions 3.1 and 3.2, we see the transelliptical is a strict extension of the nonparanormal.\nBoth families assume there exits a set of univariate transformations such that the transformed data\nfollow a base distribution: the nonparanormal exploits a normal base distribution; while the transel-\nliptical exploits an elliptical base distribution. In the nonparanormal, \u03a3 is the correlation matrix for\nthe latent normal, therefore it is called latent correlation matrix; In the transelliptical, \u03a3 is the gener-\nalized correlation matrix for the latent elliptical distribution, therefore it is called latent generalized\ncorrelation matrix.\nWe now de\ufb01ne the transelliptical graphical models. Let X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd) where \u03a3 \u2208 R+\nd\nis the latent generalized correlation matrix. In this paper, we always assume the second moment\nE\u03be2 < \u221e. We de\ufb01ne \u0398 := \u03a3\u22121 to be the latent generalized concentration matrix. Let \u0398jk be the\nelement of \u0398 on the j-th row and k-th column. We de\ufb01ne the latent generalized partial correlation\n\nmatrix \u0393 as \u0393jk := \u2212\u0398jk/p\u0398jj \u00b7 \u0398kk. Let diag(A) be the matrix A with off-diagonal elements\n\nreplaced by zero and A1/2 be the squared root matrix of A. It is easy to see that\n\n\u0393 = \u2212[diag(\u03a3\u22121)]\u22121/2\u03a3\u22121[diag(\u03a3\u22121)]\u22121/2.\n\n(5)\nTherefore, \u0393 has the same nonzero pattern as \u03a3\u22121. We then de\ufb01ne a undirected graph G = (V, E):\nthe vertex set V contains nodes corresponding to the d variables in X, and the edge set E satis\ufb01es\n(6)\nGiven a graph G, we de\ufb01ne R+\nd with zero entries at the\npositions speci\ufb01ed by the graph G. The transelliptical graphical model induced by G is de\ufb01ned as:\nDe\ufb01nition 3.3 (transelliptical graphical model). The transelliptical graphical model induced by a\ngraph G, denoted by P(G), is de\ufb01ned to be the set of distributions:\n\n(Xj, Xk) \u2208 E if and only if \u0393jk 6= 0 for j, k = 1, . . . , d.\nd (G) to be the set containing all the \u03a3 \u2208 R+\n\nP(G) :=(cid:8)all the transelliptical distributions T Ed(\u03a3, \u03be; f1, . . . , fd) satisfying \u03a3 \u2208 R+\n\n(7)\nIn the rest of this section, we prove some properties of the transelliptical family and discuss the inter-\npretation of the meaning of the graph G. This graph is called latent generalized partial correlation\ngraph. First, we show the transelliptical family is closed under marginalization and conditioning.\n\nd (G)(cid:9) .\n\n1One thing to note is that in [12], the condition that \u03a3 \u2208 Rd+ is not required.\n\n3\n\n\fLemma 3.1. Let X := (X1, . . . , Xd)T \u223c T Ed(\u03a3, \u03be; f1, . . . , fd). The marginal and the conditional\ndistributions of (X1, X2)T given the remaining variables are still transellpitical.\nProof. Since X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd), we have (f1(X1), . . . , fd(Xd))T \u223c ECd(0, \u03a3, \u03be). Let\nZj := fj(Xj) for j = 1, . . . , d. From Theorem 2.18 of [8], the marginal distribution of (Z1, Z2)T\nand the conditional distribution of (Z1, Z2)T given the remaining Z3, . . . , Zd are both elliptical. By\nde\ufb01nition, the marginal distribution of (X1, X2)T is transelliptical. To see the conditional case, since\nX has continuous marginals and f1, . . . , fd are monotone, the distribution of (X1, X2)T conditional\non X\\{1,2} is the same as conditional on Z\\{1,2}. Combined with the fact that Z1 = f1(X1),\nZ2 = f2(X2), we know that (X1, X2)T | X\\{1,2} follows a transelliptical distribution.\n\nFrom (5), we see the matrices \u0393 and \u0398 have the same nonzero pattern, therefore, they encode\nthe same graph G. Let X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd). The next lemma shows that, if the second\nmoment of X exists, the absence of an edge in the graph G is equivalent to the pairwise conditional\nuncorrelatedness of two corresponding latent variables.\nLemma 3.2. Let X :=(X1, . . . , Xd)T \u223c T Ed(\u03a3, \u03be; f1, . . . , fd) with E\u03be2 < \u221e, and Zj := fj(Xj)\nfor j = 1, . . . , d. \u0393jk = 0 if and only if Zj and Zk are conditionally uncorrelated given Z\\{j,k}.\nProof. Let Z := (Z1, . . . , Zd)T . Since X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd), we have Z \u223c ECd(0, \u03a3, \u03be).\nTherefore, the latent generalized correlation matrix \u03a3 is the generalized correlation matrix of the\nlatent variable Z. It suf\ufb01ces to prove that, for elliptical distributions with E\u03be2 < \u221e, the generalized\npartial correlation matrix \u0393 as de\ufb01ned in (5) encodes the conditional uncorrelatedness among the\nvariables. Such a result has been proved in the section 2 of [26].\nLet A, B, C \u2282 {1, . . . , d}. We say C separates A and B in the graph G if any path from a node\nin A to a node in B goes through at least one node in C. We denote by XA the subvector of X\nindexed by A. The next lemma implies the equivalence between the pairwise and global conditional\nuncorrelatedness of the latent variables for the transelliptical graphical models. This lemma connects\nthe graph theory with probability theory.\nLemma 3.3. Let X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd) be any element of the transelliptical graphical model\nP(G) satisfying E\u03be2 < \u221e. Let Z := (Z1, . . . , Zd)T with Zj = fj(Xj) and A, B, C \u2282 {1, . . . , d}.\nThen C separates A and B in G if and only if ZA and ZB are conditional uncorrelated given ZC.\nProof. By de\ufb01nition, we know Z \u223c ECd(0, \u03a3, \u03be). It then suf\ufb01ces to show the pairwise conditional\nuncorrelatedness implies the global conditional uncorrelatedness for the elliptical family. This fol-\nlows from the same induction argument as in Theorem 3.7 of [16].\n\nCompared with the nonparanormal graphical model, the transelliptical graphical model gains a lot\non modeling \ufb02exibility, but at the price of inferring a weaker notion of graphs: a missing edge in\nthe graph only represents the conditional uncorrelatedness of the latent variables. The next lemma\nshows that we do not lose any thing compared with the nonparanormal graphical model. The proof\nof this lemma is simple and is omitted. Some related discussions can be found in [19].\nLemma 3.4. Let X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd) be a member of the transelliptical graphical model\nP(G). If X is also nonparanormal, the graph G encodes the conditional independence relationship\nof X (In other words, the distribution of X is Markov to G).\n\n4 Rank-based Regularization Estimator\nIn this section, we propose a nonparametric rank-based regularization estimator which achieves the\noptimal parametric rates of convergence for both graph recovery and parameter estimation. The\nmain idea of our procedure is to treat the marginal transformation functions fj and the generating\nvariable \u03be as nuisance parameters, and exploit the nonparametric Kendall\u2019s tau statistic to directly\nestimate the latent generalized correlation matrix \u03a3. The obtained correlation matrix estimate is then\nplugged into the CLIME procedure to estimate the sparse latent generalized concentration matrix \u0398.\nFrom the previous discussion, we know the graph G is encoded by the nonzero pattern of \u0398. We\n\nthen get a graph estimator by thresholding the estimatedb\u0398.\n\n4\n\n\f4.1 The Kendall\u2019s tau Statistic and its Invariance Property\nLet x1, . . . , xn \u2208 Rd be n observations of a random vector X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd). Our task is\nto estimate the latent generalized concentration matrix \u0398 := \u03a3\u22121. The Kendall\u2019s tau is de\ufb01ned as:\n(8)\n\nX\n\n(cid:17)(cid:16)\n\n(cid:16)\n\n(cid:17)\n\nsign\n\n2\n\n,\n\nk \u2212 xi0\nxi\n\nk\n\nj \u2212 xi0\nxi\n\nj\n\nb\u03c4jk =\n\nn(n \u2212 1)\n\n1\u2264i 0 is a tuning parameter.\n[2] show that this optimization can be decomposed into d\nvector minimization problems, each of which can be reformulated as a linear program. Thus it\n\nhas the potential to scale to very large problems. Once b\u0398 is obtained, we can apply an additional\nthresholding step to estimate the graph G. For this, we de\ufb01ne a graph estimator bG = (V,bE), in\nwhich an edge (j, k) \u2208 bE ifb\u0398jk \u2265 \u03b3. Here \u03b3 is another tuning parameter.\nbS, which requires us to evaluate d(d\u22121)/2 pairwise Kendal\u2019s tau statistics. A naive implementation\n\nCompared with the original CLIME, the extra cost of our rank-based procedure is the computation of\n\nof the Kendall\u2019s tau requires O(n2) computation. However, ef\ufb01cient algorithm based on sorting and\nbalanced binary trees has been developed to calculate the Kendall\u2019s tau statistic with a computational\ncomplexity O(n log n) [4]. Therefore, the incurred computational burden is negligible.\nRemark 4.1. Similar rank-based procedures have been discussed in [19, 18, 28]. Unlike our work,\nthey focus on the more restrictive nonparanromal family and discuss several rank-based procedures\nusing the normal-score, Spearman\u2019s rho, and Kendall\u2019s tau. Unlike our results, they advocate the\nuse of the Spearman\u2019s rho and normal-score correlation coef\ufb01cients. Their main concern is that,\nwithin the more restrictive nonparanormal family, the Spearman\u2019s rho and normal-score correlations\nare slightly easier to compute and have smaller asymptotic variance. In constrast to their results,\nthe new insight obtained from this current paper is that we advocate the usage of the Kendall\u2019s tau\ndue to its invariance property within the much larger transelliptical family. In fact, we can show that\nthe Spearman\u2019s rho is not invariant within the transelliptical family unless the true distribution is\nnonparanormal. More details on this issue can be found in [8].\n5 Asymptotic Properties\nWe analyze the theoretical properties of the rank-based regularization estimator proposed in Section\n4.2. Our main result shows: under the same conditions on \u03a3 that ensure the parameter estimation\n\n5\n\n\fand graph recovery consistency of the original CLIME estimator for Gaussian graphical models, our\nrank-based regularization procedure achieves exactly the same parametric rates of convergence for\nboth parameter estimation and graph recovery for the much larger transelliptical family. This result\nsuggests that the transelliptical graphical model can be used as a safe replacement of the Gaussian\ngraphical models, the nonparanormal graphical models, and the elliptical graphical models.\nWe introduce some additional notations. Given a symmetric matrix A, for 0 \u2264 q < 1, we de\ufb01ne\nkAkLq := maxi\n\nP\nj |Aij|q and the spectral norm kAkL2 to be its largest eigenvalue. We de\ufb01ne\nSd(q, s, M) := {\u0398 : k\u0398kL1 \u2264 M and k\u0398kLq \u2264 s}.\n\n(11)\nFor q = 0, the class Sd(0, s, M) contains all the s-sparse matrices. Our main result is Theorem 5.1\n0 \u2264 q < 1. Let b\u0398 be de\ufb01ned in (10). There exist constants C0 and C1 only depending on q, such\nTheorem 5.1. Let X \u223c T Ed(\u03a3, \u03be; f1, . . . , fd) with \u03a3 \u2208 R+\nd and \u0398 := \u03a3\u22121 \u2208 Sd(q, S, M) with\nthat, whenever \u03bb = C0Mp(log d)/n, with probability no less than 1 \u2212 d\u22122, we have\n(cid:19)(1\u2212q)/2\nLet bG be the graph estimator de\ufb01ned in Section 4.2 with the additional tuning parameter \u03b3 = 4M \u03bb.\nIf we further assume \u0398 \u2208 Sd(0, s, M) and minj,k:|\u0398jk|6=0 |\u0398jk| \u2265 2\u03b3, then\n\n(cid:18)log d\nkb\u0398 \u2212 \u0398kL2 \u2264 C1M 2\u22122q \u00b7 s \u00b7\n(cid:17) \u2265 1 \u2212 o(1),\n\nP(cid:16)bG 6= G\n\n(Parameter estimation)\n\n(Graph recovery)\n\n(13)\n\n(12)\n\nn\n\n.\n\nwhere G is the graph determined by the nonzero pattern of \u0398.\nProof. The difference between the rank-based CLIME and the original CLIME is that we replace\n\nthe Pearson correlation coef\ufb01cient matrix bR by the Kendall\u2019s tau matrix bS. By examing the proofs\nof Theorems 1 and 7 in [2], the only property needed of bR is an exponential concentration inequality\n(cid:17) \u2264 c1 exp(\u2212c2nt2)\nP(cid:16)|bRjk \u2212 \u03a3jk| > t\n. Therefore, it suf\ufb01ces if we can prove a similar concentration inequality for |bSjk \u2212 \u03a3jk|. Since\n(cid:17)\n(cid:16) \u03c0\n(cid:17)\nbS = sin\nwe have |bSjk \u2212 \u03a3jk| \u2264 |b\u03c4jk \u2212 \u03c4|. Therefore, we only need to prove\nP (|b\u03c4jk \u2212 \u03c4jk| > t) \u2264 exp(cid:0)\u2212nt2/(2\u03c0)(cid:1) .\nP\nThis result holds since b\u03c4jk is a U-statistic: b\u03c4jk =\nK\u03c4 (xi, xi0) = sign(cid:0)xi\n(cid:1)(cid:0)xi\n1\u2264i