{"title": "Speeding Up Latent Variable Gaussian Graphical Model Estimation via Nonconvex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1933, "page_last": 1944, "abstract": "We study the estimation of the latent variable Gaussian graphical model (LVGGM), where the precision matrix is the superposition of a sparse matrix and a low-rank matrix. In order to speed up the estimation of the sparse plus low-rank components, we propose a sparsity constrained maximum likelihood estimator based on matrix factorization and an efficient alternating gradient descent algorithm with hard thresholding to solve it. Our algorithm is orders of magnitude faster than the convex relaxation based methods for LVGGM. In addition, we prove that our algorithm is guaranteed to linearly converge to the unknown sparse and low-rank components up to the optimal statistical precision. Experiments on both synthetic and genomic data demonstrate the superiority of our algorithm over the state-of-the-art algorithms and corroborate our theory.", "full_text": "Speeding Up Latent Variable Gaussian Graphical\nModel Estimation via Nonconvex Optimization\n\nPan Xu\n\nDepartment of Computer Science\n\nUniversity of Virginia\n\nCharlottesville, VA 22904\npx3ds@virginia.edu\n\nJian Ma\n\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\njianma@cs.cmu.edu\n\nQuanquan Gu\n\nDepartment of Computer Science\n\nUniversity of Virginia\n\nCharlottesville, VA 22904\n\nqg5w@virginia.edu\n\nAbstract\n\nWe study the estimation of the latent variable Gaussian graphical model (LVGGM),\nwhere the precision matrix is the superposition of a sparse matrix and a low-rank\nmatrix. In order to speed up the estimation of the sparse plus low-rank components,\nwe propose a sparsity constrained maximum likelihood estimator based on matrix\nfactorization, and an ef\ufb01cient alternating gradient descent algorithm with hard\nthresholding to solve it. Our algorithm is orders of magnitude faster than the\nconvex relaxation based methods for LVGGM. In addition, we prove that our\nalgorithm is guaranteed to linearly converge to the unknown sparse and low-rank\ncomponents up to the optimal statistical precision. Experiments on both synthetic\nand genomic data demonstrate the superiority of our algorithm over the state-of-\nthe-art algorithms and corroborate our theory.\n\nIntroduction\n\n1\nFor a d-dimensional Gaussian graphical model (i.e., multivariate Gaussian distribution) N (0, \u2303\u21e4),\nthe inverse of covariance matrix \u2326\u21e4 = (\u2303\u21e4)1 (also known as the precision matrix or concentration\nmatrix) measures the conditional dependence relationship between marginal random variables [19].\nWhen the number of observations is comparable to the ambient dimension of the Gaussian graphical\nmodel, additional structural assumptions are needed for consistent estimation. Sparsity is one of\nthe most common structures imposed on the precision matrix in Gaussian graphical models (GGM),\nbecause it gives rise to a sparse graph, which characterizes the conditional dependence of the marginal\nvariables. The problem of estimating the sparse precision matrix in Gaussian graphical models has\nbeen studied by a large body of literature [23, 29, 12, 28, 6, 34, 37, 38, 33]. However, the real world\ndata may not follow a sparse GGM, especially when some of the variables are unobservable.\nTo alleviate this problem, the latent variable Gaussian graphical model (LVGGM) [9, 24] has been\nstudied, where the precision matrix of the observed variables is conditionally sparse given the latent\nvariables (i.e., unobserved) , but marginally not sparse. It is well-known that in LVGGM, the precision\nmatrix \u2326\u21e4 can be represented as the superposition of a sparse matrix S\u21e4 and a low-rank matrix L\u21e4,\nwhere the latent variables contribute to the low rank component in the precision matrix. In other\nwords, we have \u2326\u21e4 = S\u21e4 + L\u21e4.\nIn the learning problem of LVGGM, the goal is to estimate both the unknown sparse component\nS\u21e4 and the low-rank component L\u21e4 of the precision matrix simultaneously. In the seminal work,\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fChandrasekaran et al. [9] proposed a maximum-likelihood estimator based on `1 norm penalty on\nthe sparse matrix and nuclear norm penalty on the low-rank matrix, and proved the model selection\nconsistency for LVGGM estimation. Meng et al. [24] studied a similar penalized estimator, and\nderived Frobenius norm error bounds based on the restricted strong convexity [26] and the structural\nFisher incoherence condition between the sparse and low-rank components. Both of these two\nmethods for LVGGM estimation are based on a penalized convex optimization problem, which can\nbe solved by log-determinant proximal point algorithm [32] and alternating direction method of\nmultipliers [22]. Due to the nuclear norm penalty, these convex optimization algorithms need to do\nfull singular value decomposition (SVD) to solve the proximal mapping of nuclear norm at each\niteration, which results in an extremely high time complexity of O(d3). When d is large as often in\nthe high dimensional setting, the convex relaxation based methods are computationally intractable. It\nis worth noting that full SVD cannot be accelerated by power method [13] or other randomized SVD\nalgorithms [15], hence the O(d3) is unavoidable whenever nuclear norm regularization is employed.\nIn this paper, in order to speed up learning LVGGM, we propose a novel sparsity constrained\nmaximum likelihood estimator for LVGGM based on matrix factorization. Speci\ufb01cally, inspired by\nthe recent work on matrix factorization [18, 16, 44, 45, 11, 30], we propose to reparameterize the\nlow-rank component L in the precision matrix as the product of smaller matrices, i.e., L = ZZ>,\nwhere Z 2 Rd\u21e5r and r \uf8ff d is the number of latent variables. This factorization captures the\nintrinsic low-rank structure of L, and automatically ensures its low-rankness. We propose an\nalternating gradient descent with hard thresholding to solve the new estimator. We prove that the\noutput of our algorithm is guaranteed to linearly converge to the unknown parameters up to the\nstatistical precision. In detail, our algorithm enjoys O(d2r) per-iteration time complexity, which\noutperforms the O(d3) per-iteration complexity of state-of-the-art LVGGM estimators based on\nnuclear norm penalty [9, 22, 24]. In addition, the estimators from our algorithm for LVGGM attain\n\nmax{Op(ps\u21e4 log d/n), Op(prd/n)} statistical rate of convergence in terms of Frobenius norm,\n\nwhere s\u21e4 is the conditional sparsity of the precision matrix (i.e., sparsity of S\u21e4), and r is the number\nof latent variables (i.e., rank of L\u21e4). This matches the minimax optimal convergence rate for LVGGM\nestimation [9, 1, 24]. Thorough experiments on both synthetic and breast cancer genomic datasets\nshow that our algorithm is orders of magnitude faster than existing methods.\nIt is also worth noting that, although our estimator and algorithm is designed for LVGGM, it is\ndirectly applicable to the Gaussian graphical model where the precision matrix is the sum of a sparse\nmatrix and a low-rank matrix. And the theoretical guarantees of our algorithm still hold.\nThe remainder of this paper is organized as follows: In Section 2, we brie\ufb02y review existing work\nthat is relevant to our study. We present our estimator and algorithm in detail in Section 3, and the\nmain theory in Section 4. In Section 5, we compare the proposed algorithm with the state-of-the-art\nalgorithms on both synthetic data and real-world breast cancer data. Finally, we conclude this paper\nin Section 6.\nNotation For matrices A, B with commensurate dimensions, we use hA, Bi = tr(A>B) to denote\ntheir inner product and A \u2326 B denote their Kronecker product. For a matrix A 2 Rd\u21e5d, we denote\nits (ordered) singular values by 1(A) 2(A) . . . d(A) 0. We denote by A1 the\ninverse of A, and denote by |A| its determinant. We use the notation k\u00b7k for various types of matrix\nnorms, including the spectral norm kAk2 and the Frobenius norm kAkF . We also use the following\nnorms kAk0,0 =Pi,j\ni,j=1 |Aij|. A\nconstant is called absolute constant if it does not depend on the parameters of the problem, e.g.,\ndimension and sample size. We denote a . b if a is less than b up to a constant.\n2 Additional Related Work\nPrecision matrix estimation in sparse Gaussian graphical models (GGM) is commonly formulated\nas a penalized maximum likelihood estimation problem with `1,1 norm regularization [12, 29, 28]\n(graphical Lasso) or regularization on diagonal elements of Cholesky decomposition for precision\nmatrix [17]. Due to the complex dependency among marginal variables in many applications, sparsity\nassumption on the precision matrix often does not hold. To relax this assumption, the conditional\nGaussian graphical model (cGGM) was proposed in [41, 5] and the partial Gaussian graphical model\n(pGGM) was proposed in [42], both of which impose blockwise sparsity on the precision matrix\nand estimate multiple blocks therein. Despite a good interpretation of these models, they need to\naccess both the observed variables as well as the latent variables for estimation. Another alternative\n\n1(Aij 6= 0), kAk1,1 = max1\uf8ffi,j\uf8ffd |Aij|, and kAk1,1 =Pd\n\n2\n\n\fis the latent variable Gaussian graphical model (LVGGM), which was proposed in [9], and later\ninvestigated in [22, 24]. Compared with cGGM and pGGM, the estimation of LVGGM does not need\nto access the latent variables and therefore is more \ufb02exible.\nAnother line of research related to ours is low-rank matrix estimation based on alternating minimiza-\ntion and gradient descent [18, 16, 44, 45, 11, 30, 3, 35, 43]. However, extending them to low-rank\nand sparse matrix estimation as in LVGGM turns out to be highly nontrivial. The most related work\nto ours includes [14] and [40], which studied nonconvex optimization for low-rank plus sparse matrix\nestimation. However, they are limited to robust PCA [8] and multi-task regression [1] in the noiseless\nsetting. Due to the square loss in RPCA, the sparse matrix S can be calculated by subtracting the\nlow-rank matrix L from the observed data matrix. Nevertheless, in LVGGM, there is no closed-form\nsolution for the sparse matrix due to the log-determinant term, and we need to use gradient descent\nto update S. On the other hand, both the algorithm in [40] and our algorithm have an initialization\nstage. Yet our initialization algorithm is new and different from the initialization algorithm in [40] for\nRPCA. Furthermore, our analysis of the initialization algorithm is built on the spikiness condition,\nwhich is also different from that for RPCA.\nThe last but not least related work is expectation maximization (EM) algorithm [2, 36], which shares a\nsimilar bivariate structure as our estimator. However, the proof technique used in [2, 36] is not directly\napplicable to our algorithm, due to the matrix factorization structure in our estimator. Moreover, to\novercome the dependency issue between consecutive iterations in the proof, sample splitting strategy\n[18, 16] was employed in [2, 36, 39], i.e., dividing the whole dataset into T pieces and using a fresh\npiece of data in each iteration. Unfortunately, the sample splitting technique results in a suboptimal\nstatistical rate, incurring an extra factor of pT in the rate. In sharp contrast, our proof technique does\nnot rely on sample splitting, because we are able to prove a uniform convergence result over a small\nneighborhood of the unknown parameters, which directly resolves the dependency issue.\n3 The Proposed Estimator and Algorithm\nIn this section, we present a new estimator for LVGGM estimation, together with a new algorithm.\n3.1 Latent Variable GGMs\nLet XO be the d-dimensional random vector with observed variables and XL be the r-dimensional\nrandom vector with latent variables. We assume that the concatenated random vector X =\n\n(X>O , X>L )> follows a multivariate Gaussian distribution with covariance matrix e\u2303 and sparse\nprecision matrix e\u2326 = e\u23031. It is proved in [10] that the observed data XO follows a normal dis-\ntribution with marginal covariance matrix \u2303\u21e4 = e\u2303OO, which is the top-left block matrix in e\u2303\n\ncorresponding to XO. The precision matrix of XO is then given by Schur complement [13]:\n\n(3.1)\nSince we can only observe XO, the marginal precision matrix \u2326\u21e4 is generally not sparse. We de\ufb01ne\n\n\u2326\u21e4 = (e\u2303OO)1 = e\u2326OO e\u2326OLe\u23261\n\nLLe\u2326LO.\n\npotentially dense. We assume that the number of latent variables is smaller than that of the observed.\nTherefore, L\u21e4 is low-rank and may be dense. Thus, the precision matrix of LVGGM can be written as\n(3.2)\n\nS\u21e4 := e\u2326OO and L\u21e4 := e\u2326OLe\u23261\nLLe\u2326LO. Then S\u21e4 is sparse due to the sparsity of e\u2326. We do not\nimpose any dependency restriction on XO and XL, and thus the matrices e\u2326OL and e\u2326LO can be\nwhere kS\u21e4k0,0 = s\u21e4 and rank(L\u21e4) = r. We refer to [9] for a detailed discussion of LVGGM.\n3.2 The Proposed Estimator\nSuppose that we observe i.i.d. samples X1, . . . , Xn from N (0, \u2303\u21e4). Our goal is to estimate the\nsparse component S\u21e4 and the low-rank component L\u21e4 of the unknown precision matrix \u2326\u21e4 in (3.2).\nThe negative log-likelihood of the Gaussian graphical model is proportional to the following sample\nloss function up to a constant\n\n\u2326\u21e4 = S\u21e4 + L\u21e4,\n\n(3.3)\nis the sample covariance matrix, and |S + L| is the determinant of\n\u2326 = S + L. We employ the maximum likelihood principle to estimate S\u21e4 and L\u21e4, which is equivalent\nto minimizing the negative log-likelihood in (3.3).\n\npn(S, L) = tr\u21e5b\u2303S + L\u21e4 log |S + L|,\n\nwhere b\u2303 = 1/nPn\n\ni=1 XiX>i\n\n3\n\n\fThe low-rank structure of the precision matrix, i.e., L poses a great challenge for computation. A\ntypical way is to use nuclear-norm regularized estimator, or rank constrained estimator to estimate\nL. However, such kind of estimators involve singular value decomposition at each iteration, which\nis computationally very expensive. To overcome this computational obstacle, we reparameterize\nL as the product of smaller matrices. More speci\ufb01cally, due to the symmetry of L, it can be\nreparameterized by L = ZZ>, where Z 2 Rd\u21e5r and r > 0 is the number of latent variables and\nis a tuning parameter. This kind of reparametrization has recently been used in low-rank matrix\nestimation [18, 16, 44, 45, 11, 30] based on matrix factorization. Then we can rewrite the sample\nloss function in (3.3) as the following objective function\n\n(3.4)\nBased on (3.4), we propose a nonconvex estimator using sparsity constrained maximum likelihood:\n(3.5)\n\nqn(S, Z) = tr\u21e5b\u2303S + ZZ>\u21e4 log |S + ZZ>|.\n\nmin\nS,Z\n\nqn(S, Z)\n\nsubject to kSk0,0 \uf8ff s,\n\nit outputs initial pointsbS(0),bZ(0), which, we will show later,\n\nwhere s > 0 is a tuning parameter that controls the sparsity of S.\n3.3 The Proposed Algorithm\nDue to the matrix factorization based reparameterization L = ZZ>, the objective function in (3.5)\nis nonconvex. In addition, the sparsity based constraint in (3.5) is nonconvex as well. Therefore,\nthe estimation in (3.5) is essentially a nonconvex optimization problem. We propose to solve it by\nalternately performing gradient descent with respect to one parameter matrix with the other one \ufb01xed.\nThe detailed algorithm is displayed in Algorithm 1, which consists of two stages.\nIn the initialization stage (Stage I),\nare guaranteed to fall in the small neighborhood of S\u21e4 and Z\u21e4 respectively. Note that we need to do\ninversion in Line 3, whose complexity is O(d3). Nevertheless, we only need to do inversion once. In\nsharp contrast, convex relaxation approaches need to do full SVD with O(d3) complexity at each\niteration, which is much more time consuming than ours.\nIn the alternating gradient descent stage (Stage II), we iteratively estimate S while \ufb01xing Z, and\nthen estimate Z while \ufb01xing S. Instead of solving each subproblem exactly, we propose to perform\none-step gradient descent for S and Z alternately, using step sizes \u2318 and \u23180. In Lines 6 and 8 of\nAlgorithm 1, rSqn(S, Z) and rZqn(S, Z) denote the partial gradient of qn(S, Z) with respect to S\nand Z respectively. The choice of the step sizes will be clear according to our theory. In practice,\none can also use line search to choose the step sizes. Due to the sparsity constraint kSk0,0 \uf8ff s,\nwe apply hard thresholding [4] right after the gradient descent step for S, in Line 7 of Algorithm\n1. For a matrix S 2 Rd\u21e5d and an integer s, the hard thresholding operator HTs(S) preserves the s\nlargest magnitudes in S and sets the rest entries to zero. Algorithm 1 does not involve singular value\ndecomposition in each iteration, neither solve an exact optimization problem, which makes it much\nfaster than the convex relaxation based algorithms [9, 24]. The computational overhead of Algorithm\n1 mainly comes from the calculation of the partial gradient with respect to Z, whose time complexity\nis O(rd2). Therefore, our algorithm has a per-iteration complexity of O(rd2).\n4 Main Theory\nWe present our main theory in this section, which characterizes the convergence rate of Algorithm\n1, and the statistical rate of its output. We begin with some de\ufb01nitions and assumptions, which are\nnecessary for our theoretical analysis.\nAssumption 4.1. There is a constant \u232b> 0 such that 0 < 1/\u232b \uf8ff min(\u2303\u21e4) \uf8ff max(\u2303\u21e4) \uf8ff \u232b< 1,\nwhere min(\u2303\u21e4) and max(\u2303\u21e4) are the minimal and maximal eigenvalues of \u2303\u21e4 respectively.\n\nAssumption 4.1 requires the eigenvalues of true covariance matrix \u2303\u21e4 to be \ufb01nite and bounded below\nfrom a positive number, which is a standard assumption for Gaussian graphical models [29, 21, 28].\nThe relation between the covariance matrix and the precision matrix \u2326\u21e4 = (\u2303\u21e4)1 immediately\nyields 1/\u232b \uf8ff min(\u2326\u21e4) \uf8ff max(\u2326\u21e4) \uf8ff \u232b.\nIt is well understood that the estimation problem of the decomposition \u2326\u21e4 = S\u21e4 +L\u21e4 can be ill-posed,\nwhere identi\ufb01ability issue arises when the low-rank matrix L\u21e4 is also sparse [10, 7]. The concept of\nincoherence condition, which was originally proposed for matrix completion [7], has been adopted in\n[9, 10], which ensures the low-rank matrix not to be too sparse by restricting the degree of coherence\n\n4\n\n\fAlgorithm 1 Alternating Thresholded Gradient Descent (AltGD) for LVGGM\n1: Input: i.i.d. samples X1, . . . , Xn from LVGGM, max number of iterations T , and parameters\n\n\u2318, \u23180, r, s.\nStage I: Initialization\ni=1 XiX>i .\n\nwhere Dr is the \ufb01rst r columns of D.\n\nStage II: Alternating Gradient Descent\n\nnPn\n2: b\u2303 = 1\n3: bS(0) = HTs(b\u23031), which preserves the s largest magnitudes of b\u23031.\n4: Compute SVD: b\u23031 bS(0) = UDU>, where D is a diagonal matrix. LetbZ(0) = UD1/2\n5: for t = 0, . . . , T 1 do\n6:\n7:\n\nbS(t+0.5) =bS(t) \u2318rSqnbS(t),bZ(t);\nbS(t+1) = HTs\u21e3bS(t+0.5)\u2318, which preserves the s largest magnitudes ofbS(t+0.5);\nbZ(t+1) =bZ(t) \u23180rZqnbS(t),bZ(t);\n\n8:\n9: end for\n\nr\n\n,\n\n10: output:bS(T ),bZ(T ).\n\nbetween singular vectors and the standard basis. Later work such as [1, 25] relaxed this condition to\na constraint on the spikiness ratio, and showed that spikiness condition is milder than incoherence\ncondition. In our theory, we use the notion of spikiness as follows.\nAssumption 4.2 (Spikiness Condition [25]). For a matrix L 2 Rd\u21e5d, the spikiness ratio is de\ufb01ned\nas \u21b5sp(L) := dkLk1,1/kLkF . For the low-rank matrix L\u21e4 in (3.2), we assume that there exists a\nconstant \u21b5\u21e4 > 0 such that\n\nkL\u21e4k1,1 =\n\n\u21b5sp(L\u21e4) \u00b7k L\u21e4kF\n\nd\n\n\u21b5\u21e4\nd\n\n.\n\n\uf8ff\n\n(4.1)\n\nSince rank(L\u21e4) = r, we de\ufb01ne max = 1(L\u21e4) > 0 and min = r(L\u21e4) > 0 to be the maximal and\nminimal nonzero singular value of L\u21e4 respectively. We observe that the decomposition of low-rank\nmatrix L\u21e4 in Section 3.2 is not unique, since we have L\u21e4 = (Z\u21e4U)(Z\u21e4U)> for any r \u21e5 r orthogonal\nmatrix U. Thus, we de\ufb01ne the following solution set for Z:\n(4.2)\n\nU =eZ 2 Rd\u21e5r|eZ = Z\u21e4U for some U 2 Rr\u21e5r with UU> = U>U = Ir .\n\nNote that 1(eZ) = pmax and r(eZ) = pmin for anyeZ 2U .\nTo measure the closeness between our estimator for Z and the unknown parameter Z\u21e4, we use\nthe following distance d(\u00b7,\u00b7), which is invariant to rotation. Similar de\ufb01nition has been used in\n[45, 30, 40].\nDe\ufb01nition 4.3. De\ufb01ne the distance between Z and Z\u21e4 as d(Z, Z\u21e4) = mineZ2U kZ eZkF , where U\n\nAt the core of our proof technique is the \ufb01rst-order stability condition on the population loss function.\nIn detail, the population loss function is de\ufb01ned as the expectation of sample loss function in (3.3):\n(4.3)\nFor the ease of presentation, we de\ufb01ne two balls around S\u21e4 and Z\u21e4 respectively: BF (S\u21e4, R) = {S 2\nRd\u21e5d : kS S\u21e4kF \uf8ff R}, Bd(Z\u21e4, R) = {Z 2 Rd\u21e5r : d(Z, Z\u21e4) \uf8ff R}. Then the \ufb01rst-order stability\ncondition is stated as follows.\nCondition 4.4 (First-order Stability). Suppose S 2 BF (S\u21e4, R), Z 2 Bd(Z\u21e4, R) for some R > 0;\nby de\ufb01nition we have L = ZZ> and L\u21e4 = Z\u21e4Z\u21e4>. The gradient of population loss function with\nrespect to S satis\ufb01es\n\np(S, L) = tr(\u2303\u21e4(S + L)) logS + L.\n\nis the solution set de\ufb01ned in (4.2).\n\nThe gradient of the population loss function with respect to L satis\ufb01es\n\nrSp(S, L) rSp(S, L\u21e4)F \uf8ff 2 \u00b7k L L\u21e4kF .\nrLp(S, L) rLp(S\u21e4, L)F \uf8ff 1 \u00b7k S S\u21e4kF ,\n\nwhere 1, 2 > 0 are constants.\n\n5\n\n\fCondition 4.4 requires the population loss function has a variant of Lipschitz continuity for the\ngradient. Note that the gradient is taken with respect to one variable (S or L), while the Lipschitz\ncontinuity is with respect to the other variable. Also, the Lipschitz property is de\ufb01ned only between\nthe true parameters S\u21e4, L\u21e4 and arbitrary elements S 2 BF (S\u21e4, R) and L = ZZ> such that Z 2\nBd(Z\u21e4, R). It should be noted that Condition 4.4, as is veri\ufb01ed in the appendix, is inspired by a\nsimilar condition originally introduced in [2]. We extend it to the loss function of LVGMM with both\nsparse and low-rank structures, which plays an important role in the analysis.\nThe following theorem characterize the theoretical properties of Algorithm 1.\nTheorem 4.5. Suppose Assumptions 4.1 and 4.2 hold. Assume that the sample size satis\ufb01es\nn 484k\u2326\u21e4k1,1\u232b2rs\u21e4 log d/(25R2min) and the sparsity of the unknown sparse matrix satis-\n\ufb01es s\u21e4 \uf8ff 25d2R2min/(121r\u21b5\u21e42), where R = min{1/4pmax, 1/(2\u232b),pmin/(6.5\u232b2)}. Then\nwith probability at least 1 C/d, the initial pointsbS(0),bZ(0) obtained by the initialization stage of\nAlgorithm 1 satis\ufb01es\n\n(4.4)\nwhere C > 0 is an absolute constant. Furthermore, suppose Condition 4.4 holds. Let the step\nsizes satisfy \u2318 \uf8ff C0/(max\u232b2) and \u23180 \uf8ff C0min/(max\u232b4), and the sparsity parameter satis\ufb01es\ns 4(1/(2p\u21e2) 1)2 + 1s\u21e4, where C0 > 0 is a constant that can be chosen arbitrarily small. Let\n\u21e2 and \u2327 be\n\nbS(0) S\u21e4F \uf8ff R,\n\ndbZ(0), Z\u21e4 \uf8ff R,\n\n\u21e2 = max\u21e21 \n\nn.\nThen for any t 1, with probability at least 1 C1/d, the output of Algorithm 1 satis\ufb01es\n\n= max\u21e2 48C2\n\n\u232b2 ,\u2327\nF , d2(bZ(t+1), Z\u21e4)o \uf8ff\nmaxnbS(t+1) S\u21e42\n\n32C2\n0 2\nmin\nmax\u232b6\n\n\u2318\n\u232b2 , 1 \n\n,\n\n(4.5)\n\ns\u21e4 log d\n\n,\n\nn\n\n0\n2\nmax\u232b4\n\nand\n\n\u23180min\n\nrd\n\n\u2327\n\n1 p\u21e2\n| {z }\n\nstatistical error\n\n+ p\u21e2t+1 \u00b7 R\n}\n{z\n|\n\noptimization error\n\nwhere C1 > 0 is an absolute constant.\n\nIn Theorem 4.5, \u21e2 is the contraction parameter of linear convergence rate, and it depends on the step\nsize \u2318. Therefore, we can always choose a suf\ufb01ciently small step size by choosing a small enough\nC0, such that \u21e2 is strictly between 0 and 1.\nRemark 4.6. (4.4) suggests that, in order to ensure that the initial points returned by the initialization\nstage of Algorithm 1 fall in small neighborhoods of S\u21e4 and Z\u21e4, we require n = O(s\u21e4 log d), which\nessentially attains the optimal sample complexity for LVGGM estimation. In addition, we require\ns\u21e4 . d2/(r\u21b5\u21e42), which means the unknown sparse matrix cannot be too dense.\nRemark 4.7. (4.5) suggests that the estimation error of the output of Algorithm 1 consists of two\nterms: the \ufb01rst term is the statistical error, and the second term is the optimization error. The statistical\n\nerror comes from \u2327 and scales as maxOp(ps\u21e4 log d/n), Op(prd/n) , where Op(ps\u21e4 log d/n)\ncorresponds to the statistical error of S\u21e4, and Op(prd/n) corresponds to the statistical error of L\u21e4 1.\nThis matches the minimax optimal rate of estimation errors in Frobenius norm for LVGGM estimation\n[9, 1, 24]. For the optimization error, note that max and min are \ufb01xed constants. For a suf\ufb01ciently\nsmall constant C0, we can always ensure \u21e2< 1, and this establishes the linear convergence rate for\nAlgorithm 1. Actually, after T max{O(log(\u232b4n/(s\u21e4 log d))), O(log(\u232b6n/(rd)))} iterations, the\ntotal estimation error of our algorithm achieves the same order as the statistical error.\nRemark 4.8. Our statistical rate is sharp, because our theoretical analysis is conducted uniformly\nover the neighborhood of true parameters S\u21e4 and Z\u21e4, rather than doing sample splitting. This is\nanother big advantage of our approach over existing algorithms which are also built upon \ufb01rst-order\nstability [2, 36] but rely on sample splitting technique.\n5 Experiments\nIn this section, we present numerical results on both synthetic and real datasets to verify the theoretical\nproperties of our algorithm, and compare it with the state-of-the-art methods. Speci\ufb01cally, we\n\nde\ufb01nition.\n\n1While the derived error bound in (4.5) is for bZ(t), it is in the same order as the error bound ofbL(t) by\n\n6\n\n\fcompare our method, denoted by AltGD, with two convex relaxation based methods for estimating\nLVGGM: (1) LogdetPPA [9, 32] for solving log-determinant semide\ufb01nite programs, denoted by\nPPA, and (2) the alternating direction method of multipliers in [22, 24], denoted by ADMM. We\nalso considered alternatives of the convex methods which use the randomized SVD method [15]\nin each iteration. However, the randomized SVD method still needs to compute a full SVD for\nnuclear norm regularization and in our experiments, we found that it is slower than the full SVD\nmethod implemented in [22]. Thus, we only report the results of the orignial convex relaxations\nin [9, 32, 22, 24]. The implementation of these two methods were downloaded from the authors\u2019\nwebsite. All numerical experiments were run in MATLAB R2015b on a laptop with Intel Core i5 2.7\nGHz CPU and 8GB of RAM.\n5.1 Synthetic Data\nIn the synthetic experiment, we \ufb01rst validate the performance of our method on the latent variable\nGGM. Then we show that our method also performs well on a more general GGM where the precision\nmatrix is the sum of an arbitrary sparse matrix S\u21e4 and arbitrary low rank matrix L\u21e4. Speci\ufb01cally, we\ngenerated data according to the following two schemes:\n\u2022 Scheme I: we generated data from the latent variable GGM de\ufb01ned in Section 3.1. In detail, the\ndimension of observed data is d and the number of latent variables is r. We randomly generated\na sparse positive de\ufb01nite matrix e\u2326 2 R(d+r)\u21e5(d+r), with sparsity s\u21e4 = 0.02d2. According to\n(3.1), the sparse component of the precision matrix is S\u21e4 := e\u23261:d;1:d and the low-rank component\nis L\u21e4 := e\u23261:d;(d+1):(d+r)[e\u2326(d+1):(d+r);(d+1):(d+r)]1e\u2326(d+1):(d+r);1:d. Then we sampled data\nX1, . . . , Xn from distribution N (0, (\u2326\u21e4)1), where \u2326\u21e4 = S\u21e4 + L\u21e4 is the true precision matrix.\n\u2022 Scheme II: the dimension of observed data is d and the number of latent variables is r. S\u21e4 is a\nsymmetric positive de\ufb01nite matrix with entries randomly generated from [1, 1] with sparsity\ns\u21e4 = 0.05d2. L\u21e4 = Z\u21e4Z\u21e4>, where Z\u21e4 2 Rd\u21e5r with entries randomly generated from [1, 1].\nThen we sampled data X1, . . . , Xn from multivariate normal distribution N (0, (\u2326\u21e4)1) with\n\u2326\u21e4 = S\u21e4 + L\u21e4 being the true precision matrix.\n\nTable 1: Scheme I: estimation errors of sparse and low-rank components S\u21e4 and L\u21e4 as well as the true\nprecision matrix \u2326\u21e4 in terms of Frobenius norm on different synthetic datasets. Data were generated\nfrom LVGGM and results were reported on 10 replicates in each setting.\n\nkbL(T ) L\u21e4kF\n0.0170\u00b10.0125\n0.0224\u00b10.0115\n0.0113\u00b10.0014\n0.0195\u00b10.0046\n0.0294\u00b10.0041\n0.0125\u00b10.0000\n0.0224\u00b10.0034\n0.0356\u00b10.0033\n0.0167\u00b10.0030\n0.0371\u00b10.0052\n0.0442\u00b10.0068\n0.0208\u00b10.0014\n\nkb\u2326(T ) \u2326\u21e4kF\n0.7350\u00b10.0359\n0.7563\u00b10.0298\n0.6236\u00b10.0669\n0.9813\u00b10.0192\n1.0610\u00b10.0134\n0.8210\u00b10.0143\n1.1639\u00b10.0179\n1.1869\u00b10.0254\n0.9021\u00b10.0244\n1.4824\u00b10.0120\n1.5012\u00b10.0240\n1.3449\u00b10.0084\n\nTime (s)\n1.1610\n1.1120\n0.0250\n35.7220\n25.8010\n0.4800\n356.7360\n156.5550\n7.4740\n\nSetting\n\nd = 100, r = 2, n =\n2000\n\nd = 500, r = 5, n =\n10000\n\nMethod\n\nkbS(T ) S\u21e4kF\n0.7335\u00b10.0352\nPPA\nADMM 0.7521\u00b10.0288\n0.6241\u00b10.0668\nAltGD\n0.9803\u00b10.0192\nPPA\nADMM 1.0571\u00b10.0135\nAltGD\n0.8212\u00b10.0143\n1.1620\u00b10.0177\nPPA\nADMM 1.1867\u00b10.0253\nAltGD\n0.9016\u00b10.0245\nPPA\n1.4822\u00b10.0302\nADMM 1.5010\u00b10.0240\n1.3449\u00b10.0073\nAltGD\n\nd = 1000, r = 8, n =\n2.5 \u21e5 104\nd = 5000, r = 10, n =\n2 \u21e5 105\nIn both schemes, we conducted experiments in different settings of d, n, s\u21e4 and r. The step sizes\nof our method were set as \u2318 = c1/(max\u232b2) and \u23180 = c1min/(max\u232b4) according to Theorem 4.5,\nwhere c1 = 0.25. The thresholding parameter s is set as c2s\u21e4, where c2 > 1 was selected by 4-fold\ncross-validation. The regularization parameters for `1,1-norm and nuclear norm in PPA and ADMM\nand the tuning parameter r in our algorithm were selected by 4-fold cross-validation. Under both\nschemes, we repeatedly generated 10 datasets for each setting of d, n, s\u21e4 and r\u21e4, and calculated the\nmean and standard error of the estimation error. We summarize the results of Scheme I over 10\nreplications in Table 1. Note that our algorithm AltGD outputs a slightly better estimator in terms of\nestimation errors compared with PPA and ADMM. It should also be noted that they do not differ\ntoo much because their statistical rates should be in the same order. To demonstrate the ef\ufb01ciency\nof our algorithm, we also reported the mean CPU time in the last column of Table 1. We observe\n\n33522.0200\n21090.7900\n445.6730\n\n7\n\n\fsigni\ufb01cant speed-ups brought by our algorithm, which is almost 50 times faster than the convex\nones. In particular, when the dimension d scales up to several thousands, the computation of SVD\nin PPA and ADMM takes enormous time and therefore the computational time of them increases\ndramatically. We report the averaged results of Scheme II over 10 repetitions in Table 2. Again,\nit can be seen that our method AltGD achieves comparable or slightly better estimators in terms\nof estimation errors in Frobenius norm compared against PPA and ADMM. Our method AltGD is\nnearly 50 times faster than the other two methods based on convex algorithms.\nTable 2: Scheme II: estimation errors of sparse and low-rank components S\u21e4 and L\u21e4 as well as\nthe true precision matrix \u2326\u21e4 in terms of Frobenius norm on different synthetic datasets. Data were\ngenerated from multivariate distribution where the precision matrix is the sum of an arbitrary sparse\nmatrix and an arbitrary low-rank matrix. Results were reported on 10 replicates in each setting.\n\nSetting\n\nd = 100, r = 2, n =\n2000\n\nd = 500, r = 5, n =\n10000\n\nTime (s)\n1.6710\n1.2790\n0.0460\n43.8000\n25.8980\n0.8690\n487.4900\n163.9350\n7.1360\n\nMethod\n\nkbS(T ) S\u21e4kF\n0.5710\u00b10.0319\nPPA\nADMM 0.6198\u00b10.0361\nAltGD\n0.5639\u00b10.0905\n0.8140\u00b10.0157\nPPA\nADMM 0.8140\u00b10.0157\nAltGD\n0.6139\u00b10.0198\nPPA\n0.9235\u00b10.0193\nADMM 0.9209\u00b10.0212\nAltGD\n0.7249\u00b10.0158\n1.1883\u00b10.0091\nPPA\nADMM 1.2846\u00b10.0089\nAltGD\n1.0681\u00b10.0034\n\nkbL(T ) L\u21e4kF\n0.6231\u00b10.0261\n0.5286\u00b10.0308\n0.4824\u00b10.0323\n0.7802\u00b10.0104\n0.7803\u00b10.0104\n0.7594\u00b10.0111\n1.1985\u00b10.0084\n1.2131\u00b10.0084\n0.9651\u00b10.0093\n1.0970\u00b10.0022\n1.1568\u00b10.0023\n1.0685\u00b10.0023\n\nkb\u2326(T ) \u2326\u21e4kF\n0.8912\u00b10.0356\n0.8588\u00b10.0375\n0.7483\u00b10.0742\n1.1363\u00b10.0131\n1.1363\u00b10.0131\n0.9718\u00b10.0146\n1.4913\u00b10.0162\n1.4975\u00b10.0171\n1.2029\u00b10.0141\n1.3841\u00b10.0083\n1.5324\u00b10.0085\n1.2068\u00b10.0032\n\nd = 1000, r = 8, n =\n2.5 \u21e5 104\nd = 5000, r = 10, n =\n2 \u21e5 105\nIn addition, we illustrate the convergence rate of our algorithm in Figure 1(a) and 1(b), where the\nx-axis is iteration number and y-axis is the estimation errors in Frobenius norm. We can see that\nour algorithm converges in dozens of iterations, which con\ufb01rms our theoretical guarantee on linear\nconvergence rate. We plot the overall estimation errors against the scaled statistical errors of S(T ) and\nL(T ) under different con\ufb01gurations of d, n, s\u21e4 and r in Figure 1(c) and 1(d). According to Theorem\n\n44098.6710\n20393.3650\n287.8630\n\n4.5, kbS(t) S\u21e4kF and kbL(t) L\u21e4kF will converge to the statistical errors as the number of iterations\nt goes up, which are in the order of O(ps\u21e4 log d/n) and O(prd/n) respectively. We can see that\n\nthe estimation errors grow linearly with the theoretical rate, which validates our theoretical guarantee\non the minimax optimal statistical rate.\n\n5\n\n4.5\n\n4\n\nd = 100; n = 1000; r = 2\nd = 500; n = 10000; r = 5\nd = 1000; n = 25000; r = 8\n\n1.18\n\n1.16\n\n1.14\n\nd = 100; n = 1000; r = 2\nd = 500; n = 10000; r = 5\nd = 1000; n = 25000; r = 8\n\n1.8\n\n1.6\n\nF\n\n3.5\nk\n$\nS\n!\n\n3\n\n2.5\n\n)\nt\n(\n\n2\nS\nk\n1.5\n\n1\n\n0.5\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\nNumber of iteration (t)\n\nF\n\n1.12\n\n1.1\n\nk\n$\nL\n!\n\n1.08\n\n)\nt\n(\n\n1.06\n\nL\nk\n\n1.04\n\n1.02\n\n1\n\n0.98\n\n0\n\nF\n\n1.4\nk\n$\n1.2\nS\n!\n\n1\n\n)\nt\n(\n\nS\n0.8\nk\n\n0.6\n\n0.4\n\n0.5\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n35\n\n40\n\nNumber of iteration (t)\n\n(a) Estimation error for S\u21e4\n\n(b) Estimation error for L\u21e4\n\n0.8\n\n0.75\n\n0.7\n\nF\n\n0.65\n\nk\n$\nL\n!\n\n0.6\n\n)\nt\n(\n\nL\nk\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n0.35\n\n0.4\n\n0.45\n\n0.5\n\nr = 2\nr = 5\nr = 7\nr = 10\n\n2\n\n1.5\n\n1\n\nps$ log d=n\n\n(c) r \ufb01xed and varying\nn, d and s\u21e4\n\n(d) s\u21e4 \ufb01xed and varying\nn, d and r\n\ns$ = 300\ns$ = 400\ns$ = 500\ns$ = 600\n\n0.65\n\n0.7\n\n0.75\n\n0.8\n\n0.6\n\n0.55\n\nprd=n\n\nFigure 1: (a)-(b): Evolution of estimation errors with number of iterations t going up with the sparsity\n\nparameter s\u21e4 set as 0.02 \u21e5 d2 and varying d, n and r. (c)-(d): Estimation errors kbS(T ) S\u21e4kF and\nkbL(T ) L\u21e4kF versus scaled statistical errorsps\u21e4 log d/n andprd/n.\n\n5.2 Genomic Data\nIn this subsection, we apply our method to TCGA breast cancer gene expression data to infer\nregulatory network. We downloaded the gene expression data from cBioPortal2. Here we focused on\n299 breast cancer related transcription factors (TFs) and estimated the regulatory relationships for\neach pair of TFs over two breast cancer subtypes: luminal and basal. We compared our method AltGD\n\n2http://www.cbioportal.org/\n\n8\n\n\fwith ADMM and PPA which are all based on LVGGM. We also compared it with the graphical\nLasso (GLasso) which only considers the sparse structure of precision matrix and ignores the latent\nvariables; we chose QUIC3 to solve the GLasso estimator. Regarding the benchmark standard, we\nused the \u201cregulatory potential scores\u201d between a pair of genes (a TF and a target gene) for these two\nbreast cancer subtypes compiled based on both co-expression and TF ChIP-seq binding data from the\nCistrome Cancer Database4.\n\nTable 3: Summary of CPU time of different meth-\nods on luminal subtype breast cancer dataset.\nMethod\nTime (s)\n\nADMM AltGD\n0.1500\n7.6700\n\nGLasso\n38.6310\n\nPPA\n\n85.0100\n\nFor luminal subtype, there are n = 601 samples\nand d = 299 TFs. The regularization parame-\nters for `1,1 norm in GLasso, for `1,1 norm and\nnuclear norm in PPA and ADMM were tuned\nby grid search. The step sizes of AltGD were\n\nthe maximal eigenvalue of sample covariance\nmatrix. The thresholding parameter s and number of latent variables r were tuned by grid search. In\nTable 3, we present the CPU time of each method. Importantly, we can see that AltGD is the fastest\namong all the methods and is even more than 50 times faster than the second fastest method ADMM.\n\nset as \u2318 = 0.1/b\u232b2 and \u23180 = 0.1/b\u232b4, whereb\u232b is\n\nELF5\n\n\u25cf\n\n\u25cf\n\nSUMO2\n\n\u25cf\n\nMXI1\n\nHDAC2\n\n\u25cf\n\nH2AFX\n\n\u25cf\n\nHDAC2\n\n\u25cf\n\nH2AFX\n\n\u25cf\n\nHDAC2\n\n\u25cf\n\nH2AFX\n\n\u25cf\n\nHDAC2\n\n\u25cf\n\nH2AFX\n\n\u25cf\n\nSREBF2\n\n\u25cf\n\nELF5\n\n\u25cf\n\nSREBF2\n\n\u25cf\n\nELF5\n\n\u25cf\n\nSREBF2\n\n\u25cf\n\nELF5\n\n\u25cf\n\nATF4\n\n\u25cf\n\n\u25cf\n\nSUMO2\n\nATF4\n\n\u25cf\n\n\u25cf\n\nSUMO2\n\nATF4\n\n\u25cf\n\n\u25cf\n\nSUMO2\n\nPOLR2B\n\n\u25cf\n\n\u25cf\n\nMXI1\n\nPOLR2B\n\n\u25cf\n\n\u25cf\n\nMXI1\n\nPOLR2B\n\n\u25cf\n\n\u25cf\n\nMXI1\n\n\u25cf\n\nIRF4\n\n\u25cf\n\nSF1\n\n(a) GLasso\n\n\u25cf\n\nIRF4\n\n\u25cf\n\nSF1\n\n(b) PPA\n\n\u25cf\n\nIRF4\n\n\u25cf\n\nSF1\n\n(c) ADMM\n\n\u25cf\n\nIRF4\n\n\u25cf\n\nSF1\n\n(d) AltGD\n\nSREBF2\n\n\u25cf\n\nATF4\n\n\u25cf\n\nPOLR2B\n\n\u25cf\n\nFigure 2: An example of subnetwork in the transcriptional regulatory network of luminal breast\ncancer. Here gray edges are the interactions from the Cistrome Cancer Database; red edges are the\nones inferred by the respective methods; green edges are incorrectly inferred interactions.\nTo demonstrate the performances of different methods on recovering the overall transcriptional regu-\nlatory network, we randomly selected 10 TFs in the benchmark network and plotted the subnetwork\nin Figure 2 which has 70 edges with nonzero regulatory potential scores. Speci\ufb01cally, the gray edges\nform the benchmark network, the red edges are those identi\ufb01ed correctly and the green edges are\nthose incorrectly inferred by each method. We can observe from Figure 2 that the methods based on\nLVGGMs are able to recover more edges accurately than graphical Lasso because of the intervention\nof latent variables. We remark that all the methods were not able to completely recover the entire\nregulatory network partly because we only used the gene expression data for TFs (instead of all\ngenes) and the regulatory potential scores from the Cistome Cancer Database also used TF binding\ninformation. Due to space limit, we postpone additional experimental results to the appendix.\n6 Conclusions\nIn this paper, to speed up the learning of LVGGM, we proposed a sparsity constrained maximum like-\nlihood estimator based on matrix factorization. We developed an ef\ufb01cient alternating gradient descent\nalgorithm, and proved that the proposed algorithm is guaranteed to converge to the unknown sparse\nand low-rank matrices with a linear convergence rate up to the optimal statical error. Experiments on\nboth synthetic and real world genomic data supported our theory.\nAcknowledgements\nWe would like to thank the anonymous reviewers for their helpful comments. This research was\nsponsored in part by the National Science Foundation IIS-1652539, IIS-1717205 and IIS-1717206.\nThe views and conclusions contained in this paper are those of the authors and should not be\ninterpreted as representing any funding agencies.\n\n3http://www.cs.utexas.edu/~sustik/QUIC/\n4http://cistrome.org/CistromeCancer/\n\n9\n\n\fReferences\n[1] Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Noisy matrix decomposition via\nconvex relaxation: Optimal rates in high dimensions. The Annals of Statistics, pages 1171\u20131197,\n2012.\n\n[2] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the em\nalgorithm: From population to sample-based analysis. arXiv preprint arXiv:1408.2156, 2014.\n\n[3] Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for faster\n\nsemi-de\ufb01nite optimization. arXiv preprint, 2015.\n\n[4] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing.\n\nApplied and computational harmonic analysis, 27(3):265\u2013274, 2009.\n\n[5] T Tony Cai, Hongzhe Li, Weidong Liu, and Jichun Xie. Covariate-adjusted precision matrix\n\nestimation with an application in genetical genomics. Biometrika, page ass058, 2012.\n\n[6] Tony Cai, Weidong Liu, and Xi Luo. A constrained 1 minimization approach to sparse precision\nmatrix estimation. Journal of the American Statistical Association, 106(494):594\u2013607, 2011.\n\n[7] Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization.\n\nCommunications of the ACM, 55(6):111\u2013119, 2012.\n\n[8] Emmanuel J Cand\u00e8s, Xiaodong Li, Yi Ma, and John Wright. Robust principal component\n\nanalysis? Journal of the ACM (JACM), 58(3):11, 2011.\n\n[9] Venkat Chandrasekaran, Pablo A Parrilo, and Alan S Willsky. Latent variable graphical model\nselection via convex optimization. In Communication, Control, and Computing (Allerton), 2010\n48th Annual Allerton Conference on, pages 1610\u20131613. IEEE, 2010.\n\n[10] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A Parrilo, and Alan S Willsky. Rank-sparsity\nincoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572\u2013596, 2011.\n\n[11] Yudong Chen and Martin J Wainwright. Fast low-rank estimation by projected gradient descent:\n\nGeneral statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025, 2015.\n\n[12] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation\n\nwith the graphical lasso. Biostatistics, 9(3):432\u2013441, 2008.\n\n[13] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press, 2012.\n\n[14] Quanquan Gu, Zhaoran Wang, and Han Liu. Low-rank and sparse structure pursuit via alternat-\ning minimization. In Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 600\u2013609, 2016.\n\n[15] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness:\nProbabilistic algorithms for constructing approximate matrix decompositions. SIAM review,\n53(2):217\u2013288, 2011.\n\n[16] Moritz Hardt. Understanding alternating minimization for matrix completion. In FOCS, pages\n\n651\u2013660. IEEE, 2014.\n\n[17] Jianhua Z Huang, Naiping Liu, Mohsen Pourahmadi, and Linxu Liu. Covariance matrix\n\nselection and estimation via penalised normal likelihood. Biometrika, 93(1):85\u201398, 2006.\n\n[18] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using\n\nalternating minimization. In STOC, pages 665\u2013674, 2013.\n\n[19] Steffen L Lauritzen. Graphical models, volume 17. Clarendon Press, 1996.\n\n[20] Xingguo Li, Tuo Zhao, Raman Arora, Han Liu, and Jarvis Haupt. Stochastic variance reduced\n\noptimization for nonconvex sparse learning, 2016.\n\n10\n\n\f[21] Han Liu, John Lafferty, and Larry Wasserman. The nonparanormal: Semiparametric estimation\nof high dimensional undirected graphs. Journal of Machine Learning Research, 10(Oct):2295\u2013\n2328, 2009.\n\n[22] Shiqian Ma, Lingzhou Xue, and Hui Zou. Alternating direction methods for latent variable\n\ngaussian graphical model selection. Neural computation, 25(8):2172\u20132198, 2013.\n\n[23] Nicolai Meinshausen and Peter B\u00fchlmann. High-dimensional graphs and variable selection\n\nwith the lasso. The annals of statistics, pages 1436\u20131462, 2006.\n\n[24] Zhaoshi Meng, Brian Eriksson, and Alfred O Hero III. Learning latent variable gaussian\n\ngraphical models. arXiv preprint arXiv:1406.2721, 2014.\n\n[25] Sahand Negahban and Martin J Wainwright. Restricted strong convexity and weighted matrix\ncompletion: Optimal bounds with noise. Journal of Machine Learning Research, 13(May):1665\u2013\n1697, 2012.\n\n[26] Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A uni\ufb01ed\nframework for high-dimensional analysis of m-estimators with decomposable regularizers.\nIn Advances in Neural Information Processing Systems, pages 1348\u20131356, 2009.\n\n[27] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.\n\nSpringer Science & Business Media, 2013.\n\n[28] Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, Bin Yu, et al. High-dimensional\ncovariance estimation by minimizing 1-penalized log-determinant divergence. Electronic\nJournal of Statistics, 5:935\u2013980, 2011.\n\n[29] Adam J Rothman, Peter J Bickel, Elizaveta Levina, Ji Zhu, et al. Sparse permutation invariant\n\ncovariance estimation. Electronic Journal of Statistics, 2:494\u2013515, 2008.\n\n[30] Stephen Tu, Ross Boczar, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank solutions of\n\nlinear matrix equations via procrustes \ufb02ow. arXiv preprint arXiv:1507.03566, 2015.\n\n[31] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv\n\npreprint arXiv:1011.3027, 2010.\n\n[32] Chengjing Wang, Defeng Sun, and Kim-Chuan Toh. Solving log-determinant optimization\nproblems by a newton-cg primal proximal point algorithm. SIAM Journal on Optimization,\n20(6):2994\u20133013, 2010.\n\n[33] Lingxiao Wang and Quanquan Gu. Robust gaussian graphical model estimation with arbitrary\n\ncorruption. In International Conference on Machine Learning, pages 3617\u20133626, 2017.\n\n[34] Lingxiao Wang, Xiang Ren, and Quanquan Gu. Precision matrix estimation in high dimensional\ngaussian graphical models with faster rates. In Arti\ufb01cial Intelligence and Statistics, pages\n177\u2013185, 2016.\n\n[35] Lingxiao Wang, Xiao Zhang, and Quanquan Gu. A uni\ufb01ed computational and statistical\nframework for nonconvex low-rank matrix estimation. arXiv preprint arXiv:1610.05275, 2016.\n\n[36] Zhaoran Wang, Quanquan Gu, Yang Ning, and Han Liu. High dimensional expectation-\nmaximization algorithm: Statistical optimization and asymptotic normality. arXiv preprint\narXiv:1412.8729, 2014.\n\n[37] Pan Xu and Quanquan Gu. Semiparametric differential graph models. In Advances in Neural\n\nInformation Processing Systems, pages 1064\u20131072, 2016.\n\n[38] Pan Xu, Lu Tian, and Quanquan Gu. Communication-ef\ufb01cient distributed estimation and\n\ninference for transelliptical graphical models. arXiv preprint arXiv:1612.09297, 2016.\n\n[39] Pan Xu, Tingting Zhang, and Quanquan Gu. Ef\ufb01cient algorithm for sparse tensor-variate\ngaussian graphical models via gradient descent. In Arti\ufb01cial Intelligence and Statistics, pages\n923\u2013932, 2017.\n\n11\n\n\f[40] Xinyang Yi, Dohyung Park, Yudong Chen, and Constantine Caramanis. Fast algorithms for\n\nrobust pca via gradient descent. arXiv preprint arXiv:1605.07784, 2016.\n\n[41] Jianxin Yin and Hongzhe Li. A sparse conditional gaussian graphical model for analysis of\n\ngenetical genomics data. The annals of applied statistics, 5(4):2630, 2011.\n\n[42] Xiao-Tong Yuan and Tong Zhang. Partial gaussian graphical model estimation. IEEE Transac-\n\ntions on Information Theory, 60(3):1673\u20131687, 2014.\n\n[43] Xiao Zhang, Lingxiao Wang, and Quanquan Gu. A nonconvex free lunch for low-rank plus\n\nsparse matrix recovery. arXiv preprint arXiv:1702.06525, 2017.\n\n[44] Tuo Zhao, Zhaoran Wang, and Han Liu. A nonconvex optimization framework for low rank\nmatrix estimation. In Advances in Neural Information Processing Systems, pages 559\u2013567,\n2015.\n\n[45] Qinqing Zheng and John Lafferty. A convergent gradient descent algorithm for rank minimiza-\ntion and semide\ufb01nite programming from random linear measurements. In Advances in Neural\nInformation Processing Systems, pages 109\u2013117, 2015.\n\n12\n\n\f", "award": [], "sourceid": 1197, "authors": [{"given_name": "Pan", "family_name": "Xu", "institution": "University of Virginia"}, {"given_name": "Jian", "family_name": "Ma", "institution": "Carnegie Mellon University"}, {"given_name": "Quanquan", "family_name": "Gu", "institution": "University of Virginia"}]}