{"title": "Optimization Methods for Sparse Pseudo-Likelihood Graphical Model Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 667, "page_last": 675, "abstract": "Sparse high dimensional graphical model selection is a popular topic in contemporary machine learning. To this end, various useful approaches have been proposed in the context of $\\ell_1$ penalized estimation in the Gaussian framework. Though many of these approaches are demonstrably scalable and have leveraged recent advances in convex optimization, they still depend on the Gaussian functional form. To address this gap, a convex pseudo-likelihood based partial correlation graph estimation method (CONCORD) has been recently proposed. This method uses cyclic coordinate-wise minimization of a regression based pseudo-likelihood, and has been shown to have robust model selection properties in comparison with the Gaussian approach. In direct contrast to the parallel work in the Gaussian setting however, this new convex pseudo-likelihood framework has not leveraged the extensive array of methods that have been proposed in the machine learning literature for convex optimization. In this paper, we address this crucial gap by proposing two proximal gradient methods (CONCORD-ISTA and CONCORD-FISTA) for performing $\\ell_1$-regularized inverse covariance matrix estimation in the pseudo-likelihood framework. We present timing comparisons with coordinate-wise minimization and demonstrate that our approach yields tremendous pay offs for $\\ell_1$-penalized partial correlation graph estimation outside the Gaussian setting, thus yielding the fastest and most scalable approach for such problems. We undertake a theoretical analysis of our approach and rigorously demonstrate convergence, and also derive rates thereof.", "full_text": "Optimization Methods for Sparse Pseudo-Likelihood\n\nGraphical Model Selection\n\nSang-Yun Oh\n\nComputational Research Division\nLawrence Berkeley National Lab\n\nsyoh@lbl.gov\n\nKshitij Khare\n\nDepartment of Statistics\n\nUniversity of Florida\n\nkdkhare@stat.ufl.edu\n\nOnkar Dalal\n\nStanford University\n\nonkar@alumni.stanford.edu\n\nBala Rajaratnam\n\nDepartment of Statistics\n\nStanford University\n\nbrajarat@stanford.edu\n\nAbstract\n\nSparse high dimensional graphical model selection is a popular topic in contempo-\nrary machine learning. To this end, various useful approaches have been proposed\nin the context of (cid:96)1-penalized estimation in the Gaussian framework. Though\nmany of these inverse covariance estimation approaches are demonstrably scal-\nable and have leveraged recent advances in convex optimization, they still depend\non the Gaussian functional form. To address this gap, a convex pseudo-likelihood\nbased partial correlation graph estimation method (CONCORD) has been recently\nproposed. This method uses coordinate-wise minimization of a regression based\npseudo-likelihood, and has been shown to have robust model selection proper-\nties in comparison with the Gaussian approach. In direct contrast to the parallel\nwork in the Gaussian setting however, this new convex pseudo-likelihood frame-\nwork has not leveraged the extensive array of methods that have been proposed\nin the machine learning literature for convex optimization. In this paper, we ad-\ndress this crucial gap by proposing two proximal gradient methods (CONCORD-\nISTA and CONCORD-FISTA) for performing (cid:96)1-regularized inverse covariance\nmatrix estimation in the pseudo-likelihood framework. We present timing com-\nparisons with coordinate-wise minimization and demonstrate that our approach\nyields tremendous payoffs for (cid:96)1-penalized partial correlation graph estimation\noutside the Gaussian setting, thus yielding the fastest and most scalable approach\nfor such problems. We undertake a theoretical analysis of our approach and rigor-\nously demonstrate convergence, and also derive rates thereof.\n\n1\n\nIntroduction\n\nSparse inverse covariance estimation has received tremendous attention in the machine learning,\nstatistics and optimization communities. These sparse models, popularly known as graphical mod-\nels, have widespread use in various applications, especially in high dimensional settings. The most\npopular inverse covariance estimation framework is arguably the (cid:96)1-penalized Gaussian likelihood\noptimization framework as given by\nminimize\n\u2126\u2208Sp\n\n\u2212 log det \u2126 + tr(S\u2126) + \u03bb(cid:107)\u2126(cid:107)1\n\non the elements of \u2126 = (\u03c9ij)1\u2264i\u2264j\u2264p by the term (cid:107)\u2126(cid:107)1 =(cid:80)\n\n++ denotes the space of p-dimensional positive de\ufb01nite matrices, and (cid:96)1-penalty is imposed\ni,j |\u03c9ij| along with the scaling factor\n\nwhere Sp\n\n++\n\n1\n\n\f\u03bb > 0. The matrix S denotes the sample covariance matrix of the data Y \u2208 IRn\u00d7p. As the (cid:96)1-\npenalized log likelihood is convex, the problem becomes more tractable and has bene\ufb01ted from\nadvances in convex optimization. Recent efforts in the literature on Gaussian graphical models\ntherefore have focused on developing principled methods which are increasingly more and more\nscalable. The literature on this topic is simply enormous and for the sake of brevity, space constraints\nand the topic of this paper, we avoid an extensive literature review by referring to the references in\nthe seminal work of [1] and the very recent work of [2]. These two papers contain references to\nrecent work, including past NIPS conference proceedings.\n\n1.1 The CONCORD method\n\nDespite their tremendous contributions, one shortcoming of the traditional approaches to (cid:96)1-\npenalized likelihood maximization is the restriction to the Gaussian assumption. To address this\ngap, a number of (cid:96)1-penalized pseudo-likelihood approaches have been proposed: SPACE [3] and\nSPLICE [4], SYMLASSO [5]. These approaches are either not convex, and/or convergence of\ncorresponding maximization algorithms are not established.\nIn this sense, non-Gaussian partial\ncorrelation graph estimation methods have lagged severely behind, despite the tremendous need to\nmove beyond the Gaussian framework for obvious practical reasons. In very recent work, a con-\nvex pseudo-likelihood approach with good model selection properties called CONCORD [6] was\nproposed. The CONCORD algorithm minimizes\n\nQcon(\u2126) = \u2212 p(cid:88)\n\nn log \u03c9ii +\n\n1\n2\n\ni=1\n\np(cid:88)\n\ni=1\n\n(cid:88)\n\nj(cid:54)=i\n\n(cid:107)\u03c9iiYi +\n\n\u03c9ijYj(cid:107)2\n\n2 + n\u03bb\n\n|\u03c9ij|\n\n(1)\n\n(cid:88)\n\n1\u2264i \u0001subg do\n\nG(k) = \u2212(cid:16)\n\n\u2126(k)\nD\n\n(cid:17)\u22121\n(cid:0)S \u2126(k) + \u2126(k)S(cid:1)\n\u2126(k) \u2212 \u03c4kG(k)(cid:17) (cid:96) (8).\n(cid:16)\n\n+ 1\n2\n\nTake largest \u03c4k \u2208 {cj\u03c4(k,0)}j=0,1,... s.t.\n\n\u2126(k+1) = S\u03c4k\u039b\nCompute: \u03c4(k+1,0)\n1\nCompute: \u2206subg\n\nend while\n\n1: \u2206subg =\n\n(cid:107)\u2207h1(\u2126(k)) + \u2202h2(\u2126(k))(cid:107)\n\n(cid:107)\u2126(k)(cid:107)\n\n2.3 Computational complexity\n\n\u0398(k)\nD\n\nG(k) = \u2212(cid:16)\n\nTake largest \u03c4k \u2208 {cj\u03c4(k,0)}j=0,1,... s.t.\n\nInput: sample covariance matrix S, penalty \u039b\n+, \u03b11 = 1, \u03c4(0,0) \u2264 1,\nSet: (\u0398(1) =)\u2126(0) \u2208 Sp\nc < 1, \u2206subg = 1.\n(cid:17)\u22121\n(cid:0)S\u0398(k) + \u0398(k)S(cid:1)\nwhile \u2206subg > \u0001subg do\n\u0398(k) \u2212 \u03c4kG(k)(cid:17) (cid:96) (8)\n(cid:16)\n(cid:112)\n\u2126(k) = S\u03c4k\u039b\n(cid:17)(cid:0)\u2126(k) \u2212 \u2126(k\u22121)(cid:1)\n(cid:16) \u03b1k\u22121\n\u03b1k+1 = (1 +\n\u0398(k+1) = \u2126(k) +\nCompute: \u03c4(k+1,0)\n1\nCompute: \u2206subg\n\n1 + 4\u03b1k\n\n2)/2\n\n+ 1\n2\n\n\u03b1k+1\n\nend while\n\ntr(\u2126(S\u2126)) can be computed ef\ufb01ciently using tr(\u2126W ) =(cid:80) \u03c9ijwij over the set of non-zero \u03c9ij\u2019s.\n\nAfter the one time calculation of S,\nthe most signi\ufb01cant computation for each iteration in\nCONCORD-ISTA and CONCORD-FISTA algorithms is the matrix-matrix multiplication W = S\u2126\nin the gradient term. If s is the number of non-zeros in \u2126, then W can be computed using O(sp) op-\nerations if we exploit the extreme sparsity in \u2126. The second matrix-matrix multiplication for the term\nThis computation only requires O(s) operations. The remaining computations are all at the element\nlevel which can be completed in O(p2) operations. Therefore, the overall computational complexity\nfor each iteration reduces to max(O(sp),O(p2)). On the other hand, the proximal gradient algo-\nrithms for the Gaussian framework require inversion of a full p\u00d7p matrix which is non-parallelizable\nand requires O(p3) operations. The coordinate-wise method for optimizing CONCORD in [6] also\nrequires cycling through the p2 entries of \u2126 in speci\ufb01ed order and thus does not allow parallelization.\nIn contrast, CONCORD-ISTA and CONCORD-FISTA can use \u2018perfectly parallel\u2019 implementations\nto distribute the above matrix-matrix multiplications. At no step do we need to keep all of the dense\nmatrices S, S\u2126,\u2207h1 on a single machine. Therefore, CONCORD-ISTA and CONCORD-FISTA\nare scalable to any high dimensions restricted only by the number of machines.\n\n3 Convergence Analysis\n\nIn this section, we prove convergence of CONCORD-ISTA and CONCORD-FISTA methods along\nwith their respective convergence rates of O(1/k) and O(1/k2). We would like to point out that,\nalthough the authors in [6] provide a proof of convergence for their coordinate-wise minimization\nalgorithm for CONCORD, they do not provide any rates of convergence. The arguments for con-\nvergence leverage the results in [7] but require some essential ingredients. We begin with proving\nlower and upper bounds on the diagonal entries \u03c9kk for \u2126 belonging to a level set of Qcon(\u2126). The\nlower bound on the diagonal entries of \u2126 establishes Lipschitz continuity of the gradient \u2207h1(\u2126)\nbased on the hessian of the smooth function as stated in (6). The proof for the lower bound uses the\nexistence of an upper bound on the diagonal entries. Hence, we prove both bounds on the diagonal\nentries. We begin by de\ufb01ning a level set C0 of the objective function starting with an arbitrary initial\npoint \u2126(0) with a \ufb01nite function value as\n\nC0 =\n\n\u2126 | Qcon(\u2126) \u2264 Qcon(\u2126(0)) = M\n\n.\n\n(10)\n\n(cid:110)\n\n(cid:111)\n\nFor the positive semide\ufb01nite matrix S, let U denote 1\u221a\ntimes the upper triangular matrix from the\nLU decomposition of S, such that S = 2U T U (the factor 2 simpli\ufb01es further arithmetic). Assuming\n\n2\n\n4\n\n\fthe diagonal entries of S to be strictly nonzero (if skk = 0, then the kth component can be ignored\nupfront since it has zero variance and is equal to a constant for every data point), we have at least\none k such that uki (cid:54)= 0 for every i. Using this, we prove the following theorem.\nTheorem 3.1. For any symmetric matrix \u2126 satisfying \u2126 \u2208 C0, the diagonal elements of \u2126 are\nbounded above and below by constants which depend only on M, \u03bb and S. In other words,\n\n0 < a \u2264 |\u03c9kk| \u2264 b, \u2200 k = 1, 2, . . . , p,\n\nfor some constants a and b.\nProof. (a) Upper bound: Suppose |\u03c9ii| = max{|\u03c9kk|, for k = 1, 2, . . . , p}. Then, we have\n\nConsidering kith entry in the Frobenious norm and the ith column in the third term we get\n\nM = Qcon(\u2126(0)) \u2265 Qcon(\u2126) = h1(\u2126) + h2(\u2126)\n\n\u2265 \u2212 log det \u2126D + tr(cid:0)(U \u2126)T (U \u2126)(cid:1) + \u03bb(cid:107)\u2126X(cid:107)1\nF + \u03bb(cid:107)\u2126X(cid:107)1.\nj=p(cid:88)\n\n= \u2212 log det \u2126D + (cid:107)U \u2126(cid:107)2\n\nukj\u03c9ji\n\n+ \u03bb\n\n\uf8f6\uf8f82\n\n\uf8eb\uf8edj=p(cid:88)\n\nj=k\n\n|\u03c9ji|.\n\nj=k,j(cid:54)=i\n\nM \u2265 \u2212p log |\u03c9ii| +\n\nNow, suppose |uki\u03c9ii| = z and(cid:80)j=p\n|x| \u2264 j=p(cid:88)\n\nj=k,j(cid:54)=i\n\nj=k,j(cid:54)=i ukj\u03c9ji = x. Then\n\n|ukj||\u03c9ji| \u2264 \u00afu\n\n|\u03c9ji|,\n\nj=p(cid:88)\n\nj=k,j(cid:54)=i\n\n(11)\n\n(12)\n\nwhere \u00afu = max{|ukj|}, for j = k, . . . , p, j (cid:54)= i. Substituting in (12), for \u00af\u03bb = \u03bb\n\n\u00afM = M + \u00af\u03bb2 \u2212 p log |uki| \u2265 \u2212p log z + (z + x)2 + 2\u00af\u03bb|x| + \u00af\u03bb2\n\n= \u2212p log z +(cid:0)z + x + \u00af\u03bbsign(x)(cid:1)2 \u2212 2\u00af\u03bbz sign(x)\n\n(13)\n(14)\nHere, if x \u2265 0, then \u00afM \u2265 \u2212p log z + z2 using the \ufb01rst inequality (13), and if x < 0, then \u00afM \u2265\n\u2212p log z + 2\u00af\u03bbz using the second inequality (14). In either cases, the functions \u2212p log z + z2 and\n\u2212p log z + 2\u00af\u03bbz are unbounded as z \u2192 \u221e. Hence, the upper bound of \u00afM on these functions\nguarantee an upper bound b such that |\u03c9ii| \u2264 b. Therefore, |\u03c9kk| \u2264 b for all k = 1, 2, . . . , p.\n(b) Lower bound: By positivity of the trace term and the (cid:96)1 term (for off-diagonals), we have\n\n2\u00afu, we have\n\n(15)\nThe negative log function g(z) = \u2212 log(z) is a convex function with a lower bound at z\u2217 = b with\ng(z\u2217) = \u2212 log b. Therefore, for any k = 1, 2, . . . , p, we have\n\ni=1\n\nM \u2265 \u2212 log det \u2126D =\n\n\u2212 log |\u03c9ii|.\n\ni=p(cid:88)\n\n\u2212 log |\u03c9ii| \u2265 \u2212(p \u2212 1) log b \u2212 log |\u03c9kk|.\n\n(16)\n\nM \u2265 i=p(cid:88)\n\ni=1\n\nSimplifying the above equation, we get\n\nlog |\u03c9kk| \u2265 \u2212M \u2212 (p \u2212 1) log b.\n\nTherefore, |\u03c9kk| \u2265 a = e\u2212M\u2212(p\u22121) log b > 0 serves as a lower bound for all k = 1, 2, . . . , p.\n\nGiven that the function values are non-increasing along the iterates of Algorithms 1, 2 and 3, the\nsequence of \u2126(k) satisfy \u2126(k) \u2208 C0 for k = 1, 2, ..... The lower bounds on the diagonal elements of\n\u2126(k) provides the Lipschitz continuity using\n\n\u22072h1(\u2126(k)) (cid:22)(cid:0)a\u22122 + (cid:107)S(cid:107)2\n\n(cid:1) (I \u2297 I) .\n\nTherefore, using the mean-value theorem, the gradient \u2207h1 satis\ufb01es\n(cid:107)\u2207h1(\u2126) \u2212 \u2207h1(\u0398)(cid:107)F \u2264 L(cid:107)\u2126 \u2212 \u0398(cid:107)F ,\n\n(18)\nwith the Lipschitz continuity constant L = a\u22122 + (cid:107)S(cid:107)2. The remaining argument for convergence\nfollows from the theorems in [7].\n\n(17)\n\n5\n\n\fTheorem 3.2. ([7, Theorem 3.1]). Let {\u2126(k)} be the sequence generated by either Algorithm 1 with\nconstant step size or with backtracking line-search. Then, for the solution \u2126\u2217, for any k \u2265 1,\n\nQcon(\u2126(k)) \u2212 Qcon(\u2126\u2217) \u2264 \u03b1L(cid:107)\u2126(0) \u2212 \u2126\u2217(cid:107)2\n\nF\n\n2k\n\n,\n\n(19)\n\nwhere \u03b1 = 1 for the constant step size setting and \u03b1 = c for the backtracking step size setting.\nTheorem 3.3. ([7, Theorem 4.4]). Let {\u2126(k)},{\u0398(k)} be the sequences generated by Algorithm 2\nwith either constant step size or backtracking line-search. Then, for the solution \u2126\u2217, for any k \u2265 1,\n\nQcon(\u2126(k)) \u2212 Qcon(\u2126\u2217) \u2264 2\u03b1L(cid:107)\u2126(0) \u2212 \u2126\u2217(cid:107)2\n\nF\n\n(k + 1)2\n\n,\n\n(20)\n\nwhere \u03b1 = 1 for the constant step size setting and \u03b1 = c for the backtracking step size setting.\nHence, CONCORD-ISTA and CONCORD-FISTA converge at the rates of O(1/k) and O(1/k2) for\nthe kth iteration.\n\n4\n\nImplementation & Numerical Experiments\n\nIn this section, we outline algorithm implementation details and present results of our comprehen-\nsive numerical evaluation. Section 4.1 gives performance comparisons from using synthetic multi-\nvariate Gaussian datasets. These datasets are generated from a wide range of sample sizes (n) and\ndimensionality (p). Additionally, convergence of CONCORD-ISTA and CONCORD-FISTA will be\nillustrated. Section 4.2 has timing results from analyzing a real breast cancer dataset with outliers.\nComparisons are made to the coordinate-wise CONCORD implementation in gconcord package\nfor R available at http://cran.r-project.org/web/packages/gconcord/.\nFor implementing the proposed algorithms, we can take advantage of existing linear algebra li-\nbraries. Most of the numerical computations in Algorithms 1 and 2 are linear algebra opera-\ntions, and, unlike the sequential coordinate-wise CONCORD algorithm, CONCORD-ISTA and\nCONCORD-FISTA implementations can solve increasingly larger problems as more and more scal-\nable and ef\ufb01cient linear algebra libraries are made available. For this work, we opted to using Eigen\nlibrary [11] for its sparse linear algebra routines written in C++. Algorithms 1 and 2 were also writ-\nten in C++ then interfaced to R for testing. Table 1 gives names for various CONCORD-ISTA and\nCONCORD-FISTA versions using different initial step size choices.\n\n4.1 Synthetic Datasets\n\nSynthetic datasets were generated from true sparse positive random \u2126 matrices of three sizes:\np = {1000, 3000, 5000}. Instances of random matrices used here consist of 4995, 14985 and\n24975 non-zeros, corresponding to 1%, 0.33% and 0.20% edge densities, respectively. For each\np, Gaussian and t-distributed datasets of sizes n = {0.25p, 0.75p, 1.25p} were used as inputs.\nThe initial guess, \u2126(0), and the convergence criteria was matched to those of coordinate-wise CON-\nCORD implementation. Highlights of the results are summarized below, and the complete set of\ncomparisons are given in Supplementary materials Section A.\nFor normally distributed synthetic datasets, our experiments indicate that two variations of the\nCONCORD-ISTA method show little performance difference. However, ccista 0 was marginally\nfaster in our tests. On the other hand, ccfista 1 variation of CONCORD-FISTA that uses\n\u03c4(k+1,0) = \u03c4k as initial step size was signi\ufb01cantly faster than ccfista 0. Table 2 gives actual\nrunning times for the two best performing algorithms, ccista 0 and ccfista 1, against the\ncoordinate-wise concord. As p and n increase ccista 0 performs very well. For smaller n\nand \u03bb, coordinate-wise concord performs well (more in Supplemental section A). This can be\nattributed to min(O(np2),O(p3)) computational complexity of coordinate-wise CONCORD [6],\nand the sparse linear algebra routines used in CONCORD-ISTA and CONCORD-FISTA implemen-\ntations slowing down as the number of non-zero elements in \u2126 increases. On the other hand, for\nlarge n fraction (n = 1.25p), the proposed methods ccista 0 and ccfista 1 are signi\ufb01cantly\nfaster than coordinate-wise concord. In particular, when p = 5000 and n = 6250, the speed-up\nof ccista 0 can be as much as 150 times over coordinate-wise concord. Also, for t-distributed\nsynthetic datasets, ccista 0 is generally fastest, especially when n and p are both large.\n\n6\n\n\fFigure 1: Convergence of CONCORD-ISTA and CONCORD-FISTA for threshold \u2206subg < 10\u22125\n\nWhen a good initial guess \u2126(0) is available, warm-starting cc ista 0 and cc fista 0 algorithms\nsubstantially shortens the running times. Simulations with Gaussian datasets indicate the running\ntimes can be shortened by, on average, as much as 60%. Complete simulation results are given in\nthe Supplemental Section A.6.\nConvergence behavior of CONCORD-ISTA and CONCORD-FISTA methods is shown in Figure\n1. The best performing algorithms ccista 0 and ccfista 1 are shown. The vertical axis is\nthe subgradient \u2206subg (See Algorithms 1, 2). Plots show that ccista 0 seems to converge at a\nconstant rate much faster than ccfista 1 that appears to slow down after a few initial iterations.\nWhile the theoretical convergence results from section 3 prove convergence rates of O(1/k) and\nO(1/k2) for CONCORD-ISTA and CONCORD-FISTA, in practice, ccista 0 with constant step\nsize performed the fastest for the tests in this section.\n\n4.2 Real Data\n\nReal datasets arising from various physical and biological sciences often are not multivariate Gaus-\nsian and can have outliers. Hence, convergence characteristic may be different on such datasets. In\nthis section, the performance of proposed methods are assessed on a breast cancer dataset [12]. This\ndataset contains expression levels of 24481 genes on 266 patients with breast cancer. Following the\napproach in Khare et al. [6], the number of genes are reduced by utilizing clinical information that is\nprovided together with the microarray expression dataset. In particular, survival analysis via univari-\nate Cox regression with patient survival times is used to select a subset of genes closely associated\nwith breast cancer. A choice of p-value < 0.03 yields a reduced dataset with p = 4433 genes.\nOften times, graphical model selection algorithms are applied in a non-Gaussian and n (cid:28) p setting\nsuch as the case here. In this n (cid:28) p setting, coordinate-wise CONCORD algorithm is especially\nfast due to its computational complexity O(np2). However, even in this setting, the newly proposed\nmethods ccista 0, ccista 1, and ccfista 1 perform competitively to, or often better than,\nconcord as illustrated in Table 3. On this real dataset, ccista 1 performed the fastest whereas\nccista 0 was the fastest on synthetic datasets.\n\n5 Conclusion\n\nThe Gaussian graphical model estimation or inverse covariance estimation has seen tremendous ad-\nvances in the past few years. In this paper we propose using proximal gradient methods to solve\nthe general non-Gaussian sparse inverse covariance estimation problem. Rates of convergence\nwere established for the CONCORD-ISTA and CONCORD-FISTA algorithms. Coordinate-wise\nminimization has been the standard approach to this problem thus far, and we provide numer-\nical results comparing CONCORD-ISTA/FISTA and coordinate-wise minimization. We demon-\nstrate that CONCORD-ISTA outperforms coordinate-wise in general, and in high dimensional set-\ntings CONCORD-ISTA can outperform coordinate-wise optimization by orders of magnitude. The\nmethodology is also tested on real data sets. We undertake a comprehensive treatment of the prob-\nlem by also examining the dual formulation and consider methods to maximize the dual objective.\nWe note that efforts similar to ours for the Gaussian case has appeared in not one, but several NIPS\nand other publications. Our approach on the other hand gives a complete and thorough treatment of\nthe non-Gaussian partial correlation graph estimation problem, all in this one self-contained paper.\n\n7\n\nccista_0ccfista_1lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll1e\u2212011e\u2212021e\u2212031e\u2212041e\u2212050204002040iterDsubgmethodllccista_0ccfista_1lambdal0.050.10.20.40.5ccista_0ccfista_1lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll1e\u2212011e\u2212021e\u2212031e\u2212041e\u2212050204002040iterDsubgmethodllccista_0ccfista_1lambdal0.050.10.20.40.5ccista_0ccfista_1lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll1e\u2212011e\u2212021e\u2212031e\u2212041e\u2212050204002040iterDsubgmethodllccista_0ccfista_1lambdal0.050.10.20.40.5\fTable 1: Naming convention for step size variations\n\nVariation\nMethod\nInitial step\n\nconcord\n\nCoordinatewise\n\n-\n\nccista 0\n\nISTA\n\nConstant\n\nccista 1\n\nccfista 0\n\nccfista 1\n\nISTA\n\nBarzilai-Borwein\n\nFISTA\nConstant\n\nFISTA\n\n\u03c4k\n\nTable 2: Timing comparison of concord and proposed methods: ccista 0 and ccfista 1.\n\nconcord\n\nccista 0\n\np\n\nn\n\n250\n\n1000\n\n750\n\n1250\n\n750\n\n3000\n\n2250\n\n3750\n\n1250\n\n5000\n\n3750\n\n6250\n\n\u03bb\n\n0.150\n0.163\n0.300\n0.090\n0.103\n0.163\n0.071\n0.077\n0.163\n0.090\n0.103\n0.163\n0.053\n0.059\n0.090\n0.040\n0.053\n0.163\n0.066\n0.077\n0.103\n0.039\n0.049\n0.077\n0.039\n0.077\n0.163\n\nNZ%\n1.52\n0.99\n0.05\n1.50\n0.76\n0.23\n1.41\n0.97\n0.23\n1.10\n0.47\n0.08\n1.07\n0.56\n0.16\n1.28\n0.28\n0.07\n1.42\n0.53\n0.10\n1.36\n0.31\n0.10\n0.27\n0.10\n0.04\n\niter\n9\n9\n9\n9\n9\n9\n9\n9\n9\n17\n17\n16\n16\n16\n16\n16\n16\n15\n17\n17\n17\n17\n17\n17\n17\n17\n16\n\nseconds\n3.2\n2.6\n2.6\n8.9\n8.4\n8.0\n41.3\n40.5\n43.8\n147.4\n182.4\n160.1\n388.3\n435.0\n379.4\n2854.2\n2921.5\n2780.5\n832.7\n674.7\n667.6\n2102.8\n1826.6\n2094.7\n15629.3\n15671.1\n14787.8\n\niter\n13\n18\n15\n11\n15\n15\n10\n15\n13\n20\n28\n28\n17\n28\n16\n17\n15\n25\n32\n30\n27\n18\n16\n29\n17\n27\n26\n\nseconds\n1.8\n2.0\n1.2\n1.4\n1.6\n1.6\n1.4\n1.7\n1.2\n32.4\n36.0\n28.3\n28.5\n38.5\n19.9\n33.0\n23.5\n35.1\n193.9\n121.4\n81.2\n113.0\n73.4\n95.8\n93.9\n101.0\n97.3\n\nccfista 1\nseconds\niter\n3.3\n20\n3.3\n26\n2.7\n23\n17\n2.5\n3.3\n24\n2.8\n24\n2.9\n17\n24\n3.3\n2.8\n23\n53.2\n25\n60.1\n35\n39.9\n26\n17\n39.6\n61.9\n26\n23.6\n15\n47.3\n17\n16\n31.4\n56.1\n32\n379.2\n37\n265.8\n35\n163.0\n33\n17\n176.3\n107.4\n17\n178.1\n33\n130.0\n17\n123.9\n25\n34\n173.7\n\nTable 3: Running time comparison on breast cancer dataset\n\n\u03bb\n\n0.450\n0.451\n0.454\n0.462\n0.478\n0.515\n0.602\n0.800\n\nNZ% concord\nsec\n724.5\n0.110\n664.2\n0.109\n0.106\n690.3\n671.6\n0.101\n663.3\n0.088\n600.6\n0.063\n383.5\n0.027\n0.002\n193.6\n\niter\n80\n80\n80\n79\n77\n63\n46\n24\n\nccista 0\nsec\niter\n686.7\n132\n669.2\n129\n130\n686.2\n640.4\n125\n558.6\n117\n466.0\n104\n308.0\n80\n45\n133.8\n\nccista 1\nsec\niter\n504.0\n123\n457.0\n112\n352.9\n81\n447.1\n109\n337.9\n87\n282.4\n75\n229.7\n66\n92.2\n32\n\nccfista 0\nsec\niter\n10870.3\n250\n7867.2\n216\n213\n7704.2\n7978.4\n214\n6913.1\n202\n9706.9\n276\n4685.2\n172\n74\n1077.2\n\nccfista 1\nsec\niter\n672.6\n201\n662.9\n199\n198\n677.8\n646.3\n196\n609.0\n197\n542.0\n184\n409.1\n152\n70\n169.8\n\nAcknowledgments: S.O., O.D. and B.R. were supported in part by the National Science Foun-\ndation under grants DMS-0906392, DMS-CMG 1025465, AGS-1003823, DMS-1106642, DMS\nCAREER-1352656 and grants DARPA-YFAN66001-111-4131 and SMC-DBNKY. K.K was par-\ntially supported by NSF grant DMS-1106084. S.O. was supported also in part by the Laboratory\nDirected Research and Development Program of Lawrence Berkeley National Laboratory under\nU.S. Department of Energy Contract No. DE-AC02-05CH11231.\n\n8\n\n\fReferences\n[1] Onureena Banerjee, Laurent El Ghaoui, and Alexandre DAspremont. Model Selection\nThrough Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data.\nJMLR, 9:485\u2013516, 2008.\n\n[2] Onkar Anant Dalal and Bala Rajaratnam. G-ama: Sparse gaussian graphical model estimation\n\nvia alternating minimization. arXiv preprint arXiv:1405.3034, 2014.\n\n[3] Jie Peng, Pei Wang, Nengfeng Zhou, and Ji Zhu. Partial Correlation Estimation by Joint Sparse\nRegression Models. Journal of the American Statistical Association, 104(486):735\u2013746, June\n2009.\n\n[4] Guilherme V Rocha, Peng Zhao, and Bin Yu. A path following algorithm for Sparse Pseudo-\n\nLikelihood Inverse Covariance Estimation (SPLICE). Technical Report 60628102, 2008.\n\n[5] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Applications of the lasso and grouped\n\nlasso to the estimation of sparse graphical models. Technical report, 2010.\n\n[6] Kshitij Khare, Sang-Yun Oh, and Bala Rajaratnam. A convex pseudo-likelihood framework\nfor high dimensional partial correlation estimation with convergence guarantees. Journal of\nthe Royal Statistical Society: Series B (to appear), 2014.\n\n[7] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear\n\ninverse problems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[8] R.T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on\n\n[9] Yurii Nesterov. A method of solving a convex programming problem with convergence rate\n\nControl and Optimization, 14(5):877\u2013898, 1976.\nO(1/k2). In Soviet Mathematics Doklady, volume 27, pages 372\u2013376, 1983.\n\n[10] J. Barzilai and J.M. Borwein. Two-point step size gradient methods. IMA Journal of Numerical\n\nAnalysis, 8(1):141\u2013148, 1988.\n\n[11] Ga\u00a8el Guennebaud, Beno\u02c6\u0131t Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.\n[12] Howard Y Chang, Dimitry S A Nuyten, Julie B Sneddon, Trevor Hastie, Robert Tibshirani,\nTherese S\u00f8rlie, Hongyue Dai, Yudong D He, Laura J van\u2019t Veer, Harry Bartelink, Matt van de\nRijn, Patrick O Brown, and Marc J van de Vijver. Robustness, scalability, and integration of\na wound-response gene expression signature in predicting breast cancer survival. Proceedings\nof the National Academy of Sciences of the United States of America, 102(10):3738\u201343, March\n2005.\n\n9\n\n\f", "award": [], "sourceid": 468, "authors": [{"given_name": "Sang", "family_name": "Oh", "institution": "Lawrence Berkeley Lab"}, {"given_name": "Onkar", "family_name": "Dalal", "institution": null}, {"given_name": "Kshitij", "family_name": "Khare", "institution": "University of Florida"}, {"given_name": "Bala", "family_name": "Rajaratnam", "institution": "Stanford University"}]}