{"title": "The Multiple Quantile Graphical Model", "book": "Advances in Neural Information Processing Systems", "page_first": 3747, "page_last": 3755, "abstract": "We introduce the Multiple Quantile Graphical Model (MQGM), which extends the neighborhood selection approach of Meinshausen and Buhlmann for learning sparse graphical models. The latter is defined by the basic subproblem of modeling the conditional mean of one variable as a sparse function of all others. Our approach models a set of conditional quantiles of one variable as a sparse function of all others, and hence offers a much richer, more expressive class of conditional distribution estimates. We establish that, under suitable regularity conditions, the MQGM identifies the exact conditional independencies with probability tending to one as the problem size grows, even outside of the usual homoskedastic Gaussian data model. We develop an efficient algorithm for fitting the MQGM using the alternating direction method of multipliers. We also describe a strategy for sampling from the joint distribution that underlies the MQGM estimate. Lastly, we present detailed experiments that demonstrate the flexibility and effectiveness of the MQGM in modeling hetereoskedastic non-Gaussian data.", "full_text": "The Multiple Quantile Graphical Model\n\nAlnur Ali\n\nMachine Learning Department\nCarnegie Mellon University\n\nalnurali@cmu.edu\n\nJ. Zico Kolter\n\nComputer Science Department\nCarnegie Mellon University\n\nzkolter@cs.cmu.edu\n\nRyan J. Tibshirani\n\nDepartment of Statistics\n\nCarnegie Mellon University\n\nryantibs@cmu.edu\n\nAbstract\n\nWe introduce the Multiple Quantile Graphical Model (MQGM), which extends\nthe neighborhood selection approach of Meinshausen and B\u00fchlmann for learning\nsparse graphical models. The latter is de\ufb01ned by the basic subproblem of model-\ning the conditional mean of one variable as a sparse function of all others. Our\napproach models a set of conditional quantiles of one variable as a sparse function\nof all others, and hence offers a much richer, more expressive class of conditional\ndistribution estimates. We establish that, under suitable regularity conditions, the\nMQGM identi\ufb01es the exact conditional independencies with probability tending to\none as the problem size grows, even outside of the usual homoskedastic Gaussian\ndata model. We develop an ef\ufb01cient algorithm for \ufb01tting the MQGM using the\nalternating direction method of multipliers. We also describe a strategy for sam-\npling from the joint distribution that underlies the MQGM estimate. Lastly, we\npresent detailed experiments that demonstrate the \ufb02exibility and effectiveness of\nthe MQGM in modeling hetereoskedastic non-Gaussian data.\n\n1\n\nIntroduction\n\nWe consider modeling the joint distribution Pr(y1, . . . , yd) of d random variables, given n indepen-\ndent draws from this distribution y(1), . . . , y(n) \u2208 Rd, where possibly d (cid:29) n. Later, we generalize\nthis setup and consider modeling the conditional distribution Pr(y1, . . . , yd|x1, . . . , xp), given n\nindependent pairs (x(1), y(1)), . . . , (x(n), y(n)) \u2208 Rp+d. Our starting point is the neighborhood selec-\ntion method [28], which is typically considered in the context of multivariate Gaussian data, and seen\nas a tool for covariance selection [8]: when Pr(y1, . . . , yd) is a multivariate Gaussian distribution,\nit is a well-known fact that yj and yk are conditionally independent given the remaining variables\nif and only if the coef\ufb01cent corresponding to yk is zero in the (linear) regression of yj on all other\nvariables (e.g., [22]). Therefore, in neighborhood selection we compute, for each k = 1, . . . , d,\na lasso regression \u2014 in order to obtain a small set of conditional dependencies \u2014 of yk on the\nremaining variables, i.e.,\n\n(cid:19)2\n\n\u03b8kjy(i)\nj\n\n+ \u03bb(cid:107)\u03b8k(cid:107)1,\n\n(1)\n\nminimize\n\n\u03b8k\u2208Rd\n\n(cid:18)\n\ny(i)\n\nk \u2212(cid:88)\nn(cid:88)\nPr(y1, . . . , yd) \u2248 d(cid:89)\n\nj(cid:54)=k\n\ni=1\n\nfor a tuning parameter \u03bb > 0. This strategy can be seen as a pseudolikelihood approximation [4],\n\nPr(yk|y\u00ack),\n\n(2)\n\nwhere y\u00ack denotes all variables except yk. Under the multivariate Gaussian model for Pr(y1, . . . , yd),\nthe conditional distributions Pr(yk|y\u00ack), k = 1, . . . , d here are (univariate) Gaussians, and maximiz-\ning the pseudolikelihood in (2) is equivalent to separately maximizing the conditionals, as is precisely\ndone in (1) (with induced sparsity), for k = 1, . . . , d.\n\nk=1\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFollowing the pseudolikelihood-based approach traditionally means carrying out three steps: (i) we\nwrite down a suitable family of joint distributions for Pr(y1, . . . , yd), (ii) we derive the conditionals\nPr(yk|y\u00ack), k = 1, . . . , d, and then (iii) we maximize each conditional likelihood by (freely) \ufb01tting\nthe parameters. Neighborhood selection, and a number of related approaches that came after it (see\nSection 2.1), can be all thought of in this work\ufb02ow. In many ways, step (ii) acts as the bottleneck\nhere, and to derive the conditionals, we are usually limited to a homoskedastic and parameteric family\nfor the joint distribution.\nThe approach we take in this paper differs somewhat substantially, as we begin by directly modeling\nthe conditionals in (2), without any preconceived model for the joint distribution \u2014 in this sense, it\nmay be seen a type of dependency network [13] for continuous data. We also employ heteroskedastic,\nnonparametric models for the conditional distributions, which allows us great \ufb02exibility in learning\nthese conditional relationships. Our method, called the Multiple Quantile Graphical Model (MQGM),\nis a marriage of ideas in high-dimensional, nonparametric, multiple quantile regression with those in\nthe dependency network literature (the latter is typically focused on discrete, not continuous, data).\nAn outline for this paper is as follows. Section 2 reviews background material, and Section 3 develops\nthe MQGM estimator. Section 4 studies basic properties of the MQGM, and establishes a structure\nrecovery result under appropriate regularity conditions, even for heteroskedastic, non-Gaussian data.\nSection 5 describes an ef\ufb01cient ADMM algorithm for estimation, and Section 6 presents empirical\nexamples comparing the MQGM versus common alternatives. Section 7 concludes with a discussion.\n\n2 Background\n\n2.1 Neighborhood selection and related methods\n\nNeighborhood selection has motivated a number of methods for learning sparse graphical models. The\nliterature here is vast; we do not claim to give a complete treatment, but just mention some relevant\napproaches. Many pseudolikelihood approaches have been proposed, see e.g., [35, 33, 12, 24, 17, 1].\nThese works exploit the connection between estimating a sparse inverse covariance matrix and regres-\nsion, and they vary in terms of the optimization algorithms they use and the theoretical guarantees\nthey offer. In a clearly related but distinct line of research, [45, 2, 11, 36] proposed (cid:96)1-penalized\nlikelihood estimation in the Gaussian graphical model, a method now generally termed the graphical\nlasso (GLasso). Following this, several recent papers have extended the GLasso in various ways. [10]\nexamined a modi\ufb01cation based on the multivariate Student t-distribution, for robust graphical model-\ning. [37, 46, 42] considered conditional distributions of the form Pr(y1, . . . , yd|x1, . . . , xp). [23]\nproposed a model for mixed (both continuous and discrete) data types, generalizing both GLasso and\npairwise Markov random \ufb01elds. [25, 26] used copulas for learning non-Gaussian graphical models.\nA strength of neighborhood-based (i.e., pseudolikelihood-based) approaches lies in their simplicity;\nbecause they essentially reduce to a collection of univariate probability models, they are in a sense\nmuch easier to study outside of the typical homoskedastic, Gaussian data setting. [14, 43, 44] ele-\ngantly studied the implications of using univariate exponential family models for the conditionals in\n(2). Closely related to pseudoliklihood approaches are dependency networks [13]. Both frameworks\nfocus on the conditional distributions of one variable given all the rest; the difference lies in whether\nor not the model for conditionals stems from \ufb01rst specifying some family of joint distributions (pseu-\ndolikelihood methods), or not (dependency networks). Dependency networks have been thoroughly\nstudied for discrete data, e.g., [13, 29]. For continuous data, [40] proposed modeling the mean in a\nGaussian neighborhood regression as a nonparametric, additive function of the remaining variables,\nyielding \ufb02exible relationships \u2014 this is a type of dependency network for continuous data (though it\nis not described by the authors in this way). Our method, the MQGM, also deals with continuous\ndata, and is the \ufb01rst to our knowledge that allows for fully nonparametric conditional distributions, as\nwell as nonparametric contributions of the neighborhood variables, in each local model.\n\n2.2 Quantile regression\nIn linear regression, we estimate the conditional mean of y|x1, . . . , xp from samples. Similarly, in \u03b1-\nquantile regression [20], we estimate the conditional \u03b1-quantile of y|x1, . . . , xp for a given \u03b1 \u2208 [0, 1],\nformally Qy|x1,...,xp (\u03b1) = inf{t : Pr(y \u2264 t|x1, . . . , xp) \u2265 \u03b1}, by solving the convex optimization\nj ), where \u03c8\u03b1(z) = max{\u03b1z, (\u03b1 \u2212 1)z} is the \u03b1-\nproblem: minimize\u03b8\n\n(cid:80)n\ni=1 \u03c8\u03b1(y(i) \u2212(cid:80)p\n\nj=1 \u03b8jx(i)\n\n2\n\n\fquantile loss (also called the \u201cpinball\u201d or \u201ctilted absolute\u201d loss). Quantile regression can be useful\nwhen the conditional distribution in question is suspected to be heteroskedastic and/or non-Gaussian,\ne.g., heavy-tailed, or if we wish to understand properties of the distribution other than the mean,\ne.g., tail behavior. In multiple quantile regression, we solve several quantile regression problems\nsimultaneously, each corresponding to a different quantile level; these problems can be coupled\nsomehow to increase ef\ufb01ciency in estimation (see details in the next section). Again, the literature\non quantile regression is quite vast (especially that from econometrics), and we only give a short\nreview here. A standard text is [18]. Nonparametric modeling of quantiles is a natural extension from\nthe (linear) quantile regression approach outlined above; in the univariate case (one conditioning\nvariable), [21] suggested a method using smoothing splines, and [38] described an approach using\nkernels. More recently, [19] studied the multivariate nonparametric case (more than one conditioning\nvariable), using additive models. In the high-dimensional setting, where p is large, [3, 16, 9] studied\n(cid:96)1-penalized quantile regression and derived estimation and recovery theory for non-(sub-)Gaussian\ndata. We extend results in [9] to prove structure recovery guarantees for the MQGM (in Section 4.3).\n\n3 The multiple quantile graphical model\n\nMany choices can be made with regards to the \ufb01nal form of the MQGM, and to help in understanding\nthese options, we break down our presentation in parts. First \ufb01x some ordered set A = {\u03b11, . . . , \u03b1r}\nof quantile levels, e.g., A = {0.05, 0.10, . . . , 0.95}. For each variable yk, and each level \u03b1(cid:96), we\nmodel the conditional \u03b1(cid:96)-quantile given the other variables, using an additive expansion of the form:\n\nQyk|y\u00ack (\u03b1(cid:96)) = b\u2217\n\n(cid:96)k +\n\nf\u2217\n(cid:96)kj(yj),\n\n(3)\n\nd(cid:88)\n\nj(cid:54)=k\n\n(cid:19)\n\n(cid:16)\n\n(cid:88)\n\nj(cid:54)=k\n\n(cid:96)k \u2208 R is an intercept term, and f\u2217\n\nwhere b\u2217\n(cid:96)kj, j = 1, . . . , d are smooth, but not parametric in form. In\nits most general form, the MQGM estimator is de\ufb01ned as a collection of optimization problems, over\nk = 1, . . . , d and (cid:96) = 1, . . . , r:\n\nn(cid:88)\n\ni=1\n\n\u03c8\u03b1(cid:96)\n\n(cid:18)\n\nk \u2212 b(cid:96)k \u2212(cid:88)\n\ny(i)\n\nj(cid:54)=k\n\nminimize\n\nb(cid:96)k, f(cid:96)kj\u2208F(cid:96)kj ,\n\nj=1,...,d\n\nf(cid:96)kj(y(i)\nj )\n\n+\n\n\u03bb1P1(f(cid:96)kj) + \u03bb2P2(f(cid:96)kj)\n\n.\n\n(4)\n\n(cid:17)\u03c9\n\nHere \u03bb1, \u03bb2 \u2265 0 are tuning parameters, F(cid:96)kj, j = 1, . . . , d are univariate function spaces, \u03c9 > 0 is\na \ufb01xed exponent, and P1, P2 are sparsity and smoothness penalty functions, respectively. We give\nthree examples below; many other variants are also possible.\nm}, the span of m\nExample 1: basis expansion model\nbasis functions, e.g., radial basis functions (RBFs) with centers placed at appropriate locations across\nthe domain of variable j, for each j = 1, . . . , d. This means that each f(cid:96)kj \u2208 F(cid:96)kj can be expressed\nas f(cid:96)kj(x) = \u03b8T\nm(x)).\nAlso consider an exponent \u03c9 = 1, and the sparsity and smoothness penalties\n\n(cid:96)kj\u03c6j(x), for a coef\ufb01cient vector \u03b8(cid:96)kj \u2208 Rm, where \u03c6j(x) = (\u03c6j\n\nConsider taking F(cid:96)kj = span{\u03c6j\n\n1(x), . . . , \u03c6j\n\n1, . . . , \u03c6j\n\nP1(f(cid:96)kj) = (cid:107)\u03b8(cid:96)kj(cid:107)2\n\nand P2(f(cid:96)kj) = (cid:107)\u03b8(cid:96)kj(cid:107)2\n2,\n\n\u03c8\u03b1(cid:96)\n\n+\n\nj(cid:54)=k\n\n(cid:17)\n\n(cid:16)\n\n(cid:16)\n\nminimize\n\n(cid:88)\n\nb(cid:96)k, \u03b8(cid:96)k=(\u03b8(cid:96)k1,...,\u03b8(cid:96)kd)\n\nk , . . . , y(n)\n\nYk \u2212 b(cid:96)k1 \u2212 \u03a6\u03b8(cid:96)k\n\n\u03bb1(cid:107)\u03b8(cid:96)kj(cid:107)2 + \u03bb2(cid:107)\u03b8(cid:96)kj(cid:107)2\n\nAbove, we have used the abbreviation \u03c8\u03b1(cid:96) (z) =(cid:80)n\n\nrespectively, which are group lasso and ridge penalties, respectively. With these choices in place, the\nMQGM problem in (4) can be rewritten in \ufb01nite-dimensional form:\n\n(cid:17)\n(5)\ni=1 \u03c8\u03b1(cid:96)(zi) for a vector z = (z1, . . . , zn) \u2208 Rn,\nk ) \u2208 Rn for the observations along variable k, 1 = (1, . . . , 1) \u2208 Rn, and\nand also Yk = (y(1)\nj )T \u2208 Rm.\n\u03a6 \u2208 Rn\u00d7dm for the basis matrix, with blocks of columns to be understood as \u03a6ij = \u03c6(y(i)\nThe basis expansion model is simple and tends to work well in practice, so we focus on it for most of\nthe paper. In principle, essentially all our results apply to the next two models we describe, as well.\nn}, the span\nExample 2: smoothing splines model\nof m = n natural cubic splines with knots at y(1)\n, for j = 1, . . . , d. As before, we can\n(cid:96)kjgj(x) with coef\ufb01cients \u03b8(cid:96)kj \u2208 Rn, for f(cid:96)kj \u2208 F(cid:96)kj. The work of [27], on\nthen write f(cid:96)kj(x) = \u03b8T\nhigh-dimensional additive smoothing splines, suggests a choice of exponent \u03c9 = 1/2, and penalties\nP1(f(cid:96)kj) = (cid:107)Gj\u03b8(cid:96)kj(cid:107)2\n\nNow consider taking F(cid:96)kj = span{gj\n\nand P2(f(cid:96)kj) = \u03b8T\n\n, . . . , y(n)\n\nj\n\n1, . . . , gj\n\n(cid:96)kj\u2126j\u03b8(cid:96)kj,\n\n.\n\n2\n\n2\n\nj\n\n3\n\n\fi(cid:48)(y(i)\n\nfor sparsity and smoothness, respectively, where Gj \u2208 Rn\u00d7n is a spline basis matrix with entries\nGj\nii(cid:48) = gj\nj ), and \u2126j is the smoothing spline penalty matrix containing integrated products of\npairs of twice differentiated basis functions. The MQGM problem in (4) can be translated into a\n\ufb01nite-dimensional form, very similar to what we have done in (5), but we omit this for brevity.\nConsider taking F(cid:96)kj = Hj, a univariate reproducing kernel Hilbert\nExample 3: RKHS model\nspace (RKHS), with kernel function \u03baj(\u00b7,\u00b7). The representer theorem allows us to express each\ni=1(\u03b8(cid:96)kj)i\u03baj(x, y(i)\nj ),\nfor a coef\ufb01cient vector \u03b8(cid:96)kj \u2208 Rn. The work of [34], on high-dimensional additive RKHS modeling,\nsuggests a choice of exponent \u03c9 = 1, and sparsity and smoothness penalties\n\nfunction f(cid:96)kj \u2208 Hj in terms of the representers of evaluation, i.e., f(cid:96)kj(x) =(cid:80)n\n\nP1(f(cid:96)kj) = (cid:107)K j\u03b8(cid:96)kj(cid:107)2\n\nand P2(f(cid:96)kj) =\n\n\u03b8T\n(cid:96)kjK j\u03b8(cid:96)kj,\nrespectively, where K j \u2208 Rn\u00d7n is the kernel matrix with entries K j\nii(cid:48) = \u03baj(y(i)\n). Again, the\nMQGM problem in (4) can be written in \ufb01nite-dimensional form, now an SDP, omitted for brevity.\nStructural constraints\nSeveral structural constraints can be placed on top of the MQGM op-\ntimization problem in order to guide the estimated component functions to meet particular shape\nrequirements. An important example are non-crossing constraints (commonplace in nonparametric,\nmultiple quantile regression [18, 38]): here, we optimize (4) jointly over (cid:96) = 1, . . . , r, subject to\n\n, y(i(cid:48))\n\nj\n\nj\n\n(cid:113)\n\nb(cid:96)k +\n\nf(cid:96)kj(y(i)\n\nj ) \u2264 b(cid:96)(cid:48)k +\n\nf(cid:96)(cid:48)kj(y(i)\n\nj ),\n\nfor all \u03b1(cid:96) < \u03b1(cid:96)(cid:48), and i = 1, . . . , n.\n\n(6)\n\n(cid:88)\n\nj(cid:54)=k\n\n(cid:88)\n\nj(cid:54)=k\n\nr(cid:88)\n\nThis ensures that the estimated quantiles obey the proper ordering, at the observations. For concrete-\nness, we consider the implications for the basis regression model, in Example 1 (similar statements\nhold for the other two models). For each (cid:96) = 1, . . . , r, denote by F(cid:96)k(b(cid:96)k, \u03b8(cid:96)k) the criterion in (5).\nIntroducing the non-crossing constraints requires coupling (5) over (cid:96) = 1, . . . , r, so that we now have\nthe following optimization problems, for each target variable k = 1, . . . , d:\n\nk + \u03a6\u0398k)DT \u2265 0,\n\n(cid:96)=1\n\nBk,\u0398k\n\nminimize\n\nF(cid:96)k(b(cid:96)k, \u03b8(cid:96)k)\n\nsubject to (1BT\n\n(7)\nwhere we denote Bk = (b1k, . . . , brk) \u2208 Rr, \u03a6 \u2208 Rn\u00d7dm the basis matrix as before, \u0398k \u2208 Rdm\u00d7r\ngiven by column-stacking \u03b8(cid:96)k \u2208 Rdm, (cid:96) = 1, . . . , r, and D \u2208 R(r\u22121)\u00d7r is the usual discrete\ndifference operator. (The inequality in (7) is to be interpreted componentwise.) Computationally,\ncoupling the subproblems across (cid:96) = 1, . . . , r clearly adds to the overall dif\ufb01culty of the MQGM, but\nstatistically this coupling acts as a regularizer, by constraining the parameter space in a useful way,\nthus increasing our ef\ufb01ciency in \ufb01tting multiple quantile levels from the given data.\nFor a triplet (cid:96), k, j, monotonicity constraints are also easy to add, i.e., f(cid:96)kj(y(i)\n) for all\nj < y(i(cid:48))\ny(i)\n. Convexity constraints, where we require f(cid:96)kj to be convex over the observations, for a\nparticular (cid:96), k, j, are also straightforward. Lastly, strong non-crossing constraints, where we enforce\n(6) over all z \u2208 Rd (not just over the observations) are also possible with positive basis functions.\nExogenous variables and conditional random \ufb01elds\nSo far, we have considered modeling the\njoint distribution Pr(y1, . . . , yd), corresponding to learning a Markov random \ufb01eld (MRF). It is not\nhard to extend our framework to model the conditional distribution Pr(y1, . . . , yd|x1, . . . , xp) given\nsome exogenous variables x1, . . . , xp, corresponding to learning a conditional random \ufb01eld (CRF).\n(cid:96)k \u2208 Rp in (5), and the\nTo extend the basis regression model, we introduce the additional parameters \u03b8x\nloss now becomes \u03c8\u03b1(cid:96)(Yk \u2212 b(cid:96)k1T \u2212 \u03a6\u03b8(cid:96)k \u2212 X\u03b8x\n(cid:96)k), where X \u2208 Rn\u00d7q is \ufb01lled with the exogenous\nobservations x(1), . . . , x(n) \u2208 Rq; the other models are changed similarly.\n\nj ) \u2264 f(cid:96)kj(y(i(cid:48))\n\nj\n\nj\n\n4 Basic properties and theory\n\n4.1 Quantiles and conditional independence\nIn the model (3), when a particular variable yj has no contribution, i.e., satis\ufb01ed f\u2217\n(cid:96)kj = 0 across all\nquantile levels \u03b1(cid:96), (cid:96) = 1, . . . , r, what does this imply about the conditional independence between yk\nand yj, given the rest? Outside of the multivariate normal model (where the feature transformations\nneed only be linear), nothing can be said in generality. But we argue that conditional independence can\nbe understood in a certain approximate sense (i.e., in a projected approximation of the data generating\nmodel). We begin with a simple lemma. Its proof is elementary, and given in the supplement.\n\n4\n\n\fLemma 4.1. Let U, V, W be random variables, and suppose that all conditional quantiles of U|V, W\ndo not depend on V , i.e., QU|V,W (\u03b1) = QU|W (\u03b1) for all \u03b1 \u2208 [0, 1]. Then U and V are conditionally\nindependent given W .\n\nBy the lemma, if we knew that QU|V,W (\u03b1) = h(\u03b1, U, W ) for a function h, then it would follow that\nU, V are conditionally independent given W (n.b., the converse is true, as well). The MQGM problem\nin (4), with sparsity imposed on the coef\ufb01cients, essentially aims to achieve such a representation\nfor the conditional quantiles; of course we cannot use a fully nonparametric representation of the\nconditional distribution yk|y\u00ack and instead we use an r-step approximation to the conditional cumu-\nlative distribution function (CDF) of yk|y\u00ack (corresponding to estimating r conditional quantiles),\nand (say) in the basis regression model, limit the dependence on conditioning variables to be in terms\nof an additive function of RBFs in yj, j (cid:54)= k. Thus, if at the solution in (5) we \ufb01nd that \u02c6\u03b8(cid:96)kj = 0,\n(cid:96) = 1, . . . , r, we may interpret this to mean that yk and yj are conditionally independent given the\nremaining variables, but according to the distribution de\ufb01ned by the projection of yk|y\u00ack onto the\nspace of models considered in (5) (r-step conditional CDFs, which are additive expansions in yj,\nj (cid:54)= k). This interpretation is no more tenuous (arguably, less so, as the model space here is much\nlarger) than that needed when applying standard neighborhood selection to non-Gaussian data.\n\n4.2 Gibbs sampling and the \u201cjoint\u201d distribution\n\nWhen specifying a form for the conditional distributions in a pseudolikelihood approximation as in\n(2), it is natural to ask: what is the corresponding joint distribution? Unfortunately, for a general\ncollection of conditional distributions, there need not exist a compatible joint distribution, even\nwhen all conditionals are continuous [41]. Still, pseudolikelihood approximations (a special case\nof composite likelihood approximations), possess solid theoretical backing, in that maximizing the\npseudolikelihood relates closely to minimizing a certain (expected composite) Kullback-Leibler\ndivergence, measured to the true conditionals [39]. Recently, [7, 44] made nice progress in describing\nspeci\ufb01c conditions on conditional distributions that give rise to a valid joint distribution, though their\nwork was speci\ufb01c to exponential families. A practical answer to the question of this subsection is to\nuse Gibbs sampling, which attempts to draw samples consistent with the \ufb01tted conditionals; this is\nprecisely the observation of [13], who show that Gibbs sampling from discrete conditionals converges\nto a unique stationary distribution, although this distribution may not actually be compatible with the\nconditionals. The following result establishes the analogous claim for continuous conditionals; its\nproof is in the supplement. We demonstrate the practical value of Gibbs sampling through various\nexamples in Section 6.\nLemma 4.2. Assume that the conditional distributions Pr(yk|y\u00ack), k = 1, . . . , d take only positive\nvalues on their domain. Then, for any given ordering of the variables, Gibbs sampling converges to a\nunique stationary distribution that can be reached from any initial point. (This stationary distribution\ndepends on the ordering.)\n\n4.3 Graph structure recovery\n\n(cid:96)kj\u03c6j(x)\u2217 for coef\ufb01cients \u03b8\u2217\n\nWhen log d = O(n2/21), and we assume somewhat standard regularity conditions (listed as A1\u2013A4\nin the supplement), the MQGM estimate recovers the underlying conditional independencies with\nhigh probability (interpreted in the projected model space, as explained in Section 4.1). Importantly,\nwe do not require a Gaussian, sub-Gaussian, or even parametric assumption on the data generating\nprocess; instead, we assume i.i.d. draws y(1), . . . , y(n) \u2208 Rd, where the conditional distributions\nyk|y\u00ack have quantiles speci\ufb01ed by the model in (3) for k = 1, . . . , d, (cid:96) = 1, . . . , r, and further, each\nf\u2217\n(cid:96)kj(x) = \u03b8T\nLet E\u2217 denote the corresponding edge set of conditional dependencies from these neighborhood\nmodels, i.e., {k, j} \u2208 E\u2217 \u21d0\u21d2 max(cid:96)=1,...,r max{(cid:107)\u03b8\u2217\n(cid:96)jk(cid:107)2} > 0. We de\ufb01ne the estimated\n(cid:96)kj(cid:107)2,|\u03b8\u2217\nthe features have been scaled to satisfy (cid:107)\u03a6j(cid:107) \u2264 \u221a\nedge set \u02c6E in the analogous way, based on the solution in (5). Without a loss of generality, we assume\nn for all j = 1, . . . , dm. The following is our\nrecovery result; its proof is provided in the supplement.\nTheorem 4.3. Assume log d = O(n2/21), and conditions A1\u2013A4 in the supplement. Assume that\nthe tuning parameters \u03bb1, \u03bb2 satisfy \u03bb1 (cid:16) (mn log(d2mr/\u03b4) log3 n)1/2 and \u03bb2 = o(n41/42/\u03b8\u2217\nmax),\n(cid:96)kj(cid:107)2. Then for n suf\ufb01ciently large, the MQGM estimate in (5) exactly\nwhere \u03b8\u2217\nrecovers the underlying conditional dependencies, i.e., \u02c6E = E\u2217, with probability at least 1 \u2212 \u03b4.\n\n(cid:96)kj \u2208 Rm, j = 1, . . . , d, as in the basis expansion model.\n\nmax = max(cid:96),k,j (cid:107)\u03b8\u2217\n\n5\n\n\fThe theorem shows that the nonzero pattern in the MQGM estimate identi\ufb01es, with high probability,\nthe underlying conditional independencies. But to be clear, we emphasize that the MQGM estimate\nis not an estimate of the inverse covariance matrix itself (this is also true of neighborhood regression,\nSpaceJam of [40], and many other methods for learning graphical models).\n\n5 Computational approach\n\nBy design, the MQGM problem in (5) separates into d subproblems, across k = 1, . . . , d (it therefore\nsuf\ufb01ces to consider only a single subproblem, so we omit notational dependence on k for auxil-\niary variables). While these subproblems are challenging for off-the-shelf solvers (even for only\nmoderately-sized graphs), the key terms here all admit ef\ufb01cient proximal operators [32], which makes\noperator splitting methods like the alternating direction method of multipliers [5] a natural choice.\nAs an illustration, we consider the non-crossing constraints in the basis regression model below.\nReparameterizing our problem, so that we may apply ADMM, yields:\n\n(cid:80)d\nj=1 (cid:107)W(cid:96)j(cid:107)2 + \u03bb2\n\n(cid:80)r\n(cid:80)d\n(8)\nk + \u03a6\u0398k, W = \u0398k, Z = Yk1T \u2212 1BT\nj=1 \u03c8\u03b1(cid:96) (A(cid:96)j), and I+(\u00b7) is the indicator function of the space\n\nF + I+(V DT )\nk \u2212 \u03a6\u0398k,\n\n2 (cid:107)W(cid:107)2\n\n(cid:96)=1\n\nminimize\n\u0398k,Bk,V,W,Z\nsubject to\n\nwhere for brevity \u03c8A(A) =(cid:80)r\n\nV = 1BT\n\n\u03c8A(Z) + \u03bb1\n\n(cid:96)=1\n\nof elementwise nonnegative matrices. The augmented Lagrangian associated with (8) is:\n\nL\u03c1(\u0398k, Bk, V, W, Z, UV , UW , UZ) = \u03c8A(Z) + \u03bb1\n\n(cid:107)W(cid:107)2\n\nF + I+(V DT )\n\nd(cid:88)\nr(cid:88)\nF + (cid:107)Yk1T \u2212 1BT\nF + (cid:107)\u0398k \u2212 W + UW(cid:107)2\n\n(cid:107)W(cid:96)j(cid:107)2 +\n\n\u03bb2\n2\n\nj=1\n\n(cid:96)=1\n\n(cid:16)(cid:107)1BT\n\n(cid:17)\n\n,\n\n(cid:1) ,\n\n(cid:0)1BT\n(cid:18)\n\n+\n\n\u03c1\n2\n\nk + \u03a6\u0398k \u2212 V + UV (cid:107)2\n\nk \u2212 \u03a6\u0398k \u2212 Z + UZ(cid:107)2\nF\n(9)\nwhere \u03c1 > 0 is the augmented Lagrangian parameter, and UV , UW , UZ are dual variables correspond-\ning to the equality constraints on V, W, Z, respectively. Minimizing (9) over V yields:\n\nV \u2190 Piso\n\n(cid:19)\n\n(10)\nwhere Piso(\u00b7) denotes the row-wise projection operator onto the isotonic cone (the space of compo-\nnentwise nondecreasing vectors), an O(nr) operation here [15]. Minimizing (9) over W(cid:96)j yields the\nupdate:\n\nk + \u03a6\u0398k + UV\n\n\u03bb1/\u03c1\n\nW(cid:96)j \u2190 (\u0398k)(cid:96)j + (UW )(cid:96)j\n\n1 \u2212\n\n,\n\n(11)\nwhere (\u00b7)+ is the positive part operator. This can be seen by deriving the proximal operator of the\nfunction f (x) = \u03bb1(cid:107)x(cid:107)2 + (\u03bb2/2)(cid:107)x(cid:107)2\n\n2. Minimizing (9) over Z yields the update:\n\n(cid:107)(\u0398k)(cid:96)j + (UW )(cid:96)j(cid:107)2\n\n1 + \u03bb2/\u03c1\n\nZ \u2190 prox(1/\u03c1)\u03c8A (Yk1T \u2212 1bT\n\n(12)\nwhere proxf (\u00b7) denotes the proximal operator of a function f. For the multiple quantile loss function\n\u03c8A, this is a kind of generalized soft-thresholding. The proof is given in the supplement.\nLemma 5.1. Let P+(\u00b7) and P\u2212(\u00b7) be the elementwise positive and negative part operators, respec-\ntively, and let a = (\u03b11, . . . , \u03b1r). Then proxt\u03c8A (A) = P+(A \u2212 t1aT ) + P\u2212(A \u2212 t1aT ).\nFinally, differentiation in (9) with respect to Bk and \u0398k yields the simultaneous updates:\n\nk \u2212 \u03a6\u0398k + UZ),\n\n+\n\n(cid:21)\n\n(cid:20) \u0398k\n\nBT\nk\n\n(cid:20) \u03a6T \u03a6 + 1\n\n\u2190 1\n2\n\n2 I \u03a6T 1\n1T 1\n\n1T \u03a6\n\n(cid:21)\u22121(cid:18)\n\n[I 0]T (W \u2212 UW ) +\n\n[\u03a6 1]T (Yk1T \u2212 Z + UZ + V \u2212 UV )\n\n(cid:19)\n\n.\n\n(13)\n\nA complete description of our ADMM algorithm for solving the MQGM problem is in the supplement.\nGibbs sampling Having \ufb01t the conditionals yk|y\u00ack, k = 1, . . . d, we may want to make predictions\nor extract joint distributions over subsets of variables. As discussed in Section 4.2, there is no general\nanalytic form for these joint distributions, but the pseudolikelihood approximation underlying the\nMQGM suggests a natural Gibbs sampler. A careful implementation that respects the additive model\nin (3) yields a highly ef\ufb01cient Gibbs sampler, especially for CRFs; the supplement gives details.\n\n6\n\n\f6 Empirical examples\n\n6.1 Synthetic data\n\nWe consider synthetic examples, comparing the MQGM to neighborhood selection (MB), the graphi-\ncal lasso (GLasso), SpaceJam [40], the nonparanormal skeptic [26], TIGER [24], and neighborhood\nselection using the absolute loss (Laplace).\n\nRing example As a simple but telling example, we drew n = 400 samples from a \u201cring\u201d distribution\nin d = 4 dimensions. Data were generated by drawing a random angle \u03bd \u223c Uniform(0, 1), a random\nradius R \u223c N (0, 0.1), and then computing the coordinates y1 = R cos \u03bd, y2 = R sin \u03bd and\ny3, y4 \u223c N (0, 1), i.e., y1 and y2 are the only dependent variables here. The MQGM was used with\nm = 10 basis functions (RBFs), and r = 20 quantile levels. The left panel of Figure 1 plots samples\n(blue) of the coordinates y1, y2 as well as new samples from the MQGM (red) \ufb01tted to these same\n(blue) samples, obtained by using our Gibbs sampler; the samples from the MQGM appear to closely\nmatch the samples from the underlying ring. The main panel of Figure 1 shows the conditional\ndependencies recovered by the MQGM, SpaceJam, GLasso, and MB (plots for the other methods are\ngiven in the supplement), when run on the ring data. We visualize these dependencies by forming a\nd \u00d7 d matrix with the cell (j, k) set to black if j, k are conditionally dependent given the others, and\nwhite otherwise. Across a range of tuning parameters for each method, the MQGM is the only one\nthat successfully recovers the underlying conditional dependencies, at some point along its solution\npath. In the supplement, we present an evaluation of the conditional CDFs given by each method,\nwhen run on the ring data; again, the MQGM performs best in this setting.\nLarger examples To investigate performance at larger scales, we drew n \u2208 {50, 100, 300} samples\nfrom a multivariate normal and Student t-distribution (with 3 degrees of freedom), both in d = 100\ndimensions, both parameterized by a random, sparse, diagonally dominant d \u00d7 d inverse covariance\nmatrix, following the procedure in [33, 17, 31, 1]. Over the same set of sample sizes, with d = 100, we\nalso considered an autoregressive setup in which we drew samples of pairs of adjacent variables from\nthe ring distribution. In all three data settings (normal, t, and autoregressive), we used m = 10 and\nr = 20 for the MQGM. To summarize the performances, we considered a range of tuning parameters\nfor each method, computed corresponding false and true positive rates (in detecting conditional\ndependencies), and then computed the corresponding area under the curve (AUC), following, e.g.,\n[33, 17, 31, 1]. Table 1 reports the median AUCs (across 50 trials) for all three of these examples; the\nMQGM outperforms all other methods on the autoregressive example; on the normal and Student t\nexamples, it performs quite competitively.\n\nFigure 1: Left: data from the ring distribution (blue) as well as new samples from the MQGM (red) \ufb01tted to\nthe same (blue) data, obtained by using our Gibbs sampler. Right: conditional dependencies recovered by the\nMQGM, MB, GLasso, and SpaceJam on the ring data; black means conditional dependence. The MQGM is the\nonly method that successfully recovers the underlying conditional dependencies along its solution path.\n\nTable 1: AUC values for the MQGM, MB, GLasso, SpaceJam, the nonparanormal skeptic, TIGER, and\nLaplace for the normal, t, and autoregressive data settings; higher is better, best in bold.\n\nMQGM\nMB\nGLasso\nSpaceJam\nNonpara.\nTIGER\nLaplace\n\nn = 50\n0.953\n0.850\n0.908\n0.889\n0.881\n0.732\n0.803\n\nNormal\nn = 100\n\n0.976\n0.959\n0.964\n0.968\n0.962\n0.921\n0.931\n\nn = 300\n\n0.988\n0.994\n0.998\n0.997\n0.996\n0.996\n0.989\n\nn = 50\n0.928\n0.844\n0.691\n0.893\n0.862\n0.420\n0.800\n\nStudent t\nn = 100\n\n0.947\n0.923\n0.605\n0.965\n0.942\n0.873\n0.876\n\nn = 300\n\n0.981\n0.988\n0.965\n0.993\n0.998\n0.989\n0.991\n\nn = 50\n0.726\n0.532\n0.541\n0.624\n0.545\n0.503\n0.530\n\nAutoregressive\n\nn = 100\n\n0.754\n0.563\n0.620\n0.708\n0.590\n0.518\n0.554\n\nn = 300\n\n0.955\n0.725\n0.711\n0.854\n0.612\n0.718\n0.758\n\n7\n\n\u22121.5\u22121.0\u22120.50.00.51.01.5y1\u22121.5\u22121.0\u22120.50.00.51.01.5y2truthMQGMMQGMTruth\u00b81=8.00000\u00b81=16.00000\u00b81=32.00000\u00b81=64.00000\u00b81=128.00000MBTruth\u00b81=0.12500\u00b81=0.25000\u00b81=0.50000\u00b81=1.00000\u00b81=2.00000GLassoTruth\u00b81=0.00781\u00b81=0.01562\u00b81=0.03125\u00b81=0.06250\u00b81=0.12500SpaceJamTruth\u00b81=0.06250\u00b81=0.12500\u00b81=0.25000\u00b81=0.50000\u00b81=1.00000\fFigure 2: Top panel and bottom row,\nmiddle panel: conditional dependen-\ncies recovered by the MQGM on the\n\ufb02u data; each of the \ufb01rst ten cells corre-\nsponds to a region of the U.S., and black\nmeans dependence. Bottom row, left\npanel: wallclock time (in seconds) for\nsolving one subproblem using ADMM\nversus SCS. Bottom row, right panel:\nsamples from the \ufb01tted marginal distri-\nbution of the weekly \ufb02u incidence rates\nat region 6; samples at larger quantiles\nare shaded lighter, and the median is in\ndarker blue.\n\n6.2 Modeling \ufb02u epidemics\n\nWe study n = 937 weekly \ufb02u incidence reports from September 28, 1997 through August 30,\n2015, across 10 regions in the United States (see the top panel of Figure 2), obtained from [6]. We\nconsidered d = 20 variables: the \ufb01rst 10 encode the current week\u2019s \ufb02u incidence (precisely, the\npercentage of doctor\u2019s visits in which \ufb02u-like symptoms are presented) in the 10 regions, and the last\n10 encode the same but for the prior week. We set m = 5, r = 99, and also introduced exogenous\nvariables to encode the week numbers, so p = 1. Thus, learning the MQGM here corresponds\nto learning the structure of a spatiotemporal graphical model, and reduces to solving 20 multiple\nquantile regression subproblems, each of dimension (19 \u00d7 5 + 1) \u00d7 99 = 9504. All subproblems\ntook about 1 minute on a 6 core 3.3 Ghz Core i7 X980 processor.\nThe bottom left panel in Figure 2 plots the time (in seconds) taken for solving one subproblem using\nADMM versus SCS [30], a cone solver that has been advocated as a reasonable choice for a class\nof problems encapsulating (4); ADMM outperforms SCS by roughly two orders of magnitude. The\nbottom middle panel of Figure 2 presents the conditional independencies recovered by the MQGM.\nNonzero entries in the upper left 10 \u00d7 10 submatrix correspond to dependencies between the yk\nvariables for k = 1, . . . , 10; e.g., the nonzero (0,2) entry suggests that region 1 and 3\u2019s \ufb02u reports are\ndependent. The lower right 10 \u00d7 10 submatrix corresponds to the yk variables for k = 11, . . . , 20,\nand the nonzero banded entries suggest that at any region the previous week\u2019s \ufb02u incidence (naturally)\nin\ufb02uences the next week\u2019s. The top panel of Figure 2 visualizes these relationships by drawing an\nedge between dependent regions; region 6 is highly connected, suggesting that it may be a bellwether\nfor other regions, roughly in keeping with the current understanding of \ufb02u dynamics. To draw samples\nfrom the \ufb01tted distributions, we ran our Gibbs sampler over the year, generating 1000 total samples,\nmaking 5 passes over all coordinates between each sample, and with a burn-in period of 100 iterations.\nThe bottom right panel of Figure 2 plots samples from the marginal distribution of the percentages\nof \ufb02u reports at region 6 (other regions are in the supplement) throughout the year, revealing the\nheteroskedastic nature of the data.\nFor space reasons, our last example, on wind power data, is presented in the supplement.\n\n7 Discussion\n\nWe proposed and studied the Multiple Quantile Graphical Model (MQGM). We established theoretical\nand empirical backing to the claim that the MQGM is capable of compactly representing relationships\nbetween heteroskedastic non-Gaussian variables. We also developed ef\ufb01cient algorithms for both\nestimation and sampling in the MQGM. All in all, we believe that our work represents a step forward\nin the design of \ufb02exible yet tractable graphical models.\n\nAcknowledgements\nAA was supported by DOE Computational Science Graduate Fellowship DE-\nFG02-97ER25308. JZK was supported by an NSF Expeditions in Computation Award, CompSustNet,\nCCF-1522054. RJT was supported by NSF Grants DMS-1309174 and DMS-1554123.\n\n8\n\n1234567891012345678910050100150200250Seconds101102103104Objective valueMQGMSCS16111616111630405081828Week0123456789% of flu-like symptomsRegion 6\fReferences\n[1] Alnur Ali, Kshitij Khare, Sang-Yun Oh, and Bala Rajaratnam. Generalized pseudolikelihood methods for inverse covariance estimation.\n\nTechnical report, 2016. Available at http://arxiv.org/pdf/1606.00033.pdf.\n\n[2] Onureena Banerjee, Laurent El Ghaoui, and Alexandre d\u2019Aspremont. Model selection through sparse maximum likelihood estimation\n\nfor multivariate Gaussian or binary data. Journal of Machine Learning Research, 9:485\u2013516, 2008.\n\n[3] Alexandre Belloni and Victor Chernozhukov. (cid:96)1-penalized quantile regression in high-dimensional sparse models. Annals of Statistics,\n\n[4] Julian Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society: Series B, 36(2):\n\n39(1):82\u2013130, 2011.\n\n192\u2013236, 1974.\n\n[5] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the\n\nalternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1\u2013122, 2011.\n\n[6] Centers for Disease Control and Prevention (CDC). In\ufb02uenza national and regional level graphs and data, August 2015. URL http:\n\n//gis.cdc.gov/grasp/fluview/fluportaldashboard.html.\n\n[7] Shizhe Chen, Daniela Witten, and Ali Shojaie. Selection and estimation for mixed graphical models. Biometrika, 102(1):47\u201364, 2015.\n[8] Arthur Dempster. Covariance selection. Biometrics, 28(1):157\u2013175, 1972.\n[9] Jianqing Fan, Yingying Fan, and Emre Barut. Adaptive robust variable selection. Annals of Statistics, 42(1):324\u2013351, 2014.\n[10] Michael Finegold and Mathias Drton. Robust graphical modeling of gene networks using classical and alternative t-distributions. Annals\n\n[11] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9\n\nof Applied Statistics, 5(2A):1057\u20131080, 2011.\n\n(3):432\u2013441, 2008.\n\n[12] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Applications of the lasso and grouped lasso to the estimation of sparse graphical\n\nmodels. Technical report, 2010. Available at http://statweb.stanford.edu/~tibs/ftp/ggraph.pdf.\n\n[13] David Heckerman, David Maxwell Chickering, David Meek, Robert Rounthwaite, and Carl Kadie. Dependency networks for inference,\n\ncollaborative \ufb01ltering, and data visualization. Journal of Machine Learning Research, 1:49\u201375, 2000.\n\n[14] Holger H\u00f6\ufb02ing and Robert Tibshirani. Estimation of sparse binary pairwise Markov networks using pseudo-likelihoods. Journal of\n\nMachine Learning Research, 10:883\u2013906, 2009.\n\n[15] Nicholas Johnson. A dynamic programming algorithm for the fused lasso and (cid:96)0-segmentation. Journal of Computational and Graphical\n\n[16] Kengo Kato. Group lasso for high dimensional sparse quantile regression models. Technical report, 2011. Available at http://arxiv.\n\nStatistics, 22(2):246\u2013260, 2013.\n\norg/pdf/1103.1458.pdf.\n\n[17] Kshitij Khare, Sang-Yun Oh, and Bala Rajaratnam. A convex pseudolikelihood framework for high dimensional partial correlation\n\nestimation with convergence guarantees. Journal of the Royal Statistical Society: Series B, 77(4):803\u2013825, 2014.\n\n[18] Roger Koenker. Quantile Regression. Cambridge University Press, 2005.\n[19] Roger Koenker. Additive models for quantile regression: Model selection and con\ufb01dence bandaids. Brazilian Journal of Probability and\n\nStatistics, 25(3):239\u2013262, 2011.\n\n[20] Roger Koenker and Gilbert Bassett. Regression quantiles. Econometrica, 46(1):33\u201350, 1978.\n[21] Roger Koenker, Pin Ng, and Stephen Portnoy. Quantile smoothing splines. Biometrika, 81(4):673\u2013680, 1994.\n[22] Steffen Lauritzen. Graphical models. Oxford University Press, 1996.\n[23] Jason Lee and Trevor Hastie. Structure learning of mixed graphical models. In Proceedings of the 16th International Conference on\n\n[24] Han Liu and Lie Wang. TIGER: A tuning-insensitive approach for optimally estimating Gaussian graphical models. Technical report,\n\nArti\ufb01cial Intelligence and Statistics, pages 388\u2013396, 2013.\n\n2012. Available at http://arxiv.org/pdf/1209.2437.pdf.\n\nJournal of Machine Learning Research, 10:2295\u20132328, 2009.\n\nmodels. The Annals of Statistics, pages 2293\u20132326, 2012.\n\n[25] Han Liu, John Lafferty, and Larry Wasserman. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs.\n\n[26] Han Liu, Fang Han, Ming Yuan, John Lafferty, and Larry Wasserman. High-dimensional semiparametric Gaussian copula graphical\n\n[27] Lukas Meier, Sara van de Geer, and Peter Buhlmann. High-dimensional additive modeling. Annals of Statistics, 37(6):3779\u20133821, 2009.\n[28] Nicolai Meinshausen and Peter B\u00fchlmann. High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34(3):\n\n[29] Jennifer Neville and David Jensen. Dependency networks for relational data. In Proceedings of Fourth IEEE International Conference\n\n1436\u20131462, 2006.\n\non the Data Mining, pages 170\u2013177. IEEE, 2004.\n\n[30] Brendan O\u2019Donoghue, Eric Chu, Neal Parikh, and Stephen Boyd. Operator splitting for conic optimization via homogeneous self-dual\n\nembedding. Technical report, 2013. Available at https://stanford.edu/~boyd/papers/pdf/scs.pdf.\n\n[31] Sang-Yun Oh, Onkar Dalal, Kshitij Khare, and Bala Rajaratnam. Optimization methods for sparse pseudolikelihood graphical model\n\nselection. In Advances in Neural Information Processing Systems 27, pages 667\u2013675, 2014.\n\n[32] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123\u2013231, 2013.\n[33] Jie Peng, Pei Wang, Nengfeng Zhou, and Ji Zhu. Partial correlation estimation by joint sparse regression models. Journal of the American\n\nStatistical Association, 104(486):735\u2013746, 2009.\n\n[34] Garvesh Raskutti, Martin Wainwright, and Bin Yu. Minimax-optimal rates for sparse additive models over kernel classes via convex\n\nprogramming. Journal of Machine Learning Research, 13:389\u2013427, 2012.\n\n[35] Guilherme Rocha, Peng Zhao, and Bin Yu. A path following algorithm for sparse pseudo-likelihood inverse covariance estimation\n\n(SPLICE). Technical report, 2008. Available at https://www.stat.berkeley.edu/~binyu/ps/rocha.pseudo.pdf.\n\n[36] Adam Rothman, Peter Bickel, Elizaveta Levina, and Ji Zhu. Sparse permutation invariant covariance estimation. Electronic Journal of\n\nStatistics, 2:494\u2013515, 2008.\n\n[38]\n\n[37] Kyung-Ah Sohn and Seyoung Kim. Joint estimation of structured sparsity and output structure in multiple-output regression via inverse\ncovariance regularization. In Proceedings of the 15th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 1081\u20131089,\n2012.\nIchiro Takeuchi, Quoc Le, Timothy Sears, and Alexander Smola. Nonparametric quantile estimation. Journal of Machine Learning\nResearch, 7:1231\u20131264, 2006.\n\n[39] Cristiano Varin and Paolo Vidoni. A note on composite likelihood inference and model selection. Biometrika, 92(3):519\u2013528, 2005.\n[40] Arend Voorman, Ali Shojaie, and Daniela Witten. Graph estimation with joint additive models. Biometrika, 101(1):85\u2013101, 2014.\n[41] Yuchung Wang and Edward Ip. Conditionally speci\ufb01ed continuous distributions. Biometrika, 95(3):735\u2013746, 2008.\n[42] Matt Wytock and Zico Kolter. Sparse Gaussian conditional random \ufb01elds: Algorithms, theory, and application to energy forecasting. In\n\nProceedings of the 30th International Conference on Machine Learning, pages 1265\u20131273, 2013.\n\n[43] Eunho Yang, Pradeep Ravikumar, Genevera Allen, and Zhandong Liu. Graphical models via generalized linear models. In Advances in\n\nNeural Information Processing Systems 25, pages 1358\u20131366, 2012.\n\n[44] Eunho Yang, Pradeep Ravikumar, Genevera Allen, and Zhandong Liu. Graphical models via univariate exponential family distributions.\n\nJournal of Machine Learning Research, 16:3813\u20133847, 2015.\n\n[45] Ming Yuan and Yi Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19\u201335, 2007.\n[46] Xiao-Tong Yuan and Tong Zhang. Partial Gaussian graphical model estimation.\n\nIEEE Transactions on Information Theory, 60(3):\n\n1673\u20131687, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1859, "authors": [{"given_name": "Alnur", "family_name": "Ali", "institution": "Carnegie Mellon University"}, {"given_name": "J. Zico", "family_name": "Kolter", "institution": "Carnegie Mellon University"}, {"given_name": "Ryan", "family_name": "Tibshirani", "institution": "Carnegie Mellon University"}]}