{"title": "The Expxorcist: Nonparametric Graphical Models Via Conditional Exponential Densities", "book": "Advances in Neural Information Processing Systems", "page_first": 4446, "page_last": 4456, "abstract": "Non-parametric multivariate density estimation faces strong statistical and computational bottlenecks, and the more practical approaches impose near-parametric assumptions on the form of the density functions. In this paper, we leverage recent developments to propose a class of non-parametric models which have very attractive computational and statistical properties. Our approach relies on the simple function space assumption that the conditional distribution of each variable conditioned on the other variables has a non-parametric exponential family form.", "full_text": "The Expxorcist: Nonparametric Graphical Models\n\nVia Conditional Exponential Densities\n\nArun Sai Suggala \u2217\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nMladen Kolar \u2020\n\nUniversity of Chicago\n\nChicago, IL 60637\n\nPradeep Ravikumar \u2021\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nAbstract\n\nNon-parametric multivariate density estimation faces strong statistical and com-\nputational bottlenecks, and the more practical approaches impose near-parametric\nassumptions on the form of the density functions. In this paper, we leverage re-\ncent developments to propose a class of non-parametric models which have very\nattractive computational and statistical properties. Our approach relies on the\nsimple function space assumption that the conditional distribution of each variable\nconditioned on the other variables has a non-parametric exponential family form.\n\n1\n\nIntroduction\n\nLet X = (X1, . . . , Xp) be a p-dimensional random vector. Let G = (V, E) be the graph that encodes\nconditional independence assumptions underlying the distribution of X, that is, each node of the\ngraph corresponds to a component of vector X and (a, b) \u2208 E if and only if Xa (cid:54)\u22a5\u22a5 Xb | X\u00acab with\nX\u00acab := {Xc | c \u2208 V \\{a, b}}. The graphical model represented by G is then the set of distributions\nover X that satisfy the conditional independence assumptions speci\ufb01ed by the graph G.\nThere has been a considerable line of work on learning parametric families of such graphical model\ndistributions from data [22, 20, 13, 28], where the distribution is indexed by a \ufb01nite-dimensional\nparameter vector. The goal of this paper, however, is on specifying and learning nonparametric\nfamilies of graphical model distributions, indexed by in\ufb01nite-dimensional parameters, and for which\nthere has been comparatively limited work. Non-parametric multivariate density estimation broadly,\neven without the graphical model constraint, has not proved as popular in practical machine learning\ncontexts, for both statistical and computational reasons. Loosely, estimating a non-parametric\nmultivariate density, with mild assumptions, typically requires the number of samples to scale\nexponentially in the dimension p of the data, which is infeasible even in the big-data era when n is\nvery large. And the resulting estimators are typically computationally expensive or intractable, for\ninstance requiring repeated computations of multivariate integrals.\nWe present a review of multivariate density estimation, that is necessarily incomplete but sets up\nour proposed approach. A common approach dating back to [15] uses the logistic density transform\nX0 \u2208 X or(cid:82)\n(cid:82)\nto satisfy the unity and positivity constraints for densities, and considers densities of the form\nf (X) = exp(\u03b7(X))\nX exp(\u03b7(x))dx, with some constraints on \u03b7 for identi\ufb01ability such as \u03b7(X0) = 0 for some\n\nX \u03b7(x)dx = 0.\n\nWith the logistic density transform, differing approaches for non-parametric density estimation can\nbe contrasted in part by their assumptions on the in\ufb01nite-dimensional function space domain of \u03b7(\u00b7).\nAn early approach [8] considered function spaces of functions with bounded \u201croughness\u201d functionals.\nThe predominant line of work however has focused on the setting where \u03b7(\u00b7) lies in a Reproducing\nKernel Hilbert Space (RKHS), dating back to [21]. Consider the estimation of these logistic density\n\n\u2217asuggala@cs.cmu.edu\n\n\u2020mkolar@chicagobooth.edu\n\n\u2021pradeepr@cs.cmu.edu\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fi\u2208[n] \u03b7(X (i)) + log(cid:82) exp(\u03b7(x))dx + \u03bb pen(\u03b7), for\n(cid:80)\n\ntransforms \u03b7(X) given n i.i.d. samples Xn = {X (i)}n\ni=1 drawn from f\u03b7(X). A natural loss\nfunctional is penalized log likelihood, with a penalty functional that ensures a smooth \ufb01t with respect\nto the function space domain: (cid:96)(\u03b7; Xn) := \u2212 1\nfunctions \u03b7(\u00b7) that lie in an RKHS H, and where pen(\u03b7) = (cid:107)\u03b7(cid:107)2H is the squared RKHS norm. This\nwas studied by many [21, 11, 6]. A crucial caveat is that the representer theorem for RKHSs does not\nhold. Nonetheless, one can consider \ufb01nite-dimensional function space approximations consisting\nof the linear span of kernel functions evaluated at the sample points [12]. Computationally this still\nscales poorly with the dimension due to the need to compute multidimensional integrals of the form\n\n(cid:82) exp(\u03b7(x)dx which do not, in general, decompose. These approximations also do not come with\n\nn\n\ns=1 \u03b7s(Xs) +(cid:80)p\n\ns=1\n\n(cid:80)p\n\npairwise terms: \u03b7(X) =(cid:80)p\n\ni\u2208[n] exp(\u2212\u03b7(X (i)))+(cid:82) \u03b7(x)\u03c1(x)dx+\u03bbpen(\u03b7), where \u03c1(X) is some \ufb01xed known\n(cid:80)\n\nstrong statistical guarantees.\nWe brie\ufb02y note that the function space assumption that \u03b7(\u00b7) lies in an RKHS could also be viewed\nfrom the lens of an in\ufb01nite-dimensional exponential family [4]. Speci\ufb01cally, let H be a Repro-\nducing Kernel Hilbert Space with reproducing kernel k(\u00b7,\u00b7), and inner product (cid:104)\u00b7,\u00b7(cid:105)H. Then\n\u03b7(X) = (cid:104)\u03b8(\u00b7), k(X,\u00b7)(cid:105)H, so that the density f (X) can in turn be viewed as a member of an\nin\ufb01nite-dimensional exponential family with suf\ufb01cient statistics k(X,\u00b7) : X (cid:55)\u2192 H, and natural\nparameter \u03b8(\u00b7) \u2208 H. Following this viewpoint, [4] propose estimators via linear span approximations\nsimilar to [11].\nDue to the computational caveat with exact likelihood based functionals, a line of approaches\nhave focused on penalized surrogate likelihoods instead. [14] study the following loss functional:\n(cid:96)(\u03b7; Xn) := 1\ndensity with the same support as the unknown density f (X). While this estimation procedure is\nmuch more computationally amenable than minimizing the exact penalized likelihood, the caveat,\nhowever, is that for a general RKHS this requires solving higher order integrals. The next level of\nsimpli\ufb01cation has thus focused on the form of the logistic transform function itself. There has been a\nline of work on an ANOVA type decomposition of the logistic density function into node-wise and\nt=s+1 \u03b7st(Xs, Xt). A line of work has coupled\nsuch a decomposition with the assumption that each of the terms lie in an RKHS. This does not\nimmediately provide a computational bene\ufb01t: with penalized likelihood based loss functionals, the\nloss functional does not necessarily decompose into such node and pairwise terms. [24] thus couple\nthis ANOVA type pairwise decomposition with a score matching based objective. [10] use the above\ndecomposition with the surrogate loss functional of [14] discussed above, but note that this still\nrequires the aforementioned function space approximation as a linear span of kernel evaluations, as\nwell as two-dimensional integrals.\nA line of recent work has thus focused on further stringent assumptions on the density function space,\nby assuming some components of the logistic transform to be \ufb01nite-dimensional. [30] use an ANOVA\ndecomposition but assume the terms belong to \ufb01nite-dimensional function spaces instead of RKHSs,\nspeci\ufb01ed by a pre-de\ufb01ned \ufb01nite set of basis functions. [29] consider logistic transform functions \u03b7(\u00b7)\nthat have the pairwise decomposition above, with a speci\ufb01c class of parametric pairwise functions\n\u03b2stXsXt, and non-parametric node-wise functions. [17, 16] consider the problem of estimating\nmonotonic node-wise functions such that the transformed random vector is multivariate Gaussian;\nwhich could also be viewed as estimating a Gaussian copula density.\nTo summarize the (necessarily incomplete) review above, non-parametric density estimation faces\nstrong statistical and computational bottlenecks, and the more practical approaches impose stringent\nnear-parametric assumptions on the form of the (logistic transform of the) density functions. In this\npaper, we leverage recent developments to propose a very computationally simple non-parametric\ndensity estimation algorithm, that still comes with strong statistical guarantees. Moreover, the\ndensity could be viewed as a graphical model distribution, with a corresponding sparse conditional\nindependence graph.\nOur approach relies on the following simple function space assumption: that the conditional distri-\nbution of each variable conditioned on the other variables has a non-parametric exponential family\nform. As we show, for there to exist a consistent joint density, the logistic density transform with\nrespect to a particular base measure necessarily decomposes into the following semi-parametric\nt=s+1 \u03b8st Bs(Xs) Bt(Xt) in the pairwise case, with\nboth a parametric component {\u03b8s : s = 1, . . . , p},{\u03b8st : s < t; s, t = 1, . . . , p}, as well as\nnon-parametric components {Bs : s = 1, . . . , p}. We call this class of models the \u201cexpxorcist\u201d, fol-\n\ns=1 \u03b8sBs(Xs) +(cid:80)p\n\nform: \u03b7(X) = (cid:80)p\n\nn\n\n(cid:80)p\n\ns=1\n\n2\n\n\flowing other \u201cghostbusting\u201d semi-parametric models such as the nonparanormal and nonparanormal\nskeptic [17, 16].\nSince the conditional distributions are exponential families, we show that there exist computationally\namenable estimators, even in our more general non-parametric setting, where the suf\ufb01cient statistics\nhave to be estimated as well. The statistical analysis in our non-parametric setting however is more\nsubtle, due in part to non-convexity and in part to the non-parametric setting. We also show how the\nExpxorcist class of densities is closely related to a semi-parametric exponential family copula density\nthat generalizes the Gaussian copula density of [17, 16]. We corroborate the applicability of our class\nof models with experiments on synthetic and real data sets.\n\n2 Multivariate Density Speci\ufb01cation via Conditional Densities\n\n(cid:82)\n\nXs\n\n(cid:82)\n\nWe are interested in the approach of estimating a multivariate density by estimating node-conditional\ndensities. Since node-conditional densities focus on the density of a single variable, though condi-\ntioned on the rest of the variables, estimating these is potentially a simpler problem, both statistically\nand computationally, than estimating the entire joint density itself. Let us consider the general\nnon-parametric conditional density estimation problem. Given the general multivariate density\nf (X) = exp(\u03b7(X))\nX exp(\u03b7(x))dx, the conditional density of a variable Xs given the rest of the variables X\u2212s\nis given by f (Xs | X\u2212s) = exp(\u03b7((Xs,X\u2212s)))\nexp(\u03b7((x,X\u2212s)))dx, which does not have a multi-dimensional integral,\nbut otherwise does not have a computationally amenable form. There has been a line of work on such\nconditional density estimation, mirroring developments in multivariate density estimation [9, 18, 23],\nbut unlike parametric settings, there are no large sample complexity gains with non-parametric\nconditional density estimation under general settings. There have also been efforts to use ANOVA\ndecompositions in a conditional density context [31, 26].\nIn addition to computational and sample complexity caveats, recall that in our context, we would\nlike to use conditional density estimates to infer a joint multivariate density. A crucial caveat with\nusing the above estimates to do so is that it is not clear when the estimated node-conditional densities\nwould be consistent with a joint multivariate density. There has been a line of work on this question\n(of when conditional densities are consistent with a joint density) for parametric densities; see [1] for\nan overview, with more recent results in [27, 5, 2, 25]. Overall, while estimating node-conditional\ndensities could be viewed as surrogate estimation of a joint density, arbitrary node-conditional\ndistributions need not be consistent in general with any joint density. There has however been a line\nof work in recent years [3, 28], where it was shown that when the node-conditional distributions\nbelong to an exponential family, then under certain conditions on their parameterization, there do\nexist multivariate densities consistent with the node-conditional densities. In the next section, we\nleverage these results towards non-parametric estimation of conditional densities.\n\n3 Conditional Densities of an Exponential Family Form\n\nWe \ufb01rst recall the de\ufb01nition of an exponential family in the context of a conditional density.\nDe\ufb01nition 1. A conditional density of a random variable Y \u2208 Y given covariates Z :=\n(Z1, . . . , Zm) \u2208 Z is said to have an exponential family form if it can be written as f (Y | Z) =\nexp(B(Y )T E(Z) + C(Y ) + D(Z)), for some functions B : Y (cid:55)\u2192 Rk (for some \ufb01nite integer k > 0),\nE : Z (cid:55)\u2192 Rk, C : Y (cid:55)\u2192 R and D : Z (cid:55)\u2192 R.\nThus, f (Y | Z) belongs to a \ufb01nite-dimensional exponential family with suf\ufb01cient statistics B(Y ),\nbase measure exp(C(Y )), and with natural parameter E(Z) and where \u2212D(Z) is the log-partition\nfunction. Contrast this with a general conditional density f (Y | Z) = exp(h(Y, Z) + C(Y ) + D(Z))\nwith respect to the base measure exp(C(Y )) and \u2212D(Z) being the log-normalization constant, and it\ncan be seen that a conditional density of the exponential family form has its logistic density transform\nh(Y, Z) that factorizes as B(Y )T E(Z).\nConsider the case where the suf\ufb01cient statistic function is real-valued. The non-parametric estimation\nproblem of a conditional density of exponential form then reduces to the estimation of the suf\ufb01cient\nstatistics function B(\u00b7), the exponential natural parameter function E(\u00b7), assuming the base measure\nC(\u00b7) is given. But when would such estimated conditional densities be consistent with a joint density?\n\n3\n\n\fTo answer this question, we draw upon developments in [28]. Suppose that the node-conditional\ndistributions of each random variable Xs conditioned on the rest of random variables have the\nexponential family form as in De\ufb01nition 1, so that for each s \u2208 V\n\nP(Xs | X\u2212s) \u221d exp{Es(X\u2212s)Bs(Xs) + Cs(Xs)} ,\n\n(1)\nfor some arbitrary functions Es(\u00b7), Bs(\u00b7), Cs(\u00b7) that specify a valid conditional density. Then [28]\nshow that these node-conditional densities are consistent with a unique joint density over the random\n(cid:81)\nvector X, that moreover factors according to a set of cliques C in the graph G, if and only if\nthe functions {Es(\u00b7)}s\u2208V specifying the node-conditional distributions have the form Es(X\u2212s) =\nt\u2208C,t(cid:54)=s Bt(Xt), where {\u03b8s} \u222a {\u03b8C}C\u2208C is a set of parameters. Moreover, the\n(cid:110)(cid:88)\n\ncorresponding consistent joint distribution has the following form\n\n\u03b8s +(cid:80)\n\nC\u2208C:s\u2208C \u03b8C\n\n(cid:88)\n\n(cid:88)\n\n(cid:89)\n\n(cid:111)\n\n\u03b8sBs(Xs) +\n\ns\u2208V\n\nC\u2208C \u03b8C\n\nBs(Xs) +\n\ns\u2208C\n\ns\u2208V\n\nCs(Xs)\n\n.\n\n(2)\n\nP(X) \u221d exp\n\njoint density is with respect to a given product base measure(cid:81)\n\nIn this paper, we are interested in the non-parametric estimation of the Expxorcist class of densities\nin (2), where we estimate both the \ufb01nite-dimensional parameters {\u03b8s} \u222a {\u03b8C}C\u2208C, as well as the\nfunctions {Bs(Xs)}s\u2208V . We assume we are given the base measures {Cs(Xs)}s\u2208V , so that the\ns\u2208V exp(Cs(XS)), as is common\nin the multivariate density estimation literature. Note that this is not a very restrictive assumption.\nIn practice the base measure at each node can be well approximated using the empirical univariate\nmarginal density of that node. We could also extend our algorithm, which we present next, to estimate\nthe base measures along with suf\ufb01cient statistic functions.\n\n4 Regularized Conditional Likelihood Estimation for Exponential Family\n\nForm Densities\n\nWe consider the nonparametric estimation problem of estimating a joint density of the form in (2),\nfocusing on the pairwise case where the factors have size at most k = 2, so that the joint density\ntakes the form\n\n(cid:26)(cid:88)\n\nP(X) \u221d exp\n\n(cid:88)\n\n(cid:17)\n\n(cid:27)\n\n(cid:27)\n\n(cid:88)\n\n(cid:16)\n\n(cid:26)\n\n(cid:88)\n\n\u03b8sBs(Xs) +\n\ns\u2208V\n\n(s,t)\u2208E\n\n\u03b8stBs(Xs) Bt(Xt) +\n\nCs(Xs)\n\n.\n\n(3)\n\ns\u2208V\n\nAs detailed in the previous section, estimating this joint density can be reduced to estimating its\nnode-conditional densities, which take the form\n\nP(Xs | X\u2212s) \u221d exp\n\n.\n\nBs(Xs)\n\n\u03b8s +\n\nt\u2208NG(s)\n\n+ Cs(Xs)\n\n\u03b8stBt(Xt)\n\n(4)\nWe now introduce some notation which we use in the sequel. Let \u0398 = {\u03b8s}s\u2208V \u222a {\u03b8st}s(cid:54)=t and\n\u0398s = \u03b8s \u222a {\u03b8st}t\u2208V \\{s}. Let B = {Bs}s\u2208V be the set of suf\ufb01cient statistics. Let Xs be the domain\nof Xs, which we assume is bounded and L2(Xs) be the Hilbert space of square integrable functions\nover Xs with respect to Lebesgue measure. We assume that the suf\ufb01cient statistics Bs(\u00b7) \u2208 L2(Xs).\nNote that the model in Equation (3) is unidenti\ufb01able. To overcome this issue we impose additional\nBs(X)dX = 0,\n\nconstraints on its parameters. Speci\ufb01cally, we require Bs(Xs) to satisfy (cid:82)\n(cid:82)\n\nXs\nOptimization objective: Let Xn = {X (1), . . . X (n)} be n i.i.d. samples drawn from a joint density\nof the form in Equation (3), with parameters \u0398\u2217, B\u2217. And let Ls(\u0398s, B; Xn) be the node conditional\nnegative log likelihood at node s\nLs(\u0398s, B; Xn) =\n\nBs(X)2dX = 1 and \u03b8s \u2265 0, \u2200s \u2208 V .\n\n(cid:88)n\n\n+ A(X (i)\u2212s; \u0398s, B)\n\n\u2212Bs(X (i)\ns )\n\n\u03b8stBt(X (i)\nt )\n\n(cid:88)\n\n(cid:26)\n\n(cid:18)\n\n(cid:27)\n\n,\n\n(cid:19)\n\n\u03b8s +\n\nXs\n\n1\nn\n\ni=1\n\nt\u2208V \\s\n\nwhere A(X\u2212s; \u0398s, B) is the log partition function. To estimate the unknown parameters, we solve\nthe following regularized node conditional log-likelihood estimation problem at each node s \u2208 V\n\ns.t. \u03b8s \u2265 0,(cid:82)\n\nmin\n\u0398s,B\n\nBt(X)dX = 0,(cid:82)\n\nXt\n\nLs(\u0398s, B; Xn) + \u03bbn(cid:107)\u0398s(cid:107)1\n\nBt(X)2dX = 1 \u2200t \u2208 V.\n\nXt\n\n(5)\n\n4\n\n\fThe equality constraints on the norm of functions Bt(\u00b7) makes the above optimization problem a\ndif\ufb01cult one to solve. While the norm constraints on Bt(\u00b7),\u2200t \u2208 V \\ s can be handled through re-\nparametrization, the constraint on Bs(\u00b7) can not be handled ef\ufb01ciently. To make the optimization more\namenable for numerical optimization techniques, we solve a closely related optimization problem.\nAt each node s \u2208 V , we consider the following re-parametrization of B: Bs(Xs) \u2190 \u03b8sBs(Xs),\nBt(Xt) \u2190 (\u03b8st/\u03b8s)Bt(Xt),\u2200t \u2208 V \\ {s}. With a slight abuse of notation we rede\ufb01ne Ls using this\nre-parametrization as\nLs(B; Xn) =\n\n(cid:88)n\n\n\u2212Bs(X (i)\ns )\n\n+ A(X (i)\u2212s; B)\n\nBt(X (i)\nt )\n\n(cid:26)\n\n(cid:27)\n\n(cid:18)\n\n(cid:19)\n\n1 +\n\n,\n\n(6)\n\n1\nn\n\ni=1\n\nwhere A(X\u2212s; B) is the log partition function. We solve the following optimization problem, which\nis closely related to the original optimization in Equation (5)\n\nt\u2208V \\s\n\n(cid:88)\n(cid:113)(cid:82)\n\n(cid:80)\n\nmin\n\nB\n\nLs(B; Xn) + \u03bbn\n\ns.t. (cid:82)\n\nXt\n\nt\u2208V\n\nBt(X)2dX\n\nXt\n\nBt(X)dX = 0 \u2200t \u2208 V.\n\n(7)\n\nFor more details on the relation between (5) and (7), please refer to Appendix.\nAlgorithm: We now present our algorithm for optimization of (7). In the sequel, for simplicity,\nwe assume that the domains Xt of random variables Xt are all the same and equal to X . In order to\nestimate functions Bt, we expand them over a uniformly bounded, orthonormal basis {\u03c6k(\u00b7)}\u221e\nk=0 of\nL2(X ) with \u03c60(\u00b7) \u221d 1. Expansion of the functions Bt(\u00b7) over this basis yields\n\u03b1t,k\u03c6k(X)+\u03c1t,m(X) where \u03c1t,m(X) = \u03b1t,0\u03c60(X)+\n\n(cid:88)\u221e\n\nNote that the constraint(cid:82)\nbasis expansion to the top m terms and approximate Bt(\u00b7) as(cid:80)m\n\nX Bt(X)dX = 0 in Equation (7), translates to \u03b1t,0 = 0. To convert the\nin\ufb01nite dimensional optimization problem in (7) into a \ufb01nite dimensional problem, we truncate the\nk=1 \u03b1t,k\u03c6k(\u00b7). The optimization\n\n(cid:88)m\n\n\u03b1t,k\u03c6k(X).\n\nBt(X) =\n\nk=m+1\n\nk=1\n\nproblem in Equation (7) can then be rewritten as\n\n(cid:88)\n\nt\u2208V\n\nLs,m(\u03b1m; Xn) + \u03bbn\n\n(cid:107)\u03b1t,m(cid:107)2,\n\n(8)\n\nwhere \u03b1t,m = {\u03b1t,k}m\n\nLs,m(\u03b1m; Xn) =\n\n1\nn\n\ni=1\n\nmin\n\u03b1m\n\n\uf8f1\uf8f2\uf8f3\u2212 m(cid:88)\n\nk=1\n\nk=1, \u03b1m = {\u03b1t,m}t\u2208V and Ls,m is de\ufb01ned as\nn(cid:88)\nm(cid:88)\n\n(cid:88)\n\n\u03b1s,k\u03c6k(X (i)\ns )\n\n\uf8eb\uf8ed1 +\n\n\u03b1t,l\u03c6l(X (i)\nt )\n\nt\u2208V \\{s}\n\nl=1\n\n\uf8fc\uf8fd\uf8fe .\n\uf8f6\uf8f8 + A(X (i)\u2212s; \u03b1m)\n\nIterative minimization of (8): Note that the objective in (8) is non-convex. In this work, we use\na simple alternating minimization technique for its optimization. In this technique, we alternately\nminimize \u03b1s,m, {\u03b1t,m}t\u2208V \\s while \ufb01xing the other parameters. The resulting optimization problem\nin each of the alternating steps is convex. We use Proximal Gradient Descent to optimize these\nsub-problems. To compute the objective and its gradients, we need to numerically evaluate the\none-dimensional integrals in the log partition function. To do this, we choose a uniform grid of points\nover the domain and use quadrature rules to approximate the integrals.\n\nConvergence: Although (8) is non-convex, we can show that under certain conditions on the\nobjective function, the alternating minimization procedure converges to the global minimum. In a\nrecent work [32] analyze alternating minimization for low rank matrix factorization problems and\nshow that it converges to a global minimum if the sequence of convex problems are strongly convex\nand satisfy certain other regularity condition. The analysis of [32] can be extended to show global\nconvergence of alternating minimization for (8).\n\n5 Statistical Properties\n\nIn this section we provide parameter estimation error rates for the node conditional estimator in\nEquation (8). Note that these rates are for the re-parameterized model described in Equation (6) and\ncan be easily translated to guarantees on the original model described in Equations (3), (4).\n\n5\n\n\ft be the coef\ufb01cients of B\u2217\n\nk=0 by \u03b1t, which is an in\ufb01nite dimensional vector and let \u03b1\u2217\n\nNote that(cid:82) Bt(X)2dX = (cid:107)\u03b1t(cid:107)2\nzeros. Finally, we de\ufb01ne the norm R(\u00b7) as R(\u03b1m) =(cid:80)\n\nNotation: Let B2(x, r) = {y : (cid:107)y \u2212 x(cid:107)2 \u2264 r} be the (cid:96)2 ball with center x and radius r. Let\nt (\u00b7)}t\u2208V be the true functions of the re-parametrized model, which we would like to estimate\n{B\u2217\nfrom the data. Denote the basis expansion coef\ufb01cients of Bt(\u00b7) with respect to orthonormal basis\n{\u03c6k(\u00b7)}\u221e\nt (\u00b7).\nAnd let \u03b1t,m be the coef\ufb01cients corresponding to the top m basis in the basis expansion of Bt(\u00b7).\n2. Let \u03b1 = {\u03b1t}t\u2208V and \u03b1m = {\u03b1t,m}t\u2208V . Let \u00afLs,m(\u03b1m) =\nE [Ls,m(\u03b1m; Xn)] be the population version of the sample loss de\ufb01ned in Equation (8). We will often\nomit Xn from Ls,m(\u03b1m; Xn) when clear from the context. We let (\u03b1t \u2212 \u03b1t,m) be the difference\nbetween in\ufb01nite dimensional vector \u03b1t and the vector obtained by appropriately padding \u03b1t,m with\nt\u2208V (cid:107)\u03b1t,m(cid:107)2 and its dual as R\u2217(\u03b1m) =\nsupt\u2208V (cid:107)\u03b1t,m(cid:107)2. The norms on in\ufb01nite dimensional vector \u03b1 are similarly de\ufb01ned.\nWe now state our key assumption on the loss function Ls,m(\u00b7). This assumption imposes strong\ncurvature condition on Ls,m along certain directions in a ball around \u03b1\u2217\nm.\nAssumption 1. There exists rm > 0 and constants c, \u03ba > 0 such that for any \u2206m \u2208 B2(0, rm) the\nm), \u2206m(cid:105) \u2265 \u03ba(cid:107)\u2206m(cid:107)2\n2\u2212\ngradient of the sample loss Ls,m satis\ufb01es: (cid:104)\u2207Ls,m(\u03b1\u2217\nc\n\nm + \u2206m) \u2212 \u2207Ls,m(\u03b1\u2217\n\n(cid:113) m log(p)\n\nn R(\u2206m).\n\nSimilar assumptions are increasingly common in analysis of non-convex estimators, see [19] and\nreferences therein. We are now ready to state our results which give the parameter estimation error\nrates, the proofs of which can be found in Appendix. We \ufb01rst provide a deterministic bound on\nthe error (cid:107)\u03b1m \u2212 \u03b1\u2217\nm)). We derive probabilistic\nresults in the subsequent corollaries.\nTheorem 2. Let Ns be the true neighborhood of node s, with |Ns| = d. Suppose Ls,m satis\ufb01es\nAssumption 1. If the regularization parameter \u03bbn is chosen such that \u03bbn \u2265 2R\u2217(\u2207Ls,m(\u03b1\u2217\nm)) +\n2c\n\nm(cid:107)2 in terms of the random quantity R\u2217(\u2207Ls,m(\u03b1\u2217\n\n(cid:113) m log(p)\n\n, then any stationary point \u02c6\u03b1m of (8) in B2(\u03b1\u2217\nm, rm) satis\ufb01es:\n\u221a\n\nn\n\n(cid:107)\u02c6\u03b1m \u2212 \u03b1\u2217\n\n\u221a\nm(cid:107)2 \u2264 6\n\n2\n\n\u03ba\n\nd\u03bbn.\n\nWe now provide a set of suf\ufb01cient conditions under which the random quantity R\u2217(\u2207Ls,m(\u03b1\u2217\nm))\ncan be bounded.\nAssumption 2. There exists a constant L > 0 such that the gradient of the population loss \u00afLs,m at\nm satis\ufb01es: R\u2217(\u2207 \u00afLs,m(\u03b1\u2217\n\u03b1\u2217\nCorollary 3. Suppose the conditions in Theorem 2 are satis\ufb01ed. Moreover,\n\nsupi\u2208N,X\u2208X |\u03c6i(X)| and \u03c4m = supt\u2208V,X\u2208X |(cid:80)m\n\nm)) \u2264 LR\u2217(\u03b1\u2217 \u2212 \u03b1\u2217\n\nm).\ni=1 \u03b1\u2217\n\n2. If the regularization parameter \u03bbn is chosen such that \u03bbn \u2265 2LR\u2217(\u03b1\u2217\u2212 \u03b1\u2217\nm) + c\u03b3\u03c4m\nthen then with probability at least 1\u2212 2m/p2 any stationary point \u02c6\u03b1m of (8) in B2(\u03b1\u2217\n\nlet \u03b3 =\nt,i\u03c6i(X)|. Suppose Ls,m satis\ufb01es Assumption\n,\nm, rm) satis\ufb01es:\n\n(cid:113) md2 log(p)\n\nn\n\n(cid:107)\u02c6\u03b1m \u2212 \u03b1\u2217\n\n\u221a\nm(cid:107)2 \u2264 6\n\n\u03ba\n\n\u221a\n\n2\n\nd\u03bbn.\n\nTheorem 2 and Corollary 3 bound the error of the estimated coef\ufb01cients in the truncated expansion.\nThe approximation error of the truncated expansion itself depends on the function space assumption,\nas well as the basis chosen, but can be simply combined with the statement of the above corollary to\nderive the overall error. As an instance, we present a corollary below for the speci\ufb01c case of Sobolev\nspace of order two, and the trigonometric basis.\nCorollary 4. Suppose the conditions in Corollary 3 are satis\ufb01ed. Moreover, suppose the true functions\nk=0 be the trigonometric basis of L2(X ). If the\nt (\u00b7) lie in a Sobolev space of order two. Let {\u03c6k}\u221e\nB\u2217\noptimization problem (8) is solved with \u03bbn = c1(d2 log(p)/n)2/5 and m = c2(n/d2 log(p))1/5, then\nwith probability at least 1 \u2212 2m/p2 any stationary point \u02c6\u03b1m of (8) in B2(\u03b1\u2217\n\nm, rm) satis\ufb01es:\n\n(cid:107)\u02c6\u03b1m \u2212 \u03b1\u2217(cid:107)2 \u2264 c3\n\nwhere c1, c2, c3 depend on L, \u03ba, \u03b3, \u03c4m.\n\n(cid:18) d13/4 log(p)\n\n(cid:19)2/5\n\n,\n\nn\n\n6\n\n\f(cid:10)\u2207 \u00afLs,m(\u03b1\u2217\n\nDiscussion on Assumption 1: We now provide a set of suf\ufb01cient conditions which ensure the\nrestricted strong convexity (RSC) condition. Suppose the population risk \u00afLs,m(\u00b7) is strongly convex\nin a ball of radius rm around \u03b1\u2217\n\nm\n\n2 \u2200\u2206m \u2208 B2(0, rm).\nMoreover, suppose the empirical gradients converge uniformly to the population gradients\n\nm + \u2206m) \u2212 \u2207 \u00afLs,m(\u03b1\u2217\n\n(cid:11) \u2265 \u03ba(cid:107)\u2206m(cid:107)2\n(cid:114)\nR\u2217(cid:0)\u2207Ls,m(\u03b1m) \u2212 \u2207 \u00afLs,m(\u03b1m)(cid:1) \u2264 c\n\nm), \u2206m\n\n(9)\n\n(10)\n\nm log p\n\nn\n\n.\n\nsup\n\u03b1m\u2208B2(\u03b1\u2217\n\nm,rm)\n\nFor example, this condition holds with high probability when the gradient of Ls,m(\u03b1m) w.r.t\n\u03b1t,m, for any t \u2208 [p] is a sub-Gaussian process. Equations (9),(10) are easier to check and en-\nsure that Ls,m(\u03b1m) satis\ufb01es the RSC property in Assumption 1.\n\n6 Connections to Exponential Family MRF Copulas\n\n(cid:110)(cid:80)\ns\u2208V \u03b8sBs(Xs) +(cid:80)\n\n(s,t)\u2208E(G) \u03b8stBs(Xs) Bt(Xt) +(cid:80)\n\nThe Expxorcist class of models could be viewed as being closely related to an exponential fam-\nily MRF [28] copula density. Consider the parametric exponential family MRF joint density in\n(3): PMRF;\u03b8(X) \u221d exp\n,\ns\u2208V Cs(Xs)\nwhere the distribution is indexed by the \ufb01nite-dimensional parameters {\u03b8s}s\u2208V ,{\u03b8st}(s,t)\u2208E, and\nwhere in contrast to the previous sections, we assume we are given the suf\ufb01cient statistics functions\n{Bs(\u00b7)}s\u2208V as well as the nodewise base measures {Cs(\u00b7)}s\u2208V . Now consider the following non-\nparametric problem. Given a random vector X, suppose we are interested in estimating monotonic\nnode-wise functions {fs(Xs)}s\u2208V such that (f1(X1), . . . , fp(Xp)) follows PMRF;\u03b8 for some \u03b8. Let-\nting f(X) = (f1(X1), . . . , fp(Xp)), we have that P(f(X)) = PMRF;\u03b8(f(X)), so that the density of\ns(Xs). This is now a semi-parametric estimation\nproblem, where the unknowns are the functions {fs(Xs)}s\u2208V as well as the \ufb01nite-dimensional pa-\nrameters \u03b8. To simplify this density, suppose we assume that the given node-wise suf\ufb01cient statistics\nare linear, so that Bs(z) = z, for all s \u2208 V , so that density reduces to\n\nX can be written as P(X) \u221d P(f(X))(cid:81)\n\ns\u2208V f(cid:48)\n\n(cid:111)\n\n(cid:88)\n\ns\u2208V\n\n(cid:88)\n\n(s,t)\u2208E(G)\n\n\uf8f1\uf8f2\uf8f3(cid:88)\n\ns\u2208V\n\n(cid:88)\n\n(s,t)\u2208E(G)\n\n\uf8fc\uf8fd\uf8fe .\n\n(cid:48)\ns(Xs))\n\n\uf8fc\uf8fd\uf8fe .\n\n(11)\n\n(12)\n\n(cid:88)\n\ns\u2208V\n\n\uf8f1\uf8f2\uf8f3(cid:88)\n\ns\u2208V\n\nP(X) \u221d exp\n\n\u03b8sfs(Xs) +\n\n\u03b8stfs(Xs) ft(Xt) +\n\n(Cs(fs(Xs)) + log f\n\nIn contrast, the Expxorcist nonparametric exponential family graphical model takes the form\n\nP(X) \u221d exp\n\n\u03b8sfs(Xs) +\n\n\u03b8stfs(Xs) ft(Xt) +\n\nCs(Xs)\n\nIt can be seen that the two densities have very similar forms, except that the density in (11) has a\nmore complex base measure that depends on the unknown functions {fs}s\u2208V and importantly the\nfunctions {fs}s\u2208V in (11) are monotonic.\nThe class of densities in (11) can be cast as an exponential family MRF copula density. Suppose\nwe denote the CDF of the parametric exponential family MRF joint density by FMRF;\u03b8(X), with\nnodewise marginal CDFs FMRF;\u03b8,s(Xs). Then the marginal CDF of the density (11) can be written\nas Fs(xs) = P[Xs \u2264 xs] = P[fs(Xs) \u2264 fs(xs)] = FMRF;\u03b8,s(fs(xs)), so that\n\n(cid:16)\n\nMRF;\u03b8,s(Fs(xs)).\n\n(cid:17)\n(cid:16)\nMRF;\u03b8,1(F1(X1)), . . . , F \u22121\nF \u22121\nMRF;\u03b8,p(Fp(Xp))\nF \u22121\nMRF;\u03b8,1(U1), . . . , F \u22121\n\nIt then follows that: F (X) = FMRF;\u03b8\nis the CDF of density (11). By letting FCOP;\u03b8(U ) = FMRF;\u03b8\nMRF;\u03b8,p(Up)\nbe the exponential family MRF copula density function, we see that the CDF of X is precisely:\nF (X) = FCOP;\u03b8(F1(X1), . . . , Fp(Xp)), which is speci\ufb01ed by the marginal CDFs {Fs(Xs)}s\u2208V and\nthe copula density FCOP;\u03b8 corresponding to the exponential family MRF density. In other words, the\nnon-parametric extension in (11) of the exponential family MRF densities is precisely an exponential\nfamily MRF copula density. This development thus generalizes the non-parametric extension of\nGaussian MRF densities via the Gaussian copula nonparanormal densities [17]. The caveats with the\ncopula density however are two-fold: the node-wise functions are restricted to be monotonic, but\n\n, where F (X)\n\nfs(xs) = F \u22121\n\n(13)\n\n(cid:17)\n\n7\n\n\falso the estimation of these as in (13) requires the estimation of inverses of marginal CDFs of an\nexponential family MRF, which is intractable in general. Thus, minor differences in the expressions\nof the Expxorcist density (12) and an exponential family MRF copula density (11) nonetheless have\nseemingly large consequences for tractable estimation of these densities from data.\n\n7 Experiments\n\nWe present experimental results on both synthetic and real datasets. We compare our estimator,\nExpxorcist, with the Nonparanormal model of [17] and Gaussian Graphical Model (GGM). We use\nglasso [7] to estimate GGM and the two step estimator of [17] to estimate Nonparanormal model.\n\n7.1 Synthetic Experiments\n\n(cid:2)exp(cid:0)\u221220(X \u2212 0.5)2(cid:1) + exp(cid:0)\u221220(X + 0.5)2(cid:1) \u2212 1(cid:3) and picked the log base measure Cs(X) to\n\nData: We generated synthetic data from the Expxorcist model with chain and grid graph structures.\nFor both the graph structures, we set \u03b8s = 1,\u2200s \u2208 V ,\u03b8st = 1,\u2200(s, t) \u2208 E and \ufb01x the domain\nX to [\u22121, 1]. We experimented with two choices for suf\ufb01cient statistics Bs(X): sin(4\u03c0X) and\nbe 0. The grid graph we considered has a 10 \u00d7 (p/10) structure. We used Gibbs sampling to sample\ndata from these models. We also generated data from Gaussian distribution with chain and grid graph\nstructures. To generate this data we set the off diagonal non-zero entries of inverse covariance matrix\nto 0.49 for chain graph and 0.25 for grid graph and diagonal entries to 1.\nEvaluation Metric: We compared the performance of Expxorcist against baselines, on graph\nstructure recovery, using ROC curves. The ROC curve plots the true positive rate (TPR) against false\npositive rate (FPR) over different choices of regularization parameter, where TPR is the fraction of\ncorrectly detected edges and FPR is the fraction of mis-identi\ufb01ed non edges.\nExperiment Settings: For this experiment we set p = 50 and n \u2208 {100, 200, 500} and varied the\nregularization parameter \u03bb from 10\u22122 to 1. To \ufb01t the data to the non parametric model (3), we used\ncosine basis and truncated the basis expansion to top 30 terms. In practice, one could choose the\nnumber of basis (m) based on domain knowledge (e.g. \u201csmooth\u201d functions), or in the absence of\nwhich, one could use hold-out validation/cross validation. Given \u02c6N (s), the estimated neighborhood\nfor node s, we estimated the overall graph structure as: \u222as\u2208V \u222at\u2208 \u02c6N (s){(s, t)}. To reduce the variance\nin the ROC plots, we averaged results over 10 repetitions.\nResults: Figure 1 shows the ROC plots obtained from this experiment. Due to the lack of space,\nwe present more experimental results in Appendix. It can be seen that Expxorcist has much better\nperformance on non-Gaussian data. On these datasets, even at n = 500 the baselines chose edges\nat random. This suggests that in the presence of multiple modes and fat tails, Expxorcist is a better\nmodel. Expxorcist has slightly poor performance than baselines on Gaussian data. However, this is\nexpected because it learns a broader family of distributions than Nonparanormal.\n\n7.2 Futures Intraday Data\n\nWe now present our analysis on the Futures price returns. This dataset was downloaded from\nhttp://www.kibot.com/. We focus on the Top-26 most liquid instruments being traded at the\nChicago Mercantile Exchange (CME). The instruments span different sectors like Energy, Agriculture,\nCurrencies, Equity Indices, Metals and Interest Rates. We focus on the hours of maximum liquidity\n(9am Eastern to 3pm Eastern) and look at the 1 minute price returns. The return distribution is a\nmixture of 1 minute returns with the overnight return. Since overnight returns tend to be bigger than\nthe 1 minute return within the day, the return distribution is multimodal and fat-tailed. We treat each\ninstrument as a random variable and the 1 minute returns as independent samples drawn from these\nrandom variables. We use the data collected in February 2010 as training data and data from March\n2010 as held out data for tuning parameter selection. After removing samples with missing entries\nwe are left with 894 training and 650 held out data samples. We \ufb01t Expxorcist and baselines on this\ndata with the same parameter settings described above. For each of these models, we select the best\ntuning parameter through log likelihood on held out data. However, this criteria resulted in complete\ngraphs for Nonparanormal and GGM (325 edges) and a relatively sparser graph for Expxorcist (168\nedges). So for a better comparison of these models, we selected tuning parameters for each of the\nmodels such that the resulting graphs have almost the same number of edges. Figure 2 shows the\n\n8\n\n\fFigure 1: ROC plots from synthetic experiments. Top and bottom rows show plots for chain and grid graphs\nrespectively. Left column shows plots for data generated from our non-parametric model with Bs(X) = sin(X),\nn = 500 and center column shows plots for the other choice of suf\ufb01cient statistic with n = 500. Right column\nshows plots for Gaussian data with n = 200.\n\n(a) Nonparanormal\n\n(b) Expxorcist\n\nFigure 2: Graph Structures learned for the Futures Intraday Data. The Expxorcist graph shown here was\nobtained by selecting \u03bb = 0.1. Nodes are colored based on their categories. Edge thickness is proportional to\nthe magnitude of the interaction.\nlearned graphs for one such choice of tuning parameters, which resulted in \u223c 52 edges in the graphs.\nNonparanormal and GGM resulted in very similar graphs, so we only present Nonparanormal here. It\ncan be seen that Expxorcist is able to identify the clusters better than Nonparanormal. More detailed\ngraphs and comparison with GGM can be found in Appendix.\n\n8 Conclusion\n\nIn this work we considered the problem of non-parametric density estimation and introduced Expx-\norcist, a new family of non-parametric graphical models. Our approach relies on a simple function\nspace assumption that the conditional distribution of each variable conditioned on the other variables\nhas a non-parametric exponential family form. We proposed an estimator for Expxorcist that is\ncomputationally ef\ufb01cient and comes with statistical guarantees. Our empirical results suggest that, in\nthe presence of multiple modes and fat tails in the data, our non-parametric model is a better choice\nthan the Nonparanormal model of [17].\n\n9 Acknowledgement\n\nA.S. and P.R. acknowledge the support of ARO via W911NF-12-1-0390 and NSF via IIS-1149803,\nIIS-1447574, DMS-1264033, and NIH via R01 GM117594-01 as part of the Joint DMS/NIGMS\nInitiative to Support Research at the Interface of the Biological and Mathematical Sciences. M. K.\nacknowledges support by an IBM Corporation Faculty Research Fund at the University of Chicago\nBooth School of Business.\n\n9\n\n00.20.40.60.8100.20.40.60.81Gaussian(n = 200)ExpxorcistGGMNonparanormal00.20.40.60.8100.20.40.60.81TPRSine(n = 500)00.20.40.60.8100.20.40.60.81Exp (n = 500)00.20.40.60.81FPR00.20.40.60.8100.20.40.60.81FPR00.20.40.60.81TPR00.20.40.60.81FPR00.20.40.60.81 \fReferences\n[1] Barry C. Arnold, Enrique Castillo, and Jos\u00e9 Mar\u00eda Sarabia. Conditionally speci\ufb01ed distributions: an\n\nintroduction. Stat. Sci., 16(3):249\u2013274, 2001. With comments and a rejoinder by the authors.\n\n[2] Patrizia Berti, Emanuela Dreassi, and Pietro Rigo. Compatibility results for conditional distributions. J.\n\nMultivar. Anal., 125:190\u2013203, 2014.\n\n[3] Julian Besag. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. B, pages\n\n192\u2013236, 1974.\n\n[4] St\u00e9phane Canu and Alex Smola. Kernel methods and the exponential family. Neurocomputing, 69(7-\n\n9):714\u2013720, Mar 2006.\n\n[5] Hua Yun Chen. Compatibility of conditionally speci\ufb01ed models. Statist. Probab. Lett., 80(7-8):670\u2013677,\n\n2010.\n\n[6] Ronaldo Dias. Density estimation via hybrid splines. J. Statist. Comput. Simulation, 60(4):277\u2013293, 1998.\n[7] Jerome H. Friedman, Trevor J. Hastie, and Robert J. Tibshirani. Sparse inverse covariance estimation with\n\nthe graphical lasso. Biostatistics, 9(3):432\u2013441, 2008.\n\n[8] I. J. Good and R. A. Gaskins. Nonparametric roughness penalties for probability densities. Biometrika,\n\n58:255\u2013277, 1971.\n\n[9] Chong Gu. Smoothing spline density estimation: conditional distribution. Stat. Sinica, 5(2):709\u2013726,\n\n1995.\n\n[10] Chong Gu, Yongho Jeon, and Yi Lin. Nonparametric density estimation in high-dimensions. Stat. Sinica,\n\n23:1131\u20131153, 2013.\n\n[11] Chong Gu and Chunfu Qiu. Smoothing spline density estimation: theory. Ann. Stat., 21(1):217\u2013234, 1993.\n[12] Chong Gu and Jingyuan Wang. Penalized likelihood density estimation: direct cross-validation and scalable\n\napproximation. Stat. Sinica, 13(3):811\u2013826, 2003.\n\n[13] Ali Jalali, Pradeep Ravikumar, Vishvas Vasuki, and Sujay Sanghavi. On learning discrete graphical models\n\nusing group-sparse regularization. In AISTATS, pages 378\u2013387, 2011.\n\n[14] Yongho Jeon and Yi Lin. An effective method for high-dimensional log-density anova estimation, with\n\napplication to nonparametric graphical model building. Stat. Sinica, 16(2):353\u2013374, 2006.\n\n[15] Tom Leonard. Density estimation, stochastic processes and prior information. J. R. Stat. Soc. B, 40(2):113\u2013\n\n146, 1978. With discussion.\n\n[16] Han Liu, Fang Han, Ming Yuan, John D. Lafferty, and Larry A. Wasserman. High-dimensional semipara-\n\nmetric Gaussian copula graphical models. Ann. Stat., 40(4):2293\u20132326, 2012.\n\n[17] Han Liu, John D. Lafferty, and Larry A. Wasserman. The nonparanormal: Semiparametric estimation of\n\nhigh dimensional undirected graphs. J. Mach. Learn. Res., 10:2295\u20132328, 2009.\n\n[18] Beno\u00eet R. M\u00e2sse and Young K. Truong. Conditional logspline density estimation. Canad. J. Statist.,\n\n27(4):819\u2013832, 1999.\n\n[19] Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex losses. arXiv\n\npreprint arXiv:1607.06534, 2016.\n\n[20] Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al. High-dimensional ising model selection\n\nusing l1-regularized logistic regression. The Annals of Statistics, 38(3):1287\u20131319, 2010.\n\n[21] B. W. Silverman. On the estimation of a probability density function by the maximum penalized likelihood\n\nmethod. Ann. Stat., 10(3):795\u2013810, 1982.\n\n[22] TP Speed and HT Kiiveri. Gaussian markov distributions over \ufb01nite graphs. The Annals of Statistics, pages\n\n138\u2013150, 1986.\n\n[23] Charles J. Stone, Mark H. Hansen, Charles Kooperberg, and Young K. Truong. Polynomial splines and\ntheir tensor products in extended linear modeling. Ann. Stat., 25(4):1371\u20131470, 1997. With discussion and\na rejoinder by the authors and Jianhua Z. Huang.\n\n[24] Siqi Sun, Jinbo Xu, and Mladen Kolar. Learning structured densities via in\ufb01nite dimensional exponential\n\nfamilies. In Advances in Neural Information Processing Systems, pages 2287\u20132295, 2015.\n\n[25] Cristiano Varin, Nancy Reid, and David Firth. An overview of composite likelihood methods. Stat. Sinica,\n\n21(1):5\u201342, 2011.\n\n[26] Arend Voorman, Ali Shojaie, and Daniela M. Witten. Graph estimation with joint additive models.\n\nBiometrika, 101(1):85\u2013101, Mar 2014.\n\n[27] Yuchung J. Wang and Edward H. Ip. Conditionally speci\ufb01ed continuous distributions. Biometrika,\n\n95(3):735\u2013746, 2008.\n\n[28] Eunho Yang, Pradeep Ravikumar, Genevera I Allen, and Zhandong Liu. Graphical models via univariate\n\nexponential family distributions. Journal of Machine Learning Research, 16(1):3813\u20133847, 2015.\n\n[29] Zhuoran Yang, Yang Ning, and Han Liu. On semiparametric exponential family graphical models. arXiv\n\npreprint arXiv:1412.8697, 2014.\n\n10\n\n\f[30] Xiaotong Yuan, Ping Li, Tong Zhang, Qingshan Liu, and Guangcan Liu. Learning additive exponential\nfamily graphical models via (cid:96)_{2, 1}-norm regularized m-estimation. In Advances in Neural Information\nProcessing Systems, pages 4367\u20134375, 2016.\n\n[31] Hao Helen Zhang and Yi Lin. Component selection and smoothing for nonparametric regression in\n\nexponential families. Stat. Sinica, 16(3):1021\u20131041, 2006.\n\n[32] Tuo Zhao, Zhaoran Wang, and Han Liu. Nonconvex low rank matrix factorization via inexact \ufb01rst order\n\noracle. Advances in Neural Information Processing Systems, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2328, "authors": [{"given_name": "Arun", "family_name": "Suggala", "institution": "Carnegie Mellon University"}, {"given_name": "Mladen", "family_name": "Kolar", "institution": "University of Chicago"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "Carnegie Mellon University"}]}