{"title": "On Separability of Loss Functions, and Revisiting Discriminative Vs Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 7050, "page_last": 7059, "abstract": "We revisit the classical analysis of generative vs discriminative models for general exponential families, and high-dimensional settings. Towards this, we develop novel technical machinery, including a notion of separability of general loss functions, which allow us to provide a general framework to obtain l\u221e convergence rates for general M-estimators. We use this machinery to analyze l\u221e and l2 convergence rates of generative and discriminative models, and provide insights into their nuanced behaviors in high-dimensions. Our results are also applicable to differential parameter estimation, where the quantity of interest is the difference between generative model parameters.", "full_text": "On Separability of Loss Functions, and Revisiting\n\nDiscriminative Vs Generative Models\n\nAdarsh Prasad\n\nMachine Learning Dept.\n\nCMU\n\nAlexandru Niculescu-Mizil\nNEC Laboratories America\n\nPrinceton, NJ, USA\n\nadarshp@andrew.cmu.edu\n\nalex@nec-labs.com\n\npradeepr@cs.cmu.edu\n\nPradeep Ravikumar\nMachine Learning Dept.\n\nCMU\n\nAbstract\n\nWe revisit the classical analysis of generative vs discriminative models for general\nexponential families, and high-dimensional settings. Towards this, we develop\nnovel technical machinery, including a notion of separability of general loss func-\ntions, which allow us to provide a general framework to obtain `1 convergence\nrates for general M-estimators. We use this machinery to analyze `1 and `2\nconvergence rates of generative and discriminative models, and provide insights\ninto their nuanced behaviors in high-dimensions. Our results are also applicable to\ndifferential parameter estimation, where the quantity of interest is the difference\nbetween generative model parameters.\n\n1\n\nIntroduction\n\nConsider the classical conditional generative model setting, where we have a binary random response\nY 2{ 0, 1}, and a random covariate vector X 2 Rp, such that X|(Y = i) \u21e0 P\u2713i for i 2\n{0, 1}. Assuming that we know P (Y ) and {P\u2713i}1\ni=0, we can use the Bayes rule to predict the\nresponse Y given covariates X. This is said to be the generative model approach to classi\ufb01cation.\nAlternatively, consider the conditional distribution P (Y |X) as speci\ufb01ed by the Bayes rule, also\ncalled the discriminative model corresponding to the generative model speci\ufb01ed above. Learning\nthis conditional model directly is said to be the discriminative model approach to classi\ufb01cation. In a\nclassical paper [8], the authors provided theoretical justi\ufb01cation for the common wisdom regarding\ngenerative and discriminative models: when the generative model assumptions hold, the generative\nmodel estimators initially converge faster as a function of the number of samples, but have the same\nasymptotic error rate as discriminative models. And when the generative model assumptions do\nnot hold, the discriminative model estimators eventually overtake the generative model estimators.\nTheir analysis however was for the speci\ufb01c generative-discriminative model pair of Naive Bayes, and\nlogistic regression models, and moreover, was not under a high-dimensional sampling regime, when\nthe number of samples could even be smaller than the number of parameters. In this paper, we aim to\nextend their analysis to these more general settings.\nDoing so however required some novel technical and conceptual developments. To motivate the\nmachinery we develop, consider why the Naive Bayes model estimator might initially converge\nfaster. The Naive Bayes model makes the conditional independence assumption that P (X|Y ) =\nQp\ns=1 P (Xs|Y ), so that the parameters of each of the conditional distributions P (Xs|Y ) for s 2\n{1, . . . , p} could be estimated independently. The corresponding log-likelihood loss function is thus\nfully \u201cseparable\u201d into multiple components. The logistic regression log-likelihood on the other hand\nis seemingly much less \u201cseparable\u201d, and in particular, it does not split into multiple components each\nof which can be estimated independently. In general, we do not expect the loss functions underlying\nstatistical estimators to be fully separable into multiple components, so that we need a more \ufb02exible\nnotion of separability, where different losses could be shown to be separable to differing degrees. In\na very related note, though it might seem unrelated at \ufb01rst, the analysis of `1 convergence rates of\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fstatistical estimators considerably lags that of say `2 rates (see for instance, the uni\ufb01ed framework of\n[7], which is suited to `2 rates but is highly sub-optimal for `1 rates). In part, the analysis of `1 rates\nis harder because it implicitly requires analysis at the level of individual coordinates of the parameter\nvector. While this is thus harder than an `2 error analysis, intuitively this would be much easier if\nthe loss function were to split into independent components involving individual coordinates. While\ngeneral loss functions might not be so \u201cfully separable\u201d, they might perhaps satisfy a softer notion of\nseparability motivated above. In a contribution that would be of independent interest, we develop\nprecisely such a softer notion of separability for general loss functions. We then use this notion of\nseparability to derive `1 convergence rates for general M-estimators.\nGiven this machinery, we are then able to contrast generative and discriminative models. We focus\non the case where the generative models are speci\ufb01ed by exponential family distributions, so that\nthe corresponding discriminative models are logistic regression models with the generative model\nsuf\ufb01cient statistics as feature functions. To compare the convergence rates of the two models,\nwe focus on the difference of the two generative model parameters, since this difference is also\nprecisely the model parameter for the discriminative model counterpart of the generative model,\nvia an application of the Bayes rule. Moreover, as Li et al. [3] and others show, the `2 convergence\nrates of the difference of the two parameters is what drives the classi\ufb01cation error rates of both\ngenerative as well as discriminative model classi\ufb01ers. Incidentally, such a difference of generative\nmodel parameters has also attracted interest outside the context of classi\ufb01cation, where it is called\ndifferential parameter learning [1, 14, 6]. We thus analyze the `1 as well as `2 rates for both the\ngenerative and discriminative models, focusing on this parameter difference. As we show, unlike the\ncase of Naive Bayes and logistic regression in low-dimensions as studied in [8], this general high-\ndimensional setting is more nuanced, and in particular depends on the separability of the generative\nmodels. As we show, under some conditions on the models, generative and discriminative models\nnot only have potentially different `1 rates, but also differing \u201cburn in\u201d periods in terms of the\nminimum number of samples required in order for the convergence rates to hold. The choice of a\ngenerative vs discriminative model, namely that with a better sample complexity, thus depends on their\ncorresponding separabilities. As a minor note, we also show how generative model M-estimators are\nnot directly suitable in high-dimensions, and provide a simple methodological \ufb01x in order to obtain\nbetter `2 rates. We instantiate our results with two running examples of isotropic and non-isotropic\nGaussian generative models, and also corroborate our theory with instructive simulations.\n\n2 Background and Setup.\n\nWe consider the problem of differential parameter estimation under the following generative model.\nLet Y 2{ 0, 1} denote a binary response variable, and let X = (X1, . . . , Xp) 2 Rp be the covariates.\nFor simplicity, we assume P[Y = 1] = P[Y = 0] = 1\n2. We assume that conditioned on the response\nvariable, the covariates belong to an exponential family, X|Y \u21e0 P\u2713\u21e4Y (\u00b7), where:\n\nP\u2713\u21e4Y (X|Y ) = h(X) exp(h\u2713\u21e4Y , (X)i A(\u2713\u21e4Y )).\n\n(1)\nHere, \u2713\u21e4Y is the vector of the true canonical parameters, A(\u2713) is the log-partition function and (X)\nis the suf\ufb01cient statistic. We assume access to two sets of samples X n\ni=1 \u21e0 P\u2713\u21e40 and\n1 = {x(1)\ni=1 \u21e0 P\u2713\u21e41 . Given these samples, as noted in the introduction, we are particularly\nX n\ninterested in estimating the differential parameter \u2713\u21e4diff := \u2713\u21e41 \u2713\u21e40, since this is also the model\nparameter corresponding to the discriminative model, as we show below. In high dimensional\nsampling settings, we additionally assume that \u2713\u21e4diff is at most s-sparse, i.e. ||\u2713\u21e4diff||0 \uf8ff s.\nWe will be using the following two exponential family generative models as running examples:\nisotropic and non-isotropic multivariate Gaussian models.\nIsotropic Gaussians (IG) Let X = (X1, . . . , Xp) \u21e0N (\u00b5,Ip) be an isotropic gaussian random\nvariable; it\u2019s density can be written as:\n\n0 = {x(0)\n\ni }n\n\ni }n\n\n(2)\n\nP\u00b5(x) =\n\nexp\u2713\n\n1\n2\n\n(x \u00b5)T(x \u00b5)\u25c6 .\n\n1\n\np(2\u21e1)p\n\nGaussian MRF (GMRF). Let X = (X1, . . . , Xp) denote a zero-mean gaussian random vector;\nit\u2019s density is fully-parametrized as by the inverse covariance or concentration matrix \u21e5 = (\u2303)1 0\n\n2\n\n\fand can be written as:\n\nP\u21e5(x) =\n\n1\n\nr(2\u21e1)pdet\u21e3(\u21e5)1\u2318 exp\u2713\n\n1\n2\n\nxT\u21e5x\u25c6 .\n\n(3)\n\n, where |||M|||1\n\nis the `1/`1 operator norm given by |||M|||1 = max\n\nLet d\u21e5 = maxj2[p]\u21e5(:,j)0 is the maximum number non-zeros in a row of \u21e5. Let \uf8ff\u2303\u21e4 =\n(\u21e5\u21e4)11\nj=1,2,...,pPp\nk=1 |Mjk|.\nGenerative Model Estimation. Here, we proceed by estimating the two parameters {\u2713\u21e4i }1\ni=0 indi-\nvidually. Lettingb\u27131 andb\u27130 be the corresponding estimators, we can then estimate the difference of\nthe parameters asb\u2713diff =b\u27131 b\u27130. The most popular class of estimators for the individual parameters\n\nis based on Maximum likelihood Estimation (MLE), where we maximize the likelihood of the given\ndata. For isotropic gaussians, the negative log-likelihood function can be written as:\n\n(4)\n\nLnIG(\u2713) =\n\n\u2713T \u2713\n\n2 \u2713Tb\u00b5,\n\nnPn\nwhereb\u00b5 = 1\nnPn\nwhere b\u2303 = 1\n\ni=1 xi. In the case of GGMs the negative log-likelihood function can be written as:\n(5)\n\nLnGGM(\u21e5) =DD\u21e5, b\u2303EE log(det(\u21e5)),\nis the sample covariance matrix and hhU, V ii = Pi,j UijVij denotes\n\nthe trace inner product on the space of symmetric matrices. In high-dimensional sampling regimes\n(n << p), regularized MLEs, for instance with `1-regularization under the assumption of sparse\nmodel parameters, have been widely used [11, 10, 2].\nDiscriminative Model Estimation. Using Bayes rule, we have that:\nP[X|Y = 1]P[Y = 1]\n\ni=1 xixT\ni\n\nP[X|Y = 0]P[Y = 0] + P[X|Y = 1]P[Y = 1]\n1 + exp ( (h\u2713\u21e41 \u2713\u21e40, (x)i + c\u21e4))\n\n(6)\nwhere c\u21e4 = A(\u2713\u21e40) A(\u2713\u21e41). The conditional distribution is simply a logistic regression model, with\nthe generative model suf\ufb01cient statistics as the features, and with optimal parameters being precisely\nthe difference \u2713\u21e4diff := \u2713\u21e41 \u2713\u21e40 of the generative model parameters. The corresponding negative\nlog-likelihood function can be written as\n\nP[Y = 1|X] =\n=\n\n1\n\nLlogistic(\u2713, c) =\n\n1\nn\n\nnXi=1\n\n(yi(h\u2713, (xi)i + c) + (h\u2713, (xi)i + c))\n\n(7)\n\nwhere (t) = log(1 + exp(t)). In high dimensional sampling regimes, under the assumption that the\n\nmodel parameters are sparse, we would use the `1-penalized versionb\u2713diff of the MLE (7) to estimate\n\u2713\u21e4diff.\nOutline. We proceed by studying the more general problem of `1 error for parameter estimation\nfor any loss function Ln(\u00b7). Speci\ufb01cally, consider the general M-estimation problem, where we\n1 = {z1, z2, . . . , zn}, zi 2Z from some distribution P, and we are\nare given n i.i.d samples Zn\ninterested in estimating some parameter \u2713\u21e4 of the distribution P. Let ` : Rp \u21e5 Z 7! R be a twice\ndifferentiable and convex function which assigns a loss `(\u2713; z) to any parameter \u2713 2 Rp, for a given\n\u00afL(\u2713) where\nobservation z. Also assume that the loss is Fisher consistent so that \u2713\u21e4 2 argmin\u2713\n\u00afL(\u2713) def= Ez\u21e0P[`(\u2713; z)] is the population loss. We are then interested in analyzing the M-estimators\n\u2713\u21e4 that minimize the empirical loss i.e.b\u2713 2 argmin\u2713 Ln(\u2713), or regularized versions thereof, where\nnPn\ni=1 L(\u2713; Zi).\nLn(\u2713) = 1\nWe introduce a notion of the separability of a loss function, and show how more separable losses\nrequire fewer samples to establish convergence forb\u2713 \u2713\u21e41\n. We then instantiate our separability\nresults from this general setting for both generative and discriminative models. We calculate the\nnumber of samples required for generative and discriminative approaches to estimate the differential\nparameter \u2713\u21e4diff, for consistent convergence rates with respect to `1 and `2 norm. We also discuss the\nconsequences of these results for high dimensional classi\ufb01cation for Gaussian Generative models.\n\n3\n\n\f3 Separability\n\nLet R(; \u2713\u21e4) = rLn(\u2713\u21e4+)rLn(\u2713\u21e4)r2Ln(\u2713\u21e4) be the error in the \ufb01rst order approximation\nof the gradient at \u2713\u21e4. Let B1(r) = {\u2713|||\u2713||1 \uf8ff r} be an `1 ball of radius r. We begin by analyzing\nthe low dimensional case, and then extend it to high dimensions.\n\n3.1 Low Dimensional Sampling Regimes\nIn low dimensional sampling regimes, we assume that the number of samples n p. In this\nsetting, we make the standard assumption that the empirical loss function Ln(\u00b7) is strongly convex.\nLetb\u2713 = argmin\u2713 Ln(\u2713) denote the unique minimizer of the empirical loss function. We begin by\nde\ufb01ning a notion of separability for any such empirical loss function Ln.\nDe\ufb01nition 1. Ln is (\u21b5, , ) locally separable around \u2713\u21e4 if the remainder term R(; \u2713\u21e4) satis\ufb01es:\n\n||R(; \u2713\u21e4)||1 \uf8ff\n\n1\n ||||\u21b5\n\n1 8 2B 1()\n\nwe can write ||R(, \u2713\u21e4)||1 =r2Ln(\u2713\u21e4 + t) r2Ln(\u2713\u21e4) 1\nbe further simpli\ufb01ed as ||R(, \u2713\u21e4)||1 \uf8ffr2Ln(\u2713\u21e4 + t) r2Ln(\u2713\u21e4)1 ||||1\n\nThis de\ufb01nition might seem a bit abstract, but for some general intuition, indicates the region where\nit is separable, \u21b5 indicates the conditioning of the loss, while it is that quanti\ufb01es the degree of\nseparability: the larger it is, the more separable the loss function. Next, we provide some additional\nintuition on how a loss function\u2019s separability is connected to (\u21b5, , ). Using the mean-value theorem,\nfor some t 2 (0, 1). This can\n. Hence, \u21b5 and\n1/ measure the smoothness of Hessian (w.r.t. the `1/`1 matrix norm) in the neighborhood of \u2713\u21e4,\nwith \u21b5 being the smoothness exponent, and 1/ being the smoothness constant. Note that the Hessian\nof the loss function r2Ln(\u2713) is a random matrix and can vary from being a diagonal matrix for a\nfully-separable loss function to a dense matrix for a heavily-coupled loss function. Moreover, from\nstandard concentration arguments, the `1/`1 matrix norm for a diagonal (\"separable\") subgaussian\nrandom matrix has at most logarithmic dimension dependence1, but for a dense (\"non-separable\")\nrandom matrix, the `1/`1 matrix norm could possibly scale linearly in the dimension. Thus, the\nscaling of `1/`1 matrix norm gives us an indication how \u201cseparable\u201d the matrix is. This intuition is\ncaptured by (\u21b5, , ), which we further elaborate in future sections by explicitly deriving (\u21b5, , )\nfor different loss functions and use them to derive `2 and `1 convergence rates.\nTheorem 1. Let Ln be a strongly convex loss function which is (\u21b5, , ) locally separable function\naround \u2713\u21e4. Then, if ||rLn(\u2713\u21e4)||1 \uf8ff min{ \n\n\u21b51}\n\n\u21b51 \n\n1\n\n2\uf8ff , 1\n2\uf8ff \u21b5\n\nb\u2713 \u2713\u21e41 \uf8ff 2\uf8ff||rLn(\u2713\u21e4)||1\n\n.\n\nProof. (Proof Sketch). The proof begins by constructing a suitable continuous function F , for\n\nwhere \uf8ff =r2Ln(\u2713\u21e4)11\nwhich b = b\u2713 \u2713\u21e4 is the unique \ufb01xed point. Next, we show that F (B1(r)) \u2713B 1(r) for r =\n. Since F is continuous and `1-ball is convex and compact, the contraction property\n2\uf8ff||rLn(\u2713\u21e4)||1\ncoupled with Brouwer\u2019s \ufb01xed point theorem [9], shows that there exists some \ufb01xed point of F ,\n. By uniqueness of the \ufb01xed point, we then establish our result.\nsuch that ||||1 \uf8ff 2\uf8ff||rLn(\u2713\u21e4)||1\nSee Figure 1 for a geometric description and Section A for more details\n\n3.2 High Dimensional Sampling Regimes\n\nIn high dimensional sampling regimes (n << p), estimation of model parameters is typically an\nunder-determined problem. It is thus necessary to impose additional assumptions on the true model\nparameter \u2713\u21e4. We will focus on the popular assumption of sparsity, which entails that the number\nof non-zero coef\ufb01cients of \u2713\u21e4 is small, so that ||\u2713\u21e4||0 \uf8ff s. For this setting, we will be focusing in\nparticular on `1-regularized empirical loss minimization:\n\n1Follows from the concentration of subgaussian maxima [12]\n\n4\n\n\fF\n\nb\n\nF (b) = b\n\nB1(2\uf8ff||rLn(\u2713\u21e4)||1)\n\nF (B1(2\uf8ff||rLn(\u2713\u21e4)||1))\n\nFigure 1: Under the conditions of Theorem 1, F () = r2Ln(\u2713\u21e4)1 (R(; \u2713\u21e4) + rLn(\u2713\u21e4)) is\ncontractive over B1(2\uf8ff||rLn(\u2713\u21e4)||1) and has b =b\u2713 \u2713\u21e4 as its unique \ufb01xed point. Using these\ntwo observations, we can conclude thatb1 \uf8ff 2\uf8ff||rLn(\u2713\u21e4)||1\n\n.\n\nb\u2713n = argmin\n\n\u2713\n\nLn(\u2713) + n ||\u2713||1\n\n(8)\n\nLet S = {i| \u2713\u21e4i\n6= 0} be the support set of the true parameter and M(S) = {v|vSc = 0} be the\ncorresponding subspace. Note that under a high-dimensional sampling regime, we can no longer\nassume that the empirical loss Ln(\u00b7) is strongly convex. Accordingly, we make the following set of\nassumptions:\n\u2022 Assumption 1 (A1): Positive De\ufb01nite Restricted Hessian. r2\n\u2022 Assumption 2 (A2): Irrepresentability. There exists some 2 (0, 1] such that\n\nSSLn(\u2713\u21e4) % minI\n\n`1 penalized loss minimization problem is unique, which we denote by:\n\n\u2022 Assumption 3 (A3). Unique Minimizer. When restricted to the true support, the solution to the\n(9)\n\n\u02dc\u2713n = argmin\n\nr2\nScSLn(\u2713\u21e4)r2\n\nSSLn(\u2713\u21e4) 11 \uf8ff 1 \n\u27132M(S){Ln(\u2713) + n ||\u2713||1} .\n\nAssumptions 1 and 2 are common in high dimensional analysis. We verify that Assumption 3 holds\nfor different loss functions individually. We refer the reader to [13, 5, 11, 10] for further details\non these assumptions. For this high dimensional sampling regime, we also modify our separability\nnotion to a restricted separability, which entails that the remainder term be separable only over the\nmodel subspace M(S).\nDe\ufb01nition 2. Ln is (\u21b5, , ) restricted locally separable around \u2713\u21e4 over the subspace M(S) if the\nremainder term R(; \u2713\u21e4) satis\ufb01es:\n\n||R(; \u2713\u21e4)||1 \uf8ff\n\n1\n ||||\u21b5\n\n1 8 2B 1() \\M (S)\n\n.\n\nWe present our main deterministic result in high dimensions.\nTheorem 2. Let Ln be a (\u21b5, , ) locally separable function around \u2713\u21e4. If (n,rLn(\u2713\u21e4)) are such\nthat,\n\u2022 \n8 n ||rL n(\u2713\u21e4)||1\n\u2022 ||rLn(\u2713\u21e4)||1 + n \uf8ff minn \n\u21b51o\nThen we have that support(b\u2713n) \u2713 support(\u2713\u21e4) and\nwhere \uf8ff =r2\n\n2\uf8ff , 1\n2\uf8ff \u21b5\nb\u2713n \u2713\u21e41 \uf8ff 2\uf8ff (||rLn(\u2713\u21e4)||1 + n)\n\nSSLn(\u2713\u21e4)11\n\n\u21b51 \n\n1\n\n5\n\n\fProof. (Proof Sketch). The proof invokes the primal-dual witness argument [13] which when\n\nrestricted problem. The rest of the proof proceeds similar to Theorem 1, by constructing a suitable\n\ncombined with Assumption 1-3, gives b\u2713n 2M (S) and that b\u2713n is the unique solution of the\nfunction F : R|S| 7! R|S| for which b =b\u2713n \u2713\u21e4 is the unique \ufb01xed point, and showing that F is\ncontractive over B1(r; \u2713\u21e4) for r = 2\uf8ff (||rLn(\u2713\u21e4)||1 + n).See Section B for more details.\nDiscussion. Theorems 1 and 2 provide a general recipe to estimate the number of samples required\nby any loss `(\u2713, z) to establish `1 convergence. The \ufb01rst step is to calculate the separability constants\n(\u21b5, , ) for the corresponding empirical loss function Ln. Next, since the loss ` is Fisher consistent,\nso that r \u00afL(\u2713\u21e4) = 0, the upper bound on ||rLn(\u2713\u21e4)||1\ncan be shown to hold by analyzing the\nconcentration of rLn(\u2713\u21e4) around its mean. We emphasize that we do not impose any restrictions on\nthe values of (\u21b5, , ). In particular, these can scale with the number of samples n; our results hold\nso long as the number of samples n satisfy the conditions of the theorem. As a rule of thumb, the\nsmaller that either or get for any given loss `, the larger the required number of samples.\n\n4\n\n`1-rates for Generative and Discriminative Model Estimation\n\nIn this section we study the `1 rates for differential parameter estimation for the discriminative and\ngenerative approaches. We do so by calculating the separability of discriminative and generative loss\nfunctions, and then instantiate our previously derived results.\n\n4.1 Discriminative Estimation\n\nAs discussed before, the discriminative approach uses `1-regularized logistic regression with the\nsuf\ufb01cient statistic as features to estimate the differential parameter.\nIn addition to A1-A3, we\ni=1 ([(xi)]j)2 \uf8ff n. Let n =\n, \u232bn = maxi ||((x)i)S||2. Firstly, we characterize the separability of the logistic loss.\n,1\u2318 re-\n\nassume column normalization of the suf\ufb01cient statistics, i.e.Pn\nLemma 1. The logistic regression negative log-likelihood LnLogistic from (7) is\u21e32,\n\nstricted local separable around \u2713\u21e4.\n\nmaxi ||(x)i||1\n\nsn\u232b2\nn\n\n1\n\nCombining Lemma 1 with Theorem 2, we get the following corollary.\nCorollary 3. (Logistic Regression) Consider the model in (1), then there exist universal positive con-\nstants C1, C2 and C3 such that for n C1\uf8ff2s22\nn , the discriminative\ndifferential estimateb\u2713diff, satis\ufb01es\n\nn log p and n = C2q log p\nsupport(b\u2713diff) \u2713 support(\u2713\u21e4diff) and b\u2713diff \u2713\u21e4diff1 \uf8ff C3r log p\n\nn\u232b4\n\nn\n\n.\n\n4.2 Generative Estimation\n\nWe characterize the separability of Generative Exponential Families. The negative log-likelihood\nfunction can be written as:\n\nLn(\u2713) = A(\u2713) h\u2713, ni ,\n\nnPn\ni=1 (xi). In this setting, the remainder term is independent of the data and can\nwhere n = 1\nn (xi).\nbe written as R() = rA(\u2713\u21e4 + ) rA(\u2713\u21e4) r2A(\u2713\u21e4) and rLn(\u2713\u21e4) = E[(x)] 1\nis a measure of how well the suf\ufb01cient statistics concentrate around their mean.\nHence, ||rLn(\u2713\u21e4)||1\nNext, we show the separability of our running examples Isotropic Gaussians and Gaussian Graphical\nModels.\nLemma 2. The isotropic Gaussian negative log-likelihood LnIG from (4) is (\u00b7,1,1) locally separa-\nble around \u2713\u21e4.\n3d\u21e4\u21e5\uf8ff\u2303\u21e4\u2318\nLemma 3. The Gaussian MRF negative log-likelihood LnGGM from (5) is\u21e32,\n\nrestricted locally separable around \u21e5\u21e4.\n\n3d\u21e4\u21e5\uf8ff3\n\n\u2303\u21e4\n\n2\n\n1\n\n,\n\n6\n\n\fComparing Lemmas 1, 2 and 3, we see that the separability of the discriminative model loss depends\nweakly on the feature functions. On the other hand, the separability for the generative model loss\ndepends critically on the underlying suf\ufb01cient statistics. This has consequences for their differing\nsample complexities for differential parameter estimation, as we show next.\nCorollary 4. (Isotropic Gaussians) Consider the model in (2). Then there exist universal constants\nC1, C2, C3 such that if the number of samples scale as n C1 log p, then with probability atleast\n1 1/pC2, the generative estimate of the differential parameterb\u2713diff satis\ufb01es\n\nb\u2713diff \u2713\u21e4diff1 \uf8ff C3r log p\n\nn\n\n.\n\ni \uf8ff6\n\n(\u21e5\u21e4i )1d2\n\u21e5\u21e4i\n\nComparing Corollary 3 and Corollary 4, we see that for isotropic gaussians, both the discriminative\nand generative approach achieve the same `1 convergence rates, but at different sample complexities.\nSpeci\ufb01cally, the sample complexity for the generative method depends only logarithmically on the\ndimension p, and is independent of the differential sparsity s, while the sample complexity of the\ndiscriminative method depends on the differential sparsity s. Therefore in this case, the generative\nmethod is strictly better than its discriminative counterpart, assuming that the generative model\nassumptions hold.\nCorollary 5. (Gaussian MRF) Consider the model in (3), and suppose that the scaled covari-\n\nb\u21e5diff \u21e5\u21e4diff1 \uf8ff C4r log p\n\nates Xk/p\u2303\u21e4kk are subgaussian with parameter 2. Then there exist universal positive con-\nstants C2, C3, C4 such that if the number of samples for the two generative models scale as\nlog p, for i 2{ 0, 1}, then with probability at least 1 1/pC3, the gen-\nni C2\uf8ff2\nerative estimate of the differential parameter, b\u21e5diff = b\u21e51 b\u21e50, satis\ufb01es\nand support(b\u21e5i) \u2713 support(\u21e5\u21e4i ) for i 2{ 0, 1}.\n\nComparing Corollary 3 and Corollary 5, we see that for Gaussian Graphical Models, both the\ndiscriminative and generative approach achieve the same `1 convergence rates, but at different\nsample complexities. Speci\ufb01cally, the sample complexity for the generative method depends only on\nrow-wise sparsity of the individual models d2\n, and is independent of sparsity s of the differential\n\u21e5\u21e4i\nparameter \u21e5\u21e4diff. In contrast, the sample complexity of the discriminative method depends only\non the sparsity of the differential parameter, and is independent of the structural complexities of\nthe individual model parameters. This suggests that in high dimensions, even when the generative\nmodel assumptions hold, generative methods might perform poorly if the underlying model is highly\nnon-separable (e.g. d = \u2326(p)), which is in contrast to the conventional wisdom in low dimensions.\n\nn\n\n,\n\nRelated Work. Note that results similar to Corollaries 3 and 5 have been previously reported in\n[11, 5] separately. Under the same set of assumptions as ours, Li et al. [5] provide a uni\ufb01ed analysis\nfor support recovery and `1-bounds for `1-regularized M-estimators. While they obtain the same\nrates as ours, their required sample complexities are much higher, since they do not exploit the\nseparability of the underlying loss function. As one example, in the case of GMRFs, their results\nrequire the number of samples to scale as n > k2 log p, where k is the total number of edges in the\ngraph, which is sub-optimal, and in particular does not match the GMRF-speci\ufb01c analysis of [11].\nOn the other hand, our uni\ufb01ed analysis is tighter, and in particular, does match the results of [11].\n\n5\n\n`2-rates for Generative and Discriminative Model Estimation\n\nIn this section we study the `2 rates for differential parameter estimation for the discriminative and\ngenerative approaches.\n\n5.1 Discriminative Approach\nThe bounds for the discriminative approach are relatively straightforward. Corollary 3 gives bounds\n\non the `1 error and establishes that support(b\u2713) \u2713 support(\u2713\u21e4). Since the true model parameter is\ns-sparse, ||\u2713\u21e4||0 \uf8ff s, the `2 error can be simply bounded as pskb\u2713 \u2713\u21e4k1.\n\n7\n\n\f5.2 Generative Approach\n\ngenerative estimator will have an `2 error scaling withq p log p\n\nIn the previous section, we saw that the generative approach is able to exploit the inherent separability\nof the underlying model, and thus is able to get `1 rates for differential parameter estimation at a\nmuch lower sample complexity. Unfortunately, it does not have support consistency. Hence a na\u00efve\nn , which in high dimensions, would\nmake it unappealing. However, one can exploit the sparsity of \u2713\u21e4diff and get better rates of convergence\nin `2-norm by simply soft-thresholding the generative estimate. Moreover, soft-thresholding also\nleads to support consistency.\nDe\ufb01nition 3. We denote the soft-thresholding operator STn (\u00b7), de\ufb01ned as:\n\nSTn (\u2713) = argmin\n\nw\n\n1\n2 ||w \u2713||2\n\n2 + n ||w||1 .\n\nLemma 4. Suppose \u2713 = \u2713\u21e4 + \u270f for some s-sparse \u2713\u21e4. Then there exists a universal constant C1 such\nthat for n 2||\u270f||1\n\n,\n\n||STn (\u2713) \u2713\u21e4||2 \uf8ff C1ps||\u270f||1\n\nand ||STn (\u2713) \u2713\u21e4||1 \uf8ff C1s||\u270f||1\n\n(10)\n\nNote that this is a completely deterministic result and has no sample complexity requirement.\nMotivated by this, we introduce a thresholded generative estimator that has two stages: (a) compute\n\n. An elementary application of Lemma 4 can then be shown to provide `2 error\n\nb\u2713diff using the generative model estimates, and (b) soft-threshold the generative estimate with n =\ncb\u2713diff \u2713\u21e4diff1\nbounds forb\u2713diff given its `1 error bounds, and that the true parameter \u2713\u21e4diff is s-sparse. We instantiate\nthese `2-bounds via corollaries for our running examples of Isotropic Gaussians, and Gaussian MRFs.\nLemma 5. (Isotropic Gaussians) Consider the model in (2). Then there exist universal constants\nC1, C2, C3 such that if the number of samples scale as n C1 log p, then with probability atleast\n1 1/pC2, the soft-thresholded generative estimate of the differential parameter STn\u21e3b\u2713diff\u2318, with\nthe soft-thresholding parameter set as n = cq log p\n\nn for some constant c, satis\ufb01es:\n\nSTn\u21e3b\u2713diff\u2318 \u2713\u21e4diff2 \uf8ff C3r s log p\n\nn\n\n.\n\ni \uf8ff6\n\nLemma 6. (Gaussian MRF) Consider the model in Equation 3, and suppose that the covari-\n\nates Xk/p\u2303\u21e4kk are subgaussian with parameter 2. Then there exist universal positive con-\nstants C2, C3, C4 such that if the number of samples for the two generative models scale as\nlog p, for i 2{ 0, 1}, for i 2{ 0, 1}, then with probability at least 1 1/pC3,\nni C2\uf8ff2\nthe soft-thresholded generative estimate of the differential parameter, STn\u21e3b\u21e5diff\u2318, with the soft-\nthresholding parameter set as n = cq log p\n\nn for some constant c, satis\ufb01es:\n\n(\u21e5\u21e4i )1d2\n\u21e5\u21e4i\n\nSTn\u21e3b\u21e5diff\u2318 \u21e5\u21e4diff2 \uf8ff C4r s log p\n\nn\n\n.\n\nComparing Lemmas 5 and 6 to Section 5.1, we can see that the additional soft-thresholding step\nallows the generative methods to achieve the same `2-error rates as the discriminative methods, but at\ndifferent sample complexities. The sample complexities of the generative estimates depend on the\nseparabilities of the individual models, and is independent of the differential sparsity s, where as the\nsample complexity of the discriminative estimate depends only on the differential sparsity s.\n\n6 Experiments: High Dimensional Classi\ufb01cation\n\nIn this section, we corroborate our theoretical results on `2-error rates for generative and discriminative\nmodel estimators, via their consequences for high dimensional classi\ufb01cation. We focus on the case\nof isotropic Gaussian generative models X|Y \u21e0N (\u00b5Y ,Ip), where \u00b50, \u00b51 2 Rp are unknown\n\n8\n\n\f0.5\n\n0.45\n\n0.4\n\nr\no\nr\nr\n\n \n\nE\n1\n-\n0\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0\n\n0-1 Error for s=4,p=512,d=1\n\nGen-Thresh\nLogistic\n\n50\n\n100\n\n150\n\n200\nn\n\n250\n\n300\n\n350\n\n400\n\n(a) s = 4, p = 512\n\n0.5\n\n0.45\n\n0.4\n\nr\no\nr\nr\n\n \n\nE\n1\n-\n0\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0\n\n0-1 Error for s=16,p=512,d=1\n\nGen-Thresh\nLogistic\n\n50\n\n100\n\n150\n\n200\nn\n\n250\n\n300\n\n350\n\n400\n\n(b) s = 16, p = 512\n\n0.5\n\n0.45\n\n0.4\n\nr\no\nr\nr\n\n \n\nE\n1\n-\n0\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0\n\n0-1 Error for s=64,p=512,d=1\n\nGen-Thresh\nLogistic\n\n50\n\n100\n\n150\n\n200\nn\n\n250\n\n300\n\n350\n\n400\n\n(c) s = 64, p = 512\n\nFigure 2: Effect of sparsity s on excess 0 1 error.\n\nand \u00b51 \u00b50 is s-sparse. Here, we are interested in a classi\ufb01er C : Rp 7! {0, 1} that achieves\nlow classi\ufb01cation error EX,Y [1{C(X) 6= Y }]. Under this setting, it can be shown that the Bayes\nclassi\ufb01er, that achieves the lowest possible classi\ufb01cation error, is given by the linear discriminant\nclassi\ufb01er C\u21e4(x) = 1xT w\u21e4 + b\u21e4 > 0 , where w\u21e4 = (\u00b51 \u00b50) and b\u21e4 = \u00b5T\n. Thus, the\ncoef\ufb01cient w\u21e4 of the linear discriminant is precisely the differential parameter, which can be estimated\nvia both generative and discriminative approaches as detailed in the previous section. Moreover, the\nclassi\ufb01cation error can also be related to the `2 error of the estimates. Under some mild assumptions,\nLi et al. [3] showed that for any linear classi\ufb01er bC(x) = 1nxTbw +bb > 0o, the excess classi\ufb01cation\n\nerror can be bounded as:\n\n0 \u00b50\u00b5T\n\n1 \u00b51\n\n2\n\nE(bC) \uf8ff C1\u2713||bw w\u21e4||2\n\n2 +bb b\u21e4\n\n2\n\n2\u25c6 ,\n\nfor some constant C1 > 0, and where E(C) = EX,Y [1{C(X) 6= Y }] EX,Y [1{C\u21e4(X) 6= Y }] is\nthe excess 0-1 error. In other words, the excess classi\ufb01cation error is bounded by a constant times the\n`2 error of the differential parameter estimate.\nMethods. In this setting, as discussed in previous sections, the discriminative model is simply a\nlogistic regression model with linear features (6), so that the discriminative estimate of the differential\n\nfor some constant C1. The corresponding estimate for b\u21e4 is given by\n\nregression. For the generative estimate, we use our two stage estimator from Section 5, which proceeds\n\nparameter bw as well as the constant bias termbb can be simply obtained via `1-regularized logistic\nby estimatingb\u00b50,b\u00b51 using the empirical means, and then estimating the differential parameter by\nsoft-thresholding the difference of the generative model parameter estimates bwT = STn (b\u00b51 b\u00b50)\nwhere n = C1q log p\n\u02c6bT = 1\n2 hbwT ,b\u00b51 +b\u00b50i.\nExperimental Setup. For our experimental setup, we consider isotropic Gaussian models with\nmeans \u00b50 = 1p 1ps\uf8ff 1s\n0ps, and vary the sparsity level s. For both methods,\nwe set the regularization parameter 2 as n =plog(p)/n. We report the excess classi\ufb01cation error\n\n0ps, \u00b51 = 1p + 1ps\uf8ff 1s\n\nfor the two approaches, averaged over 20 trials, in Figure 2.\n\nn\n\nResults. As can be seen from Figure 2, our two-staged thresholded generative estimator is always\nbetter than the discriminative estimator, across different sparsity levels s. Moreover, the sample\ncomplexity or \u201cburn-in\u201d period of the discriminative classi\ufb01er strongly depends on the sparsity level,\nwhich makes it unsuitable when the true parameter is not highly sparse. For our two-staged generative\nestimator, we see that the sparsity s has no effect on the \u201cburn-in\u201d period of the classi\ufb01er. These\nobservations validate our theoretical results from Section 5.\n\n2See Appendix J for cross-validated plots.\n\n9\n\n\fAcknowledgements\n\nA.P. and P.R. acknowledge the support of ARO via W911NF-12-1-0390 and NSF via IIS-1149803,\nIIS-1447574, DMS-1264033, and NIH via R01 GM117594-01 as part of the Joint DMS/NIGMS\nInitiative to Support Research at the Interface of the Biological and Mathematical Sciences.\n\nReferences\n[1] Alberto de la Fuente. From \u2018differential expression\u2019to \u2018differential networking\u2019\u2013identi\ufb01cation of dysfunc-\n\ntional regulatory networks in diseases. Trends in genetics, 26(7):326\u2013333, 2010.\n\n[2] Christophe Giraud. Introduction to high-dimensional statistics, volume 138. CRC Press, 2014.\n[3] Tianyang Li, Adarsh Prasad, and Pradeep K Ravikumar. Fast classi\ufb01cation rates for high-dimensional\ngaussian generative models. In Advances in Neural Information Processing Systems, pages 1054\u20131062,\n2015.\n\n[4] Tianyang Li, Xinyang Yi, Constantine Carmanis, and Pradeep Ravikumar. Minimax gaussian classi\ufb01cation\n\n& clustering. In Arti\ufb01cial Intelligence and Statistics, pages 1\u20139, 2017.\n\n[5] Yen-Huan Li, Jonathan Scarlett, Pradeep Ravikumar, and Volkan Cevher. Sparsistency of 1-regularized\n\nm-estimators. In AISTATS, 2015.\n\n[6] Song Liu, John A Quinn, Michael U Gutmann, Taiji Suzuki, and Masashi Sugiyama. Direct learning of\nsparse changes in markov networks by density ratio estimation. Neural computation, 26(6):1169\u20131197,\n2014.\n\n[7] Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A uni\ufb01ed framework for high-\ndimensional analysis of m-estimators with decomposable regularizers. In Advances in Neural Information\nProcessing Systems, pages 1348\u20131356, 2009.\n\n[8] Andrew Y Ng and Michael I Jordan. On discriminative vs. generative classi\ufb01ers: A comparison of logistic\n\nregression and naive bayes. Advances in neural information processing systems, 2:841\u2013848, 2002.\n\n[9] James M Ortega and Werner C Rheinboldt. Iterative solution of nonlinear equations in several variables.\n\nSIAM, 2000.\n\n[10] Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al. High-dimensional ising model selection\n\nusing `1-regularized logistic regression. The Annals of Statistics, 38(3):1287\u20131319, 2010.\n\n[11] Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, Bin Yu, et al. High-dimensional covariance\nestimation by minimizing `1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:\n935\u2013980, 2011.\n\n[12] JM Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. preparation. University of\n\nCalifornia, Berkeley, 2015.\n\n[13] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained\n\nquadratic programming (lasso). IEEE transactions on information theory, 55(5):2183\u20132202, 2009.\n\n[14] Sihai Dave Zhao, T Tony Cai, and Hongzhe Li. Direct estimation of differential networks. Biometrika,\n\npage asu009, 2014.\n\n10\n\n\f", "award": [], "sourceid": 3548, "authors": [{"given_name": "Adarsh", "family_name": "Prasad", "institution": "Carnegie Mellon University"}, {"given_name": "Alexandru", "family_name": "Niculescu-Mizil", "institution": "NEC Laboratories America"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "Carnegie Mellon University"}]}