{"title": "Fast Classification Rates for High-dimensional Gaussian Generative Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1054, "page_last": 1062, "abstract": "We consider the problem of binary classification when the covariates conditioned on the each of the response values follow multivariate Gaussian distributions. We focus on the setting where the covariance matrices for the two conditional distributions are the same. The corresponding generative model classifier, derived via the Bayes rule, also called Linear Discriminant Analysis, has been shown to behave poorly in high-dimensional settings. We present a novel analysis of the classification error of any linear discriminant approach given conditional Gaussian models. This allows us to compare the generative model classifier, other recently proposed discriminative approaches that directly learn the discriminant function, and then finally logistic regression which is another classical discriminative model classifier. As we show, under a natural sparsity assumption, and letting $s$ denote the sparsity of the Bayes classifier, $p$ the number of covariates, and $n$ the number of samples, the simple ($\\ell_1$-regularized) logistic regression classifier achieves the fast misclassification error rates of $O\\left(\\frac{s \\log p}{n}\\right)$, which is much better than the other approaches, which are either inconsistent under high-dimensional settings, or achieve a slower rate of $O\\left(\\sqrt{\\frac{s \\log p}{n}}\\right)$.", "full_text": "Fast Classi\ufb01cation Rates for High-dimensional\n\nGaussian Generative Models\n\nTianyang Li\n\nAdarsh Prasad\n\nDepartment of Computer Science, UT Austin\n\n{lty,adarsh,pradeepr}@cs.utexas.edu\n\nPradeep Ravikumar\n\nAbstract\n\nWe consider the problem of binary classi\ufb01cation when the covariates conditioned\non the each of the response values follow multivariate Gaussian distributions. We\nfocus on the setting where the covariance matrices for the two conditional dis-\ntributions are the same. The corresponding generative model classi\ufb01er, derived\nvia the Bayes rule, also called Linear Discriminant Analysis, has been shown to\nbehave poorly in high-dimensional settings. We present a novel analysis of the\nclassi\ufb01cation error of any linear discriminant approach given conditional Gaussian\nmodels. This allows us to compare the generative model classi\ufb01er, other recently\nproposed discriminative approaches that directly learn the discriminant function,\nand then \ufb01nally logistic regression which is another classical discriminative model\nclassi\ufb01er. As we show, under a natural sparsity assumption, and letting s denote\nthe sparsity of the Bayes classi\ufb01er, p the number of covariates, and n the num-\nber of samples, the simple ((cid:96)1-regularized) logistic regression classi\ufb01er achieves\nthe fast misclassi\ufb01cation error rates of O\n, which is much better than the\nother approaches, which are either inconsistent under high-dimensional settings,\nor achieve a slower rate of O\n\n(cid:18)(cid:113) s log p\n\n(cid:16) s log p\n\n(cid:17)\n\nn\n\n(cid:19)\n\n.\n\nn\n\nIntroduction\n\n1\nWe consider the problem of classi\ufb01cation of a binary response given p covariates. A popular class of\napproaches are statistical decision-theoretic: given a classi\ufb01cation evaluation metric, they then opti-\nmize a surrogate evaluation metric that is computationally tractable, and yet have strong guarantees\non sample complexity, namely, number of observations required for some bound on the expected\nclassi\ufb01cation evaluation metric. These guarantees and methods have been developed largely for the\nzero-one evaluation metric, and extending these to general evaluation metrics is an area of active\nresearch. Another class of classi\ufb01cation methods are relatively evaluation metric agnostic, which\nis an important desideratum in modern settings, where the evaluation metric for an application is\ntypically less clear: these are based on learning statistical models over the response and covariates,\nand can be categorized into two classes. The \ufb01rst are so-called generative models, where we specify\nconditional distributions of the covariates conditioned on the response, and then use the Bayes rule\nto derive the conditional distribution of the response given the covariates. The second are the so-\ncalled discriminative models, where we directly specify the conditional distribution of the response\ngiven the covariates.\nIn the classical \ufb01xed p setting, we have now have a good understanding of the performance of\nthe classi\ufb01cation approaches above. For generative and discriminative modeling based approaches,\nconsider the speci\ufb01c case of Naive Bayes generative models and logistic regression discriminative\nmodels (which form a so-called generative-discriminative pair1), Ng and Jordan [27] provided qual-\n\n1In such a so-called generative-discriminative pair, the discriminative model has the same form as that of\nthe conditional distribution of the response given the covariates speci\ufb01ed by the Bayes rule given the generative\nmodel\n\n1\n\n\fitative consistency analyses, and showed that under small sample settings, the generative model\nclassi\ufb01ers converge at a faster rate to their population error rate compared to the discriminative\nmodel classi\ufb01ers, though the population error rate of the discriminative model classi\ufb01ers could be\npotentially lower than that of the generative model classi\ufb01ers due to weaker model assumptions.\nBut if the generative model assumption holds, then generative model classi\ufb01ers seem preferable to\ndiscriminative model classi\ufb01ers.\nIn this paper, we investigate whether this conventional wisdom holds even under high-dimensional\nsettings. We focus on the simple generative model where the response is binary, and the covariates\nconditioned on each of the response values, follows a conditional multivariate Gaussian distribution.\nWe also assume that the two covariance matrices of the two conditional Gaussian distributions are\nthe same. The corresponding generative model classi\ufb01er, derived via the Bayes rule, is known in\nthe statistics literature as the Linear Discriminant Analysis (LDA) classi\ufb01er [21]. Under classical\nsettings where p (cid:28) n, the misclassi\ufb01cation error rate of this classi\ufb01er has been shown to converge to\nthat of the Bayes classi\ufb01er. However, in a high-dimensional setting, where the number of covariates\np could scale with the number of samples n, this performance of the LDA classi\ufb01er breaks down. In\nparticular, Bickel and Levina [3] show that when p/n \u2192 \u221e, then the LDA classi\ufb01er could converge\nto an error rate of 0.5, that of random chance. What should one then do, when we are even allowed\nthis generative model assumption, and when p > n?\nBickel and Levina [3] suggest the use of a Naive Bayes or conditional independence assumption,\nwhich in the conditional Gaussian context, assumes the covariance matrices to be diagonal. As they\nshowed, the corresponding Naive Bayes LDA classi\ufb01er does have misclassi\ufb01cation error rate that is\nbetter than chance, but it is asymptotically biased: it converges to an error rate that is strictly larger\nthan that of the Bayes classi\ufb01er when the Naive Bayes conditional independence assumption does\nnot hold. Bickel and Levina [3] also considered a weakening of the Naive Bayes rule, by assuming\nthat the covariance matrix is weakly sparse, and an ellipsoidal constraint on the means, showed\nthat an estimator that leverages these structural constraints converges to the Bayes risk at a rate of\nO(log(n)/n\u03b3), where 0 < \u03b3 < 1 depends on the mean and covariance structural assumptions. A\ncaveat is that these covariance sparsity assumptions might not hold in practice. Similar caveats apply\nto the related works on feature annealed independence rules [14], nearest shrunken centroids [29,\n30], as well as those . Moreover, even when the assumptions hold, they do not yield the \u201cfast\u201d rates\nof O(1/n).\nAn alternative approach is to directly impose sparsity on the linear discriminant [28, 7], which is\nweaker than the covariance sparsity assumptions (though [28] impose these in addition). [28, 7]\nthen proposed new estimators that leveraged these assumptions, but while they were able to show\nconvergence to the Bayes risk, they were only able to show a slower rate of O\n\n(cid:18)(cid:113) s log p\n\n(cid:19)\n\n.\n\nn\n\nIt is instructive at this juncture to look at recent results on classi\ufb01cation error rates from the machine\nlearning community. A key notion of importance here is whether the two classes are separable:\n\u221a\nwhich can be understood as requiring that the classi\ufb01cation error of the Bayes classi\ufb01er is 0. Classi-\nn) for any classi\ufb01er when two classes are non-separable,\ncal learning theory gives a rate of O(1/\nand it shown that this is also minimax [12], with the note that this is relatively distribution agnostic,\nsince it assumes very little on the underlying distributions. When the two classes are non-separable,\n\u221a\nonly rates slower than \u2126(1/n) are known. Another key notion is a \u201clow-noise condition\u201d [25], under\nwhich certain classi\ufb01ers can be shown to attain a rate faster than o(1/\nn), albeit not at the O(1/n)\nrate unless the two classes are separable. Speci\ufb01cally, let \u03b1 denote a constant such that\n\n(cid:16)\n\n(cid:17)\n\nP (|P(Y = 1|X) \u2212 1/2| \u2264 t) \u2264 O (t\u03b1) ,\n\n(1)\nholds when t \u2192 0. This is said to be a low-noise assumption, since as \u03b1 \u2192 +\u221e, the two classes\nstart becoming separable, that is, the Bayes risk approaches zero. Under this low-noise assumption,\n[23]. Note that this is always slower than O( 1\nknown rates for excess 0-1 risk is O\nn )\nwhen \u03b1 < +\u221e.\nThere has been a surge of recent results on high-dimensional statistical statistical analyses of M-\nestimators [26, 9, 1]. These however are largely focused on parameter error bounds, empirical and\npopulation log-likelihood, and sparsistency. In this paper however, we are interested in analyzing\nthe zero-one classi\ufb01cation error under high-dimensional sampling regimes. One could stitch these\nrecent results to obtain some error bounds: use bounds on the excess log-likelihood, and use trans-\n\n1+\u03b1\n2+\u03b1\n\n( 1\nn )\n\n2\n\n\fforms from [2], to convert excess log-likelihood bounds to get bounds on 0-1 classi\ufb01cation error,\nhowever, the resulting bounds are very loose, and in particular, do not yield the fast rates that we\nseek.\nIn this paper, we leverage the closed form expression for the zero-one classi\ufb01cation error for our\ngenerative model, and directly analyse it to give faster rates for any linear discriminant method.\nOur analyses show that, assuming a sparse linear discriminant in addition, the simple (cid:96)1-regularized\nlogistic regression classi\ufb01er achieves near optimal fast rates of O\n, even without requiring\nthat the two classes be separable.\n\n(cid:16) s log p\n\n(cid:17)\n\nn\n\n2 Problem Setup\nWe consider the problem of high dimensional binary classi\ufb01cation under the following generative\nmodel. Let Y \u2208 {0, 1} denote a binary response variable, and let X = (X1, . . . , Xp) \u2208 Rp denote\na set of p covariates. For technical simplicity, we assume Pr[Y = 1] = Pr[Y = 0] = 1\n2, however\nour analysis easily extends to the more general case when Pr[Y = 1], Pr[Y = 0] \u2208 [\u03b40, 1 \u2212 \u03b40], for\n2. We assume that X|Y \u223c N (\u00b5Y , \u03a3Y ), i.e. conditioned on a response, the\nsome constant 0 < \u03b40 < 1\ncovariate follows a multivariate Gaussian distribution. We assume we are given n training samples\n{(X (1), Y (1)), (X (2), Y (2)), . . . , (X (n), Y (n))} drawn i.i.d. from the conditional Gaussian model\nabove.\nFor any classi\ufb01er, C : Rp \u2192 {1, 0}, the 0-1 risk or simply the classi\ufb01cation error is given by\nR0\u22121(C) = EX,Y [(cid:96)0\u22121(C(X), Y )], where (cid:96)0\u22121(C(x), y) = 1(C(x) (cid:54)= y) is the 0-1 loss. It can\nalso be simply written as R(C) = Pr[C(X) (cid:54)= Y ]. The classi\ufb01er attaining the lowest classi\ufb01cation\nerror is known as the Bayes classi\ufb01er, which we will denote by C\u2217. Under the generative model\nassumption above, the Bayes classi\ufb01er can be derived simply as C\u2217(X) = 1(log Pr[Y =1|X]\nPr[Y =0|X] > 0),\nso that given sample X, it would be classi\ufb01ed as 1 if Pr[Y =1|X]\nPr[Y =0|X] > 1, and as 0 otherwise. We denote\nthe error of the Bayes classi\ufb01er R\u2217 = R(C\u2217).\nWhen \u03a31 = \u03a30 = \u03a3,\n\nPr[Y = 1|X]\nPr[Y = 0|X]\n\nlog\n\n= (\u00b51 \u2212 \u00b50)T \u03a3\u22121X +\n\n1\n2\n\n(\u2212\u00b5T\n\n1 \u03a3\u22121\u00b51 + \u00b5T\n\n0 \u03a3\u22121\u00b50)\n\n(2)\n\nand we denote this quantity as w\u2217T X + b\u2217 where\nw\u2217 = \u03a3\u22121(\u00b51 \u2212 \u00b50), b\u2217 =\n\n\u2212\u00b5T\n\n1 \u03a3\u22121\u00b51 + \u00b5T\n\n0 \u03a3\u22121\u00b50\n\n,\n\n2\n\nso that the Bayes classi\ufb01er can be written as: C\u2217(x) = 1(w\u2217T x + b\u2217 > 0).\nFor any trained classi\ufb01er \u02c6C we are interested in bounding the excess risk de\ufb01ned as R( \u02c6C) \u2212 R\u2217.\nThe generative approach to training a classi\ufb01er is to estimate estimate \u03a3\u22121and \u03b4 from data, and\nthen plug the estimates into Equation 2 to construct the classi\ufb01er. This classi\ufb01er is known as the\nlinear discriminant analysis (LDA) classi\ufb01er, whose theoretical properties have been well-studied\nin classical \ufb01xed p setting. The discriminative approach to training is to estimate Pr[Y =1|X]\nPr[Y =0|X] directly\nfrom samples.\n2.1 Assumptions.\nWe assume that mean is bounded i.e. \u00b51, \u00b50 \u2208 {\u00b5 \u2208 Rp : (cid:107)\u00b5(cid:107)2 \u2264 B\u00b5}, where B\u00b5 is a constant\nwhich doesn\u2019t scale with p. We assume that the covariance matrix \u03a3 is non-degenerate i.e. all\neigenvalues of \u03a3 are in [B\u03bbmin, B\u03bbmax]. Additionally we assume (\u00b51 \u2212 \u00b50)T \u03a3\u22121(\u00b51 \u2212 \u00b50) \u2265 Bs,\nwhich gives a lower bound on the Bayes classi\ufb01er\u2019s classi\ufb01cation error R\u2217 \u2265 1 \u2212 \u03a6( 1\n2 Bs) > 0.\nNote that this assumption is different from the de\ufb01nition of separable classes in [11] and the low\nnoise condition in [25], and the two classes are still not separable because R\u2217 > 0.\n\n2.1.1 Sparsity Assumption.\nMotivated by [7], we assume that \u03a3\u22121(\u00b51 \u2212 \u00b50) is sparse, and there at most s non-zero entries.\nCai and Liu [7] extensively discuss and show that such a sparsity assumption, is much weaker than\nassuming either \u03a3\u22121 and (\u00b51 \u2212 \u00b50) to be individually sparse. We refer the reader to [7] for an\nelaborate discussion.\n\n3\n\n\f2.2 Generative Classi\ufb01ers\nGenerative techniques work by estimating \u03a3\u22121 and (\u00b51 \u2212 \u00b50) from data and plugging them into\nEquation 2. In high-dimensions, simple estimation techniques do not perform well when p (cid:29) n,\nthe sample estimate for the covariance matrix \u02c6\u03a3 is singular; using the generalized inverse of the\nsample covariance matrix makes the estimator highly biased and unstable. Numerous alternative\napproaches have been proposed by imposing structural conditions on \u03a3 or \u03a3\u22121 and \u03b4 to ensure that\nthey can be estimated consistently. Some early work based on nearest shrunken centroids [29, 30],\nfeature annealed independence rules [14], and Naive Bayes [4] imposed independence assumptions\non \u03a3, which are often violated in real-world applications. [4, 13] impose more complex structural\nassumptions on the covariance matrix and suggest more complicated thresholding techniques. Most\ncommonly, \u03a3\u22121 and \u03b4 are assumed to be sparse and then some thresholding techniques are used to\nestimate them consistently [17, 28].\n2.3 Discriminative Classi\ufb01ers.\nRecently, more direct techniques have been proposed to solve the sparse LDA problem. Let \u02c6\u03a3\nand \u02c6\u00b5d be consistent estimators of \u03a3 and \u00b5 = \u00b51+\u00b50\n. Fan et al. [15] proposed the Regularized\nOptimal Af\ufb01ne Discriminant (ROAD) approach which minimizes wT \u03a3w with wT \u00b5 restricted to be\na constant value and an (cid:96)1-penalty of w.\n\n2\n\nwROAD = argmin\nwT \u02c6\u00b5=1\n||w||1\u2264c\n\nwT \u02c6\u03a3w\n\n(3)\n\nKolar and Liu [22] provided theoretical insights into the ROAD estimator by analysing its consis-\ntency for variable selection. Cai and Liu [7] proposed another variant called linear programming\ndiscriminant (LPD) which tries to make w close to the Bayes rules linear term \u03a3\u22121(\u00b51 \u2212 \u00b50) in the\n(cid:96)\u221e norm. This can be cast as a linear programming problem related to the Dantzig selector.[8].\n\n(4)\n\n||w||1\nwLPD = argmin\ns.t.|| \u02c6\u03a3w \u2212 \u02c6\u00b5||\u221e \u2264 \u03bbn\n\nw\n\nMai et al. [24]proposed another version of the sparse linear discriminant analysis based on an equiv-\nalent least square formulation of the LDA, where they solve an (cid:96)1-regularized least squares problem\nto produce a consistent classi\ufb01er.\nAll the techniques above either do not have \ufb01nite sample convergence rates, or the 0-1 risk converged\nat a slow rate of O\n\n(cid:18)(cid:113) s log p\n\n(cid:19)\n\n.\n\nn\n\n(cid:18) wT \u00b51 + b\n\n(cid:19)\n\n\u221a\n\nwT \u03a3w\n\n(cid:18)\n\n(cid:19)\n\nIn this paper, we \ufb01rst provide an analysis of classi\ufb01cation error rates for any classi\ufb01er with a linear\ndiscriminant function, and then follow this analysis by investigating the performance of generative\nand discriminative classi\ufb01ers for conditional Gaussian model.\n3 Classi\ufb01ers with Sparse Linear Discriminants\nWe \ufb01rst analyze any classi\ufb01er with a linear discriminant function, of the form: C(x) = 1(wT x+b >\n0). We \ufb01rst note that the 0-1 classi\ufb01cation error of any such classi\ufb01er is available in closed-form as\n\nR(w, b) = 1 \u2212 1\n2\n\n\u03a6\n\n\u2212 1\n2\n\n\u03a6\n\n\u221a\n\n\u2212 wT \u00b50 + b\nwT \u03a3w\n\n,\n\n(5)\n\nwhich can be shown by noting that wT X + b is a univariate normal random variable when condi-\ntioned on the label Y .\nNext, we relate the 0-1 classi\ufb01ction error above to that of the Bayes classi\ufb01er. Recall the earlier\nnotation of the Bayes classi\ufb01er as C\u2217(x) = 1(xT w\u2217 + b\u2217 > 0). The following theorem is a key\nresult of the paper that shows that for any linear discriminant classi\ufb01er whose linear discriminant\nparameters are close to that of the Bayes classi\ufb01er, the excess 0-1 risk is bounded only by second\norder terms of the difference. Note that this theorem will enable fast classi\ufb01cation rates if we obtain\nfast rates for the parameter error.\nTheorem 1. Let w = w\u2217 + \u2206, b = b\u2217 + \u2126, and \u2206 \u2192 0, \u2126 \u2192 0, then we have\n2 + \u21262).\n\nR(w = w\u2217 + \u2206, b = b\u2217 + \u2126) \u2212 R(w\u2217, b\u2217) = O((cid:107)\u2206(cid:107)2\n\n(6)\n\n4\n\n\fProof. Denote the quantity S\u2217 = (cid:112)(\u00b51 \u2212 \u00b50)T \u03a3\u22121(\u00b51 \u2212 \u00b50),\n\nthen we have \u00b5T\n\u221a\n\n1 w\u2217+b\u2217\nw\u2217 T \u03a3w\u2217 =\n\n2 S\u2217, we have\n) \u2212 \u03a6(\nS\u2217))|\n\n1\n2\n\n(7)\n\n\u2212\u00b5T\n1 w\u2217\u2212b\u2217\n\u221a\nw\u2217 T \u03a3w\u2217 = 1\n\n2 S\u2217. Using (5) and the Taylor series expansion of \u03a6(\u00b7) around 1\n0 w \u2212 b\n\u2212\u00b5T\n\u221a\nS\u2217)) + (\u03a6(\nwT \u03a3w\n\u2212\u00b5T\n0 w \u2212 b\n\u221a\nwT \u03a3w\n\n1\n2\nS\u2217)2 + K3(\n\n\u00b5T\n\u221a\n1 w + b\nwT \u03a3w\n\u00b5T\n\u221a\n1 w + b\nwT \u03a3w\n\n|R(w, b) \u2212 R(w\u2217, b\u2217)| =\n\u2264K1| (\u00b51 \u2212 \u00b50)T w\n\n\u2212 S\u2217| + K2(\n\n) \u2212 \u03a6(\n\n\u2212 1\n2\n\nwT \u03a3w\n\n|(\u03a6(\n\n\u221a\n\n1\n2\n\nS\u2217)2\n\n\u2212 1\n2\n\n\u221a\n\n\u221a\n\nwT \u03a3w \u2212\n2 w, \u2206(cid:48)(cid:48) = \u03a3 1\n\nwhere K1, K2, K3 > 0 are constants because the \ufb01rst and second order derivatives of \u03a6(\u00b7) are\nbounded.\nFirst note that |\nDenote w(cid:48)(cid:48) = \u03a3 1\nTaylor series expansion)\n(\u00b51 \u2212 \u00b50)T w\n\u221a\nwT \u03a3w\n1 + a(cid:48)(cid:48)T \u2206(cid:48)(cid:48)\n\nw\u2217T \u03a3w\u2217| = O((cid:107)\u2206(cid:107)2), because (cid:107)w\u2217(cid:107)2 is bounded.\n2 \u2206, w(cid:48)(cid:48)\u2217\n\nw(cid:48)(cid:48)T w(cid:48)(cid:48) \u2212(cid:112)\n\n2 (\u00b51 \u2212 \u00b50), we have (by the binomial\n\na(cid:48)(cid:48) T a(cid:48)(cid:48) \u2212(cid:113)\n\na(cid:48)(cid:48) T a(cid:48)(cid:48) + \u2206(cid:48)(cid:48)T \u2206(cid:48)(cid:48)\na(cid:48)(cid:48)T a(cid:48)(cid:48)\n\n2 w\u2217 a(cid:48)(cid:48) = \u03a3\u2212 1\n\na(cid:48)(cid:48)T w(cid:48)(cid:48)\n\u221a\n\n\u2212 S\u2217 =\n\na(cid:48)(cid:48)T a(cid:48)(cid:48)\n\n= \u03a3 1\n\n(8)\n\n=\n\n1 + 2 a(cid:48)(cid:48) T \u2206(cid:48)(cid:48)\n\u221a\nw(cid:48)(cid:48)T w(cid:48)(cid:48)\na(cid:48)(cid:48)T a(cid:48)(cid:48)\n\n= O(\n\n(cid:107)\u2206(cid:48)(cid:48)(cid:107)2\n\u221a\na(cid:48)(cid:48)T a(cid:48)(cid:48) )\n\n2\n\nNote that w(cid:48)(cid:48) \u2192 a(cid:48)(cid:48), \u2206(cid:48)(cid:48) \u2192 0, (cid:107)\u2206(cid:107)2 = \u0398((cid:107)\u2206(cid:48)(cid:48)(cid:107)2), and S\u2217 is lower bouned, we have | (\u00b51\u2212\u00b50)T w\nS\u2217| = O((cid:107)\u2206(cid:107)2\n2).\nNext we bound | \u00b5T\n\u221a\n\n\u2212 1\n\nwT \u03a3w\n\n\u221a\n\n\u2212\n\n1 w+b\nwT \u03a3w\n\nS\u2217| = | (\u00b5T\n\n2 S\u2217|:\n\u221a\n1 w\u2217 + b\u2217)(\n\n\u221a\n\n\u221a\n\nw\u2217T \u03a3w\u2217 \u2212\n\n\u221a\n\n\u221a\nwT \u03a3w) +\nw\u2217T \u03a3w\u2217\n\nwT \u03a3w\n\nw\u2217T \u03a3w\u2217(\u00b5T\n\n1 \u2206 + \u2126)\n\n|\n\n(9)\n\n| \u00b5T\n\u221a\n1 w + b\nwT \u03a3w\n\n\u2212 1\n2\n\n(cid:113)(cid:107)\u2206(cid:107)2\n\n= O(\n\n2 + \u21262)\n\nwhere we use the fact that |\u00b5T\nSimilarly |\u2212\u00b5T\n\u221a\nCombing the above bounds we get the desired result.\n\n2 S\u2217| = O((cid:112)(cid:107)\u2206(cid:107)2\n\n1 w\u2217 + b\u2217| and S\u2217 are bounded.\n\n0 w\u2212b\nwT \u03a3w\n\n2 + \u21262).\n\n\u2212 1\n\n(cid:27)\n\n(cid:88)\n\n(cid:26) 1\n\nn\n\n4 Logistic Regression Classi\ufb01er\nIn this section, we show that the simple (cid:96)1 regularized logistic regression classi\ufb01er attains fast clas-\nsi\ufb01cation error rates.\nSpeci\ufb01cally, we are interested in the M-estimator [21] below:\n\n( \u02c6w, \u02c6b) = arg min\nw,b\n\n(Y (i)(wT X (i) + b) + log(1 + exp(wT X (i) + b))) + \u03bb((cid:107)w(cid:107)1 + |b|)\n\n,\n\n(10)\nwhich maximizes the penalized log-likelihood of the logistic regression model, which also corre-\nsponds to the conditional probability of the response given the covariates P(Y |X) for the conditional\nGaussian model.\nNote that here we penalize the intercept term b as well. Although the intercept term usually is not\npenalized (e.g. [19]), some packages (e.g. [16]) penalize the intercept term. Our analysis show that\npenalizing the intercept term does not degrade the performance of the classi\ufb01er.\nIn [2, 31] it is shown that minimizing the expected risk of the logistic loss also minimizes the\nclassi\ufb01cation error for the corresponding linear classi\ufb01er. (cid:96)1 regularized logistic regression is a\npopular classi\ufb01cation method in many settings [18, 5]. Several commonly used packages ([19, 16])\nhave been developed for (cid:96)1 regularized logistic regression. And recent works ([20, 10]) have been\non scaling regularized logistic regression to ultra-high dimensions and large number of samples.\n\n5\n\n\f4.1 Analysis\n\n(11)\n\ns log p).\n\n(12)\n\n(cid:21)\n\n1\n\n(cid:21)\n(cid:20)Ip \u2212 1\n(cid:21)(cid:2)\u2212 1\n(cid:113)\n\nwhere A =\n\n2 (\u00b51 + \u00b50)\n\n0\n2 (\u00b51 + \u00b50)\n\n(cid:20)\u2212 1\n1(cid:3)\n1(cid:3), and we can see that the singular\n\n2 (\u00b51 + \u00b50)T\n\n, and\n\n1\n\n1\n\n+\n\nWe \ufb01rst show that (cid:96)1 regularized logistic regression estimator above converges to the Bayes classi\ufb01er\nparameters using techniques. Next we use the theorem from the previous section to argue that since\nestimated parameter \u02c6w, \u02c6b is close to the Bayes classi\ufb01er\u2019s parameter w\u2217, b\u2217, the excess risk of the\nclassi\ufb01er using estimated parameter is tightly bounded as well.\nFor the \ufb01rst step, we \ufb01rst show a restricted eigenvalue condition for X(cid:48) = (X, 1) where X are our\n2N (\u00b50, \u03a3). Note\n2N (\u00b51, \u03a3)+ 1\ncovariates, that comes from a mixture of two Gaussian distributions 1\nthat X(cid:48) is not zero centered, which is different from existing scenarios ([26], [6], etc.) that assume\ncovariates are zero centered. And we denote w(cid:48) = (w, b), S(cid:48) = {i : w(cid:48)\u2217\ni (cid:54)= 0}, and s(cid:48) = |S(cid:48)| \u2264 s+1.\nLemma 1. With probability 1 \u2212 \u03b4, \u2200v(cid:48) \u2208 A(cid:48) \u2286 {v(cid:48) \u2208 Rp+1 : (cid:107)v(cid:48)(cid:107)2 = 1}, for some constants\n\u03ba1, \u03ba2, \u03ba3 > 0 we have\n\n\u221a\n\n(cid:107)X(cid:48)v(cid:48)(cid:107)2 \u2265 \u03ba1\n\nn \u2212 \u03ba2w(A(cid:48)) \u2212 \u03ba3\n\n1\nn\n\n(cid:114)\n\nlog\n\n1\n\u03b4\n\nwhere w(A(cid:48)) = Eg(cid:48)\u223cN (0,Ip+1)[supa(cid:48)\u2208A(cid:48) g(cid:48)T a(cid:48)] is the Gaussian width of A(cid:48).\nIn the special case when A(cid:48) = {v(cid:48) : (cid:107)v(cid:48)\n\n\u221a\nS(cid:48)(cid:107)1,(cid:107)v(cid:48)(cid:107)2 = 1}, we have w(A(cid:48)) = O(\n\n\u00afS(cid:48)(cid:107)1 \u2264 3(cid:107)v(cid:48)\n\nProof. First note that X(cid:48) is sub-Gaussian with bounded parameter and\n\n(cid:20)\u03a3 + 1\n\n1 + \u00b50\u00b5T\n0 )\n\n1\n2 (\u00b51 + \u00b50)\n\nNote that A\u03a3(cid:48)AT =\n\nA\u22121 =\n\n1\n2 (\u00b51 + \u00b50)\n\n(cid:20)Ip\n\n0\n\n(cid:20)Ip\n\n1\n\n\u03a3(cid:48) = E[\n\n1\n\n1\nn\n\n(cid:20)\u03a3 + 1\n2 (\u00b51\u00b5T\nX(cid:48)T X(cid:48)] =\n2 (\u00b5T\n4 (\u00b51 \u2212 \u00b50)T (\u00b51 \u2212 \u00b50)\n(cid:21)\n(cid:20) 1\n(cid:21)\n\n. Notice that AAT =\n\n(cid:20)Ip\n\n1 + \u00b5T\n0 )\n0\n1\n0\n0\n\n(cid:21)\n(cid:21)\n\n0\n\n0\n\n(cid:21)(cid:2) 1\n\n1\n\n0\n\n+\n\n0\n0\n\n2 (\u00b51 + \u00b50)\n\nand A\u22121A\u2212T =\n2 (\u00b51 + \u00b50)T\n1\u221a\nvalues of A and A\u22121 are lower bounded by\n\u00b5. Let \u03bb1\n1(cid:107)2 = 1) the corresponding eigenvector. From the\n1 ((cid:107)u(cid:48)\nbe the minimum eigenvalue of \u03a3(cid:48), and u(cid:48)\n1, so we know that the minimum eigenvalue of \u03a3(cid:48) is lower\nexpression A\u03a3AT A\u2212T u(cid:48)\nbounded. Similarly the largest eigenvalue of \u03a3(cid:48) is upper bounded. Then the desired result follows\nthe proof of Theorem 10 in [1]. Although the proof of Theorem 10 in [1] is for zero-centered random\nvariables, the proof remains valid for non zero-centered random variables.\n\u221a\nS(cid:48)(cid:107)1,(cid:107)v(cid:48)(cid:107)2 = 1}, [9] gives w(A(cid:48)) = O(\nWhen A(cid:48) = {v(cid:48) : (cid:107)v(cid:48)\n\nand upper bounded by\n\n1 = \u03bb1Au(cid:48)\n\n\u00afS(cid:48)(cid:107)1 \u2264 3(cid:107)v(cid:48)\n\ns log p).\n\n2 + B2\n\n2+B2\n\u00b5\n\nHaving established a restricted eigenvalue result in Lemma 1, next we use the result in [26] for pa-\nrameter recovery in generalized linear models (GLMs) to show that (cid:96)1 regularized logistic regression\ncan recover the Bayes classi\ufb01er parameters.\nLemma 2. When the number of samples n (cid:29) s(cid:48) log p, and choose \u03bb = c0\nc0, then we have\n\nn for some constant\n\n(cid:113) log p\n\nwith probability at least 1 \u2212 O( 1\n\n(cid:107)w\u2217 \u2212 \u02c6w(cid:107)2\npc1 + 1\n\n2 + (b\u2217 \u2212 \u02c6b)2 = O(\nnc2 ), where c1, c2 > 0 are constants.\n\nn\n\n)\n\ns(cid:48) log p\n\n(13)\n\nProof. Following the proof of Lemma 1, we see that the conditions (GLM1) and (GLM2) in [26]\nare satis\ufb01ed. Following the proof of Proposition 2 and Corollary 5 in [26], we have the desired\nresult. Although the proof of Proposition 2 and Corollary 5 in [26] is for zero-centered random\nvariables, the proof remains valid for non zero-centered random variables.\n\nCombining Lemma 2 and Theorem 1 we have the following theorem which gives a fast rate for the\nexcess 0-1 risk of a classi\ufb01er trained using (cid:96)1 regularized logistic regression.\n\n6\n\n\f(cid:113) log p\n\nTheorem 2. With probability at least 1 \u2212 O( 1\nset \u03bb = c0\n\nn for some constant c0 the Lasso estimate \u02c6w, \u02c6b in (10) satis\ufb01es\n\npc1 + 1\n\nnc2 ) where c1, c2 > 0 are constants, when we\n\nR( \u02c6w, \u02c6b) \u2212 R(w\u2217, b\u2217) = O(\n\ns log p\n\nn\n\n)\n\n(14)\n\nProof. This follows from Lemma 2 and Theorem 1.\n\n5 Other Linear Discriminant Classi\ufb01ers\n\nIn this section, we provide convergence results for the 0-1 risk for other linear discriminant classi\ufb01ers\ndiscussed in Section 2.3.\n\n(cid:21)\n\n0p\u2212s\n\n(cid:20) 1s\n\n(cid:20) 1s\n\nand \u00b50 = \u2212 M0\u221a\n\nNaive Bayes We compare the discriminative approach using (cid:96)1 logistic regression to the genera-\ntive approach using naive Bayes. For illustration purposes we conside the case where \u03a3 = Ip, \u00b51 =\n. where 0 < B1 \u2264 M1, M0 \u2264 B2 are unknown but bounded\nM1\u221a\ns\n\n0p\u2212s\nconstants. In this case w\u2217 = M1+M0\u221a\n0 ). Using naive Bayes we es-\ntimate \u02c6w = \u00af\u00b51 \u2212 \u00af\u00b50, where \u00af\u00b51 =\nY (i)=0 X (i).\nThus with high probability, we have (cid:107) \u02c6w \u2212 w\u2217(cid:107)2\nn ), using Theorem 1 we get a slower rate\nthan the bound given in Theorem 2 for discriminative classi\ufb01cation using (cid:96)1 regularized logistic\nregression.\n\n1 + M 2\nY (i)=1 X (i) and \u00af\u00b50 =\n2 = O( p\n\n(cid:21)\n(cid:20) 1s\n(cid:21)\n1(cid:80) 1(Y (i)=1 )(cid:80)\n\n1(cid:80) 1(Y (i)=0 )(cid:80)\n\nand b\u2217 = 1\n\n2 (\u2212M 2\n\ns\n\n0p\u2212s\n\ns\n\nLPD [7] LPD uses a linear programming similar to the Dantzig selector.\nLemma 3 (Cai and Liu [7], Theorem 4). Let \u03bbn = C\nn with C being a suf\ufb01ciently large\nconstant. Let n > log p, let \u2206 = (\u00b51 \u2212 \u00b50)T \u03a3\u22121(\u00b51 \u2212 \u00b50) > c1 for some constant c1 > 0, and\nlet wLPD be obtained as in Equation 4, then with probability greater than 1 \u2212 O(p\u22121), we have\n\n(cid:113) s log p\n\nn ).\n\nR(wLPD)\n\nR\u2217 \u2212 1 = O(\n\nSLDA [28] SLDA uses thresholded estimate for \u03a3 and \u00b51 \u2212 \u00b50. We state a simpler version.\nLemma 4 ([28], Theorem 3). Assume that \u03a3 and \u00b51 \u2212 \u00b50 are sparse, then we have R(wSLDA)\nO(max(( s log p\nnon-zero entries in \u03a3, and \u03b11, \u03b12 \u2208 (0, 1\n\n\u2212 1 =\n)\u03b12) with high probability, where s = (cid:107)\u00b51 \u2212 \u00b50(cid:107)0, S is the number of\n\nn )\u03b11, ( S log p\n\n2 ) are constants.\n\nR\u2217\n\nn\n\nROAD [15] ROAD minimizes wT \u03a3w with wT \u00b5 restricted to be a constant value and an (cid:96)1-penalty\nof w.\nLemma 5 (Fan et al. [15], Theorem 1). Assume that with high probability, || \u02c6\u03a3\u2212\u03a3||\u221e = O(\nand ||\u02c6\u00b5\u2212\u00b5||\u221e = O(\nwe have R(wROAD) \u2212 R\u2217 = O(\n\nn )\nn ), and let wROAD be obtained as in Equation 3, then with high probability,\n\n(cid:113) s log p\n\n(cid:113) log p\n\nn ).\n\n(cid:113) log p\n\n6 Experiments\n\n(cid:113) s log p\n\nIn this section we describe experiments which illustrate the rates for excess 0-1 risk given in Theorem\n2. In our experiments we use Glmnet [19] where we set the option to penalize the intercept term\nalong with all other parameters. Glmnet is popular package for (cid:96)1 regularized logistic regression\nusing coordinate descent methods.\n\nFor illustration purposes in all simulations we use \u03a3 = Ip, \u00b51 = 1p + 1\u221a\n\n, \u00b50 = 1p \u2212\n\n(cid:20) 1s\n\n(cid:21)\n\ns\n\n0p\u2212s\n\nTo illustrate our bound in Theorem 2, we consider three different scenarios. In Figure 1a\n\n7\n\n(cid:21)\n\n(cid:20) 1s\n\n0p\u2212s\n\n1\u221a\ns\n\n\f(a) Only varying p.\n\n(b) Only varying s.\n\n(c) Dependence of excess 0-1 risk\non n.\n\nFigure 1: Simulations for different Gaussian classi\ufb01cation problems showing the dependence of\nclassi\ufb01cation error on different quantities. All experiments plotted the average of 20 trials. In all\nexperiments we set the regularization parameter \u03bb =\n\n(cid:113) log p\n\nn .\n\nwe vary p while keeping s, (\u00b51 \u2212 \u00b50)T \u03a3\u22121(\u00b51 \u2212 \u00b50) constant. Figure 1a shows for different p how\nthe classi\ufb01cation error changes with increasing n. In Figure 1a we show the relationship between\nthe classi\ufb01cation error and the quantity n\nlog p. This \ufb01gure agrees with our result on excess 0-1 risk\u2019s\ndependence on p. In Figure 1b we vary s while keeping p, (\u00b51 \u2212 \u00b50)T \u03a3\u22121(\u00b51 \u2212 \u00b50) constant.\nFigure 1b shows for different s how the classi\ufb01cation error changes with increasing n. In Figure 1a\nwe show the relationship between the classi\ufb01cation error and the quantity n\ns . This \ufb01gure agrees with\nour result on excess 0-1 risk\u2019s dependence on s. In Figure 1c we show how R( \u02c6w, \u02c6b) \u2212 R(w\u2217, b\u2217)\nchanges with respect to 1\nn in one instance Gaussian classi\ufb01cation. We can see that the excess 0-1\nrisk achieves the fast rate and agrees with our bound.\n\nAcknowledgements\n\nWe acknowledge the support of ARO via W911NF-12-1-0390 and NSF via IIS-1149803, IIS-\n1320894, IIS-1447574, and DMS-1264033, and NIH via R01 GM117594-01 as part of the Joint\nDMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical\nSciences.\n\nReferences\n[1] Arindam Banerjee, Sheng Chen, Farideh Fazayeli, and Vidyashankar Sivakumar. Estimation with norm\n\nregularization. In Advances in Neural Information Processing Systems, pages 1556\u20131564, 2014.\n\n[2] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[3] Peter J Bickel and Elizaveta Levina. Some theory for Fisher\u2019s linear discriminant function, \u2018naive Bayes\u2019,\nand some alternatives when there are many more variables than observations. Bernoulli, pages 989\u20131010,\n2004.\n\n[4] Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics,\n\npages 2577\u20132604, 2008.\n\n[5] C.M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer,\n\n2006. ISBN 9780387310732.\n\n[6] Peter B\u00a8uhlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and appli-\n\ncations. Springer Science & Business Media, 2011.\n\n[7] Tony Cai and Weidong Liu. A direct estimation approach to sparse linear discriminant analysis. Journal\n\nof the American Statistical Association, 106(496), 2011.\n\n[8] Emmanuel Candes and Terence Tao. The dantzig selector: statistical estimation when p is much larger\n\nthan n. The Annals of Statistics, pages 2313\u20132351, 2007.\n\n[9] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convex geometry of\n\nlinear inverse problems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n8\n\n01002003004000.20.250.30.350.40.450.5n/log(p)classification error p=100p=400p=160001002003004000.20.250.30.350.40.450.5n/sclassification error s=5s=10s=15s=2000.0050.010.01500.050.10.150.20.251/nexcess 0\u22121 risk\f[10] Weizhu Chen, Zhenghao Wang, and Jingren Zhou. Large-scale L-BFGS using MapReduce. In Advances\n\nin Neural Information Processing Systems, pages 1332\u20131340, 2014.\n\n[11] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer New York,\n\n1996.\n\n[12] Luc Devroye. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business\n\nMedia, 1996.\n\n[13] David Donoho and Jiashun Jin. Higher criticism thresholding: Optimal feature selection when useful\nfeatures are rare and weak. Proceedings of the National Academy of Sciences, 105(39):14790\u201314795,\n2008.\n\n[14] Jianqing Fan and Yingying Fan. High dimensional classi\ufb01cation using features annealed independence\n\nrules. Annals of statistics, 36(6):2605, 2008.\n\n[15] Jianqing Fan, Yang Feng, and Xin Tong. A road to classi\ufb01cation in high dimensional space: the regular-\nized optimal af\ufb01ne discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodol-\nogy), 74(4):745\u2013771, 2012.\n\n[16] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A\n\nlibrary for large linear classi\ufb01cation. The Journal of Machine Learning Research, 9:1871\u20131874, 2008.\n\n[17] Yingying Fan, Jiashun Jin, Zhigang Yao, et al. Optimal classi\ufb01cation in sparse gaussian graphic model.\n\nThe Annals of Statistics, 41(5):2537\u20132571, 2013.\n\n[18] Manuel Fern\u00b4andez-Delgado, Eva Cernadas, Sen\u00b4en Barro, and Dinani Amorim. Do we need hundreds of\nclassi\ufb01ers to solve real world classi\ufb01cation problems? The Journal of Machine Learning Research, 15\n(1):3133\u20133181, 2014.\n\n[19] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models\n\nvia coordinate descent. Journal of statistical software, 33(1):1, 2010.\n\n[20] Siddharth Gopal and Yiming Yang. Distributed training of Large-scale Logistic models. In Proceedings\n\nof the 30th International Conference on Machine Learning (ICML-13), pages 289\u2013297, 2013.\n\n[21] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference,\n\nand Prediction. Springer, 2009.\n\n[22] Mladen Kolar and Han Liu. Feature selection in high-dimensional classi\ufb01cation. In Proceedings of the\n\n30th International Conference on Machine Learning (ICML-13), pages 329\u2013337, 2013.\n\n[23] Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Prob-\nlems: Ecole dEt\u00b4e de Probabilit\u00b4es de Saint-Flour XXXVIII-2008, volume 2033. Springer Science & Busi-\nness Media, 2011.\n\n[24] Qing Mai, Hui Zou, and Ming Yuan. A direct approach to sparse discriminant analysis in ultra-high\n\ndimensions. Biometrika, page asr066, 2012.\n\n[25] Enno Mammen, Alexandre B Tsybakov, et al. Smooth discrimination analysis. The Annals of Statistics,\n\n27(6):1808\u20131829, 1999.\n\n[26] Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A uni\ufb01ed framework for\nhigh-dimensional analysis of M-estimators with decomposable regularizers. In Advances in Neural In-\nformation Processing Systems, pages 1348\u20131356, 2009.\n\n[27] Andrew Y. Ng and Michael I. Jordan. On discriminative vs. generative classi\ufb01ers: A comparison of\nIn Advances in Neural Information Processing Systems 14 (NIPS\n\nlogistic regression and naive bayes.\n2001), 2001.\n\n[28] Jun Shao, Yazhen Wang, Xinwei Deng, Sijian Wang, et al. Sparse linear discriminant analysis by thresh-\n\nolding for high dimensional data. The Annals of Statistics, 39(2):1241\u20131265, 2011.\n\n[29] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Diagnosis of multiple\ncancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences,\n99(10):6567\u20136572, 2002.\n\n[30] Sijian Wang and Ji Zhu. Improved centroids estimation for the nearest shrunken centroid classi\ufb01er. Bioin-\n\nformatics, 23(8):972\u2013979, 2007.\n\n[31] Tong Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex risk mini-\n\nmization. Annals of Statistics, pages 56\u201385, 2004.\n\n9\n\n\f", "award": [], "sourceid": 668, "authors": [{"given_name": "Tianyang", "family_name": "Li", "institution": "UT Austin"}, {"given_name": "Adarsh", "family_name": "Prasad", "institution": "UT Austin"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "University of Texas at Austin"}]}