
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
 i don't have many comments, other than to alert the authors to this manuscript: Q. Mai and H. Zou, "A Note On the Connection and Equivalence of Three Sparse Linear Discriminant Analysis Methods," Technometrics, vol. 55, pp. 243246, 2013. [Online]. Available: http://www.tandfonline.com/doi/abs/10.1080/00401706.2012.746208
which demonstrates that several "different" sparse discriminant algorithms are, in fact, the same. thus, this contribution is perhaps even more useful.
Q2: Please summarize your review in 12 sentences
the authors provide novel misclassification rates under perhaps the simplest highdimensional classification problem, comparing several different approaches.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
+ Wellwritten paper, claims are clear, + Best convergence rates for the logistic regression classifier under highdimensional sampling regime,  The usefulness for Machine Learning is left stranded,  The end of the paper is sloppy and finishes quickly while there is room for a conclusion and a discussion.
Q2: Please summarize your review in 12 sentences
This paper presents convergence rates for linear discriminant models and zeroone loss for highdimensional classification problems. The main result is a convergence rate proportional to the inverse of the training set size, the number of covariates and the sparsity of the Bayes classifier for the logistic regression case (Theorem 2).
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
I have read the paper ``fast classification rates for highdimensional conditional Gaussian models".
The paper studies
the problem of binary classification
using a Gaussian model and provides some theoretical results on the convergence of the classification error rates (compared to the Bayes classifier).
The paper presents some nice theoretical results and is interesting to some extent. I am generally positive about the paper but I have the following concerns.
First, it is about the practical relevance.
One of the very important point about classification is that we need a tuning free trained classifier; when a classifier
has tuning parameters that need to be tuned, we usually use data splitting
to decide such tuning parameters. Therefore, it is more relevant to either study
the error rates of a tuning free classifier, or that of a classifier with tuning parameters
selected by data splitting.
This is important for that, while the convergence rate of the "ideal tuning parameters"
may be fast, the convergence rate of the classifier that is actually used could be very slow, simply because of the convergence of the datadriven tuning parameters to the ideal tuning parameters are slow.
In terms of this, the paper needs to (a) compare more carefully with recent tuningfree classifiers, (b) address the effect of selecting the tuning parameters by datasplitting.
Second of all, the paper has put all the focus on the convergence rate,
does not
contain any insightful comparison of the computing speed.
The penalization methods
the authors present are all computationally slow when $p$ is large. It is desirable to
have a detailed comparison about the computational complexity or speed.
Third of all, given this is a wellstudied topics, it is desirable to compare
with different methods using at least 12 real data sets (say, microarray data sets). The methods that need to compare include: those discussed in the paper [3,7,13,15,17] if the method is easy to implement as well as support vector machine (SVM) and random Forest.
Fourth, the paper comments that assuming $\Sigma^{1} (\mu_1  \mu_0)$ is sparse is weaker than assuming both $\Sigma^{1}$ and $(\mu_1  \mu_0)$ are sparse. Mathematically, this is true. However, the more relevant point is that, whether we have an important application example
(say, in genomics or genetics) where $\Sigma^{1} (\mu_1  \mu_0)$ is sparse but either $\Sigma^{1}$ is nonsparse or
$(\mu_1  \mu_0)$ is nonsparse. If we don't have such application examples in areas many people care, then arguing how
two assumptions are different from each other is largely due to mathematical
curiosity,
not due to scientific interest.
Last, the presentation can be improved. The material can be streamlined so that the main contribution can be more highlighted.
Q2: Please summarize your review in 12 sentences
Good paper in terms of theory, but the authors seems to have very little concerns on practical matters (tuning parameter selection,
computation speed, realdata analysis). This largely downweights the importance of the paper.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper establishes a framework for evaluation the error of linear discriminant approach given conditional Gaussian models. It presents a faster rate of convergence than all existing results without using standard assumptions on separation or lownoise. The paper ends a bit suddenly. A better simulation study and a discussion section would be good additions.
Well written. Clear. Better results.
Q2: Please summarize your review in 12 sentences
It presents a faster rate of convergence than all existing results without using standard assumptions on separation or lownoise. A better simulation study and a discussion section would be good additions.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their kind comments
regarding the novelty and importance of our work. Some reviewers felt
that a Conclusion and Discussion section would greatly add to the paper:
we will be sure to add this in the camera ready version (including in part
the discussions in the responses below).
AR2:
We thank
the reviewer for pointing out the related work by Mai and Zou. It
would be interesting to look into the relationship between those various
methods and l1 regularized logistic regression in a later work.
AR3:
We thank the reviewer for comments on the presentation
of our paper. We will make relevant edits in the camera ready
version.
AR4:
Regarding choosing the tuning parameter
(\lambda) in l1 regularized logistic regression: in practice, \lambda is
typically chosen using crossvalidation or via informationtheoretic
measures such as BIC. Providing error bounds for such practical selections
of lambda is still an active area of research, and certainly an
interesting line of future work.
Regarding the tradeoff in
computational complexity when using logistic regression in l1 penalty with
respect to other related work: We agree that an explicit tradeoff and
comparison of computational complexities is an interesting question. It
was slightly beyond the focus of this work, which was on statistical
rates, but we will definitely expand upon this in an extended version.
Regarding motivating the sparsity assumption from an application
perspective: we will do so in the cameraready version. In particular,
many social network and bioinformatics applications have dense connected
components in the inverse covariance graph: here, though the inverse
covariance graph is not sparse, our less stringent sparsity assumption
would still hold.
AR5:
We thank the reviewer for their kind
comments on our paper.
AR6:
Regarding the usefulness of our
work to machine learning: (a) In theoretical contributions: we provide
a "direct" approach to analyzing 01 classification error rates rather
than using the traditional approach of bounding the loglikelihood loss,
and then using transforms (inequalities), to bound the 01 classification
error. (b) In methodological contributions: we systematically studied
linear discriminant classifiers under a conditional Gaussian model, and
showed that the simple approach of logistic regression with l1penalty
statistically outperforms other linear discriminant approaches, including
many *recently proposed* approaches. This has strong consequences for
practical users of machine learning based classification
techniques.
AR8:
Our paper focuses on the theory of
linear discriminant classifiers under conditional Gaussian models, and
experiments were only used to demonstrate our theoretical results. In
particular, the experiments section presented synthetic simulations whose
mainpurpose was to show that the bounds we obtain are tight. We plotted
the 01 error rate against each parameter (sparsity (s), dimension (p),
number of samples(n)), and show the stacking of the curves against the
control parameter for different (n,p,d) settings.

