
Submitted by
Assigned_Reviewer_3
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents an idea of using correlated random
features for efficient learning in semisupervised setting. It picks on
several areas  Nystrom approximation, CCA and random Fourier features.
Overall, the authors have done a nice job of combining the ideas to
propose Nystrom based Correlated Views, and doing detailed experiments.
The datasets considered are quite comprehensive and the results look good.
It would have been nicer if some synthetic experiment was
conducted to illustrate the claims/arguments given in the first paragraph
of section 2.3, in particular, the effect weakly correlated features.
The paper is reasonably well written. At some places, I had
difficulty in understanding  for example, is the squared error loss for
classification problem during training as well? The other comment is
statistical comparison multiple algorithms on multiple datasets is
wellknown (see for example, the paper by Janez Demsar, JMLR 7 (2006,
pp:130). It would have been nicer if such a comparison was made. This
would help in making the performance claims stronger in terms of
statistical significance.
There are some typos in the paper
(e.g., in conclusion, last but the third sentence).
Q2: Please summarize your review in 12 sentences
The paper presents an idea of using correlated random
features for efficient learning in semisupervised setting. Overall, this
is a decent paper combining ideas from several areas  namely, Nystrom
approximation, canonical correlation analysis and random Fourier features.
Nevertheless, novelty is somewhat limited due to the same reason of
borrowing ideas and some results from these areas. Submitted
by Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
UPDATE: I acknowledge that I have read the author
rebuttal.
The authors present a technique for semisupervised
learning based on the following idea: first, they use unlabeled data to
learn useful features, using Nystrom featurization together with canonical
correlations analysis (the idea being to featurize with respect to
multiple subsets of the data, and then use Nystrom to find features that
are correlated across the subsets). Once this step is performed, the
labeled data is then used to train a model (regularized based on the
correlation coefficients found in CCA).
Quality/clarity: the paper
is confusing in parts (particularly the description of canonical ridge
regression) but wellwritten overall. The experiments are wellpresented
and solid.
Originality: most of the ideas are borrowed from
elsewhere, but they are combined in a useful way and appear to be
wellexecuted.
Significance: I think the combination of ideas,
together with the fact that the experiments are good, makes this paper
significant.
Other comments: The authors may wish to include a
reference to recent work on the method of moments, which is another
multiview approach that seems related, at least in this reviewer's naieve
intuition. (Feel free to ignore this comment if your intuition
disagrees.) Q2: Please summarize your review in 12
sentences
The paper combines together multiple interesting ideas
in a technically competent way, and obtains good experimental results. I
think this paper is quite strong overall. Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: This paper brings together two recently
popular trends of research, namely random features and multiview
regression. It provides a two step algorithm for semisupervised learning.
In the first stage, they generate random features corresponding to two
views using the Nystrom method and then in the second stage, they use CCA
to bias the optimization towards features correlated across the views via
the canonical norm which penalizes less correlated features across the two
views more. The experimental results show the superior performance of
their approach over a stateoftheart competitor Laplacian Regularized
Least Squares (SSSL)
Comments: The paper is well
motivated and addresses an important problem. The paper operationalizes
the ideas proposed in (Kakade and Foster 2007) and presents a Nystrom
method to generate two sets of views which follow the multiview assumption
(also detailed in (Kakade and Foster 2007)).
In short, the paper
harnesses the CCA results from (Kakade and Foster 2007) and Nystrom Method
results from (Bach 2013) to come up with an efficient and scalable semi
supervised learning algorithm.
That said, I found the paper to
have limited mathematical novelty for a venue like NIPS and is mostly an
engineering paper.
The authors could have considered joint
learning of the random features and CCA bases, which would have been more
novel and could possibly lead to better error bounds than the two methods
separately.
The experimental results are detailed and the proposed
approach (XNV) significantly beats SSSL.
The authors should
consider adding an additional tuning parameter (\lambda) for the canonical
norm in Algorithm 1 (9), which can hopefully improve accuracy further as I
can see the term to be on a different scale than the other two terms in
the objective.
Q2: Please summarize your review
in 12 sentences
Mostly an engineering paper which harnesses the
results from (Kakade and Foster 2007) and (Bach 2013) to come up with an
efficient and scalable semi supervised learning algorithm.
**My evaluation of the paper remains the same even after
reading the author rebuttal. I feel that just the observation of using
Random features in CCR paper is not enough. That said, this is a nice
engineering paper better fit for a more applied venue.**
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their comments and
suggestions and respond to their main points. We first address the
question of novelty raised by reviewers 3 and 6.
Our work is based
on the important observation that random features define multiple views
that are automatically guaranteed to satisfy the multiview assumption on
*any* dataset. We convert Canonical Ridge Regression (CRR, Kakade &
Foster, 2007), a theoretically motivated algorithm with impressive
performance guarantees, into a general purpose tool that outperforms the
current state of the art. The resulting algorithm, XNV is a powerful,
computationally efficient algorithm for semisupervised learning which we
demonstrate can be widely applied.
CCR has had little impact to
date due to the highly specialized multiview assumption  which both
rarely holds and is difficult to check in practice. We expect our
contribution will change this situation.
Reviewer 3: 1.
Synthetic experiments with weakly correlated features. We feel that
there is some confusion in terminology regarding the observed variables in
the dataset and the features we construct. Since features are constructed
randomly, we do not actively manipulate the correlations between them, and
hence do not report on the specific effect of weakly correlated features.
An interesting approach, deferred to future work, is sampling from
distributions that are designed so that the resulting features are weakly
correlated. This should accelerate the decay of correlation coefficients,
and may significantly improve performance.
2. Squared loss We
used squared error loss for training. This will be clarified in the final
version.
3. Comparing algorithms across data sets. Following
standard practice in the NIPS community, we reported on prediction error
with standard deviation measures combined with plots. We appreciate that
there are other ways to report comparisons of algorithms and datasets.
While we do not report on statistical significance, the average across
datasets paints a clear picture of XNV’s improvement over other algorithms
for many datasets.
Reviewer 6: 1. Learning representations.
The question of feature learning is an interesting one. Much work has
been performed recently using e.g. deep neural networks which have shown
to be successful empirically. Typically, however such methods are
computationally intense and come with few theoretical guarantees about the
properties of the learned representations.
We take a different
tack. Random features are cheap to compute, come with strong guarantees,
work well in practice, and are easily applied to big datasets!
2.
Additional tuning parameter. Our aim is to introduce a practical,
easytouse algorithm. We therefore keep tunable parameters to a minimum:
the kernelwidth and the ridge penalty.
From the theoretical
properties of CRR, we expect the performance benefits from adding an
additional parameter for the CCA norm would not outweigh the cost in
additional tuningtime. However, this cannot be ruled out a priori and
will be investigated in the journal version of the paper.
 