Paper ID: 1081
Title: Contrastive Learning Using Spectral Methods
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper develops a model for "contrastive learning" which aims to capture structure in a "foreground" set of target observations as opposed to generic "background" structure. The moments of the foreground and background are expressed as tensor products, estimation is then based on the difference between the moments of the foreground and moments of the background.

The concept of contrastive learning appears to be novel and of general interest, at least conceptually. The authors compare the idea conceptually to PCA and linear discriminant analysis, I would be interested in seeing a comparison of performance as well. Does contrastive learning actually pick up different structure, and when?

This is a clear and well written paper within the space constraints. However, with that said, this paper suffers from having too much stuffed into 8 pages. My main complaint is simply that I want more detail.

The section describing contrastive hidden Markov models, and the application to genomics does not have enough detail. The equations in the paragraph lines 286-300, really don't have enough detail to figure out what's different and what's the same as in the topic model application. The results in lines 336-342 also lack detail and context. And it's very confusing for readers that Figure 2a goes with the topic modeling application (at least according to line 269), but Figure 2b goes with the HMM application. Even worse, you don't mention this in the caption!!

line 295: the word "justifiable" is ambiguous. Does that mean that's what you did? Or does that mean, you could if you wanted to? But again, this whole paragraph needs more detail.

The "specificity" test described on lines 258-269 seems slightly bizarre to me. Though, I admit that this may be due again to a lack of detail under the space constraints. Why is it informative that a model trained only on foreground data has poor performance when used as a classifier for foreground/background? Plus the purpose of this method is not to classify foreground v. background, it is to analyze the variation in the foreground (lines 050-052, lines 073-075, or even line 106).

I would find a simulation study that demonstrates the method's ability to pick up patterns in foreground data, much more convincing than this specificity test.


Small item:
line 376: "…under the [assumption] that the support…"
Q2: Please summarize your review in 1-2 sentences
The idea of separating foreground structure from background structure appears to be novel, and of general interest. However, this paper lacks detail because it is clearly up against the strict space constraints.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:
--------

The authors introduce the paradigm of "contrastive learning" for mixture
models, which aims at learning the parameters of mixture components that are
unique to "foreground" data and not present in "background" data. They
present a spectral approach to contrastive learning to mixtures over models which
have a spectral learning algorithm. They illustrate their approach with examples
of contrastive learning of topic models and HMMs.



Comments:
---------

The main idea of the paper, the "contrastive learning" framework, is well
motivated and might indeed be of practical use (although it looks a bit tailored
to the "spectral trick" of subtracting the moment matrices/tensors). Both
numerical experiments (LDA, HMM) nicely illustrate that contrastive learning
might be an interesting and worthwhile approach to pursue. The proposed
algorithm builds on very recent developments in spectral learning, which makes
it of interest to a broader NIPS audience. It seems to be technically sound,
although I lack the expertise to be able to check all the technical
contributions in detail. The paper is clearly structured and well written.


My comments mainly focus on the fact that the authors could have made a better
effort to put their contribution into the bigger machine learning context:

1) The authors state (p2, last line and p3) that the direct, possibly naive way
of doing contrastive learning would be to learn two models, one
for the background and one for the foreground, and then to isolate the foreground
specific components. They state advantages of their approach, especially
robustness, however they never actually show these experimentally in
comparison to the naive approach. The authors should compare their algorithm on
their two numerical experiments with a "naive" contrastive model: A Mixture
model each for the for- and the background with (partially) shared parameters
learned by EM (or a similar model). Is the spectral method better / faster
/ more robust?

2) pp7, section 5.1: It is conceivable that a "naive contrastive model" (see
above) could also be robust to sparse sampling of the background model under
the assumption of disjoint support of background and for ground-specific
components. It is important to also demonstrate the robustness of the proposed
algorithm on data relative to the naive model.

3) Learning the statistics of some signal (foreground) that is embedded in some
noise (background) is of course a major theme in signal processing / stats /
machine learning. It would have been nice if the authors briefly compared their
approach to other methods in the introduction or discussion.
Q2: Please summarize your review in 1-2 sentences
The paper contains intriguing ideas, is well written and of interest to the NIPS
audience. It would have benefitted from more thorough experiments.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
[Summary]
This paper proposes a method for what the authors call contrastive learning
on the basis of a power-iteration based tensor decomposition method for
latent variable models. An experimental result on a simple topic model
is shown to demonstrate usefulness of the proposal. Efficiency of the
proposal in gemonic data analysis is suggested via an experiment on chromatin
states.

[Quality]
The proposed method is interesting from application point of view, in
particular its demonstrated efficiency in finding topics specific to
a certain foreground corpus in contrast to background. From the theoretical
viewpoint, however, the argument in Section 5 is on something which seems
different from what has been discussed up to Section 4, in particular
in the definition of M_2 given in the line numbered 348-349, where the
coefficients are \sigma_i=sign(\lambda_i) rather than \lambda_i themselves.
This incoherence in the descriptions makes possible contributions of this
paper somewhat obscure.

[Clarity]
In Figure 1, I do not understand what the red and blue lines represent.
The example is not the same as the model considered in this paper (mixture
models), which I feel makes it a rather bad example for explanation
purposes.
A document is defined as a bag-of-words, but in Sect. 3.1 "the first,
second, and third words" in a document are used in the description,
which is confusing.
In the line numbered 072 "learn amplify".
In the first line of page 3, "just from {x_n^f}," the superscript should
read b instead of f.
In the line numbered 276, "under the that ...".
In the line numbered 317, "Each chromatin state correspond(s) to".
In the line numbered 390, "to compute execute".
After Proposition 4, \circ and "nnz" are used without explicit definitions.

[Originality]
The idea of applying a power-method based tensor decomposition method
to the contrastive learning is considered original.

[Significance]
This paper is of importance in application point of view, in that
the proposed approach to contrastive learning has successfully demonstrated
its efficiency even in an application of a simple topic model. It is rather
difficult to judge the significance of this paper in the theoretical side,
mainly because of the incoherence mentioned above, as well as a rather brief
explanation given in the main paper.
Q2: Please summarize your review in 1-2 sentences
This paper proposes an intersting approach to contrastive
learning using a power-method based tensor decomposition method. Demonstrated
usefulness of the proposal via an experiment on a simple topic model is
interesting.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

This paper tackles the problem of learning a mixture model to capture
properties of "foreground data" which are not present in background
data. The proposed technique extends recent work on method of moments
and tensor factorization. The novelty in the paper is (i) the
observation that one can subtract moments before factorization and (ii)
developing a variant of the robust power method from Anandkumar et al. 2013
that doesn't require positive semidefiniteness.

The idea presented in the paper is simple, natural, and novel. The
main weakness is that the paper doesn't really compare the proposed
method with several natural baselines either theoretically or
empirically.

073: the paper says that the stated goal is not to discriminate between
foreground and background, but isn't this exactly the intuition of
being contrastive?

383: In the case where the foreground words are disjoint from the rest
(B and C), the intuition that you only need enough background data to
get the signs of the topics in B and C correct seems right. However,
the paragraph then talks about the accuracy of the foreground topics,
which doesn't relevant. I would have expected to see some statement
characterizing the number of samples you need to get the sign correct.

Also, I normally think of the foreground as being captured by a modest
number of topics and that the background is enormous and requires a lot
more, so $K$ would have to be pretty large, close to $D$. So footnote
1, which talks about one topic, seems pretty unmotivated.

The fact that power iteration requires multiple passes over the
training data ($K$ times the number of power iteration updates times
the number of restarts of the power tensor method) is somewhat
discouraging, since one of the benefits of these method of moments
approaches is that you can take one pass over the data. Could you
randomly project the data onto $O(K)$ dimensions, building up a much
smaller dense tensor that you can store in memory?

Baseline 1: estimate the foreground model and background models
separately. Under the assumption that the model is well-specified,
then the normal spectral method of moments will provide a consistent
estimator of both models, from which it is easy to detect which are the
common topics. What can you say about this method theoretically (it is
consistent after all) and empirically?

Baseline 2: simply use EM to do parameter estimation in the full
generative model. Is local optima is actually a problem in this
setting? My experience is that EM works pretty well anyway, and any
sort of empirical statement about this approach must compare with EM.
Even if EM outperforms spectral, you can always use the latter as
initialization to the former.

Theoretically, I think it's interesting to think about the behavior of
the maximum likelihood estimator in this case, which in some sense
tries to spread its modeling efforts evenly, whereas the contrastive
estimator seems more targeted towards what you want. So there might be
a regime that the contrastive estimator is better, even theoretically?

One piece of work which is not directly relevant but has a similar
high-level flavor is contrastive estimation:
http://www.cs.cmu.edu/~nasmith/papers/smith+eisner.acl05.pdf
Actually, now that I think about it, the intuition is somewhat
different - instead of using all the background data, we are using a
subset which is "close" to the foreground data.
Q2: Please summarize your review in 1-2 sentences

This paper provides a new spectral method of moments estimator for
estimating a mixture model on foreground data, contrasting against
background data. Though there could more comparison to baselines,
overall the idea put forth is novel and interesting.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all the reviewers for their thoughtful comments and suggestions. As the reviewers mention, we had to leave out some discussions due to space limitation. We'll incorporate more descriptions and other suggestions in the revision.

Reviewer4
PCA or linear discriminant analysis do not yield parameters which can be interpreted as a topic's distribution over words or the emission matrix in HMM. This makes it difficult to compare our method against them.

We agree that the HMM section needs more description and will include them in revision. Will fix the caption of Figure 2. On Line 295, we say that the reference [13] provides justification (under a "fast mixing" condition) for averaging over all consecutive observation triplets.

Classification accuracy is only a surrogate for measuring how well a model captures topics specific to the foreground. Our experiments show that the contrastive model assigns higher probability to documents that are about USA but not economics and lower probability to documents that are about both "USA and economics" (when "economics" is the background). The model trained just on foreground is a baseline.

Unfortunately we did not have room to include a simulation study (although we did conduct one), but will try to include it in a revision.

Reviewer5
Good suggestion on placing this work in the wider context, we'll include more discussions in revision. Some related techniques in stats are anomaly and novelty detection. Key differences are that typically in those settings the background data is abundant and they don't directly learn a generative model of the anomaly.

Constructing a generative model that accounts for both foreground and background is an interesting research direction. A key advantage of our approach is that we do not need to learn a good model of the background. We have experiments showing that this is more efficient when the background is large and complex, since we only need to compute its moments. When the background is very sparse and we apply LDA, we find that it doesn't capture coherent topics but just sets of keywords. It's not clear how to use this to filter out shared topics in the foreground in a principled way. Will include more discussion of this.

Reviewer_7
Thanks for your review and interest in the applications. The key theoretical contributions are that we developed fast new tensor algorithm to deal with indefinite tensors, and proved how this allows us to recover a foreground specific model. This extends existing tensor methods which require positive semidefiniteness. We had to put most of the theoretical discussion in the supplement due to space; will try to clarify this in revision.

The model in Sec 3,4 is consistent with Sec 5. To tranform Eqn.2 to line 349, we map a_i to mu_i*sqrt(|w_i|), sigma_i to sign(w_i) and lambda_i to sign(w_i)/sqrt(|w_i|). We used this new notation in Sec 5 to convey that its results are generally applicable to indefinite tensors. Prop2 shows how we can recover {a_t, lambda_t}, and from this we can reconstruct {w_t, mu_t} for LDA. We will clarify the notations in revision.

Contrastive HMM and LDA are special cases of indefinite tensors. The key idea is that when we take contrast of moments, M_2^f -gamma M_2^b and M_3^f - gamma M_3^b, the foreground specific components have positive sign and the background have negative sign.

In the bag-of-words representation, we can arbitrarily order the words and call one word the first, another word the second, and so on. This is used to explain the structure of the moments (similar to [9]). Sec 5.2 shows how to work directly with the word count vector to avoid explicitly enumerating triples of words in the document.

Fig 1 is meant to be a toy illustration to motivate contrast. The idea is that we would like to learn a subspace (represented by the blue line in (c)) that captures variance of the foreground data that is not already captured by the background projection (represented by the red line). We'll clarify this.

Reviewer_8
Our motivation for contrastive learning is to learn a full generative model that captures the data structures specific to the foreground (FG), rather than to learn a decision boundary that discriminate FG data. It's true, as in the LDA experiments, that the contrastive model does well in classification. The parameters of the generative contrastive model are also of interest; in the genomic study, they suggest novel biological states.

On projecting to a dense KxKxK tensor--that's exactly what we do for the HMM experiment, where K is small. We use the projection method similar to [9]. We've also tried the projection method on topics. It works well but we found it to be faster to exploit the sparsity in the moments. Multiple passes through the data can be avoided in the tensor power iteration by using orthogonal iteration which updates all K parameter vectors at once. The overall computation time is, up to a factor of ~2, the same as using multiple passes. Overall the running time is linear in input size. We'll include more discussion of this in the revision.

While it's true that the background (BG) can often be enormous, we want to include general settings to allow, for example, the BG to be sparse or to be a specific biological experiment and the FG to be larger. It's still interesting to understand the contrast in such cases. When the BG is large and complex, the efficiency advantage of our method is that we only need to compute its moments once instead of needing to learn a good model of it.

It's true that baseline 1 is also consistent. However, when the BG is sparse, we observe that standard LDA doesn't learn coherent topics, just subsets of keywords. It's not clear how to filter the shared topics in a principled way. It's also not clear how to extend baseline 1 to contrastive HMM. A full generative model capturing both FG and BG process is an interesting direction of research.