
Submitted by
Assigned_Reviewer_4
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper describes a nice trick to extend MinHash to
estimate the generalization of the Jaccard index to more than two sets.
Using the trick, the paper presents efficient search algorithms for (1)
given S and T, indentifying U for which (S,T,U) are similar, (2) given S,
identifying T and U for which (S,T,U) are similar, and (3) identifying S,
T and I for which (S,T,U) are similar. There are experiments showing the
value of a >2way similaruty measure.
This paper is not written
well and is consequently difficult to read.
As far as I know, the
work is original. The algorithms are significant and may prove to be
useful in many domains. If this paper is edited to remove all of the
grammar errors, it will be very nice.
The anecdotes for the
"Google Sets" problem are very cool. I found the searchquality
experiments very hard to follow. It would be help a lot if the metric
discussion is promoted to come earlier in the section and discussed in a
little more depth.
Example presentation issues (restricting to one
per page)
Page 1: "one of the most widely used notion" =>
"notions" Page 2: "may be" => "maybe" Page 3: "any two point"
=> "any two points" Page 4: "One of the peculiarity" =>
"peculiarities" Page 5: "cBP" in the Section 5 title should be "cCP"
Page 6: "if we are interested in very high similarity cluster" =>
"a very" or "clusters" Page 7: "report top 5 words" => "report the
top 5 words" Page 8: "We formulated ... and obtain" => "formulate"
or "obtained"
Q2: Please summarize your review in
12 sentences
I think this is a nice contribution that could have
practical impact. It needs a serious edit by a native English
speaker. Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper extends the minhash mechanism to kway
Jaccard similarity and presents a theoretical analysis following the
framework of [Indyk & Motwani '98; HarPeled et al. '12]. The work is
in general solid and supported mostly by the theoretical analysis.
Experiments demonstrate the advantage of kway similarity over the
typically pairwise methods.
Overall, I like this paper. Although
the majority of the theorems and proofs are based on existing results, the
authors have some interesting constructions (e.g., the f_1 and f_2
functions) and covered several aspects of the 3way scenarios, including
near neighbor, close pair and best cluster. One minor concern is that
although the arguments of using kway similarity is convincing, it
practical value may not be demonstrated strong enough in the experiments.
For instance, the Google set experiment is interesting, but instead of
"Sum Resemblance", perhaps the correct function is something that can
simulate the conjunction (i.e., w1 is similar to w AND w2 is similar to
w). Similarly, in Table 2, the results would be stronger if the pairwise
setting became a similar conjunction or average setting. Nevertheless,
even if the enhanced pairwise similarity approach performs comparably to
the kway similarity method, the latter is still a more principled
approach.

Thanks for your response to the
comments. Q2: Please summarize your review in 12
sentences
The proposed approach is a nice extension of minhash
similarity to general kway Jaccard similarity with a pretty complete
theoretic analysis. The experiments could be stronger to better justify
the value of kway similarity over the pairwise
case. Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper addresses the problem of kway
Jaccardbased similarity search. To do so, the authors show that minwise
hashing is efficient to address the problem of kway cNN search. The
development they propose is a direct extension of the one known for
pairwise similarities. The extension to other efficient similarities
through the PGF transformation is also a direct extension of prior work.
The work presented here is well conducted, clear and of good
quality. At the same time, the focus on the Jaccard similarity for sets
limits the applicability and the impact of the method. It is not common at
all to use presence/absence representations in text mining applications
(as the Google set and webspam problems) and the work would be more
interesting if different similarities were considered. Regarding the
experiments performed, for the Google set problem, I did not find the
words obtained by the 3way similarity really better than the ones
obtained by the standard Google algorithm. For the retrieval experiments,
it seems obvious that the kway (k > 2) methods will perform better
than the pairwise one, as they use more information (x_1 and x_2,
sometimes x_3 vs x_1 only). This experiment is not really convincing and
does not show the utility of kway similarity as compared to say an
average of pairwise similarities. Q2: Please summarize
your review in 12 sentences
If the work is interesting, it represents a direct
extension of previous studies and is limited in its scope.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
Dear Reviewers. Thank you all for the efforts and
encouraging comments. Your constructive inputs will be carefully
considered.
Assigned_Reviewer_5:
Q:> The Google
set experiment is interesting, but instead of "Sum Resemblance", perhaps
the correct function is something that can simulate the conjunction (i.e.,
w1 is similar to w AND w2 is similar to w). >> We appreciate
reviewers concern here, and in fact it is a very interesting question to
find a function which simulates the notion of “AND” well.
If we
are constrained with notion of similarity being pairwise (which is the
case with current literature) there is not much we can do. The most
intuitive way is either to combine the similarity using addition ("Sum
Resemblance" in the paper) or use pairwise retrieval for each w_1 and w_2
and do the intersection of top results (PI method in the paper)
Let us say we have found such “ideal” objective which combines
similarity between w1 and w2 in the best possible way, we must also ensure
that searching with such objective should be efficient. We all agree that
efficiency is very crucial in search. Even linear algorithms are
prohibitive with the size of web. It should be noted that there is no
known provable sublinear algorithm even for a very simple looking “Sum
Resemblance".
3way resemblance is a very intuitive higher orders
similarity, which seems to models the notion of “AND” better than other
intuitive pairwise approaches and at the same time it is sublinear to
search.
Assigned_Reviewer_6:
Q:>I did not
find the words obtained by the 3way similarity really better than the
ones obtained by the standard Google algorithm. >> Our aim was
to demonstrate the applicability of 3way similarity for the Google sets
application over other similarity measures and not to beat Google’s
original algorithm which most likely is much complicated and uses lot more
data than 1.2M wiki documents. Interestingly, for the query “finger” and
“lakes” we do retrieve better words (see line 354) than Google which is
very encouraging. Even for the query “Jaguar” and “Tiger”, 3way retrieves
all the cat family in the top 5 while Google retrieve “dog” as the 5th
best candidate which is not a cat family.
Q:>For the retrieval
experiments, it seems obvious that the kway (k > 2) methods will
perform better than the pairwise one, as they use more information (x_1
and x_2, sometimes x_3 vs x_1 only). >> We agree with the
reviewer that if we are using more information compared to the pairwise
case then we should do better. The experiments establish that 3way and
higher order resemblance search is utilizing this extra information in the
right way which we believe is the first step.
The most exciting
part which comes from the theory in the paper is that 3way similarity
search admits sublinear algorithms with same guarantees as the pairwise
case. Thus, it is possible to use the extra information in multiple
queries, which cannot be used by vanilla pairwise retrieval, without
sacrificing computation which is very critical in the big data world.
The biggest disadvantage about sum of pairwise similarity
retrieval or any other ways of combining the objective is that it is not
known to be efficient. There is no known provable sublinear algorithm for
such combination of similarity.
**************************************
There have been
plenty of studies on pairwise similarities but the notion of 3way and
kway similarities is severely under exploited. We hope our paper will be
a good starting point, for an exciting line of research and applications.
We expect many opportunities can be explored. For example, as
Reviewer_6 pointed out, it would be useful to test kway similarity in
nonbinary (0/1) datasets. We fully agree this would be a promising
direction, while here we would like explain this issue a bit more, for the
sake of discussions:
1. Other than minhash, it is difficult to
find another good 3way or kway similarity estimation algorithms. For
example, random projection is extremely popular, but after random
projection, all data points become (zeromean) normal whose distribution
is determined by covariance (i.e., pairwise similarity) only.
2.
While many datasets used in academia research are not too highdimensional
(e.g., less than 10000 or million) and often not binary, it looks industry
indeed uses extremely highdimensional representations using ngrams. For
example, around 2000, industry papers described the use of 7gram. These
days, some search engine applications start to use even 11grams (by
combining all 1, 2, 3, .., 11grams). With such a highorder ngrams,
usually only absence/presence information matters (in fact, most ngrams
will occur at most once).
3. On a limited number of datasets we
experimented in the past, it looks when the dimensionality of the data is
not too high, often the nonbinary feature values help (for
classification/clustering). But if we use higher ngrams, often only
presence/absence matter. For example, using 1gram (character ngram),
Webspam dataset has 254 dimensions (size of ascii table1). The
classification accuracy using linear SVM is about 93% vs 88% if we use
nonbinary(tfid) vs binaryonly features. With 3grams, (dimensionality 16
Million) the classification accuracy is 99.6% whether or not we use binary
or nonbinary (tfid) features. We observed similar phenomenon with other
text datasets.
4. Despite the success of using extremely
highdimensional 0/1 highorder ngrams representations, we agree with
reviewer that it is nevertheless an interesting research problem to look
into nonbinary features. The definition of 3way minwise hashing can be
extended to e.g., sum_i min{x_i, y_i, z_i}/max{x_i, y_i, z_i}, which might
be something interesting to study.
Thanks again for all your
encouraging comments.
 