|
Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
I think this is an interesting paper. The connection
to bloom filters took a while for me to parse through, and the abstract
was far from clear. However, I think the presentation is adequate for a
conference paper. I like the emphasis on sub-linear classification cost,
but wish this has been explored more explicitly in the experimental
results. Overall, the experimental results are strong, and are supported
by useful theory. Q2: Please summarize your review in 1-2
sentences
The paper proposes a novel mechanism for multi-label
classification, inspired by bloom filters (I thought of it as basically a
theoretically grounded way to use random projections for label
clustering). The paper then goes on to show how this framework can be
augmented using label clustering methods to be more robust when label
distributions are non-uniform (as is normally the case). Empirical results
are strong. Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This is a well-written paper that makes an interesting
contribution to an important topic. In fact, the idea of using Bloom
filters for dimensionality reduction in multi-label classification is
intriguing. This seems to be a companion paper of another paper, in which
this idea has already been introduced. The contribution of this paper is
to improve the encoding scheme used in the Bloom filter. To this end, the
structure of the label space is exploited by means of clustering methods.
As claimed by the authors, this encoding makes the filter more robust.
While I'm essentially fine with the idea of the method and
theoretical part of the paper, I am much less convinced of the empirical
study. First, given the goal of efficiency, one would certainly expect an
empirical analysis of runtime in the experimental part, not only
predictive accuracy.
Moreover, the two data sets used in the
experiments are somewhat debatable, especially due to their extremely low
label cardinality (it would be interesting to see the marginal label
distribution). In addition to the sparsity, the distribution will probably
be skewed, so that most of the labels will almost never occur. In fact,
one may wonder whether such data sets are sufficiently representative to
provide an idea of how the authors' approach performs. Indeed, the
performance of this approach (like the performance of other compression
schemes) strongly depends on the label distribution, which is very
specific for the data sets used.
Moreover, losses like Hamming,
which the authors focus on, are debatable in such cases, especially since
always predicting negative will already yield an extremely low loss. For
example, the author report a Hamming loss of 0.000734 for SBF and "only"
0.000705 for RBF. Is this tiny difference of any meaning? Hard to say,
especially since no standard deviations are reported.
Q2: Please summarize your review in 1-2
sentences
Nice paper that definitely holds promise. While the
theoretical part is fine, the empirical one is less convincing.
Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper proposes a multi-label classification
algorithm: a large number of classes is hashed down into a smaller number
of classifiers, via Bloom-filter-like rule. The hashing function is chosen
to be consonant with co-occurance of rare class labels, for better
decoding.
Significance: The idea of mapping multi-label
classification into Bloom filters is insightful, and deserves to be
published. It's interesting that a Naive Bloom filter does not work well
-- the experimental results show that you need to carefully chose the
Bloom filters (rather generating them randomly). The experimental results
show that you can save ~2x in number of output classes --- this is a good
result, although not earth-shattering.
Novelty: this idea is new,
as far as I know. The authors correctly cite to paper reference [12] as
the closest idea.
Clarify: The paper is quite clear.
Quality: I am concerned about the comparison to paper reference
[12], which proposes learning a compressed version of the output labels,
and then use various compressed sensing decoding techniques to recover the
sparse output labels. The current paper uses Orthogonal Matching Pursuit
to decode the compressed labels. But, in paper [12], OMP sometimes is
substantially worse than using CoSaMP or LASSO. In the feedback, the
authors say they will squeeze in some more results.
Q2: Please summarize your review in 1-2 sentences
A nice idea to use Bloom filters to perform
multi-label classification.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We first thank the reviewers for their helpful
reviews. We will work on the abstract to clarify it. We now answer the
reviewers' main points, and will modify the paper accordingly if it is
accepted.
Assigned_reviewer6:
- runtime analysis: we agree
with the reviewer that a quantitative analysis should be added. For all
methods, the runtime of building the models (resp. performing prediction)
is dominated by the training (resp. prediction) time of the binary
classifiers/regressors. The encoding step, the label clustering as well as
finding the optimal configuration for the Robust Bloom filter (RBF) are
negligible. Thus, the runtime for RBF is roughly linear in the code size
(while the runtime of binary relevance (BR) is linear in the number of
labels).
For instance on the Wikipedia1K dataset, training a
single binary classifier takes about 12.35s on average (for a fixed
hyperparameter), and performing prediction on the whole test set takes
about 3.18s (for a single classifier, on average). This means that
training and testing for BR respectively take about 206 min (1000 x
12.35s) and 53 min, while training and prediction of the individual binary
classifiers for RBF with a code size of 500 takes half the time (and thus
1 tenth of the time for a code size of 100). In contrast to the training
time of single classifiers given a fixed hyperparameter, finding the
optimal configuration for RBFs in terms (#hubs, cluster size) for every
code size takes 8 min in total.
- label set distributions/skewed
distributions: The marginal distribution was removed due to lack of
space, we will add some numbers in the dataset description. On the
Industries dataset, the max label cardinality is 30, and about 20% of the
data have a label set of size greater than or equal to 2. On Wikipedia,
the max label cardinality is 5, and about 10% of the data have a label set
of size greater than or equal to 2. The distribution of labels is
skewed indeed. We believe, however, that this kind of distributions
appears in most (if not all) real-life datasets with a large number of
classes, as discussed in Dekel et al. Our method is designed to perform
well when reasonable label clusters can be found (which we do not believe
to be a major limitation in real-life datasets), and, implicitly, when the
classes that appear rarely can be predicted better than what a trivial
classifier (which never predicts any label) can do (otherwise, Dekel et
al.'s baseline is essentially unbeatable). The Dekel et al. baseline is
here to show the datasets satisfy the second, implicit requirement, and
that the methods actually predict something non-trivial.
- choice
of the Hamming loss: We first would like to point out that the Hamming
loss is the usual evaluation metric in the context of the paper
(multilabel, large number of classes). We believe that the (small) scale
of the Hamming loss is not important and that it does not invalidate it.
The Hamming loss still gives the natural quantification of "how much do we
do compared to a trivial solution". In the multilabel case that we deal
with, the trivial solution is to predict no label for every test example.
Roughly speaking, the different methods perform between 18% and 36% better
than this trivial solution, which is a very substantial improvement.
Standard deviations (which, indeed, should and will be added to the result
tables) basically follow the small scales of the Hamming loss, and, on our
10-run experiments, the order of magnitude of the standard deviation is
10^-4 times smaller than the loss, for all methods, including the variance
in the label clustering, the hash functions, etc. So on the Industries
dataset, for a code of size 200, the difference (in hamming loss) between
0.217% (+/- 0.13e-04%) for RBF and 0.205% (+/- 0.29e-04%) for SBF is
significant. Note that such low variances are in concordance with previous
results (see e.g. [15]).
- Assigned_reviewer7:
(We suppose
the reviewer means [12] instead of [9]) The choice of the decoding
algorithm for compressed censing (CS) ([12]) was somewhat an issue because
there is no clear winner in [12] and there was not enough space in the
paper to add two or three more baselines in the tables. We chose
Orthogonal Matching Pursuit (OMP) because it was the decoding algorithm
used by [15] in their comparisons, and also because OMP seemed to perform
well on the precision@k metric in [12] (while correlation decoding (CD) is
the worst in terms of precision@k on three cases out of four in Figure 4
of [12]). On the other hand, CD works well in [12] as pointed out by the
reviewer, but the good performances are only for the root mean squared
error, which is not really reliable as a measure of classification
accuracy.
We performed experiments with LASSO and CD, and the
results are indeed better with OMP. While LASSO achieves similar
performances than OMP for the Hamming Loss, CD is significantly worse in
terms of Hamming loss, and both methods have significantly lower
(micro/macro-F1). For instance, on Wikipedia1K with a code size of 500,
OMP achieves a Hamming loss of 0.0875% and a micro-F1 of 58.5%, while
LASSO has a Hamming loss of 0.0898% (micro-F1: 46.8%) and CD has a Hamming
loss of 0.100% (micro-F1: 52.3%). We will add such concrete numbers in the
text.
| |