|
Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Paper 1048 proposes a system for large-scale
zero-shot visual recognition. It consists of the following steps:
(1) Learn an embedding of a large number of words in a Euclidean
space. (2) Learn a deep architecture which takes images as input and
predicts one of 1,000 object categories. The 1,000 categories are a
subset of the 'large number of words' of step (1). (3) Remove the last
layer of the visual model -- leaving what is referred to as the 'core'
visual model. Replace it by the word embeddings and add a layer to map
the core visual model output to the word embeddings.
On the
positive side: + The paper is well written and reads easily. + The
problem of large-scale zero shot recognition is one of high practical and
scientific value. + The experiments are comprehensive and
well-designed. + The paper reports state-of-the-art results on a very
large scale.
On the negative side:
- The technical
contribution feels somewhat incremental. The paper heavily relies on
pre-existing systems, see [12] for the word embedding or [11] for the
visual model. The novelty seems to be in mapping a visual
representation into a word embeddings by adding an intermediate layer.
However, this was proposed for instance in [15]. Of course, there
are differences between papers 1048 and [15] as outlined lines 81-85:
* [15] reports results at a smaller scale. * [15] uses a quadratic
loss for the mapping while paper 1048 uses a rank loss. * [15] does
not use back-propagation to retrain the visual model. But, again,
these sound like incremental contributions.
- The design choices
of the mapping layer seem ad-hoc and are poorly justified. * It is
stated that 'the embedding vectors learned by the language model are unit
normed' (lines 159). Is there any justification for such a
nornalization? * It is stated (lines 166-170) that 'training the model
to minimize mean-squared difference ... produced poor results. We achieved
much better results with a combination of dotproduct similarity and
hinge rank loss'. Is there any justification for this besides 'it
works better'? * Similarly, several 'tricks' are proposed lines
179-182 and no justification is provided. Why setting the margin to
0.1 for instance?
Here are additional comments/questions:
- There are many missing references on zero-shot recognition.
One of the most relevant ones is the following: Palatucci,
Pomerleau, Hinton, Mitchell, 'Zero-shot learning with semantic output
codes', NIPS, 2009. Especially, see section 4.1: words are embedded in
a Euclidean space using a text corpus and the a mapping is learned
between inputs and word embeddings. While the embedding is certainly
much cruder than the one used in paper 1048, I believe this work is worth
mentioning.
- Do you have any results in the case where there is
no back-propagation into the core visual model? Quantifying the impact
of the back-propagation would be interesting.
Q2: Please summarize your review in 1-2
sentences
While paper 1048 presents impressive results for
large-scale zero-shot visual recognition, its technical contribution
is somewhat incremental as it looks like a combination of [11,12,15].
Submitted by
Assigned_Reviewer_9
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Summary of paper: This computer vision paper uses an
unsupervised, neural net based semantic embedding of a Wikipaedia text
corpus trained using skip-gram coding to enhance the performance of the
Krizhevsky et al deep network [11] that won the 2012 ImageNet large scale
visual recognition challenge, particularly for zero-shot learning problems
(i.e. previously unseen classes with some similarity to previously seen
ones). The two networks are trained separately, then the output layer of
[11] is replaced with a linear mapping to the semantic text representation
and re-trained on ImageNet 1k using a dot product loss reminiscent of a
structured output SVM one. The text representation is not currently
re-trained. The model is tested on ImageNet 1k and 21k. With the semantic
embedding output it does not quite manage to reproduce the ImageNet 1k
flat-class hit rates of the original softmax-output model, but it does
better than the original on hierarchical-class hit rates and on previously
unseen classes from ImageNet 21k. For unseen classes, the improvements are
modest in absolute terms (albeit somewhat larger in relative ones).
Review: I think this is an interesting paper that can be accepted.
The subject area (scaling to very large numbers of visual classes,
combining modalities, deep learning) is clearly topical. This is not the
first method to combine visual and textual information, but AFAIK it is
the first to tackle this problem at such a large scale (all 21k ImageNet
classes), especially using a fully unsupervised text model. The absolute
improvements in recognition rates are fairly modest, but this is a
challenging area where any improvement is welcome.
Further points
I have several questions for the authors:
- The skip-gram
textual model is quite weak in the sense that no prior semantics is built
in. It seems plausible that strengthening the semantic cues (e.g. with a
WordNet distance based regularizer) would improve the embedding. Keying on
Wikipaedia structures might also help, e.g. a "blue shark" might have its
own page, with links back to the generic "shark" page.
- It would
be useful to quantify the improvements available by refining the textual
embedding during the joint training phase.
- How do adjectives /
visual attributes come into this? Many classes have descriptive names,
e.g. one might expect a "whitecap shark" to have some body part with a
white cap. It is not clear to me that the current embedding model exploits
such hints / such factorization in any usable way (e.g. as a "white" basis
vector to be mixed in to the "shark" dot product).
- What are the
flat precision scores for zero-shot DeVISE? - I ask because the
hierarchical metric necessarily confuses the issue, especially at such low
absolute precisions. E.g. if images of blue sharks are classified mainly
as sharks, but not particularly as blue ones, the hierarchical score could
still be quite high but one would hesitate to claim that a blue shark
classifier has been learned. I suspect that this is happening here, at
least to some extent.
Note added after rebuttal:
Excuse me
for my garbled question about flat hit@k precision scores for zero shot
DeVISE. I realise that you have already given these. What I meant to ask
for was scores for a *hierarchical version* of the flat hit@k metric. Flat
hit@k is the familiar "best of k" metric -- only the best of the k guesses
counts towards the score. Your hierarchical precision@k is interesting but
very different: it essentially measures how well the full set of k guesses
reproduces the local categorical neighbourhood of the one true class. As
such it is less robust than a "best of k" metric: even if a method
invariably determines the true class with one of its k guesses, it will
score badly on hierarchical precision@k if its other k-1 guesses are
typically far from the true class. Given that you report the two metrics
face to face and don't fully explain hierarchical precision@k in the
paper, unwary readers are likely to think that hierarchical precision@k is
a hierarchical "best of k" metric, and hence be mislead by the results.
Also, to get a handle on the types of errors that are being made, it would
be convenient to report at least some results in a metric that supports a
"hops from ground truth" score decompositon. E.g. with the hierarchical
best of k metric that counts how many hops from the true class the best
[*] of the k guesses is, you can report accuracies as (say) 10% hits at
zero hops (the flat hit@k score) plus 5% hits at one hop plus ... If
desired this can be refined by breaking hop ties, e.g. counting children
as closer than parents you might get "plus 3% hits to a child class plus
2% hits to the parent class plus...".
[*] Only one of the k
guesses ever counts. Ties are broken arbitrarily.
Further points:
- Please use a uniform reporting convention for all performance
score tables -- convert either Table 3 to percentages, or Table 2 to
probabilities.
Q2: Please summarize your
review in 1-2 sentences
A decent paper on using text based semantic embedding
to improve the performance of deep network classifiers for large numbers
of visual classes, especially previously unseen ones. The actual
experimental improvements are modest but the model is interesting and it
is ambitious and topical to tackle the scaling to all 21k ImageNet
classes.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We’d like to thank all reviewers for their comments.
Our responses to their concerns are below
- Several reviewers felt
the paper is somewhat incremental, based on contributions [11], [12], and
[15]. While it's true that our model is based on pre-existing building
blocks, its integration is novel and achieves state-of-the-art performance
on zero-shot prediction, while maintaining state-of-the-art performance on
known classes, with semantically meaningful predictions in both cases.
Furthermore, the proposed framework does not rely on any given ontology
and would thus scale very well to much bigger corpora.
At a very
high level, our work is algorithmically similar to [15]. The results
however are very different. The approach in [15] works poorly on 8 classes
(as evidenced by their use of a separate model to handle the known
classes), and just satisfactorily on 2 novel classes. Our approach gives
state-of-the-art performance on 1000 native classes with no side model,
and generalizes well to 20,000 novel classes -- a scale of zero-shot
learning two orders of magnitude larger than has even been attempted in
the literature, and 1000x larger than [15]. Moreover, we believe that [15]
optimized the wrong objective function (see below).
R7:
-
Some design choices are poorly justified. In particular:
+ Why not
mean squared error? We will make this point more clearly in the revision.
It’s not just that a ranking loss works better in practice, it’s that
mean-squared difference does not capture the problem statement: The
ultimate use of the system will be to emit an ordered list of predictions.
Fundamentally this is a ranking problem and is best solved using a ranking
loss. A mean-squared difference loss tries only to make the response close
to the correct label, neglecting whether there are incorrect labels that
are closer still. We tried mean-squared loss, and it halved the
classification accuracy. + Why are the embedding vectors unit normed
and why do we use a margin of 0.1? Reference [12], which introduced
the skip-gram model for text, used cosine distance between word vectors as
their measure of semantic similarity (the norm of the vector being related
to word frequency in the text corpus). Following their logic, we unit
normed the vectors. The margin was chosen to be a fraction of the norm
(1), and a wide range of values would likely work well.
- Missing
references on zero-shot recognition, including Palatucci et al, NIPS 2009.
We will include this reference in the revision.
- Results with
no back-propagation into the core visual model? Back-propagation into
the visual model provides a small improvement, typically in the range of
1-3% absolute over the model without back-propagation.
R8:
- The proposed model does not show any improvement over the
baseline for the flat metric, only for the semantic metric.
Classification accuracy is one important measure in this domain, and
as the reviewer points out, our model neither improves upon nor loses
ground on this measure. But failing gracefully is a critical property for
real world systems. And artificial visual system that misidentifies a
telephoto lens as an eyepiece of some sort is strictly more intelligent
that one that thinks it’s typewriter (real examples from the paper). The
precision@k shows just this, and we believe is at least as relevant to
real world uses cases as the flat classification accuracy measure.
- In zero-shot learning experiments, the proposed method is not
compared with the state-of-the-art [A]. We should have cited [A] and
compared our results to theirs, in the revision we will do both. In the
short time allowed for rebuttal, we were not able to obtain and retrain on
the exact train/test split used by [A], but we will do so in time for the
revision. As a proxyl, we did a zero-shot experiment using the model
trained on ImageNet 1K and tested using 250 classes randomly chosen from
our 3-hop zero-shot data set, which maintains the same 80/20 split used by
[A]. We ran two random selections of the 250 classes. For 250-way
classification, our hit-at-5 accuracy is 31.5-35.7%, which matches the
performance in [A]. For 1250-way classification, our hit-at-5 accuracy is
8.6-10.4%, which compares very favorably to the ~3% accuracy reported on
the 1000-way classification graph in Fig 2 of [A]. Note that [A] requires
a hand-engineered hierarchy to perform zero-shot, whereas our method needs
only an unlabeled text corpus. Also note that our model performs better on
the non-zero-shot flat metric (Table 1 in [A]).
- The
implementation details are scarce and not sufficient to reproduce the
results. We thank the reviewers for pointing out a few details and
hyperparameters that were missed in the paper. The final version will
contain all the details needed to reproduce the results, with proper
justification.
- In Supp. Material the validation subset of
ImageNet 21K is used for zero-shot experiments. How exactly is it used?
The text in the Appendix was incorrect---there is no validation set
from the ImageNet 21K used in zero-shot learning, the ImageNet 21K is only
used for testing in that scenario. The mistake will be fixed in the final
revision.
R9:
- The skip-gram textual model is quite weak.
We agree with the reviewer that model is fairly weak, but chose to use
it because we were impressed with how much semantic information it gleaned
in unsupervised training. Adding structure from WordNet might give gains
in the ImageNet challenge, but is less scalable and maintainable than
learning the semantics directly from text. Human curated knowledge
representations are costly to scale, to maintain, and to keep current. For
example, WordNet doesn’t contain the term “iPhone” whereas our model
correctly learns that the most semantically related terms are “iPad”,
“iOS” and “smartphone”.
- What are the flat precision scores for
zero-shot DeVISE? They are given in Table 2.
| |