NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:36
Title:Zero-shot Learning via Simultaneous Generating and Learning

Reviewer 1

Overview: The authors propose an original approach to zero-shot-learning by combining VAEs with EM for inferring the optimal unseen examples. The key idea is simultaneously generating examples of unseen classes and learning from them. The authors run a number of experiments which demonstrate that the proposed method shows competitive performance in a number of ZSL tasks. Quality: The work is generally of high quality. The experiments are clearly described, and the model specifications are detailed. What limits its quality, however, is the lack of performance variability estimates. In general, the results are not entirely consistent across metrics, so it seems that adding variability measures is warranted. I also feel that in the paper may greatly benefit from providing a few toy simulation studies in low-dimensional settings, s.t. the structure of learned classes may be easily visualized. Clarity: The paper is generally well written and is a pleasure to read. It would help to clarify how the class embeddings are constructed (even if they are given in the current setting) There is also a number of minor typos: 234: outperforms than, 277 attain(s) Originality: The proposed method is certainly novel and the contribution is original enough. Conclusion: Overall, the paper proposes a novel approach to zero-shot learning. In my view, it’s a borderline submission. Overall, while there are considerable limitations, I think that this paper is slightly above the acceptance threshold. ## After reading the authors' response. The authors partially addressed some of my concerns, providing variability measures for some of the results and an additional low-dimensional visualization. I believe that to further elucidate the behavior of the algorithm, it could still be beneficial to actually train on an artificial low-dimensional dataset, as opposed to using T-SNE visualization on a natural dataset. At the same time, I partially agree with some of the concerns voiced by the 2nd reviewer in that more experimentation would be helpful to understand the contributions of different parts of the model. This is partially why I think low-dimensional simulations would be helpful. Overall, I still think that this is a borderline submission, slightly above the threshold and I keep my score unchanged.

Reviewer 2

Comments -------- notation comment -- it would be great to simplify the readability of the notations, particularly the 'asterix' was hard for me to read. Here is one suggestion that I have, please consider adapting it, {x_i, y_i} as the labeled set, y_i \in {Y}, and {u_i, z_i} as unlabeled set z_i \in {Z} and Y \intersect Z is null. motivation of why to use VAE -- "In order to capture the complex distribution, VAE can be a useful tool." seems a bit weak, there are a few concrete choices here, VAE vs specificying a parametric generative distribution vs GAN. Given the experiments on image dataset, why VAE ? A key difference from prior work is the reliance of the number of unseen classes and their attributes to be known apriori, I wonder how realistic this assumption is, most other work rely on a new attribute (a.k.a class) being provided at inference time, whereas this work cannot handle that scenario. A number of key ablations seem to be missing, (a) How much is the EM actually helping ? (b) how much is the *multi-modal* prior actually helping ? (c) how big are the model parameters, are they comparable to other work ? (d) if the number of EM steps is just 1, this is similar to some of the other cited work. Specifically, when the parameters are optimized based on seen classes fully and the unseen classes attributes are used to generate examples on which a deep-NN classifier is trained. How much of the benefit comes from having a more expressive deep-NN vs a simpler baseline like AVM ? What if we made some work (there are many, just to cite one from the paper - 'Generalized Zero-Shot Learning via Synthesized Examples') use a deep-NN than the svm used by the cited work ? (e) How much of the benefit even just from this work can be improved by using a discriminative deep-NN instead of a generative one ? Main concern [Originalty & Significance] ---------------------------------------- Significance: Most of this work to me seemed to me like "yes, I think if you do that it is likely performance will improve compared to other work", I found the work to be more like a bag of tricks than some key insight that the work was providing. While overall this is not a bad thing and there can be good work that improve the sota on well-established benchmarks, this particular work worries me in tending to just improve end-2-end scores without providing any key-insights. Originality: AFAIU, the new insights seem to be using EM, and multimodal, both of which are relatively mild. Focusing on running ablations will perhaps help the authors to focus on the key contributions and refine their work. After rebuttal ------------------ Thank you for the responses and the references, they were helpful. They have addressed my questions/concerns (partially). Overall, I think there is still room to perform clearer ablations to study the effect of each component, for e.g. one proposal/contribution here is the generation of samples for unseen classes.. for a practitioner it would be great to have some clear set of experiments proving its utility (as a thought, the generated samples can be used to train a discriminative classifier than continuing to be a generative one.)

Reviewer 3

The article is indicative of solid work and it is well written. The state of the art is covered well. The evaluation does not elaborate on the computational complexity and memory requirements of the proposed algorithm compared to the state of the art systems. It is not fully clear how the Simultaneously Generating And Learning (SGAL) strategy generates the missing datapoints by using class embeddings and the model itself. Elaborating this mechanism would help the reader. The multi-modal prior plays a crucial role in providing information for supporting the learning of the unseen cases. Please elaborate the number of iterations that the algorithm requires for the benchmark datasets. Reviewer comment to author response: the authors have provided more details on the above points and addressed the mentioned presentation related issues.