Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Update: I read the author responses and the other reviews. Overall I saw nothing that substantiated changing my score: reproducability / accessibility is an important component of impact to the community. I acknowledge the importance towards representation learning, which was one reason for my original score. ------ Overall the paper is very well written and technically correct. It was enjoyable to read. However, the contributions are more or less incremental and this work to me represents a technical report / demo on what happens if you scale up the architecture of BiGAN. It is incremental as it takes well-established ideas (BiGAN + BigGAN) and combines them in a straightforward way that would be only available to those with the most compute. Making models much much bigger has become popular lately. BERT showed that using a very large transformer on a masked language model achieves SOTA results on important NLP benchmarks (model / approach novelty and SOTA). BigGAN showed that very large GANs can be used to generate high-resolution datasets reliably (high impact as this was the first instance of full imagenet generation with GANs). GPT-2 showed that large RNN models along with lots of data could generate uncannily realistic paragraphs (high impact as this was the first instance of language generation of this sort at this scale). All of these works were important as they challenged expectations of what was possible as we scale neural networks up. And now there's BigBiGAN. Which is mostly just as the title describes it: Big + BiGAN (or BigGAN + BiGAN if you will). BiGAN is important, I suppose, as it demonstrates a particular paradigm of adversarial learning, but its significance isn't really well established in this work beyond "good linear probe classification performance". I would say that the modifications to the discriminator are useful for training BiGANs and worth exploring more. But it's not entirely clear just making BiGAN big is really an important step forward toward answering some fundamental question in ML other than "what if we make it big?". This question used in this way is not very useful to those outside researchers that can actually train these models. As a tech demo, this paper is excellent. However, I can't imagine it having much significance in the community other than hitting a benchmark as most researchers cannot reproduce these results. So the reproducability score on this paper is naturally low (though the techniques are described very clearly). Why not provide a more accessible demo across some of the models being compared with smaller datasets / models (i.e., reimplement some of those models and train them on similar settings)? Everything above sounds very negative, but overall the paper is very good and I think it deserves to be accepted. It would also help though if the representation learning / self-supervised works mentioned / compared to weren't just from DeepMind / Google, as there are many other relevant works out there: e.g., Noise as Targets (FAIR), Deep Infomax (MSR), and Invariant Information Clustering (Oxford) to name a few. One question: how do the generative results compare to VQ-VAE-2? Do you think VQ-VAE-2 outperforms yours in classification? ------ One last thing, which *did not* affect my score: The claims of SOTA are already dated, but I understand that the authors did not have access to these works at submission. But I would expect the message to be augmented and these works to be at least mentioned for camera ready, if this paper gets accepted, in order to not mislead (which is important): Big CPC: Data-Efficient Image Recognition with Contrastive Predictive Coding CMC: https://arxiv.org/abs/1906.05849 both of which get scores close to yours without a generator. AMDIM: Learning Representations by Maximizing Mutual Information Across Views which gets much better results than yours (68.1% on imagenet)
Originality: This work is a novel combination of two well-known models, namely BiGAN and BigGAN. The main novelty lies in demonstrating the effectiveness of additional unary terms for x and z that were not studied in BiGAN. The related work is well-cited. Quality: The submission is technically sound. The claims are well-supported by extensive and well-considered empirical analysis. There are no theoretical contributions. The authors are careful and honest in their evaluation. Clarity: The writing is very clear. Pedagogically, it would be helpful to readers who are not already familiar with BiGAN to read an intuitive explanation of how the forms of L_EG and L_D implement their cross-purposes. Significance: BigBiGAN the natural pushout of the BiGAN <- GAN -> BigGAN triad, so this work is more inevitable than it is surprising. However, the work remains a significant contribution as the investigation is exemplary in its clarity and thoroughness, the methodology is advanced by unary terms, and the performance is SOTA with respect to standard benchmarks in the community. The semantic learning apparent in the images, hypothesized to result from an implicit form of reconstruction error, is especially exciting as an area of future research.
Though there isn't much new technical content in this paper, I think it provides impressive empirical motivation for continuing to explore generative models as mechanisms for representation learning. The authors demonstrate the power of ALI/BiGAN as a technique for representation learning by reimplementing and re-tuning the core idea using the biggest, strongest current GAN model (i.e., BigGAN). The result is a model which matches the representation learning performance of concurrent self-supervised methods, while also improving on the prior state-of-the-art for unsupervised GAN-type image generation (by beating BigGAN). The empirical results are impressive. My main concern is that the work is generally unreproducible due to massive computation costs. Nonetheless, I think the work is worth publishing since it will encourage people to keep developing generative approaches to representation learning, some of which will hopefully be more efficient and practical to work with. The paper was clear and easy to read, which is a plus. Post Rebuttal: I found the authors' response to my statements about model size a bit misleading. I think their main result is the "state-of-the-art on ImageNet" result, and the encoder they use for this result is the RevNet x16 model from the CVPR 2019 "Revisiting Self-Supervised Visual Representation Learning" paper. That model is massive, and has perhaps 10x the compute cost of the ResNet50 to which they refer in the rebuttal. Additionally, training the encoder requires training the rest of BigGAN too, which isn't exactly fast and cheap.