Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
[Update after taking into account author feedback: thanks for preparing the author feedback and addressing the issues raised by the reviewers. Given the clarifications and additional results, I am raising my score to 7 and vote and argue for accepting the paper. ] The paper proposes a novel stochastic encoder-channel, Echo Noise, that allows for an analytically tractable expression of the latent marginal distribution without imposing strong assumptions about the latent-space distribution, which is in contrast to e.g. the commonly used independent Gaussian assumption. Importantly, the Echo Noise channel allows for a simple analytical expression of the mutual information (between input and encoded/latent variable) that can be used as a regularizer in a mini-batch based gradient optimisation scheme. Applying Echo Noise in the context of auto-encoding, leads to several theoretically appealing properties, most importantly it can be shown that Echo Noise automatically ensures that the latent-space prior matches the marginal, which is the “optimal” prior and allows for a simple and tractable implementation of the (exact) rate-distortion objective instead of a “looser” general-prior ELBO. Related literature is discussed in a concise and well-written manner, and some results on simple image-auto-encoding benchmarks are reported and compared against a number of state-of-the-art methods. In the recent literature the connection between lossy compression (i.e. rate-distortion) and representation learning has been pointed out several times, in particular w.r.t. interpreting the evidence lower bound as a degenerate, or suboptimal, version of the rate-distortion objective that can be improved upon by using the “optimal” prior. Unfortunately, common assumptions for the functional form of the prior and/or latent-space distribution for reasons of tractability have often prevented a straightforward implementation of the rate-distortion objective (as a tighter ELBO). This paper presents an interesting approach to tackle this problem, with an iterative stochastic encoder that is simple to implement and does not seem to add substantial computational overhead. This is an important step towards better objective functions and a more solid theoretical basis for representation learning, not only in the self-supervised but potentially also in the supervised setting (-> Info Bottleneck). I personally find the topic very timely, and the proposed approach both novel and original. Echo Noise is nicely introduced and its appealing theoretical properties are analyzed and presented in a solid fashion (given access to the appendix). A big plus is also that the authors analyzed convergence behavior and bounds for approximating the infinite-series construction of Echo Noise via a finite-iteration procedure. Unfortunately I found the experimental section a bit underwhelming and thin. While I appreciate the comparison with many other state-of-the-art methods on some standard auto-encoding benchmarks, I would have been very excited to see some thorough analysis of the learned representations, e.g. some exploration of the latent space distribution, which should now be of a much less restricted form than many competitor methods. It would have been great to see, e.g. structure and topology of the latent space (under different rate-limits), some inter- or extrapolations, disentanglement, or some experiments in the supervised setting, or at least some secondary tasks to assess the quality of the learned representations (after all the paper’s aim is representation learning). I personally do not find comparing neg. log likelihood scores very insightful when talking about representation learning (the log likelihood/reconstruction error says a lot more about the generative model than the quality of the learned representation) - however, I do recognize that this is commonly used as a proxy in many publications. But particularly in the low-rate regime (where representation learning is particularly interesting), the structure of the latent space and potential transferability of learned representations becomes much more interesting than log-likelihood scores or blurry generated samples (perhaps more emphasis on the multimodal nature of the generated samples in the low-rate regime would help a bit). I think that this paper had all the right ingredients in place to show some exciting results, but ends up with a somewhat underwhelming experimental section. As much as I would like to give the first four sections of the paper a very high score, the current experimental section brings my overall score down to 6, but I’m happy to be convinced otherwise during the rebuttal or by the other reviewers. Originality: high for Echo Noise construction, low for experimental section Quality: Very well written paper, high-quality literature overview and discussion of related methods, some nice (theor.) results, connections and analysis in the appendix. Clarity: Writing is clear, mathematics support the main arguments and are certainly not used to obfuscate quality or novelty, unfortunately the appendix is missing in the upload - since I had already seen the preprint of the paper before reviewing (which is admittedly not ideal), I referred to the appendix in the preprint version. Significance: Potential for high impact, mostly due to alleviating very limiting requirements on the latent space distribution/prior and by taking the rate-distortion viewpoint on repres. learning seriously. Unfortunately the experiments do not show too much of the promising advantages (at least not very directly), which leaves the current significance a bit more questionable and perhaps at a medium level.
This is a really wonderful paper, in moving away from perhaps naive uses of Gaussian distributions in VAEs to the so-called echo distribution that has both theoretical and practical benefits. The paper also solidifies the rate-distortion view of VAE and yields distributions with closed-form mutual information expressions. The topical area of VAEs is certainly important and so this is a great contribution for the community. The analytical idea for constructing echo noise is also quite clever. The paper is generally well-written and the experiments are also well-performed. As far as I can tell, this work is quite novel.
Personally, I'm not an expert in the topic covered in this paper, so my review should probably be downweighed. Having said that, the theoretical contributions seem weak. The conditions in Lemma 2.3 are not discussed and so the implementation of the echo noise mechanism in (3) may be unclear in practice. Furthermore, there is also no discussion on why we can restrict S(x) to be diagonal. One of the claims the authors make is that prior assumptions on Gaussianity allow for analytical tractability. By assuming that S(x) is diagonal, aren't the authors also making assumptions for analytical tractability to ensure that the mutual information can be computed in closed-form? The relation to rate-distortion theory in classical information theory is also very weak. In particular, where is the distortion measure? The claim that Echo has "approximately the same wall time" as VAEs should be substantiated with evidence. Currently, this claim is weak. In general, this paper is not very convincing.