Reviews: Exact Rate-Distortion in Autoencoders via Echo Noise

[Update after taking into account author feedback: thanks for preparing the author feedback and addressing the issues raised by the reviewers. Given the clarifications and additional results, I am raising my score to 7 and vote and argue for accepting the paper. ] The paper proposes a novel stochastic encoder-channel, Echo Noise, that allows for an analytically tractable expression of the latent marginal distribution without imposing strong assumptions about the latent-space distribution, which is in contrast to e.g. the commonly used independent Gaussian assumption. Importantly, the Echo Noise channel allows for a simple analytical expression of the mutual information (between input and encoded/latent variable) that can be used as a regularizer in a mini-batch based gradient optimisation scheme. Applying Echo Noise in the context of auto-encoding, leads to several theoretically appealing properties, most importantly it can be shown that Echo Noise automatically ensures that the latent-space prior matches the marginal, which is the “optimal” prior and allows for a simple and tractable implementation of the (exact) rate-distortion objective instead of a “looser” general-prior ELBO. Related literature is discussed in a concise and well-written manner, and some results on simple image-auto-encoding benchmarks are reported and compared against a number of state-of-the-art methods. In the recent literature the connection between lossy compression (i.e. rate-distortion) and representation learning has been pointed out several times, in particular w.r.t. interpreting the evidence lower bound as a degenerate, or suboptimal, version of the rate-distortion objective that can be improved upon by using the “optimal” prior. Unfortunately, common assumptions for the functional form of the prior and/or latent-space distribution for reasons of tractability have often prevented a straightforward implementation of the rate-distortion objective (as a tighter ELBO). This paper presents an interesting approach to tackle this problem, with an iterative stochastic encoder that is simple to implement and does not seem to add substantial computational overhead. This is an important step towards better objective functions and a more solid theoretical basis for representation learning, not only in the self-supervised but potentially also in the supervised setting (-> Info Bottleneck). I personally find the topic very timely, and the proposed approach both novel and original. Echo Noise is nicely introduced and its appealing theoretical properties are analyzed and presented in a solid fashion (given access to the appendix). A big plus is also that the authors analyzed convergence behavior and bounds for approximating the infinite-series construction of Echo Noise via a finite-iteration procedure. Unfortunately I found the experimental section a bit underwhelming and thin. While I appreciate the comparison with many other state-of-the-art methods on some standard auto-encoding benchmarks, I would have been very excited to see some thorough analysis of the learned representations, e.g. some exploration of the latent space distribution, which should now be of a much less restricted form than many competitor methods. It would have been great to see, e.g. structure and topology of the latent space (under different rate-limits), some inter- or extrapolations, disentanglement, or some experiments in the supervised setting, or at least some secondary tasks to assess the quality of the learned representations (after all the paper’s aim is representation learning). I personally do not find comparing neg. log likelihood scores very insightful when talking about representation learning (the log likelihood/reconstruction error says a lot more about the generative model than the quality of the learned representation) - however, I do recognize that this is commonly used as a proxy in many publications. But particularly in the low-rate regime (where representation learning is particularly interesting), the structure of the latent space and potential transferability of learned representations becomes much more interesting than log-likelihood scores or blurry generated samples (perhaps more emphasis on the multimodal nature of the generated samples in the low-rate regime would help a bit). I think that this paper had all the right ingredients in place to show some exciting results, but ends up with a somewhat underwhelming experimental section. As much as I would like to give the first four sections of the paper a very high score, the current experimental section brings my overall score down to 6, but I’m happy to be convinced otherwise during the rebuttal or by the other reviewers. Originality: high for Echo Noise construction, low for experimental section Quality: Very well written paper, high-quality literature overview and discussion of related methods, some nice (theor.) results, connections and analysis in the appendix. Clarity: Writing is clear, mathematics support the main arguments and are certainly not used to obfuscate quality or novelty, unfortunately the appendix is missing in the upload - since I had already seen the preprint of the paper before reviewing (which is admittedly not ideal), I referred to the appendix in the preprint version. Significance: Potential for high impact, mostly due to alleviating very limiting requirements on the latent space distribution/prior and by taking the rate-distortion viewpoint on repres. learning seriously. Unfortunately the experiments do not show too much of the promising advantages (at least not very directly), which leaves the current significance a bit more questionable and perhaps at a medium level.

Paper ID:	2137
Title:	Exact Rate-Distortion in Autoencoders via Echo Noise

Reviewer 1

Reviewer 2

Reviewer 3