__ Summary and Contributions__: The authors observe that VAEs do not correctly encode samples generated by the decoder, and therefore iteratively encoding and decoding images leads to diverging behavior. This is related to adversarial robustness. This paper proposes a new method for training VAEs that does not explicitly target robustness, and yet yields better robustness on classification tasks in ColorMNIST and CelebA.

__ Strengths__: This work seems to be theoretically well grounded, with rather convincing experiments to back up the claims. As far as I know (I'm not following closely the adversarial robustness line of research), this work is novel. The problem explored in this paper is interesting to the community, and relevant for NeurIPS.

__ Weaknesses__: One of the underlying points is that VAEs don't have the autoencoding property, which is deemed important for robustness in representation learning. I believe this should be made clearer, as it is now a bit too obfuscated in Section 2.
More in general, the theory part should include a higher-level overview of the main points and propositions. As it is, it might be hard to follow, especially because the notation deviates from the standard current notation in VAE literature.
It would be nice to include a discussion about the chosen architecture, as results might strongly depend on this. For example, the role of batch normalization is not necessarily clear, and various regularization strategies (dropout, L2 regularization, observation noise in the input) might significantly affect results. It's also unclear why (according to appendix) the convolutional decoder should be much smaller than the encoder.
The effect (if any) of these modifications to the model and training objective on the test set ELBO could be reported. Especially since (probably because the encoder is more constrained than in a VAE, according to the authors) robustness seems to come at the cost of decoder quality. Quantifying this and investigating it a bit further would make the paper stronger.
Ideally, the experimental section should include intuitive visualizations of the tables. It would be a good idea to include experiments on a standard benchmark for generative modeling (e.g. binary MNIST or CIFAR10) using the same likelihood function as in the literature, and report the ELBO. This way, one could make sure that the trained models are reasonably good, and the experimental conclusions would be sounder.

__ Correctness__: Yes.

__ Clarity__: The paper is well written, although it could use some proof-reading for polishing.

__ Relation to Prior Work__: I believe this is a major issue in the current version. The paper presents related work that is used for experimental comparison. However, only few more loosely related works are mentioned throughout the paper, and there is no dedicated related work section.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper discusses the propensity for (Gaussian) Variational Auto-encoders to drift during subsequent encodings; that is, given an original code z (or, similarly, a point x drawn from data), this paper describes the distribution of z', the decoding and then encoding of z. This paper demonstrates 1) that || z' - z || is generally non-trivial, 2) that results may be found in the analysis of this behavior, and 3) that in attempting to remove this behavior learned representations are more robust to specific adversarial perterbations for later prediction tasks.
Similar to the original VAE, the authors define a latent variable model from Z -> X, this time considering pairs of observations (Z,Z') -> (X,X') with "linking coefficient" $\rho$ between the Z and Z' instances (Z ~ Gaussian, and \rho is exactly covariance between corresponding indices of Z and Z'). They then shrink estimated models toward their prior model, which at a high level is similar to the original VAE ( KL[Q || P]] penalty, but using a repeated encoding instead of a single Z ). In practice this is an encoder and decoder pair, with original VAE penalties plus penalties based on re-encoding of the output X' to Z'.

__ Strengths__: In general I found this paper well written and pleasant to explore. While the mismatch between iterates of autoencoders is not an entirely novel concept in general (see, e.g. [1]), it is certainly a fresh perspective and in my opinion a novel analysis of the noted phenomenon, specifically for VAE. The derived model is intuitively similar to the original VAE once it is understood, yet expressive for the purpose of exploring auto-encoder drift. Even from the latent variable/generative perspective such as it is, in my opinion this paper includes interesting work in VAE theory. It further introduces clever tools for describing and testing drift phenomena.
I think the resulting simplicity of the derived model (modulo the original VAE training) is understated. That the end result separates so cleanly into the same KL[q||p] and entropy terms plus a Z -> Z' transition likelihood under a known distribution is incredibly elegant in my opinion.

__ Weaknesses__: I found the details of the reasoning fairly difficult for sections 2 and 3; while the individual statements are meaningful and well explained usually, the reasoning for introducing or building Prop 2.3 and 3.1 is not immediately clear, even though it eventually becomes critical to the construction of the actual proposed method.
Neither the original problem nor the adversarial attacks in the empirical experiments are particularly well motivated from the perspective of the theoretical analysis. The existence and nature of VAE drift is somewhat left to the reader, and its link to adversarial vulnerability is overall unexplored. The empirical results have an interesting set of experiments on adversarial robustness, but it seems important to explicitly link these experiments more closely (even just in intuition, with waving hands) with what appears to be the main idea of the paper, the analysis in Sections 2 and 3. Similarly, while I understand it is not the main focus of the work, results relative to the smooth encoder method are somewhat underwhelming without context or analytic comparison.

__ Correctness__: Results and derivations appear correct.

__ Clarity__: Generally clear, notation with $\theta$ and $\eta$ should maybe require a second look (is it truly necessary?).

__ Relation to Prior Work__: The adversarial experiments are (again) perhaps undermotivated, and it is not explicitly stated what relation either AVAE or compared methods share (except somewhat in the supplement).
Other auto-encoder inconsistency studies are not referenced (e.g. [1]), though the literature is sparse.
[1] Alain, Guillaume, and Yoshua Bengio. "What regularized auto-encoders learn from the data-generating distribution." The Journal of Machine Learning Research 15.1 (2014): 3563-3593.

__ Reproducibility__: Yes

__ Additional Feedback__: 1) Figure 1 does not aid your case that this phenomenon exists. While I am familiar with this line of research from other work and thus know that VAEs and other encoders drift, a better argument should be made to the reader, both in the introduction and in Figure 1.
2) The result that the AVAE condition makes representations more robust is empirical. This isn't a problem (empirical results are good too), but it seems almost independent to the theoretical frame and intuition of the work. I understand there are space constraints (this is a very full paper), but an analysis of why VAE vulnerabilities to adversarial attacks are mitigated by the AVAE condition would be helpful. Notably, $\varepsilon$ perturbation type attacks are obviously not constrained to any data manifold (or, in a probabilistic sense, may move data $x$ to rare events $x + \varepsilon$). It is not immediately clear why a data-amortized or $z$-amortized condition should result in robustness to non-data-manifold points $x + \varepsilon$. This is maybe similar to 1), in that it's not clear to the reader why we should think about AVAE for this problem (or alternatively, why drift in autoencoders allows for adversarial vulnerability).
3) The data distribution in Eq. 5 is an isotropic Gaussian. As noted earlier, the original VAE paper doesn't really specify $p(X|Z)$, but could be one of many (named, learned, etc.). Does this also apply to the results here? I think it does (otherwise section 3.1 is irregular), but this should be noted in Section 3, adjacent to equation 5, and verified in the supplement that you haven't used specific properties of $p(X|Z)$ in any proof. The final statement of Prop. 3.1. would also need to be changed, as v appears to be a conditional Gaussian specific parameter.
4) Is there a notion of "natural drift"? In the prior $\bar{\mathcal{P}}$ the smoothness parameter $\rho$ is left as just that, a parameter. Deriving a posterior for that parameter is a big ask, but one might ask: does an estimated $\rho$ from a regular (non-AVAE) Gaussian variational auto-encoder tell us something about our data? Is there some understanding to be gained from auto-encoder drift/mixing times? Or is it simply an undesirable property similar to overfitting or mode-collapse in GANs, etc., to be removed by regularization. Or does the degradation of MSE performance vs. vanilla VAE imply that the AVAE is relatively misspecified vs. the original model, and that Z does not have "natural smoothness"?
Along the same lines, why is $\rho \approx 1$ desirable is all cases?
4.1) In my opinion, since the example 3.1 is mostly relegated to the appendix, it would be helpful instead to have further experimental results; instead of focusing on adversarial experiments, testing $\rho$ w.r.t. MSE performance seems interesting. This also seems important since these results are referenced in the conclusion without supporting evidence.
5) In Section 3 notation switches to $p_\theta$ to denote a decoder held constant. Proposition 3.1 is used as justification for this; I do not understand the connection. It does not seem to affect the actual objective, but instead necessitates the `stopgradient` in implementation. Justification is unclear here.
6) Again, I think the resulting simplicity of the derived model is understated, and the end result is incredibly elegant. Unless this is obvious to others, I think it would benefit potential readers to know at the start that such a result is present, and that the derived training desiderata have the same simplicity (both in theory and implementation) as the original VAE model.
Minor notes:
- Equation at end of page 2 (prior to line 69) is missing a number.
- Proposition 3.1 makes a statement about "v" which is only previously used in equation 5. It would be helpful to have a backreference at 3.1 (perhaps in text preceding the prop.), since "v" only appears in that definition, and is never referred to in the main text.
- Figure 1 Caption: Mnist -> MNIST
- Line 149 effecting -> affect

__ Summary and Contributions__: [====After rebuttal====
I have read your rebuttal. Thank you for addressing the concerns. Overall, I think this is a good paper and I am happy to keep my score as it is.]
------------------------------------------------
This paper investigates the inconsistency between encoder and decoder in VAE where the encoder cannot re-encode the decoder’s generated samples. The paper then proposes a self-consistency method to fix this issue by using an alternative construction of the variational distribution. The resulting objective is intuitive in that it explicitly enforces the consistency between the representation z used to generate a sample and the encoding of the generated sample. The paper conducted experiments in discrete data to show the consistency achieved by their proposed model, and also empirically show that the proposed model is more robust to adversarial attacks in colorMNIST and CelebA.

__ Strengths__: There are several things to like about this paper:
- Well motivated and well written
- A principled derivation of the proposed self-consistency objective
- A promising experiment where adversarial robustness can be achieved via self-consistency and without adversarial training signal.

__ Weaknesses__: The experimental results are obtained in relatively simple data (discrete data, ColorMNIST and CelebA). Especially, the experiment that shows the consistency of AVAE is conducted only in discrete data. It is unclear where the consistency and adversarial robustness of AVAE can translate into more “realistic” data such CIFAR10.

__ Correctness__: The derivations in the paper appear correct.

__ Clarity__: The paper is very well written and explained.

__ Relation to Prior Work__: The paper well refer to the related work. For example, in line 187, the paper clearly mentions that the target model is already used in [4].

__ Reproducibility__: Yes

__ Additional Feedback__: Line 174: Extra “.” should be moved up to the previous equation
I think Fig.1 (or at least the way Fig. 1. Is presented) is not strong enough to support the evidence about the inconsistency of the nominal VAE. Since P(Z|X = number 9) is a distribution, it happens (though with low but nonzero probability) that the latent sample z ~ P(Z|X = number 9) could be in the latent region where it generates number 4 most of the time. A framework fixes this problem means that it should well separate the latent space. But P(Z|X) is a Gaussian, given any x, for any z, P(Z=z|X=x) > 0. To improve this, the paper might emphasize on the aforementioned point or might say something along the line that the generated sequence in Fig.1 is obtained with high probability.
The legend in Fig.5 can be written more clearly. E.g., should move “(Top row)” near MLP architecture and “(Bottom row)” near convnet architecture, and put “and” between them.
In the first column of Fig.5, why the adversarial accuracy reduces for \rho very close to 1?
Fig. 1 shows the inconsistency between encoder and decoder of VAE and motivates AVAE. I expected that there would be a similar experiment to show that the proposed AVAE has the consistency between encoder and decoder but the paper never mentioned this. The only experiment showing the consistency of the AVAE is Fig.3 for discrete data. I think it would be interesting to show the consistency of AVAE in the same manner as in Fig. 1 for MNIST.

__ Summary and Contributions__: The authors propose a novel objective for VAEs that improves the adversarial robustness of the learnt representations. Compared to the recent "Adversarially Robust Representations with Smooth Encoders" paper (SE) [4], their formulation does not require to find adversarial examples during training which makes training faster. The authors also show that their approach is complementary to [4] and propose combination of their models with SE.

__ Strengths__: This paper proposes a novel VAE object which is theoretically sound and clearly motivated. They improve over [4] in terms of training performance. The analysis they perform brings novel insights on VAEs and can be of a broader interest for the community. All claims are proved and valid.

__ Weaknesses__: I would advise to make explicit in the title that this model focuses on learning adversarially robust representations using VAEs.
Also, I think that the first observation "We uncover that widely used VAE models are not autoencoding - samples generated by the
decoder of a VAE are not mapped to the corresponding representations by the encoder" was already made in [4].

__ Correctness__: Yes. The empirical methodology is inspired from [4] and clearly compares with existing methods.

__ Clarity__: Absolutely.

__ Relation to Prior Work__: Maybe insisting on the fact that this provides an alternate approach/extension to [4] in the introduction could make the introduction and the claims more impactful.

__ Reproducibility__: Yes

__ Additional Feedback__: -"In this paper, our starting point is based on the assumption that if the learned decoder can provide a good approximation to the true data distribution, the exact posterior distribution (implied by the decoder) tends to possess many of the mentioned desired properties of a good representation, such as robustness." seems in contradiction with what is done in the following of the paper.
-Why is X' depending on Z' in Eq. 2 and throughout Sect. 2 while it is no longer the case in Sect. 3?
-Beginning of Sect. 3, it could be interesting to insist on the meaning of X tilde (all possible images) and on the distribution u tilde whose choice is possible since the image manifold is compact.
Typos in appendix:
- l.412 double bar for KL
- l.415 why q tilde for the marginals?