In this paper, the authors propose a new method for training deep networks that relies on learning a generative model for data augmentations. The resulting generating model is derived from a variational auto-encoder (VAE) and trained to learn augmentations for each hidden representation in the network. The authors test this new method for data augmentation and fine-tuning on CIFAR-10, CIFAR-100 and ImageNet. On all of these benchmarks, the authors identified substantial gains in terms of classification accuracy. The reviewers raised concerns in terms of the theoretical grounding of the method, the strength of the baselines, the training time in practice and clarity of presentation of the complex method. Some of these concerns were well addressed by the rebuttal. In significant post-rebuttal discussion, the reviewers concurred that providing a first demonstration of a generative model improving a classifier's performance on a challenging benchmark was impactful that may significantly affect many other forms of machine learning. This point subsequently convinced all reviewers to raise their scores accordingly. Although I do agree that the theoretical grounding of the method is limited and the method itself is quite complicated, the demonstration of this empirical result is an important contribution to the field and deserves broader recognition because it may spur others to investigate the benefits of generative models for data augmentation. That said, the authors have substantial work to perform in terms of incorporating results and discussion from the rebuttal as well as significantly improving the presentation and clarity of the manuscript. In this last point, I stress that the authors should pay close attention to address the suggestions of R2 to improve the clarity. Assuming that all of these issues are addressed, this paper is conditionally accepted to the NeurIPS conference.