All four reviewers support acceptance for the contributions, notably the idea of using adversarial perturbations for training transformer-based vision-language models and successfully demonstrating this idea experimentally on a 6 standard vision&language tasks / datasets leading to SOTA results, all in a clearly written and organized paper. I agree to these observations and also recommend acceptance of this strong paper. The concerns the reviewers had, have been successfully addressed in the author response and I expect the authors will follow through with their promise to release all code and pre-trained models and revise the paper with correction, clarifications and additions from the rebuttal, including results on VILLA_LARGE and maybe add more additional qualitative examples in the appendix.