This paper present a novel pretraining idea and demonstrates strong empirical results on a number of tasks. Right now the paper reads a bit like a system description and it would be good consider adding some ablation experiments to shed some light on the various design choices. This might meant that some of the tasks might need to be relegated to the appendix to create space for these additional ablation experiments. In the eyes of the AC some ablations would be more useful than the current enumeration of tasks. It would be also be good to think about alternative names for describing the MT setup. If I understand the setup correctly, the corpus that your model has access to contains translations for the sentences it is being trained to translate, it just doesn't know the sentence-level alignment. This is less supervision than what 'supervised' systems have (which are trained on sentence-aligned parallel corpora) but more supervision than unsupervised systems that are trained on monolingual corpora. It is therefore not fair to call your approach unsupervised. It seems closer to a transductive setting. One can fairly easily imagine how borrowed words and entity names can enable the model to fairly quickly find some of the parallel sentences and bootstrap its way from there.