NeurIPS 2020

Cascaded Text Generation with Markov Transformers

Review 1

Summary and Contributions: This paper presents an approach to non-autoregressive sequence generation using a cascade of transformers. They formulate transformers as CRFs, where an mth-order transformer depends on the previous m words in the decoder (as well as all words in the input). Critically, by pruning the state space at each each timestep using lower-order models, inference in higher-order models becomes faster. This pruning is based on max marginals: the score of the best path using a particular word (for 0th order models) or n-gram (for higher-order). These max marginals themselves would pose a O(L) serial bottleneck, except that the paper presents a clever parallel algorithm for computing them in O(log L) time. Results on 5 MT settings show that performance is nearly as good as an autoregressive transformer, better than most past non-autoregressive models, and giving a 3x-5x speedup without much loss in performance compared to the autoregressive model. Distillation from the autoregressive model is necessary to achieve this strong performance. The authors have released their model and hyperparameters for all experiments.

Strengths: This paper has some very nice algorithmic contributions. I am not sure whether this will be a preferred method for non-autoregressive generation (see below in the Weaknesses), but I like what the authors are trying to do. Coarse-to-fine pruning using max marginals is a very appealing technique that hasn't really seen the light of day in the neural era. It's impressive to me that the authors have gotten this as fast as it is, and there are some strong algorithmic components to make this work. In particular, the parallel max marginal computation is pretty cool. The paper introduces a significantly different way of doing things than most other non-autoregressive techniques, so I think it will be a valuable reference and inspiration to other researchers who want to carry on in this direction. The model is also nicely tuned. Folding all of the models into a single one is reasonably elegant. Between this and the knowledge distillation, the authors seem to have put a fair amount of effort into making this work well, and this effort pays off nicely in the results. The results themselves seem quite strong, handily outperforming many of the past non-autoregressive models with a decent speedup.

Weaknesses: While I am advocating for this paper's acceptance, I'm curious as to whether the authors think this will truly be the dominant approach going forward in this area. I find this approach theoretically more appealing than the Levenshtein transformer, but I think the "global communication" as a negative feature of that model isn't strictly a negative. Sure, the more local nature of this one gives a speedup. But successfully capturing long-range dependencies is one of the things transformer models like GPT-3 seem to be good at. This is a limitation of the paper only evaluating on MT; in MT, the input heavily constrains the shape of the output and long-range output dependencies may not be quite as necessary. But in tasks like summarization or open-ended text generation like story generation, the input governs the shape out of the output much less and I wonder whether this more local model is sufficient. It seems like while approaches like the Levenshtein transformer and other insertion-based approaches maybe aren't as well motivated and are harder to get working well, a good generation order there could outperform the approach in this work for more complex generation settings. Another mild weakness is the need for the length prediction with linear regression. Again, this works for MT, but I wonder about other settings.

Correctness: This paper is very nice empirically and supports its claims well.

Clarity: Yes, this paper is very clear.

Relation to Prior Work: The set of work cited is impressive and the comparisons in the experiments are thorough as far as I can tell.

Reproducibility: Yes

Additional Feedback: Line 100: "wih" =============== Post-response: thanks for the nice response. I do take your point about length prediction being an upper bound. Otherwise, my review and score is unchanged -- I still think this is solid work.

Review 2

Summary and Contributions: The paper proposes a model for sub-linear time, semi-autoregressive neural text generation with a model that considers Markov dependencies in the output. The model is based on adapting the framework on structured prediction cascades, where candidate outputs are pruned during decoding through scoring with CRF models of increasing context length (0- to Mth-order dependencies in the outputs). The max marginals of each n-gram are computed to filter the top-K candidates at each step with a parallelized implementation. The model is a Transformer with hard barriers between every M+1 words during training to enforce the Markov dependency structure. The results show that up to 7 time speed-ups over fully autoregressive decoding can be obtained at little loss in translation accuracy when knowledge distillation is used, which compared favorably to most previous non-autoregressive approaches.

Strengths: - The paper proposes a principled approach to semi-autoregressive generation with a novel application of previous work on structured prediction in a framework with more expressive neural models. - The approach provides a good trade-off between speed and accuracy compared to previous approaches. Extensive model analysis is included.

Weaknesses: - The description of the approach is complete and clear, but quite dense, in particular for cascaded decoding. So I think the paper could be improved by having more high-level explanations or intuitions. - The paper presents a general and powerful framework, but the model is then trained so that the log potential of an n-gram is the next word probability, which makes sense for training efficiency but seems to not fully exploit the expressive potential.

Correctness: The claims and empirical methodology are correct.

Clarity: The paper is well written.

Relation to Prior Work: All relevant prior work is mentioned, but there should preferably be more discussion on the relation to the closest previous work on semi-autoregressive generation, in particular Sun et al., 2019, who also proposed a CRF-based approach.

Reproducibility: Yes

Additional Feedback: What is the effect of computing the actual max marginals in decoding vs approximating it by only using the n-gram score? In the result tables, it may improve clarity and consistency to use M instead of "iterations". --- Thanks for the response and clarifications.

Review 3

Summary and Contributions: This paper proposes a new non-autoregressive text generation algorithm using cascaded decoding. The authors proposed a Markov Transformer, which can compute the log-potentials efficiently. The proposed method can be parallelized to get good efficiency during inference. The method gives comparable performance against baselines.

Strengths: Proposed method is general, so it can be applied on many variations of transformer models. Experiment details are clearly described. The proposed method is thoroughly compared with several baselines.

Weaknesses: The clarity, especially the method part needs to be improved. (See Clarity). Since the proposed method does not outperform other models in BLEU, it's necessary to have more discussion. For example, why other methods can not reach same level of parallelism.

Correctness: The method is likely to be correct.

Clarity: The explanation to the method (Sec. 3 and 4) is difficult to understand. I'm not very familiar with non-autoregressive text generation, so I don't fully understand how the generation process works. For example, figure one explains how to get K size-3 spans, but it's unclear how to decode a full sentence using these 3-spans. I would suggest to give a real example on how the algorithm generates a sentence. Notations are not clear. - In line 92, X(x_{i:j}) represents "set of sequences that contain a span x_{i:j}", it's unclear to me if it should appear at the same position or not. - What are C, P and S in TreeMM? Line 100 "wih" -> "with"

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: After the authors clarified a few questions in the response, I read the paper again. I feel the algorithm is clear and interesting. But I didn't give a score higher than 6 mainly because the method sections need to be improved.

Review 4

Summary and Contributions: Main Contribution: This paper proposes a cascaded decoding approach with a Markov transformer architecture. The authors also apply treeDP on the decoding process for higher decoding speed, which supports computation in parallel. The cascaded approximation can be computed in parallel in O(MK^2logL).

Strengths: Strength: The proposed idea is novel. Motivated by semi-autoregressive training, the authors propose a Markov transformer architecture and a new cascaded decoding approach. The usage of TreeDP is natural and elegant. Compared with AT-based decoding approaches, the speedup performance is significant. The proposed approach achieves large speedup within little bleu decrease. The related work section is detailed. The paper is in well written.

Weaknesses: Missing some well-performing NAT baselines in experiments. Although the authors list sufficient work in Section 2, the empirical comparison in experiment is limited. Many recent baselines are missed. Compared with NAT-based decoding approaches, the performance improvement is marginal. For example, Levenshtein achieves higher results than the proposed model with similar speedup performance. Also, some NAT approaches have great potential to accelerate decoding, like imitate-NAT and NART-DCRF. They can achieve extremely high speedup performance, like 10X and 18X. Although these approaches also bring large BLEU decays, I still have no idea whether the proposed approach can reach higher BLEU scores under such extremely cases.

Correctness: Yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: Minor: In line 26, the authors claim that “we can explore any level of autoregressive dependencies to achieve a speed/accuracy tradeoff”. It would be better to add a figure like Figure 3b to show the comparison between the proposed approach and literatures. In line 102-109, the details of treeDP are not clear. It is difficult for me to understand the whole process. To my knowledge, the performance of transformer implemented by fairseq (default version) on De-en should be around 34.7, higher than the reported score 34.4. There is a big margin between Table 1 and line 208. ======== I have read and taken into account the rebuttal.