NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7578
Title:Paraphrase Generation with Latent Bag of Words

Reviewer 1

Thus paper presents a model where a latent bag-of-words inform a paraphrase generation model. For each source words, the authors compute a multinomial over "neighbor" vocabulary words; this then yields a bag-of-words by a mixture of softmaxes over these neighbors. In the generative process, a set of words is drawn from this distribution, then their word embeddings are averaged to form input to the decoder. During training, the authors use a continuous relaxation of this with Gumbel top-k sampling (a differentiable way to sample k of these words without replacement). The words are averaged and fed into the LSTM's initial state. Results show decent BLEU and ROUGE scores compared to baselines as well as some nice examples. However, the authors don't compare against any baselines from prior work, instead comparing against their own implementations of basic Seq2seq, Seq2seq + Attn, and VAE models. As a result, it's a bit hard to situate the results with respect to prior efforts. As for the model itself, I like the premise a lot but I am a bit disappointed by its actual implementation. Averaging the word embeddings to get an initial state for the decoder seems like "giving up" on the fact that you actually have a bag-of-words that generation should be conditioned on. It would be much more interesting to at least attend over this collection or ideally use some more complex generative process that respects it more heavily. In light of this, I would also like to see a comparison to a model that simply treats the phi_ij as mixture weights, then computes the input to the decoder by summing word vectors together. As it stands, I'm not sure how much value the top-k sampling layer is giving, since it immediately gets smashed back down into a continuous representation again. I do see the value in principle, but this and other ablations would convince me more strongly of the model's benefit. Based on Figure 2, I'm not really convinced the sample is guiding the generation strongly as opposed to providing a latent topic. The bag-of-words is definitely related to the output, but not very closely. The Figure 4 results are more compelling, but I'd have to see more than these two examples to be truly convinced. Overall, I like the direction the authors are going with this, but I'm not quite convinced by the experimental evaluation and I think the model could be more than it is currently. ======================== Thanks for the author response; I have left the review above unchanged, but I provide some extra comments here. I see now that this is in the supplementary material, but it is still unclear from the main paper. In light of this, I have raised my score to a 7. I think this model is quite nice for this task. The results are still a bit marginal, but stronger given what's shown in Table 4 in the response. Finally, the comparisons in Table 1, even if not totally favorable, at least situate the work with respect to prior efforts. So this has largely addressed my criticisms.

Reviewer 2

This paper proposes an interesting and novel method for paraphrase generation. I enjoyed reading the paper: the description is clear, related work is balanced, and result analysis section is convincing. Minor suggestions: - Figure 1 could be expanded with more details, tied to the equations or sections in Sec 2. - The term CBOW invokes a standard word embedding method as opposed to "cheating BOW". It's fine, but confused me for a bit originally. - If space permits, some more details about the Gumbel implementation would be helpful. == Note after author response == I think my suggestions weren't mentioned in the author response since they were minor points (which is fine), but I trust the authors will improve the draft in the revision.

Reviewer 3

The paper presents a simple fully differentiable discrete latent variable model for content planning and surface realization for sentence-level paraphrase generation. The discrete latent variables are grounded to the BOW from the target sentences bringing semantic interpretability to the latent model. The paper is very well written and the proposed approach is thoroughly evaluated on the sentence-level paragraph generation with both quantitative and qualitative analysis. One of my main concern with the approach is its evaluation on the sentence-level paragraph generation. The need for content planning for sentence-level paragraph generation is rather limited, often we aim to generate the target sentence which is semantically equivalent to the input sentence, there is no need for content selection or content reordering during content planning. It is no surprise that the simple model just as the BOW model is good enough here. The decoder simply takes an average representation of the bag of the words and generates the target sentence. I am afraid that the presented model will not influence or be useful for more complex text generation tasks such as document summarization or data-to-text generation. In these tasks, they often require more sophisticated content selection and content reordering during content planning. What are the state of the art results on the Quora and Mscoco datasets? It is not surprising that the Seq2seq-Attn model seems to be doing well as well. The attention scores also learns a semantic interpretation of the information that is used during decoding. It would have been interesting to see the outputs of the Seq2seq-Attn model along with the LBOW models to understand how is the unsupervised neighbour words learning useful.