Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Originality: The proposed approach using syntactic and lexical diversity modelling within the latent space to generate diverse image captions is novel. Quality: To establish that the generated captions are diverse, various standard diversity metrics are measured for the proposed method in Tab. 2. Some qualitative results demonstrating diverse captions and diversity conditioned on different visual parse tree probabilities is shown in Fig. 5 and 6. These experiments help justify the core components of the proposed approach. Clarity: The paper is well written and easy to follow. Careful illustrations in Fig. 2 and 3 are used as an aid while describing the proposed method. Significance: The proposed method has comparable accuracy to GAN/VAE-based diverse captioning counterparts, and it demonstrates better diversity scores. It is a valuable benchmark for future development in diverse captioning. Post-rebuttal comments -- The authors add additional experiments to measure edit distance between POS/words when the lexical or syntactic variables are changed, this is a good value add to earlier experiments. They address my concerns on experimental setup for baseline methods and should clarify this in the final submission. I remain positive about the contributions of the paper and maintain my earlier rating.
Originality: The paper is moderately original — the idea of splitting diversity into lexical and syntactic is interesting. However, the approach taken is derived from  and standard VAEs. Quality: The paper is technically correct and cites relevant work adequately. Clarity: While the writing is generally clear, the notation is a bit cluttered and hard to parse with the “.” and “..” superscripts for variables. Significance: The idea of splitting diversity into these two factors is both novel and interesting. The paper takes a reasonable route to model both these aspects and also shows that it leads to improvements (over at least other VAE-based methods,  and other RL methods perform better, see COCO leaderboard). Therefore, it is of reasonable interest to the community.
1. It is not clear to me that since each tree node has a local latent variable/representation, how does the model generalize to generate longer captions or paragraphs. 2. If possible, the authors should reorganize Section 3, as the current version is not easy to follow. 3. In general, the description framework seems a bit engineering. It would be helpful if the author could provide more abstract analysis and empirical discussion about the proposed framework.