Reviews: Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Originality/significance/quality: The paper addresses an important topic, as there is indeed a need for better evaluation of dialog systems. The paper attempts to move away from traditional evaluation of open-domain dialog systems (i.e., judge response given its conversation history) and moves towards a more interactive one (i.e., human talking to a bot), which is likely an important step towards better evaluation. However, I do have several serious concerns about this work in its current form: (1) The authors contrast their work with existing evaluation for open-domain dialog evaluation, which they call “single-turn” evaluation. They point out that this type of evaluation prevents it from capturing “failure modes […] such as a lack of diversity in the responses, inability to track long-term aspects of the conversation”. I think this is rather misleading and the term is “single-turn” is a misnomer. Most previous work has indeed evaluated each conversation by factorizing it into a sequence of independent turn-level judgments, but each of these judgments assesses the quality of the current turn T_n **given** a history of several previous turns …, T_n-k, … T_n-1. As opposed to what “one-turn” implies, judges look at several conversational turns at once for each judgment. While one could set n-k=1 and therefore let the judge see the entire conversation T_1,…T_n (and therefore let the judge identify “lack of diversity”, etc.), the practice is to set k to a smaller window as to reduce cognitive overload for the judge and therefore make evaluation more reliable. Much of the earlier work on open-domain dialog system used a small context window (k=2 or 3, which should really be called “two-“ or “three-turn evaluation”), but more recent work (e.g., Serban et al.) has used much bigger windows of context and values of k. Serban et al. and others would not have been able to show that longer context and better representation thereof helped without the use of *multi-turn* evaluation. In sum, I think the characterization of previous work in the paper is rather misleading and I think most of what is said about “single-turn” in the paper would have to be rewritten. (2) Self-play protocol: It seems to me that this automatic evaluation is easily gameable, as each prior turn T_n-1 is generated by the same system, so the system could “strategically” only ask questions or issue queries for which it has high-quality responses. For example, one could design a system that ignores the original turn T_1 and then (for T_2, …, T_n) simply regurgitates human-to-human conversations memorized from the training data. While one might argue that gamebility is only relevant in the context of a competition and that one could decide to leave that issue to future work, I think it is important to make metrics resistant to gaming as it is fairly common to optimize directly or indirectly for automatic metrics (e.g., directly through RL as in https://arxiv.org/abs/1511.06732). Even if most system designers do not deliberately try to game metrics, RL or repeated queries of the metric might just take care of that, so metrics need to be gaming-proof from the outset. (3) I am not convinced by the comparisons in the paper between “single-turn” evaluation and the self-play scenario of the paper. Results are not comparable, as the paper uses Kappa for single turn and Pearson’s r for self-play. As in previous work, it is totally possible to use Pearson’s r for assessing the level of correlation between existing metrics and human judgment, so why haven’t the authors done that? (In a related note, Spearman’s rho or Kendall’s tau are usually deemed more robust than Pearson’s r as they don’t assume that correlation is necessarily linear). (4) Semantic and engagement metrics: this is more a minor concern, but these metrics are potentially gameable too as they are machine-learned. We can think of it this way: trainable metrics are akin to one turn of GAN training, where the generator is the response generation system and the discriminator is the learned metric. Clearly, one would have to iterate multiple times (GAN style) as otherwise one could modify the generator to find the weak spots in which the discriminator can easily be fooled. This intuition of GAN for dialog evaluation is described in more details in (e.g.) https://arxiv.org/abs/1701.08198, https://arxiv.org/pdf/1701.06547.pdf, https://arxiv.org/pdf/1809.08267.pdf. Clarity: the paper is mostly clear.

Reviewer 2

This paper addresses the problem of evaluating spoken dialogue systems, pointing out that current methods which focus on evaluating a single turn given the history bear little relation to the overall user experience. The core idea is a hybrid metric M_h which is a linear combination of three metric classes, each being measured across a whole dialogue: sentiment using a classifier trained on Twitter, semantic similarity measured by Infersent plus a number of simple word based metrics, and Engagement based a question score and length of user inputs. The dialogues themselves are generated automatically by self-play ie the systems output is simply fed back as input for a fixed number of turns. The ideas underlying M_h are then used in two ways. Firstly its main components: sentiment and semantics are used to regularise training a standard HRED. VHRED and VHCR dialogue model. This is shown to improve human evaluations of quality, fluency, diversity, contingency and empathy in most cases. They then show that although single turn metrics show similar trends they are very noisy and correlation between with human evaluation is poor. However, the proposed M_h metric correlates very well with human judgement. Overall, this is an interesting paper. The proposed metric appears to be very useful and the self-play idea is useful where there is no existing dialogue data to evaluate on (as would be the case during system development). However, this is really two papers in one. Although they are inspired by the same idea, the proposed regularisation for training chatbots is really a separate topic to evaluation and I would have preferred the focus to be on the latter. In particular, it isnt clear why the proposed metric could not be applied to some of the existing published data sets eg Multi-Woz.

Reviewer 3

The paper starts out with an attempt to address the difficulty of evaluating a conversational model that is supposed to be able to produce an interesting dialog (with human) in an open-domain chat. This is indeed a challenge for the research on automated dialog systems. However the paper soon goes into the addition of an EI regularization layer, which is confusing and somewhat distracts from the main focus on the evaluation system. In the experiments it is shown that using five conventional metrics with humans doing the scoring, the proposed EI regularization can improve almost each of the 3 known models; yet between the 3 models the ANOVA results do not show significant differences. Then using the hybrid metric and self-play automation, it is shown that the addition of the EI layer is favorable over the baseline models, and the results correlate well with human judgment using the conventional metrics on interactive sessions. One critical discussion that is missing is in what specific ways self-play dialog differs from interactions with a real user. Does self-play assume that the same persona is talking on both sides? How can one accurately represent the persistent but potentially very different personas represented by real users? The paper also seems to struggle between two main proposals: one on the addition of the EI regularization layer, and the other on advocating self-play automated multi-turn evaluation and a hybrid metric. If the EI proposal is meant to be the main contribution, the arguments about the evaluation metrics and self-play set up are problematic, because you are showing that with your proposed evaluation, your proposed modification is better than the baseline. There is a question on the fairness of the evaluation choices. On the other hand, if the evaluation system is meant to be the main proposal, then the choice of the EI regularization could be an unfair demonstration of the effectiveness of the evaluation system. How are you sure that your proposed evaluation scheme will not contradict human judgment when a different set of alternatives are evaluated? Either proposal provides some useful but rather incremental contribution to the state of the art in dialog modeling. Given that, the paper seems to be a better fit for an audience more specialized in dialog modeling. It is unclear how the broader audience in NeurIPS can be benefited from the work.

Paper ID:	7592
Title:	Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems

Reviewer 1

Reviewer 2

Reviewer 3