Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper explores interesting directions, in particular 1) using interactive settings to evaluate a model rather than a single answer, and 2) combining different automated metrics in a weighted sums to approximate human evaluation (e.g., based on sentiment). Reviewers have raised crucial points, regarding gameability (so that using the metrics for training a model is tricky if not followed by a non-gameable evaluation), and lack of comparability between different self-play. It’s indeed a much better evaluation setting if the system does not control both sides (e.g., models being matched to the same set of fixed models), so authors should definitely follow that direction. However, I expect this work would still be interesting to the dialog community: many of the diagnostic advantages of the model-talking-to-model setting remain, in practice, especially because the model is in fact not trained with the self-play objective, but that criterion is only used post hoc (so the system can’t extensively exploit it during training). In practice, a lot of the problems of the generations of a given model already show up during self-play, and the reasonable worry raised by reviewers that the model could exploit the metric remains theoretical at the moment. So, with the caveat that the results in self-play mode may be too optimistic, it can still act as a useful and cheaper (compared to humans responding) diagnostic tool to catch some issues like excessive repetitions, etc. Authors should make sure to at the very least include discussion of the crucial caveats raised by the reviewers (e.g. that a more reliable evaluation would be evaluating a model paired to a stable "opponent/user model"), or better, results in that set-up.