Reviews: HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models

This paper proposes a human based benchmark system in order to rate GANs. While readers might argue this is too costly, the authors nicely illustrates the costs of their system and motivate the need for using human raters as the true gold standard. While the system is introduced as being general and applicable to any type generative model, the author missed a (very recent) related work from the NLP community. The system HUSE from the paper Unifying Human and Statistical Evaluation for Natural Language Generation, Hashimoto et. al. 2019 utilizes a similar turing test like method to rate the accuracy of a translation model. In addition to this fact, the Hashimoto et. al. claim that, while humans are the gold standard for rating the accuracy, they miss to quantify the diversity capacity of an underlying generative model.

Reviewer 2

This paper introduces a framework to evaluate the perceptual realism of samples from generative models. The framework, HYPE- Human Eye Perceptual Evaluation, is based on psychophysics methods. Two different metrics are proposed. The first one, HYPE_time, measures the amount of time a human needs before distinguishing a real from a fake. The metric is clearly defined and very well founded on psychophysics. The second one, HYPE_infinite, measures the error rate of a human when wrongly classifying fake images and real images (given unlimited time). This second metric is much simpler, faster and cheaper (in terms of human labor) while maintaining the reliability of the first one. The paper is very well written, it is based on psychophysical theory, the methodology is meticulously detailed and the experimental results and conclusions are quite interesting. The work will clearly contribute to the research and development of better generative models, an issue of major importance in machine learning. I am undoubtedly in favor of acceptance. Nevertheless, in what follows I list a few comments I have: -- The two proposed metrics allow ranking models according to realism, but despite this, their absolute value or even their relative difference does not mean anything. The same amount of numerical change in HYPEx value does not correspond to the same amount of visually perceived change in realism. This implies that the proposed metrics do not inform how much better a model is, but only produce a ranking. I would like the authors to comment on this. -- The proposed metrics only measure sample realism. But another very important property of generative models is diversity. This is openly stated as a limitation of the method. Notwithstanding, I think it would be good to discuss how a measure of diversity could be incorporated into the framework (either from human evaluation or from automatic measurements of the generated samples). A ranking of generative models should contemplate both realism and diversity at least. -- Comparison to automatic metrics. In the end, evaluating a single generative model is rather costly so the only way out seems to be to compute automatic measurements. The authors compare to FID, KID, and F1/8 but there is not too much discussion regarding this. It would be interesting to improve this section by trying to draw conclusions a little more interesting beyond saying whether or not the metrics are correlated. Also, the section would look better if a figure with all the FID/HYPE/XX data points was shown not only the correlation coefficients. -- Regarding reproducibility. It would be good if the authors made available all the necessary data to recalculate the metrics shown in the paper. In particular, the evaluations of each human on the generated images that are used. This would allow other researchers to fully reproduce the results, and also facilitate to continue the research in different lines (e.g., can the humans be clustered in terms of realism perception? - do all humans behave more or less the same? Are some better correlated with any of the automatic metrics?) Other minor comments: -- During the evaluation procedure, when presenting images to the experts, in many cases, you mention that half of the images are fake and half are real (e.g., line 135 - 50real / 50fake). Knowing this proportion would bias the expert. Could this be a problem? -- Hyper-realism is hard to understand (the generator produces images that look more real than real ones). Maybe this is associated with poor quality image datasets (e.g., CIFAR 10) where the images might look a little artificial. Could you add a short comment on this? -- Table 1 (pag 5). There seems to be a typo in the first and second model since the HYPEtime values are out of the respective 95% CIs. ------ After rebuttal. I appreciate the answers and comments of the authors. I think this is a very interesting work that clearly deserves to be published in this venue.

Reviewer 3

In this work, two new benchmarks are introduced for better evaluation of generative models. The first metric is HYPE_time, which computes the minimum amount of time it takes a person to distinguish an image as real or fake. The second metric, HYPE_infinity, measures the errors of people given unlimited time, and is much faster to compute and more cost-effective. The authors test these two metrics on multiple datasets using many models and show that HYPE is reproducible and can be used to separate the efficacy of different models. Further, HYPE was shown to be a more reliable predictor than alternative automated measures. This metric can be used to more consistently compare different generative approaches and provide a foundation for research in this area. Strengths: - A benchmark for generative models can be very useful for the community to enable more consistent evaluation of new methods. - The authors provided a thorough set of experiments for the two metrics, HYPE_time and HYPE_infinity, across different models and datasets. Weaknesses: - There was no extended discussion on related work, even though there are other metrics that have been proposed for evaluating generative models. The authors should justify why their metrics are more effective and how they differ from other benchmarks. Originality: There are a few related works on developing benchmarks for generative models, as included below. It would be great to get a justification for how the proposed metrics are preferred. Xu, Qiantong, et al. "An empirical study on evaluation metrics of generative adversarial networks." arXiv preprint arXiv:1806.07755 (2018). Wang, Zhengwei, et al. "Neuroscore: A Brain-inspired Evaluation Metric for Generative Adversarial Networks." arXiv preprint arXiv:1905.04243 (2019). Quality: The quality of the work is high. The measures were based on prior research, which motivates the choice of HYPE. The authors tested a variety of models with many datasets and showed the consistency in the metrics’ prediction of model performance. Clarity: The paper was well-written. All of the included details about the datasets and models were helpful in understanding the evaluation of HYPE. There were a couple of typos: - Pg 5: the results of HYPE_infinity approximates the those from → the results of HYPE_infinity approximate those from - Pg 7: one for each of object classes → one for each of the object classes Significance: The addition of a benchmark for evaluating generative models can be very useful to the field, as having standardized metrics allow for more consistent comparison of models. Having reproducible measures are incredibly important for advancing research in this space. ----------------------- I read the author response and am satisfied with the authors' discussion about how the work differs from prior literature.

Paper ID:	1908
Title:	HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models

Reviewer 1

Reviewer 2

Reviewer 3