Review for NeurIPS paper: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

NeurIPS 2020

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Meta Review

This work initially received mixed reviews, but after the author feedback cleared up a misunderstanding, most reviewers are now recommending acceptance. Nevertheless, I think R2 (who has not raised their score) has some valid concerns, which I want to account for in my decision. I have decided to recommend acceptance. The experimental section of this work is fairly comprehensive, and adequately demonstrates that the proposed architecture is effective. However, it is important to point out that the majority of experiments was conducted using ground-truth mel-spectrogram conditioning, which does not match the usual practical setting of TTS systems, where the spectrograms are themselves generated by a model (and thus imperfect). I would encourage the authors to make this abundantly clear in the manuscript, and to consider including additional experiments in the "true" TTS setting to balance things out. R2 points out that both MPD and MSD architectures show a lot of similarities to the random window discriminators from the GAN-TTS paper. I think this is a fair observation, and it would be helpful to discuss the differences in more detail in the updated manuscript (e.g. the weight sharing after initial "reshape", using prime downsampling factors). The authors have committed to this in their feedback. In particular, it is important to accurately convey which ideas are novel and which have previously been proposed in literature, so the reader does not come away with the wrong impression. Furthermore, any claim that these architectural choices have a better inductive bias for periodic signals should be properly motivated, both theoretically and empirically. R2 also suggests an additional ablation comparing MPDs with prime and non-prime factors, leaving everything else unchanged, and I think this would be a useful addition. I concur with R1 that the choice of baselines for this work is appropriate, and the addition of an experimental comparison to WaveRNN or Parallel WaveNet is not a requirement to meet the bar for acceptance. That said, the differences with these approaches should at least be discussed qualitatively in the context of fast real-time synthesis (I believe this is already the case to some extent for Parallel WaveNet, but not WaveRNN). One small note regarding the author feedback: the original WaveNet model did not use mel-spectrogram conditioning -- vocoder variants of WaveNet were only introduced in later works. I wanted to point this out, in case the authors were intending to reuse this part of the author feedback in their manuscript. Given the recent publication of a paper with a very similar name (https://arxiv.org/abs/2006.05694), as pointed out by R4, the authors may also want to consider a name change. This is merely a suggestion, not a requirement on our part.