NeurIPS 2020

A Spectral Energy Distance for Parallel Speech Synthesis

Meta Review

This paper proposes a strategy for parallel TTS based on spectral energy distance. It does not rely on explicit optimization of likelihood nor adversarial learning, which enjoys a more stable and consistent training. On top of that, the authors introduce a repulsive term which has shown to significantly improve the quality of the generated speech. When combined with adversarial training, the quality of speech can be further improved. Overall, this is an interesting work, technically solid and experimentally compelling. All reviewers are supportive for acceptance. The rebuttal is also pretty engaged with the comments and makes the work more convincing. Please finish up what is left in the rebuttal and revise the paper accordingly in the final version.