Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Originally: Although phoneme duration prediction is widely adopted in conventional TTS systems, jointly training it in a neural TTS model is new. This paper is one of the first works on non-autoregressive text-to-spectrogram modeling. Quality: This paper seems sound overall, expected for a few issues in the comments below. Some of these issues must be addressed before acceptance. Clarity: A well written paper. A good reading to me, except for a few comments below. Significance: The advantages over its autoregressive counterparts are significant, especially for industrial use. It’s likely to be followed by the research community as well as the industry. Comments: 1. It’s interesting to see the using of 2-head attention for the Transformer block, instead of more popular setting ups such as 8 in the baseline Transformer TTS model. Does it bring benefits? 2. What’s the reason for using the mel-spectrogram generated by the autoregressive model for distillation training, instead of using the groundtruth mel-spectrogram? Intuitively, the groundtruth gives more accurate information. 3. Sec. 4.3 says that the FastSpeech model is partially initialized from the autoregressive Transformer TTS model (phoneme embeddings and FFT blocks) as they share the same architecture. However, the hyperparams given in Appendix A as well as in Sec. 4.2 shows these two models are of different dimensions for these components. 4. The pre-net and post-net of the baseline autoregressive Transformer TTS, as well as the decoder’s final linear layer of FastSpeech seem missing from the hyperparams comparison in Appendix A. 5. Experiment results on inference speedup -- what’s the batch size used for this evaluation? 6. The latency numbers in Table 2 and Figure 2 seem inconsistent. The numbers in Table 2 seem unrealistically fast. 7. Robustness experiment -- Since you have included Tacotron 2 in Table 1, it would be nice to also include Tacotron 2 in Table 3. Tacotron 2 is another widely discussed attention-based model which is considered also suffering from robustness issues due to attention failure. It will be interesting to include such results for comparison. 8. Needs a reference for CMOS evaluation. ============== Update: Thanks for authors' response. I updated my score accordingly.
Authors propose a non auto-regressive parallel text 2 mel-spectrogram model that allows a significant speed up in text to speech generation. The underlying model is based on feed forward transformer model extended with two auxiliary tasks for predicting length and duration of the underlying phonemes (i.e the input is phoneme, not word-based). To appropriately train the whole system the approach still requires auto regressive teacher model to properly work out phoneme durations. The model does not seem sensitive to spurious generative errors like repetitions or omissions. For waveform generation another non-autoregressive waveglow vocoder is used (not a contribution). Overall the study seems fair and reproducible, proposes several solutions for existing shortcomings in e2e TTS systems, offers large speedup while preserves accuracy of auto-regressive models. It's a good paper. Would it make a difference if you assumed access to pronunciation dictionary? G2p may introduce some errors along the way (though good it works with it regardless). You could probably enforce monotonicity in the attention aligner, rather than score them based on which one behaves as you hope. I would definitely cite  as you borrow a number of blocks and ideas from that paper (like 1d convolutions in tts context, etc.)  FFTNET: A REAL-TIME SPEAKER-DEPENDENT NEURAL VOCODER, 2018 Minor: Several unnecessary repetitions. ==== Update: Thanks for answering my concerns. Wrt g2p thing - you should make it explicit in the paper your system requires a pronunciation dictionary.
Comments: 1. The audio quality and inference speedup are impressive. 2. In session 3.2 Length Regulator, the hidden states of the phoneme sequence are simply repeated, which is very much like what Gu et al. did in "Non-autoregressive neural machine translation". However, the advantage of attention-based sequence-to-sequence speech synthesis model is the soft alignments between phonemes and spectrograms. Empirically, the soft attention gives better prosody and more natural speech. Won't the hard alignments(rounding and repetition) hurt the performance of the proposed model? 3. In session 3.3 Duration Predictor, the proposed focus rate F has nothing to do with "measuring how an attention head is close to diagonal". Focus rate sort of measures the overall confidence of attention alignments, but doesn't constrain the attention alignments close to diagonal. Also, it's hard to understand the behavior of each head in multihead attention, and in many cases, the attention doesn't have any clear visual meanings at all. Then why diagonal alignments are good and what if there is NO diagonal alignments in multihead attention? 4. The title is improper. In TTS, "controllable" always means the prosody or pitch diversity under expressive settings. In session 5, the voice-speed control and breaks-between-words control are trivial to TTS models. It shouldn't be called "controllable".