Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
a. I like the idea of using two VAEs to model dance at different levels. Generating complex sequence is very challenging. So decomposing generative process into stages makes a lot of sense. b. The proposed dance synthesis model uses autoregressive approach to generate dance sequence, simplifying the sequence generation process. The low level VAE decomposes dance unit into initial pose and movement. The high level VAE models movement sequence, and shares a latent space with the output of music VAE. c. The adversarial losses and reconstruction losses are carefully design to improve the naturalness of generated dance. d. The video demo clearly shows that the proposed model outperforms the baselines. e. Paper organization and presentation are good.
Originality - The work is more than moderately original. Quality - The quality of the work/experiment/evaluation is high. Clarity - The paper is structured well and written nicely. But I have several comments as below. Significance - The work is moderately significant. The impact on the same task would be big but is limited to the area around it. ---- comments ---- Title - I would like to strongly suggest to change the title. "Dance to music" can be a nickname of this paper but not a title. I don't think I need to list everything about a good title. Abstract - the "top-down" and "bottom-up" doesn't add any information and therefore seem unnecessary. I can't think of any non-top-down analysis and I was actually even confused by these words because I thought it may mean some very special kind of analysis or synthesis. L21 - "Inspired by the above observations" -- which observations exactly? It seems unclear to me. L31 - Overall in this paper, "multimodality" is undefined and simply replaced with "diversity" because that's what it really means. In the experiment, there are two different kinds of diversity measures (and only by then I was sure that it means diversity), but they can be called as "XX diversity" and "YY diversity". Multimodality as a mean of diversity is commonly used in GAN literature, but they are more likely to mean something else (e.g., multi-domain like audio and video), therefore it is confusing. L67 and L77 - those two concepts are not in parallel. Also, overall, the two paragraphs seem somehow redundant and may be compressed if the authors need more space. L107 - a fixed number of poses - how many? Overall in Section 3.1 and 3.2 - a clearer and more explicit hypothesis and assumption(s) would be nice. By building up this structure and planning the proposed approach, what is assumed? Like probably all the other works, there are some assumptions that allow the authors to model the whole system in this way, e.g., using VAEs for them, some hyper parameters, etc. It is actually already good, but I think it can be slightly improved. L146 - More detail on the music style classifier is necessary. Or at least a reference. I was surprised by not finding this in the supplementary material. L192 - L198 - Looks like a legit choice, but again, the details of these systems are absolutely necessary. L205 - L221 - Although it's not bad to have this information, at the end of the day, these are completely subjective and one can write the exact same contents with cherry-picked examples. I think this should be more compact and probably mentioned only after all the quantitive results are shown. L223 - Again, I don't see why we should call it multimodality and not diversity. Section 4.3 - It would be nicer if it is more explicit that this quantitative result is still from a subjective test. L237, L240 - "Style consistency" can mean a lot of things, e.g., the consistency over time. Won't there be a better way to describe it? L238 - 50 subjects - who are they? L250 - L252 - the action classifier should be elaborated much, much more than this. Reference and background - "J. Lee et al., 2018 Nov" could be discussed, too, especially considering it's timeliness.
Learning to generate dance according to a given piece of music is an interesting task, and could be benificial to artists in related areas. Both adversarial learning and reconstruction loss are widely used in various generaiton tasks, they are never applied to this new task before this work. Therefore, I recognize the innovation in terms of methodology made by this application work. Evaluation include both quantitative results and qualitative results. From the quantitative results (on automatic metrics and human judgment), it looks like the improvement over the selected baselines is significant. The authors also provide a video in supplementary material and show how the dance generated visually. Overall, I think the paper makes decent contributions to AI research and industry, however, I have several concerns (suggestions): 1. The authors hilghlight their innovation on decomposition of dance session to dance units. However, from their descriptions in the supplementary material, they just divide the dance session to small pieces with each 32 frames (2 seconds). Thus my understanding is that the dance unit is independent with kinematic beat or onset strength. Then what's special for the dance unit? 2 Dance generation is not totally new. The following work studies the same problem with deep learning techniques, but is ignored by the authors: a. Generative Choreography using Deep Learning b. Dance with Melody: An LSTM-autoencoder Approach to Music oriented Dance Synthesis I suggest the authors to compare their method with these existing ones. 3. Long sequence generation is a big challenge for DL based models due to exposure bias. It is common that the model will output similar units (e.g., poses in the context of dance generation) after a few steps. Therefore, I doubt about if the proposed method can really generate long sequences, since 20 seconds is not long. 4. Poses in the selected dance styles are relatively simple. Have you tried generation of any pop dances that with complicated poses?