Reviews: BehaveNet: nonlinear embedding and Bayesian neural decoding of behavioral videos

This paper proposes a probabilistic framework that combines non-linear (convolutional) autoencoders with ARHMM’s to model videos coming from neuroscience experiments. The authors then use these representations to build Bayesian decoders that can produce full-resolution frames based only on the neural recordings. I find the motivation of this paper - building tools to study the relationship between neural activity and behavior from a less reductionist approach - extremely valuable. I have, however, the following concerns: This work is very related to Wiltschko et al., the stronger difference being the use of nonlinear autoencoders instead of PCA. However, the difference between linear and non-linear AE in the reconstructions showed on the supplemental videos is not very noticeable. What are the units of MSE in Figure 2? How big is the improvement on decoding videos from neural data by using CAE as opposed to PCA in pixels? Secondly, in the paper the authors refer to “behavioral syllables” and while in Wiltschko et al there is an explicit mapping between states and behavioral subcomponents (walk, pause, low rear), this is missing in this paper. Can the resulting discrete states be mapped to any behaviorally interpretable state? In Fig. 3 the authors show that the sequence of states is task-related, can any of the states be related to task-related computations (left vs right responses, correct vs incorrect trials, etc)? Thirdly, in this paper the analyzed behaviors are very constrained - therefore is not very surprising that the videos can be encoded in 12-16 latent variables). Have the authors tried freely moving animals. Given that the whole idea is analyze behavior from a less constrained perspective, a less restricted behavior would be more interesting, in my view. [L210] “Nonetheless, many of the generated frames look qualitatively similar to real frames (Fig. 3D) and we can see the mouse transitioning between still and moving in a fairly natural way in the sampled videos, indicating that the generative model is placing significant probability mass near the space of real behavioral videos." This is, of course, really difficult to evaluate. But from the supplementary videos provided I’m concerned with: In NP many reconstructed video frames are static, without any movement, which doesn’t seem to be the case in the original video. Also, there are many segments with more than 2 paws reconstructed. In WFCI, the spouts are moving in the generative reconstructions, while they don’t move in the original frames. The decoding of videos purely from neural activity is a very interesting approach. One thing that confused me in supplementary video3_WFCI is the reconstruction of the two round structures on the lower part of the image. What is that and was that controlled by the animal? Small details: Reference to Fig. 1 ===UPDATE=== I had three main critiques in my review, which largely overlap with the points raised by the other reviewers: (1) Using a nonlinear auto-encoder doesn't seem to make a difference (at least qualitatively). (2) The inferred “behavioural syllables” are not interpretable. (3) The considered behaviours are too constrained. I have carefully read the author's feedback and appreciate the responses to all my points. However, I still believe that my points stand and therefore won’t change my score at this time: Point (1), the use of CAE, is one of the main novelties of the paper but is unclear how useful it is. The author’s claim that even if qualitatively we see no difference and the MSE is similar to that of a linear AE, it could help by reducing the number of parameters. However, as far as I understand, we have no evidence for that. More importantly, re point (2) the authors emphasise that the main point of the paper is “to provide a set of tools for understanding the relationship between neural activity and behavior”. However this is in contrast with their claim that ”interpretability of the behavioural syllables [...] is not our main concern; rather, we use the ARHMM as a prior model of behavior which is then incorporated into the Bayesian decoder”. As they later note, there are hints that these inferred states could be more closely related to the animal’s state (they have structure related to the task) but this is not followed in depth, in my view. I think this work is potentially very useful but could be greatly improved if the authors could show more explicitly how this method can provide more insights in the relationship between neural activity and behaviour.

The paper is very well written and the authors are honest about their results. The subject of the work is also very important to the community of neuroscience. Although the methods are not original, they are used very well together in the pipeline. The final results, however, do not look complete to me. The most important use of autonomous data analytics algorithms for science is to provide interpretable results, e.g. demixed data in Kobak et al. 2016, or interpretable components of TCA in Williams et al. 2018. To my understanding, presented framework does not provide any interpretable results. Even if not interpretable, the authors could show if the framework predicts the same state in a specific event in each trial, e.g. levers grab. I think figure 3c suggests that it does not. Also, regarding generating the video from neural data, the current results could be the product of learning the structure of the video and its changes over time, especially because very similar events happen in each trial (things that are different between trials such as animal's hands are not very well captured). In fact, the current results looks like average over all trials. I think it is important to see how using neural data actually improves the prediction (for example as opposed to average of all trials in each time step).

Paper ID:	9187
Title:	BehaveNet: nonlinear embedding and Bayesian neural decoding of behavioral videos

Reviewer 1

Reviewer 2

Reviewer 3