{"title": "BehaveNet: nonlinear embedding and Bayesian neural decoding of behavioral videos", "book": "Advances in Neural Information Processing Systems", "page_first": 15706, "page_last": 15717, "abstract": "A fundamental goal of systems neuroscience is to understand the relationship between neural activity and behavior. Behavior has traditionally been characterized by low-dimensional, task-related variables such as movement speed or response times. More recently, there has been a growing interest in automated analysis of high-dimensional video data collected during experiments. Here we introduce a probabilistic framework for the analysis of behavioral video and neural activity. This framework provides tools for compression, segmentation, generation, and decoding of behavioral videos. Compression is performed using a convolutional autoencoder (CAE), which yields a low-dimensional continuous representation of behavior. We then use an autoregressive hidden Markov model (ARHMM) to segment the CAE representation into discrete \"behavioral syllables.\" The resulting generative model can be used to simulate behavioral video data. Finally, based on this generative model, we develop a novel Bayesian decoding approach that takes in neural activity and outputs probabilistic estimates of the full-resolution behavioral video. We demonstrate this framework on two different experimental paradigms using distinct behavioral and neural recording technologies.", "full_text": "BehaveNet: nonlinear embedding and Bayesian neural\n\ndecoding of behavioral videos\n\nEleanor Batty*, Matthew R Whiteway*, Shreya Saxena, Dan Biderman, Taiga Abe\n\nColumbia University\n\nerb2180,m.whiteway,ss5513,db3236,ta2507 @columbia.edu\n\nSimon Musall\n\nCold Spring Harbor\n\nsimon.musall@gmail.com\n\nWinthrop Gillis\n\nHarvard Medical School\nwin.gillis@gmail.com\n\nJe\ufb00rey E Markowitz\nHarvard Medical School\n\nAnne Churchland\nCold Spring Harbor\n\nJohn Cunningham\nColumbia University\n\njeffrey_markowitz@hms.harvard.edu\n\nchurchland@cshl.edu\n\njpc2181@columbia.edu\n\nSandeep Robert Datta\nHarvard Medical School\n\nsrdatta@hms.harvard.edu\n\nScott W Linderman\u2020\nStanford University\nswl1@stanford.edu\n\nLiam Paninski\u2020\n\nColumbia University\n\nliam@stat.columbia.edu\n\nAbstract\n\nA fundamental goal of systems neuroscience is to understand the relationship\nbetween neural activity and behavior. Behavior has traditionally been characterized\nby low-dimensional, task-related variables such as movement speed or response\ntimes. More recently, there has been a growing interest in automated analysis of\nhigh-dimensional video data collected during experiments. Here we introduce a\nprobabilistic framework for the analysis of behavioral video and neural activity.\nThis framework provides tools for compression, segmentation, generation, and\ndecoding of behavioral videos. Compression is performed using a convolutional\nautoencoder (CAE), which yields a low-dimensional continuous representation\nof behavior. We then use an autoregressive hidden Markov model (ARHMM) to\nsegment the CAE representation into discrete \u201cbehavioral syllables.\u201d The resulting\ngenerative model can be used to simulate behavioral video data. Finally, based on\nthis generative model, we develop a novel Bayesian decoding approach that takes in\nneural activity and outputs probabilistic estimates of the full-resolution behavioral\nvideo. We demonstrate this framework on two di\ufb00erent experimental paradigms\nusing distinct behavioral and neural recording technologies.\n\nUnderstanding the complex relationship between neural activity and behavior requires a thorough\ncharacterization of behavior across multiple timescales. Behavior has traditionally been characterized\nby low-dimensional, task-related variables such as reaction times, or the position of a joystick, or the\nspeed of a wheel turn. These require specialized sensors set up by the experimenter, necessitating\nlaborious testing and calibration.\nOf course, behavior is in reality potentially very high-dimensional, and there is a growing appreciation\nthat to understand neural activity we need to monitor behavior in a less simplistic (and labor-intensive)\n\n\u2217Equal contributions\n\u2020Joint senior authors\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fway [1, 2, 3, 4]. There has recently been a growing interest in automated analysis of video data collected\nduring experiments, aimed at extracting richer, higher-dimensional representations of behavior. For\nexample, the last couple years have seen dramatic improvements in markerless tracking of body parts\n[5, 6, 7]. These tracking methods have opened up a range of exciting new studies, but come with some\ndrawbacks. Tracking methods are supervised, and therefore require user e\ufb00ort to label training images.\nFurthermore, simply tracking a few body parts may not capture all of the useful information in the\nvideo. For example, subtle changes of facial expression or body pose may be di\ufb03cult to track with a\nfew markers. Moreover, the tracked landmarks are chosen by the experimenter, and as such, important\nvariables may potentially be missed. Another drawback to tracking methods is that occlusion or\nmovement out of frame may cause markers to be dropped; if downstream analyses do not properly\nhandle missing data these frames must be excluded from analysis.\nIn a separate thread of work, fully-unsupervised linear dimensionality reduction methods such as\nSingular Value Decomposition (SVD) have been used to analyze behavioral videos [8, 9], but these\napproaches require a large number of dimensions to represent behavioral videos (typically, >200\ndimensions are chosen), potentially hampering downstream analyses. In fact, we have no reason to\nexpect that images of a moving animal can be represented in a low-dimensional linear vector space,\nas required for SVD to be an e\ufb00ective model.\nOnce low-dimensional time series corresponding to behavior have been obtained \u2014 whether through\nsupervised or unsupervised methods \u2014 we would like to model the dependence of neural activity\non these behavioral signals, after characterizing this behavior at di\ufb00erent timescales. Most previous\napproaches have focused on directly mapping the extracted signals into neural activity, i.e., \ufb01tting\n\u201cencoding models\u201d that predict neural responses from the observed behavioral signals [8, 9]. However,\na number of alternative analysis approaches are possible [10, 11], including unsupervised modeling of\nthe full behavioral video [12], decoding behavior directly from neural signals [13, 14, 15], or jointly\nmodeling both the behavior and neural signals [16, 17].\nHere we introduce a probabilistic framework for the unsupervised analysis of behavioral video,\nwith semi-supervised decoding of neural activity. This framework provides tools for compression,\nsegmentation, generation, and decoding of behavioral videos. Compression is performed using\nCAEs, which yield a low-dimensional, continuous representation of behavior that requires fewer\ndimensions than linear methods (e.g., SVD) to obtain the same video reconstruction error. We\nthen use an ARHMM to segment the CAE representation into discrete \u201cbehavioral syllables.\u201d The\nresulting generative model can be used to simulate behavioral video data. Finally, we exploit this\ngenerative model to construct Bayesian decoders which take in neural activity and output probabilistic\nestimates of the full-resolution behavioral video. We demonstrate the use of this framework using two\npopular experimental paradigms and neural recording technologies: Neuropixel multielectrode array\nrecordings during spontaneous behavior in head-\ufb01xed mice [9, 18], and wide\ufb01eld calcium imaging\nfrom task-engaged head-\ufb01xed mice [8, 19].\nRelated work. We build upon a rich literature of behavioral analysis. Stephens et al. [20] showed that\nthe posture of the nematode C. elegans is captured by a low dimensional subspace of \u201ceigenworms.\u201d\nStudies have been performed on pose and posture estimation in other model organisms with similar\nresults [21, 22, 23]. Our work is inspired by Wiltschko et al. [24] on characterizing mouse behav-\nior. In this work, behavioral videos of freely behaving mice are compressed using PCA, followed\nby segmentation via ARHMMs. Using these models, the authors identify behavioral syllables in\nmouse behavior such as rearing and grooming. Extending this work, Johnson et al. [12] combined\ncompression and time series modeling in a structured variational autoencoder. Related models have\nbeen developed for other model organisms, including C. elegans [25, 26] and larval zebra\ufb01sh [27].\nRecently, Markowitz et al. [28] has used these methods to identify speci\ufb01c neural representations of\nbehavioral syllables.\nParallel advances have been made in the analysis of neural time series. Sequential variational\nautoencoders [29, 30] and recurrent state space models [31, 32, 33] capture low dimensional structure\nin neural activity and relate it to sensory inputs and motor outputs. These build on a long line of\nwork decoding movement from neural activity which we do not have room to adequately review here\n[13, 14, 15]. Most of these approaches have a low-dimensional output that consists of either the\nelectromyography (EMG) of a handful of muscles in the limb, or the kinematics of the end-e\ufb00ector,\nor both. They do not capture the high-dimensional facial movements or bimanual arm poses that are\nthe focus of this study.\n\n2\n\n\fFigure 1: Graphical model showing the architecture we use for the neural decoding of continuous and discrete\nstates estimated directly from the behavior. Example neural and behavior data shown for the WFCI dataset, as\ndetailed in the text.\n\nFinally, outside of the motor decoding literature, a couple threads of work on image and speech\ndecoding from neural signals are particularly relevant: Parthasarathy et al. [34] developed approximate\nneural network Bayesian decoding methods to decode high-dimensional natural images directly from\npopulations of retinal ganglion cells, while Akbari et al. [35] and Anumanchipalli et al. [36] used\nstructured neural network approaches to decode high-resolution speech signals from neural activity.\n\nMethods\n\nWe begin by describing the datasets used in this work (data splits are described in Appendix A). Then\nwe describe the methods used for compression, segmentation, and decoding.\nWide\ufb01eld Calcium Imaging (WFCI) dataset [8, 19]. A head-\ufb01xed mouse performed a visual decision-\nmaking task while neural activity across dorsal cortex was optically recorded using wide\ufb01eld calcium\nimaging. We used the LocaNMF decomposition approach to extract signals from the calcium imaging\nvideo [37]. Behavioral data was recorded using two cameras (one side view and one bottom view;\nFig. 2B, left); grayscale video frames were downsampled to 128x128 pixels. Data consists of 1126\ntrials across two sessions, with 189 frames per trial (30 Hz framerate). Neural activity was acquired\nat the same frame rate.\nNeuropixels (NP) dataset [9, 18]. A head-\ufb01xed mouse behaved freely (including spontaneous ma-\nnipulation of a wheel with its forelimbs) while neural activity across multiple brain structures was\nelectrically recorded using eight Neuropixels probes [38]. Behavioral data was recorded using a\nsingle camera (Fig. 2B, center); grayscale video frames were downsampled to 112x192 pixels. Data\nconsists of 96k frames (40 Hz framerate), and \u201ctrials\u201d were arbitrarily de\ufb01ned as blocks of 1000\nframes. Neural activity was binned at the video frame rate.\nNeuropixels-zoom (NP-zoom) dataset. We cropped the behavioral videos in the NP dataset in order\nto analyze the \ufb01ne-grained facial movements of the mouse (Fig. 2B, right); grayscale video frames\nwere downsampled to 128x128 pixels after cropping, and the bottom corners were masked to occlude\nforelimb movements near the face.\n\n3\n\nObservedNeuralActivity}}Decoding:FeedforwardNeural NetsDiscreteHidden StatesContinuousHidden StatesObservedBehaviorDimensionalityReduction:ConvolutionalAutoencoder(CAE)Segmentation:AutoregressiveHidden MarkovModel(ARHMM)Time\fFigure 2: The CAE obtains good reconstructions at high compression rates. A) Reconstruction MSE on held-out\ntest data as a function of latent dimension. The nonlinear CAE consistently outperforms the linear autoencoder.\nPlotted values are means over test trials, and errorbars represent 95% bootstrapped con\ufb01dence intervals. B)\nReconstruction quality is good even when the latent dimensionality is three orders of magnitude smaller than the\noriginal number of pixels per frame. Top row shows example original frames from held-out test data in each\ndataset using 8 CAE dimensions; middle and bottom rows show corresponding linear autoencoder and CAE\noutput frames, respectively. In the WFCI bottom view we have enhanced the contrast and clipped high pixel\nvalues in all \ufb01gures for better visibility. Also see Supplementary Videos C.1 for full reconstruction videos.\n\nNonlinear dimensionality reduction of behavioral videos. We compress the behavioral videos\nwith a convolutional autoencoder (CAE), yielding a low-dimensional continuous representation of\nbehavior that is useful for downstream analyses. The CAE architecture is \ufb01xed for all datasets, except\nfor the number of latents (Fig. 2; see Appendix A for architecture details). We train the autoencoders\nby minimizing the mean squared error (MSE) between original and reconstructed frames using the\nAdam optimizer [39] with a learning rate of 10\u22124. Models are trained for a minimum of 500 epochs\nand a maximum of 1000 epochs. Training terminates when MSE on held-out validation data, averaged\nover the previous 10 epochs, begins to increase. As a baseline comparison we also \ufb01t a linear SVD\nmodel1.\nSegmentation of behavior. The CAE outputs a low-dimensional nonlinear embedding of the be-\nhavioral video frames, but does not capture temporal dependencies between frames. Next we train a\nsimple class of nonlinear dynamical systems to approximate dynamics within this embedded space.\nLet x \u2208 RT\u00d7D denote the sequence of continuous latents obtained by embedding the video frames\nwith the CAE. Each latent xt corresponds to the embedding of the corresponding video frame at\ntimestep t; T is the video length and D is the embedding dimension (D is of order 10 in the examples\nhere). Building on previous work [24, 28, 40, 41, 42], we model the sequence of continuous latents as\na stochastic process that switches between a small number K of discrete regimes, each characterized\nby linear-Gaussian dynamics. These discrete regimes are speci\ufb01ed by an additional layer of discrete\n\n1Direct SVD was too slow due to the large matrices involved here, and randomized SVD approaches led\nto suboptimal results in our hands; instead, we simply used Adam to minimize the reconstruction MSE of a\nlinear autoencoder (the same optimization problem solved by SVD/PCA), which uses a single dense layer for\nthe encoder (images to latents) and a single dense layer for the decoder (latents to images), with encoding and\ndecoding weights tied and a linear transfer function (with the same learning rate used for the nonlinear CAE).\n\n4\n\n102030Latent dimension5e-410e-415e-4MSE per pixelWFCIConv AELinear AE102030Latent dimension2e-34e-36e-3NP102030Latent dimension5e-410e-415e-420e-4NP-zoomA)B)Original FrameConv AE Reconstructed FrameLinear AEReconstructedFrame\fstate variables z \u2208 {1, . . . , K}T , and they too may exhibit temporal dependencies; the discrete\nstate at time t may depend on its preceding value. These modeling assumptions are captured by an\nautoregressive hidden Markov model (ARHMM), which speci\ufb01es a joint distribution over continuous\nand discrete state sequences,\n\np(x, z; \u03b8) = p(z1) p(x1)\n\np(zt | zt\u22121; \u03b8) p(xt | xt\u22121, zt; \u03b8)\n\nT(cid:89)\n\nt=2\n\nT(cid:89)\n\nt=2\n\nk=1\n\nk=1}.\n\n= \u03c0z1 N (x1 | \u00b51, \u03a31)\n\nPzt\u22121,zt N (xt | Aztxt\u22121 + bzt, Qzt),\n\n(1)\nwhere \u03c0 \u2208 \u2206K speci\ufb01es the initial distribution over discrete states, (\u00b51, \u03a31) parameterize a Gaussian\ninitial distribution over continuous states, P \u2208 [0, 1]K\u00d7K is a row-stochastic transition matrix, and\nthe parameters {Ak, bk, Qk}K\nspecify the linear dynamics associated with each of the K discrete\nstates. These parameters are combined in the set \u03b8 = {\u03c0, \u00b5, \u03a3, P,{Ak, bk, Qk}K\nWe \ufb01t the ARHMM with expectation-maximization (EM). As in standard hidden Markov models\n[43], the posterior expectations in the E-step are obtained via message passing in the chain-structured\ndiscrete graphical model. The optimal dynamics parameters are found via weighted least squares\nregression. We present results with ARHMMs with a single autoregressive lag.\nThe \ufb01tted ARHMM produces a discrete segmentation of the sequence of continuous latents out-\nput by the CAE. We estimate the discrete states with the maximum a posteriori (MAP) state se-\nquence z\u2217 = arg maxz p(z | x, \u03b8\u2217), which we obtain via the Viterbi algorithm. The estimated state\nsequence serves multiple purposes. As we will see, the discrete states may o\ufb00er useful interpretations\nof behavior as a sequence of discrete \u201csyllables,\u201d patterns of behavior identi\ufb01ed by similar temporal\ndynamics. Moreover, di\ufb00erent discrete states may correspond to di\ufb00erent patterns of neural activity,\nand di\ufb00erent mappings from neural activity to continuous latent states. We will leverage this feature\nof the discrete segmentation when developing the Bayesian decoders next.\nDecoding behavior from neural activity. Our ultimate goal is to develop a clearer understanding of\nhow neural activity maps to observed behavior (and vice versa). Probabilistic models of behavior o\ufb00er\na useful means to that end. Speci\ufb01cally, probabilistic models like the ARHMM o\ufb00er a set of latent\nstates that summarize behavioral time series, and thus a low-dimensional target for neural decoding.\nWe develop a nonlinear Bayesian decoder that combines neural recordings with the ARHMM prior to\nyield a posterior distribution over behavioral videos given neural activity.\nIdeally, we would learn the likelihood of the observed neural activity u \u2208 RT\u00d7N, where N is the\nnumber of neurons, given the underlying discrete states z and continuous states x. With a good likeli-\nhood model, we could combine it with the ARHMM prior to obtain a posterior distribution p(x, z | u)\nfor our Bayesian decoder. Unfortunately, learning a good likelihood model is challenging, so we take\nan alternative approach in order to sidestep this problem.\nInspired by Burkhart et al. [44], we instead train feedforward neural networks to output conditional\ndistributions p(zt | ut\u2212\u2206:t+\u2206) and p(xt | ut\u2212\u2206:t+\u2206) over the discrete and continuous states, respec-\ntively, given a window of neural activity. These are trained discriminatively, using the latent states\ninferred from the behavioral data. Details of the architecture, training procedure, and hyperparameter\nsearches are in Appendix A.\nWe use Bayes\u2019 rule to write p(ut\u2212\u2206:t+\u2206 | zt) \u221d p(zt | ut\u2212\u2206:t+\u2206)/p(zt), where the proportionality\nconstant p(ut\u2212\u2206:t+\u2206) is constant with respect to zt. The numerator is given by the feedforward\nnetworks and the denominator is the marginal distribution under a Markov chain, which for long\nsequences is well-approximated by the stationary distribution. We plug in this ratio as a substitute for\nthe likelihood in a hidden Markov model, and then use standard message passing routines to sample\nand compute expectations of z under the posterior p(z | u). Of course, this is the posterior distribution\nunder an approximate model of p(u | z); nevertheless, it su\ufb03ces for combining the ARHMM prior\nand the neural data in a Bayesian way.\nWe use the same technique to obtain posterior samples of the continuous states x, but here we condition\non both the neural data and a sample z \u223c p(z | u). Here, we need the marginal distribution p(xt | z),\nwhich we obtain from a simple Kalman \ufb01lter with time-varying dynamics parameters determined\nby z. Given the marginal distribution and the conditional distribution output by the neural network,\nwe use the Kalman smoother to compute posterior expectations of the continuous latent states. In\n\n5\n\n\fFigure 3: Segmenting behavioral traces and sampling new traces and videos from the generative model. A)\nCAE latents are shown on held out test data over time, with background colors indicating the discrete state (K=4)\ninferred for that time step using the ARHMM (colors are chosen to maximally di\ufb00erentiate states; colors do not\nindicate the same states across di\ufb00erent datasets). Transitions from rest to movement are easily detectable based\non the assigned colors. B) Similar to A) but the latents and states are generated by sampling from the ARHMM;\nresulting traces are qualitatively similar to real traces, with similarly strong heterogeneity in smoothness in\ndi\ufb00erent temporal segments. C) Discrete states are shown for 19 trials of the WFCI dataset, aligned to a right\nlever grab in the behavioral task. The same states (labeled by the same colors as in A) and B) above) frequently\noccur at similar points in each trial, indicating trial-locked state structure. Trial speci\ufb01c time points such as\nthe levers moving in and stimulus onset are overlaid. D) Two random frames from videos sampled from the\nfull generative model learned for each dataset; in each case the generated frames resemble real frames. See\nSupplementary Videos C.4 for full generated videos.\n\ndoing so, we obtain a Bayesian estimate of the discrete and continuous states given the neural data.\nFinally, given sample sequences x1:T we can again map these sequences through the CAE decoder to\nobtain full videos y1:T sampled from the posterior.\n\nResults\nNonlinear dimensionality reduction. We begin by quantifying the performance of the CAE. The\ncritical result here is that the behavioral videos can be embedded in a low-dimensional space (Fig. 2):\nan embedding dimension D < 20 su\ufb03ces to capture much of the structure visible in the mouse\u2019s\nbehavior (though unsurprisingly very high-resolution details such as the tips of the whiskers are\nblurred at this level of compression). Even linear autoencoders achieve decent compression, though\nthe nonlinear CAE outperforms the linear model consistently, particularly in frames where large paw\n\n6\n\n0123456Inferred latents and states0123456Time (sec)Generated latents and statesTime (sec)WFCINPNP-zoomA)B)C)D)WFCINPNP-zoomTrials0123456Time (sec)Levers inStimulus onsetLever grabSpouts in\fFigure 4: Summary of decoding results. A) Confusion matrices of predicted vs actual inferred discrete states\nfor each dataset for the Bayesian decoder. A diagonal matrix corresponds to perfect decoding. States ordered\nby usage in training data. The Bayesian decoder and feedforward decoder (not shown) perform similarly, both\noutperforming chance. B) Both the Bayesian decoder and the feedforward decoder outperform baseline for CAE\nlatent predictions from neural activity. The percent improvement over baseline is shown as a function of the\nnumber of discrete states used by the ARHMM prior for the Bayesian decoder.\nmotions occur (see Supplementary Videos C.1 for reconstructions). Importantly, for the WFCI dataset,\nthe CAE operates on both camera views simultaneously, allowing us to combine information from\nmultiple sources into one low-dimensional representation. Throughout the rest of the paper, we will\nuse CAEs with an 8-dimensional latent space for each dataset.\nSegmentation of behavioral video. We \ufb01t ARHMMs to segment the behavior based on the dynamics\nof the CAE latents. The segmentation corresponds to visible changes in the CAE latents (Fig. 3A).\nWith two ARHMM states, we segment the behavior roughly into moving vs still for all datasets (see\nSupplementary Videos C.2 with K=2). With an increased number of ARHMM states, we see more\nnuanced segmentation. We examined the reproducibility of these segmentations across trials in the\nWFCI dataset; clear trial-locked state structure is visible in Fig. 3C, indicating that these models are\ncapturing reproducible structure in the CAE latents. We also examined the reproducibility of these\nsegmentations across mice (Fig. A2), and \ufb01nd that a similar trial-locked state structure is shared\nacross multiple animals.\nSampling from the full generative model. The ARHMM \ufb01t to data serves as a generative model\nof behavioral videos. First, we sample forward from the ARHMM with learned parameters \u03b8\u2217\nand most likely state sequence z\u2217 to obtain continuous state sequences x1:T (Fig. 3B). We then\nfeed the sequence of continuous latents into the CAE decoder to obtain novel synthetic behavioral\nvideos y1:T (see Supplementary Videos C.4). The resulting generative process is clearly not perfect;\nthere are occasional distorted frames, and given enough viewing time it is easy for human observers to\ndistinguish real versus sampled movies. Nevertheless, many generated frames qualitatively resemble\nreal frames (Fig. 3D) and we can see the mouse transitioning between still and moving in a fairly\nnatural way in the sampled videos, indicating that the generative model places signi\ufb01cant probability\nmass near the space of real behavioral videos.\nBayesian decoding of behavioral states, latents, and full videos. We use this generative ARHMM\nmodel as the basis of a fully Bayesian decoder that operates on neural activity to reconstruct mouse\nbehavior. We \ufb01rst \ufb01t separate feedforward neural network decoders to predict discrete states and CAE\nlatents, and then incorporate these decoders into a fully Bayesian decoder (see Methods). We choose\nthe number of states for the ARHMM based on Bayesian CAE mean squared error on validation data\n(WFCI: 16 states, NP: 8, NP-zoom: 4). We compare decoding performance to baseline predictions\nwhich are de\ufb01ned as the most common state (discrete decoder) and the mean value of the CAE latents\n(continuous decoder) on training data.\nDecoder predictions of both continuous CAE latents and discrete ARHMM states are above chance\nacross datasets (Fig. 4). Consistent with previous work [8, 9], we \ufb01nd that the neural signals recorded\nin these experiments contain rich information about behavior. For discrete states, the confusion\n\n7\n\n24816Number of ARHMM States343536370.00.20.40.60.81.0Predicted StatesActual StatesA)Predicted StatesPredicted States24816Number of ARHMM States757677PercentageImprovementover Baseline 24816Number of ARHMM States65.866.266.6WFCINPNP-zoomFeedforward DecoderBayesian DecoderB)\fFigure 5: Example decoded ARHMM states and CAE latents for a held out WFCI test trial. A) Discrete state\nprobabilities inferred by the ARHMM for the behavioral data (top row); predictions from a feedforward model\noperating on neural data (middle row); output of the Bayesian decoding model (bottom row). B) CAE latents\n(black) are compared to predicted latents from a feedforward decoder (blue) and the Bayesian decoder (green).\nThe shaded region indicates \u00b13 posterior SDs, output by the Bayesian posterior. C) Example frames of the real\nbehavioral video (top row) compared to the Bayesian decoded frames (bottom row); see Supplementary Videos\nC.3 for full details.\n\nmatrices of actual vs predicted states (Fig. 4A), sorted by usage of the actual states in the training\ndata, show a diagonal structure, re\ufb02ecting above-chance performance (WFCI: 53% correct vs 19%\nbaseline for 16 states, NP: 53% correct vs 32% baseline for 8 states, NP-zoom: 67% correct vs 37%\nbaseline for 4 states; feedforward decoding accuracy is comparable to Bayesian).\n\n8\n\nA)B)C)BehavioralFramesBayesian DecodedFramesTime = 0.17 s Time = 3.0 s Time = 5.4 s True AE Latents Feedforward Decoder Bayesian Decoder Behavioral Data State ProbabilitiesDiscrete StatesFeedforward Decoder State ProbabilitiesBayesian Decoder State Probabilities012345Time (sec)01234567AE Latents\fFor the continuous latents, the Bayesian decoder improves over baseline MSE for latent trace predic-\ntion by 77% for WFCI and 67% in the NP-zoom dataset (Fig. 4B). The Bayesian decoder slightly\noutperformed feedforward decoders; feedforward decoders in turn outperformed simple linear de-\ncoders (results not shown). The Bayesian decoder o\ufb00ers less improvement (37%) in the NP dataset,\nwhich may seem surprising since the NP-zoom behavioral video is simply a crop of the NP video.\nOur interpretation is that the variance in the CAE latents extracted from the NP video are dominated\nby the location of the mouse\u2019s paws in space, and the brain regions recorded from in this experiment\ncarried much more information about the discrete states and the facial pose than absolute paw location.\nFuture work analyzing data from a richer variety of brain regions will further test this hypothesis.\nFinally, Fig. 5 shows the feedforward and Bayesian decoder predictions for discrete states and CAE\nlatents for an example test trial from the WFCI dataset. See Supplementary Figs. A5 and A6 for\nexample trials for NP-zoom and NP datasets. The Bayesian decoder also provides valuable information\nat each time step about the uncertainty of the CAE latents \u2014 this information is not directly available\nfrom the feedforward decoder. In Supplementary Videos C.3 we show several samples of the full\ndecoded video next to the real video, to provide a more detailed illustration of the posterior variability.\n\nDiscussion\nWe have introduced a framework for the compression, segmentation, generation, and decoding of\nbehavioral videos. Our approach builds on previous work that used ARHMMs to segment behavioral\nvideos [24, 28]. We extend these methods by incorporating nonlinear autoencoders (providing more\naccurate and compact representations of the video signal) and introducing a novel Bayesian decoding\napproach that exploits this ARHMM prior backbone; the resulting generative model and decoder can\noutput accurate full-resolution behavioral videos, to our knowledge for the \ufb01rst time. We demonstrate\nthe application of this framework to multiple behavioral paradigms and neural recording technologies.\nA few exciting directions for future work are clear. First, for simplicity, in this work we decomposed\nour approach into individual compression, segmentation, and decoding steps. In principle it is possible\nto train the graphical model in Fig. 1 in an end-to-end fashion. This approach may lead to improved\nperformance on compression and decoding metrics. Second, the ability to segment animal behavior\ninto reproducible syllables opens up new possibilities for neural data analysis, for example, novel\nswitching encoding models triggered by the segmentation output of the methods developed here [45];\nthese methods could also in principle be directly applied to the coordinates of tracked body parts from\npose tracking algorithms. Finally, our unsupervised compression approach does not allow us to easily\ndisentangle factors of variation in the behavior. For example, changes in arm position are generally\nrepresented across all latent factors, hindering our ability to connect neural activity with particular\nbehaviors. Hybrid approaches that create a more interpretable representation of behavior, through\nthe incorporation of labeled data from pose tracking algorithms, or from the timing of task-related\nvariables (e.g., stimulus onset), seem particularly promising.\nWe hope to facilitate the application of these methods to a variety of behavioral datasets. A python\nimplementation of our pipeline is available at https://github.com/ebatty/behavenet, which is\nbased on the PyTorch [46], ssm [47], and Test Tube [48] libraries.\nAcknowledgments We thank N. Steinmetz, M. Carandini, and K. Harris for generously making their data pub-\nlicly available. This work was supported by the Simons Foundation and the Gatsby Charitable Foundation, by NSF\nNeuroNex Award DBI-1707398, and by NIH awards 5U19NS107613, 5U19NS104649, and 1U19NS113201.\n\nTable 1: Author contributions.\n\nEB MW SS DB TA SM WG JM AC JC SD SL\n\nLP\n\nConceptualization\nData collection\nData analysis\nCode development\nWriting\nEditing\nFunding acquisition\n\n9\n\n\fReferences\n[1] Alex Gomez-Marin, Joseph J Paton, Adam R Kamp\ufb00, Rui M Costa, and Zachary F Mainen. Big\nbehavioral data: psychology, ethology and the foundations of neuroscience. Nature neuroscience,\n17(11):1455, 2014.\n\n[2] Adam J Calhoun and Mala Murthy. Quantifying behavior to solve sensorimotor transformations:\n\nadvances from worms and \ufb02ies. Current opinion in neurobiology, 46:90\u201398, 2017.\n\n[3] John W Krakauer, Asif A Ghazanfar, Alex Gomez-Marin, Malcolm A MacIver, and David\nPoeppel. Neuroscience needs behavior: correcting a reductionist bias. Neuron, 93(3):480\u2013490,\n2017.\n\n[4] Gordon J Berman. Measuring behavior across scales. BMC biology, 16(1):23, 2018.\n[5] Alexander Mathis, Pranav Mamidanna, Kevin M Cury, Taiga Abe, Venkatesh N Murthy, Macken-\nzie Weygandt Mathis, and Matthias Bethge. Deeplabcut: markerless pose estimation of user-\nde\ufb01ned body parts with deep learning. Technical report, Nature Publishing Group, 2018.\n\n[6] Talmo D Pereira, Diego E Aldarondo, Lindsay Willmore, Mikhail Kislin, Samuel S-H Wang,\nMala Murthy, and Joshua W Shaevitz. Fast animal pose estimation using deep neural networks.\nNature methods, 16(1):117, 2019.\n\n[7] Jacob M Graving, Daniel Chae, Hemal Naik, Liang Li, Benjamin Koger, Blair R Costelloe, and\n\nIain D Couzin. Fast and robust animal pose estimation. bioRxiv, page 620245, 2019.\n\n[8] Simon Musall, Matthew T Kaufman, Steven Gluf, and Anne K Churchland. Movement-related\nactivity dominates cortex during sensory-guided decision making. BioRxiv, page 308288, 2018.\n[9] Carsen Stringer, Marius Pachitariu, Nicholas Steinmetz, Charu Bai Reddy, Matteo Carandini,\nand Kenneth D. Harris. Spontaneous behaviors drive multidimensional, brainwide activity.\nScience, 364(6437), 2019. ISSN 0036-8075. doi: 10.1126/science.aav7893. URL https:\n//science.sciencemag.org/content/364/6437/eaav7893.\n\n[10] Liam Paninski and JP Cunningham. Neural data science: accelerating the experiment-analysis-\ntheory cycle in large-scale neuroscience. Current opinion in neurobiology, 50:232\u2013241, 2018.\n[11] Shreya Saxena and John P Cunningham. Towards the neural population doctrine. Current\n\nopinion in neurobiology, 55:103\u2013111, 2019.\n\n[12] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta.\nComposing graphical models with neural networks for structured representations and fast\ninference. In Advances in neural information processing systems, pages 2946\u20132954, 2016.\n\n[13] Mijail D Serruya, Nicholas G Hatsopoulos, Liam Paninski, Matthew R Fellows, and John P\nDonoghue. Brain-machine interface: Instant neural control of a movement signal. Nature, 416\n(6877):141, 2002.\n\n[14] David Sussillo, Paul Nuyujukian, Joline M Fan, Jonathan C Kao, Sergey D Stavisky, Stephen Ryu,\nand Krishna Shenoy. A recurrent neural network for closed-loop intracortical brain\u2013machine\ninterface decoders. Journal of neural engineering, 9(2):026027, 2012.\n\n[15] Joshua I Glaser, Raeed H Chowdhury, Matthew G Perich, Lee E Miller, and Konrad P Kording.\n\nMachine learning for neural decoding. arXiv preprint arXiv:1708.00909, 2017.\n\n[16] Omid G Sani, Bijan Pesaran, and Maryam M Shanechi. Modeling behaviorally relevant neural\ndynamics enabled by preferential subspace identi\ufb01cation (psid). bioRxiv, page 808154, 2019.\n[17] Matthew D. Golub, Chandramouli Chandrasekaran, William T Newsome, Krishna Shenoy, and\nDavid Sussillo. Joint neural-behavioral models of perceptual decision making. Computational\nand Systems Neuroscience (Cosyne), 2019.\n\n[18] Nick Steinmetz, Marius Pachitariu, Carsen Stringer, Matteo Carandini, and Kenneth Har-\nris. Eight-probe Neuropixels recordings during spontaneous behaviors. 3 2019. doi: 10.\n25378/janelia.7739750.v4. URL https://janelia.figshare.com/articles/Eight-probe_\nNeuropixels_recordings_during_spontaneous_behaviors/7739750.\n\n[19] Anne K Churchland, Simon Musall, Matthew T Kaufmann, Ashley L Juavinett, and Steven\nGluf. Single-trial neural dynamics are dominated by richly varied movements:dataset. 10 2019.\ndoi: https://dx.doi.org/10.14224/1.38599. URL http://repository.cshl.edu/38599/.\n\n10\n\n\f[20] Greg J Stephens, Bethany Johnson-Kerner, William Bialek, and William S Ryu. Dimensionality\nand dynamics in the behavior of c. elegans. PLoS computational biology, 4(4):e1000028, 2008.\n[21] Gordon J Berman, Daniel M Choi, William Bialek, and Joshua W Shaevitz. Mapping the\nstereotyped behaviour of freely moving fruit \ufb02ies. Journal of The Royal Society Interface, 11\n(99):20140672, 2014.\n\n[22] Michael B Orger and Gonzalo G de Polavieja. Zebra\ufb01sh behavior: opportunities and challenges.\n\nAnnual review of neuroscience, 40:125\u2013147, 2017.\n\n[23] Semih Gunel, Helge Rhodin, Daniel Morales, Jo\u00e3o Compagnolo, Pavan Ramdya, and Pascal Fua.\nDeep\ufb02y3d: A deep learning-based approach for 3d limb and appendage tracking in tethered,\nadult drosophila. In bioRxiv, 2019.\n\n[24] Alexander B Wiltschko, Matthew J Johnson, Giuliano Iurilli, Ralph E Peterson, Jesse M Katon,\nStan L Pashkovski, Victoria E Abraira, Ryan P Adams, and Sandeep Robert Datta. Mapping\nsub-second structure in mouse behavior. Neuron, 88(6):1121\u20131135, 2015.\n\n[25] E. Kelly Buchanan, Akiva Lipschitz, Scott W. Linderman, and Liam Paninski. Quantifying the\nbehavioral dynamics of C. elegans with autoregressive hidden Markov models. Workshop on\nWorm\u2019s Neural Information Processing at the 31st Conference on Neural Information Processing\nSystems, 2017.\n\n[26] Antonio C Costa, Tosif Ahamed, and Greg J Stephens. Adaptive, locally linear models of\ncomplex dynamics. Proceedings of the National Academy of Sciences, 116(5):1501\u20131510, 2019.\n[27] Anuj Sharma, Robert E. Johnson, Florian Engert, and Scott W. Linderman. Point process\nlatent variable models of freely swimming larval zebra\ufb01sh. Advances in Neural Information\nProcessing Systems (NeurIPS), 2018.\n\n[28] Je\ufb00rey E Markowitz, Winthrop F Gillis, Celia C Beron, Shay Q Neufeld, Keiramarie Robertson,\nNeha D Bhagat, Ralph E Peterson, Emalee Peterson, Minsuk Hyun, Scott W Linderman, et al.\nThe striatum organizes 3d behavior via moment-to-moment action selection. Cell, 174(1):44\u201358,\n2018.\n\n[29] Yuanjun Gao, Evan W Archer, Liam Paninski, and John P Cunningham. Linear dynamical\nneural population models through nonlinear embeddings. In Advances in neural information\nprocessing systems, pages 163\u2013171, 2016.\n\n[30] Chethan Pandarinath, Daniel J O\u2019Shea, Jasmine Collins, Rafal Jozefowicz, Sergey D Stavisky,\nJonathan C Kao, Eric M Trautmann, Matthew T Kaufman, Stephen I Ryu, Leigh R Hochberg,\net al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nature\nmethods, page 1, 2018.\n\n[31] Scott W. Linderman*, Matthew J. Johnson*, Andrew C. Miller, Ryan P. Adams, David M. Blei,\nand Liam Paninski. Bayesian learning and inference in recurrent switching linear dynamical\nsystems. In Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS), 2017.\n\n[32] Josue Nassar, Scott W. Linderman, Monica Bugallo, and Il Memming Park. Tree-structured recur-\nrent switching linear dynamical systems for multi-scale modeling. In International Conference\non Learning Representations (ICLR), 2019.\n\n[33] Scott W. Linderman, Annika L. A. Nichols, David M. Blei, Manuel Zimmer, and Liam Paninski.\nHierarchical recurrent state space models reveal discrete and continuous dynamics of neural\nactivity in C. elegans. bioRxiv, 2019. doi: 10.1101/621540.\n\n[34] Nikhil Parthasarathy, Eleanor Batty, William Falcon, Thomas Rutten, Mohit Rajpal,\nEJ Chichilnisky, and Liam Paninski. Neural networks for e\ufb03cient bayesian decoding of natural\nimages from retinal neurons. In Advances in Neural Information Processing Systems, pages\n6434\u20136445, 2017.\n\n[35] Hassan Akbari, Bahar Khalighinejad, Jose L Herrero, Ashesh D Mehta, and Nima Mesgarani.\nTowards reconstructing intelligible speech from the human auditory cortex. Scienti\ufb01c reports, 9\n(1):874, 2019.\n\n[36] Gopala K. Anumanchipalli, Josh Chartier, and Edward F. Chang. Speech synthesis from neural\n\ndecoding of spoken sentences. Nature, 568(7753):493\u2013498, 2019.\n\n11\n\n\f[37] Shreya Saxena, Ian Kinsella, Simon Musall, Sharon H Kim, Jozsef Meszaros, David N Thi-\nbodeaux, Carla Kim, John Cunningham, Elizabeth Hillman, Anne Churchland, et al. Localized\nsemi-nonnegative matrix factorization (locanmf) of wide\ufb01eld calcium imaging data. bioRxiv,\npage 650093, 2019.\n\n[38] James J Jun, Nicholas A Steinmetz, Joshua H Siegle, Daniel J Denman, Marius Bauza, Brian\nBarbarits, Albert K Lee, Costas A Anastassiou, Alexandru Andrei, \u00c7a\u011fatay Ayd\u0131n, et al. Fully\nintegrated silicon probes for high-density recording of neural activity. Nature, 551(7679):232,\n2017.\n\n[39] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[40] E. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky. Bayesian nonparametric inference of\nswitching dynamic linear models. IEEE Transactions on Signal Processing, 59(4):1569\u20131585,\nApril 2011. ISSN 1053-587X. doi: 10.1109/TSP.2010.2102756.\n\n[41] Samuel Ainsworth, Nicholas J. Foti, Adrian K. C. Lee, and Emily B. Fox. Interpretable vaes for\n\nnonlinear group factor analysis. CoRR, abs/1802.06765, 2018.\n\n[42] Drausin Wulsin, Emily Fox, and Brian Litt. Parsing epileptic events using a markov switching\nprocess model for correlated time series. In Sanjoy Dasgupta and David McAllester, editors,\nProceedings of the 30th International Conference on Machine Learning, volume 28 of Proceed-\nings of Machine Learning Research, pages 356\u2013364, Atlanta, Georgia, USA, 17\u201319 Jun 2013.\nPMLR. URL http://proceedings.mlr.press/v28/wulsin13.html.\n\n[43] Lawrence R Rabiner. A tutorial on hidden Markov models and selected applications in speech\n\nrecognition. Proceedings of the IEEE, 77(2):257\u2013286, 1989.\n\n[44] Michael C Burkhart, David M Brandman, Carlos E Vargas-Irwin, and Matthew T Harrison.\nThe discriminative Kalman \ufb01lter for nonlinear and non-Gaussian sequential Bayesian \ufb01ltering.\narXiv preprint arXiv:1608.06622, 2016.\n\n[45] Ziqiang Wei, Hidehiko Inagaki, Nuo Li, Karel Svoboda, and Shaul Druckmann. An orderly\nsingle-trial organization of population dynamics in premotor cortex predicts behavioral variabil-\nity. Nature Communications, 10, 2019.\n\n[46] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,\nTrevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas\nKopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,\nBenoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-\nperformance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\u0001Alch\u00e9-Buc,\nE. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages\n8024\u20138035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/\n9015-pytorch-an-imperative-style-high-performance-deep-learning-library.\npdf.\n\n[47] S Linderman. ssm. https://github.com/slinderman/ssm, 2019.\n[48] W.A. Falcon. Test tube. https://github.com/williamfalcon/test-tube, 2017.\n\n12\n\n\f", "award": [], "sourceid": 9187, "authors": [{"given_name": "Eleanor", "family_name": "Batty", "institution": "Columbia University"}, {"given_name": "Matthew", "family_name": "Whiteway", "institution": "Columbia University"}, {"given_name": "Shreya", "family_name": "Saxena", "institution": "Columbia University"}, {"given_name": "Dan", "family_name": "Biderman", "institution": "Columbia University"}, {"given_name": "Taiga", "family_name": "Abe", "institution": "Columbia University"}, {"given_name": "Simon", "family_name": "Musall", "institution": "Cold Spring Harbor Laboratory"}, {"given_name": "Winthrop", "family_name": "Gillis", "institution": "Harvard Medical School"}, {"given_name": "Jeffrey", "family_name": "Markowitz", "institution": "Harvard Medical School"}, {"given_name": "Anne", "family_name": "Churchland", "institution": "Cold Spring Harbor Laboratory"}, {"given_name": "John", "family_name": "Cunningham", "institution": "University of Columbia"}, {"given_name": "Sandeep", "family_name": "Datta", "institution": "Harvard Medical School"}, {"given_name": "Scott", "family_name": "Linderman", "institution": "Stanford University"}, {"given_name": "Liam", "family_name": "Paninski", "institution": "Columbia University"}]}