{"title": "Trajectory Convolution for Action Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 2204, "page_last": 2215, "abstract": "How to leverage the temporal dimension is a key question in video analysis. Recent works suggest an efficient approach to video feature learning, i.e.,\nfactorizing 3D convolutions into separate components respectively for spatial and temporal convolutions. The temporal convolution, however, comes with an implicit assumption \u2013 the feature maps across time steps are well aligned so that the features at the same locations can be aggregated. This assumption may be overly strong in practical applications, especially in action recognition where the motion serves as a crucial cue. In this work, we propose a new CNN architecture TrajectoryNet, which incorporates trajectory convolution, a new operation for integrating features along the temporal dimension, to replace the existing temporal convolution. This operation explicitly takes into account the changes in contents caused by deformation or motion, allowing the visual features to be aggregated along the the motion paths, trajectories. On two large-scale action recognition datasets, namely, Something-Something and Kinetics, the proposed network architecture achieves notable improvement over strong baselines.", "full_text": "Trajectory Convolution for Action Recognition\n\nYue Zhao\n\nDepartment of Information Engineering\nThe Chinese University of Hong Kong\n\nzy317@ie.cuhk.edu.hk\n\nYuanjun Xiong\n\nAmazon Rekognition\nyuanjx@amazon.com\n\nDahua Lin\n\nDepartment of Information Engineering\nThe Chinese University of Hong Kong\n\ndhlin@ie.cuhk.edu.hk\n\nAbstract\n\nHow to leverage the temporal dimension is one major question in video analysis.\nRecent works [47, 36] suggest an ef\ufb01cient approach to video feature learning,\ni.e., factorizing 3D convolutions into separate components respectively for spatial\nand temporal convolutions. The temporal convolution, however, comes with an\nimplicit assumption \u2013 the feature maps across time steps are well aligned so that\nthe features at the same locations can be aggregated. This assumption can be\noverly strong in practical applications, especially in action recognition where the\nmotion serves as a crucial cue. In this work, we propose a new CNN architecture\nTrajectoryNet, which incorporates trajectory convolution, a new operation for\nintegrating features along the temporal dimension, to replace the existing temporal\nconvolution. This operation explicitly takes into account the changes in contents\ncaused by deformation or motion, allowing the visual features to be aggregated\nalong the the motion paths, trajectories. On two large-scale action recognition\ndatasets, Something-Something V1 and Kinetics, the proposed network architecture\nachieves notable improvement over strong baselines.\n\n1\n\nIntroduction\n\nThe past decade has witnessed signi\ufb01cant progress in action recognition [37, 38, 29, 42, 1], especially\ndue to the advances in deep learning. Deep learning based methods for action recognition mostly\nfall into two categories, two-stream architectures [29] with 2D convolutional networks and 3D\nconvolutional networks [34]. Particularly, the latter has demonstrated great potential on large-scale\nvideo datasets [19, 25], with the use of new training strategies like transferring weights from pretrained\n2D CNNs [42, 1].\nHowever, for 3D convolution, several key questions remain to be answered: (1) 3D convolution\ninvolves substantially increased computing cost. Is it really necessary? (2) 3D convolution treats the\nspatial and temporal dimensions uniformly. Is it the most effective way for video modeling? We are\nnot the \ufb01rst to raise such questions. In recent works, there have been attempts to move beyond 3D\nconvolution and further improve the ef\ufb01ciency and effectiveness of joint spatio-temporal analysis. For\ninstance, both Separable-3D (S3D) [47] and R(2+1)D [36] obtain superior performance by factorizing\nthe 3D convolutional \ufb01lter into separate spatial and temporal operations. However, both methods\nare based on an implicit assumption that the feature maps across frames are well aligned so that the\nfeatures at the same locations (across consecutive frames) can be aggregated via temporal convolution.\nThis assumption ignores the motion of people or objects, a key aspect in video analysis.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1:\nIllustration of our trajectory convolution. Given a sequence of video frames (left) and its\ncorresponding input feature map of size C \u00d7 T \u00d7 H \u00d7 W (bottom-middle; the dimension of channels C is\nsimpli\ufb01ed as one for clarity), in order to calculate the response of a speci\ufb01c point at time step t, we leverage the\nmotion \ufb01elds \u2190\u2212\u03c9 and \u2212\u2192\u03c9 (top-middle; the arrows in blue denote the motion velocity) to determine the sampling\nlocation at neighboring time step t \u2212 1 and t + 1 in the sense of tracking along the motion path. The response is\ndenoted on the output feature map (bottom-right). The operation of trajectory convolution (denoted in a red\nbox) is illustrated on the top-right. This \ufb01gure is best viewed in color.\n\nA natural idea to address this issue is to track the objects of interest and extract the features along\ntheir motion paths, i.e., trajectories. This idea has been explored in previous works [33, 37, 38, 41].\nThe most recent work along this direction is the Trajectory-pooled Deep-convolutional Descriptor\n(TDD) [41], which aggregates off-the-shelf deep features along trajectories. However, in this method,\nthe visual features are derived separately from an existing deep network, just as a replacement of\nhand-crafted features. Hence, a question emerges: can we learn better video features in conjunction\nwith feature tracking?\nIn pursuit of this question, we develop a new CNN architecture for learning video features, called\nTrajectoryNet. Inspired by the Separable-3D network [36, 47], our design involves a cascade of\nconvolutional operations respectively along the spatial and temporal dimensions. A distinguishing\nfeature of this architecture is that it introduces a new operation, namely the trajectory convolution, to\ntake the place of the standard temporal convolution. As shown in Figure 1, the trajectory convolution\noperates along the trajectories that trace the pixels corresponding to the same physical points, rather\nthan at \ufb01xed pixel locations. The trajectories can be derived from either a precomputed optical \ufb02ow\n\ufb01eld or a dense \ufb02ow prediction network trained jointly with the features. The standard temporal\nconvolution can be seen as a special case of the trajectory convolution where all pixels are considered\nto be stationary over time.\nExperimental results on Something-Something V1 and Kinetics datasets show that by explicitly\ntaking into account the motion dynamics in the temporal operation, the proposed network obtains\nconsiderable improvements over the Separable-3D, a competitive baseline.\n\n2 Related Work\n\nTrajectory-based Methods for Action Recognition Action recognition in videos has been greatly\nadvanced thanks to the up-springing of powerful features.\nIt was \ufb01rstly tackled by extracting\nspatial-temporal local descriptors [39] from space-time interest points [20, 46]. These successful\nlocal features include: Histogram of Oriented Gradients (HOG) [3], Histogram of Optical Flow\n(HOF) [21], and Motion Boundary Histogram (MBH) [4].\n\n2\n\nMotion field!=\u2131$%,$%'(!=\u2131$%,$%)(Convoluteoverxp,-,.\u2208\t1\u22121,1,1+1Interpolatefeature xp,%)(Determine p,%)(from motion fielddirection for regular convolutiondirection along trajectory\u2026\u2026Input feature mapTrajectory Conv.Output feature map\fOver the years, it was recognized that the 2D space domain and 1D time domain have different\ncharacteristics and should be handled in a different manner intuitively. As of the motion modeling\nin the temporal domain, trajectories have been a powerful intermediary to convey such motion\ninformation. Messing et al [24] used a KLT tracker [22] to extract feature trajectories and applied\nlog-polar uniform quantization. Sun et al [33] extracted trajectories by matching SIFT feature\nbetween frames. These trajectories are based on sparse interest points, which have been later proved\nto be inferior to dense sampling. In [37], Wang et al used dense trajectories to extract low-level\nfeatures within aligned 3D volumes. An improved version [38] increased recognition accuracy by\nestimating and compensating the effect of camera motion. [37, 38] also revealed that trajectory itself\ncan serve as a component of descriptors in the form of concatenated displacement vectors, which was\nconsolidated by deep learning methods [29].\nWang et al \ufb01rst proposed TDD in [41] to introduce deep features to trajectory analysis. It conducts\ntrajectory-constrained pooling to aggregate deep features into video descriptors. However, the\nbackbone two-stream CNN [29], from which the deep feature is extracted, is learned from very short\nframe snippets and is unaware of the information of temporal evolution. In addition, all of these\ntrajectory-aligned methods rely on encoding methods such as Fisher vectors (FV) [45] and vectors of\nlocally aggregated descriptors (VLAD) [14] and an extra SVM is needed for classi\ufb01cation, which\nprohibits end-to-end training. To sum up the discussion above, we provide a comparison of our\napproach with previous works on action recognition in Table 1.\n\nTable 1: A comparison of our approach with existing methods.\nMethod\nSTIP [20]\n\nUse deep feature?\n\nFeature tracking? End-to-end?\n\n\u0017\n\u0013\n\u0017\n\u0013\n\u0013\n\n\u0017\n\u0017\n\u0013\n\u0017\n\u0013\n\nDT [37], iDT [38]\nTSN [42], I3D [1]\n\nTDD [41]\n\nTrajectoryNet (Ours)\n\n\u0017\n\u0017\n\u0013\n\u0013\n\u0013\n\nAction Recognition in the Context of Deep Learning Deep convolutional neural networks based\nmodels have been widely applied to action recognition [18, 29, 34], which can be mostly categorized\ninto two families, i.e. two-stream networks [29] and 3D convolution networks [15, 34]. Recently, 3D\nconvolutonal network has drawn attention since Carreira et al introduced In\ufb02ated-3D models [1] by\nin\ufb02ating an existing 2D convolutional network to its 3D variant and training on a very large action\nrecognition dataset [19]. Tran et al argued in [36] that factorizing 3D convolutions into separable\nspatial and temporal convolutions obtains higher accuracy. Similar phenomenon is also observed\nin Separable-3D models by Xie et al [47]. Wang et al incorporated multiplicative interaction into\n3D convolution for modeling relation in [40]. All of these modi\ufb01cations are focused on the single\nmodality, i.e. the appearance branch.\nApart from network architectural designs, another direction is to exploit the interaction of appearance\nand motion information of action. Feichtenhofer et al explored the strategies of spatio-temporal\nfusion of two-stream networks at earlier stages in [7]. Such attempts are mostly simple manipulation\nof feature such as stacking, addition [7], and multiplicative gating [6].\nMotion Representation using Convolutional Networks Optical \ufb02ow has been used as a generic\nrepresentation of motion as well as trajectory in particular for decades. As a competitive counterpart\nto the classical variational approaches [10, 31], many parametric models based on CNN have been\nrecently proposed and achieved promising results in estimating optical \ufb02ow. These include, but are\nnot limited to, the FlowNet family [5, 11], SpyNet [26], and PWC-Net [32]. The aforementioned\nmodels are learned in a supervised manner on large-scale simulated \ufb02ow datasets [5, 23], possibly\nleaving a large gap between simulated animations and real-world videos. Also, these datasets are\ndesigned for accurate \ufb02ow prediction, which is possibly not appropriate for motion estimation of\nhuman action due to inhomogeneity of displacement across optical \ufb02ow dataset and human action\ndataset, as revealed in [11]. As of the network architecture, most models require parameters in the\nmagnitude of 107 \u223c 108, which both prohibits being plugged into action recognition networks as a\nsubmodule and causes too much computational cost. Zhu et al proposed MotionNet [51] to learn\ndense \ufb02ow \ufb01elds in an unsupervised manner and plugged it into a two-stream network [29] to be\n\ufb01netuned for action recognition task. The MotionNet is relatively light-weighted and can accept\na sequence of multiple images. However, this is only used to substitute the pre-calculated optical\n\n3\n\n\f\ufb02ow while maintaining the conventional two-stream architecture. Zhao et al proposed an alternative\nrepresentation based on cost volume for ef\ufb01ciency at the cost of degraded quality of motion \ufb01eld [49].\nTransformation-Sensitive Convolutional Networks Conventional CNN operates on \ufb01xed loca-\ntions in a regular grid, which limits its ability to modeling unknown geometric transformations.\nSpatial Transform Networks (STN) [13] is the \ufb01rst to introduce spatial transformation learning into\ndeep models. It estimates a global parametric transformation on which the ordinary feature map\nis warped. Such warping is computationally expensive and the transformation is considered to be\nuniversal across the whole image, which is usually not the case for action recognition, since different\nbody parts have their own movement. In Dynamic Filter Networks [16], Xu et al introduce dynamic\n\ufb01lters which are conditioned on the input and can change over samples. This enables learning\nlocal spatial transformations. Deformable Convolutional Network (DCN) [2] achieves similar local\ntransformation in a different way. While maintaining \ufb01lter weights invariant to the input, the proposed\ndeformable convolution \ufb01rst learns a dense offset map from the input, and then applies it to the regular\nfeature map for re-sampling. The proposed trajectory convolution is inspired by the deformable\nsampling in DCN and utilizes it for feature tracking in the spatiotemporal convolution operations.\n\n3 Methods\n\nThe TrajectoryNet model is built with the trajectory convolution operation. In this section, we \ufb01rst\nintroduce the concept of trajectory convolution. Then we illustrate the architecture of TrajectoryNet.\nFinally we describe the approach to learning the trajectory together with the trajectory convolution.\n\n3.1 Trajectory Convolution\n\nIn the context of separable spatial temporal 3D convolution, the 1D temporal convolution is conducted\npixel-wise on the 2D spatial feature map along the temporal dimension. Given input feature maps\nxt(p) at the t-th time step, the output feature yt(p), at position p = (h, w) \u2208 [0, H) \u00d7 [0, W ), is\ncalculated by the inner product of input feature sequences at same spatial position across neighboring\nframes and the 1D convolution kernels.\nBy revisiting the idea of trajectory modeling in the action recognition literature, we introduce the\nconcept of trajectory convolution. In trajectory convolution, the convolutional operation is done across\nirregular grids such that the sampled positions at different times correspond to the same physical\npoint of a moving object. Formally, parameterized by the \ufb01lter weight {w\u03c4 : \u03c4 \u2208 [\u2212\u2206t, \u2206t]} with\nkernel size (2\u2206t + 1), the output feature yt(p) is calculated as\n\n\u2206t(cid:88)\n\nw\u03c4 xt+\u03c4 ((cid:101)pt+\u03c4 ).\n\nyt(p) =\n\n(1)\n\nFollowing the formulation of trajectory in [37], the point pt at frame t can be tracked to position(cid:101)pt+1\nat next frame (t + 1) in the presence of a forward dense optical \ufb02ow \ufb01eld \u2212\u2192\u03c9 = (ut, vt) = F(It, It+1)\nusing the following equation\n\n\u03c4 =\u2212\u2206t\n\n(cid:101)pt+1 = (ht+1, wt+1) = pt + \u2212\u2192\u03c9 (pt) = (ht, wt) + \u2212\u2192\u03c9 |(ht,wt).\n\nFor \u03c4 > 1, the sample position(cid:101)pt+\u03c4 can be calculated by applying Eq. (2) iteratively. To track to\n(2)\nthe previous frame (t \u2212 1), a backward dense optical \ufb02ow \ufb01eld \u2190\u2212\u03c9 = (ut, vt) = F(It, It\u22121) is used\nSince the optical \ufb02ow \ufb01eld is typically real-valued, the sampling position(cid:101)pt+\u03c4 becomes fractional.\nlikewise.\nTherefore, the corresponding feature x((cid:101)pt+\u03c4 ) is derived via interpolation with a speci\ufb01c sampling\n\nkernel G, written as\n\nx((cid:101)pt+\u03c4 ) =\n\nG(p(cid:48),(cid:101)pt+\u03c4 ) \u00b7 x(p(cid:48)).\n\n(3)\n\n(cid:88)\n\np(cid:48)\n\nIn this paper, we will not go deeper into the usage of different choices of sampling kernels G and use\nthe bilinear interpolation as default.\n\n4\n\n\f3.2 Relation with Deformable Convolution\n\nThe original deformable convolution is introduced for 2D convolution. But it is natural to extend it\nto the 3D scenarios. A spatio-temporal grid R \u2208 R3 can be de\ufb01ned by an ordinary 3D convolution\nspeci\ufb01ed by a certain receptive \ufb01eld size and dilation. For each location q0 \u2208 (t, h, w) on the output\nfeature map y, the response is calculated by sampling on irregular locations offset by \u2206qn.\n\n(cid:88)\n\nqn\u2208R\n\ny(q0) =\n\nw(qn) \u00b7 x(q0 + qn + \u2206qn)\n\n(4)\n\nThe trajectory convolution can then be viewed as a special case of 3D deformable convolution\nwhere the offset map is from the trajectories. Here, the grid R = {(\u22121, 0, 0), (0, 0, 0), (1, 0, 0)} is\nde\ufb01ned by a 3 \u00d7 1 \u00d7 1 kernel with dilation 1. The temporal component of the offset is always 0,\ni.e. \u2206qn = (0, \u2206pn). The discussion above reveals the relationship with deformable convolution.\nTherefore, the trajectory convolution can be ef\ufb01ciently implemented similar to the way discussed\nin [2].\n\n3.3 Combining Motion and Appearance Features\n\nThe trajectory convolution helps the network to aggregate appearance features along motion path,\nalleviating the motion artifact by trajectory alignment. However, the motion information itself is\nimportant for action recognition. Inspired by the trajectory descriptor proposed in [37], we describe\nlocal motion patterns at each position p using the sequence of trajectory information in the form of\ncoordinates of sampling offsets {\u2206p\u03c4 : \u03c4 \u2208 [\u2212\u2206t, \u2206t]}. This is equivalent to stacking the offset\nmap for trajectory convolution and the original appearance feature map. The offset map is normalized\nthrough Batch-Normalization [12] before concatenation. As a result, we achieve the combination of\nappearance feature and motion information in terms of trajectory with minimal increase of network\nparameters. Compared with the canonical two-stream approaches, which are based on late fusion of\ntwo networks, our approach leads to a uni\ufb01ed network architecture and is much more parameter and\ncomputation ef\ufb01cient.\n\n3.4 The TrajectoryNet Architecture\n\nBased on the concept of trajectory convolution, we design a uni\ufb01ed architecture that can align\nappearance and motion features along the motion trajectories. We call it TrajectoryNet by integrating\ntrajectory convolution into the Separable-3D ResNet18 architecture [9, 36]. The 1D temporal\nconvolution component of a (2+1)D-convolutional block is replaced by a trajectory convolution\nwith down-sampled motion \ufb01eld, such as a pre-computed optical \ufb02ow, in the middle level of the\nnetwork. The appearance feature map for trajectory convolution is optionally concatenated with the\ndown-sampled motion \ufb01eld to introduce extra motion information. Adding trajectory convolution\nat higher levels is likely to provide less motion information since spatial resolution is reduced and\ndown-sampled optical \ufb02ow may be inaccurate. Adding trajectory convolution at lower levels increases\nthe precision of motion estimation, but the receptive \ufb01eld for sampling position is limited.\n\n3.5 Learning Trajectory\n\nAs discussed in the previous subsection, the trajectory convolution can be viewed as deformable\nconvolution with a special deformation map, that is the motion trajectory in the video. It is capable of\naccumulating gradients from higher layers via back-propagation. Therefore, if the trajectory can be\nestimated by a parametric model, we can learn the model parameters using back-propagation as well.\nThe most straight forward approach for this cause is applying a small 3D CNN to estimate trajectories\nas an mimic of the 2D CNN used in the deformable convolution networks [2]. Preliminary experiments\nshow that this is not very effective. It can be observed that the offsets obtained simply by applying\na 3D convolutional layer over the same input feature map are highly correlated to the appearance.\nOn the contrary, the motion representation, which we use trajectory as a medium, has long been\nconsidered to be invariant to appearance intuitively and empirically [17, 28]. Therefore, we cannot\nna\u00efvely adopt the way of learning offsets in [2]. This also reveals the difference between the original\ndeformable convolution for object detection and our trajectory convolution for action recognition:\nThe original deformable convolution attempts to learn deformation of spatial con\ufb01guration within an\n\n5\n\n\fsingle image while our trajectory convolution tries to model the variation of appearance deformation\nacross neighboring images, despite sharing the similar mathematical formulation.\nTo tackle such issue, we train another network to predict the trajectory individually as an alternative.\nIn particular, we use MotionNet [51] as the basis due to its lightweightness. It accepts a stack of\n(M + 1) images as a 3(M + 1)-channel input and predicts a series of M motion \ufb01eld maps as a 2M-\nchannel output. Following a \u201cdownsample-upsample\u201d design like FlowNet-SD [11], motion \ufb01elds\nwith multiple spatial resolutions are predicted. The network is trained without external supervision\nsuch as ground-truth optical \ufb02ow. An unsupervised loss Lunsup [51] is designed to enforce pair-wise\nreconstruction and similarity, with motion smoothness as a regularization.\nOnce pre-trained, the MotionNet can be plugged into the TrajectoryNet architecture to substitute\nthe input of pre-computed optical \ufb02ow. We modify the original model in [51] to produce optical\n\ufb02ow map of the same resolution of feature maps where the trajectory convolution operates on. The\nMotionNet can also be \ufb01ne-tuned with the classi\ufb01cation network. In this case, the loss for network\ntraining is a weighted sum of the unsupervised loss Lunsup and the cross-entropy loss for classi\ufb01cation\nLcls, written as L = \u03b3Lunsup + Lcls.\n\n4 Experiments\n\nTo evaluate the effectiveness of our TrajectoryNet, we conduct experiments on two benchmark\ndatasets for action recognition: Something-Something V1 [8] and Kinetics [19]. Visualization of\nintermediate features for both appearance and trajectory is also provided.\n\n4.1 Dataset descriptions\n\nSomething-Something V1 [8] is a large-scale crowd-sourced video dataset on human-object interac-\ntion. It contains 108,499 video clips in 174 classes. The dataset is split into train, validation and test\nsubset in the ratio of around 8:1:1. The top-1 and top-5 accuracy is reported.\nKinetics [19] is a large-scale video dataset on human-centric activities sourced from YouTube. We\nuse the version released in 2017, covering 400 human action classes. Due to the inaccessibility\nof some videos on YouTube, our version contains 240436, 19796 and 38685 clips in the training,\nvalidation and test subset, respectively. The recognition performance is measured by the average of\ntop-1 and top-5 accuracy.\n\n4.2 Experimental Setups\n\nNetwork con\ufb01guration We use the Separable-3D ResNet-18 [9] as the base model, if not speci\ufb01ed.\nStarting from the base ResNet-18 model, A 1-D temporal convolution module with temporal kernel\nsize of 3, followed by Batch Normalization [12] and ReLU non-linearity is inserted after every\n2-D spatial convolution module. A dropout of 0.2 is used between the global pooling and the last\nC-dimensional (C equals the total number of classes) fully-connected layer.\nGenerating trajectories As stated above, we study two methods to generate trajectories: one is\nbased on variational methods and the other is based on CNNs. For the former, we adopt the TV-L1\nalgorithm [48] which is implemented in OpenCV with CUDA. To match the size of input feature, two\ntypes of pooling are used to down-sample the optical \ufb02ow \ufb01eld: average pooling and max pooling.\nFor the latter, the MotionNet is trained by randomly sampling images pairs from UCF-101 [30]. The\ntraining policy follows the practices in [51].\nTraining The network is trained with stochastic gradient descent with momentum set to 0.9. The\nweights for 2D spatial convolution are initialized with the 2D ResNet pre-trained on ImageNet [27].\nThe length of each input clip is 16 and the sampling step varies from 1 to 2. For Something-Something\nV1, the batch size is set to 64 while for Kinetics, the batch size is 128. On Kinetics, the network is\ntrained from an initial learning rate of 0.01 and is reduced by 1\n10 every 40 epochs. The whole training\nprocedure takes 100 epochs. For Something-Something V1, the epoch number is halved because the\nduration of its videos is shorter.\nTesting At test time, we follow the common practice by sampling a \ufb01xed number of N snippets\n(N = 7 for Something-Something V1 and N = 25 for Kinetics) with an equal temporal interval. By\n\n6\n\n\fcropping and \ufb02ipping four corners and the center of each frame within a snippet, 10 inputs are obtain\nfor each snippet. The \ufb01nal class scores are calculated by averaging the scores across all 10N inputs.\n\n4.3 Ablation Studies\n\nTrajectory convolution We \ufb01rst evaluate the effect of using trajectory convolution in the Separable-\n3D ResNet architecture in Table 2. Consistent improvement of accuracy can be observed if trajectory\nconvolution is used. Then, we study the effect of incorporating trajectory in different locations.\nAdding trajectory convolution increases the top-5 accuracy but the top-1 accuracy saturates. In the\nremaining experiments, we use only 1 trajectory convolution at the res3b1.conv1 block, if not\nspeci\ufb01ed.\nSince we did not see remarkable gain, we conjecture that this is because the used trajectory is derived\nfrom the optical \ufb02ow down-sampled via average pooling. The optical \ufb02ow is already smoothed\nwith TV-L1 and the extra average pooling degrades the quality more. To verify this, we preform\nan additional experiment by replacing average pooling with max pooling. This alternative down-\nsampling strategy preserves more details without degrading the trajectory. Furthermore, as will be\nshown in Table 4, using trajectory learned from MotionNet leads to higher accuracy. This indicates\nthat the performance of TrajectoryNet highly depends on the quality of trajectory.\n\nTable 2: Results of using trajectory convolution in different convolutional layers in the Separable-3D\nResNet-18 network. The accuracy is reported on the validation subset of Something-Something V1.\n\nUsage of Traj. Conv. Down-sample Method\n\nTop-1 Acc. Top-5 Acc.\n\nNone\n\nres2b1.conv1\nres3a.conv1\nres3b1.conv1\nres3b1.conv1,2\nres3b1.conv1\n\nNone\n\nAvg. Pool\nAvg. Pool\nAvg. Pool\nAvg. Pool\nMax Pool\n\n34.30\n34.49\n34.79\n34.96\n34.72\n36.04\n\n65.66\n66.23\n66.21\n66.24\n66.89\n67.72\n\nCombining motion and appearance features We compare the results of incorporating motion\ninformation into the trajectory convolution in Table 3. We can clearly see the improvement of\nmore than 1% after encoding a 4-dimensional feature map of trajectory coordinates. We compare\nwith several other methods, such as the early spatial fusion by concatenation with motion feature\nmap [7] and the late fusion used in the two-stream network [29]. Though there is still an apparent\ngap between ours and the late-fusion strategy, our fusion strategy achieves notable increase with\nnegligible increase of parameters. And it also completely removes the computation for running a\nmotion-stream recognition network.\n\nTable 3: Results of incorporating different sources of input into the trajectory convolution in the\nSeparable-3D ResNet-18 network. The ft. denotes the feature map. The accuracy is reported on the\nvalidation subset of Something-Something V1.\n\nSource\n\nUsage of Traj. Conv.\n\nappearance\n\nappearance + motion (ft.)\n\nappearance + trajectory (# dim=4)\n\ntwo-stream S3D (late fusion)\n\nres3b1.conv1\nres3b1.conv1\nres3b1.conv1\n\nNone\n\n# param. Top-1 Acc. Top-5 Acc.\n15.2M\n15.9M\n15.2M\n30.4M\n\n66.24\n67.22\n67.72\n72.79\n\n34.96\n35.24\n36.08\n40.67\n\nLearning trajectory Here we compare the learned trajectory against pre-computed optical \ufb02ow\nfrom TV-L1 [48]. We choose two architectures of MotionNet: one accepts one image pair and outputs\none motion \ufb01eld (denoted by MotionNet-(2)), and the other accepts 17 consecutive images and\nproduces 16 motion \ufb01elds (denoted by MotionNet-(17)). We study three training policies: (1)\n\ufb01xing the MotionNet once it is pre-trained; (2) \ufb01ne-tuning the MotionNet with the classi\ufb01cation cross-\nentropy loss; and (3) \ufb01ne-tuning the MotionNet with both the unsupervised loss and classi\ufb01cation\nloss. The loss weight \u03b3 is set to 0.01. The results are listed in Table 4. It turns out that the trajectories\nlearned by both MotionNet-(2) and MotionNet-(17) outperform those derived from TV-L1 [48]. It is\ninteresting to observe that jointly training MotionNet and TrajectoryNet will yield lower accuracies\nthan freezing MotionNet unless the unsupervised loss is introduced. We conjecture that the existence\n\n7\n\n\fof Lunsup can help to maintain the quality of trajectories by enforcing the pair-wise consistency. The\nnecessity of multi-task \ufb01ne-tuning may also explain the dif\ufb01culty of using shallow convolutional\nmodules with random initialization to estimate the trajectory, which we have discussed in Sec 3.5.\n\nTable 4: Results of learning trajectory. The settings are elaborated in the body part.\n\nsource of trajectory\n\n\ufb01ne-tune weight\n\nunsup. loss Top-1 Acc. Top-5 Acc.\n\nTV-L1\n\nMotionNet-(2)\nMotionNet-(2)\nMotionNet-(2)\nMotionNet-(17)\nMotionNet-(17)\nMotionNet-(17)\n\n-\n\u0017\n\u0013\n\u0013\n\u0017\n\u0013\n\u0013\n\n-\n\u0017\n\u0017\n\u0013\n\u0017\n\u0017\n\u0013\n\n34.96\n36.37\n34.72\n36.91\n35.69\n35.25\n36.69\n\n66.24\n67.74\n65.59\n68.47\n66.82\n66.65\n68.52\n\nTrajectories with step greater than one Here we evaluate the model which accepts an input of\n16 frames but at a sampling step of 2. To be more speci\ufb01c, we collect a consecutive of 32 frames and\nrandomly sample one frame for every two neighboring frames. This enlarges the effective coverage\nof the architecture, i.e. from 16 to 32, while keeping the computation the same. With the strategy of\nlearning trajectory mentioned above, the TrajectoryNet can still improve over the baseline. This also\nre\ufb02ects the \ufb02exibility of learnable trajectory, since pre-computed optical \ufb02ow has to be re-run for the\nwhole training set under such circumstances.\n\nTable 5: Results of using trajectories with step greater than one.\n\n# of frame \u00d7 step Effective coverage\n\nUsage of Trajectories\n\nTop-1 Acc. Top-5 Acc.\n\n32\n32\n\nNone\n\nMotionNet-(17)-ft.-unsup.\n\n16 \u00d7 2\n16 \u00d7 2\nRuntime Cost\nIn Table 6, we report the runtime of the proposed TrajectoryNet with two settings:\n(1) the one whose trajectories are from pre-computed TV-L1 (time not included) and (2) the one whose\ntrajectories are inferred from MotionNet-(17) (time included). Compared with its plain counterpart,\nthe TrajectortNet with pre-computed TV-L1 incurs less than 10% additional computation for the\noperation of trajectory convolution. It takes TrajectoryNet with MotionNet-(17) an extra 0.137 second\nfor network forward compared to TrajectoryNet with TV-L1, which can be ascribed to the forward\ntime of the MotionNet plugged in.\n\n74.57\n74.85\n\n42.47\n43.32\n\nTable 6: Runtime comparison of TrajectoryNet and the counterpart. The network is tested on a\nworkstation with Intel(R) Xeon(R) CPU (E5-2640 v3 @2.60GHz) and Nvidia Titan X GPU.\n\nMethod\n\nS3D\n\nTrajectoryNet (TV-L1)\n\nTrajectoryNet (MotionNet-(17))\n\nNet. forward (sec) \u2206t (sec)\n\n0.390\n0.426\n0.563\n\n-\n\n+0.036\n+0.137\n\n4.4 Comparison with State-of-the-Arts\n\nWe compare the performance of our TrajectoryNet with other state-of-the-art methods. The results on\nSomething-Something V1 [8] and Kinetics [19] are shown in Table 7 and Table 8 respectively. For\nSomething-Something V1, we use 16 frames with a step of 2 as input and apply MotionNet-(17) to\nproduce trajectory. Motion information encoded by trajectory is used optionally. On Table 7, we can\nsee that our TrajectoryNet achieves competitive results with state-of-the-art models including those\nwith deeper models or those pre-trained on larger models. After pre-training on Kinetics, the accuracy\nis boosted to a new level. For Kinetics, a MotionNet-(2) is used. On Table 8, the TrajectoryNet\nimproves the Separable-3D baseline. With 16 input frames at a step of 2, it performs on par with\nmodels with similar model complexity.\n\n4.5 Visualization\n\nWe present a qualitative study by visualizing the intermediate feature of our TrajectoryNet in Figure 2.\nGiven a pair of two consecutive images on top of the \ufb01rst column, we \ufb01rst compare the feature map at\n\n8\n\n\fTable 7: Comparison with state-of-the-art methods on the validation and test set of Something-\nSomething V1. The performance is measured by the Top-1 accuracy.\n\nMethod\n\n3D-CNN [8]\n\nMultiScale TRN [50]\n\nECO lite [52]\n\nNon-local I3D + GCN [44]\n\nC3D\n\nBN-Inception\n\nBackbone network\n\nPre-train Val Top-1\nSports-1M 11.5\nImageNet\n34.4\n46.4\nBN-Inception + 3D-ResNet18 Kinetics\n46.1\nKinetics\n43.3\nImageNet\nImageNet\n44.0\n47.8\nKinetics\n\nResNet-50\nResNet-18\nResNet-18\nResNet-18\n\nTrajectoryNet-MotionNet-(17) w/o. motion\nTrajectoryNet-MotionNet-(17) w/. motion\nTrajectoryNet-MotionNet-(17) w/o. motion\nTable 8: Comparison with state-of-the-art methods on the validation subset of Kinetics. The perfor-\nmance is measured by the average of Top-1 and Top-5 accuracy.\n\nBackbone network Pre-train Val. Avg. Acc.\nBN-Inception-v2\nBN-Inception-v1\n\nMethod\n\nTSN (RGB) [42]\nI3D (RGB) [1]\n\nNonlocal-I3D (RGB) [43]\n\nR(2+1)D (RGB) [36]\n\nC3D [35]\n\nARTNet w/. TSN [40]\n\nSeparable-3D (RGB, 16 \u00d7 1 frames)\n\nTrajectoryNet-MotionNet-(2) (16 \u00d7 1 frames)\nTrajectoryNet-MotionNet-(2) (16 \u00d7 2 frames)\n\nResNet-101\nResNet-34\nResNet-18\nResNet-18\nResNet-18\nResNet-18\nResNet-18\n\nImageNet\nImageNet\nImageNet\nSports-1M\n\n-\n-\n\nImageNet\nImageNet\nImageNet\n\n77.8\n81.2\n85.5\n82.6\n75.7\n80.0\n76.9\n77.8\n79.8\n\nthe layer of res3b1.conv1, i.e. on which the trajectory convolution is applied, at the bottom of the\n\ufb01rst column. We can observe a visible spatial shift between the two images\u2019 high response regions,\nwhich conforms to our assumption that feature map is not well aligned due to object movement. We\nalso demonstrate different types of trajectories that we use in the experiments, namely the TV-L1 [48]\noptical \ufb02ow and prediction from MotionNet before and after \ufb01netune on the second, third and fourth\ncolumn. We can see that the motion estimation by the original MotionNet is less smooth than TV-L1\nespecially in the background regions. For foreground objects, however, MotionNet does well and can\nsometimes produce motion with more rigid shape, e.g. the hand on the left example of Figure 2. Also,\nthe joint training further improves the quality of trajectories.\n\nFigure 2: Visualization of the intermediate feature of the TrajectoryNet. These two image pairs depict the\naction of \u201cmoving something (a pen) down\u201d and \u201ctrying but failing to attach something (a ball) to something (a\ncat) because it doesn\u2019t stick.\u201d For each block, the \ufb01rst column show a pair of input images and their corresponding\nfeature map at the layer of res3b1.conv1; the second, third and fourth column show the optical \ufb02ow \ufb01eld\ngenerated by TV-L1 algorithm and learned by MotionNet before and after \ufb01netuning (The motion \ufb01eld encoded\nin HSV color map as well as the components of x-axis and y-axis are shown from top to bottom). The \ufb01gure is\nbest viewed in color.\n5 Conclusion\nIn this paper, we propose a uni\ufb01ed end-to-end architecture called TrajectoryNet for action recog-\nnition. The approach is to incorporate the repeatedly proven idea of trajectory modeling into the\nSeparable-3D network by introducing a new operation named trajectory convolution. The Trajec-\ntoryNet further combines appearance and motion information in a uni\ufb01ed model architecture. The\nproposed architecture achieves notable improvements over the Separable-3D baseline, providing a\nnew perspective of explicitly considering motion dynamics in the deep networks.\n\n9\n\nTV-L1MotionNetw/o. unsup. ft.MotionNetw/. unsup. ft.Image$%Image$%)(Feature Map \u2131($%)@res3b.conv1Feature Map \u2131($%)()@res3b.conv1(HSV colormap)(x-direction map)(y-direction map)TV-L1MotionNetw/o. unsup. ft.MotionNetw/. unsup. ft.Image$%Image$%)(Feature Map \u2131($%)@res3b.conv1Feature Map \u2131($%)()@res3b.conv1(HSV colormap)(x-direction map)(y-direction map)\fAcknowledgment This work is partially supported by the Big Data Collaboration Research grant\nfrom SenseTime Group (CUHK Agreement No. TS1610626), and the Early Career Scheme (ECS) of\nHong Kong (No. 24204215).\n\nReferences\n[1] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset.\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724\u20134733. IEEE,\n2017.\n\n[2] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable\nconvolutional networks. In The IEEE International Conference on Computer Vision (ICCV), pages 764\u2013773,\n2017.\n\n[3] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In The IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886\u2013893. IEEE, 2005.\n\n[4] Navneet Dalal, Bill Triggs, and Cordelia Schmid. Human detection using oriented histograms of \ufb02ow and\n\nappearance. In European Conference on Computer Vision (ECCV), pages 428\u2013441. Springer, 2006.\n\n[5] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick\nvan der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical \ufb02ow with convolutional\nnetworks. In The IEEE International Conference on Computer Vision (ICCV), pages 2758\u20132766, 2015.\n\n[6] Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes. Spatiotemporal multiplier networks for video\naction recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n7445\u20137454. IEEE, 2017.\n\n[7] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for\nvideo action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).\nIEEE, 2016.\n\n[8] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal,\nHeuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The\u201d something\nsomething\u201d video database for learning and evaluating visual common sense. In The IEEE International\nConference on Computer Vision (ICCV), 2017.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770\u2013778, 2016.\n\n[10] Berthold KP Horn and Brian G Schunck. Determining optical \ufb02ow. Arti\ufb01cial intelligence, 17(1-3):185\u2013203,\n\n1981.\n\n[11] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox.\nIn The IEEE Conference on\n\nFlownet 2.0: Evolution of optical \ufb02ow estimation with deep networks.\nComputer Vision and Pattern Recognition (CVPR), volume 2, 2017.\n\n[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In International Conference on Machine Learning(ICML), pages 448\u2013456, 2015.\n\n[13] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 2017\u20132025, 2015.\n\n[14] Herve Jegou, Florent Perronnin, Matthijs Douze, Jorge S\u00e1nchez, Patrick Perez, and Cordelia Schmid.\nAggregating local image descriptors into compact codes. IEEE transactions on pattern analysis and\nmachine intelligence, 34(9):1704\u20131716, 2012.\n\n[15] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action\n\nrecognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221\u2013231, 2013.\n\n[16] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic \ufb01lter networks. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 667\u2013675, 2016.\n\n[17] Gunnar Johansson. Visual perception of biological motion and a model for its analysis. Perception &\n\npsychophysics, 14(2):201\u2013211, 1973.\n\n[18] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei.\nLarge-scale video classi\ufb01cation with convolutional neural networks. In The IEEE conference on Computer\nVision and Pattern Recognition (CVPR), pages 1725\u20131732, 2014.\n\n10\n\n\f[19] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan,\nFabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv\npreprint arXiv:1705.06950, 2017.\n\n[20] Ivan Laptev. On space-time interest points. International Journal of Computer Vision, 64(2-3):107\u2013123,\n\n2005.\n\n[21] Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic human\nactions from movies. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n1\u20138. IEEE, 2008.\n\n[22] Bruce D Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo\nIn International Joint Conference on Arti\ufb01cial intelligence (IJCAI), pages 674\u2013679. Morgan\n\nvision.\nKaufmann Publishers Inc., 1981.\n\n[23] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and\nThomas Brox. A large dataset to train convolutional networks for disparity, optical \ufb02ow, and scene\n\ufb02ow estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n4040\u20134048, 2016.\n\n[24] Ross Messing, Chris Pal, and Henry Kautz. Activity recognition using the velocity histories of tracked\nkeypoints. In The IEEE International Conference on Computer Vision (ICCV), pages 104\u2013111. IEEE,\n2009.\n\n[25] Mathew Monfort, Bolei Zhou, Sarah Adel Bargal, Alex Andonian, Tom Yan, Kandan Ramakrishnan, Lisa\nBrown, Quanfu Fan, Dan Gutfruend, Carl Vondrick, et al. Moments in time dataset: one million videos for\nevent understanding. arXiv preprint arXiv:1801.03150, 2018.\n\n[26] Anurag Ranjan and Michael J Black. Optical \ufb02ow estimation using a spatial pyramid network. In The\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.\n\n[27] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.\nInternational Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[28] Laura Sevilla-Lara, Yiyi Liao, Fatma Guney, Varun Jampani, Andreas Geiger, and Michael J Black. On the\n\nintegration of optical \ufb02ow and action recognition. arXiv preprint arXiv:1712.08416, 2017.\n\n[29] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in\n\nvideos. In Advances in Neural Information Processing Systems (NIPS), pages 568\u2013576, 2014.\n\n[30] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions\n\nclasses from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.\n\n[31] Deqing Sun, Stefan Roth, and Michael J Black. Secrets of optical \ufb02ow estimation and their principles.\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2432\u20132439. IEEE,\n2010.\n\n[32] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical \ufb02ow using pyramid,\nwarping, and cost volume. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2018.\n\n[33] Ju Sun, Xiao Wu, Shuicheng Yan, Loong-Fah Cheong, Tat-Seng Chua, and Jintao Li. Hierarchical spatio-\ntemporal context modeling for action recognition. In The IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pages 2004\u20132011. IEEE, 2009.\n\n[34] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal\nfeatures with 3d convolutional networks. In The IEEE International Conference on Computer Vision\n(ICCV), pages 4489\u20134497. IEEE, 2015.\n\n[35] Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, and Manohar Paluri. Convnet architecture search for\n\nspatiotemporal feature learning. arXiv preprint arXiv:1708.05038, 2017.\n\n[36] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look\nat spatiotemporal convolutions for action recognition. In The IEEE Conference on Computer Vision and\nPattern Recognition (CVPR). IEEE, 2018.\n\n[37] Heng Wang, Alexander Kl\u00e4ser, Cordelia Schmid, and Cheng-Lin Liu. Action recognition by dense\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n\ntrajectories.\n3169\u20133176. IEEE, 2011.\n\n11\n\n\f[38] Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In The IEEE International\n\nConference on Computer Vision (ICCV), pages 3551\u20133558, 2013.\n\n[39] Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, and Cordelia Schmid. Evaluation\nof local spatio-temporal features for action recognition. In British Machine Vision Conference (BMVC),\npages 124\u20131. BMVA Press, 2009.\n\n[40] Limin Wang, Wei Li, Wen Li, and Luc Van Gool. Appearance-and-relation networks for video classi\ufb01cation.\n\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.\n\n[41] Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deep-convolutional\ndescriptors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4305\u2013\n4314, 2015.\n\n[42] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal\nsegment networks: Towards good practices for deep action recognition. In European Conference on\nComputer Vision (ECCV), pages 20\u201336. Springer, 2016.\n\n[43] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In The\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[44] Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs. In European Conference on\n\nComputer Vision (ECCV), 2018.\n\n[45] Xingxing Wang, Limin Wang, and Yu Qiao. A comparative study of encoding, pooling and normalization\nmethods for action recognition. In Asian Conference on Computer Vision (ACCV), pages 572\u2013585. Springer,\n2012.\n\n[46] Geert Willems, Tinne Tuytelaars, and Luc Van Gool. An ef\ufb01cient dense and scale-invariant spatio-temporal\ninterest point detector. In European Conference on Computer Vision (ECCV), pages 650\u2013663. Springer,\n2008.\n\n[47] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal\nfeature learning: : Speed-accuracy trade-offs in video classi\ufb01cation. In European Conference on Computer\nVision (ECCV), 2018.\n\n[48] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l 1 optical\n\n\ufb02ow. In Joint Pattern Recognition Symposium, pages 214\u2013223. Springer, 2007.\n\n[49] Yue Zhao, Yuanjun Xiong, and Dahua Lin. Recognize actions by disentangling components of dynamics.\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6566\u20136575, 2018.\n\n[50] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. Temporal relational reasoning in videos.\n\nIn European Conference on Computer Vision (ECCV), 2018.\n\n[51] Yi Zhu, Zhenzhong Lan, Shawn Newsam, and Alexander G Hauptmann. Hidden two-stream convolutional\n\nnetworks for action recognition. arXiv preprint arXiv:1704.00389, 2017.\n\n[52] Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. Eco: Ef\ufb01cient convolutional network for\n\nonline video understanding. In European Conference on Computer Vision (ECCV), 2018.\n\n12\n\n\f", "award": [], "sourceid": 1119, "authors": [{"given_name": "Yue", "family_name": "Zhao", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Yuanjun", "family_name": "Xiong", "institution": "Amazon"}, {"given_name": "Dahua", "family_name": "Lin", "institution": "The Chinese University of Hong Kong"}]}