{"title": "Spatiotemporal Residual Networks for Video Action Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 3468, "page_last": 3476, "abstract": "Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual Networks (ResNets) have arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets as a combination of these two approaches. Our novel architecture generalizes ResNets for the spatiotemporal domain by introducing residual connections in two ways. First, we inject residual connections between the appearance and motion pathways of a two-stream architecture to allow spatiotemporal interaction between the two streams. Second, we transform pretrained image ConvNets into spatiotemporal networks by equipping these with learnable convolutional filters that are initialized as temporal residual connections and operate on adjacent feature maps in time.  This approach slowly increases the spatiotemporal receptive field as the depth of the model increases and naturally integrates image ConvNet design principles. The whole model is trained end-to-end to allow hierarchical learning of complex spatiotemporal features. We evaluate our novel spatiotemporal ResNet using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.", "full_text": "Spatiotemporal Residual Networks\n\nfor Video Action Recognition\n\nChristoph Feichtenhofer\n\nGraz University of Technology\nfeichtenhofer@tugraz.at\n\nAxel Pinz\n\nGraz University of Technology\n\naxel.pinz@tugraz.at\n\nRichard P. Wildes\n\nYork University, Toronto\nwildes@cse.yorku.ca\n\nAbstract\n\nTwo-stream Convolutional Networks (ConvNets) have shown strong performance\nfor human action recognition in videos. Recently, Residual Networks (ResNets)\nhave arisen as a new technique to train extremely deep architectures. In this paper,\nwe introduce spatiotemporal ResNets as a combination of these two approaches.\nOur novel architecture generalizes ResNets for the spatiotemporal domain by\nintroducing residual connections in two ways. First, we inject residual connections\nbetween the appearance and motion pathways of a two-stream architecture to\nallow spatiotemporal interaction between the two streams. Second, we transform\npretrained image ConvNets into spatiotemporal networks by equipping them with\nlearnable convolutional \ufb01lters that are initialized as temporal residual connections\nand operate on adjacent feature maps in time. This approach slowly increases the\nspatiotemporal receptive \ufb01eld as the depth of the model increases and naturally\nintegrates image ConvNet design principles. The whole model is trained end-to-end\nto allow hierarchical learning of complex spatiotemporal features. We evaluate our\nnovel spatiotemporal ResNet using two widely used action recognition benchmarks\nwhere it exceeds the previous state-of-the-art.\n\nIntroduction\n\n1\nAction recognition in video is an intensively researched area, with many recent approaches focused\non application of Convolutional Networks (ConvNets) to this task, e.g. [13, 20, 26]. As actions can\nbe understood as spatiotemporal objects, researchers have investigated carrying spatial recognition\nprinciples over to the temporal domain by learning local spatiotemporal \ufb01lters [13, 25, 26]. However,\nsince the temporal domain arguably is fundamentally different from the spatial one, different treatment\nof these dimensions has been considered, e.g. by incorporating optical \ufb02ow networks [20], or\nmodelling temporal sequences in recurrent architectures [4, 18, 19].\nSince the introduction of the \u201cAlexNet\u201d architecture [14] in the 2012 ImageNet competition, ConvNets\nhave dominated state-of-the-art performance across a variety of computer vision tasks, including\nobject-detection, image segmentation, image classi\ufb01cation, face recognition, human pose estimation\nand tracking. In conjunction with these advances as well as the evolution of network architectures,\nseveral design best practices have emerged [8, 21, 23, 24]. First, information bottlenecks should be\navoided and the representation size should gently decrease from the input to the output as the number\nof feature channels increases with the depth of the network. Second, the receptive \ufb01eld at the end of\nthe network should be large enough that the processing units can base operations on larger regions of\nthe input. This functionality can be achieved by stacking many small \ufb01lters or using large \ufb01lters in the\nnetwork; notably, the \ufb01rst choice can be implemented with fewer operations (faster, fewer parameters)\nand also allows inclusion of more nonlinearities. Third, dimensionality reduction (1\u00d71 convolutions)\nbefore spatially aggregating \ufb01lters (e.g. 3\u00d73) is supported by the fact that outputs of neighbouring\n\ufb01lters are highly correlated and therefore these activations can be reduced before aggregation [23].\nFourth, spatial factorization into asymmetric \ufb01lters can even further reduce computational cost and\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Our method introduces residual connections in a two-stream ConvNet model [20]. The two\nnetworks separately capture spatial (appearance) and temporal (motion) information to recognize the\ninput sequences. We do not use residuals from the spatial into the temporal stream as this would bias\nboth losses towards appearance information.\n\nease the learning problem. Fifth, it is important to normalize the responses of each feature channel\nwithin a batch to reduce internal covariate shift [11]. The last architectural guideline is to use residual\nconnections to facilitate training of very deep models that are essential for good performance [8]. We\ncarry over these good practices for designing ConvNets in the image domain to the video domain\nby converting the 1\u00d71 convolutional dimensionality mapping \ufb01lters in ResNets to temporal \ufb01lters.\nBy stacking several of these transformed temporal \ufb01lters throughout the network we provide a large\nreceptive \ufb01eld for the discriminative units at the end of the network. Further, this design allows us\nto convert spatial ConvNets into spatiotemporal models and thereby exploits the large amount of\ntraining data from image datasets such as ImageNet.\nWe build on the two-stream approach [20] that employs two separate ConvNet streams, a spatial\nappearance stream, which achieves state-of-the-art action recognition from RGB images and a\ntemporal motion stream, which operates on optical \ufb02ow information. The two-stream architecture\nis inspired by the two-stream hypothesis from neuroscience [6] that postulates two pathways in\nthe visual cortex: The ventral pathway, which responds to spatial features such as shape or colour\nof objects, and the dorsal pathway, which is sensitive to object transformations and their spatial\nrelationship, as e.g. caused by motion. We extend two-stream ConvNets in the following ways.\nFirst, motivated by the recent success of residual networks (ResNets) [8] for numerous challenging\nrecognition tasks on datasets such as ImageNet and MS COCO, we apply ResNets to the task of\nhuman action recognition in videos. Here, we initialize our model with pre-trained ResNets for image\ncategorization [8] to leverage a large amount of image-based training data for the action recognition\ntask in video. Second, we demonstrate that injecting residual connections between the two streams\n(see Fig. 1) and jointly \ufb01ne-tuning the resulting model achieves improved performance over the\ntwo-stream architecture. Third, we overcome limited temporal receptive \ufb01eld size in the original\ntwo-stream approach by extending the model over time. We convert convolutional dimensionality\nmapping \ufb01lters to temporal \ufb01lters that provide the network with learnable residual connections over\ntime. By stacking several of these temporal \ufb01lters and sampling the input sequence at large temporal\nstrides (i.e. skipping frames), we enable the network to operate over large temporal extents of the\ninput. To demonstrate the bene\ufb01ts of our proposed spatiotemporal ResNet architecture, it has been\nevaluated on two standard action recognition benchmarks where it greatly boosts the state-of-the-art.\n2 Related work\nApproaches for action recognition in video can largely be divided into two categories: Those that use\nhand-crafted features with decoupled classi\ufb01ers and those that jointly learn features and classi\ufb01er.\nOur work is related to the latter, which is outlined in the following.\nSeveral approaches have been presented for spatiotemporal feature learning. Unsupervised learning\ntechniques have been applied by stacking ISA or convolutional gated RBMs to learn spatiotemporal\nfeatures for action recognition [16, 25].\nIn other work, spatiotemporal features are learned by\nextending 2D ConvNets into time by stacking consecutive video frames [12]. Yet another study\ncompared several approaches to extending ConvNets into the temporal domain, but with rather\ndisappointing results [13]: The architectures were not particularly sensitive to temporal modelling,\n\n2\n\nAppearance StreamMotion Streamconv1conv2_x+conv1conv2_x+conv3_x+conv3_x+conv4_x+conv4_x+conv5_x+conv5_x+lossloss\fwith a slow fusion model performing slightly better than early and late fusion alternatives; moreover,\nsimilar levels of performance were achieved by a purely spatial network. The recently proposed C3D\napproach learns 3D ConvNets on a limited temporal support of 16 frames and all \ufb01lter kernels having\nsize 3\u00d73\u00d73 [26]. The network structure is similar to earlier deep spatial networks [21].\nAnother research branch has investigated combining image information in network architectures\nacross longer time periods. A comparison of temporal pooling architectures suggested that temporal\npooling of convolutional layers performs better than slow, local, or late pooling, as well as temporal\nconvolution [18]. That work also considered ordered sequence modelling, which feeds ConvNet\nfeatures into a recurrent network with Long Short-Term Memory (LSTM) cells. Using LSTMs,\nhowever, did not yield an improvement over temporal pooling of convolutional features. Other work\ntrained an LSTM on human skeleton sequences to regularize another LSTM that uses an Inception\nnetwork for frame-level descriptor input [17]. Yet other work uses a multilayer LSTM to let the\nmodel attend to relevant spatial parts in the input frames [19]. Further, the inner product of a recurrent\nmodel has been replaced with a 2D convolution and thereby converts the fully connected hidden\nlayers in a GRU-RNN to 2D convolutional operations [1]. That approach takes advantage of the local\nspatial similarity in images; however, it only yields a minor increase over their baseline, which is a\ntwo-stream VGG-16 ConvNet [21] used as the input to their convolutional RNN. Finally, three recent\napproaches for action recognition apply ConvNets as follows: In [2] dynamic images are created\nby weighted averaging of video frames over time; [31] captures the transformation of ConvNet\nfeatures from the beginning to the end of the video with a Siamese architecture; and [5] introduces a\nspatiotemporal convolutional fusion layer between the streams of a two-stream architecture.\nNotably, the most closely related work to ours (and to several of those above) is the two-stream\nConvNet architecture [20]. That approach \ufb01rst decomposes video into spatial and temporal compo-\nnents by using RGB and optical \ufb02ow frames. These components are fed into separate deep ConvNet\narchitectures to learn spatial as well as temporal information about the appearance and movement\nof the objects in a scene. Each stream initially performs video recognition on its own and for \ufb01nal\nclassi\ufb01cation, softmax scores are combined by late fusion. To date, this approach is the most effective\napproach of applying deep learning to action recognition, especially with limited training data. In\nour work we directly convert image ConvNets into 3D architectures and show greatly improved\nperformance over the two-stream baseline.\n3 Technical approach\n3.1 Two-Stream residual networks\nAs our base representation we use deep ResNets [8, 9]. These networks are designed similarly\nto the VGG networks [21], with small 3\u00d73 spatial \ufb01lters (except at the \ufb01rst layer), and similar to\nthe Inception networks [23], with 1\u00d71 \ufb01lters for learned dimensionality reduction and expansion.\nThe network sees an input of size 224\u00d7224 that is reduced \ufb01ve times in the network by stride 2\nconvolutions followed by a global average pooling layer of the \ufb01nal 7\u00d77 feature map and a fully-\nconnected classi\ufb01cation layer with softmax. Each time the spatial size of the feature map changes,\nthe number of features is doubled to avoid tight bottlenecks. Batch normalization [11] and ReLU\n[14] are applied after each convolution; the network does not use hidden fc, dropout, or max-pooling\n(except immediately after the \ufb01rst layer). The residual units are de\ufb01ned as [8, 9]:\n(1)\nwhere xl and xl+1 are input and output of the l-th layer, F is a nonlinear residual mapping represented\nby convolutional \ufb01lter weights Wl = {Wl,k|1\u2264k\u2264K} with K \u2208 {2, 3} and f \u2261 ReLU [9]. A key\nadvantage of residual units is that their skip connections allow direct signal propagation from the \ufb01rst\nto the last layer of the network. Especially during backpropagation this arrangement is advantageous:\nGradients are propagated directly from the loss layer to any previous layer while skipping intermediate\nweight layers that have potential to trigger vanishing or deterioration of the gradient signal.\nWe also leverage the two-stream architecture [20]. For both streams, we use the ResNet-50 model [8]\npretrained on the ImageNet dataset and replace the last (classi\ufb01ation) layer according to the number\nof classes in the target dataset. The \ufb01lters in the \ufb01rst layer of the motion stream are further modi\ufb01ed\nby replicating the three RGB \ufb01lter channels to a size of 2L = 20 for operating over the horizontal\nand vertical optical \ufb02ow stacks, each of which has a stack of L = 10 frames. This tack allows us to\nexploit the availability of a large amount of annotated training data for both streams.\n\nxl+1 = f (xl + F(xl;Wl)) ,\n\n3\n\n\fFigure 2: The conv5_x residual units of our architecture. A residual connection (highlighted in red)\nbetween the two streams enables motion interactions. The second residual unit, conv5_2 also includes\ntemporal convolutions (highlighted in green) for learning abstract spacetime features.\nA drawback of the two-stream architecture is that it is unable to spatiotemporally register appearance\nand motion information. Thus, it is not able to represent what (captured by the spatial stream) moves\nin which way (captured by the temporal stream). Here, we remedy this de\ufb01ciency by letting the\nnetwork learn such spatiotemporal cues at several spatiotemporal scales. We enable this interaction\nby introducing residual connections between the two streams. Just as there can be various types\nof shortcut connections in a ResNet, there are several ways the two streams can be connected. In\npreliminary experiments we found that direct connections between identical layers of the two streams\nled to an increase in validation error. Similarly, bidirectional connections increased the validation\nerror signi\ufb01cantly. We conjecture that these results are due to the large change that the signal of\none network stream undergoes after injecting a fusion signal from the other stream. Therefore, we\ndeveloped a more subtle alternative solution based on additive interactions, as follows.\n\n(cid:17)\n\n,\n\nl\n\nl ) + F(cid:16)\n(cid:18) \u2202f (xa\n\nl\n\nl\n\nMotion Residuals. We inject a skip connection from the motion stream to the appearance stream\u2019s\nresidual unit. To enable learning of spatiotemporal features at all possible scales, this modi\ufb01cation\nis applied before the second residual unit at each spatial resolution of the network (indicated by\n\u201cskip-stream\u201d in Table 1), as exempli\ufb01ed by the connection at the conv5_x layers in Fig. 2. Formally,\nthe corresponding appearance stream\u2019s residual units (1) are modi\ufb01ed according to\n\nxa\nl + f (xm\n\n\u02c6xa\nl+1 = f (xa\n\nl ),Wa\n(2)\nwhere xa\nl is the input of the l-th layer appearance stream, xm\nthe input of the l-th layer motion stream\nand Wa\nl are the weights of the l-th layer residual unit in the appearance stream. For the gradient on\nthe loss function L in the backward pass the chain rule yields\n\u2202\n\u2202xa\nl\nfor the appearance stream and similarly for the motion stream\nxa\nl + f (xm\n\nl ),Wa\n(cid:17)\n\n\u2202\u02c6xa\nl+1\n\u2202xa\nl\n\nl + f (xm\nxa\n\nF(cid:16)\n\nF(cid:16)\n\n\u2202L\n\u2202\u02c6xa\n\nl+1\n\n\u2202L\n\u2202\u02c6xa\n\nl+1\n\nl ),Wa\n\nl\n\n,\n\n(cid:17)(cid:19)\n\n\u2202L\n\u2202xa\nl\n\nl )\n\n+\n\n\u2202xa\nl\n\n(4)\n\n(3)\n\n\u2202L\n\u2202xm\nl\n\n\u2202L\n\u2202xm\n\nl+1\n\n\u2202xm\nl+1\n\u2202xm\nl\n\n\u2202L\n\u2202\u02c6xa\n\nl+1\n\n\u2202\n\u2202xa\nl\n\n+\n\n=\n\n=\n\n=\n\nwhere the \ufb01rst additive term of (4) is the gradient at the l-th layer in the motion stream and the second\nterm accumulates gradients from the appearance stream. Thus, the residual connection between the\nstreams backpropagates gradients from the appearance stream into the motion stream.\n3.2 Convolutional residual connections across time\nSpatiotemporal coherence is an important cue when working with time varying visual data and can\nbe exploited to learn general representations from video in an unsupervised manner [7]. In that case,\ntemporal smoothness is an important property and is enforced by requiring features to vary slowly\nwith respect to time. Further, one can expect that in many cases a ConvNet is capturing similar\nfeatures across time. For example, an action with repetitive motion patterns such as \u201cHammering\u201d\nwould trigger similar features for the appearance and motion stream over time. For such cases\nthe use of temporal residual connections would make perfect sense. However, for cases where the\n\n4\n\n+3x3x1 x 5121x1x3 x 20481x1x3 x 512ReLUReLUReLU1x1x1 x 512+3x3x1 x 5121x1x1 x 2048ReLUReLUReLU+3x3x1 x 5121x1x3 x 20481x1x3 x 512ReLUReLU1x1x1 x 512+3x3x1 x 5121x1x1 x 2048ReLUReLUReLUMotion Stream+conv5_2conv5_31x1x1 x 512+3x3x1 x 5121x1x1 x 2048ReLUReLUReLUconv5_11x1x1 x 512+3x3x1 x 5121x1x1 x 2048ReLUReLUReLU1x1x1 x 20481x1x1 x 2048AppearanceStream\fFigure 3: The temporal receptive \ufb01eld of a single neuron at the \ufb01fth meta layer of our motion network\nstream is highlighted. \u03c4 indicates the temporal stride between inputs. The outputs of conv5_3 are\nmax-pooled in time and fed to the fully connected layer of our ST-ResNet*.\nappearance or the instantaneous motion pattern varies over time, a residual connection would be\nsuboptimal for discriminative learning, since the sum operation corresponds to a low-pass \ufb01ltering\nover time and would smooth out potentially important high-frequency temporal variation of the\nfeatures. Moreover, backpropagation is unable to compensate for that de\ufb01cit since at a sum layer all\ngradients are distributed equally from output to input connections.\nBased on the above observations, we developed a novel approach to temporal residual connections\nthat builds on the ConvNet design guidelines of chaining small [21] asymmetric [10, 23] \ufb01lters, noted\nin Sec. 1. We extend the ResNet architecture with temporal convolutions by transforming spatial\ndimensionality mapping \ufb01lters in the residual paths to temporal \ufb01lters. This allows the straightforward\nuse of standard two-stream ConvNets that have been pre-trained on large-scale datasets e.g. to leverage\nthe massive amounts of training data from the ImageNet challenge. We initialize the temporal weights\nas residual connections across time and let the network learn to best discriminate image dynamics\nvia backpropagation. We achieve this by replicating the learned spatial 1\u00d71 dimensionality mapping\nkernels in pretrained ResNets across time. Given the pretrained spatial weights, wl \u2208 R1\u00d71\u00d7C,\ntemporal \ufb01lters, \u02c6wl \u2208 R1\u00d71\u00d7T (cid:48)\u00d7C, are initialized according to\n\nwl(i, j, c)\n\n,\u2200t \u2208 [1, T (cid:48)],\n\nT (cid:48)\n\n\u02c6wl(i, j, t, c) =\n\n(5)\nand subsequently re\ufb01ned via backpropagation. In (5), division by T (cid:48) serves to average feature\nresponses across time. We transform \ufb01lters from both the motion and the appearance ResNets\naccordingly. Hence, the temporal \ufb01lters are able to learn the temporal evolution of the appearance\nand motion features and, moreover, by stacking such \ufb01lters as the depth of the network increases\ncomplex spatiotemporal relationships can be modelled.\n3.3 Proposed architecture\nOur overall architecture (used for each stream) is summarized in Table 1. The underlying network\nused is a 50 layer ResNet [8]. Each \ufb01ltering operation is followed by batch normalization [11] and\nhalfway recti\ufb01cation (ReLU). In the columns we show \u201cmetalayers\u201d which share the same output size.\nFrom left to right, top to bottom, the \ufb01rst row shows the convolutional and pooling building blocks,\nwith the \ufb01lter and pooling size shown as (W \u00d7 H \u00d7 T, C), denoting width, height, temporal extent\nand number of feature channels, resp. Brackets outline residual units equipped with skip connections.\nIn the last two rows we show the output size of these metalayers as well as the receptive \ufb01eld on\nwhich they operate. One observes that the temporal receptive \ufb01eld is modulated by the temporal stride\n\u03c4 between the input chunks. For example, if the stride is set to \u03c4 = 15 frames, a unit at conv5_3 sees\na window of 17 \u2217 15 = 255 frames on the input video; see. Fig. 3. The pool5 layer receives multiple\nspatiotemporal features, where the spatial 7 \u00d7 7 features are averaged as in [8] and the temporal\nfeatures are max-pooled within a window of 5, with each of these seeing a window of 705 frames at\nthe input. The pool5 output is classi\ufb01ed by a fully connected layer of size 1 \u00d7 1 \u00d7 1 \u00d7 2048; note\nthat this passes several temporally max-pooled chunks to the softmax log-loss layer afterwards. For\nvideos with less than 705 frames we reduce the stride between temporal inputs and for extremely\nshort videos we symmetrically pad the input over time.\n\nSub-batch normalization. Batch normalization [11] subtracts from all activations the batchwise\nmean and divides by their variance. These moments are estimated by averaging over spatial locations\nand multiple images in the batch. After batch normalization a learned, channel-speci\ufb01c af\ufb01ne\ntransformation (scaling and bias) is applied. The noisy bias/variance estimation replaces the need\n\n5\n\npoolfcTime tconv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+conv1conv2_x+conv_3x+conv_4x+conv_5x+\u03c4\fLayers\n\nconv1\n\npool1\n\nBlocks\n\n7\u00d7\n\n7\u00d7\n1,\n\n6\n\n4\n\n3\u00d7\n\n3\u00d7\n1\n\nstride\n\nm\n\nax\n\n2\n\n(cid:34) 1\u00d71\u00d71, 64\n(cid:34) 1\u00d71\u00d73, 64\n(cid:34) 1\u00d71\u00d71, 64\n\nconv2_x\n3\u00d73\u00d71, 64\n1\u00d71\u00d71, 256\nskip-stream\n3\u00d73\u00d71, 64\n1\u00d71\u00d73, 256\n3\u00d73\u00d71, 64\n1\u00d71\u00d71, 256\n\n(cid:34) 1\u00d71\u00d71, 128\n(cid:35)\n(cid:34) 1\u00d71\u00d73, 128\n(cid:35)\n(cid:35) (cid:34) 1\u00d71\u00d71, 128\n(cid:35)\n\nconv3_x\n3\u00d73\u00d71, 128\n1\u00d71\u00d71, 512\nskip-stream\n3\u00d73\u00d71, 128\n1\u00d71\u00d73, 512\n3\u00d73\u00d71, 128\n1\u00d71\u00d71, 512\n\n(cid:35)\n(cid:35)\n\n\u00d72\n\n(cid:34) 1\u00d71\u00d71, 256\n(cid:34) 1\u00d71\u00d73, 256\n(cid:34) 1\u00d71\u00d71, 256\n\nconv4_x\n3\u00d73\u00d71, 256\n1\u00d71\u00d71, 1024\nskip-stream\n3\u00d73\u00d71, 256\n1\u00d71\u00d73, 1024\n3\u00d73\u00d71, 256\n1\u00d71\u00d71, 1024\n\n(cid:35)\n\n(cid:35)\n(cid:35)\n\n\u00d74\n\n(cid:34) 1\u00d71\u00d71, 512\n(cid:34) 1\u00d71\u00d73, 512\n(cid:34) 1\u00d71\u00d71, 512\n\nconv5_x\n3\u00d73\u00d71, 512\n1\u00d71\u00d71, 2048\nskip-stream\n3\u00d73\u00d71, 512\n1\u00d71\u00d73, 2048\n3\u00d73\u00d71, 512\n1\u00d71\u00d71, 2048\n\n(cid:35)\n(cid:35)\n(cid:35)\n\npool5\n\n7\u00d7\n\n7\u00d7\n1\n\n1\u00d7\n\nav\ng\n\n1\u00d7\n5\n\nstride\n\nm\n\nax\n\n2\n\nOutput 112\u00d7112\u00d711 56\u00d756\u00d711\nsize\nRecept.\n11\u00d711\u00d71\nField\n\n7\u00d77\u00d71\n\n56\u00d756\u00d711\n35\u00d735\u00d75\u03c4\n\n28\u00d728\u00d711\n99\u00d799\u00d79\u03c4\n\n14\u00d714\u00d711\n291\u00d7291\u00d713\u03c4\n\n7\u00d77\u00d711\n\n1\u00d71\u00d7 4\n\n483\u00d7483\u00d717\u03c4\n\n675 \u00d7 675\u00d7 47\u03c4\nTable 1: Spatiotemporal ResNet architecture used in both ConvNet streams. The metalayers are shown\nin the columns with their building blocks showing the convolutional \ufb01lter dimensions (W \u00d7H\u00d7T, C)\nin brackets. Each building block shown in brackets also has a skip connection to the block below\nand skip-stream denotes a residual connection from the motion to the appearance stream, e.g., see\nFig. 2 for the conv5_2 building block. Stride 2 downsampling is performed by conv1, pool1, conv3_1,\nconv4_1 and conv5_1. The output and receptive \ufb01eld size of these layers is shown below. For both\nstreams, the pool5 layer is followed by a 1\u00d7 1\u00d7 1\u00d7 2048 fully connected layer, a softmax and a loss.\nfor dropout regularization [8, 24]. We found that lowering the number of samples used for batch\nnormalization can further improve the generalization performance of the model. For example, for the\nappearance stream we use a low batch size of 4 for moment estimation during training. This practice\nstrongly supports generalization of the model and nontrivially increases validation accuracy (\u22484% on\nUCF101). Interestingly, in comparison to this approach, using dropout after the classi\ufb01cation layer\n(e.g. as in [24]) decreased validation accuracy of the appearance stream. Note that only the batchsize\nfor normalizing the activations is reduced; the batch size in stochastic gradient descent is unchanged.\n3.4 Model training and evaluation\nOur method has been implemented in MatConvNet [28] and we share our code and models at\nhttps://github.com/feichtenhofer/st-resnet. We train our model in three optimization steps with the\nparameters listed in Table 2.\n\nTraining phase\n\nMotion stream\nAppearance stream\nST-ResNet\nST-ResNet*\n\nSGD\n\nbatch size\n\nBnorm\nbatch size\n\n256\n256\n128\n128\n\n86\n8\n4\n4\n\nLearning Rate (#Iterations)\n10\u22122(30K), 10\u22123(10K)\n10\u22122(10K), 10\u22123(10K)\n\n10\u22123(30K), 10\u22124(30K), 10\u22125(20K)\n\n10\u22124(2K), 10\u22125(2K)\n\nTemporal\n\nchunks / stride \u03c4\n\n1 / \u03c4 = 1\n1 / \u03c4 = 1\n\n5 / \u03c4 \u2208 [5, 15]\n11 / \u03c4 \u2208 [1, 15]\n\nTable 2: Parameters for the three training phases of our model\n\nMotion and appearance streams. First, each stream is trained similar to [20] using Stochastic\nGradient Descent (SGD) with momentum of 0.9. We rescale all videos by keeping the aspect ratio\nand resizing the smallest side of a frame to 256. The motion network uses optical \ufb02ow stacking\nwith L = 10 frames and is trained for 30K iterations with a learning rate of 10\u22122 followed by 10K\niterations at a learning rate of 10\u22123. At each iteration, a batch of 256 samples is constructed by\nrandomly sampling a single optical \ufb02ow stack from a video; however, for batch normalization [11],\nwe only use 86 samples to facilitate generalization. We precompute optical \ufb02ow [32] before training\nand store the \ufb02ow \ufb01elds as JPEGs (with displacement vectors > 20 pixels clipped). During training,\nwe use the same augmentations as in [1, 31]; i.e. randomly cropping from the borders and centre of\nthe \ufb02ow stack and sampling the width and height of each crop randomly within 256, 224, 192, 168,\nfollowing by resizing to 224 \u00d7 224. The appearance stream is trained identically with a batch of\n256 RGB frames and learning rate of 10\u22122 for 10K iterations, followed by 10\u22123 for another 10K\niterations. Notably here we choose a very small batch size of 8 for normalization. We also apply\nrandom cropping and scale augmentations: We randomly jitter the width and height of the 224 \u00d7 224\ninput frame by \u00b125% and also randomly crop it from a maximum of 25% distance from the image\nborders. The cropped patch is rescaled to 224 \u00d7 224 and passed as input to the network. The same\nrescaling and cropping technique is chosen to train the next two steps described below. In all our\ntraining steps we use random horizontal \ufb02ipping and do not apply RGB colour jittering [14].\nST-ResNet. Second, to train our spatiotemporal ResNet we sample 5 inputs from a video with\nrandom temporal stride between 5 and 15 frames. This technique can be thought of as frame-rate\njittering for the temporal convolutional layers and is important to reduce over\ufb01tting of the \ufb01nal model.\n\n6\n\n\fSGD is used with a batch size of 128 videos where 5 temporal chunks are extracted from each.\nBatch-normalization uses a smaller batch size of 128/32 = 4. The learning rate is set to 10\u22123 and is\nreduced by a factor of 10 after 30K iterations. Notably, there is no pooling over time, which leads to\ntemporal fully convolutional training with a single loss for each of the 5 inputs and both streams. We\nfound that this strategy signi\ufb01cantly reduces the training duration with the drawback that each loss\ndoes not capture all available information. We overcome this by the next training step.\nST-ResNet*. For our \ufb01nal model, we equip the spatiotemporal ResNet with a temporal max-pooling\nlayer after pool5 (see Table 1, temporal average pooling led to inferior results) and continue training\nas above with the learning rate starting from 10\u22124 for 2K iterations followed by 10\u22125. As indicated\nin Table 2, we now use 11 temporal chunks as input with the stride \u03c4 between these being randomly\nchosen from [1, 15].\nFully convolutional inference. For fair comparison, we follow the evaluation procedure of the\noriginal two-stream work [20] by sampling 25 frames (and their horizontal \ufb02ips). However, rather\nthan using 10 spatial 224 \u00d7 224 crops from each of the frames, we apply fully convolutional testing\nboth spatially (smallest side rescaled to 256) and temporally (the 25 frame-chunks) by classifying the\nvideo in a single forward pass, which takes \u2248250ms on a Titan X GPU. For inference, we average\nthe predictions of the fully connected layers (without softmax) over all spatiotemporal locations.\n4 Evaluation\nWe evaluate our approach on two challenging action recognition datasets. First, we consider UCF101\n[22], which consists of 13320 videos showing 101 action classes. It provides large diversity in terms\nof actions, variations in background, illumination, camera motion and viewpoint, as well as object\nappearance, scale and pose. Second, we consider HMDB51 [15], which has 6766 videos that show 51\ndifferent actions and generally is considered more challenging than UCF0101 due to the even wider\nvariations in which actions occur. For both datasets, we use the provided evaluation protocol and\nreport mean average accuracy over three splits into training and test sets.\n4.1 Two-Stream ResNet with additive interactions\nTable 3 shows the results of our two-stream architecture across the three training stages outlined\nin Sec. 3.4. For stream fusion, we always average the (non-softmaxed) prediction scores of the\nclassi\ufb01cation layer as this approach produces better results than averaging the softmax scores. Initially,\nlet us consider the performance of the two streams, both initialized with ResNet50 models trained on\nImageNet [8], but without cross-stream residual connections (2) and temporal convolutional layers\n(5). The accuracies for UCF101 and HMDB51 are 89.47% and 60.59%, (our HMDB51 motion stream\nis initialized from the UCF101 model). Comparatively, a VGG16 two-stream architecture produces\n91.4% and 58.5% [1, 31]. In comparing these results it is notable that the VGG16 architecture is\nmore computationally demanding (19.6 vs. 3.8 billion multiply-add FLOPs ) and also holds more\nmodel parameters (135M vs. 34M) than a ResNet50 model.\n\nDataset\nUCF101\nHMDB51\n\nAppearance stream Motion stream Two-Streams ST-ResNet ST-ResNet*\n\n82.29%\n43.42%\n\n79.05%\n55.47%\n\n89.47%\n60.59%\n\n92.76%\n65.57%\n\n93.46%\n66.41%\n\nTable 3: Classi\ufb01cation accuracy on UCF101 and HMDB51 in the three training stages of our model.\n\nWe now consider our proposed spatiotemporal ResNet (ST-ResNet), which is initialized by our two-\nstream ResNet50 model of above and subsequently equipped with 4 residual connections between the\nstreams and 16 transformed temporal convolution layers (initialized as averaging \ufb01lters). The model\nis trained end-to-end with the loss layers unchanged (we found that using a single, joint softmax\nclassi\ufb01er over\ufb01ts severely to appearance information) and learning parameters chosen as in Table 2.\nThe results are shown in the penultimate column of Table 3. Our architecture signi\ufb01cantly improves\nover the two-stream baseline indicating the importance of residual connections between the streams\nas well as temporal convolutional connections over time. Interestingly, research in neuroscience\nalso suggests that the human visual cortex is equipped with connections between the dorsal and the\nventral stream to distribute motion information to separate visual areas [3, 27]. Finally, in the last\ncolumn of Table 3 we show results for our ST-ResNet* architecture that is further equipped with a\ntemporal max-pooling layer to consider larger temporal windows in training and testing. For training\nST-ResNet* we use 11 temporal chunks at the input and the max-pooling layer pools over 5 chunks\nto expand the temporal receptive \ufb01eld at the loss layer to a maximum of 705 frames at the input. For\n\n7\n\n\ftesting, where the network sees 25 temporal chunks, we observe that this long-term pooling further\nimproves accuracy over our ST-ResNet by around 1% on both datasets.\n4.2 Comparison with the state-of-the-art\nWe compare to the state-of-the-art in action recognition over all three splits of UCF101 and HMDB51\nin Table 4 (left). We use ST-ResNet*, as above, and predict the videos in a single forward pass using\nfully convolutional testing. When comparing to the original two-stream method [20], we improve by\n5.4% on UCF101 and by 7% on HMDB51. Apparently, even though the original two-stream approach\nhas the advantage of multitask learning (HMDB51) and SVM fusion, the bene\ufb01ts of our deeper\narchitecture with its cross-stream residual connections are greater. Another interesting comparison\nis against the two-stream network in [18], which attaches an LSTM to a two-stream Inception [23]\narchitecture. Their accuracy of 88.6% is to date the best performing approach using LSTMs for action\nrecognition. Here, our gain of 4.8% further underlines the importance of our architectural choices.\n\nMethod\nTwo-Stream ConvNet [20]\nTwo-Stream+LSTM[18]\nTwo-Stream (VGG16) [1, 31]\nTransformations[31]\nTwo-Stream Fusion[5]\nST-ResNet*\n\n-\n\n59.4%\n\nUCF101 HMDB51\n88.0%\n88.6%\n91.4%\n92.4%\n92.5%\n93.4%\n\n58.5%\n62.0%\n65.4%\n66.4%\n\nMethod\nIDT [29]\nC3D + IDT [26]\nTDD + IDT [30]\nDynamic Image Networks + IDT [2]\nTwo-Stream Fusion[5]\nST-ResNet* + IDT\n\n-\n\n61.7%\n\nUCF101 HMDB51\n86.4%\n90.4%\n91.5%\n89.1%\n93.5%\n94.6%\n\n65.9%\n65.2%\n69.2%\n70.3%\n\nTable 4: Mean classi\ufb01cation accuracy of the state-of-the-art on HMDB51 and UCF101 for the best\nConvNet approaches (left) and methods that additionally use IDT features (right). Our ST-ResNet\nobtains best performance on both datasets.\nThe Transformations [31] method captures the transformation from start to \ufb01nish of a video by\nusing two VGG16 Siamese streams (that do not share model parameters, i.e. 4 VGG16 models) to\ndiscriminatively learn a transformation matrix. This method uses considerably more parameters\nthan our approach, yet is readily outperformed by ours. When comparing with the previously best\nperforming approach [5], we observe that our method provides a consistent performance gain of\naround 1% on both datasets.\nThe combination of ConvNet methods with trajectory-based hand-crafted IDT features [29] typically\nboosts performance nontrivially [2, 26]. Therefore, we further explore the bene\ufb01ts of adding trajectory\nfeatures to our approach. We achieve this by simply averaging the L2-normalized SVM scores of the\nFV-encoded IDT descriptors (i.e. HOG, HOF, MBH) [29] with the L2-normalized video predictions\nof our ST-ResNet*, again without softmax normalization. The results are shown in Table 4 (right)\nwhere we observe a notable boost in accuracy of our approach on HMDB51, albeit less on UCF101.\nNote that unlike our approach, the other approaches in Table 4 (right) suffer considerably larger\nperformance drops when used without IDT, e.g. C3D [26] reduces to 85.2% on UCF101, while\nDynamic Image Networks [2] reduces to 76.9% on UCF101 and 42.8% on HMDB51. These\nrelatively larger performance decrements again underline that our approach is better able to capture\nthe available dynamic information, as there is less to be gained by augmenting it with IDT. Still, there\nis a bene\ufb01t from the hand-crafted IDT features even with our approach, which could be attributed to\nits explicit compensation of camera motion. Overall, our 94.6% on UCF101 and 70.3% HMDB51\nclearly sets a new state-of-the-art on these widely used action recognition datasets.\n5 Conclusion\nWe have presented a novel spatiotemporal ResNet architecture for video-based action recognition. In\nparticular, our approach is the \ufb01rst to combine two-stream with residual networks and to show the\ngreat advantage that results. Our ST-ResNet allows the hierarchical learning of spacetime features\nby connecting the appearance and motion channels of a two-stream architecture. Furthermore, we\ntransfer both streams from the spatial to the spatiotemporal domain by transforming the dimensionality\nmapping \ufb01lters of a pre-trained model into temporal convolutions, initialized as residual \ufb01lters over\ntime. The whole system is trained end-to-end and achieves state-of-the-art performance on two\npopular action recognition datasets.\n\nAcknowledgments. This work was supported by the Austrian Science Fund (FWF) under project\nP27076 and NSERC. The GPUs used for this research were donated by NVIDIA. Christoph\nFeichtenhofer is a recipient of a DOC Fellowship of the Austrian Academy of Sciences at the Institute\nof Electrical Measurement and Measurement Signal Processing, Graz University of Technology.\n\n8\n\n\fReferences\n[1] Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. Delving deeper into convolutional networks for\n\nlearning video representations. In Proc. ICLR, 2016.\n\n[2] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action\n\nrecognition. In Proc. CVPR, 2016.\n\n[3] Richard T Born and Roger BH Tootell. Segregation of global and local motion processing in primate\n\nmiddle temporal visual area. Nature, 357(6378):497\u2013499, 1992.\n\n[4] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan,\nKate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and\ndescription. In Proc. CVPR, 2015.\n\n[5] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream network fusion for\n\nvideo action recognition. In Proc. CVPR, 20116.\n\n[6] M. A. Goodale and A. D. Milner. Separate visual pathways for perception and action. Trends in\n\nNeurosciences, 15(1):20\u201325, 1992.\n\n[7] Ross Goroshin, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. Unsupervised feature\n\nlearning from temporal data. In Proc. ICCV, 2015.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.\n\narXiv preprint arXiv:1512.03385, 2015.\n\narXiv preprint arXiv:1603.05027, 2016.\n\n[10] Yani Ioannou, Duncan Robertson, Jamie Shotton, Roberto Cipolla, and Antonio Criminisi. Training cnns\n\nwith low-rank \ufb01lters for ef\ufb01cient image classi\ufb01cation. In Proc. ICLR, 2016.\n\n[11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\n[12] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. IEEE\n\ninternal covariate shift. In Proc. ICML, 2015.\n\nPAMI, 35(1):221\u2013231, 2013.\n\n[13] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classi\ufb01cation\n\nwith convolutional neural networks. In Proc. CVPR, 2014.\n\n[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[15] Hildegard Kuehne, Hueihan Jhuang, Est\u00edbaliz Garrote, Tomaso Poggio, and Thomas Serre. HMDB: a large\n\nvideo database for human motion recognition. In Proc. ICCV, 2011.\n\n[16] Quoc V Le, Will Y Zou, Serena Y Yeung, and Andrew Y Ng. Learning hierarchical invariant spatio-\n\ntemporal features for action recognition with independent subspace analysis. In Proc. CVPR, 2011.\n\n[17] Behrooz Mahasseni and Sinisa Todorovic. Regularizing long short term memory with 3D human-skeleton\n\nsequences for action recognition.\n\n[18] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and\n\nGeorge Toderici. Beyond short snippets: Deep networks for video classi\ufb01cation. In Proc. CVPR, 2015.\n\n[19] Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. Action recognition using visual attention. In\n\nNIPS workshop on Time Series. 2015.\n\n[20] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In\n\n[21] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\nNIPS, 2014.\n\ntion. In Proc. ICLR, 2014.\n\n[22] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions\n\ncalsses from videos in the wild. Technical Report CRCV-TR-12-01, 2012.\n\n[23] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking\n\nthe inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.\n\n[24] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception-resnet and the impact of\n\nresidual connections on learning. arXiv preprint arXiv:1602.07261, 2016.\n\n[25] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In\n\n[26] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3D\n\n[27] David C Van Essen and Jack L Gallant. Neural mechanisms of form and motion processing in the primate\n\n[28] A. Vedaldi and K. Lenc. MatConvNet \u2013 convolutional neural networks for MATLAB. In Proceeding of the\n\nProc. ECCV, 2010.\n\nconvolutional networks. In Proc. ICCV, 2015.\n\nvisual system. Neuron, 13(1):1\u201310, 1994.\n\nACM Int. Conf. on Multimedia, 2015.\n\n[29] Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In Proc. ICCV, 2013.\n[30] Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deep-convolutional\n\n[31] Xiaolong Wang, Ali Farhadi, and Abhinav Gupta. Actions ~ transformations. In Proc. CVPR, 2016.\n[32] C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime TV-L1 optical \ufb02ow. In Proc.\n\ndescriptors. In Proc. CVPR, 2015.\n\nDAGM, pages 214\u2013223, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1725, "authors": [{"given_name": "Christoph", "family_name": "Feichtenhofer", "institution": "Graz University of Technology"}, {"given_name": "Axel", "family_name": "Pinz", "institution": "Graz University of Technology"}, {"given_name": "Richard", "family_name": "Wildes", "institution": "York University Toronto"}]}