{"title": "Two-Stream Convolutional Networks for Action Recognition in Videos", "book": "Advances in Neural Information Processing Systems", "page_first": 568, "page_last": 576, "abstract": "We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.", "full_text": "Two-Stream Convolutional Networks\n\nfor Action Recognition in Videos\n\nKaren Simonyan\n\nAndrew Zisserman\n\nVisual Geometry Group, University of Oxford\n\n{karen,az}@robots.ox.ac.uk\n\nAbstract\n\nWe investigate architectures of discriminatively trained deep Convolutional Net-\nworks (ConvNets) for action recognition in video. The challenge is to capture\nthe complementary information on appearance from still frames and motion be-\ntween frames. We also aim to generalise the best performing hand-crafted features\nwithin a data-driven learning framework.\nOur contribution is three-fold. First, we propose a two-stream ConvNet architec-\nture which incorporates spatial and temporal networks. Second, we demonstrate\nthat a ConvNet trained on multi-frame dense optical \ufb02ow is able to achieve very\ngood performance in spite of limited training data. Finally, we show that multi-\ntask learning, applied to two different action classi\ufb01cation datasets, can be used to\nincrease the amount of training data and improve the performance on both. Our\narchitecture is trained and evaluated on the standard video actions benchmarks of\nUCF-101 and HMDB-51, where it is competitive with the state of the art. It also\nexceeds by a large margin previous attempts to use deep nets for video classi\ufb01ca-\ntion.\n\nIntroduction\n\n1\nRecognition of human actions in videos is a challenging task which has received a signi\ufb01cant amount\nof attention in the research community [11, 14, 17, 26]. Compared to still image classi\ufb01cation, the\ntemporal component of videos provides an additional (and important) clue for recognition, as a\nnumber of actions can be reliably recognised based on the motion information. Additionally, video\nprovides natural data augmentation (jittering) for single image (video frame) classi\ufb01cation.\nIn this work, we aim at extending deep Convolutional Networks (ConvNets) [19], a state-of-the-\nart still image representation [15], to action recognition in video data. This task has recently been\naddressed in [14] by using stacked video frames as input to the network, but the results were signif-\nicantly worse than those of the best hand-crafted shallow representations [20, 26]. We investigate\na different architecture based on two separate recognition streams (spatial and temporal), which\nare then combined by late fusion. The spatial stream performs action recognition from still video\nframes, whilst the temporal stream is trained to recognise action from motion in the form of dense\noptical \ufb02ow. Both streams are implemented as ConvNets. Decoupling the spatial and temporal nets\nalso allows us to exploit the availability of large amounts of annotated image data by pre-training\nthe spatial net on the ImageNet challenge dataset [1]. Our proposed architecture is related to the\ntwo-streams hypothesis [9], according to which the human visual cortex contains two pathways: the\nventral stream (which performs object recognition) and the dorsal stream (which recognises motion);\nthough we do not investigate this connection any further here.\nThe rest of the paper is organised as follows. In Sect. 1.1 we review the related work on action\nrecognition using both shallow and deep architectures.\nIn Sect. 2 we introduce the two-stream\narchitecture and specify the Spatial ConvNet. Sect. 3 introduces the Temporal ConvNet and in\nparticular how it generalizes the previous architectures reviewed in Sect. 1.1. A mult-task learning\nframework is developed in Sect. 4 in order to allow effortless combination of training data over\nmultiple datasets. Implementation details are given in Sect. 5, and the performance is evaluated\nin Sect. 6 and compared to the state of the art. Our experiments on two challenging datasets (UCF-\n101 [24] and HMDB-51 [16]) show that the two recognition streams are complementary, and our\n\n1\n\n\fdeep architecture signi\ufb01cantly outperforms that of [14] and is competitive with the state of the art\nshallow representations [20, 21, 26] in spite of being trained on relatively small datasets.\n1.1 Related work\nVideo recognition research has been largely driven by the advances in image recognition methods,\nwhich were often adapted and extended to deal with video data. A large family of video action\nrecognition methods is based on shallow high-dimensional encodings of local spatio-temporal fea-\ntures. For instance, the algorithm of [17] consists in detecting sparse spatio-temporal interest points,\nwhich are then described using local spatio-temporal features: Histogram of Oriented Gradients\n(HOG) [7] and Histogram of Optical Flow (HOF). The features are then encoded into the Bag Of\nFeatures (BoF) representation, which is pooled over several spatio-temporal grids (similarly to spa-\ntial pyramid pooling) and combined with an SVM classi\ufb01er. In a later work [28], it was shown that\ndense sampling of local features outperforms sparse interest points.\nInstead of computing local video features over spatio-temporal cuboids, state-of-the-art shallow\nvideo representations [20, 21, 26] make use of dense point trajectories. The approach, \ufb01rst in-\ntroduced in [29], consists in adjusting local descriptor support regions, so that they follow dense\ntrajectories, computed using optical \ufb02ow. The best performance in the trajectory-based pipeline\nwas achieved by the Motion Boundary Histogram (MBH) [8], which is a gradient-based feature,\nseparately computed on the horizontal and vertical components of optical \ufb02ow. A combination of\nseveral features was shown to further boost the accuracy. Recent improvements of trajectory-based\nhand-crafted representations include compensation of global (camera) motion [10, 16, 26], and the\nuse of the Fisher vector encoding [22] (in [26]) or its deeper variant [23] (in [21]).\nThere has also been a number of attempts to develop a deep architecture for video recognition. In\nthe majority of these works, the input to the network is a stack of consecutive video frames, so the\nmodel is expected to implicitly learn spatio-temporal motion-dependent features in the \ufb01rst layers,\nwhich can be a dif\ufb01cult task. In [11], an HMAX architecture for video recognition was proposed\nwith pre-de\ufb01ned spatio-temporal \ufb01lters in the \ufb01rst layer. Later, it was combined [16] with a spatial\nHMAX model, thus forming spatial (ventral-like) and temporal (dorsal-like) recognition streams.\nUnlike our work, however, the streams were implemented as hand-crafted and rather shallow (3-\nlayer) HMAX models. In [4, 18, 25], a convolutional RBM and ISA were used for unsupervised\nlearning of spatio-temporal features, which were then plugged into a discriminative model for action\nclassi\ufb01cation. Discriminative end-to-end learning of video ConvNets has been addressed in [12]\nand, more recently, in [14], who compared several ConvNet architectures for action recognition.\nTraining was carried out on a very large Sports-1M dataset, comprising 1.1M YouTube videos of\nsports activities.\nInterestingly, [14] found that a network, operating on individual video frames,\nperforms similarly to the networks, whose input is a stack of frames. This might indicate that\nthe learnt spatio-temporal features do not capture the motion well. The learnt representation, \ufb01ne-\ntuned on the UCF-101 dataset, turned out to be 20% less accurate than hand-crafted state-of-the-art\ntrajectory-based representation [20, 27].\nOur temporal stream ConvNet operates on multiple-frame dense optical \ufb02ow, which is typically\ncomputed in an energy minimisation framework by solving for a displacement \ufb01eld (typically at\nmultiple image scales). We used a popular method of [2], which formulates the energy based on\nconstancy assumptions for intensity and its gradient, as well as smoothness of the displacement \ufb01eld.\nRecently, [30] proposed an image patch matching scheme, which is reminiscent of deep ConvNets,\nbut does not incorporate learning.\n2 Two-stream architecture for video recognition\nVideo can naturally be decomposed into spatial and temporal components. The spatial part, in the\nform of individual frame appearance, carries information about scenes and objects depicted in the\nvideo. The temporal part, in the form of motion across the frames, conveys the movement of the\nobserver (the camera) and the objects. We devise our video recognition architecture accordingly,\ndividing it into two streams, as shown in Fig. 1. Each stream is implemented using a deep ConvNet,\nsoftmax scores of which are combined by late fusion. We consider two fusion methods: averaging\nand training a multi-class linear SVM [6] on stacked L2-normalised softmax scores as features.\nSpatial stream ConvNet operates on individual video frames, effectively performing action recog-\nnition from still images. The static appearance by itself is a useful clue, since some actions are\n\n2\n\n\fFigure 1: Two-stream architecture for video classi\ufb01cation.\n\nstrongly associated with particular objects. In fact, as will be shown in Sect. 6, action classi\ufb01cation\nfrom still frames (the spatial recognition stream) is fairly competitive on its own. Since a spatial\nConvNet is essentially an image classi\ufb01cation architecture, we can build upon the recent advances\nin large-scale image recognition methods [15], and pre-train the network on a large image classi\ufb01ca-\ntion dataset, such as the ImageNet challenge dataset. The details are presented in Sect. 5. Next, we\ndescribe the temporal stream ConvNet, which exploits motion and signi\ufb01cantly improves accuracy.\n3 Optical \ufb02ow ConvNets\nIn this section, we describe a ConvNet model, which forms the temporal recognition stream of our\narchitecture (Sect. 2). Unlike the ConvNet models, reviewed in Sect. 1.1, the input to our model is\nformed by stacking optical \ufb02ow displacement \ufb01elds between several consecutive frames. Such input\nexplicitly describes the motion between video frames, which makes the recognition easier, as the\nnetwork does not need to estimate motion implicitly. We consider several variations of the optical\n\ufb02ow-based input, which we describe below.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 2: Optical \ufb02ow. (a),(b): a pair of consecutive video frames with the area around a mov-\ning hand outlined with a cyan rectangle. (c): a close-up of dense optical \ufb02ow in the outlined area;\n(d): horizontal component dx of the displacement vector \ufb01eld (higher intensity corresponds to pos-\nitive values, lower intensity to negative values). (e): vertical component dy. Note how (d) and (e)\nhighlight the moving hand and bow. The input to a ConvNet contains multiple \ufb02ows (Sect. 3.1).\n3.1 ConvNet input con\ufb01gurations\nOptical \ufb02ow stacking. A dense optical \ufb02ow can be seen as a set of displacement vector \ufb01elds dt\nbetween the pairs of consecutive frames t and t + 1. By dt(u, v) we denote the displacement vector\nat the point (u, v) in frame t, which moves the point to the corresponding point in the following\nframe t + 1. The horizontal and vertical components of the vector \ufb01eld, dx\nt , can be seen\nas image channels (shown in Fig. 2), well suited to recognition using a convolutional network. To\nrepresent the motion across a sequence of frames, we stack the \ufb02ow channels dx,y\nof L consecutive\nframes to form a total of 2L input channels. More formally, let w and h be the width and height\nof a video; a ConvNet input volume I\u03c4 \u2208 Rw\u00d7h\u00d72L for an arbitrary frame \u03c4 is then constructed as\nfollows:\n\nt and dy\n\nt\n\n(1)\n\nI\u03c4 (u, v, 2k \u2212 1) = dx\nI\u03c4 (u, v, 2k) = dy\n\n\u03c4 +k\u22121(u, v),\n\n\u03c4 +k\u22121(u, v),\n\nu = [1; w], v = [1; h], k = [1; L].\n\nFor an arbitrary point (u, v), the channels I\u03c4 (u, v, c), c = [1; 2L] encode the motion at that point\nover a sequence of L frames (as illustrated in Fig. 3-left).\nTrajectory stacking. An alternative motion representation, inspired by the trajectory-based de-\nscriptors [29], replaces the optical \ufb02ow, sampled at the same locations across several frames, with\n\n3\n\nconv1 7x7x96 stride 2 norm. pool 2x2 conv2 5x5x256 stride 2 norm. pool 2x2 conv3 3x3x512 stride 1 conv4 3x3x512 stride 1 conv5 3x3x512 stride 1 pool 2x2 full6 4096 dropout full7 2048 dropout softmax conv1 7x7x96 stride 2 norm. pool 2x2 conv2 5x5x256 stride 2 pool 2x2 conv3 3x3x512 stride 1 conv4 3x3x512 stride 1 conv5 3x3x512 stride 1 pool 2x2 full6 4096 dropout full7 2048 dropout softmax Spatial stream ConvNet Temporal stream ConvNet single frame input video multi-frame optical flow class score fusion \fthe \ufb02ow, sampled along the motion trajectories. In this case, the input volume I\u03c4 , corresponding to\na frame \u03c4, takes the following form:\n\nI\u03c4 (u, v, 2k \u2212 1) = dx\nI\u03c4 (u, v, 2k) = dy\n\n\u03c4 +k\u22121(pk),\n\nu = [1; w], v = [1; h], k = [1; L].\n\n\u03c4 +k\u22121(pk),\n\n(2)\n\nwhere pk is the k-th point along the trajectory, which starts at the location (u, v) in the frame \u03c4 and\nis de\ufb01ned by the following recurrence relation:\n\np1 = (u, v);\n\npk = pk\u22121 + d\u03c4 +k\u22122(pk\u22121), k > 1.\n\nCompared to the input volume representation (1), where the channels I\u03c4 (u, v, c) store the displace-\nment vectors at the locations (u, v), the input volume (2) stores the vectors sampled at the locations\npk along the trajectory (as illustrated in Fig. 3-right).\n\nFigure 3: ConvNet input derivation from the multi-frame optical \ufb02ow. Left: optical \ufb02ow stack-\ning (1) samples the displacement vectors d at the same location in multiple frames. Right: trajectory\nstacking (2) samples the vectors along the trajectory. The frames and the corresponding displace-\nment vectors are shown with the same colour.\n\nBi-directional optical \ufb02ow. Optical \ufb02ow representations (1) and (2) deal with the forward optical\n\ufb02ow, i.e. the displacement \ufb01eld dt of the frame t speci\ufb01es the location of its pixels in the following\nframe t + 1. It is natural to consider an extension to a bi-directional optical \ufb02ow, which can be\nobtained by computing an additional set of displacement \ufb01elds in the opposite direction. We then\nconstruct an input volume I\u03c4 by stacking L/2 forward \ufb02ows between frames \u03c4 and \u03c4 +L/2 and L/2\nbackward \ufb02ows between frames \u03c4 \u2212 L/2 and \u03c4. The input I\u03c4 thus has the same number of channels\n(2L) as before. The \ufb02ows can be represented using either of the two methods (1) and (2).\nMean \ufb02ow subtraction. It is generally bene\ufb01cial to perform zero-centering of the network input,\nas it allows the model to better exploit the recti\ufb01cation non-linearities. In our case, the displacement\nvector \ufb01eld components can take on both positive and negative values, and are naturally centered in\nthe sense that across a large variety of motions, the movement in one direction is as probable as the\nmovement in the opposite one. However, given a pair of frames, the optical \ufb02ow between them can\nbe dominated by a particular displacement, e.g. caused by the camera movement. The importance\nof camera motion compensation has been previously highlighted in [10, 26], where a global motion\ncomponent was estimated and subtracted from the dense \ufb02ow. In our case, we consider a simpler\napproach: from each displacement \ufb01eld d we subtract its mean vector.\nArchitecture. Above we have described different ways of combining multiple optical \ufb02ow displace-\nment \ufb01elds into a single volume I\u03c4 \u2208 Rw\u00d7h\u00d72L. Considering that a ConvNet requires a \ufb01xed-size\ninput, we sample a 224 \u00d7 224 \u00d7 2L sub-volume from I\u03c4 and pass it to the net as input. The hid-\nden layers con\ufb01guration remains largely the same as that used in the spatial net, and is illustrated\nin Fig. 1. Testing is similar to the spatial ConvNet, and is described in detail in Sect. 5.\n3.2 Relation of the temporal ConvNet architecture to previous representations\nIn this section, we put our temporal ConvNet architecture in the context of prior art, drawing con-\nnections to the video representations, reviewed in Sect. 1.1. Methods based on feature encod-\nings [17, 29] typically combine several spatio-temporal local features. Such features are computed\nfrom the optical \ufb02ow and are thus generalised by our temporal ConvNet. Indeed, the HOF and MBH\nlocal descriptors are based on the histograms of orientations of optical \ufb02ow or its gradient, which\ncan be obtained from the displacement \ufb01eld input (1) using a single convolutional layer (containing\n\n4\n\ninput volume channels at point input volume channels at point \forientation-sensitive \ufb01lters), followed by the recti\ufb01cation and pooling layers. The kinematic features\nof [10] (divergence, curl and shear) are also computed from the optical \ufb02ow gradient, and, again, can\nbe captured by our convolutional model. Finally, the trajectory feature [29] is computed by stacking\nthe displacement vectors along the trajectory, which corresponds to the trajectory stacking (2). In the\nsupplementary material we visualise the convolutional \ufb01lters, learnt in the \ufb01rst layer of the temporal\nnetwork. This provides further evidence that our representation generalises hand-crafted features.\nAs far as the deep networks are concerned, a two-stream video classi\ufb01cation architecture of [16]\ncontains two HMAX models which are hand-crafted and less deep than our discriminatively trained\nConvNets, which can be seen as a learnable generalisation of HMAX. The convolutional models\nof [12, 14] do not decouple spatial and temporal recognition streams, and rely on the motion-\nsensitive convolutional \ufb01lters, learnt from the data. In our case, motion is explicitly represented\nusing the optical \ufb02ow displacement \ufb01eld, computed based on the assumptions of constancy of the\nintensity and smoothness of the \ufb02ow. Incorporating such assumptions into a ConvNet framework\nmight be able to boost the performance of end-to-end ConvNet-based methods, and is an interesting\ndirection for future research.\n4 Multi-task learning\nUnlike the spatial stream ConvNet, which can be pre-trained on a large still image classi\ufb01cation\ndataset (such as ImageNet), the temporal ConvNet needs to be trained on video data \u2013 and the\navailable datasets for video action classi\ufb01cation are still rather small. In our experiments (Sect. 6),\ntraining is performed on the UCF-101 and HMDB-51 datasets, which have only: 9.5K and 3.7K\nvideos respectively. To decrease over-\ufb01tting, one could consider combining the two datasets into\none; this, however, is not straightforward due to the intersection between the sets of classes. One\noption (which we evaluate later) is to only add the images from the classes, which do not appear in\nthe original dataset. This, however, requires manual search for such classes and limits the amount\nof additional training data.\nA more principled way of combining several datasets is based on multi-task learning [5]. Its aim\nis to learn a (video) representation, which is applicable not only to the task in question (such as\nHMDB-51 classi\ufb01cation), but also to other tasks (e.g. UCF-101 classi\ufb01cation). Additional tasks act\nas a regulariser, and allow for the exploitation of additional training data. In our case, a ConvNet\narchitecture is modi\ufb01ed so that it has two softmax classi\ufb01cation layers on top of the last fully-\nconnected layer: one softmax layer computes HMDB-51 classi\ufb01cation scores, the other one \u2013 the\nUCF-101 scores. Each of the layers is equipped with its own loss function, which operates only on\nthe videos, coming from the respective dataset. The overall training loss is computed as the sum of\nthe individual tasks\u2019 losses, and the network weight derivatives can be found by back-propagation.\n5\nConvNets con\ufb01guration. The layer con\ufb01guration of our spatial and temporal ConvNets is schemat-\nically shown in Fig. 1.\nIt corresponds to CNN-M-2048 architecture of [3] and is similar to the\nnetwork of [31]. All hidden weight layers use the recti\ufb01cation (ReLU) activation function; max-\npooling is performed over 3\u00d7 3 spatial windows with stride 2; local response normalisation uses the\nsame settings as [15]. The only difference between spatial and temporal ConvNet con\ufb01gurations is\nthat we removed the second normalisation layer from the latter to reduce memory consumption.\nTraining. The training procedure can be seen as an adaptation of that of [15] to video frames, and\nis generally the same for both spatial and temporal nets. The network weights are learnt using the\nmini-batch stochastic gradient descent with momentum (set to 0.9). At each iteration, a mini-batch\nof 256 samples is constructed by sampling 256 training videos (uniformly across the classes), from\neach of which a single frame is randomly selected. In spatial net training, a 224 \u00d7 224 sub-image is\nrandomly cropped from the selected frame; it then undergoes random horizontal \ufb02ipping and RGB\njittering. The videos are rescaled beforehand, so that the smallest side of the frame equals 256. We\nnote that unlike [15], the sub-image is sampled from the whole frame, not just its 256 \u00d7 256 center.\nIn the temporal net training, we compute an optical \ufb02ow volume I for the selected training frame as\ndescribed in Sect. 3. From that volume, a \ufb01xed-size 224 \u00d7 224 \u00d7 2L input is randomly cropped and\n\ufb02ipped. The learning rate is initially set to 10\u22122, and then decreased according to a \ufb01xed schedule,\nwhich is kept the same for all training sets. Namely, when training a ConvNet from scratch, the rate\nis changed to 10\u22123 after 50K iterations, then to 10\u22124 after 70K iterations, and training is stopped\n\nImplementation details\n\n5\n\n\fafter 80K iterations. In the \ufb01ne-tuning scenario, the rate is changed to 10\u22123 after 14K iterations, and\ntraining stopped after 20K iterations.\nTesting. At test time, given a video, we sample a \ufb01xed number of frames (25 in our experiments)\nwith equal temporal spacing between them. From each of the frames we then obtain 10 ConvNet\ninputs [15] by cropping and \ufb02ipping four corners and the center of the frame. The class scores for the\nwhole video are then obtained by averaging the scores across the sampled frames and crops therein.\nPre-training on ImageNet ILSVRC-2012. When pre-training the spatial ConvNet, we use the\nsame training and test data augmentation as described above (cropping, \ufb02ipping, RGB jittering).\nThis yields 13.5% top-5 error on ILSVRC-2012 validation set, which compares favourably to 16.0%\nreported in [31] for a similar network. We believe that the main reason for the improvement is\nsampling of ConvNet inputs from the whole image, rather than just its center.\nMulti-GPU training. Our implementation is derived from the publicly available Caffe toolbox [13],\nbut contains a number of signi\ufb01cant modi\ufb01cations, including parallel training on multiple GPUs\ninstalled in a single system. We exploit the data parallelism, and split each SGD batch across several\nGPUs. Training a single temporal ConvNet takes 1 day on a system with 4 NVIDIA Titan cards,\nwhich constitutes a 3.2 times speed-up over single-GPU training.\nOptical \ufb02ow is computed using the off-the-shelf GPU implementation of [2] from the OpenCV\ntoolbox. In spite of the fast computation time (0.06s for a pair of frames), it would still introduce\na bottleneck if done on-the-\ufb02y, so we pre-computed the \ufb02ow before training. To avoid storing\nthe displacement \ufb01elds as \ufb02oats, the horizontal and vertical components of the \ufb02ow were linearly\nrescaled to a [0, 255] range and compressed using JPEG (after decompression, the \ufb02ow is rescaled\nback to its original range). This reduced the \ufb02ow size for the UCF-101 dataset from 1.5TB to 27GB.\n6 Evaluation\nDatasets and evaluation protocol.\nThe evaluation is performed on UCF-101 [24] and\nHMDB-51 [16] action recognition benchmarks, which are among the largest available annotated\nvideo datasets1. UCF-101 contains 13K videos (180 frames/video on average), annotated into 101\naction classes; HMDB-51 includes 6.8K videos of 51 actions. The evaluation protocol is the same\nfor both datasets: the organisers provide three splits into training and test data, and the performance\nis measured by the mean classi\ufb01cation accuracy across the splits. Each UCF-101 split contains 9.5K\ntraining videos; an HMDB-51 split contains 3.7K training videos. We begin by comparing different\narchitectures on the \ufb01rst split of the UCF-101 dataset. For comparison with the state of the art, we\nfollow the standard evaluation protocol and report the average accuracy over three splits on both\nUCF-101 and HMDB-51.\nSpatial ConvNets. First, we measure the performance of the spatial stream ConvNet. Three sce-\nnarios are considered: (i) training from scratch on UCF-101, (ii) pre-training on ILSVRC-2012\nfollowed by \ufb01ne-tuning on UCF-101, (iii) keeping the pre-trained network \ufb01xed and only training\nthe last (classi\ufb01cation) layer. For each of the settings, we experiment with setting the dropout regu-\nlarisation ratio to 0.5 or to 0.9. From the results, presented in Table 1a, it is clear that training the\nConvNet solely on the UCF-101 dataset leads to over-\ufb01tting (even with high dropout), and is inferior\nto pre-training on a large ILSVRC-2012 dataset. Interestingly, \ufb01ne-tuning the whole network gives\nonly marginal improvement over training the last layer only. In the latter setting, higher dropout\nover-regularises learning and leads to worse accuracy. In the following experiments we opted for\ntraining the last layer on top of a pre-trained ConvNet.\nTemporal ConvNets. Having evaluated spatial ConvNet variants, we now turn to the temporal\nConvNet architectures, and assess the effect of the input con\ufb01gurations, described in Sect. 3.1. In\nparticular, we measure the effect of: using multiple (L = {5, 10}) stacked optical \ufb02ows; trajectory\nstacking; mean displacement subtraction; using the bi-directional optical \ufb02ow. The architectures\nare trained on the UCF-101 dataset from scratch, so we used an aggressive dropout ratio of 0.9 to\nhelp improve generalisation. The results are shown in Table 1b. First, we can conclude that stacking\nmultiple (L > 1) displacement \ufb01elds in the input is highly bene\ufb01cial, as it provides the network with\nlong-term motion information, which is more discriminative than the \ufb02ow between a pair of frames\n\n1Very recently, [14] released the Sports-1M dataset of 1.1M automatically annotated YouTube sports videos.\n\nProcessing the dataset of such scale is very challenging, and we plan to address it in future work.\n\n6\n\n\fTable 1: Individual ConvNets accuracy on UCF-101 (split 1).\n\n(a) Spatial ConvNet.\n\nTraining setting\nFrom scratch\nPre-trained + \ufb01ne-tuning\nPre-trained + last layer\n\nDropout ratio\n0.5\n0.9\n\n42.5% 52.3%\n70.8% 72.8%\n72.7% 59.9%\n\n(b) Temporal ConvNet.\n\nInput con\ufb01guration\nSingle-frame optical \ufb02ow (L = 1)\nOptical \ufb02ow stacking (1) (L = 5)\nOptical \ufb02ow stacking (1) (L = 10)\nTrajectory stacking (2)(L = 10)\nOptical \ufb02ow stacking (1)(L = 10), bi-dir.\n\nMean subtraction\n\non\n\noff\n-\n-\n\n73.9%\n80.4%\n79.9% 81.0%\n79.6% 80.2%\n81.2%\n\n-\n\n(L = 1 setting). Increasing the number of input \ufb02ows from 5 to 10 leads to a smaller improvement,\nso we kept L \ufb01xed to 10 in the following experiments. Second, we \ufb01nd that mean subtraction is\nhelpful, as it reduces the effect of global motion between the frames. We use it in the following\nexperiments as default. The difference between different stacking techniques is marginal; it turns\nout that optical \ufb02ow stacking performs better than trajectory stacking, and using the bi-directional\noptical \ufb02ow is only slightly better than a uni-directional forward \ufb02ow. Finally, we note that temporal\nConvNets signi\ufb01cantly outperform the spatial ConvNets (Table 1a), which con\ufb01rms the importance\nof motion information for action recognition.\nWe also implemented the \u201cslow fusion\u201d architecture of [14], which amounts to applying a ConvNet\nto a stack of RGB frames (11 frames in our case). When trained from scratch on UCF-101, it\nachieved 56.4% accuracy, which is better than a single-frame architecture trained from scratch\n(52.3%), but is still far off the network trained from scratch on optical \ufb02ow. This shows that while\nmulti-frame information is important, it is also important to present it to a ConvNet in an appropriate\nmanner.\nMulti-task learning of temporal ConvNets. Training temporal ConvNets on UCF-101 is challeng-\ning due to the small size of the training set. An even bigger challenge is to train the ConvNet on\nHMDB-51, where each training split is 2.6 times smaller than that of UCF-101. Here we evaluate\ndifferent options for increasing the effective training set size of HMDB-51: (i) \ufb01ne-tuning a temporal\nnetwork pre-trained on UCF-101; (ii) adding 78 classes from UCF-101, which are manually selected\nso that there is no intersection between these classes and the native HMDB-51 classes; (iii) using the\nmulti-task formulation (Sect. 4) to learn a video representation, shared between the UCF-101 and\nHMDB-51 classi\ufb01cation tasks. The results are reported in Table 2. As expected, it is bene\ufb01cial to\n\nTable 2: Temporal ConvNet accuracy on HMDB-51 (split 1 with additional training data).\n\nTraining setting\nTraining on HMDB-51 without additional data\nFine-tuning a ConvNet, pre-trained on UCF-101\nTraining on HMDB-51 with classes added from UCF-101\nMulti-task learning on HMDB-51 and UCF-101\n\nAccuracy\n\n46.6%\n49.0%\n52.8%\n55.4%\n\nutilise full (all splits combined) UCF-101 data for training (either explicitly by borrowing images, or\nimplicitly by pre-training). Multi-task learning performs the best, as it allows the training procedure\nto exploit all available training data.\nWe have also experimented with multi-task learning on the UCF-101 dataset, by training a network\nto classify both the full HMDB-51 data (all splits combined) and the UCF-101 data (a single split).\nOn the \ufb01rst split of UCF-101, the accuracy was measured to be 81.5%, which improves on 81.0%\nachieved using the same settings, but without the additional HMDB classi\ufb01cation task (Table 1b).\nTwo-stream ConvNets. Here we evaluate the complete two-stream model, which combines the\ntwo recognition streams. One way of combining the networks would be to train a joint stack of\nfully-connected layers on top of full6 or full7 layers of the two nets. This, however, was not feasible\nin our case due to over-\ufb01tting. We therefore fused the softmax scores using either averaging or\na linear SVM. From Table 3 we conclude that: (i) temporal and spatial recognition streams are\ncomplementary, as their fusion signi\ufb01cantly improves on both (6% over temporal and 14% over\nspatial nets); (ii) SVM-based fusion of softmax scores outperforms fusion by averaging; (iii) using\nbi-directional \ufb02ow is not bene\ufb01cial in the case of ConvNet fusion; (iv) temporal ConvNet, trained\nusing multi-task learning, performs the best both alone and when fused with a spatial net.\nComparison with the state of the art. We conclude the experimental evaluation with the com-\nparison against the state of the art on three splits of UCF-101 and HMDB-51. For that we used a\n\n7\n\n\fTable 3: Two-stream ConvNet accuracy on UCF-101 (split 1).\n\nSpatial ConvNet\nPre-trained + last layer\nPre-trained + last layer\nPre-trained + last layer\nPre-trained + last layer\n\nTemporal ConvNet\nbi-directional\nuni-directional\nuni-directional, multi-task\nuni-directional, multi-task\n\nFusion Method Accuracy\naveraging\naveraging\naveraging\nSVM\n\n85.6%\n85.9%\n86.2%\n87.0%\n\nspatial net, pre-trained on ILSVRC, with the last layer trained on UCF or HMDB. The temporal\nnet was trained on UCF and HMDB using multi-task learning, and the input was computed using\nuni-directional optical \ufb02ow stacking with mean subtraction. The softmax scores of the two nets were\ncombined using averaging or SVM. As can be seen from Table 4, both our spatial and temporal nets\nalone outperform the deep architectures of [14, 16] by a large margin. The combination of the two\nnets further improves the results (in line with the single-split experiments above), and is comparable\nto the very recent state-of-the-art hand-crafted models [20, 21, 26].\n\nTable 4: Mean accuracy (over three splits) on UCF-101 and HMDB-51.\n\nMethod\n\nImproved dense trajectories (IDT) [26, 27]\nIDT with higher-dimensional encodings [20]\nIDT with stacked Fisher encoding [21] (based on Deep Fisher Net [23])\nSpatio-temporal HMAX network [11, 16]\n\u201cSlow fusion\u201d spatio-temporal ConvNet [14]\nSpatial stream ConvNet\nTemporal stream ConvNet\nTwo-stream model (fusion by averaging)\nTwo-stream model (fusion by SVM)\n\n-\n-\n\n65.4%\n73.0%\n83.7%\n86.9%\n88.0%\n\nUCF-101 HMDB-51\n85.9%\n87.9%\n\n57.2%\n61.1%\n66.8%\n22.8%\n\n-\n\n40.5%\n54.6%\n58.0%\n59.4%\n\n7 Conclusions and directions for improvement\nWe proposed a deep video classi\ufb01cation model with competitive performance, which incorporates\nseparate spatial and temporal recognition streams based on ConvNets. Currently it appears that\ntraining a temporal ConvNet on optical \ufb02ow (as here) is signi\ufb01cantly better than training on raw\nstacked frames [14]. The latter is probably too challenging, and might require architectural changes\n(for example, a combination with the deep matching approach of [30]). Despite using optical \ufb02ow\nas input, our temporal model does not require signi\ufb01cant hand-crafting, since the \ufb02ow is computed\nusing a method based on the generic assumptions of constancy and smoothness.\nAs we have shown, extra training data is bene\ufb01cial for our temporal ConvNet, so we are planning to\ntrain it on large video datasets, such as the recently released collection of [14]. This, however, poses\na signi\ufb01cant challenge on its own due to the gigantic amount of training data (multiple TBs).\nThere still remain some essential ingredients of the state-of-the-art shallow representation [26],\nwhich are missed in our current architecture. The most prominent one is local feature pooling\nover spatio-temporal tubes, centered at the trajectories. Even though the input (2) captures the opti-\ncal \ufb02ow along the trajectories, the spatial pooling in our network does not take the trajectories into\naccount. Another potential area of improvement is explicit handling of camera motion, which in our\ncase is compensated by mean displacement subtraction.\nAcknowledgements\nThis work was supported by ERC grant VisRec no. 228180. We gratefully acknowledge the support\nof NVIDIA Corporation with the donation of the GPUs used for this research.\nReferences\n[1] A. Berg, J. Deng, and L. Fei-Fei. Large scale visual recognition challenge (ILSVRC), 2010. URL\n\nhttp://www.image-net.org/challenges/LSVRC/2010/.\n\n[2] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical \ufb02ow estimation based on a\n\ntheory for warping. In Proc. ECCV, pages 25\u201336, 2004.\n\n[3] K. Chat\ufb01eld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep\n\ninto convolutional nets. In Proc. BMVC., 2014.\n\n[4] B. Chen, J. A. Ting, B. Marlin, and N. de Freitas. Deep learning of invariant spatio-temporal features\n\nfrom video. In NIPS Deep Learning and Unsupervised Feature Learning Workshop, 2010.\n\n8\n\n\f[5] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: deep neural networks\n\nwith multitask learning. In Proc. ICML, pages 160\u2013167, 2008.\n\n[6] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector ma-\n\nchines. JMLR, 2:265\u2013292, 2001.\n\n[7] N. Dalal and B Triggs. Histogram of Oriented Gradients for Human Detection. In Proc. CVPR, volume 2,\n\npages 886\u2013893, 2005.\n\n[8] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of \ufb02ow and appearance.\n\nIn Proc. ECCV, pages 428\u2013441, 2006.\n\n[9] M. A. Goodale and A. D. Milner. Separate visual pathways for perception and action. Trends in Neuro-\n\nsciences, 15(1):20\u201325, 1992.\n\n[10] M. Jain, H. Jegou, and P. Bouthemy. Better exploiting motion for better action recognition.\n\nCVPR, pages 2555\u20132562, 2013.\n\nIn Proc.\n\n[11] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically inspired system for action recognition. In\n\nProc. ICCV, pages 1\u20138, 2007.\n\n[12] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. IEEE\n\nPAMI, 35(1):221\u2013231, 2013.\n\n[13] Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.\n\nberkeleyvision.org/, 2013.\n\n[14] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classica-\n\ntion with convolutional neural networks. In Proc. CVPR, 2014.\n\n[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, pages 1106\u20131114, 2012.\n\n[16] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human\n\nmotion recognition. In Proc. ICCV, pages 2556\u20132563, 2011.\n\n[17] I. Laptev, M. Marsza\u0142ek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In\n\nProc. CVPR, 2008.\n\n[18] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features\n\nfor action recognition with independent subspace analysis. In Proc. CVPR, pages 3361\u20133368, 2011.\n\n[19] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backprop-\n\nagation applied to handwritten zip code recognition. Neural Computation, 1(4):541\u2013551, 1989.\n\n[20] X. Peng, L. Wang, X. Wang, and Y. Qiao. Bag of visual words and fusion methods for action recognition:\n\nComprehensive study and good practice. CoRR, abs/1405.4506, 2014.\n\n[21] X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognition with stacked \ufb01sher vectors. In Proc. ECCV,\n\npages 581\u2013595, 2014.\n\n[22] F. Perronnin, J. S\u00b4anchez, and T. Mensink. Improving the Fisher kernel for large-scale image classi\ufb01cation.\n\nIn Proc. ECCV, 2010.\n\n[23] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Fisher networks for large-scale image classi\ufb01cation.\n\nIn NIPS, 2013.\n\n[24] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in\n\nthe wild. CoRR, abs/1212.0402, 2012.\n\n[25] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features.\n\nIn Proc. ECCV, pages 140\u2013153, 2010.\n\n[26] H. Wang and C. Schmid. Action recognition with improved trajectories. In Proc. ICCV, pages 3551\u20133558,\n\n2013.\n\n[27] H. Wang and C. Schmid. LEAR-INRIA submission for the THUMOS workshop. In ICCV Workshop on\n\nAction Recognition with a Large Number of Classes, 2013.\n\n[28] H. Wang, M. M. Ullah, A. Kl\u00a8aser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features\n\nfor action recognition. In Proc. BMVC., pages 1\u201311, 2009.\n\n[29] H. Wang, A. Kl\u00a8aser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In Proc. CVPR,\n\npages 3169\u20133176, 2011.\n\n[30] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. DeepFlow: Large displacement optical \ufb02ow\n\nwith deep matching. In Proc. ICCV, pages 1385\u20131392, 2013.\n\n[31] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901,\n\n2013.\n\n9\n\n\f", "award": [], "sourceid": 381, "authors": [{"given_name": "Karen", "family_name": "Simonyan", "institution": "University of Oxford / Google DeepMind"}, {"given_name": "Andrew", "family_name": "Zisserman", "institution": "University of Oxford"}]}