{"title": "A flexible model for training action localization with varying levels of supervision", "book": "Advances in Neural Information Processing Systems", "page_first": 942, "page_last": 953, "abstract": "Spatio-temporal action detection in videos is typically addressed in a fully-supervised setup with manual annotation of training videos required at every frame.  Since such annotation is extremely tedious and prohibits scalability, there is a clear need to minimize the amount of manual supervision. In this work we propose a unifying framework that can handle and combine varying types of less demanding weak supervision. Our model is based on discriminative clustering and integrates different types of supervision as constraints on the optimization. We investigate applications of such a model to training setups with alternative supervisory signals ranging from video-level class labels over temporal points or sparse action bounding boxes to the full per-frame annotation of action bounding boxes. Experiments on the challenging UCF101-24 and DALY datasets demonstrate competitive performance of our method at a fraction of supervision used by previous methods. The flexibility of our model enables joint learning from data with different levels of annotation. Experimental results demonstrate a significant gain by adding a few fully supervised examples to otherwise weakly labeled videos.", "full_text": "A \ufb02exible model for training action localization\n\nwith varying levels of supervision\n\nGuilhem Ch\u00e9ron\u2217 1 2\n\nJean-Baptiste Alayrac\u2217 1\n\nIvan Laptev1\n\nCordelia Schmid2\n\nAbstract\n\nSpatio-temporal action detection in videos is typically addressed in a fully-\nsupervised setup with manual annotation of training videos required at every frame.\nSince such annotation is extremely tedious and prohibits scalability, there is a clear\nneed to minimize the amount of manual supervision. In this work we propose a\nunifying framework that can handle and combine varying types of less-demanding\nweak supervision. Our model is based on discriminative clustering and integrates\ndifferent types of supervision as constraints on the optimization. We investigate\napplications of such a model to training setups with alternative supervisory signals\nranging from video-level class labels to the full per-frame annotation of action\nbounding boxes. Experiments on the challenging UCF101-24 and DALY datasets\ndemonstrate competitive performance of our method at a fraction of supervision\nused by previous methods. The \ufb02exibility of our model enables joint learning\nfrom data with different levels of annotation. Experimental results demonstrate a\nsigni\ufb01cant gain by adding a few fully supervised examples to otherwise weakly\nlabeled videos.\n\n1\n\nIntroduction\n\nAction localization aims to \ufb01nd spatial and temporal extents as well as classes of actions in the\nvideo, answering questions such as what are the performed actions? when do they happen? and\nwhere do they take place? This is a challenging task with many potential applications in surveillance,\nautonomous driving, video description and search. To address this challenge, a number of successful\nmethods have been proposed [11, 13, 18, 31, 36, 37, 45]. Such methods, however, typically rely on\nexhaustive supervision where each frame of a training action has to be manually annotated with an\naction bounding box.\nManual annotation of video frames is extremely tedious. Moreover, achieving consensus when\nannotating action intervals is often problematic due to ambiguities in the start and end times of an\naction. This prevents fully-supervised methods from scaling to many action classes and training\nfrom many examples. To avoid exhaustive annotation, recent works have developed several weakly-\nsupervised methods. For example, [46] learns action localization from a sparse set of frames with\nannotated bounding boxes. [27] reduces bounding box annotation to a single spatial point speci\ufb01ed\nfor each frame of an action. Such methods, however, are designed for particular types of weak\nsupervision and can be directly used neither to compare nor to combine various types of annotation.\nIn this work we design a unifying framework for handling various levels of supervision\u2020. Our\nmodel is based on discriminative clustering and integrates different types of supervision in a form of\n\n1Inria, \u00c9cole normale sup\u00e9rieure, CNRS, PSL Research University, 75005 Paris, France.\n2University Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France.\n\u2217Equal contribution.\n\u2020Project webpage https://www.di.ens.fr/willow/research/weakactionloc/.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: A \ufb02exible method for handling varying levels of supervision. Our method estimates a matrix Y\nassigning human tracklets to action labels in training videos by optimizing an objective function h(Y ) under\nconstraints Ys. Different types of supervision de\ufb01ne particular constraints Ys and do not affect the form of the\nobjective function. The increasing level of supervision imposes stricter constraints, e.g. Y1 \u2283 Y2 \u2283 Y3 \u2283 Y4 as\nillustrated for the Cliff Diving example above.\n\noptimization constraints as illustrated in Figure 1. We investigate applications of such a model to\ntraining setups with alternative supervisory signals ranging from video-level class labels over temporal\npoints or sparse action bounding boxes to the full per-frame annotation of action bounding boxes.\nExperiments on the challenging UCF101-24 and DALY datasets demonstrate competitive performance\nof our method at a fraction of supervision used by previous methods. We also demonstrate a signi\ufb01cant\ngain by adding a few fully supervised examples to weakly-labeled videos. In summary, the main\ncontributions of our work are (i) a \ufb02exible model with ability to adopt and combine various types of\nsupervision for action localization and (ii) an experimental study demonstrating the strengths and\nweaknesses of a wide range of supervisory signals and their combinations.\n\n2 Related work\n\nSpatio-temporal action localization consists in \ufb01nding action instances in a video volume, i.e.\nboth in space and time. Initial attempts [20, 22] scanned the video clip with a 3D sliding window\ndetector on top of volumetric features. Next, [29, 43] adapted the idea of object proposals [2, 42]\nto video action proposals. Currently, the dominant strategy [11, 31, 36, 37, 45] is to obtain per-\nframe action detections and then to link them into continuous spatio-temporal tracks. The most\nrecent methods [13, 15, 18, 35, 49] operate on multiple frames and leverage temporal information to\ndetermine actions. Examples include stacking features from several frames [18] and applying the\nI3D features [6] to spatio-temporal volumes [13]. These methods are fully supervised and necessitate\na large quantity of annotations.\nWeakly supervised learning for action understanding is promising since it can enable a signi\ufb01cant\nreduction of required annotation efforts. Prior work has explored the use of readily-available sources\nof information such as movie scripts [4, 8] to discover actions in clips using text analysis. Recent work\nhas also explored more complex forms of weak supervision such as the ordering of actions [5, 16, 33].\nA number of approaches address temporal action detection with weak supervision. The Untrimmed-\nNet [44] and Hide-and-Seek [38] methods are the current state of the art. In [44], the authors\nintroduce a feed forward neural network composed of two branches \u2013 one for classi\ufb01cation and\nanother one for selecting relevant frames \u2013 that can be trained end-to-end from clip level supervision\nonly. The method proposed in [38] obtains more precise object boxes or temporal action boundaries\nby hiding random patches of the input images at training time, forcing the network to focus on less\ndiscriminative parts.\nIn contrast, we seek to not only localize the action temporally but to also look for spatial localization\nfrom weak supervision. In [40, 48], the authors propose unsupervised methods to localize actions\nby using graph clustering algorithms to \ufb01rst discover action classes and then localize actions within\neach cluster. Other works assume that clip level annotations are available [7, 39]. These methods\noften rely on action proposals, followed by a selection procedure to \ufb01nd the most relevant segment.\nContrary to this line of work, we do not use action proposals. We rely instead on recent advances in\noff-the-shell human detectors [10] and use human tracks. More recently [26] makes use of pseudo\nannotation, to integrate biases (e.g. presence of objects, the fact that the action is often in the center\nof the video...) to further improve performance. Most of this work uses Multiple Instance Learning\n(MIL) in order to select discriminative instances from the set of all potential candidates. Here, we\n\n2\n\n\finstead use discriminative clustering, which does not require EM optimization but instead relies on a\nconvex relaxation with convergence guarantees.\nMost related to us, are the works from [27] and [46] studying the trade-off between the annotation\ncost and \ufb01nal performance. [27] shows that one can simply use spatial points instead of bounding\nboxes and still obtain reasonable performance. [46] instead demonstrates that only a few frames with\nbounding box annotation are necessary for good performance. Here, we introduce a method that can\nleverage varying levels of supervision.\nDiscriminative clustering [3, 47] is a learning method that consists in clustering the data so that the\nseparation is easily recoverable by a classi\ufb01er. In this paper, we employ the method proposed in [3]\nwhich is appealing due to its simple analytic form, its \ufb02exibility and its recent successes for weakly\nsupervised methods in computer vision [1, 4, 17]. To use this method, we rely on a convex relaxation\ntechnique which transforms the initial NP-hard problem into a convex quadratic program under linear\nconstraints. The Frank-Wolfe algorithm [9] has shown to be very effective for solving such problems.\nRecently, [28] proposed a block coordinate [21] version of Frank-Wolfe for discriminative clustering.\nWe leverage this algorithm to scale our method to hundreds of videos.\n\n3 Problem formulation and approach\n\nWe are given a set of N videos of people performing various actions. The total number of possible\nactions is K. Our goal is to learn a model for each action, to localize actions in time and space in\nunseen videos. Given the dif\ufb01culty of manual action annotation, we investigate how different types\nand amounts of supervision affect the performance of action localization. To this end, we propose a\nmodel that can adopt and combine various levels of spatio-temporal action supervision such as (i)\nvideo-level only annotations, (ii) a single temporal point, (iii) a single bounding box, (iv) temporal\nbounds only, (v) temporal bounds with few bounding boxes and (vi) full supervision. In all cases, we\nare given tracks of people, i.e. sequences of bounding box detections linked through time. These\ntracks are subdivided into short elementary segments in order to obtain precise action boundaries and\nto allow the same person to perform multiple actions. These segments are called tracklets and our\ngoal is to assign them to action labels. In total, we have M such tracklets.\nA single \ufb02exible model: one cost function, different constraints. We introduce a single model\nthat can be trained with various levels and amounts of supervision. The general idea is the following.\nWe learn a model for action classi\ufb01cation on the train set from weak supervision using discriminative\nclustering [3, 47]. This model is used to predict spatio-temporal localization on test videos.\nThe idea behind discriminative clustering is to cluster the data (e.g. assign human tracklets to action\nlabels) so that the clusters can be recovered by a classi\ufb01er given some feature representation. The\nclustering and the classi\ufb01er are simultaneously optimized subject to constraints that can regularize\nthe problem and guide the solution towards a preferred direction. In this work we use constraints\nas a mean to encode different types of supervisory signals. More formally, we frame our problem\nas recovering an assignment matrix of tracklets to actions Y \u2208 {0, 1}M\u00d7K which minimizes a\nclustering cost h under constraints:\n\n(1)\nwhere Ys is a constraint set encoding the supervision available at training time. In this work we\nconsider constraints corresponding to the following types of supervision.\n\nmin\nY \u2208Ys\n\nh(Y ),\n\n(i) Video level action labels: only video-level action labels are known.\n(ii) Single temporal point: we have a rough idea when the action occurs, but we do not have\neither the exact temporal extent or the spatial extent of the action. Here, we can for example\nbuild our constraints such that at least one human track (composed of several tracklets)\nshould be assigned to the associated action class in the neighborhood of a temporal point.\n(iii) One bounding box (BB): we are given the spatial location of a person at a given time inside\neach action instance. Spatial constraints force tracks that overlap with the bounding box to\nmatch the action class around this time step.\n\n(iv) Temporal bounds: we know when the action occurs but its spatial extent is unknown.\nWe constrain the labels so that at least one human track contained in the given temporal\n\n3\n\n\fK(cid:88)\n\ninterval is assigned to that action. Samples outside annotated intervals are assigned to the\nbackground class.\n\n(v) Temporal bounds with bounding boxes (BBs): combination of (iii) and (iv).\n(vi) Fully supervised: annotation is de\ufb01ned by the bounding box at each frame of an action.\n\nAll these constraints can be formulated under a common mathematical framework that we describe\nnext. In this paper we limit ourselves to the case where each tracklet should be assigned to only one\naction class (or the background). More formally, this can be written as follows:\n\n\u2200 m \u2208 [1 . . M ],\n\nYmk = 1.\n\n(2)\n\nk=1\n\nThis does not prevent us from having multiple action classes for a video, or to assign multiple classes\nto tracklets from the same human track. Also note that it is possible to handle the case where a single\ntracklet can be assigned to multiple actions by simply replacing the equality with, e.g., a greater or\nequal inequality. However, as the datasets considered in this work always satisfy (2), we keep that\nassumption for the sake of simplicity.\nWe propose to enforce various levels of supervision with two types of constraints on the assignment\nmatrix Y as described below.\n\nStrong supervision with equality constraints.\nIn many cases, even when dealing with weak\nsupervision, the supervision may provide strong cues about a tracklet. For example, if we know that a\nvideo corresponds to the action \u2018Diving\u2019, we can assume that no tracklet of that video corresponds\nto the action \u2018Tennis Swing\u2019. This can be imposed by setting to 0 the corresponding entry in the\nassignment matrix Y . Similarly, if we have a tracklet that is outside an annotated action interval, we\nknow that this tracklet should belong to the background class. Such cues can be enforced by setting\nto 1 the matching entry in Y . Formally,\n\n(3)\nwith Os and Zs containing all the tracklet/action pairs that we want to match (Os) or dissociate (Zs).\n\nYtk = 0,\n\n\u2200(t, k) \u2208 Os Ytk = 1,\n\nand \u2200(t, k) \u2208 Zs\n\nWeak supervision with at-least-one constraints. Often, we are uncertain about which tracklet\nshould be assigned to a given action k. For example, we might know when the action occurs, without\nknowing where it happens. Hence, multiple tracklets might overlap with that action in time but not all\nare good candidates. These multiple tracklets compose a bag [4], that we denote Ak. Among them,\nwe want to \ufb01nd at least one tracklet that matches the action, which can be written as:\n\n(cid:88)\n\nt\u2208Ak\n\nYtk \u2265 1.\n\n(4)\n\nWe denote by Bs the set of all such bags. Hence, Ys is characterized by de\ufb01ning its corresponding\nstrong supervision Os and Zs and the bags Bs that compose its weak supervision.\n\nDiscriminative clustering cost. As stated above, the intuition behind discriminative clustering is\nto separate the data so that the clustering is easily recoverable by a classi\ufb01er over the input features.\nHere, we use the square loss and a linear classi\ufb01er, which corresponds to the DIFFRAC setting [3]:\n\nh(Y ) = min\n\nW\u2208Rd\u00d7K\n\n1\n2M\n\n(cid:107)XW \u2212 Y (cid:107)2\n\nF +\n\n(cid:107)W(cid:107)2\nF .\n\n\u03bb\n2\n\n(5)\n\nX \u2208 RM\u00d7d contains the features describing each tracklets. W \u2208 Rd\u00d7K corresponds to the\nclassi\ufb01er weights for each action. \u03bb \u2208 R+ is a regularization parameter. (cid:107).(cid:107)F is the standard\nFrobenius matrix norm. Following [3], we can solve the minimization in W in closed form to obtain:\n2M Tr(Y Y T B), where Tr(\u00b7) is the matrix trace and B is a strictly positive de\ufb01nite matrix\nh(Y ) = 1\n(hence h is strongly convex) de\ufb01ned as B := IM \u2212 X(X T X + M \u03bbId)\u22121X T . Id stands for the\nd-dimensional identity matrix.\n\n4\n\n\fOptimization. Directly optimizing the problem de\ufb01ned in (1) is NP hard due to the integer con-\nstraints. To address this challenge, we follow recent approaches such as [5] and propose a convex\nrelaxation of the constraints. The problem hence becomes minY \u2208 \u00afYs h(Y ), where \u00afYs is the convex\nhull of Ys. To deal with such constraints we use the Frank-Wolfe algorithm [9], which has the nice\nproperty of only requiring to know how to solve linear programs over the constraint set. Another\nchallenge lies in the fact that we deal with large datasets containing hundreds of long videos. Using\nthe fact that our set of constraints actually decomposes over the videos, \u00afYs = \u00afY 1\ns , we use\nthe block coordinate Frank-Wolfe algorithm [21] that has been recently adapted to the discriminative\nclustering objective [28]. This allows us to scale to several hundreds of videos.\n\ns \u00d7\u00b7\u00b7\u00b7\u00d7 \u00afY N\n\n4 Experiments\n4.1 Datasets and metrics\n\nThe UCF101 dataset [41] was originally designed for action classi\ufb01cation. It contains 13321 videos\nof 101 different action classes. A subset of 24 action classes (selected by [12]) was de\ufb01ned for\nthe particular task of spatio-temporal action localization in 3207 videos. We refer to this subset as\n\u2018UCF101-24\u2019. The videos are relatively short and the actions usually last a large part of the duration\n(at least 80% of the video length for half of the classes). The annotation is exhaustive, i.e., for each\naction instance, the full person track within the temporal interval of an action is annotated. We use\nrecently corrected ground truth tracks [36]. Each UCF101-24 video contains actions of a single class.\nThere is only one split containing 2293 train videos and 914 test videos.\nThe DALY dataset [46] is a recent large-scale dataset for action localization, with 10 different daily\nactions (e.g. \u2018drinking\u2019, \u2018phoning\u2019, \u2018cleaning \ufb02oor\u2019). It contains 510 videos for a total of 31 hours\narranged in a single split containing 31 train and 20 test videos per class. The average length of the\nvideos is 3min 45s. They are composed of several shots, and all actions are short w.r.t. the full video\nlength, making the task of temporal action localization very challenging. Also, DALY may contain\nmultiple action classes in the same video. For each action instance, its temporal extent is provided\nwhile bounding boxes are spatially annotated for a sparse set of frames within action intervals.\nPerformance evaluation. To evaluate the detection performance, we use the standard spatio-temporal\nintersection over union (IoU) criterion between a candidate track and a ground-truth track. It is de\ufb01ned\nfor the UCF101-24 benchmark as the product of the temporal IoU between the time segments of the\ntracks and the average spatial IoU on the frames where both tracks are present. The candidate detection\nis correct if its intersection with the ground-truth track is above a threshold (in our experiments, set to\n0.2 or 0.5) and if both tracks belong to the same action class. Duplicate detections are considered as\nfalse positives. The overall performance is evaluated in terms of mean average precision (mAP). The\nsame metric is used for all levels of supervision which enables a fair comparison. Note that since\nonly sparse spatial annotation is available for the DALY dataset, its benchmark computes spatial IoU\nat the annotated frames location only while temporal IoU remains the same.\n\nImplementation details\n\n4.2\nThe code of the paper to reproduce our experiments can be found on the webpage of the project\u2217.\nPerson detector and tracker. Person boxes are obtained with Faster-RCNN [32] using the ResNet-\n101 architecture [14]. When no spatial annotation is provided, we use an off-the-shelf person detector\ntrained on the COCO dataset [23]. Otherwise, action detections are obtained by training the detector\non the available frames with bounding boxes starting from ImageNet [34] pre-training. We use the\nimplementation from the Detectron package [10]. Detections from the human detector are linked into\ncontinuous tracks using KLT [24] to differentiate between person instances. Class-speci\ufb01c detections\nare temporally aggregated by a simpler online linker [18, 37] based on action scores and overlap.\nFeature representation. In our model, person tracks are divided into consecutive subtracks of 8\nframes which we call tracklets. Due to its recent success for action recognition, we use the I3D [6]\nnetwork trained on the Kinetics dataset [19] to obtain tracklet features. More precisely, we extract\nvideo descriptors with I3D RGB and \ufb02ow networks after the 7-th inception block, before max-pooling,\nto balance the depth and spatial resolution. The temporal receptive \ufb01eld is 63 frames and the temporal\n\n\u2217\n\nhttps://www.di.ens.fr/willow/research/weakactionloc/\n\n5\n\n\fstride is 4 frames. The input frames are resized to 320 \u00d7 240 pixels, which results in feature maps\nof size 20 \u00d7 15 with 832 channels. We use ROI-pooling to extract a 832-dimensional descriptor\nfor each human box at a given time step. The \ufb01nal 1664-dimensional representation is obtained by\nconcatenating the descriptor from the RGB and the \ufb02ow streams. Finally, a tracklet representation is\nobtained by averaging the descriptors corresponding to the detections spanned by this tracklet.\nOptimization. For all types of supervision we run the block-coordinate Frank-Wolfe optimizer for\n30k iterations (one iteration deals with one video). We use optimal line search and we sample videos\naccording to the block gap values [30] which speeds up convergence in practice.\nTemporal localization. At test time, person tracks have to be trimmed to localize the action instances\ntemporally. To do so, each of the T detections composing a track are scored by the learnt classi\ufb01er.\nThen, the scores are smoothed out with a median \ufb01ltering (with a window of size 25), giving,\nfor an action k \u2208 [1 . . K], the sequence of detection scores sk = (sk\nT ). Within this track,\ntemporal segmentation is performed by selecting consecutive person boxes with scores sk\nt > \u03b8k,\nleading to subtrack candidates for action k. A single person track can therefore produce several\nspatio-temporal predictions at different time steps. A score for each subtrack is obtained by averaging\ncorresponding detection scores. Finally, non-maximum-suppression (NMS) is applied to eliminate\nmultiple overlapping predictions with spatio-temporal IoU above 0.2.\nHyperparameters. The regularization parameter \u03bb (see equation (5)) is set to 10\u22124 in all of our\nexperiments. To calibrate the model, the temporal localization thresholds \u03b8k are validated per class\non a separate set corresponding to 10% of the training set.\nComputational cost. At each training iteration of our algorithm, the computational complexity is\nlinear in the number of tracklets for a given video. When running 30K iterations (suf\ufb01cient to get\nstable performance in terms of training loss) on a relatively modern computer (12 cores and 50Go\nof RAM) using a single CPU thread, our python implementation takes 20 minutes for UCF101-24\n(100K tracklets) and 60 minutes for DALY (500K tracklets) for the \u2018Temporal + 1 BB\u2019 setting. Note\nthat the timings are similar across different levels of supervision.\n\n1, ..., sk\n\n4.3 Supervision as constraints\nIn this section, we describe in details the constraint sets Ys for different levels of supervision\nconsidered in our work. Recall from Section 3 that Ys is characterized by Os and Zs, the sets of\ntracklets for which we have strong supervision, and by Bs, the set of bags containing tracklets that\nare candidates to match a given action. Note that in all settings we have one class that corresponds\nto the background class. In the following, a time unit corresponds to one of the bins obtained after\nuniformly dividing a time interval. Its size is 8 frames.\nVideo level. Here, Bs contains as many bags as action classes occurring in the entire video. Every\nbag contains all the tracklets of the video. This constrains each annotated action class to be assigned\nto at least one tracklet from the video. We also construct Zs making sure no tracklets are assigned to\naction classes not present in the video.\nShot level. This setting is identical to the \u2018Video level\u2019 one, but we further decompose the video into\nclips and assume we have clip-level annotations. Such clips are obtained by the shot detection [25]\nfor the DALY dataset. Since UCF101-24 videos are already relatively short, we do not report Shot\nlevel results for this dataset.\nTemporal point. For each action instance, we are given a point in time which falls into the temporal\ninterval of the action. In our experiments, we sample this time point uniformly within the ground\ntruth interval. This corresponds to the scenario where an annotator would simply click one point in\ntime where the action occurs, instead of precisely specifying the time boundaries. We then create a\ncandidate interval around that time point with a \ufb01xed size of 50 frames (2 seconds). This interval is\ndiscretized into time units. For each of them, we impose that at least one tracklet should be assigned\nto the action instance. Hence, we add A \u00d7 U bags to Bs, where A is the number of instances and U\nis the number of time units in a 50 frames interval. Finally, Zs is built as before: no tracklet should\nmatch actions absent from the video.\nOne bounding box. At each location of the previous temporal points, we are now additionally given\nthe corresponding action instance bounding box (a.k.a. keyframe in [46]). Similarly, we construct 50\nframes time intervals.We consider all tracklets whose original track has a temporal overlap with the\n\n6\n\n\fannotated frame. Then, we compute the spatial IoU between the bounding box of the track at the time\nof the annotated frame and the bounding box annotation. If this IoU is less than 0.3 for all possible\nframes, we then construct Os so that these tracklets are forced to the background class. Otherwise, if\nthe tracklet is inside the 50 frames time interval, we force it to belong to the action class instance\nwith the highest IoU by augmenting Os. Again, for each time unit of the interval, we construct Bs\nsuch that at least one tracklet matches the action instance. Zs is built as previously.\nTemporal bounds. Given temporal boundaries for each action instance, we \ufb01rst add all tracklets\nthat are outside of these ground truth intervals to Os in order to assign them to the background class.\nThen, we augment Bs with one bag per time unit composing the ground truth interval in order to\nensure that at least one tracklet is assigned to the corresponding action class at all time in the ground\ntruth interval. The set Zs is constructed as above.\nTemporal bounds with bounding boxes. In addition to temporal boundaries of action instances, we\nare also given a few frames with spatial annotation. This is a standard scenario adopted in DALY [46]\nand AVA [13]. In our experiments, we report results with one and three bounding boxes per action\ninstance. Zs and Bs are constructed as in the \u2018Temporal bounds only\u2019 setting. Os is initialized\nto assign all tracklets outside of the temporal intervals to the background class. Similarly to the\n\u2018One bounding box\u2019 scenario, we augment Os in order to force tracklets to belong to either the\ncorresponding action or the background depending on the spatial overlap criterion. However, here\nan action instance has potentially several annotations, therefore the spatial overlap with the track is\nde\ufb01ned as the minimum IoU between the corresponding bounding boxes.\nFully supervised. Here, we enforce a hard assignment of all tracklets to action classes based on the\nground truth through Os. When a tracklet has a spatial IoU greater than 0.3 with at least one ground\ntruth instance, we assign it to the action instance with the highest IoU. Otherwise, the tracklet is\nassigned to the background.\nTemporal bounds with spatial points. In [27], the authors introduce the idea of working with spatial\npoints instead of bounding boxes for spatial supervision. Following [27], we take the center of each\nannotated bounding box to simulate the spatial point annotation. Similarly to the \u2018Fully supervised\u2019\nsetting, we build Os to enforce a hard assignment for all tracklets, but we modify the action-tracklet\nmatching criterion. When tracklet bounding boxes contain all the corresponding annotation points of\nat least one ground truth instance, we assign the tracklet to the action instance that has the lowest\ndistance between the annotation point and the bounding box center of the tracklet.\n\n4.4 Results and analysis\n\nIn this section we present and discuss our experimental results. We evaluate our method on two\ndatasets, UCF101-24 and DALY, for the following supervision levels: video level, temporal point,\none bounding box, temporal bounds only, temporal bounds and one bounding box, temporal bounds\nand three bounding boxes. For UCF101-24, we also report temporal bounds and spatial points [27]\nand the fully supervised settings (not possible for DALY as the spatial annotation is not dense). We\nevaluate the additional \u2018shot-level\u2019 supervision setup on DALY. All results are given in Table 1. We\ncompare to the state of the art whenever possible. To the best of our knowledge, previously reported\nresults are not available for all levels of supervision considered in this paper. We also report results\nfor the mixture of different levels and amounts of supervision in Figure 2a. Qualitative results are\nillustrated in Figure 2b and in the Appendix.\nComparison to the state of the art. We compare results to two state-of-the-art methods [27, 46]\nthat are designed to deal with weak supervision for action localization. Mettes et al. [27] use spatial\npoint annotation instead of bounding boxes. Weinzaepfel et al. [46] compare various levels of spatial\nsupervision (spatial points, few bounding boxes). Temporal boundaries are known in both cases. We\nalso compare our fully supervised approach to recent methods, see Table 2.\nUCF101-24 speci\ufb01city. For UCF101-24, we noticed a signi\ufb01cant drop of performance whenever we\nuse an off-the-shelf detector versus a detector pretrained on UCF101-24 bounding boxes. We observe\nthat this is due to two main issues: (i) the quality of images in the UCF101-24 dataset is quite low\nwhen compared to DALY which makes the human detection very challenging, and (ii) the bounding\nboxes have been annotated with a large margin around the person whereas a typical off-the-shelf\ndetector produces tight detections (see Appendix). Addressing the problem (i) is dif\ufb01cult without\nadaptive \ufb01netuning. Concerning (ii), a simple solution adopted in our work is to enlarge person\n\n7\n\n\fTemporal\n\nTemporal\n\nShot\nlevel\nOur\n\nSupervision\n\nMethod\n\nUCF101-24\n\n(mAP)\nDALY\n(mAP)\n\n@0.2\n@0.5\n@0.2\n@0.5\n\nVideo\nlevel\nOur\n43.9\n17.7\n7.6\n2.3\n\npoint\nOur\n45.5\n18.7\n26.7\n8.1\n\n-\n-\n\nOur\n\nOur\n\nOur\n70.6\n38.6\n12.3\n32.5\n3.9\n13.3\nTable 1: Video mAP for varying levels of supervision.\n\n47.3 (69.5)\n20.1 (38.0)\n31.5 (33.4)\n9.8 (14.3)\n\n49.1 (69.8)\n19.5 (39.5)\n\nNo continuous\n\nspatial GT\n\n[27]\n34.8\n\n-\n\n-\n\n[46] Our\n74.5\n57.4\n43.2\n32.5\n15.0\n\n14.5\n\n-\n\n-\n\nTemporal +\nspatial points\n[46]\n57.5\n\n1 BB\nOur\n66.8\n36.9\n28.1\n12.2\n\nTemp. +\n\n1 BB\n\nTemp +\n3 BBs\n\nFully\n\nSupervised\n[46]\n58.9\n\n[46] Our\n76.0\n57.3\n50.1\n\n-\n\n-\n13.9 No full GT\navailable\n\n-\n\n\u221a\n\ndetections with a single scaling factor (we use\n2). However, we also observe that the size of boxes\ndepends on action classes (e.g. the bounding boxes for the \u2018TennisSwing\u2019 contains the tennis racket),\nwhich is something we cannot capture without more speci\ufb01c information. To better highlight this\nproblem we have included in Table 1 a baseline where we use the detections obtained after \ufb01netuning\non spatial annotations even when it is normally not possible (\u2018temporal\u2019 and \u2018temporal with spatial\npoints\u2019 annotations). The detector is the same as in the \u2018one bounding box\u2019 supervision setup. We\nreport these baselines in parenthesis in Table 1. We can observe that the drop of performance is much\nhigher on UCF101-24 (\u221222.2% mAP@0.2 for \u2018temporal\u2019) than on DALY (\u22121.9%), which con\ufb01rms\nthe hypothesis that the problem actually comes from the tracks rather than from our method.\n\nResults for varying levels of supervision. Results are reported in Table 1. As expected, the\nperformance increases when more supervision is available. On the DALY dataset our method\noutperforms [46] by a signi\ufb01cant margin (+18%). This highlights the strength of our model in\naddition to its \ufb02exibility. We \ufb01rst note that there is only a 1.0% difference between the \u2018temporal\u2019\nand \u2018temp + 3 keys\u2019 levels of supervision. This shows that, with recent accurate human detectors,\nthe spatial supervision is less important. Interestingly, we observe good performance when using\none temporal point only (26.7%). This weak information is almost suf\ufb01cient to match results of the\nprecise \u2018temporal\u2019 supervision (31.5%). The corresponding performance drop is even smaller on the\nUCF101-24 dataset (\u22121.8%). This demonstrates the strength of the single \u2018temporal click\u2019 annotation\nwhich could be a good and cheap alternative to the precise annotation of temporal boundaries.\nOn the UCF101-24 dataset, we are always better compared to the state of the art when bounding box\nannotations are available, e.g., +17.2% in \u2018temp + 3 keys\u2019 setting. This demonstrates the strength of\nthe sparse bounding box supervision which almost matches accuracy of the fully-supervised setup\n(\u22121.5%). We note that using only one bounding box already enables good performance (66.8%).\nOverall, this opens up the possibility to annotate action video datasets much more ef\ufb01ciently at a\nmoderate cost. However, as explained above, we observe a signi\ufb01cant drop when removing the\nspatial supervision. This is mainly due to the fact that [46] are using more robust tracks obtained by a\nsophisticated combination of detections and an instance speci\ufb01c tracker. This is shown in Table 1 of\nthe Appendix where we run our approach with tracks of [46]. We obtain better results than [46] in all\ncases and a video level mAP of 53.1%. \u2018Video level\u2019 is reported in [26]. Compared to their approach,\nwhich is speci\ufb01cally designed for this supervision, we achieve better performance on UCF101-24\n(43.9% (53.1% with tracks from [46]) vs. 37.4%).\nComparison to fully supervised baselines. Table 2 reports ad-\nditional baselines in the fully supervised setting for the UCF101-\n24 dataset. Note that, even if not the main focus of our work,\nour performance in the fully-supervised settings is on par with\nmany recent approaches, e.g. the recent work [18] which re-\nports a mAP@0.5 of 49.2% on UCF101-24 (versus our 50.1%).\nThis shows again that our model is not only designed for weak\nsupervision, and thus can be used to fairly compare all levels\nof supervision. However, we are still below the current state\nof the art [13] on UCF101-24 with full supervision, which re-\nports a mAP@0.5 of 59.9% on UCF101-24 (versus our 50.1%).\nThis can be explained by the fact that we are only learning a\nlinear model for I3D features, whereas [13] train a non-linear\nclassi\ufb01er for the same features.\nInvestigating how to build \ufb02exible deep (i.e. non-linear) models that can handle varying level of\nsupervision is therefore an interesting avenue for future work.\n\nTable 2: Comparison of fully super-\nvised methods on UCF101-24.\n\nACT [18]\nAVA [13]\nOur method\n\nPeng w/ MR [31]\nSingh et al. [37]\n\n@0.2 @0.5\n58.9\n\n35.9\n46.3\n49.2\n59.9\n50.1\n\nVideo mAP\n\n73.5\n76.5\n\n76.0\n\n[46]\n\n-\n\n-\n\n-\n\n8\n\n\f(a) Mixing supervision.\n\n(b) Qualitative results on DALY.\n\nFigure 2: Left. Mixing levels of supervision on the DALY dataset. Right. Predictions for the light green ground\ntruth instance for various levels of supervision. We display the highest scoring detection in each case. When\ntraining with only \u2018shot-level\u2019 annotation it is hard for the method to discover precise boundaries of actions, as\n\u2018Playing Harmonica\u2019 could also consist in someone holding an harmonica. Annotating with a temporal point is\nsuf\ufb01cient to better detect the instance. Training with precise temporal boundaries further improves performance.\n\nMixing levels of supervision. To further demonstrate the \ufb02exibility of our approach, we conduct an\nexperiment where we mix different levels of supervision in order to improve the results. We consider\nthe weakest form of supervision, i.e. the \u2018video-level\u2019 annotation and report results on the DALY\ndataset. The experimental setup consists in constructing a training dataset with a portion of videos\nthat have weak supervision (either \u2018video-level\u2019 or \u2018temporal point\u2019) and another set with stronger\nsupervision (temporal bounds and 3 bounding boxes). We vary the portions of videos with stronger\nsupervision available at the training time (the rest having weak labeling), and we evaluate mAP@0.2\non the test set. We report results in Figure 2a. Tracks are obtained using the off-the-shelf person\ndetector. With only 20 supervised videos (around 5% of the training data) and \u2018video level\u2019 labels for\nremaining videos, the performance goes from 7.6 to 18.2. We are on par with the performance in\nthe fully supervised setting when using only 40% of the fully annotated data. This performance is\nreached even sooner when using \u2018Temporal point\u2019 weak labeling (with only 20% of fully annotated\nvideos). This strongly encourages the use of methods with the mixed levels of supervision for action\nlocalization.\n\n5 Conclusion\n\nThis paper presents a weakly-supervised method for spatio-temporal action localization which aims\nto reduce the annotation cost associated with fully-supervised learning. We propose a unifying\nframework that can handle and combine varying types of less-demanding weak supervision. The key\nobservations are that (i) dense spatial annotation is not always needed due to the recent advances of\nhuman detection and tracking, (ii) the performance of \u2018temporal point\u2019 supervision indicates that only\nannotating an action with a \u2018click\u2019 is very promising to decrease the annotation cost at a moderate\nperformance drop, and (iii) mixing levels of supervision (see Figure 2a) is a powerful approach for\nreducing annotation efforts.\n\nAcknowledgements\n\nWe thank R\u00e9mi Leblond for his thoughtful comments and help with the paper. This work was\nsupported in part by ERC grants ACTIVIA and ALLEGRO, the MSR-Inria joint lab, the Louis\nVuitton ENS Chair on Arti\ufb01cial Intelligence, an Amazon academic research award, the Intel gift and\nthe DGA project DRAAF.\n\nReferences\n[1] Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Ivan Laptev, Josef Sivic, and Simon\n\nLacoste Julien. Unsupervised learning from narrated instruction videos. In CVPR, 2016. 3\n\n[2] Pablo Arbel\u00e1ez, Jordi Pont-Tuset, Jonathan T. Barron, Ferran Marques, and Jitendra Malik.\n\nMultiscale combinatorial grouping. In CVPR, 2014. 2\n\n9\n\n0102030405060708090100Portionofvideoswithstrongsupervision(in%)0.02.55.07.510.012.515.017.520.022.525.027.530.032.535.0mAP@0.2(in%)Full:Temp.&3BBsMix:Temp.point+Temp.&3BBsMix:Videolevel+Temp.&3BBs[46]\f[3] Francis Bach and Za\u00efd Harchaoui. DIFFRAC: A discriminative and \ufb02exible framework for\n\nclustering. In NIPS, 2007. 3, 4\n\n[4] Piotr Bojanowski, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and Josef Sivic.\n\nFinding actors and actions in movies. In ICCV, 2013. 2, 3, 4\n\n[5] Piotr Bojanowski, R\u00e9mi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, and\nJosef Sivic. Weakly supervised action labeling in videos under ordering constraints. In ECCV,\n2014. 2, 5\n\n[6] Jo\u00e3o Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the\n\nKinetics dataset. In CVPR, 2017. 2, 5\n\n[7] Wei Chen and Jason J. Corso. Action Detection by implicit intentional Motion Clustering. In\n\nICCV, 2015. 2\n\n[8] Olivier Duchenne, Ivan Laptev, Josef Sivic, Francis Bach, and Jean Ponce. Automatic annotation\n\nof human actions in video. In CVPR, 2009. 2\n\n[9] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming.\n\nResearch Logistics Quarterly, 1956. 3, 5\n\nIn Naval\n\n[10] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Doll\u00e1r, and Kaiming He. Detectron.\n\nhttps://github.com/facebookresearch/detectron, 2018. 2, 5\n\n[11] Georgia Gkioxari and Jitendra Malik. Finding action tubes. In CVPR, 2015. 1, 2\n\n[12] Alex Gorban, Haroon Idrees, Yu-Gang Jiang, Amir R. Roshan Zamir, Ivan Laptev, Mubarak\nShah, and Rahul Sukthankar. THUMOS challenge: Action recognition with a large number of\nclasses. http://www.thumos.info/, 2015. 5\n\n[13] Chunhui Gu, Chen Sun, Sudheendra Vijayanarasimhan, Caroline Pantofaru, David A Ross,\nGeorge Toderici, Yeqing Li, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, et al. AVA: A\nvideo dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018. 1, 2, 7, 8\n\n[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, 2016. 5\n\n[15] Rui Hou, Chen Chen, and Mubarak Shah. Tube convolutional neural network (T-CNN) for\n\naction detection in videos. In ICCV, 2017. 2\n\n[16] De-An Huang, Li Fei-Fei, and Juan Carlos Niebles. Connectionist Temporal Modeling for\n\nWeakly Supervised Action Labeling. In ECCV, 2016. 2\n\n[17] Armand Joulin, Francis Bach, and Jean Ponce. Discriminative clustering for image co-\n\nsegmentation. In CVPR, 2010. 3\n\n[18] Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet\n\ndetector for spatio-temporal action localization. In ICCV, 2017. 1, 2, 5, 8\n\n[19] Will Kay, Jo\u00e3o Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya-\nnarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human\naction video dataset. In CoRR, 2017. 5\n\n[20] Yan Ke, Rahul Sukthankar, and Martial Hebert. Ef\ufb01cient visual event detection using volumetric\n\nfeatures. In ICCV, 2005. 2\n\n[21] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank-Wolfe\n\noptimization for structural SVMs. In ICML, 2013. 3, 5\n\n[22] Ivan Laptev and Patrick P\u00e9rez. Retrieving actions in movies. In ICCV, 2007. 2\n\n[23] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,\nPiotr Doll\u00e1r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV,\n2014. 5\n\n10\n\n\f[24] Bruce D Lucas, Takeo Kanade, et al. An iterative image registration technique with an applica-\n\ntion to stereo vision. In IJCAI, 1981. 5\n\n[25] Johan Mathe. Shotdetect. https://github.com/johmathe/Shotdetect, 2012. 6\n\n[26] Pascal Mettes, Cees G. M. Snoek, and Shih-Fu Chang. Localizing actions from video labels\n\nand pseudo-annotations. In BMVC, 2017. 2, 8\n\n[27] Pascal Mettes, Jan C. van Gemert, and Cees G. M. Snoek. Spot on: action localization from\n\npointly-supervised proposals. In ECCV, 2016. 1, 3, 7, 8\n\n[28] Antoine Miech, Jean-Baptiste Alayrac, Piotr Bojanowski, Ivan Laptev, and Josef Sivic. Learning\n\nfrom video and text via large-scale discriminative clustering. In ICCV, 2017. 3, 5\n\n[29] Dan Oneata, J\u00e9r\u00f4me Revaud, Jakob Verbeek, and Cordelia Schmid. Spatio-temporal object\n\ndetection proposals. In ECCV, 2014. 2\n\n[30] Anton Osokin, Jean-Baptiste Alayrac, Isabella Lukasewitz, Puneet K. Dokania, and Simon\nLacoste-Julien. Minding the gaps for block Frank-Wolfe optimization of structured SVMs. In\nICML, 2016. 6\n\n[31] Xiaojiang Peng and Cordelia Schmid. Multi-region two-stream R-CNN for action detection. In\n\nECCV, 2016. 1, 2, 8\n\n[32] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time\n\nobject detection with region proposal networks. In NIPS, 2015. 5\n\n[33] Alexander Richard, Hilde Kuehne, and Juergen Gall. Weakly supervised action learning with\n\nrnn based \ufb01ne-to-coarse modeling. In CVPR, 2017. 2\n\n[34] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\nImageNet large scale visual recognition challenge. In IJCV, 2015. 5\n\n[35] Suman Saha, Gurkirt Singh, and Fabio Cuzzolin. Amtnet: Action-micro-tube regression by\n\nend-to-end trainable deep architecture. In ICCV, 2017. 2\n\n[36] Suman Saha, Gurkirt Singh, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. Deep\n\nlearning for detecting multiple space-time action tubes in videos. In BMVC, 2016. 1, 2, 5\n\n[37] Gurkirt Singh, Suman Saha, Michael Sapienza, Philip Torr, and Fabio Cuzzolin. Online real\n\ntime multiple spatiotemporal action localisation and prediction. In ICCV, 2017. 1, 2, 5, 8\n\n[38] Krishna Kumar Singh and Yong Jae Lee. Hide-and-Seek: Forcing a Network to be Meticulous\n\nfor Weakly-supervised Object and Action Localization. In ICCV, 2017. 2\n\n[39] Parthipan Siva and Tao Xiang. Weakly Supervised Action Detection. In BMVC, 2011. 2\n\n[40] Khurram Soomro and Mubarak Shah. Unsupervised action discovery and localization in videos.\n\nIn ICCV, 2017. 2\n\n[41] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human\n\nactions classes from videos in the wild. In CoRR, 2012. 5\n\n[42] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective\n\nsearch for object recognition. In IJCV, 2013. 2\n\n[43] Jan C. van Gemert, Mihir Jain, Elia Gati, and Cees G. M. Snoek. APT: Action localization\n\nproposals from dense trajectories. In BMVC, 2015. 2\n\n[44] Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. UntrimmedNets for Weakly\n\nSupervised Action Recognition and Detection. In CVPR, 2017. 2\n\n[45] Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Learning to track for spatio-\n\ntemporal action localization. In ICCV, 2015. 1, 2\n\n11\n\n\f[46] Philippe Weinzaepfel, Xavier Martin, and Cordelia Schmid. Human action localization with\n\nsparse spatial supervision. In CoRR, 2016. 1, 3, 5, 6, 7, 8\n\n[47] Linli Xu, James Neufeld, Bryce Larson, and Dale Schuurmans. Maximum margin clustering.\n\nIn NIPS, 2004. 3\n\n[48] Jiong Yang and Junsong Yuan. Common action discovery and localization in unconstrained\n\nvideos. In ICCV, 2017. 2\n\n[49] Mohammadreza Zolfaghari, Gabriel L Oliveira, Nima Sedaghat, and Thomas Brox. Chained\nmulti-stream networks exploiting pose, motion, and appearance for action classi\ufb01cation and\ndetection. In ICCV, 2017. 2\n\n12\n\n\f", "award": [], "sourceid": 522, "authors": [{"given_name": "Guilhem", "family_name": "Ch\u00e9ron", "institution": "Inria"}, {"given_name": "Jean-Baptiste", "family_name": "Alayrac", "institution": "Deepmind"}, {"given_name": "Ivan", "family_name": "Laptev", "institution": "INRIA"}, {"given_name": "Cordelia", "family_name": "Schmid", "institution": "Inria / Google"}]}