{"title": "Max-Margin Structured Output Regression for Spatio-Temporal Action Localization", "book": "Advances in Neural Information Processing Systems", "page_first": 350, "page_last": 358, "abstract": "Structured output learning has been successfully applied to object localization, where the mapping between an image and an object bounding box can be well captured. Its extension to action localization in videos, however, is much more challenging, because one needs to predict the locations of the action patterns both spatially and temporally, i.e., identifying a sequence of bounding boxes that track the action in video. The problem becomes intractable due to the exponentially large size of the structured video space where actions could occur. We propose a novel structured learning approach for spatio-temporal action localization. The mapping between a video and a spatio-temporal action trajectory is learned. The intractable inference and learning problems are addressed by leveraging an efficient Max-Path search method, thus makes it feasible to optimize the model over the whole structured space. Experiments on two challenging benchmark datasets show that our proposed method outperforms the state-of-the-art methods.", "full_text": "Max-Margin Structured Output Regression for\n\nSpatio-Temporal Action Localization\n\nDu Tran and Junsong Yuan\n\nSchool of Electrical and Electronic Engineering\nNanyang Technological University, Singapore\n\ntrandu@gmail.com, jsyuan@ntu.edu.sg\n\nAbstract\n\nStructured output learning has been successfully applied to object localization,\nwhere the mapping between an image and an object bounding box can be well\ncaptured. Its extension to action localization in videos, however, is much more\nchallenging, because we need to predict the locations of the action patterns both\nspatially and temporally, i.e., identifying a sequence of bounding boxes that track\nthe action in video. The problem becomes intractable due to the exponentially\nlarge size of the structured video space where actions could occur. We propose\na novel structured learning approach for spatio-temporal action localization. The\nmapping between a video and a spatio-temporal action trajectory is learned. The\nintractable inference and learning problems are addressed by leveraging an ef\ufb01-\ncient Max-Path search method, thus making it feasible to optimize the model over\nthe whole structured space. Experiments on two challenging benchmark datasets\nshow that our proposed method outperforms the state-of-the-art methods.\n\n1\n\nIntroduction\n\nBlaschko and Lampert have recently shown that object localization can be approached as structured\nregression problem [2]. Instead of modeling object localization as a binary classi\ufb01cation and treating\nevery bounding box independently, their method trains a discriminant function directly for predicting\nthe bounding boxes of objects located in images. Compared with conventional sliding-window based\napproach, it considers the correlations among the output variables and avoids an exhaustive search\nof the subwindows for object detection.\nMotivated by the successful application of structured regression in object localization [2], it is nat-\nural to ask if we can perform structured regression for action localization in videos. Although\nthis idea looks plausible, the extension from object localization to action localization is non-trivial.\nDifferent from object localization, where a visual object can be well localized by a 2-dimensional\n(2D) subwindow, human actions cannot be tightly bounded in such a similar way, i.e., using a 3-\ndimensional (3D) subvolume. Although many current methods for action detection are based on this\n3D subvolume assumption [6, 9, 20, 29], and search for video subvolumes to detect actions, such\nan assumption is only reasonable for \u201cstatic\u201d actions, where the subjects do not move globally e.g.,\npick-up or kiss. For \u201cdynamic\u201d actions, where the subjects can move globally e.g., walk, run, or\ndive, the subvolume constraint is no longer suitable. Thus, a more accurate localization scheme that\ncan track the actions is required for localizing dynamic actions in videos. For example, one can lo-\ncalize an action by a 2D bounding box in each frame, and track it as the action moves across different\nframes. This localization structured output generates a smooth spatio-temporal path of connected\n2-D bounding boxes. Such a spatio-temporal path can tightly bound the actions in the video space\nand provides a more accurate spatio-temporal localization of actions.\n\n1\n\n\fFigure 1: Complexities of object and action localization: a) Object localization is of O(n4). b)\nAction localization by subvolume search is of O(n6). c) Spatio-temporal action localization in a\nmuch larger search space.\n\nHowever, as the video space is much larger than the image space, spatio-temporal action localization\nhas a much larger structured space compared with object localization. For a video with size w \u00d7\nh \u00d7 n, the search space for 3D subvolumes and 2D subwindows is only O(w2h2n2) and O(w2h2),\nrespectively (Figure 1). However, the search space for possible spatio-temporal paths in the video\nspace is exponential O(whnkn)[23] if we do not know the start and end points of the path (k is\nthe number of incoming edges per node). Any one of these paths can be the candidates for spatio-\ntemporal action localization, thus an exhaustive search is infeasible. This huge structured space\nkeeps structured learning approaches from being practical to spatio-temporal action localization due\nto intractable inferences.\nThis paper proposes a new approach for spatio-temporal action localization which mainly addresses\nthe above mentioned problems. Instead of using the 3D subvolume localization scheme, we precisely\nlocate and track the action by \ufb01nding an optimal spatio-temporal path to detect and localize actions.\nThe mapping between a video and a spatio-temporal action trajectory is learned. By leveraging\nan ef\ufb01cient Max-Path search method [23], the intractable inference and learning problems can be\naddressed, thus makes our approach practical and effective although the structured space is very\nlarge. Being solved as structured learning problem, our method can well exploit the correlations\nbetween local dependent video features, and therefore optimizes the structured output. Experiments\non two challenging benchmark datasets show that our method signi\ufb01cantly outperforms the state-of-\nthe-art methods.\n\n1.1 Related work\n\nHuman action detection is traditionally approached by spatio-temporal video volume matching using\ndifferent features: space-time orientation [6], volumetric [9], action MACH [20], HOG3D [10]. The\nsliding window scheme is then applied to locate actions which is ineffective and time-consuming.\nDifferent matching, learning models have also been introduced. Boiman and Irani proposed ensem-\nbles of patches to detect irregularities in images and videos [3]. Hu et al used multiple-instance\nlearning to detect actions [8]. Mahadevan et al used mixtures of dynamic textures to detect anomaly\nevents [15]. Le et al used deep learning to learn unsuppervised features for recognizing human activ-\nities [14]. Niebles et al used a probabilistic latent semantic analysis model for recognizing actions\n[17]. Yao et al trained probabilistic non-linear latent variable models to track complex activities\n[28]. Yuan et al extended the branch-and-bound subwindow search [11] to subvolume search for\naction detection [29]. Recently, Tran and Yuan relaxed the 3D bounding box constraint for detecting\nand localizing medium and long video events [23]. Despite the improvements over 3D subvolume\nbased approaches, this method did not fully utilize the correlations between local part detectors as\nthey were independently trained.\nMax-margin structured output learning [19, 21, 24] was recently proposed and demonstrated its\nsuccess in many applications. One of its attractive features is that although the structured space\ncan be very large, whenever inference is tractable, learning is also tractable. Finley and Joachims\nfurther showed that overgenerating (e.g.\nrelaxations) algorithms have theoretic advantages over\nundergenerating (e.g. greedy) methods when exact inference is intractable [7]. Various structured\nlearning based approaches were proposed to solve computer vision problems including pedestrian\ndetection [22], object detection [2, 25], object segmentation [1], facial action unit detection [16],\nhuman interaction recognition [18], group activity recognition [13], and human pose parsing [27].\nMore recently, Lan et al used a latent SVM to jointly detect and recognize actions in videos [12].\n\n2\n\nleftrighttopbottomobjecta)b)c)\fAmong these work, Lan et al is most similar to ours. However, this method requires a reliable\nhuman detector in both inference and learning, thus it is not applicable to \u201cdynamic\u201d actions where\nthe human poses are signi\ufb01cantly varied. Moreover, because of its using HOG3D [26], it only detects\nactions in a sparse subset of frames where the interest points present.\n\n2 Spatio-Temporal Action Localization as Structured Output Regression\nGiven a video x with the size of w \u00d7 h \u00d7 m where w \u00d7 h is the frame size and m is its length, to\nlocalize actions, one needs to predict a structured object y which is a smooth spatio-temporal path in\nthe video space. We denote a path y = {(l, t, r, b)i=1..m} where (l, t, r, b)i are respectively the left,\ntop, right, bottom of the rectangle that bounds the action in the i-th frame. These values of (l, t, r, b)\nare all set to zeros when there is no action in this frame. Because of the spatio-temporal smoothness\nconstraint, the boxes in y are necessarily smoothed over the spatio-temporal video space. Let us\ndenote X \u2282 [0, 255]3whm as the set of color videos, and Y \u2282 R4m as the set of all smooth spatio-\ntemporal paths in the video space. The problem of spatio-temporal action localization becomes to\nlearn a structured prediction function of f : X (cid:55)\u2192 Y.\n\n2.1 Structured Output Learning\nLet {x1, . . . , xn} \u2282 X be the training videos, and {y1, . . . , yn} \u2282 Y be their corresponding anno-\ntated ground truths. We formulate the action localization problem using the structured learning as\npresented in [24]. Instead of searching for f, we learn a discriminant function F : X\u00d7Y (cid:55)\u2192 R. F is a\ncompatibility function which measures how compatible the localization y will be suited to the given\ninput video x. If the model utilizes a parameter set of w, then we denote F (x, y; w) = (cid:104)w, \u03c6(x, y)(cid:105),\nwhich is a family of functions parameterized by w, and \u03c6(x, y) is a joint kernel feature map which\nrepresents spatio-temporal features of y given x.\nOnce F is trained, meaning the optimal parameter w\u2217 is determined, the \ufb01nal prediction \u02c6y can be\nobtained by maximizing F over Y for a speci\ufb01c input x.\n\n\u02c6y = f (x; w\u2217) = argmax\n\nF (x, y; w\u2217) = argmax\n\n(cid:104)w\u2217, \u03c6(x, y)(cid:105)\n\ny\u2208Y\n\nThe optimal parameter set w\u2217 is selected by solving the convex optimization problem in Eq. 2:\n\ny\u2208Y\n\nn(cid:88)\n\nmin\nw,\u03be\n\ns.t.\n\n\u03bei\n\n(cid:107)w(cid:107)2 + C\n1\n2\n(cid:104)w, \u03c6(xi, yi) \u2212 \u03c6(xi, y)(cid:105) \u2265 \u2206(yi, y) \u2212 \u03bei,\u2200i,\u2200y \u2208 Y\\yi,\n\u03bei \u2265 0,\u2200i.\n\ni=1\n\n(1)\n\n(2)\n\nEq. 2 optimizes w such that the score of the true structure yi of xi will be larger than any other\nstructure y by a margin which is rescaled by the loss of \u2206(yi, y). The loss function will be de\ufb01ned\nin Section 2.3. This optimization is similar to the traditional support vector machine (SVM) formu-\nlation except for two differences. First, the number of constraints is much larger due to the huge\nsize of the structure space Y. Second, the margins are rescaled differently by the constraint\u2019s loss\n\u2206(yi, y). Because of the large number of constraints, the problem in Eq. 2 cannot be solved directly\nalthough it is a convex problem. Alternatively, one can solve the above problem by the cutting plane\nalgorithm [24] or subgradient methods [19, 21]. We use the cutting plane algorithm to solve this\nlearning problem. The algorithm starts with a random parameter w and an empty constraint set. At\neach round, it searches for the most violated constraint and add it to the constraint set. This step is\nto search for y that maximizes the violation value \u03bei (Eq. 3). When a new constraint is found, the\noptimization is applied to update w. The process is repeated until no more constraint is added. This\nalgorithm is proven to converge [24] and normally within a small number of constraints due to the\nsparsity of the structured space.\n\n\u03bei \u2265 \u2206(yi, y) + (cid:104)w, \u03c6(xi, y)(cid:105) \u2212 (cid:104)w, \u03c6(xi, yi)(cid:105),\u2200y \u2208 Y\\yi\n\n(3)\n\n2.2 The Joint Kernel Feature Map for Action Localization\nLet us denote x|y as the video portion cut out from x by the path y, namely the stack of images\ncropped by the bounding boxes b1..m of y. We also denote \u03d5(bi) \u2208 Rk as a feature map for a 2D\n\n3\n\n\fbox bi. It is worth noting that \u03d5(bi) can be represented by either local features (e.g. local interest\npoints) or global features (e.g. HOG, HOF) of the whole box bi. We thus have a feature map for x|y\nas \u03c6(x, y) which is also a vector in Rk:\n\nm(cid:88)\n\ni=1\n\n\u03c6(x, y) =\n\n1\nm\n\n\u03d5(bi)\n\nm(cid:88)\n\ni=1\n\nFinally, the decision function of our structured prediction is now formed as in Eq. 5.\n\nF (x, y; w) = (cid:104)w, \u03c6(x, y)(cid:105) =\n\n1\nm\n\n(cid:104)w, \u03d5(bi)(cid:105).\n\n2.3 Loss Function\nWe de\ufb01ne a Hinge loss function \u2206 : Y \u00d7 Y (cid:55)\u2192 [0, 1] for evaluating the loss induced by a predicted\nstructure \u02c6y compared with a true structure label y. We denote y = {bi=1..m}, where bi = (l, t, r, b)i\nis the ground truth box of the i-th frame. Similarly, we denote \u02c6y = {\u02c6bi=1..m} the predicted structure.\nThe loss function is de\ufb01ned as follow:\n\nm(cid:88)\n\ni=1\n\n1\nm\n\n\u2206(y, \u02c6y) =\n\n\u03b4(bi, \u02c6bi).\n\n(cid:40)\n\n1 \u2212 Area(b\u2229\u02c6b)\nArea(b\u222a\u02c6b)\n1 \u2212 ( 1\n2 (lbl\u02c6b + 1)), otherwise.\n\nif lb = l\u02c6b = 1\n\nif b = (0, 0, 0, 0)\notherwise.\n\n(cid:26) \u22121\n\n1,\n\n\u03b4(b, \u02c6b) =\n\nlb =\n\n3\n\nInference and Learning\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nWe need a feasible way to perform the inference in Eq. 1 during testing which can be rewritten as\nin Eq. 9.\n\n\u02c6y = argmax\n\ny\u2208Y\n\n(cid:104)w, \u03c6(x, y)(cid:105) =\n\n1\nm\n\nargmax\n\ny\u2208Y\n\n(cid:104)w, \u03d5(bi)(cid:105).\n\nm(cid:88)\n\ni=1\n\nDuring training, we need to search for the most violated constraints by maximizing the right hand\nside of Eq. 3 which is equivalent to Eq. 10. From now on, we denote \u00afy for yi in Eq. 2 because the\nexample index i is no longer important.\n\n(cid:40)\n\nmax\n\ny\u2208Y {\u2206(y, \u00afy) + (cid:104)w, \u03c6(x, y)(cid:105)}\nm(cid:88)\n\n(cid:41)\n\n(cid:104)w, \u03d5(bi)(cid:105)\n\n1\nm\n\n\u03b4(bi, \u00afbi) +\n\n(cid:0)\u03b4(bi, \u00afbi) + (cid:104)w, \u03d5(bi)(cid:105)(cid:1)(cid:41)\n\ni=1\n\nm(cid:88)\n(cid:40) m(cid:88)\n\ni=1\n\n1\nm\n\ni=1\n\n= max\ny\u2208Y\n\n=\n\n1\nm\n\nmax\ny\u2208Y\n\nTo solve Eq. 9 and Eq. 12, one needs to search for a smooth path y\u2217 in the spatio-temporal video\nspace Y which gives the maximum total score. Both of the above equations are dif\ufb01cult due to the\nlarge size of Y, e.g. the exponential number of possible spatio-temporal paths in Y (see supplemental\nmaterial). We now show that both problems in Eq. 9 and Eq. 12 can be reduced to Max-Path search\nproblem and solved by [23] ef\ufb01ciently. Max-Path algorithm [23] was proposed to detect dynamic\nvideo events. It is guaranteed to obtain the best spatio-temporal path in the video space provided that\nthe local windows\u2019 scores can be precomputed. The algorithm takes a 3D trellis of local windows\u2019\nscores as input, and outputs the best path which the maximum total score. In testing, the trellis\u2019s\nlocal scores are (cid:104)w, \u03d5(bi)(cid:105) where bi is the local window. These values are easily evaluated given a w\nand a feature map \u03d5. In training, those values of the trellis are \u03b4(bi, \u00afbi) + (cid:104)w, \u03d5(bi)(cid:105) which are also\ncomputable given parameter w, feature map \u03d5, and ground truth \u00afbi. After the trellis is constructed,\nthe Max-Path algorithm is employed to \ufb01nd the best path, therefore we can identify the smoothed\nspatio-temporal path y\u2217 that maximizes Eq. 9 and Eq. 12.\n\n4\n\n\f3.1 Constraint Enforcement\n\nLet us consider one constraint in Eq. 2, here we ignore the index i of the example for simplicity and\nuse \u00afy as the ground truth for example x. We also denote y = b1..m and \u00afy = \u00afb1..m.\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\n(cid:104)w, \u03c6(x, \u00afy)(cid:105) \u2212 (cid:104)w, \u03c6(x, y)(cid:105) \u2265 \u2206(\u00afy, y) \u2212 \u03be,\u2200y \u2208 Y\\\u00afy\n\n\u21d4 1\nm\n\nm(cid:88)\n\u21d4 m(cid:88)\n(cid:104)w, \u03d5( \u00afbi)(cid:105) \u2212 m(cid:88)\n\n(cid:104)w, \u03d5( \u00afbi)(cid:105) \u2212 1\nm\n\ni=1\n\nm(cid:88)\n(cid:104)w, \u03d5(bi)(cid:105) \u2265 m(cid:88)\n\ni=1\n\n(cid:104)w, \u03d5(bi)(cid:105) \u2265 1\nm\n\nm(cid:88)\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\n\u03b4(bi, \u00afbi) \u2212 \u03be,\u2200y \u2208 Y\\\u00afy\n\n\u03b4(bi, \u00afbi) \u2212 m\u03be,\u2200y \u2208 Y\\\u00afy\n\nThe constraint in Eq. 15 can be split into m constraints in Eq. 16 which are harder, therefore\nsatisfying these m constraints will lead to satisfying the Eq. 15 constraint\n\n(cid:104)w, \u03d5( \u00afbi)(cid:105) \u2212 (cid:104)w, \u03d5(bi)(cid:105) \u2265 \u03b4(bi, \u00afbi) \u2212 \u03be,\u2200i \u2208 [1..m],\u2200y \u2208 Y\\\u00afy\n\nIn training, instead of solving Eq. 2 with the constraints in Eq. 13, we solve it with the set of\nconstraints as in Eq. 16. The problem is harder because of tighter constraints. However, the im-\nportant bene\ufb01t of using such enforcements is that instead of comparing features of two different\nspatio-temporal paths y and \u00afy, one can compare the features of individual box pairs (bi, \u00afbi) of those\ntwo paths. This constraint enforcement will help the training algorithm to avoid comparing features\nof two paths of different lengths which is unstable due to feature normalization.\n\n4 Experimetial Setup\n\nDatasets: we conduct experiments on two datasets: UCF-Sport [20] and Oxford-TV [18]. UCF-\nSport dataset consists of 150 video sequences of 10 different action classes. We use the same split\nas in [12] for training and testing. On this dataset, we detect three different actions: horse-riding,\nrunning, and diving. We choose those actions because they have different levels of body movements.\nHorse riding is relatively rigid; running is more deformable; while diving is extremely deforming\nin terms of articulated body movements. Oxford-TV dataset consists of 300 videos taken from real\nTV programs. It has 4 classes of actions: hand-shake, high-\ufb01ve, hug, kiss, and a set of 100 negative\nvideos. As used in [18], this dataset is divided into two equal subsets. We use set 1 for training\nand set 2 for testing. We perform the task of kiss detection and localization on this dataset. Kissing\nactions is more challenging compared with other action classes in this dataset due to less motion and\nappearance cues.\nFeatures and Parameters: our algorithm needs a feature representation \u03d5(b) of a cropped image\nb. We use a global representation for \u03d5(b) using Histogram of Oriented Gradients (HOF) [4] and\nHistogram of Flows (HOF) [5]. The cropped image b is divided into h \u00d7 v half-overlapped blocks;\neach block has 2\u00d7 2 cells. Each cell is represented by a 9-bin histogram. The feature vector\u2019s length\nbecome h\u00d7 v \u00d7 2\u00d7 2\u00d7 9\u00d7 2 = 72\u00d7 h\u00d7 v for both HOG and HOF. (h, v) can be different for each\nclass due to different shape-ratios of the actions (e.g. rectangle boxes for horse-riding and running,\nsquare boxes for diving). More speci\ufb01cally, we use (7, 15) for horse-riding and running, (11, 11)\nfor diving, (9, 7) for kissing. The regularization parameter C in Eq. 2 is set to 1 for all cases.\nEvaluation Metrics: we quantitatively evaluate different methods in both detection and localization.\nAs used in [12], the video localization score is measured by averaging its frame localization scores\nwhich are the overlap area divided by the union area of the predicted and truth boxes. A prediction\nis then considered as correct if its localization score is greater or equal to \u03c3 = 0.2. It is worth\nnoting that detection evaluations are applied to both positive and negative testing examples while\nlocalization evaluations are only applied to positive ones. As a result, the detection metric is to\nmeasure the reliability of the detections (precision/recall) where the localization metric indicates\nthe quality of detections, e.g. how accurate are the predicted spatio-temporal paths compared with\nground truth. More speci\ufb01c, detection is to answer the question \u201cIs there any action of interest in this\nvideo?\u201d while localization is to answer to \u201cProvided that there is one action instance that appears in\nthis video, where is it?\u201d.\n\n5\n\n\fFigure 2: Action detection results on UCF-Sport: detection curves of our proposed method com-\npared with [12] and [23]. Upper plots are detection results evaluated on subset frames given by [12],\nwhile lower plots are the results of all-frame evaluations. Except for diving, our proposed method\nsigni\ufb01cantly improves the other methods.\n\nEval. Set Method H-Ride\n21.75\nSubset\n62.19\n68.06\nN/A\n63.06\n64.01\n\n[12]\n[23]\nOur\n[12]\n[23]\nOur\n\nAll\n\nRun\n19.60\n50.20\n61.41\nN/A\n48.09\n61.86\n\nDive Average\n42.67\n28.01\n42.93\n16.41\n55.34\n36.54\nN/A\nN/A\n44.60\n22.64\n37.03\n54.30\n\nTable 1: Action localization results on UCF-Sport: comparisons among our proposed method,\n[12], and [23]. The upper section presents results evaluated on a subset of frames given by [12],\nwhile the lower section reports results from evaluating on all frames. Our method improves 27.33%\nfrom [12] and 12.41% from [23] on subset evaluations and improves 9.7% from [23] on all-frame\nevaluations. N/A indicates not applicable.\n\n5 Experimental Results\n\nUCF-Sport: we compare our method with two current approaches: Lan et al [12], Tran and Yuan\n[23]. The output predictions of Lan et al are directly obtained from [12]. For [23], we train a linear\nSVM detector for each action class using the same features as ours. The Max-Path algorithm is then\napplied to detect the actions of interest. According to [12], its method used HOG3D [26], so that it\nis only able to detect and localize actions at a sparse set of frames where the HOG3D interest points\npresent. To provide a fair comparison with [12], we report two different sets of evaluations. The \ufb01rst\nset is applied only to the subset of frames where [12] reports detections and the second set is to take\nall frames into consideration.\nTable 1 reports the results of action localization of different methods and action classes. On average,\nour method improves 27.33% from [12] and 12.41% from [23] on subset evaluations and improves\n9.7% from [23] on all-frame evaluations. Figure 2 shows detection results of different methods\non UCF-Sport dataset. Our method signi\ufb01cantly improves over [23] for all three action classes on\nboth subset and all-frame evaluations. Compared with [12] on subset evaluations, our method sig-\nni\ufb01cantly improve over [12] on horse-riding and running detection. However, [12] provides better\ndetection results than ours on diving detection. This better detection is because their interest-point-\nbased sparse features are more suitable to deformable actions as diving. For a complete presentation,\nwe visualze localization results of our method comapared with those of [12] and [23] on a diving\nsequence (Figure 3). All predicted boxes are plotted together with ground truth boxes for com-\nparisons. It is worth noting that [12] has only predictions at a sparse set of frames, therefore blue\n\n6\n\n00.20.40.60.8100.20.40.60.81RecallPrecisionHorse-ride; Subset test Lan et alTran&YuanOur method00.20.40.60.8100.20.40.60.81RecallPrecisionDive; Subset test Lan et alTran&YuanOur method00.20.40.60.8100.20.40.60.81RecallPrecisionHorse-ride; All test Tran&YuanOur method00.20.40.60.8100.20.40.60.81RecallPrecisionDive; All test Tran&YuanOur method00.20.40.60.8100.20.40.60.81RecallPrecisionRun; Subset test Lan et alTran&YuanOur method00.20.40.60.8100.20.40.60.81RecallPrecisionRun; All test Tran&YuanOur method\fFigure 3: Visualization of diving localization: the plots of localization scores of different methods\non a diving video sequence. Lan et al\u2019s [12] results are visualized in blue, Tran and Yuan\u2019s [23] are\ngreen, ours are red, and ground truth are black boxes. Best view in color.\n\nFigure 4: Action detection and localization on UCF-Sport: Lan et al\u2019s [12] results are visualized\nin blue, Tran and Yuan\u2019s [23] are green, ours are read, and ground truth are black. Our method and\n[23] can detect multiple instances of actions (two bottom left images).\n\nsquares are visualized as discrete dots while the other methods are visualized by continuous curves.\nOur method (red curve) localizes diving action much more accurately than [23] (green curve). [12]\nlocalizes diving action fairly good, however it is not applicable when more accurate localizations\n(e.g. all frame predictions) are required.\nOxford-TV: we compare our method with [23] on both detection and localization tasks. For detec-\ntion, we report two different quantitative evaluations: the equal precision-recall (EPR) and the area\nunder ROC curve (AUC). For localization, besides the spatial localization (SL) metric as used in\nUCF dataset experiments, we also evaluate different methods by temporal localization (TL) metric.\nThis metric is not applicable to UCF dataset because most action instances in UCF dataset start\nand end at the \ufb01rst and last frame, respectively. Temporal localization is computed as the length\n\nMethod EPR(%) AUC SL(%) TL(%)\nN/A\n[18]\n[23]\n40.09\n45.30\nOur\n\n32.50*\n24.14\n38.89\n\nN/A\n0.27\n0.42\n\nN/A\n18.46\n39.52\n\nTable 2: Kiss detection and localization results. We improve 14.74% in equal precision/recall\ndetection rate, 0.15 in area under ROC curve, 21.06% in spatial localization, and 5.21% in temporal\nlocalization over [23]. *Result of [18] is not directly comparable. N/A indicates not applicable.\n\n7\n\n010203040506000.20.40.60.8Frame numberLocalization score Lan et alTran&YuanOur method452820517\fFigure 5: Visualizaiton of kiss detection: our results are visualized in red; ground truths are in\ngreen. The upper two rows are some of correct detections while the last row shows false or missed\ndetections.\n\nFigure 6: Kiss detection results: a) Precision-recall curves with \u03c3 = 0.2. b) Precision-recall curves\nwith \u03c3 = 0.4. c) ROC curves with \u03c3 = 0.2. Numbers inside the legends are best precision-recall\nvalues(a and b) and the area under ROC curve(c).\n\n(measured in frames) of the intersection divided by the union of detection and ground truth. Table 2\npresents detection and localization results of our proposed method compared with [23]. On localiza-\ntion task, our method improves 21.06% in spatial localization, and 5.21% in temporal localization\nover [23]. On detection task, by using the cut-off threshold \u03c3 = 0.2, our method improves 14.74%\nin equal precision-recall rate and 0.15 in area under ROC curve over [23] (Figure 6a and 6c). One\nmay further ask \u201cwhat if we need more accurate detections?\u201d. Interestingly, when we increase the\ncut-off threshold \u03c3 to 0.4, [23] signi\ufb01cantly drops from 24.11% to 8.82% while our method remains\n29.03% (Figure 6b) which demonstrates that our method can simultaneously detect and localize\nactions with high accuracy.\n\n6 Conclusions\n\nWe have proposed a novel structured learning approach for spatio-temporal action localization in\nvideos. While most of current approaches detect actions as 3D subvolumes [6, 9, 20, 29] or a\nsparse subset of frames [12], our method can precisely detect and track actions in both spatial and\ntemporal spaces. Although [23] is also applicable to spatio-temporal action detection, this method\ncannot be optimized over the large video space due to its independently trained detectors. Our ap-\nproach signi\ufb01cantly outperforms [23] thanks to the structured optimization. This improvement gap\nis also consistent with the theoretic analysis in [7]. Moreover, being free from people detection\nand background subtraction, our approach can ef\ufb01ciently handle unconstrained videos and be easily\nextended to detect other spatio-temporal video patterns. Strong experimental results on two chal-\nlenging benchmark datasets demonstrate that our proposed method signi\ufb01cantly outperforms the\nstate-of-the-arts.\n\n8\n\n00.10.20.30.40.500.10.20.30.40.5RecallPrecision Tran&Yuan: 8.82/10.34Our method: 29.03/31.0300.20.40.60.800.20.40.60.81RecallPrecision Tran&Yuan: 46.67/24.14Our method: 38.89/48.2800.20.40.60.8100.10.20.30.40.50.60.70.8False positive rateTrue positive rate Tran&Yuan: 0.27Our method: 0.42a)b)c)\fAcknowledgments\n\nThe authors would like to thank Tian Lan for reproducing [12]\u2019s results on UCF dataset, Minh Hoai\nNguyen for useful discussions about the cutting-plane algorithm. This work is supported in part by\nthe Nanyang Assistant Professorship (SUG M58040015) to Dr. Junsong Yuan.\n\nReferences\n[1] L. Bertelli, T. Yu, D. Vu, and S. Gokturk. Kernelized structural SVM learning for supervised object\n\nsegmentation. CVPR, 2011.\n\n[2] M. B. Blaschko and C. H. Lampert. Learning to localize objects with structured output regression. ECCV,\n\n2008.\n\n[3] O. Boiman and M. Irani. Detecting irregularities in images and in video. IJCV, 2007.\n[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR, 2005.\n[5] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of \ufb02ow and appearance.\n\nECCV, 2006.\n\n[6] K. Derpanis, M. Sizintsev, K. Cannons, and P. Wildes. Ef\ufb01cient action spotting based on a spacetime\n\noriented structure representation. CVPR, 2010.\n\n[7] T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. ICML, 2008.\n[8] Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T. S. Huang. Action detection in complex scenes with spatial\n\nand temporal ambiguities. ICCV, 2009.\n\n[9] Y. Ke, R. Sukthankar, and M. Hebert. Volumetric features for video event detection. IJCV, 2010.\n[10] A. Klaser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. BMVC,\n\n2008.\n\n[11] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Ef\ufb01cient subwindow search: A branch and bound\n\nframework for object localization. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2009.\n\n[12] T. Lan, Y. Wang, and G. Mori. Discriminative \ufb01gure-centric models for joint action localization and\n\nrecognition. ICCV, 2011.\n\n[13] T. Lan, Y. Wang, W. Yang, and G. Mori. Beyond actions: Discriminative models for contextual group\n\nactivities. NIPS, 2010.\n\n[14] Q. Le, W. Zou, S. Yeung, and A. Ng. Learning hierarchical spatio-temporal features for action recognition\n\nwith independent subspace analysis. CVPR, 2011.\n\n[15] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos. Anomaly detection in crowded scenes. CVPR,\n\n2010.\n\n[16] M. H. Nguyen, T. Simon, F. De la Torre, and J. Cohn. Action unit detection with segment-based SVMs.\n\nCVPR, 2010.\n\n[17] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatial-\n\ntemporal words. International Journal of Computer Vision, 2008.\n\n[18] A. Patron-Perez, M. Marszalek, A. Zisserman, and I. Reid. High \ufb01ve: Recognising human interactions in\n\ntv shows. BMVC, 2010.\n\n[19] N. Ratliff, J. A. Bagnell, and M. Zinkevich. Subgradient methods for maximum margin structured learn-\n\ning. ICML 2006 Workshop on Learning in Structured Output Spaces, 2006.\n\n[20] M. D. Rodriguez, J. Ahmed, and M. Shah. Action mach: A spatio-temporal maximum average correlation\n\nheight \ufb01lter for action recognition. CVPR, 2008.\n\n[21] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structured prediction via the extragradient method. NIPS,\n\n2005.\n\n[22] D. Tran and D. Forsyth. Con\ufb01guration estimates improve pedestrian \ufb01nding. NIPS, 2007.\n[23] D. Tran and J. Yuan. Optimal spatio-temporal path discovery for video event detection. CVPR, pages\n\n3321\u20133328, 2011.\n\n[24] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. JMLR, 2005.\n\n[25] A. Vedaldi and A. Zisserman. Structured output regression for detection with partial truncation. NIPS,\n\n2009.\n\n[26] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features\n\nfor action recognition. BMVC, 2009.\n\n[27] Y. Wang, D. Tran, and Z. Liao. Learning hierarchical poselets for human parsing. CVPR, 2011.\n[28] A. Yao, J. Gall, L. V. Gool, and R. Urtasun. Learning probabilistic non-linear latent variable models for\n\ntracking complex activities. NIPS, 2011.\n\n[29] J. Yuan, Z. Liu, and Y. Wu. Discriminative video pattern search for ef\ufb01cient action detection. IEEE Trans.\n\non Pattern Analysis and Machine Intelligence, 2011.\n\n9\n\n\f", "award": [], "sourceid": 186, "authors": [{"given_name": "Du", "family_name": "Tran", "institution": null}, {"given_name": "Junsong", "family_name": "Yuan", "institution": null}]}