{"title": "Attentional Pooling for Action Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 34, "page_last": 45, "abstract": "We introduce a simple yet surprisingly powerful model to incorporate attention in action recognition and human object interaction tasks. Our proposed attention module can be trained with or without extra supervision, and gives a sizable boost in accuracy while keeping the network size and computational cost nearly the same. It leads to significant improvements over state of the art base architecture on three standard action recognition benchmarks across still images and videos, and establishes new state of the art on MPII dataset with 12.5% relative improvement. We also perform an extensive analysis of our attention module both empirically and analytically. In terms of the latter, we introduce a novel derivation of bottom-up and top-down attention as low-rank approximations of bilinear pooling methods (typically used for fine-grained classification). From this perspective, our attention formulation suggests a novel characterization of action recognition as a fine-grained recognition problem.", "full_text": "Attentional Pooling for Action Recognition\n\nRohit Girdhar\n\nDeva Ramanan\n\nThe Robotics Institute, Carnegie Mellon University\n\nhttp://rohitgirdhar.github.io/AttentionalPoolingAction\n\nAbstract\n\nWe introduce a simple yet surprisingly powerful model to incorporate attention\nin action recognition and human object interaction tasks. Our proposed attention\nmodule can be trained with or without extra supervision, and gives a sizable boost\nin accuracy while keeping the network size and computational cost nearly the\nsame. It leads to signi\ufb01cant improvements over state of the art base architecture on\nthree standard action recognition benchmarks across still images and videos, and\nestablishes new state of the art on MPII dataset with 12.5% relative improvement.\nWe also perform an extensive analysis of our attention module both empirically and\nanalytically. In terms of the latter, we introduce a novel derivation of bottom-up\nand top-down attention as low-rank approximations of bilinear pooling methods\n(typically used for \ufb01ne-grained classi\ufb01cation). From this perspective, our attention\nformulation suggests a novel characterization of action recognition as a \ufb01ne-grained\nrecognition problem.\n\n1\n\nIntroduction\n\nHuman action recognition is a fundamental and well studied problem in computer vision. Traditional\napproaches to action recognition relied on object detection [11, 19, 57], articulated pose [29, 34, 35,\n55, 57], dense trajectories [52, 53] and part-based/structured models [9, 56, 58]. However, more\nrecently these methods have been surpassed by deep CNN-based representations [18, 30, 42, 47].\nInterestingly, even video based action recognition has bene\ufb01ted greatly from advancements in image-\nbased CNN models [20, 22, 43, 46]. With the exception of a few 3D-conv-based methods [33, 47, 49],\nmost approaches [12, 14, 15, 17, 54], including the current state of the art [54], use a variant of\ndiscriminatively trained 2D-CNN [22] over the appearance (frames) and in some cases motion (optical\n\ufb02ow) modalities of the input video.\nAttention: While using standard deep networks over the full image have shown great promise for\nthe task [54], it raises the question of whether action recognition can be considered as a general\nclassi\ufb01cation problem. Some recent works have tried to generate more \ufb01ne-grained representations\nby extracting features around human pose keypoints [8] or on object/person bounding boxes [18, 30].\nThis form of \u2018hard-coded attention\u2019 helps improve performance, but requires labeling (or detecting)\nobjects or human pose. Moreover, these methods assume that focusing on the human or its parts is\nalways useful for discriminating actions. This might not necessarily be true for all actions; some\nactions might be easier to distinguish using the background and context, like a \u2018basketball shoot\u2019 vs a\n\u2018throw\u2019; while others might require paying close attention to objects being interacted by the human,\nlike in case of \u2018drinking from mug\u2019 vs \u2018drinking from water bottle\u2019.\nOur work: In this work, we propose a simple yet surprisingly powerful network modi\ufb01cation that\nlearns attention maps which focus computation on speci\ufb01c parts of the input relevant to the task at\nhand. Our attention maps can be learned without any additional supervision and automatically lead to\nsigni\ufb01cant improvements over the baseline architecture. Our formulation is simple-to-implement,\nand can be seen as a natural extension of average pooling into a \u201cweighted\u201d average pooling with\nimage-speci\ufb01c weights. Our formulation also provides a novel factorization of attentional processing\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\finto bottom-up saliency combined with top-down attention. We further experiment with adding\nhuman pose as an intermediate supervision to train our attention module, which encourages the\nnetwork to look for human object interactions. While this makes little difference to the performance\nof image-based recognition models, it leads to a larger improvement on video datasets as videos\nconsist of large number of \u2018non-iconic\u2019 frames where the subject of object of actions may not be at\nthe center of focus.\nOur contributions: (1) An easy-to-use extension of state-of-the-art base architectures that incor-\nporates attention to give signi\ufb01cant improvement in action recognition performance at virtually\nnegligible increase in computation cost; (2) Extensive analysis of its performance on three action\nrecognition datasets across still images and videos, obtaining state of the art on MPII and HMDB-51\n(RGB, single-frame models) and competitive on HICO; (3) Analysis of different base architectures\nfor applicability of our attention module; and (4) Mathematical analysis of our proposed attention\nmodule and showing its equivalence to a rank-1 approximation of second order or bilinear pooling\n(typically used in \ufb01ne grained recognition methods [16, 26, 28]) suggesting a novel characterization\nof action recognition as a \ufb01ne grained recognition problem.\n\n2 Related Work\n\nHuman action recognition is a well studied problem with various standard benchmarks spanning\nacross still images [7, 13, 34, 36, 58] and videos [24, 27, 41, 45]. The newer image based datasets such\nas HICO [7] and MPII [34] are large and highly diverse, containing 600 and 393 classes respectively.\nIn contrast, collecting such diverse video based action datasets is hard, and hence existing popular\nbenchmarks like UCF101 [45] or HMDB51 [27] contain only 101 and 51 categories each. This in\nturn has lead to much higher baseline performance on videos, eg. \u223c 94% [54] classi\ufb01cation accuracy\non UCF101, compared to images, eg. \u223c 32% [30] mean average precision (mAP) on MPII.\nFeatures: Video based action recognition methods focus on two main problems: action classi\ufb01ca-\ntion and (spatio-)temporal detection. While image based recognition problems, including action\nrecognition, have seen a large boost with the recent advancements in deep learning (e.g., MPII\nperformance went up from 5% mAP [34] to 27% mAP [18]), video based recognition still relies\non hand crafted features such as iDT [53] to obtain competitive performance. These features are\ncomputed by extracting appearance and motion features along densely sampled point trajectories in\nthe video, aggregated into a \ufb01xed length representation by using \ufb01sher vectors [32]. Convolutional\nneural network (CNN) based approaches to video action recognition have broadly followed two main\nparadigms: (1) Multi-stream networks [42, 54] which split the input video into multiple modalities\nsuch as RGB, optical \ufb02ow, warped \ufb02ow etc, train standard image based CNNs on top of those, and\nlate-fuse the predictions from each of the CNNs; and (2) 3D Conv Networks [47, 49] which represent\nthe video as a spatio-temporal blob and train a 3D convolutional model for action prediction. In terms\nof performance, 3D conv based methods have been harder to scale and multi-stream methods [54]\ncurrently hold state of the art performance on standard benchmarks. Our approach is complementary\nto these paradigms and the attention module can be applied on top of either. We show results on\nimproving action classi\ufb01cation over state of the art multi-stream model [54] in experiments.\nPose: There have also been previous works in incorporating human pose into action recognition [8,\n10, 60]. In particular, P-CNN [8] computes local appearance and motion features along the pose\nkeypoints and aggregates those over the video for action prediction, but is not end-to-end trainable.\nMore recent work [60] adds pose as an additional stream in chained multi-stream fashion and shows\nsigni\ufb01cant improvements. Our approach is complementary to these approaches as we use pose as\na regularizer in learning spatial attention maps to weight regions of the RGB frame. Moreover,\nour method is not constrained by pose labels, and as we show in experiments, can show effective\nperformance with pose predicted by existing methods [4] or even without using pose.\nHard attention: Previous works in image based action recognition have shown impressive\nperformance by incorporating evidence from the human, context and pose keypoint bounding\nboxes [8, 18, 30]. Gkioxari el al. [18] modi\ufb01ed R-CNN pipeline to propose R*CNN, where they\nchoose an auxiliary box to encode context apart from the human bounding box. Mallya and Lazeb-\nnik [30] improve upon it by using the full image as the context and using multiple instance learning\n(MIL) to reason over all humans present in the image to predict an action label for the image. Our\napproach gets rid of the bounding box detection step and improves over both these methods by\nautomatically learning to attend to the most informative parts of the image for the task.\n\n2\n\n\fSoft attention: There has been relatively little work that explores unconstrained \u2018soft\u2019 attention for\naction recognition, with the exception of [39, 44] for spatio-temporal and [40] for temporal attention.\nImportantly, all these consider a video setting, where a LSTM network predicts a spatial attention\nmap for the current frame. Our method, however, uses a single frame to both predict and apply\nspatial attention, making it amenable to both single image and video based use cases. [44] also uses\npose keypoints labeled in 3D videos to drive attention to parts of the body. In contrast, we learn an\nunconstrained attention model that frequently learns to look around the human body for objects that\nmake it easier to classify the action.\nSecond-order pooling: Because our model uses a single set of appearance features to both predict\nand apply an attention map, this makes the output quadratic in the features (Sec. 3.1). This observation\nallows us to implement attention through second-order or bilinear pooling operations [28], made\nef\ufb01cient through low-rank approximations [16, 25, 26]. Our work is most related to [26], who point\nout when ef\ufb01ciently implemented, low-rank approximations avoid explicitly computing second-order\nfeatures. We point out that a rank-1 approximation of second-order features is equivalent to an\nattentional model sometimes denoted as \u201cself attention\u201d [50]. Exposing this connection allows us\nto explore several extensions, including variations of bottom-up and top-down attention, as well as\nregularized attention maps that make use of additional supervised pose labels.\n\n3 Approach\n\nOur attentional pooling module is a trainable layer that plugs in as a replacement for a pooling opera-\ntion in any standard CNN. As most contemporary architectures [20, 22, 46] are fully convolutional\nwith an average pooling operation at the end, our module can be used to replace that operation with an\nattention-weighted pooling. We now derive the pooling layer as an ef\ufb01cient low-rank approximation\nto second order pooling (Sec. 3.1). Then, we describe our network architecture that incorporates this\nattention module and explore a pose-regularized variant of the same (Sec. 3.2).\n\n3.1 Attentional pooling as low-rank approximation of second-order pooling\nLet us write the layer to be pooled as X \u2208 Rn\u00d7f , where n is the number of spatial locations (e.g.,\nn = 16 \u00d7 16 = 256) and f is the number of channels (e.g., 2048). Standard sum (or max) pooling\nwould reduce this to vector in Rf\u00d71, which could then be processed by a \u201cfully-connected\u201d weight\nvector w \u2208 Rf\u00d71 to generate a classi\ufb01cation score. We will denote matrices with upper case letters,\nand vectors with lower-case bold letters. For the moment, assume we are training a binary classi\ufb01er\n(we generalize to more classes later in the derivation). We can formalize this pipeline with the\nfollowing notation:\n\nscorepool(X) = 1T Xw,\n\nwhere\n\nX \u2208 Rn\u00d7f , 1 \u2208 Rn\u00d71, w \u2208 Rf\u00d71\n\n(1)\n\nwhere 1 is a vector of all ones and x = 1T X \u2208 R1\u00d7f is the (transposed) sum-pooled feature.\nSecond-order pooling: Following past work on second-order pooling [5], let us construct the feature\nX T X \u2208 Rf\u00d7f . Prior work has demonstrated that such second-order statistics can be useful for\n\ufb01ne-grained classi\ufb01cation [28]. Typically, one then \u201cvectorizes\u201d this feature, and learns a f 2 vector\nof weights to generate a score. If we write the vector of weights as a f \u00d7 f matrix, the inner product\nbetween the two vectorized quantities can be succinctly written using the trace operator1. The key\nidentity, T r(ABT ) = dot(A(:), B(:)) (using matlab notation), can easily be veri\ufb01ed by plugging in\nthe de\ufb01nition of a trace operator. This allows us to write the classi\ufb01cation score as follows:\n\nscoreorder2(X) = T r(X T XW T ),\n\nwhere\n\nX \u2208 Rn\u00d7f , W \u2208 Rf\u00d7f\n\n(2)\n\nLow-rank second-order pooling: Let us approximate matrix W with a rank-1 approximation,\nW = abT where a, b \u2208 Rf\u00d71. Plugging this into the above yields a novel formulation of attentional\n\n1https://en.wikipedia.org/wiki/Trace_(linear_algebra)\n\n3\n\n\fpooling:\n\nscoreattention(X) = T r(X T XbaT ),\n= T r(aT X T Xb)\n= aT X T Xb\n\n= aT(cid:16)\n\nX T (Xb)\n\n(cid:17)\n\nwhere\n\nX \u2208 Rn\u00d7f , a, b \u2208 Rf\u00d71\n\n(3)\n(4)\n(5)\n\n(6)\n\nwhere (4) makes use of the trace identity that T r(ABC) = T r(CAB) and (5) uses the fact that the\ntrace of a scalar is simply the scalar. The last line (6) gives ef\ufb01cient implementation of attentional\npooling: given a feature map X, compute an attention map over all n spatial locations with h =\nXb \u2208 Rn\u00d71, that is then used to compute a weighted average of features x = X T h \u2208 Rf\u00d71. This\nweighted-average feature is then pushed through a linear model aT x to produce the \ufb01nal score.\nInterestingly, (6) can also be written as the following:\n\n(cid:16)\n\n(cid:17)\n\nscoreattention(X) =\n\n(Xa)T X\n\nb\n\n(7)\n\n= (Xa)T (Xb)\n\n(8)\nThe \ufb01rst line illustrates that the attentional heatmap can also be seen as Xa \u2208 Rn\u00d71, with b being\nthe classi\ufb01er of the attentionally-pooled feature. The second line illustrates that our formulation is\nin fact symmetric, where the \ufb01nal score can be seen as the inner product between two attentional\nheatmaps de\ufb01ned over all n spatial locations. Fig. 1a illustrates our approach.\nTop-down attention: To generate prediction for multiple classes, we replace the weight matrix from\n(2) with class-speci\ufb01c weights:\n\nscoreorder2(X, k) = T r(X T XW T\n\n(9)\nOne could apply a similar derivation to produce class-speci\ufb01c vectors ak and bk, each of them\ngenerating a class-speci\ufb01c attention map. Instead, we choose to distinctly model class-speci\ufb01c\n\u201ctop-down\u201d attention [3, 48, 59] from bottom-up visual saliency that is class-agnostic [37]. We do so\nby forcing one of the attention parameter vectors to be class-agnostic - e.g., bk = b. This makes our\n\ufb01nal low-rank attentional model\n\nwhere\n\nk ),\n\nX \u2208 Rn\u00d7f , Wk \u2208 Rf\u00d7f\n\nk h,\n\nscoreattention(X, k) = tT\n\n(10)\nequivalent to an inner product between top-down (class-speci\ufb01c) tk and bottom-up (saliency-based) h\nattention maps. Our approach of combining top-down and botom-up attentional maps is reminiscent\nof biologically-motivated schemes that modulate saliency maps with top-down cues [31]. This\nsuggests that our attentional model can also be implemented using a single, combined attention map\nde\ufb01ned over all n spatial locations:\n\ntk = Xak, h = Xb\n\nwhere\n\nscoreattention(X, k) = 1T ck,\n\n(11)\nwhere \u25e6 denotes element-wise multiplication and 1 is de\ufb01ned as before. We visualize the combined,\ntop-down, and bottom-up attention maps ck, tk, h \u2208 Rn\u00d71 in our experimental results.\nAverage pooling (revisited): The above derivation allows us to revisit our average pooling formula-\ntion from (1), replacing weights w with class-speci\ufb01c weights wk as follows:\n\nwhere\n\nck = tk \u25e6 h,\n\nscoretop\u2212down(X, k) = 1T Xwk = 1T tk where\n\n(12)\nFrom this perspective, the above derivation gives the ability to generate top-down attentional maps\nfrom existing average-pooling networks. While similar observations have been pointed out before [59],\nit naturally emerges as a special case of our bottom-up and top-down formulation of attention.\n\ntk = Xwk\n\n3.2 Network Architecture\n\nWe now describe our network architecture to implement the attentional pooling described above. We\nstart from a state of the art base architecture, ResNet-101 [20]. It consists of a stack of \u2018modules\u2019, each\nof which contains multiple convolutional, pooling or identity mapping streams. It \ufb01nally generates a\nn1 \u00d7 n2 \u00d7 f spatial feature map, which is average pooled to get a f-dimensional vector and is then\nclassi\ufb01ed using a linear classi\ufb01er.\n\n4\n\n\f(a) Visualization of our approach to attentional pooling as a rank-1 approxi-\nmation of 2nd order pooling. By judicious ordering of the matrix multiplica-\ntions, one can avoid computing the second order feature X T X and instead\ncompute the product of two attention maps. The top-down attentional map\nis computed using class-speci\ufb01c weights ak, while the bottom-up map is\ncomputed using class-agnostic weights b. We visualize the top-down and\nbottom-up attention maps learned by our approach in Fig. 2.\n\n(b) We explore two architectures\nin our work, explained in Sec. 3.2.\n\nFigure 1: Visualization of our derivation and \ufb01nal network architectures.\n\nOur attention module plugs in at the last layer, after the spatial feature map. As shown in Fig. 1b\n(Method 1), we predict a single channel bottom-up saliency map of same spatial resolution as the last\nfeature map, using a linear classi\ufb01er on top of it (Xb). Similarly, we also generate the n1 \u00d7 n2 \u00d7 K\ndimensional top-down attention map Xa, where K is number of classes. The two attention maps are\nmultiplied and spatially averaged to generate the K-dimensional output predictions ((Xa)T (Xb)).\nThese operations are equivalent to \ufb01rst multiplying the features with saliency (X T (Xb)) and then\npassing through a classi\ufb01er (a(X T (Xb))).\nPose: While this unconstrained attention module automatically learns to focus on relevant parts and\ngives a sizable boost in accuracy, we take inspiration from previous work [8] and use human pose\nkeypoints to guide the attention. As shown in Fig. 1b (Method 2), we use a two-layer MLP on top\nof the last layer to predict a 17 channel heatmap. The \ufb01rst 16 channels correspond to human pose\nkeypoints and incur a l2 loss against labeled (or detected, using [4]) pose) The \ufb01nal channel is used\nas an unconstrained bottom-up attention map, as before. We refer to this method as pose-regularized\nattention, and it can be thought of as a non-linear extension of previous attention map.\n\n4 Experiments\n\nDatasets: We experiment with three recent, large scale action recognition datasets, across still images\nand videos, namely MPII, HICO and HMDB51. MPII Human Pose Dataset [34] contains 15205\nimages labeled with up to 16 human body keypoints, and classi\ufb01ed into one of 393 action classes. It\nis split into train, val (from authors of [18]) and test sets, with 8218, 6987 and 5708 images each. We\nuse the val set to compare with [18] and for ablative analysis while the \ufb01nal test results are obtained\nby emailing our results to authors of [34]. The dataset is highly imbalanced and the evaluation is\nperformed using mean average precision (mAP) to equally weight all classes. HICO [7] is a recently\nintroduced dataset with labels for 600 human object interactions (HOI) combining 117 actions with\n80 objects. It contains 38116 training and 9658 test images, with each image labeled with all the\nHOIs active for that image (multi-label setting). Like MPII, this dataset is also highly unbalanced and\nevaluation is performed using mAP over classes. Finally, to verify our method\u2019s applicability to video\nbased action recognition, we experiment with a challenging trimmed action classi\ufb01cation dataset,\nHMDB51 [27]. It contains 6766 realistic and varied video clips from 51 action classes. Evaluation is\nperformed using average classi\ufb01cation accuracy over three train/test splits from [23], each with 3570\ntrain and 1530 test videos.\nBaselines: Throughout the following sections, we compare our approach \ufb01rst to the standard base\narchitecture, mostly ResNet-101 [20], without the attention-weighted pooling. Then we compare to\nother reported methods and previous state of the art on the respective datasets.\nMPII: We train our models for 393-way action classi\ufb01cation on MPII with softmax cross-entropy\nloss for both the baseline ResNet and our attentional model. We compare our performance in Tab. 1.\nOur unconstrained attention model clearly out-performs the base ResNet model, as well as previous\nstate of the art methods involving detection of multiple contextual bounding boxes [18] and fusion\nof full image with human bounding box features [30]. Our pose-regularized model performs best,\nthough the improvement is small. We visualize the attention maps learned in Fig. 2.\nHICO: We train our model on HICO similar to MPII, and compare our performance in Tab. 2.\nAgain, we see a signi\ufb01cant 5% boost over our base ResNet model. Moreover, we out-perform all\n\n5\n\n\u210e\ud835\udc64\ud835\udc53\ud835\udc53\ud835\udc53\ud835\udc53\ud835\udc53\u2217\u2245\u2217\ud835\udc53\ud835\udc53=2()orderpooling\u2217\u2217Bottom-upSaliency=\ud835\udc64\u210eTop-downAttention\u2217\u2217Pose \ud835\udc59#lossMethod 1Attentionsoftmaxx-entropylossMethod 2Pose Reg. Attention\fFigure 2: Auto-generated (not hand-picked) visualization of bottom-up (Xb), top-down (Xak) and combined\n((Xak) \u25e6 (Xb)) attention on validation images in MPII, that see largest improvement in softmax score for\ncorrect class when trained with attention. Since the top-down/combined maps are class speci\ufb01c, we mention the\nclass name for which they are generated for on top left of those heatmaps. We consider 2 classes, the ground\ntruth (GT) for the image, and the class on which it gets lowest softmax score. The attention maps for GT class\nfocus on the objects most useful for distinguishing the class. Though the top-down and combined maps look\nsimilar in many cases, they do capture different information. For example, for a garbage collector action (second\nrow), top-down also focuses on the vehicles in background, while the combined map narrows focus down to the\ngarbage bags. (Best viewed zoomed-in on screen)\n\nFigure 3: We crop a 100px patch around the attention peak for all images containing an HOI involving a given\nobject, and show 5 randomly picked patches for 6 object classes here. This suggests our attention model learns\nto look for objects to improve HOI detection.\n\n6\n\nGTClassOtherClassTestImageBottomUpTopDownCombinedTopDownCombinedplayingwithanimalsplayingwithanimalsstanding,talkinginchurchstanding,talkinginchurchgarbagecollector,walking,...garbagecollector,walking,...travellinginvehicletravellinginvehicleforestryforestrychoppingwoodchoppingwoodviolin,sittingviolin,sittingguitar,classical,folk,sit...guitar,classical,folk,sit...marchingbandmarchingbandfrisbeefrisbeebasketballbasketballplayingmusicalinstruments,...playingmusicalinstruments,...choppingwoodchoppingwoodNZnativephysicalactivityNZnativephysicalactivitycalisthenicscalisthenicsracewalkingracewalkingsuitcasedonutbirdhotdogsports balllaptop\fTable 1: Action classi\ufb01cation performance on MPII dataset. Validation (Val) performance is reported on train set\nsplit shared by authors of [18]. Test performance obtained from training on complete train set and submitting our\noutput \ufb01le to authors of [34]. Note that even though our pose regularized model uses pose labels at training time\nfor regularizing attention, it does not require any pose input at test time. The top-half corresponds to a diagnostic\nanalysis of our approach with different base networks. Attention provides a strong 4% improvement for baseline\nnetworks with larger spatial resolution (e.g., ResNet). Please see text for additional discussion. The bottom-half\nreports prior work that makes use of object bounding boxes/pose. Our method performs slightly better with pose\nannotations (on training data), but even without any pose or detection annotations, we outperform all prior work.\n\nMethod\nInception-V2 (ours)\nResNet101 (ours)\nAttn. Pool. (I-V2) (ours)\nAttn. Pool. (R-101) (ours)\nDense Trajectory + Pose [34]\nVGG16, RCNN [18]\nVGG16, R*CNN [18]\nVGG16, Fusion (best) [30]\nVGG16, Fusion+MIL (best) [30]\nPose Reg. Attn. Pooling (R-101) (ours)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nFull Img Bbox\n\nPose MIL Val (mAP)\n25.2\n26.2\n24.3\n30.3\n-\n16.5\n21.7\n-\n-\n30.6\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nTest (mAP)\n-\n-\n-\n36.0\n5.5\n-\n26.7\n32.2\n31.9\n36.1\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nTable 2: Multi-label HOI classi\ufb01cation performance on HICO dataset. The top-half compares our performance\nto other full image-based methods. The bottom-half reports methods that use object bounding boxes/pose. Our\nmodel out-performs various approaches that need bounding boxes, multi-instance learning (MIL) or specialized\nlosses, and achieves performance competitive to state of the art. Note that even though our pose regularized\nmodel uses computed pose labels at training time, it does not require any pose input at test time.\n\nMethod\nAlexNet+SVM [7]\nVGG16, full image [30]\nResNet101, full image (ours)\nResNet101 with CBP [16] (impl. from [1])\nAttentional Pooling (R-101) (ours)\nR*CNN [18] (reported in [30])\nScene-RCNN [18] (reported in [30])\nFusion (best reported) [30]\nPose Regularized Attentional Pooling (R101) (ours)\nFusion, weighted loss (best reported) [30]\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nFull Im. Bbox/Pose MIL Wtd Loss mAP\n19.4\n29.4\n30.2\n26.8\n35.0\n28.5\n29.0\n33.8\n34.6\n36.1\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n\nprevious methods, including ones that use detection bounding boxes at test time except one [30],\nwhen that is trained with a specialized weighted loss for this dataset. It is also worth noting that\nthe full image-only performance of VGG and ResNet were comparable in our experiments (29.4%\nand 30.2%), suggesting that our approach shows larger relative improvement over a similar starting\nbaseline. Though we did not experiment with the same optimization setting as [30], we believe it\nwill give similar improvements there as well. Since this dataset also comes with labels decomposed\ninto actions and objects, we visualize what our attention model looks for, given images containing\ninteractions with a speci\ufb01c object. As Fig. 3 shows, the attention peak is typically close to the object\nof interest, showing the importance of detecting objects in HOI detection tasks. Moreover, this\nsuggests that our attention maps can also function as weak-supervision for object detection.\nHMDB51: Next, we apply our attentional method to the RGB stream of the current state of the\nart single-frame deep model on this dataset, TSN [54]. TSN extends the standard two-stream [42]\narchitecture by using a much deeper base architecture [22] along with enforcing consensus over\nmultiple frames from the video at training time. For the purpose of this work, we focus on the\nRGB stream only but our method is applicable to \ufb02ow/warped-\ufb02ow streams as well. We \ufb01rst train\na TSN model using ResNet-101 as base architecture after re-sizing input frames to 450px. This\nensures larger spatial dimensions of the output (14 \u00d7 14), hence ensuring the last-layer features\nare amenable to attention. Though our base ResNet model does worse than BN-inception TSN\nmodel, as Tab. 3 shows, using our attention module improves the base model to do comparably\nwell. Interestingly, on this dataset regularizing the attention through pose gives a signi\ufb01cant boost in\n\n7\n\n\fTable 3: Action classi\ufb01cation performance on HMDB51 dataset using only the RGB stream of a two-stream\nmodel. Our base ResNet stream training is done over 480px rescaled images, same as used in our attention\nmodel for comparison purposes. Our pose based attention model out-performs the base network by large margin,\nas well as the previous RGB stream (single-frame) state-of-the-art, TSN [54].\n\nMethod\nTSN, BN-inception (RGB) [54] (Via email with authors)\nActionVLAD [17]\nRGB Stream, ResNet50 (RGB) [14] (reported at [2])\nRGB Stream, ResNet152 (RGB) [14] (reported at [2])\nTSN, ResNet101 (RGB) (ours)\nLinear Attentional Pooling (ours)\nPose regularized Attentional Pooling (ours)\n\nSplit 1\n54.4\n51.2\n-\n-\n48.2\n51.1\n54.4\n\nSplit 2\n49.5\n-\n-\n-\n46.5\n51.6\n51.1\n\nSplit 3 Avg\n51.0\n49.8\n48.9\n46.7\n47.1\n50.8\n52.2\n\n49.2\n-\n-\n-\n46.7\n49.7\n50.9\n\nFigure 4: Attention maps with linear attention and pose regularized attention on a video from HMDB. Note the\npose-guided attention is better able to focus on regions of interest in the non-iconic frames.\n\nperformance, out-performing TSN and establishing new state of the art on the RGB-only single-frame\nmodel for HMDB. We visualize the attention maps with normal and pose-regularized attention in\nFig. 4. The pose regularized attention are more peaky near the human than their linear counterparts.\nThis potentially explains the improvement using pose on HMDB while it does not help as much on\nHICO or MPII; HICO and MPII, being image based datasets typically have \u2018iconic\u2019 images, with the\nsubjects and objects of action typically in the center and focus of the image. Video frames in HMDB,\non the other hand, may have the subject move all across the frame throughout the video, and hence\nadditional supervision through pose at training time helps focus the attention at the right spot.\nFull-rank pooling: Given our formulation of attention as low-rank second-order pooling, a natural\nquestion is what would be the performance of a full-rank model? Explicitly computing the second-\norder features of size f \u00d7 f for f = 2048 (and learning the associated classi\ufb01er) is cumbersome.\nInstead, we make use of the compact bilinear approach (CBP) of [16], which generates a low-\ndimensional approximation of full bilinear pooling [28] using the TensorSketch algorithm. To keep\nthe \ufb01nal output comparable to our attentional-pooled model, we project to f = 2048 dimensions.\nWe \ufb01nd it performs slightly worse than simple average pooling in Table 2. Note that we use an\nexisting implementation [1] with minimal hyper-parameter optimization, and leave a more rigorous\ncomparison to future work.\nRank-P approximation: While a full-rank model is cumbersome, we can still explore the effect of\nusing a higher, P -rank approximation. Essentially, a rank-P approximation generates P (1-channel)\nbottom-up and (C channel) top-down attention maps, and the \ufb01nal prediction is the product of\ncorresponding heatmaps, summed over P . On MPII, we obtain mAP of 30.3, 29.9, 30.0 for P =1,\n2 and 5 respectively, showing that the validation performance is relatively stable with P . We do\nobserve a drop in training loss with a higher P , indicating that a higher-rank approximation could be\nuseful for harder datasets and tasks.\nPer-class attention maps: As we described in Sec. 3.1, our inspiration for combining class-speci\ufb01c\nand class-agnostic classi\ufb01ers (i.e. top-down and bottom-up attention respectively), came from the\nNeuroscience literature on integrating top-down and bottom-up attention [31]. However, our model\n\n8\n\nAttentionPose Reg. Attention\fcan also be extended to learn completely class-speci\ufb01c attention maps, by predicting C bottom-up\nattention maps, and combining each map with the corresponding softmax classi\ufb01er for that class. We\nexperiment with this idea on MPII and obtain a mAP of 27.9 with 393 (=num-classes) attention maps,\ncompared to 30.3% with 1 map, and 26.2% without attention. On further analysis we observe that\nboth models achieve near perfect mAP on training data, implying that adding more parameters with\nmultiple attention maps leads to over-\ufb01tting on the relatively small MPII trainset. However, this may\nbe a viable approach for larger datasets.\nDiagnostics: It is natural to consider variants of our model that only consider the bottom-up or\ntop-down attentional map. As derived in (12), baseline models with average pooling are equivalent to\n\u201ctop-down-only\u201d attention models, which are resoundingly outperformed by our joint bottom-up and\ntop-down model. It is not clear how to construct a bottom-up only model, since it is class-agnostic,\nmaking it dif\ufb01cult to produce class-speci\ufb01c scores. Rather, a reasonable approximation might be\napplying an off-the-shelf (bottom-up) saliency method used to limit the spatial region that features are\naveraged over. Our initial experiments with existing saliency-based methods [21] were not promising.\nBase Network: Finally, we analyze the choice of base architecture for the effectiveness of our\nproposed attentional pooling module. In Tab. 1, we compare the improvement using attention over\nResNet-101 (R-101) [20] and an BN-Inception (I-V2) [22]. Both models perform comparably when\ntrained for full image, however, while we see a 4% improvement on R-101 on using attention, we do\nnot see similar improvements for I-V2. This points to an important distinction in the two architectures,\ni.e., Inception-style models are designed to be faster in inference and training by rapidly down\nsampling input images in initial layers through max-pooling. While this reduces the computational\ncost for later layers, it leads to most layers having very large receptive \ufb01elds, and hence later neurons\nhave effective access to all of the image pixels. This suggests that all the spatial features at the last\nlayer could be highly similar. In contrast, R-101 downscales the spatial resolution gradually, allowing\nthe last layer features to specialize to different parts of the image, hence bene\ufb01ting more from\nattentional pooling. This effect was further corroborated by our experiments on HMDB, where using\nthe standard 224px input resolution showed no improvement with attention, while the same image\nresized to 450px at input time did. This initial resize ensures the last-layer features are suf\ufb01ciently\ndistinct to bene\ufb01t from attentional pooling.\n\n5 Discussion and Conclusion\n\nAn important distinction of our model from some previous works [18, 30] is that it does not explicitly\nmodel action at an instance or bounding-box level. This, in fact, is a strength of our model; making it\ncapable of attending to objects outside of any person-instance bounding box (such as bags of garbage\nfor \u201cgarbage collecting\u201d, in Fig 2). In theory, our model can also be applied to instance-level action\nrecognition by applying attentional pooling over an instance\u2019s RoI features. Such a model would learn\nto look at different parts of human body and its interactions with nearby objects. However, it\u2019s notable\nthat most existing action datasets, including [6, 7, 27, 34, 41, 45], come with only frame or video\nlevel labels; and though [18, 30] are designed for instance-level recognition, they are not applied\nas such. They either copy image level labels to instances or use multiple-instance learning, either\nof which can be used in conjunction with our model. Another interesting connection that emerges\nfrom our work is the relation between second-order pooling and attention. The two communities are\ntraditionally seen as distinct, and our work strongly suggests that they should mix: as newer action\ndatasets become more \ufb01ne-grained, we should explore second-order pooling techniques for action\nrecognition. Similarly, second-order pooling can serve as a simple but strong baseline for the attention\ncommunity, which tends to focus on more complex sequential attention networks (based on RNNs or\nLSTMs). It is also worth noting that similar ideas involving self attention and bilinear models have\nrecently also shown signi\ufb01cant improvements in other tasks like image classi\ufb01cation [51], language\ntranslation [50] and visual question answering [38].\nConclusion: We have introduced a simple formulation of attention as low-rank second-order pool-\ning, and illustrate it on the task of action classi\ufb01cation from single (RGB) images. Our formulation\nallows for explicit integration of bottom-up saliency and top-down attention, and can take advantage\nof additional supervision when needed (through pose labels). Our model produces competitive or\nstate-of-the-art results on widely benchmarked datasets, by learning where to look when pooling\nfeatures across an image. Finally, it is easy to implement and requires few additional parameters,\nmaking it an attractive alternative to standard pooling, which is a ubiquitous operation in nearly all\ncontemporary deep networks.\n\n9\n\n\fAcknowledgements: Authors would like to thank Olga Russakovsky for initial review. This\nresearch was supported in part by the National Science Foundation (NSF) under grant numbers\nCNS-1518865 and IIS-1618903, and the Defense Advanced Research Projects Agency (DARPA)\nunder Contract No. HR001117C0051. Additional support was provided by the Intel Science and\nTechnology Center for Visual Cloud Systems (ISTC-VCS). Any opinions, \ufb01ndings, conclusions or\nrecommendations expressed in this material are those of the authors and do not necessarily re\ufb02ect the\nview(s) of their employers or the above-mentioned funding sources.\n\nReferences\n[1] Compact bilinear pooling implementation.\n\ncompact_bilinear_pooling.\n\nhttps://github.com/ronghanghu/tensorflow_\n\n[2] Convolutional two-stream network fusion for video action recognition. http://www.robots.ox.ac.\n\nuk/~vgg/software/two_stream_action/.\n\n[3] F. Baluch and L. Itti. Mechanisms of top-down attention. Trends in Neurosciences, 2011.\n[4] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part af\ufb01nity\n\n\ufb01elds. In CVPR, 2017.\n\n[5] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pooling.\n\nIn ECCV, 2012.\n\n[6] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In\n\nCVPR, 2017.\n\n[7] Y.-W. Chao, Z. Wang, Y. He, J. Wang, and J. Deng. Hico: A benchmark for recognizing human-object\n\ninteractions in images. In ICCV, 2015.\n\n[8] G. Ch\u00e9ron, I. Laptev, and C. Schmid. P-CNN: Pose-based CNN Features for Action Recognition. In ICCV,\n\n2015.\n\n[9] V. Delaitre, I. Laptev, and J. Sivic. Recognizing human actions in still images: a study of bag-of-features\n\nand part-based representations. In BMVC, 2010.\n\n[10] V. Delaitre, J. Sivic, and I. Laptev. Learning person-object interactions for action recognition in still images.\n\nIn NIPS, 2011.\n\n[11] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for static human-object interactions. In\n\nCVPR-Workshops, 2010.\n\n[12] J. Donahue, L. A. Hendricks, S. Guadarrama, S. V. M. Rohrbach, K. Saenko, and T. Darrell. Long-term\n\nrecurrent convolutional networks for visual recognition and description. In CVPR, 2015.\n\n[13] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object\n\nclasses (voc) challenge. IJCV, 2010.\n\n[14] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition.\n\nIn NIPS, 2016.\n\n[15] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action\n\nrecognition. In CVPR, 2016.\n\n[16] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell. Compact bilinear pooling. In CVPR, 2016.\n[17] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell. ActionVLAD: Learning spatio-temporal\n\naggregation for action classi\ufb01cation. In CVPR, 2017.\n\n[18] G. Gkioxari, R. Girshick, and J. Malik. Contextual action recognition with R*CNN. In ICCV, 2015.\n[19] A. Gupta, A. Kembhavi, and L. S. Davis. Observing human-object interactions: Using spatial and\n\nfunctional compatibility for recognition. PAMI, 2009.\n\n[20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.\n[21] X. Huang, C. Shen, X. Boix, and Q. Zhao. SALICON: Reducing the semantic gap in saliency prediction\n\nby adapting deep neural networks. In ICCV, 2015.\n\n[22] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. ICML, 2015.\n\n[23] Y. Jiang, J. Liu, A. Roshan Zamir, I. Laptev, M. Piccardi, M. Shah, and R. Sukthankar. THUMOS challenge:\n\nAction recognition with a large number of classes. http://www.thumos.info/, 2013.\n\n[24] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back,\n\nP. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.\n\n10\n\n\f[25] J.-H. Kim, K. W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang. Hadamard Product for Low-rank Bilinear\n\nPooling. In ICLR, 2017.\n\n[26] S. Kong and C. Fowlkes. Low-rank bilinear pooling for \ufb01ne-grained classi\ufb01cation. In CVPR, 2017.\n[27] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human\n\nmotion recognition. In ICCV, 2011.\n\n[28] T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for \ufb01ne-grained visual recognition. In\n\nICCV, 2015.\n\n[29] S. Maji, L. Bourdev, and J. Malik. Action recognition from a distributed representation of pose and\n\nappearance. In CVPR, 2011.\n\n[30] A. Mallya and S. Lazebnik. Learning models for actions and person-object interactions with transfer to\n\nquestion answering. In ECCV, 2016.\n\n[31] V. Navalpakkam and L. Itti. An integrated model of top-down and bottom-up attention for optimizing\n\ndetection speed. In CVPR, 2006.\n\n[32] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.\n[33] A. Piergiovanni, C. Fan, and M. S. Ryoo. Learning latent sub-events in activity videos using temporal\n\nattention \ufb01lters. In AAAI, 2017.\n\n[34] L. Pishchulin, M. Andriluka, and B. Schiele. Fine-grained activity recognition with holistic and pose based\n\nfeatures. In GCPR, 2014.\n\n[35] D. Ramanan and D. A. Forsyth. Automatic annotation of everyday movements. In NIPS, 2003.\n[36] M. Ronchi and P. Perona. Describing common human visual actions in images. In BMVC, 2015.\n[37] U. Rutishauser, D. Walther, C. Koch, and P. Perona. Is bottom-up attention useful for object recognition?\n\nIn CVPR, 2004.\n\n[38] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple\n\nneural network module for relational reasoning. arXiv preprint arXiv:1706.01427, 2017.\n\n[39] S. Sharma, R. Kiros, and R. Salakhutdinov. Action recognition using visual attention. ICLR-Workshops,\n\n2016.\n\n[40] Y. Shi, Y. Tian, Y. Wang, and T. Huang. Joint network based attention for action recognition. arXiv preprint\n\narXiv:1611.05215, 2016.\n\n[41] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes:\n\nCrowdsourcing data collection for activity understanding. In ECCV, 2016.\n\n[42] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In\n\nNIPS, 2014.\n\n[43] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2015.\n\n[44] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. An end-to-end spatio-temporal attention model for human\n\naction recognition from skeleton data. In AAAI, 2017.\n\n[45] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in\n\nthe wild. CRCV-TR-12-01, 2012.\n\n[46] C. Szegedy, S. Ioffe, and V. Vanhoucke.\n\nconnections on learning. 2016.\n\nInception-v4, inception-resnet and the impact of residual\n\n[47] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d\n\nconvolutional networks. In ICCV, 2015.\n\n[48] S. Ullman. Visual routines. Cognition, 1984.\n[49] G. Varol, I. Laptev, and C. Schmid. Long-term temporal convolutions for action recognition. CoRR,\n\nabs/1604.04494, 2016.\n\n[50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.\n\nAttention is all you need. In NIPS, 2017.\n\n[51] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network\n\nfor image classi\ufb01cation. In CVPR, 2017.\n\n[52] H. Wang, A. Kl\u00e4ser, C. Schmid, and L. Cheng-Lin. Action Recognition by Dense Trajectories. In CVPR,\n\n2011.\n\n[53] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.\n[54] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks:\n\nTowards good practices for deep action recognition. In ECCV, 2016.\n\n11\n\n\f[55] W. Yang, Y. Wang, and G. Mori. Recognizing human actions from still images with latent poses. In CVPR,\n\n2010.\n\n[56] B. Yao and L. Fei-Fei. Grouplet: A structured image representation for recognizing human and object\n\ninteractions. In CVPR, 2010.\n\n[57] B. Yao and L. Fei-Fei. Modeling mutual context of object and human pose in human-object interaction\n\nactivities. In CVPR, 2010.\n\n[58] B. Yao, X. Jiang, A. Khosla, A. Lin, L. Guibas, and L. Fei-Fei. Human action recognition by learning\n\nbases of action attributes and parts. In ICCV, 2011.\n\n[59] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative\n\nlocalization. In CVPR, 2016.\n\n[60] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox. Chained Multi-stream Networks Exploiting Pose,\n\nMotion, and Appearance for Action Classi\ufb01cation and Detection. In ICCV, 2017.\n\n12\n\n\f", "award": [], "sourceid": 47, "authors": [{"given_name": "Rohit", "family_name": "Girdhar", "institution": "Carnegie Mellon University"}, {"given_name": "Deva", "family_name": "Ramanan", "institution": "Carnegie Mellon University"}]}