{"title": "LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 7780, "page_last": 7789, "abstract": "This paper presents LiteEval, a simple yet effective coarse-to-fine framework for resource efficient video recognition, suitable for both online and offline scenarios. Exploiting decent yet computationally efficient features derived at a coarse scale with a lightweight CNN model, LiteEval dynamically decides on-the-fly whether to compute more powerful features for incoming video frames at a finer scale to obtain more details. This is achieved by a coarse LSTM and a fine LSTM operating cooperatively, as well as a conditional gating module to learn when to allocate more computation. Extensive experiments are conducted on two large-scale video benchmarks, FCVID and ActivityNet, and the results demonstrate LiteEval requires substantially less computation while offering excellent classification accuracy for both online and offline predictions.", "full_text": "LiteEval: A Coarse-to-Fine Framework for Resource\n\nEf\ufb01cient Video Recognition\n\nZuxuan Wu1\u2217, Caiming Xiong2, Yu-Gang Jiang3, Larry S. Davis1\n1 University of Maryland, 2 Salesforce Research, 3 Fudan University\n\nAbstract\n\nThis paper presents LiteEval, a simple yet effective coarse-to-\ufb01ne framework for\nresource ef\ufb01cient video recognition, suitable for both online and of\ufb02ine scenarios.\nExploiting decent yet computationally ef\ufb01cient features derived at a coarse scale\nwith a lightweight CNN model, LiteEval dynamically decides on-the-\ufb02y whether\nto compute more powerful features for incoming video frames at a \ufb01ner scale to\nobtain more details. This is achieved by a coarse LSTM and a \ufb01ne LSTM operating\ncooperatively, as well as a conditional gating module to learn when to allocate\nmore computation. Extensive experiments are conducted on two large-scale video\nbenchmarks, FCVID and ActivityNet, and the results demonstrate LiteEval requires\nsubstantially less computation while offering excellent classi\ufb01cation accuracy for\nboth online and of\ufb02ine predictions.\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNNs) have demonstrated stunning progress in several computer vi-\nsion tasks like image classi\ufb01cation [11, 39, 14], object detection [28, 10], video classi\ufb01cation [34, 33],\netc, sometimes even surpassing human-level performance [11] when recognizing \ufb01ne-grained cate-\ngories. The astounding performance of CNN models, while making them appealing for deployment\nin many practical applications such as autonomous vehicles, navigation robots and image recogni-\ntion services, results from complicated model design, which in turn limits their use in real-world\nscenarios that are often resource-constrained. To remedy this, extensive studies have been conducted\nto compress neural networks [2, 26, 20] and design compact architectures suitable for mobile de-\nvices [13, 16]. However, they produce one-size-\ufb01ts-all models that require the same amount of\ncomputation for all samples.\nAlthough computationally ef\ufb01cient models usually exhibit good accuracy when recognizing the\nmajority of samples, computationally expensive models, if not ensembles of models, are needed\nto additionally recognize corner cases that lie in the tail of the data distribution, offering top-notch\nperformance on standard benchmarks like ImageNet [3] and COCO [21]. In addition to network\ndesign, the computational cost of CNNs is directly affected by input resolution\u201474% of computation\ncan be saved (measured by \ufb02oating point operations) when evaluating a ResNet-101 model on\nimages with half of the original resolution, while still offering reasonable accuracy. Motivated by\nthese observations, a natural question arises: can we have a network with components of different\ncomplexity operating on different scales and derive policies conditioned on inputs to switch among\nthese components to save computation? Intuitively, during inference, lightweight modules are run by\ndefault to recognize easy samples (e.g., images with canonical views) with coarse scale inputs and\nhigh-precision components will be activated to further obtain \ufb01ner details to recognize hard samples\n\n* Part of the work is done when the author was an intern at Salesforce Research.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: An overview of the proposed framework. At each time step, coarse features, computed\nwith a lightweight CNN, together with historical information are used to determine whether to\nexamine the current frame more carefully. If further inspection is needed, \ufb01ne features are derived to\nupdate the \ufb01ne LSTM; otherwise the two LSTMs are synchronized. See texts for more details.\n\n(e.g., images with occlusion). This is conceptually similar to human perception systems where we\npay more attention to complicated scenes while a glance would suf\ufb01ce for most objects.\nIn this spirit, we explore the problem of dynamically allocating computational resources for video\nrecognition. We consider resource-constrained video recognition for two reasons: (1) Videos are more\ncomputationally demanding compared to images. Thus, video recognition systems should be resource\nef\ufb01cient, since computation is a direct indicator of energy consumption, which should be minimized\nto be cost-effective and eco-friendly; additionally, power assumption directly affects battery life\nof embedded systems. (2) Videos exhibit large variations in computation required to be correctly\nlabeled. For instance, for videos that depict static scenes (e.g., \u201criver\u201d or \u201cdesert\u201d) or centered objects\n(e.g., \u201cgorilla\u201d or \u201cpanda\u201d), viewing a single frame already gives high con\ufb01dence, while one needs to\nsee more frames in order to distinguish \u201cmaking latte\u201d from \u201cmaking cappuccino\u201d. Further, frames\nneeded to predict the label of a video clip not only differ among different classes but also within\nthe same category. For example, for many sports actions like \u201crunning\u201d and \u201cplaying football\u201d,\nprofessionally recorded videos with less camera motion are more easily recognized compared to\nuser-generated videos using hand-held devices or wearable cameras.\nWe introduce LITEEVAL, a resource-ef\ufb01cient framework suitable for both online and of\ufb02ine video\nclassi\ufb01cation, which adaptively assigns computational resources to incoming video frames. In\nparticular, LITEEVAL is a coarse-to-\ufb01ne framework that uses coarse information for economical\nevaluation while only requiring \ufb01ne clues when necessary. It consists of a coarse LSTM operating on\nfeatures extracted from downsampled video frames using a lightweight CNN, a \ufb01ne LSTM whose\ninputs are features from images of a \ufb01ner scale using a more powerful CNN, as well as a gating\nmodule to dynamically decide the granularity of features to use. Given a stream of video frames, at\neach time step, LITEEVAL computes coarse features from the current frame and updates the coarse\nLSTM to accumulate information over time. Then, conditioned on the coarse features and historical\ninformation, the gating module determines whether to further compute \ufb01ne features to obtain more\ndetails from the current frame. If further analysis is needed, \ufb01ne features are computed and input\ninto the \ufb01ne LSTM for temporal modeling; otherwise hidden states from the coarse LSTM are\nsynchronized with those of the \ufb01ne LSTM such that the \ufb01ne LSTM contains all information seen so\nfar to be readily used for prediction. Finally, LITEEVAL proceeds to the next frame. Such a recurrent\nand ef\ufb01cient way of processing video frames allows LITEEVAL to be used in both online and of\ufb02ine\nscenarios. See Figure 1 for an overview of the framework.\nWe conduct extensive experiments on two large-scale video datasets for generic video classi\ufb01cation\n(FCVID [18]) and activity recognition (ACTIVITYNET [12]) under both online and of\ufb02ine settings.\nFor of\ufb02ine predictions, we demonstrate that LITEEVAL achieves accuracies that are on par with the\n\n2\n\nGatingGatingGatingGatingGatingGatinguse fine featureskipsyncsyncsyncsync\fstrong and popular uniform sampling strategy while requiring 51.8% and 51.3% less computation,\nand it also outperforms ef\ufb01cient video recognition approaches in recent literature [41, 4]. We also\nshow that LITEEVAL can be effectively used for online video predictions to accommodate different\ncomputational budgets. Furthermore, qualitative results suggest the learned \ufb01ne feature usage policies\nnot only correspond to the dif\ufb01culty to make predictions (i.e., easier samples require fewer \ufb01ne\nfeatures) but also can re\ufb02ect salient parts in videos when recognizing a class of interest.\n\n2 Approach\n\nLITEEVAL consists of a coarse LSTM and a \ufb01ne LSTM that are organized hierarchically taking in\nvisual information at different granularities, as well as a conditional gating module governing the\nswitching between different feature scales. In particular, given a stream of video frames, the goal of\nLITEEVAL is to learn a policy that determines at each time step whether to examine the incoming\nvideo frame carefully with discriminative yet computationally expensive features, conditioned on\na quick glance of the frame with economical features computed at a coarse scale and historical\ninformation. LITEEVAL operates on coarse information by default and is expected to take in \ufb01ne\ndetails infrequently, reducing overall computational cost while maintaining recognition accuracy. In\nthe following, we introduce each component in our framework in detail, and present the optimization\nof the model.\n\n2.1 A Coarse-to-Fine Framework\n\nCoarse LSTM. Operating on features computed at a coarse image scale using a lightweight CNN\nmodel (see Sec. 3.1 for details), the coarse LSTM quickly glimpses over video frames to get an\noverview of the current inputs in a computationally ef\ufb01cient manner. More formally, at the t-th time\nstep, the coarse LSTM takes in the coarse features vc\nt of the current frame, previous hidden states\nt\u22121 and cell outputs cc\nhc\n\nt:\nt and cell states cc\n\nt\u22121 to compute the current hidden states hc\nt\u22121).\n\nt = cLSTM(vc\n\nt\u22121, cc\n\nt , cc\nhc\n\nt , hc\n\n(1)\n\nConditional gating module. The coarse LSTM skims video frames ef\ufb01ciently without allocating\ntoo much computation; however, fast processing with coarse features will inevitably overlook\nimportant details needed to differentiate subtle actions/events (e.g., it is much easier to separate\n\u201cdrinking coffee\u201d from \u201cdrinking beer\u201d with larger video frames). Therefore, LITEEVAL incorporates\na conditional gating module to decide whether to examine the incoming video frame more carefully to\nobtain \ufb01ner details. The gating module is a one-layer MLP that outputs the probability (unnormalized)\nto compute \ufb01ne features with a more powerful CNN:\ng [vc\n\nbt \u2208 R2 = W (cid:62)\n\nt , hf\n\nt\u22121],\n\n(2)\n\nwhere Wg are the weights for the conditional gate, hf\nt\u22121 are the hidden and cell states of\nthe \ufb01ne LSTM (discussed below) from the previous time step, and [ , ] denotes the concatenation of\nfeatures. Since the gating module aims to make a discrete decision whether to compute features at a\n\ufb01ner scale based on bt, a straightforward way is choose a higher value in bt, which, however, is not\ndifferentiable. Instead, we de\ufb01ne a random variable Bt to make the decision through sampling from\nbt. Learning such a parameterized gating function by sampling can be achieved in different ways, as\nwill be discussed below in Section 2.2.\n\nt\u22121, cf\nt\u22121 and cf\n\nFine LSTM.\nIf the gating module selects to pay more attention to the current frame (i.e., Bt = 1),\nfeatures at a \ufb01ner scale will be computed with a computationally intensive CNN, and will be sent to\nthe \ufb01ne LSTM for temporal modeling. In particular, the \ufb01ne LSTM takes as inputs\u2014\ufb01ne features vf\nt\nt\u22121\u2014to produce\nconcatenated with coarse features vc\nt and cells outputs cf\nhidden states hf\n\nt\u22121 and cell states cf\n\n(cid:102)hf\nt ,(cid:102)cf\nt\u22121 + Bt(cid:102)hf\nt = (1 \u2212 Bt)hf\nhf\n\nt , previous hidden states hf\nt of the current time step:\nt , vf\nt\u22121, cf\nt ], hf\nt = fLSTM([vc\nt\u22121)\nt = (1 \u2212 Bt)cf\ncf\n\nt ,\n\nt\u22121 + Bt(cid:102)hf\n\n(4)\nWhen the gating module opts out of the computation of \ufb01ne features (i.e., Bt = 0), hidden states\nfrom the previous time step are reused.\n\nt .\n\n(3)\n\n3\n\n\fSynchronizing the cLSTM with the fLSTM.\nIt worth noting that the coarse LSTM contains infor-\nmation from all frames seen so far, while hidden states in the \ufb01ne LSTM only consist of accumulated\nknowledge from frames selected by the gating module. While \ufb01ne-grained details are stored in\nfLSTM, cLSTM provides context information from the remaining frames that might be bene\ufb01cial for\nrecognition. To obtain improved performance, a straightforward way is to concatenate their hidden\nstates before classi\ufb01cation, yet they are asynchronous (the coarse LSTM is always ahead of the\n\ufb01ne LSTM, seeing more frames), making it dif\ufb01cult to know when to perform fusion. Therefore,\nwe synchronize these two LSTMs by simply copying. In particular, at the t-th step, if the gating\nmodule decides not to compute \ufb01ne features (i.e., Bt = 0 in Equation 4), instead of using hf\nt\u22121\ndirectly, we update hf\nt , ht\u22121(Dc + 1 : Df )], where Dc and Df denote the dimension of hc\nand hf , respectively. Similar modi\ufb01cations are performed to cf\nt . Now the hidden states in the \ufb01ne\nLSTM contains all information seen so far and can be readily used to derive predictions at any time:\npt = softmax(W (cid:62)\n\nt ), where Wp denotes the weights for the classi\ufb01er.\n\nt = [hc\n\np hf\n\n2.2 Optimization\nLet \u0398 = {\u0398cLSTM, \u0398fLSTM, \u0398g} denote the trainable parameters in the framework, where \u0398cLSTM and\n\u0398fLSTM represent the parameters in the coarse and \ufb01ne LSTMs, respectively and \u0398g are weights for\nthe gating module 1. During training, we use predictions from the last time step T as the video-level\npredictions, and optimize the following loss function:\n\nminimize\n\n\u0398\n\nEBt\u223cBernoulli(bt;\u0398g)\n\n(x,y)\u223cDtrain\n\n\u2212y log(pT (x; \u0398)) + \u03bb(\n\nBt \u2212 \u03b3)2\n\n.\n\n(5)\n\n(cid:80)T\n\nHere x and y denote a sampled video and its corresponding one-hot label vector from the training\nset Dtrain and the \ufb01rst term is a standard cross-entropy loss. The second term limits the usage of\n\ufb01ne features to a prede\ufb01ned target \u03b3 with 1\nt=1 Bt being the fraction of the number of times\nT\n\ufb01ne features are used over the entire time horizon. In addition, \u03bb balances the trade-off between\nrecognition accuracy and computational cost.\nHowever, optimizing Equation 5 is not trivial as the decision whether to compute \ufb01ne features is\nbinary and requires sampling from a Bernoulli distribution parameterized by \u0398g. One way to solve\nthis is to convert the optimization in Equation 5 to a reinforcement learning problem and then derive\nthe optimal parameters of the gating module with policy gradient methods [29] by associating each\naction taken with a reward. However, training with policy gradient requires techniques to reduce\nvariance during training as well as carefully selected reward functions. Instead, we use a Gumbel-\nMax trick to make the framework fully differentiable. More speci\ufb01cally, given a discrete categorical\nvariable \u02c6B with class probabilities P ( \u02c6B = k) \u221d bk, where bk \u2208 (0,\u221e) and k \u2264 K (K denotes\nthe total number of classes; in our framework K = 2), the Gumbel-Max [9, 23] trick indicates the\nsampling from a categorical distribution can be performed in the following way:\n\n\u02c6B = arg max\n\n(6)\nwhere Gk = \u2212log (\u2212log (Uk)) denotes the Gumbel noise and Uk are i.i.d samples drawn from\nUniform (0, 1). Although the arg max operation in Equation 6 is not differentiable, we can use\nsoftmax as as a continuous relaxation of arg max [23, 17]:\n\n(log bk + Gk),\n\nk\n\n(cid:35)\n\nT(cid:88)\n\nt=1\n\n1\nT\n\n(cid:34)\n\n(cid:80)K\n\nBi =\n\nexp((log bi + Gi)/\u03c4 )\nj=1 exp((log bj + Gj)/\u03c4 )\n\nfor i = 1, .., K\n\n(7)\n\nwhere \u03c4 is a temperature parameter controlling discreteness in the output vector B. Consider the\nextreme case when \u03c4 \u2192 0, Equation 7 produces the same samples as Equation 6.\nIn our framework, at each time step, we are sampling from a Gumbel-Softmax distribution parameter-\nized by the weights of of the gating module \u0398g. This facilitates the learning of binary decisions in\na fully differentiable framework. Following [17], we anneal the temperature from a high value to\nencourage exploration to a smaller positive value.\n\n1We absorb the weights of the classi\ufb01er Wp into \u0398fLSTM.\n\n4\n\n\f3 Experiments\n\n3.1 Experimental Setup\n\nDatasets and evaluation metrics. We adopt two large-scale video classi\ufb01cation benchmarks to\nevaluate the performance of LITEEVAL, i.e., FCVID and ACTIVITYNET. FCVID (Fudan-Columbia\nVideo Dataset) [18] contains 91, 223 videos collected from YouTube belonging to 239 classes that\nare selected to cover popular topics in our daily lives like \u201cgraduation\u201d, \u201cbaby shower\u201d, \u201cmaking\ncookies\u201d, etc. The average duration of videos in FCVID is 167 seconds and the dataset is split into a\ntraining set with 45, 611 videos and a testing set with 45, 612 videos. While FCVID contains generic\nvideo classes, ACTIVITYNET [12] consists of videos that are action/activity-oriented like \u201cdrinking\nbeer\u201d, \u201cdrinking coffee\u201d, \u201cfencing\u201d, etc. There are around 20K videos in ACTIVITYNET with an\naverage duration of 117 seconds, manually annotated into 200 categories. Here, we use the v1.3 split\nwith a training set of 10, 024 videos, a validation set of 4, 926 videos and a testing set of 5, 044 videos.\nWe report performance on the validation set since labels in the testing set are withheld by the authors.\nFor of\ufb02ine prediction, we compute average precision (AP) for each video category and use mean AP\nacross all classes to measure the overall performance following [18, 12]. For online recognition, we\ncompute top-1 accuracy when evaluating the performance of LITEEVAL since average precision is a\nranking-based metric based on all testing videos, which is not suitable for online prediction (we do\nobserve similar trends with both metrics). We measure computational cost with giga \ufb02oating point\noperations (GFLOPs), which is a hardware independent metric.\n\nImplementation details. We extract coarse features with a MobileNetv2 [27] model using spatially\ndownsampled video frames (i.e., 112 \u00d7 112). The MobileNetv2 is lightweight model and achieves\na top-1 accuracy of 52.3% on ImageNet operating on images with a resolution of 112 \u00d7 112. To\nextract features from high-resolution images (i.e., 224 \u00d7 224) as inputs to the \ufb01ne LSTM, we use\na ResNet-101 model and obtain features from its penultimate layer. The ResNet-101 model offers\na top-1 accuracy of 77.4% on ImageNet and it is further \ufb01netuned on target datasets to give better\nperformance. We implement the framework using PyTorch on one NVIDIA P6000 GPU and adopts\nAdam [40] as the optimizer with a \ufb01xed learning rate of 1e \u2212 4 and set \u03bb to 2. For ACTIVITYNET,\nwe train with a batch size of 128 and the coarse LSTM and the \ufb01ne LSTM respectively contain 64\nand 512 hidden units, while for FCVID, there are 512 and 2, 048 hidden units in the coarse and \ufb01ne\nLSTM respectively and the batch size is 256. The computational cost for MobileNetv2 (112 \u00d7 112)\nResNet-101 (224 \u00d7 224) is 0.08 and 7.82 GFLOPs, respectively.\n\n3.2 Main Results\n\nOf\ufb02ine recognition. We \ufb01rst report the results of LITEEVAL for of\ufb02ine prediction and compare\nwith the following alternatives: (1) UNIFORM, which computes predictions from 25 uniformly\nsampled frames and then averages these frame-level results as video-level classi\ufb01cation scores;\n(2) LSTM, which produces predictions with hidden states from the last time step of an LSTM;\n(3) FRAMEGLIMPSE [41], which employs an agent trained with REINFORCE [29] to select a\nsmall number of frames for ef\ufb01cient recognition; (4) FASTFORWARD [4], which at each time step\nlearns how many steps to jump forward by training an agent to select from a prede\ufb01ned action\nset; (5) LITEEVAL-RL, which is a variant of LITEEVAL using REINFORCE for learning binary\ndecisions. The \ufb01rst two methods are widely used baselines for video recognition, particularly the\nstrong uniform testing strategy which is adopted by almost all CNN-based approaches, while the\nremaining approaches focus on ef\ufb01cient video understanding.\nTable 1 summarizes the results and comparisons. LITEEVAL offers 51.8% (94.3 vs. 195.5) and 51.3%\n(95.1 vs. 195.5) computational savings measured by GFLOPs compared to the uniform baseline while\nachieving similar or better accuracies on FCVID and ACTIVITYNET, respectively. The con\ufb01rms that\nLITEEVAL can save computation by computing expensive features as infrequently as possible while\noperating on economical features by default. The reason that LITEEVAL requires more computation\non average on ACTIVITYNET than FCVID is that categories in ACTIVITYNET are action-focused\nwhereas FCVID also contains classes that are relatively static with fewer motion like scenes and\nobjects. Further, compared to FRAMEGLIMPSE and FASTFORWARD that also learn frame usage\npolicies, LITEEVAL achieves signi\ufb01cantly better accuracy although it requires more computation.\nNote that the low computation of FRAMEGLIMPSE and FASTFORWARD results from their access\nto future frames (i.e., jumping to a future time step), while we simply make decisions whether to\n\n5\n\n\fTable 1: Results of different methods for of\ufb02ine video recognition. We com-\npare LITEEVAL with alternative methods on FCVID and ACTIVITYNET.\n\nMethod\nUNIFORM\n\nLSTM\n\nFRAMEGLIMPSE [41]\nFASTFORWARD [4]\n\nLITEEVAL-RL\n\nLITEEVAL\n\nFCVID\n\nmAP\n80.0%\n79.8%\n71.2%\n67.6%\n74.2%\n80.0%\n\nGFLOPs\n\n195.5\n196.0\n29.9\n66.2\n245.9\n94.3\n\nGFLOPs\n\nACTIVITYNET\nmAP\n70.0%\n70.8%\n60.2%\n54.7%\n65.2%\n72.7%\n\n195.5\n195.8\n32.9\n17.2\n269.3\n95.1\n\ncompute \ufb01ne features for the current frame, making the framework suitable not only for of\ufb02ine\nprediction but also in online settings, as will be discussed below. In addition, we also compare with\nLITEEVAL-RL, which instead of using Gumbel-Softmax leverages policy search methods, to learn\nbinary decisions. LITEEVAL is clearly better than LITEEVAL-RL in terms of both accuracy and\ncomputational cost, and it is also easier to optimize.\n\n(a) FCVID\n\n(b) ACTIVITYNET\n\nFigure 2: Computational cost vs. recognition accuracy on FCVID and ACTIVITYNET. Results\nof LITEEVAL and comparisons with alternative methods for online video prediction.\n\nOnline recognition with varying computational budgets. Once trained, LITEEVAL can be read-\nily deployed in an online setting where frames arrive sequentially. Since computing \ufb01ne features is the\nmost expensive operation in the framework, given a video clip (7.82 GFLOPs per frame), we vary the\nnumber of times \ufb01ne features are read in (denoted by K) such that different computational budgets\ncan be accommodated, i.e. forcing early predictions after the model has computed \ufb01ne features for the\nK-th time. This is similar in spirit to any time prediction [15] where there is a budget for each testing\nsample. We then report the average computational cost with respect to the achieved top-1 recognition\naccuracy on the testing set. We compare with (1) UNIFORM-K, which, for a testing video, averages\npredictions from K frames sampled uniformly from a total of K(cid:48) frames as its \ufb01nal prediction scores\n(K(cid:48) is the location where LITEEVAL produces predictions after having seen the \ufb01ne features for the\nK-th time); (2) SEQ-K, which performs a mean-pooling of K consecutive frames.\nThe results are summarized in Figure 2. We observe the LITEEVAL offers the best trade-off between\ncomputational cost and recognition accuracy in the online setting on both FCVID and ACTIVITYNET.\nIt is worth noting while UNIFORM-K is a powerful baseline, it is not practical in the online setting\nas there is no prior about how many frames are seen so far and yet to arrive. Further, LITEEVAL\noutperforms the straightforward frame-by-frame computation strategy SEQ-K by clear margins. This\ncon\ufb01rms the effectiveness when LITEEVAL is deployed online.\n\nLearned policies for \ufb01ne feature usage. We now analyze the policies learned by the gating module\nwhether to compute \ufb01ne features or not. Figure 3 visualizes the distribution of \ufb01ne feature usage for\nsampled video categories in FCVID. We can see that the number of times \ufb01ne features are computed\n\n6\n\n04080120160200GFLOPs45.050.055.060.065.070.075.080.0Top-1Accuracy(%)Uniform-KSEQ-KLiteEval04080120160200GFLOPs20.030.040.050.060.070.0Top-1Accuracy(%)Uniform-KSEQ-KLiteEval\fFigure 3: The distribution of \ufb01ne feature usage for sampled classes on FCVID. In addition to\nquartiles and medians, mean usage, denoted as yellow dots, is also presented.\n\nFigure 4: Frame selected (indicated by green borders) by LITEEVAL of sampled videos to\ncompute \ufb01ne features in FCVID.\n\nnot only varies across different categories but also within the same class. Since \ufb01ne feature usage\nis proportional to the overall computation required, this veri\ufb01es our hypothesis that computation\nrequired to make correct predictions is different conditioned on input samples. We further visualize,\nin Figure 4, selected frames by LITEEVAL to compute \ufb01ne features of certain videos. We observe\nthat redundant frames without additional information are ignored and those selected frames provide\nsalient information for recognizing the class of interest.\n\n3.3 Ablation Studies\n\nFine feature usage. Table 3 presents the results of using \u03b3 to control \ufb01ne feature usage in LITEE-\nVAL. We observe that setting \u03b3 to 0.05 offers the best trade-off between computational cost and\naccuracies while using a extremely small \u03b3 (e.g., 0.01) achieves worse results, since it forces the\nmodel to compute \ufb01ne features as seldom as possible to save computation and could possibly overlook\nimportant information. It is also worth mentioning that using relatively small values (i.e., less or\nequal than 0.1) produces decent results, demonstrating there exists a high level of redundancy in\nvideo frames.\n\nThe synchronization of the \ufb01ne LSTM with the coarse LSTM. We also investigate the effective-\nness of synchronization of the two LSTMs. We can see in Table 2 that, without updating the hidden\nstates of the fLSTM with those of the cLSTM, the performance degrades to 65.7%. This con\ufb01rms that\nsynchronization by transferring information from the cLSTM to fLSTM is critical for good performance\nas it makes the \ufb01ne LSTM aware of all useful information seen so far.\n\n7\n\nchorusgorillabilliardtableTennismarchingBandsolvingMagicCubetaekwondoelephantnailArtDesignpandagraduationcatturtleboxingviolinPerformancebirthdaybeatboxhairstyleDesignwashingDishesdinnerAtHomemakingMixedDrinksbarbellWorkoutkickingShuttlecockbowlingflyingKitesmakingCakefishingmakingRingsmakingPhoneCasesdiningAtRestaurantmakingHotdogmakingIcecreammakingCookiesmakingEggTartsmarriageProposal012243648Fine feature usageMarriage ProposalMaking SaladChorusAccordion Performance\fTable 2: The effectiveness of\nsyncing LSTMs on FCVID.\n\nMethod\nw/o. sync\nLITEEVAL\n\nmAP\n65.7%\n80.0%\n\nTable 3: Results of different\n\u03b3 in LITEEVAL on FCVID.\n\nTable 4: Results of different\nsizes of LSTMs on FCVID.\n\n\u03b3\n\n0.01\n0.03\n0.10\n0.05\n\nmAP\n78.8%\n79.7%\n80.1%\n80.0%\n\nGFLOPs\n\n75.4\n82.1\n139.0\n94.3\n\n# units in cLSTM mAP\n76.9%\n77.3%\n78.3%\n80.0%\n\n64\n128\n256\n512\n\nNumber of hidden units in the LSTMs. We experiment with different number of hidden units\nin the coarse LSTM and present the results in Table 4. We can see that using a small LSTM with\nfewer hidden units degrades performance due to limited capacity. As mentioned earlier, the most\nexpensive operation in the framework is to compute CNN features from video frames, while LSTMs\nare much more computationally ef\ufb01cient\u2014only 0.06% of GFLOPs needed to extract features with a\nResNet-101 model. For the \ufb01ne LSTM, we found that a size of 2, 048 offers the best results.\n\n4 Related Work\n\nConditional Computation. Our work relates to conditional computation that aims to achieve\ndecent recognition accuracy while accommodating varying computational budgets. Cascaded classi-\n\ufb01ers [32] are among the earliest work to save computation by quickly rejecting easy negative windows\nfor fast face detection. Recently, the idea of conditional computation has also been investigated\nin deep neural networks [30, 15, 24, 6, 1, 22] through learning when to exit CNNs with attached\ndecision branches. Graves [8] add a halting unit to RNNs to associate a ponder cost for computation.\nSeveral recent approaches learn to choose which layers in a large network to use [35, 31, 37] or select\nregions to attend to in images [25, 7], conditioned on inputs, to achieve fast inference. In contrast, we\nfocus on conditional computation in videos, where we learn a \ufb01ne feature usage strategy to determine\nwhether to use computationally expensive components in a network.\n\nEf\ufb01cient Video Analysis. While there is plethora of work focusing on designing robust models\nfor video classi\ufb01cation, limited efforts have been made on ef\ufb01cient video recognition [42, 36, 4, 41,\n38, 5, 19, 43]. Yeung et al. use an agent trained with policy gradient methods to select informative\nframes and predict when to stop inference for action detection [41]. Fan et al. further introduce a fast\nforward agent that decides how many frames to jump forward at a certain time step [4]. While they\nare conceptually similar to our approach, which also aims to skip redundant frames, our framework is\nfully differentiable, and thus is easier to train than policy search methods [4, 41]. More importantly,\nwithout assuming access to future frames, our framework is not only suitable for of\ufb02ine predictions\nbut also can be deployed in an online setting where a stream of video frames arrive sequentially.\nA few recent approaches explore lightweight 3D CNNs to save computation [5, 43], but they use\nthe same set of parameters for all videos regardless of their complexity. In contrast, LITEEVAL is\na general dynamic inference framework for resource-ef\ufb01cient recognition, leveraging LSTMs to\naggregate temporal information and making feature usage decisions over time; it is complementary to\n3D CNNs, as we can replace the inputs to the \ufb01ne LSTM with features from 3D CNNs, dynamically\ndetermining whether to compute powerful features from incoming video snippets.\n\n5 Conclusion\n\nWe presented LITEEVAL, a simple yet effective framework for resource-ef\ufb01cient video prediction\nin both online and of\ufb02ine settings. LITEEVAL is a coarse-to-\ufb01ne framework that contains a coarse\nLSTM and a \ufb01ne LSTM organized hierarchically, as well as a gating module. In particular, LITEEVAL\noperates on compact features computed at a coarse scale and dynamically decides whether to compute\nmore powerful features for incoming video frames to obtain more details with a gating module. The\ntwo LSTMs are further synchronized such that the \ufb01ne LSTM always contains all information seen\nso far that can be readily used for predictions. Extensive experiments are conducted on FCVID and\nACTIVITYNET and the results demonstrate the effectiveness of the proposed approach.\nAcknowledgment ZW and LSD are supported by Facebook and the Of\ufb01ce of Naval Research under Grant\nN000141612713.\n\n8\n\n\fReferences\n[1] E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup. Conditional computation in neural networks for faster\n\nmodels. In ICML Workshop on Abstraction in Reinforcement Learning, 2016. 8\n\n[2] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. Compressing neural networks with the hashing\n\ntrick. In ICML, 2015. 1\n\n[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In CVPR, 2009. 1\n\n[4] H. Fan, Z. Xu, L. Zhu, C. Yan, J. Ge, and Y. Yang. Watching a small portion could be as good as watching\n\nall: Towards ef\ufb01cient video classi\ufb01cation. In IJCAI, 2018. 3, 5, 6, 8\n\n[5] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. In ICCV, 2019. 8\n\n[6] M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. Vetrov, and R. Salakhutdinov. Spatially\n\nadaptive computation time for residual networks. In CVPR, 2017. 8\n\n[7] M. Gao, R. Yu, A. Li, V. I. Morariu, and L. S. Davis. Dynamic zoom-in network for fast object detection\n\nin large images. In CVPR, 2018. 8\n\n[8] A. Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983,\n\n2016. 8\n\n[9] T. Hazan and T. S. Jaakkola. On the partition function and random maximum a-posteriori perturbations. In\n\nICML, 2012. 4\n\n[10] K. He, G. Gkioxari, P. Doll\u00e1r, and R. Girshick. Mask r-cnn. In ICCV, 2017. 1\n\n[11] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into recti\ufb01ers: Surpassing human-level performance on\n\nimagenet classi\ufb01cation. In ICCV, 2015. 1\n\n[12] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles. Activitynet: A large-scale video benchmark for\n\nhuman activity understanding. In CVPR, 2015. 2, 5\n\n[13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam.\n\nMobilenets: Ef\ufb01cient convolutional neural networks for mobile vision applications. In CVPR, 2017. 1\n\n[14] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In CVPR, 2018. 1\n\n[15] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger. Multi-scale dense convolutional\n\nnetworks for ef\ufb01cient prediction. In ICLR, 2018. 6, 8\n\n[16] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level\n\naccuracy with 50x fewer parameters and <0.5mb model size. In arXiv:1602.07360, 2016. 1\n\n[17] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017. 4\n\n[18] Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang. Exploiting feature and class relationships in video\n\ncategorization with regularized deep neural networks. IEEE TPAMI, 2018. 2, 5\n\n[19] B. Korbar, D. Tran, and L. Torresani. Scsampler: Sampling salient clips from video for ef\ufb01cient action\n\nrecognition. In ICCV, 2019. 8\n\n[20] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning \ufb01lters for ef\ufb01cient convnets. In ICLR,\n\n2017. 1\n\n[21] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick,\n\nand P. Doll\u00e1r. Microsoft coco: Common objects in context. In ECCV, 2014. 1\n\n[22] L. Liu and J. Deng. Dynamic deep neural networks: Optimizing accuracy-ef\ufb01ciency trade-offs by selective\n\nexecution. arXiv preprint arXiv:1701.00299, 2017. 8\n\n[23] C. J. Maddison, A. Mnih, and Y. W. Teh. The Concrete Distribution: A Continuous Relaxation of Discrete\n\nRandom Variables. In ICLR, 2017. 4\n\n[24] M. McGill and P. Perona. Deciding how to decide: Dynamic routing in arti\ufb01cial neural networks. In ICML,\n\n2017. 8\n\n[25] M. Najibi, B. Singh, and L. S. Davis. Autofocus: Ef\ufb01cient multi-scale inference. In ICCV, 2019. 8\n\n9\n\n\f[26] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnor-net: Imagenet classi\ufb01cation using binary\n\nconvolutional neural networks. In ECCV, 2016. 1\n\n[27] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and\n\nlinear bottlenecks. In CVPR, 2018. 5\n\n[28] B. Singh, M. Najibi, and L. S. Davis. Sniper: Ef\ufb01cient multi-scale training. In NIPS, 2018. 1\n\n[29] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1998. 4, 5\n\n[30] S. Teerapittayanon, B. McDanel, and H. Kung. Branchynet: Fast inference via early exiting from deep\n\nneural networks. In ICPR, 2016. 8\n\n[31] A. Veit and S. Belongie. Convolutional networks with adaptive inference graphs. In ECCV, 2018. 8\n\n[32] P. Viola and M. J. Jones. Robust real-time face detection. IJCV, 2004. 8\n\n[33] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks:\n\nTowards good practices for deep action recognition. In ECCV, 2016. 1\n\n[34] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, 2018. 1\n\n[35] X. Wang, F. Yu, Z.-Y. Dou, and J. E. Gonzalez. Skipnet: Learning dynamic routing in convolutional\n\nnetworks. In ECCV, 2018. 8\n\n[36] C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A. J. Smola, and P. Kr\u00e4henb\u00fchl. Compressed video action\n\nrecognition. In CVPR, 2018. 8\n\n[37] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris. Blockdrop: Dynamic\n\ninference paths in residual networks. In CVPR, 2018. 8\n\n[38] Z. Wu, C. Xiong, C.-Y. Ma, R. Socher, and L. S. Davis. Adaframe: Adaptive frame selection for fast video\n\nrecognition. In CVPR, 2019. 8\n\n[39] S. Xie, R. Girshick, P. Doll\u00e1r, Z. Tu, and K. He. Aggregated residual transformations for deep neural\n\nnetworks. In CVPR, 2017. 1\n\n[40] T. Yao, C.-W. Ngo, and S. Zhu. Predicting domain adaptivity: Redo or recycle? In ACM Multimedia, 2012.\n\n5\n\n[41] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to-end learning of action detection from frame\n\nglimpses in videos. In CVPR, 2016. 3, 5, 6, 8\n\n[42] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang. Real-time action recognition with enhanced motion\n\nvector cnns. In CVPR, 2016. 8\n\n[43] M. Zolfaghari, K. Singh, and T. Brox. Eco: Ef\ufb01cient convolutional network for online video understanding.\n\nIn ECCV, 2018. 8\n\n10\n\n\f", "award": [], "sourceid": 4216, "authors": [{"given_name": "Zuxuan", "family_name": "Wu", "institution": "University of Maryland"}, {"given_name": "Caiming", "family_name": "Xiong", "institution": "Salesforce"}, {"given_name": "Yu-Gang", "family_name": "Jiang", "institution": "Fudan University"}, {"given_name": "Larry", "family_name": "Davis", "institution": "University of Maryland"}]}