{"title": "Weakly Supervised Dense Event Captioning in Videos", "book": "Advances in Neural Information Processing Systems", "page_first": 3059, "page_last": 3069, "abstract": "Dense event captioning aims to detect and describe all events of interest contained in a video. Despite the advanced development in this area, existing methods tackle this task by making use of dense temporal annotations, which is dramatically source-consuming. This paper formulates a new problem: weakly supervised dense event captioning, which does not require temporal segment annotations for model training.  Our solution is based on the one-to-one correspondence assumption, each caption describes one temporal segment, and each temporal segment has one caption, which holds in current benchmark datasets and  most real world cases. We decompose the problem into a pair of dual problems: event captioning and sentence localization and present a cycle system to train our model. Extensive experimental results are provided to  demonstrate the ability of our model  on both dense event captioning and sentence localization in videos.", "full_text": "Weakly Supervised Dense Event Captioning in Videos\n\nXuguang Duan\u2217 1, Wenbing Huang\u22172, Chuang Gan3, Jingdong Wang4,\n\nWenwu Zhu1, Junzhou Huang2\n\n1 Tsinghua University, Beijing, China; 2 Tencent AI Lab. ;\n\n3 MIT-IBM Watson AI Lab; 4 Microsoft Research Asia, Beijing, China;\n\nduan_xg@outlook.com, hwenbing@126.com, ganchuang1990@gmail.com,\n\njingdw@microsoft.com, wwzhu@tsinghua.edu.cn,joehhuang@tencent.com\n\nAbstract\n\nDense event captioning aims to detect and describe all events of interest contained\nin a video. Despite the advanced development in this area, existing methods tackle\nthis task by making use of dense temporal annotations, which is dramatically\nsource-consuming. This paper formulates a new problem: weakly supervised dense\nevent captioning, which does not require temporal segment annotations for model\ntraining. Our solution is based on the one-to-one correspondence assumption,\neach caption describes one temporal segment, and each temporal segment has one\ncaption, which holds in current benchmark datasets and most real-world cases. We\ndecompose the problem into a pair of dual problems: event captioning and sentence\nlocalization and present a cycle system to train our model. Extensive experimental\nresults are provided to demonstrate the ability of our model on both dense event\ncaptioning and sentence localization in videos.\n\n1\n\nIntroduction\n\nDramatic improvements have been made on video understanding due to the development of deep\nneural networks and large-scale video datasets [1, 2, 3]. Among the wide variety of applications\non video understanding, the video captioning task is attracting more and more interests in recent\nyears [4, 5, 6, 7, 8, 9, 10, 11]. In video captioning, the machine is required to describe the video\ncontent in the natural language form, which makes it more meticulous and thus challenging compared\nto other tasks describing the video content using a few tags or labels, such as video classi\ufb01cation and\naction detection [12, 13].\nThe current trend on video captioning is to perform Dense Event Captioning (DEC, also called\nDense-Captioning Event in videos in [10]). As one video usually contains more than one event of\ninterest, the goal of DEC is to locate all events in the video and perform captioning for each of\nthem. Clearly, such dense captioning enriches the information we obtained and is bene\ufb01cial for more\nin-depth video analysis. Nevertheless, to achieve this goal, we need to collect the caption annotation\nfor each event along with its temporal segment coordinate (i.e., the start and end times) for network\ntraining, which is source-consuming and impractical.\nIn this paper, we introduce a new problem, Weakly Supervised Dense Event Captioning (WS-DEC)2,\nwhich aims at dense event captioning only using the caption annotations for training. In the training\n\u2217denotes equal contributions. This paper was done when Xuguang Duan was served as a research intern in\n\nTencent AI Lab. Wenwu Zhu is the corresponding author.\n\n2More speci\ufb01cally, the term \u201cweakly\u201d in our paper refers to the incompleteness of the supervision rather than\n\nthe amount of information.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdataset, only a paragraph or a set of sentences is available to describe each video, but the temporal\nsegment coordinate of each event and its correspondence to the captioning sentence is not given. For\ntesting, the model is able to detect all events of interest and provides the caption for each event. One\nobvious advantage of the weak supervision is the signi\ufb01cant reduction of the annotation cost. This\nbene\ufb01t becomes more demanded if we attempt to make use of the videos in the wild (e.g. the videos\non the web) to enlarge the training set.\nWe solve the problem by unitizing the one-to-one correspondence assumption: each caption describes\none temporal segment, and each temporal segment has one caption. We decompose the problem\ninto a cycle of dual problems: caption generation and sentence localization. During the training\nphase, we perform sentence localization from the given caption annotation, to obtain the associated\nsegment that is then fed to the caption generator to reconstruct the caption back. The objective is\nto minimize the reconstruction error. Our cycle process repeatedly optimizes caption generator and\nsentence localizer without any ground-truth segment. During the testing phase, it is infeasible to\napply the cycle process in the same way as the training phase, as the caption is unknown. Instead, we\n\ufb01rst perform caption generation on a bunch of randomly initialized candidate segments and then map\nthe resulting captions back to the segment space. The output segments by this cycle process will get\ncloser to the ground-truths if certain properties are satis\ufb01ed. We thus formulate an extra loss for the\ntraining to enforce our model to meet these properties. Based on the detected segment, we are able to\nperform event captioning on it, and thus achieve the goal of dense event captioning.\nWe summarize our contributions as follow. I. We propose to solve the DEC task without the need of\ntemporal segments annotation, thus introduce a new problem WS-DEC, aiming at making use of the\nhuge amount of data in the web and thus reducing the cost of annotation. II. We develop a \ufb02exible\nand ef\ufb01cient method to address WS-DEC by exploring the one-to-one correspondence between\nthe temporal segment and event caption. III. We evaluate the performance of our approach on the\nwidely-used benchmark ActivityNet Captions [10]. The experimental results verify the effectiveness\nof our method regarding the dense event captioning ability and sentence localization accuracy.\n\n2 Related Work\n\nWe brie\ufb02y review recent advances on video captioning, dense even captioning and sentence localiza-\ntion in videos in the next few paragraphs.\nVideo captioning. Early researchers simply aggregate frame-level features by mean pooling and\nthen use similar pipelines as image captioning [4] to generate caption sentences. This mean-pooling\nstrategy works well for short video clips, but will easily crash with the increase of video length.\nRecurrent Neural Networks (RNNs) along with attention mechanisms are thus employed [5, 6, 7, 8],\namong which S2VT[7] exhibits more desirable ef\ufb01ciency and \ufb02exibility. Since a single sentence is\nfar from enough to describe the dynamics of untrimmed real-world video, some researchers attempt\nto generate multiple sentences or a paragraph to describe the given video [9, 14, 15]. Among them,\nthe work by [15] aims at providing diverse captions corresponded to different spatial regions in a\nweakly supervised manner. Despite the similar weakly-supervised setting to this work, our paper\ndifferently is to localize different events temporally and perform captioning for each detected event,\nwhich generates descriptions based on meaningful events instead of bewildering visual features.\nDense Event Captioning. Recent attention have been paid on dense event captioning in videos [10,\n11]. Current works all follow the \"detection and description\" framework. The model proposed by [10]\nresorts to the DAP method[16] for event detection and enhance the caption ability by applying the\ncontext-aware S2VT[7]. Meanwhile, [11] employs a grouping schema based on their previous video\nhighlight detector[17] to perform event detection, and the attribute-augmented LSTM (LSTM-A)[18]\nfor caption generation. Most recently, [19, 20] try to boost the event proposal with generated sentence,\nwhile [21] tries to leverage bidirectional SST[22] instead of DAP[16]. Also, [21] proposes to use\nbidirectional attention for dense captioning. In contrast to these fully-supervised works, we address\nthe task without the guidance of temporal segments during training. Speci\ufb01cally, instead of detecting\nall event using the one-to-many-mapping event detector[23, 22], we try to localize them one by one\nusing our sentence localizer and caption generator.\nSentence localization in videos. Localizing sentence in videos is constrained to certain visual\ndomains (e.g., kitchen environment) in the early stage[24, 25, 26]. Due to the development of\ndeep learning, several models have been proposed to work on real-world videos [27, 28, 29]. The\n\n2\n\n\fapproaches by [27, 28] are categorized into the typical two-stage framework as \u201cscan and localize\u201d.\nTo elaborate a bit, the work by [27] employs a Moment Context Network(MCN) for matching\ncandidate video clip and sentence query, while the model in [28] proposes a Cross-modal Temporal\nRegression Localizer (CTRL) to make use of coarsely sampled clips for computation reduction. In\ncontrast, [29] opens up a different direction by regressing the temporal coordinate given learned video\nrepresentation and sentence representation. In our framework, the sentence localization is originally\nformulated as an intermediate task to enable weakly supervised training for dense event captioning.\nActually, our model also provides an unsupervised solution to sentence localization.\n\n3 The Proposed Method\n\nWe start this section by presenting the fundamental formulation of our method and follow it up with\nproviding the details on model architecture.\nNotations. Prior to further introduction, we \ufb01rst provide the key notations used in this work. We\ndenote the given video by V = (v1, v2,\u00b7\u00b7\u00b7 , vT ) with vt indexing the image frame at time t. We\nde\ufb01ne the event of interest as a temporally-continues segment of V and denote all the events by their\ntemporal coordinate as {Si = (mi, wi)}N\ni , where N is the number of events, mi and wi denote the\ntemporal center and width, respectively. The temporal coordinates for all events are normalized to be\nwithin [0, 1] throughout this paper. Let the caption for the segment Si be the sentence Ci = {cij}Tc\nwhere cij denotes the j-th word, and Tc is the length of caption sentence.\n\nj=0\n\n3.1 Formulation\nFormally, the conventional event captioning models [10, 11] \ufb01rst locate the temporal segments {Si}N\ni\nof the events by the event proposal module, and then generate the caption Ci for each segment Si\nthrough the caption generator. Here, for our weak supervision, the segment labels are unprovided and\nonly the caption sentences (could be multiple for a single video) are available.\nThe biggest dif\ufb01culty of our task lies in that it\u2019s impossible to perform weakly supervised event\nproposal which in nature is a one-to-many mapping problem and is too noisy for weakly learning.\nInstead, we try a novel new direction that makes use the bidirectional one-to-one mapping between\ncaption sentence and temporal segment. Formally, we formulate a pair of dual tasks of sentence\nlocalization and event captioning. Conditioned on a target video V , the dual tasks are de\ufb01ned as:\n\u2022 Sentence localization: this task is to localize segment Si corresponded to the given caption\n\u2022 Event Captioning:\n\nCi, i.e., learning the mapping l\u03b81 : (V , Ci) \u2192 Si, associated with parameter \u03b81;\nthe event captioning inversely generates caption Ci for the given\nsegment Si, i.e., learning the function g\u03b82 : (V , Si) \u2192 Ci, associated with parameter \u03b82.\nThe dual problems exist simultaneously once the correspondence between Si and Ci is one-to-one,\nwhich is the case in our problem for that Si and Ci are tied together by their corresponding event.\nIf we nest these dual functions together, then any valid caption and segment pair (Ci, Si) becomes a\n\ufb01xed-point solution of the following functions:\n\nCi = g\u03b82 (V , l\u03b81(V , Ci)),\nSi = l\u03b81 (V , g\u03b82(V , Si)).\n\n(1)\n(2)\n\nMore interestingly, Eq. (1) derives an auto-encoder for Ci where the segment Si gets vanished. This\ngives us a solution to train the parameters of both functions of l and g, by formulating the loss as\n\nLc = dist(Ci, g\u03b82(V , l\u03b81(V , Ci))),\n\n(3)\n\nwhere dist(\u00b7,\u00b7) is a loss distance.\nA remaining issue that it is still infeasible to perform dense event captioning in the testing phase by\napplying l\u03b81 or g\u03b82 since both the temporal segment and caption sentence are unknown. To tackle the\ntesting issue, we introduce the concept of the \ufb01xed-point iteration [30] as follow.\nProposition 1 (Fixed-Point-Iteration). We de\ufb01ne the iteration as\n\nS(t + 1) = l\u03b81 (V , g\u03b82(V , S(t))),\n\n(4)\n\n3\n\n\fFigure 1: Model structure and training connections. Our model is composed of a sentence localizer\nand a caption generator. For training, the video and all event descriptions are available. We feed\nthe video and one of its event descriptions to the sentence localizer to obtain a temporal segment\nprediction, and then the temporal segment is used to regenerate the caption sentence, and to relocate\nthe temporal segment. The trained dual system is used to generate dense event caption with random\ntemporal segments in the test phase.\n\nwhere S(t) will converge to the \ufb01xed-point solution i.e. S\u2217 = l\u03b81(V , g\u03b82 (V , S\u2217)), if there exists a\nsuf\ufb01ciently small \u0001 > 0 satisfying (cid:107)S(0) \u2212 S\u2217(cid:107) \u2264 \u0001 and the function l\u03b81 (V , g\u03b82(V , S)) is locally\nLipschitz continuous around S\u2217 with Lipschitz constant L < 1.\n\nNote that the proof has already been derived previously. For better readability, we include them in the\nsupplementary material.\nWith the application of the \ufb01xed-point-iteration, we can solve the event captioning task without any\ncaption or segment during testing. We sample a random bunch of candidate segments {S(r)\nfor\nthe target video as initial guesses, and then perform the iteration in Eq. (4) on these candidates. After\nsuf\ufb01cient iterations, the outputs will converge to the \ufb01xed-point solutions (i.e. the valid segments)\nS\u2217. In our experiments, we only use one-round iteration by S(cid:48)\n)) and \ufb01nd it\ni}N(cid:48)\nsuf\ufb01cient to deliver promising results. With the re\ufb01ned segments {S(cid:48)\ni=0 at hand, we are able to\ngenerate the captions as {Ci = g\u03b82(V , S(cid:48)\nAs introduced afterward, both l\u03b81 and g\u03b82 are stacked by multiple neural layers which not naturally\nsatisfy the local-Lipschitz-continuity in Proposition 1. We thus apply the idea of denoising auto-\nencoder in [31], where we generate noisy data by adding a Gaussian noise to the training data\nand minimize the reconstruction of the noisy data to the true ones. Explicitly, we enforce the\ntemporal segments around the true data to converge to the \ufb01xed-point solutions by one-round iteration.\nRecalling that for weakly supervised constraint, we do not have the ground-truth segments during\ntraining, we thus apply l\u03b81(V , Ci) as the approximated segment, and minimize the following loss:\n\ni=0 and thus solve the dense event captioning task.\n\ni = l\u03b81 (V , g\u03b82(V , S(r)\n\ni)}N(cid:48)\n\ni }Nr\n\ni\n\ni\n\nLs = dist(l\u03b81(V , Ci), l\u03b81 (V , g\u03b82(V , \u03b5i + l\u03b81 (V , Ci)))),\n\n(5)\nwhere \u03b5i \u223c N(0, \u03c3) is a Gaussian noise. The Gaussian smooth (Eq. (5)) does not theoretically hold\nthe Lipschitz continuity, but it practically enforces the random proposals to converge to the positive\nsegments as veri\ufb01ed by our experiments.\nBy combining Eq.(3) and Eq.(5), we obtain the hybrid loss as\nL = Lc + \u03bbsLs,\n\n(6)\n\nwhere \u03bbs is the trade-off parameter.\n\n4\n\nvcs1l\uf0712g\uf0711l\uf071cv\uf0e5\uf0e5sfcsv\uf0e5c..'c's(,')(,')slossdistccdistss\uf06c\uf03d\uf02b\uf067SMFfeaturefusionlocationregressoranchorpredictorGRUGRUGRUAmanisCaptionFeatureSoftMaskVideoFeatureattentionInitHiddenContextGRUGRUVideoFeatureattentionattention1(,):lvcssentencelocalization\uf071\uf0ae2(,):gvsccaptiongeneration\uf071\uf0aeGRU\f3.2 Network Design\nThe core of our framework as illustrated in Fig. 1 consists of a Sentence Localizer (i.e. l\u03b81 (V , C))\nand a Caption Generator (i.e. g\u03b82(V , S)). Any differential model can be applied to formulate the\nsentence localizer and caption generator. Here, we introduce the ones that we use. Besides, we omit\nthe RNN-based video and sentence feature extractors, leaving the details of them in the supplementary\nmaterial. In the following several paragraph, suppose we have obtained the features V = {vt \u2208\nRk}Tv\nt=0, hidden\nstates {h(c)\nt=0 for each caption sentence. Tv and Tc are the lengths of the video and caption.\nSentence Localizer. Performing localization requires to model the correspondence between the video\nand caption. We absorb the ideas from [29, 28], and propose a cross-attention multi-model feature\nfusion framework. Here, we develop a novel attention mechanism named as Crossing Attention,\nwhich contains two sub-attention computations. The \ufb01rst one computes the attention between the\n\ufb01nal hidden state of the video and the caption feature at each time step, namely,\n\nt=0 for each video, and the features C = {ct \u2208 Rk}Tc\n\nt=0, hidden states {h(v)\n\nt \u2208 Rk}Tv\n\nt \u2208 Rk}Tc\n\n(7)\nwhere ()T denotes the matrix transposition and Ac \u2208 Rk\u00d7k is the learnable matrix. The other one is\nto calculate the attention between the \ufb01nal hidden state of the caption and the video features, i.e.,\n\nf c = softmax((h(v)\nTv\n\n)TAcC)CT\n\nf v = softmax((h(c)\nTc\n\n)TAvV )V T\n\n(8)\n\nwhere Av \u2208 Rk\u00d7k is the learnable matrix.\nThen, we apply the multi-model feature fusion layer in [28] to fuse two sub-attentions as\n\nf cv = (f c + f v)(cid:107)(f c \u00b7 f v)(cid:107)FC(f s(cid:107)f v),\n\n(9)\nwhere \u00b7 is the element-wise multiplication, FC(\u00b7) is a Fully-Connected (FC) layer, and (cid:107) denotes the\ncolumn-wise concatenation.\nOne can regress the temporal segment directly by adding an FC layer on the mixed feature f cv,\nwhich however is easy to get suck in local minimums if the initial output is far away from the valid\nsegment. To allow our prediction to move between two distant locations ef\ufb01ciently, we \ufb01rst relax the\nregression problem to a classi\ufb01cation task. Particularly, we evenly divide the input video into multiple\nanchor segments under multiple scales, and train a FC layer on thef cv to predict the best anchor\nthat produces the highest Meteor score [32] of the generated caption sentence. We then conduct\nregression around the best anchor that gives the highest score. Formally, we attain\n\nS =\n\nS(a) + \u2206S,\n\n(10)\n\nwhere S(a) is the best anchor segment and \u2206S = (\u2206m, \u2206w) are the regression output by performing\na FC layer on f cv.\nCaption Generator. Given the temporal segment, we can perform captioning on the frames clipped\nfrom the video. However, such clipping operation is non-differential, making it intractable for\nend-to-end training. Here, we perform a soft clipping by de\ufb01ning a continues mask function with\nrespect to the time t. This mask is de\ufb01ned by\n\nM (t, S) = Sig(\u2212K(t \u2212 m + w/2)) \u2212 Sig(\u2212K(t \u2212 m \u2212 w/2)),\n\n(11)\nwhere S = (m, w) is the temporal segment, K is the scaling factor, and Sig(\u00b7) is the sigmoid function.\nWhen K is large enough, the mask function becomes a step function whose value is zero exact for\nthe region [m \u2212 w/2, m + w/2]. The conventional mean-pooling feature of clipped frames are then\nequal to the weighted sum of the video features by the mask after normalization, i.e.,\n\n(cid:88)Tv\n\nt=1\n\n(cid:88)Tv\n\nt=1\n\nv(cid:48) =\n\nM (t, S) \u00b7 vt/\n\nM (t, S).\n\n(12)\n\nRegarding v(cid:48) as context, and h(v)\n\nm+w/2 as initial hidden state, RNN is applied to generate the caption:\n{\u00afct}Tc\n\nt=1 = RN N (v(cid:48), h(v)\n\nm+w/2).\n\n(13)\n\n5\n\n\fLoss Function The loss function in Eq. (6) contains two terms, Lc and Ls.\nLc is used to minimize the distance between the ground-truth C = {ci}Tc\n\u00afC = {\u00afci}Tc\ni=0. We apply cross-enctropy loss as follow(or say, perplexity loss):\n\ni=0 and our prediction\n\nct \u00b7 log(\u00afct|c0 : ct\u22121).\n\n(14)\n\nLc = \u2212(cid:88)Tc\n\nt=1\n\nLs is applied to compare the difference between S = (m, w) and S(cid:48) = (m(cid:48), w(cid:48)) as illustrated in\nFig. 1, which is implemented by the (cid:96)2 norm as\n\nLs = (m \u2212 m(cid:48))2 + (w \u2212 w(cid:48))2.\n\n(15)\n\nAs metioned in Eq. (10), we further train the sentence localizer to predict the best anchor segment\nby adding a soft-max layer on the mixed feature f cv in Eq. (9). We de\ufb01ne the one-hot label as\ny = [y1,\u00b7\u00b7\u00b7 , yNa ] where yj = 1 if the j-th anchor segment is the best one, otherwise yj = 0.\nSuppose our prediction output is p = [p1,\u00b7\u00b7\u00b7 , pNa ] by the soft-max layer. The classi\ufb01cation loss is\nformulated as\n\nLa = \u2212(cid:88)Na\n\nyi log pi.\n\ni=0\n\n(16)\n\n(17)\n\nTaking all losses together, we have\n\nL = Lc + \u03bbsLs + \u03bbaLa,\n\nwhere \u03bbs and \u03bba are constant parameters.\n\n4 Experiments\n\nWe conduct experiments on the ActivityNet Captions[10] dataset that has been applied as the\nbenchmark for dense video captioning. This dataset contains 20,000 videos in total, covering a wide\nrange of complex human activities. For each video, the temporal segment and caption sentence\nof each human event is annotated. On average, there are 3.65 events annotated among each video,\nresulting in a total of 100,000 events. We follow the suggested protocol by [10, 11] to use 50% of the\nvideos for training, 25% for validation, and 25% for testing.\nThe vocabulary size for all text sentence is set to be 6000. As detailed in the supplementary material,\nboth the video and sentence encoders apply the GRU models[33] for feature extraction, where the\ndimensions of hidden and output layers are 512. The trade-off parameters in our loss, i.e., \u03bbs and\n\u03bba are both set to 0.1. We train our model by using the stochastic gradient descent with the initial\nlearning rate as 0.01 and momentum factor as 0.9. Our code is implemented by Pytorch-0.3.\nTraining. Under the weak supervision constraint, the ground truth temporal segments are unused\nfor training. The video itself is regarded as a special segment that is given by S(f ) = (0.5, 1). We\n\ufb01rst pre-train the caption generator by using the entire video as input and each event caption among\nit as output. Such a pretraining process allows us to learn a well-initialized caption generator since\nthe whole video content is related to the event caption, even the correlation is not precise. After the\npretraining, we train our model in 2 stages. In the \ufb01rst stage, we minimize the captioning loss Lc\nand reconstruction loss Ls. Then we minimize La in the second stage. Details about training are\nprovided in the Supplementary materials and our Github repository.\nTesting. For testing, only input videos are available. As already discussed in \u00a7 3.1, We starts from a\nrandom bunch of segments {S(r)\ni=0 for initial guesses(Nr = 15 in our reported result). After the\none-round \ufb01xed-point iteration, we obtain the re\ufb01ned segments as {Si}Nr\ni=0. We further \ufb01lter them\nbased on the IoU between S(r)\nand Si(More details are given in the supplemental material), and keep\nthose having high IoU as valid proposals. We then input the \ufb01ltered segments to the caption generator\nto obtain event captioning sentences. It\u2019s nothing to mention that we do not choose using pretrained\ntemporal segment proposal model(e.g. SST][22]) for the initial temporal segment generation, which,\nas a matter of fact, uses external temporal segment data, and is in contradiction with our motivation.\n\ni }Nr\n\ni\n\n6\n\n\f4.1 Evaluation of dense event captioning\n\nEvaluation metric. The performance is measured with the commonly-used evaluation metrics:\nMETEOR [32], CIDEr[23], Rouge-L[34], and Bleu@N[35]. We compute above metrics on the\nproposals if their overlapping with the ground-truth segments is larger than a given tIoU3 threshold,\nand set the score to be 0 otherwise. All scores are averaged with tIoU thresholds of 0.3, 0.5, 0.7 and\n0.9 in our experiments. We use the of\ufb01cial scripts4 for the score computation.\nBaselines. Not any previous method is proposed for dense event captioning under the weak supervi-\nsion. For Oracle comparisons, we still report the results by two fully-supervised methods [10, 11]. As\nfor our method, we implement various variants to analysis the impact of each component. The \ufb01rst\nvariant is the pretrained model where we randomly sample an event segment from each video and\nfeed it into the pretrained caption generator for captioning in the testing phase. Another variant is the\nmethod by removing the anchor classi\ufb01cation in Eq. 10, and thus regressing the temporal coordinate\nglobally as in [29]. As a compliment, we also carry out the version by preserving the classi\ufb01cation\nterm but removing the regression component \u2206S from Eq. 10.\nResults. The event captioning results\nare summarized in Table 1. In general,\nthe Meteor and Cider metrics are con-\nsidered to be more convictive than other\nscores: the Meteor score is highly cor-\nrelated with human judgment, and has\nbeen used as the \ufb01nal ranking in the\nActivityNet challenge; while Cider is a\nnewly proposed metric where the repe-\ntition of sentences is taken into account.\nOur method reaches comparable perfor-\nmance with the fully-supervised methods\nregarding the Meteor score and obtains\nthe best score in terms of the Cider met-\nric. Such results are encouraging as our\nmethod is weak supervised and not any\nground-truth segment is used. For the comparisons between different variants of our method, it is\nobserved that removing the anchor classi\ufb01cation or regression does decrease the accuracy, which\nveri\ufb01es the necessity for each component in our model.\nAs we use a bunch of randomly selected temporal segments to generate the caption results, the\nrobustness of the model towards such random strategy should also be evaluated. We use a different\nnumber of temporal segments and different random seeds to generate event caption sentences, and the\nevaluation results are summarized in Table 3. From the table, we can see that the variance is small on\ndifferent random seeds. Besides, we can see a slight increase of performance along with the increase\nof the number of temporal segments. We choose Nr = 15 as a trade-off between complexity and\nperformance in our \ufb01nal experiment.\nMoreover, we display the recalls of the detected events by various methods with respect to the testing\nsegments in Figure 2. To compute the recall, we assign the predicted segment as a positive sample\nif its overlap with the testing segment is larger than the tIOU threshold. From Fig. 2, we can \ufb01nd\nthat our model is much better than the random proposal model, which veri\ufb01ed the power of our\nweakly-supervised methods. Also, our \ufb01nal model is better than the two baselines in general.\nIllustrations. Figure. 3 illustrates event captioning results of two videos. It presents the ground-truth\ndescriptions, the captioning sentences by the pretrained model and our method. Compared with the\npretrained model which generates a single description for each video, our model is capable to generate\nmore accurate and detailed description. Compared to the ground truths, some of the descriptions are\ncomparable in consideration of the generated sentence and event temporal segment. However, two\nissues still remain. One is that our model sometimes cannot capture the beginning of an event, which,\nin our opinion, is due to the fact that we use the \ufb01nal hidden state of a temporal segment to generate\ndescription which does not rely much on the starting coordinate. Another is that our model tris to\n\nFigure 2: Evaluation of the event detection.\n\n3temporal Intersection over Union\n4https://github.com/ranjaykrishna/densevid_eval\n\n7\n\n\fTable 1: Evaluation results of captioning. The term ws denotes \"weak supervision\" for short.\n\nModel\nKrishna\u2019s[10]\nYao\u2019s[11]\nPretrained\nOurs (no classi\ufb01cation)\nOurs (no regression)\nOurs\n\nws\nFalse\nFalse\nTrue\nTrue\nTrue\nTrue\n\nM\n4.82\n7.71\n4.58\n6.08\n6.11\n6.30\n\nC\n\n17.29\n16.08\n10.45\n15.1\n17.66\n18.77\n\nR\n\u2013\n\n13.27\n9.27\n12.25\n12.40\n12.55\n\nB@1 B@2 B@3 B@4\n2.20\n17.95\n17.50\n3.38\n0.69\n8.7\n0.80\n11.85\n1.44\n11.98\n12.41\n1.27\n\n3.86\n5.54\n1.50\n1.90\n2.69\n2.62\n\n7.69\n9.62\n3.39\n4.67\n5.45\n5.50\n\nTable 2: Evaluation results of sentence localization. The term us denotes \u201cunsupervised\u201d for short.\n\nModel\nCTRL[28]\nABLR[29]\nFull-supervised\nOur Final\n\nus\nFalse\nFalse\nFalse\nTrue\n\nR@1, IoU 0.1 R@1, IoU 0.3 R@1, IoU 0.5\n\n49.09\n73.30\n70.01\n62.71\n\n28.70\n55.67\n52.89\n41.98\n\n14.00\n36.79\n37.61\n23.34\n\nmIoU\n20.54\n36.99\n40.36\n28.23\n\ngenerate 2 to 3 three descriptions most of the time, which means that it\u2019s not good at capture all the\nevent in a video, especially those ones with many weeny events.\n\n4.2 Evaluation of sentence localization\n\nUsing the learned caption localizer, our model can also be applied to the sentence localization task in\nan unsupervised way. In this section, we provide experimental results to demonstrate the effectiveness\nof our model on this task.\nEvaluation metric. Following the works of [29, 28], we compute the \"R@1, IoU=\u03c3\" and \u201cmIoU\u201d\nscores to measure the model\u2019s sentence localization ability. In details, for a given sentence and video\npair, the \"R@1, IoU=\u03c3\" score indicates the percentage of sentences who\u2019s top-1 predicted temporal\nsegment has a higher IoU with the ground truth temporal segment than the given threshold \u03c3, while\nthe \"mIoU\" is the average tIoU between all top-1 prediction and ground truth temporal segment. In\nour experiment, \u03c3 is set to 0.1, 0.3 and 0.5 following the setting in [29].\nBaselines. We compare our model\u2019s sentence localization ability with Cross-modal Temporal\nRegression Localizer (CTRL) [28] and Attention Based Location Regression (ABLR) [29]. Such two\nmodels achieve the state-of-the-art performance for now. Besides the unsupervised model, we also\nimplement a fully-supervised version by using ground-truth segments.\nResults Table. 2 shows the results of all compared methods. First, our supervised implementation\nreaches similar performance as ABLR( the state-of-the-art) compared with another fully-supervised\nbaseline, thus indicating the effectiveness of our model. As for the unsupervised scenario, we can\nsee that our unsupervised model outperforms CTRL by a considerable margin, which shows that our\nmodel can really learn to locate meaningful temporal segment from the indirect losses.\n\nTable 3: Evaluation of model rebustness towards random temporal segments during testing(see\nSec. 4). We report the captioning evaluations on varying Nr. For each value of Nr, we run the\nexperiments over 5 trials, and obtain the results in the form of mean\u00b1std.\n\nNr\nNr = 10\nNr = 15\nNr = 20\n\nM\n\n6.13\u00b10.03\n6.29\u00b10.01\n6.34\u00b10.01\n\nC\n\n17.75\u00b10.12\n18.65\u00b10.14\n19.14\u00b10.05\n\nB@1\n\n12.10\u00b10.06\n12.39\u00b10.04\n12.52\u00b10.02\n\nB@2\n\n5.33\u00b10.05\n5.49\u00b10.02\n5.57\u00b10.02\n\nB@3\n\n2.50\u00b10.03\n2.58\u00b10.03\n2.61\u00b10.02\n\nB@4\n\n1.20\u00b10.01\n1.23\u00b10.02\n1.25\u00b10.02\n\n8\n\n\f5 Conclusion and Future Work\n\nWe raise a new task termed Weakly Supervised Dense Event Caption(WS-DEC) and propose an\nef\ufb01cient method to tackle it. The weak supervision is of great importance as it eliminates the source-\nconsuming annotation of accurate temporal coordinates and encourages us to explore the huge amount\nof videos in the wild. The proposed solution not only solves the task ef\ufb01ciently but also provides\nan unsupervised method for sentence localization. Extensive experiments on both tasks verify the\neffectiveness of our model. For future research, one potential direction is to verify our model by\nperforming experiments directly on Web videos. Meanwhile, since weakly supervised learning is\nbecoming an important research vein in the domain, our proposed method by using the cycle process\nand \ufb01xed-point iteration could be applied to more other tasks, e.g., weakly-supervised detection.\n\nFigure 3: Illustration of the generated dense event captions. Left is the ground truth, middle is the\ngeneration of our pretrained caption model, and right is by our weakly-supervised training approach.\nDifferent colors indicate different temporal segments and sentence descriptions.\n\nAcknowledgements\nThis work was supported in part by National Program on Key Basic Research Project (No.\n2015CB352300), and National Natural Science Foundation of China Major Project (No. U1611461).\n\n9\n\n\fReferences\n[1] Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. Activitynet: A large-\nscale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 961\u2013970, 2015.\n\n[2] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei.\n\nLarge-scale video classi\ufb01cation with convolutional neural networks. In CVPR, 2014.\n\n[3] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan,\nFabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv\npreprint arXiv:1705.06950, 2017.\n\n[4] Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate\nSaenko. Translating videos to natural language using deep recurrent neural networks. arXiv preprint\narXiv:1412.4729, 2014.\n\n[5] Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R Hershey, Tim K Marks,\nand Kazuhiko Sumi. Attention-based multimodal fusion for video description. In Computer Vision (ICCV),\n2017 IEEE International Conference on, pages 4203\u20134212. IEEE, 2017.\n\n[6] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan,\nKate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and\ndescription. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n2625\u20132634, 2015.\n\n[7] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate\nSaenko. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on\ncomputer vision, pages 4534\u20134542, 2015.\n\n[8] Huijuan Xu, Subhashini Venugopalan, Vasili Ramanishka, Marcus Rohrbach, and Kate Saenko. A\n\nmulti-scale multiple instance video description network. arXiv preprint arXiv:1505.05914, 2015.\n\n[9] Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. Video paragraph captioning using\nhierarchical recurrent neural networks. In Proceedings of the IEEE conference on computer vision and\npattern recognition, pages 4584\u20134593, 2016.\n\n[10] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events\nin videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n706\u2013715, 2017.\n\n[11] Ting Yao, Yehao Li, Zhaofan Qiu, Fuchen Long, Yingwei Pan, Dong Li, and Tao Mei. Msr asia msm at\nactivitynet challenge 2017: Trimmed action recognition, temporal action proposals and dense-captioning\nevents in videos.\n\n[12] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in\nvideos. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances\nin Neural Information Processing Systems 27, pages 568\u2013576. Curran Associates, Inc., 2014.\n\n[13] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal\nsegment networks: Towards good practices for deep action recognition. In European Conference on\nComputer Vision, pages 20\u201336. Springer, 2016.\n\n[14] Tom\u00e1\u0161 Mikolov, Martin Kara\ufb01\u00e1t, Luk\u00e1\u0161 Burget, Jan \u02c7Cernock`y, and Sanjeev Khudanpur. Recurrent neural\nnetwork based language model. In Eleventh Annual Conference of the International Speech Communication\nAssociation, 2010.\n\n[15] Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue.\nWeakly supervised dense video captioning. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, volume 2, page 10, 2017.\n\n[16] Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. Daps: Deep action\nproposals for action understanding. In European Conference on Computer Vision, pages 768\u2013784. Springer,\n2016.\n\n[17] Ting Yao, Tao Mei, and Yong Rui. Highlight detection with pairwise deep ranking for \ufb01rst-person video\nsummarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 982\u2013990, 2016.\n\n10\n\n\f[18] Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. Boosting image captioning with attributes.\n\nOpenReview, 2(5):8, 2016.\n\n[19] Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. Jointly localizing and describing events\n\nfor dense video captioning. arXiv preprint arXiv:1804.08274, 2018.\n\n[20] Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. End-to-end dense video\n\ncaptioning with masked transformer. arXiv preprint arXiv:1804.00819, 2018.\n\n[21] Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu. Bidirectional attentive fusion with context\n\ngating for dense video captioning. arXiv preprint arXiv:1804.00100, 2018.\n\n[22] Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. Sst: Single-\nstream temporal action proposals. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE\nConference on, pages 6373\u20136382. IEEE, 2017.\n\n[23] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description\nevaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n4566\u20134575, 2015.\n\n[24] Iftekhar Naim, Young Chol Song, Qiguang Liu, Henry A Kautz, Jiebo Luo, and Daniel Gildea. Unsu-\npervised alignment of natural language instructions with video segments. In AAAI, pages 1558\u20131564,\n2014.\n\n[25] Young Chol Song, Iftekhar Naim, Abdullah Al Mamun, Kaustubh Kulkarni, Parag Singla, Jiebo Luo,\nDaniel Gildea, and Henry A Kautz. Unsupervised alignment of actions in video with text descriptions. In\nIJCAI, pages 2025\u20132031, 2016.\n\n[26] Piotr Bojanowski, R\u00e9mi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, and Cordelia\nIn Computer Vision (ICCV), 2015 IEEE\n\nSchmid. Weakly-supervised alignment of video with text.\nInternational Conference on, pages 4462\u20134470. IEEE, 2015.\n\n[27] Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia. Turn tap: Temporal unit regression\n\nnetwork for temporal action proposals. arXiv preprint arXiv:1703.06189, 2017.\n\n[28] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via\nlanguage query. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 5267\u20135275, 2017.\n\n[29] Yitian Yuan, Tao Mei, and Wenwu Zhu. To \ufb01nd where you talk: Temporal sentence localization in video\n\nwith attention based location regression. arXiv preprint arXiv:1804.07014, 2018.\n\n[30] CE Chidume. Iterative approximation of \ufb01xed points of lipschitzian strictly pseudocontractive mappings.\n\nProceedings of the American Mathematical Society, 99(2):283\u2013288, 1987.\n\n[31] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing\nrobust features with denoising autoencoders. In Proceedings of the 25th international conference on\nMachine learning, pages 1096\u20131103. ACM, 2008.\n\n[32] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved\ncorrelation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation\nmeasures for machine translation and/or summarization, pages 65\u201372, 2005.\n\n[33] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical\nmachine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[34] Chin-Yew Lin and Franz Josef Och. Automatic evaluation of machine translation quality using longest\ncommon subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting on Association\nfor Computational Linguistics, page 605. Association for Computational Linguistics, 2004.\n\n[35] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation\nof machine translation. In Proceedings of the 40th annual meeting on association for computational\nlinguistics, pages 311\u2013318. Association for Computational Linguistics, 2002.\n\n11\n\n\f", "award": [], "sourceid": 1587, "authors": [{"given_name": "Xuguang", "family_name": "Duan", "institution": "Tsinghua University"}, {"given_name": "Wenbing", "family_name": "Huang", "institution": "Tencent AI Lab"}, {"given_name": "Chuang", "family_name": "Gan", "institution": "MIT-IBM Watson AI Lab"}, {"given_name": "Jingdong", "family_name": "Wang", "institution": "Microsoft Research,"}, {"given_name": "Wenwu", "family_name": "Zhu", "institution": "Tsinghua University"}, {"given_name": "Junzhou", "family_name": "Huang", "institution": "University of Texas at Arlington / Tencent AI Lab"}]}