{"title": "Sequence-to-Segment Networks for Segment Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 3507, "page_last": 3516, "abstract": "Detecting segments of interest from an input sequence is a challenging problem which often requires not only good knowledge of individual target segments, but also contextual understanding of the entire input sequence and the relationships between the target segments.  To address this problem, we propose the Sequence-to-Segment Network (S$^2$N), a novel end-to-end sequential encoder-decoder architecture. S$^2$N first encodes the input into a sequence of hidden states that progressively capture both local and holistic information. It then employs a novel decoding architecture, called Segment Detection Unit (SDU), that integrates the decoder state and encoder hidden states to detect segments sequentially.  During training, we formulate the assignment of predicted segments to ground truth as bipartite matching and use the Earth Mover's Distance to calculate the localization errors. We experiment with S$^2$N on temporal action proposal generation and video summarization and show that S$^2$N achieves state-of-the-art performance on both tasks.", "full_text": "Sequence-to-Segments Networks\n\nfor Segment Detection\n\nZijun Wei1\n\nBoyu Wang1 Minh Hoai1\nRadom\u00edr M\u02c7ech2\n\nXiaohui Shen3\n\nJianming Zhang2\nDimitris Samaras1\n\n1Stony Brook University,\n\n2Adobe Research,\n\n3ByteDance AI Lab\n\nZhe Lin2\n\nAbstract\n\nDetecting segments of interest from an input sequence is a challenging problem\nwhich often requires not only good knowledge of individual target segments, but\nalso contextual understanding of the entire input sequence and the relationships\nbetween the target segments. To address this problem, we propose the Sequence-to-\nSegments Network (S2N), a novel end-to-end sequential encoder-decoder architec-\nture. S2N \ufb01rst encodes the input into a sequence of hidden states that progressively\ncapture both local and holistic information. It then employs a novel decoding\narchitecture, called Segment Detection Unit (SDU), that integrates the decoder\nstate and encoder hidden states to detect segments sequentially. During training,\nwe formulate the assignment of predicted segments to ground truth as the bipartite\nmatching problem and use the Earth Mover\u2019s Distance to calculate the localization\nerrors. Experiments on temporal action proposal and video summarization show\nthat S2N achieves state-of-the-art performance on both tasks.\n\n1\n\nIntroduction\n\nWe address the problem of detecting temporal segments of \u201cinterest\u201d in an input time series. Here\nwe de\ufb01ne \u201cinterest\u201d as an abstract concept that denotes the parts of the data that have the highest\n(application dependent) semantic values. We assume there are training time series with annotated\nsegments of interest (e.g., labeled by humans), and our goal is to train a neural network that can\ndetect the segments of interest in unseen time series. This general problem arises in many situations\nincluding temporal event detection [17, 18], video summarization [47, 48], sentence chunking [32],\ngene localization [24], and discriminative localization [19, 31]. For human action detection, the\nsegments of interest are the ones that correspond to the temporal extents of human actions. For video\nsummarization, the segments of interest are the video snippets that summarize the video.\nA typical approach to address this problem is to train a classi\ufb01er to separate the annotated segments of\ninterest from some negative examples. Once trained, the classi\ufb01er can be used to evaluate individual\ncandidate segments of the input time series in a sliding window approach to identify the segments of\ninterest. This approach however has two drawbacks. First, the computational complexity depends on\nthe number of candidate segments, and this scales quadratically with the length of the time series.\nSecond, the independent evaluation of each segment is suboptimal for many situations because\n\u201cinterest\u201d might be a contextual concept. To detect a set of target segments, not only do we need\nto evaluate the local content of individual segments, but also their collective relationships and their\nroles in the global context. Taking video summarization as an example, to summarize a video, it is\nimportant to know and preserve the gist of the video, and this requires a holistic analysis of the video.\nFurthermore, the set of selected video snippets should not overlap temporally or semantically, and\nthis can only be avoided by collectively evaluating the segments. The second drawback of the sliding\nwindow classi\ufb01cation approach is commonly addressed by applying a post-processing step such as\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fnon-maximum suppression, but the addition of post processing steps creates a pipeline that cannot be\noptimized end-to-end.\nIn this paper we propose the Sequence-to-Segments Network (S2N), a novel recurrent neural network\nfor analyzing a time series to detect temporal segments of interest. Our network is based on the\nsequential encoder-decoder architecture [40]. The encoder network encodes the time series and\nproduces a sequence of hidden states that progressively capture from local to holistic information\nabout the times series. The decoder network takes the \ufb01nal state of the encoder network as its starting\nstate and outputs one segment of interest at a time. The state will be updated to incorporate what has\nbeen already outputted. This alleviates the need for a post-processing step that may not have access\nto the time series information. The whole encoder-decoder pipeline can be optimized end-to-end.\nFor the decoder network, we introduce a novel architecture, named Segment Detection Unit (SDU),\nwhich outputs a segment based on the decoding state and the hidden states of the encoder. The SDU\nlocalizes the segment of interest by pointing to the boundaries of the segment, similar to the pointer\nnetwork [43]. The SDU also outputs a con\ufb01dence value for the selected segment. The computational\ncomplexity of SDU is linear with respect to the length of the input sequence, which is more ef\ufb01cient\nthan the quadratic complexity of the sliding window approach.\nTo train an S2N, we optimize a loss function that is de\ufb01ned based on the localization offsets and\nthe recall rate of the proposed segments. This loss function is computed based on the minimum\nmatching cost between the target segments of interest and the sequence of detected segments [13, 39].\nInspired by [39], we use a lexicographic comparison function for the detection-target pairs and use\nthe Hungarian algorithm to \ufb01nd the best matching. In addition, we use the Earth Mover\u2019s Distance\nloss that accounts for the localization error to train the boundary pointing modules of the SDU.\nThe major contributions of this paper are: (1) We propose S2N, a novel network architecture for\ndetecting segments of interest in video. (2) We design a matching algorithm and an Earth Mover\u2019s\nDistance based loss function for the training of S2Ns. (3) We show that S2Ns outperform the state-of-\nthe-art methods in two real-world applications: human action proposal and video summarization.\n\n2 Related Work\n\nRecurrent Neural Networks (RNNs) have been the standard method for learning functions over\nsequences from examples for a long time [34]. To further remove the constraint that the number of\noutputs is dependent on the number of inputs, Sutskever et al. [40] recently proposed the sequence-\nto-sequence paradigm that \ufb01rst uses one RNN to map an input sequence to a state and then applies\nanother RNN to output a sequence with arbitrary length based on the encoded state. Bahdanau\net al. augmented the decoder by propagating extra contextual information from the input using a\ncontent-based attentional mechanism [1, 14]. Vinyals et al. [43] modi\ufb01ed the attention model to allow\nthe model to directly point to elements in the input sequence, providing a more ef\ufb01cient and accurate\nmodel for element localization. These developments have made it possible to apply RNNs to new\ndomains such as language translation [1, 40] and parsing [44], and image and video captioning [7, 45].\nHowever, the current RNNs are designed to output each time one \u201ctoken\u201d in the input sequence, they\ncan not handle properly the segment detection task in which each time a continuous chunk of the\ninputs is selected. Perhaps the most related work to ours is [13] which attempts to train RNNs to\nlabel unsegmented sequences directly. But the goal of [13] is classi\ufb01cation where the localization\ninformation is not required in the output. The proposed S2N simultaneously detects segments and\nestimate their con\ufb01dence scores, thus can be applied to different problems such as temporal action\nproposal generation and video summarization.\n\n3 Sequence-to-Segments Network Architecture\n\nIn this section, we will describe the S2N. We \ufb01rst formally state the problem. We then describe the\noverall S2N architecture and the details of the proposed Segment Detection Unit (SDU), the core\ncomponent of S2N for localizing a temporal segment of interest.\n\n2\n\n\fFigure 1: An Encoder (green) processes the input sequence to create a set of encoding vectors\n({e1, e2, ...eM}). At each decoding step, a Segment Detection Unit (SDU) updates the decoding state\nwith a GRU, and based on the updated state, the SDU points to the beginning (b) and ending positions\n(d) with two separate pointing modules and estimates the con\ufb01dence score (c) of the segment.\n\n3.1 Problem Formulation\nLet X = (x1, x2,\u00b7\u00b7\u00b7 , xM ) be an input time series of length M, where xm \u2208 Rd is the observation\nfeature vector at time m. Our goal is to learn an RNN that can localize a set of segments of interest\nS = (S1,\u00b7\u00b7\u00b7 , SN ) from the input time series X. Here each segment Sn corresponds to a contiguous\nsubsequence of X and it is parameterized by a tuple of three elements (bn, dn, cn) indicating the\nbeginning position bn, the ending position dn, and the estimated interest score cn. There are no\nexplicit constraints on the locations and extents of the segments; the segments can overlap and their\nunion does not have to cover the entire sequence X. Intuitively, many problems that detect temporal\nsegments in a series such as action detection or video summarization can be formulated this way.\n\n3.2 Model Overview\n\nThe proposed S2N is illustrated in Fig. 1. S2N is a sequential encoder-decoder with an attentional\nmechanism [1]. S2N sequentially encodes an input sequence x1,\u00b7\u00b7\u00b7 , xM and obtains a corresponding\nsequence of encoding state vectors e1,\u00b7\u00b7\u00b7 , eM ; the encoding state vector em essentially contains\nintegrated information from x1 to xm [23, 40].\n\n3.3 Segment Detection Unit (SDU)\n\nA key component of the S2N is the Segment Detection Unit (SDU) for localizing a segment of\ninterest. As shown in Fig 1, each SDU has four components: a Gated Recurrent Unit (GRU) [5] for\nupdating and communicating states between time steps, two pointing modules [43] for pointing to\nthe beginning and ending positions of the segment, and a score estimator for evaluating the interest\nscore of the segment. Details about these components are described below.\n\nGRU for state update. During decoding, at each step given the previous hidden state hj\u22121 (h0 is\nthe concatenation of the last hidden state and memory cell of the encoder), the GRU module updates\nthe current hidden state: hj = GRU(hj\u22121, z), where z is a learned input vector to the GRU at each\nstep. We refer the reader to [5] for further details about the GRU update function.\nNote that S2N can be theoretically used with any RNN architecture, including LSTM, GRU, and\ntheir variants (e.g., [26]). We propose to use GRU [5] because it has a simpler architecture and\nfewer parameters than the others (which means higher training and testing ef\ufb01ciency). We also\nexperimented with LSTM but did not observe signi\ufb01cant difference in terms of model accuracy. This\nis consistent with prior observations [3] and empirical \ufb01ndings from prior work on deep recurrent\nmodels in other domains [5, 6, 22].\nPointing modules for boundary localization. Given the current state hj of an SDU, we predict the\ntwo boundary positions similar to the pointer networks (Ptr-Net) [43]. To localize the beginning\n\n3\n\nSegment\u00a0Detection\u00a0Unit(cid:2187)(cid:2778)(cid:2206)(cid:2778)(cid:2187)(cid:2779)(cid:2206)(cid:2779)(cid:2187)(cid:2780)(cid:2206)(cid:2780)(cid:2187)(cid:2191)(cid:2206)(cid:2191)\u2026SDUStart\u00a0PositionPointerEnd\u00a0PositionPointerScorePredictorGRUSDU\u2026(cid:2184)(cid:2186)(cid:2185)\fposition bj of a segment, we use the pointer mechanism as follows:\n\nbj = argmax\n\ni\n\ng(hj, ei), where g(hj, ei) = vTtanh(W1ei + W2hj).\n\n(1)\n\nThe beginning boundary is determined as the location that has the highest response to a pointer\nfunction g. The output of this function depends on the state hj of the SDU and the encoding vector\nei of the encoder component.\nOne alternative of predicting the locations is to use regression (similar to [27, 33]), however, this\napproach outputs a ratio in [0, 1], which does not respect the constraint that the outputs map back\nexactly to the boundaries. As demonstrated in prior works [38, 43], the predictions are blurry over\nlonger sequences.\nNote the difference compared to the original Ptr-Net [43]: the pointer function is de\ufb01ned based on the\nencoding state vector ei instead of the input vector xi. The encoding state vector ei contains richer\ninformation than the input vector xi; ei integrates the progression of the input time series up until\ntime i, and this information is crucial for determining the segment boundaries [29]. In the above, v,\nW1 and W2 are learnable parameters of the pointing module that associates the decoding state with\nthe hidden encoding states.\nSimilarly, the ending position dj is determined using another independent Ptr-Net module. Thus, we\nhave two Ptr-Net modules for determining the locations of the beginning and ending positions.\n\nScore predictor. Finally, we estimate the con\ufb01dence score of the segment using a two layer 1D\nconvolution network with a ReLu activation layer in between.\n\nNo terminal output. We do not design a terminal output for S2N as in [1] because of two reasons.\nFirst, the problem we address is to output a ranked list of temporal segments of interest, which is\ndifferent from the problem of sequence-to-sequence translation, in which there is a need for a terminal\nstate. Second, by not having a terminal state, S2N can output as many segments as needed, bringing\n\ufb02exibility to different needs in real-world problems.\n\n4 Training a Sequence-to-Segments Network\n\nK(cid:88)\n\nL(G,S, f ) = \u03b1\n\nN(cid:88)\n\nThe S2Ns can be trained end-to-end. In this section, we \ufb01rst present the loss function, and then\ndescribe how we match the sequence of predicted segments to the set of target segments.\nLet G = {G1,\u00b7\u00b7\u00b7 , GK} denote the set of ground truth segments and S = (S1,\u00b7\u00b7\u00b7 , SN ) the sequence\nof segments produced by the S2N. Given an assignment strategy for matching G to S, we will have\nan injection mapping: f : {1,\u00b7\u00b7\u00b7 , K} \u2192 {1,\u00b7\u00b7\u00b7 , N}, where f (k) indicates that the ground truth\ninstance Gk should be matched to Sf (k). Then, the loss value for the predicted sequence of segments\nand the set of ground truth instances is computed as follows:\n\nLloc(Gk, Sf (k)) +\n\nLconf (Sn, \u03b4n),\n\n(2)\n\nk=1\n\nn=1\n\nwhere \u03b4n is the desired con\ufb01dence value for Sn (depending on whether Sn is matched to a ground\ntruth instance in G). Lloc and Lconf are the loss functions for localization and con\ufb01dence score\nprediction, which will be explained below.\n\nLoss function for localization. For a given probability distribution over the location of the segment\nboundary returned by the pointing module, one way to de\ufb01ne the localization loss is to use the\ncross-entropy loss as in [43]. However, this loss function is unsuitable for boundary localization\nbecause it is insensitive to the amount of localization error; this loss function does not provide\nmeaningful gradients for the training process.\nWe propose to use a loss function that is de\ufb01ned based on the Earth Mover\u2019s Distance (EMD) between\nthe probability distribution of the predicted boundary and the distribution that represents the ground\ntruth boundary. We now explain how this loss function can be computed for the beginning position\nb (the loss for the ending position d is computed similarly). Recall from Eq. (1) that we determine\nthe beginning location of a segment as the maximum of a response function: b = argmaxi g(h, ei),\nwhere h is the state vector of the SDU. We de\ufb01ne the probability of picking i as the boundary point\n\n4\n\n\fi exp(g(h, ei)). Let p\u2217 be the binary\nindicator vector for the ground truth location of segment boundary; p\u2217(i) = 1 if i is the ground\ntruth boundary and 0 otherwise. The EMD loss can be computed based on the differences the two\ncumulative distributions:\n\nbased on the soft-max function P r(b = i) = exp(g(h, ei))/(cid:80)\nP r(b = i) \u2212 m(cid:88)\n\n.\n\n(3)\n\n(cid:32) m(cid:88)\n\nM(cid:88)\n\n(cid:33)2\n\np\u2217(i)\n\nLb\nloc =\n\nm=1\n\ni=1\n\ni=1\n\nloc + Ld\nloc.\n\nWe use the squared loss in Eq. (3) because it usually converges faster than a L1 loss and is easier\nto optimize with gradient descent [20, 28, 35]. The loss for the predicting the ending position is\nsimilarly de\ufb01ned and the total localization loss is: Lloc = Lb\nLoss function for con\ufb01dence estimation. Recall that the S2N predicts a con\ufb01dence value cn for\neach segment Sn. We can use the cross-entropy loss to measure the compatibility between cn and\n\u03b4n: Lconf (Sn, \u03b4n) = \u2212\u03b4n log(cn) \u2212 (1 \u2212 \u03b4n) log(1 \u2212 cn). For some applications, such as video\nsummarization, the desired con\ufb01dence value for each segment Sn is not necessary binary. In this\ncase, we can use the L2 loss function, i.e., Lconf (Sn, \u03b4n) = (cn \u2212 \u03b4n)2.\nAssignment Strategy. To implement the above loss functions, we need an assignment strategy to\nmatch the target segments to the predicted ones. We follow the bipartite matching strategy based\non the Hungarian loss used in [39]. Speci\ufb01cally, we de\ufb01ne the matching cost between a predicted\nsegment Sn and a ground truth Gk using a triplet cost function:\n\n(4)\nThe function \u2206 : G \u00d7 S \u2192 (cid:60)3 returns a tuple where lkn is the L1 distance between Gk and Sn. okn\nindicates whether there is signi\ufb01cant overlapping between Gk and Sn:\n\n\u2206(Gk, Sn) = (okn, n, lkn).\n\nWe can use the Hungarian algorithm to determine the best matching with lexicographic preference:\n\n\u2206(Gk, Sf (k)) =\n\nokf (k),\n\nf (k),\n\nlkf (k)\n\n.\n\n(6)\n\nk=1\n\nk=1\n\nk=1\n\nk=1\n\nIn words, the Hungarian algorithm \ufb01rst \ufb01nds the best matching based on o only. For tie-breaking, it\nwill consider n, and then l if necessary. For more details, see [39].\n\n5 Experiments\n\n5.1 Model Implementation and Hyper-parameters\n\nWe used the same architecture in all experiments even though better results can likely be achieved\nby tuning the model to \ufb01t speci\ufb01c problems. Unless speci\ufb01ed otherwise, the encoder is a 2 layer\nbi-directional GRU with 512 hidden units with dropout rate 0.5, the GRU module in SDU is one-\ndirectional with 1024 hidden units. All the models are trained with the Adam optimizer [25] for\n50 epochs with an initial learning rate of 0.0001, which was decreased by a factor of 10 when the\ntraining performance plateaued, batch size of 32 and L2 gradient clipping of 1.0. The trade-off factor\n\u03b1 in Eq. (2) is set to ensure that Lloc does not dominate in the total loss. A weight adjustment for\nthe score predictor is also used if necessary to account for the imbalance between the positive and\nnegative samples. The code is publicly available at https://www3.cs.stonybrook.edu/~cvl/\nprojects/wei2018s2n/S2N_NIPS2018s.html\n\n5.2 Temporal Action Proposal\n\nTemporal Action Proposal (TAP) generation, akin to generation of object proposals in images, is an\nimportant problem as accurate extraction of semantically important segments (e.g., human actions)\n\n5\n\n(cid:26)1\n\n0\n\nokn =\n\nK(cid:88)\n\nif IoU (Gk, Sn) \u2265 0.5\notherwise.\n\n(cid:32) K(cid:88)\n\nK(cid:88)\n\nK(cid:88)\n\n(cid:33)\n\n(5)\n\n\f(a) AR-N\n\n(b) AR-F\n\n(c) Recall@1.0-tIoU\n\nFigure 2: S2N outperforms previous temporal and temporal action proposal generation approaches\non THUMOS-14 under various performance metrics.\n\nfrom untrimmed videos is an important step for large-scale video analysis. In this section we show\nthat an S2N can be trained to generate action proposals.\n\nDataset. We evaluate S2Ns on the THUMOS14 dataset [21], a challenging benchmark for the action\nproposal task. Following the standard practice, we train an S2N on the validation set and evaluate it\non the testing set. On these two sets, 200 and 212 videos have temporal annotations in 20 classes,\nrespectively. The average video duration in THUMOS14 is 233 seconds. The average number of\nlabeled actions in each video is around 15. The average action duration is 4 seconds and more than\n99% of the actions are within 10 seconds. We train an S2N using 180 out of 200 videos from the\nvalidation set and hold out 20 videos for validation.\nImplementation For each video, we extract C3D features [41] following [3, 8]. To address the\nproblem of long videos, we split each video into overlapping chunks of 360 frames (~12s) and\nsubsample every 4 frames. We set the number of proposals generated from each chunk to be 15,\nwhich is the largest possible number of ground truth proposals contained in a chunk during training.\nWe combine the proposals from chunks, sort them by their scores, and apply a Non-Maximum\nSuppression (NMS) with a 75% temporal intersection over union (tIoU). Note this is the only\npost-processing step used to address the overlap introduced in splitting the videos.\nMetrics. We compare S2N with the baselines under the following metrics:\nAR-N [46]: AR-N measures average recall (AR) as a function of number of proposals per video. Note\nthat the numbers of retrieved proposals (N) for all the test videos are the same regardless of their\nlengths. Under this metric, we limit N to 300 considering that on average each video only contains\n15 actions.\nAR-F [9]: AR-F measures average recall (AR) as a function of proposal frequency (F ), which denotes\nthe number of retrieved proposals per second for a video. For a video of length Li seconds and\nproposal frequency of F , the retrieved proposal number of this video is Ni = F \u00d7 Li.\nRecall@F-tIoU [9]: this metric measures the recall rate at proposal frequency F with regard to\ndifferent tIoUs. In the evaluation, we set F = 1.0 following [9].\n\nBaselines. We compare S2N to the state-of-the art TAP generation methods including DAPs [8]\nthat uses an encoder LSTM and a regression branch for localization, Sparse-prop [4] that applies\ndictionary learning for class independent proposal generation over a large set of candidate proposals,\nand TURN-TAP [9] that evaluates candidate proposals in a sliding window manner over different\ntemporal scales and level of contexts (we compare with variants of TURN-TAP based on different\nfeatures and denote them as TURN-C3D and TURN-FLOW). We also compare with sliding window\nand random generators. For the DAPs, Sparse-prop, and TURN-TAPS, we plot the curves using the\ngenerated proposals provided by the authors. The sliding window proposals and random proposals\nare generated following [9].\n\nResults. The comparison to baselines under AR-N, AR-F, and Recall@F=1.0-tIoU metrics are shown\nin Fig 2. S2N outperforms the baselines by a signi\ufb01cant margin over all the metrics. Note the gap\nbetween S2N and DAPs partially implies the necessity of considering the contextual information\n\n6\n\n101102Average number of proposals0.00.20.40.6Average RecallRandomSliding WindowSparse-propDAPsTURN-C3DTURN-FLOWS2N (proposed)101100Proposal frequency0.00.20.40.6Average RecallRandomSliding WindowSparse-propDAPsTURN-C3DTURN-FLOWS2N (proposed)0.00.20.40.60.81.0tIoU0.00.20.40.60.81.0Recall@F=1.0RandomSliding WindowSparse-propDAPsTURN-C3DTURN-FLOWS2N (proposed)\fTable 1: F 1 scores (%) of various video summary methods on the SumMe dataset [15]\n\nInteresting[15] Submodularity[16] DPP-LSTM[47] GANsup[30] DR-DSNsup [48] S2N(proposed)\n\n39.4\n\n39.7\n\n38.6\n\n41.7\n\n42.1\n\n43.3\n\nand the superiority of the proposed pointing mechanism. Also note that we did not apply any\npost processing such as using the action length distributions as priors [9, 36], merging neighboring\nproposals or boundary re\ufb01nement [9, 37] other than a simple non-maximum suppression step.\n\nAblation Study. We explore the in\ufb02uence of differ-\nent label assignment strategies and loss functions on\nthe performance of S2N. Speci\ufb01cally we compare\nthe proposed S2N with the following variants:\nCLS-FIX: optimize the localization errors using\ncross-entropy classi\ufb01cation loss as suggested in [43]\nand assign labels to predictions base on a \ufb01xed order\nmatching).\nCLS-HUG: optimize the localization errors with\ncross-entropy loss and assign labels to predictions\nbase on the Hungarian matching algorithm described\nin Sec. 4.\nEMD-FIX: optimize the localization errors with the\nEMD loss as in Eq. (3) and assign labels based on\nthe \ufb01xed order matching.\nL2-FIX /HUG: optimize the localization errors with\nthe L2 loss as an alternative to EMD loss and assign\nlabels based on the \ufb01xed order matching or the Hungarian matching algorithm.\nAs shown in Fig. 3, the proposed strategy to train the S2N signi\ufb01cantly outperforms its variants. The\nvariant methods tend to generate overlapping proposals so that the post-processing NMS reduces the\neffective number of proposals signi\ufb01cantly.\n\nFigure 3: Comparing different action proposal\nmethods. Best viewed on a digital device.\n\nSpeed. S2N is ef\ufb01cient since it does not require repeated computation over multi-scale context.\nSpeci\ufb01cally, S2N processes each frame in a sequence only once in the encoding stage and outputs a\n\ufb01xed set of segments over the whole sequence in the decoding stage. It is more ef\ufb01cient than recent\nmodels ( [2, 3]) that evaluate on a dense set of highly-overlapped candidates at each temporal step\nin a sequence. Quantitatively, it takes on average 0.028s to process a 12s, 30FPS video on a GTX\nTitan X Maxwell GPU with 12GB memory. In the batch mode, it takes around 2s to generate over\n1200 proposals for an 8-minute video (14400 frames sampled every 4 frames). This is more than two\ntimes faster than the recently proposed models (1800 FPS v.s. 701 FPS [2] v.s. 308 FPS [3] v.s. 134\nFPS [8]).\n\n5.3 Video Summarization\n\nAutomatic video summarization provides a method for humans to browse and analyze video data.\nA good video summarization algorithm need to select a small set of segments that are interesting,\ndiverse, and representative of the original video. In this section we show that S2N can be trained to\nsummarize long videos by generating a set of segments.\n\nDataset. We perform experiments on SumMe [15], a standard benchmark for video summarization.\nSumMe consists of 25 user videos covering various topics such as holidays and sports. Each video\nin SumMe ranges from 1 to 6 minutes and is annotated by 15 to 18 people (thus there are multiple\nground truth summaries for each video). We treat each annotation separately and consider all of\nthem ground truth. In this way, S2N is trained to model multiple segment combinations to account\nfor different user annotations (around 450 annotated video instances). We use the canonical setting\nsuggested in [47] for evaluation: we use the standard 5-fold cross validation (5FCV), i.e., 80% of\nvideos are for training and the rest for testing.\n\n7\n\n101102Average number of proposals0.00.20.40.6Average RecallCLS-FIXL2-HUGL2-FIXCLS-HUGEMD-FIXTURN-FLOWS2N (proposed)\fFigure 4: Visualization of the summarization results. S2N localizes the interesting events in the video\npreferred by the annotators.\n\nImplementation. Similar to temporal action proposal generation, we use C3D features. Each video\nis split into overlapping chunks of 800 frames, subsampled every 8 frames as inputs. We limit\nthe maximum number of output segments to 6. To generate a summary, following the standard\npractice [47, 48], we select segments based on their scores by maximizing the total scores while\nensuring that the summary length does not exceed a limit, which is usually 15% of the video length.\nThe maximization step is essentially the 0/1 Knapsack problem. To address the problem that SumMe\nhas limited training data. We train each split for exactly 10 epochs and report the performance based\non the last epoch.\n\nEvaluation metric. We follow the commonly used protocol from [16, 47, 48]: we compute the\nF1-score to assess the similarity between the predicted segments and the ground truth summaries. To\ndeal with the existence of multiple ground truth summaries [16], we evaluate the predictions w.r.t.\nthe nearest-human summary, i.e., the one that is the most similar to the automatically created one.\n\nBaselines. We compare S2N to multiple state-of-the-art video summary algorithms including\ninterestingness-based summary [15], submodularity-based summary [16], and the recent deep\nlearning based models, including: DPP-LSTM [47] (based on LSTM and a determinantal point\nprocesses [11]), GANsup [30] ( based on GAN [12] with extra supervision), and DR-DSNsup [48]\n(based on reinforcement learning with supervision).\n\nResults. As shown in Tab 1, S2N outperforms all other methods. S2N is designed to capture\nall the information needed for generating good summaries. We also visualize an example of the\nsummarization in Fig 4.\n\n6 Conclusions and Future Work\n\nWe have proposed the Sequence-to-Segments Network (S2N), a novel architecture that uses Segment\nDetection Units (SDU) to detect segments sequentially from an input sequence. We have shown that\nS2N can be applied to real-world problems and achieve state-of-the-art performance.\nThere are a a few directions for future work. One direction is to augment the encoding stage to be\ncapable of recording longer sequences [26]. Another possible direction is to extend S2N to more\ncomplex problems such as action detection in untrimmed videos. A third direction is to introduce\nauxiliary losses to enforce explicit semantic constraints on S2N [2]. It is also possible to base S2N on\nthe fully convolutional encoder-decoder architecture [10, 42].\n\nAcknowledgements. This project was partially supported by NSF-CNS-1718014, NSF-IIS-1763981,\nNSF-IIS-1566248, the Partner University Fund, the SUNY2020 Infrastructure Transportation Security\nCenter, and a gift from Adobe.\n\n8\n\nGT SummaryPredSummary GT Pred\fReferences\n\n[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.\n\nIn Proceedings of the International Conference on Learning and Representation, 2014.\n\n[2] S. Buch, V. Escorcia, B. Ghanem, L. Fei-Fei, and J. Niebles. End-to-end, single-stream temporal action\n\ndetection in untrimmed videos. In Proceedings of the British Machine Vision Conference, 2017.\n\n[3] S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. C. Niebles. Sst: Single-stream temporal action proposals.\n\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.\n\n[4] F. Caba Heilbron, J. Carlos Niebles, and B. Ghanem. Fast temporal activity proposals for ef\ufb01cient detection\nof human actions in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, 2016.\n\n[5] K. Cho, B. Van Merri\u00ebnboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning\nphrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of\nInternational Conference on Empirical Methods in Natural Language Processing, 2014.\n\n[6] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on\n\nsequence modeling. arXiv:1412.3555, 2014.\n\n[7] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell.\nLong-term recurrent convolutional networks for visual recognition and description. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition, 2015.\n\n[8] V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem. Daps: Deep action proposals for action\n\nunderstanding. In Proceedings of the European Conference on Computer Vision, 2016.\n\n[9] J. Gao, Z. Yang, K. Chen, C. Sun, and R. Nevatia. Turn tap: Temporal unit regression network for temporal\n\naction proposals. In Proceedings of the International Conference on Computer Vision, 2017.\n\n[10] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to sequence\n\nlearning. Proceedings of the International Conference on Machine Learning, 2017.\n\n[11] B. Gong, W.-L. Chao, K. Grauman, and F. Sha. Diverse sequential subset selection for supervised video\n\nsummarization. In Advances in Neural Information Processing Systems, 2014.\n\n[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In Advances in Neural Information Processing Systems, 2014.\n\n[13] A. Graves, S. Fern\u00e1ndez, F. Gomez, and J. Schmidhuber. Connectionist temporal classi\ufb01cation: labelling\nunsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference\non Machine Learning, 2006.\n\n[14] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv:1410.5401, 2014.\n[15] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool. Creating summaries from user videos. In\n\nProceedings of the European Conference on Computer Vision, 2014.\n\n[16] M. Gygli, H. Grabner, and L. Van Gool. Video summarization by learning submodular mixtures of\nobjectives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.\n\n[17] M. Hoai and F. De la Torre. Max-margin early event detectors. International Journal of Computer Vision,\n\n107(2):191\u2013202, 2014.\n\n[18] M. Hoai, Z.-Z. Lan, and F. De la Torre. Joint segmentation and classi\ufb01cation of human actions in video. In\n\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011.\n\n[19] M. Hoai, L. Torresani, F. De la Torre, and C. Rother. Learning discriminative localization from weakly\n\nlabeled data. Pattern Recognition, 47(3):1523\u20131534, 2014.\n\n[20] L. Hou, C.-P. Yu, and D. Samaras. Squared earth mover\u2019s distance-based loss for training deep neural\n\nnetworks. arXiv:1611.05916, 2016.\n\n[21] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS\nchallenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/,\n2014.\n\n[22] R. Jozefowicz, W. Zaremba, and I. Sutskever. An empirical exploration of recurrent network architectures.\n\nIn Proceedings of the International Conference on Machine Learning, 2015.\n\n[23] A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing and understanding recurrent networks. In Proceedings\n\nof the International Conference on Learning and Representation, 2016.\n\n[24] D. R. Kelley, Y. A. Reshef, D. Belanger, C. McLean, J. Snoek, and M. Bileschi. Sequential regulatory\n\nactivity prediction across chromosomes with convolutional neural networks. Genome research, 2018.\n\n9\n\n\f[25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the International\n\nConference on Learning and Representation, 2015.\n\n[26] S. Li, W. Li, C. Cook, C. Zhu, and Y. Gao. Independently recurrent neural network (indrnn): Building a\n\nlonger and deeper rnn. arXiv:1803.04831, 2018.\n\n[27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox\n\ndetector. In Proceedings of the European Conference on Computer Vision, 2016.\n\n[28] D. G. Luenberger.\n\ncompany, 1973.\n\nIntroduction to linear and nonlinear programming. Addison-Wesley publishing\n\n[29] S. Ma, L. Sigal, and S. Sclaroff. Learning activity progression in lstms for activity detection and early\n\ndetection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n[30] B. Mahasseni, M. Lam, and S. Todorovic. Unsupervised video summarization with adversarial lstm\n\nnetworks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.\n\n[31] M. H. Nguyen, L. Torresani, F. De la Torre, and C. Rother. Weakly supervised discriminative localization\nand classi\ufb01cation: a joint learning process. In Proceedings of the International Conference on Computer\nVision, 2009.\n\n[32] N. Peng and M. Dredze. Named entity recognition for chinese social media with jointly trained embeddings.\nIn Proceedings of International Conference on Empirical Methods in Natural Language Processing, 2015.\n[33] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Uni\ufb01ed, real-time object detection.\n\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n[34] D. Rumelhart, G. Hinton, and R. Williams. Learning internal representations by error propagation. In\nParallel Distributed Processing, volume 1, chapter 8, pages 318\u2013362. MIT Press, Cambridge, MA, 1986.\n[35] S. Shalev-Shwartz and A. Tewari. Stochastic methods for l1-regularized loss minimization. Journal of\n\nMachine Learning Research, 12(Jun):1865\u20131892, 2011.\n\n[36] Z. Shou, D. Wang, and S.-F. Chang. Temporal action localization in untrimmed videos via multi-stage\n\ncnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n[37] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. Cdc: convolutional-de-convolutional\nIn Proceedings of the IEEE\n\nnetworks for precise temporal action localization in untrimmed videos.\nConference on Computer Vision and Pattern Recognition, 2017.\n\n[38] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using\n\nlstms. In Proceedings of the International Conference on Machine Learning, 2015.\n\n[39] R. Stewart, M. Andriluka, and A. Y. Ng. End-to-end people detection in crowded scenes. In Proceedings\n\nof the IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\n[40] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances\n\nin Neural Information Processing Systems, 2014.\n\n[41] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d\n\nconvolutional networks. In Proceedings of the International Conference on Computer Vision, 2015.\n\n[42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.\n\nAttention is all you need. In Advances in Neural Information Processing Systems, 2017.\n\n[43] O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In Advances in Neural Information Processing\n\nSystems, 2015.\n\n[44] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. Grammar as a foreign language. In\n\nAdvances in Neural Information Processing Systems, 2015.\n\n[45] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In\n\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.\n\n[46] G. Yu and J. Yuan. Fast action proposals for human action detection and search. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, 2015.\n\n[47] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman. Video summarization with long short-term memory. In\n\nProceedings of the European Conference on Computer Vision, 2016.\n\n[48] K. Zhou and Y. Qiao. Deep reinforcement learning for unsupervised video summarization with diversity-\n\nrepresentativeness reward. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1794, "authors": [{"given_name": "Zijun", "family_name": "Wei", "institution": "Stony Brook University"}, {"given_name": "Boyu", "family_name": "Wang", "institution": "Stony Brook University"}, {"given_name": "Minh Hoai", "family_name": "Nguyen", "institution": "Stony Brook University"}, {"given_name": "Jianming", "family_name": "Zhang", "institution": "Adobe Research"}, {"given_name": "Zhe", "family_name": "Lin", "institution": "Adobe Research"}, {"given_name": "Xiaohui", "family_name": "Shen", "institution": "ByteDance AI Lab"}, {"given_name": "Radomir", "family_name": "Mech", "institution": "Adobe Systems Incorporated"}, {"given_name": "Dimitris", "family_name": "Samaras", "institution": "Stony Brook University"}]}