NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1336
Title:More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

Reviewer 1

+ Three challenging and large-scale video datasets are used: Something-Something, Kinetics, Moments-in-time. + Various ablations are provided. The experimental analyses lead to very interesting observations that are useful for the community. The fact that TSN [15] does not improve with more #frames, whereas bLVNet-TAM does is interesting to see (Fig 3). The performance breakdown for different components in Table 5 is nice. Overall, there are many take-home messages to learn from this paper’s experiments. - In terms of methodology, the paper combines existing blocks: Big-Little-Net [7], TSN [15], TSM [4]. Therefore, its technical contribution is limited. * Final rating: I keep my initial score, i.e. 6, after having read the rebuttal and the other reviews. I thank the authors for the clarifications provided. Despite the limited originality, I believe that the paper can be a valuable contribution to the community with its simplicity, positive results, and comprehensive experiments. I encourage the authors to incorporate the clarifications in the revised version.

Reviewer 2

Strengths: * Simplicity. Both the bLVNet and the TAM are simple models, easy to implement and probably fairly straightforward to train. This is a good property. * Paper is well-written and the technical approach is easy to comprehend. * Although the TAM is demonstrated using a frame-based 2D CNN, it is straightforward to extend to 3D CNNs, with potential further gains in accuracy. * Comprehensive evaluation on 3 large-scale video datasets shows the memory/efficiency/accuracy gains enabled by the two proposed schemes (bLVNet and TAM). Weaknesses: * Technical innovation is fairly limited. The bLVNet is a straightforward extension of bLNet (an image model) to video. The TAM involves the use of 1D temporal convolution and depthwise convolution. Both mechanisms that have been widely leveraged before. On the other hand, the paper does not make bold novelty claims and recognizes the contribution as being more empirical than technical. The TAM shares many similarities with Timeception [Hussein et al., CVPR 19], which was not yet published at the time of this submission and thus does not diminish the value of this work. Nevertheless, given the many analogies between these concurrent approaches, it'd be advisable to discuss their relations in future versions (or the camera-ready version) of the paper. * While the memory/efficiency gains are convincingly demonstrated, they are not substantial enough to be a game-changer in the practice of training video understanding models. Due to the overhead of setting up the proposed framework (even though quite simple), adoption by the community may be fairly limited. Final rating: - After having read the other reviews and the author responses, I decide to maintain my initial rating (6). The contribution of this work is mostly empirical. The stronger results compared to more complex models and the promise to release the code imply that this work deserves to be known, even if fairly incremental.

Reviewer 3

1. Originality: WEAK a) I think this is the weakest part of this paper. Almost all of the contributions of this work have been individually explored. For example: b) The idea of parallel pathways for low-res/high-res processing was explored in SlowFast, 2-stream networks etc. Granted, authors use a somewhat different design from the ICLR'19 paper, where the spatial resolution is changed across the pathways, but the core idea is fairly well explored. c) The aggregation layer (TAM) is essentially TSM [4] with learned weights, and gets a few percentage points extra in performance. d) Missing related work: Aggregating context temporally for video representation learning has been explored in many previous works, which would be good to report in the related work. I point some here. - Aggregating using VLAD/Fisher vectors etc: Action recognition with stacked fisher vectors (ECCV'14), Learnable pooling with Context Gating for video classification (CVPR'17), ActionVLAD (CVPR'17), SeqVLAD (TIP'18) etc - Aggregation using attention: Attention Clusters (CVPR'18), Video Action Transformer Network (CVPR'19), Long-Term Feature Banks for Detailed Video Understanding (CVPR'19) etc - Other temporal modeling architectures: Timeception for Complex Action Recognition (CVPR'19), Videos as space-time region graphs (ECCV'18) etc 2. Quality: GOOD I think authors do a good job of doing thorough experiments, and comparing performance of recent works along with computational/memory costs. The ablations are useful as well. 3. Clarity: WEAK Quite a few aspects of the model were not immediately clear to me. I would encourage authors to clarify in the rebuttal: a) Splitting video into odd/even frames, setting odd as big and even as little: This seems very adhoc. Why enforce this rule? Why not just use pairs of frames and use one at lower resolution and other at higher? Is there a reason odd frames in the video must be bigger? b) What is the train-time complexity of the model? Since the aggregation layer has to be trained, and needs at least "r" clips temporally-shifted clips at the same time, it would limit the training batch size (something like TSN would not have that issue). Is that a limiting factor at all? I would like to see more discussion on that aspect in the final version. c) L221: What is "single-crop single-frame" testing? I assume it is done in TSN style -- so for SS-V1 model which uses 32x2 frames, you have 32 segments at test time and use a pair of frames from each segment (odd and even). d) If my understanding in (c) is correct, then what is the "multi-crop" setup used in Kinetics? How many frames are being used in Table 3? e) I am assuming the "Frames" column in the tables reports the *TOTAL* frames used in inference, including multiple crops etc. Is that correct? 4. Significance: MODERATE While the work doesn't significantly improve on the state of the art, it does seem to propose a cheaper alternative. That can be very useful for research groups with limited resources to work on related areas, if the code is made available. However it's not clear from the paper if the code for reproducing the reported results be released? Final rating ======== I have looked through the other reviews and author feedback. I appreciate authors efforts in responding to my concerns, and clarifying parts of the paper. As all reviewers note, the technical novelty of the work is limited, though the good performance on standard benchmarks with lower computation might be valuable. Given the newer results in rebuttal and the promise to release code, I am upgrading my rating to 6. However, I still think the writing and presentation at least needs quite a bit more work to explain their approach and setup clearly.