Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper explores video compression learned end-to-end, which is relatively unexplored in the literature, and likely to be impactful for real world applications. The proposed method is an extension of the model proposed in  to the video setting. The author propose to add a global variable that captures the information about the sequence of frames in the video. Local (frame) and global (video) variables are trained by well-known techniques in amortized variational inference by recurring to parametric function approximators. The method is well executed and experiments show that the global variable helps in achieving, in the datasets presented, lower distortion for any given compression rate. The paper is well written and clear and it is overall enjoyable read. Pros: + The paper deals with a relatively unexplored domain. + The set of baselines is rather complete and experiment show the superiority of the proposed approach. Cons: - The methodological novelty is limited as the model is a straightforward extension of . - The datasets are rather simplistic and the model is evaluated on videos long only 10 frames (which is quite short). Standard codecs seem to work better than the proposed model when applied to longer videos. In general, the experiments don't give a good idea of the applicability of the proposed model to more realistic settings (longer or higher resolution videos).
Originality: - Using deep learning methods for video compression is still underexplored and poses an interesting research direction compared to current (handcrafted) methods. - End-to-end framework: extension of a VAE with entropy coding to remove redundancy in the latent space -> combination of well-known DL model with previous work on image compression which is here applied on video data - (Uncited) recent work: 'DVC: An End-to-end Deep Video Compression Framework', Lu et al., CVPR19 Quality: - The use of a global encoding of the entire sequence might limit applicability of the approach, e.g., for encoding and transmitting live videos. Furthermore, the current approach seems to be limited to a small fixed sequence length. - The relation of Eq 3 wih Eq 4 is not obvious. Eq 3 only conditions on the z_t up to time t, while Eq 4 accesses all z_t of the sequence for the global encoding. - Evaluation demonstrates superior performance compared to traditional compression approaches on three datasets with varying degree of realism/difficulty. - Ablation study is provided which demonstrates the benefit of the model components (additional global representation for entire sequence, predictive model). Significance: - The combination of local and global feature is well motivated and the global feature is shown to have an significant impact on performance. However, the usability of the approach seems limited (small sequence length, global encoding of complete sequence). - The evaluation was performed only on short (10 frames) low resolution (64x64) videos. Superior results compared to traditional approaches were mainly achieved on special domain videos, the improvement on the diverse set Kinetics600 is relatively low and only evaluated within a small range of image quality scores. (Although the authors express their interest to examine the extension to full-resolution videos, it remains questionable whether this approach is feasible due to the high memory/GPU requirements.) Clarity: - Clear motivation for the approach (high internet traffic for videos required, usage of DL approach as promising alternative to current state-of-the-art) - l. 208, what is the time-dependent context c_t ? Minor: - If possible, figures should be shown on pages where they are mentioned. - Stated Fig. 5 in section 4.3 is missing or has been wrongly referenced. - first equation (p. 4) is not numbered - check references, e.g. - Higgins 2016, journal/ booktitel is missing - usage of abbreviations not consistent
Originality: While there does exist work in modeling video data with deep generative models, the authors are the first (to the best of my knowledge) to propose a neural, end-to-end video codec based on VAEs and entropy coding. The method offers a simple way to discretize the continuous latent space to learn a binary coding scheme of the compressed video. Although this has been explored in the context of image compression (e.g. Townsend 2019), it is important and useful. The generative model is actually quite similar in spirit to (Li & Mandt 2018), but with the added component of discretizing the latent space/entropy coding. Quality: The authors test their proposed method on 3 video datasets (Sprites, BAIR, Kinetics600), and evaluate their results using a variety of metrics (bpp, PSNR, MS-SSIM). Because there do not exist baselines to compare their method against, the authors provide a series of baselines to test the effect of each component of their model. The authors also clearly state the limitations of their method (GPU memory limitations with respect to the resolution at which they can compress videos, etc.). Although 64x64 sized videos are small, I believe this method is a great starting point for future work. Clarity: The paper was well-written, self-contained, and easy to follow, which I appreciated. The presentation of the model was clear as well. Significance: As video data comprises a significant proportion of the modern-day communications data in the Internet, the impact of this work is indeed significant. ---------------------------------------- UPDATE: Although I appreciated the authors' feedback, I wished they had addressed more of my questions in the Improvements section (e.g. regarding the strange plots with the bitrates and disentangled representations). However, as the authors noted that they will include additional experiments on longer video sequences in the final version, which is something I was particularly concerned with, I will keep my score as is.