Review for NeurIPS paper: RNNPool: Efficient Non-linear Pooling for RAM Constrained Inference

NeurIPS 2020

RNNPool: Efficient Non-linear Pooling for RAM Constrained Inference

Review 1

Summary and Contributions: The authors proposed a RNNPooling block downsampling size of activation maps over patches up to 64*64 decreasing memory and computation cost. With parameters of RNNs, it could retain information against traditional pooling operator and play a role in incremental model capacity.

Strengths: It's meticulous of the experiments of designing datasets of small 8bit monochrome images to illustrate fitting ability of RNNPooling and formal derivation of memory cost. Additionally, the authors evaluated their approach by using Visual Wake Words dataset. By inserting RNNPooling at the beginning and the end of network the compiled model achieved an accuracy score within 0.6% of the baseline but with far smaller memory cost (250->33.KB).

Weaknesses: - Absence of inference speed test. The title, Efficient Non-linear Pooling for RAM Constrained Inference, should involve not only detailed calculation of RAM cost but also speed change of inference, especially the introduction of RNN which would usually have large negative impact upon the parallelism of computation. unfortunately, the authors seem to ignore this significant aspect of efficiency. - Inappropriate comparison of experiments. The performance of a model largely depends on adaption of model capacity and task difficulty. Big model usually is not in an advantageous position against smaller one in small dataset. The experiment, densenet121 on imagenet10, seems like to illustrate efficiency with big model in small dataset. it's still in the air that whether tremendous saving of memory and computation cost without obvious accuracy decline by replacement of C1, P1, D1 and T1 in densenet121 to RNNPooling is as consequence of superiority of RNNPooling or just regular results of an easier task. it is more convincing that the authors do the experiment on ImageNet1K. ---------------------------- After rebuttal ---------------------------- the author's feedback very well addressed my questions on inference speed. For the performance comparison I still think it would be more complete to show the results of all the architectures in various settings on each dataset. Overall, this paper proposes a simple idea but might have good impact in many application domains, I thus recommend acceptance.

Correctness: The paper sounds technically correct.

Clarity: The writing is good and the structure is well organized.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: An efficient and effective RNN-based pooling operator (termed RNNPool) is proposed for rapidly downsampling activation map sizes, which is more effective than traditional average or max pooling, and also more effective than ReNet. RNNPool can be simply embedded into most existing CNN based architectures by replacing several stacks of convolutions and pooling layers or one pooling layer. Extensive experiments show the RNNPool operator can significantly decrease computational complexity and peak memory usage for inference while retaining comparable accuracy for different vision tasks.

Strengths: 1. Good presentation and well written with implementation details of the algorithm. 2. Simple yet effective idea by replacing the traditional pooling methods+convolutions with RNNPool operator. 3. The insightful and thorough experiments, e.g. comprehensive comparison with other operators, ablation study and expansive application for different tasks

Weaknesses: 1. RNNPool is very similar to ReNet, which also replaces convolution+pooling layer with 4 RNNs by sweeping horizontally and vertically in both directions. In my understanding, the differences between these two operators are the number of RNNs, the order of sweeping, and the intermediate hidden states. The authors claim ReNet cannot capture local features by flattening non-overlapping patches. I cannot agree with it. ReNet is also flexible to adjust the non-overlapping patches into overlapping patterns and extract strong local features. If not, more experiments should be added to support your claims (line 110). 2. Give more texts to describe the main idea of Figure 1 for better understanding. 3. Several typos: ``P_{xx}^{h}'' in line 13 of Algorithm 1 -> P_{xx}^{r}; ``showsn'' in line 245 ->shown 4. Need more comparison to other DNN compression methods, e.g. pruning methods [21, Ref-1-3] [Ref-1] Nisp: Pruning networks using neuron importance score propagation, in CVPR, 2018 [Ref-2] Towards Optimal Structured CNN Pruning via Generative Adversarial Learning, in CVPR, 2019 [Ref-3] Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks, in NeurIPS, 2019

Correctness: At the same input and RNNs hyperparameters, ReNet generates hidden state size of h_1 + h_2, which should be smaller than 4*h_2 with 2 shared RNNs in RNNPool. Why the FLOPs of ReNet is larger than RNNPoolLayer in Table 2? ------------------------------------------- I appreciate the authors' feedback. The authors' feedback well answer my concerns. I stand by my recommendation to accept the paper.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Refer to weaknesses for more improvements.

Review 3

Summary and Contributions: This paper proposes to reduce the memory and computation of neural networks at inference time through the use of an RNN-based pooling scheme that downsamples activations more aggressively as compared to conventional pooling mechanisms.

Strengths: - It is a relatively novel approach to pooling and the consequent reduction in memory and computational requirements. - Sufficient empirical proof has been provided to show that the method is effective in practice without compromising on the accuracy of the networks.

Weaknesses: - The networks need to be retrained after changing the pooling units to RNNpool - It is not explained if it is more difficult to train the networks when using such pooling, in the light of the fact that RNN's are harder to train than feed-forward convolutional networks. - Comparisons have not been made to methods of quantization, which also reduce memory and computational requirements without compromising on accuracy. In this regard there is only one paper [37] cited, which is just related to mixed precision networks. There are several others that deal with integers or just binary network, which could be cited: [Ref-1] "XNOR-net: Imagenet classification using binary convolutional neural networks." ECCV 2016 [Ref-2] "Data-Free Quantization through Weight Equalization and Bias Correction" ICCV 2019

Correctness: Yes, the claims and method seem correct.

Clarity: The paper is mostly well-written but clarity can still be improved. In particular, Section 3.1 can be better written to explain how training of the networks is done with RNNpool units. It is not obvious what the RNNpool layers train on. One has to assume that the input to the RNNpool layers is the rows and columns of the activation matrices on which the pooling is applied for each example.

Relation to Prior Work: Yes, the distinction has been made clear.

Reproducibility: Yes

Additional Feedback: - The main shortcoming of the approach is that the network has to be retrained from scratch when using the RNNpooling units. In some sense the authors are suggesting a new architecture to replace a bulkier one. Given this, the benefits of lower computational and memory requirements quickly pale in comparison to methods that can either quantize networks to binary values [Ref-1 and the likes] or that can quantize the networks without requiring retraining [Ref-2] - Otherwise, the approach seems novel and effective in reducing the memory and computational requirements at inference time. - An aspect the authors do not touch upon is the difficulty of training the network after plugging in the RNNpool units. This aspect can not be ignored since RNN's are known to be harder to train as compared to feed-forward networks. -- post-rebuttal -- According to the authors, in effect, having replaced the pooling modules, the model is a new one and needs to be retrained from scratch. This means that most of the benefits of the fully trained models are at risk since we may not obtain the same performance again. It is likely training becomes more difficult or takes longer after replacing the modules. This needs to be stated and demonstrated if possible. Given the response of the authors, I will stick to my previous rating of the paper

Review 4

Summary and Contributions: In neural networks for image classification, it is typical to gradually downsample an image from full resolution to an output vector of size 1x1xN, where N is the number of categories to classify. Downsampling also occurs in neural networks for other computer vision tasks such as semantic segmentation and object detection. When downsampling, it is typical to first increase the number of channels, and then downsample. This requires more working memory than is required for other layers in the network. To address this, instead of doing the "increase channels, then downsample" approach, RNNPool instead uses an RNN to downsample.

Strengths: In my opinion, the authors are correct that the details of how you downsample inside your neural net are a common reason for running out of memory when running computer vision inference on a microcontroller. This is practical problem that I have personally dealt with in my research, and the authors propose a compelling solution. Note that running out of memory on a microcontroller is much more painful than on a server, because there is often no ""swap"" memory, so the device freezes up. The improvements enabled by RNNPool are clearly conveyed in experiments on the ImageNet-1k and Wider Face datasets. Specifically, introducing the RNNPool operation into the MobileNet v2 network significantly improves peak memory usage (savings of up to 10x) and actually improves accuracy too. --- Update after reading the rebuttal --- I continue to recommend this paper for acceptance.

Weaknesses: It would be interesting to see the network run on a microcontroller. But, I know this would require a lot more engineering. As it is, I am already convinced that the work presented in this paper does make a real advance for TinyML on a microcontroller.

Correctness: Looks good.

Clarity: Yes.

Relation to Prior Work: One thing a reader might ask is "couldn't you simply use vanilla max-pooling to downsample, but just wait until after downsampling to increase the number of channels?" One way to answer this question is to point to SqueezeNAS [1], which searches for efficient neural architectures for semantic segmentation. Interestingly, even on a constrained computational budget, the architecture-search "prefers" to put a lot of computation just before downsampling (see Figures 8 and 9 of the SqueezeNAS paper). So, RNNPool may be one of the only ways to preserve accuracy without inflating the number of channels (and the amount of memory) just before downsampling. [1] @inproceedings{2019_SqueezeNAS, author = {Albert Shaw and Daniel Hunter and Forrest Iandola and Sammy Sidhu}, title = {{SqueezeNAS}: Fast neural architecture search for faster semantic segmentation}, booktitle = {ICCV Neural Architects Workshop}, year = {2019} }

Reproducibility: Yes

Additional Feedback: In the broader impact section, there is a hyperlink to "Seeing AI." I think it would be better to cite this instead of using a hyperlink.