Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
It is interesting to realize scalable neural networks in an architecture by introducing shallow classifiers. Actually, the motivation is not new as some recent work also investigated a similar objective [C1, C2, C3] by the anytime property. However, there are no analyses and comparison with the related studies which should be introduced and compared. [C1] Amir R Zamir, Te-Lin Wu, Lin Sun, William B Shen, Bertram E Shi, Jitendra Malik, and Silvio Savarese. Feedback networks, CVPR, 2017 [C2] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. FractalNet: Ultra-deep neural networks without residuals, ICLR 2017. [C3] Eunwoo Kim, Chanho Ahn, and Songhwai Oh. NestedNet: Learning nested sparse structures in deep neural networks, CVPR, 2018. The proposed framework is not that significant as the additional components are just borrowed from existing well-developed studies (attention, distillation) as well as the framework requires more parameters and computations. So the methodology itself is still incremental. Even if the authors state that the approach achieves sample-specific dynamic inference, the corresponding strategy (i.e., in Section 3.3) looks not well theoretically grounded method. The paper lacks some clarity and has some incorrect things. First, the description of the proposed method in Introduction is quite weak and can’t grasp what you want to develop, which makes me not interesting. Second, In line 119, what do you mean by “signify the feature maps” for F_i and F_c? Are they learnable variables? Third, there is a missing statement on ensemble learning. In addition, explaining Algorithm 2 is missing. In line 197, there is no Table 4.1 (guess it is Table 2). Table 1 looks a bit duplicate from Figure 3 and 4. Control experiments with respect to lambda and alpha would be better to see how the approach works. The results in Table 2 are not satisfying compared to Table 1, which makes me not convincing the proposed approach. It is natural for Classifier4/4 to perform a bit better than baseline as it has additional parameters and computations. So directly comparing them seems unfair.
- This paper is new and technically sound. The joint training nature differentiate this work from . - The overall structure and title need some improvement. Before reading into Equation 1, it is difficult to understand what would authors mean by claiming line 32 - 36 in page 1 on the differences between this work and . The claimed contribution is confusing, "Compared to existing lightweight design, the proposed... is more hardware friendly..." as many lightweight network designs are hardware friendly. The joint training or knowledge distillation idea did not show up in these claims. The same issue occurred in Figure 1. By simply reading the figure 1, it is difficult to know the core idea of this work. - The result is significant as 1% on ImageNet is not trivial. This provides future studies with an empirical result on joint training of multiple classifiers.
Summary of the paper: The paper proposes an inference acceleration technique for Deep nets by having multiple shallow classifiers which feed on the intermediate outputs of a shared backbone neural network which has been trained for the specific task. For example, in Resnet50 the paper has 4 classifiers in total including the original FC layer which are placed at different depths and predict based on the representation till then. If the prediction score if more than the threshold set using a genetic algorithm, then the inference does an early stopping and gives out the prediction. However, if the shallow classifier is not completely sure, then the inference trickles down to the next cascades finally reaching the last classifier. If it reaches the last classifier the method proposed takes the sum of all the cascades into account for the prediction making it an ensemble sharing the backbone. The shallow classifiers are learnt using self-distillation from that point of the network where the shallow classifier is being attached to. The shallow classifiers also contain attention modules to enhance the performance of the shallow classifiers as well as help in understanding the attention maps getting activated. In a nutshell, this method helps in reducing the overall inference cost of an entire dataset based on what classifier each of the images trickles down to in the cascade. Originality: Each of the components of this work are present in the literature for a good amount of time and as the paper puts it to the best of my knowledge, this is the first work which combined all the moving parts mentioned above and got it to work successfully with good results. Quality: The paper is of really good quality be it in terms of writing or the experimental results and the real-world impact. I am sure this will have a positive impact on the resource-efficient ML community. Clarity and writing: The paper was well written easy to follow. However, I feel the there are some papers which have been missed in the reference. Like: 1) "Model Compression" - Bucila et al., KDD 06 and "Do Do Deep Nets Really Need to be Deep?" - Ba et al., - While talking about knowledge distillation 2) The ensemble of the intermediate shallow classifier actually exists and is present in this paper "Resource-efficient Machine Learning in 2 KB RAM for the Internet of Things" - Kumar et al., ICML 2017. I think this needs to be cited at the appropriate location of the paper. 3) I am sure there is a huge amount of literature on this, so I would ask the authors to revisit this once more before the next revision. Positives: 1) The problem setting is very important in the current day and the experimental evaluation which is extensive shows the value of this proposed method. 2) Simple idea and lucid explanation make it easy for the reader. Issues and questions: 1) In Algorithm 1, the ensemble is taken only if the prediction goes till the final classifier. What if the method does it for every level? Ie., if the datapoint goes till classifier 2 then why can't we add the scores from classifier 1 to make it more robust. And so on. I would like to see these numbers in Table 1 and Table 2 so as to show that the ensemble works in this case as well. This will make a stronger case for the method to be adapted given the much more higher accuracy at each level (this is what I expect, given my past experiences). 2) Why not find the threshold for the ensembles I have suggested above so that the threshold will be more robust ie., even if individually classifier 1 and classifier 2 are not confident and ensemble could be confident about that prediction. 3) I don't understand the (iv) and (v) of line 192-193. It would be good to explain in rebuttal if possible. 4) Figure 4 shows the parameter count for various networks with shallow classifiers. It would be good to just give a number in the text which shows the total memory and compute increment over the original model as this will have some overhead which the reader might want to know. (One can get it from Figure 1, but it would be good to make life easy). 5) It would also be good to report training time overhead after taking the pre-trained model. --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- I have gone through the rebuttal and other reviews. I am convinced with the answers in the rebuttal, however, I would agree with other reviewers regarding some experiments/justification for the design choices like self-attention, etc in the framework (simple ablation studies should help us show the obvious gains (if they are obvious)). You have shared the results on CIFAR with the intermediate ensembling, I would like to see the results for Imagenet as well in the next version as I have a feeling that this will make the 3/4th classifier cross the baseline. Please add all the things mentioned in the rebuttal to the main paper.