Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper proposes a novel conditional computation layer, CondConv, to increase model capacity of ConvNets, while achieveing high efficiency at inference. The CondConv layer can be used as a drop-in replacement for existing convolution layers in CNN architectures. The implementation is quite simple and can be viewed as a complement for current techniques to improve model’s performance. Experiments have been conducted on multiple datasets, providing adequate evidence to demonstrate the effectiveness of proposed method. However, the proposed module is only tested on two network architectures, namely, MobileNet and ResNet. Whereas the whole paper is well-organized, there is some confusing conclusions that need to be clarified, e.g., the conclusion in section 5 “This agrees with empirical results in Table 4, which show that CondConv blocks at later layers of the network improve final accuracy more than those at earlier layers” remains questionable.
The idea of CondConvs is interesting, but there are some important questions that the authors don't address and the lack of a proper discussion is frustrating and significantly weakens the paper. The authors give no discussion on the weight matrices W_i. Is each one of these supposed to be the same size as the convolutional layer that they are replacing? Do they all have the same number of channels? It seems to me that by replacing existing convolutional layers by CondConvs would increase the number of parameters in the model by a factor of n. Given that the focus of the manuscript is to maintain efficient inference, this would imply that an application area of interest is the deployment of networks on compute-limited mobile devices, so having so many more parameters seems impractical. I find it hard to believe that the authors actually mean this, but the lack of clarity of the paper makes it hard to conclude anything else. How does having all of these W_i's affect the training? What happens during training when the routing weights have not yet been learned, do inputs get sent to all the W_i's? Looking at the code didn't help me figure this out. The paragraph starting on line 80 is unclear, especially towards the end (what is meant by "each additional parameter require only 1 additional multiply-add"?). Why is the sigmoid function used for r(x), what other nonlinear functions were tried? How did they compare? There's a brief mention of using softmax in the experiments section (why would one want to try softmax for this?!), but no details. In Figure 3, why include the FC layer? One would think that the FC isn't being replaced by a CondConv, so what is it supposed to tell the reader? In Figure 5, it's not clear what's shown on the left and what's shown on the right, the caption isn't helpful. Some general comments: -- The authors should say what MADD and mAP are abbreviations of, not all readers are familiar with the terms. -- If all CondConvs used have 8 kernels, there doesn't need to be a separate column for this in Table 2. -- The authors use both "experts" and "kernels" for n, would be clearer to stick with one. -- Still lots of cleaning up to be done: extra "We" on line 104, extra "network" on line 55, "one" should be "when" on line 109, "no" shouldn't be capitalized on line 202. In conclusion, while CondConvs appear to be original, there's a similarity in spirit with Inception Modules (did they serve as inspiration?). I don't think the quality of the paper is good and the clarity can be improved significantly. I'd require a more thoughtful and thorough discussion before I can say the work is significant. Update: After reading the author rebuttal, the other reviews, and re-reading the paper, it is clear to me that I did not fully appreciate the benefits of CondConvs at first. I apologize to the authors for being too harsh in my original review, and I am changing my recommendation to a 6. The work certainly has merit and could inspire interesting follow-up work. The authors should of course still incorporate the improvements they outlined in the rebuttal.
The authors propose a method that parameterize the convolutional kernels in a CondConv layer as a linear combination of n experts, and by increasing the number of experts, the capacity of the layer can be increased. The authors evaluate CondConv on MobileNetv1, MobileNetV2, and ResNet-50. Both experiments on ImageNet and CoCo dataset indicate the effectiveness of the proposed CondConv. Overall, this paper is an interesting complete work, which is trying to improve the capacity of the network without significantly increasing the inference cost. The writing and illustrations are clear.