Paper ID: | 828 |
---|---|

Title: | Fine-grained Optimization of Deep Neural Networks |

The work is novel and significant. It applies previously established generalization bounds in the form of upper bounds on norms of DNN weights towards providing performance guarantees by instead imposing these bounds as constraints on the parameters. This is an interesting development studying the geometry of parameters W of deep networks, and is a promising direction to pursue as demonstrated by improved performance in experimental results in practice. Authors describe challenges of handling multiple constraints expressed as products of manifolds, as they can be different from the geometry imposed by individual component manifolds. Further they provide the algorithm FG-SGD as a way to modify SGD such that the POM constraints are satisfied. The modification involves projection operation onto the tangent space of the manifold. Rescaling operation so that the upper bound RHS is 1. Clarity can be improved by avoiding run-on sections and breaking up text into meaningful subsections. clearly stating the purpose of each section, the motivations, findings and conclusions. ### After rebuttal, review discussion: Suggestions to improve clarity - Reduce introduction to no more than 1 page - List contributions as done in rebuttal succinctly - Currently, there is some redundancy under - Boundary names of lists - Training DNNs and turn another bulleted list in pages 2 and 3 Please consolidates - Have a separate background/related work section to motivate work - Have a separate notation section, introducing any notation used throughout paper. It is difficult to parse when new notation is introduced right before it is used or not at all - Subsections in section 3 e.g., -- "In order to ..." should be a new section -- The 2 results should be listed as sub-headings -- How these results are incorporated into algorithm should be listed immediately after Section 4: - Compress table 1 & 2 captions - Notation to be described separately and not merged in bullets - Derivation of Lines 5,6,7 should be explained in greater detail

This is probably the densest paper that this reviewer has seen in a long time. The material has been squashed, almost certainly beyond the point of breaking, to fit within the 8-page limit. This is exemplified by the fact that the supplementary document is 19 pages in length. It would seem to make more sense to present this in a longer format - for example a journal article which would allow for a more clear explanation of the ideas. Or alternatively the paper could (potentially) be split up into separate papers. As such the clarity of the presented work is poor which makes analysis of the quality and significance difficult to assertion.

-------post-rebuttal comments The authors' response addresses all of my minor improvement suggestions. So I am still very positive about this paper. -------pre-rebuttal comments I vote for accept this paper due to: 1) the authors proposed a reasonable hypothesis that impose multiple constraints on DNN weights to bound its norm can get empirical generalization errors closer the theoretical bounds, and also improve performance. 2) they designed a novel algorithm FG-SGD to achieve the goal, while they addressed a lot of technique difficult problems. 3) They conducted solid experiments to support the superior of the proposed method. See details breakdown comments into review rubrics below. 1. Originality - 9 The proposed FG-SGD algorithm is novel. There are also a lot of non-trivial technique contribution of the work to make training with complicated constraints over POMs feasible, in which it cannot be done by simply applying existing techniques. 2. Quality - 9 Although I didn't follow all the mathematical details carefully in the paper, the overall story is sound and the mathematical analysis seems to be very solid. The author also conducted extensive experiments over widely used ResNet50/101, ResNext, MobileNet and DeepRoot architecture on multiple datasets Cifar-10, Cifar100 and Imagenet to support the claim of the paper, and the superior of the proposed method. 3. Clarity - 7 This paper is very mathematical dense and heavy so that it takes effort to follow, even though the authors spent a lot of effort to provide high level ideas in the main paper to help readers understand it. It will also be great if the author can release the code to help other people to understand and implement the algorithms, and reproduce the results. 4. Significance - 8 This works seems to be applicable to all the existing NN learning problems. The only question is whether the extra training time caused by the complexity of the proposed algorithm pay off. If the authors can integrate this algorithm into standard ML library like TensorFlow or PyTorch, I am sure it will help people to use it.