NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
The authors consider the parameter recovery problem that the data are generated from a ``teacher network''. The goal is to learn the network from the generated data which are from Gaussian distribution and labeled by the ``teacher network'' with some white noises. A simple algorithm is proposed to learn the ground-truth parameters of a non-overlapping CNN, which is shown to converge efficiently with high probability with a small sample complexity bound. Due to the good properties of Gaussian inputs, the expectation of each update lies in the span of w^t and w^* (v^t and v^* as well). Also, with some properties of a specific ``good set'', as long as we haven't yet achieved optimality, the gradient in each step would be large enough and get closer to w^*. Furthermore, once the initial point lies in the ``good set'', all iterations will still be in that good set to guarantee the high convergence rate. The high-level idea and proof sketch of the paper is clearly written. The authors design a special but simple gradient descent algorithm that has several nice properties to make the steady improvement over iterations happen. However, it would be much better to write down the proof of the claim in line 209 since it is where the assumptions of ``non-overlapping filter'' and ``Gaussian data'' are required. Moreover, some conditions used in the proof should be clearly specified. First, the proof indicates that $\eta_w$ and $\eta_v$ are both less than 1 but it is not mentioned in the main text. The c_1 and c_2 in the theorem statement is not enough for showing both $\eta_w$ and $\eta_v$ are less than 1 since C and C' can still be chosen to be large enough such that the inequality does not hold. Secondly, it is not clear how to get line 518 and line 528 from assumptions although I believe that there should be some ways to make the proof work. For the originality and significance of this paper, the authors follow the line of [13] but propose a feasible algorithm that converges linearly. Also, they fix the two-stage convergence guarantee by simply reusing Theorem 4.3 at different time steps. However, this work only applies to non-overlapping and Gaussian input setting, which is far from the cases seen in practice, and it is not clear at all such strong assumptions can be avoided using the approach of this paper. Typos: (a) The P_r in line 101 should be P_k. (b) The third and the last term of step size \alpha (line 152) might have some mistakes. * After author response: I have read the response, but it does not change my opinion. I still feel that the assumptions of Gaussian inputs and non-overlapping filters are very strong, so I really have difficulty in judging the potential impact of this work. On the other hand, I would not be upset if the paper were accepted.
Reviewer 2
Originality: To the best of my knowledge the proposed algorithm as well as the analysis has novel insights/contributions. Quality: The submission is technically sound. The claims are well supported by theory and experiments. Clarity: The paper is very well-written. The introduction and the related work section provide a complete survey as well as comparisons with previous work. The proof section is very well done. Significance: This paper studies an important problem and improves over previous results in several directions. However, the Gaussianity assumption is very restrictive compared to the symmetry assumption in the Convotron paper. Therefore, the algorithm is not likely to have much impact in practice since the analysis heavily relies on the Gaussianity of the input. %%%%%%%%UPDATE%%%%%%%%%% After reading authors feedback, I am happy to increase my rating from 6 to 7. Overall, I vote for accepting this paper.
Reviewer 3
The paper is well written and complete with theoretical proof and experimental results. - The authors present theoretical analysis and experimental proof for modification of the gradient descent algorithm (for non-overlapping CNN case) called approximate gradient descent. The proof desmonstrates that, with high probability, the proposed algorithm with random initialization grants a linear convergence to the ground-truth parameters up to statistical precision. The results are applicable in general non-trivial, monotonic and Lipschitz continuous activation functions including ReLU, Leaky ReLU, Sigmod and Softplus etc. The sample complexity beats existing results in the dependency of the number of hidden nodes and filter size and matches the information-theoretic lower bound for learning one-hidden-layer CNNs with linear activation functions, suggesting that the sample complexity is tight.