Summary and Contributions: The paper proves the following result. Given any polynomial time learning algorithm, that can learn some function class, it is possible to design some neural network architecture, such that gradient decent on this architecture will learn the aforementioned hypothesis class in polynomial time. This result show that neural network learning is universal, in the regime of efficient learning algorithms.
Strengths: This is really cool and basic observation, that is certainly worth publication
Weaknesses: The structure of the neural network is (not surprisingly) complex. Hence, it is not clear whether this result says something about neural networks that are used in practice.
Correctness: yes
Clarity: yes
Relation to Prior Work: yes
Reproducibility: Yes
Additional Feedback:
Summary and Contributions: The paper shows that deep learning with SGD is a universal learning paradigm, i.e. for every problem P that is learnable using some algorithm A, there exists a poly size network and a corresponding initialization such that SGD learns P using poly step size in polynomial number of steps. Along this line, they show that SGD learns parities on a poly size network. Further, they provide a lower bound that shows that full gradient / large batch gradient descent is not a general learning paradigm. To the best of my knowledge this is the first work that shows this difference between GD and SGD at this level of generality.
Strengths: I think the paper aims to address a very fundamental problem in deep learning, and has useful implications to understand and compare various learning algorithms and NN architecture design. Please look at other answers for more details
Weaknesses: Order of the quantifiers: The main positive result essentially observes that if an algorithm A learns a problem P, then there exists a network and a corresponding initialization on which SGD will simulate the algorithm A on P. I am confused on the order of the quantifiers here. It seems that the network architecture depends on the algorithm A. Knowledge of the algorithm A will put some structure on the underlying learning problem P. Given that, it seems that the paper misuses the term “universal learning”. Ideally, one would want the network architecture and the initialization to be independent of the algorithm A or the problem P. Am I missing something here? Negative results for GD vs Optimization perspective: I am confused by the negative result. (a) The positive result suggests that there exists a network architecture for which there exists a set of weights that have a small generalization error. Suppose one starts with this network architecture, and does GD on the test loss. The negative results suggest that GD will fail to go to the global minima or a well generalizing stationary point. Can the authors please provide some intuition for why that would be the case for parities? From an optimization perspective, it seems that GD on the test loss with the right architecture should definitely succeed (at-least for a large set of values of initialization). (b) Typically the learning rate is chosen to be Poly(1/L, 1/T) where L is the gradient lipschitz parameter of the loss function. The lipschitz parameter may be exponentially large for poly sized networks, which would motivate a learning rate to be exponentially small (instead of polynomial or constant as is done in Corollary 1). Would GD with a small learning rate succeed? SQ lower bounds vs SGD: SQ algorithms can not learn parities (BKW03). It seems that SGD is an SQ algorithm? I am confused for why the lower bound for learning parities in the SQ model does not conflict with the positive results given in the paper.
Correctness: The proofs seem correct. I skimmed through it but did not verify everything line by line.
Clarity: The paper is mostly well written. However, I would recommend the authors to be more explicit about the order of the quantifiers, etc in the theorem statement. It might also help to break down some of the larger corollaries for ease of understanding.
Relation to Prior Work: Yes
Reproducibility: Yes
Additional Feedback: I would be flexible in changing my score depending on the feedback to my questions listed in the weakness section. ===== Post author feedback and discussions ======= The authors have addressed some of my questions. I am thus, increasing my score.
Summary and Contributions: This paper show several results that deep learning by SGD or GD can learn any function class in polytime that can be learned in polytime by some algorithm. Specifically, it shows SGD with poly steps can robustly learn any function class under poly-noise on gradients. The results enhance the universality of deep NN.
Strengths: This paper study the universality of deep learning, especially the function classes defined on parities. It is a challenging problem and the results in this paper answers several meaningful questions in this field.
Weaknesses: Although the results in this paper have theoretical values, its practical values are weak. Can you do experiments to show that SGD can be more efficient than SQ on learning parities as claimed in item 3 in introduction?
Correctness: Yes
Clarity: Not very well. The theorems and proofs are hard to follow and some statements are not rigorous. For example, in the paragraph in front of theorem 2 in sec2.2, you say "inverse-poly noise" while in introduction, you say "polynomial noise can be added". Which one is accurate, inverse-poly or poly? Another example is "therefore the previous theorem is not a degeneracy due to infinite precision". What does "infinite precision" refer to? The introduction of SQ algorithm is necessary since there are many results compared with it. As there may be many existing results, it is better to provide a table to summary them and through the table, your contributions can be clearly clarified. It is better to provide a proof sketch with clear structure.
Relation to Prior Work: Yes
Reproducibility: Yes
Additional Feedback: 1. Can you explain more on the statement: "In a practical setting, there may be no reason to use SGD to a general learning algorithm"? As DL is universal, what is the drawback of SGD? 2. The results in Section 2.2 show that there exists an initialization for SGD to learn the target function. Are there any practical principle or algorithm to construct the initialization? 3. I am not familiar with stochastic query algorithm (SQ), so the claims about the relation between SQ and DL is not clear to me. It is better to introduce the setting of SQ in background. 4. Is it possible to characterize the exact order of "Poly" in the theoretical results? 5. One suggestion: cite the related theorem when you claim the contributions in introduction.