NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5449
Title:Can SGD Learn Recurrent Neural Networks with Provable Generalization?

Reviewer 1

The paper is interesting and notable for tackling the difficult case of using recurrent neural networks to do learning. The function class it learns seems to be identical to the function classes learned by Allen-Zhu et al. in prior works, which are known to be learnable in polynomial-time via other methods. The main technical difference from this paper and previous work (per the authors) seems to be Lemma 5.2b. What happens if the inputs are not norm 1? What sort of complexity bounds are obtained in this case? Also, how does the complexity scale with the lipschitzness/boundedness of the loss function (currently assumed to be 1)? Are there hardness results indicating that your results are best possible? For example, the 'almost polynomial' result of L^{O(log 1/\eps)}, is it possible this can be improved? Can you give some explanation as to what this function H is from Lemma 5.2? Roughly what techniques are involved? Why is "n" considered a noise parameter? I did not follow. The main concern is how much does this paper overlap with several other works analyzing SGD on overparameterized networks, most notably the work of Allen-Zhu et al. The authors indicate that the proof here is substantially different and more difficult.

Reviewer 2

This paper show that Elman RNNs optimized with vanilla SGD can learn concepts where the target output at each position of the sequence is any function of the previous L inputs that can be encoded in a two-layer smooth neural network. There are multiple assumptions and complications in showing the main result. The crux of the proof is to show that if the RNN is overparameterized enough, then if we start from a randomly initialized RNN matrix W, there exists a function which is linear in matrix W* whose value at a specific W* is a good approximation to the target in the concept class. Showing that SGD moves in a direction similar to such W* gives the desired result. Another interesting aspect of the main result is that the number of samples that SGD needs depends only logarithmically with respect to the number of RNN neurons, making it applicable to overparameterized settings. Indeed, for the result to hold, the number of RNN neurons must depend polynomially (or slightly more depending on the complexity of the concept class) on the length L of the sequences. Pros: - Novel result for RNN learnability with generalization bound polynomial in input length. - Sample complexity bound almost independent of size of the RNN - Different properties of RNNs at random initialization (not sure which of these are novel when compared to [2]) Cons: - RNN is so overparameterized that there's a quick shortcut leading close to the optimal solution - Similar in vein to [2] Other: - Can you give some concrete (possibly practical) examples of functions belonging to the concept class? For example does the concept class allow to count, or, say, determine whether an a^nb^m sequence is such that n=m? - Can you give an intuition on the function complexities C_\eps and C_s?

Reviewer 3

The paper proves that RNN trained via SGD when only the weights are trained converges to a point close to the optimum and that the number of examples required for training scales polynomially with the number of training examples. The result is very interesting and novel. Yet, I have few remarks: 1. The work "Robust Large Margin Deep Neural Networks" provides generalization error guarantees that are independent of the number of neurons, unlike what is written in the paper that there is no such generalization result for neural networks till now. 2. I would appreciate seeing a simulation that demonstrates the result proposed in the paper. I believe it is not hard to train such a network when only one matrix is optimized. Indeed, as only one parameter is optimized the performance will be poor but still, it would demonstrate the result. 3. I would appreciate a discussion as to how the work may be extended to learning of the other parameters.