Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
UPDATE: I’m largely happy with how the authors addressed my points. I still think that the requirement for RNN to be non-expansive is quite restrictive per se, but this work may still be a good starting point for further theoretical discussion of such issues. The authors provide a straightforward proof by construction that a URNN with two times the number of hidden states as the corresponding RNN is as expressive as the RNN, i.e. can be formulated such that it produces the same outputs for the same series of inputs. While this is true for RNN with ReLU activation, the authors further prove, by linearizing around fixed points, that this is generally not true for RNN/URNN with sigmoid activation. Strengths: - Given that URNN are an important technique for modeling long-term dependencies, while avoiding some of the complexities of LSTM/GRU, rigorous theoretical results on how restrictive the unitary constraint is are timely and important. As far as I’m aware, this is the first set of such results. Weaknesses: - The proof works only under the assumption that the corresponding RNN is contractive, i.e. has no diverging directions in its eigenspace. As the authors point out (line #127), for expansive RNN there will usually be no corresponding URNN. While this is true, I think it still imposes a strong limitation a priori on the classes of problems that could be computed by an URNN. For instance chaotic attractors with at least one diverging eigendirection are ruled out to begin with. I think this needs further discussion. For instance, could URNN/ contractive RNN still *efficiently* solve some of the classical long-term RNN benchmarks, like the multiplication problem? Minor stuff: - Statement on line 134: Only true for standard sigmoid [1+exp(-x)]^-1, depends on max. slope - Theorem 4.1: Would be useful to elaborate a bit more in the main text why this holds (intuitively, since the RNN unlike the URNN will converge to the nearest FP). - line 199: The difference is not fundamental but only for the specific class of smooth (sigmoid) and non-smooth (ReLU) activation functions considered I think? Moreover: Is smoothness the crucial difference at all, or rather the fact that sigmoid is truly contractive while ReLU is just non-expansive? - line 223-245: Are URNN at all practical given the costly requirement to enforce the unitary matrix after each iteration?
Overall the paper focuses on theoretic analysis of the expressive powers of RNN's in terms of generating a desired sequence, but does not provide any implementable strategies to improve over existing algorithms in terms of avoiding vanishing or explosive gradient. Another concern is that ``generating desired output sequence'' may not be directly related to the generalization capacity of RNN in terms of predicting future signals, and so it would be more desirable if the analysis can bridge this gap.
Originality: To my knowledge the results in this work are clearly new and interesting. They build on and advance existing works. Quality: The paper appears to be a complete, and self-contained work that backs up claims with proofs or results. The paper states both what can be achieved but also what cannot. Clarity: The paper is well written and organised. It introduces the problem very well, and all relevant terms are well introduced. The supplementary material contains some of the proofs for theorems in the main paper. Significance: I believe the results of this work are important for future research of RNN and their training methods. While earlier work already looked into orthogonal networks (mostly for memory capacity, eg White, O, Lee, D, and Sompolinky, H. Short-term memory in orthogonal neural networks; also others Mikael Henaff et al Recurrent Orthogonal Networks and Long-Memory Tasks), expressiveness of the approaches has not been compared in this form, at least to my knowledge.