Paper ID: | 2471 |
---|---|

Title: | Full-Capacity Unitary Recurrent Neural Networks |

This paper describes a method for training recurrent neural networks with unitary transition matrices that doesn't rely on restricting the set of unitary matrices that can be learned. The authors also provide some theoretical motivation for their method by proposing the use of Givens operators (similar to Givens rotations) to quantify the representational power of a certain type of unitary matrix. They use this mathematical machinery to show that the original parameterization of uRNNs is not able to represent the entire unitary group for N x N matrices for which N > 22. The authors propose to use a gradient descent optimization method that operates directly on the Stiefel manifold of unitary matrices and show on a couple of tasks that this method leads to results that are at least as good and sometimes clearly superior to results obtained with the original uRNN parameterization and LSTMs.

I think this is a strong paper in that it presents multiple theoretical and empirical contributions. The theoretical ideas and proposed optimization algorithm are in my eyes more impressive than the empirical work, which is also decent but could benefit from a more thorough analysis. In any case, I think it's a very nice continuation of the ideas presented in the original paper about uRNNs. Except for some minor issues, the paper is well written in the sense that it was easy enough for me to follow the materials about Givens operators and how they can be used to quantify the representational capacity of a unitary matrix even though this specific subject matter was rather new to me. The proofs in the supplementary material seems sound to me and are relatively simple. The experimental section is a bit less convincing since all the tasks are either synthetic, or using real data in a setup that is still artificial. I don't have too many issues with the actual results, but the paper sometimes confused me a bit about the precise experimental designs and the interpretation of the results. In line 206, the authors say that they investigated values of the matrix dimensionality N that are below, *at* and above the critical dimension of 22. The value 22 itself however, is not part of the set of dimensions investigated. I also would have expected a sudden increase in the performance gap between the restricted-capacity uRNN and the full-capacity uRNN after increasing N past the value 22. The difference in performance seems to greater for the lower dimension problems. I understand that it's hard to compare these results directly, because they're MSE scores for problems of different dimensionality, but a discussion about the lack of a sudden breakdown of the restricted model for N > 22 would in my opinion improve the paper. In lines 211-212, the authors say they report the best test results over 100 epochs for several random initializations. It's not very nice to do this early stopping on the test data directly, although I suspect the results wouldn't be very difference. Knowing how many initializations where actually used and seeing standard errors or deviations in Table 2 would make this section of the paper a lot stronger. The issue I have with the 'copy memory' task is that it's a negative results for the restricted uRNN. Looking at the original paper about uRNNs, I now know that the restricted variant can solve the task for T=500 when trained by the original authors but fails at T=1000 when trained by the authors of this paper. I'm not saying they didn't try to get the best results possible with the restricted version, but it would have been better if there was some indication of a thorough hyperparameter search for this baseline. That said, the results for the new method on this harder version of the task are still impressive and indicative of it's potential practical usefulness. The speech task still feels a bit artificial to me as next frame prediction for speech features is not a very common application except perhaps for training generative models of speech. I'm not certain that this task requires an RNN to learn very long term dependencies ans neighboring frames from spectrograms with relatively small window sizes tend to be highly correlated. I find the results on this task somewhat surprising. The restricted uRNN seems to do surprisingly well in terms of MSE for up to 70 hidden units, while the full-capacity models always perform better on the other metrics. I wouldn't claim that the full-capacity model clearly outperforms the restricted variant here. It would also have been interesting to see the performance of much larger RNNs of 500 hidden units. Again, this task still provides evidence that the optimization method works and clearly outperforms the LSTMs for models of this size. While it may seem I have a lot of issues with the quality of the individual experiments, I think that taken together they still provide a decent investigation of the methods presented. However, I also think that the paper should improve the discussion of the results and could benefit from some additional details that would make replication of the results easier.

2-Confident (read it all; understood it all reasonably well)

The authors define a notion called "Givens capacity" which measures the complexity of a unitary matrix. They bound the Givens capacity of a parametrized class of unitary matrices proposed in previous work. From here they claim that this means that the parametrization only covers a proper subset of unitary matrices. They present an alternative method for working within the full set of unitary matrices and show that on some examples it improves over the parametrized method.

The experimental portion of the paper seems more valuable than the theoretical portion, but also has some serious issues. First, the original reason the paramatrization (3) was proposed was to limit the number of parameters to trade off between representation power and computation time. The present paper proposes a method that not only uses many more parameters, but apparently requires matrix inversion to compute. The only discussion of computation time is in lines 178 - 181, in which the authors claim that the matrix inversion is faster. This would seem to point to a problem with the Theano implementation, since the matrix inversion method should take time O(N^3) and the reduced-capacity method should take time O(N log N) with small constants, meaning that the latter should be significantly faster. Second, there is no apples-to-apples comparison of the methods -- some are trained using RMSprop, some are trained using SGD, and no argument is given for why. The authors then draw conclusions without considering any of these confounding factors. Maybe the observed differences are only due to the choice of optimization method? I am not asserting that this is the case, but the authors have not ruled it out. A minor point: the authors introduce Stiefel manifolds for seemingly no reason. The only one they consider is the square matrix case which is the same as unitary matrices. It seems that they do this because they would like to apply a method from a paper which considers the more general Stiefel manifolds, but since that level of generality is not needed here, there seems to be no reason to introduce it.

2-Confident (read it all; understood it all reasonably well)

The paper proposes an RNN architecture, where the recurrent matrix is unitary (uRNN), and is optimized over the entire space of unitary matrices. The authors also prove that previously proposed unitary RNN optimization method was not "full-capacity", i.e. did not cover the entire space. The capacity analysis is done using the Givens decomposition.

Novelty: Using Unitary matrices for RNNs (uRNN) gained some attention recently. This paper addresses an interesting question: how to optimize a unitary matrix over the entire space of unitary matrices. The proposed optimization, over the Stiefel manifold, is a simple yet powerful technique that achieves this. Impact: It remains to be seen if other researchers accept Unitary matrices as a viable alternative to LSTMs, and if this technique proves more powerful than previously proposed ones, so impact is unclear. Clarity: The introduction and analysis using the Givens decomposition are clearly written, and convincing. Section (4) which briefly described the optimization method can be improved by providing adding eq (1) from [16] and briefly explaining why gradient descent over manifolds is carried in the tangent space. Technical quality: - The authors demonstrate nicely how their method outperforms LSTMs in the experimental section. - A problem I see with comparison to previous uRNN technique is this: When the authors compare partial and full capacity matrices they match the size of the matrices and not the number of trainable parameters (which in theory also corresponds to memory usage and running time). So for the results of section 5 one might wonder what would happen if these were matched. - In section 5.2 the MSE results actually support more the restricted capacity uRNN (and not the proposed model), so the authors add "voice quality" measures which shows the superior quality of their model. The problem is that this was not the the optimization objective. The authors can perhaps solve this by optimizing MSE in mel-space instead of linear frequency space, which is known correlate much better with voice quality. Minor comments: - line 80: please mention the P is chosen randomly and fixed throughout training. - line 101: extra "define". - Regarding Theorem 3.1, please mention if the bound is tight, otherwise it's not clear there exists a full capacity matrix.

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

This work investigates the representation power of the unitary matrix used in the recently proposed unitary evolution RNNs, and then shows how a complete, full capacity unitary matrix can be optimized. A very specific composition of individual unitary matrices (diagonal, Householder reflection, permutation and DFT) was originally proposed, shown in the current paper to be limiting in its ability to represent the entire unitary group once the hidden state dimension exceeds 22. Previous work in optics showed that a unitary matrix can be represented as a product of at most comb(N,2) Givens operators. The current work defines the Givens capacity of a unitary matrix as the minimum number of operators required to construct the unitary matrix, upper bounded by comb(N,2). It is shown how to optimize full capacity unitary matrices using gradient descent using a method discussed in a 2011 technical report. This full capacity optimization is applied to 3 tasks and compared to the restricted capacity unitary matrices proposed in the original work as well as the LSTM. The synthetic data system identification task compares restricted capacity unitary RNN to the full capacity uRNN to the problem of learning the dynamics of a target uRNN given only samples of input and output. MSE objective is used in training. For all hidden dimensions tested the restricted capacity achieves worse performance. The copy memory problem which requires outputting the 10 starting inputs after long delays of 1000 and 2000 steps shows superior performance of the full capacity over restricted and LSTM. Finally a prediction problem of estimating the log magnitude of the short time Fourier transform of a speech sample given all previous STFT. The TIMIT corpus was used. Objective measures of MSE, segmental SNR and perceptual metrics of STOI and PESQ. There is little difference between retricted and full capacity for MSE but both are substantially better than LSTM, for SNR, STOI and PESQ full capacity slightly outperforms the restricted, where again both are substantially better than the LSTM

The uRNN is certainly a more elegant solution to the vanishing/exploding gradients of the RNN than the LSTM. With the proposed full capacity optimization an even larger gap is opened up with respect to the LSTM vs. that originally found with the restrictive capacity uRNN. One thing that you can notice between the original and current on the copy memory problem, is that increasing the hidden state size of the LSTM to match the uRNN, the gap between LSTM and restricted seems to go away, whereas both now lag at T=1000, and previously the restricted solved rapidly and perfectly at T=500. Does further increase of the hidden dimension of the restricted and LSTM help performance of both on the increasing T? Does more effort need to be put on the optimization of the LSTM (if you used something like adam, would gradient clipping still be needed). Line 101 missing a word after here?

2-Confident (read it all; understood it all reasonably well)

A Long standing issue with recurrent neural networks is the phenomena of vanishing and exploding gradients which prevents recurrent neural networks from learning long sequences. Recently unitary hidden-to-hidden weight matrices have been proposed to alleviate this problem. However, unitary matrices as proposed from previous literature are from their construction limited in their representational power for large dimensions and the current paper proposes the “Givens capacity” used to quantify the representational power of such unitary matrices. Furthermore, the paper proposes a modified version of stochastic gradient descent, which allow for training full-capacity unitary matrixes. These full-capacity unitary matrices allow for more representational power compared to previous proposed matrices, while still preserving the desired properties of being unitary, hence minimizing the effect of vanishing and exploding gradients during training.

This paper address an important issue related to recurrent neural networks. The paper is well written, technical and theoretical sound and the authors propose theoretical justified techniques that improve over current state-of-the-art. The use of unitary matrices for recurrent neural networks are still relatively new and its practical applicability and feasibility over traditional architectures has not been rigorously investigated. However, promising theoretical and empirical results are presented in the paper, which indicate that using unitary weight matrices in recurrent neural networks might have potential to be applied in a broad range of applications.

2-Confident (read it all; understood it all reasonably well)

The authors propose a theoretical approach to measuring the capacity of a family of unitary matrices. They show that previous work using a simple family of unitary matrices to parameterize a unitary recurrent neural network has less capacity than the full unitary group. They apply an existing approach for gradient descent on the Stiefel manifold to optimize over the full unitary group. In synthetic and real data experiments, they confirm the superiority of this approach. Surprisingly, they also find that it is more efficient computationally.

All-around good paper and well-written. One of the main contributions of the paper is a novel concept for evaluating the capacity of a family of unitary matrices. They use this approach to demonstrate that the previous family of unitary matrices limited the representational capacity of the neural network. Can the authors explain why a simple counting argument is not also valid? As the authors explain, the previous family of unitary matrices was parameterized by 7n real parameters, whereas U(N) is parameterized by n^2 real parameters. This seems to suggest an even lower bound for when the capacity of the previous family of unitary matrices will not cover the entire unitary group. The authors could strengthen the paper by conducting further experiments (for example, replicating some of the experiments from the original unitary evolution recurrent neural network paper like the permuted mnist experiments).

2-Confident (read it all; understood it all reasonably well)