NeurIPS 2020

What if Neural Networks had SVDs?

Review 1

Summary and Contributions: The paper proposes a novel way to accelerate the multiplication of Householder matrices, and demonstrates substantial speedup over prior work [18] and the standard Pytorch implementation. The speedup could benefit researchers who use SVD in neural networks for matrix inversion, exponential, spectral normalization, etc.

Strengths: The problem of speeding up SVD in neural networks is quite novel. I admit that I do not know much about this community to evaluate its size, but the results in this paper show substantial impact to this community. The empirical evaluation is sound. I did not have time to go through the theoretical grounding.

Weaknesses: Although the maximum speedups over [18] in matrix inversion is quite sizable (27x in Fig 1), the gain over the much simpler, standard baseline seems moderate (2.7x to 4.3x). This potentially weakens the significance of the contributions, unless the proposed approach has advantages in other dimensions compared to the baseline.

Correctness: The claims, method and empirical methodology is correct.

Clarity: Yes. The paper clearly explains the motivation, problem, and solution of the work.

Relation to Prior Work: Yes. The authors made is clear that previous work on SVD reparameterizations are O(d^3) while they focus on getting a O(d^2m) algorithm.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper considers how to efficiently incorporate SVD-decomposed dense layer in a DNN. The core idea of representing the U and V matrix as a product of rank-1 Householder reflections was already present in [18]. This allows efficient forward and backward propagation, i.e. no computational overhead. In return, many operations now become free. However, the benefit is "on paper", since the method presented in [18] is not amenable to parallelization. The present paper solves this issue by breaking the product of Householder reflection into blocks, and converting each block to its WY representation. The WY representation is a way to represent a product of Householder reflections that was invented to solve similar issues in parallel multifrontal QR decomposition algorithms.

Strengths: The paper a practical problem in a previous theoretical advancement. It was already observed that SVD layers can be useful in deep learning, and it was shown that this can be done efficiently, however in practice it didn't work well due to lack of parallelism. This paper solves this. The techniques in the paper can have applicability beyond deep learning. The authors release code (w/ CUDA).

Weaknesses: 1) Novelty: At the end, there is very little in the way of a fundamentally new idea. The use of SVD in deep learning, and the observation that it can be represented efficiently via Householder reflections, was already developed in [18]. WY factorization was already developed in 1987 also, I think in order to solve similar problems. The authors simply adjust the tool to the job. 2) Limited applicability: seems that the technique applies only to layers whose #input neurons is equal to #output neurons. 3) Experiments: very weak. They do not explore deep learning at all, but rather focus on the time to do a single step and the cost of various matrix operations. I am curious to see how end-to-end learning pans out. 4) Discussion of memory complexity is a bit neglected. There are some discussions throughout, but I do not feel they are complete. Is there a memory-parallelization tradeoff? The start of section 3.3 discusses only time complexity and parallelism tradeoff.

Correctness: Claims seem correct. The empirical methodology is a bit problematic since it does not explore deep learning at all, but rather on the time to do a single step and the cost of various matrix operations. I am curious to see how end-to-end learning pans out.

Clarity: Mostly. Section 3.3 is very unclear, especially how the SVD layers can be used in convolutional layers and recurrent layers. Also: Lines 158-160 are unclear. What is g? Sum of what?

Relation to Prior Work: Mostly. However, the use of WY decomposition in multifrontal QR factorization is neglected. Since WY was invented to solve similar issues in such algorithms, it should be discussed.

Reproducibility: Yes

Additional Feedback: - The mix between using v to represent the HH vectors and using U and V for the SVD components can create a challenge in understanding the paper. - Line 82 "Mathematical Setting": I think "Computational Model" is more appropriate. - Line 158-160: unclear. ,What is g? Sum of what? - Page 5, pseudocodes: Should say that H_j is given using its HH vector. - Why is the inverse needed in deep learning. - Line 275 "These approaches are thus undesirable for SVD since..." - What do you mean here? - End of line 175: What took less than 1s?

Review 3

Summary and Contributions: The paper propose a method to incorporate SVD of Deep Neural Network layers in training in an efficient way, which can have applications in Recurrent Neural Networks. The authors identify that previous attempts incurred many serialized inner product operations that resulted in higher time complexity and reduce resource utilization in massively parallel GPUs. They propose a method named Fast Householder Multiplication which uses an algorithm to compute Householder matrices with low time complexity. They analytically show that FastH can reduce the number of serialized computations and reduced the time complexity. They further experimentally compare their method with previous algorithms based on serialized inner products and show that they can outperform these methods.

Strengths: Due to the rapid development of Machine Learning hardware during the past few years, each of which employ various techniques for purposes of acceleration, theoretical guarantees of performance, such as the ones derived in the paper, have become valuable. The authors have further included their implementation that can help reproduce their results.

Weaknesses: However, I would like to ask the authors to provide more details regarding the experimental setup to help reproducibility.

Correctness: The analytical derivations seem correct. Further, the experimental results provided seem to follow the adhere the theoretical bounds derived in the paper.

Clarity: The paper is well written and clear.

Relation to Prior Work: Yes, the authors have clearly discussed the previous approaches for incorporating SVD into neural networks and provided the necessary details for algebraic results used upon which they have based their proposed methodology.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper proposes to accelerate the matrix operations in practice. The new algorithm convert the vector-vector operations to matrix-matrix operations to improve the degree of parallelism. A time complexity analysis is provided on matrix multiplications and gradient computations. Besides, this paper provides two algorithms for Forward and Backward.

Strengths: The strengths of this work include proposing a new method for handle matrix operations with high speed, a theoretical analysis on the complexity of Forward Pass and Backwards Propagation.

Weaknesses: First, something is confusing about why should we take into account the computation of sequential inner products. I'm not sure about it without any theoretical or empirical analysis. Second, the proposed method is rather incremental upon previous works. The new algorithm is based on the grouped matrix operations. Third, there is a gap between theoretical time-complexity and empirical time-complexity which makes the analysis of time-complexity in section 3 can't support the effectiveness of the designed algorithm. Based on the result in section 4.2, the sequential method spend too much time on inner products isn't mentioned in section 3. Besides, in section 3.3, the cost of searching proper k is O(d^3) which is 'negligible' which makes me confuse about which is not negligible in designed algorithm.

Correctness: The adopted methodology seems to be correct. The empirical results seems to be convinced.

Clarity: Satisfactory.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: