__ Summary and Contributions__: (post-rebuttal)
I have read the authors' response. Please do make sure to include corrections/updates that are mentioned in the rebuttal, especially the residual connections. My belief is that this paper makes interesting and nontrivial contributions on the width-depth tradeoff of Transformers; thus, I have raised the score.
---------------------------------
This paper studies the expressive power of self-attention models, which are very popular in natural language processing. The paper focuses on a simplified model where the ReLU activations, softmax operation, and layer normalization are removed from the standard BERT architecture. Using the concept of the “separation rank” which measures how “difficult” it is to capture the dependency between two sets of input variables, the paper shows that the growth of the separation rank as a function of the embedding dimension (or “width”) dx and depth L exhibits a phase transition:
1. If L < log_3 dx, the separation rank grows doubly exponentially in depth L but polynomially in width dx.
2. If L > log_3 dx, the separation rank has an upper bound that is exponential in both depth L and width dx.
These results show that, up to a certain depth threshold logarithmic in the width, the depth is much more efficient than the width in improving the expressive power of self-attention models. However, after the threshold, the depth-efficiency becomes limited and the width and the depth contribute similarly to the expressive power. While this paper only presents theoretical results, the theory is supported by empirical observations done in a different paper.

__ Strengths__: This paper reveals an interesting phenomenon that for self-attention models, the efficiency of depth is rather limited past a certain threshold. The theory of self-attention models is lagged behind compared to its huge empirical success in NLP, so the concrete theoretical results presented in this paper are timely and much-needed. I lowered my score for now because of some clarifications I need, but I’m happy to raise my score depending on the authors’ response.

__ Weaknesses__: The weakness of this paper lies in the fact that they study a simplified model without any ReLU activation function and softmax operators. However, given that linear neural networks are also a popular subject of study in the theory literature, I don’t consider this a big weakness.

__ Correctness__: As far as I can tell from the main text, the claims look correct, except for some clarification issues that I will list below. I also briefly checked the proof of the upper bounds in the supplementary material.

__ Clarity__: The presentation of the paper is good overall, but there are certain parts that have room for improvement, in my opinion. In particular, I believe that the paper would benefit from moving some of the details deferred to the supplementary material back to the main text. For example:
- The separation rank is the main subject of study in this paper, so formal definition should better appear in the main text.
- In Theorems 1 and 2, the dependency of constants a_i, b_i, c_i on L and log (dx) should better be explicitly spelled out, because L and log (dx) are indeed the key variables of the bound.
- Equation (4) interrupted my flow of reading this paper: why specifically C = (3^L - 1)/2, what is g^L, why are there so many arguments in g^L, etc.? A more precise description of g^L rather than a “placeholder” would be helpful. In fact, Eq (4) is rarely mentioned in the main text; maybe it can be deferred to the supplementary material or to the proof sketch?

__ Relation to Prior Work__: To the best of my knowledge, this paper does a good job summarizing and citing existing results.

__ Reproducibility__: Yes

__ Additional Feedback__: I have some clarification questions:
- In Line 115, the paper claims that the feed-forward layer and the residual connections can be embedded within W^{O,l,h}. While I agree that the feed-forward layer is just a multiplication of a matrix when ReLU is taken out, I do not fully understand why this is the case for residual connections. The i-th position of output of the self-attention layer (including the skip connection) reads x^{l,i}+f^{l,i}_{SA}, and after removing the softmax, this will be of the form x^{l,i}+(a degree 3 polynomial of inputs). In other words, using the notation X \in R^{dx \times N} as in the supplementary material (assuming H=1 for simplicity), the self-attention layer and the residual connection is X + W^O W^V X X^T (W^K)^T W^Q X. Given that the second term has three X’s, I do not see how one can embed the skip connection to W^O.
- It looks like the upper bounds (Theorems 1.1 and 1.2 in the supplementary material) do not depend on whether L < log_3 dx or L > log_3 dx. Denoting the upper bounds in Theorems 1.1 and 1.2 by b1(L) and b2(L), respectively, my understanding is that the “phase transition” occurs because min(b1(L), b2(L)) = b1(L) for L < log_3 dx and min(b1(L), b2(L)) = b2(L) for L > log_3 dx. However, after examining the exact coefficients presented in the proofs for L = log_3 dx, it seems that the 2dx log_3 dx terms cancel out in b2(log_3 dx), so b1(log_3 dx) = O(dx log dx) but b2(log_3 dx) = O(dx + log dx). This would mean that the threshold is in fact smaller than L = log_3 dx. Is there any particular reason as to why the authors chose to claim L = log_3 dx as the threshold? Is it because the lower bound in Theorem 1 holds for L < log_3 dx only?
- Corollary 1 says that d_x^{shallow} \propto \exp(\exp(L^{deep})) is necessary to represent a function realized by a deep model with a shallow model satisfying L^{shallow} = \alpha L^{deep} for some small \alpha. I wonder why there is no dependence on \alpha in the requirement for d_x^{shallow} \propto \exp(\exp(L^{deep})). Is the width requirement the same for very small \alpha and relatively big \alpha?
Additional questions, disregard if you don’t have enough space for rebuttal:
- Can you show a matching lower bound in Theorem 2?
- Theorem 1 requires that H > 1, i.e., the self-attention layers have multiple heads. Does it mean that the theory does not hold for single-head attention models? How is the growth of separation rank like in case of single-head models?

__ Summary and Contributions__: UPDATE: Thanks to the author(s) for the response. I still believe that some well-designed toy experiments can be illuminating, and can improve the paper very much. Personally, I was a bit disappointed to try a wide BERT and have it not perform well at all, after reading your paper. I think it'd be wise to improve your paper so other readers won't feel the same way as well.
The paper explores the benefit of width vs depth in a transformer through the lens of expressivity. They show that the separation rank of a transformer increases fast with depth when depth is below log(width); otherwise, depth and width contribute similarly to separation rank. They also support their argument by citing a figure from Kaplan et al. 2020. The paper suggests transformers should have depth L be equal to log3(width).

__ Strengths__: As people are training larger and larger transformers, hyperparameter tuning becomes more costly, and any theoretical guidance on hyperparameter choices are desired. This paper tries to tackle the issue of width vs depth allocation in a transformer. This is thus a timely paper and potentially can have a lot of impact.
The proof insights are, to my knowledge, novel.

__ Weaknesses__: 1. My major confusion is the explanation of Kaplan et al.’s figure: Shouldn’t you want to say, for any fixed vertical slice (fixing #params), the performance saturates at a depth that’s predicted by your equation? This should be explained a lot more clearly, in the figure caption and in the text.
2. Another concern is Theorem 1 applies to “almost every” weight assignment. However, it’s highly likely that after training, the network will converge to a low dimensional submanifold of the weight space (c.f. Ji & Telgarsky). Therefore, I don’t know how much I should trust this result is applicable to trained networks.
More minor questions
3. According to the recommendation of your table 1, I reshaped BERT-large to be shallower but wider, with the same number of parameters. However, the performance is much worse than the original model. This suggests that perhaps the evidence of Fig. 1 is more circumstantial and dependent on other hyperparameters, and optimization is still a key factor to performance, which is not discussed in this paper. Can you say anything about optimization?
4. Is there a way to measure the separation rank of real networks? It would be great to see some experiments
5. What’s the role of heads? It’s always polynomial in width?
6. Section 2.1: “BERT” should be replaced with “Transformer”
7. Line 120-122: sentence long and confusing
8. Why is claim 1 not a theorem or proposition? “Claim” suggests that it’s a heuristic argument. Is that true?
9. What’s the dependence on N in Theorem 1?
10. Line 242-244: explain the rank argument more carefully.
Ji, Ziwei, and Matus Telgarsky. "Gradient descent aligns the layers of deep linear networks." arXiv preprint arXiv:1810.02032 (2018).

__ Correctness__: All of the theoretical arguments seem correct. However, because there are some claims about the implication of this in practice, I’d like to see some carefully designed experiments beyond the figure from Kaplan et al.

__ Clarity__: The overall picture is relatively clear, but as I explained above, there are still some areas to improve on clarity.

__ Relation to Prior Work__: This work cites and builds on prior work well, to my knowledge.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: --------------------------
UPDATE:
After reading the authors's response, I still believe that the paper could be improved by doing some experiments on the Transformer models or on the linearized self-attention models to verify the theoretical analysis. R2's result on BERT might also show that the theoretical analysis on the linearized self-attention model might not agree with empirical findings on the real Transformer model. But I do think it's an interesting paper and the theoretical results could shed some light on the width-depth tradeoff for Transformer models. Thus, I keep my score as 5 but I'm also feeling good if the paper is finally accepted.
--------------------------
This paper focuses on the interplay between depth and width for Transformer models. It studies a simplified model where all non-linear activations and normalizations are removed to analyze the bottleneck of stacking self-attention layers in terms of modeling input dependencies (as measured by separation rank). It theoretically establishes an interesting result that there exists a depth threshold for self-attention layers which depends logarithmically on the width and increasing depth is more efficient than width below this threshold. The theoretical result is shown to be consistent with the experimental result in Kaplan et al., 2020 and has practical implications for self-attention model design.

__ Strengths__: + The problem is very motivated since the depth efficiency for Transformer models is not clearly observed in practice, in contrast to other deep learning models.
+ This paper establishes some interesting theoretical results about the limitation of depth efficiency for self-attention models, from the perspective of the function’s separation rank bottleneck by stacking self-attention layers.
+ The theoretical result has practical implications for parameter allocation between depth and width for self-attention models.

__ Weaknesses__: - The problem setting of this paper is too simplified, where only a “linearized” self-attention layer with all non-linear activations, layer normalization and softmax operation removed. However, given that the main purpose of the paper is to analyze the functionality of self-attention in terms of integrating inputs, these relaxations are not totally unreasonable.
- The experiments are not sufficient. More empirical experiments or toy experiments (for the simplified self-attention model considered in the theoretical analysis) need to be done to show the validity of the model relaxations and the consistence of the theoretical analysis with empirical results, besides citing the result in Kaplan et al. 2020.
- Although the paper is well organized, some parts are not well explained, especially for the proof sketch for Theorem 1 and Theorem 2.

__ Correctness__: I don’t find out an obvious methodological mistake in the paper.

__ Clarity__: Overall, the paper is well organized. But some parts are not well explained, especially for the proof sketch for Theorem 1 and Theorem 2.

__ Relation to Prior Work__: This paper doesn’t discuss about the related work very comprehensively. Actually, it is not clear that the depth inefficiency of Transformer models results from the expressivity of stacking self-attention layers (as discussed in this paper) or the difficulty of training deep Transformer models (e.g., Huang et al. [1]). It would be great if the paper discusses more about these related literatures.
[1] http://www.cs.toronto.edu/~mvolkovs/ICML2020_tfixup.pdf

__ Reproducibility__: Yes

__ Additional Feedback__: - Since in Tab. 1 shows the depth threshold for different model size, it would be great to carry out more experiments for larger model size (10^9 ~ 10^11) and more fine-grained layers (e.g., 6~12 layers), and plot the theoretical depth threshold and empirical result as in Fig. 1 to see if they are consistent.
- Although the separation rank may be a good theoretical metric for measuring the ability to model input dependencies, are there any empirical evidences that it could indeed predict the model expressivity of the self-attention model?
- In Eq. 2, there should be a layer normalization after the feedforward sub-layer as well.

__ Summary and Contributions__: This paper aims at providing fundamental theory to address the question of the depth to width trade-off in self-attention networks. Some findings are interesting and maybe valuable for future research.

__ Strengths__: The motivation of the paper is clear: provides the fundamental theory to understand the trade-off between depth and width in the self-attention networks. The theory and proof look sound and reasonable.

__ Weaknesses__: The study is conducted on the self-attention networks in which all non-linear activations and normalization operations are removed. It seems not the reflection of the real self-attention models. More analysis of those removals should be done.

__ Correctness__: The claims in the paper is technically correct and in line with many empirical studies.

__ Clarity__: The paper is well written with sufficient proofs.

__ Relation to Prior Work__: It is clearly discussed how this work differs from previous contributions.

__ Reproducibility__: Yes

__ Additional Feedback__: