NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 3591 Neural Shuffle-Exchange Networks - Sequence Processing in O(n log n) Time

### Reviewer 1

This paper describes a neural network for sequence processing whose topology is inspired by sparse routing networks. The primary advantage of the proposed architecture is the ability to process length $n$ sequences with $O(n \log n)$ complexity, which is a significant performance improvement over attention based models (Transformers, NeuralGPU, etc), which typically have complexity $O(n^2)$. The authors demonstrate the effectiveness of their model using previously studied synthetic algorithmic tasks (sorting, binary multiplication, etc), and the LAMBADA QA dataset. I really enjoyed the ideas presented in this paper. The connection to message routing networks is quite interesting. To the best of my knowledge this is novel, and, similar to other "neuralifications" of CS concepts [1, 2, 3], could be a productive area of future research. The results of the ablation study are interesting, although it is unfortunate they are not more conclusive given that the most interesting properties (swap, residual, Benes) had little effect on the LAMBADA dataset. It would be very useful to know if similar results are observed on tasks where the model does learn algorithms which generalize to longer sequences (unlike multiplication), and I would appreciate including these results in the main body of the paper. My biggest concern is that the architecture is somewhat difficult to understand. I've never encountered Shuffle-Exchange Networks before, and do not imagine this is a topic that most of the NeurIPS community is familiar with. I believe it would be very beneficial if the authors included additional exposition or an Appendix that gave a more detailed description of the architecture. I highlight specific points of confusion below. I found the description of the weight sharing confusing. Based on my reading, there are essentially two sets of Switch Unit parameters. The first for layers $1, k, 2k-1, \ldots$, and the second for layers $2, \ldots, k-1, k+1, 2k-2, 2k, \ldots$ Is that correct? I find it interesting that the architecture does not require any sort of positional encoding. Perhaps this is due to the fixed nature of the shuffling stage, or perhaps due to some misunderstanding of the architecture on my part. It would be helpful if the authors could comment on this? How exactly are sequences given to the model? In all the diagrams, each switch has two inputs, so I'm assuming that given a sequence of length $2^k$, [x1, x2, x3, x4, ….] then adjacent pairs [(x1, x2), (x3, x4), …] are given to each Switch Unit (of which there are only $2^{k-1}$ with shared parameters per layer). Is this correct? Line 134: "Rest gate is" -> "The reset gate is" Lines 254-258: The wording here shows a bit of bias, and should be adjusted. In particular, the absolute performance differences between the Gated-Attention Reader and Shuffle-Exchange Network (3.28) is smaller than between the Shuffle-Exchange Network and the Universal Transformer (3.72), so it is not appropriate to use "looses only slightly" exclusively when referring to the later. #### Post Author Response I thank the authors for their helpful response, it cleared up several ambiguities for me. I find the architectural diagram included within very helpful and would strongly encourage the author's include it in subsequent versions. I would also encourage the authors to include the additional ablation studies from other algorithm tasks, as well as figures depicting generalization as a function of sequence length compared the the NeuralGPU (possibly in an appendix). #### References 1. Learning to Transduce with Unbounded Memory, http://papers.nips.cc/paper/5648-learning-to-transduce-with-unbounded-memory 2. Meta-Learning Neural Bloom Filters, http://proceedings.mlr.press/v97/rae19a/rae19a.pdf 3. Neural Random Access Machines, https://ai.google/research/pubs/pub45398