Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper presents a novel model design/algorithm for building compositional representations of sequences when (as in natural language or code) it is presumed that the sequences have salient latent structure that can be described as a binary tree. The method performs essentially at ceiling on two existing artificial datasets that were designed for this task, both of which have not been previously solved under comparable conditions. The method also performs reasonably well on a sentiment analysis task. Pros: The method is novel and solves a couple of prominent instances of an important open problem in deep learning for NLP and similar domains with latent structure: How to we build models that can efficiently learn and to build compositional representations using latent structure? This is interesting and likely to garner a reasonably large audience as a somewhat abstract/artificial result. It is also plausible (though definitely not certain) that this models of this kind could yield significant advances in interpretability and robustness in downstream applications. Cons: If I understand correctly, there is a modest limitation of the model that is never really mentioned in the paper: At least at training time, the model takes O(N^2) steps to do O(N) work. A binary tree of N nodes only requires N-1 merge/reduce operations, but in this model, the cell() operation must be applied sequentially N^2 times to fully build an arbitrary tree. Some of this cost could be saved at test time. This should be discussed more in the paper, but it's not fatal: It's better to have a polynomial time algorithm that works than a linear-time one that doesn't. In addition, while the paper doesn't discuss the resources used for training, I expect that this polynomial-time backprop-friendly method is more efficient than linear time models like Yogatama's that must be trained using slow and unstable RL-style mechanisms. Also, as I hinted at above, just because this model can discover and use structure reliably on clean artificial data, it is not obvious that this model will actually be able to discover structure consistently on natural noisy data. I think this is a significant and publishable advance, and I don't see any overt overclaiming, but this is a clear opportunity for improvement. The SST sentiment results are a bit disappointing, both because this is such a straightforward/easy/longstanding task, and because there is no evaluation or visualization of what structure the model uses on that data. The reproducibility checklist suggests that source code is available now, but I can't find any. Nit: Most of the arXiv papers that are cited have been published. Consider citing the published versions.
This paper proposes Ordered Memory (OM), a new memory architecture that simulates a stack based on the principles of Ordered Neurons. In practice, OM operates similar to a continuous shift-reduce parser---where given an input sequence OM decides how many reduce operations are performed before shifting the new input into the stack. OM uses a stick-breaking attention method to model the number of reduce steps and a novel Gated Recursive Cell to implement the reduce step. Experiments on logical inference and ListOps show that OM is able to model required tree structures to solve the tasks and generalize to longer sequences that are not seen in training. Experiments on sentiment analysis show that OM achieves good performance on a real-world dataset as well. While OM relies on many principles introduced in Ordered Neuron, I think that it is not straightforward to apply it to memory and therefore is still a good methodological contribution. The writing of the paper can be improved. I have published in this area in the past but it still took me some time to actually understand how the new method operates. I suggest improving the illustration in Figure 1. I also found several typos in the paper (e.g., in Eq. 5). In terms of technical contributions, my major comment about the paper is comparisons with other stack-augmented recurrent neural networks. The authors show that OM outperforms LSTM, RRNet, ON-LSTM, Transformer, and Universal Transformer on the logical inference task. I would think that other stack-based methods (e.g., Joulin and Mikolov, 2015, Yogatama et al., 2018) would be able to perform well on this task as well. Similarly for the ListOps task. What is the main benefit of OM compared to these similar methods?
UPDATE: Thank you authors for your response, and for adding the model ablations. I have changed my score to accept, however my concerns still stand regarding the performance of this model on real language tasks. I think the architecture proposed is certainly interesting. However I have some concerns with the evaluation used in this work (namely the method doesn't do well on the sole real language task attempted), and how it does not actually validate the hypothesis (e.g. that this method learns compositionality of real language and as a result outperforms other methods). Moreover, the authors identify two core contributions that are distinct from prior work (stick-breaking attention and gated recursive cell), but do not analyse the effectiveness of these two contributions through any ablation studies. Some other comments on the writing: - The introduction contains a substantial discussion of related works. Personally I find this distracting. Out of the ~45 lines used for the introduction, only ~10 lines actually describe what this work is about. - I think perhaps there is a better title than Ordered Memory, as "ordered memory" doesn't really motivate anything related to compositionality, when compositionality seems to be the main motivation of this paper. - In equation 5, the superscript i for the (1-\pi) term should be inside the bracket)