Reviews: Novel positional encodings to enable tree-based transformers

Originality: To my knowledge, the proposed tree positional encodings are novel. They are fairly straightforward, using concatenated one-hot embeddings for each level of the tree (below the maximum depth). Quality: Some terms are inappropriate. In particular, the transformations between embeddings at different layers are not linear or affine (for which there is a unique inverse). The experiments are sufficient and demonstrate the viability of the approach. The authors obtain state-of-the-art results on some, but not all tasks considered. The authors correctly mention that the composition property breaks down past the maximum depth. Uniqueness also fails at this point, but there are no experiments evaluating whether this can be a significant issue in practice. Clarity: I had some issues understanding the paper. The term "linear" or "affine" was particularly confusing. In figure 1, it would be helpful to show the actual representations, in addition to the sequence of transformations. Datasets are often introduced with little details, referring to other papers. Significance: As sequential data isn't always the most appropriate representation, extending the transformer models to other data structures could be useful for many tasks.

ORIGINALITY : (good) This paper presents a novel positional embedding for transformer networks. The definition of this embedding, specifically designed to support tree-like datasets, is a novel contribution. Interesting perspective at the end of Section 2 with the bag-of-word and bag-of-position perspective. It is clear how this work differs from previous work. Related work is adequately cited, but some sections of the paper were missing credibility by the lack of citations, in particular, line 72-74 and 87-89. In general, any claim about key properties such as independence or other mathematical concepts should be supported by appropriate references. QUALITY : (average) Overall this submission is of good quality, but non-negligible improvements need to be considered. Claims are not all supported by theoretical analysis or experimental results. For instance, I was not convinced about the following claims: - line 72-74: "The calculations performed on any given element of a sequence are entirely independent of the order of the rest of that sequence in that layer; ...". - line 88: "any relationship between two positions can be modeled by an affine transform ...". - line 107,108: "... and k, the maximum tree depth for which our constraint is preserved." - line 117-119: It is not clear at all why this is the case All such claims need to be better supported by some evidence (either theoretical or experimental or at least by citing previous work). While introducing a novel mechanism, this submission could be more convincing with more experiments, in particular with more complex tree structures. This was properly acknowledged by the authors. Some sections of the paper need better motivation; in particular the paragraph between lines 121 and 126. the "lack of richness" is not enough motivation to add complexity to the already complex encoding. Did the parameter-free positional encoding scheme didn't work at all? In any case, experimental evidence is required to motivate this choice. Eventually, the abstract mentions the proposition of a method "enabling sequence-to-tree, tree-to-sequence and tree-to-tree mappings." however, this submission didn't cover anything on tree-to-sequence mapping. CLARITY : (low) - The submission is technically challenging to understand. Section 3 needs to be revised to make it easy to follow. For instance, between line 111 and 112, x is defined as a function of D_i's; but a few lines below, in equation (1) D_i is defined as a function of x. A figure (in addition to Figure1) or other notations could help make this section clearer. - The abstract mentions the proposition of a method "enabling sequence-to-tree, tree-to-sequence and tree-to-tree mappings." However, this submission didn't cover anything on tree-to-sequence mapping. - Figures 1, 2 and 4 are not mentioned in the text. If the authors want the readers to look at them and read through their long captions, they should be mentioned in the text. - Most (if not all) citations should be reformated. In general, all citations should be completely in parenthesis like so: "this and that is true (author, year).". Note the year is not again in ( ). Inline citations should be used when the authors of the cited paper are part of the subject of a sentence; for instance: "author (year) did this and that.". In that case, only the year is in ( ) . - some typos: "multply" (l.134) ; "evaulation" (l.158) SIGNIFICANCE (average +): The results provided in this work are positive and suggest that the proposed method is powerful. However, additional experiments and a more detailed analysis could help confirm this. Other researchers could find this work interesting as it is the first to propose a helpful inductive bias for transformers to handle tree-like datasets. The main limitation for adoption would be the clarity of the work. Figures and schema could help. The authors address a difficult task in a better way and hence improve the state of the art for two translation tasks (one being synthetic), and for one semantic parsing task.

Paper ID:	6499
Title:	Novel positional encodings to enable tree-based transformers

Reviewer 1

Reviewer 2

Reviewer 3