NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4632
Title:Visualizing and Measuring the Geometry of BERT

Reviewer 1

Originality: this paper is a straightforward extension from Hewitt and Manning (2019), with more detailed analysis (attention probe, geometry of embeddings, word sense analysis) and interesting discovery. It would be a nice contribution to help the NLP researchers to understand how BERT works and inspire further exploration. Quality: the mathematical arguments of embedding trees in Euclidean space are sound and important to help analyze parse tree space. Clarity: this paper is well-written and easy to understand. Significance: the discoveries of geometry for syntax and word senses are quite important and will be very useful for NLP research.

Reviewer 2

The paper investigates the relationship between BERT and syntactic structure. The idea is based on Manning paper as the author pointed out. The overall readability is OK. Here are some points the author could do better: 1. The visualization tool is useful. However, a comprehensive quantitative evidence would be more convincing. The figures shown in the paper (like parse tree embedding) are just representing very 1 or 2 instances. How does this idea apply to all sentences in the corpus? 2. The attention probe part (binary and multiclass) show some accuracy number. But are they good? There lacks comparison against using other features. 85.8% could be good in some binary classification tasks but very poor in others. So the authors need to establish this evidence. 3. Theorem 1 is interesting. But it only proves that for ONE tree, or ONE sentence, there's a power-2 embedding. This embedding will definitely be useless if you use the same words but in a different sentence syntax. How can you prove that for all sentences, there can be an approximately good power-2 embeddings, which is the case from Manning's result?

Reviewer 3

Originality: This submission uses existing techniques to analyze how syntax and semantics are represented in BERT. The authors do a good job of contextualizing the work in terms of previous work, for instance similar analyses for other models (like Word2Vec). They also build off of the work of Hewitt and Manning and provide new theoretical justification for Hewitt and Manning’s empirical findings. Quality: Their mathematical arguments are sound, but the authors could add more rigor to the conclusions they draw in the remarks following Theorem 1. The empirical studies show some interesting results. In particular, many of the visualizations are quite nice. They provide convincing experimental results that syntactic information is encoded in the attention matrices of BERT and that the context embeddings well represent word sense. However, some of the experiments seem incomplete relative to the conclusions they draw from them. For instance, the authors conjecture that syntax and semantics are encoded in complementary subspaces in the vector representations produced by BERT. There isn’t that much evidence to suggest this, except for their experiment showing that word sense information may be contained in a lower dimensional space. Further, in Section 4.1, the authors provide a visualization of word senses. In particular they note that within one of the clusters for the word “die” there is a type of quantitative scale. A further exploration of this directionality would have been interesting and more convincing. Clarity: The paper is clear and very well written. Significance: While their mathematical justification is interesting in relation to previous work, it’s not particularly novel or interesting in and of itself. It leads to a better understanding of the geometry of BERT, and may provide inspiration for future work. The empirical findings are interesting. They are perhaps not particularly surprising, but to my knowledge no one has done this analysis before.