Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper describes a method for leveraging sub-character information from Chinese characters, and reports small but reliable improvements on a large number of Chinese NLP tasks. The paper is strong in the results that it reports. The authors show that incorporation of their "Glyce" embeddings improves results from BERT (which is SOTA on nearly all of the tasks), as well the strongest non-BERT models, for a wide variety of tasks. So it appears that the authors' methods have successfully allowed them to leverage some useful signal from the sub-character information, which seems a reasonably significant contribution for Chinese NLP. The main weakness of the paper is in clarity of the methods. I'm not clear on the details of the training procedure. Is the glyph CNN trained jointly with each separate downstream task? Or is the CNN pre-trained separately such that it can be used ELMo-style on a variety of downstream tasks, as certain sections seem to suggest? If the latter is the case, then I'm not clear on the nature of the "task" objective that combines with the image classification objective. I would also like to see more detail about the glyph CNN itself. Is the image classification predicting the character identities (10k possible labels?). Does the model receive at each timestep the relevant character in all of the mentioned scripts? (If not, how are the different scripts being leveraged?) Finally, there should be more detail about the tasks. Little information is given about the sentence pair and single-sentence classification tasks in particular, giving mostly just the acronym designations and number of classes, rather than detail about the task itself. Relatedly, I would like to see some discussion of why we would expect the character information to help with these different tasks. An additional point: currently it is difficult to discern which of the authors' innovations contribute most to improving performance -- ablation experiments would be helpful in this regard. Minor: Text in Figure 2 is too small to read. Sec 4.1 "BERT outperforms non-BERT models in all datasets except Weibo" -- is this true? BERT seems to be better on Weibo as well. line 95: prune --> prone line 124: use to --> use the line 174: character --> characters line 201: the the
Regarding the first contribution: using visual features for Chinese characters (which are visually inspired) is a long-standing idea, but as the authors pointed out, few previous works were able to achieve a significantly better performance than purely embedding driven approaches. In this regard, the paper does a good revisit to the problem and provides an interesting hypothesis. Regarding the second contribution: a dedicated CNN structure with diverse training corpus (historical scripts) and multi-task learning scenario is novel and seems to be effective, as demonstrated through several benchmark datasets. It is especially encouraging to see that this gives non-trivial additional improvement over BERT, a very strong baseline. Regarding the lack of analysis: this is a rather disappointing because this kind of paper would benefit from a qualtitative evidence that backs how the visual features help the downstream task-specific model to better understand the language. Appropriateness: I am also worried about the appropriateness of the paper to NeurIPS. The paper seems to be more appropriate for NLP conferences than NeurIPS. I am not sure if NeurIPS has a large-enough audience for this kind of work. After rebuttal: the rebuttal was helpful but more detailed analysis on why/how a visual model helps to learn better character embedding would be desired. Increased my score from 5 to 6.
The authors propose a very simple approach to capturing the pictographic nature of Chinese characters into embeddings. The goal in doing so being that such a representation captures semantic similarities perhaps obvious to the eye but difficult to learn from clustering the character IDs in a downstream task. Further, the authors provide insights into why previous approaches failed to see performance gains.