Reviews: DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs

Authors propose DRUM, an end-to-end differentiable rule-based inference method which can be used for mining rules via backprop, and extracting rules from data. Their approach is quite interesting - it can be trained from positive examples only, without negative sampling (this is currently a burden for representation learning algorithms targeting knowledge graphs). In DRUM, paths in a knowledge graph are represented by a chain of matrix multiplications (this idea is not especially novel - see [1]). For mining rules, authors start from a formulation of the problem where each rule is associated with a confidence weight, and try to maximise the likelihood of training triples by optimising an end-to-end differentiable objective. However, the space of possible rules (and thus the number of parameters as confidence scores) is massive, so authors propose a way of efficiently approximating the rule scores tensor using with another having a lower rank (Eq. (1)), and by using a recurrent neural network conditioned on the rule. My main concern of this paper is that it does not compare at all with some very relevant work for end-to-end differentiable rule mining, e.g. [2, 3, 4]. Finally, in Tab. 4, authors claim that "The results on FB15k-237 are also very close to SOTA", while from the table it's quite clear they are not (but authors even use bold numbers for their results). Also, in those results, authors evaluate on WN18: this does not seem fair, since [5] shows that one can get WN18 with an extremely simple rule-based baseline, capturing patterns like has_part(X, Y) :- part_of(Y, X). Rules in Table 7 look fairly interpretable, but also do the ones in [2]. Contributions: - Clever end-to-end differentiable rule-based inference method that allows to learn rules via backprop. - Results seem quite promising in comparison with Neural-LP, but there is loads of work in this specific area it would be great to compare with. - Very clearly written paper. [1] https://arxiv.org/abs/1506.01094 [2] https://arxiv.org/pdf/1705.11040 [3] https://arxiv.org/abs/1807.08204 [4] https://arxiv.org/abs/1711.04574 [5] https://arxiv.org/abs/1707.01476

Reviewer 2

- A comparison with neural link prediction methods ComplEx, TransE or ConvE is good, but not timely anymore. I think by now you have to include the following state-of-the-art methods: - M3GM -- Pinter and Eisenstein. Predicting Semantic Relations using Global Graph Properties. EMNLP 2018. - ComplEx-N3 -- Lacroix et al. Canonical Tensor Decomposition for Knowledge Base Completion. 2018. - HypER -- Balazevic et al. Hypernetwork Knowledge Graph Embeddings. 2018. - TuckER -- Balazevic et al. TuckER: Tensor Factorization for Knowledge Graph Completion. 2019. Claims of state-of-the-art performance (L252) do not hold. TuckER, HypER and ComplEx-N3 outperform DRUM on FB15k-237 and all three as well as the inverse model of ConvE outperform DRUM on WN18! Moreover, instead of WN18, I would encourage the authors to use the harder WN18-RR (see Dettmers et al. Convolutional 2d knowledge graph embeddings. AAAI 2018. for details). - It is true that some prior differentiable rule induction work (references [24] and [19] in the paper) was jointly learning entity embeddings and rules. However, at test time one can simply reinitialize entity embeddings randomly so that predictions are solely based on rules. I think that could be a fair and worthwhile comparison to the approach presented here. - How does your approach relate to the random walk for knowledge base population literature? - Wang. (2015). Joint Information Extraction and Reasoning: A Scalable Statistical Relational Learning Approach - Gardner. (2014). Incorporating vector space similarity in random walk inference over knowledge bases. - Gardner. (2013). Improving Learning and Inference in a Large Knowledge-base using Latent Syntactic Cues. - Wang. (2013). Programming with personalized pagerank: a locally groundable first-order probabilistic logic. - Lao. (2012). Reading the web with learned syntactic-semantic inference rules. - Lao. (2011). Random walk inference and learning in a large scale knowledge base. Other comments/questions: - L66: "if each variable in the rule appears at least twice" -- I think you meant if every variable appears in at least two atoms. - L125: What is R in |R|? The set of relations? - L189: What's the rationale behind using L different RNNs instead of one shared one? - L271: How are the two annotators selected? - L292: Extending the approach to incorporate negative sampling should be trivial, no? UPDATE: I thank the authors for their response. I am increasing my score by one.

Reviewer 3

Originality: The paper is an extension of the NeuralLP model wherein changes are made to handle variable length rules. Significance: The paper provides a way to learn variable length rules and improves on the previous by sharing of information while estimating rules for different head relations. Results on the inductive setting for link prediction are promising. The human evaluation is interesting but would benefit from comparison to other methods than just NeuralLP. Clarity & Quality : The paper is well written and does well to explain the problems with simpler approaches and provides solutions in a step-wise manner. However, notations for the equations especially going from page 3 to page 4 should be defined better as the variables used are not adequately described. Issues with the paper: 1) The authors claim that it is not right to compare their method to the black box methods which are not interpretable but also do not report results using models that are, like [1], [2]. These are currently relevant approaches for this task and it is important to report results on them regardless of whether DRUM is able to beat the scores or not. Comparing only to NeuralLP does not seem fair. 2) It has been shown that WN18 is inadequate and recent work on the link prediction task has shifted to using WN18RR instead of WN18 it would be great if the authors reported on this dataset as well. 3) Would like to see hits@1,3 for the inductive setting as well. 4) Equation numbers should be added. [1] Lin, Xi Victoria, Richard Socher, and Caiming Xiong. "Multi-hop knowledge graph reasoning with reward shaping." (2018) [2] Das, Rajarshi, et al. "Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning." (2018)

Paper ID:	8836
Title:	DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs

Reviewer 1

Reviewer 2

Reviewer 3