{"title": "Differentiable Learning of Logical Rules for Knowledge Base Reasoning", "book": "Advances in Neural Information Processing Systems", "page_first": 2319, "page_last": 2328, "abstract": "We study the problem of learning probabilistic first-order logical rules for knowledge base reasoning. This learning problem is difficult because it requires learning the parameters in a continuous space as well as the structure in a discrete space. We propose a framework, Neural Logic Programming, that combines the parameter and structure learning of first-order logical rules in an end-to-end differentiable model. This approach is inspired by a recently-developed differentiable logic called TensorLog [5], where inference tasks can be compiled into sequences of differentiable operations. We design a neural controller system that learns to compose these operations. Empirically, our method outperforms prior work on multiple knowledge base benchmark datasets, including Freebase and WikiMovies.", "full_text": "Differentiable Learning of Logical Rules for\n\nKnowledge Base Reasoning\n\nFan Yang\n\nZhilin Yang\n\nWilliam W. Cohen\n\nSchool of Computer Science\nCarnegie Mellon University\n\n{fanyang1,zhiliny,wcohen}@cs.cmu.edu\n\nAbstract\n\nWe study the problem of learning probabilistic \ufb01rst-order logical rules for knowl-\nedge base reasoning. This learning problem is dif\ufb01cult because it requires learning\nthe parameters in a continuous space as well as the structure in a discrete space.\nWe propose a framework, Neural Logic Programming, that combines the parameter\nand structure learning of \ufb01rst-order logical rules in an end-to-end differentiable\nmodel. This approach is inspired by a recently-developed differentiable logic called\nTensorLog [5], where inference tasks can be compiled into sequences of differ-\nentiable operations. We design a neural controller system that learns to compose\nthese operations. Empirically, our method outperforms prior work on multiple\nknowledge base benchmark datasets, including Freebase and WikiMovies.\n\n1\n\nIntroduction\n\nA large body of work in AI and machine learning has considered the problem of learning models\ncomposed of sets of \ufb01rst-order logical rules. An example of such rules is shown in Figure 1. Logical\nrules are useful representations for knowledge base reasoning tasks because they are interpretable,\nwhich can provide insight to inference results. In many cases this interpretability leads to robustness\nin transfer tasks. For example, consider the scenario in Figure 1. If new facts about more companies\nor locations are added to the knowledge base, the rule about HasOfficeInCountry will still be\nusefully accurate without retraining. The same might not be true for methods that learn embeddings\nfor speci\ufb01c knowledge base entities, as is done in TransE [3].\n\nFigure 1: Using logical rules (shown in the box) for knowledge base reasoning.\n\nLearning collections of relational rules is a type of statistical relational learning [7], and when the\nlearning involves proposing new logical rules, it is often called inductive logic programming [18]\n. Often the underlying logic is a probabilistic logic, such as Markov Logic Networks [22] or\nProPPR [26]. The advantage of using a probabilistic logic is that by equipping logical rules with\nprobability, one can better model statistically complex and noisy data. Unfortunately, this learning\nproblem is quite dif\ufb01cult \u2014 it requires learning both the structure (i.e. the particular sets of rules\nincluded in a model) and the parameters (i.e. con\ufb01dence associated with each rule). Determining\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\nX\t\t=\tUber\tX\t=\tLy*\tHasO\ufb03ceInCity(New\tYork,\tUber)\tCityInCountry(USA,\tNew\tYork)\tHasO\ufb03ceInCity(Paris,\tLy*)\tCityInCountry(France,\tParis)\t\t\tHasO\ufb03ceInCountry(Y,\tX)\t\u00df\tHasO\ufb03ceInCity(Z,\tX),\tCityInCountry(Y,\tZ)\t\t\tY\t\t=\tUSA\tY\t\t=\tFrance\tHasO\ufb03ceInCountry(Y,\tX)\t?\tIn\twhich\tcountry\tY\tdoes\tX\thave\to\ufb03ce?\t\fthe structure is a discrete optimization problem, and one that involves search over a potentially large\nproblem space. Many past learning systems have thus used optimization methods that interleave\nmoves in a discrete structure space with moves in parameter space [12, 13, 14, 27].\nIn this paper, we explore an alternative approach: a completely differentiable system for learning\nmodels de\ufb01ned by sets of \ufb01rst-order rules. This allows one to use modern gradient-based programming\nframeworks and optimization methods for the inductive logic programming task. Our approach\nis inspired by a differentiable probabilistic logic called TensorLog [5]. TensorLog establishes a\nconnection between inference using \ufb01rst-order rules and sparse matrix multiplication, which enables\ncertain types of logical inference tasks to be compiled into sequences of differentiable numerical\noperations on matrices. However, TensorLog is limited as a learning system because it only learns\nparameters, not rules. In order to learn parameters and structure simultaneously in a differentiable\nframework, we design a neural controller system with an attention mechanism and memory to learn\nto sequentially compose the primitive differentiable operations used by TensorLog. At each stage of\nthe computation, the controller uses attention to \u201csoftly\u201d choose a subset of TensorLog\u2019s operations,\nand then performs the operations with contents selected from the memory. We call our approach\nneural logic programming, or Neural LP.\nExperimentally, we show that Neural LP performs well on a number of tasks. It improves the\nperformance in knowledge base completion on several benchmark datasets, such as WordNet18\nand Freebase15K [3]. And it obtains state-of-the-art performance on Freebase15KSelected [25],\na recent and more challenging variant of Freebase15K. Neural LP also performs well on standard\nbenchmark datasets for statistical relational learning, including datasets about biomedicine and\nkinship relationships [12]. Since good performance on many of these datasets can be obtained using\nshort rules, we also evaluate Neural LP on a synthetic task which requires longer rules. Finally, we\nshow that Neural LP can perform well in answering partially structured queries, where the query is\nposed partially in natural language. In particular, Neural LP also obtains state-of-the-art results on the\nKB version of the WIKIMOVIES dataset [16] for question-answering against a knowledge base. In\naddition, we show that logical rules can be recovered by executing the learned controller on examples\nand tracking the attention.\nTo summarize, the contributions of this paper include the following. First, we describe Neural LP,\nwhich is, to our knowledge, the \ufb01rst end-to-end differentiable approach to learning not only the\nparameters but also the structure of logical rules. Second, we experimentally evaluate Neural LP on\nseveral types of knowledge base reasoning tasks, illustrating that this new approach to inductive logic\nprogramming outperforms prior work. Third, we illustrate techniques for visualizing a Neural LP\nmodel as logical rules.\n\n2 Related work\n\nStructure embedding [3, 24, 29] has been a popular approach to reasoning with a knowledge base.\nThis approach usually learns a embedding that maps knowledge base relations (e.g CityInCountry)\nand entities (e.g. USA) to tensors or vectors in latent feature spaces. Though our Neural LP system can\nbe used for similar tasks as structure embedding, the methods are quite different. Structure embedding\nfocuses on learning representations of relations and entities, while Neural LP learns logical rules.\nIn addition, logical rules learned by Neural LP can be applied to entities not seen at training time.\nThis is not achievable by structure embedding, since its reasoning ability relies on entity-dependent\nrepresentations.\nNeural LP differs from prior work on logical rule learning in that the system is end-to-end differen-\ntiable, thus enabling gradient based optimization, while most prior work involves discrete search\nin the problem space. For instance, Kok and Domingos [12] interleave beam search, using discrete\noperators to alter a rule set, with parameter learning via numeric methods for rule con\ufb01dences. Lao\nand Cohen [13] introduce all rules from a restricted set, then use lasso-style regression to select a\nsubset of predictive rules. Wang et al. [27] use an Iterative Structural Gradient algorithm that alternate\ngradient-based search for parameters of a probabilistic logic ProPPR [26], with structural additions\nsuggested by the parameter gradients.\nRecent work on neural program induction [21, 20, 1, 8] have used attention mechanism to \u201csoftly\nchoose\u201d differentiable operators, where the attentions are simply approximations to binary choices.\nThe main difference in our work is that attentions are treated as con\ufb01dences of the logical rules and\n\n2\n\n\fhave semantic meanings. In other words, Neural LP learns a distribution over logical rules, instead of\nan approximation to a particular rule. Therefore, we do not use hardmax to replace softmax during\ninference time.\n\n3 Framework\n\n3.1 Knowledge base reasoning\n\nKnowledge bases are collections of relational data of the format Relation(head,tail), where\nhead and tail are entities and Relation is a binary relation between entities. Examples of such\ndata tuple are HasOfficeInCity(New York,Uber) and CityInCountry(USA,New York).\nThe knowledge base reasoning task we consider here consists of a query1, an entity tail that the\nquery is about, and an entity head that is the answer to the query. The goal is to retrieve a ranked list\nof entities based on the query such that the desired answer (i.e. head) is ranked as high as possible.\nTo reason over knowledge base, for each query we are interested in learning weighted chain-like\nlogical rules of the following form, similar to stochastic logic programs [19],\n\n\u03b1 query(Y,X)\u2190Rn (Y,Zn) \u2227\u00b7\u00b7\u00b7\u2227 R1 (Z1,X)\n\n(1)\nwhere \u03b1 \u2208 [0, 1] is the con\ufb01dence associated with this rule, and R1, . . . , Rn are relations in the\nknowledge base. During inference, given an entity x, the score of each y is de\ufb01ned as sum of the\ncon\ufb01dence of rules that imply query(y,x), and we will return a ranked list of entities where higher\nthe score implies higher the ranking.\n\n3.2 TensorLog for KB reasoning\n\nWe next introduce TensorLog operators and then describe how they can be used for KB reasoning.\nGiven a knowledge base, let E be the set of all entities and R be the set of all binary relations. We map\nall entities to integers, and each entity i is associated with a one-hot encoded vector vi \u2208 {0, 1}|E|\nsuch that only the i-th entry is 1. TensorLog de\ufb01nes an operator MR for each relation R. Concretely,\nMR is a matrix in {0, 1}|E|\u00d7|E| such that its (i, j) entry is 1 if and only if R(i,j) is in the knowledge\nbase, where i is the i-th entity and similarly for j.\nWe now draw the connection between TensorLog operations and a restricted case of logical rule\ninference. Using the operators described above, we can imitate logical rule inference R(Y,X)\u2190P(Y,\n.\nZ) \u2227 Q(Z,X) for any entity X = x by performing matrix multiplications MP \u00b7 MQ \u00b7 vx\n= s. In other\nwords, the non-zero entries of the vector s equals the set of y such that there exists z that P(y,z) and\nQ(z,x) are in the KB. Though we describe the case where rule length is two, it is straightforward to\ngeneralize this connection to rules of any length.\nUsing TensorLog operations, what we want to learn for each query is shown in Equation 2,\n\n\u03b1l\u03a0k\u2208\u03b2lMRk\n\n(2)\n\n(cid:88)\n\nl\n\nwhere l indexes over all possible rules, \u03b1l is the con\ufb01dence associated with rule l and \u03b2l is an ordered\nlist of all relations in this particular rule. During inference, given an entity vx, the score of each\nretrieved entity is then equivalent to the entries in the vector s, as shown in Equation 3.\n\ns =\n\n(\u03b1l (\u03a0k\u2208\u03b2lMRk vx)) , score(y | x) = vT\ny s\n\n(cid:88)\n\nl\n\n(cid:88)\n\n{x,y}\n\nTo summarize, we are interested in the following learning problem for each query.\n\nmax\n{\u03b1l,\u03b2l}\n\nscore(y | x) = max\n{\u03b1l,\u03b2l}\n\nvT\ny\n\n(\u03b1l (\u03a0k\u2208\u03b2lMRk vx))\n\n(cid:88)\n\n{x,y}\n\n(cid:32)(cid:88)\n\nl\n\nwhere {x, y} are entity pairs that satisfy the query, and {\u03b1l, \u03b2l} are to be learned.\n\n1In this work, the notion of query refers to relations, which differs from conventional notion, where query\n\nusually contains relation and entity.\n\n3\n\n(cid:33)\n\n(3)\n\n(4)\n\n\fFigure 2: The neural controller system.\n\n3.3 Learning the logical rules\n\nWe will now describe the differentiable rule learning process, including learnable parameters and\nthe model architecture. As shown in Equation 2, for each query, we need to learn the set of rules\nthat imply it and the con\ufb01dences associated with these rules. However, it is dif\ufb01cult to formulate a\ndifferentiable process to directly learn the parameters and the structure {\u03b1l, \u03b2l}. This is because each\nparameter is associated with a particular rule, and enumerating rules is an inherently discrete task. To\novercome this dif\ufb01culty, we observe that a different way to write Equation 2 is to interchange the\nsummation and product, resulting the following formula with a different parameterization,\n\nT(cid:89)\n\n|R|(cid:88)\n\nt=1\n\nk\n\nak\nt MRk\n\n(5)\n\nwhere T is the max length of rules and |R| is the number of relations in the knowledge base. The key\nparameterization difference between Equation 2 and Equation 5 is that in the latter we associate each\nrelation in the rule with a weight. This combines the rule enumeration and con\ufb01dence assignment.\nHowever, the parameterization in Equation 5 is not suf\ufb01ciently expressive, as it assumes that all rules\nare of the same length. We address this limitation in Equation 6-8, where we introduce a recurrent\nformulation similar to Equation 3.\nIn the recurrent formulation, we use auxiliary memory vectors ut. Initially the memory vector is set\nto the given entity vx. At each step as described in Equation 7, the model \ufb01rst computes a weighted\naverage of previous memory vectors using the memory attention vector bt. Then the model \u201csoftly\u201d\napplies the TensorLog operators using the operator attention vector at. This formulation allows the\nmodel to apply the TensorLog operators on all previous partial inference results, instead of just the\nlast step\u2019s.\n\n(cid:32)t\u22121(cid:88)\n\n(cid:33)\n\nut =\n\nak\nt MRk\n\nb\u03c4\nt u\u03c4\n\nfor 1 \u2264 t \u2264 T\n\n(6)\n\n(7)\n\n(8)\n\nu0 = vx\n\n|R|(cid:88)\nT(cid:88)\n\nk\n\n\u03c4 =0\n\nb\u03c4\nT +1u\u03c4\n\nuT+1 =\n\n\u03c4 =0\n\nFinally, the model computes a weighted average of all memory vectors, thus using attention to select\nthe proper rule length. Given the above recurrent formulation, the learnable parameters for each\nquery are {at | 1 \u2264 t \u2264 T} and {bt | 1 \u2264 t \u2264 T + 1}.\nWe now describe a neural controller system to learn the operator and memory attention vectors.\nWe use recurrent neural networks not only because they \ufb01t with our recurrent formulation, but also\nbecause it is likely that current step\u2019s attentions are dependent on previous steps\u2019. At every step\nt \u2208 [1, T + 1], the network predicts operator and memory attention vectors using Equation 9, 10,\n\n4\n\n\fand 11. The input is the query for 1 \u2264 t \u2264 T and a special END token when t = T + 1.\n\n(cid:1)\n\nht = update (ht\u22121, input)\nat = softmax (W ht + b)\n\nbt = softmax(cid:0)[h0, . . . , ht\u22121]T ht\n\n(9)\n(10)\n(11)\nThe system then performs the computation in Equation 7 and stores ut into the memory. The memory\nholds each step\u2019s partial inference results, i.e. {u0, . . . , ut, . . . , uT+1}. Figure 2 shows an overview\nof the system. The \ufb01nal inference result u is just the last vector in memory, i.e. uT+1. As discussed\nin Equation 4, the objective is to maximize vT\ny u because the\nnonlinearity empirically improves the optimization performance. We also observe that normalizing\nthe memory vectors (i.e. ut) to have unit length sometimes improves the optimization.\nTo recover logical rules from the neural controller system, for each query we can write rules and their\ncon\ufb01dences {\u03b1l, \u03b2l} in terms of the attention vectors {at, bt}. Based on the relationship between\nEquation 3 and Equation 6-8, we can recover rules by following Equation 7 and keep track of the\ncoef\ufb01cients in front of each matrix MRk. The detailed procedure is presented in Algorithm 1.\n\ny u. In particular, we maximize log vT\n\nAlgorithm 1 Recover logical rules from attention vectors\n\nInput: attention vectors {at | t = 1, . . . , T} and {bt | t = 1, . . . , T + 1}\nNotation: Let Rt = {r1, . . . , rl} be the set of partial rules at step t. Each rule rl is represented by\na pair of (\u03b1, \u03b2) as described in Equation 1, where \u03b1 is the con\ufb01dence and \u03b2 is an ordered list of\nrelation indexes.\nInitialize: R0 = {r0} where r0 = (1, ( )).\nfor t \u2190 1 to T + 1 do\nfor \u03c4 \u2190 0 to t \u2212 1 do\n\nInitialize:(cid:99)Rt = \u2205, a placeholder for storing intermediate results.\nt . Store the updated rule (\u03b1(cid:48), \u03b2) in(cid:99)Rt.\n\nfor rule (\u03b1, \u03b2) in R\u03c4 do\n\nUpdate \u03b1(cid:48) \u2190 \u03b1 \u00b7 b\u03c4\n\nif t \u2264 T then\n\nInitialize: Rt = \u2205\n\nfor rule (\u03b1, \u03b2) in(cid:99)Rt do\nRt =(cid:99)Rt\n\nfor k \u2190 1 to |R| do\nUpdate \u03b1(cid:48) \u2190 \u03b1 \u00b7 ak\n\nelse\n\nreturn RT +1\n\nt , \u03b2(cid:48) \u2190 \u03b2 append k. Add the updated rule (\u03b1(cid:48), \u03b2(cid:48)) to Rt.\n\n4 Experiments\n\nTo test the reasoning ability of Neural LP, we conduct experiments on statistical relation learning, grid\npath \ufb01nding, knowledge base completion, and question answering against a knowledge base. For all\nthe tasks, the data used in the experiment are divided into three \ufb01les: facts, train, and test. The facts\n\ufb01le is used as the knowledge base to construct TensorLog operators {MRk | Rk \u2208 R}. The train and\ntest \ufb01les contain query examples query(head,tail). Unlike in the case of learning embeddings,\nwe do not require the entities in train and test to overlap, since our system learns rules that are entity\nindependent.\nOur system is implemented in TensorFlow and can be trained end-to-end using gradient methods.\nThe recurrent neural network used in the neural controller is long short-term memory [9], and the\nhidden state dimension is 128. The optimization algorithm we use is mini-batch ADAM [11] with\nbatch size 64 and learning rate initially set to 0.001. The maximum number of training epochs is 10,\nand validation sets are used for early stopping.\n\n4.1 Statistical relation learning\n\nWe conduct experiments on two benchmark datasets [12] in statistical relation learning. The \ufb01rst\ndataset, Uni\ufb01ed Medical Language System (UMLS), is from biomedicine. The entities are biomedical\n\n5\n\n\fconcepts (e.g. disease, antibiotic) and relations are like treats and diagnoses. The second\ndataset, Kinship, contains kinship relationships among members of the Alyawarra tribe from Central\nAustralia [6]. Datasets statistics are shown in Table 1. We randomly split the datasets into facts, train,\ntest \ufb01les as described above with ratio 6:2:1. The evaluation metric is Hits@10. Experiment results\nare shown in Table 2. Comparing with Iterative Structural Gradient (ISG) [27], Neural LP achieves\nbetter performance on both datasets. 2 We conjecture that this is mainly because of the optimization\nstrategy used in Neural LP, which is end-to-end gradient-based, while ISG\u2019s optimization alternates\nbetween structure and parameter search.\n\nTable 1: Datasets statistics.\n\n# Data\n5960\n9587\n\n# Relation\n\n# Entity\n\n46\n25\n\n135\n104\n\nUMLS\nKinship\n\nTable 2: Experiment results. T indicates the maxi-\nmum rule length.\n\nISG\n\nNeural LP\n\nUMLS\nKinship\n\nT = 2\n43.5\n59.2\n\nT = 3\n43.3\n59.0\n\nT = 2\n92.0\n90.2\n\nT = 3\n93.2\n90.1\n\nFigure 3: Accuracy on grid path \ufb01nding.\n\n4.2 Grid path \ufb01nding\n\nSince in the previous tasks the rules learned are of length at most three, we design a synthetic task\nto test if Neural LP can learn longer rules. The experiment setup includes a knowledge base that\ncontains location information about a 16 by 16 grid, such as North((1,2),(1,1)) and SouthEast\n((0,2),(1,1)) The query is randomly generated by combining a series of directions, such as\nNorth_SouthWest. The train and test examples are pairs of start and end locations, which are\ngenerated by randomly choosing a location on the grid and then following the queries. We classify\nthe queries into four classes based on the path length (i.e. Hamming distance between start and\nend), ranging from two to ten. Figure 3 shows inference accuracy of this task for learning logical\nrules using ISG [27] and Neural LP. As the path length and learning dif\ufb01culty increase, the results\nshow that Neural LP can accurately learn rules of length 6-8 for this task, and is more robust than\nISG in terms of handling longer rules.\n\n4.3 Knowledge base completion\n\nWe also conduct experiments on the canonical knowledge base completion task as described in [3].\nIn this task, the query and tail are part of a missing data tuple, and the goal is to retrieve the\nrelated head. For example, if HasOfficeInCountry(USA,Uber) is missing from the knowledge\nbase, then the goal is to reason over existing data tuples and retrieve USA when presented with\nquery HasOfficeInCountry and Uber. To represent the query as a continuous input to the neural\ncontroller, we jointly learn an embedding lookup table for each query. The embedding has dimension\n128 and is randomly initialized to unit norm vectors.\nThe knowledge bases in our experiments are from WordNet [17, 10] and Freebase [2]. We use the\ndatasets WN18 and FB15K, which are introduced in [3]. We also considered a more challenging\ndataset, FB15KSelected [25], which is constructed by removing near-duplicate and inverse relations\nfrom FB15K. We use the same train/validation/test split as in prior work and augment data \ufb01les with\nreversed data tuples, i.e. for each relation, we add its inverse inv_relation. In order to create a\n\n2We use the implementation of ISG available at https://github.com/TeamCohen/ProPPR. In Wang\net al. [27], ISG is compared with other statistical relational learning methods in a different experiment setup, and\nISG is superior to several methods including Markov Logic Networks [12].\n\n6\n\n\ffacts \ufb01le which will be used as the knowledge base, we further split the original train \ufb01le into facts\nand train with ratio 3:1. 3 The dataset statistics are summarized in Table 3.\n\nTable 3: Knowledge base completion datasets statistics.\n\nDataset\nWN18\nFB15K\n\nFB15KSelected\n\n# Facts\n106,088\n362,538\n204,087\n\n# Train\n35,354\n120,604\n68,028\n\n# Test\n5,000\n59,071\n20,466\n\n# Relation\n18\n1,345\n237\n\n# Entity\n40,943\n14,951\n14,541\n\nThe attention vector at each step is by default applied to all relations in the knowledge base. Sometimes\nthis creates an unnecessarily large search space. In our experiment on FB15K, we use a subset of\noperators for each query. The subsets are chosen by including the top 128 relations that share common\nentities with the query. For all datasets, the max rule length T is 2.\nThe evaluation metrics we use are Mean Reciprocal Rank (MRR) and Hits@10. MRR computes an\naverage of the reciprocal rank of the desired entities. Hits@10 computes the percentage of how many\ndesired entities are ranked among top ten. Following the protocol in Bordes et al. [3], we also use\n\ufb01ltered rankings. We compare the performance of Neural LP with several models, summarized in\nTable 4.\n\nTable 4: Knowledge base completion performance comparison. TransE [4] and Neural Tensor\nNetwork [24] results are extracted from [29]. Results on FB15KSelected are from [25].\n\nNeural Tensor Network\nTransE\nDISTMULT [29]\nNode+LinkFeat [25]\nImplicit ReasoNets [23]\nNeural LP\n\nWN18\n\nFB15K\n\nFB15KSelected\nMRR Hits@10 MRR Hits@10 MRR Hits@10\n0.53\n0.38\n0.83\n0.94\n-\n0.94\n\n-\n-\n0.25\n0.23\n-\n0.24\n\n-\n-\n40.8\n34.7\n-\n36.2\n\n66.1\n90.9\n94.2\n94.3\n95.3\n94.5\n\n0.25\n0.32\n0.35\n0.82\n-\n0.76\n\n41.4\n53.9\n57.7\n87.0\n92.7\n83.7\n\nNeural LP gives state-of-the-art results on WN18, and results that are close to the state-of-the-art on\nFB15K. It has been noted [25] that many relations in WN18 and FB15K have inverse also de\ufb01ned,\nwhich makes them easy to learn. FB15KSelected is a more challenging dataset, and on it, Neural LP\nsubstantially improves the performance over Node+LinkFeat [25] and achieves similar performance\nas DISTMULT [29] in terms of MRR. We note that in FB15KSelected, since the test entities are rarely\ndirectly linked in the knowledge base, the models need to reason explicitly about compositions of\nrelations. The logical rules learned by Neural LP can very naturally capture such compositions.\nExamples of rules learned by Neural LP are shown in Table 5. The number in front each rule is the\nnormalized con\ufb01dence, which is computed by dividing by the maximum con\ufb01dence of rules for each\nrelation. From the examples we can see that Neural LP successfully combines structure learning\nand parameter learning. It not only induce multiple logical rules to capture the complex structure in\nthe knowledge base, but also learn to distribute con\ufb01dences on rules.\nTo demonstrate the inductive learning advantage of Neural LP, we conduct experiments where training\nand testing use disjoint sets of entities. To create such setting, we \ufb01rst randomly select a subset of\nthe test tuples to be the test set. Secondly, we \ufb01lter the train set by excluding any tuples that share\nentities with selected test tuples. Table 6 shows the experiment results in this inductive setting.\n\n3We also make minimal adjustment to ensure that all query relations in test appear at least once in train and\nall entities in train and test are also in facts. For FB15KSelected, we also ensure that entities in train are not\ndirectly linked in facts.\n\n7\n\n\fTable 5: Examples of logical rules learned by Neural LP on FB15KSelected. The letters A,B,C are\nungrounded logic variables.\n\n1.00 partially_contains(C,A)\u2190contains(B,A) \u2227 contains(B,C)\n0.45 partially_contains(C,A)\u2190contains(A,B) \u2227 contains(B,C)\n0.35 partially_contains(C,A)\u2190contains(C,B) \u2227 contains(B,A)\n1.00 marriage_location(C,A)\u2190nationality(C,B) \u2227 contains(B,A)\n0.35 marriage_location(B,A)\u2190nationality(B,A)\n0.24 marriage_location(C,A)\u2190place_lived(C,B) \u2227 contains(B,A)\n1.00 film_edited_by(B,A)\u2190nominated_for(A,B)\n0.20 film_edited_by(C,A)\u2190award_nominee(B,A) \u2227 nominated_for(B,C)\n\nTable 6: Inductive knowledge base completion. The metric is Hits@10.\n\nTransE\nNeural LP\n\nWN18\n0.01\n94.49\n\nFB15K FB15KSelected\n0.53\n27.97\n\n0.48\n73.28\n\nAs expected, the inductive setting results in a huge decrease in performance for the TransE model4,\nwhich uses a transductive learning approach; for all three datasets, Hits@10 drops to near zero, as\none could expect. In contrast, Neural LP is much less affected by the amount of unseen entities and\nachieves performance at the same scale as the non-inductive setting. This emphasizes that our Neural\nLP model has the advantage of being able to transfer to unseen entities.\n\n4.4 Question answering against knowledge base\n\nWe also conduct experiments on a knowledge reasoning task where the query is \u201cpartially structured\u201d,\nas the query is posed partially in natural language. An example of a partially structured query would\nbe \u201cin which country does x has an of\ufb01ce\u201d for a given entity x, instead of HasOfficeInCountry(Y,\nx). Neural LP handles queries of this sort very naturally, since the input to the neural controller is a\nvector which can encode either a structured query or natural language text.\nWe use the WIKIMOVIES dataset from Miller et al. [16]. The dataset contains a knowledge base and\nquestion-answer pairs. Each question (i.e. the query) is about an entity and the answers are sets of\nentities in the knowledge base. There are 196,453 train examples and 10,000 test examples. The\nknowledge base has 43,230 movie related entities and nine relations. A subset of the dataset is shown\nin Table 7.\n\nTable 7: A subset of the WIKIMOVIES dataset.\n\nKnowledge base\n\nQuestions\n\ndirected_by(Blade Runner,Ridley Scott)\nwritten_by(Blade Runner,Philip K. Dick)\nstarred_actors(Blade Runner,Harrison Ford)\nstarred_actors(Blade Runner,Sean Young)\nWhat year was the movie Blade Runner released?\nWho is the writer of the \ufb01lm Blade Runner?\n\nWe process the dataset to match the input format of Neural LP. For each question, we identity the\ntail entity by checking which words match entities in the knowledge base. We also \ufb01lter the\nwords in the question, keeping only the top 100 frequent words. The length of each question is\nlimited to six words. To represent the query in natural language as a continuous input for the neural\ncontroller, we jointly learn a embedding lookup table for all words appearing in the query. The\nquery representation is computed as the arithmetic mean of the embeddings of the words in it.\n\n4We use the implementation of TransE available at https://github.com/thunlp/KB2E.\n\n8\n\n\fWe compare Neural LP with several embedding based QA models. The main difference between\nthese methods and ours is that Neural LP does not embed the knowledge base, but instead learns\nto compose operators de\ufb01ned on the knowledge base. The comparison is summarized in Table 8.\nExperiment results are extracted from Miller et al. [16].\n\nTable 8: Performance comparison. Memory Net-\nwork is from [28]. QA system is from [4].\n\nModel\nMemory Network\nQA system\nKey-Value Memory Network [16]\nNeural LP\n\nAccuracy\n\n78.5\n93.5\n93.9\n94.6\n\nFigure 4: Visualization of learned logical rules.\n\nTo visualize the learned model, we randomly sample 650 questions from the test dataset and compute\nthe embeddings of each question. We use tSNE [15] to reduce the embeddings to the two dimensional\nspace and plot them in Figure 4. Most learned logical rules consist of one relation from the knowledge\nbase, and we use different colors to indicate the different relations and label some clusters by relation.\nThe experiment results show that Neural LP can successfully handle queries that are posed in natural\nlanguage by jointly learning word representations as well as the logical rules.\n\n5 Conclusions\n\nWe present an end-to-end differentiable method for learning the parameters as well as the structure\nof logical rules for knowledge base reasoning. Our method, Neural LP, is inspired by a recent\nprobabilistic differentiable logic TensorLog [5]. Empirically Neural LP improves performance on\nseveral knowledge base reasoning datasets. In the future, we plan to work on more problems where\nlogical rules are essential and complementary to pattern recognition.\n\nAcknowledgments\n\nThis work was funded by NSF under IIS1250956 and by Google Research.\n\nReferences\n[1] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neural\n\nnetworks for question answering. In Proceedings of NAACL-HLT, pages 1545\u20131554, 2016.\n\n[2] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a\ncollaboratively created graph database for structuring human knowledge. In Proceedings of\nthe 2008 ACM SIGMOD international conference on Management of data, pages 1247\u20131250.\nACM, 2008.\n\n[3] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.\nTranslating embeddings for modeling multi-relational data. In Advances in neural information\nprocessing systems, pages 2787\u20132795, 2013.\n\n[4] Antoine Bordes, Sumit Chopra, and Jason Weston. Question answering with subgraph embed-\n\ndings. arXiv preprint arXiv:1406.3676, 2014.\n\n[5] William W Cohen.\n\nTensorlog: A differentiable deductive database.\n\narXiv:1605.06523, 2016.\n\narXiv preprint\n\n[6] Woodrow W Denham. The detection of patterns in Alyawara nonverbal behavior. PhD thesis,\n\nUniversity of Washington, Seattle., 1973.\n\n[7] Lise Getoor. Introduction to statistical relational learning. MIT press, 2007.\n[8] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-\nBarwi\u00b4nska, Sergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou,\net al. Hybrid computing using a neural network with dynamic external memory. Nature, 538\n(7626):471\u2013476, 2016.\n\n9\n\n\f[9] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[10] Adam Kilgarriff and Christiane Fellbaum. Wordnet: An electronic lexical database, 2000.\n[11] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[12] Stanley Kok and Pedro Domingos. Statistical predicate invention. In Proceedings of the 24th\n\ninternational conference on Machine learning, pages 433\u2013440. ACM, 2007.\n\n[13] Ni Lao and William W Cohen. Relational retrieval using a combination of path-constrained\n\nrandom walks. Machine learning, 81(1):53\u201367, 2010.\n\n[14] Ni Lao, Tom Mitchell, and William W Cohen. Random walk inference and learning in a large\nscale knowledge base. In Proceedings of the Conference on Empirical Methods in Natural\nLanguage Processing, pages 529\u2013539. Association for Computational Linguistics, 2011.\n\n[15] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine\n\nLearning Research, 9(Nov):2579\u20132605, 2008.\n\n[16] Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and\nJason Weston. Key-value memory networks for directly reading documents. arXiv preprint\narXiv:1606.03126, 2016.\n\n[17] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):\n\n39\u201341, 1995.\n\n[18] Stephen Muggleton, Ramon Otero, and Alireza Tamaddoni-Nezhad. Inductive logic program-\n\nming, volume 38. Springer, 1992.\n\n[19] Stephen Muggleton et al. Stochastic logic programs. Advances in inductive logic programming,\n\n32:254\u2013264, 1996.\n\n[20] Arvind Neelakantan, Quoc V Le, and Ilya Sutskever. Neural programmer: Inducing latent\n\nprograms with gradient descent. arXiv preprint arXiv:1511.04834, 2015.\n\n[21] Arvind Neelakantan, Quoc V Le, Martin Abadi, Andrew McCallum, and Dario Amodei.\nLearning a natural language interface with neural programmer. arXiv preprint arXiv:1611.08945,\n2016.\n\n[22] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 62(1-2):\n\n107\u2013136, 2006.\n\n[23] Yelong Shen, Po-Sen Huang, Ming-Wei Chang, and Jianfeng Gao. Implicit reasonet: Modeling\nlarge-scale structured relationships with shared memory. arXiv preprint arXiv:1611.04642,\n2016.\n\n[24] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural\ntensor networks for knowledge base completion. In Advances in neural information processing\nsystems, pages 926\u2013934, 2013.\n\n[25] Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and\ntext inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and\ntheir Compositionality, pages 57\u201366, 2015.\n\n[26] William Yang Wang, Kathryn Mazaitis, and William W Cohen. Programming with personalized\npagerank: a locally groundable \ufb01rst-order probabilistic logic. In Proceedings of the 22nd ACM\ninternational conference on Information & Knowledge Management, pages 2129\u20132138. ACM,\n2013.\n\n[27] William Yang Wang, Kathryn Mazaitis, and William W Cohen. Structure learning via parameter\n\nlearning. In CIKM 2014, 2014.\n\n[28] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint\n\narXiv:1410.3916, 2014.\n\n[29] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and\n\nrelations for learning and inference in knowledge bases. In ICLR, 2015.\n\n10\n\n\f", "award": [], "sourceid": 1347, "authors": [{"given_name": "Fan", "family_name": "Yang", "institution": "Carnegie Mellon University"}, {"given_name": "Zhilin", "family_name": "Yang", "institution": "Carnegie Mellon University"}, {"given_name": "William", "family_name": "Cohen", "institution": "Carnegie Mellon University"}]}