{"title": "End-to-end Differentiable Proving", "book": "Advances in Neural Information Processing Systems", "page_first": 3788, "page_last": 3800, "abstract": "We introduce deep neural networks for end-to-end differentiable theorem proving that operate on dense vector representations of symbols.  These neural networks are recursively constructed by following the backward chaining algorithm as used in Prolog.  Specifically, we replace symbolic unification with a differentiable computation on vector representations of symbols using a radial basis function kernel, thereby combining symbolic reasoning with learning subsymbolic vector representations.  The resulting neural network can be trained to infer facts from a given incomplete knowledge base using gradient descent.  By doing so, it learns to (i) place representations of similar symbols in close proximity in a vector space, (ii) make use of such similarities to prove facts, (iii) induce logical rules, and (iv) it can use provided and induced logical rules for complex multi-hop reasoning.  On four benchmark knowledge bases we demonstrate that this architecture outperforms ComplEx, a state-of-the-art neural link prediction model, while at the same time inducing interpretable function-free first-order logic rules.", "full_text": "End-to-End Differentiable Proving\n\nTim Rockt\u00e4schel\nUniversity of Oxford\n\ntim.rocktaschel@cs.ox.ac.uk\n\nSebastian Riedel\n\nUniversity College London & Bloomsbury AI\n\ns.riedel@cs.ucl.ac.uk\n\nAbstract\n\nWe introduce neural networks for end-to-end differentiable proving of queries to\nknowledge bases by operating on dense vector representations of symbols. These\nneural networks are constructed recursively by taking inspiration from the backward\nchaining algorithm as used in Prolog. Speci\ufb01cally, we replace symbolic uni\ufb01cation\nwith a differentiable computation on vector representations of symbols using a\nradial basis function kernel, thereby combining symbolic reasoning with learning\nsubsymbolic vector representations. By using gradient descent, the resulting neural\nnetwork can be trained to infer facts from a given incomplete knowledge base.\nIt learns to (i) place representations of similar symbols in close proximity in a\nvector space, (ii) make use of such similarities to prove queries, (iii) induce logical\nrules, and (iv) use provided and induced logical rules for multi-hop reasoning. We\ndemonstrate that this architecture outperforms ComplEx, a state-of-the-art neural\nlink prediction model, on three out of four benchmark knowledge bases while at\nthe same time inducing interpretable function-free \ufb01rst-order logic rules.\n\n1\n\nIntroduction\n\nCurrent state-of-the-art methods for automated Knowledge Base (KB) completion use neural link pre-\ndiction models to learn distributed vector representations of symbols (i.e. subsymbolic representations)\nfor scoring fact triples [1\u20137]. Such subsymbolic representations enable these models to generalize\nto unseen facts by encoding similarities: If the vector of the predicate symbol grandfatherOf is\nsimilar to the vector of the symbol grandpaOf, both predicates likely express a similar relation.\nLikewise, if the vector of the constant symbol LISA is similar to MAGGIE, similar relations likely\nhold for both constants (e.g. they live in the same city, have the same parents etc.).\nThis simple form of reasoning based on similarities is remarkably effective for automatically complet-\ning large KBs. However, in practice it is often important to capture more complex reasoning patterns\nthat involve several inference steps. For example, if ABE is the father of HOMER and HOMER is a\nparent of BART, we would like to infer that ABE is a grandfather of BART. Such transitive reasoning\nis inherently hard for neural link prediction models as they only learn to score facts locally. In\ncontrast, symbolic theorem provers like Prolog [8] enable exactly this type of multi-hop reasoning.\nFurthermore, Inductive Logic Programming (ILP) [9] builds upon such provers to learn interpretable\nrules from data and to exploit them for reasoning in KBs. However, symbolic provers lack the ability\nto learn subsymbolic representations and similarities between them from large KBs, which limits\ntheir ability to generalize to queries with similar but not identical symbols.\nWhile the connection between logic and machine learning has been addressed by statistical relational\nlearning approaches, these models traditionally do not support reasoning with subsymbolic repre-\nsentations (e.g. [10]), and when using subsymbolic representations they are not trained end-to-end\nfrom training data (e.g. [11\u201313]). Neural multi-hop reasoning models [14\u201318] address the aforemen-\ntioned limitations to some extent by encoding reasoning chains in a vector space or by iteratively\nre\ufb01ning subsymbolic representations of a question before comparison with answers. In many ways,\nthese models operate like basic theorem provers, but they lack two of their most crucial ingredients:\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\finterpretability and straightforward ways of incorporating domain-speci\ufb01c knowledge in form of\nrules.\nOur approach to this problem is inspired by recent neural network architectures like Neural Turing\nMachines [19], Memory Networks [20], Neural Stacks/Queues [21, 22], Neural Programmer [23],\nNeural Programmer-Interpreters [24], Hierarchical Attentive Memory [25] and the Differentiable\nForth Interpreter [26]. These architectures replace discrete algorithms and data structures by end-to-\nend differentiable counterparts that operate on real-valued vectors. At the heart of our approach is the\nidea to translate this concept to basic symbolic theorem provers, and hence combine their advantages\n(multi-hop reasoning, interpretability, easy integration of domain knowledge) with the ability to\nreason with vector representations of predicates and constants. Speci\ufb01cally, we keep variable binding\nsymbolic but compare symbols using their subsymbolic vector representations.\nConcretely, we introduce Neural Theorem Provers (NTPs): End-to-end differentiable provers for\nbasic theorems formulated as queries to a KB. We use Prolog\u2019s backward chaining algorithm as\na recipe for recursively constructing neural networks that are capable of proving queries to a KB\nusing subsymbolic representations. The success score of such proofs is differentiable with respect to\nvector representations of symbols, which enables us to learn such representations for predicates and\nconstants in ground atoms, as well as parameters of function-free \ufb01rst-order logic rules of prede\ufb01ned\nstructure. By doing so, NTPs learn to place representations of similar symbols in close proximity in a\nvector space and to induce rules given prior assumptions about the structure of logical relationships\nin a KB such as transitivity. Furthermore, NTPs can seamlessly reason with provided domain-speci\ufb01c\nrules. As NTPs operate on distributed representations of symbols, a single hand-crafted rule can\nbe leveraged for many proofs of queries with symbols that have a similar representation. Finally,\nNTPs demonstrate a high degree of interpretability as they induce latent rules that we can decode to\nhuman-readable symbolic rules.\nOur contributions are threefold: (i) We present the construction of NTPs inspired by Prolog\u2019s back-\nward chaining algorithm and a differentiable uni\ufb01cation operation using subsymbolic representations,\n(ii) we propose optimizations to this architecture by joint training with a neural link prediction model,\nbatch proving, and approximate gradient calculation, and (iii) we experimentally show that NTPs can\nlearn representations of symbols and function-free \ufb01rst-order rules of prede\ufb01ned structure, enabling\nthem to learn to perform multi-hop reasoning on benchmark KBs and to outperform ComplEx [7], a\nstate-of-the-art neural link prediction model, on three out of four KBs.\n\n2 Background\n\nIn this section, we brie\ufb02y introduce the syntax of KBs that we use in the remainder of the paper.\nWe refer the reader to [27, 28] for a more in-depth introduction. An atom consists of a predicate\nsymbol and a list of terms. We will use lowercase names to refer to predicate and constant symbols\n(e.g. fatherOf and BART), and uppercase names for variables (e.g. X, Y, Z). As we only consider\nfunction-free \ufb01rst-order logic rules, a term can only be a constant or a variable. For instance,\n[grandfatherOf, Q, BART] is an atom with the predicate grandfatherOf, and two terms, the\nvariable Q and the constant BART. We consider rules of the form H :\u2013 B, where the body B is a\npossibly empty conjunction of atoms represented as a list, and the head H is an atom. We call a rule\nwith no free variables a ground rule. All variables are universally quanti\ufb01ed. We call a ground rule\nwith an empty body a fact. A substitution set = {X1/t1, . . . , XN /tN} is an assignment of variable\nsymbols Xi to terms ti, and applying substitutions to an atom replaces all occurrences of variables\nXi by their respective term ti.\nGiven a query (also called goal) such as [grandfatherOf, Q, BART], we can use Prolog\u2019s backward\nchaining algorithm to \ufb01nd substitutions for Q [8] (see appendix A for pseudocode). On a high level,\nbackward chaining is based on two functions called OR and AND. OR iterates through all rules\n(including rules with an empty body, i.e., facts) in a KB and uni\ufb01es the goal with the respective\nrule head, thereby updating a substitution set. It is called OR since any successful proof suf\ufb01ces\n(disjunction). If uni\ufb01cation succeeds, OR calls AND to prove all atoms (subgoals) in the body of\nthe rule. To prove subgoals of a rule body, AND \ufb01rst applies substitutions to the \ufb01rst atom that is\nthen proven by again calling OR, before proving the remaining subgoals by recursively calling AND.\nThis function is called AND as all atoms in the body need to be proven together (conjunction). As\nan example, a rule such as [grandfatherOf, X, Y] :\u2013 [[fatherOf, X, Z], [parentOf, Z, Y]] is used\n\n2\n\n\fin OR for translating a goal like [grandfatherOf, Q, BART] into subgoals [fatherOf, Q, Z] and\n[parentOf, Z, BART] that are subsequently proven by AND.1\n\n3 Differentiable Prover\n\nX/Q\n\nX/Q\n\nY/BART\n\nY/BART\nZ/HOMER\n\nIn the following, we describe the recursive construction of NTPs \u2013 neural networks for end-to-end\ndifferentiable proving that allow us to calculate the gradient of proof successes with respect to vector\nrepresentations of symbols. We de\ufb01ne the construction of NTPs in terms of modules similar to\ndynamic neural module networks [29]. Each module takes as inputs discrete objects (atoms and rules)\nand a proof state, and returns a list of new proof states (see Figure 1 for a graphical representation).\nA proof state S = ( , \u21e2) is a tuple consisting of\nthe substitution set constructed in the proof\nso far and a neural network \u21e2 that outputs a\nreal-valued success score of a (partial) proof.\nWhile discrete objects and the substitution set\nare only used during construction of the neu-\nral network, once the network is constructed a\ncontinuous proof success score can be calcu-\nlated for many different goals at training and\ntest time. To summarize, modules are instanti-\nated by discrete objects and the substitution set.\nThey construct a neural network representing\nthe (partial) proof success score and recursively\ninstantiate submodules to continue the proof.\nThe shared signature of modules is D\u21e5S ! S N where D is a domain that controls the construction of\nthe network, S is the domain of proof states, and N is the number of output proof states. Furthermore,\nlet S denote the substitution set of the proof state S and let S\u21e2 denote the neural network for\ncalculating the proof success.\nWe use pseudocode in style of a functional programming language to de\ufb01ne the behavior of modules\nand auxiliary functions. Particularly, we are making use of pattern matching to check for properties\nof arguments passed to a module. We denote sets by Euler script letters (e.g. E), lists by small capital\nletters (e.g. E), lists of lists by blackboard bold letters (e.g. E) and we use : to refer to prepending an\nelement to a list (e.g. e : E or E : E). While an atom is a list of a predicate symbol and terms, a rule\ncan be seen as a list of atoms and thus a list of lists where the head of the list is the rule head.2\n\nFigure 1: A module is mapping an upstream proof\nstate (left) to a list of new proof states (right),\nthereby extending the substitution set S and\nadding nodes to the computation graph of the neu-\nral network S\u21e2 representing the proof success.\n\nS0 \n\nS0\u21e2\n\nS \n\nS\u21e2\n\n3.1 Uni\ufb01cation Module\n\nUni\ufb01cation of two atoms, e.g., a goal that we want to prove and a rule head, is a central operation\nin backward chaining. Two non-variable symbols (predicates or constants) are checked for equality\nand the proof can be aborted if this check fails. However, we want to be able to apply rules even\nif symbols in the goal and head are not equal but similar in meaning (e.g. grandfatherOf and\ngrandpaOf) and thus replace symbolic comparison with a computation that measures the similarity\nof both symbols in a vector space.\nThe module unify updates a substitution set and creates a neural network for comparing the vector\nrepresentations of non-variable symbols in two sequences of terms. The signature of this module\nis L\u21e5L \u21e5 S ! S where L is the domain of lists of terms. unify takes two atoms represented\nas lists of terms and an upstream proof state, and maps these to a new proof state (substitution set\nand proof success). To this end, unify iterates through the list of terms of two atoms and compares\ntheir symbols. If one of the symbols is a variable, a substitution is added to the substitution set.\nOtherwise, the vector representations of the two non-variable symbols are compared using a Radial\nBasis Function (RBF) kernel [30] where \u00b5 is a hyperparameter that we set to 1p2 in our experiments.\nThe following pseudocode implements unify. Note that \"_\" matches every argument and that the\n\n1For\n\nclarity, we will\n\nsometimes\n\nomit\n\nlists when writing\n\nrules\n\nand\n\natoms,\n\ne.g.,\n\ngrandfatherOf(X, Y) :\u2013 fatherOf(X, Z), parentOf(Z, Y).\n\n2For example, [[grandfatherOf, X, Y], [fatherOf, X, Z], [parentOf, Z, Y]].\n\n3\n\n\forder matters, i.e., if arguments match a line, subsequent lines are not evaluated.\n1. unify\u2713([ ], [ ], S) = S\n2. unify\u2713([ ], _, _) = FAIL\n3. unify\u2713(_, [ ], _) = FAIL\n4. unify\u2713(h : H, g : G, S) = unify\u2713(H, G, S0) = (S0 , S0\u21e2) where\n\n1\n\n2\u00b52\n\n9=;\n\nS0 =8<:\n\nS [{ h/g}\nS [{ g/h}\nS \n\nif h 2V\nif g 2V , h 62 V\notherwise\n\n, S0\u21e2 = min S\u21e2,( exp\u21e3 k\u2713h:\u2713g:k2\n\notherwise )!\n\u2318 if h, g 62 V\nHere, S0 refers to the new proof state, V refers to the set of variable symbols, h/g is a substitution\nfrom the variable symbol h to the symbol g, and \u2713g: denotes the embedding lookup of the non-variable\nsymbol with index g. unify is parameterized by an embedding matrix \u2713 2 R|Z|\u21e5k where Z is the set\nof non-variables symbols and k is the dimension of vector representations of symbols. Furthermore,\nFAIL represents a uni\ufb01cation failure due to mismatching arity of two atoms. Once a failure is reached,\nwe abort the creation of the neural network for this branch of proving. In addition, we constrain\nproofs to be cycle-free by checking whether a variable is already bound. Note that this is a simple\nheuristic that prohibits applying the same non-ground rule twice. There are more sophisticated ways\nfor \ufb01nding and avoiding cycles in a proof graph such that the same rule can still be applied multiple\ntimes (e.g. [31]), but we leave this for future work.\n\nExample Assume that we are unifying two atoms [grandpaOf, ABE, BART] and [s, Q, i] given an\nupstream proof state S = (?,\u21e2 ) where the latter input atom has placeholders for a predicate s\nand a constant i, and the neural network \u21e2 would output 0.7 when evaluated. Furthermore, assume\ngrandpaOf, ABE and BART represent the indices of the respective symbols in a global symbol\nvocabulary. Then, the new proof state constructed by unify is:\n\nunify\u2713([grandpaOf, ABE, BART], [s, Q, i], (?,\u21e2 )) = (S0 , S0\u21e2) =\n\n{Q/ABE}, min\u21e2, exp(k\u2713grandpaOf:  \u2713s:k2), exp(k\u2713BART:  \u2713i:k2)\n\nThus, the output score of the neural network S0\u21e2 will be high if the subsymbolic representation of the\ninput s is close to grandpaOf and the input i is close to BART. However, the score cannot be higher\nthan 0.7 due to the upstream proof success score in the forward pass of the neural network \u21e2. Note\nthat in addition to extending the neural networks \u21e2 to S0\u21e2, this module also outputs a substitution set\n{Q/ABE} at graph creation time that will be used to instantiate submodules.\n3.2 OR Module\nBased on unify, we now de\ufb01ne the or module which attempts to apply rules in a KB. The signature\nof or is L\u21e5 N\u21e5S ! S N where L is the domain of goal atoms and N is the domain of integers used\nfor specifying the maximum proof depth of the neural network. Furthermore, N is the number of\npossible output proof states for a goal of a given structure and a provided KB.3 We implement or as\n1. orK\nwhere H :\u2013 B denotes a rule in a given KB K with a head atom H and a list of body atoms B. In\ncontrast to the symbolic OR method, the or module is able to use the grandfatherOf rule above\nfor a query involving grandpaOf provided that the subsymbolic representations of both predicates\nare similar as measured by the RBF kernel in the unify module.\n\n\u2713 (B, d, unify\u2713(H, G, S)) for H :\u2013 B 2 K]\n\n\u2713 (G, d, S) = [S0 | S0 2 andK\n\nExample For a goal [s, Q, i], or would instantiate an and submodule based on the rule\n[grandfatherOf, X, Y] :\u2013 [[fatherOf, X, Z], [parentOf, Z, Y]] as follows\n\norK\n\n\u2713 ([s, Q, i], d, S) = [S0|S0 2 andK\n\n\u2713 ([[fatherOf, X, Z], [parentOf, Z, Y]], d, ({X/Q, Y/i}, \u02c6S\u21e2)\n}\n\nresult of unify\n\n{z\n\n|\n\n), . . .]\n\n3The creation of the neural network is dependent on the KB but also the structure of the goal. For instance,\nthe goal s(Q, i) would result in a different neural network, and hence a different number of output proof states,\nthan s(i, j).\n\n4\n\n\fg\n\n3.3 AND Module\nFor implementing and we \ufb01rst de\ufb01ne an auxiliary function called substitute which applies substitu-\ntions to variables in an atom if possible. This is realized via\n1. substitute([ ], _) = [ ]\n\notherwise  : substitute(G, )\n\n2. substitute(g : G, ) =\u21e2 x if g/x 2 \nFor example, substitute([fatherOf, X, Z],{X/Q, Y/i}) results in [fatherOf, Q, Z].\nThe signature of and is L\u21e5 N \u21e5 S ! S N where L is the domain of lists of atoms and N is the\nnumber of possible output proof states for a list of atoms with a known structure and a provided KB.\nThis module is implemented as\n1. andK\n2. andK\n3. andK\n4. andK\nwhere the \ufb01rst two lines de\ufb01ne the failure of a proof, either because of an upstream uni\ufb01cation\nfailure that has been passed from the or module (line 1), or because the maximum proof depth has\nbeen reached (line 2). Line 3 speci\ufb01es a proof success, i.e., the list of subgoals is empty before the\nmaximum proof depth has been reached. Lastly, line 4 de\ufb01nes the recursion: The \ufb01rst subgoal G is\nproven by instantiating an or module after substitutions are applied, and every resulting proof state\nS0 is used for proving the remaining subgoals G by again instantiating and modules.\n\n\u2713 (_, _, FAIL) = FAIL\n\u2713 (_, 0, _) = FAIL\n\u2713 ([ ], _, S) = S\n\u2713 (G : G, d, S) = [S00 | S00 2 andK\n\n\u2713 (substitute(G, S ), d  1, S)]\n\n\u2713 (G, d, S0) for S0 2 orK\n\nExample Continuing the example from Section 3.2, the and module would instantiate submodules\nas follows:\nandK\n\n) =\n\n\u2713 ([[fatherOf, X, Z], [parentOf, Z, Y]], d, ({X/Q, Y/i}, \u02c6S\u21e2)\n}\nresult of unify in or\n\n{z\n\u2713 ([[parentOf, Z, Y]], d, S0) for S0 2 orK\n\n[S00|S00 2 andK\n\n|\n\n\u2713 ([fatherOf, Q, Z]\n\nresult of substitute\n\n, d  1, ({X/Q, Y/i}, \u02c6S\u21e2)\n}\nresult of unify in or\n\n{z\n\n|\n\n)]\n\n}\n\n|\n\n{z\n\n3.4 Proof Aggregation\nFinally, we de\ufb01ne the overall success score of proving a goal G using a KB K with parameters \u2713 as\n\nntpK\n\n\u2713 (G, d) =\n\nS 2 orK\n\narg max\n\u2713 (G,d,(?,1))\nS6=FAIL\n\nS\u21e2\n\nwhere d is a prede\ufb01ned maximum proof depth and the initial proof state is set to an empty substitution\nset and a proof success score of 1.\n\nExample Figure 2 illustrates an examplary NTP computation graph constructed for a toy KB. Note\nthat such an NTP is constructed once before training, and can then be used for proving goals of the\nstructure [s, i, j] at training and test time where s is the index of an input predicate, and i and j are\nindices of input constants. Final proof states which are used in proof aggregation are underlined.\n\n3.5 Neural Inductive Logic Programming\nWe can use NTPs for ILP by gradient descent instead of a combinatorial search over the space of\nrules as, for example, done by the First Order Inductive Learner (FOIL) [32]. Speci\ufb01cally, we are\nusing the concept of learning from entailment [9] to induce rules that let us prove known ground\natoms, but that do not give high proof success scores to sampled unknown ground atoms.\nLet \u2713r:, \u2713s:, \u2713t: 2 Rk be representations of some unknown predicates with indices r, s and t respec-\ntively. The prior knowledge of a transitivity between three unknown predicates can be speci\ufb01ed via\n\n5\n\n\funify\u2713([fatherOf, ABE, HOMER], [s, i, j], (?, 1))\n\n2.\n\n. . .\n\nunify\u2713([grandfatherOf, X, Y], [s, i, j], (?, 1))\n\norK\n\n\u2713 ([s, i, j], 2, (?, 1))\n\n1.\n\n3.\n\nS1 = (?,\u21e2 1)\n\nS2 = (?,\u21e2 2)\n\nS3 = ({X/i, Y/j},\u21e2 3)\n\nandK\n\n\u2713 ([[fatherOf, X, Z], [parentOf, Z, Y]], 2, S3)\n\nsubstitute\n\norK\n\n\u2713 ([fatherOf, i, Z], 1, S3)\n\n1.\n\n2.\n\nExample Knowledge Base:\n1. fatherOf(ABE, HOMER).\n2. parentOf(HOMER, BART).\n3. grandfatherOf(X, Y) :\u2013\n\nfatherOf(X, Z),\nparentOf(Z, Y).\n\nunify\u2713([fatherOf, ABE, HOMER], [fatherOf, i, Z], S3)\n\n3.\n\n. . .\n\nunify\u2713([parentOf, HOMER, BART], [fatherOf, i, Z], S3)\n\nS31 = ({X/i, Y/j, Z/HOMER},\u21e2 31)\n\u2713 ([parentOf, Z, Y], 2, S31)\n\nandK\n\nsubstitute\n\norK\n\n\u2713 ([parentOf, HOMER, j], 1, S31)\n\nS33 = FAIL\n\nS32 = ({X/i, Y/j, Z/BART},\u21e2 32)\n\u2713 ([parentOf, Z, Y], 2, S32)\n\nandK\n\nsubstitute\n\norK\n\n\u2713 ([parentOf, BART, j], 1, S32)\n\n. . .\n\n1.\n\nS311 = ({X/i, Y/j, Z/HOMER},\u21e2 311)\n\n2.\n\n. . .\n\n3. . . .\n\nS313 = FAIL\n\nS323 = FAIL\n\n. . .\n\n3.\n\n. . .\n\n1. . . .\n\n2.\nS321 = ({X/i, Y/j, Z/BART},\u21e2 321)\n\nS312 = ({X/i, Y/j, Z/HOMER},\u21e2 312)\n\nS322 = ({X/i, Y/j, Z/BART},\u21e2 322)\n\nFigure 2: Exemplary construction of an NTP computation graph for a toy knowledge base. Indices\non arrows correspond to application of the respective KB rule. Proof states (blue) are subscripted\nwith the sequence of indices of the rules that were applied. Underlined proof states are aggregated to\nobtain the \ufb01nal proof success. Boxes visualize instantiations of modules (omitted for unify). The\nproofs S33, S313 and S323 fail due to cycle-detection (the same rule cannot be applied twice).\n\nr(X, Y) :\u2013 s(X, Z), t(Z, Y). We call this a parameterized rule as the corresponding predicates are\nunknown and their representations are learned from data. Such a rule can be used for proofs at training\nand test time in the same way as any other given rule. During training, the predicate representations\nof parameterized rules are optimized jointly with all other subsymbolic representations. Thus, the\nmodel can adapt parameterized rules such that proofs for known facts succeed while proofs for\nsampled unknown ground atoms fail, thereby inducing rules of prede\ufb01ned structures like the one\nabove. Inspired by [33], we use rule templates for conveniently de\ufb01ning the structure of multiple\nparameterized rules by specifying the number of parameterized rules that should be instantiated for\na given rule structure (see appendix E for examples). For inspection after training, we decode a\nparameterized rule by searching for the closest representations of known predicates. In addition,\nwe provide users with a rule con\ufb01dence by taking the minimum similarity between unknown and\ndecoded predicate representations using the RBF kernel in unify. This con\ufb01dence score is an upper\nbound on the proof success score that can be achieved when the induced rule is used in proofs.\n\n4 Optimization\n\nIn this section, we present the basic training loss that we use for NTPs, a training loss where a neural\nlink prediction models is used as auxiliary task, as well as various computational optimizations.\n\n4.1 Training Objective\nLet K be the set of known facts in a given KB. Usually, we do not observe negative facts and thus\nresort to sampling corrupted ground atoms as done in previous work [34]. Speci\ufb01cally, for every\n[s, i, j] 2K we obtain corrupted ground atoms [s,\u02c6i, j], [s, i, \u02c6j], [s,\u02dci, \u02dcj] 62 K by sampling \u02c6i, \u02c6j,\u02dci and \u02dcj\nfrom the set of constants. These corrupted ground atoms are resampled in every iteration of training,\nand we denote the set of known and corrupted ground atoms together with their target score (1.0 for\nknown ground atoms and 0.0 for corrupted ones) as T . We use the negative log-likelihood of the\nproof success score as loss function for an NTP with parameters \u2713 and a given KB K\n\nLntpK\n\n\u2713\n\n= X([s,i,j],y) 2T\n\ny log(ntpK\n\n\u2713 ([s, i, j], d)\u21e2)  (1  y) log(1  ntpK\n\n\u2713 ([s, i, j], d)\u21e2)\n\nwhere [s, i, j] is a training ground atom and y its target proof success score. Note that since in our\napplication all training facts are ground atoms, we only make use of the proof success score \u21e2 and not\n\n6\n\n\fthe substitution list of the resulting proof state. We can prove known facts trivially by a uni\ufb01cation\nwith themselves, resulting in no parameter updates during training and hence no generalization.\nTherefore, during training we are masking the calculation of the uni\ufb01cation success of a known\nground atom that we want to prove. Speci\ufb01cally, we set the uni\ufb01cation score to 0 to temporarily hide\nthat training fact and assume it can be proven from other facts and rules in the KB.\n\n4.2 Neural Link Prediction as Auxiliary Loss\n\nAt the beginning of training all subsymbolic representations are initialized randomly. When unifying\na goal with all facts in a KB we consequently get very noisy success scores in early stages of training.\nMoreover, as only the maximum success score will result in gradient updates for the respective\nsubsymbolic representations along the maximum proof path, it can take a long time until NTPs learn\nto place similar symbols close to each other in the vector space and to make effective use of rules.\nTo speed up learning subsymbolic representations, we train NTPs jointly with ComplEx [7] (Ap-\npendix B). ComplEx and the NTP share the same subsymbolic representations, which is feasible\nas the RBF kernel in unify is also de\ufb01ned for complex vectors. While the NTP is responsible for\nmulti-hop reasoning, the neural link prediction model learns to score ground atoms locally. At test\ntime, only the NTP is used for predictions. Thus, the training loss for ComplEx can be seen as an\nauxiliary loss for the subsymbolic representations learned by the NTP. We term the resulting model\nNTP. Based on the loss in Section 4.1, the joint training loss is de\ufb01ned as\n\nLntpK\n\n\u2713\n\n= LntpK\n\n\u2713\n\n+ X([s,i,j],y) 2T\n\ny log(complex\u2713(s, i, j))  (1  y) log(1  complex\u2713(s, i, j))\n\nwhere [s, i, j] is a training atom and y its ground truth target.\n\n4.3 Computational Optimizations\n\nNTPs as described above suffer from severe computational limitations since the neural network is\nrepresenting all possible proofs up to some prede\ufb01ned depth. In contrast to symbolic backward\nchaining where a proof can be aborted as soon as uni\ufb01cation fails, in differentiable proving we only get\na uni\ufb01cation failure for atoms whose arity does not match or when we detect cyclic rule application.\nWe propose two optimizations to speed up NTPs in the Appendix. First, we make use of modern\nGPUs by batch processing many proofs in parallel (Appendix C). Second, we exploit the sparseness\nof gradients caused by the min and max operations used in the uni\ufb01cation and proof aggregation\nrespectively to derive a heuristic for a truncated forward and backward pass that drastically reduces\nthe number of proofs that have to be considered for calculating gradients (Appendix D).\n\n5 Experiments\n\nConsistent with previous work, we carry out experiments on four benchmark KBs and compare\nComplEx with the NTP and NTP in terms of area under the Precision-Recall-curve (AUC-PR) on\nthe Countries KB, and Mean Reciprocal Rank (MRR) and HITS@m [34] on the other KBs described\nbelow. Training details, including hyperparameters and rule templates, can be found in Appendix E.\n\nCountries The Countries KB is a dataset introduced by [35] for testing reasoning capabilities of\nneural link prediction models. It consists of 244 countries, 5 regions (e.g. EUROPE), 23 subregions\n(e.g. WESTERN EUROPE, NORTHERN AMERICA), and 1158 facts about the neighborhood of countries,\nand the location of countries and subregions. We follow [36] and split countries randomly into a train-\ning set of 204 countries (train), a development set of 20 countries (dev), and a test set of 20 countries\n(test), such that every dev and test country has at least one neighbor in the training set. Subsequently,\nthree different task datasets are created. For all tasks, the goal is to predict locatedIn(c, r) for every\ntest country c and all \ufb01ve regions r, but the access to training atoms in the KB varies.\nS1: All ground atoms locatedIn(c, r) where c is a test country and r is a re-\ngion are removed from the KB. Since information about\ntest coun-\ntries is still contained in the KB,\nthis task can be solved by using the transitivity rule\nlocatedIn(X, Y) :\u2013 locatedIn(X, Z), locatedIn(Z, Y).\nS2: In addition to S1, all ground atoms locatedIn(c, s) are removed where c is a test country and s\n\nthe subregion of\n\n7\n\n\fTable 1: AUC-PR results on Countries and MRR and HITS@m on Kinship, Nations, and UMLS.\n\nCorpus\n\nMetric\n\nCountries\n\nS1 AUC-PR\nS2 AUC-PR\nS3 AUC-PR\n\nComplEx\n99.37 \u00b1 0.4\n87.95 \u00b1 2.8\n48.44 \u00b1 6.3\n\nModel\nNTP\n\nNTP\n\nExamples of induced rules and their con\ufb01dence\n\n90.83 \u00b1 15.4\n87.40 \u00b1 11.7\n56.68 \u00b1 17.6\n\n100.00 \u00b1 0.0\n93.04 \u00b1 0.4\n77.26 \u00b1 17.0\n\n0.90 locatedIn(X,Y) :\u2013 locatedIn(X,Z), locatedIn(Z,Y).\n0.63 locatedIn(X,Y) :\u2013 neighborOf(X,Z), locatedIn(Z,Y).\n0.32 locatedIn(X,Y) :\u2013\n\nneighborOf(X,Z), neighborOf(Z,W), locatedIn(W,Y).\n\nKinship\n\nNations\n\nUMLS\n\nMRR\nHITS@1\nHITS@3\nHITS@10\nMRR\nHITS@1\nHITS@3\nHITS@10\nMRR\nHITS@1\nHITS@3\nHITS@10\n\n0.81\n0.70\n0.89\n0.98\n\n0.75\n0.62\n0.84\n0.99\n\n0.89\n0.82\n0.96\n1.00\n\n0.60\n0.48\n0.70\n0.78\n\n0.75\n0.62\n0.86\n0.99\n\n0.88\n0.82\n0.92\n0.97\n\n0.80\n0.76\n0.82\n0.89\n\n0.74\n0.59\n0.89\n0.99\n\n0.93\n0.87\n0.98\n1.00\n\n0.98 term15(X,Y) :\u2013 term5(Y,X)\n0.97 term18(X,Y) :\u2013 term18(Y,X)\n0.86 term4(X,Y) :\u2013 term4(Y,X)\n0.73 term12(X,Y) :\u2013 term10(X, Z), term12(Z, Y).\n0.68 blockpositionindex(X,Y) :\u2013 blockpositionindex(Y,X).\n0.46 expeldiplomats(X,Y) :\u2013 negativebehavior(X,Y).\n0.38 negativecomm(X,Y) :\u2013 commonbloc0(X,Y).\n0.38 intergovorgs3(X,Y) :\u2013 intergovorgs(Y,X).\n0.88 interacts_with(X,Y) :\u2013\n\ninteracts_with(X,Z), interacts_with(Z,Y).\n\n0.77 isa(X,Y) :\u2013 isa(X,Z), isa(Z,Y).\n0.71 derivative_of(X,Y) :\u2013\n\nderivative_of(X,Z), derivative_of(Z,Y).\n\nis a subregion. The location of test countries needs to be inferred from the location of its neighboring\ncountries: locatedIn(X, Y) :\u2013 neighborOf(X, Z), locatedIn(Z, Y). This task is more dif\ufb01cult\nthan S1, as neighboring countries might not be in the same region, so the rule above will not always\nhold.\nS3:\nIn addition to S2, all ground atoms locatedIn(c, r) where r is a region and c\nis a training country that has a test or dev country as a neighbor are also removed.\nThe location of\ninstance be inferred using the three-hop rule\nlocatedIn(X, Y) :\u2013 neighborOf(X, Z), neighborOf(Z, W), locatedIn(W, Y).\n\ntest countries can for\n\nKinship, Nations & UMLS We use the Nations, Alyawarra kinship (Kinship) and Uni\ufb01ed Medical\nLanguage System (UMLS) KBs from [10]. We left out the Animals dataset as it only contains unary\npredicates and can thus not be used for evaluating multi-hop reasoning. Nations contains 56 binary\npredicates, 111 unary predicates, 14 constants and 2565 true facts, Kinship contains 26 predicates,\n104 constants and 10686 true facts, and UMLS contains 49 predicates, 135 constants and 6529 true\nfacts. Since our baseline ComplEx cannot deal with unary predicates, we remove unary atoms from\nNations. We split every KB into 80% training facts, 10% development facts and 10% test facts. For\nevaluation, we take a test fact and corrupt its \ufb01rst and second argument in all possible ways such that\nthe corrupted fact is not in the original KB. Subsequently, we predict a ranking of every test fact and\nits corruptions to calculate MRR and HITS@m.\n\n6 Results and Discussion\n\nResults for the different model variants on the benchmark KBs are shown in Table 1. Another method\nfor inducing rules in a differentiable way for automated KB completion has been introduced recently\nby [37] and our evaluation setup is equivalent to their Protocol II. However, our neural link prediction\nbaseline, ComplEx, already achieves much higher HITS@10 results (1.00 vs. 0.70 on UMLS and\n0.98 vs. 0.73 on Kinship). We thus focus on the comparison of NTPs with ComplEx.\nFirst, we note that vanilla NTPs alone do not work particularly well compared to ComplEx. They only\noutperform ComplEx on Countries S3 and Nations, but not on Kinship or UMLS. This demonstrates\nthe dif\ufb01culty of learning subsymbolic representations in a differentiable prover from uni\ufb01cation\nalone, and the need for auxiliary losses. The NTP with ComplEx as auxiliary loss outperforms the\nother models in the majority of tasks. The difference in AUC-PR between ComplEx and NTP is\nsigni\ufb01cant for all Countries tasks (p < 0.0001).\nA major advantage of NTPs is that we can inspect induced rules which provide us with an interpretable\nrepresentation of what the model has learned. The right column in Table 1 shows examples of induced\nrules by NTP (note that predicates on Kinship are anonymized). For Countries, the NTP recovered\nthose rules that are needed for solving the three different tasks. On UMLS, the NTP induced\ntransitivity rules. Those relationships are particularly hard to encode by neural link prediction models\nlike ComplEx, as they are optimized to locally predict the score of a fact.\n\n8\n\n\f7 Related Work\n\nCombining neural and symbolic approaches to relational learning and reasoning has a long tradition\nand let to various proposed architectures over the past decades (see [38] for a review). Early proposals\nfor neural-symbolic networks are limited to propositional rules (e.g., EBL-ANN [39], KBANN [40]\nand C-IL2P [41]). Other neural-symbolic approaches focus on \ufb01rst-order inference, but do not\nlearn subsymbolic vector representations from training facts in a KB (e.g., SHRUTI [42], Neural\nProlog [43], CLIP++ [44], Lifted Relational Neural Networks [45], and TensorLog [46]). Logic\nTensor Networks [47] are in spirit similar to NTPs, but need to fully ground \ufb01rst-order logic rules.\nHowever, they support function terms, whereas NTPs currently only support function-free terms.\nRecent question-answering architectures such as [15, 17, 18] translate query representations implicitly\nin a vector space without explicit rule representations and can thus not easily incorporate domain-\nspeci\ufb01c knowledge. In addition, NTPs are related to random walk [48, 49, 11, 12] and path encoding\nmodels [14, 16]. However, instead of aggregating paths from random walks or encoding paths to\npredict a target predicate, reasoning steps in NTPs are explicit and only uni\ufb01cation uses subsymbolic\nrepresentations. This allows us to induce interpretable rules, as well as to incorporate prior knowledge\neither in the form of rules or in the form of rule templates which de\ufb01ne the structure of logical\nrelationships that we expect to hold in a KB. Another line of work [50\u201354] regularizes distributed\nrepresentations via domain-speci\ufb01c rules, but these approaches do not learn such rules from data and\nonly support a restricted subset of \ufb01rst-order logic. NTPs are constructed from Prolog\u2019s backward\nchaining and are thus related to Uni\ufb01cation Neural Networks [55, 56]. However, NTPs operate on\nvector representations of symbols instead of scalar values, which are more expressive.\nAs NTPs can learn rules from data, they are related to ILP systems such as FOIL [32], Sherlock\n[57] and meta-interpretive learning of higher-order dyadic Datalog (Metagol) [58]. While these ILP\nsystems operate on symbols and search over the discrete space of logical rules, NTPs work with\nsubsymbolic representations and induce rules using gradient descent. Recently, [37] introduced\na differentiable rule learning system based on TensorLog and a neural network controller similar\nto LSTMs [59]. Their method is more scalable than the NTPs introduced here. However, on\nUMLS and Kinship our baseline already achieved stronger generalization by learning subsymbolic\nrepresentations. Still, scaling NTPs to larger KBs for competing with more scalable relational learning\nmethods is an open problem that we seek to address in future work.\n\n8 Conclusion and Future Work\n\nWe proposed an end-to-end differentiable prover for automated KB completion that operates on\nsubsymbolic representations. To this end, we used Prolog\u2019s backward chaining algorithm as a recipe\nfor recursively constructing neural networks that can be used to prove queries to a KB. Speci\ufb01cally,\nwe introduced a differentiable uni\ufb01cation operation between vector representations of symbols. The\nconstructed neural network allowed us to compute the gradient of proof successes with respect to\nvector representations of symbols, and thus enabled us to train subsymbolic representations end-to-\nend from facts in a KB, and to induce function-free \ufb01rst-order logic rules using gradient descent. On\nbenchmark KBs, our model outperformed ComplEx, a state-of-the-art neural link prediction model,\non three out of four KBs while at the same time inducing interpretable rules.\nTo overcome the computational limitations of the end-to-end differentiable prover introduced in\nthis paper, we want to investigate the use of hierarchical attention [25] and reinforcement learning\nmethods such as Monte Carlo tree search [60, 61] that have been used for learning to play Go [62]\nand chemical synthesis planning [63]. In addition, we plan to support function terms in the future.\nBased on [64], we are furthermore interested in applying NTPs to automated proving of mathematical\ntheorems, either in logical or natural language form, similar to recent approaches by [65] and [66].\n\nAcknowledgements\n\nWe thank Pasquale Minervini, Tim Dettmers, Matko Bosnjak, Johannes Welbl, Naoya Inoue, Kai\nArulkumaran, and the anonymous reviewers for very helpful comments on drafts of this paper. This\nwork has been supported by a Google PhD Fellowship in Natural Language Processing, an Allen\nDistinguished Investigator Award, and a Marie Curie Career Integration Award.\n\n9\n\n\fReferences\n[1] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. Factorizing YAGO: scalable machine learning\nfor linked data. In Proceedings of the 21st World Wide Web Conference 2012, WWW 2012, Lyon, France,\nApril 16-20, 2012, pages 271\u2013280, 2012. doi: 10.1145/2187836.2187874.\n\n[2] Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. Relation extraction with\nmatrix factorization and universal schemas. In Human Language Technologies: Conference of the North\nAmerican Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin\nPeachtree Plaza Hotel, Atlanta, Georgia, USA, pages 74\u201384, 2013.\n\n[3] Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y. Ng. Reasoning with neural tensor\nnetworks for knowledge base completion. In Advances in Neural Information Processing Systems 26:\n27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held\nDecember 5-8, 2013, Lake Tahoe, Nevada, United States., pages 926\u2013934, 2013.\n\n[4] Kai-Wei Chang, Wen-tau Yih, Bishan Yang, and Christopher Meek. Typed tensor decomposition of\nknowledge bases for relation extraction. In Proceedings of the 2014 Conference on Empirical Methods in\nNatural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a\nSpecial Interest Group of the ACL, pages 1568\u20131579, 2014.\n\n[5] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations\nfor learning and inference in knowledge bases. In International Conference on Learning Representations\n(ICLR), 2015.\n\n[6] Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon.\nRepresenting text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference\non Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21,\n2015, pages 1499\u20131509, 2015.\n\n[7] Th\u00e9o Trouillon, Johannes Welbl, Sebastian Riedel, \u00c9ric Gaussier, and Guillaume Bouchard. Complex\nembeddings for simple link prediction. In Proceedings of the 33nd International Conference on Machine\nLearning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2071\u20132080, 2016.\n\n[8] Herv\u00e9 Gallaire and Jack Minker, editors. Logic and Data Bases, Symposium on Logic and Data Bases,\nCentre d\u2019\u00e9tudes et de recherches de Toulouse, 1977, Advances in Data Base Theory, New York, 1978.\nPlemum Press. ISBN 0-306-40060-X.\n\n[9] Stephen Muggleton. Inductive logic programming. New Generation Comput., 8(4):295\u2013318, 1991. doi:\n\n10.1007/BF03037089.\n\n[10] Stanley Kok and Pedro M. Domingos. Statistical predicate invention. In Machine Learning, Proceedings\nof the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007,\npages 433\u2013440, 2007. doi: 10.1145/1273496.1273551.\n\n[11] Matt Gardner, Partha Pratim Talukdar, Bryan Kisiel, and Tom M. Mitchell. Improving learning and\ninference in a large knowledge-base using latent syntactic cues. In Proceedings of the 2013 Conference on\nEmpirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt\nSeattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages\n833\u2013838, 2013.\n\n[12] Matt Gardner, Partha Pratim Talukdar, Jayant Krishnamurthy, and Tom M. Mitchell. Incorporating vector\nspace similarity in random walk inference over knowledge bases. In Proceedings of the 2014 Conference\non Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar,\nA meeting of SIGDAT, a Special Interest Group of the ACL, pages 397\u2013406, 2014.\n\n[13] Islam Beltagy, Stephen Roller, Pengxiang Cheng, Katrin Erk, and Raymond J Mooney. Representing\n\nmeaning with a combination of logical and distributional models. Computational Linguistics, 2017.\n\n[14] Arvind Neelakantan, Benjamin Roth, and Andrew McCallum. Compositional vector space models\nfor knowledge base completion.\nIn Proceedings of the 53rd Annual Meeting of the Association for\nComputational Linguistics and the 7th International Joint Conference on Natural Language Processing\nof the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China,\nVolume 1: Long Papers, pages 156\u2013166, 2015.\n\n[15] Baolin Peng, Zhengdong Lu, Hang Li, and Kam-Fai Wong. Towards neural network-based reasoning.\n\nCoRR, abs/1508.05508, 2015.\n\n10\n\n\f[16] Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum. Chains of reasoning over\nentities, relations, and text using recurrent neural networks. In Conference of the European Chapter of the\nAssociation for Computational Linguistics (EACL), 2017.\n\n[17] Dirk Weissenborn. Separating answers from queries for neural reading comprehension. CoRR,\n\nabs/1607.03316, 2016.\n\n[18] Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stop reading\nin machine comprehension. In Proceedings of the Workshop on Cognitive Computation: Integrating\nneural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information\nProcessing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016., 2016.\n\n[19] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014.\n\n[20] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. CoRR, abs/1410.3916, 2014.\n\n[21] Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to transduce\nwith unbounded memory. In Advances in Neural Information Processing Systems 28: Annual Conference\non Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages\n1828\u20131836, 2015.\n\n[22] Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented recurrent nets.\nIn Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information\nProcessing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 190\u2013198, 2015.\n\n[23] Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever. Neural programmer: Inducing latent programs with\n\ngradient descent. In International Conference on Learning Representations (ICLR), 2016.\n\n[24] Scott E. Reed and Nando de Freitas. Neural programmer-interpreters. In International Conference on\n\nLearning Representations (ICLR), 2016.\n\n[25] Marcin Andrychowicz, Misha Denil, Sergio Gomez Colmenarejo, Matthew W. Hoffman, David Pfau, Tom\nSchaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in\nNeural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems\n2016, December 5-10, 2016, Barcelona, Spain, pages 3981\u20133989, 2016.\n\n[26] Matko Bosnjak, Tim Rockt\u00e4schel, Jason Naradowsky, and Sebastian Riedel. Programming with a differen-\n\ntiable forth interpreter. In International Conference on Machine Learning (ICML), 2017.\n\n[27] Stuart J. Russell and Peter Norvig. Arti\ufb01cial Intelligence - A Modern Approach (3. internat. ed.). Pearson\n\nEducation, 2010. ISBN 978-0-13-207148-2.\n\n[28] Lise Getoor. Introduction to statistical relational learning. MIT press, 2007.\n\n[29] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neural networks\nfor question answering. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the\nAssociation for Computational Linguistics: Human Language Technologies, San Diego California, USA,\nJune 12-17, 2016, pages 1545\u20131554, 2016.\n\n[30] David S Broomhead and David Lowe. Radial basis functions, multi-variable functional interpolation and\n\nadaptive networks. Technical report, DTIC Document, 1988.\n\n[31] Allen Van Gelder. Ef\ufb01cient loop detection in prolog using the tortoise-and-hare technique. J. Log. Program.,\n\n4(1):23\u201331, 1987. doi: 10.1016/0743-1066(87)90020-3.\n\n[32] J. Ross Quinlan. Learning logical de\ufb01nitions from relations. Machine Learning, 5:239\u2013266, 1990. doi:\n\n10.1007/BF00117105.\n\n[33] William Yang Wang and William W. Cohen. Joint information extraction and reasoning: A scalable\nstatistical relational learning approach. In Proceedings of the 53rd Annual Meeting of the Association for\nComputational Linguistics and the 7th International Joint Conference on Natural Language Processing\nof the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China,\nVolume 1: Long Papers, pages 355\u2013364, 2015.\n\n[34] Antoine Bordes, Nicolas Usunier, Alberto Garc\u00eda-Dur\u00e1n, Jason Weston, and Oksana Yakhnenko. Translat-\ning embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems\n26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting\nheld December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 2787\u20132795, 2013.\n\n11\n\n\f[35] Guillaume Bouchard, Sameer Singh, and Theo Trouillon. On approximate reasoning capabilities of\nlow-rank vector spaces. In Proceedings of the 2015 AAAI Spring Symposium on Knowledge Representation\nand Reasoning (KRR): Integrating Symbolic and Neural Approaches, 2015.\n\n[36] Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. Holographic embeddings of knowledge\ngraphs. In Proceedings of the Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, February 12-17, 2016,\nPhoenix, Arizona, USA., pages 1955\u20131961, 2016.\n\n[37] Fan Yang, Zhilin Yang, and William W. Cohen. Differentiable learning of logical rules for knowledge base\n\ncompletion. CoRR, abs/1702.08367, 2017.\n\n[38] Artur S. d\u2019Avila Garcez, Krysia Broda, and Dov M. Gabbay. Neural-symbolic learning systems: founda-\n\ntions and applications. Springer Science & Business Media, 2012.\n\n[39] Jude W Shavlik and Geoffrey G Towell. An approach to combining explanation-based and neural learning\n\nalgorithms. Connection Science, 1(3):231\u2013253, 1989.\n\n[40] Geoffrey G. Towell and Jude W. Shavlik. Knowledge-based arti\ufb01cial neural networks. Artif. Intell., 70\n\n(1-2):119\u2013165, 1994. doi: 10.1016/0004-3702(94)90105-8.\n\n[41] Artur S. d\u2019Avila Garcez and Gerson Zaverucha. The connectionist inductive learning and logic program-\n\nming system. Appl. Intell., 11(1):59\u201377, 1999. doi: 10.1023/A:1008328630915.\n\n[42] Lokendra Shastri. Neurally motivated constraints on the working memory capacity of a production system\nfor parallel processing: Implications of a connectionist model based on temporal synchrony. In Proceedings\nof the Fourteenth Annual Conference of the Cognitive Science Society: July 29 to August 1, 1992, Cognitive\nScience Program, Indiana University, Bloomington, volume 14, page 159. Psychology Press, 1992.\n\n[43] Liya Ding. Neural prolog-the concepts, construction and mechanism. In Systems, Man and Cybernetics,\n1995. Intelligent Systems for the 21st Century., IEEE International Conference on, volume 4, pages\n3603\u20133608. IEEE, 1995.\n\n[44] Manoel V. M. Fran\u00e7a, Gerson Zaverucha, and Artur S. d\u2019Avila Garcez. Fast relational learning using\nbottom clause propositionalization with arti\ufb01cial neural networks. Machine Learning, 94(1):81\u2013104, 2014.\ndoi: 10.1007/s10994-013-5392-1.\n\n[45] Gustav Sourek, Vojtech Aschenbrenner, Filip Zelezn\u00fd, and Ondrej Kuzelka. Lifted relational neural\nnetworks. In Proceedings of the NIPS Workshop on Cognitive Computation: Integrating Neural and\nSymbolic Approaches co-located with the 29th Annual Conference on Neural Information Processing\nSystems (NIPS 2015), Montreal, Canada, December 11-12, 2015., 2015.\n\n[46] William W. Cohen. Tensorlog: A differentiable deductive database. CoRR, abs/1605.06523, 2016.\n\n[47] Luciano Sera\ufb01ni and Artur S. d\u2019Avila Garcez. Logic tensor networks: Deep learning and logical reasoning\nfrom data and knowledge.\nIn Proceedings of the 11th International Workshop on Neural-Symbolic\nLearning and Reasoning (NeSy\u201916) co-located with the Joint Multi-Conference on Human-Level Arti\ufb01cial\nIntelligence (HLAI 2016), New York City, NY, USA, July 16-17, 2016., 2016.\n\n[48] Ni Lao, Tom M. Mitchell, and William W. Cohen. Random walk inference and learning in A large scale\nknowledge base. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language\nProcessing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting\nof SIGDAT, a Special Interest Group of the ACL, pages 529\u2013539, 2011.\n\n[49] Ni Lao, Amarnag Subramanya, Fernando C. N. Pereira, and William W. Cohen. Reading the web with\nlearned syntactic-semantic inference rules. In Proceedings of the 2012 Joint Conference on Empirical\nMethods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL\n2012, July 12-14, 2012, Jeju Island, Korea, pages 1017\u20131026, 2012.\n\n[50] Tim Rockt\u00e4schel, Matko Bosnjak, Sameer Singh, and Sebastian Riedel. Low-Dimensional Embeddings of\n\nLogic. In ACL Workshop on Semantic Parsing (SP\u201914), 2014.\n\n[51] Tim Rockt\u00e4schel, Sameer Singh, and Sebastian Riedel. Injecting logical background knowledge into\nembeddings for relation extraction. In NAACL HLT 2015, The 2015 Conference of the North American\nChapter of the Association for Computational Linguistics: Human Language Technologies, Denver,\nColorado, USA, May 31 - June 5, 2015, pages 1119\u20131129, 2015.\n\n[52] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and language.\n\nIn International Conference on Learning Representations (ICLR), 2016.\n\n12\n\n\f[53] Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard H. Hovy, and Eric P. Xing. Harnessing deep neural\nnetworks with logic rules. In Proceedings of the 54th Annual Meeting of the Association for Computational\nLinguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016.\n\n[54] Thomas Demeester, Tim Rockt\u00e4schel, and Sebastian Riedel. Lifted rule injection for relation embeddings.\nIn Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP\n2016, Austin, Texas, USA, November 1-4, 2016, pages 1389\u20131399, 2016.\n\n[55] Ekaterina Komendantskaya. Uni\ufb01cation neural networks: uni\ufb01cation by error-correction learning. Logic\n\nJournal of the IGPL, 19(6):821\u2013847, 2011. doi: 10.1093/jigpal/jzq012.\n\n[56] Steffen H\u00f6lldobler. A structured connectionist uni\ufb01cation algorithm. In Proceedings of the 8th National\nConference on Arti\ufb01cial Intelligence. Boston, Massachusetts, July 29 - August 3, 1990, 2 Volumes., pages\n587\u2013593, 1990.\n\n[57] Stefan Schoenmackers, Jesse Davis, Oren Etzioni, and Daniel S. Weld. Learning \ufb01rst-order horn clauses\nfrom web text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language\nProcessing, EMNLP 2010, 9-11 October 2010, MIT Stata Center, Massachusetts, USA, A meeting of\nSIGDAT, a Special Interest Group of the ACL, pages 1088\u20131098, 2010.\n\n[58] Stephen H Muggleton, Dianhuan Lin, and Alireza Tamaddoni-Nezhad. Meta-interpretive learning of\n\nhigher-order dyadic datalog: Predicate invention revisited. Machine Learning, 100(1):49\u201373, 2015.\n\n[59] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780,\n\n1997. doi: 10.1162/neco.1997.9.8.1735.\n\n[60] R\u00e9mi Coulom. Ef\ufb01cient selectivity and backup operators in monte-carlo tree search. In Computers and\nGames, 5th International Conference, CG 2006, Turin, Italy, May 29-31, 2006. Revised Papers, pages\n72\u201383, 2006. doi: 10.1007/978-3-540-75538-8_7.\n\n[61] Levente Kocsis and Csaba Szepesv\u00e1ri. Bandit based monte-carlo planning.\n\nIn Machine Learning:\nECML 2006, 17th European Conference on Machine Learning, Berlin, Germany, September 18-22, 2006,\nProceedings, pages 282\u2013293, 2006. doi: 10.1007/11871842_29.\n\n[62] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian\nSchrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik\nGrewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray\nKavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks\nand tree search. Nature, 529(7587):484\u2013489, 2016. doi: 10.1038/nature16961.\n\n[63] Marwin H. S. Segler, Mike Preu\u00df, and Mark P. Waller. Towards \"alphachem\": Chemical synthesis planning\n\nwith tree search and deep neural network policies. CoRR, abs/1702.00020, 2017.\n\n[64] Mark E. Stickel. A prolog technology theorem prover. New Generation Comput., 2(4):371\u2013383, 1984. doi:\n\n10.1007/BF03037328.\n\n[65] Cezary Kaliszyk, Fran\u00e7ois Chollet, and Christian Szegedy. Holstep: A machine learning dataset for\nhigher-order logic theorem proving. In International Conference on Learning Representations (ICLR),\n2017.\n\n[66] Sarah M. Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep network guided proof\nsearch. In LPAR-21, 21st International Conference on Logic for Programming, Arti\ufb01cial Intelligence and\nReasoning, Maun, Botswana, May 7-12, 2017, pages 85\u2013105, 2017.\n\n[67] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International\n\nConference on Learning Representations (ICLR), 2015.\n\n[68] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nnetworks.\nIn Proceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and\nStatistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, pages 249\u2013256, 2010.\n\n[69] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S.\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp,\nGeoffrey Irving, Michael Isard, Yangqing Jia, Rafal J\u00f3zefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh\nLevenberg, Dan Man\u00e9, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster,\nJonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay\nVasudevan, Fernanda B. Vi\u00e9gas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,\nand Xiaoqiang Zheng. Tensor\ufb02ow: Large-scale machine learning on heterogeneous distributed systems.\nCoRR, abs/1603.04467, 2016.\n\n13\n\n\f", "award": [], "sourceid": 2093, "authors": [{"given_name": "Tim", "family_name": "Rockt\u00e4schel", "institution": "University of Oxford"}, {"given_name": "Sebastian", "family_name": "Riedel", "institution": "University College London"}]}