{"title": "Probabilistic Logic Neural Networks for Reasoning", "book": "Advances in Neural Information Processing Systems", "page_first": 7712, "page_last": 7722, "abstract": "Knowledge graph reasoning, which aims at predicting missing facts through reasoning with observed facts, is critical for many applications. Such a problem has been widely explored by traditional logic rule-based approaches and recent knowledge graph embedding methods. A principled logic rule-based approach is the Markov Logic Network (MLN), which is able to leverage domain knowledge with first-order logic and meanwhile handle uncertainty. However, the inference in MLNs is usually very difficult due to the complicated graph structures. Different from MLNs, knowledge graph embedding methods (e.g. TransE, DistMult) learn effective entity and relation embeddings for reasoning, which are much more effective and efficient. However, they are unable to leverage domain knowledge. In this paper, we propose the probabilistic Logic Neural Network (pLogicNet), which combines the advantages of both methods. A pLogicNet defines the joint distribution of all possible triplets by using a Markov logic network with first-order logic, which can be efficiently optimized with the variational EM algorithm. Specifically, in the E-step, a knowledge graph embedding model is used for inferring the missing triplets, while in the M-step, the weights of the logic rules are updated according to both the observed and predicted triplets. Experiments on multiple knowledge graphs prove the effectiveness of pLogicNet over many competitive baselines.", "full_text": "Probabilistic Logic Neural Networks for Reasoning\n\nMeng Qu1,2, Jian Tang1,3,4\n\n1Mila - Quebec AI Institute 2University of Montr\u00e9al\n\n3HEC Montr\u00e9al 4CIFAR AI Research Chair\n\nAbstract\n\nKnowledge graph reasoning, which aims at predicting the missing facts through\nreasoning with the observed facts, is critical to many applications. Such a problem\nhas been widely explored by traditional logic rule-based approaches and recent\nknowledge graph embedding methods. A principled logic rule-based approach is\nthe Markov Logic Network (MLN), which is able to leverage domain knowledge\nwith \ufb01rst-order logic and meanwhile handle the uncertainty. However, the inference\nin MLNs is usually very dif\ufb01cult due to the complicated graph structures. Different\nfrom MLNs, knowledge graph embedding methods (e.g. TransE, DistMult) learn\neffective entity and relation embeddings for reasoning, which are much more\neffective and ef\ufb01cient. However, they are unable to leverage domain knowledge.\nIn this paper, we propose the probabilistic Logic Neural Network (pLogicNet),\nwhich combines the advantages of both methods. A pLogicNet de\ufb01nes the joint\ndistribution of all possible triplets by using a Markov logic network with \ufb01rst-order\nlogic, which can be ef\ufb01ciently optimized with the variational EM algorithm. In\nthe E-step, a knowledge graph embedding model is used for inferring the missing\ntriplets, while in the M-step, the weights of logic rules are updated based on both\nthe observed and predicted triplets. Experiments on multiple knowledge graphs\nprove the effectiveness of pLogicNet over many competitive baselines.\n\n1\n\nIntroduction\n\nMany real-world entities are interconnected with each other through various types of relationships,\nforming massive relational data. Naturally, such relational data can be characterized by a set of (h, r, t)\ntriplets, meaning that entity h has relation r with entity t. To store the triplets, many knowledge graphs\nhave been constructed such as Freebase [14] and WordNet [24]. These graphs have been proven\nuseful in many tasks, such as question answering [49], relation extraction [34] and recommender\nsystems [4]. However, one big challenge of knowledge graphs is that their coverage is limited.\nTherefore, one fundamental problem is how to predict the missing links based on the existing triplets.\nOne type of methods for reasoning on knowledge graphs are the symbolic logic rule-based ap-\nproaches [12, 17, 35, 41, 46]. These rules can be either handcrafted by domain experts [42] or mined\nfrom knowledge graphs themselves [10]. Traditional methods such as expert systems [12, 17] use\nhard logic rules for prediction. For example, given a logic rule \u2200x, y, Husband(x, y) \u21d2 Wife(y, x)\nand a fact that A is the husband of B, we can derive that B is the wife of A. However, in many cases\nlogic rules can be imperfect or even contradictory, and hence effectively modeling the uncertainty\nof logic rules is very critical. A more principled method for using logic rules is the Markov Logic\nNetwork (MLN) [35, 39], which combines \ufb01rst-order logic and probabilistic graphical models. MLNs\nlearn the weights of logic rules in a probabilistic framework and thus soundly handle the uncertainty.\nSuch methods have been proven effective for reasoning on knowledge graphs. However, the inference\nprocess in MLNs is dif\ufb01cult and inef\ufb01cient due to the complicated graph structure among triplets.\nMoreover, the results can be unsatisfactory as many missing triplets cannot be inferred by any rules.\nAnother type of methods for reasoning on knowledge graphs are the recent knowledge graph em-\nbedding based methods (e.g., TransE [3], DistMult [48] and ComplEx [44]). These methods learn\nuseful embeddings of entities and relations by projecting existing triplets into low-dimensional spaces.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThese embeddings preserve the semantic meanings of entities and relations, and can effectively\npredict the missing triplets. In addition, they can be ef\ufb01ciently trained with stochastic gradient\ndescent. However, one limitation is that they do not leverage logic rules, which compactly encode\ndomain knowledge and are useful in many applications.\nWe are seeking an approach that combines the advantages of both worlds, one which is able to exploit\n\ufb01rst-order logic rules while handling their uncertainty, infer missing triplets effectively, and can\nbe trained in an ef\ufb01cient way. We propose such an approach called the probabilistic Logic Neural\nNetworks (pLogicNet). A pLogicNet de\ufb01nes the joint distribution of a collection of triplets with a\nMarkov Logic Network [35], which associates each logic rule with a weight and can be effectively\ntrained with the variational EM algorithm [26]. In the variational E-step, we infer the plausibility of\nthe unobserved triplets (i.e., hidden variables) with amortized mean-\ufb01eld inference [11, 21, 29], in\nwhich the variational distribution is parameterized as a knowledge graph embedding model. In the\nM-step, we update the weights of logic rules by optimizing the pseudolikelihood [1], which is de\ufb01ned\non both the observed triplets and those inferred by the knowledge graph embedding model. The\nframework can be ef\ufb01ciently trained by stochastic gradient descent. Experiments on four benchmark\nknowledge graphs prove the effectiveness of pLogicNet over many competitive baselines.\n\n2 Related Work\n\nFirst-order logic rules can compactly encode domain knowledge and have been extensively explored\nfor reasoning. Early methods such as expert systems [12, 17] use hard logic rules for reasoning.\nHowever, logic rules can be imperfect or even contradictory. Later studies try to model the uncertainty\nof logic rules by using Horn clauses [5, 19, 30, 46] or database query languages [31, 41]. A more\nprincipled method is the Markov logic network [35, 39], which combines \ufb01rst-order logic with\nprobabilistic graphical models. Despite the effectiveness in a variety of tasks, inference in MLNs\nremains dif\ufb01cult and inef\ufb01cient due to the complicated connections between triplets. Moreover, for\npredicting missing triplets on knowledge graphs, the performance can be limited as many triplets\ncannot be discovered by any rules. In contrast to them, pLogicNet uses knowledge graph embedding\nmodels for inference, which is much more effective by learning useful entity and relation embeddings.\nAnother category of approach for knowledge graph reasoning is the knowledge graph embedding\nmethod [3, 8, 28, 40, 44, 45, 48], which aims at learning effective embeddings of entities and relations.\nGenerally, these methods design different scoring functions to model different relation patterns for\nreasoning. For example, TransE [3] de\ufb01nes each relation as a translation vector, which can effectively\nmodel the composition and inverse relation patterns. DistMult [48] models the symmetric relation\nwith a bilinear scoring function. ComplEx [44] models the asymmetric relations by using a bilinear\nscoring function in complex space. RotatE [40] further models multiple relation patterns by de\ufb01ning\neach relation as a rotation in complex spaces. Despite the effectiveness and ef\ufb01ciency, these methods\nare not able to leverage logic rules, which are bene\ufb01cial in many tasks. Recently, there are a few\nstudies on combining logic rules and knowledge graph embedding [9, 15]. However, they cannot\neffectively handle the uncertainty of logic rules. Compared with them, pLogicNet is able to use logic\nrules and also handle their uncertainty in a more principled way through Markov logic networks.\nSome recent work also studies using reinforcement learning for reasoning on knowledge graphs [6,\n23, 38, 47], where an agent is trained to search for reasoning paths. However, the performance of\nthese methods is not so competitive. Our pLogicNets are easier to train and also more effective.\nLastly, there are also some recent studies trying to combine statistical relational learning and graph\nneural networks for semi-supervised node classi\ufb01cation [33], or using Markov networks for visual\ndialog reasoning [32, 51]. Our work shares similar idea with these studies, but we focus on a different\nproblem, i.e., reasoning with \ufb01rst-order logic on knowledge graphs. There is also a concurrent work\nusing graph neural networks for logic reasoning [50]. Compared to this study which emphasizes\nmore on the inference problem, our work focuses on both the inference and the learning problems.\n\n3 Preliminary\n\n3.1 Problem De\ufb01nition\n\nA knowledge graph is a collection of relational facts, each of which is represented as a triplet (h, r, t).\nDue to the high cost of knowledge graph construction, the coverage of knowledge graphs is usually\nlimited. Therefore, a critical problem on knowledge graphs is to predict the missing facts.\n\n2\n\n\fFormally, given a knowledge graph (E, R, O), where E is a set of entities, R is a set of relations, and\nO is a set of observed (h, r, t) triplets, the goal is to infer the missing triplets by reasoning with the\nobserved triplets. Following existing studies [27], the problem can be reformulated in a probabilistic\nway. Each triplet (h, r, t) is associated with a binary indicator variable v(h,r,t). v(h,r,t) = 1 means\n(h, r, t) is true, and v(h,r,t) = 0 otherwise. Given some true facts vO = {v(h,r,t) = 1}(h,r,t)\u2208O, we\naim to predict the labels of the remaining hidden triplets H, i.e., vH = {v(h,r,t)}(h,r,t)\u2208H. We will\ndiscuss how to generate the hidden triplets H later in Sec. 4.4.\nThis problem has been extensively studied in both traditional logic rule-based methods and recent\nknowledge graph embedding methods. For logic rule-based methods, we mainly focus on one\nrepresentative approach, the Markov logic network [35]. Essentially, both types of methods aim to\nmodel the joint distribution of the observed and hidden triplets p(vO, vH ). Next, we brie\ufb02y introduce\nthe Markov logic network (MLN) [35] and the knowledge graph embedding methods [3, 40, 48].\n\n3.2 Markov Logic Network\n\nIn the MLN, a Markov network is designed to de\ufb01ne the joint distribution of the observed and the\nhidden triplets, where the potential function is de\ufb01ned by the \ufb01rst-order logic. Some common logic\nrules to encode domain knowledge include: (1) Composition Rules. A relation rk is a composition of\nri and rj means that for any three entities x, y, z, if x has relation ri with y, and y has relation rj with\nz, then x has relation rk with z. Formally, we have \u2200x, y, z \u2208 E, v(x,ri,y) \u2227 v(y,rj ,z) \u21d2 v(x,rk,z).\n(2) Inverse Rules. A relation rj is an inverse of ri indicates that for two entities x, y, if x has relation\nri with y, then y has relation rj with x. We can represent the rule as \u2200x, y \u2208 E, v(x,ri,y) \u21d2 v(y,rj ,x).\n(3) Symmetric Rules. A relation r is symmetric means that for any entity pair x, y, if x has relation\nr with y, then y also has relation r with x. Formally, we have \u2200x, y \u2208 E, v(x,r,y) \u21d2 v(y,r,x). (4)\nSubrelation Rules. A relation rj is a subrelation of ri indicates that for any entity pair x, y, if x and y\nhave relation ri, then they also have relation rj. Formally, we have \u2200x, y \u2208 E, v(x,ri,y) \u21d2 v(x,rj ,y).\nFor each logic rule l, we can obtain a set of possible groundings Gl by instantiating the entity\nplaceholders in the logic rule with real entities in knowledge graphs. For example, for a subre-\nlation rule, \u2200x, y, v(x,Born in,y) \u21d2 v(x,Live in,y), two groundings in Gl can be v(Newton,Born in,UK) \u21d2\nv(Newton,Live in,UK) and v(Einstein,Born in,German) \u21d2 v(Einstein,Live in,German). We see that the former one is\ntrue while the latter one is false. To handle such uncertainty of logic rules, Markov logic networks\nintroduce a weight wl for each rule l, and then the joint distribution of all triplets is de\ufb01ned as follows:\n\np(vO, vH ) =\n\n1\nZ\n\nexp\uf8eb\n\uf8edXl\u2208L\n\nwl Xg\u2208Gl\n\n\u2736{g is true}\uf8f6\n\uf8f8 =\n\n1\nZ\n\nexp Xl\u2208L\n\nwlnl(vO, vH )! ,\n\n(1)\n\nwhere nl is the number of true groundings of the logic rule l based on the values of vO and vH.\nWith such a formulation, predicting the missing triplets essentially becomes inferring the posterior dis-\ntribution p(vH |vO). Exact inference is usually infeasible due to the complicated graph structures, and\nhence approximation inference is often used such as MCMC [13] and loopy belief propagation [25].\n\n3.3 Knowledge Graph Embedding\n\nDifferent from the logic rule-based approaches, the knowledge graph embedding methods learn\nembeddings of entities and relations with the observed facts vO, and then predict the missing facts\nwith the learned entity and relation embeddings. Formally, each entity e \u2208 E and relation r \u2208 R is\nassociated with an embedding xe and xr. Then the joint distribution of all the triplets is de\ufb01ned as:\n\np(vO, vH ) = Y(h,r,t)\u2208O\u222aH\n\nBer(v(h,r,t)|f (xh, xr, xt)),\n\n(2)\n\nwhere Ber stands for the Bernoulli distribution, f (xh, xr, xt) computes the probability that the triplet\n(h, r, t) is true, with f (\u00b7, \u00b7, \u00b7) being a scoring function on the entity and relation embeddings. For\nexample in TransE, the score function f can be formulated as \u03c3(\u03b3 \u2212 ||xh + xr \u2212 xt||) according\nto [40], where \u03c3 is the sigmoid function and \u03b3 is a \ufb01xed bias. To learn the entity and relation\nembeddings, these methods typically treat observed triplets as positive examples and the hidden\ntriplets as negative ones. In other words, these methods seek to maximize log p(vO = 1, vH = 0).\nThe whole framework can be ef\ufb01ciently optimized with the stochastic gradient descent algorithm.\n\n3\n\n\f(Alan Turing, Born in, London)\n\n(Alan Turing, Live in, UK)\n\n\u2713\n\nN a t i o n a l i t y \u21d0 L i v e i n 0 . 2\n\nBorn in \u22c0 City of \u21d2 Nationality 1.5\n\n?\n\n(Alan Turing, Nationality, UK)\n\nNationality \u21d0 Politician of 2.6\n\n\u2713\n\n\u2717\n\n\u2713\n\n(London, City of, UK)\n\n(Alan Turing, Politician of, UK)\n\nFigure 1: Framework overview. Each possible triplet is associated with a binary indicator (circles),\nindicating whether it is true (\u2713) or not (\u2717). The observed (yellow circles) and hidden (grey circles)\nindicators are connected by a set of logic rules, with each rule having a weight (red number). For\nthe center triplet, the KGE model predicts its indicator through embeddings, while the logic rules\nconsider the Markov blanket of the triplet (all connected triplets). If any indicator in the Markov\nblanket is hidden, we simply \ufb01ll it with the prediction from the KGE model. In the E-step, we use the\nlogic rules to predict the center indicator, and treat it as extra training data for the KGE model. In the\nM-step, we annotate all hidden indicators with the KGE model, and then update the weights of rules.\n\n4 Model\n\nIn this section, we introduce our proposed approach pLogicNet for knowledge graph reasoning,\nwhich combines the logic rule-based methods and the knowledge graph embedding methods. To\nleverage the domain knowledge provided by \ufb01rst-order logic rules, pLogicNet formulates the joint\ndistribution of all triplets with a Markov logic network [35], which is trained with the variational\nEM algorithm [26], alternating between a variational E-step and an M-step. In the varational E-step,\nwe employ a knowledge graph embedding model to infer the missing triplets, during which the\nknowledge preserved by the logic rules can be effectively distilled into the learned embeddings. In\nthe M-step, the weights of the logic rules are updated based on both the observed triplets and those\ninferred by the knowledge graph embedding model. In this way, the knowledge graph embedding\nmodel provides extra supervision for weight learning. An overview of pLogicNet is given in Fig. 1.\n\n4.1 Variational EM\n\nGiven a set of \ufb01rst-order logic rules L = {li}|L|\nin Eq. (1) to model the joint distribution of both the observed and hidden triplets:\n\ni=1, our approach uses a Markov logic network [35] as\n\npw(vO, vH ) =\n\n1\nZ\n\nexp Xl\n\nwlnl(vO, vH )! ,\n\n(3)\n\nwhere wl is the weight of rule l. The model can be trained by maximizing the log-likelihood of\nthe observed indicator variables, i.e., log pw(vO). However, directly optimizing the objective is\ninfeasible, as we need to integrate over all the hidden indicator variables vH. Therefore, we instead\noptimize the evidence lower bound (ELBO) of the log-likelihood function, which is given as follows:\n\nlog pw(vO) \u2265 L(q\u03b8, pw) = E\n\nq\u03b8(vH )[log pw(vO, vH ) \u2212 log q\u03b8(vH )],\n\n(4)\n\nwhere q\u03b8(vH ) is a variational distribution of the hidden variables vH. The equation holds when\nq\u03b8(vH ) equals to the true posterior distribution pw(vH |vO). Such a lower bound can be effectively\noptimized with the variational EM algorithm [26], which consists of a variational E-step and an\nM-step. In the variational E-step, which is known as the inference procedure, we \ufb01x pw and update\nq\u03b8 to minimize the KL divergence between q\u03b8(vH ) and pw(vH |vO). In the M-step, which is known\nas the learning procedure, we \ufb01x q\u03b8 and update pw to maximize the log-likelihood function of all the\ntriplets, i.e., E\n\nq\u03b8(vH )[log pw(vO, vH )]. Next, we introduce the details of both steps.\n\n4.2 E-step: Inference Procedure\n\nFor inference, we aim to infer the posterior distribution of the hidden variables, i.e., pw(vH |vO). As\nexact inference is intractable, we approximate the true posterior distribution with a mean-\ufb01eld [29]\n\n4\n\n\fvariational distribution q\u03b8(vH ), in which each v(h,r,t) is inferred independently for (h, r, t) \u2208 H. To\nfurther improve inference, we use amortized inference [11, 21], and parameterize q\u03b8(v(h,r,t)) with a\nknowledge graph embedding model. Formally, q\u03b8(vH ) is formulated as below:\n\nq\u03b8(vH ) = Y(h,r,t)\u2208H\n\nq\u03b8(v(h,r,t)) = Y(h,r,t)\u2208H\n\nBer(v(h,r,t)|f (xh, xr, xt)),\n\n(5)\n\nwhere Ber stands for the Bernoulli distribution, and f (\u00b7, \u00b7, \u00b7) is a scoring function de\ufb01ned on triplets\nas introduced in Sec. 3.3. By minimizing the KL divergence between the variational distribution\nq\u03b8(vH ) and the true posterior pw(vH |vO), the optimal q\u03b8(vH ) is given by the \ufb01xed-point condition:\n(6)\n\nq\u03b8(vMB(h,r,t))[log pw(v(h,r,t)|vMB(h,r,t))] + const\n\nlog q\u03b8(v(h,r,t)) = E\n\nfor all (h, r, t) \u2208 H,\n\nwhere MB(h, r, t) is the Markov blanket of (h, r, t), which contains the triplets that appear together\nwith (h, r, t) in any grounding of the logic rules. For example, from a grounding v(Newton,Born in,UK) \u21d2\nv(Newton,Live in,UK), we can know both triplets are in the Markov blanket of each other.\nWith Eq. (6), our goal becomes \ufb01nding a distribution q\u03b8 that satis\ufb01es the condition. However, Eq. (6)\ninvolves the expectation with respect to q\u03b8(vMB(h,r,t)). To simplify the condition, we follow [16]\nand estimate the expectation with a sample \u02c6vMB(h,r,t) = {\u02c6v(h\u2032,r\u2032,t\u2032)}(h\u2032,r\u2032,t\u2032)\u2208MB(h,r,t). Speci\ufb01cally,\nfor each (h\u2032, r\u2032, t\u2032) \u2208 MB(h, r, t), if it is observed, we set \u02c6v(h\u2032,r\u2032,t\u2032) = 1, and otherwise \u02c6v(h\u2032,r\u2032,t\u2032) \u223c\nq\u03b8(v(h\u2032,r\u2032,t\u2032)). In this way, the right side of Eq. (6) is approximated as log pw(v(h,r,t)|\u02c6vMB(h,r,t)),\nand thus the optimality condition can be further simpli\ufb01ed as q\u03b8(v(h,r,t)) \u2248 pw(v(h,r,t)|\u02c6vMB(h,r,t)).\nIntuitively, for each hidden triplet (h, r, t), the knowledge graph embedding model predicts v(h,r,t)\nthrough the entity and relation embeddings (i.e., q\u03b8(v(h,r,t))), while the logic rules make the\nprediction by utilizing the triplets connected with (h, r, t) (i.e., pw(v(h,r,t)|\u02c6vMB(h,r,t))).\nIf any\ntriplet (h\u2032, r\u2032, t\u2032) connected with (h, r, t) is unobserved, we simply \ufb01ll in v(h\u2032,r\u2032,t\u2032) with a sample\n\u02c6v(h\u2032,r\u2032,t\u2032) \u223c q\u03b8(v(h\u2032,r\u2032,t\u2032)). Then, the simpli\ufb01ed optimality condition tells us that for the optimal\nknowledge graph embedding model, it should reach a consensus with the logic rules on the distribution\nof v(h,r,t) for every (h, r, t), i.e., q\u03b8(v(h,r,t)) \u2248 pw(v(h,r,t)|\u02c6vMB(h,r,t)).\nTo learn the optimal q\u03b8, we use a method similar to [36]. We start by computing pw(v(h,r,t)|\u02c6vMB(h,r,t))\nwith the current q\u03b8. Then, we \ufb01x the value as target, and update q\u03b8 to minimize the reverse KL\ndivergence of q\u03b8(v(h,r,t)) and the target pw(v(h,r,t)|\u02c6vMB(h,r,t)), leading to the following objective:\n\nO\u03b8,U = X(h,r,t)\u2208H\n\nE\n\npw(v(h,r,t)|\u02c6vMB(h,r,t))[log q\u03b8(v(h,r,t))].\n\n(7)\n\nTo optimize this objective, we \ufb01rst compute pw(v(h,r,t)|\u02c6vMB(h,r,t)) for each hidden triplet (h, r, t).\nIf pw(v(h,r,t) = 1|\u02c6vMB(h,r,t)) \u2265 \u03c4triplet with \u03c4triplet being a hyperparameter, then we treat (h, r, t) as a\npositive example and train the knowledge graph embedding model to maximize the log-likelihood\nlog q\u03b8(v(h,r,t) = 1). Otherwise the triplet is treated as a negative example. In this way, the knowledge\ncaptured by logic rules can be effectively distilled into the knowledge graph embedding model.\nWe can also use the observed triplets in O as positive examples to enhance the knowledge graph\nembedding model. Therefore, we also optimize the following objective function:\n\nO\u03b8,L = X(h,r,t)\u2208O\n\nlog q\u03b8(v(h,r,t) = 1).\n\n(8)\n\nBy adding Eq. (7) and (8), we obtain the overall objective function for q\u03b8, i.e., O\u03b8 = O\u03b8,U + O\u03b8,L.\n\n4.3 M-step: Learning Procedure\n\nIn the learning procedure, we will \ufb01x q\u03b8, and update the weights of logic rules w by maximizing\nthe log-likelihood function, i.e., E\nq\u03b8(vH )[log pw(vO, vH )]. However, directly optimizing the log-\nlikelihood function can be dif\ufb01cult, as we need to deal with the partition function, i.e., Z in Eq. (3).\nTherefore, we follow existing studies [22, 35] and instead optimize the pseudolikelihood function [1]:\n\n\u2113P L(w) , E\n\nq\u03b8(vH )[Xh,r,t\n\nlog pw(v(h,r,t)|vO\u222aH\\(h,r,t))] = E\n\nq\u03b8(vH )[Xh,r,t\n\nlog pw(v(h,r,t)|vMB(h,r,t))],\n\n5\n\n\fwhere the second equation is derived from the independence property of the MLN in the Eq. (3).\nWe optimize w through the gradient descent algorithm. For each expected conditional distribution\nq\u03b8(vH )[log pw(v(h,r,t)|vMB(h,r,t))], suppose v(h,r,t) connects with vMB(h,r,t) through a set of rules.\nE\nFor each of such rules l, the derivative with respect to wl is computed as:\n\n\u25bdwl\n\nE\n\nq\u03b8(vH )[log pw(v(h,r,t)|vMB(h,r,t))] \u2243 y(h,r,t) \u2212 pw(v(h,r,t) = 1|\u02c6vMB(h,r,t))\n\n(9)\nwhere y(h,r,t) = 1 if (h, r, t) is an observed triplet and y(h,r,t) = q\u03b8(v(h,r,t) = 1) if (h, r, t) is a\nhidden one. \u02c6vMB(h,r,t) = {\u02c6v(h\u2032,r\u2032,t\u2032)}(h\u2032,r\u2032,t\u2032)\u2208MB(h,r,t) is a sample from q\u03b8. For each (h\u2032, r\u2032, t\u2032) \u2208\nMB(h, r, t), \u02c6v(h\u2032,r\u2032,t\u2032) = 1 if (h\u2032, r\u2032, t\u2032) is observed, and otherwise \u02c6v(h\u2032,r\u2032,t\u2032) \u223c q\u03b8(v(h\u2032,r\u2032,t\u2032)).\nIntuitively, for each observed triplet (h, r, t) \u2208 O, we seek to maximize pw(v(h,r,t) = 1|\u02c6vMB(h,r,t)).\nFor each hidden triplet (h, r, t) \u2208 H, we treat q\u03b8(v(h,r,t) = 1) as target for updating the probability\npw(v(h,r,t) = 1|\u02c6vMB(h,r,t)). In this way, the knowledge graph embedding model q\u03b8 essentially\nprovides extra supervision to bene\ufb01t learning the weights of logic rules.\n\n4.4 Optimization and Prediction\n\nDuring training, we iteratively perform the E-step and the M-step until convergence. Note that there\nare a huge number of possible hidden triplets (i.e., |E| \u00d7 |R| \u00d7 |E| \u2212 |O|), and handling all of\nthem is impractical for optimization. Therefore, we only include a small number of triplets in the\nhidden set H. Speci\ufb01cally, an unobserved triplet (h, r, t) is added to H if we can \ufb01nd a grounding\n[premise] \u21d2 [hypothesis], where the hypothesis is (h, r, t) and the premise only contains triplets\nin the observed set O. In practice, we can construct H with brute-force search as in [15].\nAfter training, according to the \ufb01xed-point condition given in Eq. (6), the posterior distribution\npw(v(h,r,t)|vO) for (h, r, t) \u2208 H can be characterized by either q\u03b8(v(h,r,t)) or pw(v(h,r,t)|\u02c6vMB(h,r,t))\nwith \u02c6vMB(h,r,t) \u223c q\u03b8(vMB(h,r,t)). Although we try to encourage the consensus of pw and q\u03b8 during\ntraining, they may still give different predictions as different information is used. Therefore, we use\nboth of them for prediction, and we approximate the true posterior distribution pw(v(h,r,t)|vO) as:\n(10)\nwhere \u03bb is a hyperparameter controlling the relative weight of q\u03b8(v(h,r,t)) and pw(v(h,r,t)|\u02c6vMB(h,r,t)).\nIn practice, we also expect to infer the plausibility of the triplets outside H. For each of such triplets\n(h, r, t), we can still compute q\u03b8(v(h,r,t)) through the learned embeddings, but we cannot make\npredictions with the logic rules, so we simply replace pw(v(h,r,t) = 1|\u02c6vMB(h,r,t)) with 0.5 in Eq. 10.\n\npw(v(h,r,t)|vO) \u221d(cid:8)q\u03b8(v(h,r,t)) + \u03bbpw(v(h,r,t)|\u02c6vMB(h,r,t))(cid:9) ,\n\n5 Experiment\n\n5.1 Experiment Settings\n\nDatasets. In experiments, we evaluate the pLogicNet on four benchmark datasets. The FB15k [3]\nand FB15k-237 [43] datasets are constructed from Freebase [2]. WN18 [3] and WN18RR [8] are\nconstructed from WordNet [24]. The detailed statistics of the datasets are summarized in appendix.\n\nEvaluation Metrics. We compare different methods on the task of knowledge graph reasoning. For\neach test triplet, we mask the head or the tail entity, and let each compared method predict the masked\nentity. Following existing studies [3, 48], we use the \ufb01ltered setting during evaluation. The Mean\nRank (MR), Mean Reciprocal Rank (MRR) and Hit@K (H@K) are treated as the evaluation metrics.\n\nCompared Algorithms. We compare with both the knowledge graph embedding methods and rule-\nbased methods. For the knowledge graph embedding methods, we choose \ufb01ve representative methods\nto compare with, including TransE [3], DistMult [48], HolE [28], ComplEx [44] and ConvE [8]. For\nthe rule-based methods, we compare with the Markov logic network (MLN) [35] and the Bayesian\nlogic programming (BLP) method [7], which model logic rules with Markov networks and Bayesian\nnetworks respectively. Besides, we also compare with RUGE [15] and NNE-AER [9], which are\nhybrid methods that combine knowledge graph embedding and logic rules. As only the results on the\nFB15k dataset are reported in the RUGE paper, we only compare with RUGE on that dataset. For\nour approach, we consider two variants, where pLogicNet uses only q\u03b8 to infer the plausibility of\nunobserved triplets during evaluation, while pLogicNet\u2217 uses both q\u03b8 and pw through Eq. (10).\n\n6\n\n\fExperimental Setup of pLogicNet. To generate the candidate rules in the pLogicNet, we search\nfor all the possible composition rules, inverse rules, symmetric rules and subrelations rules from the\nobserved triplets, which is similar to [10, 15]. Then, we compute the empirical precision of each rule,\ni.e. pl = |Sl\u2229O|\n, where Sl is the set of triplets extracted by the rule l and O is the set of the observed\n|Sl|\ntriplets. We only keep rules whose empirical precision is larger than a threshold \u03c4rule. TransE [3] is\nused as the default knowledge graph embedding model to parameterize q\u03b8. We update the weights of\nlogic rules with gradient descent. The detailed hyperparameters settings are available in the appendix.\n\n5.2 Results\n\n5.2.1 Comparing pLogicNet with Other Methods\n\nTable 1: Results of reasoning on the FB15k and WN18 datasets. The results of the KGE and the\nHybrid methods except for TransE are directly taken from the corresponding papers. H@K is in %.\n\nCategory\n\nAlgorithm\n\nKGE\n\nRule-based\n\nHybrid\n\nOurs\n\nTransE [3]\n\nDistMult [18]\n\nHolE [28]\n\nComplEx [44]\n\nConvE [8]\nBLP [7]\nMLN [35]\nRUGE [15]\nNNE-AER [9]\n\npLogicNet\npLogicNet\u2217\n\nMR\n40\n42\n-\n-\n51\n415\n352\n\n-\n-\n33\n33\n\nMRR\n0.730\n0.798\n0.524\n0.692\n0.657\n0.242\n0.321\n0.768\n0.803\n0.792\n0.844\n\nFB15k\nH@1\n64.5\n\n-\n\n40.2\n59.9\n55.8\n15.1\n21.0\n70.3\n76.1\n71.4\n81.2\n\nH@3\n79.3\n\n-\n\n61.3\n75.9\n72.3\n26.9\n37.0\n81.5\n83.1\n85.7\n86.2\n\nH@10\n86.4\n89.3\n73.9\n84.0\n83.1\n42.4\n55.0\n86.5\n87.4\n90.1\n90.2\n\nMR\n272\n655\n\n-\n-\n\n374\n736\n717\n\n-\n-\n\n255\n254\n\nMRR\n0.772\n0.797\n0.938\n0.941\n0.943\n0.643\n0.657\n\n-\n\n0.943\n0.832\n0.945\n\nWN18\nH@1\n70.1\n\n-\n\n93.0\n93.6\n93.5\n53.7\n55.4\n\n-\n\n94.0\n71.6\n93.9\n\nH@3\n80.8\n\n-\n\n94.5\n94.5\n94.6\n71.7\n73.1\n\n-\n\n94.5\n94.4\n94.7\n\nH@10\n92.0\n94.6\n94.9\n94.7\n95.6\n83.0\n83.9\n\n-\n\n94.8\n95.7\n95.8\n\nTable 2: Results of reasoning on the FB15k-237 and WN18RR datasets. The results of the KGE\nmethods except for TransE are directly taken from the corresponding papers. H@K is in %.\n\nCategory\n\nAlgorithm\n\nKGE\n\nRule-based\n\nOurs\n\nTransE [3]\n\nDistMult [18]\nComplEx [44]\n\nConvE [8]\nBLP [7]\nMLN [35]\npLogicNet\npLogicNet\u2217\n\nMR\n181\n254\n339\n244\n1985\n1980\n173\n173\n\nFB15k-237\n\nWN18RR\n\nMRR\n0.326\n0.241\n0.247\n0.325\n0.092\n0.098\n0.330\n0.332\n\nH@1\n22.9\n15.5\n15.8\n23.7\n6.2\n6.7\n23.1\n23.7\n\nH@3\n36.3\n26.3\n27.5\n35.6\n9.8\n10.3\n36.9\n36.7\n\nH@10\n52.1\n41.9\n42.8\n50.1\n15.0\n16.0\n52.8\n52.4\n\nMR\n3410\n5110\n5261\n4187\n12051\n11549\n3436\n3408\n\nMRR\n0.223\n0.43\n0.44\n0.43\n0.254\n0.259\n0.230\n0.441\n\nH@1\n1.3\n39\n41\n40\n18.7\n19.1\n1.5\n39.8\n\nH@3\n40.1\n44\n46\n44\n31.3\n32.2\n41.1\n44.6\n\nH@10\n53.1\n49\n51\n52\n35.8\n36.1\n53.1\n53.7\n\nThe main results on the four datasets are presented in Tab. 1 and 2. We can see that the pLogicNet\nsigni\ufb01cantly outperforms the rule-based methods, as pLogicNet uses a knowledge graph embedding\nmodel to improve inference. pLogicNet also outperforms all the knowledge graph embedding\nmethods in most cases, where the improvement comes from the capability of exploring the knowledge\ncaptured by the logic rules. Moreover, our approach is superior to both hybrid methods (RUGE and\nNNE-AER) under most metrics, as it handles the uncertainty of logic rules in a more principled way.\nComparing pLogicNet and pLogicNet\u2217, pLogicNet\u2217 uses both q\u03b8 and pw to predict the plausibility\nof hidden triplets, which outperforms pLogicNet in most cases. The reason is that the information\ncaptured by q\u03b8 and pw is different and complementary, so combining them yields better performance.\n\n5.2.2 Analysis of Different Rule Patterns\n\nTable 3: Analysis of different rule patterns. H@K is in %.\n\nRule Pattern\n\nWithout\n\nComposition\n\nInverse\n\nSymmetric\nSubrelation\n\nFB15k\n\nFB15k-237\n\nMR MRR H@1 H@3 H@10 MR MRR H@1 H@3 H@10\n52.1\n40\n40\n52.5\n52.4\n39\n52.4\n40\n40\n52.5\n\n0.326\n0.335\n0.332\n0.333\n0.334\n\n0.730\n0.752\n0.813\n0.793\n0.761\n\n22.9\n24.1\n23.8\n23.8\n23.9\n\n36.3\n37.1\n36.7\n36.8\n36.8\n\n64.7\n69.3\n77.7\n75.0\n70.2\n\n79.4\n78.7\n83.1\n81.7\n79.8\n\n86.4\n86.0\n88.1\n87.1\n86.6\n\n181\n173\n175\n175\n172\n\n7\n\n\fIn pLogicNet, four types of rule patterns are used. Next, we systematically study the effect of each\nrule pattern. We take the FB15k and FB15k-237 datasets as examples, and report the results obtained\nwith each single rule pattern in Tab. 3. On both datasets, most rule patterns can lead to signi\ufb01cant\nimprovement compared to the model without logic rules. Moreover, the effects of different rule\npatterns are quite different across datasets. On FB15k, the inverse and symmetric rules are more\nimportant, whereas on FB15k-237, the composition and subrelation rules are more effective.\n\n5.2.3\n\nInference with Different Knowledge Graph Embedding Methods\n\nTable 4: Comparison of using different knowledge graph embedding methods. H@K is in %.\n\nKGE Method Algorithm\n\nTransE\n\nDistMult\n\nComplEx\n\npLogicNet\npLogicNet\u2217\npLogicNet\npLogicNet\u2217\npLogicNet\npLogicNet\u2217\n\nFB15k\n\nWN18RR\n\nMR MRR H@1 H@3 H@10 MR MRR H@1 H@3 H@10\n53.1\n33\n53.7\n33\n40\n53.5\n53.6\n39\n55.7\n39\n45\n55.7\n\n0.792\n0.844\n0.791\n0.815\n0.776\n0.788\n\n41.1\n44.6\n45.5\n45.5\n49.2\n49.2\n\n90.1\n90.2\n89.5\n89.8\n88.5\n88.5\n\n3436\n3408\n4902\n4894\n5266\n5233\n\n0.230\n0.441\n0.442\n0.443\n0.471\n0.475\n\n1.5\n39.8\n39.8\n39.9\n43.0\n43.5\n\n71.4\n81.2\n73.1\n76.8\n70.6\n73.5\n\n85.7\n86.2\n83.2\n84.6\n81.7\n82.1\n\nIn this part, we compare the performance of pLogicNet with different knowledge graph embedding\nmethods for inference. We use TransE as the default model and compare with two other widely-used\nknowledge graph embedding methods, DistMult [48] and ComplEx [44]. The results on the FB15k\nand WN18RR datasets are presented in Tab. 4. Comparing with the results in Tab. 1 and 2, we see\nthat pLogicNet improves the performance of all the three methods by using logic rules. Moreover,\nthe pLogicNet achieves very robust performance with any of the three methods for inference.\n\nIteration\n\n1\n2\n3\n\nFB15k\n\nWN18\n\n# Triplets Precision\n79.21%\n79.31%\n79.10%\n\n64,929\n74,717\n76,268\n\n# Triplets Precision\n80.99%\n82.06%\n82.09%\n\n11,146\n11,430\n11,432\n\nTable 5: Effect of KGE on logic rules.\n\nFigure 2: Convergence analysis.\n\n5.2.4 Effect of Knowledge Graph Embedding on Logic Rules\n\nIn the M-step of pLogicNet, we use the learned embeddings to annotate the hidden triplets, and\nfurther update the weights of logic rules. Next, we analyze the effect of knowledge graph embeddings\non logic rules. Recall that in the E-step, the logic rules are used to annotate the hidden triplets\nthrough Eq. (7), and thus collect extra positive training data for embedding learning. To evaluate the\nperformance of logic rules, in each iteration we report the number of positive triplets discovered by\nlogic rules, as well as the precision of the triplets in Tab. 5. We see that as training proceeds, the logic\nrules can \ufb01nd more triplets with stable precision. This observation proves that the knowledge graph\nembedding model can indeed provide effective supervision for learning the weights of logic rules.\n\n5.2.5 Convergence Analysis\n\nFinally, we present the convergence curves of pLogicNet\u2217 on the FB15k and WN18 datasets in Fig. 2.\nThe horizontal axis represents the iteration, and the vertical axis shows the value of Hit@1 (in %). We\nsee that on both datasets, our approach takes only 2-3 iterations to converge, which is very ef\ufb01cient.\n\n6 Conclusion\n\nThis paper studies knowledge graph reasoning, and an approach called the pLogicNet is proposed to\nintegrate existing rule-based methods and knowledge graph embedding methods. pLogicNet models\nthe distribution of all the possible triplets with a Markov logic network, which is ef\ufb01ciently optimized\nwith the variational EM algorithm. In the E-step, a knowledge graph embedding model is used to infer\nthe hidden triplets, whereas in the M-step, the weights of rules are updated based on the observed and\ninferred triplets. Experimental results prove the effectiveness of pLogicNet. In the future, we plan to\nexplore more advanced models for inference, such as relational GCN [37, 50] and RotatE [40].\n\n8\n\n72737475761234FB15k92939495961234WN18\fAcknowledgements\n\nWe would like to thank all the reviewers for the insightful comments. We also thank Prof. Guillaume\nRabusseau and Weiping Song for their valuable feedback. Jian Tang is supported by the Natural\nSciences and Engineering Research Council of Canada, and the Canada CIFAR AI Chair Program.\n\nReferences\n\n[1] J. Besag. Statistical analysis of non-lattice data. The statistician, 1975.\n\n[2] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created\n\ngraph database for structuring human knowledge. In SIGMOD, 2008.\n\n[3] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings\n\nfor modeling multi-relational data. In NeurIPS, 2013.\n\n[4] R. Burke. Knowledge-based recommender systems. Encyclopedia of library and information\n\nsystems, 2000.\n\n[5] V. S. Costa, D. Page, M. Qazi, and J. Cussens. Clp (bn): Constraint logic programming for\n\nprobabilistic knowledge. In UAI, 2002.\n\n[6] R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, and\nA. McCallum. Go for a walk and arrive at the answer: Reasoning over paths in knowledge\nbases using reinforcement learning. In ICLR, 2018.\n\n[7] L. De Raedt and K. Kersting. Probabilistic inductive logic programming. In Probabilistic\n\nInductive Logic Programming. Springer, 2008.\n\n[8] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel. Convolutional 2d knowledge graph\n\nembeddings. In AAAI, 2018.\n\n[9] B. Ding, Q. Wang, B. Wang, and L. Guo. Improving knowledge graph embedding using simple\n\nconstraints. In ACL, 2018.\n\n[10] L. A. Gal\u00e1rraga, C. Te\ufb02ioudi, K. Hose, and F. Suchanek. Amie: association rule mining under\n\nincomplete evidence in ontological knowledge bases. In WWW, 2013.\n\n[11] S. Gershman and N. Goodman. Amortized inference in probabilistic reasoning. In CogSci,\n\n2014.\n\n[12] J. C. Giarratano and G. Riley. Expert systems. PWS publishing co., 1998.\n\n[13] W. R. Gilks, S. Richardson, and D. Spiegelhalter. Markov chain Monte Carlo in practice.\n\nChapman and Hall/CRC, 1995.\n\n[14] Google. Freebase data dumps. https://developers.google.com/freebase/data.\n\n[15] S. Guo, Q. Wang, L. Wang, B. Wang, and L. Guo. Knowledge graph embedding with iterative\n\nguidance from soft rules. In AAAI, 2018.\n\n[16] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The\n\nJournal of Machine Learning Research, 2013.\n\n[17] P. Jackson. Introduction to expert systems. Addison-Wesley Longman Publishing Co., Inc.,\n\n1998.\n\n[18] R. Kadlec, O. Bajgar, and J. Kleindienst. Knowledge base completion: Baselines strike back.\n\nIn Workshop on Representation Learning for NLP, 2017.\n\n[19] K. Kersting and L. De Raedt. Towards combining inductive logic programming with bayesian\n\nnetworks. In ICILP, 2001.\n\n[20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2014.\n\n9\n\n\f[21] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[22] S. Kok and P. Domingos. Learning the structure of markov logic networks. In ICML, 2005.\n\n[23] X. V. Lin, R. Socher, and C. Xiong. Multi-hop knowledge graph reasoning with reward shaping.\n\nIn EMNLP, 2018.\n\n[24] G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 1995.\n\n[25] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximate inference:\n\nAn empirical study. In UAI, 1999.\n\n[26] R. M. Neal and G. E. Hinton. A view of the em algorithm that justi\ufb01es incremental, sparse, and\n\nother variants. In Learning in graphical models. Springer, 1998.\n\n[27] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review of relational machine learning\n\nfor knowledge graphs. Proceedings of the IEEE, 2016.\n\n[28] M. Nickel, L. Rosasco, and T. Poggio. Holographic embeddings of knowledge graphs. In AAAI,\n\n2016.\n\n[29] M. Opper and D. Saad. Advanced mean \ufb01eld methods: Theory and practice. MIT press, 2001.\n\n[30] D. Poole. Probabilistic horn abduction and bayesian networks. Arti\ufb01cial intelligence, 1993.\n\n[31] A. Popescul and L. H. Ungar. Structural logistic regression for link analysis. Departmental\n\nPapers (CIS), page 133, 2003.\n\n[32] S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu. Learning human-object interactions by graph\n\nparsing neural networks. In ECCV, 2018.\n\n[33] M. Qu, Y. Bengio, and J. Tang. Gmnn: Graph markov neural networks. In ICML, 2019.\n\n[34] M. Qu, X. Ren, Y. Zhang, and J. Han. Weakly-supervised relation extraction by pattern-enhanced\n\nembedding learning. In WWW, 2018.\n\n[35] M. Richardson and P. Domingos. Markov logic networks. Machine learning, 2006.\n\n[36] R. Salakhutdinov and H. Larochelle. Ef\ufb01cient learning of deep boltzmann machines.\n\nAISTATS, 2010.\n\nIn\n\n[37] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling. Modeling\nrelational data with graph convolutional networks. In European Semantic Web Conference,\n2018.\n\n[38] Y. Shen, J. Chen, P.-S. Huang, Y. Guo, and J. Gao. M-walk: Learning to walk over graphs using\n\nmonte carlo tree search. In NeurIPS, 2018.\n\n[39] P. Singla and P. Domingos. Discriminative training of markov logic networks. In AAAI, 2005.\n\n[40] Z. Sun, Z.-H. Deng, J.-Y. Nie, and J. Tang. Rotate: Knowledge graph embedding by relational\n\nrotation in complex space. ICLR, 2019.\n\n[41] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In\n\nUAI, 2002.\n\n[42] B. Taskar, P. Abbeel, M.-F. Wong, and D. Koller. Relational markov networks. Introduction to\n\nstatistical relational learning, 2007.\n\n[43] K. Toutanova and D. Chen. Observed versus latent features for knowledge base and text\ninference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their\nCompositionality, 2015.\n\n[44] T. Trouillon, J. Welbl, S. Riedel, \u00c9. Gaussier, and G. Bouchard. Complex embeddings for\n\nsimple link prediction. In ICML, 2016.\n\n10\n\n\f[45] Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph embedding by translating on\n\nhyperplanes. In AAAI, 2014.\n\n[46] M. P. Wellman, J. S. Breese, and R. P. Goldman. From knowledge bases to decision models.\n\nThe Knowledge Engineering Review, 1992.\n\n[47] W. Xiong, T. Hoang, and W. Y. Wang. Deeppath: A reinforcement learning method for\n\nknowledge graph reasoning. In EMNLP, 2017.\n\n[48] B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng. Embedding entities and relations for learning\n\nand inference in knowledge bases. ICLR, 2015.\n\n[49] X. Yao and B. Van Durme. Information extraction over structured data: Question answering\n\nwith freebase. In ACL, 2014.\n\n[50] Y. Zhang, X. Chen, Y. Yang, A. Ramamurthy, B. Li, Y. Qi, and L. Song. Can graph neural\n\nnetworks help logic reasoning? arXiv:1906.02111, 2019.\n\n[51] Z. Zheng, W. Wang, S. Qi, and S.-C. Zhu. Reasoning visual dialogs with structural and partial\n\nobservations. In CVPR, 2019.\n\n11\n\n\f", "award": [], "sourceid": 4182, "authors": [{"given_name": "Meng", "family_name": "Qu", "institution": "Mila"}, {"given_name": "Jian", "family_name": "Tang", "institution": "Mila"}]}