{"title": "On the Complexity and Approximation of Binary Evidence in Lifted Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 2868, "page_last": 2876, "abstract": "Lifted inference algorithms exploit symmetries in probabilistic models to speed up inference. They show impressive performance when calculating unconditional probabilities in relational models, but often resort to non-lifted inference when computing conditional probabilities. The reason is that conditioning on evidence breaks many of the model's symmetries, which preempts standard lifting techniques. Recent theoretical results show, for example, that conditioning on evidence which corresponds to binary relations is #P-hard, suggesting that no lifting is to be expected in the worst case. In this paper, we balance this grim result by identifying the Boolean rank of the evidence as a key parameter for characterizing the complexity of conditioning in lifted inference. In particular, we show that conditioning on binary evidence with bounded Boolean rank is efficient. This opens up the possibility of approximating evidence by a low-rank Boolean matrix factorization, which we investigate both theoretically and empirically.", "full_text": "On the Complexity and Approximation of\n\nBinary Evidence in Lifted Inference\n\nGuy Van den Broeck and Adnan Darwiche\n\nComputer Science Department\n\nUniversity of California, Los Angeles\n\n{guyvdb,darwiche}@cs.ucla.edu\n\nAbstract\n\nLifted inference algorithms exploit symmetries in probabilistic models to speed\nup inference. They show impressive performance when calculating unconditional\nprobabilities in relational models, but often resort to non-lifted inference when\ncomputing conditional probabilities. The reason is that conditioning on evidence\nbreaks many of the model\u2019s symmetries, which can preempt standard lifting tech-\nniques. Recent theoretical results show, for example, that conditioning on evi-\ndence which corresponds to binary relations is #P-hard, suggesting that no lifting\nis to be expected in the worst case. In this paper, we balance this negative result\nby identifying the Boolean rank of the evidence as a key parameter for charac-\nterizing the complexity of conditioning in lifted inference. In particular, we show\nthat conditioning on binary evidence with bounded Boolean rank is ef\ufb01cient. This\nopens up the possibility of approximating evidence by a low-rank Boolean matrix\nfactorization, which we investigate both theoretically and empirically.\n\n1\n\nIntroduction\n\nStatistical relational models are capable of representing both probabilistic dependencies and rela-\ntional structure [1, 2]. Due to their \ufb01rst-order expressivity, they concisely represent probability dis-\ntributions over a large number of propositional random variables, causing inference in these models\nto quickly become intractable. Lifted inference algorithms [3] attempt to overcome this problem by\nexploiting symmetries found in the relational structure of the model.\nIn the absence of evidence, exact lifted inference algorithms can work well. For large classes of\nstatistical relational models [4], they perform inference that is polynomial in the number of objects\nin the model [5], and are therein exponentially faster than classical inference algorithms. When\nconditioning a query on a set of evidence literals, however, these lifted algorithms lose their advan-\ntage over classical ones. The intuitive reason is that evidence breaks the symmetries in the model.\nThe technical reason is that these algorithms perform an operation called shattering, which ends\nup reducing the \ufb01rst-order model to a propositional one. This issue is implicitly re\ufb02ected in the\nexperiment sections of exact lifted inference papers. Most report on experiments without evidence.\nExamples include publications on FOVE [3, 6, 7] and WFOMC [8, 5]. Others found ways to ef\ufb01-\nciently deal with evidence on only unary predicates. They perform experiments without evidence\non binary or higher-arity relations. There are examples for FOVE [9, 10], WFOMC [11], PTP [12]\nand CP [13].\nThis evidence problem has largely been ignored in the exact lifted inference literature, until recently,\nwhen Bui et al. [10] and Van den Broeck and Davis [11] showed that conditioning on unary evidence\nis tractable. More precisely, conditioning on unary evidence is polynomial in the size of evidence.\nThis type of evidence expresses attributes of objects in the world, but not relations between them.\nUnfortunately, Van den Broeck and Davis [11] also showed that this tractability does not extend to\n\n1\n\n\fevidence on binary relations, for which conditioning on evidence is #P-hard. Even if conditioning is\nhard in general, its complexity should depend on properties of the speci\ufb01c relation that is conditioned\non. It is clear that some binary evidence is easy to condition on, even if it talks about a large number\nof objects, for example when all atoms are true (\u2200X, Y p(X, Y )) or false (\u2200X, Y \u00ac p(X, Y )). As\nour \ufb01rst main contribution, we formalize this intuition and characterize the complexity of condition-\ning more precisely in terms of the Boolean rank of the evidence. We show that it is a measure of\nhow much lifting is possible, and that one can ef\ufb01ciently condition on large amounts of evidence,\nprovided that its Boolean rank is bounded.\nDespite the limitations, useful applications of exact lifted inference were found by sidestepping the\nevidence problem. For example, in lifted generative learning [14], the most challenging task is to\ncompute partition functions without evidence. Regardless, the lack of symmetries in real applica-\ntions is often cited as a reason for rejecting the idea of lifted inference entirely (informally called\nthe \u201cdeath sentence for lifted inference\u201d). This problem has been avoided for too long, and as\nlifted inference gains maturity, solving it becomes paramount. As our second main contribution,\nwe present a \ufb01rst general solution to the evidence problem. We propose to approximate evidence\nby an over-symmetric matrix, and will show that this can be achieved by minimizing Boolean rank.\nThe need for approximating evidence is new and speci\ufb01c to lifted inference: in (undirected) proba-\nbilistic graphical models, more evidence typically makes inference easier. Practically, we will show\nthat existing tools from the data mining community can be used for this low-rank Boolean matrix\nfactorization task.\nThe evidence problem is less pronounced in the approximate lifted inference literature. These al-\ngorithms often introduce approximations that lead to symmetries in their computation, even when\nthere are no symmetries in the model. Also for approximate methods, however, the bene\ufb01ts of lift-\ning will decrease with the amount of symmetry-breaking evidence (e.g., Kersting et al. [15]). We\nwill show experimentally that over-symmetric evidence approximation is also a viable technique for\napproximate lifted inference.\n\n2 Encoding Binary Relations in Unary\n\nOur analysis of conditioning is based on a reduction, turning evidence on a binary relation into\nevidence on several unary predicates. We \ufb01rst introduce some necessary background.\n\n2.1 Background\n\nAn atom p(t1, . . . , tn) consists of a predicate p /n of arity n followed by n arguments, which are ei-\nther (lowercase) constants or (uppercase) logical variables. A literal is an atom a or its negation \u00aca.\nA formula combines atoms with logical connectives (e.g., \u2228, \u2227, \u21d4). A formula is ground if it does\nnot contain any logical variables. A possible world assigns a truth value to each ground atom. Statis-\ntical relational languages de\ufb01ne a probability distribution over possible words, where ground atoms\nare individual random variables. Numerous languages have been proposed in recent years, and our\nanalysis will apply to many, including MLNs [16], parfactors [3] and WFOMC problems [8].\nExample 1. The following MLNs model the dependencies between web pages. A \ufb01rst, peer-to-peer\nmodel says that student web pages are more likely to link to other student pages.\nw studentpage(X) \u2227 linkto(X, Y ) \u21d2 studentpage(Y )\n\nIt increases the probability of a world by a factor ew with every pair of pages X, Y that satis\ufb01es the\nformula. A second, hierarchical model says that professors are more likely to link to course pages.\n\nw profpage(X) \u2227 linkto(X, Y ) \u21d2 coursepage(Y )\n\nIn this context, evidence e is a truth-value assignment to a set of ground atoms, and is of-\nIn unary evidence, atoms have one argument (e.g.,\nten represented as a conjunction of literals.\nstudentpage(a)) while in binary evidence, they have two (e.g., linkto(a, b)). Without loss of gen-\nerality, we assume full evidence on certain predicates (i.e., all their ground atoms are in e).1 We will\nsometimes represent unary evidence as a Boolean vector and binary evidence as a Boolean matrix.\n1Partial evidence on the relation p can be encoded as full evidence on predicates p0 and p1 by adding\n\nformulas \u2200X, Y p(X, Y ) \u21d0 p1(X, Y ) and \u2200X, Y \u00ac p(X, Y ) \u21d0 p0(X, Y ) to the model.\n\n2\n\n\fExample 2. Evidence e = p(a, a) \u2227 p(a, b) \u2227 \u00ac p(a, c) \u2227 \u00b7\u00b7\u00b7\u2227 \u00ac p(d, c) \u2227 p(d, d) is represented by\n\np(X,Y ) Y =a\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\n\n1\n1\n0\n1\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb\n\nY =b\n\nY =c\n\nY =d\n\n1\n1\n0\n0\n\n0\n0\n1\n0\n\n0\n1\n0\n1\n\nP =\n\nX=a\n\nX=b\n\nX=c\n\nX=d\n\nWe will look at computing conditional probabilities Pr(q | e) for single ground atoms q. Finally, we\nassume a representation language that can express universally quanti\ufb01ed logical constraints.\n\n2.2 Vector-Product Binary Evidence\n\nCertain binary relations can be represented by a pair of unary predicates. By adding the formula\n\n\u2200X, \u2200Y, p(X, Y ) \u21d4 q(X) \u2227 r(Y )\n\n(1)\nto our statistical relational model and conditioning on the q and r relations, we can condition on\ncertain types of binary p relations. Assuming that we condition on the q and r predicates, adding\nthis formula (as hard clauses) to the model does not change the probability distribution over the\natoms in the original model. It is merely an indirect way of conditioning on the p relation.\nIf we now represent these unary relations by vectors q and r, and the binary relation by the binary\nmatrix P, the above technique allows us to condition on any relation P that can be factorized in the\n(cid:124).\nouter vector product P = q r\nExample 3. Consider the following outer vector factorization of the Boolean matrix P.\n\n\uf8f9\uf8fa\uf8fb =\n\uf8ee\uf8ef\uf8f00 0 0 0\n\n1 0 0 1\n0 0 0 0\n1 0 0 1\n\n\uf8f9\uf8fa\uf8fb(cid:124)\n\n\uf8f9\uf8fa\uf8fb\n\uf8ee\uf8ef\uf8f00\n\n1\n0\n1\n\n\uf8ee\uf8ef\uf8f01\n\n0\n0\n1\n\nP =\n\nIn a model containing Formula 1, this factorization indicates that we can condition on the 16 binary\nevidence literals \u00ac p(a, a) \u2227 \u00ac p(a, b) \u2227 \u00b7\u00b7\u00b7 \u2227 \u00ac p(d, c) \u2227 p(d, d) of P by conditioning on the the 8\nunary literals \u00ac q(a) \u2227 q(b) \u2227 \u00ac q(c) \u2227 q(d) \u2227 r(a) \u2227 \u00ac r(b) \u2227 \u00ac r(c) \u2227 r(d) represented by q and r.\n\n2.3 Matrix-Product Binary Evidence\n\nThis idea of encoding a binary relation in unary relations can be generalized to n pairs of unary\nrelations, by adding the following formula to our model.\n\n\u2200X, \u2200Y, p(X, Y ) \u21d4 (q1(X) \u2227 r1(Y )) \u2228 (q2(X) \u2227 r2(Y )) \u2228 \u00b7\u00b7\u00b7 \u2228 (qn(X) \u2227 rn(Y ))\n\n(2)\nBy conditioning on the qi and ri relations, we can now condition on a much richer set of binary p\nrelations. The relations that can be expressed this way are all the matrices that can be represented\nby the sum of outer products (in Boolean algebra, where + is \u2228 and 1 \u2228 1 = 1):\n\nP = q1 r\n\n(3)\nwhere the columns of Q and R are the qi and ri vectors respectively, and the matrix multiplication\nis performed in Boolean algebra, that is,\n(cid:124)\n(Q R\n\nr Qi,r \u2227 Rj,r\n\n(cid:124)\n(cid:124)\n1 \u2228 q2 r\n2 \u2228\u00b7\u00b7\u00b7 \u2228 qn r\n\n(cid:124)\n(cid:124)\nn = Q R\n\nExample 4. Consider the following P, its decomposition into a sum/disjunction of outer vector\nproducts, and the corresponding Boolean matrix multiplication.\n\n\uf8ee\uf8ef\uf8f01\n\n1\n0\n1\n\n\uf8f9\uf8fa\uf8fb =\n\n\uf8f9\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8f00\n\n1\n0\n1\n\n0\n0\n1\n\n1\n1\n0\n0\n\n0\n0\n1\n0\n\n0\n1\n0\n1\n\nP =\n\n\uf8f9\uf8fa\uf8fb(cid:124)\n\n\uf8ee\uf8ef\uf8f00\n\uf8f9\uf8fa\uf8fb\n\n0\n1\n0\n\n\uf8ee\uf8ef\uf8f00\n\n0\n1\n0\n\n\uf8ee\uf8ef\uf8f00 1 0\n\uf8f9\uf8fa\uf8fb\n\n1 1 0\n0 0 1\n1 0 0\n\n\uf8ee\uf8ef\uf8f01 1 0\n\uf8f9\uf8fa\uf8fb(cid:124)\n\n0 1 0\n0 0 1\n1 0 0\n\n=\n\n)i,j =(cid:87)\n\uf8f9\uf8fa\uf8fb(cid:124)\n\uf8ee\uf8ef\uf8f01\n\uf8f9\uf8fa\uf8fb\n\uf8ee\uf8ef\uf8f01\n\uf8ee\uf8ef\uf8f01\n\uf8f9\uf8fa\uf8fb(cid:124)\n\n\u2228\n\n1\n0\n0\n\n\u2228\n\n1\n0\n0\n\nThis factorization shows that we can condition on the binary evidence literals of P (see Example 2)\nby conditioning on the unary literals\n\ne = [\u00ac q1(a) \u2227 q1(b) \u2227 \u00ac q1(c) \u2227 q1(d)] \u2227 [r1(a) \u2227 \u00ac r1(b) \u2227 \u00ac r1(c) \u2227 r1(d)]\n\n\u2227 [q2(a) \u2227 q2(b) \u2227 \u00ac q2(c) \u2227 \u00ac q2(d)] \u2227 [r2(a) \u2227 r2(b) \u2227 \u00ac r2(c) \u2227 \u00ac r2(d)]\n\u2227 [\u00ac q3(a) \u2227 \u00ac q3(b) \u2227 q3(c) \u2227 \u00ac q3(d)] \u2227 [\u00ac r3(a) \u2227 \u00ac r3(b) \u2227 r3(c) \u2227 \u00ac r3(d)] .\n\n3\n\n\f3 Boolean Matrix Factorization\n\nMatrix factorization (or decomposition) is a popular linear algebra tool. Some well-known instances\nare singular value decomposition and non-negative matrix factorization (NMF) [17, 18]. NMF\nfactorizes into a product of non-negative matrices, which are more easily interpretable, and therefore\nattracted much attention for unsupervised learning and feature extraction. These factorizations all\nwork with real-valued matrices. We instead consider Boolean-valued matrices, with only 0/1 entries.\n\n3.1 Boolean Rank\n\n(cid:124)\nFactorizing a matrix P as Q R\nin Boolean algebra is a known problem called Boolean Matrix\nFactorization (BMF) [19, 20]. BMF factorizes a (k \u00d7 l) matrix P into a (k \u00d7 n) matrix Q and a\n(l \u00d7 n) matrix R, where potentially n (cid:28) k and n (cid:28) l and we always have that n \u2264 min(k, l).\nAny Boolean matrix can be factorized this way and the smallest number n for which it is possible is\ncalled the Boolean rank of the matrix. Unlike (textbook) real-valued rank, computing the Boolean\nrank is NP-hard and cannot be approximated unless P=NP [19]. The Boolean and real-valued rank\nare incomparable, and the Boolean rank can be exponentially smaller than the real-valued rank.\nExample 5. The factorization in Example 4 is a BMF with Boolean rank 3. It is only a decompo-\nsition in Boolean algebra and not over the real numbers. Indeed, the matrix product over the reals\ncontains an incorrect value of 2:\n1\n1\n0\n0\n\n\uf8ee\uf8ef\uf8f01 1 0 0\n\n\uf8ee\uf8ef\uf8f01 1 0\n\n\uf8f9\uf8fa\uf8fb \u00d7real\n\n\uf8f9\uf8fa\uf8fb (cid:54)= P\n\n\uf8ee\uf8ef\uf8f00\n\n1\n0\n1\n\n\uf8f9\uf8fa\uf8fb(cid:124)\n\n0\n0\n1\n0\n\n0 1 0\n0 0 1\n1 0 0\n\n=\n\n2\n1 0 1\n0 0 1 0\n1 0 0 1\n\nNote that P is of full real-valued rank (having four non-zero singular values) and that its Boolean\nrank is lower than its real-valued rank.\n\n3.2 Approximate Boolean Factorization\n\nComputing Boolean ranks is a theoretical problem. Because most real-world matrices will have\nnearly full rank (i.e., almost min(k, l)), applications of BMF look at approximate factorizations. The\n\ngoal is to \ufb01nd a pair of (small) Boolean matrices Qk\u00d7n and Rl\u00d7n such that Pk\u00d7l \u2248(cid:0)Qk\u00d7n R\n\n(cid:124)\nl\u00d7n\nor more speci\ufb01cally, to \ufb01nd matrices that optimize some objective that trades off approximation\nerror and Boolean rank n. When n (cid:28) k and n (cid:28) l, this approximation extracts interesting structure\nand removes noise from the matrix. This has caused BMF to receive considerable attention in the\ndata mining community recently, as a tool for analyzing high-dimensional data. It is used to \ufb01nd\nimportant and interpretable (i.e., Boolean) concepts in a data matrix.\nUnfortunately, the approximate BMF optimization problem is NP-hard as well, and inapprox-\nimable [20]. However, several algorithms have been proposed that work well in practice. Algo-\nrithms exist that \ufb01nd good approximations for \ufb01xed values of n [20], or when P is sparse [21].\nBMF is related to other data mining tasks, such as biclustering [22] and tiling databases [23], whose\nalgorithms could also be used for approximate BMF. In the context of social network analysis, BMF\nis related to stochastic block models [24] and their extensions, such as in\ufb01nite relational models.\n\n(cid:1),\n\n4 Complexity of Binary Evidence\n\nOur goal in this section is to provide a new complexity result for reasoning with binary evidence\nin the context of lifted inference. Our result can be thought of as a parametrized complexity re-\nsult, similar to ones based on treewidth in the case of propositional inference. To state the new\nresult, however, we must \ufb01rst de\ufb01ne formally the computational task. We will also review the key\ncomplexity result that is known about this computation now (i.e., the one we will be improving on).\nConsider an MLN \u2206 and let \u0393m contain a set of ground literals representing binary evidence. That is,\nfor some binary predicate p(X, Y ), evidence \u0393m contains precisely one literal (positive or negative)\nfor each grounding of predicate p(X, Y ). Here, m represents the number of objects that parameters\nX and Y may take.2 Therefore, evidence \u0393m must contain precisely m2 literals.\n\n2We assume without loss of generality that all logical variables range over the same set of objects.\n\n4\n\n\fSuppose now that Prm is the distribution induced by MLN \u2206 over m objects, and q is a ground\nliteral. Our analysis will apply to classes of models \u2206 that are domain-liftable [4], which means that\nthe complexity of computing Prm(q) without evidence is polynomial in m. One such class is the\nset of MLNs with two logical variables per formula [5].\nOur task is then to compute the posterior probability Prm(q|em), where em is a conjunction of the\nground literals in binary evidence \u0393m. Moreover, our goal here is to characterize the complexity of\nthis computation as a function of evidence size m.\nThe following recent result provides a lower bound on the complexity of this computation [11].\nTheorem 1. Suppose that evidence \u0393m is binary. Then there exists a domain-liftable MLN \u2206 with\na corresponding distribution Prm, and a posterior marginal Prm(q|em) that cannot be computed\nby any algorithm whose complexity grows polynomially in evidence size m, unless P = N P .\n\nThis is an analogue to results according to which, for example, the complexity of computing poste-\nrior probabilities in propositional graphical models is exponential in the worst case. Yet, for these\nmodels, the complexity of inference can be parametrized, allowing one to bound the complexity of\ninference on some models. Perhaps the best example of such a parametrized complexity is the one\nbased on treewidth, which can be thought of as a measure of the model\u2019s sparsity (or tree-likeness).\nIn this case, inference can be shown to be linear in the size of the model and exponential only in its\ntreewidth. Hence, this parametrized complexity result allows us to state that inference can be done\nef\ufb01ciently on models with bounded treewidth.\nWe now provide a similar parameterized complexity result, but for evidence in lifted inference. In\nthis case, the parameter we use to characterize complexity is that of Boolean rank.\nTheorem 2. Suppose that evidence \u0393m is binary and has a bounded Boolean rank. Then for every\ndomain-liftable MLN \u2206 and corresponding distribution Prm, the complexity of computing posterior\nmarginal Prm(q|em) grows polynomially in evidence size m.\nThe proof of this theorem is based on the reduction from binary to unary evidence, which is described\nin Section 2. In particular, our reduction \ufb01rst extends the MLN \u2206 with Formula 2, leading to the new\nMLN \u2206(cid:48) and new pairs of unary predicates qi and ri. This does not change the domain-liftability\nof \u2206(cid:48), as Formula 2 is itself liftable. We then replace binary evidence \u0393m by unary evidence \u0393(cid:48). That\nis, the ground literals of the binary predicate p are replaced by ground literals of the unary predicates\nqi and ri (see Example 4). This unary evidence is obtained by Boolean matrix factorization. As the\nmatrix size in our reduction is m2, the following Lemma implies that the \ufb01rst step of our reduction\nis polynomial in m for bounded rank evidence.\nLemma 3 (Miettinen [25]). The complexity of Boolean matrix factorization for matrices with\nbounded Boolean rank is polynomial in their size.\n\nThe main observation in our reduction is that Formula 2 has size n, which is the Boolean rank of the\ngiven binary evidence. Hence, when the Boolean rank n is bounded by a constant, the size of the\nextended MLN \u2206(cid:48) is independent of the evidence size and is proportional to the size of the original\nMLN \u2206.\nWe have now reduced inference on MLN \u2206 and binary evidence \u0393m into inference on an extended\nMLN \u2206(cid:48) and unary evidence \u0393(cid:48). The second observation behind the proof is the following.\nLemma 4 (Van den Broeck and Davis [11], Van den Broeck [26]). Suppose that evidence \u0393m is\nunary. Then for every domain-liftable MLN \u2206 and corresponding distribution Prm, the complexity\nof computing posterior marginal Prm(q|em) grows polynomially in evidence size m.\nHence, computing posterior probabilities can be done in time which is polynomial in the size of\nunary evidence m, which completes our proof.\nWe can now identify additional similarities between treewidth and Boolean rank. Exact inference al-\ngorithms for probabilistic graphical models typically perform two steps, namely to (a) compute a tree\ndecomposition of the graphical model (or a corresponding variable order), and (b) perform inference\nthat is polynomial in the size of the decomposition, but potentially exponential in its (tree)width. The\nanalogous steps for conditioning are to (a) perform a BMF, and (b) perform inference that is polyno-\nmial in the size of the BMF, but potentially exponential in its rank. The (a) steps are both NP-hard,\nyet are ef\ufb01cient assuming bounded treewidth [27] or bounded Boolean rank (Lemma 3). Whereas\n\n5\n\n\ftreewidth is a measure of tree-likeness and sparsity of the graphical model, Boolean rank seems to\nbe a fundamentally different property, more related to the presence of symmetries in evidence.\n\n5 Over-Symmetric Evidence Approximation\n\nTheorem 2 opens up many new possibilities. Even for evidence with high Boolean rank, it is possible\nto \ufb01nd a low-rank approximate BMF of the evidence, as is commonly done for other data mining\nand machine learning problems. Algorithms already exist for solving this task (cf. Section 3).\nExample 6. The evidence matrix from Example 4 has Boolean rank three. Dropping the third pair\nof vectors reduces the Boolean rank to two.\n\n\uf8f9\uf8fa\uf8fb\n\uf8ee\uf8ef\uf8f00 1\n\n1 1\n0 0\n1 0\n\n\uf8f9\uf8fa\uf8fb(cid:124)\n\uf8ee\uf8ef\uf8f01 1\n\n0 1\n0 0\n1 0\n\n\uf8f9\uf8fa\uf8fb\n\uf8ee\uf8ef\uf8f01 1 0 0\n\n1 1 0 1\n0 0\n0\n1 0 0 1\n\n0\n\n=\n\n\uf8ee\uf8ef\uf8f01\n\n1\n0\n1\n\n\uf8f9\uf8fa\uf8fb \u2248\n\n\uf8f9\uf8fa\uf8fb\n\uf8ee\uf8ef\uf8f00\n\n1\n0\n1\n\n\uf8f9\uf8fa\uf8fb(cid:124)\n\uf8ee\uf8ef\uf8f01\n\n0\n0\n1\n\n\u2228\n\n\uf8f9\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8f01\n\n1\n0\n0\n\n\uf8f9\uf8fa\uf8fb(cid:124)\n\uf8ee\uf8ef\uf8f01\n\n1\n0\n0\n\n\uf8f9\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8f00\n\n@\n0\n@\n\n1\n\n0\n\n\uf8f9\uf8fa\uf8fb(cid:124)\n\uf8ee\uf8ef\uf8f00\n\n\n\n0\n1\n@\n@@\n0\n\n@\n\u2228\n\n\n\n=\n\n1\n1\n0\n0\n\n0\n0\n1\n0\n\n0\n1\n0\n1\n\nThis factorization is approximate, as it \ufb02ips the evidence for atom p(c, c) from true to false (repre-\nsented by the bold 0). By paying this price, the evidence has more symmetries, and we can condition\non the binary relation by introducing only two instead of three new pairs (qi, ri) of unary predicates.\n\nLow-rank approximate BMF is an instance of a more general idea; that of over-symmetric evidence\napproximation. This means that when we want to compute Pr(q | e), we approximate it by comput-\ning Pr(q | e(cid:48)) instead, with evidence e(cid:48) that permits more ef\ufb01cient inference. In this case, it is more\nef\ufb01cient because it maintains more symmetries of the model and permits more lifting. Because all\nlifted inference algorithms, exact or approximate, exploit symmetries, we expect this general idea,\nand low-rank approximate BMF in particular, to improve the performance of any lifted inference\nalgorithm.\nHaving a small amount of incorrect evidence in the approximation need not be a problem. As these\nliterals are not covered by the \ufb01rst most important vector pairs, they can be considered as noise in\nthe original matrix. Hence, a low-rank approximation may actually improve the performance of, for\nexample, a lifted collective classi\ufb01cation algorithm. On the other hand, the approximation made in\nExample 6 may not be desirable if we are querying attributes of the constant c, and we may prefer\nto approximate other areas of the evidence matrix instead. There are many challenges in \ufb01nding\nappropriate evidence approximations, which makes the task all the more interesting.\n\n6 Empirical Evaluation\n\nTo complement the theoretical analysis from the previous sections, we will now report on experi-\nments that investigate the following practical questions.\n\nQ1 How well can we approximate a real-world relational data set by a low-rank Boolean matrix?\nQ2 Is Boolean rank a good indicator of the complexity of inference, as suggested by Theorem 2?\nQ3 Is over-symmetric evidence approximation a viable technique for approximate lifted inference?\n\nTo answer Q1, we compute approximations of the linkto binary relation in the WebKB data set\nusing the ASSO algorithm for approximate BMF [20]. The WebKB data set consists of web pages\nfrom the computer science departments of four universities [28]. The data has information about\nwords that appear on pages, labels of pages and links between web pages (linkto relation). There\nare four folds, one for each university. The exact evidence matrix for the linkto relation ranges in\nsize from 861 by 861 to 1240 by 1240. Its real-valued rank ranges from 384 to 503. Performing a\nBMF approximation in this domain adds or removes hyperlinks between web pages, so that more\nweb pages can be grouped together that behave similarly.\nFigure 1 plots the approximation error for increasing Boolean ranks, measured as the number of\nincorrect evidence literals. The error goes down quickly for low rank, and is reduced by half after\nBoolean rank 70 to 80, even though the matrix dimensions and real-valued rank are much higher.\nNote that these evidence matrices contain around a million entries, and are sparse. Hence, these\napproximations correctly label 99.7% to 99.95% of the atoms.\n\n6\n\n\fRank n Circuit Size (a) Circuit Size (b)\n\n18\n58\n160\n1873\n> 2129\n\n?\n?\n\n0\n1\n2\n3\n4\n5\n6\n\n24\n50\n129\n371\n1098\n3191\n9571\n\nFigure 1: Approximation BMF error in terms\nof the number of incorrect\nliterals for the\nWebKB linkto relation.\n\nFigure 2: First-order NNF circuit size (number\nof nodes) for increasing Boolean rank n, and\n(a) the peer to peer and (b) hierarchical model.\n\n(a) Texas Data Set\n\n(b) Wisconsin Data Set\n\nFigure 3: KLD of LMCMC on different BMF approximations, relative to the KLD of vanilla MCMC\non the same approximation. From top to bottom, the lines represent exact evidence (blue), and\napproximations (red) of rank 150, 100, 75, 50, 20, 10, 5, 2, and 1.\n\nTo answer Q2, we perform two sets of experiments. Firstly, we look at exact lifted inference and\ninvestigate the in\ufb02uence of adding Formula 2 to the \u201cpeer-to-peer\u201d and \u201chierarchical\u201d MLNs from\nExample 1. The goals is to condition on linkto relations with increasing rank n. These models\nare compiled using the WFOMC [8] algorithm into \ufb01rst-order NNF circuits, which allow for exact\ndomain-lifted inference (c.f., Lemma 4). Table 2 shows the sizes of these circuits. As expected,\ncircuit sizes grow exponentially with n. Evidence breaks more symmetries in the peer-to-peer model\nthan in the hierarchical model, causing the circuit size to increase more quickly with Boolean rank.\nSince the connection between rank and exact inference is obvious from Theorem 2, the more\ninteresting question in Q2 is whether Boolean rank is indicative of the complexity of approxi-\nmate lifted inference as well. Therefore, we investigate its in\ufb02uence on the Lifted MCMC algo-\nrithm (LMCMC) [29] with Rao-Blackwellized probability estimation [30]. LMCMC interleaves\nstandard MCMC steps (here Gibbs sampling) with jumps to states that are symmetric in the graphi-\ncal model, in order to speed up mixing of the chain. We run LMCMC on the WebKB MLN of Davis\nand Domingos [31], which has 333 \ufb01rst-order formulas and over 1 million random variables. It\nclassi\ufb01es web pages into 6 categories, based on their link structure and the 50 most predictive words\nthey contain. We learn its parameters with the Alchemy package and obtain evidence sets of varying\nBoolean rank from the factorizations of Figure 1.3. For these, we run both vanilla and lifted MCMC,\nand measure the KL divergence (KLD) between the marginal distribution at each iteration4, and a\nground truth obtained from 3 million iterations on the corresponding evidence set. Figure 3 plots the\nKLD of LMCMC divided by the KLD of MCMC. It shows that the improvement of LMCMC over\nMCMC goes down with Boolean rank, answering Q2 positively.\nTo answer Q3, we look at the KLD between different evidence approximations Pr(.| e(cid:48)\nn) of rank\nn, and the true marginals Pr(.| e) conditioned on exact evidence. As this requires a good estimate\nof Pr(.| e), we make our learned WebKB model more tractable by removing formulas about word\ncontent. For two approximations e(cid:48)\nb such that rank a < b, we expect LMCMC to converge\nfaster to Pr(.| e(cid:48)\na) than to Pr(.| e(cid:48)\na) is a more\ncrude approximation of Pr(.| e) than Pr(.| e(cid:48)\nb) is, the KLD at convergence should be worse for a\n\na and e(cid:48)\nb), as suggested by Figure 3. However, because Pr(.| e(cid:48)\n\n3 When synthetically generating evidence of these ranks, results are comparable.\n4 Runtime per iteration is comparable for both algorithms. BMF runtime is negligible.\n\n7\n\n0204060801001200100020003000Boolean rankError cornelltexaswashingtonwisconsin 0.1 1 0 200000 400000 600000Relative KLDIteration 0.1 1 0 200000 400000 600000Relative KLDIteration\f(a) Cornell, Ranks 2 and 10\n\n(b) Cornell, Ranks 75 and 150\n\n(c) Washington, Ranks 75 and 150\n\n(d) Wisconsin, Ranks 75 and 150\n\nFigure 4: Error for different low-rank approximations of WebKB, in KLD from true marginals.\n\nthan for b. Hence, we expect to see a trade-off, where the lowest ranks are optimal in the beginning,\nhigher ranks become optimal later one, and the exact model is optimal at convergence.\nFigure 4 shows exactly that, for a representative sample of ranks and data sets. In Figure 4(a), rank\n2 and 10 outperform LMCMC with the exact evidence at \ufb01rst. Exact evidence overtakes rank 2\nafter 40k iterations, and rank 10 after 50k. After 80k iterations, even non-lifted MCMC outperforms\nthese crude approximations. Figure 4(b) shows the other side of the spectrum, where a rank 75\nand 150 approximation are overtaken at iterations 90k and 125k. Figure 4(c) is representative of\nother datasets. Note here that at around iteration 50k, rank 75 in turn outperforms the rank 150\napproximation, which has fewer symmetries and does not permit as much lifting. Finally, Figure 4(d)\nshows the ideal case for low-rank approximation. This is the largest dataset, and therefore the most\nchallenging inference task. Here, LMCMC on e converges slowly compared to its approximations e(cid:48),\nand e(cid:48) results in almost perfect marginals. The crossover point where exact inference outperforms\nthe approximation is never reached in practice. This answers Q3 positively.\n\n7 Conclusions\n\nWe presented two main results. The \ufb01rst is a more precise complexity characterization of condi-\ntioning on binary evidence, in terms of its Boolean rank. The second is a technique to approximate\nbinary evidence by a low-rank Boolean matrix factorization. This is a \ufb01rst type of over-symmetric\nevidence approximation that can speed up lifted inference. We showed empirically that low-rank\nBMF speeds up approximate inference, leading to improved approximations.\nFor future work, we want to evaluate the practical implications of the theory developed for other\nlifted inference algorithms, such as lifted BP, and look at the performance of over-symmetric evi-\ndence approximation on machine learning tasks such as collective classi\ufb01cation. There are many\nremaining challenges in \ufb01nding good evidence-approximation schemes, including ones that are\nquery-speci\ufb01c (cf. de Salvo Braz et al. [32]) or that incrementally run inference to \ufb01nd better ap-\nproximations (cf. Kersting et al. [33]). Furthermore, we want to investigate other subsets of binary\nrelations for which conditioning could be ef\ufb01cient, in particular functional relations p(X, Y ), where\neach X has at most a limited number of associated Y values.\n\nAcknowledgments\n\nWe thank Pauli Miettinen, Mathias Niepert, and Jilles Vreeken for helpful suggestions. This work\nwas supported by ONR grant #N00014-12-1-0423, NSF grant #IIS-1118122, NSF grant #IIS-\n0916161, and the Research Foundation-Flanders (FWO-Vlaanderen).\n\n8\n\n 0.1 1 0 20000 40000 60000 80000 100000KL DivergenceIterationGround MCMCLifted MCMCLifted MCMC (Rank 2)Lifted MCMC (Rank 10) 0.1 1 0 50000 100000 150000 200000 250000KL DivergenceIterationGround MCMCLifted MCMCLifted MCMC (Rank 75)Lifted MCMC (Rank 150) 0.1 1 0 50000 100000 150000 200000 250000KL DivergenceIterationGround MCMCLifted MCMCLifted MCMC (Rank 75)Lifted MCMC (Rank 150) 0.1 1 0 50000 100000 150000 200000 250000KL DivergenceIterationGround MCMCLifted MCMCLifted MCMC (Rank 75)Lifted MCMC (Rank 150)\fReferences\n[1] L. Getoor and B. Taskar, editors. An Introduction to Statistical Relational Learning. MIT Press, 2007.\n[2] Luc De Raedt, Paolo Frasconi, Kristian Kersting, and Stephen Muggleton, editors. Probabilistic inductive\n\nlogic programming: theory and applications. Springer-Verlag, 2008.\n\n[3] David Poole. First-order probabilistic inference. In Proceedings of IJCAI, pages 985\u2013991, 2003.\n[4] Manfred Jaeger and Guy Van den Broeck. Liftability of probabilistic inference: Upper and lower bounds.\n\nIn Proceedings of the 2nd International Workshop on Statistical Relational AI,, 2012.\n\n[5] Guy Van den Broeck. On the completeness of \ufb01rst-order knowledge compilation for lifted probabilistic\n\ninference. In Advances in Neural Information Processing Systems 24 (NIPS), pages 1386\u20131394, 2011.\n\n[6] Rodrigo de Salvo Braz, Eyal Amir, and Dan Roth. Lifted \ufb01rst-order probabilistic inference. In Proceed-\n\nings of the International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages 1319\u20131325, 2005.\n\n[7] B. Milch, L.S. Zettlemoyer, K. Kersting, M. Haimes, and L.P. Kaelbling. Lifted probabilistic inference\n\nwith counting formulas. Proceedings of the 23rd AAAI Conference on Arti\ufb01cial Intelligence, 2008.\n\n[8] Guy Van den Broeck, Nima Taghipour, Wannes Meert, Jesse Davis, and Luc De Raedt. Lifted probabilistic\n\ninference by \ufb01rst-order knowledge compilation. In Proceedings of IJCAI, pages 2178\u20132185, 2011.\n\n[9] N. Taghipour, D. Fierens, J. Davis, and H. Blockeel. Lifted variable elimination with arbitrary constraints.\n\nIn Proceedings of the 15th International Conference on Arti\ufb01cial Intelligence and Statistics, 2012.\n\n[10] H.H. Bui, T.N. Huynh, and R. de Salvo Braz. Exact lifted inference with distinct soft evidence on every\n\nobject. In Proceedings of the 26th AAAI Conference on Arti\ufb01cial Intelligence, 2012.\n\n[11] Guy Van den Broeck and Jesse Davis. Conditioning in \ufb01rst-order knowledge compilation and lifted\n\nprobabilistic inference. In Proceedings of the 26th AAAI Conference on Arti\ufb01cial Intelligence,, 2012.\n\n[12] Vibhav Gogate and Pedro Domingos. Probabilistic theorem proving. In Proceedings of the 27th Confer-\n\nence on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 256\u2013265, 2011.\n\n[13] A. Jha, V. Gogate, A. Meliou, and D. Suciu. Lifted inference seen from the other side: The tractable\nfeatures. In Proceedings of the 24th Conference on Neural Information Processing Systems (NIPS), 2010.\n[14] Guy Van den Broeck, Wannes Meert, and Jesse Davis. Lifted generative parameter learning. In Statistical\n\nRelational AI (StaRAI) workshop, July 2013.\n\n[15] K. Kersting, B. Ahmadi, and S. Natarajan. Counting belief propagation.\n\nIn Proceedings of the 25th\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 277\u2013284, 2009.\n\n[16] M. Richardson and P. Domingos. Markov logic networks. Machine learning, 62(1):107\u2013136, 2006.\n[17] D. Seung and L. Lee. Algorithms for non-negative matrix factorization. Advances in neural information\n\nprocessing systems, 13:556\u2013562, 2001.\n\n[18] M. Berry, M. Browne, A. Langville, V. Pauca, and R. Plemmons. Algorithms and applications for ap-\n\nproximate nonnegative matrix factorization. In Computational Statistics and Data Analysis, 2006.\n\n[19] Pauli Miettinen, Taneli Mielik\u00a8ainen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discrete\n\nbasis problem. In Knowledge Discovery in Databases, pages 335\u2013346. Springer, 2006.\n\n[20] Pauli Miettinen, Taneli Mielikainen, Aristides Gionis, Gautam Das, and Heikki Mannila. The discrete\n\nbasis problem. IEEE Transactions on Knowledge and Data Engineering, 20(10):1348\u20131362, 2008.\n\n[21] Pauli Miettinen. Sparse Boolean matrix factorizations. In IEEE 10th International Conference on Data\n\nMining (ICDM), pages 935\u2013940. IEEE, 2010.\n\n[22] Boris Mirkin. Mathematical classi\ufb01cation and clustering, volume 11. Kluwer Academic Pub, 1996.\n[23] Floris Geerts, Bart Goethals, and Taneli Mielik\u00a8ainen. Tiling databases. In Discovery science, 2004.\n[24] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps.\n\nSocial networks, 5(2):109\u2013137, 1983.\n\n[25] Pauli Miettinen. Matrix decomposition methods for data mining: Computational complexity and algo-\n\nrithms. PhD thesis, 2009.\n\n[26] Guy Van den Broeck. Lifted Inference and Learning in Statistical Relational Models. PhD thesis, KU\n\nLeuven, January 2013.\n\n[27] Hans L Bodlaender. Treewidth: Algorithmic techniques and results. Springer, 1997.\n[28] M. Craven and S. Slattery. Relational learning with statistical predicate invention: Better models for\n\nhypertext. Machine Learning Journal, 43(1/2):97\u2013119, 2001.\n\n[29] Mathias Niepert. Markov chains on orbits of permutation groups. In Proceedings of the 28th Conference\n\non Uncertainty in Arti\ufb01cial Intelligence (UAI), 2012.\n\n[30] Mathias Niepert. Symmetry-aware marginal density estimation. In Proceedings of the 27th Conference\n\non Arti\ufb01cial Intelligence (AAAI), 2013.\n\n[31] Jesse Davis and Pedro Domingos. Deep transfer via second-order markov logic. In Proceedings of the\n\n26th annual international conference on machine learning, pages 217\u2013224, 2009.\n\n[32] R. de Salvo Braz, S. Natarajan, H. Bui, J. Shavlik, and S. Russell. Anytime lifted belief propagation.\n\nProceedings of the 6th International Workshop on Statistical Relational Learning, 2009.\n\n[33] K. Kersting, Y. El Massaoudi, B. Ahmadi, and F. Hadiji.\n\nInformed lifting for message-passing.\n\nProceedings of the 24th AAAI Conference on Arti\ufb01cial Intelligence,, 2010.\n\nIn\n\n9\n\n\f", "award": [], "sourceid": 1306, "authors": [{"given_name": "Guy", "family_name": "Van den Broeck", "institution": "UCLA"}, {"given_name": "Adnan", "family_name": "Darwiche", "institution": "UCLA"}]}