{"title": "DeepProbLog:  Neural Probabilistic Logic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 3749, "page_last": 3759, "abstract": "We introduce DeepProbLog, a probabilistic logic programming language that incorporates deep learning by means of neural predicates. We show how existing inference and learning techniques can be adapted for the new language. Our experiments demonstrate that DeepProbLog supports (i) both symbolic and subsymbolic representations and inference, (ii) program induction, (iii) probabilistic (logic) programming, and (iv) (deep) learning from examples. To the best of our knowledge, this work is the first to propose a framework where general-purpose neural networks and expressive probabilistic-logical modeling and reasoning are integrated in a way that exploits the full expressiveness and strengths of both worlds and can be trained end-to-end based on examples.", "full_text": "DeepProbLog:\n\nNeural Probabilistic Logic Programming\n\nRobin Manhaeve\n\nKU Leuven\n\nSebastijan Duman\u02c7ci\u00b4c\n\nKU Leuven\n\nrobin.manhaeve@cs.kuleuven.be\n\nsebastijan.dumancic@cs.kuleuven.be\n\nAngelika Kimmig\nCardiff University\n\nThomas Demeester\u2217\nGhent University - imec\n\nKimmigA@cardiff.ac.uk\n\nthomas.demeester@ugent.be\n\nLuc De Raedt*\n\nKU Leuven\n\nluc.deraedt@cs.kuleuven.be\n\nAbstract\n\nWe introduce DeepProbLog, a probabilistic logic programming language that in-\ncorporates deep learning by means of neural predicates. We show how exist-\ning inference and learning techniques can be adapted for the new language. Our\nexperiments demonstrate that DeepProbLog supports (i) both symbolic and sub-\nsymbolic representations and inference, (ii) program induction, (iii) probabilistic\n(logic) programming, and (iv) (deep) learning from examples. To the best of our\nknowledge, this work is the \ufb01rst to propose a framework where general-purpose\nneural networks and expressive probabilistic-logical modeling and reasoning are\nintegrated in a way that exploits the full expressiveness and strengths of both\nworlds and can be trained end-to-end based on examples.\n\n1\n\nIntroduction\n\nThe integration of low-level perception with high-level reasoning is one of the oldest, and yet most\ncurrent open challenges in the \ufb01eld of arti\ufb01cial intelligence. Today, low-level perception is typically\nhandled by neural networks and deep learning, whereas high-level reasoning is typically addressed\nusing logical and probabilistic representations and inference. While it is clear that there have been\nbreakthroughs in deep learning, there has also been a lot of progress in the area of high-level reason-\ning. Indeed, today there exist approaches that tightly integrate logical and probabilistic reasoning\nwith statistical learning; cf. the areas of statistical relational arti\ufb01cial intelligence [De Raedt et al.,\n2016, Getoor and Taskar, 2007] and probabilistic logic programming [De Raedt and Kimmig, 2015].\nRecently, a number of researchers have revisited and modernized older ideas originating from the\n\ufb01eld of neural-symbolic integration [Garcez et al., 2012], searching for ways to combine the best of\nboth worlds [Bo\u02c7snjak et al., 2017, Rockt\u00a8aschel and Riedel, 2017, Cohen et al., 2018, Santoro et al.,\n2017], for example, by designing neural architectures representing differentiable counterparts of\nsymbolic operations in classical reasoning tools. Yet, joining the full \ufb02exibility of high-level proba-\nbilistic reasoning with the representational power of deep neural networks is still an open problem.\nThis paper tackles this challenge from a different perspective. Instead of integrating reasoning ca-\npabilities into a complex neural network architecture, we proceed the other way round. We start\n\n\u2217 joint last authors\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\ffrom an existing probabilistic logic programming language, ProbLog [De Raedt et al., 2007], and\nextend it with the capability to process neural predicates. The idea is simple: in a probabilistic logic,\natomic expressions of the form q(t1, ..., tn) (aka tuples in a relational database) have a probability p.\nConsequently, the output of neural network components can be encapsulated in the form of \u201cneural\u201d\npredicates as long as the output of the neural network on an atomic expression can be interpreted as\na probability. This simple idea is appealing as it allows us to retain all the essential components of\nthe ProbLog language: the semantics, the inference mechanism, as well as the implementation. The\nmain challenge is in training the model based on examples. The input data consists of feature vectors\nat the input of the neural network components (e.g., images) together with other probabilistic facts\nand clauses in the logic program, whereas targets are only given at the output side of the probabilistic\nreasoner. However, the algebraic extension of ProbLog (based on semirings) [Kimmig et al., 2011]\nalready supports automatic differentiation. As a result, we can back-propagate the gradient from the\nloss at the output through the neural predicates into the neural networks, which allows training the\nwhole model through gradient-descent based optimization. We call the new language DeepProbLog.\nBefore going into further detail, the following example illustrates the possibilities of this approach\n(also see Section 6). Consider the predicate addition(X, Y, Z), where X and Y are images of digits\nand Z is the natural number corresponding to the sum of these digits. After training, DeepProbLog\nallows us to make a probabilistic estimate on the validity of, e.g., the example addition(\n, 8).\nWhile such a predicate can be directly learned by a standard neural classi\ufb01er, such a method would\nhave a hard time taking into account background knowledge such as the de\ufb01nition of the addition\nof two natural numbers. In DeepProbLog such knowledge can easily be encoded in rules such as\naddition(IX, IY, NZ) :\u2212 digit(IX, NX), digit(IY, NY), NZ is NX + NY (with is the standard opera-\ntor of logic programming to evaluate arithmetic expressions). All that needs to be learned in this case\nis the neural predicate digit which maps an image of a digit ID to the corresponding natural number\nND. The learned network can then be reused for arbitrary tasks involving digits. Our experiments\nshow that this leads not only to new capabilities but also to signi\ufb01cant performance improvements.\nAn important advantage of this approach compared to standard image classi\ufb01cation settings is that\nit can be extended to multi-digit numbers without additional training. We note that the single digit\nclassi\ufb01er (i.e., the neural predicate) is not explicitly trained by itself: its output can be considered a\nlatent representation, as we only use training data with pairwise sums of digits.\nTo summarize, we introduce DeepProbLog which has a unique set of features: (i) it is a program-\nming language that supports neural networks and machine learning, and it has a well-de\ufb01ned seman-\ntics (as an extension of Prolog, it is Turing equivalent); (ii) it integrates logical reasoning with neural\nnetworks; so both symbolic and subsymbolic representations and inference; (iii) it integrates prob-\nabilistic modeling, programming and reasoning with neural networks (as DeepProbLog extends the\nprobabilistic programming language ProbLog, which can be regarded as a very expressive directed\ngraphical modeling language [De Raedt et al., 2016]); (iv) it can be used to learn a wide range of\nprobabilistic logical neural models from examples, including inductive programming. The code is\navailable at https://bitbucket.org/problog/deepproblog.\n\n,\n\n2 Logic programming concepts\n\nWe brie\ufb02y summarize basic logic programming concepts. Atoms are expressions of the form\nq(t1, ..., tn) where q is a predicate (of arity n) and the ti are terms. A literal is an atom or the\nnegation \u00acq(t1, ..., tn) of an atom. A term t is either a constant c, a variable V , or a structured\nterm of the form f (u1, ..., uk) where f is a functor. We will follow the Prolog convention and\nlet constants start with a lower case character and variables with an upper case. A substitution\n\u03b8 = {V1 = t1, ..., Vn = tn} is an assignment of terms ti to variables Vi. When applying a sub-\nstitution \u03b8 to an expression e we simultaneously replace all occurrences of Vi by ti and denote the\nresulting expression as e\u03b8. Expressions that do not contain any variables are called ground. A rule\nis an expression of the form h :\u2212 b1, ..., bn where h is an atom and the bi are literals. The meaning\nof such a rule is that h holds whenever the conjunction of the bi holds. Thus :\u2212 represents logical\nimplication (\u2190), and the comma (,) represents conjunction (\u2227). Rules with an empty body n = 0\nare called facts.\n\n2\n\n\f3\n\nIntroducing DeepProbLog\n\nWe now recall the basics of probabilistic logic programming using ProbLog, illustrate it using the\nwell-known burglary alarm example, and then introduce our new language DeepProbLog.\nA ProbLog program consists of (i) a set of ground probabilistic facts F of the form p :: f where p\nis a probability and f a ground atom and (ii) a set of rules R. For instance, the following ProbLog\nprogram models a variant of the well-known alarm Bayesian network:\n\n0.1 : :burglary. 0.5 : :hears alarm(mary).\n0.2 : :earthquake. 0.4 : :hears alarm(john).\n\nalarm :\u2212earthquake.\nalarm :\u2212burglary.\n\ncalls(X) :\u2212alarm, hears alarm(X).\n\n(1)\n\nEach probabilistic fact corresponds to an independent Boolean random variable that is true with\nprobability p and false with probability 1 \u2212 p. Every subset F \u2286 F de\ufb01nes a possible world wF =\nF \u222a {f \u03b8|R \u222a F |= f \u03b8 and f \u03b8 is ground}. So wF contains F and all ground atoms that are logically\nentailed by F and the set of rules R, e.g.,\n\nw{burglary,hears alarm(mary)} = {burglary, hears alarm(mary)} \u222a {alarm, calls(mary)}\n\nThe probability P (wF ) of such a possible world wF is given by the product of the probabilities of\n\n(cid:81)\nfi\u2208F\\F (1 \u2212 pi). For instance,\n\nthe truth values of the probabilistic facts, P (wF ) =(cid:81)\nthe probabilities of all worlds containing q, i.e., P (q) =(cid:80)\nthe pi are probabilities so that(cid:80) pi = 1, and hi and bj are atoms. The meaning of an AD is that\n\nOne convenient extension that is nothing else than syntactic sugar are the annotated disjunctions. An\nannotated disjunction (AD) is an expression of the form p1 :: h1; ...; pn :: hn :\u2212 b1, ..., bm. where\nwhenever all bi hold, hj will be true with probability pj, with all other hi false (unless other parts\nof the program make them true). This is convenient to model choices between different categorical\nvariables, e.g. different severities of the earthquake:\n\nThe probability of a ground fact q, also called success probability of q, is then de\ufb01ned as the sum of\n\nF\u2286F :q\u2208wF\n\nP (wF ).\n\nP (w{burglary,hears alarm(mary)}) = 0.1 \u00d7 0.5 \u00d7 (1 \u2212 0.2) \u00d7 (1 \u2212 0.4) = 0.024\n\nfi\u2208F pi\n\n0.4 :: earthquake(none); 0.4 :: earthquake(mild); 0.2 :: earthquake(severe).\n\nProbLog programs with annotated disjunctions can be transformed into equivalent ProbLog pro-\ngrams without annotated disjunctions (cf. De Raedt and Kimmig [2015]).\nA DeepProbLog program is a ProbLog program that is extended with (iii) a set of ground neural\nADs (nADs) of the form\n\nnn(mq, (cid:126)t, (cid:126)u) :: q((cid:126)t, u1); ...; q((cid:126)t, un) :\u2212 b1, ..., bm\n\nwhere the bi are atoms, (cid:126)t = t1, . . . , tk is a vector of ground terms representing the inputs of the\nneural network for predicate q, u1 to un are the possible output values of the neural network. We\nuse the notation nn for \u2018neural network\u2019 to indicate that this is a nAD. mq is the identi\ufb01er of a\nneural network model that speci\ufb01es a probability distribution over its output values (cid:126)u, given input (cid:126)t.\nThat is, from the perspective of the probabilistic logic program, an nAD realizes a regular AD\np1 :: q((cid:126)t, u1); ...; pn :: q((cid:126)t, un) :\u2212 b1, ..., bm, and DeepProbLog thus directly inherits its semantics,\nand to a large extent also its inference, from ProbLog. For instance, in the MNIST addition example,\nwe would specify the nAD\n\nnn(mdigit,\n\n, [0, . . . , 9]) :: digit(\n\n, 0); . . . ; digit(\n\n, 9).\n\nwhere mdigit is a network that probabilistically classi\ufb01es MNIST digits. The neural network could\ntake any shape, e.g., a convolutional network for image encoding, a recurrent network for sequence\nencoding, etc. However, its output layer, which feeds the corresponding neural predicate, needs to\nbe normalized. In neural networks for multiclass classi\ufb01cation, this is typically done by applying a\nsoftmax layer to real-valued output scores, a choice we also adopt in our experiments.\n\n3\n\n\f4 DeepProbLog Inference\n\nThis section explains how a DeepProbLog model is used for a given query at prediction time.\n\nProbLog Inference As inference in DeepProbLog closely follows that in ProbLog, we now sum-\nmarize ProbLog inference using the burglary example explained before. For full details, we refer\nto Fierens et al. [2015]. The program describing the example is explained in Section 3, Equation\n(1). We can query the program for the probabilities of given query atoms, say, the single query\ncalls(mary). ProbLog inference proceeds in four steps. (i) The \ufb01rst step grounds the logic pro-\ngram with respect to the query, that is, it generates all ground instances of clauses in the program the\nquery depends on. The grounded program for query calls(mary) is shown in Figure 1a. (ii) The\nsecond step rewrites the ground logic program into a formula in propositional logic that de\ufb01nes the\ntruth value of the query in terms of the truth values of probabilistic facts. In our example, the re-\nsulting formula is calls(mary) \u2194 hears alarm(mary) \u2227 (burglary \u2228 earthquake). (iii) The\nthird step compiles the logic formula into a Sentential Decision Diagram (SDD, Darwiche [2011]),\na form that allows for ef\ufb01cient evaluation of the query, using knowledge compilation technology\n[Darwiche and Marquis, 2002]. The SDD for our example is shown in Figure 1b, where rounded\ngrey rectangles depict variables corresponding to probabilistic facts, and the rounded red rectangle\ndenotes the query atom de\ufb01ned by the formula. The white rectangles correspond to logical operators\napplied to their children. (iv) The fourth and \ufb01nal step evaluates the SDD bottom-up to calculate the\nsuccess probability of the given query, starting with the probability labels of the leaves as given by\nthe program and performing addition in every or-node and multiplication in every and-node. The\nintermediate results are shown next to the nodes in Figure 1b, ignoring the blue numbers for now.\n\n0 . 2 : : e a r t h q u a k e .\n0 . 1 : : b u r g l a r y .\na la rm :\u2212 e a r t h q u a k e .\na la rm :\u2212 b u r g l a r y .\n0 . 5 : : h e a r s a l a r m ( mary ) .\nc a l l s ( mary ):\u2212 alarm , h e a r s a l a r m ( mary ) .\n\n(a) The ground program.\n\n(b) SDD for query calls(mary).\n\nFigure 1: Inference in ProbLog.\n\nDeepProbLog Inference\nInference in DeepProbLog works exactly as described above, except\nthat a forward pass on the neural network components is performed every time we encounter a\nneural predicate during grounding. When this occurs, the required inputs (e.g., images) are fed into\nthe neural network, after which the resulting scores of their softmax output layer are used as the\nprobabilities of the ground AD.\n\n5 Learning in DeepProbLog\n\nWe now introduce our approach to jointly train the parameters of probabilistic facts and neural\nnetworks in DeepProbLog programs. We use the learning from entailment setting [De Raedt et al.,\n2016] , that is, given a DeepProbLog program with parameters X , a set Q of pairs (q, p) with q a\nquery and p its desired success probability, and a loss function L, compute:\n\n(cid:88)\n\n(q,p)\u2208Q\n\narg min\n\n(cid:126)x\n\n1\n|Q|\n\nL(PX =(cid:126)x(q), p)\n\nIn contrast to the earlier approach for ProbLog parameter learning in this setting by Gutmann et al.\n[2008], we use gradient descent rather than EM, as this allows for seamless integration with neural\nnetwork training. An overview of the approach is shown in Figure 2a. Given a DeepProbLog\nprogram, its neural network models, and a query used as training example, we \ufb01rst ground the\nprogram with respect to the query, getting the current parameters of nADs from the external models,\nthen use the ProbLog machinery to compute the loss and its gradient, and \ufb01nally use these to update\nthe parameters in the neural networks and the probabilistic program.\n\n4\n\nANDANDANDORcalls(mary)\uffe2earthquake0.8-1,00.08-0.1,0.8burglary0.10,1hears_alarm(mary)0.50,0earthquake0.21,00.10.5,00.04-0.05,0.40.140.45,0.4\f(a) The learning pipeline.\n\nf l i p ( c o i n 2 ) .\n\nf l i p ( c o i n 1 ) .\nnn ( m side , C , [ heads , t a i l s ] ) : : s i d e (C , he ad s ) ; s i d e (C , t a i l s ) .\nt ( 0 . 5 ) : : r e d ; t ( 0 . 5 ) : : b l u e .\nh ea ds :\u2212 f l i p (X) ,\nwin :\u2212 h ea ds .\nwin :\u2212 \\+heads ,\nq ue ry ( win ) .\n\ns i d e (X, h ea ds ) .\n\nr e d .\n\n(b) The DeepProbLog program.\n\n(c) SDD for query win.\n\nFigure 2: Parameter learning in DeepProbLog.\n\nMore speci\ufb01cally, to compute the gradient with respect to the probabilistic logic program part, we\nrely on Algebraic ProbLog (aProbLog, [Kimmig et al., 2011]), a generalization of the ProbLog\nlanguage and inference to arbitrary commutative semirings, including the gradient semiring [Eisner,\n2002]. In the following, we provide the necessary background on aProbLog, discuss how to use it to\ncompute gradients with respect to ProbLog parameters and extend the approach to DeepProbLog.\n\naProbLog and the gradient semiring ProbLog annotates each probabilistic fact f with the\nprobability that f is true, which implicitly also de\ufb01nes the probability that f is false, and thus its\nnegation \u00acf is true. It then uses the probability semiring with regular addition and multiplication as\noperators to compute the probability of a query on the SDD constructed for this query, cf. Figure 1b.\nThis idea is generalized in aProbLog to compute such values based on arbitrary commutative\nsemirings. Instead of probability labels on facts, aProbLog uses a labeling function that explicitly\nassociates values from the chosen semiring with both facts and their negations, and combines these\nusing semiring addition \u2295 and multiplication \u2297 on the SDD. We use the gradient semiring, whose\nelements are tuples (p, \u2202p\n\u2202x is the partial derivative\nof that probability with respect to a parameter x, that is, the probability pi of a probabilistic fact\nwith learnable probability, written as t(pi) :: fi. This is easily extended to a vector of parameters\n(cid:126)x = [x1, . . . , xN ]T , the concatenation of all N parameters in the ground program. Semiring\naddition \u2295, multiplication \u2297 and the neutral elements with respect to these operations are de\ufb01ned\nas follows:\n\n\u2202x ), where p is a probability (as in ProbLog), and \u2202p\n\n(a1, (cid:126)a2) \u2295 (b1, (cid:126)b2) = (a1 + b1, (cid:126)a2 + (cid:126)b2)\n(a1, (cid:126)a2) \u2297 (b1, (cid:126)b2) = (a1b1, b1 (cid:126)a2 + a1\n(cid:126)b2)\n\n(4)\n(5)\nNote that the \ufb01rst element of the tuple mimics ProbLog\u2019s probability computation, whereas the\nsecond simply computes gradients of these probabilities using derivative rules.\n\ne\u2295 = (0,(cid:126)0)\ne\u2297 = (1,(cid:126)0)\n\n(2)\n(3)\n\nGradient descent for ProbLog To use the gradient semiring for gradient descent parameter learn-\ning in ProbLog, we \ufb01rst transform the ProbLog program into an aProbLog program by extending\nthe label of each probabilistic fact p :: f to include the probability p as well as the gradient vector\nof p with respect to the probabilities of all probabilistic facts in the program, i.e.,\n\nL(f ) = (p,(cid:126)0)\nL(fi) = (pi, ei)\nL(\u00acf ) = (1 \u2212 p,\u2212\u2207p)\n\n(6)\n(7)\n(8)\nwhere the vector ei has a 1 in the ith position and 0 in all others. For \ufb01xed probabilities, the\ngradient does not depend on any parameters and thus is 0. For the other cases, we use the semiring\nlabels as introduced above. For instance, assume we want to learn the probabilities of earthquake\n\nfor p :: f with \ufb01xed p\nfor t(pi) :: fi with learnable pi\nwith L(f ) = (p,\u2207p)\n\n5\n\nside(coin1,S1)side(coin2,S2)GroundProgramProgramflip(coin1). ...win :- heads.win :- \\+heads, red.LossL,\u2207Lgroundingrewrite / compilationp,\u2207pSoftmaxQuerywinside(coin1,heads)side(coin2,tails)\u2a02side(coin1,heads)0.91,0,0,0,0,0\u2a02\u2a010.1-1,0,0,0,0,00.20,0,1,0,0,00.80,0,-1,0,0,00.40,0,-0.5,0,0.8,0win\uffe2side(coin1,heads)side(coin2,heads)\uffe2side(coin2,heads)red\u2a010.50,0,0,0,1,00.60,0,0.5,0,0.8,00.06-0.6,0,0.05,0,0.08,00.960.4,0,0.05,0,0.08,0\fand burglary in the example of Figure 1, while keeping those of the other facts \ufb01xed. Then, in\nFigure 1b, the nodes in the SDD now also contain the gradient (below, in blue). The result shows that\nthe partial derivative of the proof query is 0.45 and 0.4 w.r.t. the earthquake and burglary parameters\nrespectively. To ensure that ADs are always well de\ufb01ned, i.e., the probabilities of the facts in the\nsame AD sum to one, we re-normalize these after every gradient descent update.\n\nGradient descent for DeepProbLog\nIn contrast to probabilistic facts and ADs, whose parameters\nare updated based on the gradients computed by aProbLog, the probabilities of the neural predicates\nare a function of the neural network parameters. The neural predicates serve as an interface between\nthe logic and the neural side, with both sides treating the other as a black box. The logic side can\ncalculate the gradient of the loss w.r.t. the output of the neural network, but is unaware of the internal\nparameters. However, the gradient w.r.t. the output is suf\ufb01cient to start backpropagation, which\ncalculates the gradient for the internal parameters. Then, standard gradient-based optimizers (e.g.\nSGD, Adam, ...) are used to update the parameters of the network. During gradient computation with\naProbLog, the probabilities of neural ADs are kept constant. Furthermore, updates on neural ADs\ncome from the neural network part of the model, where the use of a softmax output layer ensures\nthey always represent a normalized distribution, hence not requiring the additional normalization as\nfor non-neural ADs. The labeling function for facts in nADs is\n\nL(fj) = (mq((cid:126)t)j, ej)\n\nfor nn(mq, (cid:126)t, (cid:126)u) :: fi; . . . ; fk a nAD\n\n(9)\n\nExample We will demonstrate the learning pipeline (shown in Figure 2a) using the following\ngame. We have two coins and an urn containing red and blue balls. We \ufb02ip both coins and take a\nball out of the urn. We win if the ball is red, or at least one coin comes up heads. However, we need\nto learn to recognize heads and tails using a neural network, while only observing examples of wins\nor losses instead of explicitly learning from coin examples labeled with the correct side. We also\nlearn the distribution of the red and blue balls in the urn. We show the program in Figure 2b. There\nare 6 parameters in this program: the \ufb01rst four originate from the neural predicates (heads and tails\nfor the \ufb01rst and second coin). The last two are the logic parameters that model the chance of picking\nout a red or blue ball. During grounding, the neural network classi\ufb01es coin1 and coin2. According\nto the neural network, the \ufb01rst coin is most likely heads (p = 0.9), and the second one most likely\ntails (p = 0.2). Figure 2c shows the corresponding SDD with the AND/OR nodes replaced with\nthe respective semiring operations. The top number is the probability, and the numbers below are\nthe gradient. On the top node, we can see that we have a 0.96 probability of winning, but also that\nthe gradient of this probability is 0.4 and 0.05 w.r.t the probability of being heads for the \ufb01rst and\nsecond coin respectively, and 0.08 for the chance of the ball being red.\n\n6 Experimental Evaluation\n\nWe perform three sets of experiments to demonstrate that DeepProbLog supports (i) symbolic and\nsubsymbolic reasoning and learning, that is, both logical reasoning and deep learning; (ii) program\ninduction; and (iii) both probabilistic logic programming and deep learning.\nWe provide implementation details at the end of this section and list all programs in Appendix A.\n\nLogical reasoning and deep learning To show that DeepProbLog supports both logical reasoning\nand deep learning, we extend the classic learning task on the MNIST dataset (Lecun et al. [1998])\nto two more complex problems that require reasoning:\n\nT1: addition(\n\n,\n\n, 8): Instead of using labeled single digits, we train on pairs of images, la-\nbeled with the sum of the individual labels. The DeepProbLog program consists of the clause\naddition(X,Y,Z) :- digit(X,X2), digit(Y,Y2), Z is X2+Y2. and a neural AD for\nthe digit/2 predicate (this is shorthand notation for of arity 2), which classi\ufb01es an MNIST\nimage. We compare to a CNN baseline classifying the concatenation of the two images into\nthe 19 possible sums.\n], [\n\n], 63): the input consists of two lists of images, each element be-\ning a digit. This task demonstrates that DeepProbLog generalizes well beyond training data.\nLearning the new predicate requires only a small change in the logic program. We train the\nmodel on single digit numbers, and evaluate on three digit numbers.\n\n,\n\nT2: addition([\n\n,\n\n6\n\n\fThe learning curves of both models on T1 (Figure 3a) show the bene\ufb01t of combined symbolic and\nsubsymbolic reasoning: the DeepProbLog model uses the encoded knowledge to reach a higher F1\nscore than the CNN, and does so after a few thousand iterations, while the CNN converges much\nslower. We also tested an alternative for the neural network baseline. It evaluates convolutional\nlayers with shared parameters on each image separately, instead of a single set of convolutional\nlayers on the concatenation of both images. It converges quicker and achieves a higher \ufb01nal accuracy\nthan the other baseline, but is still slower and less accurate than the DeepProbLog model. Figure 3b\nshows the learning curve for T2. DeepProbLog achieves a somewhat lower accuracy compared to\nthe single digit problem due to the compounding effect of the error, but the model generalizes well.\nThe CNN does not generalize to this variable-length problem setting.\n\nProgram Induction The second set of problems demonstrates that DeepProbLog can perform\nprogram induction. We follow the program sketch setting of differentiable Forth [Bo\u02c7snjak et al.,\n2017], where holes in given programs need to be \ufb01lled by neural networks trained on input-output\nexamples for the entire program. As in their work, we consider three tasks: addition, sorting [Reed\nand de Freitas, 2016] and word algebra problems (WAPs) [Roy and Roth, 2015].\n\nT3: forth addition/4: where the input consists of two numbers and a carry, with the output\nbeing the sum of the numbers and the new carry. The program speci\ufb01es the basic addition\nalgorithm in which we go from right to left over all digits, calculating the sum of two digits\nand taking the carry over to the next pair. The hole in this program corresponds to calculating\nthe resulting digit (result/4) and carry (carry/4), given two digits and the previous carry.\nT4: sort/2: The input consists of a list of numbers, and the output is the sorted list. The program\nimplements bubble sort, but leaves open what to do on each step in a bubble (i.e. whether to\nswap or not, swap/2).\n\nT5: wap/2: The input to the WAPs consists of a natural language sentence describing a sim-\nple mathematical problem, and the output is the solution to this question. These WAPs al-\nways contain three numbers and are solved by chaining 4 steps: permuting the three numbers\n(permute/2), applying an operation on the \ufb01rst two numbers (addition, subtraction or product\noperation 1/2), potentially swapping the intermediate result and the last digit (swap/2), and\nperforming a last operation (operation 2/2). The hole in the program is in deciding which\nof the alternatives should happen on each step.\n\nDeepProbLog achieves 100% on the Forth addition (T3) and sorting (T4) problems (Table 1a). The\nsorting problem yields a more interesting comparison: differentiable Forth achieves a 100% accu-\nracy with a training length of 2 and 3, but performs poorly on a training length of 4; DeepProbLog\ngeneralizes well to larger lengths. As shown in Table 1b, DeepProbLog runs faster and scales better\nwith increasing training length, while differentiable Forth has issues due to computational complex-\nity with larger lengths, as mentioned in the paper.\nOn the WAPs (T5), DeepProbLog reaches an accuracy between 96% and 97%, similar to Bo\u02c7snjak\net al. [2017] (96%).\n\nProbabilistic programming and deep learning The coin-ball problem is a standard example in\nthe probabilistic programming community [De Raedt and Kimmig, 2014]. It describes a game in\nwhich we have a potentially biased coin and two urns. The \ufb01rst urn contains a mixture of red and\nblue balls, and the second urn a mixture of red, blue and green balls. To play the game, we toss the\ncoin and take a ball out of each urn. We win if both balls have the same colour, or if the coin came\nup heads and we have at least one red ball. We want to learn the bias of the coin (the probability of\nheads), and the ratio of the coloured balls in each urn. We simultaneously train one neural network\nto classify an image of the coin as being heads or tails (coin/2), and a neural network to classify\nthe colour of the ball as being either red, blue or green (colour/4). These are given as RGB triples.\nTask T6 is thus to learn the game/4 predicate, requiring a combination of subsymbolic reasoning,\nlearning and probabilistic reasoning. The input consists of an image, two RGB pairs and the output\nis the outcome of the game. The coin-ball problem uses a very simple neural network component.\nTraining on a set of 256 instances converges after 5 epochs, leading to 100% accuracy on the test\nset (64 instances). At this point, both networks correctly classify the colours and the coins, and the\nprobabilistic parameters re\ufb02ect the distributions in the training set.\n\n7\n\n\f(a) Single digit (T1)\n\n(b) Multi-digit (T2)\n\nFigure 3: MNIST Addition problems: displaying training loss (red for CNN, orange for Deep-\nProbLog) and F1 score on the test set (green for CNN, blue for DeepProbLog).\n\nSorting (T4): Training length\n\n\u22024 [Bo\u02c7snjak et al., 2017]\n\nDeepProbLog\n\nTest Length\n8\n64\n8\n64\n\n2\n\n100.0\n100.0\n100.0\n100.0\n\n3\n\n4\n\n100.0\n100.0\n100.0\n100.0\n\n49.22\n20.65\n100.0\n100.0\n\n5\n\u2013\n\u2013\n\n100.0\n100.0\n\n6\n\u2013\n\u2013\n\n100.0\n100.0\n\nAddition (T3): training length\n\n2\n\n4\n\n8\n\n100.0\n100.0\n100.0\n100.0\n\n100.0\n100.0\n100.0\n100.0\n\n100.0\n100.0\n100.0\n100.0\n\n(a) Accuracy on the addition (T3) and sorting (T4) problems (results for \u22024 reported by Bo\u02c7snjak et al. [2017]).\n\nTraining length \u2212\u2192\n\n\u22024 on GPU\n\u22024 on CPU\nDeepProbLog on CPU\n\n2\n42 s\n61 s\n11 s\n\n3\n\n160 s\n390 s\n14 s\n\n4\n\u2013\n\u2013\n32 s\n\n5\n\u2013\n\u2013\n\n114 s\n\n6\n\u2013\n\u2013\n\n245 s\n\n(b) Time until 100% accurate on test length 8 for the sorting (T4) problem.\n\nTable 1: Results on the Differentiable Forth experiments\n\nImplementation details\nIn all experiments we optimize the cross-entropy loss between the pre-\ndicted and desired query probabilities. The network used to classify MNIST images is a basic ar-\nchitecture based on the PyTorch tutorial. It consists of 2 convolutional layers with kernel size 5, and\nrespectively 6 and 16 \ufb01lters, each followed by a maxpool layer of size 2, stride 2. After this come 3\nfully connected layers of sizes 120, 84 and 10 (19 for the CNN baseline). It has a total of 44k pa-\nrameters. The last layer is followed by a softmax layer, all others are followed by a ReLu layer. The\ncolour network consists of a single fully connected layer of size 3. For all experiments we use Adam\n[Kingma and Ba, 2015] optimization for the neural networks, and SGD for the logic parameters.\nThe learning rate is 0.001 for the MNIST network, and 1 for the colour network. For robustness in\noptimization, we use a warm-up of the learning rate of the logic parameters for the coin-ball experi-\nments, starting at 0.0001 and raising it linearly to 0.01 over four epochs. For the Forth experiments,\nthe architecture of the neural networks and other hyper-parameters are as described in Bo\u02c7snjak et al.\n[2017]. For the Coin-Urn experiment, we generate the RGB pairs by adding Gaussian noise (\u03c3 =\n0.03) to the base colours in the HSV domain. The coins are MNIST images, where we use even\nnumbers as heads, and odd for tails. For the implementation we integrated ProbLog2 [Dries et al.,\n2015] with PyTorch [Paszke et al., 2017]. We do not perform actual mini-batching, but instead use\ngradient accumulation. All programs are listed in the appendix.\n\n7 Related Work\n\nMost of the work on combining neural networks and logical reasoning comes from the neuro-\nsymbolic reasoning literature [Garcez et al., 2012, Hammer and Hitzler, 2007]. These approaches\ntypically focus on approximating logical reasoning with neural networks by encoding logical terms\nin Euclidean space. However, they neither support probabilistic reasoning nor perception, and are\noften limited to non-recursive and acyclic logic programs [H\u00a8olldobler et al., 1999]. DeepProbLog\ntakes a different approach and integrates neural networks into a probabilistic logic framework, re-\ntaining the full power of both logical and probabilistic reasoning and deep learning.\n\n8\n\n050001000015000200002500030000Iterations0.00.51.01.52.02.53.0Loss0.00.20.40.60.81.0AccuracyDeepProbLogCNNDeepProbLogCNN050001000015000200002500030000Iterations0.00.51.01.52.02.5Loss0.00.20.40.60.81.0AccuracyDeepProbLogDeepProbLog\fThe most prominent recent line of related work focuses on developing differentiable frameworks for\nlogical reasoning. Rockt\u00a8aschel and Riedel (2017) introduce a differentiable framework for theorem\nproving. They re-implemented Prolog\u2019s theorem proving procedure in a differentiable manner and\nenhanced it with learning subsymbolic representation of the existing symbols, which are used to\nhandle noise in data. Whereas Rockt\u00a8aschel and Riedel (2017) use logic only to construct a neural\nnetwork and focus on learning subsymbolic representations, DeepProblog focuses on tight interac-\ntions between the two and parameter learning for both the neural and the logic components. In this\nway, DeepProbLog retains the best of both worlds. While the approach of Rockt\u00a8aschel and Riedel\ncould in principle be applied to tasks T1 and T5, the other tasks seem to be out of scope. Cohen\net al. (2018) introduce a framework to compile a tractable subset of logic programs into differen-\ntiable functions and to execute it with neural networks. It provides an alternative probabilistic logic\nbut it has a different and less developed semantics. Furthermore, to the best of our knowledge it\nhas not been applied to the kind of tasks tackled in the present paper. The approach most similar to\nours is that of Bo\u02c7snjak et al. [2017], where neural networks are used to \ufb01ll in holes in a partially de-\n\ufb01ned Forth program. DeepProblog differs in that it uses ProbLog as the host language which results\nin native support for both logical and probabilistic reasoning, something that has to be manually\nimplemented in differentiable Forth. Differentiable Forth has been applied to tasks T3-5, but it is\nunclear whether it could be applied to the remaining ones. Finally, Evans and Grefenstette (2018)\nintroduce a differentiable framework for rule induction, that does not focus on the integration of the\ntwo approaches like DeepProblog.\nA different line of work centers around including background knowledge as a regularizer during\ntraining. Diligenti et al. [2017] and Donadello et al. [2017] use FOL to specify constraints on the\noutput of the neural network. They use fuzzy logic to create a differentiable way of measuring how\nmuch the output of the neural networks violates these constraints. This is then added as an additional\nloss term that acts as a regularizer. More recent work by Xu et al. [2018] introduces a similar method\nthat uses probabilistic logic instead of fuzzy logic, and is thus more similar to DeepProbLog. They\nalso compile the formulas to an SDD for ef\ufb01ciency. However, whereas DeepProbLog can be used to\nspecify probabilistic logic programs, these methods allow you to specify FOL constraints instead.\nDai et al. [2018] show a different way to combine perception with reasoning. Just as in Deep-\nProbLog, they combine domain knowledge speci\ufb01ed as purely logical Prolog rules with the output\nof neural networks. The main difference is that DeepProbLog deals with the uncertainty of the neu-\nral network\u2019s output with probabilistic reasoning, while Dai et al. do this by revising the hypothesis,\niteratively replacing the output of the neural network with anonymous variables until a consistent\nhypothesis can be formed.\nAn idea similar in spirit to ours is that of Andreas et al. (2016), who introduce a neural network for\nvisual question answering composed out of smaller modules responsible for individual tasks, such\nas object detection. Whereas the composition of modules is determined by the linguistic structure of\nthe questions, DeepProbLog uses logic programs to connect the neural networks. These successes\nhave inspired a number of works developing (probabilistic) logic formulations of basic deep learning\nprimitives [\u02c7Sourek et al., 2018, Duman\u02c7ci\u00b4c and Blockeel, 2017, Kazemi and Poole, 2018].\n\n8 Conclusion\n\nWe introduced DeepProbLog, a framework where neural networks and probabilistic logic program-\nming are integrated in a way that exploits the full expressiveness and strengths of both worlds and\ncan be trained end-to-end based on examples. This was accomplished by extending an existing prob-\nabilistic logic programming language, ProbLog, with neural predicates. Learning is performed by\nusing aProbLog to calculate the gradient of the loss which is then used in standard gradient-descent\nbased methods to optimize parameters in both the probabilistic logic program and the neural net-\nworks. We evaluated our framework on experiments that demonstrate its capabilities in combined\nsymbolic and subsymbolic reasoning, program induction, and probabilistic logic programming.\n\nAcknowledgements\n\nRM is a SB PhD fellow at FWO (1S61718N). SD is supported by the Research Fund KU Leuven\n(GOA/13/010) and Research Foundation - Flanders (G079416N)\n\n9\n\n\fReferences\nJacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 39\u201348,\n2016.\n\nMatko Bo\u02c7snjak, Tim Rockt\u00a8aschel, and Sebastian Riedel. Programming with a differentiable forth\ninterpreter. In Proceedings of the 34th International Conference on Machine Learning, volume 70,\npages 547\u2013556, 2017.\n\nWilliam W Cohen, Fan Yang, and Kathryn Rivard Mazaitis. Tensorlog: Deep learning meets prob-\n\nabilistic databases. Journal of Arti\ufb01cial Intelligence Research, 1:1\u201315, 2018.\n\nWang-Zhou Dai, Qiu-Ling Xu, Yang Yu, and Zhi-Hua Zhou. Tunneling neural perception and logic\n\nreasoning through abductive learning. arXiv preprint arXiv:1802.01173, 2018.\n\nAdnan Darwiche. SDD: A new canonical representation of propositional knowledge bases. In Pro-\nceedings of the Twenty-Second International Joint Conference on Arti\ufb01cial Intelligence, IJCAI-\n11, pages 819\u2013826, 2011.\n\nAdnan Darwiche and Pierre Marquis. A knowledge compilation map. Journal of Arti\ufb01cial Intelli-\n\ngence Research, 17:229\u2013264, 2002.\n\nLuc De Raedt and Angelika Kimmig. Probabilistic programming. ECAI tutorial, 2014.\n\nLuc De Raedt and Angelika Kimmig. Probabilistic (logic) programming concepts. Machine Learn-\n\ning, 100(1):5\u201347, 2015.\n\nLuc De Raedt, Angelika Kimmig, and Hannu Toivonen. ProbLog: A probabilistic Prolog and its\n\napplication in link discovery. In IJCAI, pages 2462\u20132467, 2007.\n\nLuc De Raedt, Kristian Kersting, Sriraam Natarajan, and David Poole. Statistical relational arti\ufb01cial\nintelligence: Logic, probability, and computation. Synthesis Lectures on Arti\ufb01cial Intelligence\nand Machine Learning, 10(2):1\u2013189, 2016.\n\nMichelangelo Diligenti, Marco Gori, and Claudio Sacca. Semantic-based regularization for learning\n\nand inference. Arti\ufb01cial Intelligence, 244:143\u2013165, 2017.\n\nIvan Donadello, Luciano Sera\ufb01ni, and Artur S. d\u2019Avila Garcez. Logic tensor networks for seman-\ntic image interpretation. In Proceedings of the Twenty-Sixth International Joint Conference on\nArti\ufb01cial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pages 1596\u20131602,\n2017.\n\nAnton Dries, Angelika Kimmig, Wannes Meert, Joris Renkens, Guy Van den Broeck, Jonas Vlas-\nselaer, and Luc De Raedt. Problog2: Probabilistic logic programming. In Joint European Con-\nference on Machine Learning and Knowledge Discovery in Databases, pages 312\u2013315. Springer,\n2015.\n\nSebastijan Duman\u02c7ci\u00b4c and Hendrik Blockeel. Clustering-based relational unsupervised represen-\nIn Proceedings of the Twenty-Sixth\n\ntation learning with an explicit distributed representation.\nInternational Joint Conference on Arti\ufb01cial Intelligence, IJCAI-17, pages 1631\u20131637, 2017.\n\nJason Eisner. Parameter estimation for probabilistic \ufb01nite-state transducers. In Proceedings of the\n40th annual meeting on Association for Computational Linguistics, pages 1\u20138. Association for\nComputational Linguistics, 2002.\n\nRichard Evans and Edward Grefenstette. Learning explanatory rules from noisy data. Journal of\n\nArti\ufb01cial Intelligence Research, 61:1\u201364, 2018.\n\nDaan Fierens, Guy Van den Broeck, Joris Renkens, Dimitar Shterionov, Bernd Gutmann, Ingo Thon,\nGerda Janssens, and Luc De Raedt. Inference and learning in probabilistic logic programs using\nweighted Boolean formulas. Theory and Practice of Logic Programming, 15(3):358\u2013401, 2015.\n\nArtur S d\u2019Avila Garcez, Krysia B Broda, and Dov M Gabbay. Neural-symbolic learning systems:\n\nfoundations and applications. Springer Science & Business Media, 2012.\n\n10\n\n\fLise Getoor and Ben Taskar. Introduction to statistical relational learning. MIT press, 2007.\n\nBernd Gutmann, Angelika Kimmig, Kristian Kersting, and Luc De Raedt. Parameter learning in\nIn Joint European Conference on Machine\n\nprobabilistic databases: A least squares approach.\nLearning and Knowledge Discovery in Databases, pages 473\u2013488. Springer, 2008.\n\nBarbara Hammer and Pascal Hitzler. Perspectives of neural-symbolic integration, volume 8.\n\nSpringer Heidelberg:, 2007.\n\nSteffen H\u00a8olldobler, Yvonne Kalinke, and Hans-Peter St\u00a8orr. Approximating the semantics of logic\n\nprograms by recurrent neural networks. Applied Intelligence, 11(1):45\u201358, Jul 1999.\n\nSeyed Mehran Kazemi and David Poole. RelNN: A deep neural model for relational learning. In\n\nAAAI, 2018.\n\nAngelika Kimmig, Guy Van den Broeck, and Luc De Raedt. An algebraic Prolog for reasoning\n\nabout possible worlds. In AAAI, 2011.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International\n\nConference on Learning Representations (ICLR), pages 1\u201313, 2015.\n\nYann Lecun, L\u00b4eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. In Proceedings of the IEEE, pages 2278\u20132324, 1998.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in py-\ntorch. In Proceedings of the Workshop on The future of gradient-based machine learning software\nand techniques, co-located with the 31st Annual Conference on Neural Information Processing\nSystems (NIPS 2017), 2017.\n\nScott Reed and Nando de Freitas. Neural programmer-interpreters. In International Conference on\n\nLearning Representations (ICLR), 2016.\n\nTim Rockt\u00a8aschel and Sebastian Riedel. End-to-end differentiable proving. In Advances in Neural\n\nInformation Processing Systems, volume 30, pages 3788\u20133800, 2017.\n\nSubhro Roy and Dan Roth. Solving general arithmetic word problems. In Proceedings of the 2015\n\nConference on Empirical Methods in Natural Language Processing, pages 1743\u20131752, 2015.\n\nAdam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter\nBattaglia, and Tim Lillicrap. A simple neural network module for relational reasoning. In Ad-\nvances in Neural Information Processing Systems, volume 30, pages 4974\u20134983, 2017.\n\nJingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van den Broeck. A semantic loss\nfunction for deep learning with symbolic knowledge. In Proceedings of the 35th International\nConference on Machine Learning, ICML 2018, Stockholmsm\u00a8assan, Stockholm, Sweden, July 10-\n15, 2018, pages 5498\u20135507, 2018.\n\nGustav \u02c7Sourek, Vojt\u02c7ech Aschenbrenner, Filip \u02c7Zelezn\u00b4y, Steven Schockaert, and Ond\u02c7rej Ku\u02c7zelka.\nLifted relational neural networks: Ef\ufb01cient learning of latent relational structures. Journal of\nArti\ufb01cial Intelligence Research, to appear, 2018.\n\n11\n\n\f", "award": [], "sourceid": 1878, "authors": [{"given_name": "Robin", "family_name": "Manhaeve", "institution": "KU Leuven"}, {"given_name": "Sebastijan", "family_name": "Dumancic", "institution": "KU LEUVEN"}, {"given_name": "Angelika", "family_name": "Kimmig", "institution": "Cardiff University"}, {"given_name": "Thomas", "family_name": "Demeester", "institution": "Ghent University"}, {"given_name": "Luc", "family_name": "De Raedt", "institution": "KU Leuven"}]}