{"title": "Tractable Learning for Complex Probability Queries", "book": "Advances in Neural Information Processing Systems", "page_first": 2242, "page_last": 2250, "abstract": "Tractable learning aims to learn probabilistic models where inference is guaranteed to be efficient. However, the particular class of queries that is tractable depends on the model and underlying representation.  Usually this class is MPE or conditional probabilities $\\Pr(\\xs|\\ys)$ for joint assignments~$\\xs,\\ys$. We propose a tractable learner that guarantees efficient inference for a broader class of queries. It simultaneously learns a Markov network and its tractable circuit representation, in order to guarantee and measure tractability. Our approach differs from earlier work by using Sentential Decision Diagrams (SDD) as the tractable language instead of Arithmetic Circuits (AC). SDDs have desirable properties, which more general representations such as ACs lack, that enable basic primitives for Boolean circuit compilation. This allows us to support a broader class of complex probability queries, including counting, threshold, and parity, in polytime.", "full_text": "Tractable Learning for Complex Probability Queries\n\nJessa Bekker, Jesse Davis\n\nKU Leuven, Belgium\n\n{jessa.bekker,jesse.davis}@cs.kuleuven.be\n\nArthur Choi, Adnan Darwiche, Guy Van den Broeck\n\nUniversity of California, Los Angeles\n\n{aychoi,darwiche,guyvdb}@cs.ucla.edu\n\nAbstract\n\nTractable learning aims to learn probabilistic models where inference is guaran-\nteed to be ef\ufb01cient. However, the particular class of queries that is tractable de-\npends on the model and underlying representation. Usually this class is MPE\nor conditional probabilities Pr(x|y) for joint assignments x, y. We propose a\ntractable learner that guarantees ef\ufb01cient inference for a broader class of queries.\nIt simultaneously learns a Markov network and its tractable circuit representation,\nin order to guarantee and measure tractability. Our approach differs from earlier\nwork by using Sentential Decision Diagrams (SDD) as the tractable language in-\nstead of Arithmetic Circuits (AC). SDDs have desirable properties, which more\ngeneral representations such as ACs lack, that enable basic primitives for Boolean\ncircuit compilation. This allows us to support a broader class of complex proba-\nbility queries, including counting, threshold, and parity, in polytime.\n\n1\n\nIntroduction\n\nTractable learning [1] is a promising new machine learning paradigm that focuses on learning prob-\nability distributions that support ef\ufb01cient querying. It is motivated by the observation that while\nclassical algorithms for learning Bayesian and Markov networks excel at \ufb01tting data, they ignore the\ncost of reasoning with the learned model. However, many applications, such as health-monitoring\nsystems, require ef\ufb01cient and (guaranteed) accurate reasoning capabilities. Hence, new learning\ntechniques are needed to support applications with these requirements.\nInitially, tractable learning focused on the \ufb01rst model class recognized to be tractable: low-treewidth\ngraphical models [2\u20135]. Recent advances in probabilistic inference exploit other properties of a\nmodel, including local structure [6] and exchangeability [7], which even scale to models that have\nhigh treewidth. In particular, the discovery of local structure led to arithmetic circuits (ACs) [8],\nwhich are a much more powerful representation of tractable probability distributions. In turn, this\nled to new tractable learners that targeted ACs to guarantee ef\ufb01cient inference [9, 10]. In this con-\ntext, ACs with latent variables are sometimes called sum-product networks (SPNs) [11, 12]. Other\ntractable learners target exchangeable models [13, 14] or determinantal point processes [15].\nThere is a trade-off in tractable learning that is poorly understood and often ignored: tractability\nis not absolute, and always relative to a class of queries that the user is interested in. Existing ap-\nproaches de\ufb01ne tractability as the ability to ef\ufb01ciently compute most-probable explanations (MPE)\nor conditional probabilities Pr(x|y) where x, y are joint assignments to subsets of random variables.\nWhile these queries are indeed ef\ufb01cient on ACs, many other queries of interest are not. For example,\ncomputing partial MAP remains NP-hard on low-treewidth and AC models [16]. Similarly, various\n\n1\n\n\fdecision [17, 18], monotonicity [19], and utility [20] queries remain (co-)NP-hard.1 Perhaps the\nsimplest query beyond the reach of tractable AC learners is for probabilities Pr(\u03c6|\u03c8), where \u03c6, \u03c8\nare complex properties, such as counts, thresholds, comparison, and parity of sets of random vari-\nables. These properties naturally appear throughout the machine learning literature, for example, in\nneural nets [21], and in exchangeable [13] and statistical relational models [22]. We believe they\nhave not been used to their full potential in the graphical models\u2019 world due to their intractability.\nWe call these types of queries complex probability queries.\nThis paper pushes the boundaries of tractable learning by supporting more queries ef\ufb01ciently. While\nwe currently lack any representation tractable for partial MAP, we do have all the machinery avail-\nable to learn tractable models for complex probability queries. Their tractability is enabled by the\nweighted model counting (WMC) [6] encoding of graphical models and recent advances in compi-\nlation of Boolean functions into Sentential Decision Diagrams (SDDs) [23]. SDDs can be seen as\na syntactic subset of ACs with more desirable properties, including the ability to (1) incrementally\ncompile a Markov network, via a conjoin operator, (2) dynamically minimize the size and complex-\nity of the representation, and (3) ef\ufb01ciently perform complex probability queries.\nOur \ufb01rst contribution is a tractable learning algorithm for Markov networks with compact SDDs,\nfollowing the outer loop of the successful ACMN learner [9] for ACs, that uses SDD primitives to\nmodify the circuit during the Markov network structure search. Support for the complex queries\nlisted above also means that these properties can be employed as features in the learned network.\nSecond, we prove that complex symmetric probability queries over n variables, as well as their\nextensions, run in time polynomial in n and linear in the size of the learned SDD. Tighter complexity\nbounds are obtained for speci\ufb01c classes of queries. Finally, we illustrate these tractability properties\nin an empirical evaluation on four real-world data sets and four types of complex queries.\n\n2 Background\n\n2.1 Markov Networks\n\nZ exp(cid:80)\n\nA Markov network or Markov random \ufb01eld compactly represents the joint distribution over a set of\nvariables X = (X1, X2, . . . , Xn) [24]. Markov networks are often represented as log-linear models,\nthat is, an exponentiated weighted sum of features of the state x of variables X: Pr(X = x) =\njwjfj(x). The fj(X) are real-valued functions of the state, wj is the weight associated\n1\nwith fj, and Z is the partition function. For discrete models, features are often Boolean functions;\ntypically a conjunction of tests of the form (Xi = xi) \u2227 \u00b7\u00b7\u00b7 \u2227 (Xj = xj). One is interested in\nperforming certain inference tasks, such as computing the posterior marginals or most-likely state\n(MPE) given observations. In general, such tasks are intractable (#P- and NP-hard).\nLearning Markov networks from data require estimating the weights of the features (parameter\nlearning), and the features themselves (structure learning). We can learn the parameters by opti-\nmizing some convex objective function, which is typically the log-likelihood. Evaluation of this\nfunction and its gradient is in general intractable (#P-complete). Therefore, it is common to opti-\nmize an approximate objective, such as the pseudo-log-likelihood. The classical structure learning\napproach [24] is a greedy, top-down search. It starts with features over individual variables, and\ngreedily searches for new features to add to the model from a set of candidate features, found by\nconjoining pairs of existing features. Other approaches convert local models into a global one [25].\nTo prevent over\ufb01tting, one puts a penalty on the complexity of the model (e.g., number of features).\n\n2.2 Tractable Circuit Representations and Tractable Learning\n\nTractable circuit representations overcome the intractability of inference in Markov networks. Al-\nthough we are not always guaranteed to \ufb01nd a compact tractable representation for every Markov\nnetwork, in this paper we will guarantee their existence for the learned models.\nAC Arithmetic Circuits (ACs) [8] are directed acyclic graphs whose leafs are inputs representing ei-\nther indicator variables (to assign values to random variables), parameters (weights wj) or constants.\nFigure 1c shows an example. ACs encode the partition function computation of a Markov network.\n\n1The literature typically shows hardness for polytrees. Results carry over because these have compact ACs.\n\n2\n\n\f(a) Markov Network\n\n(b) Sentential Decision Diagram\n\nFigure 1: A Markov network over variables A, B, and its tractable SDD and AC representations.\n\n(c) Arithmetic Circuit\n\nBy setting indicators to 1 and evaluating the AC bottom-up, the value of the partition function, Z, is\nobtained at the root. Other settings of the indicators encode arbitrary evidence. Moreover, a second,\ntop-down pass yields all single-variable marginal probabilities; similar procedures exist for MPE.\nAll these algorithms run in time linear in the size of the AC (number of edges). The tractable learn-\ning paradigm for Markov networks is best exempli\ufb01ed by ACMN [9], which concurrently learns a\nMarkov network and its AC. It employs a complexity penalty based on the inference cost. Moreover,\nACMN ef\ufb01ciently computes the exact log-likelihood (as opposed to pseudo-log-likelihood) and its\ngradient on the AC. ACMN uses the standard greedy top-down feature search outlined above.\nSDD Sentential Decision Diagrams (SDDs) are a tractable representation of sentences in propo-\nsitional logic [23]. The supplement2 reviews SDDs in detail; a brief summary is next. SDDs are\ndirected acyclic graphs, as depicted in Figure 1b. A circle node represents the disjunction of its\nchildren. A pair of boxes denotes the conjunction of the two boxes, and each box can be a (negated)\nBoolean variable or a reference to another SDD node. The detailed properties of SDDs yield two\nbene\ufb01ts. First, SDDs support an ef\ufb01cient conjoin operator that can incrementally construct new\nSDDs from smaller SDDs in linear time. Second, SDDs support dynamic minimization, which\nallows us to control the growth of an SDD during incremental construction.\nThere is a close connection between SDDs for logic and ACs for graphical models, through an\nintermediate weighted model counting formulation [6], which is reviewed in the supplement. Given\na graphical model M, one can construct a logical sentence \u2206 whose satisfying assignments are in\none-to-one correspondence with the possible worlds of M. Moreover, each satisfying assignment of\n\u2206 encodes the weights wj that apply to its possible world in M. For each feature fj of M, this \u2206\nincludes a constraint fj \u21d4 Pj, meaning that weight wj applies when \u201cparameter\u201d variable Pj is true;\nsee Figure 1a. A consequence of this correspondence is that, given an SDD for \u2206, we can ef\ufb01ciently\nconstruct an AC for the original Markov network M; see Figure 1. Hence, an SDD corresponding\nto M is a tractable representation of M. Different from ACs, SDDs have the following properties:\nsupport for ef\ufb01cient (linear) conjunction allows us to add new features fj and incrementally learn a\nMarkov network. Moreover, dynamic minimization lets us systematically search for more compact\ncircuits for the same Markov network, mitigating the increasing complexity of inference as we learn\nmore features. Such operations are not available for ACs in general.\n\n3 Learning Algorithm\n\nWe propose LearnSDD, which employs a greedy, general-to-speci\ufb01c search that simultaneously\nlearns a Markov network and its underlying SDD which is used for inference. The cost of inference\nin the learned model is dictated by the size of its SDD. Conceptually, our approach is similar to\nACMN [9] with the key differences being our use of SDDs instead of ACs, which gives us more\ntractability and freedom in the types of features that are considered.\n\n2https://dtai.cs.kuleuven.be/software/learnsdd\n\n3\n\nABparameterweightfeaturevariablew1A\u2227BP1w2\u00acA\u2227\u00acBP2+IAI\u00acA+IBI\u00acB+I\u00acBIB+ew1110+11ew10+1ew2ew10\u2217\u2217\u2217\u2217\u2217\u2217\u2217\u2217\u2217\u2217\u2217\u2217\fAlgorithm 1 LearnSDD(T, e, \u03b1)\ninitialize model M with variables as features\nMbest \u2190 M\nwhile number of edges |SDD M| < e and not timeout\n\nbest score = \u2212\u221e\nF \u2190 generateFeatures(M, T )\nfor each feature f in F do\n\nM(cid:48) \u2190 M.add(f)\nif score(M(cid:48), T, \u03b1) > best score\nbest score = score(M(cid:48), T, \u03b1)\nMbest \u2190 M(cid:48)\n\nM \u2190 Mbest\n\nLearnSDD, outlined in Algorithm 1, receives as input a training set T , a maximum number of edges\ne, and a parameter \u03b1 to control the relative importance of \ufb01tting the data vs. the cost of inference. As\nis typical with top-down approaches to structure learning [24], the initial model has one feature Xi =\ntrue for each variable, which corresponds to a fully-factorized Markov network. Next, LearnSDD\niteratively constructs a set of candidate features, where each feature is a logical formula. It scores\neach feature by compiling it into an SDD, conjoining the feature to the current model temporarily,\nand then evaluating the score of the model that includes the feature. The supplement shows how\na features is added to an SDD. In each iteration, the highest scoring feature is selected and added\nto the model permanently. The process terminates when the maximum number of edges is reached\nor when it runs out of time. Inference time is dictated by the size of the learned SDD. To control\nthis cost, we invoke dynamic SDD minimization each time a feature is evaluated, and when we\npermanently add a feature to the model.\nPerforming structure learning with SDDs offers advantages over ACs. First, SDDs support a practi-\ncal conjoin operation, which greatly simpli\ufb01es the design of a top-down structure learning algorithm\n(ACMN instead relies on a complex special-purpose AC modi\ufb01cation algorithm). Second, SDDs\nsupport dynamic minimization, allowing us to search for smaller SDDs, as needed. The following\ntwo sections discuss the score function and feature generation in greater detail.\n\n3.1 Score Function and Weight Learning\n\nScore functions capture a trade-off between \ufb01tting the training data and the preference for simpler\nmodels, captured by a regularization term. In tractable learning, the regularization term re\ufb02ects the\ncost of inference in the model. Therefore, we use the following score function:\n\nscore(M(cid:48), T ) = [log Pr(T|M(cid:48)\n\n) \u2212 log Pr(T|M )] \u2212 \u03b1 [|SDD M(cid:48)| \u2212 |SDD M|] /|SDD M|\n\n(1)\nwhere T is the training data, M(cid:48) is the model extended with feature f, M is the old model, |SDD .|\nreturns the number of edges in the SDD representation, and \u03b1 is a user-de\ufb01ned parameter. The \ufb01rst\nterm is the improvement in the model\u2019s log-likelihood due to incorporating f. The second term\nmeasures the relative growth of the SDD representation after incorporating f. We use the relative\ngrowth because adding a feature to a larger model adds many more edges than adding a feature to\na smaller model. Section 4 shows that any query\u2019s inference complexity depends on the SDD size.\nFinally, \u03b1 lets us control the trade-off between \ufb01tting the data and the cost of inference.\nScoring a model requires learning the weights associated with each feature. Because we use SDDs,\nwe can ef\ufb01ciently compute the exact log-likelihood and its gradient using only two passes over the\nSDD. Therefore, we learn maximum-likelihood estimates of the weights.\n\n3.2 Generating Features\n\nIn each iteration, LearnSDD constructs a set of candidate features using two different feature gener-\nators: conjunctive and mutex. The conjunctive generator considers each pair of features f1, f2 in\nthe model and proposes four new candidates per pair: f1 \u2227 f2, \u00acf1 \u2227 f2,f1 \u2227 \u00acf2 and \u00acf1 \u2227 \u00acf2.\nThe mutex generator automatically identi\ufb01es mutually exclusive sets of variables in the data and\nproposes a feature to capture this relationship. Mutual exclusivity arises naturally in data. It oc-\ncurs in tractable learning because existing approaches typically assume Boolean data. Hence,\n\n4\n\n\fture(cid:87)n\nmutual exclusivity feature(cid:87)n\n\ni=1(Xi \u2227\n\n(cid:86)\n\n(cid:86)\n\nany multi-valued attribute is converted into multiple binary variables. For all variable sets X =\n{X1, X2,\u00b7\u00b7\u00b7 , Xn} that have exactly one \u201ctrue\u201d value in each training example, the exactly one fea-\nj(cid:54)=i \u00acXj) is added to the candidate set. When at most one variable is \u201ctrue\u201d, the\n\n(cid:86)n\nj=1 \u00acXj is added to the candidate set.\n\nj(cid:54)=i \u00acXj) \u2228\n\ni=1(Xi \u2227\n\n4 Complex Queries\n\nTractable learning focuses on learning models that can ef\ufb01ciently compute the probability of a query\ngiven some evidence, where both the query and evidence are conjunctions of literals. However, many\nother important and interesting queries do not conform to this structure, including the following:\n\n\u2022 Consider the task of predicting the probability that a legislative bill will pass given that\nsome lawmakers have already announced how they will vote. Answering this query re-\nquires estimating the probability that a count exceeds a given threshold.\n\n\u2022 Imagine only observing the \ufb01rst couple of sentences of a long review, and wanting to assess\nthe probability that the entire document has more positive words than negative words in\nit, which could serve as proxy for how positive (negative) the review is. Answering this\nrequires comparing two groups, in this case positive words and negative words.\n\nTable 1 lists these and other examples of what we call complex queries, which are logical functions\nthat cannot be written as a conjunction of literals. Unfortunately, tractable models based on ACs\nare, in general, unable to answer these types of queries ef\ufb01ciently. We show that using a model\nwith an SDD as the target tractable representation can permit ef\ufb01cient exact inference for certain\nclasses of complex queries: symmetric queries and their generalizations. No known algorithm exists\nfor ef\ufb01ciently answering these types of queries in ACs. For other classes of complex queries, the\ncomplexity is never worse than for ACs, and in many cases SDDs will be more ef\ufb01cient. Note that\nSPNs have the same complexity for answering queries as ACs since they are interchangeable [12].\nWe \ufb01rst discuss how to answer complex queries using ACs and SDDs. We then discuss some classes\nof complex queries and when we can guarantee tractable inference in SDDs.\n\n4.1 Answering Complex Queries\n\nCurrently, it is only known how to solve conjunctive queries in ACs. Therefore, we will answer\ncomplex queries by asking multiple conjunctive queries. We convert the query into DNF format\n\n(cid:87) C consisting of n mutually exclusive clauses C = {C1, . . . , Cn}. Now, the probability of the\nquery is the sum of the probabilities of the clauses: Pr ((cid:87) C) =(cid:80)n\n\ni=1 Pr(Ci). In the worst case,\nthis construction requires 2m clauses for queries over m variables. The inference complexity for\neach clause on the AC is O(|AC|). Hence, the total inference complexity is O(2m \u00b7 |AC|).\nSDDs can answer complex queries without transforming them into mutually exclusive clauses. In-\nstead, the query Q can directly be conjoined with the weighted model counting formulation \u2206 of\nthe Markov network M. Given an SDD Sm for the Markov network and an SDD Sq for Q, we\ncan ef\ufb01ciently compile an SDD Sa for Q \u2227 \u2206. From Sa, we can compute the partition function\nof the Markov network after asserting Q, which gives us the probability of Q. This computation is\nperformed ef\ufb01ciently on the AC that corresponds to Sa (cf. Section 2.2). The supplement explains\nthe protocol for answering a query. The size of the SDD Sa is at most |Sq|\u00b7|Sm| [23], and inference\nis linear in the circuit size, therefore it is O(|Sq| \u00b7 |Sm|). When converting an arbitrary query into\nSDD, the size may grow as large as 2m, with m the number of variables in the query. But often\nit will be much smaller (see Section 4.2). Thus, the overall complexity is O(2m \u00b7 |Sm|), but often\nmuch better, depending on the query class.\n\n4.2 Classes of Complex Queries\n\nA \ufb01rst class of tractable queries are symmetric Boolean functions. These queries do not depend on\nthe exact input values, but only on how many of them are true.\nDe\ufb01nition 1. A Boolean function f (X1, . . . , Xn) : {0, 1}n \u2192 {0, 1} is a symmetric query precisely\nwhen f (X1, . . . , Xn) = f (X\u03c0(1), . . . , X\u03c0(n)) for every permutation \u03c0 of the n indexes.\n\n5\n\n\fTable 1: Examples of complex queries, with m the SDD size and n the number of query variables.\n\nQuery class\nSymmetric\nQuery\n\nAsymmetric\nTractable\nQuery\n\nInference Complexity\nQuery Type\nParity\nO(mn)\nk-Threshold\nO(mnk2)\nExactly-k\nO(mnk2)\nModulo-k\nO(mnk)\nExactly-k\nO(mnk2)\nHamming distance k O(mnk2)\nGroup comparison\nO(mn3)\n\nExample\n#(A, B, C)%2 = 0\n#(A, B, C) > 1\n#(A, B, C) = 2\n#(A, B, C)%3 = 0\n#(A, B,\u00acC) = 2\n#(A, B,\u00acC) \u2264 2\n#(A, B,\u00acC) > #(D,\u00acE)\n\nTable 1 lists examples of functions that can always be answered in polytime because they have\na compact SDD. Note that the features generated by the mutex generator are types of exactly-k\nqueries where k = 1, and therefore have a compact SDD. We have the following result.\nTheorem 1. Markov networks with compact SDDs support tractable querying of symmetric func-\ntions. More speci\ufb01cally, let M be a Markov network with an SDD of size m, and let Q be any\nsymmetric function of n variables. Then, PrM (Q) can be computed in O(mn3) time. Moreover,\nwhen Q is a parity function, querying takes O(mn) time, and when Q is a k-threshold or exactly-k\nfunction, querying takes O(mnk2) time.\n\nThe proof shows that any SDD can be conjoined with these queries without increasing the SDD\nsize by more than a factor polynomial in n. The proof of Theorem 1 is given in the supplement.\nThis tractability result can be extended to certain non-symmetric functions. For example, negating\nthe inputs to a symmetric functions still yields a tractable complex query. This allows queries for\nthe probability that the state is within a given Hamming distance from a desired state. Moreover,\nBoolean combinations of a bounded number of tractable function also admit ef\ufb01cient querying. This\nallows queries that compare symmetric properties of different groups of variables.\nWe cannot guarantee tractability for other classes of complex queries, because some queries do not\nhave a compact SDD representation. An example of such a query is the weighed k\u2212threshold where\neach literal has a corresponding weight and the total weight of true literals must be bigger than some\nthreshold. While the worst-case complexity of using SDDs and ACs to answer such queries is the\nsame, we show in the supplement that SDDs can still be more ef\ufb01cient in practice.\n\n5 Empirical Evaluation\n\nThe goal of this section is to evaluate the merits of using SDDs as a target representation in tractable\nlearning for complex queries. Speci\ufb01cally, we want to address the following questions:\n\nQ1 Does capturing mutual exclusivity allow LearnSDD to learn more accurate models than ACMN?\nQ2 Do SDDs produced by LearnSDD answer complex queries faster than ACs learned by ACMN?\n\nTo resolve these questions, we run LearnSDD and ACMN on real-world data and compare their\nperformance. Our LearnSDD implementation builds on the publicly available SDD package.3\n\n5.1 Data\n\nTable 2 describes the characteristics of each data set.\n\nData Set\nTraf\ufb01c\nTemperature\nVoting\nMovies\n\nTable 2: Data Set Characteristics\n\nTest Set Size Num. Vars.\n128\n216\n1,359\n1000\n\n662\n2,708\n350\n250\n\nTrain Set Size\n3,311\n13,541\n1,214\n1,600\n\nTune Set Size\n441\n1,805\n200\n150\n\n3http://reasoning.cs.ucla.edu/sdd/\n\n6\n\n\fMutex features We used the Traf\ufb01c and Temperature data sets [5] to evaluate the bene\ufb01t of detect-\ning mutual exclusivity. In the initial version of these data sets, each variable had four values, which\nwere binarized using a 1-of-n encoding.\nComplex queries To evaluate complex queries, we used voting data from GovTrac.us and Pang\nand Lee\u2019s Movie Review data set.4 The voting data contains all 1764 votes in the House of Repre-\nsentatives from the 110th Congress. Each bill is an example and the variables are the votes of the\n453 congressmen, which can be yes, no, or present. The movie review data contains 1000 positive\nand 1000 negative movie reviews. We \ufb01rst applied the Porter stemmer and then used the Scikit\nLearn CountVectorizer,5 which counts all 1- and 2-grams, while omitting the standard Scikit Learn\nstop words. We selected the 1000 most frequent n-grams in the training data to serve as the features.\n\n5.2 Methodology\n\nFor all data sets, we divided the data into a single train, tune, and test partition. All experiments\nwere run on identically con\ufb01gured machines with 128GB RAM and twelve 2.4GHz cores.\nMutex features Using the training set, we learned models with both LearnSDD and ACMN. For\nLearnSDD, we tried setting \u03b1 to 1.0, 0.1, 0.01 and 0.001. For ACMN, we did a grid search for\nthe hyper-parameters (per-split penalty ps and the L1 and L2-norm weights l1 and l2) with ps \u2208\n{2, 5, 10}, l1 \u2208 {0.1, 1, 5} and l2 \u2208 {0.1, 0.5, 1}. For both methods, we stopped learning if the\ncircuit exceeded two million edges or the algorithm ran for 72 hours. For each approach, we picked\nthe best learned model according to the tuning set log-likelihood. We evaluated the quality of the\nselected model using the log-likelihood on the test set.\nComplex queries In this experiment, the goal is to compare the time needed to answer a query in\nmodels learned by LearnSDD and ACMN. In both SDDs and ACs, inference time depends linearly\non the number of edges in the circuit. Therefore, to ensure a fair comparison, the learned models\nshould have approximately the same number of edges. Hence, we \ufb01rst learned an SDD and then\nused the number of edges in the learned SDD to limit the size of the model learned by ACMN.\nIn the voting data set, we evaluated the threshold query: what is the probability that at least 50%\nof the congressmen vote \u201cyes\u201d on a bill, given as evidence that some lawmakers have already\nannounced their vote? We vary the percentage of unknown votes from 1 to 100% in intervals\nof 1% point. We evaluated several queries on the movie data set. The \ufb01rst two queries mimic\nan active sensing setting to predict features of the review without reading it entirely. The evi-\ndence for each query are the features that appear in the \ufb01rst 750 characters of the stemmed re-\nview. On average, the stemmed reviews have approximately 3,600 characters. The \ufb01rst query is\nPr(#(positive ngrams) > 5) and second is Pr(#(positive ngrams) > #(negative ngrams)),\nwhich correspond to a threshold query and a group comparison query, respectively. For both queries,\nwe varied the size of the positive and negative ngram sets from 5 to 100 ngrams with an increment\nsize of 1. We randomly selected which ngrams are positive and negative as we are only interested\nin a query\u2019s evaluation time. The third query is the probability that a parity function over a set of\nfeatures is even. We vary the number of variables considered by the parity function from 5 to 100.\nFor each query, we report the average per example inference time for each learned model on the\ntest set. We impose a 10 minute average time limit and 100 minutes individual time limit for each\nquery. For completeness, the supplement reports run times for queries that are guaranteed to (not)\nbe tractable for both ACs and SDDs as well as the conditional log-likelihoods of all queries.\n\n5.3 Results and Discussion\n\nMutex features Figure 2 shows the test set log-likelihoods as a function of the size of the learned\nmodel. In both data sets, LearnSDD produces smaller models that have the same accuracy as AC.\nThis is because it can add mutex features without the need to add other features that are needed as\nbuilding blocks but are redundant afterwards. These results allow us to af\ufb01rmatively answer (Q1).\nComplex queries Figure 3 shows the inference times for complex queries that are extensions of\nsymmetric queries. For all queries, we see that LearnSDD\u2019s model results in signi\ufb01cantly faster\ninference times than ACMN\u2019s model. In fact, ACMN\u2019s model exceeds the ten minute time limit on\n\n4http://www.cs.cornell.edu/people/pabo/movie-review-data/\n5http://tartarus.org/martin/PorterStemmer/ and http://scikit-learn.org/\n\n7\n\n\fd\no\no\nh\ni\nl\ne\nk\ni\nl\n-\ng\no\nL\n\n-40\n-50\n-60\n-70\n-80\n\n0\n\nLearnSDD\nACMN\n\n500000\n\nSize\n\nd\no\no\nh\ni\nl\ne\nk\ni\nl\n-\ng\no\nL\n\n-20\n-25\n-30\n-35\n-40\n\nLearnSDD\nACMN\n\n0\n\n500000\n\n1e+06\n\nSize\n\n(a) Temperature\n\n(b) Traf\ufb01c\n\nFigure 2: The size and log-likelihood of the models learned by LearnSDD and ACMN. Ideally, the model is\n\nsmall with high accuracy (upper left corner), which is best approached by the LearnSDD models.\n\n334 out of 388 of the query settings whereas this only happens in 25 settings for LearnSDD. The\nSDD can answer all parity questions and positive word queries in less than three hundred millisec-\nonds and the group comparison in less than three seconds. It can answer the voting query with up to\n75% of the votes unknown in less than ten minutes. These results demonstrate LearnSDD\u2019s superior\nability to answer complex queries compared to ACMN and allow us to positively answer (Q2).\n\n)\ns\n(\n\nTimeout\n600\n500\n400\n300\n200\n100\n0\n\ne\nm\nT\n\ni\n\n0\n\n20\n\nSDD\nAC\n\n40\n\n60\n\n% Unknown votes\n\n)\ns\n(\n\nTimeout\n600\n500\n400\n300\n200\n100\n0\n\ne\nm\nT\n\ni\n\n0\n\n20\n\n80\n\n100\n\nSDD\nAC\n\n)\ns\n(\n\nTimeout\n600\n500\n400\n300\n200\n100\n0\n\ne\nm\nT\n\ni\n\n0\n\n(a) Threshold query (Voting)\n\nSDD\nAC\n\n20\n\n40\n\n60\n\n80\n\n#positive words \u2265 #negative words\n(c) Group comparison (Movie)\n\n40\n\n60\n\n80\n\n100\n\n#positive words \u2265 5\n\n(b) Threshold query (Movie)\n\n)\ns\n(\n\nTimeout\n600\n500\n400\n300\n200\n100\n0\n\ne\nm\nT\n\ni\n\n100\n\nSDD\nAC\n\n0\n\n20\n\n40\n60\n# variables\n\n80\n\n100\n\n(d) Parity (Movie)\n\nFigure 3: The time for SDDs vs. ACs to answer complex queries, varying the number of query variables.\n\nSDDs need less time in all settings, answering nearly all queries. ACs timeout in more than 85% of the cases.\n\n6 Conclusions\n\nThis paper highlighted the fact that tractable learning approaches learn models for only a restricted\nclasses of queries, primarily focusing on the ef\ufb01cient computation of conditional probabilities. We\nfocused on enabling ef\ufb01cient inference for complex queries. To achieve this, we proposed using\nSDDs as the target representation for tractable learning. We provided an algorithm for simultane-\nously learning a Markov network and its SDD representation. We proved that SDDs support poly-\ntime inference for complex symmetric probability queries. Empirically, SDDs enable signi\ufb01cantly\nfaster inference times than ACs for multiple complex queries. Probabilistic SDDs are a closely re-\nlated representation: they also support complex queries (in structured probability spaces) [26, 27],\nbut they lack general-purpose structure learning algorithms (a subject of future work).\n\nAcknowledgments\n\nWe thank Songbai Yan for prior collaborations on related projects.\nJB is supported by IWT\n(SB/141744). JD is partially supported by the Research Fund KU Leuven (OT/11/051, C22/15/015),\nEU FP7 Marie Curie CIG (#294068), IWT (SBO-HYMOP) and FWO-Vlaanderen (G.0356.12). AC\nand AD are partially supported by NSF (#IIS-1514253) and ONR (#N00014-12-1-0423).\n\n8\n\n\fReferences\n[1] P. Domingos, M. Niepert, and D. Lowd (Eds.). In ICML Workshop on Learning Tractable Probabilistic\n\nModels, 2014.\n\n[2] F. R.. Bach and M. I. Jordan. Thin junction trees. In Proceedings of NIPS, pages 569\u2013576, 2001.\n[3] N. L. Zhang. Hierarchical latent class models for cluster analysis. JMLR, 5:697\u2013723, 2004.\n[4] M. Narasimhan and J. Bilmes. PAC-learning bounded tree-width graphical models. In Proc. UAI, 2004.\nIn Proceedings of\n[5] A. Chechetka and C. Guestrin. Ef\ufb01cient principled learning of thin junction trees.\n\nNIPS, pages 273\u2013280, 2007.\n\n[6] M. Chavira and A. Darwiche. On probabilistic inference by weighted model counting. AIJ, 172(6\u20137):\n\n772\u2013799, 2008.\n\n[7] M. Niepert and G. Van den Broeck. Tractability through exchangeability: A new perspective on ef\ufb01cient\n\nprobabilistic inference. In Proceedings of AAAI, 2014.\n\n[8] A. Darwiche. A differential approach to inference in Bayesian networks. JACM, 50(3):280\u2013305, 2003.\n[9] D. Lowd and A. Rooshenas. Learning Markov networks with arithmetic circuits. In Proc. AISTATS, pages\n\n406\u2013414, 2013.\n\n[10] T. Rahman, P. Kothalkar, and V. Gogate. Cutset networks: A simple, tractable, and scalable approach for\n\nimproving the accuracy of Chow-Liu trees. In Proceedings of ECML PKDD, pages 630\u2013645, 2014.\n\n[11] R. Gens and P. Domingos. Learning the structure of sum-product networks. In Proceedings of ICLM,\n\npages 873\u2013880, 2013.\n\n[12] A. Rooshenas and D. Lowd. Learning sum-product networks with direct and indirect variable interactions.\n\nIn Proceedings ICML, pages 710\u2013718, 2014.\n\n[13] M. Niepert and P. Domingos. Exchangeable variable models. In Proceedings of ICML, 2014.\n[14] J. Van Haaren, G. Van den Broeck, W. Meert, and J. Davis. Lifted generative learning of markov logic\n\nnetworks. Machine Learning, 2015. (to appear).\n\n[15] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations and Trends\n\nin Machine Learning, 2012.\n\n[16] J. D. Park. Map complexity results and approximation methods. In Proceedings of UAI, 2002.\n[17] S. J. Chen, A. Choi, and A. Darwiche. Algorithms and applications for the same-decision probability.\n\nJAIR, pages 601\u2013633, 2014.\n\n[18] C. Krause, A.and Guestrin. Optimal nonmyopic value of information in graphical models - ef\ufb01cient\n\nalgorithms and theoretical limits. In Proceedings of IJCAI, 2005.\n\n[19] L. C. van der Gaag, H. L. Bodlaender, and A. Feelders. Monotonicity in bayesian networks. In Proceed-\n\nings of UAI, pages 569\u2013576, 2004.\n\n[20] D. D. Mau\u00b4a, C. P. De Campos, and M. Zaffalon. On the complexity of solving polytree-shaped limited\n\nmemory in\ufb02uence diagrams with binary variables. AIJ, 205:30\u201338, 2013.\n\n[21] I. Parberry and G. Schnitger. Relating Boltzmann machines to conventional models of computation.\n\nNeural Networks, 2(1):59\u201367, 1989.\n\n[22] D. Buchman and D. Poole. Representing aggregators in relational probabilistic models. In Proceedings\n\nof AAAI, 2015.\n\n[23] A. Darwiche. SDD: A new canonical representation of propositional knowledge bases. In Proceedings of\n\nIJCAI, pages 819\u2013826, 2011.\n\n[24] S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing Features of Random Fields. IEEE TPAMI, 19:\n\n380\u2013392, 1997.\n\n[25] Daniel Lowd and Jesse Davis. Improving Markov network structure learning using decision trees. The\n\nJournal of Machine Learning Research, 15(1):501532, 2014.\n\n[26] D. Kisa, G. Van den Broeck, A. Choi, and A. Darwiche. Probabilistic sentential decision diagrams. In\n\nKR, 2014.\n\n[27] A. Choi, G. Van den Broeck, and A. Darwiche. Tractable learning for structured probability spaces: A\n\ncase study in learning preference distributions. In Proceedings of IJCAI, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1324, "authors": [{"given_name": "Jessa", "family_name": "Bekker", "institution": "KU Leuven"}, {"given_name": "Jesse", "family_name": "Davis", "institution": "KU Leuven"}, {"given_name": "Arthur", "family_name": "Choi", "institution": "UCLA"}, {"given_name": "Adnan", "family_name": "Darwiche", "institution": "UCLA"}, {"given_name": "Guy", "family_name": "Van den Broeck", "institution": "UCLA"}]}