{"title": "Learning Bayesian Networks with Low Rank Conditional Probability Tables", "book": "Advances in Neural Information Processing Systems", "page_first": 8964, "page_last": 8973, "abstract": "In this paper, we provide a method to learn the directed structure of a Bayesian network using data. The data is accessed by making conditional probability queries to a black-box model. We introduce a notion of simplicity of representation of conditional probability tables for the nodes in the Bayesian network, that we call ``low rankness''. We connect this notion to the Fourier transformation of real valued set functions and propose a method which learns the exact directed structure of a `low rank` Bayesian network using very few queries. We formally prove that our method correctly recovers the true directed structure, runs in polynomial time and only needs polynomial samples with respect to the number of nodes. We also provide further improvements in efficiency if we have access to some observational data.", "full_text": "Learning Bayesian Networks with Low Rank\n\nConditional Probability Tables\n\nAdarsh Barik\n\nDepartment of Computer Science\n\nPurdue University\n\nWest Lafayette, Indiana, USA\n\nabarik@purdue.edu\n\nJean Honorio\n\nDepartment of Computer Science\n\nPurdue University\n\nWest Lafayette, Indiana, USA\n\njhonorio@purdue.edu\n\nAbstract\n\nIn this paper, we provide a method to learn the directed structure of a Bayesian\nnetwork using data. The data is accessed by making conditional probability queries\nto a black-box model. We introduce a notion of simplicity of representation of\nconditional probability tables for the nodes in the Bayesian network, that we call\n\u201clow rankness\u201d. We connect this notion to the Fourier transformation of real valued\nset functions and propose a method which learns the exact directed structure of\na \u2018low rank\u2018 Bayesian network using very few queries. We formally prove that\nour method correctly recovers the true directed structure, runs in polynomial time\nand only needs polynomial samples with respect to the number of nodes. We also\nprovide further improvements in ef\ufb01ciency if we have access to some observational\ndata.\n\n1\n\nIntroduction\n\nMotivation. Real-world systems are made of large number of constituent variables. Understanding\nthe interactions and relationships of these variables is key to understand the behavior of such systems.\nScientists and researchers from many domains have been using graphs to model and learn relationships\namongst variables of real-world systems for a long time. Bayesian networks are one of the most\nimportant classes of probabilistic graphical models which are used to model complex systems. They\nprovide a compact representation of joint probability distributions among a set of variables.\n\nRelated work. Learning the structure of a Bayesian network from observational data is a well\nknown but an incredibly dif\ufb01cult problem to solve in the machine learning community. Due to its\npopularity and applications, a considerable amount of work has been done in this \ufb01eld. Most of\nthese work use observational data to learn the structure. We can broadly divide these methods in\ntwo categories. The methods in the \ufb01rst category use score maximization techniques to learn the\nDAG from observational data. In this category, there are some heuristics based approaches such as\nFriedman et al. (1999); Tsamardinos et al. (2006); Margaritis and Thrun (2000); Moore and Wong\n(2003) which run in polynomial-time without offering any convergence/consistency guarantee. There\nare also some exact but exponential time score maximizing exact algorithms such as Koivisto and\nSood (2004); Silander and Myllym\u00e4ki (2006); Cussens (2008); Jaakkola et al. (2010). The methods\nin the second category are independence test based methods such as Spirtes et al. (2000); Cheng et al.\n(2002); Yehezkel and Lerner (2005); Xie and Geng (2008).\nThere have also been some work to learn the structure of a Bayesian network using interventional\ndata (Murphy, 2001; Tong and Koller, 2001; Eaton and Murphy, 2007; Trianta\ufb01llou and Tsamardinos,\n2015). Most of these works \ufb01rst \ufb01nd a Markov equivalence class from observational data and then\ndirect the edges using interventions. Unfortunately, the \ufb01rst step of \ufb01nding Markov equivalence class\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fremains NP-hard (Chickering, 1996). Hausar and B\u00fchlmann (2012), He and Geng (2008), Kocaoglu\net al. (2017) have presented polynomial time methods to \ufb01nd an optimal set of interventions for\nchordal DAGs. Bello and Honorio (2018) have proposed a method to learn a Bayesian network\nusing interventional path queries with logarithmic sample complexity. However, their method runs in\nexponential time in terms of the number of parents.\nIn this paper, our work takes an intermediate path. We do not use pure observational or interventional\ndata directly. Rather, we assume that there exists a black-box which answers conditional probability\nqueries by outputting observational data. Our goal is to limit the number of such queries and learn\nthe directed structure of a Bayesian network. We propose a novel algorithm to achieve this goal. We\nalso provide a method to improve our results by having access to some observational data. We intend\nto measure our performance based on the following criteria. 1. Correctness - We want to come up\nwith a method which correctly recovers the directed structure of a Bayesian network with provable\ntheoretical guarantees. 2. Computational ef\ufb01ciency - The method must run fast enough to handle\nthe high dimensional cases. Ideally, we want to have polynomial time complexity with respect to\nthe number of nodes. 3. Sample complexity - We would like to use as few samples as possible for\nrecovering the structure of the Bayesian network. As with the time complexity, we want to achieve\npolynomial sample complexity with respect to the size of the network.\n\nContribution. Consider a binary node i of a Bayesian network with m parents. The conditional\nprobability table (CPT) of node i has 2m`1 entries. This number quickly becomes very large even\nfor modest values of m. To handle such large tables while still maintaining the effect of all the\nparents, we introduce a notion of simplicity of representation of the CPTs, which we call \u201clow\nrankness\u201d. Our intuition is that each CPT can be treated as summation of multiple simple tables,\neach of them depending only on a handful of parents (say k parents where k is the rank of the CPT).\nWe connect this notion of rank of a CPT to the Fourier transformation of a speci\ufb01c real valued set\nfunction (Stobbe and Krause, 2012) and use compressed sensing techniques (Rauhut, 2010) to show\nthat the Fourier coef\ufb01cients of this set function can be used to learn the structure of the Bayesian\nnetwork. While doing so, we provide a method with theoretical guarantees of correctness, and which\nworks in polynomial time and sample complexity. Our method requires computation of conditional\nprobabilities from data. We do this by making queries to a black-box. One query consists of two\nsteps. The \ufb01rst step is the selection of variables, i.e., choosing a target variable and a set of variables\nfor conditioning. The second step is to assign speci\ufb01c values to the selected conditioning variables.\nThis process is similar to the process used in Bello and Honorio (2018); Kocaoglu et al. (2017), which\nconsider a particular selection of variables as one intervention. An actual setting of the variables\nare considered as one experiment. For example, a selection of k binary variables can be assigned 2k\ndistinct values and can be queried in 2k different ways. Our setting is similar to an interventional\nsetting where a selection can be compared to an intervention and an assignment can be compared\nto an experiment, although our method never queries the 2k distinct values, but a single random\nassignment instead. Thus, we compare our results to the state-of-the-art interventional methods in\nTable 1. It should be noted that the number of queries (or experiments in the interventional setting)\nare a better metric for comparison than the number of selections (or interventions). This is because a\nselection may involve only one node (Bello and Honorio, 2018) or multiple nodes (in this paper) and\nthus could hide some complexity of the problem. Furthermore, the sample complexity of the problem\ndepends on the number of queries.\n\nTable 1: Sample and time complexity, number of selections (interventions) and queries (experiments)\nrequired for structure learning of binary Bayesian networks. Here n is the number of nodes, k is the\nmaximum size of the Markov blanket. The maximum number of parents of a node is Opkq.\nTime Complexity Selections Queries\nAlgorithms\nn log nq Opnq\nOur Work\n(no observational data)\nOpnq\nOur Work\n(with observational data) Blackbox - Opnk3 log5 kq\nBello and Honorio (2018) Interventional - Opn22k log nq Opn22k log nq Opn2q\nKocaoglu et al. (2017)\n\nSample Complexity\nBlackbox - Opnk3 log4 nplog k Opn4k\n` log log nqq\nObservational - Opnq\nOpn4q\nOpnk4?\n\nInterventional - no guarantees Op2nkn2 log2 nq Oplog nq Op2n log nq\n\n?\n\nk log kq\n\nOpnk3 log4 nq\nOpnk3 log4 kq\nOpn22kq\n\n2\n\n\fn\n\n2 Preliminaries\nIn this section, we introduce formal de\ufb01nitions and notations. Let X \u201c tX1, X2, . . . , Xnu be a set of\nrandom variables. For a set A, XA denotes the set of random variables Xi P X such that i P A. We\nuse the shorthand notation i to denote V ztiu. We de\ufb01ne a Bayesian network on a directed acyclic\ngraph G \u201c pV, Eq where V denotes the set of vertices and E is a set of ordered pair of nodes, each\ncorresponding to a directed edge, i.e., if pa, bq P E then there is an edge a \u00d1 b in G. The parents of\na node i,@i P V denoted by \u03c0Gpiq, are set of all nodes j such that edge pj, iq P E. We also de\ufb01ne\nthe Markov blanket MBGpiq for a node i as a set of nodes containing parents, children and parents of\nchildren of node i. The nodes with no children are called terminal nodes.\nDe\ufb01nition 1 (Bayesian network). Let G \u201c pV, Eq be a directed acyclic graph (DAG) and X \u201c\ntX1, X2, . . . , Xnu be a set of random variables such that Xi corresponds to a random variable at\nnode i P V,@i \u201c t1, . . . , nu. Let X\u03c0Gpiq denote the set of random variables de\ufb01ned on the parents\n\u015b\nof node i in DAG G. A Bayesian network B \u201c pG,Pq represents a joint probability distribution P\nover the set of random variables X de\ufb01ned on the nodes of DAG G which factorizes according to\ni\u201c1 PpXi|X\u03c0Gpiqq where PpXi|X\u03c0Gpiqq denotes\nthe DAG structure, i.e., PpX1, X2, . . . , Xnq \u201c\nconditional probability distribution (CPD) of node i given its parents in DAG G.\nWe denote the domain of a random variable Xi,@i P t1, . . . , nu by dompXiq. The cardinality of a\n\u015b\nset is denoted by notation | \u00a8 |. A Bayesian network B \u201c pG,Pq on discrete nodes is called a binary\nBayesian network if | dompXiq| \u201c 2,@i P t1, . . . , nu. For discrete nodes, PpXi|X\u03c0Gpiqq is often\nrepresented as a conditional probability table (CPT) with | dompXiq|\njP\u03c0Gpiq | dompXjq| entries.\nIn this work, we will only focus on binary Bayesian networks. Next, we introduce a novel concept of\nrank of a conditional probability distribution for a node of Bayesian network.\nDe\ufb01nition 2 (Rank k conditional probability distribution). A node i P V of a Bayesian network\nBpG,Pq is said to be rank k representable with respect to a set Apiq \u010e V ztiu and probability\ndistribution P if,\nQSpXS \u201c xSq,@xi P dompXiq, xApiq P dompXApiqq\nPpXi \u201c xi|XApiq \u201c xApiqq \u201c\n\n\u015a\n(1)\njPS dompXjq \u00d1 R is a function which depends only on the variables XS. A node i is\nwhere QS :\nsaid to have rank k conditional probability table if it is rank k representable but is not rank k \u00b4 1\nrepresentable with respect to Apiq and P.\n\u0159\nFor example, a node i P V of a Bayesian network BpG,Pq is rank 2 representable with respect\nto its parents \u03c0Gpiq and P if we can write PpXi \u201c xi|X\u03c0Gpiq \u201c x\u03c0Gpiqq \u201c QipXi \u201c xiq `\njP\u03c0Gpiq QijpXi \u201c xi, Xj \u201c xjq, where @xi P dompXiq, xj P dompXjq,@j P \u03c0Gpiq. It is easy to\nobserve that any node i P V is always rank |Apiq|`1 representable with respect to a set Apiq \u0102 V ztiu\nand P. Also, rank k representations for a node i with respect to Apiq and P may not be unique. We\n\u0159\nconsider real-valued set functions on a set T of cardinality t de\ufb01ned as f : 2T \u00d1 R where 2T denotes\nthe power set of T . Let F be the space of all such functions, with corresponding inner product\nAP2T fpAqgpAq. The space F has a natural Fourier basis, and in our set function\nxf, gy \ufb01 2\u00b4t\nnotation the corresponding Fourier basis vectors are \u03c8BpAq \ufb01 p\u00b41q|AXB|. We de\ufb01ne the Fourier\ntransformation coef\ufb01cients of function f as \u02c6fpBq \ufb01 xf, \u03c8By \u201c 2\u00b4t\nAP2T fpAqp\u00b41q|AXB|. Using\nFourier coef\ufb01cients, the function f can be reconstructed as:\n\nS\u010etiuYApiq\n1\u010f|S|\u010fk, iPS\n\n\u00ff\n\n\u0159\n\n\u00ff\n\nBP2T\n\nfpAq \u201c\n\n\u02c6fpBq\u03c8BpAq\n\n(2)\n\nThe Fourier support of a set function is the collection of subsets with nonzero Fourier coef\ufb01cient:\nsupportp \u02c6fq \ufb01 tB P 2T| \u02c6fpBq \u2030 0u.\n\n3 Method and Theoretical Analysis\n\nIn this section, we develop our method for learning the structure of a Bayesian network and provide\ntheoretical guarantees for correct and ef\ufb01cient learning. First we would like to mention some technical\nassumptions.\n\n3\n\n\fAssumption 1 (Availability of Black-box). For a Bayesian network BpG,Pq, we can submit a\nconditional probability query BBpi, A, xA, Nq to a black-box on any set of selected nodes i P V, A \u010e\ni and value xA, and receive N i.i.d. samples from the conditional distribution PpXi|XA \u201c xAq.\nAssumption 2 (Faithfulness). The distribution over the nodes of the Bayesian network BpG,Pq\ninduced by pG,Pq exhibits no other independencies beyond those implied by the structure of G.\nAssumption 3 (Low rank CPTs). Each node i P V in the Bayesian network BpG,Pq has rank 2\nconditional probability tables with respect to \u03c0Gpiq and P.\nAssumption 1 implies the availability of observational data for all queries. This is analogous to the\nstandard assumption of availability of interventional data in interventional setting (Murphy, 2001;\nHe and Geng, 2008; Kocaoglu et al., 2017; Tong and Koller, 2001; Hausar and B\u00fchlmann, 2012).\nAssumption 2 is also a standard assumption (Kocaoglu et al., 2017; Tong and Koller, 2001; He and\nGeng, 2008; Spirtes et al., 2000; Trianta\ufb01llou and Tsamardinos, 2015) which ensures that we only\nhave those independence relations between nodes which come from d-separation. We also introduce\na novel Assumption 3 which ensure that CPTs of nodes have a simple representation. In the later\nsections, we relate this to sparsity in the Fourier domain. We note that there is nothing special about\nCPTs being rank 2 and our method can be extended for any rank k CPTs.\n\n3.1 Problem Description\n\nIn this work, we address the following question:\nProblem 1 (Recovering structure of a Bayesian network using black-box queries). Consider we have\naccess to a black-box which provides observational data for our conditional probability queries for a\nfaithful Bayesian network BpG,Pq with each node i having rank 2 CPT with respect to its parents\n\u03c0Gpiq and P. Can we recover the directed structure of G with theoretical guarantees of correctness\nand ef\ufb01ciency in terms of time and sample complexity?\n\nWe show that it is indeed possible to do. We control the number of samples by controlling the number\nof queries. We also show that it is possible to further reduce the sample complexity if we have access\nto some observational data.\n\n3.2 Theoretical Result\n\nIn this subsection, we state our theoretical results. We start by analyzing terminal nodes.\n\nAnalyzing Terminal Nodes. Since terminal nodes do not have any children, their Markov blanket\nonly contains their parents. Furthermore, if the Bayesian network is faithful then for any terminal\nnode t P V : PpXt|X\u03c0Gptqq \u201c PpXt|XMBGptqq \u201c PpXt|Xtq. Thus, for any terminal node t P\nV, PpXt|X\u03c0Gptqq can be computed without explicitly knowing its parents. Next, we de\ufb01ne a set\nfunction which computes PpXt|X\u03c0Gptqq. In particular, for an assignment x\u03c0Gptq P t0, 1u|\u03c0Gptq|, we\nare interested in computing PpXt \u201c 1|X\u03c0Gptq \u201c x\u03c0Gptqq. Note that PpXt \u201c 0|X\u03c0Gptq \u201c x\u03c0Gptqq\ncan simply be computed by subtracting PpXt \u201c 1|X\u03c0Gptq \u201c x\u03c0Gptqq from 1. Let t denote the set\nV zttu. For node t and a set A \u010e t, let xA P t0, 1un be an assignment such that xA\ni \u201c 1iPA,@i \u2030 t\nt \u201c 0. We de\ufb01ne a set function ft for each terminal node t P V as follows:\nand xA\n\n\u00ff\n\nftpAq \u201c QtpXt \u201c xA\n\nt q `\n\nQtjpXt \u201c xA\n\nt , Xj \u201c xA\nj q,\n\n@A \u010e t\n\njP\u03c0Gptq\n\n(3)\n\u03c0Gptqq and ftpAq \u201c ftpA X \u03c0tq.\nNote that Equation (3) precisely computes PpXt \u201c 1|X\u03c0Gptq \u201c xA\nNext, we prove that the Fourier support of ft only contains singleton sets of parents of node t.\nTheorem 1. If nodes of a Bayesian network BpG,Pq have rank 2 with respect to their parents \u03c0Gp.q\nand P, then the Fourier coef\ufb01cient \u02c6ftpBq for function ft de\ufb01ned by equation (3) for any terminal\n\u0159\nnode t and a set B P 2t is given by:\n\n\u02d8\nQtjpXt \u201c 1, Xj \u201c 0q \u00b4 QtjpXt \u201c 1, Xj \u201c 1q\n\n\u02d8\nQtjpXt \u201c 1, Xj \u201c 0q ` QtjpXt \u201c 1, Xj \u201c 1q\n, B \u201c tju,@j P \u03c0Gptq\n\n$\u2019&\u2019%QtpXt \u201c 1q ` 1\n`\n\n1\n2\n0, Otherwise\n\n, B \u201c \u03c6\n\n\u02c6ftpBq \u201c\n\njP\u03c0Gptq\n\n2\n\n`\n\n4\n\n(4)\n\n\f(See Appendix A for detailed proofs.)\n\nAnalyzing Non-Terminal Nodes. A similar analysis can be done for non-terminal nodes. However,\nfor a non-terminal node i we can not compute PpXi|X\u03c0Gpiqq without explicitly knowing the parents\nof node i. We will rather focus on computing PpXi|XMBGpiqq for non-terminal nodes which equals\nto computing PpXi|Xiq which can be done from data. Similar to the previous case, we de\ufb01ne a set\nfunction gi for each non-terminal node i P V as follows:\n\ngipAq \u201c QipXi \u201c xA\n\ni q `\n\nQijpXi \u201c xA\n\ni , Xj \u201c xA\nj q,\n\n@A \u010e i\n\n(5)\n\n\u00ff\n\njP\u03c0Gpiq\n\nWe can de\ufb01ne a corresponding set function fi which computes PpXi|XMBGpiqq for non-terminal\nnodes. We do it in the following way:\nfipAq \u201cPpXi \u201c xA\n\n\u0159\n\u201c PpXi \u201c xA\nPpXi \u201c xA\n\n\u015b\ni |XMBGpiq \u201c xMBGpiqq\n\u015b\nkPchildGpiq PpXk \u201c xA\n\u03c0Gpiqq\ni |X\u03c0Gpiq \u201c xA\n\u03c0Gpiqq\ni |X\u03c0Gpiq \u201c xA\nkPchildGpiq PpXk \u201c xA\n\nk |X\u03c0Gpkq \u201c xA\nk |X\u03c0Gpkq \u201c x\u03c0GpkqAq\n\n\u03c0Gpkqq\n\n(6)\n\nXi\n\nwhere childGpiq is the set of children of node i in DAG G. We can again compute the Fourier support\nfor fi for each non-terminal node.\nTheorem 2. If nodes of a Bayesian network BpG,Pq have rank 2 with respect to their parents \u03c0Gp.q\nand P, then the Fourier coef\ufb01cient \u02c6fipBq for function fi de\ufb01ned by equation (6) for any non-terminal\nnode i and a set B P 2i is given by:\n\u0159\n|Bz MBGpiq| \u011b 1\n\u015b\n\u02c6fipBq \u201c\nkPchildGpiqgkpAq`gipAYtiuq\n\n\u015b\n\u015b\nkPchildGpiqgkpAq\nkPchildGpiqgkpAYtiuq \u03c8BpAq,\n\nAP2V \u00b4i\n\n#\n\ngipAq\n\ngipAq\n\n2n\u00b41\n\n0,\n\n1\n\notherwise\n(7)\n\n3.3 Algorithm\n\nOur algorithm works on the principle that the terminal nodes are rank 2 with respect to their Markov\nBlanket and P, while non-terminal nodes are not. This is true if for every non-terminal node there\nexists a B P 2V such that |Bz MBGpiq| \u201c 0 and \u02c6fipBq is nonzero. This is formalized in what\nfollows.\nAssumption 4 (Non-terminal nodes are not rank 2). There exists a B P 2V for each non-terminal\nnode i, such that |B| \u201c 2, |Bz MBGpiq| \u201c 0 and \u02c6fipBq as de\ufb01ned by Equation (7) is non-zero.\nThis distinction helps us to differentiate between terminal and non-terminal nodes. Note that the\nset function fi is uniquely determined by its Fourier coef\ufb01cients. Moreover, the Fourier support for\nfunction fi is sparse. For terminal nodes, \u02c6fipBq is non-zero only for the empty set or the singleton\nnodes, while for the non-terminal nodes, \u02c6fipBq is non-zero for B \u010e MBGpiq. Thus, recovering\nFourier coef\ufb01cients from the measurements of fi can be treated as recovering a sparse vector in R2i.\nHowever, |2i| could be quite large. We avoid this problem by substituting fi by another function\ngi P G2 where Gk \u201c tgi | @B P supportpgiq,|B| \u010f ku. Note that,\n\n\u02c6fipBkq\u03c8BkpAjq `\n\n\u02c6fipBkq\u03c8BkpAjq `\n\n\u02c6fipBkq\u03c8BkpAjq\n\n(8)\n\n\u00ff\n\n|Bk|\u201c1\nBkP2i\n\nfipAjq \u201c\n\nand @gi P G2,\n\ngipAjq \u201c\n\n\u02c6fipBkq\u03c8BkpAjq `\n\n\u02c6fipBkq\u03c8BkpAjq\n\n(9)\n\n\u00ff\n\n|Bk|\u201c2\nBkP2i\n\n\u00ff\n\n|Bk|\u011b3\nBkP2i\n\n\u00ff\n\n|Bk|\u201c1\nBkP2i\n\n\u00ff\n\n|Bk|\u201c2\nBkP2i\n\nIt follows that for a terminal node i, gi \u201c fi as for terminal nodes fi P G1. For non-terminal nodes,\nusing results from Theorem 2, if B \u010e MBGpiq then \u02c6fipBq \u2030 0 and therefore gi R G1. Now, let Ai\n\n5\n\n\fbe a collection of mi sets Aj P 2V \u00b4i chosen uniformly at random. We measure gipAjq for each\nAj P Ai and then using equation (2) we can write:\n\ngipAjq \u201c\n\np\u00b41q|AjXBk| \u02c6fipBkq,@Aj P Ai\n\n(10)\n\n\u00ff\n\nBkP2i,|Bk|\u010f2\n\nLet gi P Rmi be a vector whose jth row is gipAjq and \u02c6gi P Rn`pn\u00b41\nform \u02c6fipBkq@Bk P \u03c1i where\n\n2 q be a vector with elements of\n\n\u03c1i \u201c tBk | Bk P 2i,|Bk| \u010f 2u\n\n(11)\n\nis a set which contains supportp \u02c6fiq. Then,\n\n`\n\n\u02d8\n\njk \u201c p\u00b41q|AjXBk| .\n\ngi \u201c Mi\u02c6gi where, Mi P t\u00b41, 1umi\u02c6n such that Mi\n\n(12)\nAlso note that for terminal nodes \u02c6gi is sparse with |\u03c0Gpiq| ` 1 non-zero elements for terminal\n` k ` 1 non-zero elements for non-terminal nodes where k \u201c | MBGpiq|.\nnodes and at max\nEquation (12) can be solved by any standard compressed sensing techniques to recover the parents of\nthe terminal nodes. Using this formulation and the fact that terminal nodes have non-zero Fourier\ncoef\ufb01cients on empty or singleton sets, we can provide an algorithm to identify the terminal nodes and\ntheir corresponding parents. We can use this algorithm repeatedly to identify the complete structure\nof the Bayesian network until the last two nodes where we can not apply our algorithm. Algorithm 1\nidenti\ufb01es the parents for each node and consequently the directed structure of the Bayesian network.\n\nk\n2\n\nAlgorithm 1: getParentspV q\n:Nodes V \u201c t1, 2, . . . , nu\n\nInput\nOutput :Recovered parent set \u02c6\u03c0 : V \u00d1 2V\nS \u00d0 V ;\nwhile |S| \u011b 3 do\nT, \u02c6\u03c0 \u201c getTerminalNodespSq ;\nS \u00d0 SzT ;\n\nend\nfor i P S do\n\u02c6\u03c0piq \u201c \u03c6;\nend\n\nAlgorithm 2: getTerminalNodespSq\n:Nodes S \u010e t1, 2, . . . , nu\n\u02c6\u03c0 : T \u00d1 2S\n\nInput\nOutput :Set of terminal nodes T and their parents\nT \u00d0 \u03c6, \u02c6\u03c0piq \u00d0 \u03c6 @i P S ;\nfor node i P S, j P t1, . . . , miu do\n\nChoose Aj P 2Sztiu uniformly at random ;\nCompute\nfipAjq \u201c PpXi \u201c 0|XSztiu \u201c xAj\nSztiuq ;\njk for Bk P \u03c1i (Eq (11) (12)) ;\nCompute Mi\nSolve for \u03b2i using compressed sensing (Eq\n(13)) ;\nif \u03b2ipBq \u201c 0 for all |B| \u0105 1 then\n\nT \u00d0 T Y tiu ;\n\u02c6\u03c0piq \u00d0 YB:\u03b2ipBq\u20300B ;\n\nend\n\nend\n\n4 Analysis in Finite Sample Regime\n\nSo far our results have been in the population setting where we assumed that we had access to the true\nconditional probabilities. However, generally this is not the case and we have to work with a \ufb01nite\nnumber of samples from the black-box. In this section, we provide theoretical results for different\n\ufb01nite sample regimes.\n\n4.1 Without access to any observational data\n\nIn this setting, we assume that we only have access to a black-box which outputs observational\ndata for our conditional probability queries. One selection of nodes consists of \ufb01xing Xi and then\nmeasuring Xi for each node i. We need only 1 selection for each node. Thus the total number of\nselections for all the nodes is n. One query amounts to \ufb01xing Xi to a particular xi. Note that while\n2n\u00b41 such queries are possible for each selection on each node, we only conduct mi queries for each\nnode i.\n\n6\n\n\fNumber of Queries. We measure gipAjq by querying for fipAjq. Let |fipAjq \u00b4 gipAjq| \u010f\n\u0001j,@Aj P Ai for some \u0001j \u0105 0. Once we have the noisy measurements of gipAjq, we can get a good\napproximation of \u02c6gi by solving the following optimization problem for each node i.\n\n\u03b2i \u201c min\n\n\u02c6giPR|\u03c1i| }\u02c6gi}1\n\ns.t.}Mi\u02c6gi \u00b4 fi}2 \u010f \u0001 where \u0001 \u201c\n\n(13)\n\nd \u00ff\n\n\u00012\nj .\n\nAjPAj\n\nTheorem 3. Suppose \u02c6gi is constructed by computing \u02c6gipBkq using Bk from a \ufb01xed collection \u03c1i as\nde\ufb01ned in Equation (11). Furthermore, suppose gi is computed by selecting mi sets Aj uniformly at\nrandom from 2i. We de\ufb01ne the matrix Mi as in equation (12). Then there exist universal constants\nC1, C2 \u0105 0 such that if, mi \u011b maxpC1|supportp\u02c6giq| log4pn `\n\u03b4q and\n\u03b2i is solved using equation (13). Then with probability at least 1 \u00b4 \u03b4, we have }\u03b2i \u00b4 \u02c6gi}2 \u010f\nfor some universal constant C3 \u0105 0. If the minimum non-zero element of |\u02c6gi| is greater\nC3\n\u0001?\nthan 2C3\nthen \u03b2i recovers \u02c6gi up to the signs. Furthermore, if Assumption 4 is satis\ufb01ed then\n,@B P \u03c1i,|B| \u201c 2 if and only if i is a terminal node and \u02c6\u03c0piq \u201c tB | |B| \u201c\n|\u03b2ipBq| \u010f C3\nmi\n\u0001?\nu correctly recovers the parents of the terminal node i, i.e., \u02c6\u03c0piq \u201c \u03c0Gpiq.\n1, |\u03b2ipBq| \u0105 C3\nmi\n\u0001?\nmi\nApplying this recursively shows the correctness of Algorithm 1.\n\nq, C2|supportp\u02c6giq| log 1\n\nn\u00b41\n2\n\n\u0001?\nmi\n\n`\n\n\u02d8\n\nk\n2\n\n` k ` 1. Thus the num-\nThe sparsity for each node would be less than or equal to\nber of queries needed for each node (using arguments from Theorem 3) would be of order\n\u03b4qq. At the \ufb01rst iteration, we query all the nodes. From the next iteration\nOpmaxpk2 log4 n, k2 log 1\nonwards, we query for only the nodes which had terminal nodes as their children,i.e., for a maximum\n\u03b4qq. We\nof k nodes. Thus the total number of queries needed would be Opmaxpnk3 log4 n, nk3 log 1\ncan recover parents for terminal nodes using Theorem 3.\nSample and Time Complexity. The sample complexity is Opmaxp nk3 log4 n\nlog log nq, nk3\nB for details).\n\n\u03b4plog k ` log log nqq and the time complexity is Opn4k\n\nplog k `\nn log nq (See Appendix\n\n\u00012 log 1\n\n?\n\n\u00012\n\n`\n\n\u02d8\n\n4.2 With Access to Some Observational Data\n\nIn this setting, we have access to some observational data as well. We can use the observational data\nto \ufb01gure out the Markov blanket of each node which helps us reduce number of selected variables in\nthe conditional probability queries. Once we have the Markov blanket, we only select the nodes in\nMBGpiq for each query. We need only 1 selection for each node. Thus the total number of selections\nfor all nodes does not exceed n.\nUsing Observational Data. Recall that P is the true joint distribution over the nodes of a Bayesian\nnetwork BpG,Pq. We de\ufb01ne a collection of distributions P over the nodes of the Bayesian network\nas: P \u201c\n\n(cid:32)\n(\nP is faithful to G. |PpXi|Xlq \u201c PpXi|Xlq,@i, l P t1, . . . , nu\n\nComputing the Markov Blanket from Observational Data. Consider a probability distribution\n\u02c6P P P on the nodes of the Bayesian network such that each node i is rank 2 with respect to MBGpiq\nand \u02c6P . This allows us to recover the Markov blanket of the node using the observational data.\nTheorem 4. If there exists a probability distribution \u02c6P P P such that each node i is rank 2 with\nrespect to MBGpiq and \u02c6P , then the Markov blanket of a node i can be recovered by solving the\nfollowing system of equations:\nPpXi \u201c 0, Xl \u201c 0q \u201c \u02dcQipXi \u201c 0qPpXl \u201c 0q `\n\n\u02dcQijpXi \u201c 0, Xj \u201c 0qPpXj \u201c 0, Xl \u201c 0q\n\n\u00ff\n\n` \u02dcQilpXi \u201c 0, Xl \u201c 0qPpXl \u201c 0q, @ l \u201c t1, . . . , nu, l \u2030 i\n\nPpXi \u201c 0q \u201c \u02dcQipXi \u201c 0q `\n\n\u02dcQijpXi \u201c 0, Xj \u201c 0qPpXj \u201c 0q\n\n\u00ff\n\njP\u00b4i\nj\u2030l\n\njP\u00b4i\nj\u2030l\n\n7\n\n\fwhich can be written in a more compact form:\n\ny \u201c Aq\n\n(14)\nwhere y P Rn and A P Rn\u02c6n and q P Rn. The entries of y are indexed by l \u201c t1 . . . nu such that\nyl \u201c PpXi \u201c 0, Xl \u201c 0q when l \u2030 i and yl \u201c PpXi \u201c 0q when l \u201c i. The entries of A are indexed\nby l, j P t1, . . . , nu, where Alj \u201c PpXl \u201c 0, Xj \u201c 0q for l \u2030 i, j \u2030 i, j \u2030 l and , Alj \u201c PpXl \u201c 0q\nwhen l \u201c j, l \u2030 i, Alj \u201c PpXl \u201c 0q for l \u2030 i, j \u201c i, Alj \u201c PpXj \u201c 0q for l \u201c i, j \u2030 i and Alj \u201c 1\nfor l \u201c i, j \u201c i. The entries of q are indexed by j P t1, . . . , nu such that qj \u201c \u02dcQijpXi \u201c 0, Xj \u201c 0q\nfor j \u2030 i and qj \u201c \u02dcQipXi \u201c 0q for j \u201c i.\nFor terminal nodes, existence of \u02c6P P P as P P P is guaranteed. To ensure that \u02c6P P P also exists for\nnon-terminal nodes, we make the following assumption:\nAssumption 5. The population matrix A P Rn\u02c6n as de\ufb01ned in equation (14) is positive de\ufb01nite.\nThis assumption is not strong. We can, in fact, show that A is a positive semide\ufb01nite matrix.\nLemma 1. The population matrix A as de\ufb01ned in equation (14) is a positive semide\ufb01nite matrix.\n\nWe can solve Equation (14) to get \u02dcQi and \u02dcQij. The Markov blanket of node i is computed by\nMBGpiq \u201c tj| \u02dcQij \u2030 0u. To this end, we prove that:\nLemma 2. If \u02dcQijp\u00a8,\u00a8q,@j P t1, . . . , nu, j \u2030 i is computed by solving system of linear equations (14)\nand \u02c6P P P is faithful to G then \u02dcQijp\u00a8,\u00a8q \u2030 0,@j P t1, . . . , nu, j \u2030 i if and only if j P MBGpiq.\nOnce we know the Markov blanket for each node i, the queries in Algorithm 2 can be changed from\nfipAjq \u201c PpXi \u201c 0|XSztiu \u201c xAj\nSXMBiq which helps\nin reducing the sample and time complexity.\nNumber of Queries. Again, let | MBGpiq| \u010f k,@i P t1, . . . , nu. The sparsity for each node would\n` k ` 1. Thus number of queries needed for each node (using arguments\nbe less than or equal to\nfrom Theorem 3) would be of order Opmaxpk2 log4 k, k2 log 1\n\u03b4q. As before, these queries are repeated\nnk times. Thus the total number of queries needed would be Opmaxpnk3 log4 k, nk3 log 1\n\nSztiuq to fipAjq \u201c PpXi \u201c 0|XSXMBi \u201c xAj\n\n\u02d8\n\n`\n\nk\n2\n\n\u03b4qq.\n\n`\n\n\u02d8\n\nn\n2\n\n` 3nq \u00b4 N \u00012\n\nSample and Time Complexity. We use the following lemma to get the sample complexity for the\nobservational data.\n\u00012 q i.i.d observations are suf\ufb01cient to measure elements of A and y, \u0001 close\nLemma 3. N \u201c Op log n\nto their true value. That is |A \u00b4 \u02c6A| \u010f \u0001 and |y \u00b4 \u02c6y| \u010f \u0001, for some \u0001 \u0105 0 with probability at least\n1 \u00b4 2 expplogp\n2 q for some \u0001 \u0105 0 where \u02c6A and \u02c6y are the empirical measurements of\nA and y respectively and | \u00a8 \u00b4 \u00a8 | denotes componentwise comparison for matrices.\nAt this point, it remains to be shown that we can still recover the Markov blanket for the nodes using\nthe noisy measurements of unary and pairwise marginals. Below, we prove that this is true as long as\nA is well conditioned.\nLemma 4. Let \u02c6A and \u02c6y be the empirical measurements of A and y as de\ufb01ned in equation (14)\nrespectively such that |\u02c6A \u00b4 A| \u010f \u0001 and |\u02c6y \u00b4 y| \u010f \u0001 for some \u0001 \u0105 0, where | \u00a8 \u00b4 \u00a8 | denotes\ncomponentwise comparison for matrices. Let \u02c6q be the solution to the system of linear equations given\nby \u02c6y \u201c \u02c6A\u02c6q and \u03b7\u03ba8pAq \u010f 1, then \u02c6q recovers q up to signs as long as N \u201c Opnq i.i.d. measurements\n\u00b41}8 is the condition\nare used to measure \u02c6A and maxi |qi|\nnumber of A and \u03b7 \u201c maxp\nThe time complexity of computing the Markov Blanket is Opn4q. The sample complexity for the\nk log kq\nblack-box queries is Opmaxp nk3 log5 k\n(See Appendix C for details).\nFor synthetic experiments validating our theory, please See Appendix D.\n\n\u03b4 log kqq and the time complexity is Opnk4\n\nmini |qi| \u010f 1\u00b4\u03b7\u03ba8pAq\n\u0159\nj\u201c1 PpXj\u201c0q`1\nn\u00b41\n\n4\u03b7\u03ba8pAq , where \u03ba8pAq \ufb01 }A}8}A\n\n\u0001PpXn\u201c0qq.\n\n, nk3\n\n\u00012 log 1\n\n?\n\nn\u0001\n\n,\n\n\u00012\n\n8\n\n\fConcluding Remarks.\nIn this paper, we provide a novel method with theoretical guarantees to\nrecover directed structure of a Bayesian network using black-box queries. We further improve our\nresults when we have access to some observational data. We developed a theory for rank 2 CPTs\nwhich can easily be extended to a more general rank k CPTs. It would be interesting to see if we\ncan provide similar results for a Bayesian network with low rank CPTs using pure observational or\ninterventional data.\n\nReferences\nAnderson, T. W. (1962). An Introduction to Multivariate Statistical Analysis. Technical report, Wiley\n\nNew York.\n\nBello, K. and Honorio, J. (2018). Computationally and Statistically Ef\ufb01cient Learning of Causal\nBayes Nets Using Path Queries. In Advances in Neural Information Processing Systems, pages\n10931\u201310941.\n\nCheng, J., Greiner, R., Kelly, J., Bell, D., and Liu, W. (2002). Learning Bayesian Networks From\n\nData: An Information-Theory Based Approach. Arti\ufb01cial intelligence, 137(1-2):43\u201390.\n\nChickering, D. (1996). Learning Bayesian Networks Is NP-Complete. Learning from Data, pages\n\n121\u2013130.\n\nCussens, J. (2008). Bayesian Network Learning by Compiling to Weighted MAX-SAT. Uncertainty\n\nin Arti\ufb01cial Intelligence.\n\nDvoretzky, A., Kiefer, J., and Wolfowitz, J. (1956). Asymptotic Minimax Character of the Sample\nDistribution Function and of the Classical Multinomial Estimator. The Annals of Mathematical\nStatistics, pages 642\u2013669.\n\nEaton, D. and Murphy, K. (2007). Exact Bayesian Structure Learning From Uncertain Interventions.\n\nArti\ufb01cial Intelligence and Statistics, pages 107\u2013114.\n\nFriedman, N., Nachman, I., and Pe\u00e9r, D. (1999). Learning Bayesian Network Structure From\nMassive Datasets: The Sparse Candidate Algorithm. In Proceedings of the Fifteenth conference on\nUncertainty in arti\ufb01cial intelligence, pages 206\u2013215. Morgan Kaufmann Publishers Inc.\n\nHausar, A. and B\u00fchlmann, P. (2012). Two Optimal Strategies for Active Learning of Causal Models\nFrom Interventions. Proceedings of the 6th European Workshop on Probabilistic Graphical\nModels.\n\nHe, Y. and Geng, Z. (2008). Active Learning of Causal Networks With Intervention Experiments and\n\nOptimal Designs. Journal of Machine Learning Research.\n\nHigham, N. J. (1994). A Survey of Componentwise Perturbation Theory, volume 48. American\n\nMathematical Society.\n\nJaakkola, T., Sontag, D., Globerson, A., and Meila, M. (2010). Learning Bayesian Network Structure\nUsing LP Relaxations. In Proceedings of the Thirteenth International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 358\u2013365.\n\nKocaoglu, M., Shanmugam, K., and Bareinboim, E. (2017). Experimental Design for Learning\nCausal Graphs With Latent Variables. In Advances in Neural Information Processing Systems,\npages 7018\u20137028.\n\nKoivisto, M. and Sood, K. (2004). Exact Bayesian Structure Discovery in Bayesian Networks.\n\nJournal of Machine Learning Research, 5(May):549\u2013573.\n\nMargaritis, D. and Thrun, S. (2000). Bayesian Network Induction via Local Neighborhoods. In\n\nAdvances in neural information processing systems, pages 505\u2013511.\n\nMoore, A. and Wong, W.-K. (2003). Optimal Reinsertion: A New Search Operator for Accelerated\nand More Accurate Bayesian Network Structure Learning. In International Conference on Machine\nLearning, volume 3, pages 552\u2013559.\n\n9\n\n\fMurphy, K. P. (2001). Active Learning of Causal Bayes Net Structure. Technical Report.\n\nRauhut, H. (2010). Compressive Sensing and Structured Random Matrices. Theoretical foundations\n\nand numerical methods for sparse recovery, 9:1\u201392.\n\nSilander, T. and Myllym\u00e4ki, P. (2006). A Simple Approach for Finding the Globally Optimal Bayesian\n\nNetwork Structure. In Uncertainty in Arti\ufb01cial Intelligence, pages 445\u2013452.\n\nSpirtes, P., Glymour, C. N., and Scheines, R. (2000). Causation, Prediction, and Search. MIT press.\n\nStobbe, P. and Krause, A. (2012). Learning Fourier Sparse Set Functions. In Arti\ufb01cial Intelligence\n\nand Statistics, pages 1125\u20131133.\n\nTong, S. and Koller, D. (2001). Active Learning for Structure in Bayesian Networks. International\n\nJoin Conference on Arti\ufb01cial Intelligence, 17:863\u2013869.\n\nTrianta\ufb01llou, S. and Tsamardinos, I. (2015). Constraint-Based Causal Discovery From Multiple\n\nInterventions Over Overlapping Variable Sets. Journal of Machine Learning Research.\n\nTsamardinos, I., Brown, L. E., and Aliferis, C. F. (2006). The Max-Min Hill-Climbing Bayesian\n\nNetwork Structure Learning Algorithm. Machine learning, 65(1):31\u201378.\n\nXie, X. and Geng, Z. (2008). A Recursive Method for Structural Learning of Directed Acyclic\n\nGraphs. Journal of Machine Learning Research, 9(Mar):459\u2013483.\n\nYehezkel, R. and Lerner, B. (2005). Recursive Autonomy Identi\ufb01cation for Bayesian Network\n\nStructure Learning. In Arti\ufb01cial Intelligence and Statistics, pages 429\u2013436. Citeseer.\n\n10\n\n\f", "award": [], "sourceid": 4813, "authors": [{"given_name": "Adarsh", "family_name": "Barik", "institution": "Purdue University"}, {"given_name": "Jean", "family_name": "Honorio", "institution": "Purdue University"}]}