{"title": "Bayesian Network Score Approximation using a Metagraph Kernel", "book": "Advances in Neural Information Processing Systems", "page_first": 1833, "page_last": 1840, "abstract": "Many interesting problems, including Bayesian network structure-search, can be cast in terms of finding the optimum value of a function over the space of graphs. However, this function is often expensive to compute exactly. We here present a method derived from the study of reproducing-kernel Hilbert spaces which takes advantage of the regular structure of the space of all graphs on a fixed number of nodes to obtain approximations to the desired function quickly and with reasonable accuracy. We then test this method on both a small testing set and a real-world Bayesian network; the results suggest that not only is this method reasonably accurate, but that the BDe score itself varies quadratically over the space of all graphs.", "full_text": "Bayesian Network Score Approximation using a\n\nMetagraph Kernel\n\nBenjamin Yackley\n\nEduardo Corona\n\nDepartment of Computer Science\n\nCourant Institute of Mathematical Sciences\n\nUniversity of New Mexico\n\nNew York University\n\nTerran Lane\n\nDepartment of Computer Science\n\nUniversity of New Mexico\n\nAbstract\n\nMany interesting problems, including Bayesian network structure-search,\ncan be cast in terms of \ufb01nding the optimum value of a function over the\nspace of graphs. However, this function is often expensive to compute\nexactly. We here present a method derived from the study of Reproducing\nKernel Hilbert Spaces which takes advantage of the regular structure of the\nspace of all graphs on a \ufb01xed number of nodes to obtain approximations\nto the desired function quickly and with reasonable accuracy. We then test\nthis method on both a small testing set and a real-world Bayesian network;\nthe results suggest that not only is this method reasonably accurate, but\nthat the BDe score itself varies quadratically over the space of all graphs.\n\n1 Introduction\n\nThe problem we address in this paper is, broadly speaking, function approximation. Specif-\nically, the application we present here is that of estimating scores on the space of Bayesian\nnetworks as a \ufb01rst step toward a quick way to obtain a network which is optimal given a set\nof data. Usually, the search process requires a full recomputation of the posterior likelihood\nof the graph at every step, and is therefore slow. We present a new approach to the problem\nof approximating functions such as this one, where the mapping is of an object (the graph,\nin this particular case) to a real number (its BDe score). In other words, we have a function\nf : \u0393n \u2192 R (where \u0393n is the set of all directed graphs on n nodes) from which we have a\nsmall number of samples, and we would like to interpolate the rest. The technique hinges\non the set \u0393n having a structure which can be factored into a Cartesian product, as well as\non the function we approximate being smooth over this structure.\n\nAlthough Bayesian networks are by de\ufb01nition acyclic, our approximation technique applies\nto the general directed-graph case. Because a given directed graph has n2 possible edges,\nwe can imagine the set of all graphs as itself being a Hamming cube of degree n2 \u2013 a\n\u201cmetagraph\u201d with 2n2\nnodes, since each edge can be independently present or absent. We\nsay that two graphs are connected with an edge in our metagraph if they di\ufb00er in one and\nonly one edge. We can similarly identify each graph with a bit string by \u201cunraveling\u201d the\nadjacency matrix into a long string of zeros and ones. However, if we know beforehand an\nordering on the nodes of our graph to which all directed graphs must stay consistent (to\n\n2 ). The same correspondence can then be made between these graphs and bit\n\n2(cid:1) possible edges, and the size of our metagraph\n\ndrops to 2( n\n\nenforce acyclicness), then there are only (cid:0) n\nstrings of length (cid:0) n\n2(cid:1).\n\n\fSince the eigenvectors of the Laplacian of a graph form a basis for all smooth functions on\nthe graph, then we can use our known sampled values (which correspond to a mapping from\na subset of nodes on our metagraph to the real numbers) to interpolate the others. Despite\nthe incredible size of the metagraph, we show that this problem is by no means intractable,\nand functions can in fact be approximated in polynomial time. We also demonstrate this\ntechnique both on a small network for which we can exhaustively compute the score of every\npossible directed acyclic graph, as well as on a larger real-world network. The results show\nthat the method is accurate, and additionally suggest that the BDe scoring metric used is\nquadratic over the metagraph.\n\n2 Spectral Properties of the Hypercube\n\n2.1 The Kronecker Product and Kronecker Sum\n\nThe matrix operators known as the Kronecker product and Kronecker sum, denoted \u2297 and\n\u2295 respectively, play a key role in the derivation of the spectral properties of the hypercube.\nGiven matrices A \u2208 Ri\u00d7j and B \u2208 Rk\u00d7l, A \u2297 B is the matrix in Rik\u00d7jl such that:\n\na11B a12B \u00b7 \u00b7 \u00b7 a1jB\na2jB\na21B a22B\n\n...\n\n. . .\n\naj1B aj2B\n\naij B\n\n\uf8f9\n\uf8fa\uf8fa\uf8fb\n\nA \u2297 B =\uf8ee\n\uf8ef\uf8ef\uf8f0\n\nThe Kronecker sum is de\ufb01ned over a pair of square matrices A \u2208 Rm\u00d7m and B \u2208 Rn\u00d7n as\nA \u2295 B = A \u2297 In + Im \u2297 B, where In denotes an n \u00d7 n identity matrix[8].\n\n2.2 Cartesian Products of Graphs\n\nThe Cartesian product of two graphs G1 and G2, denoted G1 \u00d7 G2, is intuitively de\ufb01ned as\nthe result of replacing every node in G1 with a copy of G2 and connecting corresponding\nedges together. More formally, if the product is the graph G = G1 \u00d7 G2, then the vertex\nset of G is the Cartesian product of the vertex sets of G1 and G2. In other words, for any\nvertex v1 in G1 and any vertex v2 in G2, there exists a vertex (v1, v2) in G. Additionally,\nthe edge set of G is such that, for any edge (u1, u2) \u2192 (v1, v2) in G, either u1 = v1 and\nu2 \u2192 v2 is an edge in G2, or u2 = v2 and u1 \u2192 v1 is an edge in G1.[7]\n\nIn particular, the set of hypercube graphs (or, identically, the set of Hamming cubes) can be\nderived using the Cartesian product operator. If we denote the graph of an n-dimensional\nhypercube as Qn, then Qn+1 = Qn \u00d7 Q1, where the graph Q1 is a two-node graph with a\nsingle bidirectional edge.\n\n2.3 Spectral Properties of Cartesian Products\n\nThe Cartesian product has the property that, if we denote the adjacency matrix of a graph\nG as A(G), then A(G1 \u00d7 G2) = A(G1) \u2295 A(G2). Additionally, if A(G1) has m eigenvectors\n\u03c6k and corresponding eigenvalues \u03bbk (with k = 1...m) while A(G2) has n eigenvectors \u03c8l\nwith corresponding eigenvalues \u00b5l (with l = 1...n), then the full spectral decomposition of\nA(G1 \u00d7 G2) is simple to obtain by the properties of the Kronecker sum; A(G1 \u00d7 G2) will\nhave mn eigenvectors, each of them of the form \u03c6k \u2297 \u03c8l for every possible \u03c6k and \u03c8l in the\noriginal spectra, and each of them having the corresponding eigenvalue \u03bbk + \u00b5l[2].\n\nIt should also be noted that, because hypercubes are all k-regular graphs (in particular,\nthe hypercube Qn is n-regular), the form of the normalized Laplacian becomes simple. The\nusual formula for the normalized Laplacian is:\n\n\u02dcL = I \u2212 D\u22121/2AD\u22121/2\n\nHowever, since the graph is regular, we have D = kI, and so\n\n\u02dcL = I \u2212 (kI)\u22121/2A(kI)\u22121/2 = I \u2212\n\n1\nk\n\nA.\n\n\fAlso note that, because the formula for the combinatorial Laplacian is L = D \u2212 A, we also\nhave \u02dcL = 1\n\nk L.\n\nThe Laplacian also distributes over graph products, as shown in the following theorem.\n\nTheorem 1 Given two simple, undirected graphs G1 = (V1, E1) and G2 = (V2, E2), with\ncombinatorial Laplacians LG1 and LG2 ,the combinatorial Laplacian of the Cartesian product\ngraph G1 \u00d7 G2 is then given by:\n\nProof.\n\nLG1\u00d7G2 = LG1 \u2295 LG2\n\nLG1 = DG1 \u2212 A(G1)\n\nLG2 = DG2 \u2212 A(G2)\n\nHere, DG denotes the degree diagonal matrix of the graph G. Now, by the de\ufb01nition of the\nLaplacian,\n\nLG1\u00d7G2 = DG1\u00d7G2 \u2212 A(G1) \u2295 A(G2)\n\nHowever, the degree of any vertex uv in the Cartesian product is deg(u) + deg(v), because\nall edges incident to a vertex will either be derived from one of the original graphs or the\nother, leading to corresponding nodes in the product graph. So, we have\n\nSubstituting this in, we obtain\n\nDG1\u00d7G2 = DG1 \u2295 DG2\n\nLG1\u00d7G2 = DG1 \u2295 DG2 \u2212 A(G1) \u2295 A(G2)\n\n= DG1 \u2297 Im + In \u2297 DG2 \u2212 A(G1) \u2297 Im \u2212 I \u2297 A(G2)\n= DG1 \u2297 Im \u2212 A(G1) \u2297 Im + In \u2297 DG2 \u2212 In \u2297 A(G2)\n\nBecause the Kronecker product is distributive over addition[8],\n\nLG1\u00d7G2 = (DG1 \u2212 A(G1)) \u2297 Im + In \u2297 (DG2 \u2212 A(G2))\n\n= LG1 \u2295 LG2\n\nAdditionally, if G1 \u00d7 G2 is k-regular,\n\n\u02dcLG1\u00d7G2 = \u02dcLG1 \u2295 \u02dcLG2 =\n\n1\nk\n\n(LG1 \u2295 LG2)\n\nTherefore, since the combinatorial Laplacian operator distributes across a Kronecker sum,\nwe can easily \ufb01nd the spectra of the Laplacian of an arbitrary hypercube through a recursive\nprocess if we just \ufb01nd the spectrum of the Laplacian of Q1.\n\n2.4 The Spectrum of the Hypercube Qn\n\nFirst, consider that\n\nA(Q1) =(cid:20) 0 1\n1 0 (cid:21) .\n\nThis is a k-regular graph with k = 1. So,\n\nLQ1 = I \u2212\n\n1\nk\n\nA(Q1) =(cid:20) 1 \u22121\n1 (cid:21)\n\n\u22121\n\n\fIts eigenvectors and eigenvalues can be easily computed; it has the eigenvector (cid:2) 1\neigenvalue 0 and the eigenvector h 1\n\n1(cid:3) with\n\u22121i with eigenvalue 2. We can use these to compute the\n\nfour eigenvectors of LQ2, the Laplacian of the 2-dimensional hypercube; LQ2 = LQ1\u00d7Q1 =\nLQ1 \u2295LQ1, so the four possible Kronecker products are [1 1 1 1]T , [1 1 \u22121 \u22121]T , [1 \u22121 1 \u22121]T ,\nand [1 \u2212 1 \u2212 1 1]T , with corresponding eigenvalues 0, 1, 1, and 2 (renormalized by a factor\nof 1\n2 to take into account that our new hypercube is now degree 2 instead of degree 1;\nthe combinatorial Laplacian would require no normalization). It should be noted here that\nan n-dimensional hypercube graph will have 2n eigenvalues with only n + 1 distinct values;\nthey will be the values 2k\n\nk = 1\n\nn for k = 0...n, each of which will have multiplicity (cid:0) n\n\nk(cid:1)[4].\n\nIf we arrange these columns in the proper order as a matrix, a familiar shape emerges:\n\n\uf8ee\n\uf8ef\uf8f0\n\n1\n1\n1 \u22121\n1\n1 \u22121 \u22121\n\n1\n1\n1 \u22121\n1 \u22121 \u22121\n1\n\n\uf8f9\n\uf8fa\uf8fb\n\nThis is, in fact, the Hadamard matrix of order 4, just as placing our original two eigenvectors\nside-by-side creates the order-2 Hadamard matrix. In fact, the eigenvectors of the Laplacian\non a hypercube are simply the columns of a Hadamard matrix of the appropriate size; this\ncan be seen by the recursive de\ufb01nition of the Hadamard matrix in terms of the Kronecker\nproduct:\n\nH2n+1 = H2n \u2297 H2\n\nRecall that the eigenvectors of the Kronecker sum of two matrices are themselves all possible\nKronecker products of eigenvectors of those matrices. Since hypercubes can be recursively\nconstructed using Kronecker sums, the basis for smooth functions on hypercubes (i.e. the\nset of eigenvectors of their graph Laplacian) is the Hadamard basis. Consequently, there is\nno need to ever compute a full eigenvector explicitly; there is an explicit formula for a given\nentry of any Hadamard matrix:\n\n(H2n )ij = (\u22121)hbi,bj i\n\nThe notation bx here means \u201cthe n-bit binary expansion of x interpreted as a vector of 0s\nand 1s\u201d. This is the key to computing our kernel e\ufb03ciently, not only because it takes very\nlittle time to compute arbitrary elements of eigenvectors, but because we are free to compute\nonly the elements we need instead of entire eigenvectors at once.\n\n3 The Metagraph Kernel\n\n3.1 The Optimization Framework\n\nGiven the above, we now formulate the regression problem that will allow us to approximate\nour desired function at arbitrary points. Given a set of k observations {yi}k\ni=1 corresponding\nto nodes xi in the metagraph, we wish to \ufb01nd the \u02c6f which minimizes the squared error\nbetween our estimate and all observed points and also which is a su\ufb03ciently smooth function\non the graph to avoid over\ufb01tting. In other words,\n\n\u02c6f = arg min\n\nf ( 1\n\nk\n\nkf (xi) \u2212 yik2 + cf T Lmf)\n\nk\n\nXi=1\n\nThe variable m in this expression controls the type of smoothing; if m = 1, then we are\npenalizing \ufb01rst-di\ufb00erences (i.e. the gradient of the function). We will take m = 2 in our ex-\nperiments, to penalize second-di\ufb00erences (the usual case when using spline interpolation)[6].\nThis problem can be formulated and solved within the Reproducing Kernel Hilbert Space\nframework[9]; consider the space of functions on our metagraph as the sum of two orthogonal\nspaces, one (called \u21260) consisting of functions which are not penalized by our regularization\n\n\fterm (which is c \u02c6f Lm \u02c6f ), and one (called \u21261) consisting of functions orthogonal to those. In\nthe case of our hypercube graph, \u21260 turns out to be particularly simple; it consists only of\nconstant functions (i.e. vectors of the form 1T d, where 1 is a vector of all ones). Meanwhile,\nthe space \u21261 is formulated under the RKHS framework as a set of columns of the kernel\nmatrix (denoted K1). Consequently, we can write \u02c6f = 1T d + K1e, and so our formulation\nbecomes:\n\n\u02c6f = arg min\n\nf ( 1\n\nk\n\nk\n\nXi=1(cid:13)(cid:13)(1T d + K1e)(xi) \u2212 yi(cid:13)(cid:13)\n\n2\n\n+ ceT K1e)\n\nThe solution to this optimization problem is for our coe\ufb03cients d and e to be linear estimates\non y, our vector of observed values.\nIn other words, there exist matrices \u03a5d(c, m) and\n\u03a5e(c, m), dependent on our smoothing coe\ufb03cient c and our exponent m, such that:\n\n\u02c6d = \u03a5d(c, m)y\n\u02c6e = \u03a5e(c, m)y\n\n\u02c6f = 1T \u02c6d + K1\u02c6e = \u03a5(c, m)y\n\n\u03a5(c, m) = 1T \u03a5d(c, m) + K1\u03a5e(c, m) is the in\ufb02uence matrix[9] which provides the function\nestimate over the entire graph. Because \u03a5(c, m) is entirely dependent on the two matrices\n\u03a5d and \u03a5e as well as our kernel matrix, we can calculate an estimate for any set of nodes\nin the graph by explicitly calculating only those rows of \u03a5 which correspond to those nodes\nand then simply multiplying that sub-matrix by the vector y. Therefore, if we have an\ne\ufb03cient way to compute arbitrary entries of the kernel matrix K1, we can estimate functions\nanywhere in the graph.\n\n3.2 Calculating entries of K1\n\nFirst, we must choose an order r \u2208 {1, 2...n}; this is equivalent to selecting the degree of a\npolynomial used to perform standard interpolation on the hypercube. The e\ufb00ect that r will\nhave on our problem will be to select the set of basis functions we consider; the eigenvectors\ncorresponding to a given eigenvalue 2k\ninto identically-valued regions which are themselves (n \u2212 k)-dimensional hypercubes. For\nexample, the 3 eigenfunctions on the 3-dimensional hypercube which correspond to the\neigenvalue 2\n3 (so k = 1) are those which separate the space into a positive plane and a\nnegative plane along each of the three axes. Because these eigenfunctions are all equivalent\napart from rotation, there is no reason to choose one to be in our basis over another, and\nso we can say that the total number of eigenfunctions we use in our approximation is equal\n\nk(cid:1) eigenvectors which divide the space\n\nn are the (cid:0) n\n\nto Pr\n\nk=0(cid:0) n\n\nk(cid:1) for our chosen value of r.\n\nAll eigenvectors can be identi\ufb01ed with a number l corresponding to its position in the\nnatural-ordered Hadamard matrix; the columns where l is an exact power of 2 are ones that\nalternate in identically-sized blocks of +1 and -1, while the others are element-wise products\nof the columns correponsing to the ones in l\u2019s binary expansion. Therefore, if we use the\nnotation |x|1 to mean \u201cthe number of ones in the binary expansion of x\u201d, then choosing the\norder r is equivalent to choosing a basis of eigenvectors \u03c6l such that |l|1 is less than or equal\nto r. Therefore, we have:\n\n2k(cid:17)m\n(K1)ij = X1\u2264|l|1\u2264r(cid:16) n\n\nHilHjl\n\nBecause k is equal to |l|1, and because we already have an explicit form for any Hxy, we\ncan write\n\n(K1)ij =\n\n=\n\n1\n\n2|l|1(cid:19)m\nn X1\u2264|l|1\u2264r(cid:18) n\n2k(cid:17)m\nXk=1(cid:16) n\nX|l|1=k\n\nr\n\n1\nn\n\n(\u22121)<bi,l>+<bj ,l>\n\n(\u22121)<bi \u02d9\u2228bj ,l>\n\n\fThe \u02d9\u2228 symbol here denotes exclusive-or, which is equivalent to addition mod 2 in this\ndomain. The justi\ufb01cation for this is that only the parity of the exponent (odd or even)\nmatters, and locations in the bit strings bi and bj which are both zero or both one contribute\nno change to the overall parity. Notably, this shows that the value of the kernel between\nany two bit strings bi and bj is dependent only on bi \u02d9\u2228bj, the key result which allows us to\n\ncompute these values quickly. If we let Sk(bi, bj) =P|l|1=k(\u22121)<bi \u02d9\u2228bj ,l>, there is a recursive\n\nformulation for the computation of Sk(bi, bj) in terms of Sk\u22121(bi, bj), which is the method\nused in the experiments due to its speed and feasability of computation.\n\n4 Experiments\n\n4.1 The 4-node Bayesian Network\n\nThe \ufb01rst set of experiments we performed were on a four-node Bayesian Network. We\ngenerated a random \u201cbase truth\u201d network and sampled it 1000 times, creating a data set.\nWe then created an exhaustive set of 26 = 64 directed graphs; there are six possible edges\nin a four-node graph, assuming we already have some sort of node ordering that allows us\nto orient edges, and so this represented all possibilities. Because we chose the node ordering\nto be consistent with our base network, one of these graphs was in fact the correct network.\nWe then gave each of the set of 64 graphs a log-marginal-likelihood score (i.e. the BDe\nscore) based on the generated data. As expected, the correct network came out to have\nthe greatest likelihood. Additionally, computation of the Rayleigh quotient shows that the\nfunction is a globally smooth one over the graph topology. We then performed a set of\nexperiments using the metagraph kernel.\n\n4.1.1 Randomly Drawn Observations\n\nFirst, we partitioned the set of 64 observations randomly into two groups. The training\ngroup ranged in size from 3 to 63 samples, with the rest used as the testing group. We\nthen used the training group as the set of observations, and queried the metagraph kernel to\npredict the values of the networks in the testing group. We repeated this process 50 times for\neach of the di\ufb00erent sizes of the training group, and the results averaged to obtain Figure 1.\nNote that order 3 performs the best overall for large numbers of observations, overtaking the\norder-2 approximation at 41 values observed and staying the best until the end. However,\norder 1 performs the best for small numbers of observations (perhaps due to over\ufb01tting\nerrors caused by the higher orders) and order 2 performs the best in the middle. The data\nsuggests that the proper order to which to compute the kernel in order to obtain the best\napproximations is a function of both the size of the space and the number of observations\nmade within that space.\n\n4.1.2 Best/worst-case Observations\n\nSecondly, we performed experiments where the observations were obtained from networks\nwhich were in the neighborhood around the known true maximum, as well as ones from\nnetworks which were as far from it as possible. These results are Figures 2 and 3. Despite\nsmall di\ufb00erences in shape, the results are largely identical, indicating that the distribution\nof the samples throughout \u0393n matters very little.\n\n4.2 The Alarm Network\n\nThe Alarm Bayesian network[1] contains 37 nodes, and has been used in much Bayes-net-\nrelated research[3]. We \ufb01rst generated data according to the true network, sampling it 1000\ntimes, then generated random directed graphs over the 37 nodes to see if their scores could\nbe predicted as well as in the smaller four-node case. We generated two sets of these graphs:\na set of 100, and a set of 1000. We made no attempt to enforce an ordering; although\nthe graphs were all acyclic, we placed no assumption on the graphs being consistent with\nthe same node-ordering as the original. The scores of these sets, calculated using the data\ndrawn from the true network, served as our observed data. We then used the kernel to\n\n\fr\no\nr\nr\n\nE\n \nd\ne\nr\na\nu\nq\ns\n\u2212\nn\na\ne\nm\n\u2212\nt\no\no\nR\n\n103\n\n102\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n \n0\n\nOrder 1\nOrder 2\nOrder 3\nOrder 4\nOrder 5\n\n10\n\n20\n\nRandom samples\n\n \n\nr\no\nr\nr\n\n \n\nE\nd\ne\nr\na\nu\nq\ns\n\u2212\nn\na\ne\nm\n\u2212\no\no\nR\n\nt\n\n50\n\n60\n\n70\n\n30\n\n40\n\nObserved Nodes\n\n103\n\n102\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n \n0\n\nSamples near Maximum\n\n \n\nOrder 1\nOrder 2\nOrder 3\nOrder 4\nOrder 5\nOrder 6\n\n10\n\n20\n\n30\n\n40\n\nObserved Nodes\n\n50\n\n60\n\n70\n\n(a) Figure 1: Randomly-drawn Samples\n\n(b) Figure 2: Samples drawn near maximum\n\nr\no\nr\nr\n\nE\n \nd\ne\nr\na\nu\nq\ns\n\u2212\nn\na\ne\nm\n\u2212\no\no\nR\n\nt\n\n103\n\n102\n\n101\n\n100\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n \n0\n\nSamples near Minimum\n\n \n\n800\n\n750\n\n700\n\nr\no\nr\nr\n\nError on approximations of Alarm network data\n\n \n\nE\n \nd\ne\nr\na\nu\nq\ns\n\u2212\nn\na\ne\nm\n\u2212\no\no\nR\n\nt\n\n650\n\n600\n\n30\n\n40\n\nObserved Nodes\n\n50\n\n60\n\n70\n\n550\n\n500\n\n \n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\nOrder of approximation\n\nMean of sampled scores\n100 observations\n1000 observations\n\nOrder 1\nOrder 2\nOrder 3\nOrder 4\nOrder 5\nOrder 6\n\n10\n\n20\n\n(c) Figure 3: Samples drawn near minimum\n\n(d) Figure 4: Samples from ALARM network\n\napproximate, given the observed scores, the score of an additional 100 randomly-generated\ngraphs, with the order of the kernel varying from 1 to 20. The results, with root-mean-\nsquared error plotted against the order of the kernel, are shown in Figure 4. Additionally,\nwe calculated a baseline by taking the mean of the 1000 sampled scores and calling that the\nestimated score for every graph in our testing set.\n\nThe results show that the metagraph approximation method performs signi\ufb01cantly better\nthan the baseline for low orders of approximation with higher amounts of sampled data.\nThis makes intuitive sense; the more data there is, the better the approximation should be.\nAdditionally, the spike at order 2 suggests that the BDe score itself varies quadratically over\nthe metagraph. To our knowledge, we are the \ufb01rst to make this observation. In current work,\nwe are analyzing the BDe in an attempt to analytically validate this empirical observation.\nIf true, this observation may lead to improved optimization techniques for \ufb01nding the BDe-\nmaximizing Bayesian network. Note, however, that, even if true, exact optimization is still\nunlikely to be polynomial-time because the quadratic form is almost certainly inde\ufb01nite and,\ntherefore, NP-hard to optimize.\n\n5 Conclusion\n\nFunctions of graphs to real numbers, such as the posterior likelihood of a Bayesian network\ngiven a set of data, can be approximated to a high degree of accuracy by taking advantage of\na hybercubic \u201cmetagraph\u201d structure. Because the metagraph is regular, standard techniques\nof interpolation can be used in a straightforward way to obtain predictions for the values at\nunknown points.\n\n6 Future Work\n\nAlthough this technique allows for quick and accurate prediction of function values on the\nmetagraph, it o\ufb00ers no hints (as of yet) as to where the maximum of the function might\nbe. This could, for instance, allow one to generate a Bayesian network which is likely to be\nclose to optimal, and if true optimality is required, that approximate graph could be used\n\n\fas a starting point for a stepwise method such as MCMC. Even without a direct way to\n\ufb01nd such an optimum, though, it may be worth using this approximation technique inside\nan MCMC search instead of the usual exact-score computation in order to quickly converge\non a something close to the desired optimum.\n\nAlso, many other problems have a similar \ufb02avor.\nIn fact, this technique should be able\nto be used unchanged on any problem which involves the computation of a real-valued\nfunction over bit strings. For other objects, however, the structure is not necessarily a\nhypercube. For example, one may desire an approximation to a function of permutations\nof some number of elements to real numbers. The set of permutations of a given number\nof elements, denoted Sn, has a similarly regular structure (which can be seen as a graph in\nwhich two permutations are connected if a single swap leads from one to the other), but not\na hypercubic one. The structure-search problem on Bayes Nets can also be cast as a search\nover orderings of nodes alone[5], so a way to approximate a function over permutations\nwould be useful there as well.\n\nOther domains have this ability to be turned into regular graphs \u2013 the integers mod n with\nedges between numbers that di\ufb00er by 1 form a loop, for example. It should be possible to\napply a similar trick to obtain function approximations not only on these domains, but on\narbitrary Cartesian products of them. So, for instance, remembering that the directions of\nthe edges of Bayesian network are completely speci\ufb01ed given an ordering on the nodes, the\nnetwork structure search problem on n nodes can be recast as a function approximation\nover the set Sn \u00d7 Q( n\n2 ). Many problems can be cast into the metagraph framework; we have\nonly just scratched the surface here.\n\nAcknowledgments\n\nThe authors would like to thank Curtis Storlie and Joshua Neil from the UNM Department of\nMathematics and Statistics, as well as everyone in the Machine Learning Reading Group at\nUNM. This work was supported by NSF grant #IIS-0705681 under the Robust Intelligence\nprogram.\n\nReferences\n\n[1] I. Beinlich, H.J. Suermondt, R. Chavez, G. Cooper, et al. The ALARM monitoring\nsystem: A case study with two probabilistic inference techniques for belief networks.\nProceedings of the Second European Conference on Arti\ufb01cial Intelligence in Medicine,\n256, 1989.\n\n[2] D.S. Bernstein. Matrix Mathematics: Theory, Facts, and Formulas with Application to\n\nLinear Systems Theory. Princeton University Press, 2005.\n\n[3] D.M. Chickering, D. Heckerman, and C. Meek. A Bayesian approach to learning Bayesian\n\nnetworks with local structure. UAI\u201997, pages 80\u201389, 1997.\n\n[4] Fan R. K. Chung. Spectral Graph Theory. Conference Board of the Mathematical\n\nSciences. AMS, 1997.\n\n[5] N. Friedman and D. Koller. Being Bayesian about network structure. Machine Learning,\n\n50(1-2):95\u2013125, 2003.\n\n[6] Chong Gu. Smoothing Splines ANOVA Models. Springer Verlag, 2002.\n\n[7] G. Sabidussi. Graph multiplication. Mathematische Zeitschrift, 72(1):446\u2013457, 1959.\n\n[8] Kathrin Schacke. On the kronecker product. Master\u2019s thesis, University of Waterloo,\n\n2004.\n\n[9] Grace Wahba. Spline Models for Observational Data. CBMS-NSF Regional Conference\n\nSeries in Applied Mathematics. SCIAM, 1990.\n\n\f", "award": [], "sourceid": 836, "authors": [{"given_name": "Benjamin", "family_name": "Yackley", "institution": null}, {"given_name": "Eduardo", "family_name": "Corona", "institution": null}, {"given_name": "Terran", "family_name": "Lane", "institution": null}]}