{"title": "Learning the Architecture of Sum-Product Networks Using Clustering on Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 2033, "page_last": 2041, "abstract": "The sum-product network (SPN) is a recently-proposed deep model consisting of a network of sum and product nodes, and has been shown to be competitive with state-of-the-art deep models on certain difficult tasks such as image completion. Designing an SPN network architecture that is suitable for the task at hand is an open question. We propose an algorithm for learning the SPN architecture from data. The idea is to cluster variables (as opposed to data instances) in order to identify variable subsets that strongly interact with one another. Nodes in the SPN network are then allocated towards explaining these interactions. Experimental evidence shows that learning the SPN architecture significantly improves its performance compared to using a previously-proposed static architecture.", "full_text": "Learning the Architecture of Sum-Product Networks\n\nUsing Clustering on Variables\n\nAaron Dennis\n\nDepartment of Computer Science\n\nBrigham Young University\n\nProvo, UT 84602\n\nadennis@byu.edu\n\nDan Ventura\n\nDepartment of Computer Science\n\nBrigham Young University\n\nProvo, UT 84602\n\nventura@cs.byu.edu\n\nAbstract\n\nThe sum-product network (SPN) is a recently-proposed deep model consisting of\na network of sum and product nodes, and has been shown to be competitive with\nstate-of-the-art deep models on certain dif\ufb01cult tasks such as image completion.\nDesigning an SPN network architecture that is suitable for the task at hand is an\nopen question. We propose an algorithm for learning the SPN architecture from\ndata. The idea is to cluster variables (as opposed to data instances) in order to\nidentify variable subsets that strongly interact with one another. Nodes in the SPN\nnetwork are then allocated towards explaining these interactions. Experimental\nevidence shows that learning the SPN architecture signi\ufb01cantly improves its per-\nformance compared to using a previously-proposed static architecture.\n\n1\n\nIntroduction\n\nThe number of parameters in a textbook probabilistic graphical model (PGM) is an exponential\nfunction of the number of parents of the nodes in the graph. Latent variables can often be introduced\nsuch that the number of parents is reduced while still allowing the probability distribution to be\nrepresented. Figure 1 shows an example of modeling the relationship between symptoms of a set of\ndiseases. The PGM at the left has no latent variables and the PGM at the right has an appropriately\nadded \u201cdisease\u201d variable. The model is able to be simpli\ufb01ed because the symptoms are statistically\nindependent of one another given the disease. The middle PGM shows a model in which the latent\nvariable is introduced to no simplifying effect, demonstrating the need to be intelligent about what\nlatent variables are added and how they are added.\n\nDeep models can be interpreted as PGMs\nthat\nintroduce multiple layers of latent\nvariables over a layer of observed vari-\nables [1]. The architecture of these latent\nvariables (the size of the layers, the num-\nber of variables, the connections between\nvariables) can dramatically affect the per-\nformance of these models. Selecting a rea-\nsonable architecture is often done by hand.\nThis paper proposes an algorithm that au-\ntomatically learns a deep architecture from\ndata for a sum-product network (SPN), a\nrecently-proposed deep model that takes\nadvantage of the simplifying effect of la-\ntent variables [2]. Learning the appropri-\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: Introducing a latent variable. The PGM in\n(a) has no latent variables. The PGM in (b) has a latent\nvariable introduced to no bene\ufb01cial effect. The PGM\nin (c) has a latent variable that simpli\ufb01es the model.\n\n1\n\nS1S2S3S1S2S3DS1S2S3D\fFigure 2: A simple SPN over two binary variables A\nand B. The leaf node \u03bba takes value 1 if A = 0 and\n0 otherwise while leaf node \u03bba takes value 1 if A = 1\nand 0 otherwise. If the value of A is not known then\nboth leaf nodes take value 1. Leaf nodes \u03bbb and \u03bbb be-\nhave similarly. Weights on the edges connecting sum\nnodes with their children are not shown. The short-\ndashed edge causes the SPN to be incomplete. The\nlong-dashed edge causes the SPN to be inconsistent.\n\nFigure 3: The Poon architecture with m = 1\nsum nodes per region. Three product nodes\nare introduced because the 2\u00d73-pixel image\npatch can be split vertically and horizontally\nin three different ways. In general the Poon\narchitecture has number-of-splits times m2\nproduct nodes per region.\n\nate architecture for a traditional deep model can be challenging [3, 4], but the nature of SPNs lend\nthemselves to a remarkably simple, fast, and effective architecture-learning algorithm.\nIn proposing SPNs, Poon & Domingos introduce a general scheme for building an initial SPN ar-\nchitecture; the experiments they run all use one particular instantiation of this scheme to build an\ninitial \u201c\ufb01xed\u201d architecture that is suitable for image data. We will refer to this architecture as the\nPoon architecture. Training is done by learning the parameters of an initial SPN; after training is\ncomplete, parts of the SPN may be pruned to produce a \ufb01nal SPN architecture. In this way both the\nweights and architecture are learned from data.\nWe take this a step further by also learning the initial SPN architecture from data. Our algorithm\nworks by \ufb01nding subsets of variables (and sets of subsets of variables) that are highly dependent and\nthen effectively combining these together under a set of latent variables. This encourages the latent\nvariables to act as mediators between the variables, capturing and representing the dependencies\nbetween them. Our experiments show that learning the initial SPN architecture in this way improves\nits performance.\n\n2 Sum-Product Networks\n\nSum-product networks are rooted, directed acyclic graphs (DAGs) of sum, product, and leaf nodes.\nEdges connecting sum nodes to their children are weighted using non-negative weights. The value\nof a sum node is computed as the dot product of its weights with the values of it child nodes. The\nvalue of a product node is computed by multiplying the values of its child nodes. A simple SPN is\nshown in Figure 2.\nLeaf node values are determined by the input to the SPN. Each input variable has an associated set\nof leaf nodes, one for each value the variable can take. For example, a binary variable would have\ntwo associated leaf nodes. The leaf nodes act as indicator functions, taking the value 1 when the\nvariable takes on the value that the leaf node is responsible for and 0 otherwise.\nAn SPN can be constructed such that it is a representation of some probability distribution, with\nthe value of its root node and certain partial derivatives with respect to the root node having proba-\nbilistic meaning. In particular, all marginal probabilities and many conditional probabilities can be\ncomputed [5]. Consequently an SPN can perform exact inference and does so ef\ufb01ciently when the\nsize of the SPN is polynomial in the number of variables.\n\n2\n\n+xx++\u03bba\u03bbb\u03bba\u03bbb++AB+xx++++x++\fIf an SPN does represent a probability distribution then we call it a valid SPN; of course, not all\nSPNs are valid, nor do they all facilitate ef\ufb01cient, exact inference. However, Poon & Domingos\nproved that if the architecture of an SPN follows two simple rules then it will be valid. (Note that\nthis relationship does not go both ways; an SPN may be valid and violate one or both of these rules.)\nThis, along with showing that SPNs can represent a broader class of distributions than other models\nthat allow for ef\ufb01cient and exact inference are the key contributions made by Poon & Domingos.\nTo understand these rules it will help to know what the \u201cscope of an SPN node\u201d means. The scope\nof an SPN node n is a subset of the input variables. This subset can be determined by looking at the\nleaf nodes of the subgraph rooted at n. All input variables that have one or more of their associated\nleaf nodes in this subgraph are included in the scope of the node. We will denote the scope of n as\nscope(n).\nThe \ufb01rst rule is that all children of a sum node must have the same scope. Such an SPN is called\ncomplete. The second rule is that for every pair of children, (ci, cj), of a product node, there must\nnot be contradictory leaf nodes in the subgraphs rooted at ci and cj. For example, if the leaf node\ncorresponding to the variable X taking on value x is in the subgraph rooted at ci, then the leaf nodes\ncorresponding to the variable X taking on any other value may not appear in the subgraph rooted at\ncj. An SPN following this rule is called consistent. The SPN in Figure 2 violates completeness (due\nto the short-dashed arrow) and it violates consistency (due to the long-dashed arrow).\nAn SPN may also be decomposable, which is a property similar to, but somewhat more restrictive\nthan consistency. A decomposable SPN is one in which the scopes of the children of each product\nnode are disjoint. All of the architectures described in this paper are decomposable.\nVery deep SPNs can be built using these rules as a guide. The number of layers in an SPN can be\non the order of tens of layers, whereas the typical deep model has three to \ufb01ve layers. Recently it\nwas shown that deep SPNs can compute some functions using exponentially fewer resources than\nshallow SPNs would need [6].\nThe Poon architecture is suited for modeling probability distributions over images, or other domains\nwith local dependencies among variables. It is constructed as follows. For every possible axis-\naligned rectangular region in the image, the Poon architecture includes a set of m sum nodes, all\nof whose scope is the set of variables associated with the pixels in that region. Each of these (non-\nsingle-pixel) regions are conceptually split vertically and horizontally in all possible ways to form\npairs of rectangular subregions. For each pair of subregions, and for every possible pairing of sum\nnodes (one taken from each subregion), a product node is introduced and made the parent of the\npair of sum nodes. The product node is also added as a child to all of the top region\u2019s sum nodes.\nFigure 3 shows a fragment of a Poon architecture SPN modeling a 2 \u00d7 3 image patch.\n\n3 Cluster Architecture\n\nAs mentioned earlier, care needs to be taken when introducing latent variables into a model. Since\nthe effect of a latent variable is to help explain the interactions between its child variables [7], it\nmakes little sense to add a latent variable as the parent of two statistically independent variables.\nIn the example in Figure 4, variables W and\nX strongly interact and variables Y and Z\ndo as well. But the relationship between all\nother pairs of variables is weak. The PGM\nin (a), therefore, allows latent variable A to\ntake account of the interaction between W\nand X. On the other hand, variable A does\nlittle in the PGM in (b) since W and Y are\nnearly independent. A similar argument can\nbe made about variable B. Consequently,\nvariable C in the PGM in (a) can be used to\nexplain the weak interactions between vari-\nables, whereas in the PGM in (b), variable\nC essentially has the task of explaining the\ninteraction between all the variables.\n\nFigure 4: Latent variables explain the interaction be-\ntween child variables, causing the children to be in-\ndependent given the latent variable parent. If variable\npairs (W, X) and (Y, Z) strongly interact and other\nvariable pairs do not, then the PGM in (a) is a more\nsuitable model than the PGM in (b).\n\n(a)\n\n(b)\n\n3\n\nXYZBWACXYZBWAC\fIn the probabilistic interpretation of an SPN, sum nodes are associated with latent variables. (The\nevaluation of a sum node is equivalent to summing out its associated latent variable.) Each latent\nvariable helps the SPN explain interactions between variables in the scope of the sum nodes. Just as\nin the example, then, we would like to place sum nodes over sets of variables with strong interactions.\nThe Poon architecture takes this principle into account. Images exhibit strong interactions between\npixels in local spatial neighborhoods. Taking advantage of this prior knowledge, the Poon architec-\nture chooses to place sum nodes over local spatial neighborhoods that are rectangular in shape.\nThere are a few potential problems with this approach, however. One is that the Poon architecture\nincludes many rectangular regions that are long and skinny. This means that the pixels at each\nend of these regions are grouped together even though they probably have only weak interactions.\nSome grouping of weakly-interacting pixels is inevitable, but the Poon architecture probably does\nthis more than is needed. Another problem is that the Poon architecture has no way of explaining\nstrongly-interacting, non-rectangular local spatial regions. This is a major problem because such\nregions are very common in images. Additionally, if the data does not exhibit strong spatially-local\ninteractions then the Poon architecture could perform poorly.\nOur proposed architecture (we will call it the cluster architecture) avoids these problems. Large\nregions containing non-interacting pixels are avoided. Sum nodes can be placed over spatially-local,\nnon-rectangular regions; we are not restricted to rectangular regions, but can explain arbitrarily-\nshaped blob-like regions. In fact, the regions found by the cluster architecture are not required to\nexhibit spatial locality. This makes our architecture suitable for modeling data that does not exhibit\nstrong spatially-local interactions between variables.\n\n3.1 Building a Cluster Architecture\n\nof scope(s), where {S1,\u00b7\u00b7\u00b7 , Sk} is a partition of scope(s) if and only if(cid:83)\n\nAs was described earlier, a sum node s in an SPN has the task of explaining the interactions between\nall the variables in its scope. Let scope(s) = {V1,\u00b7\u00b7\u00b7 , Vn}. If n is large, then this task will likely\nbe very dif\ufb01cult. SPNs have a mechanism for making it easier, however. Essentially, s delegates\npart of its responsibilities to another set of sum nodes. This is done by \ufb01rst forming a partition\ni Si = scope(s) and\n\u2200i, j(Si \u2229 Sj = \u2205). Then, for each subset Si in the partition, an additional sum node si is introduced\ninto the SPN and is given the task of explaining the interactions between all the variables in Si. The\noriginal sum node s is then given a new child product node p and the product node becomes the\nparent of each sum node si.\nIn this example the node s is analogous to the variable C in Figure 4 and the nodes si are analogous\nto the variables A and B. So this partitioning process allows s to focus on explaining the interactions\nbetween the nodes si and frees it from needing to explain everything about the interactions between\nthe variables {V1,\u00b7\u00b7\u00b7 , Vn}. And, of course, the partitioning process can be repeated recursively,\nwith any of the nodes si taking the place of s.\nThis is the main idea behind the algorithm for building a cluster architecture (see Algorithm 1 and\nAlgorithm 2). However, due to the architectural \ufb02exibility of an SPN, discussing this algorithm in\nterms of sum and product nodes quickly becomes tedious and confusing. The following de\ufb01nition\nwill help in this regard.\nDe\ufb01nition 1. A region graph is a rooted DAG consisting of region nodes and partition nodes. The\nroot node is a region node. Partition nodes are restricted to being the children of region nodes and\nvice versa. Region and partition nodes have scopes just like nodes in an SPN. The scope of a node\nn in a region graph is denoted scope(n).\n\nRegion nodes can be thought of as playing the role of sum nodes (explaining interactions among\nvariables) and partition nodes can be thought of as playing the role of product nodes (delegating\nresponsibilities). Using the de\ufb01nition of the region graph may not appear to have made things any\nsimpler, but its bene\ufb01ts will become more clear when discussing the conversion of region graphs to\nSPNs (see Figure 5).\nAt a high level the algorithm for building a cluster architecture is simple: build a region graph\n(Algorithm 1 and Algorithm 2), then convert it to an SPN (Algorithm 3). These steps are described\nbelow.\n\n4\n\n\fAlgorithm 1 BuildRegionGraph\n1: Input: training data D\n2: C(cid:48) \u2190 Cluster(D, 1)\n3: for k = 2 to \u221e do\nC \u2190 Cluster(D, k)\n4:\nr \u2190 Quality(C)/Quality(C(cid:48))\n5:\nif r < 1 + \u03b4 then\n6:\n7:\n8:\n9:\n10: G \u2190 CreateRegionGraph()\n11: n \u2190 AddRegionNodeTo(G)\n12: for i = 1 to k do\n13:\n\nbreak\nC(cid:48) \u2190 C\n\nelse\n\nExpandRegionGraph(G, n, Ci)\n\n(a)\n\n(b)\n\nFigure 5: Sub\ufb01gure (a) shows a region graph fragment con-\nsisting of region nodes R1, R2, R3, R4, and R5. R1 has\ntwo parition nodes (the smaller, \ufb01lled-in nodes). Sub\ufb01gure\n(b) shows the region graph converted to an SPN. In the SPN\neach region is allotted two sum nodes. The product nodes\nin R1 are surrounded by two rectangles labeled P1 and P2;\nthey correspond to the partition nodes in the region graph.\n\nAlgorithm 1 builds a region graph using training data to guide the construction. In lines 2 through 9\nthe algorithm clusters the training instances into k clusters C = {C1,\u00b7\u00b7\u00b7 , Ck}. Our implementation\nuses the scikit-learn [8] implementation of k-means to cluster the data instances, but any clustering\nmethod could be used. The value for k is chosen automatically; larger values of k are tried until\nincreasing the value does not substantially improve a cluster-quality score. The remainder of the\nalgorithm creates a single-node region graph G and then adds nodes and edges to G using k calls to\nAlgorithm 2 (ExpandRegionGraph). To encourage the expansion of G in different ways, a different\nsubset of the training data, Ci, is passed to ExpandRegionGraph on each call.\nAt a high level, Algorithm 2 partitions scopes into sub-scopes recursively, adding region and par-\ntition nodes to G along the way. The initial call to ExpandRegionGraph partitions the scope of\nthe root region node. A corresponding partition node is added as a child of the root node. Two\nsub-region nodes (whose scopes form the partition) are then added as children to the partition node.\nAlgorithm 2 is then called recursively with each of these sub-region nodes as arguments (unless the\nscope of the sub-region node is too small).\nIn line 3 of Algorithm 2 the PartitionScope function in our implementation uses the k-means al-\ngorithm in an unusual way. Instead of partitioning the instances of the training dataset D into k\ninstance-clusters, it partitions variables into k variable-clusters as follows. D is encoded as a matrix,\neach row being a data instance and each column corresponding to a variable. Then k-means is run\non DT , causing it to partition the variables into k clusters. Actually, the PartitionScope function\nis only supposed to partition the variables in scope(n), not all the variables (note its input parame-\nter). So before calling k-means we build a new matrix Dn by removing columns from D, keeping\nonly those columns that correspond to variables in scope(n). Then k-means is run on DT\nn and the\nresulting variable partition is returned. The k-means algorithm serves the purpose of detecting sub-\nsets of variables that strongly interact with one another. Other methods (including other clustering\nalgorithms) could be used in its place.\nAfter the scope Sn of a node n has been partitioned into S1 and S2, Algorithm 2 (lines 4 through 11)\nlooks for region nodes in G whose scope is similar to S1 or S2; if region node r with scope Sr is\nsuch a node, then S1 and S2 are adjusted so that S1 = Sr and {S1, S2} is still a partition of Sn.\nLines 12 through 18 expand the region graph based on the partition of Sn. If node n does not already\nhave a child partition node representing the partition {S1, S2} then one is created (p in line 15); p is\nthen connected to child region nodes n1 and n2, whose scopes are S1 and S2, respectively.\nNote that n1 and n2 may be newly-created region nodes or they may be nodes that were created dur-\ning a previous call to Algorithm 2. We recursively call ExpandRegionGraph only on newly-created\nnodes; the recursive call is also not made if the node is a leaf node (|Si| = 1) since partitioning a\nleaf node is not helpful (see lines 19 through 22).\n\n5\n\nR1R3R2R5R4++xxxxxxxx++xx. . .++xx. . .++xx. . .++xx. . .R2P1P2R3R4R5R1\fregion node n in G, training data D\n\nAlgorithm 2 ExpandRegionGraph\n1: Input: region graph G,\n2: Sn \u2190 scope(n)\n3: {S1, S2} \u2190 PartitionScope(Sn, D)\n4: S \u2190 ScopesOfAllRegionNodesIn(G)\n5: for all Sr \u2208 S s.t. Sr \u2282 Sn do\np1 \u2190 |S1 \u2229 Sr|/|S1 \u222a Sr|\n6:\np2 \u2190 |S2 \u2229 Sr|/|S2 \u222a Sr|\n7:\nif max{p1, p2} > threshold then\n8:\n9:\n10:\n11:\n12: n1 \u2190 GetOrCreateRegionNode(G, S1)\n13: n2 \u2190 GetOrCreateRegionNode(G, S2)\n14: if PartitionDoesNotExist(G, n, n1, n2) then\n15:\n16:\n17:\n18:\n19: if S1 /\u2208 S \u2227 |S1| > 1 then\n20:\n21: if S2 /\u2208 S \u2227 |S2| > 1 then\n22:\n\np \u2190 NewPartitionNode()\nAddChildToRegionNode(n, p)\nAddChildToPartitionNode(p, n1)\nAddChildToPartitionNode(p, n2)\n\nS1 \u2190 Sr\nS2 \u2190 Sn \\ Sr\nbreak\n\nExpandRegionGraph(G, n1)\n\nExpandRegionGraph(G, n2)\n\nAlgorithm 3 BuildSPN\n\nInput: region graph G, sums per region m\nOutput: SPN S\nR \u2190 RegionNodesIn(G)\nfor all r \u2208 R do\n\nif IsRootNode(r) then\n\nN \u2190 AddSumNodesToSPN(S, 1)\nN \u2190 AddSumNodesToSPN(S, m)\n\nelse\nP \u2190 ChildPartitionNodesOf(r)\nfor all p \u2208 P do\n\nC \u2190 ChildrenOf(p)\nO \u2190 AddProductNodesToSPN(S, m|C|)\nfor all n \u2208 N do\nQ \u2190 empty list\nfor all c \u2208 C do\n\nAddChildrenToSumNode(n, O)\n\n//We assume the sum nodes associated\n//with c have already been created.\nU \u2190 SumNodesAssociatedWith(c)\nAppendToList(Q, U)\n\nConnectProductsToSums(O, Q)\n\nreturn S\n\nAfter the k calls to Algorithm 2 have been made, the resulting region graph must be converted to\nan SPN. Figure 5 shows a small subgraph from a region graph and its conversion into an SPN;\nthis example demonstrates the basic pattern that can be applied to all region nodes in G in order\nto generate an SPN. A more precise description of this conversion is given in Algorithm 3.\nIn\nthis algorithm the assumption is made (noted in the comments) that certain sum nodes are inserted\nbefore others. This assumption can be guaranteed if the algorithm performs a postorder traversal of\nthe region nodes in G in the outermost loop. Also note that the ConnectProductsToSums method\nconnects product nodes of the current region with sum nodes from its subregions; the children of a\nproduct node consist of a single node drawn from each subregion, and there is a product node for\nevery possible combination of such sum nodes.\n\n4 Experiments and Results\n\nPoon showed that SPNs can outperform deep belief networks (DBNs), deep Boltzman machines\n(DBMs), principle component analysis (PCA), and a nearest- neighbors algorithm (NN) on a dif\ufb01cult\nimage completion task. The task is the following: given the right/top half of an image, paint in\nthe left/bottom half of it. The completion results of these models were compared qualitatively by\ninspection and quantitatively using mean squared error (MSE). SPNs produced the best results; our\nexperiments show that the cluster architecture signi\ufb01cantly improves SPN performance.\nWe matched the experimental set-up reported in [2] in order to isolate the effect of changing the\ninitial SPN architecture and to make their reported results directly comparable to several of our\nresults. They add 20 sum nodes for each non-unit and non-root region. The root region has one\nsum node and the unit regions have four sum nodes, each of which function as a Gaussian over pixel\nvalues. The Gaussians means are calculated using the training data for each pixel, with one Gaussian\ncovering each quartile of the pixel-values histogram. Each training image is normalized such that\nits mean pixel value is zero with a standard deviation of one. Hard expectation maximization (EM)\nis used to train the SPNs; mini-batches of 50 training instances are used to calculate each weight\nupdate. All sum node weights are initialized to zero; weight values are decreased after each training\nepoch using an L0 prior; add-one smoothing on sum node weights is used during network evaluation.\n\n6\n\n\fTable 1: Results of experiments on the Olivetti, Cal-\ntech 101 Faces, arti\ufb01cial, and shuf\ufb02ed-Olivetti datasets\ncomparing the Poon and cluster architectures. Negative\nlog-likelihood (LLH) of the training set and test set is re-\nported along with the MSE for the image completion re-\nsults (both left-half and bottom-half completion results).\n\nDataset Measurement\nOlivetti\n\nTrain LLH\nTest LLH\nMSE (left)\nMSE (bottom)\nTrain LLH\nTest LLH\nMSE (left)\nMSE (bottom)\n\nCaltech\nFaces\n\nArti\ufb01cial Train LLH\nTest LLH\nMSE (left)\nMSE (bottom)\nTrain LLH\nTest LLH\nMSE (left)\nMSE (bottom)\n\nShuf\ufb02ed\n\nPoon\n318 \u00b1 1\n863 \u00b1 9\n996 \u00b1 42\n963 \u00b1 42\n289 \u00b1 4\n674 \u00b1 15\n1968 \u00b1 89\n1925 \u00b1 82\n195 \u00b1 0\n266 \u00b1 4\n842 \u00b1 51\n877 \u00b1 85\n793 \u00b1 3\n1193 \u00b1 3\n811 \u00b1 11\n817 \u00b1 17\n\nCluster\n433 \u00b1 17\n715 \u00b1 31\n814 \u00b1 35\n820 \u00b1 38\n379 \u00b1 8\n557 \u00b1 11\n1746 \u00b1 87\n1561 \u00b1 44\n169 \u00b1 0\n223 \u00b1 6\n558 \u00b1 27\n561 \u00b1 29\n442 \u00b1 14\n703 \u00b1 14\n402 \u00b1 16\n403 \u00b1 17\n\nFigure 6: A cluster-architecture SPN\ncompleted the images in the left col-\numn and a Poon-architecture SPN com-\npleted the images in the right column.\nAll images shown are left-half comple-\ntions. The top row is the best results as\nmeasured by MSE and the bottom row\nis the worst results. Note the smooth\nedges in the cluster completions and the\njagged edges in the Poon completions.\n\nWe test the cluster and Poon architectures by learning on the Olivetti dataset [9], the faces from the\nCaltech-101 dataset [10], an arti\ufb01cial dataset that we generated, and the shuf\ufb02ed-Olivetti dataset,\nwhich the Olivetti dataset with the pixels randomly shuf\ufb02ed (all images are shuf\ufb02ed in the same\nway). The Caltech-101 faces were preprocessed as described by Poon & Domingos. The cluster\narchitecture is compared to the Poon architectures using the negative log-likelihood (LLH) of the\ntraining and test sets as well as the MSE of the image completion results for the left half and bottom\nhalf of the images. We train ten cluster architecture SPNs and ten Poon architecture SPNs. Average\nresults across the ten SPNs along with the standard deviation are given for each measurement.\nOn the Olivetti and Caltech-101 Faces datasets the Poon architecture resulted in better training set\nLLH, but the cluster architecture generalized better, getting a better test set LLH (see Table 1). The\ncluster architecture was also clearly better at the image completion tasks as measured by MSE.\nThe difference between the two architectures is most pronounced on the arti\ufb01cial dataset. The\nimages in this dataset are created by pasting randomly-shaded circle- and diamond-shaped image\npatches on top of one another (see Figure 6), ensuring that various pixel patches are statistically\nindependent. The cluster architecture outperforms the Poon architecture across all measures on this\ndataset (see Table 1); this is due to its ability to focus resources on non-rectangular regions.\nTo demonstrate that the cluster architecture does not rely on the presence of spatially-local, strong\ninteractions between the variables, we repeated the Olivetti experiment with the pixels in the images\nhaving been shuf\ufb02ed. In this experiment (see Table 1) the cluster architecture was, as expected,\nrelatively unaffected by the pixel shuf\ufb02ing. The LLH measures remained basically unchanged from\nthe Olivetti to the Olivetti-shuf\ufb02ed datasets. (The MSE results did not stay the same because the\nimage completions happened over different subsets of the pixels.) On the other hand, the perfor-\nmance of the Poon architecture dropped considerably due to the fact that it was no longer able to\ntake advantage of strong correlations between neighboring pixels.\nFigure 7 visually demonstrates the difference between the rectangular-regions Poon architecture and\nthe arbitrarily-shaped-regions cluster architecture. Artifacts of the different region shapes can be\nseen in sub\ufb01gure (a), where some regions are shaded lighter or darker, revealing region boundaries.\nSub\ufb01gure (b) compares the best of both architectures, showing image completion results on which\nboth architectures did well, qualitatively speaking. Note how the Poon architecture produces results\nthat look \u201cblocky\u201d, whereas the cluster architecture produces results that are smoother-looking.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 7: The completion results in sub\ufb01gure (a) highlight the difference between the rectangular-\nshaped regions of the Poon architecture (top image) and the blob-like regions of the cluster archi-\ntecture (bottom image), artifacts of which can be seen in the completions. Sub\ufb01gure (b) shows\nground truth images, cluster-architecture SPN completions, and Poon-architecture SPN completions\nin the left, middle, and right columns respectively. Left-half completions are in the top row and\nbottom-half completions are in the bottom row.\n\nTable 2: Test set LLH values for the Olivetti, Olivetti45, and Olivetti4590 datasets for different\nvalues of k. For each dataset the best LLH value is marked in bold.\n\nDataset / k\nOlivetti\nOlivetti45\nOlivetti4590\n\n1\n650\n523\n579\n\n2\n653\n495\n576\n\n3\n671\n508\n550\n\n4\n685\n529\n554\n\n5\n711\n541\n577\n\n6\n716\n528\n595\n\n7\n717\n544\n608\n\n8\n741\n532\n592\n\nAlgorithm 1 expands a region graph k times (lines 12 and 13). The value of k can signi\ufb01cantly affect\ntest set LLH, as shown in Table 2. A value that is too low leads to an insuf\ufb01ciently powerful model\nand a value that is too high leads to a model that over\ufb01ts the training data and generalizes poorly.\nA singly-expanded model (k = 1) is optimal for the Olivetti dataset. This may be due in part to the\nOlivetti dataset having only one distinct class of images (faces in a particular pose). Datasets with\nmore image classes may bene\ufb01t from additional expansions. To experiment with this hypothesis\nwe create two new datasets: Olivetti45 and Olivetti4590. Olivetti45 is created by augmenting the\nOlivetti dataset with Olivetti images that are rotated by \u221245 degrees. Olivetti4590 is built similarly\nbut with rotations by \u221245 degrees and by \u221290 degrees. The Olivetti45 dataset, then, has two distinct\nclasses of images: rotated and non-rotated. Similarly, Olivetti4590 has three distinct image classes.\nTable 2 shows that, as expected, the optimal value of k for the Olivetti45 and Olivetti4590 datasets\nis two and three, respectively.\nNote that the Olivetti test set LLH with k = 1 in Table 2 is better than the test set LLH reported in\nTable 1. This shows that the algorithm for automatically selecting k in Algorithm 1 is not optimal.\nAnother option is to use a hold-out set to select k, although this method may not not be appropriate\nfor small datasets.\n\n5 Conclusion\n\nThe algorithm for learning a cluster architecture is simple, fast, and effective. It allows the SPN\nto focus its resources on explaining the interactions between arbitrary subsets of input variables.\nAnd, being driven by data, the algorithm guides the allocation of SPN resources such that it is able\nto model the data more ef\ufb01ciently. Future work includes experimenting with alternative cluster-\ning algorithms, experimenting with methods for selecting the value of k, and experimenting with\nvariations of Algorithm 2 such as generalizing it to handle partitions of size greater than two.\n\n8\n\n\fReferences\n[1] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep\n\nbelief nets. Neural Computation, 18:1527\u20131554, July 2006.\n\n[2] Hoifung Poon and Pedro Domingos. Sum-product networks: A new deep architecture.\n\nIn\nProceedings of the Twenty-Seventh Annual Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI-11), pages 337\u2013346, Corvallis, Oregon, 2011. AUAI Press.\n\n[3] Ryan Prescott Adams, Hanna M. Wallach, and Zoubin Ghahramani. Learning the structure\nIn Proceedings of the 13th International Conference on\n\nof deep sparse graphical models.\nArti\ufb01cial Intelligence and Statistics, 2010.\n\n[4] Nevin L. Zhang. Hierarchical latent class models for cluster analysis. Journal of Machine\n\nLearning Research, 5:697\u2013723, December 2004.\n\n[5] Adnan Darwiche. A differential approach to inference in bayesian networks. Journal of the\n\nACM, 50:280\u2013305, May 2003.\n\n[6] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances\n\nin Neural Information Processing Systems 24, pages 666\u2013674. 2011.\n\n[7] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques.\n\nMIT Press, 2009.\n\n[8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-\ntenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per-\nrot, and E. Duchesnay. Scikit-learn: Machine learning in python. Journal of Machine Learning\nResearch, 12:2825\u20132830, 2011.\n\n[9] F.S. Samaria and A.C. Harter. Parameterisation of a stochastic model for human face identi-\n\ufb01cation. In Proceedings of the Second IEEE Workshop on Applications of Computer Vision,\npages 138 \u2013142, Dec 1994.\n\n[10] Li Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training\nexamples: An incremental bayesian approach tested on 101 object categories. In IEEE CVPR\n2004, Workshop on Generative-Model Based Vision, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1012, "authors": [{"given_name": "Aaron", "family_name": "Dennis", "institution": null}, {"given_name": "Dan", "family_name": "Ventura", "institution": null}]}