{"title": "A Unified Approach for Learning the Parameters of Sum-Product Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 433, "page_last": 441, "abstract": "We present a unified approach for learning the parameters of Sum-Product networks (SPNs). We prove that any complete and decomposable SPN is equivalent to a mixture of trees where each tree corresponds to a product of univariate distributions. Based on the mixture model perspective, we characterize the objective function when learning SPNs based on the maximum likelihood estimation (MLE) principle and show that the optimization problem can be formulated as a signomial program. We construct two parameter learning algorithms for SPNs by using sequential monomial approximations (SMA) and the concave-convex procedure (CCCP), respectively. The two proposed methods naturally admit multiplicative updates, hence effectively avoiding the projection operation. With the help of the unified framework, we also show that, in the case of SPNs, CCCP leads to the same algorithm as Expectation Maximization (EM) despite the fact that they are different in general.", "full_text": "A Uni\ufb01ed Approach for Learning the Parameters of\n\nSum-Product Networks\n\nHan Zhao\n\nMachine Learning Dept.\n\nCarnegie Mellon University\nhan.zhao@cs.cmu.edu\n\nPascal Poupart\n\nppoupart@uwaterloo.ca\n\nSchool of Computer Science\n\nUniversity of Waterloo\n\nGeoff Gordon\n\nMachine Learning Dept.\n\nCarnegie Mellon University\nggordon@cs.cmu.edu\n\nAbstract\n\nWe present a uni\ufb01ed approach for learning the parameters of Sum-Product networks\n(SPNs). We prove that any complete and decomposable SPN is equivalent to a\nmixture of trees where each tree corresponds to a product of univariate distributions.\nBased on the mixture model perspective, we characterize the objective function\nwhen learning SPNs based on the maximum likelihood estimation (MLE) principle\nand show that the optimization problem can be formulated as a signomial program.\nWe construct two parameter learning algorithms for SPNs by using sequential\nmonomial approximations (SMA) and the concave-convex procedure (CCCP),\nrespectively. The two proposed methods naturally admit multiplicative updates,\nhence effectively avoiding the projection operation. With the help of the uni\ufb01ed\nframework, we also show that, in the case of SPNs, CCCP leads to the same\nalgorithm as Expectation Maximization (EM) despite the fact that they are different\nin general.\n\n1\n\nIntroduction\n\nSum-product networks (SPNs) are new deep graphical model architectures that admit exact prob-\nabilistic inference in linear time in the size of the network [14]. Similar to traditional graphical\nmodels, there are two main problems when learning SPNs: structure learning and parameter learning.\nParameter learning is interesting even if we know the ground truth structure ahead of time; struc-\nture learning depends on parameter learning , so better parameter learning can often lead to better\nstructure learning. Poon and Domingos [14] and Gens and Domingos [6] proposed both generative\nand discriminative learning algorithms for parameters in SPNs. At a high level, these approaches\nview SPNs as deep architectures and apply projected gradient descent (PGD) to optimize the data\nlog-likelihood. There are several drawbacks associated with PGD. For example, the projection step in\nPGD hurts the convergence of the algorithm and it will often lead to solutions on the boundary of the\nfeasible region. Also, PGD contains an additional arbitrary parameter, the projection margin, which\ncan be hard to set well in practice. In [14, 6], the authors also mentioned the possibility of applying\nEM algorithms to train SPNs by viewing sum nodes in SPNs as hidden variables. They presented an\nEM update formula without details. However, the update formula for EM given in [14, 6] is incorrect,\nas \ufb01rst pointed out and corrected by [12].\nIn this paper we take a different perspective and present a uni\ufb01ed framework, which treats [14, 6] as\nspecial cases, for learning the parameters of SPNs. We prove that any complete and decomposable\nSPN is equivalent to a mixture of trees where each tree corresponds to a product of univariate\ndistributions. Based on the mixture model perspective, we can precisely characterize the functional\nform of the objective function based on the network structure. We show that the optimization problem\nassociated with learning the parameters of SPNs based on the MLE principle can be formulated\nas a signomial program (SP), where both PGD and exponentiated gradient (EG) can be viewed as\n\ufb01rst order approximations of the signomial program after suitable transformations of the objective\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ffunction. We also show that the signomial program formulation can be equivalently transformed into\na difference of convex functions (DCP) formulation, where the objective function of the program\ncan be naturally expressed as a difference of two convex functions. The DCP formulation allows\nus to develop two ef\ufb01cient optimization algorithms for learning the parameters of SPNs based on\nsequential monomial approximations (SMA) and the concave-convex procedure (CCCP), respectively.\nBoth proposed approaches naturally admit multiplicative updates, hence effectively deal with the\npositivity constraints of the optimization. Furthermore, under our uni\ufb01ed framework, we also show\nthat CCCP leads to the same algorithm as EM despite that these two approaches are different from\neach other in general. Although we mainly focus on MLE based parameter learning, the mixture\nmodel interpretation of SPN also helps to develop a Bayesian learning method for SPNs [21].\nPGD, EG, SMA and CCCP can all be viewed as different levels of convex relaxation of the original\nSP. Hence the framework also provides an intuitive way to compare all four approaches. We conduct\nextensive experiments on 20 benchmark data sets to compare the empirical performance of PGD, EG,\nSMA and CCCP. Experimental results validate our theoretical analysis that CCCP is the best among\nall 4 approaches, showing that it converges consistently faster and with more stability than the other\nthree methods. Furthermore, we use CCCP to boost the performance of LearnSPN [7], showing that\nit can achieve results comparable to state-of-the-art structure learning algorithms using SPNs with\nmuch smaller network sizes.\n\n2 Background\n\n2.1 Sum-Product Networks\n\nTo simplify the discussion of the main idea of our uni\ufb01ed framework, we focus our attention on SPNs\nover Boolean random variables. However, the framework presented here is general and can be easily\nextended to other discrete and continuous random variables. We \ufb01rst de\ufb01ne the notion of network\npolynomial. We use Ix to denote an indicator variable that returns 1 when X = x and 0 otherwise.\nDe\ufb01nition 1 (Network Polynomial [4]). Let f (\u00b7) 0 be an unnormalized probability distribution\nover a Boolean random vector X1:N. The network polynomial of f (\u00b7) is a multilinear function\nPx f (x)QN\nn=1 Ixn of indicator variables, where the summation is over all possible instantiations of\nthe Boolean random vector X1:N.\n\nA Sum-Product Network (SPN) over Boolean variables X1:N is a rooted DAG that computes the\nnetwork polynomial over X1:N. The leaves are univariate indicators of Boolean variables and\ninternal nodes are either sum or product. Each sum node computes a weighted sum of its children\nand each product node computes the product of its children. The scope of a node in an SPN is\nde\ufb01ned as the set of variables that have indicators among the node\u2019s descendants. For any node v\nin an SPN, if v is a terminal node, say, an indicator variable over X, then scope(v) = {X}, else\nscope(v) =S\u02dcv2Ch(v) scope(\u02dcv). An SPN is complete iff each sum node has children with the same\nscope. An SPN is decomposable iff for every product node v, scope(vi)T scope(vj) = ? where\nvi, vj 2 Ch(v), i 6= j. The scope of the root node is {X1, . . . , XN}.\nIn this paper, we focus on complete and decomposable SPNs. For a complete and decomposable\nSPN S, each node v in S de\ufb01nes a network polynomial fv(\u00b7) which corresponds to the sub-SPN\n(subgraph) rooted at v. The network polynomial of S, denoted by fS, is the network polynomial\nde\ufb01ned by the root of S, which can be computed recursively from its children. The probability\ndistribution induced by an SPN S is de\ufb01ned as PrS(x) , fS (x)Px fS (x). The normalization constant\nPx fS(x) can be computed in O(|S|) in SPNs by setting the values of all the leaf nodes to be 1, i.e.,\nPx fS(x) = fS(1) [14]. This leads to ef\ufb01cient joint/marginal/conditional inference in SPNs.\n\n2.2 Signomial Programming (SP)\n\nBefore introducing SP, we \ufb01rst introduce geometric programming (GP), which is a strict subclass\nof SP. A monomial is de\ufb01ned as a function h : Rn\nn , where the\n++), the coef\ufb01cient d is positive and the exponents\ndomain is restricted to be the positive orthant (Rn\nn . One of the\nkey properties of posynomials is positivity, which allows us to transform any posynomial into the log\n\nai 2 R,8i. A posynomial is a sum of monomials: g(x) =PK\n\n2 \u00b7\u00b7\u00b7 xan\n\u00b7\u00b7\u00b7 xank\n\n2\n\n++ 7! R: h(x) = dxa1\n1 xa2k\n\nk=1 dkxa1k\n\n1 xa2\n\n2\n\n\f++.\n\ndomain. A GP in standard form is de\ufb01ned to be an optimization problem where both the objective\nfunction and the inequality constraints are posynomials and the equality constraints are monomials.\nThere is also an implicit constraint that x 2 Rn\nA GP in its standard form is not a convex program since posynomials are not convex functions\nin general. However, we can effectively transform it into a convex problem by using the loga-\nrithmic transformation trick on x, the multiplicative coef\ufb01cients of each monomial and also each\nobjective/constraint function [3, 1].\nAn SP has the same form as GP except that the multiplicative constant d inside each monomial is not\nrestricted to be positive, i.e., d can take any real value. Although the difference seems to be small,\nthere is a huge difference between GP and SP from the computational perspective. The negative\nmultiplicative constant in monomials invalidates the logarithmic transformation trick frequently used\nin GP. As a result, SPs cannot be reduced to convex programs and are believed to be hard to solve in\ngeneral [1].\n\n3 Uni\ufb01ed Approach for Learning\n\nIn this section we will show that the parameter learning problem of SPNs based on the MLE principle\ncan be formulated as an SP. We will use a sequence of optimal monomial approximations combined\nwith backtracking line search and the concave-convex procedure to tackle the SP. Due to space\nconstraints, we refer interested readers to the supplementary material for all the proof details.\n\n3.1 Sum-Product Networks as a Mixture of Trees\nWe introduce the notion of induced trees from SPNs and use it to show that every complete and\ndecomposable SPN can be interpreted as a mixture of induced trees, where each induced tree\ncorresponds to a product of univariate distributions. From this perspective, an SPN can be understood\nas a huge mixture model where the effective number of components in the mixture is determined by\nits network structure. The method we describe here is not the \ufb01rst method for interpreting an SPN (or\nthe related arithmetic circuit) as a mixture distribution [20, 5, 2]; but, the new method can result in an\nexponentially smaller mixture, see the end of this section for more details.\nDe\ufb01nition 2 (Induced SPN). Given a complete and decomposable SPN S over X1:N, let T =\n(TV ,TE) be a subgraph of S. T is called an induced SPN from S if\n\n1. Root(S) 2T V .\n2. If v 2T V is a sum node, then exactly one child of v in S is in TV , and the corresponding\nedge is in TE.\n3. If v 2T V is a product node, then all the children of v in S are in TV , and the corresponding\nedges are in TE.\nTheorem 1. If T is an induced SPN from a complete and decomposable SPN S, then T is a tree that\nis complete and decomposable.\n\nAs a result of Thm. 1, we will use the terms induced SPNs and induced trees interchangeably. With\nsome abuse of notation, we use T (x) to mean the value of the network polynomial of T with input\nvector x.\nTheorem 2. If T is an induced tree from S over X1:N, then T (x) =Q(vi,vj )2TE\nn=1 Ixn,\n\nwijQN\n\nwhere wij is the edge weight of (vi, vj) if vi is a sum node and wij = 1 if vi is a product node.\nRemark. Although we focus our attention on Boolean random variables for the simplicity of\ndiscussion and illustration, Thm. 2 can be extended to the case where the univariate distributions at\nthe leaf nodes are continuous or discrete distributions with countably in\ufb01nitely many values, e.g.,\nGaussian distributions or Poisson distributions. We can simply replace the product of univariate\nn=1 pn(Xn), where pn(Xn) is a\nunivariate distribution over Xn. Also note that it is possible for two unique induced trees to share\nwij are\nguaranteed to be different. As we will see shortly, Thm. 2 implies that the joint distribution over\nn=1 represented by an SPN is essentially a mixture model with potentially exponentially many\n{Xn}N\ncomponents in the mixture.\n\ndistributions term,QN\nthe same product of univariate distributions, but in this case their weight termsQ(vi,vi)2TE\n\nn=1 Ixn, in Thm. 2 to be the general formQN\n\n3\n\n\fDe\ufb01nition 3 (Network cardinality). The network cardinality \u2327S of an SPN S is the number of unique\ninduced trees.\nTheorem 3. \u2327S = fS(1|1), where fS(1|1) is the value of the network polynomial of S with input\nvector 1 and all edge weights set to be 1.\nTheorem 4. S(x) =P\u2327S\nRemark. The above four theorems prove the fact that an SPN S is an ensemble or mixture of trees,\nwhere each tree computes an unnormalized distribution over X1:N. The total number of unique trees\nin S is the network cardinality \u2327S, which only depends on the structure of S. Each component is a\nsimple product of univariate distributions. We illustrate the theorems above with a simple example in\nFig. 1.\n\nt=1 Tt(x), where Tt is the tth unique induced tree of S.\n\nw1\n\nw2\n\n+\n\n\u21e5\n\nw3 = w1\n\n\u21e5\n\nX1\n\nX2\n\nX2\n\n\u21e5\n\nX1\n\n\u21e5\n\nX1\n\n+\n\n+w2\n\n+\n\n\u21e5\n\n+\n\n+w3\n\nX2\n\nX1\n\nX2\n\nX1\n\n\u21e5\n\nX2\n\nFigure 1: A complete and decomposable SPN is a mixture of induced trees. Double circles indicate\nunivariate distributions over X1 and X2. Different colors are used to highlight unique induced trees;\neach induced tree is a product of univariate distributions over X1 and X2.\n\nZhao et al. [20] show that every complete and decomposable SPN is equivalent to a bipartite Bayesian\nnetwork with a layer of hidden variables and a layer of observable random variables. The number\nof hidden variables in the bipartite Bayesian network is equal to the number of sum nodes in S. A\nnaive expansion of such Bayesian network to a mixture model will lead to a huge mixture model with\n2O(M ) components, where M is the number of sum nodes in S. Here we complement their theory\nand show that each complete and decomposable SPN is essentially a mixture of trees and the effective\nnumber of unique induced trees is given by \u2327S. Note that \u2327S = fS(1|1) depends only on the network\nstructure, and can often be much smaller than 2O(M ). Without loss of generality, assuming that in S\nlayers of sum nodes are alternating with layers of product nodes, then fS(1|1) =\u2326(2 h), where h is\nthe height of S. However, the exponentially many trees are recursively merged and combined in S\nsuch that the overall network size is still tractable.\n\n3.2 Maximum Likelihood Estimation as SP\nLet\u2019s consider the likelihood function computed by an SPN S over N binary random variables\nwith model parameters w and input vector x 2{ 0, 1}N. Here the model parameters in S are edge\nweights from every sum node, and we collect them together into a long vector w 2 RD\n++, where D\ncorresponds to the number of edges emanating from sum nodes in S. By de\ufb01nition, the probability\nPx fS (x|w) = fS (x|w)\ndistribution induced by S can be computed by PrS(x|w) , fS (x|w)\nfS (1|w) .\nCorollary 5. Let S be an SPN with weights w 2 RD\n++ over input vector x 2{ 0, 1}N, the net-\nwork polynomial fS(x|w) is a posynomial: fS(x|w) =PfS (1|1)\nt=1 QN\nxnQD\n, where\nIwd2Tt is the indicator variable whether wd is in the t-th induced tree Tt or not. Each monomial\ncorresponds exactly to a unique induced tree SPN from S.\nThe above statement is a direct corollary of Thm. 2, Thm. 3 and Thm. 4. From the de\ufb01nition of\nnetwork polynomial, we know that fS is a multilinear function of the indicator variables. Corollary 5\nworks as a complement to characterize the functional form of a network polynomial in terms of\nw. It follows that the likelihood function LS(w) , PrS(x|w) can be expressed as the ratio of two\nposynomial functions. We now show that the optimization problem based on MLE is an SP. Using\nthe de\ufb01nition of Pr(x|w) and Corollary 5, let \u2327 = fS(1|1), the MLE problem can be rewritten as\n\nd=1 wIwd2Tt\n\nn=1 I(t)\n\nd\n\nmaximizew\n\nsubject to\n\nfS(x|w)\nfS(1|w)\nw 2 RD\n\n++\n\n= P\u2327\n\nn=1 I(t)\n\nt=1QN\nP\u2327\nt=1QD\n\nd\n\nd=1 wIwd2Tt\n\nxnQD\n\nd=1 wIwd2Tt\n\nd\n\n(1)\n\n4\n\n\fProposition 6. The MLE problem for SPNs is a signomial program.\n\nBeing nonconvex in general, SP is essentially hard to solve from a computational perspective [1, 3].\nHowever, despite the hardness of SP in general, the objective function in the MLE formulation of\nSPNs has a special structure, i.e., it is the ratio of two posynomials, which makes the design of\nef\ufb01cient optimization algorithms possible.\n\n3.3 Difference of Convex Functions\n\nBoth PGD and EG are \ufb01rst-order methods and they can be viewed as approximating the SP after\napplying a logarithmic transformation to the objective function only. Although (1) is a signomial\nprogram, its objective function is expressed as the ratio of two posynomials. Hence, we can still\napply the logarithmic transformation trick used in geometric programming to its objective function\nand to the variables to be optimized. More concretely, let wd = exp(yd),8d and take the log of\nthe objective function; it becomes equivalent to maximize the following new objective without any\nconstraint on y:\n\nmaximize\n\nlog0@\n\u2327 (x)Xt=1\n\nexp DXd=1\n\nydIyd2Tt!1A log \u2327Xt=1\n\nexp DXd=1\n\nydIyd2Tt!!\n\n(2)\n\nNote that in the \ufb01rst term of Eq. 2 the upper index \u2327 (x) \uf8ff \u2327 , fS(1|1) depends on the current input\nx. By transforming into the log-space, we naturally guarantee the positivity of the solution at each\niteration, hence transforming a constrained optimization problem into an unconstrained optimization\nproblem without any sacri\ufb01ce. Both terms in Eq. 2 are convex functions in y after the transformation.\nHence, the transformed objective function is now expressed as the difference of two convex functions,\nwhich is called a DC function [9]. This helps us to design two ef\ufb01cient algorithms to solve the\nproblem based on the general idea of sequential convex approximations for nonlinear programming.\n\n3.3.1 Sequential Monomial Approximation\nLet\u2019s consider the linearization of both terms in Eq. 2 in order to apply \ufb01rst-order methods in the\ntransformed space. To compute the gradient with respect to different components of y, we view each\nnode of an SPN as an intermediate function of the network polynomial and apply the chain rule to\nback-propagate the gradient. The differentiation of fS(x|w) with respect to the root node of the\nnetwork is set to be 1. The differentiation of the network polynomial with respect to a partial function\nat each node can then be computed in two passes of the network: the bottom-up pass evaluates the\nvalues of all partial functions given the current input x and the top-down pass differentiates the\nnetwork polynomial with respect to each partial function. Following the evaluation-differentiation\npasses, the gradient of the objective function in (2) can be computed in O(|S|). Furthermore, although\nthe computation is conducted in y, the results are fully expressed in terms of w, which suggests that\nin practice we do not need to explicitly construct y from w.\nLet f (y) = log fS(x|exp(y)) log fS(1|exp(y)). It follows that approximating f (y) with the best\nlinear function is equivalent to using the best monomial approximation of the signomial program (1).\nThis leads to a sequential monomial approximations of the original SP formulation: at each iteration\ny(k), we linearize both terms in Eq. 2 and form the optimal monomial function in terms of w(k). The\nadditive update of y(k) leads to a multiplicative update of w(k) since w(k) = exp(y(k)), and we use\na backtracking line search to determine the step size of the update in each iteration.\n\n3.3.2 Concave-convex Procedure\nSequential monomial approximation fails to use the structure of the problem when learning SPNs.\nHere we propose another approach based on the concave-convex procedure (CCCP) [18] to use the\nfact that the objective function is expressed as the difference of two convex functions. At a high level\nCCCP solves a sequence of concave surrogate optimizations until convergence. In many cases, the\nmaximum of a concave surrogate function can only be solved using other convex solvers and as a\nresult the ef\ufb01ciency of the CCCP highly depends on the choice of the convex solvers. However, we\nshow that by a suitable transformation of the network we can compute the maximum of the concave\nsurrogate in closed form in time that is linear in the network size, which leads to a very ef\ufb01cient\n\n5\n\n\falgorithm for learning the parameters of SPNs. We also prove the convergence properties of our\nalgorithm.\nConsider the objective function to be maximized in DCP: f (y) = log fS(x| exp(y)) \nlog fS(1| exp(y)) , f1(y) + f2(y) where f1(y) , log fS(x| exp(y)) is a convex function and\nf2(y) , log fS(1| exp(y)) is a concave function. We can linearize only the convex part f1(y) to\nobtain a surrogate function\n\n\u02c6f (y, z) = f1(z) + rzf1(z)T (y z) + f2(y)\n\n(3)\nfor 8y, z 2 RD. Now \u02c6f (y, z) is a concave function in y. Due to the convexity of f1(y) we have\nf1(y) f1(z) + rzf1(z)T (y z),8y, z and as a result the following two properties always hold\nfor 8y, z: \u02c6f (y, z) \uf8ff f (y) and \u02c6f (y, y) = f (y). CCCP updates y at each iteration k by solving\n\u02c6f (y, y(k1)) unless we already have y(k1) 2 arg maxy\n\u02c6f (y, y(k1)), in which\ny(k) 2 arg maxy\ncase a generalized \ufb01xed point y(k1) has been found and the algorithm stops.\nIt is easy to show that at each iteration of CCCP we always have f (y(k)) f (y(k1)). Note also\nthat f (y) is computing the log-likelihood of input x and therefore it is bounded above by 0. By the\nmonotone convergence theorem, limk!1 f (y(k)) exists and the sequence {f (y(k))} converges.\nWe now discuss how to compute a closed form solution for the maximization of the concave surrogate\n\u02c6f (y, y(k1)). Since \u02c6f (y, y(k1)) is differentiable and concave for any \ufb01xed y(k1), a suf\ufb01cient and\nnecessary condition to \ufb01nd its maximum is\n\n(4)\nIn the above equation, if we consider only the partial derivative with respect to yij(wij), we obtain\n\nry \u02c6f (y, y(k1)) = ry(k1)f1(y(k1)) + ryf2(y) = 0\n\nw(k1)\n\nij\n\nfvj (x|w(k1))\n\nfS(x|w(k1))\n\n@fS(x|w(k1))\n@fvi(x|w(k1))\n\n=\n\nwijfvj (1|w)\nfS(1|w)\n\n@fS(1|w)\n@fvi(1|w)\n\nEq. 5 leads to a system of D nonlinear equations, which is hard to solve in closed form. However,\nif we do a change of variable by considering locally normalized weights w0ij (i.e., w0ij 0 and\nPj w0ij = 1 8i), then a solution can be easily computed. As described in [13, 20], any SPN can be\ntransformed into an equivalent normal SPN with locally normalized weights in a bottom up pass as\nfollows:\n\nWe can then replace wijfvj (1|w) in the above equation by the expression it is equal to in Eq. 5 to\nobtain a closed form solution:\n\nw0ij =\n\nwijfvj (1|w)\n\nPj wijfvj (1|w)\nfvj (x|w(k1))\nfS(x|w(k1))\n\nw0ij / w(k1)\n\nij\n\n@fS(x|w(k1))\n@fvi(x|w(k1))\n\n(5)\n\n(6)\n\n(7)\n\nNote that in the above derivation both fvi(1|w)/fS(1|w) and @fS(1|w)/@fvi(1|w) can be treated\nas constants and hence absorbed since w0ij,8j are constrained to be locally normalized. In order to\nobtain a solution to Eq. 5, for each edge weight wij, the suf\ufb01cient statistics include only three terms,\ni.e, the evaluation value at vj, the differentiation value at vi and the previous edge weight w(k1)\n,\nall of which can be obtained in two passes of the network for each input x. Thus the computational\ncomplexity to obtain a maximum of the concave surrogate is O(|S|). Interestingly, Eq. 7 leads to\nthe same update formula as in the EM algorithm [12] despite the fact that CCCP and EM start from\ndifferent perspectives. We show that all the limit points of the sequence {w(k)}1k=1 are guaranteed to\nbe stationary points of DCP in (2).\nTheorem 7. Let {w(k)}1k=1 be any sequence generated using Eq. 7 from any positive initial point,\nIn addition,\nthen all the limiting points of {w(k)}1k=1 are stationary points of the DCP in (2).\nlimk!1 f (y(k)) = f (y\u21e4), where y\u21e4 is some stationary point of (2).\nWe summarize all four algorithms and highlight their connections and differences in Table 1. Although\nwe mainly discuss the batch version of those algorithms, all of the four algorithms can be easily\nadapted to work in stochastic and/or parallel settings.\n\nij\n\n6\n\n\fTable 1: Summary of PGD, EG, SMA and CCCP. Var. means the optimization variables.\n\nAlgo\nPGD\nEG\nSMA\nCCCP\n\nVar.\nw\n\nUpdate Formula\nUpdate Type\nw(k+1)\nAdditive\nMultiplicative w(k+1)\nw\nlog w Multiplicative w(k+1)\nlog w Multiplicative w(k+1)\n\n PR\u270f\n w(k)\n w(k)\n/ w(k)\n\nd + (rwd f1(w(k)) rwd f2(w(k)))o\nd \u21e5 (rwd f1(w(k)) rwd f2(w(k)))}\n\n++nw(k)\nexp{(rwd f1(w(k)) rwd f2(w(k)))}\nexp{w (k)\nij \u21e5 rvi fS(w(k)) \u21e5 fvj (w(k))\n\nij\n\nd\n\nd\n\nd\n\nd\n\nd\n\n4 Experiments\n\n4.1 Experimental Setting\n\nWe conduct experiments on 20 benchmark data sets from various domains to compare and evaluate\nthe convergence performance of the four algorithms: PGD, EG, SMA and CCCP (EM). These 20\ndata sets are widely used in [7, 15] to assess different SPNs for the task of density estimation. All the\nfeatures in the 20 data sets are binary features. All the SPNs that are used for comparisons of PGD,\nEG, SMA and CCCP are trained using LearnSPN [7]. We discard the weights returned by LearnSPN\nand use random weights as initial model parameters. The random weights are determined by the\nsame random seed in all four algorithms. Detailed information about these 20 datasets and the SPNs\nused in the experiments are provided in the supplementary material.\n\n4.2 Parameter Learning\n\nWe implement all four algorithms in C++. For each algorithm, we set the maximum number of\niterations to 50. If the absolute difference in the training log-likelihood at two consecutive steps is\nless than 0.001, the algorithms are stopped. For PGD, EG and SMA, we combine each of them with\nbacktracking line search and use a weight shrinking coef\ufb01cient set at 0.8. The learning rates are\ninitialized to 1.0 for all three methods. For PGD, we set the projection margin \u270f to 0.01. There is no\nlearning rate and no backtracking line search in CCCP. We set the smoothing parameter to 0.001 in\nCCCP to avoid numerical issues.\nWe show in Fig. 2 the average log-likelihood scores on 20 training data sets to evaluate the convergence\nspeed and stability of PGD, EG, SMA and CCCP. Clearly, CCCP wins by a large margin over\nPGD, EG and SMA, both in convergence speed and solution quality. Furthermore, among the four\nalgorithms, CCCP is the most stable one due to its guarantee that the log-likelihood (on training data)\nwill not decrease after each iteration. As shown in Fig. 2, the training curves of CCCP are more\nsmooth than the other three methods in almost all the cases. These 20 experiments also clearly show\nthat CCCP often converges in a few iterations. On the other hand, PGD, EG and SMA are on par\nwith each other since they are all \ufb01rst-order methods. SMA is more stable than PGD and EG and\noften achieves better solutions than PGD and EG. On large data sets, SMA also converges faster than\nPGD and EG. Surprisingly, EG performs worse than PGD in some cases and is quite unstable despite\nthe fact that it admits multiplicative updates. The \u201chook shape\u201d curves of PGD in some data sets, e.g.\nKosarak and KDD, are due to the projection operations.\n\nTable 2: Average log-likelihoods on test data. Highest log-likelihoods are highlighted in bold. \"\nshows statistically better log-likelihoods than CCCP and # shows statistically worse log-likelihoods\nthan CCCP. The signi\ufb01cance is measured based on the Wilcoxon signed-rank test.\nData set\nID-SPN\n\"-84.693\nNLTCS\n-10.605\nMSNBC\n-9.800\nKDD 2k\n\"-34.436\nPlants\n\"-51.550\nAudio\n\"-153.293\nJester\n\"-84.389\nNet\ufb02ix\n\"-151.666\nAccidents\nRetail\n#-252.602\n#-40.012\nPumsb-star\n\nData set\nID-SPN\n#-6.050\nDNA\n-6.048\nKosarak\n#-2.153 MSWeb\n\"-12.554\n-39.824\n#-52.912 WebKB\n\"-56.554\n\"-27.232\n-10.945\n\"-22.552\n\nLearnSPN\n#-6.099\n#-6.113\n#-2.233\n#-12.955\n#-40.510\n#-53.454\n#-57.385\n#-29.907\n#-11.138\n#-24.577\n\nLearnSPN\n#-85.237\n#-11.057\n#-10.269\n#-36.247\n#-52.816\n#-158.542\n#-85.979\n#-156.605\n#-249.794\n#-27.409\n\nCCCP\n-84.921\n-10.880\n-9.970\n-35.009\n-52.557\n-157.492\n-84.628\n-153.205\n-248.602\n-27.202\n\nCCCP\n-6.029\n-6.045\n-2.134\n-12.872\n-40.020\n-52.880\n-56.782\n-27.700\n-10.919\n-24.229\n\nReuters-52\n20 Newsgrp\nBBC\nAd\n\nBook\nEachMovie\n\n7\n\n\fFigure 2: Negative log-likelihood values versus number of iterations for PGD, EG, SMA and CCCP.\n\nThe computational complexity per update is O(|S|) in all four algorithms. CCCP often takes less\ntime than the other three algorithms because it takes fewer iterations to converge. We list detailed\nrunning time statistics for all four algorithms on the 20 data sets in the supplementary material.\n\n4.3 Fine Tuning\n\nWe combine CCCP as a \u201c\ufb01ne tuning\u201d procedure with the structure learning algorithm LearnSPN and\ncompare it to the state-of-the-art structure learning algorithm ID-SPN [15]. More concretely, we keep\nthe model parameters learned from LearnSPN and use them to initialize CCCP. We then update the\nmodel parameters globally using CCCP as a \ufb01ne tuning technique. This normally helps to obtain\na better generative model since the original parameters are learned greedily and locally during the\nstructure learning algorithm. We use the validation set log-likelihood score to avoid over\ufb01tting. The\nalgorithm returns the set of parameters that achieve the best validation set log-likelihood score as\noutput. Experimental results are reported in Table. 2. As shown in Table 2, the use of CCCP after\nLearnSPN always helps to improve the model performance. By optimizing model parameters on\nthese 20 data sets, we boost LearnSPN to achieve better results than state-of-the-art ID-SPN on 7\ndata sets, where the original LearnSPN only outperforms ID-SPN on 1 data set. Note that the sizes\nof the SPNs returned by LearnSPN are much smaller than those produced by ID-SPN. Hence, it is\nremarkable that by \ufb01ne tuning the parameters with CCCP, we can achieve better performance despite\nthe fact that the models are smaller. For a fair comparison, we also list the size of the SPNs returned\nby ID-SPN in the supplementary material.\n\n5 Conclusion\n\nWe show that the network polynomial of an SPN is a posynomial function of the model parameters,\nand that parameter learning yields a signomial program. We propose two convex relaxations to solve\nthe SP. We analyze the convergence properties of CCCP for learning SPNs. Extensive experiments are\nconducted to evaluate the proposed approaches and current methods. We also recommend combining\nCCCP with structure learning algorithms to boost the modeling accuracy.\n\nAcknowledgments\nHZ and GG gratefully acknowledge support from ONR contract N000141512365. HZ also thanks\nRyan Tibshirani for the helpful discussion about CCCP.\n\n8\n\n\fReferences\n[1] S. Boyd, S.-J. Kim, L. Vandenberghe, and A. Hassibi. A tutorial on geometric programming.\n\nOptimization and Engineering, 8(1):67\u2013127, 2007.\n\n[2] H. Chan and A. Darwiche. On the robustness of most probable explanations. In In Proceedings\n\nof the Twenty Second Conference on Uncertainty in Arti\ufb01cial Intelligence.\n\n[3] M. Chiang. Geometric programming for communication systems. Now Publishers Inc, 2005.\n[4] A. Darwiche. A differential approach to inference in Bayesian networks. Journal of the ACM\n\n(JACM), 50(3):280\u2013305, 2003.\n\n[5] A. Dennis and D. Ventura. Greedy structure search for sum-product networks. In International\n\nJoint Conference on Arti\ufb01cial Intelligence, volume 24, 2015.\n\n[6] R. Gens and P. Domingos. Discriminative learning of sum-product networks. In Advances in\n\nNeural Information Processing Systems, pages 3248\u20133256, 2012.\n\n[7] R. Gens and P. Domingos. Learning the structure of sum-product networks. In Proceedings of\n\nThe 30th International Conference on Machine Learning, pages 873\u2013880, 2013.\n\n[8] A. Gunawardana and W. Byrne. Convergence theorems for generalized alternating minimization\n\nprocedures. The Journal of Machine Learning Research, 6:2049\u20132073, 2005.\n\n[9] P. Hartman et al. On functions representable as a difference of convex functions. Paci\ufb01c J.\n\nMath, 9(3):707\u2013713, 1959.\n\n[10] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear\n\npredictors. Information and Computation, 132(1):1\u201363, 1997.\n\n[11] G. R. Lanckriet and B. K. Sriperumbudur. On the convergence of the concave-convex procedure.\n\npages 1759\u20131767, 2009.\n\n[12] R. Peharz. Foundations of Sum-Product Networks for Probabilistic Modeling. PhD thesis, Graz\n\nUniversity of Technology, 2015.\n\n[13] R. Peharz, S. Tschiatschek, F. Pernkopf, and P. Domingos. On theoretical properties of sum-\n\nproduct networks. In AISTATS, 2015.\n\n[14] H. Poon and P. Domingos. Sum-product networks: A new deep architecture. In Proc. 12th Conf.\n\non Uncertainty in Arti\ufb01cial Intelligence, pages 2551\u20132558, 2011.\n\n[15] A. Rooshenas and D. Lowd. Learning sum-product networks with direct and indirect variable\n\ninteractions. In ICML, 2014.\n\n[16] R. Salakhutdinov, S. Roweis, and Z. Ghahramani. On the convergence of bound optimization\n\nalgorithms. UAI, 2002.\n\n[17] C. J. Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, pages\n\n95\u2013103, 1983.\n\n[18] A. L. Yuille, A. Rangarajan, and A. Yuille. The concave-convex procedure (CCCP). Advances\n\nin Neural Information Processing Systems, 2:1033\u20131040, 2002.\n\n[19] W. I. Zangwill. Nonlinear programming: a uni\ufb01ed approach, volume 196. Prentice-Hall\n\nEnglewood Cliffs, NJ, 1969.\n\n[20] H. Zhao, M. Melibari, and P. Poupart. On the Relationship between Sum-Product Networks and\n\nBayesian Networks. In ICML, 2015.\n\n[21] H. Zhao, T. Adel, G. Gordon, and B. Amos. Collapsed variational inference for sum-product\n\nnetworks. In ICML, 2016.\n\n9\n\n\f", "award": [], "sourceid": 250, "authors": [{"given_name": "Han", "family_name": "Zhao", "institution": "Carnegie Mellon University"}, {"given_name": "Pascal", "family_name": "Poupart", "institution": "University of Waterloo"}, {"given_name": "Geoffrey", "family_name": "Gordon", "institution": "CMU"}]}