{"title": "From Stochastic Planning to Marginal MAP", "book": "Advances in Neural Information Processing Systems", "page_first": 3081, "page_last": 3091, "abstract": "It is well known that the problems of stochastic planning and probabilistic inference are closely related. This paper makes two contributions in this context. The first is to provide an analysis of the recently developed SOGBOFA heuristic planning algorithm that was shown to be effective for problems with large factored state and action spaces. It is shown that SOGBOFA can be seen as a specialized inference algorithm that computes its solutions through a combination of a symbolic variant of belief propagation and gradient ascent. The second contribution is a new solver for Marginal MAP (MMAP) inference. We introduce a new reduction from MMAP to maximum expected utility problems which are suitable for the symbolic computation in SOGBOFA. This yields a novel algebraic gradient-based solver (AGS) for MMAP. An experimental evaluation illustrates the potential of AGS in solving difficult MMAP problems.", "full_text": "From Stochastic Planning to Marginal MAP\n\nHao Cui\n\nDepartment of Computer Science\n\nTufts University\n\nMedford, MA 02155, USA\nhao.cui@tufts.edu\n\nRadu Marinescu\n\nIBM Research\nDublin, Ireland\n\nradu.marinescu@ie.ibm.com\n\nRoni Khardon\n\nDepartment of Computer Science\n\nIndiana University\n\nBloomington, IN, USA\nrkhardon@iu.edu\n\nAbstract\n\nIt is well known that the problems of stochastic planning and probabilistic infer-\nence are closely related. This paper makes two contributions in this context. The\n\ufb01rst is to provide an analysis of the recently developed SOGBOFA heuristic plan-\nning algorithm that was shown to be effective for problems with large factored\nstate and action spaces. It is shown that SOGBOFA can be seen as a specialized\ninference algorithm that computes its solutions through a combination of a sym-\nbolic variant of belief propagation and gradient ascent. The second contribution is\na new solver for Marginal MAP (MMAP) inference. We introduce a new reduc-\ntion from MMAP to maximum expected utility problems which are suitable for the\nsymbolic computation in SOGBOFA. This yields a novel algebraic gradient-based\nsolver (AGS) for MMAP. An experimental evaluation illustrates the potential of\nAGS in solving dif\ufb01cult MMAP problems.\n\n1\n\nIntroduction\n\nThe connection between planning and inference is well known. Over the last decade multiple authors\nhave introduced explicit reductions showing how stochastic planning can be solved using probabilis-\ntic inference (for example, [4, 25, 5, 17, 23, 12, 8, 19, 26, 10, 18]) with applications in robotics,\nscheduling and environmental problems. However, heuristic methods and search are still the best\nperforming approaches for planning in large combinatorial state and action spaces [9, 7, 2].\nThis paper makes two contributions in this context. We \ufb01rst analyze a recent heuristic planning\nalgorithm that was shown to be effective in practice. SOGBOFA [2] builds an approximate algebraic\ncomputation graph capturing marginals of state and reward variables under independence assump-\ntions. It then uses automatic differentiation [6] and gradient based search to optimize action choice.\nOur analysis shows that the value computed by SOGBOFA\u2019s computation graph is identical to the\nsolution of Belief Propagation (BP) when conditioned on actions. This provides an explicit connec-\ntion between heuristic planning algorithms and approximate inference. Inference through algebraic\nexpressions has been explored before [16] and even applied to planning but both the symbolic rep-\nresentation and algorithms are different from the ones in SOGBOFA.\nOur second contribution is in showing how planning algorithms can be used to solve inference\nproblems, making use of the correspondence in the reverse direction from prior work. The original\nconstruction for SOGBOFA can be seen to solve Maximum expected Utility (MEU) problems with\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fdecision variables as roots of the corresponding graphical model and one leaf node representing the\nvalue which is being optimized. This corresponds to MMAP problems with MAP variables at the\nroots and a single evidence node at a leaf. We provide a new reduction from MMAP problems\nto MEU whose output satis\ufb01es these requirements. When combined with the SOGBOFA solver\nthis provides a novel inference algorithm, algebraic gradient-based solver (AGS), that can solve\ngeneral MMAP problems. AGS effectively uses a symbolic variant of BP with gradient search.\nAGS provides an alternative to the mixed-product BP algorithm of [13] and the stochastic local\nsearch algorithm of [20]. An experimental evaluation compares AGS to state of the art algorithms\nfor MMAP [14] and illustrates its potential in solving hard inference problems.\n\n2 Preliminaries\n\nBelief Propagation in Bayesian Networks: For our results it is convenient to refer to the BP al-\ngorithm for directed graphs [21]. A Bayesian Network (BN) is given by a directed acyclic graph\n(cid:81)\nwhere each node x is associated with a random variable and a corresponding conditional proba-\nbility table (CPT) capturing p(x|parents(x)). The joint probability of all variables is given by\ni p(xi|parents(xi)). In this paper we assume that all random variables are binary.\nAssume \ufb01rst that the directed graph is a polytree (no underlying undirected cycles). For node x, BP\ncalculates an approximation of p(x|e), which we denote by BEL(x), where e is the total evidence\nin the graph. Let \u03c0(x) \u2261 p(x|e+) and \u03bb(x) \u2261 p(e\u2212|x), where e+, e\u2212 are evidence nodes reachable\nfrom x through its parents and children respectively. We use \u03b1 to represent a normalization constant\nand \u03b2 to denote some constant. For a polytree, x separates its parents from its children and we have\n(1)\n(2)\n\nBEL(x) = \u03b1\u03c0(x)\u03bb(x)\n\n(cid:89)\n\n\u03bb(x) =\n\n\u03bbzj (x)\n\n\u03c0(x) =\n\n\u03c0x(wk)\n\n(3)\n\n(cid:89)\n\nj\u2208children(x)\np(x|w)\n\nk\u2208parents(x)\n\n(cid:88)\n\nw\n\n(cid:88)\n\n(cid:89)\n\nk(cid:54)=i\n\np(x|w)\n\n\u03bbzk (x)\u03c0(x)\n\n(cid:88)\n(cid:89)\n\nwk:k(cid:54)=i\n\nk(cid:54)=j\n\nwhere \u03bb() and \u03c0() incorporate evidence through children and parents respectively. In (3) the sum\nvariable w ranges over all assignments to the parents of x and wk is the induced value to the kth\nparent. \u03bbzj (x) is the message that a child zj sends to its parent x and \u03c0x(wk) is the message that a\nparent wk sends to x. The messages are given by\n\n\u03bbx(wi) = \u03b2\n\n\u03bb(x)\n\nx\n\n\u03c0zj (x) = \u03b1\n\n\u03c0x(wk)\n\n(4)\n\n(5)\n\nwhere in (4) wi is \ufb01xed and the the sum is over values to other parents wk. Since the nodes are\nbinary the messages have two values (i.e., \u03bbx(w = 0) and \u03bbx(w = 1)). The algorithm is initialized\nby forcing \u03c0 and \u03bb of evidence nodes to agree with the evidence, setting \u03c0 of root nodes equal to\nthe prior probability, and setting \u03bb of leaves to (1,1), i.e., an uninformative value. A node can send\na message along an edge if all messages from its other edges have been received. If the graph is a\npolytree then two passes of messages on the graph yield BEL(x) = p(x) for all x [21].\nThe loopy BP algorithm applies the same updates even if the graph is not a polytree. In this case\nwe initialize all messages to (1,1) and follow the same update rules for messages according to some\nschedule. The algorithm is not guaranteed to converge but it often does and it is known to perform\nwell in many cases. The following property of BP is well known:\nLemma 1. If loopy BP is applied to a BN with no evidence, i.e., \u03bb(x) = (1, 1) for all x at initial-\nization, then for any order of message updates and at any time in the execution of loopy BP, for any\nnode x, \u03bb(x) \u221d (1, 1) and \u03bbx(w) is a constant independent of w for any parent of x. In addition, a\nsingle pass upadting \u03c0 messages in topological order converges to the \ufb01nal output of BP.\n\nProof. We prove the claim by induction. Assume that \u03bb(x) = (o, o), for some value o, and consider\nthe next \u03bb message from x. From Eq (4) we have\n\n(cid:88)\n\nx\n\n(cid:88)\n\nwk:k(cid:54)=i\n\n(cid:89)\n\nk(cid:54)=i\n\np(x|w)\n\n\u03bbx(wi) = \u03b2\n\n\u03bb(x)\n\n(cid:88)\n\n(cid:89)\n\nwk:k(cid:54)=i\n\nk(cid:54)=i\n\n(cid:88)\n\nx\n\n\u03c0x(wk)\n\np(x|w) = \u03b2o\n\n\u03c0x(wk) = \u03b2o\n\n2\n\n\flast equality is true because(cid:80)\n\nx p(x|w) = 1 and(cid:80)\n\nwhere to get the second equality we extract the constant \u03bb(x) = o and reorder the summations. The\nk(cid:54)=i \u03c0x(wk) = 1. Therefore, \u03bbx(wi) is\na constant independent of wi. Now from Eq 2 we see that \u03bb(wi) = (1, 1) as well, and from Eq 5\nand 3 we see that it suf\ufb01ces to update \u03c0 messages in topological order.\n\nwk:k(cid:54)=i\n\n(cid:81)\n\nis the expected discounted total reward E[(cid:80)\n\nARollout and SOGBOFA: Stochastic planning is de\ufb01ned using Markov decision processes [22]. A\nMDP [22] is speci\ufb01ed by {S, A, T, R, \u03b3}, where S is a \ufb01nite state space, A is a \ufb01nite action space,\nT (s, a, s(cid:48)) = p(s(cid:48)|s, a) de\ufb01nes the transition probabilities, R(s, a) is the immediate reward of taking\naction a in state s, and \u03b3 is the discount factor. A policy \u03c0 : S \u2192 A is a mapping from states to\nactions, indicating which action to choose at each state. Given a policy \u03c0, the value function V \u03c0(s)\ni \u03b3iR(si, \u03c0(si)) | \u03c0], where si is the ith state visited by\nfollowing \u03c0 (and s0 = s). The action-value function Q\u03c0 : S \u00d7 A \u2192 R is the expected discounted\ntotal reward when taking action a at state s and following \u03c0 thereafter. In this paper we consider\n\ufb01nite horizon planning where the trajectories are taken to a \ufb01xed horizon h and \u03b3 = 1, i.e., no\ndiscounting is used.\nIn factored spaces [1] the state is speci\ufb01ed by a set of variables and the number of states is expo-\nnential in the number of variables. Similarly in factored action spaces an action is speci\ufb01ed by a set\nof variables. We assume that all state and action variables are binary. Finite horison planning can\nbe captured using a dynamic Bayesian network (DBN) where state and action variables at each time\nstep are represented explicitly and the CPTs of variables are given by the transition probabilities. In\noff-line planning the task is to compute a policy that optimizes the long term reward. In contrast, in\non-line planning we are given a \ufb01xed limited time t per step and cannot compute a policy in advance.\nInstead, given the current state, the algorithm must decide on the next action within time t. Then\nthe action is performed, a transition and reward are observed and the algorithm is presented with\nthe next state. This process repeats and the long term performance of the algorithm is evaluated.\nOn-line planning has been the standard evaluation method in recent planning competitions.\nAROLLOUT and SOGBOFA perform on-line planning by estimating the value of initial actions at\nthe current state s, Q\u03c0(s, a), where a \ufb01xed rollout policy \u03c0, typically a random policy, is used in\nfuture steps. The AROLLOUT algorithm [3] introduced the idea of algebraic simulation to esti-\nmate values but optimized over actions by enumeration. Then [2] showed how algebraic rollouts\ncan be computed symbolically and that the optimization can be done using automatic differen-\ntiation [6]. We next review these algorithms. Finite horizon planning can be translated from a\nhigh level language (e.g., RDDL [24]) to a dynamic Bayesian network. AROLLOUT transforms\nthe CPT of a node x into a disjoint sum form. In particular, the CPT for x is represented in the\nform if(c11|c12...)\nthen pn, where pi is p(x=1) and the cij are\nconjunctions of parent values which are are mutually exclusive and exhaustive. In this notation cij\nis a set of conjunctions having the same conditional probability p(x=1|c)=pi. The algorithm then\nperforms a forward pass calculating \u02c6p(x), an approximation of the true marginal p(x), for any node\nx in the graph. \u02c6p(x) is calculated as a function of \u02c6p(cij), an estimate of the probability that cij is\ntrue, which assumes the parents are independent. This is done using the following equations where\nnodes are processed in the topological order of the graph:\np(x|cij)\u02c6p(cij) =\n\nif(cn1|cn2...)\n\n(cid:88)\n\npi \u02c6p(cij)\n\nthen p1\n\n...\n\n\u02c6p(x) =\n\n(6)\n\n\u02c6p(cij) =\n\n\u02c6p(wk)\n\nij\n\n(1 \u2212 \u02c6p(wk)).\n\n(7)\n\n(cid:88)\n(cid:89)\n\nij\n\n(cid:89)\n\nwk\u2208cij\n\n\u00afwk\u2208cij\n\nThe following example from [2] illustrates AROLLOUT and SOGBOFA . The problem has three state\nvariables s(1), s(2) and s(3), and three action variables a(1), a(2), a(3) respectively. In addition we\nhave two intermediate variables cond1 and cond2 which are not part of the state. The transitions and\nreward are given by the following RDDL expressions where primed variants of variables represent\nthe value of the variable after performing the action.\n\ncond1 = Bernoulli(0.7)\ncond2 = Bernoulli(0.5)\ns\u2019(1) = if (cond1)\ns\u2019(2) = if (s(1))\ns\u2019(3) = if (cond2)\nreward = s(1) + s(2) + s(3)\n\nthen \u02dca(3)\nthen a(2)\nthen s(2)\n\nelse false\nelse false\nelse false\n\n3\n\n\fFigure 1: Left: example of SOGBOFA graph construction. Right: Example of reduction from MMAP to MEU.\nOriginal graph (top) and transformed graph (bottom).\n\nAROLLOUT translates the RDDL code into algebraic expressions using standard transformations\nfrom a logical to a numerical representation. In our example this yields:\n\ns\u2019(1) = (1-a(3))*0.7\ns\u2019(2) = s(1)*a(2)\ns\u2019(3) = s(2) * 0.5\nr = s(1) + s(2) + s(3)\n\nThese expressions are used to calculate an approximation of marginal distributions over state and\nreward variables. The distribution at each time step is approximated using a product distribution\nover the state variables. To illustrate, assume that the state is s0 = {s(1)=0, s(2)=1, s(3)=0} which\nwe take to be a product of marginals. At the \ufb01rst step AROLLOUT uses a concrete action, for\nexample a0 = {a(1)=1, a(2)=0, a(3)=0}. This gives values for the reward r0 = 0 + 1 + 0 = 1 and\nstate variables s1 = {s(1)=(1 \u2212 0) \u2217 0.7=0.7, s(2)=0 \u2217 0=0, s(3)=1*0.5 = 0.5}. In future steps it\ncalculates marginals for the action variables and uses them in a similar manner. For example if a1 =\n{a(1)=0.33, a(2)=0.33, a(3)=0.33} we get r1 = 0.7+0+0.5 = 1.2 and s2 = {s(1)=(1\u22120.33)\u22170.7,\ns(2)= 0.7 * 0.33, s(3)=0\u22170.5}. Summing the rewards from all steps gives an estimate of the Q value\nfor a0. AROLLOUT randomly enumerates values for a0 and selects the one with the highest estimate.\nThe main observation in SOGBOFA is that instead of calculating numerical values, as illustrated in\nthe example, we can use the expressions computing these values to construct an explicit directed\nacyclic graph representing the computation steps, where the last node represents the expectation of\nthe cumulative reward. SOGBOFA uses a symbolic representation for the \ufb01rst action and assumes\nthat the rollout uses the random policy. In our example if the action variables are mutually exclu-\nsive (such constraints are often imposed in high level domain descriptions) this gives marginals of\na1 = {a(1)=0.33, a(2)=0.33, a(3)=0.33} over these variables. The SOGBOFA graph for our exam-\nple expanded to depth 1 is shown In Figure 1. The bottom layer represents the current state and\naction variables. Each node at the next level represents the expression that AROLLOUT would have\ncalculated for that marginal. To expand the planning horizon we simply duplicate the second layer\nconstruction multiple times.\nNow, given concrete marginal for the action variables at the \ufb01rst step, i.e., a0, one can plug in that\nvalue into the computation graph and compute the value of the \ufb01nal Q node. This captures the\nsame calculation as AROLLOUT. In addition, the explicit graph allows us to compute the gradients\nof the Q value with respect to the action variables, using automatic differentiation. We refer the\nreader to [6] for details on automatic differentiation; the basic idea is similar to backpropagation of\ngradients in neural network learning which can be generalized to arbitrary graphs. In this manner we\ncan perform gradient search over marginals for action variables in a0 and effectively select values\nfor the action variables at the \ufb01rst step. SOGBOFA includes several additional heuristics including\ndynamic control of simulation depth, dynamic selection of gradient step size, maintaining domain\nconstraints, and a balance between gradient search and random restarts. Most of this is orthogonal\nto the topic of this paper and we omit the details.\n\n4\n\nA\tD\tC\tE\tA\tDin\tDout\tDeq\tC\tV\tE\t\f3 AROLLOUT is Equivalent to BP\n\nWe \ufb01rst show that the computation of AROLLOUT can be rewritten as a sum over assignments.\nLemma 2. AROLLOUT\u2019s calculation in Eq (6) and (7) is equivalent to (8) where W is the set of\nassignment to the parents of x.\n\n\u02c6p(x) =\n\np(x|W )\n\n\u02c6p(wk)wk (1 \u2212 \u02c6p(wk))1\u2212wk .\n\n(8)\n\n(cid:89)\n\n(cid:88)\n\nW\n\nk\u2208parents(x)\n\nProof. The sum in (8) can be divided into disjoint sets of assignments according to the cij they\nsatisfy. Consider one \ufb01xed cij. Let wl be a parent of x which is not in cij. Let W (cij) be the\nassignments to the parents of x which satisfy cij, and W\\wl (cij) be the assignments to the parents\nof x except for wl which satisfy cij. Since wl is not in cij, W\\wl (cij) is well de\ufb01ned. We have that\n(9)\n\nis equal to(cid:80)\n\u02c6p(wl))(cid:81)\nk(cid:54)=l \u02c6p(wk)wk (1 \u2212 \u02c6p(wk))1\u2212wk + p(x|W, wl = 0)(1 \u2212\n. Now since wl is not in cij the assignment of wl does not\naffect the probability of x. So for W \u2208 W\\wl (cij) we have p(x|W, wl=1) = p(x|W, wl=0) =\np(x|W ) and therefore the above sum can be simpli\ufb01ed to\n\nk(cid:54)=l \u02c6p(wk)wk (1 \u2212 \u02c6p(wk))1\u2212wk\n\n\u02c6p(wk)wk (1 \u2212 \u02c6p(wk))1\u2212wk\n\nW\\wl\n\n(cid:88)\n\n(cid:89)\np(x|W, wl = 1)\u02c6p(wl)(cid:81)\n\np(x|W )\n\nW\u2208W (cij )\n\n(cid:16)\n\n(cid:17)\n\n(cij )\n\nk\n\np(x|W )\n\n\u02c6p(wk)wk (1 \u2212 \u02c6p(wk))1\u2212wk .\n\nApplying the same reasoning to all individual wl /\u2208 cij, we get that Eq (9) is equal to\n\np(x|W )\n\n\u02c6p(wk)wk (1 \u2212 \u02c6p(wk))1\u2212wk\n\n(cid:88)\n(cid:88)\n\nW\\wl\n\n(cij )\n\nW\u2208wp(cij )\n\n(cid:89)\n(cid:89)\n\nk(cid:54)=l\n\nk\u2208cij\n\nwhere wp(cij) is the set of assignments to variables in cij which satisfy cij, that is, we removed all\nparents not in cij from the expression. Now, because cij is a conjunction, wp(cij) includes a single\nassignment where if wk \u2208 cij we have wk = 1 and if \u00afwk \u2208 cij we have wk = 0. In addition, for\nthis assignment W we have that p(x = 1|W ) = pi. Therefore, the last expression simpli\ufb01es to\n\n(cid:89)\n\n(cid:89)\n\npi\n\n\u02c6p(wk)\n\nwk\u2208cij\n\n\u00afwk\u2208cij\n\n(1 \u2212 \u02c6p(wk)).\n\nFinally, because the cij are mutually exclusive and exhaustive, the sum over the disjoint sets of\nassignments is identical to the sum in (6).\n\nProposition 3. The marginals calculated by AROLLOUT are identical to the marginals calculated\nby BP on the DBN generated by the planning problem, conditioned on the initial state, initial action\nand rollout policy, and with no evidence.\nProof. By Lemma 1, \u03bb(x) and \u03bbx(wi) are \u221d (1, 1) for all nodes. Therefore, backward messages do\nnot affect the result of BP and we can argue inductively going forward from roots to leaves in the\nDBN. By Eq (1) and Lemma 1 we have BEL(x) = \u03b1\u03c0(x)\u03bb(x) = \u03c0(x) where the last equality is\ntrue because \u03c0(x) is always normalized. Therefore from Eq (3) we have\n\n(cid:88)\n\n(cid:89)\n\nBEL(x) =\n\np(x|w)\n\n\u03c0x(wk).\n\nNow, from Eq (5) and Lemma 1 we have\n\nw\n\nk\n\n\u03c0x(wk) = \u03c0(wk) = BEL(wk)\n\n(cid:89)\n\n(cid:88)\n\np(x|w)\n\nw\n\nk\n\nBEL(wk).\n\nand substituting (11) into (10) we get\n\nBEL(x) =\n\nBEL(x = 1) = \u02c6p(x) =(cid:80)\n\nW p(x|W )(cid:81)\n\nInductively assuming BEL(wk = 1) =\u02c6p(wk) and BEL(wk = 0)=1 \u2212 \u02c6p(wk), we can rewrite (12) as\n\nk \u02c6p(wk)wk (1 \u2212 \u02c6p(wk))1\u2212wk, which is identical to (8).\n\n(10)\n\n(11)\n\n(12)\n\n5\n\n\f4 Algebraic Solver for Marginal MAP\n\n(cid:80)\n\nthe score for solution D = d is Q = log(cid:80)\n\nMarginal MAP [20, 13, 11, 14] is a complex inference problem seeking a con\ufb01guration of a subset\nof variables that maximizes their marginal probability. Recall that the graph construction in SOG-\nBOFA evaluates exactly to the value returned by AROLLOUT . Therefore, the result in the previous\nsection shows that SOGBOFA can be understood as using gradient search for the best action where\nthe evaluation criterion is given using BP but calculated symbolically. In this section, we show that\nthis approach can be used for MMAP yielding a novel solver for these problems.\nThe input to a MMAP problem is a Bayesian network G where the nodes in the network are divided\ninto 3 sets E, D, S standing for evidence nodes, MAP (or decision) nodes and sum nodes, with a\nspeci\ufb01cation of values to evidence nodes E = e. The goal is to \ufb01nd argmaxD=d\nS=s p(D =\nd, S = s, E = e). Anytime algorithms are typically scored using the log of marginal probability;\nS=s p(D = d, S = s, E = e). Current state of the art\nexact algorithms use branch and bound techniques (e.g., [11]). Various approximation algorithms\nfor MMAP exist including mixed product belief propagation [13], an extension of BP that directly\naddresses MMAP and is therefore closely related to the algorithms in this paper.\nTo make the connection more precise we show that the optimization problem in SOGBOFA is a\nmaximum expected utility (MEU) problem. In the graphical models literature such problems are\nformalized using in\ufb02uence diagrams (ID). An in\ufb02uence diagram is a Bayesian network where two\nadditional types of nodes are allowed in addition to random variable nodes R. Decision nodes D\nrepresent variables whose values (conditioned on parents) are being optimized, and value nodes V\nare leaves. The IDs that arise in this paper (arising from SOGBOFA and later as the output of our\nreduction) satisfy additional syntactic constraints: decision nodes do not have parents and there is a\nsingle value node V . We restrict the discussion to such IDs. This avoids subtleties in de\ufb01ning the\noptimization problem. In this case, given an ID the goal is to \ufb01nd argmaxD=dER=r(V |D = d) =\nV =v v \u00b7 p(V = v, R = r|D = d).\nargmaxD=d\nConsider SOGBOFA with a \ufb01xed rollout policy \u03c0 and w.l.o.g. assume a single binary node V repre-\n(cid:80)\nsenting cumulative reward.1 For a start action a we have Q = p(V = 1|a, \u03c0) = E(V |a, \u03c0). Now\nassuming a uniform prior over a we have p(V = 1|a, \u03c0) \u221d p(V = 1, a|\u03c0) and arg maxa p(V =\n1|a, \u03c0) = arg maxa p(V = 1, a|\u03c0) = arg maxa\nS p(V = 1, a, S|\u03c0) where S are the state vari-\nables.\nIt is obvious from the last expression that SOGBOFA can be seen to solve MEU for the\nrestricted class of IDs and that this is the same as a restricted version of MMAP problems, where\nthe structural constraints on a and V are given above. This implies:\nCorollary 4. SOGBOFA can be directly used to solve Marginal MAP problem on graphs with parent-\nless MAP nodes and only one evidence node at a leaf.\n\n(cid:80)\n\n(cid:80)\n\nR=r\n\nThe question is whether we can use SOGBOFA for general MMAP problems. We next show how\nthis can be done. Mau\u00b4a [15] gave a reduction from MMAP to maximum expected utility (MEU)\nin in\ufb02uence diagrams (reduction 5.1) which satis\ufb01es our syntactic requirements. The reduction\npreserves correctness under exact inference. However, with that construction there is no direct\nforward path that connects decision nodes, downward evidence nodes, and the value node. Recall\nthat SOGBOFA uses forward propagation of marginals in the directed graph. If no such path exists\nthen downward evidence is ignored and the result of forward BP inference is not informative. We\ngive a new reduction that avoids this issue and in this way introduce a new algorithm for MMAP.\nReduction: Let the input for the MMAP problem be given by E, D, S as above. Without loss of\ngenerality we may assume that each Ei \u2208 E does not have any children. If it does we can \ufb01rst\ndisconnect the edges to its children and substitute the value of the evidence node directly into the\nchild\u2019s CPT. The reduction performs two modi\ufb01cations on the original graph. (1) Each MAP node\nDi \u2208 D is replaced by three nodes: Din\ni has all inputs of Di and the same\nCPT. Dout\nand Dout\nare parents of Deq\n. (2) We add a\nnew leaf utility node V whose CPT captures a logical conjunction requiring that all evidence nodes\nhave their observed values and that all nodes Deq\nare true. Although V may have many parents\ni\nwe can represent its CPT symbolically and this does not adversely affect the complexity of our\n\nhas no parents and it connects to all the outputs of Di. Finally both Din\ni\n\ni and the CPT for Deq\n\nspeci\ufb01es that Deq\ni\n\nis true iff Din\n\nand Deq\ni\n\ni = Dout\n\ni\n\n. Din\n\ni\n\n, Dout\n\ni\n\ni\n\ni\n\ni\n\n1There are standard reductions (see [13]) showing how to translate a general reward to a binary variable\n\nwhose expectations are proportional so that the optimization problems are equivalent.\n\n6\n\n\falgorithm. The CPTs for all other nodes are unchanged except that a parent Di is changed to Dout\n.\nThe in\ufb02uence diagram problem is to \ufb01nd the setting for variables in {Dout\ni } which maximize the\nexpected utility E[V |{Dout\ni }). An example of this construction with one\nevidence node E and one MAP node D is shown in Figure 1. We have:\nProposition 5. Let G1 represent the original MMAP problem, G2 the transformation into MEU,\nand let E=e be the evidence for the MMAP problem and D=d an assignment to the MAP variables.\nThen, pG1(D=d, E=e) = EG2[V |Dout=d].\n\ni }] = p(V = 1|{Dout\n\ni\n\ncase, p(D=d, E=e) =(cid:80)\n\n(cid:80)\nC p(A) p(D=d|A) p(C|D=d) p(E=e|A, D=d, C). Now in G2,\n\nA\n\nProof. (sketch) We illustrate how the claim can be proved for the example from Figure 1. In this\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nE[V |Dout=d] = p(Deq=1, E=e|Dout=d)\n=\n\np(Deq=1|Din, Dout=d)p(A) p(Din|A) p(C|Dout=d) p(E=e|A, Dout=d, C).\n\nA\n\nC\n\nDin\n\nNow replace the sum over Din \u2208 {0, 1} with a sum over the cases Din=Dout=d and Din (cid:54)= Dout\nand observe that p(Deq=1|Din, Dout=d) is 1 in the \ufb01rst case and 0 in the second. Therefore the last\nexpression can be simpli\ufb01ed to(cid:88)\n(cid:88)\n\np(A) p(Din=d|A) p(C|Dout=d) p(E=e|A, Dout=d, C)\n\nA\n\nC\n\nwhich by construction is identical to the value for G1.\nThe proof for the general case follows along the same steps. The crucial point is to replace the sum\nover Din\n. This shows that the irrelevant\ni\nterms cancel out and the remaining terms are identical to the original ones.\n\ninto the cases where it is the same vs. not equal to Dout\n\ni\n\nThe reduction allows us to solve general MMAP problems using the SOGBOFA heuristic:\nAGS \u2013 Algebraic Gradient Based Solver for MMAP:\n\n1. Given a MMAP problem G1 with evidence E = e, decision nodes D and sum nodes S use\nthe reduction to obtain a MEU problem G2 with utility node V and decision nodes Dout.\n2. Generate the SOGBOFA graph GSOG from the MEU problem where decision nodes are\n\n3. Use the gradient based optimizer in SOGBOFA (gradient ascent with random restarts) to\n\ntreated as action nodes and V is the Q node of the planning problem.\noptimize the marginal probabilities of variables {Dout\ni }.\n\n4. Extract a discrete solution from the marginal probabilities by thresholding: Dout\n\ni = 1 iff\n\np(Dout\n\ni \u2265 0.5).\n\nCorollary 6. AGS can be used to solve general Marginal MAP problems.\n\n5 Experimental Validation\n\nIn this section, we explore the potential of AGS in solving complex MMAP problems. Speci\ufb01cally,\nwe evaluate the anytime performance of AGS and two natural baselines. The \ufb01rst is the Mixed\nProduct BP (MPBP) algorithm of [13]. MPBP uses belief propagation and is therefore related to\nAGS, but in MPBP the search over MAP variables is integrated into the messages of BP and like\nBP it can be derived from the corresponding optimization problem. The second algorithm is the\nrecently developed Alternating best-\ufb01rst with depth-\ufb01rst AND/OR search (AAOBF) [14]. AAOBF\ninterleaves best-\ufb01rst and depth-\ufb01rst search over an AND/OR search space to compute both anytime\nsolutions (corresponding to lower bounds) as well as upper bounds on the optimal MMAP value.\nAAOBF was shown to have excellent anytime performance and dominate other algorithms.\nFor the evaluation we use several problems from the UAI competition 2008. The original challenge\nproblems were given for sum inference, specifying the network and evidence nodes. Following\nprevious work, we use these for MMAP by selecting a subset of the variables as MAP nodes. To\nexplore the performance of the algorithms we vary the proportion of MAP variables in each instance,\nand for each \ufb01xed ratio we generate 20 MMAP problems by picking the MAP nodes at random.\n\n7\n\n\fFigure 2: Experimental results on UAI instances. Each row shows results for one instance. The top row shows\nresults of one run. Other rows show results aggregated over 20 random choices of MAP nodes. The columns\ncorrespond to proportion of MAP variables (0.5, 0.3, 0.2).\n\nFor AAOBF we use the implementation of [14] that can process the UAI competition problems di-\nrectly. AGS requires CPTs as expressions and our implementation extracts such expressions from\nthe tabular representation of the UAI problems as a preprocessing step. This is not computation-\nally demanding because the tabular representation is naturally restricted to have a small number of\nparents. We use our own implementation of MPBP, and for consistency the MPBP implementation\nbene\ufb01ts from the same expression representation of CPTs as AGS. More speci\ufb01cally, we use the join\ngraph version of MPBP (algorithm 5 of [13]) and run it on the factor graph which is obtained from\nthe BN. Since the factor graph is not a cluster tree we are running loopy MPBP. The max clusters\nof MPBP correspond to individual MAP variables, and sum nodes include both individual sum vari-\nables and factors in the original BN. Factor nodes and sum nodes perform the same computations\nas in standard loopy BP. The Max cluster with node i calculates a message to factor j as follows:\n\ufb01rst calculate the product of all incoming messages from factors other than j. Then, noting that\nwe have binary variables and thus only two entries in a message, zero out the smaller entry if it is\nstrictly smaller. MPBP keeps iterating over updates to nodes until it runs out of time or the maximal\nchange of the messages becomes smaller than 0.0001. While [13] introduce annealing and restarts\nto improve the performance of this algorithm we do not use them here. Note that MPBP can get\ninto a \u201ccontradiction state\u201d when the graph has logical constraints, i.e., messages can become (0,0)\nor legal states are ruled out. AGS does not suffer from this problem. However, to enable the com-\nparison we modi\ufb01ed the UAI instances changing any 0 probability to 0.0001 (and 1 to 0.9999). The\nimplementation of AAOBF replaces every 0 with 0.000001 for similar reasons. The solutions of all\nalgorithms are evaluated off line using an exact solver which uses the same code base as AAOBF.\nFigure 2 shows the results. Each algorithm is scored using log marginal probability. The plot shows\na relative score ct = at\u2212b\nwhere at is the score of algorithm a at time t and b is the best score\nfound by any of the algorithms for this instance. This guarantees that relative scores are between 0\nand 1, where the best value is 0. When an algorithm \ufb01nds an inconsistent solution (probability 0) or\ndoes not \ufb01nd a solution we replace the score with 1. We show results for 3 problems, where for one\nproblem (top row) we show results for a single run and for two problems we show results aggregated\nover 20 runs. Comprehensive results with individual runs and aggregated runs on more instances are\ngiven in the supplementary material. Each column in Figure 2 corresponds to a different proportion\nof MAP variables in the instance (0.5,0.3,0.2 respectively). The results for individual runs show\nmore clearly transitions between no solution and the \ufb01rst solution for an algorithm whereas this is\naveraged in aggregate results. But the trends are consistent across these graphs. We see that AAOBF\n\nat\n\n8\n\n 0 0.2 0.4 0.6 0.8 1 128 256 512 1024 2048 4096 8192 16384Normalized Average DifferenceLog-Scaled Time in MillisecondsDomain: 50-20-10 Instance: 0 MAP-Ratio: 0.5AGSMPBPAAOBF 0 0.2 0.4 0.6 0.8 1 128 256 512 1024 2048 4096 8192 16384Normalized Average DifferenceLog-Scaled Time in MillisecondsDomain: 50-20-10 Instance: 0 MAP-Ratio: 0.3AGSMPBPAAOBF 0 0.2 0.4 0.6 0.8 1 128 256 512 1024 2048 4096 8192 16384Normalized Average DifferenceLog-Scaled Time in MillisecondsDomain: 50-20-10 Instance: 0 MAP-Ratio: 0.2AGSMPBPAAOBF-0.2 0 0.2 0.4 0.6 0.8 1 1.2 128 256 512 1024 2048 4096 8192 16384Normalized Average DifferenceLog-Scaled Time in MillisecondsDomain: students_03_02-0015 Instance: Aggregating MAP-Ratio: 0.5AGSMPBPAAOBF-0.2 0 0.2 0.4 0.6 0.8 1 128 256 512 1024 2048 4096 8192 16384Normalized Average DifferenceLog-Scaled Time in MillisecondsDomain: students_03_02-0015 Instance: Aggregating MAP-Ratio: 0.3AGSMPBPAAOBF-0.2 0 0.2 0.4 0.6 0.8 1 1.2 128 256 512 1024 2048 4096 8192 16384Normalized Average DifferenceLog-Scaled Time in MillisecondsDomain: students_03_02-0015 Instance: Aggregating MAP-Ratio: 0.2AGSMPBPAAOBF-0.2 0 0.2 0.4 0.6 0.8 1 1.2 128 256 512 1024 2048 4096 8192 16384Normalized Average DifferenceLog-Scaled Time in MillisecondsDomain: blockmap_05_01-0006 Instance: Aggregating MAP-Ratio: 0.5AGSMPBPAAOBF-0.2 0 0.2 0.4 0.6 0.8 1 1.2 128 256 512 1024 2048 4096 8192 16384Normalized Average DifferenceLog-Scaled Time in MillisecondsDomain: blockmap_05_01-0006 Instance: Aggregating MAP-Ratio: 0.3AGSMPBPAAOBF-0.2 0 0.2 0.4 0.6 0.8 1 1.2 128 256 512 1024 2048 4096 8192 16384Normalized Average DifferenceLog-Scaled Time in MillisecondsDomain: blockmap_05_01-0006 Instance: Aggregating MAP-Ratio: 0.2AGSMPBPAAOBF\fFigure 3: Experimental results on different proportion of MAP variables. Each row corresponds to one problem.\nEach column corresponds to different running time, from left to right 1, 5 and 10 seconds.\n\nhas a larger initial overhead and AGS and MPBP are faster to \ufb01nd the \ufb01rst solutions and that AGS\nperforms better than MPBP. AAOBF is signi\ufb01cantly affected by the complexity of the conditional\nsum inference problems (i.e., evaluating the score of a speci\ufb01c MAP assignment). For the problems\nwith 50% of MAP variables (and only 50% sum variables) the complexity is not too high and the\nsearch successfully \ufb01nds high quality solutions. For these problems AAOBF dominates both AGS\nand MPBP. On the other hand, with 70% and 80% of sum variables the summation problems are\nharder and AAOBF is slower to \ufb01nd solutions. In this case AGS dominates as it \ufb01nds reasonable\nsolutions fast and improves with time. To further illustrate the impact of summation dif\ufb01culty we\nrun the algorithms in the same setup but with a \ufb01xed bound on run time varying the proportion of\nMAP variables from 0.1 to 0.9. Figure 3 shows results for the same 3 problems averaged over 20\nruns, for run time of 1,5,10 seconds in corresponding columns. Here, we clearly see the transition\nin relative performance of the algorithms as a function of the proposition of MAP variables. We\nalso see that with shorter run time AGS dominates for a larger range of problems. To summarise,\ngiven enough time AAOBF will \ufb01nd an optimal solution and can dominate AGS which is limited by\nthe approximation inherent in BP. However, with a limited time and dif\ufb01cult conditional summation\nproblems AGS provides a better tradeoff in \ufb01nding solutions quickly.\n\n6 Conclusions\n\nThe paper identi\ufb01es a connection between a successful heuristic for planning in large factored spaces\nand belief propagation. The SOGBOFA heuristic performs its estimation symbolically and through\nthat performs its search using gradients. This suggests a general scheme for approximate MMAP\nalgorithms where the MAP value is represented using an explicit computation graph which is op-\ntimized directly through automatic differentiation. The instantiation of this scheme in AGS shows\nthat it improves over the anytime performance of state of the art algorithms on problems with hard\nsummation sub-problems. In addition, while previous work has shown how inference can be used\nfor planning, this paper shows how ideas from planning can be used for inference. We believe that\nthese connections can be further explored to yield improvements in both \ufb01elds.\n\nAcknowledgments\n\nThis work was partly supported by NSF under grant IIS-1616280. Some of the experiments in this\npaper were performed on the Tufts Linux Research Cluster supported by Tufts Technology Services.\n\n9\n\n-0.2 0 0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Normalized Average DifferenceRatio of MAP NodesDomain: 50-20-10 Instance: Aggregating time: 1000 MillisecondsAGSMPBPAAOBF-0.2 0 0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Normalized Average DifferenceRatio of MAP NodesDomain: 50-20-10 Instance: Aggregating time: 5000 MillisecondsAGSMPBPAAOBF-0.2 0 0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Normalized Average DifferenceRatio of MAP NodesDomain: 50-20-10 Instance: Aggregating time: 10000 MillisecondsAGSMPBPAAOBF-0.2 0 0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Normalized Average DifferenceRatio of MAP NodesDomain: students_03_02-0015 Instance: Aggregating time: 1000 MillisecondsAGSMPBPAAOBF-0.2 0 0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Normalized Average DifferenceRatio of MAP NodesDomain: students_03_02-0015 Instance: Aggregating time: 5000 MillisecondsAGSMPBPAAOBF-0.2 0 0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Normalized Average DifferenceRatio of MAP NodesDomain: students_03_02-0015 Instance: Aggregating time: 10000 MillisecondsAGSMPBPAAOBF 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Normalized Average DifferenceRatio of MAP NodesDomain: blockmap_05_01-0006 Instance: Aggregating time: 1000 MillisecondsAGSMPBPAAOBF-0.2 0 0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Normalized Average DifferenceRatio of MAP NodesDomain: blockmap_05_01-0006 Instance: Aggregating time: 5000 MillisecondsAGSMPBPAAOBF-0.4-0.2 0 0.2 0.4 0.6 0.8 1 1.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Normalized Average DifferenceRatio of MAP NodesDomain: blockmap_05_01-0006 Instance: Aggregating time: 10000 MillisecondsAGSMPBPAAOBF\fReferences\n[1] Craig Boutilier, Thomas Dean, and Steve Hanks. Planning under uncertainty: Structural assumptions and\ncomputational leverage. In Proceedings of the Second European Workshop on Planning, pages 157\u2013171,\n1995.\n\n[2] Hao Cui and Roni Khardon. Online symbolic gradient-based optimization for factored action MDPs. In\n\nProc. of the International Joint Conference on Arti\ufb01cial Intelligence, 2016.\n\n[3] Hao Cui, Roni Khardon, Alan Fern, and Prasad Tadepalli. Factored MCTS for large scale stochastic\n\nplanning. In Proc. of the AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\n[4] Carmel Domshlak and J\u00a8org Hoffmann. Fast probabilistic planning through weighted model counting. In\n\nProc. of the International Conference on Automated Planning and Scheduling, pages 243\u2013252, 2006.\n\n[5] Thomas Furmston and David Barber. Variational methods for reinforcement learning. In Proceedings of\n\nthe International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS, pages 241\u2013248, 2010.\n\n[6] Andreas Griewank and Andrea Walther. Evaluating derivatives - principles and techniques of algorithmic\n\ndifferentiation (2. ed.). SIAM, 2008.\n\n[7] Thomas Keller and Malte Helmert. Trial-based heuristic tree search for \ufb01nite horizon MDPs. In Proc. of\n\nthe International Conference on Automated Planning and Scheduling, 2013.\n\n[8] Igor Kiselev and Pascal Poupart. A novel single-dbn generative model for optimizing POMDP controllers\nby probabilistic inference. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, pages 3112\u2013\n3113. AAAI Press, 2014.\n\n[9] Andrey Kolobov, Peng Dai, Mausam Mausam, and Daniel S Weld. Reverse iterative deepening for \ufb01nite-\nIn Proc. of the International Conference on Automated\n\nhorizon MDPs with large branching factors.\nPlanning and Scheduling, 2012.\n\n[10] Junkyu Lee, Radu Marinescu, and Rina Dechter. Applying search based probabilistic inference algorithms\nto probabilistic conformant planning: Preliminary results. In Proceedings of the International Symposium\non Arti\ufb01cial Intelligence and Mathematics (ISAIM), 2016.\n\n[11] Junkyu Lee, Radu Marinescu, Rina Dechter, and Alexander T. Ihler. From exact to anytime solutions\nfor marginal MAP. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, pages 3255\u20133262.\nAAAI Press, 2016.\n\n[12] Qiang Liu and Alexander T. Ihler. Belief propagation for structured decision making. In Proceedings of\n\nthe Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 523\u2013532, 2012.\n\n[13] Qiang Liu and Alexander T. Ihler. Variational algorithms for marginal MAP. Journal of Machine Learning\n\nResearch, 14(1):3165\u20133200, 2013.\n\n[14] Radu Marinescu, Junkyu Lee, Alexander T. Ihler, and Rina Dechter. Anytime best+depth-\ufb01rst search for\nbounding marginal MAP. In Proceedings of the Thirty-First AAAI Conference on Arti\ufb01cial Intelligence,\npages 3775\u20133782, 2017.\n\n[15] Denis Deratani Mau\u00b4a. Equivalences between maximum a posteriori inference in bayesian networks and\nmaximum expected utility computation in in\ufb02uence diagrams. International Journal Approximate Rea-\nsoning, 68:211\u2013229, 2016.\n\n[16] Martin Mladenov, Vaishak Belle, and Kristian Kersting. The symbolic interior point method. In Proceed-\n\nings of the Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, pages 1199\u20131205, 2017.\n\n[17] Gerhard Neumann. Variational inference for policy search in changing situations. In Proceedings of the\n\n28th International Conference on Machine Learning, ICML, pages 817\u2013824, 2011.\n\n[18] Duc Thien Nguyen, Akshat Kumar, and Hoong Chuin Lau. Collective multiagent sequential decision\nmaking under uncertainty. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, pages 3036\u2013\n3043, 2017.\n\n[19] Davide Nitti, Vaishak Belle, and Luc De Raedt. Planning in discrete and continuous markov decision\nprocesses by probabilistic programming. In ECML/PKDD (2), volume 9285 of Lecture Notes in Computer\nScience, pages 327\u2013342. Springer, 2015.\n\n[20] James D. Park and Adnan Darwiche. Complexity results and approximation strategies for map explana-\n\ntions. Journal of AI Research, 21(1):101\u2013133, 2004.\n\n10\n\n\f[21] Judea Pearl. Probabilistic reasoning in intelligent systems - networks of plausible inference. Morgan\n\nKaufmann series in representation and reasoning. Morgan Kaufmann, 1989.\n\n[22] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. Wiley, 1994.\n\n[23] R\u00b4egis Sabbadin, Nathalie Peyrard, and Nicklas Forsell. A framework and a mean-\ufb01eld algorithm for the\n\nlocal control of spatial processes. International Journal Approximate Reasoning, 53(1):66\u201386, 2012.\n\n[24] Scott Sanner. Relational dynamic in\ufb02uence diagram language (rddl): Language description. Unpublished\n\nManuscript. Australian National University, 2010.\n\n[25] Marc Toussaint and Amos Storkey. Probabilistic inference for solving discrete and continuous state\nMarkov decision processes. In Proceedings of the International Conference on Machine Learning, ICML,\n2006.\n\n[26] Jan-Willem van de Meent, Brooks Paige, David Tolpin, and Frank Wood. Black-box policy search with\nIn Proceedings of the International Conference on Arti\ufb01cial Intelligence and\n\nprobabilistic programs.\nStatistics, AISTATS, pages 1195\u20131204, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1591, "authors": [{"given_name": "Hao(Jackson)", "family_name": "Cui", "institution": "Tufts University"}, {"given_name": "Radu", "family_name": "Marinescu", "institution": "IBM Research"}, {"given_name": "Roni", "family_name": "Khardon", "institution": "Indiana University, Bloomington"}]}