{"title": "Symbolic Opportunistic Policy Iteration for Factored-Action MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 2499, "page_last": 2507, "abstract": "We address the scalability of symbolic planning under uncertainty with factored states and actions. Prior work has focused almost exclusively on factored states but not factored actions, and on value iteration (VI) compared to policy iteration (PI). Our \ufb01rst contribution is a novel method for symbolic policy backups via the application of constraints, which is used to yield a new ef\ufb01cient symbolic imple- mentation of modi\ufb01ed PI (MPI) for factored action spaces. While this approach improves scalability in some cases, naive handling of policy constraints comes with its own scalability issues. This leads to our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent al- gorithm lying between VI and MPI. The core idea is a symbolic procedure that applies policy constraints only when they reduce the space and time complexity of the update, and otherwise performs full Bellman backups, thus automatically adjusting the backup per state. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show signi\ufb01cantly improved scalability over the state-of-the-art.", "full_text": "Symbolic Opportunistic Policy Iteration for\n\nFactored-Action MDPs\n\nAswin Raghavana Roni Khardonb Alan Ferna Prasad Tadepallia\n\na School of EECS, Oregon State University, Corvallis, OR, USA\n\n{nadamuna,afern,tadepall}@eecs.orst.edu\n\nb Department of Computer Science, Tufts University, Medford, MA, USA\n\nroni@cs.tufts.edu\n\nAbstract\n\nThis paper addresses the scalability of symbolic planning under uncertainty with\nfactored states and actions. Our \ufb01rst contribution is a symbolic implementation\nof Modi\ufb01ed Policy Iteration (MPI) for factored actions that views policy evalu-\nation as policy-constrained value iteration (VI). Unfortunately, a na\u00a8\u0131ve approach\nto enforce policy constraints can lead to large memory requirements, sometimes\nmaking symbolic MPI worse than VI. We address this through our second and\nmain contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel\nconvergent algorithm lying between VI and MPI, that applies policy constraints\nif it does not increase the size of the value function representation, and otherwise\nperforms VI backups. We also give a memory bounded version of this algorithm\nallowing a space-time tradeoff. Empirical results show signi\ufb01cantly improved\nscalability over state-of-the-art symbolic planners.\n\n1\n\nIntroduction\n\nWe study symbolic dynamic programming (SDP) for Markov Decision Processes (MDPs) with ex-\nponentially large factored state and action spaces. Most prior SDP work has focused on exact [1]\nand approximate [2, 3] solutions to MDPs with factored states, assuming just a handful of atomic\nactions. In contrast to this, many applications are most naturally modeled as having factored actions\ndescribed in terms of multiple action variables, which yields an exponential number of joint actions.\nThis occurs, e.g., when controlling multiple actuators in parallel, such as in robotics, traf\ufb01c control,\nand real-time strategy games. In recent work [4] we have extended SDP to factored actions by giving\na symbolic VI algorithm that explicitly reasons about action variables. The key bottleneck of that\napproach is the space and time complexity of computing symbolic Bellman backups, which requires\nreasoning about all actions at all states simultaneously. This paper is motivated by addressing this\nbottleneck via the introduction of alternative and potentially much cheaper backups.\nWe start by considering Modi\ufb01ed Policy Iteration (MPI) [5], which adds a few policy evaluation\nsteps between consecutive Bellman backups. MPI is attractive for factored-action spaces because\npolicy evaluation does not require reasoning about all actions at all states, but rather only about the\ncurrent policy\u2019s action at each state. Existing work on symbolic MPI [6] assumes a small atomic\naction space and does not scale to factored actions. Our \ufb01rst contribution (Section 3) is a new\nalgorithm, Factored Action MPI (FA-MPI), that conducts exact policy evaluation steps by treating\nthe policy as a constraint on normal Bellman backups.\nWhile FA-MPI is shown to improve scalability compared to VI in some cases, we observed that in\npractice the strict enforcement of the policy constraint can cause the representation of value functions\nto become too large and dominate run time. Our second and main contribution (Section 4) is to\novercome this issue using a new backup operator that lies between policy evaluation and a Bellman\n\n1\n\n\fFigure 1: Example of a DBN MDP with factored actions.\n\nbackup and hence is guaranteed to converge. This new algorithm, Opportunistic Policy Iteration\n(OPI), constrains a select subset of the actions in a way that guarantees that there is no growth\nin the representation of the value function. We also give a memory-bounded version of the above\nalgorithm (Section 5). Our empirical results (Section 6) show that these algorithms are signi\ufb01cantly\nmore scalable than FA-MPI and other state-of-the-art algorithms.\n\n2 MDPs with Factored State and Action Spaces\n\ni denote the ADD representing the conditional probability table for variable X(cid:48)\ni.\n\nIn a factored MDP M, the state space S and action space A are speci\ufb01ed by \ufb01nite sets of binary\nvariables X = (X1, . . . , Xl) and A = (A1, . . . , Am) respectively, so that |S| = 2l and |A| = 2m.\nFor emphasis we refer to such MDPs as factored-action MDPs (FA-MDPs). The transition function\nT and reward function R are speci\ufb01ed compactly using a Dynamic Bayesian Network (DBN). The\nDBN model consists of a two\u2013time-step graphical model that shows, for each next state variable\nX(cid:48) and the immediate reward, the set of current state and action variables, denoted by parents(X(cid:48)).\nFurther, following [1], the conditional probability functions are represented by algebraic decision\ndiagrams (ADDs) [7], which represent real-valued functions of boolean variables as a Directed\nAcyclic Graph (DAG) (i.e., an ADD maps assignments to n boolean variables to real values). We\nlet P X(cid:48)\nFor example, Figure 1 shows a DBN for the SysAdmin domain (Section 6.1). The DBN encodes\nthat the computers c1, c2 and c3 are arranged in a directed ring so that the running status of each is\nin\ufb02uenced by its reboot action and the status of its predecessor. The right part of Figure 1 shows the\nADD representing the dynamics for the state variable running c1. The variable running c1\u2019\nrepresents the truth value of running c1 in the next state. The ADD shows that running c1\nbecomes true if it is rebooted, and otherwise the next state depends on the status of the neighbors.\nWhen not rebooted, c1 fails w.p. 0.3 if its neighboring computer c3 has also failed, and w.p. 0.05\notherwise. When not rebooted, a failed computer becomes operational w.p. 0.05.\nADDs support binary operations over the functions they represent (F op G = H if and only if\n\u2200x, F (x) op G(x) = H(x)) and marginalization operators (e.g., marginalize x via maximization in\nx F (x, y) ). Operations between diagrams\nwill be represented using the usual symbols +,\u00d7, max etc., and the distinction between scalar oper-\nations and operations over functions should be clear from context. Importantly, these operations are\ncarried out symbolically and scale polynomially in the size of the ADD rather than the potentially\nexponentially larger tabular representation of the function. ADD operations assume a total order-\ning O on the variables and impose that ordering in the DAG structure (interior nodes) of any ADD.\nSDP uses the compact MDP model to derive compact value functions by iterating symbolic Bellman\nbackups that avoid enumerating all states. It has the advantage that the value function is exact while\noften being much more compact than explicit tables. Early SDP approaches such as SPUDD [1]\nonly represented the structure in the state variables and enumerate over actions, so that space and\ntime is at least linearly related to the number of actions, and hence exponential in m.\nIn recent work, we extended SDP to factored action spaces by computing Bellman backups using an\nalgorithm called Factored Action Regression (FAR) [4]. This is done by implementing the following\nequations using ADD operations over a representation like Figure 1. Let T Q(V ) denote the backup\n\nG(y) = maxx F (x, y) and through sum in G(y) = (cid:80)\n\n2\n\n\f(cid:88)\n\nX(cid:48)\n\n1\n\n(cid:88)\n\nX(cid:48)\n\nl\n\noperator that computes the next iterate of the Q-value function starting with value function V ,\n\nT Q(V ) = R + \u03b3\n\nP X(cid:48)\n\n1 . . .\n\nP X(cid:48)\n\nl \u00d7 primed(V )\n\n(1)\n\ni assigns a probability to X(cid:48)\n\ni) \u2286 (X, A), introducing the variables P arents(X(cid:48)\n\nThe(cid:80) marginalization eliminates the variable X(cid:48)\n\nthen T (V ) = maxA1 . . . maxAm T Q(V ) gives the next iterate of the value function. Repeating this\nprocess we get the VI algorithm. Here primed(V ) swaps the state variables X in the diagram V\nwith next state variables X(cid:48) (c.f. DBN representation for next state variables). Equation 1 should be\nread right to left as follows: each probability diagram P X(cid:48)\ni from assign-\nments to P arents(X(cid:48)\ni) into the value function.\ni. We arrive at the Q-function that maps variable\nassignments \u2286 (X, A) to real values. Written in this way, where the domain dynamics are explicitly\nexpressed in terms of actions variables and where maxA = maxA1,...,Am is a symbolic marginal-\nization operation over action variables, we get the Factored Action Regression (FAR) algorithm [4].\nIn the following, we use T () to denote a Bellman-like backup where superscript T Q() denotes that\nthat actions are not maximized out so the output is a function of state and actions, and subscript as in\nT\u03c0() de\ufb01ned below denotes that the update is restricted to the actions in \u03c0. Similarly T Q\n\u03c0 () restricts\nto a (possibly partial) policy \u03c0 and does not maximize over the unspeci\ufb01ed action choice.\nIn this work we will build on Modi\ufb01ed Policy Iteration (MPI), which generalizes value iteration and\npolicy iteration, by interleaving k policy evaluation steps between successive Bellman backups [5].\nHere a policy evaluation step corresponds to iterating exact policy backups, denoted by T\u03c0 where\nthe action is prescribed by the policy \u03c0 in each state. MPI has the potential to speed up convergence\nover VI because, at least for \ufb02at action spaces, policy evaluation is considerably cheaper than full\nBellman backups. In addition, when k > 0, one might hope for larger jumps in policy improvement\nbecause the greedy action in T is based on a more accurate estimate of the value of the policy.\nInterestingly, the \ufb01rst approach to symbolic planning in MDPs was a version of MPI for factored\nstates called Structured Policy Iteration (SPI), which was [6] later adapted to relational problems\n[8]. SPI represents the policy as a decision tree with state-variables labeling interior nodes and a\nconcrete action as a leaf node. The policy backup uses the graphical form of the policy. In each such\nbackup, for each leaf node (policy action) a in the policy tree, its Q-function Qa is computed and\nattached to the leaf. Although SPI leverages the factored state representation, it represents the policy\nin terms of concrete joint actions, which fails to capture the structure among the action variables in\nFA-MDPs. In addition, in factored actions spaces this requires an explicit calculation of Q functions\nfor all joint actions. Finally, the space required for policy backup can be prohibitive because each\nQ-function Qa is joined to each leaf of the policy. SPI goes to great lengths in order to enforce a\npolicy backup which, intuitively, ought to be much easier to compute than a Bellman backup. In\nfact, we are not aware of any implementations of this algorithm that scales well for FA-MDPs or\neven for factored state spaces. The next section provides an alternative algorithm.\n\n3 Factored Action MPI (FA-MPI)\n\nIn this section, we introduce Factored Action MPI (FA-MPI), which uses a novel form of policy\nbackup. Pseudocode is given in Figure 2. Each iteration of the outer while loop starts with one full\nBellman backup using Equation 1, i.e., policy improvement. The inner loop performs k steps of\npolicy backups using a new algorithm described below that avoids enumerating all actions.\nWe represent the policy using a Binary Decision Diagram (BDD) with state and action variables\nwhere a leaf value of 1 denotes any combination of action variables that is the policy action, and a\nleaf value of \u2212\u221e indicates otherwise. Using this representation, we perform policy backups using\n\u03c0 (V ) given in Equation 2 below followed by a max over the actions in the resulting diagram.\nT Q\nIn this equation, the diagram resulting from the product \u03c0 \u00d7 primed(V ) sets the value of all off-\npolicy state-actions to \u2212\u221e, before computing any value for them1 and this ensures correctness of\nthe update as indicated by the next proposition.\n\nP X(cid:48)\n\n1 . . .\n\nP X(cid:48)\n\nl \u00d7 (\u03c0 \u00d7 primed(V ))\n\n(2)\n\n\uf8ee\uf8f0R + \u03b3\n\n(cid:88)\n\nT Q\n\u03c0 (V ) =\n\n(cid:88)\n\nX(cid:48)\n\n1\n\nX(cid:48)\n\nl\n\n\uf8f9\uf8fb\n\n1Notice that T Q\n\n\u03c0 is equivalent to \u03c0 \u00d7 T Q but the former is easier to compute.\n\n3\n\n\fAlgorithm 3.1: FA-MPI/OPI(k)\nV 0 \u2190 0, i \u2190 0\n(V i+1\nwhile ||V i+1\n\n, \u03c0i+1) \u2190 max\n\n0 \u2212 V i|| > \u0001\nfor j \u2190 1 to k\n\nT Q(V i)\n\nA\n\n0\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nFor Algorithm FA-MPI :\n\n\u03c0i+1 (V i+1\nT Q\nj\u22121 )\nFor Algorithm OPI :\n\u02c6T Q\n\u03c0i+1 (V i+1\nj\u22121 )\n\nA\n\ndo\n\ndo\n\nj \u2190 max\nV i+1\nj \u2190 max\nV i+1\nV i+1 \u2190 V i+1\ni \u2190 i + 1\n(V i+1\nreturn (\u03c0i+1).\n\n, \u03c0i+1) \u2190 max\n\nA\n\nA\n\nFigure 2: Factored Action MPI and OPI.\n\nk\n\n0\n\nT Q(V i)\n\nAlgorithm 3.2: P(D, \u03c0)\nd \u2190 variable at the root node of D\nc \u2190 variable at root node of \u03c0\nif d occurs after c in ordering\nthen P(D, max(\u03c0T , \u03c0F ))\nelse if d = c\nthen ADD(d,P(DT , \u03c0T ),P(DF , \u03c0F ))\nelse if d occurs before c in ordering\nthen ADD(d,P(DT , \u03c0),P(DF , \u03c0))\nelse if \u03c0 = \u2212\u221e return (\u2212\u221e)\nelse D\n\nFigure 3: Pruning procedure for an ADD. Sub-\nscripts T and F denote the true and false child\nrespectively.\n\nProposition 1. FA-MPI computes exact policy backups i.e. maxA T Q\nThe proof uses the fact that (s, a) pairs that do not agree with the policy get a value \u2212\u221e via the\nconstraints and therefore do not affect the maximum. While FA-MPI can lead to improvements over\nVI (i.e. FAR), like SPI, FA-MPI can lead to large space requirements in practice. In this case, the\nbottleneck is the ADD product \u03c0 \u00d7 primed(V ), which can be exponentially larger than primed(V )\nin the worst case. The next section shows how to approximate the backup in Equation 2 while\nensuring no growth in the size of the ADD.\n\n\u03c0 = T\u03c0.\n\n4 Opportunistic Policy Iteration (OPI)\n\nHere we describe Opportunistic Policy Iteration (OPI), which addresses the shortcomings of FA-\nMPI. As seen in Figure 2, OPI is identical to FA-MPI except that it uses an alternative, more conser-\nvative policy backup. The sequence of policies generated by FA-MPI (and MPI) may not all have\ncompactly representable ADDs. Fortunately, \ufb01nding the optimal value function may not require\nrepresenting the values of the intermediate policies exactly. The key idea in OPI is to enforce the\npolicy constraint opportunistically, i.e. only when they do not increase the size of the value function\nrepresentation.\nIn an exponential action space, we can sometimes expect a Bellman backup to be a coarser parti-\ntioning of state variables than the value function of a given policy (e.g. two states that have the same\nvalue under the optimal action have different values under the policy action). In this case enforc-\ning the policy constraint via T Q\n\u03c0 (V ) is actually harmful in terms of the size of the representation.\nOPI is motivated by retaining the coarseness of Bellman backups in some states, and otherwise en-\nforcing the policy constraint. The OPI backup is sensitive to the size of the value ADD so that it is\nguaranteed to be smaller than the results of both Bellman backup and policy backup.\nFirst we describe the symbolic implementation of OPI . The trade-off between policy evaluation\nand policy improvement is made via a pruning procedure (pseudo-code in Figure 3). This procedure\nassigns a value of \u2212\u221e to only those paths in a value function ADD that violate the policy constraint\n\u03c0. The interesting case is when the root variable of \u03c0 is ordered below the root of D (and thus\ndoes not appear in D) so that the only way to violate the constraint is to violate both true and false\nbranches. We therefore recurse D with the diagram max{\u03c0T , \u03c0F}.\nExample 1. The pruning procedure is illustrated in Figure 4. Here the input function D does not\ncontain the root variable X of the constraint, and the max under X is also shown. The result of\npruning P(D, \u03c0) is no more complex than D, whereas the product D \u00d7 \u03c0 is more complex.\n\nClearly, the pruning procedure is not sound for ADDs because there may be paths that violate the\npolicy, but are not explicitly represented in the input function D. In order to understand the result\nof P, let p be a path from a root to a leaf in an ADD. The path p induces a partial assignment to the\n\n4\n\n\fFigure 4: An example for pruning. D and \u03c0 denote the given function and constraint respectively.\nThe result of pruning is no larger than D, as opposed to multiplication. T (true) and F (false)\nbranches are denoted by the left and the right child respectively.\n\nvariables in the diagram. Let E(p) be the set of all extensions of this partial assignment to complete\nassignments to all variables. As established in the following proposition, a path is pruned if none of\nits extensions satis\ufb01es the constraint.\nProposition 2. Let G = P(D, \u03c0) where leaves in D do not have the value \u2212\u221e. Then for all paths\np in G we have:\n1. p leads to \u2212\u221e in G iff \u2200y \u2208 E(p), \u03c0(y) = \u2212\u221e.\n2. p does not lead to \u2212\u221e in G iff \u2200y \u2208 E(p), G(y) = D(y).\n3. The size of the ADD G is smaller or equal to the size of D.\n\nThe proof (omitted due to space constraints) uses structural induction on D and \u03c0. The novel backup\nintroduced in OPI interleaves the application of pruning with the summation steps so as to prune the\ndiagram as early as possible. Let P\u03c0(D) be shorthand for P(D, \u03c0). The backup used by OPI, which\nis shown in Figure 2 is\n\u03c0 (V ) = P\u03c0\n\u02c6T Q\n\n\uf8ee\uf8f0P\u03c0(R) + \u03b3P\u03c0(\n(cid:88)\n\nl \u00d7 primed(V )))))\n\n1 . . .P\u03c0(\n\n(cid:88)\n\n\uf8f9\uf8fb\n\nP X(cid:48)\n\n(3)\n\nP X(cid:48)\n\nX(cid:48)\n\n1\n\nX(cid:48)\n\nl\n\n\u03c0 \u2264 T .\n\n\u03c0 (V ) overestimates the true backup of a policy, but\n\nUsing the properties of P we can show that \u02c6T Q\nis still bounded by the true value function.\nTheorem 1. The policy backup used by OPI is bounded between the full Bellman backup and the\ntrue policy backup, i.e. T\u03c0 \u2264 maxA \u02c6T Q\nSince none of the value functions generated by OPI overestimate the optimal value function, it fol-\nlows that both OPI and FA-MPI converge to the optimal policy under the same conditions as MPI\n[5]. However, the sequence of value functions/policies generated by OPI are in general different\nfrom and potentially more compact than those generated by FA-MPI. The relative compactness of\nthese policies is empirically investigated in Section 6. The theorem also implies that OPI converges\nat least as fast as FA-MPI to the optimal policy, and may converge faster.\nIn terms of a \ufb02at MDP, OPI can be interpreted as sometimes picking a greedy off-policy action while\nevaluating a \ufb01xed policy, when the value function of the greedy policy is at least as good and more\ncompact than that of the given policy. Thus, OPI may be viewed as asynchronous policy iteration\n([9]). However, unlike traditional asynchronous PI, the policy improvement in OPI is motivated by\nthe size of the representation, rather than any measure of the magnitude of improvement.\nExample 2. Consider the example in Figure 5. Suppose that \u03c0 is a policy constraint that says that\nthe action variable A1 must be true when the state variable X2 is false. The backup T Q(R) does\nnot involve X2 and therefore pruning does not change the diagram and P\u03c0(T Q(R)) = T Q(R). The\nmax chooses A1 = true in all states, regardless of the value of X2, a greedy improvement. Note\nthat the improved policy (always set A1) is more compact than \u03c0, and so is its value. In addition,\nP\u03c0(T Q(R)) is coarser than \u03c0 \u00d7 T Q(R).\n\n5 Memory-Bounded OPI\n\nMemory is usually a limiting factor for symbolic planning. In [4] we proposed a symbolic memory\nbounded (MB) VI algorithm for FA-MDPs, which we refer to below as Memory Bounded Factored\n\n5\n\n\f(a) A simple policy for an MDP\nwith two state variables, X1 and\nX2, and one action variable A1.\n\n(b) Optimal policy backup in\nFA-MPI.\n\nNote the\n(c) OPI backup.\nsmaller size of the value func-\ntion.\n\nFigure 5: An illustration where OPI computes an incorrect but more compact value function that is\nis a partial policy improvement. T (true) and F (false) branches are denoted by the left and the right\nchild respectively.\n\nAction Regression (MBFAR). MBFAR generalizes SPUDD and FAR by \ufb02exibly trading off compu-\ntation time for memory. The key idea is that a backup can be computed over a partially instantiated\naction, by \ufb01xing the value of an action variable. MBFAR computes what [10] called \u201cZ-value func-\ntions\u201d that are optimal value functions for partially speci\ufb01ed actions. But in contrast to their work,\nwhere the set of partial actions are hand-coded by the designer, MBFAR is domain-independent and\ndepends on the complexity of the value function. In terms of time to convergence, computing these\nsubsets on the \ufb02y may lead to some overhead, but in some cases may lead to a speedup. Memory\nBounded FA-MPI (MB-MPI) is a simple extension that uses MBFAR in place of FAR for the back-\nups in Figure 2. MB-MPI is parametrized by k, the number of policy backups, and M, the maximum\nsize (in nodes) of a Z-value function. MB-MPI generalizes MPI in that MB-MPI(k,0) is the same as\nSPI(k) [6] and MB-MPI(k,\u221e) is FA-MPI(k). Also, MB-MPI(0,0) is SPUDD [1] and MB-MPI(0,\u221e)\nis FAR [4]. We can also combine OPI with memory bounded backup. We will call this algorithm\nMB-OPI. Since both MB-MPI and OPI address space issues in FA-MPI the question is whether one\ndominates the other and whether their combination is useful. This is addressed in the experiments.\n\n6 Experiments\n\nIn this section, we experimentally evaluate the algorithms and the contributions of different compo-\nnents in the algorithms.\n\n6.1 Domain descriptions\n\nThe following domains were described using the Relational Dynamic In\ufb02uence Diagram Language\n(RDDL) [11]. We ground the relational description to arrive at the MDP similar to Figure 1. In our\nexperiments the variables in the ADDs are ordered so that parents(X(cid:48)\ni and the\ni)|. We heuristically chose to do the expectation over state variables\nis are ordered by |parents(X(cid:48)\nX(cid:48)\nin the top-down way, and maximization of action variables in the bottom-up way with respect to the\nvariable ordering.\nInventory Control(IC): This domain consists of n independent shops each being full or empty that\ncan be \ufb01lled by a deterministic action. The total number of shops that can be \ufb01lled in one time step is\nrestricted. The rate of arrival of a customer is distributed independently and identically for all shops\nas Bernoulli(p) with p = 0.05. A customer at an empty shop continues to wait with a reward of -1\nuntil the shop is \ufb01lled and gives a reward of -0.35. An instance of IC with n shops and m trucks has\n\na joint state and action space of size 22n and(cid:80)m\n\ni) occur above X(cid:48)\n\n(cid:0)n\n\n(cid:1) respectively.\n\ni=0\n\ni\n\nSysAdmin: The \u201cSysAdmin\u201d domain was part of the IPC 2011 benchmark and was introduced\nin earlier work [12]. It consists of a network of n computers connected in a given topology. Each\ncomputer is either running (reward of +1) or failed (reward of 0) so that |S| = 2n, and each computer\nhas an associated deterministic action of rebooting (with a cost of -0.75) so that |A| = 2n. We\nrestrict the number of computers that can be rebooted in one time step. Unlike the previous domain,\nthe exogenous events are not independent of one another. A running computer that is not being\n\n6\n\n\fFigure 6: Impact of policy evaluation: Parallel actions vs. Time. In Star and Unidirectional networks\nVI was stopped at a time limit of six hours and the Bellman error is annotated.\n\n(cid:16) 1+nr\n\n(cid:17)\n\nFigure 7: Impact of Pruning. EML denotes Exceeded Memory Limit and the Bellman error is\ndenoted in parenthesis.\n\n1+nc\n\nrebooted is running in the next state with probability p proportional to the number of its running\nneighbors, where p = 0.45 + 0.5\n, nr is the number of neighboring computers that have\nnot failed and nc is the number of neighbors. We test this domain on three topologies of increasing\ndif\ufb01culty, viz. a star topology, a unidirectional ring and a bidirectional ring.\nElevator control: We consider the problem of controlling m elevators in a building with n \ufb02oors.\nA state is described as follows: for each \ufb02oor, whether a person is waiting to go up or down; for\neach elevator, whether a person inside the elevator is going up or down, whether the elevator is at\neach \ufb02oor, and its current direction (up or down). A person arrives at a \ufb02oor f, independently of\nother \ufb02oors, with a probability Bernoulli(pf ), where pf is drawn from U nif orm(0.1, 0.3) for\neach \ufb02oor. Each person gets into an elevator if it is at the same \ufb02oor and has the same direction (up\nor down), and exits at the top or bottom \ufb02oor based on his direction. Each person gets a reward of\n-1 when waiting at a \ufb02oor and -1.5 if he is in an elevator that is moving in a direction opposite to\nhis destination. There is no reward if their directions are the same. Each elevator has three actions:\nmove up or down by one \ufb02oor, or \ufb02ip its direction.\n\n6.2 Experimental validation\n\nIn order to evaluate scaling with respect to the action space we \ufb01x the size of the state-space and\nmeasure time to convergence (Bellman error less than 0.1 with discount factor of 0.9). Experiments\nwere run on a single core of an Intel Core 2 Quad 2.83GHz with 4GB limit. The charts denote\nOPI with k steps of evaluation as OPI (k), and MB-OPI with memory bound M as MB-OPI(k, M)\n(similarly FA-MPI(k) and MB-MPI(k, M)). In addition, we compare to symbolic value iteration:\n\nFigure 8:\nevaluation in Elevators.\n\nImpact of policy\n\nFigure 9: Impact of memory bounding. EML denotes Exceeded\nMemory Limit.\n\n7\n\n12345672006001000Inventory Control \u2212 8 shops#Parallel actionsSolution time(mins)VIOPI(2)OPI(5)12345670200400SysAdmin \u2212 Star network \u2212 11 computers# parallel actionsSolution time(mins)VIOPI(2)OPI(5)0.30.630.870.6312345670400800Bidirectional ring \u2212 10 computers# parallel actionsSolution time(mins)VIOPI(2)OPI(5)12345670200400Unidirectional ring \u2212 10 computers# parallel actionsSolution time(mins)VIOPI(2)OPI(5)0.150.340.460.4612345672006001000Inventory Control(Uniform) \u2212 8 shops#Parallel actionsSolution time(mins)FA\u2212MPI(5)OPI(5)EMLEMLEMLEMLEML89101112050100200Bidirectional ring \u2212 2 parallel actionsComputers(|S|,|A|)%Time more than OPIEMLFA\u2212MPI(5)OPI(5)010020030040002468Elevator Control \u2212 4 floors, 2 elevatorsCPU time(mins)Bellman errorEMLFA\u2212MPI(5)OPI(5)05010015020002468Elevator Control \u2212 4 floors, 2 elevatorsCPU time(mins)Bellman ErrorVIOPI(2)OPI(5)12345672006001000Inventory Control \u2212 8 shops#Parallel actionsSolution time(mins)MB\u2212MPI(5)MB\u2212OPI(5)EMLEML89101112\u2212100050150Bidirectional ring \u2212 2 parallel actionsComputers% more time than OPIOPI(5)FA\u2212MPI(5)MB\u2212OPI(5,20k)MB\u2212MPI(5,20k)EML\fDomain\n\nIC(8)\nStar(11)\nBiring(10)\nUniring(10)\n\n# parallel actions\nCompression in V\n\n2\n\n0.06\n0.67\n0.96\n0.99\n\n3\n\n0.03\n0.58\n0.96\n0.99\n\n4\n\n0.03\n0.50\n0.95\n0.99\n\n5\n\n0.02\n0.40\n0.94\n0.99\n\n6\n\n0.02\n0.37\n0.88\n0.99\n\n7\n\n0.02\n0.35\n0.80\n0.99\n\n2\n\n0.28\n1.8e\u22124\n1.1e\u22123\n9.3e\u22124\n\n3\n\n0.36\n2.3e\u22124\n1.3e\u22123\n1e\u22123\n\n# parallel actions\nCompression in \u03c0\n4\n5\n\n0.35\n2.1e\u22124\n1.2e\u22123\n9.4e\u22124\n\n0.20\n1.9e\u22124\n1.1e\u22123\n8.2e\u22124\n\n6\n\n0.09\n1.4e\u22124\n9.8e\u22124\n5.2e\u22124\n\n7\n\n0.03\n9.6e\u22125\n7.4e\u22124\n2.9e\u22124\n\nTable 1: Ratio of size of ADD function to a table.\n\nthe well-established baseline for factored states, SPUDD [1], and factored states and actions FA-\nMPI(0). Since both are variants of VI we will denote the better of the two as VI in the charts.\nImpact of policy evaluation : We compare symbolic VI and OPI in Figure 6. For Inventory Control,\nas the number of parallel actions increases, SPUDD takes increasingly more time but FA-MPI(0)\ntakes increasingly less time, giving VI a bell-shaped pro\ufb01le. An increase in the steps of evaluation\nin OPI(2) and OPI(5) leads to a signi\ufb01cant speedup. For the SysAdmin domain, we tested three\ndifferent topologies. For all the topologies, as the size of the action space increases, VI takes an\nincreasing amount of time. OPI scales signi\ufb01cantly better and does better with more steps of policy\nevaluation, suggesting that more lookahead is useful in this domain. In the Elevator Control domain\n(Figure 8) OPI(2) is signi\ufb01cantly better than VI and OPI(5) is marginally better than OPI(2). Overall,\nwe see that more evaluation helps, and that OPI is consistently better than VI.\nImpact of pruning : We compare PI vs. FA-MPI to assess the impact of pruning. Figure 7 shows\nthat with increasing state and action spaces FA-MPI exceeds the memory limit (EML) whereas\nOPI does not and that when both converge OPI converges much faster. In Inventory Control, FA-\nMPI exceeds the memory limit on \ufb01ve out of the seven instances, whereas OPI converges in all cases.\nIn SysAdmin, the plot shows the % time FA-MPI takes more than OPI. On the largest problem, FA-\nMPI exceeds the memory-limit, and is at least 150% slower than OPI. In Elevator control, FA-MPI\nexceeds the memory limit while OPI does not, and FA-MPI is at least 250% slower.\nImpact of memory-bounding : Even though memory bounding can mitigate the memory problem\nin FA-MPI, it can cause a large overhead in time, and can still exceed the limit due to intermediate\nsteps in the exact policy backups. Figure 9 shows the effect of memory bounding. MB-OPI , scales\nbetter than either MB-MPI or OPI . In the IC domain, MB-MPI is much worse than MB-OPI in\ntime, and MB-MPI exceeds the memory limit in two instances. In the SysAdmin domain, the \ufb01gure\nshows that combined pruning and memory-bounding is better than either one separately. A similar\ntime pro\ufb01le is seen in the elevators domain (results omitted).\nRepresentation compactness : The main bottleneck toward scalability beyond our current results\nis the growth of the value and policy diagrams with problem complexity, which is a function of\nthe suitability of our ADD representation to the problem at hand. To illustrate this, Table 1 shows\nthe compression provided by representing the optimal value functions and policies as ADDs versus\ntables. We observe orders of magnitude compression for representing policies, which shows that the\nADDs are able to capture the rich structure in policies. The compression ratio for value functions\nis less impressive and surprisingly close to 1 for the Uniring domain. This shows that for these\ndomains ADDs are less effective at capturing the structure of the value function. Possible future\ndirections include better alternative symbolic representations as well as approximations.\n\n7 Discussion\nThis paper presented symbolic variants of MPI that scale to large action spaces and generalize and\nimprove over state-of-the-art algorithms. The insight that the policy can be treated as a loose con-\nstraint within value iteration steps gives a new interpretation of MPI. Our algorithm OPI computes\nsome policy improvements during policy evaluation and is related to Asynchronous Policy Iteration\n[9]. Further scalability can be achieved by incorporating approximate value backups (e.g. similar to\nAPRICODD[2]) as weel as potentially more compact representations(e.g. Af\ufb01ne ADDs [3]). An-\nother avenue for scalability is to use initial state information to focus computation. Previous work\n[13] has studied theoretical properties of such approximations of MPI, but no ef\ufb01cient symbolic\nversion exists. Developing such algorithms is an interesting direction for future work.\n\nAcknowdgements\nThis work is supported by NSF under grant numbers IIS-0964705 and IIS-0964457.\n\n8\n\n\fReferences\n[1] Jesse Hoey, Robert St-Aubin, Alan Hu, and Craig Boutilier. SPUDD: Stochastic Planning Using Decision\nDiagrams. In Proceedings of the Fifteenth conference on Uncertainty in Arti\ufb01cial Intelligence(UAI), 1999.\n[2] Robert St-Aubin, Jesse Hoey, and Craig Boutilier. APRICODD: Approximate Policy Construction Using\n\nDecision Diagrams. Advances in Neural Information Processing Systems(NIPS), 2001.\n\n[3] Scott Sanner, William Uther, and Karina Valdivia Delgado. Approximate Dynamic Programming with\nAf\ufb01ne ADDs. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent\nSystems, 2010.\n\n[4] Aswin Raghavan, Saket Joshi, Alan Fern, Prasad Tadepalli, and Roni Khardon. Planning in Factored\nIn Twenty-Sixth AAAI Conference on Arti\ufb01cial\n\nAction Spaces with Symbolic Dynamic Programming.\nIntelligence(AAAI), 2012.\n\n[5] Martin L Puterman and Moon Chirl Shin. Modi\ufb01ed Policy Iteration Algorithms for Discounted Markov\n\nDecision Problems. Management Science, 1978.\n\n[6] Craig Boutilier, Richard Dearden, and Moises Goldszmidt. Exploiting Structure in Policy Construction.\n\nIn International Joint Conference on Arti\ufb01cial Intelligence(IJCAI), 1995.\n\n[7] R Iris Bahar, Erica A Frohm, Charles M Gaona, Gary D Hachtel, Enrico Macii, Abelardo Pardo, and\nFabio Somenzi. Algebraic Decision Diagrams and their Applications. In Computer-Aided Design, 1993.\narXiv preprint\n\n[8] Chenggang Wang and Roni Khardon.\n\narXiv:1206.5287, 2012.\n\nPolicy Iteration for Relational MDPs.\n\n[9] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. 1996.\n[10] Jason Pazis and Ronald Parr. Generalized Value Functions for Large Action Sets. In Proc. of ICML, 2011.\n[11] Scott Sanner. Relational Dynamic In\ufb02uence Diagram Language (RDDL): Language Description. Unpub-\n\nlished ms. Australian National University, 2010.\n\n[12] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent Planning with Factored MDPs. Advances\n\nin Neural Information Processing Systems(NIPS), 2001.\n\n[13] Bruno Scherrer, Victor Gabillon, Mohammad Ghavamzadeh, and Matthieu Geist. Approximate Modi\ufb01ed\n\nPolicy Iteration. In ICML, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1178, "authors": [{"given_name": "Aswin", "family_name": "Raghavan", "institution": "Oregon State University"}, {"given_name": "Roni", "family_name": "Khardon", "institution": "Tufts University"}, {"given_name": "Alan", "family_name": "Fern", "institution": "Oregon State University"}, {"given_name": "Prasad", "family_name": "Tadepalli", "institution": "Oregon State University"}]}