{"title": "APRICODD: Approximate Policy Construction Using Decision Diagrams", "book": "Advances in Neural Information Processing Systems", "page_first": 1089, "page_last": 1095, "abstract": null, "full_text": "APRICODD: Approximate Policy Construction \n\nusing Decision Diagrams \n\nRobert St-Aubin \n\nJesse Hoey \n\nCraig Boutilier \n\nDept.  of Computer Science \n\nDept.  of Computer Science \n\nDept.  of Computer Science \n\nUniversity of British Columbia \n\nUniversity of British Columbia \n\nVancouver, BC V6T lZA \n\nstaubin@cs.ubc.ca \n\nVancouver, BC V6T lZA \n\njhoey@cs.ubc.ca \n\nUniversity of Toronto \nToronto, ON M5S  3H5 \ncebly@cs.toronto.edu \n\nAbstract \n\nWe propose a method of approximate dynamic programming for Markov \ndecision processes (MDPs) using algebraic decision diagrams  (ADDs). \nWe produce near-optimal value functions and policies with much lower \ntime  and  space  requirements  than  exact  dynamic  programming.  Our \nmethod reduces  the  sizes  of the  intermediate value functions  generated \nduring value iteration by replacing the values at the terminals of the ADD \nwith  ranges  of values.  Our method is  demonstrated on  a class  of large \nMDPs (with up to 34 billion states), and we compare the results with the \noptimal value functions. \n\n1  Introduction \n\nThe last decade has  seen much interest in  structured approaches to  solving planning prob(cid:173)\nlems under uncertainty formulated as  Markov decision processes (MDPs). Structured algo(cid:173)\nrithms allow problems to be solved without explicit state-space enumeration by aggregating \nstates  of identical value.  Structured approaches using decision trees have been applied to \nclassical  dynamic  programming (DP)  algorithms  such  as  value iteration and policy itera(cid:173)\ntion [7, 3].  Recently, Hoey et.al.  [8]  have shown that significant computational advantages \ncan be obtained by using an Algebraic Decision Diagram (ADD) representation [1,  4,  5]. \nNotwithstanding such advances, large MDPs must often be solved approximately. This can \nbe  accomplished  by  reducing  the  \"level  of detail\"  in  the representation  and  aggregating \nstates  with  similar (rather than  identical)  value.  Approximations  of this  kind have  been \nexamined in the context of tree structured approaches [2];  this paper extends this research \nby  applying them to  ADDs.  Specifically, the terminal of an ADD will be labeled with the \nrange  of values  taken  by  the  corresponding set  of states.  As  we  will  see,  ADDs  have a \nnumber of advantages over trees. \n\nWe  develop  two  approximation methods  for  ADD-structured value  functions,  and  apply \nthem to  the value diagrams generated during dynamic programming.  The result is  a near(cid:173)\noptimal value function and policy. We examine the tradeoff between computation time and \ndecision quality, and consider several variable reordering strategies that facilitate approxi(cid:173)\nmate aggregation. \n\n\f2  Solving MDPs using Algebraic Decision Diagrams \n\nWe  assume  a fully-observable MDP [10]  with  finite  sets  of states S  and  actions  A, tran(cid:173)\nsition function Pr(s, a, t), reward function R, and a discounted infinite-horizon optimality \ncriterion with discount factor 13.  Value iteration can be used to compute an optimal station(cid:173)\nary policy 7r  :  S  -t A by constructing a series of n-stage-to-go value functions, where: \n\nV n+1(s)  = R(s) + max {f3 L Pr(s, a, t) . Vn(t)} \n\naEA \n\ntES \n\n(1) \n\nThe sequence of value functions vn produced by value iteration converges linearly to  the \noptimal  value function V*.  For some finite  n, the actions that maximize Equation 1 form \nan optimal policy, and vn approximates its value. \n\nADDs  [1,4, 5]  are a compact, efficiently manipulable data structure for representing real(cid:173)\nvalued functions over boolean variables En  -t R.  They generalize a tree-structured rep(cid:173)\nresentation  by  allowing  nodes  to  have  multiple parents,  leading  to  the recombination  of \nisomorphic sub graphs and hence to a possible reduction in the representation size.  A more \nprecise definition of the semantics of ADDs can be found in [9]. \n\nRecently,  we  applied  ADDs  to  the  solution  of  large  MDPs  [8],  yielding  significant \nspace/time  savings  over  related  tree-structured  approaches.  We  assume  the  state  of an \nMDP is characterized by a set of variables X  =  {Xl,' .. ,Xn }.  Values of variable Xi will \nbe denoted in lowercase (e.g., Xi).  We  assume each Xi is boolean'!  Actions are described \nusing dynamic Bayesian networks (DBNs) [6, 3] with ADDs representing their conditional \nprobability tables.  Specifically, a DBN for action a requires two sets  of variables, one set \nX  = {X I, ... , Xn} referring to the state of the system before action a has been executed, \nand X' = {X~, . .. ,X~} denoting the state after a has been executed.  Directed arcs from \nvariables in X  to  variables in  X' indicate direct causal  influence.  The conditional proba(cid:173)\nbility  table  (CPT) for each post-action variable XI  defines a conditional distribution  PJc~ \nover XI-i.e., a's effect on  Xi-for each instantiation of its parents.  This can  be viewed \nas a function PJc~ (Xl ... X n), but where the function value (distribution) depends only on \nthose Xj that ar~ parents of X:'  We represent this function using an  ADD.  Reward func(cid:173)\ntions can also be represented using ADDs.  Figure  I(a) shows a simple example of a single \naction represented as a DBN as well as a reward function. \n\nWe use the method of Hoey et.  al [8]  to perform value iteration using ADDs.  We  refer to \nthat paper for full details on the algorithm, and present only a brief outline here. The ADD \nrepresentation  of the  CPTs  for  each  action,  PJc~ (X),  are  referred to  as  action  diagrams, \nas shown in Figure 1 (b), where X  represents the' set of pre-action variables, {Xl, ... X n}. \nThese action diagrams can be combined into a complete action diagram (Figure I(c\u00bb: \n\nn \n\npa(x\"X) = IIX:. PJc;(X) + XI\u00b7  (1- PJc;(X)). \n\n(2) \n\ni=l \n\nThe  complete  action  diagram  represents  all  the  effects  of pre-action  variables  on  post(cid:173)\naction  variables for a given action.  The immediate reward function  R(X') is  also  repre(cid:173)\nsented  as  an  ADD,  as  are  the  n-stage-to-go value functions  vn(x).  Given  the  complete \naction diagrams for each action, and the immediate reward function, value iteration can be \nperformed by setting VO  = R, and applying Eq.  1, \n\nvn+1(x) =  R(X) + ~a;: {f3 L pa(X', X) . Vn(X')} , \n\nE \n\nx' \n\n(3) \n\n1 An extension to multi-valued variables would be straightforward. \n\n\f~ \n~ \n\nx \n\ny \n\nX\u00b7 \n\nT \nP \n\nY \n\n> \n~ ~ ~~~ \n~ :;~ ; \n> \n> \n\no~~ \n\nY'  -------- -ttfty \n\nx  lrew,,,, \n\n10 \n0 \n\nT \nF \n\n0 8 \n0 2 \n\nMatrIx \n\nRepresentatI on \n\nADD \n\nRepresentatJOn \n\n(a) \n\n(b) \n\nComplete \n\nActIon Diagra m \n\n(e) \n\nFigure  1:  ADD  representation  of an  MDP:  (a)  action  network  for  a  single  action  (top) \nand  the immediate reward network (bottom) (b) Matrix and  ADD representation of CPTs \n(action diagrams) (c) Complete action diagram. \n\nx \n\nx \n\ny \n\nx \n\ny \n\nz \n\n~  jfJ \n\nl.l  [H] ~ [2}J  ~1 [9.7.9.8J  I 1.1 \n\nFigure 2:  Approximation of original value diagram (a) with errors of 0.1  (b) and 0.5 (c). \n\n(b) 0.1 \n\n(e) 0.5 \n\nfollowed  by  swapping all  unprimed variables with  primed ones.  All  operations in  Equa(cid:173)\ntion 3 are well defined in terms of ADDs [8, 12]. The value iteration loop is continued until \nsome stopping criterion is  met.  Various optimizations are applied to make this calculation \nas efficient as possible in  both space and time. \n\n3  Approximating Value Functions \n\nWhile structured solution techniques offer many  advantages,  the exact solution of MDPs \nin this way  can only work if there are \"few\" distinct values in  a value function.  Even if a \nDBN representation  shows little  dependence among variables from one  stage to  another, \nthe influence of variables tends  to  \"bleed\" through a DBN over time,  and many variables \nbecome relevant to predicting value.  Thus, even using structured methods, we must often \nrelax the optimality constraint and generate only approximate value functions, from which \nnear-optimal policies will hopefully arise.  It is generally the case that many of the values \ndistinguished by  DP are similar.  Replacing such values with  a single approximate values \nleads to size reduction, while not significantly affecting the precision of the value diagrams. \n\n3.1  Decision Diagrams and Approximation \n\nConsider the value diagram shown in Figure 2(a), which has eight distinct values as shown. \nThe value of each  state s  is  represented as  a pair [l, u],  where the lower,  l,  and  upper,  u, \nbounds on the values are both represented. The span of a state, s, is given by span(s)=u-l. \nPoint  values  are  represented  by  setting  u=l , and  have  zero  span.  Now  suppose  that  the \n\n\fdiagram  in  Figure  2(a)  exceeds  resource  limits,  and  a  reduction  in  size  is  necessary  to \ncontinue the  value iteration process.  If we  choose to  no  longer distinguish values  which \nare  within  0.1  or  0.5  of each  other,  the  diagrams  in  Figure  2(b)  or  (c)  result,  respec(cid:173)\ntively.  The states  which  had proximal  values  have  been merged,  where merging a  set of \nstates  81,82, ... ,8n with values  [it, U1],  ... ,  [ln, un], results in an aggregate state, t, with \na ranged value  [min(h, . .. , In), max:(u1, ... , un)].  The midpoint of the range estimates \nthe true value of the states with  minimal error, namely,  8pan( t) / 2.  The span of V  is  the \nmaximum of all spans in the value diagram, and therefore the maximum error in V  is sim(cid:173)\nply span ( V) / 2  [2].  The combined span of a set of states is the span of the pair that would \nresult from merging them all.  The extent of a value diagram V  is the combined span of the \nportion of the state space which it represents.  The span of the diagram in Figure 2(c) is 0.5, \nbut its extent is  8.7. \n\nADD-structured  value  functions  can  be  leveraged  by  approximation  techniques  because \napproximations can always be performed directly without pre-processing techniques such \nas  variable reordering. Of course, variable reordering can still play an important computa(cid:173)\ntional role in ADD-structured methods, but are not needed for discovering approximations. \n\n3.2  Value Iteration with Approximate Value Functions \n\nApproximate value iteration simply means applying an  approximation technique to  the n(cid:173)\nstage to go value function generated at each iteration of Eq.  3.  Available resources might \ndictate that ADDs be kept below some fixed size.  In contrast, decision quality might require \nerrors below some fixed value, referred to as the pruning strength, 8.  The remainder of this \npaper will focus  on the latter, although we have examined the former as  well [9]. \n\nThus,  the  objective  of a  single  approximation step  is  a reduction in  the  size  of a ranged \nvalue  ADD  by  replacing  all  leaves  which  have  combined  spans  less  than  the  specified \nerror bound  by  a  single  leaf.  Given  a  leaf [l, u]  in  V,  the  set  of all  leaves  [li, ud  such \nthat the  combined  span  of [li, Ui]  with  [l, u]  is  less  than  the  specified  error are  merged. \nRepeating this process until no more merges are possible gives the desired result.  We have \nalso examined a quicker, but less exact, method for approximation, which exploits the fact \nthat simply reducing the precision of the values at the leaves of an ADD merges the similar \nvalues.  We defer explanations to the longer version of this paper [9]. \nThe sequence of ranged value functions, Vn , converges after n' iterations to an approximate \n(non-ranged) value function, V,  by taking the mid-points of each ranged terminal node in \nVn'.  The  pruning  strength,  8,  then  gives  the  percentage difference between V and  the \noptimal n'-stage-to-go value function V n '.  The value function V induces a policy, n, the \nvalue of which is ViTo  In general, however, ViT  # V [11]  2. \n\n3.3  Variable Reordering \n\nAs previously mentioned, variable reordering can have a significant effect on the size of an \nADD, but finding the variable ordering which gives rise to the smallest ADD for a boolean \nfunction is  co-NP-complete [4].  We  examine three reordering methods.  The first  two  are \nstandard for reordering variables in BDDs:  Rudell's sifting algorithm and random reorder(cid:173)\ning  [12].  The  last  reordering  method  we  consider  arises  in  the  decision  tree  induction \nliterature,  and is related to  the information gain  criterion.  Given a value diagram V  with \nextent 8,  each  variable x  is  considered in  tum.  The value diagram is  restricted  first  with \nx  =  true,  and  the extent 8t  and the number of leaves  nt are calculated for the restricted \nADD.  Similar values 8 f  and n f  are found for the x = false  restriction. If we collapsed the \nentire ADD into a single node, assuming a uniform distribution over values in the resulting \n\n2In fact, the equality arises if and only if V = V\u00b7, where V\u00b7 is the optimal value function. \n\n\frange gives us the entropy for the entire ADD: \n\nE  = J p(v)log(p(v))dv =  log(J), \n\n(4) \n\nand  represents  our degree  of uncertainty  about the  values  in  the  diagram.  Splitting  the \nvalues with the variable x results in two new value diagrams, for each of which the entropy \nis  calculated.  The gain in  information (decrease in  entropy)  values  are  used  to  rank  the \nvariables, and the resulting order is applied to the diagram. This method will be referred to \nas the minimum span method. \n\n4  Results \n\nThe procedures described above were implemented using a modified version of the CUDD \npackage [12] , a library of C routines which provides support for manipulation of ADDs. \n\nExperimental results  from  this  section  were  all  obtained using  one  processor on  a  dual(cid:173)\nprocessor Pentium II PC running at 400Mhz with O.5Gb of RAM. Our approximation meth(cid:173)\nods were tested on various adaptations of a process planning problem taken from [7,  8]. 3 \n\n4.1  Approximation \n\nAll  experiments in  this  section  were  performed on  problem domains  where  the  variable \nordering was the one selected implicitly by the constructors of the domains. 4 \n\nValue \n\nFunction \nOptimal \n\nApproximate \n\n0 \n(%) \n0 \n1 \n2 \n3 \n4 \n5 \n10 \n15 \n20 \n30 \n40 \n50 \n\ntime \n(s) \n270.91 \n562.35 \n547.00 \n1l2.7 \n68.53 \n38.06 \n6.24 \n0.70 \n0.57 \n0.05 \n0.07 \n0.04 \n\niter \n\nnodes  leaves \n\n(inl) \n22170 \n17108 \n15960 \n15230 \n14510 \n11208 \n3739 \n580 \n299 \n50 \n10 \n0 \n\n44 \n44 \n44 \n15 \n12 \n10 \n6 \n4 \n4 \n2 \n2 \n1 \n\n527 \n117 \n77 \n58 \n48 \n38 \n15 \n9 \n6 \n3 \n2 \n1 \n\nIV\"  - V*I \n(%) \n0.0 \n0.13 \n0.14 \n5.45 \nlo20 \n2.48 \nllo33 \n14.1l \n16.66 \n25.98 \n30.28 \n3lo25 \n\nTable 1:  Comparing optimal with approximate value iteration on a domain with 28 boolean \nvariables. \n\nIn  Table  1 we compare optimal  value iteration using ADDs  (SPUDD as  presented in  [8]) \nwith  approximate  value  iteration  using  different pruning  strengths  J.  In  order  to  avoid \noverly  aggressive pruning in  the early  stage  of the  value iterations,  we  need to  take into \naccount the size of the value function at every iteration. Therefore, we use a sliding pruning \nstrength specified as  J E~=o j3iextent(R) where R  is  the initial reward diagram,  j3  is  the \ndiscount factor introduced earlier and n is the iteration number. \n\nWe  illustrate running time,  value function size (internal nodes and leaf nodes), number of \niterations, and the average sum of squared difference between the optimal value function, \nV* , and the value of the approximate policy, Vir. \n\nIt is  important to  note that the pruning strength is  an  upper bound on the approximation \nerror.  That is, the optimal values are guaranteed to lie within the ranges of the approximate \n\n3 See [9]  for details. \n4Experiments showed that conclusions in this section are independent of variable order. \n\n\franged value function.  However, as noted earlier, this bound does not hold for the value of \nan induced policy, as can be seen at 3% pruning in the last column of Table 1. \n\nThe effects of approximation on the performance of the value iteration algorithm are three(cid:173)\nfold.  First,  the approximation itself introduces an  overhead which depends on  the size of \nthe value function being approximated.  This effect can be  seen in Table  1 at low pruning \nstrengths (1  - 2%), where the running time is increased from that taken by  optimal value \niteration.  Second, the ranges in the  value function reduce the number of iterations needed \nto  attain  convergence,  as  can  be  seen  in  Table  1 for  pruning  strengths  greater  than  2%. \nHowever, for the lower pruning strengths, this effect is not observed. This can be explained \nby  the fact  that a small  number of states  with values  much greater (or much lower)  than \nthat of the rest of the  state space may never be approximated.  Therefore, to converge, this \nportion of the state space requires the same number of iterations as in the optimal case 5. \n\nThe third effect of approximation is to reduce the size of the value functions, thus reducing \nthe per iteration computation time during value iteration. This effect is clearly seen at prun(cid:173)\ning strengths greater than 2%, where it overtakes the cost of approximation, and generates \nsignificant  time  and  space  savings.  Speed  ups  of 2  and  4  fold  are  obtained for  pruning \nstrengths of 3% and 4%  respectively.  Furthermore, fewer than 60 leaf nodes represent the \nentire state  space,  while  value errors  in  the policy do  not exceed 6%.  This  confirms  our \ninitial hypothesis that many values within a given domain are very similar and thus, replac(cid:173)\ning  such  values  with  ranges drastically reduces  the  size  of the  resulting diagram without \nsignificantly affecting the quality of the resulting policy.  Pruning above 5% has a larger er(cid:173)\nror, and takes a very  short time to converge.  Pruning strengths of more than 40% generate \npolicies which are close to trivial, where a single action is always taken. \n\n4.2  Variable reordering \n\nlO'r;===:Sh=:U:;;:\"I=:gd;'=_=n=o=ra=Ord=;:e=, ===:::;--~------, \n\n-e- Intuitive (unshuffled) - no reorder \n~ shuffled - reorder mlnspan \no  shuffled - reorder random \n\n- - shuffled - reorder sift \n\nla' \n\no \n\no \n\no \n\n102 '-----___  ~ ___  ~ ___  ~ ___  ----' \n35 \n\n15 \n\n20 \n\n25 \n\n30 \n\nboolesn variables \n\nFigure 3:  Sizes of final  value diagrams plotted as  a function of the problem domain size. \n\nResults in the previous section were all generated using the \"intuitive\" variable ordering for \nthe problem at hand. It is probable that such an ordering is close to optimal, but such order(cid:173)\nings may not always be obvious, and the effects of a poor ordering on the resources required \nfor policy  generation can  be extreme.  Therefore,  to  characterize the reordering methods \ndiscussed in Section 3.3, we start with initially randomly shuffled orders and compare the \nsizes of the final  value diagrams with those found using the intuitive order. \n\n5We  are currently looking into  alleviating this  effect in  order to  increase convergence  speed for \n\nlow pruning strengths \n\n\fIn  Figure  3 we  present results  obtained from  approximate value iteration with  a pruning \nstrength of 3% applied to a range of problem domain sizes. \n\nIn the absence of any reordering, diagrams produced with randomly shuffled variable orders \nare  up  to  3  times  larger than  those  produced with  the  intuitive  (unshuffled)  order.  The \nminimum span reordering method,  starting from a randomly  shuffled  order,  finds  orders \nwhich  are equivalent to  the intuitive one, producing value diagrams with  nearly identical \nsize. The sifting and random reordering methods find orders which reduce the sizes further \nby up to a factor of 7. \n\nReordering attempts take time,  but on the other hand, DP is  faster with  smaller diagrams. \nValue iteration with the sifting reordering method (starting with shuffled orders) was found \nto run in time similar to  that of value iteration with  the intuitive ordering, while the other \nreordering methods took slightly longer.  All reordering methods, however, reduced running \ntimes and diagram sizes from that using no reordering, by factors of 3 to 5. \n\n5  Concluding Remarks \n\nWe  examined a method for  approximate dynamic  programming for MDPs  using  ADDs. \nADDs are found to be ideally suited to this task.  The results we present have clearly shown \ntheir applicability on a range of MDPs  with up to  34 billion states.  Investigations into the \nuse of variable reordering during value iteration have also proved fruitful,  and yield large \nimprovements  in  the  sizes  of value diagrams.  Results  show  that  our policy  generator is \nrobust to the variable order, and so this is no longer a constraint for problem specification. \n\nReferences \n\n[1]  R. Iris  Bahar, Erica A.  Frohm,  Charles M.  Gaona, Gary  D.  Hachtel, Enrico Macii, Abelardo \nPardo, and Fabio Somenzi. Algebraic decision diagrams and their applications.  In International \nConference on Computer-Aided Design, pages 188- 191. IEEE, 1993. \n\n[2]  Craig Boutilier and Richard  Dearden.  Approximating  value  trees  in  structured dynamic  pro(cid:173)\n\ngramming.  In Proceedings ICML-96, Bari, Italy, 1996. \n\n[3]  Craig Boutilier, Richard Dearden, and Moises Goldszmidt.  Exploiting structure in policy con(cid:173)\n\nstruction.  In Proceedings Fourteenth Inter.  Conf on AI (IJCAI-95),  1995. \n\n[4]  Randal E. Bryant.  Graph-based algorithms  for boolean function  manipulation.  IEEE Transac(cid:173)\n\ntions on Computers, C-35(8):677--691,  1986. \n\n[5]  E.  M.  Clarke, K. L. McMillan, X.  Zhao, M.  Fujita, and J.  Yang.  Spectral transforms for large \nboolean functions with applications to technology mapping.  In DAC, 54-60. ACMIIEEE, 1993. \n[6]  Thomas  Dean  and  Keiji  Kanazawa.  A  model  for reasoning  about  persistence  and  causation. \n\nComputational Intelligence, 5(3):142- 150, 1989. \n\n[7]  Richard Dearden and Craig Boutilier.  Abstraction and approximate decision theoretic planning. \n\nArtificial Intelligence, 89:219- 283,  1997. \n\n[8]  Jesse Hoey, Robert St-Aubin, Alan Hu, and Craig Boutilier.  SPUDD: Stochastic planning using \n\ndecision diagrams.  In Proceedings of UAI99, Stockholm, 1999. \n\n[9]  Jesse Hoey, Robert St-Aubin, Alan Hu, and Craig Boutilier. Optimal and approximate planning \n\nusing decision diagrams.  Technical  Report TR-OO-05 , UBC, June 2000. \n\n[10]  Martin L. Puterman.  Markov Decision Processes:  Discrete Stochastic Dynamic Programming. \n\nWiley, New York, NY., 1994. \n\n[11]  Satinder P.  Singh and Richard C. Yee.  An upper bound on  the loss from approximate optimal(cid:173)\n\nvalue function.  Machine Learning,  16:227- 233,  1994. \n\n[12]  Fabio  Somenzi. \n\nCUDD:  CU  decision  diagram  package. \n\nAvailable \n\nfrom \n\nft p : /  /vl s i . c o l o r ado. edu/pub /, 1998. \n\n\f", "award": [], "sourceid": 1840, "authors": [{"given_name": "Robert", "family_name": "St-Aubin", "institution": null}, {"given_name": "Jesse", "family_name": "Hoey", "institution": null}, {"given_name": "Craig", "family_name": "Boutilier", "institution": null}]}