{"title": "An Improved Policy Iteration Algorithm for Partially Observable MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1015, "page_last": 1021, "abstract": "", "full_text": "An Improved Policy Iteratioll Algorithm \n\nfor  Partially  Observable  MDPs \n\nComputer Science  Department \n\nUniversity of Massachusetts \n\nEric A.  Hansen \n\nAmherst,  MA  01003 \nhansen@cs.umass.edu \n\nAbstract \n\nA new  policy iteration algorithm for  partially observable Markov \ndecision processes is  presented that is simpler and more efficient than \nan earlier policy  iteration algorithm of Sondik  (1971,1978).  The key \nsimplification is  representation of a  policy as a finite-state  controller. \nThis representation makes policy evaluation straightforward.  The pa(cid:173)\nper's contribution is  to show that the dynamic-programming update \nused  in the policy improvement step can be interpreted as the trans(cid:173)\nformation of a finite-state controller into an improved finite-state con(cid:173)\ntroller.  The  new  algorithm  consistently  outperforms  value  iteration \nas an approach to solving infinite-horizon  problems. \n\n1 \n\nIntroduction \n\nA  partially observable  Markov  decision  process  (POMDP)  is  a  generalization of the \nstandard completely observable  Markov  decision  process that  allows  imperfect  infor(cid:173)\nmation about the state of the system.  First studied as a  model of decision-making in \noperations research,  it  has  recently  been  used  as  a  framework  for  decision-theoretic \nplanning  and  reinforcement  learning  with  hidden state  (Monahan,  1982;  Cassandra, \nKaelbling,  &  Littman,  1994; Jaakkola, Singh,  &  Jordan,  1995). \nValue  iteration and  policy  iteration algorithms for  POMDPs were first  developed  by \nSondik and rely on a  piecewise linear and convex representation of the value function \n(Sondik,  1971;  Smallwood  &  Sondik,1973;  Sondik,  1978).  Sondik's  policy  iteration \nalgorithm has proved to be impractical, however,  because its  policy evaluation step is \nextremely complicated and difficult  to implement.  As  a  result,  almost  all  subsequent \nwork on dynamic  programming for  POMDPs has used  value iteration.  In this  paper, \nwe  describe an improved policy iteration algorithm for  POMDPs that avoids the dif(cid:173)\nficulties  of Sondik's algorithm.  We  show  that these difficulties  hinge on the choice of \na  policy  representation  and  can be  avoided  by representing  a  policy  as  a  finite-state \n\n\f1016 \n\nE.  A.  Hansen \n\ncontroller.  This  representation  makes  the  policy  evaluation  step  easy  to  implement \nand efficient.  We show  that the policy improvement step can be  interpreted in a  nat(cid:173)\nural way as the transformation of a finite-state controller into an improved finite-state \ncontroller.  Although  it  is  not  always  possible  to  represent  an  optimal  policy  for  an \ninfinite-horizon POMDP as a finite-state controller, it is always possible to do so when \nthe optimal value function  is  piecewise linear and convex.  Therefore representation of \na  poiicy as a finite-state controller is no more limiting than representation of the value \nfunction  as  piecewise  linear  and  convex.  In  fact,  it  is  the close  relationship  between \nrepresentation  of  a  policy  as  a  finite-state  controller  and  representation  of  a  value \nfunction  as  piecewise linear and convex that the new algorithm successfully exploits. \nThe  paper is  organized  as  follows.  Section  2 briefly  reviews  the POMDP  model  and \nSondik's  policy iteration algorithm.  Section 3  describes an improved policy  iteration \nalgorithm.  Section  4  illustrates  the  algorithm  with  a  simple  example  and  reports  a \ncomparison of its  performance to value iteration.  The paper concludes with a  discus(cid:173)\nsion of the significance of this work. \n\n2  Background \n\nConsider  a  discrete-time  POMDP  with  a  finite  set  of states 5,  a  finite  set of actions \nA,  and a  finite  set of observations e.  Each time  period,  the  system  is  in some  state \ni  E  5,  an agent  chooses an action a  E  A  for  which  it receives a  reward  with  expected \nvalue ri,  the system  makes  a  transition  to state j  E  5  with  probability pij'  and the \nagent  observes ()  E  e with  probability  tje.  We  assume the performance objective  is \nto maximize expected total discounted reward over an infinite  horizon. \nAlthough the state of the system cannot be directly observed, the probability that it is \nin a  given state can be calculated.  Let 7r  denote a  vector of state probabilities, called \nan  information state, where 7ri  denotes the probability that the system is  in state i.  If \naction a  is  taken in  information state 7r  and ()  is  observed,  the successor information \nstate  is  determined  by  revising  each  state  probability  using  Bayes'  theorem:  trj  = \nLiEs 7riPijQje/ Li,jES 7riPijQje'  Geometrically, each information state 7r  is  a  point  in \nthe (151  - I)-dimensional unit simplex,  denoted  II. \nIt is  well-known  that an information state 7r  is  a  sufficient  statistic  that  summarizes \nall information about the history of a POMDP necessary for  optimal action selection. \nTherefore a POMDP can be recast as a completely observable MDP with a continuous \nstate space II and it can be theoretically solved using dynamic programming.  The key \nto practical implementation of a dynamic-programming algorithm is a piecewise-linear \nand convex representation of the value function.  Smallwood and Sondik  (1973)  show \nthat the dynamic-programming update for  POMDPs preserves the piecewise linearity \nand convexity of the value function.  They also show that an optimal value function fot \na  finite-horizon  POMDP  is  always  piecewise  linear  and  convex.  For  infinite-horizon \nPOMDPs, Sondik (1978) shows that an optimal value function is sometimes piecewise \nlinear and convex and can be aproximated arbitrarily closely by a piecewise linear and \nconvex function otherwise. \nA  piecewise  linear  and  convex  value  function  V  can  be  represented  by  a  finite  set \nof lSI-dimensional  vectors,  r  =  {aO,a i ,  \u2022.. },  such  that  V(7r)  =  maxkLi  s7riaf.  A \ndynamic-programming update transforms a  value  function  V  representedEfiy  a  set  r \nof a-vectors into an improved value function  V'  represented by a  set r' of a-vectors. \nEach possible a-vector in r' corresponds to choice of an action,  and for  each possible \nobservation,  choice  of a  successor  vector  in  r.  Given  the  combinatorial  number  of \nchoices that can be made, the maximum n4mber of vectors in r' is  IAllfll91.  However \nmost  of these  potential  vectors  are  not  needed  to  define  the  updated  value  function \nand  can  be  pruned.  Thus  the  dynamic-programming  update  problem  is  to  find  a \n\n\fAn Improved Policy Iteration Algorithmfor Partially Observable MDPs \n\nJ017 \n\nminimal  set of vectors r' that  represents V',  given  a  set  of vectors r  that  represents \nV .  Several  algorithms  for  performing  this  dynamic-programming update  have  been \ndeveloped  but describing them is  beyond the scope of this  paper.  Any  algorithm for \nperforming the dynamic-programming update can be used  in the policy improvement \nstep  of policy  iteration.  The  algorithm  that  is  presently  the fastest  is  described  by \n(Cassandra, Littman, &  Zhang,  1997). \nFor value iteration, it is sufficient to have a representation of the value function because \na  policy is  defined  implicitly by the value function,  as follows, \n\n8(11\")  =  a(arg mF L 1I\"iof), \n\niES \n\n(1) \n\nwhere  a(k)  denotes  the  action  associated  with  vector  ok.  But  for  policy  iteration, \na  policy  must  be represented independently of the value function  because the  policy \nevaluation  step  computes  the  value  function  of  a  given  policy.  Sondik's  choice  of \na  policy  representation is  influenced  by  Blackwell's proof that for  a  continuous-space \ninfinite-horizon MDP, there is a stationary, deterministic Markov policy that is optimal \n(Blackwell, 1965).  Based on this result, Sondik restricts policy space to stationary and \ndeterministic  Markov  policies  that  map  the  continuum  of information  space II into \naction  space  A.  Because it  is  important for  a  policy  to have a  finite  representation, \nSondik defines  an admissible policy  as a  mapping from  a  finite  number of polyhedral \nregions  of II to  A .  Each  region  is  represented  by  a  set  of linear  inequalities,  where \neach  linear inequality corresponds to a  boundary of the region. \nThis is Sondik's canonical representation of a  policy, but his policy iteration algorithm \nmakes use  of two other  representations.  In  the  policy evaluation step,  he  converts a \npolicy  from  this  representation to an  equivalent,  or approximately equivalent,  finite(cid:173)\nstate controller.  Although  no  method  is  known  for  computing  the value  function  of \na  policy  represented  as  a  mapping from  II to  A,  the  value  function  of a  finite-state \ncontroller  can  be  computed  in  a  straightforward  way. \nIn  the  policy  improvement \nstep,  Sondik  converts  a  policy  represented  implicitly  by  the  updated  value  function \nand equation  (1)  back to his  canonical  representation.  The complexity of translating \nbetween these different policy representations - especially in the policy evaluation step \n- makes Sondik's  policy  iteration algorithm  difficult  to implement  and  explains  why \nit is  not used  in  practice. \n\n3  Algorithm \n\nWe now show that policy iteration for  POMDPs can be simplified - both conceptually \nand  computationally  - by  using  a  single  representation of a  policy  as  a  finite-state \ncontroller. \n\n3.1  Policy evaluation \n\nAs Sondik recognized, policy evaluation is straightforward when a policy is represented \nas a finite-state controller.  An o-vector representation of the value function of a finite(cid:173)\nstate controller is  computed by solving the system of linear equations, \n\nk  _ \n0i  -\n\na(k) + (3'\"'  a(k)  a(k)  s(k ,8) \nri \n\n, \n\nL.JPij  qj8  OJ \nj ,8 \n\n(2) \n\nwhere k is  an index of a state of the finite-state controller, a(k) is the action associated \nwith  machine  state  k,  and  s(k,O)  is  the  index  of the  successor  machine state if 0 is \nobserved.  This value function is convex as well as piecewise linear because the expected \nvalue  of an information state is  determined  by  assuming  the  controller  is  started in \nthe machine state that optimizes it. \n\n\f1018 \n\nE.  A.  Hansen \n\n1.  Specify  an  initial  finite-state  controller,  <5,  and  select  f.  for  detecting  conver(cid:173)\n\ngence to an f.-optimal  policy. \n\n2.  Policy  evaluation:  Calculate  a  set  r  of a-vectors  that  represents  the  value \n\nfunction for  <5  by solving the system of equations given by equation 2. \n\n3.  Policy  improvement:  Perform  a  dynamic-programming  update  and  use  the \nnew  set  of vectors r' to transform  <5  into a  new finite-state  controller,  <5',  as \nfollows: \n(a)  For each vector a  in r': \n\nl.  If the action and successor links  associated with a  duplicate those of \n\na  machine state of <5,  then  keep  that  machine state unchanged in 8'. \n\nii.  Else if a  pointwise dominates a vector associated with a machine state \nof <5,  change  the  action  and  successor  links  of that  machine  state to \nthose used to create a.  (If it pointwise dominates the vectors of more \nthan one machine state, they can be combined into a  single machine \nstate.) \n\niii.  Otherwise  add  a  machine  state  to  <5'  that  has  the  same  action  and \n\nsuccessor links used to create a. \n\n(b)  Prune  any  machine  state for  which  there is  no corresponding  vector  in \nr', as  long as it is not  reachable from  a  machine state to which  a  vector \nin r' does correspond. \n\n4.  Termination test.  If the Bellman residual is  less than or equal to f.(1  - /3)//3, \n\nexit with f.-optimal  policy.  Otherwise set  <5  to <5'  and go  to step 2. \n\nFigure 1:  Policy iteration algorithm. \n\n3.2  Policy  improvement \n\nThe policy  improvement  step uses  the  dynamic-programming update to transform a \nvalue function  V  represented by a  set r  of a-vectors into an improved value function \nV' represented by a set r' of a-vectors.  We  now show that the dynamic-programming \nupdate can also be interpreted as the transformation of a finite-state controller 8 into \nan  improved finite-state  controller <5'.  The transformation is  made based on a  simple \ncomparison of r' and r. \nFirst  note that  some  of the a-vectors  in r' are duplicates  of a-vectors  in  r, that is, \ntheir action  and  successor links  match  (and their  vector  values  are pointwise equal). \nAny  machine state of <5  for  which  there is  a  duplicate vector in r' is  left  unchanged. \nThe vectors in r' that are not  duplicates of vectors in  r  indicate how  to change the \nfinite-state  controller.  If a  non-duplicate  vector  in  r' pointwise  dominates  a  vector \nin  r, the  machine state that  corresponds  to  the  pointwise  dominated  vector  in r  is \nchanged  so  that  its action  and  successor links  match  those of the dominating vector \nin r'.  If a  non-duplicate  vector  in  r' does  not  pointwise  dominate  a  vector  in r, a \nmachine state is added to the finite-state controller with the same action and successor \nlinks used to generate the vector.  There may be some machine states for  which there \nis  no  corresponding  vector  in  r' and  they  can  be  pruned,  but  only  if  they  are  not \nreachable from  a  machine state that corresponds to a  vector in r'.  This last point  is \nimportant because it preserves the integrity of the finite-state controller. \n\nA policy iteration algorithm that uses these simple transformations to change a finite(cid:173)\nstate controller in  the policy improvement step is  summarized in  Figure  1.  An  algo(cid:173)\nrithm that performs this transformation is easy to implement and runs very efficiently \nbecause it simply compares the a-vectors in r' to the a-vectors in r  and modifies the \nfinite-state  controller  accordingly.  The policy  evaluation step  is  invoked  to  compute \nthe  value  function  of the transformed  finite-state  controller.  (This  is  only  necessary \n\n\fAn Improved Policy Iteration Algorithmfor Partially Observable MDPs \n\n1019 \n\nif a  machine state has  been  changed, not if machine states have simply been added.) \nIt is easy to show that the  value function of the transformed finite-state controller /j' \ndominates the value function of the original finite-state controller,  /j,  and we  omit the \nproof which  appears in  (Hansen,  1998). \n\nTheorem 1  If a finite-state  controller is  not optimal,  policy  improvement transforms \nit into  a finite-state  controller with  a value function  that is  as  good  or better for  every \ninformation state  and better for  some  information  state. \n\n3.3  Convergence \n\nIf a finite-state controller cannot be improved in the policy improvement step (Le.,  all \nthe vectors in r' are duplicates of vectors in r), it must be optimal because the value \nfunction  satisfies  the  optimality  equation.  However  policy  iteration  does  not  neces(cid:173)\nsarily converge to an optimal finite-state  controller after a  finite  number of iterations \nbecause  there  is  not  necessarily  an  optimal finite-state  controller.  Therefore  we  use \nthe same stopping condition used by Sondik to detect t-optimality:  a finite-state con(cid:173)\ntroller is t-optimal when the Bellman residual is less than or equal to t(l- {3) / {3,  where \n{3  denotes  the  discount  factor.  Representation  of a  policy  as a  finite-state  controller \nmakes the following  proof straightforward  (Hansen,  1998). \n\nTheorem 2  Policy  iteration  converges  to  an  t-optimal finite-state  controller  after  a \nfinite  number of iterations. \n\n4  Example and  performance \n\nWe  illustrate  the  algorithm  using  the  same  example  used  by  Sondik:  a  simple  two(cid:173)\nstate,  two-action,  two-observation  POMDP  that  models  the  problem  of  finding  an \noptimal marketing strategy  given  imperfect  information about  consumer  preferences \n(Sondik,1971,1978).  The  two  states  of  the  problem  represent  consumer  preference \nor  lack  of  preference  for  the  manufacturers  brand;  let  B  denote  brand  preference \nand  ....,B  denote  lack  of brand  preference.  Although  consumer  preferences  cannot  be \nobserved,  they  can  be  infered  based  on observed  purchasing  behavior;  let  P  denote \npurchase  of the  product  and  let  ....,p  denote  no  purchase.  There  are  two  marketing \nalternatives or actions;  the company can market a  luxury version of the  product  (L) \nor  a  standard  version  (S).  The  luxury  version  is  more expensive  to  market  but  can \nbring  greater  profit.  Marketing  the  luxury  version  also  increases  brand  preference. \nHowever consumers are more likely to purchase the less expensive, standard product. \nThe  transition  probabilities,  observation  probabilities,  and  reward  function  for  this \nexample are shown in  Figure 2.  The discount factor is  0.9. \nBoth Sondik's policy  iteration algorithm and the new  policy  iteration algorithm con(cid:173)\nverge  in  three  iterations  from  a  starting  policy  that  is  equivalent  to  the  finite-state \n\nAClions \n\nTransilion \nprobabililies \n\nObservalion \nprobabililies \n\nExpecled \nreward \n\nMarkel \nluxury \nproducl (L) \n\nMarkel \nslandard \nproducl (S) \n\nB -B \nB/O.8/0.2\\ \n-B 0.5  0.5 \nB  -B \n\nP  -p \nB 10.81 0.2\\ \n-B  0.60.4 \n\nP \n\n-p \n\nB\u00a7j \n-B  \u00b74 \n\nB~  B~  Bbj \n-B  \u00b73 \n-B  0.4  o. \n\n-B  O.  0. \n\nFigure 2:  Parameters for  marketing example of Sondik  (1971,1978) . \n\n\f1020 \n\nE  A.  Hansen \n\n~; .. \"  -~ '''' .. \n\"  a = L \n\\ \n~  9,96 \n: \n'- 18.86  <,8=-P \n\n'~~9~~_:~~~;~<~._ \n\n:'  .1 = S  \" \n:  14.82! \n\\  18.20  / \n', __ __ - '~~ P \n\n\\\\ \n\\ \n\n\\ \n\\., \n\n''', \n\n,.;,\\ \n\n/\"\"---... ,9=-p,y: \n\"  a.= S \n14.86  t ____ .... \n: \n\\_:8.1~/ 8=P \n\n.. ,'\" \n\n\\ \n\n(.) \n\n(b) \n\n(e) \n\n(d) \n\n(e) \n\nFigure  3:  (a)  shows  the  initial  finite-state  controller,  (b)  uses  dashed  circles  to  show  the \nvectors  in  r' generated  in  the first  policy  improvement step  and  (c)  shows  the transformed \nfinite-state controller,  (d)  uses dashed circles to show the vectors in r' generated in the second \npolicy  improvement  step  and  (e)  shows  the  transformed  finite-state  controller  after  policy \nevaluation.  The  optimality  of this  finite-state  controller  is  detected  on  the  third  iteration, \nwhich  is  not  shown.  Arcs  are  labeled  with  one  of  two  possible  observations  and  machine \nstates are  labeled with one of two  possible  actions  and  a  2-dimensional  vector  that contains \na value for  each of the two  possible system states. \n\ncontroller shown  in  Figure  3a.  Figure  3  shows  how  the  initial  finite-state  controller \nis  transformed  into  an  optimal  finite-state  controller  by  the  new  algorithm.  In  the \nfirst  iteration, the updated set of vectors r' (indicated by dashed circles in Figure 3b) \nincludes two duplicate vectors and one non-duplicate that results in an added machine \nstate.  Figure 3c  shows the improved finite-state  controller after the first  iteration.  In \nthe second  iteration,  each  of the  three vectors in  the updated set of vectors r' (indi(cid:173)\ncated by  dashed  circles  in  Figure 3d)  pointwise dominates a  vector  that corresponds \nto  a  current machine state.  Thus each of these machine states is  changed.  Figure  4e \nshows  the improved finite-state  controller after  the second  iteration.  The optimality \nof this finite-state  controller is  detected in  the third iteration. \nThis is the only example for  which Sondik reports using policy iteration to find  an op(cid:173)\ntimal policy.  For POMDPs with more than two states, Sondik's algorithm is especially \ndifficult  to implement.  Sondik  reports  that  his  algorithm  finds  a  suboptimal  policy \nfor  an example described in  (Smallwood  &  Sondik,  1973).  No  further  computational \nexperience with his algorithm has been  reported. \n\nThe  new  policy  iteration  algorithm  described  in  this  paper  easily  finds  an  optimal \nfinite-state  controller for  the example described  in  (Smallwood  & Sondik,  1973)  and \nhas been used to solve many other POMDPs.  In fact,  it consistently outperforms value \niteration.  We  compared  its  performance  to  the  performance of value  iteration  on  a \nsuite of ten POMDPs that represent a range of problem sizes for which exact dynamic(cid:173)\nprogramming updates are currently feasible.  (Presently, exact dynamic-prorgramming \nupdates  are  not  feasible  for  POMDPs  with  more  than  about  ten  or  fifteen  states, \nactions,  or observations.)  Starting from  the same point,  we  measured  how soon each \nalgorithm  converged  to  f-optimality  for  f  values  of  10.0,  1.0,  0.1 ,  and  0.01.  Policy \niteration was  consistently  faster  than  value  iteration by  a  factor  that  ranged  from  a \nlow  of about  10  times  faster  to  a  high  of over  120  times  faster.  On  average,  its  rate \nof convergence  was  between  40  and  50  times  faster  than  value  iteration  for  this  set \nof examples.  The  finite-state  controllers  it  found  had  as  many  as  several  hundred \nmachine  states,  although  optimal  finite-state  controllers  were  sometimes  found  with \njust a  few  machine states. \n\n\fAn Improved Policy Iteration Algorithm for Partially Observable MDPs \n\n1021 \n\n5  Discussion \n\nWe  have  demonstrated that  the dynamic-programming update for  POMDPs can  be \ninterpreted  as  the  improvement  of a  finite-state  controller.  This  interpretation  can \nbe applied to both value iteration and policy iteration.  It provides no computational \nspeedup for value iteration, but for policy iteration it results in substantial speedup by \nmaking policy evaluation straightforward and easy to implement.  This representation \nalso has the advantage that  it makes a  policy easier to understand and execute than \nrepresentation as  a  mapping from  regions of information space to actions.  In particu(cid:173)\nlar,  a  policy can be executed without maintaining an information state at run-time. \nIt  is  well-known  that  policy  iteration  converges  to  f-optimality  (or  optimality)  in \nfewer  iterations than value  iteration.  For  completely  observable  MDPs,  this  is  not  a \nclear advantage because the policy evaluation step is  more computationally expensive \nthan the dynamic-programming update.  But for  POMDPs, policy evaluation has low(cid:173)\norder  polynomial  complexity  compared  to  the  worst-case  exponential  complexity  of \nthe dynamic-programming update  (Littman et al.,  1995).  Therefore, policy  iteration \nappears  to have  a  clearer  advantage over  value  iteration for  POMDPs.  Preliminary \ntesting bears this out and suggests that policy iteration significantly outperforms value \niteration as an approach to solving infinite-horizon POMDPs. \n\nAcknowledgements \n\nThanks to Shlomo Zilberstein and especially Michael Littman for  helpful  discussions. \nSupport for  this work was provided in part by the National Science Foundation under \ngrants IRI-9409827 and IRI-9624992. \n\nReferences \n\nBlackwell,  D.  {1965}  Discounted  dynamic  programming.  Ann.  Math.  Stat.  36:226-\n235. \nCassandra,  A.;  Kaelbling,  L.P.;  Littman,  M.L.  {1994}  Acting  optimally  in  partially \nobservable stochastic domains.  In  Proc.  13th  National  Conf.  on  AI,  1023-1028. \nCassandra, A.;  Littman, M.L.;  &  Zhang, N.L.  (1997)  Incremental pruning:  A simple, \nfast,  exact algorithm for partially observable Markov decision processes.  In Proc.  13th \nA nnual  Con/.  on  Uncertainty  in  AI. \nHansen,  E.A.  (1998).  Finite-Memory  Control  of Partially  Observable  Systems.  PhD \nthesis,  Department of Computer Science,  University of Massachusetts at Amherst. \nJaakkola, T.; Singh, S.P. ; & Jordan, M.I.  (1995)  Reinforcement learning algorithm for \npartially observable Markov decision  problems.  In NIPS-7. \nLittman,  M.L.;  Cassandra,  A.R.;  &  Kaebling,  L.P.  (1995)  Efficient  dynamic(cid:173)\nprogramming updates  in  partially  observable  Markov  decision  processes.  Computer \nScience Technical Report CS-95-19,  Brown University. \nMonahan,  G.E.  (1982)  A  survey  of partially  observable  Markov  decision  processes: \nTheory, models,  and algorithms.  Management  Science  28:1-16. \nSmallwood,  R.D.  & Sondik,  E.J.  (1973)  The optimal  control  of partially observable \nMarkov processes over a  finite  horizon.  Operations  Research 21:1071-1088. \nSondik,  E.J.  (1971)  The  Optimal  Control  of Partially  Observable  Markov  Processes. \nPhD  thesis,  Department of Electrical Engineering, Stanford University. \nSondik, E.J. (1978) The optimal control of partially observable Markov processes over \nthe infinite horizon:  Discounted  costs.  Operations  Research 26:282-304. \n\n\f", "award": [], "sourceid": 1447, "authors": [{"given_name": "Eric", "family_name": "Hansen", "institution": null}]}