{"title": "Exploiting Model Uncertainty Estimates for Safe Dynamic Control Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1047, "page_last": 1053, "abstract": null, "full_text": "Exploiting Model Uncertainty Estimates \n\nfor  Safe Dynamic  Control Learning \n\nJeff G.  Schneider \nThe Robotics Institute \n\nCarnegie Mellon  University \n\nPittsburgh,  PA  15213 \nschneide@cs.cmu.edu \n\nAbstract \n\nModel  learning combined  with  dynamic  programming has  been  shown  to \nbe effective  for  learning control  of continuous state dynamic systems.  The \nsimplest method assumes the learned model is correct  and applies dynamic \nprogramming to it, but many approximators provide uncertainty estimates \non  the  fit.  How  can  they  be  exploited?  This  paper  addresses  the  case \nwhere the system must be prevented from having catastrophic failures dur(cid:173)\ning learning.  We  propose  a  new  algorithm adapted  from  the  dual  control \nliterature  and  use  Bayesian  locally  weighted  regression  models  with  dy(cid:173)\nnamic programming.  A common reinforcement learning assumption is that \naggressive exploration should be encouraged.  This paper addresses the con(cid:173)\nverse  case  in which  the system  has  to  reign  in exploration.  The algorithm \nis illustrated on a 4 dimensional simulated control  problem. \n\nIntroduction \n\n1 \nReinforcement learning and related grid-based dynamic programming techniques are \nincreasingly being applied to dynamic systems with continuous valued state spaces. \nRecent  results  on  the  convergence  of dynamic  programming methods  when  using \nvarious  interpolation methods to  represent  the  value  (or  cost-to-go)  function  have \ngiven  a  sound  theoretical  basis  for  applying  reinforcement  learning  to  continuous \nvalued state spaces  [Gordon,  1995].  These are important steps  toward the eventual \napplication of these  methods to industrial learning and control problems. \n\nIt  has  also  been  reported  recently  that  there  are  significant  benefits  in  data  and \ncomputational efficiency when data from running a system is used to build a model, \nrather  than  using  it  once  for  single  value  function  updates  (as  Q-learning  would \ndo)  and discarding it [Sutton,  1990, Moore  and Atkeson,  1993, Schaal and Atkeson, \n1993, Davies,  1996].  Dynamic programming sweeps can then be done on the learned \nmodel either off-line or on-line.  In  its vanilla form, this method assumes the model \nis  correct  and  does  deterministic  dynamic  programming  using  the  model.  This \nassumption  is  often  not  correct,  especially  in  the  early  stages  of learning.  When \nlearning simulated or software systems,  there may be no harm in  the fact that this \n\n\f1048 \n\nJ.  G.  Schneider \n\nassumption does  not hold.  However,  in real,  physical systems there are often states \nthat really  are catastrophic  and must be avoided even during learning.  Worse yet, \nlearning may have to occur during normal operation of the system in which case its \nperformance during learning must not be significantly degraded. \n\nThe literature on adaptive and optimal linear control theory has explored this prob(cid:173)\nlem  considerably  under  the names stochastic  control  and dual  control.  Overviews \ncan be found in  [Kendrick,  1981,  Bar-Shalom and Tse,  1976].  The control  decision \nis based on three components call the deterministic,  cautionary,  and probing terms. \nThe deterministic  term  assumes  the  model is  perfect  and  attempts to  control  for \nthe best  performance.  Clearly, this may lead to disaster  if the model is inaccurate. \nAdding a  cautionary term yields a  controller  that considers  the uncertainty in  the \nmodel and chooses  a control for  the best expected  performance.  Finally, if the sys(cid:173)\ntem  learns  while  it  is  operating,  there  may  be  some  benefit  to choosing  controls \nthat are  suboptimal and/or risky  in  order  to obtain better data for  the model and \nultimately achieve better long-term performance.  The addition of the probing term \ndoes  this and gives  a  controller that yields  the best  long-term performance. \n\nThe advantage of dual control is  that its strong mathematical foundation can  pro(cid:173)\nvide  the  optimal  learning  controller  under  some  assumptions  about  the  system, \nthe  model,  noise,  and  the  performance criterion.  Dynamic programming methods \nsuch  as  reinforcement  learning  have  the  advantage  that  they  do  not  make  strong \nassumptions  about  the  system,  or  the  form  of the  performance  measure.  It  has \nbeen suggested  [Atkeson,  1995, Atkeson,  1993]  that techniques used  in global linear \ncontrol,  including caution and probing, may also be applicable in the local case.  In \nthis  paper  we  propose  an  algorithm that  combines  grid  based  dynamic  program(cid:173)\nming with the cautionary concept from dual control via the use of a Bayesian locally \nweighted  regression  model. \nOur  algorithm is  designed  with  industrial control  applications  in  mind.  A  typical \nscenario  is  that  a  production  line is  being  operated  conservatively.  There  is  data \navailable from its operation, but it only covers a small region of the state space and \nthus can  not be used  to produce  an  accurate  model over the whole  potential  range \nof operation.  Management is interested in improving the line's response  to changes \nin  set points  or  disturbances,  but  can  not  risk  much loss  of production  during  the \nlearning process.  The goal of our algorithm is to collect new  data and optimize the \nprocess  while explicitly minimizing the risk. \n\n2  The Algorithm \nConsider  a  system  whose  dynamics  are  given  by  xk+1  = f(xk, uk).  The state,  x, \nand  control,u,  are  real  valued  vectors  and  k  represents  discrete  time increments. \nA  model  of f  is  denoted  as  j.  The  task  is  to  minimize  a  cost  functional  of the \nform  J  =  E:=D L(xk, uk, k)  subject  to  the  system  dynamics.  N  mayor  may not \nbe fixed  depending on the problem.  L  is given,  but f  must be learned.  The goal is \nto acquire data to learn f  in order  to minimize J  without incurring huge penalties \nin J  during learning.  There is  an implicit assumption that the cost function defines \ncatastrophic  states.  If it  were  known  that  there  were  no disasters  to  avoid,  then \nsimpler, more aggressive algorithms would likely outperform the one presented here. \nThe top level  algorithm is  as follows: \n\n1.  Acquire some  data while  operating  the system from an existing controller. \n2.  Construct a  model from  the data using  Bayesian locally  weighted regression. \n3.  Perform DP  with the model  to compute a  value function  and a  policy. \n4.  Operate the system using  the new policy  and record additional  data. \n\n\fExploiting Model Uncertainty Estimates for Safe Dynamic Control Learning \n\n1049 \n\n5.  Repeat  to  step  2 while  there  is  still some improvement  in performance. \n\nIn the rest  of this section  we  describe steps 2 and 3. \n\n2.1  Bayesian locally weighted regression \nWe  use  a  form  of  locally  weighted  regression  [Cleveland  and  Delvin,  1988, \nAtkeson,  1989,  Moore,  1992]  called  Bayesian  locally  weighted  regression  [Moore \nand Schneider,  1995]  to build a  model from data.  When  a query,  x q ,  is made, each \nof the  stored  data points  receives  a  weight  Wi  =  exp( -llxi - xql1 2 /  K).  K  is  the \nkernel width which controls the amount of localness in the regression.  For Bayesian \nLWR we  assume a wide,  weak normal-gamma prior on the coefficients of the regres(cid:173)\nsion  model and the inverse  of the noise  covariance.  The  result  of a  prediction is  a \nt  distribution  on  the output that remains well  defined  even  in the  absence  of data \n(see  [Moore  and  Schneider,  1995]  and  [DeGroot,  1970]  for  details) . \n\nThe distribution of the prediction in  regions  where  there  is little data is  crucial to \nthe  performance of the  DP  algorithm.  As  is  often  the  case  with  learning  through \nsearch and experimentation, it is at least as important that a function approximator \npredicts  its own  ignorance in  regions  of no data as  it is  how  well  it interpolates in \ndata rich  regions. \n\n2.2  Grid based dynamic programming \nIn dynamic programming, the optimal value function,  V,  represents  the cost-to-go \nfrom each state to the end of the task  assuming that the optimal policy is  followed \nfrom  that  point on.  The value function can  be computed iteratively by identifying \nthe  best  action  from  each  state  and updating it  according  to  the expected  results \nof the  action  as given  by  a  model of the system.  The update equation is: \n\nVk+1(x)  =  minL(x, u) + Vk(j(x, u\u00bb \n\n(1) \nIn  our  algorithm,  updates  to  the ~que function  are  computed  while  considering \nthe  probability  distribution  on  the  results  of each  action.  If we  assume  that  the \noutput  of the  real  system  at  each  time  step  is  an  independent  random  variable \nwhose  probability density function  is given by the uncertainty from  the model, the \nupdate equation is as follows: \n\nVk+1(x)  =  minL(x, u) + E[Vk(f(x, u))lj] \n\n(2) \nNote  that  the  independence  as~~fhption does  not  hold  when  reasonably  smooth \nsystem  dynamics  are  modeled  by  a  smooth  function  approximator.  The  model \nerror  at one time step  along a  trajectory  is  highly correlated  with  the model error \nat the following step assuming a small distance traveled during the time step. \n\nOur algorithm for  DP  with model uncertainty on  a grid is  as follows: \n\n1.  Discretize  the state space,  X,  and  the control space,  U. \n2.  For  each  state  and  each  control  cache  the  cost  of taking this  action  from \nthis state.  Also compute the probability density function  on the next state \nfrom  the  model and cache  the information.  There are two cases which  are \nshown  graphically in fig.  1: \n\n\u2022  If the  distribution  is  much  narrower  than  the  grid  spacing,  then  the \nmodel is confident and a deterministic update will be done according to \neq.  1.  Multilinear interpolation is  used  to compute the value function \nat the mean of the predicted  next state [Davies,  1996] . \n\n\u2022  Otherwise, a stochastic update will be done according to eq.  2.  The pdf \nof each of the state variables is stored, discretized  at the same intervals \nas  the  grid  representing  the  value  function.  Output  independence  is \n\n\f1050 \n\nJ  G.  Schneider \n\nHigh Confidence Next State \n\nLow Confidence Next State \n\nv7 \n\nv8 \n\nvlO \n\nvI,.!! \n\n-..;:~ \n\n.......---:: ~ V \n\nV. ~ V \n\n...---:l \n\n17 __ V \n\n(.-/. \n\nFigure 1:  Illustration of the two kinds of cached updates.  In the high confidence sce(cid:173)\nnario the transition is  treated  as deterministic and  the value function  is  computed \nwith multilinear interpolation :  Vl~+l = L(x, u) + OAV; + 0.3V; + 0.2V1k1 + 0.1 vl2 \u2022 \nIn  the  low  confidence  scenario  the  transition  is  treated  stochastically and  the  up(cid:173)\ndate  takes  a  weighted  sum over  all  the  vertices  of significant  weight  as  well  as  the \n\npro  a  Iltymassoutsl  et  egn:  VIO \n\n-\n\n\u00b7d  h \n\n\u00b7d  TTk+l  _  L( \n\n) \n\nb  b\u00b7l\u00b7 \n\n~ \n\n, .  \n\nIe \n\nI \n\nX,u  +~ \n\nL-f.,/ lp(.,/l><} p(x  lJ ,x,u)V  (x) \n\nL-{.,'Ip(,,' \u00bb<} p(x If ,x ,u) \n\nI '  \n\nassumed  and  later the  pdf of each  grid  point will  be computed as the \nproduct  of the  pdfs for  each  dimension and  a  weighted  sum of all  the \ngrid  points  with  significant  weight  will  be  computed.  Also  the  total \nprobability mass outside the bounds of the grid is computed and stored. \n\n3.  For  each  state,  use  the  cached  information to  estimate  the  cost  of choos(cid:173)\n\ning each  action  from  that  state.  Update  the  value function  at that  state \naccording to the cost of the best  action found . \n\n4.  Repeat 3 until the value function converges, or the desired  number of steps \n\nhas  been  reached  in finite  step problems. \n\n5.  Record  the best  action  (policy)  for  each grid  point. \n\n3  Experiments:  Minimal Time Cart-Pole Maneuvers \nThe inverted pendulum is a well studied problem.  It is easy to learn to stabilize it in \na small number oftrials, but not easy to learn quick maneuvers.  We demonstrate our \nalgorithm on  the harder  problem of moving the cart-pole stably from  one  position \nto another as quickly as  possible.  We  assume we  have  a controller that can balance \nthe  pole  and  would  like  to  learn  to  move  the  cart  quickly  to  new  positions,  but \nnever  drop  the  pole  during  the  learning  process.  The  simulation equations  and \nparameters are from  [Barto  et  aI.,  1983]  and the task is illustrated at the top of fig. \n2.  The state  vector  is  x  =  [ pole  angle  (0),  pole  angular  velocity  (8),  cart  position \n(p),  cart  velocity  (p)  ].  The control vector,  u,  is  the one dimensional force  applied \nto the cart.  Xo  is  [0  0 170]  and the cost function  is  J = E~o xT X + 0.01 uT u.  N  is \nnot fixed.  It  is  determined  by  the  amount of time it  takes for  the system to  reach \na  goal  region  about the target state,  [0  0 0 0] .  If the pole is dropped,  the  trial ends \nand  an additional penalty of 106  is  incurred. \n\nThis  problem  has  properties  similar  to  familiar  process  control  problems  such  as \ncooking,  mixing, or cooling,  because  it is  trivial  to stabilize the system  and  it can \nbe moved slowly to a new desired  position while maintaining the stability by slowly \nchanging  positional  setpoints.  In  each  case,  the  goal  is  to  learn  how  to  respond \nfaster  without causing  any disasters during,  or  after,  the learning process. \n\n\fExploiting Model Uncertainty Estimates for Safe Dynamic Control Learning \n\n1051 \n\n3.1  Learning an LQR controller \nWe first  learn a linear quadratic regulator that balances the pole.  This can be done \nwith  minimal data.  The  system  is  operated  from  the  state,  [0  0  0  0]  for  10  steps \nof length  0.1  seconds  with  a  controller  that chooses  u  randomly from  a  zero  mean \ngaussian with standard deviation 0.5.  This is repeated  to obtain a  total of 20  data \npoints.  That data is used  to fit  a  global linear model mapping x  onto x'.  An LQR \ncontroller is  constructed  from  the  model and  the  given  cost  function  following the \nderivation in [Dyer  and McReynolds,  1970]. \n\nThe  resulting  linear  controller  easily  stabilizes  the  pole  and  can  even  bring  the \nsystem stably (although very inefficiently as it passes through the goal several times \nbefore coming to rest there)  to the origin when started as far out as  x  =  [0  0 10 0]. \nIf the cart is started further  from the origin, the controller crashes  the system. \n\n3.2  Building the initial Bayesian LWR model \nWe  use  the  LQR  controller  to  generate  data for  an  initial  model.  The  system  is \nstarted  at  x  =  [0  0  1 0]  and controlled  by  the  LQR controller  with  gaussian  noise \nadded  as  before.  The  resulting  50  data points  are stored  for  an  LWR  model  that \nmaps [e, 0, u]  -+ [0, pl.  The data in each  dimension of the state and control space is \nscaled  to [0  1].  In this scaled  space,  the  LWR kernel  width is set  to 1.0. \n\nNext,  we consider the deterministic DP method on this model.  The grid covers the \nranges:  [\u00b11.0 \u00b14.0 \u00b121.0 \u00b120.0] and is discretized to [11  9 11  9]  levels.  The control \nis  \u00b130.0 discretized  to  15  levels.  Any  state outside  the  grid  bounds  is  considered \nfailure  and  incurs  the  106  penalty.  If we  assume  the  model  is correct,  we  can  use \ndeterministic  DP  on  the grid  to  generate  a  policy.  The computation is done  with \nfixed  size steps in time of 0.25 seconds.  We observe that this policy is able to move \nthe system safely from an initial state of [0  0 12 0], but crashes if it is started further \nout.  Failure occurs because the best path g.enerated using the model strays far from \nthe region  of the data (in  variables e and e)  used  to construct the model. \nIt is disappointing that the use of LWR for nonlinear modeling didn't improve much \nover  a  globally linear  model  and  an  LQR controller.  We  believe  this is  a  common \nsituation.  It is  difficult  to  build  better  controllers  from  naive  use  of  nonlinear \nmodeling  techniques  because  the  available  data  models  only  a  narrow  region  of \noperation  and safely  acquiring a  wider  range of data is difficult. \n\n3.3  Cautionary dynamic programming \nAt this point we  are ready to test our algorithm.  Step 3 is executed  using the LWR \nmodel  from  the  data  generated  by  the  LQR  controller  as  before.  A  trace  of the \nsystem's operation  when  started  at a  distance of 17  from the goal is  shown  at the \ntop of fig.  2.  The controller  is  extremely conservative  with  respect  to the  angle of \nthe  pole.  The  pole is  never  allowed  to go  outside  \u00b10.13  radians.  Even  as  the cart \napproaches the goal  at a  moderate velocity the controller  chooses  to overshoot  the \ngoal considerably rather than making an abrupt action to brake the system. \n\nThe data from this run is  added to the model and the steps are repeated.  Traces of \nthe runs from three iterations of the algorithm are shown in fig.  2.  At each trial, the \ncontroller becomes more aggressive and completes the task with less cost.  After the \nthird  iteration,  no significant improvement is  observed.  The costs  are  summarized \nand compared with the LQR and deterministic DP controllers in table  1. \n\nFig.  3 is  another illustration of how  the policy becomes increasingly  aggressive.  It \nplots the pole angle vs.  the pole angular velocity for  the original LQR data and the \nexecutions  at each  of the following  three trials.  In summary, our  algorithm is  able \n\n\f1052 \n\n1.  G.  Schneider \n\nGoal  Reg-ion \n\no \n\n13 \n\n24 \n\n12 \n\n25 \n\n2' \n\n11 \n2'1. \n\n10 \n\n'A i \u2022\u2022 \n\n3  2 .  \n\n10 \n\n\u2022 \n\nI \n\nI \n\nI \n\nI \n\nI \n\nI \n\nI \n\nI \n\no \n\no \n\nI \n\nI \n\nI \n\nI \n\nI \n\nI \n\nI \n\nI \n\nt \n\nI \n\nI \n\nI \n\nI \n\n, \n\n\u2022 \n\n5 \n\n01 \n\nI \n\nI \n\nI \n\nI \n\nI \n\n' \n\n, \n\nFigure 2:  The task is to move the cart to the origin as quickly  as  possible without \ndropping the pole.  The bottom three  pictures show  a  trace of the policy execution \nobtained  after one,  two,  and three trials  (shown  in increments of 0.5 seconds) \n\nNumber of data points used  Cost from initial state 17 \n\nController \n\nLQR \nDeterministic D P \nStochastic DP trial  1 \nStochastic DP trial 2 \nStochastic DP trial 3 \n\nto build the controller \n20 \n50 \n50 \n221 \n272 \n\nfailure \nfailure \n12393 \n7114 \n6270 \n\nTable 1:  Summary of experimental results \n\nto  start from  a  simple controller  that can  stabilize  the  pole  and  learn  to  move  it \naggressively  over  a long distance  without ever  dropping the pole during learning. \n\n4  Discussion \nWe  have  presented  an  algorithm  that  uses  Bayesian  locally  weighted  regression \nmodels with  dynamic programming on  a  grid.  The result  is  a  cautionary adaptive \ncontrol  algorithm with  the flexibility  of a  non-parametric nonlinear  model instead \nof  the  more  restrictive  parametric  models  usually  considered  in  the  dual  control \nliterature.  We  note  that  this  algorithm  presents  a  viewpoint  on  the  exploration \nvs exploitation issue  that is different from many reinforcement learning algorithms, \nwhich  are devised  to encourage exploration  (as in  the  probing concept  in dual con(cid:173)\ntrol) .  However,  we  argue  that modeling the  data first  with  a  continuous function \napproximator  and  then  doing  DP  on  the  model  often  leads  to  a  situation  where \nexploration must be  inhibited to prevent  disasters.  This is  particularly true in the \ncase  of real,  physical systems. \n\n\fExploiting Model Uncertainty Estimatesfor Safe Dynamic Control Learning \n\n1053 \n\nAngular \n\nVelocity \n\n1.5 \n\n1 \n\n0.5 \n\n0 \n\n-0.5 \n\n-1 \n\n-1.5 \n\n-0.8 \n\n\"  - '  .. --. . . \n\n\"  \"  \"  \"  \" \" . \" \n\n. . \n\nLQR data  0 \n\n1st trial \n_ .......  2nd  trial \n'3td trial \n\n---\n\n-(cid:173).... \n\n'. \n\n: \n\n\"  \" \"  \"  \" \" . \" \"  \" \n\n\",,\" \" \n'. \n\n., . \n\" ..  ,,\"\"  \" \n\n\",,\"\"\"\" \" \n\n-0.6 \n\n-0.4 \n\n-0.2 \n\n0 \nPole Angle \n\n0.2 \n\n0.4 \n\n0.6 \n\nFigure 3:  Execution  trace.  At each  iteration, the controller is more aggressive. \n\nReferences \n[Atkeson,  1989)  C.  Atkeson.  Using local models  to control movement .  In  Advances in  Neural Informa(cid:173)\n\ntion  Processing  Systems,  1989. \n\n[Atkeson,  1993]  C .  Atkeson.  Using  local  trajectory optimizers  to speed  up  global  optimization  in  dy(cid:173)\n\nnamic  programming.  In  Advances  in Neural  Information Processing Systems  (NIPS-6),  1993. \n\n[Atkeson ,  1995)  C .  Atkeson .  Local methods for  active learning.  Invited  talk  at  AAAI Fall  Symposium \n\non Active  Learning,  1995 . \n\n[Bar-Shalom  and  Tse,  1976)  Y . Bar-Shalom  and  E . Tse.  Concepts  and Methods  in  Stochastic  Control. \n\nAcademic Press,  1976. \n\n[Barto  et  al.,  1983)  A . Barto, R.  Sutton,  and C.  Anderson.  Neuronlike adaptive elements that can solve \n\ndifficult  learning control problems.  IEEE  Transactions  on  Systems,  Man,  and  Cybernetics, 1983. \n\n[Cleveland  and Delvin,  1988)  W . Cleveland and S. Delvin.  Locally weighted  regression:  An approach to \nregression  analysis by  local fitting.  Journal of the  American  Statistical Association, pages 596-610, \nSeptember 1988. \n\n[Davies,  1996]  S. Davies.  Applying  grid-based interpolation to reinforcement learning.  In  Neural Infor(cid:173)\n\nmation Proceuing Systems  9,  1996. \n\n[DeGroot, 1970)  M.  DeGroot.  Optimal Statistical Decisions.  McGraw-Hill,  1970. \n[Dyer  and McReynolds ,  1970)  P.  Dyer  and  S.  McReynolds.  The  Computation  and  Theory  of Optimal \n\nControl.  Academic Press,  1970. \n\n[Gordon ,  1995]  G.  Gordon.  Stable  function  approximation  in  dynamic  programming. \n\nInternational Conference  on  Machine  Learning,  1995 . \n\nIn  The  12th \n\n[Kendrick,  1981)  D.  Kendrick.  Stochastic  Control for Economic Models.  McGraw-Hill,  1981. \n[Moore and  AtkesoD,  1993)  A .  Moore  and  C.  Atkeson.  Prioritized  sweeping:  Reinforcement  learning \n\nwith less data and  less  real time.  Machine Learning,  13(1):103-130,1993. \n\n[Moore and  Schneider,  1995]  A.  Moore  and  J.  Schneider.  Memory  based  stochastic  optimization.  In \n\nAdvances  in  Neural  Information Proceuing Systems  (NIPS-B),  1995 . \n\n[Moore,  1992)  A.  Moore.  Fast,  robust  adaptive control  by  learning only  forward  models.  In  Advances \n\nin  Neural  Information  Processing Systems 4,  1992. \n\n[Schaal  and  Atkeson,  1993)  S. Schaal and C . Atkeson.  Assessing the quality  of learned local models.  In \n\nAdvances  in  Neural  Information Processing  Systems  (NIPS-6),  1993. \n\n[Sutton,  1990)  R.  Sutton.  First  results  with  dyna,  an  intergrated  architecture for  learning,  planning, \nand  reacting.  In  AAAI Spring Symposium  on  Planning in  Uncertain,  Unpredictable ,  or  Changing \nEnvironment\",  1990. \n\n\f", "award": [], "sourceid": 1317, "authors": [{"given_name": "Jeff", "family_name": "Schneider", "institution": null}]}