{"title": "Convergence of Stochastic Iterative Dynamic Programming Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 703, "page_last": 710, "abstract": null, "full_text": "Convergence of Stochastic Iterative \nDynamic Programming Algorithms \n\nTommi Jaakkola'\" \n\nMichael I.  Jordan \n\nSatinder P.  Singh \n\nDepartment of Brain  and  Cognitive Sciences \n\nMassachusetts  Institute of Technology \n\nCambridge, MA  02139 \n\nAbstract \n\nIncreasing attention has recently  been  paid to algorithms based on \ndynamic programming (DP) due  to the suitability of DP for  learn(cid:173)\ning  problems  involving control.  In  stochastic  environments  where \nthe  system  being  controlled  is  only  incompletely  known,  however, \na  unifying theoretical  account  of these  methods has  been  missing. \nIn  this  paper  we  relate  DP-based  learning  algorithms to  the  pow(cid:173)\nerful  techniques of stochastic approximation via a  new convergence \ntheorem,  enabling us  to establish  a  class  of convergent  algorithms \nto which  both TD(\"\\)  and Q-Iearning belong. \n\n1 \n\nINTRODUCTION \n\nLearning  to  predict  the future  and  to find  an  optimal way  of controlling it  are  the \nbasic  goals  of learning  systems  that  interact  with  their  environment.  A  variety  of \nalgorithms  are  currently  being  studied  for  the  purposes  of prediction  and  control \nin  incompletely specified,  stochastic environments.  Here  we  consider  learning algo(cid:173)\nrithms defined  in Markov environments.  There are actions or controls (u)  available \nfor  the  learner  that  affect  both  the  state  transition  probabilities,  and  the  proba(cid:173)\nbility distribution for  the immediate, state  dependent  costs  (Ci( u))  incurred  by  the \nlearner.  Let  Pij (u)  denote  the  probability  of a  transition  to  state  j  when  control \nu  is  executed  in state  i.  The learning  problem is  to  predict  the  expected  cost  of a \n\n... E-mail:  tommi@psyche.mit.edu \n\n703 \n\n\f704 \n\nJaakkola, Jordan, and Singh \n\nfixed  policy  p  (a function  from states  to  actions),  or  to obtain the  optimal policy \n(p*)  that  minimizes the  expected  cost  of interacting with the environment. \n\nIf the learner  were  allowed to know the transition probabilities as well  as the imme(cid:173)\ndiate  costs  the  control  problem could  be solved  directly  by  Dynamic Programming \n(see  e.g.,  Bertsekas,  1987).  However,  when  the  underlying  system  is  only  incom(cid:173)\npletely  known,  algorithms such  as  Q-Iearning  (Watkins,  1989)  for  prediction  and \ncontrol,  and  TD(>.)  (Sutton,  1988) for  prediction,  are needed. \n\nOne  of  the  central  problems  in  developing  a  theoretical  understanding  of  these \nalgorithms  is  to  characterize  their  convergence;  that  is,  to  establish  under  what \nconditions they are  ultimately able to obtain correct  predictions or optimal control \npolicies.  The  stochastic  nature  of these  algorithms  immediately  suggests  the  use \nof stochastic  approximation  theory  to  obtain  the  convergence  results.  However, \nthere  exists  no  directly  available stochastic  approximation techniques  for  problems \ninvolving the maximum norm that  plays a  crucial  role  in learning algorithms based \non  DP. \n\nIn  this  paper,  we  extend  Dvoretzky's  (1956)  formulation of the  classical  Robbins(cid:173)\nMunro  (1951)  stochastic  approximation theory  to obtain a  class  of converging  pro(cid:173)\ncesses  involving  the  maximum norm.  In  addition,  we  show  that  Q-Iearning  and \nboth  the  on-line  and  batch  versions  of TD(>.)  are  realizations  of  this  new  class. \nThis  approach  keeps  the  convergence  proofs simple and  does  not  rely  on  construc(cid:173)\ntions specific to particular algorithms.  Several other authors have recently presented \nresults  that  are  similar  to  those  presented  here:  Dayan  and  Sejnowski  (1993)  for \nTD(A),  Peng  and  Williams (1993)  for  TD(A),  and Tsitsiklis  (1993)  for  Q-Iearning. \nOur results  appear  to  be  closest  to those  of Tsitsiklis (1993). \n\n2  Q-LEARNING \n\nThe  Q-Iearning  algorithm  produces  values-\"Q-values\"-by which  an  optimal ac(cid:173)\ntion  can  be  determined  at  any  state.  The  algorithm is  based  on  DP  by  rewriting \nBellman 's  equation  such  that  there  is  a  value  assigned  to  every  state-action  pair \ninstead  of only  to a  state.  Thus the  Q-values satisfy \n\nQ(s,u) = cs(u) +, ~pssl(u)maxQ(sl,ul) \n\n1.).1 \n\nL....J \n8 ' \n\n(1) \n\nwhere  c denotes  the  mean  of c.  The  solution  to  this  equation  can  be  obtained \nby  updating  the  Q-values  iteratively;  an  approach  known  as  the  vaz'ue  iteration \nmethod.  In the learning problem the values for  the mean of c and for the transition \nprobabilities are  unknown.  However,  the observable  quantity \n\nCSt (Ut) +, maxQ(St+l, u) \n\n1.). \n\n(2) \n\nwhere  St  and Ut  are  the state of the system and the  action taken  at time t,  respec(cid:173)\ntively, is an  unbiased estimate of the update used  in value iteration.  The Q-Iearning \nalgorithm is  a  relaxation  method that  uses  this  estimate iteratively  to  update  the \ncurrent  Q-values  (see  below). \n\nThe  Q-Iearning algorithm  converges  mainly due  to  the  contraction  property  of the \nvalue iteration operator. \n\n\fConvergence of Stochastic Iterative Dynamic Programming Algorithms \n\n705 \n\n2.1  CONVERGENCE  OF  Q-LEARNING \n\nOur proof is based on the observation that the Q-Iearning algorithm can be viewed as \na  stochastic  process  to  which  techniques  of stochastic  approximation are  generally \napplicable.  Due  to  the  lack  of a  formulation  of stochastic  approximation for  the \nmaximum norm,  however,  we  need  to slightly extend  the standard  results.  This  is \naccomplished by the following theorem the  proof of which can be found  in Jaakkola \net  al.  (1993). \nTheorem 1  A  random  iterative process ~n+I(X) = (l-ll:n(X))~n(x)+lin(x)Fn(x) \nconverges  to  zero  w.p.l  under the  following  assumptions: \n\n1)  The  state  space  is  finite. \n2)  Ln ll:n(x)  = 00,  Ln ll:~(x)  <  00,  Ln lin(x)  = 00,  Ln Ii~(x)  <  00,  and \n\nE{lin(x)IPn }  ~ E{ll:n(x)IPn }  uniformly  w.p.1. \n\n3)  II  E{Fn(x)IPn}  Ilw~ 'Y  II  ~n IlwI  where'Y  E  (0,1). \n4)  Var{Fn(x)IPn } ~ C(1+ II  ~n Ilw)2,  where  C  is  some  constant. \n\nHere  Pn = {~n, ~n-I, .. \u00b7' Fn- I, ... , ll:n-I,\u00b7 .. , lin-I, ... }  stands for the past  at  step \nn.  Fn(x),  ll:n(x)  and lin(x)  are  allowed  to  depend  on  the  past  insofar  as  the  above \nconditions  remain  valid.  The  notation  II  . Ilw  refers  to  some  weighted  maximum \nnorm. \n\nIn  applying  the  theorem,  the  ~n process  will  generally  represent  the  difference \nbetween  a  stochastic  process  of interest  and  some  optimal value  (e.g.,  the  optimal \nvalue function).  The formulation of the theorem therefore  requires knowledge to be \navailable about the optimal solution to the learning problem before it can be applied \nto any  algorithm whose  convergence  is  to be  verified.  In  the  case  of Q-Iearning the \nrequired  knowledge  is  available through  the  theory  of DP  and  Bellman's equation \nin  particular. \n\nThe  convergence  of  the  Q-Iearning  algorithm  now  follows  easily  by  relating  the \nalgorithm to the converging stochastic  process  defined  by  Theorem  1.1 \n\nTheorem 2  The  Q-learning  algorithm  given  by \n\nQt+I(St, Ut)  =  (1  - ll:t(St, Ut))Qt(St, ut) + ll:t(St, ut}[CSt(ut) + 'Yvt(St+dJ \n\nconverges  to  the  optimal Q*(s, u)  values  if \n\n1)  The  state  and  action  spaces  are  finite. \n2)  Lt ll:t(s, u) = 00  and  Lt ll:;(s, u)  < 00  uniformly  w.p.1. \n3)  Var{cs(u)}  is  bounded. \n\n1 We  note  that  the  theorem  is  more  powerful  than  is  needed  to  prove  the  convergence \nof Q-learning.  Its generality,  however,  allows  it  to be  applied  to other  algorithms  as  well \n(see  the following  section  on  TD(>.)). \n\n\f706 \n\nJaakkola, Jordan, and Singh \n\n3)  If, = 1,  all policies  lead  to  a  cost free  terminal state  w.p.1. \n\nProof.  By subtracting Q*(s, u)  from both sides of the learning rule and by defining \nLlt(s, u)  = Qt(s, u) - Q*(s, u)  together  with \n\n(3) \nthe  Q-learning algorithm can  be seen  to have  the form of the  process  in Theorem 1 \nwith !3t(s, u) = at(s, u). \nTo verify  that  Ft(s, u)  has the required  properties  we  begin  by showing that it  is  a \ncontraction mapping with respect  to some maximum norm.  This is  done  by relating \nFt  to the DP value iteration operator for  the same Markov chain.  More specifically, \n\nmaxIE{Ft(i, u)}1 \n\nu \n\nj \n\n<  ,max ~Pij(u)maxIQt(j,v) - Q*(j,v)1 \n\nv \n\nu  6 \nj \n\n,muax LPij(U)Va(j) = T(Va)(i) \n\nj \n\nwhere we  have used the notation Va(j) = maXv  IQt(j, v)-Q*(j, v)1  and T  is  the DP \nvalue iteration operator for  the case  where  the costs  associated  with  each state  are \nzero.  If, < 1 the contraction property  of E{ Ft (i, u)} can be obtained by  bounding \nI:j Pij(U)Va(j)  by  maxj Va(j)  and  then  including the, factor.  When the future \ncosts  are  not  discounted  (, = 1)  but  the  chain  is  absorbing and all  policies  lead  to \nthe  terminal state  w.p.1  there  still  exists  a  weighted  maximum norm with  respect \nto which T  is  a  contraction mapping (see  e.g.  Bertsekas  & Tsitsiklis,  1989) thereby \nforcing  the  contraction  of E{Ft(i, u)}.  The  variance  of Ft(s, u)  given  the  past  is \nwithin the  bounds of Theorem 1 as  it depends  on Qt(s, u)  at most linearly and the \nvariance  of cs(u)  is  bounded. \n\nNote  that the  proof covers  both  the on-line  and  batch versions. \n\no \n\n3  THE TD(-\\)  ALGORITHM \n\nThe  TD(A)  (Sutton,  1988)  is  also  a  DP-based  learning algorithm that  is  naturally \ndefined  in a  Markov environment.  Unlike Q-learning, however,  TD does  not involve \ndecision-making tasks  but  rather  predictions  about  the future  costs  of an evolving \nsystem.  TD(A) converges to the same predictions as a version ofQ-learning in which \nthere is  only one action available at each state,  but the algorithms are derived from \nslightly  different  grounds  and their  behavioral differences  are  not  well  understood. \n\nThe algorithm is  based  on the estimates \n\nV/\\(i) = (1  - A) L An-l~(n)(i) \n\n00 \n\nn=l \n\n(4) \n\nwhere  ~(n)(i) are  n  step  look-ahead predictions.  The expected  values  of the  ~>\"(i) \nare  strictly  better  estimates  of  the  correct  predictions  than  the  lit (i)s  are  (see \n\n\fConvergence of Stochastic Iterative Dynamic Programming Algorithms \n\n707 \n\nJaakkola et al.,  1993)  and the update equation of the  algorithm \n\nVt+l(it) = vt(it) + adV/(it) - Vt(it)J \n\n(5) \n\ncan  be  written  in  a  practical  recursive  form  as  is  seen  below.  The  convergence  of \nthe  algorithm is  mainly due to the statistical  properties  of the V? (i)  estimates. \n\n3.1  CONVERGENCE  OF  TDP) \n\nAs  we  are  interested  in  strong  forms  of convergence  we  need  to  impose  some  new \nconstraints,  but  due  to  the  generality  of the  approach  we  can  dispense  with  some \nothers.  Specifically,  the  learning  rate  parameters  an  are  replaced  by  a n( i)  which \nsatisfy  Ln an(i)  = 00  and  Ln a~(i)  <  00  uniformly  w.p.1.  These  parameters \nallow  asynchronous  updating  and  they  can,  in  general,  be  random variables.  The \nconvergence  of the  algorithm is  guaranteed  by  the  following  theorem  which  is  an \napplication of Theorem  1. \n\nTheorem 3  For any finite  absorbing Markov  chain,  for any  distribution  of starting \nstates  with  no  inaccessible  states,  and  for  any  distributions  of the  costs  with  finite \nvariances  the  TD(A)  algorithm  given  by \n\n1) \n\n2) \n\nVn+1(i)  = Vn(i) + an(i) L)Ci t  + ,Vn(it+d - Vn(it)] LbA)t-kXi(k) \n\nm \n\nt=l \n\nt \n\nk=l \n\nLn an(i) = 00  and  Ln a~(i) < 00  uniformly  w.p.i. \n\nVt+l(i) = Vt(i) + at(i)[ci t  + ,Vt(it+d - Vt(id] LbA)t-kXi(k) \n\nt \n\nk=l \n\nLt at(i)  =  00  and  Ln a;(i)  <  00  uniformly  w.p.i  and  within  sequences \nat(i)/maXtESat(i) ----;.  1  uniformly  w.p.i. \n\nconverges  to  the  optimal predictions  w.p.i  provided\"  A E [0,1]  with ,A < 1. \n\nProof for  (1):  We  use  here  a  slightly  different  form for  the  learning rule  (cf.  the \nprevious section). \n\nVn(i) + an (i)[Gn (i)  - E~~~)} Vn(i)] \n\n1 \n\nm(i) \n\nE{m(i)}  {; Vn\"(i;  k) \n\nwhere  Vn\"( i; k)  is  an  estimate  calculated  at  the  ph  occurrence  of  state  i  in  a \nsequence  and  for  mathematical  convenience  we  have  made  the  transformation \nan(i)  ----;.  E{m(i)}an(i),  where  m(i)  is  the  number  of  times  state  i  was  visited \nduring  the sequence. \n\n\f708 \n\nJaakkola, Jordan, and Singh \n\nTo apply Theorem 1 we  subtract V* (i),  the optimal predictions,  from both sides  of \nthe  learning equation.  By  identifying an(i)  := an(i)m(i)/E{m(i)}, f3n(i)  := an(i), \nand  Fn(i)  := Gn(i) - V*(i)m(i)/E{m(i)}  we  need  to  show  that  these  satisfy  the \nconditions  of Theorem  1.  For  an(i)  and  f3n(i)  this  is  obvious.  We  begin  here  by \nshowing that  Fn(i)  indeed  is  a  contraction mapping.  To  this end, \n\nm?xIE{Fn(i) 1 Vn}1  = \n\nI \n\nmiaxIE{~(i)} E{(VnA(i; 1) - V*(i\u00bb  +  (VnA(i;2) - V*(i\u00bb  +\u00b7\u00b7\u00b71 Vn}1 \n\nwhich can  be  bounded  above  by  using  the relation \n\nIE{VnA(i; k)  - V*(i)  1 Vn}1 \n\n<  E {  IE{VnA(i; k) - V*(i)  1 m(i)  ~ k, Vn}IO(m(i)  - k)  1 Vn } \n<  P{m(i) ~ k}IE{VnA(i) - V*(i)  1 Vn}1 \n< \n\nI P {m( i)  > k} m~x 1 Vn (i) - V* (i) 1 \n\nI \n\nwhere O(x)  =  0 if x < 0 and 1 otherwise.  Here  we  have also used  the fact  that VnA(i) \nis  a  contraction  mapping  independent  of possible  discounting.  As  Lk P {m( i)  ~ \nk} = E{ m( i)}  we  finally  get \n\nm~x IE{ Fn( i)  1 Vn} 1 ::;  I  m?x IVn(i)  - V*(i)1 \n\nI \n\nI \n\nThe variance of Fn (i)  can be seen  to  be  bounded  by \n\nE{ m4} m~xIVn(i)12 \n\nI \n\nFor any absorbing Markov chain the  convergence  to the terminal state is  geometric \nand  thus  for  every  finite  k,  E{mk}::; C(k),  implying that  the  variance  of Fn(i)  is \nwithin  the bounds of Theorem 1.  As  Theorem 1 is  now  applicable we  can conclude \nthat the  batch  version  of TD(>.)  converges  to the  optimal predictions  w.p.l. \n0 \nProof for  (2)  The  proof for  the  on-line  version  is  achieved  by  showing  that  the \neffect  of the on-line updating vanishes  in  the limit thereby  forcing  the  two  versions \nto  be  equal  asymptotically.  We  view  the  on-line  version  as  a  batch  algorithm in \nwhich  the  updates  are  made  after each  complete sequence  but  are  made  in  such  a \nmanner so  as to be equal  to those  made on-line. \nDefine  G~ (i)  =  G n (i) + G~ (i)  to  be  a  new  batch estimate taking into account  the \non-line updating within sequences.  Here Gn (i) is the batch estimate with the desired \nproperties  (see  the  proof for  (1\u00bb and  G~ (i)  is  the  difference  between  the  two.  We \ntake  the  new  batch  learning  parameters  to  be  the  maxima over  a  sequence,  that \nis  an(i)  =  maxtES at(i).  As  all  the  at(i) satisfy  the  required  conditions  uniformly \nw.p.1  these  new  learning  parameters satisfy them as  well. \n\nTo analyze  the  new  batch algorithm we  divide  it  into three  parallel  processes:  the \nbatch TD( >.)  with an (i) as  learning rate parameters, the difference  between this and \nthe  new  batch estimate,  and  the  change  in  the  value function  due  to  the  updates \nmade  on-line.  Under  the  conditions  of the  TD(>.)  convergence  theorem  rigorous \n\n\fConvergence of Stochastic Iterative Dynamic Programming Algorithms \n\n709 \n\nupper  bounds  can  be  derived  for  the  latter  two  processes  (see  Jaakkola,  et  al., \n1993).  These  results  enable  us  to write \n\nII  E{G~ - V*} II  < \n< \n\nII  E{Gn - V*} II  + II  G~ II \n(-y'  + C~) II  Vn  - V*  II  +C~ \n\nwhere  C~ and  C~ go  to  zero  with  w.p.I.  This  implies  that  for  any  c  >  0  and \nII  Vn  - V*  II~ c there  exists I  < 1 such  that \n\nI \n\nII  E{Gn - V*} II::; I  II  Vn  - V*  II \n\nfor  n  large  enough.  This  is  the  required  contraction  property  of Theorem  1.  In \naddition, it can readily be checked  that the variance of the new  estimate falls under \nthe  conditions of Theorem  1. \n\nTheorem 1 now guarantees that for  any c the value function in the on-line algorithm \nconverges w.p.1 into some t-bounded region of V*  and therefore  the algorithm itself \nconverges  to  V*  w.p.I. \n0 \n\n4  CONCLUSIONS \n\nIn  this  paper  we  have  extended  results  from  stochastic  approximation  theory  to \ncover  asynchronous  relaxation  processes  which  have  a  contraction  property  with \nrespect to some maximum norm (Theorem 1).  This new class of converging iterative \nprocesses  is  shown  to  include  both  the  Q-Iearning  and TD(A)  algorithms in  either \ntheir on-line or  batch versions.  We  note that the convergence  of the  on-line version \nof TD(A)  has not  been  shown  previously.  We  also wish  to emphasize the simplicity \nof our  results.  The convergence  proofs for  Q-Iearning and TD(A)  utilize  only high(cid:173)\nlevel statistical properties of the estimates used  in  these  algorithms and do not rely \non  constructions  specific  to  the  algorithms.  Our  approach  also  sheds  additional \nlight  on the similarities between  Q-Iearning and TD(A). \n\nAlthough Theorem 1 is  readily applicable to DP-based learning schemes,  the theory \nof Dynamic Programming is  important only for  its characterization  of the  optimal \nsolution and for  a  contraction property needed  in applying the theorem.  The theo(cid:173)\nrem  can  be  applied to iterative algorithms of different  types  as  well. \n\nFinally we note that Theorem 1 can be extended to cover processes that do not show \nthe  usual  contraction property  thereby  increasing  its  applicability to  algorithms of \npossibly  more practical importance. \n\nReferences \n\nBertsekas,  D.  P. (1987).  Dynamic Programming:  Deterministic and Stochastic  Mod(cid:173)\nels.  Englewood Cliffs,  NJ:  Prentice-Hall. \nBertsekas,  D. P., &  Tsitsiklis,  J.  N.  (1989).  Parallel  and  Distributed  Computation: \nNumerical Methods.  Englewood  Cliffs,  NJ:  Prentice-Hall. \n\nDayan,  P.  (1992).  The  convergence  of TD(A)  for  general  A.  Machine  Learning,  8, \n341-362. \n\n\f710 \n\nJaakkola, Jordan, and Singh \n\nDayan,  P.,  &  Sejnowski,  T.  J.  (1993).  TD(>.)  converges  with  probability  1.  CNL, \nThe Salk Institute,  San Diego,  CA. \n\nDvoretzky, A. (1956).  On stochastic approximation.  Proceedings of the Third Berke(cid:173)\nley  Symposium  on  Mathematical Statistics  and  Probability.  University of California \nPress. \nJaakkola, T., Jordan,  M.  I.,  & Singh, S.  P.  (1993).  On the convergence of stochastic \niterative dynamic programming algorithms.  Submitted to  Neural  Computation. \nPeng J., & Williams R. J.  (1993).  TD(>.) converges  with probability 1.  Department \nof Computer Science  preprint,  Northeastern  University. \nRobbins,  H.,  &  Monro,  S.  (1951).  A  stochastic  approximation  model.  Annals  of \nMathematical  Statistics,  22,  400-407. \n\nSutton,  R.  S.  (1988).  Learning  to  predict  by  the  methods  of temporal  differences. \nMachine  Learning,  3,  9-44. \n\nTsitsiklis  J.  N.  (1993).  Asynchronous  stochastic  approximation  and  Q-learning. \nSubmitted to:  Machine  Learning. \n\nWatkins, C.J .C.H.  (1989).  Learning from  delayed  rewards.  PhD Thesis,  University \nof Cambridge,  England. \nWatkins,  C.J .C.H, &  Dayan, P.  (1992).  Q-learning.  Machine  Learning,  8,  279-292. \n\n\f", "award": [], "sourceid": 764, "authors": [{"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Satinder", "family_name": "Singh", "institution": null}]}