{"title": "Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 695, "page_last": 702, "abstract": null, "full_text": "Convergence of Indirect  Adaptive \n\nAsynchronous  Value Iteration Algorithms \n\nVijaykumar Gullapalli \n\nDepartment of Computer Science \n\nUniversity  of Massachusetts \n\nAndrew  G.  Barto \n\nDepartment of Computer Science \n\nUniversity  of Massachusetts \n\nAmherst,  MA  01003 \nvijay@cs.umass.edu \n\nAmherst,  MA  01003 \nbarto@cs.umass.edu \n\nAbstract \n\nReinforcement Learning methods based on approximating dynamic \nprogramming  (DP)  are  receiving  increased  attention  due  to  their \nutility  in  forming  reactive  control  policies  for  systems  embedded \nin  dynamic  environments.  Environments  are  usually  modeled  as \ncontrolled  Markov  processes,  but  when  the  environment  model  is \nnot known a priori,  adaptive methods are necessary.  Adaptive con(cid:173)\ntrol methods are often classified  as being direct  or indirect.  Direct \nmethods directly adapt the control policy from experience,  whereas \nindirect  methods adapt a  model of the controlled process and com(cid:173)\npute  control  policies  based  on  the  latest  model.  Our  focus  is  on \nindirect  adaptive  DP-based  methods  in  this  paper.  We  present  a \nconvergence  result  for  indirect  adaptive  asynchronous  value  itera(cid:173)\ntion algorithms for the case in which a look-up table is used to store \nthe  value  function.  Our  result  implies  convergence  of several  ex(cid:173)\nisting reinforcement  learning algorithms such as adaptive real-time \ndynamic programming (ARTDP)  (Barto,  Bradtke, &  Singh,  1993) \nand prioritized  sweeping  (Moore  &  Atkeson,  1993).  Although  the \nemphasis  of researchers  studying DP-based reinforcement  learning \nhas been on direct  adaptive methods such as Q-Learning (Watkins, \n1989)  and  methods  using  TD  algorithms  (Sutton,  1988),  it is  not \nclear that these direct methods are preferable in practice to indirect \nmethods such as those  analyzed  in  this  paper. \n\n695 \n\n\f696 \n\nGullapalli and Barto \n\n1 \n\nINTRODUCTION \n\nReinforcement  learning  methods  based  on  approximating  dynamic  programming \n(DP)  are receiving  increased  attention due  to their  utility  in forming  reactive  con(cid:173)\ntrol policies  for  systems embedded in dynamic environments.  In  most of this  work, \nlearning  tasks  are  formulated  as  Markovian  Decision  Problems  (MDPs)  in  which \nthe  environment  is  modeled  as  a  controlled  Markov  process.  For  each  observed \nenvironmental  state,  the  agent  consults  a  policy  to  select  an  action,  which  when \nexecuted  causes  a  probabilistic  transition  to  a  successor  state.  State  transitions \ngenerate  rewards,  and  the  agent's  goal  is  to form  a  policy  that  maximizes  the  ex(cid:173)\npected value of a measure of the long-term reward for  operating in the environment. \n(Equivalent  formulations  minimize  a  measure  of the long-term cost  of operating in \nthe environment.)  Artificial  neural networks are often  used  to store value functions \nproduced by  these  algorithms  (e.g.,  (Tesauro,  1992)). \n\nRecent  advances  in  reinforcement  learning  theory  have  shown  that  asynchronous \nvalue  iteration  provides  an  important  link  between  reinforcement  learning  algo(cid:173)\nrithms and classical  DP methods for  value iteration  (VI)  (Barto,  Bradtke,  &  Singh, \n1993).  Whereas conventional VI algorithms use repeated exhaustive  \"sweeps\"  ofthe \nMDP's state set to update the value function, asynchronous VI can achieve the same \nresult without proceeding in systematic sweeps  (Bertsekas  &  Tsitsiklis,  1989).  If the \nstate ordering of an asynchronous VI computation is  determined by state sequences \ngenerated  during real or simulated interaction of a  controller  with  the  Markov pro(cid:173)\ncess,  the result  is  an algorithm  called  Real- Time  DP  (RTDP)  (Barto,  Bradtke,  & \nSingh,  1993).  Its  convergence  to optimal value  functions  in  several  kinds  of prob(cid:173)\nlems  follows  from  the convergence  properties  of asynchronous  VI  (Barto,  Bradtke, \n&  Singh,  1993). \n\n2  MDPS  WITH INCOMPLETE  INFORMATION \n\nBecause asynchronous VI employs a basic update operation that involves computing \nthe expected  value  of the next  state for  all  possible  actions,  it requires  a  complete \nand accurate model of the MDP in the form of state-transition probabilities and ex(cid:173)\npected transition rewards.  This is also true for the use of asynchronous VI in RTDP. \nTherefore,  when  state-transition  probabilities  and expected  transition rewards  are \nnot  completely  known,  asynchronous  VI is  not  directly  applicable.  Problems  such \nas  these,  which  are called  MDPs  with  incomplete  information,l  require  more com(cid:173)\nplex  adaptive  algorithms  for  their solution.  An  indirect adaptive  method works by \nidentifying  the underlying  MDP  via estimates  of state transition  probabilities  and \nexpected  transition  rewards,  whereas  a  direct  adaptive  method  (e.g.,  Q-Learning \n(Watkins,  1989))  adapts  the  policy  or  the  value  function  without  forming  an  ex(cid:173)\nplicit  model of the MDP  through system  identification. \n\nIn this paper, we prove a convergence theorem for a set of algorithms we call indirect \nadaptive  asynchronous  VI algorithms.  These are indirect  adaptive algorithms that \nresult from  simply substituting current estimates of transition probabilities  and ex(cid:173)\npected  transition rewards  (produced  by some concurrently  executing  identification \n\n1 These problems should not be confused with MDPs with incomplete  6tate information, \n\ni.e.,  partially  observable  MDPs. \n\n\fConvergence of Indirect Adaptive Asynchronous Value Iteration Algorithms \n\n697 \n\nalgorithm) for  their actual values in  the asynchronous value  iteration computation. \nWe  show  that  under  certain  conditions,  indirect  adaptive  asynchronous  VI  algo(cid:173)\nrithms  converge  with  probability  one  to the  optimal value  function.  Moreover,  we \nuse  our result  to infer convergence  of two existing  DP-based reinforcement  learning \nalgorithms,  adaptive  real-time  dynamic  programming  (ARTDP)  (Barto,  Bradtke, \n&  Singh,  1993),  and prioritized  sweeping  (Moore  &  Atkeson,  1993). \n\n3  CONVERGENCE  OF INDIRECT  ADAPTIVE \n\nASYNCHRONOUS  VI \n\nIndirect adaptive asynchronous VI algorithms are produced from non-adaptive algo(cid:173)\nrithms by substituting a current approximate model of the MDP for  the true model \nin the  asynchronous  value  iteration  computations.  An indirect  adaptive  algorithm \ncan be expected to converge only if the corresponding non-adaptive algorithm, with \nthe true model used in the place of each approximate model, converges.  We therefore \nrestrict  attention to indirect  adaptive asynchronous VI algorithms  that correspond \nin this way to convergent non-adaptive algorithms.  We prove the following theorem: \n\nTheorem  1  For  any  finite  6tate,  finite  action  MDP  with  an  infinite-horizon  di6-\ncounted performance measure,  any indirect adaptive  a6ynchronous  VI algorithm (for \nwhich the  corresponding non-adaptive  algorithm converges)  converges to  the  optimal \nvalue  function  with probability  one  if \n1)  the  conditions for  convergence  of the  non-adaptive  algorithm are  met, \n2)  in the  limit,  every  action  is  executed from  every 6tate  infinitely  often,  and \n3)  the  e6timate6  of the  state-transition probabilities  and  the  expected  transition  re(cid:173)\nwards  remain bounded and converge  in the  limit to  their true  value6  with probability \none. \n\nProof  The proof is  given  in Appendix  A.2. \n\n4  DISCUSSION \n\nCondition  2  of the  theorem,  which  is  also  required  by  direct  adaptive  methods \nto  ensure  convergence,  is  usually  unavoidable.  It is  typically  ensured  by  using  a \nstochastic  policy.  For  example,  we  can  use  the  Gibbs  distribution  method  for  se(cid:173)\nlecting actions used by Watkins (1989) and others.  Given condition 2,  condition 3 is \neasily satisfied by most identification methods.  In particular,  the simple maximum(cid:173)\nlikelihood identification method (see  Appendix A.l, items 6 and 7)  converges to the \ntrue  model with  probability  one  under  this condition. \n\nOur result  is  valid  only for  the special  case in which  the value function is  explicitly \nstored in  a  look-up  table.  The  case  in  which  general  function  approximators such \nas  neural  networks  are  used  requires  further  analysis. \n\nFinally,  an  important  issue  not  addressed  in  this  paper  is  the  trade-off  between \nsystem identification  and  control.  To ensure  convergence  of the  model,  all  actions \nhave to be executed infinitely often in every state.  On the other hand, on-line control \nobjectives  are  best  served  by  executing  the  action  in  each  sta.te  that  is  optimal \naccording  to  the  current  value  function  (i.e.,  by  using  the  certainty  equivalence \n\n\f698 \n\nGullapalli and Barto \n\noptimal policy).  This issue has received considerable attention from control theorists \n(see,  for  example,  (Kumar,  1985), and the references  therein).  Although we  do not \naddress this issue in this paper, for  a specific  estimation method, it may be possible \nto  determine  an  action  selection  scheme  that  makes  the  best  trade-off  between \nidentification  and control. \n\n5  EXAMPLES  OF  INDIRECT  ADAPTIVE \n\nASYNCHRONOUS  VI \n\nOne example of an indirect  adaptive asynchronous VI algorithm is  ARTDP  (Barto, \nBradtke, &  Singh,  1993) with maximum-likelihood identification.  In this algorithm, \na  randomized  policy  is  used  to ensure  that every  action  has a  non-zero  probability \nof being executed in each state.  The following theorem for  ARDTP follows  directly \nfrom  our  result  and  the  corresponding  theorem  for  RTDP  in  (Barto,  Bradtke,  & \nSingh,  1993): \n\nTheorem 2  For  any  discounted  MDP  and  any  initial value  junction,  trial-based 2 \nARTDP converges  with probability  one. \n\nAs a special case of the above theorem, we can obtain the result that in similar prob(cid:173)\nlems  the  prioritized  sweeping  algorithm  of Moore  and  Atkeson  (Moore  &  Atkeson, \n1993) converges  to the optimal value function.  This is  because  prioritized sweeping \nis  a  special  case  of ARTDP  in  which  states  are  selected  for  value  updates  based \non  their  priority  and  the  processing  time  available.  A  state's  priority  reflects  the \nutility  of performing  an update for  that  state,  and  hence  prioritized  sweeping  can \nimprove the efficiency  of asynchronous VI.  A similar algorithm, Queue-Dyna (Peng \n&  Williams,  1992),  can  also  be  shown  to  converge  to  the  optimal  value  function \nusing  a  simple  extension  of our result. \n\n6  CONCLUSIONS \n\nWe  have  shown  convergence  of indirect  adaptive  asynchronous  value  iteration  un(cid:173)\nder fairly  general conditions.  This result implies  the convergence of several existing \nDP-based  reinforcement  learning  algorithms.  Moreover,  we  have  discussed  possi(cid:173)\nble  extensions  to  our  result.  Our  result  is  a  step  toward  a  better  understanding \nof indirect  adaptive  DP-based  reinforcement  learning  methods.  There  are  several \npromising  directions for  future  work. \n\nOne  is  to  analyze  the  trade-off between  model  estimation  and  control  mentioned \nearlier  to determine  optimal methods for  action  selection  and to integrate our work \nwith existing results  on  adaptive methods for  MDPs  (Kumar,  1985).  Second,  anal(cid:173)\nysis  is  needed  for  the  case  in  which  a  function  approximation  method,  such  as  a \nneural  network,  is  used  instead  of a  look-up  table  to store  the  value  function.  A \nthird possible  direction is  to analyze indirect  adaptive versions  of more general DP(cid:173)\nbased  algorithms  that  combine  asynchronous  policy  iteration  with  asynchronous \n\n2 As  in  (Barto,  Bradtke,  &  Singh,  1993),  by  trial-balled  execution  of an  algorithm  we \nmean its  use  in an infinite  series  of trials  such  that every  state is  selected infinitely  often \nto be  the start state of a  trial. \n\n\fConvergence of Indirect Adaptive Asynchronous Value Iteration Algorithms \n\n699 \n\npolicy  evaluation.  Several  non-adaptive  algorithms  of this  nature  have  been  pro(cid:173)\nposed  recently  (e.g.,  (Williams  & Baird,  1993;  Singh  & Gullapalli)). \n\nFinally,  it  will  be  useful  to  examine  the  relative  efficacies  of direct  and  indirect \nadaptive  methods  for  solving  MDPs  with  incomplete  information.  Although  the \nemphasis  of researchers  studying  DP-based reinforcement  learning  has  been on di(cid:173)\nrect  adaptive  methods such  as  Q-Learning and methods using TD algorithms,  it is \nnot  clear  that  these  direct  methods  are  preferable  in  practice  to indirect  methods \nsuch as the ones discussed here.  For example, Moore and Atkeson (1993) report sev(cid:173)\neral experiments in which  prioritized sweeping significantly outperforms Q-learning \nin  terms  of  the  computation  time  and  the  number  of  observations  required  for \nconvergence.  More  research  is  needed  to characterize  circumstances  for  which  the \nvarious reinforcement  learning  methods  are best  suited. \n\nAPPENDIX \n\nA.1  NOTATION \n\n1.  Time steps are denoted t = 1, 2, ... ,  and  Zt  denotes  the last state observed \n\nbefore  time t.  Zt  belongs  to a  finite  state set  S  =  {I, 2, ... , n}. \n\n2.  Actions  in  a  state  are selected  according  to a  policy  7r,  where  7r(i)  E  A,  a \n\nfinite  set of actions,  for  1 :::;  i  :::;  n. \n\n3.  The probability of making a  transition from  state i  to state  j  on executing \n\naction  a  is  pa ( i, j). \n\n4.  The  expected  reward  from  executing  action  a  in  state  i  is  r(i, a).  The \n\nreward  received  at time t  is  denoted  rt(Zt, at). \n\n5.  0 :::;  \"y  < 1 is  the discount  factor. \n6.  Let  p~(i, j)  denote  the  estimate  at  time  t  of the  probability  of transition \nfrom state i to j  on executing action a  E A.  Several different methods can be \nused for estimating p~( i, j).  For example, if n~( i, j) is the observed number \nof times before time step t that execution of action a  when the system was \nin state  i  was followed  by a  transition to state j, and n~(i) =  L:jEs nf(i, j) \nis  the  number  of times  action  a  was  executed  in  state  i  before  time  step \nt,  then,  for  1  :::;  i  :::;  n  and  for  all  a  E  A,  the  maximum-likelihood  state(cid:173)\ntransition  probability  estimates  at time t  are \n\nAa(' \nPt  \", J  = \n\n') \n\na('  ') \nnt~, J \na ( ' ) '  \nn t  \" \n\n1 <  '< \n\n_  J  _  n. \n\nNote  that  the maximum-likelihood  estimates  converge  to their  true  values \nwith  probability  one if nf(i) -+  00  as t -+  00,  i.e.,  every  action is  executed \nfrom  every  state infinitely  often. \nLet pa(i) =  [pa(i,  1), ... , pa(i, n)]  E  [0,1]'\\ and similarly,  pf(i) =  [Pf(i, I), \n... , pf(i, n)]  E  [o,l]n.  We  will  denote  the  lSI  x  IAI  matrix  of transition \nprobabilities  associated  with  state  i  by  P( i)  and its  estimate  at time t  by \nPt(i).  Finally,  P  denotes  the  vector  of matrices  [P(I), ... , P(n)],  and  Pt \ndenotes  the vector  [A(I), ... , A(n)]. \n\n\f700 \n\nGullapalli and Barto \n\n7.  Let  rt(i, a)  denote  the  estimate  at  time  t  of the  e:Jq>ected  reward  r(i, a), \nand let  rt  denote all the  lSI  x IAI  estimates at time t.  Again,  if maximum(cid:173)\nlikelihood  estimation is  used, \n\n\" (\")  L:!=I rk(zk, Gk)h,(Zk, Gk) \nrt  'I.,  G  = \n\nII( \") \nn t \n1. \n\n, \n\nwhere  fill:  S  x  A  -+ {O, 1} is  the indicator function for the state-action pair \n1.,G. \n\nB.  ~* denotes the optimal value function for the MDP defined by the estimates \n\nA and rt  of P  and r  at time t.  Thus, Vi  E  S, \n\n~*(i) =  max{rt(i, a) + \"( '\" p~(i, i)~*(j)}. \n\nilEA \n\nL..-J \nje S \n\nSimilarly,  V*  denotes  the  optimal  value  function  for  the  MDP defined  by \nP  and r. \n\n9.  B t  ~ S  is  the subset  of states whose  values are updated at time t.  Usually, \n\nat least  Zt  E B t \u2022 \n\nA.2  PROOF  OF THEOREM  1 \n\nIn indirect  adaptive asynchronous VI algorithms, the estimates of the MDP param(cid:173)\neters at time step t, Pt and rt,  are used in place of the true parameters,  P  and r,  in \nthe  asynchronous  VI  computations at time t.  Hence  the  value  function  is  updated \nat time t  as \n\nV. \n\n(.)  _  {  maxaeA{rt(i,a) + \"(L:;Espf(i,i)vt(j)}  ifi E Bt \notherwise, \n\nvt(i) \n\nHI  1. \n\n-\n\nwhere  B(t)  ~ S  is  the subset  of states whose  values  are updated  at time t. \nFirst  note  that  because  A and  rt  are  assumed  to  be  bounded  for  all  t,  Vi  is  also \nbounded for all t.  Next,  because the optimal value function given the model A and \nrt,  l't*,  is  a  continuous  function  of the  estimates  A and  rt,  convergence  of these \nestimates  w.p.  1 to their  true values  implies  that \n\nv.* 1lI.p.  1  V* \n\nt  ~  , \n\nwhere  V*  is  the optimal value function  for  the original MDP. The convergence w.p. \n1  of ~* to V*  implies  that  given  an  \u20ac  > 0  there  exists  an integer  T  > 0  such  that \nfor  all t  ;::::  T, \n\n11l't* - V*II  <  (1  - \"() \u20ac  w.p.  1. \n\n2\"( \n\nHere,  II  . II  can be any norm on  lRn ,  although  we  will  use  the 1/10  or  max norm. \nIn algorithms based on asynchronous VI, the values of only the states in B t  ~ S  are \nupdated at time t,  although the  value  of each state is  updated infinitely  often.  For \nan arbitrary  Z E  S,  let  us  define  the infinite  subsequence  {tk}k=O  to be  the  times \nwhen  the  value  of state  Z gets  updated.  Further,  let  us  only  consider  updates  at, \nor after,  time T,  where  T  is  from  equation  (1)  above,  so  that t~ ;::::  T  for  all  Z E S. \n\n(1) \n\n\fConvergence of Indirect Adaptive Asynchronous Value Iteration Algorithms \n\n701 \n\nBy the  nature of the  VI computation we  have,  for  each t  ;::::  1, \nlVi+l(i) - ~*(i)1 ~ '\"Yllvt  - ~*II  if i  E Bt \u2022 \n\nUsing  inequality  (2),  we  can get  a  bound for  Ivt-+l(Z) - ~~(z)1 as \n\n,. \n\nIvt-+dz) - ~!(z)1 < '\"Y1I:+lIIVi- - ~!II +  (1  - '\"Y1I:)\u20ac  w.p.1. \n\n0 \n\n,. \n\n,. \n\n,. \n\n0 \n\n(2) \n\n(3) \n\nWe  can  verify  that  the  bound  in  (3)  is  correct  through  induction.  The  bound  is \nclearly  valid for  k = o.  Assuming it is  valid for  k,  we  show that it is  valid for  k + 1: \nIvt- +l(Z) - ~~  (z)1  <  '\"Yllvt-\n,.+1 \n<  '\"Y(lIvt\u00b7 \n\n- ~~  II \n,.+1 \n\n\"+1 \n\n,.+1 \n\n,.+1 \n\n,. \n\n- ~~ II  +  II~! -~!  II) \n(z) - ~!(z)1 +1' ((1- 1') \u20ac)  w.p.l \n\n,.+1 \n\n,. \n\n<  '\"Ylvt-\n\"+1 , .  \n,. \n,. \n\nl' \n'\"Ylvt-+dz) - ~!(z)1 +  (1- '\"Y)\u20ac \n\n<  '\"Yb1l:+1I1vt.  - ~!II +  (1  - '\"Y1I:)\u20ac)  +  (1  - '\"Y)\u20ac  w.p.l \n\no \n\n0 \n\n1'11:+211 Vi- - ~! II  +  (1  - 1'11:+ 1)\u20ac. \n\no \n\n0 \n\nTaking  the  limit  as  k  -t  00  in  equation  (3)  and  observing  that  for  each  z, \nlim1l:-.00  ~qz) = V*(z)  w.p.  1,  we  obtain \n\n,. \n\nlim  Ivt-+l(Z) - V*(z)1  < \u20ac  w.p.1. \n11:-.00 \n\n,. \n\nSince  \u20ac  and  z  are  arbitrary,  this  implies  that vt  -t V*  w.p.  1. \n\n0 \n\nAcknowledgements \n\nWe  gratefully acknowledge the significant contribution of Peter Dayan, who pointed \nout  that  a  restrictive  condition  for  convergence  in  an  earlier  version  of our result \nwas  actually  unnecessary.  This  work  has  also  benefited  from  several  discussions \nwith  Satinder  Singh.  We  would  also  like  to  thank  Chuck  Anderson  for  his  timely \nhelp  in  preparing  this  material  for  presentation  at  the  conference.  This  material \nis  based  upon  work  supported  by  funding  provided  to  A.  Barto  by  the  AFOSR, \nBolling  AFB,  under  Grant AFOSR-F49620-93-1-0269 and by the NSF  under Grant \nECS-92-14866. \n\nReferences \n\n[1]  A.G.  Barto,  S.J.  Bradtke,  and  S.P.  Singh.  Learning  to  act  using  real-time \ndynamic  programming.  Technical  Report  93-02,  University  of Massachusetts, \nAmherst,  MA,  1993. \n\n[2]  D.  P.  Bertsekas  and  J.  N.  Tsitsiklis.  Parallel  and  Di8tributed  Computation: \n\nNumerical Method6.  Prentice-Hall,  Englewood  Cliffs,  NJ,  1989. \n\n[3]  P.  R.  Kumar.  A  survey  of some results  in  stochastic  adaptive  control.  SIAM \n\nJournal  of Control and  Optimization,  23(3):329-380, May  1985. \n\n\f702 \n\nGullapalli and Barto \n\n[4]  A.  W.  Moore  and  C.  G.  Atkeson.  Memory-based  reinforcement  learning:  Ef(cid:173)\n\nficient  computation  with  prioritized  sweeping.  In  S.  J.  Hanson,  J.  D.  Cowan, \nand  C.  L.  Giles,  editors,  Advance8  in  Neural  Information  Proceuing  Sy8tem8 \n5,  pages  263-270, San  Mateo,  CA,  1993.  Morgan Kaufmann  Publishers. \n\n[5]  J.  Peng  and  R.  J.  Williams.  Efficient  learning  and  planning  within  the  dyna \n\nframework.  In  Proceeding8  of the  Second  International  Conference  on Simula(cid:173)\ntion  of Adaptive  Behavior,  Honolulu,  HI,  1992. \n\n[6]  S.  P.  Singh  and  V.  Gullapalli.  Asynchronous  modified  policy  iteration  with \n\nsingle-sided  updates.  (Under  review). \n\n[7]  R.  S.  Sutton.  Learning  to  predict  by  the  methods  of  temporal  differences. \n\nMachine  Learning,  3:9-44, 1988. \n\n[8]  G.  J. Tesauro.  Practical issues in temporal difference learning.  Machine  Learn(cid:173)\n\ning,  8(3/4):257-277, May  1992. \n\n[9]  C.  J.  C.  H.  Watkins.  Learning from  delayed  reward8.  PhD  thesis,  Cambridge \n\nUniversity,  Cambridge,  England,  1989. \n\n[10]  R.  J.  Williams  and  L.  C.  Baird.  Analysis  of some  incremental  variants  of \n\npolicy  iteration:  First  steps  toward  understanding  actor-critic  learning  sys(cid:173)\ntems.  Technical  Report  NU-CCS-93-11,  Northeastern  University,  College  of \nComputer Science,  Boston,  MA  02115,  September  1993. \n\n\f", "award": [], "sourceid": 773, "authors": [{"given_name": "Vijaykumar", "family_name": "Gullapalli", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": null}]}