{"title": "Local Bandit Approximation for Optimal Learning Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1019, "page_last": 1025, "abstract": null, "full_text": "Local  Bandit  Approximation \n\nfor  Optimal Learning Problems \n\nMichael o. Duff \n\nAndrew G. Barto \n\nDepartment of Computer Science \n\nUniversity  of Massachusetts \n\nAmherst,  MA  01003 \n\n{duff.barto}Ccs.umass.edu \n\nAbstract \n\nIn  general,  procedures  for  determining  Bayes-optimal  adaptive \ncontrols  for  Markov  decision  processes  (MDP's)  require  a  pro(cid:173)\nhibitive  amount  of  computation-the  optimal  learning  problem \nis  intractable.  This  paper  proposes  an  approximate  approach  in \nwhich bandit processes are used to model, in a certain \"local\" sense, \na given MDP. Bandit processes constitute an important subclass of \nMDP's,  and  have  optimal learning  strategies  (defined  in  terms  of \nGittins indices)  that can be  computed relatively  efficiently.  Thus, \none  scheme  for  achieving  approximately-optimal learning  for  gen(cid:173)\neral MDP's proceeds by taking actions suggested by strategies that \nare  optimal with respect  to local  bandit  models. \n\n1 \n\nINTRODUCTION \n\nWatkins  [1989]  has  defined  optimal learning as:\" \nthe  process  of collecting  and \nusing  information during learning in an optimal manner,  so  that the learner makes \nthe best  possible  decisions  at all stages  of learning:  learning  itself is  regarded  as  a \nmUltistage decision  process,  and learning is  optimal if the learner  adopts a strategy \nthat  will  yield  the  highest  possible  return  from  actions  over  the  whole  course  of \nlearning.\" \n\nFor  example,  suppose  a  decision-maker  is  presented  with  two  biased  coins  (the \ndecision-maker  does  not know precisely  how  the coins  are  biased)  and asked  to al(cid:173)\nlocate  twenty flips  between  them so  as  to maximize  the number of observed heads. \nAlthough  the decision-maker  is  certainly  interested in  determining  which  coin  has \na  higher  probability of heads, his  principle  concern is  with optimizing  performance \nen route  to  this determination.  An  optimal learning strategy typically intersperses \n\"exploitation\"  steps,  in which the coin currently thought to have the highest proba-\n\n\f1020 \n\nM.  O.  Duff and A.  G.  Barto \n\n1 \n1- P\" \n\n~  1 \n(a)  \"\"\"~ \n\nP \u2022  f'r:i) \n\n~-QP22 \n\n1 \n1- P12 \n\nl-P~, \n\n~ \n\n2  ~~P2 \nR\" \nR2  :y  22 \nll~  1 \n\n2 \n\nP \n\n(b) \n\ny \ni 1;(3,1)(1,1):(1 ,1)(1 ,1)  ~ \n\n~2;(3,2)(1,1):(1,1)(1,1)  .7 \n~ \n\n2 / 3 \n\n\u2022 \u2022 \u2022  \n\n> 2;(1.2)(1,1):(1,1)(2,1) <{: \n\na, \n\naj \n\naJ  1 /3 \n~1;(2'1)(1'1):(1'1)(1'1)  l~ \na,  ;~ 2:(1,2)(1,1):(1,1)(1,1)  ~ 7 \n\n-\na \n\"\"  ./ \n\n1/2 \n\nl-p 2 \n\n22 \n\n1;(1,1)(1 ,1):(1,1)(1.1) \n\na,  )fo \n~ \n\n(c) \n\nFigure 1:  A simple example:  dynamics/rewards under (a) action  1 and (b) action \n2.  (c)  The decision  problem in  hyperstate space. \n\nbility of heads is flipped,  with  \"exploration\"  steps in which,  on the basis of observed \nflips,  a  coin  that  would  be  deemed inferior  is  flipped  anyway to further  resolve  its \ntrue potential for  turning up heads.  The coin-flip  problem is  a  simple  example of a \n(two-armed)  bandit problem.  A key feature  of these  problems,  and of adaptive con(cid:173)\ntrol processes in general, is  the so-called  \"exploration-versus-exploitation  trade-off\" \n(or  problem  of \"dual control\"  [Fel'dbaum,  1965]). \n\nAs an another example, consider  the MDP depicted in Figures  l(a) and (b).  This is \na  2-state/2-action proceSSj  transition  probabilities  label arcs,  and quantities  within \ncircles  denote  expected  rewards  for  taking  particular  actions  in  particular  states. \nThe goal is  to  assign  actions  to states so  as  to maximize,  say,  the expected infinite \nhorizon discounted sum ofrewards (the value function)  over all states.  For the case \nconsidered  in  this  paper,  the  transition  probabilites  are  not  known.  Given  that \nthe  process  is  in  some  state,  one  action  may be optimal with  respect  to  currently(cid:173)\nperceived  point-estimates  of unknown parameters, while  another action  may result \nin  greater information gain.  Optimal learning is  concerned  with  striking  a  balance \nbetween  these  two criteria. \n\nWhile  reinforcement  learning  approaches  have  recognized  the  dual-effects  of con(cid:173)\ntrol,  at least  in  the sense  that  one  must  occasionally  deviate  from  a  greedy  policy \nto  ensure  a  search  of sufficient  breadth,  many  exploration  procedures  appear  not \nto be motivated  by real notions of optimallearningj rather, they aspire  to be prac(cid:173)\ntical  schemes  for  avoiding  unrealistic  levels  of sampling  and  search  that  would  be \nrequired  if one  were  to  strictly  adhere  to  the  theoretical  sufficient  conditions  for \nconvergence-that all  state-action  pairs  must  be considered  infinitely  many times. \n\nIf  one  is  willing  to  adopt  a  Bayesian  perspective,  then  the  exploration-versus(cid:173)\nexploitation issue has already been resolved, in principle.  A solution was recognized \nby  Bellman  and  Kalaba  nearly  fo  rty  years  ago  [Bellman  &  Kalaba,  1959]j  their \ndynamic  programming  algorithm  for  computing  Bayes-optimal  policies  begins  by \nregarding  \"state\"  as  an  ordered  pair,  or  \"hyperstate,\"  (:z:,I),  where  :z:  is  a  point \nin  phase-space  (Markov-chain  state)  and  I \nis  the  \"information  pattern,\"  which \nsummarizes  past  history  as  it  relates  to  modeling  the  transitional  dynamics  of :z:. \nComputation grows increasingly  burdensome with problem size,  however,  so  one is \ncompelled  to seek approximate solutions,  some of which  ignore the effects  of infor(cid:173)\nmation  gain  entirely.  In  contrast,  the  approach  suggested  in  this  paper  explicitly \nacknowledges  that  there  is  an  information-gain  component  to  the  optimal  learn-\n\n\fLocal Bandit Approximation/or Optimal Learning Problems \n\n1021 \n\ning  problem; if certain salient  aspects  of the value  of information can be captured, \neven approximately,  then one may be led to a reasonable method for approximating \noptimal learning  policies. \n\nHere is  the basic idea behind the approach suggested in this paper:  First note  that \nthere  exists  a  special  class  of problems,  namely  multi-armed  bandit  problems,  in \nwhich the information pattern is the sole component of the hyperstate.  These special \nproblems  have  the  important  feature  that  their  optimal  policies  can  be  defined \nconcisely  in  terms  of  \"Gittins  indices,\"  and  these  indices  can  be  computed  in  a \nrelatively  efficient  way.  This  paper is  an attempt to  make use  of the fact  that  this \nspecial  subclass  of  MDP's  has  tractably-computable  optimal  learning  strategies. \nActions  for  general  MDP's are derived  by,  first,  attaching  to a  given general  MDP \nin  a  given state a  \"local\"  n-armed bandit  process  that captures some aspect  of the \nvalue  of information  gain  as  well  as  explicit  reward.  Indices  for  the  local  bandit \nmodel  can  be  computed  relatively  efficiently;  the  largest  index  suggests  the  best \naction in an optimal-learning sense.  The resulting algorithm has a receding-horizon \nflavor  in  that  a  new  local-bandit  process  is  constructed  after  each  transition;  it \nmakes use of a  mean-process model as in some previously-suggested  approximation \nschemes,  but here  the value  of information gain is  explicitly  taken into account, in \npart, through index calculations. \n\n2  THE BAYES-BELLMAN  APPROACH  FOR \n\nADAPTIVE  MDP'S \n\nConsider  the  two-state,  two-action  process  shown  in  Figure  1,  and  suppose  that \none  is  uncertain  about  the  transition  probabilities. \nIf  the  process  is  in  a  given \nstate and an action is  taken,  then  the result  is  that the process  either  stays in  the \nstate  it  is  in  or  jumps  to  the  other  state-one  observes  a  Bernoulli  process  with \nunknown parameter-just as in the coin-flip  example.  But in this case one observes \nfour  Bernoulli  processes:  the  result  of taking  action  1  in  state  1,  action  1  in \nstate  2,  action  2  in  state  1,  action  2  in  state  2.  So  if the  prior  probability \nfor staying in the current state, for each of these state-action pairs, is represented  by \na  beta distribution  (the  appropriate  conjugate family  of distributions  with  regard \nto Bernoulli  sampling;  I.e.,  a  Bayesian  update of a  beta prior  remains  beta),  then \none  may  perform dynamic  programming in  a  space of \"hyperstates,\"  in  which  the \ncomponents are four  pairs  of parameters specifying  the beta distributions  describ(cid:173)\ning  the  uncertainty  in  the  transition  probabilities,  along  with  the  Markov  chain \nstate:  (:z:, (aL,Bt), (a~,,B~)(a~,,Bn, (a~,,B~\u00bb), where  for  example  (aL,BD  denotes \nthe  parmeters  specifying  the  beta  distribution  that  represents  uncertainty  in  the \ntransition probability P~l.  Figure  l(c) shows part ofthe associated decision  tree;  an \noptimality equation may be written in terms of the hyperstates.  MDP's with more \nthan two states pose no special  problem (there exists  an appropriate generalization \nof the  beta distribution).  What  i&  a  problem is  what  Bellman  calls  the  \"problem \nof the expanding  grid:\"  the  number  of hyperstates  that  must  be  examined  grows \nexponentially  with  the horizon. \n\nHow does one proceed if one is constrained to practical amounts of computation and \nis  willing  to  settle  for  an  approximate  solution?  One  could  truncate  the  decision \ntree at some shorter and more manageable horizon, compute approximate terminal \nvalues by replacing  the distributions with their means, and proceed with a receding(cid:173)\nhorizon  approach:  Starting from  the  approximate  terminal  values  at  the  horizon, \nperform a  backward sweep  of dynamic programming, computing an optimal policy. \nTake  the  initial  action  of the  policy,  then  shift  the  entire  computational  window \nforward one level and repeat.  One can imagine a sort of limiting, degenerate version \n\n\f1022 \n\nM.  O.  Duff and A.  G.  Barto \n\nof this receding horizon approach in which the horizon is zerOj  that is,  use the means \nof the current distributions  to calculate an optimal policy, take an \"optimal\" action, \nobserve  a  transition,  perform  a  Bayesian  modification  of  the  prior,  and  repeat. \nThis (certainty-equivalence)  heuristic  was suggested by [Cozzolino  et al.,  1965], and \nhas  recently  reappeared  in  [Dayan  & Sejnowski,  1996].  However,  as  was  noted in \n[Cozzolino  et al.,  1965]  \" ... the trade-off between  immmediate gain and information \ndoes  not  exist  in  this  heuristic.  There  is  no  mechanism  which  explicitly  forces \nunexplored  policies  to  be  observed  in  early  stages.  Therefore,  if it  should  happen \nthat there is  some  very good policy  which  a  priori seemed  quite  bad, it is  entirely \npossible  that this  heuristic  will  never  provide  the information needed  to recognize \nthe policy as being better than originally thought .. .'t  This comment and others seem \nto refer  to  what  is  now  regarded  as  a  problem  of \"identifiability\"  associated  with \ncertainty-equivalence controllers in which a closed-loop system evolves identically for \nboth true and false  values of the unknown parametersj that is,  certainty-equivalence \ncontrol  may  make some  of the  unknown  parameters  invisible  to  the  identification \nprocess and lead one to repeatedly choose the wrong action (see  [Borkar & Varaiya, \n1979], and also  Watkinst discussion  of \"metastable policiestt  in  [Watkins,  1989]). \n\n3  BANDIT PROBLEMS AND  INDEX  COMPUTATION \n\nOne  basic  version  of the  bandit  problem  may be  described  as  follows:  There  are \nsome  number  of statistically  independent  reward  processes-Markov  chains  with \nan  imposed  reward  structure  associated  with  the  chain's  arcs.  At  each  discrete \ntime-step, a decision-maker selects  one of these processes to activate.  The activated \nprocess  yields  an  immediate  reward  and  then  changes  state.  The  other  processes \nremain  frozen  and  yield  no  reward.  The  goal  is  to  splice  together  the  individual \nreward streams into one  sequence  having maximal expected  discounted  value. \n\nThe special Cartesian structure of the bandit problem turns out to imply that there \nare  functions  that  map  process-states  to  scalars  (or  \"indices't),  such  that  optimal \npolicies  consist  simply of activating the task with the largest index.  Consider  one of \nthe reward processes,  let  S  be its state space,  and let B  be the set  of all subsets  of \nS.  Suppose  that :z:(k)  is  the state of the process  at time  k  and, for  B  E B, let  reB) \nbe  the number of transitions  until the process first  enters  the set  B.  Let  v(ij B) be \nthe  expected  discounted  reward  per  unit  of discounted  time  starting from  state  i \nuntil the stopping time  reB): \n\nThen the  Gittins index  of state i  for  the process  under  consideration is \n\nv(i) = maxv(ijB). \n\nBEB \n\n(1) \n\n[Gittins  &  Jones,  1979]  shows  that  the  indices  may  be  obtained  by  solving  a  set \nof functional  equations.  Other  algorithms  that  have  been  suggested  include  those \nby  Beale  (see  the  discussion  section  following  [Gittins  &  Jones,  1979]),  [Robin(cid:173)\nsion,  1981],  [Varaiya  et  al.,  1985],  and  [Katehakis  &  Vein ott ,  1987].  [Dufft  1995] \nprovides  a  reinforcement  learning  approach  that  gradually  learns  indices  through \nonline/model-free interaction with bandit processes.  The details of these algorithms \nwould require more space than is available here.  The algorithm proposed in the next \nsection  makes  use  of the approach of [Varaiya  et al.,  1985]. \n\n\fLocal Bandit Approximation for Optimal Learning Problems \n\n1023 \n\n4  LOCAL  BANDIT APPROXIMATION  AND  AN \n\nAPPROXIMATELY-OPTIMAL  LEARNING \nALGORITHM \n\nThe most obvious difference  between  the optimal learning problem for  an MDP and \nthe  multi-armed  bandit  problem  is  that  the  MDP  has  a  phase-space  component \n(Markov chain state)  to its hyperstate.  A first  step in  bandit-based approximation, \nthen, proceeds by  \"removing\"  this phase-space component.  This can be achieved by \nviewing  the process on a  time-scale defined by the recurrence  time of a given state. \nThat is,  suppose  the process is in some  state,  z.  In response  to some given  action, \ntwo  things  can  happen:  (1)  The  process  can  transition,  in  one  time-step,  into  z \nagain  with  some  immediate  reward,  or  (2)  The  process  can  transition  into  some \nstate that is not z  and experience some \"sojourn\" path of states and rewards before \nreturning  to z.  On a  time-scale  defined  by sojourn-time,  one  can view  the  process \nin a  sort  of \"state-z-centric\"  way  (if state  z  never  recurs,  then  the sojourn-time is \n\"infinite\"  and there is no value-of-information component of the local bandit model \nto acknowledge).  From this perspective,  the process appears to have only one state, \nand  is  8em~Markov; that  is,  the  time  between  transitions  is  a  random  variable. \nSome  other  action  taken  in  state  z  would  give  rise  to  a  different  sojourn  reward \nprocess.  For both processes  (sojourn-processes initiated by different  actions applied \nto state z), the sojourn path/reward will depend upon the policy for  states encoun(cid:173)\ntered  along  sojourn  paths,  but  suppose  that  this  policy  is  fixed  for  the  moment. \nBy viewing  the original process  on  a  time-scale  of sojourn-time, one has effectively \ncollapsed  the  phase-space  component  of the  hyperstate.  The  new  process  has  one \nstate,  z,  and the problem of choosing an action,  given  that  one is  uncertain  about \nthe transition  probabilities,  presents  itself as a  semi-Markov bandit  problem. \n\nThe preceding discussion suggests an algorithm for approximately-optimal learning: \n\n(0)  Given  that  the uncertainty  in  transition  probabilities  is  expressed  in  terms  of \n\nsufficient  statistics  < a, Ii >, and the process  is  currently in state  Zt. \n\n(1)  Compute  the  optimal  policy  for  the  mean  process,  7r\u00b7[F(a,Ii)];  that  is,  com(cid:173)\npute the policy  that is  optimal for  the MDP whose  transition  probabilities \nare  taken  to  be  the  mean  values  associated  with  < a, Ii >-this defines  a \nnominal (certainty-equivalent)  policy for  sojourn states. \n\n(2)  Construct  a  local  bandit  model  at  state  Zt;  that  is,  the  decision-maker  must \nchoose between some number (the number of admissible actions)  of sojourn \nreward  processes-this is  a  semi-Markov multi-armed  bandit problem. \n\n(3)  Compute  the Gittins indices  for  the local  bandit model. \n( 4)  Take the action with  the largest  index. \n(5)  Observe a  transition  to  Zt+l in  the underlying  MDP. \n(6)  Update < a,1i >  accordingly  (Bayes  update). \n(7)  Go  to step (1) \n\nThe  local  semi-Markov  bandit  process  associated  with  state  1  /  action  1  for \nthe 2-state example MDP of Figure  1 is shown in  Figure  2.  The sufficient  statistics \nfor  ptl are denoted  by  (Q, f3),  and  Q~.8  and  ~ are the expected  probabilities  for \ntransition into state  1 and state  2, respectively.  rand R121 are random variables \nsignifying  sojourn time and reward. \nThe goal is to compute the index for the root information-state labeled < Q, f3  >  and \nto compare it  with that computed for  a similar  diagram associated  with  the bandit \n\n\f1024 \n\nM.  O.  Duff and A.  G.  Barto \n\n/ \nr  ~/ \n\n~,y.~ \n/,~I \n\nu+1, \n\nFigure  2:  A  local  semi-Markov bandit  process  associated  with  state  1  /  action \n1 for  the 2-state example  MDP of Figure  1. \n\nprocess for  taking action  2.  The approximately-optimal action is  suggested by the \nprocess having the largest root-node index.  Indices  for  semi-Markov bandits can be \nobtained  by  considering  the  bandits  as  Markov,  but  performing  the  optimization \nin  Equation  1  over  a  restricted  set  of stopping  times.  The  algorithm  suggested \nin  [Tsitsiklis,  1993],  which  in  turn  makes  use  of methods  described  in  [Varaiya  et \nal.,  1985],  proceeds  by  \"reducing\"  the graph  through a  sequence  of node-excisions \nand modifications  of rewards  and  transition  probabilities;  [Duff,  1997]  details  how \nthese  steps  may be realized  for  the  special  semi-Markov  processes  associated  with \nproblems  of optimal learning. \n\n5  Discussion \n\nIn  summary,  this  paper  has  presented  the  problem  of optimal  learning,  in  which \na  decision-maker  is  obliged  to  enjoy  or  endure  the  consequences  of its  actions  in \nquest  of the  asymptotically-learned  optimal policy.  A  Bayesian formulation  of the \nproblem leads to a clear concept of a solution whose computation, however, appears \nto  entail  an  examination  of an  intractably-large  number  of hyperstates.  This  pa(cid:173)\nper  has suggested  extending  the  Gittins index  approach  (which  applies  with  great \npower and elegance  to the special class  of multi-armed bandit processes)  to general \nadaptive  MDP's.  The  hope  has  been  that  if certain  salient  features  of the  value \nof information  could  be  captured,  even  approximately,  then  one  could  be led  to a \nreasonable  method for  avoiding certain defects  of certainty-equivalence  approaches \n(problems with identifiability,  \"metastability\").  Obviously,  positive evidence, in the \nform of empirical results from simulation experiments,  would lend support to these \nideas- work along these lines  is  underway. \n\nLocal  bandit  approximation  is  but  one  approximate  computational  approach  for \nproblems  of optimal learning and dual control.  Most prominent in the literature  of \ncontrol theory is the  \"wide-sense\"  approach of [Bar-Shalom & Tse, 1976], which uti(cid:173)\nlizes  local  quadratic  approximations  about  nominal state/control trajectories.  For \ncertain  problems,  this  method  has  demonstrated  superior  performance  compared \nto  a  certainty-equivalence  approach,  but  it  is  computationally  very  intensive  and \nunwieldy,  particularly for  problems  with controller  dimension  greater than one. \n\nOne  could  revert  to  the  view  of  the  bandit  problem,  or  general  adaptive  MDP, \nas  simply  a  very  large  MDP  defined  over  hyperstates,  and  then  consider  a  some-\n\n\fLocal Bandit Approximationfor Optimal Learning Problems \n\n1025 \n\nwhat  direct  approach in  which  one  performs  approximate  dynamic  programming \nwith function  approximation over  this domain-details of function-approximation, \nfeature-selection,  and  \"training\"  all  become  important  design  issues.  [Duff,  1997] \nprovides  further  discussion  of these  topics,  as  well  as  a  consideration  of action(cid:173)\nelimination  procedures  [MacQueen,  1966]  that  could  result  in  substantial  pruning \nof the hyperstate decision  tree. \n\nAcknowledgements \n\nThis  research  was  supported,  in  part,  by  the  National  Science  Foundation  under \ngrant  ECS-9214866  to Andrew  G.  Barto. \n\nReferences \n\nBar-Shalom, Y.  8\u00a3  Tse,  E.  (1976)  Caution,  probing and  the value of information in \nthe control of uncertain  systems,  Ann.  Econ.  Soc.  Meas.  5:323-337. \n\nR.  Bellman 8\u00a3  R.  Kalaba,  (1959)  On adaptive control processes.  IRE  Trans.,  4:1-9. \n\nBokar,  V.  8\u00a3  Varaiya,  P.P.  (1979)  Adaptive  control  of Markov  chains  I:  finite  pa(cid:173)\nrameter set.  IEEE  Trans.  Auto.  Control 24:953-958. \n\nCozzolino,  J.M., Gonzalez-Zubieta,  R.,  8\u00a3  Miller,  R.L.  (1965)  Markov decision  pro(cid:173)\ncesses  with  uncertain  transition  probabilities.  Tech.  Rpt.  11,  Operations  Research \nCenter,  MIT. \n\nDayan, P.  8\u00a3  Sejnowski,  T.  (1996)  Exploration Bonuses and Dual Control.  Machine \nLearning (in  press). \nDuff,  M.O.  (1995)  Q-Iearning  for  bandit  problems.  in  Machine  Learning:  Proceed(cid:173)\nings  of the  Twelfth  International  Conference  on Machine  Learning:  pp.  209-217. \n\nDuff,  M.O.  (1997)  Approximate  computational  methods  for  optimal learning  and \ndual  control.  Technical  Report,  Deptartment  of Computer  Science,  Univ.  of Mas(cid:173)\nsachusetts,  Amherst. \n\nFel'dbaum, A.  (1965)  Optimal  Control Systems,  Academic  Press. \n\nGittins,  J.C.  8\u00a3  Jones,  D.  (1979)  Bandit  processes  and  dynamic  allocation  indices \n(with  discussion).  J.  R.  Statist.  Soc.  B  41:148-177. \n\nKatehakis,  M.H.  8\u00a3  Veinott,  A.F.  (1987)  The multi-armed bandit problem:  decom(cid:173)\nposition and computation Math.  OR 12:  262-268. \nMacQueen,  J.  (1966).  A modified  dynamic programming method for  Markov deci(cid:173)\nsion  problems,  J.  Math.  Anal.  Appl.,  14:38-43. \n\nRobinsion,  D.R.  (1981)  Algorithms  for  evaluating  the  dynamic  allocation  index. \nResearch  Report  No.  80/DRR/4,  Manchester-Sheffield  School  of Probability  and \nStatistics. \n\nTsitsiklis,  J.  (1993)  A short  proof of the  Gittins index  theorem.  Proc.  3fnd Conf. \nDec.  and Control:  389-390. \n\nVaraiya,  P.P., Walrand,  J.C., 8\u00a3  Buyukkoc,  C.  (1985) Extensions of the multiarmed \nbandit problem:  the discounted  case.  IEEE  Trans.  Auto.  Control 30(5):426-439. \n\nWatkins,  C.  (1989)  Learning /rom  Delayed  Rewards  Ph.D.  Thesis,  Cambidge Uni(cid:173)\nversity. \n\n\f", "award": [], "sourceid": 1230, "authors": [{"given_name": "Michael", "family_name": "Duff", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": null}]}