{"title": "Reinforcement Learning Based on On-Line EM Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 1052, "page_last": 1058, "abstract": null, "full_text": "Reinforcement  Learning  based  on \n\nOn-line  EM  Algorithm \n\nMasa-aki  Sato t \n\nt ATR Human Information Processing Research Laboratories \n\nSeika,  Kyoto 619-0288,  Japan \n\nmasaaki@hip.atr.co.jp \n\nShin Ishii +t \n\ntNara Institute of Science and Technology \n\nIkoma, Nara 630-0101,  Japan \n\nishii@is.aist-nara.ac.jp \n\nAbstract \n\nIn  this  article,  we  propose  a  new  reinforcement  learning  (RL) \nmethod  based  on  an  actor-critic  architecture.  The  actor  and \nthe  critic  are  approximated  by  Normalized  Gaussian  Networks \n(NGnet),  which  are  networks  of local  linear regression  units.  The \nNGnet is trained by the on-line EM algorithm proposed in our pre(cid:173)\nvious  paper.  We  apply our  RL  method  to the task of swinging-up \nand stabilizing a  single pendulum and the task of balancing a  dou(cid:173)\nble  pendulum near the upright  position.  The experimental results \nshow  that our RL  method can  be applied  to optimal control prob(cid:173)\nlems  having  continuous  state/action  spaces  and  that  the  method \nachieves  good  control  with  a  small  number of trial-and-errors. \n\n1 \n\nINTRODUCTION \n\nReinforcement  learning  (RL)  methods  (Barto  et  al.,  1990)  have  been  successfully \napplied to various Markov decision problems having finite  state/action spaces,  such \nas  the  backgammon  game  (Tesauro,  1992)  and  a  complex task  in  a  dynamic  envi(cid:173)\nronment  (Lin,  1992).  On  the other  hand,  applications  to continuous  state/action \nproblems (Werbos,  1990;  Doya,  1996;  Sofge &  White,  1992)  are much more difficult \nthan the finite  state/action cases.  Good function  approximation methods and  fast \nlearning algorithms are crucial  for  successful  applications. \nIn  this  article,  we  propose  a  new  RL  method  that  has  the  above-mentioned  two \nfeatures.  This method  is  based  on  an actor-critic architecture  (Barto et al.,  1983), \nalthough  the  detailed  implementations  of the  actor  and  the  critic  are  quite  differ-\n\n\fReinforcement Learning Based on On-Line EM Algorithm \n\n1053 \n\nent  from  those  in  the  original  actor-critic  model.  The  actor  and  the  critic  in  our \nmethod estimate  a  policy  and  a  Q-function, respectively, and are approximated by \nNormalized Gaussian Networks (NGnet)  (l'doody & Darken , 1989).  The NGnet is  a \nnetwork of local linear regression units.  The model softly partitions the input space \nby  using normalized  Gaussian  functions,  and each local  unit  linearly  approximates \nthe output within its partition.  As pointed out by Sutton  (1996), local  models such \nas  the  NGnet  are  more  suitable  than  global  models  such  as  multi-layered  percep(cid:173)\ntrons, for  avoiding serious learning interference in  on-line RL processes.  The NGnet \nis  trained  by  the  on-line  EM  algorithm  proposed  in  our  previous  paper  (Sato  & \nIshii,  1998).  It was  shown that this on-line  E11  algorithm is  faster  than a  gradient \ndescent  algorithm.  In  the  on-line  EM  algorithm,  the  positions  of the  local  units \ncan  be  adjusted  according  to  the  input  and  output  data  distribution.  Moreover, \nunit  creation  and  unit  deletion  are  performed  according  to  the  data distribution. \nTherefore, the model  can  be adapted to dynamic  environments in which  the input \nand output data distribution changes with  time  (Sato &  Ishii,  1998). \n\n\\Ve  have applied the new  RL method to optimal  control  problems for  deterministic \nnonlinear  dynamical  systems.  The first  experiment  is  the task of swinging-up  and \nstabilizing  a  single  pendulum  with  a  limited  torque  (Doya,  1996) .  The  second \nexperiment  is  the  task  of balancing  a  double  pendulum  where  a  torque is  applied \nonly  to  the  first  pendulum.  Our  RL  method  based  on  the  on-line  E11  algorithm \ndemonstrated good  performances in these experiments. \n\n2  NGNET  AND  ON-LINE EM  ALGORITHM \n\nIn this section,  we  review  the on-line EM algorithm for  the NGnet  proposed in our \nprevious  paper  (Sato  &  Ishii,  1998).  The NGnet  (Moody  &  Darken,  1989) ,  which \ntransforms an N-dimensional input vector x  to a  D-dimensional output vector  y , is \ndefined  by  the following  equations. \n\n(la) \n\n(lb) \n\n\u2022  W i and bi are a  (D x N)-dimensionallin(cid:173)\n\nAI  denotes  the  number  of units ,  and  the  prime  (')  denotes  a  transpose.  Gi(x)  is \nan N-dimensional Gaussian function,  which has an N-dimensional center /11  and an \n(N x N)-dimensional covariance matrix E j\near regression  matrix  and  a  D-dimensional bias  vector,  respectively.  Subsequently, \nwe  use  notations ll'-j  ==  (Wi, bl )  and x' ==  (x' , 1). \nThe NGnet can be interpreted as a stochastic model, in which a pair of an input and \nan output ,  (x, y) , is  a  stochastic  event.  For  each  event,  a  unit index i  E {I , ... , AI} \nis  assumed  to  be  selected,  which  is  regarded  as  a  hidden  variable.  The stochastic \nmodel is defined by the probability distribution for  a  triplet  (x, y , i),  which is called \na  complete event: \n\nP(x , y , ilB)  =  (27r)-(D+N)/2 a ;-DIEi l- 1 / 2 AI - I \n\n(2) \n\nx exp  [- ~(x - /1i )'Ei l  (x - 11i)  - 2~? (y -\n\nIt-ix )2]  . \n\nHere , B ==  {/1i, E i , a?, 11\"1  Ii =  1, ... , AI}  is  a  set of model parameters.  We  can easily \nproye  that the expectation  value  of the output  y  for  a  giYen  input x,  i.e.,  E[Ylx]  == \n\n\f1054 \n\nM.  Sato and S.  Ishii \n\nJ yP(ylx , B)dy , is identical to equation (1).  Namely, the probability distribution (2) \nprovides a  stochastic model  for  the NGnet. \nFrom  a  set  of T  events  (observed  data)  (X,Y)  ==  {(x(t),y(t))  It =  1, ... ,TL  the \nmodel parameter B of the stochastic model  (2)  can be determined by the maximum \nlikelihood estimation method, in particular, by the EM algorithm (Dempster et al., \n1977) .  The EM algorithm  repeats  the following  E- and M-steps. \n\nE  (~stimation) step: \nprobability that the i-th unit is  selected for  (x(t), yet))  is given  as \n\nLet  fJ  be  the  present  estimator.  By using  fJ , the  posterior \n\nP(i lx(t) , yet) , fJ)  = P(x(t), yet) , ilfJ)!2: P(x(t) , yet), jlfJ). \n\nM \n\n(3) \n\nj=1 \n\nM  (l\\laximization)  step: \nlikelihood  L(Bj1J, X, Y)  for  the complete events is  defined  by \n\nUsing  the  posterior  probability  (3),  the  expected  log-\n\nT \n\nAI \n\nL(Bj1J, X, Y)  = 2: 2: P(ilx(t) , yet) , fJ)  log P( x(t), yet), iIB). \n\nt=1 ;=1 \n\n(4) \n\nSince an increase of L(Bj1J, X , Y) implies an increase of the log-likelihood for the ob(cid:173)\nserved  data (X, Y)  (Dempster et al.,  1977) , L(BlfJ, X, Y)  is  maximized with respect \nto B.  A  solution of the necessity condition 8L!8B = 0 is  given  by  (Xu et al. , 1995)  . \n(5a) \n(5b) \n(5c) \n\nIli  = (x)i(T)!(l)i(T) \n~i 1  =  [(xx')i(T)!(l)i(T) - lli(T)Il~(T)] - 1 \nTili =  (yi;')i(T)[(i;i;')i(T)]-l \na;  = ~ [(ly 2 1)i(T)  - Tr (Tt';(i;y')i(T))]  !(l)i(T), \n\n(5d) \n\nwhere  Oi  denotes  a  weighted  mean  with  respect  to  the  posterior  probability  (3) \nand it  is  defined  by \n\n_ \n(f(x, y)),(T) ==  T  2: f(x(t), y(t))P(ilx(t), yet) , B). \n\n1  T \n\nt=1 \n\n(6) \n\nThe  EM  algorithm  introduced  above  is  based  on  batch  learning  (Xu  et al.,  1995) , \nnamely,  the  parameters  are  updated  after  seeing  all  of  the  observed  data.  We \nintroduce  here  an  on-line  version  (Sato  &  Ishii,  1998)  of the  EM  algorithm.  Let \nB(t)  be  the estimator  after  the  t-th  observed  data  (x(t),y(t)).  In  this  on-line  EM \nalgorithm, the weighted mean  (6)  is  replaced  by \n\n\u00abf(x,y) \u00bbi (T)  ==  TJ(T)  2:( II >.(s))f(x(t),y(t))P(ilx(t),y(t),B(t -1)). \n\nT \n\nT \n\n(7) \n\nt=1  s=i+1 \n\nThe parameter >'(t)  E  [0,1]  is  a  discount  factor,  which  is  introduced for  forgetting \nthe effect  of earlier inaccurate  estimator.  TJ(T)  ==  (Li=1 (TI~=t+l >.(S))) - 1 is  a  nor(cid:173)\nmalization coefficient and it is  iteratively calculated by TJ(t)  = (1 + >.(t)!TJ(t _1)) - 1. \nThe modified  weighted  mean \u00ab  . \u00bbi can  be obtained by  the step-wise equation: \n\n\u00ab  f(x, y)  \u00bbi (t)  =\u00ab f(x, y)  \u00bb i (t  - 1) \n\n+TJ(t)  [!(x(t),y(t))Pi(t)-\u00ab  f(x,y) \u00bbi (t - l)J, \n\n(8) \n\n\fReinforcement Learning Based on On-Line EM Algorithm \n\n/055 \n\nwhere Pi(t) ==  P(ilx(t) , y(t) , {}(t  - 1)).  Using the modified  weighted mean,  the new \nparameters are obtained by the following  equations. \n\nAi(t  = \n\n) \n\n1 \n\n1 - 17(t) \n\n[Ai(t - 1  -\n) \n\nPi(t)Ai (t - l)x(t)x'(t~Ai(t - 1)  1 \n(l/17(t)  - 1) + Pi(t)x'(t)Ai(t -\n\nl)x(t) \n\nf.Li(t)  =\u00ab x \u00bbi (t)/ \u00ab  1 \u00bbi (t) \nW'i (t)  = W'i(t  - 1) + 17(t)Pi(t)(y(t) - Wi(t - l)x(t))x'(t)Ai(t) \na;(t) = ~ [\u00ab lyl2  \u00bbi (t)  - Tr (Wi(t)\u00ab xy' \u00bbi (t))] /\u00ab 1 \u00bbi (t), \n\n(9a) \n\n(9b) \n(9c) \n\n(9d) \n\nIt can  be  proved  that  this  on-line  EM  algorithm  is  equivalent  to  the  stochastic \napproximation for  finding  the maximum likelihood  estimator,  if the time course  of \nthe discount  factor  A(t)  is  given  by \n\nA(t)  t~ 1 -\n\n(1  - a)/(at + b), \n\n(11) \n\nwhere  a  (1  > a > 0)  and  b are constants  (Sato &  Ishii,  1998). \nWe  also employ dynamic unit manipulation mechanisms in  order to efficiently  allo(cid:173)\ncate the units (Sato & Ishii, 1998).  The probability P(x(t), y(t), i  I (}(t-1)) indicates \nhow  probable the i-th unit produces the datum (x(t) , y(t))  with the present param(cid:173)\neter {)( t  - 1) .  If the probability  for  every  unit  is  less  than some threshold  value,  a \nnew unit is produced to account for the new datum.  The weighted mean \u00ab  1 \u00bb i (t) \nindicates how  much  the i-th  unit  has  been  used  to account  for  the  data until  t.  If \nthe mean becomes less  than some threshold value,  this  unit  is  deleted. \nIn  order  to  deal  with  a  singular  input  distribution,  a  regularization  for  2:;1 (t)  is \nintroduced as follows. \n\n2: ; l(t) = [(<<  xx' \u00bbi (t)  - f.Li(t)f.L;(t)\u00ab  1 \u00bbi (t) \n\n(12a) \n\n+  Q  \u00ab  ~; \u00bbi (t)IN) /  \u00ab  1 \u00bbi (t)]-l \n\u00ab~T \u00bbi (t)  =  (<<  Ixl 2  \u00bbi (t)  -1f.Li(t)12\u00ab 1 \u00bbi (t))  /N, \n\n(12b) \nwhere IN  is the (N x N)-dimensional identity matrix and  Q  is a  small constant.  The \ncorresponding Ai(t)  can be calculated in an on-line manner using a  similar equation \nto  (9a)  (Sato &  Ishii, 1998). \n\n3  REINFORCEMENT  LEARNING \n\nIn  this  section,  we  propose  a  new  RL  method  based  on  the  on-line  EM  algorithm \ndescribed in the previous section.  In the following, we consider optimal control prob(cid:173)\nlems for  deterministic nonlinear dynamical  systems having continuous state/action \nspaces.  It  is  assumed  that  there  is  no  knowledge  of  the  controlled  system.  An \nactor-critic  architecture .(Barto  et al. ,1983)  is  used  for  the learning  system.  In the \noriginal  actor-critic  model,  the  actor  and  the  critic  approximated  the  probability \nof each  action  and  the  value  function,  respectively,  and  were  trained  by  using  the \nTD-error.  The  actor  and  the  critic  in  our  RL  method  are  different  from  those  in \nthe original  model as explained later. \n\n\f1056 \n\nM.  Sato and S.  Ishii \n\nFor  the  current  state,  xc(t),  of the  controlled  system,  the  actor  outputs  a  control \nsignal  (action)  u(t), which is  given by the policy function 00, i.e.,  u(t) = O(xc(t)). \nThe  controlled  system  changes  its  state  to  xc(t  +  1)  after  receiving  the  control \nsignal  u(t).  Subsequently,  a  reward  r(xc(t) , u(t))  is  given  to  the  learning  system. \nThe  objective  of  the  learning  system  is  to  find  the  optimal  policy  function  that \nmaximizes the discounted future return defined  by \n\n00 \n\nV(xc)  ==  L \"/r(xc(t), O(xc(t)))l xc (O)=::x c  ' \n\n/ = 0 \n\n(13) \n\nwhere  0 < ,  < 1  is  a  discount  factor.  V(xc),  which  is  called  the  value  function ,  is \ndefined for  the current policy function  0(-)  employed by  the actor.  The Q-function \nis  defined  by \n\n(14) \nwhere  xc(t)  = Xc and  u(t)  = u  are  assumed.  The  value  function  can  be  obtained \nfrom  the Q-function: \n\nV(xc)  = Q(xc, O(xc))\u00b7 \n\nThe Q-function should satisfy the consistency  condition \n\nQ(xc(t), u(t)) = ,Q(xc(t + 1), O(xc(t + 1)) + r(xc(t) , u(t)). \n\n(15) \n\n(16) \n\nIn our RL method, the policy function and the Q-function are approximated by the \nNGnets,  which are called the actor-network and the critic-network,  respectively.  In \nthe learning phase, a stochastic actor is necessary in order to explore a better policy. \nFor  this  purpose,  we  employ  a  stochastic  model  defined  by  (2) ,  corresponding  to \nthe  actor-network.  A  stochastic  action  is  generated  in  the  following  way.  A  unit \nindex  i  is  selected  randomly  according  to  the  conditional  probability  P(ilxc)  for \na  given  state  X C.  Subsequently,  an  action  u  is  generated  randomly  according  to \nthe  conditional  probability  P(ulxc, i)  for  a  given  Xc  and  the  selected  i.  The  value \nfunction  can be defined  for  either the stochastic policy  or the deterministic  policy. \nSince  the  controlled  system  is  deterministic,  we  use  the  value  function  defined  for \nthe deterministic policy which  is  given  by  the actor-network. \nThe learning  process  proceeds  as  follows.  For  the  current  state  xc(t) , a  stochastic \naction u(t) is generated by the stochastic model corresponding to the current actor(cid:173)\nnetwork.  At the next time step , the learning system gets the next state xc(t+ 1)  and \nthe reward r(xc(t) , u(t)).  The critic-network is trained by the on-line EM algorithm. \nThe input  to the critic-network  is  (xc(t) , u(t)).  The  target  output  is  given  by  the \nright hand side of (16) , where the Q-function and the deterministic policy function \n00 are calculated  using the current critic-network and the current  actor-network, \nrespectively.  The  actor-network  is  also  trained  by  the  on-line  EM  algorithm.  The \ninput to the actor-network is xc(t).  The target output is given by using the gradient \nof the critic-network  (Sofge  &  White,  1992): \n\n(17) \nwhere the Q-function and the deterministic policy function 00 are calculated using \nthe modified critic-network and the current actor-network, respectively.  E is  a small \nconstant.  This  target  output  gives  a  better action,  which  increases  the  Q-function \nvalue for  the current state Xc (t) , than the current  deterministic  action  0 (xc (t)). \n\nIn the above learning scheme, the critic-network and the actor-network are updated \nconcurrently.  One can consider another learning scheme.  In this scheme, the learn(cid:173)\ning system tries to control the controlled system for  a  given period of time by  using \nthe fixed  actor-network.  In this period, the critic-network is trained to estimate the \n\n\fReinforcement Learning Based on On-Line EM Algorithm \n\n1057 \n\nQ-function for  the fixed  actor-network.  The state trajectory in  this period is  saved. \nAt the next stage,  the actor-network is trained along the saved trajectory using the \ncritic-network  modified in  the first  stage. \n\n4  EXPERIMENTS \n\nThe first  experiment  is  the task  of swinging-up  and  stabilizing a  single  pendulum \nwith  a  limited  torque  (Doya,  1996) .  The  state  of  the  pendulum  is  represented \n\nby  X c  = (\u00a2, cp),  where  cp  and  \u00a2 denote the  angle  from  the  upright  position and  the \nangular velocity of the pendulum, respectively.  The reward r(xc(t) , u(t)) is assumed \nto be given  by  f(x c(t + 1)) , where \n\nf(xc) =  exp( -(\u00a2)2/(2vi) - cp2/(2v~)). \n\n(18) \nVI  and  V2  are  constants.  The reward  (18)  encourages  the  pendulum  to  stay  high . \nAfter  releasing  the  pendulum  from  a  vicinity  of  the  upright  position,  the  control \nand the learning process of the actor-critic network is conducted for  7 seconds.  This \nis  a  single episode.  The reinforcement learning is  done  by  repeating these episodes. \nAfter  40  episodes,  the  system  is  able  to  make  the  pendulum  achieve  an  upright \nposition from almost every initial state.  Even from a low initial position, the system \nswings  the pendulum several times and stabilizes it at the upright position.  Figure \n1 shows  a  control  process,  i.e.,  stroboscopic time-series  of the pendulum,  using the \ndeterministic policy after training.  According to our previous experiment,  in which \nboth  of the  actor- and  critic- networks  are  the  NGnets  with  fixed  centers  trained \nby  the  gradient  descent  algorithm,  a  good  control  was  obtained  after  about  2000 \nepisodes.  Therefore,  our  new  RL  method  is  able  to  obtain  a  good  control  much \nfaster than that  based on the gradient descent  algorithm. \n\nThe  second  experiment  is  the  task  of  balancing  a  double  pendulum  near  the  up(cid:173)\nright  position.  A  torque  is  applied  only  to the  first  pendulum.  The  state  of  the \npendulum is represented by  X c  = (\u00a21, \u00a22 , CPl,  CP2),  where CPl  and CP2  are the first  pen(cid:173)\ndulum's angle from the upright direction and the second pendulum's angle from the \nfirst  pendulum's  direction,  respectively.  \u00a21 (\u00a22)  is  the  angular  velocity  of the  first \n(second)  pendulum.  The  reward  is  given  by  the  height  of the  second  pendulum's \nend  from  the  lowest  position.  After  40  episodes,  the  system  is  able  to  stabilize \nthe  double  pendulum.  Figure  2  shows  the  control  process  using  the  deterministic \npolicy  after  training.  The  upper  two  figures  show  stroboscopic  time-series  of  the \npendulum.  The  dashed,  dotted,  and  solid  lines  in  the  bottom figure  denote  cPl/7r, \nCP2/7r , and  the  control  signal  u  produced  by  the  actor-network,  respectively.  After \na  transient period,  the pendulum is  successfully  controlled to stay near the upright \nposition. \n\nThe numbers of units in the actor- (critic-)  networks after training are 50  (109)  and \n96  (121)  for  the  single  and  double  pendulum  cases,  respectively.  The  RL  method \nusing center-fixed  NGnets trained by the gradient descent  algorithm employed  441 \n(= 212)  actor units and 18,081  (= 212x41) critic units, for the single pendulum task. \nFor the double pendulum task, this scheme did not work even when 14,641  (= 114) \nactor  units  and  161 ,051  (=  114  X  11)  critic  units  were  prepared.  The  numbers  of \nunits  in  the  NGnets trained  by  the  on-line  EM  algorithm  scale  moderately  as  the \ninput  dimension increases. \n\n5  CONCLUSION \n\nIn this  article,  we  proposed a  new  RL method  based on the on-line EM algorithm. \nWe  showed  that  our  RL  method  can  be  applied  to  the  task  of  swinging-up  and \n\n\f1058 \n\nM.  Sato and S.  Ishii \n\nstabilizing  a  single  pendulum  and  the  task  of  balancing  a  double  pendulum  near \nthe upright position.  The number of trial-and-errors needed to achieve good control \nwas  found  to  be  very  small  in  the  two  tasks. \nIn  order  to  apply  a  RL  method \nto  continuous  state/action  problems,  good  function  approximation  methods  and \nfast  learning algorithms are  crucial.  The experimental results showed  that our RL \nmethod has both features. \n\nReferences \n\nBarto,  A.  G.,  Sutton,  R.  S.,  &  Anderson,  C.  W.  (1983).  IEEE  Transactions  on \nSystems,  Man,  and  Cybernetics,  13,834-846. \nBarto,  A.  G.,  Sutton,  R.  S.,  &  Watkins,  C.  J.  C.  H.  (1990).  Learning  and  Com(cid:173)\nputational  Neuroscience:  Foundations  of Adaptive  Networks  (pp.  539-602),  MIT \nPress. \nDempster,  A.  P.,  Laird,  N.  M.,  & Rubin,  D.  B.  (1977).  Journal  of Royal Statistical \nSociety B,  39,  1-22. \n\nDoya,  K.  (1996).  Advances in Neural  Information  Processing Systems  8 (pp.  lO73-\n1079),  MIT Press. \n\nLin,  L.  J.  (1992).  Machine  Learning,  8,293-321. \nMoody,  J.,  &  Darken,  C.  J.  (1989).  Neural  Computation,  1,  281-294. \nSato,  M.,  & Ishii,  S.  (1998).  ATR  Technical Report,  TR-H-243, ATR. \nSofge,  D.  A.,  &  White, D.  A.  (1992).  Handbook  of Intelligent  Control (pp. 259-282), \nVan  Nostrand  Reinhold. \n\nSutton,  R.  S.  (1996) .  Advances  in  Neural  Information  Processing  Systems  8 \n(pp.  1038-1044),  MIT Press. \n\nTesauro,  G.  J.  (1992).  Machine  Learning,  8,  257-278. \nWerbos,  P.  J.  (1990).  Neural Networks  for  Control (pp.  67-95),  MIT Press. \nXu,  1.,  Jordan,  M.  1.,  &  Hinton,  G.  E.  (1995).  Advances  in  Neural  Information \nProcessing  Systems  \"(  (pp.  633-640),  MIT Press. \n\nTime Sequence of Inverted Pendulum \n\n3 \n\n3l II I II! j l \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\> \n\n-l~~ ________ ~ ________ ~_ \n\no \n\n2 \n\n3 \n\n-l~~ ________ ~ ________ ~_ \n\n23 4  \n\no \n\n2 \n\n6 \n\n2 \n\n4 \n\n8 \n\n3 \n\nJl <II i \\ \\ '\\, '-.~/'//////'--~ \nU :-\\I//?-~' ~ \\ II 111111 ~ \n\n45 6  \n\n7 \n\nTime (sec.) \n\nFigure 1 \n\nFigure  2 \n\n\f", "award": [], "sourceid": 1614, "authors": [{"given_name": "Masa-aki", "family_name": "Sato", "institution": null}, {"given_name": "Shin", "family_name": "Ishii", "institution": null}]}