{"title": "Credit Assignment through Time: Alternatives to Backpropagation", "book": "Advances in Neural Information Processing Systems", "page_first": 75, "page_last": 82, "abstract": null, "full_text": "Credit  Assignment through Time: \nAlternatives  to Backpropagation \n\nYoshua Bengio * \nDept.  Informatique et \n\nRecherche  Operationnelle \nUniversite de  Montreal \nMontreal,  Qc  H3C-3J7 \n\nPaolo Frasconi \n\nDip.  di Sistemi e  Informatica \n\nUniversita di  Firenze \n50139  Firenze  (Italy) \n\nAbstract \n\nLearning  to  recognize  or  predict  sequences  using  long-term  con(cid:173)\ntext  has  many  applications.  However,  practical  and  theoretical \nproblems  are  found  in  training  recurrent  neural  networks  to  per(cid:173)\nform tasks in which input/output dependencies span long intervals. \nStarting from a mathematical analysis of the  problem, we  consider \nand compare alternative algorithms and  architectures  on  tasks for \nwhich the span of the input/output dependencies can be controlled. \nResults  on the new  algorithms show  performance qualitatively su(cid:173)\nperior  to that obtained with backpropagation. \n\n1 \n\nIntroduction \n\nRecurrent neural networks have been considered to learn to map input sequences  to \noutput sequences.  Machines  that could efficiently  learn such  tasks  would be useful \nfor  many applications involving sequence  prediction,  recognition or  production. \n\nHowever,  practical difficulties  have been  reported  in  training recurrent  neural  net(cid:173)\nworks  to  perform  tasks  in  which  the  temporal  contingencies  present  in  the  in(cid:173)\nput/output  sequences  span  long  intervals.  In  fact,  we  can  prove  that  dynamical \nsystems such as  recurrent  neural networks will be increasingly difficult to train with \ngradient  descent  as  the duration  of the  dependencies  to  be  captured  increases.  A \nmathematical analysis of the problem shows that either one of two conditions arises \nin such  systems.  In  the first  case,  the dynamics of the network  allow  it to reliably \nstore  bits  of information  (with  bounded  input  noise),  but  gradients  (with  respect \nto an error  at  a  given  time step)  vanish exponentially fast  as  one  propagates  them \n\n\u00b7also,  AT&T  Bell  Labs,  Holmdel,  NJ  07733 \n\n7S \n\n\f76 \n\nBengio and Frasconi \n\nbackward in time.  In  the second case,  the gradients can flow  backward but the sys(cid:173)\ntem is  locally unstable and cannot reliably store bits of information in the presence \nof input  noise. \nIn  consideration of the  above problem and the understanding brought by  the theo(cid:173)\nretical  analysis,  we  have explored  and compared several  alternative algorithms and \narchitectures.  Comparative experiments were  performed on artificial tasks on which \nthe span  of the  input/output dependencies  can  be  controlled.  In  all  cases,  a  dura(cid:173)\ntion  parameter  was  varied,  from  T/2  to T,  to  avoid  short  sequences  on  which  the \nalgorithm could  much  more easily  learn.  These  tasks  require  learning to  latch,  i.e. \nstore bits  of information for  arbitrary durations  (which  may vary  from example to \nexample).  Such  tasks  cannot  be  performed by Time Delay  Neural  Networks  or  by \nrecurrent  networks  whose memories are gradually lost  with time constants  that are \nfixed  by  the  parameters of the network. \nOf all the alternatives to gradient descent that we  have explored, an approach based \non  a  probabilistic interpretation of a  discrete  state space,  similar to hidden  Markov \nmodels  (HMMs),  yielded  the most interesting results. \n\n2  A  Difficult  Problem of Error  Propagation \n\nConsider  a  non-autonomous discrete-time  system  with  additive  inputs,  such  as  a \nrecurrent  neural  network  a  with a  continuous  activation function: \n\nat  =  M(at-d + Ut \n\n(1) \n\nand the corresponding  autonomous dynamics \n\nat  =  M(at-d \n\n(2) \nwhere  M  is  a  nonlinear map (which  may have tunable  parameters such  as  network \nweights), and at  E  R n  and Ut  E  R m  are vectors  representing respectively the system \nstate  and the external  input  at  time t. \nIn  order  to  latch  a  bit of state  information one  wants  to  restrict  the  values  of the \nsystem  activity  at  to  a  subset  S  of its  domain.  In  this  way,  it  will  be  possible  to \nlater interpret at  in at least two ways:  inside S  and outside S.  To make sure that at \nremains in such  a  region,  the system dynamics can  be chosen  such  that this  region \nis  the  basin of attraction of an attractor X  (or of an attractor in  a sub-manifold or \nsubspace  of at's domain).  To  \"erase\"  that bit of information, the inputs may push \nthe system activity at  out of this basin of attraction and possibly  into another one. \nIn  (Bengio,  Simard,  & Frasconi,  1994)  we  show  that  only  two conditions  can  arise \nwhen  using  hyperbolic  attractors  to  latch  bits  of  information  in  such  a  system. \nEither  the system is  very  sensitive  to noise,  or  the derivatives  of the cost  at time t \nwith respect  to the system activations ao  converge exponentially to 0 as  t  increases. \nThis situation is  the  essential  reason  for  the  difficulty  in  using  gradient descent  to \ntrain  a  dynamical  system  to  capture  long-term  dependencies  in  the  input/output \nsequences. \n\nA  first  theorem  can  be  used  to  show  that  when  the  state  at  is  in  a  region  where \nIM'I  >  1,  then small perturbations grow  exponentially,  which  can  yield  to a  loss  of \nthe information stored  in  the dynamics of the system: \n\nTheorem 1  A ssume x  is  a  point of R n  such  that there  exists  an  open  sphere  U (x) \ncentered  on  x  for  which  IM'(z)1  >  1  for  all  z E  U(x).  Then  there  exist  Y  E  U(x) \nsuch  that IIM(x) - M(y) I > Ilx - YII\u00b7 \n\n\fCredit Assignment through Time: Alternatives to Backpropagation \n\n77 \n\nA  second  theorem shows  that when  the state  at  is  in  a  region  where  IM'I < 1,  the \ngradients  propagated backwards in time vanish exponentially fast: \n\nTheorem 2  If  the \nnM'(adl < 1) on  attmctor X  after time 0,  then  g:~  -t 0  as  t  -t 00. \n\nis  such  that  a  system  remains  robustly  latched \n\ninput  Ut \n\nSee  proofs  in  (Bengio,  Simard,  &  Frasconi,  1994).  A  consequence  of these  results \nis  that  it  is  generally  very  difficult  to  train  a  parametric  dynamical system  (such \nas  a  recurrent  neural  network)  to learn  long-term dependencies  using  gradient  de(cid:173)\nscent.  Based  on  the  understanding  brought by  this  analysis,  we  have explored  and \ncompared several  alternative algorithms and  architectures. \n\n3  Global Search  Methods \n\nGlobal  search  methods  such  as  simulated  annealing  can  be  applied  to  this  prob(cid:173)\nlem,  but  they  are  generally  very slow.  We  implemented  the  simulated  annealing \nalgorithm presented in  (Corana,  Marchesi,  Martini,  &  Ridella,  1987)  for  optimizing \nfunctions  of continuous  variables.  This  is  a  \"batch  learning\"  algorithm  (updating \nparameters after  all examples of the training set have been seen).  It performs a  cy(cid:173)\ncle  of random moves,  each  along one coordinate  (parameter)  direction.  Each  point \nis  accepted  or  rejected  according  to  the  Metropolis  criterion  (Kirkpatrick,  Gelatt, \n&  Vecchi,  1983).  The  simulated  annealing  algorithm  is  very  robust  with  respect \nto  local  minima and  long  plateaus.  Another  global  search  method  evaluated  in \nour experiments is  a  multi-grid random search.  The  algorithm tries  random points \naround the current solution (within a  hyperrectangle of decreasing size)  and accepts \nonly  those  that  reduce  the  error.  Thus  it  is  resistant  to  problems of plateaus  but \nnot as  much resistant to problems of local minima.  Indeed,  we  found  the multi-grid \nrandom search  to be much faster  than simulated annealing but to fail on the parity \nproblem,  probably because  of local  minima. \n\n4  Time Weighted  Pseudo-Newton \n\nThe  time-weighted  pseudo-Newton  algorithm  uses  second  order  derivatives  of the \ncost  with  respect  to each of the instantiations of a  weight  at different  time steps  to \ntry  correcting for  the vanishing gradient  problem.  The weight  update for  a  weight \nWi  is  computed as  follows: \n\n(3) \n\nwhere  Wit  is  the  instantiation  for  time  t  of parameter  Wi,  1}  is  a  global  learning \nrate  and  C(p)  is  the  cost  for  pattern  p.  In  this  way,  each  (temporal)  contribution \nto  ~Wi(p) is  weighted  by  the  inverse  curvature  with  respect  to  Wit .  Like  for  the \npseudo-Newton  algorithm of Becker  and  Le  Cun  (1988)  we  prefer  using  a  diagonal \napproximation  of  the  Hessian  which  is  cheap  to  compute  and  guaranteed  to  be \npositive definite. \nThe constant J1  is introduced to prevent ~w from becoming very large (when I &;C~p) I \nis  very  small).  We  found  the  performance  of this  algorithm to  be  better  than  the \nregular pseudo-Newton  algorithm, which  is  better  than  the simple stochastic back(cid:173)\npropagation algorithm, but all  of these  algorithms perform  worse  and  worse as  the \nlength of the sequences  is  increased. \n\nW.! \n\n\f78 \n\nBengio and Frasconi \n\n5  Discrete Error Propagation \n\nThe  discrete  error  propagation  algorithm  replaces  sigmoids in  the  network  by  dis(cid:173)\ncrete  threshold  units  and  attempts  to  propagate  discrete  error  information back(cid:173)\nwards  in  time.  The  basic  idea  behind  the  algorithm  is  that  for  a  simple discrete \nelement such as a threshold unit or a latch, one can write down an error propagation \nrule  that  prescribes  desired  changes  in  the  values  of the  inputs  in  order  to  obtain \ncertain  changes  in  the  values  of the outputs.  In  the  case  of a  threshold  unit,  such \na  rule  assumes  that  the  desired  change  for  the  output  of the  unit  is  discrete  (+2, \no or -2).  However,  error  information propagated  backwards  to such  as  unit  might \nhave  a  continuous  value.  A  stochastic  process  is  used  to  convert  this  continuous \nvalue into an appropriate discrete  desired  change.  In the case of a  self-loop,  a  clear \nadvantage of this  algorithm over  gradient back-propagation  through  sigmoid units \nis  that  the  error  information does  not  vanish  as  it  is  repeatedly  propagated  back(cid:173)\nwards  in  time  around  the  loop,  even  though  the  unit  can  robustly  store  a  bit  of \ninformation.  Details  of the  algorithm  will  appear  in  (Bengio,  Simard, &  Frasconi, \n1994).  This  algorithm  performed  better  than  the  time-weighted  pseudo-Newton, \npseudo-Newton  and  back-propagation algorithms but the  learning curve  appeared \nvery  irregular,  suggesting  that the  algorithm is  doing a  local  random search. \n\n6  An  EM  Approach  to  Target  Propagation \n\nThe  most  promising  of  the  algorithms  we  studied  was  derived  from  the  idea  of \npropagating  targets  instead  of gradients.  For  this  paper  we  restrict  ourselves  to \nsequence  classification.  We  assume  a  finite-state  learning system  with the  state  qt \nat  time t  taking on  one  of n  values.  Different  final  states  for  each  class  are  used \nas  targets.  The  system  is  given  a  probabilistic  interpretation  and  we  assume  a \nMarkovian conditional independence  model.  As  in  HMMs,  the system  propagates \nforward  a  discrete  distribution  over  the  n  states.  Transitions  may be  constrained \nso  that each  state j  has  a  defined  set  of successors  Sj. \n\nStat~ L \n.~_.~_1 ..... 0_j1_ \n\n_;_\u00b7\u00b7\n\nState \n\n\u2022\u2022\u2022 ( .. \u00b7\u2022 \u2022\u2022 _ .\u2022.. __ n_e--lt/\\rK : \n\nUt \n\nFigure  1:  The proposed  architecture \n\nLearning is formulated as a maximum likelihood problem with missing data.  Missing \nvariables,  over  which  an  expectation  is  taken,  are  the  paths  in  state-space.  The \n\n\fCredit Assignment through Time: Alternatives to Backpropagation \n\n79 \n\nEM  (Expectation/Maximization) or GEM (Generalized EM) algorithms (Dempster, \nLaird.,  &  Rubin,  1977)  can  be  used  to  help  decoupling  the  influence  of different \nhypothetical paths in state-space.  The estimation step of EM  requires  propagating \nbackward a discrete distribution of targets.  In contrast to HMMs, where parameters \nare  adjusted  in  an  unsupervised  learning  framework,  we  use  EM  in  a  supervised \nfashion.  This new  perspective has been successful in training static models  (Jordan \n&  Jacobs,  1994). \nTransition  probabilities,  conditional  on  the  current  input,  can  be  computed  by  a \nparametric function such as a layer of a neural network with softmax units.  We  pro(cid:173)\npose  a  modular architecture  with one subnetwork Nj  for  each  state  (see  Figure  1). \nEach  subnetwork  is  feedforward,  takes  as  input  a  continuous  vector  of features  Ut \nand has one output for each successor state, interpreted as P(qt =  i  I qt-l =  j, Ut; 0), \n(j =  1, ... , n, i  E Sj).  0 is  a set  of tunable parameters.  Using a  Markovian assump(cid:173)\ntion,  the distribution over  states at time t  is  thus obtained as  a  linear combination \nof the outputs of the subnetworks,  gated by  the  previously computed distribution: \n\nP(qt = i  lui; 0)  = L P(qt-l = j  lui-I; O)P(qt  = i  I qt-l = j, Ut;  0) \n\n(4) \n\nj \n\nwhere  ui  is  a  subsequence  of  inputs  from  time  1  to  t  inclusively.  The  training \nalgorithm looks  for  parameters  0 of the  system  that  maximize the  likelihood  L  of \nfalling in the  \"correct\"  state at  the end of each  sequence: \n\nL(O)  = II P(qTp =  qj,p  I uip; 0) \n\n(5) \n\n(6) \n\nwhere p  ranges  over  training sequences,  Tp  the length of the pth  training sequence, \nand  qj, \n\nthe desired  state at time Tp. \n\np \n\np \n\nAn auxiliary function Q(O, Ok)  is constructed  by  introducing as hidden variables the \nwhole state sequence,  hence  the complete likelihood function  is  defined  as  follows: \n\nLc(O)  = IIp(qip  luip;O) \n\np \n\nand \n\n(7) \nwhere at the k+lth EM  (or GEM) iteration, Ok+l  is chosen to maximize (or increase) \nthe  auxiliary function  Q  with respect  to O. \nIf the inputs are quantized and the subnetworks perform a simple look-up in a table \nof probabilities, then the EM algorithm can be used,  i.e.,  aQ~/k) =  0 can be solved \nanalytically.  If the  networks  have  non-linearities,  (e.g.,  with  hidden  units  and  a \nsoftmax at  their  output  to  constrain  the  outputs  to  sum  to  1),  then  one  can  use \nthe  GEM  algorithm  (which  simply increases  Q,  for  example with  gradient  ascent) \nor directly  perform  (preferably  stochastic)  gradient  ascent  on the likelihood. \nAn extra term was  introduced  in the optimization criterion  when  we  found  that in \nmany  cases  the  target  information  would  not  propagate  backwards  (or  would  be \ndiffused  over  all  the states).  These experiments confirmed previous  results  indicat(cid:173)\ning  a  general  difficulty  of training fully  connected  HMMs,  with  the  EM  algorithm \nconverging  very  often  to  poor  local  maxima of the  likelihood.  In  an  attempt  to \nunderstand  better  the  phenomenon,  we  looked  at  the  quantities  propagated  for(cid:173)\nward and the quantities propagated backward  (representing  credit or blame) in the \n\n\f80 \n\nBengio and Frasconi \n\ntraining  algorithm.  We  found  a  diffusion  of credit  or  blame  occurring  when  the \nforward maps (i.e.  the matrix of transition probabilities)  at each time step are such \nthat  many inputs  map to  a  few  outputs,  i.e.,  when  the  ratio of a  small  volume in \nthe  image of the  map with  respect  to  the  corresponding  volume in  the  domain is \nsmall.  This  ratio  is  the  absolute  value  of the  determinant  of the  Jacobian  of the \nmap.  Hence,  using an optimization criterion that incorporates  the maximization of \nthe average magnitude of the determinant of the transition matrices,  this algorithm \nperforms  much  better  than  the  other  algorithms.  Two  other  tricks  were  found  to \nbe  important  to  help  convergence  and  reduce  the  problem  of diffusion  of credit. \nThe  first  idea  is  to  use  whenever  possible  a  structured  model  with  a  sparse  con(cid:173)\nnectivity matrix, thus introducing some prior knowledge about the state-space.  For \nexample, applications of HMMs to speech recognition always rely on such structured \ntopologies.  We could reduce connectivity in the transition matrix for  the 2-sequence \nproblem  (see  next section for  its definition)  by splitting some of the nodes into two \nsubsets,  each  specializing on one  of the sequence  classes.  However,  sometimes it is \nnot  possible  to introduce such  constraints,  such  as in  the parity problem.  Another \ntrick that drastically improved performance was to use stochastic gradient ascent in \na  way  that helps  the training algorithm get  out of local optima.  The learning rate \nis  decreased  when  the  likelihood  improves  but  it  is  increased  when  the  likelihood \nremains flat  (the system  is  stuck in  a  plateau or  local optimum). \nAs  the  results  in  the  next  section  show,  the  performances obtained with  this  algo(cid:173)\nrithm  are  much  better  than  those  obtained  with  the  other  algorithms on  the  two \nsimple test  problems that were  considered. \n\n7  Experimental Results \n\nWe  present  here  results  on  two  problems  for  which  one  can  control  the  span  of \ninput/output  dependencies.  The  2-sequence  problem  is  the  following:  classify  an \ninput sequence,  at the  end of the sequence,  in one  of two  types,  when  only the first \nN  elements  (N  =  3  in  our  experiments)  of this  sequence  carry  information about \nthe sequence  class.  Uniform noise is  added to the sequence.  For the first  6 methods \n(see  Tables  1 to 4),  we  used  a  fully  connected  recurrent  network  with  5 units  (with \n25 free  parameters).  For the EM algorithm, we  used  a  7 -state system with a sparse \nconnectivity  matrix  (an  initial  state,  and  two  separate  left-to-right  submodels  of \nthree states each  to model the two  types of sequences). \nThe parity problem consists in producing the parity of an input sequence  of 1 's  and \n-l's  (i.e.,  a  1 should  be  produced  at  the  final  output  if and  only  if the  number  of \n1 's  in  the  input  is  odd).  The  target  is  only  given  at the  end  of the  sequence.  For \nthe first  6  methods  we  used  a  minimal size  network  (1  input,  1  hidden,  1 output, \n7  free  parameters).  For  the  EM  algorithm,  we  used  a  2-state  system  with  a  full \nconnectivity matrix. \nInitial parameters were chosen  randomly for each trial.  Noise added to the sequence \nwas also uniformly distributed and chosen independently for each training sequence. \nWe considered two criteria:  (1)  the average classification error at the end of training, \ni.e.,  after  a  stopping criterion  has  been  met  (when  either  some  allowed  number  of \nfunction  evaluations  has  been  performed  or  the  task  has  been  learned),  (2)  the \naverage number of function  evaluations needed  to reach  the stopping criterion. \nIn the tables,  \"p-n\"  stands for  pseudo-Newton.  Each column corresponds to a value \nof the maximum sequence  length T  for  a given set of trials.  The sequence  length for \na  particular  training sequence  was  picked  randomly  within  T/2  and  T.  Numbers \n\n\fCredit Assignment through Time: Alternatives to Backpropagation \n\n81 \n\nreported  are  averages over  20  or more trials. \n\n8  Conclusion \n\nRecurrent  networks  and other  parametric dynamical systems  are  very  powerful  in \ntheir  ability  to  represent  and  use  context.  However,  theoretical  and  experimental \nevidence  shows  the  difficulty  of assigning  credit  through  many  time  steps,  which \nis  required  in  order  to  learn  to use  and  represent  context.  This  paper  studies  this \nfundamental  problem  and  proposes  alternatives  to  the  backpropagation  algorithm \nto  perform  such  learning  tasks.  Experiments  show  these  alternative  approaches \nto  perform significantly better  than gradient descent.  The  behavior  of these  algo(cid:173)\nrithms yields  a  better  understanding of the central issue  of learning to use  context, \nor  assigning  credit  through  many  transformations.  Although  all  of  the  alterna(cid:173)\ntive algorithms presented  here showed  some improvement with respect  to standard \nstochastic  gradient  descent,  a  clear  winner  in  our  comparison  was  an  algorithm \nbased on the EM  algorithm and a probabilistic interpretation of the system dynam(cid:173)\nics.  However,  experiments on  more challenging tasks  will  have  to be  conducted  to \nconfirm  those  results.  Furthermore,  several  extensions  of this  model  are  possible, \nfor  example allowing both  inputs and outputs,  with supervision  on  outputs  rather \nthan on  states.  Finally, similarly to the  work  we  performed for  recurrent  networks \ntrained  with  gradient descent,  it  would  be  very  important to analyze  theoretically \nthe problems of propagation of credit encountered  in  training such  Markov models. \n\nAcknowledgements \n\nWe  wish  to  emphatically  thank  Patrice  Simard,  who  collaborated  with  us  on  the \nanalysis  of the  theoretical  difficulties  in  learning  long-term  dependencies,  and  on \nthe discrete  error  propagation algorithm. \n\nReferences \n\nS.  Becker  and Y.  Le  Cun.  (1988)  Improving the  convergence  of back-propagation \nlearning with second  order  methods,  Proc.  of the  1988  Connectionist  Models  Sum(cid:173)\nmer School,  (eds.  Touretzky, Hinton and Sejnowski),  Morgan Kaufman, pp.  29-37. \nY.  Bengio,  P.  Simard,  and  P.  Frasconi.  (1994)  Learning  long-term  dependencies \nwith gradient descent  is  difficult,  IEEE  Trans.  Neural  Networks,  (in press). \nA.  Corana, M.  Marchesi,  C. Martini, and S.  Ridella.  (1987)  Minimizing multimodal \nfunctions  of continuous  variables  with  the  simulated  annealing  algorithm,  A CM \nTransactions  on  Mathematical  Software,  vol.  13,  no.  13,  pp.  262-280. \nA.P.  Dempster,  N.M.  Laird,  and  D.B.  Rubin.  (1977)  Maximum-likelihood  from \nincomplete data via the  EM  algorithm,  J.  of Royal  Stat.  Soc.,  vol.  B39,  pp.  1-38. \nM.1.  Jordan and R.A. Jacobs.  (1994)  Hierarchical  mixtures of experts and the  EM \nalgorithm, Neural  Computation,  (in press). \nS.  Kirkpatrick,  C.D.  Gelatt,  and  M.P.  Vecchio  (1983)  Optimization  by  simulated \nannealing,  Science  220,  4598, pp.671-680. \n\n\f82 \n\nBengio and Frasconi \n\nTable 1:  Final classification error for  the 2-sequence  problem  wrt sequence  length \n\nac  -prop \n\np-n \n\ntime-weighted p-n \n\nmultigrid \n\ndiscrete  err.  prop. \nsimulated anneal. \n\nEM \n\n2 \no \n2 \n6 \n6 \no \n\n3 \n0 \n6 \n16 \n0 \n0 \n\n10 \n25 \n9  34 \n1 \n3 \n23 \n29 \n4 \n7 \n0 \n0 \n\n29 \n14 \n6 \n22 \n11 \no \n\nTable 2:  #  sequence  presentations for  the 2-sequence  problem wrt sequence  length \n\nac  -prop \n\np-n \n\ntime-weighted p-n \n\nmultigrid \n\ndiscrete  err.  prop. \nsimulated anneal. \n\nEM \n\n.  e \n5.1e2 \n5.4e2 \n4.1e3 \n6.6e2 \n2.0e5 \n3.2e3 \n\n.  e \n1.1e3 \n4.3e2 \n5.8e3 \n1.3e3 \n3.ge4 \n4.0e3 \n\n.  e \n1.ge3 \n2.4e3 \n2.5e3 \n2.1e3 \n8.2e4 \n2.ge3 \n\n.  e \n2.6e3 \n2.ge3 \n3.ge3 \n2.1e3 \n7.7e4 \n3.2e3 \n\n.  e \n2.5e3 \n2.7e3 \n6.4e3 \n2.1e3 \n4.3e4 \n2.ge3 \n\nTable 3:  Final classification  error for  the  parity problem wrt sequence  length \n\n5 \n\np-n \n\nback-prop \n\n10 \n3 \n~  ~U  41 \n41 \n3 \ntime-weighted p-n  26 \n39 \n15 \n0 \n3 \n\ndiscrete  err.  prop. \nsimulated anneal. \n\nmultigrid \n\n25 \n\n0 \n\n6 \n\n0 \n\nEM \n\n20 \n50 \n~~  43-\n40 \n44 \n43 \n44 \n45 \n44 \n5 \n0 \n0 \n10 \n14 \n0 \n\n100 \n\n500 \n\n47 \n\n0 \n\n12 \n\nTable 4:  #  sequence  presentations for  the parity problem wrt sequence  length \n\np-n \n\nback-prop \n\n3 \n3.6e3 \n2.5e2 \ntime-weighted p-n  4.5e4 \n4.2e3 \ndiscrete  err.  prop.  5.0e3 \nsimulated anneal. \n5.1e5 \n\nmultigrid \n\nEM \n\n5 \n5.5e3 \n8.ge3 \n\n9 \n8.7e3 \n8.ge3 \n7.0e4 \n\n7.ge3 \n\n2.3e3 \n\n1.5e3 \n\n100 \n\n1.le5 \n\n20 \n50 \n1.1e4 \n1.6e4 \n1.1e4 \n7 .7e4 \n3.4e4 \n8.1e4 \n1.5e4 \n3.1e4 \n1.5e4  5.4e4 \n1.2e6  8.1e5 \n1.3e3  3.2e3  2.6e3 \n\n500 \n\n3.4e3 \n\n\f", "award": [], "sourceid": 724, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Paolo", "family_name": "Frasconi", "institution": null}]}