{"title": "The Power of Amnesia", "book": "Advances in Neural Information Processing Systems", "page_first": 176, "page_last": 183, "abstract": null, "full_text": "The Power  of Amnesia \n\nDana Ron  Yoram Singer  Naftali Tishby \n\nInstitute of Computer Science  and \n\nCenter  for  Neural  Computation \n\nHebrew  University,  Jerusalem  91904,  Israel \n\nAbstract \n\nWe  propose  a  learning  algorithm  for  a  variable  memory  length \nMarkov  process.  Human  communication,  whether  given  as  text, \nhandwriting,  or  speech,  has  multi  characteristic  time  scales.  On \nshort  scales  it  is  characterized  mostly  by  the  dynamics  that  gen(cid:173)\nerate  the  process,  whereas  on  large  scales,  more syntactic  and  se(cid:173)\nmantic information is  carried.  For  that  reason  the  conventionally \nused  fixed  memory  Markov  models  cannot  capture  effectively  the \ncomplexity of such structures.  On the other hand using long mem(cid:173)\nory  models uniformly is  not  practical  even  for  as  short memory as \nfour.  The  algorithm  we  propose  is  based  on  minimizing  the  sta(cid:173)\ntistical  prediction  error  by  extending  the  memory, or state length, \nadaptively, until the total prediction error  is  sufficiently small.  We \ndemonstrate the algorithm by learning the structure of natural En(cid:173)\nglish  text  and applying the learned  model to the correction  of cor(cid:173)\nrupted  text .  Using  less  than  3000  states  the  model's  performance \nis  far  superior  to  that  of fixed  memory models  with  similar num(cid:173)\nber  of states.  We  also  show  how  the  algorithm  can  be  applied  to \nintergenic  E.  coli DNA  base  prediction  with  results  comparable to \nHMM  based methods. \n\n1 \n\nIntroduction \n\nMethods  for  automatically acquiring  the  structure  of the  human  language are  at(cid:173)\ntracting increasing  attention .  One  of the  main difficulties  in  modeling the  natural \nlanguage  is  its  multiple  temporal  scales.  As  has  been  known  for  many  years  the \nlanguage is  far  more  complex than any finite  memory  Markov  source.  Yet  Markov \n\n176 \n\n\fThe Power of Amnesia \n\n177 \n\nmodels  are  powerful  tools  that  capture  the  short  scale  statistical  behavior  of lan(cid:173)\nguage,  whereas  long  memory  models  are  generally  impossible  to  estimate.  The \nobvious  desired  solution  is  a  Markov  sOUrce  with  a  'deep'  memory just  where  it  is \nreally needed.  Variable memory length Markov models have been in use for language \nmodeling  in  speech  recognition  for  some  time  [3,  4],  yet  no  systematic  derivation, \nnor  rigorous  analysis of such  learning mechanism has been  proposed. \n\nMarkov models are a natural candidate for language modeling and temporal pattern \nrecognition,  mostly due  to their mathematical simplicity.  It is  nevertheless  obvious \nthat finite  memory Markov models can  not in  any  way capture the recursive  nature \nof the language, nor can  they be trained  effectively  with long enough  memory.  The \nnotion of a  variable length memory seems to appear naturally also in  the context of \nuniversal  coding  [6].  This information theoretic  notion is  now  known  to  be  closely \nrelated  to  efficient  modeling [7].  The  natural measure  that appears  in  information \ntheory  is  the  description  length,  as  measured  by  the  statistical  predictability  via \nthe  Kullback- Liebler  (KL)  divergence. \n\nThe  algorithm  we  propose  here  is  based  on  optimizing  the  statistical  prediction \nof a  Markov  model , measured  by  the instantaneous  KL  divergence  of the following \nsymbols, or by the current  statistical surprise of the model.  The memory is extended \nprecisely  when  such  a  surprise  is  significant,  until  the  overall  statistical  prediction \nof the stochastic model is sufficiently good.  We apply this algorithm successfully for \nstatistical language modeling.  Here we  demonstrate its ability for spelling correction \nof  corrupted  English  text .  We  also  show  how  the  algorithm  can  be  applied  to \nintergenic  E.  coli  DNA  base  prediction  with  results  comparable  to  HMM  based \nmethods. \n\n2  Prediction  Suffix Trees and  Finite State Automata \n\nDefinitions and  Notations \n\nLet  ~ be  a  finite  alphabet.  Denote  by  ~* the  set  of all  strings  over~ .  A  string \ns,  over  ~* of length  n,  is  denoted  by  s  =  Sl S2  ... Sn.  We  denote  by  e  the  empty \nstring.  The  length of a  string s is  denoted  by  lsi  and  the  size  of an  alphabet  ~ is \ndenoted by  I~I.  Let,  Prefix(s) =  SlS2  .. . Sn-1,  denote the longest prefix ofa string \ns,  and let  Prefix*(s) denote the set  of all prefixes  of s,  including the empty string. \nSimilarly,  5uffix(s)  =  S2S3  . . . Sn  and  5uffix*(s)  is  the  set  of all  suffixes  of s.  A \nset  of strings is called a prefix free  set if, V Sl, S2  E 5:  {Sl} nPrefix*(s2) = 0.  We \ncall a probability measure P, over the strings in  ~* proper if P( e)  =  1,  and for  every \nstring  s,  I:aEl:P(sa)  =  P(s).  Hence,  for  every  prefix  free  set  5,  I:sEsP(s):S  1, \nand specifically  for  every  integer  n  2:  0,  I:sEl:n P(s) =  1. \n\nPrediction Suffix  Trees \n\nA  prediction  suffix  tree  T  over  ~, is  a  tree  of degree  I~I.  The  edges  of the  tree \nare labeled by  symbols from  ~, such  that from every  internal node  there is  at most \none  outgoing  edge  labeled  by  each  symbol.  The  nodes  of the  tree  are  labeled  by \npairs  (s, / s)  where  s  is  the  string associated  with  the  walk starting from  that  node \nand  ending  in  the  root  of the  tree,  and  /s  :  ~ --t  [0,1]  is  the  output  probability \nfunction  related  with  s  satisfying  I:aEE /s(O\")  =  1.  A  prediction suffix tree  induces \n\n\f178 \n\nRon,  Singer, and Tishby \n\nprobabilities on arbitrary long strings in the following manner.  The probability that \nT  generates  a  string  w  =  W1W2  .. 'Wn  in  ~n, denoted  by  PT(W),  is  IIi=1/s.-1(Wi), \nwhere  SO  =  e,  and  for  1  :S  i  :S  n  - 1,  sj  is  the  string  labeling  the  deepest  node \nreached  by  taking the  walk corresponding to W1  ... Wi  starting at the root of T.  By \ndefinition,  a  prediction suffix  tree  induces  a  proper  measure over  ~*, and hence  for \nevery  prefix  free  set  of strings  {wl, ... , wm },  L~l PT(Wi ) :S  1,  and specifically for \nn  2:  1,  then  L3  En  PT(S)  =  1.  An  example of a  prediction  suffix  tree  is  depicted \nin  Fig.  1 on  the left,  where  the  nodes  of the  tree  are  labeled  by  the  corresponding \nsuffix  they  present. \n\n1'0=0.6 \n1'1=0.4 \n\n0.4 \n\n~,---(~~, \n\n... \n\n... \n\n.... \u00b7\u00b7\u00b7\u00b7 ...... ~~.:.6 \n\n.... \n\nFigure  1:  Right:  A  prediction  suffix  tree  over  ~ =  {a, I}.  The  strings  written  in \nthe  nodes  are  the suffixes  the  nodes  present.  For  each  node  there  is  a  probability \nvector  over  the  next  possible  symbols.  For example,  the  probability of observing  a \n'1' after  observing  the  string  '010' is  0.3.  Left:  The equivalent  probabilistic finite \nautomaton.  Bold  edges  denote  transitions  with  the  symbol  '1'  and  dashed  edges \ndenote  transitions with  '0'.  The  states  of the  automaton are  the  leaves  of the  tree \nexcept  for  the  leaf denoted  by  the  string  1,  which  was  replaced  by  the  prefixes  of \nthe strings  010  and  110:  01  and  11. \n\nFinite State Automata and Markov  Processes \nA  Probabilistic  Finite  Automaton  (PFA)  A  is  a  5-tuple  (Q, 1:, T,  I, 7r),  where  Q is \n:  Q x  ~ -;. Q is  the  transition \na  finite  set  of n  states,  1:  is  an  alphabet  of size  k,  T \njunction,  I  :  Q x  ~ -;.  [0,  1 J is  the  output  probability  junction,  and  7r  : Q -;.  [0,  1 J \nis  the  probability  distribution  over  the  starting  states.  The  functions  I  and  7r \nmust  satisfy  the  following  requirements:  for  every  q  E  Q,  LUEE I(q, 0')  =  1,  and \nLqEQ 7r( q)  =  1.  The  probability  that  A  generates  a  string  s  =  S1 S2  ... Sn  E  1:n  is \nPA(S)  =  LqOEQ 7r(qO) TI7=l l(qi-1, Si),  where  qi+l  = T(qi, Si). \nWe  are  interested  in  learning  a  sub-class  of finite  state  machines  which  have  the \nfollowing property.  Each state in a machine M  belonging to this sub-class is labeled \nby  a  string of length  at most L  over  ~, for  some  L  2:  O.  The set of strings labeling \nthe states is suffix free.  We  require that for  every two states ql ,q2  E Q  and for every \nsymbol  0'  E  ~, if T(q1, 0')  = q2  and  ql  is  labeled  by  a  string  s1,  then  q2  is  labeled \n\n\fThe Power of Amnesia \n\n179 \n\nby  a  string  s2  which  is  a  suffix  of  s1  . a.  Since  the set  of strings  labeling the  states \nis  suffix  free,  if there  exists  a  string having  this  property  then  it  is  unique.  Thus, \nin order  that  r  be  well  defined  on a  given set  of string S,  not  only  must the set  be \nsuffix free,  but it must also have the  property,  that for  every  string s  in  the set  and \nevery  symbol  a,  there  exists  a  string which  is  a  suffix  of sa.  For  our  convenience, \nfrom  this  point  on,  if q  is  a  state  in  Q then  q  will  also  denote  the  string  labeling \nthat state. \nA  special  case  of these  automata is  the  case  in  which  Q includes  all  2L  strings  of \nlength  L.  These  automata  are  known  as  Markov  processes  of order  L.  We  are \ninterested  in learning automata for  which  the number of states,  n,  is  actually much \nsmaller than 2\u00a3,  which  means that few  states have  \"long memory\"  and most states \nhave  a  short  one.  We  refer  to  these  automata as  Markov  processes  with  bounded \nmemory L.  In  the  case  of Markov processes  of order  L,  the  \"identity\"  of the states \n(i.e.  the strings labeling the states)  is  known and learning such  a process  reduces  to \napproximating the  output  probability function.  When  learning  Markov  processes \nwith bounded memory, the task of a learning algorithm is  much more involved since \nit  must reveal  the  identity of the states  as  well. \n\nIt  can  be  shown  that  under  a  slightly  more  complicated  definition  of  prediction \nsuffix trees,  and assuming that the initial distribution on the states is  the stationary \ndistribution,  these  two  models  are  equivalent  up  to  a  grow  up  in  size  which  is  at \nmost linear in L.  The proof of this equi valence is  beyond the scope of this paper, yet \nthe transformation from a prediction suffix tree to a finite  state automaton is rather \nsimple.  Roughly speaking,  in order  to implement a  prediction suffix  tree  by a  finite \nstate automaton we  define  the leaves  of the  tree  to be  the states  of the automaton. \nIf the  transition function  of the  automaton,  r(-, .),  can  not  be  well  defined  on  this \nset  of strings,  we  might  need  to  slightly expand  the  tree  and  use  the  leaves  of the \nexpanded tree.  The output probability function of the automaton, ,(-, .),  is  defined \nbased  on  the  prediction  values  of the  leaves  of the  tree.  i.e.,  for  every  state  (leaf) \n\ns,  and  every  symbol  a,  ,( s, a)  = ,s (a).  The  outgoing  edges  from  the  states  are \n\ndefined  as  follows:  r(q1, a)  =  q2  where  q2  E  Suffix*(q 1a).  An  example of a  finite \nstate automaton which  corresponds  to the  prediction tree  depicted  in  Fig.  1 on the \nleft,  is  depicted  on  the  right  part of the figure. \n\n3  Learning  Prediction  Suffix  Trees \n\nGiven  a  sample  consisting  of one  sequence  of length  I  or  m  sequences  of lengths \n11 ,/2 ,  ... ,1m  we  would  like  to find  a  prediction  suffix  tree  that  will  have  the  same \nstatistical properties of the sample and thus can be used  to predict the next outcome \nfor  sequences  generated  by  the  same source.  At  each  stage  we  can  transform  the \ntree  into  a  Markov  process  with  bounded  memory.  Hence,  if  the  sequence  was \ncreated  by  a  Markov  process,  the  algorithm  will  find  the  structure  and  estimate \nthe  probabilities  of the  process.  The  key  idea  is  to  iteratively  build  a  prediction \ntree  whose  probability measure equals the empirical probability measure calculated \nfrom  the sample. \n\nWe start with a  tree  consisting of a single node  (labeled by the empty string e)  and \nadd nodes which we have reason to believe should be in the tree.  A node as, must be \nadded to the tree if it statistically differs  from its parent  node s.  A natural measure \n\n\f180 \n\nRon, Singer, and Tishby \n\nto check  the statistical difference is the relative  entropy (also known as the Kullback(cid:173)\nLiebler  (KL)  divergence)  [5],  between  the  conditional  probabilities  PCI s)  and \nPCIO\"s).  Let  X  be an observation space  and  Pl , P2  be  probability measures over  X \nthen the KL divergence between Pl  and P'2  is,  DKL(Pl IIP2)  = 2:XEx  Pl(X) log ;~~:~. \nNote  that  this  distance  is  not  symmetric and  Pl  should  be  absolutely  continuous \nwith  respect  to  P2 .  In  our problem,  the  KL  divergence  measures  how  much  addi(cid:173)\ntional information is gained by using the suffix crs for prediction instead of predicting \nusing  the  shorter  suffix  s.  There  are  cases  where  the  statistical  difference  is  large \nyet  the  probability of observing  the  suffix  crs  itself is  so  small that  we  can neglect \nthose  cases.  Hence  we  weigh  the  the  statistical  error  by  the  prior  probability  of \nobserving  crs.  The statistical error  measure in  our case  is, \n\nE1'1'(o\"s,  s) \n\n)  ~  P(  'I \n\nP(O\"s)  DKL (P(-IO\"s)IIPCls)) \nP( \n~  P( \nL....-a'E~ \n\nP(a'las) \n0\"  O\"S  og  P(a'ls) \nog  P(a'ls)P(as) \n\nP(a3a') \n\nL....-a'EE \n\n) I \n\n') 1 \n\nO\"SO\" \n\nO\"s \n\nTherefore,  a  node  crs  is  added  to  the  tree  if the  statistical  difference  (defined  by \nE1'1'( crs,  s))  between  the  node  and  its  parrent  s  is  larger  than  a  predetermined \naccuracy  c  The  tree  is  grown  level  by  level,  adding  a  son  of a  given  leaf in  the \ntree  whenever  the  statistical surprise is  large.  The problem is  that the requirement \nthat  a  node  statistically  differs  from  it's  parent  node  is  a  necessary  condition  for \nbelonging to the tree,  but is not sufficient.  The leaves of a prediction suffix tree must \ndiffer  from  their parents  (or  they  are redundant)  but internal nodes  might not have \nthis  property.  Therefore,  we  must  continue  testing  further  potential  descendants \nof the  leaves  in  the  tree  up  to  depth  L.  In  order  to  avoid exponential  grow  in  the \nnumber of strings  tested,  we  do  not test  strings  which  belong to branches which  are \nreached  with  small probability.  The set  of strings,  tested  at  each  step,  is  denoted \nby  5,  and can  be  viewed  as  a  kind of potential  'frontier' of the  growing tree  T.  At \neach  stage  or  when  the  construction  is  completed  we  can  produce  the  equivalent \nMarkov  process  with  bounded  memory.  The  learning  algorithm of the  prediction \nsuffix  tree  is  depicted  in  Fig.  2.  The  algorithm gets  two  parameters:  an  accuracy \nparameter t  and the maximal order of the process  (which is  also the maximal depth \nof the  tree)  L. \n\nThe  true  source  probabilities are  not  known,  hence  they  should  be estimated from \nthe  empirical counts of their  appearances  in  the  observation sequences.  Denote  by \n#s the number of time  the  string s appeared  in  the  observation sequences  and by \n#crls  the  number  of time  the  symbol  cr  appeared  after  the  string  s.  Then,  usmg \nLaplace's  rule  of succession,  the empirical estimation of the  probabilities is, \n\nP(s) ~ P(s) =  2: \n\n~ \n\n-\n\n#s + 1 \n\n#'  I~I \n\nS  + \n\n3'EEIsi \n\nP(crls)  ~ P(O\"ls)  =  2: \n\n-\n\n~ \n\n#crls + 1 \n\nI  I \na'E~  0\"  S +  ~ \n\n#  'I \n\n4  A  Toy  Learning  Example \n\nThe  algorithm was  applied  to  a  1000  symbols  long  sequence  produced  by  the  au(cid:173)\ntomaton  depicted  top  left  in  Fig.  3.  The  alphabet  was  binary.  Bold  lines  in  the \nfigure  represent  transition with the symbol '0' and dashed  lines  represent  the sym(cid:173)\nbol  '1'.  The  prediction suffix  tree  is  plotted  at each  stage  of the  algorithm.  At  the \n\n\fThe Power of Amnesia \n\n181 \n\n\u2022  Initialize the  tree  T  and  the candidate strings  S: \n\nT  consists  of a  single root  node , and  S  -\n\n\u2022  While S  =I  0,  do  the following: \n\n{O\"  I 0\"  E ~ /\\  p( 0\")  2:  t} . \n\n1.  Pick  any  s  E S  and remove  it from  S. \n2.  If  Err{s, Suffix(s))  2:  E  then  add  to  T  the  node  corresponding  to  s \nand  all  the  nodes  on  the  path  from  the  deepest  node  in  T  (the  deepest \nancestor  of s)  until s. \n\n3.  If lsi  < L  then for  every  0\"  E ~ if P(O\"s)  2:  E  add  O\"S  to  S. \n\nFigure  2:  The algorithm for  learning a  prediction  suffix tree. \n\nend of the run  the correponding automat.on is  plotted  as  well  (bottom right.).  Note \nthat  the  original  automaton  and  the  learned  automaton  are  the  same  except  for \nsmall diffrences  in  the  transition  probabilities. \n\no. \no. \n\n0.7 \n\n0.3 \n\n0.32. \n0.68 \n\n0.14 \n0.86 \n\nFigure 3: The original automaton (top left),  the instantaneous automata built along \nthe run of the algorithm (left  to right and top to bottom), and the final  automaton \n(bottom left). \n\n0.69 \n\n0.31 \n\n5  Applications \n\nWe  applied the algorithm to the  Bible with L  =  30  and  E =  0.001  which  resulted  in \nan automaton having less  than 3000 states.  The alphabet was the english letters and \nthe  blank  character.  The  final  automaton constitutes  of states  that  are  of length \n2,  like  r qu'  and  r xe',  and  on  the  other  hand  8  and  9  symbols  long  states,  like \nr shall  be'  and  r there  was'.  This  indicates  that  the  algorithm  really  captmes \n\n\f182 \n\nRon, Singer, and Tishby \n\nthe  notion  of variable  context  length  prediction  which  resulted  in  a  compact  yet \naccurate  model.  Building  a  full  Markov  model  in  this  case  is  impossible  since \nit  requires  II:IL  =  27 9  states.  Here  we  demonstrate  our  algorithm  for  cleaning \ncorrupted  text.  A  test  text  (which  was  taken  out  of the  training  sequence)  was \nmodified  in two  different  ways.  First  by  a  stationary noise  that  altered each  letter \nwith  probability  0.2,  and  then  the  text  was  further  modified  by  changing  each \nblank to a random letter.  The most probable state sequence was found via dynamic \nprogramming.  The  'cleaned'  observation  sequence  is  the  most  probable  outcome \ngiven  the  knowledge  of the  error  rate.  An  example of such  decoding for  these  two \ntypes  of  noise  is  shown  in  Fig.  4.  We  also  applied  the  algorithm  to  intergenic \n\nOriginal Text: \nand  god  called  the  dry  land  earth  and  the  gathering  together  of the  waters  called \nhe  seas  and  god  saw  that  it  was  good  and god said  let  the  earth  bring forth  grass \nthe  herb  yielding seed  and  the  fruit  tree  yielding fruit  after his  kind \nNoisy  text  (1): \nand god  cavsed  the  drxjland earth  ibd shg  gathervng  together oj  the  waters  dlled \nre  seas  aed  god  saw  thctpit  was  good  ann  god  said  let  tae earth  bring forth  gjasb \ntse  hemb  yielpinl peed  and thesfruit  tree  sielxing fzuitnafter  his  kind \nDecoded  text  (1): \nand god  caused  the  dry  land earth  and she  gathering together of the  waters  called \nhe  sees  and god  saw  that it  was  good  and  god said  let  the  earth  bring forth  grass \nthe  memb yielding peed  and the fruit  tree  fielding  fruit  after  his  kind \nNoisy  text  (2): \nandhgodpcilledjthesdryjlandbeasthcandmthelgatceringhlogetherjfytrezaatersoczlled \nxherseasaknddgodbsawwthathitqwasoqoohanwzgodcsaidhletdtheuejrthriringmforth \nbgrasstthexherbyieidingzseedmazdctcybfruitttreeayieidinglfruztbafberihiskind \nDecoded  text  (2): \nand god  called  the  dry  land earth and  the gathering together  of the altars called he \nseasaked  god  saw  that it  was  took  and  god  said  let  the earthriring forth  grass  the \nherb  yielding seed  and thy  fruit  treescielding fruit  after  his  kind \n\nFigure 4:  Cleaning corrupted  text  using  a  Markov process  with  bounded  memory. \n\nregions  of E.  coli  DNA,  with  L  =  20  and  f.  =  0.0001.  The  alphabet  is:  A. C. T. G. \nThe  result  of the  algorithm  is  an  automaton  having  80  states.  The  names  of the \nstates of the final  automaton are  depicted  in  Fig.  5.  The performance of the model \ncan be compared to other models, such as  the HMM  based model [8],  by  calculating \nthe  normalized  log-likelihood  (NLL)  over  unseen  data.  The  NLL  is  an  empirical \nmeasure  of the  the  entropy  of the  source  as  induced  by  the  model.  The  NLL  of \nbounded  memory  Markov  model  is  about  the  same  as  the  one  obtained  by  the \nHMM .based  model.  Yet,  the  Markov  model  does  not  contain  length  distribution \nof the  intergenic  segments  hence  the  overall  perform ace  of the  HMM  based  model \nis  slightly  better.  On  the  other  hand,  the  HMM  based  model  is  more  complicated \nand requires  manual tuning of its  architecture. \n\n\fThe Power of Amnesia \n\n183 \n\nACT G AA AC  AT CA CC  CT CG TA TC TT TG GA GC  GT GG  AAC AAT AAG \nACA ATT CAA CAC CAT CAG CCA CCT CCG CTA CTC CTT CGA CGC CGT TAT \nTAG TCA TCT TTA TTG TGC  GAA  GAC GAT GAG  GCA  GTA GTC  GTT GTG \nGGA  GGC  GGT  AACT  CAGC  CCAG  CCTG  CTCA  TCAG  TCTC  TTAA  TTGC \nTTGG  TGCC  GACC  GATA  GAGC  GGAC  GGCA  GGCG  GGTA  GGTT  GGTG \nCAGCC TTGCA GGCGC  GGTTA \n\nFigure  5:  The states  that  constitute  the automaton for  predicting the  next  base  of \nintergenic  regions  in  E.  coli DNA. \n\n6  Conclusions  and  Future  Research \n\nIn  this  paper we  present  a  new  efficient  algorithm for  estimating the structure  and \nthe  transition probabilities of a  Markov  processes  with  bounded yet  variable mem(cid:173)\nory.  The algorithm when  applied to natural language modeling result  in  a  compact \nand  accurate  model  which  captures  the  short  term  correlations.  The  theoretical \nproperties  of the algorithm will  be  described  elsewhere.  In fact,  we  can  prove  that \na  slightly different  algorithm constructs  a  bounded memory markov process,  which \nwith  arbitrary  high  probability,  induces  distributions  (over  I:n  for  n  >  0)  which \nare  very  close  to those  induced  by  the  'true'  Markovian source,  in  the  sense  of the \nKL  divergence.  This  algorithm  uses  a  polynomial  size  sample  and  runs  in  poly(cid:173)\nnomial  time  in  the  relevent  parameters of the  problem.  We  are  also  investigating \nhierarchical  models based  on  these  automata which  are  able to capture  multi-scale \ncorrelations,  thus  can  be  used  to  model  more  of the  large  scale  structure  of the \nnatural language. \n\nAcknowledgment \n\nWe  would like  to thank Lee Giles for providing  us  with the software for  plotting finite  state \nmachines,  and  Anders  Krogh  and  David  Haussler for  letting  us  use  their  E.  coli DN A  data \nand  for  many  helpful  discussions.  Y.S.  would  like  to  thank  the  Clore  foundation  for  its \nsupport. \n\nReferences \n[1]  J.G  Kemeny and J.L.  Snell,  Finite  Markov  Chains,  Springer-Verlag  1982. \n[2]  Y.  Freund,  M.  Kearns,  D.  Ron,  R.  Rubinfeld,  R.E.  Schapire,  and  L.  Sellie, \nEfficient  Learning  of Typical  Finite  Automata from  Random  Walks,  STOC-93 . \n\n[3]  F.  Jelinek,  Self-Organized  Language  Modeling  for  Speech  Recognition,  1985. \n[4]  A.  N adas,  Estimation  of Probabilities in  the  Language  Model  of the  IBM Speech \n\nRecognition  System,  IEEE Trans. on  ASSP  Vol.  32  No.4,  pp.  859-861,  1984. \n\n[5]  S.  Kullback,  Information  Theory  and  Statistics,  New  York:  Wiley,  1959. \n[6]  J.  Rissanen  and  G.  G.  Langdon,  Universal  modeling  and  coding,  IEEE  Trans . \n\non  Info.  Theory,  IT-27  (3),  pp.  12-23,  1981. \n\n[7]  J. Rissanen,  Stochastic  complexity  and  modeling, The Ann. of Stat.,  14(3),1986. \n[8]  A.  Krogh,  S.1.  Mian,  and D.  Haussler,  A  Hidden  Markov  Model that  finds  genes \n\nin  E.  coli  DNA,  UCSC  Tech.  Rep.  UCSC-CRL-93-16. \n\n\f", "award": [], "sourceid": 723, "authors": [{"given_name": "Dana", "family_name": "Ron", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}