{"title": "An Entropic Estimator for Structure Discovery", "book": "Advances in Neural Information Processing Systems", "page_first": 723, "page_last": 729, "abstract": null, "full_text": "An entropic estimator for structure discovery \n\nMitsubishi Electric Research Laboratories, 201  Broadway, Cambridge MA 02139 \n\nMatthew Brand \n\nbrand@merl.com \n\nAbstract \n\nWe  introduce  a  novel  framework  for  simultaneous  structure  and  parameter  learning  in \nhidden-variable conditional probability models,  based on an  en tropic  prior and  a solution \nfor its maximum a posteriori (MAP) estimator.  The MAP estimate minimizes uncertainty \nin  all  respects:  cross-entropy  between  model  and  data;  entropy  of the  model ;  entropy \nof  the  data's  descriptive  statistics. \nIterative  estimation  extinguishes  weakly  supported \nparameters,  compressing and  sparsifying  the  model.  Trimming operators  accelerate  this \nprocess  by  removing  excess  parameters  and,  unlike  most  pruning  schemes,  guarantee \nan  increase  in  posterior  probability.  Entropic  estimation  takes  a  overcomplete  random \nmodel  and simplifies  it,  inducing the  structure of relations  between  hidden  and  observed \nvariables.  Applied  to  hidden  Markov  models  (HMMs),  it  finds  a  concise  finite-state \nmachine  representing  the  hidden  structure  of a  signal.  We  entropically  model  music, \nhandwriting, and video time-series, and show that the resulting models are highly concise, \nstructured,  predictive,  and  interpretable:  Surviving  states  tend  to  be  highly  correlated \nwith meaningful partitions of the data, while surviving transitions provide a low-perplexity \nmodel of the signal dynamics. \n\n1  . An entropic prior \nIn  entropic  estimation  we  seek to  maximize  the  information  content of parameters.  For \nconditional  probabilities,  parameters  values  near  chance  add  virtually  no  information \nto  the  model,  and  are  therefore  wasted  degrees  of  freedom. \nIn  contrast,  parameters \nnear  the  extrema  {O, I}  are  informative  because  they  impose  strong  constr\u00b7aints  on  the \nclass  of signals  accepted  by  the  model.  In  Bayesian  terms,  our  prior should  assert  that \nparameters that do not reduce uncertainty are improbable.  We  can capture this intuition in \na surprisingly simple form:  For a model of N  conditional probabilities 9  = {(h , . . . , () N  } \nwe write \n\n(1) \n\nwhence we can  see that the prior measures a model's freedom from ambiguity (H(9) is an \nentropy measure).  Applying Pe (.)  to a multinomial yields the posterior \n\np  (LlI)  P(wI9)Pe (9) \ne  U  W \n\nP(w) \n\n<X \n\n<X \n\n[II ()W;]  Pe(9) \n\n1 \n\n. \nz \n\nP(w)  <X \n\nII (){;/;+w; \n\nz \n\n. \nz \n\n(2) \n\nwhere Wi  is  evidence for event type i.  With extensive evidence this distribution converges \nto  \"fair\"(ML) odds for w, but with scant evidence it skews to stronger odds. \n\n\f724 \n\n1.1  MAP estimator \n\nM  Brand \n\nTo  obtain  MAP estimates  we  set  the  derivative  of log-posterior to  zero,  using  Lagrange \nmultipliers to ensure L:i (}i  =  1, \n\nW' \n\n1+ ();  +IOg(}i+\"\\ \n\n(3) \n\nWe  obtain (}i  by  working backward from the Lambert W  function,  a multi-valued inverse \nfunction satisfying W(x)eW(x) =x. Taking logarithms and setting y  =  logx, \n\n0= -W(x) -logW(x) +  logx \n\n- W(e Y )  -log W(e Y )  +  y \n\n-1 \n\nIjW(eY )  +logljW(eY )+logz+y-Iogz \n\n-z \n\nzjW(eY )  +  10gzjW(eY )  +  y  -logz \n\n(4) \n\nSetting (}i = zjW(eY ) ,  y = 1 +\"\\+logz, and z = -Wi, eqn. 4 simplifies to eqn. 3, implying \n\n(5) \n\nEquations 3 and 5 together yield a quickly converging fix-point equation for ..\\  and therefore \nfor the entropic MAP estimate.  Solutions lie in the W -1 branch of Lambert's function.  See \n[Brand,  1997] for methods we developed to calculate the little-known W  function. \n\n1.2 \n\nInterpretation \n\nThe negated log-posterior is equivalent to a sum of entropies: \n\nH(O) + D(wIIO) + H(w) \n\n(6) \n\nMaximizing Pe(Olw)  minimizes entropy in  all  respects:  the parameter entropy H(O);  the \ncross-entropy D (w \"0)  between  the  parameters 0  and  the  data's  descriptive  statistics w; \nand  the  entropy  of those  statistics  H (w),  which  are  calculated  relative  to  the  structure \nof the  model.  Equivalently,  the  MAP  estimator  minimizes  the  expected  coding  length, \nmaking it a maximally efficient compressor of messages consisting of the  model and  the \ndata  coded  relative  to  the  model.  Since  compression  involves  separating essential  from \naccidental structure, this can be understood as  a form of noise  removal.  Noise inflates the \napparent  entropy  of a  sampled  process;  this  systematically  biases  maximum  likelihood \n(ML)  estimates  toward  weaker  odds,  more  so  in  smaller  samples.  Consequently,  the \nentropic prior is  a countervailing bias toward stronger odds. \n\n\fAn Entropic Estimator for Structure Discovery \n\n725 \n\n1.3  Model trimming \n\nBecause  the  prior  rewards  sparse  models,  it  is  possible  to  remove  weakly  supported \nparameters  from \nthe  model  while  improving  its  posterior  probability,  such  that \nPe (O\\(}iIX)  > Pe (OIX). This stands in contrast to most pruning schemes, which typically \ntry to minimize damage to  the  posterior.  Expanding via Bayes rule and taking logarithms \nwe obtain \n\n(7) \n\nwhere hi((}i)  is  the entropy due to (}i.  For small (}i,  we can approximate via differentials: \n\n() . {)H(O) \n~ \n\n{)(}i \n\n> \n\n() . {)logP(XIO) \nt \n\n{)(} i \n\n(8) \n\nBy  mixing  the  left- and  right-hand  sides  of equations  7  and  8,  we  can  easily  identify \ntrimmable parameters-those that contribute more to  the  entropy than  the  log-likelihood. \nE.g., for multinomials we set hi ((}i) = -(}i log (}i  against r.h.s. eqn. 8 and simplify to obtain \n\n<  exp  -\n\n[  {)logP(XIO)] \n\n{)(}i \n\n(9) \n\nParameters  can  be  trimmed  at  any  time  during  training;  at  convergence  trimming  can \nbump the model out of a local probability maximum, allowing further training in  a lower(cid:173)\ndimensional and possibly smoother parameter subspace. \n2  Entropic HMM training and trimming \nIn entropic estimation of HMM transition probabilities, we follow the conventional E-step, \ncalculating the probability mass for each transition to be used as evidence w: \n\nIj,i \n\nT-l \n\nL aj(t) Pilj Pi(Xt+1) fh(t + 1) \n\n(10) \n\nwhere  PilJ  is  the  current  estimate  of the  transition  probability  from  state  j  to  state  i; \nPi(Xt+d  is the output probability of observation Xt+1  given state i, and Q, {3  are obtained \nfrom  forward-backward analysis  and  follow  the  notation  of Rabiner  [1989].  For the  M-\nstep,  we  calculate  new  estimates  {Pi lj h = 0  by  applying  the  MAP estimator in  \u00a7 1.1  to \neach w =  {,j ,i k  That is,  w  is  a vector of the  evidence for each kind  of transition out of \na single state; from this evidence the MAP estimator calculates probabilities O.  (In Baum(cid:173)\nWelch re-estimation, the maximum-likelihood estimator simply sets Pilj  =  Ij ,i/ 2:i Ij,d \nIn  iterative estimation, e.g., expectation-maximization (EM), the entropic estimator drives \nweakly  supported  parameters  toward  zero,  skeletonizing  the  model  and  concentrating \nevidence on  surviving parameters until  their estimates converge to  near the ML estimate. \nTrimming  appears  to  accelerate  this  process  by  allowing  slowly  dying  parameters  to \nleapfrog to  extinction. It also averts numerical underflow errors. \nFor HMM transition parameters, the trimming criterion of egn. 9 becomes \n\nwhere Ij (t)  is the probability of state j  at time t.  The multinomial output distributions of a \ndiscrete-output HMM can be en tropically re-estimated and trimmed in the same manner. \n\n(11 ) \n\n\f726 \n\nM.  Brand \n\nEntropic versus ML HMM models of Bach chorales \n90 \n\n7S.-.-~-.-----,..., \n\ngo \n\nr:\\\\~ \n~~ \n\n.t5.....,5----+.-.,-~ \n\no  S \n\n20 \n\n5 \n\n3S \n\nIS \n\n25 \n\n,  states at initialization \n\nFigure  1:  Left:  Sparsification,  classification,  and  prediction  superiority  of entropically \nestimated  HMMs  modeling  Bach  chorales.  Lines  indicate  mean  performance  over  10 \ntrials; error bars are 2 standard deviations. Right:  High-probability states and subgraphs of \ninterest from an  entropically estimated 35-state chorale HMM. Tones output by each state \nare listed in order of probability. Extraneous arcs have been removed for clarity. \n\n3  Structure learning experiments \n\nTo explore the practical utility of this framework, we will use entropically estimated HMMs \nas a window into the hidden structure of some human-generated time-series. \nBach  Chorales:  We  obtained  a  dataset of melodic  lines  from  100  of I.S.  Bach's  371 \nsurviving chorales from the  UCI repository [Merz and Murphy,  1998], and transposed all \ninto  the  key  of C.  We  compared entropically  and  conventionally  estimated  HMMs  in \nprediction  and  classification  tasks,  training  both from  identical  random initial  conditions \nand  trying  a  variety  of different  initial  state-counts.  We  trained  with  90  chorales  and \ntesting  with  the  remaining  10. \nIn  ten  trials,  all  chorales  were  rotated  into  the  test \nset.  Figure  1 illustrates  that  despite  substantial  loss  of parameters  to  sparsification,  the \nentropically  estimated  HMMs  were,  on  average,  better  predictors  of notes.  (Each  test \nsequence was  truncated to  a random length and  the  HMMs were  used  to  predict the  first \nmissing note.) They also were better at discriminating between test chorales and temporally \nreversed test chorales-challenging because Bach famously employed melodic reversal as a \ncompositional device. With larger models, parameter-trimming became state-trimming: An \naverage of 1.6 states were \"pinched off\" the 35-state models when all incoming transitions \nwere deleted. \nWhile the conventionally estimated HMMs were wholly uninterpretable, in the entropically \nestimated  HMMs  one  can  discern  several  basic  musical  structures  (figure  1,  right), \nincluding  self-transitioning  states  that  output  only  tonic  (C-E-G)  or  dominant  (G-B-D) \ntriads,  lower- or upper-register diatonic tones (C-D-E or F-G-A-B),  and mordents (A-nG(cid:173)\nA). We  also found chordal state sequences (F-A-C) and states that lead to  the tonic (C) via \nthe mediant (E) or the leading tone (B). \nHandwriting:  We used 2D Gaussian-output HMMs to analyze handwriting data.  Training \ndata,  obtained  from  the  UNIPEN  web  site  [Reynolds,  1992],  consisted  of sequences  of \nnormalized pen-position coordinates taken at 5msec intervals from  10 different individuals \nwriting the digits 0-9. The HMMs were estimated from identical data and initial conditions \n(random  upper-diagonal  transition  matrices;  random  output parameters).  The  diagrams \nin  Figure  2 depict transition graphs of two  HMMs modeling the pen-strokes for the  digit \n\"5,\"  mapped onto the data.  Ellipses  indicate each state's output probability  iso-contours \n(receptive field);  X s and arcs indicate state dwell and transition probabilities, respectively, \nby their thicknesses. Entropic estimation induces an interpretable automaton that captures \nessential structure and timing of the pen-strokes. 50 of the 80 original transition parameters \n\n\fAn Entropic Estimator for Structure Discovery \n\n727 \n\n.. \n\n. \n\neonrus.on Matnll WIth 96 0% accuracy \n\n, .  \n\nConlUStOn MatrIX WIth 93 0% acct.JIlIcy \n\n. \n\n-, \n\n~~Y\",  '. \n:,y \n.s -': \n. \n.,-!.:~,~_~~\"  6 \n\n:,\"'\n\" \n\n/ \n\na.  conventional \n\nb.  en tropic \n\nc.  conventional \n\nd.  en tropic \n\nFigure  2:  (a  &  b):  State  machines  of conventionally  and  entropically  estimated  hidden \nMarkov models of writing \"S.\"  (c & d):  Confusion matrices for all digits. \n\nwere trimmed.  Estimation without the entropic prior results  in a wholly opaque model, in \nwhich none of the original dynamical parameters were trimmed. Model concision leads to \nbetter classification-the confusion matrices show cumulative classificMion error over ten \ntrials with random initializations.  Inspection of the parameters for the model in 2b showed \nthat all  writers began in  states 1 or 2.  From there it is possible to follow the state diagram \nto reconstruct the possible sequences of pen-strokes: Some writers start with the cap (state \n1) while others start with the vertical (state 2); all  loop through states 3-8 and some return \nto the top (via state  10) to add a horizontal (state 12) or diagonal (state 11) cap. \nOffice  activity:  Here  we  demonstrate a  model  of human activity  learned from  medium(cid:173)\nto  long-term  ambient  video.  By  activity,  we  mean  spatio-temporal  patterns  in  the  pose, \nposition,  and  movement  of one's  body.  To  make  the  vision  tractable,  we  consider  the \nactivity of a single person in a relatively stable visual environment, namely, an office. \nWe  track the  gross  shape  and  position  of the  office  occupant by  segmenting each  image \ninto  foreground  and  background  pixels.  Foreground pixels  are  identified  with  reference \nto  an  acquired  statistical  model  of  the  background  texture  and  camera  noise.  Their \nensemble  properties  such  as  motion  or  color  are  modeled  via  adaptive  multivariate \nGaussian  distributions,  re-estimated  in  each  frame. \nA  single  bivariate  Gaussian  is \nfitted  to  the  foreground  pixels  and  we  record  the  associated  ellipse  parameters  [mean x , \nmeany,  timeanx ,  timean y,  mass,  timass,  elongation,  eccentricity].  Sequences of these \nobservation vectors are used to train and test the HMMs. \nApproximately  30  minutes  of  data  were  taken  at  SHz  from  an  SGI  IndyCam.  Data \nwas  collected  automatically  and  at  random  over several  days  by  a  program  that  started \nrecording  whenever  someone  entered  the  room  after  it  had  been  empty  S+  minutes. \nBackgrounds  were  re-Iearned  during  absences  to  accommodate  changes  in  lighting  and \nroom  configuration.  Prior  to  training,  HMM  states  were  initialized  to  tile  the  image \nwith  their  receptive  fields,  and  transition  probabilities  were  initialized  to  prefer  motion \nto adjoining tiles.  Three sequences ranging from  1000 to  1900 frames in length were used \nfor entropic training of 12,  16,20, 2S, and 30-state HMMs. \nEntropic training yielded  a substantially sparsified  model  with  an  easily  interpreted state \nmachine (see figure 3).  Grouping of states into activities (done only to improve readability) \nwas  done  by  adaptive  clustering  on  a  proximity  matrix  which  combined  Mahalonobis \ndistance and transition  probability between  states.  The labels are  the author's description \nof the  set  of frames  claimed  by  each  state  cluster  during  forward-backward  analysis  of \ntest data.  Figure 4 illustrates this  analysis, showing frames from  a test sequence to  which \nspecific states are strongly tuned.  State S (figure 3 right) is particularly interesting-it has a \nvery non-specific receptive field, no self-transition, and an extremely low rate of occupancy. \nInstead  of modeling  data,  it  serves  to  compress  the  model  by  summarizing  transition \npatterns  that  are  common  to  several  other states.  The  entropic  model  has  proven  to  be \nquite superior for segmented new video into activities and detecting anomalous behavior. \n\n\f728 \n\nM  Brand \n\nirlitialization - --\n.\" . ...  '  .. \n.-. \n: .. ~ \n.\" \n~ ~ \n\n~ . \n\ntinalmochtl \n\n~ \n\nFigure 3:  Top:  The state  machine found  by en tropic training (left)  is  easily  labeled  and \ninterpreted.  The state  machine found  by  conventional training  (right)  is  not,  begin  fully \nconnected.  Bottom:  Transition matrices after (1) initialization, (2) entropic training,  (3) \nconventional training, and (4 &  5) entropic training from larger initializations.  The top row \nindicates  initial  probabilities  of each  state;  each  subsequent row  indicates  the  transition \nprobabilities  out of a  state.  Color key:  0  = 0;  \u2022  = 1.  The  state  machines  above are \nextracted from 2 &  3.  Note that 4 &  5 show the same qualitative structure as 2, but sparser, \nwhile 3 shows no almost no structure at all. \n\nFigure 4:  Some sample frames assigned high state-specific probabilities by the model. Note \nthat some states are tuned to velocities, hence the difference between states 6 and  11. \n\n4  Related work \n\nHMMs: The literature of structure-learning in HMMs is based almost entirely on generate(cid:173)\nand-test algorithms.  These algorithms  work by merging [Stokke and Omohundro,  1994] \nor splitting  [Takami and  Sagayama,  1991]  states,  then  retraining the model  to  see if any \nadvantage  has  been  gained.  Space  constraints  force  us  to  summarize a  recent literature \nreview: There are now more than 20 variations and improvements on these approaches, plus \nsome  heuristic  constructive algorithms  (e.g.,  [Wolfertstetter and  Ruske,  1995]).  Though \nthese  efforts  use  a  variety  of heuristic  techniques  and  priors  (including  MDL)  to  avoid \ndetrimental model changes, much of the computation is squandered and reported run-times \noften  range  from  hours  to  days.  Entropic  estimation  is  exact.  monotonic,  and  orders  of \nmagnitude faster-only slightly longer than standard EM parameter estimation. \nMDL: Description length minimization is  typically done via gradient ascent or search via \nmodel comparison; few estimators are known. Rissanen [1989] introduced an estimator for \nbinary fractions,  from  which Vovk [1995] derived an approximate estimator for Bernoulli \n\n\fAn Entropic Estimator for Structure Discovery \n\n729 \n\nmodels over discrete sample spaces.  It approximates a special case of our exact estimator, \nwhich handles multinomial models in continuous sample spaces.  Our framework provides \na  unified  Bayesian  framework  for  two  issues  that  are  often  treated  separately  in  MDL: \nestimating the number of parameters and estimating their values. \nMaxEnt:  Our prior has different premises  and  an  effect opposite that  of the  \"standard\" \nMaxEnt prior e- aD(9i1 9o).  Nonetheless,  our prior can  be derived via MaxEnt reasoning \nfrom  the  premise  that  the  expectation  of the  perplexity  over all  possible  models  is  finite \n[Brand,  1998]. More colloquially, we almost always expect there to  be learnable structure. \nExtensions:  For simplicity  of exposition  (and  for  results  that  are  independent of model \nclass),  we  have assumed  prior independence of the  parameters and taken  H (8)  to  be the \ncombined parameter entropies of the model's component distributions.  Depending on  the \nmodel class,  we can  also  provide variants of eqns.  1-8 for H (8) =conditional entropy or \nH (8) =entropy rate of the model.  In Brand  [1998]  we  present entropic MAP estimators \nfor  spread  and  covariance parameters  with  applications  to  mixtures-of-Gaussians,  radial \nbasis  functions,  and  other  popular  models.  In  the  same  paper  we  generalize  eqns.  1-8 \nwith a temperature term, obtaining a MAP estimator that minimizes the free energy of the \nmodel. This folds deterministic annealing into EM, turning it into a quasi-global optimizer. \nIt  also  provides  a  workaround  for  one  known  limitation  of entropy  minimization:  It is \ninappropriate for learning from data that is  atypical of the source process. \nOpen questions:  Our framework is  currently  agnostic w.r.t.  two  important questions:  Is \nthere  an  optimal  trimming  policy?  Is  there  a  best  entropy  measure?  Other  questions \nnaturally  arise:  Can  we  use  the  entropy  to  estimate  the  peakedness  of  the  posterior \ndistribution, and thereby judge the appropriateness of MAP models?  Can we also directly \nminimize the entropy of the hidden variables, thereby obtaining discriminant training? \n5  Conclusion \nEntropic estimation is highly efficient hillclimbing procedure for simultaneously estimating \nmodel structure and parameters. It provides a clean Bayesian framework for minimizing all \nentropies associated with modeling, and an E-MAP algorithm that brings the structure of a \nrandomly initialized model into alignment with hidden structures in the data via parameter \nextinction. The applications detailed here are three of many in which entropically estimated \nmodels have consistently outperformed maximum likelihood models in  classification  and \nprediction tasks.  Most notably, it tends to  produce interpretable models that shed light  on \nthe structure of relations between hidden variables and observed effects. \nReferences \nBrand,  M.  (1997).  Structure discovery  in  conditional  probability  models  via  an  entropic  prior and \nparameter extinction.  NeuraL  Computation To appear; accepted 8/98. \nBrand,  M.  (1998).  Pattern  discovery  via  entropy  minimization.  To  appear  in  Proc ..  ArtificiaL \nIntelligence and Statistics #7. \nMerz, C.  and Murphy, P.  (1998).  UCI  repository of machine learning databases. \nRabiner,  L.  R.  (1989).  A tutorial  on  hidden  Markov  models  and  selected  applications  in  speech \n\nrecognition.  Proceedings of the IEEE, 77(2):257-286. \n\nReynolds,  D. (1992).  Handwritten  digit  data.  UNIPEN  web  site,  hUp:llhwr.nici.kun.nl/unipen/. \n\nDonated by  HP Labs,  Bristol, England. \n\nRissanen, J.  (1989).  Stochastic CompLexit)' and StatisticaL Inquiry.  World Scientific. \nStolcke, A. and Omohundro, S.  (1994).  Best-first model merging for hidden Markov model induction. \nTR-94-003, International Computer Science Institute, U.c. Berkeley. \nTakami,  1.-1.  and  Sagayama,  S.  (1991).  Automatic  generation  of the  hidden  Markov  model  by \nsuccessive state splitting on  the contextual domain and the temporal domain.  TR SP91-88, IEICE. \nVovk,  V.  G. (1995).  Minimum description  length  estimators under  the optimal  coding scheme.  In \nVitanyi, P., editor, Proc.  ComputationaL Learning Theory / Europe, pages 237-251. Springer-Verlag. \nIn \n\nWolfertstetter,  F.  and  Ruske,  G.  (1995).  Structured  Markov  models  for  speech  recognition. \nInternationaL Conference on Acoustics. Speech.  and SignaL Processing, volume  I, pages 544-7. \n\n\f", "award": [], "sourceid": 1522, "authors": [{"given_name": "Matthew", "family_name": "Brand", "institution": null}]}