{"title": "An Apobayesian Relative of Winnow", "book": "Advances in Neural Information Processing Systems", "page_first": 204, "page_last": 210, "abstract": null, "full_text": "An Apobayesian Relative of Winnow \n\nNick Littlestone \n\nNEC  Research Institute \n\n4 Independence Way \nPrinceton,  NJ 08540 \n\nChris Mesterharm \nNEC  Research Institute \n\n4 Independence Way \nPrinceton, NJ 08540 \n\nAbstract \n\nWe  study  a  mistake-driven  variant  of an  on-line  Bayesian  learn(cid:173)\ning  algorithm  (similar  to one  studied  by  Cesa-Bianchi,  Helmbold, \nand Panizza [CHP96]).  This variant only updates its state (learns) \non trials in which it makes  a mistake.  The algorithm makes binary \nclassifications using a linear-threshold classifier and runs in time lin(cid:173)\near in  the  number of attributes seen by the learner.  We  have  been \nable  to  show,  theoretically  and  in  simulations,  that  this  algorithm \nperforms well  under assumptions quite different from  those embod(cid:173)\nied  in  the  prior of the  original  Bayesian algorithm.  It can handle \nsituations that  we  do  not  know  how  to handle  in  linear  time  with \nBayesian  algorithms.  We  expect  our  techniques  to  be  useful  in \nderiving and analyzing other apobayesian algorithms. \n\n1 \n\nIntroduction \n\nWe  consider  two  styles  of on-line  learning.  In  both  cases,  learning  proceeds  in  a \nsequence  of trials.  In  each  trial,  a  learner  observes  an  instance  to  be  classified, \nmakes  a  prediction  of  its  classification,  and  then  observes  a  label  that  gives  the \ncorrect  classification.  One  style  of on-line  learning  that  we  consider  is  Bayesian. \nThe  learner  uses  probabilistic  assumptions  about  the  world  (embodied  in  a  prior \nover some model class)  and data observed in  past trials to construct a  probabilistic \nmodel (embodied in a posterior distribution over the model class).  The learner uses \nthis  model  to make  a  prediction  in  the  current  trial.  When  the  learner  is  told  the \ncorrect classification of the instance, the learner uses this information to update the \nmodel,  generating a  new  posterior to be used in the next trial. \nIn  the  other style  of learning  that  we  consider,  the  attention  is  on  the  correctness \nof the  predictions  rather than on  the  model  of the  world.  The internal state of the \n\n\fAn Apobayesian Relative o!Winnow \n\n205 \n\nlearner is only changed when the learner makes a mistake (when the prediction fails \nto match the label).  We  call such an algorithm mistake-driven.  (Such algorithms are \noften called conservative in the computational learning theory literature.)  There is a \nsimple way to derive a mistake-driven algorithm from anyon-line learning algorithm \n(we  restrict  our  attention  in  this  paper  to  deterministic  algorithms).  The derived \nalgorithm is just like the original algorithm, except that before every trial, it makes \na  record  of its entire state,  and  after every  trial  in  which  its  prediction  is  correct, \nit  resets  its  state  to  match  the  recorded  state,  entirely  forgetting  the  intervening \ntrial.  (Typically this is  actually implemented not by  making such a  record,  but by \nmerely  omitting the step  that  updates  the state.)  For  example,  if some  algorithm \nkeeps  track of the  number  of trials  it has  seen,  then  the  mistake-driven  version  of \nthis  algorithm  will  end  up  keeping  track  of the  number  of mistakes  it  has  made. \nWhether the original or mistake-driven algorithm will do better depends on the task \nand on how  the algorithms are evaluated. \n\nWe  will  start  with  a  Bayesian learning  algorithm  that  we  call  SBSB  and  use  this \nprocedure to derive  a  mistake-driven variant,  SASB.  Note that the variant  cannot \nbe expected  to  be  a  Bayesian  learning  algorithm  (at  least  in  the  ordinary  sense) \nsince  a  Bayesian algorithm would make a  prediction that minimizes the Bayes risk \nbased on  all  the available  data, and the mistake-driven variant has  forgotten  quite \na  bit.  We  call  such  algorithms  apobayesian  learning  algorithms.  This  name  is \nintended  to suggest  that  they  are  derived  from  Bayesian learning algorithms,  but \nare  not  themselves  Bayesian.  Our  algorithm  SASB  is  very  close  to  an  algorithm \nof [CHP96).  We  study its application to different  tasks than they  do,  analyzing its \nperformance when it is  applied to linearly separable data as described below. \nIn this paper instances will  be chosen from  the instance space X  =  {a, l}n for  some \nn.  Thus  instances  are  composed  of n  boolean  attributes.  We  consider  only  two \ncategory classifications tasks, with predictions and labels chosen from  Y  =  {a, I} . \nWe  obtain a' bound on the  number of mistakes  SASB  makes  that is  comparable to \nbounds for  various  Winnow family  algorithms given  in  [Lit88,Lit89).  As  for  those \nalgorithms,  the  bound  holds  under  the  assumption  that  the  points  labeled  1  are \nlinearly separable from  the points labeled 0,  and the bound depends on the size 8 of \nthe gap between the two classes.  (See  Section 3 for  a  definition of 8.)  The mistake \nbound for  SASB  is  0 ( /or log ~ ).  While this bound has an extra factor of log ~ not \npresent in  the  bounds for  the Winnow algorithms,  SASB  has  the  advantage of not \nneeding any  parameters.  The Winnow family  algorithms have parameters, and the \nalgorithms' mistake bounds depend on setting the parameters to values that depend \non  8. \n(Often,  the  value  of  8  will  not  be  known  by  the  learner.)  We  expect  the \ntechniques  used  to  obtain this  bound  to be  useful  in  analyzing other apobayesian \nlearning algorithms. \n\nA  number  of  authors  have  done  related  research  regarding  worst-case  on-line \nloss  bounds  including  [Fre96,KW95,Vov90).  Simulation  experiments  involving  a \nBayesian  algorithm  and  a  mistake-driven  variant  are  described  in  [Lit95).  That \npaper  provides  useful  background for  this  paper.  Note  that  our  present  analysis \ntechniques do not apply to the apobayesian algorithm studied there.  The closest of \nthe  original  Winnow  family  algorithms to  SASB  appears  to be  the  Weighted  Ma(cid:173)\nJority  algorithm  [LW94],  which  was  analyzed  for  a  case similar to  that considered \nin  this  paper  in  [Lit89).  One should  get  a  roughly  correct  impression  of SASB  if \n\n\f206 \n\nN.  Littlestone and C.  Mesterharm \n\none  thinks  of it  as  a  version  of  the  Weighted  Majority  algorithm  that  learns  its \nparameters. \n\nIn  the  next  section  we  describe  the  Bayesian  algorithm  that  we  start  with. \nIn \nSection  3  we  discuss  its  mistake-driven  apobayesian  variant.  Section  4  mentions \nsome simulation experiments using these algorithms, and Section 5 is the conclusion. \n\n2  A  Bayesian Learning  Algorithm \n\nTo  describe the  Bayesian learning algorithm we  must  specify  a  family  of distribu(cid:173)\ntions  over  X  x  Y  and  a  prior  over  this  family  of distributions.  We  parameterize \nthe  distributions  with  parameters  ((h, ... , 8n + l )  chosen  from  e  =  [0, 1 ]n+l.  The \nparameter 8n +1  gives the probability that the label is  1,  and the parameter 8i  gives \nthe  probability  that  the ith attribute  matches  the  label.  Note  that  the  probability \nthat  the  ith  attribute  is  1  given  that  the  label  is  1 equals  the  probability that  the \nith  attribute  is  0  given  that  the  label  is  O.  We  speak  of this  linkage  between  the \nprobabilities  for  the  two  classes  as  a  symmetry  condition.  With  this  linkage,  the \nobservation of a point from either class will affect the posterior distribution for both \nclasses.  It is  perhaps more typical to choose priors that allow the two  classes to be \ntreated separately, so that the posterior for  each class  (giving the probability of ele(cid:173)\nments of X  conditioned on the label)  depends only on the prior and on observations \nfrom  that  class.  The symmetry condition that  we  impose appears to be important \nto the success of our analysis of the apobayesian variant of this algorithm.  (Though \nwe  impose this condition to  derive  the algorithm, it turns out that the apobayesian \nvariant can actually handle tasks where this condition is not satisfied.) \nWe  choose  a  prior  on  e  that  gives  probability  1  to  the  set  of  all  elements \n()  =  (81, ... , 8n +l )  E  e  for  which  at  most  one  of  81 ,  ... ,8n  does  not  equal  !. \nThe  prior is  uniform  on this  set.  Note  that  for  any  ()  in  this  set only a  single  at(cid:173)\ntribute has  a  probability other than  ~ of matching the label,  and thus only a  single \nattribute is relevant.  Concentrating on this set turns out to lead to an apobayesian \nalgorithm that can,  in  fact,  handle  more  than one  relevant  attribute  and  that per(cid:173)\nforms  particularly well  when only a  small fraction of the attributes are relevant. \n\nThis  prior  is  related  to  to  the  familiar  Naive  Bayes  model,  which  also  assumes \nthat  the  attributes are  conditionally independent  given  the  labels.  However,  in the \ntypical  Naive  Bayes model there is  no restriction to a  single  relevant  attribute and \nthe symmetry condition linking the two classes is  not imposed. \n\nOur prior leads to the following algorithm.  (The name SBSB stands for  \"Symmetric \nBayesian Algorithm with Singly-variant prior for  Bernoulli distribution.\") \n\nAlgorithm  SBSB  Algorithm  SBSB  maintains  counts  Si  of the  number  of times \neach attribute matches  the  label,  a  count  M  of the  number  of times  the  label is  1, \nand a  count t  of the number of trials. \n\nInitialization  Si  t- 0  for  i  =  1, ... ,n \n\nM  t-O \n\ntt-O \n\nPrediction  Predict 1 given instance (Xl, ... ,xn )  if and only if \n\n(M + 1) f=  XiCSi+l)+Clixi)(t-Si+1)  > (t  - M + 1) f=  (1-Xi)(Si+1~+XiCt-si+l) \nUpdate  M  t- M  + y,  t  t- t + 1,  and for  each i, if Xi  =  Y then  Si  t- Si + 1 \n\n(S,) \n\ni=l \n\ni=l \n\n(S.) \n\n\fAn Apobayesian Relative of Winnow \n\n207 \n\n3  An Apobayesian Algorithm \n\nWe  construct  an  apobayesian  algorithm  by  converting  algorithm  SBSB  into  a \nmistake-driven  algorithm  using  the  standard conversion  given  in  the  introduction. \nWe  call the resulting learning algorithm SASBj  we  have replaced  \"Bayesian\"  with \n\"Apobayesian\"  in  the  acronym. \n\nIn  the  previous  section  we  made  assumptions  made  about  the  generation  of  the \ninstances  and  labels  that  led  to  SBSB  and  thence  to  SASB.  These  assumptions \nhave served their purpose and we now abandon them.  In analyzing the apobayesian \nalgorithm  we  do  not  assume  that  the  instances  and  labels  are  generated  by  some \nstochastic  process.  Instead  we  assume  that  the  instance-label  pairs  in  all  of the \ntrials  are  linearly-separable,  that  is,  that  there  exist  some  WI, ., .  ,Wn ,  and  c such \nthat  for  every  instance-label  pair  (x, y)  we  have  E~=I WiXi  ;:::  c  when  y  =  1  and \n2:~=1 WiXi  ::;  c  when  y  =  O.  We  actually  make  a  somewhat  stronger  assumption, \ngiven in the following theorem, which gives our bound for the apobayesian algorithm. \n\nTheorem 1  Suppose  that 'Yi  ;:::  0  and \"Ii  ;:::  0  for  i  =  1, ... , n,  and  that  2:~=1 'Yi  + \n\"I i  =  1.  Suppose  that 0 ::;  bo < bi  ::;  1  and  let 8 =  bi  - bo.  Suppose  that  algorithm \nSASB  is  run  on  a  sequence  of trials  such  that  the  instance  x  and  label  y  in  each \ntrial satisfy 2:~=1 'YiXi + \"Ii (1  - Xi)  ::;  bo  if y = 0  and 2:~=1 'YiXi + \"Ii (1  - Xi)  ;:::  bi  if \n\ny =  1.  Then  the  number of mistakes  made  by  SASB  will  be  bounded  by * log 8; . \n\nWe  have space to say only a little about how  the derivation of this bound proceeds. \nDetails are given in  [Lit96]. \n\n) \n\npt-d  )P(x,y \n\nIn analyzing SASB we work with an abstract description of the associated algorithm \nSBSB.  This algorithm starts with  a prior on e as  described  above.  We  represent \nthis  with  a  density  Po.  Then  after each  trial  it  calculates a  new  posterior density \nPt(O)  =  t-d8kP(X'YI~k, where  Pt  is  the  density  after  trial t  and  P(x, ylO)  is  the \nconditional probability ofthe instance x  and label y observed in trial t given O.  Thus \nwe  can  think  of the  algorithm  as  maintaining  a  current  distribution  on e  that  is \ninitially the prior.  SASB is similar, but it leaves the current distribution unchanged \nwhen  a  mistake  is  not  made.  For there to exist  a  finite  mistake bound  there  must \nexist some possible choice for  the current distribution for  which  SASB  would make \nperfect  predictions,  should  it  ever  arrive  at  that  distribution.  We  call  any  such \ndistribution leading to perfect predictions a possible target distribution.  It turns out \nthat the separability condition given in Theorem 1 guarantees that a suitable target \ndistribution exists.  The analysis proceeds by showing that for an appropriate choice \nof a target density p the relative entropy of the current distribution with respect to \nthe  target distribution,  J p( 0) log(p( 0) / Pt (0)),  decreases  by at least some  amount \nR  > 0 whenever  a mistake is  made.  Since the relative entropy is never negative, the \nnumber of mistakes  is  bounded  by  the  initial  relative entropy  divided  by  R.  This \nform of analysis is very similar to the analysis of the various members of the Winnow \nfamily  in [Lit89,Lit91]. \n\nThe same technique can be applied to other apobayesian algorithms.  The abstract \nupdate  of Pt  given  above is  quite  general.  The  success  of the  analysis  depends  on \nconditions on  Po  and P(x, ylO)  that we  do not have space here to discuss. \n\n\f208 \n\np = 0.01  k = 1  n = 20 \n\n250r---~--~----~--~--~ \n\nN.  LittLestone and C.  Mesterharm \np = 0.1  k = 5  n = 20 \n\n250  .-----.-----,-----,---~--~ \n\n, \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n\n'\" \n\n200 \n\n100 \n\n50 \n\n'Optimal'  -\n\n'SBSB'  .. .. \n'SASB'  - -\n\n. \n:  'SASB + voting\"  -. - /  / \n. \n\n/' \n\n/ /  \n\n.,-\n\n, \n\n,/ \n\n/ \n\n/ \n\n/ \n\n/ \n,/ \n\n.,-\n\n/ \n,/ \n\n/ \n/' \n\n,/ \n\n/' \n. \n.  / \n. / \n/ \n:'/.\",,~ \nr \no \n\n200 \n\nf/)  150 \n~ \nS \n.~ :e  100 \n\n50 \n\n' Optimal'  -\n\n'SBSB'  ... . \n'SASB'  - (cid:173)\n\n'SASB + voting'  _.-\n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n,/ \n\n/' \n\n/ \n,/ \n\n/' \n\n/ \n\n-.---:: .. -. \n\n' . '  \n\n/ \n\n/ ...... <- ... \n\n/'\" \n./::: .. . . \no \n\no~--~--~----~--~--~ \n8000  10000 \n\n2000 \n\n4000 \n\n6000 \n\nO\"\"'---...L-------I.-----'----~--~ \n\n2000 \n\n4000 \n\n6000 \n\n8000  1 0000 \n\nTrials \n\nTrials \n\nFigure 1:  Comparison of SASB with SBSB \n\n4  Simulation Experiments \n\nThe  bound  of the  previous  section  was  for  perfectly  linearly-separable  data.  We \nhave also done some simulation experiments exploring the performance of SASB on \nnon-separable  data  and  comparing  it  with  SBSB  and  with  various  other  mistake(cid:173)\ndriven  algorithms.  A sample  comparison of SASB  with  SBSB  is  shown  in  Figure \n1.  In each experimental run we  generated 10000 trials with the instances and labels \nchosen  randomly  according  to  a  distribution  specified  by  (h  =  '\"  = Ok  = 1 - p, \nOk+l  = ... = 0n+l  = .5  where  01 , \u2022\u2022. ,On+l  are  interpreted  as  specified  in Section \n2,  n  is  the number of attributes,  and n, p,  and k  are as specified  at the top of each \nplot.  The  line  labeled  \"optimal\"  shows  the  performance  obtained  by  an  optimal \npredictor that knows  the  distribution  used to generate the data ahead of time,  and \nthus does not need to do any learning.  The lines labeled  \"SBSB\"  and  \"SASB\"  show \nthe performance of the corresponding learning algorithms.  The lines labeled  \"SASB \n+ voting\"  show  the  performance  of SASB  with the  addition of a  voting  procedure \ndescribed  in  [Lit95].  This  procedure  improves  the  asymptotic  mistake  rate of the \nalgorithms.  Each  line  on  the  graph  is  the  average of 30  runs.  Each line  plots  the \ncumulative number of mistakes made by the algorithm from the beginning of the run \nas a  function  of the number of trials. \n\nIn the left hand plot, there is only 1 relevant attribute.  This is exactly the case that \nSBSB is intended for,  and it does better than SASB.  In right hand plot,  there are 5 \nrelevant attributes; SBSB appears unable to take advantage of the extra information \npresent in  the extra relevant attributes, but SASB successfully  does. \n\nComparison of SASB  and  previous  Winnow  family  algorithms  is  still  in  progress, \nand we defer presenting details until a clearer picture has been obtained.  SASB and \nthe Weighted Majority algorithm often perform similarly in simulations.  Typically, \nas  one  would expect,  the  Weighted  Majority algorithm does  somewhat  better than \n\n\fAn Apobayesian Relative of Winnow \n\n209 \n\nSASB when its parameters are chosen optimally for the particular learning task, and \nworse for  bad choices of parameters. \n\n5  Conclusion \n\nOur mistake bounds and simulations suggest that SASB may be a useful alternative \nto the existing algorithms in the Winnow family.  Based on the analysis style and the \nbounds, SASB should perhaps itself be considered a Winnow family algorithm.  Fur(cid:173)\nther experiments are in  progress comparing SASB with Winnow family  algorithms \nrun with a  variety of parameter settings. \n\nPerhaps of even greater interest is the potential application of our analytic techniques \nto a  variety of other  apobayesian algorithms  (though  as  we  have  observed earlier, \nthe  techniques  do  not  appear  to  apply  to  all  such  algorithms) .  We  have  already \nobtained  some  preliminary  results  regarding  an  interpretation  of  the  Perceptron \nalgorithm  as  an  apobayesian  algorithm.  We  are  interested  in  looking  for  entirely \nnew  algorithms  that  can  be  derived  in  this  way  and  also  in  better  understanding \nthe scope of applicability of our techniques.  All  of the analyses that we  have looked \nat depend  on  symmetry conditions  relating the probabilities for  the  two  classes.  It \nwould be of interest to see what can be said when such symmetry conditions do not \nhold.  In  simulation  experiments  [Lit95],  a  mistake-driven  variant  of the  standard \nNaive  Bayes algorithm often  does  very  well,  despite the absence of such symmetry \nin the prior that it is  based on. \n\nOur simulation experiments and also the analysis of the related algorithm Winnow \n[Lit91] suggest that SASB can be expected to handle some instance-label pairs inside \nof the separating gap or on the wrong side, especially if they are not too far  on  the \nwrong side.  In  particular it appears to be able to handle data generated according \nto the distributions on which SBSB is based, which do not in general yield perfectly \nseparable data. \n\nIt is  of interest to compare the capabilities of the original Bayesian algorithm with \nthe  derived apobayesian algorithm.  When the data is  stochastically generated in  a \nmanner consistent  with the assumptions behind the original algorithm,  the  original \nBayesian algorithm can be expected  to do  better  (see,  for  example,  Figure  1).  On \nthe  other  hand,  the  apobayesian  algorithm  can  handle  data beyond  the  capabilit(cid:173)\nies  of the  original  Bayesian  algorithm.  For  example,  in  the  case  we  consider,  the \napobayesian algorithm can take advantage of the presence of more than one relevant \nattribute,  even though  the  prior behind the  original Bayesian algorithm  assumes  a \nsingle relevant  attribute.  Furthermore, as for  all  of the  Winnow family  algorithms, \nthe mistake bound for  the apobayesian algorithm does not depend on details of the \nbehavior of the irrelevant  attributes  (including  redundant attributes). \n\nInstead  of using  the  apobayesian  variant,  one  might  try  to  construct  a  Bayesian \nlearning  algorithm  for  a  prior that  reflects  the  actual  dependencies  among  the  at(cid:173)\ntributes and the labels.  However,  it may not be clear what the appropriate prior is. \nIt may  be  particularly unclear  how  to  model  the  behavior of the  irrelevant  attrib(cid:173)\nutes.  Furthermore,  such  a  Bayesian  algorithm  may end  up being  computationally \nexpensive.  For  example,  attempting to  keep  track of correlations  among  all  pairs \nof attributes may  lead to an  algorithm that needs  time  and space  quadratic in  the \nnumber of attributes.  On the other hand, if we  start with a Bayesian algorithm that \n\n\f210 \n\nN.  Littlestone and C.  Mesterharm \n\nuses time and space linear in the number of attributes we can obtain an apobayesian \nalgorithm that still uses linear time and space but that can handle situations beyond \nthe capabilities of the original Bayesian algorithm. \nAcknowledgments  This paper has benefited from discussions with Adam Grove. \n\nReferences \n\n[CHP96]  Nicolo  Cesa-Bianchi,  David P.  Helmbold,  and Sandra Panizza.  On bayes \nmethods for on-line boolean prediction. In Proceedings  of the Ninth Annual \nConference  on  Computational  Learning  Theory,  pages 314-324, 1996. \n\n[Fre96]  Yoav  Freund.  Predicting a binary sequence almost as  well  as  the optimal \n\nbiased coin.  In Proceedings  of the  Ninth  Annual Conference  on  Computa(cid:173)\ntional Learning  Theory,  pages 89-98, 1996. \n\n[KW95]  J.  Kivinen  and M.  K.  Warmuth.  Additive  versus  exponentiated gradient \nupdates  for  linear  prediction.  In  Proc.  27th  ACM Symp.  on  Theory  of \nComputing,  pages  209-218, 1995. \n\n[Lit88]  N.  Littlestone.  Learning quickly when irrelevant attributes abound:  A new \n\nlinear-threshold algorithm.  Machine  Learning,  2:285-318, 1988. \n\n[Lit89]  N.  Littlestone. Mistake Bounds and Logarithmic Linear-threshold Learning \nAlgorithms.  PhD  thesis,  Tech.  Rept.  UCSC-CRL-89-11,  Univ.  of Calif., \nSanta Cruz,  1989. \n\n[Lit91]  N.  Littlestone.  Redundant  noisy  attributes,  attribute  errors,  and  linear(cid:173)\nthreshold learning using Winnow.  In  Proc.  4th Annu.  Workshop  on  Com(cid:173)\nput.  Learning  Theory,  pages 147- 156. Morgan Kaufmann, San Mateo, CA, \n1991. \n\n[Lit95]  N.  Littlestone.  Comparing several linear-threshold learning algorithms on \n\ntasks involving superfluous attributes.  In Proceedings  of the  XII Interna(cid:173)\ntional conference  on  Machine  Learning,  pages 353- 361, 1995. \n\n[Lit96]  N.  Littlestone.  Mistake-driven  bayes  sports:  Bounds  for  symmetric \n\napobayesian  learning  algorithms.  Technical  report,  NEC  Research  In(cid:173)\nstitute,  Princeton,  NJ,  1996. \n\n[LW94]  N.  Littlestone  and  M.  K.  Warmuth.  The  weighted  majority  algorithm. \n\nInformation  and  Computation,  108:212-261, 1994. \n\n[Vov90]  Volodimir  G.  Vovk.  Aggregating  strategies.  In  Proceedings  of the  1990 \n\nWorkshop  on  Computational  Learning  Theory,  pages 371-383, 1990. \n\n\f", "award": [], "sourceid": 1194, "authors": [{"given_name": "Nick", "family_name": "Littlestone", "institution": null}, {"given_name": "Chris", "family_name": "Mesterharm", "institution": null}]}