{"title": "Learning Sparse Perceptrons", "book": "Advances in Neural Information Processing Systems", "page_first": 654, "page_last": 660, "abstract": null, "full_text": "Learning  Sparse Perceptrons \n\nJeffrey  C. Jackson \n\nMathematics &  Computer Science Dept. \n\nDuquesne University \n\n600  Forbes Ave \n\nPittsburgh, PA  15282 \n\njackson@mathcs.duq.edu \n\nMark W. Craven \n\nComputer Sciences  Dept. \n\nUniversity of Wisconsin-Madison \n\n1210 West Dayton St. \nMadison,  WI 53706 \ncraven@cs.wisc.edu \n\nAbstract \n\nWe  introduce  a  new  algorithm  designed  to  learn  sparse  percep(cid:173)\ntrons over input representations which include high-order features. \nOur  algorithm,  which  is  based  on a  hypothesis-boosting  method, \nis  able to PAC-learn a  relatively  natural class  of target concepts. \nMoreover, the algorithm appears to work well in practice:  on a set \nof three  problem  domains,  the algorithm  produces  classifiers  that \nutilize  small  numbers  of features  yet  exhibit  good  generalization \nperformance.  Perhaps most  importantly,  our algorithm generates \nconcept descriptions that are easy for  humans to understand. \n\n1 \n\nIntrod uction \n\nMulti-layer perceptron  (MLP)  learning is  a powerful method for tasks such as con(cid:173)\ncept classification.  However,  in many applications, such as  those that may involve \nscientific discovery, it is crucial to be able to explain predictions.  Multi-layer percep(cid:173)\ntrons are limited in this regard,  since their representations are notoriously difficult \nfor  humans  to  understand.  We  present  an  approach  to learning  understandable, \nyet  accurate,  classifiers.  Specifically,  our  algorithm  constructs  sparse  perceptrons, \ni.e.,  single-layer  perceptrons  that  have  relatively  few  non-zero  weights.  Our  algo(cid:173)\nrithm  for  learning sparse  perceptrons  is  based  on a  new  hypothesis  boosting algo(cid:173)\nrithm  (Freund & Schapire,  1995).  Although our algorithm was  initially  developed \nfrom a learning-theoretic point of view and retains certain theoretical guarantees (it \nPAC-learns the class  of sparse perceptrons), it also works  well  in  practice.  Our ex(cid:173)\nperiments in  a  number of real-world domains indicate that our algorithm produces \nperceptrons that are relatively comprehensible, and that exhibit generalization per(cid:173)\nformance  comparable to that of backprop-trained MLP's  (Rumelhart et al.,  1986) \nand better than decision trees learned using  C4.5  (Quinlan,  1993). \n\n\fLearning  Sparse  Perceptrons \n\n655 \n\nWe contend that sparse perceptrons, unlike MLP's, are comprehensible because they \nhave relatively few  parameters, and each parameter describes a  simple  (Le.  linear) \nrelationship.  As evidence that sparse perceptrons are comprehensible, consider that \nsuch linear functions are commonly used to express domain knowledge in fields  such \nas medicine  (Spackman,  1988)  and molecular biology  (Stormo, 1987). \n\n2  Sparse Perceptrons \n\nA perceptron is  a  weighted threshold over the set of input features and over higher(cid:173)\norder  features  consisting  of functions  operating on  only  a  limited  number  of the \ninput features.  Informally, a  sparse perceptron is  any perceptron that has relatively \nfew  non-zero weights.  For our later theoretical results  we  will  need a  more precise \ndefinition  of sparseness  which  we  develop  now.  Consider  a  Boolean  function  I  : \n{O, 1 } n  -t { -1, + 1 }.  Let Ck  be the set of all conjunctions of at most k of the inputs \nto I.  Ck  includes  the  \"conjunction\"  of 0  inputs,  which  we  take as  the identically \n1 function.  All  of the functions  in Ck  map to {-1,+1}, and every conjunction in \nCk  occurs  in  both  a  positive  sense  (+1  represents  true)  and a  negated  sense  (-1 \nrepresents  true).  Then the function  I  is  a  k-perceptron  if there is  some  integer  s \nsuch that I(x)  =  sign(L::=1 hi(x)), where for all i, hi  E Ck, and sign(y) is undefined \nif y =  0  and is  y/lyl  otherwise.  Note that while  we  have  not explicitly  shown  any \nweights in our definition of a  k-perceptron I, integer weights are implicitly present \nin that we allow a particular hi  E Ck  to appear more than once in  the sum defining \nI.  In  fact,  it  is  often  convenient  to  think  of a  k-perceptron  as  a  simple  linear \ndiscriminant function  with integer weights  defined over a feature space with O(nk) \nfeatures,  one feature for each element of Ck \u2022 \nWe call a given collection of s conjunctions hi  E Ck  a k-perceptron representation of \nthe corresponding function  I, and we call s the size of the representation.  We define \nthe size of a given  k-perceptron function  I  as the minimal size of any k-perceptron \nrepresentation of I.  An s-sparse k-perceptron is a  k-perceptron I  such that the size \nof I  is  at most s.  We denote by PI: the set of Boolean functions over {O, 1}n which \ncan  be  represented  as  k-perceptrons,  and we  define  Pk  =  Un Pi:.  The subclass  of \ns-sparse k-perceptrons  is  denoted  by  Pk,/l\"  We  are also interested in the class P~ \nof k-perceptrons with real-valued weights, at most r  of which are non-zero. \n\n3  The Learning Algorithm \n\nIn  this  section  we  develop  our learning  algorithm  and  prove  certain  performance \nguarantees.  Our  algorithm  is  based  on  a  recent  \"hypothesis  boosting\"  algorithm \nthat we  describe after reviewing some basic learning-theory terminology. \n\n3.1  PAC  Learning and Hypothesis Boosting \n\nFollowing  Valiant  (1984),  we  say that a  function  class  :F  (such  as  Pk  for  fixed  k) \nis  (strongly)  PAC-learnable if there  is  an  algorithm  A  and a  polynomial  function \nPI  such  that  for  any  positive  f  and  8,  any  I  E  :F  (the  target  junction),  and  any \nprobability  distribution  D  over  the  domain  of I,  with  probability  at  least  1 -\n8,  algorithm  A(EX(f, D), f, 8)  produces  a  function  h  (the  hypothesis)  such  that \nPr[PrD[/(x) I- hex)]  > f]  < 8.  The outermost probability is over the random choices \nmade by the EX oracle and any random choices made by A.  Here EX(f, D) denotes \nan oracle  that,  when  queried,  chooses  a  vector  of input  values  x  with  probability \nD  and returns the pair (x,/(x))  to A.  The learning algorithm A  must run in time \nPI (n, s, c 1 , 8-1 ), where n  is  the length of the input vector to I  and s is the size of \n\n\f656 \n\nJ. C. JACKSON, M. W.  CRAVEN \n\nAdaBoost \nInput:  training set S of m examples of function f, weak learning algorithm WL  that \nis  (~ - 'Y)-approximate, l' \nAlgorithm: \n\nfor  all  xES, w(x)  +-- l/m \n\n1.  T  +-- ~ In(m) \n2. \n3.  for i  =  1 to T  do \n4. \n5. \n6. \n7. \n8. \n9.  enddo \n\nfor  all  XES,  Di(X)  +-- w(x)/ L:l=l w(x). \ninvoke WL  on S  and distribution  Di, producing weak hypothesis  hi \n\u20aci  +-- L:z.h;(z);oI:/(z) Di(X) \n(3i  +-- \u20aci/  (1  - \u20aci) \nfor  all XES, if h(x) =  f(x) then w(x)  +-- w(x) . (3i \n\nOutput:  h(x) ==  sign (L::=l -In((3i) . hi{x)) \n\nFigure 1:  The AdaBoost algorithm. \n\nf;  the algorithm is  charged one  unit of time  for  each  call  to  EX.  We  sometimes \ncall  the function  h output  by  A  an  \u20ac-approximator  (or  strong  approximator)  to f \nwith  respect  to  D.  If F  is  PAC-learnable  by  an  algorithm  A  that  outputs  only \nhypotheses  in  class  1\u00a3  then  we  say  that  F  is  PAC-learnable  by  1\u00a3.  If F  is  PAC(cid:173)\nlearnable  for  \u20ac  =  1/2 - 1/'P2(n, s),  where'P2  is  a  polynomial  function,  then  :F  is \nweakly  PA C-learnable,  and  the  output  hypothesis  h  in  this  case  is  called  a  weak \napproximator. \nOur algorithm for  finding  sparse perceptrons  is,  as  indicated earlier,  based on the \nnotion  of hypothesis  boosting.  The specific  boosting algorithm  we  use  (Figure  1) \nis  a  version of the recent  AdaBoost  algorithm  (Freund  & Schapire,  1995).  In  the \nnext section we apply AdaBoost to \"boost\"  a weak learning algorithm for  Pk,8  into \na  strong learner for  Pk,8'  AdaBoost  is  given  a  set  S of m  examples of a  function \nf  : {O,1}n  ---+  {-1, +1}  and a  weak  learning algorithm  WL  which  takes  \u20ac  =  !  -\nl' \nfor  a given l' b  must be bounded by an inverse polynomial in nand s).  Adaf300st \nruns for T  = In(m)/(2'Y2)  stages.  At each stage it creates a probability distribution \nDi  over the training set and invokes  WL  to find  a  weak hypothesis  hi  with  respect \nto Di  (note that an example oracle EX(j, Di)  can be simulated given  Di  and S). \nAt  the  end of the T  stages  a  final  hypothesis  h  is  output;  this  is  just  a  weighted \nthreshold over the weak hypotheses  {hi I 1 ~ i  ~ T}.  If the weak  learner succeeds \nin producing a  (~-'Y)-approximator at each stage then AdaBoost's final hypothesis \nis guaranteed to be consistent with the training set  (Freund & Schapire,  1995). \n\n3.2  PAC-Learning Sparse k-Perceptrons \n\nWe  now  show  that  sparse  k-perceptrons  are  PAC  learnable  by  real-weighted  k(cid:173)\nperceptrons having relatively few nonzero weights.  Specifically, ignoring log factors, \nPk,8  is learnable by P~O(82) for  any constant k.  We first show that, given a training \nset for  any f  E Pk,8'  we  can efficiently find  a  consistent  h  E p~( 8 2 )'  This  consis(cid:173)\ntency  algorithm is  the  basis  of the  algorithm  we  later apply  to empirical  learning \nproblems.  We then show how to turn the consistency algorithm into a PAC learning \nalgorithm.  Our proof is  implicit in somewhat more general work by Freund (1993), \nalthough he  did  not actually present a  learning algorithm for  this class or analyze \n\n\fLearning  Sparse  Perceptrons \n\n657 \n\nthe sample size  needed to ensure f-approximation, as we  do.  Following Freund, we \nbegin our development with the following lemma  (Goldmann et al.,  1992): \n\nLemma 1  (Goldmann Hastad Razhorov)  For I: {0,1}n -+ {-1,+1}  and H, \nany  set  01  functions  with  the  same  domain  and  range,  il I  can  be  represented  as \nI(x)  = sign(L::=l hi(X\u00bb,  where  hi  E  H,  then lor  any  probability  distribution  D \nover {O, 1}n  there  is  some hi  such  that PrD[f(x) \u00a5- hi(x)]  ~ ~ - 218 ' \nIf we  specialize this lemma by taking H  =  Ck  (recall that Ck  is  the set of conjunc(cid:173)\ntions of at most  k  input features  of f) then this implies that for  any I  E  Pk,8  and \nany  probability  distribution  D  over  the input features  of I  there is  some  hi  E  Ck \nthat  weakly  approximates I  with  respect to D.  Therefore,  given  a  training set  S \nand  distribution  D  that has  nonzero  weight  only  on  instances  in  S,  the following \nsimple algorithm is  a weak learning algorithm for  Pk:  exhaustively test each of the \nO(nk)  possible conjunctions of at most  k  features  until we  find  a  conjunction that \n\na - 218  )-approximates I  with respect to D  (we can efficiently compute the approx(cid:173)\nimation of a conjunction hi by summing the values of D  over those inputs where hi \nand I  agree).  Any  such conjunction can  be  returned as the weak hypothesis.  The \nabove  lemma  proves  that if I  is  a  k-perceptron  then  this  exhaustive  search  must \nsucceed at finding such a hypothesis.  Therefore, given a training set of m examples \nof any s-sparse k-perceptron I,  AdaBoost run with the above weak learner will,  af(cid:173)\nter 2s2In(m) stages, produce a hypothesis consistent with the training set.  Because \neach stage adds one weak hypothesis to the output hypothesis, the final  hypothesis \nwill  be a  real-weighted k-perceptron with at most 2s2In(m)  nonzero weights. \nWe  can convert this consistency algorithm to a  PAC  learning algorithm as  follows. \nFirst,  given a  finite  set of functions  F,  it is  straightforward to show  the following \n(see,  e.g., Haussler,  1988): \n\nLemma 2  Let F  be  a  finite  set  ollunctions  over  a  domain  X.  For  any function \nlover X,  any probability  distribution  D  over X,  and any positive  f  and ~,  given  a \nset S  ofm examples  drawn  consecutively from  EX(f, D),  where m  ~ f-1(ln~-1 + \nIn IFI),  then Pr[3h  E  F  I \"Ix  E  S  f(x)  =  h(x)  &  Prv[/(x) \u00a5- h(x)]  > f]  < ~,  where \nthe  outer probability  is  over the  random  choices  made  by  EX(f,D). \n\nThe consistency algorithm above  finds  a  consistent  hypothesis in P~, where  r  = \n2s2 In(m).  Also,  based on a  result of Bruck (1990), it can be shown that In IP~I = \no (r2  + kr log n).  Therefore, ignoring log factors,  a randomly-generated training set \nof size O(kS4 If) is sufficient to guarantee that, with high probability, our algorithm \nwill produce an f-approximator for any s-sparse k-perceptron target.  In other words, \nthe following is a PAC algorithm for Pk,8:  compute sufficiently large (but polynomial \nin  the PAC  parameters)  m, draw  m  examples from  EX(f, D)  to create a  training \nset, and run the consistency algorithm on this training set. \n\nSo far  we  have shown that sparse k-perceptrons are learnable by  sparse perceptron \nhypotheses  (with  potentially  polynomially-many  more  weights). \nIn  practice,  of \ncourse,  we  expect  that  many  real-world  classification  tasks  cannot  be  performed \nexactly by sparse perceptrons.  In fact,  it can be shown that for  certain (reasonable) \ndefinitions of \"noisy\"  sparse perceptrons  (loosely,  functions  that are approximated \nreasonably  well  by  sparse  perceptrons),  the  class  of noisy  sparse  k-perceptrons  is \nstill  PAC-learnable.  This  claim  is  based  on  results  of Aslam  and  Decatur  (1993), \nwho  present a  noise-tolerant boosting algorithm.  In fact,  several different  boosting \nalgorithms could be used to learn Pk,s  (e.g.,  Freund,  1993).  We have chosen to use \nAdaBoost because it seems to offer significant practical advantages, particularly in \nterms of efficiency.  Also,  our empirical results  to date indicate that our algorithm \n\n\f658 \n\nJ. C. JACKSON, M. W. CRAVEN \n\nworks  very  well  on  difficult  (presumably  \"noisy\")  real-world  problems.  However, \none  potential  advantage  of basing  the  algorithm  on  one of these  earlier  boosters \ninstead of AdaBoost  is  that the algorithm would  then  produce a  perceptron  with \ninteger  weights  while  still  maintaining the sparseness  guarantee of the AdaBoost(cid:173)\nbased algorithm. \n\n3.3  Practical Considerations \n\nWe turn now to the practical details of our algorithm, which is  based on the consis(cid:173)\ntency algorithm above.  First, it  should  be noted that the theory developed  above \nworks  over  discrete  input domains  (Boolean or nominal-valued features).  Thus,  in \nthis  paper,  we  consider  only  tasks  with  discrete input features.  Also,  because  the \nalgorithm  uses  exhaustive search over all  conjunctions  of size  k,  learning time de(cid:173)\npends exponentially on the choice of k.  In  this  study we  to use  k =  2 throughout, \nsince this  choice results in  reasonable learning times. \nAnother  implementation  concern  involves  deciding  when  the  learning  algorithm \nshould  terminate.  The  consistency  algorithm  uses  the size  of the  target  function \nin  calculating the number  of boosting stages.  Of course,  such  size  information  is \nnot available in real-world applications, and in fact,  the target function may not be \nexactly  representable as  a  sparse  perceptron.  In practice,  we  use  cross  validation \nto determine an appropriate termination point.  To facilitate  comprehensibility, we \nalso  limit  the  number  of boosting  stages  to  at  most  the  number  of weights  that \nwould  occur in  an  ordinary perceptron  for  the  task.  For  similar  reasons,  we  also \nmodify  the criteria used to select the weak hypothesis  at each stage so that simple \nfeatures  are  preferred  over  conjunctive  features.  In  particular,  given  distribution \nD  at  some  stage  j,  for  each  hi  E  Ck  we  compute  a  correlation  Ev[/ . hi].  We \nthen  mUltiply  each high-order feature's  correlation  by  i.  The hi  with  the largest \nresulting correlation serves as  the weak hypothesis  for  stage j. \n\n4  Empirical Evaluation \n\nIn  our  experiments,  we  are interested  in  assessing  both the generalization  ability \nand the complexity of the hypotheses produced by our algorithm.  We compare our \nalgorithm to ordinary perceptrons trained using backpropagation (Rumelhart et al., \n1986),  multi-layer  perceptrons  trained  using  backpropagation,  and  decision  trees \ninduced using the C4.5 system (Quinlan, 1993).  We  use C4.5 in our experiments as \na representative of \"symbolic\"  learning algorithms.  Symbolic algorithms are widely \nbelieved  to learn hypotheses  that are more comprehensible  than neural  networks. \nAdditionally,  to test  the hypothesis  that the performance of our algorithm can  be \nexplained solely  by  its  use of second-order features,  we  train ordinary perceptrons \nusing  feature  sets  that  include  all  pairwise  conjunctions,  as  well  as  the  ordinary \nfeatures.  To  test  the  hypothesis  that  the  performance  of our  algorithm  can  be \nexplained  by  its  use  of relatively  few  weights,  we  consider  ordinary  perceptrons \nwhich  have  been  pruned  using  a  variant  of  the  Optimal  Brain  Damage  (OBD) \nalgorithm (Le Cun et al.,  1989).  In our version of OBD, we train a perceptron until \nthe stopping criteria are met, prune the weight with the smallest salience, and then \niterate the process.  We use a validation set to decide when to stop pruning weights. \nFor each training set,  we  use  cross-validation to select  the number of hidden  units \n(5,  10,  20,  40  or 80)  for  the MLP's,  and the pruning confidence level  for  the  C4.5 \ntrees.  We  use a validation set to decide when to stop training for  the MLP's. \nWe evaluate our algorithm using three real-world domains:  the voting data set from \nthe UC-Irvine database;  a  promoter data set which  is  a  more complex superset of \n\n\fLearning Sparse  Perceptrons \n\n659 \n\nT  bl  1  11 \n\na  e \n\n:  es  -se  accuracy. \n\nt \n\nt \n\nperceptrons \n\nboosting \n91.5% \n\ndomain \nvoting \npromoter  92.7 \ncoding \n72.9 \n\nC4.5 \n89.2%  *  92.2% \n*  90.6 \n84.4 \n*  71.6 \n62.6 \n\nmulti-layer  ordinary  2nd-order \n\n90.8% \n90.0 \n*  70.7 \n\npruned \n89.2%  *  87.6%  * \n* \n* \n\n*  88.2 \n*  70.3 \n\n*  88.7 \n*  69.8 \n\nTable 2:  Hypothesis complexity (# weights). \n\ndomain \nvoting \npromoters \nprotein coding \n\nperceptrons \n\nboosting  multi-layer  ordinary  2nd-order  pruned \n12 \n59 \n37 \n\n450 \n25764 \n1740 \n\n651 \n2267 \n4270 \n\n30 \n228 \n60 \n\n12 \n41 \n52 \n\nU C-Irvine  one;  and  a  data  set  in  which  the  task  is  to  recognize  protein-coding \nregions in DNA  (Craven &  Shavlik, 1993).  We  remove the physician-fee-freeze \nfeature  from  the voting data set to make the problem  more  difficult.  We  conduct \nour  experiments  using  a  lO-fold  cross  validation  methodology,  except  for  in  the \nprotein-coding  domain.  Because of certain  domain-specific  characteristics  of this \ndata set, we  use 4-fold cross-validation for  our experiments with it. \nTable  1 reports test-set  accuracy for  each method on  all  three domains.  We  mea(cid:173)\nsure  the  statistical  significance  of accuracy  differences  using  a  paired,  two-tailed \nt-test.  The  symbol  '*'  marks  results  in  cases  where  another  algorithm is  less  ac(cid:173)\ncurate than our boosting algorithm at the p  ::;  0.05  level  of significance.  No other \nalgorithm is  significantly  better than our boosting method in  any of the domains. \nFrom these results we  conclude that (1)  our algorithm exhibits good generalization \nperformance on number of interesting real-world problems,  and (2)  the generaliza(cid:173)\ntion performance of our algorithm is  not explained solely by its use of second-order \nfeatures,  nor is  it solely explained by the sparseness of the perceptrons it produces. \nAn interesting open question is whether perceptrons trained with both pruning and \nsecond-order features  are able to match the accuracy of our algorithm;  we  plan to \ninvestigate this question in future work. \nTable  2  reports  the average  number of weights  for  all  of the  perceptrons.  For all \nthree  problems,  our algorithm  produces  perceptrons  with  fewer  weights  than  the \nMLP's,  the ordinary perceptrons,  and the perceptrons with  second-order features. \nThe  sizes  of the  OBD-pruned  perceptrons  and  those  produced  by  our  algorithm \nare  comparable  for  all  three  domains.  Recall,  however,  that  for  all  three  tasks, \nthe  perceptrons  learned  by  our  algorithm  had  significantly  better  generalization \nperformance than their similar-sized  OBD-pruned counterparts.  We  contend that \nthe  sizes  of the perceptrons  produced  by  our  algorithm  are  within  the  bounds  of \nwhat humans can readily understand.  In the biological literature, for example, linear \ndiscriminant functions are frequently used to communicate domain knowledge about \nsequences  of interest.  These  functions  frequently  involve  more  weights  than  the \nperceptrons produced by our algorithm.  We conclude, therefore, that our algorithm \nproduces hypotheses that are not only accurate,  but also comprehensible. \nWe  believe that the results on the protein-coding domain are especially interesting. \nThe input representation for this problem consists of 15 nominal features represent(cid:173)\ning  15  consecutive  bases  in  a  DNA  sequence.  In  the  regions  of DNA  that encode \nproteins  (the  positive  examples  in  our task),  non-overlapping triplets  of consecu-\n\n\f660 \n\nJ. C. JACKSON, M. W. eRA VEN \n\ntive  bases  represent  meaningful  \"words\"  called  codons.  In previous  work  (Craven \n&  Shavlik,  1993),  it  has  been  found  that  a  feature  set  that  explicitly  represents \ncodons  results  in  better generalization  than a  representation of just  bases.  How(cid:173)\never, we used the bases representation in our experiments in order to investigate the \nability  of our algorithm  to select  the  \"right\"  second-order features.  Interestingly, \nnearly all of the second-order features  included in our sparse perceptrons represent \nconjunctions  of bases  that are  in  the same  codon.  This  result  suggests  that  our \nalgorithm is  especially good at selecting relevant features from  large feature sets. \n\n5  Future Work \n\nOur present algorithm has a number of limitations which we  plan to address.  Two \nareas of current research are generalizing the algorithm for  application to problems \nwith real-valued features and developing methods for automatically suggesting high(cid:173)\norder features  to be included in our algorithm's feature set. \n\nAcknowledgements \n\nMark Craven was partially supported by ONR grant N00014-93-1-0998.  Jeff Jackson \nwas  partially supported by  NSF  grant CCR-9119319. \n\nReferences \nAslam,  J.  A.  &  Decatur, S.  E.  (1993).  General bounds on statistical query learning and \nPAC  learning  with  noise  via hypothesis  boosting.  In  Proc.  of the  34th  Annual  Annual \nSymposium  on Foundations  of Computer Science,  (pp.  282-291). \nBruck,  J .  (1990).  Harmonic analysis  of polynomial  threshold  functions.  SIAM Journal \nof Discrete  Mathematics,  3(2):168-177. \nCraven,  M.  W.  &  Shavlik,  J.  W.  (1993) .  Learning  to  represent  codons:  A  challenge \nproblem for  constructive  induction.  In  Proc.  of the  13th  International  Joint  Conf.  on \nArtificial Intelligence,  (pp.  1319-1324),  Chambery, France. \n\nFreund,  Y.  (1993).  Data  Filtering  and  Distribution  Modeling  Algorithms  for  Machine \nLearning.  PhD thesis,  University of California at Santa Cruz. \nFreund, Y.  &  Schapire, R. E.  (1995).  A decision-theoretic generalization of on-line learn(cid:173)\ning  and  an  application  to  boosting. \nComputational  Learning  Theory. \nGoldmann,  M., Hastad,  J.,  &  Razborov,  A.  (1992).  Majority gates vs.  general weighted \nthreshold gates.  In  Proc.  of the  7th  IEEE  Conf.  on  Structure  in Complexity  Theory. \n\nIn  Proc.  of the  ~nd Annual  European  Conf.  on \n\nHaussler,  D.  (1988).  Quantifying inductive  bias:  AI  learning  algorithms  and  Valiant's \nlearning framework.  Artificial Intelligence,  (pp.  177-221). \n\nLe  Cun, Y.,  Denker,  J.  S.,  &  Solla,  S.  A.  (1989).  Optimal brain damage.  In Touretzky, \nD.,  editor,  Advances  in  Neural  Information  Processing  Systems  (volume  ~). \nQuinlan,  J. R.  (1993).  C4.5:  Programs  for Machine  Learning.  Morgan Kaufmann. \nRumelhart,  D.,  Hinton,  G.,  &  Williams,  R.  (1986).  Learning  internal  representations \nby error  propagation.  In Rumelhart,  D.  &  McClelland,  J.,  editors,  Parallel  Distributed \nProcessing:  Explorations  in the  microstructure  of cognition.  Volume  1.  MIT  Press. \nSpackman,  K.  A.  (1988).  Learning  categorical  decision  criteria.  In  Proc.  of the  5th \nInternational  Conf.  on  Machine  Learning,  (pp.  36-46),  Ann Arbor,  MI. \nStormo,  G.  (1987).  Identifying coding  sequences.  In  Bishop,  M.  J.  &  Rawlings,  C.  J., \neditors,  Nucleic  Acid and Protein  Sequence  Analysis:  A  Practical  Approach.  IRL Press. \n\nValiant,1.  G.  (1984).  A theory of the learnable.  Comm.  of the  ACM,  27(11):1134-1142. \n\n\f", "award": [], "sourceid": 1076, "authors": [{"given_name": "Jeffrey", "family_name": "Jackson", "institution": null}, {"given_name": "Mark", "family_name": "Craven", "institution": null}]}