{"title": "Bayesian Modeling of Human Concept Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 59, "page_last": 68, "abstract": null, "full_text": "Bayesian modeling of human concept learning \n\nJoshua B. Tenenbaum \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts Institute of Technology, Cambridge, MA 02139 \n\njbt@psyche.mit.edu \n\nAbstract \n\nI consider the problem of learning concepts from small numbers of pos(cid:173)\nitive examples,  a feat  which humans perform routinely but which com(cid:173)\nputers  are  rarely  capable  of.  Bridging machine  learning  and  cognitive \nscience perspectives, I present both theoretical analysis and an empirical \nstudy with human subjects for the simple task oflearning concepts corre(cid:173)\nsponding to axis-aligned rectangles in a multidimensional feature space. \nExisting learning models, when applied to this task, cannot explain how \nsubjects generalize from only a few  examples of the concept.  I propose \na principled Bayesian model based on the assumption that the examples \nare  a random sample from  the concept to be  learned.  The  model  gives \nprecise fits to human behavior on this simple task and provides qualitati ve \ninsights into more complex, realistic cases of concept learning. \n\n1  Introduction \n\nThe ability to learn concepts from examples is one of the core capacities of human cognition. \nFrom a computational point of view, human concept learning is remarkable for the fact that \nvery successful generalizations are often produced after experience with only a small number \nof positive examples of a concept (Feldman, 1997). While negative examples are no doubt \nuseful  to  human  learners  in  refining  the  boundaries  of concepts,  they  are  not  necessary \nin order to make reasonable generalizations of word meanings, perceptual categories,  and \nother natural concepts.  In contrast, most machine learning algorithms require examples of \nboth positive and  negative instances of a concept in order to  generalize  at  all,  and many \nexamples of both kinds in order to generalize successfully (Mitchell, 1997). \n\nThis  paper  attempts  to  close  the  gap  between  human  and  machine  concept  learning by \ndeveloping  a  rigorous  theory  for  concept  learning  from  limited  positive  evidence  and \ntesting  it against  real  behavioral  data. \nI  focus  on  a  simple  abstract  task  of interest  to \nboth cognitive science and machine learning:  learning axis-parallel rectangles in ?Rm .  We \nassume that each object x  in our world can be described by its values (XI, ... , xm)  on m \nreal-valued observable dimensions, and that each concept C to be learned corresponds to a \nconjunction of independent intervals (mini (C)  ~ Xi  ~ maXi (C\u00bb along each dimension \n\n\f60 \n\n(a) \n\n-\n\nr-------------. \n\nI \n\n+ \n\n+ \n\nI \nI \nI \nI \nI \n\n(b) \n\n(e) \n\n1.  B.  Tenenbaum \n\n......  ~ . \n\n\" \n\n- >\"  ~ \n\n\\ - ...... - . ~  ~  \" \n\n\u2022 ! \n\n:C  + \nI \nI \nt..  ...  ____ ...  __ ...... ___ , \n\n+ \n\nFigure 1:  (a) A rectangle concept C.  (b-c) The size principle in Bayesian concept learning: \nof the man y hypotheses consistent wi th the observed posi ti ve examples, the smallest rapidly \nbecome more likely (indicated by darker lines) as more examples are observed. \n\ni.  For example,  the objects might be people, the dimensions might be \"cholesterol level\" \nand  \"insulin  level\",  and  the  concept  might  be  \"healthy  levels\".  Suppose  that  \"healthy \nlevels\" applies to any individual whose cholesterol and insulin levels are each greater than \nsome minimum healthy level and less than some maximum healthy level.  Then the concept \n\"healthy levels\" corresponds to a rectangle in the two-dimensional cholesterol/insulin space. \n\nThe  problem  of generalization in  this setting  is  to  infer,  given  a  set  of positive (+)  and \nnegative  (-)  examples  of a  concept  C,  which  other  points  belong  inside  the  rectangle \ncorresponding to C (Fig. 1 a.).  This paper considers the question most relevant for cognitive \nmodeling:  how to generalize from just a few  positive examples? \n\nIn  machine  learning,  the  problem of learning rectangles  is  a common  textbook example \nused to illustrate models of concept learning (Mitchell, 1997).  It is also the focus of state(cid:173)\nof-the-art theoretical work and applications (Dietterich et aI.,  1997).  The rectangle learning \ntask is not well known in cognitive psychology, but many studies have investigated human \nlearning  in  similar tasks  using  simple  concepts  defined  over  two  perceptually  separable \ndimensions such  as  size  and  color (Shepard,  1987).  Such  impoverished tasks  are  worth \nour attention because they isolate the essential inductive challenge of concept learning in a \nform that is analytically tractable and amenable to empirical study in human subjects. \n\nThis  paper  consists of two main  contributions.  I first  present  a new  theoretical  analysis \nof the rectangle learning  problem based  on Bayesian inference and  contrast this  model's \npredictions with standard learning frameworks  (Section 2).  I then describe an  experiment \nwith human  subjects  on  the rectangle  task  and  show  that,  of the  models  considered,  the \nBayesian approach  provides by far the best description of how people actually generalize \non this task when  given only limited positive evidence (Section 3).  These results suggest \nan explanation for some aspects of the ubiquotous human ability to learn concepts from just \na few positive examples. \n\n2  Theoretical analysis \n\nComputational approaches to concept learning.  Depending on how they model a con(cid:173)\ncept, different approaches to concept learning differ in their ability to generalize meaning(cid:173)\nfully from  only limited positive evidence.  Discriminative approaches embody no explicit \nmodel of a concept, but only a procedure for discriminating category members from mem(cid:173)\nbers  of mutually exclusive contrast categories.  Most backprop-style neural networks and \nexemplar-based  techniques  (e.g.  K -nearest  neighbor classification)  fall  into  this  group, \nalong with hybrid models like ALCOVE (Kruschke, 1992).  These approaches are ruled out \nby definition; they cannot learn to discriminate positive and negative instances ifthey have \nseen  only positive examples.  Distributional approaches model a concept as  a probability \ndistribution over some feature space and classify new instances x  as members of C  if their \n\n\fBayesian Modeling of Human  Concept Learning \n\n61 \n\nestimated probability p(xIG) exceeds  a threshold (J.  This group includes \"novelty detec(cid:173)\ntion\" techniques based on Bayesian nets (Jaakkola et al.,  1996) and,  loosely, autoencoder \nnetworks (Japkowicz et al.,  1995).  While p(xIG) can be estimated from only positive ex(cid:173)\namples, novelty detection also requires negative examples for principled generalization, in \norder to set an  appropriate threshold (J  which may vary over many orders of magnitude for \ndifferent concepts.  For learning from positive evidence only, our best hope are algorithms \nthat treat a new concept G as an unknown subset of the universe of objects and decide how \nto generalize G by finding \"good\" subsets in a hypothesis space H  of possible concepts. \n\nThe Bayesian framework.  For this task, the natural hypothesis space H corresponds to all \nrectangles in the plane.  The central challenge in generalizing using the subset approach is \nthat any small set of examples will typically be consistent with many hypotheses (Fig.  Ib). \nThis problem is not unique to learning rectangles, but is a universal dilemna when trying to \ngeneralize concepts from only limited positive data.  The Bayesian solution is to embed the \nhypothesis space  in a probabilistic model  of our observations, which allows us  to weight \ndifferent consistent hypotheses as  more or less likely to be the true concept based  on the \nparticular examples observed.  Specifically, we assume that the examples are generated by \nrandom sampling from the true concept.  This leads to the size principle:  smaller hypotheses \nbecome more likely than larger hypotheses  (Fig.  Ib - darker rectangles are  more likely), \nand they become exponentially more likely as the number of consistent examples increases \n(Fig.  lc).  The size principle is the key to understanding how we can  learn concepts from \nonly a few positive examples. \nFormal treatment.  We  observe n  positive examples  X  = {xCI), ... , x Cn )}  of concept G \nand want to compute the generalization/unction p(y E GIX), i.e.  the probability that some \nnew  object y  belongs to G given the observations X.  Let each rectangle hypothesis h be \ndenoted by a quadruple (11,/2,81,82), where Ii  E [-00,00] is the location of h's lower-left \ncomer and  8i  E [0,00] is the size of h along dimension i. \nOur probabilistic model consists of a prior density p( h) and a likelihood function p( X I h) \nfor each hypothesis h E  H.  The likelihood is determined by our assumption of randomly \nsampled  positive examples.  In  the  simplest case,  each  example  in  X  is  assumed  to  be \nindependently sampled  from  a  uniform density over the concept  C.  For n  examples  we \nthen have: \n\np(Xlh) \n\no otherwise, \n\n(1) \n\nwhere  Ihl  denotes the size of h.  For rectangle  (11,/2,81,82),  Ihl  is simply 8182.  Note that \nbecause each hypothesis must distribute one unit mass oflikelihood over its volume for each \nexample cJx  h p(xlh)dh = 1), the probability density for smaller consistent hypotheses is \ngreater than for larger hypotheses, and exponentially greater as  a function of n.  Figs.  Ib,c \nillustrate this size principle for scoring hypotheses (darker rectang!es are more likely). \nThe appropriate choice of p( h)  depends on  our background knowledge.  If we have  no a \npriori reason to prefer any  rectangle hypothesis over any  other,  we can  choose the scale(cid:173)\nand  location-invariant uninformative prior,  p( h)  = P(ll, 12, 81 ,82)  = 1/(81,82),  In  any \nrealistic application, however, we will have some prior information.  For example, we may \nknow the expected size O'i  of rectangle concepts along dimension i in our domain, and then \nuse the associated maximum entropy prior P(ll, 12,  81,82) = exp{ -( 81/0'1 + 82/ 0'2)}. \nThe generalization function p(y  E GIX) is computed by integrating the predictions of all \nhypotheses, weighted by their posterior probabilities p( h IX): \n\np(y E GIX) =  r  p(y E Glh) p(hIX) dh, \n\n(2) \n\nfrom  Bayes' \n\nwhere \nthat \nfhEH p(hIX)dh  =  1),  and  p(y  E  Clh)  =  1  if y  E  hand  0  otherwise.  Under  the \n\n(normalized  such \n\nlhEH \ntheorem  p(hIX) \n\nex:  p(Xlh)p(h) \n\n\f62 \n\nuninformative prior, this becomes: \n\nJ.  B.  Tenenbaum \n\n(3) \n\nHere  ri  is  the  maximum  distance  between  the  examples  in  X  along  dimension  i,  and \ndi  equals  0  if y  falls  inside  the  range  of values  spanned  by  X  along  dimension  i,  and \notherwise  equals  the  distance  from  y  to  the  nearest  example  in  X  along  dimension  i. \nUnder the expected-size  prior,  p(y  E  GIX) has  no  closed  form  solution valid for  all  n. \nHowever, except for very small values of n  (e.g.  < 3) and ri (e.g.  < 0'i/1O), the following \napproximation holds to within 10% (and usually much less) error: \n\n(4) \n\nFig.  2  (left  column)  illustrates  the  Bayesian  learner's  contours  of equal  probability  of \ngeneralization  (at p  =  0.1  intervals),  for  different  values  of nand ri.  The  bold  curve \ncorresponds  to  p(y  E  GIX)  = 0.5,  a  natural  boundary  for  generalizing  the  concept. \nIntegrating over all  hypotheses  weighted  by  their size-based  probabilities yields a  broad \ngradient  of generalization  for  small  n  (row  1)  that  rapidly  sharpens  up  to  the  smallest \nconsistent hypothesis as n increases (rows 2-3), and that extends further along the dimension \nwith a broader range  ri  of observations.  This  figure  reflects  an  expected-size  prior with \n0'1  =  0'2  = axiLwidthl2; using an uninformative prior produces a qualitatively similar plot. \nRelated work:  MIN and Weak Bayes. Two existing subset approaches to concept learning \ncan be seen as variants of this Bayesian framework. The classic MIN algorithm generalizes \nno further than the smallest hypothesis in H that includes all the positive examples (Bruner \net al.,  1956; Feldman, 1997).  MIN is a PAC learning algorithm for the rectangles task, and \nalso corresponds to the maximum likelihood estimate in the Bayesian framework (Mitchell, \n1997).  However, while it converges to the true concept as  n  becomes large (Fig. 2, row 3), \nit appears extremely conservative in generalizing from  very limited data (Fig. 2, row 1). \n\nAn  earlier  approach  to  Bayesian concept learning,  developed independently in cognitive \npsychology (Shepard,  1987) and machine learning (Haussler et al.,  1994; Mitchell, 1997), \nwas  an  important inspiration for  the framework  of this paper.  I call the earlier approach \nweak Bayes, because it embodies a different generative model that leads to a much weaker \nlikelihood  function  than  Eq.  1.  While  Eq.  1  came  from  assuming  examples  sampled \nrandomly from  the true concept,  weak  Bayes  assumes  the examples  are generated  by  an \narbitrary process independent of the true concept.  As a result, the size principle for scoring \nhypotheses does not apply; all hypotheses consistent with the examples receive a likelihood \nof 1, instead of the factor of 1/lhln in Eq. 1.  The extent of generalization is then determined \nsolely by the prior; for example, under the expected-size prior, \n\n(5) \n\nWeak Bayes, unlike MIN, generalizes reasonably from just a few examples (Fig. 2, row 1). \nHowever,  because Eq.  5 is  independent of n  or ri,  weak  Bayes does  not converge to the \ntrue concept as  the number of examples increases (Fig. 2, rows 2-3), nor does it generalize \nfurther along axes  of greater variability.  While weak  Bayes  is  a natural  model  when  the \nexamples really are generated independently of the concept (e.g.  when the learner himself \nor a random process chooses objects to be labeled \"positive\" or \"negative\" by a teacher), it \nis clearly limited as  a model oflearning from deliberately provided positive examples. \n\nIn sum, previous subset approaches each appear to capture a different aspect of how humans \ngeneralize concepts  from  positive examples.  The broad similarity gradients that emerge \n\n\fBayesian Modeling of Human  Concept Learning \n\n63 \n\nfrom  weak  Bayes  seem  most applicable when  only a few  broadly spaced  examples  have \nbeen  observed  (Fig.  2,  row  1),  while the sharp boundaries  of the MIN rule appear more \nreasonable as  the number of examples increases or their range narrows (Fig.  2, rows 2-3). \nIn contrast, the Bayesian framework guided by the size principle automatically interpolates \nbetween these two regimes of similarity-based and rule-based generalization, offering the \nbest hope for a complete model of human concept learning. \n\n3  Experimental data from human subjects \n\nThis section presents  empirical evidence that our Bayesian model - but neither MIN nor \nweak Bayes - can explain human behavior on the simple rectangle learning task.  Subjects \nwere given the task of guessing 2-dimensional rectangular concepts from positive examples \nonly,  under the  cover story  of learning  about  the  range  of healthy  levels  of insulin  and \ncholesterol,  as  described  in  Section  1.  On  each  trial  of the  experiment,  several  dots \nappeared  on a blank computer screen.  Subjects  were told that these dots were  randomly \nchosen  examples  from  some arbitrary rectangle of \"healthy levels,\"  and  their job was  to \nguess that rectangle as  nearly as  possible by clicking on-screen with the mouse.  The dots \nwere in fact randomly generated on each trial, subject to the constraints ofthree independent \nvariables that were systematically varied across trials in a (6 x 6 x  6) factorial design.  The \nthree independent variables were the horizontal range spanned by the dots (.25, .5,  1, 2, 4, \n8 units in a 24-unit-wide window), vertical range spanned by the dots (same), and number \nof dots (2,3,4,6, 10,50).  Subjects thus completed 216 trials in random order.  To ensure \nthat subjects understood the task, they first completed 24 practice trials in which they were \nshown, after entering their guess, the \"true\" rectangle that the dots were drawn from.  I \n\nThe data from 6 subjects is shown in Fig. 3a, averaged across subjects and across the two \ndirections  (horizontal  and  vertical).  The  extent  d of subjects'  rectangles  beyond  r,  the \nrange spanned by the observed examples,  is plotted as  a function  of rand n, the number \nof examples.  Two patterns of generalization are apparent.  First, d increases monotonically \nwith r  and decreases  with n.  Second,  the rate of increase  of d as  a function of r  is  much \nslower for larger values of n. \n\nFig. 3b shows that neither MIN nor weak Bayes can explain these patterns.  MIN always \npredicts zero generalization beyond the examples - a horizontal line at d = 0 - for all values \nof rand n.  The predictions of weak Bayes are also independent of rand n:  d = 0\" log 2, \nassuming subjects give the tightest rectangle enclosing all points y with p(y E G\\X) > 0.5. \nUnder the same  assumption, Figs.  3c,d  show our Bayesian model's predicted bounds on \ngeneralization using uninformative and expected-size priors, respectively.  Both versions of \nthe model capture the qualitative dependence of d on rand n, confirming the importance of \nthe size principle in guiding generalization independent of the choice of prior. However, the \nuninformative prior misses the nonlinear dependence on r  for small n, because it assumes \nan ideal scale invariance that clearly does not hold in this experiment (due to the fixed size \nof the computer window in which the rectangles appeared).  In contrast, the expected-size \nprior naturally embodies prior knowledge about typical scale in its one free parameter 0\".  A \nreasonable value of 0\"  = 5 units (out of the 24-unit-wide window) yields an  excellent fit  to \nsubjects' average generalization behavior on this task. \n\n4  Conclusions \n\nIn developing a model of concept learning that is  at once computationally principled and \nable to fit human behavior precisely, I hope to have shed some light on how people are able \n\nI Because dots were drawn randomly, the \"true\" rectangles that subjects saw during practice were \nquite variable and were rarely the \"correct\" response according to any theory considered here.  Thus \nit is unlikely that this short practice was responsible for any consistent trends in subjects' behavior. \n\n\f64 \n\n1.  B.  Tenenbaum \n\nto infer the correct extent of a concept from only a few  positive examples.  The Bayesian \nmodel has two key components:  (1) a generalization function that results from integrating \nthe predictions of all hypotheses weighted by their posterior probability; (2) the assumption \nthat examples  are  sampled  from  the concept to  be  learned,  and  not independently of the \nconcept as  previous weak  Bayes models  have  assumed.  Integrating predictions over the \nwhole hypothesis space explains why either broad gradients of generalization (Fig. 2, row \n1)  or  sharp,  rule-based  generalization  (Fig.  2,  row  3)  may  emerge,  depending  on  how \npeaked  the posterior is.  Assuming examples  drawn randomly from  the concept explains \nwhy  learners  do  not  weight  all  consistent  hypotheses  equally,  but instead  weight  more \nspecific hypotheses higher than more general ones by a factor that increases exponentially \nwith the number of examples observed (the size principle). \n\nThis work is being extended in a number of directions.  Negative instances, when encoun(cid:173)\ntered,  are  easily  accomodated  by  assigning zero  likelihood to any  hypotheses containing \nthem.  The  Bayesian  formulation  applies  not only to learning rectangles,  but to learning \nconcepts  in  any  measurable  hypothesis  space  - wherever  the  size  principle  for  scoring \nhypotheses may be applied.  In Tenenbaum (1999), I show that the same principles enable \nlearning number concepts and words for kinds of objects from  only a few  positive exam(cid:173)\nples. 2  I also show how the size  principle supports much  more powerful inferences  than \nthis short paper could demonstrate:  automatically detecting incorrectly labeled examples, \nselecting relevant features,  and determining the complexity of the hypothesis space.  Such \ninferences  are  likely to be  necessary  for  learning in  the  complex  natural  settings we  are \nultimately interested in. \n\nAcknowledgments \n\nThanks to M. Bernstein, W.  Freeman, S. Ghaznavi, W. Richards, R  Shepard, and Y. Weiss for helpful \ndiscussions. The author was a Howard Hughes Medical Institute Predoctoral Fellow. \n\nReferences \n\nBruner, J. A., Goodnow,J. S., & Austin, G.  J.  (1956).  A study of thinking.  New York:  Wiley. \n\nDietterich, T, Lathrop, R, &  Lozano-Perez, T  (1997).  Solving the multiple-instance problem with \naxis-parallel rectangles.  ArtificiaL Intelligence 89(1-2), 31-71. \n\nFeldman, J.  (1997).  The structure of perceptual categories. J.  Math.  Psych.  41, 145-170. \n\nHaussler, D.,  Keams,  M.,  &  Schapire,  R  (1994).  Bounds on the  sample complexity  of Bayesian \nlearning using infonnation theory and the VC-dimension.  Machine Learning 14, 83-113. \n\nJaakkola, T.,  Saul, L.,  &  Jordan, M.  (1996) Fast learning by  bounding likelihoods in sigmoid  type \nbelief networks.  Advances in NeuraL Information Processing Systems 8. \n\nJapkowicz,  N.,  Myers,  C.,  &  Gluck,  M.  (1995).  A  novelty  detection  approach  to  classification. \nProceedings of the 14th InternationaL Joint Conference on AritificaL InteLLigence. \n\nKruschke, J. (1992).  ALCOVE: An exemplar-based connectionist model of category learning.  Psych. \nRev.  99,22-44. \n\nMitchell, T  (1997).  Machine Learning. McGraw-Hill. \n\nMuggleton, S.  (preprint).  Learning from positive data.  Submitted to Machine Learning. \n\nShepard, R  (1987).  Towards a universal law  of generalization for  psychological science.  Science \n237,1317-1323. \n\nThnenbaum,  J.  B.  (1999).  A  Bayesian  Frameworkfor  Concept Learning.  Ph.  D.  Thesis,  MIT \nDepartment of Brain and Cognitive Sciences. \n\n2In  the  framework  of inductive  logic  programming,  Muggleton  (preprint)  has  independently \nproposed that similar principles may allow linguistic grammars to be learned from positive data only. \n\n\fBayesian Modeling of Human  Concept Learning \n\n65 \n\nBayes \n\nMIN \n\nweak Bayes \n\nn=6 \n\nn= 12 \n\nFigure 2:  Performance of three concept learning algorithms on the rectangle task. \n\n(a) Average data from 6 subjects \n\n(b) MIN and weak Bayes models \n\n52.5 \n\ni  2 \ne \n~ 1.5 \n& \n'0  1 \nC \n~ 0.5 \n\u2022\u2022  0 \n~  ~--~~--~----~------\n8 \n\n6 \n\no \n\n2 \n\n4 \n\nr: Range spanned by  n examples \n\n2.5 \n\n2 \n\n1.5  weak Bayes (0 :: 2) \n\nweak Bayes (0:: 1) \n\nMIN \n\n0.5 \n\n0 \n0 \n\n2 \n\n4 \n\n6 \n\n8 \n\n\"In \n\n\"In \n\n\"In \n\n(c) Bayesian model (uninformative prior) \n2.5 \n\n(d) Bayesian model (expected-size prior) \n2.5 \n\n2 \n\n1.5 \n\no \n\n2 \n\n1.5 \n\nn::2 \nn::3 \nn=4 \nn=6 \nn\", 10 \n\nn= 50 \n\n2 \n\n4 \n\n6 \n\n8 \n\n2 \n\n4 \n\n6 \n\n8 \n\nFigure 3:  Data from human subjects and model predictions for  the rectangle task. \n\n\f\fPART II \n\nNEUROSCIENCE \n\n\f\f", "award": [], "sourceid": 1542, "authors": [{"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}]}