{"title": "Bayesian Models of Inductive Generalization", "book": "Advances in Neural Information Processing Systems", "page_first": 59, "page_last": 66, "abstract": null, "full_text": "Bayesian Models of Inductive Generalization\n\nNeville E. Sanjana & Joshua B. Tenenbaum\nDepartment of Brain and Cognitive Sciences\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\n nsanjana, jbt\n\n@mit.edu\n\nAbstract\n\nWe argue that human inductive generalization is best explained in a\nBayesian framework, rather than by traditional models based on simi-\nlarity computations. We go beyond previous work on Bayesian concept\nlearning by introducing an unsupervised method for constructing \ufb02ex-\nible hypothesis spaces, and we propose a version of the Bayesian Oc-\ncam\u2019s razor that trades off priors and likelihoods to prevent under- or\nover-generalization in these \ufb02exible spaces. We analyze two published\ndata sets on inductive reasoning as well as the results of a new behavioral\nstudy that we have carried out.\n\n1 Introduction\n\nThe problem of inductive reasoning \u2014 in particular, how we can generalize after seeing\nonly one or a few speci\ufb01c examples of a novel concept \u2014 has troubled philosophers, psy-\nchologists, and computer scientists since the early days of their disciplines. Computational\napproaches to inductive generalization range from simple heuristics based on similarity\nmatching to complex statistical models [5]. Here we consider where human inference\nfalls on this spectrum. Based on two classic data sets from the literature and one more\ncomprehensive data set that we have collected, we will argue for models based on a ra-\ntional Bayesian learning framework [10]. We also confront an issue that has often been\nside-stepped in previous models of concept learning: the origin of the learner\u2019s hypothesis\nspace. We present a simple, unsupervised clustering method for creating hypotheses spaces\nthat, when applied to human similarity judgments and embedded in our Bayesian frame-\nwork, consistently outperforms the best alternative models of inductive reasoning based on\nsimilarity-matching heuristics.\n\nWe focus on two related inductive generalization tasks introduced in [6], which involve\nreasoning about the properties of animals. The \ufb01rst task is to judge the strength of a gen-\neralization from one or more speci\ufb01c kinds of mammals to a different kind of mammal:\n, how likely is it that an animal of\ngiven that animals of kind\nkind\nis always a blank predicate, such as \u201cis susceptible to the disease blick-\nmight be horse.\netitis\u201d, about which nothing is known outside of the given examples. Working with blank\npredicates ensures that people\u2019s inductions are driven by their deep knowledge about the\ngeneral features of animals rather than the details they might or might not know about any\n\nmight be squirrel, and\n\nmight be chimp,\n\nalso has property\n\n? For example,\n\nand\n\nhave property\n\n\u0001\n\u0002\n\u0003\n\u0004\n\u0005\n\u0004\n\u0002\n\u0003\n\u0005\n\u0004\n\fone particular property. Stimuli are typically presented in the form of an argument from\npremises (examples) to conclusion (the generalization test item), as in\n\nChimps are susceptible to the disease blicketitis.\nSquirrels are susceptible to the disease blicketitis.\n\nHorses are susceptible to the disease blicketitis.\n\nand subjects are asked to judge the strength of the argument \u2014 the likelihood that the\nconclusion (below the line) is true given that the premises (above the line) are true. The\nsecond task is the same except for the form of the conclusion. Instead of asking how likely\nthe property is to hold for another kind of mammal, e.g., horses, we ask how likely it is to\nhold for all mammals. We refer to these two kinds of induction tasks as the speci\ufb01c and\ngeneral tasks, respectively.\n\nOsherson et al. [6] present data from two experiments using these tasks. One data set\ncontains human judgments for the relative strengths of 36 speci\ufb01c inferences, each with a\ndifferent pair of mammals given as examples (premises) but the same test species, horses.\nThe other set contains judgments of argument strength for 45 general inferences, each with\na different triplet of mammals given as examples and the same test category, all mammals.\nOsherson et al. also published subjects\u2019 judgments of similarity for all 45 pairs of the\n10 mammals used in their generalization experiments, which they (and we) use to build\nmodels of generalization.\n\n2 Previous approaches\n\nThere have been several attempts to model the data in [6]: the similarity-coverage model\n[6], a feature-based model [8], and a Bayesian model [3]. The two factors that determine\nthe strength of an inductive generalization in Osherson et al.\u2019s model [6] are (i) similarity\nof the animals in the premise(s) to those in the conclusion, and (ii) coverage, de\ufb01ned as the\nsimilarity of the animals in the premise(s) to the larger taxonomic category of mammals,\nincluding all speci\ufb01c animal types in this domain. To see the importance of the coverage\nfactor, compare the following two inductive generalizations. The chance that horses can get\na disease given that we know chimps and squirrels can get that disease seems higher than\nif we know only that chimps and gorillas can get the disease. Yet simple similarity favors\nthe latter generalization: horses are judged to be more similar to gorillas than to chimps,\nand much more similar to either primate species than to squirrels. Coverage, however,\nintuitively favors the \ufb01rst generalization: the set chimp, squirrel\n\u201ccovers\u201d the set of all\nmammals much better than does the set chimp, gorilla\n, and to the extent that a set of\nexamples supports generalization to all mammals, it should also support generalization to\nhorses, a particular type of mammal.\n\nis the set of examples (premises), \b\n\nis a setwise similarity metric de\ufb01ned to be the sum of each\n\nSimilarity and coverage factors are mixed linearly to predict the strength of a generaliza-\n\n\u0013\u0001\u0014\u0002\u0015\u0004\u0007\u0006 all mammals\n ,\ntion. Mathematically, the prediction is given by \nwhere \u0004\nis a free param-\neter, and \u0001\u0003\u0002\nelement\u2019s\n\n . For the speci\ufb01c\nmaximal similarity to the\narguments, the test set \b\nis just the maxi-\nmum similarity of horses to the example animal types in \u0004\n. For the general arguments,\n\b-\u0016\nall mammals, which is approximated by the set of all mammal types used in the ex-\nperiment (see Figure 1). Osherson et al. [6] also consider a sum-similarity model, which\n\n . Summed similarity\nreplaces the maximum with a sum: \u0001\u0014\u0002\nhas more traditionally been used to model human concept learning, and also has a rational\ninterpretation in terms of nonparametric density estimation, but Osherson et al. favor the\n\nis the test set (conclusion), \n\n\u0017\u0016\u0019\u0018\u001b\u001a\u001d\u001c\u001f\u001e! #\"%$'&)(*\u0002\n\nelements: \u0001\u0003\u0002\nhas just one element, ,\n\n.\u0016\n\n\u0018/\u001a0\u0018\n\n$'&)(*\u0002\n\n\"+\u0006\n\n\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\u000b\n\r\f\u000e\u0002\u0010\u000f\u0012\u0011\n\n\"+\u0006\n\nhorse, so \u0001\u0003\u0002\u0015\u0004\u0007\u0006+\b\u000b\n\n\u0001\n\u0001\n\n\u0002\n\u0006\n\u0003\n\n\u0003\n\u0002\n\u0002\n\u0006\n\u0003\n\u0002\n\u0003\n\u001a\n\u0016\n\u0002\n\u0006\n\u0003\n\"\n\u0002\n\u0003\n\u001a\n\fmax-similarity model based on its match to their intuitions for these particular tasks. We\nexamine both models in our experiments.\n\nSloman [8] developed a feature-based model that encodes the shared features between the\npremise set and the conclusion set as weights in a neural network. Despite some psycho-\nlogical plausibility, this model consistently \ufb01t the two data sets signi\ufb01cantly worse than\nthe max-similarity model. Heit [3] outlines a Bayesian framework that provides qualitative\nexplanations of various inductive reasoning phenomena from [6]. His model does not con-\nstrain the learner\u2019s hypothesis space, nor does it embody a generative model of the data,\nso its predictions depend strictly on well-chosen prior probabilities. Without a general\nmethod for setting these prior probabilities, it does not make quantitative predictions that\ncan be compared here.\n\n3 A Bayesian model\n\nof the concept\n\n\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\n\n\u0006\n\t\n\t\u000b\t\n\n\u0001\u0004\u0003\r\f\u000e\u0007\n\n . These generalization probabilities \u000f\n\nTenenbaum & colleagues have previously introduced a Bayesian framework for learning\nconcepts from examples, and applied it to learning number concepts [10], word meanings\n[11], as well as other domains. Formally, for the speci\ufb01c inference task, we observe posi-\ntive examples \u0004\nand want to compute the probability\ngiven the observed examples \u0004\nthat a particular test stimulus , belongs to the concept\n:\n\n are computed by averaging\n,\u0011\u0010\n,\u0014\u0010\n\u0005\u0013\u0012\nthe predictions of a set of hypotheses weighted by their posterior probabilities:\n\u0016\u001e\u0016\n\u001c\u001d\u0012\n\u0018 \u001f\n\n!\"\u0019#\u0018\nis just 1 or 0 depending on whether the test stimulus ,\n\nHypotheses \u001c pick out subsets of stimuli \u2014 candidate extensions of the concept\n\u2014 and\n.\n,\u0015\u0010\nIn the general inference task, we are interested in computing the probability that a whole\ntest category \b\n\n\u001c$\u0012\nfalls under the subset \u001c\n\nfalls under the concept\n\n:\n\n\u0016\u0017\u0016\n\u0018\u001a\u0019\u000e\u001b\n\n,\u0015\u0010\n\n\u0005\u0013\u0012\n\n,\u0015\u0010\n\n\u0005\u0013\u0012\n\n%\t\n\n\u0005\u0013\u0012\n\n\u0005\u0013\u0012\n\n(1)\n\n%\t\n\n(2)\n\nA crucial component in modeling both tasks is the structure of the learner\u2019s hypothesis\n\n\b'&\n\n\u0005\u0013\u0012\n\n\u001c\u001d\u0012\n\n\u0018\"\u001f\n\n(*)+\u0018\n\n.\n\nspace ,\n\n3.1 Hypothesis space\n\nElements of the hypothesis space ,\ning up ,\n\nrepresent natural subsets of the objects in the domain\n\u2014 subsets likely to be the extension of some novel property or concept. Our goal in build-\nis to capture as many hypotheses as possible that people might employ in concept\nlearning, using a procedure that is ideally automatic and unsupervised. One natural way to\nbegin is to identify hypotheses with the clusters returned by a clustering algorithm [11][7].\n\nHere, hierarchical clustering seems particularly appropriate, as people across cultures ap-\npear to organize their concepts of biological species in a hierarchical taxonomic struc-\nture [1]. We applied four standard agglomerative clustering algorithms [2] (single-link,\ncomplete-link, average-link, and centroid) to subjects\u2019 similarity judgments for all pairs of\n10 animals given in [6]. All four algorithms produced the same output (Figure 1), sug-\nto consist of all\n19 clusters in this tree. The most straightforward way to de\ufb01ne a hypothesis space for\n; each hypothesis consists of one base cluster.\n\ngesting a robust cluster structure. We de\ufb01ne the base set of clusters -\nBayesian concept learning is to take ,\nWe refer to ,\nIt is clear that ,\n\n\u0005 as the \u201ctaxonomic hypothesis space\u201d.\n\u0005 alone is not suf\ufb01cient. The chance that horses can get a disease given that\n\nwe know cows and squirrels can get that disease seems much higher than if we know only\n\n\u0016\n\u0006\n\u0001\n\u0005\n\u0005\n\u000f\n\u0002\n\u0004\n\u0002\n\u0004\n\u000f\n\u0002\n\u0004\n\n\u000f\n\u0002\n\u001c\n\n\u000f\n\u0002\n\u0004\n\n\u000f\n\u0002\n\u0004\n\u0005\n\u000f\n\u0002\n\u001c\n\n\u0005\n\u000f\n\u0002\n\u0004\n\n\u0016\n\u0016\n\u000f\n\u0002\n\u0004\n\u0005\n\u0016\n-\n\fHorse Cow Elephant Rhino Chimp Gorilla Mouse Squirrel Dolphin Seal\n\nFigure 1: Hierarchical clustering of mammals based on similarity judgments in [6]. Each\n\nnode in the tree corresponds to one hypothesis in the taxonomic hypothesis space ,\n\n\u0005 .\n\nand chimp, squirrel\n\nthat chimps and squirrels can get the disease, yet the taxonomic hypotheses consistent with\nare the same. Bayesian generaliza-\nthe example sets cow, squirrel\ntion with a purely taxonomic hypothesis space essentially depends only on the least similar\nexample (here, squirrel), ignoring more \ufb01ne-grained similarity structure, such as that one\nexample in the set cow, squirrel\nis very similar to the target horse even if the other is\nnot. This sense of \ufb01ne-grained similarity has a clear objective basis in biology, because a\nsingle property can apply to more than one taxonomic cluster, either by chance or through\nconvergent evolution. If the disease in question could af\ufb02ict two distinct clusters of ani-\nmals, one exempli\ufb01ed by cows and the other by squirrels, then it is much more likely also\nto af\ufb02ict horses (since they share most taxonomic clusters with cows) than if the disease\naf\ufb02icted two distinct clusters exempli\ufb01ed by chimps and squirrels. Thus we consider richer\n, consisting of all pairs of taxonomic clusters (i.e., all unions of\n, consisting of\n\nbecause we have no behavioral data beyond three examples. Our total hypothesis space is\n\ntwo clusters from Figure 1, except those already included in ,\nall triples of taxonomic clusters (except those included in lower layers). We stop with ,\nthen the union of these three layers, ,\n\nhypothesis subspaces ,\u0001\n\nThe notion that the hypothesis space of candidate concepts might correspond to the power\nset of the base clusters, rather than just single clusters, is broadly applicable beyond the\ndomain of biological properties. If the base system of clusters is suf\ufb01ciently \ufb01ne-grained,\nthis framework can parameterize any logically possible concept. It is analogous to other\ngeneral-purpose representations for concepts, such as disjunctive normal form (DNF) in\nPAC-Learning, or class-conditional mixture models in density-based classi\ufb01cation [5].\n\n\u0005 ), and ,\u0003\u0002\n\n,\u0003\u0002\n\n,\u0003\n\n\u0005\u0005\u0004\n\n.\n\n3.2 The Bayesian Occam\u2019s razor: balancing priors and likelihoods\n\n\u0002\u0015\u0004\n\nGiven this hypothesis space, Bayesian generalization then requires assigning a prior \u000f\nand likelihood\u000f\nand \u001c be a hypothesis in the\n\n for each hypothesis \u001c\n\nbase clusters. A simple but reasonable prior assigns to \u001c\n\n\u0012 be the number of base clusters,\n\u0012 i. i. d.\n\nof\nBernoulli variables with\n\na sequence of \u0012\n\n, corresponding to a union\n\nsuccesses and parameter\n\n, with probability\n\n. Let \u0012\n\n(3)\n\nth layer of the hypothesis space ,\b\u0007\n\n\u000b\n\n\t\u000b\u0013\n\n\u0012\u0011\n\n\u000e\u000f\f\n\n\f\n\n\u0002\u0010\u000f.\u0011\n\nIntuitively, this choice of prior is like assuming a generative model for hypotheses in which\neach base cluster has some small independent probability\n;\n\nof expressing the concept\n\n\u0001\n\u0001\n\u0001\n\u0002\n\u0016\n,\n\u0004\n\u0002\n\u001c\n\n\u0012\n\u001c\n\u0010\n,\n-\n\u0006\n\u0006\n-\n\u0006\n\t\n\u000f\n\u0002\n\u001c\n\t\n\u0007\n\t\n\u0010\n\u0007\n\t\n\u000f\n\u0011\n\u0007\n\t\n\t\n\u0005\n\f\n . For\n\npotheses consisting of fewer disjoint clusters (smaller\n\nthe correspondence is not exact because each hypothesis may be expressed as the union\nof base clusters in multiple ways, and we consider only the minimal union in de\ufb01ning\ninstantiates a preference for simpler hypotheses \u2014 that is, hy-\n). More complex hypotheses re-\n\n , and the penalty for complexity increases\nas\nbecomes smaller. This prior can be applied with any set of base clusters, not just\nthose which are taxonomically structured. We are currently exploring a more sophisticated\ndomain-speci\ufb01c prior for taxonomic clusters de\ufb01ned by a stochastic mutation process over\nthe branches of the tree.\n\nceive exponentially lower probability under \u000f\n\n\u000f\u0003\u0002\u0005\u0004 , \u000f\n\n\t\u0001\n\nare\na random sample (with replacement) of instances from the concept to be learned. Let\n\nis calculated by assuming that the examples \u0004\n\n\u0002\u0005\u0004\n\n\u0012 , the number of examples, and let the size \u0012\n\nFollowing [10], the likelihood \u000f\n\u0002\u0015\u0004\nnumber of animal types it contains. Then \u000f\n\f if \u001c\nincludes all examples in \u0004\n\f\u000b\n\nif \u001c does not include all examples in \u0004\n\n\u0016\u0007\u0006\t\b\n\n\u0002\u0005\u0004\n\nassigning greater likelihood to smaller hypotheses, by a factor that increases exponentially\nas the number of consistent examples observed increases.\n\n follows the size principle,\n\n\u001c$\u0012 of each hypothesis \u001c be simply the\n\n(4)\n\nNote the tension between priors and likelihoods here, which implements a form of the\nBayesian Occam\u2019s razor. The prior favors hypotheses consisting of few clusters, while\nthe likelihood favors hypotheses consisting of small clusters. These factors will typically\ntrade off against each other. For any set of examples, we can always cover them under a\nsingle cluster if we make the cluster large enough, and we can always cover them with a\nhypothesis of minimal size (i.e., including no other animals beyond the examples) if we\nuse only singleton clusters and let the number of clusters equal the number of examples.\n\n , proportional to the product of these terms, thus seeks an\n\nThe posterior probability\u000f\n\noptimal tradeoff between over- and under-generalization.\n\n\u001c\u001d\u0012\n\n4 Model results\n\nWe consider three data sets. Data sets 1 and 2 come from the speci\ufb01c and general tasks in\n[6], described in Section 1. Both tasks drew their stimuli from the same set of 10 mammals\nshown in Figure 1. Each data set (including the set of similarity judgments used to con-\nstruct the models) came from a different group of subjects. Our models of the probability of\ngeneralization for speci\ufb01c and general arguments are given by Equations 1 and 2, respec-\ntively, letting \u0004\n(respectively)\nbe the \ufb01xed test category, horses or all mammals. Osherson at al.\u2019s subjects did not provide\nan explicit judgment of generalization for each example set, but only a relative ranking\nof the strengths of all arguments in the general or speci\ufb01c sets. Hence we also converted\nall models\u2019 predictions to ranks for each data set, to enable the most natural comparisons\nbetween model and data.\n\nbe the example set that varies from trial to trial and , or \b\n\nFigure 3 shows the (rank) predictions of three models, Bayesian, max-similarity and sum-\nsimilarity, versus human subjects\u2019 (rank) con\ufb01rmation judgments on the general (row 1)\nin\nand speci\ufb01c (row 2) induction tasks from [6]. Each model had one free parameter (\nin the similarity models), which was tuned to the single value that\nthe Bayesian model, \nmaximized rank-order correlation between model and data jointly over both data sets.\nThe best correlations achieved by the Bayesian model in both the general and speci\ufb01c tasks\nwere greater than those achieved by either the max-similarity or sum-similarity models.\nThe sum-similarity model is far worse than the other two \u2014 it is actually negatively corre-\nlated with the data on the general task \u2014 while max-similarity consistently scores slightly\nworse than the Bayesian model.\n\n\u000f\n\u0002\n\u001c\n\u0002\n\u001c\n\n\u0006\n\u0002\n\u001c\n\t\n\u0012\n\u001c\n\n\n\u0016\n\u0012\n\u0004\n\u0012\n\u001c\n\u000f\n\u0012\n\u001c\n\n\u0005\n\f\n\u0018\n\f\n\u0002\n\u0004\n\t\n\f4.1 A new experiment: Varying example set composition\n\nIn order to provide a more comprehensive test of the models, we conducted a variant of the\nspeci\ufb01c experiment using the same 10 animal types and the same constant test category,\nhorses, but with example sets of different sizes and similarity structures. In both data sets\n1 and 2, the number of examples was constant across all trials; we expected that varying\nthe number of examples would cause dif\ufb01culty for the max-similarity model because it\nis not explicitly sensitive to this factor. For this purpose, we included \ufb01ve three-premise\narguments, each with three examples of the same animal species (e.g.,\n chimp, chimp,\n). We\nchimp\nalso included three-premise arguments where all examples were drawn from a low-level\ncluster of species in Figure 1 (e.g., chimp, gorilla, chimp\n). Because of the increasing\npreference for smaller hypotheses as more examples are observed, Bayes will in general\nmake very different predictions in these three cases, but max-similarity will not. This\nmanipulation also allowed us to distinguish the predictions of our Bayesian model from\nalternative Bayesian formulations [5][3] that do not include the size principle, and thus do\nnot predict differences between generalization from one example and generalization from\nthree examples of the same kind.\n\n), and \ufb01ve one-premise arguments with the same \ufb01ve animals (e.g., chimp\n\nWe also changed the judgment task and cover story slightly, to match more closely the nat-\nural problem of inductive learning from randomly sampled examples. Subjects were told\nthat they were training to be veterinarians, by observing examples of particular animals that\nhad been diagnosed with novel diseases. They were required to judge the probability that\nhorses could get the same disease given the examples observed. This cover story made it\nclear to subjects that when multiple examples of the same animal type were presented, these\ninstances referred to distinct individual animals. Figure 3 (row 3) shows the model\u2019s pre-\ndicted generalization probabilities along with the data from our experiment: mean ratings\n\nand \n\nof generalization from 24 subjects on 28 example sets, using either \n\nand the same test species (horses) across all arguments. Again we show predictions for the\n. All three models \ufb01t best at different parameter\nbest values of the free parameters\nvalues than in data sets 1 and 2, perhaps due to the task differences or the greater range of\nstimuli here.\n\n\u0004 , or \u0001 examples\n\nt\n\nh\ng\nn\ne\nr\nt\ns\n \nt\n\nn\ne\nm\nu\ng\nr\nA\n\n0.6\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n1 example\n3 examples\n\nFigure 2: Human generalization\nto the conclusion category horse\nwhen given one or three examples\nof a single premise type.\n\ncow \n\nchimp \n\nmouse \n\ndolphin \n\nelephant\n\nPremise category\n\nAgain, the max-similarity model comes close to the performance of the Bayesian model,\nbut it is inconsistent with several qualitative trends in the data. Most notably, we found a\ndifference between generalization from one example and generalization from three exam-\nples of the same kind, in the direction predicted by our Bayesian model. Generalization to\n) than from\nthe test category of horses was greater from singleton examples (e.g., chimp\nthree examples of the same kind (e.g., chimp, chimp, chimp\n), as shown in Figure 2. This\neffect was relatively small but it was observed for all \ufb01ve animal types tested and it was\n\n\u0001\n\u0001\n\u0001\n\u0016\n\u000f\n\u0006\n\t\n\u0001\n\u0001\n\fstatistically signi\ufb01cant (\u000f\n\nanimal type) ANOVA.\nThe max-similarity model, however, predicts no effect here, as do Bayesian accounts that\ndo not include the size principle [5][3].\n\n5 (number of examples\n\n\f\u0001 ) in a 2\n\nIt is also of interest to ask whether these models are suf\ufb01ciently robust as to make reason-\nable predictions across all three experiments using a single parameter setting, or to make\ngood predictions on held-out data when their free parameter is tuned on the remaining data.\nOn these criteria, our Bayesian model maintains its advantage over max-similarity. At the\n\u000f%\u0006\nsingle value of\n\u0005\t\b on the\nthree data sets, respectively, compared to\nfor max-similarity at its\nsingle best parameter value (\n(1000 runs for each data set, 80%-20% training-test splits), Bayes obtains average test-set\ncorrelations of\n\n\f\u0001\u0003 , Bayes achieves correlations of\n\f , and\f\n\n\f ). Using Monte Carlo cross validation [9] to estimate\n\u0004 on the three data sets, respectively, compared to\n\nfor max-similarity using the same method to tune \n\n\u0005\u0007\u0006 , and\f\n\n\u0004 and\f\n\n\t\b\n\n\u0007\u0006\n\n\u0001 and\f\n\n5 Conclusion\n\n.\n\nOur Bayesian model offers a moderate but consistent quantitative advantage over the best\nsimilarity-based models of generalization, and also predicts qualitative effects of varying\nsample size that contradict alternative approaches. More importantly, our Bayesian ap-\nproach has a principled rational foundation, and we have introduced a framework for un-\nsupervised construction of hypothesis spaces that could be applied in many other domains.\nIn contrast, the similarity-based approach requires arbitrary assumptions about the form\nof the similarity measure: it must include both \u201csimilarity\u201d and \u201ccoverage\u201d terms, and it\nmust be based on max-similarity rather than sum-similarity. These choices have no a priori\njusti\ufb01cation and run counter to how similarity models have been applied in other domains,\nleading us to conclude that rational statistical principles offer the best hope for explaining\nhow people can generalize so well from so little data. Still, the consistently good perfor-\nmance of the max-similarity model raises an important question for future study: whether\na relatively small number of simple heuristics might provide the algorithmic machinery\nimplementing approximate rational inference in the brain.\n\nWe would also like to understand how people\u2019s subjective hypothesis spaces have their ori-\ngin in the objective structure of their environment. Two plausible sources for the taxonomic\nhypothesis space used here can both be ruled out. The actual biological taxonomy for these\n10 animals, based on their evolutionary history, looks quite different from the subjective\ntaxonomy used here. Substituting the true taxonomic clusters from biology for the base\nclusters of our model\u2019s hypothesis space leads to dramatically worse predictions of peo-\nple\u2019s generalization behavior. Taxonomies constructed from linguistic co-occurrences, by\napplying the same agglomerative clustering algorithms to similarity scores output from the\nLSA algorithm [4], also lead to much worse predictions. Perhaps the most likely possibil-\nity has not yet been tested. It may well be that by clustering on simple perceptual features\n(e.g., size, shape, hairiness, speed, etc.), weighted appropriately, we can reproduce the tax-\nonomy constructed here from people\u2019s similarity judgments. However, that only seems to\npush the problem back, to the question of what de\ufb01nes the appropriate features and fea-\nture weights. We do not offer a solution here, but merely point to this question as perhaps\nthe most salient open problem in trying to understand the computational basis of human\ninductive inference.\n\nAcknowledgments\n\nTom Grif\ufb01ths provided valuable help with statistical analysis. Supported by grants from\nNTT Communication Science Laboratories and MERL and an HHMI fellowship to NES.\n\n\n\f\n\t\n\u0002\n\u0002\n\t\n\u0016\n\f\n\t\n\u0004\n\u0016\n\f\n\t\n\u0005\n\f\n\t\n\t\n\u0004\n\u0016\n\f\n\t\n\u0006\n\f\n\t\n\u0005\n\t\n\u0005\n\f\n\u0016\n\f\n\t\n\u0006\n\t\n\u0004\n\u0016\n\f\n\t\n\u0005\n\f\n\u0006\n\f\n\t\n\u0005\n\t\n\u0005\n\u0004\n\u0016\n\f\n\t\n\n\u0001\n\u0006\n\f\n\t\n\n\t\n\fReferences\n\n[1] S. Atran. Classifying nature across cultures. In An Invitation to Cognitive Science, volume 3.\n\nMIT Press, 1995.\n\n[2] R. Duda, P. Hart, and D. Stork. Pattern Classi\ufb01cation. Wiley, New York, NY, 2001.\n[3] E. Heit. A Bayesian analysis of some forms of induction. In Rational Models of Cognition.\n\nOxford University Press, 1998.\n\n[4] T. Landauer and S. Dumais. A solution to Plato\u2019s problem: The Latent Semantic Analysis\ntheory of the acquisition, induction, and representation of knowledge. Psychological Review,\n104:211\u2013240, 1997.\n\n[5] T. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997.\n[6] D. Osherson, E. Smith, O. Wilkie, A. L\u00b4opez, and E. Sha\ufb01r. Category-based induction. Psycho-\n\nlogical Review, 97(2):185\u2013200, 1990.\n\n[7] N. Sanjana and J. Tenenbaum. Capturing property-based similarity in human concept learning.\n\nIn Sixth International Conference on Cognitive and Neural Systems, 2002.\n\n[8] S. Sloman. Feature-based induction. Cognitive Psychology, 25:231\u2013280, 1993.\n[9] P. Smyth. Clustering using Monte Carlo cross-validation. In Second International Conference\n\non Knowledge Discovery and Data Mining, 1996.\n\n[10] J. Tenenbaum. Rules and similarity in concept learning. In S. Solla, T. Keen, and K.-R. M\u00a8uller,\neditors, Advances in Neural Information Processing Systems 12, pages 59\u201365. MIT Press, 2000.\n[11] J. Tenenbaum and F. Xu. Word learning as Bayesian inference. In Proceedings of the 22nd\n\nAnnual Conference of the Cognitive Science Society, 2000.\n\nBayes\n\nMax\u2212Similarity\n\nSum\u2212Similarity\n\n1\n\nr = 0.94\n\n1\n\nr = 0.87\n\n1 r = (cid:31) 0.33\n\n_\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\nr = 0.91\n\n0.2 0.4 0.6 0.8\n\nr = 0.93\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\n0.5\n\n1\n\nr = 0.87\n\nr = 0.39\n\n1\n\n2\n\n3\n\nGeneral:\nmammals \nn=3 \n\n0.5\n\n0.5\n\n1\n\n0\n\n0\n\n1\n\nSpecific:\nhorse \nn=2 \n\n0.5\n\nr = 0.97\n\nSpecific:\nhorse \nn=1,2,3 \n\n0\n\n0\n\n1\n\n0.5\n\n0\n\n0\n\n0.5\n\n1\n\nr = 0.97\n\n0.5\n\n1\n\nFigure 3: Model predictions (\n-axis). Each\ncolumn shows the results for a particular model. Each row is a different inductive generalization\nexperiment, where\n\nindicates the number of examples (premises) in the stimuli.\n\n-axis) plotted against human con\ufb01rmation scores (\n\n\n\u0001\n\u0002\n\f", "award": [], "sourceid": 2284, "authors": [{"given_name": "Neville", "family_name": "Sanjana", "institution": null}, {"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}]}