{"title": "Hierarchical Learning of Dimensional Biases in Human Categorization", "book": "Advances in Neural Information Processing Systems", "page_first": 727, "page_last": 735, "abstract": "Existing models of categorization typically represent to-be-classified items as points in a multidimensional space. While from a mathematical point of view, an infinite number of basis sets can be used to represent points in this space, the choice of basis set is psychologically crucial. People generally choose the same basis dimensions, and have a strong preference to generalize along the axes of these dimensions, but not diagonally\". What makes some choices of dimension special? We explore the idea that the dimensions used by people echo the natural variation in the environment. Specifically, we present a rational model that does not assume dimensions, but learns the same type of dimensional generalizations that people display. This bias is shaped by exposing the model to many categories with a structure hypothesized to be like those which children encounter. Our model can be viewed as a type of transformed Dirichlet process mixture model, where it is the learning of the base distribution of the Dirichlet process which allows dimensional generalization.The learning behaviour of our model captures the developmental shift from roughly \"isotropic\" for children to the axis-aligned generalization that adults show.\"", "full_text": "Hierarchical Learning of Dimensional Biases in\n\nHuman Categorization\n\nKatherine Heller\n\nDepartment of Engineering\nUniversity of Cambridge\n\nCambridge CB2 1PZ\n\nheller@gatsby.ucl.ac.uk\n\nAdam Sanborn\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\n\nLondon WC1N 3AR\n\nasanborn@gatsby.ucl.ac.uk\n\nNick Chater\n\nCognitive, Perceptual and Brain Sciences\n\nUniversity College London\n\nLondon WC1E 0AP\n\nn.chater@ucl.ac.uk\n\nAbstract\n\nExisting models of categorization typically represent to-be-classi\ufb01ed items as\npoints in a multidimensional space. While from a mathematical point of view,\nan in\ufb01nite number of basis sets can be used to represent points in this space, the\nchoice of basis set is psychologically crucial. People generally choose the same\nbasis dimensions \u2013 and have a strong preference to generalize along the axes of\nthese dimensions, but not \u201cdiagonally\u201d. What makes some choices of dimension\nspecial? We explore the idea that the dimensions used by people echo the natural\nvariation in the environment. Speci\ufb01cally, we present a rational model that does\nnot assume dimensions, but learns the same type of dimensional generalizations\nthat people display. This bias is shaped by exposing the model to many categories\nwith a structure hypothesized to be like those which children encounter. The\nlearning behaviour of the model captures the developmental shift from roughly\n\u201cisotropic\u201d for children to the axis-aligned generalization that adults show.\n\n1\n\nIntroduction\n\nGiven only a few examples of a particular category, people have strong expectations as to which\nnew examples also belong to that same category. These expectations provide important insights\ninto how objects are mentally represented. One basic insight into mental representations is that\nobjects that have similar observed properties will be expected to belong to the same category, and\nthat expectation decreases as the Euclidean distance between the properties of the objects increases\n[1, 2].\nThe Euclidean distance between observed properties is only part of the story however. Dimensions\nalso play a strong role in our expectations of categories. People do not always generalize isotropi-\ncally: the direction of generalizations turns out to be centrally important. Speci\ufb01cally, people gen-\neralize along the dimensions, such as size, color, or shape \u2013 dimensions that are termed separable.\nIn contrast, dimensions such as hue and saturation, which show isotropic generalization are termed\nintegral [3] . An illustration of the importance of separable dimensions is found in the time to learn\ncategories. If dimensions did not play a strong role in generalization, then rotating a category struc-\nture in a parameter space of separable dimensions should not in\ufb02uence how easily it can be learned.\nTo the contrary, rotating a pair of categories 45 degrees [3, 4] makes it more dif\ufb01cult to learn to\n\n1\n\n\fdiscriminate between them. Similarity rating results also show strong trends of judging objects to\nbe more similar if they match along separable dimensions [3, 5].\nThe tendency to generalize categories along separable dimensions is learned over development. On\ndimensions such as size and color, children produce generalizations that are more isotropic than\nadults [6]. Interestingly, the developmental transition between isotropic and dimensionally biased\ngeneralizations is gradual [7].\nWhat privileges separable dimensions? And why are they acquired over development? One possi-\nbility is that there is corresponding variation in real-world categories, and this provides a bias that\nlearners carry over to laboratory experiments. For example, Rosch et al. [1] identi\ufb01ed shape as a key\nconstant in categories, and we can \ufb01nd categories that are constant along other separable dimensions\nas well. For instance, categories of materials such as gold, wood, and ice all display a characteristic\ncolor while being relatively unconstrained as to the shapes and sizes that they take. Size is often\nconstrained in artifacts such as books and cars, while color can vary across a very wide range.\nModels of categorization are able to account for both the isotropic and dimension-based components\nof generalization. Classic models of categorization, such as the exemplar and prototype model,\naccount for these using different mechanisms [8, 9, 10]. Rational models of categorization have\naccounted for dimensional biases by assuming that the shapes of categories are aligned with the\naxes that people use for generalization [11, 12, 13]. Neither the classic models nor rational models\nhave investigated how people learn to use the particular dimension basis that they do.\nThis paper presents a model that learns the dimensional basis that people use for generalization.\nWe connect these biases with a hypothesis about the structure of categories in the environment and\ndemonstrate how exposure to these categories during development results in human dimensional\nbiases. In the next section, we review models of categorization and how they have accounted for di-\nmensional biases. Next, we review current nonparametric Bayesian models of categorization, which\nall require that the dimensions be hand-coded. Next, we introduce a new prior for categorization\nmodels that starts without pre-speci\ufb01ed dimensions and learns to generalize new categories in the\nsame way that previous categories varied. We show that without the use of pre-speci\ufb01ed dimensions,\nwe are able to produce generalizations that \ufb01t human data. We demonstrate that training the model\non reasonable category structures produces generalization behavior that mimics that of human sub-\njects at various ages. In addition, our trained model predicts the challenging effect of violations of\nthe triangle inequality for similiarity judgments.\n\n2 Modeling Dimensional Biases in Categorization\n\nModels of categorization can be divided into generative and discriminative models \u2013 we will focus\non generative models here and leave discriminative models for the discussion. Generative models\nof categorization, such as the prototype [8] and exemplar models [9, 10], assume that people learn\ncategory distributions, not just rules for discriminating between categories.\nIn order to make a\njudgment of whether a new item belongs to one category or another, a comparison is made of the\nnew item to the already existing categories, using Bayes rule with a uniform prior on the category\nlabels,\n\nP (cn = i|xn, xn1, cn1) = P (xn|cn = i, xn1, cn1)\nPj P (xn|cn = j, xn1, cn1)\n\nwhere xn is the nth item and cn = j assigns that item to category j. The remaining items are\ncollected in the vector xn1 and the known labels for these items are cn1.\nFor the prototype and exemplar models, the likelihood of an item belonging to a category is based\non the weighted Minkowski power metric1,\n\n(1)\n\nXi \u21e3Xd\n\nw(d)x(d)\n\n\n\nn R(d)\n\ni\n\nr\n\nr\u2318 1\n\n(2)\n\n1For an exemplar model, R(d)\n\ni\n\naverage of xn1.\n\nis each example in xn1, while for the prototype model, it is the single\n\n2\n\n\fwhich computes the absolute value of the power metric between the new example xn and the cat-\negory representation Ri for category i on a dimension d. Integral dimensions are modeled with\nr = 2, which results in a Euclidean distance metric. The Euclidean metric has the special property\nthat changing the basis set for the dimensions of the space does not affect the distances. Any other\nchoice of r means that the distances are affected by the basis set, and thus it must be chosen to match\nhuman judgments. Separable dimensions are modeled with either r = 1, the city-block metric, or\nr < 1, which no longer obeys the triangle equality [5].\nDimensional biases are also modeled in categorization by modifying the dimension weights for each\ndimension, w(d). In effect, the weights stretch or shrink the space of stimuli along each dimension\nso that some items are closer than others. These dimension weights are assumed to correspond to\nattention. To model learning of categories, it is often necessary to provide non-zero weights to only\na few features early in learning and gradually shift to uniform weights late in learning [14].\nThese generative models of categorization have been developed to account for the different types of\ndimensional biases that are displayed by people, but they lack means for learning the dimensions\nthemselves. Extensions to these classical models learn the dimension weights [15, 16], but can only\nlearn the weights for pre-speci\ufb01ed dimensions. If the chosen basis set did not match that used by\npeople, then the models would be very poor descriptions of human dimensional biases. A stronger\nnotion of between-category learning is required.\n\n3 Rational Models of Categorization\n\nRational models of categorization view categorization behavior as the solution to a problem posed\nby the environment: how best to to generalize properties from one object to another. Both exemplar\nand prototype models can be viewed as restricted versions of rational models of categorization,\nwhich also allow interpolations between these two extreme views of representation. Anderson [11]\nproposed a rational model of categorization which modeled the stimuli in a task as a mixture of\nclusters. This model treated category labels as features, performing unsupervised learning. The\nmodel was extended to supervised learning so each category is a mixture [17],\n\nKXk=1\n\nP (x`|x`1, s`1) =\n\nP (s` = k|s`1)P (x`|s` = k, x`1, s`1)\n\n(3)\n\nwhere x` is the newest example in a category i and x`1 are the other members of category i. x`\nis a mixture over a set of K components with the prior probability of x` belonging to a component\ndepending on the component membership of the other examples s`1.\nInstead of a single component or a component for each previous item, the mixture model has the\n\ufb02exibility to choose an intermediate number of components. To make full use of this \ufb02exibility,\nAnderson used a nonparametric Chinese Restaurant Process (CRP) prior on the mixing weights,\nwhich allows the \ufb02exibility of having an unspeci\ufb01ed and potentially in\ufb01nite number of components\n(i.e., clusters) in our mixture model. The mixing proportions in a CRP are based on the number of\nitems already included in the cluster,\n\nP (s` = k|s`1) =( Mk\n\n\u21b5\n\ni1+\u21b5\ni1+\u21b5\n\nif Mk > 0 (i.e., k is old)\nif Mk = 0 (i.e., k is new)\n\n(4)\n\nwhere Mj is the number of objects assigned to component k, and \u21b5 is the dispersion parameter.\nUsing Equation 4, the set of assignments s`1 is built up as a simple sequential stochastic process\n[18] in which the order of the observations is unimportant [19].\nThe likelihood of belonging to a component depends on the other members of the cluster. In the\ncase of continuous data, the components were modeled as Gaussian distributions,\n\nP (x`|s` = k, x`1, s`1) =Yd Z\u00b5(d)Z\u2303(d)\n\n3\n\nN(x(d)\n\n`\n\n; \u00b5(d), \u2303(d))P (\u2303(d))P (\u00b5(d)|\u2303(d))\n\n(5)\n\n\fwhere the mean and variance of each Gaussian distribution is given by \u00b5(d) and \u2303(d) respectively.\nThe prior for the mean was assumed to be Gaussian given the variance and the prior for the variance\nwas a inverse-gamma distribution. The likelihood distribution for this model assumes a \ufb01xed basis\nset of dimensions, which must align with the separable dimensions to produce dimensional biases\nin generalization.\n\n4 A Prior for Dimensional Learning\nThe rational model presented above assumes a certain basis set of dimensions, and the likelihood\ndistributions are aligned with these dimensions. To allow the learning of the basis set, we \ufb01rst need\nmultivariate versions of the prior distributions over the mean and variance parameters. For the mean\nparameter, we will use a multivariate Gaussian distribution and for the covariance matrix, we will use\nthe multivariate generalization of the inverse-gamma distribution, the inverse-Wishart distribution.\nm+D+1, where \u2303 is the mean covariance matrix\nThe inverse-Wishart distribution has its mode at\nparameter, m is the degrees of freedom, and D is the number of dimensions of the stimulus. A\ncovariance matrix is always diagonal under some rotated version of the initial basis set. This new\nbasis set gives the possible dimensional biases for this cluster.\nHowever, using Gaussian distributions for each cluster, with a unimodal prior on the covariance\nmatrix, greatly limits the patterns of generalizations that can be produced. For a diagonal covariance\nmatrix, strong generalization along a particular dimension would be produced if the covariance\nmatrix has a high variance along that dimension, but low variances along the remaining dimensions.\nThus, this model can learn to strongly generalize along one dimension, but people often make strong\ngeneralizations along multiple dimensions [5], such as in Equation 2 when r < 1. A unimodal\nprior on covariance matrices cannot produce this behavior, so we use a mixture of inverse Wishart\ndistributions as a prior for covariance matrices,\n\n\u2303\n\np(\u2303k|uk, ) =\n\np(uk = j|uk1)p(\u2303k|j, uk = j)\n\n(6)\n\nJXj=1\n\nwhere \u2303k is the covariance parameter for the kth component. For simplicity, the component param-\neters \u2303k are assumed i.i.d. given their class. j are the parameters of component j which re\ufb02ect\nthe expected covariances generated by the jth inverse-Wishart distribution in the mixture. uk = j is\nthe assignment of parameters \u2303k to component j and the set of all other component assignments\nis uk1. and \u00b5k are the sets of all j and \u00b5k. The means of categories k have Gaussian priors,\nwhich depend on \u2303k, but are otherwise independent of each other.\nAs before, we will use a nonparametric CRP prior over the mixture weights uk. We now have\ntwo in\ufb01nite mixtures: one that allows a category to be composed of a mixture of clusters, and one\nthat allows the prior for the covariance matrices to be composed of a mixture of inverse-Wishart\ndistributions. The \ufb01nal piece of the model is to specify p(). We use another inverse-Wishart prior,\nbut with an identity matrix for the mean parameter, so as not to bias the j components toward a\nparticular dimension basis set. Figure 1 gives a schematic depiction of the model.\n\n5 Learning the Prior\nThe categories we learn during development often vary along separable dimensions \u2013 and people\nare sensitive to this variability. The linguistic classi\ufb01cation of nouns helps to identify categories that\nare \ufb01xed on one separable dimension and variable on others. Nouns can be classi\ufb01ed into count\nnouns and mass nouns. Count nouns refer to objects that are discrete, such as books, shirts, and cars.\nMass nouns are those that refer to objects that appear in continuous quantities, such as grass, steel,\nand milk. These two types of nouns show an interesting regularity: count nouns are often relatively\nsimilar in size but vary greatly in color, while mass nouns are often relatively \ufb01xed in color but vary\ngreatly in size.\nSmith [7] tested the development of children\u2019s dimensionality biases. In this study, experimenters\nshowed participants six green circles that varied in shade and size. The discriminability judgments\nof adults to scale the parameters of the stimuli, so that one step in color caused the same gain in dis-\ncriminability as one step in size. Participants were asked to group the stimuli into clusters according\n\n4\n\n\fFigure 1: Schematic illustration of the hierarchical prior over covariance matrices. The top-level\nprior is a covariance matrix (shown as the equiprobability curves of a Gaussian) that is not biased\ntowards any dimension. The mid level priors j are drawn from an inverse-Wishart distribution\ncentered on the top-level prior. The j components are used as priors for the covariance matrices\nfor clusters. The plot on the right shows some schematic examples of natural categories that tend\nto vary along either color or size. The covariance matrices for these clusters are drawn from an\ninverse-Wishart prior using one of the j components.\n\nto their preferences, only being told that they should group the \u201cones that go together\u201d. The parti-\ntions of the stimuli into clusters that participants produced tended toward three informative patterns,\nshown in Figure 2. The Overall Similarity pattern ignores dimension and appears to result from\nisotropic similarity. The One-dimensional Similarity pattern is more biased towards generalizing\nalong separable dimensions than the Overall Similarity pattern. The strongest dimensional biases\nare shown by the One-Dimensional Identity pattern, with the dimensional match overriding the close\nisotropic similarity between neighboring stimuli.\nChildren aged 3 years, 4 years, 5 years and adults participated in this experiment. There were ten\nparticipants in each age group, participants clustered eight problems each, and all dimension-aligned\norientations of the stimuli were tested. Figure 2 shows the developmental trend of each of the infor-\nmative clustering patterns. The tendency to cluster according to Overal Similarity decreased with\nage, re\ufb02ecting a reduced in\ufb02uence of isotropic similarity. Clustering according One-dimensional\nSimilarity increased from 3-year-olds to 5-year-olds, but adults produced few of these patterns. The\npercentage of One-dimensional Identity clusterings increased with age, and was the dominant re-\nsponse for adults, supporting the idea that strong dimensional biases are learned.\nWe trained our model with clusters that were aligned with the dimensions of size and color. Half of\nthe clusters varied strongly in shape and weakly in size, while the other half varied strongly in size\nand weakly in shape. The larger standard deviation of the distribution that generated the training\nstimuli was somewhat smaller than the largest distance between stimuli in the Smith experiment,\nwhile the smaller standard deviation in the distribution that generated the training stimuli was much\nsmaller than the smallest distance between Smith stimuli. The two dispersion parameters were\nset to 1, the degrees of freedom for all inverse-Wishart distributions were set to the number of\ndimensions plus 1, and 0.01 was used for the scale factor for the mean parameters of the inverse-\nWishart distributions2.\nInference in the model was done as a combination of the Gibbs sampling and Metropolis-Hastings\nalgorithms. The assignments of data points to clusters in each class were Gibbs sampled conditioned\non the cluster assignments to inverse-Wishart components and the parameters of those components,\nj. Following a complete pass of the assignments of data points to clusters, we then Gibbs sam-\npled the assignments of the cluster covariance parameters \u2303k to components of the inverse-Wishart\n\n2The general pattern of the results was only weakly dependent on the parameter settings, but unsupervised\n\nlearning of the clusters required a small value of the scale factor.\n\n5\n\n\fFigure 2: Experiment 2 of Smith [7]. In a free categorization task, the stimuli marked by dots in the\ntop row were grouped by participants. The three critical partitions are shown as circles in the top\nrow of plots. The top bar graph displays the developmental trends for each of the critical partitions.\nThe bottom bar graph displays the trend as the model is trained on a larger number of axis-aligned\nclusters.\n\nmixture prior. After a pass of this mid-level sampling, we resampled j, the parameters of the\ninverse-Wishart components, and the prior expected means of each cluster. This sampling was done\nusing Metropolis-Hastings, with the non-symmetric proposals made from a separate inverse-Wishart\ndistribution. A large \ufb01nite Dirichlet distribution was used to approximate p(U). Given the learned\n and uk, the predicted probabilities for the Smith experiment were computed exactly.\nThe predictions of our model as a result of training are shown in Figure 2. The model was trained\non 0, 2, 4, 8, and 16 axis-aligned clusters in an unsupervised fashion. For the all three patterns,\nthe model shows the same developmental trajectory as human data. Overall Similarity decreases\nwith the number of trained categories, One-dimensional Similarity increases and then decreases,\nand One-dimensional Identity patterns are overwhelmingly produced by the fully trained model.\nThe probabilities plotted in the \ufb01gure are the predicted posterior of only the partitions that exactly\nmatched the informative patterns, out of all 203 possible partitions, showing that the patterns in\nFigure 2 dominated the model\u2019s predictions as they dominated the participants\u2019 responses in the free\ncategorization task.\n\n6 Generalization Gradients\n\nStandard models of categorization, such as the prototype or exemplar model, have a variety of mech-\nanisms for producing the dimensional biases seen in experiments with adults. We propose a very\ndifferent explanation for these dimensional biases. In this section we plot generalization gradients,\n\n6\n\n\uf001\uf002\uf003\uf004\uf005\uf006\uf006\uf007\uf008\uf009\uf00a\uf009\uf006\uf005\uf004\uf009\uf00b\uf00c\uf00d\uf00e\uf006\uf00e\uf004\uf00f\uf009\uf010\uf003\uf001\uf011\uf003\uf012\uf013\uf009\uf00a\uf003\uf011\uf00f\uf009\uf00e\uf011\uf005\uf006\uf007\uf008\uf009\uf00a\uf009\uf006\uf005\uf004\uf009\uf00b\uf00c\uf00d\uf00e\uf006\uf00e\uf004\uf00f\uf009\uf010\uf003\uf001\uf011\uf003\uf012\uf013\uf009\uf00a\uf003\uf011\uf00f\uf009\uf00e\uf011\uf005\uf006\uf007\uf014\uf013\uf003\uf011\uf00b\uf009\uf00b\uf00c\uf00d\uf00e\uf006\uf00e\uf004\uf00f\uf009\uf010\uf003\uf015\uf007\uf00c\uf003\uf005\uf004\uf00f\uf016\uf007\uf00c\uf003\uf005\uf004\uf00f\uf017\uf007\uf00c\uf003\uf005\uf004\uf00f\uf018\uf013\uf019\uf006\uf00b\uf00f\uf001\uf002\uf003\uf004\uf005\uf006\uf006\uf007\uf008\uf009\uf00a\uf009\uf006\uf005\uf004\uf009\uf00b\uf00c\uf01a\uf01b\uf007\uf008\uf009\uf00a\uf009\uf006\uf005\uf004\uf009\uf00b\uf00c\uf01a\uf01b\uf007\uf014\uf013\uf003\uf011\uf00b\uf009\uf00b\uf00c\uf001\uf002\uf003\uf004\uf005\uf006\uf006\uf007\uf008\uf009\uf00a\uf009\uf006\uf005\uf004\uf009\uf00b\uf00c\uf01a\uf01b\uf007\uf008\uf009\uf00a\uf009\uf006\uf005\uf004\uf009\uf00b\uf00c\uf01a\uf01b\uf007\uf014\uf013\uf003\uf011\uf00b\uf009\uf00b\uf00c\uf01a\uf01c\uf01d\uf01d\uf01c\uf01e\uf01d\uf01c\uf016\uf01d\uf01c\uf01f\uf01d\uf01c\uf020\uf01a\uf01c\uf01d\uf01d\uf01c\uf01e\uf01d\uf01c\uf016\uf01d\uf01c\uf01f\uf01d\uf01c\uf020\uf021\uf019\uf00a\uf005\uf011\uf007\uf01b\uf005\uf00b\uf005\uf022\uf00e\uf013\uf003\uf006\uf007\uf023\uf004\uf003\uf013\uf009\uf00d\uf00b\uf009\uf00e\uf011\uf00f\uf01d\uf007\uf00b\uf004\uf005\uf009\uf011\uf003\uf013\uf016\uf007\uf00b\uf004\uf005\uf009\uf011\uf003\uf013\uf020\uf007\uf00b\uf004\uf005\uf009\uf011\uf003\uf013\uf01a\uf01f\uf007\uf00b\uf004\uf005\uf009\uf011\uf003\uf013\fFigure 3: Generalization gradients of the exemplar model and the posterior predictive distribution\nof the model presented in this paper. The dots are the stimuli.\n\nwhich provide a good feel for how the priors we propose match with the mechanisms used in earlier\nmodels across a variety of conditions.\nGeneralizations of single items are studied by collecting similarity ratings. In this task, participants\njudge the similarity of two items. In standard models of categorization, similarity ratings are mod-\neled mainly by the exponent in the Minkowski power metric (Equation 2). For rational models,\nsimilarity ratings can be modeled as the posterior predictive probability of one item, given the sec-\nond item [20]. The \ufb01rst two columns of Figure 3 give a comparison between the exemplar model\nand the model we propose for similarity ratings. The central dot is a particular stimulus and the\ncolor gradient shows the predicted similarity ratings of all other stimuli. For integral dimensions, a\nEuclidean metric (r = 2) is used in the exemplar model, which the model we propose matches if it\nhas not been trained on dimension-aligned categories.\nFor separable categories, the exemplar model usually uses a city-block metric (r = 1) [10]. How-\never, experimental evidence shows that dimensions have an even stronger effect than predicted by\na city-block metric. In experiments to test violations of the triangle equality, Tversky and Gati [5]\nshowed that the best \ufb01tting exponent for similarity data is often r < 1. The model we propose can\nproduce this type of similarity prediction by using a prior that is a mixture of covariance matrices,\nin which each component of the mixture generalizes strongly along one dimension. In a category\nof one item, which is the case when making similarity judgments with the posterior predictive dis-\ntribution, it is uncertain which covariance component best describes the category. This uncertainty\nresults in a generalization gradient that imitates an exponent of r < 1 using Gaussian distributions.\nAs a result, our proposed model predicts violations of the triangle inequality if it has been trained\non a set of clusters in which some vary strongly along one dimension and others vary strongly along\nanother dimension. A comparison between this generalization gradient and the exemplar model is\nshown in the second column of Figure 3.\nThe second mechanism for dimensional biases in standard models of categorization is selective\nattention. Selective attention is used to describe biases that occur in categorization experiments,\nwhen many items are trained in each category. These biases are implemented in the exemplar model\nas weights along each dimension, and early in learning there are usually large weights on a small\nnumber of separable dimensions [14, 21]. Our proposed model does not have a mechanism for\nselective attention, but provides a rational explanation for this effect in terms of the strong sampling\nassumption [13]. If two items are assumed to come from the same cluster, then generalization tends\nto be along a single dimension that has varied during training (third column of Figure 3). However,\nif two items are inferred to belong to different clusters, then the generalization gradient corresponds\nto additive similarity without selective attention (fourth column of Figure 3).\nWe have shown that the model we have proposed can reproduce the key generalization gradients of\nthe exemplar and prototype models. The important difference between our model of dimensional\n\n7\n\n\uf001\uf002\uf003\uf004\uf005\uf006\uf007\uf008\uf009\uf00a\uf00b\uf00c\uf003\uf006\uf00d\uf00e\uf008\uf009\uf00a\uf00b\uf00c\uf003\uf006\uf001\uf00e\uf00f\uf006\uf010\uf00c\uf003\uf007\uf011\uf009\uf00a\uf003\uf012\uf008\uf010\uf00f\uf013\uf010\uf014\uf012\uf007\uf011\uf00f\uf003\uf009\uf015\uf010\uf012\uf016\uf009\uf008\uf009\uf017\uf018\uf019\uf012\uf008\uf00b\uf011\uf01a\uf006\uf01b\uf009\uf01c\uf003\uf010\uf01a\uf016\uf012\uf003\uf00c\uf01d\uf003\uf008\uf012\uf010\uf00f\uf007\uf006\uf009\uf013\uf010\uf004\uf003\uf011\uf014\uf010\uf00b\uf011\uf01e\uf015\uf00b\uf009\uf001\uf002\uf007\uf004\uf005\uf006\uf003\uf014\uf01f\uf011\uf012\uf008\uf007\uf010\uf011\uf003\uf00c\uf009\uf00a\uf00b\uf00c\uf003\uf006\uf01e\uf008\uf007\uf010\uf011\uf003\uf00c\uf009\uf015\uf010\uf012\uf016\uf009\uf020\uf002\uf010\uf014\uf021\uf020\uf006\uf010\uf01a\uf011\uf003\uf00c\uf009\uf022\uf007\uf012\uf003\uf01a\uf00b\uf008\uf010\uf003\uf014\uf01e\uf015\uf00b\uf009\uf001\uf002\uf007\uf004\uf005\uf006\uf003\uf014\uf009\uf010\uf011\uf009\uf00d\uf011\uf003\uf009\uf022\uf006\uf00e\uf014\uf012\uf003\uf008\uf01e\uf015\uf00b\uf009\uf001\uf002\uf007\uf004\uf005\uf006\uf003\uf014\uf009\uf010\uf011\uf009\uf019\uf003\uf005\uf007\uf008\uf007\uf012\uf003\uf009\uf022\uf006\uf00e\uf014\uf012\uf003\uf008\uf014\uf01e\uf008\uf007\uf010\uf011\uf003\uf00c\uf009\uf015\uf010\uf012\uf016\uf023\uf00b\uf012\uf007\uf012\uf003\uf00c\uf009\uf022\uf007\uf012\uf003\uf01a\uf00b\uf008\uf010\uf003\uf014\fbiases and these standard categorization models is that we learn basis set for dimensional biases,\nassuming these dimensions have proven to be useful for predicting category structure in the past.\nOther models must have these dimensions pre-speci\ufb01ed. To show that our model is not biased\ntowards a particular basis set, we rotated the training stimuli 45 degrees in space. The resulting\nposterior predictive distributions in Figure 3 extendend in the same direction as the rotated training\ncategories varied.\n\n7 Discussion\n\nThe approach to dimensional biases we have outlined in this paper provides a single explanation for\ndimensional biases, in contrast to standard models of categorization, such as exemplar and prototype\nmodels. These standard models of categorization assume two distinct mechanisms for producing di-\nmensional biases: a Minkowski metric exponent, and attentional weights for each dimension. In\nour approach, biases in both similarity judgments and categorization experiments are produced by\nlearning covariance matrices that are shared between clusters. For similarity judgments, the single\nitem does not give information about which covariance mixture component was used to generate it.\nThis uncertainty produces similarity judgments that would be best \ufb01t with an Minkowski exponent\nof r < 1. For category judgments, the alignment of the items along a dimension allows the gen-\nerating covariance mixture component to be inferred, so the judgments will show a bias like that\nof attentional weights to the dimensions. The difference between tasks drives the different types of\ndimensional biases in our approach.\nWe propose that people learn more complex cross-category information than most previous ap-\nproaches do. Attention to dimensions is learned in connectionist models of categorization by \ufb01nd-\ning the best single set of weights for each dimension in a basis set [15, 16], or by cross-category\nlearning in a Bayesian approach [22]. A more \ufb02exible approach is used in associative models of\ncategorization, which allow for different patterns of generalizations for different items. One asso-\nciative model used a Hop\ufb01eld network to predict different generalizations for solid and non-solid\nobjects [23]. A hierarchical Bayesian model with very similar properties to this associative model\nmotivated this result from cross-category learning [24]. The key difference between all these models\nand our proposal is that they use only a single strong dimensional bias for each item, while we use\nmultiple latent strong dimensional biases for each item, which is needed for modeling both similar-\nity and categorization dimensional biases with a single explanation. The only previous approach we\nare aware of that learns such complex cross-category information is a Bayesian rule-based model of\ncategorization [25].\nThe main advantage of our approach over many other models of categorization is that we learn the\nbasis set of dimensions that can display dimensional biases. Our model learns the basis the same\nway people do, from categories in the environment (as opposed to \ufb01tting to human similarity or\ncategory judgements). We begin with a feature space of stimuli in which physically similar items\nare near to each other. Using a version of the Transformed Dirichlet Process [26], a close relation to\nthe Hierarchical Dirichlet Process previously proposed as a unifying model of categorization [17],\na mixture of covariance matrices are learned from environmentally plausible training data. Most\nother models of categorization, including exemplar models [10], prototype models [8], rule-based\ndiscriminative models [27], as well as hierarchical Bayesian models for learning features [24, 22]\nand Bayesian rule-based models [25] all must have a pre-speci\ufb01ed basis set.\n\n8 Summary and Conclusions\n\nPeople generalize categories in two ways: they generalize to stimuli with parameters near to the\ncategory and generalize to stimuli that match along separable dimensions. Existing models of cat-\negorization must assume the dimensions to produce human-like generalization performance. Our\nmodel learns these dimensions from the data: starting with an unbiased prior, the dimensions that\ncategories vary along are learned to be dimensions important for generalization. After training the\nmodel with categories intended to mirror those learned during development, our model reproduces\nthe trajectory of generalization biases as children grow into adults. Using this type of approach, we\nhope to better tie models of human generalization to the natural world to which we belong.\n\n8\n\n\fReferences\n[1] E. Rosch, C. B. Mervis, W. D. Gray, D. M. Johnson, and P. Boyes-Braem. Basic objects in natural\n\ncategories. Cognitive Psychology, 8:382\u2013439, 1976.\n\n[2] R. N. Shepard. Toward a universal law of generalization for psychological science. Science, 237:1317\u2013\n\n1323, 1987.\n\n[3] W. R. Garner. The processing of information and structure. Erlbaum, Hillsdale, NJ, 1974.\n[4] J. K. Kruschke. Human category learning: implications for backpropagation models. Connection Science,\n\n5:3\u201336, 1993.\n\n[5] A. Tversky and I. Gati. Similarity, separability and the triangular inequality. Psychological Review,\n\n93:3\u201322, 1982.\n\n[6] L. B. Smith and Kemler D. G. Developmental trends in free classi\ufb01cation: Evidence for a new conceptu-\n\nalization of perceptual development. Journal of Experimental Child Psychology, 24:279\u2013298, 1977.\n\n[7] L. B. Smith. A model of perceptual classi\ufb01cation in children and adults. Psychological Review, 96:125\u2013\n\n144, 1989.\n\n[8] S. K. Reed. Pattern recognition and categorization. Cognitive Psychology, 3:393\u2013407, 1972.\n[9] D. L. Medin and M. M. Schaffer. Context theory of classi\ufb01cation learning. Psychological Review, 85:207\u2013\n\n238, 1978.\n\n[10] R. M. Nosofsky. Attention, similarity, and the identi\ufb01cation-categorization relationship. Journal of Ex-\n\nperimental Psychology: General, 115:39\u201357, 1986.\n\n[11] J. R. Anderson. The adaptive nature of human categorization. Psychological Review, 98(3):409\u2013429,\n\n1991.\n\n[12] D. J. Navarro. From natural kinds to complex categories.\n\nMahwah, NJ, 2006. Lawrence Erlbaum.\n\nIn Proceedings of CogSci, pages 621\u2013626,\n\n[13] J. B. Tenenbaum and T. L. Grif\ufb01ths. Generalization, similarity, and Bayesian inference. Behavioral and\n\nBrain Sciences, 24:629\u2013641, 2001.\n\n[14] M. K. Johansen and T. J. Palmeri. Are there representational shifts in category learning? Cognitive\n\nPsychology, 45:482\u2013553, 2002.\n\n[15] John K. Kruschke. Alcove: An exemplar-based connectionist model of category learning. Psychological\n\nReview, 99:22\u201344, 1992.\n\n[16] B. C. Love, D. L. Medin, and T. M. Gureckis. SUSTAIN: A network model of category learning. Psy-\n\nchological Review, 111:309\u2013332, 2004.\n\n[17] T. L. Grif\ufb01ths, K. R. Canini, A. N. Sanborn, and D. J. Navarro. Unifying rational models of categorization\n\nvia the hierarchical dirichlet process. In R. Sun and N. Miyake, editors, Proceedings CogSci, 2007.\n\n[18] D. Blackwell and J. MacQueen. Ferguson distributions via Polya urn schemes. The Annals of Statistics,\n\n1:353\u2013355, 1973.\n\n[19] D. Aldous. Exchangeability and related topics. In \u00b4Ecole d\u2019\u00b4et\u00b4e de probabilit\u00b4es de Saint-Flour, XIII\u20141983,\n\npages 1\u2013198. Springer, Berlin, 1985.\n\n[20] T. L. Grif\ufb01ths, M. Steyvers, and J. B. Tenenbaum. Topics in semantic representation. Psychological\n\nReview, 114:211\u2013244, 2007.\n\n[21] R. M. Nosofsky and S. R. Zaki. Exemplar and prototype models revisted: response strategies, selective\nattention, and stimulus generalization. Journal of Experimental Psychology: Learning, Memory, and\nCognition, 28:924\u2013940, 2002.\n\n[22] A. Perfors and J.B. Tenenbaum. Learning to learn categories. In Proceedings of CogSci, 2009.\n[23] E. Colunga and L. B. Smith. From the lexicon to expectations about kinds: a role for associative learning.\n\nPsychological Review, 112, 2005.\n\n[24] C. Kemp, A. Perfors, and J. B. Tenenbaum. Learning overhypotheses with hierarchical Bayesian models.\n\nDevelopmental Science, 10:307\u2013321, 2007.\n\n[25] N. D. Goodman, J. B. Tenenbaum, J. Feldman, and T. L. Grif\ufb01ths. A rational analysis of rule-based\n\nconcept learning. Cognitive Science, 32:108\u2013154, 2008.\n\n[26] E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. Describing visual scenes using transformed\n\ndirichlet processes. In Neural Information Processing Systems NIPS, 2005.\n\n[27] R. M. Nosofsky and T. J. Palmeri. A rule-plus-exception model for classifying objects in continuous-\n\ndimension spaces. Psychonomic Bulletin & Review, 5:345\u2013369, 1998.\n\n9\n\n\f", "award": [], "sourceid": 1080, "authors": [{"given_name": "Adam", "family_name": "Sanborn", "institution": null}, {"given_name": "Nick", "family_name": "Chater", "institution": null}, {"given_name": "Katherine", "family_name": "Heller", "institution": null}]}