{"title": "Testing a Bayesian Measure of Representativeness Using a Large Image Database", "book": "Advances in Neural Information Processing Systems", "page_first": 2321, "page_last": 2329, "abstract": "How do people determine which elements of a set are most representative of that set? We extend an existing Bayesian measure of representativeness, which indicates the representativeness of a sample from a distribution, to define a measure of the representativeness of an item to a set. We show that this measure is formally related to a machine learning method known as Bayesian Sets. Building on this connection, we derive an analytic expression for the representativeness of objects described by a sparse vector of binary features. We then apply this measure to a large database of images, using it to determine which images are the most representative members of different sets. Comparing the resulting predictions to human judgments of representativeness provides a test of this measure with naturalistic stimuli, and illustrates how databases that are more commonly used in computer vision and machine learning can be used to evaluate psychological theories.", "full_text": "Testing a Bayesian Measure of Representativeness\n\nUsing a Large Image Database\n\nJoshua T. Abbott\n\nDepartment of Psychology\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\njoshua.abbott@berkeley.edu\n\nKatherine A. Heller\n\nDepartment of Brain and Cognitive Sciences\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\nkheller@mit.edu\n\nZoubin Ghahramani\n\nDepartment of Engineering\nUniversity of Cambridge\nCambridge, CB2 1PZ, U.K.\nzoubin@eng.cam.ac.uk\n\nThomas L. Grif\ufb01ths\n\nDepartment of Psychology\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\ntom griffiths@berkeley.edu\n\nAbstract\n\nHow do people determine which elements of a set are most representative of that\nset? We extend an existing Bayesian measure of representativeness, which indi-\ncates the representativeness of a sample from a distribution, to de\ufb01ne a measure of\nthe representativeness of an item to a set. We show that this measure is formally\nrelated to a machine learning method known as Bayesian Sets. Building on this\nconnection, we derive an analytic expression for the representativeness of objects\ndescribed by a sparse vector of binary features. We then apply this measure to a\nlarge database of images, using it to determine which images are the most repre-\nsentative members of different sets. Comparing the resulting predictions to human\njudgments of representativeness provides a test of this measure with naturalistic\nstimuli, and illustrates how databases that are more commonly used in computer\nvision and machine learning can be used to evaluate psychological theories.\n\n1\n\nIntroduction\n\nThe notion of \u201crepresentativeness\u201d appeared in cognitive psychology as a proposal for a heuristic\nthat people might use in the place of performing a probabilistic computation [1, 2]. For example, we\nmight explain why people believe that the sequence of heads and tails HHTHT is more likely than\nHHHHH to be produced by a fair coin by saying that the former is more representative of the output\nof a fair coin than the latter. This proposal seems intuitive, but raises a new problem: How is rep-\nresentativeness itself de\ufb01ned? Various proposals have been made, connecting representativeness to\nexisting quantities such as similarity [1] (itself an ill-de\ufb01ned concept [3]), or likelihood [2]. Tenen-\nbaum and Grif\ufb01ths [4] took a different approach to this question, providing a \u201crational analysis\u201d\nof representativeness by trying to identify the problem that such a quantity solves. They proposed\nthat one sense of representativeness is being a good example of a concept, and then showed how\nthis could be quanti\ufb01ed via Bayesian inference. The resulting model outperformed similarity and\nlikelihood in predicting human representativeness judgments for two kinds of simple stimuli.\nIn this paper, we extend this de\ufb01nition of representativeness, and provide a more comprehensive\ntest of this account using naturalistic stimuli. The question of what makes a good example of a\nconcept is of direct relevance to computer scientists as well as cognitive scientists, providing a way\nto build better systems for retrieving images or documents relevant to a user\u2019s query. However, the\n\n1\n\n\fmodel presented by Tenenbaum and Grif\ufb01ths [4] is overly restrictive in requiring the concept to\nbe pre-de\ufb01ned, and has not been tested in the context of a large-scale information retrieval system.\nWe extend the Bayesian measure of representativeness to apply to the problem of deciding which\nobjects are good examples of a set of objects, show that the resulting model is closely mathematically\nrelated to an existing machine learning method known as Bayesian Sets [5], and compare this model\nto similarity and likelihood as an account of people\u2019s judgments of the extent to which images drawn\nfrom a large database are representative of different concepts. In addition, we show how measuring\nthe representativeness of items in sets can also provide a novel method of \ufb01nding outliers in sets.\nBy extending the Bayesian measure of representativeness to apply to sets of objects and testing it\nwith a large image database, we are taking the \ufb01rst steps towards a closer integration of the methods\nof cognitive science and machine learning. Cognitive science experiments typically use a small set\nof arti\ufb01cial stimuli, and evaluate different models by comparing them to human judgments about\nthose stimuli. Machine learning makes use of large datasets, but relies on secondary sources of\n\u201ccognitive\u201d input, such as the labels people have applied to images. We combine these methods by\nsoliciting human judgments to test cognitive models with a large set of naturalistic stimuli. This\nprovides the \ufb01rst experimental comparison of the Bayesian Sets algorithm to human judgments, and\nthe \ufb01rst evaluation of the Bayesian measure of representativeness in a realistic applied setting.\nThe plan of the paper is as follows. Section 2 provides relevant background information, including\npsychological theories of representativeness and the de\ufb01nition of Bayesian Sets. Section 3 then\nintroduces our extended measure of representativeness, and shows how it relates to Bayesian Sets.\nSection 4 describes the dataset derived from a large image database that we use for evaluating this\nmeasure, together with the other psychological models we use for comparison. Section 5 presents\nthe results of an experiment soliciting human judgments about the representativeness of different\nimages. Section 6 provides a second form of evaluation, focusing on identifying outliers from sets.\nFinally, Section 7 concludes the paper.\n\n2 Background\n\nTo approach our main question of which elements of a set are most representative of that set, we \ufb01rst\nreview previous psychological models of representativeness with a particular focus on the rational\nmodel proposed by Tenenbaum and Grif\ufb01ths [4]. We then introduce Bayesian Sets [5].\n\n2.1 Representativeness\n\nWhile the notion of representativeness has been most prominent in the literature on judgment and\ndecision-making, having been introduced by Kahneman and Tversky [1], similar ideas have been\nexplored in accounts of human categorization and inductive inference [6, 7].\nIn these accounts,\nrepresentativeness is typically viewed as a form of similarity between an outcome and a process\nor an object and a concept. Assume some data d has been observed, and we want to evaluate\nits representativeness of a hypothesized process or concept h. Then d is representative of h if\nit is similar to the observations h typically generates. Computing similarity requires de\ufb01ning a\nsimilarity metric. In the case where we want to evaluate the representativeness of an outcome to a\nset, we might use metrics of the kind that are common in categorization models: an exemplar model\nde\ufb01nes similarity in terms of the sum of the similarities to the other objects in the set (e.g., [8, 9]),\nwhile a prototype model de\ufb01nes similarity in terms of the similarity to a prototype that captures the\ncharacteristics of the set (e.g., [10]).\nAn alternative to similarity is the idea that representativeness might track the likelihood function\nP (d|h) [11]. The main argument for this proposed equivalence is that the more frequently h leads\nto observing d, the more representative d should be of h. However, people\u2019s judgments from the\ncoin \ufb02ip example with which we started the paper go against this idea of equivalence, since both\n\ufb02ips have equal likelihood yet people tend to judge HHTHT as more representative of a fair coin.\nAnalyses of typicality have also argued against the adequacy of frequency for capturing people\u2019s\njudgments about what makes a good example of a category [6].\nTenenbaum and Grif\ufb01ths [4] took a different approach to this question, asking what problem repre-\nsentativeness might be solving, and then deriving an optimal solution to that problem. This approach\nis similar to that taken in Shepard\u2019s [12] analysis of generalization, and to Anderson\u2019s [13] idea of\n\n2\n\n\frational analysis. The resulting rational model of representativeness takes the problem to be one\nof selecting a good example, where the best example is the one that best provides evidence for the\ntarget process or concept relative to possible alternatives. Given some observed data d and a set of\nof hypothetical sources, H, we assume that a learner uses Bayesian inference to infer which h \u2208 H\ngenerated d. Tenenbaum and Grif\ufb01ths [4] de\ufb01ned the representativeness of d for h to be the evidence\nthat d provides in favor of a speci\ufb01c h relative to its alternatives,\n\nR(d, h) = log\n\n,\n\n(1)\n\n(cid:80)\nh(cid:48)(cid:54)=h P (d|h(cid:48))P (h(cid:48))\n\nP (d|h)\n\nwhere P (h(cid:48)) in the denominator is the prior distribution on hypotheses, re-normalized over h(cid:48) (cid:54)= h.\n\n2.2 Bayesian Sets\n\nIf given a small set of items such as \u201cketchup\u201d, \u201cmustard\u201d, and \u201cmayonnaise\u201d and asked to produce\nother examples that \ufb01t into this set, one might give examples such as \u201cbarbecue sauce\u201d, or \u201choney\u201d.\nThis task is an example of clustering on-demand, in which the original set of items represents some\nconcept or cluster such as \u201ccondiment\u201d and we are to \ufb01nd other items that would \ufb01t appropriately\ninto this set. Bayesian Sets is a formalization of this process in which items are ranked by a model-\nbased probabilistic scoring criterion, measuring how well they \ufb01t into the original cluster [5].\nMore formally, given a data collection D, and a subset of items Ds = {x1, . . . , xN} \u2282 D represent-\ning a concept, the Bayesian Sets algorithm ranks an item x\u2217 \u2208 {D \\ Ds} by the following scoring\ncriterion\n\nBscore(x\u2217) =\n\np(x\u2217,Ds)\np(x\u2217)p(Ds)\n\n(cid:105)\nn=1 p(xn|\u03b8)\n\nThis ratio intuitively compares the probability that x\u2217 and Ds were generated by some statistical\nmodel with the same, though unknown, model parameters \u03b8, versus the probability that x\u2217 and Ds\nwere generated by some statistical model with different model parameters \u03b81 and \u03b82.\nEach of\nas\n\nthe three terms in Equation 2 are marginal\n\n(cid:105)\nthe following integrals over \u03b8 since the model parameter\nn=1 p(xn|\u03b8)\n\nknown: p(x\u2217) = (cid:82) p(x\u2217|\u03b8)p(\u03b8)d\u03b8, p(Ds) = (cid:82)(cid:104)(cid:81)N\n(cid:82)(cid:104)(cid:81)N\ndent Bernoulli distribution p(xi|\u03b8) =(cid:81)\n(cid:81)\n\u0393(\u03b1j )\u0393(\u03b2j ) \u03b8\u03b1j\u22121\n\nFor computational ef\ufb01ciency reasons, Bayesian Sets is typically run on binary data. Thus, each\nitem in the data collection, xi \u2208 D, is represented as a binary feature vector xi = (xi1, . . . , xiJ )\nwhere xij \u2208 {0, 1}, and de\ufb01ned under a model in which each element of xi has an indepen-\n(1 \u2212 \u03b8j)1\u2212xij and conjugate Beta prior p(\u03b8|\u03b1, \u03b2) =\n(1\u2212\u03b8j)\u03b2j\u22121. Under these assumptions, the scoring criterion for Bayesian Sets\n\nlikelihoods and can be expressed\nis assumed to be un-\np(\u03b8)d\u03b8, and p(x\u2217,Ds) =\n\np(x\u2217|\u03b8)p(\u03b8)d\u03b8.\n\n\u0393(\u03b1j +\u03b2j )\n\nj \u03b8xij\n\nj\n\nj\n\nj\n\nreduces to\n\np(x\u2217,Ds)\np(x\u2217)p(Ds)\n\nBscore(x\u2217) =\n\nwhere \u02dc\u03b1j = \u03b1j +(cid:80)N\n\nn=1 xnj and \u02dc\u03b2j = \u03b2j + N \u2212(cid:80)N\n\nj\n\nin x and can be computed ef\ufb01ciently as\n\n=\n\n\u03b1j + \u03b2j\n\n(cid:89)\n\n(cid:18) \u02dc\u03b1j\n\n(cid:19)x\u2217j(cid:32) \u02dc\u03b2j\n\n(cid:33)1\u2212x\u2217j\n\n\u03b1j + \u03b2j + N\n\n\u03b1j\n\n\u03b2j\n\nn=1 xnj. The logarithm of this score is linear\n\nlog Bscore(x\u2217) = c +\n\nsjx\u2217j\n\n(4)\n\n(cid:88)\n\nj\n\nj log(\u03b1j + \u03b2j)\u2212 log(\u03b1j + \u03b2j + N ) + log \u02dc\u03b2j \u2212 log \u03b2j, sj = log \u02dc\u03b1j \u2212 log \u03b1j \u2212 log \u02dc\u03b2j +\n\nlog \u03b2j, and x\u2217j is the jth component of x\u2217.\nThe Bayesian Sets method has been tested with success on numerous datasets, over various applica-\ntions including content-based image retrieval [14] and analogical reasoning with relational data [15].\nMotivated by this method, we now turn to extending the previous measure of representativeness for\na sample from a distribution, to de\ufb01ne a measure of representativeness for an item to a set.\n\nwhere c =(cid:80)\n\n(2)\n\n(3)\n\n3\n\n\f3 A Bayesian Measure of Representativeness for Sets of Objects\n\nThe Bayesian measure of representativeness introduced by Tenenbaum and Grif\ufb01ths [4] indicated\nthe representativeness of data d for a hypothesis h. However, in many cases we might not know what\nstatistical hypothesis best describes the concept that we want to illustrate through an example. For\ninstance, in an image retrieval problem, we might just have a set of images that are all assigned to the\nsame category, without a clear idea of the distribution that characterizes that category. In this section,\nwe show how to extend the Bayesian measure of representativeness to indicate the representativeness\nof an element of a set, and how this relates to the Bayesian Sets method summarized above.\nFormally, we have a set of data Ds and we want to know how representative an element d of that set\nis of the whole set. We can perform an analysis similar to that given for the representativeness of d\nto a hypothesis, and obtain the expression\nR(d,Ds) =\n\n(cid:80)D(cid:48)(cid:54)=Ds\nFor example, P (d|Ds) =(cid:80)\nis better expressed as P (d|Ds) =(cid:82) P (d|\u03b8)P (\u03b8|Ds).\nnominator will approximate(cid:80)D(cid:48) P (d|D(cid:48))P (D(cid:48)), which is just P (d). This allows us to observe that\n\nwhich is simply Equation 1 with hypotheses replaced by datasets. The quantities that we need to\ncompute to apply this measure, P (d|Ds) and P (D(cid:48)), we obtain by marginalizing over all hypotheses.\nh P (d|h)P (h|Ds), being the posterior predictive distribution associated\nwith Ds. If the hypotheses correspond to the continuous parameters of a generative model, then this\n\nIn the case where the set of possible datasets that is summed over in the denominator is large, this de-\n\nP (d|Ds)\nP (d|D(cid:48))P (D(cid:48))\n\n(5)\n\nthis measure of representativeness will actually closely approximate the logarithm of the quantity\nBscore produced by Bayesian Sets for the dataset Ds, with\nP (d|Ds)\nR(d,Ds) = log\nP (d)\n\nP (d|Ds)\nP (d|D(cid:48))P (D(cid:48))\n\nP (d,Ds)\nP (d)P (Ds)\n\n(cid:80)D(cid:48)(cid:54)=Ds\n\n= log Bscore(d)\n\n\u2248 log\n\n= log\n\nThis relationship provides a link between the cognitive science literature on representativeness and\nthe machine learning literature on information retrieval, and a new way to evaluate psychological\nmodels of representativeness.\n\n4 Evaluating Models of Representativeness Using Image Databases\n\nHaving developed a measure of the representativeness of an item in a set of objects, we now focus\non the problem of evaluating this measure. The evaluation of psychological theories has historically\ntended to use simple arti\ufb01cial stimuli, which provide precision at the cost of ecological validity. In\nthe case of representativeness, the stimuli previously used by Tenenbaum and Grif\ufb01ths [4] to evaluate\ndifferent representativeness models consisted of 4 coin \ufb02ip sequences and 45 arguments based on\npredicates applied to a set of 10 mammals. One of the aims of this paper is to break the general trend\nof using such restricted kinds of stimuli, and the formal relationship between our rational model and\nBayesian Sets allows us to do so. Any dataset that can be represented as a sparse binary matrix can\nbe used to test the predictions of our measure.\nWe formulate our evaluation problem as one of determining how representative an image is of a\nlabeled set of images. Using an existing image database of naturalistic scenes, we can better test\nthe predictions of different representativeness theories with stimuli much more in common with the\nenvironment humans naturally confront. In the rest of this section, we present the dataset used for\nevaluation and outline the implementations of existing models of representativeness we compare our\nrational Bayesian model against.\n\n4.1 Dataset\n\nWe use the dataset presented in [14], a subset of images taken from the Corel database commonly\nused in content-based image retrieval systems. The images in the dataset are partitioned into 50\nlabeled sets depicting unique categories, with varying numbers of images in each set (the mean is\n264). The dataset is of particular interest for testing models of representativeness as each image\n\n4\n\n\fAlgorithm 1 Representativeness Framework\n\ninput: a set of items, Dw, for a particular category label w\nfor each item xi \u2208 Dw do\nlet Dwi = {Dw \\ xi}\ncompute\n\nscore(xi,Dwi)\nend for\nrank items in Dw by this score\noutput: ranked list of items in Dw\n\n(a)\n\n(b)\n\nFigure 1: Results of the Bayesian model applied to the set labelled coast. (a) The top nine ranked\nimages. (b) The bottom nine ranked images.\n\nfrom the Corel database comes with multiple labels given by human judges. The labels have been\ncriticized for not always being of high quality [16], which provides an additional (realistic) challenge\nfor the models of representativeness that we aim to evaluate.\nThe images in this dataset are represented as 240-dimensional feature vectors, composed of 48\nGabor texture features, 27 Tamura texture features, and 165 color histogram features. The images\nwere additionally preprocessed through a binarization stage, transforming the entire dataset into a\nsparse binary matrix that represents the features which most distinguish each image from the rest of\nthe dataset. Details of the construction of this feature representation are presented in [14].\n\n4.2 Models of Representativeness\n\nWe compare our Bayesian model against a likelihood model and two similarity models: a prototype\nmodel and an exemplar model. We build upon a simple leave-one-out framework to allow a fair\ncomparison of these different representativeness models. Given a set of images with a particular\ncategory label, we iterate through each image in the set and compute a score for how well this image\nrepresents the rest of the set (see Algorithm 1). In this framework, only score(xi,Dwi) varies across\nthe different models. We present the different ways to compute this score below.\nBayesian model. Since we have already shown the relationship between our rational measure and\nBayesian Sets, the score in this model is computed ef\ufb01ciently via Equation 2. The hyperparameters\n\u03b1 and \u03b2 are set empirically from the entire dataset, \u03b1 = \u03bam, \u03b2 = \u03ba(1 \u2212 m), where m is the mean\nof x over all images, and \u03ba is a scaling factor. An example of using this measure on the set of 299\nimages for category label coast is presented in Figure 1. Panels (a) and (b) of this \ufb01gure show the\ntop nine and bottom nine ranked images, respectively, where it is quite apparent that the top ranked\nimages depict a better set of coast examples than the bottom rankings. It also becomes clear how\npoorly this label applies to some of the images in the bottom rankings, which is an important issue\nif using the labels provided with the Corel database as part of a training set for learning algorithms.\n\n5\n\n\fLikelihood model. This model treats representative judgments of an item x\u2217 as p(x\u2217|Ds) for a set\nDs = {x1, . . . , xN}. Since this probability can also be expressed as p(x\u2217,Ds)\n, we can derive an\np(Ds)\nef\ufb01cient scheme for computing the score similar to the Bayesian Sets scoring criterion by making\nthe same model assumptions. The likelihood model scoring criterion is\n\n(cid:89)\n\n(\u02dc\u03b1j)x\u2217j(cid:16) \u02dc\u03b2j\n\n(cid:17)1\u2212x\u2217j\n\n(6)\n\nLscore(x\u2217) =\n\np(x\u2217,Ds)\np(Ds)\n\n\u03b1j + \u03b2j + N\n\n=\n\nj\n\nn=1 xnj and \u02dc\u03b2j = \u03b2j + N \u2212(cid:80)N\n(cid:88)\n\nlog Lscore(x\u2217) = c +\n\n1\n\nj\n\nwhere \u02dc\u03b1j = \u03b1j +(cid:80)N\nwhere c =(cid:80)\n\nlinear in x and can be computed ef\ufb01ciently as\n\nn=1 xnj. The logarithm of this score is also\n\nwjx\u2217j\n\n(7)\n\nj log \u03b2j \u2212 log(\u03b1j + \u03b2j + N ) and wj = log \u02dc\u03b1j \u2212 log \u02dc\u03b2j. The hyperparameters \u03b1 and\n\n\u03b2 are initialized to the same values used in the Bayesian model.\nPrototype model. In this model we de\ufb01ne a prototype vector xproto to be the modal features for a\nset of items Ds. The similarity measure then becomes\n\nPscore(x\u2217) = exp{\u2212\u03bb dist(x\u2217, xproto)}\n\n(8)\nwhere dist(\u00b7,\u00b7) is the Hamming distance between the two vectors and \u03bb is a free parameter. Since\nwe are primarily concerned with ranking images, \u03bb does not need to be optimized as it plays the role\nof a scaling constant.\nExemplar model. We de\ufb01ne the exemplar model using a similar scoring metric to the prototype\nmodel, except rather than computing the distance of x\u2217 to a single prototype, we compute a distance\nfor each item in the set Ds. Our similarity measure is thus computed as\nexp{\u2212\u03bb dist(x\u2217, xj)}\n\nEscore(x\u2217) =\n\n(cid:88)\n\n(9)\n\nxj\u2208Ds\n\nwhere dist(\u00b7,\u00b7) is the Hamming distance between two vectors and \u03bb is a free parameter. In this\ncase, \u03bb does need to be optimized as the sum means that different values for \u03bb can result in different\noverall similarity scores.\n\n5 Modeling Human Ratings of Representativeness\n\nGiven a set of images provided with a category label, how do people determine which images are\ngood or bad examples of that category? In this section we present an experiment which evaluates\nour models through comparison with human judgments of the representativeness of images.\n\n5.1 Methods\n\nA total of 500 participants (10 per category) were recruited via Amazon Mechanical Turk and com-\npensated $0.25. The stimuli were created by identifying the top 10 and bottom 10 ranked images\nfor each of the 50 categories for the Bayesian, likelihood, and prototype models and then taking the\nunion of these sets for each category. The exemplar model was excluded in this process as it required\noptimization of its \u03bb parameter, meaning that the best and worst images could not be determined in\nadvance. The result was a set of 1809 images, corresponding to an average of 36 images per cate-\ngory. Participants were shown a series of images and asked to rate how good an example each image\nwas of the assigned category label. The order of images presented was randomized across subjects.\nImage quality ratings were made on a scale of 1-7, with a rating of 1 meaning the image is a very\nbad example and a rating of 7 meaning the image is a very good example.\n\n5.2 Results\n\nOnce the human ratings were collected, we computed the mean ratings for each image and the mean\nof the top 10 and bottom 10 results for each algorithm used to create the stimuli. We also computed\n\n6\n\n\fFigure 2: Mean quality ratings of the top 10 and bottom 10 rankings of the different representative-\nness models over 50 categories. Error bars show one standard error. The vertical axis is bounded by\nthe best possible top 10 ratings and the worst possible bottom 10 ratings across categories.\n\nbounds for the ratings based on the optimal set of top 10 and bottom 10 images per category. These\nare the images which participants rated highest and lowest, regardless of which algorithm was used\nto create the stimuli. The mean ratings for the optimal top 10 images was slightly less than the\nhighest possible rating allowed (m = 6.018, se = 0.074), while the mean ratings for the optimal\nbottom 10 images was signi\ufb01cantly higher than the lowest possible rating allowed (m = 2.933,\nse = 0.151). The results are presented in Figure 2. The Bayesian model had the overall highest\nratings for its top 10 rankings (m = 5.231, se = 0.026) and the overall lowest ratings for its\nbottom 10 rankings (m = 3.956, se = 0.031). The other models performed signi\ufb01cantly worse,\nwith likelihood giving the next highest top 10 (m = 4.886, se = 0.028), and next lowest bottom\n10 (m = 4.170, se = 0.031), and prototype having the lowest top 10 (m = 4.756, se = 0.028),\nand highest bottom 10 (m = 4.249, se = 0.031). We tested for statistical signi\ufb01cance via pairwise\nt-tests on the mean differences of the top and bottom 10 ratings over all 50 categories, for each pair\nof models. The Bayesian model outperformed both other algorithms (p < .001).\nAs a second analysis, we ran a Spearman rank-order correlation to examine how well the actual\nscores from the models \ufb01t with the entire set of human judgments. Although we did not explicitly\nask participants to rank images, their quality ratings implicitly provide an ordering on the images\nthat can be compared against the models. This also gives us an opportunity to evaluate the exemplar\nmodel, optimizing its \u03bb parameter to maximize the \ufb01t to the human data. To perform this correlation\nwe recorded the model scores over all images for each category, and then computed the correlation of\neach model with the human judgments within that category. Correlations were then averaged across\ncategories. The Bayesian model had the best mean correlation (\u03c1 = 0.352), while likelihood (\u03c1 =\n0.220), prototype (\u03c1 = 0.160), and the best exemplar model (\u03bb = 2.0, \u03c1 = 0.212) all performed\nless well. Paired t-tests showed that the Bayesian model produced statistically signi\ufb01cantly better\nperformance than the other three models (all p < .01).\n\n5.3 Discussion\n\nOverall, the Bayesian model of representativeness provided the best account of people\u2019s judgments\nof which images were good and bad examples of the different categories. The mean ratings over the\nentire dataset were best predicted by our model, indicating that on average, the model predictions\nfor images in the top 10 results were deemed of high quality and the predictions for images in the\nbottom 10 results were deemed of low quality. Since the images from the Corel database come\nwith labels given by human judges, few images are actually very bad examples of their prescribed\nlabels. This explains why the ratings for the bottom 10 images are not much lower. Additionally,\nthere was some variance as to which images the Mechanical Turk workers considered to be \u201cmost\nrepresentative\u201d. This explains why the ratings for the top 10 images are not much higher, and thus\nwhy the difference between top and bottom 10 on average is not larger. When comparing the actual\n\n7\n\nTop 10 RankingsBottom 10 Rankings33.544.555.56Mean quality ratings BayesLikelihoodPrototype\fTable 1: Model comparisons for the outlier experiment\nModel\nBayesian Sets\nLikelihood\nPrototype\nExemplar\n\nS.E.\n\u00b1 0.014\n\u00b1 0.013\n\u00b1 0.015\n\u00b1 0.016\n\nAverage Outlier Position\n\n0.805\n0.779\n0.734\n0.734\n\nscores from the different models against the ranked order of human quality ratings, the Bayesian\naccount was also signi\ufb01cantly more accurate than the other models. While the actual correlation\nvalue was less than 1, the dataset was rather varied in terms of quality for each category and thus it\nwas not expected to be a perfect correlation. The methods of the experiment were also not explicitly\ntesting for this effect, providing another source of variation in the results.\n\n6 Finding Outliers in Sets\n\nMeasuring the representativeness of items in sets can also provide a novel method of \ufb01nding outliers\nin sets. Outliers are de\ufb01ned as an observation that appears to deviate markedly from other members\nof the sample in which it occurs [17]. Since models of representativeness can be used to rank items\nin a set by how good an example they are of the entire set, outliers should receive low rankings.\nThe performance of these different measures in detecting outliers provides another indirect means\nof assessing their quality as measures of representativeness.\nTo empirically test this idea we can take an image from a particular category and inject it into\nall other categories, and see whether the different measures can identify it as an outlier. To \ufb01nd\na good candidate image we used the top ranking image per category as ranked by the Bayesian\nmodel. We justify this method because the Bayesian model had the best performance in predicting\nhuman quality judgments. Thus, the top ranked image for a particular category is assumed to be a\nbad example of the other categories. We evaluated how low this outlier was ranked by each of the\nrepresentativeness measures 50 times, testing the models with a single injected outlier from each\ncategory to get a more robust measure. The \ufb01nal evaluation was based on the normalized outlier\nranking for each category (position of outlier divided by total number of images in the category),\naveraged over the 50 injections. The closer this quantity is to 1, the lower the ranking of outliers.\nThe results of this analysis are depicted in Table 1, where it can be seen that the Bayesian model\noutperforms the other models. It is interesting to note that these measures are all quite distant from\n1. We interpret this as another indication of the noisiness of the original image labels in the dataset\nsince there were a number of images in each category that were ranked lower than the outlier.\n\n7 Conclusions\n\nWe have extended an existing Bayesian model of representativeness to handle sets of items and\nshowed how it closely approximates a method of clustering on-demand \u2013 Bayesian Sets \u2013 that had\nbeen developed in machine learning. We exploited this relationship to allow us to evaluate a set\nof psychological models of representativeness using a large database of naturalistic images. Our\nBayesian measure of representativeness signi\ufb01cantly outperformed other proposed accounts in pre-\ndicting human judgments of how representative images were of different categories. These results\nprovide strong evidence for this characterization of representativeness, and a new source of valida-\ntion for the Bayesian Sets algorithm. We also introduced a novel method of detecting outliers in sets\nof data using our representativeness measure, and showed that it outperformed other measures. We\nhope that the combination of methods from cognitive science and computer science that we used\nto obtain these results is the \ufb01rst step towards closer integration between these disciplines, linking\npsychological theories and behavioral methods to sophisticated algorithms and large databases.\n\nAcknowledgments. This work was supported by grants IIS-0845410 from the National Science Foundation and\nFA-9550-10-1-0232 from the Air Force Of\ufb01ce of Scienti\ufb01c Research to TLG and a National Science Foundation\nPostdoctoctoral Fellowship to KAH.\n\n8\n\n\fReferences\n[1] D. Kahneman and A. Tversky. Subjective probability: A judgment of representativeness. Cognitive\n\nPsychology, 3:430\u2013454, 1972.\n\n[2] G. Gigerenzer. On narrow norms and vague heuristics: A reply to Kahneman and Tversky (1996). Psy-\n\nchological Review, 103:592, 1996.\n\n[3] G. L. Murphy and D. L. Medin. The role of theories in conceptual coherence. Psychological Review,\n\n92:289\u2013316, 1985.\n\n[4] J. B. Tenenbaum and T. L. Grif\ufb01ths. The rational basis of representativeness. In Proc. 23rd Annu. Conf.\n\nCogn. Sci. Soc., pages 1036\u20131041, 2001.\n\n[5] Z. Ghahramani and K. A. Heller. Bayesian sets. In Advances in Neural Information Processing Systems,\n\nvolume 18, 2005.\n\n[6] C.B. Mervis and E. Rosch. Categorization of natural objects. Annual Review of Psychology, 32:89\u2013115,\n\n1981.\n\n[7] D.N. Osherson, E.E. Smith, O. Wilkie, A. Lopez, and E. Sha\ufb01r. Category-based induction. Psychological\n\nReview, 97:185, 1990.\n\n[8] D. L. Medin and M. M. Schaffer. Context theory of classi\ufb01cation learning. Psychological Review, 85:207\u2013\n\n238, 1978.\n\n[9] R. M. Nosofsky. Attention and learning processes in the identi\ufb01cation and categorization of integral\n\nstimuli. Journal of Experimental Psychology: Learning, Memory, and Cognition, 13:87\u2013108, 1987.\n\n[10] S. K. Reed. Pattern recognition and categorization. Cognitive Psychology, 3:393\u2013407, 1972.\n[11] G. Gigerenzer and U. Hoffrage. How to improve Bayesian reasoning without instruction: Frequency\n\nformats. Psychological Review, 102:684, 1995.\n\n[12] R. N. Shepard. Towards a universal law of generalization for psychological science. Science, 237:1317\u2013\n\n1323, 1987.\n\n[13] J. R. Anderson. The adaptive character of thought. Erlbaum, Hillsdale, NJ, 1990.\n[14] K. A. Heller and Z. Ghahramani. A simple Bayesian framework for content-based image retrieval. IEEE\n\nConference on Computer Vision and Pattern Recognition, 2:2110\u20132117, 2006.\n\n[15] R. Silva, K. A. Heller, and Z. Ghahramani. Analogical reasoning with relational Bayesian sets. Interna-\n\ntional Conference on AI and Statistics, 2007.\n\n[16] H. M\u00a8uller, S. Marchand-Maillet, and T. Pun. The truth about Corel - evaluation in image retrieval. Inter-\n\nnational Conference on Image and Video Retrieval, 2002.\n\n[17] F. Grubbs. Procedures for detecting outlying observations in samples. Technometrics, 11:1\u201321, 1969.\n\n9\n\n\f", "award": [], "sourceid": 1244, "authors": [{"given_name": "Joshua", "family_name": "Abbott", "institution": null}, {"given_name": "Katherine", "family_name": "Heller", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Thomas", "family_name": "Griffiths", "institution": null}]}