{"title": "Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies", "book": "Advances in Neural Information Processing Systems", "page_first": 1842, "page_last": 1850, "abstract": "Learning a visual concept from a small number of positive examples is a significant challenge for machine learning algorithms. Current methods typically fail to find the appropriate level of generalization in a concept hierarchy for a given set of visual examples. Recent work in cognitive science on Bayesian models of generalization addresses this challenge, but prior results assumed that objects were perfectly  recognized. We present an algorithm for learning visual concepts directly from images, using  probabilistic predictions generated by visual classifiers as the input to a Bayesian generalization model. As no existing challenge data tests this paradigm, we collect and make available a new, large-scale dataset for visual concept learning using the ImageNet hierarchy as the source of possible concepts, with human annotators to provide ground truth labels as to whether a new image is an instance of each concept using a paradigm similar to that used in experiments studying word learning in children.  We compare the performance of our system to several baseline algorithms, and show a significant advantage results from combining visual classifiers with the ability to identify an appropriate level of abstraction using Bayesian generalization.", "full_text": "Visual Concept Learning: Combining Machine Vision\nand Bayesian Generalization on Concept Hierarchies\n\nYangqing Jia1, Joshua Abbott2, Joseph Austerweil3, Thomas Grif\ufb01ths2, Trevor Darrell1\n\n1UC Berkeley EECS\n\n2Dept of Psychology, UC Berkeley\n\n3Dept of Cognitive, Linguistics, and Psychological Sciences, Brown University\n\n{jiayq, joshua.abbott, tom griffiths, trevor}@berkeley.edu\n\njoseph austerweil@brown.edu\n\nAbstract\n\nLearning a visual concept from a small number of positive examples is a signif-\nicant challenge for machine learning algorithms. Current methods typically fail\nto \ufb01nd the appropriate level of generalization in a concept hierarchy for a given\nset of visual examples. Recent work in cognitive science on Bayesian models\nof generalization addresses this challenge, but prior results assumed that objects\nwere perfectly recognized. We present an algorithm for learning visual concepts\ndirectly from images, using probabilistic predictions generated by visual classi-\n\ufb01ers as the input to a Bayesian generalization model. As no existing challenge\ndata tests this paradigm, we collect and make available a new, large-scale dataset\nfor visual concept learning using the ImageNet hierarchy as the source of possible\nconcepts, with human annotators to provide ground truth labels as to whether a\nnew image is an instance of each concept using a paradigm similar to that used in\nexperiments studying word learning in children. We compare the performance of\nour system to several baseline algorithms, and show a signi\ufb01cant advantage results\nfrom combining visual classi\ufb01ers with the ability to identify an appropriate level\nof abstraction using Bayesian generalization.\n\n1\n\nIntroduction\n\nMachine vision methods have achieved considerable success in recent years, as evidenced by per-\nformance on major challenge problems [4, 7], where strong performance has been obtained for\nassigning one of a large number of labels to each of a large number of images. However, this re-\nsearch has largely focused on a fairly narrow task: assigning a label (or sometimes multiple labels)\nto a single image at a time. This task is quite different from that faced by a human child trying to\nlearn a new word, where the child is provided with multiple positive examples and has to generalize\nappropriately. Even young children are able to learn novel visual concepts from very few positive\nexamples [3], something that still poses a challenge for machine vision systems. In this paper, we\nde\ufb01ne a new challenge task for computer vision \u2013 visual concept learning \u2013 and provide a \ufb01rst\naccount of a system that can learn visual concepts from a small number of positive examples.\nIn our visual concept learning task, a few example images from a visual concept are given and\nthe system has to indicate whether a new image is or is not an instance of the target concept. A\nkey aspect of this task is determining the degree to which the concept should be generalized [21]\nwhen multiple concepts are logically consistent with the given examples. For example, consider the\nconcepts represented by examples in Figure 1 (a-c) respectively, and the task of predicting whether\nnew images (d-e) belong to them or not. The ground truth from human annotators reveals that the\nlevel of generalization varies according to the conceptual diversity, with greater diversity leading to\nbroader generalization. In the examples shown in Figure 1, people might identify the concepts as\n(a) Dalmatians, (b) all dogs, and (c) all animals, but not generalize beyond these levels although no\n\n1\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\n(e)\n\nFigure 1: Visual concept learning. (a-c): positive examples of three visual concepts. Even without\nnegative data, people are able to learn these concepts: (a) Dalmatians, (b) dogs and (c) animals.\nNote that although (a) contains valid examples of dogs and both (a) and (b) contain valid examples\nof animals, people restrict the scope of generalization to more speci\ufb01c concepts, and \ufb01nd it easy to\nmake judgments about whether novel images such as (d) and (e) are instances of the same concepts\n\u2013 the task we refer to as visual concept learning.\n\nnegative images forbids so. Despite recent successes in large-scale category-level object recognition,\nwe will show state-of-the-art machine vision systems fail to exhibit such patterns of generalization,\nand have great dif\ufb01culty learning without negative examples.\nBayesian models of generalization [1, 18, 21] account for these phenomena, determining the scope\nof a novel concept (e.g., does the concept refer to Dalmatians, all dogs, or all animals?) in a similar\nmanner to people. However, these models were developed by cognitive scientists interested in an-\nalyzing human cognition, and require examples to be manually labeled as belonging to a particular\nleaf node in a conceptual hierarchy. This is reasonable if one is asking whether proposed psycho-\nlogical models explain human behavior, but prevents the models from being used to automatically\nsolve visual concept learning problems for a robot or intelligent agent.\nWe bring these two threads of research together, using machine vision systems to assign novel\nimages locations within a conceptual hierarchy and a Bayesian generalization model to determine\nhow to generalize from these examples. This results in a system that comes closer to human per-\nformance than state-of-the-art machine vision baselines. As an additional contribution, since no\nexisting dataset adequately tests human-like visual concept learning, we have collected and made\navailable to the community the \ufb01rst large-scale dataset for evaluating whether machine vision al-\ngorithms can learn concepts that agree with human perception and label new unseen images, with\nground-truth labeling obtained from human annotators from Amazon Mechanical Turk. We believe\nthat this new task provides challenges beyond the conventional object classi\ufb01cation paradigms.\n\n2 Background\n\nIn machine vision, scant attention has been given to the problem of learning a visual concept from\na few positive examples as we have de\ufb01ned it. When the problem has been addressed, it has largely\nbeen considered from a hierarchical regularization [16] or transfer learning [14] perspective, assum-\ning that a \ufb01xed set of labels are given and exploiting transfer or regularization within a hierarchy.\nMid-level representations based on attributes [8, 13] focus on extracting common attributes such\nas \u201c\ufb02uffy\u201d and \u201caquatic\u201d that could be used to semantically describe object categories better than\nlow-level features. Transfer learning approaches have been proposed to jointly learn classi\ufb01ers with\nstructured regularization [14].\nOf all these previous efforts, our paper is most closely related to work that uses object hierarchies to\nsupport classi\ufb01cation. Salakhutdinov et al. [16] proposed learning a set of object classi\ufb01ers with reg-\nularization using hierarchical knowledge, which improves the classi\ufb01cation of objects at the leaves\nof the hierarchy. However, this work did not address the problem of determining the level of ab-\nstraction within the hierarchy at which to make generalizations, which is a key aspect of the visual\nconcept learning problem. Deng et al. [5] proposed predicting object labels only to a granularity that\nthe classi\ufb01er is con\ufb01dent with, but their goal was minimizing structured loss rather than mimicking\nhuman generalization.\nExisting models from cognitive science mainly focus on understanding human generalization judg-\nments within fairly restricted domains. Tenenbaum and colleagues [18, 20] proposed mathematical\nabstractions for the concept learning problem, building on previous work on models of general-\nization by Shepard [17]. Xu and Tenenbaum [21] and Abbott et al. [1] conducted experiments\n\n2\n\n\fwith human participants that provided support for this Bayesian generalization framework. Xu and\nTenenbaum [21] showed participants one or more positive examples of a novel word (e.g., \u201cthese\nthree objects are Feps\u201d), while manipulating the taxonomic relationship between the examples. For\ninstance, participants could see three toy Dalmatians, three toy dogs, or three toy animals. Partic-\nipants were then asked to identify the other \u201cFeps\u201d among a variety of both taxonomically related\nand unrelated objects presented as queries. If the positive examples were three Dalmatians, people\nmight be asked whether other Dalmatians, dogs, and animals are Feps, along with other objects such\nas vegetables and vehicles. Subsequent work has used the same basic methodology in experiments\nusing a manually collated set of images as stimuli [1].\nAll of these models assume that objects are already mapped onto locations in a perceptual space or\nconceptual hierarchy. Thus, they are not able to make predictions about genuinely novel stimuli.\nLinking such generalization models to direct perceptual input is necessary in order to be able to use\nthis approach to learn visual concepts directly from images.\n\n3 A Large-scale Concept Learning Dataset\n\nExisting datasets (PASCAL [7], ILSVRC [2], etc.) test supervised learning performance with rel-\natively large amounts of positive and negative examples available, with ground truth as a set of\nmutually-exclusive labels. To our knowledge, no existing dataset accurately captures the task we\nrefer to as visual concept learning: to learn a novel word from a small set of positive examples like\nhumans do. In this section, we describe in detail our effort to make available a dataset for such task.\n\n3.1 Test Procedure\n\nIn our test procedure, an agent is shown n example images (n = 5 in our dataset) sampled from a\nnode (may be leaf nodes or intermediate nodes) from the ImageNet synset tree, and is then asked\nwhether other new images sampled from ImageNet belong to the concept or not. The scores that the\nagent gives are then compared against human ground truth that we collect, and we use precision-\nrecall curves to evaluate the performance.\nFrom a machine vision perspective, one may ask whether this visual concept learning task differs\nfrom the conventional ImageNet-de\ufb01ned classi\ufb01cation problem \u2013 identifying the node from which\nthe examples are drawn, and then answering yes for images in the subtree corresponding to the node,\nand no for images not from the node. In fact, we will show in Section 5.2 that using this approach\nfails to explain how people learn visual concepts. Human performance in the above task exhibits\nmuch more sophisticated concept learning behaviors than simply identifying the node itself, and\nthe latter differs signi\ufb01cantly from what we observe from human participants. In addition, with no\nnegative images, a conventional classi\ufb01cation model fails to distinguish between nodes that are both\nvalid candidates (e.g., \u201cdogs\u201d and \u201canimals\u201d when shown a bunch of dog images). These make our\nvisual concept learning essentially different and richer than a conventional classi\ufb01cation problem.\n\n3.2 Automatic Generation of Examples and Queries\n\nLarge-scale experimentation requires an ef\ufb01cient scheme to generate test data across varying levels\nof a concept hierarchy. To this end, we developed a fully-automated procedure for constructing a\nlarge-scale dataset suitable for a challenge problem focused on visual concept learning. We used\nthe ImageNet LSVRC [2] 2010 data as the basis for automatically constructing a hierarchically-\norganized set of concepts at four different levels of abstraction. We had two goals in constructing\nthe dataset: to cover concepts at various levels of abstraction (from subordinate concepts to super-\nordinate concepts, such as from Dalmatian to living things), and to \ufb01nd query images that compre-\nhensively test human generalization behavior. We address these two goals in turn.\nTo generate concepts at various levels of abstraction, we use all the nodes in the ImageNet hierarchy\nas concept candidates, starting from the leaf node classes as the most speci\ufb01c level concept. We then\ngenerate three more levels of increasingly broad concepts along the path from the leaf to the root for\neach leaf node in the hierarchy. Examples from such concepts are then shown to human participants\nto obtain human generalization judgements, which will serve as the ground truth. Speci\ufb01cally, we\nuse the leaf node class itself as the most basic trial type L0, and select three levels of nested concepts\n\n3\n\n\fblueberry\n\nberry\n\nedible fruit\n\nnatural object\n\n(a)\n\n(b)\n\nFigure 2: Concepts drawn from ImageNet. (a) example images sampled from the four levels for\nblueberry, and (b) the histogram for the subtree sizes of different levels of concepts (x axis in\nlog scale).\n\nL1, L2, L3 which correspond to three intermediate nodes along the path from the leaf node to the\nroot. We choose the three nodes that maximize the combined information gain across these levels:\n\n(cid:88)3\n\ni=0\n\nC(L1\u00b7\u00b7\u00b73) =\n\nlog(|Li+1| \u2212 |Li|) \u2212 log |Li+1|,\n\n(1)\nwhere |Li| is the number of leaf nodes under the subtree rooted at Li, and L4 is the whole taxonomy\ntree. As a result, we obtain levels that are \u201cevenly\u201d distributed over the taxonomy tree. Such levels\ncoarsely correspond to the sub-category, basic, super-basic, and super-category levels in the taxon-\nomy: for example, the four levels used in Figure 1 are dalmatian, domestic dog, animal,\norganism for the leaf node dalmatian, and in Figure 2(a) are blueberry, berry, edible\nfruit, and natural object for the leaf node blueberry. Figure 2(b) shows a histogram of\nthe subtree sizes for L1 to L3 respectively.\nFor each concept, the \ufb01ve images shown to participants as examples of that concept were randomly\nsampled from \ufb01ve different leaf node categories from the corresponding subtree in the ILSVRC\n2010 test images. Figure 1 and 2 show such examples.\nTo obtain the ground truth (the concepts people perceive when given the set of examples), we then\nrandomly sample twenty query images, and ask human participants whether each of these query\nimages belong to the concept given by the example images. A total of 20 images are randomly\nsampled as follows: three each from the L0, L1, L2 and L3 subtrees, and eight images outside L3.\nThis ensures a complete coverage over in-concept and out-of-concept queries. We explicitly made\nsure that the leaf node classes of the query images were different from those of the examples if\npossible, and no duplicates exist among the 20 queries. Note that we always sampled the example\nand query images from the ILSVRC 2010 test images, allowing us to subsequently train our machine\nvision models with the training and validation images from the ILSVRC dataset while keeping those\nin the visual concept learning dataset as novel test images.\n\n3.3 Collecting Human Judgements\n\nWe created 4,000 identical concepts (four for each leaf node) using the protocol above, and recruited\nparticipants online through Amazon Mechanical Turk (AMT, http://www.mturk.com) to ob-\ntain the human ground truth data. For each concept, an AMT HIT (a single task presented to the\nhuman participants) is formed with \ufb01ve example images and twenty query images, and the partici-\npants were asked whether each query belongs to the concept represented by the examples. Each HIT\nwas completed by \ufb01ve unique participants, with a compensation of $0.05 USD per HIT. Participants\nwere allowed to complete as many unique trials as they wished. Thus, a total of 20,000 AMT HITs\nwere collected, and a total of 100,000 images were shown to the participants. On average, each\nparticipant took approximately one minute to \ufb01nish each HIT, spending about 3 seconds per query\nimage. The dataset is publicly available at http://www.eecs.berkeley.edu/\u02dcjiayq/.\n\n4 Visually-Grounded Bayesian Concept Learning\n\nIn this section, we describe an end-to-end framework which combines Bayesian word learning mod-\nels and visual classi\ufb01ers, and is able to perform concept learning with perceptual inputs.\n\n4\n\n100101102103Subtreesize(logscale)050100150200250300350400450CountL1L2L3\f4.1 Bayesian Concept Learning\n\n(cid:88)\nh\u2208H Pnew(xnew|h)P (h|X ),\n\nPrior work on concept learning [21] addressed the problem of generalization from examples using\na Bayesian framework: given a set of N examples (images in our case) X = {x1, x2, . . . , xN} that\nare members of an unknown concept C, the probability that a query instance xquery also belongs to\nthe same concept is given by\n\nPnew(xquery \u2208 C|X ) =\n\n(2)\nwhere H is called the \u201chypothesis space\u201d \u2013 a set of possible hypotheses for what the concept might\nbe. Each hypothesis corresponds to a (often semantically related) subset of all the objects in the\nworld, such as \u201cdogs\u201d or \u201canimals\u201d. Given a speci\ufb01c hypothesis h, the probability Pnew(xnew|h)\nthat a new instance belongs to it is 1 if xnew is in the set, and 0 otherwise, and P (h|X ) is the\nposterior probability of a hypothesis h given the examples X .\nThe posterior distribution over hypotheses is computed using the Bayes\u2019 rule: it is proportional to\nthe product of the likelihood, P (X|h), which is the probability of drawing these examples from the\nhypothesis h uniformly at random times the prior probability P (h) of the hypothesis:\n\nP (h|X ) \u221d P (h)\n\nPexample(xi|h),\n\n(3)\n\n(cid:89)N\n\ni=1\n\nwhere we also make the strong sampling assumption that each xi is drawn uniformly at random from\nthe set of instances picked out by h. Importantly, this ensures that the model acts in accordance\nwith the \u201csize principle\u201d [18, 20], meaning that the conditional probability of an instance given\na hypothesis is inversely proportional to the size of the hypothesis, i.e., the number of possible\ninstances that could be drawn from the hypothesis:\n\nPexample(xi|h) = |h|\u22121I(xi \u2208 h),\n\n(4)\nwhere |h| is the size of the hypothesis and I(\u00b7) is an indicator function that has value 1 when the\nstatement is true. We note that the probability of an example and that of a query given a hypothesis\nare different: the former depends on the size of the underlying hypothesis, representing the nature\nof training with strong sampling. For example, as the number of examples that are all Dalmatians\nincreases, it becomes increasingly likely that the concept is just Dalmatians and not dogs in general\neven though both are logically possible, because it would have been incredibly unlikely to only\nsample Dalmatians given that the truth concept was dogs. In addition, the prior distribution P (h)\ncaptures biases due to prior knowledge, which favor particular kinds of hypotheses over others\n(which we will discuss in the next subsection). For example, it is known that people favor basic\nlevel object categories such as dogs over subcategories (such as Dalmatians) or supercategories\n(such as animals).\n\n4.2 Concept Learning with Perceptual Uncertainty\n\nExisting Bayesian word learning models assume that objects are perfectly recognized, thus repre-\nsenting them as discrete indices into a set of \ufb01nite tokens. Hypotheses are then subsets of the com-\nplete set of tokens and are often hierarchically nested. Although perceptual spaces were adopted\nin [18], only very simple hypotheses (rectangles over the position of dots) were used. Performing\nBayesian inference with a complex perceptual input such as images is thus still a challenge. To this\nend, we utilize the state-of-the-art image classi\ufb01ers and classify each image into the set of leaf node\nclasses given in the ImageNet hierarchy, and then build a hypothesis space on top of the classi\ufb01er\noutputs.\nSpeci\ufb01cally, we construct the hypothesis space over the image labels using the ImageNet hierarchy,\nwith each subtree rooted at a node serving as a possible hypothesis. The hypothesis sizes are then\ncomputed as the number of leaf node classes under the corresponding node, e.g., the node \u201canimal\u201d\nwould have a larger size than the node \u201cdogs\u201d. The large number of images collected by ImageNet\nallows us to train classi\ufb01ers from images to the leaf node labels, which we will describe shortly.\nAssuming that there are a total of K leaf nodes, for an image xi that is classi\ufb01ed as label \u02c6yi, the\nlikelihood P (xi|h) is then de\ufb01ned as\n\nPexample(xi|h) =\n\nAj \u02c6yi\n\n1\n\n|h| I(j \u2208 h),\n\n(5)\n\n(cid:88)K\n\nj=1\n\n5\n\n\fwhere A is the normalized confusion matrix, with Aj,i being the probability that the true leaf node is\nj given the classi\ufb01er output being i. The motivation of using the confusion matrix is that classi\ufb01ers\nare not perfect and misclassi\ufb01cation could happen. Thus, the use of the confusion matrix incorpo-\nrates the visual ambiguity into the word learning framework by providing an unbiased estimation of\nthe true leaf node label for an image.\nThe prior probability of a hypothesis was de\ufb01ned to be an Erlang distribution, P (h) \u221d\n(|h|/\u03c32) exp{\u2212|h|/\u03c3}, which is a standard prior over sizes in Bayesian models of generalization\n[17, 19]. The parameter \u03c3 is set to 200 according to [1] in order to \ufb01t human cognition, which favors\nbasic level hypotheses [15]. Finally, the probability of a new instance belonging to a hypothesis is\nj=1 Aj \u02c6ynew I(\u02c6ynew \u2208 h),\n\nsimilar to the likelihood, but without the size term, as Pnew(xnew|h) =(cid:80)K\n\nwhere \u02c6ynew is the classi\ufb01er prediction.\n\n4.3 Learning the Perceptual Classi\ufb01ers\n\nTo train the image classi\ufb01ers for the perceptual component in our model, we used the ILSVRC\ntraining images, which consisted of 1.2 million images categorized into the 1,000 leaf node classes,\nand followed the pipeline in [11] to obtain feature vectors to represent the images. This pipeline\nuses 160K dimensional features, yielding a total of about 1.5TB for the training data. We trained the\nclassi\ufb01ers with linear multinomial logistic regressors with minibatch Adagrad [6] algorithm, which\nis a quasi-Newton stochastic gradient descent approach. The hyperparameters of the classi\ufb01ers are\nlearned with the held-out validation data.\nOverall, we obtained a performance of 41.33% top-1 accuracy and a 61.91% top-5 accuracy on\nthe validation data, and 41.28% and 61.69% respectively on the testing data, and the training took\nabout 24 hours with 10 commodity computers. Although this is not the best ImageNet classi\ufb01er\nto date, we believe that the above pipeline is a fair representation of the state-of-the-art computer\nvision approaches. Algorithms using similar approaches have reported competitive performance in\nimage classi\ufb01cation on a large number of classes (on the scale of tens of thousands) [10, 9], which\nprovides reassurance about the possibility of using state-of-the-art classi\ufb01cation models in visual\nconcept learning.\nTo obtain the confusion matrix A of the classi\ufb01ers, we note that the validation data alone does not\nsuf\ufb01ce to provide a dense estimation of the full confusion matrix, because there is a large number of\nentries (1 million) but very few validation images (50K). Thus, instead of using the validation data\nfor estimation of A, we approximated the classi\ufb01er\u2019s leave-one-out (LOO) behavior on the training\ndata with a simple one-step gradient descent update to \u201cunlearn\u201d each image. Speci\ufb01cally, we started\nfrom the trained classi\ufb01er parameters, and for each training image x, we compute the gradient of the\nloss function when x is left out of the training set. We then take one step update in the direction of\nthe gradient to obtain the updated classi\ufb01er, and use it to perform prediction on x. This allows us to\nobtain a much denser estimation that worked better than existing methods. We refer the reader to the\nsupplementary material for the technical details about the classi\ufb01er training and the LOO confusion\nmatrix estimation.\n\n5 Experiments\n\nIn this section, we describe the experimental protocol adopted to compare our system with human\nperformance and compare our system against various baseline algorithms. Quantitatively, we use\nthe precision-recall curve, the average precision (AP) and the F1 score at the point where preci-\nsion = recall to evaluate the performance and to compare against the human performance, which is\ncalculated by randomly sampling one human participant per distinctive HIT, and comparing his/her\nprediction against the four others.\nTo the best of our knowledge, there are no existing vision models that explicitly handles our concept\nlearning task. Thus, we compare our vision baseg Bayes generalization algorithm (denoted by VG)\ndescribed in the previous section against the following baselines, which are reasonable extensions\nof existing vision or cognitive science models:\n\n1. Naive vision approach (NV): this uses a nearest neighbor approach by computing the\n\nscore of a query as its distance to the closest example image, using GIST features [12].\n\n6\n\n\fMethod\n\nNV\nPM\nHC\nHB\nNP\n\nVG (ours)\n\nHuman Performance\n\nAP\n36.37\n61.74\n60.58\n57.50\n76.24\n72.82\n\n-\n\nF1 Score\n\n35.64\n56.07\n56.82\n52.72\n72.70\n66.97\n75.47\n\nFigure 3: The precision-recall curves of our method and the baseline algorithms. The human results\nare shown as the red crosses, and the non-perceptual Bayesian word learning model (NB) is shown\nas magenta dashed lines. The table summarizes the average precision (AP) and F1 scores of the\nmethods.\n\n2. Prototype model (PM): an extension of the image classi\ufb01ers. We use the L1 normalized\nclassi\ufb01er output from the multinomial logistic regressors as a vector for the query image,\nand compute the score as its \u03c72 distance to the closest example image.\n\n3. Histogram of classi\ufb01er outputs (HC): similar to the prototype model, but instead of com-\nputing the distance between the query and each example, we compute the score as the \u03c72\ndistance to the histogram of classi\ufb01er outputs, aggregated over the examples.\n\n4. Hedging the bets extension (HB): we extend the hedging idea [5] to handle sets of query\nimages. Speci\ufb01cally, we \ufb01nd the subtree in the hierarchy that maximizes the information\ngain while maintaining an overall accuracy above a threshold \u0001 over the set of example\nimages. The score of a query image is then computed as the probability that it belongs to\nthis subtree. The threshold \u0001 is tuned on a randomly selected subset of the data.\n\n5. Non-perceptual word learning (NP): the classical Bayesian word learning model in [21]\nassuming a perfect classi\ufb01er, i.e., by taking the ground-truth leaf labels for the test im-\nages. This is not practical in actual applications, but evaluating NP helps understand how a\nperceptual component contributes to modeling human behavior.\n\n5.1 Main Results\n\nFigure 3 shows the precision-recall curves for our method and the baseline methods, and summa-\nrizes the average precision and F1 scores. Conventional vision approaches that build upon image\nclassi\ufb01ers work better than simple image features (such as GIST), which is sensible given that ob-\nject categories provide relatively more semantics than simple features. However, all the baselines\nstill have performances far from human\u2019s, because they miss the key mechanism for inferring the\n\u201cwidth\u201d of the latent concept represented by a set of images (instead of a single image as conven-\ntional approaches assume). In contrast, adopting the size principle and the Bayesian generalization\nframework allows us to perform much better, obtaining an increase of about 10% in average preci-\nsion and F1 scores, closer to the human performance than other visual baselines.\nThe non-perceptual (NP) model exhibits better overall average precision than our method, which\nsuggests that image classi\ufb01ers can still be improved. This is indeed the case, as state-of-the-art\nrecognition algorithms may still signi\ufb01cantly underperform human. However, note that for a system\nto work in a real-world scenario such as aid-giving robots, it is crucial that the agent be able to\ntake direct perceptual inputs. It is also interesting to note that all visual models yield higher pre-\ncision values in the low-recall region (top left of Figure 3) than the NP model, which does not use\nperceptual input and has a lower starting precision. This suggests that perceptual signals do play\nan important role in human generalization behaviors, and should not be left out of the pipeline as\nprevious Bayesian word learning methods do.\n\n5.2 Analysis of Per-level Responses\n\nIn addition to the quantitative precision-recall curves, we perform a qualitative per-level analysis\nsimilar to previous word learning work [1]. To this end, we binarize the predictions at the threshold\nthat yields the same precision and recall, and then plot the per-level responses, i.e., the proportion\nof query images from level Li that are predicted positive, given examples from level Lj.\n\n7\n\n0.00.20.40.60.81.0Recall0.00.20.40.60.81.0PrecisionNVPMHBHCVGNPhuman\f(a) NP model\n\n(b) Our method\n\n(c) PM baseline\n\n(d) IC oracle\n\nFigure 4: Per-level generalization predictions from various methods, where the horizontal axis shows\nfour levels at which examples were provided (L0 to L3). At each level, \ufb01ve bars show the proportion\nof queries form levels L0 to L4 that are labeled as instances of the concept by each method. These\nresults are summarized in a scatter plot showing model predictions (horizontal axis) vs. human\njudgments (vertical axis), with the red line showing a linear regression \ufb01t.\n\nWe show in Figures 4 and 5 the per-level generalization\nresults from human, the NP model, our method, and the\nPM baseline which best represents state-of-the-art vision\nbaselines. People show a monotonic decrease in general-\nization as the query level moves conceptually further from\nthe examples. In addition, for queries of the same level,\nits generalization score peaks when examples from the\nsame level are presented, and drops when lower or higher\nlevel examples are presented. The NP model tends to give\nmore extreme predictions (either very low or very high),\npossibly due to the fact that it assumes perfect recogni-\ntion, while visual inputs are actually dif\ufb01cult to precisely\nclassify even for a human being. The conventional vision baseline does not utilize the size prin-\nciple to model human concept learning, and as a result shows very similar behavior with different\nlevel of examples. Our method exhibits a good correlation with the human results, although it has\na smaller generalization probability for L0 queries, possibly because current visual models are still\nnot completely accurate in identifying leaf node classes [5].\nLast but not least, we examine how well a conventional image classi\ufb01cation approach could ex-\nplain our experimental results. To do so, Figure 44(d) plots the results of an image classi\ufb01cation\n(IC) oracle that predicts yes for an image within the ground-truth ImageNet node that the current\nexamples were sampled from and no otherwise. Note that the IC oracle never generalizes beyond\nthe level from which the examples are drawn, and thus, exhibits very different generalization results\ncompared to the human participants in our experiment. Thus, visual concept learning poses more\nrealistic and challenging problems for computer vision studies.\n\nFigure 5: Per-level generalization from\nhuman participants.\n\n6 Conclusions\n\nWe proposed a new task for machine vision \u2013 visual concept learning \u2013 and presented the \ufb01rst system\ncapable of approaching human performance on this problem. By linking research on object classi-\n\ufb01cation in machine vision and Bayesian generalization in cognitive science, we were able to de\ufb01ne\na system that could infer the appropriate scope of generalization for a novel concept directly from a\nset of images. This system outperforms baselines that draw on previous approaches in both machine\nvision and cognitive science, coming closer to human performance than any of these approaches.\nHowever, there is still signi\ufb01cant room to improve performance on this task, and we present our\nvisual concept learning dataset as the basis for a new challenge problem for machine vision, going\nbeyond assigning labels to individual objects.\n\n8\n\nL0L1L2L30.00.20.40.60.81.0GeneralizationProbabilityNPoracleL0L1L2L3L40.00.51.0NPoracle0.00.51.0humangroundtruthL0L1L2L30.00.20.40.60.81.0GeneralizationProbabilityourmethodL0L1L2L3L40.00.51.0Ourmethod0.00.51.0humangroundtruthL0L1L2L30.00.20.40.60.81.0GeneralizationProbabilityPMbaselineL0L1L2L3L40.00.51.0PMbaseline0.00.51.0humangroundtruthL0L1L2L30.00.20.40.60.81.0GeneralizationProbabilityICoracleL0L1L2L3L40.00.51.0ICoracle0.00.51.0humangroundtruthL0L1L2L30.00.20.40.60.81.0GeneralizationProbabilityhumanoracleL0L1L2L3L4\fReferences\n\n[1] J. T. Abbott, J. L. Austerweil, and T. L. Grif\ufb01ths. Constructing a hypothesis space from the\nWeb for large-scale Bayesian word learning. In Proceedings of the 34th Annual Conference of\nthe Cognitive Science Society, 2012.\n\n[2] A. Berg,\n\nJ. Deng,\n\nnet.org/challenges/LSVRC/2010/.\n\nand L. Fei-Fei.\n\nILSVRC 2010.\n\nhttp://www.image-\n\n[3] S. Carey. The child as word learner. Linguistic Theory and Psychological Reality, 1978.\n[4] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierar-\n\nchical image database. In CVPR, 2009.\n\n[5] J. Deng, J. Krause, A. Berg, and L. Fei-Fei. Hedging your bets: Optimizing accuracy-\n\nspeci\ufb01city trade-offs in large scale visual recognition. In CVPR, 2012.\n\n[6] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. JMLR, 12:2121\u20132159, 2010.\n\n[7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL\n\nVisual Object Classes (VOC) challenge. IJCV, 88(2):303\u2013338, 2010.\n\n[8] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In\n\nCVPR, 2009.\n\n[9] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[10] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Building\n\nhigh-level features using large scale unsupervised learning. In ICML, 2012.\n\n[11] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, and T. Huang. Large-scale image\n\nclassi\ufb01cation: fast feature extraction and svm training. In CVPR, 2011.\n\n[12] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the\n\nspatial envelope. International journal of computer vision, 42(3):145\u2013175, 2001.\n\n[13] D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011.\n[14] A. Quattoni, M. Collins, and T. Darrell. Transfer learning for image classi\ufb01cation with sparse\n\nprototype representations. In CVPR, 2008.\n\n[15] E. Rosch, C. B. Mervis, W. D. Gray, D. M. Johnson, and P. Boyes-Braem. Basic objects in\n\nnatural categories. Cognitive psychology, 8(3):382\u2013439, 1976.\n\n[16] R. Salakhutdinov, A. Torralba, and J.B. Tenenbaum. Learning to share visual appearance for\n\nmulticlass object detection. In CVPR, 2011.\n\n[17] R. N. Shepard. Towards a universal law of generalization for psychological science. Science,\n\n237:1317\u20131323, 1987.\n\n[18] J. B. Tenenbaum. Bayesian modeling of human concept learning. In NIPS, 1999.\n[19] J. B. Tenenbaum. Rules and similarity in concept learning. In NIPS, 2000.\n[20] J. B. Tenenbaum and T. L. Grif\ufb01ths. Generalization, similarity, and Bayesian inference. Be-\n\nhavioral and Brain Sciences, 24(4):629\u2013640, 2001.\n\n[21] F. Xu and J.B. Tenenbaum. Word learning as Bayesian inference. Psychological Review,\n\n114(2):245\u2013272, 2007.\n\n9\n\n\f", "award": [], "sourceid": 932, "authors": [{"given_name": "Yangqing", "family_name": "Jia", "institution": "UC Berkeley"}, {"given_name": "Joshua", "family_name": "Abbott", "institution": "UC Berkeley"}, {"given_name": "Joseph", "family_name": "Austerweil", "institution": "Brown University"}, {"given_name": "Tom", "family_name": "Griffiths", "institution": "UC Berkeley"}, {"given_name": "Trevor", "family_name": "Darrell", "institution": "UC Berkeley"}]}