{"title": "Learning and using relational theories", "book": "Advances in Neural Information Processing Systems", "page_first": 753, "page_last": 760, "abstract": null, "full_text": "Learning and using relational theories\n\nCharles Kemp, Noah D. Goodman & Joshua B. Tenenbaum\n\nDepartment of Brain and Cognitive Sciences, MIT, Cambridge, MA 02139\n\n{ckemp,ndg,jbt}@mit.edu\n\nAbstract\n\nMuch of human knowledge is organized into sophisticated systems that are often\ncalled intuitive theories. We propose that intuitive theories are mentally repre-\nsented in a logical language, and that the subjective complexity of a theory is\ndetermined by the length of its representation in this language. This complexity\nmeasure helps to explain how theories are learned from relational data, and how\nthey support inductive inferences about unobserved relations. We describe two\nexperiments that test our approach, and show that it provides a better account of\nhuman learning and reasoning than an approach developed by Goodman [1].\n\nWhat is a theory, and what makes one theory better than another? Questions like these are of obvious\ninterest to philosophers of science but are also discussed by psychologists, who have argued that\neveryday knowledge is organized into rich and complex systems that are similar in many respects\nto scienti\ufb01c theories. Even young children, for instance, have systematic beliefs about domains\nincluding folk physics, folk biology, and folk psychology [2]. Intuitive theories like these play many\nof the same roles as scienti\ufb01c theories: in particular, both kinds of theories are used to explain and\nencode observations of the world, and to predict future observations.\nThis paper explores the nature, use and acquisition of simple theories. Consider, for instance, an\nanthropologist who has just begun to study the social structure of a remote tribe, and observes that\ncertain words are used to indicate relationships between selected pairs of individuals. Suppose that\nterm T1(\u00b7, \u00b7) can be glossed as ancestor(\u00b7, \u00b7), and that T2(\u00b7, \u00b7) can be glossed as friend(\u00b7, \u00b7). The\nanthropologist might discover that the \ufb01rst term is transitive, and that the second term is symmetric\nwith a few exceptions. Suppose that term T3(\u00b7, \u00b7) can be glossed as defers to(\u00b7, \u00b7), and that the tribe\ndivides into two castes such that members of the second caste defer to members of the \ufb01rst caste. In\nthis case the anthropologist might discover two latent concepts (caste 1(\u00b7) and caste 2(\u00b7)) along\nwith the relationship between these concepts.\nAs these examples suggest, a theory can be de\ufb01ned as a system of laws and concepts that specify\nthe relationships between the elements in some domain [2]. We will consider how these theories are\nlearned, how they are used to encode relational data, and how they support predictions about unob-\nserved relations. Our approach to all three problems relies on the notion of subjective complexity.\nWe propose that theory learners prefer simple theories, that people remember relational data in terms\nof the simplest underlying theory, and that people extend a partially observed data set according to\nthe simplest theory that is consistent with their observations. There is no guarantee that a single\nmeasure of subjective complexity can do all of the work that we require [3]. This paper, however,\nexplores the strong hypothesis that a single measure will suf\ufb01ce.\nOur formal treatment of subjective complexity begins with the question of how theories are mentally\nrepresented. We suggest that theories are represented in some logical language, and propose a spe-\nci\ufb01c \ufb01rst-order language that serves as a hypothesis about the \u201clanguage of thought.\u201d We then pursue\nthe idea that the subjective complexity of a theory corresponds to the length of its representation in\nthis language. Our approach therefore builds on the work of Feldman [4], and is related to other\npsychological applications of the notion of Kolmogorov complexity [5]. The complexity measure\nwe describe can be used to de\ufb01ne a probability distribution over a space of theories, and we develop\na model of theory acquisition by using this distribution as the prior for a Bayesian learner. We also\n\n1\n\n\f(a) Star\n\n(b) Bipartite\n\n(c) Exception\n\n11\n\n22 33 44 55 66 77 88\n21 31 41 51 61 71 81\n\nR(X, X).\n\nR(X, 1).\n\n16 26 36 46 56\n17 27 37 47 57\n18 28 38 48 58\n\nT(6). T(7). T(8).\nR(X, Y) \u2190 \u00afT(X), T(Y).\n\n11\n\n26 36 46 56\n17 27 37 47 57\n18 28 38 48 58\n\nT(6). T(7). T(8).\nR(X, Y) \u2190 \u00afT(X), T(Y).\nR(1, 1). \u00afR(1, 6).\n\n(d) Symmetric\n\n(e) Transitive\n\n(f) Random\n\n11 22 33 44 55 66 77\n\n12 21\n\n13 31\n\n24 42\n\n56 65\n\n12\n13 23\n14 24 34\n15 25 35 45\n16 26 36 46 56\n\n21\n13 32\n14 24 34\n51 52 35 54\n61 26 63 46 56\n\nR(1, 2). R(1, 3). R(2, 4). R(5, 6).\n\nR(1, 2). R(2, 3). R(3, 4).\n\nR(5, X). R(X, 4).\n\nR(X, Y) \u2190 R(Y, X).\nR(X, X).\n\nR(4, 5). R(5, 6).\nR(X, Z) \u2190 R(X, Y), R(Y, Z).\n\nR(2, 1). R(1, 3). R(6, 1). R(3, 2).\n\nR(2, 6). R(3, 5). R(6, 3). R(4, 6).\n\u00afR(X, X). \u00afR(6, 4). \u00afR(5, 3).\n\nIn each case, the objects in the\nFigure 1: Six possible extensions for a binary predicate R(\u00b7, \u00b7).\ndomain are represented as digits, and a pair such as 16 indicates that R(1, 6) is true. Below each set\nof pairs, the simplest theory according to our complexity measure is shown.\n\nshow how the same Bayesian approach helps to explain how theories support inductive generaliza-\ntion: given a set of observations, future observations (e.g. whether one individual defers to another)\ncan be predicted using the posterior distribution over the space of theories.\nWe test our approach by developing two experiments where people learn and make predictions\nabout binary and ternary relations. As far as we know, the approach of Goodman [1] is the only\nother measure of theory complexity that has previously been tested as a psychological model [6].\nWe show that our experiments support our approach and raise challenges for this alternative model.\n\n1 Theory complexity: a representation length approach\n\nIntuitive theories correspond to mental representations of some sort, and our \ufb01rst task is to char-\nacterize the elements used to build these representations. We explore the idea that a theory is a\nsystem of statements in a logical language, and six examples are shown in Fig. 1. The theory in\nFig. 1b is related to the defers to(\u00b7, \u00b7) example already described. Here we are interested in a\ndomain including 9 elements, and a two place predicate R(\u00b7, \u00b7) that is true of all and only the 15\npairs shown. R is de\ufb01ned using a unary predicate T which is true of only three elements: 6, 7, and\n8. The theory includes a clause which states that R(X, Y) is true for all pairs XY such that T(X) is\nfalse and T(Y) is true. The theory in Fig. 1c is very similar, but includes an additional clause which\nspeci\ufb01es that R(1, 1) is true, and an exception which speci\ufb01es that R(1, 6) is false. Formally, each\ntheory we consider is a collection of function-free de\ufb01nite clauses. All variables are universally\nquanti\ufb01ed: for instance, the clause R(X, Z) \u2190 R(X, Y), R(Y, Z) is equivalent to the logical formula\n\u2200x \u2200y \u2200z (R(x, z) \u2190 R(x, y) \u2227 R(y, z)). For readability, the theories in Fig. 1 include parenthe-\nses and arrows, but note that these symbols are unnecessary and can be removed. Our proposed\nlanguage includes only predicate symbols, variable symbols, constant symbols, and a period that\nindicates when one clause \ufb01nishes and another begins.\nEach theory in Fig. 1 speci\ufb01es the extension of one or more predicates. The extension of predicate\nP is de\ufb01ned in terms of predicate P+ (which captures the basic rules that lead to membership in P)\nand predicate P\u2212 (which captures exceptions to these rules). The resulting extension of P is de\ufb01ned\n\n2\n\n\fas P+ \\ P\u2212, or the set difference of P+ and P\u2212.1 Once P has been de\ufb01ned, later clauses in the\ntheory may refer to P or its negation \u00afP. To ensure that our semantics is well-de\ufb01ned, the predicates\nin any valid theory must permit an ordering so that the de\ufb01nition of any predicate does not refer to\npredicates that follow it in the order. Formally, the de\ufb01nition of each predicate P+ or P\u2212 can refer\nonly to itself (recursive de\ufb01nitions are allowed) and to any predicate M or \u00afM where M < P.\nOnce we have committed to a speci\ufb01c language, the subjective complexity of a theory is assumed to\ncorrespond to the number of symbols in its representation. We have chosen a language where there\nis one symbol for each position in a theory where a predicate, variable or constant appears, and one\nsymbol to indicate when each clause ends. Given this language, the subjective complexity c(T ) of\ntheory T is equal to the sum of the number of clauses in the theory and the number of positions in\nthe theory where a predicate, variable or constant appears:\n\nc(T ) = #clauses(T ) + #pred slots(T ) + #var slots(T ) + #const slots(T ).\n\n(1)\n\nFor instance, the clause R(X, Z) \u2190 R(X, Y), R(Y, Z). contributes ten symbols towards the complexity\nof a theory (three predicate symbols, six variable symbols, and one period). Other languages might\nbe considered: for instance, we could use a language which uses \ufb01ve symbols (e.g. \ufb01ve bits) to\nrepresent each predicate, variable and constant, and one symbol (e.g. one bit) to indicate the end of\na clause. Our approach to subjective complexity depends critically on the representation language,\nbut once a language has been chosen the complexity measure is uniquely speci\ufb01ed.\nAlthough our approach is closely related to the notion of Kolmogorov complexity and to Minimum\nMessage Length (MML) and Minimum Description Length (MDL) approaches, we refer to it as a\nRepresentation Length (RL) approach. A RL approach includes a commitment to a speci\ufb01c language\nthat is proposed as a psychological hypothesis, but these other approaches aspire towards results that\ndo not depend on the language chosen.2 It is sometimes suggested that the notion of Kolmogorov\ncomplexity provides a more suitable framework for psychological research than the RL approach,\nprecisely because it allows for results that do not depend on a speci\ufb01c description language [8]. We\nsubscribe to the opposite view. Mental representations presumably rely on some particular language,\nand identifying this language is a central challenge for psychological research.\nThe language we described should be considered as a tentative approximation of the language of\nthought. Other languages can and should be explored, but our language has several appealing prop-\nerties. Feldman [4] has argued that de\ufb01nite clauses are psychologically natural, and working with\nthese representations allows our approach to account for several classic results from the concept\nlearning literature. For instance, our language leads to the prediction that conjunctive concepts are\neasier to learn than disjunctive concepts [9].3 Working with de\ufb01nite clauses also ensures that each of\nour theories has a unique minimal model, which means that the extension of a theory can be de\ufb01ned\nin a particularly simple way. Finally, human learners deal gracefully with noise and exceptions, and\nour language provides a simple way to handle exceptions.\nAny concrete proposal about the language of thought should make predictions about memory, learn-\ning and reasoning. Suppose that data set D lists the extensions of one or more predicates, and that a\ntheory is a \u201ccandidate theory\u201d for D if it correctly de\ufb01nes the extensions of all predicates in D. Note\nthat a candidate theory may well include latent predicates\u2014predicates that do not appear in D, but\nare useful for de\ufb01ning the predicates that have been observed. We will assume that humans encode\nD in terms of the simplest candidate theory for D, and that the dif\ufb01culty of memorizing D is deter-\nmined by the subjective complexity of this theory. Our approach can and should be tested against\nclassic results from the memory literature. Unlike some other approaches to complexity [10], for\ninstance, our model predicts that a sequence of k items is about equally easy to remember regardless\nof whether the items are drawn from a set of size 2, a set of size 10, or a set of size 1000 [11].\n\n1The extension of P+ is the smallest set that satis\ufb01es all of the clauses that de\ufb01ne P+, and the extension of\nP\u2212 is de\ufb01ned similarly. To simplify our notation, Fig. 1 uses P to refer to both P and P+, and \u00afP to refer to \u00afP and\nP\u2212. Any instance of P that appears in a clause de\ufb01ning P is really an instance of P+, and any instance of \u00afP that\nappears in a clause de\ufb01ning \u00afP is really an instance of P\u2212.\n\n2MDL approaches also commit to a speci\ufb01c language, but this language is often intended to be as general\n\nas possible. See, for instance, the discussion of universal codes in Gr\u00a8unwald et al. [7].\n\n3A conjunctive concept C(\u00b7) can be de\ufb01ned using a single clause: C(X) \u2190 A(X), B(X). The shortest de\ufb01nition\n\nof a disjunctive concept requires two clauses: D(X) \u2190 A(X). D(X) \u2190 B(X).\n\n3\n\n\fTo develop a model of inductive learning and reasoning, we take a Bayesian approach, and use\nour complexity measure to de\ufb01ne a prior distribution over a hypothesis space of theories: P (T ) \u221d\n2\u2212c(T ).4 Given this prior distribution, we can use Bayesian inference to make predictions about\nunobserved relations and to discover the theory T that best accounts for the observations in data set\nD [12, 13]. Suppose that we have a likelihood function P (D|T ) which speci\ufb01es how the examples\nin D were generated from some underlying theory T . The best explanation for the data D is the\ntheory that maximizes the posterior distribution P (T |D) \u221d P (D|T )P (T ). If we need to predict\nwhether ground term g is likely to be true, 5 we can sum over the space of theories:\n\nP (g|D) = X\n\nP (g|T )P (T |D) =\n\nT\n\n1\n\nP (D) X\n\nT :g\u2208T\n\nP (D|T )P (T )\n\n(2)\n\nwhere the \ufb01nal sum is over all theories T that make ground term g true.\n\n1.1 Related work\nThe theories we consider are closely related to logic programs, and methods for Inductive Logic\nProgramming (ILP) explore how these programs can be learned from examples [14]. ILP algorithms\nare often inspired by the idea of searching for the shortest theory that accounts for the available data,\nand ILP is occasionally cast as the problem of minimizing an explicit MDL criterion [10]. Although\nILP algorithms are rarely considered as cognitive models, the RL approach has a long psychological\nhistory, and is proposed by Chomsky [15] and Leeuwenberg [16] among others.\nFormal measures of complexity have been developed in many \ufb01elds [17], and there is at least one\nother psychological account of theory complexity. Goodman [1] developed a complexity measure\nthat was originally a philosophical proposal about scienti\ufb01c theories, but was later tested as a model\nof subjective complexity [6]. A detailed description of this measure is not possible here, but we\nattempt to give a \ufb02avor of the approach. Suppose that a basis is a set of predicates. The starting\npoint for Goodman\u2019s model is the intuition that basis B1 is at least as complex as basis B2 if B1\ncan be used to de\ufb01ne B2. Goodman argues that this intuition is \ufb02awed, but his model is founded\non a re\ufb01nement of this intuition. For instance, since the binary predicate in Fig. 1b can be de\ufb01ned\nin terms of two unary predicates, Goodman\u2019s approach requires that the complexity of the binary\npredicate is no more than the sum of the complexities of the two unary predicates.\nWe will use Goodman\u2019s model as a baseline for evaluating our own approach, and a comparison\nbetween these two models should be informed by both theoretical and empirical considerations.\nOn the theoretical side, our approach relies on a simple principle for deciding which structural\nproperties are relevant to the measurement of complexity: the relevant properties are those with\nshort logical representations. Goodman\u2019s approach incorporates no such principle, and he proposes\nsomewhat arbitrarily that re\ufb02exivity and symmetry are among the relevant structural properties but\nthat transitivity is not. A second reason for preferring our model is that it makes contact with a\ngeneral principle\u2014the idea that simplicity is related to representation length\u2014that has found many\napplications across psychology, machine learning, and philosophy.\n\n2 Experimental results\n\nWe designed two experiments to explore settings where people learn, remember, and make inductive\ninferences about relational data. Although theories often consist of systems of many interlocking\nrelations, we keep our experiments simple by asking subjects to learn and reason about a single\nrelation at a time. Despite this restriction, our experiments still make contact with several issues\nraised by systems of relations. As the defers to(\u00b7, \u00b7) example suggests, a single relation may be\nbest explained as the observable tip of a system involving several latent predicates (e.g. caste 1(\u00b7)\nand caste 2(\u00b7)).\n\n4To ensure that this distribution can be normalized, we assume that there is some upper bound on the number\nof predicate symbols, variable symbols, and constants, and on the length of the theories we will consider. There\nwill therefore be a \ufb01nite number of possible theories, and our prior will be a valid probability distribution.\n\n5A ground term is a term such as R(8, 9) that does not include any variables.\n\n4\n\n\fLearning time\n\nComplexity (Human)\n\nComplexity (RL)\n\nComplexity (Goodman)\n\n6\n\n4\n\n2\n\n40\n\n20\n\n4\n\n2\n\nd 0\n\nr\na\n\nt\ns\n\nt\nr\np\nb\n\np\nc\nx\ne\n\nm\ny\ns\n\ns\nn\na\nr\nt\n\nn\na\nr\n\nd 0\n\nr\na\n\nt\ns\n\nt\nr\np\nb\n\np\nc\nx\ne\n\nm\ny\ns\n\ns\nn\na\nr\nt\n\nn\na\nr\n\nd 0\n\nr\na\nt\ns\n\nt\nr\np\nb\n\np\nc\nx\ne\n\nm\ny\ns\n\ns\nn\na\nr\nt\n\nn\na\nr\n\nr\na\nt\ns\n\nt\nr\np\nb\n\np\nc\nx\ne\n\nm\ny\ns\n\ns\nn\na\nr\nt\n\nd\nn\na\nr\n\n300\n200\n100\n0\n\nFigure 2: (a) Average time in seconds to learn the six sets in Fig. 1. (b) Average ratings of set com-\nplexity. (c) Complexity scores according to our representation length (RL) model. (d) Complexity\nscores according to Goodman\u2019s model.\n\n2.1 Experiment 1: memory and induction\nIn our \ufb01rst experiment, we studied the subjective complexity of six binary relations that display a\nrange of structural properties, including re\ufb02exivity, symmetry, and transitivity.\nMaterials and Methods. 18 adults participated in this experiment. Subjects were required to learn\nthe 6 sets shown in Fig. 1, and to make inductive inferences about each set. Although Fig. 1 shows\npairs of digits, the experiment used letter pairs, and the letters for each condition and the order\nin which these conditions were presented were randomized across subjects. The pairs for each\ncondition were initially laid out randomly on screen, and subjects could drag them around and\norganize them to help them understand the structure of the set. At any stage, subjects could enter a\ntest phase where they were asked to list the 15 pairs belonging to the current set. Subjects who made\nan error on the test were returned to the learning phase. After 9 minutes had elapsed, subjects were\nallowed to pass the test regardless of how many errors they made.\nAfter passing the test, subjects were asked to rate the complexity of the set compared to other sets\nwith 15 pairs. Ratings were provided on a 7 point scale. Subjects were then asked to imagine that\na new letter (e.g. letter 9) had belonged to the current alphabet, and were given two inductive tasks.\nFirst they were asked to enter between 1 and 10 novel pairs that they might have expected to see\n(each novel pair was required to include the new letter). Next they were told about a novel pair that\nbelonged to the set (e.g. pair 91), and were again asked to enter up to 10 additional pairs that they\nmight have expected to see.\nResults. The average time needed to learn each set is shown in Fig. 2a, and ratings of set complexity\nare shown in Fig. 2b. It is encouraging that these measures yield converging results, but they may be\nconfounded since subjects rated the complexity of a set immediately after learning it. The complex-\nities plotted in Fig. 2c are the complexities of the theories shown in Fig. 1, which we believe to be\nthe simplest theories according to our complexity measure. The \ufb01nal plot in Fig. 2 shows complex-\nities according to Goodman\u2019s model, which assigns each binary relation an integer between 0 and\n4. There are several differences between these models: for instance, Goodman\u2019s account incorrectly\npredicts that the exception case is the hardest of the six, but our model acknowledges that a sim-\nple theory remains simple if a handful of exceptions are added. Goodman\u2019s account also predicts\nthat transitivity is not an important structural regularity, but our model correctly predicts that the\ntransitive set is simpler than the same set with some of the pairs reversed (the random set).\nResults for the inductive task are shown in Fig. 3. The \ufb01rst two columns show the number of subjects\nwho listed each novel pair. The remaining two columns show the probability of set membership\npredicted by our model. To generate these predictions, we applied Equation 2 and summed over\na set of theories created by systematically extending the theories shown in Fig. 1. Each extended\ntheory includes up to one additional clause for each predicate in the base theory, and each additional\nclause includes at most two predicate slots. For instance, each extended theory for the bipartite\ncase is created by choosing whether or not to add the clause T(9), and adding up to one clause for\npredicate R.6 For the \ufb01rst inductive task, the likelihood term P (D|T ) (see Equation 2) is set to 0\nfor all theories that are not consistent with the pairs observed during training, and to a constant for\nall remaining theories. For the second task we assumed in addition that the novel pair observed is\n\n6R(9, X), \u00afR(2, 9), and R(X, 9) \u2190 R(X, 2) are three possible additions.\n\n5\n\n\f91\n\n99 19\n\n91\n\n99 19\n\n91\n\n99 19\n\n81\n\n88 18\n\nr\na\nt\ns\n\nt\nr\na\np\nb\n\ni\n\np\ne\nc\nx\ne\n\nm\nm\ny\ns\n\ns\nn\na\nr\nt\n\nm\no\nd\nn\na\nr\n\n91\n\n91\n\n91\n\n99 19\n\n89\n\n99 19\n\n89\n\n99 19\n\n89\n\n18\n9\n0\n18\n9\n0\n18\n9\n0\n18\n9\n0\n18\n9\n0\n18\n9\n0\n67\nHuman (no examples)\n\n77 17\n\n77 17\n\n88 18\n\n78\n\n18\n9\n0\n18\n9\n0\n18\n9\n0\n18\n9\n0\n18\n9\n0\n18\n9\n0\n\n81\n\n67\n\n71\n\n71\n\nr=0.99\n\n91\n\n99 19\n\nr=0.96\n\n91\n\n99 19\n\nr=0.98\n\n91\n\n99 19\n\nr=0.88\n\n81\n\n88 18\n\nr=0.62\n\n89\n\n89\n\n89\n\n78\n\n1\n0.5\n0\n1\n0.5\n0\n1\n0.5\n0\n1\n0.5\n0\n1\n0.5\n0\n1\n0.5\n0\n\n89\n\n89\n\n89\n\n78\n\n1\n0.5\n0\n1\n0.5\n0\n1\n0.5\n0\n1\n0.5\n0\n1\n0.5\n0\n1\n0.5\n0\n\nr=0.99\n\n91\n\n99 19\n\nr=0.99\n\n89\n\n91\n\n99 19\n\n89\n\nr=0.99\n\n91\n\n99 19\n\n89\n\nr=0.99\n\n81\n\n88 18\n\nr=0.93\n\n78\n\n71\n\n77 17\n\n67\n\nr=0.74\n\n77 17\n\n71\n67\nRL (one example)\n\n71\n\n77 17\n\n67\n\n77 17\n\n71\n67\nHuman (1 example)\n\n71\n\n77 17\n\n67\n\nr=0.38\n\n77 17\n\n71\n67\nRL (no examples)\n\nFigure 3: Data and model predictions for the induction task in Experiment 1. Columns 1 and 3\nshow predictions before any pairs involving the new letter are observed. Columns 2 and 4 show\npredictions after a single novel pair (marked with a gray bar) is observed to belong to the set. The\nmodel plots for each condition include correlations with the human data.\n\nsampled at random from all pairs involving the new letter.7 All model predictions were computed\nusing Mace4 [18] to generate the extension of each theory considered.\nThe supporting material includes predictions for a model based on the Goodman complexity measure\nand an exemplar model which assumes that the new letter will be just like one of the old letters.8 The\nexemplar model outperforms our model in the random condition, and makes accurate predictions\nabout three other conditions. Overall, however, our model performs better than the two baselines.\nHere we focus on two important predictions that are not well handled by the exemplar model. In\nthe symmetry condition, almost all subjects predict that 78 belongs to the set after learning that 87\nbelongs to the set, suggesting that they have learned an abstract rule. In the transitive condition,\nmost subjects predict that pairs 72 through 76 belong to the set after learning that 71 belongs to the\nset. Our model accounts for this result, but the exemplar model has no basis for making predictions\nabout letter 7, since this letter is now known to be unlike any of the others.\n\n2.2 Experiment 2: learning from positive examples\nDuring the learning phase of our \ufb01rst experiment, subjects learned a theory based on positive ex-\namples (the theory included all pairs they had seen) and negative examples (the theory ruled out all\npairs they had not seen). Often, however, humans learn theories based on positive examples alone.\nSuppose, for instance, that our anthropologist has spent only a few hours with a new tribe. She may\nhave observed several pairs who are obviously friends, but should realize that many other pairs of\nfriends have not yet interacted in her presence.\n\n7For the second task, P (D|T ) is set to 0 for theories that are inconsistent with the training pairs and theories\nn , where n is the\n\nwhich do not include the observed novel pair. For all remaining theories, P (D|T ) is set to 1\ntotal number of novel pairs that are consistent with T .\n\n8Supporting material is available at www.charleskemp.com\n\n6\n\n\fa)\n\n7\n\n1\n\nn\na\nm\nu\nH\n\nL\nR\n\n0\n\u221210\n\u221220\n\nR(X, X, X).\n111 222\n333 444\n\n7\n7\n7\n\n1\n7\n7\n\n8\n7\n7\n\n9\n8\n7\n\n7\n3\n2\n\n7\n7\n7\n\n1\n7\n7\n\n8\n7\n7\n\n9\n8\n7\n\n7\n3\n2\n\nb)\n\n7\n\n1\n\n0\n\u221210\n\u221220\n\nR(X, X, 1).\n221 331\n441 551\n\n7\n7\n7\n\n1\n7\n7\n\n8\n7\n7\n\n9\n8\n7\n\n7\n3\n2\n\nc)\n\n7\n\n1\n\nR(X, X, Y).\n221 443\n552 663\n\n7\n7\n7\n\n1\n7\n7\n\n8\n7\n7\n\n9\n8\n7\n\n7\n3\n2\n\nd)\n\n7\n\n1\n\nR(X, Y, Z).\n231 456\n615 344\n\n7\n7\n7\n\n1\n7\n7\n\n8\n7\n7\n\n9\n8\n7\n\n7\n3\n2\n\ne)\n\n7\n\n1\n\nR(2, 3, X).\n231 234\n235 236\n\n7\n7\n7\n\n1\n7\n7\n\n8\n7\n7\n\n9\n8\n7\n\n7\n3\n2\n\n0\n\u22125\n7 \u221210\n\n3\n2\n\n0\n\u22120.1\n7 \u22120.2\n\n3\n2\n\n7\n7\n7\n\n1\n7\n7\n\n8\n7\n7\n\n9\n8\n7\n\n0\n\u221210\n7 \u221220\n\n3\n2\n\n7\n7\n7\n\n1\n7\n7\n\n8\n7\n7\n\n9\n8\n7\n\n7\n7\n7\n\n1\n7\n7\n\n8\n7\n7\n\n9\n8\n7\n\n7\n7\n7\n\n1\n7\n7\n\n8\n7\n7\n\n9\n8\n7\n\n7\n3\n2\n\nFigure 4: Data and model predictions for Experiment 2. The four triples observed for each set are\nshown at the top of the \ufb01gure. The \ufb01rst row of plots shows average ratings on a scale from 1 (very\nunlikely to belong to the set) to 7 (very likely). Model predictions are plotted as log probabilities.\n\nOur framework can handle cases like these if we assume that the data D in Equation 2 are sampled\nfrom the ground terms that are true according to the underlying theory. We follow [10] and [13]\nand use a distribution P (D|T ) which assumes that the examples in D are randomly sampled with\nreplacement from the ground terms that are true. This sampling assumption encourages our model\nto identify the theory with the smallest extension that is compatible with all of the training examples.\nWe tested this approach by designing an experiment where learners were given sets of examples that\nwere compatible with several underlying theories.\nMaterials and Methods. 15 adults participated in this experiment immediately after taking Experi-\nment 1. In each of \ufb01ve conditions, subjects were told about a set of triples built from an alphabet of\n9 letters. They were shown four triples that belonged to the set (Fig. 4), and told that the set might\ninclude triples that they had not seen. Subjects then gave ratings on a seven point scale to indicate\nwhether \ufb01ve additional triples (see Fig. 4) were likely to belong to the set.\nResults. Average ratings and model predictions are shown in Fig. 4. Model predictions for each\ncondition were computed using Equation 2 and summing over a space of theories that included the\n\ufb01ve theories shown at the top of Fig. 4, variants of these \ufb01ve theories which stated that certain pairs\nof slots could not be occupied by the same constant,9 and theories that included no variables but\nmerely enumerated up to 5 triples.10\nAlthough there are general theories like R(X, Y, Z) that are compatible with the triples observed in all\n\ufb01ve conditions, Fig. 4 shows that people were sensitive to different regularities in each case.11 We\nfocus on one condition (Fig. 4b) that exposes the strengths and weaknesses of our model. According\nto our model, the two most probable theories given the triples for this condition are R(X, X, 1) and the\nclosely related variant that rules out R(1, 1, 1). The next most probable theory is R(X, X, Y). These\npredictions are consistent with people\u2019s judgments that 771 is very likely to belong to the set, and\nthat 778 is the next most likely option. Unlike our model, however, people consider 777 to be\nsubstantially less likely than 778 to belong to the set. This result may suggest that the variant of\nR(X, X, Y) that rules out R(X, X, X) deserves a higher prior probability than our model recognizes. To\nbetter account for cases like this, it may be worth considering languages where any two variables\nthat belong to the same clause but have different names must refer to different entities.\n\n3 Discussion and Conclusion\nThere are many psychological models of concept learning [4, 12, 13], but few that use representa-\ntions rich enough to capture the content of intuitive theories. We suggested that intuitive theories\nare mentally represented in a \ufb01rst-order logical language, and proposed a speci\ufb01c hypothesis about\n\n9One such theory includes two clauses: R(X, X, Y). \u00afR(X, X, X).\n10One such theory is the following list of clauses: R(2, 2, 1). R(3, 3, 1). R(4, 4, 1). R(5, 5, 1). R(7, 7, 7).\n11Similar results have been found with 9-month old infants. Cases like Figs. 4b and 4c have been tested in an\ninfant language-learning study where the stimuli were three-syllable strings [19]. 9-month old infants exposed\nto strings like the four in Fig. 4c generalized to other strings consistent with the theory R(X, X, Y), but infants in\nthe condition corresponding to Fig. 4b generalized only to strings consistent with the theory R(X, X, 1).\n\n7\n\n\fthis \u201clanguage of thought.\u201d We assumed that the subjective complexity of a theory depends on the\nlength of its representation in this language, and described experiments which suggest that the result-\ning complexity measure helps to explain how theories are learned and used for inductive inference.\nOur experiments deliberately used stimuli that minimize the in\ufb02uence of prior knowledge. Theories,\nhowever, are cumulative, and the theory that seems simplest to a learner will often depend on her\nbackground knowledge. Our approach provides a natural place for background knowledge to be\ninserted. A learner can be supplied with a stock of background predicates, and the shortest repre-\nsentation for a data set will depend on which background predicates are available. Since different\nsets of predicates will lead to different predictions about subjective complexity, empirical results can\nhelp to determine the background knowledge that people bring to a given class of problems.\nFuture work should aim to re\ufb01ne the representation language and complexity measure we proposed.\nWe expect that something like our approach will be suitable for modeling a broad class of intuitive\ntheories, but the speci\ufb01c framework presented here can almost certainly be improved. Future work\nshould also consider different strategies for searching the space of theories. Some of the strate-\ngies developed in the ILP literature should be relevant [14], but a detailed investigation of search\nalgorithms seems premature until our approach has held up to additional empirical tests. It is com-\nparatively easy to establish whether the theories that are simple according to our approach are also\nconsidered simple by people, and our experiments have made a start in this direction. It is much\nharder to establish that our approach captures most of the theories that are subjectively simple, and\nmore exhaustive experiments are needed before this conclusion can be drawn.\nBoolean concept learning has been studied for more than \ufb01fty years [4, 9], and many psychologists\nhave made empirical and theoretical contributions to this \ufb01eld. An even greater effort will be needed\nto crack the problem of theory learning, since the space of intuitive theories is much richer than\nthe space of Boolean concepts. The dif\ufb01culty of this problem should not be underestimated, but\ncomputational approaches can contribute part of the solution.\nAcknowledgments Supported by the William Asbjornsen Albert memorial fellowship (CK), the James S. Mc-\nDonnell Foundation Causal Learning Collaborative Initiative (NDG, JBT) and the Paul E. Newton chair (JBT).\nReferences\n[1] N. Goodman. The structure of appearance. 2nd edition, 1961.\n[2] S. Carey. Conceptual change in childhood. MIT Press, Cambridge, MA, 1985.\n[3] H. A. Simon. Complexity and the representation of patterned sequences of symbols. Psychological\n\nReview, 79:369\u2013382, 1972.\n\n[4] J. Feldman. An algebra of human concept learning. JMP, 50:339\u2013368, 2006.\n[5] N. Chater and P. Vitanyi. Simplicity: a unifying principle in cognitive science. TICS, 7:19\u201322, 2003.\n[6] J. T. Krueger. A theory of structural simplicity and its relevance to aspects of memory, perception, and\n\nconceptual naturalness. PhD thesis, University of Pennsylvania, 1979.\n\n[7] P. Gr\u00a8unwald, I. J. Myung, and M. Pitt, editors. Advances in Minimum Description Length: Theory and\n\nApplications. 2005.\n\nReview, 103:566\u2013581, 1996.\n\n[8] N. Chater. Reconciling simplicity and likelihood principles in perceptual organization. Psychological\n\n[9] J. A. Bruner, J. S. Goodnow, and G. J. Austin. A study of thinking. Wiley, 1956.\n[10] D. Conklin and I. H. Witten. Complexity-based induction. Machine Learning, 16(3):203\u2013225, 1994.\n[11] G. A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing\n\ninformation. Psychological Review, 63(1):81\u201397, 1956.\n\n[12] N. D. Goodman, T. L. Grif\ufb01ths, J. Feldman, and J. B. Tenenbaum. A rational analysis of rule-based\n\n[13] J. B. Tenenbaum and T. L. Grif\ufb01ths. Generalization, similarity, and Bayesian inference. BBS, 24:629\u2013641,\n\nconcept learning. In CogSci, 2007.\n\n2001.\n\nProgramming, 19-20:629\u2013679, 1994.\n\n[14] S. Muggleton and L. De Raedt. Inductive logic programming: theory and methods. Journal of Logic\n\n[15] N. Chomsky. The logical structure of linguistic theory. University of Chicago Press, Chicago, 1975.\n[16] E. L. J. Leeuwenberg. A perceptual coding language for visual and auditory patterns. American Journal\n\nof Psychology, 84(3):307\u2013349, 1971.\n\n[17] B. Edmonds. Syntactic measures of complexity. PhD thesis, University of Manchester, 1999.\n[18] W. McCune. Mace4 reference manual and guide. Technical Report ANL/MCS-TM-264, Argonne Na-\n\n[19] L. Gerken. Decisions, decisions: infant language learning when multiple generalizations are possible.\n\ntional Laboratory, 2003.\n\nCognition, 98(3):67\u201374, 2006.\n\n8\n\n\f", "award": [], "sourceid": 3332, "authors": [{"given_name": "Charles", "family_name": "Kemp", "institution": null}, {"given_name": "Noah", "family_name": "Goodman", "institution": null}, {"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}]}