{"title": "Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories", "book": "Advances in Neural Information Processing Systems", "page_first": 35, "page_last": 42, "abstract": "", "full_text": "Categorization Under Complexity: A Unified\nMDL Account of Human Learning of Regular\n\nand Irregular Categories\n\nDavid Fass\n\nDepartment of Psychology\nCenter for Cognitive Science\n\nRutgers University\n\nPiscataway, NJ 08854\n\nJacob Feldman*\n\nDepartment of Psychology\nCenter for Cognitive Science\n\nRutgers University\n\nPiscataway, NJ 08854\n\ndfass@ruccs.rutgers.edu\n\njacob@ruccs.rutgers.edu\n\nAbstract\n\nWe present an account of human concept learning-that is, learning of\ncategories from examples-based on the principle of minimum descrip(cid:173)\ntion length (MDL). In support of this theory, we tested a wide range\nof two-dimensional concept types, including both regular (simple) and\nhighly irregular (complex) structures, and found the MDL theory to give\na good account of subjects' performance. This suggests that the intrin(cid:173)\nsic complexity of a concept (that is, its description -length) systematically\ninfluences its leamability.\n\n1- The Structure of Categories\n\nA number of different principles have been advanced to explain the manner in which hu(cid:173)\nmans learn to categorize objects. It has been variously suggested that the underlying prin(cid:173)\nciple might be the similarity structure of objects [1], the manipulability of decision bound~\naries [2], or Bayesian inference [3][4]. While many of these theories are mathematically\nwell-grounded and have been successful in explaining a range of experimental findings,\nthey have commonly only been tested on a narrow collection of concept types similar to\nthe simple unimodal categories of Figure 1(a-e).\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 1: Categories similar to those previously studied. Lines represent contours of equal\nprobability. All except (e) are unimodal.\n\n~http://ruccs.rutgers.edu/~jacob/feldman.html\n\n\fMoreover, in the scarce research that has ventured to look beyond simple category types,\nthe goal has largely been to investigate categorization performance for isolated irregular\ndistributions, rather than to present a survey of performance across a range of interesting\ndistributions. For example, Nosofsky has previously examined the \"criss-cross\" category\nof Figure 1(d) and a diagonal category similar to Concept 3 of Figure 2, as well as some\nother multimodal categories [5J [6J. While these individual category structures are no doubt\ntheoretically important, they in no way exhaust the range of possible concept structures.\nIndeed, if we view n-dimensional Cartesian space as the canvas upon which a category\nmay be represented, then any set of manifolds in that space may be considered as a poten(cid:173)\ntial category [7]. It is therefore natural to ask whether one such category-manifold is in\nprinciple easier or more difficult to learn than another. Since previous investigations have\nnever considered any reasonably broad range of category structures, they have never been\nin a position to answer this question.\n\nIn this paper we present a theory for human categorization, based on the MDL princi(cid:173)\nple, that is much better equipped to answer questions about .the intrinsic leamability of\nboth structurally regular and structurally irregular categories. In support of this theory we\nbriefly present an experiment testing human subjects' learning of a range of concept types\ndefined over a continuous two-dimensional feature space, including both highly regular\nand highly irregular structures. We find that our MDL-based theory gives a good account\nof human learning for these concepts, and that descriptive complexity accurately predicts\nthe subjective difficulty of the various concept types tested.\n\n2 Previous Investigations of Category Structure\n\nThe role of category structure in determining leamability has not been overlooked entirely\nin the literature; in fact, the intrinsic structure of binary-featured categories has been in(cid:173)\nvestigated quite thoroughly. The classic work by Shepard et al.\n[8J showed that human\nperformance in learning such Boolean categories varies greatly depending on the intrinsic\nlogical structure of the concept. More recently, we have shown that this performance is\nwell-predicted by the intrinsic Boolean complexity of each concept, given by the length\nof the shortest Boolean formula that describes the objects in the category [9]. This re(cid:173)\nsult suggests that a principle of simplicity or parsimony, manifested as a minimization of\ncomplexity, might play an important role in human category learning.\n\nThe details of Boolean complexity analysis do not generalize easily to the type of contin(cid:173)\nuous feature spaces we wish to investigate here. Thus a new approach is required, similar\nin general spirit but differing in the mathematics. Our goals are therefore (1) to deploy\na complexity minimization technique such as MDL to quantify the complexity of cate-\ngories defined over continuous features, and (2) to investigate the influence of complexity\non human category learning by testing a range of concept types differing widely in intrinsic\ncomplexity.\n\n3 Experiment\n\nWhile the MDL principle that we plan to employ is applicable to concepts of any dimen(cid:173)\nsion, for reasons of convenience this experiment is limited to category structures that can\nbe formed within a two-dimensional feature space. This feature space is discretized into a\n4 x 4 grid from which a legitimate category can be specified by the selection ofany four grid\nsquares. Our motivation for discretizing the feature space is to place a constraint on pos(cid:173)\nsible category structure that will facilitate the computation of a complexity measure; this\ndoes not restrict the range ofpossible feature values that can be adopted by stimuli. In prin(cid:173)\nciple, feature values are limited only by machine precision, but as a matter of convenience\n\n\fwe restrict features to adopting one of 1000 possible values in the range [0,1].\n\nConcept 1\n\nConcept 2\n\nConcept 3\n\nConcept 4\n\nConcept 5\n\nConcept 6\n\nConcept 7\n\nConcept 8\n\nConcept 9\n\nConcept 10\n\nConcept 11\n\nConcept 12\n\nFigure 2: Abstract concepts used in experiment.\n\nThe particular 12 abstract category structures (\"concepts\") examined in the experiment are\nshown in Figure 2. These concepts were considered to be individually interesting (from\na cross-theoretical perspective) and jointly representative of the broader range of available\nconcepts. The two categories in each concept are referred to as \"positive\" and \"negative.\"\nThe positive category is represented by the dark-shaded regions, and the corresponding\nnegative category is its complement. Note that in many cases the categories are \"discon(cid:173)\nnected\" or multimodal. Nevertheless, these categories are not in any sense \"probabilistic\"\nor \"ill-de:fil1.ed\"; a given point in feature space is ahvays either p_ositive or negative.\n\nDuring the experiment, each stimulus is drawn randomly from the feature space and is\nlabeled \"positive\" or \"negative\" based on the region from which it was drawn. Uniform\nsampling is used, so all 12 categories of Figure 2 have the same base rate for positives,\nP(\n\n1\nposItIve == 16 == 4'\n\n..)\n\n4\n\nThe experiment itself was clothed as a video game that required subjects to discriminate\nbetween two classes of spaceships, \"Ally\" and \"Enemy,\" by destroying Enemy ships and\nquick-landing Allied ships. Each subject (14 total) played 12 five-minute games in which\nthe distribution ofAllies and Enemies corresponded (in random order) to the 12 concepts of\nFigure 2. The physical features of the spaceships in all cases were the height of the \"tube\"\nand the radius of the \"pod.\" As shown in Figure 3, these physical features are mapped\nrandomly onto the abstract feature space such that the experimental concepts may be any\nrigid rotation or reflection of the abstract concepts in Figure 2.\n\n4 Derivation of the MDL Principle\n\nThe MDL principle is largely due to Rissanen [10] and is easily shown to be a consequence\nof optimal Bayesian inference [11]. While several Bayesian algorithms have previously\nbeen proposed as models of human concept learning [3][4], the implications of the MDL\nprinciple for human learning have only recently come under scrutiny [12][13]. We briefly\nreview the relevant theory.\n\nAccording to Bayes rule, a learner ought to select the category hypothesis H that maximizes\n\n(a)\n\n(b)\n\n(c)\n\nPod Radius\n\n(d)\n\nFigure 3: (a) A spaceship. (b-d) Three possible instantiations of Concept 6 from Figure 2.\n\n\fthe posterior P(H I D), where D is the data, and\n\nP(H I D) = P(D I H)P(H)\n\nP(D)\n\nTaking negative logarithms of both sides, we obtain\n\n-log P(H I D) == -log-P(D I H) - log P(H) + log P(D)\n\n(1)\n\n(2)\n\nThe problem of maximizing P(H I D) is thus identical to the problem of minimizing\n- log P (H I D). Since log P (D) is constant for all hypotheses, its value does not enter\ninto the minimization problem, and we can state that the hypothesis of choice ought to be\nsuch as to minimize the quantity\n\n-log P(D I H) - log P(H)\n\n(3)\n\nIf we follow Rissanen and regard the quantity -log P(x) as the description length of x,\nD L (x ), then Equation 3 instructs us to select the hypothesis that minimizes the total de(cid:173)\nscription length\n\nDL(D I H) + DL(H)\n\n(4)\nWhat this means is that the hypothesis that is optimal from the standpoint of the Bayesian\ndecision maker is the same hypothesis that yields the most compact two-part code in Equa(cid:173)\ntion 4. Thus, besides the merits ofbrevity for its own sake, we see that maximal descriptive\ncompactness also corresponds to maximal inferential power. It is this equivalence between\ndescription length and inference that leads us to investigate the role of descriptive complex(cid:173)\nity in the domain of concept learning.\n\n5 Theory\n\nIn order to investigate the complexity of the 12 concepts of Figure 2, Equation 4 indicates\nthat we need to analyze (1) the description length ofa hypothesis for each concept, DL(H),\nand (2) the description length ofthe concept given the hypothesis, DL(D I H). We discuss\nthese in sequence.\n\n5.1 The Hypothesis Description Length, DL(H)\n\nIn order to compute DL(H), we first fix a language! within which hypotheses about the\ncategory structure can be expressed. We choose to use the \"rectangle language\" whose\nalphabet (Table 1) consists of 10 classes of symbols representing the 10 different sizes of\nrectangle that can be composited within a 4x4 grid: 1x 1, 1x2, 1x3, 1 x4, 2x2, 2x3,\n2x4, 3x3, 3x4, and 4x4.2 Each member of the class \"m x n\" is an m x n or n\u00b7x m\nrectangle situated at a particular position in the 4 x 4 grid. We allow a given hypothesis to\nbe represented by up to four distinct rectangles (i.e., four symbols).\n\nHaving specified a language, the issue is now the length ofthe hypothesis code. The deriva(cid:173)\ntion above suggests that a codelength of -log P(x) be assigned to each symbol x, which\ncorresponds to the so-called Shannon code. We therefore proceed to compute the Shannon\ncodelengths for the rectangle alphabet of Table 1.3\n\n1Equivalently, a model class. The particular choice of language (model class) is obviously an im(cid:173)\nportant determinant of the ultimate hypothesis description length. We mentionthat the MDL analysis\nin this paper might be replaced by another theoretical approach, such as a Bayesian framework,\nalthough we have not pursued this possibility. We adopt the MDL formulation partly because its\nemphasis on representation (i.e., description) seems apt for a study of complexity.\n\n2The class \"m x n\" contains all rectangles of dimension m x nand n x m.\n3We use the noninteger value - log P (x) rather than the integer r- log P (x)l. Logs are base-2.\n\n- ~- \u00b7-1\n\n\fTable 1: Rectangle alphabet. The third and fourth columns show the probability that the\nsource generates a given member ofthe class \"m x n\" and the corresponding codelength.\n\nRectangle Class\n\nPossible Locations\n\nlxl\nlx2\nlx3\nlx4\n2x2\n2x3\n2x4\n3x3\n3x4\n4x4\n\n16\n24\n16\n8\n9\n12\n6\n4\n4\n1\n\nProbability Codelength\n-log (1~0)\n-log (2~0)\n-log (1~0)\n-log (8~)\n-log (gI0 )\n-log (1~0)\n-log (lo)\n1\n-log (4\n0)\n-log (4~)\n-log (1~)\n\n1\n1\n10 . 16\n1\n1\n10 . 24\n1\n1\n10 . 16\n1\n1\n10 . \"8\n1\n1\n10 \u00b79\n1\n1\n10 . 12\n1\n1\n10 . \"6\n1\n1\n10 \u00b74\n1\n1\n10 \u00b74\n.l. . 1\n10\n\nComputing these codelengths requires t~at we specify the probability mass function of a\nsource, P(x). It is convenient for this purpose (and compatible with the subject's perspec(cid:173)\ntive) to imagine that the concepts in Figure 2 are produced by a \"concept generator,\" an\ninformation source whose parameters are essentially unknown. A reasonable assumption\nis that the source randomly selects a rectangle class with uniform probability, and then se(cid:173)\nlects an individual member of the chosen class also with uniform probability. Since there\nare 10 classes, the assumption regarding class selection places a prior on each rectangle\nclass of P(m x n) == 1~.\nMoreover, the assumption of uniform within-class sampling means that in order to encode\nany individual rectangle, we need only consider the cardinality of the class to which it\nbelongs. We now recall that the individual rectangles of the class \"m x n\" differ only in\ntheir positions within the 4 x 4 grid. Therefore, the cardinality of the class \"m x n\" is equal\nto the number ofunique ways N m x n in which an m x n or n x m rectangle can be selected\nfrom a 4 x 4 grid, where\n\nN\n\n-\nmXn -\n\n{\n\n(5-m)(5-n), m==n\n2(5 - m)(5 - n), m =I n\n\n(5)\n\nThus, the probability associated with an individual rectangle of class \"m x n\" is PN(m x n) .\nThe corresponding Shannon codelengths are shown next to these probabilities in Table 1.\nThe description length of a particular hypothesis is the summed codeword lengths for all\nthe rectangles (up to four) that are comprised by the hypothesis.\n\nrnXn\n\n5.2 The Likelihood Description Length, DL(D I H)\n\nThe second part of the two-part MDL code is the description of the concept with respect to\nthe selected hypothesis, corresponding to the Bayes likelihood. There are several possible\napproaches to computing DL(D I H); we discuss one that is particularly straightforward.\n\nWe recall that a hypothesis H is composed of up to four rectangular regions. Computing\nDL(D I H) therefore involves describing that portion of the positive category that falls\nwithin each rectangular hypothesis region. This is conceptually the same problem that we\nfaced in computing DL(H) above, except that the region of interest for DL(H) was fixed\n\n\fTable 2: Minimum description lengths for the 12 abstract concepts.\n\nConcept MDL Codelength\n\nMDL Concept\n\n8.0768 bits\n\n8.3219 bits\n\n27.3236 bits\n\n17.8138 bits\n\n16.5216 bits\n\n14.4919 bits\n\n17.1357 bits\n\n22.5687 bits\n\n14.4919 bits\n\n15.0768 bits\n\n27.1946 bits\n\n28.1536 bits\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n11\n\n12\n\nlIE.\u2022..\u00b7;..'.\n~\n\nat 4x4, while the regions for DL(D I H) can be of any dimension 4x4 and smaller.\nGuided by this analogy, we follow the procedure of the previous section to compute an\nappropriate probability mass function. Since DL(D I H) must capture just the positive\nsquares in the hypothesis region (a maximum of four squares), the only rectangle classes\nneeded in the alphabet are those of size four: 1x 1, 1x2, 1x 3, 1x4, and 2x2.\n\n6 Minimum Description Lengths for Experimental Concepts\n\nApplying the MDL analysis above to the concepts in Figure 2 requires that we compute\nthe total description length DL(D I H) + DL(H) corresponding to all viable hypothe-\n, ses for each concept. The hypothesis H corresponding to the shortest total codelength\nDL(D I H) + DL(H) for each concept is the MDL hypothesis.4 The MDL hypotheses for\nall 12 concepts are shown in Table 2 along with the corresponding minimum codelengths.\nIt can be observed that while for some concepts the MDL hypothesis precisely conforms\nto the true positive category (meaning that almost all of the concept information is carried\nin the hypothesis code), for the majority of concepts the MDL hypothesis is broader than\nthe true category region (meaning that the concept information is distributed between the\nhypothesis and likelihood codes).\n\n4Note that the MDL hypothesis is not in general the most compact hypothesis, i.e., the hypoth(cid:173)\nesis for which DL(H) is a minimum. Rather, the MDL hypothesis is the one for which the sum\nDL(D I H) + DL(H) is minimum.\n\n\f7 Results\n\nFor each game played by the subject (i.e., each concept in Figure 2), an overall measure\nof performance (d') is computed.5 Figure 4 shows performance for all subjects and all\nconcepts as a function of the concept complexities (MDL codelengths) in Table 2. There is\nan evident decrease in performance with increasing complexity, which a regression analysis\nshows to be highly significant (R 2 == .384, F(1,166) == 103.375, p < .000001), meaning\nthat the linear trend in the plot is very unlikely to be a statistical accident. Thus, the MDL\ncomplexity predicts the subjective difficulty ofleaming across a broad range of concepts.\n\n3.5r---+----,-------r--------.--------.-------,\n\n+\n+\n++\n\n2.5\n\n~ 2\nQ.)g 1.5\nco\nE 1\n~\nQ.) 0.5\n0..\n\n-0.5\n\n-1\n\n5\n\n10\n25\nComplexity, DL(H) + DL(DIH)\n\n15\n\n20\n\n30\n\nFigure 4: Performance vs. complexity for all 14 subjects. The d' performance for each\nconcept is indicated by a '+' and the mean d' for each concept is indicated by an '0'.\n\nWe mention that the MDL approach described here can be further modified to make \"real(cid:173)\ntime\" predictions of how subjects will categorize each new stimulus. In the most simplistic\napproach, the prediction for each new stimulus x is made based on the MDL hypothesis\nprevailing at the time that stimulus is observed. Correlation between this MDL prediction\nand the subject's actual decision is found to be highly significant (p :::; .002) for each of the\n12 concept types. The Pearson r statistics are given below:\n\nConcept #:\nPearson r:\n\n123\n.46\n.19\n\n.47\n\n456\n.18\n.51\n\n.20\n\n7\n.18\n\n8\n.14\n\n9\n.34\n\n10\n.32\n\n11\n.32\n\n12\n.05\n\nFigure 5 illustrates the behavior of the real-time MDL algorithm. Simulations for a variety\nof data sets can be found at http://ruccs . rutgers. edu/ -dfass/mdlmovies. html.\n\n.:j.::\n~.:++:\n\nstep 7\n\nstep 9\n\nStep 19\n\nstep 59\n\nstep 113\n\nStep 169\n\nstep 190\n\nFigure 5: Real-time MDL hypothesis evolution for actual Concept 11 data. As the size\nof the data set grows beyond 150, there is oscillation between the one-rectangle (2x4)\nhypothesis. shown in Step 169 and the two-rectangle (1 x3) hypothesis shown in Step 190.\n\n5dl (discriminability) gives a measure of subjects' intrinsic capacity to discriminate categories,\n\ni.e., one that is independent of their criterion for responding \"positive\" [14].\n\n\f8 Conclusions\n\nAs discussed above, MDL bears a tight relationship with Bayesian inference, and hence\nserves as a reasonable basis for a theory of learning. The data presented above suggest\nthat human learners are indeed guided by something very much like Rissanen's principle\nwhen classifying objects. While it is premature to conclude that humans construct any(cid:173)\nthing precisely corresponding to the two-part code of Equation 4, it seems likely that they\nemploy some closely related complexity-minimization principle-and an associated \"cog(cid:173)\nnitive code\" still to be discovered. This finding is consistent with many earlier observations\nof minimum principles guiding human inference, especially in perception (e.g., the Gestalt\nprinciple ofPragnanz). Moreover, our findings suggest a principled approach to predicting\nthe subjective difficulty of concepts defined over continuous features. As we had previ(cid:173)\nously found with Boolean concepts, subjective difficulty correlates with intrinsic complex(cid:173)\nity: That which is incompressible is) in turn) incomprehensible. The MDL approach is an\nelegant framework in which to make this observation rigorous and concrete, and one which\napparently accords well with human performance.\n\nAcknowledgments\n\nThis research was supported by NSF SBR-9875175.\n\nReferences\n\n[IJ Nosofsky, R. M., \"Exemplar-based accounts of relations between classification, recognition,\nand typicality,\" Journal of Experimental Psychology: Learning) Memory, and Cognition,\nVol. 14, No.4, 1988, pp. 700-708.\n\n[2J Ashby, F. G. and Alfonso-Reese, L. A., \"Categorization as probability density estimation,\"\n\nJournal ofMathematical Psychology, Vol. 39, 1995, pp. 216-233.\n\n[3J Anderson, J. R., \"The adaptive nature ofhuman categorization,\" Psychological Review, Vol. 98,\n\nNo.3, 1991,pp.409-429.\n\n[4J Tenenbaum, J. B., \"Bayesian modeling of human concept learning,\" Advances in Neural Infor(cid:173)\nmation Processing Systems, edited by M. S. Kearns, S. A. Solla, and D. A. Cohn, Vol. 11, MIT\nPress, Cambridge, MA, 1999.\n\n[5J Nosofsky, R. M., \"Optimal performance and exemplar models of classification,\" Rational Mod(cid:173)\nels of Cognition, edited by M. Oaksford and N. Chater, chap. 11, Oxford University Press,\nOxford, 1998, pp. 218-247.\n\n[6J Nosofsky, R. M., \"Further tests of an exemplar-similarity approach to relating identification and\n\ncategorization,\" Perception and Psychophysics, Vol. 45,1989, pp. 279-290;\n\n[7J Feldman, J., \"The structure of perceptual categories,\" Journal of Mathematical Psychology,\n\nVol. 41, No.2, 1997, pp. 145-170.\n\n[8J Shepard, R. N., Hovland, C. I., and Jenkins, H. M., \"Learning and memorization of classifica(cid:173)\n\ntions,\" Psychological Monographs: General and Applied, Vol. 75, No. 13, 1961, pp. 1-42.\n\n[9J Feldman, J., \"Minimization of Boolean complexity in human concept learning,\" Nature,\n\nVol. 407, 2000, pp. 630-632.\n\n[1 OJ Rissanen, J., \"Modeling by shortest data description,\" Automatica, Vol. 14, 1978, pp. 465-471.\n[11 J Li, M. and Vitanyi, P., An Introduction to Kolmogorov Complexity and Its Applications,\n\nSpringer, New York, 2nd ed., 1997.\n\n[12] Pothos, E. M. and Chater, N., \"Categorization by simplicity: A minimum description length\napproach to unsupervised clustering,\" Similarity and Categorization, edited by U. Hahn and\nM. Ramscar, chap. 4, Oxford University Press, Oxford, 2001, pp. 51-72.\n\n[13J Myung, 1. J., \"Maximum entropy interpretation of decision bound and context models of cate(cid:173)\n\ngorization,\" Journal ofMathematical Psychology, Vol. 38, 1994, pp. 335-365.\n\n[14J Wickens, T. D., Elementary Signal Detection Theory, Oxford University Press, Oxford, 2002.\n\n\f", "award": [], "sourceid": 2252, "authors": [{"given_name": "David", "family_name": "Fass", "institution": null}, {"given_name": "Jacob", "family_name": "Feldman", "institution": null}]}