{"title": "Constraints on Adaptive Networks for Modeling Human Generalization", "book": "Advances in Neural Information Processing Systems", "page_first": 2, "page_last": 10, "abstract": null, "full_text": "2 \n\nCONSTRAINTS ON ADAPTIVE NETWORKS \nFOR MODELING HUMAN GENERALIZATION \n\nM. Pavel \n\nMark A. Gluck \n\nVan Henkle \n\nDepartm\u00a31Il of Psychology \n\nStanford University \nStanford. CA 94305 \n\nABSTRACT \n\nThe potential of adaptive networks to learn categorization rules and to \nmodel human performance is studied by comparing how natural and \nartificial systems respond to new inputs, i.e., how they generalize. Like \nhumans, networks can learn a detenninistic categorization task by a \nvariety of alternative individual solutions. An analysis of the con(cid:173)\nstraints imposed by using networks with the minimal number of hidden \nunits shows that this \"minimal configuration\" constraint is not \nsufficient to explain and predict human performance; only a few solu(cid:173)\ntions were found to be shared by both humans and minimal adaptive \nnetworks. A further analysis of human and network generalizations \nindicates that initial conditions may provide important constraints on \ngeneralization. A new technique, which we call \"reversed learning\", \nis described for finding appropriate initial conditions. \n\nINTRODUCTION \n\nWe are investigating the potential of adaptive networks to learn categorization tasks and \nto model human performance. In particular we have studied how both natural and \nartificial systems respond to new inputs, that is, how they generalize. In this paper we \nfirst describe a computational technique to analyze generalizations by adaptive networks. \nFor a given network structure and a given classification problem, the technique \nenumerates all possible network solutions to the problem. We then report the results of \nan empirical study of human categorization learning. The generalizations of human sub(cid:173)\njects are compared to those of adaptive networks. A cluster analysis of both human and \nnetwork generalizations indicates, significant differences between human perfonnance \nand possible network behaviors. Finally, we examine the role of the initial state of a net(cid:173)\nwork for biasing the solutions found by the network. Using data on the relations between \nhuman subjects' initial and final performance during training, we develop a new tech(cid:173)\nnique, called \"reversed learning\", which shows some potential for modeling human \nlearning processes using adaptive networks. The scope of our analyses is limited to gen(cid:173)\neralizations in deterministic pattern classification (categorization) tasks. \n\n\fModeling Human Generalization \n\n3 \n\nThe basic difficulty in generalization is that there exist many different classification rules \n(\"solutions\") that that correctly classify the training set but which categorize novel \nobjects differently. The number and diversity of possible solutions depend on the \nlanguage defining the pattern recognizer. However, additional constraints can be used in \nconjunction with many types of pattern categorizers to eliminate some, hopefully \nundesirable, solutions. \n\nOne typical way of introducing additional constraints is to minimize the representation. \nFor example minimizing the number of equations and parameters in a mathematical \nexpression, or the number of rules in a rule-based system would assure that some \nidentification maps would not be computable. In the case of adaptive networks, minimiz(cid:173)\ning the size of adaptive networks, which reduces the number of possible encoded func(cid:173)\ntions, may result in improved generalization perfonnance (Rumelhart, 1988). \n\nThe critical theoretical and applied questions in pattern recognition involve characteriza(cid:173)\ntion and implementation of desirable constraints. In the first part of this paper we \ndescribe an analysis of adaptive networks that characterizes the solution space for any \nparticular problem. \n\nANALYSES OF ADAPTIVE NETWORKS \n\nFeed-forward adaptive networks considered in this paper will be defined as directed \ngraphs with linear threshold units (LTV) as nodes and with edges labeled by real-valued \nweights. The output or activations of a unit is detennined by a monotonic nonlinear func(cid:173)\ntion of a weighted sum of the activation of all units whose edges tenninate on that unit \nThere are three types of units within a feed-forward layered architecture: (1) Input units \nwhose activity is determined by external input; (2) output units whose activity is taken as \nthe response; and (3) the remaining units, called hidden units. For the sake of simplicity \nour discussion will be limited to objects represented by binary valued vectors. \n\nA fully connected feed-forward network with an unlimited number of hidden units can \ncompute any boolean function. Such a general network, therefore, provides no con(cid:173)\nstraints on the solutions. Therefore, additional constraints must be imposed for the net(cid:173)\nwork to prefer one generalization over another. One such constraint is minimizing the \nsize of the network. In order to explore the effect of minimizing the number of hidden \nunits we first identify the minimal network architecture and then examine its generaliza(cid:173)\ntions. \n\nMost of the results in this area have been limited to finding bounds on the expected \nnumber of possible patterns that could be classified by a given network (e.g. Cover, 1965; \nVolper and Hampson, 1987; Valiant, 1984; Baum & Haussler, 1989). The bounds found \nby these researchers hold for all possible categorizations and are, therefore, too broad to \nbe useful for the analysis of particular categorization problems. \n\nTo determine the generalization behavior for a particular network architecture, a specific \n\n\f4 \n\nGluck, Pavel and Henkle \n\ncategorization problem and a training set it is necessary to find find all possible solutions \nand the corresponding generalizations. To do this we used a computational (not a simu(cid:173)\nlation) procedure developed by Pavel and Moore (1988) for finding minimal networks \nsolving specific categorization problems. Pavel and Moore (1988) defined two network \nsolutions to be different if at least one hidden unit categorized at least one object in the \ntraining set differently. Using this definition their algorithm finds all possible different \nsolutions. Because finding network solutions is NP-complete (Judd, 1987), for larger \nproblems Pavel and Moore used a probabilistic version of the algorithm to estimate the \ndistribution of generalization responses. \n\nOne way to characterize the constraints on generalization is in terms of the number of \npossible solutions. A larger number of possible solutions indicates that generalizations \nwill be less predictable. The critical result of the analysis is that, even for minimal net(cid:173)\nworks. the number of different network solutions is often quite large. Moreover. the \nnumber of solutions increases rapidly with increases in the number of hidden units. The \napparent lack of constraints can also be demonstrated by finding the probability that a \nnetwork with a randomly selected hidden layer can solve a given categorization problem. \nThat is, suppose that we se~t n different hidden units, each unit representing a linear \ndiscriminant fwction. The activations of these random hidden wits can be viewed as a \nttansformation of the input patterns. We can ask what is the probability that an output \nunit can be found to perfonn the desired dichotomization. A typical example of a result \nof this analysis is shown in Figure 1 for the three-dimensional (3~) parity problem. In \nthe minimal configuration involving three hidden units there were 62 different solutions \nto the 3D parity problem. The rapid increase in probability (high slope of the curve in \nFigure 1) indicates that adding a few more hidden units rapidly increases the probability \nthat a random hidden layer will solve the 3D parity problem. \n\n100 \n\n10 \n\n80 \n\n40 \n\n20 \n\n\u2022 \nz \ng \n!; \n-' \ni \n~ \n\n0 \n\n0 \n\n...... -. \n\n~. \n\n, , \n\n\" , , \n-- 3D PARITY \n---- EXPERIMENT \n\n, \n, , \nII \n, , \n, , \n, \n, , \n, \n, \n~ , , , \n\n2 \n\n4 \n\n6 \n\nHIOOENUNITS \n\n\u2022 \n\n10 \n\n12 \n\nFigure 1 1be proportion of solutions to 3D parity problem (solid line) and the \nexperimental task (dashed line) as a function of the number of hidden units. \n\nThe results of a more detailed analysis of the generalization performance of the minimal \nnetworks will be discussed following a description of a categorization experiment with \n\n\fModeling Human Generalization \n\n5 \n\nhuman subjects. \n\nHUMAN CA TEGORIZA TION EXPERIMENT \n\nIn this experiment human subjects learned to categorize objects which were defined by \nfour dimensional binary vectors. Of the 24 possible objects, subjects were trained to clas(cid:173)\nsify a subset of 8 objects into two categories of 4 objects each. The specific assignments \nof objects into categories was patterned after Medin et aI. (1982) and is shown in Figure \n2. Eight of the patterns are designated as a training set and the remaining eight comprise \nthe test seL The assignment of the patterns in the training set into two categories was \nsuch that there were many combinations of rules that could be used to correctly perfonn \nthe categorization. For example, the first two dimensions could be used with one other \ndimension. The training patterns could also be categorized on the basis of an exclusive \nor (XOR) of the last two dimensions. The type of solution obtained by a human subject \ncould only be determined by examining responses to the test set as well as the training \nseL \n\nTRAINING SET \n\nTEST SET \n\nX1 1 1 0 1 001 0 000 1 \n1 1 0 1 \nDIMENSIONS ~ 1 1 1 0 000 1 001 0 \n1 1 1 0 \n~ 101 0 101 0 o 1 0 1 o 1 0 1 \nX. 101 0 o 1 0 1 o 1 0 1 o 1 0 1 \nAAAA BBBB ??? ? ???? \n\nCATEGORY \n\nFigllTe 2. PattemI to be clulmed. (Adapted from Medin et aI .\u2022 1982). \n\nIn the actual experiments, subjects were asked to perform a medical diagnosis for each \npattern of four symptoms (dimensions). The experimental procedure will be described \nhere only briefly because the details of this experiment have been described elsewhere in \ndetail (pavel, Gluck, Henkle, 1988). Each of the patterns was presented serially in a ran(cid:173)\ndomized order. Subjects responded with one of the categories and then received feed(cid:173)\nback. The training of each individual continued until he reached a criterion (responding \ncorrectly to 32 consecutive stimuli) or until each pattern had been presented 32 times. \nThe data reported here is based on 78 subjects, half (39) who learned the task to criterion \nand half who did DOL \n\nFollowing the training phase, subjects were tested using all 16 possible patterns. The \nresults of the test phase enabled us to determine the generalizations perfonned by the \nsubjects. Subjects' generalizations were used to estimate the \"functions\" that they may \nhave been using. For example, of the 39 criterion subjects, 15 used a solution that was \nconsistent with the exclusive-or (XOR) of the dimensions x 3 and X4. \n\nWe use \"response profiles\" to graph responses for an ensemble of functions, in this case \nfor a group of subjects. A response profile represents the probability of assigning each \n\n\f6 \n\nGluck, Pavel and Henkle \n\npattern to category \"A\". For example, the response profile for the XOR solution is \nshown in Figure 3A. For convenience we define the responses to the test set as the \"gen(cid:173)\neralization profile\". The response profile of all subjects who reached the criterion is \nshown in Figure 3D. The responses of our criterion subjects to the training set were basi(cid:173)\ncally identical and correct The distribution of subjects' genezalization profiles reflected \nin the overall generalization profile are indicative of considerable individual differences \n\n1001 \n0110 \n1101 \n1110 \n1011 \n0100 \n0011 \n0000 \n0101 \n1010 \n0001 \n0010 \n1000 \n0111 \n1100 \n1111 \n\n/I) z \na:: \nloll \n~ \nC \n~ \n\n00 02 04 06 08 10 12 \n\nPROPORTION \" .. -\n\n/I) z \na:: \nloll \n~ \nC \n~ \n\n1001 \n0110 \n1101 \n1110 \n1011~ \n\n0100 -=:===::--\n\n0011~ \n0000 r---\n0101 \n1010 \n0001 \n0010 \n1000 \n0111 \n1100 \n\n1111 . \n\n00 02 04 06 01 10 12 \n\nPROPORTION \" It.-\n\nFigwe 3. (A) Response profile of the XOR solution. and (B) a proportion of \nthe response \"A\" to all patterns for human subjects (dark bars) and minimal \nnetworks (light bars). The lower 8 patterns are from the training set and the \nupper 8 patterns from the test set. \n\nMODEliNG THE RESPONSE PROFILE \n\nOne of our goals is to model subjects' distribution of categorizations as represented by \nthe response profile in Figure 3D. We considered three natural approaches to such \n(1) Statistical/proximity models, (2) Minimal disjunctive normal forms \nmodeling: \n(DNF), and (3) Minimal two-layer networks. \n\nThe statistical approach is based on the assumption that the response profile over subjects \nrepresents the probability of categorizations performed by each subject Our data are not \nconsistent with that assumption because each subject appeared to behave deterministi(cid:173)\ncally. The second approach, using the minimal DNF is also not a good candidate because \nthere are only four such solutions and the response profile over those solutions differs \nconsiderably from that of the SUbjects. Turning to the adaptive network solutions, we \nfound all the solutions using the linear programming technique described above (pavel & \nMoore, 1988). The minimal two-layer adaptive network that was capable of solving the \ntraining set problem consisted of two hidden units. The proportion of solutions as a \n\n\fModeling Human Generalization \n\n7 \n\nfunction of the number of hidden units is shown in Figure 1 by the dashed line. \n\nFor the minimal network there were 18 different solutions. These 18 solutions had 8 dif(cid:173)\nferent individual generalization profiles. Assuming that each of the 18 network solution \nis equally likely. we computed the generalization profile for minimal network shown in \nFigure 3B. The response profile for the minimal network represents the probability that a \nrandomly selected minimal network will assign a given pattern to category \"A\". Even \nwithout statistical testing we can conclude that the generalization profiles for humans and \nnetworks are quite different. It is possible. however. that humans and minimal networks \nobtain similar solutions and that the differences in the average responses are due to the \nparticular statistical sampling assumption used for the minimal networks (i.e. each solu(cid:173)\ntion is equally likely). In order to determine the overlap of solutions we examined the \ngeneralization profiles in more detail. \n\nCLUSTERING ANALYSIS OF GENERALIZATION PROFILES \n\nTo analyze the similarity in solutions we defined a metric on generalization profiles. The \nHamming distance between two profiles is equal to the number of patterns that are \ncategorized differently. For example. the distance between generalization profile \u2022\u2022 A A \nB A B B B B\" and \"A A B B B B A B\" is equal to two. because the two profiles differ \non only the fourth and seventh pattern. Figure 4 shows the results of a cluster analysis \nusing a hierarchical clustering procedure that maximizes the average distance between \nclusters. \n\nc \u2022 \u2022 c c \u2022 \u2022 c \n\nc \n\n~ \u2022 \u2022 \u2022 \u2022 \u2022 \n\no \n\n\u2022 \u2022 \n\u2022 \u2022 \n\u2022 \u2022 \n\u2022 ~ \nc \n3 c \n~ \n\n\u2022 \u2022 \n~ c \n= \n\u2022 \u2022 \nc c \n\u2022 \u2022 \n\u2022 \u2022 \n\n\u2022 \n\u2022 \n\u2022 \nc \n\u2022 \n\u2022 \n\u2022 \nc \n\u2022 3 \nc \u2022 c \n\nc \n~ \n\u2022 \n\u2022 \n\u2022 \n\u2022 \nFiglll'll 4. Results of hierarchical clustering for human (left) and network \n(right) generalization profiles. \n\n! \n\u2022 \nc \n\u2022 \n\u2022 \n\nc \n\u2022 \n~ \n\u2022 \n\u2022 \nI \n\n; ~ \n~ ~ \n\u2022 \u2022 \n\u2022 \u2022 \n\u2022 \u2022 \n\n\u2022 \n~ \nI \n\u2022 \n\u2022 \n\u2022 \n\nIn this graph the average distance between any two clusters is shown by the value of the \nlowest common node in the tree. The clustering analysis indicates that humans and \n\n\f8 \n\nGluck, Pavel and Henkle \n\nnetworks obtained widely different generalization profiles. Only three generalization \nprofiles were found to be common to human and networks. This number of common \ngeneralizations is to be expected by chance if the human and network solutions are \nindependent Thus, even if there exists a learning algorithm that approximates the human \nprobability distribution of responses, the minimal network would not be a good model of \nhuman perfonnance in this task. \n\nIt is clear from the previously described network analysis that somewhat larger networks \nwith different constraints could account for human solutions. In order to characterize the \nadditional constraints, we examined subjects' individual strategies to find out why indivi(cid:173)\ndual subjects obtained different solutions. \n\nANALYSIS OF HUMAN LEARNING STRATEGIES \n\nHuman learning strategies that lead to preferences for particular solutions may best be \nmodeled in networks by imposing constraints and providing hints (Abu-Mostafa 1989). \nThese include choosing the network architecture and a learning rule, constraining con(cid:173)\nnectivity, and specifying initial conditions. We will focus on the specification of initial \nconditions. \n\n30 \n\n20 \n\n10 \n\no \n\nCI .. CONSISTENT \nCONSISTENT \n\u2022 \n\nlOR \n\nNON lOR \n\nNO CRrTERION \n\nSUBJECT TYPES \n\nFiglU'e 5. The number of consistent or non-stable responses (black) and the \nnwnber of stable incorrect responses (light) for XOR, Non-XOR criterion su~ \njeers, and for those who never reached criterion. \n\nOur effort to examine initial conditions was motivated by large differences in learning \ncurves (Pavel et al., 1988) between subjects who obtained the XOR solutions and those \nwho did not The subjects who did not obtain the XOR solutions would perfonn much \nbetter on some patterns (e.g. 0001) then the XOR subjects, but worse on other patterns \n(e.g. 10(0). We concluded that these subjects during the first few trials discovered rules \n\n\fModeling Human Generalization \n\n9 \n\nthat categorized most of the training patterns correctly but failed on one or two training \npatterns. \n\nWe examined the sequences of subjects' responses to see how well they adhered to \n\"incorrect\" rules. We designated a response to a pattern as stable if the individual \nresponded the same way to that pattern at least four times in a row. We designated a \nresponse as consistent if the response was stable and correct The results of the analysis \nare shown in Figure 5. These results indicate that the subjects who eventually achieved \nthe XOR solution were less likely to generate stable incorrect solutions. Another impor(cid:173)\ntant result is that those subjects who never learned the correct responses to the training \nset were not responding randomly. Rather, they were systematically using incorrect \nrules. On the basis of these results, we conclude that subjects' initial strategies may be \nimportant detenninants of their final solutions. \n\nREVERSED LEARNING \n\nFor simplicity we identify subjects' initial conditions by their responses on the first few \ntrials. An important theoretical question is whether or not it is possible to find a network \nstructure, initial conditions and a learning rule such that the network can represent both \nthe initial and final behavior of the subject In order to study this problem we developed \na technique we call \"\"reversed leaming\". It is based on a perturbation analysis of feed(cid:173)\nforward networks. We use the fact that the error surface in a small neighborhood of a \nminimum is well approximated by a quadratic surface. Hence, a well behaved gradient \ndescent procedure with a starting point in the neighborhood of the minimum will find that \n'minimum. \n\nThe reversed learning procedure consists of three phases. (1) A network is trained to a \nfinal desired state of a particular individual, using both the training and the test patterns. \n(2) Using only the training patterns, the network is then trained to achieve the initial state \nof that individual subject closest to the desired final state (3) The network is trained with \nonly the training patterns and the solution is compared to the subject's response profiles. \nOur preliminary results indicate that this procedure leads in many cases to initial condi(cid:173)\ntions that favor the desired solutions. We are currently investigating conditions for \nfinding the optimal initial states. \n\nCONCLUSION \n\nThe main goal of this study was to examine constraints imposed by humans (experimen(cid:173)\ntally) and networks (linear programming) on learning of simple binary categorization \ntasks. We characterize the constraints by analyzing responses to novel stimuli. We \nshowed that. like the humans, networks learn the detenninistic categorization task and \nfind many, very different. individual solutions. Thus adaptive networks are better models \nthan statistical models and DNF rules. The constraints imposed by minimal networks, \nhowever, appear to differ from those imposed by human learners in that there are only a \nfew solutions shared between human and adaptive networks. After a detailed analysis of \n\n\f10 \n\nGluck, Pavel and Henkle \n\nthe human learning process we concluded that initial conditions may provide imPOl'Wlt \nconstraints. In fact we consider the set of initial conditions as .powerful \"hints\" (Abu(cid:173)\nMostafa, 1989) which reduces the number of potential solutions. without reducing the \ncomplexity of the problem. We demonstrated the potential effectiveness of these con(cid:173)\nstraints using a perturbation technique. which we call reversed learning, for finding \nappropriate initial conditions. \n\nAcknowledgements \n\nThis work was supported by research grants from the National Science Foundation \n(BNS-86-18049) to Gordon Bower and Mark Gluck. and (IST-8511589) to M. Pavel. and \nby a grant from NASA Ames (NCC 2-269) to Stanford University. We thank Steve Slo(cid:173)\nman and Bob Rehder for useful discussions and their comments on this draft \n\nReferences \n\nAbu-Mostafa, Y. S. Learning by example with hints. NIPS. 1989. \nBaum, E. B .\u2022 & Haussler. D. What size net gives vaUd generalization? NIPS, 1989. \nCover. T. (June 1965). Geometrical and statistical properties of systems of linear inequal-\nities with applications in pattern recognition. IEEE Transactions on Electronic \nComputers. EC-14. 3. 326-334. \n\nJudd. J. S. Complexity of connectionist learning with various node functions. Presented at \nthe First IEEE International Conference on Neural Networks. San Diego, June \n1987. \n\nMedin. D. L .\u2022 Altom. M. W .\u2022 Edelson. S. M .\u2022 & Freko. D. (1982). Correlated symptoms \n\nand simulated medical classification. Journal of Experimental Psychology: Learn(cid:173)\ning. Memory. & Cognition, 8(1).37-50. \n\nPavel. M .\u2022 Gluck, M. A .\u2022 & Henkle. V. Generalization by humans and multi-layer adap(cid:173)\ntive networks. Submitted to Tenth Annual Conference of the Cognitive Science \nSociety. August 17-19, 1988. \n\nPavel. M .\u2022 & Moore, R. T. (1988). Computational analysis of solutions of two-layer \nadaptive networks. APL Technical Repon, Dept. of Psychology. Stanford Univer(cid:173)\nsity. \n\nValiant, L. G. (1984). A theory of the learnable. Comm. ACM. 27.11.1134-1142. \nVolper. D. J \u2022\u2022 & Hampson. S. E. (1987). Learning and using specific instances. Biological \n\nCybernetics, 56 \u2022. \n\n\f", "award": [], "sourceid": 106, "authors": [{"given_name": "Mark", "family_name": "Gluck", "institution": null}, {"given_name": "M.", "family_name": "Pavel", "institution": null}, {"given_name": "Van", "family_name": "Henkle", "institution": null}]}