{"title": "An MCMC-Based Method of Comparing Connectionist Models in Cognitive Science", "book": "Advances in Neural Information Processing Systems", "page_first": 937, "page_last": 944, "abstract": "", "full_text": "An MCMC-Based Method of Comparing\nConnectionist Models in Cognitive Science\n\nWoojae Kim, Daniel J. Navarro\u2217, Mark A. Pitt, In Jae Myung\n\nDepartment of Psychology\n\nOhio State University\n\nfkim.1124, navarro.20, pitt.2, myung.1g@osu.edu\n\nAbstract\n\nDespite the popularity of connectionist models in cognitive science,\ntheir performance can often be di\ufb03cult to evaluate. Inspired by the\ngeometric approach to statistical model selection, we introduce a\nconceptually similar method to examine the global behavior of a\nconnectionist model, by counting the number and types of response\npatterns it can simulate. The Markov Chain Monte Carlo-based\nalgorithm that we constructed (cid:222)nds these patterns e\ufb03ciently. We\ndemonstrate the approach using two localist network models of\nspeech perception.\n\n1 Introduction\n\nConnectionist models are popular in some areas of cognitive science, especially\nlanguage processing. One reason for this is that they provide a means of expressing\nthe fundamental principles of a theory in a readily testable computational form.\nFor example, levels of mental representation can be mapped onto layers of nodes\nin a connectionist network. Information (cid:223)ow between levels is then de(cid:222)ned by the\ntypes of connection (e.g., excitatory and inhibitory) between layers. The soundness\nof the theoretical assumptions are then evaluated by studying the behavior of the\nnetwork in simulations and testing its predictions experimentally.\n\nAlthough this sort of modeling has enriched our understanding of human cognition,\nthe consequences of the choices made in the design of a model can be di\ufb03cult to\nevaluate. While good simulation performance is assumed to support the model\nand its underlying principles, a drawback of this testing methodology is that it can\nobscure the role played by a model(cid:146)s complexity and other reasons why a competing\nmodel might simulate human data equally well.\n\nThese concerns are part and parcel of the well-known problem of model selection. A\ngreat deal of progress has been made in solving it for statistical models (i.e., those\nthat can be described by a family of probability distributions [1, 2]). Connectionist\n\n\u2217Correspondence should be addressed to Daniel Navarro, Department of Psychology,\nOhio State University, 1827 Neil Avenue Mall, Columbus OH 43210, USA. Telephone:\n(614) 292-1030, Facsimile: (614) 292-5601.\n\n\fmodels, however, are a computationally di\ufb00erent beast. The current paper intro-\nduces a technique that can be used to assist in evaluating and choosing between\nconnectionist models of cognition.\n\n2 A Complexity Measure for Connectionist Models\n\nThe ability of a connectionist model to simulate human performance well does not\nprovide conclusive evidence that the network architecture is a good approximation\nto the human cognitive system that generated the data. For instance, it would be\nunimpressive if it turned out that the model could also simulate many non-human-\nlike patterns. Accordingly, we need a (cid:147)global(cid:148) view of the model(cid:146)s behavior to\ndiscover all of the qualitatively di\ufb00erent patterns it can simulate.\n\nA model(cid:146)s ability to reproduce diverse patterns of data is known as its complexity, an\nintrinsic property of a model that arises from the interaction between its parameters\nand functional form. For statistical models, it can be calculated by integrating the\ndeterminant of the Fisher information matrix over the parameter space of the model,\nand adding a term that is linear in the number of parameters. Although originally\nderived by Rissanen [1] from an algorithmic coding perspective, this measure is\nsometimes called the geometric complexity, because it is equal to the logarithm of\nthe ratio of two Riemannian volumes. Viewed from this geometric perspective, the\nmeasure has an elegant interpretation as a count of the number of (cid:147)distinguishable(cid:148)\ndistributions that a model can generate [3, 4]. Unfortunately, geometric complexity\ncannot be applied to connectionist models, because these models rarely possess a\nlikelihood function, much less a well-de(cid:222)ned Fisher information matrix. Also, in\nmany cases a learning (i.e., model-(cid:222)tting) algorithm for (cid:222)nding optimal parameter\nvalues is not proposed along with the model, further complicating matters.\n\nA conceptually simple solution to the problem, albeit a computationally demanding\none, is (cid:222)rst to discretize the data space in some properly de(cid:222)ned sense and then\nto identify all of the data patterns a connectionist model can generate. This ap-\nproach provides the desired global view of the model(cid:146)s capabilities and its de(cid:222)nition\nresembles that of geometric complexity: the complexity of a connectionist model is\nde(cid:222)ned in terms of the number of discrete data patterns the model can produce. As\nsuch, this reparametrization-invariant complexity measure can be used for virtually\nall types of network models provided that the discretization of the data space is\nboth justi(cid:222)able and meaningful.\n\nA challenge in implementing this solution lies in the enormity of the data space,\nwhich may contain a truly astronomical number of patterns. Only a small fraction of\nthese might correspond to a model(cid:146)s predictions, so it is essential to use an e\ufb03cient\nsearch algorithm, one that will (cid:222)nd most or all of these patterns in a reasonable\ntime. We describe an algorithm that uses Markov Chain Monte Carlo (MCMC)\nto solve such problems. It is tailored to exploit the kinds of search spaces that we\nsuspect are typical of localist connectionist models, and we evaluate its performance\non two of them.\n\n3 Localist Models of Phoneme Perception\n\nA central issue in the (cid:222)eld of human speech perception is how lexical knowledge\nin(cid:223)uences the perception of speech sounds. That is, how does knowing the word you\nare hearing in(cid:223)uence how you hear the smaller units that make up the word (i.e.,\nits phonemes)? Two localist models have been proposed that represent opposing\ntheoretical positions. Both models were motivated by di\ufb00erent theoretical prin-\n\n\fLexical Layer \n\nLexical Layer \n\nPhoneme Layer \n\nPhoneme \nInput\n\nPhoneme \nDecision \n\nFigure 1: Network architectures for trace (left) and merge (right). Arrows in-\ndicate excitatory connections between layers; lines with dots indicate inhibitory\nconnections within layers.\n\nciples. Proponents of trace [5] argue for bi-directional communication between\nlayers whereas proponents of merge [6] argue against it. The models are shown\nschematically in Figure 1. Each contains two main layers. Phonemes are repre-\nsented in the (cid:222)rst layer and words in the second. Activation (cid:223)ows from the (cid:222)rst to\nthe second layer in both models. At the heart of the controversy is whether activa-\ntion also (cid:223)ows in the reverse direction, directly a\ufb00ecting how the phonemic input is\nprocessed. In trace it can. In merge it cannot. Instead, the processing performed\nat the phoneme level in merge is split in two, with an input stage and a phoneme\ndecision stage. The second, lexical layer cannot directly a\ufb00ect phoneme activation.\nInstead, the two sources of information (phonemic and lexical) are integrated only\nat the phoneme decision stage.\n\nAlthough the precise details of the models are unnecessary for the purposes of this\npaper , it will be useful to sketch a few of their technical details. The parameters for\nthe models (denoted \u03b8), of which trace has 7 and merge has 11, correspond to the\nstrength of the excitatory and inhibitory connections between nodes, both within\nand between layers. The networks receive a continuous input, and stabilize at a (cid:222)nal\nstate after a certain number of cycles. In our formulation, a parameter set \u03b8 was\nconsidered valid only if the (cid:222)nal state satis(cid:222)ed certain decision criteria (discussed\nshortly). Detailed descriptions of the models, including typical parameter values,\nare given by [5] and [6].\n\nDespite the di\ufb00erences in motivation, trace and merge are comparable in their\nability to simulate key experimental (cid:222)ndings [6], making it quite challenging if\nnot impossible to distinguish between then experimentally. Yet surely the models\nare not identical? Is one more complex than the other? What are the functional\ndi\ufb00erences between the two?\n\nIn order to address these questions, we consider data from experiments by [6] which\nare captured well by both models. In the experiments, monosyllabic words were\npresented in which the last phoneme from one word was partially replaced by one\nfrom another word (through digital editing) to create word blends that retained\nresidual information about the identity of the phoneme from both words. The six\ntypes of blends are listed on the left of Table 1. Listeners had to categorize the last\nphoneme in one task (phoneme decision) and categorize the entire utterance as a\nword or a nonsense word in the other task (lexical decision). The response choices\nin each task are listed in the table. Three responses choices were used in lexical\ndecision to test the models(cid:146) ability to distinguish between words, not just words\nand nonwords. The asterisks in each cell indicate the responses that listeners chose\nmost often. Both trace and merge can simulate this pattern of responses.\n\n\fTable 1: The experimental design. Asterisks denote human responses.\n\nCondition Name\n\nExample\n\nPhonemic Decision\n\nLexical Decision\n\n/b/\n\n/g/\n\n/z/\n\n/v/\n\njob\n\njog\n\nnonword\n\nbB\ngB\nvB\nzZ\ngZ\nvZ\n\n*\n*\n*\n\nJOb + joB\nJOg + joB\nJOv + joB\nJOz + joZ\nJOg + joZ\nJOv + joZ\n\n*\n*\n*\n\n*\n*\n*\n\n*\n*\n*\n\nTable 2: Two sets of decision rules for trace and merge. The values shown\ncorrespond to activation levels of the appropriate decision node.\n\nConstraint\n\nWeak\n\nStrong\n\nPhoneme Decision\nChoose /b/ if. . .\n\nChoose \"job\" if. . .\n\nChoose (cid:147)nonword(cid:148) if. . .\n\nLexical Decision\n\n/b/> 0.4 & others < 0.4\n\njob > 0.4 & jog < 0.4\n\nboth < 0.4\n\n/b/> 0.45 & others < 0.25\n(/b/ \u2212 max(others)) > 0.3\n\njob > 0.45 & jog < 0.25\n\nboth < 0.25\n\n(job \u2212 jog) > 0.3\n\nabs(di\ufb00erence) < 0.15\n\nThe pro(cid:222)le of responses decisions (phoneme and lexical) over the six experimental\nconditions provides a natural de(cid:222)nition of a data pattern that the model could\nproduce, and the decision rules establish a natural (surjective) mapping from the\ncontinuous space of network states (of which each model can produce some subset)\nto the discrete space of data patterns. We applied two di\ufb00erent sets of decision rules,\nlisted in Table 2, and were interested in determining how many patterns (besides\nthe human-like pattern) each model can generate. As previously discussed, these\ncounts will serve as a measure of model complexity.\n\n4 The Search Algorithm\n\nThe search problem that we need to solve di\ufb00ers from the standard Monte Carlo\ncounting problem. Ordinarily, Monte Carlo methods are used to discover how much\nof the search space is covered by some region by counting how often co-ordinates are\nsampled from that region. In our problem, a high-dimensional parameter space has\nbeen partitioned into an unknown number of regions, with each region corresponding\nto a single data pattern. The task is to (cid:222)nd all such regions irrespective of their size.\nHow do we solve this problem? Given the dimensionality of the space, brute force\nsearches are impossible. Simple Monte Carlo (SMC; i.e., uniform random sampling)\nwill fail because it ignores the structure of the search space.\n\nThe spaces that we consider seem to possess three regularities, which we call a\n(cid:147)grainy(cid:148) structure, illustrated schematically in Figure 2. Firstly, on many occa-\nsions the network does not converge on a state that meets the decision criteria, so\nsome proportion of the parameter space does not correspond to any data pattern.\nSecondly, the size of the regions vary a great deal. Some data patterns are elicited\nby a wide range of parameter values, whereas others can be produced only by a small\nrange of values. Thirdly, small regions tend to cluster together. In these models,\nthere are likely to be regions where the model consistently chooses the dominant\nphoneme and makes the correspondingly appropriate lexical decision. However,\nthere will also be large regions in which the models always choose (cid:147)nonword(cid:148) ir-\n\n\fFigure 2: A parameter space with (cid:147)grainy(cid:148) structure. Each region corresponds to\na single data pattern that the model can generate. Regions vary in size, and small\nregions cluster together.\n\nrespective of whether the stimulus is a word. Along the borders between regions,\nhowever, there might be lots of smaller (cid:147)transition regions(cid:148), and these regions will\ntend to be near one another.\n\nThe consequence of this structure is that the size of the region in which the process\nis currently located provides extensive information about the number of regions\nthat are likely to lie nearby. In a small region, there will probably be other small\nregions nearby, so a (cid:222)ne-grained search is required in order to (cid:222)nd them. However,\na (cid:222)ne-grained search process will get stuck in a large region, taking tiny steps when\ngreat leaps are required. Our algorithm exploits this structure by using MCMC\nto estimate a di\ufb00erent parameter sampling distribution p(\u03b8jri) for every region ri\nthat it encounters, and then cycling through these distributions in order to sample\nparameter sets. The procedure can be reduced to three steps:\n\n1. Set i = 0, m = 0. Sample \u03b8 from p(\u03b8jr0), a uniform distribution over the\n2. Set m = m + 1 and then i = m. Record the new pattern, and use MCMC\n\nspace. If \u03b8 does not generate a valid data pattern, repeat Step 1.\n\nto estimate p(\u03b8jri).\nOtherwise, set i = mod(i, m) + 1, and repeat Step 3.\n\n3. Sample \u03b8 from p(\u03b8jri).\n\nIf \u03b8 generates a new pattern, return to Step 2.\n\nThe process of estimating p(\u03b8jri) is a fairly straightforward application of MCMC\n[7]. We specify a uniform jumping distribution over a small hypersphere centered\non the current point \u03b8 in the parameter space, accepting candidate points if and\nonly if they produce the same pattern as \u03b8. After collecting enough samples, we\ncalculate the mean and variance-covariance matrix for these observations, and use\nthis to estimate an ellipsoid around the mean, as an approximation to the i-th\nregion. However, since we want to (cid:222)nd points in the bordering regions, the the\nestimated ellipsoid is deliberately oversized. The sampling distribution p(\u03b8jri) is\nsimply a uniform distribution over the ellipsoid.\n\nUnlike SMC (or even a more standard application of MCMC), our algorithm has\nthe desirable property that it focuses on each region in equal proportion, irrespec-\ntive of its size. Not only that, because the parameter space is high dimensional,\nthe vast majority of the distribution p(\u03b8jri) will actually lie near the edges of the\nellipsoid: that is, the area just outside of the i-th region. Consequently, we search\nprimarily along the edges of the regions that we have already discovered, paying\ncloser attention to the small regions. The overall distribution p(\u03b8) is essentially a\nmixture distribution that assigns higher density to points known to lie near many\nregions.\n\n\f5 Testing the Algorithm\n\nIn the absence of analytic results, the algorithm was evaluated against standard\nSMC. The (cid:222)rst test applied both to a simple toy problem possessing a grainy struc-\nInside a hypercube [0, 1]d, an assortment of large and small regions (also\nture.\nhypercubes) were de(cid:222)ned using unevenly spaced grids so that all the regions neigh-\nbored each other (d ranged from 3 to 6). In higher dimensions (d \u201a 4), SMC did not\n(cid:222)nd all of the regions. In contrast, the MCMC algorithm found all of the regions,\nand did so in a reasonable amount of time. Overall, the MCMC-based algorithm is\nslower than SMC at the beginning of the search due to the time required for region\nestimation. However, the time required to learn the structure of the parameter\nspace is time well spent because the search becomes more e\ufb03cient and successful,\npaying large dividends in time and accuracy in the end.\n\nAs a second test, we applied the algorithms to simpli(cid:222)ed versions of trace, con-\nstructed so that even SMC might work reasonably well.\nIn one reduced model,\nfor instance, only phoneme responses were considered. In the other, only lexical\nresponses were considered. Weak and strong constraints (Table 2) were imposed on\nboth models. In all cases, MCMC found as many or more patterns than SMC, and\nall SMC patterns were among the MCMC patterns.\n\n6 Application to Models of Phoneme Perception\n\nNext we ran the search algorithm on the full versions of trace and merge, using\nboth the strong and weak constraints (Table 2). The number of patterns discovered\nin each case is summarized in Figure 3. In this experimental design merge is more\ncomplex than trace, although the extent of this e\ufb00ect is somewhat dependent on\nthe choice of constraints. When strong constraints are applied trace (27 patterns)\nis nested within merge (67 patterns), which produces 148% more patterns. How-\never, when these constraints are eased, the nesting relationship disappears, and\nmerge (73 patterns) produces only 40% more patterns than trace (52 patterns).\nNevertheless, it is noteworthy that the behavior of each is highly constrained, pro-\nducing less than 100 of the 46 \u00a3 36 = 2, 985, 984 patterns available. Also, for both\nmodels (under both sets of constraints), the vast majority of the parameter space\nwas occupied by only a few patterns.\n\nA second question of interest is whether each model(cid:146)s ouput veers far from human\nperformance (Table 1). To answer this, we classi(cid:222)ed every data pattern in terms of\nthe number of mismatches from the human-like pattern (from 0 to 12), and counted\nhow frequently the model patterns fell into each class. The results, shown in Fig-\nure 4, are quite similar and orderly for both models. The choice of constraints had\nlittle e\ufb00ect, and in both cases the trace distribution (open circles) is a little closer\nto the human-like pattern than the merge distribution (closed circles). Even so,\nboth models are remarkably human-like when considered in light of the distribution\nof all possible patterns (cross hairs). In fact, the probability is virtually zero that\na (cid:147)random model(cid:148) (consisting of a random sample of patterns) would display such\na low mismatch frequency.\n\nBuilding on this analysis, we looked for qualitative di\ufb00erences in the types of mis-\nmatches made by each model. Since the choice of constraints made no di\ufb00erence,\nFigure 5 shows the mismatch pro(cid:222)les under weak constraints. Both models produce\nno mismatches in some conditions (e.g., bB-phoneme identi(cid:222)cation, vZ-lexical deci-\nsion) and many in others (e.g., gB-lexical decision). Interestingly, trace and merge\nproduce similar mismatch pro(cid:222)les for lexical decision, and a comparable number of\nmismatches (108 vs. 124). However, striking qualitative di\ufb00erences are evident for\n\n\fWeak Constraints\n\nStrong Constraints\n\nTRACE\n\nTRACE\n\n20\n\n32\n\n41\n\n27\n\n40\n\nMERGE\n\nMERGE\n\nFigure 3: Venn diagrams showing the number of patterns discovered for both models\nunder both types of constraint.\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ns\nn\nr\ne\nt\nt\na\nP\n \nf\no\n \nn\no\ni\nt\nr\no\np\no\nr\nP\n\n0\n\n0\n\nWeak TRACE\nStrong TRACE\nWeak MERGE\nStrong MERGE\nAll Patterns\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n11\n\n12\n\nNumber of Mismatches\n\nFigure 4: Mismatch distributions for all four models plus the data space. The\n0-point corresponds to the lone human-like pattern contained in all distributions.\n\nphoneme decisions, with merge producing mismatches in conditions that trace\ndoes not (e.g., vB, vZ). When the two graphs are compared, an asymmetry is evident\nin the frequency of mismatches across tasks: merge makes phonemic mismatches\nwith about the same frequency as lexical errors (139 vs. 124), whereas trace does\nso less than half as often (56 vs. 108).\n\nThe mismatch asymmetry accords nicely with the architectures shown in Figure 1.\nThe two models make lexical decisions in an almost identical manner: phonemic\ninformation feeds into the lexical decision layer, from which a decision is made. It\nshould then come as no surprise that lexical processing in trace and merge is so\nsimilar. In contrast, phoneme processing is split between two layers in merge but\ncon(cid:222)ned to one in trace. The two layers dedicated to phoneme processing provide\nmerge an added degree of (cid:223)exibility (i.e., complexity) in generating data patterns.\nThis shows up in many ways, not just in merge(cid:146)s ability to produce mismatches in\nmore conditions than trace. For example, these mismatches yield a wider range\nof phoneme responses. Shown above each bar in Figure 5 is the phoneme that was\nmisrecognized in the given condition. trace only misrecogized the phoneme as /g/\nwhereas merge misrecognized it as /g/, /z/, and /v/.\n\nThese analyses describe a few consequences of dividing processing between two\nlayers, as in merge, and in doing so creating a more complex model. On the\nbasis of performance (i.e., (cid:222)t) alone, this additional complexity is unnecessary for\nmodeling phoneme perception because the simpler architecture of trace simulates\nhuman data as well as merge. If merge(cid:146)s design is to be preferred, the additional\ncomplexity must be justifed for other reasons [6].\n\n\fs\ne\nh\nc\nt\na\nm\n\ns\ni\n\nM\n\n \nf\no\n\n \nr\ne\nb\nm\nu\nN\n\n55\n\n50\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\nphoneme\n\nlexical\n\n/g/\n\n/g/\n\nnw\n\nnw jog\n\nnw\n\njog\n\nbB gB vB zZ gZ vZ bB gB vB zZ gZ vZ\n\nTRACE\n\ns\ne\nh\nc\nt\na\nm\n\ns\ni\n\nM\n\n \nf\no\n\n \nr\ne\nb\nm\nu\nN\n\n55\n\n50\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n/g/\n\nphoneme\n\nlexical\n\nnw\n\njog\n\n/v/\n\nnw\n\n/g/\n\n/z/\n\nnw\n\njog\n\nbB gB vB zZ gZ vZ bB gB vB zZ gZ vZ\n\nMERGE\n\nFigure 5: Mismatch pro(cid:222)les for both trace and merge when the weak constraints\nare applied. Conditions are denoted by their phoneme blend.\n\n7 Conclusions\n\nThe results of this preliminary evaluation suggest that the MCMC-based algorithm\nis a promising method for comparing connectionist models. Although it was de-\nveloped to compare localist models like trace and merge, it may be broadly\napplicable whenever the search space exhibits this (cid:147)grainy(cid:148) structure. Indeed, the\nalgorithm could be a general tool for designing, comparing, and evaluating connec-\ntionist models of human cognition. Plans are underway to extend the approach to\nother experimental designs, dependent measures (e.g., reaction time), and models.\n\nAcknowledgements\n\nThe authors were supported by NIH grant R01-MH57472 awarded to IJM and MAP. DJN\nwas also supported by a grant from the O\ufb03ce of Research at OSU. We thank Nancy Briggs,\nCheongtag Kim and Yong Su for helpful discussions.\n\nReferences\n\n[1] Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transactions\n\non Information Theory 42, 40-47.\n\n[2] Rissanen, J. (2001). Strong optimality of the normalized ML models as universal codes\n\nand information in data. IEEE Transactions on Information Theory 47, 1712-1717.\n\n[3] Balasubramanian, V. (1997). Statistical inference, Occam(cid:146)s razor and statistical me-\n\nchanics on the space of probability distributions. Neural Computation, 9, 349-368.\n\n[4] Myung, I. J., Balasubramanian, V., & Pitt, M. A. (2000). Counting probability dis-\ntributions: Di\ufb00erential geometry and model selection. Proceedings of the National\nAcademy of Sciences USA, 97, 11170-11175.\n\n[5] McClelland, J. L. & Elman, J. L. (1986). The TRACE model of speech perception.\n\nCognitive Psychology, 18, 1-86.\n\n[6] Norris, D., McQueen, J. M. & Cutler, A. (2000). Merging phonetic and lexical infor-\n\nmation in phonetic decision-making. Behavioral & Brain Sciences, 23, 299-325.\n\n[7] Gilks, W. R. , Richardson, S., & Spiegelhalter, D. J. (1995). Markov Chain Monte\n\nCarlo in Practice. London: Chapman and Hall.\n\n\f", "award": [], "sourceid": 2374, "authors": [{"given_name": "Woojae", "family_name": "Kim", "institution": null}, {"given_name": "Daniel", "family_name": "Navarro", "institution": null}, {"given_name": "Mark", "family_name": "Pitt", "institution": null}, {"given_name": "In", "family_name": "Myung", "institution": null}]}