{"title": "Learning the Structure of Similarity", "book": "Advances in Neural Information Processing Systems", "page_first": 3, "page_last": 9, "abstract": null, "full_text": "Learning the structure of similarity \n\nJoshua B. Tenenbaum \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \njbt~psyche.mit.edu \n\nAbstract \n\nThe additive clustering (ADCL US) model (Shepard & Arabie, 1979) \ntreats the similarity of two stimuli as a weighted additive measure \nof their common features. Inspired by recent work in unsupervised \nlearning with multiple cause models, we propose anew, statistically \nwell-motivated algorithm for discovering the structure of natural \nstimulus classes using the ADCLUS model, which promises substan(cid:173)\ntial gains in conceptual simplicity, practical efficiency, and solution \nquality over earlier efforts. We also present preliminary results with \nartificial data and two classic similarity data sets. \n\n1 \n\nINTRODUCTION \n\nThe capacity to judge one stimulus, object, or concept as similado another is thought \nto play a pivotal role in many cognitive processes, including generalization, recog(cid:173)\nnition, categorization, and inference. Consequently, modeling subjective similarity \njudgments in order to discover the underlying structure of stimulus representations \nin the brain/mind holds a central place in contemporary cognitive science. Mathe(cid:173)\nmatical models of similarity can be divided roughly into two families: spatial models, \nin which stimuli correspond to points in a metric (typically Euclidean) space and \nsimilarity is treated as a decreasing function of distance; and set-theoretic models, in \nwhich stimuli are represented as members of salient subsets (presumably correspond(cid:173)\ning to natural classes or features in the world) and similarity is treated as a weighted \nsum of common and distinctive subsets. \n\nSpatial models, fit to similarity judgment data with familiar multidimensional scal(cid:173)\ning (MDS) techniques, have yielded concise descriptions of homogeneous, perceptual \ndomains (e.g. three-dimensional color space), often revealing the salient dimensions \nof stimulus variation (Shepard, 1980). Set-theoretic models are more general , in \nprinciple able to accomodate discrete conceptual structures typical of higher-level \ncognitive domains, as well as dimensional stimulus structures more common in per-\n\n\f4 \n\n1. B. TENENBAUM \n\nception (Tversky, 1977). In practice, however, the utility of set-theoretic models is \nlimited by the hierarchical clustering techniques that underlie conventional methods \nfor discovering the discrete features or classes of stimuli. Specifically, hierarchical \nclustering requires that any two classes of stimuli correspond to disjoint or properly \ninclusive subsets, while psychologically natural classes may correspond in general to \narbitrarily overlapping subsets of stimuli. For example, the subjective similarity of \ntwo countries results from the interaction of multiple geographic and cultural fac(cid:173)\ntors, and there is no reason a priori to expect the subsets of communist, African, or \nFrench-speaking nations to be either disjoint or properly inclusive. \nIn this paper we consider the additive clustering (ADCL US) model (Shepard & Ara(cid:173)\nbie, 1979), the simplest instantiation of Tversky 's (1977) general contrast model that \naccommodates the arbitrarily overlapping class structures associated with multiple \ncauses of similarity. Here, the similarity of two stimuli is modeled as a weighted \nadditive measure of their common clusters: \n\nK \n\nSij = I: wkfikfJk + C, \n\nk=l \n\n(1) \n\nwhere Sij is the reconstructed similarity of stimuli i and j, the weight Wk captures \nthe salience of cluster k, and the binary indicator variable fik equals 1 if stimulus i \nbelongs to cluster k and 0 otherwise. The additive constant c is necessary because the \nsimilarity data are assumed to be on an interval scale. 1 As with conventional clus(cid:173)\ntering models, ADCLUS recovers a system of discrete subsets of stimuli, weighted by \nsalience, and the similarity of two stimuli increases with the number (and weight) \nof their common subsets. ADCLUS, however, makes none of the structural assump(cid:173)\ntions (e.g. that any two clusters are disjoint or properly inclusive) which limit the \napplicability of conventional set-theoretic models. Unfortunately this flexibility also \nmakes the problem of fitting the ADCL US model to an observed similarity matrix \nexceedingly difficult. \n\nPrevious attempts to fit the model have followed a heuristic strategy to minimize a \nsquared-error energy function , \n\nE = I:(Sij - Sij)2 = I:(Sij - I: wklikfJk)2, \n\nitj \n\nitj \n\nk \n\n(2) \n\nby alternately solving for the best cluster configurations fik given the current weights \nWk and solving for the best weights given the current clusters (Shepard & Arabie, \n1979; Arabie & Carroll, 1980). This strategy is appealing because given the clus(cid:173)\nter configuration, finding the optimal weights becomes a simple linear least-squares \nproblem.2 However, finding good cluster configurations is a difficult problem in com(cid:173)\nbinatorial optimization, and this step has always been the weak point in previous \nwork. The original ADCLUS (Shepard & Arabie, 1979) and later MAPCLUS (Ara(cid:173)\nbie & Carroll, 1980) algorithms employ ad hoc techniques of combinatorial optimiza(cid:173)\ntion that sometimes yield unexpected or uninterpretable final results. Certainly, no \nrigorous theory exists that would explain why these approaches fail to discover the \nunderlying structure of a stimulus set when they do. \n\nEssentially, the ADCL US model is so challenging to fit because it generates similar(cid:173)\nities from the interaction of many independent underlying causes. Viewed this way, \nmodeling the structure of similarity looks very similar to the multiple-cause learning \n\n1 In the remainder of this paper, we absorb c into the sum over k, taking the sum over \n\nk = 0, ... , K , defining Wo == c, and fixing !iO = 1, (Vi) . \n\n2Strictly speaking, because the weights are typically constrained to be nonnegative, more \n\nelaborate techniques than standard linear least-squares procedures may be required. \n\n\fLearning the Structure of Similarity \n\n5 \n\nproblems that are currently a major focus of study in the neural computation litera(cid:173)\nture (Ghahramani, 1995; Hinton, Dayan, et al., 1995; Saund, 1995; Neal, 1992). Here \nwe propose a novel approach to additive clustering, inspired by the progress and \npromise of work on multiple-cause learning within the Expectation-Maximization \n(EM) framework (Ghahramani, 1995; Neal, 1992). Our BM approach still makes \nuse of the basic insight behind earlier approaches, that finding {wd given {lid is \neasy, but obtains better performance from treating the unknown cluster memberships \nprobabilistically as hidden variables (rather than parameters of the model), and per(cid:173)\nhaps more importantly, provides a rigorous and well-understood theory. Indeed, it \nis natural to consider {/ik} as \"unobserved\" features of the stimuli, complement(cid:173)\ning the observed data {Sij} in the similarity matrix. Moreover, in some experimental \nparadigms, one or more of these features may be considered observed data, if subjects \nreport using (or are requested to use) certain criteria in their similarity judgments. \n\n2 ALGORITHM \n\n2.1 Maximum likelihood formulation \n\nWe begin by formulating the additive clustering problem in terms of maximum like(cid:173)\nlihood estimation with unobserved data. Treating the cluster weights w = {Wk} \nas model parameters and the unobserved cluster memberships I = {lik} as hidden \ncauses for the observed similarities S = {Sij}, it is natural to consider a hierarchical \ngenerative model for the \"complete data\" (including observed and unobserved com(cid:173)\nponents) of the form p(s, Ilw) = p(sl/, w)p(flw). In the spirit of earlier approaches \nto ADCLUS that seek to minimize a squared-error energy function, we take p(sl/, w) \nto be gaussian with common variance u 2 : \np(sl/, w) ex: exp{ -~ 'L:(Sij - Sij )2} = exp{ -~ 'L:(Sij - 'L: wklik/ik)2}. (3) \n\n2u itj \n\n2u itj \n\nk \n\nNote that logp(sl/, w) is equivalent to -E/(2u2 ) (ignoring an additive constant), \nwhere E is the energy defined above. In general, priors p(flw) over the cluster \nconfigurations may be useful to favor larger or smaller clusters, induce a dependence \nbetween cluster size and cluster weight, or bias particular kinds of class structures, \nbut only uniform priors are considered here. In this case -E /(2u 2 ) also gives the \n\"complete data\" loglikelihood logp(s, Ilw). \n\n2.2 The EM algorithm for additive clustering \n\nGiven this probabilistic model, we can now appeal to the EM algorithm as the basis \nfor a new additive clustering technique. EM calls for iterating the following two(cid:173)\nstep procedure, in order to obtain successive estimates of the parameters w that are \nguaranteed never to decrease in likelihood (Dempster et al., 1977). In the E-step, we \ncalculate \n\nQ(wlw(n)) = L,: p(f' Is, wen)) logp(s,f/lw) = 2 \\ (-E}3,w(n). \n\n(4) \n\nl' \n\nu \n\nQ(wlw(n) is equivalent to the expected value of E as a function of w, averaged over \nall possible configurations I' of the N K binary cluster memberships, given the ob(cid:173)\nserved data s and the current parameter estimates wen). In the M-step, we maximize \nQ(wlw(n) with respect to w to obtain w(n+l). \nEach cluster configuration I' contributes to the mean energy in proportion to its \nprobability under the gaussian generative model in (3). Thus the number of configu(cid:173)\nrations making significant contributions depends on the model variance u 2 . For large \n\n\f6 \n\nJ. B. TENENBAUM \n\nU 2 , the probability is spread over many configurations. In the limiting case u 2 ---+ 0, \nonly the most likely configuration contributes, making EM effectively equivalent to \nthe original approaches presented in Section 1 that use only the single best cluster \nconfiguration to solve for the best cluster weights at each iteration. \n\nIn line with the basic insight embodied less rigorously in these earlier algorithms, the \nM-step still reduces to a simple (constrained) linear least-squares problem, because \nthe mean energy (E} = L:i#j (srj - 2Sij L:k Wk(fik!ik} + L:kl WkWl(fik!jk!il!il}) , \nlike the energy E, is quadratic in the weights Wk. The E-step, which amounts to \ncomputing the expectations mijk = (fik!ik} and mijkl = (fik !ik!il/j I} , is much \nmore involved, because the required sums over all possible cluster configurations f' \nare intractable for any realistic case. We approximate these calculations using Gibbs \nsampling, a Monte Carlo method that has been successfully applied to learning similar \ngenerative models with hidden variables (Ghahramani, 1995; Neal 1992).3 \n\nFinally, the algorithm should produce not only estimates of the cluster weights, but \nalso a final cluster configuration that may be interpreted as the psychologically natural \nfeatures or classes of the relevant domain. Consider the expected cluster memberships \nPik = (fik}$ w(n) , which give the probability that stimulus i belongs to cluster k, given \nthe observed similarity matrix and the current estimates of the weights. Only when \nall Pik are close to 0 or 1, i.e. when u 2 is small enough that all the probability becomes \nconcentrated on the most likely cluster configuration and its neighbors, can we fairly \nassert which stimuli belong to which classes. \n\n2.3 Simulated annealing \n\nTwo major computational bottlenecks hamper the efficiency of the algorithm as de(cid:173)\nscribed so far. First, Gibbs sampling may take a very long time to converge to the \nequilibrium distribution, particularly when u 2 is small relative to the typical energy \ndifference between neighboring cluster configurations. Second, the likelihood surfaces \nfor realistic data sets are typically riddled with local maxima. We solve both problems \nby annealing on the variance. That is, we run Gibbs sampling using an effective vari(cid:173)\nance u;\" \ninitially much greater than the assumed model variance u2 , and decrease \nu;\" \ntowards u 2 according to the following two-level scheme. We anneal within the \nnth iteration of EM to speed the convergence of the Gibbs sampling E-step (Neal, \n1993) , by lowering u;jJ from some high starting value down to a target U~arg(n) for \nthe nth EM iteration . We also anneal between iterations of EM to avoid local maxima \n(Ros~ et al., 1990), by intializing U~arg(o) at a high value and taking U~arg(n) ---+ u 2 \nas n Increases. \n\n3 RESULTS \n\nIn all of the examples below, one run of the algorithm consisted of 100-200 iterations \nof EM, annealed both within and between iterations. Within each E-step, 10-100 \ncycles of Gibbs sampling were carried out at the target temperature UTarg while the \nstatistics for mik and mijk were recorded. These recorded cycles were preceeded \nby 20-200 unrecorded cycles, during which the system was annealed from a higher \ntemperature (e.g. 8u~arg) down to U~arg, to ensure that statistics were collected as \nclose to equilibrium as possible. The precise numbers of recorded and unrecorded \niterations were chosen as a compromise between the need for longer samples as the \n\n3We generally also approximate miJkl ~ miJkmi;\"l, which usually yields satisfactory re(cid:173)\n\nsults with much greater efficiency. \n\n\fLearning the Structure of Similarity \n\n7 \n\nTable 1: Classes and weights recovered for the integers 0-9. \n\nRank Weight Stimuli in class \n\n2 \n012 \n\n4 \n\n3 \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n\n.444 \n.345 \n.331 \n.291 \n.255 \n.216 \n.214 \n.172 \n\n8 \n\nInterpretation \npowers of two \nsmall numbers \n\n9 multiples of three \n\nlarge numbers \nmiddle numbers \nodd numbers \nsmallish numbers \nlargish numbers \n\n6 \n6 789 \n\n5 \n\n7 \n\n9 \n\n2 345 6 \n\n3 \n\n1 \n1 2 3 4 \n\n4 5 6 7 8 \n\nVariance accounted for = 90.9% with 8 clusters (additive constant = .148). \n\nnumber of hidden variables is increased and the need to keep computation times \npractical. \n\n3.1 Artificial data \n\nWe first report results with artificial data, for which the true cluster memberships and \nweights are known, to verify that the algorithm does in fact find the desired structure. \nWe generated 10 data sets by randomly assigning each of 12 stimuli independently \nand with probability 1/2 to each of 8 classes, and choosing random weights for the \nclasses uniformly from [0.1,0.6]. These numbers are grossly typical of the real data \nsets we examine later in this section. We then calculated the observed similarities \nfrom (1), added a small amount of random noise (with standard deviation equal to \n5% of the mean noise-free similarity), and symmeterized the similarity matrix. \nThe crucial free parameter is K, the assumed number of stimulus classes. When the \nalgorithm was configured with the correct number of clusters (K = 8), the original \nclasses and weights were recovered during the first run of the algorithm on all 10 data \nsets, after an average of 58 EM iterations (low 30, high 92). When the algorithm \nwas configured with K = 7 clusters, one less than the correct number, the seven \nclasses with highest weight were recovered on 9/10 first runs. On these runs, the \nrecovered weights and true weights had a mean correlation of 0.948 (p < .05 on each \nrun). When configured with K = 5, the first run recovered either four of the top \nfive classes (6/10 trials) or three of the top five (4/10 trials). When configured with \ntoo many clusters (K = 12), the algorithm typically recovered only 8 clusters with \nsignificantly non-zero weights, corresponding to the 8 correct classes. Comparable \nresults are not available for ADCLUS or MAPCLUS, but at least we can be satisfied \nthat our algorithm achieves a basic level of competence and robustness. \n\n3.2 Judged similarities of the integers 0-9 \n\nShepard et al. (1975) had subjects judge the similarities of the integers 0 through \n9, in terms of the \"abstract concepts\" of the numbers. We analyzed the similarity \nmatrix (Shepard, personal communication) obtained by pooling data across subjects \nand across three conditions of stimulus presentation (verbal, written-numeral, and \nwritten-dots). We chose this data set because it illustrates the power of additive \nclustering to capture a complex, overlapping system of classes, and also because \nit serves to compare the performance of our algorithm with the original ADCL US \nalgorithm. Observe first that two kinds of classes emerge in the solution. Classes \n1, 3, and 6 represent familiar arithmetic concepts (e.g. \"multiples of three\", \"odd \nnumbers\"), while the remaining classes correspond to subsets of consecutive integers \n\n\f8 \n\n1. B. TENENBAUM \n\nTable 2: Classes and weights recovered for the 16 consonant phonemes. \n\nf 0 \n\ndg \n\nb \n\nv {t \n\nRank Weight Stimuli in class \n\nInterpretation \nfront unvoiced fricatives \nback voiced stops \nunvoiced stops (omitting t) \nfront voiced \nunvoiced stops \nnasals \nvoiced (omitting b) \nunvoiced (omittings) \nVariance accounted for = 90.2% with 8 clusters (additive constant = .047). \n\n.800 \n.572 \n.463 \n.424 \n.357 \n.292 \n.169 \n.132 \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n\np k \n\np t k \n\nmn \n\ndgvCTz2 \n\nptkfOs \n\nand thus together represent the dimension of numerical magnitude. In general, both \narithmetic properties and numerical magnitude contribute to judged similarity, as \nevery number has features of both types (e.g. 9 is a \"large\" \"odd\" \"multiple of three\"), \nexcept for 0, whose only property is \"small.\" Clearly an overlapping clustering model \nis necessary here to accomodate the multiple causes of similarity. \n\nThe best solution reported for these data using the original ADCLUS algorithm \nconsisted of 10 classes, accounting for 83.1% of the variance of the data (Shepard & \nArabie, 1979).4 Several of the clusters in this solution differed by only one or two \nmembers (e.g. three of the clusters were {0,1}, {0,1,2}, and {0,1,2,3,4}), which led \nus to suspect that a better fit might be obtained with fewer than 10 classes. Table 2 \nshows the best solution found in five runs of our algorithm, accounting for 90.9% of \nthe variance with eight classes. Compared with our solution, the original ADCLUS \nsolution leaves almost twice as much residual variance unaccounted for, and with 10 \nclasses, is also less parsimonious. \n\n3.3 Confusions between 16 consonant phonemes \n\nFinally, we examine Miller & Nicely's (1955) classic data on the confusability of 16 \nconsonant phonemes, collected under varying signal/noise conditions with the orig(cid:173)\ninal intent of identifying the features of English phonology (compiled and reprinted \nin Carroll & Wish, 1974). Note that the recovered classes have reasonably natural \ninterpretations in terms of the basic features of phonological theory, and a very dif(cid:173)\nferent overall structure from the classes recovered in the previous example. Quite \nsignificantly, the classes respect a hierarchical structure almost perfectly, with class \n3 included in class 5, classes 1 and 5 included in class 8, and so on. Only the absence \nof /b / in class 7 violates the strict hierarchy. \n\nThese data also provide the only convenient oppportunity to compare our algorithm \nwith the MAPCLUS approach to additive clustering (Arabie & Carroll, 1980). The \npublished MAPCLUS solution accounts for 88.3% of the variance in this data, using \neight clusters. Arabie & Carroll (1980) report being \"substantively pe ... turbed\" (p. \n232) that their algorithm does not recover a distinct cluster for the nasals /m n/, \nwhich have been considered a very salient subset in both traditional phonology (Miller \n& Nicely, 1955) and other clustering models (Shepard, 1980). Table 3 presents our \neight-cluster solution, accounting for 90.2% of the variance. While this represents \nonly a marginal improvement, our solution does contain a cluster for the nasals, as \nexpected on theoretical grounds. \n\n4Variance accounted for = 1- Ej Ei#j(SiJ - 8)2, where s is the mea.n of the set {Sij}. \n\n\fLearning the Structure of Similarity \n\n9 \n\n3.4 Conclusion \n\nThese examples show that ADCLUS can discover meaningful representations of stim(cid:173)\nuli with arbitrarily overlapping class structures (arithmetic properties), as well as di(cid:173)\nmensional structure (numerical magnitude) or hierarchical structure (phoneme fami(cid:173)\nlies) when appropriate. We have argued that modeling similarity should be a natural \napplication of learning generative models with multiple hidden causes, and in that \nspirit, presented a new probabilistic formulation of the ADCLUS model and an al(cid:173)\ngorithm based on EM that promises better results than previous approaches. We \nare currently pursuing several extensions: enriching the generative model, e.g. by \nincorporating significant prior structure, and improving the fitting process, e.g. by \ndeveloping efficient and accurate mean field approximations. More generally, we hope \nthis work illustrates how sophisticated techniques of computational learning can be \nbrought to bear on foundational problems of structure discovery in cognitive science. \n\nAcknowledgements \n\nI thank P. Dayan, W. Richards, S. Gilbert, Y. Weiss, A. Hershowitz, and M. Bernstein \nfor many helpful discussions, and Roger Shepard for generously supplying inspiration and \nunpublished data. The author is a Howard Hughes Medical Institute Predoctoral Fellow. \n\nReferences \n\nArabie, P. & Carroll, J. D. (1980). MAPCLUS: A mathematical programming approach to \nfitting the ADCLUS model. Psychometrika 45, 211-235. \n\nCarroll, J. D. & Wish, M. (1974) Multidimensional perceptual models and measurement \nmethods. In Handbook of Perception, Vol. 2. New York: Academic Press, 391-447. \n\nDempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood estimation from \nincomplete data via the EM Algorithm (with discussion). J. Roy. Stat. Soc. B39, 1-38. \n\nGhahramani, Z. (1995). Factorial learning and the EM algorithm. In G. Tesauro, D. S. \nTouretzky, & T . K. Leen (eds.), Advances in Neural Information Processing Systems 7. \nCambridge, MA: MIT Press, 617-624. \n\nHinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995) The ((wake-sleep\" algorithm for \nunsupervised neural networks. Science 268, 1158-1161. \n\nMiller, G. A. & Nicely, P. E. (1955). An analysis of perceptual confusions among some \nEnglish consonants. J. Ac. Soc. Am. 27, 338-352. \n\nNeal, R . M. (1992). Connectionist learning of belief networks. Arti/. Intell. 56, 71-113. \n\nNeal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. \nTechnical Report CRG-TR-93-1, Dept. of Computer Science, U. of Toronto. \n\nRose, K., Gurewitz, F., & Fox, G. (1990). Statistical mechanics and phase transitions in \nclustering. Physical Review Letters 65, 945-948. \n\nSaund, E. (1995). A multiple cause mixture model for unsupervised learning. Neural Com(cid:173)\nputation 7, 51-71. \n\nShepard, R. N. & Arabie, P. (1979). Additive clustering: Representation of similarities as \ncombinations of discrete overlapping properties. Psychological Review 86, 87-123. \n\nShepard, R. N., Kilpatric, D. W., & Cunningham, J. P., (1975). The internal representation \nof numbers. Cognitive Psychology 7, 82-138. \n\nShepard, R. N. (1980) . Multidimensional scaling, tree-fitting, and clustering. Science 210, \n390-398. \n\nTversky, A. (1977). Features of similarity. Psychological Review 84, 327-352. \n\n\f", "award": [], "sourceid": 1052, "authors": [{"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}]}