{"title": "Stochastic Neighbor Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 857, "page_last": 864, "abstract": null, "full_text": "Stochastic Neighbor Embedding\n\nGeoffrey Hinton and Sam Roweis\n\nDepartment of Computer Science, University of Toronto\n\n10 King\u2019s College Road, Toronto, M5S 3G5 Canada\n\n hinton,roweis\n\n@cs.toronto.edu\n\nAbstract\n\nWe describe a probabilistic approach to the task of placing objects, de-\nscribed by high-dimensional vectors or by pairwise dissimilarities, in a\nlow-dimensional space in a way that preserves neighbor identities. A\nGaussian is centered on each object in the high-dimensional space and\nthe densities under this Gaussian (or the given dissimilarities) are used\nto de\ufb01ne a probability distribution over all the potential neighbors of\nthe object. The aim of the embedding is to approximate this distribu-\ntion as well as possible when the same operation is performed on the\nlow-dimensional \u201cimages\u201d of the objects. A natural cost function is a\nsum of Kullback-Leibler divergences, one per object, which leads to a\nsimple gradient for adjusting the positions of the low-dimensional im-\nages. Unlike other dimensionality reduction methods, this probabilistic\nframework makes it easy to represent each object by a mixture of widely\nseparated low-dimensional images. This allows ambiguous objects, like\nthe document count vector for the word \u201cbank\u201d, to have versions close to\nthe images of both \u201criver\u201d and \u201c\ufb01nance\u201d without forcing the images of\noutdoor concepts to be located close to those of corporate concepts.\n\n1 Introduction\nAutomatic dimensionality reduction is an important \u201ctoolkit\u201d operation in machine learn-\ning, both as a preprocessing step for other algorithms (e.g. to reduce classi\ufb01er input size)\nand as a goal in itself for visualization, interpolation, compression, etc. There are many\nways to \u201cembed\u201d objects, described by high-dimensional vectors or by pairwise dissim-\nilarities, into a lower-dimensional space. Multidimensional scaling methods[1] preserve\ndissimilarities between items, as measured either by Euclidean distance, some nonlinear\nsquashing of distances, or shortest graph paths as with Isomap[2, 3]. Principal compo-\nnents analysis (PCA) \ufb01nds a linear projection of the original data which captures as much\nvariance as possible. Other methods attempt to preserve local geometry (e.g. LLE[4]) or\nassociate high-dimensional points with a \ufb01xed grid of points in the low-dimensional space\n(e.g. self-organizing maps[5] or their probabilistic extension GTM[6]). All of these meth-\nods, however, require each high-dimensional object to be associated with only a single\nlocation in the low-dimensional space. This makes it dif\ufb01cult to unfold \u201cmany-to-one\u201d\nmappings in which a single ambiguous object really belongs in several disparate locations\nin the low-dimensional space. In this paper we de\ufb01ne a new notion of embedding based on\nprobable neighbors. Our algorithm, Stochastic Neighbor Embedding (SNE) tries to place\nthe objects in a low-dimensional space so as to optimally preserve neighborhood identity,\nand can be naturally extended to allow multiple different low-d images of each object.\n\n\u0001\n\f2 The basic SNE algorithm\n\n:\n\n%&%\n\n%&%\n\n(1)\n\n\u0003\u0006\u0005\b\u0007\n\nsymmetric), or they may be computed using the scaled squared Euclidean distance (\u201caf\ufb01n-\n\n\u0003\u0016\u0005 , may be given as part of the problem de\ufb01nition (and need not be\n\nFor each object, , and each potential neighbor, \u0001 , we start by computing the asymmetric\nprobability,\u0002\u0004\u0003\u0006\u0005 , that would pick\u0001 as its neighbor:\n\t\u000b\n\r\f\u000f\u000e\u0011\u0010\u0013\u0012\u0015\u0014\n\u0003\u0016\u0005\u0018\u0017\n\u0019\u001b\u001a\u001d\u001c\n\t\u000b\n\r\f\u001f\u000e\u0011\u0010\u0013\u0012\nThe dissimilarities, \u0012 \u0014\nity\u201d) between two high-dimensional points, !\"\u0003$#$!\u0004\u0005\n(\u001d)\nwhere)\nfor the value of )\nHere,-\n(which we set without loss of generality to be /\n picks point \u0001 as its neighbor is a function of the low-dimensional images 23\u0003 of all the\n%'%\n\t4\n\r\f\u001f\u000e\u0011\u0010\n\u001a\u001d\u001c\n\t4\n\r\f\u001f\u000e\u0011\u0010\n\nthat makes the entropy of the distribution over neighbors equal to *'+\u001d,.-\n\n) so the induced probability 01\u0003\u0016\u0005\n%'%\n\nIn the low-dimensional space we also use Gaussian neighborhoods but with a \ufb01xed variance\nthat point\n\nis either set by hand or (as in some of our experiments) found by a binary search\n.\n\nis the effective number of local neighbors or \u201cperplexity\u201d and is chosen by hand.\n\nThe aim of the embedding is to match these two distributions as well as possible. This is\nachieved by minimizing a cost function which is a sum of Kullback-Leibler divergences\n\nobjects and is given by the expression:\n\n0\u000b\u0003\u0006\u0005\n\n2\u000f\u0003\n\n(2)\n\n(3)\n\n%'%\n\n%&%\n\n\u0003\u0006\u0005\n\n%&%\n\n\u000e@?\n\n\u0007;8\n\n\u0002:\u0003\u0006\u0005\n0\u000b\u0003\u0006\u0005\n\nspace is chosen by hand (much less than the number of objects).\n\n\u000798\nlarge when\u0002\n\nbetween the original (\u00025\u0003\u0016\u0005 ) and induced (06\u0003\u0016\u0005 ) distributions over neighbors for each object:\nThe dimensionality of the2\nNotice that making0\n\n*'+\u001d,\n\u0003\u0006\u0005\nis small wastes some of the probability mass in the0\n\ndistribution so there is a cost for modeling a big distance in the high-dimensional space with\na small distance in the low-dimensional space, though it is less than the cost of modeling\na small distance with a big one. In this respect, SNE is an improvement over methods\nlike LLE [4] or SOM [5] in which widely separated data-points can be \u201ccollapsed\u201d as near\nneighbors in the low-dimensional space. The intuition is that while SNE emphasizes local\ndistances, its cost function cleanly enforces both keeping the images of nearby objects\nnearby and keeping the images of widely separated objects relatively far apart.\n\n\u00039<>=\n\n\u0003\u0006\u0005\n\n\u0003\u0006\u0005\n\n(4)\n\n\u001a affects0\n2\u0004\u0005\n\u0002\u0004\u0003\u0016\u0005\n\nthe result is simple:\n\nDifferentiating C is tedious because2\n2\u000f\u0003\n\n\u0003\u0006\u0005 via the normalization term in Eq. 3, but\n0\u000b\u0003\u0006\u0005DCE\u0002F\u0005G\u0003\nwhich has the nice interpretation of a sum of forces pulling 2\"\u0003 toward2\u0004\u0005 or pushing it away\ndepending on whether\u0001\nGiven the gradient, there are many possible ways to minimize 7\n\nand we have only just be-\ngun the search for the best method. Steepest descent in which all of the points are adjusted\nin parallel is inef\ufb01cient and can get stuck in poor local optima. Adding random jitter that\ndecreases with time \ufb01nds much better local optima and is the method we used for the exam-\nples in this paper, even though it is still quite slow. We initialize the embedding by putting\nall the low-dimensional images in random locations very close to the origin. Several other\nminimization methods, including annealing the perplexity, are discussed in sections 5&6.\n\nis observed to be a neighbor more or less often than desired.\n\n0H\u0005G\u0003\n\n(5)\n\n\u0002\n\u001e\n\u0003\n\u0014\n\u0003\n\u001a\n\u0017\n\u0012\n\u0014\n\u0007\n!\n\u0003\n\u0010\n!\n\u0005\n\u0014\n\u0014\n\u0003\n\u0003\n\u0003\n\u0014\n\u0007\n2\n\u0003\n\u0010\n2\n\u0005\n\u0014\n\u0017\n\u0019\n\u001e\n\u0003\n\u0010\n2\n\u001a\n\u0014\n\u0017\n7\n\u0003\n8\n\u0005\n\u0002\n\u0003\nA\n\u0003\n\u0017\nB\n7\nB\n2\n\u0003\n\u0007\n(\n8\n\u0005\n\u000e\n\u0010\n\u0017\n\u000e\n\u0010\n\u0010\n\u0017\n\f3 Application of SNE to image and document collections\nAs a graphic illustration of the ability of SNE to model high-dimensional, near-neighbor\nrelationships using only two dimensions, we ran the algorithm on a collection of bitmaps of\nhandwritten digits and on a set of word-author counts taken from the scanned proceedings\nof NIPS conference papers. Both of these datasets are likely to have intrinsic structure in\nmany fewer dimensions than their raw dimensionalities: 256 for the handwritten digits and\n13679 for the author-word counts.\n\nfrom each of the \ufb01ve classes 0,1,2,3,4. The variance of the Gaussian around each point\n\nTo begin, we used a set of \u0002\u0001\u0003\u0001\u0002\u0001 digit bitmaps from the UPS database[7] with \u0004\u0003\u0001\u0003\u0001 examples\nin the (\u0002\u0005\n\u0004 -dimensional raw pixel image space was set to achieve a perplexity of 15 in the\ndistribution over high-dimensional neighbors. SNE was initialized by putting all the 2\"\u0003\n\nin random locations very close to the origin and then was trained using gradient descent\nwith annealed noise. Although SNE was given no information about class labels, it quite\ncleanly separates the digit groups as shown in \ufb01gure 1. Furthermore, within each region of\nthe low-dimensional space, SNE has arranged the data so that properties like orientation,\nskew and stroke-thickness tend to vary smoothly. For the embedding shown, the SNE\n\ncost function in Eq. 4 has a value of \u0004\u0007\u0006\u0007\b\n\t nats; with a uniform distribution across low-\n\t\u000f\u0010 nats. We also applied\ndimensional neighbors, the cost is \u0003\u0001\u0002\u0001\u0003\u0001\n\nprincipal component analysis (PCA)[8] to the same data; the projection onto the \ufb01rst two\nprincipal components does not separate classes nearly as cleanly as SNE because PCA is\nmuch more interested in getting the large separations right which causes it to jumble up\nsome of the boundaries between similar classes. In this experiment, we used digit classes\nthat do not have very similar pairs like 3 and 5 or 7 and 9. When there are more classes and\nonly two available dimensions, SNE does not as cleanly separate very similar pairs.\n\n\t\u0003\t\u0002\t\f\r\u0007\b\n\n*'+\u001d,\f\u000b\n\n\u0005\u000f\u000e\n\nWe have also applied SNE to word-document and word-author matrices calculated from\nthe OCRed text of NIPS volume 0-12 papers[9]. Figure 2 shows a map locating NIPS au-\nthors into two dimensions. Each of the 676 authors who published more than one paper\n\nand corresponding last names are authors who published six or more papers in that period.\n\nword counts, summed across all NIPS papers. Co-authored papers gave fractional counts\nevenly to all authors. All words occurring in six or more documents were included, ex-\ncept for stopwords giving a vocabulary size of 13649. (The bow toolkit[10] was used for\n\nin NIPS vols. 0-12 is shown by a dot at the position 2\nDistances \u0012\npart of the pre-processing of the data.) The )\n\n\u0003 found by SNE; larger red dots\n\u0003\u0006\u0005 were computed as the norm of the difference between log aggregate author\n\u0003 were set to achieve a local perplexity of\n(\u0003\u0005 neighbors. SNE seems to have grouped authors by broad NIPS \ufb01eld: generative\n\nmodels, support vector machines, neuroscience, reinforcement learning and VLSI all have\ndistinguishable localized regions.\n\n4 A full mixture version of SNE\nThe clean probabilistic formulation of SNE makes it easy to modify the cost function so\nthat instead of a single image, each high-dimensional object can have several different\nversions of its low-dimensional image. These alternative versions have mixing proportions\n\nis a mixture of the distributions induced\n\nthat sum to \b . Image-version \u0011 of object has location2\nlow-dimensional neighborhood distribution for \nby each of its image-versions across all image-versions of a potential neighbor \u0001 :\n%&%\n\t4\n\r\f\u001f\u000e$\u0010\n2\u000f\u0003\u0019\u0012\n\u0014\r\u0005\u0018\u0017\n\u001a\u001e\u001d\n\u0019\u001c\u001b\n\t4\n\r\f\u001f\u000e$\u0010\n\n\u0003\u0013\u0012 and mixing proportion \u00145\u0003\u0013\u0012 . The\n2\u0004\u0005\u001a\u0017\n\u001a\u001e\u001d\n%&%\nIn this multiple-image model, the derivatives with respect to the image locations 23\u0003\u0019\u0012 are\n\u0003\u0013\u0012 are most easily expressed\nstraightforward; the derivatives w.r.t the mixing proportions \u0014\n\n\u0007\u001b8\u0007\u0015\n\n\u0019\u001b\u001a\n\n\u0014:\u0003\u0013\u0012\n\n0\u000b\u0003\u0006\u0005\n\n8\u0007\u0016\n\n%'%\n\n%'%\n\n(6)\n\n\u000e\n(\n\u0005\n\u0017\n\u0007\n\b\n-\n\u0007\n\u0010\n\u0014\n\u0017\n\u0014\n2\n\u0003\n\u0012\n\u0010\n2\n\u0014\n\u0017\n\fby SNE. The classes are quite well separated even though SNE had no information about\nclass labels. Furthermore, within each class, properties like orientation, skew and stroke-\nthickness tend to vary smoothly across the space. Not all points are shown: to produce this\nregion of the\ndisplay centered on the 2-D location of the digit in the embedding does not overlap any of\n\n(\u0002\u0005\nFigure 1: The result of running the SNE algorithm on \u0002\u0001\u0003\u0001\u0003\u0001\n\u0004 -dimensional grayscale\n\u0003 (scans of handwritten\nimages of handwritten digits. Pictures of the original data vectors !\ndigit) are shown at the location corresponding to their low-dimensional images 23\u0003 as found\ndisplay, digits are chosen in random order and are only displayed if a \b\nthe \b\n(SNE was initialized by putting all the \u0001\u0003\u0002\nin random locations very close to the origin and then was\ntrained using batch gradient descent (see Eq. 5) with annealed noise. The learning rate was 0.2. For\nthe \ufb01rst 3500 iterations, each 2-D point was jittered by adding Gaussian noise with a standard devia-\ntion of\n\nregions for digits that have already been displayed.\n\n\u0004 x \b\n\nafter each position update. The jitter was then reduced to\n\nfor a further\n\niterations.)\n\n\u0004 x\b\n\n\u0004\u0006\u0005\n\n\b\t\u0004\n\u0004\n\n\u0004\n\u0004\n\u0007\n\u0004\n\fTouretzky\nPomerleau\n\nWiles\n\nMaass\n\nKailath\n\nChauvin Munro\nSanger\nLewicki Schmidhuber\n\nShavlik\n\nBaluja\n\nPearlmutter\n\nTenenbaum\n\nYang\nAbu\u2212Mostafa\n\nMovellan\n\nBaldi\n\nCottrell\n\nSchraudolph\n\nLippmann\n\nHertz\nBuhmann\nKrogh\nOmohundro\nMacKay\n\nRobinson\n\nSmyth\n\nCoolen\n\nCohn\n\nGoodman\n\nPentland\n\nAhmad\n\nTesauro\n\nNeuneier\n\nAtkeson\n\nWarmuth\nSollich\nThrun\n\nBarber\n\nMoore\n\nKoch\n\nObermayer\n\nRuderman\nCowan\n\nBialek\nMel\n\nGiles\n\nChen\n\nSun\nLee\n\nLee\nSeung\n\nMovellan\nBaldi\n\nCottrell\nLippmann\nKawato\nBourlard\n\nWaibel\n\nMorgan\n\nNowlan\n\nViola\n\nPouget\n\nDayan\n\nBower\n\nRuppin\nMeilijson Mead\n\nHorn\n\nEeckman\nBaird\nLi\n\nBrown\nDoya\n\nSpence\nTouretzky\n\nBell\nPomerleau\nTenenbaum\n\nMacKay\nSmyth\nWarmuthSollich\nThrun\n\nBarber\n\nLazzaro\nHarris\n\nMurray\n\nAndreou\n\nCauwenberghs\n\nJabri\nStork\nKailath\n\nPrincipe\nMaass\nAmari\nYang\nAbu\u2212Mostafa\nCohnKowalczyk\nAtkeson\nMoore\n\nSutton\n\nSingh\n\nBarto\nKearns\n\nSaad\n\nSejnowski\n\nZemel\n\nMozer\n\nTishby\n\nSinger\nSaul\n\nWolpertOpper\nMoody\nTrespLeen\nJaakkola\n\nBishop\n\nHinton\n\nJordan\n\nGhahramani\nWilliams\nBengio\n\nLeCun\n\nGraf\n\nSimard\n\nDenker\n\nGuyon\n\nAlspector Mjolsness\nMeir\nRangarajan\n\nGold\n\nWilliamson\nShawe\u2212Taylor\n\nPlatt Bartlett\n\nSolla\n\nVapnik\n\nSmola\n\nScholkopf\nMuller\n\ncation 2\n\nFigure 2: Embedding of NIPS authors into two dimensions. Each of the 676 authors\nwho published more than one paper in NIPS vols. 0-12 is show by a dot at the lo-\nfound by the SNE algorithm. Larger red dots and corresponding last names\nare authors who published six or more papers in that period. The inset in upper left\nshows a blowup of the crowded boxed central portion of the space. Dissimilarities be-\ntween authors were computed based on squared Euclidean distance between vectors of\nlog aggregate author word counts. Co-authored papers gave fractional counts evenly\nto all authors. All words occurring in six or more documents were included, except\nfor stopwords giving a vocabulary size of 13649. The NIPS text data is available at\nhttp://www.cs.toronto.edu/\n\nroweis/data.html.\n\n\u0003\n\n\f\u0003\u0013\u0012\n\n8\u0007\u0016\n\n(7)\n\nis given by\n\n8\u0007\u0016\n\n\u0006$\u0005\u0018\u0017\n\n\u0003\u0013\u0012@\u0005\n\n\u00199\u001a\n\n%&%\n\n%'%\n\n\u0003\u0013\u0012\n\n\u0003\u0006\u0005\n0\u000b\u0003\u0006\u0005\n\n%'%\n\n2\u000f\u0003\u0013\u0012\n\n%&%\n\n\t4\n\r\f\u001f\u000e$\u0010\n\t\u000b\n \f\n\n\u000e\u0011\u0010\n\n(8)\n\n(9)\n\n\u0014:\u0003\u0013\u0012\n\u0014\u0005\u0004\u0007\u0006\n\nin terms of\n\u0005\u0018\u0017 , the probability that version \u0011 of picks version\u0001 of\u0001 :\nThe effect on 06\u0003\u0016\u0005 of changing the mixing proportion for version\u0002 of object\u0003\n\u0003\u0013\u0012@\u0005\u001a\u0017\f\u000b\n\u0006 on the cost, C, is\nwhere\b\n\t4\n\r\f\u001f\u000e\u000e\r\n\u0003\u0011\u0010\n\n\u0007\t\b\n and \u0001 otherwise. The effect of changing \u0014\n\noptimization on \u201csoftmax weights\u201d de\ufb01ned by \u0014\n\nAs a \u201cproof-of-concept\u201d, we recently implemented a simpli\ufb01ed mixture version in which\nevery object is represented in the low-dimensional space by exactly two components that\n\nRather than optimizing the mixing proportions directly, it is easier to perform unconstrained\n\n\u0003\u0006\u0005\n\t4\n\r\f\u001f\u000e\u000e\n\n0\u000b\u0003\u0006\u0005\n\u0014\u0005\u0004\u0007\u0006\nif\u0003\n\nby a force which increases linearly up to a threshold separation. Beyond this threshold\nthe force remains constant.1 We ran two experiments with this simpli\ufb01ed mixture version\n\nthe classes and taking each pixel at random from one of these two \u201cparents\u201d. After mini-\nof the non-hybrids had signi\ufb01cantly different\nlocations for their two mixture components. Moreover, the mixture components of each\nhybrid always lay in the regions of the space devoted to the classes of its two parents and\n\n\u0005 . The two components are pulled together\nof SNE. We took a dataset containing \u0002\u0001\u0003\u0001 pictures of each of the digits 2,3,4 and added\n\b\n\u0001\u0002\u0001 hybrid digit-pictures that were each constructed by picking new examples of two of\n\nare constrained to have mixing proportions of \u0001\u0013\u0012\nof the hybrids and only \b\n\t\u0016\u0014\nmization, \u0004\u0003\u0004\u0015\u0014\nin de\ufb01ning the local neighborhoods, a step size of for each position update of \u0001\u0013\u0012\ngradient, and used a constant jitter of \u0001\u0013\u0012\n\nmakes it possible to map a circle onto a line without losing any near neighbor relationships\nor introducing any new ones. Points near one \u201ccut point\u201d on the circle can mapped to a\nmixture of two points, one near one end of the line and one near the other end. Obviously,\nthe location of the cut on the two-dimensional circle gets decided by which pairs of mixture\ncomponents split \ufb01rst during the stochastic optimization. For certain optimization param-\neters that control the ease with which two mixture components can be pulled apart, only\na single cut in the circle is made. For other parameter settings, however, the circle may\nfragment into two or more smaller line-segments, each of which is topologically correct\nbut which may not be linked to each other.\n\nnever in the region devoted to the third class. For this example we used a perplexity of \b\n\u0005 . Our very simple mixture version of SNE also\n\ntimes the\n\n\u0019\t\u000f\n\n\u0017 .\n\nThe example with hybrid digits demonstrates that even the most primitive mixture version\nof SNE can deal with ambiguous high-dimensional objects that need to be mapped to two\nwidely separated regions of the low-dimensional space. More work needs to be done before\nSNE is ef\ufb01cient enough to cope with large matrices of document-word counts, but it is\nthe only dimensionality reduction method we know of that promises to treat homonyms\nsensibly without going back to the original documents to disambiguate each occurrence of\nthe homonym.\n\n1We used a threshold of\n\n. At threshold the force was\n\nnats per unit length. The low-d\n\nspace has a natural scale because the variance of the Gaussian used to determine\u0019\n\n\u0004\u0018\u0017\n\b\n\nis \ufb01xed at 0.5.\n\n\n\u0017\n\u0007\n\u0014\n\u0005\n\u0017\n2\n\u0003\n\u0012\n\u0010\n2\n\u0005\n\u0017\n\u0014\n\u0017\n\u0019\n\u001b\n\u0014\n\u001a\n\u001d\n\u0010\n2\n\u001a\n\u001d\n\u0014\n\u0017\nB\nB\n\u0004\n\u0003\n\n\u0004\nC\n8\n\u0015\n\n\u0004\n\u0006\n\n\b\n\u0004\n\u0005\n\u0010\n\n\u0004\n\u0003\n\u0007\n\b\n\u0007\n\u0004\nB\n7\nB\n\u0014\n\u0004\n\u0006\n\u0007\n\u0010\n8\n\u0003\n8\n\u0005\n\u0002\nB\n0\nB\n\u0014\n\u0004\n\u0006\n\u0003\n\u0012\n\u0007\n\u0003\n\u0012\n\u0017\n\n\u0001\n\u0006\n\u0001\n\u0004\n\u0005\n\u0004\n\b\n\u0004\n\u0005\n\u0002\n\u001a\n\f5 Practical optimization strategies\nOur current method of reducing the SNE cost is to use steepest descent with added jitter\nthat is slowly reduced. This produces quite good embeddings, which demonstrates that the\nSNE cost function is worth minimizing, but it takes several hours to \ufb01nd a good embedding\n\n\u0003\u0006\u0005\n\nThe time per iteration could be reduced considerably by ignoring pairs of points for which\nis \ufb01xed during the learning, it is\nnatural to sparsify it by replacing all entries below a certain threshold with zero and renor-\n\nfor just \u0002\u0001\u0003\u0001\u0002\u0001 datapoints so we clearly need a better search algorithm.\nall four of\u0002\n\u0005G\u0003 are small. Since the matrix\u0002\n#G0\n#G0\n\u0005G\u0003\n\u0003\u0016\u0005\nmalizing. Then pairs H#\nfor which both\u00025\u0003\u0016\u0005 and\u0002F\u0005G\u0003 are zero can be ignored from gradient\n\u0003\u0006\u0005 and 0\ncalculations if both 0\n\u0005G\u0003 are small. This can in turn be determined in logarithmic\n\u0014 . Com-\nas K-D trees, ball-trees and AD-trees, since the 0\u0018\u0003\u0006\u0005 depend only on\n\nputational physics has attacked exactly this same complexity when performing multibody\ngravitational or electrostatic simulations using, for example, the fast multipole method.\n\ntime in the size of the training set by using sophisticated geometric data structures such\n\n\u0003\u0006\u0005\n\n42\u001f\u0003\n\n2\u0004\u0005\u0001\n\nIn the mixture version of SNE there appears to be an interesting way of avoiding local\noptima that does not involve annealing the jitter. Consider two components in the mixture\nfor an object that are far apart in the low-dimensional space. By raising the mixing propor-\ntion of one and lowering the mixing proportion of the other, we can move probability mass\nfrom one part of the space to another without it ever appearing at intermediate locations.\nThis type of \u201cprobability wormhole\u201d seems like a good way to avoid local optima that arise\nbecause a cluster of low-dimensional points must move through a bad region of the space\nin order to reach a better one.\n\nYet another search method, which we have used with some success on toy problems, is\nto provide extra dimensions in the low-dimensional space but to penalize non-zero values\non these dimensions. During the search, SNE will use the extra dimensions to go around\nlower-dimensional barriers but as the penalty on using these dimensions is increased, they\nwill cease to be used, effectively constraining the embedding to the original dimensionality.\n\n6 Discussion and Conclusions\nPreliminary experiments show that we can \ufb01nd good optima by \ufb01rst annealing the perplex-\n(using high jitter) and only reducing the jitter after the \ufb01nal perplexity has been\n\nsian centered on each high-dimensional point is very big so that the distribution across\nneighbors is almost uniform. It is clear that in the high variance limit, the contribution of\nto the SNE cost function is just as important for distant neighbors as for\nis very large, it can be shown that SNE is equivalent to minimizing the\nmismatch between squared distances in the two spaces, provided all the squared distances\n\nities )\nreached. This raises the question of what SNE is doing when the variance, )\n\u0002:\u0003\u0006\u0005\u000f*&+\nclose ones. When)\nfrom an object are \ufb01rst normalized by subtracting off their \u201cantigeometric\u201d mean,\u0002\n\n\u0003 , of the Gaus-\n\n\u0003 :\n\n\u0002:\u0003\u0006\u0005\n\n06\u0003\u0016\u0005\n\n(10)\n\n(11)\n\n(12)\n\n\u0002\u0004\u0003\u0006\u0005\b\u0007\n\t\f\u000b\u000e\r\u0010\u000f\n%&%\n!\u0004\u0005\n\n%'%\n\n%&%\n\n!\u000f\u0003\n%'%\n\n\u0003\u0016\u0005\n\n\u0003\u0006\u0005\n\n\u000e@\u0012\n\n\u0003\u0006\u0005\n\n\u0007;8\n\n\u0003\u0006\u0005\n\n\u0017\u0015\u0014\n\n\u000e\u0013\u0012\n\u0003\u0006\u0005\n\t4\n\r\f\u001f\u000e\u0011\u0010\u0013\u0012\n\t4\n\r\f\u001f\u000e$\u0010\n\n*&+\n\n\u001a\u001d\u001c\n\u001a\u001d\u001c\n\n*'+\u001d,\n\nis the number of objects.\n\nwhere\u0016\n\n#\n\u0002\n\u0001\n\u0010\n\u0014\n\u0003\n\u0014\n,\n\u000e\n\n\u0017\n\u0014\n\u0003\n\u0014\n\u0011\n\u0014\n\u0010\n\u0002\n\u0014\n\u0003\n\u0017\n\u0010\n\u0012\n\u0014\n\u0010\n\u0012\n\u0002\n\u0014\n\u0003\n\u0014\n\u0012\n\u0014\n\u0007\n\u0010\n\u0014\n\n)\n\u0014\n#\n\u0002\n\u0014\n\u0003\n\u0007\n\u0010\n,\n8\n\u001e\n\u0003\n\u0014\n\u0003\n\u001a\n\u0017\n\u0016\n\u0010\n\b\n#\n\u0012\n\u0012\n\u0014\n\u0007\n2\n\u0003\n\u0010\n2\n\u0005\n\u0014\n\n)\n\u0014\n#\n\u0012\n\u0002\n\u0014\n\u0003\n\u0007\n\u0010\n8\n\u001e\n\u0003\n\u0012\n\u0012\n\u0014\n\u0003\n\u001a\n\u0017\n\u0016\n\u0010\n\b\n\fThis mismatch is very similar to \u201cstress\u201d functions used in nonmetric versions of MDS,\nand enables us to understand the large-variance limit of SNE as a particular variant of such\nprocedures. We are still investigating the relationship to metric MDS and to PCA.\n\nSNE can also be seen as an interesting special case of Linear Relational Embedding (LRE)\n[11]. In LRE the data consists of triples (e.g. Colin has-mother Victoria) and the task\nis to predict the third term from the other two. LRE learns an N-dimensional vector for\neach object and an NxN-dimensional matrix for each relation. To predict the third term in\na triple, LRE multiplies the vector representing the \ufb01rst term by the matrix representing\nthe relationship and uses the resulting vector as the mean of a Gaussian. Its predictive\ndistribution for the third term is then determined by the relative densities of all known\nobjects under this Gaussian. SNE is just a degenerate version of LRE in which the only\nrelationship is \u201cnear\u201d and the matrix representing this relationship is the identity.\n\nIn summary, we have presented a new criterion, Stochastic Neighbor Embedding, for map-\nping high-dimensional points into a low-dimensional space based on stochastic selection\nof similar neighbors. Unlike self-organizing maps, in which the low-dimensional coor-\ndinates are \ufb01xed to a grid and the high-dimensional ends are free to move, in SNE the\nhigh-dimensional coordinates are \ufb01xed to the data and the low-dimensional points move.\nOur method can also be applied to arbitrary pairwise dissimilarities between objects if such\nare available instead of (or in addition to) high-dimensional observations. The gradient of\nthe SNE cost function has an appealing \u201cpush-pull\u201d property in which the forces acting on\n\n2\u000f\u0003 to bring it closer to points it is under-selecting and further from points it is over-selecting\n\nas its neighbor. We have shown results of applying this algorithm to image and document\ncollections for which it sensibly placed similar objects nearby in a low-dimensional space\nwhile keeping dissimilar objects well separated.\n\nMost importantly, because of its probabilistic formulation, SNE has the ability to be ex-\ntended to mixtures in which ambiguous high-dimensional objects (such as the word \u201cbank\u201d)\ncan have several widely-separated images in the low-dimensional space.\nAcknowledgments We thank the anonymous referees and several visitors to our poster for helpful\nsuggestions. Yann LeCun provided digit and NIPS text data. This research was funded by NSERC.\n\nReferences\n[1] T. Cox and M. Cox. Multidimensional Scaling. Chapman & Hall, London, 1994.\n[2] J. Tenenbaum. Mapping a manifold of perceptual observations. In Advances in Neural Infor-\n\nmation Processing Systems, volume 10, pages 682\u2013688. MIT Press, 1998.\n\n[3] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290:2319\u20132323, 2000.\n\n[4] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.\n\nScience, 290:2323\u20132326, 2000.\n\n[5] T. Kohonen. Self-organization and Associative Memory. Springer-Verlag, Berlin, 1988.\n[6] C. Bishop, M. Svensen, and C. Williams. GTM: The generative topographic mapping. Neural\n\nComputation, 10:215, 1998.\n\n[7] J. J. Hull. A database for handwritten text recognition research. IEEE Transaction on Pattern\n\nAnalysis and Machine Intelligence, 16(5):550\u2013554, May 1994.\n\n[8] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986.\n[9] Yann LeCun. Nips online web site. http://nips.djvuzone.org, 2001.\n[10] Andrew Kachites McCallum. Bow: A toolkit for statistical language modeling, text retrieval,\n\nclassi\ufb01cation and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.\n\n[11] A. Paccanaro and G.E. Hinton. Learning distributed representations of concepts from relational\ndata using linear relational embedding. IEEE Transactions on Knowledge and Data Engineer-\ning, 13:232\u2013245, 2000.\n\n\f", "award": [], "sourceid": 2276, "authors": [{"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Sam", "family_name": "Roweis", "institution": null}]}