{"title": "Informed Projections", "book": "Advances in Neural Information Processing Systems", "page_first": 873, "page_last": 880, "abstract": "", "full_text": "Informed Projections\n\nDavid Cohn\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\ncohn+@cs.cmu.edu\n\nAbstract\n\nLow rank approximation techniques are widespread in pattern recogni-\ntion research \u2014 they include Latent Semantic Analysis (LSA), Proba-\nbilistic LSA, Principal Components Analysus (PCA), the Generative As-\npect Model, and many forms of bibliometric analysis. All make use of a\nlow-dimensional manifold onto which data are projected.\nSuch techniques are generally \u201cunsupervised,\u201d which allows them to\nmodel data in the absence of labels or categories. With many practi-\ncal problems, however, some prior knowledge is available in the form\nof context. In this paper, I describe a principled approach to incorpo-\nrating such information, and demonstrate its application to PCA-based\napproximations of several data sets.\n\n1\n\nIntroduction\n\nMany practical problems involve modeling large, high-dimensional data sets to uncover\nsimilarities or latent structure. Linear low rank approximation techniques such as PCA [12],\nLSA [5], PLSA [6] and generative aspect models [1] are powerful tools for approaching\nthese tasks. They identify (relatively) low-dimensional hyperplanes that best approximate\nthe data according to a given noise model. In doing so, they exploit and expose regularities\nin the data: the hyperplanes represent a latent space whose dimensions are often observed\nto correspond to distinct latent categories in the data set. For example, an LSA-derived\nlow-rank approximation to a corpus of news stories may have dimensions corresponding\nto \u201cpolitics,\u201d \u201c\ufb01nance,\u201d \u201csports,\u201d etc. Documents with the same inferred sources (therefore\n\u201cabout\u201d the same topic) generally lie close to each other in the latent space.\n\nThe broad applicability of these techniques comes from the fact that they are essentially\n\u201cunsupervised\u201d \u2013 a model is learned in the absence of labels indicating class or category\nmemberships. There are, however, many situations in which some prior information is\navailable; in these cases, we would like to have some way of using that information to\nimprove our model.\n\nNigam et al. [10] studied the problem of learning to classify data into pre-existing cat-\negories in the presence of labeled and unlabeled examples. Their approach augmented a\ntraditional supervised learning algorithm with distribution information made available from\nthe unlabeled data. In contrast, this paper considers a method for augmenting a traditional\nunsupervised learning problem with the addition of equivalence classes.\n\n\fEquivalence classes are a natural concept for many real-world problems. We frequently\nhave some reason for believing that a set of observations are similar in some sense without\nwanting to or being able to say why they are similar. Note that the sets are not required to\nbe comprehensive \u2014 we may only have known associations between a handful of observa-\ntions. Further, the sets are not required to be disjoint; we may know that members of a set\nare similar, but there is no implication that members of two different sets are dissimilar.\n\nIn any case, the hope is that by indicating which observations are similar, we can bias our\nmodel focus on relevant features and to ignore differences that, while statistically signif-\nicant, are not correlated with our idea of similarity in the problem at hand. This paper\ndescribes an algorithm validating the use of this approach.\n\n1.1 Related work\n\nThere is too large a literature examining the combination of supervised and unsupervised\nlearning to cover here; below I mention in passing some of the most relevant research.\n\nIn terms of conceptual similarity, multiple discriminant analysis (MDA) and oriented prin-\ncipal components analysis (OPCA) are techniques that attempt to maximize the \ufb01delity of\na linear low rank approximation while minimizing the variance of data belonging to des-\nignated equivalence classes [2]. The difference with the approach discussed here is that\nMDA and OPCA maximize a ratio of variances rather than a mixture; this is equivalent to\nmaking the assumption that the covariance matrices for each set are tied. Another related\ntechnique is multidimensional scaling (MDS) which, aside from sharing the ratio-based cri-\nterion, makes the added assumption that the precise degree of similarity (or dissimilarity)\nof two data points is to be enforced. In general, which set of assumptions is best depends\non the problem at hand.\n\nIn terms of implementation, the present algorithm owes a great deal to the \u201cshadow targets\u201d\nalgorithm for Neuroscale [8, 15], whose eponymous data points enforce equivalence classes\non sets of (otherwise) unsupervised data. That algorithm trades \ufb01delity of representation\nagainst \ufb01delity of equivalence classes much in the same way as Equation 4, although it\ndoes so in the context of a Kohonen neural network instead of a linear mapping.\n\nAnother closely-related technique is CI-LSI [7], which uses latent semantic analysis for\ncross-language retrieval. The technique involves training on text documents from a parallel\ncorpus for two or more languages (e.g. French and English), such that each document exists\nas both an English and French version. In CI-LSI, each document is merged with its twin,\nand the hyperplane is \ufb01t to the set of paired documents.\n\nThe goal of CI-LSI matches the goal of this paper, and the technique can in fact be seen as\na special case of the informed projections discussed here. By using the \u201cmean\u201d of a pair of\ndocuments as a proxy for the documents themselves, we assert that the two come from a\ncommon source; \ufb01tting a model to a collection of such means \ufb01nds a maximum likelihood\nsolution subject to the constraint that both members of a pair comes from a common source.\n\n2\n\nInformed and uninformed projections\n\nTo introduce informed projections, I will \ufb01rst brie\ufb02y review principal components analysis\n(PCA) and an algorithm for ef\ufb01ciently computing the principal components of a data set.\n\n2.1 PCA and EMPCA\n\nGiven a \ufb01nite data set X (cid:26) Rn, where each column corresponds to one observation, PCA\ncan be used to \ufb01nd a rank m approximation \u02c6X (where m < n) which minimizes the sum\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\nFigure 1: PCA maximizes the variance of the observations (on left), while an informed\nprojection minimizes variance of projections from observations belonging to the same set.\n\nsquared distortion with respect to X. It does this by identifying the m orthogonal directions\nin which X exhibits the greatest variance, corresponding to the m largest eigenvectors C =\n[C1; : : : ;Cm]. X can then be projected onto the hyperplane de\ufb01ned by C as\n\n\u02c6X = C(CTC)(cid:0)1CT X:\n\n(1)\n\nAlthough not strictly a generative model, PCA offers a probabilistic interpretation: C rep-\nresents a maximum likelihood model of the data under the assumption that X consists of\n(Gaussian) noise-corrupted observations taken from linear combinations of m sources in an\nn-dimensional space. The values for \u02c6X then represent maximum likelihood estimates of the\nmixtures responsible for the corresponding values in X.\n\nRoweis [13] described an ef\ufb01cient iterative technique for identifying C using an EM pro-\ncedure. Beginning with an arbitrary guess for C, the latent representation of X is computed\n\nY = (CTC)(cid:0)1CT X\n\nafter which C is updated to maximize the estimated likelihoods\n\nC = XY T (YY T )(cid:0)1:\n\n(2)\n\n(3)\n\nEquations 2 and 3 are iterated until convergence (typically less than 10 iterations), at which\ntime the sum squared error of \u02c6X\u2019s approximation to X will have been minimized.\n\n2.2\n\nInformed projections\n\nPCA only penalizes according to squared distance of an observation xi from its projection\n\u02c6xi. Given a Gaussian noise model, \u02c6xi is the maximum likelihood estimate of xi\u2019s \u201csource,\u201d\nwhich is the only constraint with which PCA is concerned.\nIf we believe that a set of observations Si = fx1; x2; : : : ; xng have a common cause, then they\nshould share a common source. For a hyperplane de\ufb01ned by eigenvectors C, the maximum\nlikelihood source is the mean of Si\u2019s projections onto C, denoted Si. As such, the likelihood\nshould be penalized not only on the basis of the variance of observations around their\nprojections (cid:0)(cid:229)\nj jjx j (cid:0) \u02c6x jjj2(cid:1), but also the variance of the projections around their set means\n(cid:16)(cid:229)\nThese two penalty terms may be at odds with each other, so we must introduce a hyperpa-\nrameter b\nrepresenting how much weight to place on accurately reproducing the original\nobservations and how much to place on preserving the integrity of the known sets:\n\nx j2Si jj \u02c6x j (cid:0) Sijj2(cid:17).\n\ni (cid:229)\n\nEb = (1 (cid:0) b) (cid:229)\n\njjx j (cid:0) \u02c6x jjj2 + b\n\njj \u02c6x j (cid:0) Sijj2:\n\nj\n\ni\n\nx j2Si\n\n(4)\n\n(cid:229)\n(cid:229)\n\fWhen b = 0:5, Equation 4 is equivalent to minimizing (cid:229)\nx j2Si jjx j (cid:0) Sijj2 under the as-\nsumption that all otherwise unaf\ufb01liated xi are members of their own singleton sets. This is\njust the squared distance from each observation to its projected cluster mean, which appears\nto be the criterion CI-LSI minimizes by averaging documents.\n\ni (cid:229)\n\n2.3 Finding an informed projection\n\nThe error criterion in 4 may be ef\ufb01ciently optimized with an expectation-maximization\n(EM) procedure based on Roweis\u2019 EMPCA [13], alternately computing estimated sources\n\u02c6x and maximizing the likelihoods of the observed data given those sources.\n\nThe likelihood of a set is maximized by minimizing the variance of projections from mem-\nbers of a set around their mean. This is at odds with the efforts of PCA to maximize\nlikelihood by maximizing the variance of projections from the data set at large. We can\nmake these forces work together by adding a \u201ccomplement set\u201d \u02dcSi for each set Si such that\nthe variance of Si\u2019s projections is minimized by maximizing the variance of \u02dcSi\u2019s projections.\nThe complement set may be determined analytically, but can also be computed ef\ufb01ciently\nas an extra step between the \u201cE\u201d and \u201cM\u201d steps of the EM iteration. Given an observation\nx j 2 Si, the complement for x j may be computed in terms of its projection \u02c6x j onto the\nhyperplane and Si, the mean of the set.\n\nFigure 2: Location of a point\u2019s complement \u02dcx j with respect to its mean set projection Si\nand the current hyperplane.\n\nIn order to \u201cpull\u201d the current hyperplane in the direction that will minimize x j\u2019s distance\nfrom the set mean, \u02dcx j must be positioned at a distance of jjx j (cid:0) \u02c6x jjj from the hyperplane\nsuch that its projection lies along line from Si to \u02c6x j at a distance from Si equal to jjx j (cid:0) \u02c6x jjj.\nWith some geometric manipulation (Figure 2), it can be shown that\n\n\u02dcx j = Si + ( \u02c6x j (cid:0) Si)\n\njj \u02c6x j (cid:0) xjj\njj \u02c6x j (cid:0) Sijj\n\n+ ( \u02c6x j (cid:0) x)\n\njj \u02c6x j (cid:0) Sijj\njj \u02c6x j (cid:0) xjj\n\n:\n\nFor ef\ufb01ciency, it is worth noting that by subtracting each set\u2019s mean from its constituent\nobservations, all sets may be combined into a single zero-mean \u201csuperset\u201d \u02dcS from which\ncomplements are computed.\n\nOnce the complement set has been computed, it can be appended to the original observa-\ntions a to create a joint data set, denoted X + = [Xj \u02dcS], and the \u201cM\u201d step of the EM procedure\nis continued as before:1\n\nY = (CTC)(cid:0)1CT X +;\n\nC = X +Y T (YY T )(cid:0)1:\n\n(5)\n\nApplying b\nto the optimization is straightforward \u2013 if we preprocess the data by subtracting\nthe mean of the observations (as is standard for PCA), the effect of each observation is to\n1Since \u02dcSi depends on the projections, and therefore the position of the hyperplane, it must be\n\nrecomputed with each iteration.\n\n\fapply a \u201ctorque\u201d to the current hyperplane around the origin. By multiplying all coordinates\nof an observation by the same scalar, we scale the torque applied by the same amount. As\nsuch, we can trade off the weight attached to enforcing the sets against the weight attached\nto reconstructing the original data by multiplying \u02dcS and X by b and 1 (cid:0) b\n\nrespectively:\n\nX +b = [(1 (cid:0) b)X jb\n\n(cid:1) \u02dcS]\n\n3 Experiments\n\nI examined the effect of \u201cinforming\u201d projections on three data sets from two domains.\nThe \ufb01rst two were text data sets taken from the WebKB project and the \u201c20 newsgroups\u201d\ndata set. The third data set consisted of acoustic features from recorded music. Finally, I\nexamine the effect of adding set information to the joint probabilistic model described by\nCohn and Hofmann [3].\n\n3.1 WebKB data\n\nThe \ufb01rst set of experiments began with a subset of the WebKB data set [4]. Using Rain-\nbow [9], I tokenized 1000 randomly-selected documents, stripping out HTML and digits,\nand kept the 1000 terms with highest class-dependent information gain (the reduced vo-\ncabulary greatly decreased processing times). The result was 1000 documents with 1000\nfeatures, where feature fi; j represented the frequency with which term j occurred in docu-\nment xi. Sets were constructed from the categories provided with each document.\nThe experiments varied both the fraction of the training data for which set associations were\nprovided (0-1) and the weight given to preserving those sets (also 0-1). For each combina-\ntion, I ran 40 trials, each using a randomized split of 200 training documents and 100 test\ndocuments. Accuracy was evaluated based on leave-one-out nearest neighbor classi\ufb01cation\nover the test set.2\n\nweight = 0.4\nweight = 0.5\nweight = 0.6\nweight = 0.7\n\ny\nc\na\nr\nu\nc\nc\na\n\n0.88\n\n0.86\n\n0.84\n\n0.82\n\n0.8\n\n0.78\n\n0.76\n\n0.74\n\n0.72\n\n0.7\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.6\nfraction of data with set labels\n\n0.5\n\ny\nc\na\nr\nu\nc\nc\na\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n0\n\n0.7\n\n0.8\n\n0.9\n\nfrac = 0.2\nfrac = 0.4\nfrac = 0.6\nfrac = 0.8\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\nweight given to sets\n\nFigure 3: Nearest neighbor classi\ufb01cation of WebKB data, where a 5D PCA of document\nterms has been informed by web page category-determined sets (40 independent train/test\nsplits). The fraction of observations that have been given set assignments is varied from 0\nto 1 (left plot), as is b, the weight attached to preserving set associations (right plot).\n\nFigure 3 summarizes the results of these experiments. As expected, the more documents\nthat had set associations, the greater the improvement in classi\ufb01cation accuracy, but this\n2Obviously, simple nearest neighbor is far from the most effective classi\ufb01cation technique for\nthis domain. But the point of the experiment is to evaluate to what degree informing a projection\npreserves or improves topic locality, which nearest neighbor classi\ufb01ers are well-suited to measure.\n\n\fimprovement was only evident for 0:3 (cid:20) b (cid:20) 0:7; below 0.3, the sets were not given enough\nweight to make a difference, while above 0.7 there is a rapid deterioration in accuracy.\n\n3.2\n\n20 Newsgroups\n\nThe second set of experiments also used a stan-\ndard text classi\ufb01cation corpus, but with an unre-\nstricted vocabulary. Beginning with the documents\nof the 20 newsgroups data set, I again prepro-\ncessed the documents as above with Rainbow, but\nthis time kept the entire vocabulary (27214 unique\nterms), instead of preselecting maximally informa-\ntive terms.\n\nBecause of the additional running time required to\nhandle the complete vocabularies, the experiments\nused all set labels and only varied the weighting.\nThirty independent training and test sets of 100\ndocuments each were run for 0 (cid:20) b (cid:20) 1, and as\nbefore, accuracy was evaluted in terms of leave-\none-out classi\ufb01cation error on the test set.\n\n0.36\n\n0.34\n\n0.32\n\ny\nc\na\nr\nu\nc\nc\na\n\n0.3\n\n0.28\n\n0.26\n\n0.24\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\nalpha (set weighting)\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\nFigure 4: Five categories from 20\nnewsgroups data set, where a 5D\nPCA of document terms has been\ninformed by source category (30\ntrain/test splits, for 0 < b < 1).\n\nFigure 4 summarizes the results of these experi-\nments. The characteristic learning curve is very\nsimilar to that for the WebKB data \u2014 an interme-\ndiate set weighting yields signi\ufb01cantly better performance than the purely supervised or\nunsupervised cases. There is, however, one notable distinction: in these experiments, there\nis much less variation in accuracy for large values of b \u2014 it almost appears that there are\nthree stable regions of performance.\n\n3.3 Album recognition from acoustic features\n\nThe third test used a proprietary data set of acoustic properties of recorded music. The data\nset contained 11252 recorded music tracks from 939 albums. Each observation consisted of\n85 highly-processed acoustic features extracted automatically via digital signal processing.\n\nThe goal of this experiment was to determine whether informing a projected model could\nimprove the accuracy with which it could identify tracks from the same album. Recalling\nPlatt\u2019s playlist selection problem [11], this can serve as a proxy for estimating how well the\nmodel can predict whether two tracks \u201cbelong together\u201d by the subjective measure of the\nartist who created the album.\n\nFor these experiments, I selected the \ufb01rst 8439 tracks (3=4 of the data) for training, assign-\ning each track to be a member of the set de\ufb01ned by the album it came from. Many tracks\nappeared on multiple albums (\u201cBest of...\u201d and soundtrack collections). The remaining 2813\ntracks were used as test data.\n\nThe 85 dimensional features were projected down into a 10 dimensional space, inform-\ning the projection with sets de\ufb01ned by tracks from the same album. The relatively low\ndimension of the problem permitted also running OPCA on the data set for comparison.\nAs above, I measured the frequency with which each test track had another track from the\nsame album as its nearest neighbor when projected down into this same space.\n\nWhile the improvements in performance are not as striking as those from the previous ex-\nperiments, they are nonetheless signi\ufb01cant (Table 1). One reason for the meager improve-\nment may be that the features from which the projections were computed had already been\n\n\fweight\naccuracy\nratio\n\nb = 0:0\n0.1070\n0.3859\n\nb = 0:5\n0.1241\n0.3223\n\nb = 1:0 OPCA\n0.1340\n0.0551\n0.3414\n0.3144\n\nTable 1: Album recognition results using 2813 test tracks from 316 albums. For each\nweighting b, \u201caccuracy\u201d is the fraction of times which the closest track to a test track came\nfrom the same album; \u201cratio\u201d indicates the average ratio of intra-album distances to inter-\nalbum distances in the test set.\nIn all cases, informing the projection with a weight of\nb = 0:5 increases the accuracy and decreases the ratio of the model.\n\nmanually optimized for classi\ufb01cation accuracy. Interestingly, OPCA slightly outperforms\nthe informed projection for both criteria on this problem.\n\n3.4 Content, context and connections\n\nPrior work [3] discussed building joint probabilistic models of a document base, using both\nthe content of the documents and the connections (citations or hyperlinks) between them.\nA document base frequently contains context as well, in the form of documents from the\nsame source or by the same author. Informed projection provides a way for us to inject this\nthird form of information and further improve our models.\n\nFigure 5 summarizes the results of using set information to \u201cinform\u201d the joint content+link\nmodels discussed in the previous paper. That work used a multinomial model for its ap-\nproximation, so we can not use the equations de\ufb01ned in Section 2.3. Instead, we can make\nuse of the observation of Section 1.1 to approximate the informing process by merging\ndocuments from the same set. Figure 5 illustrates that this process complements the earlier\ncontent+connections approach, providing a joint model of document content, context and\nconnections.\n\naccuracy\n(std err)\ncontent\n\nlinks\n\nboth\n\nuninformed\n\ninformed\n\n0.19\n\n(0.017)\n\n0.11\n\n(0.013)\n\n0.21\n\n(0.023)\n\n0.33\n\n(0.039)\n\n0.23\n\n(0.098)\n\n0.33\n\n(0.057)\n\ny\nc\na\nr\nu\nc\nc\na\n\n0.6\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n1\n\n0.8\n\n0.6\n0.2\nconnection weight\n\n0.4\n\n0\n0\n\n1\n\n0.5\n\nset membership\n\nFigure 5: (left) Classi\ufb01cation accuracy of informed vs. uninformed models of separate and\njoint models of document content and connections, using the WebKB dataset. (right) Effect\nof adding more document context in the form of set membership information on the Cora\ndata set. See Cohn and Hofmann [3] for details.\n\n4 Discussion and future work\n\nThe experiments so far indicate that adding set information to a low rank approximation\ndoes improve the quality of a model, but only to the extent that the information is used\nin conjunction with the unsupervised information already present in the data set. The im-\nprovement in performance is evident for content models (such as LSA), connection models,\nand joint models of content and connections.\n\n\f4.1 Future work\nBeyond experiments that to clarify the effect of b on model \ufb01tness, there are many obvious\ndirections for future work. The \ufb01rst is further exploration on the relationship between\ninformed PCA and and the variants of MDA discussed in Section 1.1. While the differences\nare mathematically straightforward, the effect of sum-vs.-ratio criteria bears further study.\n\nA second broad area for future work is the application of the techniques described here to\nricher low rank approximation models. While this paper considered the effect of informing\nPCA, it would be fruitful to examine both the process and effect of informing multinomial-\nbased models [3, 6], fully-generative models [1] and local linear embeddings [14].\n\nA third area for exploration is the study of potential applications for this approach, which\ninclude improved relevance modeling, directed web crawling, and personalized search and\nrecommendation across a wide variety of media.\n\nReferences\n\n[1] D. Blei, A. Ng, and M. I. Jordan. Latent dirichlet allocation. In Advances in Neural Information\n\nProcessing Systems 14, 2002.\n\n[2] C.J.C. Burges, J.C. Platt, and S. Jana. Extracting noise-robust features from audio data.\n\nProceedings of ICASSP, 2002.\n\nIn\n\n[3] D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and\nhypertext connectivity. In T. Leen et al., editor, Advances in Neural Information Processing\nSystems 13, 2001.\n\n[4] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery.\nLearning to extract symbolic knowledge from the world wide web. In Proceedings of the 15th\nNational Conference on Arti\ufb01cial Intelligence (AAAI-98), 1998.\n\n[5] S. Dumais, G. Furnas, T. Landauer, S. Deerwester, and R. Harshman. Using latent semantic\nanalysis to improve access to textual information. In Proceedings of the Conference on Human\nFactors in Computing Systems CHI\u201988, 1988.\n\n[6] T. Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Arti\ufb01cial Intelli-\n\ngence, UAI\u201999, Stockholm, 1999.\n\n[7] M. Littman, S. Dumais, and T. Landauer. Automatic cross-language information retrieval using\nIn G. Grefenstette, editor, Cross Language Information Retrieval.\n\nlatent semantic indexing.\nKluwer, 1998.\n\n[8] D. Lowe and M. E. Tipping. Feed-forward neural networks and topographic mappings for\n\nexploratory data analysis. Neural Computing and Applications, 4:83\u201395, 1996.\n\n[9] A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classi\ufb01cation\n\nand clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.\n\n[10] K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Learning to classify text from\nlabeled and unlabeled documents. In Proceedings of AAAI-98, pages 792\u2013799, Madison, US,\n1998. AAAI Press, Menlo Park, US.\n\n[11] J. Platt, C. Burges, S. Swenson, C. Weare, and A. Zheng. Learning a gaussian process prior for\nautomatically generating music playlists. In T. G. Dietterich, S. Becker, and Z. Ghahramani,\neditors, Advances in Neural Information Processing Systems 14. MIT Press, 2002.\n\n[12] B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge: University Press, 1996.\n[13] S. Roweis. EM algorithms for PCA and SPCA. In M. I. Jordan, M. J. Kearns, and S. A. Solla,\n\neditors, Advances in Neural Information Processing Systems, volume 10. MIT Press, 1998.\n\n[14] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Sci-\n\nence, 290(5500):2323\u20132326, Dec 2000.\n\n[15] M. E. Tipping and D. Lowe. Shadow targets: A novel algorithm for topographic projections by\n\nradial basis functions. Neurocomputing, 19(1):211\u2013222, 1998.\n\n\f", "award": [], "sourceid": 2194, "authors": [{"given_name": "David", "family_name": "Cohn", "institution": null}]}