{"title": "Unsupervised Kernel Dimension Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 2379, "page_last": 2387, "abstract": "We apply the framework of kernel dimension reduction, originally designed for supervised problems, to unsupervised dimensionality reduction. In this framework, kernel-based measures of independence are used to derive low-dimensional representations that maximally capture information in covariates in order to predict responses. We extend this idea and develop similarly motivated measures for unsupervised problems where covariates and responses are the same. Our empirical studies show that the resulting compact representation yields meaningful and appealing visualization and clustering of data. Furthermore, when used in conjunction with supervised learners for classification, our methods lead to lower classification errors than state-of-the-art methods, especially when embedding data in spaces of very few dimensions.", "full_text": "Unsupervised Kernel Dimension Reduction\n\nMeihong Wang\n\nDept. of Computer Science\nU. of Southern California\nLos Angeles, CA 90089\n\nmeihongw@usc.edu\n\nFei Sha\n\nDept. of Computer Science\nU. of Southern California\nLos Angeles, CA 90089\n\nMichael I. Jordan\nDept. of Statistics\nU. of California\nBerkeley, CA\n\nfeisha@usc.edu\n\njordan@cs.berkeley.edu\n\nAbstract\n\nWe apply the framework of kernel dimension reduction, originally designed for\nsupervised problems, to unsupervised dimensionality reduction. In this frame-\nwork, kernel-based measures of independence are used to derive low-dimensional\nrepresentations that maximally capture information in covariates in order to pre-\ndict responses. We extend this idea and develop similarly motivated measures\nfor unsupervised problems where covariates and responses are the same. Our\nempirical studies show that the resulting compact representation yields meaning-\nful and appealing visualization and clustering of data. Furthermore, when used\nin conjunction with supervised learners for classi\ufb01cation, our methods lead to\nlower classi\ufb01cation errors than state-of-the-art methods, especially when embed-\nding data in spaces of very few dimensions.\n\n1 Introduction\n\nDimensionality reduction is an important aspect of many statistical learning tasks. In unsupervised\ndimensionality reduction, the primary interest is to preserve signi\ufb01cant properties of the data in a\nlow-dimensional representation. Well-known examples of this theme include principal component\nanalysis, manifold learning algorithms and their many variants [1\u20134].\n\nIn supervised dimensionality reduction, side information is available to in\ufb02uence the choice of the\nlow-dimensional space. For instance, in regression problems, we are interested in jointly discovering\na low-dimensional representation Z of the covariates X and predicting well the response variable\nY given Z. A classical example is Fisher discriminant analysis for binary response variables, which\nprojects X to a one-dimensional line. For more complicated cases, however, one needs to specify\na suitable regression function, E [Y | X ], in order to identify Z. This is often a challenging task in\nitself, especially for high-dimensional covariates. Furthermore, one can even argue that this task is\ncyclically dependent on identifying Z, as one of the motivations for identifying Z is that we would\nhope that the low-dimensional representation can guide us in selecting a good regression function.\n\nTo address this dilemma, there has been a growing interest in suf\ufb01cient dimension reduction (SDR)\nand related techniques [5\u20138]. SDR seeks a low-dimensional Z which captures all the dependency\nbetween X and Y . This is ensured by requiring conditional independence among the three vari-\nables; i.e., X \u22a5\u22a5 Y | Z. Several classical approaches exist to identify such random vectors Z [6, 9].\nRecently, kernel methods have been adapted to this purpose. In particular, kernel dimensional re-\nduction (KDR) develops a kernel-based contrast function that measures the degree of conditional in-\ndependence [7]. Compared to classical techniques, KDR has the signi\ufb01cant advantage that it avoids\nmaking strong assumptions about the distribution of X. Therefore, KDR has been found especially\nsuitable for high-dimensional problems in machine learning and computer vision [8, 10, 11].\n\nIn this paper we show how the KDR framework can be used in the setting of unsupervised learning.\nOur idea is similar in spirit to a classical idea from the neural network literature: we construct\n\n1\n\n\fan \u201cautoencoder\u201d or \u201cinformation bottleneck\u201d where the response variables are the same as the\ncovariates [12, 13]. The key difference is that autoencoders in the neural network literature were\nbased on a speci\ufb01c parametric regression function. By exploiting the SDR and KDR frameworks,\non the other hand, we can cast the unsupervised learning problem within a general nonparametric\nframework involving conditional independence, and in particular as one of optimizing kernel-based\nmeasures of independence.\n\nWe refer to this approach as \u201cunsupervised kernel dimensionality reduction\u201d (UKDR). As we will\nshow in an empirical investigation, the UKDR approach works well in practice, comparing favorably\nto other techniques for unsupervised dimension reduction. We assess this via visualization and via\nbuilding classi\ufb01ers on the compact representations delivered by these methods. We also provide\nsome interesting analytical links of the UKDR approach to stochastic neighbor embedding (SNE)\nand t-distributed SNE (t-SNE) [14, 15].\nThe paper is organized as follows. In Section 2, we review the SDR framework and discuss how\nkernels can be used to solve the SDR problem. Additionally, we describe two speci\ufb01c kernel-\nbased measures of independences, elucidating a relationship between these measures. We show\nhow the kernel-based approach can be used for unsupervised dimensionality reduction in Section 3.\nWe report empirical studies in Section 4. Finally, we conclude and comment on possible future\ndirections in Section 5.\nNotation Random variables are denoted with upper-case characters such as X and Y . To refer to\ntheir speci\ufb01c values, if vectorial, we use bold lower-case such as x and xn. xi stands for the i-th\nelement of x. Matrices are in bold upper-case such as M.\n\n2 Suf\ufb01cient dimension reduction and measures of independence with kernels\n\nDiscovering statistical (in)dependencies among random variables is a classical problem in statistics;\nexamples of standard measures include Spearman\u2019s \u03c1, Kendall\u2019s \u03c4 and Pearson\u2019s \u03c72 tests. Recently,\nthere have been a growing interest in computing measures of independence in Reproducing Kernel\nHilbert spaces (RKHSs) [7, 16]. Kernel-based (and other nonparametric) methods detect nonlinear\ndependence in random variables without assuming speci\ufb01c relationships among them. In particular,\nthe resulting independence measures attain minimum values when random variables are indepen-\ndent. These methods were originally developed in the context of independent component analy-\nsis [17] and have found applications in a variety of other problems, including clustering, feature\nselection, and dimensionality reduction [7, 8, 18\u201321].\n\nWe will be applying these approaches to unsupervised dimensionality reduction. Our proposed\ntechniques aim to yield low-dimensional representation which is \u201cmaximally\u201d dependent on the\noriginal high-dimensional inputs\u2014this will be made precise in a later section. To this end, we \ufb01rst\ndescribe brie\ufb02y kernel-based measures of (conditional) independence, focusing on how they are\napplied to supervised dimensionality reduction.\n\n2.1 Kernel dimension reduction for supervised learning\n\nIn supervised dimensionality reduction for classi\ufb01cation and regression, the response variable, Y \u2208\nY, provides side information about the covariates, X \u2208 X . In a basic version of this problem we\nseek a linear projection B \u2208 RD\u00d7M to project X from D-dimensional space to a M -dimensional\nsubspace. We would like the low-dimensional coordinates Z = B\u22a4X to be as predictive about\nY as X is; i.e., E [Y | B\u22a4X ] = E [Y | X ]. Intuitively, knowing Z is suf\ufb01cient for the purpose of\nregressing Y .\nThis problem is referred to as suf\ufb01cient dimension reduction (SDR) in statistics, where it has been\nthe subject of a large literature [22]. In particular, SDR seeks a projection B such that,\n\nX \u22a5\u22a5 Y | B\u22a4X ,\n\n(1)\nwhere I is the M \u00d7M identity matrix. Several methods have been proposed to estimate B [6, 9]. Of\nspecial interest is the technique of kernel dimensional reduction (KDR) that is based on assessing\nconditional independence in RKHS spaces [7]. Concretely, we map the two variables X and Y\nto the RKHS spaces F and G induced by two positive semide\ufb01nite kernels KX : X \u00d7 X \u2192 R\n\nsubject to B\u22a4B = I .\n\n2\n\n\fand KY : Y \u00d7 Y \u2192 R. For any function g \u2208 G, there exists a conditional covariance operator\nCY Y |X : G \u2192 G such that\n\nhg,CY Y |X giG = E (cid:2)varY |X [g(Y )|X](cid:3)\n\n(2)\n\ncalculates the residual errors of predicting g(Y ) with X [7, Proposition 3]. Similarly, we can de\ufb01ne\nthe conditional covariance operator CB\nThe conditional covariance operator has an important property: for any projection B, CB\nY Y |X \u2265\nCY Y |X where the (partial) order is de\ufb01ned in terms of the trace operator. Moreover, the equality\nholds if and only if eq. (1) is satis\ufb01ed. This gives rise to the possibility of using the trace of the\noperators as a contrast function to estimate B.\n\nY Y |X for predicting with B\u22a4X.\n\nConcretely, with N samples drawn from P (X, Y ), we compute the corresponding kernel matrices\nKB\u22a4X and KY . We centralize them with a projection matrix H = I \u2212 1/N 11\u22a4, where 1 \u2208 RN\nbe the vector whose elements are all ones. The trace of the estimated conditional variance operator\nY Y |X is then de\ufb01ned as follows:\nCB\n\n\u02c6JY Y |X(B\u22a4X, Y ) = Trace (cid:2)GY (GB\u22a4X + N \u01ebN IN )\u22121(cid:3) ,\n\n(3)\n\nwhere GY = HKY H and GB\u22a4X = HKB\u22a4X\n\nH. \u01ebN is a regularizer, smoothing the kernel\n\nmatrix. It should be chosen such that when N \u2192 +\u221e, \u01ebN \u2192 0 and \u221aN \u01ebN \u2192 +\u221e to ensure con-\n\nsistency [7]. The minimizer of the conditional independence measure yields the optimal projection\nB for kernel dimensionality reduction:\n\nBY Y |X = arg minB\u22a4B=I\n\n\u02c6JY Y |X (B\u22a4X, Y ).\n\n(4)\n\nWe defer discussion on choosing kernels as well as numerical optimization to later sections. When\nit is clear from context, we use \u02c6JY Y |X as a shorthand for \u02c6JY Y |X (B\u22a4X, Y ).\nThe optimization functional in eq. (3) is not the only way to implement the KDR idea. Indeed,\nanother kernel-based measure of independence that can be optimized in the KDR context is the\nHilbert-Schmidt Independence Criterion (HSIC) [16]. This is built as the Hilbert-Schmidt norm of\nthe cross-covariance operator CXY , de\ufb01ned as G \u2192 F:\n\ncov(f, g) = hf,CXY giF = EXY {[f (X) \u2212 EX f (X)] [g(Y ) \u2212 EY g(Y )]} ,\n\n(5)\n\nwhere the expectations are taken with respect to the joint distribution and the two marginals respec-\ntively. It has been shown that for universal kernels such as Gaussian kernels the Hilbert-Schmidt\nnorm of CXY is zero if and only if X and Y are independent [16]. Given N samples from P (X, Y ),\nthe empirical estimate of HSIC is given by (up to a multiplicative constant):\n\n\u02c6JXY (X, Y ) = Trace [HKXHKY ] ,\n\n(6)\n\nwhere KX and KY are RN \u00d7N kernel matrices computed over X and Y respectively. To apply\nthis independence measure to dimensionality reduction, we seek a projection B which maximizes\n\u02c6JXY (B\u22a4X, Y ), such that the low-dimensional coordinates Z = B\u22a4X are maximally correlated\nwith X,\n\nBXY = arg maxB\u22a4B=I\n\n\u02c6JXY (B\u22a4X, Y ) = arg maxB\u22a4B=I Trace [HKB\u22a4X\n\nHKY ] .\n\n(7)\n\nIt is interesting to note that the independence measures in eq. (3) and eq. (6) are similar. In fact,\nwe have been able to \ufb01nd conditions under which they are equivalent, as stated in the following\nproposition.\nProposition 1. Let N \u2192 +\u221e and \u01ebN \u2192 0. Additionally, assume that the samples are distributed\nuniformly on the unit sphere. If \u03c3N \u226a \u01eb2\n\nN , then up to a constant,\n\n\u02c6JY Y |X (B\u22a4X, Y ) \u2248 \u2212c0N 2\u01eb2\n\nN\n\n\u02c6JXY (B\u22a4X, Y ).\n\n(8)\n\nTherefore, under these conditions it is equivalent to minimize \u02c6JY Y |X (B\u22a4X, Y ) or to maximize\n\u02c6JXY (B\u22a4X, Y ). Thus, BXY \u2248 BY Y |X.\n\n3\n\n\fProof The proof is sketched in the supplementary material. Note that assuming the norm of X is\nequal to one is not overly restrictive; in practice, one often needs to normalize data points to control\nthe overall scale.\n\nWe note that while the two measures are asymptotically equivalent, they have different computa-\ntional complexity\u2014computing \u02c6JXY does not involve matrix inversion. Furthermore, \u02c6JXY is slightly\neasier to use in practice as it does not depend on regularization parameters to smooth the kernel ma-\ntrices.\nThe HSIC measure \u02c6JXY is also closely related to the technique of kernel alignment which minimizes\nthe angles between (vectorized) kernel matrices KX and KY [23]. This is equivalent to maximizing\nTrace[KX KY ]/(kKX|kFkKY kF ). The alignment technique has been used for clustering data X\nby assigning cluster labels Y so that the two kernel matrices are maximally aligned. The HSIC\nmeasure has also been used for similar tasks [18]. While both \u02c6JY Y |X and \u02c6JXY have been used\nfor supervised dimensionality reduction with known values of Y , they have not yet been applied to\nunsupervised dimensionality reduction, which is the direction that we pursue here.\n\n3 Unsupervised kernel dimension reduction\n\nIn unsupervised dimensionality reduction, the low-dimensional representation Z can be viewed as\na compression of X. The goal is to identify the Z that captures as much of the information in X\nas possible. This desideratum has been pursued in the neural network literature where autoencoders\nlearn a pair of encoding and decoding functions, Z = f (X) and X = g(Z). A drawback of this\napproach is that f and g need to be speci\ufb01ed a priori, in terms of number of layers and neurons in\nneural nets.\n\nCan we leverage the advantages of SDR and KDR to identify Z without specifying f (X) or g(Z)?\nIn this section, we describe how this can be done, viewing unsupervised dimensionality reduction as\na special type of supervised regression problem. We start by considering the simplest case where Z\nis a linear projection of X. We then consider nonlinear approaches.\n\n3.1 Linear unsupervised kernel dimension reduction\n\nGiven a random variable X \u2208 RD, we consider the regression problem \u02dcX = f (B\u22a4X) where \u02dcX\nis a copy of X and Z = B\u22a4X \u2208 RM is the low-dimensional representation of X. Following the\nframework of SDR and KDR in section 2, we seek B such that X \u22a5\u22a5 \u02dcX | B\u22a4X. Such B\u22a4X thus\ncaptures all information in X in order to construct itself (i.e., \u02dcX).\nWith a set of N samples from P (X), the linear projection B can be identi\ufb01ed as the minimizer of\nthe following kernel-based measure of independence\n\nmin\n\nB\u22a4B=I\n\n\u02c6JXX|B\u22a4X = Trace (cid:2)GX(GB\u22a4X + N \u01ebN I)\u22121(cid:3) ,\n\n(9)\n\nwhere GX and GB\u22a4X are centralized kernel matrices of KX and KB\u22a4X respectively. We can\nalternatively maximize the corresponding HSIC measure of dependence between B\u22a4X and X\n\nmax\nB\u22a4B=I\n\n\u02c6JB\u22a4X X = Trace [GX GB\u22a4X ].\n\n(10)\n\nWe refer collectively to this kernel-based dimension reduction method as linear unsupervised KDR\n(UKDR) and we use \u02c6J(B\u22a4X, X) as a shorthand for the independence measure to be either mini-\nmized or maximized.\n\n3.2 Nonlinear unsupervised kernel dimension reduction\n\nFor data with complicated multimodal distributions, linear transformation of the inputs X is unlikely\nto be suf\ufb01ciently \ufb02exible to reveal useful structures. For example, linear projections can result in\noverlapping clusters in low-dimensional spaces. For the purpose of better data visualization and\nexploratory data analysis, we describe several simple yet effective nonlinear extensions to linear\nUKDR. The main idea is to \ufb01nd a linear subspace embedding of nonlinearly transformed X. Let\n\n4\n\n\fh(X) \u2208 RH denote the nonlinear transformation. The projection B is then computed to optimize\n\u02c6J(B\u22a4h(X), X).\nRadial Basis Network (RBN). In the spirit of neural network autoencoder, one obvious choice of\nh(X) is to use a network of radial basis functions (RBFs). In this case, H = N , the number of\nsamples from X. For a sample xi, the n-th component of h(xi) is given by\n\nhRBN\n\nn\n\n(xi) = exp{\u2212kxi \u2212 xnk2/\u03c32\nn},\n\n(11)\n\nwhere xn is the center of the n-th RBF and \u03c3n is the corresponding bandwidth.\nRandom Sparse Feature (RSF). In this approach we draw D \u00d7 H elements of W from a multi-\nvariate Gaussian distribution with zero mean and identity covariance matrix. We construct the k-th\nelement of h(X) as\n\nhRSF\nk\n\n\u22a4X \u2212 b),\n\n(X) = Heaviside(wk\n\n(12)\nwhere wk is the k-th row of W and b is an adjustable offset term. Heaviside(t) is the step function\nthat takes the value of 1 when t > 0 and the value of 0 otherwise. Note that b controls the sparsity\nof hRSF(X), a property that can be computationally advantageous.\nOur choice of random matrix W is motivated by earlier work in neural networks with in\ufb01nite num-\nber of hidden units, and recent work in large-scale kernel machines and deep learning kernels [24\u2013\n26]. In particular, in the limit of H \u2192 +\u221e, the transformed X induces an RKHS space with the\narccos kernel: hRSF(u)\u22a4hRSF(v) = 1 \u2212 1/\u03c0 cos\u22121(u\u22a4v/kukkvk) [26].\nNonparametric. We have also experimented with a setup where Z is not constrained to any para-\nmetric form. In particular, we optimize \u02c6J(Z, X) over all possible values Z \u2208 RM . While more\npowerful in principle than either linear KDR or the RBF or RSF variants of nonlinear KDR, we have\nfound that empirically that the optimization can get stuck in local optima. However, when initialized\nwith the solutions from the other nonlinear methods, the \ufb01nal solution is generally better.\n\n3.3 Choice of kernels\n\nThe independence measures \u02c6J(B\u22a4X, X) are de\ufb01ned via kernels over B\u22a4X and X. A natural\nchoice is a universal kernel, in particular the Gaussian kernel: KB\u22a4X (xi, xj) = exp{\u2212kB\u22a4xi \u2212\nB}, and similarly for X with a different bandwidth \u03c3X. We have also experimented with\nB\u22a4xjk2/\u03c32\nother types of kernels; in particular we have found the following kernels to be of particular interest.\nRandom walk kernel over X. Given N observations, {x1, x2, . . . , xN}, we note that the RBN\ntransformed xi in eq. (11), when properly normalized, can be seen as the probability of random\nwalk from xi to xj,\n\npij = P (xi \u2192 xj) = exp{\u2212kxi \u2212 xjk2/\u03c32\n\ni } / X\n\nj6=i\n\nexp{\u2212kxi \u2212 xjk2/\u03c32\ni }.\n\n(13)\n\nThe matrix P with elements of pij is clearly not symmetric and not positive semide\ufb01nite. Never-\ntheless, a simple transformation KX = P P \u22a4turns it into a positive semide\ufb01nite kernel. Intuitively,\nthe values of pij describe local structures around xi [14]. Thus KX(xi, xj) = Pk pikpjk measures\nthe similarity between xi and xj in terms of these local structures.\n\nCauchy kernel for B\u22a4X. A Cauchy kernel is a positive semide\ufb01nite kernel and is given by\n\nC(u, v) = 1/ (cid:0)1 + ku \u2212 vk2(cid:1) = exp (cid:8)\u2212 log(1 + ku \u2212 vk2)(cid:9) .\n\n(14)\nWe de\ufb01ne KB\u22a4X (xi, xj) = C(B\u22a4xi, B\u22a4xj). Intuitively, the Cauchy kernel can be viewed as a\nGaussian kernel in the transformed space \u03c6(B\u22a4X) such that \u03c6(xi)\u22a4\u03c6(xj ) = C(xi, xj) [27].\nThese two types of kernels are closely related to t-distributed stochastic neighbor embedding (t-\nSNE), a state-of-the-art technique for dimensionality reduction [15]. We discuss the link in the\nSupplementary Material.\n\n3.4 Numerical optimization\n\nWe apply gradient-based techniques (with line search) to optimize either independence mea-\nsure. The techniques constrain the projection matrix B to lie on the Grassman-Stiefel manifold\n\n5\n\n\f150\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n\u2212150\n0\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n\u22120.05\n\n\u22120.1\n\n\u22120.15\n\n\u22120.2\n0\n\n100\n\n200\n\n300\n\n(b)\n\n(a)\n\n10\n\n5\n\n0\n\n\u22125\n\n100\n\n200\n\n300\n\n\u221210\n0\n\n100\n\n200\n\n300\n\n(c)\n\n(d)\n\nFigure 1: Experiments with synthetic 2D data. (a). Original. (b) 1D embedding by t-SNE. (c) and\n(d) are 1D embeddings by UKDR. They differ in terms of how the embeddings are constrained (see\ntext for details). Vertical axes are the coordinates of 1D embeddings. t-SNE failed to separate data.\nUKDR makes fewer mistakes in (c) and no mistakes in (d).\n\nB\u22a4B = I [28]. While the optimization is nonconvex, our optimization algorithm works quite well\nin practice.\n\nThe complexity of computing gradients is quadratic in the number of data points as the kernel ma-\ntrix needs to be computed. Standard tricks\u2014such as chunking\u2014for handling large kernel matrices\napply, though our empirical work has not used them. In order to optimize on the Stiefel manifold,\ncomputing the search direction from the gradient needs a QR decomposition which depends cu-\nbicly on D, the original dimensionality. More ef\ufb01cient implementation can bring the complexity to\nquadratic on D and linearly on M , the dimensionality of the low-dimensional space. One simple\nstrategy is to use PCA as a preprocessing step to obtain a moderate D.\n\n4 Experiments\n\nWe compare the performance of our proposed methods for unsupervised kernel dimension reduction\n(UKDR) to a state-of-the-art method, speci\ufb01cally t-distributed stochastic neighbor embedding (t-\nSNE) [15]. t-SNE has been shown to excel in many tasks of data visualization and clustering. In\naddition to visual examination of 2D embedding quality, we also investigate the performance of the\nresulting low-dimensional representations in classi\ufb01cation. In all of the experiments reported in this\nsection, we have used the independence measure \u02c6JB\u22a4X X (B\u22a4X, X) of eq. (10).\n\n4.1 Synthetic example\n\nOur synthetic example contains 300 data points randomly distributed on two rings, shown in\nFig. 1(a). We use t-SNE and our proposed method to yield 1D embeddings of these data points,\nplotted in Fig. 1(b)\u20131(d). The horizontal axis indexes the data points where the \ufb01rst 100 indices\ncorrespond to the inner ring.\n\nFig. 1(b) plots a typical embedding by t-SNE where we see that there is signi\ufb01cant overlap be-\ntween the clusters. On the other hand, UKDR is able to generate less overlapped or non-overlapped\nclusters. In Fig. 1(c), the embedding is computed as the linear projection of the RBN-transformed\noriginal data. In Fig. 1(d), the embedding is unconstrained and free to take any value on 1D axis,\ncorresponding to the \u201cnonparametric embedding\u201d presented in section 3.\n\n4.2 Images of handwritten digits\n\nOur second data set is a set of 2007 images of USPS handwritten digits [20]. Each image has 256\npixels and is thus represented as a point in R256. We refer to this data set as \u201cUSPS-2007.\u201d We also\nsampled a subset of 500 images, 100 each from the digits 1, 2, 3, 4 and 5. Note that images of digit\n3 and 5 are often indistinguishable from each other. We refer to this dataset as \u201cUSPS-500.\u201d\nUSPS-500. Fig. 2 displays a 2D embedding of the 500 images. The colors encode digit categories\n(which are used only for visualization). The \ufb01rst row was generated with kernel PCA, Laplacian\neigenmaps and t-SNE. t-SNE clearly outperforms the other two in yielding well-separated clusters.\n\n6\n\n\fThe second row was generated with our UKDR method with Gaussian kernels for both the low-\ndimensional coordinates Z and X. The difference between the three embeddings is whether Z is\nconstrained as a linear projection of the original X (linear UKDR), an RBN-transformed X (RBN\nUKDR), or a Random Sparse Feature transform of X (RSF UKDR). The Gaussian kernel band-\nwidths over Z were 0.1, 0.02 and 0.5, respectively. For the RBN transformation of X, we selected\nthe bandwidth of each RBF function in eq. (11) with the \u201cperplexity trick\u201d used in SNE and t-\nSNE [15]. The bandwidth for the Gaussian kernel over X was 0.5 for all three plots. While linear\nUKDR yields reasonably good clusters of the data, RBN UKDR and RSF UKDR yield signi\ufb01cantly\nimproved clusterings. Indeed, the quality of the embeddings is on par with that of t-SNE.\nIn the third row of the \ufb01gure, the embedding Z is constrained to be RSF UKDR. However, instead\nof using Gaussian kernels (as in the second row), we have used Cauchy kernels. The kernels over X\nare Gaussian, Random Walk, and Diffusion Map kernels [29], respectively. In general, contrasting\nto embeddings in the second row, using a Cauchy kernel for the embedding space Z leads to tighter\nclusters. Additionally, the embeddings by the diffusion map kernel is the most visually appealing\none, outperforming t-SNE by signi\ufb01cantly increasing the gap of digit 1 and 4 from the others.\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n \n\u22126\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n \n\n\u22120.4\n\n1\n2\n3\n4\n5\n\n \n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n\n\u22120.02\n\n\u22120.04\n\n \n\n1\n2\n3\n4\n5\n\n40\n\n20\n\n0\n\n\u221220\n\n\u22124\n\n\u22122\n\n0\n\n2\n\n4\n\n\u22120.06\n\n \n\n\u22120.04 \u22120.02\n\n0\n\n0.02\n\n0.04\n\n0.06\n\n0.08\n\n0.1\n\n(a) Kernel PCA\n\n(b) Laplacian eigenmap\n\n\u221240\n\n \n\n\u221250\n\n \n\n1\n2\n3\n4\n5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.2\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n\u22120.5\n \n\u22121\n\n0\n\n1\n\n2\n\n(d) Linear UKDR\n\n(e) RBN UKDR\n\n  1\n2\n3\n4\n5\n\n100\n\n \n\n1\n2\n3\n4\n5\n\n0\n50\n(c) t-SNE\n\n0\n\n1\n\n2\n\n3\n\n4\n\n(f) RSF UKDR\n\n \n\n1\n2\n3\n4\n5\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n \n\u22121\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n \n\u22122\n\n \n\n3\n\n \n\n1\n2\n3\n4\n5\n\n1\n2\n3\n4\n5\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n \n\u22124\n\n \n\n1\n2\n3\n4\n5\n\n\u22122\n\n0\n\n2\n\n4\n\n6\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n \n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n(g) Gaussian+Cauchy\n\n(h) Random Walk+Cauchy\n\n(i) Diffusion+Cauchy\n\nFigure 2: 2D embedding results for the USPS-500 dataset by existing approaches, shown in the \ufb01rst\nrow. Embeddings by UKDR are shown in the bottom two panels.\n\nEffect of sparsity. For RSF features computed with eq. (12), the offset constant b can be used to\nobtain control over the sparsity of the feature vectors. We investigated the effect of the sparsity\nlevel on embeddings. We found that a sparsity level as high as 82% still generates reasonable\nembeddings. Details are reported in the Supplementary Material. Thus RSF features are viable\noptions for handling high-dimensional data for nonlinear UKDR.\nUSPS-2007: visualization and classi\ufb01cation. In Fig. 3, we compare the embeddings of t-SNE and\nunsupervised KDR on the full USPS 2007 data set. The data set has many easily confusable pairs\nof images. Both t-SNE and unsupervised KDR lead to visually appealing clustering of data. In the\nUKDR framework, using an RBN transformation to parameterize the embedding performs slightly\nbetter than using the RSF transformation.\n\n7\n\n\fM\n\nUKDR\nt-SNE\nPCA\n\n2\n\n11.1\n19.8\n49.3\n\n3\n\n11.6\n16.8\n42.2\n\n5\n9.6\n19.3\n21.5\n\n10\n9.5\n8.4\n10.03\n\n20\n8.8\n8.2\n6.7\n\n50\n7.8\n8.1\n6.6\n\nTable 1: Classi\ufb01cation errors on the USPS-2007 data set with different dimensionality reduction\ntechniques.\n\nFinally, as another way to assess the quality of the low-dimensional embeddings discovered by these\nmethods, we used these embeddings as inputs to supervised classi\ufb01ers. The classi\ufb01er we used was\nthe large-margin nearest-neighbor classi\ufb01er of [30]. We split the 2007 images into 70% for training\nand 30% for testing and reporting classi\ufb01cation errors. We repeated the random split 50 times and\nreport averaged errors. The results are displayed in table 1 where PCA acts as a baseline. There are\nseveral notable \ufb01ndings. First, with very few dimensions (up to and including 5), our UKDR method\noutperforms both t-SNE and PCA signi\ufb01cantly. As the dimensionality goes up, t-SNE starts to\nperform better than our method but only marginally. PCA is expected to perform well with very high\ndimensionality as it recovers pairwise distances the best. The superior classi\ufb01cation performance by\nour method is highly desirable when the target dimensionality is very much constrained.\n\n \n\n0.06\n\n0.04\n\n0.02\n\n0\n\n\u22120.02\n\n\u22120.04\n\n\u22120.06\n\n \n\n\u22120.05\n\n0\n\n0.05\n\n(a) RBN UKDR\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n0\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n \n\u22125\n\n \n\n5\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n0\n\n(b) RSF UKDR\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n \n\n\u2212100\n\n  1\n2\n3\n4\n5\n6\n7\n8\n9\n0\n\n100\n\n\u221250\n\n0\n\n50\n\n(c) t-SNE\n\nFigure 3: Embeddings of the USPS-2007 data set by our nonlinear UKDR approach and by t-SNE.\nBoth methods separate all classes reasonably well. However, using these embeddings as inputs to\nclassi\ufb01ers suggests that the embedding by nonlinear UKDR is of higher quality.\n\n5 Conclusions\n\nWe propose a novel technique for unsupervised dimensionality reduction. Our approach is based on\nkernel dimension reduction. The algorithm identi\ufb01es low-dimensional representations of input data\nby optimizing independence measures computed in a reproducing kernel Hilbert space. We study\nempirically and contrast the performance of our method to that of state-of-the-art approaches. We\nshow that our method yield meaningful and appealing clustering patterns of data. When used for\nclassi\ufb01cation, it also leads to signi\ufb01cantly lower misclassi\ufb01cation.\n\nAcknowledgements\n\nThis work is partially supported by NSF Grant IIS-0957742 and DARPA N10AP20019. F.S. also\nbene\ufb01ted from discussions with J.P. Zhang, under Fudan University Key Laboratory Senior Visiting\nScholar Program.\n\nReferences\n[1] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science,\n\n290:2323, 2000.\n\n[2] J. B. Tenenbaum, V. Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality\n\nreduction. Science, 290:2319, 2000.\n\n[3] C. M. Bishop, M. Svens\u00b4en, and C. K. I. Williams. GTM: the generative topographic mapping. Neural\n\nComputation, 10:215\u2013234, 1998.\n\n8\n\n\f[4] N. D. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. In\n\nAdvances in Neural Information Processing Systems 16, pages 329\u2013336. MIT Press, 2004.\n\n[5] R. D. Cook and X. Yin. Dimension reduction and visualization in discriminant analysis (with discussion).\n\nAustralian & New Zealand Journal of Statistics, 43:147\u2013199, 2001.\n\n[6] K. C. Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Associa-\n\ntion, 86:316\u2013327, 1991.\n\n[7] K. Fukumizu, F. R. Bach, and M. I. Jordan. Kernel dimension reduction in regression. The Annals of\n\nStatistics, 37:1871\u20131905, 2009.\n\n[8] J. Nilsson, F. Sha, and M. I. Jordan. Regression on manifolds using kernel dimension reduction.\nProceedings of the 24th International Conference on Machine Learning, pages 697\u2013704. ACM, 2007.\n\nIn\n\n[9] K.-C. Li. On principal Hessian directions for data visualization and dimension reduction: another appli-\n\ncation of Stein\u2019s lemma. Journal of the American Statistical Association, 86:316\u2013342, 1992.\n\n[10] A. Shyr, R. Urtasun, and M. I. Jordan. Suf\ufb01cient dimensionality reduction for visual sequence classi\ufb01-\ncation. In Proceedings of Twenty-third IEEE Conference on Computer Vision and Pattern Recognition,\npages 3610\u20133617, 2010.\n\n[11] Q. Wu, S. Mukherjee, and F. Liang. Localized sliced inverse regression. In Advances in Neural Informa-\n\ntion Processing Systems 21, pages 1785\u20131792. MIT Press, 2009.\n\n[12] C. M. Bishop et al. Pattern recognition and machine learning. Springer New York, 2006.\n[13] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37th\n\nAnnual Allerton Conference on Communication, Control, and Computing, pages 368\u2013377, 1999.\n\n[14] G. Hinton and S. Roweis. Stochastic neighbor embedding. Advances in Neural Information Processing\n\nSystems 15, pages 857\u2013864, 2003.\n\n[15] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. The Journal of Machine Learning\n\nResearch, 9:2579\u20132605, 2008.\n\n[16] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Sch\u00a8olkopf. Kernel methods for measuring\n\nindependence. The Journal of Machine Learning Research, 6:2075\u20132129, 2005.\n\n[17] F. R. Bach and M. I. Jordan. Kernel independent component analysis. The Journal of Machine Learning\n\nResearch, 3:1\u201348, 2003.\n\n[18] L. Song, A. Smola, A. Gretton, and K. M. Borgwardt. A dependence maximization view of clustering. In\n\nProceedings of the 24th International Conference on Machine Learning, pages 815\u2013822. ACM, 2007.\n\n[19] L. Song, A. Smola, A. Gretton, K. M. Borgwardt, and J. Bedo. Supervised feature selection via de-\npendence estimation. In Proceedings of the 24th International Conference on Machine Learning, pages\n823\u2013830. ACM, 2007.\n\n[20] L. Song, A. Smola, K. Borgwardt, and A. Gretton. Colored maximum variance unfolding. Advances in\n\nNeural Information Processing Systems 20, pages 1385\u20131392, 2008.\n\n[21] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with repro-\n\nducing kernel Hilbert spaces. The Journal of Machine Learning Research, 5:73\u201399, 2004.\n\n[22] K. P. Adragni and R. D. Cook. Suf\ufb01cient dimension reduction and prediction in regression. Philosophical\n\nTransactions A, 367:4385\u20134405, 2009.\n\n[23] N., J. Kandola, A. Elisseeff, and J. Shawe-Taylor. On kernel-target alignment. In Advances in Neural\n\nInformation Processing Systems 14, pages 367\u2013373. MIT Press, 2002.\n\n[24] C. K. I. Williams. Computation with in\ufb01nite neural networks. Neural Computation, 10:1203\u20131216, 1998.\nIn Advances in Neural\n[25] A. Rahimi and B. Recht. Random features for large-scale kernel machines.\n\nInformation Processing Systems 20, pages 1177\u20131184. MIT Press, 2008.\n\n[26] Y. Cho and L. Saul. Kernel methods for deep learning. In Advances in Neural Information Processing\n\nSystems 22, pages 342\u2013350. MIT Press, 2009.\n\n[27] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer Verlag, 1984.\n[28] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints.\n\nSIAM J. Matrix Anal. Appl, 20:303\u2013353, 1998.\n\n[29] B. Nadler, S. Lafon, R. Coifman, and I. G. Kevrekidis. Diffusion maps, spectral clustering and eigen-\nfunctions of Fokker-Planck operators. In Advances in Neural Information Processing Systems 18, pages\n955\u2013962. MIT Press, 2005.\n\n[30] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\nThe Journal of Machine Learning Research, 10:207\u2013244, 2009.\n\n9\n\n\f", "award": [], "sourceid": 822, "authors": [{"given_name": "Meihong", "family_name": "Wang", "institution": null}, {"given_name": "Fei", "family_name": "Sha", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}