{"title": "Localized Sliced Inverse Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1785, "page_last": 1792, "abstract": "We developed localized sliced inverse regression for supervised dimension reduction. It has the advantages of preventing degeneracy, increasing estimation accuracy, and automatic subclass discovery in classification problems. A semisupervised version is proposed for the use of unlabeled data. The utility is illustrated on simulated as well as real data sets.", "full_text": "Localized Sliced Inverse Regression\n\nQiang Wu, Sayan Mukherjee\nDepartment of Statistical Science\n\nInstitute for Genome Sciences & Policy\n\nDepartment of Computer Science\n\nDuke University, Durham\nNC 27708-0251, U.S.A\n\nUniversity of Illinois at Urbana-Champaign\n\nFeng Liang\n\nDepartment of Statistics\n\nIL 61820, U.S.A.\n\nliangf@uiuc.edu\n\n{qiang, sayan}@stat.duke.edu\n\nAbstract\n\nWe developed localized sliced inverse regression for supervised dimension re-\nduction. It has the advantages of preventing degeneracy, increasing estimation\naccuracy, and automatic subclass discovery in classi\ufb01cation problems. A semi-\nsupervised version is proposed for the use of unlabeled data. The utility is illus-\ntrated on simulated as well as real data sets.\n\n1 Introduction\n\nThe importance of dimension reduction for predictive modeling and visualization has a long and\ncentral role in statistical graphics and computation In the modern context of high-dimensional data\nanalysis this perspective posits that the functional dependence between a response variable y and a\nlarge set of explanatory variables x \u2208 Rp is driven by a low dimensional subspace of the p variables.\nCharacterizing this predictive subspace, supervised dimension reduction, requires both the response\nand explanatory variables. This problem in the context of linear subspaces or Euclidean geometry\nhas been explored by a variety of statistical models such as sliced inverse regression (SIR, [10]),\nsliced average variance estimation (SAVE, [3]), principal Hessian directions (pHd, [11]), (condi-\ntional) minimum average variance estimation (MAVE, [18]), and extensions to these approaches. To\nextract nonlinear subspaces, one can apply the aforementioned linear algorithms to the data mapped\ninto a feature space induced by a kernel function [13, 6, 17].\n\nIn machine learning community research on nonlinear dimension reduction in the spirit of [19] has\nbeen developed of late. This has led to a variety of manifold learning algorithms such as isometric\nmapping (ISOMAP, [16]), local linear embedding (LLE, [14]), Hessian Eigenmaps [5], and Lapla-\ncian Eigenmaps [1]. Two key differences exist between the paradigm explored in this approach\nand that of supervised dimension reduction. The \ufb01rst difference is that the above methods are un-\nsupervised in that the algorithms take into account only the explanatory variables. This issue can\nbe addressed by extending the unsupervised algorithms to use the label or response data [7]. The\nbigger problem is that these manifold learning algorithms do not operate on the space of the ex-\nplanatory variables and hence do not provide a predictive submanifold onto which the data should\nbe projected. These methods are based on embedding the observed data onto a graph and then us-\ning spectral properties of the embedded graph for dimension reduction. The key observation in all\nof these manifold algorithms is that metrics must be local and properties that hold in an ambient\nEuclidean space are true locally on smooth manifolds.\n\nThis suggests that the use of local information in supervised dimension reduction methods may be\nof use to extend methods for dimension reduction to the setting of nonlinear subspaces and subman-\nifolds of the ambient space. In the context of mixture modeling for classi\ufb01cation two approaches\nhave been developed [9, 15].\n\n1\n\n\fIn this paper we extend SIR by taking into account the local structure of the explanatory variables.\nThis localized variant of SIR, LSIR, can be used for classi\ufb01cation as well as regression applications.\nThough the predictive directions obtained by LSIR are linear ones, they coded nonlinear informa-\ntion. Another advantage of our approach is that ancillary unlabeled data can be easily added to the\ndimension reduction analysis \u2013 semi-supervised learning.\n\nThe paper is arranged as follows. LSIR is introduced in Section 2 for continuous and categorical\nresponse variables. Extensions are discussed in Section 3. The utility with respect to predictive accu-\nracy as well as exploratory data analysis via visualization is demonstrated on a variety of simulated\nand real data in Sections 4 and 5. We close with discussions in Section 6.\n\n2 Localized SIR\n\nWe start with a brief review of SIR method and remark that the failure of SIR in some situations is\ncaused by ignoring local structures. Then we propose a generalization of SIR, called localized SIR,\nby incorporating some localization idea from manifold learning. Connection to some existing work\nis addressed at the end.\n\n2.1 Sliced inverse regression\n\nAssume the functional dependence between a response variable Y and an explanatory variable X \u2208\nRp is given by\n\nLX, \u01eb),\n\nY = f (\u03b2t\n\n1X, . . . , \u03b2t\n\n(1)\nwhere \u03b2l\u2019s are unknown orthogonal vectors in Rp and \u01eb is noise independent of X. Let B denote\nthe L-dimensional subspace spanned by \u03b2l\u2019s. Then PBX, where PB denotes the projection operator\nonto space B, provides a suf\ufb01cient summary of the information in X relevant to Y . Estimating B or\n\u03b2l\u2019s becomes the central problem in supervised dimension reduction. Though we de\ufb01ne B here via\na heuristic model assumption (1), a general de\ufb01nition based on conditional independence between\nY and X given PBX can be found in [4]. Following [4], we refer to B as the dimension reduction\n(d.r.) subspace and \u03b2l\u2019s the d.r. directions.\nSlice inverse regression (SIR) model is introduced [10] to estimate the d.r. directions. The idea\nunderlying this approach is that, if X has an identity covariance matrix, the centered inverse regres-\nsion curve E(X|Y ) \u2212 EX is contained in the d.r. space B under some design conditions; see [10]\nfor details. According to this result the d.r. directions \u03b2l\u2019s are given by the top eigenvectors of the\ncovariance matrix \u0393 = Cov(E(X|Y )). In general when the covariance matrix of X is \u03a3, the \u03b2l\u2019s\ncan be obtained by solving a generalized eigen decomposition problem\n\nA simple SIR algorithm operates as the following on a set of samples {(xi, yi)}n\n\ni=1:\n\n1. Compute an empirical estimate of \u03a3,\n\n\u0393\u03b2 = \u03bb\u03a3\u03b2.\n\n\u02c6\u03a3 =\n\n1\nn\n\nn\n\nX\n\ni=1\n\n(xi \u2212 m)(xi \u2212 m)T\n\nwhere m = 1\n\nn Pn\n\ni=1 xi is the sample mean.\n\n2. Divide the samples into H groups (or slices) G1, . . . , GH according to the value of y.\n\nCompute an empirical estimate of \u0393,\n\n\u02c6\u0393 =\n\nH\n\nX\n\ni=1\n\nnh\nn\n\n(mh \u2212 m)(mh \u2212 m)T\n\nwhere mh = 1\n\nnh Pj\u2208Gh\n\nxj is the sample mean for group Gh with nh being the group size.\n\n3. Estimate the d.r. directions \u03b2 by solving a generalized eigen problem\n\n\u02c6\u0393\u03b2 = \u03bb \u02c6\u03a3\u03b2.\n\n2\n\n(2)\n\n\fWhen Y takes categorical values as in classi\ufb01cation problems, it is natural to divide the data into dif-\nferent groups by their group labels. Then SIR is equivalent to Fisher discriminant analysis (FDA).1\nThough SIR has been widely used for dimension reduction and yielded many useful results in prac-\ntice, it has some known problems. For example, it is easy to construct a function f such that\nE(X|Y = y) = 0 then SIR fails to retrieve any useful directions [3]. The degeneracy of SIR has also\nrestricted its use in binary classi\ufb01cation problems where only one direction can be obtained. The\nfailure of SIR in these scenario is partly because the algorithm uses just the mean, E(X|Y = y), as\na summary of the information in each slice, which apparently is not enough. Generalizations of SIR\ninclude SAVE [3], SIR-II [12] and covariance inverse regression estimation (CIRE, [2]) that exploit\nthe information from the second moment of the conditional distribution of X|Y . However in some\nscenario the information in each slice can not be well described by a global statistics. For example,\nsimilar to the multimodal situation considered by [15], the data in a slice may form two clusters,\nthen a good description of the data would not be a single number such as any moments, but the two\ncluster centers. Next we will propose a new algorithm that is a generalization of SIR based on local\nstructures of X in each slice.\n\n2.2 Localization\n\nA key principle in manifold learning is that the Euclidean representation of a data point in Rp is only\nmeaningful locally. Under this principle, it is dangerous to calculate the slice average mh, whenever\nthe slice contains data that are far away. Instead some kind of local averages should be considered.\nMotivated by this idea we introduce a localized SIR (LSIR) method for dimension reduction.\n\nHere is the intuition for LSIR. Let us start with the transformed data set where the empirical co-\nvariance is identity, for example, the data set after PCA. In the original SIR method, we shift every\ndata point xi to the corresponding group average, then apply PCA on the new data set to identify\nSIR directions. The underline rational for this approach is that if a direction does not differentiate\ndifferent groups well, the group means projected to that direction would be very close, therefore\nthe variance of the new data set will be small at that direction. A natural way to incorporate local-\nization idea into this approach is to shift each data point xi to the average of a local neighborhood\ninstead of the average of its global neighborhood (i.e., the whole group). In manifolds learning, local\nneighborhood is often chosen by k nearest neighborhood (k-NN). Different from manifolds learning\nthat is designed for unsupervised learning, the neighborhood selection for LSIR that is designed for\nsupervised learning will also incorporate information from the response variable y.\n\nHere is the mathematical description of LSIR. Recall that the group average mh is used in estimating\n\u0393 = Cov(E(X|Y )). The estimate \u02c6\u0393 is equivalent to the sample covariance of a data set {mi}n\ni=1\nwhere mi = mh, average of the group Gh to which xi belongs. In our LSIR algorithm, we set mi\nequal to some local average, and then use the corresponding sample covariance matrix to replace \u02c6\u0393\nin equation (2). Below we give the details of our LSIR algorithm:\n\n1. Compute \u02c6\u03a3 as in SIR.\n2. Divide the samples into H groups as in SIR. For each sample (xi, yi) we compute\n\nmi,loc =\n\n1\nk X\n\nj\u2208si\n\nxj ,\n\nwhere, with h being the group so that i \u2208 Gh,\n\nsi = {j : xj belongs to the k nearest neighbors of xi in Gh} .\n\nThen we compute a localized version of \u0393 by\n\n\u02c6\u0393loc =\n\n1\nn\n\nn\n\nX\n\ni=1\n\n(mi,loc \u2212 m)(mi,loc \u2212 m)T .\n\n3. Solve the generalized eigen decomposition problem\n\u02c6\u0393loc\u03b2 = \u03bb \u02c6\u03a3\u03b2.\n\n1FDA is referred to as linear discriminant analysis (LDA) in some literatures.\n\n(3)\n\n3\n\n\fThe neighborhood size k in LSIR is a tuning parameter speci\ufb01ed by users. When k is large enough,\nsay, larger than the size of any group, then \u02c6\u0393loc is the same as \u02c6\u0393 and LSIR recovers all SIR directions.\nWith a moderate choice of k, LSIR uses the local information within each slice and is expected to\nretrieve directions lost by SIR in case of SIR fails due to degeneracy.\n\nFor classi\ufb01cation problems LSIR becomes a localized version of FDA. Suppose the number of\nclasses is C, then the estimate \u02c6\u0393 from the original FDA is of rank at most C \u2212 1, which means\nFDA can only estimate at most C \u2212 1 directions. This is why FDA is seldom used for binary clas-\nsi\ufb01cation problems where C = 2. In LSIR we use more points to describe the data in each class.\nMathematically this is re\ufb02ected by the increase of the rank of \u02c6\u0393loc that is no longer bounded by C\nand hence produces more directions. Moreover, if for some classes the data is composed of several\nsub-clusters, LSIR can automatically identify these sub-cluster structures. As showed in one of our\nexamples, this property of LSIR is very useful in data analysis such as cancer subtype discovery\nusing genomic data.\n\n2.3 Connection to Existing Work\n\nThe idea of localization has been introduced to dimension reduction for classi\ufb01cation problems\nbefore. For example, the local discriminant information (LDI) introduced by [9] is one of the early\nwork in this area. In LDI, the local information is used to compute the between-group covariance\nmatrix \u0393i over a nearest neighborhood at every data point xi and then estimate the d.r. directions by\ni=1 \u0393i. The local Fisher discriminant\nthe top eigenvector of the averaged between-group matrix 1\nanalysis (LFDA) introduced by [15] can be regarded as an improvement of LDI with the within-class\ncovariance matrix also being localized.\n\nn Pn\n\nComparing to these two approaches, LSIR utilizes the local information directly at the point level.\nOne advantage of this simple localization is computation. For example, for a problem of C classes,\nLDI needs to compute nC local mean points and n between-group covariance matrices, while LSIR\ncomputes only n local mean points and one covariance matrix. Another advantage is LSIR can\nbe easily extended to handle unlabeled data in semi-supervised learning as explained in the next\nsection. Such an extension is less straightforward for the other two approaches that operate on the\ncovariance matrices instead of data points.\n\n3 Extensions\n\nRegularization. When the matrix \u02c6\u03a3 is singular or has a very large condition number, which is com-\nmon in high-dimensional problems, the generalized eigen-decomposition problems (3) is unstable.\nRegularization techniques are often introduced to address this issue [20]. For LSIR we adopt the\nfollowing regularization:\n\n\u02c6\u0393loc\u03b2 = \u03bb( \u02c6\u03a3 + s)\u03b2\n\n(4)\n\nwhere the regularization parameter s can be chosen by cross validation or other criteria (e.g. [20]).\nSemi-supervised learning. In semi-supervised learning some data have y\u2019s (labeled data) and some\ndo not (unlabeled data). How to incorporate the information from unlabeled data has been the main\nfocus of research in semi-supervised learning. Our LSIR algorithm can be easily modi\ufb01ed to take\nthe unlabeled data into consideration. Since y of an unlabeled sample can take any possible values,\nwe put the unlabeled data into every slice. So the neighborhood si is de\ufb01ned as the following: for\nany point in the k-NN of xi, it belongs to si if it is unlabeled, or if it is labeled and belongs to the\nsame slice as xi.\n\n4 Simulations\n\nIn this section we apply LSIR to several synthetic data sets to illustrate the power of LSIR. The\nperformance of LSIR is compared with other dimension reduction methods including SIR, SAVE,\npHd, and LFDA.\n\n4\n\n\fMethod\nAccuracy\n\nSAVE\n\npHd\n\n0.3451(\u00b10.1970)\n\n0.3454(\u00b10.1970)\n\nLSIR (k = 20)\n0.9534(\u00b1.0004)\n\nLSIR (k = 40)\n0.9011(\u00b1.0008)\n\nTable 1: Estimation accuracy (and standard deviation) of various dimension reduction methods for\nsemisupervised learning in Example 1.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n2\n\n \n\ni\n\nn\no\ns\nn\ne\nm\nD\n\ni\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22122\n\n2\n\n \nt\n\nn\ne\nn\no\np\nm\no\nC\n\n \nl\n\ni\n\na\np\nc\nn\ni\nr\n\nP\n\n0.1\n\n0\n\n\u22120.1\n\n2\n\n\u22120.1\n0.1\nPrincipal Component 1\n\n0\n\n\u22121\n\n0\n\n1\nDimension 1\n\n \n\n2\nR\nS\nL\n\nI\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\nLSIR 1\n\n \n\nI\n\n2\nR\nS\nL\nm\ne\ns\n\ni\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22122 \u22121\n\n0\n\n1\nsemiLSIR 1\n\n2\n\nFigure 1: Result for Example 1. (a) Plot of data in the \ufb01rst two dimensions, where \u2018+\u2019 corresponds\nto y = 1 while \u2019o\u2019 corresponds to y = \u22121. The data points in red and blue are labeled and the\nones in green are unlabeled when the semisupervised setting is considered. (b) Projection of data\nto the \ufb01rst two PCA directions. (c) Projection of data to the \ufb01rst two LSIR directions when all the\nn = 400 data points are labeled. (d) Projection of the data to the \ufb01rst two LSIR directions when\nonly 20 points as indicated in (a) are labeled.\n\nLet \u02c6B = ( \u02c6\u03b21, \u00b7 \u00b7 \u00b7 , \u02c6\u03b2L) denote an estimate of the d.r. subspace B where its columns \u02c6\u03b2l\u2019s are the\nestimated d.r. directions. We introduce the following metric to measure the accuracy:\n\nAccuracy( \u02c6B, B) =\n\n1\nL\n\nL\n\nX\n\ni=1\n\nkPB \u02c6\u03b2ik2 =\n\n1\nL\n\nL\n\nX\n\ni=1\n\nk(BBT ) \u02c6\u03b2ik2.\n\nIn LSIR the in\ufb02uence of the parameter k, the size of local neighborhoods, is subtle. In our simulation\nstudy, we found it usually good enough to choose k between 10 to 20, except for the semisupervised\nsetting (e.g. Example 1 below). But further study and a theoretical justi\ufb01cation are necessary.\n\nExample 1. Consider a binary classi\ufb01cation problem on R10 where the d.r. directions are the \ufb01rst\ntwo dimensions and the remaining eight dimensions are Gaussian noise. The data in the \ufb01rst two\nrelevant dimensions are plotted in Figure 1(a) with sample size n = 400. For this example SIR\ncannot identify the two d.r. directions because the group averages of the two groups are roughly the\nsame for the \ufb01rst two dimensions, due to the symmetry in the data. Using local average instead of\ngroup average, LSIR can \ufb01nd both directions, see Figure 1(c). But so do SAVE and pHd since the\nhigh-order moments also behave differently in the two groups.\n\nNext we create a data set for semi-supervised learning by randomly selecting 20 samples, 10 from\neach group, to be labeled and setting others to be unlabeled. The directions from PCA where one\nignores the labels do not agree with the discriminant directions as shown in Figure 1(b). So to\nretrieve the relevant directions, the information from the labeled points has to be taken consideration.\nWe evaluate the accuracy of LSIR (the semi-supervised version), SAVE and pHd where the latter\ntwo are operated on just the labeled set. We repeat this experiment 20 times and each time select a\ndifferent random set to be labeled. The averaged accuracy is reported in Table 1. The result for one\niteration is displayed in Figure 1 where the labeled points are indicated in (a) and the projection to\nthe top two directions from LSIR (with k = 40) is in (d). All the results clearly indicate that LSIR\nout-performs the other two supervised dimension reduction methods.\n\nExample 2. We \ufb01rst generate a 10-dimensional data set where the \ufb01rst three dimensions are the\nSwiss roll data [14]:\n\nX1 = t cos t, X2 = 21h, X3 = t sin t,\n\n2 (1 + 2\u03b8), \u03b8 \u223c Uniform(0, 1) and h \u223c Uniform([0, 1]). The remaining 7 dimen-\nwhere t = 3\u03c0\nsions are independent Gaussian noises. Then all dimensions are normalized to have unit variance.\nConsider the following function:\n\nY = sin(5\u03c0\u03b8) + h2 + \u01eb,\n\n\u01eb \u223c N(0, 0.12).\n\n(5)\n\n5\n\n\f \n\nSIR\nLSIR\nSAVE\nPHD\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\ny\nc\na\nr\nu\nc\nc\na\n\n0.4\n\n \n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\nsample size\n\nFigure 2: Estimation accuracy of various dimension methods for example 2.\n\nWe randomly choose n samples as a training set and let n change from 200 to 1000 and compare the\nestimation accuracy for LSIR with SIR, SAVE and pHd. The result is showed in Figure 2. SAVE\nand pHd outperform SIR, but are still much worse comparing to LSIR.\n\nNote that Swiss roll (the \ufb01rst three dimensions) is a benchmark data set in manifolds learning,\nwhere the goal is to \u201cunroll\u201d the data into the intrinsic two dimensional space. Since LSIR is a\nlinear dimension reduction method we do not expect LSIR to unroll the data, but expect to retrieve\nthe dimensions relevant to the prediction of Y . Meanwhile, with the noise, manifolds learning\nalgorithms will not unroll the data either since the dominant directions are now the noise dimensions.\n\nExample 3. (Tai Chi) The Tai Chi \ufb01gure is well known in Asian culture where the concepts of\nYin-Yang provide the intellectual framework for much of ancient Chinese scienti\ufb01c development.\nA 6-dimensional data set for this example is generated as follows: X1 and X2 are from the Tai Chi\nstructure as shown in Figure 3(a) where the Yin and Yang regions are assigned class labels Y = \u22121\nand Y = 1 respectively. X3, . . . , X6 are independent random noise generated by N(0, 1).\nThe Tai Chi data set was \ufb01rst used as a dimension reduction example in [12, Chapter 14]. The\ncorrect d.r. subspace B is span(e1, e2). SIR, SAVE and pHd are all known to fail for this example.\nBy taking the local structure into account, LSIR can easily retrieve the relevant directions. Following\n[12], we generate n = 1000 samples as the training data, then run LSIR with k = 10 and repeat\n100 times. The average accuracy is 98.6% and the result from one run is shown in Figure 3. For\ncomparison we also applied LFDA for this example. The average accuracy is 82% which is much\nbetter than SIR, SAVE and pHd but worse than LSIR.\n\nFigure 3: Result for Tai Chi example. (a) The training data in \ufb01rst two dimensions; (b) The training\ndata projected onto the \ufb01rst two LSIR directions; (c) An independent test data projected onto the\n\ufb01rst two LSIR directions.\n\n5 Applications\n\nIn this section we apply our LSIR methods to two real data sets.\n\n5.1 Digits recognition\n\nThe MNIST data set (Y. LeCun, http://yann.lecun.com/exdb/mnist/) is a well known benchmark data\nset for classi\ufb01cation learning. It contains 60, 000 images of handwritten digits as training data and\n10, 000 images as test data. This data set is commonly believed to have strong nonlinear structures.\n\n6\n\n\fx 104\n\n5\n\n0\n\n\u22125\n\n\u221210\n\u22125\n\n0\n\n5\n\n10\nx 104\n\nFigure 4: Result for leukemia data by LSIR. Red points are ALL and blue ones are AML\n\nIn our simulations, we randomly sampled 1000 images (100 samples for each digit) as training set.\nWe apply LSIR and computed d = 20 e.d.r. directions. Then we project the training data and 10000\ntest data onto these directions. Using a k-nearest neighbor classi\ufb01er with k = 5 to classify the test\ndata, we report the classi\ufb01cation error over 100 iterations in Table 2. Compared with SIR method,\nthe classi\ufb01cation accuracy are increased for almost all digits. The improvement for digits 2, 3, 5 is\nmuch signi\ufb01cant.\n\ndigits\nLSIR\nSIR\n\n0\n\n0.0350\n0.0487\n\n1\n\n0.0098\n0.0292\n\n2\n\n0.1363\n0.1921\n\n3\n\n0.1055\n0.1723\n\n4\n\n0.1309\n0.1327\n\n5\n\n0.1175\n0.2146\n\n6\n\n0.0445\n0.0816\n\n7\n\n0.1106\n0.1354\n\n8\n\n0.1417\n0.1981\n\n9\n\n0.1061\n0.1533\n\naverage\n0.0927\n0.1358\n\nTable 2: Classi\ufb01cation error rate for digits classi\ufb01cation by SIR and LSIR.\n\n5.2 Gene expression data\n\nCancer classi\ufb01cation and discovery using gene expression data becomes an important technique in\nmodern biology and medical science. In gene expression data number of genes is huge (usually up\nto thousands) and the samples is quite limited. As a typical large p small n problem, dimension\nreduction plays very essential role to understand the data structure and make inference.\n\nLeukemia classi\ufb01cation. We consider leukemia classi\ufb01cation in [8]. This data has 38 training sam-\nples and 34 test samples. The training sample has two classes, AML and ALL, and the class ALL\nhas two subtypes. We apply SIR and LSIR to this data. The classi\ufb01cation accuracy is similar by\npredicting the test data with 0 or 1 error. An interesting point is that LSIR automatically realizes\nsubtype discovery while SIR cannot. By project the training data onto the \ufb01rst two directions (Fig-\nure 4), we immediately notice that the ALL has two subtypes. It turns out that the 6-samples cluster\nare T-cell ALL and the 19-samples cluster is B-cell ALL samples. Note that there are two samples\n(which are T-cell ALL) cannot be assigned to each subtype only by visualization. This means LSIR\nonly provides useful subclass knowledge for future research but itself may not a perfect clustering\nmethod.\n\n6 Discussion\n\nWe developed LSIR method for dimension reduction by incorporating local information into the\noriginal SIR. It can prevent degeneracy, increase estimation accuracy, and automatically identify\nsubcluster structures. Regularization technique is introduced for computational stability. A semi-\nsupervised version is developed for the use of unlabeled data. The utility is illustrated on synthetic\nas well as real data sets.\n\nSince LSIR involves only linear operations on the data points, it is straightforward to extend it\nto kernel models [17] via the so-called kernel trick. An extension of LSIR along this direction\n\n7\n\n\fcan be helpful to realize nonlinear dimension reduction directions and to reduce the computational\ncomplexity in case of p \u226b n.\nFurther research on LSIR and its kernelized version includes their asymptotic properties such as\nconsistency and statistically more rigorous approaches for the choice of k, the size of local neigh-\nborhoods, and L, the dimensionality of the reduced space.\n\nReferences\n[1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data repre-\n\nsentation. Neural Computation, 15(6):1373\u20131396, 2003.\n\n[2] R. Cook and L. Ni. Using intra-slice covariances for improved estimation of the central sub-\n\nspace in regression. Biometrika, 93(1):65\u201374, 2006.\n\n[3] R. Cook and S. Weisberg. Disussion of li (1991). J. Amer. Statist. Assoc., 86:328\u2013332, 1991.\n[4] R. Cook and X. Yin. Dimension reduction and visualization in discriminant analysis (with\n\ndiscussion). Aust. N. Z. J. Stat., 43(2):147\u2013199, 2001.\n\n[5] D. Donoho and C. Grimes. Hessian eigenmaps: new locally linear embedding techniques for\n\nhighdimensional data. PNAS, 100:5591\u20135596, 2003.\n\n[6] K. Fukumizu, F. R. Bach, and M. I. Jordan. Kernel dimension reduction in regression. Annals\n\nof Statistics, to appear, 2008.\n\n[7] A. Globerson and S. Roweis. Metric learning by collapsing classes. In Y. Weiss, B. Sch\u00a8olkopf,\nand J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 451\u2013458.\nMIT Press, Cambridge, MA, 2006.\n\n[8] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh,\nJ. Downing, M. Caligiuri, C. Bloom\ufb01eld, and E. Lander. Molecular classi\ufb01cation of cancer:\nclass discovery and class prediction by gene expression monitoring. Science, 286:531\u2013537,\n1999.\n\n[9] T. Hastie and R. Tibshirani. Discrminant adaptive nearest neighbor classi\ufb01cation.\n\nTransacations on Pattern Analysis and Machine Intelligence, 18(6):607\u2013616, 1996.\n\nIEEE\n\n[10] K. Li. Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist.\n\nAssoc., 86:316\u2013342, 1991.\n\n[11] K. C. Li. On principal hessian directions for data visulization and dimension reduction: another\n\napplication of stein\u2019s lemma. J. Amer. Statist. Assoc., 87:1025\u20131039, 1992.\n\n[12] K. C. Li. High dimensional data analysis via the sir/phd approach, 2000.\n[13] J. Nilsson, F. Sha, and M. I. Jordan. Regression on manifold using kernel dimension reduction.\n\nIn Proc. of ICML 2007, 2007.\n\n[14] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Sci-\n\nence, 290:2323\u20132326, 2000.\n\n[15] M. Sugiyam. Dimension reduction of multimodal labeled data by local \ufb01sher discriminatn\n\nanalysis. Journal of Machine Learning Research, 8:1027\u20131061, 2007.\n\n[16] J. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290:2319\u20132323, 2000.\n\n[17] Q. Wu, F. Liang, and S. Mukherjee. Regularized sliced inverse regression for kernel models.\n\nTechnical report, ISDS Discussion Paper, Duke University, 2007.\n\n[18] Y. Xia, H. Tong, W. Li, and L.-X. Zhu. An adaptive estimation of dimension reduction space.\n\nJ. R. Statist. Soc. B, 64(3):363\u2013410, 2002.\n\n[19] G. Young. Maximum likelihood estimation and factor analysis. Psychometrika, 6:49\u201353, 1941.\n[20] W. Zhong, P. Zeng, P. Ma, J. S. Liu, and Y. Zhu. RSIR: regularized sliced inverse regression\n\nfor motif discovery. Bioinformatics, 21(22):4169\u20134175, 2005.\n\n8\n\n\f", "award": [], "sourceid": 281, "authors": [{"given_name": "Qiang", "family_name": "Wu", "institution": null}, {"given_name": "Sayan", "family_name": "Mukherjee", "institution": null}, {"given_name": "Feng", "family_name": "Liang", "institution": null}]}