{"title": "Manifold Regularization for SIR with Rate Root-n Convergence", "book": "Advances in Neural Information Processing Systems", "page_first": 117, "page_last": 125, "abstract": "In this paper, we study the manifold regularization for the Sliced Inverse Regression (SIR). The manifold regularization improves the standard SIR in two aspects: 1) it encodes the local geometry for SIR and 2) it enables SIR to deal with transductive and semi-supervised learning problems. We prove that the proposed graph Laplacian based regularization is convergent at rate root-n. The projection directions of the regularized SIR are optimized by using a conjugate gradient method on the Grassmann manifold. Experimental results support our theory.", "full_text": "Manifold Regularization for SIR with Rate Root-n\n\nConvergence\n\nWei Bian\n\nSchool of Computer Engineering\nNanyang Technological University\n\nSingapore, 639798\n\nDacheng Tao\n\nSchool of Computer Engineering\nNanyang Technological University\n\nSingapore, 639798\n\nweibian@pmail.ntu.edu.sg\n\ndctao@ntu.edu.sg\n\nAbstract\n\nIn this paper, we study the manifold regularization for the Sliced Inverse Regres-\nsion (SIR). The manifold regularization improves the standard SIR in two aspects:\n1) it encodes the local geometry for SIR and 2) it enables SIR to deal with trans-\nductive and semi-supervised learning problems. We prove that the proposed graph\nLaplacian based regularization is convergent at rate root-n. The projection direc-\ntions of the regularized SIR are optimized by using a conjugate gradient method\non the Grassmann manifold. Experimental results support our theory.\n\n1 Introduction\n\nSliced inverse regression (SIR) [7] was proposed for suf\ufb01cient dimension reduction. In a regression\nsetting, with the predictors X and the response Y, the suf\ufb01cient dimension reduction (SDR) sub-\nspace B is de\ufb01ned by the conditional independency Y\u22a5 X| BTX. Under the assumption that the\ndistribution of X is elliptic symmetric [7], it has been proved that the SDR subsapce B is related\nto the inverse regression curve E(X|Y). It can be estimated at least partially by a generalized eigen-\ndecomposition between the covariance matrix of the predictors Cov(X) and the covariance matrix of\nthe inverse regression curve Cov(E(X|Y)). When Y is a continuous random variable, it is discretized\nby slicing its range into several slices so as to estimate E(X|Y) empirically. This procedure re\ufb02ects\nthe name of SIR.\nFor practical applications, the elliptic symmetric assumption on P (X) in SIR cannot be fully satis-\n\ufb01ed, because many real datasets are embedded on manifolds [1]. Therefore, SIR cannot select an\nef\ufb01cient subspace for predicting the response Y because the local geometry of the predictors X is\nignored. Additionally, SIR only utilizes labeled (given response) data (predictors). Thus, it is valu-\nable to extend SIR to deal with transductive and semi-supervised learning problems by considering\nunlabelled samples.\nWe solve the above two problems of SIR by using the manifold regularization [2], which has been\ndeveloped to incorporate the local geometry in learning classi\ufb01cation or regression functions. In\nthis paper, we utilize it to preserve the local geometry of predictors in learning the SDR subspace\nB. In addition, it helps SIR to solve transductive/semi-supervised learning problems because the\nregularization encodes the marginal distribution of the unlabelled predictors.\nDifferent regularizations for SIR have been well studied, e.g., the non-singular regularization [14],\nthe ridge regularization [9], and the sparse regularization [8]. However, all existing regularizations\ndo not encode the local geometry of the predictors. Although the localized sliced inverse regression\n[12] considers the local geometry, it is heuristic and does not follow up the regularization framework.\nThe rest of the paper is organized as following. Section 2 presents the manifold regularization for\nSIR. Section 3 proves the convergence of the new manifold regularization. We discuss the optimiza-\n\n1\n\n\ftion algorithm of the regularized SIR by using the conjugate gradient method on the Grassmann\nmanifold in Section 4. Section 5 presents the experimental results on synthetic and real datasets.\nSection 6 concludes this paper.\n\n2 Manifold Regularization for SIR\n\nIn the rest of the paper, we use terminologies in regression and deem classi\ufb01cation as regression\nwith the category response. Upper case letters X \u2208 Rp and Y \u2208 R are respectively the predictors\nand the response, and lower case letters x and y are corresponding realizations. Given a sample\nset, containing nl labeled samples {xi, yi}nl\n, we seek an\noptimal k-dimensional subspace spanned by B = [\u03b21, ..., \u03b2k] such that the response Y is predictable\nwith the projected predictors BT X. We also use matrix X = [x1, x2, ..., xn] to denote all predictors\nin the sample set.\n\ni=1 and nu unlabeled samples {xi}n=nl+nu\n\ni=nl+1\n\n2.1 Sliced Inverse Regression\n\n(cid:161)\n\nSuppose the response Y is predictable with a suf\ufb01cient k-dimensional projection of the original\npredictors X. We can consider the following regression model [7].\n\nY = f\n\n1 X, \u03b2T\n\u03b2T\n\n2 X, ..., \u03b2T\n\nk X, \u03b5\n\n(1)\nwhere \u03b2\u2019s are linear independent projection vectors and \u03b5 is the independent noise. Given a set\nof samples {xi, yi}nl\ni=1, SIR estimates the projection subspace B = [\u03b21, ..., \u03b2k] via following steps:\ndiscretize Y by slicing its range into H slices; calculate the sample frequency fh of Y falling into the\nh-th slice and the sample estimation of the conditional mean \u00afXh = E(X|Y = h); estimate the mean\n\u00afX and covariance matrix \u03a3 of predictors X; calculate the matrix \u0393 =\nand B is \ufb01nally obtained by using the generalized eigen-decomposition \u03a3\u03b2 = \u03bb\u0393\u03b2. It can be proved\nthat the generalized eigen-decomposition is equivalent to the following optimization,\n\n(cid:162)(cid:161) \u00afXh \u2212 \u00afX\n(cid:162)T ;\n\n(cid:161) \u00afXh \u2212 \u00afX\n\n(cid:80)\n\nh fh\n\n(cid:162)\n\n(cid:180)\n\n(cid:179)(cid:161)\n\n(cid:162)\u22121\n\nmax\n\nB\n\ntrace\n\nBT \u03a3B\n\nBT \u0393B\n\n.\n\n(2)\n\nWe refer to (2) as the objective function of SIR and thus we can impose with the manifold regular-\nization on (2).\nRemark 2.1 Another way to get the objective (2) is based on the least square formulation for SIR\nproposed in [3],\n\n(cid:161) \u00afXh \u2212 \u00afX \u2212 \u03a3BCh\n\n(cid:162)T \u03a3\u22121(cid:161) \u00afXh \u2212 \u00afX \u2212 \u03a3BCh\n\n(cid:162)\n\n(3)\n\nmin\nB\n\nL (B, C) =\n\nfh\n\nH(cid:88)\n\nh=1\n\nwhere C = [C1, C2, ..., Ch] are auxiliary variables. Eliminate Ch by setting the partial derivative\n\u2202L/\u2202Ch = 0, and then (2) can be obtained directly. Additionally, (2) shows that SIR could have\na similar objective as linear discriminant analysis, although they are obtained from different under-\nstandings of discriminative dimension reduction.\n\n2.2 Manifold Regularization for SIR\n\nEach dimension reduction projection \u03b2 can be deemed as a linear function or a mapping g(x) =\n\u03b2T x. We expect to preserve the local geometry of the distribution of the predictors X while doing\nmapping g(x). Suppose the predictors X are embedded on a manifold M, this can be achieved by\npenalizing the gradient \u2207M g along the manifold M. Because we are dealing with random variables\nwith the distribution P (X), the following formulation can be applied,\n\n(cid:107)\u2207M g(cid:107)2 dP (X).\n\n(4)\n\n(cid:90)\n\nR =\n\nX\u2208M\n\nThe above formulation is different from the original manifold regularization [2]on the point that\nthe function g(x) is a dimension reduction mapping here while it is a classi\ufb01cation or regression\n\n2\n\n\ffunction in [2]. Usually, both the manifold and the marginal distribution of X are unknown. It has\nbeen well studied in manifold learning, however, that the regularization (4) can be approximated by\nusing the associated graph Laplacian of labeled and unlabeled {xi}n=nl+nu\nConstruct an adjacent graph for {xi}n=n1+nu\n, where the pairwise edge weight (W )ij =\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)) is de\ufb01ned by the kernel function \u03c6 (\u00b7), e.g., the heat kernel \u03c6 (d) = exp\n,\nand then the associated graph Laplacian is L = D \u2212 W , where D is a diagonal matrix given\nby Dii =\nj Wij. Thus, the regularization in (4) can be approximated by R = gT Lg, where\ng = [\u03b2T x1, ..., \u03b2T xn]. Furthermore, because there are k independent projections B = [\u03b21, ..., \u03b2k] ,\nwe take the summation of k regularizations\n\n(cid:161)\u2212d2\n\n(cid:80)\n\n(cid:162)\n\ni=1\n\ni=1\n\n.\n\n(cid:161)\n\n(cid:162)\n\nR =\n\ni Lgi = trace\ngT\n\nGT LG\n\n(5)\n\nk(cid:88)\n\ni=1\n\n(cid:162)\u22121\n(cid:180)\n(cid:179)(cid:161)\n\nwhere G = [g1, ..., gk].\nIn manifold learning, it is suggested to use the normalized graph Laplacian D\u22121/2LD\u22121/2 to re-\nplace L, or to use an equivalent constraint GT DG = I, to get a better performance [1], and the\nsolution obtained by the normalized graph Laplacian is consistent with weaker conditions than\nIn the proposed regularized SIR, we normalize the regularization\nthe unnormalized one [13].\n(5) as R = trace\n, which is equivalent to the constraint GT DG = I.\nThis normalization makes R invariant to scalar and rotation transformations of the projections\nB = [\u03b21, ..., \u03b2k], which is preferred for dimension reduction problems. By adding the regularization\nto SIR (2), and substituting G = X T B, we get the regularized\nR = trace\nSIR\n\n(cid:179)(cid:161)\n(cid:162)\u22121\n\n(cid:179)(cid:161)\n\nGT DG\n\nGT DG\n\nGT LG\n\nGT LG\n\n(cid:180)\n\n(cid:179)(cid:161)\n\n(cid:162)\u22121\n\n(cid:180)\n\nmax\n\nSIRr (B) = trace\n\n(6)\nwhere Q = 1/n (n \u2212 1) XLX T , S = 1/n (n \u2212 1) XDX T , and \u03b7 is the positive weighting factor.\n\nBT \u03a3B\n\nBT QB\n\nBT \u0393B\n\nBT SB\n\nB\n\n\u2212 \u03b7trace\n\n(cid:162)\u22121\n\n(cid:180)\n\n3 Convergence of the Regularization\n\nDifferent from the existing regularizations [8,9,14] for SIR, which are constructed as deterministic\nterms, the manifold regularization in (6) is a random term that involves two data dependent variables\n(matrices) Q and S. Therefore, it is necessary to discuss the convergence property of the proposed\nmanifold regularization.\nIt has been well proved that both \u03a3 and \u0393 converge at rate root-n [7,11,15]. Therefore, the con-\nvergence rate of the objective (6) depends on whether the regularization term converges at rate\nroot-n. Below, we prove that both the sample based estimations Q = 1/n (n \u2212 1) XLX T and\nS = 1/n (n \u2212 1) XDX T converge to deterministic matrices at rate root-n. Note that the conver-\ngence of a special case where the graph Laplacian is built by the kernel function \u03c6 (d) = 1 (d < \u03b5)\nwas proved in [6]. Our proof scheme, however, is quite other than that used in [6]. Additionally, we\ntarget a general choice of kernel \u03c6 (\u00b7) and also prove the root-n convergence rate which has not been\nobtained before.\nAlthough samples {xi}n=nl+nu\nare independent, the dependency of L and D on samples makes\nQ and S cannot be expanded as a summation of independent items. Therefore, it is dif\ufb01cult to\napply the law of large numbers and the central limit theorem to prove the convergence and obtain\nthe corresponding convergence rate. However, we can prove them by constructing the converged\nlimitation and show that the variance of the sample based estimation with respect to the constructed\nlimitation decades at rate root-n. Throughout the results obtained in this Section, we assume the\nfollowing conditions hold.\nConditions 3.1 For kernel function \u03c6 (d) , it satis\ufb01es \u03c6 (0) = 1 and |\u03c6 (d)| (cid:54) 1. For the distribution\n< \u221e,\nof predictors P (X), the fourth order moment exists, i.e.,E\nwhere vec() vectorizes a matrix into a column vector.\n\n(cid:179)(cid:176)(cid:176)(cid:176)(cid:161)\n\n(cid:176)(cid:176)(cid:176)(cid:180)\n\nvec(xxT )\n\nvec(xxT )\n\n(cid:162)(cid:161)\n\n(cid:162)T\n\ni=1\n\n3\n\n\f1\n\nn (n \u2212 1)\n\n(Dii \u2212 Wii) xixT\n\n1\n\nn (n \u2212 1)\n\nWe start by splitting Q into two parts T1 and T2,\n\nSubstituting the function \u03c6 (\u00b7) into (7), we have\n\nQ =\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n1\n\nn (n \u2212 1) XLX T =\n(cid:195)\nn(cid:80)\n\nT1 = 1\n\nn(n\u22121)\n\nT2 = 1\n\nn(n\u22121)\n\nj=1\n\nWijxixT\n\nj = 1\n\nn\n\nn(cid:80)\nn(cid:80)\n\ni=1\n\ni(cid:54)=j\n\nn(cid:88)\n\ni=1\n\n(cid:195)\n\nn(cid:80)\n\ni\n\nn(cid:88)\n\ni(cid:54)=j\n\ni \u2212\nn(cid:80)\n\ni=1\n\n(cid:195)\n\nn(cid:80)\n\nj(cid:54)=i\n\n(cid:33)\n\n1\nn\u22121\n\nWijxixT\n\n(7)\n\nj = T1 \u2212 T2.\n(cid:33)\n\nxixT\ni\n\n(8)\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)) \u2212 \u03c6 (0)\n\nxixT\n\ni = 1\n\nn\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107))\n\nxi\n\n1\nn\u22121\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)) xT\n\nj\n\n.\n\n(cid:33)\nn(cid:80)\n\nj(cid:54)=i\n\n(cid:179)\n\n(cid:180)\n\n(cid:161)\n\u03d5 (x) xxT(cid:162)\n\n(cid:161)\n\n(cid:179)\n\nUnder the condition 3.1, the next two lemmas show the convergence of T1 and T2, respectively.\nLemma 3.1 Let the conditional expectation \u03d5 (x) = E (\u03c6 ((cid:107)z \u2212 x(cid:107))|x), wherez and x are indepen-\ndent and both are sampled from P (X). The E\nexists, and T1 in (8) converges almost\nsurely at rate n\u22121/2, i.e.,\n\n\u03d5 (x) xxT\n\na.s= E\n\nT1\n\n+ O\n\nn\u22121/2\n\n.\n\n(9)\n\nLemma 3.2 Let the conditional expectation \u03b7 (x) = E (\u03c6 ((cid:107)z \u2212 x(cid:107)) z |x), where z and x are inde-\npendent and both are sampled from P (X). The E\nexists, and T2 in (8) converges almost\nsurely at rate n\u22121/2, i.e.,\n\nx\u03b7 (x)T\n\na.s= E\n\nT2\n\nx\u03b7 (x)T\n\n+ O\n\nn\u22121/2\n\n.\n\n(10)\n\n(cid:162)\n\n(cid:180)\n\n(cid:180)\n\n(cid:180)\n\n(cid:179)\n\n(cid:179)\n\n(cid:161)\n\n(cid:162)\n\nn\u22121/2\n\n.\n\nThe proofs of above two lemmas are given in Section 6. Based on Lemmas 1 and 2, we have the\nfollowing two theorems for the convergence of Q and S.\nTheorem 3.1 Given the Conditions 3.1, the sample based estimation Q converges almost surely to\na.s= E (Q) +\na deterministic matrix E (Q) = E\n\nat rate n\u22121/2, i.e., Q\n\n(cid:162) \u2212 E\n\n\u03d5 (x) xxT\n\nx\u03b7 (x)T\n\n(cid:180)\n\n(cid:179)\n\n(cid:161)\n\nO\nProof. Because Q = T1 \u2212 T2, the theorem is an immediate result from Lemmas 3.1 and 3.2.\nTheorem 3.2 Given the Conditions 3.1, the sample based estimation S converges almost surely to a\ndeterministic matrix E\n\n\u03d5 (x) xxT\n\n\u03d5 (x) xxT\n\nn\u22121/2\n\na.s= E\n\n+ O\n\n(cid:162)\n\n(cid:161)\n\n(cid:162)\n\n(cid:161)\n\n(cid:162)\n\n.\n\nDii =\n\nj Wij =\n\nso S =\n\n1\n\nn(n\u22121)\n\nDiixixT\ni\n\n=\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107))\n\nxixT\n\ni =\n\n1\n\nn(n\u22121)\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)) + \u03c6 (0)\n\nxixT\n\ni = T1 +\n\n(cid:161)\n(cid:80)\n\n(cid:33)\n\nn(cid:80)\n\nat rate n\u22121/2, i.e., S\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)),\n\nj=1\n\n(cid:195)(cid:80)\n\nn(cid:80)\n\ni=1\n\nj(cid:54)=i\n\nn(cid:80)\n\nj=1\n\n(cid:195)\n\ni=1\n\nn(cid:80)\nn(cid:80)\nn(cid:80)\n(cid:162) a.s.= E\n\ni=1\n\ni=1\n\n(cid:161)\n\nProof.\n\n1\n\nn(n\u22121)\n\n1\n\nn(n\u22121)\n\n(cid:161)\n\n1\n\nn(n\u22121)\nn\u22121\n\n(cid:161)\n\nxixT\n\nxixT\ni\n\ni . Because\nn\u22121\n\na.s.= O\n\n(cid:162)\n\n.\n\n(cid:162)\n\nn(cid:80)\n\ni=1\n\n(cid:161)\n\n(cid:162)\n\nO\ncan be asymptotically achieved when n \u2192 \u221e.\n\n\u03d5 (x) xxT\n\n+ O\n\n1\n\n(n\u22121)\n\nxixT\ni\n\nis an unbiased estimation of Cov(X), we have\n\nTherefore, according to Lemma 3.1, we have S = T1 +\nn\u22121/2\n\n. Note that here E (S) (cid:54)= E\n\n, but equality\n\n\u03d5 (x) xxT\n\n(cid:161)\n\nn(cid:80)\n(cid:33)\n\ni=1\n\n(cid:162)\n\n4 Optimization on the Grassmann Manifold\n\nThe optimization of the regularized SIR (6) is much more dif\ufb01cult than that of the standard SIR (2),\nwhich can be solved by the generalized eigen-decomposition. In this section, we present a conjugate\n\n4\n\n\fgradient method on the Grassmann manifold to solve (6), based on the fact it is invariant to scalar and\nrotation transformations of the projection B. By exploiting the geometry of the Grassmann manifold,\nthe conjugate gradient algorithm converges faster than the gradient scheme in the Euclidean space.\nGiven a constrained optimization problem min F (A) subject to A \u2208 Rp\u00d7k and AT A = I, if the\nproblem further satis\ufb01es F (A) = F (AO) for an arbitrary orthonormal matrix O, then it is called\nan optimization problem de\ufb01ned on the Grassmann manifold Gpk. By the following theorem, we\ncan transform (6) into its equivalent form (11) which is de\ufb01ned on the Grassmann manifold.\nTheorem 4.1 Suppose that \u03a3 is nonsingular and given the eigen-decomposition \u03a3\u22121/2S\u03a3\u22121/2 =\nU \u02dc\u039bU T , problem (6) is equivalent to\nF (A) = \u2212trace\n\n(cid:180)\u22121\n\n(cid:181)(cid:179)\n\n+ \u03b7trace\n\nAT \u02dcQA\n\nAT \u02dc\u039bA\n\nAT \u02dc\u0393A\n\n(cid:182)\n\n(cid:179)\n\n(cid:180)\n\n(11)\n\nmin\nAT A=I\n\n(cid:179)(cid:161)\n\n(cid:182)\n\nAT \u02dc\u039bA\n\nAT \u02dcQA\n\n(cid:181)(cid:179)\n\n(cid:162)\u22121\n\n(cid:180)\u22121\n\nwhere \u02dc\u0393 = U T \u03a3\u22121/2\u0393\u03a3\u22121/2U and \u02dcQ = U T \u03a3\u22121/2Q\u03a3\u22121/2U. Given the optimal solution A of\n(11), the optimal solution of (6) is given by B = \u03a3\u22121/2U A .\nProof. Substituting B = \u03a3\u22121/2U A into (6), we have SIRr (A) = trace\n\n\u2212\n. Given a nonsingular \u03a3, B = \u03a3\u22121/2U A is an invertible variable\n\u03b7trace\ntransform. Thus, we know that if A maximizes SIRr (A) then B maximizes SIRr (B). Because\nSIRr (A) is invariant to scalar and rotation transformations, a constraint AT A = I can be added to\n(6). We then get (11). This completes the proof.\nTo implement the conjugate gradient method on the Grassmann manifold, the gradient of F (A)\nin (11) is required. According to [4], the gradient GA of F (A) on the manifold is de\ufb01ned by\nGA = \u03a0AFA where FA is the gradient of F (A) in the Euclidian space and \u03a0A = I \u2212 AAT is the\nprojection onto the tangent space at A of the manifold. In case of F (A) in (11), it is given by,\n\nAT \u02dc\u0393A\n\nAT A\n\n(cid:180)\n\n(cid:179)\n\n(cid:180)\u22121\n\n(cid:182)\n\n(cid:179)\n\n(cid:180)\u22121\n\nI \u2212 \u02dc\u039bA\n\nAT \u02dc\u039bA\n\nAT\n\n\u02c6QA\n\nAT \u02dc\u039bA\n\n.\n\n(12)\n\n(cid:161)\n\nI \u2212 AAT(cid:162) \u02dc\u0393A \u2212 \u03b7\n\n(cid:181)\n\nGA =\n\nNext, we present the conjugate gradient method on the Grassmann manifold [4] to solve (11). The\nalgorithm is given by the following three steps:\n\u2022 1-D searching along the geodesic: given the current position Ak , the gradient Gk and the\n\nsearching direction Hk , the 1-D searching along the geodesic is given by\n\n(cid:161)\n\nAkV cos (\u03a3t) V T + U sin (\u03a3t) V T(cid:162)\n\nmin\n\nt\n\nF (A (t)) s.t. A (t) = F\n\n(13)\n\nwhere U\u03a3V T is the compact SVD of Hk. Record the minimum solution tk = tmin, and Ak+1 =\nA (tk) as the starting position for next searching.\n\u2022 Transporting gradient and search direction: parallel transport Gk and Hk from Ak to Ak+1 by\n\nusing\n\n\u03c4 Gk = Gk \u2212 (AkV sin \u03a3tk + U (I \u2212 cos \u03a3tk)) U T Gk\n\n(14)\n\n(15)\n\u2022 Calculating the conjugate direction: given the gradient Gk+1 at Ak+1, the conjugate searching\n\n\u03c4 Hk = (\u2212AkV sin \u03a3tk + U cos \u03a3tk) \u03a3V T\n\ndirection is\n\nHk+1 = \u2212Gk+1 + trace\n\n(Gk+1 \u2212 \u03c4 Gk)T Gk+1\n\n(16)\n0 A0 = I ) and let H0 = \u2212G0, and then repeat the\nInitialize A0 by a random guess (subject to AT\nabove three steps iteratively to minimize F (A) until convergence, i.e., |F (Ak+1) \u2212 F (Ak)| < \u03b50.\nNote that, the same as the conjugate gradient method in the Euclidian space, the searching direction\nHk has to be resetting as Hk = \u2212Gk with a period of p (n \u2212 p), i.e., the dimension of the searching\nspace.\n\n/trace\n\n\u03c4 Hk.\n\nk Gk\n\nGT\n\n(cid:179)\n\n(cid:180)\n\n(cid:161)\n\n(cid:162)\n\n5\n\n\f5 Experiments\n\nIn this section, we evaluate the proposed regularized SIR on two real datasets. We show the results\nof the standard SIR and the localized SIR on the same experiments for reference.\n\n5.1 USPS Test\n\nThe USPS dataset contains 9,298 handwriting characters of digits 0 to 9. The entire USPS database\nis divided into two parts, a training set is with 7,291 samples and a test set is with 2,007 samples\n[5].\nIn our experiment, dimension reduction is \ufb01rst implemented and then the nearest neighbor\nrule is used for classi\ufb01cation. By using the 1/3 of the data in training set as labeled data and the\nrest 2/3 as unlabeled data, we conduct supervised and semisupervised dimension reduction by the\nfollowing \ufb01ve methods: supervised training of standard SIR, the manifold regularized SIR, and the\nlocalized SIR, and semi-supervised training of the manifold regularized SIR and the localized SIR.\nPerformances are evaluated on the independent testing set. Table 1 summarizes the experimental\nresults.\nIt shows that both the regularized SIR and the localized SIR [12] can achieve superior\nperformance to the standard SIR, and the manifold regularized SIR performs better than the localized\nSIR in both the supervised and the semi-supervised training. Experimental results re\ufb02ect that the\nmanifold regularized SIR is effective on exploiting the local geometry of a dataset.\n\nTable 1: Experimental results on the USPS dataset: SIR; the manifold regularized SIR (RSIR);\nthe localized SIR (LSIR); semi-supervised training of the manifold regularized SIR (sRSIR); semi-\nsupervised training of the localized SIR (sLSIR).\n\nDimensionality\n\nSIR\nRSIR\nsRSIR\nLSIR\nsLSIR\n\n7\n\n0.8635\n0.8575\n0.8685\n0.8301\n0.8526\n\n9\n\n0.8794\n0.8809\n0.8864\n0.8421\n0.8675\n\n11\n\u2014\n\n0.8859\n0.8934\n0.8535\n0.8795\n\n13\n\u2014\n\n0.8889\n0.8909\n0.8724\n0.8826\n\n15\n\u2014\n\n0.9028\n0.9053\n0.8789\n0.8914\n\n17\n\u2014\n\n0.9108\n0.9128\n0.8949\n0.8954\n\n19\n\u2014\n\n0.9148\n0.9208\n0.8989\n0.9038\n\n21\n\u2014\n\n0.9193\n0.9193\n0.9003\n0.9063\n\n5.2 Transductive Visualization\n\nIn Coil-20 database [10], each object has 72 images taken from different view angles. All images\nare cropped into 128\u00d7128 pixel arrays with 256 gray levels. We then reduce the size to 32\u00d732, and\nused the \ufb01rst 10 objects for 2-D visualization, with randomly labeled 6 out of 72 images. Figure\n1 shows the visualization results obtained by SIR, the proposed regularized SIR and the localized\nSIR [12]. The \ufb01gure shows that by exploiting the unlabeled data via the manifold regularization\nfor dimension reduction, the performance for data visualization can be signi\ufb01cantly improved. The\nlocalized SIR performs better than SIR, but not as good as the regularized SIR.\n\nFigure 1: Visualization of the \ufb01rst 10 objects in Coil-20 database: from left to right, by the standard\nSIR, the manifold regularized SIR, and the localized SIR.\n\n6\n\n-4000-3000-2000-1000010002000300040000100020003000400050006000-2000-1500-1000-500050010001500-1500-1000-50005001000-1500-1000-5000500100015002000-1500-1000-50005001000\f(cid:161)\n\n6 Proofs of Lemmas\nProof of Lemma 3.1 Because the kernel function \u03c6 (\u00b7) is bounded by |\u03c6 (d)| (cid:54) 1, we have\n|\u03d5 (x)| = |E (\u03c6 ((cid:107)z \u2212 x(cid:107))|x)| (cid:54) 1, which implies that E\nexists. Then, to prove\nand\nT1\nn\u22121\nCov (vec (T1)) = E\n.\nFirst, because xi and xj are independent when i (cid:54)= j, it follows,\n\n\u03d5 (x) xxT\n\u2212 (vec (E (T1))) (vec (E (T1)))T = O\n\n, it is suf\ufb01cient to show that E (T1) = E\n\n(vec (T1)) (vec (T1))T\n\n\u03d5 (x) xxT\n\n\u03d5 (x) xxT\n\nn\u22121/2\n\na.s= E\n\n+ O\n\n(cid:179)\n\n(cid:162)\n\n(cid:161)\n\n(cid:161)\n\n(cid:162)\n\n(cid:161)\n\n(cid:162)\n\n(cid:162)\n\uf8f6\uf8f8\n\n\uf8f6\uf8f8xixT\n\ni\n\n\uf8f6\uf8f8\uf8f6\uf8f8\n\n(cid:161)\n\uf8eb\uf8ed 1\nn(cid:88)\nn(cid:88)\n\ni=1\n\nn\n\n(cid:162)\n(cid:180)\n\uf8eb\uf8ed 1\nn(cid:88)\nn(cid:88)\n\uf8eb\uf8ed 1\n\uf8eb\uf8edxixT\n(cid:161)\n(cid:162)\n(cid:180)\n\ni \u03d5 (xi)\n\nn \u2212 1\n\nn \u2212 1\n\nxixT\n\ni=1\n\ni\n\nE\n\nj(cid:54)=i,j=1\n\n= E\n\nE (T1) = E\n\n=\n\n1\nn\n\nNext, we show E\n(vec(E (T1))) (vec(E (T1)))T and the other is O\n\n(vec (T1)) (vec (T1))T\n\n(vec (T1)) (vec (T1))T\n\n1\nn\n\nE\n\ni=1\n\n=\n\n(cid:179)\n(cid:180)\n\uf8eb\uf8ec\uf8edvec\n\uf8eb\uf8ed n(cid:88)\n(cid:179)\nn(cid:88)\nn(cid:88)\n(cid:88)\n\ni(cid:48)(cid:54)=j(cid:48)\n\ni(cid:54)=j\n\ni(cid:54)=j\n\nE\n\n(cid:161)\n\n(cid:179)\n\nE\n\n=\n\n=\n\n=\n\n1\n\nn2 (n \u2212 1)2 E\n\n1\n\nn2 (n \u2212 1)2\n\n1\n\nn2 (n \u2212 1)2\n\nj(cid:54)=i,j=1\n\n.\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107))\n\nn(cid:88)\n\u03d5 (x) xxT(cid:162)\n(cid:161)\n(cid:161)\n(cid:162)\n\uf8f6\uf8f8\uf8eb\uf8edvec\n(cid:162)(cid:161)\n\nn\u22121\n\n.\n\n\uf8eb\uf8ed n(cid:88)\n(cid:161)\n(cid:88)\n\ni(cid:54)=j\n\n1\n\nn2 (n \u2212 1)2\n\nelse\n\nE (\u03c6 ((cid:107)xi \u2212 xj(cid:107))|xi )\n\n(17)\n\nis a summation of two terms, of which one is\n\n\uf8f6\uf8f8\uf8f6\uf8f8T\n(cid:180)\n(cid:162)(cid:162)T\n\n(cid:180)\n\n(cid:162)(cid:162)T\n\n\uf8f6\uf8f7\uf8f8\n\n(18)\n\n(19)\n\n(20)\n\n\u03c6((cid:107)xi \u2212 xj(cid:107))xixT\n\ni\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)) xixT\n\ni\n\nvec\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)) xixT\n\ni\n\nvec\n\n\u03c6 ((cid:107)xi(cid:48) \u2212 xj(cid:48)(cid:107)) xi(cid:48) xT\ni(cid:48)\n\ni,j,i(cid:48),j(cid:48)distinct\n\n(cid:161)\n\nE (\u03a6i,j,i(cid:48),j(cid:48)) +\n\n(cid:162)(cid:161)\n\n(cid:161)\n\nE (\u03a6i,j,i(cid:48),j(cid:48)),\n\n(cid:162)(cid:162)T .\n\nwhere \u03a6i,j,i(cid:48),j(cid:48) = vec\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)) xixT\n\ni\n\nvec\n\n\u03c6 ((cid:107)xi(cid:48) \u2212 xj(cid:48)(cid:107)) xi(cid:48) xT\ni(cid:48)\n\nWhen i, j, i(cid:48), j(cid:48) are distinct, xi,xj,xi(cid:48), and xj(cid:48) are independent, we have\n\nE (\u03a6i,j,i(cid:48),j(cid:48)) = E\n\nvec\n\n(cid:162)(cid:162)(cid:162)(cid:161)\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)) xixT\n\u03d5 (x) xxT\nvec\n\ni\n\n(cid:162)(cid:162)(cid:162)T\n\nvec\n\u03d5 (x) xxT\n\n\u03c6 ((cid:107)xi(cid:48) \u2212 xj(cid:48)(cid:107)) xi(cid:48) xT\ni(cid:48)\n\nE\n\nvec\n\n=\nE\n= (vec (E (T1))) (vec (E (T1)))T .\n\n(cid:179)(cid:161)\n\n(cid:161)\n\n(cid:161)\n\n(cid:161)\n(cid:161)\n\n(cid:179)\n\n(cid:161)\n\n(cid:162)(cid:162)(cid:161)\n(cid:161)\n(cid:161)\n(cid:180)\n\nis\n\nTherefore, the \ufb01rst term in E\n\nvec (T1) (vec (T1))T\n\n1\n\nn2(n\u22121)2\n\nE (\u03a6i,j,i(cid:48),j(cid:48)) = n(n\u22121)(n\u22122)(n\u22123)\n\nn2(n\u22121)2\n\n(vec (E (T1))) (vec (E (T1)))T\n\n= (vec (E (T1))) (vec (E (T1)))T + O\n\n(cid:161)\n\n(cid:162)\n\nn\u22121\n\n.\n\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)\n\nFor the second term in E\nM under the Conditions 3.1, and thus we have\n\n(vec (T1)) (vec (T1))T\n\n, E (\u03a6i,j,i(cid:48),j(cid:48)) is bounded by a constant (matrix)\n\n1\n\nn2 (n \u2212 1)2\n\nE (\u03a6i,j,i(cid:48),j(cid:48))\n\n1\n\nn2 (n \u2212 1)2\n\nM = n(n \u2212 1) (4n \u2212 6)\n\nn2 (n \u2212 1)2 M = O\n\n(cid:161)\n\nn\u22121(cid:162)\n\n. (21)\n\n(cid:80)\n\ni,j,i(cid:48),j(cid:48)\ndistinct\n\n(cid:88)\n\nelse\n\n(cid:179)\n\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175) (cid:54)\n\n(cid:180)\n(cid:88)\n\nelse\n\n7\n\n\fCombining the above two results, we have\nCov (vec (T1)) = E(vec (T1)) (vec (T1))T \u2212 (vec (E (T1))) (vec (E (T1)))T = O\nProof of Lemma 3.2 Similar to the proof of Lemma 3.1, E\n\n(cid:179)\n\nexists. Then, it is suf\ufb01cient\n\nto show that E (T2) = E\n\nx\u03b7 (x)T\n\nand Cov (vec (T2)) = O\n\n. First, we have\n\n(cid:161)\n\nn\u22121(cid:162)\n\n(22)\n\n(cid:162)\n\n(cid:161)\n\nx\u03b7 (x)T\nn\u22121\n\n(cid:180)\n\uf8f6\uf8f8\uf8f6\uf8f8\n\nE (T2) = E\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)) xT\n\nj\n\n(cid:162)\uf8f6\uf8f8\uf8f6\uf8f8\n\n(23)\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)) xT\n\nj |xi\n\nE\n\nxi\n\n(cid:180)\n\uf8eb\uf8ed 1\nn(cid:88)\n\uf8eb\uf8ed 1\n\uf8eb\uf8edxi\n(cid:180)\n\ni=1\n\nn(cid:88)\nn(cid:88)\n\nn \u2212 1\n\nj(cid:54)=i,j=1\n\nn \u2212 1\n\nj(cid:54)=i,j=1\n\n(cid:161)\n\nE (xi\u03b7 (xi)) = E (x\u03b7 (x)) .\n\nNext, we split E\n\n(vec (T2)) (vec (T2))T\n\ninto two terms\n\n(vec (T2)) (vec (T2))T\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)) xixT\n\nj\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)) xixT\n\nj\n\nvec\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)) xixT\n\n\u03c6 ((cid:107)xi(cid:48) \u2212 xj(cid:48)(cid:107)) xi(cid:48) xT\nj(cid:48)\n\n(cid:161)\n\n\uf8f6\uf8f8\uf8eb\uf8edvec\n\uf8eb\uf8ed n(cid:88)\n(cid:162)(cid:162)(cid:161)\n(cid:88)\n\ni(cid:54)=j\n\nvec\n\nj\n\n1\n\n(cid:161)\n\n\uf8f6\uf8f7\uf8f8\n\uf8f6\uf8f8\uf8f6\uf8f8T\n(cid:180)\uf8f6\uf8f8\n(cid:162)(cid:162)T\n\n(24)\n\nn\n\nE\n\ni=1\n\ni=1\n\n=\n\n=\n\n1\nn\n\n1\nn\n\n(cid:179)\n\n(cid:179)\n\uf8eb\uf8ed 1\nn(cid:88)\nn(cid:88)\n(cid:180)\n\uf8eb\uf8ec\uf8edvec\n\uf8eb\uf8ed n(cid:88)\n\uf8eb\uf8ed n(cid:88)\n(cid:179)(cid:161)\nn(cid:88)\n(cid:88)\n(cid:161)\n(cid:88)\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175)\n\ni,j,i(cid:48),j(cid:48)distinct\n\ni(cid:48)(cid:54)=j(cid:48)\n\ni(cid:54)=j\n\ni(cid:54)=j\n\nelse\n\nE\n\n1\n\n1\n\nn2 (n \u2212 1)2 E\n\n1\n\nn2 (n \u2212 1)2\n\n1\n\nn2 (n \u2212 1)2\n\n(cid:179)\n\nE\n\n=\n\n=\n\n=\n\nand\n\nE (\u03a8i,j,i(cid:48),j(cid:48)) +\n\n(cid:162)(cid:161)\n\nE (\u03a8i,j,i(cid:48),j(cid:48))\n\n(cid:161)\n\nn2 (n \u2212 1)2\n\u03c6 ((cid:107)xi(cid:48) \u2212 xj(cid:48)(cid:107)) xi(cid:48) xT\nj(cid:48)\n\nelse\n\n(cid:162)(cid:162)T .\n\nwhere \u03a8i,j,i(cid:48),j(cid:48) = vec\nFollowing the same method used in the proof of Lemma 3.1, we have\n\nvec\n\nj\n\n\u03c6 ((cid:107)xi \u2212 xj(cid:107)) xixT\n\n1\n\nn2 (n \u2212 1)2\n\nE (\u03a8i,j,i(cid:48),j(cid:48)) = (vec (E (T2))) (vec (E (T2)))T + O\n\nn2 (n \u2212 1)2\n\nE (\u03a8i,j,i(cid:48),j(cid:48))\n\n(cid:88)\n(cid:161)\n\nelse\n\n(cid:162)\n\nn\u22121\n\n.\n\n(cid:175)(cid:175)(cid:175)(cid:175)(cid:175) (cid:54) O\n\n(cid:161)\n\nn\u22121(cid:162)\n\n.\n\nTherefore, we have Cov (vec (T2)) = O\n\n(cid:161)\n\nn\u22121(cid:162)\n\n(25)\n\n(26)\n\n7 Conclusion\n\nWe have studied the manifold regularization for Sliced Inverse Regression (SIR). The regularized\nSIR extended the original SIR in many ways, i.e., it utilizes the local geometry that is ignored\noriginally and enables SIR to deal with the tranductive/semisupervised learning problems. We also\ndiscussed the statistical properties of the proposed regularization, that under mild conditions, the\nmanifold regularization converges at rate root-n. To solve the regularized SIR problem, we present\na conjugate gradient method conducted on the Grassmann manifold. Experiments on real datasets\nvalidate the effectiveness of the regularized SIR.\n\nAcknowledgments\n\nThis project was supported by the Nanyang Technological University Nanyang SUG Grant (under\nproject number M58020010).\n\n8\n\n\fReferences\n[1] Belkin, M. & Niyogi, P. (2003) Laplacian eigenmaps for dimensionality reduction and data rep-\n\nresentation. Neural Computation, 15(6): 1373-1396.\n\n[2] Belkin, M., Niyogi, P. & Sindhwani, V. (2006) Manifold regularization: Ageometric framework\nfor learning from labeled and unlabeled examples. Journal of Machine Learning Research, 1:\n1-48.\n\n[3] Cook, R.D.(2004) Testing predictor contributions in suf\ufb01cient dimension reduction. Annals of\n\nStatistics, 32: 1061-1092.\n\n[4] Edelman, A., Arias, T.A., & Smith, S.T. (1998) The geometry of algorithms with orthogonality\n\nconstraints. SIAM J. Matrix Anal. Appl., 20(2):303-353.\n\n[5] Hastie, T., Buja, A., & Tibshirani, R. (1995) Penalized discriminant analysis. Annals of Statistics,\n\n2: 73-102.\n\n[6] He, X., Deng, C., & Min, W. (2005) Statistical and computational analysis of locality preserving\n\nprojection. In 22th International Conference on Machine Learning (ICML).\n\n[7] Li, K. (1991) Sliced inverse regression for dimension reduction (with discussion). J. Amer.\n\nStatist. Assoc., 86:316-342.\n\n[8] Li, L. (2007). Sparse suf\ufb01cient dimension reduction. Biometrika 94(3): 603-613.\n[9] Li, L., & YIN, X. (2008). Sliced inverse regression with regularizations. Biometrics 64: 124-131.\n[10] Nene, S.A., Nayar, S.K., & Murase, H. (1996) Columbia object image library: COIL-20. Tech-\n\nnical Report No. CUCS-006-96, Dept. of Computer Science, Columbia University.\n\n[11] Saracco, J. (1997). An asymptotic theory for sliced inverse regression. Comm. Statist. Theory\n\nMethods 26: 2141-2171.\n\n[12] Wu, Q., Mukherjee, S., & Liang, F. (2008) Localized sliced inverse regression. Advances in\n\nneural information processing systems 20, Cambridge, MA: MIT Press.\n\n[13] von Luxburg, U., Bousquet, O., & Belkin, M. (2005) Limits of spectral clustering.\n\nIn L. K.\nSaul, Y. Weiss and L. Bottou (Eds.), Advances in neural information processing systems 17,\nCambridge, MA: MIT Press.\n\n[14] Zhong, W., Zeng, P., Ma, P., Liu, J. S., & Zhu, Y. (2005) RSIR: Regularized sliced inverse\n\nregression for motif discovery. Bioinformatics 21: 4169-4175.\n\n[15] Zhu, L.X., & NG, K.W. (1995) Asymptotics of sliced inverse regression. Statistica Sinica 5:\n\n727-736.\n\n9\n\n\f", "award": [], "sourceid": 487, "authors": [{"given_name": "Wei", "family_name": "Bian", "institution": null}, {"given_name": "Dacheng", "family_name": "Tao", "institution": null}]}