{"title": "Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 1529, "page_last": 1536, "abstract": null, "full_text": "Convergence and Rate of Convergence of A\n\nManifold-Based Dimension Reduction Algorithm\n\nAndrew K. Smith, Xiaoming Huo\n\nSchool of Industrial and Systems Engineering\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nandrewsmith81@gmail.com, huo@gatech.edu\n\nHongyuan Zha\n\nCollege of Computing\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nzha@cc.gatech.edu\n\nAbstract\n\nWe study the convergence and the rate of convergence of a local manifold learning\nalgorithm: LTSA [13]. The main technical tool is the perturbation analysis on the\nlinear invariant subspace that corresponds to the solution of LTSA. We derive a\nworst-case upper bound of errors for LTSA which naturally leads to a convergence\nresult. We then derive the rate of convergence for LTSA in a special case.\n\n1 Introduction\n\nManifold learning (ML) methods have attracted substantial attention due to their demonstrated po-\ntential. Many algorithms have been proposed and some work has appeared to analyze the per-\nformance of these methods. The main contribution of this paper is to establish some asymptotic\nproperties of a local manifold learning algorithm: LTSA [13], as well as a demonstration of some of\nits limitations. The key idea in the analysis is to treat the solutions computed by LTSA as invariant\nsubspaces of certain matrices, and then carry out a matrix perturbation analysis.\nMany ef\ufb01cient ML algorithms have been developed including locally linear embedding (LLE) [6],\nISOMAP [9], charting [2], local tangent space alignment (LTSA) [13], Laplacian eigenmaps [1],\nand Hessian eigenmaps [3]. A common feature of many of these manifold learning algorithms is\nthat their solutions correspond to invariant subspaces, typically the eigenspace associated with the\nsmallest eigenvalues of a kernel or alignment matrix. The exact form of this matrix, of course,\ndepends on the details of the particular algorithm.\nWe start with LTSA for several reasons. First of all, in numerical simulations (e.g., using the tools\noffered by [10]), we \ufb01nd empirically that LTSA performs among the best of the available algorithms.\nSecond, the solution to each step of the LTSA algorithm is an invariant subspace, which makes\nanalysis of its performance more tractable. Third, the similarity between LTSA and several other\nML algorithms (e.g., LLE, Laplacian eigenmaps and Hessian eigenmaps) suggests that our results\nmay generalize. Our hope is that this performance analysis will provide a theoretical foundation for\nthe application of ML algorithms.\nThe rest of the paper is organized as follows. The problem formulation and background information\nare presented in Section 2. Perturbation analysis is carried out, and the main theorem is proved\n(Theorem 3.7) in Section 3. Rate of convergence under a special case is derived in Section 4.\nSome discussions related to existing work in this area are included in Section 5. Finally, we present\nconcluding remarks in Section 6.\n\n1\n\n\f2 Manifold Learning and LTSA\nWe formulate the manifold learning problem as follows. For a positive integer n, let yi \u2208 IRD, i =\n1, 2, . . . , n, denote n observations. We assume that there is a mapping f : IRd \u2192 IRD which satis\ufb01es\na set of regularity conditions (detailed in the next subsection). In addition, we require another set of\n(possibly multivariate) values xi \u2208 IRd, d < D, i = 1, 2, . . . , n, such that\n\nyi = f(xi) + \u03b5i,\n\n(1)\nwhere \u03b5i \u2208 IRD denotes a random error. For example, we may assume \u03b5i \u223c N(0, \u03c32ID); i.e., a\nmultivariate normal distribution with mean zero and variance-covariance proportional to the identity\nmatrix. The central questions of manifold learning are: 1) Can we \ufb01nd a set of low-dimensional\nvectors such that equation (1) holds? 2) What kind of regularity conditions should be imposed on\nf? 3) Is the model well de\ufb01ned? These questions are the main focus of this paper.\n\ni = 1, 2, . . . , n,\n\n2.1 A Pedagogical Example\n\n(a) Embedded Spiral\n\n(b) Noisy Observations\n\n(c) Learned vs. Truth\n\nFigure 1: An illustrative example of LTSA in nonparametric dimension reduction. The straight line\npattern in (c) indicates that the underlying parametrization has been approximately recovered.\n\nAn illustrative example of dimension reduction that makes our formulation more concrete is given\nin Figure 1. Sub\ufb01gure (a) shows the true underlying structure of a toy example, a 1-D spiral. The\nnoiseless observations are equally spaced points on this spiral. In sub\ufb01gure (b), 1024 noisy obser-\nvations are generated with multivariate noise satisfying \u03b5i \u223c N(0, 1\n100 I3). We then apply LTSA to\nthe noisy observations, using k = 10 nearest neighbors. In sub\ufb01gure (c), the result from LTSA is\ncompared with the true parametrization. When the underlying parameter is faithfully recovered, one\nshould see a straight line, which is observed to hold approximately in sub\ufb01gure (c).\n\n2.2 Regularity and Uniqueness of the Mapping f\n\nIf the conditions on the mapping f are too general, the model in equation (1) is not well de\ufb01ned.\nFor example, if the mapping f(\u00b7) and point set {xi} satisfy (1), so do f(A\u22121(\u00b7\u2212 b)) and {Axi + b},\nwhere A is an invertible d by d matrix and b is a d-dimensional vector. As being common in the\nmanifold-learning literature, we adopt the following condition on f.\n\nCondition 2.1 (Local Isometry) The mapping f is locally isometric: For any \u03b5 > 0 and x in the\ndomain of f, let N\u03b5(x) = {z : (cid:107)z \u2212 x(cid:107)2 < \u03b5} denote an \u03b5-neighborhood of x using Euclidean\ndistance. We have\n\n(cid:107)f(x) \u2212 f(x0)(cid:107)2 = (cid:107)x \u2212 x0(cid:107)2 + o((cid:107)x \u2212 x0(cid:107)2).\n\nThe above condition indicates that in a local sense, f preserves Euclidean distance. Let J(f; x0)\ndenote the Jacobian of f at x0. We have J(f; x0) \u2208 IRD\u00d7d, where each column (resp., row) of\nJ(f; x0) corresponds to a coordinate in the feature (resp., data) space. The above in fact implies the\nfollowing lemma [13].\n\nLemma 2.2 The matrix J(f; x0) is orthonormal for any x0, i.e., J T (f; x0)J(f; x0) = Id.\n\n2\n\n!1!0.500.51!1!0.500.5100.511.522.53!1!0.500.51!1!0.500.51!0.500.511.522.533.500.20.40.60.81!0.05!0.04!0.03!0.02!0.0100.010.020.030.040.05\fso do f(OT (\u00b7\u2212b)) and {Oxi +b}. We can force b to be 0 by imposing the condition that(cid:80)\n\nGiven the previous condition, model (1) is still not uniquely de\ufb01ned. For example, for any d by d\northogonal matrix O and any d-dimensional vector b, if f(\u00b7) and {xi} satisfy (1) and Condition 2.1,\ni xi = 0.\nIn dimension reduction, we can consider the sets {xi} and {Oxi} \u201cinvariant,\u201d because one is just a\nrotation of the other. In fact, the invariance coincides with the concept of \u201cinvariant subspace\u201d to be\ndiscussed.\nCondition 2.3 (Local Linear Independence Condition) Let Yi \u2208 IRD\u00d7k, 1 \u2264 i \u2264 n, denote a\nmatrix whose columns are made by the ith observation yi and its k \u2212 1 nearest neighbors. We\nchoose k \u2212 1 neighbors so that the matrix Yi has k columns. It is generally assumed that d < k.\nFor any 1 \u2264 i \u2264 n, the rank of YiP k is at least d; in other words, the dth largest singular value of\nmatrix YiP k is greater than 0.\nIn the above, we use the projection matrix P k = Ik\u2212 1\nk , where Ik is the k by k identity matrix\nand 1k is a k-dimensional column vector of ones. The regularity of the manifold can be determined\nby the Hessians of the mapping. Rewrite f(x) for x \u2208 IRd as f(x) = (f1(x), f2(x), . . . , fD(x))T .\nFurthermore, let x = (x1, . . . , xd)T . The Hessian is a D by D matrix,\n\nk \u00b71k1T\n\n[Hi(f; x)]jk = \u22022fi(x)\n\u2202xj\u2202xk\n\n,\n\n1 \u2264 i \u2264 D, 1 \u2264 j, k \u2264 d.\n\nThe following condition ensures that f is locally smooth. We impose a bound on all the components\nof the Hessians.\nCondition 2.4 (Regularity of the Manifold) |[Hi(f; x)]jk| \u2264 C1 for all i, j, and k, where C1 > 0\nis a prescribed constant.\n\n2.3 Solutions as Invariant Subspaces and a Related Metric\nWe now give a more detailed discussion of invariant subspaces. Let R(X) denote the subspace\nspanned by the columns of X. Recall that xi, i = 1, 2, . . . , n, are the true low-dimensional repre-\nsentations of the observations. We treat the xi\u2019s as column vectors. Let X = (x1, x2,\u00b7\u00b7\u00b7 , xn)T ;\ni.e., the ith row of X corresponds to xi, 1 \u2264 i \u2264 n. If the set {Oxi}, where O is a d by d orthogonal\nsquare matrix, forms another solution to the dimension reduction problem, we have\n\n(Ox1, Ox2,\u00b7\u00b7\u00b7 , Oxn)T = XOT .\n\nIt is evident that R(XOT ) = R(X). This justi\ufb01es the invariance that was mentioned earlier.\nThe goal of our performance analysis is to answer the following question: Letting (cid:107) tan(\u00b7,\u00b7)(cid:107)2\ndenote the Euclidean norm of the vector of canonical angles between two invariant subspaces ([8,\n\nSection I.5]), and letting X and (cid:101)X denote the true and estimated parameters, respectively, how do\nwe evaluate (cid:107) tan(R(X),R((cid:101)X))(cid:107)2?\n\n2.4 LTSA: Local Tangent Space Alignment\n\nWe now review LTSA. There are two main steps in the LTSA algorithm [13].\n1. The \ufb01rst step is to compute the local representation on the manifold. Recall the projection matrix\nP k. It is easy to verify that P k = P k \u00b7 P k, which is a characteristic of projection matrices. We\nsolve the minimization problem: min\u039b,V (cid:107)YiP k \u2212 \u039bV (cid:107)F , where \u039b \u2208 IRD\u00d7d, V \u2208 IRd\u00d7k, and\nV V T = Id. Let Vi denote optimal V . Then the row vectors of Vi are the d right singular vectors of\nYiP k.\n2. The solution to LTSA corresponds to the invariant subspace which is spanned and determined by\nthe eigenvectors associated with the 2nd to the (d + 1)st smallest eigenvalues of the matrix\n\n(S1, . . . , Sn)diag(P k \u2212 V T\n\n1 V1, . . . , P k \u2212 V T\n\nn Vn)(S1, . . . , Sn)T .\n\n(2)\n\nwhere Si \u2208 IRn\u00d7k is a selection matrix such that Y T Si = Yi, where Y = (y1, y2, . . . , yn)T .\n\n3\n\n\fAs mentioned earlier, the subspace spanned by the eigenvectors associated with the 2nd to the (d +\n1)st smallest eigenvalues of the matrix in 2 is an invariant subspace, which will be analyzed using\nmatrix perturbation techniques. We slightly reformulated the original algorithm as presented in [13]\nfor later analysis.\n\n3 Perturbation Analysis\n\nWe now carry out a perturbation analysis on the reformulated version of LTSA. There are two steps:\nin the local step (Section 3.1), we characterize the deviation of the null spaces of the matrices\nP k \u2212 V T\ni Vi, i = 1, 2, . . . , n. In the global step (Section 3.2), we derive the variation of the null\nspace under global alignment.\n\n3.1 Local Coordinates\nLet X be the matrix of true parameters. We de\ufb01ne Xi = X T Si = (x1, x2,\u00b7\u00b7\u00b7 , xn)Si; i.e., the\ncolumns of Xi are made by xi and those xj\u2019s that correspond to the k \u2212 1 nearest neighbors of yi.\nWe require a bound on the size of the local neighborhoods de\ufb01ned by the Xi\u2019s.\nCondition 3.1 (Universal Bound on the Sizes of Neighborhoods) For all i, 1 \u2264 i \u2264 n, we have\n\u03c4i < \u03c4, where \u03c4 is a prescribed constant and \u03c4i is an upper bound on the distance between two\ncolumns of Xi: \u03c4i = maxxj ,xk (cid:107)xj \u2212 xk(cid:107), where the maximum is taken over all columns of Xi.\nIn this paper, we are interested in the case when \u03c4 \u2192 0.\nWe will need conditions on the local tangent spaces. Let dmin,i (respectively, dmax,i) denote the\nminimum (respectively, maximum) singular values of XiP k. Let\ndmax = max\n1\u2264i\u2264n\n\ndmin = min\n1\u2264i\u2264n\nWe can bound dmax as dmin \u2264 dmax \u2264 \u03c4\nCondition 3.2 (Local Tangent Space) There exists a constant C2 > 0, such that\n\ndmin,i,\n\u221a\n\ndmax,i.\n\nk [5].\n\nC2 \u00b7 \u03c4 \u2264 dmin.\n\n(3)\n\nThe above can roughly be thought of as requiring that the local dimension of the manifold remain\nconstant (i.e., the manifold has no singularities.)\nThe following condition de\ufb01nes a global bound on the errors (\u03b5i).\nCondition 3.3 (Universal Error Bound) There exists \u03c3 > 0, such that \u2200i, 1 \u2264 i \u2264 n, we have\n(cid:107)yi \u2212 f(xi)(cid:107)\u221e < \u03c3. Moreover, we assume \u03c3 = o(\u03c4); i.e., we have \u03c3\nIt is reasonable to require that the error bound (\u03c3) be smaller than the size of the neighborhood (\u03c4),\nwhich is re\ufb02ected in the above condition.\nWithin each neighborhood, we give a perturbation bound between an invariant subspace spanned by\nthe true parametrization and the invariant subspace spanned by the singular vectors of the matrix of\nnoisy observations. Let XiP k = AiDiBi be the singular value decomposition of the matrix XiP k;\ni = Id), Di \u2208 IRd\u00d7d is diagonal, and the rows of Bi \u2208 IRd\u00d7k\nhere Ai \u2208 IRd\u00d7d is orthogonal (AiAT\ni = Id). It is not\nare the right singular vectors corresponding to the largest singular values (BiBT\nhard to verify that\n\n\u03c4 \u2192 0, as \u03c4 \u2192 0.\n\nLet YiP k = (cid:101)Ai(cid:101)Di(cid:101)Bi be the singular value decomposition of YiP k, and assume that this is the\nThe rows of (cid:101)Bi are the eigenvectors of (YiP k)T (YiP k) corresponding to the d largest eigenvalues.\n\n\u201cthin\u201d decomposition of rank d. We may think of this as the perturbed version of J(f; x(0)\n\nBi = BiP k.\n\n)XiP k.\n\n(4)\n\ni\n\ni )) denote the invariant subspace that is spanned by the columns of\n\ni ) (respectively, R((cid:101)BT\ni (respectively, (cid:101)BT\n\ni ).\n\nLet R(BT\nmatrix BT\n\n4\n\n\fTheorem 3.4 Given invariant subspaces R(BT\n\ni )) as de\ufb01ned above, we have\n\ni ) and R((cid:101)BT\ni ))(cid:107)2 \u2264 C3\n\ni ),R((cid:101)BT\n\n(cid:16) \u03c3\n\n\u03c4\n\n(cid:17)\n\n+ C1\u03c4\n\n,\n\n(cid:107) sin(R(BT\n\nlim\n\u03c4\u21920\n\nwhere C3 is a constant that depends on k, D and C2.\n\nThe proof is presented in [5]. The above gives an upper bound on the deviation of the local invariant\nsubspace in step 1 of the modi\ufb01ed LTSA. It will be used later to prove a global upper bound.\n\n3.2 Global Alignment\n\nCondition 3.5 (No Overuse of One Observation) There exists a constant C4, such that\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u221e\n\nSi\n\n\u2264 C4.\n\nNote that we must have C4 \u2265 k. The next condition (Condition 3.6) will implicitly give an upper\nbound on C4.\ni=1 Si(cid:107)\u221e is the maximum row sum of the absolute values of the entries\ni=1 Si(cid:107)\u221e is equal to the maximum number of nearest neighbor subsets\n\nRecall that the quantity (cid:107)(cid:80)n\nin(cid:80)n\ni=1 Si. The value of (cid:107)(cid:80)n\n\nto which a single observation belongs.\nWe will derive an upper bound on the angle between the invariant subspace spanned by the result of\nLTSA and the space spanned by the true parameters.\ni Bi)(XiP k)T = 0. Recall X = (x1, x2, . . . , xn)T \u2208\nGiven (4), it can be shown that XiP k(P k \u2212 BT\nIRn\u00d7d. It is not hard to verify that the row vectors of (1n, X)T span the (d + 1)-dimensional null\nspace of the matrix:\n\n1 B1, . . . , I \u2212 BT\n\n(S1, . . . , Sn)P kdiag(I \u2212 BT\n(5)\nn , X, (X c))T is orthogonal, where X c \u2208 IRn\u00d7(n\u22121\u2212d). Although in our original\nAssume that ( 1n\u221a\nproblem formulation, we made no assumption about the xi\u2019s, we can still assume that the columns\nof X are orthonormal because we can transform any set of xi\u2019s into an orthonormal set by rescaling\nthe columns and multiplying by an orthogonal matrix. Based on the previous paragraph, we have\n\nn Bn)P k(S1, . . . , Sn)T .\n\n\uf8eb\uf8ec\uf8ed 1T\n\nn\u221a\nn\nX T\n(X c)T\n\n\uf8f6\uf8f7\uf8f8 Mn\n\n(cid:18) 1n\u221a\n\nn\n\n(cid:19)\n\n(cid:18) 0(d+1)\u00d7(d+1)\n\n0(n\u2212d\u22121)\u00d7(d+1)\n\n, X, X c\n\n=\n\n(cid:19)\n\n0(d+1)\u00d7(n\u2212d\u22121)\n\nL2\n\n(6)\n\nwhere\n\nand\n\nMn = (S1, . . . , Sn)P kdiag(Ik \u2212 BT\n\n1 B1, . . . , Ik \u2212 BT\n\nn Bn)P k(S1, . . . , Sn)T\n\nL2 = (X c)T MnX c.\n\nLet (cid:96)min denote the minimum singular value (i.e., eigenvalue) of L2. We will need the following\ncondition on (cid:96)min.\n\nCondition 3.6 (Appropriateness of Global Dimension) (cid:96)min > 0 and (cid:96)min goes to 0 at a slower\nrate than \u03c3\n\n2 C1\u03c4; i.e., as \u03c4 \u2192 0, we have\n\n\u03c4 + 1\n\n(cid:0) \u03c3\n\n\u03c4 + 1\n\n2 C1\u03c4(cid:1) \u00b7 (cid:107)(cid:80)n\n\n(cid:96)min\n\ni=1 Si(cid:107)\u221e\n\n\u2192 0.\n\nAs discussed in [12, 11], this condition is actually related to the amount of overlap between the\nnearest neighbor sets.\n\n5\n\n\fTheorem 3.7 (Main Theorem)\n\n(cid:107) tan(R((cid:101)X),R(X))(cid:107)2 \u2264 C3( \u03c3\n\nlim\n\u03c4\u21920\n\n\u03c4 + C1\u03c4) \u00b7 (cid:107)(cid:80)n\n\n(cid:96)min\n\ni=1 Si(cid:107)\u221e\n\n.\n\n(7)\n\nAs mentioned in the Introduction, the above theorem gives a worst-case bound on the performance\nof LTSA. For proofs as well as a discussion of the requirement that \u03c3 \u2192 0 see [7]. A discussion on\nwhen Condition 3.6 is satis\ufb01ed will be long and beyond the scope of this paper. We leave it to future\ninvestigation. We refer to [5] for some simulation results related to the above analysis.\n\n4 A Preliminary Result on the Rate of Convergence\n\nWe discuss the rate of convergence for LTSA (to the true underlying manifold structure) in the\naforementioned framework. We modify the LTSA (mainly on how to choose the size of the nearest\nneighborhood) for a reason that will become evident later.\nWe assume the following result regarding the relationship between k, (cid:96)min, and \u03c4 (this result can be\nproved for xi being sampled on a uniform grid, using the properties of biharmonic eigenvalues for\npartial differential equations) holds:\n\n(cid:96)min \u2248 C(k) \u00b7 \u03bd+\n\nmin(\u22062) \u00b7 \u03c4 4,\n\n(8)\nmin(\u22062) is a constant, and C(k) \u2248 k5. We will address such a result in the more general\n\nwhere \u03bd+\ncontext in the future.\nSo far, we have assumed that k is constant. However, allowing k to be a function of the sample size\nn, say k = n\u03b1, where \u03b1 \u2208 [0, 1) allows us to control the asymptotic behavior of (cid:96)min along with the\nconvergence of the estimated alignment matrix to the true alignment matrix.\nConsider our original bound on the angle between the true coordinates and the estimated coordinates:\n\n(cid:107) tan(R((cid:101)X),R(X))(cid:107)2 \u2264 C3( \u03c3\n\nlim\n\u03c4\u21920\n\n\u03c4 + C1\u03c4) \u00b7 (cid:107)(cid:80)n\n\n(cid:96)min\n\ni=1 Si(cid:107)\u221e\n\n.\n\nC2 are fundamental constants not involving k. Further, it is easy to see that (cid:107)(cid:80)n\n\nNow, set k = n\u03b1, where \u03b1 \u2208 [0, 1) is an exponent, the value of which will be decided later. We must\nbe careful in disregarding constants, since they may involve k. We have that C3 =\n. C1 and\ni=1 Si(cid:107)\u221e is O(k) -\nsince each point has k neighbors, the maximum number of neighborhoods to which a point belongs\nis of the same order as k.\nNow, we can use a simple heuristic to estimate the size of \u03c4, the neighborhood size. For example,\nsuppose we \ufb01x \u0001 and consider \u0001-neighborhoods. For simplicity, assume that the parameter space is\nthe unit hypercube [0, 1]d, where d is the intrinsic dimension. The law of large numbers tells us that\nk \u2248 \u0001d \u00b7 n. Thus we can approximate \u03c4 as \u03c4 \u2248 O(n\n\u03b1\u22121\nd ). Plugging all this into the original equation\nand dropping the constants, we get\n\nkD\nC2\n\n\u221a\n\n(cid:107) tan(R((cid:101)X),R(X))(cid:107)2 \u2264 n\n\nlim\n\u03c4\u21920\n\n\u03b1\u22121\n\nd\n\n\u00b7 n\n(cid:96)min\n\n3\u03b1\n2\n\n\u00b7 Constant.\n\nIf we conjecture that the relationship in (8) holds in general (i.e., the generating coordinates can\nfollow a more general distribution rather than only lying in a uniform grid), then we have\n\nNow the exponent is a function only of \u03b1 and the constant d. We can try to solve for \u03b1 such that the\nconvergence is as fast as possible. Simplifying the exponents, we get\n\n(cid:107) tan(R((cid:101)X),R(X))(cid:107)2 \u2264 n\n(cid:107) tan(R((cid:101)X),R(X))(cid:107)2 \u2264 n\n\nlim\n\u03c4\u21920\n\nlim\n\u03c4\u21920\n\nd\n\n\u03b1\u22121\n\n\u00b7 n \u03b1\n\n2 \u00b7 n\u03b1\nn5\u03b1 \u00b7 n4\u00b7 \u03b1\u22121\n\nd\n\n\u00b7 Constant.\n\n2 \u22123( \u03b1\u22121\n\u22127\u03b1\n\nd ) \u00b7 Constant.\n\nAs a function of \u03b1 restricted to the interval [0, 1), there is no minimum\u2014the exponent decreases\nwith \u03b1, and we should choose \u03b1 close to 1.\n\n6\n\n\fHowever, in the proof of the convergence of LTSA, it is assumed that the errors in the local step\nconverge to 0. This error is given by\n\ni ),R((cid:101)BT\n\ni ))(cid:107)2 \u2264\n\n(cid:107) sin(R(BT\n\n\u221a\nkD \u00b7 [\u03c3 + 1\n\nC2 \u00b7 \u03c4 \u2212 \u221a\n\n2 C1\u03c4 2]\nkD \u00b7 [\u03c3 + 1\n\n.\n\n2 C1\u03c4 2]\n\nThus, our choice of \u03b1 is restricted by the fact that the RHS of this equation must still converge to 0.\nDisregarding constants and writing this as a function of n, we get\n\n2\u03b1\u22122\n\nd\n\n2 \u00b7 n\nn \u03b1\nd \u2212 n \u03b1\n\n\u03b1\u22121\n\n.\nThis quantity converges to 0 as n \u2192 \u221e if and only if we have\n\u21d4 \u03b1 <\n\n2\u03b1 \u2212 2\n\n\u03b1 \u2212 1\n\n2 \u00b7 n\n\n2\u03b1\u22122\n\n+\n\n<\n\nn\n\nd\n\n\u03b1\n2\n\nd\n\nd\n\n2\n\nd + 2 .\n\nNote that this bound is strictly less than 1 for all positive integers d, so our possible choices of \u03b1 are\nrestricted further.\nBy the reasoning above, we want the exponent to be as large as possible. Further, it is easy to see\nd+2 will always yield a bound converging to 0.\nthat for all d, choosing an exponent roughly equal to 2\nThe following table gives the optimal exponents for selected values of d along with the convergence\n\nrate of lim\u03c4\u21920 (cid:107) tan(R((cid:101)X),R(X))(cid:107)2. In general, using the optimal value of \u03b1, the convergence\n\nrate will be roughly n\n\n\u22124\nd+2 .\n\nTable 1: Convergence rates for a few values of the underlying dimension d.\n\nd\n\nOptimal \u03b1\n\n0.29\nConvergence rate \u22121.33 \u22121 \u22120.8 \u22120.66 \u22120.57\n\n0.66\n\n0.33\n\n1\n\n2\n0.5\n\n3\n0.4\n\n4\n\n5\n\nThesis [7] presents some numerical experiments to illustrate the above results. Associated with each\n\ufb01xed value of k, there seems to be a threshold value of n, above which the performance degrades.\nThis value increases with k, though perhaps at the cost of worse performance for small n. How-\never, we expect from the above analysis that, regardless of the value chosen, the performance will\neventually become unacceptable for any \ufb01xed k.\n\n5 Discussion\n\nTo the best of our knowledge, the performance analysis that is based on invariant subspaces is new.\nConsequently the worst-case upper bound is the \ufb01rst of its kind. There are still open questions to be\naddressed (Section 5.1). In addition to a discussion on the relation of LTSA to existing dimension\nreduction methodologies, we will also address relation with known results as well (Section 5.2).\n\n5.1 Open Questions\n\nThe rate of convergence of (cid:96)min is determined by the topological structure of f. It is important to\nestimate this rate of convergence, but this issue has not been addressed here. We did not address the\ncorrectness of (8) at all. It turns out the proof of (8) is quite nontrivial and tedious.\nWe assume that \u03c4 \u2192 0. One can imagine that it is true when the error bound (\u03c3) goes to 0 and when\nthe xi\u2019s are sampled with a suf\ufb01cient density in the support of f. An open problem is how to derive\nthe rate of convergence of \u03c4 \u2192 0 as a function of the topology of f and the sampling scheme. After\ndoing so, we may be able to decide where our theorem is applicable.\n\n5.2 Relation to Existing Work\n\nThe error analysis in the original paper about LTSA is the closest to our result. However, Zhang and\nZha [13] do not interpret their solutions as invariant subspaces, and hence their analysis does not\nyield a worst case bound as we have derived here.\n\n7\n\n\fReviewing the original papers on LLE [6], Laplacian eigenmaps [1], and Hessian eigenmaps [3]\nreveals that their solutions are subspaces spanned by a speci\ufb01c set of eigenvectors. This naturally\nsuggests that results analogous to ours may be derivable as well for these algorithms. A recent book\nchapter [4] stresses this point. After deriving corresponding upper bounds, we can establish different\nproofs of consistency than those presented in these papers.\nISOMAP, another popular manifold learning algorithm, is an exception. Its solution cannot im-\nmediately be rendered as an invariant subspace. However, ISOMAP calls for MDS, which can be\nassociated with an invariant subspace; one may derive an analytical result through this route.\n\n6 Conclusion\n\nWe derive an upper bound of the distance between two invariant subspaces that are associated with\nthe numerical output of LTSA and an assumed intrinsic parametrization. Such a bound describes\nthe performance of LTSA with errors in the observations, and thus creates a theoretical foundation\nfor its use in real-world applications in which we would naturally expect such errors to be present.\nOur results can also be used to show other desirable properties, including consistency and rate of\nconvergence. Similar bounds may be derivable for other manifold-based learning algorithms.\n\nReferences\n[1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data repre-\n\nsentation. Neural Computation, 15(6):1373\u20131396, 2003.\n\n[2] M. Brand. Charting a manifold. In Neural Information Processing Systems, volume 15. Mit-\n\nsubishi Electric Research Labs, MIT Press, March 2003.\n\n[3] D. L. Donoho and C. E. Grimes. Hessian eigenmaps: New locally linear embedding tech-\nniques for high-dimensional data. Proceedings of the National Academy of Arts and Sciences,\n100:5591\u20135596, 2003.\n\n[4] X. Huo, X. S. Ni, and A. K. Smith. Mining of Enterprise Data, chapter A survey of manifold-\n\nbased learning methods. Springer, New York, 2005. Invited book chapter, accepted.\n\n[5] X. Huo and A. K. Smith. Performance analysis of a manifold learning algorithm in dimension\nreduction. Technical report, Georgia Institute of Technology, March 2006. Downloadable at\nwww2.isye.gatech.edu/statistics/papers/06-06.pdf, to appear in Linear Algebra and Its Appli-\ncations.\n\n[6] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.\n\nScience, 290:2323\u20132326, 2000.\n\n[7] A. K. Smith. New results in dimension reduction and model selection. Ph.D. Thesis. Available\n\nat http://etd.gatech.edu, 2008.\n\n[8] G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. Academic Press, Boston, MA,\n\n1990.\n\n[9] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear\n\nMatlab\n\ndemo.\n\nURL:\n\n[10] T. Wittman.\n\ndimensionality reduction. Science, 290:2319\u20132323, 2000.\nhttp://www.math.umn.edu/\u223cwittman/mani/index.html, April 2005.\n\nMANIfold\n\nlearning\n\n[11] H. Zha and H. Zhang. Spectral properties of the alignment matrices in manifold learning.\n\nSIAM Review, 2008.\n\n[12] H. Zha and Z. Zhang. Spectral analysis of alignment in manifold learning. In Proceedings of\n\nIEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.\n\n[13] Z. Zhang and H. Zha. Principal manifolds and nonlinear dimension reduction via local tangent\n\nspace alignment. SIAM Journal of Scienti\ufb01c Computing, 26(1):313\u2013338, 2004.\n\n8\n\n\f", "award": [], "sourceid": 3471, "authors": [{"given_name": "Andrew", "family_name": "Smith", "institution": null}, {"given_name": "Hongyuan", "family_name": "Zha", "institution": null}, {"given_name": "Xiao-ming", "family_name": "Wu", "institution": null}]}