{"title": "Nonlinear Learning using Local Coordinate Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 2223, "page_last": 2231, "abstract": "This paper introduces a new method for semi-supervised learning on high dimensional nonlinear manifolds, which includes a phase of unsupervised basis learning and a phase of supervised function learning. The learned bases provide a set of anchor points to form a local coordinate system, such that each data point x on the manifold can be locally approximated by a linear combination of its nearby anchor points, and the linear weights become its local coordinate coding. We show that a high dimensional nonlinear function can be approximated by a global linear function with respect to this coding scheme, and the approximation quality is ensured by the locality of such coding. The method turns a difficult nonlinear learning problem into a simple global linear learning problem, which overcomes some drawbacks of traditional local learning methods.", "full_text": "Nonlinear Learning using Local Coordinate Coding\n\nKai Yu\n\nNEC Laboratories America\n\nkyu@sv.nec-labs.com\n\nTong Zhang\n\nRutgers University\n\ntzhang@stat.rutgers.edu\n\nYihong Gong\n\nNEC Laboratories America\nygong@sv.nec-labs.com\n\nAbstract\n\nThis paper introduces a new method for semi-supervised learning on high dimen-\nsional nonlinear manifolds, which includes a phase of unsupervised basis learning\nand a phase of supervised function learning. The learned bases provide a set of\nanchor points to form a local coordinate system, such that each data point x on\nthe manifold can be locally approximated by a linear combination of its nearby\nanchor points, and the linear weights become its local coordinate coding. We\nshow that a high dimensional nonlinear function can be approximated by a global\nlinear function with respect to this coding scheme, and the approximation quality\nis ensured by the locality of such coding. The method turns a dif\ufb01cult nonlinear\nlearning problem into a simple global linear learning problem, which overcomes\nsome drawbacks of traditional local learning methods.\n\nIntroduction\n\n1\nConsider the problem of learning a nonlinear function f(x) on a high dimensional space x \u2208 Rd.\nWe are given a set of labeled data (x1, y1), . . . , (xn, yn) drawn from an unknown underlying distri-\nbution. Moreover, assume that we observe a set of unlabeled data x \u2208 Rd from the same distribution.\nIf the dimensionality d is large compared to n, then the traditional statistical theory predicts over-\n\ufb01tting due to the so called \u201ccurse of dimensionality\u201d. One intuitive argument for this effect is that\nwhen the dimensionality becomes larger, pairwise distances between two similar data points become\nlarger as well. Therefore one needs more data points to adequately \ufb01ll in the empty space. However,\nfor many real problems with high dimensional data, we do not observe this so-called curse of di-\nmensionality. This is because although data are physically represented in a high-dimensional space,\nthey often lie on a manifold which has a much smaller intrinsic dimensionality.\nThis paper proposes a new method that can take advantage of the manifold geometric structure\nto learn a nonlinear function in high dimension. The main idea is to locally embed points on the\nmanifold into a lower dimensional space, expressed as coordinates with respect to a set of anchor\npoints. Our main observation is simple but very important: we show that a nonlinear function on the\nmanifold can be effectively approximated by a linear function with such an coding under appropriate\nlocalization conditions. Therefore using Local Coordinate Coding, we turn a very dif\ufb01cult high\ndimensional nonlinear learning problem into a much simpler linear learning problem, which has\nbeen extensively studied in the literature. This idea may also be considered as a high dimensional\ngeneralization of low dimensional local smoothing methods in the traditional statistical literature.\n\n2 Local Coordinate Coding\n\nWe are interested in learning a smooth function f(x) de\ufb01ned on a high dimensional space Rd. Let\n(cid:107) \u00b7 (cid:107) be a norm on Rd. Although we do not restrict to any speci\ufb01c norm, in practice, one often\n\nemploys the Euclidean norm (2-norm): (cid:107)x(cid:107) = (cid:107)x(cid:107)2 =(cid:112)x2\n\n1 + \u00b7\u00b7\u00b7 + x2\nd.\n\n1\n\n\fDe\ufb01nition 2.1 (Lipschitz Smoothness) A function f(x) on Rd is (\u03b1, \u03b2, p)-Lipschitz smooth with\nrespect to a norm (cid:107) \u00b7 (cid:107) if |f(x(cid:48)) \u2212 f(x)| \u2264 \u03b1(cid:107)x \u2212 x(cid:48)(cid:107) and |f(x(cid:48)) \u2212 f(x) \u2212 \u2207f(x)(cid:62)(x(cid:48) \u2212 x)| \u2264\n\u03b2(cid:107)x \u2212 x(cid:48)(cid:107)1+p, where we assume \u03b1, \u03b2 > 0 and p \u2208 (0, 1].\nNote that if the Hessian of f(x) exists, then we may take p = 1. Learning an arbitrary Lipschitz\nsmooth function on Rd can be dif\ufb01cult due to the curse of dimensionality. That is, the number\nof samples required to characterize such a function f(x) can be exponential in d. However, in\nmany practical applications, one often observe that the data we are interested in approximately lie\non a manifold M which is embedded into Rd. Although d is large, the intrinsic dimensionality\nof M can be much smaller. Therefore if we are only interested in learning f(x) on M, then the\ncomplexity should depend on the intrinsic dimensionality of M instead of d.\nIn this paper, we\napproach this problem by introducing the idea of localized coordinate coding. The formal de\ufb01nition\nof (non-localized) coordinate coding is given below, where we represent a point in Rd by a linear\ncombination of a set of \u201canchor points\u201d. Later we show it is suf\ufb01cient to choose a set of \u201canchor\npoints\u201d with cardinality depending on the intrinsic dimensionality of the manifold rather than d.\nDe\ufb01nition 2.2 (Coordinate Coding) A coordinate coding is a pair (\u03b3, C), where C \u2282 Rd is a set\nv \u03b3v(x) = 1. It\nv\u2208C \u03b3v(x)v. Moreover, for all\n\nof anchor points, and \u03b3 is a map of x \u2208 Rd to [\u03b3v(x)]v\u2208C \u2208 R|C| such that(cid:80)\ninduces the following physical approximation of x in Rd: \u03b3(x) =(cid:80)\nx \u2208 Rd, we de\ufb01ne the corresponding coding norm as (cid:107)x(cid:107)\u03b3 =(cid:0)(cid:80)\nv\u2208C \u03b3v(x)2(cid:1)1/2.\nThe quantity (cid:107)x(cid:107)\u03b3 will become useful in our learning theory analysis. The condition(cid:80)\nshown (see the appendix \ufb01le accompanying the submission) that the map x \u2192 (cid:80)\ninvariant under any shift of the origin for representing data points in Rd if and only if(cid:80)\n\nv \u03b3v(x) = 1\nfollows from the shift-invariance requirement, which means that the coding should remain the same\nif we use a different origin of the Rd coordinate system for representing data points.\nIt can be\nv\u2208C \u03b3v(x)v is\nv \u03b3v(x) =\n1. The importance of the coordinate coding concept is that if a coordinate coding is suf\ufb01ciently\nlocalized, then a nonlinear function can be approximate by a linear function with respect to the\ncoding. This critical observation, illustrate in the following linearization lemma, is the foundation\nof our approach. Due to the space limitation, all proofs are left to the appendix that accompanies the\nsubmission.\nLemma 2.1 (Linearization) Let (\u03b3, C) be an arbitrary coordinate coding on Rd. Let f be an\n(\u03b1, \u03b2, p)-Lipschitz smooth function. We have for all x \u2208 Rd:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)f(x) \u2212(cid:88)\n\nv\u2208C\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b1(cid:107)x \u2212 \u03b3(x)(cid:107) + \u03b2\n\n(cid:88)\n\nv\u2208C\n\n\u03b3v(x)f(v)\n\n|\u03b3v(x)|(cid:107)v \u2212 \u03b3(x)(cid:107)1+p .\n\nproximated by a linear function(cid:80)\n\nTo understand this result, we note that on the left hand side, a nonlinear function f(x) in Rd is ap-\nv\u2208C \u03b3v(x)f(v) with respect to the coding \u03b3(x), where [f(v)]v\u2208C\nis the set of coef\ufb01cients to be estimated from data. The quality of this approximation is bounded by\nthe right hand side, which has two terms: the \ufb01rst term (cid:107)x \u2212 \u03b3(x)(cid:107) means x should be close to its\nphysical approximation \u03b3(x), and the second term means that the coding should be localized. The\nquality of a coding \u03b3 with respect to C can be measured by the right hand side. For convenience,\nwe introduce the following de\ufb01nition, which measures the locality of a coding.\n\nDe\ufb01nition 2.3 (Localization Measure) Given \u03b1, \u03b2, p, and coding (\u03b3, C), we de\ufb01ne\n\n(cid:34)\n\n(cid:35)\n\nQ\u03b1,\u03b2,p(\u03b3, C) = Ex\n\n\u03b1(cid:107)x \u2212 \u03b3(x)(cid:107) + \u03b2\n\n|\u03b3v(x)|(cid:107)v \u2212 \u03b3(x)(cid:107)1+p\n\n.\n\n(cid:88)\n\nv\u2208C\n\nObserve that in Q\u03b1,\u03b2,p, \u03b1, \u03b2, p may be regarded as tuning parameters; we may also simply pick\n\u03b1 = \u03b2 = p = 1. Since the quality function Q\u03b1,\u03b2,p(\u03b3, C) only depends on unlabeled data, in\nprinciple, we can \ufb01nd [\u03b3, C] by optimizing this quality using unlabeled data. Later, we will consider\nsimpli\ufb01cations of this objective function that are easier to compute.\nNext we show that if the data lie on a manifold, then the complexity of local coordinate coding\ndepends on the intrinsic manifold dimensionality instead of d. We \ufb01rst de\ufb01ne manifold and its\nintrinsic dimensionality.\n\n2\n\n\fj=1 \u03b3jvj(x)\n\n(cid:13)(cid:13)(cid:13) \u2264\n\n(cid:13)(cid:13)(cid:13)x(cid:48) \u2212 x \u2212(cid:80)m\n\nDe\ufb01nition 2.4 (Manifold) A subset M \u2282 Rd is called a p-smooth (p > 0) manifold with intrinsic\ndimensionality m = m(M) if there exists a constant cp(M) such that given any x \u2208 M, there\nexists m vectors v1(x), . . . , vm(x) \u2208 Rd so that \u2200x(cid:48) \u2208 M: inf \u03b3\u2208Rm\ncp(M)(cid:107)x(cid:48) \u2212 x(cid:107)1+p.\nThis de\ufb01nition is quite intuitive. The smooth manifold structure implies that one can approximate\na point in M effectively using local coordinate coding. Note that for a typical manifold with well-\nde\ufb01ned curvature, we can take p = 1.\nDe\ufb01nition 2.5 (Covering Number) Given any subset M \u2282 Rd, and \u0001 > 0. The covering\nnumber, denoted as N (\u0001,M), is the smallest cardinality of an \u0001-cover C \u2282 M. That is,\nsupx\u2208M inf v\u2208C (cid:107)x \u2212 v(cid:107) \u2264 \u0001.\nFor a compact manifold with intrinsic dimensionality m, there exists a constant c(M) such that its\ncovering number is bounded by N (\u0001,M) \u2264 c(M)\u0001\u2212m. The following result shows that there exists\na local coordinate coding to a set of anchor points C of cardinality O(m(M)N (\u0001,M)) such that\nany (\u03b1, \u03b2, p)-Lipschitz smooth function can be linearly approximated using local coordinate coding\n\nup to the accuracy O((cid:112)m(M)\u00011+p).\n|C| \u2264 (1+m(M))N (\u0001,M), Q\u03b1,\u03b2,p(\u03b3, C) \u2264 [\u03b1cp(M)+(1+(cid:112)m(M)+21+p(cid:112)m(M))\u03b2] \u00011+p.\n\nTheorem 2.1 (Manifold Coding) If the data points x lie on a compact p-smooth manifold M, and\nthe norm is de\ufb01ned as (cid:107)x(cid:107) = (x(cid:62)Ax)1/2 for some positive de\ufb01nite matrix A. Then given any \u0001 > 0,\nthere exist anchor points C \u2282 M and coding \u03b3 such that\n\n\u03b3 \u2264 1 + (1 +(cid:112)m(M))2.\n\nMoreover, for all x \u2208 M, we have (cid:107)x(cid:107)2\n\nThe approximation result in Theorem 2.1 means that the complexity of linearization in Lemma 2.1\ndepends only on the intrinsic dimension m(M) of M instead of d. Although this result is proved\nfor manifolds, it is important to observe that the coordinate coding method proposed in this paper\ndoes not require the data to lie precisely on a manifold, and it does not require knowing m(M). In\nfact, similar results hold even when the data only approximately lie on a manifold.\nIn the next section, we characterize the learning complexity of the local coordinate coding method.\nIt implies that linear prediction methods can be used to effectively learn nonlinear functions on\na manifold. The nonlinearity is fully captured by the coordinate coding map \u03b3 (which can be a\nnonlinear function). This approach has some great advantages because the problem of \ufb01nding local\ncoordinate coding is much simpler than direct nonlinear learning:\n\n\u2022 Learning (\u03b3, C) only requires unlabeled data, and the number of unlabeled data can be\nsigni\ufb01cantly more than the number of labeled data. This step also prevents over\ufb01tting with\nrespect to labeled data.\n\u2022 In practice, we do not have to \ufb01nd the optimal coding because the coordinates are merely\nfeatures for linear supervised learning. This signi\ufb01cantly simpli\ufb01es the optimization prob-\nlem. Consequently, it is more robust than some standard approaches to nonlinear learning\nthat direct optimize nonlinear functions on labeled data (e.g., neural networks).\n\n3 Learning Theory\nIn machine learning, we minimize the expected loss Ex,y\u03c6(f(x), y) with respect to the underlying\ndistribution within a function class f(x) \u2208 F. In this paper, we are interested in the function class\nF\u03b1,\u03b2,p = {f(x) : (\u03b1, \u03b2, p) \u2212 Lipschitz smooth function in Rd}.\nThe local coordinate coding method considers a linear approximation of functions in F\u03b1,\u03b2,p on the\ndata manifold. Given a local coordinate coding scheme (\u03b3, C), we approximate each f(x) \u2208 F a\n\nby f(x) \u2248 f\u03b3,C( \u02c6w, x) =(cid:80)\n\nv\u2208C \u02c6wv\u03b3v(x), where we estimate the coef\ufb01cients using ridge regression\n\n\u03b1,\u03b2,p\n\nas:\n\n[ \u02c6wv] = arg min\n[wv]\n\n\u03c6 (f\u03b3,C(w, xi), yi) + \u03bb\n\n(wv \u2212 g(v))2\n\n,\n\n(1)\n\n(cid:34) n(cid:88)\n\ni=1\n\n3\n\n(cid:35)\n\n(cid:88)\n\nv\u2208C\n\n\fwhere g(v) is an arbitrary function assumed to be pre-\ufb01xed. In the Bayesian interpretation, this\ncan be regarded as the prior mean for the weights [wv]v\u2208C. The default values of g(v) are simply\ng(v) \u2261 0. Given a loss function \u03c6(p, y), let \u03c6(cid:48)\n1(p, y) = \u2202\u03c6(p, y)/\u2202p. For simplicity, in this paper\nwe only consider convex Lipschitz loss function, where |\u03c6(cid:48)\n1(p, y)| \u2264 B. This includes the standard\nclassi\ufb01cation loss functions such as logistic regression and SVM (hinge loss), both with B = 1.\nTheorem 3.1 (Generalization Bound) Suppose \u03c6(p, y) is Lipschitz: |\u03c6(cid:48)\n1(p, y)| \u2264 B. Consider\ncoordinate coding (\u03b3, C), and the estimation method (1) with random training examples Sn =\n{(x1, y1), . . . , (xn, yn)}. Then the expected generalization error satis\ufb01es the inequality:\n\nEx,y\u03c6(f\u03b3,C( \u02c6w, x), y)\n\n(cid:34)\n\nESn\n\u2264 inf\n\nf\u2208F\u03b1,\u03b2,p\n\n(cid:88)\n\nv\u2208C\n\n(cid:35)\n\nEx,y\u03c6 (f(x), y) + \u03bb\n\n(f(v) \u2212 g(v))2\n\n+ B2\n2\u03bbn\n\nEx(cid:107)x(cid:107)2\n\n\u03b3 + BQ\u03b1,\u03b2,p(\u03b3, C).\n\n(cid:104)(cid:112)\u0001\u2212m(M)/n + \u00011+p(cid:105)\n\nif we pick g(v) \u2261 0, and \ufb01nd (\u03b3, C) at some \u0001 > 0,\n\nWe may choose the regularization parameter \u03bb that optimizes the bound in Theorem 3.1.\nMoreover,\nthen Theorem 2.1 im-\nplies the following simpli\ufb01ed generalization bound for any f \u2208 F\u03b1,\u03b2,p such that |f(x)| =\nO(1): Ex,y\u03c6 (f(x), y) + O\n. By optimizing over \u0001, we obtain a bound:\nEx,y\u03c6 (f(x), y) + O(n\u2212(1+p)/(2+2p+m(M))).\nBy combining Theorem 2.1 and Theorem 3.1, we can immediately obtain the following simple\nconsistency result. It shows that the algorithm can learn an arbitrary nonlinear function on manifold\nwhen n \u2192 \u221e. Note that Theorem 2.1 implies that the convergence only depends on the intrinsic\ndimensionality of the manifold M, not d.\nTheorem 3.2 (Consistency) Suppose the data lie on a compact manifold M \u2282 Rd, and the norm\n(cid:107) \u00b7 (cid:107) is the Euclidean norm in Rd. If loss function \u03c6(p, y) is Lipschitz. As n \u2192 \u221e, we choose\n\u03b1, \u03b2 \u2192 \u221e, \u03b1/n, \u03b2/n \u2192 0 (\u03b1, \u03b2 depends on n), and p = 0. Then it is possible to \ufb01nd coding\n(\u03b3, C) using unlabeled data such that |C|/n \u2192 0 and Q\u03b1,\u03b2,p(\u03b3, C) \u2192 0. If we pick \u03bbn \u2192 \u221e, and\n\u03bb|C| \u2192 0. Then the local coordinate coding method (1) with g(v) \u2261 0 is consistent as n \u2192 \u221e:\nlimn\u2192\u221e ESn\n\nEx,y\u03c6(f( \u02c6w, x), y) = inf f :M\u2192R Ex,y\u03c6 (f(x), y).\n\n4 Practical Learning of Coding\nGiven a coordinate coding (\u03b3, C), we can use (1) to learn a nonlinear function in Rd. We showed\nthat (\u03b3, C) can be obtained by optimizing Q\u03b1,\u03b2,p(\u03b3, C).\nIn practice, we may also consider the\nfollowing simpli\ufb01cations of the localization term:\n\n(cid:88)\n\nv\u2208C\n\n|\u03b3v(x)|(cid:107)v \u2212 \u03b3(x)(cid:107)1+p \u2248(cid:88)\n\nv\u2208C\n\n|\u03b3v(x)|(cid:107)v \u2212 x(cid:107)1+p .\n\n(cid:80)\n\nNote that we may simply chose p = 0 or p = 1. The formulation is related to sparse coding [6] which\nhas no locality constraints with p = \u22121. In this representation, we may either enforce the constraint\nv \u03b3v(x) = 1 or for simplicity, remove it because the formulation is already shift-invariant. Putting\n\nthe above together, we try to optimize the following objective function in practice:\n\n\uf8ee\uf8f0(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)x \u2212(cid:88)\n\nv\u2208C\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:88)\n\nv\u2208C\n\nQ(\u03b3, C) = Ex inf\n\n[\u03b3v]\n\n\u03b3vv\n\n+ \u00b5\n\n|\u03b3v|(cid:107)v \u2212 x(cid:107)1+p\n\n\uf8f9\uf8fb .\n\nWe update C and \u03b3 via alternating optimization. The step of updating \u03b3 can be transformed into\na canonical LASSO problem, where ef\ufb01cient algorithms exist. The step of updating C is a least-\nsquares problem in case p = 1.\n\n5 Relationship to Other Methods\n\nOur work is related to several existing approaches in the literature of machine learning and statistics.\nThe \ufb01rst class of them is nonlinear manifold learning, such as LLE [8], Isomap [9], and Laplacian\n\n4\n\n\fEigenmaps [1]. These methods \ufb01nd global coordinates of data manifold based on a pre-computed\naf\ufb01nity graph of data points. The use of af\ufb01nity graphs requires expensive computation and lacks a\ncoherent way of generalization to new data. Our method learns a compact set of bases to form local\ncoordinates, which has a linear complexity with respect to data size and can naturally handle unseen\ndata. More importantly, local coordinate coding has a direct connection to nonlinear function ap-\nproximation on manifold, and thus provides a theoretically sound unsupervised pre-training method\nto facilitate further supervised learning tasks.\nAnother set of related models are local models in statistics, such as local kernel smoothing and local\nregression, e.g.[4, 2], both traditionally using \ufb01xed-bandwidth kernels. Local kernel smoothing can\nbe regarded as a zero-order method; while local regression is higher-order, including local linear\nregression as the 1st-order case. Traditional local methods are not widely used in machine learn-\ning practice, because data with non-uniform distribution on the manifold require to use adaptive-\nbandwidth kernels. The problem can be somehow alleviated by using K-nearest neighbors. How-\never, adaptive kernel smoothing still suffers from the high-dimensionality and noise of data. On\nthe other hand, higher-order methods are computationally expensive and prone to over\ufb01tting, be-\ncause they are highly \ufb02exible in locally \ufb01tting many segments of data in high-dimension space. Our\nmethod can be seen as a generalized 1st-order local method with basis learning and adaptive local-\nity. Compared to local linear regression, the learning is achieved by \ufb01tting a single globally linear\nfunction with respect to a set of learned local coordinates, which is much less prone to over\ufb01tting\nand computationally much cheaper. This means that our method achieves better balance between\nlocal and global aspects of learning. The importance of such balance has been recently discussed in\n[10].\nFinally, local coordinate coding draws connections to vector quantization (VQ) coding, e.g., [3],\nand sparse coding, which have been widely applied in processing of sensory data, such as acoustic\nand image signals. Learning linear functions of VQ codes can be regarded as a generalized zero-\norder local method with basis learning. Our method has an intimate relationship with sparse coding.\nIn fact, we can regard local coordinate coding as locally constrained sparse coding. Inspired by\nbiological visual systems, people has been arguing sparse features of signals are useful for learning\n[7]. However, to the best of our knowledge, there is no analysis in the literature that directly answers\nthe question why sparse codes can help learning nonlinear functions in high dimensional space. Our\nwork reveals an important \ufb01nding \u2014 a good \ufb01rst-order approximation to nonlinear function requires\nthe codes to be local, which consequently requires the codes to be sparse. However, sparsity does not\nalways guarantee locality conditions. Our experiments demonstrate that sparse coding is helpful for\nlearning only when the codes are local. Therefore locality is more essential for coding, and sparsity\nis a consequence of such a condition.\n\n6 Experiments\n\nDue to the space limitation, we only include two examples: one synthetic and one real, to illustrate\nvarious aspects of our theoretical results. We note that image classi\ufb01cation based on LCC recently\nachieved state-of-the-art performance in PASCAL Visual Object Classes Challenge 2009. 1\n\n6.1 Synthetic Data\n\nOur \ufb01rst example is based on a synthetic data set, where a nonlinear function is de\ufb01ned on a Swiss-\nroll manifold, as shown in Figure 1-(1). The primary goal is to demonstrate the performance of\nnonlinear function learning using simple linear ridge regression based on representations obtained\nfrom traditional sparse coding and the newly suggested local coordinate coding, which are, respec-\ntively, formulated as the following,\n\n|\u03b3v(x)|(cid:107)v \u2212 x(cid:107)2 + \u03bb\n\n(cid:107)v(cid:107)2\n\n(2)\n\nmainly for the simplicity of computation.\n\nv\u2208C \u03b3v(x)v. We note that (2) is an approximation to the original formulation,\n\n(cid:88)\n\nx\n\n(cid:107)x \u2212 \u03b3(x)(cid:107)2 + \u00b5\n\n1\n2\n\n(cid:88)\n\nv\u2208C\n\n(cid:88)\n\nv\u2208C\n\nmin\n\u03b3,C\n\nwhere \u03b3(x) = (cid:80)\n\n1http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2009/workshop/index.html\n\n5\n\n\f(1) A nonlinear function\n\n(2) RMSE=4.394\n\n(3) RMSE=0.499\n\n(4)RMSE=4.661\n\n(5) RMSE=0.201\n\n(6) RMSE=0.109\n\n(7) RMSE=0.669\n\n(8) RMSE=1.170\n\nFigure 1: Experiments of nonlinear regression on Swiss-roll: (1) a nonlinear function on the Swiss-\nroll manifold, where the color indicates function values; (2) result of sparse coding with \ufb01xed ran-\ndom anchor points; (3) result of local coordinate coding with \ufb01xed random anchor points; 4) result\nof sparse coding; (5) result of local coordinate coding; (6) result of local kernel smoothing; (7) result\nof local coordinate coding on noisy data; (8) result of local kernel smoothing on noisy data.\n\nWe randomly sample 50, 000 data points on the manifold for unsupervised basis learning, and 500\nlabeled points for supervised regression. The number of bases is \ufb01xed to be 128. The learned non-\nlinear functions are tested on another set of 10, 000 data points, with their performances evaluated\nby root mean square error (RMSE).\nIn the \ufb01rst setting, we let both coding methods use the same set of \ufb01xed bases, which are 128\npoints randomly sampled from the manifold. The regression results are shown in Figure 1-(2) and\n(3), respectively. Sparse coding based approach fails to capture the nonlinear function, while local\ncoordinate coding behaves much better. We take a closer look at the data representations obtained\nfrom the two different encoding methods, by visualizing the distributions of distances from encoded\ndata to bases that have positive, negative, or zero coef\ufb01cients in Figure 2.\nIt shows that sparse\ncoding lets bases faraway from the encoded data have nonzero coef\ufb01cients, while local coordinate\ncoding allows only nearby bases to get nonzero coef\ufb01cients.\nIn other words, sparse coding on\nthis data does not ensure a good locality and thus fails to facilitate the nonlinear function learning.\nAs another interesting phenomenon, local coordinate coding seems to encourage coef\ufb01cients to be\nnonnegative, which is intuitively understandable \u2014 if we use several bases close to a data point to\nlinearly approximate the point, each basis should have a positive contribution. However, whether\nthere is any merit by explicitly enforcing nonnegativity will remain an interesting future work.\nIn the next two experiments, given the random bases as a common initialization, we let the two\nalgorithms learn bases from the 50, 000 unlabeled data points. The regression results based on the\nlearned bases are depicted in Figure 1-(4) and (5), which indicate that regression error is further\nreduced for local coordinate coding, but remains to be high for sparse coding. We also make a\ncomparison with local kernel smoothing, which takes a weighted average of function values of\nK-nearest training points to make prediction. As shown in Figure 1-(6), the method works very\nwell on this simple low-dimensional data, even outperforming the local coordinate coding approach.\nHowever, if we increase the data dimensionality to be 256 by adding 253-dimensional independent\nGaussian noises with zero mean and unitary variance, local coordinate coding becomes superior to\nlocal kernel smoothing, as shown in Figure 1-(7) and (8). This is consistent with our theory, which\nsuggests that local coordinate coding can work well in high dimension; on the other hand, local\nkernel smoothing is known to suffer from high dimensionality and noise.\n\n6.2 Handwritten Digit Recognition\n\nOur second example is based on the MNIST handwritten digit recognition benchmark, where each\ndata point is a 28 \u00d7 28 gray image, and pre-normalized into a unitary 784-dimensional vector. In\nour setting, the set C of anchor points is obtained from sparse coding, with the regularization on\n\n6\n\n\f(a-1)\n\n(a-2)\n\n(b-1)\n\n(b-2)\n\nFigure 2: Coding locality on Swiss roll: (a) sparse coding vs. (b) local coordinate coding.\n\nv replaced by inequality constraints (cid:107)v(cid:107) \u2264 1. Our focus here is not on anchor point learning, but\nrather on checking whether a good nonlinear classi\ufb01er can be obtained if we enforce sparsity and\nlocality in data representation, and then apply simple one-against-all linear SVMs.\nSince the optimization cost of sparse coding is invariant under \ufb02ipping the sign of v, we take a\npostprocessing step to change the sign of v if we \ufb01nd the corresponding \u03b3v(x) for most of x is\nnegative. This recti\ufb01cation will ensure the anchor points to be on the data manifold. With the\nobtained C, for each data point x we solve the local coordinate coding problem (2), by optimizing \u03b3\nonly, to obtain the representation [\u03b3v(x)]v\u2208C. In the experiments we try different sizes of bases. The\nclassi\ufb01cation error rates are provided in Table 1. In addition we also compare with linear classi\ufb01er\non raw images, local kernel smoothing based on K-nearest neighbors, and linear classi\ufb01ers using\nrepresentations obtained from various unsupervised learning methods, including autoencoder based\non deep belief networks [5], Laplacian eigenmaps [1], locally linear embedding (LLE) [8], and VQ\ncoding based on K-means. We note that, like most of other manifold learning approaches, Laplacian\neigenmaps or LLE is a transductive method which has to incorporate both training and testing data in\ntraining. The comparison results are summarized in Table 2. Both sparse coding and local coordinate\ncoding perform quite good for this nonlinear classi\ufb01cation task, signi\ufb01cantly outperforming linear\nclassi\ufb01ers on raw images. In addition, local coordinate coding is consistently better than sparse\ncoding across various basis sizes. We further check the locality of both representations by plotting\nFigure-3, where the basis number is 512, and \ufb01nd that sparse coding on this data set happens to be\nquite local \u2014 unlike the case of Swiss-roll data \u2014 here only a small portion of nonzero coef\ufb01cients\n(again mostly negative) are assigned onto the bases whose distances to the encoded data exceed\nthe average of basis-to-datum distances. This locality explains why sparse coding works well on\nMNIST data. On the other hand, local coordinate coding is able to remove the unusual coef\ufb01cients\nand further improve the locality. Among those compared methods in Table 2, we note that the\nerror rate 1.2% of deep belief network reported in [5] was obtained via unsupervised pre-training\nfollowed by supervised backpropagation. The error rate based on unsupervised training of deep\nbelief networks is about 1.90%.2 Therefore our result is competitive to the-state-of-the-art results\nthat are based on unsupervised feature learning plus linear classi\ufb01cation without using additional\nimage geometric information.\n\n2This is obtained via a personal communication with Ruslan Salakhutdinov at University of Toronto.\n\n7\n\n\f(a-1)\n\n(a-2)\n\n(b-1)\n\n(b-2)\n\nFigure 3: Coding locality on MNIST: (a) sparse coding vs. (b) local coordinate coding.\n\nTable 1: Error rates (%) of MNIST classi\ufb01cation with different |C|.\n\n|C|\nLinear SVM with sparse coding\nLinear SVM with local coordinate coding\n\n512\n2.96\n2.64\n\n1024\n2.64\n2.44\n\n2048\n2.16\n2.08\n\n4096\n2.02\n1.90\n\nTable 2: Error rates (%) of MNIST classi\ufb01cation with different methods.\n\nMethods\nLinear SVM with raw images\nLinear SVM with VQ coding\nLocal kernel smoothing\nLinear SVM with Laplacian eigenmap\nLinear SVM with LLE\nLinear classi\ufb01er with deep belief network\nLinear SVM with sparse coding\nLinear SVM with local coordinate coding\n\nError Rate\n\n12.0\n3.98\n3.48\n2.73\n2.38\n1.90\n2.02\n1.90\n\n7 Conclusion\n\nThis paper introduces a new method for high dimensional nonlinear learning with data distributed\non manifolds. The method can be seen as generalized local linear function approximation, but can\nbe achieved by learning a global linear function with respect to coordinates from unsupervised local\ncoordinate coding. Compared to popular manifold learning methods, our approach can naturally\nhandle unseen data and has a linear complexity with respect to data size. The work also generalizes\npopular VQ coding and sparse coding schemes, and reveals that locality of coding is essential for\nsupervised function learning. The generalization performance depends on intrinsic dimensionality\nof the data manifold. The experiments on synthetic and handwritten digit data further con\ufb01rm the\n\ufb01ndings of our analysis.\n\n8\n\n\fReferences\n[1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data\n\nrepresentation. Neural Computation, 15:1373 \u2013 1396, 2003.\n\n[2] Leon Bottou and Vladimir Vapnik. Local learning algorithms. Neural Computation, 4:888 \u2013\n\n900, 1992.\n\n[3] Robert M. Gray and David L. Neuhoff. Quantization. IEEE Transaction on Information The-\n\nory, pages 2325 \u2013 2383, 1998.\n\n[4] Trevor Hastie and Clive Loader. Local regression: Automatic kernel carpentry. Statistical\n\nScience, 8:139 \u2013 143, 1993.\n\n[5] Geoffrey E. Hinton and Ruslan R. Salakhutdinov. Reducing the dimensionality of data with\n\nneural networks. Science, 313:504 \u2013 507, 2006.\n\n[6] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng. Ef\ufb01cient sparse coding algo-\n\nrithms. Neural Information Processing Systems (NIPS) 19, 2007.\n\n[7] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y. Ng. Self-taught\nlearning: Transfer learning from unlabeled data. International Conference on Machine Learn-\ning, 2007.\n\n[8] Sam Roweis and Lawrence Saul. Nonlinear dimensionality reduction by locally linear embed-\n\nding. Science, 290:2323 \u2013 2326, 2000.\n\n[9] Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework\n\nfor nonlinear dimensionality reduction. Science, 290:2319 \u2013 2323, 2000.\n\n[10] Alon Zakai and Ya\u2019acov Ritov. Consistency and localizability. Journal of Machine Learning\n\nResearch, 10:827 \u2013 856, 2009.\n\n9\n\n\f", "award": [], "sourceid": 719, "authors": [{"given_name": "Kai", "family_name": "Yu", "institution": null}, {"given_name": "Tong", "family_name": "Zhang", "institution": null}, {"given_name": "Yihong", "family_name": "Gong", "institution": null}]}