{"title": "A fast, universal algorithm to learn parametric nonlinear embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 253, "page_last": 261, "abstract": "Nonlinear embedding algorithms such as stochastic neighbor embedding do dimensionality reduction by optimizing an objective function involving similarities between pairs of input patterns. The result is a low-dimensional projection of each input pattern. A common way to define an out-of-sample mapping is to optimize the objective directly over a parametric mapping of the inputs, such as a neural net. This can be done using the chain rule and a nonlinear optimizer, but is very slow, because the objective involves a quadratic number of terms each dependent on the entire mapping's parameters. Using the method of auxiliary coordinates, we derive a training algorithm that works by alternating steps that train an auxiliary embedding with steps that train the mapping. This has two advantages: 1) The algorithm is universal in that a specific learning algorithm for any choice of embedding and mapping can be constructed by simply reusing existing algorithms for the embedding and for the mapping. A user can then try possible mappings and embeddings with less effort. 2) The algorithm is fast, and it can reuse N-body methods developed for nonlinear embeddings, yielding linear-time iterations.", "full_text": "A Fast, Universal Algorithm\n\nto Learn Parametric Nonlinear Embeddings\n\nMiguel \u00b4A. Carreira-Perpi \u02dcn\u00b4an\n\nMax Vladymyrov\n\nEECS, University of California, Merced\n\nUC Merced and Yahoo Labs\n\nhttp://eecs.ucmerced.edu\n\nmaxv@yahoo-inc.com\n\nAbstract\n\nNonlinear embedding algorithms such as stochastic neighbor embedding do di-\nmensionality reduction by optimizing an objective function involving similarities\nbetween pairs of input patterns. The result is a low-dimensional projection of each\ninput pattern. A common way to de\ufb01ne an out-of-sample mapping is to optimize\nthe objective directly over a parametric mapping of the inputs, such as a neural\nnet. This can be done using the chain rule and a nonlinear optimizer, but is very\nslow, because the objective involves a quadratic number of terms each dependent\non the entire mapping\u2019s parameters. Using the method of auxiliary coordinates,\nwe derive a training algorithm that works by alternating steps that train an aux-\niliary embedding with steps that train the mapping. This has two advantages: 1)\nThe algorithm is universal in that a speci\ufb01c learning algorithm for any choice of\nembedding and mapping can be constructed by simply reusing existing algorithms\nfor the embedding and for the mapping. A user can then try possible mappings\nand embeddings with less effort. 2) The algorithm is fast, and it can reuse N -body\nmethods developed for nonlinear embeddings, yielding linear-time iterations.\n\n1 Introduction\n\nGiven a high-dimensional dataset YD\u00d7N = (y1, . . . , yN ) of N points in RD, nonlinear embedding\nalgorithms seek to \ufb01nd low-dimensional projections XL\u00d7N = (x1, . . . , xN ) with L < D by opti-\nmizing an objective function E(X) constructed using an N \u00d7 N matrix of similarities W = (wnm)\nbetween pairs of input patterns (yn, ym). For example, the elastic embedding (EE) [5] optimizes:\n\nE(X) = PN\n\nn,m=1 wnm kxn \u2212 xmk2 + \u03bbPN\n\nn,m=1 exp (\u2212 kxn \u2212 xmk2)\n\n\u03bb > 0.\n\n(1)\n\nHere, the \ufb01rst term encourages projecting similar patterns near each other, while the second term\nrepels all pairs of projections. Other algorithms of this type are stochastic neighbor embedding\n(SNE) [15], t-SNE [27], neighbor retrieval visualizer (NeRV) [28] or the Sammon mapping [23],\nas well as spectral methods such as metric multidimensional scaling and Laplacian eigenmaps [2]\n(though our focus is on nonlinear objectives). Nonlinear embeddings can produce visualizations of\nhigh-dimensional data that display structure such as manifolds or clustering, and have been used for\nexploratory purposes and other applications in machine learning and beyond.\n\nOptimizing nonlinear embeddings is dif\ufb01cult for three reasons: there are many parameters (N L);\nthe objective is very nonconvex, so gradient descent and other methods require many iterations; and\nit involves O(N 2) terms, so evaluating the gradient is very slow. Major progress in these problems\nhas been achieved in recent years. For the second problem, the spectral direction [29] is constructed\nby \u201cbending\u201d the gradient using the curvature of the quadratic part of the objective (for EE, this is the\ngraph Laplacian L of W). This signi\ufb01cantly reduces the number of iterations, while evaluating the\ndirection itself is about as costly as evaluating the gradient. For the third problem, N -body methods\nsuch as tree methods [1] and fast multipole methods [11] approximate the gradient in O(N log N )\n\n1\n\n\fand O(N ) for small dimensions L, respectively, and have allowed to scale up embeddings to millions\nof patterns [26, 31, 34].\n\nAnother issue that arises with nonlinear embeddings is that they do not de\ufb01ne an \u201cout-of-sample\u201d\nmapping F: RD \u2192 RL that can be used to project patterns not in the training set. There are two\nbasic approaches to de\ufb01ne an out-of-sample mapping for a given embedding. The \ufb01rst one is a\nvariational argument, originally put forward for Laplacian eigenmaps [6] and also applied to the\nelastic embedding [5]. The idea is to optimize the embedding objective for a dataset consisting of\nthe N training points and one test point, but keeping the training projections \ufb01xed. Essentially, this\nconstructs a nonparametric mapping implicitly de\ufb01ned by the training points Y and its projections\nX, without introducing any assumptions. The mapping comes out in closed form for Laplacian\neigenmaps (a Nadaraya-Watson estimator) but not in general (e.g. for EE), in which case it needs\na numerical optimization. In either case, evaluating the mapping for a test point is O(N D), which\nis slow and does not scale. (For spectral methods one can also use the Nystr\u00a8om formula [3], but\nit does not apply to nonlinear embeddings, and is still O(N D) at test time.) The second approach\nis to use a mapping F belonging to a parametric family F of mappings (e.g. linear or neural net),\nwhich is fast at test time. Directly \ufb01tting F to (Y, X) is inelegant, since F is unrelated to the\nembedding, and may not work well if the mapping cannot model well the data (e.g. if F is linear).\nA better way is to involve F in the learning from the beginning, by replacing xn with F(yn) in\nthe embedding objective function and optimizing it over the parameters of F. For example, for the\nelastic embedding of (1) this means\n\nP (F) = PN\n\nn,m=1 wnm kF(yn) \u2212 F(ym)k2 + \u03bbPN\n\nn,m=1 exp(cid:0) \u2212 kF(yn) \u2212 F(ym)k2 (cid:1).\n\n(2)\n\nThis will give better results because the only embeddings that are allowed are those that are real-\nizable by a mapping F in the family F considered. Hence, the optimal F will exactly match the\nembedding, which is still trying to optimize the objective E(X). This provides an intermediate\nsolution between the nonparametric mapping described above, which is slow at test time, and the\ndirect \ufb01t of a parametric mapping to the embedding, which is suboptimal. We will focus on this\napproach, which we call parametric embedding (PE), following previous work [25].\n\nA long history of PEs exists, using unsupervised [14, 16\u201318, 24, 25, 32] or supervised [4, 9, 10,\n13, 20, 22] embedding objectives, and using linear or nonlinear mappings (e.g. neural nets). Each\nof these papers develops a specialized algorithm to learn the particular PE they de\ufb01ne (= embed-\nding objective and mapping family). Besides, PEs have also been used as regularization terms in\nsemisupervised classi\ufb01cation, regression or deep learning [33].\n\nOur focus in this paper is on optimizing an unsupervised parametric embedding de\ufb01ned by a given\nembedding objective E(X), such as EE or t-SNE, and a given family for the mapping F, such as\nlinear or a neural net. The straightforward approach, used in all papers cited above, is to derive\na training algorithm by applying the chain rule to compute gradients over the parameters of F and\nfeeding them to a nonlinear optimizer (usually gradient descent or conjugate gradients). This has\nthree problems. First, a new gradient and optimization algorithm must be developed and coded for\neach choice of E and F. For a user who wants to try different choices on a given dataset, this is\nvery inconvenient\u2014and the power of nonlinear embeddings and unsupervised methods in general is\nprecisely as exploratory techniques to understand the structure in data, so a user needs to be able to\ntry multiple techniques. Ideally, the user should simply be able to plug different mappings F into any\nembedding objective E, with minimal development work. Second, computing the gradient involves\nO(N 2) terms each depending on the entire mapping\u2019s parameters, which is very slow. Third, both\nE and F must be differentiable for the chain rule to apply.\n\nHere, we propose a new approach to optimizing parametric embeddings, based on the recently in-\ntroduced method of auxiliary coordinates (MAC) [7, 8], that partially alleviates these problems. The\nidea is to solve an equivalent, constrained problem by introducing new variables (the auxiliary co-\nordinates). Alternating optimization over the coordinates and the mapping\u2019s parameters results in\na step that trains an auxiliary embedding with a \u201cregularization\u201d term, and a step that trains the\nmapping by solving a regression, both of which can be solved by existing algorithms. Section 2\nintroduces important concepts and describes the chain-rule based optimization of parametric em-\nbeddings, section 3 applies MAC to parametric embeddings, and section 4 shows with different\ncombinations of embeddings and mappings that the resulting algorithm is very easy to construct,\nincluding use of N -body methods, and is faster than the chain-rule based optimization.\n\n2\n\n\fEmbedding space\n\nFree embedding\n\nDirect \ufb01t\n\nPE\n\nZ1\n\nF2(Y)\n\nRL\u00d7N\n\nF1(Y)\n\nZ3\n\nF\u2217(Y)\n\nF\u2032(Y)\n\nX\u2217\n\nZ2\n\nFigure 1: Left: illustration of the feasible set {Z \u2208 RL\u00d7N : Z = F(Y) for F \u2208 F } (grayed areas) of\nembeddings that can be produced by the mapping family F . This corresponds to the feasible set of\nthe equality constraints in the MAC-constrained problem (4). A parametric embedding Z\u2217 = F\u2217(Y)\nis a feasible embedding with locally minimal value of E. A free embedding X\u2217 is a minimizer of\nE and is usually not feasible. A direct \ufb01t F\u2032 (to the free embedding X\u2217) is feasible but usually\nnot optimal. Right 3 panels: 2D embeddings of 3 objects from the COIL-20 dataset using a linear\nmapping: a free embedding, its direct \ufb01t, and the parametric embedding (PE) optimized with MAC.\n\n2 Free embeddings, parametric embeddings and chain-rule gradients\n\nConsider a given nonlinear embedding objective function E(X) that takes an argument X \u2208 RL\u00d7N\nand maps it to a real value. E(X) is constructed for a dataset Y \u2208 RD\u00d7N according to a particular\nembedding model. We will use as running example the equations (1), (2) for the elastic embed-\nding, which are simpler than for most other embeddings. We call free embedding X\u2217 the result\nof optimizing E, i.e., a (local) optimizer of E. A parametric embedding (PE) objective function\nfor E using a family F of mappings F: RD \u2192 RL (for example, linear mappings), is de\ufb01ned as\nP (F) = E(F(Y)), where F (Y) = (F(y1), . . . , F(yN )), as in eq. (2) for EE. Note that, to simplify\nthe notation, we do not write explicitly the parameters of F. Thus, a speci\ufb01c PE can be de\ufb01ned by\nany combination of embedding objective function E (EE, SNE. . . ) and parametric mapping family\nF (linear, neural net. . . ). The result of optimizing P , i.e., a (local) optimizer of P , is a mapping F\u2217\nwhich we can apply to any input y \u2208 RD, not necessarily from among the training patterns. Finally,\nwe call direct \ufb01t the mapping resulting from \ufb01tting F to (Y, X\u2217) by least-squares regression, i.e., to\nmap the input patterns to a free embedding. We have the following results.\nTheorem 2.1. Let X\u2217 be a global minimizer of E. Then \u2200F \u2208 F : P (F) \u2265 E(X\u2217).\nProof. P (F) = E(F(Y)) \u2265 E(X\u2217).\n\nTheorem 2.2 (Perfect direct \ufb01t). Let F\u2217 \u2208 F . If F\u2217(Y) = X\u2217 and X\u2217 is a global minimizer of E\nthen F\u2217 is a global minimizer of P .\nProof. Let F \u2208 F with F 6= F\u2217. Then P (F) = E(F(Y)) \u2265 E(X\u2217) = E(F\u2217(Y)) = P (F\u2217).\nTheorem 2.2 means that if the direct \ufb01t of F\u2217 to (Y, X\u2217) has zero error, i.e., F\u2217(Y) = X\u2217, then\nit is the solution of the parametric embedding, and there is no need to optimize P . Theorem 2.1\nmeans that a PE cannot do better than a free embedding1. This is obvious in that a PE is not free but\nconstrained to use only embeddings that can be produced by a mapping in F , as illustrated in \ufb01g. 1.\nA PE will typically worsen the free embedding: more powerful mapping families, such as neural\nnets, will distort the embedding less than more restricted families, such as linear mappings. In this\nsense, the free embedding can be seen as using as mapping family F a table (Y, X) with parameters\nX. It represents the most \ufb02exible mapping, since every projection xn is a free parameter, but it can\nonly be applied to patterns in the training set Y. We will assume that the direct \ufb01t has a positive\nerror, i.e., the direct \ufb01t is not perfect, so that optimizing P is necessary.\n\nComputationally, the complexity of the gradient of P (F) appears to be O(N 2 |F|), where |F| is\nthe number of parameters in F, because P (F) involves O(N 2) terms, each dependent on all the\nparameters of F (e.g. for linear F this would cost O(N 2LD)). However, if manually simpli\ufb01ed and\ncoded, the gradient can actually be computed in O(N 2L + N |F|). For example, for the elastic\nembedding with a linear mapping F(y) = Ay where A is of L \u00d7 D, the gradient of eq. (2) is:\n\n\u2202P\n\n(3)\n\nn,m=1 h(cid:16)wnm \u2212 \u03bbexp(cid:0) \u2212 kAyn \u2212 Aymk2 (cid:1)(cid:17) \u00d7 (Ayn \u2212 Aym)(yn \u2212 ym)Ti\n\n\u2202A = 2PN\n1By a continuity argument, theorem 2.2 carries over to the case where F\u2217 and X\u2217 = F\u2217(Y) are local mini-\nmizers of P and E, respectively. However, theorem 2.2 would apply only locally, that is, P (F) \u2265 E(X\u2217) holds\nlocally but there may be mappings F with P (F) < E(X\u2217) associated with another (lower) local minimizer of\nE. However, the same intuition remains: we cannot expect a PE to improve over a good free embedding.\n\n3\n\n\fand this can be computed in O(N 2L + N DL) if we precompute X = AY and take common factors\nof the summation over xn and xm. An automatic differentiation package may or may not be able to\nrealize these savings in general.\n\nThe obvious way to optimize P (F) is to compute the gradient wrt the parameters of F by applying\nthe chain rule (since P is a function of E and this is a function of the parameters of F), assuming\nE and F are differentiable. While perfectly doable in theory, in practice this has several problems.\n(1) Deriving, debugging and coding the gradient of P for a nonlinear F is cumbersome. One could\nuse automatic differentiation [12], but current packages can result in inef\ufb01cient, non-simpli\ufb01ed gra-\ndients in time and memory, and are not in widespread use in machine learning. Also, combining\nautodiff with N -body methods seems dif\ufb01cult, because the latter require spatial data structures that\nare effective for points in low dimension (no more than 3 as far as we know) and depend on the\nactual point values. (2) The PE gradient may not bene\ufb01t from special-purpose algorithms developed\nfor embeddings. For example, the spectral direction method [29] relies on special properties of the\nfree embedding Hessian which do not apply to the PE Hessian. (3) Given the gradient, one then has\nto choose and possibly adapt a suitable nonlinear optimization method and set its parameters (line\nsearch parameters, etc.) so that convergence is assured and the resulting algorithm is ef\ufb01cient. Sim-\nple choices such as gradient descent or conjugate gradients are usually not ef\ufb01cient, and developing\na good algorithm is a research problem in itself (as evidenced by the many papers that study spe-\nci\ufb01c combinations of embedding objective and parametric mapping). (4) Even having done all this,\nthe resulting algorithm will still be very slow because of the complexity of computing the gradient:\nO(N 2L + N |F|). It may be possible to approximate the gradient using N -body methods, but again\nthis would involve signi\ufb01cant development effort. (5) As noted earlier, the chain rule only applies\nif both E and F are differentiable. Finally, all of the above needs to be redone if we change the\nmapping (e.g. from a neural net to a RBF network) or the embedding (e.g. from EE to t-SNE). We\nnow show how these problems can be addressed by using a different approach to the optimization.\n\n3 Optimizing a parametric embedding using auxiliary coordinates\n\nThe PE objective function, e.g. (2), can be seen as a nested function where we \ufb01rst apply F and\nthen E. A recently proposed strategy, the method of auxiliary coordinates (MAC) [7, 8], can\nbe used to derive optimization algorithms for such nested systems. We write the nested problem\nmin P (F) = E(F(Y)) as the following, equivalent constrained optimization problem:\n\nmin \u00afP (F, Z) = E(Z)\n\ns.t.\n\nzn = F(yn), n = 1, . . . , N\n\n(4)\n\nwhere we have introduced an auxiliary coordinate zn for each input pattern and a corresponding\nequality constraint. zn can be seen as the output of F (i.e., the low-dimensional projection) for xn.\nThe optimization is now on an augmented space (F, Z) with N L extra parameters Z \u2208 RL\u00d7N , and\nF \u2208 F . The feasible set of the equality constraints is shown in \ufb01g. 1. We solve the constrained\nproblem (4) using a quadratic-penalty method (it is also possible to use the augmented Lagrangian\nmethod), by optimizing the following unconstrained problem and driving \u00b5 \u2192 \u221e:\n\nmin PQ(F, Z; \u00b5) = E(Z) + \u00b5\n\n2 PN\n\n(5)\nUnder mild assumptions, the minima (Z\u2217(\u00b5), F\u2217(\u00b5)) trace a continuous path that converges to a\nlocal optimum of \u00afP (F, Z) and hence of P (F) [7, 8]. Finally, we optimize PQ using alternating\noptimization over the coordinates and the mapping. This results in two steps:\n\nn=1 kzn \u2212 F(yn)k2 = E(Z) + \u00b5\n\n2 kZ \u2212 F(Y)k2 .\n\nOver F given Z: minF\u2208F PN\n\nn=1 kzn \u2212 F(yn)k2. This is a standard least-squares regression for\na dataset (Y, Z) using F, and can be solved using existing, well-developed code for many\nfamilies of mappings. For example, for a linear mapping F(y) = Ay we solve a linear\nsystem A = ZY+ (ef\ufb01ciently done by caching Y+ in the \ufb01rst iteration and doing a matrix\nmultiplication in subsequent iterations); for a deep net, we can use stochastic gradient\ndescent with pretraining, possibly on a GPU; for a regression tree or forest, we can use any\ntree-growing algorithm; etc. Also, note that if we want to have a regularization term R(F)\nin the PE objective (e.g. for weight decay, or for model complexity), that term will appear\nin the F step but not in the Z step. Hence, the training and regularization of the mapping\nF is con\ufb01ned to the F step, given the inputs Y and current outputs Z. The mapping F\n\u201ccommunicates\u201d with the embedding objective precisely through these low-dimensional\ncoordinates Z.\n\n4\n\n\fOver Z given F: minZ E(Z) + \u00b5\n\n2 kZ \u2212 F(Y)k2. This is a regularized embedding, since E(Z) is\nthe original embedding objective function and kZ \u2212 F(Y)k2 is a quadratic regularization\nterm on Z, with weight \u00b5\n2 , which encourages Z to be close to a given embedding F(Y). We\ncan reuse existing, well-developed code to learn the embedding E(Z) with simple modi\ufb01-\ncations. For example, the gradient has an added term \u00b5(Z \u2212 F(Y)); the spectral direction\nnow uses a curvature matrix L + \u00b5\n2 I. The embedding \u201ccommunicates\u201d with the mapping\nF through the outputs F(Y) (which are constant within the Z step), which gradually force\nthe embedding Z to agree with the output of a member of the family of mappings F .\n\nHence, the intricacies of nonlinear optimization (line search, method parameters, etc.) remain con-\n\ufb01ned within the regression for F and within the embedding for Z, separately from each other.\nDesigning an optimization algorithm for an arbitrary combination of embedding and mapping is\nsimply achieved by alternately calling existing algorithms for the embedding and for the mapping.\n\nAlthough we have introduced a large number of new parameters to optimize over, the N L auxiliary\ncoordinates Z, the cost of a MAC iteration is actually the same (asymptotically) as the cost of\ncomputing the PE gradient, i.e., O(N 2L + N |F|), where |F| is the number of parameters in F. In\nthe Z step, the objective function has O(N 2) terms but each term depends only on 2 projections (zn\nand zm, i.e., 2L parameters), hence it costs O(N 2L). In the F step, the objective function has N\nterms, each depending on the entire mapping\u2019s parameters, hence it costs O(N |F|).\n\nAnother advantage of MAC is that, because it does not use chain-rule gradients, it is even possible\nto use something like a regression tree for F, which is not differentiable, and so the PE objective\nfunction is not differentiable either. In MAC, we can use an algorithm to train regression trees within\nthe F step using as data (Y, Z), reducing the constraint error kZ \u2212 F(Y)k2 and the PE objective.\n\nA \ufb01nal advantage is that we can bene\ufb01t from recent work done on using N -body methods to reduce\nthe O(N 2) complexity of computing the embedding gradient exactly to O(N log N ) (using tree-\nbased methods such as the Barnes-Hut algorithm; [26, 34]) or even O(N ) (using fast multipole\nmethods; [31]), at a small approximation error. We can reuse such code as is, without any extra\nwork, to approximate the gradient of E(Z) and then add to it the exact gradient of the regularization\nterm kZ \u2212 F(Y)k2, which is already linear in N . Hence, each MAC iteration (Z and F steps) runs\nin linear time on the sample size, and is thus scalable to larger datasets.\n\nThe problem of optimizing parametric embeddings is closely related to that of learning binary hash-\ning for fast information retrieval using af\ufb01nity-based loss functions [21]. The only difference is that\nin binary hashing the mapping F (an L-bit hash function) maps a D-dimensional vector y \u2208 RD\nto an L-dimensional binary vector z \u2208 {0, 1}L. The MAC framework can also be applied, and the\nresulting algorithm alternates an F step that \ufb01ts a classi\ufb01er for each bit of the hash function, and a\nZ step that optimizes a regularized binary embedding using combinatorial optimization.\n\nSchedule of \u00b5, initial Z and the path to a minimizer The MAC algorithm for parametric em-\nbeddings introduces no new optimization parameters except for the penalty parameter \u00b5. The con-\nvergence theory of quadratic-penalty methods and MAC [7, 8, 19] tells us that convergence to a\nlocal optimum is guaranteed if each iteration achieves suf\ufb01cient decrease (always possible by run-\nning enough (Z,F) steps) and if \u00b5 \u2192 \u221e. The latter condition ensures the equality constraints are\neventually satis\ufb01ed. Mathematically, the minima (Z\u2217(\u00b5), F\u2217(\u00b5)) of PQ as a function of \u00b5 \u2208 [0, \u221e)\ntrace a continuous path in the (Z, F) space that ends at a local minimum of the constrained prob-\nlem (4) and thus of the parametric embedding objective function. Hence, our algorithm belongs to\nthe family of path-following methods, such as quadratic penalty, augmented Lagrangian, homotopy\nand interior-point methods, widely regarded as effective with nonconvex problems.\n\nIn practice, one follows that path loosely, i.e., doing fast, inexact steps on Z and F for the current\nvalue of \u00b5 and then increasing \u00b5. How fast to increase \u00b5 does depend on the particular problem;\ntypically, one multiplies \u00b5 times a factor of around 2. Increasing \u00b5 very slowly will follow the path\nmore closely, but the runtime will increase. Since \u00b5 does not appear in the F step, increasing \u00b5 is\nbest done within a Z step (i.e., we run several iterations over Z, increase \u00b5, run several iterations\nover Z, and then do an F step).\n\nThe starting point of the path is \u00b5 \u2192 0+. Here, the Z step simply optimizes E(Z) and hence gives\nus a free embedding (e.g. we just train an elastic embedding model on the dataset). The F step\nthen \ufb01ts F to (Y, Z) and hence gives us the direct \ufb01t (which generally will have a positive error\n\n5\n\n\fkZ \u2212 F(Y)k2, otherwise we stop with an optimal PE). Thus, the beginning of the path is the direct\n\ufb01t to the free embedding. As \u00b5 increases, we follow the path (Z\u2217(\u00b5), F\u2217(\u00b5)), and as \u00b5 \u2192 \u221e, F\nconverges to a minimizer of the PE and Z converges to F(Y). Hence, the \u201clifetime\u201d of the MAC\nalgorithm over the \u201ctime\u201d \u00b5 starts with a free embedding and a direct \ufb01t which disagree with each\nother, and progressively reduces the error in the F \ufb01t by increasing the error in the Z embedding,\nuntil F(Y) and Z agree at an optimal PE.\n\nAlthough it is possible to initialize Z in a different way (e.g. random) and start with a large value of\n\u00b5, we \ufb01nd this converges to worse local optima than starting from a free embedding with a small \u00b5.\nGood local optima for the free embedding itself can be found by homotopy methods as well [5].\n\n4 Experiments\n\nOur experiments con\ufb01rm that MAC \ufb01nds optima as good as those of the conventional optimiza-\ntion based on chain-rule gradients, but that it is faster (particularly if using N -body methods). We\ndemonstrate this with different embedding objectives (the elastic embedding and t-SNE) and map-\npings (linear and neural net). We report on a representative subset of experiments.\n\nIllustrative example The simple example of \ufb01g. 1 shows the different embedding types described\nin the paper. We use the COIL-20 dataset, containing rotation sequences of 20 physical objects\nevery 5 degrees, each a grayscale image of 128 \u00d7 128 pixels, total N = 1 440 points in 16 384\ndimensions; thus, each object traces a closed loop in pixel space. We produce 2D embeddings of 3\nobjects, using the elastic embedding (EE) [5]. The free embedding X\u2217 results from optimizing the\nEE objective function (1), without any limitations on the low-dimensional projections. It gives the\nbest visualization of the data, but no out-of-sample mapping. We now seek a linear out-of-sample\nmapping F. The direct \ufb01t \ufb01ts a linear mapping to map the high-dimensional images Y to their\n2D projections X\u2217 from the free embedding. The resulting predictions F(Y) give a quite distorted\nrepresentation of the data, because a linear mapping cannot realize the free embedding X with low\nerror. The parametric embedding (PE) \ufb01nds the linear mapping F\u2217 that optimizes P (F), which\nfor EE is eq. (2). To optimize the PE, we used MAC (which was faster than gradient descent and\nconjugate gradients). The resulting PE represents the data worse than the free embedding (since the\nPE is constrained to produce embeddings that are realizable by a linear mapping), but better than the\ndirect \ufb01t, because the PE can search for embeddings that, while being realizable by a linear mapping,\nproduce a lower value of the EE objective function.\n\nThe details of the optimization are as follows. We preprocess the data using PCA projecting to 15\ndimensions (otherwise learning a mapping would be trivial since there are more degrees of freedom\nthan there are points). The free embedding was optimized using the spectral direction [29] until\nconsecutive iterates differed by a relative error less than 10\u22123. We increased \u00b5 from 0.003 to 0.015\nwith a step of 0.001 (12 \u00b5 values) and did 40 iterations for each \u00b5 value. The Z step uses the spectral\ndirection, stopping when the relative error is less than 10\u22122.\n\nCost of the iterations Fig. 2(left) shows, as a function of the number of data points N (using\na 3D Swissroll dataset), the time needed to compute the gradient of the PE objective (red curve)\nand the gradient of the MAC Z and F steps (black and magenta, respectively, as well as their\nsum in blue). We use t-SNE and a sigmoidal neural net with an architecture 3\u2013100\u2013500\u20132. We\napproximate the Z gradient in O(N log N ) using the Barnes-Hut method [26, 34]. The log-log plot\nshows the asymptotically complexity to be quadratic for the PE gradient, but linear for the F step\nand O(N log N ) for the Z step. The PE gradient runs out of memory for large N .\n\nQuality of the local optima For the same Swissroll dataset, \ufb01g. 2(right) shows, as a function of\nthe number of data points N , the \ufb01nal value of the PE objective function achieved by the chain-rule\nCG optimization and by MAC, both using the same initialization. There is practically no differ-\nence between both optimization algorithms. We sometimes do \ufb01nd they converge to different local\noptima, as in some of our other experiments.\n\nDifferent embedding objectives and mapping families The goal of this experiment is to show\nthat we can easily derive a convergent, ef\ufb01cient algorithm for various combinations of embeddings\nand mappings. We consider as embedding objective functions E(X) t-SNE and EE, and as map-\npings F a neural net and a linear mapping. We apply each combination to learn a parametric embed-\nding for the MNIST dataset, containing N = 60 000 images of handwritten digit images. Training\n\n6\n\n\fPE\nMAC\nZ step\nF step\n\n102\n\n100\n\n10\u22122\n\n)\ns\nd\nn\no\nc\ne\ns\n(\n\ne\nm\n\ni\nt\nn\nu\nR\n\n10\u22124\n\n \n\n101\n\n102\n\n103\nN\n\n104\n\n105\n\n \n\n)\nF\n(\nP\nn\no\ni\nt\nc\nn\nu\nf\n\ne\nv\ni\nt\nc\ne\nj\nb\no\nE\nP\n\n \n\nPE\nMAC\n\n100.8\n\n100.6\n\n100.4\n\n100.2\n\n \n\n101\n\n102\n\n103\nN\n\n104\n\n105\n\nFigure 2: Runtime per iteration and \ufb01nal PE objective for a 3D Swissroll dataset, using as mapping\nF a sigmoidal neural net with an architecture 3\u2013100\u2013500\u20132, for t-SNE. For PE, we give the runtime\nneeded to compute the gradient of the PE objective using CG with chain-rule gradients. For MAC,\nwe give the runtime needed to compute the (Z,F) steps, separately and together. The gradient of the\nZ step is approximated with an N -body method. Errorbars over 5 randomly generated Swissrolls.\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\nt-SNE neural net embedding\n\n \n\nMAC\nPE (minibatch)\nPE (batch)\n\n20\n\n19.5\n\n)\nF\n(\nP\n\n19\n\n18.5\n\n \n\n102\n\n103\n\n104\n\nRuntime (seconds)\n\nEE linear embedding\n\n \n\nMAC\nPE\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n\u2212100\n\n\u221250\n\n0\n\n50\n\n100\n\n\u2212100\n\n\u221250\n\n0\n\n50\n\n100\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22126\n\n\u22124\n\n\u22122\n\n0\n\n2\n\n4\n\n6\n\n\u22124\n\n\u22122\n\n0\n\n2\n\n4\n\n13x 104\n12\n\n)\nF\n(\nP\n\n11\n\n10\n\n9\n\n8\n\n7\n\n6\n\n \n\nFigure 3: MNIST dataset. Top: t-SNE with a neural net. Bottom: EE with a linear mapping. Left:\ninitial, free embedding (we show a sample of 5 000 points to avoid clutter). Middle: \ufb01nal parametric\nembedding. Right: learning curves for MAC and chain-rule optimization. Each marker indicates\none iteration. For MAC, the solid markers indicate iterations where \u00b5 increased.\n\n102\nRuntime (seconds)\n\n103\n\na nonlinear (free) embedding on a dataset of this size was very slow until the recent introduction\nof N -body methods for t-SNE, EE and other methods [26, 31, 34]. We are the \ufb01rst to use N -body\nmethods for PEs, thanks to the decoupling between mapping and embedding introduced by MAC.\n\nFor each combination, we derive the MAC algorithm by reusing code available online: for the\nEE and t-SNE (free) embeddings we use the spectral direction [29]; for the N -body methods to\napproximate the embedding objective function gradient we use the fast multipole method for EE\n[31] and the Barnes-Hut method for t-SNE [26, 34]; and for training a deep net we use unsupervised\npretraining and backpropagation [22, 25]. Fig. 3(left) shows the free embedding of MNIST obtained\nwith t-SNE and EE after 100 iterations of the spectral direction. To compute the Gaussian af\ufb01nities\nbetween pairs of points, we used entropic af\ufb01nities with perplexity K = 30 neighbors [15, 30].\n\nThe optimization details are as follows. For the neural net, we replicated the setup of [25]. This uses\na neural net with an architecture (28 \u00d7 28)\u2013500\u2013500\u20132000\u20132, initialized with pretraining as de-\n\n7\n\n\fscribed in [22] and [25]. For the chain-rule PE optimization we used the code from [25]. Because of\nmemory limitations, [25] actually solved an approximate version of the PE objective function, where\nrather than using all N 2 pairwise point interactions, only BN interactions are used, corresponding\nto using minibatches of B = 5 000 points. Therefore, the solution obtained is not a minimizer of the\nPE objective, as can be seen from the higher objective value in \ufb01g. 3(bottom). However, we did also\nsolve the exact objective by using B = N (i.e., one minibatch containing the entire dataset). Each\nminibatch was trained with 3 CG iterations and a total of 30 epochs.\nFor MAC, we used \u00b5 \u2208 {10\u22127, 5\u00b710\u22127, 10\u22126, 5\u00b710\u22126, 10\u22125, 5\u00b710\u22125}, optimizing until the objective\nfunction decrease (before the Z step and after the F step) was less than a relative error of 10\u22123. The\nrest of the optimization details concern the embedding and neural net, and are based on existing code.\nThe initialization for Z is the free embedding. The Z step (like the free embedding) uses the spectral\ndirection with a \ufb01xed step size \u03b3 = 0.05, using 10 iterations of linear conjugate gradients to solve\nthe linear system (L + \u00b5\n2 I)P = \u2212G, and using warm-start (i.e., initialized from the the previous\niteration\u2019s direction). The gradient G of the free embedding is approximated in O(N log N ) using\nthe Barnes-Hut method with accuracy \u03b8 = 1.5. Altogether one Z iteration took around 5 seconds.\nWe exit the Z step when the relative error between consecutive embeddings is less than 10\u22123. For\nthe F step we used stochastic gradient descent with minibatches of 100 points, step size 10\u22123 and\nmomentum rate 0.9, and trained for 5 epochs for the \ufb01rst 3 values of \u00b5 and for 3 epochs for the rest.\n\nFor the linear mapping F(y) = Ay, we implemented our own chain-rule PE optimizer with gra-\ndient descent and backtracking line search for 30 iterations. In MAC, we used 10 \u00b5 values spaced\nlogarithmically from 10\u22122 to 102, optimizing at each \u00b5 value until the objective function decrease\nwas less than a relative error of 10\u22124. Both the Z step and the free embedding use the spectral\ndirection with a \ufb01xed step size \u03b3 = 0.01. We stop optimizing them when the relative error between\nconsecutive embeddings is less than 10\u22124. The gradient is approximated using fast multipole meth-\nods with accuracy p = 6 (the number of terms in the truncated series). In the F step, the linear\nsystem to \ufb01nd A was solved using 10 iterations of linear conjugate gradients with warm start.\n\nFig. 3 shows the \ufb01nal parametric embeddings for MAC, neural-net t-SNE (top) and linear EE (bot-\ntom), and the learning curves (PE error P (F(Y)) over iterations). MAC is considerably faster than\nthe chain-rule optimization in all cases.\n\nFor the neural-net t-SNE, MAC is almost 5\u00d7 faster than using minibatch (the approximate PE\nobjective) and 20\u00d7 faster than the exact, batch mode. This is partly thanks to the use of N -body\nmethods in the Z step. The runtimes were (excluding the time taken by pretraining, 40\u2019): MAC: 42\u2019;\nPE (minibatch): 3.36 h; PE (batch): 15 h; free embedding: 63\u201d. Without using N -body methods,\nMAC is 4\u00d7 faster than PE (batch) and comparable to PE (minibatch). For the linear EE, the runtimes\nwere: MAC: 12.7\u2019; PE: 63\u2019; direct \ufb01t: 40\u201d.\n\nThe neural-net t-SNE embedding preserves the overall structure of the free t-SNE embedding but\nboth embeddings do differ. For example, the free embedding creates small clumps of points and\nthe neural net, being a continuous mapping, tends to smooth them out. The linear EE embedding\ndistorts the free EE embedding considerably more than if using a neural net. This is because a linear\nmapping has a much harder time at approximating the complex mapping from the high-dimensional\ndata into 2D that the free embedding implicitly demands.\n\n5 Conclusion\n\nIn our view, the main advantage of using the method of auxiliary coordinates (MAC) to learn para-\nmetric embeddings is that it simpli\ufb01es the algorithm development. One only needs to plug in existing\ncode for the embedding (with minor modi\ufb01cations) and the mapping. This is particularly useful to\nbene\ufb01t from complex, highly optimized code for speci\ufb01c problems, such as the N -body methods\nwe used here, or perhaps GPU implementations of deep nets and other machine learning models. In\nmany applications, the ef\ufb01ciency in programming an easy, robust solution is more valuable than the\nspeed of the machine. But, in addition, we \ufb01nd that the MAC algorithm can be quite faster than the\nchain-rule based optimization of the parametric embedding.\n\nAcknowledgments\n\nWork funded by NSF award IIS\u20131423515. We thank Weiran Wang for help with training the deep\nnet in the MNIST experiment.\n\n8\n\n\fReferences\n\n[1] J. Barnes and P. Hut. A hierarchical O(N log N ) force-calculation algorithm. Nature, 324, 1986.\n\n[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.\n\nNeural Computation, 15:1373\u20131396, 2003.\n\n[3] Y. Bengio, O. Delalleau, N. Le Roux, J.-F. Paiement, P. Vincent, and M. Ouimet. Learning eigenfunctions\n\nlinks spectral embedding and kernel PCA. Neural Computation, 16:2197\u20132219, 2004.\n\n[4] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. S\u00a8ackinger, and R. Shah. Signa-\nture veri\ufb01cation using a \u201csiamese\u201d time delay neural network. Int. J. Pattern Recognition and Arti\ufb01cial\nIntelligence, 5:669\u2013688, 1993.\n\n[5] M. Carreira-Perpi\u02dcn\u00b4an. The elastic embedding algorithm for dimensionality reduction. ICML, 2010.\n\n[6] M. Carreira-Perpi\u02dcn\u00b4an and Z. Lu. The Laplacian Eigenmaps Latent Variable Model. AISTATS, 2007.\n\n[7] M. Carreira-Perpi\u02dcn\u00b4an and W. Wang. Distributed optimization of deeply nested systems. arXiv:1212.5921\n\n[cs.LG], Dec. 24 2012.\n\n[8] M. Carreira-Perpi\u02dcn\u00b4an and W. Wang. Distributed optimization of deeply nested systems. AISTATS, 2014.\n\n[9] A. Globerson and S. Roweis. Metric learning by collapsing classes. NIPS, 2006.\n\n[10] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. NIPS,\n\n2005.\n\n[11] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. J. Comp. Phys., 73, 1987.\n\n[12] A. Griewank and A. Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differ-\n\nentiation. SIAM Publ., second edition, 2008.\n\n[13] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. CVPR,\n\n2006.\n\n[14] X. He and P. Niyogi. Locality preserving projections. NIPS, 2004.\n\n[15] G. Hinton and S. T. Roweis. Stochastic neighbor embedding. NIPS, 2003.\n\n[16] D. Lowe and M. E. Tipping. Feed-forward neural networks and topographic mappings for exploratory\n\ndata analysis. Neural Computing & Applications, 4:83\u201395, 1996.\n\n[17] J. Mao and A. K. Jain. Arti\ufb01cial neural networks for feature extraction and multivariate data projection.\n\nIEEE Trans. Neural Networks, 6:296\u2013317, 1995.\n\n[18] R. Min, Z. Yuan, L. van der Maaten, A. Bonner, and Z. Zhang. Deep supervised t-distributed embedding.\n\nICML, 2010.\n\n[19] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, second edition, 2006.\n\n[20] J. Peltonen and S. Kaski. Discriminative components of data. IEEE Trans. Neural Networks, 16, 2005.\n\n[21] R. Raziperchikolaei and M. Carreira-Perpi\u02dcn\u00b4an. Learning hashing with af\ufb01nity-based loss functions using\n\nauxiliary coordinates. arXiv:1501.05352 [cs.LG], Jan. 21 2015.\n\n[22] R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neighbourhood\n\nstructure. AISTATS, 2007.\n\n[23] J. W. Sammon, Jr. A nonlinear mapping for data structure analysis. IEEE Trans. Computers, 18, 1969.\n\n[24] Y. W. Teh and S. Roweis. Automatic alignment of local representations. NIPS, 2003.\n\n[25] L. J. P. van der Maaten. Learning a parametric embedding by preserving local structure. AISTATS, 2009.\n\n[26] L. J. P. van der Maaten. Barnes-Hut-SNE. Int. Conf. Learning Representations (ICLR), 2013.\n\n[27] L. J. P. van der Maaten and G. E. Hinton. Visualizing data using t-SNE. JMLR, 9:2579\u20132605, 2008.\n\n[28] J. Venna, J. Peltonen, K. Nybo, H. Aidos, and S. Kaski. Information retrieval perspective to nonlinear\n\ndimensionality reduction for data visualization. JMLR, 11:451\u2013490, 2010.\n\n[29] M. Vladymyrov and M. Carreira-Perpi\u02dcn\u00b4an. Partial-Hessian strategies for fast learning of nonlinear em-\n\nbeddings. ICML, 2012.\n\n[30] M. Vladymyrov and M. Carreira-Perpi\u02dcn\u00b4an. Entropic af\ufb01nities: Properties and ef\ufb01cient numerical compu-\n\ntation. ICML, 2013.\n\n[31] M. Vladymyrov and M. Carreira-Perpi\u02dcn\u00b4an. Linear-time training of nonlinear low-dimensional embed-\n\ndings. AISTATS, 2014.\n\n[32] A. R. Webb. Multidimensional scaling by iterative majorization using radial basis functions. Pattern\n\nRecognition, 28:753\u2013759, 1995.\n\n[33] J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding. ICML, 2008.\n\n[34] Z. Yang, J. Peltonen, and S. Kaski. Scalable optimization for neighbor embedding for visualization.\n\nICML, 2013.\n\n9\n\n\f", "award": [], "sourceid": 134, "authors": [{"given_name": "Miguel", "family_name": "Carreira-Perpinan", "institution": "UC Merced"}, {"given_name": "Max", "family_name": "Vladymyrov", "institution": "Yahoo"}]}