{"title": "No Pressure! Addressing the Problem of Local Minima in Manifold Learning Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 680, "page_last": 689, "abstract": "Nonlinear embedding manifold learning methods provide invaluable visual insights into a structure of high-dimensional data. However, due to a complicated nonconvex objective function, these methods can easily get stuck in local minima and their embedding quality can be poor. We propose a natural extension to several manifold learning methods aimed at identifying pressured points, i.e. points stuck in the poor local minima and have poor embedding quality. We show that the objective function can be decreased by temporarily allowing these points to make use of an extra dimension in the embedding space. Our method is able to improve the objective function value of existing methods even after they get stuck in a poor local minimum.", "full_text": "No Pressure! Addressing the Problem of Local\n\nMinima in Manifold Learning Algorithms\n\nMax Vladymyrov\nGoogle Research\nmxv@google.com\n\nAbstract\n\nNonlinear embedding manifold learning methods provide invaluable visual in-\nsights into the structure of high-dimensional data. However, due to a complicated\nnonconvex objective function, these methods can easily get stuck in local minima\nand their embedding quality can be poor. We propose a natural extension to sev-\neral manifold learning methods aimed at identifying pressured points, i.e. points\nstuck in poor local minima and have poor embedding quality. We show that the\nobjective function can be decreased by temporarily allowing these points to make\nuse of an extra dimension in the embedding space. Our method is able to improve\nthe objective function value of existing methods even after they get stuck in a poor\nlocal minimum.\n\nIntroduction\n\n1\nGiven a dataset Y \u2208 RD\u00d7N of N points in some high-dimensional space with dimensionality D,\nmanifold learning algorithms try to \ufb01nd a low-dimensional embedding X \u2208 Rd\u00d7N of every point\nfrom Y in some space with dimensionality d (cid:28) D. These algorithms play an important role in high-\ndimensional data analysis, speci\ufb01cally for data visualization, where d = 2 or d = 3. The quality of\nthe methods have come a long way in recent decades, from classic linear methods (e.g. PCA, MDS),\nto more nonlinear spectral methods, such as Laplacian Eigenmaps [Belkin and Niyogi, 2003], LLE\n[Saul and Roweis, 2003] and Isomap [de Silva and Tenenbaum, 2003], \ufb01nally followed by even more\ngeneral nonlinear embedding (NLE) methods, which include Stochastic Neighbor Embedding (SNE,\nHinton and Roweis, 2003), t-SNE [van der Maaten and Hinton, 2008], NeRV [Venna et al., 2010]\nand Elastic Embedding (EE, Carreira-Perpi\u02dcn\u00b4an, 2010). This last group of methods is considered as\nstate-of-the-art in manifold learning and became a go-to tool for high-dimensional data analysis in\nmany domains (e.g. to compare the learning states in Deep Reinforcement Learning [Mnih et al.,\n2015] or to visualize learned vectors of an embedding model [Kiros et al., 2015]).\nWhile the results of NLE have improved in quality, their algorithmic complexity has increased as\nwell. NLE methods are de\ufb01ned using a nonconvex objective that requires careful iterative mini-\nmization. A lot of effort has been spent on improving the convergence of NLE methods, including\nSpectral Direction [Vladymyrov and Carreira-Perpi\u02dcn\u00b4an, 2012] that uses partial-Hessian information\nin order to de\ufb01ne a better search direction, or optimization using a Majorization-Minimization ap-\nproach [Yang et al., 2015]. However, even with these sophisticated custom algorithms, it is still\noften necessary to perform a few random restarts in order to achieve a decent solution. Sometimes\nit is not even clear whether the learned embedding represents the structure of the input data, noise,\nor the artifacts of an embedding algorithm [Wattenberg et al., 2016].\nConsider the situation in \ufb01g. 1. There we run the EE 100 times on the same dataset with the same\nparameters, varying only the initialization. The dataset, COIL-20, consists of photos of 20 different\nobjects as they are rotated on a platform with new photo taken every 5 degrees (72 images per object).\nGood embedding should separate objects one from another and also re\ufb02ect the rotational sequence\nof each object (ideally via a circular embedding). We see in the left plot that for virtually every\nrun the embedding gets stuck in a distinct local minima. The other two \ufb01gures show the difference\nbetween the best and the worst embedding depending on how lucky we get with the initialization.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fBest embedding (e = 0.40)\n\nWorst embedding (e = 0.45)\n\nFigure 1: Abundance of local minima in the Elastic Embedding objective function space. We run the\nalgorithm 100\u00d7 on COIL-20 dataset with different random initializations. We show the objective\nfunction decrease (left), the embedding result for the run with the lowest (center) and the highest\n(right) \ufb01nal objective function values. Color encodes different objects.\n\nThe embedding in the center has much better quality compared to the one on the right, since most\nof the objects are separated from each other and their embeddings more closely resemble a circle.\nIn this paper we focus on the analysis of the reasoning behind the occurrence of local minima in\nthe NLE objective function and ways for the algorithms to avoid them. Speci\ufb01cally, we discuss the\nconditions under which some points get caught in high-energy states of the objective function. We\ncall these points \u201cpressured points\u201d and show that speci\ufb01cally for the NLE class of algorithms there\nis a natural way to identify and characterize them during optimization.\nOur contribution is twofold. First, we look at the objective function of the NLE methods and provide\na mechanism to identify the pressured points for a given embedding. This can be used on its own\nas a diagnostic tool for assessing the quality of a given embedding at the level of individual points.\nSecond, we propose an optimization algorithm that is able to utilize the insights from the pressured\npoints analysis to achieve better objective function values even from a converged solution of an\nexisting state-of-the-art optimizer. The proposed modi\ufb01cation augments the existing analysis of the\nNLE and can be run on top of state-of-the-art optimization methods: Spectral Direction and N-body\nalgorithms [Yang et al., 2013, van der Maaten, 2014, Vladymyrov and Carreira-Perpi\u02dcn\u00b4an, 2014].\nOur analysis arises naturally from a given NLE objective function and does not depend on any other\nassumptions. Other papers have looked into the problem of assessing the quality of the embedding\n[Peltonen and Lin, 2015, Lee and Verleysen, 2009, Lespinats and Aupetit, 2011]. However, their\nquality criteria are de\ufb01ned separately from the actual learned objective function, which introduces\nadditional assumptions and does not connect to the original objective function. Moreover, we also\npropose a method for improving the embedding quality in addition to assessing it.\n2 Nonlinear Manifold Learning Algorithms\nThe objective functions for SNE and t-SNE were originally de\ufb01ned as a KL-divergence between\ntwo normalized probability distributions of points being in the neighborhood of each other. They\n2\u03c32 (cid:107)yi \u2212 yj(cid:107)2), to capture\nuse a positive af\ufb01nity matrix W+, usually computed as w+\na similarity of points in the original space D. The algorithms differ in the kernels they use in the\n(cid:80)\nexp(\u2212(cid:107)xi\u2212xj(cid:107)2)\nlow-dimensional space. SNE uses the normalized Gaussian kernel1 Kij =\nn,m exp(\u2212(cid:107)xn\u2212xm(cid:107)2),\n\nUMAP [McInnes et al., 2018] uses the unnormalized kernel Kij =(cid:0)1 + a(cid:107)xi \u2212 xj(cid:107)2b(cid:1)\u22121 that\n\nwhile t-SNE is using the normalized Student\u2019s t kernel Kij =\n\nis similar to Student\u2019s t, but with additional constants a, b calculated based on the topology of the\noriginal manifold. The objective function is given by the cross entropy as opposed to KL-divergence.\nCarreira-Perpi\u02dcn\u00b4an [2010] showed that these algorithms could be de\ufb01ned as an interplay between two\nadditive terms: E(X) = E+(X) + E\u2212(X). Attractive term E+, usually convex, pulls points close\nto each other with a force that is larger for points located nearby in the original space. Repulsive term\nE\u2212, on the contrary, pushes points away from each other. For SNE and t-SNE the attraction is given\nby the nominator of the normalized kernel, while the repulsion is the denominator. It intuitively\nmakes sense, since in order to pull some point closer (decrease the nominator), you have to push all\nthe other points away a little bit (increase the denominator) so that the probability would still sum\n\n1Instead of the classic SNE, in this paper we are going to use symmetric SNE [Cook et al., 2007], where\n\neach probability is normalized by the interaction between all pairs of points and not every point individually.\n\n2\n\nij = exp(\u2212 1\n\n(cid:80)\n(1+(cid:107)xi\u2212xj(cid:107)2)\u22121\nn,m(1+(cid:107)xn\u2212xm(cid:107)2)\u22121 .\n\niterationsiterationsiterations\fFigure 2: Left: an illustration of the local minimum typically occurring in NLE optimization. Blue\ndashed lines indicate the location of 3 points in 1D. The curves show the objective function landscape\nwrt x0. Right: by enabling an extra dimension for x0, we can create a \u201ctunnel\u201d that avoids a local\nminimum in the original space, but follows a continuous minimization path in the augmented space.\nto one. For UMAP, there is no normalization to act as a repulsion, but the repulsion is given by the\nsecond term in the cross entropy (i.e. the entropy of the low-dimensional probabilities).\nElastic Embedding (EE) modi\ufb01es the repulsive term of the SNE objective by dropping the log,\nij = (cid:107)yi \u2212 yj(cid:107)2), and intro-\nadding a weight W\u2212 to better capture non-local interactions (e.g. as w\u2212\nducing a scaling hyperparameter \u03bb to control the interplay between two terms.\nHere are the objective functions of the described methods:\n\nEEE(X) =(cid:80)\nij (cid:107)xi \u2212 xj(cid:107)2 + \u03bb(cid:80)\nESNE(X) =(cid:80)\nij (cid:107)xi \u2212 xj(cid:107)2 + log(cid:80)\nEt-SNE(X) =(cid:80)\nij log (1 + (cid:107)xi \u2212 xj(cid:107)2) + log(cid:80)\nij log (1 + a(cid:107)xi \u2212 xj(cid:107)2b) +(cid:80)\n\ni,j w+\ni,j w+\n\ni,j w+\n\ni,j w+\n\ni,j w\u2212\nije\u2212(cid:107)xi\u2212xj(cid:107)2\n,\ni,j=1 e\u2212(cid:107)xi\u2212xj(cid:107)2\n\n,\n\nEUMAP(X) =(cid:80)\n\n1\n\n1+(cid:107)xi\u2212xj(cid:107)2 ,\n\ni,j\n\ni,j(w+\n\nij \u2212 1) log(1 \u2212\n\n1\n\n1+a(cid:107)xi\u2212xj(cid:107)2b ).\n\n(1)\n\n(2)\n(3)\n\n(4)\n\nIdentifying pressured points\n\n3\nLet us consider the optimization with respect to a given point x0 from X. For all the algorithms\nthe attractive term E+ grows as (cid:107)x0 \u2212 xn(cid:107)2 and thus has a high penalty for points placed far away\nin the embedding space (especially if they are located nearby in the original space). The repulsive\nterm E\u2212 is mostly localized and concentrated around individual neighbors of x0. As x0 navigates\nthe landscape of E it tries to get to the minimum of E+ while avoiding the \u201chills\u201d of E\u2212 created\naround repulsive neighbors. However, the degrees of freedom of X is limited by d which is typically\nmuch smaller than the intrinsic dimensionality of the data. It might happen that the point gets stuck\nsurrounded by its non-local neighbors and is unable to \ufb01nd a path through.\nWe can illustrate this with a simple scenario involving three points y0, y1, y2 in the original RD\nspace, where y0 and y1 are near each other and y2 is further away. We decrease the dimensionality\nto d = 1 using EE algorithm and assume that due to e.g. poor initialization x2 is located between\nx0 and x1. In the left plot of \ufb01g. 2 we show different parts of the objective function as a function\nof x0. The attractive term E+(x0) creates a high pressure for x0 to move towards x1. However, the\nrepulsion between x0 and x2 creates a counter pressure that pushes x0 away from x2, thus creating\ntwo minima: one local near x = \u22121 and another global near x = 1.5. Points like x0 are trapped in\nhigh energy regions and are not able to move. We argue that these situations are the reason behind\nmany of the local minima of NLE objective functions. By identifying and repositioning these points\nwe can improve the objective function and overall the quality of the embedding.\nWe propose to evaluate the pressure of every point with a very simple and intuitive idea: increased\npressure from the \u201cfalse\u201d neighbors would create a higher energy for the point to escape that location.\nHowever, for a true local minimum, there are no directions for that point to move. That is, given the\nexisting number of dimensions. If we were to add a new dimension Z temporarily just for that point,\nit would be possible for the points to move along that new dimension (see \ufb01g. 2, right). The more\nthat point is pressured by other points, the farther across this new dimension it would go.\nMore formally, we say that the point is pressured if the objective function has a nontrivial minimum\nwhen evaluated at that point along the new dimension Z. We de\ufb01ne the minimum \u02c6z along the\ndimension Z as the pressure of that point.\n\n3\n\n-2-101230246810obj.fun.wrtx0E+(x0)E\u2212(x0)E(x0)xx0x2x1N-2-10123-1.5-1-0.500.511.5345678910xx0x2x1ZN\fIt is important to notice the distinction between pressured points and points that have higher objective\nfunction value when evaluated at those points (a criterion that is used e.g. in Lespinats and Aupetit\n[2011] to assess the embedding quality). Large objective function value alone does not necessarily\nmean that the point is stuck in a local minimum. First, the point could still be on its way to the\nminimum. Second, even for an embedding that represents the global minimum, each point would\nconverge to its own unique objective function value since the af\ufb01nities for every point are distinct.\nFinally, not every NLE objective function can be easily evaluated for every point separately. SNE\n(2) and t-SNE (4) objective functions contain log term that does not allow for easy decoupling.\nIn what follows we are going to characterize the pressure of each point and look at how the objective\nfunction changes when we add an extra dimension to each of the algorithms described above.\nElastic Embedding. For a given point k we extend the objective function of EE (1) along the new\ndimension Z. Notice that we consider points individually one by one, therefore all zi = 0 for all\ni (cid:54)= k. The objective function of EE along the new dimension zk becomes:\n\nwhere d+\nThe function is symmetric wrt 0 and convex for zk \u2265 0. Its derivative is\n\ni=1 w+\n\nik, \u02dcd\u2212\n\nkd+\n\n(5)\nike\u2212(cid:107)xi\u2212xk(cid:107)2 and C is a constant independent from zk.\n\nk + C,\n\nk = (cid:80)N\n\n(cid:101)EEE(zk) = 2z2\nk = \u03bb(cid:80)N\ni=1 w\u2212\n\u2202(cid:101)EEE(zk)\n\nk e\u2212z2\n\nk + 2 \u02dcd\u2212\n(cid:16)\n\nk \u2212 e\u2212z2\nd+\n\nk \u02dcd\u2212\n\nk\n\n\u2202zk\n\n= 4zk\n\n(6)\nThe function has a stationary point at zk = 0, which is a minimum when \u02dcd\u2212\nk < d+\nk . Otherwise,\nk /d+\nk ). The magnitude of\nzk = 0 is a maximum and the only non-trivial minimum is \u02c6zk =\nthe fraction under the log corresponds to the amount of pressure for xk. The numerator \u02dcd\u2212\nk depends\non X and represents the pressure that the neighbors of xk exert on it. The denominator is given by\nthe diagonal element k of the degree matrix D+ and represents the attraction of the points in the\noriginal high-dimensional space. The fraction is smallest when points are ordered by w\u2212\nik for all\ni (cid:54)= k, i.e. ordered by distance from yk. As points change order and move closer to xk (especially\nthose far in the original space, i.e. with high w\u2212\nfrom a minimum to a maximum, thus creating a pressured point.\nStochastic Neighbor Embedding. The objective along the dimension Z for a point k is given by:\n\nk increases and eventually turns (cid:101)EEE(zk = 0)\n\nik) \u02dcd\u2212\n\nlog( \u02dcd\u2212\n\n(cid:101)ESNE(zk) = 2z2\n\u2202(cid:101)ESNE(zk)\n\nkd+\n\n+ C,\nwhere, slightly abusing the notation between different methods, we de\ufb01ne d+\n\u02dcd\u2212\n\nk =(cid:80)N\n\nk + log\n\ni=1 e\u2212(cid:107)xi\u2212xk(cid:107)2. The derivative is equal to\nk \u2212\nd+\n\n(cid:18)\n\n(cid:19)\n\n\u2212z2\n\nn\n\nn\n\n.\n\n\u2202zk\n\n2(e\u2212z2\n\nk \u2212 1) \u02dcd\u2212\n\nk +(cid:80)\n\n(cid:17)\n\n\u02dcd\u2212\n\nk = (cid:80)N\n\ni=1 w+\n\nik and\n\n(cid:16)\n\nk ) < d+\n\nSimilarly to EE, the function is convex, has a stationary point at zk = 0, which is a minimum when\nk (1\u22122d+\n\u02dcd\u2212\ni=1 w+\n.\ni,j(cid:54)=k w+\nThe LHS represents the pressure of the points on xk normalized by an overall pressure for the rest\nof the points. If this pressure gets larger than the similar quantity in the original space (RHS), the\npoint becomes pressured with the minimum at \u02c6zk =\n\nn \u22122 \u02dcd\u2212\n\u02dcd\u2212\n\n(cid:114)\n\nlog\n\n<\n\nn\n\nik\n\nk\n\nk\n\nij\n\n(cid:0)(cid:80)\n\n(cid:17)\n(cid:113)\n\n.\n\n(cid:80)N\n(cid:80)N\n\nt-SNE.\nfor EE and SNE. The objective along zk and its derivative are given by\n\nt-SNE uses Student\u2019s t distribution which does not decouple as nice as the Gaussian kernel\n\n(cid:101)Et-SNE(zk) = 2(cid:80)N\n\u2202(cid:101)Et-SNE(zk)\n\n\u2202zk\n\n(cid:0)(cid:80)N\ni=1 wik log (K\u22121\n\n= 4zk\n\nn\n\n2(e\n\n\u2212\n\u02dcd\nn\n\n= 4zk\n\n\u2212\n\u2212z2\nk \u02dcd\ne\nk\n\u2212\nk\u22121) \u02dcd\n\n\u2212\nk (1\u22122d+\n\u02dcd\nk )\n\u2212\n\u2212\nn \u22122 \u02dcd\n\u02dcd\nk\n\nk +(cid:80)\n(cid:1). It also can be rewritten as\n(cid:80)N\n(cid:80)N\ni=1 exp(\u2212(cid:107)xi\u2212xk(cid:107)2)\ni,j(cid:54)=k exp(\u2212(cid:107)xi\u2212xj(cid:107)2)\n(cid:1) .\n(cid:0)(cid:80)\nk) + log(cid:0)(cid:80)N\ni,j(cid:54)=k Kij +(cid:80)N\n(cid:80)N\n(cid:80)N\ni,j(cid:54)=k Kij +2(cid:80)N\n=(cid:80)N\n\nik + z2\nw+\nik\n\u22121\nik +z2\nk\n\nikKik \u2212 (cid:80)N\n(cid:80)N\n\nk)\u22122\n\u22121\nik +z2\n\n\u22121\nik +z2\ni=1(K\n\ni=1 w+\n\n(cid:1).\n\nk)\u22121\n\nt-SNE(0)\n\u22022zk\n\ni=1(K\n\n\u2212\n\nd+\nk\n\ni=1\n\ni=1\n\nK\n\nK\n\nn\n\n(cid:1) + C.\n\n2\n\u22121\nik +z2\nk\n\nwhere Kij = (1+(cid:107)xi \u2212 xj(cid:107)2)\u22121. The function is convex, but the closed form solution is now harder\nto obtain. Practically it can be done with just a few iterations of the Newton\u2019s method initialized at\nfrom the sign of the second derivative at zk = 0: \u2202(cid:101)E2\nsome positive value close to 0. In addition, we can quickly test whether the point is pressured or not\n\n.\n\ni=1 K2\nik\ni,j=1 Kij\n\nWe don\u2019t provide formulas for UMAP due to space limitation, but similarly to t-SNE, UMAP ob-\njective is also convex along zk with zero or one minimum depending on the sign of the second\nderivative at zk = 0.\n\n4\n\n\fFigure 3: Some examples of pressured points for different datasets. Larger marker size corresponds\nto the higher pressure value. Color corresponds to the ground truth. Left: SNE embedding of the\nswissroll dataset with poor initialization that results in a twist in the middle of the roll. Right: 10\nobjects from COIL-20 dataset after 100 iteration of EE.\n4 Pressured points for quality analysis\nThe analysis above can be directly applied to the existing algorithms as is, resulting in a qualitative\nstatistic on the amount of pressure each point is experiencing during the optimization. A nice addi-\ntional property is that computing pressured points can be done in constant time by reusing parts of\nthe gradient. A practitioner can run the analysis for every iteration of the algorithm essentially for\nfree to see how many points are pressured and whether the embedding results can be trusted.\nIn \ufb01g. 3 we show a couple of examples of embeddings with pressured points computed. The em-\nbedding of the swissroll on the left had a poor initialization that SNE was not able to recover from.\nPressured points are concentrated around the twist in the embedding and in the corners, precisely\nwhere the difference with the ground truth occurs. On the right, we can see the embedding of the\nsubset of COIL-20 dataset midway through optimization with EE. The embeddings of some objects\noverlap with each other, which results in high pressure.\nIn \ufb01g. 4 we show an embedding of the subset\nfrom MNIST after 200 iterations of t-SNE. We\nhighlight some of the digits that ended up in\nclusters different from their ground truth. We\nput them in a red frame if a digit has a high pres-\nsure and in a green frame if their pressure is 0.\nFor the most part the digits in red squares do not\nbelong to the clusters where they are currently\nlocated, while digits in green squares look very\nsimilar to the digits around them.\n5\nImproving convergence\nby pressured points optimization\nThe analysis above can be also used for im-\nprovements in optimization. Imagine the em-\nbedding X has a set of points P that are pres-\nsured according to the de\ufb01nition above. Ef-\nfectively it means that given a new dimension\nthese points would utilize it in order to improve\ntheir current location. Let us create this new\ndimension Z with zk (cid:54)= 0 for all k \u2208 P. Non-\npressured points can only move along the origi-\nnal d dimensions. For example, here is the aug-\nmented objective function for EE:\n\nFigure 4: MNIST embedding after 200 iterations\nof t-SNE. We highlight two sets of digits located\nin clusters different from their ground truth: digits\nin red are pressured and look different from their\nneighbors; digits in green are non-pressured and\nlook similar to their neighboring class.\n\n(cid:101)E(X, Z) = E(xj /\u2208P ) + E\n\n(cid:18)(cid:18)xi\n(cid:19)\n+ \u03bb(cid:80)\n\nzi\n\n(cid:19)\n\n(cid:16)(cid:80)\ni\u2208P(cid:80)\ni(cid:80)\nj /\u2208P w\u2212\n\n+ 2\n\ni\u2208P\ni\u2208P e\u2212z2\n\nij (cid:107)xi \u2212 xj(cid:107)2\n\nj /\u2208P w+\nije\u2212(cid:107)xi\u2212xj(cid:107)2\n\n+(cid:80)\n\n(cid:80)\n\ni\u2208P z2\ni\n\nj /\u2208P w+\n\nij\n\n.\n\n(7)\n\n(cid:17)\n\n5\n\nNN0123456789N\fAlgorithm 1: Pressured Points Optimization\nInput\nCompute a set of pressured points P from X and initialize Z according to their pressure value.\nforeach \u00b5i \u2208 \u00b5 do\n\n: Initial X, sequence of regularization steps \u00b5.\n\nrepeat\n\nUpdate X, Z by minimizing min(cid:0)(cid:101)E(X, Z) + \u00b5iZZT(cid:1).\n\nUpdate P using pressured points from new X:\n1. Add new points to P according to their pressure value.\n2. Remove points that are not pressured anymore.\n\nuntil convergence;\n\nend\nOutput: \ufb01nal X\n\nThe \ufb01rst two terms represent the minimization of pressured and non-pressured points independently.\nThe last term de\ufb01nes the interaction between pressure and non-pressured points and also has three\nterms. The \ufb01rst one gives the attraction between pressured and non-pressured points X in d space.\nThe second term captures the interactions between Z for pressured points and X for non-pressured\nones. On one hand, it pushes Z away from 0 as pressured and non-pressured points move closer to\neach other in d space. On the other hand, it re-weights the repulsion between pressured and non-\npressured points proportional to exp (\u2212z2\ni ) reducing the repulsion for larger values of zi. In fact,\nsince exp (\u2212z2) < 1 for all z > 0, the repulsion between pressured and non-pressured points would\nalways be weaker than the repulsion of non-pressured points between each other. Finally, the last\nterm pulls each zi to 0 with the weight proportional to the attraction between point i and all the\nnon-pressured points. Its form is identical to the l2 norm applied to the extended dimension Z with\nthe weight given by the attraction between point i and all the non-pressured points.\nSince our \ufb01nal objective is not to \ufb01nd the minimum of (7), but rather get a better embedding of X, we\nare going to add a couple of additional steps to facilitate this. First, after each iteration of minimizing\n(7) we are going to update P by removing points that stopped being pressured and adding points\nthat just became pressured. Second, we want pressured points to explore the new dimension only to\nthe extent that it could eventually help lowering the original objective function. We want to restrict\nthe use of the new dimension so it would be harder for the points to use it comparing to the original\ni . This is an\norganic extension since it has the same form as the last term in (7). For \u00b5 = 0 the penalty is given\nas the weight between pressured and non-pressured points. This property gives an advantage to our\nalgorithm comparing to the typical use of l2 regularization, where a user has to resort to a trial and\nerror in order to \ufb01nd a perfect \u00b5. In our case, the regularizer already exists in the objective and its\nweight sets a natural scale of \u00b5 values to try. Another advantage is that large \u00b5 values don\u2019t restrict\nthe algorithm: all the points along Z just collapse to 0 and the algorithm falls back to the original.\nPractically, we propose to use a sequence of \u00b5 values starting at 0 and increase proportionally to the\nmagnitude of d+\nk d+\nk , although a\nmore aggressive schedule of step = max(d+\nk ) could be used\nas well. We increase \u00b5 up until zk = 0 for all the points. Typically, it occurs after 4\u20135 steps.\nThe resulting method is described in Algorithm 1. The algorithm can be embedded and run on top\nof the existing optimization methods for NLE: Spectral Direction and N-body methods.\nIn Spectral Direction the Hessian is approximated using the second derivative of E+. The search\n\ndimensions. It could be achieved by adding l2 penalty to Z dimension as \u00b5(cid:80)\n\ndirection has the form P =(cid:0)4L+ + \u0001)\u22121G, where G is the gradient, L+ is the graph Laplacian\n\nk , k = 1 . . . N. In the experiments below, we set step = 1/N(cid:80)\n\nk ) or more conservative step = min(d+\n\ni\u2208P z2\n\nde\ufb01ned on W+ and \u0001 is a small constant that makes the inverse well de\ufb01ned (since graph Laplacian\nis psd). The modi\ufb01ed objective that we propose has one more quadratic term \u00b5ZZT and thus the\nHessian for the pressured points along Z is regularized by 2\u00b5. This is good for two reasons: it\nimproves the direction of the Spectral Direction by adding new bits of Hessian, and it makes the\nHessian approximation positive de\ufb01nite, thus avoiding the need to add any constant to it.\nLarge-scale N-Body approximations using Barnes-Hut [Yang et al., 2013, van der Maaten, 2014]\nor Fast Multipole Methods (FMM, Vladymyrov and Carreira-Perpi\u02dcn\u00b4an, 2014) to decrease the cost\nof objective function and the gradient from O(N 2) to O(N log N ) or O(N ) by approximating the\ninteraction between distant points. Pressured points computation uses the same quantities as the\ngradient, so whichever approximation is applied carries over to pressured points as well.\n\n6\n\n\fEE (COIL)\n\nSNE (COIL)\n\nEE (MNIST)\n\nFigure 5: The optimization of COIL-20 using EE (left) and SNE (center), and optimization of\nMNIST using EE (right). Black line shows the SD, green line shows PP initialized at random, blue\nline shows PP initialized from the local minima of the SD. Dashed red line indicates the absolute\nbest result that we were able to get with homotopy optimization. Top plots show the change in the\nobjective function, while the bottom show the fraction of the pressured points for a given iteration.\nMarkers \u2018o\u2019 indicate change of \u00b5 value.\n6 Experiments\nHere we are going to compare the original optimization algorithm, which we call simply spectral\ndirection (SD)2 to the Pressured Point (PP) framework de\ufb01ned above using EE and SNE algorithms.\nWhile the proposed methodology could also be applied to t-SNE and UMAP, in practice we were\nnot able to \ufb01nd it useful. t-SNE and UMAP are de\ufb01ned on kernels that have much longer tails than\nthe Gaussian kernel used in EE and SNE. Because of that, the repulsion between points is much\nstronger and points are spread far away from each other. The extra-space given by new dimension is\nnot utilized well and the objective function decrease is similar with and without the PP modi\ufb01cation.\nFor the \ufb01rst experiment, we run the algorithm on 10 objects from COIL-20 dataset. We run both SNE\nand EE 10 different times with the original algorithm until the objective function does not change for\nmore than 10\u22125 per iteration. We then run PP optimization with two different initializations: same\nas the original algorithm and initialized from the convergence value of SD. Over 10 runs for EE, SD\ngot to an average objective function value of 3.84\u00b1 0.18, whereas PP with random initialization got\nto 3.6 \u00b1 0.14. Initializing from the convergence of SD, 10 out of 10 times PP was able to \ufb01nd better\nlocal minima with the average objective function value of 3.61 \u00b1 0.19. We got similar results for\nSNE: average objective function value for SD is 11.07 \u00b1 0.03, which PP improved to 11.03 \u00b1 0.02\nfor random initialization and to 11.05 \u00b1 0.03 for initialization from local minima of SD.\nIn \ufb01g. 5 we show the results for one of the runs for EE and SNE for COIL. Notice that for initial\nsmall \u00b5 values the algorithm extensively uses and explores the extra dimension, which one can see\nfrom the increase in the original objective function values as well as from the large fraction of the\npressured points. However, for larger \u00b5 the number of pressured points drops sharply, eventually\ngoing to 0. Once \u00b5 gets large enough so that extra dimension is not used, optimization for every new\n\u00b5 goes very fast, since essentially nothing is changing.\nAs another comparison point, we evaluate how much headroom we can get on top improvements\ndemonstrated by PP algorithm. For that, we run EE on COIL dataset with homotopy method\n[Carreira-Perpi\u02dcn\u00b4an, 2010] where we performed a series of optimizations from a very small \u03bb, where\nthe objective function has a single global minimum, to \ufb01nal \u03bb = 200, each time initializing from the\nprevious solution. We got the \ufb01nal value of the objective function around E = 3.28 (dashed red line\non the EE objective function plot on \ufb01g. 5). While we could not get to a same value with PP, we got\nvery close with E = 3.3 (comparing to E = 3.68 for the best SD optimization).\nFinally, on the right plot of \ufb01g. 5 we show the minimization of MNIST using FMM approximation\nwith p = 5 accuracy (i.e. truncating the Hermite functions to 5 terms). PP optimization improved\nthe convergence both in case of random initialization and for initialization from the solution of SD.\nThus, the bene\ufb01ts of PP algorithm can be increased by also applying SD to improve the optimization\ndirection and FMM to speed up the objective function and gradient computation.\n\n2It would be more fair to call our method SD+PP, since we also apply spectral direction to minimize the\n\nextended objective function, but we are going to call it simply PP to avoid extra clutter.\n\n7\n\n3.544.555.56EEobj.fun.1111.0511.111.1511.2SNEobj.fun.567891011EE obj fun value0500100015002000Number of iterations00.20.40.60.81Fractionpressured0200400600800Number of iterations00.20.40.60.81Fractionpressured050010001500Numberofiterations00.20.40.60.81Fraction pressured\fSpectral Direction embedding\n\nPressured Point embedding\n\nFigure 6: Embedding of the subset of word2vec data using EE optimized with SD and further re\ufb01ned\nby PP. We highlight six word categories that were affected the most by embedding adjustment.\n\nAs a \ufb01nal experiment, we run the EE for word em-\nbedding vectors pretrained using word2vec [Mikolov\net al., 2013] on Google News dataset. The dataset con-\nsists of 200 000 word-vectors that were downsampled\nto 5 000 most popular English words. We \ufb01rst run SD\n100 times with different initialization until the embed-\nding does not change by more than 10\u22125. We then run\nPP, initialized from SD. Fig. 6 shows the embedding of\none of the worst results that we got from SD and the\nway the embedding improved by running PP algorithm.\nWe speci\ufb01cally highlight six different word categories\nfor which the embedding improved signi\ufb01cantly. No-\ntice that the words from the same category got closer to\neach other and formed tighter clusters. Note that more\nfeelings-oriented categories, such as emotion, sensation\nFigure 7: The difference in \ufb01nal objective\nand nonverbalcommunication got grouped together and\nfunction values between PP and SD for\nnow occupy the right side of the embedding instead of\n100 runs of word2vec dataset using EE\nbeing spread across. In \ufb01g. 7 we show the \ufb01nal objec-\nalgorithm. See main text for description.\ntive function values for all 100 runs together with the\nimprovements achieved by continuing the optimization using PP. In the inset, we show the his-\ntogram of the \ufb01nal objective function values of SD and PP. While the very best results of SD have\nnot improved a lot (suggesting that the near-global minimum has been achieved), most of the times\nSD gets stuck in the higher regions of the objective function that are improved by the PP algorithm.\n7 Conclusions\nWe proposed a novel framework for assessing the quality of most popular manifold learning meth-\nods using intuitive, natural and computationally cheap way to measure the pressure that each point is\nexperiencing from its neighbors. The pressure is de\ufb01ned as a minimum of objective function when\nevaluated along a new extra dimension. We then outlined a method to make use of that extra dimen-\nsion in order to \ufb01nd a better embedding location for the pressured points. Our proposed algorithm is\nable to get to a better solution from a converged local minimum of the existent optimizer as well as\nwhen initialed randomly. An interesting future direction is to extend the analysis beyond one extra\ndimension and see if there is a connection to the intrinsic dimensionality of the manifold.\nAcknowledgments\nI would like to thank Nataliya Polyakovska for initial analysis and Makoto Yamada for useful sug-\ngestions that helped improve this work signi\ufb01cantly.\n\n8\n\nalligatorantcamelchimpanzeecowdeerdolphingoldfishhamsterlionlobstersealwalrusweaselbeltbootclothesglovejacketmasknakedpocketshoeskirtsleevesuitwearblushchuckledrawfrowngigglegringrowlhissnodroarshrugsighsmirkwhistleswallowbakebreakfastbrunchchewchipdrugeggingredientproducesipwinenerveemotionwonderalarmafraidamazeamuseangerappreciateattitudebothercalmcarecuriousdisappointexcitefearfrustrategladinterestmadmoodpanicproudragerelaxstartlestresssurprisesurprisinglyterrifyblastfocushurtpoptoneclickshockstunapparentlyblondeclearlyglowhearpainfulquietseemspotalligatorantcamelcheetahcowdeerdolphinelephantfoxgoldfishhippopotamuskittenlionlobsteroctopusroostersealsparrowweaselbeltbootclothescollarglovehatjacketjeansshoesuitartbarkbeamblushchuckledrawfrowngaspgestureglaregrimacegringrowlgrunthissmumbleshrugsighsignalwavewinceswallowfryapplebakebeerchewchipcookiecreamdessertdineeateggfeedgulpproducesandwichwinenerveemotionalarmafraidamuseangerannoyappreciateawareborebothercalmcareconfusioncuriousdisappointdisturbexcitefearfrightenhappilyhorrormadmoodpanicpleaseprideragerelaxreliefrespectsatisfyscarysurprisinglyupsetblastfocushurtexperiencemusictonesongstudyloseclickstundisplaycoughtraceheatlightcheckblindsenseshowbeautyapparentlyappearblondecomfortdarknessexamineflickerlistenloudnoticeobviousobviouslypeekseesensationslamspotview6\u0001\u0002\u0003\u0004\u0005\u00066.66.626.646.66Original run6.566.586.66.626.646.666.68Pressure Points6.66.76.8Objective function0204060Number of runsOriginal runPressure Pointsiterations\fReferences\nS. Becker, S. Thrun, and K. Obermayer, editors. Advances in Neural Information Processing Systems\n\n(NIPS), volume 15, 2003. MIT Press, Cambridge, MA.\n\nM. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.\n\nNeural Computation, 15(6):1373\u20131396, June 2003.\n\nM. \u00b4A. Carreira-Perpi\u02dcn\u00b4an. The elastic embedding algorithm for dimensionality reduction. In Proc.\n\nof the 27th Int. Conf. Machine Learning (ICML 2010), Haifa, Israel, June 21\u201325 2010.\n\nJ. Cook, I. Sutskever, A. Mnih, and G. Hinton. Visualizing similarity data with a mixture of maps.\nIn M. Meil\u02d8a and X. Shen, editors, Proc. of the 11th Int. Workshop on Arti\ufb01cial Intelligence and\nStatistics (AISTATS 2007), San Juan, Puerto Rico, Mar. 21\u201324 2007.\n\nV. de Silva and J. B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction.\n\nIn Becker et al. [2003], pages 721\u2013728.\n\nG. Hinton and S. T. Roweis. Stochastic neighbor embedding. In Becker et al. [2003], pages 857\u2013864.\n\nR. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-\nthought vectors. In Advances in neural information processing systems, pages 3294\u20133302, 2015.\n\nJ. A. Lee and M. Verleysen. Quality assessment of dimensionality reduction: Rank-based criteria.\n\nNeurocomputing, 72, 2009.\n\nS. Lespinats and M. Aupetit. CheckViz: Sanity check and topological clues for linear and non-linear\n\nmappings. In Computer Graphics Forum, volume 30, pages 113\u2013125, 2011.\n\nL. McInnes, J. Healy, and J. Melville. UMAP: Uniform manifold approximation and projection for\n\ndimension reduction. arXiv:1802.03426, 2018.\n\nT. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words\nand phrases and their compositionality. In Advances in neural information processing systems,\npages 3111\u20133119, 2013.\n\nV. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Ried-\nmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529, 2015.\n\nJ. Peltonen and Z. Lin. Information retrieval approach to meta-visualization. Machine Learning, 99\n\n(2):189\u2013229, 2015.\n\nL. K. Saul and S. T. Roweis. Think globally, \ufb01t locally: Unsupervised learning of low dimensional\n\nmanifolds. Journal of Machine Learning Research, 4:119\u2013155, June 2003.\n\nL. van der Maaten. Accelerating t-sne using tree-based algorithms. Journal of Machine Learning\n\nResearch, 15:1\u201321, 2014.\n\nL. J. van der Maaten and G. E. Hinton. Visualizing data using t-SNE. Journal of Machine Learning\n\nResearch, 9:2579\u20132605, November 2008.\n\nJ. Venna, J. Peltonen, K. Nybo, H. Aidos, and S. Kaski. Information retrieval perspective to nonlinear\ndimensionality reduction for data visualization. Journal of Machine Learning Research, 11:451\u2013\n490, Feb. 2010.\n\nM. Vladymyrov and M. \u00b4A. Carreira-Perpi\u02dcn\u00b4an. Partial-Hessian strategies for fast learning of nonlin-\near embeddings. In Proc. of the 29th Int. Conf. Machine Learning (ICML 2012), pages 345\u2013352,\nEdinburgh, Scotland, June 26 \u2013 July 1 2012.\n\nM. Vladymyrov and M. \u00b4A. Carreira-Perpi\u02dcn\u00b4an. Linear-time training of nonlinear low-dimensional\nembeddings. In Proc. of the 17th Int. Workshop on Arti\ufb01cial Intelligence and Statistics (AISTATS\n2014), pages 968\u2013977, Reykjavik, Iceland, Apr. 22\u201325 2014.\n\n9\n\n\fM. Wattenberg, F. Vi\u00b4egas, and I. Johnson.\n\nhttps://distill.pub/2016/misread-tsne/, 2016.\n\nHow to use t-sne effectively.\n\nArticle at\n\nZ. Yang, J. Peltonen, and S. Kaski. Scalable optimization for neighbor embedding for visualization.\nIn Proc. of the 230h Int. Conf. Machine Learning (ICML 2013), pages 127\u2013135, Atlanta, GA,\n2013.\n\nZ. Yang, J. Peltonen, and S. Kaski. Majorization-minimization for manifold embedding. In Proc. of\nthe 18th Int. Workshop on Arti\ufb01cial Intelligence and Statistics (AISTATS 2015), pages 1088\u20131097,\nReykjavik, Iceland, May 10\u201312 2015.\n\n10\n\n\f", "award": [], "sourceid": 344, "authors": [{"given_name": "Max", "family_name": "Vladymyrov", "institution": "Google Research"}]}