{"title": "Dimensionality Reduction has Quantifiable Imperfections: Two Geometric Bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 8453, "page_last": 8463, "abstract": "In this paper, we investigate Dimensionality reduction (DR) maps in an information retrieval setting from a quantitative topology point of view. In particular, we show that no DR maps can achieve perfect precision and perfect recall simultaneously. Thus a continuous DR map must have imperfect precision. We further prove an upper bound on the precision of Lipschitz continuous DR maps. While precision is a natural measure in an information retrieval setting, it does not measure `how' wrong the retrieved data is. We therefore propose a new measure based on Wasserstein distance that comes with similar theoretical guarantee. A key technical step in our proofs is a particular optimization problem of the $L_2$-Wasserstein distance over a constrained set of distributions. We provide a complete solution to this optimization problem, which can be of independent interest on the technical side.", "full_text": "Dimensionality Reduction has Quanti\ufb01able\n\nImperfections: Two Geometric Bounds\n\nKry Yik Chau Lui\n\nBorealis AI\n\nCanada\n\nGavin Weiguang Ding\n\nBorealis AI\n\nCanada\n\nyikchau.y.lui@borealisai.com\n\ngavin.ding@borealisai.com\n\nRuitong Huang\n\nBorealis AI\n\nCanada\n\nruitong.huang@borealisai.com\n\nRobert J. McCann\n\nDepartment of Mathematics\n\nUniversity of Toronto\n\nCanada\n\nmccann@math.toronto.edu\n\nAbstract\n\nIn this paper, we investigate Dimensionality reduction (DR) maps in an information\nretrieval setting from a quantitative topology point of view. In particular, we show\nthat no DR maps can achieve perfect precision and perfect recall simultaneously.\nThus a continuous DR map must have imperfect precision. We further prove an\nupper bound on the precision of Lipschitz continuous DR maps. While precision\nis a natural measure in an information retrieval setting, it does not measure \u201chow\u201d\nwrong the retrieved data is. We therefore propose a new measure based on Wasser-\nstein distance that comes with similar theoretical guarantee. A key technical step\nin our proofs is a particular optimization problem of the L2-Wasserstein distance\nover a constrained set of distributions. We provide a complete solution to this\noptimization problem, which can be of independent interest on the technical side.\n\n1\n\nIntroduction\n\nDimensionality reduction (DR) serves as a core problem in machine learning tasks including infor-\nmation compression, clustering, manifold learning, feature extraction, logits and other modules in a\nneural network and data visualization [16, 8, 34, 19, 25]. In many machine learning applications, the\ndata manifold is reduced to a dimension lower than its intrinsic dimension (e.g. for data visualizations,\noutput dimension is reduced to 2 or 3; for classi\ufb01cations, it is the number of classes). In such cases, it\nis not possible to have a continuous bijective DR map (i.e. classic algebraic topology result on invari-\nance of dimension [26]). With different motivations, many nonlinear DR maps have been proposed\nin the literature, such as Isomap, kernel PCA, and t-SNE, just to name a few [31, 33, 22]. A common\nway to compare the performances of different DR maps is to use a down stream supervised learning\ntask as the ground truth performance measure. However, when such down stream task is unavailable,\ne.g. in an unsupervised learning setting as above, one would have to design a performance measure\nbased on the particular context. In this paper, we focus on the information retrieval setting, which\nfalls into this case. An information retrieval system extracts the features f (x) from the raw data x for\nfuture queries. When a new query y0 = f (x0) is submitted, the system returns the most relevant data\nwith similar features, i.e. all the x such that f (x) is close to y0. For computational ef\ufb01ciency and\nstorage, f is usually a DR map, retaining only the most informative features. Assume that the ground\ntruth relevant data of x0 is de\ufb01ned as a neighbourhood U of x that is a ball with radius rU centered at\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fx 1, and the system retrieves the data based on relevance in the feature space, i.e. the inverse image,\nf\u22121(V ), of a retrieval neighbourhood V (cid:51) f (x0). Here V is the ball centered at y0 = f (x0) with\nradius rV that is determined by the system. It is natural to measure the system\u2019s performance based\non the discrepancy between U and f\u22121(V ). Many empirical measures of this discrepancy have been\nproposed in the literature, among which precision and recall are arguably the most popular ones\n[32, 23, 20, 34]. However, theoretical understandings of these measures are still very limited.\nIn this paper, we start with analyzing the theoretical properties of precision and recall in the informa-\ntion retrieval setting. Naively computing precision and recall in the discrete settings gives undesirable\nproperties, e.g. precision always equals recall when computed by using k nearest neighbors. How\nto measure them properly is unclear in the literature (Section 3.2). On the other hand, numerous\nexperiments have suggested that there exists a tradeoff between the two when dimensionality reduc-\ntion happens [34], yet this tradeoff still remains a conceptual mystery in theory. To theoretically\nunderstand this tradeoff, we look for continuous analogues of precision and recall, and exploit\nthe geometric and function analytic tools that study dimensionality reduction maps [15]. The \ufb01rst\nquestion we ask is what property a DR map should have, so that the information retrieval system can\nattain zero false positive error (or false negative error) when the relevant neighbourhood U and the\nretrieved neighbourhood V are properly selected. Our analyses show the equivalence between the\nachievability of perfect recall (i.e. zero false negative) and the continuity of the DR map. We further\nprove that no DR map can achieve both perfect precision and perfect recall simultaneously. Although\nit may seem intuitive, to our best knowledge, this is the \ufb01rst theoretical guarantee in the literature of\nthe necessity of the tradeoff between precision and recall in a dimension reduction setting.\nOur main results are developed for the class of (Lipschitz) continuous DR maps. The \ufb01rst main result\nof this paper is an upper bound for the precision of a continuous DR map. We show that given a\ncontinuous DR map, its precision decays exponentially fast with respect to the number of (intrinsic)\ndimensions reduced. To our best knowledge, this is the \ufb01rst theoretical result in the literature for the\ndecay rate of the precision of a dimensionality reduction map. The second main result is an alternative\nmeasure for the performance of a continuous DR map, called W2 measure, based on L2-Wasserstein\ndistance. This new measure is more desirable as it can also detect the distance distortion between\nU and f\u22121(V ). Moreover, we show that our measure also enjoys a theoretical lower bound for\ncontinuous DR maps. Several other distance-based measures have been proposed in the literature\n[32, 23, 20, 34], yet all are proposed heuristically with meagre theoretical understanding. Simulation\nresults suggest optimizing the Wasserstein measure lower bound corresponds to optimizing a weighted\nf-1 score (i.e. f-\u03b2 score). Thus we may optimize precision and recall without dealing with their\ncomputational dif\ufb01culties in the discrete setting.\nFinally, let us make some comments on the technical parts of the paper. The \ufb01rst key step is the\nWaist Inequality from the \ufb01eld of quantitative algebraic topology. At a high level, we need to analyse\nf\u22121(V ), inverse image of an open ball for an arbitrary continuous map f. The waist inequality\nguarantees the existence of a \u2018large\u2019 \ufb01ber, which allows us to analyse f\u22121(V ) and prove our \ufb01rst\nmain result. We further show that in a common setting, a signi\ufb01cant proportion of \ufb01bers are actually\n\u2018large\u2019. For our second main result, a key step in the proof is a complete solution to the following\niterated optimization problem:\n\ninf\n\nW : Voln(W )=M\n\nW2(PBr , PW ) =\n\ninf\n\nW : Voln(W )=M\n\ninf\n\n\u03be\u2208\u039e(PBr ,PW )\n\nE(a,b)\u223c\u03be[(cid:107)a \u2212 b(cid:107)2\n\n2]1/2,\n\nwhere Br is a ball with radius r, PBr (PW , respectively) is a uniform distribution over Br (W ,\nrespectively), and W2 is the L2-Wasserstein distance. Unlike a typical optimal transport problem\nwhere the transport function between source and target distributions is optimized, in the above\nproblem the source distribution is also being optimized at the outer level. This becomes a dif\ufb01cult\nconstrained iterated optimization problem. To address it, we borrow tools from optimal partial\ntransport theory [9, 11]. Our proof techniques leverage the uniqueness of the solution to the optimal\npartial transport problem and the rotational symmetry of Br to deduce W .\n\n1The value of rU is unknown, and it depends on the user and the input data x0. However, we can assume rU\nis small compared to the input domain size. For example, the number of relevant items to a particular user is\nmuch fewer than the number of total items.\n\n2\n\n\f1.1 Notations\nWe collect our notations in this section. Let m be the embedding dimension, M be an n dimensional\ndata manifold2 embedded in RN , where N is the ambient dimension. M is typically modelled\nas a Riemannian manifold, so it is a metric space with a volume form. Let m < n < N and\nf : M \u2282 RN \u2192 Rm be a DR map. The pair (x, y) will be the points of interest, where y = f (x).\nThe inverse image of y under the map f is called \ufb01ber, denoted f\u22121(y). We say f is continuous at\npoint x iff oscf (x) = 0, where oscf (x) = inf U ;Uopen{diam(f (U )); x \u2208 U} is the oscillation for f\nat x \u2208 M. We say f is one-to-one or injective when its \ufb01ber, f\u22121(y) is the singleton set {x}.\nWe let A \u2295 \u0001 := {x \u2208 RN|d(x, A) < \u0001} denote the \u0001-neighborhood of the nonempty set A. In\nRN , we note the \u0001-neighborhood of the nonempty set A is the Minkowski sum of A with BN\n\u0001 (x),\nwhere the Minkowski sum between two sets A and B is: A \u2295 B = {a + b|a \u2208 A, b \u2208 B}.\nFor example, an n dimension open ball with radius r, centered at a point x can be expressed as:\nr (x) = x \u2295 Bn\nr (0) = x \u2295 r, where the last expression is used to simplify notation. If not speci\ufb01ed,\nBn\nthe dimension of the ball is n. We also use Br to denote the ball with radius r when its center is\nr denotes n-dimensional sphere in Rn+1 with radius r. Let Voln denote\nirrelevant. Similarly, Sn\nn-dimensional volume.3 When the intrinsic dimension of A is greater than n, we set Voln(A) = \u221e.\nThroughout the rest of the paper, we use U to denote BrU (x) a ball with radius rU centered at x\nand V = BrV (y) a ball with radius rV centered at y. These are metric balls in a metric space. For\nexample, they are geodesic balls in a Riemannian manifold, whenever they are well de\ufb01ned. In\nEuclidean spaces, U is a Euclidean ball with L2 norm. By T#(\u00b5) = \u03bd, we mean a map T pushes\nforward a measure \u00b5 to \u03bd, i.e. \u03bd(B) = \u00b5(T \u22121(B)) for any Borel set B. We say a measure \u00b5 is\ndominated by another measure \u03bd, if for every measurable set A, \u00b5(A) \u2264 \u03bd(A).\n\n2 Precision and recall\n\nWe present the de\ufb01nitions of precision and recall in a continuous setting in this section. We then prove\nthe equivalence between perfect recall and the continuity, followed by a theorem on the necessary\ntradeoff between the perfect recall and the perfect precision for a dimension reduction information\nretrieval system. The main result of this section is a theoretical upper bound for the precision of a\ncontinuous DR map.\n\n2.1 Precision and recall\n\nWhile precision and recall are commonly de\ufb01ned based on \ufb01nite counts in practice, when analysing\nDR maps between spaces, it is natural to extend their de\ufb01nitions in a continuous setting as follows.\nDe\ufb01nition 1 (Precision and Recall). Let f be a continuous DR map. Fix (x, y = f (x)), rU > 0\n(y) \u2282 Rm be the balls with radius rU and rV\nand rV > 0, let U = BrU (x) \u2282 RN and V = Bm\nrespectively. The precision and recall of f at U and V are de\ufb01ned as:\n\nrV\n\nPrecisionf (U, V ) =\n\n;\n\nRecallf (U, V ) =\n\nVoln(f\u22121(V ) \u2229 U )\n\nVoln(f\u22121(V ))\n\nVoln(f\u22121(V ) \u2229 U )\n\n.\n\nVoln(U )\n\nfor every rU ,\n\nWe say f achieves perfect precision at x if\nthere exists rV such that\nP recisionf (U, V ) = 1. Also, f achieves perfect recall at x if for every rV , there exists rU such that\nRecallf (U, V ) = 1. Finally, we say f achieves perfect precision (perfect recall, respectively) in an\nopen set W , if f achieves perfect precision (perfect recall, respectively) at w for any w \u2208 W .\nNote that perfect precision requires f\u22121(V ) \u2282 U except a measure zero set. Similarly, perfect\nrecall requires U \u2282 f\u22121(V ) except a measure zero set. Figure 1 illustrates the precision and recall\nde\ufb01ned above. To measure the performance of the information retrieval system, we would like to\nunderstand how different f\u22121(V ) is from the ideal response U = BrU (x). Precision and recall\nprovides two meaningful measures for this difference based on their volumes. Note that f achieves\n\n2There is empirical and theoretical evidence that data distribution lies on low dimensional submanifold in the\nambient space [27].\n3 Let A be a set. In Euclidean space, Voln(A) = Ln(A) is the Lebesgue measure. For a general n-recti\ufb01able\nset, Voln(A) = Hn(A) is the Hausdorff measure. When A is not recti\ufb01able, Voln(A) = Mn\u2217 (A) is the lower\nMinkowski content.\n\n3\n\n\fFigure 1: Illustration of precision and recall.\n\nperfect precision at x implies that no matter how small the relevant radius rU is for the image, the\nsystem would be able to achieve zero false positive by picking proper rV . Similarly perfect recall at\nx implies no matter how small rV is, the system would not miss the most relevant images around x.\nIn fact, the de\ufb01nitions of perfect precision and perfect recall are closely related to continuity and\ninjectivity of a function f. Here we only present an informal statement. Rigorous statements are\ngiven in the Appendix B.\nProposition 1. Perfect recall is equivalent to continuity. If f is continuous, then perfect precision is\nequivalent to injectivity.\n\nThe next result shows that no DR map f, continuous or not, can achieve perfect recall and perfect\nprecision simultaneously - a widely observed but unproved phenomenon in practice. In other words,\nit rigorously justi\ufb01es the intuition that perfectly maintaining the local neighbourhood structure is\nimpossible for a DR map.\nTheorem 1 (Precision and Recall Tradeoff). Let n > m, M \u2282 RN be a Riemannian n-dimensional\nsubmanifold. Then for any (dimensionality reduction) map f : M \u2192 Rm and any open set W \u2282 M,\nf cannot achieve both perfect precision and perfect recall on W .\n\n2.2 Upper bound for the precision of a continuous DR map\n\nIn this section, we provide a quantitative analysis for the imperfection of f. In particular, we prove\nan upper bound for the precision of a continuous DR map f (thus f achieves perfect recall). For\nsimplicity, we assume the domain of f is an n -ball with radius R embedded in RN , denoted by Bn\nR.\nOur main tool is the Waist Inequality [29, 1] in quantitative topology. See Appendix A for an exact\nstatement.\nIntuitively, the Waist Inequality guarantees the existence of y \u2208 Rm such that f\u22121(y) is a \u2018large\u2019\n\ufb01ber. If f is also L-Lipschitz, then for p in a small neighbourhood V of y, f\u22121(p) is also a \u2018large\u2019\n\ufb01ber, thus f\u22121(V ) has a positive volume in M. Exploiting the lower bound for Voln\nto our upper bound in Theorem 2 on the precision of f, Precisionf (U, V ). A rigorous proof is given\nin the appendix Appendix C.\nR \u2192 Rm is a\nTheorem 2 (Precision Upper Bound, Worst Case). Assume n > m, and that f : Bn\ncontinuous map with Lipschitz constant L. Let rU and rV > 0 be \ufb01xed. Denote\n\n(cid:0)f\u22121(V )(cid:1) leads\n\nThen there exists y \u2208 Rm such that for any x \u2208 f\u22121(y), we have:\n\n.\n\nm\nr\nU\n\nR\n\npm(rV /L)\n\n(1)\n\n(2)\n\n\u0393( n\u2212m\n\nD(n, m) =\n\n2 + 1)\u0393( m\n2 + 1)\n\n\u0393( n\n\n2 + 1)\n\n(cid:16) rU\n\n(cid:17)n\u2212m\n\nP recisionf (U, V ) \u2264 D(n, m)\n\n4\n\n\fpm(r)\nrm = 1.\n\nR\n\n\u221a\n\nwhere pm(r) is rm (1 + o(1)), i.e. lim\nr\u21920\nRemark 1. Key to the bound is the Waist Inequality. As such, upper bounds on precision for other\nspaces (i.e. cube, see Klartag [17] ) can be established, provided there is a Waist Inequality for\nthe space. The Euclidean norm setting can also be extended to arbitrary norms, exploiting convex\ngeometry (i.e. Akopyan and Karasev [2]). Rigorous proofs are given in the appendix C.\nRemark 2. With m \ufb01xed as a constant, note that D(n, m) decays asymptotically at a rate of\n\n(cid:1)n\u2212m decays exponentially. Typically, L can grow at a\n\n(1/n)m/2. Also note that rU < R implies(cid:0) rU\n\nrate of\nn. Moreover, while pm(r)\u2019s behaviour is given asymptotically, it is independent of n. Thus\nthe upper bound decay is dominated by the exponential rate of n \u2212 m. For \ufb01xed n, m, this upper\nbound can be trivial when rU (cid:29) rV . However, this rarely happens in practice in the information\nretrieval setting. Note that the number of relevant items, which is indexed by rU , is often smaller than\nthe number of retrieved items, that depends on rV , while they are both much smaller than number of\ntotal items, indexed by R.\nWe note however that this bound depends on the intrinsic dimension n. When n (cid:28) N and the ambient\ndimension N is used in place, the upper bound could be misleading in practice as it is much smaller\nthan it should be. To estimate this bound in practice, a good estimate on intrinsic dimension [13] is\nneeded, which is an active topic in the \ufb01eld and beyond the scope of this paper.\nTheorem 2 guarantees the existence of a particular point y \u2208 Rm where the precision of f on its\nneighbourhood is small. It is natural to ask if this is also true in an average sense for every y. In\nother words, we know an information retrieval system based on DR maps always has a blindspot,\nbut is this blindspot behaviour a typical case? In general, when m > 1, this is false, due to a recent\ncounter-example constructed by Alpert and Guth [3]. However, our next result shows that for a large\nnumber of continuous DR maps in the \ufb01eld, such upper bound still holds with high probability.\nTheorem 3 (Precision Upper Bound, Average Case). Assume n > m and Bn\nprobability distribution. Consider the following cases:\n\nR is equiped with uniform\n\n\u2022 case 1: m = 1 and f : Bn\n\u2022 case 2: f : Bn\n\nR \u2192 Rm is L Lipschitz continuous, or\n\nR \u2192 Rm is a k-layer feedforward neural network map with Lipschitz constant\n\nL, with surjective linear maps in each layer.\n\nLet 0 < \u03b42 < R2 \u2212 r2\ncase 2, it holds that\n\nU , rU , rV > 0 be \ufb01xed, then with probability at least q1 for case 1 or q2 for\n\n(cid:33)n\u2212m\n\n(cid:32)\n\nrU(cid:112)r2\n\nU + \u03b42\n\n(cid:82)\n\n(cid:82)\n\nP recisionf (U, V ) \u2264 D(n, m)\n\nm\nr\nU\n\npm(rV /L)\n\n,\n\n(3)\n\nwhere\n\n(cid:60) =(cid:112)R2 \u2212 r2\n\nq1 =\n\n1\n\n2\u03c0R\n\nBm(cid:60)\n\nVoln\u2212m+1Proj\u22121\nVoln(Bn\nR)\nU \u2212 \u03b42, Proj1 : Sn+1\n\nq2 =\nR \u2192 Rm and Proj2 : Bn\n\n1 (t)dt\n\n,\n\nmaps. Furthermore,\n\nVoln\u2212mProj\u22121\n\n2 (t)dt\n\nBm(cid:60)\n\n,\n\nVoln(Bn\nR)\n\nR \u2192 Rm are arbitrary surjective linear\nq2 = 1.\n\nr2\nU\n\nr2\nU\n\nq1 = 1\n\nlim\nR2 \u21920\n+\u03b42\n\nlim\nR2 \u21920\n+\u03b42\nSee Appendix D for an explicit characterization of Proj\u22121\n2 (t). Theorem 2 and Theo-\nrem 3 together suggest that practioners should be cautious in applying and interpreting DR maps.\nOne important application of DR maps is in data visualization. Among the many algorithms, t-SNE\u2019s\nempirical success made it the de facto standard. While [5] shows t-SNE can recover inter-cluster\nstructure in some provable settings, the resulted intra-cluster embedding will very likely be subject\nto the constraints given in our work4. For example, recall within a cluster will be good, but the\nintra-cluster precision won\u2019t be. In more general cases and/or when perplexity is too small, t-SNE\n\n1 (t) and Proj\u22121\n\n4Strictly speaking, the DR maps induced by t-SNE may not be continuous, and hence our theorems do not\napply directly. However, since we can measure how closely parametric t-SNE (which is continuous) behaves as\nt-SNE and there is empirical evidence to their similarity [21], our theorems may apply again.\n\n5\n\n\fcan create arti\ufb01cial clusters, separating neighboring datapoints. The resulted visualization embedding\nmay enjoy higher precision, but its recall suffers. The interested readers are referred to Appendix G.1\nfor more experimental illustrations. Our work thus sheds light on the inherent tradeoffs in any visual-\nization embedding. It also suggests the companion of a reliability measure to any data visualization\nfor exploratory data analysis, which measures how a low dimensional visualization represents the\ntrue underlying high dimensional neighborhood structure.5\n\n3 Wasserstein measure\n\nIntuitively we would like to measure how different the original neighbourhood U of x is from the\nretrieved neighbourhood f\u22121(V ) when using the neighbourhood of f (x) in Rm. Precision and Recall\nin Section 2.1 provide a semantically meaningful way for this purpose and we gave a non-trivial upper\nbound for precision when the feature extraction is a continuous DR map. However, precision and\nrecall are purely volume-based measures. It would be more desirable if the measure could also re\ufb02ect\nthe information about the distance distortions between U and f\u22121(V ). In this section, we propose\nan alternative measure to re\ufb02ect such information based on the L2-Wasserstein distance. Ef\ufb01cient\nalgorithms for computing the empirical Wasserstein distance exists in the literature [4]. Unlike the\nmeasure proposed in Venna et al. [34], our measure also enjoys a theoretical guarantee similar to\nTheorem 2, which provides a non-trivial characterization for the imperfection of dimension reduction\ninformation retrieval.\nLet PU (Pf\u22121(V ), respectively) denote the uniform probability distribution over U (f\u22121(V ), respec-\ntively), and \u039e(PU , Pf\u22121(V )) be the set of all the joint distribution over Bn\nR, whose marginal\ndistributions are PU over the \ufb01rst Bn\nR. We propose to measure the\ndifference between U and f\u22121(V ) by the L2-Wasserstein distance between PU and Pf\u22121(V ):\n\nR and Pf\u22121(V ) over the second Bn\n\nR \u00d7 Bn\n\nW2(PU , Pf\u22121(V )) =\n\ninf\n\u03be\u2208\u039e(PU ,P\n\nf\u22121(V ))\n\nE(a,b)\u223c\u03be[(cid:107)a \u2212 b(cid:107)2\n\n2]1/2.\n\nIn practice, it is reasonable to assume that Voln(U ) is small in most retrieval systems. In such cases,\nlow W2(PU , Pf\u22121(V )) cost is closely related to high precision retrieval. To see that, when Voln(U ) is\nsmall, achieving high precision retrieval requires small Voln(f\u22121(V )), which is a precise quantitative\nway of saying f being roughly injective. Moreover, as seen in Section 2.1, f being roughly injective\n\u2248 f giving high precision retrieval. As a result, we can expect high precision retrieval performance\nwhen optimizing W2(PU , Pf\u22121(V )) measure. Such relation is also empirically con\ufb01rmed in the\nsimulation in Section 3.2.\nBesides its computational bene\ufb01ts, for a continuous DR map f, the following theorem provides a\nlower bound on W2(PU , Pf\u22121(V )) with a similar \ufb02avour to the precision upper bound in Theorem 1.\nR \u2192 Rm be a L-Lipschitz\nTheorem 4 (Wasserstein Measure Lower Bound). Let n > m, f : Bn\nR. There exists y \u2208 Rm such that for any\ncontinuous map, where R is the radius of the ball Bn\nx \u2208 f\u22121(y), any rU > 0 such that Bn\n\nR, and any rV > 0 such that r \u2265 rU , we have:\n\n(x) \u2282 Bn\n\nrU\n\n(cid:16)\n\nwhere r =\n\n\u0393( n\n\n2 +1)\n\n\u0393( n\u2212m\n\n2 +1)\u0393( m\n\n2 +1)\n\n(r \u2212 rU )2\n\n2 (PU , Pf\u22121(V )) \u2265 n\n(cid:17) 1\nW 2\nn + 2\n2 (PU , Pf\u22121(V )) = \u2126(cid:0)(R \u2212 rU )2(cid:1) .\n\n(pm(rV /L))\n\nW 2\n\nn\u2212m\n\nR\n\nn\n\nn\n\n1\n\nn . In particular, as n \u2192 \u221e,\n\n(cid:0)f\u22121(V )(cid:1) by the topologically \ufb02avored waist inequality (Equation (6)). Heuristically\n\nWe sketch the proof here. A complete proof can be found in Appendix E. The proof starts with a lower\nbound of Voln\nVoln(f\u22121(V )) is much larger than Voln(U ) when n (cid:29) m and R (cid:29) rU . The main component of the\nproof is to establish an explicit lower bound for W2(PU , PW ) over all possible W of a \ufb01xed volume\nV, 6 where U is a ball with radius rU , as shown in Theorem 5. In particular, we prove that the shape\n5Such attempts existed in literature on visualization of dimensionality reduction (e.g. [34]). However, since\nthese works are based on heuristics, it is less clear what they measure, nor do they enjoy theoretical guarantee.\n6An antecedent of this problem was studied in Section 2.3 of [24], where the authors optimize over the more\nrestricted class of ellipses with \ufb01xed area. For our purpose, the minimization is over bounded measurable sets.\n\n6\n\n\fof optimal W \u2217 must be rotationally invariant, thus W \u2217 must be a union of spheres. This is achieved\nby levering the uniqueness of the solution to the optimal partial transport problem [9, 11]. We then\nprove that the optimal solution for W is the ball that has a common center with U.\nTheorem 5. Let U = BrU and V \u2265 Vol(U ). Then\ninf\n\nW : Voln(W )\u2265V W2(PU , PW ) =\n\ninf\n\nwhere BrV is an rV ball with the same center with U such that Voln(BrV ) = V. Moreover, T (x) =\nrV x, for x \u2208 BrV is the optimal transport map (up to a measure zero set), so that\n\nrU\n\nW : Voln(W )=V W2(PU , PW ) = W2(PU , PBrV ),\n(cid:90)\n\n|x \u2212 T (x)|2 dPBrV (x).\n\nW2(PU , PBrV ) =\n\nBrV\n\nComplementarily, when 0 < V < Voln(U ), the in\ufb01mum inf W : Voln(W )=V W2(PU , PW ) = 0, is not\nattained by any set. On the other hand, inf W : Voln(W )\u2265V W2(PU , PW ) = 0 by taking W = U.\nRemark 3. Our lower bound in Theorem 4 is (asymptotically) tight. Note that by Theorem 4,\n2 (PU , Pf\u22121(V )) has a (maximum) lower bound of scale (R \u2212 rU )2. On the other hand, by\nW 2\n) = \u2126((R \u2212 rU )2), where the equality is by standard\nTheorem 5, W 2\nalgebraic calculations.\n\n2 (PU , Pf\u22121(V )) \u2264 W 2\n\n2 (PU , PBn\n\nR\n\n3.1\n\nIso-Wasserstein inequality\n\nWe believe Theorem 5 is of independent interest itself, as it has the same \ufb02avor as the isoperimetric\ninequality (See Appendix A for an exact statement.) which arguably is the most important inequality\nin metric geometry.\nIn fact, the \ufb01rst statement of Theorem 5 can be restated as the following\ninequality:\nTheorem 6 (Iso-Wasserstein Inequality). Let Br1, Br2 \u2282 Bn\nr1 \u2264 r2 centered at the origin. For all measurable A \u2282 Bn\n\nR be two concentric n balls with radii\n\nR with Voln(A) = Voln(Br2), we have\n\nW2(P(A), P(Br1)) \u2265 W2(P(Br2), P(Br1 ))\n\nwhere P(S) denotes a uniform probability distribution on S, i.e. P(S) has density\n\n1\n\nVoln(S) .\n\nRecall that an isoperimetric inequality in Euclidean space roughly says balls have the least perimeter\namong all equal volume sets. Theorem 6 acts as a transportation cousin of the isoperimetric inequality.\nWhile the isoperimetric inequality compares n \u2212 1 volume between two sets, the iso-Wasserstein\ninequality compares their Wasserstein distances to a small ball. The extrema in both inequalities are\nattained by Euclidean balls.\n\n3.2 Simulations\n\nIn this section, we demonstrate on a synthetic dataset that our lower bound in Theorem 4 can be\na reasonable guidance for selecting the retrieval neighborhood radius rV , which emphasizes on\nhigh precision. The simulation environment is to compute the optimal rV by minimizing the lower\nbound in Theorem 4, with a given relevant neighborhood radius rU and embedding dimension m.\nNote that minimizing its lower bound instead of the exact cost itself is bene\ufb01cial as it avoids the\ndirect computation of the cost. Recall the lower bound of W2(PU , Pf\u22121(V )) is (asymptotically)\ntight (Remark 3) and matches the its upper bound when n \u2212 m (cid:29) 0. If the lower bound behaves\nroughly like W2(PU , Pf\u22121(V )), our simulation result also serves as an empirical evidence that\nW2(PU , Pf\u22121(V )) weighs more on high precision.\nSpeci\ufb01cally, we generate 10000 uniformly distributed samples in a 10-dimensional unit (cid:96)2-ball. We\nchoose rU such that on average each data point has 500 neighbors inside BrU . We then linearly\nproject these 10 dimensional points into lower dimensional spaces with embedding dimension m\nfrom 1 to 9. For each m, a different rV is used to calculate discrete precision and recall. This\nsimulates how optimal rV according to Wasserstein measure changes with respect to m. The result is\nshown in on the left in Figure 2. Similarly, we can \ufb01x m = 5 and track optimal rV \u2019s behavior when\nrU changes. This is shown on the right in Figure 2.\nWe evalute our measures based on traditional information retrieval metrics such as f-score. To\ncompute it, we need the discrete/sample-based precision and recall. As discussed in the introduction,\n\n7\n\n\fFigure 2: Precision and recall results on uniform samples in a 10 dimensional unit ball. The left\n\ufb01gure contains precision-recall curves for a \ufb01xed rU and the optimal rV is chosen according to\nm = 1,\u00b7\u00b7\u00b7 , 9. The right \ufb01gure plots the curves for m = 5 and the optimal rV \u2019s is chosen for\ndifferent rU , where rU is indexed by k, the average number of neighbors across all points.\n\na naive sample based calculations of precision and recall makes P recision = Recall at all times.\nWe compute them alternatively by discretizing De\ufb01nition 1, by \ufb01xing radii rU and rV . So each U\nand f\u22121(V ) contain different numbers of neighbors.\n\nP recision =\n\n#(points within rU from x and within rV from y)\n\n#(points within rV from y)\n\nRecall =\n\n#(points within rU from x and within rV from y)\n\n#(points within rU from x)\n\n(4)\n\n(5)\n\nThe optimal rV according to the lower bound in Theorem 4 (the blue circle-dash-dotted line) aligns\nclosely with the optimal f-score with \u03b2 = 0.3 where \u03b2 weighted f-score, also known as f-\u03b2score, is:\n\nP recision \u2217 Recall\n\n(1 + \u03b22)\n\n\u03b22 \u2217 P recision + recall\nNote that f-score with \u03b2 < 1 indeed emphasizes on high precision.\nIn this provable setting, we have demonstrated our bound\u2019s utility. This shows W2 measures\u2019 potential\nfor evaluating dimension reduction. In general cases, we won\u2019t have such tight lower bounds and\nit is natural to optimize according to the sample based W2 measures instead. We performed some\npreliminary experiments on this heuristic, shown in Appendix G.\n\n.\n\n4 Relation to metric space embedding and manifold learning\nWe lastly situate our work in the lines of research on metric space embedding and manifold learning.\nOne obvious difference between our work and the literature of metric space embedding and manifold\nlearning is that our work mainly focuses on intrinsic dimensionality reduction maps, i.e. n (cid:29) m,\nwhile in metric space embedding and manifold learning, having n \u2264 m < N is common.\nOur work also differs from the literature of metric space embedding and manifold learning in its\nlearning objective. Learning in these \ufb01elds aims to preserve the metric structure of the data. Our work\nattempts to preserve precision and recall, a weaker structure in the sense of embedding dimension\n(Proposition 2). While they typically look for lowest embedding dimension subject to certain loss\n(e.g. smoothness, isometry, etc.), in contrast, our learning goal is to minimize the loss (precision and\nrecall etc.) subject to a \ufb01xed embedding dimension constraint. In these cases, desired structures will\nbreak (Theorem 3) because we cannot choose the embedding dimension m (e.g. for visualizations\nm = 2; for classi\ufb01cations m = number of classes).\n\n8\n\n\f2\n\np\n\nWe now discuss the technical relations with metric space embedding and manifold learning. Many\ndatasets can be modelled as a \ufb01nite metric space Mk with k points. A natural unsupervised learning\ntask is to learn an embedding that approximately preserves pairwise distances. The Bourgain\nembedding [7] guarantees the metric structure can be preserved with distortion O(log k) in lO(log2 k)\n.\nWhen the samples are collected in Euclidean spaces, i.e. Mk \u2282 l2, the Johnson-Lindenstrauss lemma\n[10] improves the distortion to (1 + \u0001) in lO(log(k/\u00012))\n. These embeddings approximately preserve all\npairwise distances - global metric structure of Mk is compatible to the ambient vector space norms.\nComing back to our work, it is natural to mimic this approach for precision and recall in Mk. The\n\ufb01rst problem is that the naive sample based precision and recall are always equal (Section 3.2). A\nsecond problem is discrete precision and recall is a non-differentiable objective. In fact, the dif\ufb01culty\nof analyzing discrete precision and recall motivates us to look for continuous analogues.\nRoughly, our approach is somewhat similar to manifold learning where researchers postulate that the\ndata Mk are sampled from a continuous manifold M, typically a smooth or Riemannian manifold\nM with intrinsic dimension n. In this setting, one is interested in embedding M into l2 locally\nisometrically. Then one designs learning algorithms that can combine the local information to learn\nsome global structure of M. By relaxing to the continuous cases just like our setting, manifold\nlearning researchers gain access to vast literature in geometry. By the Whitney embedding [25],\nM can be smoothly embedded into R2n. By the Nash embedding [35], a compact Riemannian\nmanifold M can be isometrically embedded into Rp(n), where p(n) is a quadratic polynomial. Hence\nthe task in manifold learning is wellposed: one seeks an embedding f : M \u2282 RN \u2192 Rm with\nm \u2264 2n (cid:28) N in the smooth category or m \u2264 p(n) (cid:28) N in the Riemannian category. Note that\nthe embedded manifold metrics (e.g. the Riemannian geodesic distances) are not guaranteed to be\ncompatible to the ambient vector space\u2019s norm structure with a \ufb01xed distortion factor, unlike the\nBourgain embedding or the Johnson-Lindenstrauss lemma in the discrete setting. A continuous\nanalogue of the norm compatible discrete metric space embeddings is the Kuratowski embedding,\nwhich embeds global-isometrically (preserving pairwise distance) any metric space to an in\ufb01nite\ndimensional Banach space L\u221e. With \u0001 distortion relaxation, it is possible to embed a compact\nRiemannian manifold to a \ufb01nite dimensional normed space. But this appears to be very hard, in that\nthe embedding dimension may grow faster than exponentially in n [30].\nLike DR in manifold learning and unlike DR in discrete metric space embedding, rather than global\nstructure we want to preserve local notions such as precision and recall. Unlike DR in manifold\nlearning, since precision and recall are almost equivalent to continuity and injectivity (Theorem 1),\nwe are interested in embeddings in the topological category, instead of the smooth or the Riemannian\ncategory. Thus, our work can be considered as manifold learning from the perspective of information\nretrieval, which leads to the following result.\nProposition 2. If m \u2265 2n, where n is the dimension of the data manifold M in domain and m is the\ndimension of codomain Rm, then there exists a continuous map f : M \u2192 Rm such that f achieves\nperfect precision and recall for every point x \u2208 M.\n\nNote that the dimension reduction rate is actually much stronger than the case of Riemannian\nisometric embedding where the lowest embedding dimension grows polynomially [35]. This is\nbecause preserving precision and recall is weaker than isometric embedding. A practical implication\nis that, we can reduce many more dimensions if we only care about precision and recall.\n\n5 Conclusions\n\nWe characterized the imperfection of dimensionality reduction mappings from a quantitative topology\nperspective. We showed that perfect precision and perfect recall cannot be both achieved by any DR\nmap. We then proved a non-trivial upper bound for precision for Lipschitz continuous DR maps. To\nfurther quantify the distortion, we proposed a new measure based on L2-Wasserstein distances, and\nalso proved its lower bound for Lipschitz continuous DR maps. It is also interesting to analyse the\nrelation between the recall of a continuous DR map and its modulus of continuity. However, the\ngenerality and complexiity of the \ufb01bers (inverse images) of these maps so far defy our effort and this\nproblem remains open. Furthermore, it is interesting to develop a corresponding theory in the discrete\nsetting.\n\n9\n\n\fAcknowledgments\n\nWe would like to thank Yanshuai Cao, Christopher Srinivasa, and the broader Borealis AI team for\ntheir discussion and support. We also thank Marcus Brubaker, Cathal Smyth, and Matthew E. Taylor\nfor proofreading the manuscript and their suggestions, as well as April Cooper for creating graphics\nfor this work.\n\nReferences\n[1] Arseniy Akopyan and Roman Karasev. A tight estimate for the waist of the ball. Bulletin of the\n\nLondon Mathematical Society, 49(4):690\u2013693, 2017.\n\n[2] Arseniy Akopyan and Roman Karasev. Waist of balls in hyperbolic and spherical spaces.\nInternational Mathematics Research Notices, page rny037, 2018. doi: 10.1093/imrn/rny037.\nURL http://dx.doi.org/10.1093/imrn/rny037.\n\n[3] Hannah Alpert and Larry Guth. A family of maps with many small \ufb01bers. Journal of Topology\n\nand Analysis, 7(01):73\u201379, 2015.\n\n[4] Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation\nalgorithms for optimal transport via sinkhorn iteration. In Advances in Neural Information\nProcessing Systems, pages 1961\u20131971, 2017.\n\n[5] Sanjeev Arora, Wei Hu, and Pravesh K. Kothari. An analysis of the t-sne algorithm for data\nvisualization. In S\u00e9bastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings\nof the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning\nResearch, pages 1455\u20131462. PMLR, 06\u201309 Jul 2018. URL http://proceedings.mlr.\npress/v75/arora18a.html.\n\n[6] Nicolas Bonneel, Michiel Van De Panne, Sylvain Paris, and Wolfgang Heidrich. Displacement\nIn ACM Transactions on Graphics (TOG),\n\ninterpolation using lagrangian mass transport.\nvolume 30, page 158. ACM, 2011.\n\n[7] Jean Bourgain. On Lipschitz embedding of \ufb01nite metric spaces in Hilbert space. Israel Journal\n\nof Mathematics, 52(1):46\u201352, 1985.\n\n[8] Christos Boutsidis, Anastasios Zouzias, and Petros Drineas. Random projections for k-means\n\nclustering. In NIPS, pages 298\u2013306, 2010.\n\n[9] Luis A Caffarelli and Robert J McCann. Free boundaries in optimal transport and Monge-\n\nAmp\u00e9re obstacle problems. Annals of mathematics, 171:673\u2013730, 2010.\n\n[10] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and\n\nLindenstrauss. Random Structures & Algorithms, 22(1):60\u201365, 2003.\n\n[11] Alessio Figalli. The optimal partial transport problem. Archive for rational mechanics and\n\nanalysis, 195(2):533\u2013560, 2010.\n\n[12] R\u00e9mi Flamary and Nicolas Courty. Pot python optimal transport library, 2017. URL https:\n\n//github.com/rflamary/POT.\n\n[13] Daniele Granata and Vincenzo Carnevale. Accurate estimation of the intrinsic dimension using\ngraph distances: Unraveling the geometric complexity of datasets. Scienti\ufb01c Reports, 6, 2016.\n\n[14] Victor Guillemin and Alan Pollack. Differential topology, volume 370. American Mathematical\n\nSoc., 2010.\n\n[15] LARRY Guth. The waist inequality in gromov\u2019s work. The Abel Prize 2008, pages 181\u2013195,\n\n2012.\n\n[16] G\u00edsli R. Hjaltason and Hanan Samet. Properties of embedding methods for similarity searching\nin metric spaces. IEEE Trans. Pattern Anal. Mach. Intell., 25(5):530\u2013549, May 2003. ISSN\n0162-8828. doi: 10.1109/TPAMI.2003.1195989. URL https://doi.org/10.1109/TPAMI.\n2003.1195989.\n\n10\n\n\f[17] Bo\u2019az Klartag. Convex geometry and waist inequalities. Geometric and Functional Analysis,\n\n27(1):130\u2013164, 2017.\n\n[18] Jonathan Korman and Robert J McCann. Insights into capacity-constrained optimal transport.\n\nProceedings of the National Academy of Sciences, 110(25):10064\u201310067, 2013.\n\n[19] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436,\n\n2015.\n\n[20] Sylvain Lespinats and Micha\u00ebl Aupetit. Checkviz: Sanity check and topological clues for linear\nand non-linear mappings. In Computer Graphics Forum, volume 30, pages 113\u2013125. Wiley\nOnline Library, 2011.\n\n[21] Laurens Maaten. Learning a parametric embedding by preserving local structure. In Arti\ufb01cial\n\nIntelligence and Statistics, pages 384\u2013391, 2009.\n\n[22] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine\n\nLearning Research, 9(Nov):2579\u20132605, 2008.\n\n[23] Rafael Messias Martins, Danilo Barbosa Coimbra, Rosane Minghim, and Alexandru C Telea.\nVisual analysis of dimensionality reduction quality for parameterized projections. Computers &\nGraphics, 41:26\u201342, 2014.\n\n[24] Robert J McCann and Adam M Oberman. Exact semi-geostrophic \ufb02ows in an elliptical ocean\n\nbasin. Nonlinearity, 17(5):1891, 2004.\n\n[25] James McQueen, Marina Meila, and Dominique Joncas. Nearly isometric embedding by\n\nrelaxation. In NIPS, pages 2631\u20132639, 2016.\n\n[26] Michael M\u00fcger. A remark on the invariance of dimension. Mathematische Semesterberichte,\n\n62(1):59\u201368, 2015.\n\n[27] Hariharan Narayanan and Sanjoy Mitter. Sample complexity of testing the manifold hypothesis.\n\nIn NIPS, pages 1786\u20131794, 2010.\n\n[28] Lawrence E Payne.\n\n453\u2013488, 1967.\n\nIsoperimetric inequalities and their applications. SIAM review, 9(3):\n\n[29] P Ray\u00f3n and M Gromov. Isoperimetry of waists and concentration of maps. Geometric &\n\nFunctional Analysis GAFA, 13(1):178\u2013215, 2003.\n\n[30] Malte Roeer. On the \ufb01nite dimensional approximation of the Kuratowski-embedding for\n\ncompact manifolds. arXiv preprint arXiv:1305.1529, 2013.\n\n[31] Bernhard Sch\u00f6lkopf, Alexander Smola, and Klaus-Robert M\u00fcller. Kernel principal component\nanalysis. In International Conference on Arti\ufb01cial Neural Networks, pages 583\u2013588. Springer,\n1997.\n\n[32] Tobias Schreck, Tatiana Von Landesberger, and Sebastian Bremm. Techniques for precision-\n\nbased visual analysis of projected data. Information Visualization, 9(3):181\u2013193, 2010.\n\n[33] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for\n\nnonlinear dimensionality reduction. Science, 290(5500):2319\u20132323, 2000.\n\n[34] Jarkko Venna, Jaakko Peltonen, Kristian Nybo, Helena Aidos, and Samuel Kaski. Information\nretrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of\nMachine Learning Research, 11(Feb):451\u2013490, 2010.\n\n[35] Nakul Verma. Distance preserving embeddings for general n-dimensional manifolds. Journal\n\nof Machine Learning Research, 14(1):2415\u20132448, 2013.\n\n[36] Xianfu Wang. Volumes of generalized unit balls. Mathematics Magazine, 78(5):390\u2013395, 2005.\n\n11\n\n\f", "award": [], "sourceid": 5123, "authors": [{"given_name": "Kry", "family_name": "Lui", "institution": "BorealisAI"}, {"given_name": "Gavin Weiguang", "family_name": "Ding", "institution": "Borealis AI"}, {"given_name": "Ruitong", "family_name": "Huang", "institution": "Borealis AI"}, {"given_name": "Robert", "family_name": "McCann", "institution": "University of Toronto"}]}