{"title": "Landmark Ordinal Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 11506, "page_last": 11515, "abstract": "In this paper, we aim to learn a low-dimensional Euclidean representation from a set of constraints of the form \u201citem j is closer to item i than item k\u201d. Existing approaches for this \u201cordinal embedding\u201d problem require expensive optimization procedures, which cannot scale to handle increasingly larger datasets. To address this issue, we propose a landmark-based strategy, which we call Landmark Ordinal Embedding (LOE). Our approach trades off statistical efficiency for computational efficiency by exploiting the low-dimensionality of the latent embedding. We derive bounds establishing the statistical consistency of LOE under the popular Bradley- Terry-Luce noise model. Through a rigorous analysis of the computational complexity, we show that LOE is significantly more efficient than conventional ordinal embedding approaches as the number of items grows. We validate these characterizations empirically on both synthetic and real datasets. We also present a practical approach that achieves the \u201cbest of both worlds\u201d, by using LOE to warm-start existing methods that are more statistically efficient but computationally expensive.", "full_text": "Landmark Ordinal Embedding\n\nNikhil Ghosh\u2217\nUC Berkeley\n\nnikhil_ghosh@berkeley.edu\n\nYuxin Chen\u2217\nUChicago\n\nchenyuxin@uchicago.edu\n\nYisong Yue\n\nCaltech\n\nyyue@caltech.edu\n\nAbstract\n\nIn this paper, we aim to learn a low-dimensional Euclidean representation from\na set of constraints of the form \u201citem j is closer to item i than item k\u201d. Existing\napproaches for this \u201cordinal embedding\u201d problem require expensive optimization\nprocedures, which cannot scale to handle increasingly larger datasets. To address\nthis issue, we propose a landmark-based strategy, which we call Landmark Ordinal\nEmbedding (LOE). Our approach trades off statistical ef\ufb01ciency for computational\nef\ufb01ciency by exploiting the low-dimensionality of the latent embedding. We derive\nbounds establishing the statistical consistency of LOE under the popular Bradley-\nTerry-Luce noise model. Through a rigorous analysis of the computational com-\nplexity, we show that LOE is signi\ufb01cantly more ef\ufb01cient than conventional ordinal\nembedding approaches as the number of items grows. We validate these character-\nizations empirically on both synthetic and real datasets. We also present a practical\napproach that achieves the \u201cbest of both worlds\u201d, by using LOE to warm-start\nexisting methods that are more statistically ef\ufb01cient but computationally expensive.\n\n1\n\nIntroduction\n\nUnderstanding similarities between data points is critical for numerous machine learning problems\nsuch as clustering and information retrieval. However, we usually do not have a \u201cgood\" notion of\nsimilarity for our data a priori. For example, we may have a collection of images of objects, but\nthe natural Euclidean distance between the vectors of pixel values does not capture an interesting\nnotion of similarity. To obtain a more natural similarity measure, we can rely on (weak) supervision\nfrom an oracle (e.g. a human expert). A popular form of weak supervision is ordinal feedback of the\nform \u201citem j is closer to item i than item k\u201d [8]. Such feedback has been shown to be substantially\nmore reliable to elicit than cardinal feedback (i.e. how close item i is to item j), especially when\nthe feedback is subjective [11]. Furthermore, ordinal feedback arises in a broad range of real-world\ndomains, most notably in user interaction logs from digital systems such as search engines and\nrecommender systems [10, 15, 2, 24, 19, 17].\nWe thus study the ordinal embedding problem [18, 22, 14, 13, 21], which pertains \ufb01nding low-\ndimensional representations that respect ordinal feedback. One major limitation with current state-of-\nthe-art ordinal embedding methods is their high computational complexity, which often makes them\nunsuitable for large datasets. Given the dramatic growth in real-world dataset sizes, it is desirable to\ndevelop methods that can scale computationally.\n\nOur contribution.\nIn this paper, we develop computationally ef\ufb01cient methods for ordinal embed-\nding that are also statistically consistent (i.e. run on large datasets and converge to the \u201ctrue\u201d solution\nwith enough data). Our method draws inspiration from Landmark Multidimensional Scaling [7],\nwhich approximately embeds points given distances to a set of \u201clandmark\u201d points. We adapt this\ntechnique to the ordinal feedback setting, by using results from ranking with pairwise comparisons\n\n\u2217Research done when Nikhil Ghosh and Yuxin Chen were at Caltech.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fand properties of Euclidean distance matrices. The result is a fast embedding algorithm, which we\ncall Landmark Ordinal Embedding (LOE).\nWe provide a thorough analysis of our algorithm, in terms of both the sample complexity and\nthe computational complexity. We prove that LOE recovers the true latent embedding under the\nBradley-Terry-Luce noise model [4, 16] with a suf\ufb01cient number of data samples. The computational\ncomplexity of LOE scales linearly with respect to the number of items, which in the large data\nregime is orders of magnitude more ef\ufb01cient than conventional ordinal embedding approaches. We\nempirically validate these characterizations on both synthetic and real triplet comparison datasets,\nand demonstrate dramatically improved computational ef\ufb01ciency over state-of-the-art baselines.\nOne trade-off from using LOE is that it is statistically less ef\ufb01cient than previous approaches (i.e.\nneeds more data to precisely characterize the \u201ctrue\u201d solution). To offset this trade-off, we empirically\ndemonstrate a \u201cbest of both worlds\u201d solution by using LOE to warm-start existing state-of-the-art\nembedding approaches that are statistically more ef\ufb01cient but computationally more expensive. We\nthus holistically view LOE as a \ufb01rst step towards analyzing the statistical and computational trade-offs\nof ordinal embedding in the large data regime.\n\n2 Related Work\n\nMultidimensional Scaling\nIn general, the task of assigning low-dimensional Euclidean coordi-\nnates to a set of objects such that they approximately obey a given set of (dis)similarity relations, is\nknown as Euclidean embedding. When given an input matrix D of pairwise dissimilarities, \ufb01nding an\nembedding with inter-point distances aligning with D is the classical problem of (metric) Multidimen-\nsional Scaling (MDS). This problem has a classical solution that is guaranteed to \ufb01nd an embedding\nexactly preserving D, if D actually has a low-dimensional Euclidean structure [3]. The algorithm\n\ufb01nds the global optimum of a sensible cost function in closed form, running in approximately O(dn2)\ntime, where n is the number of objects and d is the dimension of the embedding.\n\nLandmark MDS When n becomes too large the MDS algorithm may be too expensive in practice.\nThe bottleneck in the solution is the calculation of the top d eigenvalues of the n \u00d7 n matrix D.\nWhen n is very large, there exists a computationally ef\ufb01cient approximation known as Landmark\nMDS (LMDS) [7]. LMDS \ufb01rst selects a subset of l \u201clandmark\u201d points, where l (cid:28) n, and embeds\nthese points using classical MDS. It then embeds the remaining points using their distances to the\nlandmark points. This \u201ctriangulation\u201d procedure corresponds to computing an af\ufb01ne map. If the af\ufb01ne\ndimension of the landmark points is at least d, this algorithm has the same guarantee as classical\nMDS in the noiseless scenario. However, it runs in time roughly O(dln + l3), which is linear in n.\nThe main drawback is that it is more sensitive to noise, the sensitivity being heavily dependent on the\n\u201cquality\u201d of the chosen landmarks.\n\nOrdinal embedding Currently there exist several techniques for ordinal embedding. Generalized\nNon-metric Multidimensional Scaling (GNMDS) [1] takes a max-margin approach by minimizing\nhinge loss. Stochastic Triplet Embedding (STE) [22] assumes the Bradley-Terry-Luce (BTL) noise\nmodel [4, 16] and minimizes logistic loss. The Crowd Kernel [20] and t-STE [22] propose alternative\nnon-convex loss measures based on probabilistic generative models. All of these approaches rely on\nexpensive gradient or projection computations and are unsuitable for large datasets. The results in\nthese papers are primarily empirical and focus on minimizing prediction error on unobserved triplets.\nRecently in Jain et al. (2016) [9], theoretical guarantees were made for recovery of the true distance\nmatrix using the maximum likelihood estimator for the BTL model.\n\n3 Problem Statement\n\nIn this section, we formally state the ordinal embedding problem. Consider n objects [n] = {1, . . . , n}\nwith respective unknown embeddings x1, . . . , xn \u2208 Rd. The Euclidean distance matrix D\u2217 is de\ufb01ned\nso that D\u2217ij = (cid:107)xi \u2212 xj(cid:107)2\n2. Let T := {(cid:104)i, j, k(cid:105) : 1 \u2264 i (cid:54)= j (cid:54)= k \u2264 n, j < k} be the set of\nunique triplets. We have access to a noisy triplet comparison oracle O, which when given a triplet\n(cid:104)i, j, k(cid:105) \u2208 T returns a binary label +1 or \u22121 indicating if D\u2217ij > D\u2217ik or not. We assume that O\n\n2\n\n\fmakes comparisons as follows:\n\n,\n\nO((cid:104)i, j, k(cid:105)) =(cid:26)+1 w.p. f (D\u2217ij \u2212 D\u2217ik)\n\u22121 w.p. 1 \u2212 f (D\u2217ij \u2212 D\u2217ik)\n\n(3.1)\nwhere f : R \u2192 [0, 1] is the known link function. In this paper, we consider the popular BTL model\n1+exp(\u2212\u03b8). In general, the ideas in this\n[16], where f corresponds to the logistic function: f (\u03b8) =\npaper can be straightforwardly generalized to any linear triplet model which says j is farther from i\nthan k with probability F (Dij \u2212 Dik), where F is the CDF of some 0 symmetric random variable.\nThe two most common models of this form are the BTL model, where F is the logistic CDF, and the\nThurstone model, where F is the normal CDF [5].\nOur goal is to recover the points x1, . . . , xn. However since distances are preserved by orthogonal\ntransformations, we can only hope to recover the points up to an orthogonal transformation. For this\nreason, the error metric we will use is the Procrustes distance [12] de\ufb01ned as:\n\n1\n\n,\n\nQ\u2208O(n)(cid:13)(cid:13)(cid:13)X \u2212 Q(cid:98)X(cid:13)(cid:13)(cid:13)F\n\nd(X,(cid:98)X) := min\nwhere O(n) is the group of n \u00d7 n orthogonal matrices.\nWe now state our problem formally as follows:\nProblem 1 (Ordinal Embedding). Consider n points X = [x1, . . . , xn] \u2208 Rd\u00d7n centered about the\norigin. Given access to oracle O in (3.1) and budget of m oracle queries, output an embedding\nestimate (cid:98)X = [(cid:98)x1, . . . ,(cid:98)xn] minimizing the Procrustes distance d(X,(cid:98)X)\n\n4 Landmark Ordinal Embedding\n\n(3.2)\n\nIn this section, we present our algorithm Landmark Ordinal Embedding (LOE), for addressing the\nordinal embedding problem as stated in Problem 1. Instead of directly optimizing a maximum\nlikelihood objective, LOE recovers the embedding in two stages. First, LOE estimates (cid:96) = O(d)\ncolumns of D\u2217. Then it estimates the embedding X from these (cid:96) columns via LMDS. We provide the\npseudocode in Algorithm 1. In the remainder of this section, we focus on describing the \ufb01rst stage,\nwhich is our main algorithmic contribution.\n\n4.1 Preliminaries\nNotation We establish some notation and conventions. For vectors the default norm is the (cid:96)2-norm.\nFor matrices, the default inner product/norm is the Fr\u00f6benius inner product/norm.\n\ncone of (cid:96) \u00d7 (cid:96) Euclidean distance matrices\nprojection onto closed, convex set C\n(cid:107)x \u2212 PC(x)(cid:107) i.e. distance of point x to C\nthe pseudoinverse of X\n{(j, k) : 1 \u2264 j < k \u2264 n}\n\nTable 1: List of Notation\n1(cid:96)1(cid:62)(cid:96) \u2212 I(cid:96)\nJ\nJ\u22a5 {X \u2208 R(cid:96)\u00d7(cid:96) : (cid:104)X, J(cid:105) = 0}\nS(cid:96)\n{X \u2208 R(cid:96)\u00d7(cid:96) : X(cid:62) = X}\nS(cid:96) \u2229 J\u22a5\nV\n\u03c3X (cid:104)X, J(cid:105)/(cid:107)J(cid:107)2\n\nEDM(cid:96)\nPC\ndist(x, C)\nX\u2020\n\n(cid:2)n\n2(cid:3)\n\nConsistent estimation We will say that for some estimator(cid:98)a and quantity a that(cid:98)a \u2248 a with respect\nto a random variable \u0001 if (cid:107)(cid:98)a \u2212 a(cid:107) \u2264 g(\u0001) for some g : R \u2192 R which is monotonically increasing and\ng(x) \u2192 0 as x \u2192 0. Note that \u2248 is an equivalence relation. If(cid:98)a \u2248 a, we say(cid:98)a approximates or\n\nestimates a, with the approximation improving to an equality as \u0001 \u2192 0.\nPrelude: Ranking via pairwise comparisons Before describing the main algorithm, we discuss\nsome results associated to the related problem of ranking via pairwise comparisons which we will\nuse later. We consider the parametric BTL model for binary comparisons.\nAssume there are n items [n] of interest. We have access to a pairwise comparison oracle O\nparametrized by an unknown score vector \u03b8\u2217 \u2208 Rn. Given (cid:104)j, k(cid:105) \u2208(cid:2)n\nw.p. f (\u03b8\u2217j \u2212 \u03b8\u2217k)\n\u22121 w.p. 1 \u2212 f (\u03b8\u2217j \u2212 \u03b8\u2217k)\n\n2(cid:3) the oracle O returns:\n\nO((cid:104)j, k(cid:105)) =(cid:26)1\n\n,\n\n3\n\n\f0\n\nAlgorithm 1 Landmark Ordinal Embedding (LOE)\n1: Input: # of items n; # of landmarks (cid:96); # of samples per column m; dimension d; triplet\n\ncomparison oracle O; regularization parameter \u03bb;\n\n(cid:46) form associated W\n\n(cid:46) rank landmark columns\n\ni (cid:54)= j\ni = j \u2200i, j \u2208 [(cid:96)]\n\nForm comparison oracle Oi((cid:104)j, k(cid:105)) in Eq. (4.3)\nRi \u2190 regularized MLE estimate Eq. (4.1)\n\n2: Randomly select (cid:96) landmarks from [n] and relabel them so that landmarks are [(cid:96)]\n3: for all i \u2208 [(cid:96)] do\n4:\n5:\n6: for ranking from Oi using m comparisons\n7: end for\n8: R \u2190 [R1, . . . , R(cid:96)]\n9: (cid:102)W \u2190 R(1 : (cid:96) \u2212 1, 1 : (cid:96))\n10: W =(cid:40)(cid:102)Wi,j\u22121(j>i)\n11: J \u2190 1(cid:96)1(cid:62)(cid:96) \u2212 I(cid:96)\n12: (cid:98)\u03c3 \u2190 least-squares solution to (4.8)\n13: (cid:98)\u03c3E \u2190 \u03bb2(PS(cid:96)(W + J \u00b7 diag(s)))\n14: (cid:98)s \u2190(cid:98)\u03c3 +(cid:98)\u03c3E\n15: (cid:98)E \u2190 PEDM(cid:96)(W + J \u00b7 diag((cid:98)s))\n16: (cid:98)F \u2190 R((cid:96) : n \u2212 1, 1 : (cid:96)) + 1n\u2212(cid:96) \u00b7(cid:98)s(cid:62)\n17: (cid:98)X \u2190 LMDS((cid:98)E,(cid:98)F )\n18: Output: Embedding (cid:98)X\nwhere as before f is the logistic function. We call the oracle O a \u03b8\u2217-oracle.\nGiven access to a \u03b8\u2217-oracle O, our goal is to estimate \u03b8\u2217 from a set of m pairwise comparisons\nmade uniformly at random labeled, that are labeled by O. Namely, we wish to estimate \u03b8\u2217 from\nthe comparison set \u2126 = {(j1, k1,O((cid:104)j1, k1(cid:105)), . . . , (jm, km,O((cid:104)jm, km(cid:105))} where the pairs (cid:104)ji, ki(cid:105)\n2(cid:3). We refer to this task as ranking from a \u03b8\u2217-oracle using m\nare chosen i.i.d uniformly from(cid:2)n\ncomparisons. To estimate \u03b8\u2217, we use the (cid:96)2-regularized maximum likelihood estimator:\n(cid:98)\u03b8 = arg max\n\n(cid:46) estimate column shifts(cid:98)s\n\n(cid:46) estimate \ufb01rst (cid:96) columns of D\u2217\n\n(cid:46) estimate ordinal embedding\n\n2 \u03bb(cid:107)\u03b8(cid:107)2\n\n2 for some\n\nlog(1 \u2212 f (\u03b8i \u2212 \u03b8j)) \u2212 1\n\nlog(f (\u03b8i \u2212 \u03b8j)) + (cid:80)(i,j)\u2208\u2126\u2212\n\nwhere L\u03bb(\u03b8; \u2126) := (cid:80)(i,j)\u2208\u2126+\nregularization parameter \u03bb > 0 and \u2126\u00b1 := {(i, j) : (i, j,\u00b11) \u2208 \u2126}.\nObserve that [\u03b8] = {\u03b8 + s1n : s \u2208 R} forms an equivalence class of score vectors in the sense that\neach score vector in [\u03b8] forms an identical comparison oracle. In order to ensure identi\ufb01ability for\nrecovery, we assume(cid:80)i \u03b8\u2217i = 0 and enforce the constraint(cid:80)i(cid:98)\u03b8i = 0. Now we state the central\nresult about(cid:98)\u03b8 that we rely upon.\n2(cid:3) and O is a \u03b8\u2217-oracle, with\n(j, k,O((cid:104)j, k(cid:105)) where (cid:104)j, k(cid:105) are drawn uniformly at random from(cid:2)n\nprobability exceeding 1 \u2212 O(n\u22125) the regularized MLE(cid:98)\u03b8 with \u03bb = \u0398((cid:112)n3 log n/m) satis\ufb01es:\nm (cid:33).\n= O(cid:32)(cid:114) n log n\n\nTheorem 2 (Adapted from Theorem 6 of [6]). Given m = \u2126(n log n) observations of the form\n\n(4.2)\n\n\u03b8 L\u03bb(\u03b8; \u2126),\n\n(4.1)\n\nIf it is not the case that(cid:80)i \u03b8\u2217i = 0, we can still apply Theorem 2 to the equivalent score vector\n= O(cid:18)(cid:113) n log n\nm (cid:19)\n\u03b8\u2217 \u2212 \u00af\u03b8\u22171, where \u00af\u03b8\u2217 = 1\nwhere it is not possible to directly estimate the unknown shift \u00af\u03b8\u2217.\n\nn1(cid:62)n \u03b8\u2217. In place of Eq. (4.2) this yields(cid:13)(cid:13)(cid:13)\u03b8\u2217 \u2212 ((cid:98)\u03b8 + \u00af\u03b8\u22171)(cid:13)(cid:13)(cid:13)\u221e\n\n(cid:13)(cid:13)(cid:13)\u03b8\u2217 \u2212(cid:98)\u03b8(cid:13)(cid:13)(cid:13)\u221e\n\n4\n\n\f4.2 Estimating Landmark Columns up to Columnwise Constant Shifts\n\nRanking landmark columns LOE starts by choosing (cid:96) items as landmark items. Upon relabeling\nthe items, we can assume these are the \ufb01rst (cid:96) items. We utilize the ranking results from the previous\nsection to compute (shifted) estimates of the \ufb01rst (cid:96) columns of D\u2217.\n\u2212i \u2208 Rn\u22121 denote the ith column of D\u2217 with the ith entry removed (in MATLAB notation\nLet D\u2217\n\u2212i := D\u2217([1:i\u22121, i+1:n], i). We identify [n]\\{i} with [n\u22121] via the bijection j (cid:55)\u2192 j\u22121{j > i}.\nD\u2217\n\u2212i-oracle Oi which compares items\nObserve that using our triplet oracle O, we can query the D\u2217\n[n] \\ {i} by their distance to xi. Namely for (cid:104)j, k(cid:105) \u2208(cid:2)n\u22121\n2 (cid:3):\n(4.3)\nBy Theorem 2, using m comparisons from Oi we compute an MLE estimate Ri on Line 6 such that:\n(4.4)\n\nOi((cid:104)j, k(cid:105)) := O((cid:104)i, j + 1{j \u2265 i}, k + 1{k \u2265 i}(cid:105)).\n\n\u2212i = Ri + s\u2217i 1n\u22121 + \u0001i, i \u2208 [(cid:96)],\nD\u2217\n\nwith shift s\u2217i = 1\n\nn\u221211(cid:62)n\u22121D\u2217\n\n\u2212i. De\ufb01ne the ranking error \u0001 as:\n.\n\n\u0001 := max\n\ni\u2208[(cid:96)] (cid:107)\u0001i(cid:107)\u221e\n\n(4.5)\n\nAssuming that Eq. (4.2) in Theorem 2 occurs for each Ri, we see that \u0001 \u2192 0 as m \u2192 \u221e at a rate\nO(1/\u221am). From now on we use \u2248 with respect to this \u0001. The use of this notation is make the\nexposition and motivation for the algorithm more clear. We will keep track of the suppressed g\ncarefully in our theoretical sample complexity bounds detailed in Appendix A of the supplementary.\n\n4.3 Recovering Landmark Column Shifts s\u2217\nEstimating the shifts s\u2217: A motivating strategy After solving (cid:96) ranking problems and computing\nestimates Ri for each i \u2208 [(cid:96)], we wish to estimate the unknown shifts s\u2217i so that we can approximate\n\u2212i \u2248 Ri + s\u2217i 1 and D\u2217i,i = 0. As remarked before, we\nthe \ufb01rst (cid:96) columns of D\u2217 using the fact that D\u2217\ncannot recover such s\u2217i purely from ranking information alone.\nTo circumvent this issue, we incorporate known structural information about the distance matrix to\nestimate s\u2217i . Let E\u2217 := D\u2217(1:(cid:96), 1:(cid:96)) and F \u2217 := D\u2217((cid:96) + 1:n, 1:(cid:96)) be the \ufb01rst (cid:96) rows and last n \u2212 (cid:96)\nrows of the landmark columns D\u2217(1:n, 1:(cid:96)) respectively. Observe that E\u2217 is simply the distance\nmatrix of the (cid:96) landmark objects. Consider the (n \u2212 1) \u00d7 (cid:96) ranking matrix R = [R1, . . . , R(cid:96)] and let\nW be its upper ((cid:96) \u2212 1) \u00d7 (cid:96) submatrix with a diagonal of zeroes inserted (see Fig. 1a). After shifting\ncolumn i of R by s\u2217i , column i of W is shifted by s\u2217i with the diagonal unchanged. The resulting shift\nof W is equivalent to adding J \u00b7 diag(s\u2217), so for shorthand we will de\ufb01ne:\n\nshift(X, y) := X + J \u00b7 diag(y).\n\nBy de\ufb01nition of R, we have shift(W, s\u2217) \u2248 E\u2217 \u2208 EDM(cid:96) (see Fig. 1b). This motivates an initial\nmatrix. Concretely, we choose:\n\nstrategy for estimating s\u2217i : choose(cid:98)s such that shift(W, s) is approximately a Euclidean distance\nHowever, it is not obvious how to solve this optimization problem to compute(cid:98)s or if(cid:98)s \u2248 s\u2217. We\n(cid:96)((cid:96)\u22121)(cid:104)E\u2217, J(cid:105) be the average of\n\nEstimating the shifts s\u2217 using properties of EDMs Let \u03c3E\u2217 := 1\nthe non-diagonal entries of E\u2217. Consider the orthogonal decomposition of E\u2217:\n\naddress these issues by modifying this motivating strategy in the next section.\n\ndist(shift(W, s), EDM(cid:96)).\n\n(cid:98)s \u2208 arg min\n\ns\u2208R(cid:96)\n\n(4.6)\n\nE\u2217 = E\u2217c \u2212 \u03c3E\u2217 J\n\nwhere the centered distance matrix E\u2217c := E\u2217 \u2212 \u03c3E\u2217 J is the projection of E\u2217 onto J\u22a5 and conse-\nquently lies in the linear subspace V := S(cid:96) \u2229 J\u22a5. Letting \u03c3\u2217 := s\u2217 \u2212 \u03c3E\u22171 we see:\nshift(W, \u03c3\u2217) = shift(W, s\u2217) \u2212 \u03c3E\u2217 J \u2248 E\u2217 \u2212 \u03c3E\u2217 J = E\u2217c \u2208 V\n\n5\n\n\f(a) Obtaining W from R (c.f. Line 9 and Line 10). Left: The entries of(cid:102)W are unshifted estimates of the\noff-diagonal entries E\u2217. Right: We add a diagonal of zeroes to(cid:102)W to match the diagonal of zeroes in E\u2217.\n\n(b) Shifting each Ri by s\u2217\n\ni to get E\u2217 and F \u2217.\n\nFigure 1: Shifting the rankings Ri to estimate the \ufb01rst (cid:96) columns of D\u2217 (see Algorithm 1).\n\nSince the space V is more tractable than EDM(cid:96) it will turn out to be simpler to estimate \u03c3\u2217 than to\nestimate s\u2217 directly. We will see later that we can in fact estimate s\u2217 using an estimate of \u03c3\u2217. In\n\nanalogy with (4.6) we choose(cid:98)\u03c3 such that:\n(cid:98)\u03c3 \u2208 arg min\nThis time(cid:98)\u03c3 is easy to compute. Using basic linear algebra one can verify that the least-squares\n\nsolution sls to the system of linear equations:\n\ndist(shift(W, s), V)\n\nsi \u2212 sj = Wij \u2212 Wji,\u2200i < j \u2208 [(cid:96)]\n\n(4.8a)\n\n(4.7)\n\ns\u2208R(cid:96)\n\n(cid:88)i\u2208[(cid:96)]\n\nsi = \u2212\n\n1\n\n(cid:96) \u2212 1 (cid:88)i(cid:54)=j\u2208[(cid:96)]\n\nWij\n\n(4.8b)\n\nsupplementary.).\n\nwhich states that if (cid:96) \u2265 d + 3, then \u03c3E\u2217 = \u03bb2(E\u2217c ) i.e. the second largest eigenvalue of E\u2217c . If we\n\nsolves the optimization problem (4.7), so we can take(cid:98)\u03c3 := sls (c.f. Line 12). It is not hard to\nshow that (cid:98)\u03c3 \u2248 \u03c3\u2217 and a proof is given in the appendix (c.f. \ufb01rst half of Appendix A.1 of the\nNow it remains to see how to estimate s\u2217 using(cid:98)\u03c3. To do so we will make use of Theorem 4 of [9]\nestimate \u03c3E\u2217 by(cid:98)\u03c3E\u2217 := \u03bb2(PV (shift(W,(cid:98)\u03c3))), then by the theorem(cid:98)\u03c3E\u2217 \u2248 \u03c3E\u2217 (c.f. Line 13). Now\nwe can take(cid:98)s :=(cid:98)\u03c3 +(cid:98)\u03c3E\u22171 (c.f. Line 14) which will satisfy(cid:98)s \u2248 s\u2217.\nAfter computing(cid:98)s, we use it to shift our rankings Ri to estimate columns of the distance matrix. We\ntake (cid:98)E = PEDM(cid:96)(shift(W,(cid:98)s)) and (cid:98)F to be the last n\u2212 (cid:96) rows of R with the columns shifted by(cid:98)s (c.f.\nLine 15 and Line 16). Together, (cid:98)E and (cid:98)F estimate the \ufb01rst (cid:96) columns of D\u2217. We \ufb01nish by applying\nLMDS to this estimate (c.f. Line 17).\nMore generally, the ideas in this section show that if we can estimate the difference of distances\nD\u2217ij \u2212 D\u2217ik, i.e. estimate the distances up to some constant, then the additional Euclidean distance\nstructure actually allows us to estimate this constant.\n\n6\n\n26666666666666666666643777777777777777777775n1``1W=0BBBBBBBBBBBBB@0...fW1,j...fW1,`.........fWj1,jfWj,1...0...fWj,`...fWj,j......fW`1,1\u00b7\u00b7\u00b7fW`1,j...01CCCCCCCCCCCCCARiRfWW+J\u00b7diag(s\u21e4)0BBBBBBBBBBB@0...fW1,j+s\u21e4j...fW1,`+s\u21e4`......fWj1,j+s\u21e4jfWj,1+s\u21e41...0...fWj,`+s\u21e4`fWj,j+s\u21e4j......fW`1,1+s\u21e41\u00b7\u00b7\u00b7fW`1,j+s\u21e4j...01CCCCCCCCCCCA126666666666666666666643777777777777777777775n1``1Ri+s\u21e4i1Rshifted10BBBBBBBBBBB@0...fW1,j+s\u21e4j...fW1,`+s\u21e4`......fWj1,j+s\u21e4jfWj,1+s\u21e41...0...fWj,`+s\u21e4`fWj,j+s\u21e4j......fW`1,1+s\u21e41\u00b7\u00b7\u00b7fW`1,j+s\u21e4j...01CCCCCCCCCCCA0BBBB@R`,1+s\u21e41...R`,`+s\u21e4`..................Rn1,1+s\u21e41...Rn1,`+s\u21e4`1CCCCA10BBBBBBBBBBB@0...fW1,j+s\u21e4j...fW1,`+s\u21e4`......fWj1,j+s\u21e4jfWj,1+s\u21e41...0...fWj,`+s\u21e4`fWj,j+s\u21e4j......fW`1,1+s\u21e41\u00b7\u00b7\u00b7fW`1,j+s\u21e4j...01CCCCCCCCCCCA0BBBB@R`,1+s\u21e41...R`,`+s\u21e4`..................Rn1,1+s\u21e41...Rn1,`+s\u21e4`1CCCCA10BBBBBBBBBBB@0...fW1,j+s\u21e4j...fW1,`+s\u21e4`......fWj1,j+s\u21e4jfWj,1+s\u21e41...0...fWj,`+s\u21e4`fWj,j+s\u21e4j......fW`1,1+s\u21e41\u00b7\u00b7\u00b7fW`1,j+s\u21e4j...01CCCCCCCCCCCA0BBBB@R`,1+s\u21e41...R`,`+s\u21e4`..................Rn1,1+s\u21e41...Rn1,`+s\u21e4`1CCCCAW+J\u00b7diag(s\u21e4)\u21e1E\u21e4Rshifted(`:n1,1:`)=R(`:n1,1:`)+1n`s\u21e4>\u21e1F\u21e410BBBBBBBBBBB@0...fW1,j+s\u21e4j...fW1,`+s\u21e4`......fWj1,j+s\u21e4jfWj,1+s\u21e41...0...fWj,`+s\u21e4`fWj,j+s\u21e4j......fW`1,1+s\u21e41\u00b7\u00b7\u00b7fW`1,j+s\u21e4j...01CCCCCCCCCCCA0BBBB@R`,1+s\u21e41...R`,`+s\u21e4`..................Rn1,1+s\u21e41...Rn1,`+s\u21e4`1CCCCAW+J\u00b7diag(s\u21e4)\u21e1E\u21e4Rshifted(`:n1,1:`)=R(`:n1,1:`)+1n`s\u21e4>\u21e1F\u21e41\f5 Theoretical Analysis\n\nWe now analyze both the sample complexity and computational complexity of LOE.\n\n5.1 Sample Complexity\n\nWe \ufb01rst present the key lemma which is required to bound the sample complexity of LOE.\nLemma 3. Consider n objects x1, . . . , xn \u2208 Rd with distance matrix D\u2217 = [D\u22171, . . . , D\u2217n] \u2208 Rn\u00d7n.\nLet (cid:96) = d+3 and de\ufb01ne \u0001 as in Eq. (4.5). Let (cid:98)E,(cid:98)F be as in Line 15 and Line 16, of LOE (Algorithm 1)\nrespectively. If E\u2217 = D\u2217(1 : (cid:96), 1 : (cid:96)), F \u2217 = D\u2217((cid:96) + 1 : n, 1 : (cid:96)) then,\n\n(cid:13)(cid:13)(cid:13)(cid:98)E \u2212 E\u2217(cid:13)(cid:13)(cid:13) = O(\u0001(cid:96)2\u221a(cid:96)),(cid:13)(cid:13)(cid:13)(cid:98)F \u2212 F \u2217(cid:13)(cid:13)(cid:13) = O(cid:16)\u0001(cid:96)2\u221an \u2212 (cid:96)(cid:17).\n\nThe proof of Lemma 3 is deferred to the appendix. Lemma 3 gives a bound for the propagated error\nof the landmark columns estimate in terms of the ranking error \u0001. Combined with a perturbation\nbound for LMDS, we use this result to prove the following sample complexity bound.\n\nTheorem 4. Let (cid:98)X be the output of LOE with (cid:96) = d + 3 landmarks and let X \u2208 Rd\u00d7n be the true\nembedding. Then \u2126(d8n log n) triplets queried in LOE is suf\ufb01cient to recover the embedding i.e.\nwith probability at least 1 \u2212 O(dn\u22125):\n\n1\n\u221and\n\nd((cid:98)X, X) = O(1).\n\nAlthough the dependence on d in our given bound is high, we believe it can be signi\ufb01cantly improved.\nNonetheless, the rate we obtain is still polynomial in d and O(n log n), and most importantly proves\nthe statistical consistency of our method. For most ordinal embedding problems, d is small (often 2\nor 3), so there is not a drastic loss in statistical ef\ufb01ciency. Moreover, we are concerned primarily with\nthe large-data regime where n and m are very large and computation is the primary bottle-neck. In\nthis setting, LOE arrives at a reasonable embedding much more quickly than other ordinal embedding\nmethods. This can be seen in Figure 2 and is supported by computational complexity analysis in the\nfollowing section. LOE can be used to warm-start other more accurate methods to achieve a more\nbalanced trade-off as demonstrated empirically in Figure 3.\nWe choose the number of landmarks (cid:96) = d + 3 since these are the fewest landmarks for which\nTheorem 4 from [9] applies. An interesting direction for further work would be to re\ufb01ne the theoretical\nanalysis to better understand the dependence of the error on the number of samples and landmarks.\nIncreasing the number of landmarks decreases the accuracy of the ranking of each column since m/(cid:96)\ndecreases, but increases the stability of the LMDS procedure. A quantitative understanding of this\ntrade-off would be useful for choosing an appropriate (cid:96).\n\n5.2 Computational Complexity\n\nComputing the regularized MLE for a ranking problem amounts to solving a regularized logistic\nregression problem. Since the loss function for this problem is strongly convex, the optimization\ncan be done via gradient descent in time O(C log 1\n\u0001 ) where C is the cost of a computing a single\ngradient and \u0001 is the desired optimization error. If m total triplets are used so that each ranking uses\nm/(cid:96) triplets, then C = O(m/(cid:96) + n). Since (cid:96) = O(d), solving all (cid:96) ranking problems (c.f. Line 6)\nwith error \u0001 takes time O((m + nd) log 1\nLet us consider the complexity of the steps following the ranking problems. A series of O(nd + d3)\n\n\u0001 ).\n\noperations are performed in order to compute (cid:98)E and (cid:98)F (c.f. Line 16). The LMDS procedure then\n\ntakes O(d3 + nd2) time. Thus overall, these steps take O(d3 + nd2) time.\nIf we treat d and log(1/\u0001) as constants, we see that LOE takes time linear in n and m. Other ordinal\nembedding methods, such as STE, GNMDS, and CKL require gradient or projection operations that\ntake at least \u2126(n2 + m) time and need to be done multiple times.\n\n7\n\n\f(a) (n, d, c) = (105, 2, 200)\n\n(a) (n, d, c) = (105, 2, 50)\n\n(a) Time to completion vs n\n\n(b) Time to LOE error vs n\n\nFigure 2: Scalability\n\n(b) (n, d, c) = (2 \u00d7 104, 10, 200)\n\nFigure 3: Warm-start\n\n(b) Purity vs n\nFigure 4: MNIST\n\n6 Experiments\n6.1 Synthetic Experiments\nWe tested the performance of LOE on synthetic datasets orders of magnitude larger than any dataset\nin previous work. The points of the latent embedding were generated by a normal distribution:\nId) for 1 \u2264 i \u2264 n. Triplet comparisons were made by a noisy BTL oracle. The\nxi \u223c N (0,\ntotal number of triplet queries m for embedding n items was set to be cn log n for various values\nof c. To evaluate performance, we measured the Procrustes error (3.2) with respect to the ground\ntruth embedding. We compared the performance of LOE with the non-convex versions of STE and\nGNMDS. For fairness, we did not compare against methods which assume different noise models.\nWe did not compare against other versions of STE and GNMDS since, as demonstrated in the\nappendix, they are much less computationally ef\ufb01cient than their non-convex versions.\n\n1\u221a2d\n\nScalability with respect to n In this series of experiments, the number of items n ranged over\n[20, 40, 60, 80, 100] \u00d7 103. The embedding dimension d = 2 was \ufb01xed and c was set to 200. For\neach n, we ran 3 trials. In Figure 2a we plot time versus error for n = 105. In Figure 2b, for each n\nwe plotted the time for LOE to \ufb01nish and the time for each baseline to achieve the error of LOE. See\nappendix for remaining plots.\nFrom Figure 2a we see that LOE reaches a solution very quickly. The solution is also fairly accurate\nand it takes the baseline methods around 6 times as long to achieve a similar accuracy. In Figure 2b\nwe observe that the running time of LOE grows at a modest linear rate. In fact, we were able to\ncompute embeddings of n = 106 items in reasonable time using LOE, but were unable to compare\nwith the baselines since we did not have enough memory. The additional running time for the baseline\nmethods to achieve the same accuracy however grows at a signi\ufb01cant linear rate and we see that for\nlarge n, LOE is orders of magnitude faster.\n\nWarm start\nIn these experiments we used LOE to warm start STE. We refer to this warm-start\nmethod as LOE-STE. To obtain the warm start, LOE-STE \ufb01rst uses \u0001m triplets to obtain a solution\nusing LOE. This solution is then used to initialize STE, which uses the remaining (1 \u2212 \u0001)m triplets\nto reach a \ufb01nal solution. Since the warm start triplets are not drawn from a uniform distribution on T ,\nthey are not used for STE which requires uniformly sampled triplets. We chose \u0001 to be small enough\nso that the \ufb01nal solution of LOE-STE is not much worse than a solution obtained by STE using m\nsamples, but large enough so that the initial solution from LOE is close enough to the \ufb01nal solution\nfor there to be signi\ufb01cant computational savings.\nFor d = 2 we set n = 105, c = 50, and \u0001 = 0.3. For d = 10 we set n = 2 \u00d7 104, c = 200, and\n\u0001 = 0.2. For each setting, we ran 3 trials. The results are shown in Figure 3. We see that LOE-STE\nis able to reach lower error much more rapidly than the baselines and yields \ufb01nal solutions that are\ncompetitive with the slower baselines. This demonstrates the utility of LOE-STE in settings where\nthe number of triplets is not too large and accurate solutions are desired.\n\n8\n\n246810n10402000400060008000time (s)STEGNMDSLOE\f6.2 MNIST Dataset\nTo evaluate our approach on less synthetic data, we followed the experiment conducted in [14]\non the MNIST data set. For n = 100, 500, and 1000, we chose n MNIST digits uniformly at\nrandom and generated 200n log n triplets comparisons drawn uniformly at random, based on the\nEuclidean distances between the digits, with each comparison being incorrect independently with\nprobability ep = 0.15. We then generated an ordinal embedding with d = 5 and computed a k-means\nclustering of the embedding. To evaluate the embedding, we measure the purity of the clustering,\nn(cid:80)k maxj |\u03c9k \u2229 cj| where \u2126 = {\u03c91, \u03c92, . . . , \u03c9k} are the clusters and\nde\ufb01ned as purity(\u2126,C) = 1\nC = {c1, c2, . . . , cj} are the classes. The higher the purity, the better the embedding. The correct\nnumber of clusters was always provided as input. We set the number of replicates in k-means to 5 and\nthe maximum number of iterations to 100. For LOE-STE we set \u0001 = 0.5. The results are shown in\nFigure 4. Even in a non-synthetic setting with a misspeci\ufb01ed noise model, we observe that LOE-STE\nachieves faster running times than vanilla STE, with only a slight loss in embedding quality.\n\n6.3 Food Dataset\nTo evaluate our method on a real data set and qualitatively assess embeddings, we used the food\nrelative similarity dataset from [23] to compute two-dimensional embeddings of food images. The\nembedding methods we considered were STE and LOE-STE. For LOE-STE, the warm start\nsolution was computed with (cid:96) = 25 landmarks, using all available landmark comparisons. STE was\nthen ran using the entire dataset. The cold-start STE method used the entire dataset as well. For each\nmethod, we repeatedly computed an embedding 30 times and recorded the time taken. We observed\nthat the warm start solution always converged to the same solution as the cold start (as can be seen\nin the appendix) suggesting that LOE warm-starts do not provide poor initializations for STE. On\naverage, STE took 9.8770 \u00b1 0.2566 seconds and LOE-STE took 8.0432 \u00b1 0.1227, which is a 22%\nspeedup. Note however that this experiment is not an ideal setting for LOE-STE since n is small\nand the data set consists of triplets sampled uniformly, resulting in very few usable triplets for LOE.\n\n7 Conclusion\nWe proposed a novel ordinal embedding procedure which avoids a direct optimization approach and\ninstead leverages the low-dimensionality of the embedding to learn a low-rank factorization. This\nleads to a multi-stage approach in which each stage is computationally ef\ufb01cient, depending linearly\non n, but results in a loss of statistical ef\ufb01ciency. However, through experimental results we show\nthat this method can still be advantageous in settings where one may wish to sacri\ufb01ce a small amount\nof accuracy for signi\ufb01cant computational savings. This can be the case if either (i) data is highly\nabundant so that the accuracy of LOE is comparable to other expensive methods, or (ii) data is less\nabundant and LOE is used to warm-start more accurate methods. Furthermore, we showed that LOE\nis guaranteed to recover the embedding and gave sample complexity rates for learning the embedding.\nUnderstanding these rates more carefully by more precisely understanding the effects of varying the\nnumber of landmarks may be interesting since more landmarks leads to better stability, but at the cost\nof less triplets per column. Additionally, extending work on active ranking into our framework may\nprovide a useful method for active ordinal embedding. Applying our method to massive real world\ndata-sets such as those arising in NLP or networks may provide interesting insights into these dataset\nfor which previous ordinal embedding methods would be infeasible.\n\nAcknowledgements Nikhil Ghosh was supported in part by a Caltech Summer Undergraduate\nResearch Fellowship. Yuxin Chen was supported in part by a Swiss NSF Early Mobility Postdoctoral\nFellowship. This work was also supported in part by gifts from PIMCO and Bloomberg.\n\nReferences\n\n[1] Sameer Agarwal, Josh Wills, Lawrence Cayton, Gert Lanckriet, David Kriegman, and Serge\nBelongie. Generalized non-metric multidimensional scaling. In Arti\ufb01cial Intelligence and\nStatistics, pages 11\u201318, 2007.\n\n[2] Eugene Agichtein, Eric Brill, and Susan Dumais. Improving web search ranking by incorporat-\ning user behavior information. In Proceedings of the 29th annual international ACM SIGIR\nconference on Research and development in information retrieval, pages 19\u201326. ACM, 2006.\n\n9\n\n\f[3] Ingwer Borg and Patrick Groenen. Modern multidimensional scaling: Theory and applications.\n\nJournal of Educational Measurement, 40(3):277\u2013280, 2003.\n\n[4] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the\n\nmethod of paired comparisons. Biometrika, 39(3/4):324\u2013345, 1952.\n\n[5] Manuela Cattelan. Models for paired comparison data: A review with emphasis on dependent\n\ndata. Statistical Science, pages 412\u2013433, 2012.\n\n[6] Yuxin Chen, Jianqing Fan, Cong Ma, and Kaizheng Wang. Spectral method and regularized\n\nmle are both optimal for top-k ranking. arXiv preprint arXiv:1707.09971, 2017.\n\n[7] Vin De Silva and Joshua B Tenenbaum. Sparse multidimensional scaling using landmark points.\n\nTechnical report, Technical report, Stanford University, 2004.\n\n[8] Johannes F\u00fcrnkranz and Eyke H\u00fcllermeier. Preference learning. Springer, 2010.\n[9] Lalit Jain, Kevin G Jamieson, and Rob Nowak. Finite sample prediction and recovery bounds for\nordinal embedding. In Advances In Neural Information Processing Systems, pages 2711\u20132719,\n2016.\n\n[10] Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the\neighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages\n133\u2013142. ACM, 2002.\n\n[11] Thorsten Joachims, Laura A Granka, Bing Pan, Helene Hembrooke, and Geri Gay. Accurately\ninterpreting clickthrough data as implicit feedback. In Sigir, volume 5, pages 154\u2013161, 2005.\n\n[12] I. L. Dryden K. V. Mardia. Statistical Shape Analysis. Wiley, 2016.\n[13] Matth\u00e4us Kleindessner and Ulrike Luxburg. Uniqueness of ordinal embedding. In Conference\n\non Learning Theory, pages 40\u201367, 2014.\n\n[14] Matth\u00e4us Kleindessner and Ulrike von Luxburg. Kernel functions based on triplet comparisons.\n\nIn Advances in Neural Information Processing Systems, pages 6807\u20136817, 2017.\n\nInformation Retrieval, 3(3):225\u2013331, 2009.\n\n[15] Tie-Yan Liu et al. Learning to rank for information retrieval. Foundations and Trends R(cid:13) in\n[16] R Duncan Luce. Individual choice behavior. 1959.\n[17] Julian McAuley, Rahul Pandey, and Jure Leskovec. Inferring networks of substitutable and\ncomplementary products. In Proceedings of the 21th ACM SIGKDD international conference\non knowledge discovery and data mining, pages 785\u2013794. ACM, 2015.\n\n[18] Dohyung Park, Joe Neeman, Jin Zhang, Sujay Sanghavi, and Inderjit Dhillon. Preference\ncompletion: Large-scale collaborative ranking from pairwise comparisons. In International\nConference on Machine Learning, pages 1907\u20131916, 2015.\n\n[19] Seung-Taek Park and Wei Chu. Pairwise preference regression for cold-start recommendation.\nIn Proceedings of the third ACM conference on Recommender systems, pages 21\u201328. ACM,\n2009.\n\n[20] Omer Tamuz, Ce Liu, Serge Belongie, Ohad Shamir, and Adam Tauman Kalai. Adaptively\nlearning the crowd kernel. In Proceedings of the 28th International Conference on International\nConference on Machine Learning, pages 673\u2013680, 2011.\n\n[21] Yoshikazu Terada and Ulrike Luxburg. Local ordinal embedding. In International Conference\n\non Machine Learning, pages 847\u2013855, 2014.\n\n[22] Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In Machine\nLearning for Signal Processing (MLSP), 2012 IEEE International Workshop on, pages 1\u20136,\n2012.\n\n[23] Michael J Wilber, Iljung S Kwak, and Serge J Belongie. Cost-effective hits for relative similarity\n\ncomparisons. In Second AAAI conference on human computation and crowdsourcing, 2014.\n\n[24] Yisong Yue, Chong Wang, Khalid El-Arini, and Carlos Guestrin. Personalized collaborative\nclustering. In Proceedings of the 23rd international conference on World wide web, pages\n75\u201384. ACM, 2014.\n\n10\n\n\f", "award": [], "sourceid": 6137, "authors": [{"given_name": "Nikhil", "family_name": "Ghosh", "institution": "Caltech"}, {"given_name": "Yuxin", "family_name": "Chen", "institution": "UChicago"}, {"given_name": "Yisong", "family_name": "Yue", "institution": "Caltech"}]}