{"title": "Hierarchical Optimal Transport for Multimodal Distribution Alignment", "book": "Advances in Neural Information Processing Systems", "page_first": 13474, "page_last": 13484, "abstract": "In many machine learning applications, it is necessary to meaningfully aggregate, through alignment, different but related datasets. Optimal transport (OT)-based approaches pose alignment as a divergence minimization problem: the aim is to transform a source dataset to match a target dataset using the Wasserstein distance as a divergence measure. We introduce a hierarchical formulation of OT which leverages clustered structure in data to improve alignment in noisy, ambiguous, or multimodal settings. To solve this numerically, we propose a distributed ADMM algorithm that also exploits the Sinkhorn distance, thus it has an efficient computational complexity that scales quadratically with the size of the largest cluster. When the transformation between two datasets is unitary, we provide performance guarantees that describe when and how well aligned cluster correspondences can be recovered with our formulation, as well as provide worst-case dataset geometry for such a strategy. We apply this method to synthetic datasets that model data as mixtures of low-rank Gaussians and study the impact that different geometric properties of the data have on alignment. Next, we applied our approach to a neural decoding application where the goal is to predict movement directions and instantaneous velocities from populations of neurons in the macaque primary motor cortex. Our results demonstrate that when clustered structure exists in datasets, and is consistent across trials or time points, a hierarchical alignment strategy that leverages such structure can provide significant improvements in cross-domain alignment.", "full_text": "Hierarchical Optimal Transport for\nMultimodal Distribution Alignment\n\nJohn Lee\u2020\u21e4, Max Dabagia\u2020, Eva L. Dyer\u2020\u2021\u00a7, Christopher J. Rozell\u2020\u00a7\n\n\u2020School of Electrical and Computer Engineering,\n\u2021Coulter Department of Biomedical Engineering\n\nGeorgia Institute of Technology, Atlanta, GA, 30332 USA\n\n{john.lee, maxdabagia, evadyer, crozell}@gatech.edu\n\nAbstract\n\nIn many machine learning applications, it is necessary to meaningfully aggregate,\nthrough alignment, different but related datasets. Optimal transport (OT)-based\napproaches pose alignment as a divergence minimization problem: the aim is to\ntransform a source dataset to match a target dataset using the Wasserstein distance\nas a divergence measure under alignment constraints. We introduce a hierarchical\nformulation of OT which leverages clustered structure in data to improve alignment\nin noisy, ambiguous, or multimodal settings. To solve this numerically, we propose\na distributed ADMM algorithm that exploits the Sinkhorn distance, thus it has an\nef\ufb01cient computational complexity that scales quadratically with the size of the\nlargest cluster. When the transformation between two datasets is unitary, we provide\nperformance guarantees that describe when and how well cluster correspondences\ncan be recovered with our formulation, and then describe the worst-case dataset\ngeometry for such a strategy. We apply this method to synthetic datasets that\nmodel data as mixtures of low-rank Gaussians and study the impact that different\ngeometric properties of the data have on alignment. Next, we applied our approach\nto a neural decoding application where the goal is to predict movement directions\nand instantaneous velocities from populations of neurons in the macaque primary\nmotor cortex. Our results demonstrate that when clustered structure exists in\ndatasets, and is consistent across trials or time points, a hierarchical alignment\nstrategy that leverages such structure can provide signi\ufb01cant improvements in\ncross-domain alignment.\n\nIntroduction\n\n1\nIn many machine learning applications, it is necessary to meaningfully aggregate, through alignment,\ndifferent but related datasets (e.g., data across time points or under different conditions or contexts).\nAlignment is an important problem at the heart of transfer learning [1, 2], point set registration [3, 4, 5],\nand shape analysis [6, 7, 8], but is generally NP hard. In recent years, distribution alignment methods\nthat use optimal transport (OT) to quantify similarity between two distributions have increased in\npopularity due to their attractive mathematical properties and impressive performance in a variety of\ntasks [9, 10]. However, using OT to solve unsupervised distribution alignment problems that must\nsimultaneously match two datasets\u2019 distributions (using OT) while also learning a transformation\nbetween their latent spaces, is extremely challenging, especially when the data has complicated\nmulti-modal structure. Leveraging additional structure in the problem is thus necessary to regularize\nOT and constrain the solution space.\nHere, we leverage the fact that heterogeneous datasets often admit clustered or multi-subspace struc-\nture to improve OT-based distribution alignment. Our solution to this problem is to simultaneously\n\n\u21e4JL is currently with DSO National Laboratories of Singapore.\n\u00a7Equal contributing senior authors.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\festimate the cluster alignment across two datasets using their local geometry, while also solving a\nglobal alignment problem to meld these local estimates. While it is advantageous to regularize the OT\nproblem with known cluster pairings [10, 11], we are instead concerned with the substantially harder\nunsupervised setting where such information is missing. We introduce a hierarchical formulation of\nOT for clustered and multi-subspace datasets called Hierarchical Wasserstein Alignment (HiWA)3.\nWe empirically show that when data are well approximated with Gaussian mixture models (GMMs)\nor lie on a union of subspaces, we may leverage existing clustering pipelines (e.g., sparse subspace\nclustering [12] [13]) to improve alignment. When the transformation between datasets is unitary,\nwe provide analyses that reveal key geometric and sampling insights, as well as perturbation and\nfailure mode analyses. To solve the problem numerically, we propose an ef\ufb01cient distributed ADMM\nalgorithm that also exploits the Sinkhorn distance, thus bene\ufb01ting from ef\ufb01cient computational\ncomplexity that scales quadratically with the size of the largest cluster.\nTo test and benchmark our approach, we applied it to synthetic data generated from mixtures of\nlow-rank Gaussians and studied the impact of different geometric properties of the data on alignment\nto con\ufb01rm the predictions of our theoretical analysis. Next, we applied our approach to a neural\ndecoding application where the goal is to predict movement directions from populations of neurons\nin the macaque primary motor cortex. Our results demonstrate that when clustered structure exists\nin neural datasets and is consistent across trials or time points, a hierarchical alignment strategy\nthat leverages such structure can provide signi\ufb01cant improvements in unsupervised decoding from\nambiguous (symmetric) movement patterns. This suggests OT can be applied to a wider range of\nneural datasets, and shows that a hierarchical strategy avoids local minima encountered by a global\nalignment strategy that ignores clustered structure.\n2 Background and related work\nTransfer learning and distribution alignment. A fundamental goal in transfer learning is to\naggregate related datasets by learning a mapping between them. We wish to learn a transformation\nT 2 T , where T refers to some class of transformations that aligns distributions under a notion of\nprobability divergence D(\u00b7|\u00b7) between a target distribution \u00b5 and a reference (source) distribution \u232b:\n(1)\n\nmin\n\nT2T D(T (\u00b5)|\u232b).\n\nVarious probability divergences have been proposed in the literature, such as Euclidean least-squares\n(when data ordering is known) [14, 15, 16], Kullback-Leibler (KL) [17], maximum mean discrepancy\n(MMD) [18, 19, 20, 21], and the Wasserstein distance [10], where trade-offs are often statistical\n(e.g., consistency, sample complexity) versus computational. Alignment problems are ill-posed since\nthe space of T is large, so a priori structure is often necessary to constrain T based on geometric\nassumptions. Compact manifolds like the Grassmann or Stiefel [22, 23] are primary choices when\nlittle information is present, as they preserve isometry. Non-isometric transformations, though richer,\ndemand much more structure (e.g., manifold or graph structure) [24, 25, 26, 27, 10].\nLow-rank and union of subspaces models. Principal components analysis (PCA), one of the most\npopular methods in data science, assumes a low-rank model where the top-k principal components of\na dataset provide the optimal rank-k approximation under an Euclidean loss. This has been extended\nto robust (sparse errors) settings [12], and multi- (union of) subspaces settings where data can be\npartitioned into disjoint subsets where each subset of data is locally low-rank [28]. Transfer learning\nmethods based on subspace alignment [29, 30, 31] work well with zero-mean unimodal datasets, but\nstruggle on more complicated modalities (e.g., Gaussian mixtures or union of subspaces) due to a\nmixing of covariances. Related to our work, [32] performs multi-subspace alignment by greedily\nassigning correspondences between subspaces using chordal distances; this however discards valuable\ninformation about a distribution\u2019s shape.\nOptimal transport. Optimal transport (OT) [33] is a natural type of divergence for registration\nproblems because it accounts for the underlying geometry of the space. In Euclidean settings, OT\ngives rise to a metric known as the Wasserstein distance W(\u00b5, \u232b) which measures the minimum effort\nrequired to \u201cdisplace\u201d points across measures \u00b5 and \u232b (understood here as empirical point clouds).\nTherefore, OT relieves the need for kernel estimation to create an overlapping support of the measures\n\n3MATLAB code can be found at https://github.com/siplab-gt/hiwa-matlab. Neural datasets and Python code\n\nare provided at http://nerdslab.github.io/neuralign\n\n2\n\n\f\u00b5, \u232b. Despite this attractive property, it has both a poor numerical complexity of O(n3 log n) (where\nn is the sample size) and a dimension-dependent sample complexity of O(n1/d), where the data\ndimension is d [34, 35]. Recently, an entropically regularized version of OT known as the Sinkhorn\ndistance [36] has emerged as a compelling divergence measure; it not only inherits OT\u2019s geometric\nproperties but also has superior computational and sample complexities of O(n2) and O(n1/2)4,\nrespectively. It has also become a versatile building block in domain adaptation [10, 38]. Prior art\n[10] has largely exploited the OT\u2019s push-forward as the alignment map since this map minimizes\nthe OT cost between the source and target distributions while allowing a priori structure to be easily\nincorporated (e.g., to preserve label/graphical integrity). Such an approach, however, is fundamentally\nexpensive when d \u2327 n since the primary optimization variable is a large transport coupling (i.e.,\nRn\u21e5n), while in reality the alignment mapping is merely Rd 7! Rd. Moreover, it assumes that the\nsource and target distributions are close in terms of their squared Euclidean distance (i.e., an identity\ntransformation), but this does not generally hold between arbitrary latent spaces.\nHierarchical OT and related work. The idea of learning an af\ufb01ne or unitary transformation to align\ndatasets with an OT-based divergence has previously been studied in [39, 40, 41], a problem known\nas OT Procrustes. However, these methods don\u2019t use problem-speci\ufb01c or clustered structure in data.\nHierarchical OT is a recent generalization of OT [42, 43, 44] that is an effective and ef\ufb01cient way of\ninjecting structure into OT but it has never been used to jointly solve alignment problems \u2013 our work\nrepresents a \ufb01rst attempt at doing so. Thus, a key contribution of this paper is putting both of these\ntwo ingredients together to develop a scalable strategy that leverages multimodal structure in data\nsolve the OT Procrustes problem.\n3 Hierarchical Wasserstein alignment\nPreliminaries and notation. Consider clustered datasets {Xi 2 RD\u21e5nx,i}S\ni=1 and {Yj 2\nj=1 whose clusters are denoted with the indices i, j and whose columns are treated\nRD\u21e5ny,j}S\nas RD embedding coordinates. The number of samples in the i-th (j-th) cluster of dataset X\n(dataset Y ) is given by nx,i (ny,j). We express the empirical measures of clusters Xi and Yj as\nl=1 Yj (l), respectively, where x refers to a point mass\n\u00b5i := 1\nlocated at coordinate x 2 RD. The squared 2-Wasserstein distance between \u00b5i and \u232bj is de\ufb01ned as\n\nny,jPny,j\n\nnx,iPnx,i\n\nk=1 Xi(k) and \u232bj := 1\n\n2 (\u00b5i, \u232bj) :=\n\nQ2U(nx,i,ny,j )\n\nQ(k, l)kXi(k) Yj(l)k2\n\nnx,iXk=1\nwhere Q is a doubly stochastic matrix that encodes point-wise correspondences (i.e., the (k, l)-th\nentry describes the \ufb02ow of mass between Xi(k) and Yj (l)), Xi(k) is the k-th column of matrix\nXi, and the constraint U(m, n) := {Q 2 Rm\u21e5n\n: Q n = m/m, Q> m = n/n} refers to the\nuniform transport polytope (with m a length m vector containing ones). We will use k\u00b7k to denote\nthe operator norm, X\u2020 to denote the pseudo-inverse of X, and Id to denote the d \u21e5 d identity matrix.\nOverview. Although unsupervised alignment is challenging due to the presence of local minima, the\nimposition of additional structure will help to prune them away. Our key insight is that hierarchical\nstructure decomposes a complicated optimization surface into simpler ones that are less prone to\nlocal minima. We formulate a hierarchical Wasserstein approach to align datasets with known (or\nj=1 but whose correspondences are unknown. The task therefore is\nestimated) clusters {\u00b5i}S\nto jointly learn the alignment T and the cluster-correspondences:\n\ni=1,{\u232bj}S\n\nny,jXl=1\n\n+\n\nW 2\n\nmin\n\n2\n\nmin\n\nP2BS ,T2T\n\nSXi=1\n\nSXj=1\n\nPijW 2\n\n2 (T (\u00b5i), \u232bj),\n\n(2)\n\nwhere the matrix P encodes the strength of correspondences between clusters, with a large Pij value\nindicating a correspondence between clusters i, j, and a small value indicating a lack thereof. We\nnote that BS := U(S, S) is a special type of transport polytope known as the S-th Birkhoff polytope.\nInterestingly, this becomes a nested (or block) OT formulation, where correspondences are resolved\nat two levels: the outer level resolves cluster-correspondences (via P ) while the inner level resolves\npoint-wise correspondences between cluster points (via the Wasserstein distance).\nAlignment over the Stiefel manifold. Assuming clusters lie on subspaces and principal angles\nbetween subspaces are \u201cwell preserved\u201d across X and Y (we make this precise in Theorem 4.2), an\n\n4Dependent on a regularization parameter [37].\n\n3\n\n\fisometric transformation suf\ufb01ces. Hence, we solve (2) with T VD,D, the Stiefel manifold which\nis de\ufb01ned as Vk,d := {R 2 Rk\u21e5d : R>R = Id}. Explicitly, we can re-formulate equation (2) as:\n(3)\n\nPijCij(R, Qij)\n\nmin\n\nP ,R,{Qij}Xi,j\n\nwhere Cij(R, Qij) :=\n\ns.t. P 2 BS, R 2 VD,D, Qij 2 U(nx,i, ny,j),\nDXk,l\n\nQij(k, l)kRXi(k) Yj(l)k2\n\n1\n\n2\n\nmeasures pairwise cluster divergences using the squared 2-Wasserstein distance under a Stiefel\ntransformation R acting on the ith cluster.\nFinally, we include entropic regularization over transportation couplings P and all Qij\u2019s to modify\nthe Wasserstein distances to Sinkhorn distances, so as to take advantage of its superior computational\nand sample complexities. Omitting constraints for brevity, our \ufb01nal problem is given as\n\n(4)\n\n(5)\n\nmin\n\nP ,R,{Qij}Xi,j \u21e3PijCij(R, Qij) + H2(Qij)\u2318 + H1(P ),\n\n\u00b5\n\nmin\n\n\u00b5\nD\n\nwhere 1, 2 > 0 are the entropic regularization parameters and the negative entropy function is\n\nde\ufb01ned as H(P ) := Pi,j Pij log Pij. Parameters 1, 2 control the correspondence entropy,\ntherefore (5) approximates (3) when 1, 2 > 0, but reverts to the original problem (3) as 1, 2 ! 0.\nDistributed ADMM approach. Problem (5) is non-convex due to multilinearity in the objective\nand its Stiefel manifold domain. Although alternating directions method of multipliers (ADMM) is a\nconvergent convex solver framework [45, 46], it is being applied in increasingly many non-convex\nsettings [47]. Since (5) readily admits a splitting structure that separates the individual Cij blocks,\nwe develop a distributed ADMM approach. We proceed to split (5) as follows:\n\n\u21e4ij, Rij Ri +\n\nnoting that the set constraints are omitted for brevity. The augmented Lagrangian is given by\n\nP ,eR,{Rij ,Qij}Xi,j \u21e3PijCij(Rij, Qij) + H2(Qij)\u2318 + H1(P )\nL\u00b5 =Xi,j \u21e3PijCij(Rij, Qij) + h\n2DkRij eRk2\n\ns.t. Rij = eR, 8i, j,\nF + H2(Qij)\u2318 + H1(P ),\nwhere \u00b5 > 0 is the ADMM parameter and {\u21e4ij} are Lagrange multipliers. Full details of the\nupdate steps are included in the Supplementary Material. The algorithm may be summarized in\ntwo steps (Alg. 1): (i) a distributed step that asks all cluster pairs to individually \ufb01nd their optimal\ntransformations Rij in parallel, and (ii) a consensus step that aggregates all the locally estimated\ntransformations according to a weighting that is proportional to correspondence strengths Pij.\nParameters. Entropic parameters 1, 2 relax the one-to-one cluster correspondence assumption,\nbalancing a trade off between alignment precision (small ) and sample complexity (large ).\nNumerically, negative entropy adds strong convexity to the program, reducing sensitivity towards\nperturbations at the cost of a slower convergence rate. The ADMM parameter \u00b5 controls the \u2018strength\u2019\nof the consensus, or from an algorithmic viewpoint, the gradient step size.\nDistributed consensus. Update steps for Qij, Rij, Lij can be performed in parallel over all cluster\npairs (S2 in total), making it amenable for a distributed implementation. The runtime complexity of\nthis algorithm is presented in the supplementary Materials.\nRobustness against initial conditions. We intentionally build robustness against initial conditions\nby ordering updates for Rij and Qij before P such that when \u00b5 is suf\ufb01ciently small, the ADMM\nsequence is in\ufb02uenced more by the data than by initial conditions.\n4 Theoretical guarantees for cluster-based alignment\nWhile the previous section explains how to align clustered datasets, in this section, we aim to answer\nthe question of when and how well they can be aligned. We provide necessary conditions for cluster-\nbased alignability as well as alignment perturbation bounds according to equation (3)\u2019s formulation.\nTo simplify our analysis, we make the following assumptions: (i) each of the clusters contain the\nsame number of datapoints n, (ii) the ground truth cluster correspondences are P ? = IS/S (i.e.,\n\n4\n\n\fi=1,{Yj}S\n\nj=1)\n\n. Initialization\n\nend while\n\nfor all i, j in parallel do\n\nQij nx,i >ny,j /nx,iny,j\nwhile not converged do\n\nR random VD,D, P S >S /S2, \u21e4ij 0, 8i, j\nwhile not converged do\n\nRij STIEFELALIGNMENT(2PijYjQ>ijX>i + \u00b5(R \u21e4ij))\nD kRijXi(k) Yj(l)k2\nQij SINKHORN(2/Pij, C(k, l) 1\n2)\nend for\nP SINKHORN(1, C(i, j) Cij(Rij, Qij))\nR STIEFELALIGNMENT(Pi,j Rij + \u21e4ij)\n\u21e4ij \u21e4ij + Rij R,\n\nAlgorithm 1 Hierarchical Wasserstein Alignment (HiWA) Algorithm\n1: procedure HIERARCHICALWASSERSTEINALIGNMENT(1, 2, \u00b5,{Xi}S\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\nend while\n14:\n15: end procedure\n1: procedure SINKHORN(, C 2 Rm\u21e5n)\n2:\n3:\n4:\n5:\n6:\n7:\n8: end procedure\n\nK exp(C/),\nwhile not converged do\nm \u21b5 Kv\nn \u21b5 K>u\n\nu m\nv n\nend while\nP diag(u)K diag(v)\n\nNotation:\n\u21b5: elementwise division\nexp(\u00b7): elementwise exponential\ndiag(\u00b7): diagonal matrix of argument\n\nv n\n\nn\n\n8i, j\n\n1: procedure STIEFELALIGNMENT(A)\n(U , \u2303, V ) SVD(A)\n2:\nR U V >\n3:\n4: end procedure\n\ndiagonal containing 1/S). However, this analysis can be extended to the case where the number of\npoints is unequal without loss of generality. Detailed proofs are given in the Supp. Material.\nThe following result is a criterion that, if met, ensures the existence of a global minimizer of the\ncluster-correspondence P ?. This criterion requires that matched clusters must be closer in Wasserstein\ndistance than mismatched clusters, according to a threshold determined by Wasserstein\u2019s sample\ncomplexity (i.e., an asymptotic rate dependent on the clusters\u2019 sample sizes and intrinsic dimensions).\nSince these sample complexity results are based on the Wasserstein distance, we expect a less stringent\ncriterion when using the Sinkhorn distance in (5) (due to superior sample complexity [37]).\nTheorem 4.1 (Correspondence disambiguity criterion). Let all clusters be strictly low-rank where the\nij :=\nminR2VD,D,Qij2Bn Cij(R, Qij). Problem (3) yields the solution P ? = IS/S with probability at\nleast 1 if, 8i, j : i 6= j, the following criterion is satis\ufb01ed:\n\ndimension of the i-th cluster in the x-th dataset is dx,i. Let dx,i, dy,j > 4,8i, j 2JSK. De\ufb01ne bC?\n\njj > Bx,i() + By,i() + Bx,j() + By,j()\n\nbC?\nij + bC?\n\nji bC?\n\nwhere Bz,k() := cz,kn 2\n\nii bC?\ndz,k +plog(1/)/2n,\n\ncz,k = 1458\u21e32 +\n\n1\n\n3dz,k/22 1\u2318.\n\nProof sketch. The proof contains two parts. In the \ufb01rst part, we consider perturbation conditions of\nthe cost matrix C in a (non-variational) optimal transport program over the Birkhoff polytope. To\nbe unperturbed from P ? = IS/S, we require that Cij + Cji Cii Cjj > 0,8i, j : i 6= j. In the\nsecond part, we extend this condition to the the \ufb01nite-sample regime by utilizing recently developed\nconcentration bounds [35] for the p-Wasserstein distance, which essentially raises the disambiguity\nlower bound due to \ufb01nite-sample uncertainty. (Supp. Material, Section 2)\n\nNow, even if we know the global correspondence P ?, we still do not have the full picture about\nthe alignment\u2019s quality. For example, all matching clusters may have very similar covariances, but\nprincipal angles between the clusters are \u201cdistorted\u201d across the datasets. Our next theorem gives us\nan upper bound on the alignment error (for unitary transformations), and makes precise the notion of\nglobal structure distortion.\n\n5\n\n\fTheorem 4.2 (Cluster-based alignment perturbation bounds). Consider data matrices {Xi, Yi 2\nRD\u21e5n}c\n\ni=1 with known point-wise correspondence matrices {Qii 2 Bn}c\n\ni=1. De\ufb01ne matrices\n\nX := [X1Q11, X2Q22, . . . , XcQcc],\n\nY := [Y1, Y2, . . . , Yc].\n\nand \"kX\u2020k \uf8ff 1p2\n\nSet \"2 :=Y >Y X>XF . If the criterion stated in theorem 4.1 is satis\ufb01ed, X is full row rank,\n\n(kXkkX\u2020k)1/2, then\nP2Bc,R2VD,DXi,j\nmin\ni=1 tr(Xi(I/n QiiQ>ii )X>i + (1/n 1)YiY >i ) is a data-dependent constant.\n\nPijCij(R) \uf8ff (kXkkX\u2020k + 2)2kX\u2020k2\"4 + D,\n\nProof sketch. We utilize a recent perturbation result on the Procrustes problem (on a Frobenius\nnorm objective) by Arias-Castro et al. [48] and adapt it to our squared 2-Wasserstein objective.\n(Supp. Material, Section 3)\n\nwhere D =Pc\n\nNote that \" plays a major role in the alignment error bound and quanti\ufb01es the notion of global structure\ndistortion, which allows us to understand on how phenomena like covariate shift or misclustering\nimpacts alignment. To shed some light in this regard, we consider a simple analysis on a cluster-\npair\u2019s error contribution to \", denoted as \"ij. Consider the decomposition of the (i, j)-th block of\nthe Gramians related to clusters i and j, where their respective singular value decompositions are\nXiQii = Ai\u2303x,iV > and Yj = Bj\u2303y,jV >. De\ufb01ning the blockwise error between clusters i, j as\n\n\"ij :=Y >i Yj Q>ijX>i XjQjjF =\u2303y,iB>i Bj\u2303y,j \u2303x,iA>i Aj\u2303x,jF ,\n\ntwo components stand out: (i) angular shift, which is characterized by differences in principal angles\nbetween B>i Bj and A>i Aj, and (ii) spectral shift, which is characterized by differences in spectra.\nFinally, we show that the subspace con\ufb01guration of a dataset\u2019s clusters can also affect alignment.\nPretend for a moment that external alignment information were present to aid in the disambiguation\nbetween two clusters. The following lemma tells us when such information is useless (Proof in\nSupp. Material, Section 4).\nLemma 4.3 (Uninformative alignment). Consider clusters Xi, Yj 2 RD\u21e5n and known point-wise\ncorrespondences Qij 2 U(n, n). Denote the left and right singular vectors of YjQ>ijX>i associated\nwith the non-zero singular values as \u02dcU , \u02dcV 2 RD\u21e5r with r \uf8ff D. De\ufb01ne the set of orthogonal\ntransformations that are constrained to agree with known angular directions as\n: R>R = I, RV 0 = U0},\n\nT (U0, V 0) := {R 2 RD\u21e5D\n\n+\n\nwhere U0, V 0 2 VD,r with r \uf8ff D. Given U0, V 0 2 RD\u21e5r0 with r0 \uf8ff D, we have\n\nmin\n\nR2T (U0,V 0)\n\nCij(R) min\nR2VD,D\n\nCij(R),\n\n(6)\n\nwith equality holding when h \u02dcU , U0i = h \u02dcV , V 0i.\nDirect consequences of this lemma are the following: When a dataset has equally-spaced subspaces,\nit has a maximally uninformative geometric con\ufb01guration since angular information from other\nclusters (i.e., U0, V 0) can never increase the inter-cluster distance Cij (i.e., equality in (6) always\nholds); it is hence a worst-case scenario for alignment. This also explains why alignment in very\nhigh-dimensional space is harder: All subspaces may be orthogonal to each other, and hence offer no\n\u201cgeometric\u201d advantage.\n5 Numerical experiments\n5.1 Synthetic low-rank Gaussian mixture dataset\nIn this section, we validate our method as well as demonstrate its limiting characteristics under\nsymmetric-subspace and \ufb01nite-sample regimes. To generate our synthetic data, we repeat the\nfollowing procedure for each of the S clusters. We \ufb01rst randomly generate Gaussian distribution\nparameters \u00b5i 2 Rd, \u2303i 2 Rd : \u2303i \u232b 0 (positive semi-de\ufb01nite), then randomly sample n data-points\nfrom these parameters, and \ufb01nally project them into a random subspace Vi 2 RD\u21e5d in a D > d\ndimensional embedding. In these experiments, we assume that the clusters are known, but the\n\n6\n\n\fF /kR?Xk2\n\nFigure 1: Synthetic experiments. HiWA was tested in two subspace con\ufb01gurations (a,b): randomly-spaced\n(average-case, solid) versus equally-spaced (worst-case, dashed) for S = 5, d = 2, D = 6, n = {25, 100},\nwhere S is the number of clusters, d the dimension of each cluster, D is the embedding dimension, and n is the\nsample size. As we expect, performance in terms of the (a) alignment and (b) correspondence error is better\nin the average (vs. worst) case. In (c,d), we report (c) alignment and (d) correspondence errors as d and n\nvaries, and report the error\u2019s 25th/50th/75th percentiles. In (e,f), we show ablation results (50 trials, no random\nrestarts permitted) for semi-supervised HiWA (known clusters), completely unsupervised HiWA-SSC (unknown\nclusters), non-structured Wasserstein alignment (WA), subspace alignment methods (SA [29], CORAL [31]),\nand iterative closest point (ICP) [49] for n = 50, d = 2, and (e) S = 5,D = 6, and (f) S = 2,D = 2.\ncluster-correspondence across datasets is unknown. We measure performance with respect to two\nmetrics: (i) alignment error, de\ufb01ned as the relative difference between the recovered versus true\nF , and (ii) correpondence error, de\ufb01ned as the\n\nrotation acting on the data kbRX R?Xk2\nsum of absolute differences between the recovered and the true correspondencesPij |bP P ?|ij.\nTo understand how global geometry impacts alignment, we applied HiWA in two different settings\n(Figure 1a-b): (i) a worst-case setting where subspaces are equally spaced with a subspace similarity\nof kV >i Vjk = 1,8i 6= j, and (ii) the random setting where subspaces are randomly selected from\nthe Grassmann manifold. We observe that equally-spaced subspaces have signi\ufb01cantly inferior\nperformance when compared to randomly-spaced subspaces, providing some evidence that equally\nspaced subspaces are indeed the worst-case scenario in alignment, as suggested by Lemma 4.3.\nNext, we studied the effect of dimensions d and sample size n on the accuracy of alignment (Figure 1\n(c-d)). We tested HiWA across various dataset conditions by varying parameters d = {2, 3, 4, 5} and\nn = {12, 25, 50, 100, 200} while approximately maintaining the average subspace correlations (i.e.,\nEkV >i Vjk) by \ufb01xing the cluster size S = 5 and tuning D to control the subspace spacing. In both\ncases, sample complexities are better than the theoretical rate of O(n1/d), which is likely due to the\nSinkhorn distance\u2019s superior sample complexity. In Figure 1e-f, we conduct an ablation study and\nevaluate our algorithm against benchmark methods in transfer learning and point set registration in\ntwo settings: a simple one in low-d (e) and a harder one in higher-d (f). Speci\ufb01cally, we compare\nHiWA when clusters are known (but pairwise correspondences are unknown), HiWA with clustering\nvia sparse subspace clustering [12] (HiWA-SSC) to represent completely unsupervised alignment,\na Wasserstein alignment variant with no cluster-structure (WA) which is akin to OT Procrustes\n[50, 39, 40, 41], subspace alignment [29], correlation alignment [31], and iterative closest point (ICP)\n[49]. HiWA exhibits strongest performance, with HiWA-SSC trailing closely behind (since clusters\nare independently resolved), followed by WA, then other algorithms. Subspace alignment methods\nhave remarkably poor performance in higher dimensions due to their inability to resolve subspace\nsign ambiguities, while ICP demonstrates its notorious dependence on good initial conditions. These\nresults indicates HiWA\u2019s strong robustness against initial conditions and good scaling properties.\n5.2 Neural population decoding example\nDecoding intent (e.g., where you want to move your arm) or evoked responses (e.g., what you are\nlooking at or listening to) directly from neural activity is a widely studied problem in neuroscience,\nand the \ufb01rst step in the design of a brain machine interface (BMI). A critical challenge with BMIs\nis that neural decoders need to be recalibrated (or re-trained) due to drift in neural responses or\nelectrophysiology measurements/readouts [51]. A recent method for semi-supervised brain decoding\n\ufb01nds a transformation between projected neural responses and movements by solving a KL-divergence\n\n7\n\n00.511.5200.20.40.60.8110-210-110000.20.40.60.810.511.500.20.40.60.8112255010020010-210-110012255010020000.511.522.5300.511.5200.20.40.60.81abcdef\fFigure 2: Results on neural decoding dataset: How distribution alignment is used to translate neural activity\ninto movement \u2013 low-dimensional embeddings of neural data are aligned with target movement patterns (a). In\n(b), we compare the performance (cluster correspondence) of HiWA, WA, and DAD as the number of points in\nthe source dataset decreases. Next, we compared the performance of HiWA with known and estimated clusters\n(via GMM). Movement patterns in which cluster separability is high and the geometry is preserved across\ndatasets, can be aligned in both cases (green stars). Patterns where separability is low but geometry is useful can\nbe aligned when the cluster arrangements are known are denoted with yellow stars.\nminimization problem [52]. Using this approach, one could build robust decoders that work across\ndays and shifts in neural responses through alignment.\nWe test the utility of hierarchical alignment for neural decoding on datasets collected from the arm\nregion of primary motor cortex of a non-human primate (NHP) during a center out reaching task [52].\nAfter spike sorting and binning the data, we applied factor analysis to reduce the data dimensionality\nto 3D (source distribution) and applied HiWA to align the neural data to a 3D movement distribution\n(target distribution) (Figure 2). We compared its performance to (procrustes) Wasserstein alignment\n(WA) without hierarchical structure, and a baseline brute force search method called distribution\nalignment decoding (DAD) [52]. We examined the prediction accuracy of the target reach direction\nfor the motor decoding task (i.e., the cluster classi\ufb01cation accuracy).\nNext, we examined the impact of the sampling density (Figure 2b) on alignment performance. Our\nresults demonstrate that HiWA continues to produce consistent cluster correspondences (> 70%\naccuracy), even as the number of samples per cluster drops to 8. In comparison, DAD is competitive\nat larger sample sizes but its performance rapidly drops off as sampling density decreases because it\nrequires estimating a distribution from samples. WA suffers from the presence of many local minima\nand fails to \ufb01nd the correct cluster correspondences. Our results suggest that HiWA consistently\nprovides stable solutions, outperforming competitor methods for this application.\nFinally, to study the impact of local and global geometry on whether an unlabeled source and target\ncan be aligned, we applied HiWA to permutations of eight subsets of reach directions (movement\npatterns). When just two reach directions are considered (Figure 2c, Columns 1-4), global geometry\nbecomes useless in determining the correct rotation. In this case, we observe that HiWA is only\ncapable of consistent alignment when cluster asymmetries are suf\ufb01ciently extreme in both the source\nand target. When three reach directions are considered (Figure 2c, Columns 5-8), the global geometry\ncan be used, yet there still exist symmetrical cases where recovering the correct rotation is dif\ufb01cult\nwithout adequate local asymmetries or some supervised (labeled) data to match clusters. These results\nsuggest that hierarchical structure can be critical in resolving ambiguities in alignment of globally\nsymmetric movement distributions.\n6 Conclusion\nThis paper introduces a new method for hierarchical alignment with Wasserstein distances, provided\nan ef\ufb01cient numerical solution with analytical guarantees. We tested our method and compared\nits performance against other methods on a synthetic mixture model dataset and on a real neural\ndecoding dataset. Future directions include extensions to non-rigid transformations, and applications\nto higher dimensional neural datasets that do not rely on external measured behavioral covariates.\n\n8\n\n\fAcknowledgments\nJL was supported by DSO National Laboratories of Singapore, ED and MD were supported by\nNSF grant IIS-1755871, and CR was supported by NSF grant CCF-1409422 and CAREER award\nCCF-1350954.\n\nReferences\n[1] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and\n\nData Engineering, 22(10):1345\u20131359, 2009.\n\n[2] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. Journal of Big\n\ndata, 3(1):9, 2016.\n\n[3] Haili Chui and Anand Rangarajan. A new point matching algorithm for non-rigid registration. Computer\n\nVision and Image Understanding, 89(2-3):114\u2013141, 2003.\n\n[4] Andriy Myronenko and Xubo Song. Point set registration: Coherent point drift. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 32(12):2262\u20132275, 2010.\n\n[5] Gary KL Tam, Zhi-Quan Cheng, Yu-Kun Lai, Frank C Langbein, Yonghuai Liu, David Marshall, Ralph R\nMartin, Xian-Fang Sun, and Paul L Rosin. Registration of 3d point clouds and meshes: a survey from rigid\nto nonrigid. IEEE Transactions on Visualization and Computer Graphics, 19(7):1199\u20131217, 2013.\n\n[6] Alexander M Bronstein, Michael M Bronstein, and Ron Kimmel. Generalized multidimensional scaling:\na framework for isometry-invariant partial surface matching. Proceedings of the National Academy of\nSciences, 103(5):1168\u20131172, 2006.\n\n[7] Alexander M Bronstein, Michael M Bronstein, Leonidas J Guibas, and Maks Ovsjanikov. Shape google:\nGeometric words and expressions for invariant shape retrieval. ACM Transactions on Graphics (TOG),\n30(1):1, 2011.\n\n[8] Maks Ovsjanikov, Mirela Ben-Chen, Justin Solomon, Adrian Butscher, and Leonidas Guibas. Functional\nmaps: a \ufb02exible representation of maps between shapes. ACM Transactions on Graphics, 31(4):30, 2012.\n\n[9] Gabriel Peyr\u00e9 and Marco Cuturi. Computational optimal transport. Foundations and Trends in Machine\n\nLearning, 11(5-6):355\u2013607, 2019.\n\n[10] Nicolas Courty, R\u00e9mi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain\n\nadaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1853\u20131865, 2017.\n\n[11] Debasmit Das and CS George Lee. Unsupervised domain adaptation using regularized hyper-graph\nmatching. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 3758\u20133762.\nIEEE, 2018.\n\n[12] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 35(11):2765\u20132781, 2013.\n\n[13] Eva L Dyer, Aswin C Sankaranarayanan, and Richard G Baraniuk. Greedy feature selection for subspace\n\nclustering. The Journal of Machine Learning Research, 14(1):2487\u20132517, 2013.\n\n[14] Xiaoxiao Shi, Qi Liu, Wei Fan, S Yu Philip, and Ruixin Zhu. Transfer learning on heterogenous feature\nspaces via spectral transformation. In Data Mining (ICDM), 2010 IEEE 10th International Conference on,\npages 1049\u20131054. IEEE, 2010.\n\n[15] Sumit Shekhar, Vishal M Patel, Hien V Nguyen, and Rama Chellappa. Generalized domain-adaptive\ndictionaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n361\u2013368, 2013.\n\n[16] Yahong Han, Fei Wu, Dacheng Tao, Jian Shao, Yueting Zhuang, and Jianmin Jiang. Sparse unsupervised\ndimensionality reduction for multiple view data. IEEE Transactions on Circuits and Systems for Video\nTechnology, 22(10):1485, 2012.\n\n[17] Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe. Direct\nimportance estimation with model selection and its application to covariate shift adaptation. In Advances\nin Neural Information Processing Systems, pages 1433\u20131440, 2008.\n\n[18] Sinno Jialin Pan, Ivor W Tsang, James T Kwok, and Qiang Yang. Domain adaptation via transfer\n\ncomponent analysis. IEEE Transactions on Neural Networks, 22(2):199\u2013210, 2011.\n\n9\n\n\f[19] Mahsa Baktashmotlagh, Mehrtash T Harandi, Brian C Lovell, and Mathieu Salzmann. Unsupervised\ndomain adaptation by domain invariant projection. In Proceedings of the IEEE International Conference\non Computer Vision, pages 769\u2013776, 2013.\n\n[20] Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and Philip S Yu. Transfer joint matching\nfor unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 1410\u20131417, 2014.\n\n[21] Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard Sch\u00f6lkopf.\nDomain adaptation with conditional transferable components. In Proceedings of the 33rd International\nConference on Machine Learning, pages 2839\u20132848, 2016.\n\n[22] Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Domain adaptation for object recognition: An\nunsupervised approach. In IEEE International Conference on Computer Vision, pages 999\u20131006. IEEE,\n2011.\n\n[23] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic \ufb02ow kernel for unsupervised domain\nadaptation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2066\u20132073. IEEE,\n2012.\n\n[24] Chang Wang and Sridhar Mahadevan. Manifold alignment using procrustes analysis. In Proceedings of\n\nthe 25th International Conference on Machine learning, pages 1120\u20131127. ACM, 2008.\n\n[25] Chang Wang and Sridhar Mahadevan. A general framework for manifold alignment. In 2009 AAAI Fall\n\nSymposium Series, 2009.\n\n[26] Sira Ferradans, Nicolas Papadakis, Gabriel Peyr\u00e9, and Jean-Fran\u00e7ois Aujol. Regularized discrete optimal\n\ntransport. SIAM Journal on Imaging Sciences, 7(3):1853\u20131882, 2014.\n\n[27] Zhen Cui, Hong Chang, Shiguang Shan, and Xilin Chen. Generalized unsupervised manifold alignment.\n\nIn Advances in Neural Information Processing Systems, pages 2429\u20132437, 2014.\n\n[28] Yonina C Eldar and Moshe Mishali. Robust recovery of signals from a structured union of subspaces.\n\nIEEE Transactions on Information Theory, 55(11):5302\u20135316, 2009.\n\n[29] Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. Unsupervised visual domain\nadaptation using subspace alignment. In Proceedings of the IEEE International Conference on Computer\nVision, pages 2960\u20132967, 2013.\n\n[30] Baochen Sun and Kate Saenko. Subspace distribution alignment for unsupervised domain adaptation. In\n\nBMVC, volume 4, pages 24\u20131, 2015.\n\n[31] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In Thirtieth\n\nAAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[32] Kowshik Thopalli, Rushil Anirudh, Jayaraman J Thiagarajan, and Pavan Turaga. Multiple subspace\n\nalignment improves domain adaptation. arXiv preprint arXiv:1811.04491, 2018.\n\n[33] Leonid Vitalevich Kantorovich. On a problem of monge. Journal of Mathematical Sciences, 133(4):1383\u2013\n\n1383, 2006.\n\n[34] Richard M Dudley. The speed of mean glivenko-cantelli convergence. The Annals of Mathematical\n\nStatistics, 40(1):40\u201350, 1969.\n\n[35] Jonathan Weed and Francis Bach. Sharp asymptotic and \ufb01nite-sample rates of convergence of empirical\n\nmeasures in wasserstein distance. arXiv preprint arXiv:1707.00087, 2017.\n\n[36] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural\n\nInformation Processing Systems, pages 2292\u20132300, 2013.\n\n[37] Aude Genevay, L\u00e9naic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyr\u00e9. Sample complexity of\n\nsinkhorn divergences. arXiv preprint arXiv:1810.02733, 2018.\n\n[38] Nicolas Courty, R\u00e9mi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal\ntransportation for domain adaptation. In Advances in Neural Information Processing Systems, pages\n3730\u20133739, 2017.\n\n[39] Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. Earth mover\u2019s distance minimization for\nunsupervised bilingual lexicon induction. In Proceedings of the 2017 Conference on Empirical Methods in\nNatural Language Processing, pages 1934\u20131945, 2017.\n\n10\n\n\f[40] David Alvarez-Melis, Stefanie Jegelka, and Tommi S Jaakkola. Towards optimal transport with global\n\ninvariances. arXiv preprint arXiv:1806.09277, 2018.\n\n[41] Edouard Grave, Armand Joulin, and Quentin Berthet. Unsupervised alignment of embeddings with\n\nwasserstein procrustes. arXiv preprint arXiv:1805.11222, 2018.\n\n[42] Mikhail Yurochkin, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin Solomon. Hierarchi-\n\ncal optimal transport for document representation. arXiv preprint arXiv:1906.10827, 2019.\n\n[43] Bernhard Schmitzer and Christoph Schn\u00f6rr. A hierarchical approach to optimal transport. In International\nConference on Scale Space and Variational Methods in Computer Vision, pages 452\u2013464. Springer, 2013.\n\n[44] David Alvarez-Melis, Tommi S Jaakkola, and Stefanie Jegelka. Structured optimal transport. arXiv preprint\n\narXiv:1712.06199, 2017.\n\n[45] Jonathan Eckstein and Dimitri P Bertsekas. On the douglas\u2014rachford splitting method and the proximal\npoint algorithm for maximal monotone operators. Mathematical Programming, 55(1-3):293\u2013318, 1992.\n\n[46] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and\nstatistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine\nLearning, 3(1):1\u2013122, 2011.\n\n[47] Yu Wang, Wotao Yin, and Jinshan Zeng. Global convergence of admm in nonconvex nonsmooth optimiza-\n\ntion. Journal of Scienti\ufb01c Computing, 78(1):29\u201363, 2019.\n\n[48] Ery Arias-Castro, Adel Javanmard, and Bruno Pelletier. Perturbation bounds for procrustes, classical\nscaling, and trilateration, with applications to manifold learning. arXiv preprint arXiv:1810.09569, 2018.\n\n[49] Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. In Sensor Fusion IV: Control\nParadigms and Data Structures, volume 1611, pages 586\u2013607. International Society for Optics and\nPhotonics, 1992.\n\n[50] Anand Rangarajan, Haili Chui, and Fred L Bookstein. The softassign procrustes matching algorithm. In\nBiennial International Conference on Information Processing in Medical Imaging, pages 29\u201342. Springer,\n1997.\n\n[51] Chethan Pandarinath, K Cora Ames, Abigail A Russo, Ali Farshchian, Lee E Miller, Eva L Dyer, and\nJonathan C Kao. Latent factors and dynamics in motor cortex and their application to brain\u2013machine\ninterfaces. Journal of Neuroscience, 38(44):9390\u20139401, 2018.\n\n[52] Eva L Dyer, Mohammad Gheshlaghi Azar, Matthew G Perich, Hugo L Fernandes, Stephanie Naufel,\nLee E Miller, and Konrad P K\u00f6rding. A cryptography-based approach for movement decoding. Nature\nBiomedical Engineering, 1(12):967, 2017.\n\n11\n\n\f", "award": [], "sourceid": 7442, "authors": [{"given_name": "John", "family_name": "Lee", "institution": "Georgia Institute of Technology"}, {"given_name": "Max", "family_name": "Dabagia", "institution": "Georgia Institute of Technology"}, {"given_name": "Eva", "family_name": "Dyer", "institution": "Georgia Institute of Technology"}, {"given_name": "Christopher", "family_name": "Rozell", "institution": "Georgia Institute of Technology"}]}