{"title": "Principal Differences Analysis: Interpretable Characterization of Differences between Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 1702, "page_last": 1710, "abstract": "We introduce principal differences analysis for analyzing differences between high-dimensional distributions. The method operates by finding the projection that maximizes the Wasserstein divergence between the resulting univariate populations. Relying on the Cramer-Wold device, it requires no assumptions about the form of the underlying distributions, nor the nature of their inter-class differences. A sparse variant of the method is introduced to identify features responsible for the differences. We provide algorithms for both the original minimax formulation as well as its semidefinite relaxation.  In addition to deriving some convergence results, we illustrate how the approach may be applied to identify differences between cell populations in the somatosensory cortex and hippocampus as manifested by single cell RNA-seq. Our broader framework extends beyond the specific choice of Wasserstein divergence.", "full_text": "Principal Differences Analysis: Interpretable\n\nCharacterization of Differences between Distributions\n\nJonas Mueller\nCSAIL, MIT\n\njonasmueller@csail.mit.edu\n\nAbstract\n\nTommi Jaakkola\n\nCSAIL, MIT\n\ntommi@csail.mit.edu\n\nWe introduce principal differences analysis (PDA) for analyzing differences be-\ntween high-dimensional distributions. The method operates by \ufb01nding the pro-\njection that maximizes the Wasserstein divergence between the resulting univari-\nate populations. Relying on the Cramer-Wold device, it requires no assumptions\nabout the form of the underlying distributions, nor the nature of their inter-class\ndifferences. A sparse variant of the method is introduced to identify features re-\nsponsible for the differences. We provide algorithms for both the original minimax\nformulation as well as its semide\ufb01nite relaxation. In addition to deriving some\nconvergence results, we illustrate how the approach may be applied to identify dif-\nferences between cell populations in the somatosensory cortex and hippocampus\nas manifested by single cell RNA-seq. Our broader framework extends beyond\nthe speci\ufb01c choice of Wasserstein divergence.\n\n1\n\nIntroduction\n\nUnderstanding differences between populations is a common task across disciplines, from biomed-\nical data analysis to demographic or textual analysis. For example, in biomedical analysis, a set of\nvariables (features) such as genes may be pro\ufb01led under different conditions (e.g. cell types, disease\nvariants), resulting in two or more populations to compare. The hope of this analysis is to answer\nwhether or not the populations differ and, if so, which variables or relationships contribute most to\nthis difference. In many cases of interest, the comparison may be challenging primarily for three\nreasons: 1) the number of variables pro\ufb01led may be large, 2) populations are represented by \ufb01nite,\nunpaired, high-dimensional sets of samples, and 3) information may be lacking about the nature of\npossible differences (exploratory analysis).\nWe will focus on the comparison of two high dimensional populations. Therefore, given two un-\npaired i.i.d. sets of samples Xpnq \u201c xp1q, . . . , xpnq \u201e PX and Ypmq \u201c yp1q, . . . , ypmq \u201e PY , the\ngoal is to answer the following two questions about the underlying multivariate random variables\nX, Y P Rd: (Q1) Is PX \u201c PY ? (Q2) If not, what is the minimal subset of features S \u00d1 t1, . . . , du\nsuch that the marginal distributions differ PXS \u2030 PYS while PXSC \u00ab PYSC for the complement? A\n\ufb01ner version of (Q2) may additionally be posed which asks how much each feature contributes to\nthe overall difference between the two probability distributions (with respect to the given scale on\nwhich the variables are measured).\nMany two-sample analyses have focused on characterizing limited differences such as mean shifts\n[1, 2]. More general differences beyond the mean of each feature remain of interest, however, includ-\ning variance/covariance of demographic statistics such as income. It is also undesirable to restrict\nthe analysis to speci\ufb01c parametric differences, especially in exploratory analysis where the nature\nof the underlying distributions may be unknown. In the univariate case, a number of nonparametric\ntests of equality of distributions are available with accompanying concentration results [3]. Popu-\nlar examples of such divergences (also referred to as probability metrics) include: f-divergences\n\n1\n\n\f(Kullback-Leibler, Hellinger, total-variation, etc.), the Kolmogorov distance, or the Wasserstein\nmetric [4]. Unfortunately, this simplicity vanishes as the dimensionality d grows, and complex\ntest-statistics have been designed to address some of the dif\ufb01culties that appear in high-dimensional\nsettings [5, 6, 7, 8].\nIn this work, we propose the principal differences analysis (PDA) framework which circumvents the\ncurse of dimensionality through explicit reduction back to the univariate case. Given a pre-speci\ufb01ed\nstatistical divergence D which measures the difference between univariate probability distributions,\nPDA seeks to \ufb01nd a projection  which maximizes DpT X, T Y q subject to the constraints ||||2 \u00a7\n1, 1 \u2022 0 (to avoid underspeci\ufb01cation). This reduction is justi\ufb01ed by the Cramer-Wold device,\nwhich ensures that PX \u2030 PY if and only if there exists a direction along which the univariate linearly\nprojected distributions differ [9, 10, 11]. Assuming D is a positive de\ufb01nite divergence (meaning it is\nnonzero between any two distinct univariate distributions), the projection vector produced by PDA\ncan thus capture arbitrary types of differences between high-dimensional PX and PY . Furthermore,\nthe approach can be straightforwardly modi\ufb01ed to address (Q2) by introducing a sparsity penalty on\n and examining the features with nonzero weight in the resulting optimal projection. The resulting\ncomparison pertains to marginal distributions up to the sparsity level. We refer to this approach as\nsparse differences analysis or SPARDA.\n\n2 Related Work\n\nThe problem of characterizing differences between populations, including feature selection, has re-\nceived a great deal of study [2, 12, 13, 5, 1]. We limit our discussion to projection-based methods\nwhich, as a family of methods, are closest to our approach. For multivariate two-class data, the most\nwidely adopted methods include (sparse) linear discriminant analysis (LDA) [2] and the logistic\nlasso [12]. While interpretable, these methods seek speci\ufb01c differences (e.g., covariance-rescaled\naverage differences) or operate under stringent assumptions (e.g., log-linear model). In contrast,\nSPARDA (with a positive-de\ufb01nite divergence) aims to \ufb01nd features that characterize a priori un-\nspeci\ufb01ed differences between general multivariate distributions.\nPerhaps most similar to our general approach is Direction-Projection-Permutation (DiProPerm) pro-\ncedure of Wei et al. [5], in which the data is \ufb01rst projected along the normal to the separating hyper-\nplane (found using linear SVM, distance weighted discrimination, or the centroid method) followed\nby a univariate two-sample test on the projected data. The projections could also be chosen at\nrandom [1]. In contrast to our approach, the choice of the projection in such methods is not opti-\nmized for the test statistics. We note that by restricting the divergence measure in our technique,\nmethods such as the (sparse) linear support vector machine [13] could be viewed as special cases.\nThe divergence in this case would measure the margin between projected univariate distributions.\nWhile suitable for \ufb01nding well-separated projected populations, it may fail to uncover more general\ndifferences between possibly multi-modal projected populations.\n\n3 General Framework for Principal Differences Analysis\n\nmax\n\n(1)\n\nFor a given divergence measure D between two univariate random variables, we \ufb01nd the projection\n\nPB,||||0\u00a7k DpT pXpnq, TpY pmqq(\n\np that solves\nwhere B :\u201c t P Rd : ||||2 \u00a7 1, 1 \u2022 0u is the feasible set, ||||0 \u00a7 k is the sparsity constraint,\nand T pXpnq denotes the observed random variable that follows the empirical distribution of n sam-\nples of T X. Instead of imposing a hard cardinality constraint ||||0 \u00a7 k, we may instead penalize\nby adding a penalty term1 \u00b4||||0 or its natural relaxation, the `1 shrinkage used in Lasso [12],\nsparse LDA [2], and sparse PCA [14, 15]. Sparsity in our setting explicitly restricts the comparison\nto the marginal distributions over features with non-zero coef\ufb01cients. We can evaluate the null hy-\npothesis PX \u201c PY (or its sparse variant over marginals) using permutation testing (cf. [5, 16]) with\nstatistic DppT pXpnq,pTpY pmqq.\n\n1In practice, shrinkage parameter  (or explicit cardinality constraint k) may be chosen via cross-validation\n\nby maximizing the divergence between held-out samples.\n\n2\n\n\fThe divergence D plays a key role in our analysis. If D is de\ufb01ned in terms of density functions as in\nf-divergence, one can use univariate kernel density estimation to approximate projected pdfs with\nadditional tuning of the bandwidth hyperparameter. For a suitably chosen kernel (e.g. Gaussian), the\nunregularized PDA objective (without shrinkage) is a smooth function of , and thus amenable to the\nprojected gradient method (or its accelerated variants [17, 18]). In contrast, when D is de\ufb01ned over\nthe cdfs along the projected direction \u2013 e.g. the Kolmogorov or Wasserstein distance that we focus\non in this paper \u2013 the objective is nondifferentiable due to the discrete jumps in the empirical cdf.\nWe speci\ufb01cally address the combinatorial problem implied by the Wasserstein distance. Moreover,\nsince the divergence assesses general differences between distributions, Equation (1) is typically\na non-concave optimization. To this end, we develop a semi-de\ufb01nite relaxation for use with the\nWasserstein distance.\n\n4 PDA using the Wasserstein Distance\n\nIn the remainder of the paper, we focus on the squared L2 Wasserstein distance (a.k.a. Kantorovich,\nMallows, Dudley, or earth-mover distance), de\ufb01ned as\n\nEPXY ||X \u00b4 Y ||2\n\nDpX, Y q \u201c min\nPXY\n\ns.t. pX, Y q \u201e PXY , X \u201e PX, Y \u201e PY\n\n(2)\nwhere the minimization is over all joint distributions over pX, Y q with given marginals PX and PY .\nIntuitively interpreted as the amount of work required to transform one distribution into the other,\nD provides a natural dissimilarity measure between populations that integrates both the fraction of\nindividuals which are different and the magnitude of these differences. While component analysis\nbased on the Wasserstein distance has been limited to [19], this divergence has been successfully\nused in many other applications [20]. In the univariate case, (2) may be analytically expressed as\nthe L2 distance between quantile functions. We can thus ef\ufb01ciently compute empirical projected\nWasserstein distances by sorting X and Y samples along the projection direction to obtain quantile\nestimates.\nUsing the Wasserstein distance, the empirical objective in Equation (1) between unpaired sampled\npopulations txp1q, . . . , xpnqu and typ1q, . . . , ypmqu can be shown to be\n\n||||0\u00a7k\" min\n\nmax\nPB\n\nMPM,\n\nn\u00ffi\u201c1\n\nm\u00ffj\u201c1pT xpiq \u00b4 T ypjqq2Mij* \u201c max\n\n||||0\u00a7k\" min\n\nPB\n\nMPM\n\nT WM *\n\n(3)\n\nwhere M is the set of all n \u02c6 m nonnegative matching matrices with \ufb01xed row sums \u201c 1{n and\ncolumn sums \u201c 1{m (see [20] for details), WM :\u201c\u221ei,jrZij b ZijsMij, and Zij :\u201c xpiq \u00b4 ypjq.\nIf we omitted (\ufb01xed) the inner minimization over the matching matrices and set  \u201c 0, the solution\nof (3) would be simply the largest eigenvector of WM. Similarly, for the sparse variant without\nminizing over M, the problem would be solvable as sparse PCA [14, 15, 21]. The actual max-\nmin problem in (3) is more complex and non-concave with respect to . We propose a two-step\nprocedure similar to \u201ctighten after relax\u201d framework used to attain minimax-optimal rates in sparse\nPCA [21]. First, we \ufb01rst solve a convex relaxation of the problem and subsequently run a steepest\nascent method (initialized at the global optimum of the relaxation) to greedily improve the current\nsolution with respect to the original nonconvex problem whenever the relaxation is not tight.\nFinally, we emphasize that PDA (and SPARDA) not only computationally resembles (sparse) PCA,\nbut the latter is actually a special case of the former in the Gaussian, paired-sample-differences\nsetting. This connection is made explicit by considering the two-class problem with paired samples\npxpiq, ypiqq where X, Y follow two multivariate Gaussian distributions. Here, the largest principal\ncomponent of the (uncentered) differences xpiq \u00b4 ypiq is in fact equivalent to the direction which\nmaximizes the projected Wasserstein difference between the distribution of X \u00b4 Y and a delta\ndistribution at 0.\n\n4.1 Semide\ufb01nite Relaxation\nThe SPARDA problem may be expressed in terms of d \u02c6 d symmetric matrices B as\n\nB\n\ntrpWM Bq\n\nmin\nMPM\n\nmax\nsubject to trpBq \u201c 1, B \u00a9 0, ||B||0 \u00a7 k2, rankpBq \u201c 1\n\n(4)\n\n3\n\n\fwhere the correspondence between (3) and (4) comes from writing B \u201c b (note that any solution\nof (3) will have unit norm). When k \u201c d, i.e., we impose no sparsity constraint as in PDA, we can\nrelax by simply dropping the rank-constraint. The objective is then a supremum of linear functions\nof B and the resulting semide\ufb01nite problem is concave over a convex set and may be written as:\n\nmax\nBPBr\n\nmin\nMPM\n\ntrpWM Bq\n\n(5)\nwhere Br is the convex set of positive semide\ufb01nite d \u02c6 d matrices with trace = 1. If B\u02da P Rd\u02c6d\ndenotes the global optimum of this relaxation and rankpB\u02daq \u201c 1, then the best projection for PDA\nis simply the dominant eigenvector of B\u02da and the relaxation is tight. Otherwise, we can truncate B\u02da\nas in [14], treating the dominant eigenvector as an approximate solution to the original problem (3).\nTo obtain a relaxation for the sparse version where k \u2020 d (SPARDA), we follow [14] closely.\nBecause B \u201c b implies ||B||0 \u00a7 k2, we obtain an equivalent cardinality constrained problem by\nincorporating this nonconvex constraint into (4). Since trpBq \u201c 1 and ||B||F \u201c ||||2\n2 \u201c 1, a convex\nrelaxation of the squared `0 constraint is given by ||B||1 \u00a7 k. By selecting  as the optimal Lagrange\nmultiplier for this `1 constraint, we can obtain an equivalent penalized reformulation parameterized\nby  rather than k [14]. The sparse semide\ufb01nite relaxation is thus the following concave problem\n\nmax\n\nBPBr min\n\nMPM\n\ntrpWM Bq \u00b4 ||B||1(\n\n(6)\n\nWhile the relaxation bears strong resemblance to DSPCA relaxation for sparse PCA, the inner max-\nimization over matchings prevents direct application of general semide\ufb01nite programming solvers.\nLet MpBq denote the matching that minimizes trpWM Bq for a given B. Standard projected sub-\ngradient ascent could be applied to solve (6), where at the tth iterate the (matrix-valued) subgradient\nis WMpBptqq. However, this approach requires solving optimal transport problems with large n \u02c6 m\nmatrices at each iteration. Instead, we turn to a dual form of (6), assuming n \u2022 m (cf. [22, 23])\n\nmint0, trprZijbZijs Bq\u00b4ui\u00b4vju` 1\n\nvj\u00b4||B||1 (7)\nBPBr,uPRn,vPRm\n(7) is simply a maximization over B P Br, u P Rn, and v P Rm which no longer requires matching\nmatrices nor their cumbersome row/column constraints. While dual variables u and v can be solved\nin closed form for each \ufb01xed B (via sorting), we describe a simple sub-gradient approach that works\nbetter in practice.\n\nui` 1\n\nm\u00ffj\u201c1\n\nm\u00ffj\u201c1\n\nn\u00ffi\u201c1\n\nn\u00ffi\u201c1\n\n1\nm\n\nmax\n\nm\n\nn\n\n?d\n\nd , . . . ,\n\nRELAX Algorithm: Solves the dualized semide\ufb01nite relaxation of SPARDA (7). Returns the\nlargest eigenvector of the solution to (6) as the desired projection direction for SPARDA.\nInput: d-dimensional data xp1q, . . . , xpnq and yp1q, . . . , ypmq (with n \u2022 m)\nParameters:  \u2022 0 controls the amount of regularization,  \u00b0 0 is the step-size used for B\nupdates, \u2318 \u00b0 0 is the step-size used for updates of dual variables u and v, T is the maximum number\nof iterations without improvement in cost after which algorithm terminates.\n1: Initialize p0q \u2013\u201d?d\n\nd \u0131, Bp0q \u2013 p0q b p0q P Br, up0q \u2013 0n\u02c61, vp0q \u2013 0m\u02c61\n\nZij \u2013 xpiq \u00b4 ypjq\nIf trprZij b ZijsBptqq \u00b4 uptqi \u00b4 vptqj \u2020 0 :\n\nBu \u2013 r1{n, . . . , 1{ns P Rn, Bv \u2013 r1{m, . . . , 1{ms P Rm, BB \u2013 0d\u02c6d\nFor i, j P t1, . . . , nu \u02c6 t1, . . . , mu:\n\n2: While the number of iterations since last improvement in objective function is less than T :\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\nEnd For\nupt`1q \u2013 uptq ` \u2318 \u00a8 Bu and vpt`1q \u2013 vptq ` \u2318 \u00a8 Bv\nBpt`1q \u2013 Projection\u00b4Bptq ` \n\nBui \u2013 Bui \u00b4 1{m , Bvj \u2013 Bvj \u00b4 1{m , BB \u2013 BB ` Zij b Zij{m\n\n||BB||F \u00a8 BB ; , {||BB||F\u00af\n\nOutput: prelax P Rd de\ufb01ned as the largest eigenvector (based on corresponding eigenvalue\u2019s magni-\n\ntude) of the matrix Bpt\u02daq which attained the best objective value over all iterations.\n\n4\n\n\f(Quadratic program)\n\n2 : w P r0, 1sd,||w||1 \u201c 1(\n\nProjection Algorithm: Projects matrix onto positive semide\ufb01nite cone of unit-trace matrices Br\n(the feasible set in our relaxation). Step 4 applies soft-thresholding proximal operator for sparsity.\nInput: B P Rd\u02c6d\nParameters:  \u2022 0 controls the amount of regularization,  \u201c {||BB||F \u2022 0 is the actual step-size\nused in the B-update.\n1: Q\u21e4QT \u2013 eigendecomposition of B\n2: w\u02da \u2013 arg min ||w \u00b4 diagp\u21e4q||2\n3: rB \u2013 Q \u00a8 diagtw\u02da1 , . . . , w\u02dadu \u00a8 QT\n4: If  \u00b0 0: For r, s P t1, . . . , du2 :\nOutput: rB P Br\nThe RELAX algorithm (boxed) is a projected subgradient method with supergradients computed in\nSteps 3 - 8. For scaling to large samples, one may alternatively employ incremental supergradient di-\nrections [24] where Step 4 would be replaced by drawing random pi, jq pairs. After each subgradient\nstep, projection back into the feasible set Br is done via a quadratic program involving the current\nsolution\u2019s eigenvalues. In SPARDA, sparsity is encouraged via the soft-thresholding proximal map\ncorresponding to the `1 penalty. The overall form of our iterations matches subgradient-proximal\nupdates (4.14)-(4.15) in [24]. By the convergence analysis in \u00a74.2 of [24], the RELAX algorithm (as\nwell as its incremental variant) is guaranteed to approach the optimal solution of the dual which also\nsolves (6), provided we employ suf\ufb01ciently large T and small step-sizes. In practice, fast and accu-\nrate convergence is attained by: (a) renormalizing the B-subgradient (Step 10) to ensure balanced\nupdates of the unit-norm constrained B, (b) using diminishing learning rates which are initially set\nlarger for the unconstrained dual variables (or even taking multiple subgradient steps in the dual\nvariables per each update of B).\n\nrBr,s \u2013 signprBr,sq \u00a8 maxt0,|rBr,s| \u00b4 u\n\n4.2 Tightening after relaxation\n\nIt is unreasonable to expect that our semide\ufb01nite relaxation is always tight. Therefore, we can\n\nsometimes further re\ufb01ne the projection prelax obtained by the RELAX algorithm by using it as\na starting point in the original non-convex optimization. We introduce a sparsity constrained\ntightening procedure for applying projected gradient ascent for the original nonconvex objective\nJpq \u201c minMPM T WM  where  is now forced to lie in BXSk and Sk :\u201c t P Rd : ||||0 \u00a7 ku.\nThe sparsity level k is \ufb01xed based on the relaxed solution (k \u201c ||prelax||0). After initializing\np0q \u201c prelax P Rd, the tightening procedure iterates steps in the gradient direction of J followed\nby straightforward projections into the unit half-ball B and the set Sk (accomplished by greedily\ntruncating all entries of  to zero besides the largest k in magnitude).\nLet Mpq again denote the matching matrix chosen in response to . J fails to be differentiable at\nthe r where Mprq is not unique. This occurs, e.g., if two samples have identical projections under\nr. While this situation becomes increasingly likely as n, m \u00d1 8, J interestingly becomes smoother\noverall (assuming the distributions admit density functions). For all other : Mp1q \u201c Mpq where\n1 lies in a small neighborhood around  and J admits a well-de\ufb01ned gradient 2WMpq. In prac-\ntice, we \ufb01nd that the tightening always approaches a local optimum of J with a diminishing step-\nsize. We note that, for a given projection, we can ef\ufb01ciently calculate gradients without recourse to\nmatrices Mpq or WMpq by sorting ptqT\nypmq. The\ngradient is directly derivable from expression (3) where the nonzero Mij are determined by appropri-\nately matching empirical quantiles (represented by sorted indices) since the univariate Wasserstein\ndistance is simply the L2 distance between quantile functions [20]. Additional computation can be\nsaved by employing insertion sort which runs in nearly linear time for almost sorted points (in iter-\nation t \u00b4 1, the points have been sorted along the pt\u00b41q direction and their sorting in direction ptq\nis likely similar under small step-size). Thus the tightening procedure is much more ef\ufb01cient than\nthe RELAX algorithm (respective runtimes are Opdn log nq vs. Opd3n2q per iteration).\n\nxp1q, . . . , ptqT\n\nyp1q, . . . , ptqT\n\nxpnq and ptqT\n\n5\n\n\fWe require the combined steps for good performance. The projection found by the tightening al-\ngorithm heavily depends on the starting point p0q, \ufb01nding only the closest local optimum (as in\nFigure 1a). It is thus important that p0q is already a good solution, as can be produced by our\nRELAX algorithm. Additionally, we note that as \ufb01rst-order methods, both the RELAX and tighten-\ning algorithms are amendable to a number of (sub)gradient-acceleration schemes (e.g. momentum\ntechniques, adaptive learning rates, or FISTA and other variants of Nesterov\u2019s method [18, 17, 25]).\n\n4.3 Properties of semide\ufb01nite relaxation\n\nWe conclude the algorithmic discussion by highlighting basic conditions under which our PDA\nrelaxation is tight. Assuming n, m \u00d1 8, each of (i)-(iii) implies that the B\u02da which maximizes (5)\nis nearly rank one, or equivalently B\u02da \u00ab r br (see Supplementary Information \u00a7S4 for intuition).\nThus, the tightening procedure initialized atr will produce a global maximum of the PDA objective.\n(i) There exists direction in which the projected Wasserstein distance between X and Y is\nnearly as large as the overall Wasserstein distance in Rd. This occurs for example if\n||ErXs \u00b4 ErY s||2 is large while both ||CovpXq||F and ||CovpY q||F are small (the dis-\ntributions need not be Gaussian).\n\n(ii) X \u201e Np\u00b5X, \u2303Xq and Y \u201e Np\u00b5Y , \u2303Y q with \u00b5X \u2030 \u00b5Y and \u2303X \u00ab \u2303Y .\n(iii) X \u201e Np\u00b5X, \u2303Xq and Y \u201e Np\u00b5Y , \u2303Y q with \u00b5X \u201c \u00b5Y where the underlying covariance\nstructure is such that arg maxBPBr ||pB1{2\u2303XB1{2q1{2 \u00b4 pB1{2\u2303Y B1{2q1{2||2\nF is nearly\nrank 1. For example, if the primary difference between covariances is a shift in the marginal\nvariance of some features, i.e. \u2303Y \u00ab V \u00a8 \u2303X where V is a diagonal matrix.\n\n5 Theoretical Results\n\nPB\n\nIn this section, we characterize statistical properties of an empirical divergence-maximizing projec-\n\nDpT pXpnq, TpY pnqq, although we note that the algorithms may not succeed\ntion p :\u201c arg max\nin \ufb01nding such a global maximum for severely nonconvex problems. Throughout, D denotes the\nsquared L2 Wasserstein distance between univariate distributions, C represents universal constants\nthat change from line to line. All proofs are relegated to the Supplementary Information \u00a7S3. We\nmake the following simplifying assumptions: (A1) n \u201c m (A2) X, Y admit continuous density\nfunctions (A3) X, Y are compactly supported with nonzero density in the Euclidean ball of radius\nR. Our theory can be generalized beyond (A1)-(A3) to obtain similar (but complex) statements\nthrough careful treatment of the distributions\u2019 tails and zero-density regions where cdfs are \ufb02at.\nTheorem 1. Suppose there exists direction \u02da P B such that Dp\u02daT X, \u02daT Y q \u2022 . Then:\nDppT pXpnq,pTpY pnqq \u00b0  \u00b4 \u270f with probability greater than 1 \u00b4 4 exp\u02c6\u00b4 n\u270f2\n16R4\u02d9\n\nTheorem 1 gives basic concentration results for the projections used in empirical applications our\nmethod. To relate distributional differences between X, Y in the ambient d-dimensional space with\ntheir estimated divergence along the univariate linear representation chosen by PDA, we turn to\nTheorems 2 and 3. Finally, Theorem 4 provides sparsistency guarantees for SPARDA in the case\nwhere X, Y exhibit large differences over a certain feature subset (of known cardinality).\n\nwith probability greater than\n\nTheorem 2. If X and Y are identically distributed in Rd, then: DppT pXpnq,pTpY pnqq \u2020 \u270f\n\n1 \u00b4 C1\u02c61 ` R2\n\u270f \u02d9d\n\nexp\u02c6\u00b4 C2\n\nR4 n\u270f2\u02d9\n\nTo measure the difference between the untransformed random variables X, Y P Rd, we de\ufb01ne the\nfollowing metric between distributions on Rd which is parameterized by a \u2022 0 (cf. [11]):\n(8)\nTapX, Y q :\u201c | Prp|X1| \u00a7 a, . . . ,|Xd| \u00a7 aq \u00b4 Prp|Y1| \u00a7 a, . . . ,|Yd| \u00a7 aq|\n\n6\n\n\fy expp\u00b4y2{2q,\n\nIn addition to (A1)-(A3), we assume the following for the next two theorems: (A4) Y has sub-\nGaussian tails, meaning cdf FY satis\ufb01es: 1 \u00b4 FY pyq \u00a7 C\n(A5) ErXs \u201c ErY s \u201c\n0 (note that mean differences can trivially be captured by linear projections, so these are not the\ndifferences of interest in the following theorems), (A6) Var(X`) = 1 for ` \u201c 1, . . . , d\nTheorem 3. Suppose D a \u2022 0 s.t. TapX, Y q \u00b0 hpgpqq where hpgpqq :\u201c mint1, 2u with\n1 :\u201c pa ` dqdpgpq ` dq ` expp\u00b4a2{2q ` exp\u00b4\u00b41{p?2 q\u00af\n2 :\u201c`gpq ` expp\u00b4a2{2q\u02d8 \u00a8 d\n :\u201c ||CovpXq||1, gpq :\u201c 4 \u00a8 p1 ` q\u00b44, and  :\u201c sup\u21b5PB supy |f\u21b5T Y pyq|(\nwith f\u21b5T Y pyq de\ufb01ned as the density of the projection of Y in the \u21b5 direction.\nThen:\nwith probability greater than 1 \u00b4 C1 exp`\u00b4 C2\nTheorem 4. De\ufb01ne C as in (11). Suppose there exists feature subset S \u00c4 t1, . . . , du s.t. |S| \u201c k,\nTpXS, YSq \u2022 hpg p\u270fpd ` 1q{Cqq, and remaining marginal distributions XSC , YSC are identical.\nThen:\n\nDppT pXpnq,pTpY pnqq \u00b0 C \u00b4 \u270f\n\nR4 n\u270f2\u02d8\n\n(10)\n\n(11)\n\n(9)\n\nPB tDpT pXpnq, TpY pnqq : ||||0 \u00a7 ku\nsatis\ufb01esppkqi \u2030 0 andppkqj \u201c 0 @ i P S, j P SC with probability greater than\n\nppkq :\u201c arg max\n\n1 \u00b4 C1\u02c61 ` R2\n\n\u270f \u02d9d\u00b4k\n\nexp\u02c6\u00b4 C2\n\nR4 n\u270f2\u02d9\n\n6 Experiments\n\nFigure 1a illustrates the cost function of PDA pertaining to two 3-dimensional distributions (see\n\ndetails in Supplementary Information \u00a7S1). In this example, the point of convergencep of the tight-\n\nening method after random initialization (in green) is signi\ufb01cantly inferior to the solution produced\nby the RELAX algorithm (in red). It is therefore important to use RELAX before tightening as we\nadvise.\nThe synthetic MADELON dataset used in the NIPS 2003 feature selection challenge consists of\npoints (n \u201c m \u201c 1000, d \u201c 500) which have 5 features scattered on the vertices of a \ufb01ve-\ndimensional hypercube (so that interactions between features must be considered in order to dis-\ntinguish the two classes), 15 features that are noisy linear combinations of the original \ufb01ve, and 480\nuseless features [26]. While the focus of the challenge was on extracting features useful to classi-\n\ufb01ers, we direct our attention toward more interpretable models. Figure 1b demonstrates how well\nSPARDA (red), the top sparse principal component (black) [27], sparse LDA (green) [2], and the\nlogistic lasso (blue) [12] are able to identify the 20 relevant features over different settings of their\nrespective regularization parameters (which determine the cardinality of the vector returned by each\nmethod). The red asterisk indicates the SPARDA result with  automatically selected via our cross-\nvalidation procedure (without information of the underlying features\u2019 importance), and the black\nasterisk indicates the best reported result in the challenge [26].\n\nMADELON\n\nTwo Sample Testing\n\n0\n2\n\n6\n.\n0\n\ns\ne\nr\nu\na\ne\n\nt\n\nf\n \nt\n\nn\na\nv\ne\ne\nR\n\nl\n\n5\n1\n\n0\n1\n\n5\n\nl\n\ne\nu\na\nv\np\n\n \n\n4\n.\n0\n\n2\n.\n0\n\n(a)\n\n0\n\n0\n\n100\n\n200\n300\nCardinality\n(b)\n\n400\n\n500\n\n0\n0\n\n.\n\n10\n\n20\nData dimension (d)\n\n30\n\n40\n\n50\n\n60\n\n(c)\n\nFigure 1: (a) example where PDA is nonconvex, (b) SPARDA vs. other feature selection methods,\n(c) power of various tests for multi-dimensional problems with 3-dimensional differences.\n\n7\n\n\fThe restrictive assumptions in logistic regression and linear discriminant analysis are not satis\ufb01ed in\nthis complex dataset resulting in poor performance. Despite being class-agnostic, PCA was success-\nfully utilized by numerous challenge participants [26], and we \ufb01nd that the sparse PCA performs\non par with logistic regression and LDA. Although the lasso fairly ef\ufb01ciently picks out 5 relevant\nfeatures, it struggles to identify the rest due to severe multi-colinearity. Similarly, the challenge-\nwinning Bayesian SVM with Automatic Relevance Determination [26] only selects 8 of the 20\nrelevant features. In many applications, the goal is to thoroughly characterize the set of differences\nrather than select a subset of features that maintains predictive accuracy. SPARDA is better suited\nfor this alternative objective. Many settings of  return 14 of the relevant features with zero false\npositives. If  is chosen automatically through cross-validation, the projection returned by SPARDA\ncontains 46 nonzero elements of which 17 correspond to relevant features.\nFigure 1c depicts (average) p-values produced by SPARDA (red), PDA (purple), the overall Wasser-\nstein distance in Rd (black), Maximum Mean Discrepancy [8] (green), and DiProPerm [5] (blue)\nin two-sample synthetically controlled problems where PX \u2030 PY and the underlying differences\nhave varying degrees of sparsity. Here, d indicates the overall number of features included of which\nonly the \ufb01rst 3 are relevant (see Supplementary Information \u00a7S1 for details). As we evaluate the\nsigni\ufb01cance of each method\u2019s statistic via permutation testing, all the tests are guaranteed to exactly\ncontrol Type I error [16], and we thus only compare their respective power in determining PX \u2030 PY\nsetting. The \ufb01gure demonstrates clear superiority of SPARDA which leverages the underlying spar-\nsity to maintain high power even with the increasing overall dimensionality. Even when all the\nfeatures differ (when d \u201c 3), SPARDA matches the power of methods that consider the full space\ndespite only selecting a single direction (which cannot be based on mean-differences as there are\nnone in this controlled data). This experiment also demonstrate that the unregularized PDA retains\ngreater power than DiProPerm, a similar projection-based method [5].\nRecent technological advances allow complete transcriptome pro\ufb01ling in thousands of individual\ncells with the goal of \ufb01ne molecular characterization of cell populations (beyond the crude average-\ntissue-level expression measure that is currently standard) [28]. We apply SPARDA to expression\nmeasurements of 10,305 genes pro\ufb01led in 1,691 single cells from the somatosensory cortex and\n\n1,314 hippocampus cells sampled from the brains of juvenile mice [29]. The resulting p identi\ufb01es\nmany previously characterized subtype-speci\ufb01c genes and is in many respects more informative than\nthe results of standard differential expression methods (see Supplementary Information \u00a7S2 for de-\ntails). Finally, we also apply SPARDA to normalized data with mean-zero & unit-variance marginals\nin order to explicitly restrict our search to genes whose relationship with other genes\u2019 expression is\ndifferent between hippocampus and cortex cells. This analysis reveals many genes known to be\nheavily involved in signaling, regulating important processes, and other forms of functional inter-\naction between genes (see Supplementary Information \u00a7S2.1 for details). These types of important\nchanges cannot be detected by standard differential expression analyses which consider each gene\nin isolation or require gene-sets to be explicitly identi\ufb01ed as features [28].\n\n7 Conclusion\n\nThis paper introduces the overall principal differences methodology and demonstrates its numerous\npractical bene\ufb01ts of this approach. While we focused on algorithms for PDA & SPARDA tailored\nto the Wasserstein distance, different divergences may be better suited for certain applications.\nFurther theoretical investigation of the SPARDA framework is of interest, particularly in the high-\ndimensional d \u201c Opnq setting. Here, rich theory has been derived for compressed sensing and\nsparse PCA by leveraging ideas such as restricted isometry or spiked covariance [15]. A natural\nquestion is then which analogous properties of PX, PY theoretically guarantee the strong empirical\nperformance of SPARDA observed in our high-dimensional applications. Finally, we also envision\nextensions of the methods presented here which employ multiple projections in succession, or adapt\nthe approach to non-pairwise comparison of multiple populations.\n\nAcknowledgements\nThis research was supported by NIH Grant T32HG004947.\n\n8\n\n\fReferences\n[1] Lopes M, Jacob L, Wainwright M (2011) A More Powerful Two-Sample Test in High Dimensions using\n\nRandom Projection. NIPS : 1206\u20131214.\n\n[2] Clemmensen L, Hastie T, Witten D, Ersb\u00f8 ll B (2011) Sparse Discriminant Analysis. Technometrics 53:\n\n406\u2013413.\n\n[3] van der Vaart AW, Wellner JA (1996) Weak Convergence and Empirical Processes. Springer.\n[4] Gibbs AL, Su FE (2002) On Choosing and Bounding Probability Metrics. International Statistical Review\n\n70: 419\u2013435.\n\n[5] Wei S, Lee C, Wichers L, Marron JS (2015) Direction-Projection-Permutation for High Dimensional\n\nHypothesis Tests. Journal of Computational and Graphical Statistics .\n\n[6] Rosenbaum PR (2005) An exact distribution-free test comparing two multivariate distributions based on\n\nadjacency. Journal of the Royal Statistical Society Series B 67: 515\u2013530.\n\n[7] Szekely G, Rizzo M (2004) Testing for equal distributions in high dimension. InterStat 5.\n[8] Gretton A, Borgwardt KM, Rasch MJ, Scholkopf B, Smola A (2012) A Kernel Two-Sample Test. The\n\nJournal of Machine Learning Research 13: 723\u2013773.\n\n[9] Cramer H, Wold H (1936) Some Theorems on Distribution Functions. Journal of the London Mathemat-\n\nical Society 11: 290\u2013294.\n\n[10] Cuesta-Albertos JA, Fraiman R, Ransford T (2007) A sharp form of the Cramer\u2013Wold theorem. Journal\n\nof Theoretical Probability 20: 201\u2013209.\n\n[11] Jirak M (2011) On the maximum of covariance estimators. Journal of Multivariate Analysis 102: 1032\u2013\n\n1046.\n\n[12] Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety Series B : 267\u2013288.\n\n[13] Bradley PS, Mangasarian OL (1998) Feature Selection via Concave Minimization and Support Vector\n\nMachines. ICML : 82\u201390.\n\n[14] D\u2019Aspremont A, El Ghaoui L, Jordan MI, Lanckriet GR (2007) A direct formulation for sparse PCA\n\nusing semide\ufb01nite programming. SIAM Review : 434\u2013448.\n\n[15] Amini AA, Wainwright MJ (2009) High-dimensional analysis of semide\ufb01nite relaxations for sparse prin-\n\ncipal components. The Annals of Statistics 37: 2877\u20132921.\n\n[16] Good P (1994) Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses.\n\nSpring-Verlag.\n\n[17] Duchi J, Hazan E, Singer Y (2011) Adaptive Subgradient Methods for Online Learning and Stochastic\n\nOptimization. Journal of Machine Learning Research 12: 2121\u20132159.\n\n[18] Wright SJ (2010) Optimization Algorithms in Machine Learning. NIPS Tutorial .\n[19] Sandler R, Lindenbaum M (2011) Nonnegative Matrix Factorization with Earth Mover\u2019s Distance Metric\n\nfor Image Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 33: 1590\u20131602.\n\n[20] Levina E, Bickel P (2001) The Earth Mover\u2019s distance is the Mallows distance: some insights from\n\nstatistics. ICCV 2: 251\u2013256.\n\n[21] Wang Z, Lu H, Liu H (2014) Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time.\n\nNIPS 27: 3383\u20133391.\n\n[22] Bertsekas DP (1998) Network Optimization: Continuous and Discrete Models. Athena Scienti\ufb01c.\n[23] Bertsekas DP, Eckstein J (1988) Dual coordinate step methods for linear network \ufb02ow problems. Mathe-\n\nmatical Programming 42: 203\u2013243.\n\n[24] Bertsekas DP (2011) Incremental gradient, subgradient, and proximal methods for convex optimization:\n\nA survey. In: Optimization for Machine Learning, MIT Press. pp. 85\u2013119.\n\n[25] Beck A, Teboulle M (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Prob-\n\nlems. SIAM Journal on Imaging Sciences 2: 183\u2013202.\n\n[26] Guyon I, Gunn S, Nikravesh M, Zadeh LA (2006) Feature Extraction: Foundations and Applications.\n\nSecaucus, NJ, USA: Springer-Verlag.\n\n[27] Zou H, Hastie T, Tibshirani R (2005) Sparse Principal Component Analysis. Journal of Computational\n\nand Graphical Statistics 67: 301\u2013320.\n\n[28] Geiler-Samerotte KA, Bauer CR, Li S, Ziv N, Gresham D, et al. (2013) The details in the distributions:\n\nwhy and how to study phenotypic variability. Current opinion in biotechnology 24: 752\u20139.\n\n[29] Zeisel A, Munoz-Manchado AB, Codeluppi S, Lonnerberg P, La Manno G, et al. (2015) Cell types in the\n\nmouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347: 1138\u20131142.\n\n9\n\n\f", "award": [], "sourceid": 1036, "authors": [{"given_name": "Jonas", "family_name": "Mueller", "institution": "MIT"}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": "MIT"}]}