{"title": "Geometry Based Data Generation", "book": "Advances in Neural Information Processing Systems", "page_first": 1400, "page_last": 1411, "abstract": "We propose a new type of generative model for high-dimensional data that learns a manifold geometry of the data, rather than density, and can generate points evenly along this manifold. This is in contrast to existing generative models that represent data density, and are strongly affected by noise and other artifacts of data collection. We demonstrate how this approach corrects sampling biases and artifacts, thus improves several downstream data analysis tasks, such as clustering and classification. Finally, we demonstrate that this approach is especially useful in biology where, despite the advent of single-cell technologies, rare subpopulations and gene-interaction relationships are affected by biased sampling. We show that SUGAR can generate hypothetical populations, and it is able to reveal intrinsic patterns and mutual-information relationships between genes on a single-cell RNA sequencing dataset of hematopoiesis.", "full_text": "Geometry Based Data Generation\n\nApplied Mathematics Program\n\nComputational Biology & Bioinformatics Program\n\nApplied Mathematics Program\n\nDepartments of Genetics & Computer Science\n\nO\ufb01r Lindenbaum\u2217\n\nYale University\n\nNew Haven, CT 06511\n\nofir.lindenbaum@yale.edu\n\nGuy Wolf\u2020\n\nYale University\n\nNew Haven, CT 06511\nguy.wolf@yale.edu\n\nJay S. Stanley III\u2217\n\nYale University\n\nNew Haven, CT 06510\n\njay.stanley@yale.edu\n\nSmita Krishnaswamy\u2020 \n\nYale University\n\nNew Haven, CT 06510\n\nsmita.krishnawamy@yale.edu\n\nAbstract\n\nWe propose a new type of generative model for high-dimensional data that learns\na manifold geometry of the data, rather than density, and can generate points\nevenly along this manifold. This is in contrast to existing generative models that\nrepresent data density, and are strongly affected by noise and other artifacts of\ndata collection. We demonstrate how this approach corrects sampling biases and\nartifacts, thus improves several downstream data analysis tasks, such as clustering\nand classi\ufb01cation. Finally, we demonstrate that this approach is especially useful in\nbiology where, despite the advent of single-cell technologies, rare subpopulations\nand gene-interaction relationships are affected by biased sampling. We show that\nSUGAR can generate hypothetical populations, and it is able to reveal intrinsic\npatterns and mutual-information relationships between genes on a single-cell RNA\nsequencing dataset of hematopoiesis.\n\n1\n\nIntroduction\n\nManifold learning methods in general, and diffusion geometry ones in particular (Coifman & Lafon,\n2006), are traditionally used to infer latent representations that capture intrinsic geometry in data, but\nthey do not relate them to original data features. Here, we propose a novel data synthesis method,\nwhich we call SUGAR (Synthesis Using Geometrically Aligned Random-walks), for generating data\nin its original feature space while following its intrinsic geometry. This geometry is inferred by a\ndiffusion kernel that captures a data-driven manifold and reveals underlying structure in the full range\nof the data space \u2013 including undersampled regions that can be augmented by new synthesized data.\nGeometry-based data generation with SUGAR is motivated by numerous uses in data exploration. For\ninstance, in biology, despite the advent of single-cell technologies such as single-cell RNA sequencing\nand mass cytometry, sampling biases and artifacts often make it dif\ufb01cult to evenly sample the data\nspace. Rare populations of relevance to disease and development are often left out (Gr\u00fcn et al., 2015).\nBy learning the data geometry rather than density, SUGAR is able to generate hypothetical cell types\nfor exploration, and uncover patterns and interactions in the data.\nFurther, imbalanced data is problematic for many machine learning applications. In classi\ufb01cation,\nfor example, class density can strongly bias some classi\ufb01ers (He & Garcia, 2009; L\u00f3pez et al.,\n\n\u2217These authors contributed equally\n\u2020These authors contributed equally;  Corresponding author\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f2013; Hensman & Masko, 2015). In clustering, imbalanced sampling of ground truth clusters can\nlead to distortions in learned clusters (Xuan et al., 2013; Wu, 2012). Sampling biases can also\ncorrupt regression tasks; relationship measures such as mutual information are heavily weighted by\ndensity estimates and thus may mis-quantify the strength of dependencies with data whose density is\nconcentrated in a particular region of the relationship (Krishnaswamy et al., 2014). SUGAR can aid\nsuch machine learning algorithms by generating data that is balanced along its manifold.\nThere are several advantages of our approach over contemporary generative models. Most other\ngenerative models attempt to learn and replicate the density of the data; this approach is intractable\nin high dimensions. Distribution-based generative models typically require vast simpli\ufb01cations\nsuch as parametric forms or restriction to marginals in order to become tractable. Examples for\nsuch methods include Gaussian Mixture Models (GMM, Rasmussen, 2000), variational Bayesian\nmethods (Beal & Ghahramani, 2003), and kernel density estimates (Scott, 2008). In contrast to these\nmethods, SUGAR does not rely on high dimensional probability distributions or parametric forms.\nSUGAR selectively generates points to equalize density; as such, the method can be used generally to\ncompensate for sparsity and heavily biased sampling in data in a way that is agnostic to downstream\napplication. In other words, whereas more specialized methods may use prior information (e.g.,\nlabels) to correct class imbalances for classi\ufb01er training (Chawla et al., 2002), SUGAR does not\nrequire such information and can apply even in cases such as clustering or regression, where such\ninformation does not exist.\nHere, we construct SUGAR from diffusion methods and theoretically justify its density equalization\nproperties. We then demonstrate SUGAR on imbalanced arti\ufb01cial data. Subsequently, we use\nSUGAR to improve classi\ufb01cation accuracy on 61 imbalanced datasets. We then provide an illustrative\nsynthetic example of clustering with SUGAR and show the clustering performance of the method\non 115 imbalanced datasets obtained from the KEEL-dataset repository (Alcal\u00e1-Fdez et al., 2009).\nFinally, we use SUGAR for exploratory analysis of a biological dataset, recovering imbalanced cell\ntypes and restoring canonical gene-gene relationships.\n\n2 Related Work\n\nMost existing methods for data generation assume a probabilistic data model. Parametric density\nestimation methods, such as Rasmussen (2000) or Varanasi & Aazhang (1989), \ufb01nd a best \ufb01tting\nparametric model for the data using maximum likelihood, which is then used to generate new data.\nNonparametric density estimators (e.g., Seaman & Powell, 1996; Scott, 1985; Gin\u00e9 & Guillou, 2002)\nuse a histogram or a kernel (Scott, 2008) to estimate the generating distribution. Recently, Varia-\ntional Auto-Encoders (VAE, Kingma & Welling, 2014; Doersch, 2016) and Generative Adversarial\nNetworks (GAN, Goodfellow et al., 2014) have been demonstrated for generating new points from\ncomplex high dimensional distributions.\nA family of manifold based Parzen window estimators are presented in Vincent & Bengio (2003);\nBengio & Monperrus (2005); Bengio et al. (2006). These methods exploit manifold structures to\nimprove density estimation of high dimensional data. Markov-Chain Monte Carlo (MCMC) on\nimplicitly de\ufb01ned manifolds was presented in Girolami & Calderhead (2011); Brubaker et al. (2012).\nThere, the authors use implicit constraints to generate new points that follow a manifold structure.\nAnother scheme by \u00d6ztireli et al. (2010) de\ufb01nes a spectral measure to resample existing points such\nthat manifold structure is preserved and the density of points is uniform. These methods differ from\nthe proposed approach as they either require implicit constraints or they change the values of existing\npoints in the resampling process.\n\n3 Background\n\n3.1 Diffusion Geometry\n\nCoifman & Lafon (2006) proposed the nonlinear dimensionality reduction framework called Diffusion\nMaps (DM). This popular method robustly captures an intrinsic manifold geometry using a row-\nstochastic Markov matrix associated with a graph of the data. This graph is commonly constructed\n\n2\n\n\fusing a Gaussian kernel\n\nK(xi, xj) (cid:44) Ki,j = exp\n\n(cid:18)\n\n\u2212(cid:107)xi \u2212 xj(cid:107)2\n\n2\u03c32\n\n(cid:19)\n\n,\n\ni, j = 1, ..., N\n\n(1)\n\nDi,i = \u02c6d(i) =(cid:80)\n\nwhere x1, . . . , xN are data points, and \u03c3 is a bandwidth parameter that controls neighborhood sizes.\nThen, a diffusion operator is de\ufb01ned as the row-stochastic matrix Pi,j = P(xi, xj) = [D\u22121K]i,j,\ni, j = 1, ..., N, where D is a diagonal matrix with values corresponding to the degree of the kernel\nj K(xi, xj). The degree \u02c6d(i) of each point xi encodes the total connectivity the\npoint has to its neighbors. The Markov matrix P de\ufb01nes an imposed diffusion process, shown by\nCoifman & Lafon (2006) to ef\ufb01ciently capture the diffusion geometry of a manifold M.\nThe DM framework may be used for dimensionality reduction to embed data using the eigendecom-\nposition of the diffusion operator. However, in this paper, we do not directly use the DM embedding,\nbut rather a variant of the operator P that captures diffusion geometry. In Sec. 4, we explain how\nthis operator allows us to ensure the data we generate follows diffusion geometry and the manifold\nstructure it represents.\n\n3.2 Measure Based Gaussian Correlation\n\nBermanis et al. (2016a,b) suggest the Measure-based Gaussian Correlation (MGC) kernel as an\nalternative to the Gaussian kernel (Eq. 1) for constructing diffusion geometry based on a measure \u00b5.\nThe measure could be provided in advance or approximated based on the data samples. The MGC\nkernel with a measure \u00b5(r), r \u2208 X, de\ufb01ned over a set X of reference points, is\n\n(cid:88)\n\n\u02c6K(xi, xj) =\n\nK(xi, r)K(r, xj)\u00b5(r) ,\n\ni, j = 1, ..., N ,\n\nwhere the kernel K is some decaying symmetric function. Here, we use a Gaussian kernel for K and\na sparsity-based measure for \u00b5.\n\nr\u2208X\n\n3.3 Kernel Bandwidth Selection\n\nThe choice of kernel bandwidth \u03c3 in Eq. 1 is crucial for the performance of Gaussian-kernel methods.\nFor small values of \u03c3, the resulting kernel K converges to the identity matrix; inversely, large values\nof \u03c3 yield the all-ones matrix. Many methods have been proposed for tuning \u03c3. A range of values is\nsuggested in Singer et al. (2009) based on an analysis of the sum of values in K. Lindenbaum et al.\n(2017) presented a kernel scaling method that is well suited for classi\ufb01cation and manifold learning\ntasks. We describe here two methods for setting the bandwidth: a global scale suggested in Keller\net al. (2010) and an adaptive local scale based on Zelnik-Manor & Perona (2005).\nFor degree estimation we use the max-min bandwidth (Keller et al., 2010) as it is simple and effective.\nThe max-min bandwidth is de\ufb01ned by\n\nMaxMin = C \u00b7 max\n\u03c32\n\nj\n\n[min\ni,i(cid:54)=j\n\n((cid:107)xi \u2212 xj(cid:107)2)] ,\n\nwhere C \u2208 [2, 3]. This approach attempts to force each point to be connected to at least one other\npoint. This method is simple, but highly sensitive to outliers. Zelnik-Manor & Perona (2005) propose\nadaptive bandwidth selection. At each point xi, the scale \u03c3i is chosen as the L1 distance of xi from\nits r-th nearest neighbor. This adaptive bandwidth guarantees that at least half of the points are\nconnected to r neighbors. Since an adaptive bandwidth obscures density biases, it is more suitable\nfor applying the resulting diffusion process to the data than for degree estimation.\n\n4 Data Generation\n\n4.1 Problem Formulation\nLet M be a d dimensional manifold that lies in a higher dimensional space RD, with d < D, and\nlet X \u2286 M be a dataset of N = |X| data points, denoted x1, . . . , xN , sampled from the manifold.\nIn this paper, we propose an approach that uses the samples in X in order to capture the manifold\n\n3\n\n\fAlgorithm 1 SUGAR: Synthesis Using Geometrically Aligned Random-walks\nInput: Dataset X = {x1, x2, . . . , xN}, xi \u2208 RD.\nOutput: Generated set of points Y = {y1, y2, . . . , yM}, yi \u2208 RD.\n1: Compute the diffusion geometry operators K, P , and degrees \u02c6d(i), i = 1, ..., N (see Sec. 3)\n2: De\ufb01ne a sparsity measure \u02c6s(i), i = 1, ..., N (Eq. 2).\n3: Estimate a local covariance \u03a3i, i = 1, ..., N, using k nearest neighbors around each xi.\n4: For each point i = 1, ..., N draw \u02c6(cid:96)(i) vectors (see Sec. 4.3) from a Gaussian distribution\n\nN (xi, \u03a3i). Let \u02c6Y 0 be a matrix with these M =(cid:80)N\n\n\u02c6(cid:96)(i) generated vectors as its rows.\n\n5: Compute the sparsity based diffusion operator \u02c6P (see Sec 4.2).\n6: Apply the operator \u02c6P at time instant t to the new generated points in \u02c6Y 0 to get diffused points as\n\ni=1\n\n, j = 1, . . . , D, in order to \ufb01t\n\nrows of Y t = \u02c6P\n\nt \u00b7 Y 0.\n\n7: Rescale Y t to get the output Y [\u00b7, j] = Y t[\u00b7, j] \u00b7 percentile(X [\u00b7,j],.99)\n\nmax Y t[\u00b7,j]\n\nthe original range of feature values in the data.\n\ngeometry and generate new data points from the manifold. In particular, we focus on the case where\nthe points in X are unevenly sampled from M, and aim to generate a set of M new data points\nY = {y1, ..., yM} \u2286 RD such that 1. the new points Y approximately lie on the manifold M, and\n2. the distribution of points in the combined dataset Z (cid:44) X \u222a Y is uniform. Our proposed approach\nis based on using an intrinsic diffusion process to robustly capture a manifold geometry from X (see\nSec. 3). Then, we use this diffusion process to generate new data points along the manifold geometry\nwhile adjusting their intrinsic distribution, as explained in the following sections.\n\n4.2 SUGAR: Synthesis Using Geometrically Aligned Random-walks\n\nSUGAR initializes by forming a Gaussian kernel GX (see Eq. 1) over the input data X in order\nto estimate the degree \u02c6d(i) of each xi \u2208 X. Because the space in which the degree is estimated\nimpacts the output of SUGAR, X may consist of the full data dimensions or learned dimensions\nfrom manifold learning algorithms. We then de\ufb01ne the sparsity of each point \u02c6s(i) via\n\n\u02c6s(i) (cid:44) [ \u02c6d(i)]\u22121, i = 1, ..., N.\n\n(2)\nSubsequently, we sample \u02c6(cid:96)(i) points hj \u2208 H i, j = 1, ..., \u02c6(cid:96)(i) around each xi \u2208 X from a set of\nlocalized Gaussian distributions Gi = N (xi, \u03a3i) \u2208 G. The choice of \u02c6(cid:96)(i) based on the density (or\nsparsity) around xi is discussed in Sec. 4.3. This construction elaborates local manifold structure in\nmeaningful directions by 1. compensating for data sparsity according to \u02c6s(i), and 2. centering each\nGi on an existing point xi with local covariance \u03a3i based on the k nearest neighbors of xi. The set\n\u02c6(cid:96)(i) new points, Y 0 = {y1, ..., yM}, is then given by the union of all local point sets\nY 0 = H 1 \u222a H 2 \u222a ... \u222a H N . Next, we construct a sparsity-based MGC kernel (see Sec. 3.2)\n\nof all M =(cid:80)\n\ni\n\n(cid:88)\n\n\u02c6K(yi, yj) =\n\nK(yi, xr)K(xr, yj)\u02c6s(r)\n\nr\n\nusing the af\ufb01nities in the sampled set X and the generated set Y 0. We use this kernel to pull the new\npoints Y 0 toward the sparse regions of the manifold M using the row-stochastic diffusion operator \u02c6P\nt to Y 0, which averages points in Y 0 according\n(see Sec. 3.1). We then apply the powered operator \u02c6P\nto their neighbors in X.\n\nt controls the diffusion distance over which points are averaged; higher\nThe powered operator \u02c6P\nvalues of t lead to wider local averages over the manifold. The operator may be modeled as a low\npass \ufb01lter in which higher powers decrease the cutoff frequency. Because Y 0 is inherently noisy in\nt \u00b7 Y 0 is a denoised version of Y 0 along M. The number of\nthe ambient space of the data, Y t = \u02c6P\nsteps required can be set manually or using the Von Neumann Entropy as suggested by Moon et al.\nt \u00b7 Y 0 is not power preserving, Y t is rescaled to \ufb01t the range of original\n(2017b). Because the \ufb01lter \u02c6P\nvalues of X. A full description of the approach is given in Alg. 1.\n\n4\n\n\f4.3 Manifold Density Equalization\n\nThe generation level \u02c6(cid:96)(i) in Alg. 1 (step 4), i.e., the amount of points generated around each xi,\ndetermines the distribution of points in Y 1. Given a biased dataset X, we wish to generate points in\nsparse regions such that the resulting density over M becomes uniform. To do this we have proposed\nto draw \u02c6(cid:96)(i) points around each point xi, i = 1, ..., N, from N (xi, \u03a3i) (as described in Alg. 1). The\nfollowing proposition provides bounds on the \u201ccorrect\u201d number of points \u02c6(cid:96)(i), i = 1, ..., N, required\nto balance the density over the manifold by equalizing the degrees \u02c6d(i).\nProposition 4.1. The generation level \u02c6(cid:96)(i) required to equalize the degree \u02c6d(i), is bounded by\n\n(cid:18)\n\n(cid:19) 1\n\ndet\n\nI +\n\n\u03a3i\n2\u03c32\n\n2 max( \u02c6d(\u00b7)) \u2212 \u02c6d(i)\n\n\u02c6d(i) + 1\n\n(cid:18)\n\n(cid:19) 1\n\n2\n\n\u2212 1 \u2264 \u02c6(cid:96)(i) \u2264 det\n\nI +\n\n\u03a3i\n2\u03c32\n\n[max( \u02c6d(\u00b7)) \u2212 \u02c6d(i)] ,\n\nwhere \u02c6d(i) is the degree value at point xi, \u03c32 is the bandwidth of the kernel K (Eq. 1) and \u03a3i is the\ncovariance of the Gaussian designed for generating new points (as described in Algorithm 1).\n\nIn practice we suggest to use the mean of the upper and lower bound to set the number of generated\npoints \u02c6(cid:96)(i). In Sec. 5.2 we demonstrate how the proposed scheme enables density equalization using\nfew iterations of SUGAR. The proof of Prop. 4.1 is presented in the supplemental material.\n\n5 Experimental Results\n\n5.1 MNIST Manifold\n\nIn the following experiment we empirically demonstrate the ability of SUGAR to \ufb01ll in missing\nsamples and compare it to two generative Neural Networks: a Variational Autoencoder (VAE, Kingma\n& Welling, 2014), which has an implicit probabilistic model of the data, and a Generative Adversarial\nNetwork (GAN, Goodfellow et al., 2014), which learns to mimic input data distributions. Note that\nwe are not able to use other density estimates in general due to the high dimensionality of datasets\nand the inability of density estimates to scale to high dimensions. To begin, we rotated an example of\na handwritten \u20186\u2019 from the MNIST dataset in N = 320 different angles non-uniformly sampled over\nthe range [0, 2\u03c0]. This circular construction was recovered by the diffusion maps embedding of the\ndata, with points towards the undersampled regions having a lower degree than other regions of the\nembedding (Fig. 1, left, colored by degree). We then generated new points around each sample in the\nrotated data according to Alg. 1. We show the results of SUGAR before and after diffusion in Fig. 1\n(top and bottom right, respectively).\n\n2\nM\nD\n\n(a)\n\n(b)\n\nDM1\n\n(c)\n\n(d)\n\nFigure 1: A two-dimensional DM representation of: (a) Original data, 320 rotated images of\nhandwritten \u20186\u2019 colored by the degree value \u02c6d(i). (b) VAE output; (c) GAN output; (d) Top: SUGAR\naugmented data before diffusion (i.e., t = 0); (d) Bottom: SUGAR augmented data with one step of\ndiffusion (t = 1). Black asterisks \u2013 original data; Blue circles \u2013 output data.\n\nNext, we compared our results to a two-layer VAE and a GAN trained over the original data (Fig. 1,\n(b) and (c)). Training a GAN on a dataset with number of samples of the same order as the dimension\nwas not a simple task. Based on our experience adding the gradient penalty as suggested in Gulrajani\net al. (2017) helps prevent mode collapse. The GAN was injected with uniform noise. Both SUGAR\n(t = 1) and the VAE generated points along the circular structure of the original manifold. For the\n\n5\n\n\foutput of the GAN, we had to \ufb01lter out around 5% of the points, which fall far from the original\nmanifold and look very noisy. Examples of images from both techniques are presented in Fig. 1.\nNotably, the VAE generated images similar to the original angle distribution, such that sparse regions\nof the manifold were not \ufb01lled. In contrast, points generated by SUGAR occupied new angles not\npresent in the original data but clearly present along the circular manifold. This example illustrates\nthe ability of SUGAR to recover sparse areas of a data manifold.\n\n5.2 Density Equalization\n\nGiven the circular manifold recovered in Sec. 5.1, we next sought to evaluate the density equalization\nproperties proposed in Sec. 4.3. We begin by sampling one hundred points from a circle such that the\nhighest density is at the origin (\u03b8 = 0) and the density decreases away from it (Fig. 2(a), colored by\ndegree \u02c6d(i)). SUGAR was then used to generate new points based on \u02c6(cid:96)(i) around each original point\n(Fig. 2(b), before diffusion, 2(c), after diffusion). We repeat this process for different initial densities\nand evaluate the resulting distribution of point against the amount of iteration of SUGAR. We perform\na Kolmogorov-Smirnov (K-S) test to determine if the points came from a uniform distribution. The\nresulting p-values are presented in Fig. 2(d).\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Density equalization demonstrated on a circle shaped manifold. (a) The original non-\nuniform samples of X. (b) The original points X (black asterisks) and set of new generated points\nY 0 (blue circles). (c) The \ufb01nal set of points Z, original points X (black asterisks) and set of new\ngenerated points Y t (blue circles). In this example only one diffusion time step is required (i.e.\nt = 1). (d) The p-values of a Kolmogorov-Smirnov (K-S) test comparing to a uniform distribution,\nthe x-axis represents the number of SUGAR iterations ran.\n\n5.3 Classi\ufb01cation of Imbalanced Data\n\nThe loss functions of many standard classi\ufb01cation algorithms are global; these algorithms are thus\neasily biased when trained on imbalanced datasets. Imbalanced training typically manifests in poor\nclassi\ufb01cation of rare samples. These rare samples are often important (Weiss, 2004). For example,\nthe preponderance of healthy individuals in medical data can obscure the diagnosis of rare diseases.\nResampling and boosting strategies have been used to combat data imbalance. Removing points\n(undersampling) is a simple solution, but this strategy leads to information loss and can decrease\ngeneralization performance. RUSBoost (Seiffert et al., 2010) combines this approach with boosting,\na technique that resamples the data using a set of weights learned by iterative training. Oversampling\nmethods remove class imbalance by generating synthetic data alongside the original data. Synthetic\n\nTable 1: Average class precision (ACP), class recall (ACR), and the Matthews correlation coef\ufb01cient\n(MCC) for k-NN and kernel SVM classi\ufb01ers (using 10-fold cross validation) before / after SMOTE\nand SUGAR, and for RUSBoost classi\ufb01cation.\n\nACP\nACR\nMCC\n\nOrig\n0.67\n0.64\n0.66\n\nk-NN\n\nSMOTE SUGAR Orig\n0.77\n0.78\n0.78\n\n0.78\n0.77\n0.78\n\n0.76\n0.73\n0.74\n\nSVM\n\nSMOTE SUGAR\n0.78\n0.84\n0.84\n\n0.77\n0.78\n0.78\n\nRUSBoost\n0.75\n0.81\n0.80\n\n6\n\n\fC(cid:88)\n\nc=1\n\nC(cid:88)\n\nc=1\n\nMinority Over-sampling Technique (SMOTE, Chawla et al., 2002) oversamples by generating points\nalong lines between existing points of minority classes.\nWe compared SUGAR, RUSBoost, and SMOTE for improving k-NN and kernel SVM classi\ufb01cation\nof 61 imbalanced datasets of varying size (from hundreds to thousands) and imbalance ratio (1.8\u2013130),\nobtained from Alcal\u00e1-Fdez et al. (2009). To quantify classi\ufb01cation performance we used Precision,\nRecall, and the Mathews correlation coef\ufb01cient (MCC), which capture classi\ufb01cation accuracy in light\nof data imbalance. For binary classi\ufb01cation, precision measures the fraction of true positives to false\npositives, recall measures the fraction of true positives identi\ufb01ed, and MCC is a discrete version of\nPearson correlation between the observed and predicted class labels. Formally, they are de\ufb01ned as\n\nT P\n\nT P\n\nPrecision =\n\nT P + F P\n\n(cid:112)(T P + F P )(T P + F N )(T N + F P )(T N + F N )\n\nRecall =\nT P \u00b7 T N \u2212 F P \u00b7 F N\n\nT P + F N\n\n.\n\nMCC =\n\nFor handling multiple classes, the \ufb01rst two are extended via average class precision and recall (ACP\nand ACR), which are de\ufb01ned as\n\nACP =\n\n1\nC\n\nPrecision(class = c)\n\nACR =\n\n1\nC\n\nRecall(class = c) ,\n\nwhile MCC is extended to multiclass settings as de\ufb01ned in Gorodkin (2004). These metrics ignore\nclass population biases by equally weighting classes. This experiment is summarized in Table 1 (see\nsupplement for full details).\n\n5.4 Clustering of Imbalanced Data\n\nIn order to examine the effect of SUGAR on clustering, we performed spectral clustering on a set of\nGaussians in the shape of the word \u201cSUGAR\u201d (top panel, Fig. 3(a)). Next, we altered the mixtures to\nsample heavily towards points on the edges of the word (middle panel, Fig. 3(a)). This perturbation\ndisrupted letter clusters. Finally, we performed SUGAR on the biased data to recreate data along the\nmanifold. The combined data and its resultant clustering is shown in the bottom panel of Fig. 3(a)\nrevealing that the letter clustering was restored after SUGAR.\nThe effect of sample density on spectral clustering is evident in the eigendecomposition of the graph\nLaplacian, which describes the connectivity of the graph and is the basis for spectral clustering. We\nshall focus on the multiplicity of the zero eigenvalue, which corresponds to the number of connected\ncomponents of a graph. In our example, we see that the zero eigenvalue for the ground truth and\nSUGAR graphs has a multiplicity of 5 whereas the corrupted graph only has a multiplicity of 4 (see\nFig. 3(b)). This connectivity difference arises from the k-neighborhoods of points in each ground\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: Augmented clustering using SUGAR. (a) Spectral clustering of a mixed Gaussian (top\npanel) with uneven sample density (middle panel). After SUGAR (bottom panel), the original cluster\ngeometries are recovered. (b) Graph Laplacian eigenvalues from (a); the corrupted graph has a lower\nmultiplicity of the zero eigenvalue, indicating fewer connected components. (c) Rand Index of 115\ndata sets (from Alcal\u00e1-Fdez et al., 2009) clustered by k-means before and after applying SUGAR.\n\n7\n\n\ftruth cluster. We note that variation in sample density disrupts the k-neighborhood of points in the\ndownsampled region to include points outside of their ground truth cluster. These connections across\nthe letters of \u201cSUGAR\u201d thus lead to a lower multiplicity of the zero eigenvalue, which negatively\naffects the spectral clustering. Augmenting the biased data via SUGAR equalizes the sampling\ndensity, restoring ground-truth neighborhood structure to the graph built on the data.\nNext, we explored the effects of SUGAR on traditional k-means across 115 datasets obtained\nfrom Alcal\u00e1-Fdez et al. (2009). K-means was performed using the ground truth number of clusters,\nand the Rand Index (RI, Hubert & Arabie, 1985) between the ground truth clustering and the empirical\nclustering was taken (Fig. 3(c), x-axis). Subsequently, SUGAR was used to generate new points for\nclustering together with the original data. The RI over the original data was again computed, this\ntime using the SUGAR clusters (Fig. 3(c), y-axis). Our results indicate the SUGAR can be used to\nimprove the cluster quality of k-means.\n\n5.5 Biological Manifolds\n\nNext, we used SUGAR for exploratory analysis of a biological dataset. In Velten et al. (2017), a high\ndimensional yet small (X \u2208 R1029\u00d712553) single-cell RNA sequencing (scRNA-seq) dataset was\n\n(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 4: SUGAR was used to augment scRNA-seq data collected by Velten et al. (2017). (a)\nAugmented data embedded with PHATE (Moon et al., 2017b) and colored by k-means over the gene\nmodule dimensions identi\ufb01ed by Velten et al. (2017). Seven canonical cell types are present. EBM:\neosinophil/basophil/mast cells; N: neutrophils; MD: monocytes/dendritic cells; E: erythroid cells;\nMK: megakaryocytes; Pre-B: immature B cell; B: mature B cell. (b) Cell type prevalence before\nand after SUGAR. (c) Explained variation (r2) and scaled mutual information ( MIi\nmax MI) between the\ncomponents of the fourteen coexpression modules identi\ufb01ed by Velten et al. (2017). (d) Relationship\nbetween CD19 (B cell maturation marker) and HOXA3 (a cell immaturity marker), CASP1 (B cell\nlinear commitment marker), and EAF2 (neutrophil and monocyte marker that is upregulated in mature\nB cells; see Sec.5.5). Marker names appear above the \ufb01gure, values represented on the y axis.\n\n8\n\n\fcollected to elucidate the development of human blood cells, which is posited to form a continuum of\ndevelopment trajectories from a central reservoir of immature cells. This dataset thus represents an\nideal substrate to explore manifold learning (Moon et al., 2017a). However, the data presents two\ndistinct challenge due to 1. undersampling of cell types, and 2. dropout and artifacts associated with\nscRNA-seq (Kim et al., 2015). These challenges stand at odds with a central task of computational\nbiology; namely, the characterization of gene-gene interactions that foment phenotypes.\nWe \ufb01rst sought to enrich rare phenotypes in the Velten data by generating Y \u2208 R4116\u00d712553 new data\npoints with SUGAR. A useful tool for this analysis is the \u2018gene module\u2019, a pair or set of genes that\nare expressed together to drive phenotype development. K-means clustering of the augmented data\nover fourteen principal gene modules (32 dimensions) revealed six cell types described in Velten\net al. (2017) and a seventh cluster consisting of mature B-cells (Fig. 4(a)). Analysis of population\nprevalence before and after SUGAR revealed a dramatic enrichment of mature B and pre-B cells,\neosinophil/basophil/mast cells (EBM), and neutrophils (N), while previously dominant megakary-\nocytes (MK) became a more equal portion of the post-SUGAR population (Fig. 4(b)). These results\ndemonstrate the ability of SUGAR to balance population prevalence along a data manifold.\nIn Fig. 4(c), we examine the effect of SUGAR on intra-module relationships. Because expression\nof genes in a module are molecularly linked, intra-module relationships should be strong in the\nabsence of sampling biases and experimental noise. After SUGAR, we note an improvement in linear\nregression (r2) and scaled mutual information coef\ufb01cients. We note that in some cases the change in\nmutual information was stronger than linear regression, likely due to nonlinearities in the module\nrelationship. Because this experiment was based on putative intra-module relationships we next\nsought to identify strong improvements in regression coef\ufb01cients de novo. To this end, we compared\nthe relationship of the B cell maturation marker CD19 with the entire dataset before and after SUGAR.\nIn Fig. 4(d) we show three relationships with marked improvement from the original data (top panel)\nto the augmented data (bottom panel). The markers uncovered by this search, HOXA3, CASP1, and\nEAF2, each have disparate relationships with CD19. HOXA3 marks stem cell immaturity, and is\nnegatively correlated with CD19. In contrast, CASP1 is known to mark commitment to the B cell\nlineage (Velten et al., 2017). After SUGAR, both of these relationships were enhanced. EAF2 is\na part of a module that is expressed during early development of neutrophils and monocytes; we\nobserve that its correlation and mutual information with B cell maturation are also increased after\nSUGAR. We note that in light of the early development discussed by Velten et al. (2017), this new\nrelationship seems problematic. In fact, Li et al. (2016) showed that EAF2 is upregulated in mature\nB cells as a mechanism against autoimmunity. Taken together, our analyses show that SUGAR\nis effective for bolstering relationships between dimensions in the absence of prior knowledge for\nexploratory data analysis.\n\n6 Conclusion\n\nSUGAR presents a new type of generative model, based on data geometry rather than density. This\nenables us to compensate for sparsity and heavily biased sampling in many data types of interest,\nespecially biomedical data. We assume that the training data lies on a low-dimensional manifold.\nThe manifold assumption is usually valid in many datasets (e.g., single cell RNA sequencing (Moon\net al., 2017a)) as they are globally high-dimensional but locally generated by a small number of\nfactors. We use a diffusion kernel to capture the manifold structure. Then, we randomly generate new\npoints along the incomplete manifold, with emphasis on its sparse areas. Finally, we use a weighted\ntransition kernel to pull the new points towards the structure of the manifold. The presented method\ndemonstrated promising results on synthetic data, MNIST images, and high dimensional biological\ndatasets in applications such as clustering, classi\ufb01cation, and mutual information relationship analysis.\nWe note that a toolbox implementing the presented algorithm is available via GitHub3 for free\nacademic use (see supplement for details), and we expect future work to apply SUGAR to study\nextremely biased biological datasets and improve classi\ufb01cation and regression performance on them.\n\nAcknowledgments\n\nThis research was partially funded by grant from the Chan-Zuckerberg Initiative (ID: 182702).\n\n3URL: github.com/KrishnaswamyLab/SUGAR\n\n9\n\n\fReferences\nAlcal\u00e1-Fdez, Jes\u00fas, Sanchez, Luciano, Garcia, Salvador, del Jesus, Maria Jose, Ventura, Sebastian,\nGarrell, Josep Maria, Otero, Jos\u00e9, Romero, Crist\u00f3bal, Bacardit, Jaume, Rivas, Victor M., Fern\u00e1ndez,\nJuan C., and Herrera, Francisco. KEEL: a software tool to assess evolutionary algorithms for data\nmining problems. Soft Computing, 13(3):307\u2013318, 2009.\n\nBeal, Matthew J. and Ghahramani, Zoubin. The variational Bayesian EM algorithm for incomplete\n\ndata: with application to scoring graphical model structures. Bayesian statistics, 7, 2003.\n\nBengio, Yoshua and Monperrus, Martin. Non-local manifold tangent learning. In Advances in Neural\n\nInformation Processing Systems (NIPS), volume 18, pp. 129\u2013136, 2005.\n\nBengio, Yoshua, Larochelle, Hugo, and Vincent, Pascal. Non-local manifold Parzen windows. In\n\nAdvances in neural information processing systems (NIPS), volume 19, pp. 115\u2013122, 2006.\n\nBermanis, Amit, Salhov, Moshe, Wolf, Guy, and Averbuch, Amir. Measure-based diffusion grid\nconstruction and high-dimensional data discretization. Applied and Computational Harmonic\nAnalysis, 40(2):207\u2013228, 2016a.\n\nBermanis, Amit, Wolf, Guy, and Averbuch, Amir. Diffusion-based kernel methods on euclidean\nmetric measure spaces. Applied and Computational Harmonic Analysis, 41(1):190\u2013213, 2016b.\n\nBrubaker, Marcus, Salzmann, Mathieu, and Urtasun, Raquel. A family of MCMC methods on\n\nimplicitly de\ufb01ned manifolds. In Arti\ufb01cial Intelligence and Statistics, pp. 161\u2013172, 2012.\n\nChawla, Nitesh V., Bowyer, Kevin W., Hall, Lawrence O., and Kegelmeyer, W. Philip. SMOTE:\nsynthetic minority over-sampling technique. Journal of arti\ufb01cial intelligence research, 16:321\u2013357,\n2002.\n\nCoifman, Ronald R. and Lafon, St\u00e9phane. Diffusion maps. Applied and Computational Harmonic\n\nAnalysis, 21(1):5 \u2013 30, 2006.\n\nDoersch, Carl. Tutorial on variational autoencoders. arXiv:1606.05908, 2016.\n\nGin\u00e9, Evarist and Guillou, Armelle. Rates of strong uniform consistency for multivariate kernel\ndensity estimators. Annales de l\u2019Institut Henri Poincare (B) Probability and Statistics, 38(6):\n907\u2013921, 2002.\n\nGirolami, Mark and Calderhead, Ben. Riemann manifold langevin and hamiltonian monte carlo\nmethods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):\n123\u2013214, 2011.\n\nGoodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil,\nIn Advances in neural\n\nCourville, Aaron, and Bengio, Yoshua. Generative adversarial nets.\ninformation processing systems (NIPS), volume 27, pp. 2672\u20132680, 2014.\n\nGorodkin, Jan. Comparing two k-category assignments by a k-category correlation coef\ufb01cient.\n\nComputational biology and chemistry, 28(5-6):367\u2013374, 2004.\n\nGr\u00fcn, Dominic, Lyubimova, Anna, Kester, Lennart, Wiebrands, Kay, Basak, Onur, Sasaki, Nobuo,\nClevers, Hans, and van Oudenaarden, Alexander. Single-cell messenger rna sequencing reveals\nrare intestinal cell types. Nature, 525(7568):251, 2015.\n\nGulrajani, Ishaan, Ahmed, Faruk, Arjovsky, Martin, Dumoulin, Vincent, and Courville, Aaron C.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems\n(NIPS), volume 30, pp. 5767\u20135777, 2017.\n\nHe, Haibo and Garcia, Edwardo A. Learning from imbalanced data. IEEE Transactions on knowledge\n\nand data engineering, 21(9):1263\u20131284, 2009.\n\nHensman, Paulina and Masko, David. The impact of imbalanced training data for convolutional\nneural networks. Degree Project in Computer Science, KTH Royal Institute of Technology, 2015.\n\n10\n\n\fHubert, Lawrence and Arabie, Phipps. Comparing partitions. Journal of classi\ufb01cation, 2(1):193\u2013218,\n\n1985.\n\nKeller, Yosi, Coifman, Ronald R., Lafon, St\u00e9phane, and Zucker, Steven W. Audio-visual group\nrecognition using diffusion maps. IEEE Transactions on Signal Processing, 58(1):403\u2013413, 2010.\n\nKim, Jong Kyoung, Kolodziejczyk, Aleksandra A., Ilicic, Tomislav, Teichmann, Sarah A., and\nMarioni, John C. Characterizing noise structure in single-cell RNA-seq distinguishes genuine from\ntechnical stochastic allelic expression. Nature communications, 6:8687, 2015.\n\nKingma, Diederik P. and Welling, Max. Auto-encoding variational Bayes. In International Conference\n\non Learning Representations (ICLR), 2014. arXiv:1312.6114.\n\nKrishnaswamy, Smita, Spitzer, Matthew H., Mingueneau, Michael, Bendall, Sean C., Litvin, Oren,\nStone, Erica, Pe\u2019er, Dana, and Nolan, Garry P. Conditional density-based analysis of T cell\nsignaling in single-cell data. Science, 346(6213):1250689, 2014.\n\nLi, Yingqian, Takahashi, Yoshimasa, Fujii, Shin-ichiro, Zhou, Yang, Hong, Rongjian, Suzuki, Akari,\nTsubata, Takeshi, Hase, Koji, and Wang, Ji-Yang. EAF2 mediates germinal centre B-cell apoptosis\nto suppress excessive immune responses and prevent autoimmunity. Nature communications, 7:\n10836, 2016.\n\nLindenbaum, O\ufb01r, Salhov, Moshe, Yeredor, Arie, and Averbuch, Amir. Kernel scaling for manifold\n\nlearning and classi\ufb01cation. arXiv:1707.01093, 2017.\n\nL\u00f3pez, Victoria, Fern\u00e1ndez, Alberto, Garc\u00eda, Salvador, Palade, Vasile, and Herrera, Francisco. An\ninsight into classi\ufb01cation with imbalanced data: Empirical results and current trends on using data\nintrinsic characteristics. Information Sciences, 250:113\u2013141, 2013.\n\nMoon, Kevin R., Stanley, Jay, Burkhardt, Daniel, van Dijk, David, Wolf, Guy, and Krishnaswamy,\nSmita. Manifold learning-based methods for analyzing single-cell RNA-sequencing data. Current\nOpinion in Systems Biology, 2017a.\n\nMoon, Kevin R., van Dijk, David, Wang, Zheng, Burkhardt, Daniel, Chen, William, van den Elzen,\nAntonia, Hirn, Matthew J., Coifman, Ronald R., Ivanova, Natalia B., Wolf, Guy, and Krishnaswamy,\nSmita. Visualizing transitions and structure for high dimensional data exploration. bioRxiv:120378,\nDOI: 10.1101/120378, 2017b.\n\n\u00d6ztireli, A Cengiz, Alexa, Marc, and Gross, Markus. Spectral sampling of manifolds. ACM\n\nTransactions on Graphics (TOG), 29(6):168:1\u2013168:8, 2010.\n\nRasmussen, Carl Edward. The in\ufb01nite gaussian mixture model. In Advances in neural information\n\nprocessing systems (NIPS), volume 13, pp. 554\u2013560, 2000.\n\nScott, David W. Averaged shifted histograms: effective nonparametric density estimators in several\n\ndimensions. The Annals of Statistics, 13:1024\u20131040, 1985.\n\nScott, David W. Kernel density estimators. In Multivariate Density Estimation: Theory, Practice,\n\nand Visualization, chapter 6, pp. 125\u2013193. Wiley Online Library, 2008.\n\nSeaman, D. Erran and Powell, Roger A. An evaluation of the accuracy of kernel density estimators\n\nfor home range analysis. Ecology, 77(7):2075\u20132085, 1996.\n\nSeiffert, Chris, Khoshgoftaar, Taghi M., Van Hulse, Jason, and Napolitano, Amri. RUSBoost:\nA hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and\nCybernetics-Part A: Systems and Humans, 40(1):185\u2013197, 2010.\n\nSinger, Amit, Erban, Radek, Kevrekidis, Ioannis G., and Coifman, Ronald R. Detecting intrinsic\nslow variables in stochastic dynamical systems by anisotropic diffusion maps. Proceedings of the\nNational Academy of Sciences, 106(38):16090\u201316095, 2009.\n\nVaranasi, Mahesh K. and Aazhang, Behnaam. Parametric generalized gaussian density estimation.\n\nThe Journal of the Acoustical Society of America, 86(4):1404\u20131415, 1989.\n\n11\n\n\fVelten, Lars, Haas, Simon F., Raffel, Simon, Blaszkiewicz, Sandra, Islam, Saiful, Hennig, Bianca P.,\nHirche, Christoph, Lutz, Christoph, Buss, Eike C., Nowak, Daniel, Boch, Tobias, Hofmann,\nWolf-Karsten, Ho, Anthony D., Huber, Wolfgang, Trumpp, Andreas, Essers, Marieke A. G., and\nSteinmetz, Lars M. Human haematopoietic stem cell lineage commitment is a continuous process.\nNature Cell Biology, 19:271\u2013281, 2017.\n\nVincent, Pascal and Bengio, Yoshua. Manifold Parzen windows. In Advances in neural information\n\nprocessing systems (NIPS), volume 16, pp. 849\u2013856, 2003.\n\nWeiss, Gary M. Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 6\n\n(1):7\u201319, 2004.\n\nWu, Junjie. The uniform effect of k-means clustering. In Advances in K-means Clustering, pp. 17\u201335.\n\nSpringer, 2012.\n\nXuan, Li, Zhigang, Chen, and Fan, Yang. Exploring of clustering algorithm on class-imbalanced\ndata. In The 8th International Conference on Computer Science & Education (ICCSE 2013), pp.\n89\u201393, 2013.\n\nZelnik-Manor, Lihi and Perona, Pietro. Self-tuning spectral clustering.\n\ninformation processing systems (NIPS), volume 18, pp. 1601\u20131608, 2005.\n\nIn Advances in neural\n\n12\n\n\f", "award": [], "sourceid": 722, "authors": [{"given_name": "Ofir", "family_name": "Lindenbaum", "institution": "Yale"}, {"given_name": "Jay", "family_name": "Stanley", "institution": "Yale University"}, {"given_name": "Guy", "family_name": "Wolf", "institution": "Yale University"}, {"given_name": "Smita", "family_name": "Krishnaswamy", "institution": "Yale University"}]}