{"title": "Multiscale Dictionary Learning for Estimating Conditional Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 1797, "page_last": 1805, "abstract": "Nonparametric estimation of the conditional distribution of a response given high-dimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change flexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efficiently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features.", "full_text": "Multiscale Dictionary Learning for\nEstimating Conditional Distributions\n\nFrancesca Petralia\n\nDepartment of Genetics and Genomic Sciences\n\nIcahn School of Medicine at Mt Sinai\n\nNew York, NY 10128, U.S.A.\n\nfrancesca.petralia@mssm.edu\n\nJoshua Vogelstein\nChild Mind Institute\n\nDepartment of Statistical Science\n\nDuke University\n\nDurham, North Carolina 27708, U.S.A.\n\njo.vo@duke.edu\n\nDavid B. Dunson\n\nDepartment of Statistical Science\n\nDuke University\n\nDurham, North Carolina 27708, U.S.A.\n\ndunson@stat.duke.edu\n\nAbstract\n\nNonparametric estimation of the conditional distribution of a response given high-\ndimensional features is a challenging problem. It is important to allow not only the\nmean but also the variance and shape of the response density to change \ufb02exibly\nwith features, which are massive-dimensional. We propose a multiscale dictio-\nnary learning model, which expresses the conditional response density as a convex\ncombination of dictionary densities, with the densities used and their weights de-\npendent on the path through a tree decomposition of the feature space. A fast graph\npartitioning algorithm is applied to obtain the tree decomposition, with Bayesian\nmethods then used to adaptively prune and average over different sub-trees in a\nsoft probabilistic manner. The algorithm scales ef\ufb01ciently to approximately one\nmillion features. State of the art predictive performance is demonstrated for toy\nexamples and two neuroscience applications including up to a million features.\n\n1\n\nIntroduction\n\nMassive datasets are becoming an ubiquitous by-product of modern scienti\ufb01c and industrial ap-\nplications. These data present statistical and computational challenges because many previously\ndeveloped analysis approaches do not scale-up suf\ufb01ciently. Challenges arise because of the ultra\nhigh-dimensionality and relatively low sample size. Parsimonious models for such big data assume\nthat the density in the ambient space concentrates around a lower-dimensional (possibly nonlinear)\nsubspace. A plethora of methods are emerging to estimate such lower-dimensional subspaces [1, 2].\nWe are interested in using such lower-dimensional embeddings to obtain estimates of the conditional\ndistribution of some target variable(s). This conditional density estimation setting arises in a number\nof important application areas, including neuroscience, genetics, and video processing. For exam-\nple, one might desire automated estimation of a predictive density for a neurologic phenotype of\ninterest, such as intelligence, on the basis of available data for a patient including neuroimaging. The\nchallenge is to estimate the probability density function of the phenotype nonparametrically based\non a 106 dimensional image of the subject\u2019s brain. It is crucial to avoid parametric assumptions\non the density, such as Gaussianity, while allowing the density to change \ufb02exibly with predictors.\nOtherwise, one can obtain misleading predictions and poorly characterize predictive uncertainty.\n\n1\n\n\fThere is a rich machine learning and statistical literature on conditional density estimation of a re-\nsponse y \u2208 Y given a set of features (predictors) x = (x1, x2, . . . , xp)T \u2208 X\u2286 Rp. Common\napproaches include hierarchical mixtures of experts [3, 4], kernel methods [5, 6, 7], Bayesian \ufb01nite\nmixture models [8, 9, 10] and Bayesian nonparametrics [11, 12, 13, 14]. However, there has been\nlimited consideration of scaling to large p settings, with the variational Bayes approach of [9] being\na notable exception. For dimensionality reduction, [9] follow a greedy variable selection algorithm.\nTheir approach does not scale to the sized applications we are interested in. For example, in a prob-\nlem with p = 1, 000 and n = 500, they reported a CPU time of 51.7 minutes for a single analysis.\nWe are interested in problems with p having many more orders of magnitude, requiring a faster\ncomputing time while also accommodating \ufb02exible nonlinear dimensionality reduction (variable se-\nlection is a limited sort of dimension reduction). To our knowledge, there are no nonparametric\ndensity regression competitors to our approach, which maintain a characterization of uncertainty in\nestimating the conditional densities; rather, all suf\ufb01ciently scalable algorithms provide point predic-\ntions and/or rely on restrictive assumptions such as linearity.\nIn big data problems, scaling is often accomplished using divide-and-conquer techniques. However,\nas the number of features increases, the problem of \ufb01nding the best splitting attribute becomes\nintractable, so that CART, MARS and multiple tree models cannot be ef\ufb01ciently applied. Similarly,\nmixture of experts becomes computationally demanding, since both mixture weights and dictionary\ndensities are predictor dependent. To improve ef\ufb01ciency, sparse extensions relying on different\nvariable selection algorithms have been proposed [15]. However, performing variable selection in\nhigh dimensions is effectively intractable: algorithms need to ef\ufb01ciently search for the best subsets\nof predictors to include in weight and mean functions within a mixture model, an NP-hard problem\n[16].\nIn order to ef\ufb01ciently deal with massive datasets, we propose a novel multiscale approach which\nstarts by learning a multiscale dictionary of densities. This tree is ef\ufb01ciently learned in a \ufb01rst stage\nusing a fast and scalable graph partitioning algorithm applied to the high-dimensional observations\n[17]. Expressing the conditional densities f (y|x) for each x \u2208 X as a convex combination of\ncoarse-to-\ufb01ne scale dictionary densities, the learning problem in the second stage estimates the\ncorresponding multiscale probability tree. This is accomplished in a Bayesian manner using a novel\nmultiscale stick-breaking process, which allows the data to inform about the optimal bias-variance\ntradeoff; weighting coarse scale dictionary densities more highly decreases variance while adding\nto bias. This results in a model that borrows information across different resolution levels and\nreaches a good compromise in terms of the bias-variance tradeoff. We show that the algorithm\nscales ef\ufb01ciently to millions of features.\n\n2 Setting\nLet X : \u2126 \u2192 X \u2286 Rp be a p-dimensional Euclidean vector-valued predictor random variable, taking\nvalues x \u2208 X , with a marginal probability distribution fX. Similarly, let Y : \u2126 \u2192 Y be a target-\nvalued random variable (e.g., Y \u2286 R). For inferential expedience, we posit the existence of a latent\nvariable \u03b7 : \u2126 \u2192 M \u2286 X , where M is only d \u201cdimensional\u201d and d (cid:28) p. Note that M need not be a\nlinear subspace of X , rather, M could be, for example, a union or af\ufb01ne subspaces, or a smooth com-\npact Riemannian manifold. Regardless of the nature of M, we assume that we can approximately\ndecompose the joint distribution as follows, fX,Y,\u03b7 = fX,Y |\u03b7f\u03b7 = fY |X,\u03b7fX|\u03b7f\u03b7 \u2248 fY |\u03b7fX|\u03b7f\u03b7.\nHence, we assume that the signal approximately concentrates around a low-dimensional latent\nspace, fY |X,\u03b7 = fY |\u03b7. This is a much less restrictive assumption than the commonplace assumption\nin manifold learning that the marginal distribution fX concentrates around a low-dimensional latent\nspace.\nTo provide some intuition for our model, we provide the following concrete example where the\ndistribution of y \u2208 R is a Gaussian function of the coordinate \u03b7 \u2208 M along the swissroll, which is\nembedded in a high-dimensional ambient space. Speci\ufb01cally, we sample the manifold coordinate,\n\u03b7 \u223c U (0, 1). We sample x = (x1, . . . , xp)T as follows\n\nx1 = \u03b7 sin(\u03b7)\n\n; x2 = \u03b7 cos(\u03b7)\n\n; xr \u223c N (0, 1) r \u2208 {3, . . . , p}\n\nFinally, we sample y from N (\u00b5(\u03b7), \u03c3(\u03b7)). Clearly, x and y are conditionally independent given \u03b7,\nwhich is the low-dimensional signal manifold. In particular, x lives on a swissroll embedded in a\n\n2\n\n\fp-dimensional ambient space, but y is only a function of the coordinate \u03b7 along the swissroll M.\nThe left panels of Figure 1 depict this example when \u00b5(\u03b7) = \u03b7 and \u03c3(\u03b7) = \u03b7 + 1.\n\nFigure 1: Illustration of our generative model and algorithm on a swissroll. The top left panel\nshows the manifold M (a swissroll) embedded in a p-dimensional ambient space, where the color\nindicates the coordinate along the manifold, \u03b7 (only the \ufb01rst 3 dimensions are shown for visualization\npurposes). The bottom left panel shows the distribution of y as a function of \u03b7, in particular, fY |\u03b7 =\nN (\u03b7, \u03b7 + 1). The middle and right panels show our estimates of fY |\u03b7 at scales 3 and 4, respectively,\nwhich follow from partitioning our data. Sample size was n = 10, 000.\n3 Goal\n\nOur goal is to develop an approach to learn about fY |X from n pairs of observations that we as-\nsume are exchangeable samples from the joint distribution, (xi, yi) \u223c fX,Y \u2208 F. Let Dn =\n{(xi, yi)}i\u2208[n], where [n] = {1, . . . , n}. More speci\ufb01cally, we seek to obtain a posterior over fY |X.\nWe insist that our approach satis\ufb01es several desiderata, including most importantly: (i) scales up\nto p \u2248 106 in reasonable time, (ii) yields good empirical results, and (iii) automatically adapts to\nthe complexity of the data corpus. To our knowledge, no extant approach for estimating conditional\ndensities or posteriors thereof satis\ufb01es even our \ufb01rst criterion.\n\n4 Methodology\n\n4.1 Ms. Deeds Framework\n\nWe propose here a general modular approach which we refer to as multiscale dictionary learning for\nestimating conditional distributions (\u201cMs. Deeds\u201d). Ms. Deeds consists of two components: (i) a\ntree decomposition of the space, and (ii) an assumed form of the conditional probability model.\n\nTree Decomposition A tree decomposition \u03c4 yields a multiscale partition of the data or the am-\nbient space in which the data live. Let (W, \u03c1W , FW ) be a measurable metric space, where FW is a\nBorel probability measure, W, and \u03c1W : W\u00d7W \u2192 R is a metric on W. Let BW\nr (w) be the \u03c1W -ball\ninside W of radius r > 0 centered at w \u2208 W. For example, W could be the data corpus Dn, or it\ncould be X \u00d7 Y. We de\ufb01ne a tree decomposition as in [2, 18]. A partition tree \u03c4 of W consists of a\ncollection of cells, \u03c4 = {Cj,k}j\u2208Z,k\u2208Kj . At each scale j, the set of cells Cj = {Cj,k}k\u2208Kj provides\na disjoint partition of W almost everywhere. We de\ufb01ne j = 0 as the root node. For each j > 0,\neach set has a unique parent node. Denote\n\nAj,k = {(j(cid:48), k(cid:48)) : Cj,k \u2286 Cj(cid:48),k(cid:48), j(cid:48) < j} , Dj,k = {(j(cid:48), k(cid:48)) : Cj(cid:48),k(cid:48) \u2286 Cj,k, j(cid:48) > j}\n\nrespectively the ancestors and the descendants of node (j, k).\n\n3\n\n\fUnlike classical harmonic theory which presupposes \u03c4 (e.g., in wavelets [19]), we choose to learn\n\u03c4 from the data. Previously, Chen et al. [18] developed a multiscale measure estimation strategy,\nand proved that there exists a scale j such that the approximate measure is within some bound of\nthe true measure, under certain relatively general assumptions. We decided to simply partition the\nx\u2019s, ignoring the y\u2019s in the partitioning strategy. Our justi\ufb01cation for this choice is as follows. First,\nsometimes there are many different y\u2019s for many different applications. In such cases, we do not\nwant to bias the partitioning to any speci\ufb01c y\u2019s, all the more so when new unknown y\u2019s may later\nemerge. Second, because the x\u2019s are so much higher dimensional than the y\u2019s in our applications of\ninterest, the partitions would be dominated by the x\u2019s, unless we chose a partitioning strategy that\nemphasized the y\u2019s. Thus, our strategy mitigates this dif\ufb01culty (while certainly introducing others).\nGiven that we are going to partition using only the x\u2019s, we still face the choice of precisely how\nto partition. A fully Bayesian approach would construct a large number of partitions, and integrate\nover them to obtain posteriors. However, such a fully Bayesian strategy remains computationally in-\ntractable at scale, so we adopt a hybrid strategy. Speci\ufb01cally, we employ METIS [17], a well-known\nrelatively ef\ufb01cient multiscale partitioning algorithm with demonstrably good empirical performance\non a wide range of graphs. Given n observations, i.e. xi = (xi1, . . . , xip)T \u2208 X for i \u2208 [n], the\ngraph construction follows via computing all pairwise distances using \u03c1(xu, xv) = (cid:107)\u02dcxu \u2212 \u02dcxv(cid:107)2,\nwhere \u02dcx is the whitened x (i.e., mean subtracted and variance normalized). We let there be an edge\nbetween xu and xv whenever e\u2212\u03c1(xu,xv)2\n> t, where t is some threshold chosen to elicit the desired\nsparsity level. Applying METIS recursively on the graph constructed in this way yields a single tree\n(see supplementary material for further details).\n\nConditional Probability Model Given the tree decomposition of the data, we place a non-\nparametric prior over the tree. Speci\ufb01cally, we de\ufb01ne fY |X as\n\nfY |X =\n\n\u03c0j,kj (x)fj,kj (x)(y|x)\n\n(1)\n\n(cid:88)\n\nj\u2208Z\n\n(cid:89)\n\nsuch that(cid:80)\n\nwhere(cid:80)\n\nwhere kj(x) is the set at scale j where x has been allocated and \u03c0j,kj (x) are weights across scales\nj\u2208Z \u03c0j,kj (x) = 1. We let weights in Eq. (1) be generated by a stick-breaking process\n[20]. For each node Cj,k in the partition tree, we de\ufb01ne a stick length Vj,k \u223c Beta(1, \u03b1). The\nparameter \u03b1 encodes the complexity of the model, with \u03b1 = 0 corresponding to the case in which\nf (y|x) = f (y). The stick-breaking process is de\ufb01ned as\n\n\u03c0j,k = Vj,k\n\n(j(cid:48),k(cid:48))\u2208Aj,k\n\n[1 \u2212 Vj(cid:48),k(cid:48)] ,\n\n(2)\n\n(j(cid:48),k(cid:48))\u2208Aj,k\n\n\u03c0j(cid:48),k(cid:48) = 1. The implication of this is that each scale within a path is weighted\nto optimize the bias/variance trade-off across scales. We refer to this prior as a multiscale stick-\nbreaking process. Note that this Bayesian nonparametric prior assigns a positive probability to all\npossible paths, including those not observed in the training data. Thus, by adopting this Bayesian\nformulation, we are able to obtain posterior estimates for any newly observed data, regardless of\nthe amount and variability of training data. This is a pragmatically useful feature of the Bayesian\nformulation, in addition to the alleviation of the need to choose a scale [18].\nEach fj,k in Eq. (1) is an element of a family of distributions. This family might be quite general,\ne.g., all possible conditional densities, or quite simple, e.g., Gaussian distributions. Moreover, the\nfamily can adapt with j or k, being more complex at the coarser scales (for which nj,k\u2019s are larger),\nand simpler for the \ufb01ner scales (or partitions with fewer samples). We let the family of conditional\ndensities for y be Gaussian for simplicity, that is, we assume that fj,k = N (\u00b5j,k, \u03c3j,k) with \u00b5j,k \u2208 R\nand \u03c3j,k \u2208 R+. Because we are interested in posteriors over the conditional distribution fY |X, we\nplace relatively uninformative but conjugate priors on \u00b5j,k and \u03c3j,k, speci\ufb01cally, assuming the y\u2019s\nhave been whitened and are unidimensional, \u00b5j,k \u223c N (0, 1) and \u03c3j,k = IG(a, b). Obviously, other\nchoices, such as \ufb01nite or in\ufb01nite mixtures of Gaussians are also possible for continuous valued data.\n\nInference\n\n4.2\nWe introduce the latent variable (cid:96)i \u2208 Z, for i = [n], denoting the multiscale level used by the ith\nobservation. Let nj,k be the number of observations in Cj,k. Let kh(xi) be a variable indicating the\n\n4\n\n\fset at level h where xi has been allocated. Each Gibbs sampler iteration can be summarized in the\nfollowing steps:\n\n(i) Update (cid:96)i by sampling from the multinomial full conditional:\n\nPr((cid:96)i = j |\u00b7) = \u03c0j,kj (xi)fj,kj (xi)(yi|xi)/\n\n\u03c0s,ks(xi)fs,ks(xi)(yi|xi)\n\nwith \u03b2(cid:48) = 1 + nj,k and \u03b1(cid:48) = \u03b1 +(cid:80)\n\n(ii) Update stick-breaking random variable Vj,k, for any j \u2208 Z and k \u2208 Kj, from Beta(\u03b2(cid:48), \u03b1(cid:48))\n(iii) Update \u00b5j,k and \u03c3j,k, for any j \u2208 Z and k \u2208 Kj, by sampling from\n\n(r,s)\u2208Dj,k\n\n(cid:88)\n\ns\u2208Z\n\n\u00b5j,k \u223c N (\u03c5j,k\u03bdj,k \u00afyj,k, \u03c5j,k) ,\n\nnr,s.\n\n\u03c3j,k \u223c IG(cid:0)a\u03c3, b + 0.5(cid:80)\n\n(yi \u2212 \u00b5j,k)2(cid:1)\n\ni\u2208Ij,k\n\nwhere \u03c5j,k = (1 + \u03bdj,k)\u22121, \u03bdj,k = nj,k/\u03c3j,k a\u03c3 = a + nj,k/2, \u00afyj,k being the average of\nthe observations {yi} allocated to cell Cj,k and Ij,k = {i : (cid:96)i = j, xi \u2208 Cj,k}.\n\nTo make predictions, the Gibbs sampler was run with up to 20, 000 iterations, including a burn-\nin of 1, 000 (see Supplementary material for details). Gibbs sampler chains were stopped testing\nnormality of normalized averages of functions of the Markov chain [21]. Parameters (a, b) and \u03b1\ninvolved in the prior density of parameters \u03c3j,k\u2019s and Vj,k\u2019s were set to (3, 1) and 1, respectively.\nAll predictions used a leave-one-out strategy.\n\n4.3 Simulation Studies\n\nIn order to assess the predictive performance of the proposed model, we considered the four different\nsimulation scenarios described below:\n(1) Nonlinear Mixture We \ufb01rst consider a relatively simple yet nonlinear joint model, with a condi-\ntional Gaussian distribution y|\u03b7 \u223c |\u03b7|N (\u00b51, \u03c31) + (1 \u2212 |\u03b7|)N (\u00b52, \u03c32), a marginal distribution for\neach dimension of x, xr|\u03b7 \u223c N (\u03b7, \u03c3x), r \u2208 {1, 2, . . . , p}, and a uniform distribution over the la-\ntent manifold \u03b7 \u223c sin(U (0, c)). In the simulations we let (\u00b51, \u03c31) = (\u22122, 1), (\u00b52, \u03c32) = (2, 1),\n\u03c3x = 0.1, and c = 20, and p = 1000. Thus, fY |X is a highly nonlinear function of x, and even \u03b7,\nand x is high-dimensional.\n(2) Swissroll We then return to the swissroll example of Figure 1; in Figure 3 we show results for\n(\u00b5, \u03c3) = (\u03b7, 1).\n(3) Linear Subspace Letting \u0393 \u2208 Rp+1\u00d7q and \u0398 be a q \u00d7 d \u201cdiagonal\u201d matrix (meaning all entires\nother than the \ufb01rst d < q elements of the diagonal are zero), we assume the following model:\nY, X|\u03b7 \u223c Np+1(\u0393\u0398\u03b7, I), where \u0393 \u223c Sp+1,d indicates \u0393 is uniformly sampled from the set of all\northonormal d frames in Rp+1 (a Stiefel manifold), \u03b8ii \u223c IG(a\u03b8, b\u03b8) for i \u2208 {1, . . . , d} and all other\nelements of \u0398 are zero, and \u03b7 \u223c Nd(0, I). In the simulation, we let q = d = 5, (\u03b1\u03b8, \u03b2\u03b8) = (1, 0.25).\n(4) Union of Linear Subspaces This model is a direct extension of the linear subspace model,\nas it is a union of subspaces. We let the dimensionality of each subspace vary to demonstrate\ng=1 \u03c9gNp+1(\u0393g\u0398g\u03b7, I),\n\u03c9 \u223c Dirichlet(\u03b1), \u03b7 \u223c Nd(0, I), where \u0393 \u223c Sp+1,g and \u0398g is \u201cdiagonal\u201d with \u03b8ii \u223c IG(ag, bg)\nfor i \u2208 {1, . . . , g}, and the remaining elements of \u0398 are zero. In the simulation, we let G = 5,\n\u03b1 = (1, . . . , 1)T, (\u03b1g, \u03b2g) = (\u03b1\u03b8, \u03b2\u03b8) as above.\n\nthe generality of our procedure. Speci\ufb01cally, we assume Y, X|\u03b7 \u223c (cid:80)G\n\n4.4 Neuroscience Applications\n\nWe assessed the predictive performance of the proposed method on two very different neuroimaging\ndatasets. For all analyses, each variable was normalized by subtracting its mean and dividing by\nits standard deviation. The prior speci\ufb01cation and Gibbs sampler described in \u00a74.1 and 4.2 were\nutilized.\nIn the \ufb01rst experiment we investigated the extent to which we could predict creativity (as measured\nvia the Composite Creativity Index [22]) via a structural connectome dataset collected at the Mind\nResearch Network (data were collected as described in Jung et al.\n[23]). For each subject, we\nestimate a 70 vertex undirected weighted brain-graph using the Magnetic Resonance Connectome\nAutomated Pipeline (MRCAP) [24] from diffusion tensor imaging data [25]. Because our graphs are\n\n5\n\n\fundirected and lack self-loops, we have a total of p =(cid:0)70\n\n(cid:1) = 2, 415 potential weighted edges. The\n\n2\n\np-dimensional feature vector is de\ufb01ned by the natural logarithm of the vectorized matrix described\nabove.\nThe second dataset comes from a resting-state functional magnetic resonance experiment as part of\nthe Autism Brain Imaging Data Exchange [26]. We selected the Yale Child Study Center for analy-\nsis. Each brain-image was processed using the Con\ufb01gurable Pipeline for Analysis of Connectomes\n(CPAC) [27]. For each subject, we computed a measure of normalized power at each voxel called\nfALFF [28]. To ensure the existence of nonlinear signal relating these predictors, we let yi corre-\nspond to an estimate of overall head motion in the scanner, called mean framewise displacement\n(FD) computed as described in Power et al. [29]. In total, there were p = 902, 629 voxels.\n\nm de\ufb01ned as rA\n\n4.5 Evaluation Criteria\nm = \u03c6(M SB)/\u03c6(A), where\nTo compare algorithmic performance we considered rA\n\u03c6 is the quantity of interest (for example, CPU time in seconds or mean squared error), MSB is\nour approach and A is the competitor algorithm. To obtain mean-squared error estimates from\nMSB, we select our posterior mean as a point-estimate (the comparison algorithms do not generate\nposterior predictions, only point estimates). For each simulation scenario, we sampled multiple\ndatasets and compute the matched distribution of rA\nm. In other words, rather than running simulations\nand reporting the distribution of performance for each algorithm, we compare the algorithms per\nsimulation. This provides a much more informative indication of algorithmic performance, in that\nwe indicate the fraction of simulations one algorithm outperforms another on some metric. For each\nexample, we sampled 20 datasets to obtain estimates of the distribution over rA\nm. All experiments\nwere performed on a typical workstation, Intel Core i7-2600K Quad-Core Processor with 8192 MB\nof RAM.\n\n5 Results\n\n5.1\n\nIllustrative Example\n\nThe middle and right panels of Figure 1 depict the quality of partitioning and density estimation\nfor the swissroll example described in \u00a72, with the ambient dimension p = 1000 and the predictive\nmanifold dimension d = 1. We sampled n = 104 samples for this illustration. At scale 3 we have 4\npartitions, and at scale 4 we have 8 (note that the partition tree, in general, need not be binary). The\ntop panels are color coded to indicate which xi\u2019s fall into which partition. Although imperfect, it\nshould be clear that the data are partitioned very well. The bottom panels show the resulting estimate\nof the posteriors at the two scales. These posteriors are piecewise constant, as they are invariant to\nthe manifold coordinate within a given partition.\nTo obviate the need to choose a scale to use to make a prediction, we choose to adopt a Bayesian\napproach and integrate across scales. Figure 2 shows the estimated density of two observations\nof model (1) with parameters (\u00b51, \u03c31) = (\u22122, 1), (\u00b52, \u03c32) = (2, 1), \u03c3x = 0.1, and c = 20 for\ndifferent sample sizes. Posteriors of the conditional density fY |X were computed for various sample\nsizes. Figure 2 suggests that our estimate of fY |X approaches the true density as the number of\nobservations in the training set increases. We are unable to compare our strategy for posterior\nestimation to previous literature because we are unaware of previous Bayesian approaches for this\nproblem that scale up to problems of this size. Therefore, we numerically compare the performance\nof our point-estimates (which we de\ufb01ne as the posterior mean of \u02c6fY |X) with the predictions of the\ncompetitor algorithms.\n\n5.2 Quantitative Comparisons for Simulated Data\n\nFigure 3 compares the numerical performance of our algorithm (MSB) with Lasso (black), CART\n(red), and PC regression (green) in terms of both mean-squared error (top) and CPU time (bottom)\nfor models (2), (3), and (4) in the left, middle, and right panels respectively. These \ufb01gures show\nrelative performance on a per simulation basis, thus enabling a much more powerful comparison\nthan averaging performance for each algorithm over a set of simulations. Note that these three\nsimulations span a wide range of models, including nonlinear smooth manifolds such as the swissroll\n\n6\n\n\fFigure 2: Illustrative example of model (1) sug-\ngesting that our posterior estimates of the con-\nditional density are converging as n increases\neven when fY |\u03b7 is highly nonlinear and fX|\u03b7\nis very high-dimensional. True (red) and es-\ntimated (black) density (50th percentile: solid\nline, 2.5th and 97.5th percentiles: dashed lines)\nfor two data positions along the manifold (top\npanels: \u03b7 \u2248 \u22120.9, bottom panels: \u03b7 \u2248 0.5)\nconsidering different training set sizes.\n\n(model 2), relatively simple linear subspace manifolds (model 3), and a union of linear subspaces\nmodel (model 4 ; which is neither linear nor a manifold).\nIn terms of predictive accuracy, the top panels show that for all three simulations, in every dimen-\nsionality that we considered\u2014including p = 0.5 \u00d7 106\u2014MSB is more accurate than either Lasso,\nCART, or PC regression. Note that this is the case even though MSB provides much more infor-\nmation about the posterior fY |X, yielding an entire posterior over fY |X, rather than merely a point\nestimate.\nIn terms of computational time, MSB is much faster than the competitors for large p and n, as shown\nin the bottom three panels. The supplementary materials show that computational time for MSB is\nrelatively constant as a function of p, whereas Lasso\u2019s computational time grows considerably with\np. Thus, for large enough p, MSB is signi\ufb01cantly faster that Lasso. MSB is faster than CART and\nPC regression for all p and n under consideration. Thus, it is clear from these simulations that MSB\nhas better scaling properties\u2014in terms of both predictive accuracy and computational time\u2014than\nthe competitor methods.\n\nFigure 3: Numerical results for various simulation scenarios. Top plots depict the relative mean-\nsquared error of MSB (our approach), versus CART (red), Lasso (black), and PC regression (green)\nfor as a function of ambient dimension of x. Bottom plots depict the ratio of CPU time as a function\nof sample size. The three simulation scenarios are: swissroll (left), linear subspaces (middle), union\nof linear subspaces (right). MSB outperforms both CART, Lasso, and PC regression in all three\nscenarios regardless of ambient dimension (rA\nmse < 1 for all p). MSB compute time is relatively\nconstant as n or p increase, whereas Lasso\u2019s compute time increases, thus, as n or p increase, MSB\nCPU time becomes less than Lasso\u2019s. MSB was always signi\ufb01cantly faster than CART and PC\nregression, regardless of n or p. For all panels, n = 100 when p varies, and p = 300k when n\nvaries, where k indicates 1000, e.g., 300k= 3 \u00d7 105.\n\n7\n\n\u22124\u2212202400.30.60.9n=100f(y|\u03b7=\u22120.9)\u22124\u2212202400.30.6yf(y|\u03b7=0.5)\u22124\u2212202400.30.60.9n=150\u22124\u2212202400.30.6\u22124\u2212202400.30.60.9n=200\u22124\u2212202400.30.650k100k01pMSE Ratio(2) Swissroll10020030001sample sizeTime Ratio50k100k01p(3) Linear Subspace100k200k300k024p50k100k01p(4) Union of Linear Subspaces10020030001sample size\fTable 1: Neuroscience application quantitative performance comparisons. Squared error predictive\naccuracy per subject (using leave-one-out) was computed. We report the mean and standard devi-\nation (s.d.) across subjects of squared error, and CPU time (in seconds). We compare multiscale\nstick-breaking (MSB), CART, Lasso, random forest (RF), and PC regression. MSB outperforms all\nthe competitors in terms of predictive accuracy and scalability. Only MSB and Lasso even ran for\nthe \u2248 106 dimensional application. Bold indicates best MSE, \u2217 indicates best CPU time.\n\nDATA\nCREATIVITY\n\nn\n108\n\np\n\n2,415\n\nMOVEMENT\n\n56\n\n\u2248 106\n\nMODEL\nMSB\nCART\nLASSO\u2217\n\nRF\nMSB\u2217\nLASSO\n\nPC REGRESSION\n\nMSE (S.D.)\n0.56 (0.85)\n1.10 (1.00)\n\u2217\n0.63 (0.95)\n0.57(0.90)\n0.65 (0.88)\n\u2217\n0.76 (0.90)\n1.02 (0.98)\n\nTIME (S.D.)\n1.1 (0.02)\n0.9 (0.01)\n0.40 (0.10)\u2217\n78.2 (0.59)\n0.46 (0.37)\n20.98 (2.31)\n96.18 (9.66)\n\n\u2217\n\n5.3 Quantitative Comparisons for Neuroscience Applications\n\nTable 1 shows the mean and standard deviation of point-estimate predictions per subject (using\nleave-one-out) for the two neuroscience applications that we investigated: (i) predicting creativity\nfrom diffusion MRI (creativity) and, (ii) predicting head motion based on functional MRI (move-\nment). For the creativity application, p was relatively small, \u201cmerely\u201d 2, 415, so we could run Lasso,\nCART, and random forests (RF) [30]. For the movement application, p was nearly one million.\nFor both applications, MSB yielded improved predictive accuracy over all competitors. Although\nCART and Lasso were faster than MSB on the relatively low-dimensional predictor example (cre-\nativity), their computational scaling was poor, such that CART yielded a memory fault on the higher-\ndimensional case, and Lasso required substantially more time than MSB.\n\n6 Discussion\n\nIn this work we have introduced a general formalism to estimate conditional distributions via mul-\ntiscale dictionary learning. An important property of any such strategy is the ability to scale up to\nultrahigh-dimensional predictors. We considered simulations and real-data examples where the di-\nmensionality of the predictor space approached one million. To our knowledge, no other approach\nto learn conditional distributions can run at this scale. Our approach explicitly assumes that the pos-\nterior fY |X can be well approximated by projecting x onto a lower-dimensional space, fY |X \u2248 fY |\u03b7,\nwhere \u03b7 \u2208 M \u2282 Rd, and x \u2208 Rd. Note that this assumption is much less restrictive than assuming\nthat x is close to a low-dimensional space; rather, we only assume that the part of fX that \u201cmat-\nters\u201d to predict y lives near a low-dimensional subspace. Because a fully Bayesian strategy remains\ncomputationally intractable at this scale, we developed an empirical Bayes approach, estimating the\npartition tree based on the data, but integrating over scales and posteriors.\nWe demonstrate that even though we obtain posteriors over the conditional distribution fY |X, our\napproach, dubbed multiscale stick-breaking (MSB), outperforms several standard machine learning\nalgorithms in terms of both predictive accuracy and computational time, as the sample size (n) and\nambient dimension (p) increase. This improvement was demonstrated when the M was a swissroll,\na latent subspace, a union of latent subspaces, and real data (for which the latent space may not even\nexist).\nIn future work, we will extend these numerical results to obtain theory on posterior convergence.\nIndeed, while multiscale methods bene\ufb01t from a rich theoretical foundation [2], the relative advan-\ntages and disadvantages of a fully Bayesian approach, in which one can estimate posteriors over all\nfunctionals of fY |X at all scales, remains relatively unexplored.\n\nReferences\n\n[1] I. U. Rahman, I. Drori, V. C. Stodden, and D. L. Donoho. Multiscale representations for manifold- valued\n\ndata. SIAM J. Multiscale Model, 4:1201\u20131232, 2005.\n\n8\n\n\f[2] W.K. Allard, G. Chen, and M. Maggioni. Multiscale geometric methods for data sets II: geometric\n\nwavelets. Applied and Computational Harmonic Analysis, 32:435\u2013462, 2012.\n\n[3] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixture of local experts. Neural\n\nComputation, 3:79\u201387, 1991.\n\n[4] W. X. Jiang and M. A. Tanner. Hierarchical mixtures-of-experts for exponential family regression models:\n\napproximation and maximum likelihood estimation. Annals of Statistics, 27:987\u20131011, 1999.\n\n[5] J. Q. Fan, Q. W. Yao, and H. Tong. Estimation of conditional densities and sensitivity measures in\n\nnonlinear dynamical systems. Biometrika, 83:189\u2013206, 1996.\n\n[6] M. P. Holmes, G. A. Gray, and C. L. Isbell. Fast kernel conditional density estimation: a dual-tree Monte\n\nCarlo approach. Computational statistics & data analysis, 54:1707\u20131718, 2010.\n\n[7] G. Fu, F. Y. Shih, and H. Wang. A kernel-based parametric method for conditional density estimation.\n\nPattern recognition, 44:284\u2013294, 2011.\n\n[8] D. J. Nott, S. L. Tan, M. Villani, and R. Kohn. Regression density estimation with variational methods\n\nand stochastic approximation. Journal of Computational and Graphical Statistics, 21:797\u2013820, 2012.\n\n[9] M. N. Tran, D. J. Nott, and R. Kohn. Simultaneous variable selection and component selection for\nregression density estimation with mixtures of heteroscedastic experts. Electronic Journal of Statistics,\n6:1170\u20131199, 2012.\n\n[10] A. Norets and J. Pelenis. Bayesian modeling of joint and conditional distributions. Journal of Economet-\n\nrics, 168:332\u2013346, 2012.\n\n[11] J. E. Grif\ufb01n and M. F. J. Steel. Order-based dependent Dirichlet processes. Journal of the American\n\nStatistical Association, 101:179\u2013194, 2006.\n\n[12] D. B. Dunson, N. Pillai, and J. H. Park. Bayesian density regression. Journal of the Royal Statistical\n\nSociety Series B-Statistical Methodology, 69:163\u2013183, 2007.\n\n[13] Y. Chung and D. B. Dunson. Nonparametric Bayes conditional distribution modeling with variable selec-\n\ntion. Journal of the American Statistical Association, 104:1646\u20131660, 2009.\n\n[14] S. T. Tokdar, Y. M. Zhu, and J. K. Ghosh. Bayesian density regression with logistic Gaussian process and\n\nsubspace projection. Bayesian Analysis, 5:319\u2013344, 2010.\n\n[15] I. Mossavat and O. Amft. Sparse bayesian hierarchical mixture of experts. IEEE Statistical Signal Pro-\n\ncessing Workshop (SSP), 2011.\n\n[16] Isabelle Guyon and Andr\u00b4e Elisseeff. An introduction to variable and feature selection. The Journal of\n\nMachine Learning Research, 3:1157\u20131182, 2003.\n\n[17] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs.\n\nSIAM Journal on Scienti\ufb01c Computing 20, 1:359392, 1999.\n\n[18] G. Chen, M. Iwen, S. Chin, and M. Maggioni. A fast multiscale framework for data in high-dimensions:\n\nMeasure estimation, anomaly detection, and compressive measurements. In VCIP, 2012 IEEE, 2012.\n\n[19] Ingrid Daubechies. Ten Lectures on Wavelets (CBMS-NSF Regional Conference Series in Applied Math-\n\nematics). SIAM: Society for Industrial and Applied Mathematics, 1992.\n\n[20] J. Sethuraman. A constructive denition of Dirichlet priors. Statistica Sinica, 4:639\u2013650, 1994.\n[21] Didier Chauveau and Jean Diebolt. An automated stopping rule for mcmc convergence assessment. Com-\n\nputational Statistics, 14:419\u2013442, 1998.\n\n[22] R. Arden, R. S. Chavez, R. Grazioplene, and R. E. Jung. Neuroimaging creativity: a psychometric view.\n\nBehavioural brain research, 214:143\u2013156, 2010.\n\n[23] R. \u02dcE. Jung, R. Grazioplene, A. Caprihan, R.\u02dcS. Chavez, and R.\u02dcJ.\n\nHaier. White matter integrity, creativity, and psychopathology: Disentangling constructs with diffusion\ntensor imaging. PloS one, 5(3):e9818, 2010.\n\n[24] W. \u02dcR. Gray, J. \u02dcA. Bogovic, J. \u02dcT. Vogelstein, B. \u02dcA. Landman, J\u02d9 L. Prince, and R.\u02dcJ. Vogelstein. Magnetic\n\nresonance connectome automated pipeline: an overview. IEEE pulse, 3(2):42\u20138, March 2010.\n\n[25] Susumu Mori and Jiangyang Zhang. Principles of diffusion tensor imaging and its applications to basic\n\nneuroscience research. Neuron, 51(5):527\u201339, September 2006.\n\n[26] ABIDE. http://fcon 1000.projects.nitrc.org/indi/abide/.\n[27] S. Sikka, J. \u02dcT. Vogelstein, and M.\u02dcP. Milham. Towards Automated Analysis of Connectomes: The Con\ufb01g-\n\nurable Pipeline for the Analysis of Connectomes (C-PAC). Neuroinformatics, 2012.\n\n[28] Q-H. Zou, C-Z. Zhu, Y. Yang, X-N. Zuo, X-Y. Long, Q-J. Cao, Y-F \u02d9Wang, and Y-F. Zang. An improved\napproach to detection of amplitude of low-frequency \ufb02uctuation (ALFF) for resting-state fMRI: fractional\nALFF. Journal of neuroscience methods, 172(1):137\u2013141, July 2008.\n\n[29] J. D. Power, K. A. Barnes, C. J. Stone, and R. A. Olshen. Spurious but systematic correlations in functional\n\nconnectivity MRI networks arise from subject motion. Neuroimage, 59:2142\u20132154, 2012.\n\n[30] Leo Breiman. Statistical Modeling : The Two Cultures. Statistical Science, 16(3):199\u2013231, 2001.\n\n9\n\n\f", "award": [], "sourceid": 913, "authors": [{"given_name": "Francesca", "family_name": "Petralia", "institution": "Mt Sinai School of Medicine"}, {"given_name": "Joshua", "family_name": "Vogelstein", "institution": "Duke University"}, {"given_name": "David", "family_name": "Dunson", "institution": "Duke University"}]}