{"title": "DECOrrelated feature space partitioning for distributed sparse regression", "book": "Advances in Neural Information Processing Systems", "page_first": 802, "page_last": 810, "abstract": "Fitting statistical models is computationally challenging when the sample size or the dimension of the dataset is huge. An attractive approach for down-scaling the problem size is to first partition the dataset into subsets and then fit using distributed algorithms. The dataset can be partitioned either horizontally (in the sample space) or vertically (in the feature space). While the majority of the literature focuses on sample space partitioning, feature space partitioning is more effective when p >> n. Existing methods for partitioning features, however, are either vulnerable to high correlations or inefficient in reducing the model dimension. In this paper, we solve these problems through a new embarrassingly parallel framework named DECO for distributed variable selection and parameter estimation. In DECO, variables are first partitioned and allocated to m distributed workers. The decorrelated subset data within each worker are then fitted via any algorithm designed for high-dimensional problems. We show that by incorporating the decorrelation step, DECO can achieve consistent variable selection and parameter estimation on each subset with (almost) no assumptions. In addition, the convergence rate is nearly minimax optimal for both sparse and weakly sparse models and does NOT depend on the partition number m. Extensive numerical experiments are provided to illustrate the performance of the new framework.", "full_text": "DECOrrelated feature space partitioning for\n\ndistributed sparse regression\n\nDept. of Statistical Science\n\nDept. of Statistical Science\n\nXiangyu Wang\n\nDuke University\n\nwwrechard@gmail.com\n\ndunson@stat.duke.edu\n\nC.Leng@warwick.ac.uk\n\nDavid Dunson\n\nDuke University\n\nChenlei Leng\n\nDept. of Statistics\n\nUniversity of Warwick\n\nAbstract\n\nFitting statistical models is computationally challenging when the sample size or\nthe dimension of the dataset is huge. An attractive approach for down-scaling the\nproblem size is to \ufb01rst partition the dataset into subsets and then \ufb01t using distributed\nalgorithms. The dataset can be partitioned either horizontally (in the sample space)\nor vertically (in the feature space). While the majority of the literature focuses on\nsample space partitioning, feature space partitioning is more effective when p (cid:29) n.\nExisting methods for partitioning features, however, are either vulnerable to high\ncorrelations or inef\ufb01cient in reducing the model dimension. In this paper, we solve\nthese problems through a new embarrassingly parallel framework named DECO\nfor distributed variable selection and parameter estimation. In DECO, variables\nare \ufb01rst partitioned and allocated to m distributed workers. The decorrelated\nsubset data within each worker are then \ufb01tted via any algorithm designed for\nhigh-dimensional problems. We show that by incorporating the decorrelation step,\nDECO can achieve consistent variable selection and parameter estimation on each\nsubset with (almost) no assumptions. In addition, the convergence rate is nearly\nminimax optimal for both sparse and weakly sparse models and does NOT depend\non the partition number m. Extensive numerical experiments are provided to\nillustrate the performance of the new framework.\n\n1\n\nIntroduction\n\nIn modern science and technology applications, it has become routine to collect complex datasets\nwith a huge number p of variables and/or enormous sample size n. Most of the emphasis in the\nliterature has been on addressing large n problems, with a common strategy relying on partitioning\ndata samples into subsets and \ufb01tting a model containing all the variables to each subset [1, 2, 3, 4, 5, 6].\nIn scienti\ufb01c applications, it is much more common to have huge p small n data sets. In such cases,\na sensible strategy is to break the features into groups, \ufb01t a model separately to each group, and\ncombine the results. We refer to this strategy as feature space partitioning, and to the large n strategy\nas sample space partitioning.\nThere are several recent attempts on parallel variable selection by partitioning the feature space. [7]\nproposed a Bayesian split-and-merge (SAM) approach in which variables are \ufb01rst partitioned into\nsubsets and then screened over each subset. A variable selection procedure is then performed on the\nvariables that survive for selecting the \ufb01nal model. One caveat for this approach is that the algorithm\ncannot guarantee the ef\ufb01ciency of screening, i.e., the screening step taken on each subset might select\na large number of unimportant but correlated variables [7], so SAM could be ineffective in reducing\nthe model dimension. Inspired by a group test, [8] proposed a parallel feature selection algorithm by\nrepeatedly \ufb01tting partial models on a set of re-sampled features, and then aggregating the residuals to\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fform scores for each feature. This approach is generic and ef\ufb01cient, but the performance relies on a\nstrong condition that is almost equivalent to an independence assumption on the design.\nIntuitively, feature space partitioning is much more challenging than sample space partitioning,\nmainly because of the correlations between features. A partition of the feature space would succeed\nonly when the features across the partitioned subsets were mutually independent. Otherwise, it is\nhighly likely that any model posed on the subsets is mis-speci\ufb01ed and the results are biased regardless\nof the sample size. In reality, however, mutually independent groups of features may not exist; Even\nif they do, \ufb01nding these groups is likely more challenging than \ufb01tting a high-dimensional model.\nTherefore, although conceptually attractive, feature space partitioning is extremely challenging.\nOn the other hand, feature space partitioning is straightforward if the features are independent.\nMotivated by this key fact, we propose a novel embarrassingly-parallel framework named DECO by\ndecorrelating the features before partitioning. With the aid of decorrelation, each subset of data after\nfeature partitioning can now produce consistent estimates even though the model on each subset is\nintrinsically mis-speci\ufb01ed due to missing features. To the best of our knowledge, DECO is the \ufb01rst\nembarrassingly parallel framework accommodating arbitrary correlation structures in the features. We\nshow, quite surprisingly, that the DECO estimate, by leveraging the estimates from subsets, achieves\nthe same convergence rate in (cid:96)2 norm and (cid:96)\u221e norm as the estimate obtained by using the full dataset,\nand that the rate does not depend on the number of partitions. In view of the huge computational gain\nand the easy implementation, DECO is extremely attractive for \ufb01tting large-p data.\nThe most related work to DECO is [9], where a similar procedure was introduced to improve\nlasso. Our work differs substantially in various aspects. First, our motivation is to develop a parallel\ncomputing framework for \ufb01tting large-p data by splitting features, which can potentially accommodate\nany penalized regression methods, while [9] aim solely at complying with the irrepresentable condition\nfor lasso. Second, the conditions posed on the feature matrix are more \ufb02exible in DECO, and our\ntheory, applicable for not only sparse signals but also those in lr balls, can be readily applied to the\npreconditioned lasso in [9].\nThe rest of the paper is organized as follows. In Section 2, we detail the proposed framework. Section\n3 provides the theory of DECO. In particular, we show that DECO is consistent for both sparse and\nweakly sparse models. Section 4 presents extensive simulation studies to illustrate the performance of\nour framework. In Section 5, we outline future challenges and future work. All the technical details\nare relegated to the Appendix.\n\n2 Motivation and the DECO framework\n\nConsider the linear regression model\n(1)\nwhere X is an n \u00d7 p feature (design) matrix, \u03b5 consists of n i.i.d random errors and Y is the response\nvector. A large class of approaches estimate \u03b2 by solving the following optimization problem\n\nY = X\u03b2 + \u03b5,\n\n\u02c6\u03b2 = arg min\n\u03b2\n\n1\nn\n\n(cid:107)Y \u2212 X\u03b2(cid:107)2\n\n2 + 2\u03bbn\u03c1(\u03b2),\n\nto the (cid:96)1 penalty where \u03c1(\u03b2) =(cid:80)p\n\nwhere (cid:107) \u00b7 (cid:107)2 is the (cid:96)2 norm and \u03c1(\u03b2) is a penalty function. In this paper, we specialize our discussion\n\nj=1 |\u03b2j| [10] to highlight the main message of the paper.\n\nAs discussed in the introduction, a naive partition of the feature space will usually give unsatisfactory\nresults under a parallel computing framework. That is why a decorrelation step is introduced.\nFor data with p \u2264 n, the most intuitive way is to orthogonalize features via the singular value\ndecomposition (SVD) of the design matrix as X = U DV T , where U is an n \u00d7 p matrix, D is an\np \u00d7 p diagonal matrix and V an p \u00d7 p orthogonal matrix. If we pre-multiply both sides of (1) by\n\u221a\npU D\u22121U T = (XX T /p)\n\n2 , where A+ denotes the Moore-Penrose pseudo-inverse, we get\n\n+\n\n(XX T /p)\n\n+\n2 Y\n\n=\n\n\u03b2 + (XX T /p)\n\n+\n2 \u03b5\n\n.\n\n(2)\n\nIt is obvious that the new features (the columns of\npU V T ) are mutually orthogonal. De\ufb01ne the\nnew data as ( \u02dcY , \u02dcX). The mutually orthogonal property allows us to decompose \u02dcX column-wisely to\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u02dcY\n\n(cid:125)\n\n\u221a\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\npU V T\n\n\u02dcX\n\n\u221a\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u02dc\u03b5\n\n(cid:125)\n\n2\n\n\fm subsets \u02dcX (i), i = 1, 2,\u00b7\u00b7\u00b7 , m, and still retain consistency if one \ufb01ts a linear regression on each\nsubset. To see this, notice that each sub-model now takes a form of \u02dcY = \u02dcX (i)\u03b2(i) + \u02dcW (i) where\n\u02dcW (i) = \u02dcX (\u2212i)\u03b2(\u2212i) + \u02dc\u03b5 and X (\u2212i) stands for variables not included in the ith subset. If, for example,\nwe would like to compute the ordinary least squares estimates, it follows\n\u02c6\u03b2(i) = ( \u02dcX (i)T \u02dcX (i))\u22121 \u02dcX (i) \u02dcY = \u03b2(i) + ( \u02dcX (i)T \u02dcX (i))\u22121 \u02dcX (i) \u02dcW (i) = \u03b2(i) + ( \u02dcX (i)T \u02dcX (i))\u22121 \u02dcX (i) \u02dc\u03b5,\n\nwhere \u02c6\u03b2(i) converges at the same rate as if the full dataset were used.\nWhen p is larger than n, the new features are no longer exactly orthogonal to each other due to the high\ndimension. Nevertheless, as proved later in the article, the correlations between different columns\northogonal when log(p) (cid:28) n. This allows us to follow the same strategy of partitioning the feature\nspace as in the low dimensional case. It is worth noting that when p > n, the SVD decomposition on\nX induces a different form on the three matrices, i.e., U is now an n \u00d7 n orthogonal matrix, D is an\n\nare roughly of the order(cid:112)log p/n for random designs, making the new features approximately\nn \u00d7 n diagonal matrix, V is an n \u00d7 p matrix, and(cid:0) XX T\n\n(cid:1) +\n2 becomes(cid:0) XX T\n\n(cid:1)\u2212 1\n\np\n\n2 .\n\np\n\npD\u22121U T by computing XX T in a distributed way as XX T =(cid:80)m\n\nIn this paper, we primarily focus on datasets where p is so large that a single computer is only able to\nstore and perform operations on an n \u00d7 q matrix (n < q < p) but not on an n \u00d7 p matrix. Because\nthe two decorrelation matrices yield almost the same properties, we will only present the algorithm\nand the theoretical analysis for (XX T /p)\u22121/2.\nThe concrete DECO framework consists of two main steps. Assume X has been partitioned column-\nwisely into m subsets X (i), i = 1, 2,\u00b7\u00b7\u00b7 , m (each with a maximum of q columns) and distributed\nonto m machines with Y . In the \ufb01rst stage, we obtain the decorrelation matrix (XX T /p)\u22121/2 or\n\u221a\ni=1 X (i)X (i)T and perform the\nSVD decomposition on XX T on a central machine. In the second stage, each worker receives the\ndecorrelation matrix, multiplies it to the local data (Y, X (i)) to obtain ( \u02dcY , \u02dcX (i)), and \ufb01ts a penalized\nregression. When the model is assumed to be exactly sparse, we can potentially apply a re\ufb01nement\nstep by re-estimating coef\ufb01cients on all the selected variables simultaneously on the master machine\nvia ridge regression. The details are provided in Algorithm 1.\nThe entire Algorithm 1 contains only two map-reduce passes and is thus communication-ef\ufb01cient.\nLines 14 - 18 in Algorithm 1 are added only for the data analysis in Section 5.3, in which p is massive\ncompared to n in that log(p) is comparable to n, and the algorithm may not scale down the size of p\nsuf\ufb01ciently for even obtaining a ridge regression estimator afterwards. Thus, a further sparsi\ufb01cation\nstep is recommended. The condition in Line 16 is only triggered in our last experiment, but is crucial\nfor improving the performance for extreme cases. In Line 5, the algorithm inverts XX T + r1I instead\nof XX T for robustness, because the rank of XX T after standardization will be n \u2212 1. Using ridge\nre\ufb01nement instead of ordinary least squares is also for robustness. The precise choice of r1 and r2\nwill be discussed in the numerical section.\nPenalized regression \ufb01tted using regularization path usually involves a computational complexity of\nO(knp + kd2), where k is the number of path segmentations and d is the number of features selected.\nAlthough the segmentation number k could be as bad as (3p + 1)/2 in the worst case [11], real data\nexperience suggests that k is on average O(n) [12], thus the complexity for DECO takes a form of\n\nm + m(cid:1) in contrast to the full lasso which takes a form of O(n2p).\n\nO(cid:0)n3 + n2 p\n\n3 Theory\n\nIn this section, we provide theoretical justi\ufb01cation for DECO on random feature matrices. We\nspecialize our attention to lasso due to page limits and will provide the theory on general penalties in\nthe long version. We prove the consistency results for the estimator obtained after Stage 2 of DECO,\nwhile the consistency of Stage 3 will then follow immediately. For simplicity, we assume that \u03b5\nfollows a sub-Gaussian distribution and X \u223c N (0, \u03a3) throughout this section, although the theory\ncan be easily extended to the situation where X follows an elliptical distribution and \u03b5 is heavy-tailed.\nRecall that DECO \ufb01ts the following linear regression on each worker\n\n\u02dcY = \u02dcX (i)\u03b2(i) + \u02dcW (i),\n\nand\n\n\u02dcW (i) = \u02dcX (\u2212i)\u03b2(\u2212i) + \u02dc\u03b5,\n\n3\n\n\fAlgorithm 1 The DECO framework\nInitialization:\n1: Input (Y, X), p, n, m, \u03bbn. Standardize X and Y to x and y with mean zero;\n2: Partition (arbitrarily) (y, x) into m disjoint subsets (y, x(i)) and distribute to m machines;\nStage 1 : Decorrelation\n3: On each worker, compute x(i)x(i)T and push to the center machine;\n\n4: On the center machine, compute F =(cid:80)m\n\ni=1 x(i)x(i)T ;\n\np(cid:0)F + r1Ip\n\n(cid:1)\u22121/2;\n\n\u221a\n\n1\n\nn(cid:107)\u02dcy \u2212 \u02dcx(i)\u03b2(cid:107)2\n\n\u02dcy = \u00afF y and \u02dcx(i) = \u00afF x(i); # obtain decorrelated data\n\n5: \u00afF =\n6: Push \u00afF to each worker.\n7: for i = 1 to m do\n8:\n9: end for\nStage 2 : Estimation\n10: On each worker we estimate \u02c6\u03b2(i) = arg min\u03b2\n2 + 2\u03bbn\u03c1(\u03b2);\n11: Push \u02c6\u03b2(i) to the center machine and combine \u02c6\u03b2 = ( \u02c6\u03b2(1), \u02c6\u03b2(2),\u00b7\u00b7\u00b7 , \u02c6\u03b2(m));\n12: \u02c6\u03b20 = mean(Y ) \u2212 mean(X)T \u02c6\u03b2 for intercept.\nStage 3 : Re\ufb01nement (optional)\n13: if #{ \u02c6\u03b2 (cid:54)= 0} \u2265 n then\n14:\n15: M = {k : | \u02c6\u03b2k| (cid:54)= 0};\n16:\n17: end if\n18: M = {k : | \u02c6\u03b2k| (cid:54)= 0};\n19: \u02c6\u03b2M = (X TMXM + r2I|M|)\u22121X TMY ;\n20: Return \u02c6\u03b2;\n\n# Sparsi\ufb01cation is needed before ridge regression.\n\nn(cid:107)\u02dcy \u2212 \u02dcxM\u03b2(cid:107)2\n\n\u02c6\u03b2M = arg min\u03b2\n\n2 + 2\u03bbn\u03c1(\u03b2);\n\n1\n\nwhere X (\u2212i) stands for variables not included in the ith subset. Our proof relies on verifying each\npair of ( \u02dcY , \u02dcX (i)), i = 1, 2,\u00b7\u00b7\u00b7 , m satis\ufb01es the consistency condition of lasso for the random features.\nDue to the page limit, we only state the main theorem in the article and defer all the proofs to the\nsupplementary materials.\nTheorem 1 (s-sparse). Assume that \u03b2\u2217 is an s-sparse vector. De\ufb01ne \u03c32\nwe choose \u03bbn = A\u03c30\nprobability at least 1 \u2212 8p1\u2212C1A2 \u2212 18pe\u2212Cn we have\n\n0 = var(Y ). For any A > 0\nn \u2264 1, then with\n\nn . Now if p > c0n for some c0 > 1 and 64C 2\n\n(cid:113) log p\n\n0 A2s2 log p\n\n(cid:114)\n\n(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)\u221e \u2264 5C0A\u03c30\nc1c\u2217 and C1 = min{\n\n8\n\nlog p\n\nn\nc\u2217c2\n0\n\nand (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)2\n\n2 \u2264 9C0A2\u03c32\n\n0\n\n8\n\ns log p\n\nn\n\n,\n\nwhere C0 = 8c\u2217\nare de\ufb01ned in Lemma 6 in the supplementary materials. Furthermore, if we have\n\n8c\u2217c2(1\u2212c0)2 ,\n\n4c\u22172} are two constants and c1, c2, c4, c\u2217, c\u2217, C\n(cid:114)\n\nc3\u2217\n8c2\n\n|\u03b2k| \u2265 C0A\u03c30\n\n4\n\nmin\n\u03b2k(cid:54)=0\n\nlog p\n\n,\n\nn\n\nthen \u02c6\u03b2 is sign consistent, i.e., sign( \u02c6\u03b2k) = sign(\u03b2k),\u2200\u03b2k (cid:54)= 0 and \u02c6\u03b2k = 0,\u2200\u03b2k = 0.\nTheorem 1 looks a bit surprising since the convergence rate does not depend on m. This is mainly\nbecause the bounds used to verify the consistency conditions for lasso hold uniformly on all subsets\nof variables. For subsets where no true signals are allocated, lasso will estimate all coef\ufb01cients to be\nzero, so that the loss on these subsets will be exactly zero. Thus, when summing over all subsets, we\nretrieve the s log p\nrate. In addition, it is worth noting that Theorem 1 guarantees the (cid:96)\u221e convergence\nand sign consistency for lasso without assuming the irrepresentable condition [13]. A similar but\nweaker result was obtained in [9].\nTheorem 2 (lr-ball). Assume that \u03b2\u2217 \u2208 B(r, R) and all conditions in Theorem 1 except that\n64C 2\n\n(cid:1)1\u2212r \u2264 1. Then with probability at least\n\nn \u2264 1 are now replaced by 64C 2\n\n0 A2R2(cid:0) log p\n\n0 A2s2 log p\n\nn\n\nn\n\n4\n\n\f1 \u2212 8p1\u2212C1A2 \u2212 18pe\u2212Cn, we have\n\n(cid:114)\n\n(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)\u221e \u2264 3C0A\u03c30\n\n2\n\nlog p\n\nn\n\nand\n\n(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)2\n\n2 \u2264\n\n(cid:18) 9C0\n\n8\n\n(cid:19)\n\n+ 38\n\n(A\u03c30)2\u2212rR\n\n(cid:18) log p\n\n(cid:19)1\u2212 r\n\n2\n\nn\n\n.\n\n0 = var(Y ) instead of \u03c3 appears in the convergence rate in both Theorem 1 and 2, which\nNote that \u03c32\nis inevitable due to the nonzero signals contained in \u02dcW . Compared to the estimation risk using full\n0 = 1 \u2212 \u02c6R2, where \u02c6R2 is\ndata, the results in Theorem 1 and 2 are similar up to a factor of \u03c32/\u03c32\nthe coef\ufb01cient of determination. Thus, for a model with an \u02c6R2 = 0.8, the risk of DECO is upper\nbounded by \ufb01ve times the risk of the full data inference. The rates in Theorem 1 and 2 are nearly\nminimax-optimal [14, 15], but the sample requirement n (cid:16) s2 is slightly off the optimal. This\nrequirement is rooted in the (cid:96)\u221e-convergence and sign consistency and is almost unimprovable for\nrandom designs. We will detail this argument in the long version of the paper.\n\n4 Experiments\n\nIn this section, we present the empirical performance of DECO via extensive numerical experiments.\nIn particular, we compare DECO after 2 stage \ufb01tting (DECO-2) and DECO after 3 stage \ufb01tting\n(DECO-3) with the full data lasso (lasso-full), the full data lasso with ridge re\ufb01nement (lasso-re\ufb01ne)\nand lasso with a naive feature partition without decorrelation (lasso-naive). This section consists of\nthree parts. In the \ufb01rst part, we run DECO-2 on some simulated data and monitor its performance on\none randomly chosen subset that contains part of the true signals. In the second part, we verify our\nclaim in Theorem 1 and 2 that the accuracy of DECO does not depend on the subset number. In the\nlast part, we provide a comprehensive evaluation of DECO\u2019s performance by comparing DECO with\nother methods under various correlation structures.\nThe synthetic datasets are from model (1) with X \u223c N (0, \u03a3) and \u03b5 \u223c N (0, \u03c32). The variance \u03c32 is\nchosen such that \u02c6R2 = var(X\u03b2)/var(Y ) = 0.9. We consider \ufb01ve different structures of \u03a3.\nModel (i) Independent predictors. The support of \u03b2 is S = {1, 2, 3, 4, 5}. We generate Xi from a\nstandard multivariate normal distribution with independent components. The coef\ufb01cients are speci\ufb01ed\nas\n\n(cid:18)\n\n\uf8f1\uf8f2\uf8f3 (\u22121)Ber(0.5)\n\n0\n\n\u03b2i =\n\n(cid:113) log p\n\n(cid:19)\n\nn\n\n|N (0, 1)| + 5\n\ni \u2208 S\ni (cid:54)\u2208 S.\n\nstandard normal variables. We set predictors as xi = (cid:80)k\ndrawn from Dirichlet distribution \u03b2 \u223c Dir(cid:0) 1\n\nModel (ii) Compound symmetry. All predictors are equally correlated with correlation \u03c1 = 0.6. The\ncoef\ufb01cients are the same as those in Model (i).\nModel (iii) Group structure. This example is Example 4 in [16], for which we allocate the 15 true\nvariables into three groups. Speci\ufb01cally, the predictors are generated as x1+3m = z1 + N (0, 0.01),\nx2+3m = z2 + N (0, 0.01) and x3+3m = z3 + N (0, 0.01), where m = 0, 1, 2, 3, 4 and zi \u223c N (0, 1)\nare independent. The coef\ufb01cients are set as \u03b2i = 3, i = 1, 2,\u00b7\u00b7\u00b7 , 15; \u03b2i = 0, i = 16,\u00b7\u00b7\u00b7 , p.\nModel (iv) Factor models. This model is considered in [17]. Let \u03c6j, j = 1, 2,\u00b7\u00b7\u00b7 , k be independent\nj=1 \u03c6jfij + \u03b7i, where fij and \u03b7i are\nindependent standard normal random variables. The number of factors is chosen as k = 5 in the\nsimulation while the coef\ufb01cients are speci\ufb01ed the same as in Model (i).\nModel (v) (cid:96)1-ball. This model takes the same correlation structure as Model (ii), with the coef\ufb01cients\nunder a weakly sparse assumption on \u03b2, since \u03b2 is non-sparse satisfying (cid:107)\u03b2(cid:107)1 = 10.\nThroughout this section, the performance of all the methods is evaluated in terms of four metrics:\nthe number of false positives (# FPs), the number of false negatives (# FNs), the mean squared error\n(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)2\n2 (MSE) and the computational time (runtime). We use glmnet [18] to \ufb01t lasso and\nchoose the tuning parameter using the extended BIC criterion [19] with \u03b3 \ufb01xed at 0.5. For DECO, the\nfeatures are partitioned randomly in Stage 1 and the tuning parameter r1 is \ufb01xed at 1 for DECO-3.\nSince DECO-2 does not involve any re\ufb01nement step, we choose r1 to be 10 to aid robustness. The\nridge parameter r2 is chosen by 5-fold cross-validation for both DECO-3 and lasso-re\ufb01ne. All\n\n(cid:1)\u00d7 10. This model is to test the performance\n\np ,\u00b7\u00b7\u00b7 , 1\n\np\n\np , 1\n\n5\n\n\fFigure 1: Performance of DECO on one subset with p = 10, 000 and different n(cid:48)s.\n\nFigure 2: Performance of DECO on one subset with n = 500 and different p(cid:48)s.\n\nthe algorithms are coded and timed in Matlab on computers with Intel i7-3770k cores. For any\nembarrassingly parallel algorithm, we report the preprocessing time plus the longest runtime of a\nsingle machine as its runtime.\n\n4.1 Monitor DECO on one subset\n\nIn this part, using data generated from Model (ii), we illustrate the performance of DECO on one\nrandomly chosen subset after partitioning. The particular subset we examine contains two nonzero\ncoef\ufb01cients \u03b21 and \u03b22 with 98 coef\ufb01cients, randomly chosen, being zero. We either \ufb01x p = 10, 000\nand change n from 100 to 500, or \ufb01x n at 500 and change p from 2, 000 to 10, 000 to simulate\ndatasets. We \ufb01t DECO-2, lasso-full and lasso-naive to 100 simulated datasets, and monitor their\nperformance on that particular subset. The results are shown in Fig 1 and 2.\nIt can be seen that, though the sub-model on each subset is mis-speci\ufb01ed, DECO performs as if the full\ndataset were used as its performance is on par with lasso-full. On the other hand, lasso-naive fails\ncompletely. This result clearly highlights the advantage of decorrelation before feature partitioning.\n\n4.2\n\nImpact of the subset number m\n\nAs shown in Theorem 1 and 2, the performance of DECO does not depend on the number of partitions\nm. We verify this property by using Model (ii) again. This time, we \ufb01x p = 10, 000 and n = 500, and\nvary m from 1 to 200. We compare the performance of DECO-2 and DECO-3 with lasso-full and\nlasso-re\ufb01ne. The averaged results from 100 simulated datasets are plotted in Fig 3. Since p and n are\nboth \ufb01xed, lasso-full and lasso-re\ufb01ne are expected to perform stably over different m(cid:48)s. DECO-2\nand DECO-3 also maintain a stable performance regardless of the value of m. In addition, DECO-3\nachieves a similar performance to and sometimes better accuracy than lasso-re\ufb01ne, possibly because\nthe irrepresentable condition is satis\ufb01ed after decorrelation (See the discussions after Theorem 1).\n\n4.3 Comprehensive comparison\n\nIn this section, we compare all the methods under the \ufb01ve different correlation structures. The model\ndimension and the sample size are \ufb01xed at p = 10, 000 and n = 500 respectively and the number\nof subsets is \ufb01xed as m = 100. For each model, we simulate 100 synthetic datasets and record the\naverage performance in Table 1\n\n6\n\nsample size n100200300400number of False positives01234567FalsepositivesDECO-2lasso-fulllasso-naivesample size n100200300400number of False negatives00.20.40.60.81FalsenegativesDECO-2lasso-fulllasso-naivesample size n100200300400l2 error051015202530EstimationerrorDECO-2lasso-fulllasso-naivesample size n100200300400Runtime (sec)02468RuntimeDECO-2lasso-fulllasso-naivemodel dimension p2000400060008000number of False positives051015FalsepositivesDECO-2lasso-fulllasso-naivemodel dimension p2000400060008000number of False negatives00.10.20.30.4FalsenegativesDECO-2lasso-fulllasso-naivemodel dimension p2000400060008000l2 error12345EstimationerrorDECO-2lasso-fulllasso-naivemodel dimension p2000400060008000Runtime (sec)02468RuntimeDECO-2lasso-fulllasso-naive\fFigure 3: Performance of DECO with different number of subsets.\n\nTable 1: Results for \ufb01ve models with (n, p) = (500, 10000)\n\n(i)\n\n(ii)\n\n(iii)\n\n(iv)\n\nMSE\n# FPs\n# FNs\nTime\nMSE\n# FPs\n# FNs\nTime\nMSE\n# FPs\n# FNs\nTime\nMSE\n# FPs\n# FNs\nTime\n\n(v) MSE \u2014\nTime \u2014\n\nDECO-3\n0.102\n0.470\n0.010\n65.5\n0.241\n0.460\n0.010\n66.9\n6.620\n0.410\n0.130\n65.5\n0.787\n0.460\n0.090\n69.4\n\nDECO-2\n3.502\n0.570\n0.020\n60.3\n4.636\n0.550\n0.030\n61.8\n1220.5\n0.570\n0.120\n60.0\n5.648\n0.410\n0.100\n64.1\n2.341\n57.5\n\nlasso-re\ufb01ne\n0.104\n0.420\n0.000\n804.5\n1.873\n2.39\n0.160\n809.2\n57.74\n0.110\n3.93\n835.3\n11.15\n19.90\n0.530\n875.1\n\u2014\n\u2014\n\nlasso-full\n0.924\n0.420\n0.000\n802.5\n3.808\n2.39\n0.160\n806.3\n105.99\n0.110\n3.93\n839.9\n6.610\n19.90\n0.530\n880.0\n1.661\n829.5\n\nlasso-naive\n3.667\n0.650\n0.010\n9.0\n171.05\n1507.2\n1.290\n13.1\n1235.2\n1.180\n0.110\n9.1\n569.56\n1129.9\n1.040\n14.6\n356.57\n13.3\n\nSeveral conclusions can be drawn from Table 1. First, when all variables are independent as in\nModel (i), lasso-naive performs similarly to DECO-2 because no decorrelation is needed in this\nsimple case. However, lasso-naive fails completely for the other four models when correlations are\npresented. Second, DECO-3 achieves the overall best performance. The better estimation error over\nlasso-re\ufb01ne is due to the better variable selection performance, since the irrepresentable condition\nis not needed for DECO. Finally, DECO-2 performs similarly to lasso-full and the difference is as\nexpected according to the discussions after Theorem 2.\n\n5 Real data\n\nWe illustrate the competitve performance of DECO via three real datasets that cover a range of\nhigh dimensionalities, by comparing DECO-3 to lasso-full, lasso-re\ufb01ne and lasso-naive in terms of\nprediction error and computational time. The algorithms are con\ufb01gured in the same way as in Section\n4. Although DECO allows arbitrary partitioning over the feature space, for simplicity, we con\ufb01ne\nour attention to random partitioning. In addition, we perform DECO-3 multiple times on the same\ndataset to ameliorate the uncertainty due to the randomness in partitioning.\nStudent performance dataset. We look at one of the two datasets used for evaluating student\nachievement in two Portuguese schools [20]. The particular dataset used here provides the students\u2019\nperformance in mathematics. The goal of the research is to predict the \ufb01nal grade (range from 0 to 20).\nThe original data set contains 395 students and 32 raw attributes. The raw attributes are recoded as\n40 attributes and form 767 features after adding interaction terms. To reduce the conditional number\nof the feature matrix, we remove features that are constant, giving 741 features. We standardize\n\n7\n\nsubset number m50100150number of False positives00.511.52FalsepositivesDECO-2lasso-refinelasso-fullDECO-3subset number m50100150number of False negatives00.050.10.150.2FalsenegativesDECO-2lasso-refinelasso-fullDECO-3subset number m50100150l2 error0123456EstimationerrorDECO-2lasso-refinelasso-fullDECO-3subset number m50100150runtime (sec)024681012RuntimeDECO-2lasso-refinelasso-fullDECO-3\fall features and randomly partition them into 5 subsets for DECO. To compare the performance\nof all methods, we use 10-fold cross validation and record the prediction error (mean square error,\nMSE), model size and runtime. The averaged results are summarized in Table 2. We also report the\nperformance of the null model which predicts the \ufb01nal grade on the test set using the mean \ufb01nal grade\nin the training set.\nMammalian eye diseases. This dataset, taken from [21], was collected to study mammalian eye\ndiseases, with gene expression for the eye tissues of 120 twelve-week-old male F2 rats recorded. One\ngene coded as TRIM32 responsible for causing Bardet-Biedl syndrome is the response of interest.\nFollowing the method in [21], 18,976 probes were selected as they exhibited suf\ufb01cient signal for\nreliable analysis and at least 2-fold variation in expressions, and we con\ufb01ne our attention to the top\n5,000 genes with the highest sample variance. The 5,000 genes are standardized and partitioned into\n100 subsets for DECO. The performance is assessed via 10-fold cross validation following the same\napproach in Section 5.1. The results are summarized in Table 2. As a reference, we also report these\nvalues for the null model.\nElectricity load diagram. This dataset [22] consists of electricity load from 2011 - 2014 for 370\nclients. The data are originally recorded in KW for every 15 minutes, resulting in 14,025 attributes.\nOur goal is to predict the most recent electricity load by using all previous data points. The variance\nof the 14,025 features ranges from 0 to 107. To reduce the conditional number of the feature matrix,\nwe remove features whose variances are below the lower 10% quantile (a value of 105) and retain\n126,231 features. We then expand the feature sets by including the interactions between the \ufb01rst\n1,500 attributes that has the largest correlation with the clients\u2019 most recent load. The resulting\n1,251,980 features are then partitioned into 1,000 subsets for DECO. Because cross-validation is\ncomputationally demanding for such a large dataset, we put the \ufb01rst 200 clients in the training set and\nthe remaining 170 clients in the testing set. We also scale the value of electricity load between 0 and\n300, so that patterns are more visible. The results are summarized in Table 2.\n\nTable 2: The results of all methods on the three datasets.\n\nStudent Performance Mammalian eye disease\nMSE size\nruntime\n3.64\n1.5\n3.79\n2.2\n2.2\n3.89\n16.5\n6.4\n20.7 \u2014\n\nsize\nruntime MSE\n4.3\n0.012\n11\n0.012\n0.010\n11\n37.65\n6.8\n0.021 \u2014\n\n9.6\n139.0\n139.7\n7.9\n\u2014\n\n37.0\n60.8\n70.9\n44.6\n\u2014\n\nDECO-3\nlasso-full\nlasso-re\ufb01ne\nlasso-naive\nNull\n\nElectricity load diagram\nMSE\n0.691\n2.205\n1.790\n3.6 \u00d7 108\n520.6\n\nsize\n4\n6\n6\n\n4966\n\u2014\n\n52.9\n\u2014\n\nruntime\n\n67.9\n\n23,515.5\n22,260.9\n\n6 Concluding remarks\n\nIn this paper, we have proposed an embarrassingly parallel framework named DECO for distributed\nestimation. DECO is shown to be theoretically attractive, empirically competitive and is straightfor-\nward to implement. In particular, we have shown that DECO achieves the same minimax convergence\nrate as if the full data were used and the rate does not depend on the number of partitions. We\ndemonstrated the empirical performance of DECO via extensive experiments and compare it to\nvarious approaches for \ufb01tting full data. As illustrated in the experiments, DECO can not only reduce\nthe computational cost substantially, but often outperform the full data approaches in terms of model\nselection and parameter estimation.\nAlthough DECO is designed to solve large-p-small-n problems, it can be extended to deal with\nlarge-p-large-n problems by adding a sample space partitioning step, for example, using the message\napproach [5]. More precisely, we \ufb01rst partition the large-p-large-n dataset in the sample space to\nobtain l row blocks such that each becomes a large-p-small-n dataset. We then partition the feature\nspace of each row block into m subsets. This procedure is equivalent to partitioning the original data\nmatrix X into l \u00d7 m small blocks, each with a feasible size that can be stored and \ufb01tted in a computer.\nWe then apply the DECO framework to the subsets in the same row block using Algorithm 1. The\nlast step is to apply the message method to aggregate the l row block estimators to output the \ufb01nal\nestimate. This extremely scalable approach will be explored in future work.\n\n8\n\n\fReferences\n[1] Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, and Gideon S Mann. Ef\ufb01cient large-\nscale distributed training of conditional maximum entropy models. In Advances in Neural Information\nProcessing Systems, pages 1231\u20131239, 2009.\n\n[2] Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-ef\ufb01cient algorithms for statistical\n\noptimization. In Advances in Neural Information Processing Systems, pages 1502\u20131510, 2012.\n\n[3] Steven L Scott, Alexander W Blocker, Fernando V Bonassi, H Chipman, E George, and R McCulloch.\nBayes and big data: The consensus monte carlo algorithm. In EFaBBayes 250 conference, volume 16,\n2013.\n\n[4] Xiangyu Wang, Fangjian Guo, Katherine A Heller, and David B Dunson. Parallelizing mcmc with random\n\npartition trees. In Advances in Neural Information Processing Systems, pages 451\u2013459, 2015.\n\n[5] Xiangyu Wang, Peichao Peng, and David B Dunson. Median selection subset aggregation for parallel\n\ninference. In Advances in Neural Information Processing Systems, pages 2195\u20132203, 2014.\n\n[6] Stanislav Minsker et al. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4):2308\u2013\n\n2335, 2015.\n\n[7] Qifan Song and Faming Liang. A split-and-merge bayesian variable selection approach for ultrahigh\ndimensional regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2014.\n\n[8] Yingbo Zhou, Utkarsh Porwal, Ce Zhang, Hung Q Ngo, Long Nguyen, Christopher R\u00e9, and Venu\nGovindaraju. Parallel feature selection inspired by group testing. In Advances in Neural Information\nProcessing Systems, pages 3554\u20133562, 2014.\n\n[9] Jinzhu Jia and Karl Rohe. Preconditioning to comply with the irrepresentable condition. arXiv preprint\n\narXiv:1208.5584, 2012.\n\n[10] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 267\u2013288, 1996.\n\n[11] Julien Mairal and Bin Yu. Complexity analysis of the lasso regularization path. In Proceedings of the 29th\n\nInternational Conference on Machine Learning (ICML-12), pages 353\u2013360, 2012.\n\n[12] Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. The Annals of Statistics, pages\n\n1012\u20131030, 2007.\n\n[13] Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of Machine Learning\n\nResearch, 7:2541\u20132563, 2006.\n\n[14] Fei Ye and Cun-Hui Zhang. Rate minimaxity of the lasso and dantzig selector for the lq loss in lr balls.\n\nThe Journal of Machine Learning Research, 11:3519\u20133540, 2010.\n\n[15] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of convergence for high-dimensional\nregression under \u00e2 \u02c7D\u00b8S q-ball sparsity. In Communication, Control, and Computing, 2009. Allerton 2009.\n47th Annual Allerton Conference on, pages 251\u2013257. IEEE, 2009.\n\n[16] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society: Series B (Statistical Methodology), 67(2):301\u2013320, 2005.\n\n[17] Nicolai Meinshausen and Peter B\u00fchlmann. Stability selection. Journal of the Royal Statistical Society:\n\nSeries B (Statistical Methodology), 72(4):417\u2013473, 2010.\n\n[18] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models\n\nvia coordinate descent. Journal of statistical software, 33(1):1, 2010.\n\n[19] Jiahua Chen and Zehua Chen. Extended bayesian information criteria for model selection with large model\n\nspaces. Biometrika, 95(3):759\u2013771, 2008.\n\n[20] Paulo Cortez and Alice Maria Gon\u00e7alves Silva. Using data mining to predict secondary school student\n\nperformance. 2008.\n\n[21] Todd E Scheetz, Kwang-Youn A Kim, Ruth E Swiderski, Alisdair R Philp, Terry A Braun, Kevin L\nKnudtson, Anne M Dorrance, Gerald F DiBona, Jian Huang, Thomas L Casavant, et al. Regulation of gene\nexpression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of\nSciences, 103(39):14429\u201314434, 2006.\n\n[22] Artur Trindade. UCI machine learning repository, 2014.\n\n9\n\n\f", "award": [], "sourceid": 478, "authors": [{"given_name": "Xiangyu", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "David", "family_name": "Dunson", "institution": "Duke University"}, {"given_name": "Chenlei", "family_name": "Leng", "institution": "University of Warwick"}]}