{"title": "Median Selection Subset Aggregation for Parallel Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 2195, "page_last": 2203, "abstract": "For massive data sets, efficient computation commonly relies on distributed algorithms that store and process subsets of the data on different machines, minimizing communication costs. Our focus is on regression and classification problems involving many features. A variety of distributed algorithms have been proposed in this context, but challenges arise in defining an algorithm with low communication, theoretical guarantees and excellent practical performance in general settings. We propose a MEdian Selection Subset AGgregation Estimator (message) algorithm, which attempts to solve these problems. The algorithm applies feature selection in parallel for each subset using Lasso or another method, calculates the `median' feature inclusion index, estimates coefficients for the selected features in parallel for each subset, and then averages these estimates. The algorithm is simple, involves very minimal communication, scales efficiently in both sample and feature size, and has theoretical guarantees. In particular, we show model selection consistency and coefficient estimation efficiency. Extensive experiments show excellent performance in variable selection, estimation, prediction, and computation time relative to usual competitors.", "full_text": "Median Selection Subset Aggregation for Parallel\n\nInference\n\nXiangyu Wang\n\nPeichao Peng\n\nDavid B. Dunson\n\nDept. of Statistical Science\n\nStatistics Department\n\nDept. of Statistical Science\n\nDuke University\n\nxw56@stat.duke.edu\n\nUniversity of Pennsylvania\nppeichao@yahoo.com\n\nDuke University\n\ndunson@stat.duke.edu\n\nAbstract\n\nFor massive data sets, ef\ufb01cient computation commonly relies on distributed algo-\nrithms that store and process subsets of the data on different machines, minimizing\ncommunication costs. Our focus is on regression and classi\ufb01cation problems in-\nvolving many features. A variety of distributed algorithms have been proposed\nin this context, but challenges arise in de\ufb01ning an algorithm with low communi-\ncation, theoretical guarantees and excellent practical performance in general set-\ntings. We propose a MEdian Selection Subset AGgregation Estimator (message)\nalgorithm, which attempts to solve these problems. The algorithm applies feature\nselection in parallel for each subset using Lasso or another method, calculates the\n\u2018median\u2019 feature inclusion index, estimates coef\ufb01cients for the selected features in\nparallel for each subset, and then averages these estimates. The algorithm is sim-\nple, involves very minimal communication, scales ef\ufb01ciently in both sample and\nfeature size, and has theoretical guarantees. In particular, we show model selection\nconsistency and coef\ufb01cient estimation ef\ufb01ciency. Extensive experiments show ex-\ncellent performance in variable selection, estimation, prediction, and computation\ntime relative to usual competitors.\n\n1\n\nIntroduction\n\nThe explosion in both size and velocity of data has brought new challenges to the design of statistical\nalgorithms. Parallel inference is a promising approach for solving large scale problems. The typical\nprocedure for parallelization partitions the full data into multiple subsets, stores subsets on different\nmachines, and then processes subsets simultaneously. Processing on subsets in parallel can lead to\ntwo types of computational gains. The \ufb01rst reduces time for calculations within each iteration of\noptimization or sampling algorithms via faster operations; for example, in conducting linear algebra\ninvolved in calculating likelihoods or gradients [1\u20137]. Although such approaches can lead to sub-\nstantial reductions in computational bottlenecks for big data, the amount of gain is limited by the\nneed to communicate across computers at each iteration. It is well known that communication costs\nare a major factor driving the ef\ufb01ciency of distributed algorithms, so that it is of critical importance\nto limit communication. This motivates the second type of approach, which conducts computations\ncompletely independently on the different subsets, and then combines the results to obtain the \ufb01nal\noutput. This limits communication to the \ufb01nal combining step, and may lead to simpler and much\nfaster algorithms. However, a major issue is how to design algorithms that are close to communi-\ncation free, which can preserve or even improve the statistical accuracy relative to (much slower)\nalgorithms applied to the entire data set simultaneously. We focus on addressing this challenge in\nthis article.\n\nThere is a recent \ufb02urry of research in both Bayesian and frequentist settings focusing on the second\napproach [8\u201314]. Particularly relevant to our approach is the literature on methods for combining\npoint estimators obtained in parallel for different subsets [8, 9, 13]. Mann et al. [9] suggest using\n\n1\n\n\faveraging for combining subset estimators, and Zhang et al. [8] prove that such estimators will\nachieve the same error rate as the ones obtained from the full set if the number of subsets m is well\nchosen. Minsker [13] utilizes the geometric median to combine the estimators, showing robustness\nand sharp concentration inequalities. These methods function well in certain scenarios, but might\nnot be broadly useful. In practice, inference for regression and classi\ufb01cation typically contains two\nimportant components: One is variable or feature selection and the other is parameter estimation.\nCurrent combining methods are not designed to produce good results for both tasks.\n\nTo obtain a simple and computationally ef\ufb01cient parallel algorithm for feature selection and co-\nef\ufb01cient estimation, we propose a new combining method, referred to as message. The detailed\nalgorithm will be fully described in the next section. There are related methods, which were pro-\nposed with the very different goal of combining results from different imputed data sets in missing\ndata contexts [15]. However, these methods are primarily motivated for imputation aggregation, do\nnot improve computational time, and lack theoretical guarantees. Another related approach is the\nbootstrap Lasso (Bolasso) [16], which runs Lasso independently for multiple bootstrap samples, and\nthen intersects the results to obtain the \ufb01nal model. Asymptotic properties are provided under \ufb01xed\nnumber of features (p \ufb01xed) and the computational burden is not improved over applying Lasso to\nthe full data set. Our message algorithm has strong justi\ufb01cation in leading to excellent convergence\nproperties in both feature selection and prediction, while being simple to implement and computa-\ntionally highly ef\ufb01cient.\n\nThe article is organized as follows. In section 2, we describe message in detail. In section 3, we\nprovide theoretical justi\ufb01cations and show that message can produce better results than full data in-\nferences under certain scenarios. Section 4 evaluates the performance of message via extensive nu-\nmerical experiments. Section 5 contains a discussion of possible generalizations of the new method\nto broader families of models and online learning. All proofs are provided in the supplementary\nmaterials.\n\n2 Parallelized framework\n\nConsider the linear model which has n observations and p features,\n\nY = X\u03b2 + \u01eb,\n\nwhere Y is an n \u00d7 1 response vector, X is an n \u00d7 p matrix of features and \u01eb is the observation error,\nwhich is assumed to have mean zero and variance \u03c32. The fundamental idea for communication\nef\ufb01cient parallel inference is to partition the data set into m subsets, each of which contains a small\nportion of the data n/m. Separate analysis on each subset will then be carried out and the result will\nbe aggregated to produce the \ufb01nal output.\n\nAs mentioned in the previous section, regression problems usually consist of two stages: feature\nselection and parameter estimation. For linear models, there is a rich literature on feature selection\nand we only consider two approaches. The risk in\ufb02ation criterion (RIC), or more generally, the gen-\neralized information criterion (GIC) is an l0-based feature selection technique for high dimensional\ndata [17\u201320]. GIC attempts to solve the following optimization problem,\n\n\u02c6M\u03bb = arg\n\nM \u2282{1,2,\u00b7\u00b7\u00b7 ,p}kY \u2212 XM \u03b2Mk2\n\nmin\n\n2 + \u03bb|M|\u03c32\n\n(1)\n\nfor some well chosen \u03bb. For \u03bb = 2(log p + log log p) it corresponds to RIC [18], for \u03bb =\n(2 log p + log n) it corresponds to extended BIC [19] and \u03bb = log n reduces to the usual BIC.\nKonishi and Kitagawa [18] prove the consistency of GIC for high dimensional data under some\nregularity conditions.\n\nLasso [21] is an l1 based feature selection technique, which solves the following problem\n\n\u02c6\u03b2 = arg min\n\n\u03b2 kY \u2212 X\u03b2k2\n\n2 + \u03bbk\u03b2k1\n\n(2)\n\nfor some well chosen \u03bb. Lasso transfers the original NP hard l0-based optimization to a problem\nthat can be solved in polynomial time. Zhao and Yu [22] prove the selection consistency of Lasso\nunder the Irrepresentable condition. Based on the model selected by either GIC or Lasso, we could\nthen apply the ordinary least square (OLS) estimator to \ufb01nd the coef\ufb01cients.\n\n2\n\n\fAs brie\ufb02y discussed in the introduction, averaging and median aggregation approaches possess dif-\nferent advantages but also suffer from certain drawbacks. To carefully adapt these features to regres-\nsion and classi\ufb01cation, we propose the median selection subset aggregation (message) algorithm,\nwhich is motivated as follows.\n\nAveraging of sparse regression models leads to an in\ufb02ated number of features having non-zero co-\nef\ufb01cients, and hence is not appropriate for model aggregation when feature selection is of interest.\nWhen conducting Bayesian variable selection, the median probability model has been recommended\nas selecting the single model that produces the best approximation to model-averaged predictions\nunder some simplifying assumptions [23]. The median probability model includes those features\nhaving inclusion probabilities greater than 1/2. We can apply this notion to subset-based inference\nby including features that are included in a majority of the subset-speci\ufb01c analyses, leading to select-\ning the \u2018median model\u2019. Let \u03b3(i) = (\u03b3(i)\np ) denote a vector of feature inclusion indicators\nfor the ith subset, with \u03b3(i)\nj = 1 if feature j is included so that the coef\ufb01cient \u03b2j on this feature is\nnon-zero, with \u03b3(i)\nj = 0 otherwise. The inclusion indicator vector for the median model M\u03b3 can be\nobtained by\n\n1 ,\u00b7\u00b7\u00b7 , \u03b3(i)\n\n\u03b3 = arg min\n\n\u03b3\u2208{0,1}p\n\nm\n\nXi=1\n\nk\u03b3 \u2212 \u03b3(i)k1,\n\nor equivalently,\n\n\u03b3j = median{\u03b3(i)\n\nj\n\n, i = 1, 2,\u00b7\u00b7\u00b7 , m} for j = 1, 2,\u00b7\u00b7\u00b7 , p.\n\nIf we apply Lasso or GIC to the full data set, in the presence of heavy-tailed observation errors, the\nestimated feature inclusion indicator vector will converge to the true inclusion vector at a polynomial\nrate. It is shown in the next section that the convergence rate of the inclusion vector for the median\nmodel can be improved to be exponential, leading to substantial gains in not only computational\ntime but also feature selection performance. The intuition for this gain is that in the heavy-tailed\ncase, a proportion of the subsets will contain outliers having a sizable in\ufb02uence on feature selection.\nBy taking the median, we obtain a central model that is not so in\ufb02uenced by these outliers, and hence\ncan concentrate more rapidly. As large data sets typically contain outliers and data contamination,\nthis is a substantial practical advantage in terms of performance even putting aside the computa-\ntional gain. After feature selection, we obtain estimates of the coef\ufb01cients for each selected feature\nby averaging the coef\ufb01cient estimates from each subset, following the spirit of [8]. The message\nalgorithm (described in Algorithm 1) only requires each machine to pass the feature indicators to\na central computer, which (essentially instantaneously) calculates the median model, passes back\nthe corresponding indicator vector to the individual computers, which then pass back coef\ufb01cient\nestimates for averaging. The communication costs are negligible.\n\n3 Theory\n\nIn this section, we provide theoretical justi\ufb01cation for the message algorithm in the linear model\ncase. The theory is easily generalized to a much wider range of models and estimation techniques,\nas will be discussed in the last section.\nThroughout the paper we will assume X = (x1,\u00b7\u00b7\u00b7 , xp) is an n \u00d7 p feature matrix, s = |S| is the\nnumber of non-zero coef\ufb01cients and \u03bb(A) is the eigenvalue for matrix A. Before we proceed to the\ntheorems, we enumerate several conditions that are required for establishing the theory. We assume\nthere exist constants V1, V2 > 0 such that\n\nA.1 Consistency condition for estimation.\n\ni xi \u2264 V1 for i = 1, 2,\u00b7\u00b7\u00b7 , p\n\nn xT\n\n\u2022 1\n\u2022 \u03bbmin( 1\n\nS XS) \u2265 V2\nA.2 Conditions on \u01eb, |S| and \u03b2\n\nn X T\n\n\u2022 E(\u01eb2k) < \u221e for some k > 0\n\u2022 s = |S| \u2264 c1n\u03b9 for some 0 \u2264 \u03b9 < 1\n\n3\n\n\f# n is the sample size, p is the number of features and m is the number of subsets\n\nAlgorithm 1 Message algorithm\nInitialization:\n1: Input (Y, X), n, p, and m\n2:\n3: Randomly partition (Y, X) into m subsets (Y (i), X (i)) and distribute them on m machines.\nIteration:\n4: for i = 1 to m do\n5:\n6: # Gather all subset models \u03b3(i) to obtain the median model M\u03b3\n7: for j = 1 to p do\n8:\n\n\u03b3(i) = minM\u03b3 loss(Y (i), X (i)) # \u03b3(i) is the estimated model via Lasso or GIC\n\n\u03b3j = median{\u03b3(i)\n\nj\n\n, i = 1, 2,\u00b7\u00b7\u00b7 , m}\n\n9: # Redistribute the estimated model M\u03b3 to all subsets\n10: for i = 1 to m do\n\u03b2(i) = (X (i)T\nY (i)\n11:\n\u03b3\n12: # Gather all subset estimations \u03b2(i)\n\n\u03b3 )\u22121X (i)T\n\n\u03b3 X (i)\n\n\u03b3\n\n# Estimate the coef\ufb01cients\n\ni=1 \u03b2(i)/m\n\n13: \u00af\u03b2 = Pm\n\n14:\n15: return \u00af\u03b2, \u03b3\n\nfor some 0 < \u03c4 \u2264 1\nA.3 (Lasso) The strong irrepresentable condition.\n\n\u2022 mini\u2208S |\u03b2i| \u2265 c2n\u2212 1\u2212\u03c4\n\n2\n\n\u2022 Assuming XS and XS c are the features having non-zero and zero coef\ufb01cients, respec-\n\ntively, there exists some positive constant vector \u03b7 such that\n\nA.4 (Generalized information criterion, GIC) The sparse Riesz condition.\n\nS c XS(X T\n\n|X T\n\nS XS)\u22121sign(\u03b2S)| < 1 \u2212 \u03b7\n\n\u2022 There exist constants \u03ba \u2265 0 and c > 0 such that \u03c1 > cn\u2212\u03ba, where\n\n\u03c1 = inf\n\n|\u03c0|\u2264|S|\n\n\u03bbmin(X T\n\n\u03c0 X\u03c0/n)\n\nA.1 is the usual consistency condition for regression. A.2 restricts the behaviors of the three key\nterms and is crucial for model selection. These are both usual assumptions. See [19,20,22]. A.3 and\nA.4 are speci\ufb01c conditions for model selection consistency for Lasso/GIC. As noted in [22], A.3 is\nalmost suf\ufb01cient and necessary for sign consistency. A.4 could be relaxed slightly as shown in [19],\nbut for simplicity we rely on this version. To ameliorate possible concerns on how realistic these\nconditions are, we provide further justi\ufb01cations via Theorem 3 and 4 in the supplementary material.\n\nTheorem 1. (GIC) Assume each subset satis\ufb01es A.1, A.2 and A.4, and p \u2264 n\u03b1 for some \u03b1 < k(\u03c4 \u2212\n\u03b7), where \u03b7 = max{\u03b9/k, 2\u03ba}. If \u03b9 < \u03c4 , 2\u03ba < \u03c4 and \u03bb in (1) are chosen so that \u03bb = c0/\u03c32(n/m)\u03c4 \u2212\u03ba\nfor some c0 < cc2/2, then there exists some constant C0 such that for n \u2265 (2C0p)(k\u03c4 \u2212k\u03b7)\u22121\nm = \u230a(4C0)\u2212(k\u03c4 \u2212k\u03b7)\u22121\n\nand\n\n\u00b7 n/p(k\u03c4 \u2212k\u03b7)\u22121\nP (M\u03b3 = MS) \u2265 1 \u2212 exp(cid:26) \u2212\n\n\u230b, the selected model M\u03b3 follows,\n24(4C0)(k\u03c4 \u2212k\u03b7)(cid:27),\n\nn1\u2212\u03b1/(k\u03c4 \u2212k\u03b7)\n\nand de\ufb01ning C \u2032\n\n0 = mini \u03bbmin(X (i)T\n\u03b3 X (i)\n+ exp(cid:26) \u2212\n\n\u03c32V \u22121\n2 s\nn\n\n2 \u2264\n\nEk \u00af\u03b2 \u2212 \u03b2k2\n\n\u03b3 /ni), the mean square error follows,\n\nn1\u2212\u03b1/(k\u03c4 \u2212k\u03b7)\n\n24(4C0)(k\u03c4 \u2212k\u03b7)(cid:27)(cid:18)(1 + 2C \u2032\u22121\n\n0\n\nsV1)k\u03b2k2\n\n2 + C \u2032\u22121\n\n0 \u03c32(cid:19).\n\nTheorem 2. (Lasso) Assume each subset satis\ufb01es A.1, A.2 and A.3, and p \u2264 n\u03b1 for some \u03b1 <\nk(\u03c4 \u2212 \u03b9). If \u03b9 < \u03c4 and \u03bb in (2) are chosen so that \u03bb = c0(n/m) \u03b9\u2212\u03c4 +1\nfor some c0 < c1V2/c2,\nand m = \u230a(4C0)(k\u03c4 \u2212k\u03b9)\u22121\nthen there exists some constant C0 such that for n \u2265 (2C0p)(k\u03c4 \u2212k\u03b9)\u22121\n\u00b7\nn/p(k\u03c4 \u2212k\u03b9)\u22121\n\n2\n\n\u230b, the selected model M\u03b3 follows\n\nP (M\u03b3 = MS) \u2265 1 \u2212 exp(cid:26) \u2212\n\nn1\u2212\u03b1/(k\u03c4 \u2212k\u03b9)\n\n24(4C0)(k\u03c4 \u2212k\u03b9)(cid:27),\n\n4\n\n\fand with the same C \u2032\n\n0 de\ufb01ned in Theorem 1, we have\n\u03c32V \u22121\n2 s\nn\n\n+ exp(cid:26) \u2212\n\nEk \u00af\u03b2 \u2212 \u03b2k2\n\n2 \u2264\n\nn1\u2212\u03b1/(k\u03c4 \u2212k\u03b9)\n\n24(4C0)(k\u03c4 \u2212k\u03b9)(cid:27)(cid:18)(1 + 2C \u2032\u22121\n\n0\n\nsV1)k\u03b2k2\n\n2 + C \u2032\u22121\n\n0 \u03c32(cid:19).\n\nThe above two theorems boost the model consistency property from the original polynomial rate\n[20, 22] to an exponential rate for heavy-tailed errors. In addition, the mean square error, as shown\nin the above equation, preserves almost the same convergence rate as if the full data is employed\nand the true model is known. Therefore, we expect a similar or better performance of message with\na signi\ufb01cantly lower computation load. Detailed comparisons are demonstrated in Section 4.\n\n4 Experiments\n\nThis section assesses the performance of the message algorithm via extensive examples, comparing\nthe results to\n\n\u2022 Full data inference. (denoted as \u201cfull data\u201d)\n\u2022 Subset averaging. Partition and average the estimates obtained on all subsets. (denoted as\n\n\u201caveraging\u201d)\n\n\u2022 Subset median. Partition and take the marginal median of the estimates obtained on all\n\nsubsets (denoted as \u201cmedian\u201d)\n\n\u2022 Bolasso. Run Lasso on multiple bootstrap samples and intersect to select model. Then\n\nestimate the coef\ufb01cients based on the selected model. (denoted as \u201cBolasso\u201d)\n\nThe Lasso part of all algorithms will be implemented by the \u201cglmnet\u201d package [24]. (We did not\nuse ADMM [25] for Lasso as its actual performance might suffer from certain drawbacks [6] and is\nreported to be slower than \u201cglmnet\u201d [26])\n\n4.1 Synthetic data sets\n\nWe use the linear model and the logistic model for (p; s) = (1000; 3) or (10,000; 3) with different\nsample size n and different partition number m to evaluate the performance. The feature vector is\ndrawn from a multivariate normal distribution with correlation \u03c1 = 0 or 0.5. Coef\ufb01cients \u03b2 are\nchosen as,\n\n\u03b2i \u223c (\u22121)ber(0.4)(8 log n/\u221an + |N (0, 1)|), i \u2208 S\n\nSince GIC is intractable to implement (NP hard), we combine it with Lasso for variable selection:\nImplement Lasso for a set of different \u03bb\u2019s and determine the optimal one via GIC. The concrete\nsetup of models are as follows,\n\nCase 1 Linear model with \u01eb \u223c N (0, 22).\nCase 2 Linear model with \u01eb \u223c t(0, df = 3).\nCase 3 Logistic model.\n\nFor p = 1, 000, we simulate 200 data sets for each case, and vary the sample size from 2000 to\n10,000. For each case, the subset size is \ufb01xed to 400, so the number of subsets will be changing\nfrom 5 to 25. In the experiment, we record the mean square error for \u02c6\u03b2, probability of selecting the\ntrue model and computational time, and plot them in Fig 1 - 6. For p = 10,000, we simulate 50\ndata sets for each case, and let the sample size range from 20,000 to 50,000 with subset size \ufb01xed to\n2000. Results for p = 10,000 are provided in supplementary materials.\n\nIt is clear that message had excellent performance in all of the simulation cases, with low MSE,\nhigh probability of selecting the true model, and low computational time. The other subset-based\nmethods we considered had similar computational times and also had computational burdens that\neffectively did not increase with sample size, while the full data analysis and bootstrap Lasso ap-\nproach both were substantially slower than the subset methods, with the gap increasing linearly in\nsample size. In terms of MSE, the averaging and median approaches both had dramatically worse\n\n5\n\n\fe\nu\na\nv\n\nl\n\ne\nu\na\nv\n\nl\n\ne\nu\na\nv\n\nl\n\ne\nu\na\nv\n\nl\n\n4\n.\n0\n\n3\n.\n0\n\n2\n.\n0\n\n1\n.\n0\n\n0\n.\n0\n\n8\n\n.\n\n0\n\n6\n\n.\n\n0\n\n4\n\n.\n\n0\n\n2\n0\n\n.\n\n0\n0\n\n.\n\n0\n1\n\n.\n\n0\n\n8\n0\n0\n\n.\n\n6\n0\n0\n\n.\n\n4\n0\n0\n\n.\n\n2\n0\n\n.\n\n0\n\n0\n0\n0\n\n.\n\n0\n3\n\n.\n\n0\n\n0\n2\n\n.\n\n0\n\n0\n1\n0\n\n.\n\n0\n0\n\n.\n\n0\n\nMean square error\n\nProbability to select the true model\n\nComputational time\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\nb\no\nr\np\n\n0\n.\n1\n\n8\n.\n0\n\n6\n.\n0\n\n4\n.\n0\n\n2\n.\n0\n\n0\n.\n0\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\ns\nd\nn\no\nc\ne\ns\n\n5\n.\n2\n\n0\n.\n2\n\n5\n.\n1\n\n0\n.\n1\n\n5\n.\n0\n\n0\n.\n0\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nSample size n\n\nSample size n\n\nSample size n\n\nFigure 1: Results for case 1 with \u03c1 = 0.\n\nMean square error\n\nProbability to select the true model\n\nComputational time\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\nb\no\nr\np\n\n0\n\n.\n\n1\n\n8\n0\n\n.\n\n6\n\n.\n\n0\n\n4\n0\n\n.\n\n2\n\n.\n\n0\n\n0\n0\n\n.\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\ns\nd\nn\no\nc\ne\ns\n\n0\n\n.\n\n3\n\n0\n\n.\n\n2\n\n0\n\n.\n\n1\n\n0\n0\n\n.\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nSample size n\n\nSample size n\n\nSample size n\n\nFigure 2: Results for case 1 with \u03c1 = 0.5.\n\nMean square error\n\nProbability to select the true model\n\nComputational time\n\n0\n1\n\n.\n\n8\n\n.\n\n0\n\n6\n0\n\n.\n\n4\n0\n\n.\n\n2\n\n.\n\n0\n\n0\n0\n\n.\n\nb\no\nr\np\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\ns\nd\nn\no\nc\ne\ns\n\n5\n2\n\n.\n\n0\n\n.\n\n2\n\n5\n1\n\n.\n\n0\n1\n\n.\n\n5\n\n.\n\n0\n\n0\n0\n\n.\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nSample size n\n\nSample size n\n\nSample size n\n\nFigure 3: Results for case 2 with \u03c1 = 0.\n\nMean square error\n\nProbability to select the true model\n\nComputational time\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\nb\no\nr\np\n\n0\n\n.\n\n1\n\n8\n\n.\n\n0\n\n6\n\n.\n\n0\n\n4\n\n.\n\n0\n\n2\n\n.\n\n0\n\n0\n\n.\n\n0\n\ns\nd\nn\no\nc\ne\ns\n\n5\n\n.\n\n2\n\n0\n\n.\n\n2\n\n5\n\n.\n\n1\n\n0\n\n.\n\n1\n\n5\n\n.\n\n0\n\n0\n\n.\n\n0\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nSample size n\n\nSample size n\n\nSample size n\n\nFigure 4: Results for case 2 with \u03c1 = 0.5.\n\n6\n\n\fMean square error\n\nProbability to select the true model\n\nComputational time\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\nb\no\nr\np\n\n0\n.\n1\n\n8\n.\n0\n\n6\n.\n0\n\n4\n.\n0\n\n2\n.\n0\n\n0\n.\n0\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\ns\nd\nn\no\nc\ne\ns\n\n8\n\n6\n\n4\n\n2\n\n0\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nSample size n\n\nSample size n\n\nSample size n\n\nFigure 5: Results for case 3 with \u03c1 = 0.\n\nMean square error\n\nProbability to select the true model\n\nComputational time\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\nb\no\nr\np\n\n0\n.\n1\n\n8\n.\n0\n\n6\n.\n0\n\n4\n0\n\n.\n\n2\n\n.\n\n0\n\n0\n0\n\n.\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\ns\nd\nn\no\nc\ne\ns\n\n2\n1\n\n0\n1\n\n8\n\n6\n\n4\n\n2\n\n0\n\nmedian\nfullset\naverage\nmessage\nbolasso\n\ne\nu\na\nv\n\nl\n\ne\nu\na\nv\n\nl\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n0\n1\n\n8\n\n6\n\n4\n\n2\n\n0\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nSample size n\n\nSample size n\n\nSample size n\n\nFigure 6: Results for case 3 with \u03c1 = 0.5.\n\nperformance than message in every case, while bootstrap Lasso was competitive (MSEs were same\norder of magnitude with message ranging from effectively identical to having a small but signi\ufb01cant\nadvantage), with both message and bootstrap Lasso clearly outperforming the full data approach. In\nterms of feature selection performance, averaging had by far the worst performance, followed by the\nfull data approach, which was substantially worse than bootstrap Lasso, median and message, with\nno clear winner among these three methods. Overall message clearly had by far the best combination\nof low MSE, accurate model selection and fast computation.\n\n4.2\n\nIndividual household electric power consumption\n\nThis data set contains measurements of electric power consumption for every household with a\none-minute sampling rate [27]. The data have been collected over a period of almost 4 years and\ncontain 2,075,259 measurements. There are 8 predictors, which are converted to 74 predictors due\nto re-coding of the categorical variables (date and time). We use the \ufb01rst 2,000,000 samples as the\ntraining set and the remaining 75,259 for testing the prediction accuracy. The data are partitioned\ninto 200 subsets for parallel inference. We plot the prediction accuracy (mean square error for test\nsamples) against time for full data, message, averaging and median method in Fig 7. Bolasso is\nexcluded as it did not produce meaningful results within the time span.\n\nTo illustrate details of the performance, we split the time line into two parts: the early stage shows\nhow all algorithms adapt to a low prediction error and a later stage captures more subtle performance\nof faster algorithms (full set inference excluded due to the scale).\nIt can be seen that message\ndominates other algorithms in both speed and accuracy.\n\n4.3 HIGGS classi\ufb01cation\n\nThe HIGGS data have been produced using Monte Carlo simulations from a particle physics model\n[28]. They contain 27 predictors that are of interest to physicists wanting to distinguish between two\nclasses of particles. The sample size is 11,000,000. We use the \ufb01rst 10,000,000 samples for training\na logistic model and the rest to test the classi\ufb01cation accuracy. The training set is partitioned into\n1,000 subsets for parallel inference. The classi\ufb01cation accuracy (probability of correctly predicting\nthe class of test samples) against computational time is plotted in Fig 8 (Bolasso excluded for the\nsame reason as above).\n\n7\n\n\fMean prediction error (earlier stage)\n\nMean prediction error (later stage)\n\nmessage\nmedian\naverage\nfullset\n\ne\nu\na\nv\n\nl\n\n4\n2\n0\n0\n\n.\n\n0\n\n0\n2\n0\n0\n\n.\n\n0\n\n6\n1\n0\n0\n0\n\n.\n\nmessage\nmedian\naverage\n\n0.060\n\n0.065\n\n0.070\n\n0.075\n\n0.080\n\n0.084\n\n0.086\n\n0.088\n\n0.090\n\n0.092\n\n0.094\n\nTime (sec)\n\nTime (sec)\n\nFigure 7: Results for power consumption data.\n\nMean prediction accuracy\n\nmessage\nmedian\naverage\nfullset\n\ne\nu\na\nv\n\nl\n\ne\nu\na\nv\n\nl\n\n8\n\n.\n\n0\n\n6\n\n.\n\n0\n\n4\n\n.\n\n0\n\n2\n\n.\n\n0\n\n0\n0\n\n.\n\n5\n6\n\n.\n\n0\n\n0\n6\n0\n\n.\n\n5\n5\n\n.\n\n0\n\n0\n5\n\n.\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\nTime (sec)\n\nFigure 8: Results for HIGGS classi\ufb01cation.\n\nMessage adapts to the prediction bound quickly. Although the classi\ufb01cation results are not as good\nas the benchmarks listed in [28] (due to the choice of a simple parametric logistic model), our new\nalgorithm achieves the best performance subject to the constraints of the model class.\n\n5 Discussion and conclusion\n\nIn this paper, we proposed a \ufb02exible and ef\ufb01cient message algorithm for regression and classi\ufb01ca-\ntion with feature selection. Message essentially eliminates the computational burden attributable to\ncommunication among machines, and is as ef\ufb01cient as other simple subset aggregation methods. By\nselecting the median model, message can achieve better accuracy even than feature selection on the\nfull data, resulting in an improvement also in MSE performance. Extensive simulation experiments\nshow outstanding performance relative to competitors in terms of computation, feature selection and\nprediction.\n\nAlthough the theory described in Section 3 is mainly concerned with linear models, the algorithm\nis applicable in fairly wide situations. Geometric median is a topological concept, which allows the\nmedian model to be obtained in any normed model space. The properties of the median model result\nfrom independence of the subsets and weak consistency on each subset. Once these two conditions\nare satis\ufb01ed, the property shown in Section 3 can be transferred to essentially any model space. The\nfollow-up averaging step has been proven to be consistent for all M estimators with a proper choice\nof the partition number [8].\n\nReferences\n\n[1] Gonzalo Mateos, Juan Andr\u00b4es Bazerque, and Georgios B Giannakis. Distributed sparse linear\n\nregression. Signal Processing, IEEE Transactions on, 58(10):5262\u20135276, 2010.\n\n[2] Joseph K Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate\n\ndescent for l1-regularized loss minimization. arXiv preprint arXiv:1105.5379, 2011.\n\n[3] Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, and David Haglin. Feature clustering\n\nfor accelerating parallel coordinate descent. In NIPS, pages 28\u201336, 2012.\n\n8\n\n\f[4] Alexander Smola and Shravan Narayanamurthy. An architecture for parallel topic models.\n\nProceedings of the VLDB Endowment, 3(1-2):703\u2013710, 2010.\n\n[5] Feng Yan, Ningyi Xu, and Yuan Qi. Parallel inference for latent dirichlet allocation on graphics\n\nprocessing units. In NIPS, volume 9, pages 2134\u20132142, 2009.\n\n[6] Zhimin Peng, Ming Yan, and Wotao Yin. Parallel and distributed sparse optimization. preprint,\n\n2013.\n\n[7] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online\nprediction using mini-batches. The Journal of Machine Learning Research, 13:165\u2013202, 2012.\n\n[8] Yuchen Zhang, John C Duchi, and Martin J Wainwright. Communication-ef\ufb01cient algorithms\n\nfor statistical optimization. In NIPS, volume 4, pages 5\u20131, 2012.\n\n[9] Gideon Mann, Ryan T McDonald, Mehryar Mohri, Nathan Silberman, and Dan Walker. Ef\ufb01-\ncient large-scale distributed training of conditional maximum entropy models. In NIPS, vol-\nume 22, pages 1231\u20131239, 2009.\n\n[10] Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I\nGeorge, and Robert E McCulloch. Bayes and big data: The consensus monte carlo algorithm.\nIn EFaBBayes 250 conference, volume 16, 2013.\n\n[11] Willie Neiswanger, Chong Wang, and Eric Xing. Asymptotically exact, embarrassingly paral-\n\nlel MCMC. arXiv preprint arXiv:1311.4780, 2013.\n\n[12] Xiangyu Wang and David B Dunson. Parallelizing MCMC via weierstrass sampler. arXiv\n\npreprint arXiv:1312.4605, 2013.\n\n[13] Stanislav Minsker. Geometric median and robust estimation in banach spaces. arXiv preprint\n\narXiv:1308.1334, 2013.\n\n[14] Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David B Dunson. Robust and scalable\n\nbayes via a median of subset posterior measures. arXiv preprint arXiv:1403.2660, 2014.\n\n[15] Angela M Wood, Ian R White, and Patrick Royston. How should variable selection be per-\n\nformed with multiply imputed data? Statistics in medicine, 27(17):3227\u20133246, 2008.\n\n[16] Francis R Bach. Bolasso: model consistent lasso estimation through the bootstrap. In Pro-\nceedings of the 25th international conference on Machine learning, pages 33\u201340. ACM, 2008.\n\n[17] Dean P Foster and Edward I George. The risk in\ufb02ation criterion for multiple regression. The\n\nAnnals of Statistics, pages 1947\u20131975, 1994.\n\n[18] Sadanori Konishi and Genshiro Kitagawa. Generalised information criteria in model selection.\n\nBiometrika, 83(4):875\u2013890, 1996.\n\n[19] Jiahua Chen and Zehua Chen. Extended bayesian information criteria for model selection with\n\nlarge model spaces. Biometrika, 95(3):759\u2013771, 2008.\n\n[20] Yongdai Kim, Sunghoon Kwon, and Hosik Choi. Consistent model selection criteria on high\n\ndimensions. The Journal of Machine Learning Research, 98888(1):1037\u20131057, 2012.\n\n[21] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society. Series B (Methodological), pages 267\u2013288, 1996.\n\n[22] Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of Machine\n\nLearning Research, 7:2541\u20132563, 2006.\n\n[23] Maria Maddalena Barbieri and James O Berger. Optimal predictive model selection. Annals\n\nof Statistics, pages 870\u2013897, 2004.\n\n[24] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized\n\nlinear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.\n\n[25] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed op-\ntimization and statistical learning via the alternating direction method of multipliers. Founda-\ntions and Trends R(cid:13) in Machine Learning, 3(1):1\u2013122, 2011.\n\n[26] Xingguo Li, Tuo Zhao, Xiaoming Yuan, and Han Liu. An R Package \ufb02are for high dimensional\n\nlinear regression and precision matrix estimation, 2013.\n\n[27] K. Bache and M. Lichman. UCI machine learning repository, 2013.\n\n[28] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Deep learning in high-energy physics:\n\nImproving the search for exotic particles. arXiv preprint arXiv:1402.4735, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1148, "authors": [{"given_name": "Xiangyu", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Peichao", "family_name": "Peng", "institution": "University of Pennsylvania"}, {"given_name": "David", "family_name": "Dunson", "institution": "Duke University"}]}