{"title": "Communication-Efficient Algorithms for Statistical Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1502, "page_last": 1510, "abstract": "We study two communication-efficient algorithms for distributed statistical optimization on large-scale data. The first algorithm is an averaging method that distributes the $N$ data samples evenly to $m$ machines, performs separate minimization on each subset, and then averages the estimates.  We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error that decays as $\\order(N^{-1}+(N/m)^{-2})$. Whenever $m \\le \\sqrt{N}$, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all $N$ samples.  The second algorithm is a novel method, based on an appropriate form of the bootstrap.  Requiring only a single round of communication, it has mean-squared error that decays as $\\order(N^{-1}+(N/m)^{-3})$, and so is more robust to the amount of parallelization. We complement our theoretical results with experiments on large-scale problems from the Microsoft Learning to Rank dataset.", "full_text": "Communication-Ef\ufb01cient Algorithms for\n\nStatistical Optimization\n\nYuchen Zhang1\n\nJohn C. Duchi1\n\nMartin Wainwright1,2\n\n1Department of Electrical Engineering and Computer Science and 2Department of Statistics\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\n{yuczhang,jduchi,wainwrig}@eecs.berkeley.edu\n\nAbstract\n\nWe study two communication-ef\ufb01cient algorithms for distributed statistical op-\ntimization on large-scale data. The \ufb01rst algorithm is an averaging method that\ndistributes the N data samples evenly to m machines, performs separate mini-\nmization on each subset, and then averages the estimates. We provide a sharp\nanalysis of this average mixture algorithm, showing that under a reasonable set of\nconditions, the combined parameter achieves mean-squared error that decays as\n\nO(N\u22121 + (N/m)\u22122). Whenever m \u2264 \u221aN , this guarantee matches the best pos-\n\nsible rate achievable by a centralized algorithm having access to all N samples.\nThe second algorithm is a novel method, based on an appropriate form of the\nbootstrap. Requiring only a single round of communication, it has mean-squared\n\nerror that decays as O(N\u22121 + (N/m)\u22123), and so is more robust to the amount of\n\nparallelization. We complement our theoretical results with experiments on large-\nscale problems from the internet search domain. In particular, we show that our\nmethods ef\ufb01ciently solve an advertisement prediction problem from the Chinese\n\nSoSo Search Engine, which consists of N \u2248 2.4\u00d7 108 samples and d \u2265 700, 000\n\ndimensions.\n\n1\n\nIntroduction\n\nMany problems in machine learning are based on a form of (regularized) empirical risk minimiza-\ntion. Given the current explosion in the size and amount of data, a central challenge in machine\nlearning is to design ef\ufb01cient algorithms for solving large-scale problem instances. In a central-\nized setting, there are many procedures for solving empirical risk minimization problems, including\nstandard convex programming approaches [3] as well as various types of stochastic approxima-\ntion [19, 8, 14]. When the size of the dataset becomes extremely large, however, it may be infeasible\nto store all of the data on a single computer, or at least to keep the data in memory. Accordingly,\nthe focus of this paper is the theoretical analysis and empirical evaluation of some distributed and\ncommunication-ef\ufb01cient procedures for empirical risk minimization.\n\nRecent years have witnessed a \ufb02urry of research on distributed approaches to solving very large-scale\nstatistical optimization problems (e.g., see the papers [13, 17, 9, 5, 4, 2, 18] and references therein).\nIt can be dif\ufb01cult within a purely optimization-theoretic setting to show explicit bene\ufb01ts arising\nfrom distributed computation. In statistical settings, however, distributed computation can lead to\ngains in statistical ef\ufb01ciency, as shown by Dekel et al. [4] and extended by other authors [2, 18].\nWithin the family of distributed algorithms, there can be signi\ufb01cant differences in communication\ncomplexity: different computers must be synchronized, and when the dimensionality of the data\nis high, communication can be prohibitively expensive. It is thus interesting to study distributed\ninference algorithms that require limited synchronization and communication while still enjoying\nthe statistical power guaranteed by having a large dataset.\n\n1\n\n\fWith this context, perhaps the simplest algorithm for distributed statistical inference is what we\nterm the average mixture (AVGM) algorithm. This approach has been studied for conditional ran-\ndom \ufb01elds [10], for perceptron-type algorithms [12], and for certain stochastic approximation meth-\nods [23]. It is an appealingly simple method: given m different machines and a dataset of size\nN = nm, give each machine a (distinct) dataset of size n = N/m, have each machine i compute\nthe empirical minimizer \u03b8i on its fraction of the data, then average all the parameters \u03b8i across the\nnetwork. Given an empirical risk minimization algorithm that works on one machine, the procedure\nis straightforward to implement and is extremely communication ef\ufb01cient (requiring only one round\nof communication); it is also relatively robust to failure and slow machines, since there is no repeated\nsynchronization. To the best of our knowledge, however, no work has shown theoretically that the\nAVGM procedure has greater statistical ef\ufb01ciency than the naive approach of using n samples on\na single machine. In particular, Mann et al. [10] prove that the AVGM approach enjoys a variance\nreduction relative to the single processor solution, but they only prove that the \ufb01nal mean-squared\nerror of their estimator is O(1/n), since they do not show a reduction in the bias of the estimator.\nZinkevich et al. [23] propose a parallel stochastic gradient descent (SGD) procedure, which runs\nSGD independently on k machines for T iterations, averaging the outputs. The algorithm enjoys\ngood practical performance, but their main result [23, Theorem 12] guarantees a convergence rate\nof O(log k/T ), which is no better than sequential SGD on a single machine processing T samples.\nThis paper makes two main contributions. First, we provide a sharp analysis of the AVGM algorithm,\nshowing that under a reasonable set of conditions on the statistical risk function, it can indeed achieve\nsubstantially better rates. More concretely, we provide bounds on the mean-squared error that decay\n\nas O((nm)\u22121+n\u22122). Whenever the number of machines m is less than the number of samples n per\nmachine, this guarantee matches the best possible rate achievable by a centralized algorithm having\naccess to all N = nm samples. This conclusion is non-trivial and requires a surprisingly careful\nanalysis. Our second contribution is to develop a novel extension of simple averaging; it is based\non an appropriate form of bootstrap [6, 7], which we refer to bootstrap average mixture (BAVGM)\napproach. At a high level, the BAVGM algorithm distributes samples evenly among m processors or\ncomputers as before, but instead of simply returning the empirical minimizer, each processor further\nsubsamples its own dataset in order to estimate the bias of its local estimate, returning a bootstrap-\ncorrected estimate. We then prove that the BAVGM algorithm has mean-squared error decaying as\n\nO(m\u22121n\u22121 + n\u22123). Thus, as long as m < n2, the bootstrap method matches the centralized gold\n\nstandard up to higher order terms. Finally, we complement our theoretical results with experiments\non simulated data and a large-scale logistic regression experiment that arises from the problem of\npredicting whether a user of a search engine will click on an advertisement. Our experiments show\nthat the resampling and correction of the BAVGM method provide substantial performance benets\nover naive solutions as well as the averaging algorithm AVGM.\n\n2 Problem set-up and methods\n\nf (\u03b8; x)dP (x).\n\nF0(\u03b8) := EP [f (\u03b8; X)] =ZX\n\nf (\u03b8; x)dP (x),\nwhich we assume to be unique. In practice, the population distribution P is unknown to us, but we\nhave access to a collection S of samples from the distribution P . In empirical risk minimization,\n\nLet {f (\u00b7; x), x \u2208 X} be a collection of convex loss functions with domain containing the convex\nset \u0398 \u2286 Rd. Let P be a probability distribution over the sample space X , and de\ufb01ne the population\nrisk function F0 : \u0398 \u2192 R via\nWe wish to estimate the risk-minimizing parameter \u03b8\u2217 = argmin\u03b8\u2208\u0398 F0(\u03b8) = RX\none estimates the vector \u03b8\u2217 by solving the optimization problemb\u03b8 \u2208 argmin\u03b8\u2208\u0398\nThroughout the paper, we impose some standard regularity conditions on the parameter space and\nits relationship to the optimal parameter \u03b8\u2217.\nAssumption A (Parameters). The parameter space \u0398 \u2282 Rd is closed convex with \u03b8\u2217 \u2208 int \u0398.\nWe use R = sup\u03b8\u2208\u0398 k\u03b8 \u2212 \u03b8\u2217k2 to denote the \u21132-diameter of the parameter space with respect to the\n\noptimum. In addition, the risk function is required to have some amount of curvature:\n\n1\n\n|S|Px\u2208S f (\u03b8; x).\n\nAssumption B (Local strong convexity). There exists a \u03bb > 0 such that the population Hessian\nmatrix \u22072F0(\u03b8\u2217) (cid:23) \u03bbId\u00d7d.\n\n2\n\n\fHere \u22072F0(\u03b8\u2217) denotes the Hessian of the population objective F0 evaluated at \u03b8\u2217. Note that this\nlocal condition is milder than a global strong convexity condition and is required to hold only for the\npopulation risk F0. It is of course well-known that some type of curvature is required to consistently\nestimate the parameters \u03b8\u2217.\n\nIn the distributed setting, we are given a dataset of N = mn\nWe now describe our methods.\nsamples i.i.d. according to the initial distribution P , which we divide evenly amongst m processors\nor inference procedures. Let Sj, j \u2208 {1, 2, . . . , m}, denote a subsampled dataset of size n, and\nde\ufb01ne the (local) empirical distribution P1 and empirical objective F1 via\n\nP1,j :=\n\n\u03b4x and F1,j(\u03b8) :=\n\nf (\u03b8; x)dP1,j(x).\n\n1\n\n|Sj| Xx\u2208Sj\n\n1\n\n|Sj| Xx\u2208Sj\n\nf (\u03b8; x) =ZX\n\nThe AVGM procedure operates as follows: for j \u2208 {1, . . . , m}, machine j uses its dataset Sj to\ncompute a vector \u03b81,j \u2208 argmin\u03b8\u2208\u0398 F1,j(\u03b8). AVGM combines these m estimates by averaging:\n\n\u03b81 : =\n\n\u03b81,j.\n\n(1)\n\n1\nm\n\nmXj=1\n\nmPm\n\nThe bootstrap average mixture (BAVGM) procedure is based on an additional level of random sam-\npling. In particular, for a parameter r \u2208 (0, 1], each machine j draws a subset S2,j of size \u2308rn\u2309\nby sampling uniformly at random without replacement from its local data set Sj. In addition to\ncomputing the empirical minimizer \u03b81,j based on Sj, BAVGM also computes the empirical min-\nimizer \u03b82,j of the function F2,j(\u03b8) := 1\nf (\u03b8; x), constructing the bootstrap average\n\u03b82 : = 1\n\n|S2,j|Px\u2208S2,j\n\nj=1 \u03b82,j and returning the estimate\n\n\u03b8BAVGM : =\n\n(2)\nThe parameter r \u2208 (0, 1) is a user-de\ufb01ned quantity. The purpose of the weighted estimate (2) is to\nperform a form of bootstrap bias correction [6, 7]. In rough terms, if b0 = \u03b8\u2217 \u2212 \u03b81 is the bias of the\n\ufb01rst estimator, then we may approximate b0 by the bootstrap estimate of bias b1 = \u03b81 \u2212 \u03b82. Then,\nsince \u03b8\u2217 = \u03b81 + b0, we use the fact that b1 \u2248 b0 to argue that \u03b8\u2217 = \u03b81 + b0 \u2248 \u03b81 + b1.1\n\n.\n\n\u03b81 \u2212 r\u03b82\n1 \u2212 r\n\n3 Main results\n\n3.1 Bounds for simple averaging\n\nTo guarantee good estimation properties of our algorithms, we require regularity conditions on the\nempirical risks F1 and F2. It is simplest to state these in terms of the sample functions f , and we\nnote that, as with Assumption B, we require these to hold only locally around the optimal point \u03b8\u2217.\nAssumption C. For some \u03c1 > 0, there exists a neighborhood U = {\u03b8 \u2208 Rd : k\u03b8\u2217 \u2212 \u03b8k2 \u2264 \u03c1} \u2286 \u0398\nsuch that for arbitrary x \u2208 X , the gradient and the Hessian of f exist and satisfy the bounds\nfor \ufb01nite constants G, H. For x \u2208 X , the Hessian matrix \u22072f (\u03b8; x) is Lipschitz continuous for\n\nk\u2207f (\u03b8; x)k2 \u2264 G and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u22072f (\u03b8; x)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2 \u2264 H.\n\n\u03b8 \u2208 U : there is a constant L such that(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u22072f (v; x) \u2212 \u22072f (w; x)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2 \u2264 Lkv \u2212 wk2 for v, w \u2208 U .\n\nWhile Assumption C may appear strong, some smoothness of \u22072f is necessary for averaging meth-\n\nods to work, as we now demonstrate by an example. (In fact, this example underscores the dif\ufb01culty\nof proving that the AVGM algorithm achieves better mean-squared error than single-machine strate-\ngies.) Consider a distribution {0, 1} with P (X = 0) = P (X = 1) = 1/2, and use the loss\n\n\u03b821(\u03b8 \u2264 0) + \u03b8\nThe associated population risk is F0(w) = 1\n2 (w2+w21(w\u22640)), which is strongly convex and smooth,\nsince |F \u20320(w) \u2212 F \u20320(v)| \u2264 2|w \u2212 v|, but has discontinuous second derivative. Evidently \u03b8\u2217 = 0, and\nby an asymptotic expansion we have that E[\u03b81] = \u2126(n\u2212 1\n2 ) (see the long version of our paper [22,\nAppendix D] for this asymptotic result). Consequently, the bias of \u03b81 is \u2126(n\u2212 1\n2 ), and the AVGM\n\n(3)\n\nf (\u03b8; x) =(cid:26) \u03b82 \u2212 \u03b8\n\nif x = 0\nif x = 1.\n\n1 When the index j is immaterial, we use the shorthand notation \u03b81 and \u03b82 to denote \u03b81,j and \u03b82,j , respec-\n\ntively, and similarly with other quantities.\n\n3\n\n\falgorithm using N = mn observations must suffer mean squared error E[(\u03b81 \u2212 \u03b8\u2217)2] = \u2126(n\u22121).\n\nSome type of smoothness is necessary for fast rates.\n\nThat being said, Assumptions B and C are somewhat innocuous for practical problems. Both hold\nfor logistic and linear regression problems so long as the population data covariance matrix is not\nrank de\ufb01cient and the data is bounded; moreover, in the linear regression case, we have L = 0.\n\nOur assumptions in place, we present our \ufb01rst theorem on the convergence of the AVGM procedure.\nWe provide the proof of Theorem 1\u2014under somewhat milder assumptions\u2014and its corollaries in\nthe full version of this paper [22].\nTheorem 1. For each i \u2208 {1, . . . , m}, let Si be a dataset of n independent samples, and let\n\n\u03b81,i \u2208 argmin\n\u03b8\u2208\u0398\n\ndenote the minimizer of the empirical risk for the dataset Si. De\ufb01ne \u03b81 = 1\ndenote the population risk minimizer. Then under Assumptions A\u2013C, we have\n\nn Xxj\u2208Si\nEh(cid:13)(cid:13)\u03b81 \u2212 \u03b8\u2217(cid:13)(cid:13)2\n2i \u2264\n2i\nEh(cid:13)(cid:13)\u22072F0(\u03b8\u2217)\u22121\u2207f (\u03b8\u2217; X)(cid:13)(cid:13)2\n\u03bb2n2 (cid:16)H 2 log d + Eh(cid:13)(cid:13)\u22072F0(\u03b8\u2217)\u22121\u2207f (\u03b8\u2217; X)(cid:13)(cid:13)2\n2i(cid:17) Eh(cid:13)(cid:13)\u22072F0(\u03b8\u2217)\u22121\u2207f (\u03b8\u2217; X)(cid:13)(cid:13)2\n2i\n\n+\n+ O(m\u22121n\u22122) + O(n\u22123).\n\nmPm\n\ni=1 \u03b81,i and let \u03b8\u2217\n\nf (\u03b8; xj)\n\n2\nnm\n\n(4)\n\n1\n\n5\n\nA simple corollary of Theorem 1 makes it somewhat easier to parse, though we prefer the general\nform in the theorem as its dimension dependence is somewhat stronger. Speci\ufb01cally, note that by\nde\ufb01nition of the operator norm, |||Ax|||2 \u2264 |||A|||kxk2 for any matrix A and vector x. Consequently,\n\n(cid:13)(cid:13)\u22072F0(\u03b8\u2217)\u22121\u2207f (\u03b8\u2217; x)(cid:13)(cid:13)2 \u2264(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u22072F0(\u03b8\u2217)\u22121(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2 k\u2207f (\u03b8\u2217; x)k2 \u2264\n\nwhere for the last inequality we used Assumption B. In general, this upper bound may be quite\n\nloose, and in many statistical applications (such as linear regression) multiplying \u2207f (\u03b8\u2217; X) by the\ninverse Hessian standardizes the data. Assumption C implies E[k\u2207f (\u03b8\u2217; X)k2\n2] \u2264 G2, so that we\n\n1\n\u03bb k\u2207f (\u03b8\u2217; x)k2 ,\n\narrive at the following:\n\nCorollary 1. Under the same conditions as Theorem 1, we have\n\n2G2\n\u03bb2nm\n\n+\n\n5G2\n\n\u03bb4n2 (cid:18)H 2 log d +\n\nG2\n\n\u03bb2(cid:19) + O(m\u22121n\u22122) + O(n\u22123).\n\nEh(cid:13)(cid:13)\u03b81 \u2212 \u03b8\u2217(cid:13)(cid:13)2\n2i \u2264\n\nA comparison of Theorem 1\u2019s conclusions with classical statistical results is also informative. If\n\nthe loss f (\u00b7; x) : \u0398 \u2192 R is the negative log-likelihood \u2113(x | \u03b8) for a parametric model P (\u00b7 | \u03b8\u2217),\n\nthen under suitable smoothness conditions on the log likelihood [21], we can de\ufb01ne the Fisher\nInformation matrix\n\nI(\u03b8\u2217) := E\u03b8\u2217(cid:2)\u2207\u2113(X | \u03b8\u2217)\u2207\u2113(X | \u03b8\u2217)\u22a4(cid:3) = E\u03b8\u2217[\u22072\u2113(X | \u03b8\u2217)],\n\nof samples available. Then under our assumptions, we have the minimax result [21, Theorem 8.11]\n\nwhere E\u03b8\u2217 denotes expectation under the model P (\u00b7 | \u03b8\u2217). Let N = mn denote the total number\nthat for any estimatorb\u03b8N based on N samples,\n\n2\n\n(5)\n\n2(cid:21) \u2265 tr(I(\u03b8\u2217)\u22121).\n\nsup\nM <\u221e\n\nlim inf\nN\u2192\u221e\n\nsup\n\nk\u03b4k\u2264M/\u221aN\n\nE\u03b8\u2217+\u03b4(cid:20)(cid:13)(cid:13)(cid:13)b\u03b8N \u2212 \u03b8\u2217 \u2212 \u03b4(cid:13)(cid:13)(cid:13)\n\nIn connection with Theorem 1, we obtain the comparative result\nCorollary 2. Let the assumptions of Theorem 1 hold, and assume that the loss functions f (\u00b7; x) are\nthe negative log-likelihood \u2113(x | \u03b8) for a parametric model P (\u00b7 | \u03b8\u2217). Let N = mn. Then\n(cid:0)H 2 log d + tr(I(\u03b8\u2217)\u22121)(cid:1) + O(m\u22121n\u22122).\n\n5m2 tr(I(\u03b8\u2217)\u22121)\n\n2i \u2264\nEh(cid:13)(cid:13)\u03b81 \u2212 \u03b8\u2217(cid:13)(cid:13)2\n\ntr(I(\u03b8\u2217)\u22121) +\n\nExcept for the factor of 2 in the bound, Corollary 2 shows that Theorem 1 essentially achieves the\nbest possible result. The important aspect of our bound, however, is that we obtain this convergence\nrate without calculating an estimate on all N = mn data samples xi; we calculate m independent\nestimators and average them to attain the convergence guarantee.\n\n\u03bb2N 2\n\n2\nN\n\n4\n\n\f3.2 Bounds for bootstrap mixture averaging\n\nAs shown in Theorem 1 and the immediately preceding corollary, for small m, the convergence rate\nof the AVGM algorithm is mainly determined by the \ufb01rst term in the bound (4), which is at worst\nG2\n\u03bb2mn . When the number of processors m grows, however, the second term in the bound (4) may\nhave non-negligible effect in spite of being O(n\u22122). In addition, when the population risk\u2019s local\nstrong convexity parameter \u03bb is close to zero or the Lipschitz continuity constant H of \u2207f (\u03b8; x) is\nlarge, the n\u22122 term in the bound (4) and Corollary 1 may dominate the leading term. This concern\nmotivates our development of the bootstrap average mixture (BAVGM) algorithm and analysis.\n\nDue the additional randomness introduced by the bootstrap algorithm BAVGM, its analysis requires\nan additional smoothness condition. In particular, we require that in a neighborhood of the optimal\npoint \u03b8\u2217, the loss function f is smooth through its third derivatives.\nAssumption D. For a \u03c1 > 0, there exists a neighborhood U = {\u03b8 \u2208 Rd : k\u03b8\u2217 \u2212 \u03b8k2 \u2264 2\u03c1} \u2286 \u0398\nsuch that the smoothness conditions of Assumption C hold. For x \u2208 X , the third derivatives of f are\nLipschitz continuous: there is a constant M \u2265 0 such that for v, w \u2208 U and u \u2208 Rd,\n\n(cid:13)(cid:13)(cid:0)\u22073f (v; x) \u2212 \u22073f (w; x)(cid:1) (u \u2297 u)(cid:13)(cid:13)2 \u2264 M kv \u2212 wk2 |||u \u2297 u|||2 = M kv \u2212 wk2 kuk2\n\nNote that Assumption D holds for linear regression (in fact, with M = 0); it also holds for logistic\nregression problems with \ufb01nite M as long as the data is bounded.\n\n2 .\n\nWe now state our second main theorem, which shows that the use of bootstrap samples to reduce the\nbias of the AVGM algorithm yields improved performance. (Again, see [22] for a proof.)\nTheorem 2. Let Assumptions A\u2013D hold. Then the output \u03b8BAVGM = \u03b81\u2212r\u03b82\nalgorithm satis\ufb01es\n\n1\u2212r of the bootstrap BAVGM\n\nEh(cid:13)(cid:13)\u03b8BAVGM \u2212 \u03b8\u2217(cid:13)(cid:13)2\n2i \u2264\n\n2 + 3r\n\n(1 \u2212 r)2 \u00b7\n+ O(cid:18)\n\n1\nnm\n1\n\n2i\nEh(cid:13)(cid:13)\u22072F0(\u03b8\u2217)\u22121\u2207f (\u03b8\u2217; X)(cid:13)(cid:13)2\nr(1 \u2212 r)2 n\u22123(cid:19)\n\n1\n\n(1 \u2212 r)2 m\u22121n\u22122 +\n\n(6)\n\nin the bound (4) has been eliminated. The reason for this elimination is that resampling at a rate\n\nComparing the conclusions of Theorem 2 to those of Theorem 1, we see that the the O(n\u22122) term\nr reduces the bias of the BAVGM algorithm to O(n\u22123); the bias of the AVGM algorithm induces\nterms of order n\u22122 in Theorem 1. Unsurprisingly, Theorem 2 suggests that the performance of the\nBAVGM algorithm is affected by the resampling rate r; typically, one uses r \u2208 (0, 1). Roughly,\nwhen m becomes large we increase r, since the bias of the independent solutions may increase and\nwe enjoy averaging affects from the BAVGM algorithm. When m is small, the BAVGM algorithm\nappears to provide limited bene\ufb01ts. The big-O notation hides some problem dependent constants\nfor simplicity in the bound. We leave as an intriguing open question whether computing multiple\nbootstrap samples at each machine can yield improved performance for the BAVGM procedure.\n\n3.3 Time complexity\n\nIn practice, the exact empirical minimizers assumed in Theorems 1 and 2 may be unavailable. In\nthis section, we sketch an argument that shows that both the AVGM algorithm and the BAVGM algo-\nrithm can use approximate empirical minimizers to achieve the same (optimal) asymptotic bounds.\nIndeed, suppose that we employ approximate empirical minimizers in AVGM and BAVGM instead\nof the exact ones.2 Let the vector \u03b8\u2032 denotes the approximation to the vector \u03b8 (at each point of the\nalgorithm). With this notation, we have by the triangle inequality and Jensen\u2019s inequality that\n\n2] + 2E[k\u03b8\u2032\n\n2] \u2264 2E[(cid:13)(cid:13)\u03b81 \u2212 \u03b8\u2217(cid:13)(cid:13)2\n\n(7)\nThe bound (7) shows that solving the empirical minimization problem to accuracy suf\ufb01cient to have\n\n2] + 2E[k\u03b8\u20321 \u2212 \u03b81k2\n2].\n2] = O((mn)\u22122) guarantees the same convergence rates provided by Theorem 1.\n\nE[k\u03b8\u2032\nE[k\u03b8\u20321 \u2212 \u03b81k2\nNow we show that in time O(n log(mn))\u2014assuming that processing one sample requires one unit\nof time\u2014it is possible to achieve empirical accuracy O((nm)\u22122). When this holds, the speedup\n2We provide the arguments only for the AVGM algorithm to save space; the arguments for the BAVGM\n\n2] \u2264 2E[(cid:13)(cid:13)\u03b81 \u2212 \u03b8\u2217(cid:13)(cid:13)2\n\n1 \u2212 \u03b8\u2217k2\n\n1 \u2212 \u03b81k2\n\nalgorithm are completely similar, though they also include \u03b82.\n\n5\n\n\f12 x 10\u22124\n\nAverage\nBootstrap\nAll\n\n2 2\nk\n\u2217\nw\n\u2212\nbw\nk\n\n11\n\n10\n\n9\n\n8\n\n7\n\n \n\n6\n0\n\n \n\n150\n\n2 2\nk\n\u2217\nw\n\u2212\nbw\nk\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n \n\n0.01\n0\n\nAverage\nBootstrap\nAll\n\n50\n\n100\n\nNumber m of machines\n\n \n\n150\n\n50\n\n100\n\nNumber m of machines\n\n(a)\n\n(b)\n\nFigure 1: Experiments plotting the error in the estimate of \u03b8\u2217 given by the AVGM algorithm and\nBAVGM algorithm for total number of samples N = 105 versus number of dataset splits (parallel\nmachines) m. Each plot indicates a different dimension d of problem. (a) d = 20, (b) d = 100.\n\nof the AVGM and similar algorithms over the naive approach of processing all N = mn samples\non one processor is at least of order m/ log(N ). Let us argue that for such time complexity the\nnecessary empirical convergence is achievable. As we show in our proof of Theorem 1, with high\nprobability the empirical risk F1 is strongly convex in a ball B\u03c1(\u03b81) of constant radius \u03c1 > 0 around\n\u03b81 with high probability. (A similar conclusion holds for F2.) A combination of stochastic gradient\ndescent [14] and standard convex programming approaches [3] completes the argument. Indeed,\n\nperforming stochastic gradient descent for O(log2(mn)/\u03c12) iterations on the empirical objective\nF1 yields that with probability at least 1\u2212 m\u22122n\u22122, the resulting parameter falls within B\u03c1(\u03b81) [14,\nProposition 2.1]. The local strong convexity guarantees that O(log(mn)) iterations of standard\ngradient descent [3, Chapter 9]\u2014each requiring O(n) units of time\u2014beginning from this parameter\nis suf\ufb01cient to achieve E[k\u03b8\u20321 \u2212 \u03b81k2\n2] = O((mn)\u22122), since gradient descent enjoys a locally linear\nconvergence rate. The procedure outlined requires at most O(n log(mn)) units of time.\nWe also remark that under a slightly more global variant of Assumptions A\u2013C, we can show that\n\nstochastic gradient descent achieves convergence rates of O((mn)\u22122 + n\u22123/2), which is order op-\n\ntimal. See the full version of this paper [5, Section 3.4] for this result.\n\n4 Experiments with synthetic data\n\nIn this section, we report the results of simulation studies comparing the AVGM and BAVGM meth-\nods, as well as a trivial method using only a fraction of the data available on a single machine. For\nour simulated experiments, we solve linear regression problems of varying dimensionality. For each\nexperiment, we use a \ufb01xed total number N = 105 of samples, but we vary the number of parallel\nsplits m of the data (and consequently, the local dataset sizes n = N/m) and the dimensionality\nd of the problem solved. For each simulation, we choose a constant vector u \u2208 Rd. The data\nsamples consist of pairs (x, y), where x \u2208 Rd and y \u2208 R is the target value. To sample each x\nvector, we choose \ufb01ve entries of x distributed as N (0, 1); the remainder of x is zero. The vector y is\nsampled as y = hu, xi +Pd\nj=1(xj/2)3, so the noise in the linear estimate hu, xi is correlated with\nx. For our linear regression problem, we use the loss f (\u03b8; (x, y)) := 1\n2 (h\u03b8, xi \u2212 y)2. We attempt\nto \ufb01nd the vector \u03b8\u2217 minimizing F (\u03b8) = E[f (\u03b8; (X, Y ))] using the standard batch solution, using\nAVGM, using BAVGM, and simply solving the linear regression problem resulting from a single split\nof the data (of size N/m). We use m \u2208 {2, 4, 8, 16, 32, 64, 128} datasets, recalling that the dis-\ntributed datasets are of size n = N/m. We perform experiments with each of the dimensionalities\nd = 20, 50, 100, 200, 400. (We plot d = 20 and d = 100; other results are qualitatively similar.)\n\n\u03b8\u2217 by solving the linear regression problem with suf\ufb01ciently large number of samples. In Figure 1,\n\nLet b\u03b8 denote the vector output by any of our procedures after inference (so in the BAVGM case, for\nexample, this is the vector b\u03b8 = \u03b8BAVGM = (\u03b81 \u2212 r\u03b82)/(1 \u2212 r)). We obtain the true optimal vector\nwe plot the error kb\u03b8 \u2212 \u03b8\u2217k2\n2 of the inferred parameter vector b\u03b8 for the true parameters \u03b8\u2217 versus the\n\nnumber of splits, or number of parallel machines, m we use. We also plot standard errors (across\n\n6\n\n\f2 2\nk\n\u2217\nw\n\u2212\nbw\nk\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n \n\n0\n0\n\nAverage\nSingle\n\n50\n\n100\n\nNumber m of machines\n\n \n\n0.131\n\nBAVGM (m=128)\n\nd\no\no\nh\ni\nl\ne\nk\ni\nL\n-\ng\no\nL\n \ne\nv\ni\nt\na\ng\ne\nN\n\n0.1308\n\n0.1306\n\n0.1304\n\n150\n\n0.1302\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\nS ub-s ampling Rate r\n\n0.5\n\n(a)\n\n(b)\n\nFigure 2: (a) Sythetic data: comparison of AVGM estimator to linear regression estimator based on\nN/m data points. (b) Advertising data: the log-loss on held-out data for the BAVGM method applied\nwith m = 128 parallel splits of the data, plotted versus the sub-sampling rate r.\n\n\ufb01fty experiments) for each curve. In each plot, the \ufb02at bottom line is the error of the batch method\nusing all the N samples.\n\nFrom the plots in Figure 1, we can make a few claims. First, the AVGM and BAVGM algorithms\nindeed enjoy excellent performance, as our theory predicts. Even as the dimensionality d grows, we\nsee that splitting the data into as many as m = 64 independent pieces and averaging the solution\n\nthan twice the solution using all N samples. We also see that the AVGM curve appears to increase\nroughly quadratically with m. This agrees with our theoretical predictions in Theorem 1. Indeed,\nN 2 ),\nwhich matches Figure 1. In addition, we see that the BAVGM algorithm enjoys somewhat more stable\n\nvectors \u03b8i estimated from each subsample i yields a vector b\u03b8 whose estimate of \u03b8\u2217 is no worse\nsetting n = N/m, we see that Theorem 1 implies E[(cid:13)(cid:13)\u03b8 \u2212 \u03b8\u2217(cid:13)(cid:13)2\nperformance, with increasing bene\ufb01t as the number of machines m increases. We chose r \u221dpd/n\n\nfor the BAVGM algorithm, as that choice appeared to give reasonable performance. (The optimal\nchoice of r remains an open question.)\n\nN + m2\n\n2] = O( 1\n\nmn + 1\n\nn2 ) = O( 1\n\nAs a check that our results are not simply consequences of the fact that the problems are easy to\nsolve, even using a fraction 1/m of the data in a single machine, in Figure 2(a) we plot the estimation\n2 of an estimate of \u03b8\u2217 based on just a fraction 1/m of the data versus the number of\nmachines/data splits m. Clearly, the average mixture approach dominates. (Figure 2(a) uses d = 20;\nlarger dimensions are similar but more pronounced).\n\nerror kb\u03b8 \u2212 \u03b8\u2217k2\n\n5 Experiments with advertising data\n\nPredicting whether a user of a search engine will click on an advertisement presented to him or her\nis of central importance to the business of several internet companies, and in this section, we present\nexperiments studying the performance of the AVGM and BAVGM methods for this task. We use\na large dataset from the Tencent search engine, soso.com [20], which contains 641,707 distinct\nadvertisement items with N = 235,582,879 data samples. Each sample consists of a so-called\nimpression, which is a list containing a user-issued search, the advertisement presented to the user\nand a label y \u2208 {+1,\u22121} indicating whether the user clicked on the advertisement. The ads in our\ndataset were presented to 23,669,283 distinct users.\n\nTencent dataset provides a standard encoding to transform an impression into a useable set of re-\ngressors x. We list the features present in the data in Table 1 of the full version of this paper [22].\nEach text-based feature is given a \u201cbag-of-words\u201d encoding [11]. Real-valued features are binned\ninto a \ufb01xed number of intervals. When a feature falls into a particular bin, the corresponding entry\nof is assigned a 1, and otherwise assigned 0. This combination of encodings yields a binary-valued\n\ncovariate vector x \u2208 {0, 1}d with d = 741,725 dimensions.\n\nOur goal is to predict the probability of a user clicking a given advertisement as a function of the\ncovariates x. In order to do so, we use a logistic regression model to estimate the probability of a\n1+exp(\u2212h\u03b8,xi) , where \u03b8 \u2208 Rd is the unknown regression vector.\nclick response P (y = 1 | x; \u03b8) :=\n\n1\n\n7\n\n\fd\no\no\nh\ni\nl\ne\nk\ni\nL\n-\ng\no\nL\n \ne\nv\ni\nt\na\ng\ne\nN\n\n0.1308\n\n0.1306\n\n0.1304\n\n0.1302\n\n0.13\n\n0.1298\n\n0.1296\n\n0.1294\n\n8\n\nAVGM\nBAVGM (r=0.1)\nBAVGM (r=0.25)\n\n16\nNumber of machines m\n\n32\n\n64\n\nd\no\no\nh\ni\nl\ne\nk\ni\nL\n-\ng\no\nL\n \ne\nv\ni\nt\na\ng\ne\nN\n\n0.1308\n\n0.1306\n\n0.1304\n\n0.1302\n\n0.13\n\n0.1298\n\n0.1296\n\n0.1294\n\n1\n\n2\n\n3\n\n128\n\nSGD\n\n8\n\n9 10\n\n4\n\n7\n\n5\n\n6\n\nNumber of Pas s es\n\n(a)\n\n(b)\n\nFigure 3: The negative log-likelihood of the output of the AVGM, BAVGM, and a stochastic gradient\ndescent method on the held-out dataset for the click-through prediction task. (a) Performance of the\nAVGM and BAVGM methods versus the number of splits m of the data. (b) Performance of the SGD\nbaseline as a function of number of passes through the entire dataset.\n\nWe use the negative logarithm of P as the loss, incorporating a ridge regularization penalty. This\ncombination yields the optimization objective\n\nf (\u03b8; (x, y)) = log (1 + exp(\u2212y h\u03b8, xi)) +\n\n\u03bb\n2 k\u03b8k2\n2 .\n\nIn all our experiments, we use regularization parameter \u03bb = 10\u22126, a choice obtained by cross\nvalidation.\n\nFor this problem, we cannot evaluate the mean-squared error kb\u03b8 \u2212 \u03b8\u2217k2\noptimal parameter \u03b8\u2217. Consequently, we evaluate the performance of an estimate b\u03b8 using log-loss\n\non a held-out dataset. Speci\ufb01cally, we perform a \ufb01ve-fold validation experiment, where we shuf\ufb02e\nthe data and partition it into \ufb01ve equal-sized subsets. For each of our \ufb01ve experiments, we hold out\none partition to use as the test set, using the remaining data as the training set used for inference.\nWhen studying the AVGM or BAVGM method, we compute the local estimate \u03b8i via a trust-region\nNewton-based method [15].\n\n2, as we do not know the true\n\nThe dataset is too large to \ufb01t in main memory on most computers: in total, four splits of the data\nrequire 55 gigabytes. Consequently, it is dif\ufb01cult to provide an oracle training comparison using the\nfull N samples. Instead, for each experiment, we perform 10 passes of stochastic gradient descent\nthrough the dataset to get a rough baseline of the performance attained by the empirical minimizer\nfor the entire dataset. Figure 3(b) shows the hold-out set log-loss after each of the sequential passes\nthrough the training data \ufb01nishes.\n\nIn Figure 3(a), we show the average hold-out set log-loss (with standard errors) of the estimator \u03b81\nprovided by the AVGM method and the BAVGM method versus number of splits of the data m. The\nplot shows that for small m, both AVGM and BAVGM enjoy good performance, comparable to or\nbetter than (our proxy for) the oracle solution using all N samples. As the number of machines m\ngrows, the de-biasing provided by the subsampled bootstrap method yield substantial improvements\nover the standard AVGM method. In addition, even with m = 128 splits of the dataset, the BAVGM\nmethod gives better hold-out set performance than performing two passes of stochastic gradient on\nthe entire dataset of m samples. This is striking, as doing even one pass through the data with\nstochastic gradient descent is known to give minimax optimal convergence rates [16, 1].\n\nIt is instructive and important to understand the sensitivity of the BAVGM method to the resampling\nparameter r. We explore this question in in Figure 2(b) using m = 128 splits. We choose m = 128\nbecause more data splits provide more variable performance in r. For the soso.com ad prediction\ndata set, the choice r = .25 achieves the best performance, but Figure 2(b) suggests that mis-\nspecifying the ratio is not terribly detrimental. Indeed, while the performance of BAVGM degrades\nto that of the AVGM method, there is a wide range of r giving improved performance, and there does\nnot appear to be a phase transition to poor performance.\n\nAcknowledgments This work is based on research supported in part by the Of\ufb01ce of Naval Re-\nsearch under MURI grant N00014-11-1-0688. JCD was also supported by an NDSEG fellowship\nand a Facebook PhD fellowship.\n\n8\n\n\fReferences\n\n[1] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the\noracle complexity of convex optimization. IEEE Transactions on Information Theory, 58(5):3235\u20133249,\nMay 2012.\n\n[2] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Infor-\n\nmation Processing Systems 25, 2011.\n\n[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\n[4] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-\n\nbatches. Journal of Machine Learning Research, 13:165\u2013202, 2012.\n\n[5] J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual averaging for distributed optimization: convergence\n\nanalysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592\u2013606, 2012.\n\n[6] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.\n\n[7] P. Hall. The Bootstrap and Edgeworth Expansion. Springer, 1992.\n\n[8] E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex optimiza-\n\ntion. In Proceedings of the Nineteenth Annual Conference on Computational Learning Theory, 2006.\n\n[9] B. Johansson, M. Rabi, and M. Johansson. A randomized incremental subgradient method for distributed\n\noptimization in networked systems. SIAM Journal on Optimization, 20(3):1157\u20131170, 2009.\n\n[10] G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Ef\ufb01cient Large-Scale Distributed Train-\ning of Conditional Maximum Entropy Models. In Advances in Neural Information Processing Systems\n22, pages 1231\u20131239, 2009.\n\n[11] C. Manning, P. Raghavan, and H. Sch\u00a8utze. Introduction to Information Retrieval. Cambridge University\n\nPress, 2008.\n\n[12] R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. In\n\nNorth American Chapter of the Association for Computational Linguistics (NAACL), 2010.\n\n[13] A. Nedi\u00b4c and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transac-\n\ntions on Automatic Control, 54:48\u201361, 2009.\n\n[14] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[15] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.\n\n[16] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal\n\non Control and Optimization, 30(4):838\u2013855, 1992.\n\n[17] S. S. Ram, A. Nedi\u00b4c, and V. V. Veeravalli. Distributed stochastic subgradient projection algorithms for\n\nconvex optimization. Journal of Optimization Theory and Applications, 147(3):516\u2013545, 2010.\n\n[18] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: a lock-free approach to parallelizing stochastic gradient\n\ndescent. In Advances in Neural Information Processing Systems 25, 2011.\n\n[19] H. Robbins. Asymptotically subminimax solutions of compound statistical decision problems. In Pro-\nceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probability, pages 131\u2013148,\n1951.\n\n[20] G. Sun. KDD cup track 2 soso.com ads prediction challenge, 2012. Accessed August 1, 2012.\n\n[21] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics.\n\nCambridge University Press, 1998.\n\n[22] Y. Zhang, J. C. Duchi, and M. J. Wainwright. Communication-ef\ufb01cient algorithms for statistical opti-\n\nmization. arXiv:1209.4129 [stat.ML], 2012.\n\n[23] M. A. Zinkevich, A. Smola, M. Weimer, and L. Li. Parallelized Stochastic Gradient Descent. In Advances\n\nin Neural Information Processing Systems 24, 2010.\n\n9\n\n\f", "award": [], "sourceid": 716, "authors": [{"given_name": "Yuchen", "family_name": "Zhang", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}, {"given_name": "John", "family_name": "Duchi", "institution": null}]}