{"title": "DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation", "book": "Advances in Neural Information Processing Systems", "page_first": 10320, "page_last": 10330, "abstract": "To improve the resilience of distributed training to worst-case, or Byzantine node failures, several recent methods have replaced gradient averaging with robust aggregation methods. Such techniques can have high computational costs, often quadratic in the number of compute nodes, and only have limited robustness guarantees. Other methods have instead used redundancy to guarantee robustness, but can only tolerate limited numbers of Byzantine failures. In this work, we present DETOX, a Byzantine-resilient distributed training framework that combines algorithmic redundancy with robust aggregation. DETOX operates in two steps, a filtering step that uses limited redundancy to significantly reduce the effect of Byzantine nodes, and a hierarchical aggregation step that can be used in tandem with any state-of-the-art robust aggregation method. We show theoretically that this leads to a substantial increase in robustness, and has a per iteration runtime that can be nearly linear in the number of compute nodes. We provide extensive experiments over real distributed setups across a variety of large-scale machine learning tasks, showing that DETOX leads to orders of magnitude accuracy and speedup improvements over many state-of-the-art Byzantine-resilient approaches.", "full_text": "DETOX: A Redundancy-based Framework for Faster\n\nand More Robust Gradient Aggregation\n\nShashank Rajput\u21e4\n\nUniversity of Wisconsin-Madison\n\nrajput3@wisc.edu\n\nHongyi Wang\u21e4\n\nUniversity of Wisconsin-Madison\n\nhongyiwang@cs.wisc.edu\n\nZachary Charles\n\nUniversity of Wisconsin-Madison\n\nzcharles@math.wisc.edu\n\nDimitris Papailiopoulos\n\nUniversity of Wisconsin-Madison\n\ndimitris@papail.io\n\nAbstract\n\nTo improve the resilience of distributed training to worst-case, or Byzantine node\nfailures, several recent approaches have replaced gradient averaging with robust\naggregation methods. Such techniques can have high computational costs, often\nquadratic in the number of compute nodes, and only have limited robustness\nguarantees. Other methods have instead used redundancy to guarantee robustness,\nbut can only tolerate limited number of Byzantine failures.\nIn this work, we\npresent DETOX, a Byzantine-resilient distributed training framework that combines\nalgorithmic redundancy with robust aggregation. DETOX operates in two steps,\na \ufb01ltering step that uses limited redundancy to signi\ufb01cantly reduce the effect of\nByzantine nodes, and a hierarchical aggregation step that can be used in tandem\nwith any state-of-the-art robust aggregation method. We show theoretically that\nthis leads to a substantial increase in robustness, and has a per iteration runtime\nthat can be nearly linear in the number of compute nodes. We provide extensive\nexperiments over real distributed setups across a variety of large-scale machine\nlearning tasks, showing that DETOX leads to orders of magnitude accuracy and\nspeedup improvements over many state-of-the-art Byzantine-resilient approaches.\n\n1\n\nIntroduction\n\nTo scale the training of machine learning models, gradient computations can often be distributed\nacross multiple compute nodes. After computing these local gradients, a parameter server (PS) then\naverages them, and updates a global model. As the scale of data and available compute power grows,\nso does the probability that some compute nodes output unreliable gradients. This can be due to\npower outages, faulty hardware, or communication failures, or due to security issues, such as the\npresence of an adversary governing the output of a compute node.\nDue to the dif\ufb01culty in quantifying these different types of errors separately, we often model them\nas Byzantine failures. Such failures are assumed to be able to result in any output, adversarial or\notherwise. Unfortunately, the presence of a single Byzantine compute node can result in arbitrarily\nbad global models when aggregating gradients via their average [1].\nIn distributed training, there have generally been two distinct approaches to improve Byzantine\nrobustness. The \ufb01rst replaces the gradient averaging step at the PS with a robust aggregation step,\nsuch as the geometric median and variants thereof [1, 2, 3, 4, 5, 6]. The second approach instead\n\n\u21e4Authors contributed equally to this paper and are listed alphabetically.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fassigns each node redundant gradients, and uses this redundancy to eliminate the effect of Byzantine\nfailures [7, 8, 9].\nBoth of the above approaches have their own limitations. For the \ufb01rst, robust aggregators are typically\nexpensive to compute and scale super-linearly (in many cases quadratically [10, 4]) with the number\nof compute nodes. Moreover, such methods often come with limited theoretical guarantees of\nByzantine robustness (e.g., only establishing convergence in the limit, or only guaranteeing that the\noutput of the aggregator has positive inner product with the true gradient [1, 10]) and often require\nstrong assumptions, such as bounds on the dimension of the model being trained. On the other hand,\nredundancy or coding-theoretic based approaches offer strong or even perfect recocvery guarantees.\nUnfortunately, such approaches may, in the worst case, require each node to compute \u2326(q) times\nmore gradients, where q is the number of Byzantine machines [7]. This overhead is prohibitive in\nsettings with large numbers of Byzantine machines.\n\nmodel update\n\nrobust aggregation\n\naggregation\n\naggregation\n\naggregation\n\nmajority\n\nmajority\n\nmajority\n\nmajority\n\nmajority\n\nmajority\n\n. . .\n\nAAAB7XicbZC7SgNBFIZn4y2ut6ilzWAQrMJuLLQRgzaWEcwFkiXMzs4mY2ZnlpmzQgh5BxsLRWwsfBR7G/FtnFwKTfxh4OP/z2HOOWEquAHP+3ZyS8srq2v5dXdjc2t7p7C7Vzcq05TVqBJKN0NimOCS1YCDYM1UM5KEgjXC/tU4b9wzbbiStzBIWZCQruQxpwSsVW+LSIHpFIpeyZsIL4I/g+LFh3uevn251U7hsx0pmiVMAhXEmJbvpRAMiQZOBRu57cywlNA+6bKWRUkSZoLhZNoRPrJOhGOl7ZOAJ+7vjiFJjBkkoa1MCPTMfDY2/8taGcRnwZDLNAMm6fSjOBMYFB6vjiOuGQUxsECo5nZWTHtEEwr2QK49gj+/8iLUyyX/pFS+8YqVSzRVHh2gQ3SMfHSKKugaVVENUXSHHtATenaU8+i8OK/T0pwz69lHf+S8/wAdA5J5\n\nr-group\n\nr-group\n\nr-group\n\nr-group\n\nr-group\n\nr-group\n\np compute nodes\n\nFigure 1: DETOX is a hierarchical scheme for Byzantine gradient\naggregation. In its \ufb01rst step, the PS partitions the compute nodes in\ngroups and assigns each node to a group with the same batch of data.\nAfter the nodes compute gradients with respect to this batch, the PS\ntakes a majority vote of their outputs. This \ufb01lters out a large fraction\nof the Byzantine gradients. In the second step, the PS partitions the\n\ufb01ltered gradients in large groups, and applies a given aggregation\nmethod to each group. In the last step, the PS applies a robust ag-\ngregation method (e.g., geometric median) to the previous outputs.\nThe \ufb01nal output is used to perform a gradient update step.\n\nFigure 2: Top: Convergence com-\nparisons among various vanilla ro-\nbust aggregation methods and their\nDETOX paired versions under \u201ca little\nis enough\" Byzantine attack [11]. Bot-\ntom: Per iteration runtime analysis of\nvarious methods. All results are for\nResNet-18 trained on CIFAR-10. The\npre\ufb01x \u201cD-\u201d stands for a robust aggrega-\ntion method paired with DETOX.\n\nOur contributions. In this work, we present DETOX, a Byzantine-resilient distributed training\nframework that \ufb01rst uses computational redundancy to \ufb01lter out almost all Byzantine gradients, and\nthen performs a hierarchical robust aggregation method. DETOX is scalable, \ufb02exible, and is designed\nto be used on top of any robust aggregation method to obtain improved robustness and ef\ufb01ciency. A\nhigh-level description of the hierarchical nature of DETOX is given in Fig. 1.\nDETOX proceeds in three steps. First the PS partitions the compute nodes in groups of r to compute\nthe same gradients. While this step requires redundant computation at the node level, it will eventually\nallow for much faster computation at the PS level, as well as improved robustness. After all compute\nnodes send their gradients to the PS, the PS takes the majority vote of each group of gradients. We\nshow that by setting r to be logarithmic in the number of compute nodes, after the majority vote step\nonly a constant number of Byzantine gradients are still present, even if the number of Byzantine nodes\nis a constant fraction of the total number of compute nodes. DETOX then performs hierarchical robust\naggregation in two steps: First, it partitions the \ufb01ltered gradients in a small number of groups, and\naggregates them using simple techniques such as averaging. Second, it applies any robust aggregator\n(e.g., geometric median [2, 6], BULYAN [10], MULTI-KRUM [4], etc.) to the averaged gradients to\nfurther minimize the effect of any remaining traces of the original Byzantine gradients.\n\n2\n\n20040060080010001um oI IterDtions3040506070807est AccurDcy (%)D-%ulyDnD-0ulti-.rumD-0o0%ulyDn0ulti-.rum0ed.CoPputDionCoPPunicDtionAggregDtion02468TiPe Per Iter (sec)BulyDn0ulti-.ruP0ed.D-BulyDnD-0ulti-.ruPD-0o0\fWe prove that DETOX can obtain orders of magnitude improved robustness guarantees compared to\nits competitors, and can achieve this at a nearly linear complexity in the number of compute nodes p,\nunlike methods like BULYAN [10] that require complexity that is quadratic in p. We extensively test\nour method in real distributed setups and large-scale settings, showing that by combining DETOX with\npreviously proposed Byzantine robust methods, such as MULTI-KRUM, BULYAN, and coordinate-\nwise median, we increase the robustness and reduce the overall runtime of the algorithm. Moreover,\nwe show that under strong Byzantine attacks, DETOX can lead to almost a 40% increase in accuracy\nover vanilla implementations of Byzantine-robust aggregation. A brief performance comparison with\nsome of the current state-of-the-art aggregators in shown in Fig. 2.\nRelated work. The topic of Byzantine fault tolerance has been extensively studied since the early\n80s by Lamport et al. [12], and deals with worst-case, and/or adversarial failures, e.g., system crashes,\npower outages, software bugs, and adversarial agents that exploit security \ufb02aws. In the context of\ndistributed optimization, these failures are manifested through a subset of compute nodes returning to\nthe master \ufb02awed or adversarial updates. It is now well understood that \ufb01rst-order methods, such\nas gradient descent or mini-batch SGD, are not robust to Byzantine errors; even a single erroneous\nupdate can introduce arbitrary errors to the optimization variables.\nByzantine-tolerant ML has been extensively studied in recent years [13, 14, 15, 16, 17, 2], establishing\nthat while average-based gradient methods are susceptible to adversarial nodes, median-based update\nmethods can in some cases achieve better convergence, while being robust to some attacks. Although\ntheoretical guarantees are provided in many works, the proposed algorithms in many cases only ensure\na weak form of resilience against Byzantine failures, and often fail against strong Byzantine attacks\n[10]. A stronger form of Byzantine resilience is desirable for most of distributed machine learning\napplications. To the best of our knowledge, DRACO [7] and BULYAN [10] are the only proposed\nmethods that guarantee strong Byzantine resilience. However, as mentioned above, DRACO requires\nheavy redundant computation from the compute nodes, while BULYAN requires heavy computation\noverhead on the PS end.\nWe note that [18] presents an alternative approach that does not \ufb01t easily under either category, but\nrequires convexity of the underlying loss function. Finally, [19] examines the robustness of SIGNSGD\nwith a majority vote aggregation, but study a restricted Byzantine failure setup that only allows for a\nblind multiplicative adversary.\n\n2 Problem Setup\n\nOur goal is to solve solve the following empirical risk minimization problem: minw F (w) :=\nnPn\ni=1 fi(w) where w 2 Rd denotes the parameters of a model, and fi is the loss function on the\n1\ni-th training sample. To approximately solve this problem, we often use mini-batch SGD. First, we\ninitialize at some w0. At iteration t, we sample St uniformly at random from {1, . . . , n}, and then\nupdate via\n\nwt+1 = wt \n\nrfi(wt),\n\n(1)\n\n\u2318t\n\n|St| Xi2St\n\nwhere St is a randomly selected subset of the n data points. To perform mini-batch SGD in a\ndistributed manner, the global model wt is stored at the PS and updated according to (1), i.e., by\nusing the mean of gradients that are evaluated at the compute nodes.\nLet p denote the total number of compute nodes. At each iteration t, during distributed mini-batch\nSGD, the PS broadcasts wt to each compute node. Each compute node is assigned Si,t \u2713 St, and\n|Si,t|Pj2Si,t rfj(wt). The PS then updates the global\nthen evaluates the mean of gradients gi = 1\np Pp\nmodel via wt+1 = wt \u2318t\ni=1 gi. We note that in our setup we assume that the PS is the owner of\nthe data, and has access to the entire data set of size n.\nDistributed training with Byzantine nodes We assume that a \ufb01xed subset Q of size q of the p\ncompute nodes are Byzantine. Let \u02c6gi be the output of node i. If i is not Byzantine (i /2 Q), we say it\nis \u201chonest\u201d, in which case its output \u02c6gi = gi where gi is the true mean of gradients assigned to node i.\nIf i is Byzantine (i 2 Q), its output \u02c6gi can be any d-dimensional vector. The PS receives {\u02c6gi}p\ni=1,\nand can then process these vectors to produce some approximation to the true gradient update in (1).\n\n3\n\n\fWe make no assumptions on the Byzantine outputs. In particular, we allow adversaries with full\ninformation about F and wt, and that the Byzantine compute nodes can collude. Let \u270f = q/p be the\nfraction of Byzantine nodes. We will assume \u270f< 1/2 throughout.\n\n3 DETOX: A Redundancy Framework to Filter most Byzantine Gradients\n\nWe now describe DETOX, a framework for Byzantine-resilient mini-batch SGD with p nodes, q of\nwhich are Byzantine. Let b p be the desired batch-size, and let r be an odd integer. We refer to r as\nthe redundancy ratio. For simplicity, we will assume r divides p and that p divides b. DETOX can be\ndirectly extended to the setting where this does not hold.\nDETOX \ufb01rst computes a random partition of [p] in p/r node groups A1, . . . , Ap/r each of size r.\nThis will be \ufb01xed throughout. We then initialize at some w0. For t 0, we wish to compute some\napproximation to the gradient update in (1). To do so, we need a Byzantine-robust estimate of the true\ngradient. Fix t, and let us suppress the notation t when possible. As in mini-batch SGD, let S be a\nsubset of [n] of size b, with each element sampled uniformly at random from [n]. We then partition of\nS in groups S1, . . . , Sp/r of size br/p. For each i 2 Aj, the PS assigns node i the task of computing\n(2)\n\ngj :=\n\nrfk(w) =\n\nrfk(w).\n\n1\n\n|Sj| Xk2Sj\n\np\n\nrb Xk2Sj\n\nIf i is an honest node, then its output is \u02c6gi = gj, while if i is Byzantine, it outputs some d-dimensional\n\u02c6gi, which is then sent to the PS. The PS then computes zj := maj({\u02c6gi|i 2 Aj}), where maj denotes\nthe majority vote. If there is no majority, we set zj = 0. We will refer to zj as the \u201cvote\u201d of group j.\nSince some of these votes are still Byzantine, we must do some robust aggregation of the vote.\nWe employ a hierarchical robust aggregation process HIER-AGGR, which uses two user-speci\ufb01ed\naggregation methods A0 and A1. First, the votes are partitioned in to k groups. Let \u02c6z1, . . . , \u02c6zk denote\nthe output of A0 on each group. The PS then computes \u02c6G = A1(\u02c6z1, . . . , \u02c6zk) and updates the model\nvia w = w \u2318 \u02c6G. This hierarchical aggregation resembles a median of means approach on the votes\n[20], and has the bene\ufb01t of improved robustness and ef\ufb01ciency. We discuss this in further detail in\nSection 4. A description of DETOX is given in Algorithm 1.\n\nAlgorithm 1 DETOX: Algorithm to be performed at the parameter server\ninput Batch size b, redundancy ratio r, compute nodes 1, . . . , p, step sizes {\u2318t}t0.\n1: Randomly partition [p] in \u201cnode groups\u201d {Aj|1 \uf8ff j \uf8ff p/r} of size r.\n2: for t = 0 to T do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\n\nDraw St of size b randomly from [n].\nPartition St in to groups {St,j|1 \uf8ff j \uf8ff p/r} of size rb/p.\nFor each j 2 [p/r], i 2 Aj, push wt and St,j to compute node i.\nReceive the (potentially Byzantine) p gradients \u02c6gt,i from each node.\nLet zt,j := maj({\u02c6gt,i|i 2 Aj}), and 0 if no majority exists.\nSet \u02c6Gt = HIER-AGGR({zt,1, . . . , zt,p/r}).\nSet wt+1 = wt \u2318 \u02c6Gt.\n\n%Filtering step\n%Hierarchical aggregation\n%Gradient update\n\nAlgorithm 2 HIER-AGGR: Hierarchical aggregation\ninput Aggregators A0,A1, votes {z1, . . . , zp/r}, vote group size k.\n1: Let \u02c6p := p/r.\n2: Randomly partition {z1, . . . , z\u02c6p} in to k \u201cvote groups\u201d {Zj|1 \uf8ff j \uf8ff k} of size \u02c6p/k.\n3: For each vote group Zj, calculate \u02c6zj = A0(Zj).\n4: Return A1({\u02c6z1, . . . , \u02c6zk}).\n\n3.1 Filtering out Almost Every Byzantine Node\nWe now show that DETOX \ufb01lters out the vast majority of Byzantine gradients. Fix the iteration t.\nRecall that all honest nodes in a node group Aj send \u02c6gj = gj as in (2) to the PS. If Aj has more\n\n4\n\n\fhonest nodes than Byzantine nodes then zj = gj and we say zj is honest. If not, then zj may not\nequal gj in which case zj is a Byzantine vote. Let Xj be the indicator variable for whether block Aj\n\nhas more Byzantine nodes than honest nodes, and let \u02c6q =Pj Xj. This is the number of Byzantine\n\nvotes. By \ufb01ltering, DETOX goes from a Byzantine compute node ratio of \u270f = q/p to a Byzantine vote\nratio of \u02c6\u270f = \u02c6q/\u02c6p where \u02c6p = p/r.\nWe \ufb01rst show that E[\u02c6q] decreases exponentially with r, while \u02c6p only decreases linearly with r. That\nis, by incurring a constant factor loss in compute resources, we gain an exponential improvement in\nthe reduction of Byzantine nodes. Thus, even small r can drastically reduce the Byzantine ratio of\nvotes. This observation will allow us to instead use robust aggregation methods on the zj, i.e., the\nvotes, greatly improving our Byzantine robustness. We have the following theorem about E[\u02c6q]. All\nproofs can be found in the appendix. Note that throughout, we did not focus on optimizing constants.\nTheorem 1. There is a universal constant c such that if the fraction of Byzantine nodes is \u270f< c , then\n\nthe effective number of Byzantine votes after \ufb01ltering satis\ufb01es E[\u02c6q] = O\u270f(r1)/2q/r.\nWe now wish to use this to derive high probability bounds on \u02c6q. While the variables Xi are not\nindependent, they are negatively correlated. By using a version of Hoeffding\u2019s inequality for weakly\ndependent variables, we can show that if the redundancy is logarithmic, i.e., r \u21e1 log(q), then with\nhigh probability the number of effective Byzantine votes drops to a constant, i.e., \u02c6q = O(1).\nCorollary 2. There is a constant c such that if and \u270f \uf8ff c and r 3 + 2 log2(q) then for any\n 2 (0, 1\nIn the next section, we exploit this dramatic reduction of Byzantine votes to derive strong robustness\nguarantees for DETOX.\n\n2 ), with probability at least 1 , we have that \u02c6q \uf8ff 1 + 2 log(1/).\n\n4 DETOX Improves the Speed and Robustness of Robust Estimators\n\nUsing the results of the previous section, if we set the redundancy ratio to r \u21e1 log(q), the \ufb01ltering\nstage of DETOX reduces the number of Byzantine votes \u02c6q to roughly a constant. While we could\napply some robust aggregator A directly to the output votes of the \ufb01ltering stage, such methods often\nscale poorly with the number of votes \u02c6p. By instead applying HIER-AGGR, we greatly improve\nef\ufb01ciency and robustness. Recall that in HIER-AGGR, we partition the votes into k \u201cvote groups\u201d,\napply some A0 to each group, and apply some A1 to the k outputs of A0. We analyze the case where\nk is roughly constant, A0 computes the mean of its inputs, and A1 is a robust aggregator. In this case,\nHIER-AGGR is analogous to the Median of Means (MoM) method from robust statistics [20].\nImproved speed. Suppose that without redundancy, the time required for the compute nodes to\n\ufb01nish is T . Applying KRUM [1], MULTI-KRUM [4], and BULYAN [10] to their p outputs requires\nO(p2d) operations, so their overall runtime is O(T + p2d). In DETOX, the compute nodes require\nr times more computation to evaluate redundant gradients. If r \u21e1 log(q), this can be done in\nO(ln(q)T ). With HIER-AGGR as above, DETOX performs three major operations: (1) majority\nvoting, (2) mean computation of the k vote groups and (3) robust aggregation of the these k means\nusing A1. (1) and (2) require O(pd) time. For practical A1 aggregators, including MULTI-KRUM\nand BULYAN, (3) requires O(k2d) time. Since k \u2327 p, DETOX has runtime O(ln(q)T + pd). If\nT = O(d) (which generally holds for gradient computations), KRUM, MULTI-KRUM, and BULYAN\nrequire O(p2d) time, but DETOX only requires O(pd) time. Thus, DETOX can lead to signi\ufb01cant\nspeedups, especially when the number of workers is large.\nImproved robustness. To analyze robustness, we \ufb01rst need some distributional assumptions. At a\ngiven iteration, let G denote the full gradient of F (w). Throughout this section, we assume that the\ngradient of each sample is drawn from a distribution D on Rd with mean G and covariance \u2303. Let\n2 = Tr(\u2303), we\u2019ll refer to this as variance. In DETOX, the \u201chonest\u201d votes zi will also have mean G,\nbut their variance will be 2p/rb. This is because each honest compute node gets rb/p samples, so its\nvariance is reduced by rb/p. Note that this variance reduction is integral in proving that we achieve\noptimal rates (see Theorem 3 and the discussion after it). To see this intuitively, consider a scenario\nwithout Byzantine machines, then the variance of empirical mean is 2/b. A simple calculation\nshows that variance of the mean of each \u201cvote group\u201d is 2p/rb\n\u02c6p/k = k2/b where k is the number of\nvote groups. Thus, if k is small, we are still able to optimally reduce the variance.\n\n5\n\n\fSuppose \u02c6G is some approximation to the true gradient G. We say that \u02c6G is a -inexact gradient\noracle for G if k \u02c6G Gk \uf8ff . [5] shows that access to a -inexact gradient oracle is suf\ufb01cient to\nupper bound the error of a model \u02c6w produced by performing gradient updates with \u02c6G. Thus, to bound\nthe robustness of an aggegator, it suf\ufb01ces to bound . Under the distributional assumptions above,\nwe will derive bounds on for the hierarchical aggregator A with different base aggregators A1.\nWe will analyze DETOX when A0 computes the mean of the vote groups, and A1 is geometric\nmedian, coordinate-wise median, or \u21b5-trimmed mean [6]. We will denote the approximation \u02c6G to\nG computed by DETOX in these three instances by \u02c6G1, \u02c6G2 and \u02c6G3, respectively. Using the proof\ntechniques similar to [20], we get the following.\nTheorem 3. Assume r 3 + 2 log2(q) and \u270f \uf8ff c where c is the constant from Corollary 2. There\nare constants c1, c2, c3 such that for all 2 (0, 1/2), with probability at least 1 2:\n1. If k = 128 ln(1/), then \u02c6G1 is a c1pln(1/)/b-inexact gradient oracle.\n2. If k = 128 ln(d/), then \u02c6G2 is a c2pln(d/)/b-inexact gradient oracle.\n\n3. If k = 128 ln(d/) and \u21b5 = 1\n\n4, then \u02c6G3 is a c3pln(d/)/b-inexact gradient oracle.\n\nThe above theorem has three important implications. First, we can derive robustness guarantees\nfor DETOX that are virtually independent of the Byzantine ratio \u270f. Second, even when there are no\nByzantine machines, it is known that no aggregator can achieve = o(/pb) [21], and because we\nachieve = \u02dcO(/pb), we cannot expect to get an order of better robustness by any other aggregator.\nThird, other than a logarithmic dependence on q, there is no dependence on the number of nodes p.\nEven as p and q increase, we still maintain roughly the same robustness guarantees.\nBy comparison, the robustness guarantees of KRUM and Geometric Median applied directly to the\ncompute nodes worsens as as p increases [17, 3]. Similarly, [6] show if we apply coordinate-wise\nmedian to p nodes, each of which are assigned b/p samples, we get a -inexact gradient oracle where\n\n= O(p\u270fp/b + pd/b). If \u270f is constant and p is comparable to b, then this is roughly , whereas\nDETOX can produce a -inexact gradient oracle for = \u02dcO(/pb). Thus, the robustness of DETOX\n\ncan scale much better with the number of nodes than naive robust aggregation of gradients.\n\n5 Experiments\n\nIn this section we present an experimental study on pairing DETOX with a set of previously proposed\nrobust aggregation methods, including MULTI-KRUM [17], BULYAN [10], coordinate-wise median\n[5]. We also incorporate DETOX with a recently proposed Byzantine resilient distributed training\nmethod i.e.SIGNSGD with majority vote [19]. We conduct extensive experiments on the scalability\nand robustness of these Byzantine-resilient methods, and the improvements gained when pairing them\nwith DETOX. All our experiments are deployed on real distributed clusters under various Byzantine\nattack models. Our implementation is publicly available for reproducibility 2.\n5.1 Experimental Setup\nThe main \ufb01ndings are as follows: 1) Applying DETOX leads to signi\ufb01cant speedups, e.g., up to an\norder of magnitude end-to-end training speedup is observed; 2) in defending against state-of-the-art\nByzantine attacks, DETOX leads to signi\ufb01cant Byzantine-resilience improvement, e.g., applying\nBULYAN on top of DETOX improves the test-set prediction accuracy from 11% to 60% when training\nVGG13-BN on CIFAR-100 under the \u201ca little is enough\" (ALIE) [11] Byzantine attack. Moreover,\nincorporating SIGNSGD with DETOX improves the test set prediction accuracy from 34.92% to\n78.75% when defending against a constant Byzantine attack for ResNet-18 trained on CIFAR-10.\nWe implemented vanilla versions of the aforementioned Byzantine resilient methods, as well as\nversions of these methods pairing with DETOX, in PyTorch [22] with MPI4py [23]. Our experiments\nare deployed on a cluster of 46 m5.2xlarge instances on Amazon EC2, where 1 node serves as the\nPS and the remaining p = 45 nodes are compute nodes. In all the following experiments, we set the\nnumber of Byzantine nodes to be q = 5. We also study the performance of all considered methods\nwith smaller number (and without) Byzantine nodes, the result can be found in the Appendix B.6.\n\n2https://github.com/hwang595/DETOX\n\n6\n\n\f5.2\n\nImplementation of DETOX\n\nIn DETOX, the 45 compute nodes are randomly partitioned into node groups of size r = 3, which\ngives p/r = 15 node groups. Batch size b is set to 1, 440. In each iteration of the vanilla Byzantine\nresilient methods, each compute node evaluates b/p = 32 gradients sampled from its partition of data\nwhile in DETOX each node evaluates r\u21e5 more gradients i.e. rb/p = 96, which makes DETOX r\u21e5\nmore computationally expensive than the vanilla Byzantine resilient methods. Compute nodes in the\nsame node group evaluate the same gradients to create algorithmic redundancy for the majority voting\nstage in DETOX. The mean of these locally computed gradients is sent back to the PS. Note that\nalthough DETOX requires each compute node evaluate r\u21e5 more gradients, the communication cost\nof DETOX is the same as the vanilla Byzantine resilient methods since only the gradient means are\ncommunicated instead of individual gradients. After receiving all gradient means from the compute\nnodes, the PS uses either vanilla Byzantine-resilient methods or their DETOX paired variants.\n\nFigure 3: Results of VGG13-BN on CIFAR-100. Left: Convergence performance of various robust aggregation\nmethods against ALIE attack. Right: Per iteration runtime analysis of various robust aggregation methods.\n\nWe emphasize that DETOX is not simply a new robust aggregation technique. It is instead a general\nByzantine-resilient distributed training framework, and any robust aggregation method can be im-\nmediately implemented on top of it to increase its Byzantine-resilience and scalability. Note that\nafter the majority voting stage on the PS one has a wide range of choices for A0 and A1. In our\nimplementations, we had the following setups: 1) A0 = Mean, A1 = Coordinate-size Median, 2)\nA0 = MULTI-KRUM, A1 = Mean, 3) A0 = BULYAN, A1 = Mean, and 4) A0 =coordinate-wise\nmajority vote, A1 =coordinate-wise majority vote (designed speci\ufb01cally for pairing DETOX with\nSIGNSGD). We tried A0 = Mean and A1 = MULTI-KRUM/BULYAN but we found that setups 2)\nand 3) had better resilience than these choices. More details on the implementation and system-level\noptimizations that we performed can be found in the Appendix B.1.\n\nByzantine attack models We consider two Byzantine attack models for pairing MULTI-KRUM,\nBULYAN, and coordinate-wise median with DETOX. First, we consider the \u201creversed gradient\" attack,\nwhere Byzantine nodes that were supposed to send g 2 Rd to the PS instead send cg, for some\nc > 0. Secondly, we study the recently proposed ALIE [11] attack, where the Byzantine compute\nnodes collude and use their locally calculated gradients to estimate the coordinate-wise mean and\nstandard deviation of the entire set of gradients of all other compute nodes. The Byzantine nodes\nthen use the estimated mean and variance to manipulate the gradient they send back to the PS. To\nbe more speci\ufb01c, Byzantine nodes will send \u02c6\u00b5i + z \u00b7 \u02c6i,8i 2 [d] where \u02c6\u00b5 and \u02c6 are the estimated\ncoordinate-wise mean and standard deviation each gradient dimension and z is a hyper-parameter\nwhich was tuned empirically in [11]. Finally, to compare the resilience of the vanilla SIGNSGD\nand the one paired with DETOX, we consider the \u201cconstant Byzantine attack\u201d where Byzantine\ncompute nodes send a constant gradient matrix with dimension same as that of the true gradient but\nall elements set to 1.\nDatasets and models Our experiments are over ResNet-18 [24] on CIFAR-10 and VGG13-BN\n[25] on CIFAR-100. For each dataset, we use data augmentation (random crops, and \ufb02ips) and image\nnormalization. Also, we tune the learning rate schedules and use the constant momentum at 0.9 in all\nexperiments. The details of parameter tuning and dataset normalization are in the Appendix B.2.\n\n7\n\n250500750100012501500175020001um oI IterDtions1020304050607est AccurDcy (%)D-%ulyDnD-0ulti-.rumD-0o0%ulyDn0ulti-.rum0ed.CoPputDionCoPPunicDtionAggregDtion0246TiPe Per Iter (sec)BulyDn0ulti-.ruP0ed.D-BulyDnD-0ulti-.ruPD-0o0\f(a) ResNet-18, MULTI-KRUM\n\n(b) ResNet-18, BULYAN\n\n(c) ResNet-18, Coord-Median\n\n(d) VGG13-BN, MULTI-KRUM\n\n(e) VGG13-BN, BULYAN\n\n(f) VGG13-BN, Coord-Median\n\nFigure 4: End-to-end comparisons between DETOX paired with different baseline methods under reverse gradi-\nent attack. (a)-(c): Vanilla vs. DETOX paired version of MULTI-KRUM, BULYAN, and coordinate-wise median\non ResNet-18 trained on CIFAR-10. (d)-(f): Same comparisons for VGG13-BN trained on CIFAR-100.\n\n(a) ResNet-18, CIFAR-10\n\n(b) VGG13-BN, CIFAR-100\n\nFigure 5: Speedups in converging to given accuracies for vanilla robust aggregation methods and their DETOX-\npaired variants under reverse gradient attack: (a) ResNet-18 on CIFAR-10, (b) VGG13-BN on CIFAR-100\n\n5.3 Results\nScalability We report a per-iteration runtime of all considered robust aggregations and their DETOX\npaired variants on both CIFAR-10 over ResNet-18 and CIFAR-100 over VGG-13. The results on\nResNet-18 and VGG13-BN are shown in Figure 2 and 3. We observe that although DETOX requires\nslightly more compute time per iteration, due to its algorithmic redundancy as explained in Section\n5.2, it largely reduces the PS computation cost during the aggregation stage, which matches our\ntheoretical analysis. Surprisingly, we observe that by applying DETOX, the communication costs\ndecrease. This is because the variance of computation time among compute nodes increases with\nheavier computational redundancy. Therefore, after applying DETOX, compute nodes tend not to send\ntheir gradients to the PS at the same time, which mitigates a potential network bandwidth congestion.\nIn a nutshell, applying DETOX can lead to up to 3\u21e5 per-iteration speedup.\nByzatine-resilience under various attacks We \ufb01rst study the Byzantine-resilience of all consid-\nered methods under the ALIE attack, which to the best of our knowledge, is the strongest Byzantine\nattack proposed in the literature. The results on ResNet-18 and VGG13-BN are shown in Figure\n2 and 3 respectively. Applying DETOX leads to signi\ufb01cant improvement in Byzantine-resilience\n\n8\n\n0100200300400500WDll-clock 7ime (0ins.)7678808284868890927esW AccurDcy (%)D-0ulWi-.rum0ulWi-krum0100200300400500600700WDll-clock 7ime (0ins.)7678808284868890927esW AccurDcy (%)D-%ulyDn%ulyDn0100200300400500600700800WDll-clock 7ime (0ins.)7678808284868890927esW AccurDcy (%)D-0o00ed.050100150200250300350400WDll-clock 7ime (0ins.)404550556065707esW AccurDcy (%)D-0ulWi-.rum0ulWi-krum0100200300400500WDll-clock 7ime (0ins.)404550556065707esW AccurDcy (%)D-%ulyDn%ulyDn0100200300400500600WDll-clock 7ime (0ins.)253035404550556065707esW AccurDcy (%)D-0o00edD-0.-..over0.-..D-%ulyDnover%ulyDnD-0o0over0ed.0etKod86%88%90%92%7est AccurDcy2.1x1.94x5.24x1.75x2.51x5.01x2.22x2.48x5.15x1.81x2.1x4.54xD-0.-..over0.-..D-%ulyDnover%ulyDn0etKod59%61%63%65%7est AccurDcy2.13x2.04x2.1x1.84x2.43x1.88x2.45x1.94xD-0o0over0.-0ed.0etKod45%48%50%55%11.57x10.4x10.4x11.15x\fTable 1: Defense results summary for ALIE attacks [11]; the reported numbers are test set prediction accuracy.\n\nD-MULTI-KRUM D-BULYAN D-Med. MULTI-KRUM BULYAN Med.\n\nResNet-18\nVGG13-BN\n\n80.3%\n42.98%\n\n76.8%\n46.82%\n\n86.21%\n59.51%\n\n45.24%\n17.18%\n\n42.56% 43.7%\n11.06% 8.64%\n\n(a) ResNet-18 on CIFAR-10\n\n(b) VGG13-BN on CIFAR-100\n\nFigure 6: Convergence comparisons between DETOX paired with SIGNSGD and vanilla SIGNSGD under con-\nstant Byzantine attack on: (a) ResNet-18 trained on CIFAR-10; (b) VGG13-BN trained on CIFAR-100\n\ncompared to vanilla MULTI-KRUM, BULYAN, and coordinate-wise median on both datasets as shown\nin Table 1. We then consider the reverse gradient attack, the results are shown in Figure 4. Since\nreverse gradient is a much weaker attack, all vanilla robust aggregation methods and their DETOX\npaired variants defend well. Moreover, applying DETOX leads to signi\ufb01cant end-to-end speedups.\nIn particular, combining the coordinate-wise median with DETOX led to a 5\u21e5 speedup gain in the\namount of time to achieve to 90% test set prediction accuracy for ResNet-18 trained on CIFAR-10.\nThe speedup results are shown in Figure 5. For VGG13-BN trained on CIFAR-100, an order of\nmagnitude end-to-end speedup can be observed in coordinate-wise median applied on top of DETOX.\n\nComparison between DETOX and SIGNSGD We compare DETOX paired SIGNSGD with vanilla\nSIGNSGD where only the sign of each gradient coordinate is sent to the PS. The PS, on receiving\nthese gradient signs, takes coordiante-wise majority votes to get the model update. We consider a\nstronger constant Byzantine attack introduced in Section 5.2. The details of our implementation and\nhyper-parameters used are in Appendix B.4. The results on both the considered datasets are shown\nin Figure 6 where we see that DETOX paired with SIGNSGD improves the Byzantine resilience of\nSIGNSGD signi\ufb01cantly. For ResNet-18 trained on CIFAR-10, DETOX improves testset prediction\naccuracy of vanilla SIGNSGD from 34.92% to 78.75%; while for VGG13-BN trained on CIFAR-100,\nDETOX improves testset prediction accuracy (TOP-1) of vanilla SIGNSGD from 2.12% to 40.37%.\nFor completeness, we compare DETOX with DRACO [7]. This is not the focus of this work, as we are\nprimarily interested in showing that DETOX improves the robustness of traditional robust aggregators.\nHowever the comparisons with DRACO are in Appendix B.7. Another experimental study of mean\nestimation task over synthetic data that directly matches our theory can be found in Appendix B.5.\n\n6 Conclusion\n\nIn this paper, we present DETOX, a new framework for Byzantine-resilient distributed training.\nNotably, any robust aggregator can be immediatley used with DETOX to increase its robustness and\nef\ufb01ciency. We demonstrate these improvements theoretically and empirically. In the future, we\nwould like to devise a privacy-preserving version of DETOX, as currently it requires the PS to be\nthe owner of the data, and also to partition data among compute nodes, which hurts the data privacy.\nOvercoming this limitation would allow us to develop variants of DETOX for federated learning.\n\n9\n\n050010001500200025003000350040001um oI IterDtions203040506070807est AccurDcy (%)D-sign6GDsign6GD0100020003000400050001um oI IterDtions0510152025303540Test AccurDcy (%)D-signSGDsignSGD\fAcknowledgments\nThis research is supported by an NSF CAREER Award #1844951, a Sony Faculty Innovation Award,\nan AFOSR & AFRL Center of Excellence Award FA9550-18-1-0166, and an NSF TRIPODS Award\n#1740707. The authors also thank Ankit Pensia for useful discussions about the Median of Means\napproach.\n\nReferences\n[1] Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning\nwith adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information\nProcessing Systems 30: Annual Conference on Neural Information Processing Systems 2017,\n4-9 December 2017, Long Beach, CA, USA, pages 118\u2013128, 2017.\n\n[2] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial\nsettings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of\nComputing Systems, 1(2):44, 2017.\n\n[3] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Generalized byzantine-tolerant sgd. arXiv\n\npreprint arXiv:1802.10116, 2018.\n\n[4] Georgios Damaskinos, El Mahdi El Mhamdi, Rachid Guerraoui, and Sebastien Guirguis,\nArsany Rouault. Aggregathor: Byzantine machine learning via robust gradient aggregation.\nConference on Systems and Machine Learning, 2019.\n\n[5] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Defending against saddle\n\npoint attack in byzantine-robust distributed learning. CoRR, abs/1806.05358, 2018.\n\n[6] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantine-robust distributed\nlearning: Towards optimal statistical rates. In International Conference on Machine Learning,\npages 5636\u20135645, 2018.\n\n[7] Lingjiao Chen, Hongyi Wang, Zachary Charles, and Dimitris Papailiopoulos. Draco: Byzantine-\nresilient distributed training via redundant gradients. In International Conference on Machine\nLearning, pages 902\u2013911, 2018.\n\n[8] Deepesh Data, Linqi Song, and Suhas Diggavi. Data encoding for byzantine-resilient distributed\ngradient descent. In 2018 56th Annual Allerton Conference on Communication, Control, and\nComputing (Allerton), pages 863\u2013870. IEEE, 2018.\n\n[9] Qian Yu, Netanel Raviv, Jinhyun So, and A Salman Avestimehr. Lagrange coded computing:\n\nOptimal design for resiliency, security and privacy. arXiv preprint arXiv:1806.00939, 2018.\n\n[10] El Mahdi El Mhamdi, Rachid Guerraoui, and S\u00e9bastien Rouault. The hidden vulnerability of\n\ndistributed learning in byzantium. arXiv preprint arXiv:1802.07927, 2018.\n\n[11] Moran Baruch, Gilad Baruch, and Yoav Goldberg. A little is enough: Circumventing defenses\n\nfor distributed learning. arXiv preprint arXiv:1902.06156, 2019.\n\n[12] Leslie Lamport, Robert Shostak, and Marshall Pease. The byzantine generals problem. ACM\n\nTransactions on Programming Languages and Systems (TOPLAS), 4(3):382\u2013401, 1982.\n\n[13] El-Mahdi El-Mhamdi, Rachid Guerraoui, Arsany Guirguis, and Sebastien Rouault. Sgd:\n\nDecentralized byzantine resilience. arXiv preprint arXiv:1905.03853, 2019.\n\n[14] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Zeno: Byzantine-suspicious stochastic\n\ngradient descent. arXiv preprint arXiv:1805.10032, 2018.\n\n[15] Cong Xie, Sanmi Koyejo, and Indranil Gupta. Fall of empires: Breaking byzantine-tolerant sgd\n\nby inner product manipulation. arXiv preprint arXiv:1903.03936, 2019.\n\n10\n\n\f[16] El-Mahdi El-Mhamdi and Rachid Guerraoui. Fast and secure distributed learning in high\n\ndimension. arXiv preprint arXiv:1905.04374, 2019.\n\n[17] Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. Machine learning with adversaries:\nByzantine tolerant gradient descent. In Advances in Neural Information Processing Systems,\npages 119\u2013129, 2017.\n\n[18] Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient descent. In\nS. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 31, pages 4618\u20134628. Curran Associates,\nInc., 2018.\n\n[19] Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd\n\nwith majority vote is communication ef\ufb01cient and fault tolerant. arXiv, 2018.\n\n[20] Stanislav Minsker et al. Geometric median and robust estimation in banach spaces. Bernoulli,\n\n21(4):2308\u20132335, 2015.\n\n[21] G\u00e1bor Lugosi, Shahar Mendelson, et al. Sub-gaussian estimators of the mean of a random\n\nvector. The Annals of Statistics, 47(2):783\u2013794, 2019.\n\n[22] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. 2017.\n\n[23] Lisandro D Dalcin, Rodrigo R Paz, Pablo A Kler, and Alejandro Cosimo. Parallel distributed\n\ncomputing using python. Advances in Water Resources, 34(9):1124\u20131139, 2011.\n\n[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[25] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[26] Nathan Linial and Zur Luria. Chernoff\u2019s inequality-a very elementary proof. arXiv preprint\n\narXiv:1403.7739, 2014.\n\n[27] J. Ramon C. Pelekis. Hoeffding\u2019s inequality for sums of weakly dependent random variables.\n\nMediterranean Journal of Mathematics, 2017.\n\n[28] Georgios Damaskinos, El Mahdi El Mhamdi, Rachid Guerraoui, Arsany Guirguis, and S\u00e8bastien\nRouault. Aggregathor: Byzantine machine learning via robust gradient aggregation. In SysML,\n2019.\n\n11\n\n\f", "award": [], "sourceid": 5443, "authors": [{"given_name": "Shashank", "family_name": "Rajput", "institution": "University of Wisconsin - Madison"}, {"given_name": "Hongyi", "family_name": "Wang", "institution": "University of Wisconsin-Madison"}, {"given_name": "Zachary", "family_name": "Charles", "institution": "University of Wisconsin - Madison"}, {"given_name": "Dimitris", "family_name": "Papailiopoulos", "institution": "University of Wisconsin-Madison"}]}