Review for NeurIPS paper: Election Coding for Distributed Learning: Protecting SignSGD against Byzantine Attacks

NeurIPS 2020

Election Coding for Distributed Learning: Protecting SignSGD against Byzantine Attacks

Review 1

Summary and Contributions: This paper addresses the problem of designing first-order optimization methods that are both communication efficient and robust to byzantine workers. In particular, the paper focuses on an existing variant of SignSGD, namely SignSGD with majority voting (SignSGD-MV), which is already communication efficient by design. The paper proposes a new coding theoretic approach to make SignSGD-MV robust to byzantine workers. In a regular SignSGD-MV method, each of the n workers computes a gradient estimate based on the data partition assigned to it and sends a sign of the gradient estimate to the master node. The master node takes the coordinate wise majority of the signed gradient estimates received from all the workers to obtain the final signed gradient estimate. However, the byzantine workers may send adversarially generated signed gradient estimates so that the majority operation at the master end up computing the incorrect final signed gradient estimate. This paper proposed a novel election coding framework where each worker node is assigned multiple data partitions. Each worker then sends the coordinate wise majority of the local signed gradient estimates based on its partitions to the master. The master then performs the coordinate wise majority of the signed vectors (outputs of the local majority operations) received from the workers. The paper shows that with a careful design of the data partition allocation to the workers can ensure that the master computes the correct final signed gradient estimate despite the presence of the byzantine workers. The paper studies two allocation schemes (coding schemes): 1) random Bernoulli codes that provide byzantine tolerance with high probability, and 2) deterministic codes that provide perfect byzantine tolerance. The paper then evaluates the proposed coding schemes on Amazon EC2. For two combinations of datasets and model architecture, the paper empirically shows that the proposed schemes significantly improve upon the regular (uncoded) SignSGD-MV. ######### Post author response phase ########### Thank you for taking the time to address my comments. I have updated my score accordingly. I would encourage the authors to incorporate their responses and new results (from their response) to the main body of the revised version.

Strengths: The paper studies a timely problem of designing communication efficient and robust distributed optimization methods. The paper proposes the novel framework of election coding and presents two coding schemes under this framework along with their theoretical analysis. The paper carries out experiments on the real systems and demonstrates the utility of the proposed coding schemes over the natural baseline of regular (uncoded) SignSGD-MV.

Weaknesses: The proposed scheme significantly increases the amount of redundant computation performed at the workers. The authors should discuss this point in detail in the main text. For example, for the deterministic scheme, if one takes b=1, Theorem 3 shows that, for large n, computational redundancy becomes, r = \Omega(n/4), which is rather large. Would it be more beneficial to use full other robust schemes based on full gradients + median approaches? The paper also does not address the issue of how far the proposed schemes are from optimal solutions. The experiment results can be more comprehensive. Please include the results for *no_attack* in Fig. 5 and 6. Also, the author should present results for larger networks. For such networks, with large n, do the proposed scheme have tolerable computational redundancy? The paper only considers the setting where n data partitions are assigned to n workers. Could the authors comment on the case where # of data partitions is different from # of workers? Is it possible to lower the computational redundancy by exploiting this dimension?

Correctness: The results in the paper appear correct. (The reviewer did not verify the details of the proof.)

Clarity: The paper is generally well written. The reviewer has some minor comments/suggestions to improve the quality/clarity of the presentation. Please see the weaknesses and the additional feedback sections.

Relation to Prior Work: The paper adequately covers the relevant prior work. In information theory and complexity theory literature, there has been some work on understanding the boolean computation in the presence of noisy gates. Could the authors comment on the relevance of that literature to their work?

Reproducibility: Yes

Additional Feedback: 1. For deterministic scheme, in Lemma 2, the authors only focus on the vectors m with ||m||_0 = \lceil n/2 \rceil. Clearly by the definition of S_v(m) in (4), master would obtain the correct result for ||m||_0 < \lceil n/2 \rceil. Does this also hold for the ||m||_0 < \lceil n/2 \rceil case? 2. In equation (2), q --> q_i and S --> S_i ? 3. The coding scheme in Figure 1(b) and 2 is not tolerant to 1 byzantine worker. As shown in Fig. 1(b) (D1, D2, D3, D4, D5) = (-1, -1, +1, -1, +1) gives (c1, c2, c3, c4, c5) = (-1, -1, +1, -1, -1), which ensures that the master receives three -1 even in the presence of 1 byzantine. However, (D1, D2, D3, D4, D5) = (-1, +1, -1, -1, +1) gives (c1, c2, c3, c4, c5) = (-1, +1, -1, +1, -1). Here, clearly 1 byzantine would lead to an error at the master. Please consider presenting an example that is actually tolerant to 1 Byzantine worker. 4. In line 83-85, "The codes proposed in [10] guarantees Byzantine tolerance, as long as each node transmits real-valued gradients, while our codes provide tolerance even under the compressed gradient constraint." This gives the impression that the proposed codes work for both settings, with the compressed gradients, and with the real gradients. I would be happy to revise my score based on the author's responses to my comments.

Review 2

Summary and Contributions: In this work the authors propose ELECTION CODING which is byzantine robust and communication efficient in distributed learning framework. In this framework with n worker nodes, the data is split into n blocks. The authors propose randomized (Bernoulli coding ) and a deterministic data allocation scheme over the workers. The sign of the vector is used to provide low communication overhead and majority voting scheme along with data redundancy using randomized and deterministic allocation matrix is for robustness of the algorithm. The paper also offers analysis and experimental result in the support of the algorithm.

Strengths: 1. In terms of redundancy this paper improves over the previous paper DRACO. 2.The hierarchical voting scheme seems like a good idea to handle the error in mismatch of the sign of the gradient locally and globally. 3. The analysis for random Bernoulli code and design of deterministic codes (as claimed by the authors) looks promising. 4. The experimental result shows improvement over the previous work of signsgd with majority voting.

Weaknesses: Please see the comments.

Correctness: I have not checked the proof for the section 4 (deterministic code). The rest of the paper looks ok.

Clarity: The paper is overall well-written.

Relation to Prior Work: The related work section is very concise and addresses the relevant works in quantization and byzantine robust learning.

Reproducibility: No

Additional Feedback: 1. The byzantine setup seems a bit weak as the byzantine attacks only happen by bit flipping. The paper does not address the problem when the attack happens in computing gradient. For example if Gaussian noise is added to the gradient of a data block and then the sign of the gradient is applied. 2. With the data allocation scheme, the computational load increases. Also the computational load is not uniform among the workers. 3. The idea of the paper seems to be the fact that in absence of the byzantine set-up it should be sign-sgd algorithm. The sign-sgd algorithm uses the majority voting of the signs of the gradients form workers. The lemma 1 focuses on the local error which is probability of error in the mismatch of the sign of g and c_i. Should we not worry about the sign of the g_i (local gradient)? ######################### The authors have responded well to all my queries in the review and accordingly I am changing my score to 6.

Review 3

Summary and Contributions: This paper proposes a Byzantine-robust framework for signed stochastic gradient descent (SignSGD) based on the SignSGD majority vote framework. Two election coding schemes are constructed and experiments are done to confirm the mathematical results. ----- Post rebuttal: After reading the authors' response, the authors do address my concern about where the errors could be introduced (although I still think the way it is worded in the paper could lead to misunderstanding this point). However, I still question the novelty of the methods. I will keep my score.

Strengths: The paper supports the claims made with proofs and experimental results. The problem studied is significant and valuable.

Weaknesses: The paper proposes a scheme, based on the majority voting scheme for the SignSGD, however, the solution is not so novel. Moreover, the discussion of the scheme is a little confusing and a simple example of how the algorithm works exactly would benefit the paper largely. Also, in Figure 1b, when the bit flip happened between the polling station and the master, that did not affect the results, but if the bit flip has happened between the voters and the polling stations, then it would have changed enough of the polling station results to change the final result. In remark 1, you say that the transmitted messages of the nodes are compromised, so that would mean that the change could happen between the node and the polling stations. The byzantine attack happening only between the polling stations and the master needs more justification.

Correctness: The paper satisfies the claims and the methodology is correct.

Clarity: The paper is well written, however, it could benefit from an explicit example and some more high level explanation of the methods.

Relation to Prior Work: Prior work is discussed clearly, and the links between this work and the preceding literature is well explained.

Reproducibility: Yes

Additional Feedback: