Reviews: DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

This paper presents DETOX, a new framework for improving resistance to Byzantine compute nodes during distributed optimization. The framework utilizes existing robust aggregation techniques, as well as redundancy, within the proposed framework to make better use of resources toward mitigating the effects of Byzantine compute nodes. The paper is very well written. Presentation of existing methods and description of their framework is very clear. The algorithm is presented clearly with minimal assumptions on readers knowledge of distributed training. The theory is sound and shows improvements in performance over multiple relevant metrics. The framework is, from a software design point of view, substantially more complex (cyclomatic complexity may capture this observation) than existing methods and may make adoption of the method more difficult for certain system architectures. The method appears to do quite well when q=5 number of compute nodes are Byzantine, but how does performance suffer when using DETOX if epsilon is small (epsilon=q=0, q=1)? How do other methods (including no robustness and just redundancy) do outside the DETOX framework when epsilon is small? It may make sense to adopt simpler robust methods when epsilon is small. This is minimally addressed with experiments that employ the weaker reverse gradient attack, but having some experiments with varying epsilon may help system developers/architects identify whether DETOX is the right framework for their infrastructure/existing system. Additional thoughts after authors response: I'm a bit confused by their response regarding variance. For p=45, for vanilla aggregation, the PS will average over 45 batches of size b, giving an "effective" batchsize of 45*b, whereas with DETOX, due to redundancy (r=3), we'll see an "effective" batchsize of 15*b. Additional thoughts having gone over the paper again: 1. I think the paper could use more clarity around how their method compares to other methods in terms of resource requirements. It's mentioned in passing (section 5.1 and 5.3) but I think it could be made a bit more explicit. In particular, in section 5.1, do they consider an r=3 group of nodes a compute node? Is this why DETOX computes 3 times as many gradients as the compute nodes in the vanilla aggregation case? 2. Greater clarity around how DETOX doesn't increase variance (for fixed p). My current understanding: For p=45, for vanilla aggregation, the PS will average over 45 batches of size b, giving an "effective" batchsize of 45*b, whereas with DETOX, due to redundancy (r=3), we'll see an "effective" batchsize of 15*b. 3.The paper could benefit from illustrating the problem (Byzantine node failures) with concrete examples (in particular the adversarial setting, unless this is purely hypothetical). I also think this would help make the work more interesting to a wider audience.

Paper ID:	5443
Title:	DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation

Reviewer 1

Reviewer 2

Reviewer 3