NeurIPS 2020

FedSplit: an algorithmic framework for fast federated optimization


Review 1

Summary and Contributions: The paper presents a new splitting method for federated optimization. Convergence rates are established under a variety of plausible assumptions. These range for geometric rate under Lipschitz + strong convexity assumptions, to O(1/t^2) rates for smooth but not strongly convex functions. In all cases, exact (i.e., non-noisy gradients are considered).

Strengths: The results are new and the problem formulation is natural and useful. The method is elegant. Overall, I think the paper makes a good contribution to the literature, especially compared to many previous works whose results are much messier.

Weaknesses: - The main weakness is the lack of a good theoretical result in the case when the proximal steps are not solved exactly. In that case, Corollary 1 gives a bound for b, but this is given in terms of the iterates themselves. I suppose one can always try to assume that the optimization is done over a set of finite diameter, in which case the right-hand side can be naturally bounded; but in the absence of this assumption, the analysis is not fully complete. - The results are quite easy to obtain and the techniques come from a natural modification of well-known arguments. However, this might be just because the method is cleverly chosen so that the ultimate analysis is simple.

Correctness: I have not found any errors.

Clarity: The paper is well written.

Relation to Prior Work: This is adequately discussed, though I'm not super familiar with the existing literature.

Reproducibility: Yes

Additional Feedback: **Edit:** I have now read the author response. I do not think the authors do a good job of responding to my comments. Overall, my opinion of the opinion is unchanged and I'm leaving the rating at a 7. First, my complaint was about the inherently circular nature of the results: Theorem 1 bounds what happens to the iterates in terms of b, and then Corollary 1 bounds b in terms of iterates. Even if, as the authors reply, their analysis could handle the case when b changes from step to step, the conceptual problem remains. Typically, this is fixed either by adding a projection, or adding a separate argument that the iterates remain bounded even without a projection. Second, the authors list some technical contributions they make in terms of operator splitting, but frankly these do not sound very impressive. Overall, I like this paper because it is simple and cute, and I would like to see it published. However, it is not without weaknesses.


Review 2

Summary and Contributions: The authors showed that the fixed points of FedSGD and FedProx need not correspond to stationary points of the original optimization problem. Then they proposed FedSplit framework to deal with this issue.

Strengths: - Strong theory, in fact the authors derived an explicite formula for the fixed points of FedSGD and FedProx. -They gave example of situations where these points do not correspond to stationary points of the original optimization problem with particular focus on least squares. -They proposed FedSplit to handle this issue and they proved its convergence to the right points.

Weaknesses: -The experiments are not strong. -A numerical illustration is given only based on simple logistic regression and using FEMNIST. -The authors can strength the numerical part using bigger models and larger datasets.

Correctness: -In the proof of thm2 I can not see where the condition on the step size is used. -In the same Theorem, I am surprised about the condition on the step size. In fact the proposed step size is proportional to 1/sqrt(\lambda) and \lambda is of order of \epsilon thus the step size is of order of 1/sqrt(\epsilon)!! too big.

Clarity: The paper is clear and well written.

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: -See above my concern with thm2. -In equation (17) you used x^1 in the regularization term. It seems that other vectors can work as well? -I think explanation and comment on this need to be included in the paper. -Some experiments about the impact of this regularization are needed. ----After the rebuttal----- I read the other reviews and the author response. I especially understand from the other reviews that there is some criticism about the theoretical properties and their comparison with some results from the literature. Nevertheless, I still continue to think that this is a good submission, so I decided to keep my score unchanged to 7.


Review 3

Summary and Contributions: Main merit: The paper introduces a novel local algorithm for federated learning -- FedSplit. While the classical FL methods such as FedAvg and FedProx does not have a correct fixed point, FedSplit has the correct fixed points despite a quite simple structure. I believe this is a very nice contribution and should be appreciated by the community. _______________ After the rebuttal _________________ The authors did not disagree with the points I raised. At the same time, the authors have proposed a reasonable fixes to these issues: mention SCAFFOLD, be a little more careful about taking a credit for noticing wrong fixed points of FedAvg/FedProx and mention that rate of FedSplit is no better than rate of AGD in heterogenous setting (but argue the benefits of FedSplit in the iid setup). Just one last comment: I was not persuaded by the comment that FedProx is better to AGD in terms of the inexactness; I believe more details would need to be given to make this argument solid (I understand there is no space for this in the rebuttal). 1) if one has access to the exact gradients, AGD can be performed directly, while FedSplit still needs to solve the local subproblem inexactly, and 2) AGD can still work well under inexact updates one done correctly, see https://arxiv.org/abs/1109.2415 for example (one should make more detailed argument about inexactness over there and Thm 1,3 from your paper). Given the above (I read the other reviews too), I stand by my initial score -- 6.

Strengths: Explained in the contributions.

Weaknesses: Main criticism:  1) The paper claims two main contributions, one of which is  "The first contribution of this paper is to analyze some past procedures, and show that even in the favorable setting of deterministic updates (i.e., no stochastic approximation used), these methods typically fail to preserve solutions of the original optimization problem as fixed points " I believe the text above is misleading. In fact, it was already well known for the "past procedures" to not have the correct fixed points; one alternative approach to deal with such an issue was to incorporate the "drift"; see https://arxiv.org/abs/1910.06378 for example. Therefore, believe it would be more appropriate to not claim the contribution for showing the wrong fixed point of the local algorithms.  2) While FedSplit is an honest local algorithm with a correct fixed point; the convergence properties of FedSplit are strictly worse than those of the plain accelerated gradient descent (AGD). Specifically, in strongly convex case (same arguments apply for weakly convex), the communication complexity of FedSplit is O(sqrt(kappa)log(1/epsilon)), which is identical to the communication complexity of AGD. In fact, AGD is favorable (in terms of the rate) as is requires a single gradient evaluation instead of evaluating the prox with high enough precision so that the inexactness does not drive the rate.  To be fair, the local methods do not, in general, outperform the non-local counterparts in terms of the communication even if the fixed point is correct; see https://arxiv.org/abs/2006.04735 for example (no need to cite it as the paper appeared online only recently) -- in fact, one might argue that plain AGD is optimal in such a case (see https://arxiv.org/abs/2005.10675 for example). For that reason, one might not hope for a better rate of FedSplit.  To "fix" this issue, I suggest the following: i) consider data homogeneous setting (i.e., identical data across nodes). In such a case, FedSplit should strictly outperform AGD (I presume it would even converge in a single iteration in such a case) ii) mention transparently in the paper that FedSplit is not favorable over AGD in terms of the theory for the heterogenous setting.

Correctness: I did a high-level check of the proofs and all seem correct.

Clarity: I am happy with the level of writing of the paper (besides the )

Relation to Prior Work: Yes, the difference and motivation of this work is well justified.

Reproducibility: Yes

Additional Feedback: See Weaknesses


Review 4

Summary and Contributions: The paper considers federated distributed optimization under convex assumption. The paper first identifies the fixed points of some federated optimization algorithms may not be stationary points of the original optimization. To fix this, the paper introduces a new algorithm which has fixed points corresponding to the optima of the original problem. The convergence rate is also established for the proposed algorithm.

Strengths: The paper presents the federated SGD and federated proximal algorithms have fixed points not corresponding to the zero of sum of gradients of the consensus problems even in the deterministic case. An example is presented to verify the claim. The paper then proposes the FedSplit algorithm based on operator splitting algorithm with additive structure. The algorithm has a local proximal update local parameters and aggregates local parameters to update the server. The common convergence rate is established for the algorithm using common tools.

Weaknesses: The paper could add more details on the derivation of the algorithm 1 and give some intuition about why the proposed algorithm could fix the issue. The FedSplit method is more like deterministic distributed optimization algorithm. The connections to the multi-device communications and failures mentioned in the paper are weak.

Correctness: The algorithm looks correct and convergence rates are classical optimization results using classical tools.

Clarity: The paper is well written. It presents good motivation and example, followed by the proposal to address the issue. The experiment is designed to show the advantages of the new method.

Relation to Prior Work: The connections to the existing work are clearly stated, and the contributions are also clearly presented.

Reproducibility: Yes

Additional Feedback: