NeurIPS 2020

### Review 1

Summary and Contributions: The paper proposes to address non-iid local datasets in distributed machine learning by clustering models during the learning process. That is, local learners make iterative estimations on cluster attribution and then optimize corresponding global models. Thus, every step of synchronization will send k (number of clusters) models to each of the local learners and each local learner returns its cluster and the update for its model. The paper provides a convergence rate for linear models with squared loss, and one for strongly convex learning problems, as well as an empirical analysis on MNIST and CIFAR.

Strengths: - The authors address an important problem of personalized distributed learning, i.e., the case of distributed learning when local learners do not have iid. data and they need a model that will suit exactly their local distribution. - The algorithm is sound and well explained.

Weaknesses: - the theoretical assumptions on the initial parameters appear to be too strong (see additional feedback for details) - the empirical evaluation is not fully convincing, since it (i) does not compare the approach to existing methods, but only with very simple baselines, and (ii) uses an unjustified success criterion (having the difference between right model and found solutions less than 0.6 of the noise deviation). Again, see additional feedback for details.

Correctness: The proofs to theorem 1 and 2 seem to be correct. For the experiments, the parameters of the algorithm used (step size, number of gradient steps) are reported, but it is not documented how they have been chosen. Was a separate dataset for parameter evaluation used?

Clarity: The paper is well written and easy to follow, it is a pleasure to read.

Relation to Prior Work: The paper adequately discusses related work and clearly stated the difference in the setup they consider, including a comparison with a recent pre-print [25].

Reproducibility: Yes

### Review 2

Summary and Contributions: This paper proposes a framework for clustered federated learning which can deal with the heterogeneity in the data distribution, especially the non-i.i.d. data. As for it's contribution, theoretical analysis of this work makes contributions to statistical estimation problems with latent variables in distributed settings

Strengths: It is proved that under the assumption of good initialization, the linear models and general strongly convex losses can obtain near the optimal statistical error rates. It is also showing that the IFCV is efficient in non-convex problems such as neural networks and outperforms the baselines on several clustered FL benchmarks created based on the MNIST and CIFAR-10 datasets by 5 - 8%. (1)Novelty: personalized predictions similarity among the users; consider a statistical setting with cluster structure (2)Significance: efficient in non-convex problem (3)Soundness: algorithm can succeed even if we relax the requirements on initialization with random initialization and multiple restarts.

Weaknesses: It doesn’t say this framework whether will work in non-linear problem. Moreover, the meaning of the global model and the local model didn’t explain very clearly. And also it may cause a privacy issue in identity information collection. The empirical evaluation lacks of comparisons with SOTA model.

Correctness: There is something wrong when I run the code, but in this article, the authors have clearly explained his thought and algorithm，and show impressive results.

Clarity: This paper is well organized and provides sufficient theoretical proof, however, a conclusion section is necessary in my opinion.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: I want to thank the authors for their detailed response and I will remain my score.

### Review 3

Summary and Contributions: This paper investigates the problem of heterogeneity in federated learning with the clustered federated learning framework. A new algorithm, Iterative Federated Clustering Algorithm (IFCA), is introduced, in which the workers first estimate the cluster they belong to and then compute the corresponding updates. Convergence bounds are provided in two cases: linear Gaussian regression and strongly convex loss functions. Parametric experiments on synthetic show that the algorithm is able to retrieve approximate parameters in the linear case. Experiments on real data (modified MNIST & CIFAR10) show that the proposed IFCA yields better results than global or local training.

Strengths: - This paper introduces a novel algorithm for the previously introduced clustered FL framework, which is analysed both theoretically and empirically. I think this is one of the first analyses of such an algorithm, which is very significant. This makes it very relevant to the NeurIPS community. - As a byproduct, the theoretical analysis yields novel results for the statistical estimation of distributed models with latent variables.

Weaknesses: - The theoretical analysis is restricted to either a linear Gaussian case in relatively small dimension or to a strongly convex problem, with quite restrictive assumptions on the initialisation. - Experiments do not compare the results of the proposed algorithm to previous clustered federated learning algorithms (notably ClusteredFL, Sattler et al. 2019), making it difficult to appreciate the practical improvement. - Contrary to previous works (Sattler et al. 2019), this algorithm assumes that the number of clusters is known, which reduces its practical applications.

Correctness: Regarding the experiments, a comparison to ClusteredFL would have strengthened the paper. Besides that, the methodology seems sound.

Clarity: The paper is well written and easy to follow. However, a conclusion or discussion is missing.

Relation to Prior Work: The paper discusses well how it differs from previous work. However, it lacks experimental validation with respect to previous works.

Reproducibility: Yes

Additional Feedback: - Contrary to Sattler et al., the proposed algorithm is not communication-efficient as all models need to be broadcast to all clients participating in a round, which multiplies by K the amount of models to send. How would it be possible to reduce this communication cost? - Intuitively, the size of the mini batch on which the losses are evaluated in order to estimate the local cluster plays an important role, as it adds noise which prevents from finding the correct cluster ID. Did you always use the full dataset in the experiments? If not, did you see an effect of this batch size on the convergence speed?

### Review 4

Summary and Contributions: Authors consider the setting of Federated Learning which each device belongs to a cluster, and data points within the same cluster follow the same model; it's a mixture of (linear) regressions model, where we have an additional information about subgroups which share the same latent variable. The model seems a bit simplistic to reflect the heterogeneity of data in the real world, but it is still a theoretical progress from an i.i.d. model. Authors prove the convergence rate of their algorithm to true parameters for both linear models and strongly convex loss functions. Experiments on synthetic data validates their theoretical findings, and experiments on semi-artificial data (rotated MNIST) shows the proposed method shall be applied to non-convex problems.

Strengths: The proposed algorithm is simple and thus allows a sound theoretical analysis, while still being applicable to models which don't allow theoretical analysis (such as neural networks). The proposed model has close connections to fundamental models in statistics, and theoretical results may have implications to audience broader than the Federated Learning community. I haven't carefully checked proofs, but results in this paper seem to align with known results.

Weaknesses: The proposed model seems to be too simplistic to reflect the heterogeneity of devices in the real world. At the least, large number of clusters $k$ would be expected in the real world; large $k$ will lead to small $p$, so the algorithm wouldn't scale very well with $k$. Rotated MNIST is an interesting idea to demonstrate the applicability of the proposed algorithm to non-convex problems, but it is still somewhat artificial as there is a small number of latent clusters, and doesn't convincingly demonstrate the practical usefulness of the algorithm. Authors did not address my review in their rebuttal, so I stayed with my original overall score.

Correctness: I wasn't able to closely verify proofs in the Appendix, but authors' claims are sensible, and are aligned with known results. Empirical evaluation is based on an artificial data, but this is reasonable for a theory paper.

Clarity: The paper is very well written. The outline of the proof clearly explains high-level steps needed for the final result, and each step is stated with a well-defined lemma and self-contained proof. Authors also compare their results with special cases of known results, which is a great sanity check.

Relation to Prior Work: Authors acknowledge that Mansour et al (2020) simultaneously proposed the same statistical model. Authors also make good connections to related work on Federated Learning for non-i.i.d. data and more classical models such as mixture of regressions model.

Reproducibility: Yes