NeurIPS 2020

### Review 1

Summary and Contributions: This paper presents an algorithm and corresponding upper bounds for federated contextual bandits under a differential privacy constraint. They consider both a centralised and distributed setting. While the results are not surprising and the proof methods are relatively standard, it is the first theoretical study of this problem, which has significant practical importance.

Strengths: Rigorous, good theory, with an analysis of both centralised and distributed setting including: - Regret bounds - Communication complexity

Weaknesses: It would be good if there were also some lower bounds for the federated/distributed case.

Correctness: I have not checked thoroughly for correctness but everything appears rigorous

Clarity: The paper is quite clear. However, some details of the model are unclear.

Relation to Prior Work: All directly related work is cited and discussed.

Reproducibility: Yes

Additional Feedback: It is slightly unclear from your DP model whether the agents simply broadcast randomised versions of their actual actions to the mechanism, or if they actually have to take the same actions as the ones they broadcast. It seems to be the latter, as far as I can tell. In l. 36 you say that your regret bound is $1/\sqrt(\epsilon)$ and you say that is a factor $1/\sqrt(\epsilon)$ away from [42]. But I guess that one is $1/\epsilon$, so your bound is lower than the lower bound. This is perhaps not very surprising, as my reading of [42] suggests that the lower bound does not apply to DP bandit algorithims in gneral, but only those that use local differential privacy. Privatizer, however, only adds noise to the actions. However, their reasoning is a bit unclear. Lower bounds would be good, I guess! The decentralized setting is particularly interesting. It uses the notion of delayed feedback to do the analysis, something that has been recently a focus in the regret literature. update : thank you for the clarification

### Review 2

Summary and Contributions: The paper describes a distributed learning protocol which balances quality of decisions, communication overhead, and leakage of information. Algorithmic components from the contextual bandit and differential privacy literature are combined and analyzed. ------------------- Having read the other reviews, author's response, and reviewer discussion: I do not see any reason to adjust my overall score.

Strengths: The topic is important and the paper is well written. This particular combination has high potential practical impact and the analysis of the combination is thorough.

Weaknesses: No complaints, solid work.

Correctness: I did not analyze proofs in detail.

Clarity: Very well written.

Relation to Prior Work: Yes.

Reproducibility: No

### Review 3

Summary and Contributions: The paper considers the problem of federated/distributed contextual linear bandits under differential privacy constraints. More concretely, the goal is to compute the optimal parameters vector  \theta that defines a linear reward function to minimize the regret over a time horizon T. The key contribution claimed by the authors is the first differentially private federated algorithm for this problem.

Strengths: Contextual bandit is a classic problem and it is indeed, theoretically, interesting to study it from a differential privacy point of view. It is indeed relevant to the theoretical NeurIPS community.

Weaknesses: I have few key concerns that would be good to get some clarification on. 1) What is the privacy angle? Is there an application where sharing the reward/current parameters would lead to privacy leakage (typically the angle would be either data sensitivity or its proprietary nature, -- what is it here )? 2) Some of the theoretical aspects were very hard to follow. Partly this is because, the paper draws heavily from prior art -- which I am unfamiliar with. On the other hand, I blame it partly on the writeup.   The critical one, which I would really need some clarity on, is the relationship between (epsilon, deltas) and the privacy budget parameters (rho_max, rho_min, and kappa). On page 5, it is mentioned: "In turn, the quantities ε and δ affect the algorithm (and regret) via the quantities ρmin, ρmax, and κ which bound the norm of perturbations at any round:" But these privacy budget quantities ρmin, ρmax, and κ are never specified in terms of epsilon, deltas. Definition 2 is perhaps defining the former quantities in terms of perturbations (H) norms but I do not see their relationship with epsilon, delta. This is critical to understand because, the experiments are performed with respect to privacy budget parameters (\rhos and kappa). It is important to understand the plots in terms of the standard DP parameters (namely epsilon and delta).

Correctness: Methodology is sound but I would like clarity on the privacy budgets as mentioned above. I have not verified the proofs.

Clarity: For a non-expert the paper is really hard to read/follow. The presentation should be (if accepted) made better.

Relation to Prior Work: Somewhat, but I am not sure how distinguished the techniques are from the prior approaches. While I appreciate that the challenges primarily stem from the asymmetric communication between agents, I cannot judge on how complicated this really is and what new techniques are employed here.

Reproducibility: Yes

Additional Feedback: Thank you authors for the response to my questions. Based on that and discussions with fellow reviewers and Area chair, I have decided to increase my score. However, for the sake of better understandability, please address the concerns I have raised before mainly that the privacy analysis must be augmented with epsilon and delta values used.

### Review 4

Summary and Contributions: This work introduces FedUCB - an algorithm for differentially-privates contextual bandits (CB) in a centralized and decentralized federated learning setting, with the main contributions of the paper being: 1. Defining differential privacy for federated CBs 2. Introducing a centralized federated learning algorithm for CB with regret bounds that match single agent regret when synchronization happens every round, and 3. Introducing a decentralized version of the algorithm, with regret bounds that additionally depend on the topology of the graph.

Strengths: The problem that the work targets is relevant and well-motivated, and the work is technically sound, novel, and thorough in its presentation.

Weaknesses: would like to see discussion on non-homogeneous arms for different agents and the effect of new arms being introduced in the course of an experiment. Have the authors considered if we can have a privacy budget that is dependent on each pair of agents? Meaning some agents are allowed to share more compared to others?

Correctness: Although have not check all the proofs, parts that I needed to reproduce to get deeper understanding seems all to be correct.

Clarity: The paper is clearly written and expands the problem and its motivation in a nice way.

Relation to Prior Work: The introduction is easy to read and follow and covers mostly what has been done in the field. I didn't search extensively for any other prior work that needs to be added though.

Reproducibility: Yes

Additional Feedback: page 8, Line 276 . "Beta" should be /beta .