NeurIPS 2020

### Review 1

Summary and Contributions: The paper studies cumulative regret when there are multiple best arms and the number of best arms is unknown. It defines a new notion of difficulty for such problems, \psi. It provides a new algorithm for this problem that is Pareto optimal over problems with different hardness levels. It also proves that there is no algorithm that is simultaneously optimal across all problem difficulties. However, the authors show that if mu_* is known then simultaneous optimality is possible. Finally, they demonstrate the superior performance of their algorithms in experiments.

Strengths: The regime where n can be larger than T is important and well-motivated by many modern appplications. Their observation that there are various levels of difficulty for this problem and it is not possible to be simultaneously optimal for all of them is very insightful. It seems that Theorem 2 implies that no algorithm is within log(T) of the minimax rate for all problem difficulties, correct? If so this polynomial gap seems quite convincing. Algorithm 1 is a non-trivial algorithm and leads to an algorithm with strong empirical performance. The experiments are thorough.

Weaknesses: The paper does not seem to develop fundamentally new algorithmic ideas, although they apply sophisticated algorithms from the literature to an important problem and in an enlightening way. It would seem that mu_* is never known exactly in practice. How does misspecification of mu_* affect the performance of algorithm 3? is Minimax optimality achievable even under misspecification? How would this affect empirical performance?  It seems odd that Algorithm 3 does worse in the experiments even though it has extra information. Is there a way to get around the issue that you are running several versions of the algorithm? Does it remain an open question how to leverage this extra information into a empirically superior algorithm?

Correctness: Yes.

Clarity: The paper is extremely clear.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: After rebuttal: I have read the response and other reviews and my score remains the same.

### Review 2

Summary and Contributions: The authors consider a stochastic bandit setting with n arms, where n is taken to be very large. Unlike other large-armed bandit work, there is no structure assumed between the payoff of the various arms. Their methods rely heavily on MOSS [Audibert et al.], which attains an optimal regret rate of \sqrt{nT} in the usual setting where n <= T. Letting m <= n denote the number of arms that are optimal (or near optimal), the authors begin by defining alpha satisfying n/m = T^{\alpha} as a key quantity characterizing the hardness. They observe that if alpha were known, one could sample a subset of the arms of cardinality T^alpha log T, which contains an optimal arm whp. Applying MOSS attains a regret bound of O(T^(1+alpha)/2). Meanwhile, there exists an instance with parameter alpha, and matching lower bound (when m = 1 and n = T^alpha). Thus, the key difficulty is attaining this same result when alpha is not known. They give an algorithm that runs multiple epochs of MOSS for exponentially increasing rounds on subsampled arms. In each epoch, the number of sampled arms decays by a factor of 2. However, the algorithm also includes a “virtual” arm that simulates the empirical distribution of arms played on each of the previous epochs. The algorithm’s analysis is very clean and follows by identifying an epoch i* where the number of samples arms falls below T^{alpha} \log T. Before i*, so many arms are sampled, that an optimal one is bound to be among them. After i*, the performance of the virtual arm becomes near-optimal. The ultimate bound, however, depends on a correct setting of a user-specified parameter beta in order to achieve the desired rate of O(T^(1+alpha)/2). The remainder of the theoretical content of the paper essentially defends this property of their algorithm in two ways. (1) They argue that no algorithm can be optimal for all levels of alpha, and that their algorithm sits on the Pareto frontier, while naive algorithms such as subsampling arms and running MOSS do not, and (2) with knowledge of the mean of the best arm, an algorithm achieving the lower bound is possible. Finally, they give experiments demonstrating the performance of their algorithms at at fixed beta, and varying alphas. They also give an algorithm inspired by the theory but that is more empirically robust (re-using statistics from previous epochs). The experiments demonstrate strong performance until alpha becomes larger than user-defined beta (at which point the theory predicts vacuous regret bounds).

Strengths: 1) The paper does a very good job of analyzing a large MAB problem without a great deal of cumbersome assumptions. The only salient quantity is the fraction of arms that are optimal. 2) The paper is very clear and well written. 3) The algorithmic ideas and analysis are interesting, and could be of value beyond the setting considered.

Weaknesses: 1) The optimality of the bound ultimately depends on a hyperparameter being set properly. However, the authors do a good job defending this, and one can imagine making reasonable choices for this hyperparameter in practice.

Correctness: The proofs appear to be correct, and the empirical methodology sound.

Clarity: The paper is easy to read.

Relation to Prior Work: The authors do a good job of setting the material in context of previous work. I don't think there is missing literature except for more large-arm/continuum bandit work, which is only tangentially related due to the lack of smoothness assumptions here.

Reproducibility: Yes

Additional Feedback: 115: Definition of psi. The setup claims that n can be infinite, but then it seems that this definition would cause problems… Can the authors clarify? 137: typo “an” 138: I think it is standard enough to not warrant definition, but take note that [K] notation is used before it is defined (e.g. first paragraph in section 2). 167: typo -> “still holds” Appendix: 427: What does “valid algorithm” mean in this context? If anything, it seems like you would want to upper bound T_p (i.e., show that the algo does not exceed its time horizon). 446: A is used as the random arm, so probably better to call this event something else. 450: Typo: “if it exists” 454: Typo: “particular”