Review for NeurIPS paper: Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms

NeurIPS 2020

Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms

Review 1

Summary and Contributions: ----------------------------------- After Rebuttal: I have read other reviews and author response. My score remains the same. ---------------------------------- The paper studies the standard multi-armed bandit problems when number of arms is much higher than $O(\sqrt{T})$, $T$ being the time horizon. It first shows that a subsampled version of the UCB algorithm achieves optimal regret in this setting. However, motivated by not-so-good performance of UCB in experiments, the paper studies a simple greedy rule and analyze its performance in the many arm regime.

Strengths: 1. The paper is extremely well-written. The paper flow makes a lot of sense. First, an experimental comparison is shown to motivate the study of the greedy algorithm. Then, a lower bound for the problem is proved and optimal UCB based and greedy algorithms are discussed. 2. The proof of the regret bound of the greedy algorithm incorporates new machinery, which is an welcome addition to the MAB literature.

Weaknesses: The paper considers a week notion of Bayesian regret. It would make the work stronger if similar guarantees for the greedy algorithm can be established in the worst case setting also. I would like the authors to comment on that.

Correctness: I was not able to check the proofs in detail. But the proofs seem correct to me.

Clarity: The paper is very well written and there is a smooth flow of arguments, which helps to understand the main idea pretty well.

Relation to Prior Work: Prior works are well cited and it is clearly mentioned how this work differs from prior works.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper studies the effectiveness of greedy algorithms in the multi-armed bandit with many arms, e.g., k>\sqrt{T}. They show that greedy algorithms with subsampling outperform many other approaches that depend on active exploration. Theoretical analysis and empirical investigation are provided.

Strengths: This paper first shows the importance of subsampling in the many-armed Bayesian MAB problem with theoretical guarantees. Then simulation results are provided to show the power of greedy algorithms. To explain this phenomenon, the paper gives a theoretical analysis of Greedy in the many-armed regime. Specifically, this paper shows that with high probability, the greedy policy is likely to concentrate on arms with high mean rewards. And the paper derives an upper bound on regret of SS-Greedy for Bernoulli rewards and more general reward distributions under a mild condition.

Weaknesses: As mentioned in the paper, SS-Greedy does not establish universal rate optimality, although it shows better empirical performance in simulations provided in the paper. Experimental results for subgaussian rewards and uniformly upward-looking rewards explored in the theoretical parts are not shown. It seems that this paper does not show whether their analysis holds when k > T, where is realistic, as discussed at the beginning of the paper. ****************** Thanks for your response. I will hold my current score because my concerns are not fully addressed. For example, the sentence for the motivating example ‘In practice, however, there are many situations where the number of arms is large relative to the time horizon of interest’ In Section 1 brings the case with k>T to me. It is very unclear for me that the current algorithm will work for the case with k>T.

Correctness: Seems correct.

Clarity: Satisfied.

Relation to Prior Work: The relations of theoretical results and related work are not clear.

Reproducibility: Yes

Additional Feedback: A summary for contributions in the first section should be provided Some notations, e.g., free exploration and subsampling, should be explained/defined before they are used. Typos: Line 54: behavior->behaviors Line 69: reward -> rewards Line 68: in supplementary -> in the supplementary Line 76: Figure 1 ) -> Figure 1) The tense in Section 1.1 Related Work should be consistent. References, e.g., [21], should not be used as a noun. Line 267: Figure 6 -> Figure 2

Review 3

Summary and Contributions: This paper continues along the interesting recent direction of characterizing the performance of a simple greedy algorithm in bandit optimization. The main problem considered is the many-armed Bayesian MAB problem, i.e., where the number of arms k > sqrt{T}. Motivated by the empirical findings in this regime, the authors study the theoretical performance of popular MAB algorithms and provide insights into the effectiveness of greedy algorithms. The main contributions include: 1) lower bound on the Bayesian expected regret, 2) two algorithms based on the simple idea of subsampling: SS-UCB and SS-Greedy, 3) optimal regret rates for SS-UCB and (sub)optimal regret rates for SS-Greedy for different types of rewards.

Strengths: The problem considered in this paper is very well motivated by the interesting empirical study presented in the introduction. Recently, there has also been an increasing interest in the line of works that study the performance of the greedy algorithm in bandit optimization (e.g., in the contextual setting). This work seems to be the first to study the performance (in the Bayesian setting) of both the standard bandit and greedy algorithms in the regime of many arms. The obtained theoretical results in the many arms regime seem novel. These are of interest since in many real-world problems the number of arms can be huge, while the number of optimization rounds can be limited. The obtained regret rates are supported by the lower bound performance. A simple idea of subsampling seems to work well in both theory and practice. The authors go one step further to show that the better regret rates can be obtained for a special (but still a large) family of sub-gaussian reward distributions. Some non-standard tools are used in the analysis such as Lundberg inequality. There are also some interesting additional contributions in Section 7, i.e., going beyond 1-regular priors.

Weaknesses: Overall, I do not find any very major weaknesses when it comes to this paper. Apart from that, it seems that Assumption 1 is crucial in obtaining results of almost every theorem. However, although the statement of this assumption is easy to understand, I feel that the role (intuition, importance) of this assumption in the theoretical analysis is not properly discussed. Next, the main algorithmic idea is very simple, and it is not clear whether the obtained upper bounds for the (ss)-greedy algorithms and sub-gaussian rewards can be improved. Currently, there is a linear dependence on T in the small k regime. Is this unavoidable? Finally, I have some concerns when it comes to Section 6. Please see below for more details.

Correctness: I haven’t checked the proofs in the appendix carefully, but I did not find a good reason to believe that the presented results are incorrect.

Clarity: Overall, the paper is clearly written, however, the paper would benefit from more intuition/explanations when it comes to the main used assumptions.

Relation to Prior Work: This paper clearly discusses the connection to previous works and how it differs from other recent ones that also study the performance of greedy algorithms in some related but different settings.

Reproducibility: Yes

Additional Feedback: Post-rebuttal comments: I've read the rebuttal and other reviews. The authors have addressed most of my concerns and hence I increase my score. I hope the authors would make the suggested edits in the revised version and explain the role of their main assumption. _______________________________________________ Some comments/questions: -- Assumption 1 seems to be the main ingredient of all the main theorems, while the intuition of this assumption and its impact on the analysis (i.e., challenges) are not properly discussed. Can you explain why things fail if this assumption does not hold? -- Why is Assumption made in Theorem 2 stronger compared to Assumption 1? How are these compared? -- What are the inherent difficulties in the analysis of the greedy algorithm when it comes to sub-Gaussian rewards (in comparison to Bernoulli rewards) that result in higher regret rates? -- In the case of sub-gaussian rewards, the bounds for (ss)greedy do not match the lower bounds, and in the small k regime bounds are linear in T. Currently, it is not clear if these bounds can be improved. -- Are there other ideas for subsampling other than uniform sampling? Can you make use of a prior (in the case it is informative)? -- Can you elaborate on considering the Bayesian framework for the “many arms” problem? Can similar insights/results be obtained in the frequentist setting? Simulations: -- Can you comment on the high variance of the performance of greedy (when d=2) in Figure 2 in comparison to its performance when d=6? Next, why is the variance of the greedy methods larger than in the case of TS or OFUL? This raises the question of whether we should really care about the expected performance when it comes to the greedy algorithm. -- Why would contextual simulations be considered since this is not the actual setup studied in the paper? In particular, the context diversity can be present here as well, and for which we know that the greedy algorithm works well. Can the results be due to this form of exploration only?! -- “Performance of Greedy is better for d=6, due to another source of exploration…” -- What is this conclusion based upon?