__ Summary and Contributions__: Several authors have tried to mimic the idea underlying Thompson sampling by replacing posterior sampling with bootstrap sampling. Unfortunately, when little data has been collected, subsampling does not adequately reflect the range of possible uncertainty. These algorithms require some extra randomization to succeed.
This paper proposes a modification of this approach. They sample only from the observations of the 'leading' arm and compare the subsampled mean estimate to the empirical mean of challenger arms. All victorious arms are sampled and the algorithm proceeds to the next round. The paper shows that for several common distributions the algorithm attains the Lai Robbins lower bound on regret. Numerical experiments show very competitive performance.

__ Strengths__: There have been so many papers about MABs with independent arms. It is hard to have really creative and elegant new ideas. In my opinion, the algorithms in this paper represent that. Moreover, the algorithms are simple enough that I could imagine them being used in practice. Beyond this, it seems generally to be well written, technically sound etc.

__ Weaknesses__: For my taste, the numerical experiments in the body of the paper should be focused on problem settings that are more practical. (Not 2 arms with tens of thousands of observations). The regret analysis is already compelling in that regime. The numerical experiments in the appendix were more convincing to me.

__ Correctness__: I did not find any mistakes, but I did not verify all of the proofs in the appendix.

__ Clarity__: Yes. Thank you to the authors.

__ Relation to Prior Work__: I am not an expert on the sampling based nonparametric bandit algorithms of the last 2 years, so the authors could have overlooked something I did not catch. But they certainly credit the influence of past work on this paper and clearly discuss differences.

__ Reproducibility__: Yes

__ Additional Feedback__: I got very stuck trying to understand what happens when some arms have been sampled zero times. What do we use for an empirical mean? I imagine many choices work for the theory, but please clarify what you use in the code. (Perhaps I missed this)

__ Summary and Contributions__: -The paper proposes a non-parametric multi-armed bandit algorithm based on 2 core ideas: sub-sampling and dueling.
-The new method combines ideas from 2 previous works; it uses a sub-sampling scheme independent of the observations, and performs the duels in rounds between the leader arm and all the other arms. The independence of the sub-sampling scheme enables the use of concentration result for sub-samples, and the duels in rounds enable generalization to many arms.
-The resulting algorithm is simple to implement, does not require distribution-specific tuning (as in parametric bandit algorithms), does not require wasteful forced exploration, and guarantees asymptotically optimal performance for a large class of exponential distributions.

__ Strengths__: The proposed method is simple to implement, and can be applied to various reward distributions (including those with unbounded support) without the knowledge on the parametric form of the distribution. At the same time, it guarantees asymptotically optimal performance. As compared to parametric methods, the proposed method would be much more useful in practice as it is usually hard to verify the parametric assumptions and to tune hyperparameters in the online learning setup.

__ Weaknesses__: The authors show that the proposed algorithm is asymptotically optimal when rewards follow a Gaussian distribution with "known" variance, and also present experiment results for this case. However, it is unclear how this "known" variance is incorporated in the algorithm. The presented Algorithm 1 does not require any input. The authors should clarify on this point.

__ Correctness__: The claims and methods are stated in a clear manner.

__ Clarity__: The paper is well-motivated, and the exposition is clear.

__ Relation to Prior Work__: The authors did a thorough review of previous works, comparing them one by one with their method in terms of theoretical guarantee, empirical performance, assumptions, computational cost, robustness to distributions.

__ Reproducibility__: Yes

__ Additional Feedback__: ==== Comments after rebuttal ====
Thank you for the clarification.

__ Summary and Contributions__:
The paper proposes the approach of sub-sampling duelling algorithm (SDA) to achieve efficient multi-armed bandit regret result that matches the asymptotic lower bound. The main feature of the SDA approach is that it is non-parametric, working for several families of distributions and it does not require to know the distribution families before hand. This approach seems to be different from the UCB approach and the Thompson sampling approach, the two dominant approaches in the MAB literature. It has the poential to provide alternative learning methods in practice.

__ Strengths__:
- The paper proposes the SDA approach, which is different from the dominant UCB and Thompson sampling approach, and has some potential to be come a new approach in practice.
- The instantiation of the SDA approach, RB-SDA, provides the theoretical guarantee that matches the lower bound result.
- Empirical evaluation in synthetic data sets show that the approach is promising.

__ Weaknesses__:
- One key problem that I did not find an answer in the paper is that whether the proposed algorithm still provides any regret guarantee if the unknown distribution does not follow the conditions given in Theorem 1. This is very important to understand the limitation of the approach.
- The approach is not brand new, and it is built upon prior work [1] and [2]. In particular SSMC algorithm in [2] also uses the sub-sampling approach and also achieves the optimal result in a nonparametric way. The only difference seems to be that [2] requires a forced exploration but the current paper does not require that.

__ Correctness__:
The claims seem to be correct, though I did not check the complete proofs in the appendix.

__ Clarity__:
The paper is in general easy to follow, but some points need to be further clarified.

__ Relation to Prior Work__:
Yes.

__ Reproducibility__: Yes

__ Additional Feedback__:
The authors propose the SDA approach, which has the potential to be an alternative to UCB and Thompson sampling for bandit learning. The main advantage of the proposed approach is that it is non-parametric, and for exponential families of distritions, it could achieve the optimal regret bound matching the lower bound, without knowing which distribution family the unknown distribution belongs to.
However, I feel that achieve optimal regret guarantee is mostly theoretical interest. UCB may not be optimal in this sense but it could achieve consistent regret gurantee for all distributions. For the current approach, e.g. RB-SDA algorithm, it is unclear to me if its theoretical guarantee of O(log T) regret bound could be achieved for any distribution. Theorem 3.1 is only stated in the way that conditions 1 and 2 need to hold for the regret guarantee. What if 1 and 2 do not hold? Can we still achieve O(log T) regret (may not be optimal)? If SDA could achieve consistent O(log T) regret for all distributions, and optimal regret for the exponential family in a nonparameteric way, then I believe SDA would become an attractive alternative to UCB or Thompson sampling in practice. But if not, then I would only view SDA as of theoretical interest, because one cannot typically guarantee that the unknown distribution encountered in practice is of exponential family. I believe this is an important missing point in the paper, and a positive answer may significantly raise the contribution of the paper.
The second issue of the paper is its novelty comparing to prior work, especially the SSMC algorithm from [2]. From the authors' description in the paper, it looks like SSMC in [2] already achieves nonparametric optimal regret bound for the exponential families of distributions. The main difference is that SSMC uses a deterministic sub-sampling that depends on the past rewards, while SDA framework proposed in the current paper uses random or deterministic sub-sampling that does not depends on the history. Also SSMC requires a forced exploration for arms that have very few samples, but SDA in this paper does not require forced exploration. But such difference looks to be minor. Thus the contribution of the paper seems to be minor.
Other issues:
- Line 30-31. It talks about optimal tuning, and in various places the papers talk about optimal regret. I think this needs to be more specific and needs more detailed discussion. USB already matches the asymptotic lower bound in the gap terms and the time horizon term T. So what is exactly the optimal regret? Meaning the matching of the constant term, or measuring the gap in terms of KL divergence?
- Algorithm 1. The SP-SDA algorithm conducts a duel between the leader and every other arm. This makes it having a separate round concept. Is it necessary? Is it possible that the algorithm just chooses a random opponent in the arm set and play one duel between the leader and this opponent? Some discussion would be helpful.
-line 139, "SSDA" appears here and in some other places, where in most places the authors use SDA. What is SSDA? Or is it just a typo?
- line 161, "... that have non-overlapping support". The non-overlapping support is not very clear to me. Needs more elaboration.
- Experiment section.
UCB is not compared. I think it is important to compare against UCB, since UCB is the method used in many solutions. Moreover, different exploration parameter on the UCB confidence radius term should be used to see the effect comparing to the sub-sampling method.
- Lines 256-257
The sentence explains the reason why the lower bound is above the upper bounds as the lower bound is asymptotic. But it still looks strange to me that the lower bound is significantly above all the upper bounds. Is it really just due to the asymptotic behavior or is there some other issue in the experiments? Perhaps the authors could run the experiments with a much higher time horizon to see if the lower bound will eventually go below the algorithm regrets. The figures 1 and 2 just look suspicious in this regard.
--------- Post Response Comments -----------------
I am in general satisfied with the authors' response, and agree with the other reviewers that the paper is worth to be published at NeurIPS. The authors mention that some minor change would allow their algorithm to work with bounded distributions. I would very much like to see this research, and hope that the authors could add this part in their final version.

__ Summary and Contributions__: The authors propose a new algorithmic framework for multi-armed bandits which is based on resampling, and in particular sub-sampling and dueling (hence the SDA name). It is shown that this algorithm can be proven to match the Lai and Robbins under Gaussian, Bernoulli, and Poisson distributed arms with limited assumptions.

__ Strengths__: The work appears quite strong theoretically, and provides an interesting framework for future work. It provides a new algorithmic setting with limited assumptions on the underlying distributions. Empirically the work is good and in that it backs up the theoretical work it seems quite strong and extensive. Note, though, that all algorithms compared against are quite close in comparison (as one might perhaps expect from their bounds).
The authors also do a good job of relating to previous approaches, and the work is quite clear and the various Lemmas are well explained---although more on this below.
The authors also plan to release code for this work and their experiments and have included it with the submission, which is great.
Overall I found this to be a strong paper and worth accepting!

__ Weaknesses__: What follows are a number of complaints/weaknesses of the work that center mainly around notational/descriptive clarity. But it is worth noting that I think these would make the paper better, but are otherwise somewhat minor complaints.
The first is that while the work itself is quite clear, some of the notation is quite dense and terse. Some of this does obviously come from previous literature. For example, on line 95 we have Y_{k_i_r , N_{k_i_r}(r)} which although it does make sense can be difficult to unpack. I'm not sure how much this can be improved, but it would definitely be great to simplify in any way.
Similarly, the paper could use some additional signposting of how results will be used later on in the work. For example, up to around line 111, it is unclear what form the sampling mechanism SP will take. This is given below, but having some brief description ahead of time would have made the algorithm easier to understand (with details obviously given later). Similarly, the more of a description and attempt to intuitively explain condition 2 of Theorem 3.1 would be quite useful to be given. These conditions come from the proof itself, but I would have liked to see more discussion of their implications. The authors do show how this is satisfied in the appendix for the distributions noted above, but I would have also liked to see slightly more discussion of this---albeit briefly as space is obviously a concern.
Finally, I would have liked to see a greater discussion of the computational complexity would have been useful and/or additional results for longer horizons when this complexity can become more of a burden.

__ Correctness__: As far as I can tell the work is correct (I have not fully read the proofs in the appendix however). The empirical results back this claim up and are relatively extensive.

__ Clarity__: Yes, minus the (minor!) caveats given above.

__ Relation to Prior Work__: As far as I can tell the authors situate this work quite well within previous work.

__ Reproducibility__: Yes

__ Additional Feedback__: