The reviewers warmed up to this paper during the discussion. Its scores increased from (5, 5, 6, 7) to (6, 6, 6, 7). Although we agreed that there are shortcomings, there are also new results. So the paper is worth accepting. My additional comments are below: 1) The title is indeed confusing. The authors study a very specific problem and it is unclear how this addresses the bigger picture of regret minimization when no arm can be pulled twice. My opinion is that the learning agent has to learn to sort the arms. If you cannot pull the best arm twice, the second best thing is to pull the best two arms. In general, you want to pull the best K arms in the descending order. 2) Sections 5.1 and 5.2 study the above problem. I find the solutions unconvincing though. For instance, in Theorem 1, the regret seems to be driven by the hardness of sorting the best and second best arms, in term 1 / (p_1 - p_2)^2. What happened to the other arms? This is due to the algorithm design, which only learns to identify the best cluster. 3) The proposed algorithms are impractical. The authors essentially try to redo online low-rank factorization without going into the latent space.