NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6022
Title:Distribution oblivious, risk-aware algorithms for multi-armed bandits with unbounded rewards

Reviewer 1

The main idea of this submission is the new concentration inequality, Theorem 2, proved for analyzing the proposed algorithm. To generalize from bounded random variable X to X that satisfies the mild assumptions made in the submission, the application of a truncation-based estimator plays a crucial role for developing a new concentration inequality. Another idea is to introduce the linear combination of the mean and CVaR studied in financial community as the optimization objective. This formulation includes the expected reward maximization and the CVaR minimization as special cases. The proposed Algorithm 1, Generalized successive rejects, identifies the arm with minimum loss and CVaR has bounded misidentification probability shown in Theorem 4. The resulting bounds are discussed and compared to that of non-oblivious settings, including the probability in terms of T (number of pulls), the minimum requirement of T, and the interaction between T and the truncation parameters. The authors also show that when the same information is provided as the non-oblivious setting, Algorithm 1 achieves the same performance as that proposed in the non-oblivious setting. Simulations verifying the performance of the proposed algorithm and demonstrating how truncation parameters affect the performance are provided. The reviewer has concerns about the submission. Please see Improvements for details. Originality. The submission relaxes the assumptions on the distributions. A novel concentration inequality is proved. Quality. The analysis is rigorous. The performance of this work is compared and discussed with the non-oblivious ones. Clarity. The submission and the analysis is well written. Significance. Bounds on the misidentification probabilities under milder assumptions on distributions are proved. A new concentration inequality is proved and could be applied to other research. ------------------------------ Update Thank you for the response. I am satisfied with the authors' reply, especially the first point clarifying the algorithm actually doesn't need C2.

Reviewer 2

The paper considers a new pure exploration problem where the target criterion is a linear combination of the mean and the conditional value at risk. While the algorithmic framework is the usual Successive Rejects (the fixed budget setting), the newly proposed estimator and its concentration inequality enables the algorithm to be agnostic (=oblivious) to the type of the distributions (up to very mild assumptions). The empirical results are compelling, and it suggests that the theory might be tightened up. Overall, a well-written paper. The key novelty is in the new estimator and concentration inequality that is not restricted to bounded random variables. The idea of setting b value as a function of n and enjoying guarantees when n is large enough seem like an interesting idea though similar concepts appear in the literature. I think the paper is in a good shape. It took me some time to appreciate "where" the distribution-obliviousness comes from, since the SR algorithm has been distribution-obliviousness, I thought (e.g., we don't need to know the variance when the arms are sub-Gaussian). However, I realized that it becomes interesting in the heavy-tailed distribution / CVaR regime where the estimator itself would require some knowledge on the distributions itself. I don't think the authors said anything incorrect, but maybe they can avoid confusion by adding more explanation. - "MAB formulations consider reward distributions with bounded support, typically [0, 1]. Moreover, the support is assumed to be known beforehand, and this knowledge is baked into the algorithm. " attacking this is quite weak, since many times you can relax this to \sigma^2-sub-Gaussianity, and for SR you don't need to know \sigma ahead of time. It'd be nice to address this case as well. - It might be helpful to explain that the sample average in used in the standard SR does not serve the purpose in heavy-tailed distribution / CVaR problems anymore, and naïve applications of existing techniques would require information about the distribution." # minor comments - Note the authors are not using the correct formatting for the submission, so there is no line numbers in the submission. This makes things a bit difficult. - I am pretty sure the definition of v_\alpha(X) should have ">= \alpha" rather than "<= \alpha". Otherwise v_\alpha(X) would be -\infty. - It would've been easier to define CVaR as E[Y | Y \ge VaR_\alpha(Y)] then say it becomes c_\alpha(X) defined in the paper since that's easier to understand the meaning. I myself had to track down the reference to dig it out. - In section 4, first paragraph, I think (\xi_1, \xi_2) = (1,0) for the first appearance rather than (0,1). - Figure 2 (a), b_m must be q_m? ==== [after rebuttal] I stand by my score. It'd be great if you could make the paper more accessible to nonexperts by adding definitions/explanations/references on the basic concept such as CVaR.

Reviewer 3

This paper proposes an algorithm for the multi-armed bandit, where the goal is to identify the arm that has the minimum value of the conditional value at risk (CVaR) of the reward, which may be heavy-tailed. The proposed algorithm is enabled by the new truncation-based estimator for the CVaR, for which new concentration bound is established. The new truncation-based estimator is of interest in its own right and may find applications beyond multi-armed bandit. Overall, the paper makes solid contributions primarily from mathematical perspectives. The proposed algorithm has limited applicability for the following reasons (please explain if there are misunderstandings): - The total number of arm-pulls needs to be known in advance to determine the number of arm-pulls in each stage. - To compute the estimator, all of the observed reward needs to be stored. I also wonder what real applications involve heavy-tailed reward. I would like to see experimental comparisons of the new truncation-based estimator against existing estimators of CVaR. --- I have read the responses from the authors. I am not completely sure exactly how the "doubling trick" and "summary statistics" can be used. I hope the authors explain them in more detail in the final version. Also, I understand that heavy-tailed distributions are ubiquitous but not sure exactly what applications would benefit from optimization via bandit. In any case, this paper makes sufficient contributions to be accepted.