NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:554
Title:Combinatorial Bandits with Relative Feedback

Reviewer 1

The authors study the problem of online learning with subset-wise preferences with relative feedback information under the Multinomial logit choice model. More speci_x000C_cally, they study two regret minimisation problems over subsets of a fi_x000C_nite ground set [n]. In the fi_x000C_rst setting, the learner can play subsets of size bounded by a maximum size and receives top-m rank-ordered feedback. For this setting, they give a lower bound and an instance-dependent and order-optimal winner-regret algorithm with regret O( n/m \ln T) based on the well-known upper-con_x000C_dence bound (UCB) strategy. In the second setting the learner can play subsets of a _x000C_xed size k with a full subset ranking observed as feedback, they also derive a regret lower bound, and then give a UCB-based algorithm, along with a matching upper bound regret of O(n/k \ln T). The authors also provide an experimental study with synthetically generated data to validate their _x000C_findings. The authors give a thorough study of interesting extensions of the dueling bandit problem, including deriving lower bounds, and giving algorithms with theoretical guarantees. The paper has a good quality and is written in a good way, and should be signifi_x000C_cant for the preference- based learning community.

Reviewer 2

For me, the regrets they introduced seem unnatural. In their feedback settings, the best arm identification framework is more appropriate. Show some cases in which the introduced regrets are more appropriate. [comments after authors' feedback] I understood that their regret definition is a natural extension of the popular dueling bandit regret. However, I still think that regret minimization setting is unnatural considering real applications.

Reviewer 3

The paper considers a bandit problem where the learner selects a subset of (up to) k>= 2 arms in each round. Subsequently, the learner observes as feedback a rank-ordered list of m >=1 items from the subset, generated probabilistically according to the Pluckett-Luce distribution on rankings based on the multinomial logit choice. That is, each arm is associated with a weight and, at each round, the online algorithm plays a subset S of at most k arms. It then receives feedback according to one of the following models: 1. Winner Feedback: The environment returns one item from S. The probability of each subset is proportional to its weight. This scheme is known as the ``Placket-Luce probability model" on S. 2. Top-m Ranking Feedback: The environment returns an ordered list of m items sampled without replacement from the Plackett-Luce probability model on S. The main contribution of the author(s) is to give algorithms for regret minimization that perform optimally (up to constants) under the assumptions of the feedback models (1) and (2). The exact bounds on the regret are somewhat involved to state here, but they depend logarithmically on the number of rounds and, interestingly (especially in terms of upper bound), on the weight distribution over arms. In terms of techniques, while I am not an expert in the area so I cannot judge their novelty, I did find the proposed algorithms and their analysis interesting and intuitive. A possible criticism is that the setting considered by the author(s) is similar to submitted Paper 396 (indeed, it was flagged as a possible dual submission). That said, the two papers seem to use sufficiently different techniques. Overall, I found the paper interesting. My main concern has to do with whether the assumptions of the proposed models are natural enough, and their relevance to practice.