Optimal Algorithms for Stochastic Contextual Preference Bandits

Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021)

Bibtex Paper Reviews And Public Comment » Supplemental

Authors

Aadirupa Saha

Abstract

We consider the problem of preference bandits in the contextual setting. At each round, the learner is presented with a context set of $K$ items, chosen randomly from a potentially infinite set of arms $\mathcal D \subseteq \mathbf R^d$. However, unlike classical contextual bandits, our framework only allows the learner to receive feedback in terms of item preferences: At each round, the learner is allowed to play a subset of size $q$ (any $q \in \{2,\ldots,K\}$) upon which only a (noisy) winner of the subset is revealed. Yet, same as the classical setup, the goal is still to compete against the best context arm at each round. The problem is relevant in various online decision-making scenarios, including recommender systems, information retrieval, tournament ranking--typically any application where it's easier to elicit the items' relative strength instead of their absolute scores. To the best of our knowledge, this work is the first to consider preference-based stochastic contextual bandits for potentially infinite decision spaces. We start with presenting two algorithms for the special case of pairwise preferences $(q=2)$: The first algorithm is simple and easy to implement with an $\tilde O(d\sqrt{T})$ regret guarantee, while the second algorithm is shown to achieve the optimal $\tilde O(\sqrt{dT})$ regret, as follows from our $\Omega(\sqrt {dT})$ matching lower bound analysis. We then proceed to analyze the problem for any general $q$-subsetwise preferences ($q \ge 2$), where surprisingly, our lower bound proves the fundamental performance limit to be $\Omega(\sqrt{d T})$ yet again, independent of the subsetsize $q$. Following this, we propose a matching upper bound algorithm justifying the tightness of our results. This implies having access to subsetwise preferences does not help in faster information aggregation for our feedback model. All the results are corroborated empirically against existing baselines.