Review for NeurIPS paper: Latent Bandits Revisited

NeurIPS 2020

Latent Bandits Revisited

Review 1

Summary and Contributions: The paper considers the latent bandits with context information, where the parameter vector is available from offline evaluations (either exactly or approximately). The paper proposes both UCB and Thompson Sampling-based algorithms for the problem and demonstrates their efficacy via synthetic/real-world experiments

Strengths: The main strength I believe lies in the practicality of the setting. Often a parameter estimate is available from offline estimates and the ideas proposed (m2 variants) use these carefully to develop efficient algorithms.

Weaknesses: While the authors perform empirical evaluation on synthetic and real world data, they seem to restrict themselves to "fast personalization" horizons. The algorithm per se is not designed keeping short horizons in mind. It appears from the experimental results that the general trend indicates that LinTS would perform better than m2TS if the horizon was extended a bit more. It would be nice if the authors can discuss this a bit more.

Correctness: The empircal methodology is reasonable. The authors give a slightly corrected version of their proofs in the supplementary material (where the constants change). Other than that, the overall proof techniques seem to be standard.

Clarity: The paper is well written and is clear.

Relation to Prior Work: Looks reasonable.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper considers a bandit setting with a latent space: each user is associated to a state which controls her preferences. The setting assumes an approximate version of the latent space is given by some offline algorithm. Therefore, the bandit algorithm is aiming at identifying the state of the user which corresponds the most to her preferences. 4 algorithms are proposed: * 2 based on upper confidence bounds (UCB), 2 based on Thompson Sampling (TS) approach; * 2 assuming knowledge of the exact latent space, 2 assuming an approximation thereof. A proof of an upper-bound on the worst case regret is given for the UCB-based algorithms and a proof of an upper-bound on the bayesian regret is given for TS-based algorithms. Some experiments compare the result of the proposed algorithm vs. state of the art algorithms making no use of the latent model, or using it in a conservative way. --- Post Rebuttal --- Thank you for your precise remarks. I'm eager to look at a more refined analysis with respect to epsilon; either in the context where mmTS posterior reduces the impact of epsilon, or in the entire bandit setting which would include the offline learning. Note that Zhou and Brunskill (2016) handle the entire setting by adding a uniform exploration for the \tau_u first steps of each sequence of recommendations, which was sufficient to control the evolution of epsilon along successive sequences (under some assumption on the offline learning problem).

Strengths: The work presents and analyzes in a unified way a set of approaches (UCB-like and TS-like), of settings (independent arms and contextual bandit), and of assumptions (exact and approximate latent space).

Weaknesses: First, the approach puts appart the offline learning of the latent factors. While this learning build upon much more data, it also suffers an exploration-exploitation tradeoff. Secondly, the unified analyses cannot cover instance-dependent bounds of the regret which inform about the settings which are harder to handle.

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper adds to the algorithmic and regret analysis of the contextual latent bandits model of Zhou and Brunskill (itself based on work by Maillard and Mannor), which is designed to allow quick adaptation of bandit models to user characteristics. The paper adds a conservative correction to deal with model error and explores the use of Thompson sampling as an alternative to UCB algorithms.

Strengths: A cleanly written and technically competent paper with analysis and experimental results suggesting that the approach could be useful in practice. Does a good job of motivating the model. The "pedagogical" experiments nicely illustrate the benefits of the algorithmic contributions.

Weaknesses: The algorithms are essentially standard and the results are somewhat incremental. The algorithm changes and analysis that allow for epsilon-inaccurate conditional models are fairly minor. The regret analysis for Thompson sampling follows the lines laid out by Van Roy. From a technical point of view, the possible weak point is in the need to supply an epsilon to the m^2UCB algorithm. Often, the available epsilon-bounds are wildly pessimistic and would render the algorithm useless, so you'd have to just tweak it experimentally to get something to work. Similar issue with a misspecified hyperprior P_1 for the m^2TS algorithm. The paper makes no attempt to analyze the problem in computational terms: how hard is it? Could one compute an optimal policy? If that could be done even for small cases, how does your algorithm compare to optimality in practice? (The graphs show "optimal", but that's post hoc clairvoyant.) The algorithmic ideas here are all inherited from bandit algorithms, but this is not really an arm-space bandit problem from the computational viewpoint because the reward models are known.

Correctness: The theorems appear to be correct. The paper could do better than gloss over the epsilon/hyperprior issue mentioned above.

Clarity: The paper is very clear and maintains the right level of detail in the main body. Related to the point that this is also a value-of-information problem, might be useful to show the setup as a Bayes net or plate diagram Minor comments: l.30 adapt it results -> adapt its results l.49 "its effectiveness" - referent is "algorithms" or "approaches", so plural? l.77 "offline interaction data" is an oxymoronic phrase; I assume you mean repositories of interaction data? l.115 "An advantage of TS over UCB is that it obviates the need to design UCBs" - hmmmm. l.203 i.e. tensor decomposition -> e.g., tensor decomposition l.204 as size of offline dataset -> as the size of the offline dataset l.218 no offline -> no offline model l.218 "using offline reward model as experts" ?? rewrite l.226 "non-" ?? They have 10 arms. Do you mean non-contextual? l.227 uniformly at random -> uniformly at random: l.321 yadkori -> Yadkori [check the pdf!] l.339 thompson -> Thompson Missing sources in [16] and [30]

Relation to Prior Work: Does a good job within the narrow area of latent/contextual bandits. This has become a fairly dense area these days so I could be missing something. l.21 "explicit exploration for (latent) state identification (i.e., reducing uncertainty regarding the true state) is less common in practice." - Is this really true? It seems close to a standard value-of-information problem. The "bandit" aspect seems nearly orthogonal. Possibly some of the many results by Krause and collaborators on optimal and near-optimal information-gathering would apply here.

Reproducibility: Yes

Additional Feedback: Re broader impacts: content selection algorithms that explicitly aim to identify user "type" could exacerbate stereotyping and lead to enhanced psychological manipulation if combined with RL-like methods that learn to apply content *sequences* to change user mental state in profit-maximizing directions.

Review 4

Summary and Contributions: This paper provides new algorithms for contextual multi-arm bandits where reward depends not only on a context x and an action (or arm) a but also on a latent variable s, which is assumed to be unknown. Both UCB and Thompson sampling algorithms are proposed and regret bounds are derived.

Strengths: - The problem of "Latent bandits" is useful as often the some useful information (e.g. intent behind action) in off-line data is not known. - The paper is well-written and provides a theoretical analysis of the algorithm.

Weaknesses: Given the previous works such as Latent Bandit [23] and the connections drawn between UCB and TS in Russo and Van Roy [26], the results of this paper, especially incorporating model uncertainty in the latent bandits seems straightforward. Therefore, novelty is weak. Proof techniques are also heavily borrowed from the previous papers.

Correctness: As far as I can tell the claims made and methods used in this paper are correct, though I have not checked all details for the proofs in the supplementary.

Clarity: The paper is easy to read and usually clear except the problem setting for the MovieLens experiments. See my comments below.

Relation to Prior Work: The paper discusses some closely related works but it could be better if it could also cover the work on latent bandits with confounding information and its relation with them.

Reproducibility: No

Additional Feedback: Comment on the proposed algorithm: Overall, the algorithm seems reasonable and is simple. I have the following questions: - The regret bounds for the misspecified model case (Theorem 2 and Corollary 2) are linear in n, does this mean the proposed algorithm is not efficient in this case? For a fixed size off-line dataset, this could be a problem. - Does the algorithm assume the latent state to be fixed throughout the optimisation or can it be changed over time? - What about the number of latent states, it seems the number needs to be known in advance? If not so, it is not clear how will the proposed algorithm tackle that case? Comments on the experiments: The proposed methods are compared with some methods with no offline model, and Exp4. While Exp4 performance seems a bit surprising, I am wondering how will the proposed methods compare with a simple baseline where latent contexts are identified via an online clustering and then we simply use the traditional contextual MAB algorithms on top of clustering? Further, to measure its effectiveness, the proposed method needs to be compared with the latent bandit paper in [23]. For the MovieLens experiments, I could not understand the problem setting properly, it is not clear to me in this setting what is the context, what is action and what is a latent state? This needs to be clearly described. Minor points: -The definition of \mu function is not consistent, e.g. we have \mu(a,x,s) or \mu(a,x,s,\theta), is it on purpose? -Line 164: Should s* be s? -Line 30: "it" should be "its" *** Post rebuttal *** After reading the rebuttal, I have updated my score.