Reviews: Thompson Sampling for Multinomial Logit Contextual Bandits

I think overall this paper is a good contribution. The Bayesian and the frequentist analysis for the TS-MNL algorithm is novel. Especially the frequentist analysis builds on recently proposed proof techniques in [2,4] (for linear contextual bandits) and extends it to the MNL problem. The frequentist regret bounds matches the corresponding bounds in the linear case. I have a few concerns and it would be great if the authors can correct/respond to them: 1. Concerns: Lines 142-144 is confusingly written. y_{t,i} is supposed to belong in {0,1} as noted before. It is better to write that the y vector is sampled from the multi-nomial logit distribution given by equation below line 140, rather the sub-gaussianity based description where it seems like y_{t,i} can take values in [0,1], even though it is technically correct. Assumption 1 actually says that x_{t,i} is actually i.i.d. This should be clarified before in the paper, as there are multiple mentions of x_{t,i} varying over time. It should be clarified that the distribution is i.i.d. Line 4 in the algorithm: I think the monte-carlo sampling scheme should be discussed in more details in the body of the paper. Also, how does this approximate sampling the theoretical or practical regret performance of the algorithm, needs to be discussed. The exact details of the markov chains construction are missing from the appendix as well. Line 201: There is no needs to use the term oracle if a polynomial time algorithm exists to solve the problem exactly. You can just refer to the algorithm (I am just commenting on the usage of the term oracle). It would also be appreciated if the computational complexity of the said algorithm is revealed in the discussion. Add a list of differences from [6] earlier in the paper. A little more details about the experimental setup is needed for the main body of the paper. For example, which algorithm was used as the optimization algorithm for line 5 in the algorithm. The real experiments are essentially semi-synthetic. A model is learnt from the actual dataset and then that model is used as the ground truth parameter. Can you think of a better evaluation method? Typos: Line 24 -- exploitation-exploitation trade-off -> exploration-exploitation trade-off Line 24-25 -- there are epoch/epsilon greedy strategies as well Line 136: and and -> and

Overall, the paper is fine: setting, proposed approaches, theoretical analysis, and clarity. Still it may be improved in few ways. The main improvement would regard the motivation of the paper. Typically * the paper should give insight on why the best assortment selection with current assumption wrt. the reward distribution is different from the best assumption with cascading click model, or position-based click models; * the paper claims frugality of the proposed approach wrt. computation time (compared to UCB based approaches). That claim could be enforced by providing the computation time measured in experiments. It would also be enforced by a theoretical analysis of computation time. * current experiments are on simulated data (derived from real-world data). Experiments on real-world data with assortments would be more convincing. It's also unusual to have bibliographic citations linking to arXiv version of each paper without reference to the conference/journal where the paper was accepted. To name one, paper [39] has been accepted at UAI'16. ____________________________ POST REBUTTAL --- I thank the authors for their thorough answer. I agree with the authors regarding the originality and the soundness of assortment optimization as opposed to cascading/position-based model (let's denote them "independent optimization models") for multiple recommendation setting. Still, I think THE PAPER should include as much insights as possible to demonstrate the potentiality of assortment optimization: * some simple examples where assortment and independent optimization are selecting two different assortments (with substituable items, obviously), * some real-world datasets/analysis demonstrating the benefit of such modelization, * an empirical comparison to state of the art regarding independent optimization models. Such comparaison is even more required given that several papers with independent modelization were published in Machine Learning avenues. By the way, I would prefer the semi-synthetic data to be derived from real-world datasets with assortment properties. There are publicly available "baskets" datasets, I wonder if they may provide the required starting data. Finally, I acknowledge that appart from paper [39], arXiv citations concern papers which are still unpublished. Having five of them cited remains surprising.

Paper ID:	1775
Title:	Thompson Sampling for Multinomial Logit Contextual Bandits

Reviewer 1

Reviewer 2

Reviewer 3