NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 5120 Contextual Bandits with Cross-Learning

### Reviewer 1

This paper studies the contextual bandits with cross learning. Existing literature which does not assume any structure among the contexts establishes a weak O(\sqrt{CKT}) bound. In this paper, the authors assume that the rewards corresponding to every context is observable and prove a O(\sqrt{KT}) bound under two settings; The setting where the contexts are adversarial, but the rewards are stochastic and the setting where the contexts are stochastic, but the rewards are adversarial. The paper also proves a lower bound of O(\sqrt{CKT}) for the setting when the rewards and contexts are both adversarial arguing that cross learning does not help when rewards and contexts are both adversarially generated. The paper abstracts a number of interesting applications into a new theoretical problem and presents two different algorithms, UCB1.CL and EXP3.CL along with their theoretical analysis. The paper acknowledges that UCB.1CL and EXP3.CL are respectively minor variations of existing bandit algorithms, UCB-1 and EXP3 and claims that their analysis requires significantly new ideas. However, I think the paper fails to convincingly argue that existing proof techniques (for both UCB-1 and EXP3) do not extend to the problem setting under consideration. The paper fails to argue why the “optimism under uncertainty” proof argument (for example see Section 4 in [1]) do not carry over for proving the regret bound of UCB.1CL. Since we observe the rewards corresponding to all the contexts for any selected arm and hence can construct the corresponding confidence intervals, the preceding argument can be used verbatim to prove the desired bound. [EDIT] After considering the author response, this reviewer no longer holds the earlier concern about the correctness of the lower bound. The reviewer has changed their score accordingly. References [1] Agrawal, Shipra, and Nikhil R. Devanur. "Bandits with concave rewards and convex knapsacks." Proceedings of the fifteenth ACM conference on Economics and computation. ACM, 2014.

### Reviewer 2

When contexts are adversarial, a UCB-style algorithm that operates pointwise achieves good regret, especially in the partial cross-learning case where the average clique size in the reward elicitation graph is the relevant quantity. This algorithm is computationally viable when rewards have a convenient parametric form eliding the need to enumerate contexts. This algorithm seems likely to find practical application. The EXP3 variant is more of theoretical interest and is unlikely to be practically applied without additional research, e.g., eliminating the need to know the context distribution.

### Reviewer 3

This paper explores contextual bandits with cross learning, i.e., when an arm i is pulled with context c, the learner not only observes the rewards for arm i in context c, but also observes the reward of arm i for all other contexts. The authors investigate the settings in which the context/rewards are stochastic or adversarial and show the corresponding regret bounds (or regret lower bound). Overall I think it’s a well-written paper. The results seem sound and reasonable (though I didn’t carefully check the proof), and in particular, I found the analysis of UCB1.CL (which drops the dependency on C in the regret bounds) to be non-trivial and interesting. I would lean towards acceptance. Other comments: I’m not entirely convinced that auctions are best formulated as contextual bandits with cross-learning (which seems to be the major motivating example of this paper). As the authors also pointed out in the experiments, it requires the assumption that other bidders’ valuations to be independent of the bidder’s valuation. The other examples illustrated in the paper (e.g., multi-armed bandits with exogenous costs) are reasonable though don’t seem as impactful/pressing as the auction settings. Honestly, though I don’t have other applications in mind, I think the setting is general enough (especially with the added discussion on partial cross-learning) and worth investigating. However, given one of the main contributions of the paper is to introduce this cross learning setting, it would be nice if the papers can provide stronger motivations for this setting. In the experiment setup, the authors assume the “true” valuations of some participant is known. How is this achieved? Are there additional assumptions on bidders’ beliefs, so you can infer the bidders’ values from bids in first price auctions?