Reviews: Low-Rank Bandit Methods for High-Dimensional Dynamic Pricing

The authors consider the problem of regret minimization in dynamic pricing, where the price/demand relation is linear and the linearity dependence is (dynamically) adversarial. The paper considers that this linearity actually comes from a low rank structure (d << N). From there, the authors propose two algorithms for this problem and provide theoretical regret bounds, scaling with d, while the current methods (without low rank embeddings) scale with N. The authors then evaluate the algorithms on synthetic data. Considering low rank structures is a good model and it could lead to significant improvements in practice. However, this paper presents several downsides detailed below. The considered model is not justified and seems different from the mentioned related work. Strong assumptions are required for the proper functioning of the algorithms and are sometimes omitted in the paper. Finally, I think the presented experiments are not well chosen (or poorly interpreted). For these different reasons, I give a weak reject score to the paper in its current state. -------------------------------------------- Major comments 1. The model of Equation (1) is justified as previously used in some mentioned papers. After having a look at a couple of them, it seems that the model they considered is pretty different. If they are indeed similar, this needs to be justified as this is not obvious at first glance. 2. Many assumptions are needed for the proper functioning of the algorithms (OPOK and OPOL). Some of them are mild, but some of them are very strong in my opinion. (A5) seems strong to me. I would have liked a more relaxed assumption such as "S included in some ball". Would the proof be different for such a setting ? More importantly, (A4) seems more problematic than what is pretended by the following paragraph (lines 137 to 143). If U is known, this paragraph is indeed accurate, as we would write U = O * P with O orthogonal, and we would then work with the transformed action Px instead of x. But if U is unknown (which is the significant setting), the transformed action Px is unknown. Thus, I do believe that this assumption is very strong in the setting where U is unknown. Furthermore, even in the unknown case, the algorithm needs to know exactly both d and r. It does not seem reasonable in practice, and we would prefer algorithms working with upper bounds of both d and r. Hence the relaxed inclusion version of (A5) I mentioned earlier. 3. Actually, when U is known, the setting just consists in directly optimizing on the low dimensional action Ux in the original setting, and thus no assumption should be required on U (but maybe on the action space). So why all of this is needed in the case of known features ? 4. My last major comment concerns the experiments. First, it seems odd that the regret can be decreasing in time (which happens for figures C, D, E and F). How is this possible ? Also, OPOL outperforms OPOK as mentioned in section 5.1. I am not convinced by your justification of this behavior. You claim that OPOL can more robustly represent the way changing an action impacts the revenue. But OPOL depends on noise which is totally independent from the action, while OPOK exactly knows how the action can influence the revenue. Besides the estimation of U, another difference between the two algorithms comes from FindPrice (from 0 or p_t-1). What is its influence in terms of regret ? I think the difference in the experiments can actually come from this point. Also, the curves of GDG and OPOK (or OPOL) are suspiciously similar for the case d=N. Are GDG and OPOK exactly the same in this case ? (and why is there actually a difference between OPOK and OPOL in this case ?) Actually, I even suspect that the chosen parameters in the experiments make the problem easy. First, some rescaling is mentioned, but there is not a single parameter rescaled to 1 (I would have guessed either r or the variance of the noise), which is a standard way to rescale. More importantly, the values of z are very large compared to V. Thus, I think the influence of the prices p in the demand is marginal with these parameters, which would make the problem very easy. Unfortunately, the code was not given in the supplementary material and I could not confirm whether my claim was true or not. Also, why do you consider lambda=10 ? Only positivity is required according to the introduction, why not choose lambda=0 ? Also, if the model of equation (1) indeed shows up to be considered in previous papers, why did you not compare your algorithms with them (at least in the stationary case, since I understand the dynamic setting is not always considered in previous papers) ? ---------------------------------- Minor comments 5. Page 4 line 145: does C has a dependence in T, d, r, etc. ? 6. Page 5 Lines 190-197: it would be great to have a thorough study of the complexity of OPOK/OPOL (and especially the significance of speeding up the SVD) 7. Page 6 line 222: typo here, "inqualities" 8. page 10 line 574: what happens for d<10 ? 9. Page 10 equation (12). As in the main text, I am not convinced that p_t will have a large influence on the demand here, as p_t << 100. -------------------------------------------------------------------- The authors answered to my main concerns and I thus decide to raise my score in consequence. Still, the justification of why (A4) is not cristal clear to me and still needs to be clarified for the camera ready version.

Paper ID:	8962
Title:	Low-Rank Bandit Methods for High-Dimensional Dynamic Pricing

Reviewer 1

Reviewer 2

Reviewer 3