Reviews: Semi-Parametric Efficient Policy Learning with Continuous Actions

This paper considers the off-policy learning problem for the case of continuous treatments, and provides regret bounds for the doubly-robust estimator, as well as study of semiparametric efficiency. The primary assumptions are that the “value function” is of known parametric form in the treatment, but with arbitrary dependence on covariates. In this sense, it generalizes the classical partially linear regression model. Originality: The paper generalizes recent semiparametric efficient analysis of doubly robust estimators, particularly the "slicing" analysis innovation for policy learning as in Athey and Wager, Efficient policy learning, as well as analysis of Foster and Syrgkanis 2019. The proposed approach for continuous treatments avoids the unfavorable dimension dependence of previous approaches for continuous treatments, instead the difficulty is in the matrix regression problem of the covariance-based generalization of the propensity score for the continuous case. Quality: The paper is technically sound with claims well supported by theoretical analysis. Clarity: The paper is overall clear but sometimes vague in descriptions. I found the remarks helpful in instantiating the results. Fig. 1a: the axes are unreadable and the results could be summarized differently or moved to the appendix. Significance: The paper studies continuous treatments for policy learning and applies recent advances in analysis for policy learning and doubly-robust estimators to this setting, in order to study the doubly-robust estimator. Minor comments/clarification questions: - Lines 93-98: could you clarify the relationship to the “slates” estimator of Swaminathan et al. 2017, as well as the dependence of the rate on the estimate of \hat\Sigma(z), or include a reference on a “typical” nuisance rate in that setting? - The efficiency bound is achieved if the model is misspecified, while requiring an additional homoskedasticity assumption otherwise if it is achieved: this seems nonstandard, perhaps commenting on why this arises inline could be useful. - Regarding remark 4 abstracting away technical details regarding first-stage rates: while this may hold for the outcome regression, for the matrix-valued regression of \hat\Sigma, specific details regarding typical rates and their dependence on dimension or other quantities may be helpful to the reader. For example, the special assumption of homoskedasticity for the pricing example simplifies the problem to simply that of density estimation; while convenient, this abstracts away some potential issues in implementing this approach in practice.

The paper builds on recent work on orthogonal machine learning, efficient policy learning, and continuous-action policy learning. The theory is solid and it is a reasonable advance with good empirical results. The results are most similar to Foster and Syrgkanis [9], which makes the present paper less novel and interesting. The big improvement over [9], per the authors, is that they provide an analysis giving a regret guarantee for the unpenalized policy learning approach that depends on the efficient variance of estimating the optimal policy. [9], in contrast, provided a bound that depended on the variance of estimating any policy in the class, and only show a regret bound depending on the variance of estimating the optimal policy for a modified learning procedure that penalizes the second moment. First, I'm not sure how groundbreaking improving the bound to depend on the efficient variance of the optimal policy instead of an arbitrary policy is. Sure, it's better, but is it of sufficient general interest to merit acceptance? Second, I'm not sure I buy the authors' claim about the computational issues of variance regularization. Optimizing the the orthogonal machine learning estimate over policies is already not convex and rather intractable (and the authors don't explain how they actually optimize it in section 4!). So adding a variance regularizer does not destroy any convexity. Also, unlike the policy optimization objective, which is difficult to convexify for continuous actions (where for binary actions one usually uses classification surrogate losses), the variance regularization has very nice convexifications via phi-divergences / distributionally robust optimization. Post-response: I have read the authors' response and I find that it appropriately and sufficiently addresses my questions about comparison to [9] and the tractability of the different optimization problems. But I would suggest the authors to add discussion reflecting the explanations they offered in the response into the text at camera ready.

Paper ID:	8611
Title:	Semi-Parametric Efficient Policy Learning with Continuous Actions

Reviewer 1

Reviewer 2

Reviewer 3