NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6322
Title:Multivariate Distributionally Robust Convex Regression under Absolute Error Loss

Reviewer 1

While mostly I was able to understand the contributions, and comparison with the related work is reasonably well-written, I think that the organization and the quality of writing could be significantly improved, as summarized below. a. The main statistical result, Theorem 2, is stated in a very concise manner in the main text. For example, for me it is unclear how $\alpha$ in Assumption 1 influences the result (I guess the ``constant'' C actually depends on it, as well as on some other parameters). b. The structure of the paper could be significantly improved. In particular, discussion of the results and motivation for the estimator are somewhat scattered throughout the text, appearing before the estimator is introduced (see, in particular, lines 29-30, 52-55). c. There are numerous typos, see, e.g., lines 6, 42, display under line 45 (missing subscript of $\inf$), display (4), 92, 101, 177 (missed $\nabla$), 179 and mild stylistic error (``touched'' in line 94, ``order $O(\cdot)$'' and ``at most $O(\cdot)$'' throughout, etc.) Regarding the significance of the contributions and mathematical quality of the paper, I see the following issues. a. First of all, Assumption 1 (Weibull-type tails of the design vector) is relatively weak, and the distrubution robustness of the novel estimator is quite limited. It would be interesting (and more useful in practice) to consider finite moment assumption on the design. I expect that using Huber-type M-estimators, or median-of-means, one can obtain statistically optimal rate $O(-2/(d+4))$ in any dimension (see lines 111-113), perhaps even without explicit Wasserstein regularization, and under finite-moment assumption. This would be in line with the recent parametric results (see, e.g., S. Minsker. Geometric median and robust estimation in Banach spaces), as well as with the empirical results obtained in Sec. 3. Algorithmically, the Huber-loss criterion would result in a quadratic program rather than a linear one. b. While the authors claim that their estimator does not depend on the a priori bound on the size of the gradient of the target function (see, e.g., lines 79-80), such dependence indirectly appears in the constraints of their optimization problem. Also, it is not entirely clear why they impose logarithmic scaling of the gradient. On the positive side, I would like to emphasize the technical quality of the paper, in particular, the proof of Theorem 3 which features a somewhat non-standard chaining argument.

Reviewer 2

Update after reading rebuttal: I appreciate the authors' response to my questions. There are still few points that I would like to discuss here: 1- "our approach also allows l be function maps R^2 to R". I do not see any necessity to complicate the loss function. In the end, the authors consider error-based loss functions where the cost function is in the form of L(y - f(x)), and as far as I know, most lost functions in regression models penalize the prediction error e = y - f(x). 2- regarding to infinite dimensional problem: the loss function \ell(f(X), Y) is Lipschitz with respect to the transportation cost. By the Kantorovich-Rubinstein duality theorem, for any 1-Lipschitz loss function g and Q in the Wasserstein set with parameter \delta we have | - | \leq \delta, where the inner product is a short representation for the integral. Now, by [1, Theorem 2], we know that if \delta = \mathcal{O}(n^{-1/d}, then the data generating distribution P is inside the Wasserstein sett with high probability. Therefore, for any Lipschitz function and with high probability we have | - \leq \delta Denoting the optimizer of (3) by f_n and it optimal value g_n, we have |g_n - | \leq \delta * L \| \nabla f \|_\infty. I should emphasis here that the searching over infinite dimensional decision would not change the guarantees. In this paper, the authors consider a different guarantee as follows | - | \leq C \delta I agree with the authors that their analysis is different, but I want to kindly ask them to add a comments about the previous results in the literature, and show the similarities and differences here. One potential benefit of the proposed approach is that the constant in the concentration rate can be found in closed form while [1, Theorem 2] guarantees the existence of a constant C. I would also like to see a comparison between the concentration bound in [5, Theorem 4.3] and the one in your paper. 3- Given the statistical guarantee and new simulation results, I increase my overall evaluation grade. ======================================= 1- Theorem 1: consider a general convex loss function l(y, z): \mathbb{R} \times \mathbb{R} \to \mathbb{R} and a general convex estimator f(x): \mathbb{R^d} \to \mathbb{R}. In the parametric setting, it is known that if f(x) is an affine function then the distributionally robust problem coincides with the regularized empirical loss minimization problem whenever the loss function is convex and Lipschitz [4, Theorem 3.1, Remark 3.5]. The result is significantly generalized in [2, Theorem 1] where it is shown that convexity of the loss function is not required. Theorem 1 in the current submission offers a new regularization whenever the loss function l and the estimator f are convex. To the best of my understanding, the proof of Theorem 1 is similar to the proof in [2, Theorem 1]. The key point here is that any convex function can be represented by the maximum of infinitely many affine functions thanks to its bi-conjugate. Repeating the proof in [2, Theorem 1] for \tilde\beta = \nabla f(x_0) basically gives the proof of Theorem 1 in this paper. Therefore, it is important to illustrate the similarities and differences between these papers. Moreover, similar result to Lemma 1 can be found (or with simple generalization) in [3, 4, 7]. Please cite these papers as well. 2- The setting in this paper with the transportation cost c(\cdot, \cdot) introduced in equation (2) resembles to the robust or adverserial covariate shift problems, where the ambiguity in the distribution is due to the uncertainty in the distribution of the covariate variable x. Consider the followings: the joint probability distribution P(x, y) (under some technical assumptions) can be written as P(x, y) = \int_y P(x|y) P(y). In my opinion, giving infinite weight to perturbations in y is the same as perturbing P(x|y) while P(y) is fixed to P_n(y). Solving the following distributionally robust problem $$ \inf_f \frac{1}{n} \sup_{D(P_i(x|y_i), \delta_{(x_i, y_i)}) \leq \delta} \sum_{i=1}^n \mathbb{E}_{P_i(x|y_i)} l(y_i, f(x)) $$ gives the exact regularized problem in Theorem 1. I believe it would be instructive to relate this problem to the covariate shift setting. 3- Theorem 2: Assumption 1 is the same as the assumption required for \mathcal{E} in [1, Theorem 2]. Therefore, the consistency and the convergence rate of the estimator can be guaranteed using [4, Theorem 3.4 & 3.5] without imposing assumption 2 and restricting the search space \mathcal{F}. Note that the rate can be further sharpened using the result in [6]. While I like the new proof in the paper, I am not convinced that the new result adds any new insights given that the search space for the convex function f is also restricted to \mathcal{F}_n. As far as I understood, this restriction is crucial to obtain the new concentration bound. 4- \mathcal{F}_n: I don't really understand the necessity of the extra constraint \| \nabla d \|_\infty < \log(n) in Section 2.2. It is not conventional to solve a constrained regularized problem where we have a regularization term in both objective and constraint. As far as I understood, the extra constraint is crucial for Theorem 2 and the statistical bounds. However, from the optimization point of view and using Lagrangian theory, we can always move the constraint to the objective and solve a regularized problem. I was wondering if the authors can comment on this restriction. Moreover, it is necessary to have a theorem for optimality of piecewise affine functions here. While I believe the linear program and representation of convex estimator with an n-piece affine function are correct, I think a formal proof is required here. Moreover, the linear program suggests that the worst-case distribution is an n-point discrete distribution. I was wondering if the authors could comment on the structure of the worst-case distribution. 5- The numerical section is too short, and it doesn't offer any meaningful conclusion. I would suggest to show the actual and the theoretical convergence rate as the function of n. [1] Fournier, Nicolas, and Arnaud Guillin. "On the rate of convergence in Wasserstein distance of the empirical measure." Probability Theory and Related Fields 162.3-4 (2015): 707-738. [2] Gao, Rui, Xi Chen, and Anton J. Kleywegt. "Wasserstein distributional robustness and regularization in statistical learning." arXiv preprint arXiv:1712.06050 (2017). [3] Gao, Rui, and Anton J. Kleywegt. "Distributionally robust stochastic optimization with Wasserstein distance." arXiv preprint arXiv:1604.02199 (2016). [4] Mohajerin Esfahani, Peyman, and Daniel Kuhn. "Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations." Mathematical Programming 171.1-2 (2018): 115-166. [5] Shafieezadeh-Abadeh, Soroosh, Daniel Kuhn, and Peyman Mohajerin Esfahani. "Regularization via mass transportation." arXiv preprint arXiv:1710.10016 (2017). [6] Weed, Jonathan, and Francis Bach. "Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance." arXiv preprint arXiv:1707.00087 (2017). [7] Zhao, Chaoyue, and Yongpei Guan. "Data-driven risk-averse stochastic optimization with Wasserstein metric." Operations Research Letters 46.2 (2018): 262-267.

Reviewer 3

The paper is well written and is easy to follow. The conducted analysis is also sound.