{"title": "Multivariate Distributionally Robust Convex Regression under Absolute Error Loss", "book": "Advances in Neural Information Processing Systems", "page_first": 11817, "page_last": 11826, "abstract": "This paper proposes a novel non-parametric multidimensional convex\nregression estimator which is designed to be robust to adversarial\nperturbations in the empirical measure. We minimize over convex functions\nthe maximum (over Wasserstein perturbations of the empirical measure) of the\nabsolute regression errors. The inner maximization is solved in closed form\nresulting in a regularization penalty involves the norm of the gradient. We\nshow consistency of our estimator and a rate of convergence of order $\n\\widetilde{O}\\left( n^{-1/d}\\right) $, matching the bounds of alternative\nestimators based on square-loss minimization. Contrary to all of the existing results, our convergence rates hold without imposing compactness on the underlying domain and with no a priori bounds on the underlying convex function or its gradient norm.", "full_text": "Multivariate Distributionally Robust Convex\n\nRegression under Absolute Error Loss\n\nJose Blanchet\nStanford MS&E\n\njose.blanchet@stanford.edu\n\nJun Yan\n\nStanford Statistics\n\njunyan65@stanford.edu\n\nPeter W. Glynn\nStanford MS&E\n\nglynn@stanford.edu\n\nZhengqing Zhou\n\nStanford Mathematics\nzqzhou@stanford.edu\n\nAbstract\n\nThis paper proposes a novel non-parametric multidimensional convex regression\nestimator which is designed to be robust to adversarial perturbations in the empirical\nmeasure. We minimize over convex functions the maximum (over Wasserstein\nperturbations of the empirical measure) of the absolute regression errors. The\ninner maximization is solved in closed form resulting in a regularization penalty\ninvolves the norm of the gradient. We show consistency of our estimator and a rate\n\nof convergence of order (cid:101)O(cid:0)n\u22121/d(cid:1), matching the bounds of alternative estimators\n\nbased on square-loss minimization. Contrary to all of the existing results, our\nconvergence rates hold without imposing compactness on the underlying domain\nand with no a priori bounds on the underlying convex function or its gradient\nnorm.\n\n1\n\nIntroduction\n\nConvex regression estimation arises in a wide range of learning applications, for example, when\n\ufb01tting demand functions, production curves or utility functions, see [14, 22, 23]. Economic theory\noften dictates that demand functions are concave, [2]. In \ufb01nancial engineering, stock option prices\noften exhibit convexity restrictions [1]. This paper introduces a novel convex regression estimator\nwhich, by design, enjoys enhanced robustness properties. This estimator requires no a priori uniform\nbounds on the underlying convex function or its Lipschitz constant, nor does our estimator require\nthat the domain of the convex function be compact, in contrast to existing convex function estimators\nthat have known convergence rate guarantees. Furthermore, our numerical experiments show that\nour estimator exhibits good empirical performance, in comparison with existing estimators, and is a\npromising alternative to existing methods.\nLet X be a d-dimensional random vector and let Y be a scalar random variable. Given a sample\n(X1, Y1),\u00b7\u00b7\u00b7 , (Xn, Yn) of i.i.d. copies of (X, Y ), we adopt the convex regression model\n\nYi = f\u2217(Xi) + Ei,\n\n(1)\nwhere f\u2217 : Rd \u2192 R is a (unknown) convex function and Ei is a zero-median random variable\nindependent of Xi, satisfying mild regularity conditions indicated in the sequel. Unlike the existing\nliterature on convex regression (or, more generally, shape-based regression), we base our estimation\nmethodology not on minimizing the squared error loss, but on minimizing mean absolute error loss.\nWe adopt this viewpoint as a means of reducing the sensitivity of our regression estimator to outliers\nin the data.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe further wish to regularize our estimator. One vehicle towards accomplishing this goal in a\nprincipled fashion is to consider a distributionally robust formulation in which we robustify over a\nWasserstein ball around the data, using a diameter that is driven by consistency and convergence rate\nconsiderations. When we do this, we arrive at a computationally tractable formulation of the problem\nthat can be solved as a linear program. This is to be contrasted against the quadratic program that\narises when minimizing squared error loss. Furthermore, the form of regularization that appears in\nthis problem involves a novel gradient-based penalization term, to be described in more detail later in\nthis Introduction.\nIn order to introduce our Wasserstein-based distributionally robust optimization formulation, we \ufb01rst\nrecall how the Wasserstein distance is de\ufb01ned.\nFirst, let P(Rm \u00d7 Rm) be the space of Borel probability measures de\ufb01ned on Rm \u00d7 Rm. Let \u03a0 (\u00b5, \u03bd)\nbe the subspace of P(Rm \u00d7 Rm) with \ufb01xed marginals given by \u00b5 and v, respectively. That is, if\nU \u2208 Rm, V \u2208 Rm are random vectors with joint distribution \u03c0 \u2208 P(Rm \u00d7 Rm), then \u03c0 \u2208 \u03a0 (\u00b5, \u03bd),\nif the marginal distribution of U, \u03c0U , equals \u00b5 and the marginal distribution of V , \u03c0V , equals \u03bd. The\nWasserstein distance between \u00b5 and \u03bd is given by\n\nD(\u00b5, \u03bd) := inf\n\nE\u03c0 [c (U, V )] : \u03c0 \u2208 P(Rm \u00d7 Rm), \u03c0U = \u00b5, \u03c0V = \u03bd\n\n,\n\n(cid:26)\n\n(cid:27)\n\nwhere c : Rm \u00d7 Rm \u2192 [0,\u221e] is a metric. In our setting, we have m = d + 1, and we will choose as\nour metric\n\nc ((x, y) , (x(cid:48), y(cid:48))) = (cid:107)x \u2212 x(cid:48)(cid:107)1\n\n1 (y = y(cid:48)) + \u221e1 (y (cid:54)= y(cid:48)) .\n\n(2)\n\nWe take the view here that distributional uncertainty is incorporated only in terms of the predictors\nand not the responses, since the responses already include a measurement error (in the term E). This\ntype of cost function has been used in the literature, [6], to exactly recover regularized estimators\nsuch as sqrt-Lasso, among others. It is possible to add distributional uncertainty in the response. The\nmethods that we propose allow for adding distributional uncertainty in the response with only a small\nvariation in the form of the estimator and without any change in the learning rates or the assumptions\nthat we impose. Since the challenge here arises from the multidimensional aspect of the predictor\nvariable, we decided to mostly impose the distributional robustness on the predictors.\nNow, consider a loss function l(y, z) : R \u00d7 R \u2192 R, which is assumed to be convex and uniformly\nLipschitz. Our distributionally robust convex regression (DRCR) formulation takes the form,\n\ninf\nf\u2208F\n\nsup\n\nP\u2208P(Rd+1):D(P,Pn)\u2264\u03b4\n\nEP [l(Y, f (X))] ,\n\n(3)\n\nwhere F represents the class of convex and Lipschitz functions (formally de\ufb01ned in Section 2.3), the\nparameter \u03b4 := \u03b4n > 0 is the uncertainty radius. This radius will be judiciously chosen as a function\nof n to obtain consistency and suitable rates of convergence. The notation Pn encodes the empirical\ndistribution of the observations (X1, Y1),\u00b7\u00b7\u00b7 , (Xn, Yn), namely,\n\nPn(dx, dy) :=\n\n1\nn\n\n\u03b4{(Xi,Yi)}(dx, dy).\n\nn(cid:88)\n\ni=1\n\nDistributionally robust optimization formulations such as (3) have been used in a wide range of\nsettings in the operations research literature and these formulations have become increasingly popular\nin machine learning and statistics.\nOur main contributions in this paper are as follows.\n\ni) We provide a tractable formulation of (3), in particular, we will show that\n\ninf\nf\u2208F\n\nsup\n\nP\u2208P(Rd+1):D(P,Pn)\u2264\u03b4\n\nEP [l(Y, f (X))] = inf\n\nf\u2208F {\u03b4L(cid:107)\u2207f(cid:107)\u221e + EPn l(Y, f (X))} ,\n\n(4)\n\nwhere (cid:107)\u2207f(cid:107)\u221e is the largest l\u221e-norm of all subgradients of f (x) for all x, and similarly, L :=\nsup(y,z)\u2208R\u00d7R |\u2207zl(y, z)| (see Theorem 1). Note the penalty term is expressed in terms of the\nnorm of the gradient of the estimator. The appearence of the l\u221e-norm is intimately connected to\nthe choice of the l1 cost function given in (2).\n\n2\n\n\fconditions on the residuals (see Theorem 2),\n\nii) Assuming that l (y, f (x)) = |y \u2212 f (x)|, we provide statistical guarantees for the rate of conver-\ngence of the estimators obtained in (4), improving upon the results obtained using a quadratic loss\n. In particular, we show that if (cid:107)X(cid:107)\u03b3\u221e has a \ufb01nite moment generating function in a neighborhood\n\nof the origin for some \u03b3 > 0 and if \u03b4n is chosen to be (cid:101)O(cid:0)n\u22122/d(cid:1), then, under suitable regularity\nin a suitable sense, where (cid:98)fn,\u03b4n \u2208 arg inf f\u2208F {\u03b4nL(cid:107)\u2207f(cid:107)\u221e + EPn l(Y, f (X))} and the notation\n(cid:101)O(cid:0)n\u22121/d(cid:1) ignores poly-log factors in n. In contrast to the current results in the literature, our rate\n\n(cid:98)fn,\u03b4n = f\u2217 + (cid:101)O\n\n(cid:16)\n\nn\u22121/d(cid:17)\n\n,\n\nof convergence does not require X to have compact support, nor do we need to build an apriori\nbound on the size of the gradient of f into our estimator in order to obtain convergence rate result.\n\nOur contributions have several signi\ufb01cant features. First, it is not dif\ufb01cult to see that choosing the\nabsolute error loss l (y, f (x)) = |y \u2212 f (x)| makes (4) equivalent to a linear programming problem.\nIn fact, since Pn is \ufb01nitely supported, the problem becomes a \ufb01nite dimensional linear programming\nproblem. Hence, this problem is, in principle, easier to solve than the standard quadratic problem that\narises in typical non-parametric convex regression formulations, which arise when minimizing the\nsquared error loss.\nSecond, our estimator is naturally endowed with desirable out-of-sample features due to the presence\nof the inner maximization, which explores the impact on the loss function due to statistical variations\nin the data. This interpretation follows from the left hand side of (4). The right hand side of (4), on\nthe other hand, shows a direct connection to regularization in terms of the norm of the gradient of f,\nand the resulting norm is the dual transportation cost. This regularization term, as we shall see, allows\nus to construct an estimator that are free of a priori bounds imposed on the size of the gradient of\nf, which typically are required in order to obtain statistical guarantees. We now provide a literature\nreview in the scienti\ufb01c areas touched by our contribution, namely, convex regression estimation and\ndistributionally robust optimization.\n\n1.1 Related Literature\n\nIn the context of convex regression, the overwhelming majority of the literature focuses on empirical\nleast-squares estimators (leading to a quadratic programming formulation of the same size as the\nlinear programming formulation that we offer). In one dimension, the work of [11] proves the\nconsistency of the least squares estimator, and provides a rate of convergence of order O(n\u22122/5) and\nan asymptotic distribution for this estimator; a matching upper and lower bounds for the min-max\nrisk (in terms of quadratic loss) was obtained in [12], also with the same rate of order O(n\u22122/5) up\nto a logarithmic factor. The \ufb01rst consistency results in higher dimensional problems were obtained\nin [16, 19]. Associated rates of convergence have only been derived recently, in [3, 13, 15], all of\nwhich assume that the predictor takes values on a compact set. It is shown in these papers that a phase\ntransition occurs at d = 4. When d \u2264 4, the least squares estimator achieves the convergence rate of\nn\u22122/(d+4), which matches the optimal convergence rate in the non-parametric setting (when f\u2217 is a\ntwice continuously differentiable and the data is restricted to lie on a compact set). However, when\nd > 4, the convergence rate of the least squares estimator deteriorates to O(n\u22121/d). Moreover, the\nresults in [15] and [3] require apriori knowledge on (cid:107)\u2207f\u2217(cid:107)\u221e in the construction of their estimator,\nwhile [13] requires knowledge of (cid:107)f\u2217(cid:107)\u221e. The work of [13] shows that under additional smoothness\nassumptions, the optimal min-max risk is of order n\u22122/(d+4), although, interestingly, no explicit\nestimator was given to recover such a rate in dimensions larger than four.\nIn connection to optimization, our formulation connects to an area which has been active in operations\nresearch for many years, namely, robust and distributionally robust optimization [5]. Distributionally\nrobust optimization (DRO) problems informed by optimal transport costs, as in this paper\u2019s formula-\ntion, have become popular in recent years not only in operations research but also in the machine\nlearning community. The work of [20] is the \ufb01rst one to show a connection to regularized estimators,\nin the context of logistic regression. The paper [6] provides an exact recovery of sqrt-Lasso and\nsupport vector machines. The work in [6] uses the DRO formulation to de\ufb01ne a statistical criterion to\noptimally choose the uncertainty size \u03b4. This criterion, when applied to linear regression problems,\nrecovers the scalings both in dimension and sample size obtained in the high-dimensional statistics\n\n3\n\n\fliterature (see, for example, [4]). Applications in training of deep neural networks are given in [21],\nand additional representations of other estimators are given in [8, 10, 18], among others. A key step\ninvolved in obtaining these representations involves a duality result, which is given in [7].\n\n1.2 Organization\n\nThe rest of this paper is organized as follows. In Section 2.1, we state and prove a strong duality\nresult for the DRCR formulation in (6). Section 2.2 provides an explicit construction of the DRCR\n\nestimator, and in Section 2.3, we show that the convergence rate of this estimator is at most (cid:101)O(n\u22121/d).\n\nFinally we run a simulation study showing that the DRCR estimator can outperform the standard\nLSE or kernel based estimator. The proof of Theorem 2, as well as the main lemmas, is deferred to\nthe supplementary materials.\n\n2 Main Results\n\nWe \ufb01rst discuss our main result corresponding to the \ufb01rst contribution stated in the Introduction.\nWe later turn to the second contribution. In order to state the strong duality result, we introduce\nsome notations as follows. Let x = (x1,\u00b7\u00b7\u00b7 , xd), denoted by \u2202f (x) the subdifferential of f\nat x, and we de\ufb01ne \u2202xif (x) to be the partial subdifferential of f at x with respect to xi. we\nde\ufb01ne (cid:107)\u2207f(cid:107)\u221e := supx\u2208Rd max{(cid:107)g(cid:107)\u221e : g \u2208 \u2202f (x)}, and |\u2207xif (x)| := max{|g| : g \u2208 \u2202xif (x)}.\nFinally, let \u2207f (x) denotes one of the solutions in arg max{(cid:107)g(cid:107)\u221e : g \u2208 \u2202f (x)}.\n\n2.1 Dual formulation of DRCR\n\nIn this section, we establish the strong duality result for the DRCR problem (3), which plays an\nimportant role in the construction of our estimator and the analysis of rate of convergence.\nTheorem 1 (Strong Duality). Suppose l(y, z) : R \u00d7 R \u2192 R is a convex and Lipschitz function, such\nthat l(y, z) = l(\u2212y,\u2212z). De\ufb01ne\n\nThen, for any \u03b4 \u2265 0,\n\nL := sup\n\n(y,z)\u2208R\u00d7R\n\n|\u2207zl(y, z)|.\n\ninf\nf\u2208F\n\nsup\n\nP\u2208P(Rd+1):D(P,Pn)\u2264\u03b4\n\nEP [l(Y, f (X))] = inf\nf\u2208F\n\n(cid:40)\n\n\u03b4L(cid:107)\u2207f(cid:107)\u221e +\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:41)\n\nl(Yi, f (Xi))\n\n.\n\nBy the above theorem, we see that the DRCR (3) problem is essentially equivalent to a regularized\nempirical loss, where the supremum norm of \u2207f is penalized.\n\nProof of Theorem 1. To begin, we invoke the following lemma\nLemma 1 ([7]). Given any probability distribution \u00b5 \u2208 P(Rd), for any upper semi-continuous\nfunction f \u2208 L1(d\u00b5) and any cost function c, the following strong duality holds:\n\n(cid:40)\n\n(cid:34)\n\nsup\n\n\u03bd\u2208P(Rd):D(\u00b5,\u03bd)\u2264\u03b4\n\nE\u03bdf (X) = inf\n\u03bb\u22650\n\n\u03bb\u03b4 + E\u00b5\n\n{f (y) \u2212 \u03bbc(X, y)}\n\nsup\ny\u2208Rd\n\n(cid:35)(cid:41)\n\n.\n\n(cid:35)(cid:41)\n\nAs a direct consequence of Lemma 1, we have for any f \u2208 F that\n\nsup\n\nP\u2208Rd+1:D(P,Pn)\u2264\u03b4\n\nEP [l(Y, f (X))]\n\n(cid:40)\n(cid:40)\n\n(cid:34)\nn(cid:88)\n\ni=1\n\n= inf\n\u03bb\u22650\n\n\u03bb\u03b4 + EPn\n\nsup\n\n(x,y)\u2208Rd\u00d7R\n\n{l(y, f (x)) \u2212 \u03bbc ((X, Y ), (x, y))}\n\n(cid:41)\n\n= inf\n\u03bb\u22650\n\n\u03bb\u03b4 +\n\n1\nn\n\nsup\nx\u2208Rd\n\n{l(Yi, f (x)) \u2212 \u03bb(cid:107)x \u2212 Xi(cid:107)1}\n\n.\n\n(5)\n\n4\n\n\fFor simplicity, let \u2207if (x) denotes the ith coordinate of \u2207f (x), (1 \u2264 i \u2264 d). Suppose \u03bb < L(cid:107)\u2207f(cid:107)\u221e\n, then there exists y0 \u2208 R, z0 \u2208 R, x0 \u2208 Rd and i0 \u2208 {1, . . . , d}, such that \u03bb < |\u2207zl(y0, z0)| \u00b7\n|\u2207i0 f (x0)|. Without lost of generality, we may assume that \u2207zl(y0, z0)\u2207i0f (x0) > 0. Otherwise,\nwe consider (\u2212y0,\u2212z0). We may consider the case that both \u2207zl(y0, z0),\u2207i0 f (x0) > 0, since the\ncase in which both of them are negative is similar. Let {ei}d\ni=1 be the canonical basis of Rd, if\nxt := x0 + t\u00b7 ei0 \u2208 Rd, then f (xt) is a convex function of t. Moreover, under the above assumptions,\nwe have f (xt) \u2192 +\u221e as t \u2192 +\u221e. Hence, together with the convexity of l, for t > 0 suf\ufb01ciently\nlarge,\n\nl(Yi.f (xt)) \u2212 \u03bb(cid:107)xt \u2212 Xi(cid:107)1\n\n\u2265 l(y0, f (xt)) \u2212 \u03bb(cid:107)xt \u2212 x0(cid:107)1 \u2212 L0|y0 \u2212 Yi| \u2212 \u03bb(cid:107)x0 \u2212 Xi(cid:107)\n\u2265 l(y0, z0) + \u2207zl(y0, z0) \u00b7 (f (xt) \u2212 z0) \u2212 \u03bbt \u2212 L0|y0 \u2212 Yi| \u2212 \u03bb(cid:107)x0 \u2212 Xi(cid:107)\n\u2265 (\u2207zl(y0, z0)\u2207i0f (x0) \u2212 \u03bb)t + \u2207zl(y0, z0) \u00b7 (f (x0) \u2212 z0) + l(y0, z0) \u2212 L0|y0 \u2212 Yi|\n\n\u2212 \u03bb(cid:107)x0 \u2212 Xi(cid:107),\n\nwhere L0 := sup(y,z)\u2208R\u00d7R |\u2207yl(y, z)| < \u221e. By taking the supremum over t, we have\n\n{l(Yi, f (x)) \u2212 \u03bb(cid:107)x \u2212 Xi(cid:107)1} = \u221e.\n\nsup\nx\u2208Rd\n\nOn the other hand, if \u03bb \u2265 L(cid:107)\u2207f(cid:107)\u221e, we have for any x \u2208 Rd that\n\nl(Yi, f (x)) \u2212 l(Yi, f (Xi)) \u2264 L(cid:107)\u2207f(cid:107)\u221e(cid:107)x \u2212 Xi(cid:107)1 \u2264 \u03bb(cid:107)x \u2212 Xi(cid:107)1,\n\nwhere the equality holds if x = Xi. Hence\n\n{l(Yi, f (x)) \u2212 \u03bb(cid:107)x \u2212 Xi(cid:107)1} = l(Yi, f (Xi)).\n\nsup\nx\u2208Rd\n\nNow, we can rewrite the equation (5) as\n\nsup\n\n\u03bd\u2208P(Rd):D(\u00b5,\u03bd)\u2264\u03b4\n\nE\u03bdf (X) =\n\ninf\n\n\u03bb\u2265L(cid:107)\u2207f(cid:107)\u221e\n\n(cid:41)\n\nn(cid:88)\n\ni=1\n\nl(Yi, f (Xi))\n\n(cid:40)\n\n\u03bb\u03b4 +\n\n1\nn\n\nn(cid:88)\n\ni=1\n\n= \u03b4L(cid:107)\u2207f(cid:107)\u221e +\n\n1\nn\n\nl(Yi, f (Xi)).\n\n2.2 Construction of the DRCR Estimator\nTo construct the DRCR estimator, we focus now on the absolute error loss l(y, f (x)) = |y \u2212 f (x)|.\nConsider the following class of convex and Lipschitz functions:\n\nFn := {f : f is convex,(cid:107)\u2207f(cid:107)\u221e \u2264 log n}.\n\nIt can be checked directly that the loss function l satis\ufb01es the requirements in Theorem 1 with the\nconstant L = 1, so, we can rewrite the DRCR problem (3) as follows:\n\nNow we construct an estimator (cid:98)fn,\u03b4 that solve the problem (6). Consider the following \ufb01nite\n\ni=1\n\nl(Yi, f (Xi))\n\n.\n\ninf\nf\u2208Fn\n\n(6)\n\n\u03b4(cid:107)\u2207f(cid:107)\u221e +\n\n1\nn\n\ndimensional linear programming (LP)\n\nn(cid:88)\n\n(cid:41)\n\nLet ((cid:98)g1,(cid:98)\u03be1),\u00b7\u00b7\u00b7 , ((cid:98)gn,(cid:98)\u03ben) be any solution of problem (7). Then, we can de\ufb01ne the DRCR estimator\n\ni ), 1 \u2264 i \u2264 n.\n\nby\n\nmin\ngi,\u03bei\n\ns.t.\n\n(cid:107)\u03bei(cid:107)\u221e.\n\ni=1\n\n1\nl(Yi, gi) + \u03b4 max\n1\u2264i\u2264n\nn\ngj \u2265 gi + (cid:104)\u03bei, Xj \u2212 Xi(cid:105),\ni | \u2264 log n, where \u03bei = (\u03be1\n|\u03bek\n(cid:98)fn,\u03b4(x) := max\n\n1\u2264i\u2264n\n\n(cid:16)(cid:98)gi + (cid:104)(cid:98)\u03bei, x \u2212 Xi(cid:105)(cid:17)\n\n1 \u2264 i, j \u2264 n.\ni ,\u00b7\u00b7\u00b7 , \u03bed\n\n,\n\n(8)\n\n(7)\n\n(cid:40)\n\nn(cid:88)\n\n5\n\n\fwhere (cid:104)\u00b7,\u00b7(cid:105) is the standard inner product. Next, we show that (cid:98)fn,\u03b4 also solves the problem (6). In fact,\n(cid:98)fn,\u03b4 is a solution to the problem\n(cid:40)\n\n(cid:41)\n\ninf\nf\u2208Fn\n\n\u03b4 sup\n1\u2264i\u2264n\n\n(cid:107)\u2207f (Xi)(cid:107)\u221e +\n\n1\nn\n\nl(Yi, f (Xi))\n\n,\n\nwhere the objective value certainly serves as a lower bound for that of (6). Moreover, observe that\n\n(cid:107)\u2207(cid:98)fn,\u03b4(cid:107)\u221e = max1\u2264i\u2264n (cid:107)(cid:98)\u03bei(cid:107)\u221e = sup1\u2264i\u2264n (cid:107)\u2207f (Xi)(cid:107)\u221e, hence (cid:98)fn,\u03b4 is also a solution of (6).\n\ni=1\n\nn(cid:88)\n\n2.3 Rate of Convergence\n\nIn order to state our rate of convergence result, corresponding the second contribution stated in the\nIntroduction, we need to impose some assumptions and state some de\ufb01nitions.\nLet P(Rn) denote the set of all probability measures supported on Rn. Given a metric space\n(X , \u03c1) and any subset G \u2282 X , the \u03b5\u2212covering number M (G, \u03b5; \u03c1) is de\ufb01ned as the small-\nest number of balls with radius \u03b5 whose union contains G, and let A\u03b5 denotes any corre-\nsponding \u03b5-covering set. We say a random variable W is \u03c3-sub-Gaussian if its Orlicz norm\n\n(cid:107)W(cid:107)\u03c82 := supk\u22651 k\u22121/2(cid:0)E|W \u2212 EW|k(cid:1)1/k \u2264 \u03c3, which is equivalent to the standard de\ufb01ni-\nlim supn\u2192\u221e an/bn < \u221e, an = \u0398(bn) iff an = O(bn) and bn = O(an), and an = (cid:101)O(bn) iff for\n\ntion of sub-Gaussian random variable, see [24]. Furthermore, we use standard Landau\u2019s asymp-\ntotic notations as follows: for two non-negative sequences {an} and {bn}, let an = O(bn) iff\n\nsome an = O(bn) up to a poly-log factor of bn.\nWe assume that the data {(Xi, Yi)}n\ni=1 are i.i.d samples from P . To analyze the asymptotic behavior\nof the DRCR estimator, we shall impose the following assumptions on the distribution of X and the\nrandom variable E in (1).\nAssumption 1. There exists some \u03b1, \u03b3 > 0 such that\n\nE exp (\u03b1(cid:107)X(cid:107)\u03b3\u221e) < \u221e.\n(9)\nAssumption 2. The distribution of E is \u03c3-sub-Gaussian for some \u03c3 > 0, symmetric about zero, and\nhas a continuous positive density pE (\u00b7) in a neighborhood of 0.\nRemark 1. Assumption 1 allows the study of random variables (such as Weibull random variables)\nexhibiting heavy tail behavior [9].\nRemark 2. The assumptions on the symmetry and the density, ensure that 0 is the unique median of\nE. As is standard in statistical formulations involving absolute error minimization, this assumption is\nneeded to guarantee the consistency of our estimator.\n\nIn the rest of this section, we study the convergence rate of the DRCR estimator (cid:98)fn,\u03b4n introduced in\n\nSection 2.2. We consider the general question of convergence rate for robusti\ufb01ed estimators of the\nform\n\nWe will show that by a suitable choice of \u03b4n, the convergence rate of(cid:98)gn,\u03b4n to f\u2217 under the empirical\nl1 loss is of order (cid:101)O(cid:0)n\u22121/d(cid:1), where the empirical l1 loss of any two functions f, g is de\ufb01ned as\n\nP\u2208P(Rd+1):Dc(P,Pn)\u2264\u03b4n\n\nf\u2208Fn\n\nsup\n\n.\n\n(10)\n\nEP [l(Y, f (X))]\n\n(cid:98)gn,\u03b4n(x) \u2208 arg min\n\n(cid:40)\n\n(cid:41)\n\nn(cid:88)\n\ni=1\n\nl1(f, g) :=\n\n1\nn\n\n|f (Xi) \u2212 g(Xi)|.\n\nd (log n)1+ 3\n\nNow we state our main theorem. The proof details are deferred to the supplementary materials\n(Appendix A).\nTheorem 2. If (cid:107)\u2207f\u2217(cid:107)\u221e < \u221e and d > 4, and Assumption 1 and 2 hold, we can pick a \u03b4n of order\n\u0398(n\u2212 2\nthat\n\n\u03b3 ) so that for any(cid:98)gn,\u03b4n (\u00b7) de\ufb01ned via (10), there exists some constant C > 0 such\nP(cid:16)\nIn particular, the DRCR estimator (cid:98)fn,\u03b4n de\ufb01ned in (8) also enjoys the rate of (cid:101)O(n\u22121/d), which is\n\nl1((cid:98)gn,\u03b4n , f\u2217) > Cn\u2212 1\n\n(cid:17) \u2192 0\n\nthe best known rate so far (compare to [3, 13, 15]). In contrast to prior work, the estimation are not\nde\ufb01ned in terms of a priori bounds on (cid:107)f\u2217(cid:107)\u221e and (cid:107)\u2207f\u2217(cid:107)\u221e.\n\nas n \u2192 \u221e.\n\nd (log n)\n\n(11)\n\n\u03b3+3\n\n2\u03b3\n\n6\n\n\f3 Numerical Experiments\n\n3.1 Synthetic datasets\n\nIn this section we investigate the performance of our estimator (cid:98)fn,\u03b4, and compare it with the least\n\nsquares estimator (LSE) of convex regression in [15], as well as the kernel smoothing estimator.\nWe conduct the experiments in the following setting. For each d and n, we generate i.i.d. random\nvariables Xi \u2208 Rd, i = 1 . . . n such that each coordinate of Xi are i.i.d. from N (0, 1), or a standard\nStudent\u2019s t-distribution with 10 degrees of freedom. We include this heavy-tailed speci\ufb01cation to\nempirically test the impact of Assumption 1 in our estimator. The results suggest that even if such\nassumption is violated, our estimator still performs remarkably well.\nLet f\u2217 : Rd \u2192 R such that\n\nd(cid:88)\n\ni=1\n\nf\u2217(x) =\n\n|xi|,\n\nx = (x1, . . . , xd).\n\n1\nn\n\n(cid:40)\n\n(cid:41)\n\n.\n\nn(cid:88)\n\ni=1\n\nn,c = arg min\n\nWe generate Yi, i = 1 . . . d by Yi = f\u2217(Xi) + Ei, where the noises Ei are sampled i.i.d. from\nN (0, \u03c32).\n\nline with the setting in [3, 15], let c be any numerical constant greater than (cid:107)\u2207f\u2217(cid:107)\u221e, and we consider\nthe class of functions\n\nWe construct our DRCR estimator (cid:98)fn,\u03b4n by taking \u03b4n = n\u22122/d. For the LSE of convex regression, in\nLet (cid:98)f LS\n\nn,c be the least squares convex regression estimator, namely,\n\nFc := {f : f is convex,(cid:107)\u2207f(cid:107)\u221e \u2264 c}.\n\n(cid:98)f LS\nIn [3, 15] it is shown that (cid:98)f LS\nn,c converges to f\u2217 for any c > (cid:107)\u2207f\u2217(cid:107)\u221e. Given that (cid:107)\u2207f\u2217(cid:107)\u221e = 1, we\nset c = 10 or 0.8, since in practice we typically do not have a tight bound for (cid:107)\u2207f\u2217(cid:107)\u221e (we may\noverestimate/underestimate (cid:107)\u2207f\u2217(cid:107)\u221e).\nNext we construct the kernel regression estimator. Although not required to be convex, the\nkernel estimator is a good benchmark comparison choice, in the non-parametric setting. For\n), where K : Rd \u2192 R denotes the Gaussian kernel with\n2 e\u2212(cid:107)x(cid:107)2/2. We then choose the best bandwidth hn via cross validation. To be\nd+4 , and then optimize the choice C via line search. That is, for each\ni=1,i(cid:54)=j YiK( x\u2212Xi\n(cid:16)\n) and we select C to be the\nn(cid:88)\n\nsome bandwidth hn > 0, we de\ufb01ne the kernel regression estimator (cid:98)kn,hn by (cid:98)kn,hn (x) =\n)/(cid:80)n\n(cid:80)n\ni=1 YiK( x\u2212Xi\nK(x) = (2\u03c0)\u2212 d\n(x) =(cid:80)n\n1 \u2264 j \u2264 n, let(cid:98)k(\u2212j)\nspeci\ufb01c, we pick hn = Cn\u2212 1\n\ni=1,i(cid:54)=j K( x\u2212Xi\n\n(Yi \u2212 f (Xi))2\n\ni=1 K( x\u2212Xi\n\nhn\n\nminimizer of\n\n)/(cid:80)n\nYi \u2212(cid:98)k(\u2212i)\n\n(cid:17)2\n\nn,Cn\u22121/(d+4)(Xi)\n\n.\n\nf\u2208Fc\n\nn,hn\n\nhn\n\nhn\n\nhn\n\nmin\n\nC\u2208{j/100,1\u2264j\u2264100}\n\nDe\ufb01ne the empirical l2 loss of any two functions f, g as\n\nl2(f, g) :=\n\n|f (Xi) \u2212 g(Xi)|2\n\n(cid:33) 1\n\n2\n\n.\n\n(cid:32)\n\ni=1\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nthe performance of (cid:98)fn,\u03b4n, (cid:98)f LS\n\nIn the experiments, we set d = 5, n \u2208 {50, 100, 150, 200, 250, 300, 350} and \u03c3 = 0.2. We compare\n\nn,10 and(cid:98)kn,hn under both the empirical l1 and l2 losses. For each\n\nn,0.8, (cid:98)f LS\n\nchoice of n and d, we repeat the simulation 100 times and calculate their average.\nWe \ufb01rst sample i.i.d. Xi \u223c N (0, Id) for the light tail case that satisfying Assumption 1. To compare,\nwe also sample i.i.d. heavy tail random variable Xi such that coordinates of Xi are i.i.d. from the\nt-distribution with parameter 10. The results of the experiment follow.\n\n7\n\n\f(a) Light tail covariates, l1 loss\n\n(b) Light tail covariates, l2 loss\n\n(c) Heavy tail covariates, l1 loss\n\n(d) Heavy tail covariates, l2 loss\n\nFigure 1: In the above plots, the blue solid line stands for the estimator (cid:98)fn,\u03b4, the black dotted line stands for\n(cid:98)f LS\nn,0.8, the red dash-dot line stands for the estimator (cid:98)f LS\nestimator(cid:98)kn,hn.\nn,10 and(cid:98)kn,hn\nFrom the Figure 1 in above, we observed that our estimator (cid:98)fn,\u03b4 outperforms (cid:98)f LS\n\nn,10, and the green dashed line stands for the kernel\n\nn,0.8, (cid:98)f LS\n\nin both l1 and l2 losses, and the performance of the least squares estimator is highly sensitive to\nthe choice of the constant c, the a priori bound on (cid:107)\u2207f\u2217(cid:107)\u221e. We believe that a key factor in the\nperformance of our estimator is the regularization penalty introduced in the DRCR formulation.\n\n3.2 Real dataset\n\nWe consider a public dataset from United States Environmental Protection Agency, which was\nsuggested by [17]. The dataset consists of 600 air market data of California in the \ufb01rst quarter of\n2019. The response was the amount of heat input with the covariates corresponding to the amounts\nof emissions of SO2, NOx, CO2 (in tons) and the NOX rate. Empirical evidence suggests that\nrelationship between the response and the log transformation of each individual covariate can be\nmodeled well by a convex \ufb01t, so we do the log transformation on covariates and then standardize the\ndata. Since we never know f\u2217 in real data, we can not evaluate our method in the same way as the\nsubmitted paper. Instead, we randomly split the dataset into a training set with 400 data and a test set\nwith 200 data, and we implement three different approaches: DRCR, LSE and LR (linear regression).\nWe repeat the experiment 10 times and then compare the average training l1 loss and average test l1\nerror.\n\nMethod Training loss Test error\nDRCR\n0.1294\n0.1516\nLSE\nLR\n0.1692\n\n0.1238\n0.1485\n0.1691\n\nWe summarize the results in the above table. It is clear that our method outperforms both LSE and\nLR.\n\n8\n\n\fReferences\n[1] Yacine Ait-Sahalia and Jefferson Duarte. Nonparametric option pricing under shape restrictions.\n\nJournal of Econometrics, 116(1-2):9\u201347, 2003.\n\n[2] Gad Allon, Michael Beenstock, Steven Hackman, Ury Passy, and Alexander Shapiro. Nonpara-\nmetric estimation of concave production technologies by entropic methods. Journal of Applied\nEconometrics, 22(4):795\u2013816, 2007.\n\n[3] Gabor Balazs, Andr\u00e1s Gy\u00f6rgy, and Csaba Szepesvari. Near-optimal max-af\ufb01ne estimators for\nconvex regression. In Proceedings of the Eighteenth International Conference on Arti\ufb01cial\nIntelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages\n56\u201364, San Diego, California, USA, 09\u201312 May 2015. PMLR.\n\n[4] A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals\n\nvia conic programming. Biometrika, 98(4):791\u2013806, 12 2011.\n\n[5] Aharon Ben-Tal and Arkadiaei Semenovich Nemirovskiaei. Lectures on Modern Convex\nOptimization: Analysis, Algorithms, and Engineering Applications. Society for Industrial and\nApplied Mathematics, Philadelphia, PA, USA, 2001.\n\n[6] Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust wasserstein pro\ufb01le inference and\n\napplications to machine learning. arXiv e-prints, page arXiv:1610.05627, Oct 2016.\n\n[7] Jose Blanchet and Karthyek Murthy. Quantifying distributional model risk via optimal transport.\n\nMathematics of Operations Research, 2019.\n\n[8] Jose H. Blanchet and Yang Kang. Distributionally robust groupwise regularization estimator. In\nACML, volume 77 of Proceedings of Machine Learning Research, pages 97\u2013112. PMLR, 2017.\n\n[9] Paul Embrechts, Thomas Mikosch, and Claudia Kl\u00fcppelberg. Modelling extremal events: for\n\ninsurance and \ufb01nance. Springer-Verlag, Berlin, Heidelberg, 1997.\n\n[10] Rui Gao and Anton J. Kleywegt. Distributionally robust stochastic optimization with wasserstein\n\ndistance. arXiv e-prints, page arXiv:1604.02199, Apr 2016.\n\n[11] Piet Groeneboom, Geurt Jongbloed, and Jon A. Wellner. Estimation of a convex function:\n\nCharacterizations and asymptotic theory. Ann. Statist., 29(6):1653\u20131698, 12 2001.\n\n[12] Adityanand Guntuboyina and Bodhisattva Sen. Global risk bounds and adaptation in univariate\n\nconvex regression. Probability Theory and Related Fields, 163(1):379\u2013411, Oct 2015.\n\n[13] Qiyang Han and Jon A. Wellner. Multivariate convex regression: global risk bounds and\n\nadaptation. arXiv e-prints, page arXiv:1601.06844, Jan 2016.\n\n[14] Lauren A. Hannah and David B. Dunson. Multivariate convex regression with adaptive parti-\n\ntioning. J. Mach. Learn. Res., 14(1):3261\u20133294, January 2013.\n\n[15] Eunji Lim. On convergence rates of convex regression in multiple dimensions. INFORMS\n\nJournal on Computing, 26(3):616\u2013628, 2014.\n\n[16] Eunji Lim and Peter W. Glynn. Consistency of multidimensional convex regression. Operations\n\nResearch, 60(1):196\u2013208, 2012.\n\n[17] Rahul Mazumder, Arkopal Choudhury, Garud Iyengar, and Bodhisattva Sen. A computational\nframework for multivariate convex regression and its variants. Journal of the American Statistical\nAssociation, 114(525):318\u2013331, 2019.\n\n[18] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization\nusing the wasserstein metric: performance guarantees and tractable reformulations. Mathemati-\ncal Programming, 171(1):115\u2013166, Sep 2018.\n\n[19] Emilio Seijo and Bodhisattva Sen. Nonparametric least squares estimation of a multivariate\n\nconvex regression function. Ann. Statist., 39(3):1633\u20131657, 06 2011.\n\n9\n\n\f[20] Soroosh Sha\ufb01eezadeh Abadeh, Peyman Mohajerin Mohajerin Esfahani, and Daniel Kuhn.\nDistributionally robust logistic regression. In Advances in Neural Information Processing\nSystems 28, pages 1576\u20131584. Curran Associates, Inc., 2015.\n\n[21] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness\n\nwith principled adversarial training. arXiv preprint arXiv:1710.10571, 2017.\n\n[22] Hal Varian. The nonparametric approach to demand analysis. Econometrica, 50(4):945\u201373,\n\n1982.\n\n[23] Hal Varian. The nonparametric approach to production analysis. Econometrica, 52(3):579\u201397,\n\n1984.\n\n[24] Roman Vershynin.\n\nIntroduction to the non-asymptotic analysis of random matrices, page\n\n210\u2013268. Cambridge University Press, 2012.\n\n10\n\n\f", "award": [], "sourceid": 6322, "authors": [{"given_name": "Jose", "family_name": "Blanchet", "institution": "Stanford University"}, {"given_name": "Peter", "family_name": "Glynn", "institution": "Stanford University"}, {"given_name": "Jun", "family_name": "Yan", "institution": "Stanford"}, {"given_name": "Zhengqing", "family_name": "Zhou", "institution": "Stanford University"}]}