__ Summary and Contributions__: The paper proposes a framework to incorporate parametric information in DRO for supervised machine-learning. DRO is a promising alternative to the traditional ERM, as it typically enjoys better performance under small-sample regimes, or out of distribution corruption. This paper considers the parametric case of exponential-families and proposes to use ambiguity sets which respect this parametric structure.

__ Strengths__: The paper is well-written and easy to follow.

__ Weaknesses__: - The contribution of the paper is too marginal. DRO with KL ambiguity sets has been extensively studied. For example, the nonparametric case, see Faury et al "Distributionally Robust Counterfactual Risk Minimization" (AAAI 2020) and Si et al
"Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits" (ICML 2020). These papers propose explicit wort-case distributions, tractable algorithms for the modified robust ERM (via reweighting), and complete
theoretical analysis thereof (consistency, sample complexity, etc.).
- Moreover, in the particular parametric case of DRO on exponential families (the subject of this manuscript) has already been studied in Hu et al. "Kullback-Leibler Divergence Constrained Distributionally Robust Optimization"
- The paper lacks solid theoretical background. Since everything is parametric, I'd expect explicit rates of convergence involvind all probalem complexity parameters (n, m, p, etc.)
To make the rest of my points clear, let me recall the following notations are used in the paper:
- n: the dimensionality of the covariate (i.e feature vector) X. Thus X is random vector in R^n. BTW, in the context of ML or stats, I'd use another notation here, as n conventionally stands for "sample size".
- m: the dimensionality of the output variable Y. Thus X is a random vector in R^m.
- p: the dimensionality of the ambient space in which the model parameter theta lives.
- N: the number of samples in the training dataset
- (x_1,y_1),...,(x_N,y_N): and iid sample of size N from the unknown joint distribution of X and Y.
- C the number of distinct feature vectors x_i in the sample. Thus 1 <= C <= N. WLOG, tet the distinct
feature vectors be x_1,...,x_C.
- N_c (with 1 <= c <= C): #{i | x_i = x_c}, i.e number of examples whose features vector equals x_c.
- Line 95 to 98: Since the covariates (i.e features) are continuous, we are certain to have C = N and N_c = 1 for all c.
- Line 99 to 102: The model in (5) has C + dim(Theta)^C, which is rougly N + dim(Theta)^N parameters in case of continuous covariates
(see previous comment). This cannot possibly work as soon as dim(Theta) > 1.
- Proposition 4.2: This should be rewritten to clearly outline the dependence on the sample size N. Also, the third term on the
right is mysterious. Also, at what rates do kappa_1 and kappa_2 go to zero (if they do...).
- Line 227: I don't see how you let "N_c tend to infinity" in view of my above comment on N_c = 1 almost certainly
(see my previous comments). The rest of the analysis (Lemma 4.4) is therefore awkward.
- Why are these experimental setups presented in section 5 relevant to the subject ?
- Documentation on the datasets in Table 2 should be provided (n = ?, m = ?, etc.).

__ Correctness__: Lemma 4.4 sounds ackward. How can N_c tend to infinity when the covariates are continuous ? (See my previous comments).

__ Clarity__: Yes.

__ Relation to Prior Work__: No. DRO with KL ambiguity sets has been extensively studied. For example, the nonparametric case, see Faury et al "Distributionally Robust Counterfactual Risk Minimization" (AAAI 2020) and Si et al "Distributionally Robust Policy Evaluation and Learning in Offline Contextual Bandits" (ICML 2020). These papers propose explicit wort-case distributions, tractable algorithms for the modified robust ERM (via reweighting), and complete theoretical analysis thereof (consistency, sample complexity, etc.).
- Moreover, in the particular parametric case of DRO on exponential families (the subject of this manuscript) has already been studied in Hu et al. "Kullback-Leibler Divergence Constrained Distributionally Robust Optimization"

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper studies the robust maximum likelihood estimator for exponential families using KL ambiguity set. Comprehensive theoretical results are presented, including the tractable reformulation and statistical guarantees.

__ Strengths__: This work presents a strong theoretical contribution to the maximum likelihood estimator in DRO setting. Both theoretical results and empirical evaluation on synthetic/real data are promising. This work is a good attempt in applying DRO techniques to classical statistical problems and could serve as a good complement in the current literature of robust maximum likelihood estimation.

__ Weaknesses__: I think the major limitation is that the model is still restrictive in the following sense: the ambiguity set does not change the support of X, which is the same as training data, this indeed leads to the tractable reformulation, but it doesnâ€™t consider possible contamination in the X variable. In other words, the restriction to discrete X might be inappropriate in certain applications.

__ Correctness__: Correct.

__ Clarity__: This paper is well written. The organization is very clear and easy to follow.

__ Relation to Prior Work__: Satisfactory.

__ Reproducibility__: Yes

__ Additional Feedback__: The ambiguity set defined in (7) is good in terms of the tractability and a fully parametric structure, but is also restricted and heavily rely on the empirical samples. If the underlying distribution is continuous and the training size is small, than the ambiguity set defined in (7) might not be a good set in terms of its similarity with the true underlying distribution. Therefore, I think it would be more interesting if the authors could add a few discussion about possible extension of this DRO MLE framework for other uncertainty sets (such as the Wasserstein ambiguity set that does not restrict the support).
Moreover, Huber has a lot of seminal work on robust estimation of parameters under distribution contamination, which could be highly related to the content of this paper, but are not sufficiently mentioned or cited in this paper.
A minor comment: in the Poisson counting model example 3.4, the pmf actually corresponds to Poi( e^{w0^T x}), not w0^T x as written in the paper.

__ Summary and Contributions__: This paper considers a parametric distributionally robust optimization under the Kullback-Leibler divergence. Unlike the existing nonparametric framework, it is assumed that the considered distributions belong to the exponential family whose parameter is determined by a function of a covariate, and the nominal distribution is also a parametric distribution whose parameters are estimated from data. Parallel to the existing non-parametric results, dual reformulation, consistency and worst-case distributions are derived. The paper also discuss its connection to the exponential reweighting and variance regularization and conducts experiments for Poisson and logistic regression using UCI datasets.

__ Strengths__: (1) The paper complements the existing non-parametric distributionally robust optimization by considering the exponential family with covariates.
(2) The theoretical results build off of techniques from distributionally robust optimization under Kullback-Leibler divergence and the derivation demonstrates the authors' master on these techniques.
(3) Numerical results demonstrate some superiority of the proposed approach.

__ Weaknesses__: (1) One key issue with the current framework is the specification of the radii rho_c. Lemma 4.4 needs significant improvement. By the nature of MLE esimators, for Examples 2.2 and 2.3, the conditions should be a joint convergence for all c, but the current form is separate among all c. In addition, I would suggest providing an explicit form for Z as it is crucial to determining the radii.
(2) Several issues regarding the numerical experiments are in order. First, how does the parametric framework comparing to existing non-parametric distributionally robust optimization under Kullback-Leibler divergence? Second, in the current setup, it is assumed that rho_c only depends on the sample size N_c, while my intuition is that the radii rho_c should depend on the magnitude of x_c as well, which can be inferred from the asymptotic convergence. I would suggest running the experiments with a more carefully designed tunning scheme.
I will raise my score if the authors can successfully address these issues.
=== Edit after rebuttal ====
The rebuttal briefly addresses my concerns about the joint asymptotic convergence results and I wish to see a full exhibition in the final version. On the other hand, I am not convinced that choosing \rho_c based on x_c is unrealistic. One may try to study the asymptotics at the optimal w, in the spirit similar to [13]. Hence, I will keep my score.

__ Correctness__: Yes to the best of my knowledge.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper presents a regularization scheme for parameter estimation using maximum likelihood estimation (MLE). The new scheme leverages the distributionally robust optimization (DRO) [arXiv:1908.05659] framework, which is related to adversarial learning. The goal is to achieve better generalization for MLE as compared existing techniques such as L1 and L2 regularization. The paper casts the MLE learning problem within a min-max setup where the loss function is minimized with respect to an expectation (that is maximized) over an adverse distribution (element of a parametric ambiguity set). The learning scheme is applied to Poisson and logistic regression, which are trained using off-the-shelf optimization software and empirically evaluated on synthetic and UCI datasets.

__ Strengths__: The paper casts regularized MLE within a min-max (adversarial) setting. Theoretical analysis are provided.

__ Weaknesses__: For Poisson regression, the technique works for extremely small sample sizes (50,100) as demonstrated in Table 1, but at 500 (still very small), we see the performance difference between L2 and DRO is much less. This technique may find application in extremely small data size regimes, but it is difficult to imagine it applicable (given the higher complexity setup) to the big data regime. For logistic regression on UCI datasets (Table 2), the performance improvements are not significant.

__ Correctness__: They appear correct to me.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: Are there relations with your approach to the idea of over-parameterization?
----- update after authors' response -----
I have read through the authors' feedback and reviews. My review remains unchanged.