Paper ID: 1709 Title: Nearly Optimal Private LASSO
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a differentially private algorithm for the Lasso with significantly cleaner construction and guarantees than previous known results. The idea is to adapt the Franke-Wolfe algorithm, which at each iteration finds the one feature most correlated with the current gradient and adds it into the estimate with a fixed weight. The adaptation is simply to add Laplace noise to the calculation of the correlation of feature X_j with the current gradient, before selecting which feature to add at this iteration.

The resulting guarantee shows an excess risk of rate O(1/n^(2/3)), with log factors. This power of n is also proven to be tight. In contrast, in a setting with a true sparse model, restricted eigenvalue type conditions on X, etc, the non-private error is O(1/n) and existing (more complicated) private mechanisms also introduce error at a rate O(1/n). The result given here is far more general as it does not rely on any properties such as restricted eigenvalues, which in practice are not verifiable.

The proof of the upper bound is a simple corollary of an existing result on noisy Franke-Wolfe, while the lower bound involves some computation. The novelty here is in the idea of the algorithm itself, which is elegant.

Overall this is a nice contribution that largely resolves the question of privacy for the Lasso.

Page 2 typo: "logartihmic"

Theorem 1.2, 3.1, 3.2 are not clear on the definition of (eps,delta). Theorem 1.2 uses eps=0.1 and delta=1/n. Theorem 3.1 uses eps=0.1 and delta=o(1/n^2) which is a narrower class of algorithms. Theorem 3.2 seems to give a result that is independent of (eps,delta) since they are not specified but it must be the case that there are assumptions on (eps,delta).
Overall this is a nice contribution that largely resolves the question of privacy for the Lasso. The main results for the upper bound are straightforward applications of

existing bounds, but the idea of the algorithm is very clean and novel.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper introduces differentially private Frank-Wolfe algorithms, which allows to conduct LASSO with differential privacy constraints. The proposal method is designed by taking account of the fact that the constraints induced by LASSO forms a polytope. As a result, the excess empirical risk of the proposal learning algorithm is bounded above by nearly O(1/n^{2/3}) under l1-Lipschitz continuous assumption of loss functions. This bound is better than O(polylog(n)/n^(1/3)) by [31]. The bound on the excess empirical risk is higher than O(polylog(n)/n) by [32,37]. However considering that this is derived without strong assumptions (e.g., restricted strong convexity and mutual incoherence as required in [32,37]), this can be a new contribution. Furthermore, the authors show that the bound of the excess empirical risk is near optimal under epsilon<=0.1 and delta=o(1/n^2). The authors also derive a differentially private Frank-Wolfe algorithm that can work with any convex constraint region.

The derivation of the near-optimality of the proposed algorithm is an original contribution while the derivations of the differential privacy and the excess empirical risk seems to be applications of existing results.

In terms of privacy preservation, assumptions on the parameter, epsilon<=0.1 and delta=o(1/n^2), are adequate. On the other hand, in terms of utility, derivation of the constant term of the excess risk is needed. The noise added by the mechanism can occupy the domain of the prediction y

in realistic situations if the constant term of the bound on the excess empirical risk is not small.

Say, suppose the constant term of the bound of the excess empirical risk is 1 (i.e., the excess empirical risk <= log(np/delta)/(n epsilon)^(2/3) ). Suppose privacy parameters are set to epsilon=0.1delta=1/n^2, and p=2. Then,

we have excess empirical risk <= 0.5 if n<=4000 by Corollary 2.7. Since the domain of the prediction y is [-1,1], the half of the domain can be covered with not a small probability. This can be improved with a larger size of samples, but n=4000 is not a small sample size in practice. Consequently, the assumptions on privacy parameters, epsilon=0.1 and delta=1/n^2, might not be practical in some situations.
This paper introduces differentially private Frank-Wolfe algorithms, which allows to conduct LASSO with differential privacy constraints. The study contains some novel theoretical contribution.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We would first like to thank all the reviewers for their careful reviews. Below are our responses to some specific comments from the reviewers.

Assigned_reviewer_1: Theorem 1.2, 3.1, 3.2 are not clear on the definition of (eps,delta). Theorem 1.2 uses eps=0.1 and delta=1/n. Theorem 3.1 uses eps=0.1 and delta=o(1/n^2) which is a narrower class of algorithms. Theorem 3.2 seems to give a result that is independent of (eps,delta) since they are not specified but it must be the case that there are assumptions on (eps,delta).

Response: Thanks for pointing out the inconsistency between Theorems 1.2. and Theorem 3.1. It should have been \delta = o(1/n^2) in both the cases, and \epsilon < = 0.1 . Our lower bounds hide the exact dependence on \delta, assuming it is o(1/n^2) and \epsilon <=0.1. We will correct these in the final version. Note that proving a lower bound with \delta=1/n^2, and \epsilon=0.1 suffices for the regime \delta= o(1/n^2), and \epsilon<=0.1, as the privacy guarantees are only stronger in the latter.

Assigned_reviewer_3: In terms of privacy preservation, assumptions on the parameter, epsilon<=0.1 and delta=o(1/n^2), are adequate. On the other hand, in terms of utility, derivation of the constant term of the excess risk is needed. The noise added by the mechanism can occupy the domain of the prediction y in realistic situations if the constant term of the bound on the excess empirical risk is not small. Say, suppose the constant term of the bound of the excess empirical risk is 1 (i.e., the excess empirical risk <= log(np/delta)/(n epsilon)^(2/3) ). Suppose privacy parameters are set to epsilon=0.1,delta=1/n^2, and p=2. Then, we have excess empirical risk <= 0.5 if n<=4000 by Corollary 2.7. Since the domain of the prediction y is [-1,1], the half of the domain can be covered with not a small probability. This can be improved with a larger size of samples, but n=4000 is not a small sample size in practice. Consequently, the assumptions on privacy parameters, epsilon=0.1 and delta=1/n^2, might not be practical in some situations.

Response: We will work out the leading constant in the next version. (A rough estimate gives an upper bound of 5.) We believe that our results are most applicable in the domain that $p$ is large. For example, if the constant term is 5, p is 10^5 (which is reasonable in modern applications) and n = 10^5 suffices to get excess empirical risk <= 0.5, for eps=0.1, delta=1/n^2. (As an example for such a choice of parameters at a similar scale present in Table 1 of https://web.stanford.edu/~boyd/papers/pdf/l1_logistic_reg_aaai.pdf). We want to mention that our result is an upper bound based on worst case analysis.

Assigned_reviewer_8: Moreover, it seems that the lower bound is only valid for large n, which is not the standard setting for the Lasso. Could the authors discuss that point in more details?

Response: Our lower-bounds hold as long as p=Omega(n^{2/3}). Our algorithm achieves the upper bound of O(log p / n^{2/3}). So the gap is O(\log p), which is nearly tight for the usual case of p = n^O(1). Hence the L1 constraints used in LASSO allow us to achieve a DP algorithm whose excess risk has a weak dependence on p, a phenomenon permeated in the other quality analysis of LASSO (this is also why LASSO scales well with p). We will add a discussion about this point in the next version, and also add a conclusion section.