Paper ID: 515 Title: Estimating LASSO Risk and Noise Level
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This work proposes a new method for estimating the risk of the Lasso procedure,
providing also an estimate of the unknown noise level, in the Gaussian regression
model. The considerations are based on asymptotic results relying on
approximate message passing as well as on the Stein Lemma.

The paper offers asymptotic guarantees in the regime where n/p converges to
a certain level. I also contain a short experiment section.

The general impression is that the paper is not totally finished (cf. the amount
of typos below), and is a large collection of difficult results that are not
necessarily explained in details.
Moreover the experiment section is not really convincing, since the comparisons
given are only considering naive procedure (or so).

It is also to be noted that all the results page 8 rely on a conjecture,
and therefore cannot be taken for granted as written.

Recent works have also focused on the Lasso estimator when the noise level is unknown,
and several propositions are worth mentioning. Among others:

"L1-Penalization for Mixture Regression Models",
Nicolas Städler, Peter Bühlmann, Sara van de Geer, 2010

"Pivotal Estimation of Nonparametric Functions via Square-root Lasso"
Alexandre Belloni, Victor Chernozhukov, Lie Wang, 2011

"SOCP based variance free Dantzig Selector with application to robust estimation"
Arnak S. Dalalyan, 2012

Comparisons with some of those methods is highly needed to validate in practice
the results obtained.

Questions: can the authors relax the assumption that n/p converges to a fixed ratio?
what about cases where p > > n?

-l83: "for of "- > "for"

-l100: latex issue...

-l153: lambda-> latex lambda
Moreover I do not understand why when changing the lambda, the signal is re-sample
again. Is there any particular reason for that?

-l157: delta is defined only afterwards. Please correct.

-l179: tha?

-l190: no clue is given for choosing y_0...

-l208/214: the N_1 is different in the 2 lines, please correct the typo.

-l295 and 300: what is nu?

-l297: what is D?

-l307: it seems that the normalization differs in that line w.r.t. the equation line 297.

-l442: this book simply does not exist. The one I know is co-written by S. Van de Geer. Please
correct.

-l445/450: Candes should be Cand\`es.

-In the references harmonize the authors name so that they always appear with the same spelling.
e. g. A. Montanari is sometimes also Andrea Montari.

-l472 is corrupted.

-l498: define the scalar product used with the 1/p normalization.

-l513: lemma lemma... lemma?

-l521: I think there is a hat missing on the tau
This paper provides a theoretical work on the asymptotic properties of an estimator of the MSE (as well as an estimator of the noise level) of the Lasso procedure in high dimension.
Though, the theory seem appealing, the practical benefit of the considered method is not conclusive
as is.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
(*) The paper proposes new estimators for:
(i.) the risk of a lasso estimator
(ii.) the noise-level in the lasso model.
Both estimators are purely from the data. Under some conditions, it is proven that these are consistent
estimators of the quantities when the dimension p and the number of measurements n go to infinity.
There are also a few experimental results showing the accuracy of the proposed estimators on simulated examples.

(*) The paper provides important results for a fundamental problem of estimating the lasso risk from data, and thus offers potentially
a significant contribution. An important practical implication is that the estimator could be used to optimize the regularization parameter \lambda of the LASSO.
However, the writing of the paper is not clear enough and often sloppy. This makes it hard to follow the paper, and verify
the correctness (and sometimes even the meaning) of the results, and prevented me from assigning a higher score for the paper.
It looks like more effort should have been devoted to presentation.

(*) The computer experiments part is encouraging but limited and not well explained. What is the distribution for the non-zero
values of x_0? how was each point in the plot obtained? from a single run of the lasso? if so, why not average over multiple runs,
to see also the standard deviation of the estimation error and of the estimators the authors provide?

(*) Many things are not explained:
- What are 'irrepresentability conditions' in the introduction?
- How does one initialize y and z in the AMP algorithm in eq. (3.1)?
- I don't understand what is the joint empirical distribution of {\hat{y}_i, x_0,i} (first defined in line 270).
x_0 is the signal, with coordinates for i=1,..,p. But \hat{y} is the the data, with i=1,..,n. Why is the same i used here?
- In line 198 it is said : \eta_t' is the derivative ...'. Derivative with respect to what? y^t?
- What is D in proposition 3.2 (line 297)?
- The authors provide a general estimator for a general design matrix, in definition 3. But when should one use each estimator? is it assumed
that the inverse covariance matrix \Omega is known? or should one estimate it from the data?
- It is often not clear when the authors write Expectations E[..], over what are these expectations taken. For example in equations (3.2), (3.4), (3.5).
Is it over the choice of elements of the matrix A? of error variables Z? of the signal X? all of these together?

(*) The authors provide estimators for the error for one specific estimator: the lasso.
But could it be that other estimators give better MSE for the case of sparse vectors? what would a comparison of the MSE to minimax lower bounds yield?
(e.g. E. J. Candès and M. A. Davenport. How well can we estimate a sparse vector? Applied and Computational Harmonic Analysis, 2011)

(*) I have a problem in interpreting the main results. In theorem (4.1), the authors provide a consistency result for the estimator of the lasso risk,
as both the dimension p and the number of measurements n go together to infinity (with their ratio being fixed). But, first, it looks like the result is given for
a fixed iteration t - yet one would expect the number of iterations required for convergence to depend on n and p (and the data). How do the results apply to the
actual LASSO solution? Second, from eq. (4.1) it looks like n and p approach infinity independently, which is confusing.
Finally, it seems to me that in this limit, the lasso risk itself approaches some limit (i.e. some constant. For example in [Wai09] it is shown that
under certain conditions the error goes to zero). If this is true, then what is the utility of the author's estimator in this limit if the error can be anyway
calculated as a function of the distribution of A and the noise?

(*) There are many typos and inaccuracies, some of which really hinder the reading and understanding:

The last section in the abstract ('We compare our variance ..') is unclear and confusing.
What are 'two naive LASSO estimator'? estimators of the error? also, why 'On the other hand'?

Page 1, line 43: 'p < n' should be 'p > n'?

Page 2, line 69: 'the the risk' -> 'the risk'.

Page 2, line 82: 'data y,A' -> 'data y and A'?

Page 2, line 100: 'underbraccdelta' ???

Page 3, line 132: 'SURE' is written before it was introduced/defined.

Page 3, line 153: 'for each lambda'

Page 3, line 155: \delta is first written here but defined only later.

Page 4, lines 192-196: many indices in eq. (3.1) are wrong/missing (e.g. when is t used, when t+1,t-1? the iteration looks circular). Also, shouldn't the 2nd line be z+Ax?

Page 5, line 260: 'provides' -> 'provide'

Page 6, line 295: 'for' should be removed. Would be also good to define 'weakly differentiable'

Page 9, line 473: Reference is messed up

The paper provides important results for a fundamental problem of estimating the lasso risk from data, and thus offers potentially
a significant contribution. An important practical implication is that the estimator could be used to optimize the regularization parameter \lambda of the LASSO.
However, the writing of the paper is not clear enough and often sloppy. This makes it hard to follow the paper, and verify
the correctness (and sometimes even the meaning) of the results, and prevented me from assigning a higher score for the paper.
It looks like more effort should have been devoted to presentation.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This is an interesting paper that applies Stein's Unbiased Risk Estimator on the a pseudo data vector, which has a certain distribution shown by previous papers on approximate message passing, to derive an asymptotic estimator of the risk and the noise variance of the lasso.

The results of this paper are significant. The risk formula can inform the choice of the tuning parameter lambda and the variance estimate is certainly useful for residual analysis. The formulas presented are asymptotically correct and the authors show some simulations in which the formulas are accurate with 5000 samples.

The theory used and developed by the paper is not at all straightforward and the paper is well-written given its complexity. I could only understand the outline of the proof and I can only trust that the details are correct.

My only complaint about this paper is the shortage of simulation experiments. It would be very interesting to see how the estimation formulas perform as we vary n,p. I tried implementing the formulas and making my own simulations, but the formulas didn't work, even when I tried to replicate the setting of the paper's simulation. I find that $1-b_n$ is often negative, which renders the other estimated quantities senseless. I'd appreciate it if the authors could either share their code for the simulations (through anonymous URL in the rebuttal possibly) or inspect and debug my R code below.

Overall a good paper; good theoretical contribution with possible practical benefits.

\texttt{
library(glmnet)
n=5000; p=10000; sigma=0.5; lambda=0.2; beta=c(rep(1,0.1*p), rep(0,0.9*p));
X = (1/sqrt(n))*matrix(rnorm(n*p), nrow=n, ncol=p);
noise = sigma*rnorm(n); y = X %*% beta + noise;

glmnet.fit = glmnet(x=X, y=y, lambda=lambda*(1/n)) #glmnet uses average loss
beta_hat = glmnet.fit\$beta; relvars = which(abs(beta_hat) > 0.000001)

bn = length(relvars)/n;
pseudo.y = beta_hat + (1/(1-bn)) * t(X) %*% (y- X%*% beta_hat)
theta = lambda/(1-bn)
tau = sqrt(sum((y- X %*% beta_hat)^2))/(sqrt(n)*(1-bn))

Rhat = tau^2 - (2*tau^2/p)*sum( abs(pseudo.y) < theta) + (1/p)*sum( (sapply(abs(pseudo.y), FUN=function (x) {min(x,theta)} )^2))
varhat = tau^2 - Rhat
}
Good theoretical contribution toward an important problem. The results are possibly practical but not easy o use.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
REVIEWER 4:

We acknowledge the presence of typos and thank the reviewer for pointing them to us. We will definitely address them.
Regarding the other comments/questions, we think the following points are important to highlight.

(*) Our results are rigorously proven for the case of standard gaussian design model (Theorems 4.1, 4.2, and Corollary 4.3). Even in this case, they are highly non-trivial. We agree that the treatment of general gaussian design model relies on a conjecture that is supported by earlier work and sophisticated statistical physics arguments such as the replica method. The simulation results strongly suggest that our formulas are accurate.

(*) Thanks for pointing to us other methods that analyze the LASSO-like regression methods with unknown noise.

We will definitely cite them as related work.
However, after reading the aforementioned papers, we found that:

(a) None of them provides an estimator for the LASSO risk.

(b) All of them (except Städler-Bühlmann-van de Geer: [SBG12]) assume sparsity of the original signal x0 whereas in our work this condition is not required. We will definitely compare with [SBG12] and a related method by Sun-Zhang'11 in the next version of the paper. However, we did compare our method with refitted-cross validation (RCV) method of [FGH12] for the noise level estimation as they have general assumptions as ours.

(*) Regarding the convergence of n/p to a fixed ratio \delta we note that since our results hold for any constant \delta in (0,\infty), in practice even if p >> n, one can use our formulas using the actual value of n/p as the limit \delta. Indeed the formulas allow to recover known results (with sharp constants) in the regime p >> n.

----------------------------
REVIEWER 5:
-----------

We thank the reviewer for his/her detailed analysis.

(*) Regarding the extension to other estimation methods:

Theorem 4.1 applies to general AMP algorithms where non-linearities \eta_t can be arbitrary, and hence its results apply to a very broad class of iterative methods. Each such algorithm produces a different estimator than the LASSO. While further generalizations might be possible, AMP algorithms have the property of admitting a fairly explicit characterization.

(*) Regarding interpreting the results:

(a) Theorem 4.1 applies to any fixed number of iterations and does not require convergence of the algorithm. This is of independent interest because one might want to halt the iteration when a certain degree of estimation accuracy is achieved. Theorem 4.2 applies to the actual LASSO solution.
(b) p and n are related by the condition that the sequence must be converging (cf. Definition 1 or lines 220-221 of the paper). In particular, the ratio n/p converges.
(c) Existing results (e.g. Bickel et al 2009) imply that the (\ell_2) risk of LASSO vanishes in probability under restricted eigenvalue or similar conditions. However these upper bounds are quite conservative.
For instance, they would not appear in plots on the left panels of Figures 1 and 2 because the upper bounds are very large (out of the plotted range).
While these upper bounds are very useful to support the use of the LASSO or similar methods, they cannot be used for practical matters such as tuning \lambda, as they only estimate the risk within some large constant.

(*) Thanks for pointing the typos. We will definitely address them.
Below are responses to the reviewer's other questions:

- The distribution of x0 can be arbitrary as long as its empirical distribution converges and the limit has a finite second moment (cf. definition 1).
In the simulations coordinates of x0 are iid: equal to 1 wp .05, -1 wp 0.05, or 0 wp 0.9.
- In simulations we use single run of the LASSO to highlight the statistical error of the estimator.
We also note that each point uses new (independent) x0 and noise which allows to guess statistical fluctuations.
Nevertheless, we agree that showing the average values of several runs with standard errors would be useful and we plan to add them in the longer version of the paper.
- Irrepresentability states that LASSO selects the true model consistently iff the predictors that are not in the true model are "irrepresentable" by predictors that are in the true model (cf. Section 2 of [ZY06] for an exact definition).
- AMP algorithm is initialized by x^0 = y^0 = z^0 = 0.
- Regarding the joint empirical distribution of {\hat{y}_i, x_0,i}, i goes from 1 to p for \hat{y}_i as well since \hat{y} is in R^p.
- In line 198 \eta_t' is derivative with respect to y^t. Since it is assumed to be separable, the derivative is also applied coordinate-wise.
- In line 297, D is the multivariate derivative.
- In practice, the inverse covariance can be estimated from the data.
- All expectations E[] are with respect to the joint distribution of all random variables inside the brackets: i.e., either wrt A, x0, and noise or wrt X and Z.

----------------------------
REVIEWER 6:
-----------

We thank the reviewer for his/her positive comments.

(*) Changing n/p will only change the shape of the curve. The quality of the estimators will remain valid.

(*) We debugged the reviewer's R code and found the following two minor errors:

1) glmnet by default adds an intercept and standardizes the data so these options should be turned off (the latter needs glmnet version 1.93 or beyond).
The correct line in R should be:

glmnet.fit = glmnet(x=X, y=y, lambda=lambda/n, standardize = FALSE, intercept=FALSE)

2) This one is caused by a latex error in line 100 of the paper: Rhat should be divided by delta.
The formula is stated correctly in Corollary 4.3.
Therefore, the correct line in R should be (given that delta=n/p):

varhat = tau^2 - Rhat/(n/p)

The code will run perfectly after these changes (i.e., varhat will be close to sigma^2=0.25).