Review for NeurIPS paper: Truncated Linear Regression in High Dimensions

NeurIPS 2020

Truncated Linear Regression in High Dimensions

Review 1

Summary and Contributions: This work studies high-dimensional linear regression under a constraint imposed by a truncation set S, where any pair (a, y) is hidden if y is not in S. A computationally efficient algorithm is proposed to recover the groundtruth signal, and a near-optimal statistical error rate is established.

Strengths: - The paper is clearly written, and connection to prior works is elaborated in detail. Thus it is easy to follow the main contribution of the work. - Both computational and statistical results are presented. - Authors also make significant efforts to explain the primary proof idea (which is appreciated).

Weaknesses: - My major concern is why the problem is difficult. Assumption 1 literally enforces that the adversary cannot pick arbitrary S, but only those such that a constant alpha-fraction of the observations are hidden/removed. Thus, suppose before removal we have a total of m samples (a, y). After removal it reduces to alpha * m pairs (a, y), which still suffices for accurate recovery provided that m = O(k log n). - It is not convincing to me that the sample complexity in Theorem 3.1 is near-optimal. I know that O(k log n) is near-optimal, but does your result really imply such bound? Observe that your success probability is 1 - 1/log n, which has a very slow convergence to 1. In contrast, almost all existing works provide a strong probability of 1 - n^{-100}, or even 1 - exp(-n). If you rephrase Theorem 3.1 in a way that with probability 1- delta, || x_bar - x^* || < f(m, k, n, delta) What is the expression of the function f? - In the rest of the paper, the success probability is superceded with "high probability", which I do not believe comforms common practice. Again, note that 1-1/log n does not suffice for a high probability argument since it carries out serious issue on sample complexity. - I am disappointed that the proof of Theorem 3.1 is nowhere found, and the organization of the supplement is quite a mess. I was not able to track the true sample complexity under a real high probability event. - The term "adversarial noise" is misleading. You are not handling arbitrary noise, e.g. you cannot slightly tailor your algorithm to gross outliers. - PDW in Wainwright 2009 requires a lower bound on |xhat - x|, but this paper asks for an upper bound. Is there any intuition behind the essential difference? - I find Lemmas 5.7 and 5.8 are hard to follow. How can you ensure/test for any x, f(x) - f(xhat) < log n / m^3 ? If the purpose of Section 5.2 is to provide faster rate of SGD, why don't you show that the Hessian matrix (Lemma 5.1) has a lower bound on its smallest singular value when restricted on sparse directions? My overall feeling is that the paper presents a dense set of results to extend a previous work on truncated linear regression to the high-dimensional regime. Yet, most of the key techniques, e.g. log-likelihood of truncated linear regression, projected SGD, have been established in Daskalakis 2019. The main difference lies in an application of primal-dual witness, but was also known due to Wainwright 2009. The discussion in the main body, and the proof in the supplement, are hard to follow. I believe many of the results have been proved in Daskalakis 2019. So references to lemmas thereof (maybe with a short paragraph explaining the difference) should serve the purpose. ---updates after rebuttal--- Thank you for addressing my concerns. I suggest incorporating them into your revision. I still feel Section 5 is dense and hard to follow. The notation and terminology can also be improved. Overall this is a good submission, but it turns out the computational efficiency heavily relies on Assumption II.

Correctness: Not all. See Weakness.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper looks at the problem of truncated linear regression where we see the labels depending on if they lie in some fixed known set for the case when the underlying true model is k sparse. They give an algorithm which gets optimal error guarantees for this setting as function of number of samples under standard assumptions. The algorithm is also computationally efficient with certain conditions on the truncation set.

Strengths: This paper looks at the problem of truncated linear regression where we see the labels depending on if they lie in some fixed known set for the case when the underlying true model is k sparse. They give an algorithm which gets optimal error guarantees with l2 squared error scaling linearly in the sparsity dimension for this setting as function of number of samples under standard assumptions. The algorithm is also computationally efficient with certain conditions on the truncation set. I think the problem of biased data is an important problem. The paper presents interesting results and seems like has strong technical contributions. The paper is also very well written clearly explaining the challenges in extending previous techniques and how they deal with them.

Weaknesses: Can the authors comment on how easy is this algorithm to implement in practice? What is the precise dependence of n and m in running time? -------- I have read the authors' response and would like to thank the authors for clarifying the runtime dependence.

Correctness: The proofs look correct at a first glance.

Clarity: Yes, the paper is very well written.

Relation to Prior Work: Yes, the paper has sufficiently discussed comparison with previous work.

Reproducibility: Yes

Additional Feedback: Yes, the paper has sufficiently discussed comparison with previous work.

Review 3

Summary and Contributions: The author propose to solve truncated sparse linear regression problem which has recently gained attention due to works by Daskalakis and Wainwright. However, their techniques do not directly apply to the current formulations and the authors propose new tools to solve the resulting problems such as lack of strong convexity, finding good initialization for the SGD approach etc.

Strengths: (A) Successfully solve the very relevant truncated linear regression problem and provide theoretical guarantees of l2 error of O(\sqrt(k log n)/m)). (B) Highly relevant for real-world settings where the data could be censored or truncated due to privacy reasons.

Weaknesses: (A) I did not see any experimental work which would have strengthened the work.

Correctness: Based on the reading of the main paper it seems right but did not check the proofs in the supplementary.

Clarity: The paper is clearly organized with prior work, issues in applying them to current problem and how they are solved with proposed approach.

Relation to Prior Work: Prior works are clearly discussed and built on.

Reproducibility: Yes

Additional Feedback: Update after discussion: Keeping the score since the contribution seems interesting and did not uncover any surprises in discussion phase.