NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center

### Reviewer 1

1.One of the applications for the proposed algorithm may be the case with the zero norm. However, for the zero norm, extensive theoretical analyses have been done, such as matching pursuit, IHT, and StoIHT. Some of them even guarantee perfect support recovery. Comparing with those algorithms, what are the benefits of the newly-proposed algorithms? Here, first-order stationary points may be far away from the optimal solution when the sparsity regularizer is imposed. It might be interesting to know the property of support recovery for the proposed algorithms. Or in a more general case, what is the behavior related to the convergence of x_t to the “optimal x^*”? 2. Based on Theorem 5, the proposed algorithm 3 will not converge to a first-order stationary point, since $(\gamma + 4\theta L ) sigma^2 / 2 \theta L |S_1|$ is dependent on batch size |S_1| and |S_1| is upper bounded by data size n ( at least in the finite-sum setting ). In order to make (\gamma + 4\theta L ) sigma^2 / 2 \theta L |S_1| \leq O(\epsilon), |S_1| may need to be larger than n, which is not possible. At least from Theorem 5, this reviewer has problems understanding why it is true for the finite-sum setting in Corrollary 6, which may need further explanation. 3. Please give the definition of \tilda( O) and O in table 1. 4, line 76 typo: mataching -> matching

### Reviewer 2

This paper provides the first non-asymptotic analysis for stochastic proximal gradient algorithms in the fully non-smooth non-convex setting. So far only convergence results in the deterministic case, and recently in the stochastic case, have been obtained. The rates derived are optimal. No regularization or smoothing of the objective is performed (such as reformulating an auxiliary function like the Moreau envelope), but rather a direct proximal sub gradient step is performed at every stage. The drawback of this approach is that no potential function is available, making it difficult how to measure progress of the algorithm. Moreover, Theorem 2 shows that the distance measure to the set of stationary points is given by a sum of variances of the stochastic oracle. As a consequence, it has to be assumed that the stochastic oracle's noise is uniformly bounded, which is a quite strong assumption. Specific comments: -) line 23: If r(x) is assumed to be LSC it can be non-smooth as well. There is no need to emphasize this again here. -) line 76: typo in „matching“. -) line 87: It is a bit unfortunate to use the same letter for a data sample and a generic set. I would recommend introducing different notation to distinguish them. -) I don’t agree that eq. (2) is a special case of (1). Mathematically speaking this is correct of course, but the interpretation is usually different. In Stochastic optimization our goal is to solve problem (1) and problem (2) would represent a sample average stochastic approximation for this problem. -) line 106: I don’t think that argmin needs an explanation. -) line 110: two times „the“ -) The iteration complexity of Theorem 1 depends on the quantity Delta. It seems to be hard to have a good guess for this quantity, unless we restrict the analysis to an a-priori known compact set. How do you justify this assumption? -) Proof of Theorem 2: In line 421, instead of what is written, you should use the Fenchel-Young inequality to get to the bound (16), i.e a.b<(1/2L)a^{2}+(L/2)b^{2}. Same remark applies to line 443. Line 429 is just the Pythagoras Theorem.

### Reviewer 3

For neural network compression, it is common to train sparse models with L0 projections, as well as quantized models. It is significant to analyze and propose new algorithms for such training. The algorithms are limited in practice because the learning rate is constant. The practical fix of increasing the batch size throughout training is significantly less convenient than decreasing the learning rate. For the convergence analysis, the main trick is to exploit the optimality conditions from the proximal update, and I found the proof sketch in Section 2.1 instructive. *Update* I have read the author response and other reviews, and I am keeping my review the same. I think my point about the practicality is still true, but this is only a minor comment. The authors plan to define the practical algorithm in a way that is more clear.