Paper ID: | 2899 |
---|---|

Title: | Optimal Stochastic and Online Learning with Individual Iterates |

originality, significance: I am not sufficiently familiar with the contextual literature to deeply judge these aspects. However the ability to preserve sparsity over similar methods with such convergence guarantees is significant. quality: The quality seems high, but I wonder about the lack of error bars. How does the method react to random initializations? The large number of datasets makes me not worry too much about that-- but it seems like you did more than sufficient number of experiments to generate such uncertainty estimates so why aren't they shown? clarity: The writing is very clear and a pleasure to read. ==================================== ==================================== Thank you for your thorough response and for including the error bars.

*the paper is very well written; main ideas are explained clearly; results are presented mathematically rigorously *I have only a view comments which refer to comparing the proposed strategy to taking just the last iterate: -Algorithm 1: when I understand correctly, one has to calculate all iterates up to t=2T-1 and needs to store all iterates from t=T up to t=2T-1 -is the map t --> A_t monotone under the assumption of (strong) convexity ? this question refers to the choice of T* in Algorithm 1 - wouldn't any t \geq T* also do the job? ... the best choice would possibly be the argmin of all t satisfying the condition in l.16 (which is possibly the last iterate) -could you comment on this? *the benefit of this strategy compared to just taking some kind of averaging seems clear; but I do not see the benefit compared to taking the last iterate

The paper makes an important contribution to the literature of SGD optimization algorithms in that it introduces an algorithm that achieves (1) last round convergence, (2) allows for sparsity constraints imposed by regularizer, and (3) achieves the optimal convergence rate. The algorithm is a simple variant of the SCMD algorithm. The original version either gets (1) and (2) with suboptimal (up to a log T factor) convergence rate, or with time-averaging achieves (3). The main change in this paper is a subroutine that searches for a specific round that admits the optimal convergence rate, the existence of such a round is guaranteed by the fact that time-averaging gets optimal convergences; searching for such a round, however, is non-trivial. Experiments on real-world dataset clearly show the competitive edge of the new algorithm.