NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6627
Title:Think out of the "Box": Generically-Constrained Asynchronous Composite Optimization and Hedging

Reviewer 1

Summary This paper concerns the asynchronous sparse online and stochastic optimization settings. In this setting several algorithms work in parallel to optimize the same objective. The difficulty herein lies that not all algorithms are aware of the current state of the objective, complicating the analysis. Existing convergence guarantees in this setting only hold for box shaped constraint sets. In this paper the authors develop several new algorithms that can deal with non-box shaped constrained sets: “AsynCADA” and “HedgeHog!”. Both these algorithms are able to deal with inexact readings of the current time point in addition to dealing with the previously mentioned difficulties. Both these algorithms are analysed in a setting related to online convex optimization in which the algorithms only have access to a perturbed state. This setting allows for the use of standard online convex optimization algorithms, but now the analysis requires us the bound the perturbation penalty between the actual update and an update in which the true state is known. Qualitative assessment. The build-up of the paper could be improved. While the description of the problems, algorithms, and analysis was fine as is I would prefer that the paper was reordered. The way the paper is set up now is that first the algorithms are presented and then the perturbed ADA-FTRL framework is introduced. I would present it the other way around, first the framework then the algorithms. This allows the reader to understand some of the design choices of the algorithms by relating it to the analysis of the framework. It even makes sense in the appendix where first the proofs of both AyncADA and HedgeHOG are presented and then the analysis of the framework is presented. However, the proofs of both algorithms hinge on the analysis of framework, which explains some of the choices made in the design of the algorithms. The new algorithms seem to be a useful contribution to the asynchronous composite optimization setting since these algorithms allow for more general constraint sets than in previous work. The generality of the analysis is a plus, as several variants of the assumptions on the loss functions are presented which allows the reader to understand the framework. The analysis itself was interesting and I agree with the authors when they say that the framework could be of independent interest. Minor comments Line 32: sothcastic gradient → stochastic gradient Line 93: f() → f_t() Line 123: prox_\phi(z, \eta) → prox(\phi, z, \eta) to be consistent with (2) Line 136: an over-estimates → an over-estimate The definitions in line 524 in the appendix and (9) of the imaginary iterate are not the same, I think that (t+1) should be t in line 524 Post-rebuttal update The authors addressed my concerns about the ordering of the paper. Therefore, I have increased my score for this paper.

Reviewer 2

In my opinion, the main advantage over existing results such as [Recht et al, Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent] and [Duchi et al, Estimation, Optimization, and Parallelism when Data is Sparse] is the type of performance bound. While the bounds in the cited papers are mainly concerned with sparsity of the data, the bound in Theorem 1 is a generic regret bound of the order O(\sqrt{T}) for the convex case and O(\log T) for the strongly convex case. This is clearly related to delayed feedback results in prediction with expert advice. The asymptotic behaviour of the regret bound for HedgeHog! is also O(\sqrt{T}). There is also an advantage in the required conditions. The paper considers the case of convex (rather than strictly convex) target function in combination with a generic convex (rather than box-shaped) domain. All these conditions are well-known in the theory of gradient decent algorithms (and discussed at length in [Bottou et al, Optimization Methods for Large-Scale Machine Learning]), but not in the theory of asynchronous algorithms. The result of [Recht et al, Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent] on asynchronous algorithms is limited to strictly convex functions and the result of [Duchi et al, Estimation, Optimization, and Parallelism when Data is Sparse] to box-shaped domains.

Reviewer 3

Post feedback: 1. thanks, this clarifies the speed up confusion. So, you want the factor p^*\tau^* to be not too large, both by ensuring high sparsity and by having a small enough number of parallel processors that \tau^* does not blow up. 2. Got it, but this does limit the usability of the algorithm to a narrower class of problems that adhere to p^*<1. I don't recall too many practical examples of this in standard machine / deep learning. 3. I think you may be missing a \mu in your expression, it should be f+\phi - \frac {\mu} 2 ||x||^2 ? I see what you are saying here though. Thanks for the clarification; I'm happy to raise my evaluation to 7. --------------- Originality: The authors claim a broad-framework style contribution that subsumes a lot of open or previously solved analysis cases. Clarity: The paper, while written fluently, is addressed to an audience that is well-versed with the precursor work. Key terms such as "linear speedup" seems hard to grasp for me, not being from the asynchronous optimization area (see below in "improvements"). Quality& Significance: The results sound significant in the breadth of cases covered in the analysis framework proposed by the authors. A main claim here is that the analysis allows for optimization over arbitrary convex constraint sets, while previous async optim algos required unconstrained or box-constraints only. This improvement is achieved by pushing the constraints into the prox-operator (pg 4) which is assumed to be easily solvable.