NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center

### Reviewer 1

The paper considers to accelerate ADMM using adaptive penalty parameter. The paper developed a new adaptive scheme of penalty parameters for ADMM and established theoretical guarantees for both deterministic and stochastic ADMMs. Overall, the paper is interesting in the use of local error bound but the contribution is not significant enough. 1. The new adaptive penalization scheme turns out to be loop warm start algorithms. The algorithm could be practically slow, though faster convergence is claimed. 2. Increasing penalty parameter after each outer-loop iterate could greatly slow down the algorithm when K is large.

### Reviewer 2

Summary: This paper shows that O(1/eps) iteration complexity of ADMM can be improved to O(1/eps^(1-theta)) where theta is a parameter that characterizes how sharply the objective function increases with respect to increasing distance to the optimal solution. This improvement is shown under a locally adaptive version of the ADMM where the penalty parameter is increased after every $t$ steps of ADMM. The method is extended to stochastic ADMM whose O(1/eps^2) iteration complexity is shown to similarly improve. The results are backed by experiments on generalized Lasso problems. Overall, the paper is well written and makes an important contribution towards improving the analysis of ADMM under adaptive penalty parameters. On a practical note, I would have liked to see some comparisons against other "adaptive-rho" heuristics used the literature (see Boyd's monograph). As background, it may be valuable to some readers to see how the local error bound relates to the KL property, and some examples of "theta" for problems of interest in machine learning. There are lots of grammatical errors, typos and odd phrases in the the abstract: "tremendous interests" --> "tremendous interest", "variants are"-->"variants is", "penalty scheme of lies at it is...", "iterate message".. The LADMM-AP behavior in plot 1(e) is somewhat strange. Any explanations? If factor of 2 in LA-ADMM optimal in any sense? In practice ADMM stopping criteria include primal-dual tolerance thresholds. Shouldnt LA-ADMM use those instead of fixed t? With regards to linearized ADMM, Eqn 7, please comment on its relationship to Chambelle-Pock updates (https://hal.archives-ouvertes.fr/hal-00490826/document)