Paper ID: 1462 Title: Bregman Alternating Direction Method of Multipliers
Current Reviews

Submitted by Assigned_Reviewer_21

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Lagrangian replaced by a Bregman divergence. Authors prove
convergence rate results where L2 distance (of the standard ADMM
guarantee) is replaced by the corresponding Bregman divergence.
Similar to mirror descent, this can yield an improvement in the
dimension-dependent constants of convergence bounds. Authors then
demonstrate the utility of their algorithm on synthetic instances
of a specific optimization problem (mass transportation problem).

I was not able to check the proofs, but the results look plausible
and are in line with similar results for Bregman-divergence
versions of other first-order methods. The key contribution of the
paper is that it fills a gap in the ADMM literature, and helps
unify analysis of various variants of ADMM. Experimental section
is fairly limited (and a bit extreme), but it is a reasonable
illustration of the gains of Bregman-divergence-based approaches
in constrained optimization settings.

Overall, I think that this paper fills a gap in the theory of

* line 182: eta should be replaced by rho_x

* line 185: some of the terms on the right-hand side of Eq. 16 need
to be multiplied by rho_x; also, it might be more convenient to
state the right-hand side in terms of the Bregman divergence B_psi
(since that's how it's applied in examples)

* line 217: the definition of h and \nabla h should include the
multiplier rho

* line 227: "y^t" -> "y_t"

* line 250: some more explanation is needed why L(G) is difficult
to solve. The number of constraints does not look prohibitively
big.

* line 322: the divergence D is only defined in the appendix

* line 352: it seems that the constraints should be Xe=a and X^T e=b

* page 9: problems with capitalization in the bibliography, e.g.,
"poissonian" and "hungarian"

----
After author response: My review remains unchanged.
This paper fills a gap in the theory of

Submitted by Assigned_Reviewer_33

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper under review considers a generalization of the ADMM algorithm, named BADMM, where the quadratic penalty term is replaced by some Bregman divergence terms. Various
convergence results (on the iterates, and on the objective values in an ergodic sense) are given. Moreover, the performance of BADMM is evaluated on a high dimensional linear program coming from a mass transportation problem. Promising results are obtained from an empirical point of view.

The problem faced in the paper is interesting and deserves further investigation.
The proof of convergence relies on different tools with respect to the ones tipically used for the ADMM algorithm. Nevertheless, I have one main concern about the proof of the main results. It seems to me that the optimality conditions stated in (29) for the lagrangean function of (1) refer to the unconstrained case (i.e. X=R^{n1} and Z=R^{n2}) (and indeed the normal cones of the two sets do not appear in the formulation). The same mistake is made also in equations (30)-(31), when the optimality conditions corresponding to (3) and (4) are derived. Then, in Theorem 1, the iterates are shown to converge to a point (x*,z*,y*) satisfying (29). Therefore, it seems to me that the proof of convergence given in this paper holds only for the unconstrained case. Convergence of the algorithm for the examples with explicit constraints given in the paper is not in principle proved.

The organization of the paper could be improved. In particular, I think that the informal presentation at the beginning of pag 4 is hard to follow. FOr instance the definition of h as "the problematic term which need to be linearized" is not completely clear to me (maybe exampe 1 could be anticipated to guide the reader).

1) X and Z should be closed (otherwise the projection could be not well-defined)
2)p.3, line 111: "since \phi is convex" should be replaced by "since \phi is strictly convex"
3) p. 4: according to the definition of Bregman divergence given at the beginning of the section, the method described here works only if $\varphi_x$ is continuously differentiable and strictly convex (and not only convex as stated after eq. (15))
4) p.6, line 311: I don't understand what $\tau=O(\rho)$ means here. Since in the previous result $\tau$ is a constant multiplied by \rho this is already satisfied.
5) p. 8 Figure 1 (c) : I don't clearly see here that BADMM is faster here.

The authors in their reply stated that the constraints can be explicitly added and the proofs work also in this case. I suggest to add them to fix the error.
The paper under review studies convergence of a generalized ADMM algorithm including Bregman divergence terms. The problem is interesting and deserves further investigation, but I think that there is a mistake in the proof of the main result, that in my opinion does not take into account the presence of constraints. According to the author reply, the proofs can be modified to treat the constrained case.

Submitted by Assigned_Reviewer_41

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
In this paper, the authors extend the ordinary ADMM algorithm to replace the l2-norm penalty by a Bregman divergence term. Their method can also be generalized to cover the generalized ADMM method and the Bethe-ADMM method. The benefit of using the Bregman divergence penalty is that, the penalty can incorporate the structure of the problem. For example, when the parameter space is a simplex, using the KL divergence as the penalty term can yield a significantly performance improvement than using the l2-norm penalty. This is like how mirror descent generalizes the gradient descent method.

From an algorithmic point of view, this generalization is meaningful, as it allows the ADMM algorithm to explore structured parameter spaces. In the theoretical perspective, this paper proves the 1/T convergence rate for BADMM, which is the tight convergence rate for ADMM-type method. In the experimental section, the authors compare the BADMM algorithm with a commercial software, and shows that the proposed algorithm is faster and more accurate. Nevertheless, the comparison is a little unfair, as the BADMM algorithm uses the power of GPU, while Gurobi doesn't.

The paper is well-written and technically solid. The idea is a little incremental, but it may have practical significance, as solving constraint optimization problem appears to be difficult in the large-scale data and distributed settings. Overall, this paper presents a nice generalization for the ADMM-type method.
Overall, this paper presents a nice generalization for the ADMM-type method.
Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all the reviewers for detailed and insightful comments. We will fix the typos, update the draft suitably to incorporate the feedback and address the concerns.

Reviewer21: We will fix the typos.
We will add more explanation for the polytope L(G). The number of constraints is O(nK + |E|K^2), where n is the number of nodes, K is the value each node can take, and |E| is the number of edges. For a grid graph of 1000*1000, there will be millions of variables and constraints.

Reviewer33: The augmented Lagrangian (29) is actually a partial augmented Lagrangian, which only relaxes the equality constraints and keeps other constraints untouched. For any x^{t+1}, z^{t+1} in constraints, equations 30-31 hold. The proof does not rely on the assumption that the variables are unconstrained, e.g., the optimality condition \nabla f(\x^*) = 0. The convergence results hold for constrained case. We will state the constraints in Eq. 29-31 explicitly.

We will improve the exposition and the quality of figures as the reviewers suggested.
1) We consider both constrained and unconstrained case. For unconstrained case, it does not require the projection.
2,3) Traditionally, Bregman divergence is defined on continuously differentiable and strictly convex functions. However, there are some extensions on Bregman divergence. The strictly convex function is in general required in convex optimiztion. Otherwise, the problem may have multiple solutions. In most literature, the condition is not explicitly stated.
4) In Theorem 1, the constant decreases at an order of O(m^{2/p}). We want to find a tau = c\rho where c does not rely on O(m^{2/p}).
5) In Figure 1(c), x-axis is the time (log). The stopping criterion for both ADMM and BADMM is that the primal-dual residuals are less than 1e-4. Although ADMM is close to objective value, the primal-dual residuals are still larger, so ADMM takes longer time than BADMM as the figure shows. ADMM takes 9 seconds (log) to satisfy the residuals threshold 1e-4, and BADMM takes 8 seconds (log) to achieve the same accuracy.

Reviewer41: We decided to use Gurobi as a baseline, since it is a highly optimized commercial software, which is far faster than any implementation available online and can handle large problems, e.g., a linear program with a quarter billion variables, as in our experiments. However, most methods used in Gurobi cannot be implemented in GPU since they either do not have massive parallelism or use some data structures which are efficient for some algorithms but are not suitable for GPU. However, BADMM provides mass parallelism for mass transportation problem. Our GPU implementation is still far from being highly optimized.