NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:186
Title:Block Coordinate Regularization by Denoising

Reviewer 1

A recent trend in large scale optimization, specially in the machine learning community, was to replace full gradient based algorithm by its coordinate descent counterpart. The idea being to reduce the computational cost of each iteration while enjoying similar rate of convergence. Often, the solution of maximum a posteriori (estimated with a proximal algorithm) is hard to estimate exactly when the prior is not directly available. In that case, the "proximal iteration" is replaced by "Denoised iteration" where the proximal operator of the prior is replaced by another adequate denoising operator. Such algorithm is then based on full vector update just as vanilla (proximal) gradient descent. Then, this paper follows the recent trend and make the same coordinate descent moves. Using (same) classical assumptions for establishing rate of convergence of smooth + non-smooth optimization problem, this time replacing non expansiveness of the prox by non expansiveness of the denoiser, and using same proof techniques, the authors obtain similar rates of convergence O(1/t). At the difference that, the fixed point is not explicitly known as a minimizer of some objective function. This seems to be not too surprising and immediately follows from the current results in smooth + nonsmooth convex optimization. Again, just replace the proximal operator by any other non-expansive operator and all the results remains the same (up to explicit characterization of the fixed point) When D is a proximal operator of h, the iterations in Eq. 3 correspond to the two steps 1- a (proximal) smoothing of h 2- a gradient descent on the smooth function {g + smoothed(h)}. With this view, it is natural that line 161: "when the denoiser is a proximal operator of some convex function h, BC-RED is not directly solving (1), but rather its approximation". Indeed, it solve the smoothed version of (1). These strategies are similar to Nesterov smoothing see (Beck and Teboulle, 2012: Smoothing and First Order Methods: A Unified Framework). More precisely, it seems to correspond to the "partial smoothing" method. Small questions: - Usually, proximal iteration on a smooth + nonsmooth strongly convex, enjoys geometric rate of convergence. using a general denoiser, what would be a natural assumption to get the geometric rates? - Usually, convergence analysis of bloc coordinate descent require separability of the regularizer. (If necessary) Can the authors explicitly clarify the equivalent property for a generic denoiser? Small typos: - In proof of Theorem 1, in the second and third equation after line 322, there is some missing ^2. - In the proof of Proposition 8, the function phi can be more than 1-strongly convex eg when h is already alpha-strongly convex.

Reviewer 2

EDIT: My comments about notation and figure were indeed cosmetic, hence their position at the end of the reviews. Other points have been clarified in the rebuttal, I expect the authors to include these clarifications in their manuscript despite the lack of space mentioned. ###################### The proximal gradient algorithm allows to efficiently minimize the sum of a smooth term and a non smooth regularizer. Plug and play and regularization by denoising (RED) approaches mimicks the updates of proximal gradient, using a "denoising function" instead of an explicit prox (for some cases of denoising functions, PnP and RED correspond to proximal gradient). The authors adapt this method to the case where only a subset of the parameter vector is updated at each iteration (generalizing proximal block coordinate descent in some cases) The goal is to benefit from the known speed-ups of BCD compared to full gradient. The assumptions made for convergence analysis in Thm1 sound reasonable (smoothness of the fidelity term, non expansiveness of the denoiser), but Assumption 4 is quite strong in my opinion. Thm 1 is somehow disappointing as it does not chow convergence, but only that some points get close to the zero set of G. I don't think the sentence L220 is correct, as Thm 1 does not show convergence of the iterates. I don't understand how (11) is a *generic* denoiser (L175), since it has the form of a proximal operator. ALso, in Thm 2, why not take tau going to infinity ? It seems that some constant blows up with tau, but this is not specified. The article reads well, although the reference list is quite long and often not cited very specificly, eg [5-14]. Specificities of each paper could be highlighted better in the literature review. Similar approaches have been developped in the imaging community, where the prox step (or its equivalent in ADMM) is replaced by the application of a predefined function (see eg Image deblurring by augmented Lagrangian with BM3D frame prior, Danyelan et al). The connection with these methods from another field is somehow lacking in the literature review. It is not clear for readers not familiar with PnP and RED if the algorithm corresponds to the minimization of a functional, which should be better explained. The seminal work of Paul Tseng on coordinate descent is somehow missing in the 24-27 part of the bibliography. The use of a special font for U and G is confusing in the case of U, as it looks like an union. Can't the authors use normal U ? Fig 1 is not very informative in my opinion: denoising problems are standard and the role of the function D is not particularly detailed. The sentence L17 is not completely correct to me: the true prior can be known (eg the L0 penalization for sparse vectors), but using it is intractable.

Reviewer 3

The paper is very well written, however, from my point of view, the generalisation of the study made in [21] to block coordinate approach is quite straightforward. However, Theorem 2 is a very nice and important result. But it is only a small part of the paper. It is the first time that such a bound is given and it is very useful for the understanding of RED approaches. Experimental section lake of some computation time comparison (not only with respect to the number of iterations), while block coordinate approach can be faster than a batch approach for some problem when the operator A has very correlated columns. ** After feedback ** The authors have clearly answer my concerns. For thm 1, I was convinced that a similar result was in [21], but I was wrong. So I have decided to revise my evaluation.