NeurIPS 2020

Reparameterizing Mirror Descent as Gradient Descent


Review 1

Summary and Contributions: This paper presents a method for using a different gradient scheme for optimization. This scheme, mirror descent, is based on Bergman divergence, which can offer a potential advantage to the convergence of certain systems. Neural networks typically use gradient descent based on quadratic loss and the proposed method could offer an advantage to neural network training. Namely, the authors point out that, with small neural networks, training convergence is provably faster with this method. The authors’ proposal is a general framework for casting mirror descent as gradient descent via reparameterization. When later applied to neural network training, this method could lead to large advantages in learning from data faster and cheaper with current compute systems.

Strengths: The theoretical grounding is well founded from the optimization literature and is well motivated by the difficulty and expense in training large neural networks. This could be on the path to fundamental breakthroughs in our field that allow us to approach the capacity and efficiency of biological brains. To my knowledge, I haven’t seen a generalized framework such as this, so it is novel work.

Weaknesses: The authors do a good job of listing some of the future directions to which this work could lead. I believe the authors could relate this work more directly to neural networks, as this is a timely subject. In this vein, seeing where the assumptions of casting the problem as Mirror descent breakdown in relation to neural networks would be of high value.

Correctness: To the best of my knowledge, the derivations are correct. However, I am not well equipped to critique each step of them.

Clarity: Generally, a well written paper. The authors assume the readers are well versed in the background of this work. As I understand it is difficult to provide adequate background and include all of the derivations required to motivate the result, I would have appreciated more context around each step. For instance, what is the intuition around mirror descent and Bergman divergence? Why is using MD vs GD potentially better? I believe there needs to be more background to motivate the need for such work and then apply it to a problem that the field works on directly.

Relation to Prior Work: This is part of a set of work around MD, EGU, and Bergman divergence. To my reading, the appropriate background is cited, but I am not familiar enough with the background to identify missing work. I was able to read some of the references and get a sense of what lead to this work.

Reproducibility: Yes

Additional Feedback: This is a theoretical result. There aren’t any experiments to reproduce, but researchers in this field should be able to follow the logic and identify any errors. I acknowledge any rebuttals and stand by my review. The current review has been updated based on rebuttals.


Review 2

Summary and Contributions: The paper provides a framework to convert a mirror descent algorithm into a gradient descent (GD) algorithm through a reparametrization of the parameter to be learned. The mirror descent considered in the paper is the continuous-time version, i.e., continuous mirror descent (CMD). The paper then presents several examples of CMD and the corresponding GD to demonstrate the idea.

Strengths: The claim of result is clear. Main results are well-demonstrated with several examples. Converting CMD to GD is an interesting topic and is closely related to the NeurIPS community.

Weaknesses: The writing: The writing of the paper gave me some difficulty in understanding it. There is lack of definition. Some symbols and expressions are used without clearly defined. Problems is refered to without definition. Transition between parts/sections requires improvement. Please see 'Additional feedback' for more detail.

Correctness: The claim and method look correct to me.

Clarity: There is still room for improvement.

Relation to Prior Work: Yes, it is discussed.

Reproducibility: Yes

Additional Feedback: Suggestions: Lack definition (anything that is not 'common knowledge' should be defined and explained before using. Should not let readers guess.) 1. In eq(1), w and L is used without defined. Could first introduce the problem and mention L is loss or the target function, and w is the model parameter. 2. In theorem 2 line 116 the phrase 'coincides with' is unclear. 'coincides with' is not a commonly used, mathematically rigorous and clear expression. So if authors want to use it, please define clearly what it means before using. Although I can guess the meaning and finally understand what thm2 is saying after reading the paper, the theorems should be self-contained and claims be mathematically clear. 3. In section 4.1, under-determined linear regression problem is not properly defined before being refered to. What is L, w and corresponding F,f in this case? 4. line 83: what is the meaning of 'independent variable' in this setting? Connection between sections: 1. The paper discuss the dual form of the problem in between line 87 and 103. When I first read this part, I am not sure why the authors put it here. There is no motivation of this part, and no connection between this part and the previous part, making the transition unsmooth. I recommend add some motivation before line 87 so the trasition is smooth. It is even ok to make this part into a subsection itself, with motivation and explanation, since it has importance: the equivalence is a special case of the main result; eq(11) is used in the proof of theorem 2. Questions: 1. Why should we care about reparametrize CMD as GD? What's the benefit (eg. fast training, better interpretation)?


Review 3

Summary and Contributions: This paper studies reparameterization method for the continuous-time mirror descent algorithm. The authors firstly prove that, similar to MD, CMD can be presented as minimizing a tradeoff between the loss and a Bregman momentum. They then provide a general reparameterization theorem, which allows reparametrizing one CMD update by another. Based on this technique, they further show that various CMD updates including EGU, Burg updates and EG can be reparametrized as GD update. Finally, the authors study tempered updates, which allow interpolating between different updates.

Strengths: 1. Gradient descent is widely used in real-world applications, while CMD can be more efficient for many different classes of problems. This paper studies how to reparametrize CMD as GD updates, which is an interesting and well-motived optimization problem. 2. The authors successfully develop a general framework that reals the connections between different CMD updates, and show that how to reparametrize a variety of CMD methods as GD updates.

Weaknesses: In section 2, the authors show that CMD can be presented as minimizing a tradeoff between the loss and a Bregman momentum. Although it is novel and very intresting, I am unsure about the significance of this contribution. It would be better if the authors can add more discussion on this point.

Correctness: I have read the main paper and I didn’t find any significant errors.

Clarity: The paper is generally well-written.

Relation to Prior Work: The relation to prior work is clearly discussed .

Reproducibility: Yes

Additional Feedback: