NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2300
Title:Limitations of the empirical Fisher approximation for natural gradient descent

Reviewer 1

Originality: While the paper does not propose a novel algorithm, it presents an in-depth discussion of the (lack of a) relationship between the EF and F despite the seeming similarity. From a scientific standpoint I find this kind of contribution that strengthens the understanding of a family of methods and raises questions for new avenues of research substantially more valuable than yet another "state-of-the-art" algorithm on some arbitrary benchmark that provides little insight into why it works. Quality: The write-up is clearly focussed with a thoughtfully chosen set of empirical examples and experiments. I think this is a well-executed paper that clearly warrants publication in NeurIPS. Clarity: The paper is well written and structured, and overall easy to follow. I appreciate that crucial ideas are re-emphasized throughout the paper, making it easy to keep them in mind throughout and when relevant, and that existing work is credited clearly. Significance: The paper raises some overlooked subtle, but, as demonstrated, important issues. I therefore believe that this is a significant piece of work, even though second-order optimization is perhaps somewhat niche at the moment.

Reviewer 2

Originality: the paper lacks a sound and novel contribution. Theoretically, there is only one minor result as stated above. Technically, there is not a systematical experimental study on real deep networks. The main contribution is on discussing two different formulations of the Fisher matrix. The main trick on making these two formulations different (despite that the authors took a sophisticated approach going though GGN) is that the so called empirical Fisher relies on y_n (target of neural network output), and if one consider y_n to be randomly distributed with fixed variance based on the neural network output, the two formulations are equivalent, otherwise there is a scale parameter in eq.(3) which is shrinking making the two formulations different because of the shrinking and damping. In practice, for applying the exact natural gradient (without approximation) to deep neural networks, there is no reason to use the EF over the Fisher. For example, in the cited "revisiting natural gradient" paper, the Fisher matrix is given in eq.(26), (28), etc. and does not depend on the target y_n, and one has no reason to use the EF formulation. Quality: the experiments are mainly performed on toy examples. It is not clear how these results can be generalized to real networks. The literature review can be enhanced, the authors used the term "empirical Fisher" which is used by the some machine learning papers (or is that the case the authors coined the term? please make clear on its first appearance) This should be connected to observed Fisher information which is a well-defined concept in statistics. There should be more discussions on information geometry and deep learning, where many literature are skipped. Clarity: the paper is well written both in English and in math. Significance: the paper can potentially broaden the audience of natural gradient methods to the deep learning community. However it has limited significance due to the lack of novel contributions. Overall I feel that it is below the bar of standard NeurIPS papers.

Reviewer 3

Originality: Such a paper is much needed as many authors blindly use the empirical FIM in place of other matrices in many applications Quality: section 3.1 would require some clarification. Clarity: the paper is well written and concise Significance: as already said, such a work that critically analyses the empirical FIM is much needed

Reviewer 4

Strengths --------- In terms of new material, - one contribution of the paper is to propose a refined definition of a generalized Gauss-Newton matrix. This could open new convergence analyzes but it is not pursued in the present paper. - numerous experiments are presented to compare the empirical fisher matrix to the true one in terms of preconditioners for optimization on simple models. These experiments support clearly the use of the true Fisher matrix. The code seems very well written and reusable. The real strength of the paper is pedagogical: - it is very well-written and illustrated (see Fig. 2 or 4 for example) - it has even more value that a lot of previous authors did not make any difference and made numerous false statements. Weaknesses ---------- 1. Except the new definition of the Generalized Gauss-Newton matrix (that is not pursued), no other proposition in the paper is original. 2. As the authors point themselves, analyzing the EF as a variance adaptation method would have explained its efficiency and strengthened the paper: "This perspective on the empirical Fisher is currently not well studied. Of course, there are obvious difficulties ahead:" Overcoming these difficulties is what a research paper is about, not only discussing them. 3. The main point of the paper relies in paragraph 3.2. This requires clear and sound propositions such as: for a well-specified model, and a consistent estimator, the empirical fisher matrix converges to the Hessian at a rate ... It is claimed to be specified in Appendix C.3 but there seems to be a referencing problem in the paper. This would highlight both the reasoning of previous papers and the difference with the actual approximation made here. Minor comments: --------------- Typos: - Eq. 5 no square for gradient of a _ n - Eq. 8 subscript theta should be under p not log - Replace the occurrences of Appendix A to Appendix C Conclusion: ---------- Overall I think this is an good lecture on natural gradient and its subtleties, yet not a research paper since almost no new results are demonstrated. Yet, if the choice has to be made between another paper that uses the empirical Fisher and this one that explains it, I'll advocate for this paper. Therefore I tend to marginally accept this paper though I think its place is in lecture notes (in fact Martens long review of natural gradient [New insights and perspectives on the natural gradient method, Martens 2014] should incorporate it, that is where this paper should be from my opinion.) After discussion -------------------- After the discussion, I increased my score, I don't think that it is a top paper as it does not have new results but it should clearly be accepted as it would be much more helpful than "another state of the art technique for deep learning" with some misleading approximations like ADAM. Note that though refining the definition of a generalized gauss-newton method seems to be a detail, I think it could have a real potential for further analysis in optimization.