NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6877
Title:MetaInit: Initializing learning by learning to initialize

Reviewer 1

Update: The authors have addressed my questions. I hope in the camera ready there is a clear discussion on Taylor expansion VS. finite difference (at least in the appendix). I also second the other reviewer on the importance of comparing in the batch norm case, since the method should be used as a general purpose initialization scheme. Longer Summary: - Authors introduce the GradientDeviation criterion, which characterizes how much gradient changes after gradient step. Simple and avoids full Hessian (+) - They use meta-learning to learn the scale of initialization such that GradientDeviation is minimized. - They claim meta-learning the scale is mostly architecture-dependent can be done with random data and random labels, without the need of using a specific dataset (+) - They compare with other initializations schemes (DeltaOrthogonal, LSUV) and with Batchnorm. Beats other schemes and competitive with Batchnorm (+) Shortcomings: - Can be problematic for very large models (-) - Can only learn scale of parameters (but from experience I think the actual distribution does not matter too much anyways for independent initializations) The authors claim their approach is more general than analytical approaches. Analytical approaches do not account for nonlinearity. Originality: Combines two-lines of work: learning to initialize and developing analytic initializations (e.g. Xavier, Kaiming). Quality: Enough experiments, paper well written. Related section seems a bit thin. Clarity: The paper is very well written and easy to read. Significance: Their "GradientDeviation" criterion can be reused and explored by others. Their method can be combined with usual initialization schemes.

Reviewer 2

They propose a novel algorithm that automatically finds the good initialization for a neural network by meta-learning methodology even without specific data (i.e. in a domain-independent way). To do that, they propose a new metric that is called gradient deviation, which measures the scale of one step ahead gradient likewise in MAML. They assume that the gradient of good initial parameters is less affected by the curvature near it. I think It’s nice to bring the idea of meta learning into learning the initial parameters of model and the objective function. But it’s lack of evidence that the method is working in a sense that there is no theoretical proof of experimental results. Q1. I’m not the expert of optimization theory, so I can’t fully confirm that the hypothesis you made is valid or not. Is there any theory that plain surface in the initial point leads to a better local minima? Even if you have no theoretical proof, there should be an experimental support that Gradient Deviation represents a metric of good initialization (i.e. Gradient Deviation – final performance for various initializations of the model) Q2. I don’t understand the protocols you used in experiments. Why do you have to remove skip connections and batch normalization layers? Is it natural to compare random init (or other known initialization method) vs. meta-learned init?

Reviewer 3

This is a clear well written paper on using meta learning to learn better initial parameters for training deep neural networks. It uses automatic differentiation to optimize a ratio of magnitude of gradient change in order to learn scale value for initial parameters for different layers. While it is a simple algorithm, figure 2 is very interesting. However, why not also show results for non-random data? The paper also mention it operates in 'data-agnoistic' fashion, what advantages does it bring to be data agnostic? If the algorithm is truly data agnostic, are there results on how scale learned for a network can transfer to multiple datasets? The paper makes the statement regarding 'less curvy' starting regions. However the loss in equation 2 is only looking at the magnitude of the gradient and not necessarily the curvature, due to the absolute value taken. - What happens if metainit is combined with batchnorm? - Can you show a training error plot as function of update iterations? that would be helpful to compare with metainit vs baseline optimization.