NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:8583
Title:Learning to Optimize in Swarms

Reviewer 1

### Overall opinion * While this method is useful for certain non-convex problems, I’m not sure if neurips is a suitable publication venue for this paper. ### Clarity * The motivation for each feature is well laid out. * The overall motivation is clear and the design of the model reflects this. ### Originality * To the best of my knowledge, this is the first work to propose a meta-optimizer for population-based optimization. * The entropy term that encourages exploration seems novel, at least in the context of optimizer learning. ### Significance * The model performs well on a variety of benchmarks. * While this paper seems to be a useful contribution to highly non convex optimization problems such as protein docking, it is a fairly straightforward combination of optimizer learning and population-based optimization. ------------------------ post-rebuttal ---------------------------- The author response and other reviews have convinced me of the significance of this work; I change my score from 5 to 6.

Reviewer 2

This paper introduces a new meta-learning algorithm that combines population-based and point-based optimization. While population based approaches have been very popular in very rugged landscapes, current meta-learning methods are point-based and thus not suitable for optimizing such functions. This work presents two contributions, (1) a new architecture for population based meta-learning. This architecture, while more complicated, can be summarized as follows: each particle is composed of a set of 4 features (gradient, momentum, velocity, and attractions), an attention mechanism is applied to those features together with the hidden state. The outputs of the attention mechanism for all particles are fed into an inter-particle attention together with a similarity matrix. The output of this inter-attention mechanism constitutes the intput of the LSTM learner that outputs the variation on the optimized parameters. The second contribution is the addition of a differential entropy term on the meta-loss that balances exploration and exploitation of the optimization process. This paper tackles the important problem of extending current meta-learning algorithms to take advantage of population-based training, which is necessary in extremely non-convex problems. I consider that the contributions of their work are novel, especially the proposed architecture. While the work is well motivated, the paper lacks clarity. More specifically, it leaves a lot of important components to the supplementary material to the point that it is impossible to fully understand the approach without reading the supplementary material. The paper could be restructured so all of it fits in the 8 pages limit without compromising readability. In this line, my main concerns in the paper are: How is the P(x*| D_t) defined? It seems a crucial part of your contribution, but the paper lacks its definition (besides citing Chao and She, 2019) The model architecture should be more clearly explained in the main section. A key component of your approach, is the attention mechanism it seems crucial to me that you explain in the main text how this works. Right now, your main contribution it’s explained in just a paragraph. Section 4.1, heavily discusses plots and results from the supp material. Those results are interesting and important, they should be included in the main body. Figure 3 ©, the definition of Q and M should be in the main text, otherwise it’s impossible to interpret what’s that plot means without looking at the supplementary material. Section 4.4 results are entirely in to supp material. Those results are interesting and important, they should be included in the main body. Regarding the experimental evaluation, the paper would highly benefit of an ablation study. This work presents an architecture that consists of many parts, however it is not clear which parts have significant effects. Regarding baselines, an important baseline to run would be to run the DM_LSTM for k different initializations and pick the best. This would show if your method just benefits from having k independent runs or there’s actually a benefit in the attention mechanisms (given that the particle interdependence is not high).

Reviewer 3

The authors did an outstanding job in addressing "where to learn" "how to learn" "what more to learn" and clearly established the novelty of their method. Their work opens the door to solving more sophisticated optimization using L2L. The uncertainty-aware loss fits the goal of the exploration-exploitation tradeoff, that was popularly researched in Bayesian optimization and RL (but not so much yet in L2L). I also like the intra- and inter- particle attention modules, that add to the explainability whether particles are working collaboratively or independently. A clear conclusion could be drawn from their Rastrigin experiments, that population-based L2L outperforms gradient-based meta optimizers including (Andrychowicz et al., 2016). The meta optimizer then outperformed a very recent SOTA method (Cao and Shen, 2019) in the real protein docking application.