Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Originality: This is the first approach that I know of which attempts to train on the network architecture independently of the weight values. Significance: The relationship to neuroscience is somewhat remote, synapses in the brain do have different connection strengths. The significance of the empirical results is rather low as their resulting network architecture looks rather shallow and the task involved seem to be solvable with shallow networks trained other evolutionary-like algorithms. Clarity: The language is quite good. However, many technical details are not spelled out. I think the description of the algorithm l. 123 to 129 is not sufficient to understand the algorithm. Quality: I think that it is unfortunate that the authors do not analyze in more details their training algorithm in comparison to other approaches, or variants of their own algorithm. -------- Second review ----- I am glad that the positive reviews of the other reviewers got me to look again at the paper. After another reading, I do find that the originality of this work should be valued a lot more than I did it in my first review. It is still hard for to me know if the performance reported is trivial or satisfactory as a proof of concept. Despite this, I change the "overall score" of my review. One big improvement is that the authors promise to add a pseudo code in the supplements I hope that the details of the algorithm will make it more accessible. I also think it would be great to discuss somewhere what is new in this algorithm in comparison with the NEAT framework.
Originality: This paper draws from many fields (especially neural architecture search), but its core is a unique and powerfuly original idea. Quality: The execution of this idea by the authors is thorough and the results are compelling. The scholarship evident in this work is also exemplary. Clarity: This paper is extremely well-written and easy to follow. One small exception, however, was how the authors discussed their use of a single shared weight across all the connections in the network. At the beginning of the paper their stated ambition to use a single weight was confusing: why limit the network so much? In section 3 they explain this choice clearly and convincingly: the dimensionality of the full weight space is too large to effectively explore, so they test only a 1-D manifold within that space. But after reading the continuous control results section it actually seemed like even this 1-D weight manifold was over-parameterized: for nearly every problem & network the performance was either good with large absolute shared weight regardless of sign, or good with one or the other sign of large shared weight (e.g. the bipedal walker network that only performs well with large negative weights). This suggests that the same results could be obtained without randomizing the shared weight, but just using a single value throughout training (e.g. 1). Besides the ensemble classifier discussed in the last section, is there a strong advantage to using a shared random weight rather than a shared fixed weight? Does it make the networks more easily fine-tuned when the individual weights are trained? It could help improve the paper somewhat if the authors clarified the importance and reasoning behind using a single shared weight early on (i.e. in the intro). Significance: This work is highly significant: it introduces a new realm of problems (weight-agnostic neural networks) that have strong analogues in biological systems. This work opens many new avenues for research and is sure to be widely cited.
Review update: Thanks for your response. I have read it and I still like the paper. Nice work! Summary The paper uses architecture search based on genetic algorithms to find architectures on simple reinforcement learning tasks and MNIST that only have a single shared weight parameters. They demonstrate that these networks perform competitively with hand-picked and trained architectures. Strengths and weaknesses The paper contributes to the recent body of papers on architecture search and explores the inductive bias induced by the network architecture in a very extreme case where the network only has a single parameter. Given that most networks currently use standardized architectural components and heavily focus on weight training, this paper offers a refreshing view on different aspects responsible for a good inductive bias. In particular, it's in line with a recent report (https://arxiv.org/abs/1904.01569) that demonstrates that the graph architecture class (i.e. random connect vs. small word etc.) seems to have a relatively larger influence on the performance than the weights of the network. Apart from a few minor comments the paper is clearly written, and represents a significant and original contribution to this line of research. I think the paper has two main weaknesses: 1. The strategy is obviously only applicable to very small problems/networks. A demonstration that networks composed of repeating/recursive simple network motifs can solve next level complex problem such as CIFAR, would strengthen the paper. 2. Apart from the demonstration that a single weight parameter is enough, the paper offers little insight into the quantification of inductive biases by architecture. For instance, how many networks perform well? Do well performing networks differ strongly (in their components and the class of functions they implement)? Minor comments - Is there a reason why you don't use a delete operation? - Table 1 doesn't say what the number is. I guess reward, but it would be good to say it explicitly. - A bit of information on how the VAE for MNIST was trained would make the paper more self-contained.