NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Edit: Thanks to the authors for addressing my comments and running additional experiments. I have consequently increased my score. The paper proposes to decompose the parameters into L distinct parameter blocks. Each of these blocks is seen as solving a "pseudo-task", learning a linear map from inputs to outputs. The parameters of these blocks are generated by K hypermodules (small hypernetworks) that condition on a context vector for each pseudo-task based. The alignment of hypermodules to pseudo-tasks is governed by a softmax function and learned during training similar to mixture-of-experts. By sampling hypermodules proportional to their usage, more general modules are used more often. The approach is evaluated on a synthetic dataset (modeling linear regression tasks) and on a setting that involves three cross-modal tasks: object recognition on CIFAR-10, language modeling on WikiText-2, and CRISPR binding prediction. The approach combines hypernetworks and mixture-of-experts in an interesting way to learn to capture information that can be shared across different tasks. It is interesting to see that some hyper-modules become general and are used many pseudo-tasks. My main concerns about this submission regard its evaluation and lack of analysis of model components. The main experiment of the paper is the evaluation on cross-modal multi-task learning using a vision, a text, and a DNA task. However, based on the provided scores, the actual performance of the model remains unclear as the results are a) not competitive and b) not compared to any multi-task learning baselines. a) is most apparent for language modelling (which I am most familiar with), where the state-of-the-art on WikiText-2 is around 39 perplexity (lower is better) and reasonably tuned models from 2017 (https://arxiv.org/abs/1708.02182) achieve around 65 perplexity—compared to 128 reported here. For CIFAR-10, there is a (albeit smaller) gap to the results from the WideResNet or Hypernetworks papers that the method uses. State-of-the-art performance importantly is not necessary, but in order to make a statement about any gains from the new model, the proposed method should at least be competitive with results from the literature. Re b), I am missing a comparison against other multi-task learning models. Even if these are found not to be feasible in the cross-modal setting, a comparison would still be useful to demonstrate how challenging the benchmark is. In addition, the proposed model should be compared to other cross-modal MTL models such as the MultiModel (https://arxiv.org/abs/1706.05137). Alternatively, to get more confidence in the performance of the model, I would have appreciated seeing a comparison on a standard MTL benchmark of a single modality. Finally, as the model involves many subjective choices regarding the number of pseudo-tasks, the number of hyper-modules, the initialization and sampling strategies, etc. I would have liked to see an ablation analysis that studies and identifies the most impactful components of the model. Overall, even though I like the proposed approach, I do not think it is ready for acceptance at this point and think that it could be significantly improved through more careful experiments and comparison to prior work.
Reviewer 2
The paper is well written. The problem is well described and formalized. In some points the theory is a bit hard to follow even becouse some proof/explanation are postponed to the supplementary, but this is of course due to the space limits. The aim of having 'one' deep architecture that can accomodate multiple learning tasks which benefit from reciprocal transfer even beyond their specific differences is challening and of large interest for the community.
Reviewer 3
The content of this paper is dense at times but the clarity and writing quality is good for the most part. The main contribution is quite interesting as the paper proposes a new framework MUiR that can share parameters across different tasks even when the architectures used are different. Nice analysis of the algorithm time and complexity are also provided in section 3. Where the paper falls short for me is in the theoretical and empirical comparisons with prior work. For the synthetic results MUiR is compared to some multi-task learning baselines. The experiments seem to indicate that other algorithms can achieve similar or better performance. For the cross-modal results MUiR achieves good performance, but is not compared to any relevant architecture selection models. I still lean towards accept because of the nice framework and algorithmic analysis. However, the paper could be significantly improved by providing some more competitive multi-task learning baselines as part of the the cross-modal results for context.