Review for NeurIPS paper: AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning

NeurIPS 2020

AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning

Review 1

Summary and Contributions: This work proposes to search feature sharing strategy for multi-task learning. It relies on standard back-propagation to jointly learn feature sharing policy and network weights. It also uses two regularizations for learning a compact network. Experiments are conducted on three standard MTL datasets and improved results are obtained.

Strengths: + Searching feature sharing strategy for multi-task learning is interesting. + The related work is appropriately discussed and compared. + The performance looks good compared to other methods in MTL benchmarks. + Detailed ablation studies are also presented.

Weaknesses: - The method highly relies on the ResNet residual block for search. How does this method extend to other network architecture (e.g., MobileNet or Inception)? - The motivation of sparsity loss is enhancing the compact of model. It is not clear that such strategy can largly affect the multi-task learning performance accroding to Table 5. - RL-based methods are widely used in many existing NAS methods. It is also not clear to me why RL-based searching method performs much worse than the proposed differentiable-based method in Table 5. - I suggest that computation cost (FLOPs) can be presented other than #params to show the efficiency of MTL methods.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: See above Update after the rebuttal: I read the rebuttal and agree that this paper can be accepted.

Review 2

Summary and Contributions: In this paper, they introduce a multi-task scheme in which they adaptively select what to share across various tasks. That is, they want to decide which layers could be shared across specific tasks, and which layers are the best to be task-specific. They learn this policy sharing parameters while learning the model parameters. Their method outperformed several existing MTL on three multi-task learning benchmarks.

Strengths: - Paper is well-written and well-organized. I liked figures 1 and 2, which compare their method with 2 popular MLT scheme as well as depicting their methods with more details. - They introduced a MTL method in which they learn how tasks are correlated with each other and which layers are the best for which tasks during training. The cost is learning more parameters (policy-sharing) which does not introduce much computational overhead. This is a very interesting innovation as we can learn which tasks are related after training the model. (They explained in Policy Visualization and Task Correlation section how tasks are related in their experiments) - Thorough and extensive experiments, especially their ablation study where they examine the effect of each loss function. I also like their qualitative analysis in figure 3, where they show how task relatedness are learned after training.

Weaknesses: - I am wondering how this framework is easy and straightforward to apply for MTL tasks. I did not see discussion on the paper (or am I missing that?) and also any discussion about the computational cost? - Even though their method works well on three MTL benchmarks, I would recommend adding one/two datasets in another domain such as language.

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: I read the rebuttal, I like the paper, my score is the same.

Review 3

Summary and Contributions: This paper explores the layer dropping/pruning technique in multi-task learning (MTL). It proposes to train a layer-drop policy for each task individually. Through this, each task may choose to skip or execute each layer and in theory the network can learn what to share and what not to share. Overall, this strategy has been shown effective on MTL with a few tasks (<= 5).

Strengths: In general, the paper is easy to follow. It also contains a good amount of experiments that clearly shows its improvement over baselines. It demonstrates that dropping layers for individual tasks in MTL may mitigate negative transfer.

Weaknesses: Despite its effectiveness, there are certain aspects that I am concerned about, as following: (1) The idea of layer dropping is not new. It has been explored for regularization [1] as well as structured pruning [2, 3]. In addition, methods of routing subnetwork and learning task-specific params for each task [4] have also been studied before. (2) The three datasets included are similar in nature. It would be more comprehensive to additionally test on a dataset with larger numbers of tasks. (3) It is not very clear what are sources of improvement for this method and if it remains effective for larger datasets. First, I think that the method’s main advantage is not memory efficiency since it is at least as large as a standard MTL network (denoted as Multi-Task in the paper). Therefore, the main point should be addressing why it has been so effective. Overall, the results seem to show that there is negative transfer [5] among tasks and thus dropping some layers may benefit certain tasks. This is particularly true given that ‘Random #2’ performs well. Then some unanswered questions include: (a) How is the number dropped for each task correlated to performance gain? (b) How does the method compare with adding task-specific adapter layers? (c) How does it compare to learning a stochastic depth pattern for each task (using expected value instead of pruning)? (d) What would happen if we use a higher task-to-layer ratio (20 tasks v.s. 18 layers, etc)? (e) How would pre-training play a role here? Will it improve or degenerate AdaShare? (f) What about other components (e.g. channels)? Can we do something similar? [1] Deep networks with stochastic depth. ECCV 2016. [2] Reducing transformer depth on demand with structured dropout. ICLR 2020. [3] Data-driven sparse structure selection for deep neural networks. ECCV 2018. [4] Adversarial examples improve image recognition. CVPR 2020. [5] Characterizing and avoiding negative transfer. CVPR 2019.

Correctness: The method is generally correct.

Clarity: The paper is well written with good visualizations.

Relation to Prior Work: Some citations are missing (see above). I recommend adding some discussion on how this work differs from task-specific stochastic depth and structured pruning, and its relation to addressing negative transfer.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: The paper introduces a method for multi task learning with an adaptive sharing approach called adashare. This method is different from current soft parameter sharing methods as it takes resource efficiency into consideration instead of learning a task specific network per task. It differs from hard parameter sharing as well as it has no hand set network split for each task. The paper uses Gumbel softmax sampling to jointly learn the execution paths per task. The contributions are a differentiable approach for adaptive feature sharing without requiring any reinforcement learning, two new loss terms to balance learning new features vs sharing features, and ablation experiments on some well known datasets.

Strengths: The method is very interesting. It is simple and clear and makes sense. I like the emphasis on efficiency. This seems like a nice method that has advantages of both some soft-parameter sharing methods and some hard-parameter sharing methods by being adaptable but also efficient in memory and computation. I especially like the paragraph on line 28 about more flexible multi-task learning but it’s important to be computationally and memory efficient. I also think the added loss terms (sparsity loss and sharing loss) are a nice addition in Equations 3 and 4. They make a lot of sense to have and I haven’t seen their exact like before in this type of work. I like Figure 3 and the interesting relationships it shows. I think that’s an excellent addition to the paper and worth having for others to note/cite in future works. Despite the weaknesses of only one architecture, only related tasks, and some small clarity issues, I think the paper is well-written and I think the concept is novel and clever. I would put my score at 7.5 so I’m rounding up to an 8.

Weaknesses: The method learns a policy distribution for a feature sharing pattern but then is optimized on the full training set. It’s not clear the effect of this. It would be simpler and cleaner if the network didn’t need several stages of training. The curriculum learning method does make sense but nowhere in the paper is an ablation study showing this except for the line in Table 5 which is only one experiment. It doesn’t seem like the curriculum learning is that large of an effect on this specific example. Furthermore, I don’t believe the random policies are that meaningful of a comparison in Table 5. Random is a very low baseline. Its unusual that Random #2 does so well. The biggest weakness I see is that this is only shown for one network architecture and only very related tasks. It would be interesting to see some more diverse and different tasks. FIg3B at least shows that some of these are more related than others so perhaps an ablation study showing the architectures found for the 2 closest tasks compared to the 2 furthest tasks. I think that would be interesting to show that the closer tasks have more shared features. In terms of only one architecture, the paper only shows a large resnet style encoder which is interesting since the paper argues for more memory efficiency. It would be good to show this also works on a smaller style architecture like mobilenetv3, etc.

Correctness: The type of surface normals for NYUv2 are not shown or cited appropriately. There are several different surface normals produced for this dataset. Which did the authors use? (Ladicky, Eigen, Hickson, etc. all have different surface normals produced for this dataset)

Clarity: Line 23-24: Cumbersome and hard to read sentence. Line 80-81: The method attempts to minimize negative interference. None of the experiments prove that it does this. I see the theory and argument but I think the language is important here. Nothing is shown that proves this. Line 154 introduces P_T but doesn’t define it. Perhaps its better to define that previously in equation 1? Line 153, remove Clearly. It’s understandable but I wouldn’t call that clear. Best to avoid language like that and “it’s trivial that…”, etc. Following that, in line 156, there is no variable that decides where to split the distribution to one-hot during training. Is it greater than probability 0.5 is 1 and less than 0? This is not made clear in the paper. Line 207-210, why are the citations in a different format with conference/year? In Equation 6, what is STL? Table 3, Sluice/T3 is missing a +/-.

Relation to Prior Work: This paper has a very similar title (Adashare) to: Kong, Linghe, et al. "AdaSharing: Adaptive data sharing in collaborative robots." IEEE Transactions on Industrial Electronics 64.12 (2017): 9569-9579. It’s slightly different but at least something to consider in case the authors decide to change the name to be more distinct. Otherwise the relationship to previous work is well established.

Reproducibility: Yes

Additional Feedback: Many of the cifations have arxiv versions cited instead of the correct published versions. Please take the time to correct these to the published versions. Google scholar sometimes has these wrong. Ex: 1] Ahn, Chanho, Eunwoo Kim, and Songhwai Oh. "Deep elastic networks with model selection for multi-task learning." Proceedings of the IEEE International Conference on Computer Vision. 2019.