{"title": "Pareto Multi-Task Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 12060, "page_last": 12070, "abstract": "Multi-task learning is a powerful method for solving multiple correlated tasks simultaneously. However, it is often impossible to find one single solution to optimize all the tasks, since different tasks might conflict with each other. Recently, a novel method is proposed to find one single Pareto optimal solution with good trade-off among different tasks by casting multi-task learning as multiobjective optimization. In this paper, we generalize this idea and propose a novel Pareto multi-task learning algorithm (Pareto MTL) to find a set of well-distributed Pareto solutions which can represent different trade-offs among different tasks. The proposed algorithm first formulates a multi-task learning problem as a multiobjective optimization problem, and then decomposes the multiobjective optimization problem into a set of constrained subproblems with different trade-off preferences. By solving these subproblems in parallel, Pareto MTL can find a set of well-representative Pareto optimal solutions with different trade-off among all tasks. Practitioners can easily select their preferred solution from these Pareto solutions, or use different trade-off solutions for different situations. Experimental results confirm that the proposed algorithm can generate well-representative solutions and outperform some state-of-the-art algorithms on many multi-task learning applications.", "full_text": "Pareto Multi-Task Learning\n\nXi Lin1, Hui-Ling Zhen1, Zhenhua Li2, Qingfu Zhang1, Sam Kwong1\n\n1City University of Hong Kong, 2Nanjing University of Aeronautics and Astronautics\n\n{qingfu.zhang, cssamk}@cityu.edu.hk\n\nxi.lin@my.cityu.edu.hk,\n\nhuilzhen@um.cityu.edu.hk,\n\nzhenhua.li@nuaa.edu.cn\n\nAbstract\n\nMulti-task learning is a powerful method for solving multiple correlated tasks simul-\ntaneously. However, it is often impossible to \ufb01nd one single solution to optimize\nall the tasks, since different tasks might con\ufb02ict with each other. Recently, a novel\nmethod is proposed to \ufb01nd one single Pareto optimal solution with good trade-off\namong different tasks by casting multi-task learning as multiobjective optimization.\nIn this paper, we generalize this idea and propose a novel Pareto multi-task learning\nalgorithm (Pareto MTL) to \ufb01nd a set of well-distributed Pareto solutions which\ncan represent different trade-offs among different tasks. The proposed algorithm\n\ufb01rst formulates a multi-task learning problem as a multiobjective optimization\nproblem, and then decomposes the multiobjective optimization problem into a\nset of constrained subproblems with different trade-off preferences. By solving\nthese subproblems in parallel, Pareto MTL can \ufb01nd a set of well-representative\nPareto optimal solutions with different trade-off among all tasks. Practitioners can\neasily select their preferred solution from these Pareto solutions, or use different\ntrade-off solutions for different situations. Experimental results con\ufb01rm that the\nproposed algorithm can generate well-representative solutions and outperform\nsome state-of-the-art algorithms on many multi-task learning applications.\n\n1\n\nIntroduction\n\nMulti-task learning (MTL) [1], which aims at learning\nmultiple correlated tasks at the same time, is a popular\nresearch topic in the machine learning community. By\nsolving multiple related tasks together, MTL can further\nimprove the performance of each task and reduce the\ninference time for conducting all the tasks in many real-\nworld applications. Many MTL approaches have been\nproposed in the past, and they achieve great performances\nin many areas such as computer vision [2], natural lan-\nguage processing [3] and speech recognition [4].\nMost MTL approaches are proposed for \ufb01nding one sin-\ngle solution to improve the overall performance of all\ntasks [5, 6]. However, it is observed in many applications\nthat some tasks could con\ufb02ict with each other, and no\nsingle optimal solution can optimize the performance of\nall tasks at the same time [7]. In real-world applications,\nMTL practitioners have to make a trade-off among differ-\nent tasks, such as in self-driving car [8], AI assistance [9]\nand network architecture search [10, 11].\n\nFigure 1: Pareto MTL can \ufb01nd a set\nof widely distributed Pareto solutions\nwith different trade-offs for a given MTL.\nThen the practitioners can easily select\ntheir preferred solution(s).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fHow to combine different tasks together and make a proper trade-off among them is a dif\ufb01cult\nproblem. In many MTL applications, especially those using deep multi-task neural networks, all tasks\nare \ufb01rst combined into a single surrogate task via linear weighted scalarization. A set of \ufb01xed weights,\nwhich re\ufb02ects the practitioners\u2019 preference, is assigned to these tasks. Then the single surrogate task\nis optimized. Setting proper weights for different tasks is not easy and usually requires exhaustive\nweights search. In fact, no single solution can achieve the best performance on all tasks at the same\ntime if some tasks con\ufb02ict with each other.\nRecently, Sener and Koltun [12] formulate a multi-task learning problem as a multi-objective opti-\nmization problem in a novel way. They propose an ef\ufb01cient algorithm to \ufb01nd one Pareto optimal\nsolution among different tasks for a MTL problem. However, the MTL problem can have many (even\nan in\ufb01nite number of ) optimal trade-offs among its tasks, and the single solution obtained by this\nmethod might not always satisfy the MTL practitioners\u2019 needs.\nIn this paper, we generalize the multi-objective optimization idea [12] and propose a novel Pareto\nMulti-Task Learning (Pareto MTL) algorithm to generate a set of well-representative Pareto solutions\nfor a given MTL problem. As shown in Fig. 1, MTL practitioners can easily select their preferred\nsolution(s) among the set of obtained Pareto optimal solutions with different trade-offs, rather than\nexhaustively searching for a set of proper weights for all tasks.\nThe main contributions of this paper are: 1\n\n\u2022 We propose a novel method to decompose a MTL problem into multiple subproblems with\ndifferent preferences. By solving these subproblems in parallel, we can obtain a set of\nwell-distributed Pareto optimal solutions with different trade-offs for the original MTL.\n\u2022 We show that the proposed Pareto MTL can be reformulated as a linear scalarization\napproach to solve MTL with dynamically adaptive weights. We also propose a scalable\noptimization algorithm to solve all constrained subproblems with different preferences.\n\u2022 Experimental results con\ufb01rm that the proposed Pareto MTL algorithm can successfully \ufb01nd\n\na set of well representative solutions for different MTL applications.\n\n2 Related Work\n\nMulti-task learning (MTL) algorithms aim at improving the performance of multiple related tasks by\nlearning them at the same time. These algorithms often construct shared parameter representation to\ncombine multiple tasks. They have been applied in many machine learning areas. However, most\nMTL algorithms mainly focus on constructing shared representation rather than making trade-offs\namong multiple tasks [5, 6].\nLinear tasks scalarization, together with grid search or random search of the weight vectors, is the\ncurrent default practice when a MTL practitioner wants to obtain a set of different trade-off solutions.\nThis approach is straightforward but could be extremely inef\ufb01cient. Some recent works [7, 13] show\nthat a single run of an algorithm with well-designed weight adaption can outperform the random\nsearch approach with more than one hundred runs. These adaptive weight methods focus on balancing\nall tasks during the optimization process and are not suitable for \ufb01nding different trade-off solutions.\nMulti-objective optimization [14] aims at \ufb01nding a set of Pareto solutions with different trade-offs\nrather than one single solution. It has been used in many machine learning applications such as\nreinforcement learning [15], Bayesian optimization [16, 17, 18] and neural architecture search [10, 19].\nIn these applications, the gradient information is usually not available. Population-based and gradient-\nfree multi-objective evolutionary algorithms [20, 21] are popular methods to \ufb01nd a set of well-\ndistributed Pareto solutions in a single run. However, it can not be used for solving large scale and\ngradient-based MTL problems.\nMulti-objective gradient descent [22, 23, 24] is an ef\ufb01cient approach for multi-objective optimization\nwhen gradient information is available. Sener and Koltun [12] proposed a novel method for solving\nMTL by treating it as multi-objective optimization. However, similar to the adaptive weight methods,\nthis method tries to balance different tasks during the optimization process and does not have a\nsystematic way to incorporate trade-off preference. In this paper, we generalize it for \ufb01nding a set of\nwell-representative Pareto solutions with different trade-offs among tasks for MTL problems.\n\n1The code is available at: https://github.com/Xi-L/ParetoMTL\n\n2\n\n\f(a) Random Linear Scalarization\n\n(b) MOO MTL\n\n(c) Pareto MTL (Ours)\n\nFigure 2: The convergence behaviors of different algorithms on a synthetic example. (a) The obtained\nsolutions of random linear scalarization after 100 runs. (b) The obtained solutions of the MOO-\nMTL [12] method after 10 runs. (c) The obtained solutions of the Pareto MTL method proposed by\nthis paper after 10 runs. The proposed Pareto MTL successfully generates a set of widely distributed\nPareto solutions with different trade-offs. Details of the synthetic example can be found in section 5.\n\n3 Multi-Task Learning as Multi-Objective Optimization\n\n3.1 MTL as Multi-Objective Optimization\n\nA MTL problem involves a set of m correlated tasks with a loss vector:\nmin\u03b8 L(\u03b8) = (L1(\u03b8),L2(\u03b8),\u00b7\u00b7\u00b7 ,Lm(\u03b8))T,\n\n(1)\nwhere Li(\u03b8) is the loss of the i-th task. A MTL algorithm is to optimize all tasks simultaneously by\nexploiting the shared structure and information among them.\nProblem (1) is a multi-objective optimization problem. No single solution can optimize all objectives\nat the same time. What we can obtain instead is a set of so-called Pareto optimal solutions, which\nprovides different optimal trade-offs among all objectives. We have the following de\ufb01nitions [25]:\nPareto dominance. Let \u03b8a, \u03b8b be two points in \u2126, \u03b8a is said to dominate \u03b8b (\u03b8a \u227a \u03b8b) if and only if\nLi(\u03b8a) \u2264 Li(\u03b8b),\u2200i \u2208 {1, ..., m} and Lj(\u03b8a) < Lj(\u03b8b),\u2203j \u2208 {1, ..., m}.\nPareto optimality. \u03b8\u2217 is a Pareto optimal point and L(\u03b8\u2217) is a Pareto optimal objective vector if it\ndoes not exist \u02c6\u03b8 \u2208 \u2126 such that \u02c6\u03b8 \u227a \u03b8\u2217. The set of all Pareto optimal points is called the Pareto set.\nThe image of the Pareto set in the loss space is called the Pareto front.\nIn this paper, we focus on \ufb01nding a set of well-representative Pareto solutions that can approximate\nthe Pareto front. This idea and the comparison results of our proposed method with two others are\npresented in Fig. 2.\n\n3.2 Linear Scalarization\n\nLinear scalarization is the most commonly-used approach for solving multi-task learning problems.\nThis approach uses a linear weighted sum method to combine the losses of all tasks into a single\nsurrogate loss:\n\nm(cid:88)\n\nmin\u03b8 L(\u03b8) =\n\nwiLi(\u03b8),\n\n(2)\n\ni=1\n\nwhere wi is the weight for the i-th task. This approach is simple and straightforward, but it has some\ndrawbacks from both multi-task learning and multi-objective optimization perspectives.\nIn a typical multi-task learning application, the weights wi are needed to be assigned manually before\noptimization, and the overall performance is highly dependent on the assigned weights. Choosing\na proper weight vector could be very dif\ufb01cult even for an experienced MTL practitioner who has\nexpertise on the given problem.\nSolving a set of linear scalarization problems with different weight assignments is also not a good\nidea for multi-objective optimization. As pointed out in [26, Chapter 4.7], this method can only\nprovide solutions on the convex part of the Pareto front. The linear scalarization method with different\nweight assignments is unable to handle a concave Pareto front as shown in Fig. 2.\n\n3\n\n0.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.00.00.20.40.60.81.0\f3.3 Gradient-based method for multi-objective optimization\n\nMany gradient-based methods have been proposed for solving multi-objective optimization prob-\nlems [22, 23]. Fliege and Svaiter [24] have proposed a simple gradient-based method, which is a\ngeneralization of a single objective steepest descent algorithm. The update rule of the algorithm is\n\u03b8t+1 = \u03b8t + \u03b7dt where \u03b7 is the step size and the search direction dt is obtained as follows:\n\n(dt, \u03b1t) = arg min\n\nd\u2208Rn,\u03b1\u2208R\n\n\u03b1 +\n\n1\n2\n\n||d||2, s.t. \u2207Li(\u03b8t)T d \u2264 \u03b1, i = 1, ..., m.\n\nThe solutions of the above problem will satisfy:\nLemma 1 [24]: Let (dk, \u03b1k) be the solution of problem (3).\n1. If \u03b8t is Pareto critical, then dt = 0 \u2208 Rn and \u03b1t = 0.\n2. If \u03b8t is not Pareto critical, then\n\n\u03b1t \u2264 \u2212(1/2)||dt||2 < 0,\n\u2207Li(\u03b8t)T dt \u2264 \u03b1t, i = 1, ..., m,\n\n(3)\n\n(4)\n\nwhere \u03b8 is called Pareto critical if no other solution in its neighborhood can have better values in\nall objective functions. In other words, if dt = 0, no direction can improve the performance for all\ntasks at the same time. If we want to improve the performance for a speci\ufb01c task, another task\u2019s\nperformance will be deteriorated (e.g., \u2203i,Li(\u03b8t)T dt > 0). Therefore, the current solution is a Pareto\ncritical point. When dt (cid:54)= 0, we have \u2207Li(\u03b8t)T dt < 0, i = 1, ..., m, which means dt is a valid\ndescent direction for all tasks. The current solution should be updated along the obtained direction\n\u03b8t+1 = \u03b8t + \u03b7dt.\nRecently, Sener and Koltun [12] used the multiple gradient descent algorithm (MGDA) [22] for\nsolving MTL problems and achieve promising results. However, this method does not have a systemic\nway to incorporate different trade-off preference information. As shown in Fig. 2, running the\nalgorithm multiple times can only generate some solutions in the middle of the Pareto front on\nthe synthetic example. In this paper, we generalize this method and propose a novel Pareto MTL\nalgorithm to \ufb01nd a set of well-distributed Pareto solutions with different trade-offs among all tasks.\n\n4 Pareto Multi-Task Learning\n\n4.1 MTL Decomposition\n\nWe propose the Pareto Multi-Task Learning (Pareto\nMTL) algorithm in this section. The main idea of\nPareto MTL is to decompose a MTL problem into\nseveral constrained multi-objective subproblems with\ndifferent trade-off preferences among the tasks in\nthe original MTL. By solving these subproblems in\nparallel, a MTL practitioner can obtain a set of well-\nrepresentative solutions with different trade-offs.\nDecomposition-based multi-objective evolutionary\nalgorithm [27, 28], which decomposes a multi-\nobjective optimization problem (MOP) into several\nsubproblems and solves them simultaneously, is one\nof the most popular gradient-free multi-objective op-\ntimization methods. Our proposed Pareto MTL algo-\nrithm generalizes the decomposition idea for solving\nlarge-scale and gradient-based MTL.\nWe adopt the idea from [29] and decompose the MTL\ninto K subproblems with a set of well-distributed unit\npreference vectors {u1, u2, ..., uK} in Rm\n+ . Suppose\nall objectives in the MOP are non-negative, the multi-\nobjective subproblem corresponding to the preference\nvector uk is:\n\n4\n\nFigure 3: Pareto MTL decomposes a given\nMTL problem into several subproblems with\na set of preference vectors. Each MTL sub-\nproblem aims at \ufb01nding one Pareto solution\nin its restricted preference region.\n\n\fL(\u03b8) = (L1(\u03b8),L2(\u03b8),\u00b7\u00b7\u00b7 ,Lm(\u03b8))T, s.t. L(\u03b8) \u2208 \u2126k,\n\nmin\n\n\u03b8\n\n(5)\n\nwhere \u2126k(k = 1, ..., K) is a subregion in the objective space:\n\n\u2126k = {v \u2208 Rm\n\n+|uT\n\nj v \u2264 uT\n\nk v,\u2200j = 1, ..., K}\n\n(6)\nj v is the inner product between the preference vector uj and a given vector v. That is to say,\nand uT\nv \u2208 \u2126k if and only if v has the smallest acute angle to uk and hence the largest inner product uT\nk v\namong all K preference vectors.\nThe subproblem (5) can be further reformulated as:\n\nL(\u03b8) = (L1(\u03b8),L2(\u03b8),\u00b7\u00b7\u00b7 ,Lm(\u03b8))T\n\nmin\ns.t. Gj(\u03b8t) = (uj \u2212 uk)TL(\u03b8t) \u2264 0,\u2200j = 1, ..., K,\n\n\u03b8\n\n(7)\n\n(8)\n\nAs shown in Fig. 3, the preference vectors divide the objective space into different sub-regions. The\nsolution for each subproblem would be attracted by the corresponding preference vector and hence be\nguided to its representative sub-region. The set of solutions for all subproblems would be in different\nsub-regions and represent different trade-offs among the tasks.\n\n4.2 Gradient-based Method for Solving Subproblems\n\n4.2.1 Finding the Initial Solution\n\nTo solve the constrained multi-objective subproblem (5) with a gradient-based method, we need to\n\ufb01nd an initial solution which is feasible or at least satis\ufb01es most constraints. For a randomly generated\nsolution \u03b8r, one straightforward method is to \ufb01nd a feasible initial solution \u03b80 which satis\ufb01es:\n\n||\u03b80 \u2212 \u03b8r||2\n\ns.t. L(\u03b80) \u2208 \u2126k.\n\nmin\n\u03b80\n\nHowever, this projection approach is an n dimensional constrained optimization problem [30]. It\nis inef\ufb01cient to solve this problem directly, especially for a deep neural network with millions of\nparameters. In the proposed Pareto MTL algorithm, we reformulate this problem as unconstrained\noptimization, and use a sequential gradient-based method to \ufb01nd the initial solution \u03b80.\nFor a solution \u03b8r, we de\ufb01ne the index set of all activated constraints as I(\u03b8r) = {j|Gj(\u03b8r) \u2265 0, j =\n1, ..., K}. We can \ufb01nd a valid descent direction dr to reduce the value of all activated constraints\n{Gj(\u03b8r)|j \u2208 I(\u03b8r)} by solving:\n\n(dr, \u03b1r) = arg min\n\nd\u2208Rn,\u03b1\u2208R\n\n\u03b1 +\n\n1\n2\n\n||d||2, s.t.\u2207Gj(\u03b8r)T d \u2264 \u03b1, j \u2208 I(\u03b8r).\n\n(9)\n\nThis approach is similar to the unconstrained gradient-based method (3), but it reduces the value of\nall activated constraints. The gradient-based update rule is \u03b8rt+1 = \u03b8rt + \u03b7rdrt and will be stopped\nonce a feasible solution is found or a prede\ufb01ned number of iterations is met.\n\n4.2.2 Solving the Subproblem\n\nOnce we have an initial solution, we can use a gradient-based method to solve the constrained\nsubproblem. For a constrained multiobjective optimization problem, the Pareto optimality restricted\non the feasible region \u2126k can be de\ufb01ned as [24]:\nRestricted Pareto Optimality. \u03b8\u2217 is a Pareto optimal point for L(\u03b8) restricted on \u2126k if \u03b8\u2217 \u2208 \u2126k and\nit does not exist \u02c6\u03b8 \u2208 \u2126k such that \u02c6\u03b8 \u227a \u03b8\u2217.\nAccording to [24, 30], we can \ufb01nd a descent direction for this constrained MOP by solving a\nsubproblem similar to the subproblem (3) for the unconstrained case:\n\n(dt, \u03b1t) =arg min\n\nd\u2208Rn,\u03b1\u2208R\n\n\u03b1 +\n\n||d||2\n\n1\n2\n\ns.t. \u2207Li(\u03b8t)T d \u2264 \u03b1, i = 1, ..., m.\n\u2207Gj(\u03b8t)T d \u2264 \u03b1, j \u2208 I\u0001(\u03b8t),\n\n5\n\n(10)\n\n\fwhere I\u0001(\u03b8) is the index set of all activated constraints:\n\nI\u0001(\u03b8) = {j \u2208 I|Gj(\u03b8) \u2265 \u2212\u0001}.\n\n(11)\n\nWe add a small threshold \u0001 to deal with the solutions near the constraint boundary. Similar to the\nunconstrained case, for a feasible solution \u03b8t, by solving problem (10), we either obtain dt = 0 and\ncon\ufb01rm that \u03b8t is a Pareto critical point restricted on \u2126k, or obtain dt (cid:54)= 0 as a descent direction for\nthe constrained multi-objective problem (7). In the latter case, if all constraints are inactivated (e.g.,\nI\u0001(\u03b8) = \u2205), dt is a valid descent direction for all tasks. Otherwise, dt is a valid direction to reduce\nthe values for all tasks and all activated constraints.\nLemma 2 [30]: Let (dk, \u03b1k) be the solution of problem (10).\n\n1. If \u03b8t is Pareto critical restricted on \u2126k, then dt = 0 \u2208 Rn and \u03b1t = 0.\n2. If \u03b8t is not Pareto critical restricted on \u2126k, then\n\n\u03b1t \u2264 \u2212(1/2)||dt||2 < 0,\n\u2207Li(\u03b8t)T dt \u2264 \u03b1t, i = 1, ..., m\n\u2207Gj(\u03b8t)T dt \u2264 \u03b1t, j \u2208 I\u0001(\u03b8t).\n\n(12)\n\nTherefore, we can obtain a restricted Pareto critical solution for each subproblem with simple iterative\ngradient-based update rule \u03b8t+1 = \u03b8t + \u03b7rdt. By solving all subproblems, we can obtain a set of\ndiverse Pareto critical solutions restricted on different sub-regions, which can represent different\ntrade-offs among all tasks for the original MTL problem.\n\n4.2.3 Scalable Optimization Method\n\nBy solving the constrained optimization problem (10), we can obtain a valid descent direction for each\nmulti-objective constrained subproblem. However, the optimization problem itself is not scalable well\nfor high dimensional decision space. For example, when training a deep neural network, we often\nhave more than millions of parameters to be optimized, and solving the constrained optimization\nproblem (10) in this scale would be extremely slow. In this subsection, we propose a scalable\noptimization method to solve the constrained optimization problem.\nInspired by [24], we \ufb01rst rewrite the optimization problem (10) in its dual form. Based on the KKT\nconditions, we have\ndt = \u2212(\n\n\u03bbi\u2207Li(\u03b8t) +\n\n\u03b2i\u2207Gj(\u03b8t)),\n\nm(cid:88)\n\n(cid:88)\n\nm(cid:88)\n\n(cid:88)\n\n\u03b2j = 1,\n\n\u03bbi +\n\n(13)\n\ni=1\n\nj\u2208I\u0001(x)\n\ni=1\n\nj\u2208I\u0001(\u03b8)\n\nwhere \u03bbi \u2265 0 and \u03b2i \u2265 0 are the Lagrange multipliers for the linear inequality constraints. Therefore,\nthe dual problem is:\n\u2212 1\n2\n\n\u03bbi\u2207Li(\u03b8t) +\n\n\u03b2i\u2207Gj(\u03b8t)||2\n\n|| m(cid:88)\n\n(cid:88)\n\nmax\n\u03bbi,\u03b2j\n\nj\u2208I\u0001(x)\n\n(14)\n\nm(cid:88)\n\ni=1\n\ni=1\n\n(cid:88)\n\nj\u2208I\u0001(\u03b8)\n\ns.t.\n\n\u03bbi +\n\n\u03b2j = 1, \u03bbi \u2265 0, \u03b2j \u2265 0,\u2200i = 1, ..., m,\u2200j \u2208 I\u0001(\u03b8).\n\nFor the above problem, the decision space is no longer the parameter space, and it becomes the\nobjective and constraint space. For a multiobjective optimization problem with 2 objective function\nand 5 activated constraints, the dimension of problem (14) is 7, which is signi\ufb01cantly smaller than\nthe dimension of problem (10) which could be more than a million.\nThe algorithm framework of Pareto MTL is shown in Algorithm 1. All subproblems can be solved\nin parallel since there is no communication between them during the optimization process. The\nonly preference information for each subproblem is the set of preference vectors. Without any prior\nknowledge for the MTL problem, a set of evenly distributed unit preference vectors would be a\nreasonable default choice, such as K + 1 preference vectors {(cos( k\u03c0\n2K ))|k = 0, 1, ..., K}\nfor 2 tasks. We provide more discussion on preference vector setting and sensitivity analysis of the\npreference vectors in the supplementary material.\n\n2K ), sin( k\u03c0\n\n6\n\n\fAlgorithm 1 Pareto MTL Algorithm\n1: Input: A set of evenly distributed vectors {u1, u2, ..., uK}\n2: Update Rule:\n3: (can be solved in parallel)\n4: for k = 1 to K do\n5:\n6:\n7:\n8:\n9:\n\nrandomly generate parameters \u03b8(k)\n\ufb01nd the initial parameters \u03b8(k)\nfor t = 1 to T do\n\nt = \u2212((cid:80)m\n\nr\nfrom \u03b8(k)\n\nti \u2265 0, \u03b2(k)\n\nobtain \u03bb(k)\ncalculate the direction d(k)\nupdate the parameters \u03b8(k)\n\ni=1 \u03bb(k)\nt + \u03b7d(k)\n\nt+1 = \u03b8(k)\n\n0\n\nr\n\nusing gradient-based method\n\nti \u2265 0,\u2200i = 1, ..., m,\u2200j \u2208 I\u0001(\u03b8) by solving subproblem (14)\nti \u2207Gj(\u03b8(k)\n\u03b2(k)\n\nti \u2207Li(\u03b8(k)\n\nj\u2208I\u0001(x)\n\nt\n\n)\n\nt\n\n) +(cid:80)\n\nend for\n\n10:\n11:\n12: end for\n13: Output: The set of solutions for all subproblems with different trade-offs {\u03b8(k)\n\nt\n\nT |k = 1,\u00b7, K}\n\n4.3 Pareto MTL as an Adaptive Linear Scalarization Approach\n\nWe have proposed the Pareto MTL algorithm from the multi-objective optimization perspective. In\nthis subsection, we show that the Pareto MTL algorithm can be reformulated as a linear scalarization\nof tasks with adaptive weight assignment. In this way, we can have a deeper understanding of the\ndifferences between Pareto MTL and other MTL algorithms.\nWe \ufb01rst tackle the unconstrained case. Suppose we do not decompose the multi-objective problem\nand hence remove all constraints from the problem (14), it will immediately reduce to the update rule\nproposed by MGDA [22] which is used in [12]. It is straightforward to rewrite the corresponding\nMTL into a linear scalarization form:\n\nwhere we adaptively assign the weights \u03bbi by solving the following problem in each iteration:\n\n(15)\n\n(16)\n\nL(\u03b8t) =\n\n\u03bbiLi(\u03b8t),\n\nm(cid:88)\nm(cid:88)\n\ni=1\n\n|| m(cid:88)\n\n\u2212 1\n2\n\nmax\n\n\u03bbi\n\n\u03bbi\u2207Li(\u03b8t)||2, s.t.\n\n\u03bbi = 1, \u03bbi \u2265 0,\u2200i = 1, ..., m.\n\ni=1\n\ni=1\n\nIn the constrained case, we have extra constraint terms Gj(\u03b8t). If Gj(\u03b8t) is inactivated, we can ignore\nit. For an activated Gj(\u03b8t), assuming the corresponding reference vector is uk, we have:\n\n\u2207Gj(\u03b8t) = (uj \u2212 uk)T\u2207L(\u03b8t) =\n\n(17)\nSince the gradient direction dt can be written as a linear combination of all \u2207Li(\u03b8t) and \u2207Gj(\u03b8t) as\nin (13), the general Pareto MTL algorithm can be rewritten as:\n\n(uji \u2212 uki)\u2207Li(\u03b8t).\n\ni=1\n\nL(\u03b8t) =\n\n\u03b1iLi(\u03b8t), where \u03b1i = \u03bbi +\n\n\u03b2j(uji \u2212 uki),\n\n(18)\n\nm(cid:88)\n\n(cid:88)\n\nj\u2208I\u0001(\u03b8)\n\nm(cid:88)\n\ni=1\n\nwhere \u03bbi and \u03b2j are obtained by solving problem (14) with assigned reference vector uk.\nTherefore, although MOO-MTL [12] and Pareto MTL are both derived from multi-objective opti-\nmization, they can also be treated as linear MTL scalarization with adaptive weight assignments.\nBoth methods are orthogonal to many existing MTL approaches. We provide further discussion on\nthe adaptive weight vectors in the supplementary material.\n\n5 A Synthetic Example\n\nTo better analyze the convergence behavior of the proposed Pareto MTL, we \ufb01rst compare it with two\ncommonly used methods, namely the linear scalarization method and the multiple gradient descent\n\n7\n\n\falgorithm used in MOO-MTL [12], on a simple synthetic multi-objective optimization problem:\n\nf1(x) = 1 \u2212 exp (\u2212 d(cid:88)\nf2(x) = 1 \u2212 exp (\u2212 d(cid:88)\n\ni=1\n\nmin\n\nx\n\nmin\n\nx\n\n(xd \u2212 1\u221a\nd\n\n)2)\n\n(xd +\n\ni=1\n\n1\u221a\nd\n\n)2)\n\n(19)\n\nwhere f1(x) and f2(x) are two objective functions to be minimized at the same time and x =\n(x1, x2, ..., xd) is the d dimensional decision variable. This problem has a concave Pareto front on\nthe objective space.\nThe results obtained by different algorithms are shown in Fig. 2. In this case, the proposed Pareto\nMTL can successfully \ufb01nd a set of well-distributed Pareto solutions with different trade-offs. Since\nMOO-MTL tries to balance different tasks during the optimization process, it gets a set of solutions\nwith similar trade-offs in the middle of the Pareto front in multiple runs. It is also interesting to\nobserve that the linear scalarization method can only generate extreme solutions for the concave\nPareto front evenly with 100 runs. This observation is consistent with the theoretical analysis in [26]\nthat the linear scalarization method will miss all concave parts of a Pareto front. It is evident that \ufb01xed\nlinear scalarization is not always a good idea for solving the MTL problem from the multi-objective\noptimization perspective.\n\n6 Experiments\n\nIn this section, we compare our proposed Pareto MTL algorithm on different MTL problems with the\nfollowing algorithms: 1) Single Task: the single task baseline; 2) Grid Search: linear scalarization\nwith \ufb01xed weights; 3) GradNorm [13]: gradient normalization; 4) Uncertainty [7]: adaptive weight\nassignments with uncertainty balance; 5) MOO-MTL [12]: \ufb01nding one Pareto optimal solution for\nmulti-objective optimization problem. More experimental results and discussion can be found in the\nsupplementary material.\n\n6.1 Multi-Fashion-MNIST\n\n(a) MultiMNIST\n\n(b) MultiFashionMNIST\n\n(c) Multi-(Fashion+MNIST)\n\nFigure 4: The results for three experiments with Task1&2 accuracy: our proposed Pareto MTL\ncan successfully \ufb01nd a set of well-distributed solutions with different trade-offs for all experiments,\nand it signi\ufb01cantly outperforms Grid Search, Uncertainty and GradNorm. MOO-MTL algorithm can\nalso \ufb01nd promising solutions, but their diversity is worse than the solutions generated by Pareto MTL.\n\nIn order to evaluate the performance of Pareto MTL on multi-task learning problems with different\ntasks relations, we \ufb01rst conduct experiments on MultiMNIST [31] and two MultiMNIST-like datasets.\nTo construct the MultiMNIST dataset, we randomly pick two images with different digits from the\noriginal MNIST dataset [32], and then combine these two images into a new one by putting one digit\non the top-left corner and the other one on the bottom-right corner. Each digit can be moved up to 4\npixels on each direction. With the same approach, we can construct a MultiFashionMINST dataset\nwith overlap FashionMNIST items [33], and a Multi-(Fashion + MNIST) with overlap MNIST and\nFashionMNIST items. For each dataset, we have a two objective MTL problem to classify the item\n\n8\n\n0.820.840.860.880.900.92Task 1: Top-Left0.7750.8000.8250.8500.8750.900Task 2: Bottom-RightMultiMNISTSingle TaskGrid SearchUncertaintyGradNormMOO MTLPareto MTL0.650.700.750.80Task 1: Top-Left0.700.750.80Task 2: Bottom-RightMulti-FashionMNISTSingle TaskGrid SearchUncertaintyGradNormMOO MTLPareto MTL0.700.750.800.850.900.95Task 1: Top-Left0.740.760.780.800.820.84Task 2: Bottom-RightMulti-(Fashion+MNIST)Single TaskGrid SearchUncertaintyGradNormMOO MTLPareto MTL\fon the top-left (task 1) and to classify the item on the bottom-right (task 2). We build a LeNet [32]\nbased MTL neural network similar to the one used in [12]. The obtained results are shown in Fig. 4.\nIn all experiments, since the tasks con\ufb02ict with each other, solving each task separately results in\na hard-to-beat single-task baseline. Our proposed Pareto MTL algorithm can generate multiple\nwell-distributed Pareto solutions for all experiments, which are compatible with the strong single-task\nbaseline but with different trade-offs among the tasks. Pareto MTL algorithm achieves the overall\nbest performance among all MTL algorithms. These results con\ufb01rm that our proposed Pareto MTL\ncan successfully provide a set of well-representative Pareto solutions for a MTL problem.\nIt is not surprising to observe that the Pareto MTL\u2019s solution for subproblems with extreme preference\nvectors (e.g., (0, 1) and (1, 0)) always have the best performance in the corresponding task. Especially\nin the Multi-(Fashion-MNIST) experiment, where the two tasks are less correlated with each other. In\nthis problem, almost all MTL\u2019s solutions are dominated by the strong single task\u2019s baseline. However,\nPareto MTL can still generate solutions with the best performance for each task separately. It behaves\nas auxiliary learning, where the task with the assigned preference vector is the main task, and the\nothers are auxiliary tasks.\nPareto MTL uses neural networks with simple hard parameter sharing architectures as the base model\nfor MTL problems. It will be very interesting to generalize Pareto MTL to other advanced soft\nparameter sharing architectures [5]. Some recently proposed works on task relation learning [34, 35,\n36] could also be useful for Pareto MTL to make better trade-offs for less relevant tasks.\n\n6.2 Self-Driving Car: Localization\n\nReference\n\nVector\n\n(0.25,0.75)\n(0.5,0.5)\n(0.75,0.25)\n\nMethod\n\nSingle Task\n\nGrid Search\n\nGradNorm\nUncertainty\nMOO-MTL\n\n-\n\n-\n-\n-\n\nPareto MTL\n\n(\n\n(0,1)\n\u221a\n\u221a\n2 )\n2\n2 ,\n(1,0)\n\n2\n\nTranslation Rotation\n\n(m)\n8.392\n9.927\n7.840\n7.585\n7.751\n7.624\n7.909\n7.285\n7.724\n8.411\n\n(\u25e6)\n2.461\n2.177\n2.306\n2.621\n2.287\n2.263\n2.090\n2.335\n2.156\n1.771\n\nFigure 5: The results of self-location MTL experiment: Our proposed Pareto MTL outperforms\nother algorithms and provides solutions with different trade-offs.\n\nWe further evaluate Pareto MTL on an autonomous driving self-localization problem [8]. In this\nexperiment, we simultaneously estimate the location and orientation of a camera put on a driving car\nbased on the images it takes. We use data from the apolloscape autonomous driving dataset [37, 38],\nand focus on the Zpark sample subset. We build a PoseNet with a ResNet18 [39] encoder as the MTL\nmodel. The experiment results are shown in Fig. 5. It is obvious that our proposed Pareto MTL can\ngenerate solutions with different trade-offs and outperform other MTL approaches.\nWe provide more experiment results and analysis on \ufb01nding the initial solution, Pareto MTL with\nmany tasks, and other relative discussions in the supplementary material.\n\n7 Conclusion\n\nIn this paper, we proposed a novel Pareto Multi-Task Learning (Pareto MTL) algorithm to generate a\nset of well-distributed Pareto solutions with different trade-offs among tasks for a given multi-task\nlearning (MTL) problem. MTL practitioners can then easily select their preferred solutions among\nthese Pareto solutions. Experimental results con\ufb01rm that our proposed algorithm outperforms some\nstate-of-the-art MTL algorithms and can successfully \ufb01nd a set of well-representative solutions for\ndifferent MTL applications.\n\n9\n\n10.09.59.08.58.07.5Negative Translation Error2.62.42.22.01.8Negative Rotation ErrorSelf-Localization MTLGrid SearchUncertaintyGradNormMOO MTLPareto MTL\fAcknowledgments\n\nThis work was supported by the Natural Science Foundation of China under Grant 61876163 and\nGrant 61672443, ANR/RGC Joint Research Scheme sponsored by the Research Grants Council of\nthe Hong Kong Special Administrative Region, China and France National Research Agency (Project\nNo. A-CityU101/16), and Hong Kong RGC General Research Funds under Grant 9042489 (CityU\n11206317) and Grant 9042322 (CityU 11200116).\n\nReferences\n[1] Rich Caruana. Multitask learning. Machine learning, 28(1):41\u201375, 1997.\n\n[2] Iasonas Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and\nhigh-level vision using diverse datasets and limited memory. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 6129\u20136138, 2017.\n\n[3] Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J Pal. Learning general purpose\ndistributed sentence representations via large scale multi-task learning. In International Conference on\nLearning Representations, 2018.\n\n[4] Zhen Huang, Jinyu Li, Sabato Marco Siniscalchi, I-Fan Chen, Ji Wu, and Chin-Hui Lee. Rapid adaptation\nfor deep neural networks through multi-task learning. In Sixteenth Annual Conference of the International\nSpeech Communication Association, 2015.\n\n[5] Sebastian Ruder. An overview of multi-task learning in deep neural networks.\n\narXiv:1706.05098, 2017.\n\narXiv preprint\n\n[6] Yu Zhang and Qiang Yang. A survey on multi-task learning. arXiv preprint arXiv:1707.08114, 2017.\n\n[7] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for\nscene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2018.\n\n[8] Peng Wang, Ruigang Yang, Binbin Cao, Wei Xu, and Yuanqing Lin. Dels-3d: Deep localization and\nsegmentation with a 3d semantic map. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 5860\u20135869, 2018.\n\n[9] Jaebok Kim, Gwenn Englebienne, Khiet P. Truong, and Vanessa Evers. Towards speech emotion recognition\n\"in the wild\" using aggregated corpora and deep multi-task learning. In 18th Annual Conference of the\nInternational Speech Communication Association, pages 1113\u20131117, 2017.\n\n[10] Jin-Dong Dong, An-Chieh Cheng, Da-Cheng Juan, Wei Wei, and Min Sun. Dpp-net: Device-aware\nprogressive search for pareto-optimal neural architectures. In Proceedings of the European Conference on\nComputer Vision (ECCV), pages 517\u2013531, 2018.\n\n[11] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and\n\nhardware. In International Conference on Learning Representations, 2019.\n\n[12] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In Advances in\n\nNeural Information Processing Systems, pages 525\u2013536, 2018.\n\n[13] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normal-\nization for adaptive loss balancing in deep multitask networks. In Proceedings of the 35th International\nConference on Machine Learning, pages 794\u2013803, 2018.\n\n[14] Kaisa Miettinen. Nonlinear multiobjective optimization, volume 12. Springer Science & Business Media,\n\n2012.\n\n[15] Kristof Van Moffaert and Ann Now\u00e9. Multi-objective reinforcement learning using sets of pareto dominat-\n\ning policies. The Journal of Machine Learning Research, 15(1):3483\u20133512, 2014.\n\n[16] Marcela Zuluaga, Guillaume Sergent, Andreas Krause, and Markus P\u00fcschel. Active learning for multi-\n\nobjective optimization. In International Conference on Machine Learning, pages 462\u2013470, 2013.\n\n[17] Daniel Hern\u00e1ndez-Lobato, Jose Hernandez-Lobato, Amar Shah, and Ryan Adams. Predictive entropy\nsearch for multi-objective bayesian optimization. In International Conference on Machine Learning, pages\n1492\u20131501, 2016.\n\n10\n\n\f[18] Amar Shah and Zoubin Ghahramani. Pareto frontier learning with expensive correlated objectives. In\n\nInternational Conference on Machine Learning, pages 1919\u20131927, 2016.\n\n[19] Thomas Elsken, Jan Metzen, and Frank Hutter. Ef\ufb01cient multi-objective neural architecture search via\n\nlamarckian evolution. In International Conference on Learning Representations, 2019.\n\n[20] Eckart Zitzler. Evolutionary algorithms for multiobjective optimization: Methods and applications,\n\nvolume 63. Citeseer, 1999.\n\n[21] Kalyanmoy Deb. Multi-Objective Optimization Using Evolutionary Algorithms. John Wiley & Sons, Inc.,\n\nNew York, NY, USA, 2001.\n\n[22] Jean-Antoine D\u00e9sid\u00e9ri. Mutiple-gradient descent algorithm for multiobjective optimization. In European\n\nCongress on Computational Methods in Applied Sciences and Engineering (ECCOMAS 2012), 2012.\n\n[23] J\u00f6rg Fliege and A Ismael F Vaz. A method for constrained multiobjective optimization based on sqp\n\ntechniques. SIAM Journal on Optimization, 26(4):2091\u20132119, 2016.\n\n[24] J\u00f6rg Fliege and Benar Fux Svaiter. Steepest descent methods for multicriteria optimization. Mathematical\n\nMethods of Operations Research, 51(3):479\u2013494, 2000.\n\n[25] Eckart Zitzler and Lothar Thiele. Multiobjective evolutionary algorithms: a comparative case study and\n\nthe strength pareto approach. IEEE transactions on Evolutionary Computation, 3(4):257\u2013271, 1999.\n\n[26] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.\n\n[27] Qingfu Zhang and Hui Li. Moea/d: A multiobjective evolutionary algorithm based on decomposition.\n\nIEEE Transactions on evolutionary computation, 11(6):712\u2013731, 2007.\n\n[28] Anupam Trivedi, Dipti Srinivasan, Krishnendu Sanyal, and Abhiroop Ghosh. A survey of multiobjective\nIEEE Transactions on Evolutionary Computation,\n\nevolutionary algorithms based on decomposition.\n21(3):440\u2013462, 2016.\n\n[29] Hai-Lin Liu, Fangqing Gu, and Qingfu Zhang. Decomposition of a multiobjective optimization problem\ninto a number of simple multiobjective subproblems. IEEE Trans. Evolutionary Computation, 18(3):450\u2013\n455, 2014.\n\n[30] Bennet Gebken, Sebastian Peitz, and Michael Dellnitz. A descent method for equality and inequality\nconstrained multiobjective optimization problems. In Numerical and Evolutionary Optimization, pages\n29\u201361. Springer, 2017.\n\n[31] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in\n\nNeural Information Processing Systems, pages 3856\u20133866, 2017.\n\n[32] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[33] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking\n\nmachine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.\n\n[34] Amir R. Zamir, Alexander Sax, William Shen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese.\nTaskonomy: Disentangling task transfer learning. In 2018 IEEE/CVF Conference on Computer Vision and\nPattern Recognition, pages 3712\u20133722, 2018.\n\n[35] Jiaqi Ma, Zhao Zhe, Xinyang Yi, Jilin Chen, and Ed H. Chi. Modeling task relationships in multi-task\nlearning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, 2018.\n\n[36] Yu Zhang, Ying Wei, and Qiang Yang. Learning to multitask.\n\nProcessing Systems, pages 5771\u20135782, 2018.\n\nIn Advances in Neural Information\n\n[37] Xinyu Huang, Xinjing Cheng, Qichuan Geng, Binbin Cao, Dingfu Zhou, Peng Wang, Yuanqing Lin, and\nRuigang Yang. The apolloscape dataset for autonomous driving. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition, 2018.\n\n[38] Peng Wang, Xinyu Huang, Xinjing Cheng, Dingfu Zhou, Qichuan Geng, and Ruigang Yang. The\napolloscape open dataset for autonomous driving and its application. IEEE Transactions on Pattern\nAnalysis and Machine Intelligence, 2019.\n\n[39] Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time\n6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision,\npages 2938\u20132946, 2015.\n\n11\n\n\f", "award": [], "sourceid": 6489, "authors": [{"given_name": "Xi", "family_name": "Lin", "institution": "City University of Hong Kong"}, {"given_name": "Hui-Ling", "family_name": "Zhen", "institution": "City University of Hong Kong"}, {"given_name": "Zhenhua", "family_name": "Li", "institution": "National University of Singapore"}, {"given_name": "Qing-Fu", "family_name": "Zhang", "institution": null}, {"given_name": "Sam", "family_name": "Kwong", "institution": "City Univeristy of Hong Kong"}]}