{"title": "Adaptive Auxiliary Task Weighting for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4772, "page_last": 4783, "abstract": "Reinforcement learning is known to be sample inefficient, preventing its application to many real-world problems, especially with high dimensional observations like images. Transferring knowledge from other auxiliary tasks is a powerful tool for improving the learning efficiency. However, the usage of auxiliary tasks has been limited so far due to the difficulty in selecting and combining different auxiliary tasks. In this work, we propose a principled online learning algorithm that dynamically combines different auxiliary tasks to speed up training for reinforcement learning.  Our method is based on the idea that auxiliary tasks should provide gradient directions that, in the long term, help to decrease the loss of the main task. We show in various environments that our algorithm can effectively combine a variety of different auxiliary tasks and achieves significant speedup compared to previous heuristic approches of adapting auxiliary task weights.", "full_text": "Adaptive Auxiliary Task Weighting for Reinforcement\n\nLearning\n\nXingyu Lin\u2217\n\nHarjatin Singh Baweja\u2217\n\nGeorge Kantor\n\nDavid Held\n\nRobotics Institute\n\nCarnegie Mellon University\n\n{xlin3, harjatis, kantor, dheld}@andrew.cmu.edu\n\nAbstract\n\nReinforcement learning is known to be sample inef\ufb01cient, preventing its application\nto many real-world problems, especially with high dimensional observations like\nimages. Transferring knowledge from other auxiliary tasks is a powerful tool for\nimproving the learning ef\ufb01ciency. However, the usage of auxiliary tasks has been\nlimited so far due to the dif\ufb01culty in selecting and combining different auxiliary\ntasks. In this work, we propose a principled online learning algorithm that dynam-\nically combines different auxiliary tasks to speed up training for reinforcement\nlearning. Our method is based on the idea that auxiliary tasks should provide\ngradient directions that, in the long term, help to decrease the loss of the main task.\nWe show in various environments that our algorithm can effectively combine a\nvariety of different auxiliary tasks and achieves signi\ufb01cant speedup compared to\nprevious heuristic approaches of adapting auxiliary task weights.\n\n1\n\nIntroduction\n\nDeep reinforcement learning has enjoyed recent success in domains like games [1, 2], robotic\nmanipulation, and locomotion tasks [3, 4]. However, most of these applications are either limited to\nsimulation or require a large number of samples collected from real-world experiences. For complex\ntasks, the reinforcement learning algorithm often requires a prohibitively large number of samples to\nlearn the policy [5, 6]. The sample complexity is even worse when learning from image observations,\nin which more samples are needed to learn a good feature representation.\nTransferring knowledge from other tasks can be a powerful tool for learning ef\ufb01ciently. Two types of\ntransferring are often used: representational and functional transfer [7]. In representational transfer,\na representation previously learned from other tasks is used for the task at hand. For example,\nvisual features can be taken from pre-trained features of other tasks, such as image classi\ufb01cation or\ndepth estimation [8]. However, visual features learned from these static tasks might not be useful\nfeatures for decision-making tasks studied in reinforcement learning. Further, the visual environment\nencountered by the reinforcement learning agent might have a different appearance from which these\nvisual features were trained, thus limiting the bene\ufb01t of such transfer.\nIn functional transfer, multiple tasks sharing the representation are trained jointly. In the context of\nreinforcement learning, auxiliary tasks are trained jointly with the reinforcement learning task [9, 10,\n11], as illustrated in Figure 1. This approach has the advantage that the learned representation will be\nrelevant to the environment in which the agent is operating. Many self-supervised tasks proposed in\nprevious works can be used [9, 11, 12, 13, 14].\nThe challenge for using auxiliary tasks is to select what set of auxiliary tasks to use and to determine\nthe weighting of different auxiliary tasks; some auxiliary tasks may be more or less relevant to the\n\n\u2217Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: An illustration of learning with auxiliary tasks. All of the visual observations and any auxiliary visual\ninformation are passed through a shared weight CNN to get a low dimension representation. The representation,\nalong with other auxiliary information are then used to perform different auxiliary tasks, as well as the main task.\nA \ufb01nal loss is computed by weighting the main loss and all the auxiliary losses.\n\nreinforcement learning objective. Further, the usefulness of an auxiliary task may change over the\ncourse of the learning process: one auxiliary task might be useful for learning a feature representation\nfor reinforcement learning in the beginning of training, but it might no longer be useful later in\ntraining. Some auxiliary tasks may even slow down reinforcement learning if weighted too strongly.\nWhile auxiliary tasks have been commonly used for learning a feature representation, it is still an\nopen question to determine which auxiliary tasks to use [15]. Previous work either relies on prior\nknowledge about the tasks [8, 16] or uses a grid search to determine the best weighting parameters,\nwhich requires repeatedly learning the same task with different hyper-parameters.\nWe meet this challenge from a different perspective: instead of pre-determining the importance of\neach auxiliary task before the training begins, we can adjust the weight of each task online during the\ntraining process. In this way, we exploit the information obtained after actually applying the auxiliary\ntasks while performing reinforcement learning. Our framework follows the principle that a useful\nauxiliary task should provide a gradient that points in a good direction such that, in the long term, the\nloss of the reinforcement learning objective decreases.\nIn this work, we apply the above principle and propose a simple online learning algorithm that\ndynamically determines the importance of different auxiliary tasks for the speci\ufb01c task at hand.\nWe show in various environments that our algorithm can effectively combine a variety of different\nauxiliary tasks and achieves signi\ufb01cant speedup compared to previous heuristic approches of adapting\nauxiliary task weights.\n\n2 Related Work\n\n2.1 Auxiliary Tasks for Reinforcement Learning\n\nWhile reinforcement learning with low-dimensional state input can also bene\ufb01t from auxiliary tasks\nin partially observable environments [17, 18], auxiliary tasks have been more commonly used for\nreinforcement learning from images or other high-dimensional sensor readings. The auxiliary tasks\ncan either be supervised learning tasks such as depth prediction [19], reward prediction [20], or\ntask speci\ufb01c prediction [10]. Here we focus on self-supervised tasks where labels can be acquired\nin a self-supervised manner. Other works use alternative control tasks as auxiliary tasks, such as\nmaximizing pixel changes or network features [9]. The weights for each of the auxiliary tasks\nare either manually set using prior intuition or through hyper-parameter tuning by running the full\ntraining procedure multiple times. In this work, we propose to determine what auxiliary tasks are\nuseful at each time step of the training procedure and adaptively tune the weights, eliminating the\nhyper-parameter tuning, which becomes much harder as the number of auxiliary tasks grows.\n\n2\n\n\fWhile we use auxiliary tasks for learning a feature representation, auxiliary tasks can be used in other\nways as well. For example, some works use auxiliary tasks for better exploration [21, 22]. These\ntypes of methods are orthogonal to the use of the auxiliary tasks in our work.\n\n2.2 Adaptive Weights for Multiple Losses\n\nThe most similar analogy of auxiliary tasks in the context of supervised learning is multi-task\nlearning [7, 15, 23]. Past work has found that, by sharing a representation among related tasks and\njointly learning all the tasks, better generalization can be achieved over independently learning each\ntask [24]. While MTL focuses on simultaneously learning multiple tasks, in this work, we only have\none main task that we care about and the auxiliary tasks are only used to help learn the main task.\nSome works in MTL assumes the amount of knowledge to transfer among tasks, i.e. task relationships\nto be known as a prior [25], or learned from the relationships among task speci\ufb01c parameters [26, 27].\nIn comparison, our work aims to adapt the weights of the auxiliary tasks online, to adaptively learn\nhow much should we transfer from one auxiliary task to the main task, by looking at the gradient\ndifferent tasks have on the shared parameters. As such, for future works, our method can also be\napplied to MTL problems after adapting our method from caring about only the transfer from auxiliary\ntasks to the main task, to caring about the transfer from one task to all the other tasks.\nOther works involving multiple losses assume that all auxiliary tasks matter equally and adapt the\nweights based on the gradient norm [28] or task uncertainty [29]. A similar approach for anytime\nprediction balances the weights based on the average loss over the previous training time [30]. These\nmethods assume that all of the auxiliary tasks are equally important. However, if we scale the number\nof auxiliary tasks, it is highly probable that some tasks will be more useful than others. In contrast,\nour method evaluates the usefulness of each task online and adapts the weights accordingly such that\nthe more useful tasks receive a higher weight.\n\n2.3 Learning Meta-parameters with Gradient Descent\n\nAn early work \ufb01rst proposed the idea of using online cross-validation with gradient descent to learn\nmeta-parameters, which was also referred to as \"adaptive bias\" [31]. Some recent works can be\nviewed as instantiations of this method, where the meta-parameters are the parameters of the return\nfunction [32] or an intrinsic reward function [33]. Our method can be viewed as another variant of\nonline cross-validation, where we treat the auxiliary task weights as meta-parameters and optimize\nover the weights for online representation learning.\n\n3 Problem De\ufb01nition\nAssume that we have a main task Tmain that we want to complete and a set of auxiliary tasks Taux,i,\nwhere i \u2208 {1, 2, ..., K}. Each task has a corresponding loss Lmain and Laux,i. In the context\nof reinforcement learning, Lmain can either be the the expected return loss L\u03c0 which is used for\ncalculating policy gradient, or the Bellman error LQ, such as for Q-learning. The losses are functions\nof all the model parameters \u03b8t at each training time step t.\nOur goal is to optimize the main loss Lmain. However, using gradient-based optimization with only\nthe main task gradient \u2207\u03b8Lmain is often slow and unstable, due to the high variance of reinforcement\nlearning. Thus, auxiliary tasks are commonly used, especially for image based tasks, to help to learn\na good feature representation. We can combine the main loss with the loss from the auxiliary tasks as\n\nK(cid:88)\n\ni=1\n\nL(\u03b8t) = Lmain(\u03b8t) +\n\nwiLaux,i(\u03b8t),\n\n(1)\n\nwhere wi is the weight for auxiliary task i and \u03b8t is the set of all model parameters at training step t.\nWe assume that we update the parameters \u03b8t using gradient descent on this combined objective:\n\n\u03b8t+1 = \u03b8t \u2212 \u03b1\u2207\u03b8tL(\u03b8t).\n\n(2)\n\nIf a large number of auxiliary tasks are used, some auxiliary tasks may be more bene\ufb01cial than\nothers for learning a feature representation for the main task; thus the weights wi of each auxiliary\n\n3\n\n\ftask (Eqn. 1) need to be tuned. Previous work manually tunes the auxiliary task weights wi [9].\nHowever, a number of issues arise when we try to scale the number of auxiliary tasks. First, tuning the\nparameters wi becomes more computationally intensive as the number of auxiliary tasks K increases.\nSecond, if the values of wi are learned via hyperparameter optimization, then the reinforcement\nlearning optimization must be run to near-convergence multiple times to determine the optimal values\nof wi; ideally the weights would be learned online so that the reinforcement learning optimization\nonly needs to be performed once. Last, the importance of each auxiliary task, and hence the optimal\nweight wi, might change throughout the learning process; using a \ufb01xed value for wi might limit the\nperformance.\n\n4 Approach\n\nWe propose to dynamically adapt the weights for each auxiliary task. Below we describe our approach\nto doing so. We will \ufb01rst describe an approach that uses a one-step gradient in Section 4.1; we will\nthen extend this framework to N-step gradient in Section 4.2.\n\n4.1 Local Update from One-step Gradient\nIn our initial approach, we aim to \ufb01nd the weights for the auxiliary tasks such that Lmain decreases\nthe fastest. Speci\ufb01cally, de\ufb01ne Vt(w) as the speed at which the main task loss decreases at the time\nstep t, where w = [w1, ..., wk]T . We then have\n\nVt(w) =\n\ndLmain(\u03b8t)\n\ndt\n\n\u2248 Lmain(\u03b8t+1) \u2212 Lmain(\u03b8t)\n= Lmain(\u03b8t \u2212 \u03b1\u2207\u03b8tL(\u03b8t)) \u2212 Lmain(\u03b8t)\n\u2248 Lmain(\u03b8t) \u2212 \u03b1\u2207\u03b8tLmain(\u03b8t)T\u2207\u03b8tL(\u03b8t) \u2212 Lmain(\u03b8t)\n= \u2212\u03b1\u2207\u03b8tLmain(\u03b8t)T\u2207\u03b8tL(\u03b8t),\n\n(3)\n\nwhere \u03b1 is the gradient step size. The \ufb01rst line is obtained from a \ufb01nite difference approximation of\nthe time derivative, with \u2206t = 1 (where t is the iteration number of the learning process). The third\nline is a \ufb01rst-order Taylor approximation.\nTo update w, we can simply calculate its gradient:\n\n\u2202Vt(wi)\n\u2202wi\n\n= \u2212\u03b1\u2207\u03b8tLmain(\u03b8t)T\u2207\u03b8tLaux,i(\u03b8t),\u2200i = 1, ..., K.\n\n(4)\n\nThis leads to an update rule that is based on the dot product between the gradient of the auxiliary\ntask and the gradient of the main task. Intuitively, our approach leverages the online experiences\nto determine if an auxiliary task has been useful in decreasing the main task loss. The form of this\nequation resembles recent work which uses a thresholded cosine similarity to determine whether to\nuse each auxiliary task [34]; however, our update rule is a product of our derivation of maximizing the\nspeed at which the main task loss decreases and we will later show that it outperforms this method.\n\n4.2 N-step Update\nThe gradient in Eqn. 4 is optimized for the instantaneous update rate of the main task, dLmain(\u03b8t)/dt,\nas shown in Equation 3. However, we are not actually concerned with the one-step update of the\nmain task loss Lmain(\u03b8t); rather, we are concerned with the long-term value of Lmain(\u03b8t) after\nmultiple gradient updates. In this section, we extend the method of the previous section to obtain an\noptimization objective for w that accounts for the performance on Lmain(\u03b8t) in the longer term.\nSince the loss changes in one step does not necessarily re\ufb02ect the long-term performance, we instead\nseek to optimize the N-step decrease of the main task loss:\n\nV N\nt (w) = Lmain(\u03b8t+N ) \u2212 Lmain(\u03b8t).\n\n4\n\n\fV N\nt (w)\n\nExact computation of the gradient with respect to w requires calculating higher order Jacobians,\nwhich can be computationally expensive. We thus adopt a \ufb01rst-order approximation:\n\n(cid:16)\n\n(cid:17) \u2212 Lmain(\u03b8t)\n\n\u03b8t+N\u22121 \u2212 \u03b1\u2207\u03b8t+N\u22121L(\u03b8t+N\u22121)\n\n= Lmain(\u03b8t+N ) \u2212 Lmain(\u03b8t)\n.\n= Lmain\n\u2248 Lmain(\u03b8t+N\u22121) \u2212 Lmain(\u03b8t) \u2212 \u03b1\u2207\u03b8t+N\u22121Lmain(\u03b8t+N\u22121)T\u2207\u03b8t+N\u22121L(\u03b8t+N\u22121)\n...\n\u2248 \u2212\u03b1\n\n\u2207\u03b8t+jLmain(\u03b8t+j)T\u2207\u03b8t+jL(\u03b8t+j)\n\nN\u22121(cid:88)\n\n(5)\n\nj=0\n\nNext, we want to update w by calculating \u2207wV N\nt (w), which requires differentiating through the\noptimization process. To avoid this cumbersome computation in a online process, we ignore all the\nhigher order terms, essentially assuming that a small perturbation of w only affects the immediate\nnext step. With this approximation, we get that \u2200i = 1, ..., K:\n\n\u2207wiV N\n\nt (wi) \u2248 \u2212\u03b1\n\n\u2207\u03b8t+jLmain(\u03b8t+j)T\u2207\u03b8t+jLaux,i(\u03b8t+j).\n\n(6)\n\nj=0\n\nWe call this approach Online Learning for Auxiliary losses (OL-AUX). The full algorithm is described\nin Algorithm 1. As an implementation detail, to balance the norm of the gradient between different\nlosses, we adopt the Adaptive Loss Balancing technique [30] and wrap all the auxiliary task losses\ninside a log operator. Figure 1 provides an illustration of the pipeline for computing all the individual\nlosses that constitute L(\u03b8t).\nThe bene\ufb01t of the N-step update, compared to the one-step update, comes from two sources. First,\nas shown in Eqn. 5, the N-step objective considers the long-term effect on the main task loss of\nupdating the weights w, which aligns with our longer term goal. Second, as shown in Eqn. 6, the\nN-step method also averages the auxiliary weight gradient over more \u03b8 update iterations, which will\ncompute a less noisy gradient. Ablation experiments are shown Section 5 to differentiate between\nthese effects, and we show that both of these effects contribute to our performance.\n\nAlgorithm 1 Learning with OL-AUX\n\nN\u22121(cid:88)\n\nInput:\n\nMain task loss: Lmain\nK auxiliary task losses: Laux,1, . . . ,Laux,K\nHorizon N\nStep size \u03b1, \u03b2\n\nInitialize \u03b80, w = 1, t = 0,\nfor i = 0 to T rainingEpoch \u2212 1 do\n\nCollect new data using \u03b8t\nfor j = 0 to U pdateIteration \u2212 1 do\n\nt \u2190 i \u00b7 U pdateIteration + j\nSample a mini-batch from dataset\n\u03b8t+1 \u2190 \u03b8t \u2212 \u03b1\u2207\u03b8tL(\u03b8t)\nif t + 1 mod N == 0 then\n\nL(\u03b8t) \u2190 log Lmain(\u03b8t) +(cid:80)K\nt\u2212N +1(wi) \u2190 \u2212\u03b1(cid:80)N\u22121\n\n\u2207wiV N\n(Based on equation 6)\nw \u2190 w \u2212 \u03b2\u2207wV N\n\nt\u2212N +1(w)\n\ni=1 wi log Laux,i(\u03b8t)\n\nj=0 \u2207\u03b8t\u2212j log Lmain(\u03b8t\u2212j)T\u2207\u03b8t\u2212j log Laux,i(\u03b8t\u2212j)\n\n5 Experiments\n\nIn the following experiments, we aim to answer the following questions:\n\n5\n\n\f\u2022 Can OL-AUX adapt the weights online to optimize for the main task and ignore harmful\n\nauxiliary tasks?\n\n\u2022 How much can we improve sample ef\ufb01ciency by leveraging a set of diverse auxiliary tasks?\n\u2022 Is dynamically tuning the weights of the auxiliary tasks important to the achieved sample\n\nef\ufb01ciency, compared to using a \ufb01xed set of weights?\n\n\u2022 Is it bene\ufb01cial to adapt the auxiliary task weight based on its longer term effect, i.e. N-step\n\nupdate (Section 4.2) compared to the 1-step update (Section 4.1)?\n\nWe \ufb01rst answer some of the questions on a simple optimization problem. Then, we empirically\nevaluate different approaches on three Atari games and three goal-oriented reinforcement learning\nenvironments with visual observations, where the issue of sample complexity is exacerbated due to\nthe high dimensional input.\n\n5.1\n\nIgnore Harmful Auxiliary Tasks\n\nWe \ufb01rst show in a simple optimization problem that OL-AUX is able to ignore an adversarial\nauxiliary task and \ufb01nd the global optimum of the main task. Later we will also show this in more\ncomplex and realistic experiments. In this example, the main task is to \ufb01nd x, y to minimize the\nloss L0(x, y) = x2 + y2. There are two auxiliary losses, L1(x, y) = (x \u2212 0.5)2 + (y \u2212 0.5)2 and\nL2(x, y) = \u2212L(x, y). Clearly L2 is an adversarial auxiliary task as its gradient always points in\nthe opposite direction to the optima of the main task. On the other hand, L1 is a useful auxiliary\ntask. The baseline we compare to is to optimize the total loss L = L0 + w1L1 + w2L2 with a \ufb01xed\nweight w1 = w2 = 1, using gradient descent. Our method is OL-AUX as described in Algorithm 1,\nusing N = 1 and without applying the log operator in front of the loss as we do not need to balance\nthe gradient for this simple example. The results are shown in Figure 2. The \ufb01xed weight baseline\nconverges to a sub-optimal point as the main task loss is canceled out with the adversarial auxiliary\ntask. On the other hand, OL-AUX \ufb01nds the optimum of the main task from different starting points.\nFrom the auxiliary task weights during training of OL-AUX, we can see that the weight of auxiliary\ntask L1 \ufb01rst increases as it helps to learn the main task and later decreases to focus on the main task.\nOn the other hand, the weight of auxiliary task L2 quickly decreases in the beginning.\n\nFigure 2: A simple optimization problem to show that our OL-AUX is able to ignore harmful auxiliary tasks.\nLeft and middle show the trajectories of the optimization from four starting points using \ufb01xed weights and\nOL-AUX respectively. Right shows the loss and auxiliary task weights during training for OL-AUX.\n\n5.2 Auxiliary Tasks\n\nFor more complex visual manipulation, we consider the set of auxiliary tasks shown in Table 1. For a\nmore detailed description of the auxiliary tasks, see Appendix B.\n\n5.3 Base Learning Algorithm and Environments\n\nWe evaluate our method in two scenarios: 1) Reinforcement learning with A2C [35] as the base\nlearning algorithm, evaluated on three Atari games [36]: Breakout, Pong and SeaQuest. 2) Goal-\nconditioned reinforcement learning using DDPG [3] with hindsight experience replay [37], evaluated\non three visual robotic manipulation tasks simulated in MuJoCo [38]:\n\n6\n\n\fAuxiliary Task\n\nDescription\n\nForward Dynamics [12] Given current visual observation and action, predict\n\nInverse Dynamics [12]\n\nEgomotion [13]\n\nAutoencoder\n\nOptical Flow [14]\n\nlatent space representation of next observation.\nGiven consecutive image observations, predict the ac-\ntion taken.\nGiven raw and transformed visual observation, predict\nthe transformation applied.\nReconstruct visual observation from latent space rep-\nresentation.\nGiven two consecutive visual observations, predict the\noptical \ufb02ow.\n\nTable 1: Brief description of the auxiliary losses used.\n\nRobot to a randomly sampled 3D location.\n\n\u2022 Visual Fetch Reach (OpenAI Gym [39]). The goal is to move the end effector of a Fetch\n\u2022 Visual Hand Reach (OpenAI Gym [39]). A target hand pose with the positions of the\n\ufb01ve \ufb01ngers of a 24 DOF Shadow hand is randomly sampled from 3D space; the policy is\nrequired to control the hand to reach the corresponding positions of all \ufb01ve \ufb01ngers. The\npolicy outputs motors commands for the 24 DOFs of the hand.\n\u2022 Visual Finger Turn (DeepMind Control Suite [5]). A policy needs to control a 3 DOF\nrobot \ufb01nger to rotate a body on an unactuated hinge. The agent receives a positive reward if\nthe body tip coincides with a randomly sampled target location.\n\nFor all the manipulation environments, the goal is an RGBD image with objects in the desired\ncon\ufb01guration. We use sparse rewards speci\ufb01ed by the L2 distance of the underlying ground truth\nstate from the goal state. For hindsight experience replay, with a probability of 0.9 we relabel the\noriginal goal with another observation from a future time step of the same episode. More details on\nthe environments and the algorithm used can be found in Appendix A and B.\nIn both cases, we use Adam as our optimizer. While this creates a discrepancy between our\ntheoretically-derived gradient and the gradient used in practice, we do not \ufb01nd this to be a big\nissue during our experiments.\"\n\n5.4 Baselines Compared\n\nWe compare the following approaches:\n\n1. No Auxiliary Losses This is our base learning algorithm without using any auxiliary tasks.\n2. Gradient Balancing This baseline combines all the auxiliary tasks with the same weight\nof 1 but uses adaptive loss balancing [30] to balance the norm of the gradient for different\nauxiliary tasks.\n3. Cosine Similarity This baseline combines the gradients from the auxiliary tasks and the\nmain task based on their cosine similarities [34]. Speci\ufb01cially, \u2207\u03b8Laux,i is added to\n\u2207\u03b8Lmain to update the shared parameter \u03b8 only if cos(\u2207\u03b8Lmain,\u2207\u03b8Laux,i) \u2265 0.\n4. OL-AUX This is our method as described in Algorithm 1, with N=5 (OL-AUX-5).\n5. In Appendix E, we compare with grid search and heuristic \ufb01xed weights in the Visual Hand\n\nReach task.\n\n5.5 Online Learning of Auxiliary Task Weighting\n\nThe learning curves of all methods are shown in Figure 3. All experiments are run with \ufb01ve different\nrandom seeds and the shaded region shows the standard deviations across different seeds. We can\nsee that, in all environments, using auxiliary tasks with gradient balancing [30] gives consistent\nimprovement over not using auxiliary tasks. By adapting the auxiliary task weights online, our\nOL-AUX-5 method shows even more improvement and requires fewer than half of the samples to\nconverge in all the environments.\n\n7\n\n\fFigure 3: The training curves of our method compared to other baselines. For the manipulation environments,\nthe y-axis shows the percentage of the time that the goal is reached.\n\nFor the manipulation environments, we further compare with using only a single auxiliary task\nalong with gradient balancing in Figure 4. We can see that using a single auxiliary task usually\ngives marginal improvement over the baseline. On the other hand, our method, by leveraging the\ncombination of all auxiliary tasks, always performs better or as well as the best single auxiliary\ntask. Additionally, we can see that the importance of the auxiliary task depends on the RL task. For\nexample, inverse kinematics is the best single auxiliary task for the Visual Hand Reach environment\nbut does not help much in the Visual Finger Turn environment. Our method is able to exploit the\nbest combination of the auxiliary tasks without prior knowledge about the relationships among the\nauxiliary tasks as well the relationship between the auxiliary tasks and the main RL task.\n\nFigure 4: The training curve for our method (which combines multiple auxiliary losses) compared to using each\nindividual auxiliary loss one at a time. The y-axis shows the percentage of the time that the goal is reached.\n\nFigure 5: Change of the weights of the auxiliary tasks during training.\n\nIn Figure 5, we show how the weights of all the auxiliary tasks change during training for OL-AUX.\nLooking at the weight of an auxiliary task alongside the single auxiliary task ablation in Figure 4,\nwe can see that they often align. For example, inverse kinematics is the best single auxiliary task in\nHand Reach and it also retains a large weight when combined with other auxiliary tasks. There are\nalso exceptions: In Finger Turn, optical \ufb02ow does not work well as a single auxiliary task. But when\ncombined with other auxiliary tasks, it still has the largest weight for a small amount of time; In Hand\nReach, optical \ufb02ow performs well as a single task but when combined with others, the weight is kept\nat a relatively low level. This shows that our method is able to take advantage of the auxiliary tasks\n\n8\n\n\fthat best suits the RL task at hand, while taking into consideration the interplay of different auxiliary\ntasks, without any prior knowledge.\n\n5.6 Auxiliary Task Gradients should Provide Long-term Guidance\n\nOur N-step update method incorporates the idea that the auxiliary tasks should be used to decrease\nthe loss of the main task in the long term. To verify that this long-term reasoning is important, we\ncompare OL-AUX-5 with OL-AUX-1 where the weights are updated every time step (as in Sec. 4.1).\nFor OL-AUX-1, we scale the learning rate \u03b2 down by a factor of 5 to make a fair comparison, as it\nupdates w more frequently. The results are shown in Figure 6. As shown, OL-AUX-1 performs much\nworse than OL-AUX-5, providing evidence of the importance of using auxiliary tasks to provide\nlong-term guidance for reinforcement learning.\n\nFigure 6: Learning curves comparing different ablation methods.\n\nHowever, as discussed Sec. 4.2, there are two reasons why our method might outperform the OL-\nAUX-1 baseline: our method takes into account the long-term effect of the auxiliary weight update\non the main objective; also, our method averages the auxiliary weight gradient over more updates on\n\u03b8, which will result in a less noisy gradient update. To isolate the in\ufb02uence of each of these effects,\nwe perform another ablation experiment. In this ablation, we perform a one-step gradient update\nusing Eqn. 4, but only update w every N (N = 5) steps. When updating w, we use N times as much\ndata to compute the gradient, and we use the same learning rate \u03b2 as OL-AUX-5. This makes the\nalgorithm the same as OL-AUX-5 in terms of update frequency, learning rate, and the amount of data\nused for the update. The plots from this method are labeled OL-AUX-5 (Ablation) in Figure 6. The\ngap between OL-AUX-5 and OL-AUX-5 (Ablation) shows the effect of \"long-term reasoning\" while\nthe gap between OL-AUX-5 (Ablation) and OL-AUX-1 shows the effect of \"gradient smoothing\".\nWe can see that both factors contribute to the gained improvement in training time.\n\n6 Conclusions\n\nIn this work, we have shown that dynamically combining a set of auxiliary tasks can give a signi\ufb01cant\nperformance improvement for reinforcement learning from high dimensional input. Our method\nadaptively adjusts the weights for the auxiliary tasks in an online manner, showing large improvement\nover a baseline method that treats each auxiliary task as equally important. Our method uses the\nidea that auxiliary tasks should provide a gradient update direction that, in the long term, helps to\ndecrease the loss of the main task, showing large improvement over one-step reasoning. The task\nweights we learn with OL-AUX indicates the optimal amount of knowledge to transfer between\nthe auxiliary tasks and the main task. For future works, OL-AUX can also extend to the multi-task\nlearning. Additionally, it would be interesting to explore if the task weights we learn with OL-AUX\ncan transfer across different domains.\n\nAcknowledgments\n\nWe would like to thank Ben Eysenbach for helpful feedback on the workshop version of the paper\nand Wen Sun, Lerrel Pinto, Siddharth Ancha, Brian Okorn for useful discussions.\nThis material is based upon work supported by the United States United States Air Force and DARPA\nunder Contract No. FA8750-18-C-0092, National Science Foundation under Grant No. IIS-1849154\nand USDA Specialty Crop Research Initiative Ef\ufb01cient Vineyards Project 2015-51181-24393.\n\n9\n\n\fReferences\n[1] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of\ngo without human knowledge. Nature, 550(7676):354, 2017.\n\n[2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[3] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,\nDavid Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv\npreprint arXiv:1509.02971, 2015.\n\n[4] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, and Deirdre Quillen. Learning\nhand-eye coordination for robotic grasping with deep learning and large-scale data collection.\nThe International Journal of Robotics Research, 37(4-5):421\u2013436, 2018.\n\n[5] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David\nBudden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite.\narXiv preprint arXiv:1801.00690, 2018.\n\n[6] Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub\nPachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous\nin-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.\n\n[7] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media,\n\n2012.\n\n[8] Alexander Sax, Bradley Emi, Amir R Zamir, Leonidas Guibas, Silvio Savarese, and Jitendra\nMalik. Mid-level visual representations improve generalization and sample ef\ufb01ciency for\nlearning active tasks. arXiv preprint arXiv:1812.11971, 2018.\n\n[9] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo,\nDavid Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary\ntasks. arXiv preprint arXiv:1611.05397, 2016.\n\n[10] Jan Matas, Stephen James, and Andrew J Davison. Sim-to-real reinforcement learning for\n\ndeformable object manipulation. arXiv preprint arXiv:1806.07851, 2018.\n\n[11] Michelle A Lee, Yuke Zhu, Krishnan Srinivasan, Parth Shah, Silvio Savarese, Li Fei-Fei,\nAnimesh Garg, and Jeannette Bohg. Making sense of vision and touch: Self-supervised learning\nof multimodal representations for contact-rich tasks. arXiv preprint arXiv:1810.10191, 2018.\n\n[12] Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to\npoke by poking: Experiential learning of intuitive physics. In Advances in Neural Information\nProcessing Systems, pages 5074\u20135082, 2016.\n\n[13] Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proceedings\n\nof the IEEE International Conference on Computer Vision, pages 37\u201345, 2015.\n\n[14] Vikash Goel, Jameson Weng, and Pascal Poupart. Unsupervised video object segmentation for\ndeep reinforcement learning. In Advances in Neural Information Processing Systems, pages\n5683\u20135694, 2018.\n\n[15] Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint\n\narXiv:1706.05098, 2017.\n\n[16] Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio\nIn Proceedings of the IEEE\n\nSavarese. Taskonomy: Disentangling task transfer learning.\nConference on Computer Vision and Pattern Recognition, pages 3712\u20133722, 2018.\n\n[17] Long-Ji Lin and Tom M Mitchell. Memory approaches to reinforcement learning in non-\n\nMarkovian domains. Citeseer, 1992.\n\n10\n\n\f[18] Xiujun Li, Lihong Li, Jianfeng Gao, Xiaodong He, Jianshu Chen, Li Deng, and Ji He. Recurrent\n\nreinforcement learning: a hybrid approach. arXiv preprint arXiv:1509.03044, 2015.\n\n[19] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino,\nMisha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in\ncomplex environments. arXiv preprint arXiv:1611.03673, 2016.\n\n[20] Evan Shelhamer, Parsa Mahmoudieh, Max Argus, and Trevor Darrell. Loss is its own reward:\n\nSelf-supervision for reinforcement learning. arXiv preprint arXiv:1612.07307, 2016.\n\n[21] Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom\nVan de Wiele, Volodymyr Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by\nplaying-solving sparse reward tasks from scratch. arXiv preprint arXiv:1802.10567, 2018.\n\n[22] Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam\nWhite, and Doina Precup. Horde: A scalable real-time architecture for learning knowledge\nfrom unsupervised sensorimotor interaction. In The 10th International Conference on Au-\ntonomous Agents and Multiagent Systems-Volume 2, pages 761\u2013768. International Foundation\nfor Autonomous Agents and Multiagent Systems, 2011.\n\n[23] Rich Caruana. Multitask learning. Machine learning, 28(1):41\u201375, 1997.\n\n[24] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer\n\nvision, pages 1440\u20131448, 2015.\n\n[25] Theodoros Evgeniou, Charles A Micchelli, and Massimiliano Pontil. Learning multiple tasks\n\nwith kernel methods. Journal of Machine Learning Research, 6(Apr):615\u2013637, 2005.\n\n[26] Yu Zhang and Dit-Yan Yeung. A regularization approach to learning task relationships in\nmultitask learning. ACM Transactions on Knowledge Discovery from Data (TKDD), 8(3):12,\n2014.\n\n[27] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and S Yu Philip. Learning multiple tasks\nwith multilinear relationship networks. In Advances in Neural Information Processing Systems,\npages 1594\u20131603, 2017.\n\n[28] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gra-\ndient normalization for adaptive loss balancing in deep multitask networks. arXiv preprint\narXiv:1711.02257, 2017.\n\n[29] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh\nlosses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 7482\u20137491, 2018.\n\n[30] Hanzhang Hu, Debadeepta Dey, Martial Hebert, and J Andrew Bagnell. Learning anytime\npredictions in neural networks via adaptive loss balancing. arXiv preprint arXiv:1708.06832,\n2017.\n\n[31] Richard S Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta.\n\nIn AAAI, pages 171\u2013176, 1992.\n\n[32] Zhongwen Xu, Hado P van Hasselt, and David Silver. Meta-gradient reinforcement learning. In\n\nAdvances in neural information processing systems, pages 2396\u20132407, 2018.\n\n[33] Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient\n\nmethods. In Advances in Neural Information Processing Systems, pages 4644\u20134654, 2018.\n\n[34] Yunshu Du, Wojciech M Czarnecki, Siddhant M Jayakumar, Razvan Pascanu, and Balaji\nLakshminarayanan. Adapting auxiliary losses using gradient similarity. arXiv preprint\narXiv:1812.02224, 2018.\n\n[35] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient\nIn Advances in neural\n\nmethods for reinforcement learning with function approximation.\ninformation processing systems, pages 1057\u20131063, 2000.\n\n11\n\n\f[36] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning\nenvironment: An evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence\nResearch, 47:253\u2013279, 2013.\n\n[37] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder,\nBob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience\nreplay. In Advances in Neural Information Processing Systems, pages 5048\u20135058, 2017.\n\n[38] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based\ncontrol. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages\n5026\u20135033. IEEE, 2012.\n\n[39] Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn\nPowell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal\nreinforcement learning: Challenging robotics environments and request for research. arXiv\npreprint arXiv:1802.09464, 2018.\n\n12\n\n\f", "award": [], "sourceid": 2672, "authors": [{"given_name": "Xingyu", "family_name": "Lin", "institution": "Carnegie Mellon University"}, {"given_name": "Harjatin", "family_name": "Baweja", "institution": "CMU"}, {"given_name": "George", "family_name": "Kantor", "institution": "CMU"}, {"given_name": "David", "family_name": "Held", "institution": "CMU"}]}