{"title": "A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 14636, "page_last": 14647, "abstract": "We introduce a new algorithm for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks. In MORL, the aim is to learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent. While this alleviates dependence on scalar reward design, the expected return of a policy can change significantly with varying preferences, making it challenging to learn a single model to produce optimal policies under different preference conditions. We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences. After an initial learning phase, our agent can execute the optimal policy under any given preference, or automatically infer an underlying preference with very few samples. Experiments across four different domains demonstrate the effectiveness of our approach.", "full_text": "A Generalized Algorithm for Multi-Objective\nReinforcement Learning and Policy Adaptation\n\nRunzhe Yang\n\nDepartment of Computer Science\n\nPrinceton University\n\nrunzhey@cs.princeton.edu\n\nXingyuan Sun\n\nDepartment of Computer Science\n\nPrinceton University\n\nxs5@cs.princeton.edu\n\nKarthik Narasimhan\n\nDepartment of Computer Science\n\nPrinceton University\n\nkarthikn@cs.princeton.edu\n\nAbstract\n\nWe introduce a new algorithm for multi-objective reinforcement learning (MORL)\nwith linear preferences, with the goal of enabling few-shot adaptation to new tasks.\nIn MORL, the aim is to learn policies over multiple competing objectives whose\nrelative importance (preferences) is unknown to the agent. While this alleviates\ndependence on scalar reward design, the expected return of a policy can change\nsigni\ufb01cantly with varying preferences, making it challenging to learn a single\nmodel to produce optimal policies under different preference conditions. We\npropose a generalized version of the Bellman equation to learn a single parametric\nrepresentation for optimal policies over the space of all possible preferences. After\nan initial learning phase, our agent can execute the optimal policy under any given\npreference, or automatically infer an underlying preference with very few samples.\nExperiments across four different domains demonstrate the effectiveness of our\napproach.1\n\n1\n\nIntroduction\n\nIn recent years, there has been increased interest\nin the paradigm of multi-objective reinforcement\nlearning (MORL), which deals with learning\ncontrol policies to simultaneously optimize over\nseveral criteria. Compared to traditional RL,\nwhere the aim is to optimize for a scalar reward,\nthe optimal policy in a multi-objective setting\ndepends on the relative preferences among com-\npeting criteria. For example, consider a virtual\nassistant (Figure 1) that can communicate with\na human to perform a speci\ufb01c task (e.g., provide\nweather or navigation information). Depending on the user\u2019s relative preferences between aspects like\nsuccess rate or brevity, the agent might need to follow completely different strategies. If success is all\nthat matters (e.g., providing an accurate weather report), the agent might provide detailed responses\nor ask several follow-up questions. On the other hand, if brevity is crucial (e.g., while providing\n\nFigure 1: Task-oriented dialogue policy learning is a\nreal-life example of unknown linear preference scenario.\nUsers may expect either briefer dialogue or more infor-\nmative dialogue depending on the task.\n\n1Code is available at https://github.com/RunzheYang/MORL\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n(weather)0.70.3(driving)0.50.5successbrevitysuccessbrevityTurn left at thenext intersection.Showers this evening, becoming a steady rain overnight. Low 6C. Winds S at 15 to 25 km/h. Chance of rain 100%\u2026\fturn-by-turn guidance), the agent needs to \ufb01nd the shortest way to complete the task. In traditional\nRL, this is often a \ufb01xed choice made by the designer and incorporated into the scalar reward. While\nthis suf\ufb01ces in cases where we know the preferences of a task beforehand, the learned policy is limited\nin its applicability to scenarios with different preferences. The MORL framework provides two\ndistinct advantages \u2013 (1) reduced dependence on scalar reward design to combine different objectives,\nwhich is both a tedious manual task and can lead to unintended consequences [1], and (2) dynamic\nadaptation or transfer to related tasks with different preferences.\nHowever, learning policies over multiple preferences under the MORL setting has proven to be\nquite challenging, with most prior work using one of two strategies [2]. The \ufb01rst is to convert the\nmulti-objective problem into a single-objective one through various techniques [3, 4, 5, 6] and use\ntraditional RL algorithms. These methods only learn an \u2018average\u2019 policy over the space of preferences\nand cannot be tailored to be optimal for speci\ufb01c preferences. The second strategy is to compute a set\nof optimal policies that encompass the entire space of possible preferences in the domain [7, 8, 9].\nThe main drawback of these approaches is their lack of scalability \u2013 the challenge of representing a\nPareto front (or its convex approximation) of optimal policies is handled by learning several individual\npolicies, which can grow signi\ufb01cantly with the size of the domain.\nIn this paper, we propose a novel algorithm for learning a single policy network that is optimized over\nthe entire space of preferences in a domain. This allows our trained model to produce the optimal\npolicy for any user-speci\ufb01ed preference. We tackle two concrete challenges in MORL: (1) provide\ntheoretical convergence results of a multi-objective version of Q-Learning for MORL with linear\npreferences, and (2) demonstrate effective use of deep neural networks to scale MORL to larger\ndomains. Our algorithm is based on two key insights \u2013 (1) the optimality operator for a generalized\nversion of Bellman equation [10] with preferences is a valid contraction, and (2) optimizing for the\nconvex envelope of multi-objective Q-values ensures an ef\ufb01cient alignment between preferences\nand corresponding optimal policies. We use hindsight experience replay [11] to re-use transitions\nfor learning with different sampled preferences and homotopy optimization [12] to ensure tractable\nlearning. In addition, we also demonstrate how to use our trained model to automatically infer hidden\npreferences on a new task, when provided with just scalar rewards, through a combination of policy\ngradient and stochastic search over the preference parameters.\nWe perform empirical evaluation on four different domains \u2013 deep sea treasure (a popular MORL\nbenchmark), a fruit tree navigation task, task-oriented dialog, and the video game Super Mario Bros.\nOur experiments demonstrate that our methods signi\ufb01cantly outperform competitive baselines on all\ndomains. For instance, our envelope MORL algorithm achieves an % improvement on average user\nutility compared to the scalarized MORL in the dialog task and a factor 2x average improvement on\nSuperMario game with random preferences. We also demonstrate that our agent can reasonably infer\nhidden preferences at test time using very few sampled trajectories.\n\n2 Background\n\nA multi-objective Markov decision process (MOMDP) can be represented by the tuple\n(cid:104)S,A,P, r, \u2126, f\u2126(cid:105) with state space S, action space A, transition distribution P(s(cid:48)|s, a), vector\nreward function r(s, a), the space of preferences \u2126, and preference functions, e.g., f\u03c9(r) which\nproduces a scalar utility using preference \u03c9 \u2208 \u2126. In this work, we consider the class of MOMDPs\nwith linear preference functions, i.e., f\u03c9(r(s, a)) = \u03c9(cid:124)r(s, a). We observe that if \u03c9 is \ufb01xed to a\nsingle value, this MOMDP collapses into a standard MDP. On the other hand, if we consider all\npossible returns from an MOMDP, we have a Pareto frontier F\u2217 := {\u02c6r |(cid:54) \u2203\u02c6r(cid:48) \u2265 \u02c6r}, where the return\n\u02c6r :=(cid:80)t \u03b3tr(st, at). And for all possible preference in \u2126, we de\ufb01ne a convex coverage set (CCS) of\nthe Pareto frontier as:\n\nCCS := {\u02c6r \u2208 F\u2217 | \u2203\u03c9 \u2208 \u2126 s.t. \u03c9(cid:124) \u02c6r \u2265 \u03c9(cid:124) \u02c6r(cid:48),\u2200\u02c6r(cid:48) \u2208 F\u2217},\n\nwhich contains all returns that provide the maximum cumulative utility. Figure 2 (a) shows an\nexample of CCS and the Pareto frontier. The CCS is a subset of the Pareto frontier (points A to H, and\nK), containing all the solutions on its outer convex boundary (excluding point K). When a speci\ufb01c\nlinear preference \u03c9 is given, the point within the CCS with the largest projection along the direction\nof the relative importance weights will be the optimal solution (Figure 2(b)).\nOur goal is to train an agent to recover policies for the entire CCS of MOMDP and then adapt to the\noptimal policy for any given \u03c9 \u2208 \u2126 at test time. We emphasize that we are not solving for a single,\n\n2\n\n\fFigure 2: (a) The Pareto frontier may encapsulate local concave parts (points A-H, plus point K), whereas\nCCS is a convex subset of Pareto frontier (points A-H). Point L indicates a non-optimal solution. (b) Linear\npreferences select the optimal solution from CCS with the highest utility, represented by the projection length\nalong preference vector. Arrows are different linear preferences, and points indicate possible returns. Return D\nhas better cumulative utility than return F under the preference in solid line. (c) The scalarized MORL algorithms\n(e.g., [13]) \ufb01nd the optimal solutions at a stage while they are not aligned with preference, e.g., two optimal\nsolutions D and F in the CCS, misaligned with preferences \u03c92 and \u03c91. The scalarized update cannot use the\ninformation of maxa Q(s, a, \u03c91) (corresponding to F) to update the optimal solution aligned with \u03c92 or vice\nversa. It only searches along \u03c91 direction leading to non-optimal L, even if solution D has been seen under \u03c92.\nIt still requires many iterations for the value-preference alignment.\n\nunknown \u03c9, but instead aim for generalization across the entire space of preferences. Accordingly,\nour MORL setup has two phases:\n\nLearning phase.\nIn this phase, the agent learns a set of optimal policies \u03a0L corresponding to the\nentire CCS of the MOMDP, using interactions with the environment and historical trajectories. For\neach \u03c0 \u2208 \u03a0L, there exists at least one linear preference \u03c9 such that no other policy \u03c0(cid:48) generates\nhigher utility under that \u03c9:\n\n\u03c0 \u2208 \u03a0L \u21d2 \u2203 \u03c9 \u2208 \u2126, s.t. \u2200\u03c0(cid:48) \u2208 \u03a0, \u03c9(cid:124)v\u03c0(s0) \u2265 \u03c9(cid:124)v\u03c0(cid:48)(s0),\n\nwhere s0 is a \ufb01xed initial state, and v\u03c0 is the value function, i.e., v\u03c0(s) = E\u03c0[\u02c6r|s0 = s]. Given any\npreference \u03c9, \u03a0L(\u03c9) determines the optimal policy.\nAdaptation phase. After learning, the agent is provided a new task, with either a) a preference \u03c9\nspeci\ufb01ed by a human, or b) an unknown preference, where the agent has to automatically infer \u03c9.\nEf\ufb01ciently aligning \u03a0L(\u03c9) with the preferred optimal policy is non-trivial since the CCS can be very\nlarge. In both cases, the agent is evaluated on how well it can adapt to tasks with unseen preferences.\n\n2.1 Related Work\n\nMulti-Objective RL Existing MORL algorithms can be roughly divided into two main cate-\ngories [14, 15, 2]: single-policy methods and multiple-policy methods. Single-policy methods aim\nto \ufb01nd the optimal policy for a given preference among the objectives [16, 17]. These methods\nexplore different forms of preference functions, including non-linear ones such as the minimum over\nall objectives or the number of objectives that exceed a certain threshold. However, single-policy\nmethods do not work when preferences are unknown.\nMulti-policy approaches learn a set of policies to obtain the approximate Pareto frontier of optimal\nsolutions. The most common strategy is to perform multiple runs of a single-policy method over\ndifferent preferences [7, 18]. Policy-based RL algorithms [19, 20] simultaneously learn the optimal\nmanifold over a set of preferences. Several value-based reinforcement learning algorithms employ\nan extended version of the Bellman equation and maintain the convex hull of the discrete Pareto\nfrontier [8, 21, 22]. Multi-objective \ufb01tted Q-iteration (MOFQI) [23, 24] encapsulates preferences\nas input to a Q-function approximator and uses expanded historical trajectories to learn multiple\npolicies. This allows the agent to construct the optimal policy for any given preference during testing.\nHowever, these methods explicitly maintain sets of policies, and hence are dif\ufb01cult to scale up to\nhigh-dimensional preference spaces. Furthermore, these methods are designed to work during the\nlearning phase but cannot be easily adapted to new preferences at test time.\nScalarized Q-Learning. Recent work has proposed the scalarized Q-learning algorithm [9] which\nuses a vector value function but performs updates after computing the inner product of the value\nfunction with a preference vector. This method uses an outer loop to perform a search over preferences,\n\n3\n\nABCDEFGHKLQUANTITY OF OBJECTIVE 1QUANTITY OF OBJECTIVE 2CCSNon-optimalPareto FrontierFQUANTITY OF OBJECTIVE 1QUANTITY OF OBJECTIVE 2FNon-preferredPreferenceDPreferred Solution!2\u2326AAACWnicfZFdSxwxFIaz40d1tLpqvfImdBFK2S4zRdBL0V54U7TQVcEsy5ns2TGYjyHJiMuwP8Zb/UVCf4yZcaFaoQcCT07ec07yJiukcD5JnlrR3PzC4oel5Xhl9ePaentj89yZ0nLscyONvczAoRQa+154iZeFRVCZxIvs5rg+v7hF64TRv/2kwIGCXIux4OBDatjeZpmizCjMgTKhKTutcdjuJL2kCfoe0hl0yCzOhhutb2xkeKlQey7Buas0KfygAusFlziNWemwAH4DOV4F1KDQDarm/lO6GzIjOjY2LO1pk31dUYFyCvx1lwaoJa4hN1FZl2aq2ZhCh0a16u0sPz4YVEIXpUfNX0aNS0m9obUddCQsci8nNGY/MNzc4s/Q4rRAC97YrxUDmyu4m4aX5JR1ac3/kwr9Vxo43g0TgFsRbKD8GixwH34jDgan/9r5Hs6/99Kkl/7a6xwezaxeIjvkM/lCUrJPDskJOSN9wklF7skDeWz9iaJoOVp5kUatWc0WeRPRp2fCfrJdAAACWnicfZFdSxwxFIaz40d1tLpqvfImdBFK2S4zRdBL0V54U7TQVcEsy5ns2TGYjyHJiMuwP8Zb/UVCf4yZcaFaoQcCT07ec07yJiukcD5JnlrR3PzC4oel5Xhl9ePaentj89yZ0nLscyONvczAoRQa+154iZeFRVCZxIvs5rg+v7hF64TRv/2kwIGCXIux4OBDatjeZpmizCjMgTKhKTutcdjuJL2kCfoe0hl0yCzOhhutb2xkeKlQey7Buas0KfygAusFlziNWemwAH4DOV4F1KDQDarm/lO6GzIjOjY2LO1pk31dUYFyCvx1lwaoJa4hN1FZl2aq2ZhCh0a16u0sPz4YVEIXpUfNX0aNS0m9obUddCQsci8nNGY/MNzc4s/Q4rRAC97YrxUDmyu4m4aX5JR1ac3/kwr9Vxo43g0TgFsRbKD8GixwH34jDgan/9r5Hs6/99Kkl/7a6xwezaxeIjvkM/lCUrJPDskJOSN9wklF7skDeWz9iaJoOVp5kUatWc0WeRPRp2fCfrJdAAACWnicfZFdSxwxFIaz40d1tLpqvfImdBFK2S4zRdBL0V54U7TQVcEsy5ns2TGYjyHJiMuwP8Zb/UVCf4yZcaFaoQcCT07ec07yJiukcD5JnlrR3PzC4oel5Xhl9ePaentj89yZ0nLscyONvczAoRQa+154iZeFRVCZxIvs5rg+v7hF64TRv/2kwIGCXIux4OBDatjeZpmizCjMgTKhKTutcdjuJL2kCfoe0hl0yCzOhhutb2xkeKlQey7Buas0KfygAusFlziNWemwAH4DOV4F1KDQDarm/lO6GzIjOjY2LO1pk31dUYFyCvx1lwaoJa4hN1FZl2aq2ZhCh0a16u0sPz4YVEIXpUfNX0aNS0m9obUddCQsci8nNGY/MNzc4s/Q4rRAC97YrxUDmyu4m4aX5JR1ac3/kwr9Vxo43g0TgFsRbKD8GixwH34jDgan/9r5Hs6/99Kkl/7a6xwezaxeIjvkM/lCUrJPDskJOSN9wklF7skDeWz9iaJoOVp5kUatWc0WeRPRp2fCfrJdAAACWnicfZFdSxwxFIaz40d1tLpqvfImdBFK2S4zRdBL0V54U7TQVcEsy5ns2TGYjyHJiMuwP8Zb/UVCf4yZcaFaoQcCT07ec07yJiukcD5JnlrR3PzC4oel5Xhl9ePaentj89yZ0nLscyONvczAoRQa+154iZeFRVCZxIvs5rg+v7hF64TRv/2kwIGCXIux4OBDatjeZpmizCjMgTKhKTutcdjuJL2kCfoe0hl0yCzOhhutb2xkeKlQey7Buas0KfygAusFlziNWemwAH4DOV4F1KDQDarm/lO6GzIjOjY2LO1pk31dUYFyCvx1lwaoJa4hN1FZl2aq2ZhCh0a16u0sPz4YVEIXpUfNX0aNS0m9obUddCQsci8nNGY/MNzc4s/Q4rRAC97YrxUDmyu4m4aX5JR1ac3/kwr9Vxo43g0TgFsRbKD8GixwH34jDgan/9r5Hs6/99Kkl/7a6xwezaxeIjvkM/lCUrJPDskJOSN9wklF7skDeWz9iaJoOVp5kUatWc0WeRPRp2fCfrJd!|\u02c6rD>!|\u02c6rFAAACkHicjZBtSxwxEMdz2yfdPnjqy74JPYRSrseuCPVVa1VELKUWeiqY6zGbm9sL5mFJsqXHsh/KT1N8Z79Js+tBa4XSgcAvk//MZP5ZIYXzSXLVie7df/Dw0dJy/PjJ02cr3dW1E2dKy3HIjTT2LAOHUmgceuElnhUWQWUST7OLveb99BtaJ4z+4ucFjhTkWkwFBx9S4+4HlinKjMIcvlZMaI+Wg6wpm4Gvmjdbj/fpW/ofsoNxt5cMkjboXUgX0COLOB6vdl6zieGlQu25BOfO06TwowqsF1xiHbPSYQH8AnI8D6hBoRtV7dY13QiZCZ0aG472tM3+WVGBcgr8rE8DNBLXkpurrE8z1V5MoUOjRnV7lp9ujyqhi9Kj5jejpqWk3tDGRDoRFrmXcxqzfQw/t/gxtPhUoAVv7KuKgc0VfK/DJjllfdrwv6RC/5YGjjfCBOBWBBson4EFHgx3cTA4/dvOu3CyOUiTQfp5q7ezu7B6iTwnL8hLkpI3ZIcckmMyJJxckh/kmvyM1qLt6F30/kYadRY16+RWREe/AAi9yBw=AAACkHicjZBtSxwxEMdz2yfdPnjqy74JPYRSrseuCPVVa1VELKUWeiqY6zGbm9sL5mFJsqXHsh/KT1N8Z79Js+tBa4XSgcAvk//MZP5ZIYXzSXLVie7df/Dw0dJy/PjJ02cr3dW1E2dKy3HIjTT2LAOHUmgceuElnhUWQWUST7OLveb99BtaJ4z+4ucFjhTkWkwFBx9S4+4HlinKjMIcvlZMaI+Wg6wpm4Gvmjdbj/fpW/ofsoNxt5cMkjboXUgX0COLOB6vdl6zieGlQu25BOfO06TwowqsF1xiHbPSYQH8AnI8D6hBoRtV7dY13QiZCZ0aG472tM3+WVGBcgr8rE8DNBLXkpurrE8z1V5MoUOjRnV7lp9ujyqhi9Kj5jejpqWk3tDGRDoRFrmXcxqzfQw/t/gxtPhUoAVv7KuKgc0VfK/DJjllfdrwv6RC/5YGjjfCBOBWBBson4EFHgx3cTA4/dvOu3CyOUiTQfp5q7ezu7B6iTwnL8hLkpI3ZIcckmMyJJxckh/kmvyM1qLt6F30/kYadRY16+RWREe/AAi9yBw=AAACkHicjZBtSxwxEMdz2yfdPnjqy74JPYRSrseuCPVVa1VELKUWeiqY6zGbm9sL5mFJsqXHsh/KT1N8Z79Js+tBa4XSgcAvk//MZP5ZIYXzSXLVie7df/Dw0dJy/PjJ02cr3dW1E2dKy3HIjTT2LAOHUmgceuElnhUWQWUST7OLveb99BtaJ4z+4ucFjhTkWkwFBx9S4+4HlinKjMIcvlZMaI+Wg6wpm4Gvmjdbj/fpW/ofsoNxt5cMkjboXUgX0COLOB6vdl6zieGlQu25BOfO06TwowqsF1xiHbPSYQH8AnI8D6hBoRtV7dY13QiZCZ0aG472tM3+WVGBcgr8rE8DNBLXkpurrE8z1V5MoUOjRnV7lp9ujyqhi9Kj5jejpqWk3tDGRDoRFrmXcxqzfQw/t/gxtPhUoAVv7KuKgc0VfK/DJjllfdrwv6RC/5YGjjfCBOBWBBson4EFHgx3cTA4/dvOu3CyOUiTQfp5q7ezu7B6iTwnL8hLkpI3ZIcckmMyJJxckh/kmvyM1qLt6F30/kYadRY16+RWREe/AAi9yBw=AAACkHicjZBtSxwxEMdz2yfdPnjqy74JPYRSrseuCPVVa1VELKUWeiqY6zGbm9sL5mFJsqXHsh/KT1N8Z79Js+tBa4XSgcAvk//MZP5ZIYXzSXLVie7df/Dw0dJy/PjJ02cr3dW1E2dKy3HIjTT2LAOHUmgceuElnhUWQWUST7OLveb99BtaJ4z+4ucFjhTkWkwFBx9S4+4HlinKjMIcvlZMaI+Wg6wpm4Gvmjdbj/fpW/ofsoNxt5cMkjboXUgX0COLOB6vdl6zieGlQu25BOfO06TwowqsF1xiHbPSYQH8AnI8D6hBoRtV7dY13QiZCZ0aG472tM3+WVGBcgr8rE8DNBLXkpurrE8z1V5MoUOjRnV7lp9ujyqhi9Kj5jejpqWk3tDGRDoRFrmXcxqzfQw/t/gxtPhUoAVv7KuKgc0VfK/DJjllfdrwv6RC/5YGjjfCBOBWBBson4EFHgx3cTA4/dvOu3CyOUiTQfp5q7ezu7B6iTwnL8hLkpI3ZIcckmMyJJxckh/kmvyM1qLt6F30/kYadRY16+RWREe/AAi9yBw=a.b.DUtility ProjectionLQUANTITY OF OBJECTIVE 1QUANTITY OF OBJECTIVE 2DF!2AAACT3icfZFNSxxBEIZ7NsaPMfHzmEvjIoSwWWYkEI+iHryEGHBVcJalprd2bLa/6O4Rl2H/htf4lzz6S3IL6ZkMJCpY0PB09dtV1W/nRnDnk+Qx6rxZeLu4tLwSr757v7a+sbl17nRpGQ6YFtpe5uBQcIUDz73AS2MRZC7wIp8e1ecXN2gd1+rMzwwOJRSKTzgDH1JZlkuaaYkFjPZGG92knzRBX0LaQpe0cTrajD5nY81KicozAc5dpYnxwwqs50zgPM5KhwbYFAq8CqhAohtWzdBzuhsyYzrRNizlaZP9/0YF0knw1z0aoJa4htxM5j2ay2ajjQqFatXTXn6yP6y4MqVHxf62mpSCek1rD+iYW2RezGicHWOY3OK3UOK7QQte209VBraQcDsPLylo1qM1vybl6p80cLwbOgCzPNhA2TVYYD58QRwMTp/b+RLO9/pp0k9/fOkeHLZWL5MPZId8JCn5Sg7ICTklA8KIIXfkJ7mPHqJf0e9OK+1ELWyTJ9FZ+QMa87CpAAACT3icfZFNSxxBEIZ7NsaPMfHzmEvjIoSwWWYkEI+iHryEGHBVcJalprd2bLa/6O4Rl2H/htf4lzz6S3IL6ZkMJCpY0PB09dtV1W/nRnDnk+Qx6rxZeLu4tLwSr757v7a+sbl17nRpGQ6YFtpe5uBQcIUDz73AS2MRZC7wIp8e1ecXN2gd1+rMzwwOJRSKTzgDH1JZlkuaaYkFjPZGG92knzRBX0LaQpe0cTrajD5nY81KicozAc5dpYnxwwqs50zgPM5KhwbYFAq8CqhAohtWzdBzuhsyYzrRNizlaZP9/0YF0knw1z0aoJa4htxM5j2ay2ajjQqFatXTXn6yP6y4MqVHxf62mpSCek1rD+iYW2RezGicHWOY3OK3UOK7QQte209VBraQcDsPLylo1qM1vybl6p80cLwbOgCzPNhA2TVYYD58QRwMTp/b+RLO9/pp0k9/fOkeHLZWL5MPZId8JCn5Sg7ICTklA8KIIXfkJ7mPHqJf0e9OK+1ELWyTJ9FZ+QMa87CpAAACT3icfZFNSxxBEIZ7NsaPMfHzmEvjIoSwWWYkEI+iHryEGHBVcJalprd2bLa/6O4Rl2H/htf4lzz6S3IL6ZkMJCpY0PB09dtV1W/nRnDnk+Qx6rxZeLu4tLwSr757v7a+sbl17nRpGQ6YFtpe5uBQcIUDz73AS2MRZC7wIp8e1ecXN2gd1+rMzwwOJRSKTzgDH1JZlkuaaYkFjPZGG92knzRBX0LaQpe0cTrajD5nY81KicozAc5dpYnxwwqs50zgPM5KhwbYFAq8CqhAohtWzdBzuhsyYzrRNizlaZP9/0YF0knw1z0aoJa4htxM5j2ay2ajjQqFatXTXn6yP6y4MqVHxf62mpSCek1rD+iYW2RezGicHWOY3OK3UOK7QQte209VBraQcDsPLylo1qM1vybl6p80cLwbOgCzPNhA2TVYYD58QRwMTp/b+RLO9/pp0k9/fOkeHLZWL5MPZId8JCn5Sg7ICTklA8KIIXfkJ7mPHqJf0e9OK+1ELWyTJ9FZ+QMa87CpAAACT3icfZFNSxxBEIZ7NsaPMfHzmEvjIoSwWWYkEI+iHryEGHBVcJalprd2bLa/6O4Rl2H/htf4lzz6S3IL6ZkMJCpY0PB09dtV1W/nRnDnk+Qx6rxZeLu4tLwSr757v7a+sbl17nRpGQ6YFtpe5uBQcIUDz73AS2MRZC7wIp8e1ecXN2gd1+rMzwwOJRSKTzgDH1JZlkuaaYkFjPZGG92knzRBX0LaQpe0cTrajD5nY81KicozAc5dpYnxwwqs50zgPM5KhwbYFAq8CqhAohtWzdBzuhsyYzrRNizlaZP9/0YF0knw1z0aoJa4htxM5j2ay2ajjQqFatXTXn6yP6y4MqVHxf62mpSCek1rD+iYW2RezGicHWOY3OK3UOK7QQte209VBraQcDsPLylo1qM1vybl6p80cLwbOgCzPNhA2TVYYD58QRwMTp/b+RLO9/pp0k9/fOkeHLZWL5MPZId8JCn5Sg7ICTklA8KIIXfkJ7mPHqJf0e9OK+1ELWyTJ9FZ+QMa87Cp!1AAACT3icfZFNSxxBEIZ71sSPMRo1Ry9NFkFkXWaCoEfRHHIJGnBVcJalprd2bLa/6O4JWYb9G17NX/LoL8ktpGcykKhgQcPT1W9XVb+dG8GdT5LHqLPw5u3i0vJKvPpubf39xubWpdOlZThgWmh7nYNDwRUOPPcCr41FkLnAq3x6Wp9ffUfruFYXfmZwKKFQfMIZ+JDKslzSTEssYJSONrpJP2mCvoS0hS5p43y0Ge1nY81KicozAc7dpInxwwqs50zgPM5KhwbYFAq8CahAohtWzdBzuhMyYzrRNizlaZP9/0YF0knwtz0aoJa4htxM5j2ay2ajjQqFatXTXn5yNKy4MqVHxf62mpSCek1rD+iYW2RezGicfcYwucWvocSZQQte270qA1tI+DEPLylo1qM1vybl6p80cLwTOgCzPNhA2S1YYD58QRwMTp/b+RIuP/XTpJ9+O+gen7RWL5Nt8pHskpQckmPyhZyTAWHEkDtyT35GD9Gv6HenlXaiFj6QJ9FZ+QMZE7CoAAACT3icfZFNSxxBEIZ71sSPMRo1Ry9NFkFkXWaCoEfRHHIJGnBVcJalprd2bLa/6O4JWYb9G17NX/LoL8ktpGcykKhgQcPT1W9XVb+dG8GdT5LHqLPw5u3i0vJKvPpubf39xubWpdOlZThgWmh7nYNDwRUOPPcCr41FkLnAq3x6Wp9ffUfruFYXfmZwKKFQfMIZ+JDKslzSTEssYJSONrpJP2mCvoS0hS5p43y0Ge1nY81KicozAc7dpInxwwqs50zgPM5KhwbYFAq8CahAohtWzdBzuhMyYzrRNizlaZP9/0YF0knwtz0aoJa4htxM5j2ay2ajjQqFatXTXn5yNKy4MqVHxf62mpSCek1rD+iYW2RezGicfcYwucWvocSZQQte270qA1tI+DEPLylo1qM1vybl6p80cLwTOgCzPNhA2S1YYD58QRwMTp/b+RIuP/XTpJ9+O+gen7RWL5Nt8pHskpQckmPyhZyTAWHEkDtyT35GD9Gv6HenlXaiFj6QJ9FZ+QMZE7CoAAACT3icfZFNSxxBEIZ71sSPMRo1Ry9NFkFkXWaCoEfRHHIJGnBVcJalprd2bLa/6O4JWYb9G17NX/LoL8ktpGcykKhgQcPT1W9XVb+dG8GdT5LHqLPw5u3i0vJKvPpubf39xubWpdOlZThgWmh7nYNDwRUOPPcCr41FkLnAq3x6Wp9ffUfruFYXfmZwKKFQfMIZ+JDKslzSTEssYJSONrpJP2mCvoS0hS5p43y0Ge1nY81KicozAc7dpInxwwqs50zgPM5KhwbYFAq8CahAohtWzdBzuhMyYzrRNizlaZP9/0YF0knwtz0aoJa4htxM5j2ay2ajjQqFatXTXn5yNKy4MqVHxf62mpSCek1rD+iYW2RezGicfcYwucWvocSZQQte270qA1tI+DEPLylo1qM1vybl6p80cLwTOgCzPNhA2S1YYD58QRwMTp/b+RIuP/XTpJ9+O+gen7RWL5Nt8pHskpQckmPyhZyTAWHEkDtyT35GD9Gv6HenlXaiFj6QJ9FZ+QMZE7CoAAACT3icfZFNSxxBEIZ71sSPMRo1Ry9NFkFkXWaCoEfRHHIJGnBVcJalprd2bLa/6O4JWYb9G17NX/LoL8ktpGcykKhgQcPT1W9XVb+dG8GdT5LHqLPw5u3i0vJKvPpubf39xubWpdOlZThgWmh7nYNDwRUOPPcCr41FkLnAq3x6Wp9ffUfruFYXfmZwKKFQfMIZ+JDKslzSTEssYJSONrpJP2mCvoS0hS5p43y0Ge1nY81KicozAc7dpInxwwqs50zgPM5KhwbYFAq8CahAohtWzdBzuhMyYzrRNizlaZP9/0YF0knwtz0aoJa4htxM5j2ay2ajjQqFatXTXn5yNKy4MqVHxf62mpSCek1rD+iYW2RezGicfcYwucWvocSZQQte270qA1tI+DEPLylo1qM1vybl6p80cLwTOgCzPNhA2S1YYD58QRwMTp/b+RIuP/XTpJ9+O+gen7RWL5Nt8pHskpQckmPyhZyTAWHEkDtyT35GD9Gv6HenlXaiFj6QJ9FZ+QMZE7CoA Snapshot of Deep MORL AlgorithmDFOptimal SolutionsSampled Preferencesc.\fwhile the inner loop performs the scalarized updates. Recently, Abels et al. [13] extended this to use\na single neural network to represent value functions over the entire space of preferences. However,\nscalarized updates are not sample ef\ufb01cient and lead to sub-optimal MORL policies \u2013 our approach\nuses a global optimality \ufb01lter to perform envelope Q-function updates, leading to faster and better\nlearning (as we demonstrate in Figure 2(c) and Section 4).\nThree key contributions distinguish our work from Abels et al. [13]: (1) At algorithmic level,\nour envelope Q-learning algorithm utilizes the convex envelope of the solution frontier to update\nparameters of the policy network, which allows our method to quickly align one preference with\noptimal rewards and trajectories that may have been explored under other preferences.\n(2) At\ntheoretical level, we introduce a theoretical framework for designing and analyzing value-based\nMORL algorithms, and convergence proofs for our envelope Q-learning algorithm. (3) At empirical\nlevel, we provide new evaluation metrics and benchmark environments for MORL and apply our\nalgorithm to a wider variety of domains including two complex larger scale domains \u2013 task-oriented\ndialog and supermario. Our FTN domain is a scaled up, more complex version of Minecart in [13].\nPolicy Adaptation. Our policy adaptation scheme is related to prior work in preference elicita-\ntion [25, 26, 27] or inverse reinforcement learning [28, 29]. Inverse RL (IRL) aims to learn a scalar\nreward function from expert demonstrations, or directly imitate the expert\u2019s policy without inter-\nmediate steps for solving a scalar reward function [30]. Chajewska et al. [31] proposed a Bayesian\nversion to learn the utility function. IRL is effective when the hidden preference is \ufb01xed and expert\ndemonstrations are available. In contrast, we require policy adaptation across various different\npreferences and do not use any demonstrations.\n\n3 Multi-objective RL with Envelope Value Updates\n\nIn this section, we propose a new algorithm for multi-objective RL called envelope Q-learning. Our\nkey idea is to use vectorized value functions and perform envelope updates, which utilize the convex\nenvelope of the solution frontier to update parameters. This is in contrast to approaches like scalarized\nQ-Learning, which perform value function updates using only a single preference at a time. Since\nwe learn a set of policies simultaneously over multiple preferences, and our concept of optimality is\nde\ufb01ned on vectorized rewards, existing convergence results from single-objective RL no longer hold.\nHence, we \ufb01rst provide a theoretical analysis of our proposed update scheme below followed by a\nsketch of the resulting algorithm.\n\nBellman operators. The standard Q-Learning [32] algorithm for single-objective RL utilizes the\nBellman optimality operator T :\n\n(T Q)(s, a) := r(s, a) + \u03b3E\n\n(1)\nwhere the operator H is de\ufb01ned by (HQ)(s(cid:48)) := supa(cid:48)\u2208A Q(s(cid:48), a(cid:48)) is an optimality \ufb01lter over the\nQ-values for the next state s(cid:48).\nWe extend this to the MORL case by considering a value space Q \u2286 (\u2126 \u2192 Rm)S\u00d7A, containing all\nbounded functions Q(s, a, \u03c9) \u2013 estimates of expected total rewards under m-dimensional preference\n(\u03c9) vectors. We can de\ufb01ne a corresponding value metric d as:\n\ns(cid:48)\u223cP(\u00b7|s,a)(HQ)(s(cid:48)).\n\nd(Q, Q(cid:48)) := sup\n\n|\u03c9(cid:124)(Q(s, a, \u03c9) \u2212 Q(cid:48)(s, a, \u03c9))|.\n\n(2)\n\ns\u2208S,a\u2208A\n\n\u03c9\u2208\u2126\n\nSince the identity of indiscernibles [33] does not hold, we note that d forms a complete pseudo-metric\nspace, and refer to Q as a Multi-Objective Q-value (MOQ) function. Given a policy \u03c0 and sampled\ntrajectories \u03c4, we \ufb01rst de\ufb01ne a multi-objective evaluation operator T\u03c0 as:\n\n(T\u03c0Q)(s, a, \u03c9) := r(s, a) + \u03b3E\n\n\u03c4\u223c(P,\u03c0)Q(s(cid:48), a(cid:48), \u03c9).\n\n(3)\n\nWe then de\ufb01ne an optimality \ufb01lter H for\nargQ supa\u2208A,\u03c9(cid:48)\u2208\u2126 \u03c9(cid:124)Q(s, a, \u03c9(cid:48)), where the argQ takes the multi-objective value corresponding to\nthe supremum (i.e., Q(s, a, \u03c9(cid:48)) such that (a, \u03c9(cid:48)) \u2208 arg supa\u2208A,\u03c9(cid:48)\u2208\u2126 \u03c9(cid:124)Q(s, a, \u03c9(cid:48))). The return of\nargQ depends on which \u03c9 is chosen for scalarization, and we keep argQ for simplicity. This can\nbe thought of as generalized version of the single-objective optimality \ufb01lter in Eq. 1. Intuitively, H\nsolves the convex envelope (hence the name envelope Q-learning) of the current solution frontier to\n\nthe MOQ function as\n\n(HQ)(s, \u03c9)\n\n:=\n\n4\n\n\fproduce the Q that optimizes utility given state s and preference \u03c9. This allows for more optimistic\nQ-updates compared to using just the standard Bellman \ufb01lter (H) that optimizes over actions only \u2013\nthis is the update used by scalarized Q-learning [13]. We can then de\ufb01ne a multi-objective optimality\noperator T as:\n\n(T Q)(s, a, \u03c9) := r(s, a) + \u03b3E\n\ns(cid:48)\u223cP(\u00b7|s,a)(HQ)(s(cid:48), \u03c9).\n\n(4)\n\nThe following theorems demonstrate the feasibility of using our optimality operator for multi-objective\nRL. Proofs for all the theorems are provided in the supplementary material.\nTheorem 1 (Fixed Point of Envelope Optimality Operator). Let Q\u2217 \u2208 Q be the preferred optimal\nvalue function in the value space, such that\n\nQ\u2217(s, a, \u03c9) = argQ sup\n\u03c0\u2208\u03a0\n\n\u03c9(cid:124)E\n\n|s0=s,a0=a(cid:34) \u221e(cid:88)t=0\n\n\u03c4\u223c(P,\u03c0)\n\n\u03b3tr(st, at)(cid:35) ,\n\n(5)\n\nwhere the argQ takes the multi-objective value corresponding to the supremum. Then, Q\u2217 = T Q\u2217.\nTheorem 1 tells us the preferred optimal value function is a \ufb01xed-point of T in the value space.\nTheorem 2 (Envelope Optimality Operator is a Contraction). Let Q, Q(cid:48) be any two multi-objective Q-\nvalue functions in the value space Q as de\ufb01ned above. Then, the Lipschitz condition d(T Q,T Q(cid:48)) \u2264\n\u03b3d(Q, Q(cid:48)) holds, where \u03b3 \u2208 [0, 1) is the discount factor of the underlying MOMDP M.\nFinally, we provide a generalized version of Banach\u2019s Fixed-Point Theorem in the pseudo-metric\nspace.\nTheorem 3 (Multi-Objective Banach Fixed-Point Theorem). If T is a contraction mapping with\nLipschitz coef\ufb01cient \u03b3 on the complete pseudo-metric space (cid:104)Q, d(cid:105), and Q\u2217 is de\ufb01ned as in Theorem\n1, then limn\u2192\u221e d(T nQ, Q\u2217) = 0 for any Q \u2208 Q.\nTheorems 1-3 guarantee that iteratively applying optimality operator T on any MOQ-value function\nwill terminate with a function Q that is equivalent to Q\u2217 under the measurement of pseudo-metric d.\nThese Qs are as good as Q\u2217 since they all have the same utilities for each \u03c9, and will only differ\nwhen the utility corresponds to a recess in the frontier (see Figure 2(c) for an example, at the recess,\neither D or F is optimal).\nMaintaining the envelope sup\u03c9(cid:48) \u03c9(cid:124)Q(\u00b7,\u00b7, \u03c9(cid:48)) allows our method to quickly align one preference\nwith optimal rewards and trajectories that may have been explored under other preferences, while\nscalarized updates that optimizes the scalar utility cannot use the information of maxa Q(s, a, \u03c9(cid:48)) to\nupdate the optimal solution aligned with a different \u03c9. As illustrated in Figure 2 (c), assuming we\nhave found two optimal solutions D and F in the CCS, misaligned with preferences \u03c92 and \u03c91. The\nscalarized update cannot use the information of maxa Q(s, a, \u03c91) (corresponding to F) to update\nthe optimal solution aligned with \u03c92 or vice versa. It only searches along \u03c91 direction leading to\nnon-optimal L, even if solution D has been seen under \u03c92. Hence, the envelope updates can have\nbetter sample ef\ufb01ciency in theory, as is also seen from the empirical results.\n\nLearning Algorithm. Using the above theorems, we provide a sample-ef\ufb01cient learning algorithm\nfor multi-objective RL (Algorithm 1). Since our goal is to induce a single model that can adapt to the\nentire space of \u2126, we use one parameterized function to represent Q \u2286 (\u2126 \u2192 Rm)S\u00d7A. We achieve\nthis by using a deep neural network with s, \u03c9 as input and |A| \u00d7 m Q-values as output. We then\nminimize the following loss function at each step k:2\n\nLA(\u03b8) = Es,a,\u03c9(cid:104)(cid:107)y \u2212 Q(s, a, \u03c9; \u03b8)(cid:107)2\n2(cid:105) ,\n\n(6)\nwhere y = Es(cid:48)[r + \u03b3 argQ maxa,\u03c9(cid:48) \u03c9(cid:124)Q(s(cid:48), a, \u03c9(cid:48); \u03b8k)], which empirically can be estimated by sam-\npling transition (s, a, s(cid:48), r) from a replay buffer.\nOptimizing LA directly is challenging in practice because the optimal frontier contains a large number\nof discrete solutions, which makes the landscape of loss function considerably non-smooth. To\naddress this, we use an auxiliary loss function LB:\n\nLB(\u03b8) = Es,a,\u03c9[|\u03c9(cid:124)y \u2212 \u03c9(cid:124)Q(s, a, \u03c9; \u03b8)|].\n2We use double Q learning with target Q networks following Mnih et al. [34]\n\n(7)\n\n5\n\n\fCombined, our \ufb01nal loss function is L(\u03b8) =\n(1 \u2212 \u03bb) \u00b7 LA(\u03b8) + \u03bb \u00b7 LB(\u03b8), where \u03bb is a\nweight to trade off between losses LA and\nk. We slowly increase the value of \u03bb from\nLB\n0 to 1, to shift our loss function from LA\nto LB. This method, known as homotopy\noptimization [12], is effective since for each\nupdate step, it uses the optimization result\nfrom the previous step as the initial guess. LA\n\ufb01rst ensures the prediction of Q is close to\nany real expected total reward, although it\nmay not be optimal. LB provides an auxiliary\npull along the direction with better utility.\nThe loss function above has an expectation\nover \u03c9 \u2013 this entails sampling random pref-\nerences in the algorithm. However, since the\n\u03c9s are decoupled from the transitions, we can\nincrease sample ef\ufb01ciency by using a scheme\nsimilar to Hindsight Experience Replay [11].\nFurthermore, computing the optimality \ufb01lter\nH over the entire Q is infeasible; instead we\napproximate this by applying H over a mini-\nbatch of transitions before performing param-\neter updates. Further details on our model\narchitectures and implementation details are\navailable in the supplementary material (Sec-\ntion A.2.3).\n\nAlgorithm 1: Envelope MOQ-Learning\nInput: a preference sampling distribution D\u03c9, path p\u03bb for\nthe balance weight \u03bb increasing from 0 to 1.\nInitialize replay buffer D\u03c4 , network Q\u03b8, and \u03bb = 0.\nfor episode = 1, . . . , M do\n\n(cid:40)\n\nSample a linear preference \u03c9 \u223c D\u03c9.\nfor t = 0, . . . , N do\nObserve state st.\nSample an action \u0001-greedily:\nrandom action in A,\nmaxa\u2208A \u03c9(cid:124)Q(st, a, \u03c9; \u03b8), w.p 1 \u2212 \u0001.\nReceive a vectorized reward rt and observe st+1.\nStore transition (st, at, rt, st+1) in D\u03c4 .\nif update then\n\nw.p. \u0001;\n\nat =\n\nSample N\u03c4 transitions\n(sj, aj, rj, sj+1) \u223c D\u03c4 .\nSample N\u03c9 preferences W = {\u03c9i \u223c D\u03c9}.\nCompute yij = (T Q)ij =\n\n\uf8f1\uf8f2\uf8f3rj,\n\nfor terminal sj+1;\n\n\u03c9(cid:124)\n\ni Q(sj+1, a, \u03c9\n\n; \u03b8), o.w.\n\n(cid:48)\n\nrj + \u03b3 argQ max\na\u2208A,\n\u03c9(cid:48)\u2208W\n\nfor all 1 \u2264 i \u2264 N\u03c9 and 1 \u2264 j \u2264 N\u03c4 .\nUpdate Q\u03b8 by descending its stochastic\ngradient according to equations 6 and 7:\n\u2207\u03b8L(\u03b8) = (1\u2212\u03bb)\u00b7\u2207\u03b8LA(\u03b8)+\u03bb\u00b7\u2207\u03b8LB(\u03b8).\nIncrease \u03bb along the path p\u03bb.\n\nPolicy adaptation. Once we obtain a pol-\nicy model \u03a0L(\u03c9) from the learning phase,\nthe agent can adapt to any provided preference by simply feeding the \u03c9 into the network. While this\nis a straightforward scenario, we also consider a more challenging test where only scalar rewards are\navailable and the agent has to uncover a hidden preference \u03c9 while adapting to the new task. For\nthis case, we assume preferences are drawn from a truncated multivariable Gaussian distribution\n\u03c9 (\u00b51, . . . , \u00b5m; \u03c3) on an (m\u22121)-simplex, where nonnegative parameters \u00b51, . . . , \u00b5m are the means\nDm\nwith \u00b51 + \u00b7\u00b7\u00b7 + \u00b5m = 1, and \u03c3 is a \ufb01xed standard deviation for all dimensions. Our goal is then\nto infer the parameters of this Gaussian distribution, for which we perform a combination of policy\ngradient (e.g., REINFORCE [35]) and stochastic search while keeping the policy model \ufb01xed. We\ndetermine the best preference parameters that maximize the expected return in the target task:\n\narg max\n\u00b51,...,\u00b5m\n\n\u03c9 (cid:34)E\n\u03c4\u223c(P,\u03a0L(\u03c9))(cid:34) \u221e(cid:88)t=0\nE\u03c9\u223cDm\n\n\u03b3trt(st, at)(cid:35)(cid:35) .\n\n(8)\n\n4 Experiments\n\nEvaluation Metrics. Three metrics are to evaluate the empirical performance on test tasks:\na) Coverage Ratio (CR). The \ufb01rst metric is coverage ratio (CR), which evaluates the agent\u2019s ability to\nrecover optimal solutions in the convex coverage set (CCS). If F \u2286 Rm is the set of solutions found by\nthe agent (via sampled trajectories), we de\ufb01ne F \u2229\u0001 CCS := {x \u2208 F | \u2203y \u2208 CCS s.t. (cid:107)x\u2212y(cid:107)1/(cid:107)y(cid:107)1 \u2264\n\u0001} as the intersection between these sets with a tolerance of \u0001. The CR is then de\ufb01ned as:\n\nCRF1(F) := 2 \u00b7\n\nprecision \u00b7 recall\nprecision + recall\n\n,\n\n(9)\n\nwhere the precision = |F \u2229\u0001 CCS|/|F|, indicating the fraction of optimal solutions among the\nretrieved solutions, and the recall = |F \u2229\u0001 CCS|/|CCS|, indicating the fraction of optimal instances\nthat have been retrieved over the total amount of optimal solutions (see Figure 3(a)).\n\n6\n\n\fb) Adaptation Error (AE). Our second met-\nric compares the retrieved control frontier\nwith the optimal one, when an agent is pro-\nvided with a speci\ufb01c preference \u03c9 during\nthe adaptation phase:\nAE(C) := E\u03c9\u223cD\u03c9 [|C(\u03c9)\u2212Copt(\u03c9)|/Copt(\u03c9)],\n(10)\nwhich is the expected relative error be-\ntween optimal control frontier Copt : \u2126 \u2192\nR with \u03c9 (cid:55)\u2192 max\u02c6r\u2208CCS \u03c9(cid:124) \u02c6r and the\nagent\u2019s control frontier C\u03c0\u03c9 = \u03c9(cid:124) \u02c6r\u03c0\u03c9.\nc) Average Utility (UT). This measures\nthe average utility obtained by the trained\nagent on randomly sampled preferences\nand is a useful proxy to AE when we don\u2019t\nhave access to the optimal policy.\n\nDomains. We evaluate on four different\ndomains (complete details in supplemen-\ntary material):\n\nFigure 3: Illustration of evaluation metrics for MORL. (a.)\nCoverage ratio (CR) measures an agent\u2019s ability to \ufb01nd all\nthe potential optimal solutions in the convex coverage set of\nPareto frontier. Dots with black boundary are solutions in\nCCS, dots without black boundary are non-optimal returns,\nand dots in green are solutions retrieved by an MORL algo-\nrithm. CR is the F1 based on the precision and recall calcula-\ntion. (b.) Adaptation error (AE) measures an agent\u2019s ability of\npolicy adaptation to real-time speci\ufb01ed preferences. The gray\ncurve indicates the theoretical limit of the best cumulative\nutilities under all preference, and the green curve indicates the\ncumulative utilities of an MORL algorithm. AE is the average\ngap between these two curves over all preferences.\n\n1. Deep Sea Treasure (DST) A classic MORL benchmark [14] in which an agent controls a\nsubmarine searching for treasures in a 10 \u00d7 11-grid world while trading off time-cost and\ntreasure-value. The grid world contains 10 treasures of different values. Their values increase\nas their distances from the starting point s0 = (0, 0) increase. We ensure the Pareto frontier of\nthis environment to be convex.\n2. Fruit Tree Navigation (FTN) A full binary tree of depth d with randomly assigned vectorial\nreward r \u2208 R6 on the leaf nodes. These rewards encode the amounts of six different components\nof nutrition of the fruits on the tree: {Protein, Carbs, Fats, Vitamins, Minerals, Water}.\nFor every leaf node, \u2203\u03c9 for which its reward is optimal, thus all leaves lie on the CCS. The goal\nof our MORL agent is to \ufb01nd a path from the root to a leaf node that maximizes utility for a given\npreference, choosing between left or right subtrees at every non-terminal node.\n\n3. Task-Oriented Dialog Policy Learning (Dialog) A modi\ufb01ed task-oriented dialog system in the\nrestaurant reservation domain based on PyDial [36]. We consider the task success rate and the\ndialog brevity (measured by number of turns) as two competing objectives of this domain.\n\n4. Multi-Objective SuperMario Game (SuperMario) A multi-objective version of the popular\nvideo game Super Mario Bros. We modify the open-source environment from OpenAI gym [37]\nto provide vectorized rewards encoding \ufb01ve different objectives: x-pos: value corresponding\nto the difference in Mario\u2019s horizontal position between current and last time point, time: a\nsmall negative time penalty, deaths: a large negative penalty given each time Mario dies , coin:\nrewards for collecting coins, and enemy: rewards for eliminating an enemy.\n\nBaselines. We compare our envelope MORL algorithm with classic and state-of-the-art baselines:\n\n1. MOFQI [24]: Multi-objective \ufb01tted Q-iteration where the Q-approximator is a large linear model.\n2. CN+OLS [13]: Conditional neural network with Optimistic Linear Support (OLS) method as the\nouter loop for selecting \u03c9. This method is \ufb01rst proposed in [9] with multiple neural networks,\nand we employ an improved version using single conditional neural network [13].\n\n3. Scalarized [13]: The state-of-the-art algorithm uses scalarized Q-update with double Q-learning,\n\nprioritized and hindsight experience replay, which is equivalent to CN+DER proposed in [13].\n\nMain Results. Table 1 shows the performance comparison of different MORL algorithms in four\ndomains. We elaborate training and test details for each domain in supplementary material. In DST\nand FTN we compare CR and AE as de\ufb01ned in section 4. In the task-oriented dialog policy learning\ntask, we compare the average utility (Avg. UT) for 5,000 test dialogues with uniformly sampled user\npreferences on success and brevity. In the SuperMario game, the Avg. UT is over 500 test episodes\n\n7\n\nABCDEFGHKMLQUANTITY OF OBJECTIVE 1QUANTITY OF OBJECTIVE 2Non-optimal Solutionsa.LQUANTITY OF OBJECTIVE 1QUANTITY OF OBJECTIVE 2F!2AAACT3icfZFNSxxBEIZ7NsaPMfHzmEvjIoSwWWYkEI+iHryEGHBVcJalprd2bLa/6O4Rl2H/htf4lzz6S3IL6ZkMJCpY0PB09dtV1W/nRnDnk+Qx6rxZeLu4tLwSr757v7a+sbl17nRpGQ6YFtpe5uBQcIUDz73AS2MRZC7wIp8e1ecXN2gd1+rMzwwOJRSKTzgDH1JZlkuaaYkFjPZGG92knzRBX0LaQpe0cTrajD5nY81KicozAc5dpYnxwwqs50zgPM5KhwbYFAq8CqhAohtWzdBzuhsyYzrRNizlaZP9/0YF0knw1z0aoJa4htxM5j2ay2ajjQqFatXTXn6yP6y4MqVHxf62mpSCek1rD+iYW2RezGicHWOY3OK3UOK7QQte209VBraQcDsPLylo1qM1vybl6p80cLwbOgCzPNhA2TVYYD58QRwMTp/b+RLO9/pp0k9/fOkeHLZWL5MPZId8JCn5Sg7ICTklA8KIIXfkJ7mPHqJf0e9OK+1ELWyTJ9FZ+QMa87CpAAACT3icfZFNSxxBEIZ7NsaPMfHzmEvjIoSwWWYkEI+iHryEGHBVcJalprd2bLa/6O4Rl2H/htf4lzz6S3IL6ZkMJCpY0PB09dtV1W/nRnDnk+Qx6rxZeLu4tLwSr757v7a+sbl17nRpGQ6YFtpe5uBQcIUDz73AS2MRZC7wIp8e1ecXN2gd1+rMzwwOJRSKTzgDH1JZlkuaaYkFjPZGG92knzRBX0LaQpe0cTrajD5nY81KicozAc5dpYnxwwqs50zgPM5KhwbYFAq8CqhAohtWzdBzuhsyYzrRNizlaZP9/0YF0knw1z0aoJa4htxM5j2ay2ajjQqFatXTXn6yP6y4MqVHxf62mpSCek1rD+iYW2RezGicHWOY3OK3UOK7QQte209VBraQcDsPLylo1qM1vybl6p80cLwbOgCzPNhA2TVYYD58QRwMTp/b+RLO9/pp0k9/fOkeHLZWL5MPZId8JCn5Sg7ICTklA8KIIXfkJ7mPHqJf0e9OK+1ELWyTJ9FZ+QMa87CpAAACT3icfZFNSxxBEIZ7NsaPMfHzmEvjIoSwWWYkEI+iHryEGHBVcJalprd2bLa/6O4Rl2H/htf4lzz6S3IL6ZkMJCpY0PB09dtV1W/nRnDnk+Qx6rxZeLu4tLwSr757v7a+sbl17nRpGQ6YFtpe5uBQcIUDz73AS2MRZC7wIp8e1ecXN2gd1+rMzwwOJRSKTzgDH1JZlkuaaYkFjPZGG92knzRBX0LaQpe0cTrajD5nY81KicozAc5dpYnxwwqs50zgPM5KhwbYFAq8CqhAohtWzdBzuhsyYzrRNizlaZP9/0YF0knw1z0aoJa4htxM5j2ay2ajjQqFatXTXn6yP6y4MqVHxf62mpSCek1rD+iYW2RezGicHWOY3OK3UOK7QQte209VBraQcDsPLylo1qM1vybl6p80cLwbOgCzPNhA2TVYYD58QRwMTp/b+RLO9/pp0k9/fOkeHLZWL5MPZId8JCn5Sg7ICTklA8KIIXfkJ7mPHqJf0e9OK+1ELWyTJ9FZ+QMa87CpAAACT3icfZFNSxxBEIZ7NsaPMfHzmEvjIoSwWWYkEI+iHryEGHBVcJalprd2bLa/6O4Rl2H/htf4lzz6S3IL6ZkMJCpY0PB09dtV1W/nRnDnk+Qx6rxZeLu4tLwSr757v7a+sbl17nRpGQ6YFtpe5uBQcIUDz73AS2MRZC7wIp8e1ecXN2gd1+rMzwwOJRSKTzgDH1JZlkuaaYkFjPZGG92knzRBX0LaQpe0cTrajD5nY81KicozAc5dpYnxwwqs50zgPM5KhwbYFAq8CqhAohtWzdBzuhsyYzrRNizlaZP9/0YF0knw1z0aoJa4htxM5j2ay2ajjQqFatXTXn6yP6y4MqVHxf62mpSCek1rD+iYW2RezGicHWOY3OK3UOK7QQte209VBraQcDsPLylo1qM1vybl6p80cLwbOgCzPNhA2TVYYD58QRwMTp/b+RLO9/pp0k9/fOkeHLZWL5MPZId8JCn5Sg7ICTklA8KIIXfkJ7mPHqJf0e9OK+1ELWyTJ9FZ+QMa87Cp!1AAACT3icfZFNSxxBEIZ71sSPMRo1Ry9NFkFkXWaCoEfRHHIJGnBVcJalprd2bLa/6O4JWYb9G17NX/LoL8ktpGcykKhgQcPT1W9XVb+dG8GdT5LHqLPw5u3i0vJKvPpubf39xubWpdOlZThgWmh7nYNDwRUOPPcCr41FkLnAq3x6Wp9ffUfruFYXfmZwKKFQfMIZ+JDKslzSTEssYJSONrpJP2mCvoS0hS5p43y0Ge1nY81KicozAc7dpInxwwqs50zgPM5KhwbYFAq8CahAohtWzdBzuhMyYzrRNizlaZP9/0YF0knwtz0aoJa4htxM5j2ay2ajjQqFatXTXn5yNKy4MqVHxf62mpSCek1rD+iYW2RezGicfcYwucWvocSZQQte270qA1tI+DEPLylo1qM1vybl6p80cLwTOgCzPNhA2S1YYD58QRwMTp/b+RIuP/XTpJ9+O+gen7RWL5Nt8pHskpQckmPyhZyTAWHEkDtyT35GD9Gv6HenlXaiFj6QJ9FZ+QMZE7CoAAACT3icfZFNSxxBEIZ71sSPMRo1Ry9NFkFkXWaCoEfRHHIJGnBVcJalprd2bLa/6O4JWYb9G17NX/LoL8ktpGcykKhgQcPT1W9XVb+dG8GdT5LHqLPw5u3i0vJKvPpubf39xubWpdOlZThgWmh7nYNDwRUOPPcCr41FkLnAq3x6Wp9ffUfruFYXfmZwKKFQfMIZ+JDKslzSTEssYJSONrpJP2mCvoS0hS5p43y0Ge1nY81KicozAc7dpInxwwqs50zgPM5KhwbYFAq8CahAohtWzdBzuhMyYzrRNizlaZP9/0YF0knwtz0aoJa4htxM5j2ay2ajjQqFatXTXn5yNKy4MqVHxf62mpSCek1rD+iYW2RezGicfcYwucWvocSZQQte270qA1tI+DEPLylo1qM1vybl6p80cLwTOgCzPNhA2S1YYD58QRwMTp/b+RIuP/XTpJ9+O+gen7RWL5Nt8pHskpQckmPyhZyTAWHEkDtyT35GD9Gv6HenlXaiFj6QJ9FZ+QMZE7CoAAACT3icfZFNSxxBEIZ71sSPMRo1Ry9NFkFkXWaCoEfRHHIJGnBVcJalprd2bLa/6O4JWYb9G17NX/LoL8ktpGcykKhgQcPT1W9XVb+dG8GdT5LHqLPw5u3i0vJKvPpubf39xubWpdOlZThgWmh7nYNDwRUOPPcCr41FkLnAq3x6Wp9ffUfruFYXfmZwKKFQfMIZ+JDKslzSTEssYJSONrpJP2mCvoS0hS5p43y0Ge1nY81KicozAc7dpInxwwqs50zgPM5KhwbYFAq8CahAohtWzdBzuhMyYzrRNizlaZP9/0YF0knwtz0aoJa4htxM5j2ay2ajjQqFatXTXn5yNKy4MqVHxf62mpSCek1rD+iYW2RezGicfcYwucWvocSZQQte270qA1tI+DEPLylo1qM1vybl6p80cLwTOgCzPNhA2S1YYD58QRwMTp/b+RIuP/XTpJ9+O+gen7RWL5Nt8pHskpQckmPyhZyTAWHEkDtyT35GD9Gv6HenlXaiFj6QJ9FZ+QMZE7CoAAACT3icfZFNSxxBEIZ71sSPMRo1Ry9NFkFkXWaCoEfRHHIJGnBVcJalprd2bLa/6O4JWYb9G17NX/LoL8ktpGcykKhgQcPT1W9XVb+dG8GdT5LHqLPw5u3i0vJKvPpubf39xubWpdOlZThgWmh7nYNDwRUOPPcCr41FkLnAq3x6Wp9ffUfruFYXfmZwKKFQfMIZ+JDKslzSTEssYJSONrpJP2mCvoS0hS5p43y0Ge1nY81KicozAc7dpInxwwqs50zgPM5KhwbYFAq8CahAohtWzdBzuhMyYzrRNizlaZP9/0YF0knwtz0aoJa4htxM5j2ay2ajjQqFatXTXn5yNKy4MqVHxf62mpSCek1rD+iYW2RezGicfcYwucWvocSZQQte270qA1tI+DEPLylo1qM1vybl6p80cLwTOgCzPNhA2S1YYD58QRwMTp/b+RIuP/XTpJ9+O+gen7RWL5Nt8pHskpQckmPyhZyTAWHEkDtyT35GD9Gv6HenlXaiFj6QJ9FZ+QMZE7CoRetrieved Control Frontierb.CCS++Precision =Recall =DNOptimal Control FrontierControl ErrorsRetrieved Solutions\fMethod\n\nMOFQI\nCN+OLS\nScalarized\n\nDST\n\nFTN (d = 6)\n\nCR \u2191\n\n0.639 \u00b1 0.421\n0.751 \u00b1 0.163\n0.989 \u00b1 0.024\n0.994 \u00b1 0.001\n\nAE \u2193\n\n139.6 \u00b1 25.98\n34.63 \u00b1 1.396\n0.165 \u00b1 0.096\n0.152 \u00b1 0.006\n\nCR \u2191\n\u2013\n\n0.197 \u00b1 0.000\n0.914 \u00b1 0.044\n0.987 \u00b1 0.021\n\nAE \u2193\n\u2013\n\n0.176 \u00b1 0.001\n0.016 \u00b1 0.005\n0.006 \u00b1 0.001\n\nDialog2\nAvg.UT \u2191\n2.17 \u00b1 0.21\n2.53 \u00b1 0.22\n2.38 \u00b1 0.22\n2.65 \u00b1 0.22\n\nSuperMario2\nAvg.UT \u2191\n\n\u2013\n\u2013\n\nEnvelope (ours)1\n\n162.7 \u00b1 77.66\n321.2 \u00b1 146.9\nTable 1: Comparison of different MORL algorithms in learning and adaptation phases across four experimental\ndomains. \u2191 indicates higher is better, and \u2193 indicates lower is better for the scores. Each data point indicates the\nmean and standard deviation over 5 independent training and test runs. 1Using the unpaired t-test, we obtain\nsigni\ufb01cance scores of p < 0.05 vs MOFQI on all domains, p < 0.01 vs CN+OLS on DST and p < 0.05 vs\nScalarized on FTN, Dialog and SuperMario. 2Additional results are in the supplementary material C.4 and C.5.\n\nwith uniformly sampled preferences. The envelope algorithm steadily achieves the best performance\nin terms of both learning and adaptation among all the MORL methods in all four domains.\n\nFigure 4: Coverage Ratio (CR) and Adaptation Error (AE) comparison of the scalarized algorithm [13] and our\nenvelope deep MORL algorithm over 5000 episodes of FTN tasks of depths d = 5, 6, 7. Higher CR indicates\nbetter coverage of optimal policies, lower AE indicates better adaptation. The error bars are standard deviations\nof CR and AE estimated from 5 independent runs under each con\ufb01guration.\n\nScalability. There are three aspects of the scalability of a MORL algorithm: the ability to deal\nwith (1) large state space, (2) many objectives, and (3) large optimal policy set. Unlike other neural\nnetwork-based methods, MOFQI cannot deal with the large state space, e.g., the video frames in\nSuperMario Game. The CN+OLS baseline requires solving all the intersection points of a set of\nhyper-planes thus is computationally intractable in domains with m > 3 objectives, such as FTN\nand SuperMario. We denote these entries as \u201c\u2013\" in Table 1. Both scalarized and envelope methods\ncan be applied to cases having large state space and reasonably many objectives. However, the size\nof optimal policy set may affect the performance of these algorithms. Figure 4 shows CR and AE\nresults in three FTN environments with d = 5 (with 32 solutions), d = 6 (with 64 solutions), and\nd = 7 (with 128 solutions). We observe that both scalarized and envelope algorithms are close to\noptimal when d = 5 but both CR and AE values are worse for d = 7. However, the envelope version\nis more stable and outperforms the scalarized MORL algorithm in all three cases. These results point\nto the robustness and scalability of our algorithms.\n\nSample Ef\ufb01ciency. To compare sample ef\ufb01ciency during the learning phase, we train both our\nscalarized and envelope deep MORL on the FTN task with different depths for 5,000 episodes. We\ncompute coverage ratio (CR) over 2,000 episodes and adaptation error (AE) over 5,000 episodes.\nFigure 4 shows plots for the metrics computed over a varying number of sampled preferences N\u03c9\n(more details can be found in the supplementary material). Each point on the curve is averaged over\n5 experiments. We observe that the envelope MORL algorithm consistently has a better CR and AE\nscores than the scalarized version, with smaller variances. As N\u03c9 increases, CR increases and AE\ndecreases, which shows better use of historical interactions for both algorithms when N\u03c9 is larger.\nAnd to achieve the same level AE the envelope algorithm requires smaller N\u03c9 than the scalarized\nalgorithm. This reinforces our theoretical analysis that the envelope MORL algorithm has better\nsample ef\ufb01ciency than the scalarized version.\n\nPolicy Adaptation. We show how the MORL agents respond to user preference during the adap-\ntation phase in the dialog policy learning task, where the agent must trade off between the dialog\nsuccess rate and the conversation brevity. Figure 5 shows the success rate (SR) curves as we vary the\n\n8\n\na.b.1.00.90.80.60.50.71101001101000.060.050.040.030.020.01\fProtein\n0.9639\n\n0.\n\nCarbs\n\nFats\n0.0361\n\nVitamins\n\nMinerals\n\nWater\n\n0.\n\n0.\n\n0.\n0.\n0.\n\n0.\n\n0.\n0.\n0.\n\n0.\n\n0.9461\n\n0.0291\n\n0.\n\n0.\n\n0.\n\n0.9067\n\n0.0539\n0.1366\n\n0.7503\n0.0428\n\n0.0459\n0.0148\n0.0505\n\nv1\nv2\nv3\nv4\n0.\nv5\n0.\nv6\nTable 2:\nInferred preferences of the envelope\nMOQ-learning algorithm on different FTN (d = 6)\ntasks (v1 to v6) after only 15 episodes interaction.\nThe underlying preferences are all ones on the di-\nagonal of the table and zeros for the off-diagonal.\n\n0.0671\n0.7503\n\n0.0933\n\n0.\n\n0.\n0.\n\n0.1629\n0.9495\n\nx-pos\n0.5288\n0.1985\n0.2196\n0.0211\n0.0715\n\ntime\n0.1770\n0.2237\n0.1296\n0.2404\n0.1038\n\nlife\n0.1500\n0.2485\n0.3541\n0.0211\n0.2069\n\ncoin\n0.0470\n0.1422\n0.1792\n0.6960\n0.3922\n\nenemy\n0.0972\n0.1868\n0.1175\n0.0211\n0.2253\n\ng1\ng2\ng3\ng4\ng5\n\nTable 3:\nInferred preferences of the envelope\nmulti-objective A3C algorithm in different Mario\nGame variants (g1 to g5) with 100 episodes. The\nunderlying preferences are all ones on the diagonal\nof the table and zeros for the off-diagonal.\n\nweight of the preference on task completion success. The success rates of both MORL algorithms\nincrease as the user\u2019s weight on success increases, while those of the single-objective algorithms do\nnot change. This shows that our envelope MORL agent can adapt gracefully to the user\u2019s preference.\nFurthermore, our envelope deep MORL algorithm outperforms other algorithms whenever success is\nrelatively more important to the user (weight > 0.5).\n\nRevealing underlying preferences. Finally, we\ntest the ability of our agent to infer and adapt to un-\nknown preferences on FTN and SuperMario. During\nthe learning phase, the agent does not know the under-\nlying preference, and hence learns a multi-objective\npolicy. During the adaptation phase, our agent per-\nforms recovers underlying preferences (as described\nin Section 3) to uncover the underlying preference\nthat maximizes utility. Table 2 shows the learned\npreferences for 6 different FTN tasks (v1 to v6)\nwith unknown one-hot preferences [1, 0, 0, 0, 0, 0]\nto [0, 0, 0, 0, 0, 1], respectively, meaning the agent\nshould only care about one elementary nutrition.\nThese were learned in a few-shot adaption setting,\nusing just 15 episodes. For the SuperMario Game,\nwe implement an A3C [38] variant of our envelope\nMORL agent (see supplementary material for details). Table 3 shows the learned preferences for 5\ndifferent tasks (g1 to g5) with unknown one-hot preferences using just 100 episodes.\nWe observe that the learned preferences are concentrated on the diagonal, indicating good alignment\nwith the actual underlying preferences. For example, in the SuperMario game variant g4, the envelope\nMORL agent \ufb01nds the preference with the highest weight (0.6960) on the coin objective can best\ndescribe the goal of g4, which is to collect as many coins as possible. We also tested policy adaptation\non the original Mario game using game scores for the scalar rewards. We \ufb01nd that the agent learns\npreference weights of 0.37 for x-pos and 0.23 for time, which seems consistent with a common\nstrategy that humans employ \u2013 simply move Mario towards the \ufb02ag as quickly as possible.\n\nFigure 5: The success-weight curves of task-\noriented dialog. Each data point is a moving aver-\nage of closest around 500 dialogues in the interval\nof around \u00b1 0.05 weight of success. The light\nshadow indicates the standard deviations of 5 inde-\npendent runs under each con\ufb01guration.\n\n5 Conclusion\nWe have introduced a new algorithm for multi-objective reinforcement learning (MORL) with linear\npreferences, with the goal of enabling few-shot adaptation of autonomous agents to new scenarios.\nSpeci\ufb01cally, we propose a multi-objective version of the Bellman optimality operator, and utilize it to\nlearn a single parametric representation for all optimal policies over the space of preferences. We\nprovide convergence proofs for our multi-objective algorithm and also demonstrate how to use our\nmodel to adapt and elicit an unknown preference on a new task. Our experiments across four different\ndomains demonstrate that our algorithms exhibit effective generalization and policy adaptation.\n\nAcknowledgements\nThe authors would like to thank Yexiang Xue at Purdue University, Carla Gomes at Cornell University\nfor helpful discussions on multi-objective optimization, Lu Chen, Kai Yu at Shanghai Jiao Tong\nUniversity for discussing dialogue applications, Haoran Cai at MIT for helping running a part of\nsynthetic experiments, and anonymous reviewers for constructive suggestions.\n\n9\n\nScalarized VersionEnvelope VersionSingle-Obj(0.5turn+0.5succ)Single-Obj(0.2turn+0.8succ)Single-Obj(0.8turn+0.2succ) (cid:90)(cid:72)(cid:76)(cid:74)(cid:75)(cid:87) (cid:82)(cid:73) (cid:86)(cid:88)(cid:70)(cid:70)(cid:72)(cid:86)(cid:86)Scalarized VersionEnvelope VersionSingle-Obj(0.5turn+0.5succ)Single-Obj(0.2turn+0.8succ)Single-Obj(0.8turn+0.2succ) (cid:90)(cid:72)(cid:76)(cid:74)(cid:75)(cid:87) (cid:82)(cid:73) (cid:86)(cid:88)(cid:70)(cid:70)(cid:72)(cid:86)(cid:86)Weight of SuccessSuccess Rate (%)9590858075700.10.20.30.40.50.60.70.80.9\fReferences\n[1] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man\u00e9. Concrete\n\nproblems in ai safety. arXiv preprint arXiv:1606.06565, 2016.\n\n[2] Diederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective\n\nsequential decision-making. J. Artif. Intell. Res., 48:67\u2013113, 2013.\n\n[3] Il Yong Kim and OL De Weck. Adaptive weighted sum method for multiobjective optimization: a new\nmethod for pareto front generation. Structural and multidisciplinary optimization, 31(2):105\u2013116, 2006.\n\n[4] Abdullah Konak, David W Coit, and Alice E Smith. Multi-objective optimization using genetic algorithms:\n\nA tutorial. Reliability Engineering & System Safety, 91(9):992\u20131007, 2006.\n\n[5] Hirotaka Nakayama, Yeboon Yun, and Min Yoon. Sequential approximate multiobjective optimization\n\nusing computational intelligence. Springer Science & Business Media, 2009.\n\n[6] JiGuan G Lin. On min-norm and min-max methods of multi-objective optimization. Mathematical\n\nprogramming, 103(1):1\u201333, 2005.\n\n[7] Sriraam Natarajan and Prasad Tadepalli. Dynamic preferences in multi-criteria reinforcement learning. In\nLuc De Raedt and Stefan Wrobel, editors, Machine Learning, Proceedings of the Twenty-Second Interna-\ntional Conference (ICML 2005), Bonn, Germany, August 7-11, 2005, volume 119 of ACM International\nConference Proceeding Series, pages 601\u2013608. ACM, 2005.\n\n[8] Leon Barrett and Srini Narayanan. Learning all optimal policies with multiple criteria. In William W.\nCohen, Andrew McCallum, and Sam T. Roweis, editors, Machine Learning, Proceedings of the Twenty-\nFifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, volume 307 of ACM\nInternational Conference Proceeding Series, pages 41\u201347. ACM, 2008.\n\n[9] Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, and Shimon Whiteson. Multi-objective deep\n\nreinforcement learning. CoRR, abs/1610.02707, 2016.\n\n[10] Richard Ernest Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1957.\n\n[11] Marcin Andrychowicz, Dwight Crow, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob\nMcGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in\nNeural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems\n2017, 4-9 December 2017, Long Beach, CA, USA, pages 5055\u20135065, 2017.\n\n[12] Layne T Watson and Raphael T Haftka. Modern homotopy methods in optimization. Computer Methods\n\nin Applied Mechanics and Engineering, 74(3):289\u2013305, 1989.\n\n[13] Axel Abels, Diederik M. Roijers, Tom Lenaerts, Ann Now\u00e9, and Denis Steckelmacher. Dynamic weights\nin multi-objective deep reinforcement learning. In Proceedings of the 36th International Conference on\nMachine Learning, page TBA, 2019.\n\n[14] Peter Vamplew, Richard Dazeley, Adam Berry, Rustam Issabekov, and Evan Dekker. Empirical evaluation\nmethods for multiobjective reinforcement learning algorithms. Machine Learning, 84(1-2):51\u201380, 2011.\n\n[15] Chunming Liu, Xin Xu, and Dewen Hu. Multiobjective reinforcement learning: A comprehensive overview.\n\nIEEE Trans. Systems, Man, and Cybernetics: Systems, 45(3):385\u2013398, 2015.\n\n[16] Shie Mannor and Nahum Shimkin. The steering approach for multi-criteria reinforcement learning. In\nThomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, editors, Advances in Neural Information\nProcessing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001,\nDecember 3-8, 2001, Vancouver, British Columbia, Canada], pages 1563\u20131570. MIT Press, 2001.\n\n[17] Gerald Tesauro, Rajarshi Das, Hoi Chan, Jeffrey O. Kephart, David Levine, Freeman L. Rawson III,\nand Charles Lefurgy. Managing power consumption and performance of computing systems using\nreinforcement learning. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors,\nAdvances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference\non Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007,\npages 1497\u20131504. Curran Associates, Inc., 2007.\n\n[18] Kristof Van Moffaert, Madalina M. Drugan, and Ann Now\u00e9. Scalarized multi-objective reinforcement\nlearning: Novel design techniques. In Proceedings of the 2013 IEEE Symposium on Adaptive Dynamic\nProgramming and Reinforcement Learning, ADPRL 2013, IEEE Symposium Series on Computational\nIntelligence (SSCI), 16-19 April 2013, Singapore, pages 191\u2013199. IEEE, 2013.\n\n10\n\n\f[19] Matteo Pirotta, Simone Parisi, and Marcello Restelli. Multi-objective reinforcement learning with continu-\nous pareto frontier approximation. In Blai Bonet and Sven Koenig, editors, Proceedings of the Twenty-Ninth\nAAAI Conference on Arti\ufb01cial Intelligence, January 25-30, 2015, Austin, Texas, USA., pages 2928\u20132934.\nAAAI Press, 2015.\n\n[20] Simone Parisi, Matteo Pirotta, and Jan Peters. Manifold-based multi-objective policy search with sample\n\nreuse. Neurocomputing, 263:3\u201314, 2017.\n\n[21] Kazuyuki Hiraoka, Manabu Yoshida, and Taketoshi Mishima. Parallel reinforcement learning for weighted\nmulti-criteria model with adaptive margin. In Masumi Ishikawa, Kenji Doya, Hiroyuki Miyamoto, and\nTakeshi Yamakawa, editors, Neural Information Processing, 14th International Conference, ICONIP 2007,\nKitakyushu, Japan, November 13-16, 2007, Revised Selected Papers, Part I, volume 4984 of Lecture Notes\nin Computer Science, pages 487\u2013496. Springer, 2007.\n\n[22] Hitoshi Iima and Yasuaki Kuroe. Multi-objective reinforcement learning for acquiring all pareto optimal\nIn 2014 IEEE International\npolicies simultaneously - method of determining scalarization weights.\nConference on Systems, Man, and Cybernetics, SMC 2014, San Diego, CA, USA, October 5-8, 2014, pages\n876\u2013881. IEEE, 2014.\n\n[23] Andrea Castelletti, Francesca Pianosi, and Marcello Restelli. Multi-objective \ufb01tted q-iteration: Pareto\nIn Proceedings of the IEEE International Conference on\nfrontier approximation in one single run.\nNetworking, Sensing and Control, ICNSC 2011, Delft, The Netherlands, 11-13 April 2011, pages 260\u2013265.\nIEEE, 2011.\n\n[24] Andrea Castelletti, Francesca Pianosi, and Marcello Restelli. Tree-based \ufb01tted q-iteration for multi-\nobjective markov decision problems. In The 2012 International Joint Conference on Neural Networks\n(IJCNN), Brisbane, Australia, June 10-15, 2012, pages 1\u20138. IEEE, 2012.\n\n[25] Wolfram Conen and Tuomas Sandholm. Preference elicitation in combinatorial auctions. In Proceedings\n\nof the 3rd ACM conference on Electronic Commerce, pages 256\u2013259. ACM, 2001.\n\n[26] Craig Boutilier. A POMDP formulation of preference elicitation problems. In Proceedings of the Eighteenth\nNational Conference on Arti\ufb01cial Intelligence and Fourteenth Conference on Innovative Applications of\nArti\ufb01cial Intelligence, July 28 - August 1, 2002, Edmonton, Alberta, Canada., pages 239\u2013246, 2002.\n\n[27] Li Chen and Pearl Pu. Survey of preference elicitation methods. Technical report, 2004.\n\n[28] Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. In Pat Langley, editor,\nProceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford\nUniversity, Stanford, CA, USA, June 29 - July 2, 2000, pages 663\u2013670. Morgan Kaufmann, 2000.\n\n[29] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Carla E.\nBrodley, editor, Machine Learning, Proceedings of the Twenty-\ufb01rst International Conference (ICML 2004),\nBanff, Alberta, Canada, July 4-8, 2004, volume 69 of ACM International Conference Proceeding Series.\nACM, 2004.\n\n[30] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Daniel D. Lee, Masashi\nSugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural In-\nformation Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016,\nDecember 5-10, 2016, Barcelona, Spain, pages 4565\u20134573, 2016.\n\n[31] Urszula Chajewska, Daphne Koller, and Dirk Ormoneit. Learning an agent\u2019s utility function by observing\nbehavior. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001),\nWilliams College, Williamstown, MA, USA, June 28 - July 1, 2001, pages 35\u201342, 2001.\n\n[32] Christopher J. C. H. Watkins and Peter Dayan. Technical note q-learning. Machine Learning, 8:279\u2013292,\n\n1992.\n\n[33] Arch W Naylor and George R Sell. Linear operator theory in engineering and science. Springer Science\n\n& Business Media, 2000.\n\n[34] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare,\nAlex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,\nAmir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis\nHassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[35] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine Learning, 8:229\u2013256, 1992.\n\n11\n\n\f[36] Stefan Ultes, Lina Maria Rojas-Barahona, Pei-Hao Su, David Vandyke, Dongho Kim, I\u00f1igo Casanueva,\nPawel Budzianowski, Nikola Mrksic, Tsung-Hsien Wen, Milica Gasic, and Steve J. Young. Pydial: A multi-\ndomain statistical dialogue system toolkit. In Proceedings of the 55th Annual Meeting of the Association\nfor Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, System Demonstrations,\npages 73\u201378, 2017.\n\n[37] Christian Kauten.\n\nSuper Mario Bros for OpenAI Gym.\n\ngym-super-mario-bros, 2018.\n\nhttps://github.com/Kautenja/\n\n[38] Volodymyr Mnih, Adri\u00e0 Puigdom\u00e8nech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim\nHarley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In\nProceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY,\nUSA, June 19-24, 2016, pages 1928\u20131937, 2016.\n\n[39] Mohamed A Khamsi and William A Kirk. An introduction to metric spaces and \ufb01xed point theory,\n\nvolume 53. John Wiley & Sons, 2011.\n\n[40] Dimitri P. Bertsekas. Regular policies in abstract dynamic programming. SIAM Journal on Optimization,\n\n27(3):1694\u20131727, 2017.\n\n[41] Dimitri P Bertsekas. Abstract dynamic programming. Athena Scienti\ufb01c Belmont, MA, 2018.\n\n[42] Marc G. Bellemare, Will Dabney, and R\u00e9mi Munos. A distributional perspective on reinforcement learning.\nIn Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW,\nAustralia, 6-11 August 2017, pages 449\u2013458, 2017.\n\n[43] Jost Schatzmann and Steve J. Young. The hidden agenda user simulation model. IEEE Trans. Audio,\n\nSpeech & Language Processing, 17(4):733\u2013747, 2009.\n\n[44] Stefan Ultes, Pawel Budzianowski, I\u00f1igo Casanueva, Nikola Mrksic, Lina Maria Rojas-Barahona, Pei-Hao\nSu, Tsung-Hsien Wen, Milica Gasic, and Steve J. Young. Reward-balancing for statistical spoken dialogue\nsystems using multi-objective reinforcement learning. In Proceedings of the 18th Annual SIGdial Meeting\non Discourse and Dialogue, Saarbr\u00fccken, Germany, August 15-17, 2017, pages 65\u201370, 2017.\n\n[45] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David\nSilver, and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971,\n2015.\n\n[46] Shixiang Gu, Timothy P. Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with\nmodel-based acceleration. In Proceedings of the 33nd International Conference on Machine Learning,\nICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2829\u20132838, 2016.\n\n[47] Luke Metz, Julian Ibarz, Navdeep Jaitly, and James Davidson. Discrete sequential prediction of continuous\n\nactions for deep RL. CoRR, abs/1705.05035, 2017.\n\n[48] Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning.\nIn Dale Schuurmans and Michael P. Wellman, editors, Proceedings of the Thirtieth AAAI Conference on\nArti\ufb01cial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 2094\u20132100. AAAI Press,\n2016.\n\n[49] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. CoRR\n\n(Published at ICLR 2016), abs/1511.05952, 2015.\n\n[50] L.J.P van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-sne. Journal of Mahcine\n\nLearning Research, Nov 2008.\n\n12\n\n\f", "award": [], "sourceid": 8277, "authors": [{"given_name": "Runzhe", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Xingyuan", "family_name": "Sun", "institution": "Princeton University"}, {"given_name": "Karthik", "family_name": "Narasimhan", "institution": "Princeton University"}]}