{"title": "Learner-aware Teaching: Inverse Reinforcement Learning with Preferences and Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 4145, "page_last": 4155, "abstract": "Inverse reinforcement learning (IRL) enables an agent to learn complex behavior by observing demonstrations from a (near-)optimal policy. The typical assumption is that the learner's goal is to match the teacher\u2019s demonstrated behavior. In this paper, we consider the setting where the learner has its own preferences that it additionally takes into consideration. These preferences can for example capture behavioral biases, mismatched worldviews, or physical constraints. We study two teaching approaches: learner-agnostic teaching, where the teacher provides demonstrations from an optimal policy ignoring the learner's preferences, and learner-aware teaching, where the teacher accounts for the learner\u2019s preferences. We design learner-aware teaching algorithms and show that significant performance improvements can be achieved over learner-agnostic teaching.", "full_text": "Learner-aware Teaching: Inverse Reinforcement\n\nLearning with Preferences and Constraints\n\nSebastian Tschiatschek\u2217\n\nMicrosoft Research\n\nsetschia@microsoft.com\n\ngahana@mpi-sws.org\n\nAhana Ghosh\u2217\n\nMPI-SWS\n\nLuis Haug\u2217\nETH Zurich\n\nlhaug@inf.ethz.ch\n\nRati Devidze\nMPI-SWS\n\nrdevidze@mpi-sws.org\n\nAdish Singla\nMPI-SWS\n\nadishs@mpi-sws.org\n\nAbstract\n\nInverse reinforcement learning (IRL) enables an agent to learn complex behavior\nby observing demonstrations from a (near-)optimal policy. The typical assumption\nis that the learner\u2019s goal is to match the teacher\u2019s demonstrated behavior. In this\npaper, we consider the setting where the learner has its own preferences that it\nadditionally takes into consideration. These preferences can for example capture\nbehavioral biases, mismatched worldviews, or physical constraints. We study\ntwo teaching approaches: learner-agnostic teaching, where the teacher provides\ndemonstrations from an optimal policy ignoring the learner\u2019s preferences, and\nlearner-aware teaching, where the teacher accounts for the learner\u2019s preferences.\nWe design learner-aware teaching algorithms and show that signi\ufb01cant performance\nimprovements can be achieved over learner-agnostic teaching.\n\n1\n\nIntroduction\n\nInverse reinforcement learning (IRL) enables a learning agent (learner) to acquire skills from\nobservations of a teacher\u2019s demonstrations. The learner infers a reward function explain-\ning the demonstrated behavior and optimizes its own behavior accordingly.\nIRL has been\nstudied extensively [Abbeel and Ng, 2004, Ratliff et al., 2006, Ziebart, 2010, Boularias et al., 2011,\nOsa et al., 2018] under the premise that the learner can and is willing to imitate the teacher\u2019s behavior.\nIn real-world settings, however, a learner typically does not blindly follow the teacher\u2019s demonstra-\ntions, but also has its own preferences and constraints. For instance, consider demonstrating to an\nauto-pilot of a self-driving car how to navigate from A to B by taking the most fuel-ef\ufb01cient route.\nThese demonstrations might con\ufb02ict with the preference of the auto-pilot to drive on highways in\norder to ensure maximum safety. Similarly, in robot-human interaction with the goal of teaching\npeople how to cook, a teaching robot might demonstrate to a human user how to cook \u201croast chicken\u201d,\nwhich could con\ufb02ict with the preferences of the learner who is \u201cvegetarian\u201d. To give yet another\nexample, consider a surgical training simulator which provides virtual demonstrations of expert\nbehavior; a novice learner might not be con\ufb01dent enough to imitate a dif\ufb01cult procedure because of\nsafety concerns. In all these examples, the learner might not be able to acquire useful skills from the\nteacher\u2019s demonstrations.\nIn this paper, we formalize the problem of teaching a learner with preferences and constraints. First,\nwe are interested in understanding the suboptimality of learner-agnostic teaching, i.e., ignoring the\nlearner\u2019s preferences. Second, we are interested in designing learner-aware teachers who account\n\n\u2217Authors contributed equally to this work.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffor the learner\u2019s preferences and thus enable more ef\ufb01cient learning. To this end, we study a learner\nmodel with preferences and constraints in the context of the Maximum Causal Entropy (MCE) IRL\nframework [Ziebart, 2010, Ziebart et al., 2013, Zhou et al., 2018]. This enables us to formulate the\nteaching problem as an optimization problem, and to derive and analyze algorithms for learner-aware\nteaching. Our main contributions are:\n\nI We formalize the problem of IRL under preference constraints (Section 2 and Section 3).\nII We analyze the problem of optimizing demonstrations for the learner when preferences are\nknown to the teacher, and we propose a bilevel optimization approach to the problem (Section 4).\nIII We propose strategies for adaptively teaching a learner with preferences unknown to the teacher,\n\nand we provide theoretical guarantees under natural assumptions (Section 5).\n\nIV We empirically show that signi\ufb01cant performance improvements can be achieved by learner-\n\naware teachers as compared to learner-agnostic teachers (Section 6).\n\nr(cid:107)1 \u2264 1 ensures that |R(s)| \u2264 1 for all s.\n\nfor policies we are interested in is the expected discounted reward R(\u03c0) := E ((cid:80)\u221e\n\nr , \u00b5r(\u03c0)(cid:105), where \u00b5r : \u03a0 \u2192 Rdr, \u03c0 (cid:55)\u2192 E ((cid:80)\u221e\n(cid:80)\n\n2 Problem Setting\nEnvironment. Our environment is described by a Markov decision process (MDP) M :=\n(S,A, T, \u03b3, P0, R). Here S and A denote \ufb01nite sets of states and actions. T : S \u00d7 S \u00d7 A \u2192 [0, 1]\ndescribes the state transition dynamics, i.e., T (s(cid:48)|s, a) is the probability of landing in state s(cid:48) by\ntaking action a from state s. \u03b3 \u2208 (0, 1) is the discounting factor. P0 : S \u2192 [0, 1] is an initial\ndistribution over states. R : S \u2192 R is the reward function. We assume that there exists a feature\nmap \u03c6r : S \u2192 [0, 1]dr such that the reward function is linear, i.e., R(s) = (cid:104)w\u2217\nr , \u03c6r(s)(cid:105) for some\nr \u2208 Rdr. Note that a bound of (cid:107)w\u2217\nw\u2217\nBasic de\ufb01nitions. A policy is a map \u03c0 : S\u00d7A \u2192 [0, 1] such that \u03c0( \u00b7 | s) is a probability distribution\nover actions for every state s. We denote by \u03a0 the set of all such policies. The performance measure\nt=0 \u03b3tR(st)), where\nthe expectation is taken with respect to the distribution over trajectories \u03be = (s0, s1, s2, . . .) induced\nby \u03c0 together with the transition probabilities T and the initial state distribution P0. A policy \u03c0 is\noptimal for the reward function R if \u03c0 \u2208 arg max\u03c0(cid:48)\u2208\u03a0 R(\u03c0(cid:48)), and we denote an optimal policy by \u03c0\u2217.\nNote that R(\u03c0) = (cid:104)w\u2217\nt=0 \u03b3t\u03c6r(st)), is the map taking a\npolicy to its vector of (discounted) feature expectations. We denote by \u2126r = {\u00b5r(\u03c0) : \u03c0 \u2208 \u03a0} the\nimage \u00b5r(\u03a0) of this map. Note that the set \u2126r \u2208 Rdr is convex (see [Ziebart, 2010, Theorem 2.8]\nand [Abbeel and Ng, 2004]), and also bounded due to the discounting factor \u03b3 \u2208 (0, 1). For a \ufb01nite\ncollection of trajectories \u039e = {si\n2, . . .}i=1,2,... obtained by executing a policy \u03c0 in the MDP\nM, we denote the empirical counterpart of \u00b5r(\u03c0) by \u02c6\u00b5r(\u039e) := 1|\u039e|\nAn IRL learner and a teacher. We consider a learner L implementing an inverse reinforcement\nlearning (IRL) algorithm and a teacher T. The teacher has access to the full MDP M; the learner\nknows the MDP and the parametric form of reward function R(s) = (cid:104)wr, \u03c6r(s)(cid:105) but does not know\nthe true reward parameter w\u2217\nr. The learner, upon receiving demonstrations from the teacher, outputs\na policy \u03c0L using its algorithm. The teacher\u2019s objective is to provide a set of demonstrations \u039eT to\nthe learner that ensures that the learner\u2019s output policy \u03c0L achieves high reward R(\u03c0L).\nThe standard IRL algorithms are based on the idea of feature matching [Abbeel and Ng, 2004,\nZiebart, 2010, Osa et al., 2018]: The learner\u2019s algorithm \ufb01nds a policy \u03c0L that matches the feature\nexpectations of the received demonstrations, ensuring that (cid:107)\u00b5r(\u03c0L) \u2212 \u02c6\u00b5r(\u039eT)(cid:107)2 \u2264 \u0001 where \u0001\nspeci\ufb01es a desired level of accuracy. In this standard setting, the learner\u2019s primary goal is to imitate\nthe teacher (via feature matching) and this makes the teaching process easy. In fact, the teacher just\nneeds to provide a suf\ufb01ciently rich pool of demonstrations \u039eT obtained by executing \u03c0\u2217, ensuring\n(cid:107)\u02c6\u00b5r(\u039eT)\u2212 \u00b5r(\u03c0\u2217)(cid:107)2 \u2264 \u0001. This guarantees that (cid:107)\u00b5r(\u03c0L)\u2212 \u00b5r(\u03c0\u2217)(cid:107)2 \u2264 2\u0001. Furthermore, the linearity\nof rewards and (cid:107)w\u2217\nr(cid:107)1 \u2264 1 ensures that the learner\u2019s output policy \u03c0L satis\ufb01es R(\u03c0L) \u2265 R(\u03c0\u2217) \u2212 2\u0001.\nKey challenges in teaching a learner with preference constraints. In this paper, we study a novel\nsetting where the learner has its own preferences which it additionally takes into consideration when\nlearning a policy \u03c0L using teacher\u2019s demonstrations. We formally specify our learner model in the\nnext section; here we highlight the key challenges that arise in teaching such a learner. Given that\nthe learner\u2019s primary goal is no longer just imitating the teacher via feature matching, the learner\u2019s\noutput policy can be suboptimal with respect to the true reward even if it had access to \u00b5r(\u03c0\u2217), i.e.,\n\n0, si\n\n1, si\n\n(cid:80)\n\nt \u03b3t\u03c6r(si\n\nt).\n\ni\n\n2\n\n\f(a) Environment\n\n(b) Set of \u00b5r(\u03c0) vectors\n\nFigure 1: An illustrative example to showcase the suboptimality of teaching when the learner\nhas preferences and constraints. Environment: Figure 1a shows a grid-world environment in-\nspired by the object-world and gathering game environments [Levine et al., 2010, Leibo et al., 2017,\nMendez et al., 2018]. Each cell represents a state, there are \ufb01ve actions given by \u201cleft\", \u201cup\", \u201cright\",\n\"down\", \u201cstay\", the transitions are deterministic, and the starting state is the top-left cell. The agent\u2019s\ngoal is to collect objects in the environment: Collecting a \u201cstar\" provides a reward of 1.0 and a \u201cplus\"\na reward of 0.9; objects immediately appear again upon collection, and the rewards are discounted\nwith \u03b3 close to 1. The optimal policy \u03c0\u2217 is to go to the nearest \u201cstar\" and then \u201cstay\" there. Pref-\nerences: A small number of states in the environment are distractors, depicted by colored cells in\nFigure 1a. We consider a learner who prefers to avoid \u201cgreen\" distractors: it has a hard constraint that\nthe probability of having a \u201cgreen\" distractor within a 3x3 neighborhood, i.e., 1-cell distance, is at\nmost \u0001 = 0.1. Feature expectation vectors: Figure 1b shows the set of feature expectation vectors\n{\u00b5r(\u03c0) : \u03c0 \u2208 \u03a0}. The x-axis and the y-axis represent the discounted feature count for collecting\n\u201cstar\" and \u201cplus\" objects, respectively. The striped region represents policies that are feasible w.r.t. the\nlearner\u2019s constraint. Suboptimality of teaching: Upon receiving demonstrations from an optimal\npolicy \u03c0\u2217 with feature vector \u00b5r(\u03c0\u2217), the learner under its preference constraint can best match the\nteacher\u2019s demonstrations (in a sense of minimizing (cid:107)\u00b5r(\u03c0L) \u2212 \u00b5r(\u03c0\u2217)(cid:107)2) by outputting a policy with\n\u00b5r(\u03c02), which is clearly suboptimal w.r.t. the true rewards. Policy \u03c03 with feature vector \u00b5r(\u03c03)\nrepresents an alternate teaching policy which would have led to higher reward for the learner.\nthe feature expectation vector of an optimal policy \u03c0\u2217. Figure 1 provides an illustrative example\nto showcase the suboptimality of teaching when the learner has preferences and constraints. The\nkey challenge that we address in this paper is that of designing a teaching algorithm that selects\ndemonstrations while accounting for the learner\u2019s preferences.\n\n3 Learner Model\nIn this section we describe the learner models we consider, including different ways of de\ufb01ning\npreferences and constraints. First, we introduce some notation and de\ufb01nitions that will be helpful.\nWe capture learner\u2019s preferences via a feature map \u03c6c : S \u2192 [0, 1]dc. We de\ufb01ne \u03c6(s) as a con-\ncatenation of the two feature maps \u03c6r(s) and \u03c6c(s) given by [\u03c6r(s)\u2020, \u03c6c(s)\u2020]\u2020 and let d = dr + dc.\nt=0 \u03b3t\u03c6c(st)) and \u00b5 : \u03a0 \u2192 Rd,\nt=0 \u03b3t\u03c6(st)). Similar to \u2126r, we de\ufb01ne \u2126c \u2286 Rdc and \u2126 \u2286 Rd as the images of the maps\n\nSimilar to the map \u00b5r, we de\ufb01ne \u00b5c : \u03a0 \u2192 Rdc, \u03c0 (cid:55)\u2192 E ((cid:80)\u221e\n\u03c0 (cid:55)\u2192 E ((cid:80)\u221e\n\nt=0 \u03b3tE(cid:104) \u2212 log \u03c0(at | st)\n(cid:105)\n(cid:80)\u221e\n\n\u00b5c(\u03a0) and \u00b5(\u03a0). Note that for any policy \u03c0 \u2208 \u03a0, we have \u00b5(\u03c0) = [\u00b5r(\u03c0)\u2020, \u00b5c(\u03c0)\u2020]\u2020.\nStandard (discounted) MCE-IRL. Our learner models build on the (discounted) Maximum\nCausal Entropy (MCE) IRL framework [Ziebart et al., 2008, Ziebart, 2010, Ziebart et al., 2013,\nZhou et al., 2018]. In the standard (discounted) MCE-IRL framework, a learning agent aims to iden-\ntify a policy that matches the feature expectations of the teacher\u2019s demonstrations while simultaneously\nmaximizing the (discounted) causal entropy given by H(\u03c0) := H({at}t=0,1,...(cid:107){st}t=0,1,...) :=\n. More background is provided in Appendix D of the supplementary.\nIncluding preference constraints. The standard framework can be readily extended to include\nlearner\u2019s preferences in the form of constraints on the preference features \u03c6c. Clearly, the learner\u2019s\npreferences can render exact matching of the teacher\u2019s demonstrations infeasible and hence we relax\nthis condition. To this end, we consider the following generic learner model:\n\nmax\nr \u22650, \u03b4soft\n\n\u03c0, \u03b4soft\n\nc \u22650\ns.t.\n\nr (cid:107)p \u2212 Cc \u00b7 (cid:107)\u03b4soft\n\nH(\u03c0) \u2212 Cr \u00b7 (cid:107)\u03b4soft\n|\u00b5r(\u03c0)[i] \u2212 \u02c6\u00b5r(\u039eT)[i]| \u2264 \u03b4hard\ngj(\u00b5c(\u03c0)) \u2264 \u03b4hard\n\nc (cid:107)p\n[i] + \u03b4soft\n[j] + \u03b4soft\n\nr\n\nr\n\nc\n\n(1)\n\n[i] \u2200i \u2208 {1, 2, . . . , dr}\n[j] \u2200j \u2208 {1, 2, . . . , m},\n\nc\n\n3\n\n\ud835\udf07\ud835\udc5f(\ud835\udf0b\u2217)\ud835\udf07\ud835\udc5f(\ud835\udf0b2)\ud835\udf07\ud835\udc5f(\ud835\udf0b3)\ud835\udf16\fmin\n\n\u03c0\ns.t.\n\n(cid:107)\u00b5r(\u03c0) \u2212 \u02c6\u00b5r(\u039eT)(cid:107)p\ngj(\u00b5c(\u03c0)) \u2264 0 \u2200j \u2208 {1, 2, . . . , m}.\n\n(2)\n\nHere, g : Rdc (cid:55)\u2192 R are m convex functions representing preference constraints. The coef\ufb01cients Cr\nand Cc are the learner\u2019s parameters which quantify the relative importance of matching the teacher\u2019s\ndemonstrations and satisfying the learner\u2019s preferences. The learner model is further characterized\nc \u2208 Rm\u22650).\nby parameters \u03b4hard\nThe optimization variables for the learner are given by \u03c0, \u03b4soft\n[j] (we will use the vector\nc\nnotation as \u03b4soft\n, \u03b4hard\n) and optimization variables\n(\u03b4soft\n\n[j] (we will use the vector notation as \u03b4hard\n[i], and \u03b4soft\nc \u2208 Rm\u22650). These parameters (\u03b4hard\n\n[i] and \u03b4hard\nr \u2208 Rdr\u22650 and \u03b4soft\n\n) characterize the following behavior:\n\nr \u2208 Rdr\u22650 and \u03b4hard\n\n, \u03b4soft\n\nr\n\nr\n\nr\n\nc\n\nc\n\nr\n\nc\n\u2022 While a mismatch of up to \u03b4hard\ntions incurs no cost regarding the optimization objective, a mismatch larger than \u03b4hard\na cost of Cr \u00b7 (cid:107)\u03b4soft\n\u2022 Similarly, while a violation of up to \u03b4hard\nno cost regarding the optimization objective, a violation larger than \u03b4hard\nCc \u00b7 (cid:107)\u03b4soft\n\nbetween the learner\u2019s and teacher\u2019s reward feature expecta-\nincurs\n\nof the learner\u2019s preference constraints incurs\nincurs a cost of\n\nr (cid:107)p.\n\nc (cid:107)p.\n\nr\n\nc\n\nc\n\nr\n\nNext, we discuss two special instances of this generic learner model.\n\n3.1 Learner Model with Hard Preference Constraints\n\nIt is instructive to study a special case of the above-mentioned generic learner model. Let us consider\nc = 0, and a limiting case with Cr, Cc (cid:29) 0 such that the term\nthe model in Eq. 1 with \u03b4hard\nH(\u03c0) can be neglected. Now, if we additionally assume that Cc (cid:29) Cr, the learner\u2019s objective can be\nthought of as \ufb01nding a policy \u03c0 that minimizes the Lp norm distance to the reward feature expectations\nof the teacher\u2019s demonstration while satisfying the constraints gj(\u00b5c(\u03c0)) \u2264 0 \u2200j \u2208 {1, 2, . . . , m}.\nMore formally, we study the following learner model given in Eq. 2 below:\n\nr = 0, \u03b4hard\n\nTo get a better understanding of the model, we can de\ufb01ne the learner\u2019s constraint set as \u2126L := {\u00b5 :\n\u00b5 \u2208 \u2126 s.t. gj(\u00b5c) \u2264 0 \u2200j \u2208 {1, 2, . . . , m}}. Similar to \u2126L, we de\ufb01ne \u2126L\nr is the\nprojection of the set \u2126L to the subspaces Rdr. We can now rewrite the above optimization problem as\nmin\u03c0 : \u00b5r(\u03c0)\u2208\u2126L\n(i) Learner can match: When \u02c6\u00b5r(\u039eT) \u2208 \u2126L\nr, the learner outputs a policy \u03c0L s.t. \u00b5r(\u03c0L) = \u02c6\u00b5r(\u039eT).\n(ii) Learner cannot match: Otherwise, the learner outputs a policy \u03c0L such that \u00b5r(\u03c0L) is given by\n\n(cid:107)\u00b5r(\u03c0) \u2212 \u02c6\u00b5r(\u039eT)(cid:107)p. Hence, the learner\u2019s behavior is given by:\n\nr \u2286 \u2126r where \u2126L\n\nr\n\nthe Lp norm projection of the vector \u02c6\u00b5r(\u039eT) onto the set \u2126L\nr.\n\nFigure 1 provides an illustration of the behavior of this learner model. We will design learner-aware\nteaching algorithms for this learner model in Section 4.1 and Section 5.\n\n3.2 Learner Model with Soft Preference Constraints\n\nAnother interesting learner model that we study in this paper arises from the generic learner when\nwe consider m = dc number of box-type linear constraints with gj(\u00b5c(\u03c0)) = \u00b5c(\u03c0)[j] \u2200j \u2208\n{1, 2, . . . , dc}. We consider an L1 norm penalty on violation, and for simplicity we consider\n\u03b4hard\nr\n\n[i] = 0 \u2200i \u2208 {1, 2, . . . , dr}. In this case, the learner\u2019s model is given by\n\nmax\nr \u22650, \u03b4soft\n\n\u03c0, \u03b4soft\n\nc \u22650\ns.t.\n\nH(\u03c0) \u2212 Cr \u00b7 (cid:107)\u03b4soft\n|\u00b5r(\u03c0)[i] \u2212 \u02c6\u00b5r(\u039eT)[i]| \u2264 \u03b4soft\n\u00b5c(\u03c0)[j] \u2264 \u03b4hard\n\nc (cid:107)1\nr (cid:107)1 \u2212 Cc \u00b7 (cid:107)\u03b4soft\n[i] \u2200i \u2208 {1, 2, . . . , dr}\n[j] + \u03b4soft\n\nr\n\nc\n\n[j] \u2200j \u2208 {1, 2, . . . , dc},\n\nc\n\n(3)\n\nThe solution to the above problem corresponds to a softmax policy with a reward function R\u03bb(s) =\n(cid:104)w\u03bb, \u03c6(s)(cid:105) where w\u03bb \u2208 Rd is parametrized by \u03bb. The optimal parameters \u03bb can be computed\nef\ufb01ciently and the corresponding softmax policy is then obtained by Soft-Value-Iteration procedure\n(see [Ziebart, 2010, Algorithm. 9.1], [Zhou et al., 2018]). Details are provided in Appendix E of\nthe supplementary. We will design learner-aware teaching algorithms for this learner model in\nSection 4.2.\n\n4\n\n\f4 Learner-aware Teaching under Known Constraints\n\nIn this section, we analyze the setting when the teacher has full knowledge of the learner\u2019s constraints.\n\n4.1 A Learner-aware Teacher for Hard Preferences: AWARE-CMDP\n\nHere, we design a learner-aware teaching algorithm when considering the learner from Section 3.1.\nGiven that the teacher has full knowledge of the learner\u2019s preferences, it can compute an optimal\nteaching policy by maximizing the reward over policies that satisfy the learner\u2019s preference constraints,\ni.e., the teacher solves a constrained-MDP problem (see [De, 1960, Altman, 1999]) given by\n\n(cid:104)w\u2217\n\nr , \u00b5r(\u03c0)(cid:105)\n\ns.t. \u00b5r(\u03c0) \u2208 \u2126L\nr.\n\nmax\n\n\u03c0\n\nWe refer to an optimal solution of this problem as \u03c0aware and the corresponding teacher as AWARE-\nCMDP. We can make the following observation formalizing the value of learner-aware teaching:\nTheorem 1. For simplicity, assume that the teacher can provide an exact feature expectation \u00b5(\u03c0)\nof a policy instead of providing demonstrations to the learner. Then, the value of learner-aware\nteaching is\n\n(cid:68)\n\n(cid:69) \u2212(cid:68)\n\nmax\n\n\u03c0 s.t. \u00b5r(\u03c0)\u2208\u2126L\n\nr\n\nw\u2217\nr , \u00b5r(\u03c0)\n\nw\u2217\nr , Proj\u2126L\n\nr\n\n(cid:0)\u00b5r(\u03c0\u2217)(cid:1)(cid:69) \u2265 0.\n\nWhen the set \u2126L is de\ufb01ned via a set of linear constraints, the above problem can be formulated as a\nlinear program and solved exactly. Details are provided in Appendix F the supplementary material.\n\n4.2 A Learner-aware Teacher for Soft Preferences: AWARE-BIL\n\nFor the learner models in Section 3, the optimal learner-aware teaching problem can be naturally\nformalized as the following bi-level optimization problem:\ns.t. \u03c0L \u2208 arg max\n\nIRL(\u03c0, \u00b5(\u03c0T)),\n\nR(\u03c0L)\n\n(4)\n\n\u03c0\n\nmax\n\u03c0T\n\nwhere IRL(\u03c0, \u00b5(\u03c0T)) stands for the IRL problem solved by the learner given demonstrations from\n\u03c0T and can include preferences of the learner (see Eq. 1 in Section 3).\nThere are many possibilities for solving this bi-level optimization problem\u2014see for exam-\nple [Sinha et al., 2018] for an overview. In this paper we adopted a single-level reduction approach\nto simplify the above bi-level optimization problem as this results in particularly intuitive optimizia-\ntion problems for the teacher. The basic idea of single-level reduction is to replace the lower-level\nproblem, i.e., arg max\u03c0 IRL(\u03c0, \u00b5(\u03c0T)), by the optimality conditions for that problem given by the\nKarush-Kuhn-Tucker conditions [Boyd and Vandenberghe, 2004, Sinha et al., 2018]. For the learner\nmodel outlined in Section 3.2, these reductions take the following form (see Appendix G in the\nsupplementary material for details):\n\n{0 \u2264 \u03b2 \u2264 Cc AND \u00b5c(\u03c0\u03bb) \u2264 \u03b4hard\n\nc } OR {\u03b2 = Cc AND \u00b5c(\u03c0\u03bb) \u2265 \u03b4hard\nc }\n\nwhere \u03c0\u03bb corresponds to a softmax policy with a reward function R\u03bb(s) = (cid:104)w\u03bb, \u03c6(s)(cid:105) for\nw\u03bb = [(\u03b1low \u2212 \u03b1up)\u2020,\u2212\u03b2\u2020]\u2020. Thus, \ufb01nding optimal demonstrations means optimization over\nsoftmax teaching policies while respecting the learner\u2019s preferences. To actually solve the above opti-\nmization problem and \ufb01nd good teaching policies, we use an approach inspired by the Frank-Wolfe\nalgorithm [Jaggi, 2013] detailed in Appendix G of the supplementary material. We refer to a teacher\nimplementing this approach as AWARE-BIL.\n\n5 Learner-Aware Teaching Under Unknown Constraints\n\nIn this section, we consider the more realistic and challenging setting in which the teacher T does not\nknow the learner L\u2019s constraint set \u2126L\nr. Without feedback from L, T can generally not do better than\n\n5\n\nmax\n\n\u03bb:={\u03b1low\u2208Rdr , \u03b1up\u2208Rdr , \u03b2\u2208Rdc}\ns.t.\n\n(cid:104)w\u2217\nr , \u00b5r(\u03c0\u03bb)(cid:105)\n0 \u2264 \u03b1low \u2264 Cr\n0 \u2264 \u03b1up \u2264 Cr\n\n(5)\n\n\fthe agnostic teacher who simply ignores any constraints. We therefore assume that T and L interact\nin rounds as described by Algorithm 1. The two versions of the algorithm we describe in Sections 5.1\nand 5.2 are obtained by specifying how T adapts the teaching policy in each round.\n\nAlgorithm 1 Teacher-learner interaction in the adaptive teaching setting\n1: Initial teaching policy \u03c0T,0 (e.g., optimal policy ignoring any constraints)\n2: for round i = 0, 1, 2, . . . do\n3:\n4:\n5:\n\nTeacher provides demonstrations with feature vector \u00b5T,i\nLearner upon receiving \u00b5T,i\nTeacher observes learner\u2019s feature vector \u00b5L,i\n\nr using policy \u03c0T,i\ncomputes a policy \u03c0L,i with feature vector \u00b5L,i\nr\nr and adapts the teaching policy\n\nr\n\nIn this section, we assume that L is as described in Section 3.1: Given demonstrations \u039eT, L\n\ufb01nds a policy \u03c0L such that \u00b5r(\u03c0L) matches the L2-projection of \u02c6\u00b5r(\u039eT) onto \u2126L\nr. For the sake of\nsimplifying the presentation and the analysis, we also assume that L and T can observe the exact\nfeature expectations of their respective policies, e.g., \u02c6\u00b5r(\u039eT) = \u00b5r(\u03c0T) if \u039eT is sampled from \u03c0T.\n\n5.1 An Adaptive Learner-aware Teacher Using Volume Search: ADAWARE-VOL\n\nIn our \ufb01rst adaptive teaching algorithm ADAWARE-VOL, T maintains an estimate \u02c6\u2126L\nr of the\nlearner\u2019s constraint set, which in each round gets updated by intersecting the current version with\na certain af\ufb01ne halfspace, thus reducing the volume of \u02c6\u2126L\nr. The new teaching policy is then any\npolicy \u03c0T,i+1 which is optimal under the constraint that \u00b5T,i+1 \u2208 \u02c6\u2126L\nr. The interaction ends as soon\nr (cid:107)2 \u2264 \u0001 for a threshold \u0001. Details are provided in Appendix C.1 of the supplementary.\nas (cid:107)\u00b5L,i\nTheorem 2. Upon termination of ADAWARE-VOL, L\u2019s output policy \u03c0L satis\ufb01es R(\u03c0L) \u2265\nR(\u03c0aware) \u2212 \u0001 for any policy \u03c0aware which is optimal under L\u2019s constraints. For the special case that\nr is a polytope de\ufb01ned by m linear inequalities, the algorithm terminates in O(mdr ) iterations.\n\u2126L\n\nr \u2212 \u00b5T,i\n\nr \u2283 \u2126L\n\n5.2 An Adaptive Learner-aware Teacher Using Line Search: ADAWARE-LIN\n\nr + \u03b1w\u2217\n\n= \u00b5L,i\n\nr + \u03b1iw\u2217\n\nr \u2212 \u03b1minw\u2217\n\n\u2208 arg min\u00b5r\u2208\u2126r (cid:107)\u00b5r \u2212 \u00b5L,i\n\nIn our second adaptive teaching algorithm, ADAWARE-LIN, T adapts the teaching policy by per-\nforming a binary search on a line segment of the form {\u00b5L,i\nr | \u03b1 \u2208 [\u03b1min, \u03b1max]} \u2282 Rdr\nto \ufb01nd a vector \u00b5T,i+1\nr that is the vector of feature expectations of a policy; here\n\u03b1max > \u03b1min > 0 are \ufb01xed constants. If that is not successful, the teacher \ufb01nds a teaching policy with\nr(cid:107)2. The following theorem analyzes the convergence\n\u00b5T,i+1\nr\nof L\u2019s performance to RL := max\u00b5r\u2208\u2126r R(\u00b5r) under the assumption that T\u2019s search succeeds in\nevery round. The proof and further details are provided in Appendix C.2 of the supplementary.\nTheorem 3. Fix some \u03b5 > 0 and assume that there exists a constant \u03b1min > 0 such that, as long as\nRL \u2212 R(\u00b5L,i\nr + \u03b1iw\u2217\nfor some \u03b1i \u2265 \u03b1min. Then the learner\u2019s performance increases monotonically in each round of\nr\nADAWARE-LIN, i.e., R(\u00b5L,i+1\n\u03b5 ) teaching steps,\n) > R(\u00b5L,i\nthe learner\u2019s performance satis\ufb01es R(\u00b5L,i\n\nr ). Moreover, after at most O( D2\n\u03b5\u03b1min\nr ) > RL \u2212 2\u03b5. Here we abbreviate D := diam \u2126r.\n\nr ) > \u03b5, the teacher can \ufb01nd a teaching policy \u03c0T,i+1 satisfying \u00b5T,i+1\n\n= \u00b5L,i\n\nr\n\nlog D\n\nr\n\nr\n\n6 Experimental Evaluation\n\nIn this section we evaluate our teaching algorithms for different types of learners on the environment\nintroduced in Figure 1. The environment we consider here has three types of reward objects, i.e., a\n\u201cstar\" object with reward of 1.0, a \u201cplus\" object with reward of 0.9, and a \u201cdot\" object with reward of\n0.2. Two objects of each type are placed randomly on the grid such that there is always only a single\nobject in each grid cell. The presence of an object of type \u201cstar\u201d, \u201cplus\u201d, or \u201cdot\u201d in some state s is\nencoded in the reward features \u03c6r(s) by a binary-indicator for each type such that dr = 3. We use a\ndiscount factor of \u03b3 = 0.99. Upon collecting an object, there is a 0.1 probability of transiting to a\nterminal state.\nLearner models. We consider a total of 5 different learners whose preferences can be described by\ndistractors in the environment. Each learner prefers to avoid a certain subset of these distractors.\n\n6\n\n\fThere is a total of 4 of distractors: (i) two \u201cgreen\" distractors are randomly placed at a distance of\n0-cell and 1-cell to the \u201cstar\" objects, respectively; (ii) two \u201cyellow\" distractors are randomly placed\nat a distance of 1-cell and 2-cells to the \u201cplus\" objects, respectively, see Figure 2a.\nThrough these distractors we de\ufb01ne learners L1-L5 as follows: (L1) no preference features (dc = 0);\n(L2) two preference features (dc = 2) such that \u03c6c(s)[1] and \u03c6c(s)[2] are binary indicators of whether\nthere is a \u201cgreen\" distractor at a distance of 0-cells or 1-cell, respectively; (L3) four preference features\n(dc = 4) such that \u03c6c(s)[1], \u03c6c(s)[2] are as for L2, and \u03c6c(s)[3] and \u03c6c(s)[4] are binary indicators of\nwhether there is a \u201cgreen\" distractor at a distance of 2-cells or a \u201cyellow\u201d distractor at a distance of\n0-cells, respectively; (L4) \ufb01ve preference features (dc = 5) such that \u03c6c(s)[1], . . . , \u03c6c(s)[4] are as\nfor L3, and \u03c6c(s)[5] is a binary indicator whether there is a \u201cyellow\u201d distractor at a distance of 1-cell;\nand (L5) six preference features (dc = 6) such that \u03c6c(s)[1], . . . , \u03c6c(s)[5] are as for L4, and \u03c6c(s)[6]\nis a binary indicator whether there is a \u201cyellow\u201d distractor at a distance of 2-cells.\nThe \ufb01rst row in Figure 2 shows an instance of the considered object-worlds and indicates the\npreference of the learners to avoid certain regions by the gray area.\n\n(a) Environments and learners\u2019 preferences for 5 different learners L1, . . ., L5\n\n(b) Learners\u2019 rewards inferred from learner-agnostic teacher\u2019s (AGNOSTIC) demonstrations\n\n(c) Learners\u2019 rewards inferred from learner-aware teacher\u2019s (AWARE-BIL) demonstrations\n\nFigure 2: Teaching in object-world environments under full knowledge of the learner\u2019s preferences.\nGreen and yellow cells indicate distractors associated with either \u201cstar\" or \u201cplus\" objects, respectively.\nLearner\u2019s preferences to avoid cells are indicated in gray. Learner model from Section 3.2 with\nCr = 5, Cc = 10, and \u03b4hard\nc = 0 is considered for these experiments. The learner-aware teacher\nenable the learner to infer reward functions that are compatible with the learner\u2019s preferences and\nachieve higher average rewards. In Figure 2b and Figure 2c, blue color represents positive reward,\nred color represents negative reward, and the magnitude of the reward is indicated by color intensity.\n\n6.1 Teaching under known constraints\n\nIn this section we consider learners with soft constraints from Section 3.2, with preference features as\ndescribed above, and parameters Cr = 5, Cc = 10, and \u03b4hard\nc = 0 (more experimental results for dif-\nferent values of Cr and Cc are provided in Appendix B.1 of the supplementary). Our \ufb01rst results are\npresented in Figure 2. The second and third rows show the rewards inferred by the learners for demon-\nstrations provided by a learner-agnostic teacher who ignores any constraints (AGNOSTIC) and the bi-\nlevel learner-aware teacher (AWARE-BIL), respectively. We observe that AGNOSTIC fails to teach the\nlearner about objects\u2019 positive rewards in cases where the learners\u2019 preferences con\ufb02ict with the posi-\ntion of the most rewarding objects (second row). In contrast, AWARE-BIL always successfully teaches\nthe learners about rewarding objects that are compatible with the learners\u2019 preferences (third row).\nWe also compare AGNOSTIC and AWARE-BIL in terms of reward achieved by the learner after\nteaching for object worlds of size 10 \u00d7 10 in Table 1. The numbers show the average reward over 10\nrandomly generated object-worlds. Note that AWARE-BIL has to solve a non-convex optimization\nproblem to \ufb01nd the optimal teaching policy, cf. Eq. 5. Because we use a gradient-based optimization\n\n7\n\n\fapproach, the teaching policies found can depend on the initial point for optimization. Hence, we\nalways consider the following two initial points for optimization and select the teaching policy which\nresults in a higher objective value: (i) all optimization variables in Eq. 5 are set to zero, and (ii) the\noptimization variables are initialized as \u03b1low[i] = max{w\u03bb[i], 0}, \u03b1up[i] = max{\u2212w\u03bb[i], 0}, and\n\u03b2 = 0, where w\u03bb is as inferred by the learner when taught by AGNOSTIC and i \u2208 {1, . . . , dr}, cf.\nSection 3.2. From Table 1 we observe that a learner can learn better policies from a teacher that\naccounts for the learner\u2019s preferences.\n\nTable 1: Learners\u2019 average rewards after teaching. L1, . . ., L5 correspond to learners with preferences\nas shown in Figure 2. Results are averaged over 10 random object-worlds, \u00b1 standard error\n\nL1\n\nAGNOSTIC 7.99 \u00b1 0.02\nAWARE-BIL 8.00 \u00b1 0.02\n\nTeacher\n\nL2\n\n0.01 \u00b1 0.00\n7.20 \u00b1 0.01\n\nLearner (Cr = 5, Cc = 10)\nL4\n\nL3\n\n0.01 \u00b1 0.00\n4.86 \u00b1 0.30\n\n0.01 \u00b1 0.00\n3.15 \u00b1 0.27\n\nL5\n\n0.00 \u00b1 0.00\n1.30 \u00b1 0.07\n\n6.2 Teaching under unknown constraints\n\nc\n\nIn this section we evaluate the teaching algorithms from Section 5. We consider the learner model\nfrom Section 3.1 that uses L2-projection to match reward feature expectations as studied in Section 5,\ncf. Eq. 2.2 For modeling the hard constraints, we consider box-type linear constraints with \u03b4hard\n[j] =\n2.5 \u2200j \u2208 {1, 2, . . . , dc} for the preference features, cf. Eq. 3.\nWe study the learners L1, L2, and L3 with preferences corresponding to the \ufb01rst three object-worlds\nshown in Figure 2a. We report the results for learner L2 below; results for learners L1 and L3 are\ndeferred to the Appendix B.2 of the supplementary material.\nIn this context it is instructive to investigate how quickly these adaptive teaching strategies converge\nto the performance of a teacher who has full knowledge about the learner. Results comparing the\nadaptive teaching strategies (ADAWARE-VOL and ADAWARE-LIN) are shown in Figure 3a. We can\nobserve that both teaching strategies get close to the best possible performance under full knowledge\nabout the learner (AWARE-CMDP). We also provide results showing the performance achieved by\nthe adaptive teaching strategies on object-worlds of varying sizes, see Figure 3b.\nNote that the performance of ADAWARE-VOL decreases slightly when teaching for more rounds,\ni.e., comparing the results after 3 teaching rounds and at the end of the teaching process. This is\nbecause of approximations when learner is computing the policy via projection, which in turn leads\nto errors on the teacher side when approximating \u02c6\u2126L\nr (refer to discussion in Footnote 2). In contrast,\nADAWARE-LIN performance always increases when teaching for more rounds.\n\n7 Related Work\n\nOur work is closely related to algorithmic machine teaching [Goldman and Kearns, 1995, Zhu, 2015,\nZhu et al., 2018], whose general goal is to design teaching algorithms that optimize the data that is\nprovided to a learning algorithm. Most works in machine teaching so far focus on supervised learning\ntasks and assume that the learning algorithm is fully known to the teacher, see e.g., [Zhu, 2013,\nSingla et al., 2014, Liu and Zhu, 2016, Mac Aodha et al., 2018].\nIn the IRL setting, few works study how to provide maximally informative demonstrations\nto the learner, e.g., [Cakmak and Lopes, 2012, Brown and Niekum, 2019].\nIn contrast to our\nwork, their teacher fully knows the learner model and provides the demonstrations without any\nadaptation to the learner. The question of how a teacher should adaptively react to a learner\nhas been addressed by [Singla et al., 2013, Liu et al., 2018, Chen et al., 2018, Melo et al., 2018,\nYeo et al., 2019, Hunziker et al., 2019], but only in the supervised setting.\nIn a recent work,\n[Kamalaruban et al., 2019] have studied the problem of adaptively teaching an IRL agent by pro-\n\n2To implement the learner in Eq. 2, we approximated the learner\u2019s projection onto the set \u2126L\n\nr as follows: We\nimplemented the learner based on the optimization problem given in Eq. 3 with a hard constraint on preferences\nand L2 norm penalty on reward mismatch scaled with a large value of Cr = 20.\n\n8\n\n\fXXXXXXXXX\n\nTeacher\nAWARE-CMDP\n\nEnv\n\nAGNOSTIC\nCONSERV\n\nADAWARE-VOL (3rd)\nADAWARE-VOL (end)\nADAWARE-LIN (3rd)\nADAWARE-LIN (end)\n\n10 \u00d7 10\n7.62 \u00b1 0.02\n3.94 \u00b1 0.09\n1.68 \u00b1 0.01\n7.50 \u00b1 0.14\n6.85 \u00b1 0.33\n6.14 \u00b1 0.08\n7.64 \u00b1 0.02\n\n15 \u00d7 15\n7.44 \u00b1 0.04\n3.84 \u00b1 0.06\n1.67 \u00b1 0.012\n7.50 \u00b1 0.04\n7.06 \u00b1 0.06\n6.28 \u00b1 0.10\n7.53 \u00b1 0.03\n\n20 \u00d7 20\n7.19 \u00b1 0.04\n3.95 \u00b1 0.06\n1.62 \u00b1 0.02\n7.29 \u00b1 0.05\n6.77 \u00b1 0.08\n6.37 \u00b1 0.08\n7.29 \u00b1 0.06\n\n(b) Varying grid-size\n\n(a) Reward over teaching rounds\nFigure 3: Performance of adaptive teaching strategies ADAWARE-VOL and ADAWARE-LIN. (left)\nFigure 3a shows the reward for learner\u2019s policy over number of teaching interactions. The horizontal\nlines indicate the performance of learner\u2019s policy for the learner-aware teacher with full knowledge\nof the learner\u2019s constraints AWARE-CMDP, the learner-agnostic teacher AGNOSTIC who ignores\nany constraints, and a conservative teacher CONSERV who considers all 6 constraints (assuming the\nlearner model L5 in Figure 2). Our adaptive teaching strategies ADAWARE-VOL and ADAWARE-LIN\nsigni\ufb01cantly outperform baselines (AGNOSTIC and CONSERV) and quickly converge towards the\noptimal performance of AWARE-CMDP. The dotted lines ADAWARE-VOL:T and ADAWARE-LIN:T\nshow the rewards corresponding to teacher\u2019s policy at a round and are shown to highlight the very\ndifferent behavior of two adaptive teaching strategies. (right) Table 3b shows results for varying\ngrid-size of the environment. Results are reported at i = 3rd round and at the \u201cend\" round when\nalgorithm reaches it\u2019s stopping criterion. Results are reported as average over 10 runs \u00b1 standard\nerror, where each run corresponds to a random environment.\n\nviding an informative sequence of demonstrations. However, they assume that the teacher has full\nknowlege of the learner\u2019s dynamics.\nWithin the area of IRL, there is a line of work on active learning approaches [Cohn et al., 2011,\nBrown et al., 2018, Brown and Niekum, 2018, Amin et al., 2017, Cui and Niekum, 2018], which is\nrelated to our work. In contrast to us, they take the perspective of the learner who actively in\ufb02uences\nthe demonstrations it receives. A few papers have addressed the problem that arises when the learner\ndoes not have full access to the reward features, e.g., [Levine et al., 2010] and [Haug et al., 2018].\nOur work is also loosely related to multi-agent reinforcement learning. [Dimitrakakis et al., 2017]\nstudied the interaction between agents with misaligned models with a focus on the question of how to\njointly optimize a policy. [Ghosh et al., 2019] studied the problem of designing robust AI agent that\ncan interact with another agent of unknown type. However, these works do not tackle the problem of\nteaching an agent by demonstrations. Another related work is [Had\ufb01eld-Menell et al., 2016] which\nstudied the cooperation of agents who do not perfectly understand each other.\n8 Conclusions and Outlook\nIn this paper we considered inverse reinforcement learning in the context of learners with preferences\nand constraints. In this setting, the learner does not only focus on matching the teacher\u2019s demonstrated\nbehavior but also takes its own preferences, e.g., behavioral biases or physical constraints, into\naccount. We developed a theoretical framework for this setting, and proposed and studied algorithms\nfor learner-aware teaching in which the teacher accounts for the learner\u2019s preferences for the cases of\nknown and unknown preference constraints. We demonstrated signi\ufb01cant performance improvements\nof our learner-aware teaching strategies as compared to learner-agnostic teaching both theoretically\nand empirically. Our theoretical framework and our proposed algorithms foster the application of\nIRL in real-world settings in which the learner does not blindly follow a teacher\u2019s demonstrations.\nThere are several promising directions for future work, including but not limited to: The evaluation\nof our approach in machine-human and human-machine tasks; extensions of our approach to other\nlearner models; approaches for learning ef\ufb01ciently from a learner\u2019s point of view from a \ufb01xed set of\n(potentially suboptimal) demonstrations in the case of preference constraints.\n\n9\n\n012345678910Adaptiveteaching:Roundi13579Learner\u2019srewardAwAgCoAdA-VolAdA-LinAdA-Vol:TAdA-Lin:T\fAcknowledgements\n\nThis work was supported by Microsoft Research through its PhD Scholarship Programme.\n\nReferences\n\n[Abbeel and Ng, 2004] Abbeel, P. and Ng, A. Y. (2004). Apprenticeship learning via inverse rein-\n\nforcement learning. In ICML.\n\n[Altman, 1999] Altman, E. (1999). Constrained Markov decision processes, volume 7. CRC Press.\n[Amin et al., 2017] Amin, K., Jiang, N., and Singh, S. P. (2017). Repeated inverse reinforcement\n\nlearning. In NIPS, pages 1813\u20131822.\n\n[Boularias et al., 2011] Boularias, A., Kober, J., and Peters, J. (2011). Relative entropy inverse\n\nreinforcement learning. In AISTATS, pages 182\u2013189.\n\n[Boyd and Vandenberghe, 2004] Boyd, S. and Vandenberghe, L. (2004). Convex optimization. Cam-\n\nbridge university press.\n\n[Brown et al., 2018] Brown, D. S., Cui, Y., and Niekum, S. (2018). Risk-Aware Active Inverse\n\nReinforcement Learning. In Conference on Robot Learning, pages 362\u2013372.\n\n[Brown and Niekum, 2018] Brown, D. S. and Niekum, S. (2018). Ef\ufb01cient probabilistic performance\nIn Thirty-Second AAAI Conference on Arti\ufb01cial\n\nbounds for inverse reinforcement learning.\nIntelligence.\n\n[Brown and Niekum, 2019] Brown, D. S. and Niekum, S. (2019). Machine teaching for inverse\n\nreinforcement learning: Algorithms and applications. In AAAI.\n\n[Cakmak and Lopes, 2012] Cakmak, M. and Lopes, M. (2012). Algorithmic and human teaching of\n\nsequential decision tasks. In AAAI.\n\n[Chen et al., 2018] Chen, Y., Singla, A., Mac Aodha, O., Perona, P., and Yue, Y. (2018). Understand-\ning the role of adaptivity in machine teaching: The case of version space learners. In Advances in\nNeural Information Processing Systems, pages 1476\u20131486.\n\n[Cohn et al., 2011] Cohn, R., Durfee, E., and Singh, S. (2011). Comparing Action-query Strategies\n\nin Semi-autonomous Agents. In AAMAS, pages 1287\u20131288, Richland, SC.\n\n[Cui and Niekum, 2018] Cui, Y. and Niekum, S. (2018). Active reward learning from critiques. In\n2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6907\u20136914.\nIEEE.\n\n[De, 1960] De, G. G. (1960). Les problemes de decisions sequentielles. cahiers du centre d\u2019etudes\n\nde recherche operationnelle vol. 2, pp. 161-179.\n\n[Dimitrakakis et al., 2017] Dimitrakakis, C., Parkes, D. C., Radanovic, G., and Tylkin, P. (2017).\nMulti-view decision processes: the helper-ai problem. In Advances in Neural Information Pro-\ncessing Systems, pages 5443\u20135452.\n\n[Ghosh et al., 2019] Ghosh, A., Tschiatschek, S., Mahdavi, H., and Singla, A. (2019). Towards\ndeployment of robust AI agents for human-machine partnerships. In Workshop on Safety and\nRobustness in Decision Making (SRDM) at NeurIPS\u201919.\n\n[Goldman and Kearns, 1995] Goldman, S. A. and Kearns, M. J. (1995). On the complexity of\n\nteaching. Journal of Computer and System Sciences, 50(1):20\u201331.\n\n[Had\ufb01eld-Menell et al., 2016] Had\ufb01eld-Menell, D., Russell, S. J., Abbeel, P., and Dragan, A. (2016).\n\nCooperative inverse reinforcement learning. In NIPS.\n\n[Haug et al., 2018] Haug, L., Tschiatschek, S., and Singla, A. (2018). Teaching Inverse Reinforce-\nment Learners via Features and Demonstrations. In Advances in Neural Information Processing\nSystems, pages 8473\u20138482.\n\n[Hunziker et al., 2019] Hunziker, A., Chen, Y., Mac Aodha, O., Rodriguez, M. G., Krause, A.,\nPerona, P., Yue, Y., and Singla, A. (2019). Teaching multiple concepts to a forgetful learner. In\nAdvances in Neural Information Processing Systems.\n\n[Jaggi, 2013] Jaggi, M. (2013). Revisiting Frank-Wolfe: Projection-free sparse convex optimization.\n\nIn Proceedings of the 30th International Conference on Machine Learning, pages 427\u2013435.\n\n10\n\n\f[Kamalaruban et al., 2019] Kamalaruban, P., Devidze, R., Cevher, V., and Singla, A. (2019). Inter-\n\nactive teaching algorithms for inverse reinforcement learning. In IJCAI, pages 2692\u20132700.\n\n[Leibo et al., 2017] Leibo, J. Z., Zambaldi, V., Lanctot, M., Marecki, J., and Graepel, T. (2017).\nMulti-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th\nConference on Autonomous Agents and MultiAgent Systems, pages 464\u2013473.\n\n[Levine et al., 2010] Levine, S., Popovic, Z., and Koltun, V. (2010). Feature construction for inverse\n\nreinforcement learning. In NIPS, pages 1342\u20131350.\n\n[Liu and Zhu, 2016] Liu, J. and Zhu, X. (2016). The teaching dimension of linear learners. Journal\n\nof Machine Learning Research, 17(162):1\u201325.\n\n[Liu et al., 2018] Liu, W., Dai, B., li, X., Rehg, J. M., and Song, L. (2018). Towards black-box\n\niterative machine teaching. In ICML.\n\n[Mac Aodha et al., 2018] Mac Aodha, O., Su, S., Chen, Y., Perona, P., and Yue, Y. (2018). Teaching\ncategories to human learners with visual explanations. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 3820\u20133828.\n\n[Melo et al., 2018] Melo, F. S., Guerra, C., and Lopes, M. (2018). Interactive optimal teaching with\n\nunknown learners. In IJCAI, pages 2567\u20132573.\n\n[Mendez et al., 2018] Mendez, J. A. M., Shivkumar, S., and Eaton, E. (2018). Lifelong inverse\nreinforcement learning. In Advances in Neural Information Processing Systems, pages 4507\u20134518.\n[Osa et al., 2018] Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., Peters, J., et al.\n(2018). An algorithmic perspective on imitation learning. Foundations and Trends\u00ae in Robotics,\n7(1-2):1\u2013179.\n\n[Ratliff et al., 2006] Ratliff, N. D., Bagnell, J. A., and Zinkevich, M. A. (2006). Maximum margin\n\nplanning. In ICML, pages 729\u2013736.\n\n[Singla et al., 2013] Singla, A., Bogunovic, I., Bart\u00f3k, G., Karbasi, A., and Krause, A. (2013). On\n\nactively teaching the crowd to classify. In NIPS Workshop on Data Driven Education.\n\n[Singla et al., 2014] Singla, A., Bogunovic, I., Bart\u00f3k, G., Karbasi, A., and Krause, A. (2014).\n\nNear-optimally teaching the crowd to classify. In ICML.\n\n[Sinha et al., 2018] Sinha, A., Malo, P., and Deb, K. (2018). A review on bilevel optimization:\nfrom classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary\nComputation, 22(2):276\u2013295.\n\n[Yeo et al., 2019] Yeo, T., Kamalaruban, P., Singla, A., Merchant, A., Asselborn, T., Faucon, L.,\nDillenbourg, P., and Cevher, V. (2019). Iterative classroom teaching. In AAAI, pages 5684\u20135692.\n[Zhou et al., 2018] Zhou, Z., Bloem, M., and Bambos, N. (2018). In\ufb01nite time horizon maximum\ncausal entropy inverse reinforcement learning. IEEE Trans. Automat. Contr., 63(9):2787\u20132802.\n[Zhu, 2013] Zhu, X. (2013). Machine teaching for bayesian learners in the exponential family. In\n\nNIPS, pages 1905\u20131913.\n\n[Zhu, 2015] Zhu, X. (2015). Machine teaching: An inverse problem to machine learning and an\n\napproach toward optimal education. In AAAI, pages 4083\u20134087.\n\n[Zhu et al., 2018] Zhu, X., Singla, A., Zilles, S., and Rafferty, A. N. (2018). An Overview of\n\nMachine Teaching. arXiv:1801.05927.\n\n[Ziebart, 2010] Ziebart, B. D. (2010). Modeling purposeful adaptive behavior with the principle of\n\nmaximum causal entropy. Carnegie Mellon University.\n\n[Ziebart et al., 2013] Ziebart, B. D., Bagnell, J. A., and Dey, A. K. (2013). The principle of maximum\ncausal entropy for estimating interacting processes. IEEE Transactions on Information Theory,\n59(4):1966\u20131980.\n\n[Ziebart et al., 2008] Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. (2008). Maximum\n\nentropy inverse reinforcement learning. In AAAI.\n\n11\n\n\f", "award": [], "sourceid": 2298, "authors": [{"given_name": "Sebastian", "family_name": "Tschiatschek", "institution": "Microsoft Research"}, {"given_name": "Ahana", "family_name": "Ghosh", "institution": "MPI-SWS"}, {"given_name": "Luis", "family_name": "Haug", "institution": "ETH Zurich"}, {"given_name": "Rati", "family_name": "Devidze", "institution": "MPI-SWS"}, {"given_name": "Adish", "family_name": "Singla", "institution": "MPI-SWS"}]}