{"title": "Teaching Inverse Reinforcement Learners via Features and Demonstrations", "book": "Advances in Neural Information Processing Systems", "page_first": 8464, "page_last": 8473, "abstract": "Learning near-optimal behaviour from an expert's demonstrations typically relies on the assumption that the learner knows the features that the true reward function depends on. In this paper, we study the problem of learning from demonstrations in the setting where this is not the case, i.e., where there is a mismatch between the worldviews of the learner and the expert. We introduce a natural quantity, the teaching risk, which measures the potential suboptimality of policies that look optimal to the learner in this setting. We show that bounds on the teaching risk guarantee that the learner is able to find a near-optimal policy using standard algorithms based on inverse reinforcement learning. Based on these findings, we suggest a teaching scheme in which the expert can decrease the teaching risk by updating the learner's worldview, and thus ultimately enable her to find a near-optimal policy.", "full_text": "Teaching Inverse Reinforcement Learners\n\nvia Features and Demonstrations\n\nDepartment of Computer Science\n\nLuis Haug\n\nETH Zurich\n\nlhaug@inf.ethz.ch\n\nSebastian Tschiatschek\n\nMicrosoft Research\n\nCambridge, UK\n\nsetschia@microsoft.com\n\nAdish Singla\n\nMax Planck Institute for Software Systems\n\nSaarbr\u00fccken, Germany\n\nadishs@mpi-sws.org\n\nAbstract\n\nLearning near-optimal behaviour from an expert\u2019s demonstrations typically relies\non the assumption that the learner knows the features that the true reward function\ndepends on. In this paper, we study the problem of learning from demonstrations\nin the setting where this is not the case, i.e., where there is a mismatch between\nthe worldviews of the learner and the expert. We introduce a natural quantity, the\nteaching risk, which measures the potential suboptimality of policies that look\noptimal to the learner in this setting. We show that bounds on the teaching risk\nguarantee that the learner is able to \ufb01nd a near-optimal policy using standard\nalgorithms based on inverse reinforcement learning. Based on these \ufb01ndings, we\nsuggest a teaching scheme in which the expert can decrease the teaching risk\nby updating the learner\u2019s worldview, and thus ultimately enable her to \ufb01nd a\nnear-optimal policy.\n\n1\n\nIntroduction\n\nReinforcement learning has recently led to impressive and widely recognized results in several\nchallenging application domains, including game-play, e.g., of classical games (Go) and Atari games.\nIn these applications, a clearly de\ufb01ned reward function, i.e., whether a game is won or lost in the case\nof Go or the number of achieved points in the case of Atari games, is optimized by a reinforcement\nlearning agent interacting with the environment.\nHowever, in many applications it is very dif\ufb01cult to specify a reward function that captures all\nimportant aspects. For instance, in an autonomous driving application, the reward function of an\nautonomous vehicle should capture many different desiderata, including the time to reach a speci\ufb01ed\ngoal, safe driving characteristics, etc. In such situations, learning from demonstrations can be a\nremedy, transforming the need of specifying the reward function to the task of providing an expert\u2019s\ndemonstrations of desired behaviour; we will refer to this expert as the teacher. Based on these\ndemonstrations, a learning agent, or simply learner attempts to infer a (stochastic) policy that\napproximates the feature counts of the teacher\u2019s demonstrations. Examples for algorithms that can\nbe used to that end are those in [Abbeel and Ng, 2004] and [Ziebart et al., 2008], which use inverse\nreinforcement learning (IRL) to estimate a reward function for which the demonstrated behaviour is\noptimal, and then derive a policy based on that.\nFor this strategy to be successful, i.e., for the learner to \ufb01nd a policy that achieves good performance\nwith respect to the reward function set by the teacher, the learner has to know what features the teacher\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fconsiders and the reward function depends on. However, as we argue, this assumption does not hold\nin many real-world applications. For instance, in the autonomous driving application, the teacher,\ne.g., a human driver, might consider very different features, including high-level semantic features,\nwhile the learner, i.e., the autonomous car, only has sensory inputs in the form of distance sensors\nand cameras providing low-level semantic features. In such a case, there is a mismatch between\nthe teacher\u2019s and the learner\u2019s features which can lead to degraded performance and unexpected\nbehaviour of the learner.\nIn this paper we investigate exactly this setting. We assume that the true reward function is a linear\ncombination of a set of features known to the teacher. The learner also assumes that the reward\nfunction is linear, but in features which are different from the truly relevant ones; e.g., the learner\ncould only observe a subset of those features. In this setting, we study the potential decrease in\nperformance of the learner as a function of the learner\u2019s worldview. We introduce a natural and easily\ncomputable quantity, the teaching risk, which bounds the maximum possible performance gap of the\nteacher and the learner.\nWe continue our investigation by considering a teaching scenario in which the teacher can provide\nadditional features to the learner, e.g., add additional sensors to an autonomous vehicle. This naturally\nraises the question which features should be provided to the learner to maximize her performance.\nTo this end, we propose an algorithm that greedily minimizes the teaching risk, thereby shrinking\nthe maximal gap in performance that policies optimized with respect to the learner\u2019s resp. teacher\u2019s\nworldview can have.\nOr main contributions are:\n\n1. We formalize the problem of worldview mismatch for reward computation and policy\n\noptimization based on demonstrations.\n\n2. We introduce the concept of teaching risk, bounding the maximal performance gap of the\nteacher and the learner as a function of the learner\u2019s worldview and the true reward function.\n3. We formally analyze the teaching risk and its properties, giving rise to an algorithm for\n\nteaching a learner with an incomplete worldview.\n\n4. We substantiate our \ufb01ndings in a large set of experiments.\n\n2 Related Work\n\nbeyond\n\nsupervised\n\nOur work is related to the area of algorithmic machine teaching, where the objective is to de-\nsign effective teaching algorithms to improve the learning process of a learner [Zhu et al., 2018,\nZhu, 2015]. Machine teaching has recently been studied in the context of diverse real-world ap-\nplications such as personalized education and intelligent tutoring systems [Hunziker et al., 2018,\nRafferty et al., 2016, Patil et al., 2014], social robotics [Cakmak and Thomaz, 2014], adversarial ma-\nchine learning [Mei and Zhu, 2015], program synthesis [Mayer et al., 2017], and human-in-the-loop\ncrowdsourcing systems [Singla et al., 2014, Singla et al., 2013]. However, different from ours, most\nof the current work in machine teaching is limited to supervised learning settings, and to a setting\nwhere the teacher has full knowledge about the learner\u2019s model.\nGoing\nBrown and Niekum, 2018,\nKamalaruban et al., 2018] have studied the problem of teaching an IRL agent, similar in\nspirit to what we do in our work. Our work differs from their work in several aspects\u2014they assume\nthat the teacher has full knowledge of the learner\u2019s feature space, and then provides a near-optimal\nset/sequence of demonstrations; we consider a more realistic setting where there is a mismatch\nbetween the teacher\u2019s and the learner\u2019s feature space. Furthermore, in our setting, the teaching signal\nis a mixture of demonstrations and features.\nOur work is also related to teaching via explanations and features as explored recently by\n[Aodha et al., 2018] in a supervised learning setting. However, we explore the space of teach-\ning by explanations when teaching an IRL agent, which makes it technically very different from\n[Aodha et al., 2018]. Another important aspect of our teaching algorithm is that it is adaptive in\nnature, in the sense that the next teaching signal accounts for the current performance of the learner\n(i.e., worldview in our setting). Recent work of [Chen et al., 2018, Liu et al., 2017, Yeo et al., 2019]\nhave studied adaptive teaching algorithms, however only in a supervised learning setting.\n\n[Cakmak et al., 2012,\n\nlearning,\n\n2\n\n\fApart from machine teaching, our work is related to [Stadie et al., 2017] and [Sermanet et al., 2018],\nwhich also study imitation learning problems in which the teacher and the learner view the world\ndifferently. However, these two works are technically very different from ours, as we consider the\nproblem of providing teaching signals under worldview mismatch from the perspective of the teacher.\n\n3 The Model\n\nBasic de\ufb01nitions. Our environment\nis described by a Markov decision process M =\n(S, A, T, D, R, \u03b3), where S is a \ufb01nite set of states, A is a \ufb01nite set of available actions, T is a\nfamily of distributions on S indexed by S \u00d7 A with Ts,a(s(cid:48)) describing the probability of transition-\ning from state s to state s(cid:48) when action a is taken, D is the initial-state distribution on S describing\nthe probability of starting in a given state, R : S \u2192 R is a reward function and \u03b3 \u2208 (0, 1) is a discount\nfactor. We assume that there exists a feature map \u03c6 : S \u2192 Rk such that the reward function is linear\nin the features given by \u03c6, i.e.,\n\nR(s) = (cid:104)w\u2217, \u03c6(s)(cid:105)\n(cid:107) = 1.\n\n\u2208 Rk which we assume to satisfy (cid:107)w\u2217\n\nfor some w\u2217\nfor policies we are interested in is the expected discounted reward R(\u03c0) := E ((cid:80)\u221e\nBy a policy we mean a family of distributions on A indexed by S, where \u03c0s(a) describes the probabil-\nity of taking action a in state s. We denote by \u03a0 the set of all such policies. The performance measure\nt=0 \u03b3tR(st)),\nwhere the expectation is taken with respect to the distribution over trajectories (s0, s1, . . . ) induced\nby \u03c0 together with the transition probabilities T and the initial-state distribution D. We call a policy\n\u03c0 optimal for the reward function R if \u03c0 \u2208 arg max\u03c0(cid:48)\u2208\u03a0 R(\u03c0(cid:48)). Note that\n\nwhere \u00b5 : \u03a0 \u2192 Rk, \u03c0 (cid:55)\u2192 E ((cid:80)\u221e\n\nR(\u03c0) = (cid:104)w\u2217, \u00b5(\u03c0)(cid:105),\n\nt=0 \u03b3t\u03c6(st)), is the map taking a policy to its vector of (discounted)\nfeature expectations. Note also that the image \u00b5(\u03a0) of this map is a bounded subset of Rk due to\nthe \ufb01niteness of S and the presence of the discounting factor \u03b3 \u2208 (0, 1); we denote by diam \u00b5(\u03a0) =\nsup\u00b50,\u00b51\u2208\u00b5(\u03a0) (cid:107)\u00b50 \u2212 \u00b51(cid:107) its diameter. Here and in what follows, (cid:107) \u00b7 (cid:107) denotes the Euclidean norm.\nProblem formulation. We consider a learner L and a teacher T , whose ultimate objective is that L\n\ufb01nds a near-optimal policy \u03c0L with the help of T .\nThe challenge we address in this paper is that of achieving this objective under the assumption that\nthere is a mismatch between the worldviews of L and T , by which we mean the following: Instead of\nthe \u201ctrue\u201d feature vectors \u03c6(s), L observes feature vectors AL\u03c6(s) \u2208 R(cid:96), where\n\nAL : Rk \u2192 R(cid:96)\n\nis a linear map (i.e., a matrix) that we interpret as L\u2019s worldview. The simplest case is that AL\nselects a subset of the features given by \u03c6(s), thus modelling the situation where L only has access\nto a subset of the features relevant for the true reward, which is a reasonable assumption for many\nreal-world situations. More generally, AL could encode different weightings of those features.\nThe question we ask is whether and how T can provide demonstrations or perform other teaching\ninterventions, in a way such as to make sure that L achieves the goal of \ufb01nding a policy with\nnear-optimal performance.\n\nAssumptions on the teacher and on the learner. We assume that T knows the full speci\ufb01cation\nof the MDP as well as L\u2019s worldview AL, and that she can help L to learn in two different ways:\n\n1. By providing L with demonstrations of behaviour in the MDP;\n2. By updating L\u2019s worldview AL.\n\nDemonstrations can be provided in the form of trajectories sampled from a (not necessarily optimal)\npolicy \u03c0T , or in the form of (discounted) feature expectations of such a policy. The method by which\nT can update AL will be discussed in Section 5. Based on T \u2019s instructions, L then attemps to train a\npolicy \u03c0L whose feature expectations approximate those of \u03c0T . Note that, if this is successful, the\nperformance of \u03c0L is close to that of \u03c0T due to the form of the reward function.\n\n3\n\n\f(a) \u03c00\n\n(b) \u03c0\u2217\n\nFigure 1: A simple example to illustrate the challenges arising when teaching under worldview\nmismatch. We consider an MDP in which S = {s1, . . . , s5} is the set of cells in the gridworld\ndisplayed, A = {\u2190,\u2192}, and R(s) = (cid:104)w\u2217, \u03c6(s)(cid:105) with feature map \u03c6 : S \u2192 R5 taking si to the\none-hot vector ei \u2208 R5. The initial state distribution is uniform and the transition dynamics are\ndeterministic. More speci\ufb01cally, when the agent takes action \u2192 (resp. \u2190), it moves to the neighboring\ncell to the right (resp. left); when the agent is in the rightmost (resp. leftmost) cell, the action \u2192\n(resp. \u2190) is not permitted. The reward weights are given by w\u2217 = (\u22121,\u22120.5, 0, 0.5, 1)T \u2208 R5 up to\nnormalization; the values R(si) = (cid:104)w\u2217, \u03c6(si)(cid:105) are also encoded by the colors of the cells. The policy\n\u03c0\u2217 in (b) is the optimal policy with respect to the true reward function. Assuming that the learner L\n0) \u2208 R1\u00d75,\nonly observes the feature corresponding to the central cell, i.e., AL = (0\nthe policy \u03c00 in (a) is a better teaching policy in the worst-case sense. See the main text for a detailed\ndescription.\n\n1\n\n0\n\n0\n\nWe assume that L has access to an algorithm that enables her to do the following: Whenever she is\ngiven suf\ufb01ciently many demonstrations sampled from a policy \u03c0T , she is able to \ufb01nd a policy \u03c0L\nwhose feature expectations in her worldview approximate those of \u03c0T , i.e., AL\u00b5(\u03c0L) \u2248 AL\u00b5(\u03c0T ).\nExamples for algorithms that L could use to that end are the algorithms in [Abbeel and Ng, 2004]\nand [Ziebart et al., 2008] which are based on IRL. The following discussion does not require any\nfurther speci\ufb01cation of what precise algorithm L uses in order to match feature expectations.\n\nChallenges when teaching under worldview mismatch.\nIf there was no mismatch in the world-\nview (i.e., if AL was the identity matrix in Rk\u00d7k), then the teacher could simply provide demonstra-\ntions from the optimal policy \u03c0\u2217 to achieve the desired objective. However, the example in Figure 1\nillustrates that this is not the case when there is a mismatch between the worldviews.\nFor the MDP in Figure 1, assume that the teacher provides demonstrations using \u03c0T = \u03c0\u2217, which\nmoves to the rightmost cell as quickly as possible and then alternates between cells 4 and 5 (see\nalternates between cells 1 and 2, has the same feature expectations as \u03c0\u2217 in the learner\u2019s worldview;\n\nFigure 1b). Note that the policy(cid:101)\u03c0\u2217 which moves to the leftmost cell as quickly as possible and then\nin fact,(cid:101)\u03c0\u2217 is the unique policy other than \u03c0\u2217 with that property (provided we restrict to deterministic\n\npolicies). As the teacher is unaware of the internal workings of the learner, she has no control over\nwhich of these two policies the learner will eventually learn by matching feature expectations.\nHowever, the teacher can ensure that the learner achieves better performance in a worst case sense by\nproviding demonstrations tailored to the problem at hand. In particular, assume that the teacher uses\n\u03c0T = \u03c00, the policy shown in Figure 1a, which moves to the central cell as quickly as possible and\n\nthen alternates between cells 3 and 4. The only other policy(cid:101)\u03c00 with which the learner could match\nNote that R(\u03c0\u2217) > R(\u03c00) > R((cid:101)\u03c00) > R((cid:101)\u03c0\u2217), and hence \u03c00 is a better teaching policy than \u03c0\u2217\n\nthe feature expectations of \u03c00 in her worldview (restricting again to deterministic policies) is the one\nthat moves to the central cell as quickly as possible and then alternates between states 2 and 3.\n\nregarding the performance that a learner matching feature expectations in her worldview achieves\nin the worst case. In particular, this example shows that providing demonstrations from the truly\noptimal policy \u03c0\u2217 does not guarantee that the learner\u2019s policy achieves good performance in general.\n\n4 Teaching Risk\n\nDe\ufb01nition of teaching risk. The fundamental problem in the setting described in Section 3 is that\ntwo policies \u03c00, \u03c01 that perform equally well with respect to any estimate s (cid:55)\u2192 (cid:104)wL, AL\u03c6(s)(cid:105) that\nL may have of the reward function, may perform very differently with respect to the true reward\nfunction. Hence, even if L is able to imitate the behaviour of the teacher well in her worldview, there\n\n4\n\n12345\u221210112345\u2212101\f(a) \u03c00\n\n(b) \u03c01\n\nFigure 2: Two policies in the environment introduced in Figure 1 (the policy \u03c00 here is identical to\nthe one in Figure 1(a)). We assume again that L can only observe the feature corresponding to the\ncentral cell. Provided that the initial state distribution is uniform, the feature expectations of \u03c00 and\n\u03c01 in L\u2019s worldview are equal, and hence these policies perform equally well with respect to any\nestimate of the reward function that L may have. In fact, both look optimal to L if she assumes that\nthe central cell carries positive reward. However, their performance with respect to the true reward\nfunction is positive for \u03c00 but negative for \u03c01. This illustrates that, if all we know about L is that she\nmatches feature counts in her worldview, we can generally not give good performance guarantees for\nthe policy she \ufb01nds.\n\nis genenerally no guarantee on how good her performance is with respect to the true reward function.\nFor an illustration, see Figure 2.\nTo address this problem, we de\ufb01ne the following quantity: The teaching risk for a given worldview\nAL with respect to reward weights w\u2217 is\n\nmax\n\n\u03c1(AL; w\u2217) :=\n\nv\u2208ker AL,(cid:107)v(cid:107)\u22641(cid:104)w\u2217, v(cid:105).\n\n(1)\nHere ker AL = {v \u2208 Rk | ALv = 0} and ker w\u2217 = {v \u2208 Rk | (cid:104)w\u2217, v(cid:105) = 0} denote the kernels\nof AL resp. (cid:104)w\u2217,\u00b7(cid:105). Geometrically, \u03c1(AL; w\u2217) is the cosine of the angle between ker AL and\nw\u2217; in other words, \u03c1(AL; w\u2217) measures the degree to which ker AL deviates from satisfying\nker AL \u2286 ker w\u2217. Yet another way of characterizing the teaching risk is as \u03c1(AL; w\u2217) = (cid:107) pr(w\u2217)(cid:107),\nwhere pr : Rk \u2192 ker AL denotes the orthogonal projection onto ker AL.\nSigni\ufb01cance of teaching risk. To understand the signi\ufb01cance of the teaching risk in our context,\nassume that L is able to \ufb01nd a policy \u03c0L which matches the feature expectations of T \u2019s (not necessarily\noptimal) policy \u03c0T perfectly in her worldview, which is equivalent to \u00b5(\u03c0T ) \u2212 \u00b5(\u03c0L) \u2208 ker AL.\nDirectly from the de\ufb01nition of the teaching risk, we see that the gap between their performances with\nrespect to the true reward function satis\ufb01es\n\n|(cid:104)w\u2217, \u00b5(\u03c0T ) \u2212 \u00b5(\u03c0L)(cid:105)| \u2264 \u03c1(AL; w\u2217) \u00b7 (cid:107)\u00b5(\u03c0T ) \u2212 \u00b5(\u03c0L)(cid:107),\n\n(2)\nwith equality if \u00b5(\u03c0T ) \u2212 \u00b5(\u03c0L) is proportional to a vector v realizing the maximum in (1). If\nthe teaching risk is large, this performance gap can generally be large as well. This motivates the\ninterpretation of \u03c1(AL; w\u2217) as a measure of the risk when teaching the task modelled by an MDP\nwith reward weights w\u2217 to a learner whose worldview is represented by AL.\nOn the other hand, smallness of the teaching risk implies that this performance gap cannot be too large.\nThe following theorem, proven in the extended version of this paper [Haug et al., 2018], generalizes\nthe bound in (2) to the situation in which \u03c0L only approximates the feature expectations of \u03c0T .\n\nTheorem 1. Assume that (cid:107)AL(\u00b5(\u03c0T ) \u2212 \u00b5(\u03c0L))(cid:107) < \u03b5. Then the gap between the true performances\nof \u03c0L and \u03c0T satis\ufb01es\n\n\u03b5\n\n+ \u03c1(AL; w\u2217) \u00b7 diam \u00b5(\u03a0)\n\n|(cid:104)w\u2217, \u00b5(\u03c0T ) \u2212 \u00b5(\u03c0L)(cid:105)| <\nwith \u03c3(AL) = minv\u22a5ker AL,(cid:107)v(cid:107)=1 (cid:107)ALv(cid:107).\nTheorem 1 shows the following: If L imitates T \u2019s behaviour well in her worldview (meaning that \u03b5\ncan be chosen small) and if the teaching risk \u03c1(AL; w\u2217) is suf\ufb01ciently small, then L will perform\nnearly as well as T with respect to the true reward. In particular, if T \u2019s policy is optimal, \u03c0T = \u03c0\u2217,\nthen L\u2019s policy \u03c0L is guaranteed to be near-optimal.\n\n\u03c3(AL)\n\n5\n\n12345\u221210112345\u2212101\fAlgorithm 1 TRGREEDY: Feature- and demo-based teaching with TR-greedy feature selection\nRequire: Reward vector w\u2217, set of teachable features F, feature budget B, initial worldview AL,\nteacher policy \u03c0T , initial learner policy \u03c0L, performance threshold \u03b5.\nfor i = 1, . . . , B do\n\n(cid:46) T selects feature to teach\n(cid:46) L\u2019s worldview gets updated\n(cid:46) L trains a new policy\n\nif |(cid:104)w\u2217, \u00b5(\u03c0L)(cid:105) \u2212 (cid:104)w\u2217, \u00b5(\u03c0T )(cid:105)| > \u03b5 then\nf \u2190 arg minf\u2208F \u03c1(AL \u2295 (cid:104)f,\u00b7(cid:105); w\u2217)\nAL \u2190 AL \u2295 (cid:104)f,\u00b7(cid:105)\n\u03c0L \u2190 LEARNING(\u03c0L, AL\u00b5(\u03c0T ))\nreturn \u03c0L\n\nelse\n\nend if\n\nend for\nreturn \u03c0L\n\nThe quantity \u03c3(AL) appearing in Theorem 1 is a bound on the amount to which AL distorts lengths\nof vectors in the orthogonal complement of ker AL. Note that \u03c3(AL) is independent of the teaching\nrisk, in the sense that one can change it, e.g., by rescaling AL \u2192 \u03b1AL by some \u03b1 \u2208 R+, without\nchanging the teaching risk.\n\nTeaching risk as obstruction to recognizing optimality. We now provide a second motivation\nfor the consideration of the teaching risk, by interpreting it as a quantity that measures the degree to\nwhich truly optimal policies deviate from looking optimal to L. We make the technical assumption\nthat \u00b5(\u03a0) is the closure of a bounded open set with smooth boundary \u2202\u00b5(\u03a0) (this will only be needed\nfor the proofs). Our \ufb01rst observation is the following:\nProposition 1. Let \u03c0\u2217 be a policy which is optimal for s (cid:55)\u2192 (cid:104)w\u2217, \u03c6(s)(cid:105). If \u03c1(AL; w\u2217) > 0, then \u03c0\u2217\nis suboptimal with respect to any choice of reward function s (cid:55)\u2192 (cid:104)w, AL\u03c6(s)(cid:105) with w \u2208 R(cid:96).\nIn view of Proposition 1, a natural question is whether we can bound the suboptimality, in L\u2019s view,\nof a truly optimal policy in terms of the teaching risk. The following theorem provides such a bound:\nTheorem 2. Let \u03c0\u2217 be a policy which is optimal for s (cid:55)\u2192 (cid:104)w\u2217, \u03c6(s)(cid:105). There exists a unit vector\nw\u2217\nL \u2208 R(cid:96) such that\nmax\n\u00b5\u2208\u00b5(\u03a0)\n\ndiam \u00b5(\u03a0) \u00b7 (cid:107)AL(cid:107) \u00b7 \u03c1(AL; w\u2217)\n\n(cid:10)w\u2217\nL, AL\u00b5(\u03c0\u2217)(cid:11)\n\n(cid:10)w\u2217\nL, AL\u00b5(cid:11)\n\n,\n\n(cid:112)1 \u2212 \u03c1(AL; w\u2217)2\n\n\u2212\n\n\u2264\n\nwhere (cid:107)AL(cid:107) = maxv\u2208Rk,(cid:107)v(cid:107)=1 (cid:107)ALv(cid:107).\nProofs of Proposition 1 and Theorem 2 are given in the extended version of this paper\n[Haug et al., 2018]. Note that the expression on the right hand side of the inequality in Theorem\n2 tends to 0 as \u03c1(AL; w\u2217) \u2192 0, provided (cid:107)AL(cid:107) is bounded. Theorem 2 therefore implies that, if\n\u03c1(AL; w\u2217) is small, a truly optimal policy \u03c0\u2217 is near-optimal for some choice of reward function\nlinear in the features L observes, namely, the reward function s (cid:55)\u2192 (cid:104)w\u2217\nL \u2208 R(cid:96) the\nvector whose existence is claimed by the theorem.\n\nL, \u03c6(s)(cid:105) with w\u2217\n\n5 Teaching\n\nFeature teaching. The discussion in the last section shows that, under our assumptions on how L\nlearns, a teaching scheme in which T solely provides demonstrations to L can generally, i.e., without\nany assumption on the teaching risk, not lead to reasonable guarantees on the learner\u2019s performance\nwith respect to the true reward. A natural strategy is to introduce additional teaching operations by\nwhich the teacher can update L\u2019s worldview AL and thereby decrease the teaching risk.\nThe simplest way by which the teacher T can change L\u2019s worldview is by informing her about\nfeatures f \u2208 Rk that are relevant to performing well in the task, thus causing her to update her\nworldview AL : Rk \u2192 R(cid:96) to\n\nAL \u2295 (cid:104)f,\u00b7(cid:105) : Rk \u2192 R(cid:96)+1.\n\n6\n\n\f(a) \u03b3 = 0.5\n\n(b) \u03b3 = 0.75\n\n(c) \u03b3 = 0.9\n\nFigure 3: Performance vs. teaching risk. Each point in the plots shows the relative performance that a\nlearner L with a random worldview matrix AL achieved after one round of learning and the teaching\nrisk of AL. For all plots, a gridworld with N = 20, n = 2 was used. The reward vector w\u2217 was\nsampled randomly in each round. (a)\u2013(c) correspond to different values of the discount factor \u03b3.\n\nViewing AL as a matrix, this operation appends f as a row to AL. (Strictly speaking, the feature that\nis thus provided is s (cid:55)\u2192 (cid:104)f, \u03c6(s)(cid:105); we identify this map with the vector f in the following and thus\nkeep calling f a \u201cfeature\u201d.)\nThis operation has simple interpretations in the settings we are interested in: If L is a human learner,\n\u201cteaching a feature\u201d could mean making L aware that a certain quantity, which she might not have\ntaken into account so far, is crucial to achieving high performance. If L is a machine, such as an\nautonomous car or a robot, it could mean installing an additional sensor.\nTeachable features. Note that if T could provide arbitrary vectors f \u2208 Rk as new features, she\ncould always, no matter what AL is, decrease the teaching risk to zero in a single teaching step by\nchoosing f = w\u2217, which amounts to telling L the true reward function. We assume that this is not\npossible, and that instead only the elements of a \ufb01xed \ufb01nite set of teachable features\n\nF = {fi | i \u2208 I} \u2282 Rk\n\ncan be taught. In real-world applications, such constraints could come from the limited availability of\nsensors and their costs; in the case that L is a human, they could re\ufb02ect the requirement that features\nneed to be interpretable, i.e., that they can only be simple combinations of basic observable quantities.\n\nGreedy minimization of teaching risk. Our basic teaching algorithm TRGREEDY (Algorithm 1)\nworks as follows: T and L interact in rounds, in each of which T provides L with the feature f \u2208 F\nwhich reduces the teaching risk of L\u2019s worldview with respect to w\u2217 by the largest amount. L then\ntrains a policy \u03c0L with the goal of imitating her current view AL\u00b5(\u03c0T ) of the feature expectations\nof the teacher\u2019s policy; the LEARNING algorithm she uses could be the apprenticeship learning\nalgorithm from [Abbeel and Ng, 2004].\n\nComputation of the teaching risk. The computation of the teaching risk required of T in every\nround of Algorithm 1 can be performed as follows: One \ufb01rst computes the orthogonal complement of\nker AL \u2229 ker w\u2217 in Rk and intersects that with ker AL, thus obtaining (generically) a 1-dimensional\nsubspace \u03bb = (ker AL \u2229 ker w\u2217)\u22a5\n\u2229 ker AL of Rk; this can be done using SVD. The teaching risk\nis then \u03c1(AL; w\u2217) = (cid:104)w\u2217, v\u22a5\n6 Experiments\n\n(cid:105) with v\u22a5 the unique unit vector in \u03bb with (cid:104)w\u2217, v\u22a5\n\n(cid:105) > 0.\n\nOur experimental setup is similar to the one in [Abbeel and Ng, 2004], i.e., we use N \u00d7 N gridworlds\nin which non-overlapping square regions of neighbouring cells are grouped together to form n \u00d7\nn macrocells for some n dividing N. The state set S is the set of gridpoints, the action set is\nA = {\u2190,\u2192,\u2191,\u2193}, and the feature map \u03c6 : S \u2192 Rk maps a gridpoint belonging to macrocell\ni \u2208 {1, . . . , (N/n)2} to the one-hot vector ei \u2208 Rk; the dimension of the \u201ctrue\u201d feature space is\ntherefore k = (N/n)2. Note that these gridworlds satisfy the quite special property that for states\ns (cid:54)= s(cid:48), we either have \u03c6(s) = \u03c6(s(cid:48)) (if s, s(cid:48) belong to the same macrocell), or \u03c6(s) \u22a5 \u03c6(s(cid:48)). The\nreward weights w\u2217\n\u2208 Rk are sampled randomly for all experiments unless mentioned otherwise. As\n\n7\n\n0.000.250.500.751.00Teachingrisk0.40.60.81.0Relativeperformance0.000.250.500.751.00Teachingrisk0.40.60.81.0Relativeperformance0.000.250.500.751.00Teachingrisk0.40.60.81.0Relativeperformance\f(a)\n\n(b)\n\n(c)\n\nFigure 4: Gridworld with N = 10, n = 2. The colors in (a) indicate the reward of the corresponding\nmacrocell, with blue meaning positive and red meaning negative reward. The numbers within each\nmacrocell correspond to the feature index. The histograms in (b) and (c) show how often, in a series\nof 100 experiments, each feature was selected as the \ufb01rst resp. second feature to be taught to a learner\nwith a random 5-dimensional initial worldview.\n\nthe LEARNING algorithm within Algorithm 1, we use the projection version of the apprenticeship\nlearning algorithm from [Abbeel and Ng, 2004].\n\nPerformance vs. teaching risk. The plots in Figure 3 illustrate the signi\ufb01cance of the teaching\nrisk for the problem of teaching a learner under worldview mismatch. To obtain these plots, we\nused a gridworld with N = 20, n = 2; for each value (cid:96) \u2208 [1, 100], we sampled \ufb01ve random\nworldview matrices AL \u2208 R(cid:96)\u00d7100, and let L train a policy \u03c0L using the projection algorithm in\n[Abbeel and Ng, 2004], with the goal of matching the feature expectations \u00b5(\u03c0T ) corresponding to\nan optimal policy \u03c0T for a reward vector w\u2217 that was sampled randomly in each round. Each point\nin the plots corresponds to one such experiment and shows the relative performance of \u03c0L after the\ntraining round vs. the teaching risk of L\u2019s worldview matrix AL.\nAll plots in Figure 3 show that the variance of the learner\u2019s performance decreases as the teaching\nrisk decreases. This supports our interpretation of the teaching risk as a measure of the potential\ngap between the performances of \u03c0L and \u03c0T when L matches the feature expectations of \u03c0T in\nher worldview. The plots also show that the bound for this gap provided in Theorem 1 is overly\nconservative in general, given that L\u2019s performance is often high and has small variance even if the\nteaching risk is relatively large.\nThe plots indicate that for larger \u03b3 (i.e., less discounting), it is easier for L to achieve high performance\neven if the teaching risk is large. This makes intuitive sense: If there is a lot of discounting, it is\nimportant to reach high reward states quickly in order to perform well, which necessitates being able\nto recognize where these states are located, which in turn requires the teaching risk to be small. If\nthere is little discounting, it is suf\ufb01cient to know the location of some maybe distant reward state, and\nhence even a learner with a very de\ufb01cient worldview (i.e., high teaching risk) can do well in that case.\n\nSmall gridworlds with high reward states and obstacles. We tested TRGREEDY (Algorithm 1)\non gridworlds such as the one in Figure 4a, with a small number of states with high positive rewards,\nsome obstacle states with high negative rewards, and all other states having rewards close to zero. The\nhistograms in Figures 4b and 4c show how often each of the features was selected by the algorithm\nas the \ufb01rst resp. second feature to be taught to the learner in 100 experiments, in each of which the\nlearner was initialized with a random 5-dimensional worldview. In most cases, the algorithm \ufb01rst\nselected the features corresponding to one of the high reward cells 4 and 9 or to one of the obstacle\ncells 18 and 23, which are clearly those that the learner must be most aware of in order to achieve\nhigh performance.\n\nComparison of algorithms. We compared the performance of TRGREEDY (Algorithm 1) to two\nvariants of the algorithm which are different only in how the features to be taught in each round\nare selected: The \ufb01rst variant, RANDOM, simply selects a random feature f \u2208 F from the set of\nall teachable features. The second variant, PERFGREEDY, greedily selects the feature that will lead\nto the best performance in the next round among all f \u2208 F (computed by simulating the teaching\nprocess for each feature and evaluating the corresponding learner).\n\n8\n\n024681012141618202224Featureindex051015202530Firsttaughtfeature024681012141618202224Featureindex051015202530Secondtaughtfeature\f(a) Relative performance\n\n(b) Teaching risk\n\n(c) Runtime\n\nFigure 5: Comparison of TRGREEDY vs. PERFGREEDY vs. RANDOM. The plots show (a) the\nrelative performance that the learner achieved after each round of feature teaching and training a\npolicy, (b) the teaching risk after each such step, and (c) the runtime required to perform each step. We\naveraged over 100 experiments, in each of which a new random gridworld of size (N, n) = (20, 2)\nand a new set F of randomly selected features with |F| = 70 were sampled; the bars in the relative\nperformance plot indicate the standard deviations. The discount factor used was \u03b3 = 0.9 in all cases.\n\nThe plots in Figure 5 show, for each of the three algorithms, the relative performance with respect\nto the true reward function s (cid:55)\u2192 (cid:104)w\u2217, \u03c6(s)(cid:105) that the learner achieved after each round of feature\nteaching and training a policy \u03c0L, as well as the corresponding teaching risks and runtimes, plotted\nover the number of features taught. The relative performance of the learner\u2019s policy \u03c0L was computed\nas (R(\u03c0L) \u2212 min\u03c0 R(\u03c0))/(max\u03c0 R(\u03c0) \u2212 min\u03c0 R(\u03c0)).\nWe observed in all our experiments that TRGREEDY performed signi\ufb01cantly better than RANDOM.\nWhile the comparison between TRGREEDY and PERFGREEDY was slightly in favour of the latter,\none should note that a teacher T running PERFGREEDY must simulate a learning round of L for all\nfeatures f \u2208 F not yet taught, which presupposes that T knows L\u2019s learning algorithm, and which\nalso leads to very high runtime. If T only knows that L is able to match (her view of) the feature\nexpectations of T \u2019s demonstrations and T simulates L using some algorithm capable of this, there\nis no guarantee that L will perform as well as T \u2019s simulated counterpart, as there may be a large\ndiscrepancy between the true performances of two policies which in L\u2019s view have the same feature\nexpectations. In contrast, TRGREEDY relies on much less information, namely the kernel of AL, and\nin particular is agnostic to the precise learning algorithm that L uses to approximate feature counts.\n\n7 Conclusions and Outlook\n\nWe presented an approach to dealing with the problem of worldview mismatch in situations in which\na learner attempts to \ufb01nd a policy matching the feature counts of a teacher\u2019s demonstrations. We\nintroduced the teaching risk, a quantity that depends on the worldview of the learner and the true\nreward function and which (1) measures the degree to which policies which are optimal from the\npoint of view of the learner can be suboptimal from the point of view of the teacher, and (2) is\nan obstruction for truly optimal policies to look optimal to the learner. We showed that under the\ncondition that the teaching risk is small, a learner matching feature counts using, e.g., standard\nIRL-based methods is guaranteed to learn a near-optimal policy from demonstrations of the teacher\neven under worldview mismatch.\nBased on these \ufb01ndings, we presented our teaching algorithm TRGREEDY, in which the teacher\nupdates the learner\u2019s worldview by teaching her features which are relevant for the true reward\nfunction in a way that greedily minimizes the teaching risk, and then provides her with demostrations\nbased on which she learns a policy using any suitable algorithm. We tested our algorithm in gridworld\nsettings and compared it to other ways of selecting features to be taught. Experimentally, we found\nthat TRGREEDY performed comparably to a variant which selected features based on greedily\nmaximizing performance, and consistently better than a variant with randomly selected features.\nWe plan to investigate extensions of our ideas to nonlinear settings and to test them in more complex\nenvironments in future work. We hope that, ultimately, such extensions will be applicable in real-\nworld scenarios, for example in systems in which human expert knowledge is represented as a reward\nfunction, and where the goal is to teach this expert knowledge to human learners.\n\n9\n\n01020304050Numberoftaughtfeatures0.50.60.70.80.9Learner\u2019srelativeperformanceTRGreedyPerfGreedyRandom01020304050Numberoftaughtfeatures0.60.70.80.91.0TeachingriskTRGreedyPerfGreedyRandom01020304050Numberoftaughtfeatures05101520Runtime/sTRGreedyPerfGreedyRandom\fReferences\n[Abbeel and Ng, 2004] Abbeel, P. and Ng, A. Y. (2004). Apprenticeship learning via inverse rein-\n\nforcement learning. In ICML.\n\n[Aodha et al., 2018] Aodha, O. M., Su, S., Chen, Y., Perona, P., and Yue, Y. (2018). Teaching\n\ncategories to human learners with visual explanations. In CVPR.\n\n[Brown and Niekum, 2018] Brown, D. S. and Niekum, S. (2018). Machine teaching for inverse\n\nreinforcement learning: Algorithms and applications. CoRR, abs/1805.07687.\n\n[Cakmak et al., 2012] Cakmak, M., Lopes, M., et al. (2012). Algorithmic and human teaching of\n\nsequential decision tasks. In AAAI.\n\n[Cakmak and Thomaz, 2014] Cakmak, M. and Thomaz, A. L. (2014). Eliciting good teaching from\n\nhumans for machine learners. Arti\ufb01cial Intelligence, 217:198\u2013215.\n\n[Chen et al., 2018] Chen, Y., Singla, A., Mac Aodha, O., Perona, P., and Yue, Y. (2018). Under-\nstanding the role of adaptivity in machine teaching: The case of version space learners. In Neural\nInformation Processing Systems (NeurIPS).\n\n[Haug et al., 2018] Haug, L., Tschiatschek, S., and Singla, A. (2018). Teaching inverse reinforcement\n\nlearners via features and demonstrations. CoRR, abs/1810.08926.\n\n[Hunziker et al., 2018] Hunziker, A., Chen, Y., Mac Aodha, O., Gomez-Rodriguez, M., Krause, A.,\nPerona, P., Yue, Y., and Singla, A. (2018). Teaching multiple concepts to a forgetful learner. CoRR,\nabs/1805.08322.\n\n[Kamalaruban et al., 2018] Kamalaruban, P., Devidze, R., Yeo, T., Mittal, T., Cevher, V., and Singla,\nA. (2018). Assisted inverse reinforcement learning. In Learning by Instruction Workshop at\nNeural Information Processing Systems (NeurIPS).\n\n[Liu et al., 2017] Liu, W., Dai, B., Humayun, A., Tay, C., Yu, C., Smith, L. B., Rehg, J. M., and\n\nSong, L. (2017). Iterative machine teaching. In ICML, pages 2149\u20132158.\n\n[Mayer et al., 2017] Mayer, M., Hamza, J., and Kuncak, V. (2017). Proactive synthesis of recursive\ntree-to-string functions from examples (artifact). In DARTS-Dagstuhl Artifacts Series, volume 3.\nSchloss Dagstuhl-Leibniz-Zentrum fuer Informatik.\n\n[Mei and Zhu, 2015] Mei, S. and Zhu, X. (2015). Using machine teaching to identify optimal\n\ntraining-set attacks on machine learners. In AAAI, pages 2871\u20132877.\n\n[Patil et al., 2014] Patil, K. R., Zhu, X., Kope\u00b4c, \u0141., and Love, B. C. (2014). Optimal teaching for\n\nlimited-capacity human learners. In NIPS, pages 2465\u20132473.\n\n[Rafferty et al., 2016] Rafferty, A. N., Brunskill, E., Grif\ufb01ths, T. L., and Shafto, P. (2016). Faster\n\nteaching via pomdp planning. Cognitive science, 40(6):1290\u20131332.\n\n[Sermanet et al., 2018] Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., Levine,\nS., and Brain, G. (2018). Time-contrastive networks: Self-supervised learning from video. In\nICRA, pages 1134\u20131141.\n\n[Singla et al., 2013] Singla, A., Bogunovic, I., Bart\u00f3k, G., Karbasi, A., and Krause, A. (2013). On\n\nactively teaching the crowd to classify. In NIPS Workshop on Data Driven Education.\n\n[Singla et al., 2014] Singla, A., Bogunovic, I., Bart\u00f3k, G., Karbasi, A., and Krause, A. (2014).\n\nNear-optimally teaching the crowd to classify. In ICML, pages 154\u2013162.\n\n[Stadie et al., 2017] Stadie, B. C., Abbeel, P., and Sutskever, I. (2017). Third-person imitation\n\nlearning. CoRR, abs/1703.01703.\n\n[Yeo et al., 2019] Yeo, T., Kamalaruban, P., Singla, A., Merchant, A., Asselborn, T., Faucon, L.,\n\nDillenbourg, P., and Cevher, V. (2019). Iterative classroom teaching. In AAAI.\n\n[Zhu, 2015] Zhu, X. (2015). Machine teaching: An inverse problem to machine learning and an\n\napproach toward optimal education. In AAAI, pages 4083\u20134087.\n\n[Zhu et al., 2018] Zhu, X., Singla, A., Zilles, S., and Rafferty, A. N. (2018). An overview of machine\n\nteaching. CoRR, abs/1801.05927.\n\n[Ziebart et al., 2008] Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. (2008). Maximum\nentropy inverse reinforcement learning. In AAAI, volume 8, pages 1433\u20131438. Chicago, IL, USA.\n\n10\n\n\f", "award": [], "sourceid": 5128, "authors": [{"given_name": "Luis", "family_name": "Haug", "institution": "ETH Zurich"}, {"given_name": "Sebastian", "family_name": "Tschiatschek", "institution": "Microsoft Research"}, {"given_name": "Adish", "family_name": "Singla", "institution": "MPI-SWS"}]}