{"title": "Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement", "book": "Advances in Neural Information Processing Systems", "page_first": 239, "page_last": 249, "abstract": "This paper studies Learning from Observations (LfO) for imitation learning with access to state-only demonstrations. In contrast to Learning from Demonstration (LfD) that involves both action and state supervisions, LfO is more practical in leveraging previously inapplicable resources (e.g., videos), yet more challenging due to the incomplete expert guidance. In this paper, we investigate LfO and its difference with LfD in both theoretical and practical perspectives. We first prove that the gap between LfD and LfO actually lies in the disagreement of inverse dynamics models between the imitator and expert, if following the modeling approach of GAIL. More importantly, the upper bound of this gap is revealed by a negative causal entropy which can be minimized in a model-free way. We term our method as Inverse-Dynamics-Disagreement-Minimization (IDDM) which enhances the conventional LfO method through further bridging the gap to LfD. Considerable empirical results on challenging benchmarks indicate that our method attains consistent improvements over other LfO counterparts.", "full_text": "Imitation Learning from Observations by\nMinimizing Inverse Dynamics Disagreement\n\n1 Beijing National Research Center for Information Science and Technology (BNRist),\n\nState Key Lab on Intelligent Technology and Systems,\n\nDepartment of Computer Science and Technology, Tsinghua University\n\n2 Center for Vision, Cognition, Learning and Autonomy, Department of Computer Science, UCLA\n\nChao Yang1\u2217, Xiaojian Ma12\u2217, Wenbing Huang1\u2217,\n\nFuchun Sun1, Huaping Liu1, Junzhou Huang3, Chuang Gan4\n\n3 Tencent AI Lab, 4 MIT-IBM Watson AI Lab\n\nyangchao18@mails.tsinghua.edu.cn, maxiaojian@ucla.edu\n\nhwenbing@126.com, fcsun@tsinghua.edu.cn\n\nAbstract\n\nThis paper studies Learning from Observations (LfO) for imitation learning with\naccess to state-only demonstrations. In contrast to Learning from Demonstration\n(LfD) that involves both action and state supervision, LfO is more practical in\nleveraging previously inapplicable resources (e.g. videos), yet more challenging\ndue to the incomplete expert guidance. In this paper, we investigate LfO and its\ndifference with LfD in both theoretical and practical perspectives. We \ufb01rst prove\nthat the gap between LfD and LfO actually lies in the disagreement of inverse\ndynamics models between the imitator and the expert, if following the modeling\napproach of GAIL [15]. More importantly, the upper bound of this gap is revealed\nby a negative causal entropy which can be minimized in a model-free way. We\nterm our method as Inverse-Dynamics-Disagreement-Minimization (IDDM) which\nenhances the conventional LfO method through further bridging the gap to LfD.\nConsiderable empirical results on challenging benchmarks indicate that our method\nattains consistent improvements over other LfO counterparts.\n\n1\n\nIntroduction\n\nA crucial aspect of intelligent robots is their ability to perform a task of interest by imitating\nexpert behaviors from raw sensory observations [5]. Towards this goal, GAIL [15] is one of the\nmost successful imitation learning methods, which adversarially minimizes the discrepancy of the\noccupancy measure between the agent and the expert for policy optimization. However, along with\nmany other methods [31, 3, 29, 30, 1, 23, 6, 9, 13, 2], GAIL adopts a heavily supervised training\nmechanism, which demands not only the expert\u2019s state (e.g. observable spatial locations), but also its\naccurate action (e.g. controllable motor commands) performed at each time step.\nWhereas providing expert action indeed enriches the information and hence facilitates the imitation\nlearning process, collecting them could be dif\ufb01cult and sometimes infeasible for some certain practical\ncases, particularly when we would like to learn skills from a large number of internet videos. Besides,\nimitation learning under action guidance is not biologically reasonable [39], as our human can imitate\nskills through adjusting the action to match the demonstrators\u2019 state, without knowing what exact\naction the demonstrator has performed. To address these concerns, several methods have been\nproposed [35, 39, 5, 21, 36], including the one named GAIfO [40] that extends the idea of GAIL to\n\n\u2217Denotes equal contributions. Corresponding author: Fuchun Sun.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe case with the absence of action guidance. Joining the previous denotations, this paper will de\ufb01ne\nthe original problem as Learning from Demonstrations (LfD), and the new action agnostic setting as\nLearning from Observations (LfO).\nUndoubtedly, conducting LfO is non-trivial. For many tasks (e.g. robotic manipulation, locomotion\nand video-game playing), the reward function depends on both action and state. It remains chal-\nlenging to determine the optimal action corresponding to the best reward purely from experts\u2019 state\nobservations, since there could be multiple choices of action corresponding to the same sequence of\nstate in a demonstration\u2014when, for example, manipulating redundant-degree robotic hands, there\nexist countless force controls of joints giving rise to the same pose change. Yet, realizing LfO is still\npossible, especially if the expert and the agent share the same dynamics system (namely, the same\nrobot). In this condition and what this paper has assumed, the correlation between action and state\ncan be learned by the self-playing of the agent (see for example in [39]).\nIn this paper, we approach LfO by leveraging the concept of inverse dynamics disagreement mini-\nmization. As its name implies, inverse dynamics disagreement is de\ufb01ned as the discrepancy between\nthe inverse dynamics models of the expert and the agent. Minimizing such disagreement becomes the\ntask of inverse dynamics prediction, a well-known problem that has been studied in robotics [24].\nInterestingly, as we will draw in this paper, the inverse dynamics disagreement is closely related\nto LfD and LfO. To be more speci\ufb01c, we prove that the inverse dynamics disagreement actually\naccounts for the optimization gap between LfD and naive LfO, if we model LfD by using GAIL [15]\nand consider naive LfO as GAIfO [40]. This result is crucial, not only for it tells the quantitative\ndifference between LfD and naive LfO but also for it enables us to solve LfO more elegantly by\nminimizing the inverse dynamics disagreement as well.\nTo mitigate the issue of inverse dynamics disagreement, here we propose a model-free solution\nfor the consideration of ef\ufb01ciency. In detail, we derive an upper bound of the gap, which turns\nout to be a negative entropy of the state-action occupancy measure. Under the assumption of\ndeterministic system, such entropy contains a mutual information term that can be optimized with the\npopularly-used tool (i.e. MINE [4]). For convenience, we term our method as the Inverse-Dynamics-\nDisagreement-Minimization (IDDM) based LfO in what follows. To verify the effectiveness of our\nIDDM, we perform experimental comparisons on seven challenging control tasks, ranging from\ntraditional control to locomotion [8]. The experimental results demonstrate that our proposed method\nattains consistent improvements over other LfO counterparts.\nThe rest of the paper is organized as follows. In Sec. 2, we will \ufb01rst review some necessary notations\nand preliminaries. Then our proposed method will be detailed in Sec. 3 with theoretical analysis and\nef\ufb01cient implementation, and the discussions with existing LfD and LfO methods will be included in\nSec. 4. Finally, experimental evaluations and ablation studies will be demonstrated in Sec. 5.\n\n2 Preliminaries\n\nNotations. To model the action decision procedure in our context, we consider a standard Markov\ndecision process (MDP) [37] as (S,A, r,T , \u00b5, \u03b3), where S and A are the sets of feasible state and\naction, respectively; r(s, a) : S \u00d7 A \u2192 R denotes the reward function on state s and action a;\nT (s(cid:48)|s, a) : S \u00d7 A \u00d7 S \u2192 [0, 1] characterizes the dynamics of the environment and de\ufb01nes the\ntransition probability to next-step state s(cid:48) if the agent takes action a at current state s; \u00b5(s) : S \u2192\n[0, 1] is the distribution of initial state and \u03b3 \u2208 (0, 1) is the discount factor. A stationary policy\n\u03c0(a|s) : S \u00d7 A \u2192 [0, 1] de\ufb01nes the probability of choosing action a at state s. A temporal sequence\nof state-action pairs {(s0, a0), (s1, a1),\u00b7\u00b7\u00b7} is called a trajectory denoted by \u03b6.\n\nOccupancy measure. To characterize the statistical properties of an MDP, the concept of occupancy\nmeasure [28, 38, 15, 17] is proposed to describe the distribution of state and action under a given\npolicy \u03c0. Below, we introduce its simplest form, i.e., State Occupancy Measure.\nDe\ufb01nition 1 (State Occupancy Measure). Given a stationary policy \u03c0, state occupancy measure\n\u03c1\u03c0(s) : S \u2192 R denotes the discounted state appearance frequency under policy \u03c0\n\n\u03c1\u03c0(s) =\n\n\u03b3tP (st = s|\u03c0).\n\n(1)\n\n\u221e(cid:88)\n\nt=0\n\n2\n\n\fWith the use of state occupancy measure, we can de\ufb01ne other kinds of occupancy measures under\ndifferent supports, including state-action occupancy measure, station transition occupancy measure,\nand joint occupancy measure. We list their de\ufb01nitions in Tab. 1 for reader\u2019s reference.\n\nInverse dynamics model. We present the inverse dynamics model [34, 33] in De\ufb01nition 2, which\ninfers the action inversely given state transition (s, s(cid:48)).\n\nTable 1: Different occupancy measures for MDP\n\nState-Action\n\nOccupancy Measure\n\nDenotation\n\nSupport\nDe\ufb01nition\n\n\u03c1\u03c0(s, a)\nS \u00d7 A\n\n\u03c1\u03c0(s)\u03c0(a|s)\n\nState Transition\n\nOccupancy Measure\n\n\u03c1\u03c0(s, s(cid:48))\nS \u00d7 S\n\n(cid:82)\nA \u03c1\u03c0(s, a)T (s(cid:48)|s, a)da\n\nJoint\n\nOccupancy Measure\n\n\u03c1\u03c0(s, a, s(cid:48))\nS \u00d7 A \u00d7 S\n\n\u03c1\u03c0(s, a)T (s(cid:48)|s, a)\n\nDe\ufb01nition 2 (Inverse Dynamics Model). Let \u03c1\u03c0(a|s, s(cid:48)) denotes the density function of the inverse\ndynamics model under the policy \u03c0, whose relation with T and \u03c0 can be shown as follows.\n\n\u03c1\u03c0(a|s, s(cid:48)) :=\n\n3 Methodology\n\n(cid:82)\nT (s(cid:48)|s, a)\u03c0(a|s)\nA T (s(cid:48)|s, \u00afa)\u03c0(\u00afa|s)d\u00afa\n\n.\n\n(2)\n\nIn this section, we \ufb01rst introduce the concepts of LfD, naive LfO, and inverse dynamics disagreement.\nThen, we prove that the optimization gap between LfD and naive LfO actually leads to the inverse\ndynamics disagreement. As such, we enhance naive LfO by further minimizing the inverse dynamics\ndisagreement. We also demonstrate that such disagreement can be bounded by an entropy term and\ncan be minimized by a model-free method. Finally, we provide a practical implementation for our\nproposed method.\n\n3.1\n\nInverse Dynamics Disagreement: the Gap between LfD and LfO\n\nLfD.\nIn Sec. 1, we have mentioned that GAIL and many other LfD methods [15, 18, 16, 27] exploit\nthe discrepancy of the occupancy measure between the agent and expert as a reward for policy\noptimization. Without loss of generality, we will consider GAIL as the representative LfD framework\nand build our analysis on this description. This LfD framework requires to compute the discrepancy\nover the state-action occupancy measure, leading to\n\nDKL (\u03c1\u03c0(s, a)||\u03c1E(s, a)) ,\n\nmin\n\n(3)\nwhere \u03c1E(s, a) denotes the occupancy measure under the expert policy, and DKL(\u00b7) computes the\nKullback-Leibler (KL) divergence2. We have omitted the policy entropy term in GAIL, but our\nfollowing derivations will \ufb01nd that the policy entropy term is naturally contained in the gap between\nLfD and LfO.\n\n\u03c0\n\nIn LfO, the expert action is absent, thus directly working on DKL(\u03c1\u03c0(s, a)||\u03c1E(s, a))\nNaive LfO.\nis infeasible. An alternative objective could be minimizing the discrepancy on the state transition\noccupancy measure \u03c1\u03c0(s, s(cid:48)), as mentioned in GAIfO [40]. The objective function in (3) becomes\n(4)\n\nDKL (\u03c1\u03c0(s, s(cid:48))||\u03c1E(s, s(cid:48))) .\n\nmin\n\n\u03c0\n\n2 The original GAIL method applies Jensen-Shannon (JS) divergence rather than KL divergence for measure-\nment. Here, we will use KL divergence for the consistency throughout our derivations. Indeed, our method is\nalso compatible with JS divergence, with the details provided in the supplementary material.\n\n3\n\n\fWe will refer this as naive LfO in the following context. Compared to LfD, the key challenge in\nLfO comes from the absence of action information, which prevents it from applying typical action-\ninvolved imitation learning approaches like behavior cloning [31, 3, 11, 29, 30] or apprenticeship\nlearning [23, 1, 38]. Actually, action information can be implicitly encoded in the state transition\n(s, s(cid:48)). We have assumed the expert and the agent share the same dynamics system T (s(cid:48)|s, a). It is\nthus possible for us to learn the action-state relation by exploring the difference between their inverse\ndynamics models.\nWe de\ufb01ne the inverse dynamics disagreement between the expert and the agent as follows.\nDe\ufb01nition 3 (Inverse Dynamics Disagreement). Given expert policy \u03c0E and agent policy \u03c0, the\ninverse dynamics disagreement is de\ufb01ned as the KL divergence between the inverse dynamics models\nof the expert and the agent.\n\nInverse Dynamics Disagreement := DKL (\u03c1\u03c0(a|s, s(cid:48))||\u03c1E(a|s, s(cid:48))) .\n\n(5)\nGiven a state transition (s, s(cid:48)), minimizing the inverse dynamics disagreement is learning an optimal\npolicy to \ufb01t the expert/ground-truth action labels. This is a typical robotic task [24], and it can be\nsolved by using a mixture method of combining machine learning model and control model.\nHere, we contend another role of the inverse dynamics disagreement in the context of imitation\nlearning. Joining the denotations in (3), (4) and De\ufb01nition 3, we provide the following result.\nTheorem 1. If the agent and the expert share the same dynamics system, the relation between LfD,\nnaive LfO, and inverse dynamics disagreement can be characterized as\n\nDKL (\u03c1\u03c0(a|s, s(cid:48))||\u03c1E(a|s, s(cid:48))) = DKL (\u03c1\u03c0(s, a)||\u03c1E(s, a)) \u2212 DKL (\u03c1\u03c0(s, s(cid:48))||\u03c1E(s, s(cid:48))) .\n\n(6)\nTheorem 1 states that the inverse dynamics disagreement essentially captures the optimization gap\nbetween LfD and naive LfO. As (5) is non-negative by nature, optimizing the objective of LfD implies\nminimizing the objective of LfO but not vice versa. One interesting observation is that when the\naction corresponding to a given state transition is unique (or equivalently, the dynamics T (s(cid:48)|s, a) is\ninjective w.r.t a), the inverse dynamics is invariant to different conducted policies, hence the inverse\ndynamics disagreement between the expert and the agent reduces to zero. We summarize this by the\nfollowing corollary.\nCorollary 1. If the dynamics T (s(cid:48)|s, a) is injective w.r.t a, LfD is equivalent to naive LfO.\n\nDKL (\u03c1\u03c0(s, a)||\u03c1E(s, a)) = DKL (\u03c1\u03c0(s, s(cid:48))||\u03c1E(s, s(cid:48))) .\n\n(7)\nHowever, since most of the real world tasks are performed in rather complex environments, (5) is\nusually not equal to zero and the gap between LfD and LfO should not be overlooked, which makes\nminimizing the inverse dynamics disagreement become unavoidable.\n\n3.2 Bridging the Gap with Entropy Maximization\n\nWe have shown that the inverse dynamics disagreement amounts to the optimization gap between\nLfD and naive LfO. Therefore, the key to improving naive LfO mainly lies in inverse dynamics\ndisagreement minimization. Nevertheless, accurately computing the disagreement is dif\ufb01cult, as it\nrelies on the environment dynamics T and the expert policy (see (2)), both of which are assumed to\nbe unknown. In this section, we try a smarter way and propose an upper bound for the gap, without the\naccess of the dynamics model and expert guidance. This upper bound is tractable to be minimized if\nassuming the dynamics to be deterministic. We introduce the upper bound by the following theorem.\nTheorem 2. Let H\u03c0(s, a) and HE(s, a) denote the causal entropies over the state-action occupancy\nmeasure of the agent and expert, respectively. When DKL [\u03c1\u03c0(s, s(cid:48))||\u03c1E(s, s(cid:48))] is minimized, we have\n(8)\nNow we take a closer look at H\u03c0(s, a). Following the de\ufb01nition in Tab. 1, the entropy of state-action\noccupancy measure can be decomposed as the sum of the policy entropy and the state entropy by\n\nDKL [\u03c1\u03c0(a|s, s(cid:48))||\u03c1E(a|s, s(cid:48))] (cid:54) \u2212H\u03c0(s, a) + Const.\n\nH\u03c0(s, a) = E\u03c1\u03c0(s,a) [\u2212 log \u03c1\u03c0(s, a)] = E\u03c1\u03c0(s,a) [\u2212 log \u03c0(a|s)] + E\u03c1\u03c0(s) [\u2212 log \u03c1\u03c0(s)]\n\n(9)\nFor the \ufb01rst term, the policy entropy H\u03c0(a|s) can be estimated via sampling similar to previous\nstudies [15]. For the second term, we leverage the mutual information (MI) between s and (s(cid:48), a) to\nobtain an unbiased estimator of the entropy over the state occupancy measure, namely,\n\n= H\u03c0(a|s) + H\u03c0(s).\n\n4\n\n\fAlgorithm 1 Inverse-Dynamics-Disagreement-Minimization (IDDM)\ni } where \u03b6i = {sE\nInput: State-only expert demonstrations DE = {\u03b6 E\n\nnator D\u03c6, MI estimator I, entropy weights \u03bbp, \u03bbs, maximum iterations M.\nfor 1 to M do\n\n0 , sE\n\nSample agent rollouts DA = {\u03b6 i}, \u03b6 i \u223c \u03c0\u03b8 and update the MI estimator I with DA.\nUpdate the discriminator D\u03c6 with the gradient\n\n1 , ...}, policy \u03c0\u03b8, discrimi-\n\n\u02c6EDA [\u2207\u03c6 log D\u03c6(s, s(cid:48))] + \u02c6EDE [\u2207\u03c6 log(1 \u2212 D\u03c6(s, s(cid:48)))] .\n\nUpdate policy \u03c0\u03b8 using the following gradient (can be integrated into methods like PPO [32])\n\n\u02c6EDA[\u2207\u03b8 log \u03c0\u03b8(a|s)Q(s, a)] \u2212 \u03bbp\u2207\u03b8H\u03c0\u03b8 (a|s) \u2212 \u03bbs\u2207\u03b8I\u03c0\u03b8 (s; (s(cid:48), a)),\nwhere Q(\u00afs, \u00afa) = \u02c6EDA [log D\u03c6(s, s(cid:48))|s0 = \u00afs, a0 = \u00afa] .\n\nend for\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nH\u03c0(s) = I\u03c0(s; (s(cid:48), a)) + H\u03c0(s|s(cid:48), a)\n\n= I\u03c0(s; (s(cid:48), a)),\n\n=0\n\n(10)\nwhere we have H\u03c0(s|s(cid:48), a) = 0 as we have assumed (s, a) \u2192 s(cid:48) is a deterministic function3. In\nour implementation, the MI I\u03c0(s; (s(cid:48), a)) is computed via maximizing the lower bound of KL\ndivergence between the product of marginals and the joint distribution following the formulation\nof [25]. Speci\ufb01cally, we adopt MINE [4, 14] which implements the score function with a neural\nnetwork to achieve a low-variance MI estimator.\nOverall loss. By combining the results in Theorem 1, Theorem 2, (9) and (10), we enhance naive\nLfO by further minimizing the upper bound of its gap to LfD. The eventual objective is\n\nL\u03c0 = DKL(\u03c1\u03c0(s, s(cid:48))||\u03c1E(s, s(cid:48))) \u2212 \u03bbpH\u03c0(a|s) \u2212 \u03bbsI\u03c0(s; (s(cid:48), a)),\n\n(11)\nwhere the \ufb01rst term is from naive LfO, and the last two terms are to minimize the gap between LfD\nand naive LfO. We also add trade-off weights \u03bbp and \u03bbs to the last two terms for more \ufb02exibility.\n\nImplementation. As our above derivations can be generalized to JS-divergence (see Sec. A.3-4\nin the supplementary material), we can utilize the GAN-like [25] method to minimize the \ufb01rst term\nin (11). In detail, we introduce a parameterized discriminator network D\u03c6 and a policy network \u03c0\u03b8\n(serves as a generator) to realize the \ufb01rst term in (11). The term log D\u03c6(s, s(cid:48)) could be interpreted as\nan immediate cost since we minimize its expectation over the current occupancy measure. A similar\ntraining method can also be found in GAIL [15], but it relies on state-action input instead. We defer\nthe derivations for the gradients of the causal entropy \u2207H\u03c0(a|s) and MI \u2207I\u03c0(s; (s(cid:48), a)) with respect\nto the policy in Sec. A.5 of the supplementary material. Note that the objective (11) can be optimized\nby any policy gradient method, like A3C [22] or PPO [32], and we apply PPO in our experiments.\nThe algorithm details are summarized in Alg. 1.\n\n4 Related Work\n\n4.1 Learning from Demonstrations\n\nModern dominant approaches on LfD mainly fall into two categories: Behavior Cloning (BC) [31,\n3, 29, 30], which seeks the best policy that can minimize the action prediction error in demonstration,\nand Inverse Reinforcement Learning (IRL) [23, 1], which infers the reward used by expert to guide\nthe agent policy learning procedure. A notable implementation of the latter is GAIL [15], which\nreformulates IRL as an occupancy measure matching problem [28], and utilizes the GAN [12] method\n3In this paper, the tasks in our experiments indeed reveal deterministic dynamics. The mapping (s, a) \u2192 s(cid:48)\nis deterministic also underlying that (s(cid:48), a) \u2192 s is deterministic. Referring to [20], when s, s(cid:48), a are continuous,\nH(s|s(cid:48), a) can be negative; but since these variables are actually quanti\ufb01ed as \ufb01nite-bit precision numbers (e.g.\nstored as 32-bit discrete numbers in computer), it is still true that conditional entropy is zero in practice.\n\n5\n\n\falong with a forward RL to minimize the discrepancy of occupancy measures between imitator and\ndemonstrator. There are also several follow-up works that attempt to enhance the effectiveness of\ndiscrepancy computation [19, 16, 10, 27], whereas all these methods require exact action guidance at\neach time step.\n\n4.2 Learning from Observations\n\nThere have already been some researches on exploring LfO. These approaches exploit either a\ncomplex hand-crafted reward function or an inverse dynamics model that predicts the exact action\ngiven state transitions. Here is a summary to show how they are connected to our method.\n\nLfO with Hand-crafted Reward and Forward RL. Recently, Peng et al. propose DeepMimic,\na method that can imitate locomotion behaviors from motion clips without action labeling. They\ndesign a reward to encourage the agent to directly match the expert\u2019s physical proprieties, such as\njoint angles and velocities, and run a forward RL to learn the imitation policy. However, as the\nhand-crafted reward function does not take expert action (or implicitly state transition) into account,\nit is hard to be generalized to tasks whose reward depends on actions.\n\nModel-Based LfO. BCO [39] is another LfO approach. The authors infer the exact action from\nstate transition with a learned inverse dynamics model (2). The state demonstrations augmented with\nthe predicted actions deliver common state-action pairs that enable imitation learning via BC [31]. At\nits heart, the inverse dynamics model is trained in parallel by collecting rollouts in the environment.\nHowever, as showed in (2), the inverse dynamics model depends on the current policy, underlying\nthat an optimal inverse dynamics model would be infeasible to obtain before the optimal policy is\nlearned. The performance of BCO would thus be not theoretically guaranteed.\n\nLfO with GAIL. GAIfO [40, 41] is the closest work to our method. The authors follow the\nformulation of GAIL [15] but replace the state-action de\ufb01nition (s, a) with state transition (s, s(cid:48)),\nwhich gives the same objective in Eq. (4) if replacing KL with JS divergence. As we have discussed\nin Sec. 3.1, there is a gap between Eq. (4) and the objective of original LfD in Eq. (3), and this gap\nis induced by inverse dynamics disagreement. Unlike our method, the solution by GAIfO never\nminimizes the gap and is thereby no better than ours in principal.\n\n5 Experiments\n\nFor the experiments below, we investigate the following questions:\n\n1. Does inverse dynamics disagreement really account for the gap between LfD and LfO?\n2. With state-only guidance, can our method achieve better performance than other counterparts\n\nthat do not consider inverse dynamics disagreement minimization?\n\n3. What are the key ingredients of our method that contribute to performance improvement?\n\nTo answer the \ufb01rst question, we conduct toy experiments with the Gridworld environment [37]. We\ntest and contrast the performance of our method (refer to Eq. (11)) against GAIL (refer to Eq. (3))\nand GAIfO (refer to Eq. (4)) on the tasks under different levels of inverse dynamics disagreement.\nRegarding the second question, we evaluate our method against several baselines on six physics-\nbased control benchmarks [8], ranging from low-dimension control to challenging high-dimension\ncontinuous control. Finally, we explore the ablation analysis of two major components in our method\n(the policy entropy term and the MI term) to address the last question. Due to the space limit, we\ndefer more detailed speci\ufb01cations of all the evaluated tasks into the supplementary material.\n\n5.1 Understanding the Effect of Inverse Dynamics Disagreement\n\nThis collection of experiments is mainly to demonstrate how inverse dynamics disagreement in\ufb02u-\nences the LfO approaches. We \ufb01rst observe that inverse dynamics disagreement will increase when\nthe number of possible action choices grows. This is justi\ufb01ed in Fig. 1a, and more details about\nthe relation between inverse dynamics disagreement and the number of action choices are provided\nin Sec. B.2 of the supplementary material. Hence, we can utilize different action scales to re\ufb02ect\n\n6\n\n\fdifferent levels of inverse dynamics disagreement in our experiments. Controlling the action scale in\nGridworld is straightforward. For example in Fig. 1b, agent block (in red) may take various kinds of\nactions (walk, jump or others) for moving to a neighbor position towards the target (in green), and we\ncan specify different numbers of action choices.\nWe simulate the expert demonstrations by collecting the trajectories of the policy trained by PPO [32].\nThen we conduct GAIL, GAIfO, and our method, and evaluate the pre-de\ufb01ned reward values for\nthe policies they learn. It should be noted that all imitation learning methods have no access to the\nreward function during training. As we can see in Fig. 1c, the gap between GAIL and GAIfO is\ngrowing as the number of action choices (equivalently the level of inverse dynamics disagreement)\nincreases, which is consistent with our conclusion in Theorem 1. We also \ufb01nd that the rewards\nof GAIL and GAIfO are the same when the number of action choice is 1 (i.e.\nthe dynamics is\ninjective), which follows the statement in Corollary 1. Our method lies between GAIL and GAIfO,\nindicating that the gap between GAIL and GAIfO can be somehow mitigated by explicitly minimizing\ninverse dynamics disagreement. Note that, GAIL also encounters performance drop when inverse\ndynamics disagreement becomes large. This is mainly because the imitation learning problem itself\nalso becomes more dif\ufb01cult when the dynamics is complicated and beyond injective.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: Toy examples on illustrating the effect of inverse dynamics disagreement.\n\n5.2 Comparative Evaluations\n\nFor comparative evaluations, we carry out several LfO baselines, including DeepMimic [26],\nBCO [39], and GAIfO [40]. In particular, we introduce a modi\ufb01ed version of GAIfO that only\ntakes a single state as input to illustrate the necessity of leveraging state transition; we denote this\nmethod as GAIfO-s. We also run GAIL [15] to provide oracle reference. All experiments are\nevaluated within \ufb01xed steps. On each task, we run each algorithm over \ufb01ve times with different\nrandom seeds. In Fig. 2, the solid curves correspond to the mean returns, and the shaded regions\nrepresent the variance over the \ufb01ve runs. The eventual results are summarized in Tab. 2, which is\naveraged over 50 trials of the learned policies. Due to the space limit, we defer more details to the\nsupplementary material.\n\nTable 2: Summary of quantitative results. All results correspond to the original exact reward de\ufb01ned\nin [7]. CartPole is excluded from DeepMimic because no crafting reward is available.\n\nCartPole\n\n-\n\nPendulum DoublePendulum\nHalfCheetah\n731.0\u00b119.0 454.4\u00b1154.0 2292.6\u00b11068.9 202.6\u00b14.4\n\nHopper\n\n200.0\u00b10.0 24.9\u00b10.8\n\nDeepMimic\n80.3\u00b113.1\nBCO\nGAIfO 197.5\u00b17.3 980.2\u00b13.0 4240.6\u00b14525.6\nGAIfO-s\u2217 200.0\u00b10.0 952.1\u00b123.0 1089.2\u00b151.4\n9359.7\u00b10.2\n\n-985.3\u00b113.6\n1266.2\u00b11062.8 4557.2\u00b190.0 562.5\u00b1384.1\n1021.4\u00b10.6 3955.1\u00b122.1 -1415.0\u00b1161.1\n1022.5\u00b10.40 2896.5\u00b153.8 -5062.3\u00b156.9\n3300.9\u00b152.1 5699.3\u00b151.8 2800.4\u00b114.0\n200.0\u00b10.0 1000.0\u00b10.0\n200.0\u00b10.0 1000.0\u00b10.0 9174.8\u00b11292.5 3249.9\u00b134.0 6279.0\u00b156.5 5508.8\u00b1791.5\n200.0\u00b10.0 1000.0\u00b10.0\n3645.7\u00b1181.8 5988.7\u00b161.8 5746.8\u00b1117.5\n\nAnt\n\nOurs\nGAIL\nExpert\n\n9318.8\u00b18.5\n\n\u2217GAIfO with single state only.\n\nThe results read that our method achieves comparable performances with the baselines on the easy\ntasks (such as CartPole) and outperforms them by a large margin on the dif\ufb01cult tasks (such as\nAnt, Hopper). We also \ufb01nd that our algorithm exhibits more stable behaviors. For example, the\n\n7\n\n246810Number of possible action choices0.00.10.20.30.4Disagreementstart stateend statewalk actionjump action246810Number of possible action choices20406080Averaged returnGAILGAIfOOurs\fFigure 2: Learning curves under challenging robotic control benchmarks. For each experiment, a step\nrepresents one interaction with the environment. Detailed plots can be found in the supplementary.\nperformance of BCO on Ant and Hopper will unexpectedly drop down as the training continues.\nWe conjecture that BCO explicitly but not accurately learns the inverse dynamics model from data,\nwhich yet is prone to over-\ufb01tting and leads to performance degradation. Conversely, our algorithm\nis model-free and guarantees the training stability as well as the eventual performance, even for the\ncomplex tasks including HalfCheetah, Ant and Hopper.\nBesides, GAIfO performs better than GAIfO-s in most of the evaluated tasks. This illustrates the\nimportance of taking state-transition into account to re\ufb02ect action information in LfO. Compared\nwith GAIfO, our method clearly attains consistent and signi\ufb01cant improvements on HalfCheetah\n(+1744.2), Ant (+4215.0) and Hopper (+2279.5), thus convincingly verifying that minimizing the\noptimization gap induced by inverse dynamics disagreement plays an essential role in LfO, and our\nproposed approach can effectively bridge the gap. For the tasks that have relatively simple dynamics\n(e.g. CartPole), GAIfO achieves satisfying performances, which is consistent with our conclusion in\nCorollary 1.\nDeepMimic that relies on hand-crafted reward struggles on most of the evaluated tasks. Our proposed\nmethod does not depend on any manually-designed reward signal, thus it becomes more self-contained\nand more practical in general applications.\n\nFigure 3: Comparative results of GAIL [15], GAIfO [40] and our method with different number of\ntrajectories in demonstrations. The performance is the averaged cumulative return over 5 trajectories\nand has been scaled within [0, 1] (the random and the expert policies are \ufb01xed to be 0 and 1, respec-\ntively). We also conduct experiments with demonstrations containing state-action/state-transition\npairs with the number less than that within one complete trajectory. We use 32(cid:48), 128(cid:48) and 256(cid:48) pairs\n(denoted in the beginning of the x axes) for the \ufb01rst three tasks, respectively.\n\nFinally, we compare the performances of GAIL, GAIfO and our method with different numbers\nof demonstrations. The results are presented in Fig. 3. It reads that for simple tasks like CartPole\nand Pendulum, there are no signi\ufb01cant differences for all evaluated approaches, when the number of\n\n8\n\n0.00.20.40.60.81.00.00.20.40.60.81.0Averaged returnExpertBCODeepMimicGAIfOGAIfO (single state)GAILOurs0.000.250.500.751.00CartPole1e3501001502000.00.51.01.5Pendulum1e30200400600800100001234DoublePendulum1e302000400060008000024Hopper1e601000200030000.00.51.01.5HalfCheetah1e702000400060000123Ant1e7200002000400060000.00.20.40.60.81.0Number of trajectories in demonstrations0.00.20.40.60.81.0Performance (Scaled)Expert (PPO)RandomGAILGAIfOOurs32'15250.00.51.0CartPole128'15500.00.51.0Pendulum256'15500.00.51.0DoublePendulum51020500.00.20.40.60.81.0Hopper51020500.00.20.40.60.81.0HalfCheetah51020500.50.00.51.0Ant\fdemonstrations changes. While for the tasks with a higher dimension of state and action, our method\nperforms advantageously over GAIfO. Even compared with GAIL that involves action demonstrations,\nour method still delivers comparable results. For all methods, more demonstrations facilitate better\nperformances especially when the tasks become more complicated (HalfCheetah and Ant).\n\n5.3 Ablation Study\n\nThe results presented in the previous section suggest that our proposed method can outperform other\nLfO approaches on several challenging tasks. Now we further perform a diverse set of analyses on\nassessing the impact of the policy entropy term and the MI term in (11). As these two terms are\ncontrolled by \u03bbp, \u03bbs, we will explore the sensitivity of our algorithm in terms of their values.\n\nSensitivity to Policy Entropy. We design four groups of\nparameters on HalfCheetah, where \u03bbp is selected from\n{0, 0.0005, 0.001, 0.01} and \u03bbs is \ufb01xed at 0.01. The \ufb01nal re-\nsults are plotted in Fig. 4, with the learning curves and detailed\nquantitative results provided in the supplementary material. The\nresults suggest that we can always promote the performances by\nadding policy entropy. Although different choices of \u03bbp induce\nminor differences in their \ufb01nal performances, they are overall\nbetter than GAIfO that does not include the policy entropy term\nin its objective function.\n\nFigure 4: Sensitivity to the policy\nentropy weight \u03bbp.\n\nSensitivity to Mutual Information. We conduct four groups\nof experiments on HalfCheetah by ranging \u03bbs from 0.0 to 0.1\nand \ufb01xing \u03bbp to be 0.001. The \ufb01nal results are shown in Fig. 5\n(the learning curves and averaged return are also reported in\nthe supplementary material). It is observed that the imitation\nperformances could always bene\ufb01t from adding the MI term,\nand the improvements become more signi\ufb01cant when the \u03bbs has\na relatively large magnitude. All of the variants of our method\nconsistently outperform GAIfO, thus indicating the importance\nof the mutual information term in our optimization objective.\nWe also provide the results of performing a grid search on \u03bbs and \u03bbp in the supplementary material\nto further illustrate how better performance could be potentially obtained.\n\nFigure 5: Sensitivity to the MI\nweight \u03bbs.\n\n6 Conclusion\nIn this paper, our goal is to perform imitation Learning from Observations (LfO). Based on the\ntheoretical analysis for the difference between LfO and Learning from Demonstrations (LfD), we\nintroduce inverse dynamics disagreement and demonstrate it amounts to the gap between LfD and\nLfO. To minimize inverse dynamics disagreement in a principled and ef\ufb01cient way, we realize\nits upper bound as a particular negative causal entropy and optimize it via a model-free method.\nOur model, dubbed as Inverse-Dynamics-Disagreement-Minimization (IDDM), attains consistent\nimprovement over other LfO counterparts on various challenging benchmarks. While our paper\nmainly focuses on control planning, further exploration on combining our work with representation\nlearning to enable imitation across different domains could be a new direction for future work.\n\nAcknowledgments\n\nThis research was funded by National Science and Technology Major Project of the Ministry of\nScience and Technology of China (No.2018AAA0102900).\nIt was also partially supported by\nNational Science Foundation of China (Grant No.91848206), National Science Foundation of China\n(NSFC) and the German Research Foundation (DFG) in project Cross Modal Learning, NSFC\n61621136008/DFG TRR-169. We would like to thank Mingxuan Jing and Dr. Boqing Gong for the\ninsightful discussions and the anonymous reviewers for the constructive feedback.\n\n9\n\n350040004500500055006000Averaged returnGAIfOp=0p=0.0005p=0.001p=0.01350040004500500055006000Averaged returnGAIfOs=0s=0.001s=0.01s=0.1\fReferences\n[1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning.\n\nIn International conference on Machine learning (ICML), 2004.\n\n[2] Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub\nPachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous\nin-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.\n\n[3] Christopher G Atkeson and Stefan Schaal. Robot learning from demonstration. In International\n\nConference on Machine Learning (ICML), 1997.\n\n[4] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio,\nAaron Courville, and Devon Hjelm. Mutual information neural estimation. In International\nConference on Machine Learning (ICML), 2018.\n\n[5] Darrin C Bentivegna, Christopher G Atkeson, and Gordon Cheng. Learning tasks from observa-\n\ntion and practice. Robotics and Autonomous Systems, 47(2-3):163\u2013169, 2004.\n\n[6] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon\nGoyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning\nfor self-driving cars. arXiv preprint arXiv:1604.07316, 2016.\n\n[7] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,\n\nand Wojciech Zaremba. Openai gym, 2016.\n\n[8] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking\ndeep reinforcement learning for continuous control. In International Conference on Machine\nLearning (ICML), 2016.\n\n[9] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever,\nPieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in neural\ninformation processing systems (NeurIPS), 2017.\n\n[10] Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse\n\nreinforcement learning. International conference on Learning Representation (ICLR), 2018.\n\n[11] Alessandro Giusti, J\u00e9r\u00f4me Guzzi, Dan C Ciresan, Fang-Lin He, Juan P Rodr\u00edguez, Flavio\nFontana, Matthias Faessler, Christian Forster, J\u00fcrgen Schmidhuber, Gianni Di Caro, et al. A\nmachine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics\nand Automation Letters (RA-L), 2016.\n\n[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems (NeurIPS), 2014.\n\n[13] Tuomas Haarnoja, Aurick Zhou, Sehoon Ha, Jie Tan, George Tucker, and Sergey Levine.\nLearning to walk via deep reinforcement learning. In Robotics: Science and Systems (RSS),\n2019.\n\n[14] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman,\nAdam Trischler, and Yoshua Bengio. Learning deep representations by mutual information\nestimation and maximization. In International Conference on Learning Representations (ICLR),\n2019.\n\n[15] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in\n\nNeural Information Processing Systems (NeurIPS), 2016.\n\n[16] Mingxuan Jing, Xiaojian Ma, Wenbing Huang, Fuchun Sun, and Huaping Liu. Task transfer by\n\npreference-based cost learning. In AAAI Conference on Arti\ufb01cial Intelligence (AAAI), 2019.\n\n[17] Bingyi Kang, Zequn Jie, and Jiashi Feng. Policy optimization with demonstrations.\n\nInternational Conference on Machine Learning (ICML), 2018.\n\nIn\n\n[18] Beomjoon Kim and Joelle Pineau. Maximum mean discrepancy imitation learning. In Robotics:\n\nScience and systems (RSS), 2013.\n\n[19] Kee-Eung Kim and Hyun Soo Park. Imitation learning via kernel mean embedding. In AAAI\n\nConference on Arti\ufb01cial Intelligence (AAAI), 2018.\n\n[20] Rithesh Kumar, Anirudh Goyal, Aaron Courville, and Yoshua Bengio. Maximum entropy\n\ngenerators for energy-based models. arXiv preprint arXiv:1901.08508, 2019.\n\n10\n\n\f[21] YuXuan Liu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Imitation from observation:\nLearning to imitate behaviors from raw video via context translation. In 2018 IEEE International\nConference on Robotics and Automation (ICRA), 2018.\n\n[22] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-\ncrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. In International conference on machine learning (ICML), 2016.\n\n[23] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning.\n\nInternational conference on Machine learning (ICML), 2000.\n\nIn\n\n[24] Duy Nguyen-Tuong and Jan Peters. Model learning for robot control: a survey. Cognitive\n\nprocessing, 12(4):319\u2013340, 2011.\n\n[25] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural sam-\nplers using variational divergence minimization. In Advances in Neural Information Processing\nSystems (NeurIPS), 2016.\n\n[26] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-\nguided deep reinforcement learning of physics-based character skills. ACM Transactions on\nGraphics (TOG), 2018.\n\n[27] Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. Variational\ndiscriminator bottleneck: Improving imitation learning, inverse RL, and GANs by constraining\ninformation \ufb02ow. In International Conference on Learning Representations (ICLR), 2019.\n\n[28] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming.\n\nJohn Wiley & Sons, 1994.\n\n[29] St\u00e9phane Ross and Drew Bagnell. Ef\ufb01cient reductions for imitation learning. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2010.\n\n[30] Stephane Ross and J Andrew Bagnell. Agnostic system identi\ufb01cation for model-based rein-\n\nforcement learning. In International conference on Machine learning (ICML), 2012.\n\n[31] Stefan Schaal. Learning from demonstration. In Advances in neural information processing\n\nsystems (NeurIPS), 1997.\n\n[32] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[33] Bruno Siciliano and Oussama Khatib. Springer handbook of robotics. Springer, 2008.\n[34] Mark W Spong and Romeo Ortega. On adaptive inverse dynamics control of rigid robots. IEEE\n\nTransactions on Automatic Control (T-AC), 1990.\n\n[35] Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third-person imitation learning.\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\nIn\n\n[36] Wen Sun, Anirudh Vemula, Byron Boots, and Drew Bagnell. Provably ef\ufb01cient imitation\nlearning from observation alone. In International Conference on Machine Learning (ICML),\n2019.\n\n[37] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,\n\n1998.\n\n[38] Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear\n\nprogramming. In International Conference on Machine Learning (ICML), 2008.\n\n[39] Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation.\n\nInternational Joint Conference on Arti\ufb01cial Intelligence (IJCAI), 2018.\n\nIn\n\n[40] Faraz Torabi, Garrett Warnell, and Peter Stone. Generative adversarial imitation from observa-\n\ntion. arXiv preprint arXiv:1807.06158, 2018.\n\n[41] Faraz Torabi, Garrett Warnell, and Peter Stone. Imitation learning from video by leveraging\n\nproprioception. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), 2019.\n\n11\n\n\f", "award": [], "sourceid": 98, "authors": [{"given_name": "Chao", "family_name": "Yang", "institution": "Tsinghua University"}, {"given_name": "Xiaojian", "family_name": "Ma", "institution": "University of California, Los Angeles"}, {"given_name": "Wenbing", "family_name": "Huang", "institution": "Tsinghua University"}, {"given_name": "Fuchun", "family_name": "Sun", "institution": "Tsinghua"}, {"given_name": "Huaping", "family_name": "Liu", "institution": "Tsinghua University"}, {"given_name": "Junzhou", "family_name": "Huang", "institution": "University of Texas at Arlington / Tencent AI Lab"}, {"given_name": "Chuang", "family_name": "Gan", "institution": "MIT-IBM Watson AI Lab"}]}